Institute of Mathematical Statistics LECTURE NOTES-MONOGRAPH SERIES
R.R. Bahadur's Lectures on the Theory of Estimation
Stephen M. Stigler, Wing Hung Wong and Darning Xu, Editors
Volume 39
Institute of Mathematical Statistics LECTURE NOTES-MONOGRAPH SERIES Volume 39
R.R. Bahadur's Lectures on the Theory of Estimation
Stephen M. Stigler, Wing Hung Wong and Darning Xu, Editors
Institute of Mathematical Statistics Beachwood, Ohio
Institute of Mathematical Statistics Lecture Notes-Monograph Series Series Editor: Joel Greenhouse
The production of the Institute of Mathematical Statistics Lecture Notes-Monograph Series is managed by the IMS Societal Office: Julia A. Norton, Treasurer and Elyse Gustafson, Executive Director.
Library of Congress Control Number: 2002104854 International Standard Book Number 0-940600-53-6 Copyright © 2002 Institute of Mathematical Statistics All rights reserved Printed in the United States of America
Editors' Preface
In the Winter Quarter of the academic year 1984-1985, Raj Bahadur gave a series of lectures on estimation theory at the University of Chicago. The role of statistical theory in Chicago's graduate curriculum has always varied according to faculty interests, but the hard and detailed examination of the classical theory of estimation was in those years Raj's province, to be treated in a special topics course when his time and the students' interests so dictated. Winter 1985 was one of those times. In ten weeks, Raj covered what most mathematical statisticians would agree should be standard topics in a course on parametric estimation: Bayes estimates, unbiased estimation, Fisher information, Cramer-Rao bounds, and the theory of maximum likelihood estimation. As a seasoned teacher, Raj knew that mathematical prerequisites could not be taken entirely for granted, that even students who had fulfilled them would benefit from a refresher, and accordingly he began with a review of the geometry of L2 function spaces. Two of us were in that classroom, WHW who was then a junior member of the statistics faculty and DX who was then an advanced graduate student. We both had previously studied parametric estimation, but never from Raj's distinctive perspective. What started as a visit motivated by curiosity (just how would one of the architects of the modern theory explain what were by then its standard elements?) soon became a compelling, not-to-be-missed pilgrimage, three times a week. Raj's approach was elegant and turned what we thought were shop-worn topics into polished gems; these gems were not only attractive on the surface but translucent, and with his guidance and insight we could look deeply within them as well, gaining new understanding with every lecture. Topics we thought we understood well, such as locally unbiased estimates and likelihood ratios in Lecture 11, and the asymptotic optimality of maximum likelihood estimators in Lectures 28-30, were given new life and much more general understanding as we came to better understand the principles and geometry that underlay them. The two of us (WHW and DX) took detailed notes, reviewing them after the lectures to recover any gaps and smooth the presentation to a better approximation of what we had heard. Some time after the course, Raj was pleased to receive from us an edited, hand-copied version of his lectures. In these lectures, Raj Bahadur strived towards, and in most cases succeeded in deriving the most general results using the simplest arguments. After stating a result in class, he would usually begin its derivation by saying "it is really very simple...", and indeed his argument would often appear to be quite elementary and simple. However, upon further study, we would find that his arguments were far from elementary — they
appear to be so only because they are so carefully crafted and follow such impeccable logic. Isaac Newton's Principia abounds with "simple" demonstrations based upon a single well-designed diagram, demonstrations that in another's hands might have been given through pages of dense and intricate mathematics, the result proved correctly but without insight. Others, even the great astrophysicist S. Chandrasekhar (1995) would marvel at how Newton saw what could be so simply yet generally accomplished. So it is with some of these lectures. Raj focused on the essential and core aspect of problems, often leading him to arguments useful not only for the solutions of the immediate problems, but also for the solutions of very large classes of related problems. This is illustrated by his treatment of Fisher's bound for the asymptotic variance of estimators (Lectures 29-30). Here he used the Neyman-Pearson lemma and the asymptotic normal distribution under local alternatives to provide a remarkably elegant proof of the celebrated result (which he attributed to LeCam) that the set of parameter points at which the Fisher bound fails must be of measure zero. Although Raj developed this theory under the assumption of asymptotic normality, his approach is in fact applicable to a much wider class of estimators (Wong, 1992; Zheng and Fang, 1994). In this course, Raj Bahadur did not use any particular text and clearly did not rely on any book in his preparation for lectures, although he did encourage students to consult other sources to broaden their perspective. Among the texts that he mentioned are Cramer (1946), Pitman (1979), Ibragimov and Hasminskii (1981), and Lehmann (1983). Raj's own papers were useful, particularly his short and elegant treatment of Fisher's bound (Bahadur, 1964, which is (23) in the appended bibliography of Raj's works). Raj did lecture from notes, but they were no more than a sketchy outline of intentions, and none of them survive. And so these lecture notes are exactly that, the notes of students. A couple of years after these notes were recorded Raj gave the course again, but we know of no record of any changes in the material, and he may well have used these very notes in preparing for those lectures. While Raj at one time considered using these notes as a basis for a monographic treatment, that was not to be. As a consequence they reflect very much the pace and flavor of the occasion the course was given. Some topics occupied several lectures; others were shorter. Homework problems were stated at the time he thought them most appropriate, not at the end of a topic or chapter. Notation would shift occasionally, and the one change that we have made in this published version is to attempt to make the notation consistent. Unfortunately, many memorable aspects of Raj's teaching style cannot be conveyed in these notes. He had a great sense of humor and was able to amuse the class with a good number of unexpected jokes that were not recorded. Raj also possessed a degree of humility that is rare among scholars of his stature. His showed no outward signs of self-importance and shared his time and insight without reservation. After class his door was always open for those who needed help and advice, except, of course, immediately after lunch hour, when he was needed in the Billiard Room of the Quadrangle Club. For many years these notes were circulated as xeroxed copies of the handwritten originals. The repeated requests for copies we received over many years led us to prepare them for publication. These lectures are now superbly typed in Tex by Loren Spice, himself an expert mathematician, and in the process he helped clarify the exposition in
11
countless small ways. They have been proofread by George Tseng, Donald Traax, and the editors. We are grateful for the support of the University of Chicago's Department of Statistics and its Chairman Michael Stein, and for the assistance of Mitzi Nakatsuka in their final preparation. These lecture notes are presented here with the permission of Thelma Bahadur, in the hope of helping the reader to appreciate the great beauty and utility of the core results in estimation theory, as taught by a great scholar and master teacher Raj Bahadur. Stephen M. Stigler (University of Chicago) Wing Hung Wong (Harvard University) Darning Xu (University of Oregon)
in
Table of Contents Editor's Preface
i
Lectures on Estimation Theory
1
Chapter 1. Review of L2 theory
2
Lecture 1. Review of L2 geometry Lecture 2. Orthogonal projection Lecture 3. L2(S,A,P) Lecture 4. Orthogonal projection in L2(S,A,P) Lecture 5. Formulation of the estimation problem Chapter 2. Bayes Estimation Lecture Lecture Lecture Lecture Lecture
6. Bayes estimates 7. Bayes estimates (continued) 8. Bayes estimates and maximum likelihood estimates 9. Example: binomial sampling 10. Example: negative binomial sampling
Chapter 3. Unbiased Estimation Lecture 11. Unbiased estimates and orthogonal projections on the space spanned by likelihood ratios Lecture 12. LMVUE and UMVUE Chapter 4. The Score Function, Fisher Information and Bounds Lecture 13. Score function, Fisher information, Bhattacharya bound and Cramer-Rao bound Lecture 14. Heuristics for maximum likelihood estimates Lecture 15. Necessary conditions for the attainment of the Cramer-Rao bound and Chapman-Robbins bound (one-parameter case) Chapter 5. Sufficient Conditions for the Attainment of the Cramer-Rao Bound and the Bhattacharya Bound Lecture 16. Exact statement of the Cramer-Rao inequality and Bhattacharya inequality Lecture 17. One-parameter exponential family and sufficient conditions for the attainment of the bounds Lecture 18. Examples
2 4 7 8 10 14 14 16 18 20 23 25
25 27 30
30 32 35 38
38 40 43
Chapter 6. Multi-Parameter Case Lecture 19. Vector-valued score functions, Fisher information and the Cramer-Rao inequality Lecture 20. Heuristics for maximum likelihood estimates in the multi-parameter case Lecture 21. Example Lecture 22. Exact statement of the Cramer-Rao inequality in the multi-parameter case Lecture 23. Metrics in the parameter space Lecture 24. Multi-parameter exponential family Chapter 7. Score Function, MLE and Related Issues Lecture 25. Using the score function Lecture 26. Examples Lecture 27. Review of heuristics for maximum likelihood estimates Chapter 8. The Asymptotic Distribution and Asymptotic Efficiency of MLE and Related Estimators Lecture 28. Asymptotic distribution of MLE, super-efficiency example Lecture 29. Fisher's bound for asymptotic variance and asymptotic optimality of the MLE Lecture 30. Multi-parameter case References
46
46 48 50 52 54 56 59 59 62 64
67 67 69 72 75
Appendix Raghu Raj Bahadur 1924-1997, by Stephen M. Stigler Bibliography of Bahadur's Publications
76 80
Lectures on Estimation Theory R. R. Bahadur (Notes taken by Wing Hung Wong and Darning Xu and typed by Loren Spice) (Winter quarter of academic year 1984-1985)
Chapter 1 Note on the notation: Throughout, Professor Bahadur used the symbols φ(s), Ψi(s), Ψ2(s), . . . to denote functions of the sample that are generally of little importance in the discussion of the likelihood. These functions often arise in his derivations without prior definition.
Lecture 1 Review of L2 geometry Let (S, A, P) be a probability space. We call two functions /i and / 2 on S EQUIVALENT if and only if P(fχ = / 2 ) = 1, and set
V = L2(S, Λ P) := {/ : / is measurable and E(f2) = [ f(s)2dP(s) < oo}, Js where we have identified equivalent functions. We may abbreviate L 2 (S, A, P) to L2(P) or, if the probability space is understood, to just ZΛ For f,g G V, we define ||/|| = +y/Eψ) and (/,g) = E(f g), so that | | / | | 2 = (/, / ) . Throughout this list / and g denote arbitrary (collections of equivalent) functions in V. 1. V is a real vector space. 2. ( , •) is an inner product on V - i.e., a bilinear, symmetric and positive definite function. 3. CAUCHY-SCHWARZ INEQUALITY:
with equality if and only if / and g are linearly dependent. Proof. Let x and y be real; then, by expanding || || in terms of ( , ), we find that 0 < | | x / + yg\\2 = x2\\f\\2 + 2xy{f,g) + y2\\g\\2, from which the result follows immediately on letting x = \\g\\ and y = | | / | | .
D
4. T R I A N G L E INEQUALITY:
Proof. |2 5
again by expanding || || in terms of ( , •) and using the Cauchy-Schwarz inequality. D 5. PARALLELOGRAM LAW:
Proof. Direct computation, as above.
D
6. 11 || is a continuous function on V, and ( , •) is a continuous function on V x V. L2
Proof. Suppose fn —> /; then
and From these two statements it follows that l i m | | / n | | = | | / | | .
Π
7. V is a complete metric space under || || - i.e., if {gn} is a sequence in V and Wΰn — 9m\\ -> 0 as n, m —» oo, then 3j eV such that \\gn — j \ \ —> 0. Proof. The proof proceeds in four parts, l
{9n} is a Cauchy sequence in probability: P(\9m-9n\ > ε) = P{\gm-gn\2 > e2) < —E(\\gm-gn\\2)
= -^\\gm-9n\\2'
2. Hence there exists a subsequence {gnk} converging a.e.(P) to, say, g. 3. g e V. Proof.
E(\g\2) = j(\ιmjlk)dP <
ljmjglkdP
by Fatou's lemma; but {f g^kdP = ||^ nfc || 2 } is a bounded sequence, since |} is Cauchy. •
Proof. For any ε > 0, choose k = k(ε) so that \\gm - gn\\ < ε whenever m,n > fc(ε). Then
j \gn - g\2dP = I(Hm \gn Fatou
<
\2)dP
9nk
/*
lim / \gn - gnk\dP
= lim \\gn - g
D
provided that n > ft(ε). Let W be a subset of V. If VF is closed under addition and scalar multiplication, then it is called a LINEAR MANIFOLD in V. If, furthermore, W is topologically closed, then it is called a SUBSPACE of V. Note that a finite-dimensional linear manifold must be topologically closed (hence a subspace). If C is any collection of vectors in V, then let CΊ be the collection of all finite linear combinations of vectors in C and C<ι be the closure of C\. Then C2 is the smallest subspace of V containing C, and is called the subspace SPANNED by C. C\ is called the linear manifold spanned by C. Let W be a fixed subspace of V, and / a fixed vector in V. We say that the vector g e W is an ORTHOGONAL PROJECTION of / to W if and only if
H/dlmfH/Λ|| 8. There exists a unique orthogonal projection g of / to W. Proof. Let ί — vnίhew 11/ — ^||, and let {gn} be a sequence in W such that 11/ ~ 9n\\ -> ^ then we have
- - / >t
converges to t
converges to i
from which we see that \\gm - gn\\* -> 0 as m, n -> 00. Thus {gn} is a Cauchy sequence; but this means that there is some g such that gn —>• g. Since W is a subspace of V, it is closed; so, since each gn e W, so too is g eW. D
Lecture 2 Definition. For two vectors fuf2 G V, we say that /i is ORTHOGONAL to / 2 , and write /i _L / 2 , if and only if (fu / 2 ) = 0. Throughout, we fix a subspace W of V and vectors /, fu f2 G V\
9.
10.
PYTHAGOREAN THEOREM
(and its converse):
a. Given the above definition of orthogonality, there are two natural notions of orthogonal projection: (*) 7 G W is an orthogonal projection of / on W if and only if
(**) 7 (= VF is an orthogonal projection of / on W if and only if (/ - 7 ) JL g Vg G W. These two definitions are equivalent (i.e., 7 satisfies (*) if and only if it satisfies (**)). b. Exactly one vector 7 G W satisfies (**) - i.e., a solution of the minimisation problem exists and is unique.
= ιi7iι 2 +ιι/-7iι 2 Proof of (10). a. (=$) Choose h 6 W. For all real x, 7 + xh € W also. Therefore, if (*) holds, then (setting δ = f — 7)
^x2\\h\\2-2x(δ,h) >0) VzeR. This is possible only if (<5, h) = 0. Thus (**) holds. (-Φ=) If (**) holds then we have 2
2
((/ - 7 ) J- (7 - Λ) S ||/ - h\\2 = ||/ - 7 | | + 1 | 7 - ^ll => \\f - h\\2 > \\f - Ί\\2) Vh e W Thus (*) holds. b. Suppose that both 71 and 72 are solutions to (**) in W. Since 71—72 £ W, (/ - 7χ) _L (71 - 72) and hence, by (9),
By (a), however, 71 and 72 both also satisfy (*), so ||/
7 l
||
gew and hence H71 - 7 2 | | = 0 => 71 = 2
c. Since 7 G W,
as desired.
Π
Definition. We denote by πwf the orthogonal projection of / on W. Note. \\πwf\\ < 11/11, with equality iff πwf = / - i.e., ififeW. 2
2
(For, by 10(c),
II/II = IKw/ll + l l / "
It's easy to see that W = {/ G V : ττwf
= /} = { W
:feV}.
Definition. The ORTHOGONAL COMPLEMENT of W in V is defined to be ±
W :={heV:h±gVgeW}. Note that WL = {/ι G V : π w /ι = 0}. 11. W-1 is a subspace of V. 12. TΓVK V -ϊ V is linear, idempotent and self-adjoint. Proof. We abbreviate πw to π. Let a^a2 G R and /, Λ,/2 G F be arbitrary. Then we have by (10) that /1 — ττ/i and / 2 — π/ 2 are in VF1- and hence by (11) that (αi/i + 02/2) - (αiπ/x + α 2 π/ 2 ) = αi(Λ - π/χ) + α 2 (/ 2 - τr/2) G ^
(*)
Since ττ/i, τr/2 G W and W is a subspace, θχπ/i + α2τr/2 G W^; therefore, by (10) and (*) above, π(aιfι + α2/i) = αiτr/i + α 2 π/ 2 . Thus π is linear. We also have by (10) that τr(τr/) = π/, since π/ G VF; thus TΓ is idempotent. Finally, since π/i, π/ 2 G W, once more by (10) we have that (/1 — π/i, π/ 2 ) = 0; thus (Λ, π/ 2 ) = (/1 + (TΓ/! - π/x), τr/2) = ((Λ - π/0 + 7r/1? τr/2) = (/1 - τr/i,π/2) + (τr/i,π/2) = Similarly, (πfuf2) adjoint.
= (πfι,πf2),
so that (/i,τr/2) = (πfuf2).
(πfι,πf2).
Thus π is selfD
13. We have from the above description of % that W-1- = {/ — πw/ : / G 14. (This is a converse to (12).) lΐU :V ^-V is linear, idempotent and self-adjoint, then U is an orthogonal projection to some subspace (i.e., there is a subspace W of V so that U = πw>).
15. Given an arbitrary / G V, we may write uniquely / = g + Λ, with g G VF and h G W^. In fact, g = πwf and h = πw±f. From this we conclude that 1
1
τr\γ± o TΓW Ξ O Ξ πw o πw±. a n d (VF )- " = W.
16. Suppose that W\ and W2 are two subspaces of V such that W2 C Wi. Then a n d ) Ikwa/ll < IKwi/H, with equality ΊfiπWlf G W2
Lecture 3 Note. The above concepts and statements (regarding projections etc.) are valid in 2 any Hubert space, but we are particularly interested in the case V = L (5, *4, P). Λfoίe. If V is a Hubert space and W is a subspace of V, then VF is a Hubert space when equipped with the same inner product as V. Homework 1 1. If V = L 2 (5, .A, -P), show that V is finite-dimensional if P is concentrated on a finite number of points in S. You may assume that the one-point sets {s} are measurable. 2. Suppose that S = [0,1], A is the Borel field (on [0,1]) and P is the uniform probability measure. Let V = L2 and, for 7, J fixed disjoint subintervals of 5, define W — Wjtj := {/ G V : / = 0 a.e. on / and / is constant a.e. on J}. Show that W is a subspace and find WL. arbitrary.
Also compute πwf for / G V
3. Let 5 = R1, A = i?1 and P be arbitrary, and set V = L2. Suppose that s G V is such that E(ets) < oo for all t sufficiently small (i.e., for all t in a neighbourhood of 0). Show that the subspace spanned by {1, s, s 2 ,...} is equal to V. (HINT: Check first that the hypothesis implies that 1, s, s 2 ,... are indeed in V. Then check that, if g G V satisfies g _L s2 for r = 0,1, 2,..., then p = 0 a.e.(P). This may be done by using the uniqueness of the moment-generating function.) Definition. Let S = {s} and V = L 2 (5, ^4, P). Let (R,C) be a measurable space, and let F : S —> R be a measurable function. If we let Q = P o F " 1 (so that Q(Γ) = P(F~ 1 [T])), then F(s) is called a STATISTIC with corresponding probability space (i?, C, Q). W = L 2 (#, C, Q) is isomorphic to the subspace W = L 2 (5, F " 1 ^ ] , P) of V.
Application to prediction k+
k+1
+1
2
Let S = R \ A = B be the Borel field in R* , P be arbitrary and V = L . Let s = (Xu...,Xk;Y). A PREDICTOR of F is a Borel function G = G(X) of X_ = (Xi,... ,Xk) We 2 2 assume that £ ( F ) < oo and take the MSE of G, i.e., E{\G{X) - Y\ ), as a criterion. What should we mean by saying that G is the "best" predictor of YΊ i. No restriction on G: Consider the set W of all measurable G = G(X) with 2 E(\G\ ) < oo. VF is clearly (isomorphic to) a subspace of V and, for G G W, 2 2 E(G-Y) = \\Y-G\\ . Then the best predictor of Y is just the orthogonal projection of Y on W, which is the same as the conditional expectation of Y given X_ = (Xι,..., Proof (informal). Let G*(X) = E(Y \ X). For an arbitrary G = GQQ G | | y - G\\2 = \\Y - G*|| 2 + ||G - G*|| 2 + 2(Y - G\ G* - G), but
(F - G*, G* - (?) = E ( ( r - G*)(G* - G)) = £?[((?* - G)E(Y - G* I 20] = 0, so that | | F - G | | 2 = \\Y - G*|| 2 + \\G - G*|| 2 , whence G* must be the unique projection. ii. G an affine function: We require that G be an affine function of X_ - i.e., that there be constants α 0 , α i 5 . . . , a k such that G(X_) = G(Xχ,... ,Xk) = α 0 + Σi=ι diXi for all X_. The class of such G is a subspace W of the space W defined in the previous case. The best predictor of Y in this class is the orthogonal projection of Y on W7, which is called the LINEAR REGRESSION of Yon(Xu...,Xk).
Lecture 4 We return to predicting Y using an affine function of X_. We define
and denote by Y the orthogonal projection of Y on W. Ϋ is characterized by the two facts that (*) Y-Ϋll,
and
((**)
Y-Ϋ
where X? — X{ — EXit Since W = Span{l, Xι,..., Xk}, we may suppose that Ϋ = βo + ΣLi βiχi F r o m (*)> βo = ^ ^ and, from (**), Σβ = c (the 'normal equation'), where β = (ft,..., ft)Γ, c = (cu ..., c,) τ , Σ = (σ y ), Q - £ ( ^ ° ) = Cov(X, , F), cry = E(XfX^) = Cov(Xi,Xj) and F° = Y - EY. We have (by considering the minimization problem) that there exists a solution β to these two equations; and (by uniqueness of the orthogonal projection) that, if β is any such solution, then Ϋ = β0 + Σi=ι βiXi Σ is positive semi-definite and symmetric. Homework 1 4. Show that Σ is nonsingular iff, whenever P[aιX^ + + Uk^t = 0) = 1, αi = ... = a*; = 0; and that this is true iff, whenever P(bo+bιXι + - +bkXk = 0) = 1, bo = h = = bk = 0. Let us assume that Σ is nonsingular; then β = Σ - 1 c and Ϋ = EY + Σ*=1 βiXi> Note. i. Ϋ is called the LINEAR REGRESSION of Y on (Xu...,Xk),
or the AFFINE
REGRESSION Or t h e LINEAR REGRESSION of Y on (1, Xu . . . , Xk).
π. y° = Σ*=ι β{χf is the projection of Y° on S p a n ^ 0 , . . . , X%}. Thus VarF = ||y°|| 2 = ||y° - y°|| 2 + ||y°|| 2 = Var(y -Ϋ) + Vary or, more suggestively, Var (predict and) = Var (residual) + Var (regression). A related problem concerns R := sup Corr(y, aλXλ H
h akXk) = ?
We have that
=
\\y°\\\\L\\
γ°
L
\
ΊWI
where L = £ α^f. Since F° = (Y° - Ϋ°) + Ϋ°,
J~V
'ΪMJ-
with equality iff TT^TT = dY° for some d > 0 (we have used the Cauchy-Schwarz inequality). In particular, c(βι,..., βk) (with c a positive constant) are the maximizing
choices of (α x ,..., ak). Plugging in any one of these maximizing choices gives us that 2 R = jjl^ij and hence that R — ^y, from which we conclude that
Prom the above discussion we see that Hubert spaces are related to regression, and hence to statistics. Note. Suppose that k = 1, and that we have data Serial # 1 2
X2,y2)
n
cn,yn)-
We may then let S be the set consisting of the points (1; xi, j/i),..., (n; x n , y n ), to each of which we assign probability 1/n. If we define X(i,Xi,yi) = X{ and Y{i,Xi,yi) — yι for 2 = 1, 2,..., n, then £"X = x and 2?F = y. F is the affine regression of y on x and i? is the correlation between x and ?/, which is 1
SχSy This extends also to the case A; > 1.
Lecture 5 Classical estimation problem for inference In the following, S is a sample space, with sample point s; A is a σ-field on 5; and V is a set of probability measures P on A, indexed by a set θ = {#}. We call θ the PARAMETER SPACE. (The distinction between probability and statistics is that, in probability, θ has only one element, whereas, in statistics, θ is richer.) Suppose we are given a function g : θ -> θ and a sample point s € S. We are interested in estimating the actual value of g using 5, and describing its quality. Example 1. Estimate g(θ) from iid X{ = θ + e*, where the e» are iid with distribution symmetric around 0. We let S = {Xι,..., Xn} and θ = (-00,00), and define g by g(θ) = θ for all θ G θ . We might have: a.
Xi8iidN(θ,l).
b. XiS iid double exponential with density \e~\χ~~θ(for -00 < x < 00), with respect to Lebesgue measure. 10
c. XiS iid Cauchy, with density Possible estimates are tι(s) = X, t2(s) = median{-XΊ,..., Xn} and ts(s) = 10% of the trimmed mean in {Xu ..., Xn}; there are many others. In the general case, (S,Λ,PΘ), function t on S such that
θ e θ , an ESTIMATE (of g(θ)) is a measurable
Eθ(t2) = ί t(s)2dPθ(s)<
ooV0eθ.
What is a "good" estimate? Suppose that the loss involved in estimating g{θ) to be t when it is actually g is L(ί, g). (Some important choices of loss functions are L(ί, g) = \t — g\ - the absolute error - and L(t,g) = |ί — g\2 - the square error.) Then the EXPECTED LOSS for a particular estimate t (and θ e θ ) is
J2t is called the RISK FUNCTION for t. For t to be a "good" estimate, we want Rt "small". We consider now a heuristic for the square error function:
Assume that L > 0 and that, for each g, L(g,g) = 0 and L{ ,g) is a smooth function of ί. Then L(t,g)=0 + (t-g)-L(t,g)
1
= \a{g){t-g)2
where a(g) > 0. Let us assume that in fact a(g) > 0; then we define Rt{θ) :=\a{g)Eθ{t{s)
- g{θ))\
so that Rt is locally proportional to Eg(t — g)2, the MSE in t at θ. Assume henceforth that Rt(θ) = Eθ(t - g)2 and denote by bt(θ) = Eθ(t) - g(θ) the 'bias' of t at θ. 11
= WΆτθ(t)+[bt(θ)]2. Note. It is possible to regard Pβ(\t{s) - g(θ)\ > ε) (for ε > 0 small) - i.e., the distribution of t - as a criterion for how "good" the estimate t is. Now, for Z > 0, EZ = Jo°° P(Z > z)dz\ hence ifc(0)= /
Pθ{\t{8)-g{θ)\>y/Z)dz.
Jo There are several approaches to making Rt small. Three of them are: f ADMISSIBILITY: The estimate t is INADMISSIBLE if there is some estimate t such that Rtfίβ) < Rtiβ) for all 0 G θ , and the inequality is strict for at least one θ. to is admissible if it is not inadmissible. (This may be called the "sure-thing principle".) MINIMAXITY:
The estimate t0 is MINIMAX if supi?ίo(0) <suipRt(θ) θee
θee
for all estimates t. Let λ be a probability on θ and let Rt = JθRt(θ)dλ be the average risk with respect to λ. The estimate t* is then BAYES (with respect to λ) if Rt* = inft it!*. BAYES ESTIMATION:
2. If t* has constant risk, i.e., Rt* (θ) = c for all ί e θ , and t* is Bayes with respect to some probability λ on θ , then t* is minimax. Proof. Let t be arbitrary; then c = sup Rt* (θ) = Rt*
sup Rt(θ). Θ
D 3. If t* is the essentially unique Bayes estimate with respect to a probability λ on θ, then t* is admissible. Proof. Suppose that t is such that Rt(θ) < Rt*(θ) for all θ G θ ; then ~Rt < ~Rt*. Hence, by the definition of essential uniqueness,
it follows that Rt* (θ) = Rt(θ) for all θ eθ.
D
Another approach to making Rt small is: We require all estimates t to be unbiased - i.e., Eθ(t) = g(θ) o bt(θ) = 0 for all θeθ. UNBIASEDNESS:
12
Several questions arise: i. Are there any unbiased estimates at all? ii. If so, which ί, if any, has minimum variance at a given 0? (We call such a ί a LOCALLY MINIMUM-VARIANCE UNBIASED ESTIMATE.)
iii. If there is a locally minimum variance unbiased estimate, is it independent of ΘΊ (If so, then it is the uniformly minimum-variance unbiased estimate. If this estimate exists, what is it?) There are two approaches: (I) general; and (II) sufficiency (i.e., via complete sufficient statistics).
13
Chapter 2
Lecture 6 Bayes estimation We have the setup from the previous lecture: (S, A, P#), θ G θ . We want to estimate g(θ). Let B be a σ-field in θ and λ a probability on B. We assume that g is immeasurable and that g G L 2 ( θ , # , λ) - i.e., that JΘg(θ)2dλ(θ) < oo. We regard θ as a random element and PΘ(A) as a conditional probability that s e A, given θ. Let w = (5,0), Ω = S x θ and C = Λ x β be the smallest σ-field containing all sets of the form Ax B with A e A and B e B. We assume that 0 ι-> P${A) is ^-measurable for all AeA. Lemma 1. There exists a unique probability measure M on C such that
M(A x B) = [ Pθ(A)dλ(θ) JB
(λ is the distribution of 0, Pβ is the distribution of s given θ and M is the joint distribution of w = (s,θ). We will see the explicit formula for Qs - the distribution of θ given s - soon.) Consider an estimate t. Our assumption on g is that
E{g2) = ί g2dM = ί g2dλ < oo, JΩ.
JΘ
so
Rt = J Rt(θ)dλ(θ) = JEθ{t-
g(θ))2dλ(θ) 2
= E ( E ( ( t - gf \θ)) = E ( t - gf = | | ί - g\\
(the norm taken in L 2 (Ω,C,M)). We would like to choose a ί to minimize this quantity. Since t is a function only of 5, the desired minimizing estimate - which we will denote by t* - is the projection of g to the subspace of all .4-measurable functions t(s) satisfying EM(t(s)2) < oo. We know that t*(s) = E(g(θ) | s). 14
4.
a. There exists a minimizing t* (this is the Bayes estimate for g with respect to λ). b. The minimizing t* is unique up to P-equi valence, where
P(A) = ί Pθ(A)dλ(θ) Je for
AeΛ.
c. 2
2
Rt. = inf Rt = \\t* - g\\ = \\g\\ -
\\tψ
t
= [ g2dM- ί(f)2dM= JΩ
JΩ
ί g(θ)2dλ(θ) - ί t*(s)2d~P(s).
Je
Js
d. t*(s) = E(g(θ) I s), where (s,θ) is distributed according to M. Proof. Clear.
D
Note. P is the marginal distribution of s.
Some explicit formulas Suppose we begin with a dominated family {PQ : θ G θ } - i.e., a family such that there exists a σ-finite μ such that each Pθ admits a density, say IQ with respect to μ such that ίβ is measurable, 0 < tβis) < 00 and
PΘ(A) = f tθ{s)dμ(s) MA e A. JA
Two basic examples are: i. 5 = {si,S2,-..} is countable and μ is counting measure. Then lβ(s) is the P^-probability of the atomic set {5}. ii. S = Rfc, A = Bk and μ is Lebesgue measure on R*. Then ίβ is the probability density in the familiar sense. We assume that (5, θ) H> iθ(s) is a C-measurable function. We write v — μ x λ. 5.
a. M is given by dv
v
'
b. P is given by — (a) = / eθ(s)d\(θ). dμ JΘ 15
c. For all s E S , Q 5 , the conditional probability on θ given s when (s,0) is distributed according to M, is well-defined and given by
where Z(a) = d. For all g G L 2 ,
E(g I s) - ί*(s) = / 9(θ)dQs(θ) = / Jθ
Je
(θ)ψldλ(θ).
g
£{s)
Proof. These are all easy consequences of Fubini's theorem. (For example, since P φ G A,θ G B) = £(l A (s)J B (0)) = E(lB(θ)E(IA(s)
\ θ))
= E(IB(Θ)PΘ(A))
=
E(IB(Θ) v
J
f iθdμ) = / /B(β) f / ^(.5)dμ(5)]dλ(tf) = / A
'
J
Q
1
JA
J
JAXB
£θ(s)dv(s, θ),
we have (a).)
D
Lecture 7 . In any V = L 2 (Ω,C, M), the constant functions form a subspace, which we will denote by Wc = ^ ( Ω , C, M). The projection of / G V on Wc is just £?(/). Λ^ί?ίβ. In the present context, with it; = (5,0) and Ω = 5 x θ , s is the datum and θ is the unknown parameter, λ is the prior distribution of 0, Qs is the posterior distribution (after s is observed) of 0, t*(s) = E(g \ s) is the posterior mean of g(θ), θ ι-> ^(5) is the likelihood function and ίg = ^p-. Note. If we do have a λ on hand but no data, then the Bayes estimate is just Eg — SB9(0)dX(θ). 6. If t* is a Bayes estimate of p, then t* cannot be unbiased, except in trivial cases. Proof. Suppose that t* is unbiased and Bayes (with respect to λ). Then, by unbiasedness, (t*,g) = E(t*g) = E(E(t*g
\ θ)) = E(g{θ)E(t*
\ θ)) = E(g(θ)2)
but also, since t* is a Bayes estimate, (f,g)
= E(E(t*g \ s)) = E(t*(s)E(g |
16
= \\g\\2-
From this we conclude that:
r -
Je
φ g(θ))dλ(θ) = 0 & Pθ(t*(s) = g(θ)) = 1 a.e.(λ)
This last statement, though, holds iff there is an estimate t such that Pβ{t(s) = g(θ)) = 1 a.e.(λ). Hence, except in the trivial case, t* cannot be both Bayes and unbiased. D 1
Example l(a). s = (XUX2, . .,Xn), X% ~ N{θ, 1) and θ = R . Let μ be Lebesgue e Xi 9 measure; then dμ = dX\ dXn and ίg(s) = ΠΓ=i τk^ ~^ ~ ^ . Consider the estis a n un mation of g(θ) = θ. X = ^ Σ2=i X* ^ biased estimate of g (in fact, it is the UMVUE of #), so X is not Bayes with respect to any λ. We will see later that: i. X is minimax, ii. X is the pointwise limit of Bayes estimates and iii. X is admissible. Suppose that λ is the JV(O, σ 2 ) distribution, so that dλ(θ) = -/^e~ϊ^θ dθ. Given 5, is proportional to ^ ( s ) , i.e., dQs(θ) = φι(s)£g(s)dλ(θ) for some function ψι of s. Now
ie{s) = ( ^ where υ = ^ΣΓ=i( χ i ~ ^ ) 2 ' s o 2 = P ^ o r s o m e Λ > 0, so that
Thus Q 5 ~ N(γ^2,
t h a t
dQβ(θ) = φ1(s)φ2(s)e-^-θ^eSdθ.
Let
n(i+h2)) ( w ^ e r e ^ 2 ~ ( n ) / σ 2 ^s *^ e r a ^ 0 °^ °bservation variance
to prior variance) and t*(s) = -E1^ | 5) = γ ^
Therefore
t
where the 0 (labelled f above) arises from the fact that we are dealing with a Bayes estimate with no data; and the terms marked ί represent a weighted average of prior and data mean, with weights proportional to inverse variance. We have O 17
n
-θ) and
1
=
hψn
2
h
1
hψn
h2
2
2
h )'
l + h
2
2
( T h e last c a n b e checked b y n o t i n g t h a t Rt* = \\θ - t*\\ = \\θ\\ -
2
\\t*\\ .) R
Lecture 8 In the framework (S,Λ,Po), ί ί θ , set up in the previous lectures, suppose that dPg = igdμ on S, where dμ is a fixed measure (this is another way of saying that ίg is the likelihood function). Given s, suppose θ = θ(s) is the point in θ such that t§(s)(s) ~ su Pέ>eθM s ) Then θ is the (or an) ML estimate of θ. Given a function g on θ , g{θ(s)) is the ML estimate of g(θ). Homework 2 1. What is θ in Example l(b) (on page 10)? In Example l(c) (no explicit answer is available in this case)? Assume that θ = E 1 . We return to our investigation of Example l(a). Example 1. Here μ is n-dimensional Lebesgue measure and £θ(s) = φ(s)e~%(χ~θϊ2, so that θ(s) = ~X. The ML estimate of θ2 is X 2 , the ML estimate of |0| is |X|, etc. Under square error loss, we know that, if A is distributed as N(0, σ2) with σ2 = ^ , then the Bayes estimate of θ is t*(s) = T^W The Bayes estimate of θ2 is
Jm.1
X2
2
θ dQs(θ) =
h2)2
and the Bayes estimate of \θ\ is J R 1 \θ\dQs(θ), which is a sum involving special functions. Note. 18
i. We have seen that the Bayes estimates of θ converge to X, which is ML and 2 UMVUE, as h —> 0 (i.e., σ —> +00). This does not always happen, however: 2
2
2
2
The Bayes estimates of θ converge to X + ^ the ML estimate of θ is X and 2 the UMVUE of θ is X — £. These three are not identical, but they are close if n is large. 2
ii. The case h — 0 (σ = +00) corresponds to uniform prior ignorance about 0; 1 i.e., for two intervals (α, b) and (c, d) in M = θ , jj-^j —>- ^ as /ι | 0 (from Homework 2). iii. t*(s) = γ^2 {h fixed) is admissible because t* is an essentially unique Bayes 2 a so estimate in L (M) (i.e., if ί 0 is l Bayes, then P ( ί 0 = ί*) = 1, which is equivalent to saying that Pofa = ί*) = 1 for all θ e θ). It follows that, for any constant c, t(s) = aX + (1 — a)c is admissible for any 0 < a < 1. (Let X2' = Xj + c and θ' = θ -{- c, and apply the above result.)
Homework 2 2. Find a necessary and sufficient condition such that 60 + &1-X1 + admissible for θ.
+ bnXn be
For any h > 0, it!** (0) = IA + ^0 2 , so sup^ G θ Rt* (θ) = +00 and t* is not minimax. We do have, however, that: i. X is minimax. Proof. Choose any estimate t. snpRt(θ)
>~Rt> ~Rt* = —— /til 1
U
so, since h is arbitrary, sup# Rt(θ) > ^, which is the (constant) risk of Ύ. Thus X is minimax. • ii. X is admissible. Proof. Let t be an estimate such that
Rt(θ) < Rχ(θ) = ~\/θeθ I It
and let, for h > 0, λ be distributed as iV(0, l/nh2). because t* is Bayes for λ.
19
We have that ~Rt > Tϊt*
Now t-θ
= {t-t*) 2
+ {t*- θ) and (t - t*) ± (ί* - θ), so 2
2
\\t-t*\\ = \\t-θ\\ -\\f-θ\\
=
Rt-Rt*
- n
**
n
n(l + /ι2)
n(l +
We have also that
I \t - t*112 = /
(ί - ί*)2dM = f (t- t*)2dP
[ ^^
= ί (t(s)-f{8))
= f
(t(s) -
jRn+l
β
V27Γ
From these two equalities we conclude that
Letting h —> 0 and using Fatou's lemma and the fact that t* —>• X (as /ι —> 0), we have that /
(ί( 5 ) - ~X)2£θ{s)dμ{s)dθ = 0
and hence that (ί(s) —A") ^(5) = 0 a.e. (with respect to Lebesgue measure) in M n + 1 . Since ίθ(s) > 0 for all (s, (9) G R n + 1 by presumption, we have that (t(s) — X — 0 a.e. (with respect to Lebesgue measure) on R n + 1 ) =Φ> (ί(s) 1= X a.e. (with respect to Lebesgue measure) on S = Rn) => (Pθ(t(s) = X) = 1 V^ e θ ) ^ (#*(#) = % ( 0 ) = - V ί G θ ) . Ib
D
Lecture 9 Homework 2 3. In Example l(a), what is the marginal distribution P of s? Are the X^s normal and independent under PΊ What is the distribution of X under P ? Here the prior is assumed to be λ ~ iV(0,1/n/ι2).
20
Example 2(a). Let n be a fixed positive integer and s — ( X 1 ? . . . ,X n ) with the Xι iid random variables assuming the values 1 and 0 with probabilities 0 and 1 — 0 n respectively and θ = [0,1]. Let μ be counting measure on 5, the set of 2 possible values assumed by s. Then T
n
£θ(s) = Pθ({s}) = 0 ^ ( 1 -
τ
θ) ~ ^
where T(s) = Y^=ι Xi The ML estimate is θ(s) = T(s)/n = ~X, which is unbiased for 0. (We will see that in fact ~X is the UMVUE.) RΎ(Θ) = Var^(X) = θ(l - θ)/n. The ML estimate of 0(1 — θ)/n is just X(l — X)/n, which is not unbiased. We shall see later that J ( J - J J r f ) is the UMVUE for 0(1 - θ)/n. Let λ be the Beta distribution () " ύ( θ
dλ{θ)
=
ύ\
U-g)
^
for 0 < ^ < 1, with parameters α, b > 0. Here B(α, b) = ffi^ff is the Beta function; it is easy to check that Ex(θ) = ^ and Varλ(^) = ( a + 6 ) 2 ^ + / ) + 1 ) We visualize the product space Ω = S x θ as a unit interval attached to each point of S: S (2n points)
We let μ be counting measure on S, λ prior measure on θ and v = μ x λ the product measure on Ω. We have dM(s,θ) — ίg(s)du(s,θ) on Ω, so that M(C) = fceβ(s)du(s, θ) for all C G C = A x B, and dQs{θ) =
where ^ ( s ) distribution.
|
n - Γ ( s ) ) - 1 , so that Q s is the B(a + T(s), b +
21
n-T(s))
The Bayes estimate for θ is t*(s) = EQs{θ) 2 for 0(1 - θ) is EQ.(Θ) - EQs{θ ).
similarly, the Bayes estimate
=
Ί
If we choose a=ψ
= b, then t* becomes t*m =
τ +
^
/ 2
2
and
1 An Hence t*m is a Bayes estimate with constant risk; therefore it must be minimax. The graphs for the risk functions of X and ί^ look like: χ = θ(l-θ)/n
1
As n —> oo, bn -> 0 and γf^\
—> 0. Neither X nor ί^ is perfect; for example, if
n = 100 and T = 0, then X = ί = 0, which is too low, but t*m = ^ g ^ w 4 | % , which may be too high. Note. With a = ^ψ = 6, the prior mean is \ and the prior variance is 1
4(Vn+l) '
Homework 2 4. Show that X — -^- is admissible in two ways: a. Show that X is the pointwise limit of ί* as α | 0 and 6 ^ 0 . b. Redefine the loss function by L(t,θ) = 1/ΓfL, so that (Admissibility with respect to this loss function is equivalent to admissibility with respect to the loss function L(ί, θ) = (ί — 0) 2 .) 5. Show that X is the unique Bayes estimate with respect to some λ.
22
Lecture 10 Example 2(b) (negative binomial sampling). We have 0 e θ = (0,1). We choose a positive integer k and observe the iid 1
with probability 0 with probability 1 — 0
until exactly k Γs are observed. Let N be the total number of Xφ observed. Here s = (X l 5 . . . , XN) and TV is a random variable. Now let S be the set of all possible values of s; then 5 is countable (obviously N > k. There are (£lj) values of s corresponding to N = r ) . Let μ be counting measure on S (such that sample points with N the same have the same probability). Then k
N{s) ι) {k
ι)
k
£θ(s) = P, (observing 5) = θ ^(l - θf - - ~ θ
= θ {\ -
N{s)
k
θ) ~ .
The MLE of 0 is X — ^ k , which is not unbiased. Note that N — Nx-\ h Nk, where Nι is the number of trials until the first 'success' (i.e., observation of a 1), JV2 is the number of additional trials required for the second success etc. The Λ^s are iid, so that EΘ(N) = kEθ(Nλ) and Var^ΛΓ) = k V a r ^ ) . Since Pθ{Nλ = r) = (1 - 0) r - x 0 (for r = 1, 2,...), we have Eθ(Nι) = 1/0 and Varfl(JVχ) = (1 - 0)/0 2 . Thus
remember, however, that, by the Cauchy-Schwarz inequality, E(X)E(1/X) any random variable X > 0, with equality iff P(X = c) = 1, and so
> 1 for
- i.e., X = ^ y is biased upwards. It can be shown by the Rao-Blackwell theorem that the estimate t = j ^ - is unbiased when k > 2. In fact, t is (by the Lehmann-Scheffe theorem or a geometrical approach) the UMVUE. (Heuristically, we see that, if 5 = ( X i , . . . , XN-U XN), then necessarily XN = 1 (we stop as soon as we observe the A th 1) and so only ( X i , . . . , XN-I) constitute the active part. Then t(s) = number of successes in active part \ ry number of trials in active part ') i 0
,1 S β β t Π a t
U;
, . l
l i n a ς p f 1 S u n D i a s e a
PΘ(N = r) = for r = k, k + 1,..., so that
r=k
23
l .
n
n π t p o t e
We have X - t{s) = j/fc - j^pi = N(")(N(S)-I) - °' positive probability; so EQ(X) > Eθ(t) — θ.
t h e ine(
l
ualit
y
bein
S
s t r i c t
w i t h
Bayes estimates Let λ be a prior probability measure on (0,1). As always dQs(θ) = φι(s)ίβ(s)dλ(θ). Since £g(s) is as in Example 2(a), formally the Bayes estimate here is identical to the one there. In particular, Q+β+fc i s admissible and Bayes with respect to the B(a, b) prior with α, b > 0. Note. Although the MLEs in Examples 2(a) and 2(b) are formally identical, the risk 2 functions are different. In Example 2(b), % ( 0 ) = Var^(X) + [EΘ{X - θ)] and Rt(θ) = Var#(i) are complicated expressions. Example 2(c). Depicted here are the stopping points for Examples 2 (a) and 2(b), along with those of another possible (two-stage) sampling scheme:
Example 2(a)
Example 2(b)
Example 2(c)
Here (as in any scheme) the likelihood function is ίe(s) — θτ^s\\ — where T(s) is the number of successes (and, of course, N(s) — T(s) is the number of failures), μ is counting measure and the MLE is θ(s) = j ^ \ always. s = ( X i , . . . , XN), where N = n\ or N = n\ + n2 depending on s. How do we estimate θ? What is the precision of this estimate?
24
Chapter 3
Lecture 11 Unbiasedness has an appealing property, which we discuss here: Choose any estimate t(s). Imagining for the moment that s is unknown but θ is provided, what is the best predictor for ί? Let λ be the prior; this determines M, as above. Regard t and g as elements of 2 L (M). 7. (t is an unbiased estimate of g) <£> (for any choice of a probability λ on 0, g is the best (in MSE) predictor for ί). Proof. If t is an unbiased estimate of g, then, for any λ, E(t | θ) = g - i.e., g is the projection of t to the subspace of functions in L2(M) which depend only on θ; or, equivalently, g is the best predictor of t in the sense of || | | M Conversely, assume that each one-point set in θ is measurable and take λ to be degenerate at a point θ. The assumption that g is the best predictor of t tells us that g(θ) = E(t I θ) or, equivalently, that t is an unbiased estimate of g. D
Unbiased estimation; likelihood ratio Choose and fix a θ G θ and let δ e θ . Assume that P$ is absolutely continuous with respect to PQ on Λ\ then, by the Radon-Nikodym theorem, there exists an Λmeasurable function Ω ^ satisfying 0 < Ω ^ < +00 and dPs = ΩsjdPβ (i.e., Ps(A) — JAΩδΊθ(s)dPθ(s) for &\\ A e A). Note. Suppose that we begin with dPs(θ) = ίs(s)dμ(s) on S, where μ is given, and that we know that PΘ{A) — 0 => Pδ(A) = 0 (i.e., that P$ is absolutely continuous with respect to PQ). Then Ω
M(5 =
is an explicit formula for the likelihood ratio. In fact Ωsj can be defined arbitrarily on the set {s : ίβ(s) = 0}. 25
In estimating g on the basis of s, let Ug be the class of all unbiased estimates of 2 g. For an estimate teUgj the risk function is given by Rt(θ) = Eθ(t - g) = Var^(ί). Two questions arise immediately: What is the infimum (over Ug) of the variances at a given θ of the various estimates to g? Is it attained? 2 Remember that we fixed a θ G θ above. Let Vθ = L (S,A,PΘ)', then we assume throughout that {Ωw :δeθ}CVθ, 2
i.e., that EΘ(Ω δθ) < +00. Let Wθ be the subspace of Vθ spanned by {Ω^ : δ G θ } . 8.
a. Ug is non-empty iff Ug (Ί WΘ is non-empty. We assume henceforth that Ug is non-empty. Then: b. Ug Π WΘ contains (essentially) only one estimate t. c. t is the orthogonal projection on Wg of every t £ Ug. d. Var^(ί) > Var*(f) for all t G t^. Note. The above means that t e Ug Γ)WΘ is the LMVUE of g at 0. ί often depends on 0, and this is the problem in practice. Proof of (8). Note first that 1. leWθ
(since Ω M = 1).
2. For any t, £?*(*) = fst(s)dPδ(s) = fst(s)Ωδ,θ(s)dPθ(s), (ί, Ω^)^, where ( , )^ is the inner product in L2(S, B,
so that )
To prove (a), suppose that Ug is non-empty. Let t £ Ug and define t = πt, where TΓ = πWθ is the orthogonal projection on W#. Then, for any 5 G Θ ,
To prove (b), suppose tut2
eUgΠWθ]
then
(*i - «2, Ω w ) ^ = £ , f a - ί 2 ) = g(δ) - p ( ί ) = 0 V5 G θ . Hence (ίi - 1 2 ) ± Ω ^ for all δ G θ , and so (ί x - ί 2 ) -L Wθ; but t x - t 2 e Wθ, so
(ίi - ί 2 ) ± (ίi - t2) => h - t2 = 0 => Pθih = t2) = 1. It follows by absolute continuity that Pδ{tχ = t2) = 1 for all δ G θ . (c) follows from (b) and the above construction. (d) follows from (c) since t is unbiased for g.
•
Note. In verifying (8), please remember that, if Eδ(t) = g(δ) = Eδ(ΐ) for all δ G θ , then Var,(t) = ^ ( ί 2 ) - 5 ( ^ ) 2 and Var^(ί2) = Eθ(t2) - g(θ)2, so that V ( i ) < Var^(ί), with equality iff t = ί. 26
Lecture 12 We may restate (8) as follows: 8'.
a. For some t G Wθ, Eδ(t) = Eδ(t) for all δ G θ and t G Ug. b. πWθt is such a ί, and is the (essentially) unique such. c. We have that 2
Ri(θ) = Eθ(t-g(θ))
2
2
= Var,(ί) + [bt(θ)] < Var,(ί) + [bt(θ)] = i?,(0)
with equality iff ί = t. d. ί is (essentially) the only unbiased estimate of g which belongs to Wθ. 9.
a. An estimate t is the locally MVUE of g(δ) := £*(*) at θ iff ί has finite variance at each δ and t eWg. b. An estimate t is the UMVUE of g{θ) := ^ ( ί ) iff ί G f | ^ θ ^ that Ω w G i 2 ( i ^ ) for all 5,5 G θ ) .
9(b) above raises the question: Can we describe C := f]θeΘ WQΊ contains the constant functions; does it contain any others?
(
we
assume
We know it
10 (Lehman-Scheffe). Write V = p | VΘ Π {υ : Eδ(υ) = 0 V5 G θ } . If ί has finite variance for each δ (i.e., t G Πfleθ^)' then ί G C iff, for each δ G θ , we have = 0 Vu G V. Proof. Suppose that t G C. Then t ±^ WδL for all 5 G Θ . NOW, for all ueV,u is an unbiased estimate of 0; from (8), we know that 0 is the projection of u to any Wδ. Since u — 0 + u, we must therefore have u G W^~, so that t _L$ u for each 5 - i.e., Eδ(tu) = 0 for all δ. Conversely, fix a θ G θ and write t = πt + u, where u = t — πt and TΓ = π ^ Then £"j(^) = 0 for all δ and hence, by hypothesis, we have that Eθ(u2) +Eθ(u
πt) - Eθ((πt + u)u) = Eθ(tu) = 0 => Eθ(u2) = ~Eθ(u
πt) = - ( π ί , u) = 0
- i.e., u = 0 a.e.(Pfl) and hence, by absolute continuity of each Pδ, u = 0 a.e.(P^) also for every δ G θ . This means that ί = πt = π ^ ί => ί G W^; since ^ G θ was arbitrary, this means that t G Πfleθ Wfl = C as desired. D
27
Example l(d). We have s = (Xu . . . , Xn), with the X{ iid as N{θ, 1), and θ = {1,2}. We have explicitly that 2 £θ(s) oc e - t ^ - " )
Choose 0 = 1; then nΎ
We = Span{Ω u , Ω 21 } = Span{l, e }
nX
= {a + be
: α, 6 G R}.
Let (/(<$) = 5. Since X is an unbiased estimate of g, we have a unique unbiased estimate of g in We- Hence we want
E(a + ben*J = l 6nX)
2
Since V ^ ( ^ - S) ~ iV(0,1) for 5 G Θ , under 5, using the MGF of iV(0,1), we have
Eδ(en*) = e^Esie
^
for any δ e θ. Solving (*), we find a and b (b > 0). Thus α + &enX is LMVU for Eo(Xχ) at 0 = 1. This is not, however, a reasonable estimate. We already know that θ = {1,2}, but this estimate takes values in (—00,00). (Since θ is not connected, we don't have Taylor's theorem here. Also, the LMVUE at θ = 2 is a very different function of X.) This is absurd. MSE is not suitable because g takes on only two values. We try changing our parameter space to θ = (£, u). Now Wβ = SpanjΩ^ : ί < δ < u} = Span{e ί X : t is sufficiently small} (in the last set, H is sufficiently small' means 'for t in a fixed neighbourhood of 0'). It can be shown that Wθ = {f(X) Proof (outline). Since ^zζ*
: / is a Borel function and Eθf2 < +00}. e Wθ, we have that ftetΎ
e Wθ. Hence XetΎ tx
e Wθ
X
for |ί| sufficiently small. Iterating this reasoning gives us that X e ,..., X V , . . . G Wθ for |ί| sufficiently small (what "sufficiently small" means depends on i). Thus 1, X, X , . . . G Wβ and hence We is as desired. Since X is an unbiased estimate of E$(Xι) which belongs to We, X is LMVU at 0, and hence X is the UMVUE; since X - M s an unbiased estimate of [Eδ(Xι)]2, ~X2 - M s the UMVUE for [Eδ{Xι)]2. (Here VF0 essentially does not depend on 0, and C = {/(X) : / is Borel and Eθf < +00 V0 G θ } 28
by our above computation.) Let A C S be such that Pλ(A) φ P2(A)(ϊor example, A = {s : Xχ(s) > 3/2}). Then a + HA is an unbiased estimate of θ if α and b are chosen properly. Indeed, there are many unbiased estimates. To find the "best", we try to minimize variances, noting that _ nΎ Wι = Span{Ωn, Ω 2 i } = {a + be : α, δ G M} is the class of all estimates which are unbiased for their own expected values and have minimum variance when θ = 1 and hence that there is a t\ G W\ such that E&(t\) = δ for δ = 1, 2. (Exercise: What is ίi?). nX Similarly, W2 = {a + be~ : a,b eR} and there is a ί2 € ^ 2 such that £k(ί 2 ) = 5 for 5 = 1,2. (Exercise: What is £2?) *i Φ h, however; in fact, C is the set of all UMVUEs, which is just the set of constant functions. As noted, the Neyman-Pearson theory implies that we should use a + HA with A = {s : X > c} and b > 0. We should also restrict the estimation theory to a continuum of values (i.e., should have only connected θ ) .
29
Chapter 4
Lecture 13 The
score function, Fisher information and bounds
Let θ be an open interval in Rι and suppose that dPθ{s) = £θ(s)dμ(s), where μ is a fixed measure on 5. Suppose that θ »-* £Θ(S) is differentiate for each fixed s; then δ \-> Ωs,θ{s) = f4f) is also differentiate for each fixed (s,θ). If we use dashes for derivatives with respect to the parameters as described, then
is the SCORE FUNCTION at θ (given s). We also define I(θ) := EΘ(JQ1\S)) FISHER INFORMATION (for estimating θ) in s.
, the
Note. ( [ iδ(s)dμ(s) = 1 Vδ G θ) = ί i'δ{s)dμ{s) = 0 V5 G θ)
Js
(s)) = 0 = Similarly, we have fsί'l(s)dμ(s)
f
= 0, Js£^ (s)dμ(s) = 0, etc. for all δ G θ , so that
j)
j)
5
^ U ( 5 ) ) = 0 for j = 1,2,3,..., where 7 J (s) = ( ^ f t / M ) - Conditions under which the interchanging of differentiation and integration (as above) is valid will be given later. Suppose that we are interested in We and want some concrete method of constructing it. We have that Ω w ( β ) = SUj + ( δ - θ h P b ) + \ { δ - θ ) 2 Ί f \ s ) + •••,
which suggests that Wθ = Span{l, γ ^ , ηf \ ...}. We will see that this equality holds exactly in a one-parameter exponential family and approximately in general in large 30
samples. To see that jg e ^ , we reason as follows: First, of course, we note that 1 G WQ. Then, since Ω ^ , Ω ^ G W#, we have that ^ ( Ω ^ — Ωθj) G We for δ φ 0, from which it follows that 7^ G WQ. Similar inductive reasoning allows us to conclude that each jβ is in Welt is clear that 1 and 7^ are the most important generators if s is very informative, for then only δ near the true θ are important. In any case,
We know that, in VQ — L2(Pe), every t eUg projects to the same t G Wβ\ thus every t eUg has the same projection to WQ - say tβk. Then we have: 11. BHATTACHARYA BOUNDS: For each t G Ug,
Var,(ί) > Eθ(tlk)2 - [g(θ)}2 for A: = l , 2 , . . . . Proof. This follows since
D Let us consider the case k = 1 - i.e., projection to Span{l, % }• We have seen that
1 ±7 ί 1 } - i.e., that Eθ{Ίf]) = 0 - and that HT^II 2 = Uβ) Hence { l ^ f V l l Λ } is an orthonormal basis in Wg and, for any t G VΘ, the projection t*θι of t to Wg is
Now (l,ί) = J5^(ί) = ^(β) since t is unbiased, and
(7? ) ,*)=^(ί 7{1))= [ t(s)ff\dPθ(s) = ί t(s)eθ(s)dμ(s) Js
Zθ{s)
Js
The above calculations give us that
since the summands are orthogonal,
From this we see: 31
12 (Fisher-Darmois-Cramer-Rao). INFORMATION INEQUALITY: Fort G Ug,
The Fisher information can be related to the second derivative of the log-likelihood: Let Lθ(s) = logJθ(s). Then L'θ(s) = %$ = T^ί*) a n d
" -£e(s)
/WV
L (s)
but Eβ(t'β{s)/tθ{s)) = fs£';(s)dμ(s) 13. E9(V9{3))
W
= 0, and so
=-I{β).
Exact conditions under which statements (11)—(13) hold are deferred until Lecture 5.1.
Lecture 14 Heuristics for maximum likelihood estimate:
i. Wί = Span{l >7 ί 1) ,7?\
..}
ii. WQ ~ Span{l, 7^ } if s is highly informative, iii. The MLE θ(s) G Wθ (whatever θ may be!). The last item gives us that: iv. θ is approximately the UMVUE of its own expected value function (the same is true of estimates related to θ in certain ways). Let θ(s) be the MLE of θ and assume that θ is close to θ. Since θ(s) maximizes Z^, we have 0 = L J = L'θ + (θ - Θ)L"Q + . . . & L ' Θ + ( Θ - θ ) L l Assume also that the experiment (that is, (S,A,PΘ), θ G θ) is highly informative in the sense that I(θ) is large (for a given θ). We know that Eθ(Lfθ) = 0 and Var^(L^) = 7(0); hence, informally, L'θ is "about" 0, "give or take" about y/l(θ). From (13), EΘ(—L'Q) — 7(0) - i.e., Eθ{--φj) = 1. Assume that the random variable —jfa ~ 1. Then
θ
θ
^ 4
θ
+
θ
32
+
φ
n
(
and hence θ is nearly in Wθ ' C W^; so 0 is nearly LMVU, and hence 0 is nearly the UMVUE (of0). From(*), and
Eθ(θ)faθ
Varβ(£)
:
Hfl)'
The MLE oΐ g(θ) is #(0). Assuming that # is continuously differentiate, we have
So g(θ) is nearly in We (since 1 G W0 and 0 is nearly in WQ). Hence Eθg(θ) « p(0)
and
where ^7UJ is the lower bound in (12). Note. Γ770\ϊ2 is the information in s for estimating g(θ). Suppose that (SΊ, -4i, PQ ) and (52, ^2, -P^ ), θ G θ , are independent experiments concerning 0, with sample points s± and S2 Let s = (51,52), A = A\ x A2 and P(9 = Pfl^ x P j 2 \ and let Ii(θ) be the information in 5^ for estimating θ (i = 1,2). Then the information in 5 for estimating θ is 7(0) = Iι(θ) + I2{θ). (This result extends inductively to any finite number of independent experiments.)
Proof. dP$\s) = lf{si)dμ®(si)
{
for i = 1,2, so ^ ( 5 ) = l p(sι)ίf\s2)dv(s)
and
hence The result now follows from (13).
D
Example l(a). s = ( X l 5 . . . , Xn), X{ ^ N(θ, 1). The information in 5 for estimating θ is the sum of the information in Xu . . . , Xn, respectively, for estimating #, which sum is (since the Xi are iid) n times the information in ΛΊ, which product is (since X\ is distributed as N(θ, 1)) just n. L^(Xi) = Xλ-Θ = ^(Xi) and Var^T^) - 1 = I^θ). L< (We check that piL is about 0, give or take about 1; and -f^γ w 1 (indeed, here it is identically 1).) Example 2. X\,...,
Xn,...
are iid as 0
with probability 1 — θ
1
with probability 0,
and θ = (0,1). s = (ΛΊ,..., XN), N the stopping time. The three cases we discussed are: a. N = n (n a fixed positive integer). 33
b. N is the first time k successes (i.e., Is) are recorded (k a fixed positive integer). c. Two-stage scheme. T
In all cases (even other than (a)-(c) above), £θ(s) = 0 W(1 - fl^W-TW, where χ
and
τ(s) = Σΰ? i>
hence
where V(s) = N(s) - T(s). Since Eβ(L'e) = 0, we have EΘ(T) 0
EΘ{V) 1-0'
(4.1) - —
— Cov^T, V)
(4.2)
and 1
1
LZ) =
(4.3)
^ ^
Exercise: What happens in case (a) (i.e., N = n)? This is like Example l(a) except that I(θ)~ι = ^^"^ depends on 0. Suppose now that we are in case (b). Then T = k and V(s) = N(s) — k. Hence, from (4.1), | = Eθψ^k^ and therefore Eg{N) — | (which we could also compute directly) and
(from equation (4.3) above). Hence the heuristics apply when k is large. In all cases 1S
V\S)
N(s)'
Exercise: Derive V&τθ(N) from (4.2) and check the behavior of L'θ/y/I(θ) Example
l(e). s = {Xι,
,Xn), w i t h t h e Xi i i d w i t h d e n s i t y ae-b(>χ-θ)4
and
(a,b> 0 ) .
Homework 3 1.
a. Find a and b such that Var^(X1) = 1 (to make it comparable to Example l(a)). b. Find 1(0). c. EΘ(X) Ξ ί . s o ϊ is unbiased for θ. Is X the UMVUE? (Note the answer is no.) d. What is the UMVUE? e. Give an explicit method for finding θ(s). 34
Lecture 15 13(a). Suppose t G Ug is such that
Ά
e θ;
then {PQ : 0 G θ } is a one-parameter exponential family with statistic t - i.e.,
p
= £θ(s) = φ{s
dμ
where A and B are smooth functions; moreover, g(θ) = —A'(θ)/B'{θ). Proof. By the same argument as used in the proof of (12), we have that t G W^ for all 0 - i.e., that t(s) = a(θ) + b(θ)L'θ(s)
a.e.(Pfl)
for all 0. From this it follows that L'θ(s) = a{θ) + β(θ)t(s) Var^(ί) = 0 for all θ. We rule out this case) and hence that
(if 6 Ξ 0 then
where A(θ) — f a(θ)dθ. This gives the required form for ig(s). Also, 0 = E9(L'Θ) = a(θ) + β(θ)Eθ(t) = a(θ) + β(θ)g(θ) and so g(θ) = -a(θ)/β(θ)
= -A'(θ)/B'(θ).
D
Note. For a near-rigorous proof, see R. A. Wijsman 1973 AS, pp. 538-542, and V. M. Joshi 1976 AS, pp. 998-1002. Note. The necessary conditions on {PQ : θ G θ } and g are also sufficient for the attainment of the C-R bound. We will see this later. Example l(a). Since £θ(s) = φi(s)e-n^-^
= φ2(s)e-n-
the C-R bound is attained by X for estimating g(θ) = θ. This implies that X is LMVU at 0, which in turn implies that it is UMVU. Also, the C-R bound is not attained by any unbiased estimate of any g which is not an affine function of θ. In particular, since X — M s an unbiased estimate of g(θ) = 0 2 , it does not attain the C-R bound since θ2 ψ -A'(θ)/B'(θ). the UMVUE.
We have seen before, however, that X 2 - J is
35
To study the Bhattacharya bounds, note that ί'θ — IQ [-nθ + nX] and ί'g — lβ [-nθ + nXf + U [-n], so that f ^ is affine in X and ^ / ^ is quadratic. This implies that ^ 4 / ^ } = Span{l,X} and Wjp = Span{l,^Λ7M = Span{l,X,X }, whence X — £ € W^2^ attains the Bhattacharya bound and is the UMVUE. In fact, We is the space of all functions of X, and hence any function of X (but not θ) is the UMVUE of its expectation. 1
1
1
Example l(b). s = (Xi,X 2 , •) are iid from ie" *"* on R . Here Wθ is well-defined (i.e., (8)-(10) hold), but (11)—(13) are not applicable since IQ is not sufficiently smooth. In such a situation, the following is useful. 14 (Chapman-Robbins). Given (5, A, Pe), θ € θ , with θ an open interval in M1, ΊΐteUg then
for all θ such that Ω ^ = j ^ exists for all δ in a neighborhood of θ. Proof. Eδ(t) = g(δ) => / t - Ωδ,θdPθ = g{δ) =* / t ( Ω w - l)dP^ = g(δ) - g(θ). Js Js Dividing by δ — 0, we find that _ 9{δ) - g(θ) - 9(θ) δ-θ 2
D Note. If g is differentiable at θ, then
If, further, Ωδj is difFerentiable (see (12E) below for exact conditions), then this is the same as WW\ 36
Homework 3 2. What is the Chapman-Robbins bound for g(θ) = θ in Example l(b)? 3. In Example l(c), s consists of n iid observations from π / 1+ Λ m2\
For any g,
the C-R bound is not attained by any ί; but θ has nearly the variance jhη if I(θ) is large. Here 7(0) = nh{θ). Show that h{θ) = | .
37
Chapter 5
Lecture 16 Example l(e). We have Xi iid αe~ 6(x ~ β)4 , with a, b > 0 chosen so that this is a density and Vtuθ(Xi) = 1. Then )
(
2=1
)
() t=l
i-l
which is not a one-parameter exponential family. It is called a "curved exponential family".
Sufficient conditions for the Cramer-Rao and Bhattacharya inequalities As usual, we have (5, A, P#), ί £ θ , where θ is an open subset of R 1 . μ is a fixed
measure on 5 and dPe(s) = lg(s)dμ(s). Condition 1. £(9(5) > 0 and 5 h-> ^(s) has, for each 5 G 5, a continuous derivative Jι->^(s). Let
Condition 2. Given any ί G θ , we may find an ε = ε(0) > 0 such that Eg(mf) < +00, where mθ(s) = sup 17^(5)1 - i.e., m^ G H, which implies that I(θ) = Eθ(^)2
< +00.
Condition 3. I(θ) > 0. 12E Exact statement of Cramer-Rao inequality: Under conditions 1-3 above, if Ug is non-empty, then g is differentiate and
38
Proof. i. $> = 1 + (ί -
(i)
for some 5* between θ and δ. By Condition 2, Ω ^ G V# for |5—θ\ sufficiently small. ii. Ωδft{β) δ
(i), v
—\
_
( i ) /\
θ
n
— rθ {s) = rδJ{s) - fθ \s) -> o
as δ —> ^ for all s £ S. (From Condition 1, 7^
and hence EQ^V i.e.,
( i ) /\
}
is continuous.) Also,
— 7^ ) 2 —> 0 as 5 -> ^ (by dominated convergence) Ω
- 1 Vo
(1) 7
From this it follows that 7 ^ G Wθ. iii. Choose t G C/^. If we let ( , •) and || || be the inner product and norm, respectively, in VQ, then Es(t) = EβζtΩsj) = g(δ) and so (ί, Ω w - 1) = g(δ) - g(θ) =• (ί - 0(0), Ω w - 1) = g(δ) - g(θ) (since Ee(Ωs,θ — 1) = 0), whence
From (ii)
g(<δ
]_fθ>) has a finite limit (t — (#),7#) as δ —> θ. Thus g is
differentiate and g'(θ) = (t — g(θ).jί1'),
so t h a t \g'lθ)\ < \\t — g(θ)\\ II70 II [
-i.e., Var^ί) >
-^-.
D
3. To know that ϊcί'oda = 0 = ίcKdu, it suffices to show that (W/?(s) exists and is continuous for each s and that { max \ί"δ{s)\2}dμ(s) < + o o for some ε = ε(0) > 0. TVoίe. Under Conditions 1-3, S p a n l l ^ } = wf } ί = 1; then (1, Ω^^) = 1 and hence
Letting δ -»• 0, we have t h a t ( 1 , 7 ^ ) = 0.) 39
C Wθ and 1 J_ 7 ^ in V^. (Take
Let A; be a positive integer. Condition lk. For each fixed s, θ ι-> £g(s) is positive and is A -times continuously difFerentiable. 2
Condition 2k- Given any θ e θ , we may find an ε = ε(θ) > 0 such that Eθ(m θ) < +00, where
mθ(s)=
sup hί f c ) (s)|. { 3)
(From the above, we have that 1 J_ 7 ^ for j = 1,..., k - i.e., Eθ(j θ )
= 0.)
Let ΣQ be the covariance matrix of (k) ΊΘ
Condition Sk. Σg is positive definite. HE. If conditions lfe-3fc hold and Ug is non-empty, then g is A -times continuously differentiable and Vfuo(t)>bk(θ)VteUg,θeθ, ( where bk(θ) = /ι'(^)[Σ^ ) ]" 1 /ι(^) and h(θ) =
:
(Of course g® =
{ k) (fc) (fc) Proof (outline). 1,7^,..., C ΪVβ and ^ j { k j ) θ e Wθθ and so W W β
Lecture 17 i. L'θ, L'g1,... are derivatives of log e £ e , but 7 ^ = ί'θ/£β, ηf] = f;/£ f l ,... are noί the same as Lρ, L'θ', ii. Condition 2 in (12E) can be weakened slightly to: Condition 2'. Given any θ e θ , we may find an ε — ε(θ) > 0 such that |2
E
< +OO.
and condition 2fc in (HE) can be weakened to: 40
Condition 2'k Given any θ E θ , we may find an ε = ε(θ) > 0 such that |2
E
< +OO.
iii. Suppose that Ug is non-empty; then (8) implies that the projection of any t E Ug to W$ is the (fixed) t E Ug Π WQ. Also, ίj!|fc is the projection of any t E Ug {k)
to Wθ
1}
= Span{l,75 ,... , 7 ^ } QWΘ-
i.e., t*θk is the (affine) "regression" of
e ^ o n ^ , . . . ^ } . Thus
where α i , . . . , α^ are determined as in our discussion of regression, and
h(θ) = Eθ(tlkf
- [g(θ)}2 =
ίfc)
P
^i\ (y
l \dθ'"'
'dθk){
θ
(g '
\dθ'"'
k
d g\' ) ' dθk
by the regression formula. iv. b1(θ)
-
(where bλ(θ) is the C-R bound) because w f } C W^+1). limfe^ooδfcί^), then b(θ) < Var β (ί),
If we define
the actual lower bound at # for an unbiased estimate of g. We have that b(θ) — Vare(i) iff t E Span{l, % ,Tg, •••}• This does hold for any g with nonempty Ug if the subspace spanned by {1,7^ , . . . , 7^ ,...} is We. This sufficient condition for &* -» 6 and t*θk —> t is plausible since, by the Taylor expansion,
It holds rigorously in the following case: 15. (One-parameter exponential family) Suppose that C(s)eΛW+B«>™
£θ(s) =
where C(s) > 0, T is a fixed statistic and B is a continuous strictly monotone function on θ C R; then, under Condition (*) below, we have a. Wθ(k) = Span{l, T , . . . , Tk} for A; = 1,2,3,.... b. Span{l, Γ, Γ 2 , ...} = WΘ (under θ). 41
2
c. Wθ is the space of all Borel functions / of T such that Eθ(f{T))
< +00.
d. If Ug is non-empty, then bk(θ) -» b(θ) = Varj(ί). e. t = Eθ(t I T) for all θ e θ and ί e tf9. f. SUFFICIENCY OF T: Given any AC S,we may find an /ι(T) independent of θ such that h(T) = PΘ(A\ T) for all θ G θ . Proof, (f) follows from (e) by defining g(θ) = PΘ{A) and t = IA e Ug and applying (c). (e) follows from (c) since projection to We is then the same as taking conditional expectation. (d) follows from (a) and (b) and the above notes. It now remains only to prove (a)-(c). To this end, let ξ = B(δ) — B(θ). Then ξ is the parameter, and takes values in a neighborhood of 0. We have
dPξ
Cj,)e«<»«<™
__
c f ( s ) e A(β)+β(β)τ( s )
_κ
ετ(s)
Suppose that Condition (*). ξ = B(δ) — B(θ) takes all values in a neighborhood of 0 as δ varies in a neighborhood of θ. Under this condition,
[
KdPo{s)=
s
[dPξ(8)
=
Js
and hence the MGF of T exists for ξ in a neighborhood of 0, and
is the cumulant generating function of T under PQ. Thus the family of probabilities on S is {Pξ : ξ in a neighborhood of 0}, where dPξ(s) = eξT^~κ^dP0(s) - i.e., a one-parameter exponential family with ξ as the "natural" parameter and T(s) as the "natural" statistic. Wθ = Span{Ωί)6> : δ £ θ } ; the spanning set includes {eξτ^~κ^ : ξ in a neighbourhood of 0}, so Wo contains the subspace spanned by {eξT : ξ in a neighborhood of 0}. Now -
£
I
I
g
v-ξ for some ?;* between η and £. We have, however, that \(η — ξ)T2eίjl'~^τ since the MGFs of T exist around 0. Hence Teξτ
-(eηT - eςτ) G Wθ.
= lim 42
—>• 0
Γ
T
2
Similarly, T V , T V , . . . are in WΘ. Taking ξ = 0, we get {1, T, T , . . . } C Wθ, 2 so that the subspace spanned by {1, T, T , . . . } is in WQ\ but this subspace is the 2 subspace of all square-integrable Borel functions of T, so Span{l, T, T , . . . } = W0 actually, since each Ω ^ is a (square-integrable Borel) function of T. D Example 2. Here 5 = (XL, . . . ,XN)> N the total number of trials in a Bernoulli τ s N T S sequence, and ίβ(s) = θ ^ \l — Θ) ^~ ^ \ where T, the total number of successes, is Xι + X2 H h XAΓ. In general, this is a curved exponential family. In Example 2(a), since N = n (a, constant), g _
nlog e (l-0)+Tlog e (0/(l-0))
so that T is sufficient and any function of T is the UMVUE of its expected value. s e s e t a e s t m a t e s C = Π f l e β ^ * *^ °^ ^ i of the form f(T). The C-R bound b\ is attained essentially only for g{θ) = -A'{θ)/B'{θ) = θ, i.e., for g(θ) = α + βθ. The th A: Bhattacharya bound bk is attained iff ρ(θ) is a polynomial of degree k < n. If k > n, then bk = bn = b.
Lecture 18 Note. In the context of (15), it is sometimes necessary to look at the distribution of the (sufficient) statistic T. Suppose that we have found the distribution function of T for a particular θ - say Fθ\ then Fδ is given by dFδ(x) = where x = T(s) (so that the distributions of Γ are a one-parameter exponential family with statistic the identity). (Please check, by computing, that Ps(T < x) =: F$(x) — • • • • )
Example 2(a). Homework 4 1. Ug is non-empty iff g is a polynomial of degree < n (in the case of Example 2(a)). We does not depend on 0; it is the class of all functions of X, and hence an estimate is a UMVUE of its expected value iff it is a function of X.
We will show that σ2(θ) has a UMVUE when n > 2. This UMVUE should be a function of X. £ may be estimated by ^ . How about θ2? Let
_ Jl 10
otherwise;
43
2
then Eθt — θ . We know that the projection to Wθ, which is Eθ(t | T), will give t for 2 g(θ) = θ . (Taking Eθ(t | Γ) is called "Blackwellization".)
Pβ{X\ — 1 = X2, exactly k — 2 successes in subsequent n — 2 trials) PΘ(T = k)
l-*)»-*
©
n(n-l)'
which is independent of 0, as expected. Thus T(T-1) Ϊ =
n(n — 1) '
2
2
which is the UMVUE of 0 , and therefore σ (θ) may be estimated by
X n
_X nX-ϊ 1n n- 1
XfnX-l\ n l n — 1 /
which is a function of X and hence is the UMVUE of σ 2 (0). Consider the odds ratio g(θ) = y ^ . This has no unbiased estimate. Since θ has MLE X, ί, the MLE for this g, is ^=. Since P ^ X = 1) = θn > 0, we have Eθ(t) — +oo, so the expectation breaks down. If, however, I(θ) = "θ) is large i.e., n is large - then
i =X +- - + X +-
^ =χ
+
... + χn + Rn,
where Rn = jzr%. For each θ e (0,1), Rn is very small with large probability, and 1
Rn in Pfl-probability as n -> oo.
Example 2(b) (Negative binomial sampling). Here k
N
k
4 = θ (l - θ) ~ = expJA l o g ^ + Jfclog(l - θ) • y\, where y = iV/Λ, so that T = y,
A = klog(θ/(1-Θ))
and B = k\og(l-Θ),
and hence -A'(θ)/B'(θ) = 1/0. Thus £„() = 1/0 and Var<,(y) is the C-R bound, and the C-R bound is attained only for g(θ) = a + b/θ. 44
Now assume k > 3. We know (even for k > 2) that j ^ is an unbiased estimate of θ. Since j ^ = -j^± is a function of y, it is in fact the UMVUE of θ. 2
Let σ (θ) = Vari9(-|^γ). Since t = j^[ is not a polynomial in y - in fact, t <£ Wθ,k \/k - we have (for g(θ) = θ) bι(θ)
< σ\θ),
2
2
but bk(θ) ->• σ (^) as fc -+ cχo. We can, however, find a UMVUE for σ (^) (without knowing what the bks are). 2 2 Suppose that we can find an unbiased estimate u of θ . Then υ = ί — w is an 2 2 2 unbiased estimate of σ (0) (σ (^) = Var^(ί) = EΘ(P) - θ ). Let ifX 1 = l = X 2 t = ί l 1 0 otherwise. 2
Then (even at present) Ee(t) = θ and hence u — Ee(t \ N) (the Blackwellization of ί) is the UMVUE of θ2 (when k > 3).
- 2) (m-l)(m-2) - i.e., u = (^lljj^j) is the UMVUE of θ2, so that the UMVUE of σ2(θ) is (k-l\
2
(Jfc-l)(fc-2)
_
(ife-l)(iV-fc)
Homework 4 2. Does every polynomial in θ have an unbiased estimate? (Yes?) Does ^ an unbiased estimate? (No?)
45
have
Chapter 6
Lecture 19 The vector-valued score function and information in the multiparameter case Now we have an experiment (5, A, P#), θ = (θ\,..., θp) G θ with θ an open set in W and a smooth function g : θ -* R 1 . We assume that dPθ(s) — ίθ(s)dμ(s) as before, and define ί{β \ s) : = £Θ(S). Assume that ί is smooth in θ and let gι(θ) = -$fg{β), £ i(θ I *) = £-.£{θ I 3) and ^-(β | 5) = Q^£(Θ \ s) for I < ij < p. There are two approaches to the present topic in this situation: Approach 1. Generalize the previous one-dimensional discussion: Suppose that t is unbiased for g - that is to say, f t{s)i(δ \ s)dμ{s) = Eδ(t) = g(δ) Js
for all δ eθ. Then E$(t(8)ii(θ
I s)/i(θ
\s))= [ t(8)ii(θ
I s)dμ(s)
= ft(0)
for i = l , . . . , p and hence every t e Ug has the same projection on Span{l, L i , . . . , L p }, where L(θ \ s) = L^(s) and L f(? I s) -
d
L(θ\s)-
This approach is useful for studies of conditions which ensure that Li, L2,..., in Wθ = Span{Ω^ : ί e θ } .
Lp are
Approach 2. Use the result for the #-real case: Fix θ e θ and a vector c = (cu . . . , cp) φ 0, and suppose that δ is restricted to the line passing through θ and θ + c - in other words, that we consider only δ = θ + ξc for some scalar ξ. (Note that, since θ is
46
open, if ξ is sufficiently small then θ + ξc G θ.) Then g becomes a function of ξ for which t remains unbiased. By (12), Var#(ί) > [Fisher information in s for g at θ in the restricted problem]"
1
2
A
/[Fisher information for ξ in s for estimating g] ξ=0
Now, since δ = θ + ξc,
u
~^{ ^ * δ=θ
ξ=0
2
The information in the denominator is Ee(dL/dξ) , and dL
_sr^cL(θls) l
dξ ξ=o~h°
%
S
'
so that the information may be expressed explicitly as p
where /^ is the (i,j)th
v
entry of the Fisher information matrix I(θ) =
{Covθ(Li(θ\s),Lj(θ\s))}pχp
(where the sample space is S). Let 9- _ 9-922? 22_2. ~ £ £2 '
then
and hence we have the p-dimensional analogue of (13): 13*. I(θ) = {-Eθ(Lij(θ\s)}. The above lower bound for Var#(ί) can now be written as
Let us assume that / is positive definite. It will be shown below that sup{the bound above} =
^gi{θ)Iij{θ)gj{θ),
(*)
where {Iιj(θ)} = I 1(θ); and the supremum is achieved when c is a multiple of h(θ)Γ\θ), where h(θ) = (9ι(θ),.. .,gp(θ)) = Vg(θ). Thus we have the p-dimensional analogue of (12):
47
12*. If t G Ug, then Var*(ί) >
h{θ)I'\θ)hφ)'.
Assume that this bound is attained, at least approximately; then, for the estimation of g, there exists a one-dimensional problem (namely, the one obtained by restricting 1 δ to {θ+ξc* : ξ G R}, where c* = h(θ)I~ (θ)) which is as difficult as the p-dimensional problem. Proof of (*). For u = (uu . . . , up) and υ = (vu ..., vp) in W, let (u\v) := Σ?=i u ^ = ι 2 m/ and \\u\\ := (u\u) l . Let / be a (fixed) positive definite symmetric px p matrix u
v
an(
u
;
and set (u\v)* := Σi * UilijVj — ^ ' l I\ \I* — (^|^)* Let g = ( # i , . . . ,gp) be a fixed point in W. Consider the maximization over α = ( o i , . . . , ap) G W of r-l
NI* The unique (up to scalar multiples) maximizing value is given by a = gi maximum value is r-i
ι
and the
\ 2 r-l
ιι^D
Lecture 20 We have seen that, with θ = (θu...,θp) and fixed g, the "most difficult" onedimensional problem is with δ G θ unknown but restricted to {θ + ξc* : \ξ\ is sufficiently small}, where c* = c*(θ) = h{θ)I'\θ) and h(θ) = gradg(θ) = (9l(θ),... ,gp(θ)), gτ = g ; i.e., teUg^ Varfl(ί) > Var^(ί) > V a r ^ ί ^ ) = h(θ)Γ1(θ)hf{θ), where t is the projection (of any t G Ug) to W^ and t*θl is the projection (again, of any t G Ug) to Span{l, dL/dξ| f = 0 }. Now (remembering that δ = θ + ξc*) dL
and, under Pθ (i.e., for ξ = 0) 1 i_ L7, so {1, Z//||I/||} is an orthonormal basis for Span{l,Z/} and ί,
ΐ=l
48
Note that c*; = / ιh, so c*Ic*' — hi ιhf = c*hr and so the above formula becomes
We have
I-ι)I{hJ-xγ
{h
More heuristic (as in the one-dimensional parameter case) "ML estimates are nearly unbiased and nearly attain the bound in 12P." We assume that the ML estimate θ of θ exists. Since θ is open and L(- \ s) is continuously differentiate, we have that = 0. θ=θ
Choose and fix θ G θ , and regard it as the actual parameter value. If we assume that θ is close to 0, then p
L
ίfiλ ~ T ίfi\ _]_ \ i\ϋ ) ^> ±Ji\y J ~τ~ /
^(Ω 10'j
ύ \ T — t/j)ljjiyϋ
(A\ ),
ή Z —
1 n 1 , . . . , Ό.
Assume that the sample is highly informative, i.e., that
(We know that Eθ[Lji(θ \ s)) = -Iji(θ). We are thus assuming that
where Sji(θ,s) —> 0 in probability. This happens typically when the data is highly informative.) From this it follows that
Definition. L ( 1 ) (0 | s) := (Li(0 | s ) , . . . , L p (0 | 5)) is the SCORE
VECTOR.
Thus the ML estimate of a given g is
t(s) = g(θ(s)) « g(θ) + £ & ( * ) - θi)9i{θ) = g(θ) + (θ(s) - θ)ti(θ) « p(β) + Uι\θ I s)Γ\θ)ti{θ) = 49
W
under Pθ. Since EΘ(L (Θ \ s)) = 0, we have Eθ(i) « g(θ). Since 0 is arbitrary, i is approximately unbiased for g, i.e., ί e Ug. Since (1)
ι
{ι)
t(s) « g(θ) + L (0 I s)Γ (θ)ti(θ)
= g(θ) + c*(L (θ
\ s))'
under P#, we know that ί G Spanjl, Lu . . . , L p }, so that ί « ί^ under P# and 1
Var^(ί) « V a r ^ ί ^ ) =
h{θ)r {θ)h'(θ).
This is, if true, remarkable, for it happens for e?;eπ/ g and ei>en/ 0 e θ . 2
2
Example 3. Suppose that the X* are iid N(μ,σ ) and 0 = (0i,02) = (μ,σ ). Some 2 functions ^ which may be of interest are g(0) = μ, g(θ) = σ (or g(θ) = σ), ^(0) = μ/σ (or g(0) = σ/μ, Ίί μ φ 0) and ^(0) = the real number c such that PQ{X{ < c) = a (for some fixed 0 < a < 1) - i.e., g(θ) — μ + z α σ, where z α is the normal a fractile. Let us compute /. Since 5 consists of n iid parts, 7(0) for s is simply n/i(0), where Iι(θ) is I for Xι. If AΊ is the entire data, then
where C is a constant and r := σ2 = 02; thus Z/i =
and
Lo = -
Homework 4 3. Check that
L/r 0
0 l/2r 2
Lecture 21 Example 3 (continued). We return to the situation s — (Xu . . . , Xn)\ then
0
l/2τ 2 y
a n
^S' ~ \
0
2τ2/r
Consider g(θ) = μ = 0X; then the most difficult one-dimensional problem is τ
-
θ = (μ,τ)
u 50
This one-dimensional problem is in a one-parameter exponential family with sufficient statistic X, and X is a UMVUE in this one-dimensional problem which attains ι the C-R bound - i.e., ~X is unbiased and Vsiθ(X) = h(θ)I' {θ)h'(θ), where h = (1, 0); thus
Vaxθ(X) = τ/n V0 G θ . The following are some gs (and their corresponding C-R bounds) for which the C-R bound is not attained: 2
i. g{β) = σ ; the C-R bound is ^ . ϋ. p(0) = a; the C-R bound is ^ . iii. g{θ) = μ + zaσ, h = (1, za/2y/r)] the C-R bound is £ + r | | . To see this, it is enough to check case (i), since the reasoning for the other cases is similar. Here _ £(θ I S) = C r -"/2 e -£KX-μ where C is a constant and υ = \ Σ™=1{Xi — X)2',
L(θ \s) = C - | l o g r - i-[n(X - μf + nv], 2τ
where C" = logC; Lλ(θ \ s) = ^(X - μ) and
r
(θ
21
Is\ '
;
=
_.!L + 2 _ 2r
2r 2
Let 5 = (μ*,τ*); then n r and
From these equations it is easily seen that there do not exist constants α(0), 6(0) and c(0) such that £,[α(0) + 6(0)Lχ(0 I 5) + c(0)L 2 (0 | 5)] = r* for all δ = (μ*,?"*) - i.e., there is no unbiased estimate of r* in Span{l, Lι(θ \ •), L2(θ \ •)}, so that the C-R bound is not attainable for g(β) —r. On the other hand, X = μ + ^Lι(θ | 5) is in Span{l, L 1 ? L2] and is unbiased for μ, and so attains the C-R bound for μ. It is easy to check that the ML estimate is 0 = (X,v), so the MLE for μ is X; it is exactly unbiased, and its variation is 51
2
the C-R bound. The MLE for τ = σ is υ = j Σ ί U P ^ Eβ(v) = ^T = τ-z (note that £ is small when / is "large"),
x
2
) ί
w e
h a v e
t h a t
4ί(Λj_1) = which is /ess than the C-R bound ^ ,
for r (so v is noΐ unbiased), and 2
2(n - 1) 2 o r 2r V MSE* Ϊ; = V +— = 2 x
2
r
2
2
2r - < —.
Homework 4 4. The ML estimate for σ = yfτ is y/ϋ. Show that EΘ(y/v) = σ + o(l) and Vaiθ(y/ϋ) = ^+o(l) as n -> oo. (HINT: Z is an X£ <£> \z is a Γ(fc/2) variable. A e l m m Γ(m) variable has density ~^~ in (0, oo). Γ(m+1) = \/2^m m e - + o ( l / m ) a s m - ^ oo, so
as m —> oo for a fixed /ι.)
Lecture 22 TVoίe. In the general case of (5, A, Pg), θ G θ , the above considerations are somewhat more general than are required for strict unbiased estimation. In particular, associated with each θ G θ there is a set We of estimates which has the following properties: Corollary to (8). If we are estimating a scalar g{θ) corresponding to any estimate t, then there is an estimate t G We such that Es(t) = Es(t) for all δ G θ and Eθ{t - g(θ))2 =: Rt(θ) > R~t(θ) := E,(t - g(θ))\ with the inequality strict unless Pδ(t = t) = 1 for all δ G θ . In general, WQ depends on θ and we must be content with C = ΠGEΘ WQ. In some important special cases, however - for example, in an exponential family - Wg is independent of θ. In any case, though, the MLE and related estimates have the property that, if "1(9)" is large, any smooth function f(θ) is approximately in Wθ for any fixed θ. Example 3 (continued), θ — (μ, r), where τ = σ 2 . Choose and fix θ; then what is WΘΊ There are three methods available: Method 1. Look at Ω^. Wθ is the subspace spanned by {Ω^ : δ G θ } . Method 2. (Let θ be real, under regularity conditions.) -§jΩs,θ\δ=θ G Wθ. This is the method which leads to the Cramer-Rao and Bhattacharya inequalities. 52
a
Method 3. (Due to Stein.) /* ςiSfidδ E Wθ. L( θ
Li θ
We use Method 2. Since £(θ \ s) = e - ^, we have 4(0 | s) = e - ^Li{θ | s), L
^•(0 I a) = e ^[Liό{θ
I a) + Ltf \ s)L3(θ \ a)],
etc., and hence £{/£ = Li, £ij/£ — Lij + LiLj, etc. Thus lifl, tij/ί, etc. are in Wθ. Here we have n(X - μ)
n[v-\- (X - μ?\
n 2τ
T n
T
T
n[v + I(x-μ)2}
n(X - μ) r
n(X-
L ^22
2
r
3
n '
2
Since £n/£ = Ln + L\ is an aίRne function of (X — μ) , we have 2
L 1 (β | -),L2(Θ \ .),tn{θ
|
whence X is the LMVUE of Eδ(X) = μt, v is the LMVUE of Eδ(υ) = ^ r * and ^ is the LMVUE of Es(nv/{n — 1)) = r* (remember δ = (μ*,r*).) Since X, v and ^7j do not depend on θ, they are in fact in C = C\θ£θ We and hence are the UMVUEs of their expected values. (Neither y/v nor ^ (the latter is the MLE of μ/σ) is available by this method, but one can show by the above method that any function of X and υ is in C. If θ is the set of all pairs (μ, σ 2 ), then we are in the two-parameter exponential family case and a result to be stated later applies.)
Regularity conditions θ is open in W and dPθ{s) = l(θ \ s)dμ(s). Condition lp. For each s, £(- \ s) is a positive continuously differentiate function of θ. Condition 2P. Given any θ G θ , we may find an ε = ε(θ) > 0 such that
msx{\Lj(δ \ s)\ : \δi - θi\
<ε}eVθ
(i.e., the function is square-integrable with respect to PQ), or at least τv| o γ J r max
ίA
Ci
*
A
ι l \ Λ ° Is)\ - \°ι
Wΰ)
L e t I(Θ) = EΘ(Li{θ\s)Lj{θ\
ΓΊ
^^
υ
s)).
P
Condition 5 '. For each θ, I(θ) is positive definite. 53
P" >
t\ —εϊ
s- ΛΓ θ
-
a. For each 0, l,Li(0 | s ) , . . . , L p ( 0 \ s) C Wθ) and 1 i . Lj(θ \ s) in Vθ for b. If Ug is non-empty, then g is differentiable and the projection of any t £ Ug to Span{l, L i , . . . , ί^} (which is the projection of £ to Span{l, L 1 ? . . . , L p }) is fθι = g(θ) + h(θ)Γ\θ) (L x (0 I 5 ) , . . . , L p (0 I where h{θ) = grad^(0). 1
f
c. If £ G C/p, then Var^(£) > h(θ)I- (θ)h (θ)
for all 0 G θ .
/. The proof is left as an exercise for the reader. See the proof in the case p = 1 and use Approach 1 rather than Approach 2. Note also that g(θ) is a projection of t to Span{l} and that 1 is orthogonal to L i , . . . , L p , so that the projection of £ — g(θ) to Spanjl, L i , . . . , Lp} is the same as its projection to Span{Lχ,..., Lp}. Thus Var#(£ - g(θ)) > ^(projection of £ - g(θ) to Span{Z/χ,..., Lv})2 =
Eθ(hΓ\Lu...,Lp)'[hΓ1(Lu...,Lpγ}') = Eθ(hΓ\Lu
. . . , Lp)'(Li,..., Lp)Γιti)
= hΓιh'.
Lecture 23 . In the case when θ is open in W, g : θ -» R; is differentiable and conditions V-3P are satisfied, then, for any estimate £, Rt(θ) := Eθ(t(s) - g{θ)f > βt{θ)Γ\θ)β't{θ)
+ [bt{θ)]\
where bt(θ) := Eθ(t) - g(θ) and βt(θ) := grad^(ΐ) = gradί?(0) + gradδt(^). Proof. Let
Then
2
i?t(0) = Varβ(ί) + [6tW] > [gradT^jZ-'WlgradT^)]' by C-R bound.
•
This result is useful even in case p = 1 - see, for example, the proof of the admissibility of θ in Example l(a) in Lehmann (1983, Theory of point estimation).
54
On the distance between θ and δ Ω
Should one use the Euclidean distance di? What is really of interest is the "distance" between /^(i) and Pδ{2) - given, say, by
= sup \Pδil){A) - Pδw(A)\ = ± [\l(δW I s) - £(δ^ \ s)\ds AeΛ
2Js
or
The distance d 3 is used in E. J. G. Pitman (1979, Some basic theory of statistic inference). It is related to the Fisher information in the following way: Suppose that we want to distiguish between Pδ(i) and Pδw on the basis of s. Instead of a hypothesis-testing approach, let us choose a real-valued function t(s). What is the difference between δ^ and δ^ on the basis of ί? Regard t as an estimate of g(δ) := Eδ(t). Then \g(δ^) — g(δ^)\ might be taken as a measure of the distance between δ^ and δ^ on the basis of t. It is, however, more plausible to use the standardized versions and
SD, (2) (ί)
especially if t is approximately normally distributed. Now choose and fix θ € θ and restrict δ to a small neighborhood of θ. Then Var^(ΐ) « Var^(ί), and hence the distance (between δ^ and δ^2\ on the basis of t) is approximately
|g(*"')^>)l
,
Since the distance should be "intrinsic", we should maximize it with respect to t. First, we maximize dtj with respect to t with the expectation function g fixed to get
With δ^ -> ^ and δ^ —>• ^, this is approximately
55
Next, maximize the square of this with respect to h(θ) = gmdg(θ), which then leads to the squared distance
The distance Dθ is called the LOCAL FISHER METRIC in the vicinity of θ. It is the distance between Pδ(i) and Pδ(2) as measured in standard units for a real-valued statistic of the form g(θ), where g is suitably chosen so that gradg(0) — ^ 2
2
Example l(a). Let n = 1, s ~ N(θ,σ ) and θ G θ = (—00,00), where σ is a fixed 2 known quantity. Then I(θ) = 1/σ for all 0,
and mean of Pδ(i) — mean of Pδ(2) common SD If n > 1 and s = ( Λ Ί , . . . , Xn) with the X{ iid, then σ For fixed 0 G EP, A? is the metric derived from the inner product
which has been used before. Exercise (informal): Look at DΘ in Example 3, N(θι, Θ2) Example 4. 7 e K f c has the Nk(θ,Σ) distribution and density £(θ I y) =
-
-\{y-θ)Έ-^y-θ)'
e
with respect to Lebesgue measure. With this density, θ and Σ are respectively the mean and covariance matrices of Y. Show that I(θ) = Σ " 1 for all θ and hence P\W is the fixed square distance (δ^ - ^
Lecture 24 Note. A sufficient condition for 13P - i.e., the equality I(θ) = -{EΘ(Lij(θ that, given any θ G θ , we may find an ε = ε(θ) > 0 such that m
| s))} - is
Note. The theory extends to estimation of vector-valued functions - for example, if u(s) — (ιii(s),... ,up(s)) is an unbiased estimate of θ and Var^(wi) < +00 for each i — 1,... ,p and ί e θ , then QQNQ{U) — I~ι{θ) is positive semidefinite for each θ G θ . Proof. Fix a = ( a l 9 . . . , a p ) G Rp and define #(0) = Σ ? = 1 afli = α0'. Then ί(s) = αu'(s) is an unbiased estimate of g. Since grad(7(0) = α, we have ΐ) = aCovθ{u)o! >
aΓι(θ)a!,
so that (α G Mp having been arbitrary) Cov^(w) — I~ι{θ) is positive semidefinite.
D
Definition. (£, >A, P^), 0 G θ C W is a (p-parameter) EXPONENTIAL FAMILY with statistic T= (TU...,TP):
S -+W if dPθ(s) = £(θ | s)dμ(s), ^(0 I5 λ _
where
Cfs\eBi(Θ)T1{s)+-+Bp(Θ)Tp(s)+A(θ)^
The family is NON-DEGENERATE at a particular θ G θ if
{(Bx(i) - ^ ( ^ , . . . , ^ ( 5 ) - Bp(θ)) : δ e θ } contains a neighborhood of 0 = ( 0 , . . . , 0). We assume non-degeneracy at each θ G θ . Exercise: Check that Example l(a) is a non-degenerate exponential family with p = 1, with Γi = X if θ = R 1 ; Example 2 (a) is a non-degenerate exponential family with p = 1, 7\ = X and θ = (0,1); Example 2(b) is a non-degenerate exponential family with p = 1, 7\ = N and θ = (0,1); Example 3 is a two-parameter nondegenerate exponential family with 7\ = ΣXi, T2 = J^Xf and θ = {(μ5 r) : — 00 < μ < +00 and 0 < r < +00}; and Example 4 is a fc-parameter exponential family with T = ΣVϊ 15*\
=
(Ti,...,T*).
a. For each 0 G θ , Wθ is the space of all Borel functions of T = (Tu ..., Tp) which are in VQ. W
is t h e c l a s s
o f a11
b. C = Γ\eee θ UMVUE - i.e., the class of all Borel 2 functions of T which are in L (PQ) for all θ G θ . c. For any g such that Ug is non-empty, there exists an essentially unique estimate t = t(T) eCΓ\Ug. d. t = ^ ( ί I Γ) for all ί G E^ and 0 G θ . e. For all ACS, EΘ(IA \ T) = PΘ{A \ T) (essentially) is the same for each θ G θ , i.e., T is a sufficient statistic. f. T is a complete statistic. Proof. 57
a. Choose θ 6 θ and write ξ{ = B^δ) - Bi(θ). Then
ΣίiTi(5)
where K$(ξu ,£p) = l o g ^ ( e ) is the cumulant generating function of Γ at ( ξ i , . . . , ξp) under P#. Non-degeneracy means that Kθ(ξu...,ξp)<+oo for ( £ l 3 . . . , ξp) in a neighborhood of 0, and hence We contains all functions τ ( £ 1 ? . . . ,£ p ) in a neighborhood of 0. By differentiation, we find e Σ& i for that Wg contains all polynomials in 7\,... ,T P , so We contains all Borel functions of T which belong to VQ. On the other hand, since each Ωj50 is a Borel function of T, every function in We is such; so (a) is proved. b. This follows from (a) and (9). c. This follows from (a) and (8). d. This follows from (a) and (8) and the fact that, if W is the space of all functions of Γ, projection to W is the conditional expectation given T. e. This follows from (c) and (d) by letting g(θ) = PΘ(A). f. Suppose Eθh = 0 and Eθh2 < +oo for all θ e θ . Then h(T) is the UMVUE of g(β) = 0; but 0 is an unbiased estimate of this g, so Var<9 h = 0 for all θ e θ and hence Pθ(h = 0) = 1 for all θ G θ . D
58
Chapter 7
Lecture 25 Using the score function (or vector) Assume the usual setting, (S, A,Po),θ eθ
CW.
L
L'
First consider the case p = 1. Let u(s) be a trial solution of 7/(0 \ s) = 0. Assume that θ — θ = O(l/>/7(0)) and / is large. (Here 0 is the true parameter, Eβ(θ) « 0 and Var#(0) « 1/7(0).) Assume that u is not very inaccurate in the sense that, for any 0, u(s) -θ = 0(1/0X0)). Then θ-u = 0(1/0X0)) under 0, 0 = L'(θ(s) I 5) = L'(u(s) I 5) + (θ(s) - u(s))L"(u{s) \ s) + 0(1/7(0)) and
0 » =u(β) + ( -
V ^
I
L'(U(S)
u s
κ {)
I a) + 0(1/7(0)).
Dropping the last term (order 1/7(0)), we obtain the 'first Newton iterate' for solving 7/(0 I s) = 0.
Application 1. Let t^°)(s) be a trial solution of 7/(0 | 5) = 0. Let
One hopes that
->• l ( s ) . 59
A variant of this approach consists in taking
= u ω ( β ) + _ _ i _ L ' ( u ω ( s ) is) (since typically -L"(θ \ s)/I(θ) « 1 if I(θ) is large). Suppose we do not think it worthwhile to find θ exactly. Application 2. Start with a plausible estimate u(s) of θ, and improve it to
or
liu-θ = 0{l/yfϊψ)) and Eθ(u) -θ = O(l/y/ϊ(β)), then the first iterates have the same properties as θ, i.e., u* — θ and u** — θ are of order l/y/I(θ) and Var^w*) and Var<,(i**) are h(θ) = The case p > 1 Let «(s) = (ιti(s),... ,tίp(s)) : 5 —>• θ C W be some plausible estimate of θ. Then
u (β) = u(s) + {-L 0 (u(β) I s f t - V a d L p I s)| β = ϋ ( s ) } and u**(s)=u(s)
+ Γ1(u(s)){gmdL(θ
\
s)\θ=u{s)}
are versions of the first iteration of the Newton-Raphson method for solving grad L(θ \ s) 0. Let 11 || be the Euclidean norm. If \\u — θ\\ and \\θ — θ\\ are of the same order and EQ(Θ) W θ and Cov^(^(s)) « /~ 1 (^) J then u* and u** also have these properties - i.e., 1 Eθ(u*) « 0 and Cov^(w*(s)) « J " ^ ) (and similarly for ιz**). Example 1. s = ( Λ Ί , . . . , X n ), with the ^ iid with density f(x - θ) for 0 G R 1 . a. / is the normal density. 7(0) = n, Z/(0J 5) = n ( Z - θ) and L"(0 | 5) = -n. For any ?/, the first iteration gives u* = X = u**. χ
b. f(x) = |e~l L We know from the homework that θ is the median of Xu . . . , Xn. Here V and I do not exist, but the Chapman-Robbins bound gives Var^ί) > ^ for any unbiased estimate t of g. Show that Var#() = ^ + O ( ^ ) . (Note that
= I Vsiθ{X1) = - [ ^e^dx = - Γ x n nJ 2 n JJo o so that the variance bound is true for X.) 60
n
n
c. f(x) = \γ^z Here I\{θ) = \ and I(θ) — f. θ is hard to find (there are many roots of L'(θ I 5) = 0).
where C is a constant, and
Let u(s) be the median of {X 1 ? ..., X n }; then
u oo =u(s) +
iJ21+^~_M^))2.
Since it is true that u(s)—θ is 0(1/'y/n), we have Eo(u**) « # and Var0(?i**) « | , the information bound. e. f(χ) = αe~ 5 χ 4 , α,δ > 0, and Var(x) = 1. Here, as in (c) above, it is difficult to find WQ, and W$^ and WQ$ look awful. X is a plausible estimate since
EΘ(X) = θ and Var*(X) = \ = O{l/I(θ)) (I{θ) = n). The most important differences among the above four densities are the different tail behaviors: l(e).
Here a good estimate gives more weight to the extreme values than to the central values.
l(a).
NORMAL:
l(b).
Here the best estimate, the median, gives weights concentrated in the middle.
SHORT TAIL:
Here the best estimate X gives equal weight to all observations.
D O U B L E EXPONENTIAL:
61
l(c). CAUCHY: Here the optimal estimate(s) is (are) unknown.
Lecture 26 Continuing Example l(e) n
s) = a
where m'j = ^ Σ " = 1 ^ / for j = 1,2,3 (notice that m\ = X). This is not a threeparameter exponential family but a curved exponential family; but (m'^m^ra^) is equivalent to (X, 777,2,7713), where rrij = ^ΣILiC^* ~ ^Y f° r i = 2,3, which is the minimal sufficient statistic - i.e., (X, 7712,7713) is an adequate summary of data (for any statistical purpose) and nothing less will do. (In Example l(a), X is the minimal sufficient statistic, and, in Example 3, (X,m2) is the minimal sufficient statistic.)
L'(θ) = 4bΣtι(χi
- θ?
Let
$ = X + zml/2. Since L'{θ) = 0, we have 1
3
1
where 71 = m^/m^2 is the sample coefficient of kurtosis. There are several approaches to getting θ: Approach 3. Get an explicit form of z from the equation in z above, and substitute it into the expression for θ in terms of z. Approach 4. The graphic method:
(In the picture above, g = 7i/3.) 0 < z < \ηλ if 71 > 0 and \ a solution is where 0 < η < 1.
62
< z < 0 if 71 < 0; so
Approach 5. U rss y\. -f-
,
j χ
3 3m2 Note that, if n is large, then ra2 « 1 (since Var^(Xχ) = 1) and so 0 « X + | r a 3 (so outliers are given more weight than given by X). Here 7(0) = n and Var#(X) = ~ = O(j^y), so X is an acceptable starting value for approximating 0. Approach 6. u* = X + | ^ and u** = X + | m 3 (please check). It is not easy to find the exact properties of 0, u* and u**, but w** is the easiest to examine.
Homework 5 1. Show t h a t EΘ(u**) = θ and 2
\n )
2
YΣbn
1.37n
\n J
\n
2
(so that Eθ(m3) = 0 and Cov^(X,m 3 ) < 0). Since ra3 is a function of the (minimal) sufficient statistic T(s) = (X,ra 2 ,m 3 ), this statistic is not complete. Since Cov^(X,m 3 ) φ 0 (m 3 is an unbiased estimate of 0), we know that X is not even locally MVUE. (See Kendall and Stuart, vol. I, for "standard error of moments". A good reference to the use of the score function in general is C. R. Rao's Linear Statistical Inference.) Example 5. Our state space is {1,2} and the transition probability matrix is / 0ii 012 \ \ 021 022 /
=
/ 0i V 1 — 02
l 02
Suppose first that θ = (0,1) x (0,1) and that a Markoff chain with transition probability matrix as above starts at T and is observed for n one-step transitions. Thus s = (X θ5 X l 5 . . . 5 X n ) ? where X o = 1, and
£(θ\s)=
0^ ( 5 ) -0 1 / l l ( s ) (l-0 1 ) / l 2 ^0 2 / 2 2 ( 5 ) (l-0 2 ) / 2 1 ^,
JJ M=l,2
where fij(s) is the number of one-step transitions from i to j in 5. Since Λ1+/12+/22+ f21 = n, we have a three-dimensional minimal sufficient statistic and two parameters. If /21 + /22 > 0, /11 > 0 and /22 > 0, then we have (noticing that fu + / 1 2 > 0) f 0 = (0, 02) where θ\ —/11+/12 . *]*. and 02 = /21+/22 ^f • Since f v
i 5
Δ
n
L
_ fll
τ
1
/) (/I
fl2 -1 f\ ' 1 — (7i
τ
_ /22 -^
/) ι/9
/21
j
1 / 3 ' 1 — t/9
we have Eβ(Lι) = 0 = Eo(L2) (since (1 — θι)Eθ{fn)
63
__ f\\ Z)2 (7i
= 0\Ee(fn)i
f\2 Z) \2 ' 1 1 — (/I I /I
etc.) and
It is known that Eθ(fij) = nπifflθij + o(n) as n -> oc
where πι(θ) and π2(0) are the stationary distribution over {1,2} and
^»)
and
V/
1' Ω i /) λ — ( Ό\ ~t~ C/2 )
2 SO
/ /
/ W = n
ίd\ IΩ (Λ d \ 7Γi I (/ ) / \JΛ I JL — f/1 )
V
0
C\ vJ
τr2(0)/02(l-02)
\ 1
/ \
+
) °W'
The information bound for the variances of estimates of θ\ is ^^1
(and similarly
1
for 0 2 ). Is Var^(0i) « ^ l * ? It can be shown (though not easily) that Var6>(0i) = n + ° ( l / ) as n —>> oo, where &χ(0) is the C-R bound.
Lecture 27 In Example 5, θ is an open unit square consisting of points (0 1? 0 2 ). Let #1 = fj+f12 and 02 = fj+f f if /»j > 0 for all i, j . Otherwise, let 02 be arbitrary - say \, for convenience. It can be shown that
for all sufficiently large n and some fixed 0 < p(θ) < 1. Hence we can ignore the case fij = 0 in the computation of £"#(0) and Var#(0). Suppose we know that 02 = kθ\ for some 0 < k < oo; then now θ = {0X : 0 < 0! < l/jfc} and L oc / n log0i + /i2 log(l - 0i) + /21 log(l - kθλ) + / 2 2 log A;0χ. Exercise: Show that / in the present case is greater than / in the previous case, for sufficiently large n. (Recall that Eθ(fij) = nτri(θ)θij + o(ή).) The equation for 0χ is now a cubic. We can solve it explicitly, or we can approximate it by v = u* or u**, with u = , {" (say). Then we have EQ(V) = 0χ + o(l) and V ( v ) = l/(n present /) + o(l). A special case of the above is when 0χ = 02 - i.e., k = 1 - so that
where of course y = fn + / 2 2 . It turns out that y is a B(n,θι) 0i = y/n satisfies Var*(0i) = ^0 X (1 - 0i). This is the new I " 1 . Example 6. X έ - JV(0,1), θ = (0,1).
64
variable, so that
for all
a.
+ -^2-^3 H
H
n-1 is unbiased for θ and n-2 2
is unbiased for # . «i + ky/U2 is an estimate of 0; what are its properties? b. Covθ(Xi, Xj) = θ for all i φ j . In both cases (Xι,.. .,Xn) is from a stationary sequence. What is I(θ) in 6(a) and 6(b)? What estimate(s) t (t = 0? t = u*Ί t = u**Ί) has (have) the property 1 that Eθ(t) m θ and Vaxβ(t) « i " ^ ) for large n? 1 In 6(a), find |C| and C" , where
C = Covθ(s) = Θ \
n
ι
θ~
•••
θ
1
)
(C'1 is tridiagonal.) In 6(b), find |£>| and D'1, where /I
β
••• θ \
θ
1
••• :
D=
••. \Θ
••.
Θ
••• 0 i
1 ••• 1 (D= ( 1 -
θu, where u = |
: 1
•-. \, so D~ι = αl + βu.) ... 1
Homework 5 2. (Optional) Answer the questions in Example 6.
A review of the preceding heuristics Suppose θ is real. i. CONSISTENCY: θ is close to the true θ. ii. Eβφ) « θ in fact, Eθ{θ) = θ +
0{l/y/ϊψ)). 65
lίu is any estimate such that u = θ + θ(—^=),
then u*, u** etc. also have properties
(ii) and (iii). Consistency is difficult even today. Assuming that θ exists and is consistent, then (ii) and (iii) remain difficult, but one can say that 0, u*, u*\ etc. are « N(θ, l//(0)) where I(θ) is large. Theorem (on consistency). Let Xι be iid. £(θ \ Xι) depends on θ G θ = (α,6)
with -oo
χ
+oo ; and £{θ \ s) = ΠΓ=i W I i)-
Condition 1. For all 5, £(- \ s) is continuous. Let θn : 5 —> θ 6e some function; θ is an ML estimate ^ θ is measurable and £(θ(s) \s)
=sup£(δ\s) δeθ
whenever the supremum exists. Condition 2. lim^_>α£(0 | X\) and lim^_>5^(0 | Xi) exist a.e. with respect to the dominating measure for Xγ\ denote these limits by £(a \ X{) and £(b \ Xι). Condition 3. If θ G θ , then {xι:£(θ\xι)φ£(a\xι)}
and
{an : ^ I x θ ^ £{b
have positive measures (with respect to the dominating measure for Xι).
θ,δeθ
For any
with θφδ, {xx : £(θ I n ) # i(δ I an)}
has positive measure. 1 (LeCam). Condition 1 implies that an ML estimate exists. 2 (Wald). Conditions 1-3 imply that, for all θ G θ , with probability 1, 1. θn actually maximizes the likelihood for all sufficiently large n. 2. liin^oo^fl. Note. The proof of (2) depends on the fact that [α, b] is compact. There are difficulties in extending the proof to, say, θ C W, because it is difficult to find a suitable compactification of θ .
66
Chapter 8
Lecture 28 Example 7. Xi are iid uniformly over (0, θ) for θ G θ = (0, oo).
Homework 6 1. Show that a. With respect to Lebesgue measure on R n ,
Λ otherwise and θ = max{Xi,..., Xn}. b. Condition 2 in the Theorem above is satisfied, and hence θn -^> θ for all θ (which we check directly also); but the likelihood function is not continuous, and hence the information function is not defined. c. Eθ(θn) = ^ 0 , and θn := ^θ
is unbiased.
d. n(θ — θn) has the asymptotic distribution with density | β ~ t on (0, oo), and so θn has a non-normal limiting distribution and θn — θ = O(l/n). (In regular cases, θ has a normal limiting distribution and θn — θ — 0(1/y/n).)
Asymptotic distribution of θ (θ real) in regular cases X = {#} (arbitrary), C is a σ-field on J , P^ is a probability on C and θ G θ for θ an open interval in R 1 . dPe(x) — ί(θ \ x)dv(x), with v a fixed measure. Let sn = (Xu . . . , Xn) e S ( n ) = I x x I , Λ{n) = C x - x C and P^n) = Pθ x - x Pθ on A(n). We assume that l(θ \ x) > 0, L(θ \ x) = logei(θ \ x) has at least two continuous derivatives, EQ(L'(Θ \ x)) = 0 and 2
h(θ) = EΘ(L'(Θ I x)) = -EΘ(L"(Θ 67
I x)) > 0.
We have L(θ \ sn) = Σti o r an
L θ
I *i), L\θ
(
ven
Isn) = ΣΓ=i L'(θ Iχi)
a n d L
" ( ^ Isn) =
w e
ΣΓ=i ^ " ( ^ I ^*) ^ ^ S^ ^J know that a good estimate of 0 based on sn will be approximately o(0) + 6(0)1/(0 | s n ), and Z/(0 | s n ) « iV(0, *), so a good estimate of 0 based on sn will be approximately normally distributed when n is large. We have L J "M ") -). -/i(0). Assume that: Condition (*). Given any 0 G θ , we may find an ε = ε(θ) > 0 such that max
\L"(δ\x)\
\δ-θ\<ε
has a finite expectation under P#. Assume also that θn exists and is consistent. Then
0 = L'φn I sn) = L'(ί I sn) + (θn - θ)L"(θ*n I sn), where ^* is between θ and 0 n . Since 0* —>• ^ in P#, we have 0
n
(**)
in Pθ.
So where ξn —>• 0 in P$. Since '
-» N(0,h{θ))
in distribution under Pθ,
we have:
1 (Fisher). φι{θn -θ)-+
N(Q,Iλ{θ)).
Note. This does not assert that Eg(θn) = θ + o(l) or that V a r ^ n ) = nIι,θ)
+-
°f (**)• Fix 0. Under (*), we have h{r) := Eθ\max\δ_θ\
\ x) - L"(θ | x)|l < +oo
for sufficiently small r > 0. /ι is continuous in r and decreases to 0 as r —»> 0. For any η > 0, choose r such that h(r) <η. We have
iz/'(0; i β n ) = h"(θ i βn) + Δ , where
= l \ Σ i L " ( θ n I Xi) - L"(θ I X,)] I < I ^ | L " ( ^ IX ι ) - L"(θ \Xi)\ 68
Suppose that \θn - θ\ < r; then \θ*n - θ\ < r and hence | Δ n | < J £ " = 1 M p Q ) , where M(X)
= m a x ί _ , | < J L " ( 5 | X)
- L"{θ
\X)\.
Since E \M(Xi)\ < η, we have n
1 n
ti
Since 77 is arbitrary and 0 n —> ^ in P^, we have that | Δ n | -> 0 in Pθ.
D
e. It was asserted by Fisher (and believed for a long time) that, if tn = tn(sn) any estimate of θ such that
is
y/ΰ(tn — θ) —> Λ^(θ, t'(^))
in distribution as n —» oo,
then υ(θ) > l//i(0). This is, however, not quite correct, as shown by the following counterexample (due to J. L. Hodges, 1951): Let Xι be iid iV(0,1) and θ = K 1 . Let 0 n = ~X~n. y/nφn - 0) is JV(0,1) and h(θ) = 1. Let \~Ϋr~
if
\'Yr~\ \
/κi-l/4
, ,.
tn = <
IrY
if
< n~ι
Y
A
V,
then v^(ίn -θ)-+
N(0,υ(θ)) for all 6>, where
iίβφ0
m = {\ l^c2
if 0 = 0,
and so v(θ) > 1 breaks down at 0 = 0 (if we choose — 1 < c < 1).
Lecture 29 Definition. We say that {zn} is ϋV(μ n ,σ^) if
Pί^U^ \
Φ{z) for all z.
σ
n
J
Consider the condition Condition (***). {tn - θ} is AN(θ,v(θ)/n)
under θ (for each θ).
In Hodges's counterexample in the context of Example l(a),
V n f o - θ) = φ{θ)MX~n
-θ)+ ξn(s, θ),
where ξn —>• 0 in P^-probability and
so that tn is AN{θJv(θ)/n) following theorem:
2
for ^(0) = ¥? (0).
69
This provides an example of the
2 (Le Cam/Bahadur). The set
is always of Lebesgue measure zero for any tn satisfying (***). Corollary. // {tn} is regular in the sense that v is continuous in θ and 1\ is also continuous, then v(θ) > l//i(0) for all θ G θ . Note. This should not be confused with the C-R bound, since (***) does not imply that tn is unbiased, nor that v(θ) = n V&γθ(tn). In the general case, (***) does imply that tn is asymptotically median unbiased, i.e., that Pe(tn < θ) —> \ as n —> oo for each θ. Suppose this holds uniformly; then also it must be true that v(θ) > 1/Iι(θ) for all θ. This follows from: 3. If θ is a point in θ , a > 0 and δn(a) = θ + -^, and lim Pδn(a)(tn > δn(θ)) > ~,
n-> oo
then υ(θ) > l/h(θ)
v
'
v
2
(for the given θ).
Corollary. Suppose that tn is super-efficient (v < 1/Iχ) at a point θ. Then, given any a > 0, we may find S\ = ει(a) > 0 and 62 = S2{o) > 0 such that Pθ+ a I tn > θ + —= J < - - εx ^ V vn/ ^
and
Pθ—j= \tn < θ ^ V
-=) < - - ε2 vn/ 2
for all sufficiently large n. Definition. Let Fn be a sequence of distributions on Rk and Fo be a given distribution on Rk. We say that Fn 4 Fo iff
b(x)dFn(x) -> f
b(x)dFθ(x)
for all bounded continuous functions b : Rk —> R 1 . 4 (Hajek). Let Fn,θ = C(^/n(rn - θ)) and suppose that FnM^
A G for all
\a\ < 1. Then G is the distribution function of X + Y, where X is JV(θ, l//i(0)) and X and Y are independent. (This is true for all θ. G can depend on θ.) Corollary. The variance of G (if it exists) is at least l//i(0). Conclusion. At least in the iid case, Fisher's assertion is essentially correct. 70
Proof of (3) (outline). Choose 0 G θ and a > 0, and let δn = θ + ^ . For fixed n, consider testing θ against δn. τ ^ y is the optimal (LR) test statistic, whose logarithm is L n ( δ n ) - L(θ) = ± L \ Θ ) + | jn
in
W
) = J=L'(Θ) jn
- \c?h{θ) I
+ •••,
where the omitted terms are negligible. Let n I sn) - L(θ Iβ n ) + ^
Kn(sn) =
2
.&„ is equivalent to the LR statistic and Kn —>• AΓ(0,1) under . Consider the distribution of Kn under δn,
(κ
Sn n
JKn
Kn(Sn)
Jyy
->
e~2
Jy
-00
where Fn(j/) = Pθ(Kn < y). Note that_Fn(i/) Given a sequence {ίn} such that limn^oo Psn(tn > δn) > 1/2, choose z > yfa2lι(θ). Then, by the above result, Psn{Kn > z) < 1/2 for all sufficiently large n. Regard {tn > $n} and {Kn > zn} as critical regions for the test; then, by the Neyman-Pearson lemma, we have that, for some subsequence {nk}, Pθ(Knk > z) < Pe(tnk > δnk) for all sufficiently large k; but and Pθ(Kn > z) -> 1 - Φ(*),
Pθ(tn>δn)=Pθ{y/ϊi(tn-θ)>a) so
z > ^Q?h{θ) => Pθ{Knk >z
Pθ(tnk >θ + a
Letting k -> oo, we find that P(ΛΓ(0, ! ) > * ) < P(N(0,1) > a/yfi 2
and hence z > a/y/v(θ). Since z was arbitrary, we must have yja lι(θ) > a/y/v(θ) and hence v(β) > l//i(β). •
71
Lecture 30 Proof of (2). Assume only (***), i.e., that φι(tn - θ) % N(Q,υ(θ)) for θ € θ , and let J be a bounded subinterval of θ , say (a, b). Let θ)
and φn(θ) = |φ n (0) - ^ .
Then 0 < y>n(0) < | and, from (***), Φ n (0) -> | and <^n(0) -+ 0 for each θ. Hence 0 H-> Ij(β)φn(θ), where /,/ is an indicator function, is bounded on θ and tends to 0, 8θfθlj(θ)φn(θ)dθ->0,oτ 1
JR
but Ij(δ + -j=) -> Ij(δ) except for δ an endpoint of J, so
Noticing that Ij(δ)φn[δ+^)
> 0, we have Ij(δ)φn{δ+^τ=)
so that there is some sequence {nk} such that Ij(δ)φnk(δ thnsφnk(δ+φ=)
-> 0 a.e. (Lebesgue) on J-i.e., Pθ+^_(tnk
-> 0 in Lebesgue measure, +-τ=) -> 0 a.e.(Lebesgue); > θ+φ=)-\
-> 0 a.e. on
J. Returning to the original sequence, we have that lim^oo P JL. (tn > ^+77^) > 1/2 a.e. on J and so, from (3), v{θ) > 1/Iι(θ) a.e. on J . Since J was any bounded subinterval of θ , this means that v(θ) > l/h{θ) a.e. on θ . D Θ+
General regular case For each n, let (Sn,An,Pβ)
be an experiment with common parameter Θ=
(θ1,...,θp)eθ,
where θ is open in Mp, such that Sn consists of points sn. No relation between n and n + 1 is assumed. In Examples 1-5, we have Sn = X x - - x X and PQ^ — Pθ x Pθ. In Examples n times
6 and 7, P# is the distribution of sn = ( X i ? . . . , X n ), where the X; are noί iid. Example 8. For n = 2,3,..., let n\ and n 2 be positive integers such that n — nxJrn2. Let sn = (Xu . . . , Xni',Yu , K 2 ) , where A Ί , . . . , Xni,Yι, , Yn2 are independent, 2 Xι,..., X n i are N(μuσ ) distributed and Yu...,Yn2 are Λ^(μ2, ^ 2 ) distributed. Here θ = (μi,μ 2 ,σ 2 ) is entirely unknown. This is a three-parameter exponential family, and the complete sufficient statistic is Til
712
Til
72
712
If Πι/ri2 -> p as n —>• oc for some 0 < p < oo, all regularity conditions to follow are satisfied.
The local asymptotic normality condition n
Choose θ G θ and assume that dP^ \sn) neighborhood of θ.
(sn) holds for all δ in a
= ΩsA^dP^
Condition LAN (at θ G θ). For each a G W, loge(Ωθ+^θ(sn))
f
= az n(θ) - \a'h{β)a
+ Δ n (0, s n ),
where /i is a fixed p xp positive definite matrix, zn(θ) £ W and z n (0) -Λ Λ^(θ, n)
and Δn((9, sn) -> 0 in Pj -probability.
i. If «sn = (J¥"i,... ,Xn)? where the XiS are iid, and 7χ is the information matrix for Xι, then LAN is satisfied for this I\\ but the LAN condition holds in some "irregular" cases also - see Example l(b). ii. The right-hand side in LAN with Δ n omitted is exactly the log-likelihood in the multivariate normal translation-parameter case. See Example 4. Let g : θ —> M1 be continuously differentiate and write h(θ) = gradρ(#). 2P (Le Cam). If tn = tn(sn)
is an estimate of g such that
Vn(*n - 9{β)) ^
7V(O5 t (tf)) V0 € θ ,
then {^ : υ(θ) < h(θ)} is of (p-dimensional) Lebesgue measure 0 if we let bι(θ) = h(θ)Iϊ1{θ)h'(θ). 4? (Hajek). Suppose that un : Sn —ϊ θ is s.t. \fn(un - (θ + a/y/n)) —
n
) uθ
(UΘ independent of α), then UQ may be represented as v# + WΘ, where VQ and are independent and VQ ~ iV(θ,/f 1 ^)). e. No uniformity in a is needed in Hajek's theorem. From the above we see that, for large n, the iV(θ, 7f 1 (^)/n) distribution is nearly the best possible for estimates of θ. n is the "sample size", or cost of observing sn.
73
Sufficient conditions for LAN n)
m$n)
Suppose that L(θ \ sn) exists for each π, i.e., that dP^ (sn) = e dv^(sn) for all n, and that, for each n, L( | sn) has at least two continuous derivatives. We write L ί = e . Let L«(0 | sn) = gradL(0 \ sn). w
Condition 1. ^L (θ
| sn) ^> JV(θ, Ji(0)) for some positive definite Ix. n)
Condition 2. ^{Lijiθ \ sn)} -> -h{θ) in P^ -probability. Condition 3. With M(Θ,Ί,sn)
:= i
^ { | L y ( ί I βn) - ^-(β I 8n)\}, i j l 7
n)
(M(0,7, sn) > ε) = 0 for every ε > 0.
Conditions 1-3 imply LAN with Δ n -> 0, and also the following: (Fisher). Under Conditions 1-3, if θn = 0n(sn), the MLE of 0, exists and is consistent, then V 0 » - 0) -^> ^(O,/!" 1 ^)) V0 G θ . Definition. Let un = un(sn) be an estimate of 0. un is CONSISTENT if un -Λ 0 for all 0, or, equivalently, (ixn - 0)(un - 0)' -A 0. u n is V^-CONSISTENT if n(un - θ)(un - θ)' is bounded in Pθ for all 0. (We say that Yn is BOUNDED in P if, given any ε > 0, we may find k such that -P(|K| > k) < ε for all n sufficiently large.) (continued). If un is a y^-consistent estimate of 0 and
and
then < and u*n* are both AN(θ,I{1(θ)/n). Consequently, t*n = g(u*n) and a r e b o t h C = 5«*) ^^(
74
References Chandrasekhar, S. (1995). Newton's Prίncipia for the Common Reader. Oxford: Oxford University Press. Cramer, H. (1946). Mathematical Methods of Statistics. Princeton: Princeton University Press. Ibragimov, I. A., and R. Z. Has'minskii (1981). Statistical Estimation: Asymptotic Theory. New York: Springer-Verlag. Joshi, V. M. (1976). On the Attainment of the Cramer-Rao Lower Bound. Annals of Statistics, 4, 998-1002. Le Cam, L. (1953). On Some Asymptotic Properties of Maximum Likelihood Estimates and Related Bayes's Estimates. University of California Publications in Statistics, 1, 277-330. Lehmann, E. L. (1991). Theory of Point Estimation, 2nd ed. (1st ed. 1983). Pacific Grove: Wads worth. Pitman, E. J. G. (1979). Some Basic Theory of Statistical Inference, [2.1] in Chapter Two. Pacific Grove: Wadsworth. Rao, C. R. (1973). Linear Statistical Inference. New York: John Wiley. Stuart, A., J. K. Ord and S. Arnold (1994). Kendall's Advanced Theory of Statistics Chapter 10. London: Arnold. Wijsman, R. A. (1973). On the Attainment of the Cramer-Rao Lower Bound. Annals of Statistics, 1, 538-542. Wong, W. H. (1992). On Asymptotic Efficiency in Estimation Theory. Statίstica Sinίca, 2, 47-68. Zheng, Z. and K. T. Fang (1994). On Fisher's Bound for Stable Estimators with Extension to the Case of Hubert Parameter Space. Statistica Sinίca, 4, 679-692.
75
Raghu Raj Bahadur 1924-1997 By Stephen M. Stigler Raghu Raj Bahadur was born in Delhi, India on April 30, 1924. He was extremely modest in demeanor and uncomfortable when being honored, but on several occasions the Chicago Department of Statistics managed to attract him to birthday celebrations by taking advantage of the coincidence of his and Gauss's birthdates ~ he would come to honor Gauss, not to receive honor for himself. At St. Stephen's College of Delhi University he excelled, graduating in 1943 with first class honors in mathematics. In 1944 he won a scholarship and generously returned the money to the College to aid poor students. But it is clear that he had not yet found his calling. In that year, 1944, his was judged the best serious essay by a student in the College. The essay gave no hint of the career that was to follow: it was a somber essay, on the isolation of individuals, and it gave a dark and pessimistic view of the search for meaning in life - a vision that was foreign to the Raj we knew in later years. He continued on at Delhi, receiving a Masters degree in mathematics in 1945. After a year at the Indian Institute of Science in Bangalore he was awarded a scholarship by the government of India for graduate studies, and in October 1947, after spending one year at the Indian Statistical Institute in Calcutta, Raj took an unusual and fateful step. While India was in the upheaval that followed partition and preceded independence, Raj traveled to Chapel Hill, North Carolina, to study mathematical statistics. In barely over two years he completed his Ph.D. His dissertation focused on decision theoretic problems for k populations, a problem suggested by Harold Hotelling (although Herbert Robbins served as his major professor). In a December 1949 letter of reference, Harold Hotelling wrote: "His thesis, which is now practically complete, includes for one thing a discussion of the following paradox: Two samples are known to be from Cauchy populations
76
whose central values are known, but it is not known which is which. Probability of erroneous assignment of the samples to the two populations may be larger in some cases when the greater sample mean is ascribed to the greater population mean than when the opposite is done." His first paper, including this example, was published in the Annals of Mathematical Statistics (1). At the winter statistical meetings in December 1949, W. Allen Wallis contacted him to sound him out - was he interested in joining the new group of statisticians being formed at the University of Chicago? He was interested, and Allen arranged for Raj to start in the Spring Quarter of 1950. Raj's move to Chicago was to prove a pivotal event in his life. He left Chicago twice (in 1952 and in 1956), and he returned twice (in 1954 and in 1961). He never forgot his roots in India, and the pull of family and the intellectual community in Delhi caused him to return there time and again throughout his life, but Chicago had a special, irresistible allure for him. In the decade following 1948, Allen Wallis assembled an extraordinarily exciting and influential intellectual community. Starting with Jimmie Savage, William Kruskal, Leo Goodman, Charles Stein, and Raj Bahadur, he soon added David Wallace, Paul Meier, and Patrick Billingsley. Raj thrived at Chicago, although sometimes the price was high. One of his great achievements was his 1954 paper on "Sufficiency and Statistical Decision Functions," (5) a monumental paper (it ran to 40 pages in the Annals) that is a masterpiece of both mathematics and statistics. The story of its publication tells much about the atmosphere in Chicago in those days. It was originally submitted in May of 1952, and, with Raj away in India, it was assigned to Jimmie Savage as a referee. Savage was favorable and impressed, and fairly quick in his report (he took two months on what must have been a 100 page manuscript of dense mathematics), so why was there a two year delay in publication? It was not because of a backlog; the Annals was
77
uiblishing with a three month delay in those days. Rather it was the character of the report and he care of Raj's response. For while Savage was favorable, his reports (eventually there were hree) ran to 20 single-spaced pages, asking probing questions as well as listing over 60 points of inguistic and mathematical style. Somehow Raj survived this barrage, rewriting the paper Όmpletely, benefiting from the comments but keeping the work his own, and preserving, over mother referee's objections, an expository style that explained the deep results both as nathematics and again as statistics. From 1956-61 Raj was again in India, this time as a Research Statistician at the Indian Statistical Institute, Calcutta, but in 1961 he returned to the University of Chicago to stay, except or two leaves back to India. He retired in 1991 but continued to take vigorous part in the ntellectual life of the Department as long as his increasingly frail health permitted. He died on une7, 1997. Bahadur's research in the 1950s and 1960s played a fundamental role in the development >f mathematical statistics over that period. These works included a series of papers on ufficiency (5)-(8), investigations on the conditions under which maximum likelihood estimators vill be consistent (including Bahadur's Example of Inconsistency) (12), new methods for the omparison of statistical tests (including the measure based upon the theory of large deviations IOW known as Bahadur Efficiency) (16,17), and an approach to the asymptotic theory of [uantiles (now recognized as the Bahadur Representation of Sample Quantiles) (25). C. R. Rao tas written "Bahadur's theorem [his 1957 converse to the Rao-Blackwell theorem (11)] is one of tie most beautiful theorems of mathematical statistics" [in Glimpses of India's Statistical ίeritage, Ed. J. K. Ghosh, S. K. Mitra, K. R. Parthasarathy, Wiley Eastern, 1992, p. 162]. )ther work included his approach to classification of responses from dichotomous
78
questionnaires (including the Bahadur-Lazarsleld Expansion) (20, 21), and the asymptotic optimality of the likelihood ratio test in a large deviation sense (24). Bahadur summarized his research in the theory of large deviations in an elegant short monograph, Some Limit Theorems h Statistics (30). Virtually everything Raj did was characterized by a singular depth and elegance. He tool particular pleasure in showing how simplicity and greater generality could be allies rather than antagonists, as in his demonstration that LeCam's theorem on Fisher's bound for asymptotic variances could be derived from a clever appeal to the Neyman-Pearson Lemma (23). Raj's work was remarkable for its elegance and deceptive simplicity. He forever sought the "right" way of approaching a subject - a combination of concept and technique that not only yielded the result but also showed precisely how far analysis could go. Isaac Newton labored hard to draw the right diagram, to outline in simple steps a demonstration that made the most deep and subtle principles of celestial mechanics seem clear and unavoidably natural. Raj had a similar touch in mathematical statistics. His own referee's reports were minor works of art; his papers often masterpieces. In the early 1950s he married Thelma Clark, and together they raised two fine children, Sekhar and Sheila Ann, of whom they were immensely proud. From his first arrival in 1950 for the rest of his life, Raj felt Chicago was a precious place. He evidently found here, in the intellectual life of the Department and the close companionship of his family, the meaning he had been seeking when he wrote his somber essay in 1944. Raj Bahadur was President of the IMS in 1974-75, and he was the IMS's 1974 Wald Lecturer. He was honored by the Indian Society for Probability and Statistics in November
79
1987. In 1993 a Festschrift was published in his honor, Statistics and Probability, edited by J. K. Ghosh, S. K. Mitra, K. R. Parthasarathy, and B. L. S. Prakasa Rao (Wiley Eastern).
Raghu Raj Bahadur Born April 30, 1924, Delhi, India; died June 7, 1977, Chicago, Illinois. Education B.A. (Honours) Mathematics (with Physics) M.A. Mathematics Ph.D. Mathematical Statistics
Delhi University (St. Stephen's College) Delhi University (St. Stephen's College) North Carolina University (Chapel Hill)
1943 1945 1950
Professional Career Research Associate in Applied Statistics, Indian Statistical Institute, Calcutta, 1946-47. Research Associate in Mathematical Statistics, University of North Carolina, 1949-50. Instructor in Statistics, University of Chicago, 1950-51. Professor of Statistics, Indian Council of Agricultural Research, New Delhi, 1952-53. Visiting Assistant Professor of Mathematical Statistics, Columbia University, 1953-54. Assistant Professor of Statistics, University of Chicago, 1954-56. Research Statistician, Indian Statistical Institute, Calcutta, 1956-61. Associate Professor of Statistics, University of Chicago, 1961-65. Professor of Statistics, University of Chicago, 1965-91. Distinguished Visiting Professor, Indian Statistical Institute, 1972-97. Professor Emeritus, University of Chicago, 1992-97. Professional Memberships Fellow, Institute of Mathematical Statistics. Member, International Statistical Institute. Fellow, Indian National Sciences Academy. Fellow, Indian Academy of Sciences. Professional Activities and Honors John Simon Guggenheim Fellow, 1968-69. Ten lectures on limit theorems in statistics at the SIAM regional conference at Tallahassee, Florida, in 1969. Associate Editor, Annals of Mathematical Statistics, 1964-1973. Member, Council of the Indian Statistical Institute, 1972-74. Member, Editorial Board of Sankhya. Wald Lecturer, 1974 Annual Meeting of the Institute of Mathematical Statistics. President, Institute of Mathematical Statistics, 1974-75.
80
Invited to Department of Mathematics, University of Maryland to deliver six lectures (September 1975) as part of their "Year in Probability and Statistics" program. Chairman of Editorial Board of the IMS-University of Chicago Monograph Series, from April 1977. Fellow, American Academy of Arts and Sciences, from 1986. Outstanding Statistician of the Year, Chicago Chapter of the American Statistical Association, 1992.
Publications (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12) (13) (14) (15)
(16) (17) (18) (19) (20) (21) (22) (23) (24) (25) (26) (27)
"On a problem in the theory of k populations," Ann. Math. Statist. 21 (1950), 362-375. "The problem of the greater mean" (with H. Robbins), Ann. Math. Statist. 21 (1950), 469-487. "A property of the t statistic," Sankhya 12 (1952), 78-88. "Impartial decision rules and sufficient statistics" (with Leo A. Goodman), Ann. Math. Statist. 23 (1952), 553-562. "Sufficiency and statistical decision functions," Ann. Math. Statist. 25 (1954), 423-462. "Two comments on sufficiency and statistical decision functions" (with E. L. Lehmann), Ann. Math. Statist. 26 (1955), 139-142. "A characterization of sufficiency," Ann. Math. Statist. 26 (1955), 286-293. "Statistics and subfields," Ann. Math. Statist. 26 (1955), 490-497. "Measurable subspaces and subalgebras," Proc. Amer. Math. Soc. 6 (1955), 565-570. "The nonexistence of certain statistical procedures in non-parametric problems" (with L. J. Savage), Ann. Math. Statist. 27(1956), 1115-1122. "On unbiased estimates of uniformly minimum variance," Sankhya 18 (1957), 211-224. "Examples of inconsistency of maximum likelihood estimates," Sankhya 20 (1958), 207-210. "A note on the fundamental identity of sequential analysis," Ann. Math. Statist. 29 (1958), 534-543. "Some approximations to the binomial distribution function," Ann. Math. Statist. 31 (1960), 43-54. "Simultaneous comparison of the optimum and sign tests of a normal mean," Contributions to Probability and Statistics: Essays in Honor of Harold Hotelling, Stanford University Press, (1960), 79-88. "Stochastic comparison of tests," Ann. Math. Statist. 31 (1960), 276-295. "On the asymptotic efficiency of tests and estimates," Sankhya 22 (1960), 229-252. "On deviations of the sample mean" (with R. R. Rao), Ann. Math. Statist. 31 (1960), 1015-1027. "On the number of distinct values in a large sample from an infinite discrete distribution," Proc. Nat. Inst. Sciences, India, 26, A (Supp. 11), (1960), 67-75. "A representation of the joint distribution of n dichotomous items," Studies in Item Analysis and Prediction, H. Solomon, ed., Stanford University Press, (1961), 158-168. "On classification based on responses to n dichotomous items," Studies in Item Analysis and Prediction, H. Solomon, ed., Stanford University Press, (1961), 169-176. "Classification into two multivariate normal distributions with unequal covariances" (with T. W. Anderson), Ann. Math. Statist. 33 (1962), 420-431. "On Fisher's bound for asymptotic variances," Ann. Math. Statist. 35 (1964), 1545-1552. "An optimal property of the likelihood ratio statistic," Proc. Fifth Berk. Symp. Math. Statist. Prob., 1, (1965), 13-26. "A note on quantiles in large samples," Ann. Math. Statist. 37 (1966), 577-580. "Rates of convergence of estimates and test statistics," Ann. Math. Statist. 38 (1967), 303-324. (This was a Special Invited Address to the Institute of Mathematical Statistics.) "Substitution in conditional expectation" (with P. J. Bickel), Ann. Math. Statist. 39 (1968), 377-378.
81
(28) (29) (30) (31) (32) (33) (34) (35) (36) (37) (38)
(39) (40)
"On conditional test levels in large samples" (with P. J. Bickel), University of North Carolina Monograph Series in Probability and Statistics, No. 3 (1970), 25-34. "Some asymptotic properties of likelihood ratios on general sample spaces" (with M. Raghavachari), Proc. Sixth Berk Symp. Math. Statist. Prob., 1 (1970), 129-152. Some Limit Theorems in Statistics. NSF-CBMS Monograph, No. 4 (SIAM, 1971). "Examples of inconsistency of the likelihood ratio statistic," Sankhya 34 (1972), 81 -84. "A note on UMV estimates and ancillary statistics." Contributions to Statistics (Hajek Memorial Volume), Academia (Prague), 1979,19-24. "On large deviations of the sample mean in general vector spaces," (with S. L. Zabell), Ann. Probability, 7(1979), 587-621. "Large deviations, tests, and estimates" (with J. C. Gupta and S. L. Zabell). Asymptotic Theory of Statistical Tests and Estimation (Hoeffding Volume), 1979, 33-67. Academic Press. "Hodges Superefficiency," 1980. Encyclopedia of Statistical Sciences, Vol. 3 (F-H). John Wiley. "On large deviations of maximum likelihood and related estimates." Tech. Report No. 121, Department of Statistics, University of Chicago, 1980. "A note on the effective variance of a randomly stopped mean." Statistics and Probability: Essays in Honor ofC R. Rao, 1982, 39-43, North-Holland Publishing Co. "Some further properties of the LR statistic in general sample spaces" (with T. K. Chandra and D. Lambert), 1982. Proceedings of the Golden Jubilee Conference, 1-19, Indian Statistical Institute, Calcutta. "Large deviations of the maximum likelihood estimate in the Markov chain case." Recent Advances in Statistics 1983, 273-286. Academic Press. "Distributional optimality and second-order efficiency of test statistics" (with J. C. Gupta), 1986. In Adaptive Statistical Procedures and Related Topics, Proceedings of a symposium in honor of H. Robbins, IMS Lecture Notes Monograph Series, Vol. 8, 315-331.
82
IMS Lecture Notes—Monograph Series NSF-CBMS Regional Conference Series in Probability & Statistics
Order online at: htφ://www.imstatorg/publications/imslect/index.phtml
Vol 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Title IMS LECTURE NOTES-MONOGRAPH SERIES Essays on the Prediction Process Survival Analysis Empirical Processes Zonal Polynomials Inequalities in Statistics & Probability The Likelihood Principle Approximate Computation of Expectations Adaptive Statistical Procedures & Related Topics Fundamentals of Statistical Exponential Families Differential Geometry in Statistical Inference Group Representations in Probability & Statistics An Introduction to Continuity, Extrema & Related Topics for General Gaussian Processes Small Sample Asymptotics Invariant Measures on Groups & Their Use in Statistics Analytic Statistical Models Topics in Statistical Dependence
27 28
Current Issues in Statistical Inference: Essays in Honor of D. Basu Selected Proceedings of the Sheffield Symposium on Applied Probability Stochastic Orders & Decision Under Risk Spatial Statistics & Imaging Weighted Empiricals & Linear Models Stochastic Inequalities Change-point Problems Multivariate Analysis & Its Applications Adaptive Designs Stochastic Differential Equations in Infinite Dimensional Spaces Analysis of Censored Data Distributions with Fixed Marginals & Related Topics
29
Bayesian Robustness
30
Statistics, Probability & Game Theory: Papers in Honor of D. Blackwell Lj-Statistical Procedures & Related Topics Selected Proceedings of the Symposium on Estimating Functions Statistics in Molecular Biology & Genetics New Developments & Applications in Experimental Design
17 18 19 20 21 22 23 24 25 26
31 32 33 34 35
1 2 3 4 5
Game Theory, Optimal Stopping, Probability & Statistics: Papers in Honor of Thomas S. Ferguson State of the Art in Probability & Statistics: Festschrift for WillemRvanZwet Selected Proceedings of the Symposium on Inference for Stochastic Processes Model Selection ILR Bahadur's Lectures on the Theory ofEstimation NSF-CBMS REGIONAL CONFERENCE SERIES Group Invariance Applications in Statistics Empirical Processes: Theory & Applications Stochastic Curve Estimation Higher Order Asymptotics Mixture Models: Theory, Geometry & Applications
6
Statistical Inference from Genetic Data on Pedigrees
36 37 38 39
Authors or Editors
Member
General
$9 $15 $12 $9 $15 $15 $12 $24 $15 $15
$15 $25 $20 $15 $25 $25 $20 $40 $25 $25
$18 $15
$30 $25
$15 $18 $15 $27
$25 $30 $25 $45
$15
$25
$9
$15
K Mosler & M Scarsini A Possolo HLKoul MShaked&YLTong HG Mueller & D Siegmund TW Anderson, KT Fang & I Olkin N Floumoy & WF Rosenberger G Kallianpur & J Xiong
$18 $21 $18 $24 $26 $26 $24 $24
$30 $35 $30 $40 $45 $45 $40 $40
HL Koul & JV Deshpande L Ruschendorf; B Schweizer & MD Taylor JO Berger, B Betro, E Moreno, LR Pericchi, F Ruggeri, G Salinetti & L Wasserman TS Ferguson, LS Shapley & JB MacQueen Y Dodge IV Basawa, VP Godambe & RL Taylor F. Seillier-Moiseiwitsch N Floumoy, WF Rosenberger & WK Wong FT Brass & L Le Cam
$18 $31
$30 $52 '
$29
$49
$32
$54
$41 $42
$69 $69
$36 $21
$45 $35
$39
$66
M de Gunst, C Klaassen & A van der Vaart IV Basawa, CC Heyde & RL Taylor
$79
$120
$29
$49
PLahiri SM Stigler, WH Wong, D Xu
$24 $15
$40 $25
ML Eaton D Pollard M Rosenblatt JK Ghosh BG Lindsay
$15 $12 $15 $15 $15
$25 $20 $25 $25 $25
E Thompson
$18
$30
FB Knight J Crowley & RA Johnson P Gaenssler A Takemura YLTong JO Berger & RL Wolpert C Stein J Van Ryzin LD Brown SI Amari, OE Bamdorff-Nielsen, RE Kass, SL Lauritzen & CR Rao PW Diaconis RJAdler CAField&ERonchetti RAWijsman IM Skovgaard HW Block, AR Sampson & TH Savits M Ghosh & PK Pathak IV Basawa & RL Taylor
ISBN 0-940600-53-6