Institute of Mathematical Statistics LECTURE NOTES-MONOGRAPH SERIES
Fundamentals of Statistical Exponential Families with Applications in Statistical Decision Theory
Lawrence D. Brown Cornell University
Institute of Mathematical Statistics LECTURE NOTES-MONOGRAPH SERIES Shanti S. Gupta, Series Editor Volume 9
Fundamentals of Statistical Exponential Families with Applications in Statistical Decision Theory Lawrence D. Brown Cornell University
Institute of Mathematical Statistics Hayward, California
Institute of Mathematical Statistics Lecture Notes-Monograph Series Series Editor, Shanti S. Gupta, Purdue University
The production of the IMS Lecture Notes-Monograph Series is managed by the IMS Business Office: Nicholas P. Jewell, Treasurer, and Jose L. Gonzalez, Business Manager.
Library of Congress Catalog Card Number: 87-80020 International Standard Book Number 0-940600-10-2 Copyright© 1986 Institute of Mathematical Statistics All rights reserved Printed in the United States of America
To my family for their love and understanding
PREFACE
I first met exponential families as a beginning graduate student. The previous summer I had written a short research report under the direction of Richard Bellman at the RAND Corporation. That report was about a dynamic programming problem concerning sequential observation of binomial variables. Jack Kiefer read that report. He conjectured that the properties of the binomial distribution used there were properties shared by all "Koopman-Darmois" distributions. (This is a name sometimes used for exponential families, in honor of the authors of two of the pioneering papers on the topic. See Koopman (1936), and Darmois (1935), and also Pitman (1936).) Jack suggested that I recast the paper into the Koopman-Darmois setting. That suggestion had two objectives. One was the hope that viewing the problem from this general perspective would lead to a clearer understanding of its structure and perhaps a simpler and better proof. The other objective was the hope of generalizing the result from the binomial to other classes of distributions, for example the Poisson and the gamma. (The resulting manuscript appeared as Brown (1965).) These two objectives of clearer understanding and of possible generalization in statistical applications are the motivation for this monograph. Many if not most of the successful mathematical formulations of statistical questions involve specific exponential families of distributions such as the normal, the exponential and gamma, the beta, the binomial and the multinomial, the geometric and the negative binomial, and the Poisson among others. It is often informative and advantageous to view these mathematical formulations
vi
PREFACE
from the perspective of general exponential families. These notes provide a systematic treatment of the analytic and probabilistic properties of exponential families. This treatment is constructed with a variety of statistical applications in mind. This basic theory appears in Chapters 1-3, 5, 6 and the first part of Chapter 7 (through Section 7.11). Chapter 4, the latter part of Chapter 7, and many of the examples and exercises elsewhere in the text develop selected statistical applications of the basic theory. Almost all the specific statistical applications presented here are within the area of statistical decision theory.
However, as suggested above
the scope of application of exponential families is much wider yet. They are, for further example, a valuable tool in asymptotic statistical theory. The presentation of the basic theory here was designed to be also suitable for applications in this area. Exercises 2.19.1, 5.15.1-5.15.4 and 7.5.1-7.5.5 provide further background for some of these applications. Efron (1975) gives an elegant example of what can be done in this area. Some earlier treatments of the general topic have proved helpful to me and have influenced my presentation, both consciously and unconsciously. The most important of these is Barndorff-Nielsen (1978). The latter half of that book treats many of the same topics as the current monograph, although they are arranged differently and presented from a different point-of-view. Lehmann (1959) contains an early definitive treatment of some fundamental results such as Theorems 1.13, 2.2, 2.7 and 2.12. Rockafellar (1970) treats in great detail the duality theory which appears in Chapters 5 and 6. I found Johansen (1979) also to be useful, particularly in the preparation of Chapter 1. The first version of this monograph was prepared during a year's leave at the Technion, Haifa, and the second was prepared during a temporary appointment at the Hebrew University, Jerusalem.
I wish to express my gratitude to both
those institutions and especially to my colleagues in both departments for their hospitality, interest, and encouragement.
I also want to acknowledge
PREFACE
vii
the support from the National Science Foundation which I received throughout the preparation of this manuscript. I am grateful to all the colleagues and students who have heard me lecture on the contents or have read versions of this monograph. Nearly all have made measurable, positive contributions. Among these I want to specially thank Richard Ellis, Jiunn Hwang, Iain Johnstone, John Marden, and Yossi Rinott who have particularly influenced specific portions of the text, Jim Berger who made numerous valuable suggestions, and above all Roger Farrell who carefully read and critically and constructively commented on the entire manuscript. The draft version of the index was prepared by Fu-Hsieng Hsieh. Finally, I want to thank the editor of this series, Shanti Gupta, for his gentle but persistent encouragement which made an important contribution to the completion of this monograph.
TABLE OF CONTENTS
CHAPTER 1. BASIC PROPERTIES
1
Standard Exponential Families
1
Marginal Distributions
8
Reduction to a Minimal Family
13
Random Samples
16
Convexity Property
19
Conditional Distributions
21
Exercises
26
CHAPTER 2. ANALYTIC PROPERTIES
32
Differentiability and Moments
32
Formulas for Moments
34
Analyticity
38
Completeness
42
Mutual Independence
44
Continuity Theorem
48
Total Positivity
53
Partial Order Properties
57
Exercises
60
CHAPTER 3. PARAMETRIZATIONS
70
Steep Families
70
Mean Value Parametrization
73
Mixed Parametrization
78
IX
x
TABLE OF CONTENTS Differentiate Subfamilies
81
Exercises
85
CHAPTER 4. APPLICATIONS
90
Information Inequality
90
Unbiased Estimates of the Risk
99
Generalized Bayes Estimators of Canonical Parameters
106
Generalized Bayes Estimators of Expectation Parameters; Conjugate Priors Exercises CHAPTER 5. MAXIMUM LIKELIHOOD ESTIMATION
112 124 144
Full Families
148
Non-Full Families
152
Convex Parameter Space
153
Fundamental Equation
160
Exercises
167
CHAPTER 6. THE DUAL TO THE MAXIMUM LIKELIHOOD ESTIMATOR
174
Convex Duality
178
Minimum Entropy Parameter
184
Aggregate Exponential Families
191
Exercises
203
CHAPTER 7. TAIL PROBABILITIES
208
Fixed Parameter (Via Chebyshev's Inequality)
208
Fixed Parameter (Via Kullback-Leibler Information)
212
Fixed Reference Set
214
Complete Class Theorems for Tests (Separated Hypotheses)
220
Complete Class Theorems for Tests (Contiguous Hypotheses)
232
Exercises
239
APPENDIX TO CHAPTER 4. POINTWISE LIMITS OF BAYES PROCEDURES
254
REFERENCES
269
INDEX
280
CHAPTER 1. BASIC PROPERTIES
TANDARD EXPONENTIAL FAMILIES ,. 1
Definitions (Standard Exponential Family):
Let v be a σ-finite measure
m the Borel subsets of R . Let 1)
θ x
hi = Wv = {θ: / e ' v ( d x ) < «>} .
et λ(θ) = /e θ # x v(dx)
2)
Define λ(θ) = °° i f the integral in (2) is i n f i n i t e . ) Let ψ(θ)
= log λ(θ) ,
nd define 3)
P θ (x) = exp(θ x - ψ(θ)) ,
θ€ N
et Θ <Ξ w . The family of probability densities {p θ
: θ € 0}
s called a k-dimensional standard exponential lensities).
The associated
family
(of probability
distributions
PΘ(A) = / P θ (x)v(dx) ,
θe 0
A
re also referred to as a standard exponential family (of probability listributions). W is called the natural parameter space,
ψ has many names. We
vVill call i t the log Laplace transform (of v) or the owmlant generating function.
θ € 0 is sometimes referred to as a canonical parameter, and 1
2
STATISTICAL EXPONENTIAL FAMILIES
x € X is sometimes called a canonical observation, or value of a canonical statistic. The family is called full if Θ = W . It is called regular if N is open, i.e. if W = M° where W° denotes the interior of Λf, defined as i n t . N = {UQ: Q c N, Q is open}. As customary, let the support of v k
closed set S c R for which v(S (4)
comp
) = 0.
(supp v) denote the minimal
Let
H = convex hull (supp v) = conhull (supp v) .
and let K = K = fl. K is called the convex support of v.
(The convex hull of
a set S € Rk is the set {y: 3 {x..} cz s, {c^}, 0 < αη., Σα]. = 1 3 y = Σ α ^ } . )
For S c R
the dimension of S, dim S, is the dimension of the
linear space spanned by the set of vectors { ( x j - x 2 ) : x ^ x 2 € S}.
A k-
dimensional standard family is called minimal i f (5)
dim N = dim K = k . Note that i f K is compact then W = Rk, so that the family is
regular. (The exponential families described above can be called f i n i t e dimensional exponential families. Various writers have recently begun to investigate i n f i n i t e dimensional generalizations.
See Soler (1977),
Mandelbaum (1983), and Lauritzen (1984) for some results and references.) Standard exponential families abound in statistical applications. Often a reduction by sufficiency and reparametrization i s , however, needed in order to recognize the standard exponential family hidden in specific settings. Here are two of the most f r u i t f u l examples. 1.2
Example (Normal samples):
Let
Yj'
Yη
be
independent identically
distributed normal variables with meanμ and variance σ 2 .
Thus, each Y. has
BASIC PROPERTIES
3
density (relative to Lebesgue measure) Φ nz(y) = (2πσ 2 )"^ 2 exp(-(y-μ) 2 /2σ 2 ) μ,σ
(1)
and cumulative distribution function Φ
2
. Consider the statistics
Ϋ = n" 1 Σ Y. S 2 = n" 1 Σ (Y. - Ϋ)2 1 i= l X χ = Ϋ , X 2 = n" 1 Σ Y 2 = S 2 + Ϋ 2 . The joint density of Y = Y 1S ...,Y
can be written in two distinct revealing
ways, as f μ , σ 2(y) = (2πσ 2 )" n / 2 exp(-ns 2 /2σ 2 - n(y-μ) 2 /2σ 2 ) ,
(2) or as (3)
f μ
, σ 2(y) = (2πσ 2 )" n / 2 exp((nμ/σ 2 )x 1 + (-n/2σ 2 )x 2 )exp(-nμ 2 /2σ 2 ) . From the f i r s t of these one sees that
statistics.
Y and S are sufficient
(One can also derive from this expression that Ϋ and S are
independent (see sections 2.14 - 2.15) with Ϋ being normal mean μ, variance σ2/n and V = S being (σ2/n) χ 2 .. -- i.e. having density (4)
f(v) = (n/2σ 2 ) m / 2 (Γ(m/2))" 1 v ( m / 2 " 1 }
exp(-nv/2σ 2 )χ (θ9θo) (v)
with m = n-1 .)
X = (X.jXo) is also sufficient. This can be seen from the factorization (3), or from the fact that X is a 1-1 function of (Ϋ,S ) . 2 Let v denote the marginal measure on R corresponding to X -- i.e. v(A) -
/
dy 1 ... dy n .
n n /2 V 1 2" 7 " (It can be checked that when n > 2, v(dx) = φ (π2Γ((n-l)/2)) ( x ^ x ^ dx over the region K = {(xi,x2)-' x 2 4 X 2> w h e n n = 1 v Ί S supported on the
4
STATISTICAL EXPONENTIAL FAMILIES p
curve {(x-j.Xg): x-| = x 2 h ) Then the density of X relative to v is (5) with
p θ jθ (x) = exp(θ l X l + θ 2 x 2 - ψ(θ)) θj = nμ/σ 2 ,
θ 2 = -n/2σ2
and ψ(θ) = -Θ 2 /4Θ 2 - (n/2)log(-2θ2/n) . Thus the distributions of the sufficient statistic form a 2 dimensional exponential family with canonical parameters (θ^θg) related to the original parameters as above. This family is minimal. The natural parameter space is A/ = {(θj, θ 2 ) : θ χ € R ,
θ 2 < 0} .
The above can of course be generalized to multivariate normal distributions. See Example 1.14. 1.3 Example
(Multinomial distribution): Let X = (X1,...,Xj<) be multinomial (N,π) — that is Pr{X = x} = (x
N
l"
)Π πj 1 . 'xk Ί
Let v be the measure concentrated on the set {x : k x. > 0 , i = l k , Σ X: = N} , and given by 1 1 =1 Ί (1)
v({x})
=
(χ
l 9
\9χN
)) == -λ
N!
Then the density of X r e l a t i v e to v is
(2)
p (x) θ
=
k exp( Σ θ.x. - ψ(θ)) i=l Ί Ί
where (3)
θ. = log π.
i=l,...,k
x Ί integers,
BASIC PROPERTIES
5
and ψ(θ) = N log( Σ e8"1') 1= 1
(4)
.
This i s a k dimensional exponential f a m i l y w i t h canonical X .
I t s canonical parameter i s r e l a t e d to π by ( 3 ) .
(5)
Θ
=
{ ( l o g π.)
:
0 < π.,
statistic
I t has parameter space
Σπ. = 1}
.
Note that this exponential family is not full. The full family has densities {p θ } as above with 0 = hi = R k . (For 0 as in (5) ψ(θ) s 0 , however ψ as defined in (4), rather than ψ s o, is the appropriate cumulant generating function, as defined in 1.1(3) for the full family.) However
for all a € R where Γ = (1,...,1) . Hence expanding this family to be a full family does not introduce any new distributions. The above phenomenon is related to the fact that the above family is not minimal since dim K = k-1 < k . To reduce to a minimal family let X* € R k " 1 be given by (X 1 ,...,X k _ 1 ) . Thenn X* X* is sufficient. (In fact, it is k-1 essentially equivalent to X since X k = N - Σ X* a.e.(v) .) Let θ* € R1^"1 N
be given by θ* = θ. - θ. , and let v*({x*}) = ( 1
1
JL,
K
k-1 ) . Then JL,
M
π
JL,
the density of X* relative to v* is (7)
P**U*) θ
k-1 = exp( Σ θ*x* - ψ*(θ*)) i=ll l
where (8)
ψ*(θ*) = N logCl + Σ e M
.
This is a full minimal standard exponential family with M = R Note that
k-1
6
STATISTICAL EXPONENTIAL FAMILIES π.
= exp(θ*)/(l + Σexp(θ*))
πR
= 1/(1 + Σexp(θ*)) .
1=1
k-1 ,
(9)
kl Here, each different θ* e R = N corresponds to a different distribution. Reductions by reparametrization and sufficiency like those in the above examples are frequent in statistical applications.
Together with proper
choice of the dominating measure, v, they lead to the representation of problems involving exponential families in terms of problems involving standard exponential families. This is formally explained in the next few paragraphs. 1.4
Definition: Let {F : ω e Ω} be a family of distributions on a probability
space y,B . Suppose F -« μ , ω € Ω . Suppose there exist functions
c
: Ω -> (0
R : Ω -> Rk T
:
V
•*
h
:
V
•+
Rk [0.-)
(Borel measurable) (Borel measurable)
such that ω (y)
(1)
dF ω = C(ω)h(y)exp(R(ω)
du
T(y)) .
Then {F } (or, {f }) is called a k dimensional exponential family of distributions (or, of densities). 1.5
Proposition: Any k dimensional exponential family (1.4(1)) can be reduced by
sufficiency, reparametrization, and proper choice of v to a k dimensional standard exponential family (1.1(3)).
The sufficient s t a t i s t i c is X = T(Y), and
i t s distributions form an exponential family with canonical parameter θ = R(ω) . Proof:
X = T(Y) is sufficient by virtue of 1.4(1) and the Neyman factorization
BASIC PROPERTIES theorem.
(See e.g. Lehmann (1959) Chapter 2 Theorem 8.)
Let μ*(dy) = h(y)dy
and l e t v(A) = μ*(T" 1 (A)) for Borel measurable sets A c Rk . induced
densities of X with respect to v exist and have the desired form
1.1(3) with θ = R(ω) and ψ(θ) = -log C(R" 1 (θ)) . then f
Then the
= f
and hence C(ωχ) = C(α>2) .)
(Note that i f R ^ )
= R(α>2),
II
In spite of appearances the above reduction process is not really unique.
Any standard exponential family can be transformed to a different,
but equivalent, form by linearly transforming X and Θ with linked nonsingular affine transformations. This is described in the following proposition. 1.6
Proposition: Let {p Q } be a k-dimensional standard exponential family.
Let M be
a non-singular kxk matrix and l e t Z
= MX + zQ
(1) Φ =
(M1)
θ + φQ .
Then the distributions of Z also form a k-dimensional standard exponential family which is equivalent to the original family. Proof:
The equivalency assertion is immediate since the transformations (1)
are 1-1. Furthermore, the density of Z relative to the measure v 2 defined by v 2 (A) = v(M"X(A - z Q ) ) is exp(θ'x(z) - ψ(θ))
(2)
= exp((φ - φ 0 ) ' MM"Ί(z - z Q ) - ψ(M'(φ - φ Q ) ) ) =
exp(φ'z - ψ(M'(φ - φ Q )) + Φ'z0 - ΦQZ + Φ0 z 0 ) .
(By definition A - z Q = {x : 3 z € A,
x = z - zQ}
.)
Let Vjίdz) = exp(-φ^z)v 2 (dz) and ψ^φ) = ψ(M'(φ - Φ Q )) The densities of Z relative to v 1 are
8
STATISTICAL EXPONENTIAL FAMILIES
(3)
exp{φ'z - ψjίφ)} ,
which, as claimed, form a k parameter exponential family.
The natural
parameter space for this family is M'~ Θ + φ« and the cumulant generating function is ψ, .
||
Proposition 1.6 shows that one may apply an arbitrary affine transformation either to Θ or to X.
In this way one may assume without loss of
generality that Θ (or X) lies in a convenient position in R
.
One application
of this process w i l l be discussed at some length in Section 3.11, and such transformations w i l l be used wherever convenient.
MARGINAL DISTRIBUTIONS The proof of Proposition 1.6 yields a statement about marginal distributions generated under linear projections by standard exponential families. The result is important in its own right, and useful in the proof of Theorem 1.8, as well. Some preliminary remarks will be helpful. Let M- : R
-». R m be a
linear map. M, is represented by an (mxk) matrix, M-., of rank m. There is then -* R " m which is orthogonally complementary to M 1 --
a linear map M 2 : R
that is, the rows of the corresponding ((k-m)χk) matrix, Mp, of rank (k-m) are orthogonal to those of M- . (The rows of M 2 can be chosen to be orthonormal, but that is not necessary here.) Let M denote the (kxk) nonsingular matrix M
Z
l kκ l mm M = (*) . If x e R then Z = Mx can be written as (/) with 1 € Λ M2 Z2 1 R , Z
Q
ς- pIΠ~K V_
Γ\
.
M
Let M = ( M ) as defined above. M
Then M
written as
(1)
1
M" = (M~, M")
where M" is (kxm), M^ is (kχ(k-m)) and
exists and can be
BASIC PROPERTIES (M") 1 M^ = 0
(2)
since M 1 and M 2 are orthogonally complementary. Let θ € Rk and φ = ( M " 1 ) ^ = ( * )θ = Γ 1 ) . Then /M"\ Φo Y (M 2 ) 2 (3)
θ'x =
θ'M^Mx
" φiZl+φ2Z2 by (2). For typographical reasons let M!" = (NT) 1 . The special case where M 1 (x 1 > ...,x k ) = (*i» ••»*,„) i s noting.
worth
Here
<4>
M
l " < W °mχ(k-m))=
M
2
= <0(k-m)χm
M
Γ
^k-mMk-m)*
=
M
2~
Somewhat more generally, if the rows of M. and M 2 are orthonormal then M
l =MΓ
M2
1.7
=
M2~
Theorem: Consider a standard exponential family. Φ Ί
Let M, : R
•%
R m and
I
θ = M'( ) as described above.
Fix φ° e M^"(W) c R k " m . Consider the
family of distributions of Z^ = M-X over the parameter space Φ.o = M'"({θ € Θ : M'"θ = φϊ}) . These form an m dimensional standard Φp * ^ ^ exponential family generated by the marginal measure defined by (5)
v 0 (A) Φ
= / , MiΔ^
exp(φϊ M x)v(dx) . t c
Z 1 The natural parameter space for this family is W Φo
= M'"({θ e N : Mi θ = φi}
The statistic 1, is sufficient for the family of densities {p θ (x) : M 2 - e = φ°} .
10
STATISTICAL EXPONENTIAL FAMILIES
Proof:
A direct proof is as easy as an appeal to Proposition 1.6.
The density
of Z relative to the appropriate dominating measure v(M~ •) is (6)
exp(θ'x - ψ(θ))
= exp(φχ
When Φ2 = Φ2 the factor exp(φ° measure, y i e l d i n g v
zχ + Φ2
z 2 - ψ(M'φ)) .
z 2 ) can be absorbed i n t o the dominating
(•) as defined in ( 5 ) .
The r e s u l t i n g family of densities
Φ
2 is the standard exponential family claimed in the statement of the theorem. (Note that (6) also provides a formula for the cumulant generating function of this family.) The assertions concerning W,o and sufficiency follow from ( 6 ) , Φ 2 with Φ2 = Φp > and the Neyman f a c t o r i z a t i o n theorem. || For the special case where IIL is as described i n ( 4 ) , one sees that for fixed exponential
θ
'θm
u+i»
tlΊe
d i s t r i b u t i o n s of Z, = ( X 1 5 . . . , X ι J form an
family. Note that the theorem does not say that the family of d i s t r i b u t i o n s
of 1. = HΛ form a standard exponential family with natural parameter φ, i f the parameter θ ranges over all
of 0 .
In f a c t such a claim is generally
false unless Θ is of dimension <m and s a t i s f i e s (7)
Θ c
{θ : M^'θ = φ°}
for some φ° € Rk~m ,
as w i l l be the case in Theorem 1.9; or (8)
Z,
and
Z2
are independent for some
θ € Θ .
( I t w i l l be seen i n the next chapter that (8) implies independence of Z, and Z 2 for a l l θ € Θ .) (8)
Remark.
The preceding theorem may be given an a l t e r n a t i v e i n t e r p r e t a t i o n .
Let L be a linear variety i n R -- that is L = x Q + V f o r some m dimensional k k linear subspace V c R . Let P : R -> L be any a f f i n e projection onto L -2 that i s , P is a f f i n e , P = P, and P is the i d e n t i t y on L. Let Q denote the orthogonal projection onto V
= {w € R
:
v'w = 0
v
v € V} .
Let
BASIC PROPERTIES
θ
(2)
€ V
Then
" '
{θ € N :
the
famll
of
y
11
d i s t r i b u t i o n s of P(X) as θ ranges over
Qθ = θ/o\ί forms an exponential
family.
To v e r i f y the above, note that there are l i n e a r isometries Sχ
:
Rm £1
L
Rk'm
Sz :
onto
V1
^
.
onto
The theorem applies to the maps M. = S"
° P ,
ϊlL = S i
statement concerning the d i s t r i b u t i o n s of M.,(X) .
° Q , and y i e l d s a
This converts d i r e c t l y to
the above statement about the d i s t r i b u t i o n s of P(X) = S ^ M ^ X ) ) over the appropriate parameter space since S, is a l i n e a r isometry, and S-. and S« are orthogonal, e t c . 1.8
EXAMPLE
(Log-linear models):
described i n Example 1.3.
Consider a multinomial (N, π) variable as
Consider the family of d i s t r i b u t i o n s f o r which the
natural parameter 1.3(3) s a t i s f i e s (1)
θ =
3 e Rm
BB + ΘQ ,
where B is a specified kxm matrix of rank m. (2)
B =
Assume, i n a d d i t i o n , that
( l k , B(2))
where 1/ = ( 1 , . . . , 1 ) and B/2\ is k x (m-1) of rank (m-1). multinomial
model.
This is a
log-linear
The name derives from the f a c t that the l i n e a r constraint
(1) can also be w r i t t e n i n the form log π = B$ where (log π). = log π. , i=l,...,k
.
Condition (2) is imposed because PQ = P f t . a 1 , as noted in 1.3(6). u+a i
Ό
Because of (2) f o r every $\^\
=
e
^2""' rJ
t h e r e
i s
d
u n i c
1ue ^
=
^1^(2)^
such that k Σ π.
(3)
Ί
=
k θ, Σ e Ί
=
1 .
Let NL = B1 and l e t M = ( M ) as i n 1.7. 1 Np Z
(l)
=
M
1
X=
B
'X
i s
a
s u f f Ί c i e n t
statistic.
Theorem 1.7 y i e l d s that
The d i s t r i b u t i o n s of Z , ^
m-dimensional exponential family with corresponding natural parameter
form an
12
STATISTICAL EXPONENTIAL FAMILIES
Mj'θ = β + B"Θ Q .
This family is not minimal since U ^ N ^ = N w.p. 1 .
As i n Example 1.3 one may reduce to an equivalent minimal family having dimension (m-1) and canonical s t a t i s t i c Z*-x = B/o\X
=
( ^ / , \ o9'"9^(l)
m^' '
Here is a famous l o g - l i n e a r model a r i s i n g i n genetics.
Suppose a
parent population contains a l l e l e s G,g a t a c e r t a i n locus, with frequency p,q = 1-p , respectively.
Under the assumptions of random mating and no
selection a generation of N offspring w i l l have genotypes GG, Gg, gg according to a multinomial d i s t r i b u t i o n with π given by
(4)
TΓj
=
p
,
π2
=
π3 = q 2 .
2pq ,
Such a multinomial d i s t r i b u t i o n is called a Hardy-Weinberg
distribution.
This corresponds to a l o g - l i n e a r model with
l
(5)
Thus, Z/.x
B =
= (2
N +
1
2
{ ° \
\
1
θ
) is a sufficient
log-linear family, and z?,x
=
statistic
log
2)
.
for the distributions of this
= 2x, + x ? is a minimal sufficient
statistic.
(This log-linear family can be imbedded in a useful way in the original multinomial family as follows: Let
(6)
Then
1
M"
/5/12 = I 1/6 \-l/12
-1/12 1/6 5/12
-l/2\ 1 ) = (M~, M") . -1/2/
Let ΦQ = ( 0 , 0 , -In2) and z^ = ( 0 , 0 , !j). Z = MX + z 0 is the canonical s t a t i s t i c
According to Proposition 1.6 for an exponential family with
corresponding canonical parameter φ = (M~ )'θ + φQ .
In terms of the original
BASIC PROPERTIES
variables z 1 = 2x^ + x 2 , etc.
13
z 2 = 2x 3 + x^9 z^ = x 2 ,
and Φ3 = (^)log(π|/4π-π 3 ) f
The log-linear family described above is therefore the family of marginal
distributions of ( z . , z«) under the r e s t r i c t i o n φ~ = 0 .
The family of
distributions corresponding to the restriction Φ3 = φ° t 0 also has a natural genetic interpretation as the distribution of a population a f t e r variable selection of genotypes.
See Barndorff-Nielsen (1978, p.123); the
generalization of this model to a m u l t i a l l e l i c locus is also described there.)
REDUCTION TO A MINIMAL FAMILY Any exponential family which is not minimal can be reduced ϊo a minimal standard family through sufficiency, reparametrization, and proper choice of v. This involves only a minor extension of the process used above in Proposition 1.5 and Theorem 1.7. This reduction is unique up to the appearance of linked affine transformations as in Proposition 1.6. Here are the details. 1.9 Theorem Any k dimensional exponential family can be reduced by sufficiency, reparametrization, and proper choice of v to an m dimensional minimal standard exponential family, for some m
Let X,θ and Z,φ denote the canonical
s t a t i s t i c and parameter for two such reductions to an m. and an m2 dimensional minimal family, respectively.
Then m, = m2 and (X,θ), (Z,φ) are related as in
1.6(1). Proof.
The reduction to a minimal standard family w i l l be performed in three
steps.
F i r s t , one may apply Proposition 1.5 to reduce to a standard k
dimensional
family. Suppose for this family that dim Θ = m1 < k.
where V is an m'-dimensional
linear subspace.
Thus θ c θ
Q
+V
One may l e t P be the orthogonal
projection on V and M,, M2 the corresponding orthonormal matrices described above in Theorem 1.7.
Then M2Θ = φ£ , a constant vector.
By Theorem 1.7,
Z 1 = M-X is s u f f i c i e n t , and i t s distributions form a standard exponential family, whose parameter space has dimension m1.
14
STATISTICAL EXPONENTIAL FAMILIES Thus it now suffices to consider the case of a standard m 1 dimension-
al exponential family whose parameter space also has dimension m 1 . Suppose for this family that dim K = m < m' . Then K c x + V , similar to the previous situation. as above.
Let P be the orthogonal projection on V, and NL, NL
Observe that
(1)
θ x = Θ'MJ M χ x + Θ ' M £ M 2 x a.e.v
I t follows that Z. = NLX is a sufficient statistic whose distributions form a standard exponential family with natural parameter M,θ.
(Actually Z is not
merely sufficient, but is actually equivalent to X under v.)
Since
dim (M..K) = dim (M-Θ) = m this family is the desired minimal family formed from the original family through reduction by sufficiency and reparametrization. Suppose {p
: ω € Ω} is a standard k dimensional exponential family
relative to v, and (X,θ), (Z,φ) denote the canonical statistics and parameters for two reductions of {p } to a minimal standard exponential family.
For the
next step l e t P^ ' , P; ' denote their respective probability distributions σ ψ with dimensions m, and nu respectively, etc..
Let ω Q € Ω. Since X and Z
are each sufficient
dP
^
9\
Jtf θϊ ) p
Jf
o
Now, HD(D
= exp(((θ(ω) - θ(ω Q )) and similarly for p' ' .
Hence (4) yields
x-
ae(v)
BASIC PROPERTIES (5)
(θ(ω) - θ(ω Q J)
15
x(y) - U ( 1 ) (θ(ω))
= (Φ(ω) - φ(ω Q ))
z(y) - U ( 2 ) (φ(ω))
a.e. (v)
for all ω € Ω. Suppose m = m 1 < m,,. Since dim {φ(ω) : ω e Ω} = m 2 > m there m+1 exist values α. € R, ω. e Ω, i = l,...,m+l, such that 0 = Σ α.(θ(ω ) - θ(ω n )) m+1 and φ* = Σ α.(φ(ω.) - Φ(ω Q )) f 0. It follows from (5) that (6) But,
φ* (6) implies /C c {z : φ*
z(y) = const
a.e. (v)
z = const} so that dim K^ < m^.
This
contradicts the f a c t that the d i s t r i b u t i o n s of Z form a minimal standard family of dimension nip.
Hence πu = πu = m.
Now choose ω 1 , . . . , ω m so that {θ(ω.j) - θ(ωQ) : m
R.
The preceding argument shows that {φ(ω.j) - φ(ωQ) : m
also span R . φίω^
i =l , . . . , m }
i = l , . . . , m } must
Let M, non-singular, be chosen so that - φ(ωQ)
=
( M ' J ' ^ θ ί ω ^ - θ(ω Q ))
i=l,...,m
.
Then, as i n 1.6(3),
(7)
(θ(ω.) - θ(ω Q ))
x(y) - U ( θ ( ω i ) )
=
(Φ(ω.) - φ(ω Q ))
Mx(y) - U(φ(ω-))
=
(Φ(ω.) - φ(ω Q ))
z(y) - U(φ(ω.))
Let y Q e K be a value for which (7) is valid for i=l,...,m. yields
span
a.e.
(v)
Then (7)
16
(8)
STATISTICAL EXPONENTIAL FAMILIES
(φ(ω.) - φ(ω Q )) =
M(x(y) - x(y Q ))
(φ(ωΊ.) - φ(ω Q ))
(z(y) - z(y Q ))
a.e. (v) i = l,...,m
This implies M(x(y) - x(y Q )) = z(y) - z(y Q ); which verifies 1.6(1) with zo=z(yQ).
1.10
||
Definition Let {p_} be a k-dimensional exponential family. u
Theorem 1.9
shows that there is a unique value, m, such that {pθ> can be reduced to a minimal exponential family of dimension m. This value is called the order of the family p. I f {p Ω } is a standard family i t is clear that its order m Ό
satisfies (1)
m <_ min(dim Θ, dim K)
In most cases equality holds in (1); however, i t is possible to have inequality, even when {p n } is f u l l .
u In view of Theorem 1.9 there is no loss of generality in confining oneself to the study of minimal standard exponential families. A full minimal standard exponential family is also called a canonical exponential family. RANDOM SAMPLES A nearly trivial but very important application of the first part of Theorem 1.9 involves independent identically distributed (i.i.d.) observations from an exponential family.
BASIC PROPERTIES
17
1.11 Theorem Let X l s . . . , X n be i . i . d . observations from some k-dimensional standard exponential family with natural parameter space M and convex support K. Then S =
n Σ X. is a s u f f i c i e n t s t a t i s t i c . i =l Ί
standard k-dimensional
The distributions of S form a
family with natural parameter space W and convex support
n/( = {s : 3 x e K, s = nx} .
The order of the families corresponding to S and
to X. are equal. Proof:
The j o i n t density of X . , . . . , X n = exp( . Σ (θ =1
Pϋf l ( Xii . - . - . x nn)
= Hence X , , . . . , X
n exp( Σ ( θ . i=l Ί
with respect to vx ... xv is x.i - ψ ( θ ) ) )
x. - ψ ( θ . ) ) ) Ί
Ί
with
θ. M Ί
are canonical s t a t i s t i c s from an nk-dimensional
.
exponential
nk family whose parameter space s a t i s f i e s Θ = { ( θ 1 , . . . , θ n ) € R
k : θ^ • θ € R } .
Applying Theorem 1.7 yields that S is s u f f i c i e n t and comes from a standard k-dimensional family with natural parameter space N and convex support nK. (All this is also obvious from the f a c t that n pβ(xΊ,...,xn) = exp(θ Σ x_. - nψ(θ)) u
1
Π
_ i
.)
I
It is easily checked that any linear map which transforms the distributions of X. to a minimal family also transforms those of S to one, and conversely. This yields the assertion concerning the order of the families corresponding to S and X. .
||
Note that the cumulant generating function for the exponential family generated by S is (1)
nψ(θ) . The sufficient statistic X = n S also has distributions from an
exponential family. (Apply Theorem 1.6.) Here, the natural parameter space
18
STATISTICAL EXPONENTIAL FAMILIES
is nW and the convex support is K. The cumulant generating function for X corresponding to the point Φ = nθ in its natural parameter space is (2)
nψ(Φ/n) . (Under appropriate additional conditions a family of distributions
for which there is a nontrivial sufficient statistic based on a sample of size n must be an exponential family. See Dynkin (1951) and Hipp (1974).) 1.12 Examples Example 1.2 displays an instance of t h i s theorem.
I f Y is normal
with mean μ and variance σ 2 then X = (Y, Y ) i s the canonical s t a t i s t i c of a minimal standard exponential family having canonical parameter θ = (μ/σ 2 , -l/2σ 2 ) .
Thus i f one has i . i . d . observations Yp
Ύη
t h e n
n n n p S = Σ X. = ( Σ Y., Σ YT) is a sufficient statistic; and its distributions Ί Ί i= l i = l i = lΊ form a minimal standard exponential family. As another example, suppose Y is a member of the gamma family with unknown index, α, and scale, σ. The density of Y relative to Lebesque measure on (0, °°) is
(l)
f(y) = (^ΓίcOΓV 01 " 11 e" y/σ ,
We w i l l use the notation Y ~ Γ(α, σ) .
y >o .
Note that Γ(m/2, 2 ) = x * .
These
d i s t r i b u t i o n s form a two-dimensional exponential family with canonical s t a t i s t i c (Y, In Y) and canonical parameters (-1/σ, α ) . w i t h density (1) then S1Ί =
n Σ Y
1=1
1
and
Sά9 =
n Σ In Y. Ί
i=l
If
γ
i>
>γn
a r e
i i d.
form a two-dimensional
exponential family. It is interesting to note that the marginal distribution of Sj/n also has a density of the form (1) with index nα and scale nσ . (Here, as well as in the preceding normal example, S. is strongly reproductive in the terminology of Barndorff-Nielsen and Blaesild (1983b). For more details see Theorem 2.14 and Example 2.15.) Another example of interest is provided by the Poisson distribution; where Y has probability function
BASIC PROPERTIES (2)
y
Pr{Y = y }
λ
= λ e~ /y!
We w i l l use the notation Y ~ P(λ) .
19 y=O,ls...
Then X = Y comes from a one-dimensional
exponential family with canonical parameter θ = In λ. n S = Σ Y. Ί i l
The d i s t r i b u t i o n of
is i t s e l f Poisson with natural parameter θ+ In n = In nλ .
CONVEXITY PROPERTY Here is an important fundamental fact about exponential families. 1.13 Theorem (i) (ii) (iii)
N is a convex set and ψ is convex on M. ψ i s lower semi-continuous on R and i s continuous on N°. PQ θ
(1)
= PΩ Θ
l
i f and only i f 2
Ψ(αθ1 + (1 - α ) θ 2 )
for some 0 < α < 1. (iv)
= oίφίθj) + (1 - α)φ(θ 2 )
In this case (1) is then valid for a l l 0 <_ α <_ 1.
I f dim K = k (in particular, i f {p Q } is minimal) then ψ is
s t r i c t l y convex on W, and PΩ θ
l
t V Θ
for any θΊ f θ 9 € W. 2
0 < α < 1.
ά
'
Woof:
Let θ,, θ 2 e W,
Then by Holder's inequality
(2)
exp(ψ(αθ1 + (1 - α)θ 2 )) = /expίtαθj + (1 - α)θ 2 )
x)v(dx)
= /(exp Θ.
x) α
1
x)v(dx)) α (/exp(θ2
(/exp(θ1
(exp θ 2
x ) ( 1 " α ) v(dx) x)v(dx)) ( 1 " α )
ί θ ^ + (1 - α)ψ(θ 2 )) This proves the convexity of ψ, and the convexity of hi follows easily. There is s t r i c t inequality in (1) unless (3)
θj
x
= θ2
x + K
(a.e. (v))
for some constant K; in which case there is equality.
(3) is equivalent to
20 θ e
STATISTICAL EXPONENTIAL FAMILIES i
# x
=
κ e
θ e
2#x
a . e . ( v ) which i s equivalent to the assertion Pfi
1
= PB
2
.
If (3) holds for some θ, ί θ 2 then dim K <_ k - 1. Hence dim K = k implies P Q t P Q for any Θ Ί + θ 9 e W. θ
Θ
l
1
2
ά
Finally, for the continuity assertions, note f i r s t that λ(θ)
= /exp(θ x)v(dx) is lower semi-continuous by Fatou's lemma.
lower semi-continuous.
Any convex function defined and f i n i t e on a convex set
W of R must be continuous on W°. sets.)
Hence ψ is
(We leave this as an exercise on convex
|| Be careful about the above result -- the fact that ψ is s t r i c t l y
convex on hi does not imply that hi is s t r i c t l y convex; for a simple example, see Example 1.2 which involves a minimal family for which
W = ί(θΊ> θ 9 ) : Θ Ί € R, θ 9 < 0} XL.
X
L.
Usually ψ is continuous on all of hi. However examples can be constructed when k >_ 2 where this is not the case. This simple theorem has an interesting direct application. 1.14
Example Let Y be m-variate normal with mean μ and covariance matrix %.
We w i l l use the notation Y - N(μ, t).
Also, ό , = 1 i f i = j and = 0 i f ij^j . •u
The density of Y with respect to Lebesgue measure is (1)
Φ y j Z ( y ) = (2πΓ m / 2 |2Γ*exp(tr(-*- Ί (y - v){y - y)72)) =
(2π)" m ^ 2 |?|"^exp((?" 1 μ)
y + tr((-£ /2)(yy')) - μ'jΓ μ/2)
It follows that the distributions of Y form an (m + m(m+l)/2) dimensional exponential family with canonical statistics Yj,
,Y m , {YΊ-Y ./(I + δ..): i <_ j}
and corresponding canonical parameters (?" μ),,...,(?" μ) > ί(-Z" ).Ί : i £ j} . For the following it is convenient to label these statistics X..,...,X , {X . : i <. j} and the corresponding parameters as (θ,,...,θ , {θ. : i <_ j )}. Write θ = (θ,,...,θ ), S£= (θ. Ί ). x
Πl
Ignoring the factor (2π)" m / ' 2 , which can be
1J
absorbed into the measure v, the cumulant generating function is
BASIC PROPERTIES
(2)
Ψ( ) =
(-^JioglZ"1) + (μ'Z"1μ)/2
=
21
(-*s)log( |-£|) - θ'βθ/2
.
Note that M = {(θ, {θ.. : i <_ j}) : -£ is positive definite} . It is easy to check that N is open, so that this family is regular. (3)
By Theorem 1.12
ψ(0, {θ j : i < j}) =
is s t r i c t l y convex in the variables { θ . . : i <_ j } over the set where Q. is positive definite.
To reinterpret this result s l i g h t l y , let B = -Q.
then (3) yields that (4)
log |B|
is s t r i c t l y concave
as a function of the variables {b. . : i <_ j } over the region where B is *J
positive definite.
(4) yields IB' 1 ]
(5)
=
|BI" 1
is s t r i c t l y convex
((4) can also be proven by directly calculating k+1 the r e s u l t i n g ( « )
. log|B|, and showing
k+1 x
( o ) matrix i s p o s i t i v e d e f i n i t e .
The above proof
is much simpler !) CONDITIONAL DISTRIBUTIONS k Let v be a given σ - f i n i t e measure on the Borel subsets of R , and P «v
a p r o b a b i l i t y measure with density p.
generality) that 0 € W so that v is f i n i t e . M^x) = M-x.
Then the conditional
w i l l be denoted by v( |NLX = z j
Assume (without loss of
measure of v given z, = M,(X) e x i s t s .
or v( | z j .
given M,(X) exists and has density proportional over the set {x : ΛMX) = z-} . is any Borel measurable f u n c t i o n .
Mj : Rk -> Rm be l i n e a r ,
Let
The conditional
distribution
to p( ) r e l a t i v e to vί
lzj)
(More generally these facts are true i f Λ^1 See, f o r example, Neveu
(1965).)
The above s i t u a t i o n resembles that described i n 1.7. M2 : Rk -> Rk"m be an orthogonal complement of My M2 :
It
ίx : Mj(x) = z χ } -* R k " m
Then
Let
of P
22
STATISTICAL EXPONENTIAL FAMILIES
is 1 - 1. We will also use the symbol v( |z.) for the equivalent conditional distribution of M 2 (X) given M χ (X) = Zy
As before,
Φ = M'-lθ = ( M M θ - (φl) It is always possible to choose M 2 to be "orthonormal" so that Ml = M' ,
and so
Mλ" = M« .
To do so simplifies somewhat the resulting formulae. 1.15 Theorem The d i s t r i b u t i o n of Z ? = NLX given Z- = M.X depends only on φ,p\
= M'"θ .
For fixed Z. = z. these d i s t r i b u t i o n s form the (k-m) dimensional
exponential family generated by the measure defined by v( |z,) . Let W z
family.
denote the natural parameter space of t h i s conditional
i
Then Φ2 € M«"W implies Φ
(1)
€ M
2
a e
MX
(v)
Furthermore, if {p } is regular then θ
(2)
Proof:
M^~N c
hlM
a.e.(v)
χ
.
The conditional density of Z2 given Z1 = z- is proportional to '
z
l
+
Φ
2 '
Z
2 "
Hence the density of Zp given Z1 = z, r e l a t i v e to v( |z..) can be w r i t t e n as (3)
pφ(z2)
= exp(φ 2
z2 - ψz ( φ 2 ) )
where (4)
ψ z (Φ 2 )
=
ln(/exp(φ 2
The natural parameter space W
z
i
z2)v(dz2|z1))
.
is the s e t { φ ? } , f o r which the ά
integral on the right of (4) is finite. Let Φ 2 € M 2 "W . There is thus a θ €
BASIC PROPERTIES for which φ 2 = M2~θ . v*(A) = v ( M ^ ( A ) ) .
°° > /exp(θ
23
Let v* denote the marginal measure on Rm defined by
Then
x)v(dx) = /{/exptφj
zχ + φ 2
z 2 )v(dz 2 |z 1 )}v*(dz 1 ) .
oo > /exp(φ 2
z 2 )v(dz 2 |z 1 )
Hence
for
almost every z..(v*) .
This v e r i f i e s ( 1 ) .
Suppose {p Q } is regular. dense subset of W.
Let { θ i : i = l , . . . , } cW
{M2~ θ Ί : i = l , . . . } is dense i n M'~N .
b e a countable
Nl'~ is a l i n e a r map.
Hence M2"M is convex and open since W is convex (by Theorem 1.13) and open (by assumption). I t follows t h a t (5)
conhull
{M£~ θ. : i = l , . . .
}
=
MιfN
.
(We leave (5) as an exercise on convex sets.) Since { θ . } is countable i t follows from (1) that M2~ θ..
c
WM
χ
for a l l
i =l , . . . ,
a.e.(v)
.
Thus M£"N since NL
χ
=
conhull ίM£" θ. : i = l , . . . } c
is convex; which proves ( 2 ) .
^
χ
a.e.(v) ,
||
The above r e s u l t can be given an alternate i n t e r p r e t a t i o n under which the conditional d i s t r i b u t i o n s of X given X ε L form an exponential f a m i l y , for
L a given linear variety i n R .
See 1.7(8).
We omit the d e t a i l s .
Here are two important simple applications of the above ideas. 1.16
Example Let X 1 9 ...,X. be independent Poisson variables with expectations
λ.
.
See 1.12(2).
Then X = ( X 1 f . . . , X j is the canonical s t a t i s t i c of a
standard exponential family with natural parameter θ: θ Ί = In λ^ k The dominating measure has v ( { x } ) = 1/ Π x , ! . i =l Ί
Let
i =l , . . . 9 k .
N > 0 be an integer.
24
STATISTICAL EXPONENTIAL FAMILIES
k Then the distributions of X given Σ1 X. = N form a standard exponential family i=l with dominating measure k
(1)
n
k
v ( { x } | Σ x . = N) = 1 / Π x i !,
for
Σ x. = N .
This measure is proportional to the measure 1.3(1) which generates the multinomial distribution.
Hence the conditional distribution is multinomial (N,π).
The value of π can be easily computed as follows:
orthogonally
project onto {θ: Σθ. = 0} which is the linear subspace parallel to {x : Σxi = N} .
This yields (θ - θl)
(where θ = k"
1
Σ θ 1 ) as the natural
parameter of the conditional multinomial distribution. Thus
with c = ( Σ e Ί'~ )
.
Substituting θ. = In λ
(2) 1.17
k = λ./ Σ λ Ί
πΊ
yields
.
Example Let X be k-variate normal with mean μ and covariance %, For t
given the distributions of X form a standard exponential family with natural parameter θ = ϊ~ μ.
(This can easily be checked directly or derived
from Example 1.14 by using Theorem 1.7.)
The dominating measure for this
family is proportional to v(dx) = exp(-x'Z~ x/2)dx. Let z, = ( x , , . . . , x ),
z 2 = (x + , , . . . , x . ).
The conditional
distributions of Zp given Z, = z, form an exponential family. parameter for this family is just Φ2 = (θ
+1
, . . . , θ . )'.
Partition t as
(1) Then
t
= Q11
12 t
)
with
i n ( m x m) , etc.
The natural
BASIC PROPERTIES
, 1
(2)
I'
Z
Z:
Z
/ * 1 1 ' 12 22 21^ = ( - l - l -l -^tZ2-t2ltntl2) tntn
25
Z
Z
Z
Z
2:
Z:
~ 11 12^ '22" 21 11 12^ > -i .1 ) Z Z Z; (?22" 21 11 12^
((2) is a general formula for block symmetric positive definite matrices. Note 12
that I
1
= -?ϊ}?12(222 " V ' l l ^ '
1
=
Z
Z
(Z
Z
Z
Ϊ
)
]
" 22 21 11 " 1 2 2 2 2 1 ^ '
N o t e
t h a t
the natural parameter can be written as
where
Consider the case where z. = 0. The conditional dominating measure is v(dz2|0)
= c e x p ( - z ^ 2 2 z2/2)
and is thus a normal density with mean 0, variance-covariance ( Z 2 2 ) " 1 = Z 2 2 - ^21^11^12
= Z
* 's a y '
l t
f o l 1 o w s
t h a t
t h e
conditional density
of Z2 given Z, = 0 is normal with this covariance matrix and with mean μ* given by t*~\* = Φ2 , since φ ? must be the value of the natural parameter for both the unconditional and conditional family. (3)
μ* = 2*φ2
Hence
= 2*(221μ(l) + Z22μ(2))
= ^21Z"lίμ(l)
+ μ
(2)
'
For z 1 t 0 i t i s convenient to use the location invariance o f the normal family.
The conditional d i s t r i b u t i o n under (μ,Z) o f l,^) given l,^ =
^(1) " z f 1 ) is the same as the conditional d i s t r i b u t i o n under ( ( ,, h t) o f Z(Z) m μ (2) given l,,\ = 0. By the preceding this is normal with covariance matrix Z* = ( Z 2 2 ) " 1 and mean μ ^ - ^ l ^ ϊ l ^ d )
" Z ( l ) ^'
26
STATISTICAL EXPONENTIAL FAMILIES
EXERCISES 1.1.1
(a) Let C be any closed convex set in R . Show that there oo
exists a standard exponential family with M = C.
[C = n { θ : v. Ί i =l
θ < c,} Ί
with ||v.|| = 1 . Let v. denote Lebesgue measure on the ray {x: x = α v . , α > 0} 1 1 00
and l e t v =
.
Λ
Σ 2" 1 exp(c.v. Ί Ί i=l
x)v./(l+||x|I )• The result is also t r u e , but Ί
harder to prove, i f C is an open convex s e t . ] (b)
Let C = {(Qv
θ 2 ) : ||θ|| 2 < 1} U { ( 0 , 1)}
and show there
exists an exponential family with hi = C. 1,2.1
Verify 1.2(5) (including the formula for v which precedes i t ) . Note
that when n = 1 the measure v can be described by the relations x ? = x-, and v(dx 2 ) = dXj/ZZ? . 1.7.1
(i)
Let Z = MX as in Theorem 1.7.
Show that Z 1 is independent of
Zp for some θ e Θ i f and only i f 1, is independent of Zp for a l l θ € Θ. (ii)
Give an example to show that the assertion is false i f Z., 1^
are non-linear transformations of X. [ ( i ) Assume independence at θ = 0. (ii)
Let X be bivariate normal with mean μ and covariance I , and Z, = ||x||,
Z2 = t a r f ^ x g / x ^ . ] 1.7.2 {?'• *1
Consider the s i t u a t i o n of Theorem 1.7. θ 6 W
e
φ
M'φ° = θ.
i s
is f u l l and minimal. f u l l #
l t
i s
mΊrΊΊmal
i f
Suppose the original family
Then the family of distributions of 1Λ for a n d
onΊ
y
i f
there is a θ e i n t W with
[For a situation where the family of distributions of 1, is not
minimal use Exercise 1 . 1 . l ( b ) , l e t M be as i n ( 4 ) , and l e t φ£ = 1.] 1.7.3
(a)
Show that i f
1.7(7) or (8) are s a t i s f i e d then the d i s t r i -
butions of Z, = M-X form a standard exponential family with natural parameter
ΦΓ (b)
Give an example to show that the distributions of Z. = M-X may
form a standard exponential family with natural parameter d i f f e r e n t from φeven when 1.7(7) and (8) f a i l .
[Consider the d i s t r i b u t i o n of X. when X is
BASIC PROPERTIES
27
multinomial of dimension k :> 3, or equivalently, of X* with X* as i n Example 1.3. There are also some other i n t e r e s t i n g instances of t h i s phenomenon.] 1.8.1
(Contingency table under independence).
Consider a 2x2 contingency
table i n which the observations are Y. ., 1 <_ i , j <_ 2, and have a multinomial (N, p) d i s t r i b u t i o n with p = {p. ., 1 <_ i , j £ 2}. independence p... = p^ +
p . where P i + = Σ P Ί j> e t c . .
Under the model of Write t h i s independence
J
model as a log-linear model in a fashion so that the coordinates of the natural (minimal) sufficient s t a t i s t i c are independent binomial variables. to the model of independence in an r*c contingency table.
Generalize
(For further log-
linear models in contingency tables, see Haberman (1974), (1979).) 1.10.1
Show that in any standard exponential family of dimension k and
order m,
m + k >_ dim K + dim Θ .
m < min(dim Θ, dim K).
Give an example in which
[The simplest example has m = 0, dim Θ = dim K = 1,
k = 2.] 1.12.1
From many points of view the negative binomial distributions are
the discrete analog to the gamma distributions.
The negative binomial,
NB(α, p), distribution has probability function
Show that f o r f i x e d α the family N8(α, •) is a one parameter exponential f a m i l y , but that -- unlike the Γ(α, σ) s i t u a t i o n -- the family N8(α, p) α>0,
0 < p < 1 i s not an exponential
1.12.2 c R
family.
Let v denote counting measure on { ( 0 , 0 ) , ( 1 , 1 ) , ( 2 , 0 ) , (3,1) , ( 4 , 0 ) , . . . } .
Show that the exponential family generated by v has the following
properties:
XΊ has a geometric distribution,
Ge(p,) = M 8 ( l , p ) ;
X2
n a s
a
binomial d i s t r i b u t i o n , B ( p 2 ) ; (X χ - X 2 )/2 has a geometric d i s t r i b u t i o n Ge(pJ and (X- - X 2 )/2 is independent of X2 natural parameters θ^, Θ2
Write p ^ p 2 , P 3 i n terms of the
28
STATISTICAL EXPONENTIAL FAMILIES Let Z 1 ,...,Z m be i.i.d. N(μ,σ 2 ).
1.12.3
Let X = Σ Z 2 . Then X has a Ί
i =l scaled non-central χ
2
distribution with m degrees of freedom, non-centrality
2 2 2 parameter 6 = mμ /σ , and scale parameter σ . Denote this distribution by 2 2 X (6, σ ). ( i ) This distribution has density 2
„ (1)
g(χ)
=
Σ^Γ-—
k=0 where λ = 6/2.
_ λ (2^)(m/2)+k-l e -x/(2σ )
k
k !
— k 2 k σ Γ(k + ψ 2
+
,
/? m / 2
,
(From the form of (1) i t i s evident that K = k ~ P(λ) and 2
2
X|K ~ Γ(k + m/2, σ ) ; thus (X/σ ) K i s central χ (ii)
x>0
2
with k + SJ degrees of freedom.)
The d i s t r i b u t i o n s of X can also be represented as the marginal d i s t r i b u t i o n
of X- from a canonical two parameter exponential family generated by a measure v supported on { ( x ^ x^): x^ > 0, x 2 = 0 , 1 , . . . } . and series expansion prove (1) f o r the case m=l.
[(i)
(1) f o r general m then follows
from facts about sums of Poisson and gamma variables, 2 measure generated by (1) with σ = 1 , λ = 1.] 1.13.1
(i)
By change of variables
(ii)
Let v be the
Show that when k = 1 then ψ must be continuous on N. [Use
1.13 and convexity of W.] (ii)
More generally, l e t ΘQ.Θ- € N and θ = (1 - ρ)θ Q + pθ 1 and
show ψ(θ ) i s continuous i n p f o r 0 <_ p <_ 1.
[Reason as i n ( i ) , or use
Theorem 1.7 and ( i ) . ] (iii) continuous on W. 1.13.2
Give an example of an exponential family i n which ψ is not [Exercise 1.1.l(b) provides an example.]
Generalize 1 . 1 3 . 1 ( i i ) as f o l l o w s :
Let τ . € c o n h u l l ί θ , θ , , . . . ^ , } , 1
1
i =l , . . .
l e t θ € W and θ. € Λ/, j = l , . . . , J .
and τ . •* θ.
u
1
[Write τ i = Σα. . [ ( 1 - p ^ θ . + p..θ] with α. . >_ 0,
Then ψ ( τ . ) •* ψ(θ). I
Σα. . = 1, and p1 f 1. j
J
Use 1 . 1 3 . 1 ( i i ) a n d t h e f a c t t h a t ψ ( τ . ) < Σ α . . Ψ ( ( l - P Ί ) θ . + p . . θ ) . ] j
1.13.3
Let Y = (YQ = 1, Y , . . . Y n ) ' be the i n i t i a l state and n f u r t h e r
observations from an S-state Markov chain with t r a n s i t i o n matrix P p γ
(o
= J'|Y0
i = Ί ') = P. ,
1 < i , j < S, £ = l , . . . , n ) .
(i.e.,
Let N denote the sample
BASIC PROPERTIES
29
t r a n s i t i o n matrix, N = { n . . } ,
(1)
n.j
=
Σ X{Ui}(1^v
Y£)
Suppose p., > 0, 1 <_ i,j <_ S. Show that the distributions of Y form an S 2 dimensional exponential family with canonical statistic N = {n..} and canonical parameters {log p. .} . Show that if n >_ 3 the family has order p
S - 1. [Let E.. denote the matrix with i,j-th entry 1 and all other entries 0. Show that for given 1 < i, j < K there exist sample points N-, ΓL having positive probability and that N- + (E.^ - E. ) = N« and (other) points N-, N ? such that N 1 + ( E ^ - E n ) = N 2 ] 1.14.1
Univariate General Linear Model (G.L.M.).
Let Y be m-variate
normal, Y ~ N(μ, σ I), μ € R , σ > 0. (a) Show that this is an m+1 dimensional exponential family,
(b) In the G.L.M. μ is restricted by μ = Bβ ,
β e Rr
with B a known mxr matrix. Assume (for convenience) B has rank r. Show that this is a full (r+1) dimensional exponential family.
[Use Example 1.14 and
Theorem 1.7.] 1.14.2
Matrix normal distribution. Let μ = {μ. .} be an mxq matrix and
let Γ = {γ. .} and % ={σ. .} be mxm and qxq positive definite matrices, respectively.
Let Y = {Y..} be an mxq random matrix whose entries have a
multivariate normal distribution with
This is the matrix normal d i s t r i b u t i o n , denoted by Y ~ N(μ, Γ, Z ) . (a)
Show that Y has density ( r e l a t i v e to Lebesgue measure on Rmq)
f(y)
= (2πΓ
mq/2
|rΓ
m/2
exp t r ( -
[See Arnold (1981, Theorem 17.4).] (b)
Reduce this to an mq + m ^ m * 1 ] ( ^ q + 1 ^
dimensional minimal exponential
family with canonical parameters θ. . = Γ~ μZ" , 1 <_ i <_ m,
1 £ j £ q, and
'ϋ
θii.
... = γ
Ί 1
'σ
j j
',
l < i < i ' < ι n ,
1 < j < j ' < q , where r "
1
= {γ
1 J
} ,
30
(c)
STATISTICAL EXPONENTIAL FAMILIES
Show t h a t i f m _> 2 and q ^> 2 t h i s i s not a f u l l exponential f a m i l y .
Rather, Θ i s an mq + m(m+l)/2 + q ( q + l ) / 2 - 1 dimensional d i f f e r e n t ! a b l e manifold i n s i d e o f W. (An a l t e r n a t e n o t a t i o n involves w r i t i n g Y = ( γ ( i \ > d e f i n i n g (vec Y) 1 = ( γ ( i ) '
>γ(α\)
τ h e n
γ
~ N^
Γ
>γ(α\)
anc
'
> 2) i s the same as
vec Y ~ N(vec μ, I θ Γ) where θ denotes the Kronecker product. 1.14.3
M u l t i v a r i a t e Linear Model (M.L.M.).
Here Y ~ N(μ, I , %) w i t h %
positive definite and = Bβ
μ
with B a known mxr matrix and 3 an (rxq) matrix of parameters. convenience) B has rank r.
Assume (for
Show that this can be reduced to a f u l l minimal
regular exponential family of dimension rq + q(q+l)/2. 1.14.4
Wishart d i s t r i b u t i o n .
mxm positive definite matrices.
Let X = ( x . . ) and t = (σ..) be symmetric lJ iJ The matrix r ( α , t) d i s t r i b u t i o n has
density (l)
p ^(X)
=
—'
where
Γm(α)
= Z1^-1)/4
Π
Γ(α - ( i -
Show this is an exponential family, and describe the natural observations, natural parameters, and cumulant generating function. ( I f Y.,
i = l , . . . , n , are independent N(0, ϊ)
vectors then
n Σ Y.Yj = X has the Γ(3, 2t) distribution. This is also called the wishart (n, t) distribution and denoted by W(n, %). See e.g. Arnold (1981). Also Σ (Y. - Ϋ)(Y. - Ϋ ) 1 ~W(n-l, %) .) 1.15.1
Consider a 2x2 contingency table (see Exercise 1.8.1). Find the 2 2 conditional distribution of Y.. given Y. = Σ Y.. and Y . = Σ Y.. . Show that
BASIC PROPERTIES
31
these conditional d i s t r i b u t i o n s depend only on the given values Y . + , Y + . and on the odds ratio
PiiP22^Pi2^21
and
^orm
a one
" P a r a m e t e r exponential family.
[Under the independence model the d i s t r i b u t i o n is hypergeometric and independent of p.]
2. ANALYTIC PROPERTIES
CHAPTER DIFFERENTIABILITY AND MOMENTS
The cumulant generating function has several nice properties. Among these are the fact that its defining expression may be differentiated under the integral sign.
In this manner one obtains the moments of X from
the derivatives of ψ. One needs first to establish a simple bound. 2.1
Lemma I,
Let B = conhull {b^ : i=l,...,I} c R . Let C c B° be compact and let b Q e C.
Then there are constants Kρ (depending on C,B) £=0,1,... such th I b. X k 1 eb'x < Σ e η v b e C , x e R
INI
(1)
L
Also, e
(2)
b x lib -
b o .χ
- e
b o ιι
Let ε > 0.
Proof.
1
<- KK, l £
L
I b. b.1 xx ΣΣ e 1 , i1-1 =
k
x € RK
b € C ,
Note that there exists a K
< »
such that
Jc 9 ε
I r I Λ <_ K
eεlΓl
v
r e R
since lim
|r|A/eε|r|
=
0
Let {e. : i=l,...,k} denote the elementary (orthogonal) unit vectors in R . Then X
<
K
L
X. I
^_
No
^ εlx^ I Ί iΣ ee
32
ς
<
% K'
^ εe ϊ x -εe ; x. Σ M(e + e ), e
r
ANALYTIC PROPERTIES where K!
= k^~2^/2K0
33
. Choose ε > 0 such that (b + € e . ) € B, i = l,...,k,
for all b € C. See Figure 2.1(1). By convexity
since e a # x is convex i n a € R
^.) € B = conhuli {b..}. Then
and
k Σ (e
'εε 1=1
+ e
Figure 2 . 1 ( 1 ) :
( b - ε e ΊΊ ) x
)
<
2k K'
x»9ε
B, C, and p . + = b
max(e
+ εei
b.1 x% )
I
<
2k K
Σ e
b
» - j =i
f o r the proof of Lemma 2 . 1 .
STATISTICAL EXPONENTIAL FAMILIES
34
This proves (1), with K£ = 2k K^ Note that (2) may also be written 1
(bΊ- b π ) x KΣe Ί ° 1 1= 1
<
b - br
bn X H e n c e i ts u f f i c e s t op r o v e ( 2 ) i n t h e c a s e w h e r e b π = 0 , ( s o t h a t e ^ = 1 ) Note that rer -e r,+1 >_ 0 and also that for
and we make this assumption below.
r £ 0 l-e r < |r|. Using the f i r s t inequality when b x > 0 and the second when b x < 0 yields b
eb'x
'x bx b
x
^ max(b x e b'' x , I b x j ) Γb^Γl ' Hence lx||eb'x+ by (1) since b £ C and 0 £ C.
llxl
b. x
| |
FORMULAS FOR MOMENTS Let £. >_ 0 be non-negative integers with
k Σ ί,. = I.
Formal
calculation yields (1)
Ί
_ i i -λ(β)
k θ< x = / ( _ π x ^ ) e v(dx)
.
i=i9θi In particular (2)
Vλ(θ)
-
X /xe θ * v(dx)
These calculations are j u s t i f i e d by the following theorem. 2.2
Theorem Suppose θQ e M° .
Then a l l derivatives of λ and of ψ exist at
ANALYTIC PROPERTIES ΘQ.
They are given by the above expressions
d i f f e r e n t i a t i n g under the i n t e g r a l Proof.
We prove only ( 2 ) .
35
( 1 ) , (2) derived by f o r m a l l y
sign.
(The proof of the general formula (1) i s
and proceeds by i n d u c t i o n on I.
See Exercise 2 . 2 . 1 . )
there is a B = c o n h u l l ί θ ^ : i = l , . . . , I } c W °
Let ΘQ 6 N°.
similar Then
and C c B°, C compact, w i t h
ΘQ € C . Let θ
(3)
d ( θ , x)
=
e
'x
-
- (θ-θn)
e
xe
9
.
llθ - θ o l l By Lemma 2.1 (4)
sup |d(θ, x ) | Θ€C
<_ 2K
i
Σe
Ί
Also |d(θ, x)|
(5)
s i n c e Ve
#x Λ
_Q
= xe
.
-> 0
as
θ -> ΘQ
Hence
π /d(θ,
x)v(dx)
-> 0
as
θ -> ΘQ
by t h e dominated convergence theorem, so t h a t
λ(θ) - λ(θ n ) - (θ - θ n ) 9 Q
(6)
IIθ
which proves ( 1 ) .
-
/xe °
v(dx)
ΘQII
II
Theorem 2.2 immediately yields the following fundamental formulae. 2
k 9f For f : R -> R introduce the notation D^f for the kxk matrix ( - — r — ) . An alternate expression is V'Vf since V1 converts each element of the (column) vector I — into the row vector (a( τ^-)/3χ. : j = l,...,k), and hence D 9 f = V'Vf. dX.
2.3
σX.j
J
c-
Corollary Consider a standard exponential family.
Let θ £ W°.
Then
36
STATISTICAL EXPONENTIAL FAMILIES
(1)
E Θ (X) = Vψ(θ)
(2)
cov X = D ? ψ(θ) = V'Vψ(θ)
Notation.
In the sequel we frequently use the notation
(I 1 )
ξ(θ) = Vψ(θ) = EΘ(X)
and (2 1 )
ϊ(θ) = D2ψ(θ) = ZΘ(X) .
Proof. Calculating formally, Vψ(θ)
= /xe θ ' x v(dx)//e θ ' x v(dx)
=
E
θ< X >
The c a l c u l a t i o n is j u s t i f i e d by Theorem 2.2. (1)
is s i m i l a r .
2.4
Examples
This proves ( 2 ) .
The proof of
II
The reader is i n v i t e d to use Corollary 2.3 to calculate the f a m i l i a r formulae for mean and variance i n the classic exponential families such as (univariate) normal, multinomial, Poisson, gamma, negative binomial, etc. For the m u l t i v a r i a t e normal d i s t r i b u t i o n Corollary 2.3 provides a benefit i n the reverse d i r e c t i o n . Example 1.14.
Fix μ = 0.
Let Y
be
m-variate normal ( μ , 2 ) , as i n
Direct c a l c u l a t i o n (not using Corollary 2.3)
yields the f a m i l i a r r e s u l t
(1)
EίY.Yj)
when μ = 0, where ζf
= σ.j
= (θΊJ).
= (-(αΓ1)^-
= -θij
Calculation using Corollary 2.3 and the
formula 1.14(3) f o r the cumulant generating function thus y i e l d s f o r i f
6..)
j
ANALYTIC PROPERTIES
since the corresponding canonical B = -Q..
statistics
a r e Y.Y . / ( I + 6 . . ) .
Let
Then ( 2 ) shows t h a t f o r any p o s i t i v e d e f i n i t e symmetric m a t r i x , B,
ToglBI
=
2 bΊ ΊJ J/ /( (l l + 6.. 6 ...))
B"1
where
=
( b ii jj )
.
1 J
ij Hence,
37
also,
^ — IBI = 2 b i j | B I / ( l + δ..) J ij
(4)
.
The convexity of ψ together with Theorem 2.2 yields the following useful result. 2.5
Corollary Let
Θ 1 5 Θ 2 € N°.
Then
( θ 1 - θ2)
(1)
Equality holds i n (1) i f
( ξ ^ )
- ξ(θ2))
and only i f Pfl
= P 1
.
>.
0
.
Consequently ξ ( θ , ) = ξ ( θ j
2
if and only if P Q = P Q . (If {p Q } is minimal this happens only when θ
θ
1 =
θ
2
l
Θ
θ
2
.)
Proof.
ψ is convex.
Hence the directional derivative of ψ in direction
θ. - θ 2 is non-decreasing as one moves along the line from θ ? to θ... That i s , (2)
(θ χ - θ 2 )
Vψ(θ2 + p(θ 1 - θ 2 )) = (θ 1 - θ 2 )
ξ(θ 2 + p(θ 2 - θ 2 ))
is non-decreasing in p. This yields (1). I f PQ f PQ then ψ is s t r i c t l y convex on the line joining θ 0 and θ
θ,.
l
Θ
2
2
Hence (2) is s t r i c t l y increasing for p £ (0,1).
remaining assertions of the corollary. contained in Theorem 1.13.)
This yields the
(The parenthetical assertion is
||
The final corollary to Theorem 2.2 establishes the possibility of differentiating inside the integral sign for expectations involving exponential families.
The result is stated only for real valued s t a t i s t i c s , but obviously
38
STATISTICAL EXPONENTIAL FAMILIES
generalizes to higher dimensional statistics. 2.6
Corollary Let T : R k + R. Let W(T) = {θ : /|T(x)|e θ # X v(dx) < «} .
(1)
Then W(T) is convex. Define h(θ) = /T(x)e θ # x v(dx) = e ψ ( θ ) E θ (T(X))
(2)
for θ e W(T). Then all derivatives of h exist at every θ € W°(T), and they may be computed under the integral sign. In particular (3)
VE Θ (T(X)) = /(x - ξ(θ))T(x)exp(θ x - ψ(θ))v(dx) .
Proof.
Suppose T(x) _> 0. Applying Theorem 2.2 to the measure
ω(dx) = T(x)v(dx) yields the desired results. For general T the corollary follows upon using the above to separately treat T and T~.
||
Note that if T and |T|"] are bounded then W(T) => w.
ANALYTICITY The moment generating function is analytic. in the proof of Theorem 2.2.
This fact is implicit
As a preliminary we extend the definition of λ
and ψ to the complex domain. Let λ :
(Ck -> ID
be defined by the same expression as previously, i.e. (1)
λ(θ)
= /exp(θ
x)v(dx)
.
For θ € (D let Re θ denote the vector with coordinates (Re θ ^ . ^ R e θ k ) . Note that for x e R k (2)
|e θ ' x |
=
(Reθ) e
χ
.
ANALYTIC PROPERTIES
39
Hence λ(θ) exists for Re θ e N . 2.7 Theorem λ(θ) is analytic on {θ E C k : Re θ e M°} . Lemma 2.1 (and its proof) apply for b € (Ck, x e Rk. Similarly the
Proof.
proof of Theorem 2.2(2) is valid verbatim for θ £ I . Thus Vλ(θ) exists for Re θ € W° (and has the expression 2.2(2)). This implies that λ is analytic on this domain.
||
Two important properties of analytic functions are: (i) they can be expanded in a Taylor series; and (ii) they are analytic in each variable separately. Thus, for a fixed value of (Θ 2 ,...,θ k ),
λ(( ,θ 2 ,...,θ k )) is
analytic. λ(( ,θp,...,θ.)) is determined by its values on any subset having an accumulation point. This is the basis for the following result. 2.8 Lemma Let T : R k •> R, and let (1)
h(θ) = /T(x)e θ ' x v(dx),
for
Re θ e N(T),
as defined in 2.6(1).
Then h is analytic on {θ € ik : Re θ € W°(T)}. Let L be a line in R , and let B c L n M(T) be any subset of L Π M(T) having an accumulation point in N°(T). (2)
h(θ) = 0
Then
V θ €B
implies h(θ) = 0 for all θ € Rk such that θ € L n W°(T). Proof.
The first assertion follows upon applying Theorem 2.7 to T (x)v(dx),
and T"(x)v(dx). Next, one may apply linked affine transformations as in Proposition 1.6. Because of this it suffices to consider the case where L = ίθ € R : θ = .,. = θ = 0}. L.
K
h((θ-,O,...,O)) is an analytic function of J.
θ € (C, as already noted. Hence (2) implies h(θ,O,..,O) Ξ 0 on its domain of analytic!ty, which is {(θ,0,...,0) : Re θ E L n W°(T)}. This proves the
40
STATISTICAL EXPONENTIAL FAMILIES
analyticity, which is {(θ ,0,... ,0): assertion. ||
Re θ € L ΠW°(T)}. This proves the second
Note that, more generally, if B is as above then the values of h on B uniquely determine by analytic continuation its value on all of L Π N°(T). (Straight lines play a special role in the above lemma. However we note that there is a valid generalization of the above lemma in which L can be replaced by a suitable one dimensional curve determined as the locus of points satisfying (n - 1) simultaneous analytic equations (C. Earle (1980), personal communication). For example L may be taken to be the curve x^ + x^ = 1, x 3 = ... = x k = 0.) 2.9 Example A question which arises, in statistical estimation theory, is whether the positive part James-Stein estimator for an unknown normal mean, θ(x) = (1 - (k-2)||x|Γ 2 ) + x,
x € Rk,
can possibly be generalized Bayes for squared error loss. This is equivalent to asking whether <$(•) can be the gradient of a cumulant generating function for some measure v(dθ) having N = R . (Note interchange of roles of θ and x.) See Theorem 4.16. The answer is, "No." To see this note that ό(x) Ξ 0 for I |x| I <_ 1. Hence if ό(x) = vψ(x) = vλ(x)/λ(x) for | |x| | < 1 it follows by analyticity that ψ(x) = 0 on its domain of analyticity, which in this case is R . This implies <$(x) = 0, a contradiction.
||
2.10 Example The question arises in the theory of hypothesis tests as to whether the unit square, S = {x € R k : |χ.| £ 1},
k >. 2,
can be a Bayes acceptance region for testing the mean of a normal distribution. Placed in a general context, the question is whether there exist two distinct non-zero finite measures G Q and G-j (concentrated on disjoint sets Θ Q and θ1
c
R k ) such that
ANALYTIC PROPERTIES
41
2 d(x) = / e θ " X " θ
(1)
/2
(G Q (dθ) - G Ί (dθ)) > 0
if x € S,
and d(x) < O if x ^ S . Proof.
Let
The answer is, "No." 2 μ.(dθ) = e' θ / 2 G.(dθ), i = 0,1. Then d(x) = λ Q (x) - λ ^ x ) = R k , i = 0,1.
where λ. is the moment generating function of p.. Note that W I
1
y.
Hence d( ) is analytic on R . For convenience consider only the case k = 2. Expand d in a Taylor series about (1,1) as
. . y\y)' d((l . 1) + (y1Ί. yά J ) = Σ ί Ja. i=0 j=0 ί Ί " J ' ά (a Q 0 = 0 since d((l, 1)) = 0.) Let i1 be the smallest index for which i1 Σ la. ., .1 > 0. i' exists since d t 0 if (1) is valid. j=0 J'Ί "J Suppose i1 is even. Then for y. >_ 0, i = 1,2,
I 3i 1._i yίyJ'-J = I a. .,
(2)
j_Q
J j l -J
I C
.
Q
J , l -J
A-y,)H-y2)v'3. I
ά
There are values (y,, y j in the first quadrant for which (2) f 0, since (2) is a non-zero homogeneous polynomial. Suppose (y-j, y^) is such a value. Then
|pΓ 1 l d((i, D ) + (py?> py°)) = .Σ a i 5i .j(y?) d (y2) il " :i + = c + o(p) as
°^
|p| -> 0
with c ί 0. If c > 0 it follows that d((l, 1)) + (Py^, pyij)) > 0 for p > 0 sufficiently small; and this would contradict (1). If c < 0 it follows that d((l, 1) + (py?, py2)) < 0 for p < 0 sufficiently small; and this would also contradict (1). If i1 is odd analogous reasoning yields 1) + ( y r -y 2 )) = ||y|Γ Ί "d((l, 1) + (-yr y 2 )) + o(l) as ||y|| + 0, and that there are values of (y-j, -y 2 ) > 0 for which lim ||y|Γ Ί l d((l, 1) + P (y?, -y2)) t 0. It follows that there are values of P
ψ0
42
STATISTICAL EXPONENTIAL FAMILIES
y in either the fourth quadrant or the second quadrant for which d((l, 1) + y) > 0. This again contradicts (1). Hence (1) is impossible.
||
COMPLETENESS 2.11 Remarks A family {F
θ € 0} of probability distributions (or their
u associated densities, i f these exist) is called s t a t i s t i c a l l y complete i f T : Rk -> R with (1)
/T(x)F θ (dx)
= 0
τ(x) = 0
a.e.
V θ €0
implies (2)
(F 0 )
V θ€0
(Implicit in (1) is the condition that /|T(x)|FQ(dx) < ~ v θ e 0 .) Standard exponential families are complete if the parameter space is large enough. This result, which is equivalent to the uniqueness theorem for Laplace transforms, is proved in Theorem 2.12.
(The uniqueness theorem for
Laplace transforms states that if U° n H° f φ then λ = λ if and only if μ μ = v.)
v
μ
v
The most convenient way to prove this theorem seems to be to invoke
the uniqueness theorem for Fourier-Stieltjes transforms (equals characteristic functions) which is described in the next paragraph. Let Im = {bi € (C : b € R} denote the pure imaginary numbers. Let F k k be a f i n i t e (non-negative) measure on R . The function K : R -> (C defined by κ F (b) = λ F (bi) b € Rk is the Fourier-Stieltjes function)
transform (or, Fourier transform, or, characteristic
of F. Hence λp restricted to the domain (Im) is equivalent to Kp.
Note that κp always exists ( i . e . Re((Im) ) = 0 c N). The uniqueness theorem for Fourier transforms is as follows. Theorem.
(^ Let F and G be two f i n i t e non-negative measures on R . Then F = G
ANALYTIC PROPERTIES
43
i f and o n l y i f κ p Ξ K Q ( i . e . λ p ( b i ) = X Q ( b i ) V b e R k ) .
Proof.
This is a standard result in the theory of characteristic functions.
Proofs abound. A quick proof may be found in Feller
(1966, XV,3).
proof is explicitly for R, but generalizes immediately to R .)
(This
11
Here is the classic result on completeness of exponential families. 2.12 Theorem
Let ίp Q }: θ e 0} be a standard exponential family. Suppose 0° f φ. Then {p Q } is complete. Proof.
Let θ Q E 0°. One may translate coordinates using Proposition 1.6 so
that ΘQ = 0. There is thus no loss of generality in assuming Θ Q = 0. Suppose Π(x)p Q (x)v(dx) = 0 V θ e 0.
Then, letting T = T + - T~,
u /T + (x)e θ " x v(dx)
(1)
Let F(dx) = T + (x)v(dx), (2)
λ F (θ)
= /T"(x)e θ # x v(dx)
G(dx) = T"(x)v(dx).
= /e θ # x F(dx)
= /e θ # x G(dx)
V
Then (1) becomes =
λ G (θ)
Both λ F ( ) and λ r ( ) are analytic on the domain 0° x they agree on 0 x 0 <= 0 x (Im) . re z € 0° .
θ€0
λ
z
V (Im) .
Hence λr(x) = g ( ) "^
(This follows d i r e c t l y from a n a l y t i c i t y .
θ € 0
0Γ a
^
(2) states that z
suc
^
t
'Ίat
Alternately one may
apply the second half of Lemma 2.8 to a l l lines which intersect 0 .) particular (3)
λ F (0 + bi)
V b eRk
= λ G (0 + b i )
since 0 € 0°. Thus, F = G by Theorem 2.11. This says that T+(x)v(dx) = T"(x)v(dx), which implies T + = T~ a.e.(v), which implies T = 0 a.e.(v). Hence
{pQ} is complete. u
II
Note from the above that any canonical family is complete. From this we derive:
In
44
STATISTICAL EXPONENTIAL FAMILIES
2.13 Corollary A standard family with W° t φ is uniquely determined by its Laplace transform
(or by its cumulant generating function). Note that the corollary applies to a l l minimal families since they
always have H° f φ. Proof.
Consider the standard families in R generated by the measures μ
and v.
Suppose N° f φ and ψ = ψ . Let ω = (y + v)/2.
λ
ω
=
(λ
μ
+
λ
v)/2
HenCe
W
ω =
θ x-ώ (θ) ψ /T(x)eθ ω^ j ω(dx)
Then, ω generates an exponential family with
M a n d
Let T = 4± - 4ϋ . dω dω
Then W = hi = hi.
ψ
ω
=
ψ
μ =
V
Then Ί
= (l)(/e
θ x-ψ (θ) θ x-ψ (θ) V v y(dx) - fe v(dx))
Hence T = 0 a.e.(ω) by Theorem 2.12; which implies μ = v.
||
Theorem 2.12 has many other important applications in statistics. It plays an important role, for example, in the theory of unbiased estimates and in the construction of unbiased tests. Some aspects of this role are described in the exercises and in succeeding chapters. MUTUAL INDEPENDENCE Lehamnn (1959, p. 162-163) describes a nice proof of the independence of X, S in a normal sample. A different but related proof is a special instance of an argument which applies in several important exponential families. (See Example 2.15.) The basic parts of the argument are due to Neyman (1938) and Basu (1955), but the full result in Theorem 2.14, below, was only recently proved by Bar-Lev (1983) and by Barndorff-Nielsen and Blaesild (1983).
The proof below follows
that in the second of these papers. See the exercises for an additional related result of Bar-Lev and for several applications of this theorem. Through most of this subsection we consider the situation where
ANALYTIC PROPERTIES
45
θ and x are, respectively, partitioned as θ 1 = (θ/.x, θ/p)^ J
x< =
^(D* xl?)^
As in Sections 1.7 and 1.15, problems not in this form can sometimes be reduced to this form through use of linked linear transformations on θ and x. Where convenient, we write ψ(θ) = ψίθ/.x, θ/ 2 ))
^e u s e ^ e
notat
i°n
Y ~ Expf (θ) to mean that the distributions of Y form a standard exponential family with natural parameter θ. We also use the notation X l Y to mean that X and Y are independent. 2.14 Theorem Let X ~ Expf (θ) with θ° € Θ°. Let X 1 = (X| 1 ) s X | 2 ) ) where X ( i ) is k i dimensional, and let M X , ^ ) be a k 2 dimensional statistic. Let Pl
(θ(1), θ(2))
= log E 0 (exp((θ ( 1 ) - θ f 1 } )
(1)
+ ( θ ( 2 ) - Q°{2))
P 2 (θ( 2 ))
X(1)
h(X ( 1 ) )))
= "Όg E θ (exp((θ ( 2 ) - θ^ 2 ) )
( X ( 2 ) - h(X ( l ) ))) .
Then t h e f o l l o w i n g c o n d i t i o n s a r e e q u i v a l e n t : (2)
X
( 1 )
l
(X
( 2 )
- h(χ(1)))
under
(21)
X
( 1 )
l
(X
( 2 )
- h(X(1)))
for all
(3)
Ψ(θ(1), θ(2))
(4)
(X
(5)
θ €Θ
= P1(θ(1), θ(2)) + P2(θ(2))
- h(X(1)))~
(X(1), h(X(1))) ~
Proof. (See
( 2 )
θ°
V ΘGΘ
Expf ( θ ( 2 ) ) Expf ( θ ( 1 ) , θ
( 2 )
)
For convenience, assume without loss of generality that θ° = 0.
Proposition 1.6.)
Let ω denote the j o i n t distribution under 0 of
V = ( X ( 1 ) , h ( X ( 1 ) ) , X ( 2 ) - h ( X ( 1 ) ) ) . Consider the standard exponential family generated by ω, with natural parameter space N v .
Note that, in general,
46
STATISTICAL EXPONENTIAL FAMILIES
ί X ( 1 ) 1 ( X ( 2 ) - h(X ( 1 ) ))} « {(X ( 1 ) 5 h(X ( 1 ) )) 1 ( X ( 2 ) - h(X ( 1 ) ))} . The equivalence of (2) and (2') is seen in this fashion to be a special case of Exercise 1.7.1. (2) => (3) follows from a direct calculation. (3) =* (2): Let ω χ denote the distribution under θ° = 0 of (V ( 1 ) , V ( 2 ) ) = ( X ( 1 ) , h ( X ( 1 ) ) ) and ω 2 that of V ( 3 ) = X ( χ ) - h ( X ( 1 ) ) . Let ω* = ω- x uu. Then the cumulant generating function ψ* of ω* satisfies ψ*(θ ( 1 ) , θ ( 2 ) . θ ( 2 ) )
=
P I
θ(2)) + P 2 ( Θ ( 2 ) ) ,
(Θ(1),
(Θ(1).
θ(2))ee
.
Furthermore, the cumulant generating function of the linear function (V ( 1 ) , V ( 2 ) + V ( 3 ) ) is ψ** given by Ψ**(θ ( 1 ) , θ ( 2 ) )
= ψ*(θ(1), θ ( 2 ) , θ ( 2 ) ) = Ψ(θ(1), θ(2)) ,
= Pi(θ(i).θ(2))+p2(θ(2))
θ e Θ
.
It follows from Corollary 2.13, since Θ° f φ, that (V\^ 9 V, 2 j + V,^Λ has the same distribution under θ° as (X/i\> ^(2)^' T^us
^ ( 1 ) 'X (2) " n ^ ( i ) ) )
has the same joint distribution under θ° as (V/,x> ^io\ + ^C*) " n ( V f i O ) But, V ( 2 ) + V ( 3 ) - h ( V ( 1 ) ) = V ( 3 ) .
Hence X ( 1 ) 1 ( X ( 2 ) - h(X ( 1 ) )) under θ°
since V ( 1 ) ± V ( 3 ) . (2) => (4) and (5), as can be seen by direct calculation of the marginal distributions involved via the standard formulae (6) and (8), below. (4) => (2): The marginal density of V,3x = X/2x - hίX/^) relative to the marginal distribution ω 2 is (6)
q θ ( v ( 3 ) ) = /exp(θ ( 1 )
+ θ
v(1) + θ(2)
h(v(1))
v
(2) ' (3) "
where ω( | ) denotes the indicated conditional distribution. By (4) eχ
θ
v
P
θ
q Q (v^ 3 j)= P ( ( 2 ) ( 3 ) " 2 ^ ( 2 ) ^
(a.e.).
Setting θ ^ = 0 yields
ANALYTIC PROPERTIES (7)
47
exp(ψ(θ ( 1 ) , 0) - p 2 (0)) ω
= /exp(θ ( 1 ) Here the Laplace t r a n s f o r m o f dent o f
V/ 3 x
(a.e.).
t h a t ω(
|v/3J
This v e r i f i e s
dv
v
ω
(8)
V(2)
( (i)> °) e Θ , (a.e.) .
( # l v ( 3 ) ) e x i s t s on an open s e t and i s
indepen-
I t f o l l o w s from a n o t h e r a p p l i c a t i o n o f C o r o l l a r y
i s independent o f v,^
(a.e.).
So, V , ^
2.13
i s independent o f
V/^.
(2).
The p r o o f t h a t (5) => (2) i s s i m i l a r . V(1).
θ
v(1)) ( (Dl (3)) >
The marginal j o i n t d e n s i t y
of
is
q£(v(1), v
( 2 )
) = /exp(θ(1)
v
( 1 )+
θ
( 2 )
h(v(1))+ θ
- ψ(θ)) ω'(dv ( 3 ) |v ( 1 ) )
( 2 )
v
( 3 )
(a.e.) .
Setting θ,.v = 0 and cancelling terms in (5) implies exp(ψ(0, θ ^ ) - p(0, θ( 2 j)) = /exp(θ^2j
v ^ )ω ' ( d v ( 3 ) I V ( D )
(a.e.).
Hence, as before ω'ί lv,^) is independent of v , ^ (a.e.), which yields (2). || 2.15
Examples (i) Let Y,,...,Y 2
2
be independent N(μ,σ ) variables. Then (Example 2
2
1.12) (Y., Y ) * Expf(μ/σ , -l/2σ ).
Also ( Σ Y Γ (ΣY.)2/n) Λ, Expf(μ/σ2, -l/2σ 2 ). 2
2
2
Hence ( Σ Y ^ Σ Y ? ) * Expf(μ/σ , -l/2σ ). This verifies 2.14(5). Hence
2
ΣY. -' ΣY /n = Σ(Y.- Ϋ ) ^ Expf(-l/2σ ) and is independent of T p by 2.14(4) and 2.14(2').
(ii) (Example 1.12) of
X. is also
S i m i l a r l y , l e t X 1 , . . . , X n be independent Γ(α, σ ) . i^ΣX^
Then
Σ In XΊ ) ~ Expf(-l/σ, nα). The marginal d i s t r i b u t i o n
(nα, σ ) ; hence (ΣX^, In ΣXη.) ~ Expf(-l/σ, nα).
Theorem 2.14 yields that (Σln Xη. - In ΣXΊ ) 1 ΣX i .
Again,
This is often re-expressed
i n the form X/X 1 X where here X = ( Π X ) l y / n denotes the geometric mean of
i= l Ί the observations. Also, ln(X/X) - Expf(nα). See the Exercises for a double
48
STATISTICAL EXPONENTIAL FAMILIES
extension of this conclusion. There are further applications of this theorem. see the exercises and the references cited above.
For some of these
In particular there are
several applications to problems involving the inverse Gaussian distribution.
See Chapter 3.
CONTINUITY THEOREM The continuity theorem for Laplace transforms refers to the limiting behavior of a sequence of measures and the associated Laplace transforms. We f i r s t need a standard definition and some related remarks. 2.16
Definition Consider R .
functions on R .
Let C denote the space of continuous (real-valued)
Let CQ c C denote the subspace of continuous functions with
compact support - - i . e . c(x)
= 0
for
11 x EI > r ,
some r < oo .
A (non-negative) measure v is called locally v({x : I Ixl I < r } ) < oo v
finite
r € R. Except where specifically noted, a l l measures
are assumed to be locally f i n i t e , σ-finite, and non-negative. sequence of measures.
Let {v } be a
We say v
(1)
if
-* v
(weak*)
/ c(x)v n (dx) -> /c(x)v(dx)
if V c € CQ .
Here are several important facts concerning weak* convergence. For v finite let V^ denote the cumulative distribution
function:
V v ( t ) = v ( { x : xΊ. < t . , i = l , . . . , k } ) . (i) (2)
Then v n + v i f and only i f
Vv ( t ) + V v (t) (ii)
V t € Rk
at which Vv( ) is continuous.
k k S u p p o s e v + v . T h e n l i m i n f v ( R ) > v ( R ) . S u p p o s e there n ~ n —
ANALYTIC PROPERTIES
49
is a c £ C, c >_ 0, with (3)
lim c(x) = °° llxll-*»
such that (3 1 )
lim sup /c(x)v (dx) < «> n-χ»
Then lim v (R k )
(4)
= v(R k ) < «
n-χ»
(iii) Furthermore, (4) implies (5)
/c(x)v n (dx) - /c(x)v(dx)
for all bounded c ε C. (Condition (3), (3 1 ) is sometimes referred to by saying the sequence is tight.) (iv) If v > 0 is any bounded sequence (i.e. lim sup v (R ) < «>) n-*» then there is a subsequence {vn } and a finite measure v such that v -> v . n π i i For a proof of these facts see Neveu (1965). 2.17
Theorem Let S <= Rk and l e t B = conhull S.
sequence of measures on R
(1)
oo
v
b e S .
Then there e x i s t s a subsequence {n^} and a l o c a l l y
f i n i t e measure v such t h a t
(2)
ebo'x v
n
(dx)
->
eb°#xv(dx)
i
and (3)
Let v n be a
such t h a t
lim i n f sup λ (b) < v beS n Let b Q e B°.
Suppose B° * φ.
λ
(b) - λ v (b)
V b € B° .
The convergence in (3) is uniform on compact subsets of B°,
50
STATISTICAL EXPONENTIAL FAMILIES (Condition (3) is of course equivalent to
ψ
(b) + ψ (b)
v b e B°
n
i Condition (1) implies the measures v are locally finite.) Remark.
Lemma 2.1 together with (3) shows that / x e b # x v n ( d x ) - /xe b ' x v(dx),
(4)
b ε B° ,
and similarly for higher moments of x. Hence (5)
V λ v (b) + vλ v (b), n
b € B°
i
and similarly for higher order partial derivatives of λ. See Exercise 2.17.1. Similar reasoning also shows that (6)
e θ # b v n (dθ) - e θ # b v(dθ)
weak*
v b e BQ .
Hence the measure v in (2) does not depend on the choice of b Q € BQ . Proof.
We exploit Proposition 1.6 and assume without loss of generality
that bg = 0 £ B°. It also suffices to assume that B is a convex polytope (i.e. B = conhull {b. : i=l,...,m}) since the interior of any convex set is a countable union of such polytopes, and a compact subset of the interior will be contained in one of them. Now, m
lim
Σ
e
bΓx
=
by Lemma 2.1. Thus, for some subsequence {n1.} j m
(7)
b- x lim sup /( Σ e Ί )v.(dx) < co n n-« i =l j
by (1). Hence, the sequence {v ,} is tight, and there exists a further J subsequence {v } and a limiting measure v such that v -• v . This immediately implies that also e b # x v
n
(dx) -> e b ' x v (dx) for any b € R k . i
ANALYTIC PROPERTIES
51
lim (Σe7 / e b " x ) = °°, again by Lemma 2.1.
Let b € B°. Then As i n (7)
bΓx
lim sup / ^ T - e b ' X v D x n-χ» e
n n
i
(dx) < ».
Hence the sequence e # X v (dx) is also tight. This implies / e b # x v n (dx) + /e b ' x v(dx), which yields (3). Let C c B° be compact. Then ||x||e b χ by Lemma 2 . 1 .
< KΣebi*X
This yields
lim sup sup | |Vλv (b)|| Ί-XΌ beC r\.
m b <_ lim sup K / Σ e Ί i^oo i =i
x v
(dx)
n Ί
m <_ lim sup K Σ λ(b.) < <» . The f u n c t i o n s λ
(•) are thus u n i f o r m l y ( i n { n . } ) u n i f o r m l y c o n t i n u o u s on C.
The convergence i n (3) i s t h e r e f o r e u n i f o r m on C.
||
2.18 Uniform Convergence Theorem 2.17 shows that if A..
(b) -> ψ Λ ( b )
for all
b G B° f φ
A
then
\>. + v i
There is a useful uniform version of this statement. (1)
{v α n : n=l,...,α € A}
be a family of sequences of measures and {v All of these are assumed locally f i n i t e . v when for each c 6 CΛ
Let
-** v
: α € A} be a family of measures.
We say
(weak*) uniformly in α
52
STATISTICAL EXPONENTIAL FAMILIES
(2)
/c(x)v
(dx) n
uniformly over α € A. V
α =\
/c(x)v (dx)
->oo
For notational convenience in the following, let
etc
Proposition.
Suppose the family of cumulative distribution functions
{V : α € A} is equicontinuous at e^ery x e R .
Then v
-+v
uniformly in
α i f and only i f (3) Proof.
V α n •*• V α
uniformly for
α €A
The necessity of (3) is proved by applying (2) to continuous
functions c satisfying
c(x)
1
x
i ^X0i - 6
0
X
i
for a l l
= >X
0i
+ 6
f o r some
and then choosing 6 sufficiently small. Conversely, (3) implies /g(x)d(V α n (x) - V α (x)) = • ^ V α n ^ " v α ( χ ) ) d 9 ( χ ) "*• ° uniformly in α for each differentiate g € C Q . If c € C Q and ε > 0 there is a differentiable g € C Q with |g - c| < ε . Then |/(c(x) - g(x)) d(V α n (x) - V α (x))| < 2ε uniformly for all α € A and all n. Combining these facts yields the uniform convergence of v
to v .
||
Extra care in the proof of the above proposition will show that if the ίV α : α e A) are equicontinuous uniformly over x € S and v^ -> v uniformly in α then (3) holds uniformly for α € A, x € S. 2.19
Theorem Let {v n ) and {v } be as in 2.18(1).
and B° ϊ φ. Let λ = λ , etc. α α (1)
Suppose
λ (b) -> λ (b) nκ»
uniformly over α e A, and suppose
Suppose B = conhull S,
v bε S
ANALYTIC PROPERTIES (2)
53
sup sup λ (b) < b€S α α
Then v
-»• v uniformly over α € A. If v -h v uniformly over α € A, there is a c € C n and a sequence an a
Proof. a such that
lim |/c(θ)(v α n (dθ)-v α
(3)
n-χ»
In view of (3) there exists a subsequence vt t vΐ such that if we write v (4)
ω. •* v* , l
1
= ω. and v
= ω. then
λ M (b) + λ *(b) , ω.
n. and limiting measures
b E B ;
v1
and (5) ω. -> v* , λ- β (b) -> λ v *(b) b €B by Theorem 2.17. (To establish (4) we exploit (2) to guarantee condition 2.17(1) for the sequence {ωn } .) i Assumption (1) implies λ *(b) = λ *(b), b e B, which implies v V l 2 vϊ = Vo - This is a contradiction. α € A.
It follows that v
-*• v uniformly over
||
TOTAL POSITIVITY 2.20
Definitions Let S cz R and h : S -> R .
{x
€ S :
Let { X Q < . . . < X Π } <= S.
i = O , l , . . . , n } is called a s t r i c t l y changing sequence f o r h having
order n i f
(1)
The sequence
(sgn h(χ._ 1 ))(sgn h(x.)) = -1
i=l s ...,n
54
STATISTICAL EXPONENTIAL FAMILIES
The number S (h) -- the number of strict sign changes of h -- is the maximal order of a sequence of strict sign changes of h. Clearly 0 <_ S (h) <_«> . Let S"(h) = n < °° and let {x. € S : i=0,...,n} be a strictly changing sequence for h having order n. Then the (strict) initial sign of h is (2)
IS"(h) = sgn h(x Q )
.
(It is easy to check that this definition is well-formulated -- i.e. does not depend on the chosen strictly changing sequence for h.) Similarly a sequence {x. € S : i=0,...,n} is called a weakly changing sequence for h having order n if (3)
(sgn h(x 2i ))(sgn h(x 2 j + 1 )) < 0
for
i=0,...,[n/2],
j=0,...,t(n-l)/2]
This means that zeros of the sequence {sgn h(χ.)
: i = 0 , l , . . . , n } can be
reassigned as e i t h e r a (+1) or a (-1) i n a manner so t h a t the r e s u l t i n g sequence of ± Γ s alternates i n sign. of such a sequence.
The number S (h) is the maximal order
Clearly, 0 <_ S (h) <_ °°, and S + (h)
(4)
> S"(h)
.
Let S + (h) = n < °° and l e t {x. e S : i = 0 , . . . , n } be a weakly changing sequence for
(5)
h of order n.
IS"(h)
=
Then +1
if
h(x2 ) > 0
f o r some
0
if
h(χ.) = o
i=0,...,n
-1
if
h(x2i)<0
for some
i=0,...,[n/2]
i=0,...,[n/2] .
It can be checked that this definition is well formulated. 2,21 Theorem Let {p n } be a standard one parameter exponential family. Let u g : R -> R such that v{χ : g(x) ^ 0} > 0 . Let
ANALYTIC PROPERTIES (1)
h(θ) = E θ (g(x)) ,
55
θ € A/°(g) .
Then S + (h) < S"(g) .
(2)
I f equality holds in (2) then IS+(h)
(3) Remark.
= is"(g)
.
The domain of h i n (1) is r e s t r i c t e d to W°(g).
true i f the domain of h i s a l l of N(g).
The theorem remains
We leave this generalization as an
exercise. The sign-change-preserving properties ( 2 ) , (3) are equivalent to "Total P o s i t i v i t y of {p Q } of order « . " standard reference on t h i s t o p i c .
Karlin (1968) is a very u s e f u l ,
See also Brown, Johnstone, and MacGibbon
(1981).
Proof.
Let g(θ)
= /e θ x g(x)dx = e ψ ( θ ) h ( θ )
.
It suffices to prove g has the properties of h in (2), (3), The proof is by induction on n = S~(g). Assume without loss of generality that IS"(g) = +1. When n = 0 the result is trivial since then g ^ 0 and v({χ : g(x) > 0}) > 0 so that g(θ) > 0 for all θ € W(h), as claimed in (2). Assume the theorem is true for n <_ N. Suppose n = N + 1. Let ξ, = infix : g(x) < 0}. u(θ) =
ξ- > -» since IS"(g) = +1. Let
d_( e " θ ζ i3(θ))
= /(x - ξ l )g(x)e θ x v(dx)
.
Now, S"((x - ξi)g(x)) 1 N = n - 1, as can easily be checked from the definition of ζ r
Hence S+(u) <_ N by the induction hypothesis. Integration yields that
S+(U) <_ N + 1 where (4)
U(θ) = / θ u(t)dt
= e " ξ i θ g(θ)
.
(2) follows from (4). (3) may be verified by concentrating the above argument on the case where S + (u) = N and S + (U) = N + 1, and using the induction hypothesis
56
STATISTICAL EXPONENTIAL FAMILIES
to keep track of IS (u) and consequently of IS (U).
||
The above property for n = 1 is equivalent to the strict monotone likelihood ratio property. The following is an important consequence of this. 2.22
Corollary Let { p . } be a standard one parameter exponential family. u
g : R + R is non-decreasing and not e s s e n t i a l l y a constant ( v ) .
Suppose
Then E θ (g)
is s t r i c t l y increasing on M°(g). [Remark.
Again, the r e s u l t is true on the f u l l domain, W(g), but we leave
v e r i f i c a t i o n of t h i s as an exercise.) Proof.
Let
ess i n f g( ) < c < ess sup g( ) ί then g( ) - c s a t i s f i e s the
hypotheses of Theorem 2.21 with S~(g-c) = 1. for
θ € W°(g) whenever θ > QΛc)
s t r i c t l y increasing on W°(g).
Hence E J g ) - c > 0
(whenever θ < θ..(c)).
(or < 0)
I t follows that g i s
||
I t is possible to derive from the above some results concerning sign changes f o r multidimensional f a m i l i e s .
In general, these results appear
yery weak by comparison with t h e i r univariate cousins.
Here is an example of
such a r e s u l t which w i l l be useful l a t e r . 2.23
Corollary Let {p Q } be a standard k parameter exponential family.
ΘQ € N and v € Rk.
Suppose g : Rk ->• R s a t i s f i e s
Let θ = ΘQ + pv.
g(x)
£
0
v
x £ α
^
0
v
x ^ α
(1)
for
some α € R.
Let h(ρ)
+
Then S (h) < 1 .
+
=
E +
(g(X)) P
I f S (h) = 1 then IS (h) = - 1 .
.
Let
ANALYTIC PROPERTIES
57
Proof.
Apply Theorem 2.22 to the one parameter exponential family {pΘ Q } P of densities of v X. Observe that E θ (g(x)|v x = t) = g*(t) is independent o f P by Theorem 1.7, and (1) guarantees that S"(g*) £ 1.
These observations enable the desired application o f the theorem.
II
PARTIAL ORDER PROPERTIES The preceding multidimensional r e s u l t i s not very s a t i s f a c t o r y ; the hypotheses on h are too r e s t r i c t i v e .
Better results may be obtained by
considering p a r t i a l orderings and imposing suitable r e s t r i c t i o n s on the exponential family.
We give one simple r e s u l t as an appetizer f o r what may
be obtained. For t h i s r e s u l t define the p a r t i a l ordering, <* , on R by x « y i f x
i 1 y-j»
i =l , . . . , k .
A function h : R -> R i s non-decreasing r e l a t i v e to t h i s
ordering i f x <χ y implies h(x) <_ h ( y ) . The following preparatory lemma i s also of independent i n t e r e s t . 2.24
Lemma Let X have coordinates X.,...,X. which are independent random
variables with d i s t r i b u t i o n s F 1 ,...,F| ζ i respectively. decreasing r e l a t i v e to the p a r t i a l ordering * .
(1) Proof. well known. of (1) as
Suppose hy h^ are non-
Then
E(h 1 (X)h 2 (X)) > EίhjWJEίhgίX)) . The proof is by induction on k.
Note that for k = 1 the result is
This observation enables one to rewrite and reduce the l e f t side
58
STATISTICAL EXPONENTIAL FAMILIES /.../ hΛx)h ?(x) 1 ά
k Π ΊF.(dχ.) Ί i=l
k-l = / . . . / ( / h 1 (x)h 2 (x)F k (dx | < )) Π F.(dx.) k-l >_ /.../ [/h 1 (x)F k (dx k )][/ h 2 (x)F k (dx k )] n F.(dχ.) . Each function in square brackets is clearly non-decreasing in (xΊ,...,x, - ) . Hence, by induction, (1) is valid.
||
Here is the application to exponential families. 2.25 Theorem Consider a minimal standard exponential family for which the canonical coordinate variables X 1 9 ...,X k are independent. Let h be nondecreasing relative to the partial ordering «. Then E (h) is a non-decreasing function of θ on M°(h). Proof.
(This result may be extended to all of W(h).)
Write
Note that both x. - ξ.(θ) and h(x) are non-decreasing functions of x. Hence J
J
g|τE θ (h) > E (Xj - ξj(θ))Eθ(h(X)) = 0 J
by Lemma 2.24. It follows that E Λ (h) is non-decreasing in each coordinate of θ and hence (equivalently) is non-decreasing relative to «.
||
The preceding theorem is merely a sample of the available results. Other assumptions may replace the independence assumption, above. Notably, the conclusion of Lemma 2.24 remains valid if the joint distribution, F, of X has a density f with respect to Lebesgue measure which is monotone likelihood ratio in each pair of coordinates when the others are held fixed. (Exercise.) (There is also a lattice variable version of this fact.) Such densities are called multivariate totally positive of order 2 (= MTPp). Suppose {pa> is a minimal standard exponential family whose dominating measure, v, is MTPp
It follows by the proof of the theorem above that then h non-
ANALYTIC PROPERTIES
59
decreasing implies E Q (h) non-decreasing in θ. Under suitable conditions it is also possible to derive analogous "order preserving" results for other partial orderings. For example, one may consider the partial ordering induced by a convex cone C c R , under which xα
c
y if y - x e c.
A rather different but very fruitful partial ordering is that leading k k the notion of Schur convexity. Define x « Qb y if Σ x.Ί = Σ y. and if Ί i=l i=l k1 Σk' X MLΊJI 1 Σ V ΓLΊJ- T 1 <_ k' < k, where x r LΊJ . Ί »i = l,...,k, denote the coordinates i= l i =i of x written in decreasing order, etc. Then h is called Schur convex if it is non-decreasing relative to the ordering <*
(Obviously any such function
must be a symmetric function of x , , . . . ^ . ) For further information about these and other partial orderings, consult Marshall and Olkin (1979), Karlin and Rinott (1981), Eaton (1982), and references cited in these works.
60
STATISTICAL EXPONENTIAL FAMILIES
EXERCISES 2.2.1
Generalize 2.1(2) to I eθ#x - θo"x
I
b.
Thus (iΐx ^ (2)
) (
e
θ
χ
-eθo
χ
||Θ
-
(θ - θ0)
-
θ
xeVx)
oll
Use this to prove 2. 2(1) by induction on I. 2.3. 1
Consider a one-dimensional standard exponent!
K c [ 0 , oo).
Show that
(1)
(E0[(l - a)X])2
X >EO[(1 - 2a) ],
0 < a <
and VarQX < oo imply (2)
E 0 (X)
>:
VarQ X
.
[Let e θ = (1 - a) and show by d i f f e r e n t i a t i n g at θ = 0" that (1) implies ψ'(0~) > ψ"(θ").
The finiteness of Var X guarantees that
ψ"(0~) = VargX < «>, e t c . , S. Zamir (personal communication).]
( I t is not known
i f (1) implies (2) without the assumption that VarQX < 00.) 2.4.1
Canonical one-parameter exponential families for which Var Q (X) is
u
a quadratic function of E Q (X) are called quadratic variance function families (= QVF). See Morris (1982, 1983). Verify that the following six families have the QVF property: (1) (2)
N(μ, σ 2 )
μ known
P(λ)
(3)
r(α, σ)
α known
(4)
Bin (r, p)
(5)
Neg. Bin. (r, p)
r known r known
(6) v has density f(x) = (2 cosh(^))" 1 , -00 < x < «> , relative to Lebesgue measure. (X = π log(Y/(l - Y)) where Y ~ Beta [h, h).)
ANALYTIC PROPERTIES [ I n (6) ψ(θ) = - log(cos θ ) . distribution.
61
This is called the hyperbolic secant
The generalized hyperbolic secant distributions are produced
from these by i n f i n i t e d i v i s i b i l i t y and convolution. only QVF families (Morris, 1982). 2.5.1
These families are the
See also Bar-Lev and Enis ( 1 9 8 5 ) . ]
Let {p } be a canonical one-dimensional exponential family.
Then N° = ( θ ^ θ 2 ) , -oo £ ξ
< ξ
£ co.
ξ(W°) = ( ξ ^ ξ 2 ) for some -°° < θ χ < θ 2 < «, and i f K = [x-,°o) then ξ. = x..
(Theorem 3.6 is a m u l t i v a r i a t e
generalization of this r e s u l t . ) 2.10.1
Let {p θ > be a two-dimensional canonical exponential family.
Find
a convex subset of W such that h bounded and E Q (h) = 0 for a l l θ € 3W implies h = 0
(Hence, the family {p Λ : θ € dhl}
Let π.(θ) = E A ( φ ) . Φ θ
Θ 6 0 , , implies π segments.
is "boundedly
Conclude that e\/ery test of ΘQ versus Θ, = hi - ΘQ is "admissible".
complete".) (i.e.
a.e.(v).
Then π. (θ) <. π ( θ ) , θ € Θ Π , and π. (θ) ^ π ( Θ ) , ψ^ — Φ2 u φ^ — Φ2
(θ) Ξ π
( θ ) .)
[3W contains an i n f i n i t e number of l i n e
See F a r r e l l ( 1 9 6 8 ) . ]
Similar Tests and Unbiased Tests 2.12.0
Let ΘΊ c Θ,
i=0,l.
A c r i t i c a l t e s t function φ,
called level α unbiased i f E Q (φ) < α, called similar
(level α) i f EQ(Φ)
Ξ
ot,
0 £ φ <_ 1 , is
θ € Θ n , and E Q (φ) > α , θ € Θ Ί . θ 6 0 Q n 0 . n W.
I t is
The following
problems consider the common case where ΘQ U Θ, = hi so that 8ΘQ n hi = 0 Q n § 1 n hi.
Exercises 2.21.3, 2.21.4 and 2.21.5 contain further applications
of these concepts.
See also 7 . 1 2 . 1 .
2.12.1
Let {p } be a regular canonical family and l e t θ 1 = ( θ | i ) »
X1 = ( X | . j ,
x
(2))
essential here.) (i) (1)
be
p a r t i t i o n e d vectors.
Let L = {θ : θ / ^ = 0} .
θ
(2)^9
(Regularity is convenient but not Assume L Π W° f φ.
Show that a c r i t i c a l function φ is similar on L i f and only i f α
= /Φ(x)v(dxj1j|X(2)^
a
'e
(v)
(Tests with property (1) are said to have Neyman structure. Note that the
62
STATISTICAL EXPONENTIAL FAMILIES
right side of (1) is E Θ (Φ|X( 2 )) f o r (ii) (2)
v
θ
€ L
)
Show that φ is similar on L and satisfies VE Λ (φ) = 0
V v € L
1
θ = 0
= { v : v
VΘ6L}
Ό
if and only if φ satisfies (1) and (3)
/ x^jjφίxjvίdx^jlx^j) = 0
a.e. (v) .
(Note that (2) is a necessary condition for a test of HQ: θ € L versus Hy θ f L to be unbiased. for all θ € L,
v € L ,
(3) expresses the fact that v x^x.
=
°
See Lehmann (1959) for many applications of
(1) and (3) to the construction of U.M.P.U. 2.12.2
VE (Φ|X/2^)
( i ) Let X - N(θ, I) in Rk,
tests.)
k >_ 2.
Show there does not exist a
non-constant level α similar test of ΘQ = {θ : θ. <_ 0 for some i } . [Use Example 2.10.] (ii)
Show there exists a non-constant similar test of
ΘQ = {θ : θ = 0 for some i } , but there does not exist a non-constant unbiased test of this hypothesis. 2.12.3
Let X <Ξ Rk,
X. ~ P(λ.), independent.
Show there exists a non-
t r i v i a l similar test of {λ : λ. < 1 V i } but there does not exist a non-trivial unbiased test of this hypothesis. 2.13.1
Let X = (X..) be a matrix Γ(α, I) variable.
(See Exercise 1.14.4.) m Observe that log |X| has the same Laplace transform as Σ log Y. where Y. J
are independent Γ(α - (i-l)/2, 1) variables. Hence |X| has the same distrim bution as Π Y.. Reinterpret this result to show equality of the distribution i = lΊ of the determinant of a Wishart (n, I) matrix and a product of independent χ 2 -variables. 2.13.2 V F Let F, G be two distributions on R . Let μ^ Ί f i
k Ί
'i . = E( Π X. J ) and ϊΊ k j=l J
ANALYTIC PROPERTIES
63
p
similarly for y .
(1)
Suppose
,
\ζ
= μ?
,
i, = 0,1,...
j = l,...,k ,
and
(2) w h e r e
lim sup £n/f m
2
j,2n = E d X j l " ) .
n
< »
,
j = i,...,k
(Note,nj)2n = μ 0 ) _ ) 2 n ) _
^
.)
Then F = G.
(Condition (2) is slightly weaker than the necessary and sufficient condition, ( \ Σ m*l/
(3)
J
n=l
'^
= - ,
j = l,...,k
,
n
for (1) to imply equality of F and G.
See Feller (1966, Sections XV4 and
VI13) and references cited therein.) [Use Stirling's formula to show that
θ n /n! converges absolutely
Σm. J )Π
for |θ| < ε , j = l,...,k, and hence that λp = λp on an open set in R .] 2.14.1 (Bar-Lev (1983).) Let X - Expf (θ) with Θ° f φ. Let t ( X ( 2 ) I X M ) ) d e n o t e
the
indicated
conditional covariance matrix. Show that 2 θ (X/o)l x /i\) depends only on θ if and only if X / ^ 1 (X/2x - hίX/^)) for some (measurable) function h. [Integrate Z θ ( χ (2)l x (i)) o n
θ startin
9 at 0 e Θ° to find that the
conditional cumulant generating function of X/p\ under P Q is (2)
Ψ(θ|x,,\) = p(θ/ 2 \) + θ/ 2 j
"(X/!))
^o^ some functions p , h .
Show that (2) implies X/ 2 x - n ( χ M ) ) J- x (j) under P Q .] 2.14.2 Suppose X - Expf ( θ ) with Θ° f φ.
Then the following are
equivalent: (1)
X, x l
(2)
Ψ(θ(1), θ(2j)
(3)
X ( i ) ~ Expf ( θ ( i ) )
(4)
cov θ
X(2x
(χ(i)'
for some θ° € Θ, or for a l l
X
= Ψ1(θ(1))
(2)}
=
+
Ψ2(θ(2))
f o r
for i = 1 and 2 ,
°
V θ
e θ
θ e Θ, s o m e
f u n c t i o n s
Ψi
a n d
Ψ
64
STATISTICAL EXPONENTIAL FAMILIES
[For (1) - (3) apply Theorem 2.14 with h H 0 and check Ψi = p i ,
1=1,2.
For (4) =* (2) use 2.3(2) and i n t e g r a t e . ] 2.14.3
( P a t i l (1965), Barndorff-Nielsen
and Blaesild
(1983).)
Let P = {P θ : θ € 0} be a family of d i s t r i b u t i o n s on V, 8. k
k
(measurable), 0 <= R with 0° t φ.
X : V -> R
In E Q exp((3 - θ) f o r some function p( ).
X(Y))
=
Suppose
p(β) - p ( θ ) ,
Then X ~ Expf ( θ ) .
Let
β,θ e 0
[Use Corollary 2.13.]
2.14.4 L e t X have a k-dimensional m u l t i n o m i a l ( N , π) d i s t r i b u t i o n . Write X|jj
= ( X ^ . . . ^
)',
X j 2 j = (Xfe
+ 1
,...,Xk)\
Show t h a t the marginal
d i s t r i b u t i o n s o f both X,-x
and X/^x
f o r m an e x p o n e n t i a l f a m i l y , b u t
i s not independent o f X/ ? \
as one m i g h t e x p e c t f r o m Theorem 2 . 1 4 ( 2 ) .
X/jx Why not?
[The f a c t t h a t X i s n o t a minimal f a m i l y i s i r r e l e v a n t ; f o r k >_ 3, k.. < k-2 the
same phenomenon occurs i n t h e minimal model d e f i n e d as i n
1.2(7).]
2.15.1
Let the independent symmetric mxm matrices, X., i=l,...,n, have matrix r(α., t) distributions. 1
Z = Z 1 9 ...,Z n 1
(See Exercise 1.14.4). Show that n n with Z. = IX.J I/I Σ xΊ I is independent of Σ x. . Show that the J i=i i=l 1
distributions of In Z = {In Z. : j = l , . . . , n } form an exponential family, and identify the canonical s t a t i s t i c and parameter for this distribution. generalizes Example 2.15(ii). multivariate beta distribution.
(This
The distributions of Z form the so-called See, e.g., Muirhead (1982).
When m = 1
the X. have ordinary Γ distributions and the distribution of Z is a Dirichlet distribution.
See Exercise 5.6.2.
2.16.1
Suppose v -»• v with v(R ) < <». Then (1)
lim sup v n ( R k )
i f and o n l y i f the sequence { v n } i s t i g h t .
< v(R k ) [ L e t c(x) = i i f rΊ
<_ ||x|| <_
ANALYTIC PROPERTIES
65
and choose r. 3 sup v ({||x|| >.r.}) <_ 1/i , 1=1,2,... .] Hence a convergent n sequence of probability measures has a probability measure as its limit if and only if it is tight. 2.17.1 Verify 2.17(4),(5). (1)
[From Lemma 2.1(1)
I |/ xe bχ v^dx)! I < Σ II
^
ί
^
^
and the quantity in braces in (1) is 0(1/(1 + llxll)). Now use 2.16(1).] 2.17.2 Let
S c R
k
k
measures on R 0 € B°.
and B = conhull S.
Let v
be a bounded sequence of
( v n ( R ) < K. < » ) with λ^ ( b ) < »,
Define P p
b
b € s, n = l , . . .
Suppose
by dPn
.
- g j ^ = exp(b x - ψv (b))
(1)
.
.
n n Suppose f o r each b € S there is a K = K(b) such t h a t (2)
l i m sup P
.({llxll
< K})
>
0
.
Then there i s a subsequence { n 1 } c {n} and a non-zero l i m i t i n g measure v such t h a t f o r a l l b € B° eb#xv.(dx) n
(3)
[As
-
e b # x v(dx)
,
λ
(b) vn,
-> λ ( b ) v
i n the proof of Theorem 2.17 i t s u f f i c e s to consider the case
where S is f i n i t e .
Then K = maχ{K(b) : b € S} < ~ .
I f b € S,
then 0 < ε < / e b " x v . (dx)/λ (b) < K Ί e K o K / λ (b). n V V " ||χll
s a t i s f i e d on S n {b : I Ibl I < K Q } .
I Ibl I < KQ
Hence 2 . 1 7 ( 1 )
v f 0 since 0 € B° ]
2.18.1 Let {v on X = { 0 , 1 , . . .
}.
: α e A}, Show t h a t v^
n=l,2,...
be a f a m i l y of sequences of measures
•+ v^ uniformly i n α i f and only
if
66
STATISTICAL EXPONENTIAL FAMILIES
v ({x}) -• v ({x}) uniformly in α for each x € X. 2.19.1 Let {puQ> be an exponential family with supp v c {0,1,...} and v(0) > 0, v(l) > 0. Let X-,»...,X be a random sample and, as usual, let n S nn = .Σ X.. Define θn (λ) by =1 i (1)
ξ(θ n (λ)) = λ/n
Let FΛΛ , ndenote the distribution of Snnunder the parameter θ n(λ). Show that F i n - * p U )a n d t h ^ convergence is uniform in λ over λ e [a,b] for Λ ,Π
0 < a < b < oo. [0,
b].)
(A s l i g h t elaboration of the argument y i e l d s uniformity over
Generalize t h i s r e s u l t to the case where pQ is a k-dimensional
exponential f a m i l y .
[Show Ψ"(θ (λ)) -+0 as n -> «> since θ n ( λ ) -> -«>, uniformly nc
p:>n
n
= λ ( e p - 1) + o ( l ) as n -> «> uniformly
for
λ € [a, b].
Hence log EQ / , ^ e
for
λ € [a, b].
Then apply Theorem 2.19.
case the l i m i t d i s t r i b u t i o n
In the non-degenerate
is the product of independent Poisson
k-dimensional variables.]
(A special case of the above is the well known r e s u l t Bin ( n , λ/n) -> P(λ). The general form of the above statement was pointed out to me by I . Johnstone.) 2.21.1 Let X be non-central χ 2 with m degrees of freedom and noncentral i t y parameter θ .
Show that the d i s t r i b u t i o n s of X have the sign-change
preserving properties 2.21(2), ( 3 ) .
[Use Exercise 1.12.1(1).
Write
E θ (h(X)) = E θ (E(h(X)|K)) . ] 2.21.2
Let X be a one-dimensional exponential family and ΘQ € N°. (i) Show that the (essentially unique) level α test of the form
(1)
1 φ(x) = γ 0
x >x Q x =x 0 x <x Q
ANALYTIC PROPERTIES
67
is the U.M.P. level α test of H Q : θ <_ Θ Q versus H*: θ > Θ Q . (ii) Similarly, show that the (essentially unique) level α test of the form
(2)
Φ(χ)
=
1
X
>
x2
γ
X
=
x
0
i
X
x < xχ
i
< X
l
or < x2
satisfying (3)
E
is the U.M.P.U. level
θ
(χφ(χ))
=
0
0
test of H Q : θ = Θ Q versus Hy θ f Θ Q .
[(i) Let Φ 1 be any different level α test. Then S"(φ - φ 1 ) = 1. E Ω (φ - φ 1 ) = 0 by definition. Now use Theorem 2.18. (ii) Condition (3) is ϋ
o
the one-dimensional version of 2.12.1(3).
Again use Theorem 2.18.]
( I t is
also possible to show by a c o n t i n u i t y argument that level α tests of the form (1) and ( 2 ) , (3) always e x i s t . )
2.21.3
Consider a 2χ2 contingency table. (See Exercise 1.8.1.) Describe the general form of the U.M.P.U. level α tests of the following null hypotheses.
In each case the alternative is the complement of H Q . (i) H Q : P n P 2 2 / P 1 2 P 2 1 (ϋ)
Ho: P 1 1 P 2 2 / P 1 2 P 2 i
(111) H Q : p
π
ί
l
= 1
< p12
(iv) H Q : p 1 2 = p 2 1
.
(This corresponds to the exact form of McNemar's test. See, e.g. Fleiss (1981).)
[Use Exercise 2.21.2 and, for (i), (ii), Exercise 1.15.1. See
Lehmann (1959).] 2.21.4 Consider a 2χ2 contingency table.
Let c > 0,
exist non-trivial similar tests of the null hypothesis
c f 1. Show there
68
H
O
STATISTICAL EXPONENTIAL FAMILIES
:
Pll^Pll
+
Pi2^
=
C
P21^P21
+
^22^ °^
conc
*"""tional p r o b a b i l i t i e s
given p r o p o r t i o n , even though t h i s i s not a l o g - l i n e a r hypothesis. randomized t e s t s . under which Y*,
[Use
Consider the conditional d i s t r i b u t i o n given Y. + ,
and Yp . are independent binomials.
on i t s own m e r i t s . )
in a
Consider the special case Y, +
i = l,2
(This case is of i n t e r e s t = 1 = Y~+
f o r which the
condition f o r s i m i l a r i t y reduces to four l i n e a r equations i n the four variables φ(y) f o r the four c o n d i t i o n a l l y possible outcomes, y .
This t e s t is
unbiased f o r the one-sided version of H Q , but not f o r HQ as defined above. Is t h e r e , i n general, an unbiased t e s t of HQ?
Is t h e r e , i n general, a U.M.P.U.
t e s t of e i t h e r the one- or two-sided hypothesis i n e i t h e r the o r i g i n a l model or the conditional (independent binomial) model?
The somewhat
analogous question of the existence of s i m i l a r and of unbiased tests f o r the Behrens-Fisher problem of equality of means f o r two normal samples with unknown variances is solved i n Wijsman (1958) and i n Linnik (1968).] 2.21.5
Let X 1$ ...,X be a sequence of independent failure times, assumed to have a Γ(α, σ) distribution. Describe the U.M.P.U. tests of H Q : α = 1 versus H,: α > 1 and H': α f 1. [Use Exercise 2.21.2 and Example 2.15.] 2.25.1 Suppose v has density f with respect to Lebesgue measure on R and f is MTP 2 (i.e. has monotone likelihood ratio) in each pair of coordinates. Prove the conclusions of Lemma 2.24 and Theorem 2.25. Prove these also for the case where f, as above, is a density with respect to counting measure on the lattice of points with integer coordinates. [If h(xj,... ,x k ) is nonΊ s
decreasing then, under v, E(h(Xχ,... >\_y \)\ \ = \ ) ' also nondecreasing.] 2.25.2 Let {p Q } be a canonical
k-parameter exponential family with
ANALYTIC PROPERTIES ΘQ G W°.
Let H Q : θ <_ ΘQ and Hy
θ > ΘQ.
69
( i ) Show t h a t any Bayes or
generalized Bayes t e s t , α, of H Q versus Hj has the strong monotonocity property Φ(x)
> 0
y > x
=> φ(y)
=
1
Φ(x)
< 1
y < x
=> Φ(y)
=
0
(1)
Assume ΘQ = 0 and consider V / p ^ x H G ^ d θ )
- GQ(dθ)]
( g e n e r a l i z e d ) p r i o r measure r e s t r i c t e d to H . ] measure v is MTP2
where G i denotes the
( i i ) Suppose the dominating
Show t h a t any ( g e n e r a l i z e d ) Bayes t e s t i s unbiased.
[Use the above and Exercise 2 . 2 5 . 1 . ]
2.25.3
(Slepian's
Inequality)
Let X, Y be k-dimensional
normal v a r i a b l e s w i t h mean 0 and non-
s i n g u l a r covariance matrices A, B, r e s p e c t i v e l y .
Suppose
Then, f o r any C e R k ,
(1)
Pr{X < C }
>
Pr{Y <. C}
.
[ I f Z ( p ) ~ N ( 0 , A + p(B - A ) ) then
aXp( 8p
(2)
z ( p )
-< C)
=
Σ i W
πj ~3 -θ iPj ( Z
( p )
Q i i
< C)
where each α. . >_Q. Note that for i ^ j (3) by 2 . 4 ( 2 ) .
τ j £ - = θ., exp(-ln|*|/2)
= θ,, λ
Hence 92pfl(Z)
(4)
— 9 θ
from C o r o l l a r y 2 . 1 3 .
= ij
θ. . p (Z) ΊJ
θ
^ i
"""j
Combine ( 2 ) and ( 4 ) to y i e l d ( 1 ) . ]
proof of Slepian's i n e q u a l i t y see Saw ( 1 9 7 7 ) .
(For an a l t e r n a t e
For g e n e r a l i z a t i o n s see Joag-
Dev, Perlman, and P i t t (1983) and Brown and R i n o t t
(1986).)
CHAPTER 3. PARAMETRIZATIONS
In regular exponential families maximum likelihood estimation is closely related to the so-called mean value parametrization. This parametrization will be described after some brief preliminaries. The relation to maximum likelihood is pursued in Chapter 5. 3.1
Notation For v ε R , α € R let H(v, α) denote the hyperplane H(v, α) = {x € Rk : v
x = α}
Let H (a, α) and H~(a, α) be the open half spaces H + (v, α) = ίx ε R k : v
x > α}
H"(v, α) = {x ε R k : v
x < α}
When (v, α) are clear from the context they will be omitted from the notation Note that the closure of H~ is written H" and, of course, satisfies ΪΓ= H U H 1 . STEEP FAMILIES Most exponential families occurring in practice are regular (i.e. W is open). However, for technical reasons which will become clear in Chapter 6, it is \/ery useful to prove the parametrization Theorem 3.6 for steep families as well. 70
71
PARAMETRIZATIONS 3.2
Definition L e t φ: R
+ (-°°,
« ] be convex.
Assume Φ i s c o n t i n u o u s l y d i f f e r e n t i a t e on N°.
Let
W = {θ € R k : φ ( θ ) < °°}
Let θ . e W -
Λ/°f θ o e W °
,
and l e t θ
= ΘQ + ρ ( θ 1 - Θ Q ) , 0 < p < 1 , denote p o i n t s on t h e l i n e
joining
ΘQ t o Qy
Then, φ i s c a l l e d steep
,
i f f o r a l l θ j € N - W°, ΘQ € A/° Q
(1)
lim (θj -
Vφ(θ ) = oo
ΘQ)
Note that (1) is the same as (I1)
limf-φ(θ) = Λ 1
dp
p
Figure 3.2(1): An i l l u s t r a t i o n of the definition of steepness
A standard exponential family is called steep i f i t s cumulant generating function, ψ, is steep.
(A steep convex function is sometimes
referred to as an "essentially smooth" convex function.) exponential family is regular then i t is a fortiori
Note that i f the
steep since N - W° = φ
72
STATISTICAL EXPONENTIAL FAMILIES Here is a convenient necessary and sufficient condition for
steepness. 3.3
Proposition A minimal standard exponential family is steep i f and only i f
(1)
EJllxll)
= oo
for a l l
θ € W - W°
o Proof.
Suppose the family is steep. (θ 1 - Θ Q )
Vψ(θ p )
= (θχ - ΘQ)
This i m p l i e s EQ ( ( θ j - Θ Q )
X) -* °°
P
9
Then ξ(θp) + ° o
as p t 1
which i m p l i e s ( 1 ) .
The converse seems not to be easy to prove without further preparation. We postpone the proof to Chapter 6. It appears after the proof of Lemma 6.8. 3.4
||
Example There is one classic example of a steep non-regular family which
occurs in a variety of applications.
I t is the family of densities defined by
( π ) " 1 / 2 z " 3 / 2 e x p ( θ l Z + Θ2(l/z) - ( - 2 l θ χ θ 2 ) 1 / 2 - (l/2)ln(-2θ2)))
(1)
relative to Lebesque measure on z € (0, «>). The canonical statistics are (xi, X2) = (z, 1/z) and the natural parameter space is (2)
N
=
(-00, 0] x (-co, 0)
T h u s t h e f a m i l y i s n o t r e g u l a r b u t i s s t e e p s i n c e E / nQ \ ( x Ί )= °° "for a l l \U»y2)
$2
G
(~°°> 0)
1
These densities are referred to as inverse Gaussian. They
arise, for example, as the distribution of the f i r s t time (x,) that a standard Brownian motion crosses the line &{t) = /-2Θ? - /-2Θ, t .
Note that these
densities with θj = 0 are the scale family of stable densities on (0, «>) with index h- See Feller (1966). For some other steep non-regular families see Bar-Lev and Enis (1984).
PARAMETRIZATIONS
73
MEAN VALUE PARAMETRIZATION We begin with a useful lemma which involves a natural relation between parameter space (Θ) and sample space (X). Similar relations will reoccur several times and we have found it useful to draw pictures to illustrate the geometric relationships involved. Figure 3.5.1, below, is a simple example of such a picture which illustrates the hypotheses of Lemma 3.5.
JN
Figure 3.5.1:
3.5
Illustrating the hypotheses of Lemma 3.5 when k = 2.
Lemma Let v € R k , α € R. Let K <= R k be compact.
v(H (v, α)) > 0.
Then there exists a constant c > 0 such that λ(θ
(1)
+ pv)
> ce pα
(Note that (1) is equivalent (I1)
ψ(θ
Suppose
+ pv)
>^ pα + log c
V θ € K,
p > 0
to V θ € K,
I f θ + pv I A/ then λ(θ + pv) = °° so that (1) is
p _> 0 trivial.)
74
STATISTICAL EXPONENTIAL FAMILIES
Proof. (2)
λ(θ + pv)
= /e(θ+pv)
χ
v(dx)
> e p α Γ e θ χ v(dx) H
> ce p α
where c = inf / e θ # x v(dx) Θ€K H
(3)
> 0
.
((2) shows that i f c = °° here then λ(θ + pv) = °° for a l l θ € K and all p>0.)
|| Note that (3) provides an explicit formula for the constant c
appearing in formula (1).
Exercise 3.5.1 contains a converse to this lemma.
Here is the main result. 3.6 Theorem
Let {p 0 } be a minimal steep standard exponential family. Then ζ(θ) = E Θ (X) defines a homeomorphism of M° and K° continuous, 1-1, and onto.
(i.e., ξ: H° + K° is
Of course, if {p Q } is regular then ξ: U -> K°
since M = M°). Proof.
ξ is continuous on W° by Theorem 2.2 and Corollary 2.3. It is
1-1 by Corollary 2.5.
It remains to prove that ξ(W°) = K°, that is, to show
(1)
x € K° => x e ξ(Λ/)
It suffices to prove (1) for x = 0, for then the desired result for arbitrary x e K° follows upon translating the origin, which is justified by Proposition 1.6. So, assume 0 € K°. Let Sj = ίv e R k : llvll = 1}. Since 0 € K° there is an ε > 0 such that (2) for all v € S..
v(H + (v, ε)) > c > 0 (If not, there would be sequences v. € S 1 with
PARAMETRIZATIONS
v. -> v e Sι
75
and ε^ -> 0 f o r which v ( H + ( v Ί . , ε . ) ) -> 0.
v ( f l + ( v , 0 ) ) = 0 which contradicts 0 € K ° . )
This would imply
Now apply Lemma 3.2
(with
v = θ / | | θ | I and p = I | θ | | ) including the expression 3 . 2 ( 3 ) f o r the constant appearing in the lemma to get
(3)
ψ(θ)
w i t h c as i n ( 2 ) .
>
I I θ l l ε + log c
Thus
(4)
lim
ψ(θ)
= oo
l l θ l IHOO
(See
Exercise 3 . 6 . 2 and Lemma 5 . 3 ( 3 ) f o r restatements of ( 3 ) , ( 4 ) . ) Any lower semi-continuous function (such as ψ) defined on a closed
set
and which also s a t i s f i e s ( 4 ) must assume i t s minimum.
Ψ(ΘΊ.) = i n f ί ψ ( θ ) :
θ e Rk}.
I I Θ . M -> « i s
To see t h i s , l e t
impossible by ( 4 ) .
So, there
i s a convergent subsequence, θ . , -> θ * , and ψ ( θ * ) = i n f ί ψ ( θ ) : θ € R } by lower s e m i - c o n t i n u i t y . )
This minimum is assumed a t a point θ * € W.
Suppose θ * € N - W°. ψ(θ p
Then, f o r some 0 < p1 < 1 ,
,) < ψ ( θ * ) = l i m ψ ( θ n + p ( θ * - θ n ) ) by v i r t u e of 3 . 2 ( 1 ' ) o f the d e f i n i t i o n U U p+1
of steepness.
Hence no θ * G W - W° can be the minimum point f o r ψ.
I t follows
t h a t θ * € W°. Hence ξ(θ*)
=
Vψ(θ*)
=
0
since ψ is differentiate on a neighborhood of θ*. (Here we use Theorem 2.2, Corollary 2.3, and the fact that θ* € W° an open set.) This proves (1) for x = 0 and, as noted, completes the proof of the theorem. 3.7
||
Interpretation Theorem 3.6 shows that a minimal, steep family with parameter
space N° can be parametrized by ξ = ξ(θ), and the range of this parameter is K°. This is the mean value yavametvization.
In this parametrization the
resulting family is an exponential family, but of course is no longer a
76
STATISTICAL EXPONENTIAL FAMILIES
standard exponential family (except when ζ( ) is a f f i n e ) . (1)
θ(x)
=
ξ-1(x)
=
(θ :
Write
ξ(θ) = x)
The exponential family parametrized by ξ then has densities Pr(x) = exp(θ(ξ)
x - ψ(θ(ξ))).
For a number of applications t h i s parametri-
zation is more convenient than the "natural" parametrization described by the canonical parameter θ.
I f {p f i } is regular then W = N° and the mean value
parametrization reparametrizes the f u l l
family.
Minimality was used i n Theorem 3.6 only to guarantee that the map i s 1-1.
Even without minimality the map ξ discriminates between d i f f e r e n t
d i s t r i b u t i o n s i n {?'.
θ C N].
Hence one can s t i l l use the mean-value
parametrization to conveniently index {P A : θ e N°}, and the range of the mean
u value parameter is the relative interior of K. (Equivalently, one may reduce to a minimal family by Theorem 1.9 and then apply Theorem 3.3.) If the family is not steep then ξ(W°) c K°. We leave this fact — relatively unimportant for statistical application -- as an exercise. In this case it is even possible to have ξ(W°) not convex. See Exercise 3.7.1 for an example due to Efron (1978). 3.8 Example
(Fisher-VonMises Distribution) For a number of common exponential families the mean value
parametrization is the familiar parametrization, or nearly so. For example, for the Binomial (N, π) family the expectation parameter is Nπ, for the Poisson (λ) family the expectation parameter is λ, and for the exponential distributions (gamma distributions with index α = 1 and unknown scale, σ) the expectation parameter is σ. For the multivariate normal (μ, I) family the 1
expectation parameters are μ and μμ + I (corresponding to the canonical statistics of 1.14). The mean value parameters are not always so convenient. Nevertheless it is necessary to consider this parametrization in order to construct maximum likelihood estimators. See especially Theorem 5.5.
PARAMETRIZATIONS
77
Accordingly, we now discuss the mean value parametrization f o r the FisherVonMises d i s t r i b u t i o n . Let v be uniform measure on the sphere of radius one i n R . Consider the exponential family generated by v. VonMises family.
When k = 3 i t is the Fisher
When k = 2 t h i s i s the
family
of d i s t r i b u t i o n s .
These
d i s t r i b u t i o n s appear o f t e n i n a p p l i c a t i o n s , w i t h a v a r i e t y o f parametrizations, to model angular data i n R .
Consult Mardia (1972) f o r an extended treatment
o f these f a m i l i e s ; see also Beran (1979).
(Frequently one considers a sample
of n observations from one of these d i s t r i b u t i o n s .
The sample mean, X , is
then also said to have a VonMises or Fisher d i s t r i b u t i o n . parametrization f o r the family of d i s t r i b u t i o n s of X to t h a t below since E Q ( X j = E Q (X). u n u
The mean value
i s , of course, i d e n t i c a l
See also 5 . 5 ( 3 ) . )
The Laplace transform of v i s
where I (•) denotes the modified Bessel function of order s. When k is odd these functions have a convenient representation in terms of hyperbolic functions; for example I 1 / 2 (r) = (2/πr) 1 / 2 sinh r
(2)
I 3 / 2 (r) = (2/πr) 1/2 (cosh r - (sinh r)/r) (See, for example, Courant and H u b e r t (1953).)
These functions also have
nice recurrence r e l a t i o n s ; i n p a r t i c u l a r (3)
I^(r)
=
I$+1(r) + sls(r)/r ,
s >_ 0,
r>0
By symmetry, or by calculation, it follows that ξ(θ) lies in the same direction as θ, that is (4)
ξ(θ)/||ξ(θ)||
=
θ/||θ|| ,
θ t 0,
and
ξ(0) = 0
78
STATISTICAL EXPONENTIAL FAMILIES
It remains therefore to give a formula for ||ξ(θ)||. For this purpose it suffices to consider the case where θ = (r,0,...,0), and to calculate —: In λ v (θ r ).
For the Fisher distribution (k = 3) one gets from (1) - (3)
that ||ξ(θ)|| = coth ||θ|| - M θ l Γ 1
(5)
For the Von Mises distribution (k = 2) one gets only the less convenient expression (6)
l|ξ(θ)|| = iχ(I |Θ||)/io(||θ|I)
.
Although (6) is less convenient that (5), it can be used in conjunction with series expansions or tables of the modified Bessel function to provide numerical values for ||ξ(θ)||, and other information about ||ξ(θ)||.
MIXED PARAMETRIZATION We refer to the type of situation discussed in 1.7. a partitioned kxk non-singular matrix with M,M* = 0. M.x
= z.
i = 1, 2
(MT)'Θ
= φi
i = 1, 2
M M = (MM ) is 2
Write
(1)
Φi
-i
(Thus (. ) = (M )'θ .) Where convenient we write φ. = φ. (θ) to emphasize the dependence on θ, etc.) Note that (2)
M.ξ(θ) = E ^ M ^ ) = E Θ (Z.) = ζ.(θ)
(say)
i = 1, 2
.
Recall also that one may without loss of generality visualize only the case where M = I. In this case φ] = (θj. . .θ ), z' = (x + 1 , . . . f x k ) , ζ
2 = ^nH-l-
^
etc
PARAMETRIZATIONS
79
The following result is valid for steep families but for simplicity we state and prove it here only for regular families. See Exercise 3.9.1. 3.9 Theorem Let {p Q } be minimal and regular. Then the map /ζΊ(θ)\ e - (1 )
(3)
is 1 - 1 and continuous on N° (=W) with range ζ^W 0 ) x Φ2(W°)
(4)
Proof.
= K°{1) x φ2(W°)
.
Fix φ 2 € Φ 2 (W) and refer to Theorem 1.7. The distributions of
Zj given Φ 2 (θ) = Φ 2 form the minimal regular standard exponential family generated by v0 . According to Theorem 3.6 this family can be parametrized (in a 1 - 1 manner) by ζ Ί (θ) = E Q (Z n ). The range of this map is int (conhull (supp v 0 )) = K° (say) . *2 Φ2 The formula for v is given in 1.7(5), but all that needs to be noted is that Φ K T h e m a2 p i n l° =Kll) ( 3 )i s t h e r e f o r e ι - 1 with range as in (4). Continuity of the map in (3) is immediate from continuity of ζ. 3.10
||
Interpretation The above theorem has an interpretation like that of Theorem 3.6.
Any minimal regular exponential family can be parametrized by parameters of the form 3.9(3), above. This parametrization is called the mixed parametrization. Consider a mixed parametrization with parameter (. ), as above. ζj
Q
Then the family of densities corresponding to the parameters {( ) : Φ 2 = Φ 2 )
80
STATISTICAL EXPONENTIAL FAMILIES
forms a f u l l standard exponential family of order m.
(See Theorem 1.7.)
However, i f one fixes the expectation coordinate and looks at the family ζ
l corresponding to the parameters { ( . ) : ζ
0 = ζ-} then one gets i n general
only some non-full standard family of dimension and order k, whose parameter space is a (k - m) dimensional manifold in hi.
Here is an example.
Consider the parametrization of the three dimensional multinomial (N, π) family discussed following 1.8(6).
A mixed parametrization for t h i s
family involves
4 ζ
'
Z
2π
4
+
and Φ3
=
(h) log
Note that the range of (_ ) is ζ
2 2N}
h <
independent of the value of φ~ € (-«>, °°), as claimed by Theorem 3.9. Φo
=
For fixed
Z 0 l Φo the distributions of ( 7 ) form a 2 dimensional exponential family
(of order 1) having expectation parameter ( r ). ζ
(In the genetic interpretation
2
for this parametrization the parameter Φ^ measures the strength of selection in favor of the heterozygote character Gg.) On the other hand the family of distributions corresponding to fixed ( ζ ) is not so convenient. I t is the non-linear subfamily of the usual 2 f u l l standard family described by (1)
Θ = {θ :
2e
θl
+e
02
=
(ζ^^Σe^h
( I f one reduces the usual standard exponential family to a minimal family of
PARAMETRIZATIONS
81
dimension 2, then the parameter set becomes a smooth one-dimensional curve 2 w i t h i n R . This provides an example cof a curved exponential f a m i l y , as defined below.
See Exercise 3.11.2.)
DIFFERENTIABLE SUBFAMILIES 3.11
Description A differentiate
subfamily
i s a standard exponential family w i t h
parameter space Θ an m-dimensional d i f f e r e n t i a t e manifold i n N.
An
e s p e c i a l l y convenient s i t u a t i o n occurs when Θ i s a one-dimensional manifold -i . e . a d i f f e r e n t i a t e curve.
Such a family i s c a l l e d a curved
family.
i t i s often convenient to assume t h a t the
(A technical p o i n t :
exponential
parameter space i s smoother than being merely d i f f e r e n t ! a b l e -- f o r example, to assume i t possesses second d e r i v a t i v e s .
such an assumption implicit
Whenever convenient we consider
in the definition of a d i f f e r e n t i a t e subfamily,
writing formulae for relevant second or higher derivatives (as in (3) below) carries with i t the assumption that these derivatives exist.) In a d i f f e r e n t i a t e subfamily the parameter space can be written locally as {θ(t) : t e N} where N is a neighborhood in Rm and θ( ) is differentiable and one to one.
Properties of such a family around some
ΘQ € Θ can often be most conveniently studied after invoking Proposition 1.6 to rewrite the family in a more convenient form.
For example in a curved
exponential family m = 1 and the proper choice of ΦQ, zQ and M in that proposition transforms the problem into one in which θQ (1)
ξ(θ 0 )
= 0 = θ(t Q ) = Eθo(X) = 0 Z(θ n )
=
I
STATISTICAL EXPONENTIAL FAMILIES
82
Γ • ϊt
θ (
V
•
(2) a2b a 2 /p 0
θ(t Q ) =
0 (The value p = «> is possible.) Furthermore, one can linearly reparametrize the curve so that Θ Q = θ(0) (i.e. so that t Q = 0) and so that a = 1 and (2) becomes
(3)
θ(0)
=
1/p 0
ό In this form p is the radius of curvature of the curve θ(t) at t = 0. The value of 1/p is sometimes referred to as the statistical curvature of the family at Θ Q . Its magnitude is uniquely determined by the above reduction process. Alternately, in an arbitrary curved exponential family it has the formula
= (Bit
(4) where
A ' ΫfV
with θ = θ ( t Q ) ,
θ = θ(tQ),
Remark on Notation.
% = 2(θ(tQ)).
See Efron (1975).
The general functional notation θ( ) was introduced i n
3.7(1) as θ(x) = ξ" ( x ) .
We w i l l continue to use t h i s general notation i n
PARAMETRIZATIONS
83
contexts not involving s p e c i f i c d i f f e r e n t i a t e subfamilies.
In contexts
involving d i f f e r e n t ! a b l e subfamilies the notation θ( ) w i l l usually r e f e r to a ( l o c a l ) parametrization of the subfamily; i f so, t h i s f a c t w i l l be e x p l i c i t l y noted.
Although t h i s means that the \/ery convenient notation θ( ) can hence-
f o r t h have e i t h e r of two meanings we hope there w i l l be no confusion -simply remember that θ( ) is defined by 3.7(1) except where e x p l i c i t l y stated otherwise. 3.12
Example
Let Z have exponential density, " M z ) = e"z χ/ Q ^ ( z h relative to Lebesgue measure. Let T > 0 be a fixed constant. Let Y be the p
truncated variable Y = min (Z, T) and X(y) € R be (y, 0)
if
y
(y, l)
if
y =T
x(y) =
For λ € (0, o°) the distribution of X form a standard curved exponential family. The dominating measure v is composed of linear Lebesgue measure on the line ((0, T) x 0) plus a point mass on (T, 1). The parameter space for this family is (1)
0 = {θ € R 2 : θ χ = -λ, θ 2 = -In λ, λ € (0, °°)}
and (2)
ψ(θ) = log [^- ( e θ l T - 1) + e θ l
+ θ z
]
(The natural parameter space is R , since v has bounded support.) Figure 1 displays both Θ and K on a single plot.
84
STATISTICAL EXPONENTIAL FAMILIES
Figure 3,12(1): Θ and K for Example 3.12.
We return to this example in Chapter 5.
PARAMETRIZATIONS
85
EXERCISES 3.4.1 Let X ^ X 2 ,...,X n be a sample from a population with the inverse Gaussian distribution 3.4(1).
(i) Show that S = Σ X. also has an inverse i =l
Ί
2
tS Gaussian d i s t r i b u t i o n w i t h parameters θ , , n θ~ [Examine E(e ) . ] ( i i ) Show t h a t S and ( X T 1 - X " 1 ) a r e i n d e p e n d e n t , [ ( i ) shows t h a t 2 ( S , ^ - ) *υ Expf ( θ j , θ 2 ) . Now use Theorem 2 . 1 4 . ] 3.5.1
(Converse to Lemma 3 . 5 ) Let v € R k ,
α € R.
L e t K cz A/ be compact.
I f v ( H " + ( v , a))
= 0
then Ίim sup λ ( θ + p v ) / e p α θ€K
(1)
Also, i f
=
0
v ( H + ( v , α ) ) = 0 then l i m sup λ ( θ + ρ v ) / e p α Θ€K
(2)
<
(Be c a r e f u l , these r e s u l t s may be f a l s e i f K £ A/.) In particular, (3)
for θ e N
ψ ( θ + pv)
-> -~
as
p
i f and o n l y i f v ( H + ( v , 0 ) ) = 0 . 3.5.2
Let Z e K°. Let ε 1 = inf {| |x - Z| |: x ί K} > 0 . (1)
lim
Θ
Show
Z
- ' ) =
[Translate to the case where Z = 0, using 1.6(3) with φ Q = 0 , Z Q = Z. Then this result is a minor variation of 3.6(3), and could also have been used to establish 3.6(4).]
86
STATISTICAL EXPONENTIAL FAMILIES
3.6.1 Is the following assertion a v a l i d converse to Theorem 3.6: Let {p Q } be a minimal standard exponential family. homeomorphism i f and only i f {p Q } is steep. (?)
Then ξ : A/° -> K° is a
[ I f k = 1 t h i s is easy to
prove.] 3.7.1 D e f i n e the measure v on { ( x , , Xo) xλ = 0 , x 2 > 0 }
- 0 0 < Xi < °° »
x2 = 0 or
by
e
-|t|
v((A,
0)) = J c n ^ — r d t , A u l+tH
A d i - , - ) ,
v((0,
A))
= / e - t dt
A c (0, co)
v((R, 0 ) ) =
1
,
(1)
(i) Show the exponential family generated by v has N = {θ: -1 <_ θ- <_ 1, θ 2 < 1} and is not steep,
(ii) Show that ξ(M° ξ(M°) c K° = {x : x 2 ^ 0} and furthermore
that ξ(A/°) is not even convex.
[Show 1 .
for appropriate c, k.]
_J
2
2
See Efron (1978).
3.9.1 Prove the conclusion of Theorem 3.9 i f { p Λ } is minimal and steep.
u
[In
the proof of Theorem 3.9 l e t Φ2 € N° and show (using D e f i n i t i o n 3.2) that
v
is steep. For ease of proof assume (w.l.o.g.) that M = I.]
of of Theorem 3.9 let Φ 2 € N° and show (using Defini
Φ
3.11.1 Verify the formula 3.11(4) for the s t a t i s t i c a l curvature of a curved exponential
family.
PARAMETRIZATIONS
87
3.11.2
( i ) Verify 3.10(1).
( i i ) Reduce the three-dimensional multi-
nomial family to a two-dimensional minimal family and show that 3.10(1) now corresponds to a curved exponential family.
( i i i ) Fix ζ j and calculate the
statistical curvature of the resulting family as a function of the remaining parameter, φ~.
(iv) For what value(s) of ζ,,
Why? 3.11.3
Consider an m-dimensional d i f f e r e n t i a t e subfamily inside a k parameter exponential family.
Write a canonical form for this family analogous
to that in 4.14(1) - (3). [The case m = 1 required two canonical parameters -b,p
— in 4.14(3).
The general case requires
m + m(m + l)/2 parameters.]
3.12.1
Let {p } be a canonical k parameter exponential family. Ό
Let and {p
inf ίψ(θ):
θ e W} < C < sup ίψ(θ):
l e t 0 = {θ € A/°: ψ(θ) = C } . θ e hi] .
( i ) Show t h a t
θ £ W}
{ p Q : θ e 0 } can be c a l l e d a stratum { p Q : θ € 0 } i s a (k - 1) dimensional 1
θ
( i i ) Let θ
i s (k - 1) x 1 and θ / 2 ) i s l x l .
L e t θ ( t ) be any ( l o c a l ) p a r a m e t r i z a t i o n o f
Θ £ W with t e ί c
Rk"1. 8θ
(i)
ξ(1)(θ(t))
(2)^'
differen-
w n e r e
t i a b l e s u b f a m i l y o f {p : θ € N } .
{pΛ:
= (θn\»
Θ
of
(χ)
Then
m
(t)
-ill—
Bθ?(t) +
ξ(2)(θ
( t ) ) -£-
- o
.
J
(iii) Let θ° e 0 be any point with ζ(2)( θ °) ^ ° a
τhen on a
neighborhood of
θ° in 0 one may write θ / 2 \ s a function of θ / ^ — i.e. Θ / 2 N = θ ( 2 ) ^ θ ( i ) ^ "" and
88
STATISTICAL EXPONENTIAL FAMILIES
3.12.2 Show t h a t the d i s t r i b u t i o n s o f X described below can be represented as s t r a t a of canonical exponential f a m i l i e s (See 3.12.1 f o r d e f i n i t i o n . ) (1)
X ~N(Θ, I) ,
(ii)
||θ||
2
= C .
The d i s t r i b u t i o n s o f X - ( 0 , l ) with X defined in Example 3 . 1 2 .
(iii)
Let Y.., Y p , . . .
be i . i . d .
from a canonical regular
exponential f a m i l y , { p . } . Let N be any Markov stopping time {y:
N(y) = n} i s measurable with respect to Y 1 5 . . . , Y j .
Let
X = (S*., N) = (X/i\» X/o\)»
P.(N < °°) = 1 .
anc
(i.e.
Let S n =
n Σ Y. .
* consider only values of φ such t h a t
[ L e t θ = ( Φ - ψ ( θ ) ) where ψ(φ) i s the cumulant generating
function f o r the o r i g i n a l
family { p ώ ) ]
3.12.3 In 3.12.2 ( i i i ) show that 3.12.1(2) is identical to the following conclusion also derivable from the martingale stopping theorem:
(1)
E(SN)
=
E(Y) E(N) .
((S n - n E(Y) is a martingale and so (1) also follows from the stopping theorem applied to this martingale.) 3.12.4 (i) For the family in 3.12.1(i) show the statistical curvature is the constant l/fC.
(ii) Calculate the statistical curvature for the families
described in Example 3.12 and Exercise 3.12.1(ii). 3.12.5 A Poisson process on [0, 1] with intensity function ρ(t) _> 0 may be characterized by the property that the number of observations in any b interval (a, b) c [ 0 , 1] has P( / ρ ( t ) d t ) distribution, and the number of a observations in disjoint intervals are independent random variables.
Let
PARAMETRIZATIONS
89
T, <...< T γ denote the observations from a Poisson process on [0, 1], Suppose (1)
p(t) =
m Ία. Π p (t) i= l Ί
where p. > 0 are known (measurable) functions on [0, 1] and α. are unknown parameters.
Show that the distributions of (T-,...,T γ , Y) form a differen-
t i a t e subfamily of dimension m in an (m + 1) parameter exponential family. Identify the canonical statistics and observations for this family. family a stratum of the original family?
Is this
[The conditional distribution of
T,,...T γ given Y is that of an ordered sample of Y independent observations m α* from a distribution on [0, 1] with density proportional to Π p. ( t ) . ] i =l Ί 3.12.6 Let Z.. be independent i d e n t i c a l l y d i s t r i b u t e d variables with a power s e r i e s (1)
distribution: P ( Z . j = z)
L e t YQ .. YQ = 1 and d e f i n e Y p . ...
=
C(λ) h ( z ) λ Z ,
i n d u c t i v e l y as
Y. =
inductively as Y. =
z=0,l,.. ,
YΊ - 1 ΊΣ Z^.
.
Y ,
λ > 0.
Y,,...
is
Σ Z^. . YQQ , Y,,... i
J " ~ •!•
called the Galton-Watson process with generating d i s t r i b u t i o n ( 1 ) . 2 <^ n < °°.
Show that the d i s t r i b u t i o n s of YQ,
exponential
family with natural s t a t i s t i c s
Y-,...,Y
n-1 ( Σ Y.J
0
,
form a curved
n Σ Y.) J
0
Fix
and t h i s curved
exponential family is a stratum of the corresponding full exponential family.
CHAPTΈRΊ.
APPLICATIONS
This chapter describes three different general applications of the
theory developed so far.
The f i r s t part of the chapter contains a proof
of the information inequality and a proof based on this inequality of Karl in 1 theorem on admissibility of linear estimators. The second part of the chapter describes Stein's unbiased estimat of the risk and proves the minimaxity of the James-Stein estimator as a specific application of this unbiased estimate. The third part of the chapter describes generalized Bayes estimat and contains two principle theorems describing situations in which a l l admiss ble estimators are generalized Bayes -- or at least have a representation similar to that of a generalized Bayes procedure. deals with two basic situations.
This part of the chapter
The f i r s t is estimation of the natural
parameter under squared error loss, and the second is estimation of the expectation parameter under squared error loss.
The so-called conjugate pric
play a natural role in this second situation. The exercises at the end of the chapter contain a non-systematic selection of some of the specific results derivable from the more general development in the body of the chapter.
INFORMATION INEQUALITY The information inequality -- also known as the Cramer-Rao inequality -- is an easy consequence of Corollary 2.6. The version to be proved below applies to vector-valued as well a real-valued s t a t i s t i c s .
For vector-valued statistics one needs the multi-
90
APPLICATIONS variate Cauchy-Schwarz
91
inequality, as described in the following theorem.
If A,B are symmetric (mxm) matrices, write A :> B to mean that A - B is positive semi-definite. 4.1
Theorem Let Tj, T 2 be, respectively (£χ 1) and (mx
variables on some probability space. B
ll "'
E(τ
B
12 -1
E(τ
B
22
=
Let (λ X
i
*>
(l x m)
T i 2>
= E(T 2 V)
(m x m)
and suppose B,, exists and B 2 2 exists and is non-singular.
(1)
B n > B 1 2 B-J B 2 1
Remarks.
Then
.
If I = m = 1 this is the usual Cauchy-Schwarz inequality:
(2)
E 2 (T χ T 2 )
E(T*)E(T*) >
If B 2 2 is singular the inequality (1) remains true with generalized inverses in place of true inverses. See Exercise 4.1.1. If 4.1(1) is applied to the random vectors Tj - E(T ), T 2 - E(T 2 ) it yields the covariance form of the inequality: (3)
ϊn
Proof.
> %ι2 tZ2
Z21
Consider the ((£ + m) x 1) random vector / 11
<
B
/ Let W =( 0
" B 12 B 22v _Ί B 22
Then
1£\
E(U U') = ( 21
B
22
U = ( τ ) . Then '2
92
STATISTICAL EXPONENTIAL FAMILIES 0 < E(WUU'W') = W E(UU')W
- B12B22B21
/ 11
It follows that 0 <_ B
n
Lά LC LI
-B i 2 B 2 2 B 2 Γ
as desired
\
II
One further preparatory lemma is needed for the form of the information inequality which appears below. 4.2 Proposition Let {p Q } be a standard k-parameter exponential family. Let T be Ό
Q
a statistic taking values in R . Suppose Θ Q € N° and the covariance matrix £ β (T) of T exists at θ n . Then E Q (T) exists on a neighborhood of θ n . 0 g
U
0
U
(θ eW°(11T11) i n the notation of 2.6.) Proof.
For some ε > 0,
I|θ - θ o || < ε/2 .
(1)
| |θ - θ J | < ε
implies θ € N.
Let
Then, by the ordinary Cauchy-Schwarz inequality,
EΘ(||T||) = /||T(x)|| exp(θ =
/||T(x)|| exp((θ -ΘQ)
1
[/||T(x)|| 2 exp(θ 0
/ exp(2(θ - θ 0 )
x - φ(θ))v(dx) x - ψ(θ) + ψ(θ 0 )) exp(θ 0
x - ψ(θ Q ))v(dx)
x - ψ(θ Q )) v(dx)
x - 2ψ(θ) + 2ψ(θo))exp(θo
x - ψ(θQ))v(dx)]h
= Eg2 (I |T(x) I |2)[exp ψ(2(θ - θ Q ) + Θ Q ) - 2ψ(θ) + ψ(θ Q )]^
2 since Eθ (||T(x)|| ) < °° by assumption and since 2(θ - ΘQ) + ΘQ 6 W.
4.3
||
Setting The following version of the information inequality applies to
d i f f e r e n t i a t e exponential subfamilies, as defined at the end of Chapter 3.
APPLICATIONS
93
Let {p 0 : θ € 0} be such a family with Θ m-dimensional. m
Let θQ G 0.
For
k
N a neighborhood in R let θ : N + 0 c R , with θ(ρQ) = ΘQ be a parametrization of 0 in a neighborhood of ΘQ.
By definition Vθ(p) is the mxk matrix with
elements
(1)
^
3J7
The parametrization can always be chosen so that Vθ(p) is of rank m, and we assume this is so. Define the information matrix (2)
J(p Q )
J(p) at p = PQ by
= (Vθ(p o ))(2(θ o )(Vθ(p o ))
I f {p_} is a minimal exponential family then 2(θ n ) is non-singular, and so u
U
J(PQ) is then a positive definite mxm symmetric matrix.
The chain rule and
the basic differentiation formula 2.3(2) yield two alternate expressions for J; namely
(3)
1J (JίPnίίn
°
/3 log p θ , p v(X) d log p θ , Θ = θΛ Efl( ^ ^
The f i r s t expression of (3) i s , of course, the usual definition of J in contexts more general than d i f f e r e n t i a t e subfamilies. I f T is a statistic taking values in R let (4)
e(p)
= e τ (p)
= E θ ( p ) (T)
.
Suppose Θ Q e N°(||T||). Then E Θ (T) and its derivatives exists at Θ Q by Corollary 2.6. The chain rule then yields
94
STATISTICAL EXPONENTIAL FAMILIES
(5)
Ve(p Q ) = (Vθ(p Q ))(vE θo (T)) (The preceding formulation of course includes the case where
{pθ> is a full exponential family. Simply set p = θ so that θ(p) Ξ θ. In that case J(p Q ) = Z(θ Q ) and Ve(p Q ) = V E Q (T) .) 4.4 Theorem
(Information inequality) Let {p.: θ e 0} be a differentiate subfamily of a canonical Ό
exponential family with θ Q = θ(p Q ), as above. Let T be an ^-dimensional statistic. Suppose 2 (T) exists. Then e(ρ) = E , J T ) exists and is differentiable on a neighborhood of ρ Q , and the covariance matrix of T satisfies Z θ (T) > (ve(p 0 ))' J" 1 (p 0 )(ve(p 0 ))
(1) Proof.
θ Q £ W°(||T||) by Proposition 4 . 2 .
.
Now apply the Cauchy-Schwarz
i n e q u a l i t y 4 . 1 ( 1 ) with T, = T - EΩ (T) and 1 θ o (2) Then B n (3)
T2(X) = ^ ( T )
=
V In p θ ( p
}
(X)
=
( V θ ( p 0 ) ) (X - ξ ( θ Q ) )
.
,
B22
=
E(T2 T p
=
(Vθ(p0)) 2(θo)(Vθ(po))'
B12
=
E ( T 1 T£)
=
(Vθ(po))(vE
=
J(pQ) ,
and (4)
by 2 . 6 ( 3 ) and 4 . 3 ( 5 ) .
(T))
=
Ve(pQ)
The Cauchy-Schwarz i n e q u a l i t y says B ^ >_ B 1 2 B 2 2 B 2 1
which i s the same as ( 1 ) .
||
A useful f e a t u r e of the form of Theorem 4.4 is the absence of any r e g u l a r i t y condition on T other than the existence of la
θ
(T).
Many other
o
versions of the information inequality contain further assumptions about T (See e.g. Lehmann (1983, Theorem 7.3).) but these are superfluous here.
APPLICATIONS
95
An information inequality l i k e Theorem 4.4 is needed f o r applications of the following type. 4.5
Application
( K a r l i n ' s Theorem on A d m i s s i b i l i t y of Linear Estimates)
The information inequality can sometimes be used to prove admissibility.
In these situations other, more f l e x i b l e , proofs can also
be used, but the information inequality proof is nevertheless easy and revealing.
The following r e s u l t is due to Karlin (1958).
inequality proof,
The information
due to Ping (1964), is a generalization of the f i r s t
proof of t h i s sort in Hodges and Lehmann (1951).
See Lehmann (1983, p.271) f o r
f u r t h e r references and d e t a i l s of the proof. Theorem.
Let { p Λ } be a f u l l regular one-dimensional exponential family with u
N = (θ, θ),
-°° <_ θ < θ <_ °°.
under squared error loss.
Consider the problem of estimating ξ(θ) = EQ(X)
The r i s k of any (non-randomized) estimator δ i s
thus R(θ, δ) = E θ ((δ(x) - ξ ( θ ) ) 2 ) . (1)
δ
Then the linear estimator
(x)
=
αx + p
α,p
is admissible if 0 < α <_ 1 and if (2)
/ exp(-γθ + λψ(θ)) dθ
diverges at both θ and θ, where γ,λ are defined by α
(3) Proof.
= ΓTT'
β
= Γ^T
We consider here only the case p = 0 = γ . (See Exercise 4.5.1.)
Fix α. Let δ be any estimator with finite risk. Let b(θ) = E θ (δ(X)) - αξ(θ). The information inequality yields
(4)
R ( θ ,δ ) >
[(αζ(θ)
+
b
W y f
+ (ξ(θ)(l - α ) -b(θ))2
ξ'(θ) >
α2ξ'(θ)
+ 2αb'(θ)
+ (ξ(θ)(l
- α) -
b(θ))2
96
STATISTICAL EXPONENTIAL FAMILIES
since ξ(θ) = EΘ(X) and ξ'(θ) = J(θ) = Var θ X . For δ^Q 2
(5)
2
R(θ, δ α > 0 ) = Λ ' ( θ ) + (1 - α ) ξ (θ) .
Hence, if (6)
R(θ, δ) < R(θ, 6 n )
then 2b'(θ) - 2λξ(θ) b(θ) + (1 + λ) b 2 (θ) < 0
(7) Let
K(θ) = e λ ψ ( θ ) b(θ)
.
Then (7) becomes (8)
2K'(Θ) + (1 + λ) K 2 ( θ ) e λ ψ ( θ ) < 0
Now, let ΘQ € (a, b) and make the change of variables t(θ) = Θ J exp(λψ(t))dt. Correspondingly, define k(t) by k(t(θ)) = K(θ), so that (8) becomes (9)
2k'(t) + (1 + λ) k2(t) £ 0
where -°° < t < °° by ( 2 ) .
The only s o l u t i o n of (9) f o r t € (-°°, °°) is k = 0
since i n t e g r a t i o n of (9) shows that for t > t k'^t)
- k"1(t1)
k i s non-increasing and
>. (1 + λ ) ( t - t χ ) / 2
and hence k(tj) < 0 is impossible.
A similar inequality for t < t- shows
that k(t χ ) > 0 is also impossible.
It follows that (6) implies b Ξ 0 , which
in turn implies 6 = 6 Q (a.e.(v)) by completeness. This proves admissibility Ofδ
α,0 It is generally conjectured that the condition 4.5(2) is necessary
APPLICATIONS as well as sufficient for admissibility of 6 o. are known in this connection.
97 However only partial results
See Joshi (1969) and also Exercises 4.5.4,
4.5.5. 4.6 Further Developments It is useful in considering asymptotic theory to have available a few further results concerning the information
inequality.
These results are sketched below; the proofs are left for exercises. These results have nothing to do specifically with exponential families but only require a setting in which the information inequality is valid.
Nevertheless,
for precision assume below the setting of Theorem 4.4, and let S c R m denote a (possibly large) open set on which Σ Θ ( O ) ( T ) exists. For convenience we consider below only estimation of p under the quadratic type loss function (1)
L(p, δ) = (6 - p)' J(p)(ό - p) ,
and under a truncated version of this loss. (See (3) below.) For proof of the following assertions see Exercises 4.6.1 - 4.6.7 and Brown (1986). Let h be an absolutely continuous probability density on S , supported on a compact subset H c S , (2)
/ R(p, δ)h( P )dp
Then the expected risk satisfies
> m -/
Note that the right side of this inequality is independent of 6, and thus provides a lower bound for the Bayes risk under the prior density h. A natural truncation of the loss (1) is the function min(L(p, ό ) , K). Generalizations of the information inequality and of (2), like those to be described below, can be stated for this natural truncation; however the statements and proofs are easier under a different truncation which is equally useful in asymptotics.
This truncation will now be described.
Let K > 0. For v e R define
98
STATISTICAL EXPONENTIAL FAMILIES
vκ
-K = Ivl K
v < -K v £K v >K .
For v € R
d e f i n e v u t o be the v e c t o r w i t h c o o r d i n a t e s
i=l,....k.
Now l e t
i\
(v J . i\
L κ (p, δ) = (6 - p)£ J~l(p)(δ - p ) κ
(3)
= (v. )„
i
l
ι\
,
.
Let R κ denote the risk function corresponding to this truncated loss function. If δ is an estimator of p, let ό
(4)
(K)(χ; P)
=
P
κ
and b
(5)
(κ)^
=
E
θ^6(K)^X' P ) )" P
=
e
( K ) ^ p ^ "p
Let λ,(p) >: ... >_ λ (p) > 0 denote the ordered eigenvalues of J(p). Let α be any number satisfying 0 < α < 1. Then (6)
fl +
2
) R (p, ό )
(1 - α)λ m K 2 ^
^ >
α Tr(J(p)(ve(κ)(p))' J^ίpJίVe^jίp)))
(Note: values of p.
K
+ Tr(J(p)b(κ)(p)b|κ)(p))
Ve/^x exists except possibly f o r a countable number of
At these values i n t e r p r e t the r i g h t side of (6) as i t s lim sup;
or use r i g h t (or l e f t ) p a r t i a l derivatives i n place of Ve/^x, f o r these always e x i s t . ) This i n e q u a l i t y becomes more i n t e r e s t i n g as K gets large r e l a t i v e to
1/λ m , f o r then
α
can be chosen near 1 but so that T\—?ΓΊ72 ^l-α;λ m κ
is small,
The i n e q u a l i t y (6) leads to an i n e q u a l i t y concerning the Bayes r i s k j u s t as the usual information i n e q u a l i t y leads to ( 2 ) .
With h as i n (2)
APPLICATIONS
(7)
(l + ^
2 \ j R (p, ( 1 - α ) λ K2/ H K
6
99
)h(p)dp
H The above bound, unlike ( 6 ) , does not involve 6 (through e / ^ J . UNBIASED ESTIMATES OF THE RISK An unbiased estimate of the risk as a tool for proving inadmissib i ϋ t y of estimators f i r s t appears in Stein (1973), and has been widely exploited since then.
The basic technique is embarassingly simple.
It
involves merely an integration by parts which succeeds because of the term θ x e appearing in the exponential density. few of the easier applications.
Here we describe the method and a
For further (more complex) applications, see,
for example, Berger (1980b), Berger and Haff (1981), and Haff (1983).
Here
is the heart of the method. A function t : R •> R is called absolutely
continuous
if
t ( x , , . . , x j , is absolutely continuous in x.., i = l , . . . , k , when a l l Xj, j ^ i are held f i x e d . 4.7
Let t !
Theorem Let s : R •> R be absolutely continuous.
(1) (2)
/ |s(x)|e θ # x dx < θ
/|s'(x)|e '
x
dx
,
Assume
and
< »f
i = 1
k.
Then (3) Proof. (4)
ΘΊ / s ( x ) e θ ' x dx Set i = 1 for convenience. / |s(κ 1
=
-/ s ! ( x ) e θ ' x dx
For almost every (
x 2 , . . . , x k ) | e θ # x dx χ
<
and Pi
J ls Λxi>
x
\ ι ft X
?»
»Λi/ / l e
U Λ
i
^
-
100
STATISTICAL EXPONENTIAL FAMILIES
because of (1), (2). For any such (x 2 5 ...,x k ) integration by parts yields (6)
θjj s(x l s x 2 ,...,x k )e θ # x dx 1 = lim θ / s(x ,x9,...,x. )e
R
{ -/
fl
dx Ί
y
s^(x ls x 2 ,...,x k )e
Γ
fl
x11
dx 1 + s(x 1 ,x 2J ...,x |< )e
> B
-/ s (x 1 ,x 2 ,...,x k )e θ " x dx 1 + lim inf f [|s( s ( xX ir,x x 2 ,9 ,...,xJe . . . ,x k )evθ.χl *xj B-χ»
= -/ s^(x 1 ,x 2 ,...,x k )e θ # X dx by (2) and then (1). Integration over Xp»...,xk then yields (3).
||
The assumptions (1) and (2) are slightly more stringent than necessary, and also can be given alternate forms. For example the assumption (5) together with lim s(x Ί , x ? ,...,x.)e θ ' x = 0
(7)
,
for
X
£
Is,
almost eyery x 2 , . . . , x k implies ( 4 ) , and hence (3) when i = 1.
Or, f o r
example, when k = 1 a p o t e n t i a l l y useful r e s u l t is the equality
J°°θs(x)eθx dx = -f°s'(x)e θ x -s(0 + )
(8) for
a b s o l u t e l y continuous f u n c t i o n s s having / | s ' ( x ) | e
dx < °° and
Ay
lim s(x)e
= 0. However, the version of the theorem given above suffices
for the usual applications. Theorem 4.6 can be expressed in other forms which are more suggestive of its applications, as in the following two corollaries.
APPLICATIONS
101
4.8 Corollary Let p θ (x) be a p r o b a b i l i t y density on R
( r e l a t i v e to Lebesgue
measure) of the form (1)
P θ (x)
=
h(x) exp(θ
x - ψ(θ))
where h >^ 0 is absolutely continuous. Let t : R -> R be absolutely Let t.1 = - 3 1 .
continuous.
i
9
χ
Then
h! θ. E θ (t) = -Eθ((t! + ^ "
(2)
provided both expectations in (2) exist. k k Let t : R •* R be absolutely continuous. Then (3)
E
Θ
k where V
t =
Σ
8x.j
(Θ • t )
s
=
-EΘ(V
+ 2JL. t )
, provided that
(4)
i
and
E θ (| ^
t |) < - ,
1 =1
k.
(In expressions (2), (3), (4) and similar expressions below define ~ = 0 if h = 0 .) Proof.
For (2) note that ~
(th) = (tj + jp t)h and apply Theorem 4.7.
For (3) apply (2) with i=l,...,k and sum. Remarks. (5)
||
Expression (2) immediately yields ΘE Q (t) = -Eθ(Vt + t ^ )
provided the expectations e x i s t .
(3) can also be derived d i r e c t l y from Green's
theorem which implies (under suitable conditions) that
(6)
/s(x)(Ve θ#x )dx
= - /(V s(x))e θ ' x dx
102
STATISTICAL EXPONENTIAL FAMILIES It can also be worthwhile to apply Theorem 4.7 repeatedly, as in
the next proposition which is needed for Theorem 4.10. 4.9
Proposition Let p be as in Corollary 4.8. Assume that h! is also absolutely
continous, and that
and (2) (where h1.111. = -¥- h). Then 8x? (3) k
2
(where V h = Σ h1Ί1.1. ). Proof.
Apply Theorem 4.6 twice for each i=l,...,k and sum over i.
||
Combining the preceding results yields the following unbiased estimator of risk for squared error loss. 4.10 Theorem Let {puΩ } be an exponential family whose densities are of the form 4.8(1) with h satisfying 4.9(1), (2). Let 6: R k -> R k be any absolutely continuous estimator of θ. Suppose 2
(1)
and (2)
E θ (||δ|| )
<
h! Eθ(|δ! + ίpδl) < - ,
oo
1 = 1,....k
.
Then (3)
2
2
E θ (||δ-Θ|| ) = E θ (||δ|| - 2(V
δ +^
δ) + ^ )
APPLICATIONS
103
Note t h a t
Proof.
E θ ( | | δ - θ | | 2 ) = E θ ( | | δ | | 2 - 2Θ Now u s e 4 . 8 ( 3 ) a n d 4 . 9 ( 3 ) t o a r r i v e a t ( 3 ) .
Remarks.
δ+
||
The l e f t s i d e o f ( 3 ) i s t h e r i s k f u n c t i o n f o r squared e r r o r l o s s .
As p r e v i o u s l y , we f r e q u e n t l y use t h e n o t a t i o n R ( θ , δ) f o r a r i s k
function
when the l o s s f u n c t i o n ( h e r e ||δ - θ|| ) i s c l e a r from the c o n t e x t .
The
i n t e g r a n d o f the r i g h t s i d e o f (3) i s f r e e o f θ; hence t h i s i n t e g r a n d i s an unbiased e s t i m a t e o f R ( θ , δ ) .
For most a p p l i c a t i o n s o f (3) one a c t u a l l y needs
o n l y an unbiased e s t i m a t e o f R ( θ , δ j estimators.
- R ( θ , δp) where δ. and 6? are two g i v e n
I n t h a t c a s e , t h e term | | θ | | , l e a d i n g t o - r —
i n ( 3 ) , cancels.
Assumption 4 . 9 ( 2 ) i s t h e r e f o r e not needed t o a r r i v e a t an unbiased e s t i m a t e o f the
(4)
form
R(θ, δ χ )
4.11 Application
- R(θ, δ2)
=
E^MδJI
2
- | | δ 2 | | 2 + 2(V
(δχ - ό2)
(James-Stein estimator)
The neatest application of Theorem 4.10 is to prove the minimaxity of the James-Stein estimator for a multivariate normal mean.
(The
original result in James and Stein (1961) uses a different method of proof.) Let X be k-variate normal, k >^ 3, with mean ξ(θ) = θ and covariance I. Consider the problem of estimating ζ under squared error loss.
The usual
estimator δQ(x) = x is minimax.
However, when k :> 3 i t is not admissible.
(1)
=
δ(x)
( l ί ϋ M l i ) llxll2
where r is absolutely continuous, non-decreasing, and (2)
0 < r( )
< 2(k - 2)
Let
104
STATISTICAL EXPONENTIAL FAMILIES
Then (3)
R(θ, 6) £ R(θ, ό 0 ) = k
Strict inequality holds in (3) except when r Ξ 0 or when r Ξ 2(k - 2), as can be seen from (5) below. The normal density is of the form 4.8(1) and -r- = -x. With ό as in (1)
so that 4.10(4) yields (4)
R(θ, 6 0 ) - R(θ. 6) = E Θ (2V
IIXIΓ
IIXI I
(It remains to check the regularity conditions needed for 4.10(4), and these will be discussed below.) —-—~ = k"2 ? . Hence (4) yields llxll llxll
Observe that V (5)
R(θ, δ n ) - R(θ, δ) = E ^ " * 1 ' ) (2(k-2) - r(llXll)) + 2 Γ ' ( " x " ) ) υ ϋ iixir iixii
The unbiased estimator of the risk which appears on the right of (5) is nonnegative because of (2); hence (3) follows. The first estimator of James and Stein was of the form (1) with r = k - 2, which is the best possible constant value of r. However, a better estimator (as also noted by James and Stein) is +
(6)
δ (x) = (1 - ^ ^ ) llxll which corresponds to the choice r(t)
+
x
= min(t 2 , k-2)
See Exercise 4.11.1. See also Exercises 4.11.5, 4.17.5, and 4.17.6 for generalizations. (It is also of interest to note that in general if
APPLICATIONS
δ
i
=
δ
Oi
+
Ύ
i '
i = 1
k
" -> >
t h e n
4
1 0
105
4
( ) yields k
(7)
R(θ, δ 0 ) - R(θ, δ)
=
E
θ
[ Π ^ γ
Γ
γ { ]
The integrand is formally the same as the Cramer-Rao lower bound (in which b( ) replaces γ( ))
See 4.5(7) (with λ = 0) and Exercise 4.5.6.
Hence the
fact that the inequality
Σ 2 -£- γ. - γ? > 0
(8)
i =l
a x
Ί
i
Ί
"
has a non-trivial solution i f and only i f k _> 3 leads to the proof of the fact that 6Q(X) = x is inadmissible i f and only i f k _> 3.) The regularity conditions stated in Theorem 4.10 are not always satisfied by an estimator of the form (1). δ is not continuous at ||x|| = 0.) a supplementary argument: specified r( ) (9)
Justification of (4) therefore requires
suppose 6 is an estimator of the form (1) with a
Let 6 be the estimator with r( ) replaced by r e (||x||)
Then δ
( I f , for example, r(x) = k-2 then
= min(||x|| 2 /ε ,
r(||x||))
.
satisfieds the conditions of Theorem 4.10 so that (4) holds for
6 .
Passing to the l i m i t as ε Ψ 0 yields that (4) also holds for 6. There is a yjery extensive l i t e r a t u r e concerning the problem of estimating a multivariate normal mean.
For an introduction and some references
consult Lehmann (1983, Chapter 4). 4.12
Remark For discrete exponential families there is an analog of the
unbiased estimates in 4.8 and 4.10 which involves difference operators instead of partial derivatives.
These results are based on the deceptively simple
equality oo
(1)
Σ λh(x)λX x=0
oo
=
Σ h(x - l ) λ X x=l
106
STATISTICAL EXPONENTIAL FAMILIES
They have been particularly useful for certain problems involving Poisson or negative binomial variables. See Hudson (1978), Hwang (1982), and Ghosh, Hwang, and Tsui (1983) for some theory and applications.
GENERALIZED BAYES ESTIMATORS OF CANONICAL PARAMETERS We first define the concept of a generalized Bayes estimator in the current context and state some foundational results. Then we discuss estimation of the canonical parameter of an exponential family. Later in this chapter we discuss estimation of the expectation parameter, including the topic of conjugate priors for exponential families. 4.13
Definition Let {p Q : θ € 0} be an exponential family of densities. Let Ό
ζ: Θ -> R be measurable. Let G be a non-negative (σ-finite) measure on Θ, locally finite at every θ € Θ. G is called a prior measure on Θ. Let S c R . Then 6: S -> R is generalized Bayes on S (for estimating ζ under squared error loss) if / ζ(θ)pft(x)G(dθ) (1) ό(x) = , x € S , / P θ (x)G(dθ) where both numerator and denominator exist for all x € S. We say δ is generalized Bayes if it is generalized Bayes on S where v(S C ) = 0. We will use the symbol δ β to denote the generalized Bayes procedure for G, when this exists. If the loss is squared error loss -(2)
L(θ, a) =
11 a - ζ(θ)|| 2
for estimating ζ(θ) and if the Bayes risk, (3)
B(G) = inf B(G, 6') = inf / R(θ, δ 1 )G(dθ) δ ό1 = inf/E fl(L(θ, δ'(X))G(dθ), δ1 θ
APPLICATIONS
107
satisfies B(G) < ». Then by Fubini's theorem any Bayes estimator for G (i.e. one which minimizes B(G, 6)) must also be generalized Bayes for G. One of the topics in which we shall be interested below is that of characterizing complete classes of procedures under squared error loss (2). Since L is strictly convex the nonrandomized procedures are a complete class. The following theorem is our main tool for proving complete class theorems. (In the current context a complete class is a set of procedures which contains all admissible procedures.) 4.14
Theorem With {p Q } and L as above ewery admissible procedure must be a
limit of Bayes estimators for priors with finite support. More precisely, to eyery admissible procedure corresponds a sequence G. of prior distributions supported on a finite set (and hence having finite Bayes risk) such that (1)
6 G β (x) - 6(x)
a.e.(v)
where (as above) δn denotes the Bayes estimator for G.. Proof.
This theorem is apparently "well known". Its proof is outside the
intended scope of our manuscript. However, I do not know any adequate published reference for it, so a proof is given in the appendix to the monograph.
See Theorem A12. Theorems 3.18 and 3.19 of Wald (1950) come close
to the above theorem as do some comments in Sacks (1963) and in Le Cam (1955).
II We now concentrate on estimation of the canonical parameter.
In
this case generalized Bayes estimators have a particularly convenient form, as described in the next theorem. 4.15 Theorem Let {p f l } be a canonical exponential family and l e t G be a prior measure on Θ for which the generalized Bayes procedure, δG for estimating θ
108
STATISTICAL EXPONENTIAL FAMILIES
exists. Define the measure H by H(dθ) = e " ψ ( θ ) G(dθ)
(1)
θ x and (as usual) l e t λ..(x) = / e H(dθ) denote i t s Laplace transform.
Then δ fi
satisfies
(2)
δ G (x)
=
V In λ H (x)
( I f v(8K) = 0 then, of course,
=
VψH(x) ,
x e Γ .
(2) completely defines δQ since
v((K°)Cmp)
= v(3fC) = 0.)
Proof.
By d e f i n i t i o n the generalized Bayes procedure i s
(3)
δG(x) G
=
/ θ e θ ' x H(dθ) g / e θ x H(dθ)
a.e. (v)
By assumption the i n t e g r a l s on the r i g h t o f ( 3 ) e x i s t a . e . ( v ) ; hence N H 3 K° . the
The denominator e x i s t s on WH, by d e f i n i t i o n , and by Theorem 2 . 2 ,
numerator e x i s t s on N° and i s given by V λ u ( x ) . π π
This proves ( 2 ) .
II
If δ is only generalized Bayes on S c K relative to G one clearly has an analogous representation of δ on S°, namely (4)
δ(x)
= Vψ H (x) ,
x € S° .
An interesting special consequence of the above is that if k = 1, and |δ(x) - x| is bounded, and λδ(x) is generalized Bayes on K° for 0 < λ £ 1 then δ(x) = x + b. See Meeden (1976). The foundation for the following major theorem has been laid above and in Section 2.17. The first theorem of this type was proved by J. Sacks (1963) for dimension k = 1. Indeed Sacks claimed, but did not prove, validity of the result for arbitrary dimension. Brown (1971) proved the result for arbitrary dimensions when {p Q } is a normal location family; and that Ό
proof was extended to arbitrary exponential families by Berger and Srinivasan
APPLICATIONS (1978).
109
The proof below follows Brown and Berger-Srinivasan.
The proof of
Theorem 4.24 is somewhat more l i k e Sacks' original proof. 4.16
Theorem
Let {p Q } be a canonical k parameter exponential family. Then 6 is admissible under squared error loss for estimating θ only if there is a measure H on θ c W such that
(1)
/ θ e θ # x H(dθ) 6(x) = Q-^ = Vψ H (x) , / e H(dθ)
Remarks.
for
x e K°
a.e.(v)
.
The expression (1) i m p l i c i t l y includes the condition N,, 3 K°, so
that both numerator and denominator in (1) are well defined for a l l x € K°. I f H(Θ - Θ) = 0 so that 0 = § c W 5 then one may define (2)
= e ψ ( θ ) H(dθ)
G(dθ)
and rewrite (1) as / θp f l (x)G(dθ) (3)
6(x)
9
=
/
,
x € K°
.
Pθ(x)G(dθ)
Thus 6 is generalized Bayes on K° relative to G. This observation leads to Corollary 4.17 and to further remarks which appear after the corollary. Proof.
Let 6 be admissible. By Theorem 4.14 there is a sequence of prior
measures G., having finite support, such that $ G (x) -> δ G (x) a.e.(v). Let x Q € K° such that 6 Q (xQ) •+ ό(x Q ). Since G 1 has finite support ; e
θ Xo"Ψ(θ)
(2)
fi.(dθ)
=
This is a normalized version of 4 . 1 5 ( 1 ) , so, l e t t i n g ψ. = ψr;
,
110
STATISTICAL EXPONENTIAL FAMILIES
(3)
δ G (x) = Vψ^x)
Since / e
H.(dθ) = 1 we assume w i t h o u t loss o f g e n e r a l i t y the existence o f a
l i m i t i n g measure H, f o r which H. -> H weak*.
(Apply 2 . 1 6 ( i v ) to the measure
e X o # θ H i t o get e X o ' θ ί^ -> H*, say, and l e t H = e " X o # θ H* such t h a t 4.14(1) holds a t x 1 .
Then thbre i s a f i n i t e set S c K ° such t h a t
4.14(1) holds on S and such t h a t B = conhull S s a t i s f i e s x1 e B ° . (4)
Let x e S.
ΨΊ (x) - ψ . ( x 0 )
=
J 1 (x - x 0 )
V Φ ^ X Q + p(x - x o ) ) d p
Vψ^x) i
||x - XQ11
by C o r o l l a r y 2 . 5 . ( N o t e t h a t Ψ ^ X Q ) Ξ 0 . ) I t f o l l o w s Ί i m sup s u p ψ . ( x )
i-**>
xQ € B ° ,
Then
<. ( x - x Q )
(5)
Let x 1 € K°
.)
x€S Ί
=
sup | | ό ( x ) | |
Mδ^x)!!
that
||x-xn||
<
°°
u
x€S
This is the principle assumption of Theorem 2.17, which now implies the existence of a subsequence H., and a limiting measure, which must be H, such that ψΊ.(x) + Ψ H (x),
x € B°, and also Vψ Ί (x) ^ Vψ H (x),
x € B°, by 2.17(5).
Since vψ.(x') = ό.(x') ->• ό(x') we have (4)
δ(x') = Vψ H (x')
This proves (1) since x1 is an arbitrary point of and since 4.14(1) is satisfied 4.17
a.e.(v).
K° satisfying 4.14(1),
||
Corollary Suppose Θ is closed in R and
(1)
v(3K)
= 0
Then the generalized Bayes procedures form a complete class.
APPLICATIONS Proof.
HI
As noted the admissible procedures are a (minimal) complete class.
If 6 is admissible then for some prior measure H on Θ = Θ
δ(x) = 'θ e Π H W
(2)
a.e.(v)
/ e θ # x H(dθ) by 4.16(1) and (1), above. Let G(dθ) = e ψ ( θ ) H(dθ) as in 4.16(2) to get the desired representation, (3) Remarks.
δ(x) =
θpA(x)G(dθ) 2 ( Pθ(x)G(dθ)
If v is dominated by Lebesgue measure then (1) holds since the
Lebesgue measure of the boundary of any convex subset of R is zero. (To see this note that if C is bounded and convex with 0 € intC then 9C = n [(1 + Ί|)C - (1 -Ί j)C] = Π ΊC. , say, where (as usual) i=l i=l aC = {x: By € C, x = ay}. See e.g. Rockafeller (1970). Then / dx = a/dx aC C so that / dx = lim rf dx = lim(^j-)/ dx = 0. If C is unbounded apply the άΛ 8C S" C result for bounded C to C n {x: llxll < b} and let b -> «>.) If v{dK) f 0 then there are, in general, admissible procedures which are not generalized Bayes. See Exercise 4.17.1. Similarly, if Θ is not closed in R there will again be admissible procedures which are not generalized Bayes, even when v(9K) = 0. See Exercise 4.17.2. When Θ = W and the exponential family is regular then Θ is closed if and only if H = R . Hence when Θ + R one cannot assert that all admissible procedures are generalized Bayes. However, the representation 4.16(1) remains valid. This representation is qualitatively similar to a generalized Bayes representation and is generally as useful as one. Not all estimators which can be represented in the form 4.17(3) or 4.16(1) are admissible. In fact, many are not. Nevertheless, representations of this form are valuable stepping-off points for general admissibility
112
STATISTICAL EXPONENTIAL FAMILIES
proofs. See Brown (1971, 1979). The most conspicuous example of an inadmissible generalized Bayes estimator occurs in the problem of estimating a multivariate normal mean already discussed in 4.11. The usual estimator ό(x) = x is generalized Bayes, but when k >^ 3 it is not admissible. When k >_ 3 the positive part JamesStein estimator, defined in 4.11(6), dominates δ(x) = x. However, the positive part James-Stein estimator cannot be generalized Bayes
(see Example 2.9);
hence is itself inadmissible. So far as I know the problem of finding an (admissible) estimator which dominates 4.11(6) remains open. However, theoretical and numerical evidence indicates that such an estimator cannot have a much smaller risk at any parameter point; hence 4.11(6) remains one of the many reasonable alternatives to ό(x) = x when k >_ 3. (See e.g. Berger (1982).)
GENERALIZED BAYES ESTIMATORS OF EXPECTATION PARAMETERS CONJUGATE PRIORS The statistical problem of estimating the expectation parameter ξ(θ), is more often of interest than that considered previously, of estimating the natural parameter.
(Of course for normal location families the two problems
are identical.) In this case, too, there is a representation theorem for generalized Bayes procedures and a complete class theorem based on a representation similar to that of generalized Bayes. (In some (not fully developed) sense the generalized Bayes representation available here is dual to that in the preceding section -- the differentiation operator is with respect to θ and appears inside the integral sign instead of being with respect to x and appearing outside it.) Both these main results are somewhat more limited than those for estimating θ; but are nevertheless useful. A new feature of considerable statistical interest appears here. The linear estimators are (generalized) Bayes for the conjugate (generalized) priors. This result is presented first; the conjugate priors are defined in
APPLICATIONS
113
4.18 and the existence and linearity of their (generalized) Bayes procedures is proved in Theorem 4.19. 4.18
Definition Prior measures having densities relative to Lebesgue measure of
the form g(θ) = C e θ ' Ύ - λ ψ ( θ )
(1)
γ € Rk ,
λ >0
are called conjugate prior measures. Note that if the prior is of the form (1) then the posterior distribution, calculating formally, has the same general form, with new parameters γ + x and λ + 1. For a sample of size n the n parameters become γ + s n = γ + Σ x and λ + n. (Note in (1) that g = 0 if n
1= 1
Ί
θ ί W since then ψ(θ) = °° .) Arguments resembling those in the following proof show that the conjugate prior measure is f i n i t e , and hence can be normalized to be a prior probability distribution i f and only i f (2)
λ > 0
and
γ/λ
€ K°
See Exercise 4.18.1. For estimating ζ(θ) = E_(X), under squared error loss, the Bayes u procedures for conjugate priors are linear in x. This fact (often under extraneous regularity conditions) has been known for decades. See, for example, De Groot (1970, Chapter 9) and Raiffa and Schlaiffer (1961). The following precise statement and its converse first appeared in Diaconis and Ylvisaker (1979). 4.19
(See Exercise 4.19.1 for a statement of the converse.)
Theorem Let {p Q } be a regular canonical exponential family and let g(θ) be θ
a conjugate prior density as defined by 4.18(1). Then the generalized Bayes procedure for estimating ξ(θ) exists on the set
114
STATISTICAL EXPONENTIAL FAMILIES
(1)
S = {x : δ(x) = ^ ^ € λ +1
and has the l i n e a r form
(2)
+
δ(x) = J-J-J
=
γ?Γϊ
αx +
P
I f v(S C ) = 0 then δ i s generalized Bayes.
Remarks.
occurs for γ = 0, v(8K) = 0.
λ > 0.
I t occurs f o r γ = 0,
x
>
€ S
I f 0 € K t h i s always
λ = 0 i f (and only i f )
I t can occur for other values of γ,λ as w e l l . I f x ί S then the generalized Bayes procedure does not e x i s t at x
since /
θ e
*x~ψ^
g(θ)dθ = °°.
See Exercise 4 . 1 9 . 1 .
For the r e l a t i o n between the condition that v(S c ) = 0, so that 6 is generalized Bayes, and K a r l i n ' s c o n d i t i o n , 4 . 5 ( 2 ) , see Exercise 4.19.2. Proof.
Let x € S.
The generalized Bayes procedure at x, i f i t
exists,
has the form
(3)
δ ( χ )
/ (Vφ(θ)) exp((x+γ) - θ - (λ+l)ψ(θ))dθ / exp((x+γ) θ - (λ+l)ψ(θ))dθ
=
because of the form of g and of p Q , and because ξ(θ) = Vψ(θ) on W and g(θ) = 0 for θ (. hi. I f the integrals i n the numerator and denominator of (3) e x i s t then Green's theorem i n the form of 4.7(3) y i e l d s (4)
(x + γ ) / exp((x + γ)
=
θ - (λ + l)ψ(θ))dθ
(λ + 1) / (Vψ(θ)) exp((x + γ)
Rearranging terms in (4) y i e l d s ( 2 ) .
I t remains only to v e r i f y that the
numerator and denominator of (3) e x i s t . L e t z = £J2. . Hence
θ - (λ + l)ψ(θ))dθ
z £ K ° s i n c e x € S.
APPLICATIONS
(5)
θ
liminf
^ > -
IIΘMHOO
θ
'
z
115
> 0
llθll
by 3.5.2(1) (or by 3.6(3) and t r a n s l a t i o n of the o r i g i n ) .
I t follows that
f o r some ε > 0 (6)
exp ((x + γ)
θ - (λ + l)ψ(θ))
=
0(e"
ε||θ|1
)
This proves existence of the integral i n the denominator of ( 3 ) . Now consider ξ, = -r^1
let
ξ 1 (θ) = 0 i f θ f. M.
i n θ 1 for θ € W. for θ 1 < q
(7)
and
du -j
on hi.
For s i m p l i c i t y of notation
Fix Θ 2 , . . . , θ k .
ξ j t θ ^ θ g . . . . »θ k ) is monotone
Thus f o r some q = q ( θ 2 , . . . , θ k ) € R , ξJθyθ^,... ξ j ( θ j , θ 2 , . . . ,θk) ^ 0
f o r θ^ > q.
/ | ξ 1 ( θ r θ 2 , . . . , θ | < ) | exp((x+γ)
below,
, θ k ) <_ 0
Hence
θ - (λ+l)ψ(θ))dθ 1
q l i m / - ξ 1 ( θ 1 , θ ? , . . . , θ j exp((x+γ) B*x> B
θ - (λ+l)ψ(θ))dθ 1
B + lim / ξ 1 ( θ 1 , θ ? , . . . , θ | f ) exp((x+γ) . θ - (λ+l)ψ(θ))dθ Ί l d κ L B-^> q The function
e x p ( - ( λ + l ) ψ ( θ i s θ 2 , . . . , θ k ) ) is absolutely continuous i n θ,
since {p θ > is regular.
( I f { p 0 } were not regular there could be a discon-
t i n u i t y at the boundary of W.)
Let θ = ( q ( θ 2 . . . . , θ k ) , Θ 2 , . . . , θ k ) .
Ordinary
i n t e g r a t i o n by parts y i e l d s (8)
q l i m / - ξ 1 ( θ 1 , θ 2 . . . . . θ k ) exp((x+ γ ) . θ - ( λ + l ) ψ ( θ ) ) d θ 1 Bκ B =
lim j-tXi+γj)
B-χ»
I
/ exp((x+γ)
-B
θ - (λ+l)ψ(θ))dθ 1 q
+ [exp((x+γ)
θ - (λ+l)ψ(θ))]
Ί
>
q = - ( X I + Ύ J / exp((x+γ) * θ - (λ+l)ψ(θ))dθ, + exp((x+γ) θ l i '
*•
q
- (λ+l)ψ(θ )) q
116
STATISTICAL EXPONENTIAL FAMILIES
by (6). Note that (again by (6)) (9)
k ? Θ Q - (λ+l)ψ(θ )) = 0(exp(-ε Σ θί)) q M 3=2 J
exp((x+γ)
Reasoning similarly for the second integral on the right of (7), integrating both integrals over θp»...,θ. , and using (9) yields (10)
/.κ |ζ Ίi(θ) I exp((x+γ) R
Finally, the identical reasoning on ζ., ) exp((x+γ)
θ - (λ+l)φ(θ))dθ
< -
i = l , 2 , . . . , k , shows that θ - (λ+l)ψ(θ))dθ < -
which verifies that the numerator of (3) exists. As noted previously, this completes the proof.
||
4.20 Application For a given k-parameter exponential family {p Q } the conjugate prior distributions, {g
} , say, form a (k+1)-parameter exponential family
with canonical statistics Θ 1S ...,Θ., -ψ(θ). This (k+1)-parameter family is minimal except when ψ(θ) is a linear function of θ. This linearity occurs when p n is the Γ(α, σ) family with known σ, and in certain multivariate u generalizations of this univariate example. Many familiar exponential families are the conjugate families of prior distributions for other familiar exponential families of distributions. (Conjugate prior measures which are not finite then appear as limits of these distributions.) For example, the N(γ, λ I) distributions are conjugate to the N(μ, I) family. The proper conjugate prior distributions for the Γ(α, TZQJ) family (α known, θ < 0) are those of -Θ where Θ ^ Γ(λα, - γ ) , γ < 0, λ > 0. The proper conjugate priors for the P(e ) family have density (i)
g Y λ (θ) = e γ θ - λ e ,
γ < o,
λ >o
with respect to Lebesgue measure on (-«>, °°). Thus the density of ξ = e is
APPLICATIONS
117
Γ(-γ, 1/λ). See also Exercise 5.6.3. The basic representation theorem for generalized Bayes procedures is a simple consequence of Green's Theorem 4.7(3), and is an obvious extension of 4.19(4) in the proof of Theorem 4.19. The regularity conditions in the following statement may be modified as noted in the remark following the theorem, 4.21 Theorem Let {pθ> be a regular canonical exponential family and let G be a prior measure on Θ. Suppose G has a density, g, with respect to Lebesgue measure. Suppose g(θ)e~^ ' is absolutely continuous on R . Assume for x e S (1)
/ e θ χ -Ψ( θ > g(θ)dθ
(2)
/ ||vg(θ)|| e θ ' χ - ψ ( θ ) dθ <
and / ||Vψ(θ)|| g ( θ ) e θ * χ - ψ ( θ ) dθ
(3)
Then the generalized Bayes procedure, 6, for estimating ξ(θ) under squared error loss, exists on S and is given by the formula (9())
Remarks.
de
c
If v(S ) = 0 then, of course, the unrestricted generalized Bayes
procedure exists and is given by (4). Conditions (1) and (2) are of course necessary for the representation (4) to make sense. Condition (3) is necessary in order that the generalized Bayes estimator be well defined. However it can often be deduced as a consequence of (2) and so then need not be checked directly. Suppose (5) for some function h(θ) satisfying
118
STATISTICAL EXPONENTIAL FAMILIES tk~l
h(t)dt
Then (1) is satisfied, and condition (2) implies condition (3). See Exercise 4.21.1. The representation (4) is exploited in Brown and Hwang (1982) as the starting point for a proof of admissibility of generalized Bayes estimators under certain (important) extra regularity conditions. Proof.
Conditions (1), (2), and (3) justify use of the integration by
parts formula 4.7(3), which yields (6)
/ x(g(θ)e" ψ ( θ ) )e θ ' x dθ = / (-Vg(θ) + g(θ)Vψ(θ))e θ ' x " ψ ( θ ) dθ
Rearranging terms (each of which exists by (1), (2), (3)) yields (4).
||
We now turn to the complete class theorem comparable to Theorem 4.16. The result proved below applies only to one parameter exponential families. It appears to us that there exists a satisfactory multiparameter analog of this result which, however, is somewhat more complex to state (and to prove). We hope to present this multiparameter extension in a future manuscript. As with Theorem 4.16 the representation of admissible procedures involves a ratio of integral expressions similar to the formula for a generalized Bayes estimator. Again, under certain additional conditions, this representation reduces to precisely that of a generalized Bayes procedure. A new complication appears in the integral representation below. only on an interval I. whose definition involves ό( ) itself.
It applies (See 4.24(1).)
However, as explained in the remarks following the theorem, the values of δ(x) for x ί ί are uniquely specified by monotonicity considerations. Hence the theorem actually describes exactly the values of ό(x) except for at most two points -- the endpoints of I. . In this sense the complication presented by the presence of I. is just a minor nuisance.
APPLICATIONS
119
We begin with a technical lemmά. 4.22
Lemma Let v
be a sequence of probability measures on R .
Suppose for
some ζ > 0 (1)
lim i n f v n ( { x > K})
>
ζ
> 0
n+ o°
f o r a l l K < oo.
Let ε > 0.
Suppose λ
n=l,...
.
Then
e ε x vn(dx)
ί (2)
( ε ) < °° ,
v
lim n-**>
λ
(ε) n
for all K < «>. Remarks.
The negation of (1) is the condition
(3)
l i m l i m i n f v n ( { | x | > K})
K**>
n
n*»
= 0
This is the usual necessary and s u f f i c i e n t condition for there to exist a subsequence n 1 and a non-zero limiting measure v such that v , •> v. The conclusion (2) can be paraphrased by saying that the sequence of probability measures
e ε x v (dx)/λ
( ε ) sends a l l i t s mass out to -κ». n
Proof.
Let K < oo, 1 < m < «.
/" e
(4)
I
ε x
v(dx)
/ e ε x v n (dx)
/" e
Then
ε x
v(dx)
2 f
/ e e x v n (dx)
—oo
—oo
>
e
e ( m
Now l e t n •»• » and m -»• °° t o f i n d
"
1 ) κ
v n ( { x > mK})
.
120
STATISTICAL EXPONENTIAL FAMILIES
/ e ε x v n (dx) (5)
lim inf
which proves ( 2 ) . 4.23
K
f e ε x v n (dx)
||
Theorem
Let {p Q : θ € Θ} be a regular exponential family on R . Consider u
the problem of estimating the expectation parameter, ξ ( θ ) , under squared e r r o r loss.
Let 6 be an admissible estimator.
function. (1)
Then,
6( ) must be a non-decreasing
Let
I 6 = {x: v ( { y : y >x, ό(y) € K°}) > 0
and v ( { y : y < x, ό(y) € K°}) > 0 } .
Then there exists a f i n i t e measure V on Θ such t h a t f o r a l l x G I
(2)
Remarks.
δ(x)
=
r ζ(θ) θx J li + \l ζr( U )\e θ ) l
In (2) the functions ..
*
|1/Q\ , and Λ
,Γ/Q\. have the obvious
i n t e r p r e t a t i o n on the boundary of M. ( I n other words, i f N = ( a , b) then
= -1 , etc., since
lim ξ(θ) = » , lim ξ(θ) = — . )
By monotonicity of 6, I must be an open i n t e r v a l . -oo £ i < T £ oo.
Suppose K° = ( k , E ) , -» £ k < k <_ ».
Say I = ( i , T ) ,
Then k £ i
( ΐ <^ k,
respectively) and, by monotonicity and the d e f i n i t i o n of I , ό(x) = k f o r k <_ x < i
(ό(x) = R for
T < x <_ k ) .
For i < x < ϊ ,
ό ( x ) i s defined by ( 2 ) .
Thus, the theorem f a i l s to define 6 ( x ) only f o r x = i i f -°° < k < i
or f o r
x = k i f - o o < k = i , and, i f k < «>, f o r x = T or k depending on whether T < R or T = k.
I f v, the dominating measure f o r { p A } , is continuous then these
APPLICATIONS
121
two points have measure 0 and the theorem completely describes δ. Similarly, if K = (-», <*>) then irrespective of v the theorem completely describes 6 If V(N - N) = V(ίa, b}) = 0 then (2) can be rewritten as (3)
Jξ(θ) Pfl(x)G(dθ) § , ί Pθ(x)G(dθ)
δ(x) =
x€Γ
,
where β
ψ(θ)
Thus, δ is then generalized Bayes on T in the ordinary sense. of course, occur i f W = R .)
(This must,
When W f R there may exist admissible procedures
having representation (2) but not (3).
See Exercise 4.24
Finally, note as with Theorem 4.16 that there are many inadmissible procedures satisfying (2). Proof.
See for example Exercise 4.5.4.
I f G is a prior density then the Bayes procedure (assuming i t is
well defined for x € K) is given by the formula
m
* t^
-
f ζ ( θ ) e θ x ' ψ ( θ ) G(dθ)
_
/ ζ ( θ ) e θ x H(dθ)
/eθxH(dθ) where H(dθ) = ce" ψ ^ θ ^ G(dθ). x
θx
e //e H(dθ)
ζ(θ) is monotone on W.
The family of densities
is an exponential family (with parameter x) relative to the
dominating measure H.
In particular, i t has monotone likelihood ratio.
Hence, ό r is monotone non-decreasing by Corollary 2.22.
(6^ is actually
s t r i c t l y increasing unless G is concentrated on a single point.)
All admissi-
ble procedures are (a.e.(v)) limits of Bayes procedures by Theorem 4.14, and limits of monotone functions are monotone. must be monotone non-decreasing.
Hence a l l admissible procedures
(A different proof of a better result is
contained in Brown, Cohen and Strawderman (1976).) Let 6 be admissible and l e t όp Theorem 4.14 having δp
+6
a.e.(v).
be the sequence promised in i Since all δp are monotone non-
122
STATISTICAL EXPONENTIAL FAMILIES
decreasing t h e r e i s no loss o f g e n e r a l i t y i n assuming δg ( x ) -*• 6 ( x ) f o r a l l n x f Γ , and we do so below. Assume 0 e l c Γ ,
(5)
V (dθ)
Define the p r o b a b i l i t y measures,
(1 + | ξ ( θ ) l ) e " ψ ( θ ) G ( d θ ) ^ ^ / ( I + I ξ ( θ ) l ) e " ψ ( θ ) 6n(dθ)
=
n
Let
ε > 0 such t h a t ε € K°.
Then
εθ
(6)
δG(ε)
J
-
e
1 +
V (dθ)
^ Vn(dθ)
i Λ Suppose for some ζ > 0 (7)
Let
Tim inf V ({θ > K}) > ζ > 0
for all
K<
ΘQ be the unique value such t h a t ξ ( θ Q ) = 0 , and l e t K > Θ Q .
1 + iε(θ) 1
1<s
i n c r e a s i n
9
f o r
θ
>
θ
o
The f u n c t i o n
A
PP"ly Lemma 4.23 to get
J e ε θ Vn(dθ)
ξ(K)
Similarly, ,+ tela) ι
1S
decreasing for θ > Θ Q so that
J e ^ V n (d θ ) 1 + ξ(K)
Substitute
( 8 ) and ( 9 ) i n t o
•
t h e f o r m u l a , ( 6 ) , f o r or
( ε ) and l e t K -*• °° t o n
find
APPLICATIONS (10)
123
δ(ε) = lim δ Q (ε) >_ lim ξ(K) . n-*» n κ-χ»
This holds for all e > 0 with ε e K°. It follows from (10) that O i l , contrary to assumption.
Hence (7) must be false. A symmetric argument shows
that lim inf V ({θ < -K}) > ζ > 0 is also impossible. n-*» (11)
Hence
lim lim inf V ({|θl > K}) = 0 .
By translating the origin the same argument can be applied at any x € I c K°. The conclusion is that x e I implies e θ x V n (dθ) (12)
lim lim inf
This i s s l i g h t l y more than i s needed to apply the c o r o l l a r y o f Theorem 2.17 stated i n Exercise 2 . 1 7 . 2 . x interchanged so t h a t P n
( ( 1 2 ) implies 2 . 1 7 . 2 ( 2 ) with the roles o f v
(dθ) = eθxVn(dθ)/ eθxVn(dθ).)
Π j X
fI
θ and
The conclusion of
Π
this exercise is that there exists a subsequence {n 1 } and a limiting measure V on Θ such that (13)
eθxVn,(dθ)
+ e θ x V(dθ)
Note that V(Rk) = λ v ( 0 ) = lim λy
and
(0) = 1.
λ v § (x) - λ γ ( x ) ,
x € 1° .
Since both
and
χ
+^(3))
n
are bounded continuous functions on §, (13) and (6) yield directly 1 + lξ(θjl that for x e 7
e
δ(x)
=
lim
δG
(x)
This verifies ( 2 ) , and completes the proof.
||
v ( d θ )
124
STATISTICAL EXPONENTIAL FAMILIES
EXERCISES 4.1.1 ( i ) Prove the Cauchy-Schwarz inequality (4.1(1)) with B ^ in place of Bpo when Bpo is singular.(ii) Show the inequality remains valid when T,, T^ are respectively Uxs) and (mxs) matrix valued random variables, [ ( i ) Reproduce the proof of Theorem 4.1 with Bio in place of BZi; or rotate coordinates so that B2o is diagonal with diagonal entries d.
> 0, 1 < i < r, and d
= 0,
r+1 £ i <_ m, and apply 4.1(1) for the f i r s t r coordinates of T 2 J 4.2.1 Let v be a measure on R and l e t T be a real valued s t a t i s t i c . 2 Suppose 0 € W° and E(T ) < °°. Show f o r every ε > 0 there i s a polynomial 2 p(x) such t h a t E((T - p) ) < ε .
2 ( I n other words, the monomials x - , . . . , x . , x , ,
Xj x 2 , . . . form a complete basis f o r L 2 ( v ) . ) ^0
=
*' ^1 =
in L9(v).
a
ll
x
+
a
1 0 ' ^2 =
Let α. = E ( T f . ) .
T-g€L2(v),
a
22x
2 +
a
21x
+
[For k = 1 ( f o r s i m p l i c i t y ) l e t a
2 0 ' # * # ^ e o^thonormal f u n c t i o n s
Then Σα? < ET2 so t h a t g = Σ α . f . € L 9 ( v ) .
0 e W°(T - g) and λ ^ ( 0 ) = 0 ,
j=0,l,...
.]
4.3.1 V e r i f y formulae 4.3(3) and 4 . 3 ( 5 ) . 4.3.2 Let p Q be a f u l l canonical exponential f a m i l y and l e t ξ = ξ ( θ ) denote the expectation parameter.
Show t h a t r e l a t i v e t o t h i s parameter the
information matrix i s J ( ζ ) = Σ - 1 ( θ ( ζ ) ) . 4.4.1 Let M be a f i x e d £χ£ p o s i t i v e s e m i - d e f i n i t e symmetric m a t r i x . 1
Write the i n f o r m a t i o n i n e q u a l i t y f o r EQ ( ( T - μ ) M(T - μ ) ) where T i s an Jl-dimensional s t a t i s t i c w i t h mean μ and f i n i t e covariance a t ΘQ.
[This i s
immediate from Theorem 4.4 and (T - μ ) 1 M(T - μ) = Tr(M(T - μ ) ( T - μ ) 1 ) . ]
APPLICATIONS
125
4.4.2 Show that the information inequality 4.4(1) is an equality i f and only i f for some matrix A and vector b (a)
T(x)
= A(Vθ(p Q ))X + b
.
[Show the Cauchy-Schwarz inequality is an equality i f and only i f Tp is an affine transformation of T-.] 4.4.3
Let {p θ : θ e 0} be a different!able subfamily and T an £dimensional statistic. Suppose %a (T) exists for some θ n € 0. Then the w0
u
information inequality is an equality for a l l θ € 0 i f and only i f 0 is an affine subspace of W and T is an affine function of the canonical minimal sufficient s t a t i s t i c for the exponential family {p Q : θ e 0 } .
(That such a
characterization holds under mild regularity conditions for a general family
{p 0 }
was proved in
Wijsman (1973) and Joshi (1976).)
[Use
Exercise 4.4.2.] 4.4.4 Suppose {p Q } is a canonical one-parameter exponential family.
Show
that when the information inequality is not an equality i t can be improved to an inequality of the form:
(1)
Var 0 T > ε'(θ o )M(θ o )ε(θ o )
where ε(θ) is the jχl vector with (2)
J
ε(θ), = - ^ e f θ ) 3Θ
1 =1
J
and M(θ) is an appropriate j x j symmetric matrix, not depending on T. In -1 fact, M(θ) is the covariance matrix at θ of the vector with coordinates
(3) (The inequality (1) with M
3Θ. as in (3) is called a Bhattacharya
inequality.
126
STATISTICAL EXPONENTIAL FAMILIES
Such inequalities are valid also for full k parameter exponential families and for ^-dimensional statistics, as well as for differentiate subfamilies (p replaces θ in (1) - (3)). See e.g. Lehmann (1983, p.129).
[A direct proof
is possible which also yields the formula (3). An alternate proof assumes ΘQ = 0, ψ(θ 0 ) = 1 (w.l.o.g.) and uses Exercise 4.2.1 to write / (T(x) - oυu ) 2 v(dx) >_ Σ α? = Σ /T(x)f. (x)v(dx). 1=1Ί 1=1 Ί
]
4.4.5 Suppose X...... are i . i . d . observations from a d i f f e r e n t i a t e exponential subfamily.
(1)
Let N be a stopping time with PQ
Eθ (exp(ε N))
Let S n =
<
»
for some
ε >0
n Σ X. and l e t T(S M , N) be a s t a t i s t i c for which la
(2)
ZΘQ(T)
where e(ρ) =
E
Θ(D)(T(SN>
>
N
(E θ ( N ) ) " 1 ( V e ( p 0 ) ) '
) )
(N < °°) = 1 and
(T) < «.
Then
yHpQ)(Mp0))
[Prove directly or use Exercise 3.12.2 (iii)
and Theorem 4.4. The regularity condition (1) can be considerably relaxed or modified, but some condition on N is needed in general. See Simons (1980).] 4.4.6 ( i ) When {p Q } is a f u l l canonical exponential family and p
EQ (T ) < «>, the Bhattacharya i n e q u a l i t i e s 4 . 4 . 4 ( 1 ) l i m i t as j + °°.
( i i ) I f {ΌA i s an m-dimensional
tend to e q u a l i t y i n the
d i f f e r e n t i a t e subfamily
w i t h m < k then there a r e s t a t i s t i c s T f o r which the appropriate Bhattacharya i n e q u a l i t i e s do not tend to e q u a l i t y as j + «.
[ ( i ) Use Exercise 4 . 2 . 1 and
proceed from the proof sketched i n the h i n t i n Exercise 4 . 4 . 4 . a curved exponential
f a m i l y i n the canonical
( i i ) Consider
version 3 . 1 1 ( 1 ) , and l e t
T(x) = x2 - x 2 . ] 4.5.1
Prove the assertion in 4.5 when 3 ^ 0 .
[Let Y = X - γ. Apply 4.5
APPLICATIONS
127
to yield αY as an admissible estimator of ξ(θ) - γ. Hence αY + γ is admissible for ξ(θ).] 4.5.2 Show the condition 4.5(2) implies 6 Jx) = (αx + β) € K a.e.(v). α,p
[The theorem would be false otherwise! theorem i s also of i n t e r e s t .
But a d i r e c t proof not involving the
Use Lemma 3 . 5 . ]
4.5.3
Suppose (λ, γ) satisfies condition 4.5(2), λ 1 < λ, and either γ € K° or v is a discrete measure. Then (λ 1 , γ) satisfies condition 4.5(2). If γ € 8K = K - K°, and (λy γ ) , (λ 2 , γ) both satisfy 4.5(2), and λ χ < λ < λ 2 then (λ, γ) satisfies 4.5(2). 4.5.4 Let X ~ Γ(a, σ ) , a known, and consider the problem of estimating σ = E(X) under squared error loss, δ
o
( i ) Using K a r l i n ' s theorem v e r i f y that
( x ) =αx + β is admissible i f α = -^γ
, 3 = 0 or i f α < - ~
α"fΊ
Ot, p
, β > 0.
— a+ l
(ii) Show that if α,β do not satisfy these conditions then 6
Q
is inadmissible
Ot>P
since there i s an admissible l i n e a r estimator which i s b e t t e r . 4.5.5 Consider the one-parameter exponential family defined by 3.4(1) with θo
=
-1 and θ = θi € (-°°> 0 ) .
under squared error loss.
Let δ
Consider the problem of estimating ξ(θ) o
be a l i n e a r estimator as i n 4 . 5 ( 1 ) .
Oί»p
Observe that condition 4.5(2) of K a r l i n ' s theorem i s not s a t i s f i e d at θ = 0. Sh.ow that 6
o
is inadmissible.
[For the case α = 1, β = 0 l e t
OUp
(1)
δ'(x) L
X
X £ C
c + (x-c)/2
x >c
=
Then R(θ, δ 1 ) < R(θ, δ Ί n ) for ξ(θ) £ c and, for ξ(θ) >_ c, a crude bound yields
128
STATISTICAL EXPONENTIAL FAMILIES R(θ, δ') < [h) Var θ (X) + (*)(ς(θ) - c ) 2 + ξ 2 (θ)
(2)
=
3
2
1
Hence f o r c s u f f i c i e n t l y large R(θ, 6 ) < R(θ, δ Q ^ ξ(θ)
2
ξ ( θ ) / 8 + (ξ(θ) - c ) / 4 + ξ ( θ ) 3
= ξ ( θ ) / 2 also when
> c]
4.5.6
Let {pθ> be as in 4.5. Suppose it is desired to estimate g(θ) = ξ(θ) + W'(θ) under squared error loss. Show the estimator δ Q is Otjp
admissible if (1)
/ exp(λψ(θ) + (1 + λ)W(θ) - γλ(θ)dθ
diverges at both θ and θ. [Define b( ) as in 4.5.
4.5(7) becomes
2b'(θ) - 2(λξ(θ) + (1 + λ)W(θ))b(θ) + (1 + λ)b 2 (θ) ± 0 .]
(2)
(See Ghosh and Meeden (1977). Although an estimator δ
o
may be admissible
α,p
here, it is not clear that it is desirable, whereas for the case W = 0 of 4.5 these estimators are yery natural.) 4.5.7 Let {p Q } be a canonical two dimensional exponential family with Ό
o
W = R . Consider the problem of estimating ξ(θ) with squared error loss (so 2
that R(θ, δ) = E0(||δ(X) - ξ(θ)|| )). Show that the estimator δ(x) = x is admissible. Apply this result when (X-, X«) are independent normal, independent Poisson, independent binomial, or the sample means from Von-Mises variables. [Using the bivariate information inequality leads to replacement of 4.5(7) by (1) where V
2v
b(θ) +
2
2 ab.(θ) b(θ) = Σ —9\θ τ — . If b satisfies (1) so does 1=1 i
APPLICATIONS
129
B(θ) = (2π)" 1 / 2 π Q" 1 b(Q.Θ)dφ 0 Φ Φ
(2) where
/cos φ *
-sin φχ
^sin φ
cos φ'
b is spherically symmetric; hence can be written as b(θ) = β(||θ||)θ . Let t = I |θ| |. (1) becomes (3)
2kβ(t) + 2t3'(t) + t V ( t )
< 0 .
Now let K(t) = t 2 β(t) to get 2K'(t) + K2(t)/t £ 0
(4) in place of 4.5(8).
(Note how the argument fails if k > 2!)] (Stein (1956),
Brown and Hwang (1982, Corollary 4.1).) 4.5.8 Let X ~ Γ(α, σ), problem of estimating σ = -r
α > 0 a specified constant. under the loss function
L(σ, a) = i - l n φ - 1
(1)
.
(See Chapter 5 for a natural interpretation of this loss. 4.11.3 and 4.11.4.) estimator.
e(θ)
(i)
Show that
(ii)
Let 6 Q (x) = £ and l e t 6(x) = (1 + Φ(x))ό Q (x) be any
= Eβ(φ)
and
W(t)
R(θ, 6) - R(θ, δ 0 )
= t - ln(l + t) ,
>
- M i M +
t > -1
.
W(e(θ))
Use (3) to show that <5Q is admissible among a l l estimators having
e(θ) £ B for a l l θ € (-°°, 0 ) . on δ.
See also Exercises
Let
(2)
(3)
Consider the
See Brown (1966).)
(6Q is actually admissible with no restriction
130
STATISTICAL EXPONENTIAL FAMILIES [(i) R(θ, δ) - R(θ, δ Q ) = E θ (-ΘXφ(X) -
= -θe'(θ)/α + E θ (φ(X) - ln(l + φ(X))). For (ii) follow the pattern of the proof of Theorem 4.5. (It is also possible to use (3) to prove δg is admissible with no restriction on δ.)] 4.6.1 Prove 4.6(2).
[Use the information inequality to write
/ h(p)R(p, δ) dp >_ m + /{2h(p) Tr(Vb(p)) + h(p)Tr(J(p)b(p)b' (p))}dp . Integrate by parts the f i r s t term in the integrand in order to get an integral whose integrand is a quadratic in b(p) for each fixed p. grand for each p to get 4.5(2).]
Minimize this inte-
See Exercise 5.8.1 for a statistical
application of 4.5(2). 4.6.2
In preparation for the proof of 4.6(6) prove the following facts: (i) For each K, V e (κ)(p) exists for all but at most a countable number of points, p. Fix
PQ, K
for which Ve, K )(p 0 ) exists. Let δ*(x) =
e*(p) = E θ (p)(<5*( χ )K and D = (d..) = Ve*(p Q ) - V e ( κ ) ( p Q ) . (ii) d.. = 0 , 1
ί ") Let
d
[Since all
ρQ),
Show
i f j , and
p
(
χ
θ
κ
l ϋl± θ(p0) l i - i l> ) |D| = ( | d . . | ) w i t h d . . as above and l e t J = J ( p Q ) be symmetric
p o s i t i v e d e f i n i t e w i t h eigenvalues (iv)
<5/ K N(X;
T r ( J D J " D)
< "
| d . | <_ 1 the e i g e n v a l u e s
have magnitude a t most -r^ . λ m
λ , >. . . . τ-^Tr|D| < λ m
>_ λ
> 0.
Show
——^-^ λmK2
- - and hence diagonal elements - - o f JDJ Then Rv = T r ( J K
E) > T— Tr E " λl
where
-1
APPLICATIONS
131
4.6.3 Also in preparation for the proof of 4.6(6) prove the following matrix inequalities (i)
TrίJA'J^A) _> 0 for any (kxk) positive definite symmetric J
and any (kx k) matrix A. (ii) [(i)
Tr(J(A' + B'JJ'^A + B)) > α TrίJA'J^A) - - j ^
Diagonalize J (and J " ) and then w r i t e out Tr( ) as a sum of individual
terms,
( i i ) follows from ( i ) . ]
4.6.4
Now prove 4.6(6).
[Write the information inequality for δ*.
Substitute Ve*(p Q ) = Ve, κ x(p Q ) + D and use 4.6.2(iv) and 4.6.3(ii).
(Note
that both these inequalities are nearly trivial when k = 1, so in that case the overall proof is much simpler to follow.)] 4.6.5 The inequality 4.6(6) is never sharp (except sometimes in the limit as K •+ °°). To examine how far from sharp the inequality is compare R κ and the best lower bound from 4.6(6) in the case where k = 1, L is ordinary squared error loss, X
N(θ, 1), p = θ, and δ(x) = ax (0 < a <_ 1).
[For a = 1, K = 1, I get R κ = .516 >_ .250 = best lower bound. For a = 1, K = 3 I get R κ = .991 >_ .5625 and for a = 1, K = 10 R κ = .999+^ .891 .] 4.6.6 Prove 4.6(7).
[See 4.6.1.]
4.6.7 Investigate the sharpness of (7) by comparing the Bayes risk for L^ and the bound on the right of 4.6(7) when k = 1, L is ordinary squared 2
error loss, X ~ N(θ, 1), p = θ, and h is a normal (0, σ ) density. (Note: h does not have compact support, but it can be shown (Exercise !) that the tails of h decrease fast enough so that 4.6(7) is still valid.) [When K = «
132
STATISTICAL EXPONENTIAL
FAMILIES
2 2 so t h a t 4 . 6 ( 7 ) reduces to 4 . 6 ( 2 ) the Bayes r i s k i s σ / ( I + σ ) and the lower 2 2 bound i s (σ - l ) / σ . Thus even when K = °° the bound i s not sharp, although 2 i t i s asymptotically sharp as σ ->• °° a l s o . ]
4.11.1 Let δ denote the James-Stein estimator 4 . 1 1 ( 1 ) w i t h r Ξ k - 2 and let 6
denote the corresponding " p o s i t i v e p a r t " estimator 4 . 1 1 ( 6 ) . +
R(θ, δ ) < R ( θ , δ ) .
+
2
[ W r i t e R ( θ , δ) - R ( θ , δ ) = E Q ( g ( | |X| | ) ) . 2
and I S " ( g ) = - 1 , and ( t r i v i a l l y ) E g ( g ( | | x | | ) ) > 0 .
Show t h a t
Note S"(g) = 1
Use Exercise 2 . 2 1 . 1 . ]
4.11.2 Suppose X - N(μ, σ 2 l )
(X € R k ) and, independently,
is desired to estimate μ with squared error loss —
V/σ 2 - x ^ .
2 σ is unknown.
It
Let
2
k :> 3. Let σ = V/m and 6(x
, .Λ.
ιiχiι2/δ2
where 0 £ s( ) <_ 2(k-2)m/(m + 2) and s( , σ ) i s d i f f e r e n t i a t e and nondecreasing f o r each value o f σ . [Assume ( w . l . o . g . ) t h a t σ
= 1.
Show t h a t δ ( x ) i s b e t t e r than δ Q ( x ) = x. Condition on σ
apply 4 . 1 1 ( 5 ) with
r ( ) = σ s( , σ ) ; and take the expectation over σ . Λ
(A f r e q u e n t l y recommend-
Λ
2 2 2 2 ed choice f o r s i s s ( | | x | | , σ ) = m i n ( | | x | | / σ , (k-2)m/(m + 2 ) ) corresponding to 4 . 1 1 ( 6 ) . ] 4.11.3 Let
. be independent r ( α . , σ.) v a r i a b l e s w i t h α. known,
i=l,...,k.
Consider the problem o f estimating σ = ( σ - , . . . , σ k ) with loss 2 function L(σ, a ) = Σ σ . ( l - a . / σ . ) . The best l i n e a r estimator f o r t h i s problem i s δ Q with 6 Q i ( x ) = x^/(a. [Use Theorem 4 . 5 . ]
+ 1).
( i ) When k = 1 t h i s estimator i s admissible.
( i i ) f o r k > 2 define δ by k
(1)
δ.(x)
q
= χ./(α Ί . + 1) + (k-l)α. + 1/ Σ (oj + IΓ/XJ
.
APPLICATIONS
133
Show that R(σ, 6), < R(θ, δQ). (This is the easiest of several interesting related results in Berger (1980b).) [Let φ ^ x ) = (^ + Using Corollary 4.7 show (2)
R(σ, 6 0 ) - R(σ, δ) =
-EtΣ (
L-2 L (α. + 1 Γ
2
since σ..(l - a/σ^
^^ X i
φίX)
^———
α
1
+ —LJ (α. + 1) Z
*
3x
i
φ 1
(
2
= (a/ -θ. -
1// -θ..) .
Then show the expectand on the
(Use the fact that -^r"Φ Ί ( χ ) < 0 to eliminate the
right of (2) is negative.
σX
1
terms i n v o l v i n g φ. -r— φ. . ) ] I
σX
1
4.11.4
Let X. ~ Γ(α., σ ^ , α^ > 0 specified constants, i = l,...,k, as in Exercise 4.11.3. Consider the loss function L(σ, a) =
(1)
Define 6 n by 6 .(x) = x / α . . Mx)
= (1 + Φ i ( χ ) )
δ Λ 1
Σ (a /σ. - ln(a./σ.) - 1) . Ί Ί i= l 1 Ί
(See Exercise 4.5.7.)
Let k >_ 3 and define 6 by
( χ ) where cα.
(2)
φ.(x) 1
=
In x. ]
1 + Σ(α. In
?
x^Γ
with 0 < c <_ 1. Show that R(σ, 6) < R(σ, 6 Q ) , σ > 0. [The unbiased estimator of R(σ, ό) - R(σ, δQ) is x. 3φ i
(The following algebra can be simplified by changing variables in (3) to y. = α In x ,
i=l,...,k.)
Then show this is always positive, using the
facts that |Φi | ± c/2 and t - In (1 + t) <_ 2t 2 /3
for |t| <_h.
(You w i l l see
134
STATISTICAL EXPONENTIAL FAMILIES
that values of c somewhat larger than 1 can also be used in (2).)] See Dey, Ghosh, and Srinivasan (1983). (Change variables in Exercise 4.5.7(3) to σ = -| = ξ(θ), and compare with the i -th term in brackets in (3), above. This identity of expressions is analogous to that which occurs in the estimation of normal means with squared error loss. See 4.11(8).) 4.11.5 Let X ~N(θ, I). Consider the problem of estimating θ € R ε>0
under squared error loss. Suppose for some C < «>, (1) Then δΛx)
(δjU) - x) is
x > 2-k+ε
for
llxll > C .
inadmissible.
[Let 6 9 ( x ) = ό Ί ( x ) - ε [ ( | | x | | - C ) + Λ 1] ι
and use 4.10(4).]
*-*-
llxll
(Note that this generalizes Example 4.11 since δΛx) = x
satisfies (1) when k >_ 3.) 4.15.1 (i)
Show t h a t f o r estimating the natural parameter the corres-
pondence between p r i o r measures and t h e i r generalized Bayes procedures is oneone i f Supp v has a non-empty i n t e r i o r ( i . e . show 6 G = 6 H G = H).
[Use Theorem 4.15 and Corollary 2 . 1 3 . ]
show t h a t t h i s u n i c i t y may f a i l i f (Supp v ) °
a.e.(v)
implies
( i i ) Give an example to
= φ.
4.15.2
Show that every admissible estimator of θ under squared error loss satisfies the monotonicity condition (1)
(x 2 - x χ )
(ό(x 2 ) - 6( X l )) i 0
a.e.(v * v) .
[Use 4 . 1 4 , 4 . 1 5 , and 2 . 5 . (Do not use 4 . 1 6 ( 1 ) f o r t h i s would not y i e l d ( 1 ) f o r x Ί e 9/C.)]
APPLICATIONS
135
4.16.1 Let X - P(λ). Let c Q ± 0. Show that the estimator 6(0) = c Q , ό(x) = In x, x=l,2,... , is not an admissible estimator of the natural parameter θ = In λ under squared error loss, (ό is the "maximum likelihood estimate" of θ; see Chapter 5. Also, the squared error loss function L(θ, a) = (a - θ) can be justified in its own right, or one can transform to λ = e θ
and let b = e a . The loss then takes the form (In b - In λ ) 2 2 = (In (b/λ)) = L*(λ, b). The inadmissibility result, above, then says also that ό*(x) = x is an inadmissible estimator of λ under loss L*. Losses of the form L* appear naturally in scale invariant problems; see Brown (1968).) [Use Theorem 4.16. If 6 is of the form 4.16(1) then, by monotonicity, (1)
In [x] <
λi(x) χ ^ H
<
In ([x] + 1) ,
x > 1 .
Hence λ H (x) -»«as x -> <» but λ..(x) = o(e ε x ) as x -* <», v ε > 0. This is impossible by Lemma 3.5 and Exercise 3.5.1.] 4.17.1 Let X ~ Bin(n, p ) , n >_ 3, and consider the problem of estimating the natural parameter θ = In (p/(l - p)) under squared error loss. Show that the procedure -1 ό(x) = 0 1
x = 0 1 £ x <_ n-1 x =n
is admissible. But, 6 is not generalized Bayes. (Note that Corollary 4.17 is not valid here because 4.17(1) is not satisfied. Of course, Theorem 4.16 is satisfied with H giving unit mass to the point θ = 0.) [Let ό 1 be another estimator. Suppose ό'(0) = -1 + α, α > 0. Then lim
I θ Γ ^ R ί θ , ό 1 ) - R(θ, 6)) = α > 0 .
136
STATISTICAL EXPONENTIAL FAMILIES
Hence R(θ, δ 1 ) ± R(θ, δ), V θ, implies
(i) δ'(0) <_ -1. Similarly (ii)
ό'(n) >_ 1. Among all procedures satisfying (i), (ii) 6 uniquely minimizes R(0, 6). Hence δ is admissible. If 6 were generalized Bayes the prior G would have to have support {0} by 2.5; but this would imply 6(0) = 0 = ό(n).] 4.17.2 Let Z ^ r(α,σ) as in 4.17.3, below, (i) Show that the estimator δ Q (x) = 0 cannot be represented as a generalized Bayes estimator of θ = 1/σ. (ii) For α < 2 show δ Q is admissible, [(ii) If δ M
o
(iii) For α > 2 show δ Q is inadmissible,
then, for some € > 0, R(θ,<5) > £ θ α as θ + 0. (iii) Let
δ(x) = (α-2)/x.]
4.17.3 Let Z ~ r(α, σ ) , α known. Then the distributions of X = -Z form an exponential family with natural parameter θ = 1/σ. Consider the problem of estimating θ with squared error loss. Show that δ(x) = b e x
(1) can It
be represented i n the form 4 . 1 6 ( 1 ) .
(= be" z ) [ L e t H be a Poisson d i s t r i b u t i o n !
can f u r t h e r be shown t h a t δ is admissible when α <_ 2 since i t
minimizes (2)
uniquely
m
H ( { 0 } ) l i m s u pp R(θ, ό ) e ψ ( θ ) θ+00
+
Σ
H({i})R(i, δ ) e ψ ( i )
.]
i i=l
4.17.4 Let {p_} be any exponential family with K compact (Binomial, Multinomial, Fisher, Von Mises, etc.). Show that δ(x) = x is an admissible estimator of θ under squared error loss. [Show that δ is Bayes for the prior 2
distribution, G, with density c exp(ψ(θ) - ||θ|| /2) and that B(G) < «> . Admissibility then follows from basic decision theoretic results. See, e.g. Lehmann (1983, Theorem 3.1). Use Exercise 3.4.1 to verify that B(G) is
APPLICATIONS
137
finite (and also that G is finite).] Caution! δ(x) = x is not a very natural estimator of θ, in spite of its admissibility. Hence its use in this problem is not necessarily recommended (unless the prior G is indeed as above). If Supp v is finite then <5(x) = x is a natural estimator of ξ(θ), and is admissible under squared error loss for estimating ξ(θ). See Exercise 4.5.5 and also Brown (1981b). 4.17.5 Let X ~ P(λ). Consider the problem of estimating
λ under loss
function L(λ, a) = (ln(a/λ)) 2 .
(1)
Show that estimator δ ^ x ) = e x is generalized Bayes, but not admissible.
[The
question is equivalent to asking whether the estimator 6(x) = x is generalized Bayes, or admissible for estimating the canonical parameter, θ, under squared error loss. Reason as in Exercise 4.17.4 to show 6(x) = x is generalized fiayes. However, for estimating θ, direct calculation shows that δ'(x) = bx, e"~ <_ b < 1 is better than δ(x). This inadmissibility result shows that the general result of Exercise 4.17.4 does not extend to problems with K not compact, even when k = 1. (All estimators of the form 6(x) = bx, 0 < b <_ 1, are generalized Bayes for estimating θ. We conjecture that none of them are admissible.)] 4.17.6 1/
Let X ~ N(θ, I). Consider the problem of estimating θ€ R under squared error loss, (i) Let G be a generalized prior density. Show that the generalized Bayes estimator (if it exists) can be written in the form
CD
6G(x) = x
where (2)
g*(x) = / pfl(x) G(dθ)
138
STATISTICAL EXPONENTIAL FAMILIES
( i i ) Consider the l i n e a r p a r t i a l d i f f e r e n t i a l inequality (3)
V
(g*(x) vu(x))
< 0
||x|| > 1
subject to the condition that u is continuous on ||x|| _> 1, and (4)
u(x)
=
1
for
||x||
=
1,
u(x)
£
1
for
||x||>l
.
Show that i f ( 3 ) , (4) have a non-constant solution which also s a t i s f i e s
Mllljll2)
(5)
< °° »
θ € Rk ,
then όg i s inadmissible. [(ii)
Let 6(x) = όg(x) + η j .
Use (5) and Green's theorem to
j u s t i f y an expression l i k e 4.10(4) f o r R(θ, όg) - R(θ, 6) but with an extra term involving a surface integral over {x: ||x|| = 1).
This extra term i s
non-negative because of ( 4 ) , and the remainder of the expression i s nonnegative because of ( 3 ) . the above.
(Note that Exercise 4.11.5 is a special case of
Brown (1971) proves that s o l u b i l i t y of ( 3 ) , ( 4 ) implies inadmissi-
b i l i t y of 6Q
(condition (5) is not r e q u i r e d ) , and conversely
if
is bounded -- and somewhat more generally -- then i n s o l u b i l i t y of ( 3 ) , (4) implies a d m i s s i b i l i t y of 6Q. See also Srinivasan (1981).] 4.17.7
(Berger and Srinivasan (1978).) ( i ) Again l e t X ~ N(θ, I ) and consider the problem of estimating
θ € R under squared error loss.
6(χ)
CD
Suppose +
= x i +
o(—ί-Λ
for two constant kxk matrices B and M. Show that ό is inadmissible unless B = cM for some c € R. [Theorem 4.17 and the representation 4.17.5(1) imply V(ln g*(x))
= ^
l
By considering l i n e integrals over closed paths show t h i s is
impossible
APPLICATIONS
unless B = cM.
139
The calculations are easier i f B and M are simultaneously
diagonalized ( w . l . o . g . ) .
Then when k = 2 the only paths that need be considered
are those bounding sets of the form {x:
x 1 >_ 0, χ« _> 0,
r <_ | |x| | <_ r + ε } . ]
( i i ) Suppose, instead, that X ~ N(θ, I) with % known ( p o s i t i v e d e f i n i t e ) ; and 6 is given by ( 1 ) . M f o r a d m i s s i b l i t y of 6.
Now w r i t e a necessary condition on B and
Does the condition involve %Ί What i f the loss
function i s L(θ, a) = (a - θ )
1
D(a - θ)
f o r some (known p o s i t i v e d e f i n i t e
matrix D? 4.18.1
Verify the assertion in 4.18(2).
[Use Lemma 3.5 and Exercise
3.5.1(2).] 4.20.1 If x ί S as defined in 4.20(1) then / Pθ(x)g(θ)dθ = °°, so that the generalized Bayes procedure for the conjugate prior does not exist at x. [See Exercise 4.19.1.] 4.20.2 Show that K a r l i n ' s condition 4.5(2) implies that S => K°.
(Hence,
i f v(3K) = 0 i t implies that the estimator 4.20(2) = 4.5(1) is generalized Bayes.)
( i i ) Give an example where 4.5(2) is s a t i s f i e d but 4.20(2) = 4.5(1)
is not generalized Bayes. 4.20.3 Let { p Q : θ € 0}
be a stratum of an exponential f a m i l y , as
ϋ defined in Exercise 3.12.1. Suppose it is desired to estimate η
(θ)
r
(θ)
= d). . ζ(2)(θ)
un
der squared error loss.
(Note that in the sequential
setting of 3.12.2(iii) and 3.12.3, η(θ) = EΘ(Y) is a \jery natural quantity to estimate.) State general conditions to j u s t i f y the formal manipulation —
140
STATISTICAL EXPONENTIAL FAMILIES
χ (1)
m
(2)
/ η(θ)e θ * X dθ θ
x
/e * dθ
( 1 )
which says that <5(x) = -^- is generalized Bayes on (K/ ,\)° relative to the x (1) (2) prior measure dθ,-v on 0* = {(θ/.j, θ ( 2 ) ^ θ ( l ) ^ € θ
: θ
(l)€ ^ ( 1 ) ^ *
(The conclusion is justified in the situation of 3.12.2(ii) and in that of 3.12.2(iii) if p Q (N <_ N Q ) = 1, and somewhat more generally.) 4.2Q.4 Generalize 4.20.3(1) to obtain a representation for certain estimators xm +a of the form x u ; +. b• . (2) 4.21.1
Show that 4.21(5) and 4.21(2) imply 4.21(1) and 4.21(3). [4.21(1) is trivial from 4.21(5'). For 4.21(3) reason as in the proof of Theorem 4.19. The key fact is that, with q as defined there,
-B
x
x
u
x
-B
9 Θ
1
-B
, etc.
Now integrate over Θ 2 >...»θ k and let B -» «>. The first part of the expression is bounded because of 4.21(1), (2), and the second part because of 4.21(5').] 4.21.2
(Converse to Theorem 4.19.) Let G be a prior measure whose Bayes procedure for estimating ξ(θ)
exists on S and satisfies δ(x) = αx + β. Suppose S° f φ. Assume further that G possesses a density g satisfying 4.21(1), (2), (3). Then G is a conjugate prior measure, and its conjugate prior density, 4.18(1), has α = l/(λ + 1) and 3 = γ/(λ + 1).)
Apply 4.7(3) to the last integral of the
APPLICATIONS
141
equality (λ+l)/(Vψ(θ))g(θ)eθ χ - Ψ ( θ ) d θ
(1)
= γ/g(θ)eθ χ -ψ( θ )dθ + x/g(θ)e θ * x "*< θ >dθ ,
rearrange terms and invoke completeness to find (2)
vg(θ) = (γ - λVψ(θ))g(θ)
.]
(Diaconis and Ylvisaker (1979) show that this statement is true without this "further" assumption that G possess a density.) (A question of interest is whether this unicity result extends to non-linear generalized Bayes estimators.
To be more precise suppose the
generalized Bayes procedures for estimating ξ(θ) under priors G and H exist and are equal everywhere on S with S° t φ.
Does this imply G = H?
In the
case of the normal distributions or the Poisson distribution the answer is yes.
See 4.15.1 for the normal distribution and Johnstone (1982) for the
Poisson distribution.) 4.24.1
Suppose δ( ) is admissible for estimating ξ under squared error loss. Then v{χ : 6(x) (. K} = 0. [Define δ'(x) as the projection of 6(x) on K. If v{x : 6(x) t ό'(x)} t 0 then R(θ, ό') < R(θ, δ) whenever R(θ, δ) < «> . (If δ is admissible there must exist some θ for which R(θ, δ) < «>.)] 4.24.2 (i) Verify that the conclusion of Theorem 4.24 remains valid when {p Q } is a steep exponential family and Θ c W°. (ii) Even more generally, it Ό
is valid for any one-parameter exponential family i f (1)
Θ c
{θ : EΘ(X) = ξ(θ) € R}
and i f the definition 4.24(1) is modified to
142
STATISTICAL EXPONENTIAL FAMILIES
(2)
I' = {x : v({y: y > x, δ(y) € ξ(N°)°}) > 0 and
ξ({y: y < x, δ(y) e ζ(W°)°}) > 0}
( i i i ) Extend Theorem 4.24 to the problem of estimating ρ(θ) under squared error loss where p: N° -> R is a non-decreasing f u n c t i o n .
[The formulation
and proof are i d e n t i c a l to ( i i ) , above.] 4.24.3
Let v = v^. + Vp where v, is Lebesgue measure on (0, 3) and v^ gives mass 1 to each of the points x = 1,2. Consider the estimator 6 of ζ (under squared error loss) given by
(1)
0 h \h 7h 3
δ(x)
x <1 x =1 1 < x <2 x =2 x >2
(i) Show that δ has the representation 4.24(2) on I = (1,2), but (ii) this representation cannot be extended to the points x = 1,2 even though δ(x) e K° for these points.
(iii) Show that δ is a pointwise limit of a
sequence of Bayes procedures,
(δ is also admissible. See Exercise 7.9.1.)
4.24.4 Let X have the geometric d i s t r i b u t i o n w i t h parameter p ( G e ( p ) ) , under which (1)
PKX = x}
=
p(l - p)x
x=0,l,...
( i ) Show t h a t δ ( x ) = x/2 i s an admissible e s t i m a t o r o f E (X) = ( l - p ) / p under squared e r r o r l o s s .
[Use K a r l i n ' s Theorem 4 . 5 .
Note also t h a t the e s t i m a t o r s
δ ( x ) = ex w i t h c > h f a i l t o s a t i s f y 4 . 5 ( 2 ) and are not a d m i s s i b l e . ]
(ii)
Suppose i t i s known i n a d d i t i o n t h a t p <_ ^ , so t h a t E (X) >^ 1 . , Using Theorem 4.24 show t h a t the t r u n c a t e d v e r s i o n o f δ - - namely
δ
' ( x ) = m a x ( δ ( x ) , 1) —
APPLICATIONS is inadmissible. , ( i i i ) than ό1 ??)
Can you find an (admissible) estimator better
143
CHAPTER 5. MAXIMUM LIKELIHOOD ESTIMATION
5.1 Definition Let φ : R k -> [0, «>] be convex. Define % : R k x R k -• [-o°,«>] by (1)
A(θ, x) = £ φ (θ, x) = θ
x - φ(θ)
For S c N let (2)
£(S, x) = sup U ( θ , x) : θ € S}
and let θ s (x) = {θ e S : £(θ, x) = A(S, x)}
(3)
Note that according to this definition θ~ is a subset of S. We will often abuse the notation slightly by letting θ also denote an element of this set. If φ = ψ is the cumulant generating function for an exponential family then £ φ (θ, x) = log p θ (x)
θ €H
is the log likelihood function on N. (Of course, A.(θ, x) = -«> for θ I N in accordance with the natural convention that ψ(θ) = ~ for θ f. N .) θ € θς(x) is then called a maximum likelihood estimate at x relative to ScW,
A function 6 : K + Θ for which δ(x) € Θ 0 (x) a.e.(v) is called the
(a) maximum likelihood estimator.
This terminology is not always properly
used in the literature; and we will also abuse it, at least to the extent of also referring to the set valued function θ (•) as the maximum likelihood estimator. 144
MAXIMUM LIKELIHOOD ESTIMATION
145
5.2 Assumptions , The main results of this section concern the existence and construction of maximum likelihood estimators, θ. The proofs of these results are based on the fact that ψ is a convex function satisfying certain additional properties, and not otherwise on the fact that ψ is a cumulant generating function.
In Chapter 6 we will want to apply these same existence and
construction results to convex functions, φ, which are not cumulant generating functions. To prepare for this application we now make explicit the conditions on φ which are needed in the proofs of the main results of this section. Let φ : R •*(-<»,«>] be a lower semi continuous convex function. Let N = W. = {θ : φ(θ) < »} .
Such a function is called regularly strictly
convex if it is strictly convex and twice different!able on N% and
Φ (1)
Dpφ
is positive definite on
N°
In the following results we will assume φ is regularly strictly convex.
In some of the following we also assume φ is steep. Note that if
ψ is the cumulant generating function of a steep exponential family then it satisfies these assumptions. Here are some useful facts. Let I = I be defined by 5.1(1), and let the mapping ξ : N -> R , be defined by ξ(θ) = Vφ(θ). Then, ζ is continuous and 1 - 1 since φ is strictly convex. (1) says that the Hessian of ξ = vψ is positive definite. Hence ξ(M°) is an open set; call it R, or R φ .
ξ"Ί( ) is continuous on R.
Theorem 3.6 establishes that (2)
R = K°
when φ = ψ is the cumulant generating function of a minimal steep exponential family. (3)
In particular, in this case R is convex
146
STATISTICAL EXPONENTIAL FAMILIES It will be shown in Proposition 6.7 that (3) is always valid under
the above general assumptions on φ including steepness of ψ. As previously, let θ( ) = ξ ' ^ ). i.e. ζ(θ(x)) = x. (The assumption above of the existence of second derivatives and of (1) is convenient, but can be dispensed with. The other assumptions are required for the following development.) We emphasize again: the following results about i, and maximum likelihood estimation concern the general situation where φ is as assumed above. These results therefore apply in particular to maximum likelihood estimation from minimal 5.3
steep standard exponential families.
Lemma
Assume φ is regularly strictly convex. Then, &( , x) is concave k k and upper semi continuous on R for all x € R . It is strictly concave on N . If Θ Q e N° then (1)
V£( , x ) , θ
(2)
D 2 £( , x ) . θ
where ( Z ί θ g ) ) ^
39 9θ. ' J
φ(θ
Q)
= x - ξ(θ Q ) = -D 2 φ(θ Q ) = -2(θ 0 )
1s
P°sitive
d e f i n i t e
I f x € R (= K°) then
lim £(θ, x) = -«
(3)
IIΘIIHOQ
Proof.
The first assertions are immediate from Assumption 5.2. Equations
(1) and (2) are a direct calculation. The positive definiteness of 2 ( Θ Q ) is a consequence of 5.2(1). Assertion (3) has been proved in 3.6(4) for the case where φ = ψ is the cumulant generating function of a minimal steep exponential family. This proof was needed in order to show that R = K° in such a situation. However we now want a proof valid for arbitrary convex functions, φ, satisfying
MAXIMUM LIKELIHOOD ESTIMATION 5.2(1).
147
This i s easily supplied. Assume x e R, then θ(x) € A/°.
Note using ( 1 ) , (2) that
VA(θ(x), x) = 0, and D 2 A(θ(x), x) is negative d e f i n i t e . 6 > 0,
ε >0
(4)
£ ( θ ,x ) = A ( θ ( x ) ,
<
x ) - ( θ- θ(x))
A ( θ ( x ) , x) - ε
1
Hence f o r some
Z(θ- θ(x))/2 +o(||θ -
for
I|θ - θ ( x ) | |
=
2
θ(x)|| )
δ
I t f o l l o w s t h a t when ||θ - θ ( x ) 11 > δ
(5)
£ ( θ , x)
< A ( θ ( x ) , x)
-
11θ - θ n | | δ
5_ε
by (4) since A(θ(x) + (δ/(||θ - θ(x)||))(θ - θ(x))) < (1 - δ/||θ - θ(x)||)£(θ(x), x) + (δ/||θ -θ(x)||H(θ, x) by convexity.
(5) implies (3).
||
(We note that the positive definiteness of I is not really needed to establish (3). It is only necessary that the conclusion of (4) be valid -i.e. for some 6 > 0, ε > 0 (4 1 )
Jt(θ, x) < λ(θ(x), x) - ε
for
||θ - θ(x)|| = δ
This condition follows whenever λ( , x) is a strictly concave function which assumes its maximum at θ(x).) It is useful to now prove the following lemma. This result is used in Theorem 5.5 to show that θ 0 c W° when 0 is convex. 5.4
Lemma Assume φ is steep and regularly strictly convex. Let
θ χ € N - W°, Θ Q € M°. Let θ = Θ Q + p(θ χ - Θ Q ) , 0 < p < 1. Then
148
STATISTICAL EXPONENTIAL FAMILIES
(1)
11m ( |^ £(θ p , x))
Hence there is a p1 < 1 such that (2)
Proof.
a(θ p I , x) > ϋ(Qv x)
From 5.3(1) | ^ A ( θ p f x)
=
as p t 1 because ψ i s steep.
(ΘQ - θ χ )
(x - ξ ( θ p ) )
+
- -
This proves (1) from which (2) is immediate.
(In case ψ i s regular, i . e . N = W°, then l i m Λ(θ , x) = -°° c o n t i n u i t y , which can also be used to prove ( 2 ) . )
by upper semi-
||
FULL FAMILIES Here is a fundamental result concerning maximum likelihood estimation. It follows easily from the above. 5.5 Theorem Let φ be steep and regularly s t r i c t l y convex. (I)
θ w (x)
=
{θ(x)}
c
I f x € R then
A/°
In other words, θ.,(x) consists of the unique point θ = θ(x) satisfying N I (I ) ξ(θ) = x € R I f x £ R then θ.,(x) is empty.
(Recall that i f Φ = Ψ is the cumulant generating
function of a steep canonical exponential family then R = K°.) Proof.
For any x, {Qjχ)}c
Λ/°
by v i r t u e of Lemma 5.4.
Any maximum
l i k e l i h o o d estimator must thus be a local maxima of £( , x) and hence must satisfy
MAXIMUM LIKELIHOOD ESTIMATION
149
V£( , x ) , ~ = 0 This implies ( Γ ) by 5.3(1). Furthermore, the solution to (I 1 ) is unique if it exists, and it exists if and only if x £ R = ζ(W°). Remarks.
||
Maximum likelihood estimation is defined in statistical theory for
a general parametric family of densities {f Q : θ € Θ} by θ(x) = {θ € Θ : f Ω (x) = sup f (x)}. Note that this definition is invariant θ
α
α
under reparametrization. Thus, if ξ = ξ(θ) is a 1 - 1 map on 0 the maximum likelihood estimate of the parameter ξ € ξ(θ) is ξ(θ). Accordingly, Theorem 5.5 says that for minimal steep exponential families x = ξ(θ(x)) is the unique maximum likelihood estimator of the mean value parameter ξ = ξ(θ) at x £ K°. To emphasize, in terms of the mean value parametrization the maximum likelihood estimator is determined by the trivial equation (1")
ξ(x) = x ,
xe r
For the present, (1") is valid if and only if x € K°. This set of course contains almost every x(v) if and only if (2)
v(K - O
= 0
.
Note that (2) is satisfied i f v is absolutely continuous with respect to Lebesgue measure.
I t is never satisfied i f v has f i n i t e support or,
more generally, has countable support and K ^ R . In the last part of Chapter 6 we expand such exponential families so that (1") usually remains valid for a.e.x (v). (Since ζ = EΩ(x) equation (1") also defines ζ(x) = x as the classical method-of-moments estimator. Thus for the mean value parametrization the maximum likelihood and method-of-moments estimators agree.) Suppose that X-,...,X
are independent identically distributed
random variables from the exponential family { p θ ) . Then, as noted in 1.11(2),
150
STATISTICAL EXPONENTIAL FAMILIES
the distributions of the sufficient statistic X = n -1 Σ X
also form an
exponential family with natural parameter α = nθ and cumulant generating function nψ(α/n).
It follows that α(x) = nθ(x). So, the maximum likelihood
estimator of α based on X is nθ(X ) and the maximum likelihood estimator θ/ i of θ = α/n based on X is (3)
θ(n)
5.6 Examples
= Sc/n = θ(x n )
.
(Beta Distribution)
For a variety of common full families the above remarks lead to easy calculation of the maximum likelihood estimator. These are situations such as those mentioned in 3.8 where the mean value parametrization has a convenient form. For example if Y,, Y2>
>Y n are i.i.d. multivariate normal
(μ, %) random variables then the maximum likelihood estimators for μ and -in
n
Ί
μμ 1 + t are, respectively, Y = n
Σ Y, and n Σ Y.Y! . This leads to the i=l Ί i=l Ί Ί conventional maximum likelihood estimates
(1) t = S = n" 1 Σ(Y i - Ϋ)(Y. - Y ) 1 For the Fisher - Von Mises distributions the result of Theorem 5.5 is not so easy to implement. See 3.8. Another not so convenient, but important, family is the beta family, which will now be discussed. Consider the family of densities (2)
f Ay)
= B'^α, β)ya'l(l - y ) 3 " 1 ,
0 < x < 1,
α > 0,
3 > 0 .
ot,p
realtive to Lebesgue measure on (0, 1), where B = B(α, β) denotes the beta function, (3)
B(α. 3) =
T
Γ(α + β)
MAXIMUM LIKELIHOOD ESTIMATION
151
This is a two parameter exponential family with canonical parameters (α, 3) e N = (0, °°) x (0, °°). The corresponding canonical statistics are (4)
xχ
= log y
x2
= log (1 - y)
In this case the canonical parameters themselves have a convenient statistical interpretation since (5)
E(Y) = α / ( α + β ) ,
E(l - Y) = β/(α + β)
Var(Y) = αβ/(α + 3) 2 (α + β + 1) = Var (1 - Y)
.
The mean value parameters are somewhat less convenient. One has (6)
ξ 2 (β. α) = ξ,(α, β) = B " 1 ^ , .β) /(In y)ya'l(l ά ι 0 T - (In B(α, g j
=
3α
M k
=0
α+3+k;
- y)*'ldy
Γ'(α)
Γ'(α + 3)
Γ(α)
Γ(α + β)
- -ΐ i w :
k Q
B
i
(α+k)(α+3+k)
and 3-1 ξ Ί (α, 3) = - Σ 1 (See e.g. Courant and Hubert (1953, (7)
, -irif α+κ p.499)).
3 = 1,2,...
Suppose Y-,...,Y are i.i.d. beta variables, and X.., X^^ are defined from Y. through (3), i=l,...,n . Then the maximum likelihood estimates of (α, 3) can be found numerically by solving (8)
ζj(α, 3) = Xj
i = 1. 2
n
from (6), where X. = n~ 1 Σ X... An exact solution appears to be unavailable, J1 J i=l except when α,3 turn out to be integers so that (7) applies. According to Theorem 5.5, the solution to (8) exists if and only if X € K°. Now,
152
STATISTICAL EXPONENTIAL FAMILIES K = conhull ίln y, In (1 - y) : y e (0, «)}
Since {In y, In (1 - y) : y € (0, 1)} is strictly convex in R this solution n 2 therefore exists if and only if n > 2 and Σ (Y. - Ϋ) > 0. The event 1= 1 2 " Σ (Y. - Y) = 0 occurs with zero probability when n >_ 2; hence the maximum i =l Ί likelihood estimate exists with probability one when n >_ 2. NON-FULL FAMILIES We now proceed to discuss the existence and construction of maximum likelihood estimators when Θ c W. Here is an existence theorem. 5.7
Theorem Let φ be steep and regularly strictly convex. Let 0 c N be a non-
empty relatively closed subset of A/. Suppose x e R. Then θ Q (x) is non-empty. Suppose x € R - R. Suppose there are values x. e R, i=l,...,I, and constants β. < °° such that (1)
Θ c
I y H~((X - x.), β ^
.
Then Θ 0 (x) is non-empty. Remark.
See Exercises 5.7.1-2, 7.9.1-3, and Theorem 5.8 for more infor-
mation about the theorem.
In particular, (1) implies x (. (ζ(Θ))~. See
Figure 5.7(1) for an illustration of 5.7(1). Proof.
Let x e R.
£( , x) is upper semi-continuous and satisfies 5.3(3).
Hence &( , x) assumes its supremum over Θ. But £(θ, x) = -°° for θ € (δ - Θ) C H - M.
It follows that Θ 0 (x) is non-empty.
Suppose x € R - R and (1) is valid. Then for each θ € Θ there is an index i for which θ € ίΓ (x - x., β.). For this index (2)
£(θ, x) = θ
(x - xΊ ) + θ
x Ί - ψ(θ) i
β 1 + θ • x i - ψ(θ)
.
MAXIMUM LIKELIHOOD ESTIMATION
153
0
Figure 5.7(1): An Illustration of 5.7(1) showing R, x € R - R, Θ
2 u H~((x - xΊ. ) , β.) Ί i= l
and ξ(Θ) . It follows that Λ,(θ» x) £
sup {$. + θ
x Ί - ψ(θ) : 1 < l < 1} ->
a
s | |θ| |
θ € Θ by 5.3(3).
The second assertion of the theorem follows from (2) as did the
first from 5.3(3).
||
CONVEX PARAMETER SPACE When Θ is convex one gets a better result, including a fundamental equation defining the maximum likelihood estimator.
154
STATISTICAL EXPONENTIAL FAMILIES
5.8 Theorem Assume φ is as above. Suppose Θ is a relatively closed convex subset of W with Θ n N° f φ. Then θ (x) is non-empty if and only if x 6 R (= K°) or x € R - R and Θ <= H" (x - x r β χ )
(1) for some Xj € R, pj € R.
((1) is the same as 5.7(1) with I = 1.)
If θ 0 (x) is non-empty then it consists of a single point. This is the unique point, θ e Θ n N° satisfying (2)
(x - ξ(θ))
(θ - θ) >_ 0
V θ €Θ
(An alternate form of (2) when x - ξ(θ) t 0 is (2 1 )
Θ
c H" (x - ξ(θ), (x - ζ(θ))
θ)
.)
(Note t h a t i f θ ( x ) € Θ t h e n , o f c o u r s e , θ ( x ) = { θ ( x ) } and θ = θ(x) trivially satisfies Proof.
( 2 ) . See 5 . 9 f o r i l l u s t r a t i o n s
£( , x) i s s t r i c t l y concave on hi and hence can assume i t s maximum
a t o n l y one p o i n t o f the convex s e t Θ. Suppose (2) i s s a t i s f i e d . (3)
£ ( θ , x) - £ ( θ , x)
=
(θ - θ)
+ (θ - θ)
=
(θ - θ)
w i t h e q u a l i t y i f and o n l y i f θ = θ. Λ
o f( 2 ) . )
/S
Furthermore, θ f l c w° by Lemma 5 . 4 . Then f o r θ € Θ (x - ξ ( θ ) )
ξ ( θ ) - (ψ(θ) - ψ ( θ ) )
(x - ξ ( θ ) ) + A ( θ f ζ ( θ ) ) - £ ( θ , ξ ( θ ) )
U ( θ , ξ(θ)) - £(θ, ξ(θ)) > 0
when
Λ.
θ t θ since θ = θ(ξ(θ)) is the unique maximum likelihood estimator over N corresponding to the observation ξ(θ) .) Hence (2) implies that θ Q (x) = {θ}.
MAXIMUM LIKELIHOOD ESTIMATION
155
On the,other hand, suppose (4)
(x - ξ(θ Q ))
(Θ Q - θj) < 0
for some θ Q , θ χ € 0
.
Then θp
= Θ Q + p(θ 1 - Θ Q ) € θ
for 0 _< p £ 1 since Θ is convex. Then ^*(θp, x)|p=Q
= ( x - ζ(θ 0 ))
(θ 1 - Θ Q ) > 0 .
Hence £(θ , x) > £(Θ Q , x) for p > 0 s u f f i c i e n t l y small; and ΘQ cannot be the unique maximum l i k e l i h o o d estimator.
I t follows that the unique maximum
l i k e l i h o o d estimator i f i t e x i s t s , must s a t i s f y ( 2 ) . F i n a l l y , i f x € R or (1) is s a t i s f i e d then θ β i s non-empty by Theorem 5.7. Conversely, i f θ € θ ξ(θ)
is non-empty then θ s a t i s f i e s ( 2 ) . Hence ξ =
€ R and
(x - ζ)
θ
<_ (x - ξ)
by (2) so that (1) is satisfied with x χ = ξ. 5.9
θ
||
Construction The criterion 5.8(1) is particularly easy to apply if
0 = (ΘQ + L) Π W for some linear subspace, L. This is because the vectors {(θ - θ) : θ € L} will then span L. Thus, by (1), in order to find θ one need only search for the unique point θ* 6 0 for which x - ζ(θ*) l L. This process can be viewed from two slightly different perspectives. Because of its importance we illustrate both these perspectives in the simplest case where Θ Q + L is a hyperplane. Thus, consider the case where 0 = H n W with H a hyperplane, say H = H(a, α ) . Let x e R. (The same construction also works for x € R - R if 5.7(1) is satisfied.) To find θ (x) one may proceed from θ(x) along the curve {θ(x + pa) : p 6 R} until the unique point at which θ(x + pa) € 0. This point is θ. The process is illustrated in Figure 5.9(1).
156
STATISTICAL EXPONENTIAL FAMILIES An alternative procedure is to map Θ n A/° into R as ξ(Θ Π A/°).
Then proceed along the line {x + pa : p € R} until the unique point at which x + pa € ξ(Θ n N°). This point is x = ξ(θ). This process is illustrated in Figure 5.9(2).
x+pa
Figure 5.9(1):
First construction of θ
Θ
Figure 5.9(2):
Second construction of θ
There are useful paradigms available also for the case where θ is an arbitrary relatively closed convex set.
These are described in 5.13.
The entire process illustrated above may also be viewed from a different perspective.
Θ is contained in a proper linear subset of R .
Hence the densities {p θ : θ e 0} form an exponential family which is not minimal.
This non-minimal family can be reduced by sufficiency and reparame-
trization to a minimal family of dimension k1 < k.
Let (φ-,. - - ,φ. ,) and
MAXIMUM LIKELIHOOD ESTIMATION
(yi>
J Y ^ I ) denote the natural parameters and corresponding observations
this family. H(a,
157
in
(They are formed by p r o j e c t i n g θ and x, r e s p e c t i v e l y , onto
α) or any t r a n s l a t e H ( a , $) . )
This family w i l l
have log-Laplace
transform ψ*(φ) = Ψ ( θ ( φ ) ) , and the m . l . e . , φ, s a t i s f i e s 5 . 5 ( 1 ) —
Φ(y)
=
i.e.
Φ(y)
where φ(y) is the inverse to ξ*(φ) = Vψ*(φ)
Thus
θ(x) = Φ(y(x)) . These remarks can be used to yield a very simple proof of Theorem 5.8 in the special case where Θ = ( θ Q + L) n W.
They also provide a method of
easily constructing the maximum likelihood estimate in many such cases.
Here
are two examples. 5.10a
Example Consider the classical Hardy-Weinberg situation described in
Example 1.8.
( X j , X2> X^) is multinomial (N, ζ) with expectation
2 2 ξ = N(p , 2pq, q ) ,
0 < p = 1-q < 1. This is a three-dimensional exponential
family with two dimensional parameter space Θ = {θ:
= β j d . l . l ) + B 2 (2,l,0) + ( 0 , In 2, 0)} = H ( ( l , - 2 , 1 ) , -2 In 2 ) .
(This family is not minimal.
This fact affects but does not hinder the
reasoning which follows.) Reduction to a minimal exponential family yields a one-parameter exponential family with parameter φ = 2θ 1 + θ 2 and natural observation y = 2x 1 + Xp.
(Θ is two-dimensional but yields a family of only order one
since the original family was not minimal.) (1)
E(Y)
=
Note that
N(2p2 + 2pq)
= 2pN
Hence 2x
(2)
P
=
2N
=
+ x?
2N
°
< y
< 2N
.
158
STATISTICAL EXPONENTIAL FAMILIES
Correspondingly, ξ = N(p , 2pq, q ) and θ can be defined from θ.. = ^ 3, £ R.
+ In ξ..,
(Note that θ is a line rather than a single point because the original
representation of the multinomial family was not minimal.) The simplicity of (1) is the special fact which enables the preceding construction to proceed so smoothly. behave similarly.
Many other multinomial log-linear models
Classes of such models are discussed in Darroch, Lauritzen,
and Speed (1980) and in Haberman (1974).
Here is a useful example.
5.10b Example Consider a 2χ2χ2 contingency table. denoted by y-jj k >
i , j , k = 0 , 1 . They are multinomial (N) variables with
respective probabilities π. ., . such a table.
The observations w i l l be
There are various useful log-linear models for
The derivation of maximum likelihood estimates for such models
provides a useful and illuminating application of the preceding theory.
Here
we consider the model in which responses in the f i r s t category (corresponding to index i ) are conditionally independent of those in the third category given the level of response in the second category.
This model illustrates
several characteristic phenomena, and allows for direct and explicit maximum likelihood estimates of the parameters TΓ. .. . In order to write the model in customary vector-matrix notation, let π
£
z% = y. =
π
iik'
k
where 1= 1 + i + 2j + 4k
*"et ^ ° 9
π
^ (^eno'te ^
e
(1 <_ I <_ 8 ) , and, similarly,
v e c t 0 Γ
with coordinates log π. , λ = l , . . . , 8 .
Let 1 1 D1
=
-1 1
1 -1
-1 -1
1 1
-1 1
1 -1
-1 -1
1
1
1
1
-1
-1
-1
-1
1
-1
-1
1
1
-1
-1
1
1
1
-1
-1
-1
-1
1
1
1
1
1
1
1
1
1
1
The log-linear model of interest here has (2)
θ* = (log π) = DB,
3 e R6 .
MAXIMUM LIKELIHOOD ESTIMATION
159
In order to normalize π one must choose 3g so that
8 Σπ
(3)
= 1
.
The r e s u l t i n g multinomial family i s an 8-parameter exponential family. canonical s t a t i s t i c can be reduced via s u f f i c i e n c y . θ* z = 3'D'z = β'x*.
Its
Let x* = D'z so that
Furthermore x£ = N with p r o b a b i l i t y one.
Hence x € R
with ( x 1 , . . . , x 5 ) = ( X j , . . . , x £ ) is a s u f f i c i e n t , canonical s t a t i s t i c .
The
corresponding canonical parameter is θ € R with ( θ * , . . . , θ g ) = ( β , , . . . , β r ) . I t can be checked t h a t t h i s l o g - l i n e a r family i s characterized by the conditional independence of responses i n categories 1 and 3 given level of response i n category 2. noting that i f i t Γ ,
l
n
The conditional independence can be checked by k ϊ k1 then (2) y i e l d s
V j V
+
l n τ τ
ijk
From t h i s i t follows that π . + π..^
=
l n
= π+
k
π
i'jk
+
Ί n
π
ijk'
ττΊ ή+» which implies the desired
conditional independence. By Theorem 4.5 x is the maximum l i k e l i h o o d estimate of E(X) = ξ ( θ ) . Thus, (log TT) = D $ ( x ) i s the maximum l i k e l i h o o d estimate of (log π) = θ * w i t h β(x)
= ( e 1 ( x ) . - . . » B 5 ( x ) . 3 6 ( β ) ) ' where 3 6 ( ) i s determined by ( 3 ) . The r e l a t i o n between ξ(θ) and π(θ) is easy to determine via simple
calculations such as ξ χ = EfX^
= Σ ( - l ) 1 E ( y i j k ) , etc.
These y i e l d
(4)
Thus
(5)
Σ(-l) Ί 'y i j k
- xj
= NΣί-l) 1 ^.,
,
etc.
160
STATISTICAL EXPONENTIAL FAMILIES From these relationships and the structure of D it is possible in
this case to give explicit expressions for ί ^ ^ } in terms of ί y ^ } -
Let
a "+" replacing a subscript denote addition over that subscript. Thus, τr1 +u + = Σ π.. . Simple manipulation based on (3) and (5) yields 1Jk Nπ1++ (6)
The conditional independence properties yield
£ Hence Nπ
ijk
FUNDAMENTAL EQUATION
5.11
Definition For ΘΛ 6 0 c R define VQ(ΘQ)> the set of (outward) normals to 1/
Θ a t ΘQ e Θ, t o be the s e t o f a l l δ e R (1)
δ
( θ 0 - θ)
satisfying
>. 0 + o ( | | θ Q - θ||)
v
θ eQ
V is obviously a convex cone, and can easily be shown to be closed. Note that if Θ Q e int Θ then V Θ (θ Q ) = {0}. If Θ Q is an isolated point of Θ then V 0 (θ Q ) = R . If Θ is a different!able manifold with tangent space T at Θ Q then V Q ( Θ Q ) is the orthogonal complement of T — i.e., V 0 (θ o ) = {δ: δ
k
τ = 0 V τ e T}. Here V 0 (θ Q ) is a linear subspace of R .
If Θ is convex and θ Q 6 bd 0 then V 0 (θ) = {δ :Θ c fi" (δ,δ θ Q )} .
MAXIMUM LIKELIHOOD ESTIMATION 5.12
Theorem Assume φ is steep and regularly s t r i c t l y convex.
relatively closed subset of N. (1) Let θ € θ (x) c Λ/°.
(2)
V0(θ)
€
Note that
vθU(θ, ζ(θ))|θ=§ = 0
and x - ξ ( θ ) = 0 when θ = θ. 0
< A(θ,x) - £(θ,x)
.
Hence, as i n 5 . 8 ( 3 ) =
(θ - θ)
( x - ξ(θ)) + A(θ, ξ(θ)) - A(θ, ξ(θ))
=
(θ - θ)
( x - ξ ( θ ) ) + o ( | | θ -θ | | )
Thus, by definition, (1) is satisfied.
||
Note that the theorem does not require x € R (= 5.13
Let Θ be a
Then for any θ € Θ 9 (x) n N° x - ξ(θ)
Proof.
(3)
161
K°).
Construction
The fundamental equation, 5.8(1) or 5.12(1), can be used to picture the process of finding a maximum likelihood estimator, by an extension of the process pictured in 5.9. Fix x e R k . Suppose it is desired to locate Θ 0 (x).
If 0 n W° / φ
one should first check to see whether x € ξ(Θ n N°). If so, then θ(x) = θ β (x). /\ If not, then Θ 0 (x) c bd Θ. To see whether a given θ Q e bd Θ n W° can be an /\ element of θ first locate Θ, Θ Q , X , and x Q = ζ(θ Q ) on their respective graphs. Then carry a vector δ pointing in the direction of x - XQ over to ΘQ in order to check whether δ is an outward normal to 0 at θg. If so, then Θ Q is a candidate for θ. In fact, if Θ is convex {θ Q } = Θ 0 (x).
If Θ is
not convex one must search over bd Θ for all such candidates, then examine β(θ, x) at each of them to eliminate those which are not global maxima. (If φ is not regular and Θ is not convex one needs also to search over
STATISTICAL EXPONENTIAL FAMILIES
162
Θ n (W - W 0 ).) The process is illustrated in Figure 5.13(1).
Figure 5.13(1):
θ Q and
are candidates for Θ 0 (x).
θ 2 is not.
If bd Θ is a curve as in Figure 5.13(1) then this process is relatively convenient. Otherwise, it is usually less convenient to search over all of bd Θ for the set of candidates. An alternate picture can also be constructed.
In this picture one
constructs for each θ € Θ the collection of points in X space for which θ can possibly be the maximum likelihood estimator. In order to construct this picture one locates θ € bd Θ and draws the unit outward normal(s), 6, to θ. One then maps θ to ξ(θ) and carries the vector(s) 6 directly over to X space. The corresponding line or cone with vertex located at ξ(θ) θ
χ
is the locus of values of x for which θ e β ( )
ΊS
a
possibility. Again,
if x falls in more than one such locus then &(θ, x) must be separately examined at all such θ. This process is illustrated in Figure 5.13(2).
MAXIMUM LIKELIHOOD ESTIMATION
Figure 5.13(2):
163
C. is the locus of points, x, for which θ. can possibly fall in Θ 0 (x).
5.14
Example The curved exponential f a m i l y described i n Example 3 . 1 2 provides a
p a r t i c u l a r l y elegant instance o f the above c o n s t r u c t i o n .
The family i s a two-
parameter standard exponential family with θ ( λ ) = ( - λ , - I n λ ) 1 , and Θ = { θ ( λ ) : λ > 0} c M =
(-co, o) x R, and
ψ(θ) = l n [ ( e θ l T - l ) / θ 1 + e θ i T + θ 2 ] .
K = conhull { ( 0 , 0 ) , ( T , 0 ) , ( T , 1)} .
Then,
ζ(θ(λ))
1
K a n d ξ ( θ ) on a s i n g l e p l o t .
p"λT
λT , e"λl).
Figure 5 . 1 4 ( 1 ) shows both Θ and
T h e r e i s no o v e r l a p s i n c e Θ cz { ( θ , , θ p ) : θ , < 0 }
a n d K cz { { xy y x 22): ) : Xj >_0}.
The tangent space to θ(λ) is spanned by (-1, -1/λ) 1 .
Hence
V 0 (θ(λ)) is s the line {p(l, - λ ) : p e R}. The locus, C(λ), of points p x for which θ(λ) can be the maximum likelihood estimator is the line
STATISTICAL EXPONENTIAL FAMILIES
164
, --λT (2)
C(λ)
=
ίξ(θ(λ)) + p ( l , -λ):
{(0,
1) + σ ( l , - λ ) :
as can be seen by letting σ =
p e R} =
+ p,
σ € R} - e
-λT + p.
Formula ( 2 ) reveals that the
loci C(λ) are s t r a i g h t lines through the point ( 0 , 1 ) . It or ( T , 1 ) . (0,
e " λ T -λp): p € R}
Again, see Figure ( 1 ) .
can be seen from Theorem 5.7 that θ ( x ) f φ unless x € K is ( 0 , 0)
(Applying 5 . 7 ( 1 )
for points on the i n t e r i o r of the l i n e j o i n i n g
0) to ( T , 1) requires the choice 1 = 2 .
Of course, these points occur with
p r o b a b i l i t y zero, so i t ' s not worth the e f f o r t ! )
Since the loci C(λ) i n t e r s e c t
only a t ( 0 , 1) f. K i t follows from ( 2 ) that i f x t ( 0 , 0) or (T, 1) then θ Θ (.x) is the single p o i n t , θ ( λ ) , f o r which x € C ( λ ) . If
x = ( 0 , 0) or (T, 1) then θ ( x ) = φ since neither of these points /\ l i e s i n U C ( λ ) . (That θ ( x ) = φ i n this case can also be seen by applying the λ€R f i n a l part of Theorem 5.8 to the parameter set consisting of the convex hull of θ).
Figure 5.14(1): Illustrating the construction of θ(x) via construction of loci C.
MAXIMUM LIKELIHOOD ESTIMATION
165
The original description of this example involves a single observation, X, which can take only values in (0 x [0, T]) u {(T, 1)}. -in
However, if one observes X n = n n
Σ X_ where X . are n i.i.d. variables each Ί i =lΊ
with the given distribution, then X p can take values over more of K. This problem has natural parameter θ* = nθ and log Laplace transform ψ*(θ*) = nψ(θ*/n).
It follows that N9K and ξ(θ) are as before. Θ undergoes a simple
transformation.
It is easy to check that the above picture applies equally
well to this problem, for which various values of x € K° are possible. See also Proposition 5.15. From (2) one sees that the maximum likelihood estimator of λ is (3)
λ = (1 - x 2 )/x χ . In terms of the original motivation for this problem the parameter
1/λ is the mean value (= mean lifetime) of the exponential variable Z.
-x 2 )
Thus,
nxΊ
ι
n In this problem nx1η = Σ Y. = "total time on test", and n(l - xά 9 ) = (number of i = lΊ observations < T) = "number of objects failing before truncation". This supplies the familiar expression for this problem: (3")
/\ (1/λ) =
total time on test number of objects failing before truncation
Note that the value of T does not appear in (3"). This fact has been commented on and exploited by Cox (1975) and many others. It has been noted that the differentiate subfamily treated in this example is a stratum within the full two parameter family.
It is really this
fact which explains the elegance of the above construction and of Figure 5.14(1). See Exercise 5.14.1 - 5.14.3. In general the maximum likelihood estimate for an i.i.d. sample
166
STATISTICAL EXPONENTIAL FAMILIES
is determined exactly as that from a single observation. The latter part of Example 5.14 mentions one special case of this. It is worthwhile to formally note this fact. 5.15 Proposition Let X-,... ,X be i.i.d. random variables from a standard exponential family {p n : θ € 0} . Let θ^ denote the set of maximum likelihood estimators of θ € 0 on the basis of a single observation. The maximum likelihood estimator of θ € 0 based on the sample 1
n
X Ίl>...,X nn is a function of the sufficient statistic, Xn = n .Σ X. . Let =1 l θ ^ ( ) denote this function of X n
θ< n ) (ϊ) = θΘ("x)
(1) Proof.
Then .
The cumulant generating function for the sufficient statistic
S = nX is nψ(θ). The proposition follows from the fact that £ n ψ (θ, s) = θ
s - nψ(θ)
= n(θ s/n - ψ(θ)) = n£ ψ (θ, s/n)
,
since this shows that &niκ( > s) is maximized i f and only i f I ( , s/n) is maximized.
II
MAXIMUM LIKELIHOOD ESTIMATION
167
EXERCISES 5.6.1 Verify formula 5.6(6). 5.6.2 The multivariate generalization of the beta distribution is the Diriohlet θ
0
=
V(a), d e f i n e d a s f o l l o w s : k-1 1=l. ..k; γk = Ί " Σ V
distribution,
k .Σ V
γ
i
>
°>
k _> 2 ; t h e
θ
> 0, i = l , . . . , k ,
d i s t r i b u t i o n has
density with respect to Lebesgue measure over the allowable { ( y ^ . ^ y , , Ί ) }
(l)
fθ(y)
Γ(θ n )
=
k
°
k
(Θ.-1) Ί
πy
π Γ(Θ.) 1 i=l
> i Ί=1
.
This is a k-parameter exponential family with canonical s t a t i s t i c X. = In Y. . (1) (ii)
E(Y ) l
Describe K. Verify the standard formulae:
= θ./θ n i
Var(Y.)
ϋ
l
(θ π -θ.)θ. = —-—•—— 2
(2) θ.θ. Cov (Y Ί , Y.)
(iii)
=
-
Ί
j
2
D e r i v e f o r m u l a e f o r E(X.)
analogous t o 5 . 6 ( 6 ) , ( 7 ) . s
(iv)
j=l,...,£.
Let 1 = s n < . . . < s
= k and d e f i n e z . =
=
i ζ
j Σ
+1
Y
,
Show that Z has a P(θ') distribution, and describe θ 1 in terms of
θ. (v) variables.
Let Y ' 1 ^ , i = l , . . . , n be independent, k-dimensional ^ ( θ ^ 1 ^ )
-1 n ( i ) Verify that the d i s t r i b u t i o n of n Σ Yv '
is
168
STATISTICAL EXPONENTIAL FAMILIES
5.6.3 Let XΊ , i = l , . . . , k, be independent Γ(α., 3) variables. Describe k the conditional distribution of the variable (X Ί S ..-,X b ) given Σ X. as a 1 κ i=l Ί multiple of an appropriate Dirichlet variable.
(Note the partial analogy
between the situation here and that in Example 1.16. Note also that the situation here was described from another perspective in Exercise 2.15.1.) 5.6.4 The following is a valid statement:
the k-dimensional Dirichlet
distributions form the family of (proper) conjugate priors for the parameter ( p . , . . . ,p. _-) of a k-dimensional multinomial distribution.
Relate this
statement to the general theory of Sections 4.18-4.20, and describe (in terms of the Dirichlet parameters) the posterior expectation of p given the multiθ.
nomial. observation.
k-1
θ
[Let p. = e / ( I + Σ e ), e t c . ] 1 j=l
(This conjugate relation between Dirichlet and multinomial distributions has an i n f i n i t e dimensional generalization in which the Dirichlet distribution is replaced by a "Dirichlet process" and the multinomial distribution is replaced by a distribution over the family of cumulative distribution functions on [ 0 , 1]. See Ferguson (1973) and Ghosh and Meeden (1984).) 5.7.1
(i)
Show that 5.7(1) implies x £ (ξ(θ))~.
(ii) Show the converse is not valid by constructing an example in which φ = ψ, R = K° is not strictly convex, x £ (ξ(Θ))~, and 5.7(1) fails.
(I believe no example exists when R is strictly convex. See
Exercise 7.9.2 which shows that when R = K° is strictly convex and x £ (ξ(θ))~ then θ(x) t φ.) [(i) x £ (ξ(H"((x - x Ί ), £.)))" for x. € R, x e R - R. (ii) Let v give mass 1 to each of the four points (+ 1, +1). Let x = (1, 0)
MAXIMUM LIKELIHOOD ESTIMATION
169
and Θ = {(t, 2): t € R} .] 5.7.2 Construct examples in which φ = ψ is steep, R = K°, x e (ξ(θ))~, and (i) θ(x) = φ, (ii) θ(x) f φ. [For both examples let v be the uniform 7 2 distribution on the ball {x: (xΊ - 1) + Σ x.} plus a point mass at 0. For 1
(i)
i=2
l e t 0 = ίθ: θ = (α, 0 , . . . , 0 ) } .
every
Ί
For ( i i ) l e t 0 ={θ : ψ(θ) = 3 } .
u n i t vector v + e , there i s a unique η ( v ) > 0
As v -» e - 9 η ( v ) -* « and hence ξ ( η ( v ) v ) -> 0 .
For
such t h a t ψ ( η ( v ) v ) = 3.
Hence 0 £ ( ξ ( θ ) ) " . ]
5.8.1 Let {p . θ e 0} be a standard one-parameter exponential family. u Suppose ξ ( 0 ) is an unbounded i n t e r v a l — i . e . ξ ( 0 ) => ( ζ Q , ξ j or ξ, = +oo.
with ξ Q = -°°
For ξg < A < ξ 1 suppose e i t h e r
(1)
ζQ
=
- oo and
ξ-
=
oo
A J J(ξ)dξ
= co
or and
1
with J " 1 ^ )
= θ'(ξ) =
estimating ξ.
V a r θ
(ε)(χ)>
is minimax; and
the
t n a t
J(ζ)dξ
J
=
»
denotes the Fisher information for
Consider the problem of estimating ξ under the loss 4 . 6 ( 1 )
i . e . L ( ξ , 6) = J ( ξ ) ( ό - ξ ) 2 .
admissible,
s o
/ A
Show t h a t :
( i ) the maximum l i k e l i h o o d
estimator
( i i ) i f 0 9 W then the maximum l i k e l i h o o d estimator is not
( i i i ) Give examples when 0 = N and ξ ( 0 ) is unbounded in which
maximum l i k e l i h o o d estimator is not minimax, is minimax but not admissible,
is both minimax and admissible,
( i v ) Can you generalize ( i ) to a k-parameter
family? [Let
(2)
—
αn Ψ ξQ,
h^2 ( ξ )
=
3n + ξχ
m1n( /
and
J(t)dt,
Kp,
/
J(t)dt)
170
STATISTICAL EXPONENTIAL FAMILIES
where K is chosen so that h is a probability density. Show K n -+ 0 because of (1). Then use 4.6(2). For (ii) use Theorem 4.24.] 5.9.1 Consider the general l i n e a r model as defined i n 1.14.1.
(a) Verify
that the usual least squares estimators of ξ are also the maximum l i k e l i h o o d estimators ( i . e . μ = Bξ). (b) What is the maximum l i k e l i h o o d estimator of 2 σ ? Is i t unbiased? (Assume m ^ r + 1.) (c) Generalize the preceding p
questions to the situation where Y ~ N(μ, σ ΐ) with μ = Bξ as in 1.14.1 and t a known positive matrix.
[The maximum likelihood estimates are the usual
generalized least squares estimates.] 5.9.2 Generalize 5.9.1 to the multivariate linear model defined in 1.14.3. 5.9.3 Let (X., X2) be the canonical statistics from a normal sample with mean μ. and variance σ?> and let (Z,, Z2) be from an independent normal sample with mean μ 2 and variance σ|.
Suppose μ, <_ μ 2 , but the parameters are
otherwise unrestricted. Show that ( μ . , μ2) = U p z,) i f x, <_ z, and otherwise /\
/\
/\
μ, = μ 2 = μ is the unique solution to
(Assume x 2 > x? and z 2 > z^, which occurs with probability one.) 5.9.4 Let ξ be a normally distributed vector with mean 0 and covariance matrix I.
Given ξ let Y be distributed according to the general linear
model 1.14.1.
(Assume m •> r + 1.)
Suppose B'B is diagonal and t <= P, a
relatively closed convex subset of positive definite diagonal matrices, (a) Show that the (marginal) distributions of Y form an exponential family
MAXIMUM LIKELIHOOD ESTIMATION with Θ a r e l a t i v e l y closed proper convex subset of N. are minimal s u f f i c i e n t s t a t i s t i c s . ]
171 [ ζ and |(Y - Bξ) |
(b) When V is a l l p o s i t i v e d e f i n i t e
diagonal matrices describe the maximum l i k e l i h o o d estimates of 2, σ . (c) Extend (b) to include other suitable subsets, P.
(d) The preceding is a
canonical form f o r a class of random e f f e c t s models (see, e . g . , Arnold (1981)). To see t h i s convert the usual balanced one-way or two-way random e f f e c t s models to a model of this form by applying suitable l i n e a r transformations to the usual parameters.
[For the one-way model having E(Y..) = μ + α . , 1J
μ ~ N ( 0 , σ*)9 μ
a. ~ N(0, σ*), i
i =l , . . . , I ,
j=l,...,J
Ί
l e t ζ. = Iμ +
u
I Σ α,
J.
*
i
and ( ξ 2 i - . . » ξ j ) = ( α ^ . . ^ j j M where M is a I x ( I - 1) matrix whose columns are orthonormal and orthogonal to 1.] [The following three exercises concern the 2χ2χ2 contingency t a b l e . ] 5.10.1 Consider the model under which the f i r s t category and t h i r d category are (marginally) independent ( i . e . , π..+k
= ττi++π++k).
Show t h i s is a
l o g - l i n e a r model and f i n d an e x p l i c i t expression f o r the maximum l i k e l i h o o d estimator. 5.10.2 Consider the l o g - l i n e a r model described by the r e s t r i c t i o n 0 = φ 1 + φ- + φ β + φ 7 - (φ 2 + Φ3 + Φ5 + Φg)
(This is the model described by
the phrase, "no t h i r d - o r d e r i n t e r a c t i o n s . " )
Write the equation(s) determining
the maximum l i k e l i h o o d estimator.
Determine that these equations do not have
a closed form s o l u t i o n , such as 5.10(7). (1980).
(See Darroch, Lauritzen, and Speed
In such a case the l i k e l i h o o d equations must be solved numerically.
The usual methods are the E-M algorithm or the Newton-Raphson algorithm. Bishop, Feinberg, and Holland (1975) and Haberman (1974).) 5.12.1 Consider the model described by h = TΓQ++ = π 1 + +
= π+Q+ = π+1+
See
172
STATISTICAL EXPONENTIAL FAMILIES
Show this corresponds to a d i f f e r e n t i a b l e subfamily within the f u l l exponential family, but is not a log-linear model. for
Find the maximum likelihood estimator
this d i f f e r e n t i a b l e subfamily [TΓQQ+ = π--+
.]
5.14.1 Let
{ p θ : θ € 0} be astratumof
family, as defined in Exercise 3 . 1 2 . 1 .
regular (or a steep)
exponential
(a) Show that for x € R the maximum
likelihood estimator exists and satisfies
(i) ζ
(2)
X
(2)
(b) Discuss the situation when x € R - R.
(c) Show (by example) that there
can be two solutions to (1); but there can never be more than two. Is it possible for both of these solutions to be maximum likelihood estimators? [Suppose the family is defined by ψ(θ) = ΨQ. Note that the set {θ: ψ(θ) £ Ψ Q } is convex and apply Theorem 4.8.] 5.14.2 Show how the result of Exercise 5.14.1 directly yields 5.14(3'). [Translate x 2 ] 5.14.3 Apply 5.14.1 to describe the maximum likelihood estimator for the other examples discussed in 3.12.2. 5.15.1 Let X 1 9 . . . , X exponential family.
be i . i . d . with distribution pQ from a canonical
Let K <= W° be compact.
Then x n is uniformly asympto-
t i c a l l y normal over θ € K with mean ξ ( θ ) and covariance matrix 1 = = n" 1D 2 ψ(θ).
[Apply Theorem 2.19.]
5.15.2 Consider the setting of 5.15.1:
(a) The maximum likelihood
MAXIMUM LIKELIHOOD ESTIMATION
173
estimators θ n and ξ n e x i s t with p r o b a b i l i t y approaching 1 as n •* » uniformly over θ € K.
(b) They are asymptotically normal uniformly over θ € K
with means θ and ζ and covariances n C(a)
£
( θ ) and n
£ ( θ ) , respectively.
P ( x n (. R) converges to 0 ( e x p o n e n t i a l l y f a s t ) , uniformly on K.
g ( t ) = g ( t Q ) + ( h ( t o ) ) ' ( t - t Q ) + o ( | |t - t Q | |)
(b)
if
then g ( x R ) i s asymptotically
normal with mean ξ ( θ ) covariance h' ( ξ ( θ ) ) £ ( θ ) h ( ξ ( θ ) ) s
uniformly f o r θ € K.]
5.15.3
Let X , , . . . , X
be i . i . d . with d i s t r i b u t i o n p
exponential subfamily {p Q : θ € Θ}.
Let K c θ be compact,
uniformly asymptotically normal over θ e K with mean θ .
from a d i f f e r e n t i a t e ( a ) Then ΘM is (b) For a curved
exponential family with θ = θ ( t ) the maximum l i k e l i h o o d estimator t uniformly asymptotically normal a t θ ( t ) G K with mean t . asymptotic variance o f t
of t is
( c ) Write the
as a function o f 2 ( θ ( t ) ) , θ ' ( t ) , and the s t a t i s t i c a l
curvature a t t of the curved exponential f a m i l y .
[See Theorem 5 . 1 2 , the h i n t
to 5 . 1 5 . 2 ( b ) , and Section 3 . 1 1 . For ( c ) , and f o r a geometric i n t e r p r e t a t i o n o f ( a ) and (b) note t h a t */FΓl Iς
- ξ 11 -> 0 i n p r o b a b i l i t y where ξ
p r o j e c t i o n i n the inner product <s, t > = s 1 Σ a t θ to Θ.
( θ ) t of x
denotes the
on the tangent l i n e
I f the problem is w r i t t e n i n the canonical form of Section 3.11
the asymptotic variance is
I.]
5.15.4 1
Let {p Q : θ e 0} be a curved exponential family. Let θ E W but θ 1 ί 0. Assume (w.l.o.g.) that the family has been written in the canonical form 3.11(1) - (4) with 0 = i 0 (9') = θ Θ (ξ(θ')). Show θ' = (0,α,...,0) with 1
α £ p. Let X..,...,X be i.i.d. observations under θ from this family and let t be the maximum likelihood estimator of t. Show that if α
CHAPTER
6. THE DUAL TO THE MAXIMUM LIKELIHOOD ESTIMATOR
KULLBACK-LEIBLER INFORMATION (ENTROPY) Before turning to the dual of the maximum l i k e l i h o o d estimator we define the Kullback-Leibler information, and prove a few of i t s simple properties.
The goal of t h i s detour is to provide a natural p r o b a b i l i s t i c
i n t e r p r e t a t i o n f o r t h i s dual as the minimum entropy expectation parameter. 6.1
Definitions Suppose F, G are two p r o b a b i l i t y d i s t r i b u t i o n s with densities f , g
r e l a t i v e to some dominating information
σ - f i n i t e measure v.
The
Kullbaok-Leibler
of G at F is
(1)
K(F, G)
with the convention that °°
=
EF(ln(f(x)/g(x)))
0 = 0,
0 / 0 = 1 , and y/0 = °° f o r y > 0.
K is als
referred to as the entropy of G a t F. I t can easily be v e r i f i e d that K(F, G) is independent of the choice of dominating measure v.
The existence of K w i l l be established i n
Lemma 6.2 where i t is shown that 0 <_ K £«>. In exponential families i t is convenient to w r i t e (2)
K(Θ Q , θ j )
=
K(PΘ , Pθ ) , 0
For (3)
ΘQ, θ j e N
1
S c H let K(S, θ χ )
=
inf{K(θQ, θ j ) :
etc. 174
ΘQ € S}
,
THE DUAL TO THE MLE
175
K( , ) as defined i n (2) has domain A/χA/.
I t is convenient to
also transfer this d e f i n i t i o n to the expectation parameter space.
Accordingly,
define K(ξ Q , ζ χ ) by (4) for
K(ξ Q , ξ χ ) ( ξ g . ξ j ) € ξ(W°)
x ξ(M°).
=
K(θ(ξ Q ), θ ( ξ 1 ) )
I f the family is steep this d e f i n i t i o n is
valid
on K° x K°. I t is also sometimes convenient to extend the κ
d e f i n i t i o n of K( , ζ,) to a l l of R , by lower semi continuity. for (5)
a minimal steep family, and for ξ Q € R - K°, K(ξ 0 , ξ 1 )
=
For ξ f. K9
lim i n f { K ( ξ , ξ χ ) : eΨO
Accordingly,
ξ>1 € K°, define
ξ € K°9 | |ξ - ξ Q | | < ε}
ξ 1 € K° define K(ξ, ξ j )
(6)
=
-
I t is to be emphasized that this is a formal, analytic extension of the d e f i n i t i o n .
κ
(ξn» ξ i ) f ° r £n f- ^° does not necessarily have a
p r o b a b i l i s t i c interpretation l i k e ( 1 ) .
(Sections 6.18+ give a p r o b a b i l i s t i c
interpretation of K, valid under some auxiliary conditions.) K is often called the Kullback-Leibler "distance" from ΘQ to θy but
i t is not a metric in the topological sense.
general -- not symmetric.
There i s , however, one yery important special case
where K is symmetric and ( K ) 2 is a metric: {P } = {ΦΛ - : θ e R θ θ,2 s t a t i s t i c t* x (7)
,
In p a r t i c u l a r , i t is -- i n
the normal location family,
forms a standard exponential family with canonical
(see Example 1.14), and has K(ΘQ, θj)
=
(θj - θo)iZ"1(θ1 -θQ)/2
The following proposition has already been mentioned above.
176
STATISTICAL EXPONENTIAL FAMILIES
6.2
Proposition For any two distributions K(F, G) exists and satisfies
(1) K(F,
0 <_ K(F, G) £ oo G) = 0 i f and only i f F = G.
Proof.
E F (1n(f(X)/g(X))) = E F ( - l n ( g ( X ) / f ( X ) ) )
>
-In E F (g(X)/f(X))
=
-In 1 = 0
by Jensen's inequality, with equality i f and only i f f = g a . e . ( v ) .
||
For exponential families K has an especially simple and appealing form. 6.3
Proposition Let {pθ> be a standard exponential family.
I f ΘQ € W°, θ j € N
then K(θ Q , θ χ )
( 1)
{Bemark. K(θ Q ,
=
(θQ - θχ)
ξ ( θ Q ) - (ψ(θ 0 ) - ψ ί θ j ) )
=
log (p θ ( ξ ( θ o ) ) / p θ
(ξ(θQ)))
Suppose { p θ ) is steep and ΘQ € N - N°, θ 1 € W°.
θ χ ) = « = lim
K(η, θ χ ) for {ηΊ.} c hl° by steepness.
Then Since the only
^i^o sensible interpretation for ( θ Q - θ j
? ( 6 Q ) is « here, (1) may be considered
valid for a l l ΘQ € hi for regular or steep Proof.
Note
families.)
that
ln(pθ (x)/pθ (x)) and E θ (X) = ξ ( θ Q ) .
||
=
( θ j - ΘQ)
x - (ψtθj) - ψ(θ0))
THE DUAL TO THE MLE 6.4
177
Remark The second part of 6.3(1) shows how the Kullback-Leibler
tion is related to maximum likelihood estimation.
For S c N l e t
(1)
θ χ € S}
K(ΘQ, S)
=
inf{K(θQ, θ j ) :
informa-
Then, by 6 . 3 ( 1 ) , i f ΘQ e A/° (2)
K(ΘQ, S)
=
K(ΘQ, θ)
for θ € S i f and only i f θ e θ s ( ξ ( θ Q ) ) . In other words, for steep families, for Θ = S, and for an observation x € K° the maximum likelihood estimator is the closest point in S to θ(x) in the Kullback-Leibler sense.
(For observations x € K - K° such
an interpretation requires an extension of the definition of K l i k e that to be provided in
Sections 6.18+.)
Note also that K(ΘQ, θ χ )
(3)
= £ ( θ 0 , ξ ( θ Q ) ) - 1{QV
ζ(θQ))
The fact that the quantity on the right is positive (for ΘQ e M°,
Q- f θ Q )
has already been used in 5.8(3) and 5.12(3). 6.5
Theorem
Let {p } be a standard exponential family. θ i n f i n i t e l y d i f f e r e n t i a t e on W° x W°. On W° (1)
VK(ΘQ, •)
(2)
D 2 K(Θ Q , •)
=
ξ( ) - ξ ( θ Q )
= D2ψ( ) = Z( ) ,
If {p_} is minimal and steep then on K° Ό
(3)
Then K( , ) is
VK( , ξ χ ) = θ( ) - θίζj)
Θ Q € H°
178
STATISTICAL EXPONENTIAL FAMILIES
(4)
D2K( , ξj) = Γ ^ θ t )) , Consequently,
(5)
If
K(ξ, ξ j )
ξj € /C°
given ξ , e K° and ε- > 0 there is an ε« > 0 such t h a t
>. ε 2 | | ξ - ζ 1 | |
whenever
llξ-ξjll
>
εj
s c K° is compact then a value ε ? > 0 can be chosen so t h a t ( 5 ) is
valid
uniformly f o r a l l ξ , 6 S.
Proof.
Formulae ( 1 ) - ( 3 ) a r e s t r a i g h t f o r w a r d from 6 . 3 ( 1 ) .
(Note a l s o
t h a t ( 1 ) , ( 2 ) a r e merely a r e s t a t e m e n t o f 5 . 3 ( 1 ) , ( 2 ) . ) ( 4 ) f o l l o w s from ( 3 ) by t h e i n v e r s e f u n c t i o n theorem s i n c e θ ( ) = ξ
( • ) and V ξ ( ) = Σ( )
Formula ( 5 ) f o l l o w s from ( 3 ) , ( 4 ) as d i d t h e analogous c o n c l u s i o n 5 . 3 ( 3 ) , and 5 . 3 ( 5 ) o f Lemma 5 . 3 f o l l o w from 5 . 3 ( 1 ) , ( 2 ) . The a s s e r t e d u n i f o r m i t y o f ( 5 ) over ζ 1 € S i s easy t o check i n t h a t p r o o f . (Note:
||
i f p Q i s not minimal 6 . 5 ( 3 ) u
v a l i d w i t h %" i n t e r p r e t e d as a g e n e r a l i z e d
is s t i l l
v a l i d and 6 . 5 ( 4 )
is
inverse.)
CONVEX DUALITY 6.6 Definition Let φ: R -> (-«>,<»] be convex.
The convex
dual
of φ is the function
k
d : R -> [-oo, oo] d e f i n e d by d φ ( x ) = s u p U φ ( θ , x ) : θ e Rk}
(1) (Recall, JL(θf x) = θ
x - φ(θ).)
We w i l l be i n t e r e s t e d i n t h e s i t u a t i o n when φ i s r e g u l a r l y s t r i c t l y convex and s t e e p . l( (2)
9
(See D e f i n i t i o n 5 . 2 . )
Then i f x e R = ξ ( N φ h
x ) i s s t r i c t l y concave on hi. and V£( , x ) ι θ / x ) = 0 . d φ ( x ) = £ φ ( θ ( x ) , x)
for
x € R
Thus
= ξ(W°)
(In such cases, and somewhat more generally, the pair (d., R ) is called the
THE DUAL TO THE MLE L e g e n d r e t r a n s f o r m o f ( Φ , Λ/ φ ).
179
I t i s e a s y t o check f r o m ( 2 ) a n d Theorem 6 . 5
that (3)
dd (θ) = φ(θ) Φ
for
θeW°
It can be shown that (3) actually holds for all θ € R , but we do not need this fact in what follows.) Suppose ψ is the cumulant generating function of a steep exponential family. Then (4)
dψ(xQ)
= K ( x 0 ,X ; L ) + θ ( X l )
xQ
x Q
If the coordinate system and dominating measure are chosen so that ψ(0) = 0 = ξ(0) then (4) becomes (4 1 )
d φ (x Q ) = K(x 0 , 0)
x € K°
This provides a p r o b a b i l i s t i c i n t e r p r e t a t i o n f o r d(x) on K°.
I t w i l l be
seen l a t e r t h a t d( ) is the maximal lower semi continuous extension o f (d(x):
x € K°) to a l l of R k , and ( 4 ) is v a l i d f o r a l l
xQ € R k .
Lemmas 6.7 and 6 . 8 and Theorem 6.9 present some important basic facts about convex d u a l i t y .
They are j u s t the t i p of a r i c h theory.
We w i l l
not f u r t h e r develop t h i s theory as an a b s t r a c t u n i t ; although other important features of the theory are i m p l i c t in r e s u l t s we s t a t e elsewhere ( e . g . Theorem 5 . 5 ) .
A u n i f i e d presentation of the theory appears i n R o c k a f e l l e r
( 1 9 7 0 ) , and many elements of i t are i n B a r n d o r f f - N i e l s e n ( 1 9 7 8 , e s p e c i a l l y Chapters 5 and 9 ) .
6.7
Lemma
The convex dual d is a lower semi continuous convex function. Hence,N. is convex. Suppose φ is regularly strictly convex. Then d is strictly convex and twice different!able on R. On R
180
STATISTICAL EXPONENTIAL FAMILIES
(1)
Vd(x) = θ(x) ,
and D2d(x) = ( D ^ ) " 1 (θ(x)) .
(2)
Proof.
Since d is the supremum of linear functions i t is lower semi-
continuous and convex. For x € R,
d(x) = x
θ(x) - ψ(θ(x)).
the same computation that yielded 6.5(3), ( 4 ) . since D2d is positive definite.
Hence ( 1 ) , (2) hold, by
d is s t r i c t l y convex on R
( I t is possible to also directly establish
s t r i c t convexity without requiring that φ be twice d i f f e r e n t i a t e . )
||
I t is now convenient to consider
£ d (x, θ)
= x
θ - d(x)
.
Under the conditions of Lemma 6.7 Vd(x) = θ(x) so that for Θ 6 W ° &Λ('>
Θ
)
ΊS
uniquely maximized at the value x for which θ(x) = θ.
is precisely ξ ( θ ) .
This value
This interpretation is developed further below, especially
in Definition 6.10. The following equivalent expression for steepness is a fundamental building block in the proof of Theorem 6.9, and has other uses.
6.8
Lemma
Let φ be regularly strictly convex. Then φ is steep if and only if
implies (2)
l|Vφ(θ.)|| - -
.
THE DUAL TO THE MLE Proof.
181
Assume (1) implies (2). Let θ0 n e ™ N°,' θ, e W - N°, l c
θ p = ΘQ + p ( e j - θ 0 ) .
σ
Then
' θo)
=
d ξ θ
( ( P ) ) -ξ ( θ p )
• «<ΘP>
(Θ
θ 0
P' V -* ( V
d is s t r i c t l y convex and twice d i f f e r e n t i a t e on the open set R with (D ? d) nonsingular on R.
Hence
(4)
for
lim
£ d ( x , θ)
every θ € θ(R) = N° by Lemma 5 . 3 ( 3 ) .
(5)
ξ(θp)
Since θ . e A/, 1
(6)
Since | | ξ ( θ p ) | | + », by ( 2 ) , we have
( θ p - ΘQ) - φ ( θ p ) =
l i m φ(θ ) = Φ ( θ Ί ) i s f i n i t e . p L p+1
ξ(θp)
( θ 1 - ΘQ)
=
ξ(θp)
= -oo
-Ad(ξ(θp),
ΘQ)
-«.
This implies
( θ p - θ Q ) / p •> -
as
p t l
By d e f i n i t i o n , Φ i s steep. Conversely, suppose there i s a sequence s a t i s f y i n g ( 1 ) f o r which (2)
fails.
The sequence can be chosen so t h a t
sup 11Vφ(θ i )11
=
B
<
-
This means that ξ(θ.) = Vφ(θ ), i=l,... is a bounded sequence, thus, without loss of generality, the original sequence {θ..} can be assumed to have been chosen to satisfy ξ(θ.j) -> x*. Hence, for any θ 1 € Rk
,
182
STATISTICAL EXPONENTIAL FAMILIES
(7)
θ
x* - φ(θ) = Tim (θ.
ξ(θ.) - φ(θΊ.))
>_ Tim sup (θ 1 ξ(θΊ.) - Φ(θ')) =
θ
1
x* - φ ( θ ' )
It follows that (8)
d(x*)
= θ
x* - φ(θ) < °°
This means t h a t θ f. hi° s a t i s f i e s θ € θ ( x * ) . impossible i f φ i s steep.
Hence
Proof of Proposition 3.3.
By Theorem 5.5 t h i s is
φ i s not steep.
||
I t is now easy to prove the converse assertion
in Proposition 3.3, namely that a minimal exponential family satisfying (9)
E 0 ( | | x || )
= oo
for
θ G W - W°
is steep. By Fatou's lemma i f 11m ||Vψ(θ.)|| Hence (2) is s a t i s f i e d . 6.9
=
{θ } s a t i s f i e s (1) then
l i m ||E θ .(x)||
> l i m E 0 .( | |x| |)
= «
Thus ψ i s steep, which is the desired r e s u l t .
. ||
Theorem
Assume φ is steep and regularly strictly convex. Then d. is also, and (1) Proof.
«"d = % φ Let x Q e R,
v e Rk.
Note that p > 0 since R i s open. and x p = x Q + p (
X l
- χQ).
- ξ(N ) .
Let p y = i n f {p > 0:
x Q + pv £ R} .
Assume p < °° and l e t x, = x Q + p v
Note that x 1 ί R.
Suppose i t were true that
THE DUAL TO THE MLE (2)
183
lim inf | |θ(x ) 11 < co . p pfl
Then there would be a sequence p.. t 1 with θ ( x p . ) + θ * , say. X j f. R = ξ ( W ° ) .
θ * (. A/° since
But then, since φ is steep, t h i s would imply
I Up.11 = l l ξ ( θ ( χ p . ) ) l l
- -
by Lemma 6 . 8 , which is a c o n t r a d i c t i o n since x p . -> x - .
Hence ( 2 ) is f a l s e ;
so t h a t a c t u a l l y
(3)
lim
||θ(xj||
=
oα .
p
ptl
The argument i n the f i r s t p a r t of the proof of Lemma 6 . 8 applies to y i e l d the dual to 6 . 8 ( 6 ) , namely (4)
θ(xp)
(xχ - x0)
-> oo
as
p t l
.
(Technically, the lemma as stated cannot be directly quoted since we have not yet established that R = M . so that d is regularly strictly convex. But, d has the desired convexity and differentiability properties on R c w, by Lemma 6.7. It is then easy to check that the first part of Lemma 6.8 indeed applies since p
} c R and yields (4) as the dual of 6.8 (6).) i
d is therefore a convex function with
(5)
^ d ( x
+ p(xj - x 0 ) ) +
oo
as
p
t
l
.
This implies t h a t
(6)
d(xQ + p ( x χ - x 0 ) )
= «
for
p > 1
Since the above argument applies f o r a l l
v € R , i t yields
(7)
for
Thus R
d(x)
=> W..
d(x) = θ ( x )
This y i e l d s
=«,
x £ R
(1) s i n c e , a l s o , R c W d
x - Φ ( θ ( x ) ) < oo on R.
because
.
that
184
STATISTICAL EXPONENTIAL
FAMILIES
I t now follows that d is regularly s t r i c t l y convex since i t has the desired smoothness properties, e t c . , on R = N°. by Lemma 6.7. d is steep since (5) applies to any xQ e R, Remark.
x 1 e R - R.
And, f i n a l l y ,
| |
Since d i s convex, lower semi continuous, and d ( x ) = °° f o r x f. R
i t must be t h a t d( ) on R i s the maximal lower semi continuous extension o f d(x):
x e R
(= K°) to a l l o f R k . d(xj
=
1
That i s , f o r x χ € R - R
lim inf {d(x): εΨO
x € R, I | x - x . I I < ε }
I t follows that i f {p Q } is a steep exponential family. between d(x Q ) and K(x Q , x χ ) is valid for a l l xQ € Rk,
L
The relation 6.6(4) x χ € K°.
MINIMUM ENTROPY PARAMETER The path has been prepared for the definition of the dual to maximum likelihood estimation,
and for the basic existence and construction
theorems. 6.10
Definition I,
Let d: Let S
R - * ( - « , °°] be convex and lower semi continuous.
Define
ξs(θ)
=
{ξ € S: £ d ( ξ , θ) = A d (S, θ) = i n f { λ d ( x , θ ) : x € S}} .
Obviously ξ^ is related to I.
in the same fashion as θ, the maximum likelihood
estimator for an exponential family, is related to the log likelihood £,..
function
( I t would therefore seem logical to adopt the notation ξ ς rather than ξς.
However for reasons of convenience and tradition we wish to reserve the notation ξ s for the set of maximum likelihood estimates of expectation parameters.
That i s , ξς(x) = ξ ( θ s ( x ) ) .) The function ξL has been given a variety of f a i r l y
appelations.
inconvenient
For example, values in ξς(θ) can be called minimum entropy
THE DUAL TO THE ME
185
(expectation) parameters r e l a t i v e to the set S c K°.
Barndorff-Nielsen (1978)
refers to values θς(x) = θ ( ξ ς ( . θ ( x ) ) ) , x e K°9 as maΰsimum likelihood
predictors.
(Note however t h a t ξ $ ( θ ) n (K - K°) t φ is possible even i f {p θ > is regular as long as S is not convex
(see Theorem 6.13).
Hence values i n ξ need not always
be expectation parameters.) Another i n t e r p r e t a t i o n is provided by the Kullback-Lei b i e r information.
Consider a steep minimal exponential f a m i l y .
I f ξ e ζς(θ) Π K°
then K(ξ, ξ ( θ ) )
=
i n f {K(x, ξ ( θ ) ) :
Thus, θ € θ ( ξ ς ( θ j )
x € S n K°}
.
is a parameter i n Θ(S) whose Kullback-
Leibler distance to θ, is a minimum over a l l parameters i n θ(S). Suppose { p f i } i s a minimal, steep standard exponential f a m i l y . Then Theorem 6.9 establishes that d, is steep and r e g u l a r l y s t r i c t l y convex with R = ζ(W°) θ i n Chapter 5.
= K°.
Consequently ξ possesses the properties established f o r
The main properties are formally stated below; t h e i r proofs
consist only of reference to the appropriate results i n Chapter 5. Convention.
In the f o l l o w i n g statements {p Q } is a minimal steep standard
exponential f a m i l y . 6.11
Note that R = K° c Wd c K.
Theorem If θ €
then
UΘ)
(1)
I f θ e N - N° then ξ w (θ) i s empty. Proof.
This i s the dual statement to Theorem 5 . 5 .
||
Note t h a t (2)
θ(ζw(θ(x)))
= θ w (x)
,
etc.
In other words, f o r a f u l l exponential family the maximum l i k e l i h o o d p r e d i c t o r
186
STATISTICAL EXPONENTIAL FAMILIES
is the same as the maximum likelihood estimator.
However (2) does not extend
to non-full families. 6.12
Theorem Let S cW.be a non-empty, r e l a t i v e l y closed subset of W^. Suppose
θ e N°.
Then ζ ( θ ) is non-empty. Suppose θ € W - W° and there are values θ i € W°,
i=l,...,I
and
constants $.. < » such that
I (1)
S c
y
H " ( θ - θ . , (3.) .
Then ξ(θ) is non-empty. For any ξ € ξ s (θ) n K° (2) Proof.
θ - θ(ξ) € V s (ξ) . Invoke Theorem 5.7 and Theorem 5.12.
||
6.13 Theorem Suppose S Π W, is a relatively closed convex subset of W^ with S n K° non-empty. Then ξ s (θ) is non-empty if and only if θ € W° or θ e W - W° and (1)
S c H"(θ - θ χ , Bj)
for some θ e W°, 3, € R. If ζ s (θ) is non-empty then it consists of the unique point ξ € S Π K ° satisfying (2) Proof.
(θ - θ(ξ))
(ξ - ξ) > 0
Invoke Theorem 5.8.
v
ξeS
.
||
6.14 Construction Theorems 6.12(2) and 6.13 have a geometrical interpretation which looks exactly like that of their counterparts in Chapter 5. For example,
THE DUAL TO THE MLE
187
suppose S = H n K with H the hyperplane H(a, α ) , and H n K°is non-empty. Then in order to find ζς(θ) one need only search for the unique point ζ* € H for which θ - θ(ζ*) = pa for some p € R. The process can be pictured from two different perspectives. Both of these are shown in Figure 6.14(1). (i) One may proceed from ξ(θ) along the curve {ζ(θ + pa): p € R} until the unique point at which ζ(θ + pa) € H. (ii) Alternatively one may map S n K° back into Θ as θ(S n K°) and then proceed along the line {θ + pa: p € R} until the unique point at which θ + pa ε θ(S n κ°).
e
Θ(S)
Figure 6 . 1 4 ( 1 ) :
Construction of ξ s ( θ ) when S = H ( a , α) n K
There is an important s t a t i s t i c a l d i f f e r e n c e between the s i t u a t i o n p i c t u r e d here and the dual s i t u a t i o n . d i s p l a y e d i n 5 . 9 . In Construction 5.9
Θ = H n N and the problem considered was to
find θ .
In t h a t case one could proceed via the geometrical dual to Figure
6.14(1).
See Figures 5 . 9 ( 1 ) and 5 . 9 ( 2 ) .
However, one could also reduce by
s u f f i c i e n c y to a minimal exponential family with parameter space Θ. then be found by applying Theorem 5.5 to t h i s minimal f a m i l y .
θ 0 could
A corresponding
188
STATISTICAL EXPONENTIAL FAMILIES
statistical interpretation is not available for the dual problem of finding ζ
HnK' Furthermore, i f Θ = H n N and S = ξ(Θ) the maximum likelihood
predictor relative to S cannot legally be found by f i r s t reducing by sufficiency.
This very undesirable property of a s t a t i s t i c a l estimator is
displayed in the following example. 6.15
Example Consider the Hardy-Weinberg problem discussed earlier in
Examples 1.8 and 5.10. Let S = ξ(Θ) and consider the problem of finding ξς. Rather than provide a general formula for ξ (a messy exercise) we discuss a special case, and some implications. Suppose N = 18 and x = ( 3 , 6 , 9 ) .
P =
(1)
2x
*
+x
2
=g
θ(ξ(x))
We have already seen that
Thus ξ ( x ) = 18(J, J , | ) = ( 2 , 8 , 8 ) , and
= θ(x)
=
=
ί p ( l , l , l ) + (In 1, In 4, In 4)}
{ β j ί l . l . l ) - (In 2 ) ( 2 , l , 0 ) + (0, In 2, 0)} c
θ
Note also that (2)
θ(x)
=
{ p d . l . D + (In 1, In 2, In 3)}
.
Of course θ(x) n θ = Φ Since ς(p) = ( p 2 , 2pq, q 2 ) = ( p 2 , 2p(l-p). (1-p) 2 ) space to S = ί ξ ( p ) :
0 < p < 1} can be found by taking 4 - ξ ( P ) .
p = = this tangent space, T, is spanned by the vector τ
By definition v s ( ξ ) = {v:
=
(2p,
"
(
,2 3»
v
2 - 4p, 2 4> 3* - 3 }
τ = 0} .
-2 + 2p)
the tangent Evaluated at
THE DUAL TO THE NILE
189
Now, from (1) and (2) θ(x) - θ(ζ) = {p'(l,l,l) + (0, In 2 - In 4, In 3 - In 4): p 1 € R} . Thus (θ(x) - θ(ξ})
(3)
τ = (2/3) In (1/2) - (4/3)ln (3/4) f 0 .
The implication of (3) is that θ(x) - θ(ζ) £ V $ (ξ).
It follows
from Theorem 6.12(2) that (4)
θ(x) n θ(x) = φ
,
or, in other words, (4')
ξ(x) t ξ(x)
.
Finally, suppose instead that the sample point is x* = (2,8,8). Note that x* = ξ(x) with x = (3,6,9), as above.
In this case ξ(x*) = x*
and hence (5 1 )
ξ(x*) = ξ(x*) = x*
and (5)
θ(x*) = θ(x*) = θ(x*) . Recall from the discussion in Example 5.10 that,over the domain
K°, ξ(x) coincides with the minimal sufficient s t a t i s t i c . (5)
Thus, from (4) and
(or (4 1 ) and (5 1 )) i t can be seen that here the "estimator"
θ(x) = θ(ξ(θ(x))) is not a function of the minimal sufficient
statistic.
is a very undesirable property for a statistical estimator.
Indeed, we
This
emphasize, the primary statistical use of θ does not l i e in i t s use as a statistical estimator, but rather in i t s use in the theory of large deviations. See, for example, 7.5 and Exercises 7.5.1 - 7.5.6.
190
STATISTICAL EXPONENTIAL FAMILIES
ENTROPY 6.16
Discussion In statistical mechanics and elsewhere the term entropy appears
and has a definition whose connection with the quantity K(θ Q , θ,) for exponential families is not at first obvious. See Ellis (1984a; 1984b). k k Let F be a probability distribution on R . Let x e R and define the entropy of x under F as (1)
E F (x) = inf {K(G, F ) : E Q (X) = x} . There is, as yet, no exponential family apparent in this definition.
However, there is indeed an intimate connection between ξ and K, as revealed in the following theorem. The theorem is proved only for the case where F satisfies certain mild assumptions and x € κl or x t Kp
We leave it to the
reader to develop the appropriate results when F does not satisfy these assumptions. The situation where x € K - K° can sometimes be treated using the methods at the end of this chapter. 6.17 Theorem Suppose the exponential family generated by F is a steep minimal family with 0 € int N.
Let ξ Q = ξ(0) = E R (X).
Let K denote the usual
Kullback-Leibler function, 6 . 1 ( 4 ) , for this exponential family. Then (1) i f y € K°.
(2)
Proof. (3)
E F (y)
= K(y, ξ Q )
If y £ K
»
=
Ef(y)
=
K(y, ξ Q )
Suppose y € K°, it is obviously true that E F (y) < K(y, ξ Q )
since the distribution G(dx) = p 0 / %(x)F(dx) = p θ ( y ) ( d χ ) satisfies E Q (X) = y
THE DUAL TO THE MLE
191
and K(G, F) = K(y, ξ Q ) . Suppose K(G, F) < «> and (4)
E G (X) = y = It must be that G <*« F, for otherwise K(G, F) = ». Let g = 4S-
and p = P θ ( y ) (5)
Then
K(G, F) - K ( P θ ( y ) 5
F)
=
/ [ g ( χ ) I n g(χ) -
=
/ g ( x ) ( l n g ( x ) - In p ( x ) ) F ( d x )
p
(χ)
Ί n
p(χ)]
+ /(g(x) - P(x))(ln
=
K(G. P θ ( y ) )
>
G = F θ /y\
I t follows
from ( 3 ) and ( 5 ) t h a t ( 1 ) holds.
is the unique d i s t r i b u t i o n
K(G, F) = Ef(y)
satisfying
p(x))F(dx)
0
since / ( g ( χ ) - p ( x ) ) ( l n p ( x ) ) F ( d x ) = / ( g ( x ) - p ( x ) ) ( θ by ( 4 ) .
F(dx)
x - φ(θ))F(dx) = 0 (Also, note t h a t
( 4 ) and y i e l d i n g
.)
If y £ K then Eg(X) = y implies G « F and hence K(G, F) = - = κ(y, ξ Q ) .
||
AGGREGATE EXPONENTIAL FAMILIES If {p Q } is a full canonical exponential family and x € dK then θ(x) = φ. (See Theorem 5.5.)
If v(8K) > 0 then this means that with
positive probability the maximum likelihood estimator fails to exist. This occurs most commonly when v has countable support.
In most such
cases the family of distributions {p Q : θ € N] can be augmented in a natural way so that the maximum likelihood estimator is always defined over this new, larger family of distributions. The augmented family will be called an aggregate exponential family.
192
STATISTICAL EXPONENTIAL FAMILIES Aggregate exponential families can also be satisfactorily defined
in a few special cases where v does not have countable support, but v(8K) > 0 nevertheless.
However, such situations are rare in applications and the
general theory involves d i f f i c u l t i e s not present in the countable case; hence we do not treat such situations below.
For similar reasons of convenience we
avoid non-regular exponential families. Special cases of the theory are extremely familiar — for example the aggregate family of binomial distributions, which is just B(n, p ) , 0 < p£l.
The general theory for the case where v has f i n i t e support
appears in Barndorff-Nielsen (1978, p.154-158), along with some observations about generalizations. 6.18
Definitions Let v be a measure concentrated on the countable subset
X = {χv
x 2 , . . . } c Rk.
(1)
Thus
v(ίχ.})
>
0
v(X c )
1=1,2,... ,
Consider the closed convex set K = K .
The faces
of K
= 0
.
are the non-empty sets
of the form (2)
F
=
K n H(v, α)
where
K c H~(v, α)
By convention the set K is i t s e l f a face of K (corresponding to v = 0 , α = 0 ) . A f a c e , F, is i t s e l f a closed convex subset, which has dimension s,
0 <_ s <_ k.
interior Rs.
(Only the face F = K can have dimension k.)
The
relative
of F, denoted r i ( F ) is the i n t e r i o r of F considered as a subset of
An a n a l y t i c c h a r a c t e r i z a t i o n of r i ( F ) is t h a t x e r i (F) i f x € F and
f o r every
if
hyperplane H 6 Rk such t h a t x € H but F £ H then both F n H + ϊ
,
and F n H~ f φ. Let F be a face o f K.
I f v ( F ) > 0 then the r e s t r i c t i o n of v to F,
v,c is uniquely defined and non-zero.
We use the notation K. c =K
.
Note
t h a t while i t is usually t r u e t h a t K,r = F t h i s need not always be the case.
THE DUAL TO THE MLE
193
See Exercise 6.18.1. The f i r s t main theorem involves the following structural assumption on X: For e\ίβry ξ G X there is a face F of K such that K ι c = F I F
(3)
and ξ € r i ( F ) . I f X is f i n i t e then (3) is clearly satisfied.
Another important
case where (3) is satisfied is when X = { 0 , 1 , . . . } , as for example when Xj,...,X|^ are independent Poisson or independent negative binomial variables. Assumption 6.22(1) provides an easily verified structural condition which implies ( 3 ) . 6.19
Definition
(Aggregate family)
Let X and v be as in 6.18. family of densities generated by v.
Let {p 0 } be the canonical exponential
Assume the family is regular.
As shown
in Chapter 3 this family can be reparametrized by the expectation parameter ξ = ξ ( θ ) . Let q
(1)
ζ(θ)(x)
Then, { q ξ : ξ € K°} = {p 0 :
=
P
θ(x)
θ
€ W
θ € W} .
Now, for each face, F, of K with v(F) > 0 l e t ψ r = ψ v
and
define the family of densities exp(θ Pθ,F(χ) θ l h
relative to the measure v. measure v . f .
x - ψ.r(θ))
x € F
IF
= 0
x j£ F
This is an exponential family relative to the
Assume this family is regular.
Let ξ.p denote i t s expectation
parameter, and l e t (2)
=
PθlF(x)
Thus ζ ranges over the set r i K. F as θ ranges over N^ = Wvjp
Note that the
194
STATISTICAL EXPONENTIAL FAMILIES
family {p Q jF: θ € N.p} is not minimal. Hence the map θ •*• ξ,p(θ) is not 1 - 1 . However, q>. >p = q^ ι p if and only if ξ, = ξp> by virtue of Theorems 1.9 and 3.6. Let (3)
F = {x: 3 face
F of K 3 v. F t 0 and x e ri(F)} .
Lemma 6.20, below, establishes that for each ξ € F there is a unique F such that ξ € ri(F) and a unique density q^.p corresponding to the pair ξ, F. This density has (4)
E q
(X)
= ξ
.
ξ|F
We denote this density as q f .
The aggregate family of densities
generated by v with parameter space F is the family (5)
{ q ξ : ξ € F}
.
Note that P ξ (X)
(6) 6.20
= 1
V ζ € F
.
Lemma Make the assumptions in 6.18 and 6.19.
is a unique F such that ξ € r i ( F ) .
The density q
Then for each ξ € F there = q^. p satisfies 6.19(4).
I t i s , in f a c t , the unique density of the form q , ( F« having expectation ξ. Proof. K c ίΓ(v\
Suppose ξ e r i ( F ) and a l s o ξ € F 1 = H ( v ' , α ' ) n K where α1).
Then e i t h e r
(i) FcH(v'.a')
or
( i i ) F n H+(v', a 1 ) t φ
and F n H " ( v ' , α 1 ) t φ. I n case ( i i ) H ( v ' , α 1 ) i s n o t a s u p p o r t i n g h y p e r plane, a contradiction.
Hence ( i ) h o l d s , and so F 1 D F.
Reversing t h e roles
o f F, F 1 i n t h e above now shows t h a t ξ € r i ( F ) and ξ e r i ( F ' ) i m p l i e s F = F 1 . By Theorem 3 . 6 , {En (x): θ € N } = r i ( K l t : ) = r i ( F ) by q V I F ξ(θ)IF IF 6 . 1 8 ( 3 ) since v i exists.
p
generates a regular family.
Thus q ξ i p s a t i s f y i n g 6 . 1 9 ( 4 )
THE DUAL TO THE MLE
195
For every ξ € X the preceding shows that ζ = E (X) € ri(F) where q
F is the unique face of K with ξ € r i ( F ) .
Hence ξ = E
q
(X) = E = q£l.
ξ§IF'
(X)
||
i f the conclusion of
6.18(3) holds for a l l ξ € conhull X then F = conhull X. occur that Fcconhull X.
q
ξ|F
implies F = F 1 , and thus, as previously noted, implies q Assumption 6.18(3) guarantees that F 3 X.
ζ
Otherwise i t may
Exercise 6 . 2 0 . 1 sketches an example.
I f Assumption
6.22(1) is s a t i s f i e d then (1)
F =
conhull X
=
K
.
Here is the f i r s t main theorem providing the extension of Theorem 5.5. 6 .21
Theorem Make the assumptions in 6.18 and 6.19.
Then for x € F 3 X the
maximum likelihood estimator, ξ ( x ) , is uniquely determined by the t r i v i a l equation (1)
ξ(x)
Proof.
= x
L e t x e r i ( F ) f o r some f a c e F = H ( v , α ) n K o f K.
I f ξ1 € r i ( F ' )
and x £ F 1 then q ζ , ( x ) = 0 . Now suppose ξ ' e r i ( F ' ) , Lemma 6 . 2 0 ) t h a t F 1 3 F.
x e F 1 , b u t F 1 t F.
I t follows
The argument now t a k e s p l a c e i n F 1 .
Hence we can
assume f o r c o n v e n i e n c e , and w i t h o u t l o s s o f g e n e r a l i t y , t h a t F 1 = R and ξ 1 e K°.
(2) and
l
= θ + p e
OK
We may f u r t h e r assume t h a t x = 0 , K c ί Π e ^ 0 ) , and 0 e r i ( F )
w i t h F = H ( e l f 0 ) n K. θ
(as i n
1
, ρ > 0 .
Then, ξ 1 = ξ ( θ ' ) f o r some θ ' e W ° c Then q
ξ ( θ
j ( 0 ) = exp(-ψ(θp))
Rk.
Let
196
STATISTICAL EXPONENTIAL FAMILIES eΦ(θp)=
(3)
eθ'
/
0 +
x
χ +
Pχ1v(dx)+
e θ ' # x v(dx)
/
f°
eθ"
/
χ
v(dx)
O
= ψ|F(θ )
'
by the monotone convergence theorem and the d e f i n i t i o n of ψ.r. (2)
I t follows from
and (3) t h a t
(4)
qξ,(0)
< q ξ ( θ )(0)
< qξl,,F(0) ,
0 < p < »
where ζ" is the unique point i n r i ( F ) defined by ξ" = ξ ( p ( θ ' ) .
Finally, if ξ 1 " € ri(F) then applying Theorem 5.5 to the measure v | F yields qξ,..|F(0)
(5)
with equality only i f ξ" 1 = 0.
< qQ|F(0)
Combining ( 4 ) , ( 5 ) , and the f i r s t comment
in the proof y i e l d s (6)
ζ(0)
= 0
.
This verifies (1) when ξ = 0 , and completes the proof. Remark.
||
As noted in the remark preceding the theorem it is usually true
that F => conhull X. Assume so and assume the hypotheses of the theorem. Let X,,...,X be i.i.d. random variables with density q f , ξ € F. As usual, let n Xn = .Σ X./n. =1 l
Then Xn € conhull X c F with probability one.
The family of
distributions of the sufficient statistic Xn is then also an aggregate family f i t t i n g the specifications of the theorem. estimator of ξ € F based on X-,...,X (6)
Hence the maximum likelihood
satisfies the t r i v i a l equation
£(Xr...,Xn)
=
Xn
.
The preceding theorem yields the existence of maximum likelihood
THE DUAL TO THE MLE estimates when the parameter space is F.
197
In order to guarantee existence of
these estimates when the parameter space is a proper closed subset of K i t suffices
to establish continuity in ξ of q Λ x ) , x € X.
useful for other purposes as well.
This continuity is
Somewhat unfortunately, the assumptions of
Theorem 6.21 do not imply that q^x) is continuous in ξ (see Exercises 6.23.5-6) and the following theorems demand stronger assumptions. Sufficient assumptions are described below. There is a further, aesthetic, reason for wanting to know that q^(x) is continuous in ξ. family { q ζ ( x ) :
The definition given in 6.19 of the aggregate
θ € F} is structurally natural.
But there is also an analy-
t i c a l l y natural definition for the family of distributions generated from {p n : u
θ G N] -- namely, the set of a l l probability distributions on X which
are limits of sequences of distributions in ί p θ h
These two definitions
coincide when q^(x) is continuous in ξ. 6.22 Assumptions K is called a polyhedral convex set i f i t can be written as the intersection of a f i n i t e number of half spaces (see Rockafellar (1970)). Assume that K is a polyhedral convex set and that for every one of the f i n i t e number of faces, F, of /C (1)
F = K|F
.
As previously noted in 6.20(1), this implies F = K = conhull X. For any convex set S € R define the centered span of S to be the subspace spanned by vectors of the form x - y , subspace by csp S. (2)
Note that i f xQ € ri S then csp S =
span {x - x Q :
Assume that for eyery face F of K
(3)
x,y € X.
Pr
°JCspFW
"
x € S}
Denote this
198
STATISTICAL EXPONENTIAL FAMILIES
Note t h a t i f X is f i n i t e then (1) i s s a t i s f i e d , and ( 3 ) is s a t i s f i e d since A/ |F = R
for a l l
faces F ( i n c l u d i n g F = K).
measure then (1) and (3) are again s a t i s f i e d .
6.23
Theorem
x € K,
Then f o r
The proof involves an i n d u c t i o n on the dimension, k.
r e s u l t is n e a r l y obvious.
assume K c ( - « , ξ Q ] . θ. -> W, and Q. - > « . for x f ξ Q ,
every
ζ 6 K.
q (x) is continuous f o r
Proof.
and
I f v i s a product
See Exercise 6 . 2 2 . 2 .
Make the assumptions in 6 . 1 8 , 6 . 1 9 , and 6 . 2 2 .
the
trivially
Suppose ξ Q e 3K.
For k = 1
Without loss o f g e n e r a l i t y
Then ξ i -• ζ Q w i t h ζ i t ξ Q ,
i =l,...
i m p l i e s ξ Ί = ξ(θ Ί .)»
I t follows t h a t q ξ . U 0 ) = P θ . ( ? 0 ) -* v ^ } ) "
1
= qζ ( ξ Q ) ,
q ξ . ( x ) -• 0 = q ζ ( x ) .
For arbitrary k, including k=l, if ξ Q € K° then q ξ (x) =
pQ,^Λx)
is continous on a neighborhood of ξ Q . This completes the proof for k = 1. We now turn to the case k >_ 2. We need to prove continuity of q>. at ξ Q € dK. Let ξ. -* ξ Q . We need consider only the case where {ζ.} c F with F some face of K, since K has only a finite number of faces. If this F is a proper face of K then q r -> q Γ by the induction hypothesis. Hence we need consider only the case where each ξ. = ζ(θ ), θ^ e A/. There is a unique face F Q of K such that ξ Q € ri F Q = ri ACjp . o K c R " ( e - , 0 ) , - σ e , € K° f o r some
loss o f g e n e r a l i t y
σ > 0,
F Q = H ( e 1 9 0 ) n K a n d c s p F Q = {w € R k :
( 0 <_ s < k - 1 ) . w
(2)
G RS<
Let S
Further,
assume
ξQ = 0,
Without
= csp F Q .
w = (0, ω), ω € RS},
F o r w e Rk w r i t e
assume 0 e N . p » ψ
F
(0) = 0,
ψ|p ( θ ) i s a f u n c t i o n o f θ / 2 ) > a n d so we w i l l
w1 = ( w L , W / ξ
(0) = 0.
( F
write ψ j
F
(θ/2\),
2
J with
Note t h a t where
convenient. We h a v e a l r e a d y assumed 0 e M , r . f o r some 6 Q > 0 . σ(θ),
I t then follows
s a y , such t h a t θ + σ e 1 € W,
from 6 . 2 2 ( 3 ) θ ^ σ ( θ ) .
Hence { θ € S :
| | θ | | <_ ό Q } c W.p
t h a t f o r e a c h such θ t h e r e i s a Since { θ € S:
||θ||<_ό0}
is
THE DUAL TO THE MLE compact, with {θ € S:
| | θ | | <• ό Q ,
θ + oeι
199
€ N} as a r e l a t i v e l y open s u b s e t ,
there must, f u r t h e r , e x i s t s a σQ >_ 0 such t h a t θ + σe, e N f o r a l l σ >_ σ Q , θ e S,
| | θ | | £ <50. For 6 <_ όg, σ _> σQ define
(1)
Q
=
Q(σ, 6)
{θ € Rk: | | θ
=
( 2 )
||
< 6,
v
Note t h a t θ ^ j
x^j
- σgθj
x^j
±
x
(-σ + σ Q ) | | X ( ^ | \< 0 ,
Vx
€ K .
Hence for θ € Q λ(θ) as i n 6 . 2 1 ( 4 ) .
<
λ(σ Q e χ )
< co
I t follows t h a t Q c N.
Now assume f o r convenience, and without loss o f g e n e r a l i t y , t h a t σQ = 0 .
Then f o r θ € Q λ(θ)
(2)
=
Je
θ
Ψ/ e (
χ
v(dx)
2
) (
2
)
as σ •> °°, uniformly f o r θ ^ ) 1 6Q. (3)
sup {|ψ(θ) I:
as σ -> ~, (4)
6 •> 0.
sup
v
/e"
σ l l X
(l)
M + θ
(
2
)
X
(
2
) v(dx)
(dx)
In p a r t i c u l a r
θ £ Q(σ, δ ) }
•+ ψ , F ( 0 ) 1 Γ
= 0
o
I t follows that
ί|pθ(x) - qQ(x)|:
for each x € K.
<
θeQ(σ, 6)}
[For x € FQ the convergence
^ 0
as
σ -^ °°,
6 •> o
in (4) is uniform over
subsets of F o ; however i f x £ FQ then as σ -> «, 6 » 0 ,
p Q (x) =
θ # x e
compact "ψ(θ)
~ eθ#x
-> 0 = q Ω ( x ) , but the convergence is not uniform over a r b i t r a r y compact subsets of K. x € X -
( I t is uniform over bounded subsets of X i f e, Fo.)]
x < -ε < 0 for a l l
200
STATISTICAL EXPONENTIAL FAMILIES I t remains to show t h a t f o r given σ >^ a Q ,
α > 0 such that ||ξ|| < α,
δ <_ δ Q there is an
ξ € K° , implies θ(ξ) € Q(σ, δ ) .
Once t h i s has
been done i t follows from ( 4 ) , and the induction hypothesis, that q^-ίx) i s continuous in ξ ε K f o r each x € K. For convenience we show below only that there is an α > 0 such that ||ξ|| < α implies θ(ξ) € Q(0, δ ) .
The proof f o r a r b i t r a r y α > 0,
in place of σ = 0, requires only minor a l t e r a t i o n s of the constants appearing i n the proof.
In the following α, ε are generic p o s i t i v e constants whose
numerical value may decrease as the proof progresses. is an α > 0 such that ||θ/2%11 > δ
Since 0 € W. F
there
implies ψ J F ( θ / 2 \ ) >. 201 |θ/2% ||.
Let
C c X be a f i n i t e subset of X such that C n FQ t φ and F n C f φ f o r eyery face F of K which properly contains FQ.
The existence of C is guaranteed by
6.22(1).
Suppose I I Θ ( 2 ) M max ξ/-x
{ΘQX
X/^:
> δ and
x € C} > 0 .
θ
(i)*x m
>
for
°
some
x
e κ
τhen
| | ξ | | < α and α i s s u f f i c i e n t l y small
If
i s i n the convex h u l l o f ί x / i \ :
x € C} U { 0 } .
then
Hence t h e r e i s an η ε R
such t h a t
(5)
θ^j
f o r a l l I | ξ |I < α .
ζ^x
<_ ηα max ίθ.^x
L e t p = max { | | X / 2 Λ I I :
x ^
x € C}
x € C},
vQ = min { v ( { x } ) :
x € C}.
Then A(θ, ξ )
= θ
=
θ
ξ -ψ ( θ )
(
2
)
ξ
( 2 )
- β||θ
( 2 )
||
+θ
ξ
( 1 )
( 1 )
- ln(e
(
2
)
λ(θ)).
Now, (6)
λ ( θ ) >_ λ
>.
j
F
(θ(2j)
+ v
0
e x p (231| θ ( 2 j | | )
exp( θ
+ v
0
( 1 )
x
exp ( θ
( 1 )
( 1 )
+ θ
x
( 2 )
( 1 )
x
( 2 )
)
- P||Θ(2)||)
.
THE DUAL TO THE MLE For n o t a t i o n a l s i m p l i c i t y l e t t = θ / , x (7)
A(θ, ξ ) 1 θ
( 2 )
ξ
( 2 )
201
x ^ x > 0 . Then f o r α <_ 3 / 2
- 3 | | θ ( 2 ) | | + η α t- I n (
3 l l θ e
2
( )
M
+
exp ( t - p I I θ / o \ I I ~ 31 I θ / p \ I I /
VQ
£
- ε + η α t- ( 3 | | θ ( 2 ) | |
1
-ε
V ( t - (p + 3 ) | | θ ( 2 ) | | + I n v Q ) )
for α > 0 sufficiently small, since 3||θ
3t p + 23 + a
| | V ( t - ( p + 3 ) | | θ ( 2 ) | |1 1 - a ό ) > (2)ir v^; -
f o r I | θ / 2 x 11 > δ , a _> 0 . If
ι ( θ .ξ )
(8)
I | θ # 2 j | I > δ butΘ Q J
< θ
θ
1
(l)
( 2 )
- ΨlFo(θ(2))
(2) * ξ(2) "
θ
ξ
( 2 )
δ
i
b u t
X / j x 1 0 f o r a l l x € K then
ψ
Θ
θ
+
F0(θ(2))
1
(D * x{i)
>
°
( 1 )
"ε f o r
s o m e
x
e
κ
t h e n
X/.% > 0 f o r some x e C; and
(9)
£(θ, ξ) 1
θ(
2 )
ξ
( 2 )
- ψ θ 2
ψ
<
| F
(i)
(θ{2)) + ηαθ(1) χ
X(1}
(D
IF0
-ε < 0
f o r α > 0 and some e > 0 s u f f i c i e n t l y s m a l l , s i n c e ψ F ( θ / 2 \ ) 1 ° b u t sup
{ ψ
| F
(θ(2j):
l|θ(2)||
o r ( 9 ) a p p l y so t h a t
1 δ j } < « . I f | | ζ | | < α a n d θ jf Q o n e o f ( 7 ) , ( 8 ) ,
202
STATISTICAL EXPONENTIAL FAMILIES
(10)
A(θ, ξ) £
-ε <
0
.
On the other hand, there i s a σ > 0 s u f f i c i e n t l y large so that by (2) or ( 3 ) , (11)
Hoey
ξ)
=
σe χ
ξ - ψ(σe χ )
^
σe][
ξ - ε/3
>_ -2ε/3 for ||ξ|| < α £•—• . It follows from (10) and (11) that if ||ξ|| < α, ξ € K°, then if θ (έ Q &(θ, ξ) £ Hence θ f θ ( ξ ) .
-ε
<
-2ε/3 £
il(θ(ξ), ξ ) .
I t follows that θ(ξ) € Q.
We have thus proved that given σ, 6 there is an α > 0 such that ||ξ|| < α,
ξ € K°, implies θ(ζ) € Q(σ, 6 ) .
completes the proof of the theorem.
| |
As previously noted, this
THE DUAL TO THE MLE
203
EXERCISES 6.6.1 Assume φ i s r e g u l a r l y s t r i c t l y convex.
Verify
6.6(3).
6.7.1 For φ regularly s t r i c t l y convex, when does d. = φ? 6.9.1 Generalize Theorem 3.9 to apply to steep, regularly convex functions v
φ [i.e.; write φ = V Φ
v
and consider the map θ -M ' . Show this map is λ (2) Φ(2)(θ)^ ;
1 - 1 and continuous on N° with range ξ^AN°)
x φ/ 2 x(M°) = K,.* x φ/ 2 \(W°)].
6.18.1 (i)
Show t h a t Kj F f F i n the following example:
X = ( 1 , -1) u { ( ( i 2 - l ) J V i , (ii)
1/i);
1 =1 , 2 , . . . }
,
F = K n H ( ( l , 0 ) , 1).
Construct an example of the same phenomenon i n R where X is
a discrete set ( i . e . X has no accumulation points i n R ).
[Construct X so
that the set X i n ( i ) i s i t s p r o j e c t i o n on the space spanned by the f i r s t two coordinate axes.] 6.19.1 Show that the following three families are aggregate exponential families: (i) (ii) (iii)
Binomial (n, p ) , Poisson ( λ ) ,
0 <_ p <_ 1
λ >. 0
Multinomial (N, p),
0 < p., " Ί
k Σ p. = 1 . i=lΊ
6.19.2 Suppose the d i s t r i b u t i o n o f X family { q ί Ί ' b ,
i = l , 2 , and X ^ ,
X ^
form an aggregate exponential
are independent.
Show t h a t the d i s t r i b u -
tions of ( X ^ , X^ 2 M form a ( k j + k 2 parameter) aggregate exponential
family.
204
STATISTICAL EXPONENTIAL FAMILIES
6.20.1 Construct an example in which 6.18(3) holds but F f conhuil X. [Let X1 be the set in 6.18.1(i) and define X € R 3 by X = ίx: ( x Γ x 2 ) E X 1 ,
x 3 = ±(1 - x 2 )} u (1,0,1) U (1,0,-1).]
6.21.1 Let X be the set defined in 6.20.1 with the additional point (1,0,0). Show (i) 6.18(3) fails at x = (1,0,0). (ii) {q
:
The maximum l i k e l i h o o d estimate f o r the aggregate family
ξ e F} f a i l s to e x i s t ( i . e . i s the empty s e t ) when X = ( 1 , 0 , 0 ) ,
which occurs with p o s i t i v e p r o b a b i l i t y . (iii)
The f a i l u r e i n ( i i ) can be r e c t i f i e d i n a natural way by
l e t t i n g G = conhull { ( 1 , 0 , - 1 ) , q
ξ(θ)|G
=
p
θ|G
t 0
(iv) for
t h e
f a m i l y
{ q
( 1 , 0 , 1 ) } and adding the d e n s i t i e s ξ:
ξ
€
F}
Addition o f the d e n s i t i e s q ζ j G i s " n a t u r a l " i n the sense t h a t
each ξ € G there i s a sequence θ. 6 U° such t h a t q ε .ς;(x)
=
^
]im Pθ ( x ) .
i-*» i
[This sequence cannot be chosen to be of the form θ = θ 1 + iv for fixed v € R , θ 1 e W° as was the case in the proof of Theorem 6.21.] 6.21.2 Let v be linear measure on the perimeter SS, of the unit square, S.
This measure does not have a countable supporting set.
describe i t s
Nevertheless,
"natural aggregate family", having parameter space S and
satisfying the conclusion of Theorem 6.21 for each x € S.
6.21.3 (i) c i r c l e S.
Let v be uniform measure on the perimeter
S, say, of the u n i t
Thus, {p Q } i s the family of Von-Mises d i s t r i b u t i o n s
(Example 3 . 8 ) .
u Show there can be no possible way of constructing a family of densities {q-} which contains {pθ> such that the maximum likelihood estimate for {q^} exists
THE DUAL TO THE MLE with probability one. (ii)
[
lim
PΘM
=
°°
205
for each
x € 8S.]
Note that i f Xn is the sample mean from a sample of size n,
n >• 2, having the above distribution, then the maximum likelihood estimate does exist with probability one. (iii)
Construct a measure v for which {pQ} is a regular exponential Ό
family but there does not exist an n for which i t is possible to construct an "aggregate family" of densities {q^}9 containing the densities of X under θ, such that the maximum likelihood estimator exists with probability one. [There exists such a measure v having K = {x € R :
x* + x* < x*
0 < x
< 1},
and v({0}) > 0 . ] 6.22.2 Show that 6.22(1) (including the polyhedral nature of K) implies 6.20(1).
[The polyhedrality of K guarantees that for ey/ery x e 8K there is
a face F of K such that x € ri
F.]
6.22.2
Prove that 6.22(1) and 6.22(3) are satisfied whenever v is a k
product measure on a countable set X = Π X., X. € R. [The faces j=l
J
F = H ( v , α ) n X o f X a r e determined uniquely
J
by ( s g n V j , . . . , s g n v k ) . ]
6.22.3 (i)
Prove t h a t
(1)
M | F = Proj c s p F ( W | F ) x (csp F )
^ p M (ii)
c
PrcspF(M|F)
,
and
.
Give an example in which X = { 0 , 1 , . . . } ,
F = {(0, 0), (1, 0),... }, (3)
1
Proj'csp F ( N )
=
(-°°» ° ) x °
*
R x 0
N = (-»,0) 2 , "
Pr
°JCsp F(WIF)
206
STATISTICAL EXPONENTIAL FAMILIES
and (4)
ξ| F ((0, 0)) = (1, 0) € X .
(Thus 6.22(3) is not valid here.) (ii) In the example (ii) show that qE((x.i» °))> x τ
=
°> 1> . >
is not continuous at ξ = (ξ^, 0 ) , ξ j > 1. [If θ. is chosen so that θ. j Φ 0 somewhat slowly and θ.g "*• "°°t h e n ξ ( θ Ί ) •*• (ξ-|> °) b u t q ε ( θ )( χ ) + °1M O ) ^ " 6.23.1 Prove versions of Theorems 5.7, 5.8 and 5.12 valid for aggregate exponential families. [Make the assumptions in Theorem 6.23.] 6.23.2 Show that q (x) is not jointly continuous in (ξ, x) at any point with ξ = x € dK. 6.23.3 Are the analogs to Theorems 6.12 and 6.13 valid for aggregate exponential families under the assumptions of Theorem 6.23? 6.23.4 Suppose X = (0, 0) U {x € R 2 : x. = 1,..., i = 1,2}. Note that Assumption 6.22(1) is not satisfied. Show that, nonetheless, q J x ) is continuous at every ξ € conhull X = F. (If one defines q ζ (x) = q Q (x) for ξ € K - conhull X then it is even true that q # (x) is continuous on K.) 6.23.5 Let X = {((i 2 - 1)*/1, 1/1): i = 1,...} U (1, 0 ) . For x = ((i 2 - l)*/i, 1/i) € X let v({x}) = 1/2 1 , and let v({0}) = 1. Note that 6.22(1) is not satisfied. Show that q ((1,0)) is not continuous at ξ = (1,0) [q/ j Q \ ( ( 1 , 0 ) ) = 1 . Let 0 < c < 1. For i sufficiently large let θ £ = p^x £ with
Prt
chosen so that p Λ ((1,0)) = c ({p 0 } is a swiftly increasing
THE DUAL TO THE MLE
207
sequence.) Then ξ(θ £ ) + (1, 0) but q ς , θ j((lf 0)) • c ^ 1.]
(In this
X>
example q Γ (0) is, however, upper semi continuous; so that, for example, the conclusion of Theorem 6.23 remains valid. Exercise 6.23.4 shows this need not be the case.) 6.23.6 For x = x v({x
( i j )
})
(
i
j
= ( 4+ 3 j ) / 2
)
i
2
= ( ( i - \ ) .
H
For x = x
/ \,
1 / 1 , j ) , 1=1 , . . . ,
( j )
j =± 1 , l e t
= ( 1 , 0 , j ) , j = - l ,0 , + 1 l e t
v({x>) = 2 - | j | . Otherwise v ( { χ } ) = 0. Construct {θ^} i n a manner s i m i l a r to 6.23.5 with ( θ ^ ) 3 = 0 so that Pθ ( ί x ( j ) :
j = 0 , ± 1 } ) + 1/3
and
(Φι))
ι
+ 1.
Verify t h a t ξ ( θ £ ) - ( 1 , 0 , 1/2)
XJ
and Pθ U^h)
- fc (x("1})+ 1/12, but ^ j ^ i ^ )
Λ/
( 1 / 4 ) 2 < 1/12.
=' ( L O . * ) ^ " ^ =
XJ
Hence q ζ ( x
upper semi continuous.
) i s not continuous at ξ = ( 1 , 0, 1/2) or even
I f E c K is the closed set ί ξ ( θ ^ ) : λ = l , . . . } U ( 1 , 0, 1/2)
then the maximum likelihood estimator or o\ over the family {q ζ : ξ € E} fails to exist at the possible observation xv(-1)
CHAPTER
7. TAIL PROBABILITIES
In exponential families the probability under θ of a set generally f a l l s off exponentially fast as the distance of the set from ξ(θ) increases.
This section contains several results of this form.
The f i r s t of
these w i l l be improved later, but i t is included here because of i t s simplicit: of statement and proof. Throughout this chapter l e t {p Q } be a steep canonical exponential family.
(Most of the results hold with possibly minor modifications for non-
minimal families, and many also hold for non-steep families.)
FIXED PARAMETER (Via Chebyshev's Inequality) 7.1 Theorem Fix Θ Q € N°. Choose ε so that {θ: | |θ - Θ Q | | £ ε} c Λ/°. Then there exists a constant c < °°, such that (1)
P r Q H (v, α) θ o
< c exp(-εα)
for all v e R k with ||v|| = 1 and all α € R. Proof. (2)
Let c = exp(sup {ψ(θ) - ψ(θ Q ):
and let θ £ = Θ Q + εv. Then
208
||θ-θo||=ε})
TAIL PROBABILITIES
/
exp(θQ
=
+ / exp(θϋQ H (v,α)
1
( + /
£
c exp(-εα)
exp(θ ε
.
209
x - ψ(θQ))v(dx)
x + (εv)
x - (εv)
x - ψ(θ n ))v(dx) °
x - Ψ(θ ε ))v(dx))exp(ψ(θ ε ) - ψ(θ Q ) - εα)
||
Note that (2) provides a specific formula for the constant appearing in ( 1 ) . In specific situations the bound provided in Theorem 7.1 can be improved in various ways.
However the following converse result shows that
Theorem 7.1 always comes within an a r b i t r a r i l y small amount of yielding the best exponential rate of decrease for t a i l probabilities. 7.2
Proposition
Let Θ Q € W°. Suppose there exists a c < » and ε > 0 such that 7.1(1) is valid for all v € R k with ||v|| = 1 and all α > 0. Then {θ: ||θ - θ o | | < ε} cN°. (Thus, if for some ε > 0, c < °°, a bound of the form 7.1(1) is valid for all v with ||v|| = 1 and all α > 0, then Theorem 7.1 will verify such a bound for any ε 1 < ε.) Proof.
We leave the proof as an exercise.
||
When ε = inf {||θ - θ J | : θ t W} then 7.1(1) may or may not be valid for all α, v. The following example demonstrates this. 7.3
Example R e l a t i v e to Lebesgue measure, l e t
210
STATISTICAL EXPONENTIAL FAMILIES f_ k (y) = Γ(k)y k " 1 e- y / η /n k
(1)
y >0
0
y£ 0 .
This is the gamma density with scale parameter η and shape parameter k. x
l
=
^'
X
2
=
^
n
θ
y
l
=
η
"^ '
Θ
=
2
^
" ^'
anc
e
* ^ *
v
^
e
t Ίe
'
by the map y + x when y has Lebesgue measure on ( 0 5 °°).
m e a s u r e
Let
induced
One then has a
standard exponential family of order 2 with ψ(θ)
=
( θ 2 + 1) l n ί - θ j ) - In Γ(θ 2 + 1)
and (2)
W=
( - « , 0 ) x ( - 1 , oo), When k = 1
κ=
{ ( x
r
x2):
XjlO,
x ? > I n Xj}
( i . e . θ 2 = 0) the r e s u l t i n g one-parameter exponential
family is t h a t o f exponential d i s t r i b u t i o n s with i n t e n s i t y | θ - | .
For t h i s
family
P r θ
=-i ί χ i
> a)
=
e"α
for all
α > 0
so t h a t 7 . 1 holds with v = 1 and ε = 1 = i n f ί | | θ - Θ Q || : θ ? W .
On the
other hand, f o r θ 2 = 1 the r e s u l t i n g one-parameter gamma family has
ΘΓ-1
1
> α}
=
(α + l ) e ~ α
for a l l
α > 0.
Thus here 7.1(1) fails to hold when v = 1 and ε = 1 = inf ί||θ - Θ Q | | : Θ < W When W = R k Theorem 7.1 says only that P r Ω {H + (u, α ) } = 0 ( e " k α ) θ
o
for a l l k > 0.
However, much smaller bounds may be valid for these t a i l
probabilities.
Consider for example the following well known facts:
(3)
and
Γ e"
t2/2
dt < e "
α2/2
/α
for
α >0
TAIL PROBABILITIES 0.2/Q
oo
(4)
α
J e
2 /o
dt ~
e
/α
as
Thus, suppose X i s normal, mean 0 , variance 1. (5)
Pr{X > α}
211
< e~α2/2/α(2π)^
α •*• «>
Then, from (3) for
α >0
.
I t can be seen from (4) t h a t t h i s bound i s asymptotically accurate as α •> °° . Theorem 7.5 contains a bound which e a s i l y y i e l d s the statement (6)
Pr{X > α} £
for
this situation.
but
is s t i l l i n f e r i o r to ( 5 ) .
e"α2/2
This i s much b e t t e r than what i s a v a i l a b l e from 7.1(1)
Theorem 7.1 applies to p r o b a b i l i t i e s o f large deviations defined by h a l f spaces but can e a s i l y be converted t o a statement about any shape o f set,
as f o l l o w s .
7.4
Corollary Consider a standard exponential f a m i l y .
Let S be any s e t .
Fix ΘQ € W°. Let XQ € R .
Let p = i n f { | | x - X Q | | : x t S} , and define ε as i n
Theorem 7 . 1 . Then there i s a c < °° such t h a t (1)
PA ({(X - Xn)/α t S}) θ
Proof.
o
< c exp(-εpα)
ε'p
= εp.
n n {x: x i=l
α€R
.
I t suffices to prove the corollary for xQ = 0 and S the open
sphere of radius p about the o r i g i n . There exists p1 < p and ε 1 < inf{||θ 1
for a l l
ϋ
- ΘQ|| : θ f W}
such that
There exists a f i n i t e set of unit vectors {a..: i = l , . . . , n } such that a. < p 1 } c S. Ί
Thus PrΩ
n {X/α £ S} < Σ Prft {X o i=l o
a. > αp 1 } '
n an < Σ c exp(-αp'ε') <_ c exp(-εpα) by Theorem 7.1 where c < ~ is Ί " 1 =1 appropriate constant. ||
212
STATISTICAL EXPONENTIAL FAMILIES
FIXED PARAMETER (Via Kullback-Leibler Information) I t is possible to use the Kuilback-Leibler information number ( i . e . entropy) to improve the preceding bound.
See the exercises for some
applications of this bound to asymptotic theory. 7.5
Theorem Let ΘQ € A/° and H+ = H + (v, α ) . PΘQ(H+)
(1) Proof.
1
Then
exp(-K(H + , ξ ( θ Q ) ) )
.
Suppose f i r s t that H+
(2) Let ξ = ξ R + (ΘQ) . θ = θ(ξ) € W°.
ΓΊ K°
Note that ξ € FT n
t
φ .
K° by Theorem 6.13.
Hence
(This is precisely the situation pictured in Figure 6.14(1).)
Now, (3)
k
=
K(H + , ξ ( θ Q ) )
i
(θ - ΘQ)
by definition and by 6.13(2).
Pϋfl (H )
=
=
(θ - θ Q )
x - ψ(θ) + φ(θ Q )
V
x € H+
This yields
/+ P PθAA (x)v(dx) (x)v(dx) ΰ + θoo
= J + exp((θ0 - θ)
ίi
£ - ψ(θ) + ψ(θ Q )
= =
P θ ( (X ) // — 2 — p~(x)v(dx) ΰ+ θ ΰ+ P~(x) H
x - ψ(θ0) + ψ(θ))p~(x)v(dx)
exp(-k)pg(x)v(dx)
θ
<. e"
k
which is the desired result. Now suppose H+ Π K ί φ but H+ Π K° = φ.
Then
TAIL PROBABILITIES
limK(H + (v, α - ε ) , ξ ( θ Q ) )
(4)
213
= K(H + (v,α), ξ ( θ Q ) )
< co
since K( , ξ ( θ Q ) ) is lower semi-continuous (by definition), satisfies lim
IKII-* 0 0
K(ξ, ξ ( θ Q ) )
=
»
(by 6 . 5 ( 5 ) ) , and s i n c e K ( H + ( v , α ) , ξ ( θ Q ) ) >. K ( H + ( v , α - ε ) , ξ ( θ Q ) ) f o r a l l ε > 0.
Hence
(5)
P f i ( H + ) = l i m P. ( H + ( v , α - ε ) ) < l i m e x p ( - K ( H + ( v , α - ε ) , ξ ( θ Ω ) ) ) θ υ o εΨO θ o εΨO =
exp(-K(fl + , ξ ( θ Q ) ) )
.
||
(We leave as an exercise to verify that K(H + , ζ ( θ Q ) )
(6)
= -
i f and only i f
Pθ (H + )
= 0
.)
Note that the Kullback-Leibler information enters into the above only as a convenient way of identifying the sup {(θ - ΘQ) x € H } .
x - ψ(θ) + ψ(θ Q ):
Various other interpretations of K, such as the probabilistic
Definition 6 . 1 , do not enter into the above argument. The connection between Theorem 7.5 and 7.1 is provided by the following lemma. 7.6
Lemma
Let ΘQ € N° and H + = H+(v, α). Suppose θ = Θ Q + εv e W°. Then (1) Proof.
+
K(H , ξ(θ Q )) >. ψ(θ Q ) - ψ(θ) + εα . Let ξ = ξ-+(θQ) as in Theorem 7.5. Then K(H + , ξ(θ Q )) = (θ - Θ Q ) 1
(θ - Θ Q )
ξ + Ψ(θ o ) - ψ(θ) ξ + Ψ(Θ Q ) - ψ(θ)
since θ = θ(ζ) = θ w (ξ) maximizes l(9 ζ). Hence
214
STATISTICAL EXPONENTIAL FAMILIES K(H+, θ Q ) 1 εv
ξ + ψ(θ Q ) - ψ(θ) = εα + ψ(θ Q ) - ψ(θ) .
||
Applying the bound (1) in the formula 7.5(1) yields the earlier formulae, 7.1(1) and (2), of Theorem 7.1. ~ 2 Note also that in the normal example of Example 7.3, K(ξ, 0) = ξ / 2 , and thus 7.5(1) yields 7.3(6).
FIXED REFERENCE SET The preceding results concern the nature of probabilities of large deviations when the parameter is fixed and the reference set for calculating the probability proceeds to i n f i n i t y .
There is another class of results.
These
concern the situation when the reference set is fixed and the parameter proceeds to i n f i n i t y in an appropriate direction.
These theorems were exploited
in a s t a t i s t i c a l setting by Birnbaum (1955) and then Stein (1956).
Giri (1977)
surveys several further applications of this theory.
7.7
Theorem Let v € R k ,
(1)
S2
(2)
Let K c N be compact.
Let S χ , S 2 c Rk
α € R.
v(Sχ
c
ίΓ(v, α)
n H+(v,
α))
with
, >
0
.
Then there exist constants c and ε > 0 such that θ#x v(dx) / e
(3)
—
< χ
/ S
c exp(-pε)
v(dx)
l
for a l l θ € W of the form θ = η + pv with η € K, p > 0.
TAIL PROBABILITIES Let S ^ ε ) = Sι f) H + (v, α + ε ) .
Proof.
v ( S Ί ( ε ) ) > ε > 0.
/ S
e
215
There is an ε > 0 such that
Then,
v(dx)
/
2
S
.
exp(p(v
x - α) + pα + η
x)v(dx)
2
ft Y
/ e S2
v(dx)
/ exp(ρ(v Sχ(ε)
/
x - α) + pα + η
x)v(dx)
e η " x v(dx)
<
< epε / eη S,(ε)
χ
c exp(-pε)
v(dx)
where (4)
c
= sup (/ e η ' x v(dx)/J e η # x v(dx)) ηCK S 2 S^ε)
Here is why c < °°: inf η€K
/ e η # x v(dx) > 0 . Sj(ε)
< ~
.
K is compact and v ( S , ( ε ) ) > 0 so that
Also, / e η # x v(dx) is upper semicontinuous on K S2
by Fatou's lemma, and is f i n i t e on K since K c N .
Thus sup J e η # \ > ( d x ) < <». η€K S 2
||
The preceding theorem r e a l l y concerns the relationship of probabil i t i e s for the sets S 2 and S.,(0) = S 1 n H ( v , α) contained in separate h a l f spaces. 7.7.
Note again the dual relationship, connecting θ U a n d H c K i n
Theorem
Because of this relationship i t is often revealing in such
contexts to superimpose both the sample space and parameter space on a single plot.
This is done in Example 7 . 1 2 ( 1 ) . Here are some corollaries to the Theorem, the second of which w i l l
be used in the example. compared to Theorem 7 . 1 .
The f i r s t of these corollaries may be i n s t r u c t i v e l y
216
STATISTICAL EXPONENTIAL FAMILIES
7.8
Corollary k Let v € R ,
KcWbe
compact, and S c H ( a , α ) . Suppose
v(H + (v, α ) )
(1)
> 0
.
Then there e x i s t constants c and ε > 0 such that Pr θ (S) for
<_ c exp(-pε)
a l l θ € W of the form θ = η + pv with η € K,
any sequence {θ.. € hi: θ.. = p..v + η ^ ,
p. -+«>,
lim Pr f i (S) θ i-*» i Proof.
In p a r t i c u l a r ,
for
. e K} one has
η>
= 0
.
Let S 2 = H (v, α ) . Then by Theorem 7.7 PrAS)
< c exp(-pε) J
s
=
7.9
p > 0.
θ
χ
e
c exp(-pε)Prθ(S2)
" ψ ( θ ) v(dx)
<_ c exp(-pε)
.
||
Corollary Again, l e t v € R ,
K c W ° be compact, and v(S) > 0 ; and l e t {θ..}
be any sequence of the form θ. = p.v + η. with p. -> » and η. € K.
(1)
lim Eft (v i^o θ i
X)
=
sup{α:
Then
v(H + (v, α ) ) > 0} < «, .
(Note that here we assume K c A/°; not merely K c hi.) Proof. it
Let α n denote the supremum on the right of ( 1 ) . Since Eft (v X) <_ α n
is only necessary to prove lim i n f EA (v X) 2 l α n ϋ u i-*χ> i
T° this end, l e t
α < α 1 < α Q and S 2 = H"(v, α 1 ) . Let ξ 2 ( θ ) = EΘ(X|X € S 2 ) . result is t r i v i a l .
Hence, suppose v(S 2 ) > 0.
continuous for a l l θ € N°.
Hence 3 = i n f ί v
I f v(S 2 ) = 0
the
Note that ξ 2 ( θ ) exists and is ζ 2 ( η ) : η € K} > -°°.
Note that
TAIL PROBABILITIES
217
3<α'. Apply Corollary 2.5 to the conditional exponential family given X € S9 (generated by v| c ) to find ά
|S2 E θ (v
X|X € S 2 )
for all θ = η + pv with p _> 0. E θ (v
X)
>_ E η (v
X|X € S 2 )
>.
3
Then for such θ,
= Pr θ (X € S 2 )
E θ (v
X|X € S 2 )
+ Pr θ (X ί S 2 )
E θ (v
X|X € S - S 2 )
>. ( c e " ε p / ( l + c e " ε p ) ) 3 + ( 1 / ( 1 + c e ' ε p ) ) α ' by Theorem 7.7.
p sufficiently large. arbitrary.
X) > α (since α < α 1 ) for θ as above for a l l
Hence E (v
This implies lim inf EA (v
X) > α n , since α < α n was
|| Note the placement of the hyperplane H in the statement of Theorem
7.7.
I f S 2 cz H" and v(S. n H+) > 0, but v(S, n H+) = 0, then only a much weaker
conclusion is valid. 7.10
This conclusion is contained in the following corollary.
Corollary Let v € R , α € R.
(1)
Suppose S2
c
H"(v, α)
and v(Sj n H + (v, α))
(2)
Let K c N be compact. = PΊ v + nΊ with n i e K,
(3)
> 0
.
Let { Θ . J c W b e a sequence of the form
p1 + ».
Then
lim—! θi
= 1
0
.
218 Proof.
STATISTICAL EXPONENTIAL FAMILIES Apply Theorem 7 . 7 t o f i n d P θ ( S 2 ΓΊ H " ( v , α - ε ) )
(4)
lim θ.
for a l l ε > 0.
1
Furthermore, i f p.. > 0
Pθ (S 2 n H + (v, α - ε ) ) (5)
P
(S 2 ίl H + (v, α - ε ) )
1
!
as ε -• 0
uniformly for η. € K
(The inequality in (5) follows after applying Corollary 2.23 to the functions
hc(x) = X s -cχ s . π H + ( v ,α-ε) W Ί t h C c h o s e n so t h a t E η . ( M X ^
=
°t0 find
that
E ϋ. (hC (X)) —> Eη.j(hC (X)) for all c and p. > 0.) Combining (4) and (5) yields Ω i the conclusion of the corollary.
||
7.11 Example 2 Consider the usual sufficient statistics X, S derived from a normal (μ, σ ) sample. As explained in Example 1.2 the statistics X, = X, 7 7 X2 = S + X are the canonical statistics for a two-parameter exponential 2 2 family with canonical parameters θ, = nμ/σ , θ 2 - -n/2σ . Note that 2 K = { ( x , , x 2 ) : x2 1 x p tor some c > 1 consider the conditioning set Q = { ( x r x2):
x2 1
cx
i>
=
t(x»
s 2
)
:
x 2 /s 2 1 V(c
"
ι
)}'
(™
s
i s
t h e
s e t
on
which the usual two-sided t - t e s t (based on t = /rvT x/s) with n - 1 degrees of freedom accepts at the appropriate level determined by c.) σ
2
+ 0.
Then ( θ ^ θ 2 ) = (π/σ ) ( μ Q 5 -h).
ray with slope - % i n as σ -* 0.
Fix μ = μ Q and l e t
Thus ( θ ^ θ 2 ) proceeds down the
Both X and Θ are displayed on the plot in
Figure 7.11(1), which shows also K, Q, and this l i n e . Corollary 7.9 applied to the conditional exponential family given X € Q (generated by the measure v restricted to Q) yields
TAIL PROBABILITIES
219
(1)
sup { μ ^ - X g / Z K x j , Note that E(μ Q X 1 - X £ /2 |X € Q) = ( μ Q , -h) S
X 2 ) | χ € Q) € Q.
sequence ί ( x l Ί »
X
Q
i f
(χϋs
a n d
the tangent to Q a t the point (\ιQ/c, ( n / σ 2 ) ( μ 0 , -H).)
(2)
μQ/2c2
=
μ
WQ/C) is perpendicular
X
2
2 ^
#
c
(Note that
to the ray
Thus
11m
E
(
μ
σ
2)<ίχi
X
2)IX _
€
Q
>
=
(μ0/c>
μ
0
/ c ) =
e
0
( s a y )
p
In terms o f the t r a d i t i o n a l v a r i a b l e s X, S , and t = /n-1 x/s t h i s y i e l d s 2 (3)
lim σ2->0
Ef ^ °'
2
2 )
((X, S )| l t l < τ ) '
Example 7 . 1 1 ( 1 ) :
.
X 2 )|X e Q) and t h a t
Furthermore since Q is s t r i c t l y convex 2 2 /2c μ 0 / 2 c = s u p ' 0
c
2 Ί )^
E((Xr
x 2 ) € Q}
= ( 9 ^τ + n - 1
f
(τ +
Picture for Example 7.12
9_ n - 1
220
STATISTICAL EXPONENTIAL FAMILIES
COMPLETE CLASS THEOREMS FOR TESTS (Separated Hypotheses) The preceding results can be used to prove admissibility of many conventional test procedures in univariate and multivariate analysis of variance and in many other testing situations involving exponential families. When combined with the continuity theory for Laplace transforms of Section 2.17 these results yield useful complete class characterizations for certain classes of problems.
In many of these cases the characterization precisely describes
the minimal complete class.
The general theory, as well as a very few specific
applications, is described in the remainder of this chapter. cations can be found in the cited references.
Many more appli-
The results to follow should be
compared to the results in the same s p i r i t for estimation which appear in Chapter 4. 7.12
Setting and Definitions Throughout the remainder of this chapter {p Q : θ€Θ}
is a standard
Ό
exponential f a m i l y .
The parameter space Θ is divided i n t o non-empty n u l l
and a l t e r n a t i v e spaces ΘQ, Θ-; so t h a t Θ = ΘQ U Θ..
In the customary fashion,
a t e s t of Θg versus Θ, is uniquely s p e c i f i e d by i t s c r i t i c a l f u n c t i o n , φ, where Φ(x)
= P ( t e s t r e j e c t s ΘQ|X = x ) .
Φ 1 i s as good as a t e s t Φ 2
V
(1)
π
The power of ψ i s π ( θ ) = E θ ( ψ ) .
A test
if
θ ) (θ)
e >
θ€θ
>. πφ (θ)
θ eΘ
*• V
I t is better i f there is s t r i c t inequality for some θ € Θ.
(Here, and in what
follows, we write, "a test φ" in place of the more precise but cumbersome phrase, "a test with c r i t i c a l function φ".) no better test.
A test is admissible i f there is
The decision-theoretic formulation with a two-point action
space A = {a^, a.} and a loss function of the form L(θ,
a.) = A(θ) > 0 J
if
θ d Θ., j
=0
if
θ € Θ. , J
yields the same ordering among tests, and hence the same collection of
TAIL PROBABILITIES
221
admissible tests. Let (2)
Ur
= ϋ Γ (Θ, θ 0 )
=
(u: I lul I = 1, 3 θ € 0 3 I Iθl I > r, I
and
u = j ^ J^ \ , llθ - θ o ll J
r £θ and let (3)
U(0, θ 0 ) u
=
n U ( θ , θQ) u r >0 r
and
U*(0, θ n ) υ
=
Π 0 ( 0 , θn) ϋ r^O r
Note that i f 0 is a closed cone then U = U*; more generally U c U*.
.
I t is
possible that U = φ but U* f ψ. If S cR (4)
is a convex set l e t α(u)
= ou(u)
= sup {x u:
x € S}
.
This function is defined for u € R , , although we will mainly be interested its values for | | u | | = 1. (5)
As is well known,
S =
n
FΓ(u, α s ( u ) )
.
I t is clear from the definition (4) that α( ) is lower semi continuous. The following lemma is a key result which leads directly to the f i r s t main theorem.
A result of this type was f i r s t proved and used by
Birnbaum (1955) in the case of testing for a normal mean.
A general result
similar to the following lemma was then proved and applied in Stein (1956b). 7.13
Lemma Fix θ 2 € Rk.
(1)
where U* = U*(0 1> θ 2 ) .
Let S =
n ίΓ(u, α ς ( u ) ) b u€U*
Assume further either that
in
222
STATISTICAL EXPONENTIAL FAMILIES
(2)
S =
n FΓ(u, α ς (u)) , S U€U
(U = U(Θ, θ J ) , ά
or ou(u) i s continuous a t u f o r a l l u e U* - U.
Let φ ^ x ) = 1 f o r a l l x ί S.
Suppose Φ2 is as good as φ j . Then Φ 2 (x) = 1 f o r x i S, a . e . ( v ) . (Note: v{x:
x ί S ,
Proof.
A more formal way to s t a t e the conclusion of the lemma is
φ 2 ( x ) < 1} = 0 . ) Assume f o r convenience θ 2 = 0.
is f a l s e .
Suppose the conclusion of the lemma
Then there is an ε Q > 0 , u Q e U* such that
(3)
CQ
=
{x:
Φ 2 (x) < l - e
l
o
satisfies v(C Q n H + ( u Q , α ( u Q ) ) )
(4) Assume u Q € U. {p.uQ:
i=l,...}
cz 0 .
Theorem 7.7 y i e l d s
\~ f 2
.
Then there is a sequence { p . } with p. -> °° such that
>-yCΊV 1 - π
> 0
(ρ u 0 ) 1
V ( ε
)
) e9'X π
n
v(dx)
X
/
\
e J υ(dx) + C o nH (u o ,α(u o ))
<_ CQ exp (-p.ε ) + 0
as
i+«
Hence π. (p.un) > π. (pn u n ) for i sufficiently large, which shows that φ 9 is φ
1 U
φ
c.
1 U
not b e t t e r than φ-. Now assume u Q ί
U but ou(u) is continuous a t u Q e U* - U.
Then
ε Q > 0 i n ( 3 ) can be chosen small enough so t h a t (6)
v(Cn Π H+(u, α(u))) u
f o r a l l ||u||=l with | | u - u o | | < ε Q .
>
εn u
Theorem 7 . 7 , including formula 7 . 7 ( 4 )
the constant c appearing i n 7 . 7 ( 3 ) , now y i e l d s , f o r θ = pu € M,
for
TAIL PROBABILITIES
1 - 7τ Φ
(7)
(pu) *
<
1 - πA (pu) n
1 for
||u|| = 1 with ||U-UQ|| < ε Q .
epu'x
v(dx)
/ epu*x C o nH + (u,α(u))
v(dx)
/ fl~(u,ct(u))
~
223
£
(l/εo)e-P o
uQ € U * ( Θ 1 )
implies there e x i s t s a sequence
θ.
e Θ χ with
| | θ . | | ->oo such t h a t θ . / d l θ . H ) -• u Q .
π
(θj) > π
( θ Ί) for i sufficiently large.
than φ-.
It
I t follows from ( 7 ) t h a t
Consequently φ« i s not b e t t e r
follows from the two cases t r e a t e d above t h a t φ ? b e t t e r
than φ, implies Φ 2 (x) = 1 for (a.e.) x ί S.
Lemma 7.13 leads directly to a criterion which can often be used to prove admissibility of conventional tests for appropriate testing problems. 7.14 Corollary Let {p : θ e 0 } , θ = 0Q U θ j be a standard exponential family, as in 7.12.
Let θ 2 € Rk and
(1)
S
=
n H"(u, α ς ( u ) ) b u€U*
where U* = U * ^ , θ 2 ) , as in 7.13(1). Assume (also as in 7.13) that 7.13(2) is satisfied or that ou(u) is continuous at u for all u € U* - U. Let φ(x) = 1 - χ s (x) Proof.
(= 0 if x € S, =1 if x £ S ) . Then φ is an admissible test.
Suppose φ 1 is any test as good as φ. Then, φ'(x) = φ(x) = 1 for
a.e.(v) x € S by Lemma 7.13. But then, π.,(θ 0 ) i ^ ( Θ Q ) implies φ'(x) = φ(x) = 0 for a.e.(v) x 6 S. Thus, φ 1 = φ a.e.(v). admissible. Remark.
It follows that φ is
|| It follows from Corollary 7.14 that if θ^ is a bounded null hypothe-
sis and Θ = R k then any nonraήdomized test with convex acceptance region is
224
STATISTICAL EXPONENTIAL
admissible.
FAMILIES
When ΘQ = { θ Q } is simple and v is dominated by Lebesgue measure
such t e s t s i n f a c t form a minimal
complete class — i . e . a t e s t is
admissible
i f and only i f i t is nonrandomized and has convex acceptance region
(a.e.(v)).
This i s the fundamental r e s u l t which was proved by Birnbaum ( 1 9 5 5 ) .
See
Exercise
7.15
7.14.3.
Application
( U n i v a r i a t e general l i n e a r model)
Here is a customary canonical form f o r the normal theory general Y € Rp has the normal N(μ, σ I ) d i s t r i b u t i o n , μ s + 1 = . . . = μ
l i n e a r model: σ
2
1 £
> 0 , and the null Γ
£
s
£ P
hypothesis to be t e s t e d is t h a t μ, = . . . = μ
(See, e . g . Lehmann (1959, Chapter 7 ) . )
= 0,
= 0,
This can be reduced
via s u f f i c i e n c y and change o f v a r i a b l e s to a t e s t i n g question o f the form P 2 considered above.
Let X. = Y.,
butions of X = ( X 1 » . . . , X S + J
i=l,...,s,
X$+1 =
i =l , . . . , s ,
hypothesis i s , t h e r e f o r e , Θ Q = {θ € N:θ. = 0 , I
θ $ + 1 = -l/2σ
Then the d i s t r i -
The F-test when
.
The null
i = l , . . . , r } , so t h a t
Qd. > 0 } , where of course W = {θ € R s + I :
Figure 7 . 1 5 ( 1 ) :
.
form a minimal standard exponential f a m i l y with 2 2
canonical parameters θ . = μ Ί /σ ,
Θ χ = {θ € N:
Σ Y
r = l
θ g + 1 < 0}
= s,
p = 2
.
TAIL PROBABILITIES
225
The usual likelihood ratio F-test accepts if (and only if) r 2 Σ Yί/r (1)
F ^ < c Σ Yί/(p - s) S+l J
as determined from tables of the F-distribution.
'
In terms of the canonical
variables this region is
(2)
or Γ
(3)
9
S
9
K Σ XJ + Σ XJ j=l r+1
< X
, where
K = 1 + (p - s)/rF α
> 1.
(The simple situation for r = 1 = s, p = 2 is illustrated in Figure 7.15(1), above, which shows K in the upper half-space and N in the lower half.
Compare
Figures 7.11(1) and Figure 7.12.3.) Consider a point z in the boundary of the acceptance region (3). s r 2 ? Thus, K Σ z. + Σ z = z,.,-,. The outward normal at z is v = (2Kz Ί ,... ,2Kz , J J s+l I r Ί r+1 Γ 2 Z Γ + Ί , . . . , 2 z $ , - 1 ) . Except for the (s + 1 - r) dimensional set having Σ Z . = 0 all positive multiples of this vector lie in Θ J. It follows that 7.13(1) and 7.13(2) are satisfied (for any choice of θ Q € Θ Q ) . Thus the F-test (1) (or (2)) is admissible. Note that the test remains admissible by the same r 2 2 reasoning if e , is restricted by Σ y. > aσ since then r
Θ,I = {θ € W:
2 Σ θI > -2 a θs+i c+1} i=1
The same style of reasoning can be used to prove admissibility of a wide variety of tests involving the univariate and multivariate general linear model.
It was used in Stein (1956b) to prove admissibility of
226
STATISTICAL EXPONENTIAL FAMILIES
Hotelling's T
test; Giri (1977) contains a compilation of other results
provable by this method, and further references. 7.16
Discussion I f a test is shown to be admissible by virtue of Theorem 7.14 this
does not, in i t s e l f , constitute a strong recommendation in favor of the test. In principle the following situation may exist:
there may be another test φ1
with π , (θ) <_ τr.(θ) for a l l θ € ΘQ and with π.,(θ) >. π (θ) for "most" θ € Qy I t might occur that π , (θ^) > π ( θ . ) for θ € Θ-except when both π , and π are \/ery nearly one.
In such a case φ' would dominate φ for a l l practical purposes.
Of course, a procedure whose admissibility can be proved by Theorem 7.14 may also be a desirable one. this.
The F-test of 7.15 is a good example of
I t is admissible from several perspectives in addition to that of
Theorem 7.14. The most surprising of these properties is undoubtedly the fact that i t is a Bayes test.
See Kiefer and Schwartz (1965) and Exercise
7.16.2. The F-test is also locally optimal (D-optimality) in the sense that i t maximizes (among level-α tests) ?
(1)
min σ y(EΘ0
Γ a2 ? Σ - \ π. (μ, σ )
i = l d/.
.
φ
See Giri and Kiefer (1964) or Giri (1977) and Exercise 7.16.3. When r = s the F-test, φr> is also optimal in the sense that for any constant c > 0 and any level-α test Φ (2)
r Σ μ 2 /σ 2 = c 2 } i=i i 2 ^ 2 2 2 > min {π.(μ, σ ): Σ μ./σ = c} Ί Φ i =i
min {π. (μ, σ 2 ) : ΦF
with equality only if φ = φ_. Note that the left side of (2) is a constant. See Brown and Fox (1974b). Brown and Fox (1974a) yields the same result for s + 1 = r. For r £ s + 2 it is only known that the (minimax) inequality (2) is valid without the (admissiblity) assertion of equality only if φ = φp. This
TAIL PROBABILITIES
227
(minimax) assertion follows from the Hunt-Stein theorem as stated in Lehmann (1959). The next lemma is needed for the complete class theorems which follow i t . 7.17
The lemma can be viewed as an elaboration of Theorem 2.17.
Lemma Let ω be a sequence of ( l o c a l l y f i n i t e ) measures concentrated on
QcR
,
Then there exists a subsequence ω ,, a closed convex set S, and a
( l o c a l l y f i n i t e ) measure
ω concentrated on Θ such that
λ ω ( (b) * »
,
b (£ S .
If ω i, ω, and S are as in (1) and θ 2 £ R then (2)
S =
where U* = U*(Θ, θ 2 ) .
Proof.
n R"(u, α ς (u)) b u€U*
,
(This is s i m i l a r to 7 . 1 3 ( 1 ) . )
The f i r s t part of the lemma is a direct consequence of Theorem 2.17.
To prove (2) l e t T = n H"(u, α Q ( u ) ) and suppose y € T°. b U€U*
there is an x(u) € S such that u
Then for eyery u € U*
x(u) > u y.
Define N(u) by (3)
N(u) = {v: I|v|l = 1, v
x(u) > v
y}
.
N(u) is a r e l a t i v e l y open subset of the unit sphere and u € N(u).
Hence
U N(u) 3 U*, and there is a f i n i t e subset u Γ . . . , u c U* such that u€U* r (4) N = U N(u.) => U* . Ί i =l For convenience l e t x i = x ί u ^ . (5)
sup { | | θ | | :
Now,
θ €0
,
228
STATISTICAL EXPONENTIAL FAMILIES
otherwise there would be a sequence v. ί S with v^ -> v (v £ tj since M is open) and a sequence p. •* °° such that p.v. € Θ, contradi cti on.
(6)
i=l,...
but then v 6 U* c |, a
Then
/eθ-yω.(dθ)<
eBHy|l
||Θ||
{{θ:
£ /e0#xi
ω.(dθ)
zxω (x.) by (3), (4), (5) and the simple fact that ω n ι(ίθ:
B11x Ί 11 θ x Ί
I |θ| I <_ B>) <_ e
• fe
I t follows from (6) and (1) that y € S.
Hence T° c s.
closed and convex this implies T = S.
||
ω n '(dθ) .
Since T and S are
Here is the complete class theorem from Farrell (1968). to situations where ΘQ is compact and ΘQ and Θ1 are separated sets. Theorem 7.19 for a partial converse.
I t applies See
Results like Theorem 7.18 and 7.19
have been proved in contexts somewhat more general than ordinary exponential families.
See Schwartz (1967), Oosterhoff (1969), Ghia (1976), Perlman (1980),
and Marden (1982a, 1982b), for such extensions and various applications.
In
k the following statement Θj denotes the closure in R , not merely the closure relative to W. 7.18
Theorem Let ΘQ c M be compact and assume θ Q n 0
admissible test.
= φ.
Let φ1 be an
Then there exists an equivalent test φ ( i . e . π.,(θ) = π.(θ)»
θ e ΘQ U Θ j ) , a convex set S satisfying 7.17(2), and a (locally f i n i t e ) measure H. on Θ•,
i = 0 , l , such that λn (x) < °° for x e S° and
TAIL PROBABILITIES
1
if
229
x ί S λ H (x)
(1)
Φ(x)
=
1
if
0
a.e.(v).
if
x € S° ,
—! λ H (x) λH (x) —1 λμH ( x ) 0
x € S° ,
>
1
<
1
,
I f (Θ Q u Θ j ) 0 f φ then Φ = Φ 1 ;
( λ H ( x ) is f i n i t e since H Q (Θ 0 ) < °°.)
and hence a l l admissible tests are of the form φ i n ( 1 ) . I f φ1 is admissible then according to Theorem 4A.10 there exists an
Proof.
equivalent test φ and a sequence of a priori
distributions G (concentrated
on f i n i t e subsets of Θ) whose Bayes procedures Φn (say), converge to φ in the topology of 4A.2. weak*
By Exercise 4A.2.1 this convergence means that φ + φ
— i .e.
(2)
/ (Φ n (x) - φ(x)) g(x)v(dx)
for every v integrable function g.
-
0
A consequence of (2) is that i f a subse-
quence of Φ n (x) converges pointwise on some (measurable) subset T c K (say Φ n .(x) •* λ ( x ) ,
x € T) then the l i m i t must be φ ( i . e . , φ(x) = λ ( x ) ,
x € T,
a.e. (v)) Let (3)
Tn(dθ)
H
Note that
=
e
"orW
-ψ(θ) G_(dθ)// e -Ψ(θ)
= 1.
Let ω = HQ
Φn
i = 0, 1
Then
1 (4)
θ € Θ ,
n (dθ)
G
λ
H
(x)
>
1
<
1
(x)
+ Hχ .
0
Let ω ,, ω, S be as in Lemma 7.17. Let H.. = u).Q
i=0, 1, so that H i n , -> HΊ ,
i=0, 1, as n1 ->«.
,
Then H Q ( Θ ) = H Q (Θ 0 ) = 1 since
230
STATISTICAL EXPONENTIAL FAMILIES
H Q n (Θ 0 ) = 1 and ΘQ is compact.
The assertions in (1) follow from this along
with Lemma 7.17, ( 4 ) , and the decision theoretic facts in the f i r s t paragraph of the proof. I f φ1 and φ are equivalent and (ΘQ U Θ j ° by completeness (Theorem 2.12).
t φ then φ1 = φ ( a . e . ( v ) )
||
Many of the tests produced by the recipe 7.18(1) are admissible. In certain s t a t i s t i c a l situations, i t can even be concluded that a l l of them are admissible.
Then Theorem 7.18 describes the minimal complete class.
following converse to Theorem 7.18 contains statements of these facts.
The It
is not entirely satisfactory but i t is the best general result we have been able to devise. (1)
Θ*
For the purpose of this theorem define =
{θj € Qy
θ1 € W
+ p,θ1 € Qι
θ 2 € β1 3 (1 - ρ)θ 2
or there is a
for
0 <_ p < 1}
(See Exercise 7.19.3 for an extension of ( 1 ) . ) 7.19
Theorem
Consider the testing problem described in Theorem 7.18. Suppose φ satisfies 7.18(1) where H, is concentrated on Θ* and S satisfies all the assumptions of Lemma 7.13 relative to some θ 2 € R . Suppose also that
(2)
φ(x)
1 )
if
. x€ S
\ and
( X )
>
1
λHμ (x) < 1 , 0
(This is a mild extension of the l a t t e r part of 7.18(1).)
a.e.(v)
Then any c r i t i c a l
function as good as φ must also satisfy (2) and 7.18(1) with the same values of S, H o , H r
I f also either
(31)
v({x:
λ H (x) —i
=
1
and
φ(x) < 1})
= 0
,
TAIL PROBABILITIES
231
or λ H (x) (3")
v({x:
—ΐ λ H (x)
=
1
and
φ(x) > 0})
= 0
,
or (Supp (H o + H^)0
(4)
f
φ
then φ is admissible; and i f η is as good as φ then η = φ a . e . ( v ) . I f v is dominated by Lebesgue measure, U(Θ, θ 2 ) = U*(Θ, θ 2 ) for some θ 2 € Rk, and Θj c Θ* then the collection of tests of the form 7.18(1) is a minimal complete class. Proof.
Suppose φ is defined by (2) and 7.18(1) where S s a t i s f i e s the
assumptions of Lemma 7.13. Suppose ηis another c r i t i c a l function as good as φ.
Then η(x) = 1 i f x f. S by Lemma 7.13. I f θ € Θ- then
0 £ /(η(x) - φ(x))e θ # x v(dx)
(5)
since 0 £ π (θ) - π (θ). By continuity (5) also holds if θ € § 1 n M. Now, suppose ζ = (1 - p)θ«+ pθ, € Θ, for 0 <_ p < 1. Then (5) holds at θ = ζ and ζ x /(η(x) - φ(x))e p v(dx) is continuous in p as p t 1 by Exercise 1.13.l(ii)It follows that (5) holds whenever θ € 0, . The opposite inequality to (5) holds when θ € Θ Q , and H Q is finite since Θ Q <= W is compact. Hence (6)
0 < / (/(η(x) - Φ(x))e θ # x vίdxJJίHjtdθ) - H Q (θ)) .
Notice that η(x) - φ(x) < 0 whenever λ u (x) > λn (x), so that —
Πj
n0
+
/ (η(x) - Φ(x)) λ μ (x) v(dx) £ /(η(x) - φ(x)) + λ H (x) v(dx) o
/ / e θ ' x v(dx) HQ(dθ)
232
STATISTICAL EXPONENTIAL FAMILIES
Furthermore, as already noted, H Q is a f i n i t e measure. i n t e g r a t i o n i n ( 6 ) can be reversed, y i e l d i n g
(7)
0
£
Hence the order of
that
/ (η(x) - Φ ( x ) ) ( λ H (x) - λH (x))v(dx)
<
-
with the i n t e g r a l extending only over the region x € S since η ( x ) = φ(x) f o r x ί S.
Because φ s a t i s f i e s ( 2 ) , the integrand i n ( 7 ) is non-positive; hence
η(x) also s a t i s f i e s If
( 2 ) , f o r otherwise the i n t e g r a l would be negative.
in addition φ s a t i s f i e s
( 3 1 ) then ^ . ( θ ^ > TΓ ( θ ^ ,
(a c o n t r a d i c t i o n ) unless η ( x ) = Φ(x) a . e . ( v ) .
Similarly i f
( 3 " ) is
s a t i s f i e d n ( x ) = Φ(x) a . e . ( v ) ; f o r otherwise π φ ( θ Q ) < ^ ( Θ Q ) , F i n a l l y , suppose ( 4 ) is s a t i s f i e d i n place o f ( 3 1 ) or ( 3 " ) . reasoning f o l l o w i n g
Θ Q e ΘQ. Note t h a t the
( 7 ) shows t h a t e q u a l i t y holds i n ( 7 ) and hence i n ( 6 ) .
From t h i s i t follows t h a t / ( n ( x ) - Φ ( x ) ) e θ " x v ( d x ) = 0
a . e . HQ + H χ
t h i s i n t e g r a l is non-negative on θ * and non-positive on Θ Q .
since
( 4 ) then implies
η ( x ) = Φ(x) a . e . (v) by completeness and hence φ is admissible. the
ΘJ e Θy
This completes
proof of a l l assertions i n the middle paragraph of the theorem. If
v is dominated by Lebesgue measure and also s a t i s f i e s the
remaining assumptions of the l a s t paragraph of the theorem then λH (x) v{x:
xήxΓ H
= 1}
=
°
o 1
so that any test, φ, of the form 7.18(1) is also of the form (2), and (3 ) (and (3")) is satisfied, and H- is concentrated on Θ- <= 0* and S satisfies assumption 7.13(2) of Lemma 7.13. It follows that φ is admissible.
||
COMPLETE CLASS THEOREMS FOR TESTS (Contiguous Hypotheses) 7.20 Definitions: It is necessary to characterize the local structure of Θ, near ΘQ. Let Θ Q = {θ Q } and Θj be given and let
TAIL PROBABILITIES (1)
233
J(ε) = { J : J is a f i n i t e non-negative measure on {θ: θ ε Θ r ||θ-Θ o || < ε } , J J(dθ) < 1 , f
/
< »,
^ τ J ( d θ ) | | < 1}
I|Θ-Θ O || 2
Then let (2)
Δ(ε) = {(v,M): V = /
^J(dθ),
IIΘ-Θ 0 II (Θ-Θ M
=j
2
)(Θ-Θ )*
Q
o
ιiβ-θoιr
j
J(dθ)>
ε
J ( ε ) κ
Also, let Δ = Γ\ Δ(ε). Note that v ε R and M is a positive semidefinite ε>0 k x k matrix, and Δ and Δ(ε) are compact, convex sets. In various typical statistical problems it is not hard to explicitly describe Δ. For example, if
Θ Q = 0 and Θ = θ Q U "Θ, is a closed conical
set then Δ is the convex hull of points of the form (v,0): v ε Θ, ||v|| <_ 1 , and (0,M): M = vv 1 3 v ε Θ, -v ε Θ, ||v|| < 1 . (See Exercise 7.20.1.) As another example, suppose
Θ is a twice different-
iable curved exponential family at Θ Q . This means that there are two orthogonal vectors u ., ιu ε R , with ||ui|| = 1 such that for θ ε Θ (4)
θ-θ Q = ((θ-θ o )-u 1 )u 1 + ||θ-θo||2u2 + o(||θ-θ 0 || 2 ).
(Note in (4) that
KΘ-ΘQJ U ^
2
= ||θ-θo|| + o( ||θ-θ o || ), and also that u 2 = 0
is a possible value of u 2 ) Then Δ is the convex hull of (u-j,O), (-u-j,O) and
(u^sUnU-j). (See Exercise 7.20.2.) As with earlier results the full complete class characterization is not
directly as a generalized Bayes test but involves an extension of this notion. As part of this extension the kernel
θ
x
e ' is replaced by
Z34
STATISTICAL EXPONENTIAL FAMILIES
(5)
e θ " x -l-θ.χ
ω(θ,x) =
Iθll2
A converse result which sometimes yields a characterization of the minimal complete class is given in Theorem 7.22. As with earlier results both of the following theorems can be profitably extended beyond the exponential family context in which they are proved below. See Marden and Perlman (1980), Marden (1981, 1982b), Cohen and Marden (1985), Brown and Sackrowitz (1984, Theorem 6.1), and Brown and Marden (in preparation). 7.21
Theorem Let Θ Q = {θ Q } be a simple null hypothesis. Let φ 1 be an admissible
test of Θ Q versus 0^. Then there exists an equivalent test φ and a closed convex set S satisfying 7.17 (2) such that (1)
Φ(x) = 1
x Φ S.
Further, for every x Q € S° there is a finite non-negative measure H on ^ - {θ Q } θ (x-x n ) with S° c {x: e H(dθ) < «>}, a constant C ε R, an M ε Δ 2 , and a v ε R k satisfying (3), below, with at least one of C, H, v, M being non-zero, such that for all x ε S° 1
< if C
0
/ ω(θ,x-x Q )e
θ xn
°H(dθ)+v (x-x 0 )
+ (x-xo)'M(x-xo)/2
>
If θn ^ ψ then Φ = Φ 1 a.e. (v). Define θ-θ~ ||θ-θo||>ε Then there is a sequence
(3)
ε i -> 0
such that
»" ~ 0 " 1imvε
= vQ
(say) exists, and
(v Q ,M) ε Δ. (Note that if
written as
J||θ||
H(dθ) < «
the extreme right side of (2) can be re-
TAIL PROBABILITIES 0")
/5
235
θ (x-x 0 ) 'Lsl H(dθ) + v o (x-xo)'M(x-xo)/Z .
In particular, lim vε = vυ n ε->0
exists.)
Proof. The assertion just after (2) follows from completeness, as in Theorem 7.17. Now, suppose Φ' is admissible. Then by Theorem 4A.10 there is an equivalent <J> and a sequence of prior distributions G. concentrated on finite subsets of Θ such that the Bayes procedures, ψ. = Φ G , converge to Φ in the topology of 4A.2. (See the proof of Theorem 7.18 for further remarks.) Without loss of generality let θ Q = 0 and
Thus Φ J ( X ) = ί 0 } according to whether (4)
e θ ' x G!(dθ)
/ Θ
>
1 .
l
As in 7.17, it is possible to reduce {G!} to a subsequence (if necessary) such that now for some closed S satisfying,7.17 (2), lim Je θ ' x G!(dθ) = « 0 <_ lim Je θ ' x G!(dθ) = q(x) < » where G! -> G1
x i S x e S° ,
and q(χ) = Je θ ' x G' (dθ). Clearly, (1) is satisfied.
Assume without loss of generality that x Q = 0 € S°. Rewrite (4) as (5)
/ ω(θ,x)||θ||2G!(dθ) + J (θ χ)G!(dθ) > cϊ . Θ Θ l l
Let d. = J||θ||2G!(dθ)+||/θG!(dθ)|| + |c!| and H.(dθ) = dT11| θ||2G!(dθ). Substituting in (5) and multiplying through by dT
yields
236 (6)
STATISTICAL EXPONENTIAL FAMILIES /
ω(θ,x)H 1 (dθ) + / θ
e
- ^ 2 ||θ||
H.(dθ) >
C 1
where jH^dθ) + || /(θ/||θ||2)Hi(dθ)|| + Ic^ = 1. Reduce { H ^ to a subsequence (if necessary) so that /(θ/1|θ||2)Hi(dθ) •*• v, c i + C.
(7)
H i -• H 1 since G\ ->• G 1 . Furthermore /H'(dθ) + | |v| | + C = 1 since x Q = 0 € S° Let H = H',- o } . Let ε > 0 such that H({θ:||θ|| =e}) = 0. (AΠ but a countable set of e's satisfy this.) For each x ε S (8)
/ ||θ||>ε
ω(θ,x)H.(dθ) + / ω(θ,x)H(dθ), 1 ||θ||>ε
1/
(ω(θ,x) . 2 L L § Θ ^ ) H (dθ)| = 0(ε)
and (9)
iiθiiiε
2iiβir
Ί
since (e* - 1 - t - t 2 /2)/t 2 = 0(t) and jH^dθ) <. 1.
Another subsequence may
now be taken, if necessary, so that the following limits exist: (10)
v = 11m f ε
(11)
— 5 - y H.(dθ) = v - /
1— ιiθiι<e Heir
1
- ^ - T H(dθ)
ιiβiι>ε
iieir
M = lim / - ^ H.(dθ). ε
i - llθll 2
Ί
By definition, (v ,M ) ε Δ(ε). Δ(e) is compact in the obvious topology. Hence there is a subsequence ε. + 0 so that (v ,M£ ) -> (v Q ,M) ε Δ. If j
J
necessary another subsequence of {H.} may be extracted using a diagonalization argument so that (10) and (11) hold for each ε . It follows from (5), J (7), (8), (9), and (11) that for x ε S° 1 < ΦΊ (x) + if C Jω(θ,x)H(dθ) + v x + x'Mx/2 .
TAIL PROBABILITIES
237
Note that Tr M = H'({0}). Hence Tr M + H(Φ" J) + | |v| | + C = 1 so that at least one of M, H, ||v||, C are non-zero. It follows from (10) that (3) is satisfied. Since Φ Ί -* ψ in the topology of 4A.2 this yields (2).
||
7.22 Theorem Consider the testing problem described in Theorem 7.21. Suppose θ Q ε W° and φ satisfies 7.21(1), (2), and (3) where "S satisfies all the assumptions of Lemma 7.13 and H is concentrated on Θ*, as defined in 7.19(1). Suppose ψ(x) is also given by 7.21(2) for x e S - S°. Then any critical function as good as ψ must also satisfy 7.21(1) for x i S" and 7.21(2) for x ε S (a.e. (v)) with the same values of S, H, v, M, C, x Q . If also either (1)
v{x: ω(θ,x-xQ)H(dθ) + v'-(x-x Q ) + (x-xQ)'M(x-xo)/2 = C, φ(x) < 1} = 0
or (I 1 ) v{x: ω(θ,x-xo)H(dθ) + v .(x-xQ) + (x-xQ)'M(x-xQ)/2 = C, φ(x) > 0} = 0 or (2)
(Supp H)° f φ
then φ is admissible; and if η is as good as φ then η = Φ a.e. (v). If v is dominated by Lebesgue measure, U ( Θ , Θ 2 ) = U * ( Θ , Θ 2 ) for some θ 2 ε R k , and e^ = Θ* then the collection of tests of the form 7.21(1), (2) is a minimal complete class. Proof. Much of the proof resembles that of Theorem 7.19 (as does much of the statement of the theorem). Assume with no loss of generality that θ Q = 0 and x n = 0. Let ε. + 0 and v U
j
be as in 7.21(3) and let J. ε J(ε.)» be measures ε
J
supported on finite subsets such that vε (3)
j /
- /
eθ
'
θ 2
11 Θ 11 ?
c
JΛdθ) - 0 J
J.(dθ) -> M.
J
238
L β t
STATISTICAL EXPONENTIAL FAMILIES
H
li=H|{θ:||θ||>εi>+
is b e t t e r than
(4)
For each
(5)
O
ψ
4
J
then
η
J
i'HOi({O}) satisfies
( n ( x ) - φ ( x ) ) Je θ
χ
=
C
+
/llθll'2ji
7.21 ( 1 ) and
(Hli(dθ)-Hoi(dθ))v(dx),
x ε S Jeθ'x(H,.(dθ) - HQ.(dθ)) = /
ω(θ.χ)H(dθ) + v x + x ' M x / 2 - C
v
J
^
)x
-x'Mx/2).
Lemma 2.1 implies that the dominated convergence theorem can be invoked in (4), (5) as i -> 00 (6)
since 0 ε hi6 and ω(θ x) = 0 ( e θ ' x + l ) .
Hence
0 4 J (η(x)-φ(x))(/ω(θ x)H(dθ)+v x+x'Mx/2-C)v(dx).
It follows that η satisfies 7.21 (2). The remaining assertions of the theorem are proved just as the analogous assertions in Theorem 7.19.
||
TAIL PROBABILITIES
239
EXERCISES 7.2.1 Prove proposition v({||x|| > α}) = 0 ( e " ε α )
7.2.
[ I f v is a f i n i t e measure and
then E v ( e
ε Ί | x | !
)
< » for a l l 0 < ε1 < ε . ]
7.4.1 (i)
L e t S be a convex s e t w i t h p = i n f { | | x | | : x (. S } .
some ε > 0 , c < °°
Suppose f o r
7 . 4 ( 1 ) holds i . e .
(1)
P ft ( { X / α (. S } ) < θ
c exp(-εpα)
V
α € R .
o
Show that {θ: | |θ - Θ Q || < ε 1 }
N° for all ε 1 < ε.
(ii) Give an example of
a nonconvex set with v{x: ||x|| < p, x (. S} > 0 and in which (1) holds but {θ: I|θ - ΘQ|I < ε 1 } φ N° for any ε 1 < ε. 7.5.1 Let Θ Q e W° and H + = H + (v, α ) . Show lim (n" 1 log P θ (X n € H + )) <_ -K(H + , ξ(θ Q )) .
(1)
[Use Theorem 7.5 and Proposition 5.15.] 7.5.2 In 7.5.1 suppose H + n K° ϊ φ. Show (1)
lim (n" 1 log Pft (X n € H + )) = -K(H + , ζ(θ n )) .
[For one d i r e c t i o n use 7 . 5 . 1 ( 1 ) .
For the other l e t P l n ^ denote the d i s t r i b u -
t i o n of S n under θ = θ ( ξ H + ( θ ) ) . ] (2)
P Q (X € H o )
>
e x p [ - n ( K ( θ , ΘQ) + ε ) ] P ^ n ) ( { S :
•> e x p [ - n ( K ( θ , θ 0 ) + ε ) ] by the Central L i m i t Theorem (Exercise
5.15.1).]
| ( θ Q - θ)
\\ < e } )
240
STATISTICAL EXPONENTIAL FAMILIES
7.5.3 Let Θ Q = θ(ξ Q ) € W°. Let Q be a closed subset of R k . Show lim (n" 1 log P 0 (*n € Q)) < -K(Q, ξ Q ) . [Let ε > 0.
Show Q c
Σ H + ( v . , a . ) where K ( H + ( v , α . ) ) > K ( Q , ξ n ) - ε .
.=1
i
i
When k > 2 this requires some care.)
i
Apply
i
-
0
7.5.2.]
7.5.4 Let ΘQ = θ ( ξ Q ) .
Let Q c Rk be a set such that
K(Q°, ξ 0 ) = K(Q, ξ Q ) = k (say). (1)
Then
lim n" 1 log P fl (X n € Q)
=
-k
.
o
[Reason as in 7 . 5 . 2 and use 7 . 5 . 3 . ] 7.5.5 Let X , , . . . be i . i . d . random variables on R with d i s t r i b u t i o n F. Let h: X -+ Rk be measurable and Q c R k .
Let ζ(Q) = i n f { ζ p ( x ) : x € Q}
where ξp(x) denotes the entropy as defined in 6 . 1 6 ( 1 ) . and E(exp ( ε | |X| I ) ) < «> for some ε > 0. lim n" 1 log P(X
Suppose ξ(Q°) = ξ(Q)
Then e Q)
=
E(Q)
.
7.5.6 ( i ) Show that K( , ξ Q ) is r e l a t i v e l y continuous on {x: K(x, ζ Q ) < «} i f v(K - K°) = 0 , i f k = 1 , or i f v is concentrated on a countable number of points satisfying Assumptions in Theorem 6.23. K(Q, ξ Q ) = K(Q, ξ Q ) as required in 7.5.4.
I f so, then for Q an open set
( i i ) Given an example where Q is open
and K(Q, ξ Q ) t K(Q, ξ Q ) . [Let v be Lebesgue measure on the f i r s t quadrant of 2 R plus a unit mass at the o r i g i n . ] 7.7.1
Hwang (1983) raises the following question: Let X - N(θ, I ) , k k k θ e R . Does there exist an estimator 6: R -> R for which
TAIL PROBABILITIES (1)
P θ (||δ(X) - θ|| < B )
> PΘ(||X - θ|| < B)
with s t r i c t inequality for some B, θ? dominate" δ Q (x) = x.
241 v
B>0,
,
( I f so, δ would be said to "stochastically
Note that for fixed B > 0 there exists an estimator 6
dominating δQ in the sense of satisfying (1) for a l l θ € Rk. ait.)
θ € Rk
See Hwang (op.
I t can be shown that 6 f δQ exists
and references cited therein.)
satisfying (1) i f and only i f there exists a continuous spherically symmetric function δ f δQ satisfying ( 1 ) . Show that no such function exists. [Suppose I I
I I
ll^vXo/ll
<
I I
x
ιι oM
Let θ p = ρx Q and B
k f o r
s o m e
x
o
€
R
(and
= (p - 1 ) | | X Q | | .
hence
f o r
a
neighborhood o f
xQ).
Show t h a t f o r some ε > 0 , s u f f i c i e n t l y
small,
(i)
- ^
pθp(!lχ-θpn >Bp) Pθ
(x e H ( x 0 ,
I | χ o | | 2 ) , δ(X) e H " ( X Q ,
||XQ||
_P eεpPθ ( X € H - ( θP P
V
||χ o || 2 )
Use t h e m u l t i v a r i a t e g e n e r a l i z a t i o n o f 7 . 3 ( 3 ) t o e s t i m a t e t h e denominator on the l e f t o f ( 1 ) ; then use 7 . 7 ( 3 ) f o r t h e a s y m p t o t i c a s s e r t i o n i n ( 1 ) . A s i m i l a r argument, w i t h d i f f e r e n t θ and B , a p p l i e s when | | δ ( x Q ) | | > | | x Q | | f o r some x Q e R . See Brown and Hwang ( i n p r e p a r a t i o n ) . ] 7.9.1 Consider the e s t i m a t i o n problem described i n Exercise 4 . 2 4 . 3 . t h a t the e s t i m t o r 4 . 2 4 . 3 ( 1 ) i s admissible. to
[Use Theorem 7.7 and C o r o l l a r y 7.9
show t h a t i f δ 1 i s b e t t e r than 6 then ό ' ( x ) = 0 ,
x = 1 , and symmetrically f o r x >_ 2.
Show
Among a l l
x < 1 , and ό ' ( x ) <_ ^ ,
such estimators 6 minimizes the
risk at θ = 0 . ]
7.9.2
(A uniform Let
version of C o r o l l a r y 7.9)
Vj c V 2 be subsets o f the u n i t sphere i n R
V 2 r e l a t i v e l y open i n the u n i t sphere.
Let
w i t h V- closed and
242
STATISTICAL EXPONENTIAL FAMILIES α(v) = sup {α: K n H + (v, α) f φ} .
(1) Assume α ( v ) < «
V v € V2
Then
α ( v ) is continuous
f o r v € V2
(ii)
V ε > 0
3
δ > 0
3
(ill)
V ε > 0
3
rQ
(i)
(2)
v
v(H+(v, α(v) - ε ) ) > δ ,
v € Vχ
3
ξ(rv)
>
α(v) - ε
V
v € Vχ,
r > rQ
*
7.9.3 Consider a steep exponential f a m i l y . and l e t K be s t r i c t l y convex. t h a t ζ{Q.) for a l l
-> y .
i > I.
Let y € 3K,
Then, ( i ) 3 I < «,
Let K c { x :
< ε,
y < -2ε,
v
ε > 0,
v € V^
v(H+(v, e ) )
(1)
> δ
0 e K,
Let θ. € N ° , 1 = 1 , . . . f such 1 θ. δ > 0 such t h a t v ( H + ( i i Λ ι , ε ) ) > δ
Hence, ( i i ) ψ ( θ . ) >_ ε| | θ | | + I n δ f o r a l l
v € V2;
£ 0},
y ^ 0.
l i m ψ ( θ . ) = °° . η i-**> [There e x i s t V., \L as i n 7 . 9 . 2 and ε > 0 , α(v)
x
i > I , and
(iii)
δ > 0, satisfying
and
V v t V2
.
p (Draw p i c t u r e s i n R to help see why the above i s t r u e . is important h e r e . )
Now, l l θ . l l -•»
.
The s t r i c t convexity θ. Hence, ,, 1 ., f. VΊ for i
(Why?)
s u f f i c i e n t l y l a r g e , by 7 . 9 . 2 ( 2 ) . ] 7.9.4 Consider a steep exponential f a m i l y . closed i n N and assume K i s s t r i c t l y convex. x ί
(ζ(Θ n W°))~.
Show t h a t θ ( x ) t .
Let Θ c hi be r e l a t i v e l y
Suppose x € dK but
(This r e s u l t complements Theorem 5 . 7 .
I b e l i e v e i t should be possible to prove i t by showing the above hypotheses imply t h a t 5 . 7 ( 1 ) i s s a t i s f i e d .
However, the h i n t below i n d i c a t e s a d i f f e r e n t
argument. [Assume x = 0 € K c {x: x χ <. 0} lim ψ(θ) = °° . llθll-χ»,θ€Θ
(w.l.o.g.).
Apply 7 . 9 . 3 to show
Now proceed as i n the proof of Theorem 5 . 7 , f o l l o w i n g
TAIL PROBABILITIES
243
5.7(2).]
7.9.5 Consider a standard exponential
family with natural
parameter
space
k + Let v € R and α Q = sup { α : v(H ( v , α ) ) > 0 } . Let θ^ = ρ Ί v + r^. as i n
N.
Corollary 7 . 9 . Then (1)
11m v
Vψ(θ.)
=
αΩ
.
Hence, there exist a c > -°° such that (2)
ψ(θ.) >. -c + αp. ,
and, consequently, (3)
p θ (x) -* 0
V x e H"(v, α Q )
[The key a s s e r t i o n , ( 1 ) , is a uniform version of Theorem 3 . 9 , since for η. Ξ η i t follows immediately from t h a t theorem.
However, i t seems
easier to prove ( 1 ) as a consequence o f Corollary 7 . 9 . ( A l t e r n a t i v e l y , one may also derive the above, as well as 7 . 9 , through an a p p l i c a t i o n of convex d u a l i t y , since K° = R, e t c . ) ] 7.11.1 In the s i t u a t i o n i n Corollary
7 . 1 1 l e t p ( θ , ) = Pfl ( S 9 ) / ( P f l ( S j ) .
C o n s t r u c t examples ( i ) i n which p ( θ . ) ~ | | θ . | |" α ,
α > 0 ; ( i i ) i n which
p f θ j ) •* 0 b u t I | θ i I | α ρ ( θ Ί . ) + oo f o r a l l α > 0 ; and ( i i i ) i n which p ( θ Ί ) = 0 ( 1 |Θ 1 | Γ
α
) f o r a l l α > 0 but e '
α ll θ
i ' l p ( θ 1 ) •> «,
for a l l α >0.
[ ( 1 ) L e t k = 1 , v ( { 0 } ) = 1 and v ( d x ) = x 0 1 " 1 dx on x > 0 . ]
7.12.1 Consider a t e s t i n g problem, as i n 7.12 with ΘQ = H(v, α) n A/, O
= M_ 0
anc
j 0
z^ 1 ) e H(v, α ) ,
n Wβ ^ φ.
z ^
For z € R k , l e t z = z ^
+ z" ' where
= pv l H(v, α ) . Assume ( w . l . o . g . ) v ( R k ) = 1 . Show
244
STATISTICAL EXPONENTIAL FAMILIES (i) If φ 1 is better than φ then
(1)
φ(x)v(dx|x (1) = y) = /
/
Φ'(x)v(dx|x (1) = y)
y € H(v, α) a.e.(v) and / x ( 2 ) φ(x)v(dx|x ( 1 ) = y) = / x ( 2 ) φ'(x)v(dx|x (1) = y)
(2)
y € H(v, α ) , a.e.(v) (ii)
Show that φ is admissible i f and only i f for some measurable func-
tions CΊ , γ . ,
(3)
i = l,2,
ψ(x)
=
1
if
x(1)
> C2(x(2))
γ2(x(2))
if
x(1)
= C2(x(2))
0
if
Cλ[x{2))
•«
x
-
if
xu;
< C^x^M
/
Ύi \X
1
< x(1) I,.. vX
< C2(x(2))
/
.
[This is a continuation of 2.12.1 and 2.21.2.] (Matthes and Truax (1967).) 7.12.2 Prove that i f φ is an admissible t e s t and Q c X with v(Q) > 0 then φ must also be admissible f o r the same problem with dominating measure V . Q . 7.12.3
Let Xχ = X and X2 = S2 + X2 be the canonical statistics for the two2 parameter exponential family generated by a N(μ, σ ) random sample. (See Example 1.2.) μ
0xl " X2^2
=
Consider Figure 7.12.3. ^
SUC Ί
'
^
a t
V
(R)
=
V
(S).
( i ) Show that this is possible,
Draw the broken line parallel to (v is defined in Example 1.2.) ( i i ) Let Φ1 be the c r i t i c a l function for
the test with acceptance region Q1 + R - S, and let φQ be the c r i t i c a l function for the usual one-sided t - t e s t , which has acceptance region Q1 = {x, < 0 or
TAIL PROBABILITIES x
? -
cx
245
i}
(μ,σ 2 ) ( φ l } (μ,σ 2 ) ( φ l }
(μ,σ 2 )( φ O }
Hence Φ1 is a better test than φQ of (2)
HQ:
μ
[E(Φ 1 - Φ o ) = E ( χ s - χ R ) . (1984).
<
0
versus
μ
Now use Corollary 2 . 2 3 . ]
μQ
^
(See Brown and Sackrowitz
See also Exercise 7 . 1 4 . 6 . )
A Figure 7.12.3: Diagram for Exercise 7.12.3 7.13.1 Here is an example which shows that something more than 7.13(1) p
is needed for validity of the conclusion of Lemma 7.13. Let X € R be bivariate N(θ, I). Consider the problem of testing Θ Q = {0} versus = {θ: θ1 > 0, θ 2 = -θj}. Let S = {x e R 2 :
1 0}.
(i) Show that U = φ but U* = (0, -1). (ii) Verify that S satisfies 7.13(1) but not the remaining hypotheses of Lemma 7.13. (iii) Let φ.ίx) = 1 if x ί S, = 0 otherwise. Show the conclusion of Lemma 7.13 does not apply to φ,. [Let Φp(x) = 1 if x-i ^
< ε or
246
STATISTICAL EXPONENTIAL FAMILIES
x. < 0, x« < -ε. Show for ε > 0 sufficiently small φp dominates φ,.] 7.13.2 The additional assumptions of 7.13 are stronger than necessary. Let X ~ N(θ, I ) , Θo = {0}, S be as in 7.13.1. θ 1 = {(μ, y 4 ) : μ > 0}.
But now let
Note that S satisfies 7.13(1) but does not satisfy
either of the other two assumptions of Lemma 7.13. as φ then φ'(x) = 1 for all x (. S.
Show that i f φ' is as good
Conclude that φ is admissible. [Show
directly that i f Q is an open set in Sc then P (u u * ) i Q ) lim JU!iLJ 50 μ-* P/(μ>μ M )( S )
= oo
.]
7.14.1 A t e s t φ is said to have a nearly convex acceptance region there is a closed convex set A such that φ(x) = 0 , for x (. A.
if
x e A° and φ(x) = 1
(Thus, i f v is dominated by Lebesgue measure any test with nearly
convex acceptance region is equivalent to one with a (closed) convex acceptance region.
See the Remark following Corollary 4.17.)
ΘQ = {ΘQ} is simple in the s e t t i n g of 7.12.
Suppose
Show t h a t any Bayes t e s t has
nearly convex acceptance region. 7.14.2
Let φ. be a sequence of critical functions with nearly convex acceptance regions. Suppose φ. -* φ weak* on L TO . (See 4A.2(1) for the definition of weak* convergence.) Then φ has a nearly convex acceptance region.
[Assume v(R ) < °°. To each φ. there corresponds an A.. Let {u.}
be a countable dense subset of {u:
Mull = 1}. Choose a subsequence {i'}
such that α Δ (u.) converges for each u , say, α Λ ", i
J
A = nFΓ(u., α , ) . Then φ(x) = 0 , J J j
J
M. i
(u. ) -*α . Let J
J
x e A° and =1 for x ft A.]
7.14.3 Suppose ΘQ = { Θ Q } is simple in the setting of 7.12.
TAIL PROBABILITIES (i)
247
Show t h a t the tests with nearly convex acceptance regions form a
complete class. (ii)
Suppose, a l s o , Θ1 = R - { θ Q } and v is dominated by Lebesgue measure.
Show t h a t the tests with convex acceptance regions form a minimal complete class.
[Use Theorem 4.14, 7 . 1 4 . 1 , 7.14.2, and, f o r ( i i ) , Theorem 7.14.]
7.14.4 Suppose the support of v i s a f i n i t e s e t , X. R .
(i)
Let 0 Q = {Θ Q } € hi =
Prove that φ is admissible i f and only i f there i s a closed
convex set A such t h a t φ(x) = 1 i f X (. A, f o r some face F of A.
(ii)
= 0 i f x € A0 or i f x € r . i . F
Can you formulate an analogous complete class
statement v a l i d when X i s countable and the assumptions of Theorem 6.23 are satisfied?
[ ( i ) Use Theorem 7.14, Corollary 7.10, and 7.12.2.
( i i ) Be
c a r e f u l ; the characterization i n ( i ) i s not v a l i d here, even when X = { 0 , 1 , . . . } k , and so w i l l need to be modified.] 7.14.5 Consider a 2χ2 contingency table.
(See Exercise 1.8.1.)
Two
common tests for independence of row and column effects are the likelihood 2 ratio test and the χ test, based on the values of
=
(i) (ii) (iii)
N Σ
Use Theorem 7.14 to show that the χ
test is admissible,
Is the likelihood ratio test also admissible via Theorem 7.14? Use 7.12.1 to prove both tests are admissible.
7.14.6 Show that the test with critical function φ- in Exercise 7.12.3 is admissible.
248
STATISTICAL EXPONENTIAL FAMILIES
7.16.1 Let X € Rk be N(θ, I ) . Suppose ΘQ = 0 and Θj = { θ : |θ i I > c φ
l
( x )
1
i=l,...,k}.
Consider level α tests of the form
χ
= - {t:|til
Note that φ, i s admissible.
k}
( x )
a n d
(x
=
*2 >
l
X
" {||t|Ka2}
( x )
Adjust k, c, α to provide an example where Φ2
dominates φ Ί except where π
i s extremely small.
7.16.2 Consider t h e u n i v a r i a t e l i n e a r model, as i n 7 . 1 5 . Show t h a t t h e [ L e t η € R s . L e t σ = 1/(1 + l l η l l ) and 2
usual F t e s t , 7 . 1 5 ( 1 ) , i s Bayes. μΊ = r ^ / U + l l η l l 2 ) , i=l,...,r.
i = r+l,...,s.
2
Under θ 1 a l s o l e t μ . = r)./{l
+ ||η||2),
Under ΘQ ( r e s p . Θ,) l e t η have d e n s i t y p r o p o r t i o n a l t o
(1 + U n l l 2 Γ p / 2 e x p ((
2
2 ( 1 + I In 11 ) 2 (resp.,
2
(1 + M n l l Γ
p / 2
exp( Σ ?-) r+1 2 ( 1+ l l η l Γ )
).]
(Kiefer and Schwartz (1965).) 7.16.3 Verify when r = 2 that the F test has the local optimality property described in 7.16(1).
(This is called D-optimality.)
2 2-π(0, σ ) Φ μ?
[Write
)dy ) 8μ
μ=0
and use a general form o f t h e Neyman-Pearson Lemma o r Theorem 2 . 2 1 . ] 7.16.4 Let X-,...,X. be independent gamma variables with known indices α , , . . . , α h and unknown scale parameters σ , , . . . , σ . .
Consider the problem o f
t e s t i n g the n u l l hypothesis H Q : σ . = . . . = σ. . ( I n the special case where 2 the X, /CL are χ variables r e s u l t i n g from a normal sample then t h i s i s the problem of t e s t i n g homogeneity
of variance.
(In t h i s notation the variances
TAIL PROBABILITIES
249
are α,,... ,σ. .)) Show (i) The likelihood ratio test for this problem has acceptance region
α
(Σχ.) o S = {x: ^
(1)
<_ C} ,
where
αn
=
π x^ (ii)
k Σ α, Ίι
When these distributions are written as a canonical exponential
family the null hypothesis is linear in both parameter space and expectation space.
Nevertheless, for k _> 3, the acceptance region for the likelihood
ratio test is not convex.
(Hence there is no hope of proving i t s admissibi-
l i t y via Theorem 7.14.) [(ii)
Consider k = 3 and α. Ξ α.
Consider points of the form
x = (z, z, 1) on the boundary of the acceptance region S.
Let
πx. f(x) =
^ - T - C so that f(x) = 0 for x € as. Show that for z sufficiently (Σx-jΓ large (Vf(x z ))' (D 2 f(x z ))(vf(x z )) < 0.] (iii)
The likelihood ratio test is unique Bayes, hence admissible.
H1 l e t θ. = 1/σ density |η.| Ί"
Under
= (1 + η.) where η. e R are independent variables with (1 + η?)~ α i .
has density | η p α ° " ^ ( l + η 2 ) " α ° .
Under HQ, ΘΊ = 1/σ. Ξ (1 + η 2 ) where η e R (This result is another one of many
contained in Kiefer and Schwartz (1965).) Note: admissible.
I t is not always true that a likelihood ratio test is
For an interesting counter-example see Lehmann (1959, p.338)
or Kiefer and Schwartz (1965, p.767). 7.17.1 2 Let x G R be b i v a r i a t e normal,
N(θ, I ) .
of t e s t i n g ΘQ = {0} versus Θχ = { θ : θ ^ :> 0 ,
Consider the problem
Show t h a t the 2 non-randomized level α = .05 t e s t with acceptance region {x: ||x|| <_ 5.991] is inadmissible.
| |θ| | >_ 1} .
(Can you also f i n d a better test?)
(Compare t h i s r e s u l t
250
STATISTICAL EXPONENTIAL FAMILIES
with 7.22.2 in which this test is admissible.) 7.17.2 Exercise 2.10.1 indicates a n o n t r i v i a l testing problem where ΘQ and Θ1 are contiguous and a l l tests are admissible. the
Here is an example of
same phenomenon in which the null and a l t e r n a t i v e hypotheses are sepa-
rated:
Let 1 £ m < k and l e t X = {x e R : x i = 0
or
1,
i = l , . . . , k , ΣxΊ = m}.
Let v be counting measure on X, with ί p θ ) the exponential family generated by v.
Θχ = {θ: ||θ|| 2 >_ 1}.
Let ΘQ = { 0 } ,
d e f i n i t i o n s of Θ, w i l l also s u f f i c e . ) test.
(Other more r e s t r i c t i v e
Let φ be any (possibly randomized)
Then φ is admissible. [It
is possible to use Lemma 7.13 f o r t h i s , but here i s an
easier argument. {qΓ:
The aggregate family generated by { p θ l contains k 1 where q Γ ( ) = χ Γ ( * ) and also q Γ (•) = ( ) where
ξ € X}
ξ Q = ξ(0) (jj ) l . for
I f Φ is inadmissible there exists a test Φ' better than φ Then (by c o n t i n u i t y ) φ1 must be as good as φ for
testing ΘQ versus Θ-.
testing q Γ
and (m V
1
versus { q r : ξ € X}
Σ φ'(x) < ( V
x€X
1
.
This implies φ'(x) >_ φ(x),
x € X,
Σ φ(x).]
m
xex
7.18.1 Let X,, Xp be independent gamma variables Γ ( α . , λ . ) , variables with α , , α 2 known.
i=l,2,
Consider the problem of t e s t i n g H Q : λ
versus the a l t e r n a t i v e H-:
max |1 - λ . | > ε for some given ε > 0. Ί i=l,2 any " i n t e r s e c t i o n " test with acceptance region —
= λp = 1 Show that
1
(1)
φ(x) = 0
is inadmissible.
iff
a
n
< xΊ < a i 2 ,
(See also 7.21.1.)
i = l , 2,
(0 < a . j < a i 2 < « )
[No admissible t e s t can have an
acceptance region with a sharp corner at (x^, x^) = ( a 1 2 , a 2 2 ) l i k e (1) has. See Example 2.10.]
—
TAIL PROBABILITIES
251
7.19.1 In Theorem 7.19 replace Θ* by (1)
θί* = {θΊ ε Θ", : θ, ε N or there is a set {θ' : j = 1,.. . , J } ^ N and a sequence {ζ }czQ*
with
ζ. -> θ . and
{ζ.}cz conhull ({θU U {θ,})} . [Use 1.13.2.] 7.20.1 Prove the assertion in 7.20(3). [The extreme points of {J:J ε J(ε), JθJ(dθ) = V Q } , V Q ε Θ, are the distributions in this set which are concentrated on a single point; similarly the extreme points of {J:J ε J(ε), JθJ(dθ) = 0, / ||θ||2J(dθ) = α} are two-point distributions. The extreme points of ϊ(ε) are thus points
(v, M) satisfying 7.19(2) with J
either a one- or two-point distribution, as above. The extreme points of Δ are (contained in) the set of limits as ε + 0 of these points.] 7.20.2 Prove the assertion following 7.20(4). [Let J be either a one- or twopoint distribution.] 7.20.3 Generalize the assertion following 7.20(4) to apply to the situation where Θ is a twice differentiate manifold at θ Q . [First generalize 7.20(5)!] 7.21.1 In the setting of 7.18.1 consider the problem of testing H Q : λ^ = λ 2 = 1 versus the complementary alternative H-.: λ, f 1 or λ^ t 1. Show that the intersection test 7.18.1(1) is still inadmissible.
252
STATISTICAL EXPONENTIAL FAMILIES
7.21.2 Consider the curved exponential family of Example 3.14 and 5.14. Let ΘQ = {θ Q } and Θ 1 = Θ - Θ Q - TO be specific take θ Q = θ(λ Q ) = (-1,0); i.e., λ Q = 1. One easily constructed test of θ Q |λ-λ o | > c n with c n
is that which rejects when
chosen to give the desired level of significance.
(Such a test can be constructed for any curved exponential family, and has certain asymptotic optimality properties as n -> «>.) Show that for moderately large n and the usual levels of significance this test is inadmissible; although for every n there exists a (possibly very small) level of significance for which the test of this form is admissible. [Use 5.14 and Theorem 7.21. Except for small values of n or large values of c n
the acceptance
region has a convex, but not strictly convex, form. Theorem 7.21 allows only very special admissible acceptance regions which are not strictly convex; and for appropriate values of n, c
the above acceptance region is not of this
special form.] 7.22.1 Let X 1 9 ...,X n
be independent normal variables, Xj ~ N(μ,l+μ ). Con-
sider the problem of testing HQ:μ = 0. Let ΦΊ = 1 if |X"| > 1.96...//n , = 0 otherwise; and π,(μ) = E (Φ-.). Show (i) φ j has level α = .05 and is locally unbiased (i.e., π.j(O) = 0, πη (0) > 0 ) . (Is ψ, also globally unbiased; i.e., π ^ μ ) ^ .05??) (ii) ψ, is inadmissible. [Use 7.20(5) and Theorem 7.21. Note that θ 2 = --^2 = -(2(l+μ 2 ))" 1 4-I/2 2σ H = 0.]
to show ψ ]
cannot satisfy 7.21(2) unless
(iii) Find a locally best locally unbiased level α test; i.e., the test which maximizes π"(μ) subject to π(0) = α, π'(0) = 0. Use Theorem 7.22 to verify this test is admissible. [Admissibility actually follows directly from the fact that this test is the unique locally best locally unbiased level α test, but it may be instructive to note how this test can be written in the form 7.21(2) with H = 0.] Call this test Φ 2
TAIL PROBABILITIES ((iv) Is Φ 2 unbiased??
Is Φ 2
better than ψ^??
253 If not, what is??)
(v) Generalize (i)-(iii) to arbitrary curved exponential families: Show that the locally unbiased test with parallel boundaries for the acceptance region is not locally best among locally unbiased tests unless u 2 = 0 in 7.20(5). State (convenient, frequently satisfied) conditions under which this parallel boundary test is inadmissible. 7.22.2 Let X be bivariate normal with mean θ and covariance
1. Consider the
problem of testing Θ Q = 0 versus Θ-j = {(θ^.θg): θ ^ > 0}. Consider tests 2 (χ)» a >b> c > 0 (These tests are ( ) ' ( ) ^ symmetric in (xpX^).) Show that such a test is admissible if and only if 2 2 a _> b. The same result holds if Θj = {(θ^ ,θ 2 ): θ-jθ2 > 0, θ^+θ 2 4 1}.
of the form ψ(x) = χ
2
:
APPENDIX TO CHAPTER
H. POINTWISE LIMITS OF BAYES PROCEDURES
This appendix contains a proof of Theorem 4.14, which was used to establish the complete class Theorems 4.16 and 4.24, and w i l l be used again in Chapter 7.
As already noted, this theorem has nothing in particular to
do with exponential families, but i t s proof is included here since i t is not readily accessible elsewhere.
We w i l l state and prove i t below in a convenie
form which is more general than that stated in Theorem 4.14. 4A.1
Setting Let ί p Q ( x ) : θ e 0} be any family of probability densities relativ
to
a σ-finite measure v on a measure space X,B.
(1)
p θ (x)
>0
Assume θ €Θ
x€X,
(This assumption is a c t u a l l y used only i n Proposition 4A.11 and Theorem 4A.12 Let
the a c t i o n space, A, be a closed convex subset of Euclidean space.
loss function is L: Θ x A -> [ 0 , <»). function f o r each θ e Θ.
(2)
The
Assume L ( θ , •) i s a lower semi continuous
Assume also t h a t
lim
L(θ, a )
=
°°
,
Θ6Θ
,
lla||-*» (If
A is a bounded set this is t r i v i a l l y satisfied.)
A* = A; i f cation of A. (3)
I f A is bounded l e t
A is unbounded l e t A* = A u {1} denote the one-point compactifiExtend the function L(θ, •) to A* by defining L(θ, I)
= -a
.
A randomized decision procedure on A* i s . a kernel 6( | ) f o r which 254
APPENDIX
δ(t|x)
is a Borel measure on
δ(A| )
is
255
A*,
for
x € X
(4) 8 measurable f o r each measurable s e t A ^ A *
A nonrandomized procedure i s one f o r which δ( p o i n t , δ ( x ) , f o r almost every ( v ) , x € X. kernel δ(
|x) is concentrated on a s i n g l e
Note we use the symbol 6 both f o r
| ) and f o r the r e l a t e d function δ( )
o f a l l randomized decision procedures. to A c A*, and l e t V
.
the
Let V* denote the c o l l e c t i o n
Let V c V* denote those g i v i n g mass 1
c V denote the nonrandomized procedures i n V.
As usual, the r i s k of any procedure is
(5)
R(θ, 6)
=
// L ( θ , a) 6 ( d a | x ) p ( x ) v ( d x ) θ X A*
Note t h a t R(θ, 6) may take the value °°.
(6)
R(θ, 6 ' )
<
R(θ, δ)
V
.
A procedure 6 is admissible
θ € Θ => R(θ, δ 1 )
=
R(θ, δ)
V
if
θ € Θ
.
The proof o f the main r e s u l t o f the appendix is broken down i n t o s i x main p r e l i m i n a r y steps as f o l l o w s : (i) (ii)
V* i s compact i n an appropriate topology; R(θ, •) is lower semi continuous on V*;
(iii)
δ. + δ n with δ. e 0 . I = 0 , 1 , . . . ,
(iv)
the mini max Theorem f o r f i n i t e Θ;
implies δ ; -> δ n i n measure
(v);
(v) (vi)
the closure o f the Bayes procedures i s a complete c l a s s ; and V
is a complete class when L ( θ , •) i s s t r i c t l y convex.
Formal statements of a l l
these r e s u l t s and some c o r o l l a r i e s are given below.
Complete proofs are also given f o r a l l but ( i ) f o r which the reader can consult the references c i t e d below. 4A.2
Definitions We now define the topology on V*.
the Banach space o f v i n t e g r a b l e f u n c t i o n s .
Let lχ
= L^X,
B χ , v) denote
Let C* denote the (Banach) space
256
STATISTICAL EXPONENTIAL FAMILIES
of continuous, real-valued functions on the compact set A*. For every δ € P*, f e L,, c € C* there is a number S6(f, c) = //c(a)δ(da|x) f(x)v(dx)
.
Define the topology on V* according to the convergence criterion δ •*• δ i f (2)
βΛ ( f , c)
-> β Λ ( f , c)
f ε Lv
C o
c ε C*
.
1
α
(This is a "weak" topology. The collection of sets of the following form comprise a basis for this topology: {δ € V : 3χ(f, , c •) - 3χ. (f, > c ) I < ε, δ0 € P*f
f
€ l
v
1 < i < I, 1 < j < J,
c . 6 C*,
i =l 9 . . . , 1 ,
j=l,...,J,
ε > 0 } .)
4A.3
Theorem V* is compact in the topology defined above.
Proof. above.
This theorem appears in Le Cam (1955) in a form similar to the In a somewhat more primitive form the result appears already in Wald
(1950). I t is interesting to note that this theorem is actually a special case of a result in abstract functional analysis. Theorems V.8.6 and IV.6.3
I t follows directly from
(the Riesz representation theorem) of the classic
treatise of Dunford and Schwartz (1966). For a complete, detailed proof see Farrell (1966, Appendix).
||
As has already been noted in the text, Wald's book, and the paper of Le Cam, both cited above, continue from their versions of Theorem 4A.3 and prove results similar to most of those below; but they do not e x p l i c i t l y state a version of Theorem 4A.12 which is our ultimate goal.
See especially Wald
(1950, Sections 3.5 and 3.6) and Le Cam (1955, Theorem 3.4).
APPENDIX 4A.4
257
Proposition The map R ( θ , •): V* -> [ 0 , « ] i s lower semi-continuous.
I n other
words, i f ό + 6 Q then (1)
Ίim i n f R ( θ , δj
Proof.
Let 6 α -> ό Q .
>_ R ( θ , ό Q ) ,
θ eθ
Let θ e 0 and l e t c β ( ) = m i n ( L ( θ , •)> B ) . Then
c β € C* and, f o r any 6 £ V*9 3 6 ( p θ , c β ) t R ( θ , δ) as B t °°. (2)
limα i n f R(θ, δα)
(1) follows directly from (2).
>
Thus,
limα i n f 3fi ( p 0 ,c β )
||
We will apply this proposition in roughly the following form: 4A.5 Corollary Let { Θ 1 , . . . , θ m } c Θ.
L e t Γ f <= Rm be the s e t o f a v a i l a b l e f i n i t e
r i s k points -- i . e . (1)
ff
=
{ r € Rm:
3 6 G P*,
R(θj,ό)=rj,
j=l,...,m}
.
Let rf c Rm be the s e t o f points dominated by r f -- i . e . (2)
Γf
= { r e Rm:
where (as usual) s £ r means s . <. r . ,
3 s € Γf, j=l,...,m.
s <_ r } Then f
f
i s a non-empty,
m
donvex, closed subset o f R . Notation:
I n the current context, when R(θ > ό) < <», j = l , . . . , m , we w r i t e
R ( . , δ) to denote the point r € Rm with r. = R ( θ , , 6 ) , j = l , . . . , m . Proof. Tf t Φ and r
R(θ, a Q ) = L ( θ , a Q ) < «,, Θ € Θ, so Γ f ϊ φ, and consequently Γf i s convex, so also Γ f i s convex. -> r.
Suppose r^ € Γ f , 1 = 1 9 . . . 9
Then there e x i s t δ ; e V* with R( , 6 y ) <_ r ,
1=1
Since V*
258
STATISTICAL EXPONENTIAL FAMILIES
is compact there must exist a subsequence {V} such that 6., is convergent; δ., -*- δ. Then (3)
(r)j = Hm(r^,)j > lim inf R(θj, 6^,)
by Proposition 4A.4. It follows that r € Γ.. This proves that r^ is closed.
11 Here is another useful consequence of Theorem 4A.3 and Proposition
4A.4. 4A.6 Corollary The set of admissible procedures forms a minimal complete class. ?H.ooi.
We give a proof only for the case where Θ = {θ,,...,θ } is finite.
The corollary will be applied in this form in the proof of Theorem 4A.10. (The proof for general Θ is basically similar, but involves some form of Zorn's lemma. See, e.g. Brown (1977).) Let δ Q € V*. To each 6 € V* associate the point r(δ) = r € [0, ~ ]
m
for which r. = R(θ , δ ) , j = l,...,m. Let j
(1)
J
f = {r € [0, oo] m :
r
= r (δ) for some 6 6 £>*} .
(This is the same as 4A.5(1), except for the fact that here r. = » is possible j
so that r(δ) is defined for all δ € V*9 not merely for those δ having finite risk.) Let r 6 r be a minimal point of f which dominates r(δ Q ). (2)
That is,
r <_ r(δ Q ); r <_ r and r ί r => r ί f .
(It is shown in Lemma 4A.8 that r. < °°, j = l,...,m, but that fact is not ~J
essential here.) Such a point, r, can be constructed as the limit of a sequence of points r(ό .) ξ r.
APPENDIX
259
By Theorern 4A.3 the sequence δ. has an accumulation point in V*, say δ. By Proposition 4A.4 r(δ) ± r <_ r(δ Q ).
Since r was minimal it follows
that 6 is admissible. It has thus been shown that any procedure, δ Q , is dominated by an admissible procedure, as asserted by the corollary.
||
4A.7 Proposition Let δ € P , o = 1,..., and suppose δ + δ Q in P* with δ Q € V^. Then δ (•)•»• δn( ) in measure (v). Thus there is a subsequence V for which δ^.ί ) •> θ o ( ) a.e.(v). PΛ.OOI5.
Suppose δ^ -> δ Q in P* but δ α / δ Q in measure (v). If δ Q € P*-P then
there is an a n € A, an e > 0, and a set S with v(S) > 0 such that (1)
|δ Q (x) - a Q | < ε for all
x € S
and (2)
lim sup v({x € S:
(To verify (1) and (2) is a standard
|6 (x) - a Q | > 2ε}) > 0 . but nontrivial exercise in measure theory
which uses the fact that A is a separable metric space. If δ Q € V*
then
(1) may need to be replaced by |θ Q (x)| > 1/ε and, correspondingly, the statement |δ (x)| < l/2ε would need to be substituted in (2). Similar substitutions would α
then need to be made in what follows.) Let c € C* satisfy 0 <_ c( •) ± 1, and 1
|a - a Q | i ε
0
I a - a n | _> 2ε
c(a)
and let f( ) = x s ( ) € L ] . Then (3)
β, (f, c) = v(S),
260
STATISTICAL EXPONENTIAL FAMILIES
but 3 6 (f, c) < v({x: |δ α (x) - a Q | < 2ε}) = v(S) - v({x € S:
|δ(x) - a Q | >. 2ε})
so that (4)
lim inf 3. (f,c) <_ v(S) - lim sup v({x € S: |δ (x) - a π | >. 2ε}) α α α α u < v(S) .
Taken together (3) and (4) contradict the assumption that δ ^ 6 Q in V*. contradiction shows that δ -> δQ in Ό* implies 6 + δQ in measure ( v ) .
This The
second conclusion of the proposition is a standard consequence of this. We now come to the minimax theorem.
||
In preparation for this
theorem we prove a simple lemma. 4A.8
Lemma Let Θ be f i n i t e .
Then the set of procedures having f i n i t e risks
is a complete class of procedures in Ό*.
(In other words, for every δ € V*
t h e r e i s a p r o c e d u r e δ 1 € V* w i t h R ( θ , δ 1 ) <_ R ( θ , 6 ) a n d R ( θ , δ 1 ) < °°, θ € Θ . ) Proof.
L e t a Q € A a n d A . = max { L ( θ , a j : θ e θ } <
B = { a € A: Assumption
m i n { L ( θ , a ) : θ € 0 } <_ A - } .
4A.1(2). Ap
°°.
B i s a bounded s e t b e c a u s e o f
Hence =
max { L ( θ , a ) : θ € Θ ,
a € B } < «
.
Define δ1 to s a t i s f y (1)
δ ({ao}|x)
= δ({aQ}|x) + δ(Bc|x)
δ'(A|x)
δ(A|x)
=
,
{ a Q } t A,
A c B
.
( I n words, δ 1 takes a c t i o n a Q whenever δ takes an a c t i o n o u t s i d e B.)
Then
APPENDIX ό'(B|x) a 1 ; hence R(θ, ό 1 ) £ A 2 < «. (2)
Also, by c o n s t r u c t i o n ,
<. / L ( θ , a ) δ ( d a | x ) + L ( θ , a Q ) δ ( B c | x ) B
/ L(θ, a)ό'(da|x)
± Hence R(θ, ό 1 ) £ R(θ, δ).
261
JL(Θ, a)6(da|x)
.
||
In the language of Corollary 4A.5, used below, the preceding can be interpreted as saying that the set of procedures with risk points in Γ f is a complete class. 4A.9
Theorem Let Θ be f i n i t e .
Let δQ € V* be any procedure for which
R( , δ Q ) € Γ f , and such that (1)
R( , δ 0 ) - ε
for every ε > 0.
j = l , . . . , m such that
(2)
m Σ τr.R(θ., δ n ) J U j=l J
Remark.
J
Γf
Then δQ is Bayes — i . e . there exists a prior G giving
mass π. to θ. e Θ, J
£
<
m i n f * Σ π.R(θ., δ) J 6€0 j = l J
The minimax risk — M = i n f * max {R(θ, δ):
f i n i t e by Lemma 4A.8.
.
θ € 0}
~ must be
(Also, as a consequence of Corollary 4A.5 there must
exist a minimax procedure.)
I f δQ is any minimax procedure then i t must
satisfy (1) and hence must be Bayes.
This does not yet prove that the
resulting prior 6 is least favorable -- i . e . Σ π.R(θ.* δ) >. M for a l l J
δ € Ό*.
Indeed, this need not be the case.
J
To get a least favorable prior
apply the proof of the theorem to the point with coordinates r. Ξ M, ϋ
j=l,...,m. This point need not correspond to any procedure in V*, but it is in f, and the proof of the theorem applies directly to yield {π.} such that M =
m Σ π.M £ i n f * j=l J P
m Σ π.R(θ., δ ) . J
J
This {π.} corresponds to the l e a s t J
262
STATISTICAL EXPONENTIAL FAMLIES
favorable distribution. Proof.
Γ f is a closed convex subset of R m by Corollary 4A.5. Condition (1)
implies that the point r Q = R(θ, δ Q ) lies on the boundary of Γ-. Hence there exists a nonzero vector {α.} which defines a supporting hyperplane to Γ* at r Q - i.e. (3)
Σ αj(r o )j = 1nf {ΣαjΓjΓ
r € Γf} .
Since r Q € Γ f , so also is r Q + ae^ for any unit vector e^, and a_> 0. Thus (3) yields
- J ^ o ^ ' 3 * aι{roh>
(4)
I t follows t h a t α . _> 0 , ^ = l , . . . , m .
(5)
a
^°
Let
\
=
m Σ α
Then m Σ π.(rn).
(6)
=
i n f {Σπ r :
r € Γf}
.
Furthermore, by Lemma 4A.8, f o r every δ € V* there is an r € Γ f such that r. < R ( θ . , δ ) ,
j = l , . . . , m ; so that
J "~ 3 (2), now follows from (6).
m Σ π.r. < Σπ.R(θ., δ ) .
j_2 J J ~
The desired r e s u l t ,
3 3
||
4A.10 Theorem Let BQ denote the set of Bayes procedures for priors concentrated on f i n i t e subsets of Θ.
Then BQ, the closure of BQ in P*, is an essentially
complete class. Proof.
(Note:
the following proof is written in the language of directed
sets, nets, and subnets.
See, e.g. Dunford and Schwartz (1966).
The reader
unfamiliar with these concepts, or the equivalent concept'of f i l t e r s and
APPENDIX
263
u l t r a f i l t e r s can understand the essence of the proof by considering the case where Θ is countable, for then the nets and subnets can be cooverted to ordinary sequences and subsequences.
I f X, B is Euclidean space -- as in the
exponential family situation — i t can be shown by an auxiliary argument that sequences and subsequences also can suffice for the proof, since the topology of V* has a countable basis.)
Let <5Q be any procedure.
Let A denote the collection of a l l f i n i t e subsets of Θ formed into a directed set under the obvious p a r t i a l ordering: Consider a fixed α € A;
α c 0.
α- £ ou i f α- c o u .
Consider the s t a t i s t i c a l problem
with parameter space just the f i n i t e set α.
There must exist a procedure,
call i t 6 , which is admissible in this r e s t r i c t e d problem and is a t least as good as δ Q — i . e . (1)
R(θ, ό α )
Since δ
R(θ, δ 0 )
θ Gα
is admissible in the r e s t r i c t e d problem i t s a t i s f i e s
4A.9(1) there. 6
<
condition
(The existence of 6^ is guaranteed by Corollary 4A.6.)
Hence
is Bayes with respect to a prior G concentrated on the f i n i t e set α c 0 . Let A1 = { α 1 } be a ( c o - f i n a l ) subnet of A and l e t δ e V* be such
that δ , •> δ.
(The existence of A' and δ follows from Theorem 4A.4 by standard
topological arguments.) out in A 1 .
Let ΘQ € 0.
Then α 1 => {Θ Q } for every α 1 f a r enough
Hence R(θ Q ,δ ,) <_ R(θ Q J δ Q ) for any such α 1 and, by Proposition
4A.5, R(ΘQ, δ)
<_ lim i n f R(θ Q , δ α . )
£
R(θ Q , δ Q )
.
Since ΘQ £ θ is a r b i t r a r y , this proves that δ € BQ is as good as δ Q .
Since
δ Q € V* is also a r b i t r a r y this proves BQ is an essentially complete class.
||
So far we have not assumed that L(θ, •) is s t r i c t l y convex, as is the case in the applications in Chapter 4.
We now add this assumption, which
is required for the desired complete class theorem.
264 4A.11
STATISTICAL EXPONENTIAL FAMILIES Proposition Assume
(1)
L(θ, •)
is s t r i c t l y convex on
Let δ € V*9
for each
θ € Θ
Then there is a δ1 € £>n such that
δ (. Όn.
R(θ, δ 1 )
(2)
A
<
R(θ, δ) ,
with s t r i c t inequality for some θ Q € Θ.
θ € Θ
In particular, the procedures in V
are a complete class. Proof.
I f δ € V* but δ £ V then v ( { x : δ ( U } | χ ) } ) > 0 .
θ e Θ, by 4 A . 1 ( 1 ) .
L e t a Q € A and l e t δ 1 be d e f i n e d by <S'(x) = a Q .
R(θ, δ1)
(3)
Hence R ( θ , δ) Ξ « ,
=
L ( θ , a Q ) < oo =
R(Θ,
δ) ,
Then
θ eΘ
Now, suppose 6 e V but 6 f. V . If R(θ, δ) = «> then, again, δ'(x) Ξ a Q satisfies (3). So, assume R(θ o ,δ') < » for some θ Q € Θ. Condition (1) and 4A.1(2) guarantees that for some ε > 0, A Q >^ 0 (4)
L(θ Q , a) > ε||a|| - A Q
(We leave this as an exercise on convex functions.
.
A \/ery similar r e s u l t is
proved in 5.3(3) and ( 5 ) ; see 5 . 3 ( 4 ' ) . ) Hence (5)
co >
R(Θ Q , δ)
=
/ ( / L ( θ , a ) δ ( d a | x ) ) p θ Q ( x ) v(dx)
|| δ ( d a | x ) ) p θ (x)v(dx) - AQ I t follows t h a t
(6)
f\ |a| I δ ( d a | x )
since p A (x) > 0 Define
a . e . ( v ) by 4A.1(1).
< <»
a.e.(v)
.
APPENDIX /aό(da|x)
if
265 /| |a| |ό(da|x) < °°
ό'(x) = {
(6)
aQ
otherwise
Then /L(θ, a)δ(da|x)
1
L(θ,ό'(x))
a.e.(v)
with strict inequality whenever 6( |x) is not concentrated on the point ό'(x). Since 6 (. V^ this occurs with positive probability under v — and hence by 4A.1(1) under P Q . θ o Consequently R(θ, ό 1 )
(7)
with s t r i c t inequality for ΘQ e Θ. whenever R(θ,ό) < °° .)
£
R(θ, 6)
(In fact, there is s t r i c t inequality in (7)
| |
The desired result now follows as an easy consequence. 4A.12
Theorem Assume 4A.11(1).
Then the set of pointwise limits of sequences of
procedures in BQ is a complete class. Proof.
(B
is defined in 4A.10.)
As a consequence of 4A.11(1), Jensen's inequality and 4A.1(1)
eyery procedure in BQ is non-randomized.
Also, there cannot be two non-
equivalent admissible procedures with equal risk functions, for i f δ 1 f 6 2 then (1)
(R(θ, 6λ) + R(θ, 6 2 ) ) / 2
^
R(θ,
( 6 χ + 62)/2)
with s t r i c t inequality whenever the right-hand side is f i n i t e . The theorem now follows as a direct consequence of Corollary 4A.6, Proposition 4A.11, Theorem 4A.10, and Proposition 4A.7. Here's how: Because of Corollary 4A.6 there is a unique minimal complete class.
I t is contained in V by Proposition 4A.11 and in BQ by Theorem 4A.10
and ( 1 ) , above.
I f δQ is in this minimal complete class ( i . e . , i f 6 Q is
266
STATISTICAL EXPONENTIAL
admissible)
Then, by Proposition 4A.7, δ α ( ) •* δ Q ( ) a . e . ( v )
which is the desired condition. 4A.13
€ BQ = V^ such that 6 α + δ Q € V^
there is therefore a net 6
in the topology on V*.
FAMILIES
II
Generalizations (i)
Assumption 4A.1(1) and the s t r i c t convexity assumption 4A.11(1)
are used in the proof of Theorem 4A.12 for only two purposes; namely, to guarantee that (1)
δ € BQ
=> δ € Vn ,
and that (2)
(δy
δ2
admissible;
R(θ, δ χ ) = R(θ, δ 2 ) ,
θ e Θ)
=* δ χ
=
δ2
.
I f (1) and (2) can be established separately, as is the case in some of the applications in Section 7 to the theory of hypotheses t e s t s , then the conclusion of Theorem 4A.12 remains valid without 4A.1(1) and 4A.11(1). (ii)
There is not much hope for something l i k e the conclusion of
Theorem 4A.12 unless (1) and (2) are s a t i s f i e d .
However, a l l the e a r l i e r
results of this appendix, through Theorem 4A.10, remain valid without the assumptions 4A.1(1) and 4A.11(1) (or (1) and ( 2 ) ) . (iii)
The remaining assumption which can be relaxed without major
alterations in the theory is the assumption 4A.1(2) on the loss function.
If
this assumption is replaced by (3)
lim L(θ, a) a-κc
=
sup { L ( θ , a ) :
a e A}
and (4)
sup {L(θ, a ) :
a e A}
<
«
then a l l results through Theorem 4A.10 remain valid with only a simple modification needed in the statement and proof of Lemma 4.8 to establish that the procedures in V are a complete class.
APPENDIX
(iv)
267
I f (3) is valid but not (4) or 4A.1(2), then a peculiar
situation may arise.
The results through Corollary 4A.7 remain valid, but i t is
then possible that there may exist admissible procedures having R(θ, 6) = °° for some θ € Θ.
When this peculiarity occurs the minimax theorem is not valid in
the strong form of Theorem 4A.9.
(There may exist admissible minimax procedures
satisfying 4A.9(1) for which no prior exists satisfying 4A.9(2).) of Theorem 4A.9 i s , however, valid. v
sequence of priors defined by {π £
A
A weaker form
Its conclusion is that there exists a , £=1,...} and corresponding Bayes procedures
ό' ) such that R(θ, δ ' ' ) + R(θ, 6 Q ) , θ £ Θ.
(The most convenient proof I know
of this fact proceeds in a somewhat roundabout fashion using a device found in Wald (1950).) (v)
Brown (1977) contains versions of Theorem 4A.3 and Proposition
4A.4 valid for some situations where i t is useful to compactify A in some fashion other than the one point compactification, A*, used above; or where the loss L depends on the observed x € X, as well as on Θ, A; or where the decision rules are restricted a priori
to l i e in some proper subset of V.
In many of
these situations i t is possible to proceed further and also establish the conclusion of Theorem 4A.10. I t is also possible to derive some satisfactory results in the (unusual) situation where A is not a Borel subset of Euclidean space, nor imbeddable as such a subset.
Such an extension involves intricacies not present
in the preceding treatment of the Euclidean case.
268
STATISTICAL EXPONENTIAL FAMILIES
Exercises 4A.2.1 Suppose A = { a Q , a - } , corresponding to a hypothesis testing problem. Φ( χ )
(a Q = "accept", a. = "reject".)
== Φ δ (x) = δ(ίa,|x})
For any procedure 6 l e t
denote the c r i t i c a l function of the test.
Then,
6η •*• 6 in the topology on V* i f and only i f φ^ •* Φδ in the weak* topology on L 00
(i.e. V
(i) f o r every f e L..
yiΦδ Cx) - Φδ(χ)|f(χ)v(dx)
- o
(See e . g . Lehmann (1959, Section A 4 ) . )
REFERENCES
AMARI, S. (1982). Differential geometry of curved exponential families -curvature and information loss. Ann. Statist. 10, 357-385. ARNOLD, S.F. (1981). The Theory of Linear Models. Wiley: New York. BAHADUR, R.R. and ZABELL, S.L. (1979). Large deviations of the sample mean in general vector spaces. Ann. Prob. 7, 587-621. BAR-LEV, S.K. (1983). A characterization of certain statistics in exponential models whose distribution depends on a sub-vector of parameters only. Ann. Statist. 11, 746-752. BAR-LEV, S.K., and ENIS, P. (1984). Reproducibility and natural exponential families with power variance functions. Preprint. Dept. of Statistics, S.U.N.Y., Buffalo. BAR-LEV, S.K. and REISER, B. (1982). An exponential subfamily which admits UMPU tests based on a single test statistic. Ann. Statist. 10, 979-989. BARNDORFF-NIELSEN, 0. (1978). Information and Exponential Families in Statistical Theory. Wiley: New York. BARNDORFF-NIELSEN, 0. and BLAESILL, P. (1983a). Exponential models with affine dual foliations. Ann. Statist. 11, 753-769. BARNDORFF-NIELSEN, 0., AND BLAESILL, P. (1983b). Reproductive exponential families. Ann. Statist. 11, 770-782. BARNDORFF-NIELSEN, 0. and COX, D.R. (1984). The effect of sampling rules on likelihood statistics. Inter. Statist. Rev. 52. To appear. BARNDORFF-NIELSEN, 0. and COX, D.R. (1979). Edgeworth and saddle point approximation with statistical applications (with discussion). J. Roy. Statist. Soc. B 41, 279-312. 269
270
STATISTICAL EXPONENTIAL FAMILIES
BASU, D. (1955). On statistics independent of a complete sufficient statistic. Sankhya 15, 377-380. BASU, D. (1958). On statistics independent of a complete sufficient statistic,
Sankhya 20, 223-226.
BERAN, R.J. (1979). Exponential models for discrete data. Ann. Statist. 7, 1162-1178. BERGER, J.O. (1982).
Selecting a minimax estimator of a multivariate normal
mean. Ann. Statist. 10, 81-92. BERGER, J.O. (1980a). Statistical Decision Theory:
Foundations, Concepts,
Methods. Springer-Verlag: New York. BERGER, J.O. (1980b).
Improving on inadmissible estimators in continuous
exponential families with applications to simultaneous estimation of Gamma scale parameters. Ann. Statist. 8, 545-571. BERGER, J.O. (1976). Admissible minimax estimation of a multivariate normal mean with arbitrary quadratic loss. Ann. Statist. 4, 223-226. BERGER, J.O. (1975). Minimax estimation of location vectors for a wide class of densities. Ann. Statist. 3, 1318-1328. BERGER, J.O. and HAFF, L.R. (1981). A class of minimax estimators of a normal mean vector for arbitrary quadratic loss and unknown covariance matrix. Mimeograph Series, Purdue University. BERGER, J.O. and SRINIVASAN , C. (1978). Generalized Bayes estimators in multivariate problems. Ann. Statist. 6, 783-801. BERK, R.H. (1972). Consistency and asymptotic normality of maximum likelihood estimates for exponential models. Ann. Math. Statist. 43, 193-204. BHATTACHARYA, R.N. and RAO, R.R. (1976). Normal Approximations and Asymptotic Expansions. Wiley: New York. BIRNBAUM, A. (1955). Characterizations of complete classes of tests of some multiparameter hypotheses with applications to likelihood ratio tests. Ann. Math. Statist. 26, 21-36. BISHOP, Y.M., FEINBERG, S.E., and HOLLAND, P.W. (1975). Discrete Multivariate
REFERENCES
271
Analysis. M.I.T. Press, Cambridge. BLAESILD, P. (1978). A generalization of the exponential distribution to convex sets in R k . Scand. Jour. Statist. J5, 189-194. BROWN, L.D. (1986). Information inequalities for the Bayes risk. Statistics Center Preprint, Cornell University. BROWN, L.D. (1981). A complete class theorem for statistical problems with finite sample spaces. /\nn. Statist. ^, 1289-1300. BROWN, L.D. (1980). Examples of Berger's phenomenon in the estimation of independent normal means. Ann. Statist. J3, 572-585. BROWN, L.D. (1979). A heuristic method for determining admissibility of estimators — with applications. Ann. Statist. 7_9 960-994. BROWN, L.D. (1977). Closure theorems for sequential design processes. Statistical Decision Theory and Related Topics II, Academic Press, 57-91. BROWN, L.D. (1971). Admissible estimators, recurrent diffusions, and insoluble boundary value problems. Ann. Math. Statist. j42, 855-904. BROWN, L.D. (1968). Inadmissibility of the usual estimators of scale parameters in problems with unknown location and scale parameters. Ann. Math. Statist. ^39, 29-48. BROWN, L.D. (1966). On the admissibility of invariant estimators of one or more location parameters. Ann. Math. Statist. 37, 1087-1136. BROWN, L.D. (1965). Optimal policies for a sequential decision process. J_. Soc. Indust. Appi. ^3, 37-46. BROWN, L.D., COHEN, A., and STRAWDERMAN, W. (1976). A complete class theorem for strict monotone likelihood ratio with applications. Ann. Statist. 4, 712-722. BROWN, L.D. and FARRELL, R. (1984). Complete class theorems for estimation of multivariate Poisson means and related problems. /\nn. Statist., to appear. BROWN, L.D. and FOX, M. (1974a). Admissibility of procedures in twodimensional location parameter problems. Ann. Statist. 2, 248-266.
272
STATISTICAL EXPONENTIAL FAMILIES
BROWN, L.D. and FOX, M. (1974b). Admissibility in statistical problems involving a location or scale parameter. Ann. Statist. J2, 807-814. BROWN, L.D. and HWANG, J.T. (1982). A unified admissibility proof. Stat. Dec. Theory and Related Topics, III. 1, 205-230. Academic Press: New York. BROWN, L.D., JOHNSTONE, I.M., and MACGIBBON, K.B. (1981).
Variation
diminishing transformations: a direct approach to total positivity and its statistical applications. Jour. Amer. Statist. Assoc., 76, 824-832. BROWN, L.D. and RINOTT, Y. (1985). Stochastic order relations for multivariate infinitely divisible distributions. Statistics Center Preprint, Cornell University. BROWN, L.D. and SACKROWITZ, H. (1984). An alternative to Student's t test for problems with indifference zones. Ann. Statist. ]_2, 451-469. CHERNOFF, H. (1952). A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. Ann. Math. Statist. 23, 493-507. CLEVINSON, M. and ZIDEK, J. (1977). Simultaneous estimation of the mean of independent Poisson laws. Jl. Amer. Statist. Assoc. 70, 698-705. COURANT, R. and HILBERT, D. (1953). Methods of Mathematical Physics (Translated and revised from the German original). Interscience: New York. COX, D.R. (1975). Partial likelihood. Biometrika, 62, 269-276. DARROCH, J.N., LAURITZEN, S.L., and SPEED, T.P. (1980). Markov fields and log linear interaction models for contingency tables. Ann. Statist. S9 522-539. DARMOIS, G. (1935). Sur les lois de probability a estimation exhaustive. Compt. Rend. Acad. Sci., Paris 260, 1265-1266. DE GROOT, M.H. (1970). Optimal Statistical Decisions. McGraw-Hill: New York. DEY, D.K., GHOSH, M., and SRINIVASAN, C. (1983). Simultaneous estimation of parameters under Stein's loss. Preprint.
REFERENCES
273
DIACONIS, P. and YLVISAKER, D. (1979). Conjugate priors for exponential families. Ann. Statist. 7., 269-281. DUNFORD, N., and SCHWARTZ, J.T. (1966). Linear Operators, Part I. Interscience: New York. DYNKIN, E.B. (1951). Necessary and sufficient conditions for a family of probability distributions. Select. Transl. Math. Statist, and Prob. U 23-41. EATON, M.L. (1982). A review of selected topics in multivariate probability inequalities. Ann. Statist. Π), 11-43. EFRON, B. (1978). The geometry of exponential families. Ann. Statist. 69 362-376. EFRON, B. (1975). Defining the curvature of a statistical problem (with applications to second order efficiency); with discussion. Ann. Statist. 3, 1189-1242. EFRON, B. and TRUAX, D. (1968). Large deviations theory in exponential families. Ann. Math. Statist. 39, 1402-1424. EQUELI, S. (1983). Second order efficiency of minimum contrast estimators in a curved exponential family. Ann. Statist. _Π, 793-803. ELLIS, R.S. (1984a). Large deviations for a general class of random vectors. Ann. Prob. U, 1-12. ELLIS, R.S. (1984). Entropy, Large Deviations, and Statistical Mechanics Springer-Verlag. FARRELL, R.H. (1968). Towards a theory of generalized Bayes tests. _Ann. Math. Statist. 38, 1-22. FARRELL, R.H. (1966). Weak limits of sequences of Bayes procedures in estimation theory. Proc. Fifth Berk. Symp. Math. Statist. Prob. _1, 83-111. FEIGIN, P.D. (1981). Conditional exponential families and a representation theorem for asymptotic inference. Ann. Statist. £, 597-603. FELLER, W. (1966). An Introduction to Probability Theory and Its Applications Vol. II. Wiley: New York.
274
STATISTICAL EXPONENTIAL FAMILIES
FERGUSON, T.S. (1973). A Bayesian analysis of some nonparametric problems. Ann. Statist. I, 209-230. FLEISS, J.L. (1981). Statistical Methods for Rates and Proportions, 2nd Edition. Wiley: New York. GHIA, G.D. (1976). Truncated generalized Bayes tests. Ph.D. thesis. Yale University. GHOSH, M., HWANG, J.T., and TSUI, K.W. (1983). Construction of improved estimators in multiparameter estimation for discrete exponential families (with discussion). Ann. Statist. Π_, 351-374. GHOSH, M., and MEEDEN, G. (1977). Admissibility of linear estimators in the one-parameter exponential family. Ann. Statist. J5, 772-778. GIRI, N.C. (1977). Multivariate Statistical Inference. Academic Press: New York. GIRI, N.C. and KIEFER, J. (1964). Local and asymptotic minimax properties of multivariate tests. Ann. Math. Statist. 315, 21-35. HABERMAN, S.J. (1979). Analysis of Qualitative Data, Volumes I, _Π. Academic Press: New York. HABERMAN, S.J. (1974). The Analysis of Frequency Data. University of Chicago Press: Chicago, Illinois. HAFF, L.R. (1982). Solutions of the Euler-Lagrange equations for certain multivariate normal estimation problems. Preprint. HERR, D.G. (1967). Asymptotically optimal tests for multivariate normal distributions. Ann. Math. Statist. 38, 1829-1844. HIPP, C. (1974). Sufficient statistics and exponential famlies. Ann. Statist. 2, 1283-1292. HODGES, J.L. JR. and LEHMANN, E.L. (1951). Some applications of the Cramer-Rao inequality. Proc. 2nd Berkeley Symp. Math. Statist. Prob., 13-22. HOEFFDING, W. (1965). Asymptotically optimal tests for multinomial distributions. Ann. Math. Statist. 36, 369-408.
REFERENCES
275
HUDSON, H.M. (1978). A natural identity for exponential families with applications in multiparameter estimation. Ann. Statist, j), 473-484. HWANG, J.T. (1983). Universal domination and stochastic domination-estimation simultaneously under a broad class of loss functions. Cornell Statistics Center Technical Report. HWANG, J.T. (1982). Improving upon standard estimators in discrete exponential families with applications to Poisson and negative binomial cases. Ann. Statist. 10, 857-867. JAMES, W. and STEIN, C. (1961). Estimation with quadratic loss. Proc. Fourth Berkeley Symposium Math. Statist, and Prob. l_f 311-319. JOAG-DEV, K., PERLMAN, M.D. and PITT, L.D. (1983). Association of normal random variables and Slepian's inequality. Ann. Prob. Π_, 451-455. JOHANSEN, S. (1979). Introduction to the Theory of Regular Exponential Families. University of Copenhagen, Institute of Mathematical Statistics, Lecture Notes. JOHNSON, B.R. and TRUAX, D.R. (1978). Asymptotic behavior of Bayes procedures for testing simple hypotheses in multiparameter exponential families. Ann. Statist. 6, 346-361. JOSHI, V.M. (1976). On the attainment of the Cramer-Rao lower bound. Ann. Statist. 4, 998-1002. JOSHI, V.M. (1969). On a theorem of Karl in regarding admissible estimates for exponential populations. Ann. Math. Statist. 40, 216-223. KALLENBERG, W.C.M. (1978). Asymptotic Optimaiity of Likelihood Ratio Tests in Exponential Families. Mathematical Centre Tracts, Amsterdam. KARLIN, S. (1968). Total Positivity (Volume I). Stanford University Press, Stanford, California. KARLIN, S. (1958). Admissibility for estimation with quadratic loss. Ann. Math. Statist. 29, 406-436.
276
STATISTICAL EXPONENTIAL FAMILIES
KARLIN, S. and RINOTT, Y. (1981). Total positivity properties of absolute value multinomial variables with applications to confidence interval estimates and related probabilistic inequalities. Ann. Statist, j), 1035-1049. KOOPMAN, B.O. (1936). On distributions admitting a sufficient statistic. Trans. Amer. Math. Soc. ^9> 399-409. KOUROUKLIS, S. (1984). A large deviation result for the likelihood ratio statistic in exponential families. Ami. Statist., to appear. KOZIOL, J.A. and PERLMAN, M.D. (1978). Combining independent chi-squared tests. Jour. Amer. Statist. Assoc. 7^3, 753-763. KULLBACK, S. (1959). Information Theory and Statistics. Wiley: New York. KULLBACK, S. and LEIBLER, R.A. (1951). On information and sufficiency. Ann. Math. Statist. 22, 79-86. LAURITZEN, S.L. (1984). Extreme point models in statistics. Scand. _J. Statist. Π_, 65-91. LEHAMNN, E.L. (1983). Theory of Point Estimation Wiley: New York. LEHMANN, E.L. (1959). Testing Statistical Hypotheses. Wiley: New York. LE CAM, L. (1955). An extension of Wald's theory of statistical decision functions. Ami. Math. Statist. 2(>, 69-81. LINDSAY, B.G. (1983). The geometry of mixture likelihoods, part II: the exponential family. -Ann. Statist. jΛ, 783-792. LINNIK, Y.V. (1968). Statistical problems with nuisance parameters. Amer. Math. Soc. Trans!, erf Math. Monographs, ^0. MANDELBAUM, A. (1984). Linear estimators and measurable linear transformations on a Hubert space. 1. Wahr. verw. Gebiet. 65, 385-397. MARDEN, J.I. (1983). Admissibility of invariant tests in the general multivariate analysis of variance problem. Ann. Statist. 11, 1086-1099. MARDEN, J.I. (1982a). Combining independent noncentral chi-squared or F-tests. Ann. Statist. JLO, 266-277. MARDEN, J.I. (1982b). Minimal complete classes of tests of hypotheses with multivariate one-sided alternatives. Ann. Statist. 10, 962-970.
REFERENCES
277
MARDEN, J.I. (19ζl). Invariant tests on covariance matrices. Ann. Statist, j}, 1258-1266. MARDEN, J.I. and PERLMAN, M.D. (1981). The minimal complete class of procedures when combining independent non-central F-tests. Proc. Third Purdue Symp. Dec. Theory and Related Topics. MARDEN, J.I. and PERLMAN, M.D. (1980). Invariant tests for means with covariates. Ann. Statist. S9 25-63. MARDIA, K.V. (1975). Statistics of directional data (with discussion). jJ. Roy. Statist. Soc. EL 27, 343-349. MARDIA, K.V. (1972). Statistics of Directional Data. Academic Press: London. MARDIA, K.V. (1970). Families of Bivariate Data. Griffin: London. MARSHALL, A.W. and OLKIN, I. (1979). Inequalities: Theory of Marjorization and Its Applications. Academic Press: New York. MATTHES, T.K. and TRUAX, D.R. (1967). Tests of composite hypotheses for the multivariate exponential family. Ann. Math. Statist. _38, 681-697. MEEDEN, G. (1976). A special property of linear estimates of the mean. Ann. Statist. 4, 649-650. MEEDEN, G., GHOSH, M. and VARDEMAN, S. (1984). Some admissible nonparametric and related finite population sampling estimators. To appear in Ann. Statist. MORRIS, C.N. (1983). Natural exponential families with quadratic variance functions: statistical theory. Ann. Statist. U9
515-529.
MORRIS, C.N. (1982). Natural exponential families with quadratic variance functions. Ann. Statist. K), 65-80. MUIRHEAD, R.J. (1982). Aspects of Multivariate Statistical Theory. Wiley: New York. NEVEU, J. (1965). The Mathematical Foundations of the Calculus of Probability. Holden-Day: San Francisco. NEY, P. (1983). Dominating points and the asymptotics of large deviations for random walk on R d . Ann. Prob. 11, 158-167.
278
STATISTICAL EXPONENTIAL FAMILIES
NEYMAN, J. (1938). On statistics the distribution of which is independent of the parameters involved in the original probability law of the observed variables. Stat. Res. Mem. _Π, 58-59. OOSTERHOFF, J. (1969). Combination of One-Sided Statistical Tests. Mathematisch Centrum, Amsterdam. PATIL, G.P. (1963). A characterization of the exponential type distributions. Biometrika 50, 205-207. PING, C. (1964). Minimax estimates of parameters of distributions belonging to the exponential family. Chinese Math. 5, 277-299. PITMAN, E.J.G. (1936). Sufficient statistics and intrinsic accuracy. Proc. Camb. Phil. Soc. 32, 567-579. RAIFFA, H. and SCHLAIFER, R. (1961). Applied Statistical Decision Theory. Harvard University: Boston. ROCKAFELLAR, R.T. (1970). Convex Analysis. Princeton University Press: Princeton, New Jersey. SACKS, J. (1963). Generalized Bayes solutions in estimation problems. Ann. Math. Statist. 3[4, 751-768. SAW, J.G. (1977). On inequalities in constrained random variables. Comm. Statist. Theor. Meth. A6(13), 1301-1304. SCHWARTZ, R. (1967). Admissible tests in multivariate analysis of variance. Ann. Math. Statist. 38, 698-710. SIMON, G. (1973). Additivity of information in exponential family probability laws. ^. Amer. Statist. Assoc. j>8, 478-482. SIMONS, G. (1980). Sequential estimators and the Cramer-Rao lower bound. J!. Statist. Planning and Inference 4, 67-74. SOLER, J.L. (1977). Infinite dimensional-type statistical spaces (Generalized exponential families). Recent Developments in Statistics (edited by J.R. Barra, et al.). North Holland Publishing Co.: Amsterdam. SRINIVASAN, C. (1984). On estimation of parameters in a curved exponential family with applications to the Galton-Watson process. Technical report, University of Kentucky.
REFERENCES
279
SRINIVASAN, C. (1981). Admissible generalized Bayes estimators and exterior boundary value problems. Sankhya 43, 1-25. STEIN, C. (1973). Estimation of the mean of a multivariate normal distribution. Proc. Prague Symp. on Asymptotic Statist., 345-381. STEIN, C. (1956a). Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. Proc. Third Berkeley Symp. Math. Statist. Prob. 19 197-206. STEIN, C. (1956b). The admissibility of Hotellings T2-test. Ann. Math. Statist. 27., 616-623. SUNDBERG, R. (1974). Maximum likelihood theory for incomplete data from an exponential family. Scand. Jk Statist. _1, 49-58. VAN ZWET, W.R., and OOSTERHOFF, J. (1967). On the combination of independent test statistics. Ann. Math. Statist. 38, 659-680. WALD, A. (1950). Statistical Decision Functions. Wiley: New York. WIJSMAN, R.A. (1973). On the attainment of the Cramer-Rao lower bound. Ann. Statist. 1, 538-542. WIJSMAN, R.A. (1958). Incomplete sufficient statistics and similar tests. Ann. Math. Statist. .29, 1028-1045. WOODROOFE, M. (1978). Large deviations of likelihood ratio statistics with applications to sequential testing. Ami. Statist. (6, 72-84.
INDEX
Absolute continuity 99, 102
Chebyshev's inequality
Admissibility 96, 169, 220, 223-226, 231-232, 237, 247, 256
Chi-squared distribution
208 28
(see also gamma distribution) Affine: projection 10 Complete class 107, 110, 220, 224, 22ί transformation 7 230-231, 237, 256, 259 Aggregate exponential family 191-194, 203-206, 250 Completeness 42, 96, 230, 232 Analyticity
38
(see also bounded completeness)
Asymptotic: normality 172-173 optimality 252
Conditional: distribution 21, 30 dominating measure 25
Bayes: acceptance region 40 procedure 262, 263 risk 97 test 69, 226, 246
Conical set 233, 251
(see also generalized Bayes) Behrens-Fisher problem
68
Bessel function 77 Beta distribution
60, 150
Bhattacharya inequality Binomial distribution 136, 203
126 60, 76, 135,
Bounded completeness 61 Canonical exponential family (see standard exponential family) Cauchy-Schwarz inequality 91, 124 Censored exponential distribution 83-84, 163-165
Conjugate prior 90, 106, 112, 116, 161 Contingency table 27, 30, 67, 158, 17" 247 Contiguous alternative 232, 250 Continuity Theorem for Laplace transforms 48-53 Convex: dual 178 hull 2 polytope 50 support 2 Convolution 61 Cramer-Rao inequality inequality)
(see informatioi
Critical function 220, 269 Cumulant generating function 1, 31, 38, 71, 145
Characteristic function 42 280
INDEX
281
Curved exponential family 81, 83, Finite: parameter space 256, 261, 86-89, T26, 165, 173, 233, 252, 262 253 sample space 193 support 107, 149, 263, 266 (see also differentiate subfamily) Fisher information 169 Difference operator 105 Different!able: manifold 30, 160 subfamily 81, 89, 92 165, 172, 173 (see also curved exponential family)
Fisher-Von Mises distribution 76-78, 150, 204 Fourier-Stieltjes transform 42 F-test (Snedecor) 225, 226, 248
Dirichlet: distribution 64, 167, 168 process 167, 168
Fundamental equation 160, 186-187
Discrete exponential family 105, 136, 203
Gamma distribution 18, 47, 60, 68, 76, 132, 133, 136, 210, 248, 250
D-optimality 226, 248
(see also matrix gamma distribution)
E-M algorithm 171
Gal ton-Watson process 89
Entropy 174, 190, 212, 240
Generalized: Bayes estimator 40, 90, 105-107, 110, 112, 140, 141 least square estimate 170
Equicontinuity 52 Essentially smooth 71 Expectation parameter 112, 120, 124, 137-138, 141, 142
General linear model 29, 30, 158, 170, 224, 225 Geometric distribution 27
Exponential distribution (see censored exponential distribution, gamma distribution)
(see also negative binomial distribution)
Exponential family: canonical parameter Green's Theorem
101, 112
(see natural parameter)
Hardy-Weinberg distribution 12, 157, 188
convexity property of 19 dimension of 6 full 2, 16, 80 order of 16
Holder's inequality 19
(see also curved exponential family, differentiate subfamily, discrete exponential family, mean value parameter, natural parameter, regular exponential family, standard exponential family, steep family, and specific families of distributions such as normal distribution, Poisson distribution, etc.) Exponentially fast 208 Face of convex set 192 Fatou's lemma 20, 182, 215
Homogeneity of variance 248 Homomorphism 74, 86 Hotelling's T 2 test 226 Hunt-Stein Theorem 227 Hyperbolic secant distribution 61 Inadmissibility 112, 135, 137, 142, 244, 249, 251, 253
282
INDEX
Information: matrix 93, 124 inequality 90, 94, 97, 105, 124, 125, 130 (see also Fisher information, Kυllback-Lei bier information)
Martingale 88 Matrix: gamma distribution 30, 62, 64 normal distribution 29 Maximum likelihood estimator (M.L.E.) 70, 135, 144, 152, 172-173, 177, 186, 195
Independence: in contingency tables 27, 30, 67, 171, 247 mutual 44, 63
McNemar's test 67
Independent observations
Mean value parameter 70, 75, 150
17, 166
Method-of-moments
Infinite divisibility 61 Inverse Gaussian distribution
72, 85
149
Minimax 103, 169, 256, 260
James-Stein estimator 40, 90, 103, 112, Minimal: entropy parameter 184 complete class (see 132 complete class) Karlin's Theorem 90, 95, 112, 127, 139, exponential family 2, 5, 72, 142 74, 79, 84, 145, 149, 161 Kronecker product 30
Mixed parametrization
79, 243
Kullback-Leibler information 174, 175, Moments 34-38, 50 177, 185, 190, 212 Monotone likelihood ratio 58 (see also entropy) Monotonicity 134 Large deviations 211, 214, 239-240 Multinomial distribution 4, 11, 27, Legendre transformation 179 168, 203 Likelihood ratio test 255, 247, 249 Linear estimator 90, 95 Locally finite measure 48 Local optimality
226
Log linear model
11, 171
(see also contingency table, HardyWeinberg distribution, multinomial distribution) Log Laplace transform
157, 160
(see also cumulant generating function) Lower semi-continuity 19, 75, 145, 179, 184, 215, 256, 258 Marginal distribution
8, 64, 170
Markov chain 28 Markov stopping time 88
(see also binomial distribution, loglinear models) Multivariate: beta distribution 64 linear model (see general linear model) normal distribution (see normal distribution) Natural parameter
1, 26, 45, 76, 106
Nearly convex 246, 247 Negative binomial distribution
27, 60, 106
Newton Raphson algorithm 171 Neyman-Pearson lemma 248 Normal distribution 36, 47, 60, 76, 108, 116, 132, 134, 138, 170, 218, 244, 245, 249, 252 Odds ratio 31, 135
INDEX
283
Partial order 57
Unbiased test 61
Poisson: distribution 60, 76, 106, 135, 136, 137, 141, 203 process 88
Uniform continuity 49
Polyhedral convex set 197, 205
Uniform distribution 77, 169 Uniqueness (for Laplace transform) 42, 63
Quadratic variance function (QVF) 60 Upper semi-continuity Random effects model
148
171
Von Mises distribution (see FisherRegular exponential family 2, 22, 70, Von Mises distribution) 79 Weak convergence 48, 51, 257, 269 Regularly strictly convex 145, 147, 179, 182, 203 Wishart distribution 30 Relative interior 192
(see also Matrix gamma distribution)
Schur convexity 59 Sign change preserving
55, 66
Similar test 61 Slepian's inequality 69 Squared error loss 95, 97, 103, 109, 134 Stable distribution 72 Standard exponential family 1, 35, 42, 43, 92, 166, 223 Statistical curvature 82, 86, 88 Steep family 70, 71, 75, 79, 145, 147, 149, 161, 169, 175, 180, 190, 208 Stein's unbiased estimate 90 Stratum 87, 88, 139-140, 172 Strongly reproductive
18
Student's t-test 218, 244 Sufficient statistic
13, 17, 27, 185
Support (of measure) 2, 191 Tight (sequence of measures) 49 Total positivity 53, 55 Truncated (loss function) 97-99