Fundamentals of Statistical Exponential Families: with Applications in Statistical Decision Theory

Institute of Mathematical Statistics LECTURE NOTES-MONOGRAPH SERIES Fundamentals of Statistical Exponential Families wi...

Author: Lawrence D. Brown

60 downloads 619 Views 14MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Institute of Mathematical Statistics LECTURE NOTES-MONOGRAPH SERIES

Fundamentals of Statistical Exponential Families with Applications in Statistical Decision Theory

Lawrence D. Brown Cornell University

Institute of Mathematical Statistics LECTURE NOTES-MONOGRAPH SERIES Shanti S. Gupta, Series Editor Volume 9

Fundamentals of Statistical Exponential Families with Applications in Statistical Decision Theory Lawrence D. Brown Cornell University

Institute of Mathematical Statistics Hayward, California

Institute of Mathematical Statistics Lecture Notes-Monograph Series Series Editor, Shanti S. Gupta, Purdue University

The production of the IMS Lecture Notes-Monograph Series is managed by the IMS Business Office: Nicholas P. Jewell, Treasurer, and Jose L. Gonzalez, Business Manager.

Library of Congress Catalog Card Number: 87-80020 International Standard Book Number 0-940600-10-2 Copyright© 1986 Institute of Mathematical Statistics All rights reserved Printed in the United States of America

To my family for their love and understanding

PREFACE

I first met exponential families as a beginning graduate student. The previous summer I had written a short research report under the direction of Richard Bellman at the RAND Corporation. That report was about a dynamic programming problem concerning sequential observation of binomial variables. Jack Kiefer read that report. He conjectured that the properties of the binomial distribution used there were properties shared by all "Koopman-Darmois" distributions. (This is a name sometimes used for exponential families, in honor of the authors of two of the pioneering papers on the topic. See Koopman (1936), and Darmois (1935), and also Pitman (1936).) Jack suggested that I recast the paper into the Koopman-Darmois setting. That suggestion had two objectives. One was the hope that viewing the problem from this general perspective would lead to a clearer understanding of its structure and perhaps a simpler and better proof. The other objective was the hope of generalizing the result from the binomial to other classes of distributions, for example the Poisson and the gamma. (The resulting manuscript appeared as Brown (1965).) These two objectives of clearer understanding and of possible generalization in statistical applications are the motivation for this monograph. Many if not most of the successful mathematical formulations of statistical questions involve specific exponential families of distributions such as the normal, the exponential and gamma, the beta, the binomial and the multinomial, the geometric and the negative binomial, and the Poisson among others. It is often informative and advantageous to view these mathematical formulations

vi

PREFACE

from the perspective of general exponential families. These notes provide a systematic treatment of the analytic and probabilistic properties of exponential families. This treatment is constructed with a variety of statistical applications in mind. This basic theory appears in Chapters 1-3, 5, 6 and the first part of Chapter 7 (through Section 7.11). Chapter 4, the latter part of Chapter 7, and many of the examples and exercises elsewhere in the text develop selected statistical applications of the basic theory. Almost all the specific statistical applications presented here are within the area of statistical decision theory.

However, as suggested above

the scope of application of exponential families is much wider yet. They are, for further example, a valuable tool in asymptotic statistical theory. The presentation of the basic theory here was designed to be also suitable for applications in this area. Exercises 2.19.1, 5.15.1-5.15.4 and 7.5.1-7.5.5 provide further background for some of these applications. Efron (1975) gives an elegant example of what can be done in this area. Some earlier treatments of the general topic have proved helpful to me and have influenced my presentation, both consciously and unconsciously. The most important of these is Barndorff-Nielsen (1978). The latter half of that book treats many of the same topics as the current monograph, although they are arranged differently and presented from a different point-of-view. Lehmann (1959) contains an early definitive treatment of some fundamental results such as Theorems 1.13, 2.2, 2.7 and 2.12. Rockafellar (1970) treats in great detail the duality theory which appears in Chapters 5 and 6. I found Johansen (1979) also to be useful, particularly in the preparation of Chapter 1. The first version of this monograph was prepared during a year's leave at the Technion, Haifa, and the second was prepared during a temporary appointment at the Hebrew University, Jerusalem.

I wish to express my gratitude to both

those institutions and especially to my colleagues in both departments for their hospitality, interest, and encouragement.

I also want to acknowledge

PREFACE

vii

the support from the National Science Foundation which I received throughout the preparation of this manuscript. I am grateful to all the colleagues and students who have heard me lecture on the contents or have read versions of this monograph. Nearly all have made measurable, positive contributions. Among these I want to specially thank Richard Ellis, Jiunn Hwang, Iain Johnstone, John Marden, and Yossi Rinott who have particularly influenced specific portions of the text, Jim Berger who made numerous valuable suggestions, and above all Roger Farrell who carefully read and critically and constructively commented on the entire manuscript. The draft version of the index was prepared by Fu-Hsieng Hsieh. Finally, I want to thank the editor of this series, Shanti Gupta, for his gentle but persistent encouragement which made an important contribution to the completion of this monograph.

TABLE OF CONTENTS

CHAPTER 1. BASIC PROPERTIES

1

Standard Exponential Families

1

Marginal Distributions

8

Reduction to a Minimal Family

13

Random Samples

16

Convexity Property

19

Conditional Distributions

21

Exercises

26

CHAPTER 2. ANALYTIC PROPERTIES

32

Differentiability and Moments

32

Formulas for Moments

34

Analyticity

38

Completeness

42

Mutual Independence

44

Continuity Theorem

48

Total Positivity

53

Partial Order Properties

57

Exercises

60

CHAPTER 3. PARAMETRIZATIONS

70

Steep Families

70

Mean Value Parametrization

73

Mixed Parametrization

78

IX

x

TABLE OF CONTENTS Differentiate Subfamilies

81

Exercises

85

CHAPTER 4. APPLICATIONS

90

Information Inequality

90

Unbiased Estimates of the Risk

99

Generalized Bayes Estimators of Canonical Parameters

106

Generalized Bayes Estimators of Expectation Parameters; Conjugate Priors Exercises CHAPTER 5. MAXIMUM LIKELIHOOD ESTIMATION

112 124 144

Full Families

148

Non-Full Families

152

Convex Parameter Space

153

Fundamental Equation

160

Exercises

167

CHAPTER 6. THE DUAL TO THE MAXIMUM LIKELIHOOD ESTIMATOR

174

Convex Duality

178

Minimum Entropy Parameter

184

Aggregate Exponential Families

191

Exercises

203

CHAPTER 7. TAIL PROBABILITIES

208

Fixed Parameter (Via Chebyshev's Inequality)

208

Fixed Parameter (Via Kullback-Leibler Information)

212

Fixed Reference Set

214

Complete Class Theorems for Tests (Separated Hypotheses)

220

Complete Class Theorems for Tests (Contiguous Hypotheses)

232

Exercises

239

APPENDIX TO CHAPTER 4. POINTWISE LIMITS OF BAYES PROCEDURES

254

REFERENCES

269

INDEX

280

CHAPTER 1. BASIC PROPERTIES

TANDARD EXPONENTIAL FAMILIES ,. 1

Definitions (Standard Exponential Family):

Let v be a σ-finite measure

m the Borel subsets of R . Let 1)

θ x

hi = Wv = {θ: / e ' v ( d x ) < «>} .

et λ(θ) = /e θ # x v(dx)

2)

Define λ(θ) = °° i f the integral in (2) is i n f i n i t e . ) Let ψ(θ)

= log λ(θ) ,

nd define 3)

P θ (x) = exp(θ x - ψ(θ)) ,

θ€ N

et Θ <Ξ w . The family of probability densities {p θ

: θ € 0}

s called a k-dimensional standard exponential lensities).

The associated

family

(of probability

distributions

PΘ(A) = / P θ (x)v(dx) ,

θe 0

A

re also referred to as a standard exponential family (of probability listributions). W is called the natural parameter space,

ψ has many names. We

vVill call i t the log Laplace transform (of v) or the owmlant generating function.

θ € 0 is sometimes referred to as a canonical parameter, and 1

2

STATISTICAL EXPONENTIAL FAMILIES

x € X is sometimes called a canonical observation, or value of a canonical statistic. The family is called full if Θ = W . It is called regular if N is open, i.e. if W = M° where W° denotes the interior of Λf, defined as i n t . N = {UQ: Q c N, Q is open}. As customary, let the support of v k

closed set S c R for which v(S (4)

comp

) = 0.

(supp v) denote the minimal

Let

H = convex hull (supp v) = conhull (supp v) .

and let K = K = fl. K is called the convex support of v.

(The convex hull of

a set S € Rk is the set {y: 3 {x..} cz s, {c^}, 0 < αη., Σα]. = 1 3 y = Σ α ^ } . )

For S c R

the dimension of S, dim S, is the dimension of the

linear space spanned by the set of vectors { ( x j - x 2 ) : x ^ x 2 € S}.

A k-

dimensional standard family is called minimal i f (5)

dim N = dim K = k . Note that i f K is compact then W = Rk, so that the family is

regular. (The exponential families described above can be called f i n i t e dimensional exponential families. Various writers have recently begun to investigate i n f i n i t e dimensional generalizations.

See Soler (1977),

Mandelbaum (1983), and Lauritzen (1984) for some results and references.) Standard exponential families abound in statistical applications. Often a reduction by sufficiency and reparametrization i s , however, needed in order to recognize the standard exponential family hidden in specific settings. Here are two of the most f r u i t f u l examples. 1.2

Example (Normal samples):

Let

Yj'

Yη

be

independent identically

distributed normal variables with meanμ and variance σ 2 .

Thus, each Y. has

BASIC PROPERTIES

3

density (relative to Lebesgue measure) Φ nz(y) = (2πσ 2 )"^ 2 exp(-(y-μ) 2 /2σ 2 ) μ,σ

(1)

and cumulative distribution function Φ

2

. Consider the statistics

Ϋ = n" 1 Σ Y. S 2 = n" 1 Σ (Y. - Ϋ)2 1 i= l X χ = Ϋ , X 2 = n" 1 Σ Y 2 = S 2 + Ϋ 2 . The joint density of Y = Y 1S ...,Y

can be written in two distinct revealing

ways, as f μ , σ 2(y) = (2πσ 2 )" n / 2 exp(-ns 2 /2σ 2 - n(y-μ) 2 /2σ 2 ) ,

(2) or as (3)

f μ

, σ 2(y) = (2πσ 2 )" n / 2 exp((nμ/σ 2 )x 1 + (-n/2σ 2 )x 2 )exp(-nμ 2 /2σ 2 ) . From the f i r s t of these one sees that

statistics.

Y and S are sufficient

(One can also derive from this expression that Ϋ and S are

independent (see sections 2.14 - 2.15) with Ϋ being normal mean μ, variance σ2/n and V = S being (σ2/n) χ 2 .. -- i.e. having density (4)

f(v) = (n/2σ 2 ) m / 2 (Γ(m/2))" 1 v ( m / 2 " 1 }

exp(-nv/2σ 2 )χ (θ9θo) (v)

with m = n-1 .)

X = (X.jXo) is also sufficient. This can be seen from the factorization (3), or from the fact that X is a 1-1 function of (Ϋ,S ) . 2 Let v denote the marginal measure on R corresponding to X -- i.e. v(A) -

/

dy 1 ... dy n .

n n /2 V 1 2" 7 " (It can be checked that when n > 2, v(dx) = φ (π2Γ((n-l)/2)) ( x ^ x ^ dx over the region K = {(xi,x2)-' x 2 4 X 2> w h e n n = 1 v Ί S supported on the

4

STATISTICAL EXPONENTIAL FAMILIES p

curve {(x-j.Xg): x-| = x 2 h ) Then the density of X relative to v is (5) with

p θ jθ (x) = exp(θ l X l + θ 2 x 2 - ψ(θ)) θj = nμ/σ 2 ,

θ 2 = -n/2σ2

and ψ(θ) = -Θ 2 /4Θ 2 - (n/2)log(-2θ2/n) . Thus the distributions of the sufficient statistic form a 2 dimensional exponential family with canonical parameters (θ^θg) related to the original parameters as above. This family is minimal. The natural parameter space is A/ = {(θj, θ 2 ) : θ χ € R ,

θ 2 < 0} .

The above can of course be generalized to multivariate normal distributions. See Example 1.14. 1.3 Example

(Multinomial distribution): Let X = (X1,...,Xj<) be multinomial (N,π) — that is Pr{X = x} = (x

N

l"

)Π πj 1 . 'xk Ί

Let v be the measure concentrated on the set {x : k x. > 0 , i = l k , Σ X: = N} , and given by 1 1 =1 Ί (1)

v({x})

=

(χ

l 9

\9χN

)) == -λ

N!

Then the density of X r e l a t i v e to v is

(2)

p (x) θ

=

k exp( Σ θ.x. - ψ(θ)) i=l Ί Ί

where (3)

θ. = log π.

i=l,...,k

x Ί integers,

BASIC PROPERTIES

5

and ψ(θ) = N log( Σ e8"1') 1= 1

(4)

.

This i s a k dimensional exponential f a m i l y w i t h canonical X .

I t s canonical parameter i s r e l a t e d to π by ( 3 ) .

(5)

Θ

=

{ ( l o g π.)

:

0 < π.,

statistic

I t has parameter space

Σπ. = 1}

.

Note that this exponential family is not full. The full family has densities {p θ } as above with 0 = hi = R k . (For 0 as in (5) ψ(θ) s 0 , however ψ as defined in (4), rather than ψ s o, is the appropriate cumulant generating function, as defined in 1.1(3) for the full family.) However

for all a € R where Γ = (1,...,1) . Hence expanding this family to be a full family does not introduce any new distributions. The above phenomenon is related to the fact that the above family is not minimal since dim K = k-1 < k . To reduce to a minimal family let X* € R k " 1 be given by (X 1 ,...,X k _ 1 ) . Thenn X* X* is sufficient. (In fact, it is k-1 essentially equivalent to X since X k = N - Σ X* a.e.(v) .) Let θ* € R1^"1 N

be given by θ* = θ. - θ. , and let v*({x*}) = ( 1

1

JL,

K

k-1 ) . Then JL,

M

π

JL,

the density of X* relative to v* is (7)

P**U*) θ

k-1 = exp( Σ θ*x* - ψ*(θ*)) i=ll l

where (8)

ψ*(θ*) = N logCl + Σ e M

.

This is a full minimal standard exponential family with M = R Note that

k-1

6

STATISTICAL EXPONENTIAL FAMILIES π.

= exp(θ*)/(l + Σexp(θ*))

πR

= 1/(1 + Σexp(θ*)) .

1=1

k-1 ,

(9)

kl Here, each different θ* e R = N corresponds to a different distribution. Reductions by reparametrization and sufficiency like those in the above examples are frequent in statistical applications.

Together with proper

choice of the dominating measure, v, they lead to the representation of problems involving exponential families in terms of problems involving standard exponential families. This is formally explained in the next few paragraphs. 1.4

Definition: Let {F : ω e Ω} be a family of distributions on a probability

space y,B . Suppose F -« μ , ω € Ω . Suppose there exist functions

c

: Ω -> (0

R : Ω -> Rk T

:

V

•*

h

:

V

•+

Rk [0.-)

(Borel measurable) (Borel measurable)

such that ω (y)

(1)

dF ω = C(ω)h(y)exp(R(ω)

du

T(y)) .

Then {F } (or, {f }) is called a k dimensional exponential family of distributions (or, of densities). 1.5

Proposition: Any k dimensional exponential family (1.4(1)) can be reduced by

sufficiency, reparametrization, and proper choice of v to a k dimensional standard exponential family (1.1(3)).

The sufficient s t a t i s t i c is X = T(Y), and

i t s distributions form an exponential family with canonical parameter θ = R(ω) . Proof:

X = T(Y) is sufficient by virtue of 1.4(1) and the Neyman factorization

BASIC PROPERTIES theorem.

(See e.g. Lehmann (1959) Chapter 2 Theorem 8.)

Let μ*(dy) = h(y)dy

and l e t v(A) = μ*(T" 1 (A)) for Borel measurable sets A c Rk . induced

densities of X with respect to v exist and have the desired form

1.1(3) with θ = R(ω) and ψ(θ) = -log C(R" 1 (θ)) . then f

Then the

= f

and hence C(ωχ) = C(α>2) .)

(Note that i f R ^ )

= R(α>2),

II

In spite of appearances the above reduction process is not really unique.

Any standard exponential family can be transformed to a different,

but equivalent, form by linearly transforming X and Θ with linked nonsingular affine transformations. This is described in the following proposition. 1.6

Proposition: Let {p Q } be a k-dimensional standard exponential family.

Let M be

a non-singular kxk matrix and l e t Z

= MX + zQ

(1) Φ =

(M1)

θ + φQ .

Then the distributions of Z also form a k-dimensional standard exponential family which is equivalent to the original family. Proof:

The equivalency assertion is immediate since the transformations (1)

are 1-1. Furthermore, the density of Z relative to the measure v 2 defined by v 2 (A) = v(M"X(A - z Q ) ) is exp(θ'x(z) - ψ(θ))

(2)

= exp((φ - φ 0 ) ' MM"Ί(z - z Q ) - ψ(M'(φ - φ Q ) ) ) =

exp(φ'z - ψ(M'(φ - φ Q )) + Φ'z0 - ΦQZ + Φ0 z 0 ) .

(By definition A - z Q = {x : 3 z € A,

x = z - zQ}

.)

Let Vjίdz) = exp(-φ^z)v 2 (dz) and ψ^φ) = ψ(M'(φ - Φ Q )) The densities of Z relative to v 1 are

8

STATISTICAL EXPONENTIAL FAMILIES

(3)

exp{φ'z - ψjίφ)} ,

which, as claimed, form a k parameter exponential family.

The natural

parameter space for this family is M'~ Θ + φ« and the cumulant generating function is ψ, .

||

Proposition 1.6 shows that one may apply an arbitrary affine transformation either to Θ or to X.

In this way one may assume without loss of

generality that Θ (or X) lies in a convenient position in R

.

One application

of this process w i l l be discussed at some length in Section 3.11, and such transformations w i l l be used wherever convenient.

MARGINAL DISTRIBUTIONS The proof of Proposition 1.6 yields a statement about marginal distributions generated under linear projections by standard exponential families. The result is important in its own right, and useful in the proof of Theorem 1.8, as well. Some preliminary remarks will be helpful. Let M- : R

-». R m be a

linear map. M, is represented by an (mxk) matrix, M-., of rank m. There is then -* R " m which is orthogonally complementary to M 1 --

a linear map M 2 : R

that is, the rows of the corresponding ((k-m)χk) matrix, Mp, of rank (k-m) are orthogonal to those of M- . (The rows of M 2 can be chosen to be orthonormal, but that is not necessary here.) Let M denote the (kxk) nonsingular matrix M

Z

l kκ l mm M = (*) . If x e R then Z = Mx can be written as (/) with 1 € Λ M2 Z2 1 R , Z

Q

ς- pIΠ~K V_

Γ\

.

M

Let M = ( M ) as defined above. M

Then M

written as

(1)

1

M" = (M~, M")

where M" is (kxm), M^ is (kχ(k-m)) and

exists and can be

BASIC PROPERTIES (M") 1 M^ = 0

(2)

since M 1 and M 2 are orthogonally complementary. Let θ € Rk and φ = ( M " 1 ) ^ = ( * )θ = Γ 1 ) . Then /M"\ Φo Y (M 2 ) 2 (3)

θ'x =

θ'M^Mx

" φiZl+φ2Z2 by (2). For typographical reasons let M!" = (NT) 1 . The special case where M 1 (x 1 > ...,x k ) = (*i» ••»*,„) i s noting.

worth

Here

<4>

M

l " < W °mχ(k-m))=

M

2

= <0(k-m)χm

M

Γ

^k-mMk-m)*

=

M

2~

Somewhat more generally, if the rows of M. and M 2 are orthonormal then M

l =MΓ

M2

1.7

=

M2~

Theorem: Consider a standard exponential family. Φ Ί

Let M, : R

•%

R m and

I

θ = M'( ) as described above.

Fix φ° e M^"(W) c R k " m . Consider the

family of distributions of Z^ = M-X over the parameter space Φ.o = M'"({θ € Θ : M'"θ = φϊ}) . These form an m dimensional standard Φp * ^ ^ exponential family generated by the marginal measure defined by (5)

v 0 (A) Φ

= / , MiΔ^

exp(φϊ M x)v(dx) . t c

Z 1 The natural parameter space for this family is W Φo

= M'"({θ e N : Mi θ = φi}

The statistic 1, is sufficient for the family of densities {p θ (x) : M 2 - e = φ°} .

10

STATISTICAL EXPONENTIAL FAMILIES

Proof:

A direct proof is as easy as an appeal to Proposition 1.6.

The density

of Z relative to the appropriate dominating measure v(M~ •) is (6)

exp(θ'x - ψ(θ))

= exp(φχ

When Φ2 = Φ2 the factor exp(φ° measure, y i e l d i n g v

zχ + Φ2

z 2 - ψ(M'φ)) .

z 2 ) can be absorbed i n t o the dominating

(•) as defined in ( 5 ) .

The r e s u l t i n g family of densities

Φ

2 is the standard exponential family claimed in the statement of the theorem. (Note that (6) also provides a formula for the cumulant generating function of this family.) The assertions concerning W,o and sufficiency follow from ( 6 ) , Φ 2 with Φ2 = Φp > and the Neyman f a c t o r i z a t i o n theorem. || For the special case where IIL is as described i n ( 4 ) , one sees that for fixed exponential

θ

'θm

u+i»

tlΊe

d i s t r i b u t i o n s of Z, = ( X 1 5 . . . , X ι J form an

family. Note that the theorem does not say that the family of d i s t r i b u t i o n s

of 1. = HΛ form a standard exponential family with natural parameter φ, i f the parameter θ ranges over all

of 0 .

In f a c t such a claim is generally

false unless Θ is of dimension <m and s a t i s f i e s (7)

Θ c

{θ : M^'θ = φ°}

for some φ° € Rk~m ,

as w i l l be the case in Theorem 1.9; or (8)

Z,

and

Z2

are independent for some

θ € Θ .

( I t w i l l be seen i n the next chapter that (8) implies independence of Z, and Z 2 for a l l θ € Θ .) (8)

Remark.

The preceding theorem may be given an a l t e r n a t i v e i n t e r p r e t a t i o n .

Let L be a linear variety i n R -- that is L = x Q + V f o r some m dimensional k k linear subspace V c R . Let P : R -> L be any a f f i n e projection onto L -2 that i s , P is a f f i n e , P = P, and P is the i d e n t i t y on L. Let Q denote the orthogonal projection onto V

= {w € R

:

v'w = 0

v

v € V} .

Let

BASIC PROPERTIES

θ

(2)

€ V

Then

" '

{θ € N :

the

famll

of

y

11

d i s t r i b u t i o n s of P(X) as θ ranges over

Qθ = θ/o\ί forms an exponential

family.

To v e r i f y the above, note that there are l i n e a r isometries Sχ

:

Rm £1

L

Rk'm

Sz :

onto

V1

^

.

onto

The theorem applies to the maps M. = S"

° P ,

ϊlL = S i

statement concerning the d i s t r i b u t i o n s of M.,(X) .

° Q , and y i e l d s a

This converts d i r e c t l y to

the above statement about the d i s t r i b u t i o n s of P(X) = S ^ M ^ X ) ) over the appropriate parameter space since S, is a l i n e a r isometry, and S-. and S« are orthogonal, e t c . 1.8

EXAMPLE

(Log-linear models):

described i n Example 1.3.

Consider a multinomial (N, π) variable as

Consider the family of d i s t r i b u t i o n s f o r which the

natural parameter 1.3(3) s a t i s f i e s (1)

θ =

3 e Rm

BB + ΘQ ,

where B is a specified kxm matrix of rank m. (2)

B =

Assume, i n a d d i t i o n , that

( l k , B(2))

where 1/ = ( 1 , . . . , 1 ) and B/2\ is k x (m-1) of rank (m-1). multinomial

model.

This is a

log-linear

The name derives from the f a c t that the l i n e a r constraint

(1) can also be w r i t t e n i n the form log π = B$ where (log π). = log π. , i=l,...,k

.

Condition (2) is imposed because PQ = P f t . a 1 , as noted in 1.3(6). u+a i

Ό

Because of (2) f o r every $\^\

=

e

^2""' rJ

t h e r e

i s

d

u n i c

1ue ^

=

^1^(2)^

such that k Σ π.

(3)

Ί

=

k θ, Σ e Ί

=

1 .

Let NL = B1 and l e t M = ( M ) as i n 1.7. 1 Np Z

(l)

=

M

1

X=

B

'X

i s

a

s u f f Ί c i e n t

statistic.

Theorem 1.7 y i e l d s that

The d i s t r i b u t i o n s of Z , ^

m-dimensional exponential family with corresponding natural parameter

form an

12

STATISTICAL EXPONENTIAL FAMILIES

Mj'θ = β + B"Θ Q .

This family is not minimal since U ^ N ^ = N w.p. 1 .

As i n Example 1.3 one may reduce to an equivalent minimal family having dimension (m-1) and canonical s t a t i s t i c Z*-x = B/o\X

=

( ^ / , \ o9'"9^(l)

m^' '

Here is a famous l o g - l i n e a r model a r i s i n g i n genetics.

Suppose a

parent population contains a l l e l e s G,g a t a c e r t a i n locus, with frequency p,q = 1-p , respectively.

Under the assumptions of random mating and no

selection a generation of N offspring w i l l have genotypes GG, Gg, gg according to a multinomial d i s t r i b u t i o n with π given by

(4)

TΓj

=

p

,

π2

=

π3 = q 2 .

2pq ,

Such a multinomial d i s t r i b u t i o n is called a Hardy-Weinberg

distribution.

This corresponds to a l o g - l i n e a r model with

l

(5)

Thus, Z/.x

B =

= (2

N +

1

2

{ ° \

\

1

θ

) is a sufficient

log-linear family, and z?,x

=

statistic

log

2)

.

for the distributions of this

= 2x, + x ? is a minimal sufficient

statistic.

(This log-linear family can be imbedded in a useful way in the original multinomial family as follows: Let

(6)

Then

1

M"

/5/12 = I 1/6 \-l/12

-1/12 1/6 5/12

-l/2\ 1 ) = (M~, M") . -1/2/

Let ΦQ = ( 0 , 0 , -In2) and z^ = ( 0 , 0 , !j). Z = MX + z 0 is the canonical s t a t i s t i c

According to Proposition 1.6 for an exponential family with

corresponding canonical parameter φ = (M~ )'θ + φQ .

In terms of the original

BASIC PROPERTIES

variables z 1 = 2x^ + x 2 , etc.

13

z 2 = 2x 3 + x^9 z^ = x 2 ,

and Φ3 = (^)log(π|/4π-π 3 ) f

The log-linear family described above is therefore the family of marginal

distributions of ( z . , z«) under the r e s t r i c t i o n φ~ = 0 .

The family of

distributions corresponding to the restriction Φ3 = φ° t 0 also has a natural genetic interpretation as the distribution of a population a f t e r variable selection of genotypes.

See Barndorff-Nielsen (1978, p.123); the

generalization of this model to a m u l t i a l l e l i c locus is also described there.)

REDUCTION TO A MINIMAL FAMILY Any exponential family which is not minimal can be reduced ϊo a minimal standard family through sufficiency, reparametrization, and proper choice of v. This involves only a minor extension of the process used above in Proposition 1.5 and Theorem 1.7. This reduction is unique up to the appearance of linked affine transformations as in Proposition 1.6. Here are the details. 1.9 Theorem Any k dimensional exponential family can be reduced by sufficiency, reparametrization, and proper choice of v to an m dimensional minimal standard exponential family, for some m
Let X,θ and Z,φ denote the canonical

s t a t i s t i c and parameter for two such reductions to an m. and an m2 dimensional minimal family, respectively.

Then m, = m2 and (X,θ), (Z,φ) are related as in

1.6(1). Proof.

The reduction to a minimal standard family w i l l be performed in three

steps.

F i r s t , one may apply Proposition 1.5 to reduce to a standard k

dimensional

family. Suppose for this family that dim Θ = m1 < k.

where V is an m'-dimensional

linear subspace.

Thus θ c θ

Q

+V

One may l e t P be the orthogonal

projection on V and M,, M2 the corresponding orthonormal matrices described above in Theorem 1.7.

Then M2Θ = φ£ , a constant vector.

By Theorem 1.7,

Z 1 = M-X is s u f f i c i e n t , and i t s distributions form a standard exponential family, whose parameter space has dimension m1.

14

STATISTICAL EXPONENTIAL FAMILIES Thus it now suffices to consider the case of a standard m 1 dimension-

al exponential family whose parameter space also has dimension m 1 . Suppose for this family that dim K = m < m' . Then K c x + V , similar to the previous situation. as above.

Let P be the orthogonal projection on V, and NL, NL

Observe that

(1)

θ x = Θ'MJ M χ x + Θ ' M £ M 2 x a.e.v

I t follows that Z. = NLX is a sufficient statistic whose distributions form a standard exponential family with natural parameter M,θ.

(Actually Z is not

merely sufficient, but is actually equivalent to X under v.)

Since

dim (M..K) = dim (M-Θ) = m this family is the desired minimal family formed from the original family through reduction by sufficiency and reparametrization. Suppose {p

: ω € Ω} is a standard k dimensional exponential family

relative to v, and (X,θ), (Z,φ) denote the canonical statistics and parameters for two reductions of {p } to a minimal standard exponential family.

For the

next step l e t P^ ' , P; ' denote their respective probability distributions σ ψ with dimensions m, and nu respectively, etc..

Let ω Q € Ω. Since X and Z

are each sufficient

dP

^

9\

Jtf θϊ ) p

Jf

o

Now, HD(D

= exp(((θ(ω) - θ(ω Q )) and similarly for p' ' .

Hence (4) yields

x-

ae(v)

BASIC PROPERTIES (5)

(θ(ω) - θ(ω Q J)

15

x(y) - U ( 1 ) (θ(ω))

= (Φ(ω) - φ(ω Q ))

z(y) - U ( 2 ) (φ(ω))

a.e. (v)

for all ω € Ω. Suppose m = m 1 < m,,. Since dim {φ(ω) : ω e Ω} = m 2 > m there m+1 exist values α. € R, ω. e Ω, i = l,...,m+l, such that 0 = Σ α.(θ(ω ) - θ(ω n )) m+1 and φ* = Σ α.(φ(ω.) - Φ(ω Q )) f 0. It follows from (5) that (6) But,

φ* (6) implies /C c {z : φ*

z(y) = const

a.e. (v)

z = const} so that dim K^ < m^.

This

contradicts the f a c t that the d i s t r i b u t i o n s of Z form a minimal standard family of dimension nip.

Hence πu = πu = m.

Now choose ω 1 , . . . , ω m so that {θ(ω.j) - θ(ωQ) : m

R.

The preceding argument shows that {φ(ω.j) - φ(ωQ) : m

also span R . φίω^

i =l , . . . , m }

i = l , . . . , m } must

Let M, non-singular, be chosen so that - φ(ωQ)

=

( M ' J ' ^ θ ί ω ^ - θ(ω Q ))

i=l,...,m

.

Then, as i n 1.6(3),

(7)

(θ(ω.) - θ(ω Q ))

x(y) - U ( θ ( ω i ) )

=

(Φ(ω.) - φ(ω Q ))

Mx(y) - U(φ(ω-))

=

(Φ(ω.) - φ(ω Q ))

z(y) - U(φ(ω.))

Let y Q e K be a value for which (7) is valid for i=l,...,m. yields

span

a.e.

(v)

Then (7)

16

(8)

STATISTICAL EXPONENTIAL FAMILIES

(φ(ω.) - φ(ω Q )) =

M(x(y) - x(y Q ))

(φ(ωΊ.) - φ(ω Q ))

(z(y) - z(y Q ))

a.e. (v) i = l,...,m

This implies M(x(y) - x(y Q )) = z(y) - z(y Q ); which verifies 1.6(1) with zo=z(yQ).

1.10

||

Definition Let {p_} be a k-dimensional exponential family. u

Theorem 1.9

shows that there is a unique value, m, such that {pθ> can be reduced to a minimal exponential family of dimension m. This value is called the order of the family p. I f {p Ω } is a standard family i t is clear that its order m Ό

satisfies (1)

m <_ min(dim Θ, dim K)

In most cases equality holds in (1); however, i t is possible to have inequality, even when {p n } is f u l l .

u In view of Theorem 1.9 there is no loss of generality in confining oneself to the study of minimal standard exponential families. A full minimal standard exponential family is also called a canonical exponential family. RANDOM SAMPLES A nearly trivial but very important application of the first part of Theorem 1.9 involves independent identically distributed (i.i.d.) observations from an exponential family.

BASIC PROPERTIES

17

1.11 Theorem Let X l s . . . , X n be i . i . d . observations from some k-dimensional standard exponential family with natural parameter space M and convex support K. Then S =

n Σ X. is a s u f f i c i e n t s t a t i s t i c . i =l Ί

standard k-dimensional

The distributions of S form a

family with natural parameter space W and convex support

n/( = {s : 3 x e K, s = nx} .

The order of the families corresponding to S and

to X. are equal. Proof:

The j o i n t density of X . , . . . , X n = exp( . Σ (θ =1

Pϋf l ( Xii . - . - . x nn)

= Hence X , , . . . , X

n exp( Σ ( θ . i=l Ί

with respect to vx ... xv is x.i - ψ ( θ ) ) )

x. - ψ ( θ . ) ) ) Ί

Ί

with

θ. M Ί

are canonical s t a t i s t i c s from an nk-dimensional

.

exponential

nk family whose parameter space s a t i s f i e s Θ = { ( θ 1 , . . . , θ n ) € R

k : θ^ • θ € R } .

Applying Theorem 1.7 yields that S is s u f f i c i e n t and comes from a standard k-dimensional family with natural parameter space N and convex support nK. (All this is also obvious from the f a c t that n pβ(xΊ,...,xn) = exp(θ Σ x_. - nψ(θ)) u

1

Π

_ i

.)

I

It is easily checked that any linear map which transforms the distributions of X. to a minimal family also transforms those of S to one, and conversely. This yields the assertion concerning the order of the families corresponding to S and X. .

||

Note that the cumulant generating function for the exponential family generated by S is (1)

nψ(θ) . The sufficient statistic X = n S also has distributions from an

exponential family. (Apply Theorem 1.6.) Here, the natural parameter space

18

STATISTICAL EXPONENTIAL FAMILIES

is nW and the convex support is K. The cumulant generating function for X corresponding to the point Φ = nθ in its natural parameter space is (2)

nψ(Φ/n) . (Under appropriate additional conditions a family of distributions

for which there is a nontrivial sufficient statistic based on a sample of size n must be an exponential family. See Dynkin (1951) and Hipp (1974).) 1.12 Examples Example 1.2 displays an instance of t h i s theorem.

I f Y is normal

with mean μ and variance σ 2 then X = (Y, Y ) i s the canonical s t a t i s t i c of a minimal standard exponential family having canonical parameter θ = (μ/σ 2 , -l/2σ 2 ) .

Thus i f one has i . i . d . observations Yp

Ύη

t h e n

n n n p S = Σ X. = ( Σ Y., Σ YT) is a sufficient statistic; and its distributions Ί Ί i= l i = l i = lΊ form a minimal standard exponential family. As another example, suppose Y is a member of the gamma family with unknown index, α, and scale, σ. The density of Y relative to Lebesque measure on (0, °°) is

(l)

f(y) = (^ΓίcOΓV 01 " 11 e" y/σ ,

We w i l l use the notation Y ~ Γ(α, σ) .

y >o .

Note that Γ(m/2, 2 ) = x * .

These

d i s t r i b u t i o n s form a two-dimensional exponential family with canonical s t a t i s t i c (Y, In Y) and canonical parameters (-1/σ, α ) . w i t h density (1) then S1Ί =

n Σ Y

1=1

1

and

Sά9 =

n Σ In Y. Ί

i=l

If

γ

i>

>γn

a r e

i i d.

form a two-dimensional

exponential family. It is interesting to note that the marginal distribution of Sj/n also has a density of the form (1) with index nα and scale nσ . (Here, as well as in the preceding normal example, S. is strongly reproductive in the terminology of Barndorff-Nielsen and Blaesild (1983b). For more details see Theorem 2.14 and Example 2.15.) Another example of interest is provided by the Poisson distribution; where Y has probability function

BASIC PROPERTIES (2)

y

Pr{Y = y }

λ

= λ e~ /y!

We w i l l use the notation Y ~ P(λ) .

19 y=O,ls...

Then X = Y comes from a one-dimensional

exponential family with canonical parameter θ = In λ. n S = Σ Y. Ί i l

The d i s t r i b u t i o n of

is i t s e l f Poisson with natural parameter θ+ In n = In nλ .

CONVEXITY PROPERTY Here is an important fundamental fact about exponential families. 1.13 Theorem (i) (ii) (iii)

N is a convex set and ψ is convex on M. ψ i s lower semi-continuous on R and i s continuous on N°. PQ θ

(1)

= PΩ Θ

l

i f and only i f 2

Ψ(αθ1 + (1 - α ) θ 2 )

for some 0 < α < 1. (iv)

= oίφίθj) + (1 - α)φ(θ 2 )

In this case (1) is then valid for a l l 0 <_ α <_ 1.

I f dim K = k (in particular, i f {p Q } is minimal) then ψ is

s t r i c t l y convex on W, and PΩ θ

l

t V Θ

for any θΊ f θ 9 € W. 2

0 < α < 1.

ά

'

Woof:

Let θ,, θ 2 e W,

Then by Holder's inequality

(2)

exp(ψ(αθ1 + (1 - α)θ 2 )) = /expίtαθj + (1 - α)θ 2 )

x)v(dx)

= /(exp Θ.

x) α

1

x)v(dx)) α (/exp(θ2

(/exp(θ1

(exp θ 2

x ) ( 1 " α ) v(dx) x)v(dx)) ( 1 " α )

ί θ ^ + (1 - α)ψ(θ 2 )) This proves the convexity of ψ, and the convexity of hi follows easily. There is s t r i c t inequality in (1) unless (3)

θj

x

= θ2

x + K

(a.e. (v))

for some constant K; in which case there is equality.

(3) is equivalent to

20 θ e

STATISTICAL EXPONENTIAL FAMILIES i

# x

=

κ e

θ e

2#x

a . e . ( v ) which i s equivalent to the assertion Pfi

1

= PB

2

.

If (3) holds for some θ, ί θ 2 then dim K <_ k - 1. Hence dim K = k implies P Q t P Q for any Θ Ί + θ 9 e W. θ

Θ

l

1

2

ά

Finally, for the continuity assertions, note f i r s t that λ(θ)

= /exp(θ x)v(dx) is lower semi-continuous by Fatou's lemma.

lower semi-continuous.

Any convex function defined and f i n i t e on a convex set

W of R must be continuous on W°. sets.)

Hence ψ is

(We leave this as an exercise on convex

|| Be careful about the above result -- the fact that ψ is s t r i c t l y

convex on hi does not imply that hi is s t r i c t l y convex; for a simple example, see Example 1.2 which involves a minimal family for which

W = ί(θΊ> θ 9 ) : Θ Ί € R, θ 9 < 0} XL.

X

L.

Usually ψ is continuous on all of hi. However examples can be constructed when k >_ 2 where this is not the case. This simple theorem has an interesting direct application. 1.14

Example Let Y be m-variate normal with mean μ and covariance matrix %.

We w i l l use the notation Y - N(μ, t).

Also, ό , = 1 i f i = j and = 0 i f ij^j . •u

The density of Y with respect to Lebesgue measure is (1)

Φ y j Z ( y ) = (2πΓ m / 2 |2Γ*exp(tr(-*- Ί (y - v){y - y)72)) =

(2π)" m ^ 2 |?|"^exp((?" 1 μ)

y + tr((-£ /2)(yy')) - μ'jΓ μ/2)

It follows that the distributions of Y form an (m + m(m+l)/2) dimensional exponential family with canonical statistics Yj,

,Y m , {YΊ-Y ./(I + δ..): i <_ j}

and corresponding canonical parameters (?" μ),,...,(?" μ) > ί(-Z" ).Ί : i £ j} . For the following it is convenient to label these statistics X..,...,X , {X . : i <. j} and the corresponding parameters as (θ,,...,θ , {θ. : i <_ j )}. Write θ = (θ,,...,θ ), S£= (θ. Ί ). x

Πl

Ignoring the factor (2π)" m / ' 2 , which can be

1J

absorbed into the measure v, the cumulant generating function is

BASIC PROPERTIES

(2)

Ψ( ) =

(-^JioglZ"1) + (μ'Z"1μ)/2

=

21

(-*s)log( |-£|) - θ'βθ/2

.

Note that M = {(θ, {θ.. : i <_ j}) : -£ is positive definite} . It is easy to check that N is open, so that this family is regular. (3)

By Theorem 1.12

ψ(0, {θ j : i < j}) =

is s t r i c t l y convex in the variables { θ . . : i <_ j } over the set where Q. is positive definite.

To reinterpret this result s l i g h t l y , let B = -Q.

then (3) yields that (4)

log |B|

is s t r i c t l y concave

as a function of the variables {b. . : i <_ j } over the region where B is *J

positive definite.

(4) yields IB' 1 ]

(5)

=

|BI" 1

is s t r i c t l y convex

((4) can also be proven by directly calculating k+1 the r e s u l t i n g ( « )

. log|B|, and showing

k+1 x

( o ) matrix i s p o s i t i v e d e f i n i t e .

The above proof

is much simpler !) CONDITIONAL DISTRIBUTIONS k Let v be a given σ - f i n i t e measure on the Borel subsets of R , and P «v

a p r o b a b i l i t y measure with density p.

generality) that 0 € W so that v is f i n i t e . M^x) = M-x.

Then the conditional

w i l l be denoted by v( |NLX = z j

Assume (without loss of

measure of v given z, = M,(X) e x i s t s .

or v( | z j .

given M,(X) exists and has density proportional over the set {x : ΛMX) = z-} . is any Borel measurable f u n c t i o n .

Mj : Rk -> Rm be l i n e a r ,

Let

The conditional

distribution

to p( ) r e l a t i v e to vί

lzj)

(More generally these facts are true i f Λ^1 See, f o r example, Neveu

(1965).)

The above s i t u a t i o n resembles that described i n 1.7. M2 : Rk -> Rk"m be an orthogonal complement of My M2 :

It

ίx : Mj(x) = z χ } -* R k " m

Then

Let

of P

22

STATISTICAL EXPONENTIAL FAMILIES

is 1 - 1. We will also use the symbol v( |z.) for the equivalent conditional distribution of M 2 (X) given M χ (X) = Zy

As before,

Φ = M'-lθ = ( M M θ - (φl) It is always possible to choose M 2 to be "orthonormal" so that Ml = M' ,

and so

Mλ" = M« .

To do so simplifies somewhat the resulting formulae. 1.15 Theorem The d i s t r i b u t i o n of Z ? = NLX given Z- = M.X depends only on φ,p\

= M'"θ .

For fixed Z. = z. these d i s t r i b u t i o n s form the (k-m) dimensional

exponential family generated by the measure defined by v( |z,) . Let W z

family.

denote the natural parameter space of t h i s conditional

i

Then Φ2 € M«"W implies Φ

(1)

€ M

2

a e

MX

(v)

Furthermore, if {p } is regular then θ

(2)

Proof:

M^~N c

hlM

a.e.(v)

χ

.

The conditional density of Z2 given Z1 = z- is proportional to '

z

l

+

Φ

2 '

Z

2 "

Hence the density of Zp given Z1 = z, r e l a t i v e to v( |z..) can be w r i t t e n as (3)

pφ(z2)

= exp(φ 2

z2 - ψz ( φ 2 ) )

where (4)

ψ z (Φ 2 )

=

ln(/exp(φ 2

The natural parameter space W

z

i

z2)v(dz2|z1))

.

is the s e t { φ ? } , f o r which the ά

integral on the right of (4) is finite. Let Φ 2 € M 2 "W . There is thus a θ €

BASIC PROPERTIES for which φ 2 = M2~θ . v*(A) = v ( M ^ ( A ) ) .

°° > /exp(θ

23

Let v* denote the marginal measure on Rm defined by

Then

x)v(dx) = /{/exptφj

zχ + φ 2

z 2 )v(dz 2 |z 1 )}v*(dz 1 ) .

oo > /exp(φ 2

z 2 )v(dz 2 |z 1 )

Hence

for

almost every z..(v*) .

This v e r i f i e s ( 1 ) .

Suppose {p Q } is regular. dense subset of W.

Let { θ i : i = l , . . . , } cW

{M2~ θ Ί : i = l , . . . } is dense i n M'~N .

b e a countable

Nl'~ is a l i n e a r map.

Hence M2"M is convex and open since W is convex (by Theorem 1.13) and open (by assumption). I t follows t h a t (5)

conhull

{M£~ θ. : i = l , . . .

}

=

MιfN

.

(We leave (5) as an exercise on convex sets.) Since { θ . } is countable i t follows from (1) that M2~ θ..

c

WM

χ

for a l l

i =l , . . . ,

a.e.(v)

.

Thus M£"N since NL

χ

=

conhull ίM£" θ. : i = l , . . . } c

is convex; which proves ( 2 ) .

^

χ

a.e.(v) ,

||

The above r e s u l t can be given an alternate i n t e r p r e t a t i o n under which the conditional d i s t r i b u t i o n s of X given X ε L form an exponential f a m i l y , for

L a given linear variety i n R .

See 1.7(8).

We omit the d e t a i l s .

Here are two important simple applications of the above ideas. 1.16

Example Let X 1 9 ...,X. be independent Poisson variables with expectations

λ.

.

See 1.12(2).

Then X = ( X 1 f . . . , X j is the canonical s t a t i s t i c of a

standard exponential family with natural parameter θ: θ Ί = In λ^ k The dominating measure has v ( { x } ) = 1/ Π x , ! . i =l Ί

Let

i =l , . . . 9 k .

N > 0 be an integer.

24

STATISTICAL EXPONENTIAL FAMILIES

k Then the distributions of X given Σ1 X. = N form a standard exponential family i=l with dominating measure k

(1)

n

k

v ( { x } | Σ x . = N) = 1 / Π x i !,

for

Σ x. = N .

This measure is proportional to the measure 1.3(1) which generates the multinomial distribution.

Hence the conditional distribution is multinomial (N,π).

The value of π can be easily computed as follows:

orthogonally

project onto {θ: Σθ. = 0} which is the linear subspace parallel to {x : Σxi = N} .

This yields (θ - θl)

(where θ = k"

1

Σ θ 1 ) as the natural

parameter of the conditional multinomial distribution. Thus

with c = ( Σ e Ί'~ )

.

Substituting θ. = In λ

(2) 1.17

k = λ./ Σ λ Ί

πΊ

yields

.

Example Let X be k-variate normal with mean μ and covariance %, For t

given the distributions of X form a standard exponential family with natural parameter θ = ϊ~ μ.

(This can easily be checked directly or derived

from Example 1.14 by using Theorem 1.7.)

The dominating measure for this

family is proportional to v(dx) = exp(-x'Z~ x/2)dx. Let z, = ( x , , . . . , x ),

z 2 = (x + , , . . . , x . ).

The conditional

distributions of Zp given Z, = z, form an exponential family. parameter for this family is just Φ2 = (θ

+1

, . . . , θ . )'.

Partition t as

(1) Then

t

= Q11

12 t

)

with

i n ( m x m) , etc.

The natural

BASIC PROPERTIES

, 1

(2)

I'

Z

Z:

Z

/ * 1 1 ' 12 22 21^ = ( - l - l -l -^tZ2-t2ltntl2) tntn

25

Z

Z

Z

Z

2:

Z:

~ 11 12^ '22" 21 11 12^ > -i .1 ) Z Z Z; (?22" 21 11 12^

((2) is a general formula for block symmetric positive definite matrices. Note 12

that I

1

= -?ϊ}?12(222 " V ' l l ^ '

1

=

Z

Z

(Z

Z

Z

Ϊ

)

]

" 22 21 11 " 1 2 2 2 2 1 ^ '

N o t e

t h a t

the natural parameter can be written as

where

Consider the case where z. = 0. The conditional dominating measure is v(dz2|0)

= c e x p ( - z ^ 2 2 z2/2)

and is thus a normal density with mean 0, variance-covariance ( Z 2 2 ) " 1 = Z 2 2 - ^21^11^12

= Z

* 's a y '

l t

f o l 1 o w s

t h a t

t h e

conditional density

of Z2 given Z, = 0 is normal with this covariance matrix and with mean μ* given by t*~\* = Φ2 , since φ ? must be the value of the natural parameter for both the unconditional and conditional family. (3)

μ* = 2*φ2

Hence

= 2*(221μ(l) + Z22μ(2))

= ^21Z"lίμ(l)

+ μ

(2)

'

For z 1 t 0 i t i s convenient to use the location invariance o f the normal family.

The conditional d i s t r i b u t i o n under (μ,Z) o f l,^) given l,^ =

^(1) " z f 1 ) is the same as the conditional d i s t r i b u t i o n under ( ( ,, h t) o f Z(Z) m μ (2) given l,,\ = 0. By the preceding this is normal with covariance matrix Z* = ( Z 2 2 ) " 1 and mean μ ^ - ^ l ^ ϊ l ^ d )

" Z ( l ) ^'

26

STATISTICAL EXPONENTIAL FAMILIES

EXERCISES 1.1.1

(a) Let C be any closed convex set in R . Show that there oo

exists a standard exponential family with M = C.

[C = n { θ : v. Ί i =l

θ < c,} Ί

with ||v.|| = 1 . Let v. denote Lebesgue measure on the ray {x: x = α v . , α > 0} 1 1 00

and l e t v =

.

Λ

Σ 2" 1 exp(c.v. Ί Ί i=l

x)v./(l+||x|I )• The result is also t r u e , but Ί

harder to prove, i f C is an open convex s e t . ] (b)

Let C = {(Qv

θ 2 ) : ||θ|| 2 < 1} U { ( 0 , 1)}

and show there

exists an exponential family with hi = C. 1,2.1

Verify 1.2(5) (including the formula for v which precedes i t ) . Note

that when n = 1 the measure v can be described by the relations x ? = x-, and v(dx 2 ) = dXj/ZZ? . 1.7.1

(i)

Let Z = MX as in Theorem 1.7.

Show that Z 1 is independent of

Zp for some θ e Θ i f and only i f 1, is independent of Zp for a l l θ € Θ. (ii)

Give an example to show that the assertion is false i f Z., 1^

are non-linear transformations of X. [ ( i ) Assume independence at θ = 0. (ii)

Let X be bivariate normal with mean μ and covariance I , and Z, = ||x||,

Z2 = t a r f ^ x g / x ^ . ] 1.7.2 {?'• *1

Consider the s i t u a t i o n of Theorem 1.7. θ 6 W

e

φ

M'φ° = θ.

i s

is f u l l and minimal. f u l l #

l t

i s

mΊrΊΊmal

i f

Suppose the original family

Then the family of distributions of 1Λ for a n d

onΊ

y

i f

there is a θ e i n t W with

[For a situation where the family of distributions of 1, is not

minimal use Exercise 1 . 1 . l ( b ) , l e t M be as i n ( 4 ) , and l e t φ£ = 1.] 1.7.3

(a)

Show that i f

1.7(7) or (8) are s a t i s f i e d then the d i s t r i -

butions of Z, = M-X form a standard exponential family with natural parameter

ΦΓ (b)

Give an example to show that the distributions of Z. = M-X may

form a standard exponential family with natural parameter d i f f e r e n t from φeven when 1.7(7) and (8) f a i l .

[Consider the d i s t r i b u t i o n of X. when X is

BASIC PROPERTIES

27

multinomial of dimension k :> 3, or equivalently, of X* with X* as i n Example 1.3. There are also some other i n t e r e s t i n g instances of t h i s phenomenon.] 1.8.1

(Contingency table under independence).

Consider a 2x2 contingency

table i n which the observations are Y. ., 1 <_ i , j <_ 2, and have a multinomial (N, p) d i s t r i b u t i o n with p = {p. ., 1 <_ i , j £ 2}. independence p... = p^ +

p . where P i + = Σ P Ί j> e t c . .

Under the model of Write t h i s independence

J

model as a log-linear model in a fashion so that the coordinates of the natural (minimal) sufficient s t a t i s t i c are independent binomial variables. to the model of independence in an r*c contingency table.

Generalize

(For further log-

linear models in contingency tables, see Haberman (1974), (1979).) 1.10.1

Show that in any standard exponential family of dimension k and

order m,

m + k >_ dim K + dim Θ .

m < min(dim Θ, dim K).

Give an example in which

[The simplest example has m = 0, dim Θ = dim K = 1,

k = 2.] 1.12.1

From many points of view the negative binomial distributions are

the discrete analog to the gamma distributions.

The negative binomial,

NB(α, p), distribution has probability function

Show that f o r f i x e d α the family N8(α, •) is a one parameter exponential f a m i l y , but that -- unlike the Γ(α, σ) s i t u a t i o n -- the family N8(α, p) α>0,

0 < p < 1 i s not an exponential

1.12.2 c R

family.

Let v denote counting measure on { ( 0 , 0 ) , ( 1 , 1 ) , ( 2 , 0 ) , (3,1) , ( 4 , 0 ) , . . . } .

Show that the exponential family generated by v has the following

properties:

XΊ has a geometric distribution,

Ge(p,) = M 8 ( l , p ) ;

X2

n a s

a

binomial d i s t r i b u t i o n , B ( p 2 ) ; (X χ - X 2 )/2 has a geometric d i s t r i b u t i o n Ge(pJ and (X- - X 2 )/2 is independent of X2 natural parameters θ^, Θ2

Write p ^ p 2 , P 3 i n terms of the

28

STATISTICAL EXPONENTIAL FAMILIES Let Z 1 ,...,Z m be i.i.d. N(μ,σ 2 ).

1.12.3

Let X = Σ Z 2 . Then X has a Ί

i =l scaled non-central χ

2

distribution with m degrees of freedom, non-centrality

2 2 2 parameter 6 = mμ /σ , and scale parameter σ . Denote this distribution by 2 2 X (6, σ ). ( i ) This distribution has density 2

„ (1)

g(χ)

=

Σ^Γ-—

k=0 where λ = 6/2.

_ λ (2^)(m/2)+k-l e -x/(2σ )

k

k !

— k 2 k σ Γ(k + ψ 2

+

,

/? m / 2

,

(From the form of (1) i t i s evident that K = k ~ P(λ) and 2

2

X|K ~ Γ(k + m/2, σ ) ; thus (X/σ ) K i s central χ (ii)

x>0

2

with k + SJ degrees of freedom.)

The d i s t r i b u t i o n s of X can also be represented as the marginal d i s t r i b u t i o n

of X- from a canonical two parameter exponential family generated by a measure v supported on { ( x ^ x^): x^ > 0, x 2 = 0 , 1 , . . . } . and series expansion prove (1) f o r the case m=l.

[(i)

(1) f o r general m then follows

from facts about sums of Poisson and gamma variables, 2 measure generated by (1) with σ = 1 , λ = 1.] 1.13.1

(i)

By change of variables

(ii)

Let v be the

Show that when k = 1 then ψ must be continuous on N. [Use

1.13 and convexity of W.] (ii)

More generally, l e t ΘQ.Θ- € N and θ = (1 - ρ)θ Q + pθ 1 and

show ψ(θ ) i s continuous i n p f o r 0 <_ p <_ 1.

[Reason as i n ( i ) , or use

Theorem 1.7 and ( i ) . ] (iii) continuous on W. 1.13.2

Give an example of an exponential family i n which ψ is not [Exercise 1.1.l(b) provides an example.]

Generalize 1 . 1 3 . 1 ( i i ) as f o l l o w s :

Let τ . € c o n h u l l ί θ , θ , , . . . ^ , } , 1

1

i =l , . . .

l e t θ € W and θ. € Λ/, j = l , . . . , J .

and τ . •* θ.

u

1

[Write τ i = Σα. . [ ( 1 - p ^ θ . + p..θ] with α. . >_ 0,

Then ψ ( τ . ) •* ψ(θ). I

Σα. . = 1, and p1 f 1. j

J

Use 1 . 1 3 . 1 ( i i ) a n d t h e f a c t t h a t ψ ( τ . ) < Σ α . . Ψ ( ( l - P Ί ) θ . + p . . θ ) . ] j

1.13.3

Let Y = (YQ = 1, Y , . . . Y n ) ' be the i n i t i a l state and n f u r t h e r

observations from an S-state Markov chain with t r a n s i t i o n matrix P p γ

(o

= J'|Y0

i = Ί ') = P. ,

1 0, 1 <_ i,j <_ S. Show that the distributions of Y form an S 2 dimensional exponential family with canonical statistic N = {n..} and canonical parameters {log p. .} . Show that if n >_ 3 the family has order p

S - 1. [Let E.. denote the matrix with i,j-th entry 1 and all other entries 0. Show that for given 1 < i, j < K there exist sample points N-, ΓL having positive probability and that N- + (E.^ - E. ) = N« and (other) points N-, N ? such that N 1 + ( E ^ - E n ) = N 2 ] 1.14.1

Univariate General Linear Model (G.L.M.).

Let Y be m-variate

normal, Y ~ N(μ, σ I), μ € R , σ > 0. (a) Show that this is an m+1 dimensional exponential family,

(b) In the G.L.M. μ is restricted by μ = Bβ ,

β e Rr

with B a known mxr matrix. Assume (for convenience) B has rank r. Show that this is a full (r+1) dimensional exponential family.

[Use Example 1.14 and

Theorem 1.7.] 1.14.2

Matrix normal distribution. Let μ = {μ. .} be an mxq matrix and

let Γ = {γ. .} and % ={σ. .} be mxm and qxq positive definite matrices, respectively.

Let Y = {Y..} be an mxq random matrix whose entries have a

multivariate normal distribution with

This is the matrix normal d i s t r i b u t i o n , denoted by Y ~ N(μ, Γ, Z ) . (a)

Show that Y has density ( r e l a t i v e to Lebesgue measure on Rmq)

f(y)

= (2πΓ

mq/2

|rΓ

m/2

exp t r ( -

[See Arnold (1981, Theorem 17.4).] (b)

Reduce this to an mq + m ^ m * 1 ] ( ^ q + 1 ^

dimensional minimal exponential

family with canonical parameters θ. . = Γ~ μZ" , 1 <_ i <_ m,

1 £ j £ q, and

'ϋ

θii.

... = γ

Ί 1

'σ

j j

',

l 2 and q ^> 2 t h i s i s not a f u l l exponential f a m i l y .

Rather, Θ i s an mq + m(m+l)/2 + q ( q + l ) / 2 - 1 dimensional d i f f e r e n t ! a b l e manifold i n s i d e o f W. (An a l t e r n a t e n o t a t i o n involves w r i t i n g Y = ( γ ( i \ > d e f i n i n g (vec Y) 1 = ( γ ( i ) '

>γ(α\)

τ h e n

γ

~ N^

Γ

>γ(α\)

anc

'

> 2) i s the same as

vec Y ~ N(vec μ, I θ Γ) where θ denotes the Kronecker product. 1.14.3

M u l t i v a r i a t e Linear Model (M.L.M.).

Here Y ~ N(μ, I , %) w i t h %

positive definite and = Bβ

μ

with B a known mxr matrix and 3 an (rxq) matrix of parameters. convenience) B has rank r.

Assume (for

Show that this can be reduced to a f u l l minimal

regular exponential family of dimension rq + q(q+l)/2. 1.14.4

Wishart d i s t r i b u t i o n .

mxm positive definite matrices.

Let X = ( x . . ) and t = (σ..) be symmetric lJ iJ The matrix r ( α , t) d i s t r i b u t i o n has

density (l)

p ^(X)

=

—'

where

Γm(α)

= Z1^-1)/4

Π

Γ(α - ( i -

Show this is an exponential family, and describe the natural observations, natural parameters, and cumulant generating function. ( I f Y.,

i = l , . . . , n , are independent N(0, ϊ)

vectors then

n Σ Y.Yj = X has the Γ(3, 2t) distribution. This is also called the wishart (n, t) distribution and denoted by W(n, %). See e.g. Arnold (1981). Also Σ (Y. - Ϋ)(Y. - Ϋ ) 1 ~W(n-l, %) .) 1.15.1

Consider a 2x2 contingency table (see Exercise 1.8.1). Find the 2 2 conditional distribution of Y.. given Y. = Σ Y.. and Y . = Σ Y.. . Show that

BASIC PROPERTIES

31

these conditional d i s t r i b u t i o n s depend only on the given values Y . + , Y + . and on the odds ratio

PiiP22^Pi2^21

and

^orm

a one

" P a r a m e t e r exponential family.

[Under the independence model the d i s t r i b u t i o n is hypergeometric and independent of p.]

2. ANALYTIC PROPERTIES

CHAPTER DIFFERENTIABILITY AND MOMENTS

The cumulant generating function has several nice properties. Among these are the fact that its defining expression may be differentiated under the integral sign.

In this manner one obtains the moments of X from

the derivatives of ψ. One needs first to establish a simple bound. 2.1

Lemma I,

Let B = conhull {b^ : i=l,...,I} c R . Let C c B° be compact and let b Q e C.

Then there are constants Kρ (depending on C,B) £=0,1,... such th I b. X k 1 eb'x < Σ e η v b e C , x e R

INI

(1)

L

Also, e

(2)

b x lib -

b o .χ

- e

b o ιι

Let ε > 0.

Proof.

1

<- KK, l £

L

I b. b.1 xx ΣΣ e 1 , i1-1 =

k

x € RK

b € C ,

Note that there exists a K

< »

such that

Jc 9 ε

I r I Λ <_ K

eεlΓl

v

r e R

since lim

|r|A/eε|r|

=

0

Let {e. : i=l,...,k} denote the elementary (orthogonal) unit vectors in R . Then X

<

K

L

X. I

^_

No

^ εlx^ I Ί iΣ ee

32

ς

<

% K'

^ εe ϊ x -εe ; x. Σ M(e + e ), e

r

ANALYTIC PROPERTIES where K!

= k^~2^/2K0

33

. Choose ε > 0 such that (b + € e . ) € B, i = l,...,k,

for all b € C. See Figure 2.1(1). By convexity

since e a # x is convex i n a € R

^.) € B = conhuli {b..}. Then

and

k Σ (e

'εε 1=1

+ e

Figure 2 . 1 ( 1 ) :

( b - ε e ΊΊ ) x

)

<

2k K'

x»9ε

B, C, and p . + = b

max(e

+ εei

b.1 x% )

I

<

2k K

Σ e

b

» - j =i

f o r the proof of Lemma 2 . 1 .

STATISTICAL EXPONENTIAL FAMILIES

34

This proves (1), with K£ = 2k K^ Note that (2) may also be written 1

(bΊ- b π ) x KΣe Ί ° 1 1= 1

_ 0 and also that for

and we make this assumption below.

r £ 0 l-e r < |r|. Using the f i r s t inequality when b x > 0 and the second when b x < 0 yields b

eb'x

'x bx b

x

^ max(b x e b'' x , I b x j ) Γb^Γl ' Hence lx||eb'x+ by (1) since b £ C and 0 £ C.

llxl

b. x

| |

FORMULAS FOR MOMENTS Let £. >_ 0 be non-negative integers with

k Σ ί,. = I.

Formal

calculation yields (1)

Ί

_ i i -λ(β)

k θ< x = / ( _ π x ^ ) e v(dx)

.

i=i9θi In particular (2)

Vλ(θ)

-

X /xe θ * v(dx)

These calculations are j u s t i f i e d by the following theorem. 2.2

Theorem Suppose θQ e M° .

Then a l l derivatives of λ and of ψ exist at

ANALYTIC PROPERTIES ΘQ.

They are given by the above expressions

d i f f e r e n t i a t i n g under the i n t e g r a l Proof.

We prove only ( 2 ) .

35

( 1 ) , (2) derived by f o r m a l l y

sign.

(The proof of the general formula (1) i s

and proceeds by i n d u c t i o n on I.

See Exercise 2 . 2 . 1 . )

there is a B = c o n h u l l ί θ ^ : i = l , . . . , I } c W °

Let ΘQ 6 N°.

similar Then

and C c B°, C compact, w i t h

ΘQ € C . Let θ

(3)

d ( θ , x)

=

e

'x

-

- (θ-θn)

e

xe

9

.

llθ - θ o l l By Lemma 2.1 (4)

sup |d(θ, x ) | Θ€C

<_ 2K

i

Σe

Ί

Also |d(θ, x)|

(5)

s i n c e Ve

#x Λ

_Q

= xe

.

-> 0

as

θ -> ΘQ

Hence

π /d(θ,

x)v(dx)

-> 0

as

θ -> ΘQ

by t h e dominated convergence theorem, so t h a t

λ(θ) - λ(θ n ) - (θ - θ n ) 9 Q

(6)

IIθ

which proves ( 1 ) .

-

/xe °

v(dx)

ΘQII

II

Theorem 2.2 immediately yields the following fundamental formulae. 2

k 9f For f : R -> R introduce the notation D^f for the kxk matrix ( - — r — ) . An alternate expression is V'Vf since V1 converts each element of the (column) vector I — into the row vector (a( τ^-)/3χ. : j = l,...,k), and hence D 9 f = V'Vf. dX.

2.3

σX.j

J

c-

Corollary Consider a standard exponential family.

Let θ £ W°.

Then

36

STATISTICAL EXPONENTIAL FAMILIES

(1)

E Θ (X) = Vψ(θ)

(2)

cov X = D ? ψ(θ) = V'Vψ(θ)

Notation.

In the sequel we frequently use the notation

(I 1 )

ξ(θ) = Vψ(θ) = EΘ(X)

and (2 1 )

ϊ(θ) = D2ψ(θ) = ZΘ(X) .

Proof. Calculating formally, Vψ(θ)

= /xe θ ' x v(dx)//e θ ' x v(dx)

=

E

θ< X >

The c a l c u l a t i o n is j u s t i f i e d by Theorem 2.2. (1)

is s i m i l a r .

2.4

Examples

This proves ( 2 ) .

The proof of

II

The reader is i n v i t e d to use Corollary 2.3 to calculate the f a m i l i a r formulae for mean and variance i n the classic exponential families such as (univariate) normal, multinomial, Poisson, gamma, negative binomial, etc. For the m u l t i v a r i a t e normal d i s t r i b u t i o n Corollary 2.3 provides a benefit i n the reverse d i r e c t i o n . Example 1.14.

Fix μ = 0.

Let Y

be

m-variate normal ( μ , 2 ) , as i n

Direct c a l c u l a t i o n (not using Corollary 2.3)

yields the f a m i l i a r r e s u l t

(1)

EίY.Yj)

when μ = 0, where ζf

= σ.j

= (θΊJ).

= (-(αΓ1)^-

= -θij

Calculation using Corollary 2.3 and the

formula 1.14(3) f o r the cumulant generating function thus y i e l d s f o r i f

6..)

j

ANALYTIC PROPERTIES

since the corresponding canonical B = -Q..

statistics

a r e Y.Y . / ( I + 6 . . ) .

Let

Then ( 2 ) shows t h a t f o r any p o s i t i v e d e f i n i t e symmetric m a t r i x , B,

ToglBI

=

2 bΊ ΊJ J/ /( (l l + 6.. 6 ...))

B"1

where

=

( b ii jj )

.

1 J

ij Hence,

37

also,

^ — IBI = 2 b i j | B I / ( l + δ..) J ij

(4)

.

The convexity of ψ together with Theorem 2.2 yields the following useful result. 2.5

Corollary Let

Θ 1 5 Θ 2 € N°.

Then

( θ 1 - θ2)

(1)

Equality holds i n (1) i f

( ξ ^ )

- ξ(θ2))

and only i f Pfl

= P 1

.

>.

0

.

Consequently ξ ( θ , ) = ξ ( θ j

2

if and only if P Q = P Q . (If {p Q } is minimal this happens only when θ

θ

1 =

θ

2

l

Θ

θ

2

.)

Proof.

ψ is convex.

Hence the directional derivative of ψ in direction

θ. - θ 2 is non-decreasing as one moves along the line from θ ? to θ... That i s , (2)

(θ χ - θ 2 )

Vψ(θ2 + p(θ 1 - θ 2 )) = (θ 1 - θ 2 )

ξ(θ 2 + p(θ 2 - θ 2 ))

is non-decreasing in p. This yields (1). I f PQ f PQ then ψ is s t r i c t l y convex on the line joining θ 0 and θ

θ,.

l

Θ

2

2

Hence (2) is s t r i c t l y increasing for p £ (0,1).

remaining assertions of the corollary. contained in Theorem 1.13.)

This yields the

(The parenthetical assertion is

||

The final corollary to Theorem 2.2 establishes the possibility of differentiating inside the integral sign for expectations involving exponential families.

The result is stated only for real valued s t a t i s t i c s , but obviously

38

STATISTICAL EXPONENTIAL FAMILIES

generalizes to higher dimensional statistics. 2.6

Corollary Let T : R k + R. Let W(T) = {θ : /|T(x)|e θ # X v(dx) < «} .

(1)

Then W(T) is convex. Define h(θ) = /T(x)e θ # x v(dx) = e ψ ( θ ) E θ (T(X))

(2)

for θ e W(T). Then all derivatives of h exist at every θ € W°(T), and they may be computed under the integral sign. In particular (3)

VE Θ (T(X)) = /(x - ξ(θ))T(x)exp(θ x - ψ(θ))v(dx) .

Proof.

Suppose T(x) _> 0. Applying Theorem 2.2 to the measure

ω(dx) = T(x)v(dx) yields the desired results. For general T the corollary follows upon using the above to separately treat T and T~.

||

Note that if T and |T|"] are bounded then W(T) => w.

ANALYTICITY The moment generating function is analytic. in the proof of Theorem 2.2.

This fact is implicit

As a preliminary we extend the definition of λ

and ψ to the complex domain. Let λ :

(Ck -> ID

be defined by the same expression as previously, i.e. (1)

λ(θ)

= /exp(θ

x)v(dx)

.

For θ € (D let Re θ denote the vector with coordinates (Re θ ^ . ^ R e θ k ) . Note that for x e R k (2)

|e θ ' x |

=

(Reθ) e

χ

.

ANALYTIC PROPERTIES

39

Hence λ(θ) exists for Re θ e N . 2.7 Theorem λ(θ) is analytic on {θ E C k : Re θ e M°} . Lemma 2.1 (and its proof) apply for b € (Ck, x e Rk. Similarly the

Proof.

proof of Theorem 2.2(2) is valid verbatim for θ £ I . Thus Vλ(θ) exists for Re θ € W° (and has the expression 2.2(2)). This implies that λ is analytic on this domain.

||

Two important properties of analytic functions are: (i) they can be expanded in a Taylor series; and (ii) they are analytic in each variable separately. Thus, for a fixed value of (Θ 2 ,...,θ k ),

λ(( ,θ 2 ,...,θ k )) is

analytic. λ(( ,θp,...,θ.)) is determined by its values on any subset having an accumulation point. This is the basis for the following result. 2.8 Lemma Let T : R k •> R, and let (1)

h(θ) = /T(x)e θ ' x v(dx),

for

Re θ e N(T),

as defined in 2.6(1).

Then h is analytic on {θ € ik : Re θ € W°(T)}. Let L be a line in R , and let B c L n M(T) be any subset of L Π M(T) having an accumulation point in N°(T). (2)

h(θ) = 0

Then

V θ €B

implies h(θ) = 0 for all θ € Rk such that θ € L n W°(T). Proof.

The first assertion follows upon applying Theorem 2.7 to T (x)v(dx),

and T"(x)v(dx). Next, one may apply linked affine transformations as in Proposition 1.6. Because of this it suffices to consider the case where L = ίθ € R : θ = .,. = θ = 0}. L.

K

h((θ-,O,...,O)) is an analytic function of J.

θ € (C, as already noted. Hence (2) implies h(θ,O,..,O) Ξ 0 on its domain of analytic!ty, which is {(θ,0,...,0) : Re θ E L n W°(T)}. This proves the

40

STATISTICAL EXPONENTIAL FAMILIES

analyticity, which is {(θ ,0,... ,0): assertion. ||

Re θ € L ΠW°(T)}. This proves the second

Note that, more generally, if B is as above then the values of h on B uniquely determine by analytic continuation its value on all of L Π N°(T). (Straight lines play a special role in the above lemma. However we note that there is a valid generalization of the above lemma in which L can be replaced by a suitable one dimensional curve determined as the locus of points satisfying (n - 1) simultaneous analytic equations (C. Earle (1980), personal communication). For example L may be taken to be the curve x^ + x^ = 1, x 3 = ... = x k = 0.) 2.9 Example A question which arises, in statistical estimation theory, is whether the positive part James-Stein estimator for an unknown normal mean, θ(x) = (1 - (k-2)||x|Γ 2 ) + x,

x € Rk,

can possibly be generalized Bayes for squared error loss. This is equivalent to asking whether <$(•) can be the gradient of a cumulant generating function for some measure v(dθ) having N = R . (Note interchange of roles of θ and x.) See Theorem 4.16. The answer is, "No." To see this note that ό(x) Ξ 0 for I |x| I <_ 1. Hence if ό(x) = vψ(x) = vλ(x)/λ(x) for | |x| | < 1 it follows by analyticity that ψ(x) = 0 on its domain of analyticity, which in this case is R . This implies <$(x) = 0, a contradiction.

||

2.10 Example The question arises in the theory of hypothesis tests as to whether the unit square, S = {x € R k : |χ.| £ 1},

k >. 2,

can be a Bayes acceptance region for testing the mean of a normal distribution. Placed in a general context, the question is whether there exist two distinct non-zero finite measures G Q and G-j (concentrated on disjoint sets Θ Q and θ1

c

R k ) such that

ANALYTIC PROPERTIES

41

2 d(x) = / e θ " X " θ

(1)

/2

(G Q (dθ) - G Ί (dθ)) > 0

if x € S,

and d(x) < O if x ^ S . Proof.

Let

The answer is, "No." 2 μ.(dθ) = e' θ / 2 G.(dθ), i = 0,1. Then d(x) = λ Q (x) - λ ^ x ) = R k , i = 0,1.

where λ. is the moment generating function of p.. Note that W I

1

y.

Hence d( ) is analytic on R . For convenience consider only the case k = 2. Expand d in a Taylor series about (1,1) as

. . y\y)' d((l . 1) + (y1Ί. yά J ) = Σ ί Ja. i=0 j=0 ί Ί " J ' ά (a Q 0 = 0 since d((l, 1)) = 0.) Let i1 be the smallest index for which i1 Σ la. ., .1 > 0. i' exists since d t 0 if (1) is valid. j=0 J'Ί "J Suppose i1 is even. Then for y. >_ 0, i = 1,2,

I 3i 1._i yίyJ'-J = I a. .,

(2)

j_Q

J j l -J

I C

.

Q

J , l -J

A-y,)H-y2)v'3. I

ά

There are values (y,, y j in the first quadrant for which (2) f 0, since (2) is a non-zero homogeneous polynomial. Suppose (y-j, y^) is such a value. Then

|pΓ 1 l d((i, D ) + (py?> py°)) = .Σ a i 5i .j(y?) d (y2) il " :i + = c + o(p) as

°^

|p| -> 0

with c ί 0. If c > 0 it follows that d((l, 1)) + (Py^, pyij)) > 0 for p > 0 sufficiently small; and this would contradict (1). If c < 0 it follows that d((l, 1) + (py?, py2)) < 0 for p < 0 sufficiently small; and this would also contradict (1). If i1 is odd analogous reasoning yields 1) + ( y r -y 2 )) = ||y|Γ Ί "d((l, 1) + (-yr y 2 )) + o(l) as ||y|| + 0, and that there are values of (y-j, -y 2 ) > 0 for which lim ||y|Γ Ί l d((l, 1) + P (y?, -y2)) t 0. It follows that there are values of P

ψ0

42

STATISTICAL EXPONENTIAL FAMILIES

y in either the fourth quadrant or the second quadrant for which d((l, 1) + y) > 0. This again contradicts (1). Hence (1) is impossible.

||

COMPLETENESS 2.11 Remarks A family {F

θ € 0} of probability distributions (or their

u associated densities, i f these exist) is called s t a t i s t i c a l l y complete i f T : Rk -> R with (1)

/T(x)F θ (dx)

= 0

τ(x) = 0

a.e.

V θ €0

implies (2)

(F 0 )

V θ€0

(Implicit in (1) is the condition that /|T(x)|FQ(dx) < ~ v θ e 0 .) Standard exponential families are complete if the parameter space is large enough. This result, which is equivalent to the uniqueness theorem for Laplace transforms, is proved in Theorem 2.12.

(The uniqueness theorem for

Laplace transforms states that if U° n H° f φ then λ = λ if and only if μ μ = v.)

v

μ

v

The most convenient way to prove this theorem seems to be to invoke

the uniqueness theorem for Fourier-Stieltjes transforms (equals characteristic functions) which is described in the next paragraph. Let Im = {bi € (C : b € R} denote the pure imaginary numbers. Let F k k be a f i n i t e (non-negative) measure on R . The function K : R -> (C defined by κ F (b) = λ F (bi) b € Rk is the Fourier-Stieltjes function)

transform (or, Fourier transform, or, characteristic

of F. Hence λp restricted to the domain (Im) is equivalent to Kp.

Note that κp always exists ( i . e . Re((Im) ) = 0 c N). The uniqueness theorem for Fourier transforms is as follows. Theorem.

(^ Let F and G be two f i n i t e non-negative measures on R . Then F = G

ANALYTIC PROPERTIES

43

i f and o n l y i f κ p Ξ K Q ( i . e . λ p ( b i ) = X Q ( b i ) V b e R k ) .

Proof.

This is a standard result in the theory of characteristic functions.

Proofs abound. A quick proof may be found in Feller

(1966, XV,3).

proof is explicitly for R, but generalizes immediately to R .)

(This

11

Here is the classic result on completeness of exponential families. 2.12 Theorem

Let ίp Q }: θ e 0} be a standard exponential family. Suppose 0° f φ. Then {p Q } is complete. Proof.

Let θ Q E 0°. One may translate coordinates using Proposition 1.6 so

that ΘQ = 0. There is thus no loss of generality in assuming Θ Q = 0. Suppose Π(x)p Q (x)v(dx) = 0 V θ e 0.

Then, letting T = T + - T~,

u /T + (x)e θ " x v(dx)

(1)

Let F(dx) = T + (x)v(dx), (2)

λ F (θ)

= /T"(x)e θ # x v(dx)

G(dx) = T"(x)v(dx).

= /e θ # x F(dx)

= /e θ # x G(dx)

V

Then (1) becomes =

λ G (θ)

Both λ F ( ) and λ r ( ) are analytic on the domain 0° x they agree on 0 x 0 <= 0 x (Im) . re z € 0° .

θ€0

λ

z

V (Im) .

Hence λr(x) = g ( ) "^

(This follows d i r e c t l y from a n a l y t i c i t y .

θ € 0

0Γ a

^

(2) states that z

suc

^

t

'Ίat

Alternately one may

apply the second half of Lemma 2.8 to a l l lines which intersect 0 .) particular (3)

λ F (0 + bi)

V b eRk

= λ G (0 + b i )

since 0 € 0°. Thus, F = G by Theorem 2.11. This says that T+(x)v(dx) = T"(x)v(dx), which implies T + = T~ a.e.(v), which implies T = 0 a.e.(v). Hence

{pQ} is complete. u

II

Note from the above that any canonical family is complete. From this we derive:

In

44

STATISTICAL EXPONENTIAL FAMILIES

2.13 Corollary A standard family with W° t φ is uniquely determined by its Laplace transform

(or by its cumulant generating function). Note that the corollary applies to a l l minimal families since they

always have H° f φ. Proof.

Consider the standard families in R generated by the measures μ

and v.

Suppose N° f φ and ψ = ψ . Let ω = (y + v)/2.

λ

ω

=

(λ

μ

+

λ

v)/2

HenCe

W

ω =

θ x-ώ (θ) ψ /T(x)eθ ω^ j ω(dx)

Then, ω generates an exponential family with

M a n d

Let T = 4± - 4ϋ . dω dω

Then W = hi = hi.

ψ

ω

=

ψ

μ =

V

Then Ί

= (l)(/e

θ x-ψ (θ) θ x-ψ (θ) V v y(dx) - fe v(dx))

Hence T = 0 a.e.(ω) by Theorem 2.12; which implies μ = v.

||

Theorem 2.12 has many other important applications in statistics. It plays an important role, for example, in the theory of unbiased estimates and in the construction of unbiased tests. Some aspects of this role are described in the exercises and in succeeding chapters. MUTUAL INDEPENDENCE Lehamnn (1959, p. 162-163) describes a nice proof of the independence of X, S in a normal sample. A different but related proof is a special instance of an argument which applies in several important exponential families. (See Example 2.15.) The basic parts of the argument are due to Neyman (1938) and Basu (1955), but the full result in Theorem 2.14, below, was only recently proved by Bar-Lev (1983) and by Barndorff-Nielsen and Blaesild (1983).

The proof below follows

that in the second of these papers. See the exercises for an additional related result of Bar-Lev and for several applications of this theorem. Through most of this subsection we consider the situation where

ANALYTIC PROPERTIES

45

θ and x are, respectively, partitioned as θ 1 = (θ/.x, θ/p)^ J

x< =

^(D* xl?)^

As in Sections 1.7 and 1.15, problems not in this form can sometimes be reduced to this form through use of linked linear transformations on θ and x. Where convenient, we write ψ(θ) = ψίθ/.x, θ/ 2 ))

^e u s e ^ e

notat

i°n

Y ~ Expf (θ) to mean that the distributions of Y form a standard exponential family with natural parameter θ. We also use the notation X l Y to mean that X and Y are independent. 2.14 Theorem Let X ~ Expf (θ) with θ° € Θ°. Let X 1 = (X| 1 ) s X | 2 ) ) where X ( i ) is k i dimensional, and let M X , ^ ) be a k 2 dimensional statistic. Let Pl

(θ(1), θ(2))

= log E 0 (exp((θ ( 1 ) - θ f 1 } )

(1)

+ ( θ ( 2 ) - Q°{2))

P 2 (θ( 2 ))

X(1)

h(X ( 1 ) )))

= "Όg E θ (exp((θ ( 2 ) - θ^ 2 ) )

( X ( 2 ) - h(X ( l ) ))) .

Then t h e f o l l o w i n g c o n d i t i o n s a r e e q u i v a l e n t : (2)

X

( 1 )

l

(X

( 2 )

- h(χ(1)))

under

(21)

X

( 1 )

l

(X

( 2 )

- h(X(1)))

for all

(3)

Ψ(θ(1), θ(2))

(4)

(X

(5)

θ €Θ

= P1(θ(1), θ(2)) + P2(θ(2))

- h(X(1)))~

(X(1), h(X(1))) ~

Proof. (See

( 2 )

θ°

V ΘGΘ

Expf ( θ ( 2 ) ) Expf ( θ ( 1 ) , θ

( 2 )

)

For convenience, assume without loss of generality that θ° = 0.

Proposition 1.6.)

Let ω denote the j o i n t distribution under 0 of

V = ( X ( 1 ) , h ( X ( 1 ) ) , X ( 2 ) - h ( X ( 1 ) ) ) . Consider the standard exponential family generated by ω, with natural parameter space N v .

Note that, in general,

46

STATISTICAL EXPONENTIAL FAMILIES

ί X ( 1 ) 1 ( X ( 2 ) - h(X ( 1 ) ))} « {(X ( 1 ) 5 h(X ( 1 ) )) 1 ( X ( 2 ) - h(X ( 1 ) ))} . The equivalence of (2) and (2') is seen in this fashion to be a special case of Exercise 1.7.1. (2) => (3) follows from a direct calculation. (3) =* (2): Let ω χ denote the distribution under θ° = 0 of (V ( 1 ) , V ( 2 ) ) = ( X ( 1 ) , h ( X ( 1 ) ) ) and ω 2 that of V ( 3 ) = X ( χ ) - h ( X ( 1 ) ) . Let ω* = ω- x uu. Then the cumulant generating function ψ* of ω* satisfies ψ*(θ ( 1 ) , θ ( 2 ) . θ ( 2 ) )

=

P I

θ(2)) + P 2 ( Θ ( 2 ) ) ,

(Θ(1),

(Θ(1).

θ(2))ee

.

Furthermore, the cumulant generating function of the linear function (V ( 1 ) , V ( 2 ) + V ( 3 ) ) is ψ** given by Ψ**(θ ( 1 ) , θ ( 2 ) )

= ψ*(θ(1), θ ( 2 ) , θ ( 2 ) ) = Ψ(θ(1), θ(2)) ,

= Pi(θ(i).θ(2))+p2(θ(2))

θ e Θ

.

It follows from Corollary 2.13, since Θ° f φ, that (V\^ 9 V, 2 j + V,^Λ has the same distribution under θ° as (X/i\> ^(2)^' T^us

^ ( 1 ) 'X (2) " n ^ ( i ) ) )

has the same joint distribution under θ° as (V/,x> ^io\ + ^C*) " n ( V f i O ) But, V ( 2 ) + V ( 3 ) - h ( V ( 1 ) ) = V ( 3 ) .

Hence X ( 1 ) 1 ( X ( 2 ) - h(X ( 1 ) )) under θ°

since V ( 1 ) ± V ( 3 ) . (2) => (4) and (5), as can be seen by direct calculation of the marginal distributions involved via the standard formulae (6) and (8), below. (4) => (2): The marginal density of V,3x = X/2x - hίX/^) relative to the marginal distribution ω 2 is (6)

q θ ( v ( 3 ) ) = /exp(θ ( 1 )

+ θ

v(1) + θ(2)

h(v(1))

v

(2) ' (3) "

where ω( | ) denotes the indicated conditional distribution. By (4) eχ

θ

v

P

θ

q Q (v^ 3 j)= P ( ( 2 ) ( 3 ) " 2 ^ ( 2 ) ^

(a.e.).

Setting θ ^ = 0 yields

ANALYTIC PROPERTIES (7)

47

exp(ψ(θ ( 1 ) , 0) - p 2 (0)) ω

= /exp(θ ( 1 ) Here the Laplace t r a n s f o r m o f dent o f

V/ 3 x

(a.e.).

t h a t ω(

|v/3J

This v e r i f i e s

dv

v

ω

(8)

V(2)

( (i)> °) e Θ , (a.e.) .

( # l v ( 3 ) ) e x i s t s on an open s e t and i s

indepen-

I t f o l l o w s from a n o t h e r a p p l i c a t i o n o f C o r o l l a r y

i s independent o f v,^

(a.e.).

So, V , ^

2.13

i s independent o f

V/^.

(2).

The p r o o f t h a t (5) => (2) i s s i m i l a r . V(1).

θ

v(1)) ( (Dl (3)) >

The marginal j o i n t d e n s i t y

of

is

q£(v(1), v

( 2 )

) = /exp(θ(1)

v

( 1 )+

θ

( 2 )

h(v(1))+ θ

- ψ(θ)) ω'(dv ( 3 ) |v ( 1 ) )

( 2 )

v

( 3 )

(a.e.) .

Setting θ,.v = 0 and cancelling terms in (5) implies exp(ψ(0, θ ^ ) - p(0, θ( 2 j)) = /exp(θ^2j

v ^ )ω ' ( d v ( 3 ) I V ( D )

(a.e.).

Hence, as before ω'ί lv,^) is independent of v , ^ (a.e.), which yields (2). || 2.15

Examples (i) Let Y,,...,Y 2

2

be independent N(μ,σ ) variables. Then (Example 2

2

1.12) (Y., Y ) * Expf(μ/σ , -l/2σ ).

Also ( Σ Y Γ (ΣY.)2/n) Λ, Expf(μ/σ2, -l/2σ 2 ). 2

2

2

Hence ( Σ Y ^ Σ Y ? ) * Expf(μ/σ , -l/2σ ). This verifies 2.14(5). Hence

2

ΣY. -' ΣY /n = Σ(Y.- Ϋ ) ^ Expf(-l/2σ ) and is independent of T p by 2.14(4) and 2.14(2').

(ii) (Example 1.12) of

X. is also

S i m i l a r l y , l e t X 1 , . . . , X n be independent Γ(α, σ ) . i^ΣX^

Then

Σ In XΊ ) ~ Expf(-l/σ, nα). The marginal d i s t r i b u t i o n

(nα, σ ) ; hence (ΣX^, In ΣXη.) ~ Expf(-l/σ, nα).

Theorem 2.14 yields that (Σln Xη. - In ΣXΊ ) 1 ΣX i .

Again,

This is often re-expressed

i n the form X/X 1 X where here X = ( Π X ) l y / n denotes the geometric mean of

i= l Ί the observations. Also, ln(X/X) - Expf(nα). See the Exercises for a double

48

STATISTICAL EXPONENTIAL FAMILIES

extension of this conclusion. There are further applications of this theorem. see the exercises and the references cited above.

For some of these

In particular there are

several applications to problems involving the inverse Gaussian distribution.

See Chapter 3.

CONTINUITY THEOREM The continuity theorem for Laplace transforms refers to the limiting behavior of a sequence of measures and the associated Laplace transforms. We f i r s t need a standard definition and some related remarks. 2.16

Definition Consider R .

functions on R .

Let C denote the space of continuous (real-valued)

Let CQ c C denote the subspace of continuous functions with

compact support - - i . e . c(x)

= 0

for

11 x EI > r ,

some r < oo .

A (non-negative) measure v is called locally v({x : I Ixl I < r } ) < oo v

finite

r € R. Except where specifically noted, a l l measures

are assumed to be locally f i n i t e , σ-finite, and non-negative. sequence of measures.

Let {v } be a

We say v

(1)

if

-* v

(weak*)

/ c(x)v n (dx) -> /c(x)v(dx)

if V c € CQ .

Here are several important facts concerning weak* convergence. For v finite let V^ denote the cumulative distribution

function:

V v ( t ) = v ( { x : xΊ. < t . , i = l , . . . , k } ) . (i) (2)

Then v n + v i f and only i f

Vv ( t ) + V v (t) (ii)

V t € Rk

at which Vv( ) is continuous.

k k S u p p o s e v + v . T h e n l i m i n f v ( R ) > v ( R ) . S u p p o s e there n ~ n —

ANALYTIC PROPERTIES

49

is a c £ C, c >_ 0, with (3)

lim c(x) = °° llxll-*»

such that (3 1 )

lim sup /c(x)v (dx) < «> n-χ»

Then lim v (R k )

(4)

= v(R k ) < «

n-χ»

(iii) Furthermore, (4) implies (5)

/c(x)v n (dx) - /c(x)v(dx)

for all bounded c ε C. (Condition (3), (3 1 ) is sometimes referred to by saying the sequence is tight.) (iv) If v > 0 is any bounded sequence (i.e. lim sup v (R ) < «>) n-*» then there is a subsequence {vn } and a finite measure v such that v -> v . n π i i For a proof of these facts see Neveu (1965). 2.17

Theorem Let S <= Rk and l e t B = conhull S.

sequence of measures on R

(1)

oo

v

b e S .

Then there e x i s t s a subsequence {n^} and a l o c a l l y

f i n i t e measure v such t h a t

(2)

ebo'x v

n

(dx)

->

eb°#xv(dx)

i

and (3)

Let v n be a

such t h a t

lim i n f sup λ (b) < v beS n Let b Q e B°.

Suppose B° * φ.

λ

(b) - λ v (b)

V b € B° .

The convergence in (3) is uniform on compact subsets of B°,

50

STATISTICAL EXPONENTIAL FAMILIES (Condition (3) is of course equivalent to

ψ

(b) + ψ (b)

v b e B°

n

i Condition (1) implies the measures v are locally finite.) Remark.

Lemma 2.1 together with (3) shows that / x e b # x v n ( d x ) - /xe b ' x v(dx),

(4)

b ε B° ,

and similarly for higher moments of x. Hence (5)

V λ v (b) + vλ v (b), n

b € B°

i

and similarly for higher order partial derivatives of λ. See Exercise 2.17.1. Similar reasoning also shows that (6)

e θ # b v n (dθ) - e θ # b v(dθ)

weak*

v b e BQ .

Hence the measure v in (2) does not depend on the choice of b Q € BQ . Proof.

We exploit Proposition 1.6 and assume without loss of generality

that bg = 0 £ B°. It also suffices to assume that B is a convex polytope (i.e. B = conhull {b. : i=l,...,m}) since the interior of any convex set is a countable union of such polytopes, and a compact subset of the interior will be contained in one of them. Now, m

lim

Σ

e

bΓx

=

by Lemma 2.1. Thus, for some subsequence {n1.} j m

(7)

b- x lim sup /( Σ e Ί )v.(dx) < co n n-« i =l j

by (1). Hence, the sequence {v ,} is tight, and there exists a further J subsequence {v } and a limiting measure v such that v -• v . This immediately implies that also e b # x v

n

(dx) -> e b ' x v (dx) for any b € R k . i

ANALYTIC PROPERTIES

51

lim (Σe7 / e b " x ) = °°, again by Lemma 2.1.

Let b € B°. Then As i n (7)

bΓx

lim sup / ^ T - e b ' X v D x n-χ» e

n n

i

(dx) < ».

Hence the sequence e # X v (dx) is also tight. This implies / e b # x v n (dx) + /e b ' x v(dx), which yields (3). Let C c B° be compact. Then ||x||e b χ by Lemma 2 . 1 .

< KΣebi*X

This yields

lim sup sup | |Vλv (b)|| Ί-XΌ beC r\.

m b <_ lim sup K / Σ e Ί i^oo i =i

x v

(dx)

n Ί

m <_ lim sup K Σ λ(b.) < <» . The f u n c t i o n s λ

(•) are thus u n i f o r m l y ( i n { n . } ) u n i f o r m l y c o n t i n u o u s on C.

The convergence i n (3) i s t h e r e f o r e u n i f o r m on C.

||

2.18 Uniform Convergence Theorem 2.17 shows that if A..

(b) -> ψ Λ ( b )

for all

b G B° f φ

A

then

\>. + v i

There is a useful uniform version of this statement. (1)

{v α n : n=l,...,α € A}

be a family of sequences of measures and {v All of these are assumed locally f i n i t e . v when for each c 6 CΛ

Let

-** v

: α € A} be a family of measures.

We say

(weak*) uniformly in α

52

STATISTICAL EXPONENTIAL FAMILIES

(2)

/c(x)v

(dx) n

uniformly over α € A. V

α =\

/c(x)v (dx)

->oo

For notational convenience in the following, let

etc

Proposition.

Suppose the family of cumulative distribution functions

{V : α € A} is equicontinuous at e^ery x e R .

Then v

-+v

uniformly in

α i f and only i f (3) Proof.

V α n •*• V α

uniformly for

α €A

The necessity of (3) is proved by applying (2) to continuous

functions c satisfying

c(x)

1

x

i ^X0i - 6

0

X

i

for a l l

= >X

0i

+ 6

f o r some

and then choosing 6 sufficiently small. Conversely, (3) implies /g(x)d(V α n (x) - V α (x)) = • ^ V α n ^ " v α ( χ ) ) d 9 ( χ ) "*• ° uniformly in α for each differentiate g € C Q . If c € C Q and ε > 0 there is a differentiable g € C Q with |g - c| < ε . Then |/(c(x) - g(x)) d(V α n (x) - V α (x))| < 2ε uniformly for all α € A and all n. Combining these facts yields the uniform convergence of v

to v .

||

Extra care in the proof of the above proposition will show that if the ίV α : α e A) are equicontinuous uniformly over x € S and v^ -> v uniformly in α then (3) holds uniformly for α € A, x € S. 2.19

Theorem Let {v n ) and {v } be as in 2.18(1).

and B° ϊ φ. Let λ = λ , etc. α α (1)

Suppose

λ (b) -> λ (b) nκ»

uniformly over α e A, and suppose

Suppose B = conhull S,

v bε S

ANALYTIC PROPERTIES (2)

53

sup sup λ (b) < b€S α α

Then v

-»• v uniformly over α € A. If v -h v uniformly over α € A, there is a c € C n and a sequence an a

Proof. a such that

lim |/c(θ)(v α n (dθ)-v α

(3)

n-χ»

In view of (3) there exists a subsequence vt t vΐ such that if we write v (4)

ω. •* v* , l

1

= ω. and v

= ω. then

λ M (b) + λ *(b) , ω.

n. and limiting measures

b E B ;

v1

and (5) ω. -> v* , λ- β (b) -> λ v *(b) b €B by Theorem 2.17. (To establish (4) we exploit (2) to guarantee condition 2.17(1) for the sequence {ωn } .) i Assumption (1) implies λ *(b) = λ *(b), b e B, which implies v V l 2 vϊ = Vo - This is a contradiction. α € A.

It follows that v

-*• v uniformly over

||

TOTAL POSITIVITY 2.20

Definitions Let S cz R and h : S -> R .

{x

€ S :

Let { X Q < . . . < X Π } <= S.

i = O , l , . . . , n } is called a s t r i c t l y changing sequence f o r h having

order n i f

(1)

The sequence

(sgn h(χ._ 1 ))(sgn h(x.)) = -1

i=l s ...,n

54

STATISTICAL EXPONENTIAL FAMILIES

The number S (h) -- the number of strict sign changes of h -- is the maximal order of a sequence of strict sign changes of h. Clearly 0 <_ S (h) <_«> . Let S"(h) = n < °° and let {x. € S : i=0,...,n} be a strictly changing sequence for h having order n. Then the (strict) initial sign of h is (2)

IS"(h) = sgn h(x Q )

.

(It is easy to check that this definition is well-formulated -- i.e. does not depend on the chosen strictly changing sequence for h.) Similarly a sequence {x. € S : i=0,...,n} is called a weakly changing sequence for h having order n if (3)

(sgn h(x 2i ))(sgn h(x 2 j + 1 )) < 0

for

i=0,...,[n/2],

j=0,...,t(n-l)/2]

This means that zeros of the sequence {sgn h(χ.)

: i = 0 , l , . . . , n } can be

reassigned as e i t h e r a (+1) or a (-1) i n a manner so t h a t the r e s u l t i n g sequence of ± Γ s alternates i n sign. of such a sequence.

The number S (h) is the maximal order

Clearly, 0 <_ S (h) <_ °°, and S + (h)

(4)

> S"(h)

.

Let S + (h) = n < °° and l e t {x. e S : i = 0 , . . . , n } be a weakly changing sequence for

(5)

h of order n.

IS"(h)

=

Then +1

if

h(x2 ) > 0

f o r some

0

if

h(χ.) = o

i=0,...,n

-1

if

h(x2i)<0

for some

i=0,...,[n/2]

i=0,...,[n/2] .

It can be checked that this definition is well formulated. 2,21 Theorem Let {p n } be a standard one parameter exponential family. Let u g : R -> R such that v{χ : g(x) ^ 0} > 0 . Let

ANALYTIC PROPERTIES (1)

h(θ) = E θ (g(x)) ,

55

θ € A/°(g) .

Then S + (h) < S"(g) .

(2)

I f equality holds in (2) then IS+(h)

(3) Remark.

= is"(g)

.

The domain of h i n (1) is r e s t r i c t e d to W°(g).

true i f the domain of h i s a l l of N(g).

The theorem remains

We leave this generalization as an

exercise. The sign-change-preserving properties ( 2 ) , (3) are equivalent to "Total P o s i t i v i t y of {p Q } of order « . " standard reference on t h i s t o p i c .

Karlin (1968) is a very u s e f u l ,

See also Brown, Johnstone, and MacGibbon

(1981).

Proof.

Let g(θ)

= /e θ x g(x)dx = e ψ ( θ ) h ( θ )

.

It suffices to prove g has the properties of h in (2), (3), The proof is by induction on n = S~(g). Assume without loss of generality that IS"(g) = +1. When n = 0 the result is trivial since then g ^ 0 and v({χ : g(x) > 0}) > 0 so that g(θ) > 0 for all θ € W(h), as claimed in (2). Assume the theorem is true for n <_ N. Suppose n = N + 1. Let ξ, = infix : g(x) < 0}. u(θ) =

ξ- > -» since IS"(g) = +1. Let

d_( e " θ ζ i3(θ))

= /(x - ξ l )g(x)e θ x v(dx)

.

Now, S"((x - ξi)g(x)) 1 N = n - 1, as can easily be checked from the definition of ζ r

Hence S+(u) <_ N by the induction hypothesis. Integration yields that

S+(U) <_ N + 1 where (4)

U(θ) = / θ u(t)dt

= e " ξ i θ g(θ)

.

(2) follows from (4). (3) may be verified by concentrating the above argument on the case where S + (u) = N and S + (U) = N + 1, and using the induction hypothesis

56

STATISTICAL EXPONENTIAL FAMILIES

to keep track of IS (u) and consequently of IS (U).

||

The above property for n = 1 is equivalent to the strict monotone likelihood ratio property. The following is an important consequence of this. 2.22

Corollary Let { p . } be a standard one parameter exponential family. u

g : R + R is non-decreasing and not e s s e n t i a l l y a constant ( v ) .

Suppose

Then E θ (g)

is s t r i c t l y increasing on M°(g). [Remark.

Again, the r e s u l t is true on the f u l l domain, W(g), but we leave

v e r i f i c a t i o n of t h i s as an exercise.) Proof.

Let

ess i n f g( ) < c < ess sup g( ) ί then g( ) - c s a t i s f i e s the

hypotheses of Theorem 2.21 with S~(g-c) = 1. for

θ € W°(g) whenever θ > QΛc)

s t r i c t l y increasing on W°(g).

Hence E J g ) - c > 0

(whenever θ < θ..(c)).

(or < 0)

I t follows that g i s

||

I t is possible to derive from the above some results concerning sign changes f o r multidimensional f a m i l i e s .

In general, these results appear

yery weak by comparison with t h e i r univariate cousins.

Here is an example of

such a r e s u l t which w i l l be useful l a t e r . 2.23

Corollary Let {p Q } be a standard k parameter exponential family.

ΘQ € N and v € Rk.

Suppose g : Rk ->• R s a t i s f i e s

Let θ = ΘQ + pv.

g(x)

£

0

v

x £ α

^

0

v

x ^ α

(1)

for

some α € R.

Let h(ρ)

+

Then S (h) < 1 .

+

=

E +

(g(X)) P

I f S (h) = 1 then IS (h) = - 1 .

.

Let

ANALYTIC PROPERTIES

57

Proof.

Apply Theorem 2.22 to the one parameter exponential family {pΘ Q } P of densities of v X. Observe that E θ (g(x)|v x = t) = g*(t) is independent o f P by Theorem 1.7, and (1) guarantees that S"(g*) £ 1.

These observations enable the desired application o f the theorem.

II

PARTIAL ORDER PROPERTIES The preceding multidimensional r e s u l t i s not very s a t i s f a c t o r y ; the hypotheses on h are too r e s t r i c t i v e .

Better results may be obtained by

considering p a r t i a l orderings and imposing suitable r e s t r i c t i o n s on the exponential family.

We give one simple r e s u l t as an appetizer f o r what may

be obtained. For t h i s r e s u l t define the p a r t i a l ordering, <* , on R by x « y i f x

i 1 y-j»

i =l , . . . , k .

A function h : R -> R i s non-decreasing r e l a t i v e to t h i s

ordering i f x <χ y implies h(x) <_ h ( y ) . The following preparatory lemma i s also of independent i n t e r e s t . 2.24

Lemma Let X have coordinates X.,...,X. which are independent random

variables with d i s t r i b u t i o n s F 1 ,...,F| ζ i respectively. decreasing r e l a t i v e to the p a r t i a l ordering * .

(1) Proof. well known. of (1) as

Suppose hy h^ are non-

Then

E(h 1 (X)h 2 (X)) > EίhjWJEίhgίX)) . The proof is by induction on k.

Note that for k = 1 the result is

This observation enables one to rewrite and reduce the l e f t side

58

STATISTICAL EXPONENTIAL FAMILIES /.../ hΛx)h ?(x) 1 ά

k Π ΊF.(dχ.) Ί i=l

k-l = / . . . / ( / h 1 (x)h 2 (x)F k (dx | < )) Π F.(dx.) k-l >_ /.../ [/h 1 (x)F k (dx k )][/ h 2 (x)F k (dx k )] n F.(dχ.) . Each function in square brackets is clearly non-decreasing in (xΊ,...,x, - ) . Hence, by induction, (1) is valid.

||

Here is the application to exponential families. 2.25 Theorem Consider a minimal standard exponential family for which the canonical coordinate variables X 1 9 ...,X k are independent. Let h be nondecreasing relative to the partial ordering «. Then E (h) is a non-decreasing function of θ on M°(h). Proof.

(This result may be extended to all of W(h).)

Write

Note that both x. - ξ.(θ) and h(x) are non-decreasing functions of x. Hence J

J

g|τE θ (h) > E (Xj - ξj(θ))Eθ(h(X)) = 0 J

by Lemma 2.24. It follows that E Λ (h) is non-decreasing in each coordinate of θ and hence (equivalently) is non-decreasing relative to «.

||

The preceding theorem is merely a sample of the available results. Other assumptions may replace the independence assumption, above. Notably, the conclusion of Lemma 2.24 remains valid if the joint distribution, F, of X has a density f with respect to Lebesgue measure which is monotone likelihood ratio in each pair of coordinates when the others are held fixed. (Exercise.) (There is also a lattice variable version of this fact.) Such densities are called multivariate totally positive of order 2 (= MTPp). Suppose {pa> is a minimal standard exponential family whose dominating measure, v, is MTPp

It follows by the proof of the theorem above that then h non-

ANALYTIC PROPERTIES

59

decreasing implies E Q (h) non-decreasing in θ. Under suitable conditions it is also possible to derive analogous "order preserving" results for other partial orderings. For example, one may consider the partial ordering induced by a convex cone C c R , under which xα

c

y if y - x e c.

A rather different but very fruitful partial ordering is that leading k k the notion of Schur convexity. Define x « Qb y if Σ x.Ί = Σ y. and if Ί i=l i=l k1 Σk' X MLΊJI 1 Σ V ΓLΊJ- T 1 <_ k' < k, where x r LΊJ . Ί »i = l,...,k, denote the coordinates i= l i =i of x written in decreasing order, etc. Then h is called Schur convex if it is non-decreasing relative to the ordering <*

(Obviously any such function

must be a symmetric function of x , , . . . ^ . ) For further information about these and other partial orderings, consult Marshall and Olkin (1979), Karlin and Rinott (1981), Eaton (1982), and references cited in these works.

60

STATISTICAL EXPONENTIAL FAMILIES

EXERCISES 2.2.1

Generalize 2.1(2) to I eθ#x - θo"x

I

b.

Thus (iΐx ^ (2)

) (

e

θ

χ

-eθo

χ

||Θ

-

(θ - θ0)

-

θ

xeVx)

oll

Use this to prove 2. 2(1) by induction on I. 2.3. 1

Consider a one-dimensional standard exponent!

K c [ 0 , oo).

Show that

(1)

(E0[(l - a)X])2

X >EO[(1 - 2a) ],

0 < a <

and VarQX < oo imply (2)

E 0 (X)

>:

VarQ X

.

[Let e θ = (1 - a) and show by d i f f e r e n t i a t i n g at θ = 0" that (1) implies ψ'(0~) > ψ"(θ").

The finiteness of Var X guarantees that

ψ"(0~) = VargX < «>, e t c . , S. Zamir (personal communication).]

( I t is not known

i f (1) implies (2) without the assumption that VarQX < 00.) 2.4.1

Canonical one-parameter exponential families for which Var Q (X) is

u

a quadratic function of E Q (X) are called quadratic variance function families (= QVF). See Morris (1982, 1983). Verify that the following six families have the QVF property: (1) (2)

N(μ, σ 2 )

μ known

P(λ)

(3)

r(α, σ)

α known

(4)

Bin (r, p)

(5)

Neg. Bin. (r, p)

r known r known

(6) v has density f(x) = (2 cosh(^))" 1 , -00 < x < «> , relative to Lebesgue measure. (X = π log(Y/(l - Y)) where Y ~ Beta [h, h).)

ANALYTIC PROPERTIES [ I n (6) ψ(θ) = - log(cos θ ) . distribution.

61

This is called the hyperbolic secant

The generalized hyperbolic secant distributions are produced

from these by i n f i n i t e d i v i s i b i l i t y and convolution. only QVF families (Morris, 1982). 2.5.1

These families are the

See also Bar-Lev and Enis ( 1 9 8 5 ) . ]

Let {p } be a canonical one-dimensional exponential family.

Then N° = ( θ ^ θ 2 ) , -oo £ ξ

< ξ

£ co.

ξ(W°) = ( ξ ^ ξ 2 ) for some -°° < θ χ < θ 2 < «, and i f K = [x-,°o) then ξ. = x..

(Theorem 3.6 is a m u l t i v a r i a t e

generalization of this r e s u l t . ) 2.10.1

Let {p θ > be a two-dimensional canonical exponential family.

Find

a convex subset of W such that h bounded and E Q (h) = 0 for a l l θ € 3W implies h = 0

(Hence, the family {p Λ : θ € dhl}

Let π.(θ) = E A ( φ ) . Φ θ

Θ 6 0 , , implies π segments.

is "boundedly

Conclude that e\/ery test of ΘQ versus Θ, = hi - ΘQ is "admissible".

complete".) (i.e.

a.e.(v).

Then π. (θ) <. π ( θ ) , θ € Θ Π , and π. (θ) ^ π ( Θ ) , ψ^ — Φ2 u φ^ — Φ2

(θ) Ξ π

( θ ) .)

[3W contains an i n f i n i t e number of l i n e

See F a r r e l l ( 1 9 6 8 ) . ]

Similar Tests and Unbiased Tests 2.12.0

Let ΘΊ c Θ,

i=0,l.

A c r i t i c a l t e s t function φ,

called level α unbiased i f E Q (φ) < α, called similar

(level α) i f EQ(Φ)

Ξ

ot,

0 £ φ <_ 1 , is

θ € Θ n , and E Q (φ) > α , θ € Θ Ί . θ 6 0 Q n 0 . n W.

I t is

The following

problems consider the common case where ΘQ U Θ, = hi so that 8ΘQ n hi = 0 Q n § 1 n hi.

Exercises 2.21.3, 2.21.4 and 2.21.5 contain further applications

of these concepts.

See also 7 . 1 2 . 1 .

2.12.1

Let {p } be a regular canonical family and l e t θ 1 = ( θ | i ) »

X1 = ( X | . j ,

x

(2))

essential here.) (i) (1)

be

p a r t i t i o n e d vectors.

Let L = {θ : θ / ^ = 0} .

θ

(2)^9

(Regularity is convenient but not Assume L Π W° f φ.

Show that a c r i t i c a l function φ is similar on L i f and only i f α

= /Φ(x)v(dxj1j|X(2)^

a

'e

(v)

(Tests with property (1) are said to have Neyman structure. Note that the

62

STATISTICAL EXPONENTIAL FAMILIES

right side of (1) is E Θ (Φ|X( 2 )) f o r (ii) (2)

v

θ

€ L

)

Show that φ is similar on L and satisfies VE Λ (φ) = 0

V v € L

1

θ = 0

= { v : v

VΘ6L}

Ό

if and only if φ satisfies (1) and (3)

/ x^jjφίxjvίdx^jlx^j) = 0

a.e. (v) .

(Note that (2) is a necessary condition for a test of HQ: θ € L versus Hy θ f L to be unbiased. for all θ € L,

v € L ,

(3) expresses the fact that v x^x.

=

°

See Lehmann (1959) for many applications of

(1) and (3) to the construction of U.M.P.U. 2.12.2

VE (Φ|X/2^)

( i ) Let X - N(θ, I) in Rk,

tests.)

k >_ 2.

Show there does not exist a

non-constant level α similar test of ΘQ = {θ : θ. <_ 0 for some i } . [Use Example 2.10.] (ii)

Show there exists a non-constant similar test of

ΘQ = {θ : θ = 0 for some i } , but there does not exist a non-constant unbiased test of this hypothesis. 2.12.3

Let X <Ξ Rk,

X. ~ P(λ.), independent.

Show there exists a non-

t r i v i a l similar test of {λ : λ. < 1 V i } but there does not exist a non-trivial unbiased test of this hypothesis. 2.13.1

Let X = (X..) be a matrix Γ(α, I) variable.

(See Exercise 1.14.4.) m Observe that log |X| has the same Laplace transform as Σ log Y. where Y. J

are independent Γ(α - (i-l)/2, 1) variables. Hence |X| has the same distrim bution as Π Y.. Reinterpret this result to show equality of the distribution i = lΊ of the determinant of a Wishart (n, I) matrix and a product of independent χ 2 -variables. 2.13.2 V F Let F, G be two distributions on R . Let μ^ Ί f i

k Ί

'i . = E( Π X. J ) and ϊΊ k j=l J

ANALYTIC PROPERTIES

63

p

similarly for y .

(1)

Suppose

,

\ζ

= μ?

,

i, = 0,1,...

j = l,...,k ,

and

(2) w h e r e

lim sup £n/f m

2

j,2n = E d X j l " ) .

n

< »

,

j = i,...,k

(Note,nj)2n = μ 0 ) _ ) 2 n ) _

^

.)

Then F = G.

(Condition (2) is slightly weaker than the necessary and sufficient condition, ( \ Σ m*l/

(3)

J

n=l

'^

= - ,

j = l,...,k

,

n

for (1) to imply equality of F and G.

See Feller (1966, Sections XV4 and

VI13) and references cited therein.) [Use Stirling's formula to show that

θ n /n! converges absolutely

Σm. J )Π

for |θ| < ε , j = l,...,k, and hence that λp = λp on an open set in R .] 2.14.1 (Bar-Lev (1983).) Let X - Expf (θ) with Θ° f φ. Let t ( X ( 2 ) I X M ) ) d e n o t e

the

indicated

conditional covariance matrix. Show that 2 θ (X/o)l x /i\) depends only on θ if and only if X / ^ 1 (X/2x - hίX/^)) for some (measurable) function h. [Integrate Z θ ( χ (2)l x (i)) o n

θ startin

9 at 0 e Θ° to find that the

conditional cumulant generating function of X/p\ under P Q is (2)

Ψ(θ|x,,\) = p(θ/ 2 \) + θ/ 2 j

"(X/!))

^o^ some functions p , h .

Show that (2) implies X/ 2 x - n ( χ M ) ) J- x (j) under P Q .] 2.14.2 Suppose X - Expf ( θ ) with Θ° f φ.

Then the following are

equivalent: (1)

X, x l

(2)

Ψ(θ(1), θ(2j)

(3)

X ( i ) ~ Expf ( θ ( i ) )

(4)

cov θ

X(2x

(χ(i)'

for some θ° € Θ, or for a l l

X

= Ψ1(θ(1))

(2)}

=

+

Ψ2(θ(2))

f o r

for i = 1 and 2 ,

°

V θ

e θ

θ e Θ, s o m e

f u n c t i o n s

Ψi

a n d

Ψ

64

STATISTICAL EXPONENTIAL FAMILIES

[For (1) - (3) apply Theorem 2.14 with h H 0 and check Ψi = p i ,

1=1,2.

For (4) =* (2) use 2.3(2) and i n t e g r a t e . ] 2.14.3

( P a t i l (1965), Barndorff-Nielsen

and Blaesild

(1983).)

Let P = {P θ : θ € 0} be a family of d i s t r i b u t i o n s on V, 8. k

k

(measurable), 0 <= R with 0° t φ.

X : V -> R

In E Q exp((3 - θ) f o r some function p( ).

X(Y))

=

Suppose

p(β) - p ( θ ) ,

Then X ~ Expf ( θ ) .

Let

β,θ e 0

[Use Corollary 2.13.]

2.14.4 L e t X have a k-dimensional m u l t i n o m i a l ( N , π) d i s t r i b u t i o n . Write X|jj

= ( X ^ . . . ^

)',

X j 2 j = (Xfe

+ 1

,...,Xk)\

Show t h a t the marginal

d i s t r i b u t i o n s o f both X,-x

and X/^x

f o r m an e x p o n e n t i a l f a m i l y , b u t

i s not independent o f X/ ? \

as one m i g h t e x p e c t f r o m Theorem 2 . 1 4 ( 2 ) .

X/jx Why not?

[The f a c t t h a t X i s n o t a minimal f a m i l y i s i r r e l e v a n t ; f o r k >_ 3, k.. < k-2 the

same phenomenon occurs i n t h e minimal model d e f i n e d as i n

1.2(7).]

2.15.1

Let the independent symmetric mxm matrices, X., i=l,...,n, have matrix r(α., t) distributions. 1

Z = Z 1 9 ...,Z n 1

(See Exercise 1.14.4). Show that n n with Z. = IX.J I/I Σ xΊ I is independent of Σ x. . Show that the J i=i i=l 1

distributions of In Z = {In Z. : j = l , . . . , n } form an exponential family, and identify the canonical s t a t i s t i c and parameter for this distribution. generalizes Example 2.15(ii). multivariate beta distribution.

(This

The distributions of Z form the so-called See, e.g., Muirhead (1982).

When m = 1

the X. have ordinary Γ distributions and the distribution of Z is a Dirichlet distribution.

See Exercise 5.6.2.

2.16.1

Suppose v -»• v with v(R ) < <». Then (1)

lim sup v n ( R k )

i f and o n l y i f the sequence { v n } i s t i g h t .

< v(R k ) [ L e t c(x) = i i f rΊ

<_ ||x|| <_

ANALYTIC PROPERTIES

65

and choose r. 3 sup v ({||x|| >.r.}) <_ 1/i , 1=1,2,... .] Hence a convergent n sequence of probability measures has a probability measure as its limit if and only if it is tight. 2.17.1 Verify 2.17(4),(5). (1)

[From Lemma 2.1(1)

I |/ xe bχ v^dx)! I < Σ II

^

ί

^

^

and the quantity in braces in (1) is 0(1/(1 + llxll)). Now use 2.16(1).] 2.17.2 Let

S c R

k

k

measures on R 0 € B°.

and B = conhull S.

Let v

be a bounded sequence of

( v n ( R ) < K. < » ) with λ^ ( b ) < »,

Define P p

b

b € s, n = l , . . .

Suppose

by dPn

.

- g j ^ = exp(b x - ψv (b))

(1)

.

.

n n Suppose f o r each b € S there is a K = K(b) such t h a t (2)

l i m sup P

.({llxll

< K})

>

0

.

Then there i s a subsequence { n 1 } c {n} and a non-zero l i m i t i n g measure v such t h a t f o r a l l b € B° eb#xv.(dx) n

(3)

[As

-

e b # x v(dx)

,

λ

(b) vn,

-> λ ( b ) v

i n the proof of Theorem 2.17 i t s u f f i c e s to consider the case

where S is f i n i t e .

Then K = maχ{K(b) : b € S} < ~ .

I f b € S,

then 0 < ε < / e b " x v . (dx)/λ (b) < K Ί e K o K / λ (b). n V V " ||χll
s a t i s f i e d on S n {b : I Ibl I < K Q } .

I Ibl I < KQ

Hence 2 . 1 7 ( 1 )

v f 0 since 0 € B° ]

2.18.1 Let {v on X = { 0 , 1 , . . .

}.

: α e A}, Show t h a t v^

n=l,2,...

be a f a m i l y of sequences of measures

•+ v^ uniformly i n α i f and only

if

66

STATISTICAL EXPONENTIAL FAMILIES

v ({x}) -• v ({x}) uniformly in α for each x € X. 2.19.1 Let {puQ> be an exponential family with supp v c {0,1,...} and v(0) > 0, v(l) > 0. Let X-,»...,X be a random sample and, as usual, let n S nn = .Σ X.. Define θn (λ) by =1 i (1)

ξ(θ n (λ)) = λ/n

Let FΛΛ , ndenote the distribution of Snnunder the parameter θ n(λ). Show that F i n - * p U )a n d t h ^ convergence is uniform in λ over λ e [a,b] for Λ ,Π

0 < a < b < oo. [0,

b].)

(A s l i g h t elaboration of the argument y i e l d s uniformity over

Generalize t h i s r e s u l t to the case where pQ is a k-dimensional

exponential f a m i l y .

[Show Ψ"(θ (λ)) -+0 as n -> «> since θ n ( λ ) -> -«>, uniformly nc

p:>n

n

= λ ( e p - 1) + o ( l ) as n -> «> uniformly

for

λ € [a, b].

Hence log EQ / , ^ e

for

λ € [a, b].

Then apply Theorem 2.19.

case the l i m i t d i s t r i b u t i o n

In the non-degenerate

is the product of independent Poisson

k-dimensional variables.]

(A special case of the above is the well known r e s u l t Bin ( n , λ/n) -> P(λ). The general form of the above statement was pointed out to me by I . Johnstone.) 2.21.1 Let X be non-central χ 2 with m degrees of freedom and noncentral i t y parameter θ .

Show that the d i s t r i b u t i o n s of X have the sign-change

preserving properties 2.21(2), ( 3 ) .

[Use Exercise 1.12.1(1).

Write

E θ (h(X)) = E θ (E(h(X)|K)) . ] 2.21.2

Let X be a one-dimensional exponential family and ΘQ € N°. (i) Show that the (essentially unique) level α test of the form

(1)

1 φ(x) = γ 0

x >x Q x =x 0 x <x Q

ANALYTIC PROPERTIES

67

is the U.M.P. level α test of H Q : θ <_ Θ Q versus H*: θ > Θ Q . (ii) Similarly, show that the (essentially unique) level α test of the form

(2)

Φ(χ)

=

1

X

>

x2

γ

X

=

x

0

i

X

x < xχ

i

< X

l

or < x2

satisfying (3)

E

is the U.M.P.U. level

θ

(χφ(χ))

=

0

0

test of H Q : θ = Θ Q versus Hy θ f Θ Q .

[(i) Let Φ 1 be any different level α test. Then S"(φ - φ 1 ) = 1. E Ω (φ - φ 1 ) = 0 by definition. Now use Theorem 2.18. (ii) Condition (3) is ϋ

o

the one-dimensional version of 2.12.1(3).

Again use Theorem 2.18.]

( I t is

also possible to show by a c o n t i n u i t y argument that level α tests of the form (1) and ( 2 ) , (3) always e x i s t . )

2.21.3

Consider a 2χ2 contingency table. (See Exercise 1.8.1.) Describe the general form of the U.M.P.U. level α tests of the following null hypotheses.

In each case the alternative is the complement of H Q . (i) H Q : P n P 2 2 / P 1 2 P 2 1 (ϋ)

Ho: P 1 1 P 2 2 / P 1 2 P 2 i

(111) H Q : p

π

ί

l

= 1

< p12

(iv) H Q : p 1 2 = p 2 1

.

(This corresponds to the exact form of McNemar's test. See, e.g. Fleiss (1981).)

[Use Exercise 2.21.2 and, for (i), (ii), Exercise 1.15.1. See

Lehmann (1959).] 2.21.4 Consider a 2χ2 contingency table.

Let c > 0,

exist non-trivial similar tests of the null hypothesis

c f 1. Show there

68

H

O

STATISTICAL EXPONENTIAL FAMILIES

:

Pll^Pll

+

Pi2^

=

C

P21^P21

+

^22^ °^

conc

*"""tional p r o b a b i l i t i e s

given p r o p o r t i o n , even though t h i s i s not a l o g - l i n e a r hypothesis. randomized t e s t s . under which Y*,

[Use

Consider the conditional d i s t r i b u t i o n given Y. + ,

and Yp . are independent binomials.

on i t s own m e r i t s . )

in a

Consider the special case Y, +

i = l,2

(This case is of i n t e r e s t = 1 = Y~+

f o r which the

condition f o r s i m i l a r i t y reduces to four l i n e a r equations i n the four variables φ(y) f o r the four c o n d i t i o n a l l y possible outcomes, y .

This t e s t is

unbiased f o r the one-sided version of H Q , but not f o r HQ as defined above. Is t h e r e , i n general, an unbiased t e s t of HQ?

Is t h e r e , i n general, a U.M.P.U.

t e s t of e i t h e r the one- or two-sided hypothesis i n e i t h e r the o r i g i n a l model or the conditional (independent binomial) model?

The somewhat

analogous question of the existence of s i m i l a r and of unbiased tests f o r the Behrens-Fisher problem of equality of means f o r two normal samples with unknown variances is solved i n Wijsman (1958) and i n Linnik (1968).] 2.21.5

Let X 1$ ...,X be a sequence of independent failure times, assumed to have a Γ(α, σ) distribution. Describe the U.M.P.U. tests of H Q : α = 1 versus H,: α > 1 and H': α f 1. [Use Exercise 2.21.2 and Example 2.15.] 2.25.1 Suppose v has density f with respect to Lebesgue measure on R and f is MTP 2 (i.e. has monotone likelihood ratio) in each pair of coordinates. Prove the conclusions of Lemma 2.24 and Theorem 2.25. Prove these also for the case where f, as above, is a density with respect to counting measure on the lattice of points with integer coordinates. [If h(xj,... ,x k ) is nonΊ s

decreasing then, under v, E(h(Xχ,... >\_y \)\ \ = \ ) ' also nondecreasing.] 2.25.2 Let {p Q } be a canonical

k-parameter exponential family with

ANALYTIC PROPERTIES ΘQ G W°.

Let H Q : θ <_ ΘQ and Hy

θ > ΘQ.

69

( i ) Show t h a t any Bayes or

generalized Bayes t e s t , α, of H Q versus Hj has the strong monotonocity property Φ(x)

> 0

y > x

=> φ(y)

=

1

Φ(x)

< 1

y < x

=> Φ(y)

=

0

(1)

Assume ΘQ = 0 and consider V / p ^ x H G ^ d θ )

- GQ(dθ)]

( g e n e r a l i z e d ) p r i o r measure r e s t r i c t e d to H . ] measure v is MTP2

where G i denotes the

( i i ) Suppose the dominating

Show t h a t any ( g e n e r a l i z e d ) Bayes t e s t i s unbiased.

[Use the above and Exercise 2 . 2 5 . 1 . ]

2.25.3

(Slepian's

Inequality)

Let X, Y be k-dimensional

normal v a r i a b l e s w i t h mean 0 and non-

s i n g u l a r covariance matrices A, B, r e s p e c t i v e l y .

Suppose

Then, f o r any C e R k ,

(1)

Pr{X < C }

>

Pr{Y <. C}

.

[ I f Z ( p ) ~ N ( 0 , A + p(B - A ) ) then

aXp( 8p

(2)

z ( p )

-< C)

=

Σ i W

πj ~3 -θ iPj ( Z

( p )

Q i i

< C)

where each α. . >_Q. Note that for i ^ j (3) by 2 . 4 ( 2 ) .

τ j £ - = θ., exp(-ln|*|/2)

= θ,, λ

Hence 92pfl(Z)

(4)

— 9 θ

from C o r o l l a r y 2 . 1 3 .

= ij

θ. . p (Z) ΊJ

θ

^ i

"""j

Combine ( 2 ) and ( 4 ) to y i e l d ( 1 ) . ]

proof of Slepian's i n e q u a l i t y see Saw ( 1 9 7 7 ) .

(For an a l t e r n a t e

For g e n e r a l i z a t i o n s see Joag-

Dev, Perlman, and P i t t (1983) and Brown and R i n o t t

(1986).)

CHAPTER 3. PARAMETRIZATIONS

In regular exponential families maximum likelihood estimation is closely related to the so-called mean value parametrization. This parametrization will be described after some brief preliminaries. The relation to maximum likelihood is pursued in Chapter 5. 3.1

Notation For v ε R , α € R let H(v, α) denote the hyperplane H(v, α) = {x € Rk : v

x = α}

Let H (a, α) and H~(a, α) be the open half spaces H + (v, α) = ίx ε R k : v

x > α}

H"(v, α) = {x ε R k : v

x < α}

When (v, α) are clear from the context they will be omitted from the notation Note that the closure of H~ is written H" and, of course, satisfies ΪΓ= H U H 1 . STEEP FAMILIES Most exponential families occurring in practice are regular (i.e. W is open). However, for technical reasons which will become clear in Chapter 6, it is \/ery useful to prove the parametrization Theorem 3.6 for steep families as well. 70

71

PARAMETRIZATIONS 3.2

Definition L e t φ: R

+ (-°°,

« ] be convex.

Assume Φ i s c o n t i n u o u s l y d i f f e r e n t i a t e on N°.

Let

W = {θ € R k : φ ( θ ) < °°}

Let θ . e W -

Λ/°f θ o e W °

,

and l e t θ

= ΘQ + ρ ( θ 1 - Θ Q ) , 0 < p < 1 , denote p o i n t s on t h e l i n e

joining

ΘQ t o Qy

Then, φ i s c a l l e d steep

,

i f f o r a l l θ j € N - W°, ΘQ € A/° Q

(1)

lim (θj -

Vφ(θ ) = oo

ΘQ)

Note that (1) is the same as (I1)

limf-φ(θ) = Λ 1

dp

p

Figure 3.2(1): An i l l u s t r a t i o n of the definition of steepness

A standard exponential family is called steep i f i t s cumulant generating function, ψ, is steep.

(A steep convex function is sometimes

referred to as an "essentially smooth" convex function.) exponential family is regular then i t is a fortiori

Note that i f the

steep since N - W° = φ

72

STATISTICAL EXPONENTIAL FAMILIES Here is a convenient necessary and sufficient condition for

steepness. 3.3

Proposition A minimal standard exponential family is steep i f and only i f

(1)

EJllxll)

= oo

for a l l

θ € W - W°

o Proof.

Suppose the family is steep. (θ 1 - Θ Q )

Vψ(θ p )

= (θχ - ΘQ)

This i m p l i e s EQ ( ( θ j - Θ Q )

X) -* °°

P

9

Then ξ(θp) + ° o

as p t 1

which i m p l i e s ( 1 ) .

The converse seems not to be easy to prove without further preparation. We postpone the proof to Chapter 6. It appears after the proof of Lemma 6.8. 3.4

||

Example There is one classic example of a steep non-regular family which

occurs in a variety of applications.

I t is the family of densities defined by

( π ) " 1 / 2 z " 3 / 2 e x p ( θ l Z + Θ2(l/z) - ( - 2 l θ χ θ 2 ) 1 / 2 - (l/2)ln(-2θ2)))

(1)

relative to Lebesque measure on z € (0, «>). The canonical statistics are (xi, X2) = (z, 1/z) and the natural parameter space is (2)

N

=

(-00, 0] x (-co, 0)

T h u s t h e f a m i l y i s n o t r e g u l a r b u t i s s t e e p s i n c e E / nQ \ ( x Ί )= °° "for a l l \U»y2)

$2

G

(~°°> 0)

1

These densities are referred to as inverse Gaussian. They

arise, for example, as the distribution of the f i r s t time (x,) that a standard Brownian motion crosses the line &{t) = /-2Θ? - /-2Θ, t .

Note that these

densities with θj = 0 are the scale family of stable densities on (0, «>) with index h- See Feller (1966). For some other steep non-regular families see Bar-Lev and Enis (1984).

PARAMETRIZATIONS

73

MEAN VALUE PARAMETRIZATION We begin with a useful lemma which involves a natural relation between parameter space (Θ) and sample space (X). Similar relations will reoccur several times and we have found it useful to draw pictures to illustrate the geometric relationships involved. Figure 3.5.1, below, is a simple example of such a picture which illustrates the hypotheses of Lemma 3.5.

JN

Figure 3.5.1:

3.5

Illustrating the hypotheses of Lemma 3.5 when k = 2.

Lemma Let v € R k , α € R. Let K <= R k be compact.

v(H (v, α)) > 0.

Then there exists a constant c > 0 such that λ(θ

(1)

+ pv)

> ce pα

(Note that (1) is equivalent (I1)

ψ(θ

Suppose

+ pv)

>^ pα + log c

V θ € K,

p > 0

to V θ € K,

I f θ + pv I A/ then λ(θ + pv) = °° so that (1) is

p _> 0 trivial.)

74

STATISTICAL EXPONENTIAL FAMILIES

Proof. (2)

λ(θ + pv)

= /e(θ+pv)

χ

v(dx)

> e p α Γ e θ χ v(dx) H

> ce p α

where c = inf / e θ # x v(dx) Θ€K H

(3)

> 0

.

((2) shows that i f c = °° here then λ(θ + pv) = °° for a l l θ € K and all p>0.)

|| Note that (3) provides an explicit formula for the constant c

appearing in formula (1).

Exercise 3.5.1 contains a converse to this lemma.

Here is the main result. 3.6 Theorem

Let {p 0 } be a minimal steep standard exponential family. Then ζ(θ) = E Θ (X) defines a homeomorphism of M° and K° continuous, 1-1, and onto.

(i.e., ξ: H° + K° is

Of course, if {p Q } is regular then ξ: U -> K°

since M = M°). Proof.

ξ is continuous on W° by Theorem 2.2 and Corollary 2.3. It is

1-1 by Corollary 2.5.

It remains to prove that ξ(W°) = K°, that is, to show

(1)

x € K° => x e ξ(Λ/)

It suffices to prove (1) for x = 0, for then the desired result for arbitrary x e K° follows upon translating the origin, which is justified by Proposition 1.6. So, assume 0 € K°. Let Sj = ίv e R k : llvll = 1}. Since 0 € K° there is an ε > 0 such that (2) for all v € S..

v(H + (v, ε)) > c > 0 (If not, there would be sequences v. € S 1 with

PARAMETRIZATIONS

v. -> v e Sι

75

and ε^ -> 0 f o r which v ( H + ( v Ί . , ε . ) ) -> 0.

v ( f l + ( v , 0 ) ) = 0 which contradicts 0 € K ° . )

This would imply

Now apply Lemma 3.2

(with

v = θ / | | θ | I and p = I | θ | | ) including the expression 3 . 2 ( 3 ) f o r the constant appearing in the lemma to get

(3)

ψ(θ)

w i t h c as i n ( 2 ) .

>

I I θ l l ε + log c

Thus

(4)

lim

ψ(θ)

= oo

l l θ l IHOO

(See

Exercise 3 . 6 . 2 and Lemma 5 . 3 ( 3 ) f o r restatements of ( 3 ) , ( 4 ) . ) Any lower semi-continuous function (such as ψ) defined on a closed

set

and which also s a t i s f i e s ( 4 ) must assume i t s minimum.

Ψ(ΘΊ.) = i n f ί ψ ( θ ) :

θ e Rk}.

I I Θ . M -> « i s

To see t h i s , l e t

impossible by ( 4 ) .

So, there

i s a convergent subsequence, θ . , -> θ * , and ψ ( θ * ) = i n f ί ψ ( θ ) : θ € R } by lower s e m i - c o n t i n u i t y . )

This minimum is assumed a t a point θ * € W.

Suppose θ * € N - W°. ψ(θ p

Then, f o r some 0 < p1 < 1 ,

,) < ψ ( θ * ) = l i m ψ ( θ n + p ( θ * - θ n ) ) by v i r t u e of 3 . 2 ( 1 ' ) o f the d e f i n i t i o n U U p+1

of steepness.

Hence no θ * G W - W° can be the minimum point f o r ψ.

I t follows

t h a t θ * € W°. Hence ξ(θ*)

=

Vψ(θ*)

=

0

since ψ is differentiate on a neighborhood of θ*. (Here we use Theorem 2.2, Corollary 2.3, and the fact that θ* € W° an open set.) This proves (1) for x = 0 and, as noted, completes the proof of the theorem. 3.7

||

Interpretation Theorem 3.6 shows that a minimal, steep family with parameter

space N° can be parametrized by ξ = ξ(θ), and the range of this parameter is K°. This is the mean value yavametvization.

In this parametrization the

resulting family is an exponential family, but of course is no longer a

76

STATISTICAL EXPONENTIAL FAMILIES

standard exponential family (except when ζ( ) is a f f i n e ) . (1)

θ(x)

=

ξ-1(x)

=

(θ :

Write

ξ(θ) = x)

The exponential family parametrized by ξ then has densities Pr(x) = exp(θ(ξ)

x - ψ(θ(ξ))).

For a number of applications t h i s parametri-

zation is more convenient than the "natural" parametrization described by the canonical parameter θ.

I f {p f i } is regular then W = N° and the mean value

parametrization reparametrizes the f u l l

family.

Minimality was used i n Theorem 3.6 only to guarantee that the map i s 1-1.

Even without minimality the map ξ discriminates between d i f f e r e n t

d i s t r i b u t i o n s i n {?'.

θ C N].

Hence one can s t i l l use the mean-value

parametrization to conveniently index {P A : θ e N°}, and the range of the mean

u value parameter is the relative interior of K. (Equivalently, one may reduce to a minimal family by Theorem 1.9 and then apply Theorem 3.3.) If the family is not steep then ξ(W°) c K°. We leave this fact — relatively unimportant for statistical application -- as an exercise. In this case it is even possible to have ξ(W°) not convex. See Exercise 3.7.1 for an example due to Efron (1978). 3.8 Example

(Fisher-VonMises Distribution) For a number of common exponential families the mean value

parametrization is the familiar parametrization, or nearly so. For example, for the Binomial (N, π) family the expectation parameter is Nπ, for the Poisson (λ) family the expectation parameter is λ, and for the exponential distributions (gamma distributions with index α = 1 and unknown scale, σ) the expectation parameter is σ. For the multivariate normal (μ, I) family the 1

expectation parameters are μ and μμ + I (corresponding to the canonical statistics of 1.14). The mean value parameters are not always so convenient. Nevertheless it is necessary to consider this parametrization in order to construct maximum likelihood estimators. See especially Theorem 5.5.

PARAMETRIZATIONS

77

Accordingly, we now discuss the mean value parametrization f o r the FisherVonMises d i s t r i b u t i o n . Let v be uniform measure on the sphere of radius one i n R . Consider the exponential family generated by v. VonMises family.

When k = 3 i t is the Fisher

When k = 2 t h i s i s the

family

of d i s t r i b u t i o n s .

These

d i s t r i b u t i o n s appear o f t e n i n a p p l i c a t i o n s , w i t h a v a r i e t y o f parametrizations, to model angular data i n R .

Consult Mardia (1972) f o r an extended treatment

o f these f a m i l i e s ; see also Beran (1979).

(Frequently one considers a sample

of n observations from one of these d i s t r i b u t i o n s .

The sample mean, X , is

then also said to have a VonMises or Fisher d i s t r i b u t i o n . parametrization f o r the family of d i s t r i b u t i o n s of X to t h a t below since E Q ( X j = E Q (X). u n u

The mean value

i s , of course, i d e n t i c a l

See also 5 . 5 ( 3 ) . )

The Laplace transform of v i s

where I (•) denotes the modified Bessel function of order s. When k is odd these functions have a convenient representation in terms of hyperbolic functions; for example I 1 / 2 (r) = (2/πr) 1 / 2 sinh r

(2)

I 3 / 2 (r) = (2/πr) 1/2 (cosh r - (sinh r)/r) (See, for example, Courant and H u b e r t (1953).)

These functions also have

nice recurrence r e l a t i o n s ; i n p a r t i c u l a r (3)

I^(r)

=

I$+1(r) + sls(r)/r ,

s >_ 0,

r>0

By symmetry, or by calculation, it follows that ξ(θ) lies in the same direction as θ, that is (4)

ξ(θ)/||ξ(θ)||

=

θ/||θ|| ,

θ t 0,

and

ξ(0) = 0

78

STATISTICAL EXPONENTIAL FAMILIES

It remains therefore to give a formula for ||ξ(θ)||. For this purpose it suffices to consider the case where θ = (r,0,...,0), and to calculate —: In λ v (θ r ).

For the Fisher distribution (k = 3) one gets from (1) - (3)

that ||ξ(θ)|| = coth ||θ|| - M θ l Γ 1

(5)

For the Von Mises distribution (k = 2) one gets only the less convenient expression (6)

l|ξ(θ)|| = iχ(I |Θ||)/io(||θ|I)

.

Although (6) is less convenient that (5), it can be used in conjunction with series expansions or tables of the modified Bessel function to provide numerical values for ||ξ(θ)||, and other information about ||ξ(θ)||.

MIXED PARAMETRIZATION We refer to the type of situation discussed in 1.7. a partitioned kxk non-singular matrix with M,M* = 0. M.x

= z.

i = 1, 2

(MT)'Θ

= φi

i = 1, 2

M M = (MM ) is 2

Write

(1)

Φi

-i

(Thus (. ) = (M )'θ .) Where convenient we write φ. = φ. (θ) to emphasize the dependence on θ, etc.) Note that (2)

M.ξ(θ) = E ^ M ^ ) = E Θ (Z.) = ζ.(θ)

(say)

i = 1, 2

.

Recall also that one may without loss of generality visualize only the case where M = I. In this case φ] = (θj. . .θ ), z' = (x + 1 , . . . f x k ) , ζ

2 = ^nH-l-

^

etc

PARAMETRIZATIONS

79

The following result is valid for steep families but for simplicity we state and prove it here only for regular families. See Exercise 3.9.1. 3.9 Theorem Let {p Q } be minimal and regular. Then the map /ζΊ(θ)\ e - (1 )

(3)

is 1 - 1 and continuous on N° (=W) with range ζ^W 0 ) x Φ2(W°)

(4)

Proof.

= K°{1) x φ2(W°)

.

Fix φ 2 € Φ 2 (W) and refer to Theorem 1.7. The distributions of

Zj given Φ 2 (θ) = Φ 2 form the minimal regular standard exponential family generated by v0 . According to Theorem 3.6 this family can be parametrized (in a 1 - 1 manner) by ζ Ί (θ) = E Q (Z n ). The range of this map is int (conhull (supp v 0 )) = K° (say) . *2 Φ2 The formula for v is given in 1.7(5), but all that needs to be noted is that Φ K T h e m a2 p i n l° =Kll) ( 3 )i s t h e r e f o r e ι - 1 with range as in (4). Continuity of the map in (3) is immediate from continuity of ζ. 3.10

||

Interpretation The above theorem has an interpretation like that of Theorem 3.6.

Any minimal regular exponential family can be parametrized by parameters of the form 3.9(3), above. This parametrization is called the mixed parametrization. Consider a mixed parametrization with parameter (. ), as above. ζj

Q

Then the family of densities corresponding to the parameters {( ) : Φ 2 = Φ 2 )

80

STATISTICAL EXPONENTIAL FAMILIES

forms a f u l l standard exponential family of order m.

(See Theorem 1.7.)

However, i f one fixes the expectation coordinate and looks at the family ζ

l corresponding to the parameters { ( . ) : ζ

0 = ζ-} then one gets i n general

only some non-full standard family of dimension and order k, whose parameter space is a (k - m) dimensional manifold in hi.

Here is an example.

Consider the parametrization of the three dimensional multinomial (N, π) family discussed following 1.8(6).

A mixed parametrization for t h i s

family involves

4 ζ

'

Z

2π

4

+

and Φ3

=

(h) log

Note that the range of (_ ) is ζ

2 2N}

h <

independent of the value of φ~ € (-«>, °°), as claimed by Theorem 3.9. Φo

=

For fixed

Z 0 l Φo the distributions of ( 7 ) form a 2 dimensional exponential family

(of order 1) having expectation parameter ( r ). ζ

(In the genetic interpretation

2

for this parametrization the parameter Φ^ measures the strength of selection in favor of the heterozygote character Gg.) On the other hand the family of distributions corresponding to fixed ( ζ ) is not so convenient. I t is the non-linear subfamily of the usual 2 f u l l standard family described by (1)

Θ = {θ :

2e

θl

+e

02

=

(ζ^^Σe^h

( I f one reduces the usual standard exponential family to a minimal family of

PARAMETRIZATIONS

81

dimension 2, then the parameter set becomes a smooth one-dimensional curve 2 w i t h i n R . This provides an example cof a curved exponential f a m i l y , as defined below.

See Exercise 3.11.2.)

DIFFERENTIABLE SUBFAMILIES 3.11

Description A differentiate

subfamily

i s a standard exponential family w i t h

parameter space Θ an m-dimensional d i f f e r e n t i a t e manifold i n N.

An

e s p e c i a l l y convenient s i t u a t i o n occurs when Θ i s a one-dimensional manifold -i . e . a d i f f e r e n t i a t e curve.

Such a family i s c a l l e d a curved

family.

i t i s often convenient to assume t h a t the

(A technical p o i n t :

exponential

parameter space i s smoother than being merely d i f f e r e n t ! a b l e -- f o r example, to assume i t possesses second d e r i v a t i v e s .

such an assumption implicit

Whenever convenient we consider

in the definition of a d i f f e r e n t i a t e subfamily,

writing formulae for relevant second or higher derivatives (as in (3) below) carries with i t the assumption that these derivatives exist.) In a d i f f e r e n t i a t e subfamily the parameter space can be written locally as {θ(t) : t e N} where N is a neighborhood in Rm and θ( ) is differentiable and one to one.

Properties of such a family around some

ΘQ € Θ can often be most conveniently studied after invoking Proposition 1.6 to rewrite the family in a more convenient form.

For example in a curved

exponential family m = 1 and the proper choice of ΦQ, zQ and M in that proposition transforms the problem into one in which θQ (1)

ξ(θ 0 )

= 0 = θ(t Q ) = Eθo(X) = 0 Z(θ n )

=

I

STATISTICAL EXPONENTIAL FAMILIES

82

Γ • ϊt

θ (

V

•

(2) a2b a 2 /p 0

θ(t Q ) =

0 (The value p = «> is possible.) Furthermore, one can linearly reparametrize the curve so that Θ Q = θ(0) (i.e. so that t Q = 0) and so that a = 1 and (2) becomes

(3)

θ(0)

=

1/p 0

ό In this form p is the radius of curvature of the curve θ(t) at t = 0. The value of 1/p is sometimes referred to as the statistical curvature of the family at Θ Q . Its magnitude is uniquely determined by the above reduction process. Alternately, in an arbitrary curved exponential family it has the formula

= (Bit

(4) where

A ' ΫfV

with θ = θ ( t Q ) ,

θ = θ(tQ),

Remark on Notation.

% = 2(θ(tQ)).

See Efron (1975).

The general functional notation θ( ) was introduced i n

3.7(1) as θ(x) = ξ" ( x ) .

We w i l l continue to use t h i s general notation i n

PARAMETRIZATIONS

83

contexts not involving s p e c i f i c d i f f e r e n t i a t e subfamilies.

In contexts

involving d i f f e r e n t ! a b l e subfamilies the notation θ( ) w i l l usually r e f e r to a ( l o c a l ) parametrization of the subfamily; i f so, t h i s f a c t w i l l be e x p l i c i t l y noted.

Although t h i s means that the \/ery convenient notation θ( ) can hence-

f o r t h have e i t h e r of two meanings we hope there w i l l be no confusion -simply remember that θ( ) is defined by 3.7(1) except where e x p l i c i t l y stated otherwise. 3.12

Example

Let Z have exponential density, " M z ) = e"z χ/ Q ^ ( z h relative to Lebesgue measure. Let T > 0 be a fixed constant. Let Y be the p

truncated variable Y = min (Z, T) and X(y) € R be (y, 0)

if

y
(y, l)

if

y =T

x(y) =

For λ € (0, o°) the distribution of X form a standard curved exponential family. The dominating measure v is composed of linear Lebesgue measure on the line ((0, T) x 0) plus a point mass on (T, 1). The parameter space for this family is (1)

0 = {θ € R 2 : θ χ = -λ, θ 2 = -In λ, λ € (0, °°)}

and (2)

ψ(θ) = log [^- ( e θ l T - 1) + e θ l

+ θ z

]

(The natural parameter space is R , since v has bounded support.) Figure 1 displays both Θ and K on a single plot.

84

STATISTICAL EXPONENTIAL FAMILIES

Figure 3,12(1): Θ and K for Example 3.12.

We return to this example in Chapter 5.

PARAMETRIZATIONS

85

EXERCISES 3.4.1 Let X ^ X 2 ,...,X n be a sample from a population with the inverse Gaussian distribution 3.4(1).

(i) Show that S = Σ X. also has an inverse i =l

Ί

2

tS Gaussian d i s t r i b u t i o n w i t h parameters θ , , n θ~ [Examine E(e ) . ] ( i i ) Show t h a t S and ( X T 1 - X " 1 ) a r e i n d e p e n d e n t , [ ( i ) shows t h a t 2 ( S , ^ - ) *υ Expf ( θ j , θ 2 ) . Now use Theorem 2 . 1 4 . ] 3.5.1

(Converse to Lemma 3 . 5 ) Let v € R k ,

α € R.

L e t K cz A/ be compact.

I f v ( H " + ( v , a))

= 0

then Ίim sup λ ( θ + p v ) / e p α θ€K

(1)

Also, i f

=

0

v ( H + ( v , α ) ) = 0 then l i m sup λ ( θ + ρ v ) / e p α Θ€K

(2)

<

(Be c a r e f u l , these r e s u l t s may be f a l s e i f K £ A/.) In particular, (3)

for θ e N

ψ ( θ + pv)

-> -~

as

p

i f and o n l y i f v ( H + ( v , 0 ) ) = 0 . 3.5.2

Let Z e K°. Let ε 1 = inf {| |x - Z| |: x ί K} > 0 . (1)

lim

Θ

Show

Z

- ' ) =

[Translate to the case where Z = 0, using 1.6(3) with φ Q = 0 , Z Q = Z. Then this result is a minor variation of 3.6(3), and could also have been used to establish 3.6(4).]

86

STATISTICAL EXPONENTIAL FAMILIES

3.6.1 Is the following assertion a v a l i d converse to Theorem 3.6: Let {p Q } be a minimal standard exponential family. homeomorphism i f and only i f {p Q } is steep. (?)

Then ξ : A/° -> K° is a

[ I f k = 1 t h i s is easy to

prove.] 3.7.1 D e f i n e the measure v on { ( x , , Xo) xλ = 0 , x 2 > 0 }

- 0 0 < Xi < °° »

x2 = 0 or

by

e

-|t|

v((A,

0)) = J c n ^ — r d t , A u l+tH

A d i - , - ) ,

v((0,

A))

= / e - t dt

A c (0, co)

v((R, 0 ) ) =

1

,

(1)

(i) Show the exponential family generated by v has N = {θ: -1 <_ θ- <_ 1, θ 2 < 1} and is not steep,

(ii) Show that ξ(M° ξ(M°) c K° = {x : x 2 ^ 0} and furthermore

that ξ(A/°) is not even convex.

[Show 1 .

for appropriate c, k.]

_J

2

2

See Efron (1978).

3.9.1 Prove the conclusion of Theorem 3.9 i f { p Λ } is minimal and steep.

u

[In

the proof of Theorem 3.9 l e t Φ2 € N° and show (using D e f i n i t i o n 3.2) that

v

is steep. For ease of proof assume (w.l.o.g.) that M = I.]

of of Theorem 3.9 let Φ 2 € N° and show (using Defini

Φ

3.11.1 Verify the formula 3.11(4) for the s t a t i s t i c a l curvature of a curved exponential

family.

PARAMETRIZATIONS

87

3.11.2

( i ) Verify 3.10(1).

( i i ) Reduce the three-dimensional multi-

nomial family to a two-dimensional minimal family and show that 3.10(1) now corresponds to a curved exponential family.

( i i i ) Fix ζ j and calculate the

statistical curvature of the resulting family as a function of the remaining parameter, φ~.

(iv) For what value(s) of ζ,,
Why? 3.11.3

Consider an m-dimensional d i f f e r e n t i a t e subfamily inside a k parameter exponential family.

Write a canonical form for this family analogous

to that in 4.14(1) - (3). [The case m = 1 required two canonical parameters -b,p

— in 4.14(3).

The general case requires

m + m(m + l)/2 parameters.]

3.12.1

Let {p } be a canonical k parameter exponential family. Ό

Let and {p

inf ίψ(θ):

θ e W} < C < sup ίψ(θ):

l e t 0 = {θ € A/°: ψ(θ) = C } . θ e hi] .

( i ) Show t h a t

θ £ W}

{ p Q : θ e 0 } can be c a l l e d a stratum { p Q : θ € 0 } i s a (k - 1) dimensional 1

θ

( i i ) Let θ

i s (k - 1) x 1 and θ / 2 ) i s l x l .

L e t θ ( t ) be any ( l o c a l ) p a r a m e t r i z a t i o n o f

Θ £ W with t e ί c

Rk"1. 8θ

(i)

ξ(1)(θ(t))

(2)^'

differen-

w n e r e

t i a b l e s u b f a m i l y o f {p : θ € N } .

{pΛ:

= (θn\»

Θ

of

(χ)

Then

m

(t)

-ill—

Bθ?(t) +

ξ(2)(θ

( t ) ) -£-

- o

.

J

(iii) Let θ° e 0 be any point with ζ(2)( θ °) ^ ° a

τhen on a

neighborhood of

θ° in 0 one may write θ / 2 \ s a function of θ / ^ — i.e. Θ / 2 N = θ ( 2 ) ^ θ ( i ) ^ "" and

88

STATISTICAL EXPONENTIAL FAMILIES

3.12.2 Show t h a t the d i s t r i b u t i o n s o f X described below can be represented as s t r a t a of canonical exponential f a m i l i e s (See 3.12.1 f o r d e f i n i t i o n . ) (1)

X ~N(Θ, I) ,

(ii)

||θ||

2

= C .

The d i s t r i b u t i o n s o f X - ( 0 , l ) with X defined in Example 3 . 1 2 .

(iii)

Let Y.., Y p , . . .

be i . i . d .

from a canonical regular

exponential f a m i l y , { p . } . Let N be any Markov stopping time {y:

N(y) = n} i s measurable with respect to Y 1 5 . . . , Y j .

Let

X = (S*., N) = (X/i\» X/o\)»

P.(N < °°) = 1 .

anc

(i.e.

Let S n =

n Σ Y. .

* consider only values of φ such t h a t

[ L e t θ = ( Φ - ψ ( θ ) ) where ψ(φ) i s the cumulant generating

function f o r the o r i g i n a l

family { p ώ ) ]

3.12.3 In 3.12.2 ( i i i ) show that 3.12.1(2) is identical to the following conclusion also derivable from the martingale stopping theorem:

(1)

E(SN)

=

E(Y) E(N) .

((S n - n E(Y) is a martingale and so (1) also follows from the stopping theorem applied to this martingale.) 3.12.4 (i) For the family in 3.12.1(i) show the statistical curvature is the constant l/fC.

(ii) Calculate the statistical curvature for the families

described in Example 3.12 and Exercise 3.12.1(ii). 3.12.5 A Poisson process on [0, 1] with intensity function ρ(t) _> 0 may be characterized by the property that the number of observations in any b interval (a, b) c [ 0 , 1] has P( / ρ ( t ) d t ) distribution, and the number of a observations in disjoint intervals are independent random variables.

Let

PARAMETRIZATIONS

89

T, <...< T γ denote the observations from a Poisson process on [0, 1], Suppose (1)

p(t) =

m Ία. Π p (t) i= l Ί

where p. > 0 are known (measurable) functions on [0, 1] and α. are unknown parameters.

Show that the distributions of (T-,...,T γ , Y) form a differen-

t i a t e subfamily of dimension m in an (m + 1) parameter exponential family. Identify the canonical statistics and observations for this family. family a stratum of the original family?

Is this

[The conditional distribution of

T,,...T γ given Y is that of an ordered sample of Y independent observations m α* from a distribution on [0, 1] with density proportional to Π p. ( t ) . ] i =l Ί 3.12.6 Let Z.. be independent i d e n t i c a l l y d i s t r i b u t e d variables with a power s e r i e s (1)

distribution: P ( Z . j = z)

L e t YQ .. YQ = 1 and d e f i n e Y p . ...

=

C(λ) h ( z ) λ Z ,

i n d u c t i v e l y as

Y. =

inductively as Y. =

z=0,l,.. ,

YΊ - 1 ΊΣ Z^.

.

Y ,

λ > 0.

Y,,...

is

Σ Z^. . YQQ , Y,,... i

J " ~ •!•

called the Galton-Watson process with generating d i s t r i b u t i o n ( 1 ) . 2 <^ n < °°.

Show that the d i s t r i b u t i o n s of YQ,

exponential

family with natural s t a t i s t i c s

Y-,...,Y

n-1 ( Σ Y.J

0

,

form a curved

n Σ Y.) J

0

Fix

and t h i s curved

exponential family is a stratum of the corresponding full exponential family.

CHAPTΈRΊ.

APPLICATIONS

This chapter describes three different general applications of the

theory developed so far.

The f i r s t part of the chapter contains a proof

of the information inequality and a proof based on this inequality of Karl in 1 theorem on admissibility of linear estimators. The second part of the chapter describes Stein's unbiased estimat of the risk and proves the minimaxity of the James-Stein estimator as a specific application of this unbiased estimate. The third part of the chapter describes generalized Bayes estimat and contains two principle theorems describing situations in which a l l admiss ble estimators are generalized Bayes -- or at least have a representation similar to that of a generalized Bayes procedure. deals with two basic situations.

This part of the chapter

The f i r s t is estimation of the natural

parameter under squared error loss, and the second is estimation of the expectation parameter under squared error loss.

The so-called conjugate pric

play a natural role in this second situation. The exercises at the end of the chapter contain a non-systematic selection of some of the specific results derivable from the more general development in the body of the chapter.

INFORMATION INEQUALITY The information inequality -- also known as the Cramer-Rao inequality -- is an easy consequence of Corollary 2.6. The version to be proved below applies to vector-valued as well a real-valued s t a t i s t i c s .

For vector-valued statistics one needs the multi-

90

APPLICATIONS variate Cauchy-Schwarz

91

inequality, as described in the following theorem.

If A,B are symmetric (mxm) matrices, write A :> B to mean that A - B is positive semi-definite. 4.1

Theorem Let Tj, T 2 be, respectively (£χ 1) and (mx

variables on some probability space. B

ll "'

E(τ

B

12 -1

E(τ

B

22

=

Let (λ X

i

*>

(l x m)

T i 2>

= E(T 2 V)

(m x m)

and suppose B,, exists and B 2 2 exists and is non-singular.

(1)

B n > B 1 2 B-J B 2 1

Remarks.

Then

.

If I = m = 1 this is the usual Cauchy-Schwarz inequality:

(2)

E 2 (T χ T 2 )

E(T*)E(T*) >

If B 2 2 is singular the inequality (1) remains true with generalized inverses in place of true inverses. See Exercise 4.1.1. If 4.1(1) is applied to the random vectors Tj - E(T ), T 2 - E(T 2 ) it yields the covariance form of the inequality: (3)

ϊn

Proof.

> %ι2 tZ2

Z21

Consider the ((£ + m) x 1) random vector / 11

<

B

/ Let W =( 0

" B 12 B 22v _Ί B 22

Then

1£\

E(U U') = ( 21

B

22

U = ( τ ) . Then '2

92

STATISTICAL EXPONENTIAL FAMILIES 0 < E(WUU'W') = W E(UU')W

- B12B22B21

/ 11

It follows that 0 <_ B

n

Lά LC LI

-B i 2 B 2 2 B 2 Γ

as desired

\

II

One further preparatory lemma is needed for the form of the information inequality which appears below. 4.2 Proposition Let {p Q } be a standard k-parameter exponential family. Let T be Ό

Q

a statistic taking values in R . Suppose Θ Q € N° and the covariance matrix £ β (T) of T exists at θ n . Then E Q (T) exists on a neighborhood of θ n . 0 g

U

0

U

(θ eW°(11T11) i n the notation of 2.6.) Proof.

For some ε > 0,

I|θ - θ o || < ε/2 .

(1)

| |θ - θ J | < ε

implies θ € N.

Let

Then, by the ordinary Cauchy-Schwarz inequality,

EΘ(||T||) = /||T(x)|| exp(θ =

/||T(x)|| exp((θ -ΘQ)

1

[/||T(x)|| 2 exp(θ 0

/ exp(2(θ - θ 0 )

x - φ(θ))v(dx) x - ψ(θ) + ψ(θ 0 )) exp(θ 0

x - ψ(θ Q ))v(dx)

x - ψ(θ Q )) v(dx)

x - 2ψ(θ) + 2ψ(θo))exp(θo

x - ψ(θQ))v(dx)]h

= Eg2 (I |T(x) I |2)[exp ψ(2(θ - θ Q ) + Θ Q ) - 2ψ(θ) + ψ(θ Q )]^

2 since Eθ (||T(x)|| ) < °° by assumption and since 2(θ - ΘQ) + ΘQ 6 W.

4.3

||

Setting The following version of the information inequality applies to

d i f f e r e n t i a t e exponential subfamilies, as defined at the end of Chapter 3.

APPLICATIONS

93

Let {p 0 : θ € 0} be such a family with Θ m-dimensional. m

Let θQ G 0.

For

k

N a neighborhood in R let θ : N + 0 c R , with θ(ρQ) = ΘQ be a parametrization of 0 in a neighborhood of ΘQ.

By definition Vθ(p) is the mxk matrix with

elements

(1)

^

3J7

The parametrization can always be chosen so that Vθ(p) is of rank m, and we assume this is so. Define the information matrix (2)

J(p Q )

J(p) at p = PQ by

= (Vθ(p o ))(2(θ o )(Vθ(p o ))

I f {p_} is a minimal exponential family then 2(θ n ) is non-singular, and so u

U

J(PQ) is then a positive definite mxm symmetric matrix.

The chain rule and

the basic differentiation formula 2.3(2) yield two alternate expressions for J; namely

(3)

1J (JίPnίίn

°

/3 log p θ , p v(X) d log p θ , Θ = θΛ Efl( ^ ^

The f i r s t expression of (3) i s , of course, the usual definition of J in contexts more general than d i f f e r e n t i a t e subfamilies. I f T is a statistic taking values in R let (4)

e(p)

= e τ (p)

= E θ ( p ) (T)

.

Suppose Θ Q e N°(||T||). Then E Θ (T) and its derivatives exists at Θ Q by Corollary 2.6. The chain rule then yields

94

STATISTICAL EXPONENTIAL FAMILIES

(5)

Ve(p Q ) = (Vθ(p Q ))(vE θo (T)) (The preceding formulation of course includes the case where

{pθ> is a full exponential family. Simply set p = θ so that θ(p) Ξ θ. In that case J(p Q ) = Z(θ Q ) and Ve(p Q ) = V E Q (T) .) 4.4 Theorem

(Information inequality) Let {p.: θ e 0} be a differentiate subfamily of a canonical Ό

exponential family with θ Q = θ(p Q ), as above. Let T be an ^-dimensional statistic. Suppose 2 (T) exists. Then e(ρ) = E , J T ) exists and is differentiable on a neighborhood of ρ Q , and the covariance matrix of T satisfies Z θ (T) > (ve(p 0 ))' J" 1 (p 0 )(ve(p 0 ))

(1) Proof.

θ Q £ W°(||T||) by Proposition 4 . 2 .

.

Now apply the Cauchy-Schwarz

i n e q u a l i t y 4 . 1 ( 1 ) with T, = T - EΩ (T) and 1 θ o (2) Then B n (3)

T2(X) = ^ ( T )

=

V In p θ ( p

}

(X)

=

( V θ ( p 0 ) ) (X - ξ ( θ Q ) )

.

,

B22

=

E(T2 T p

=

(Vθ(p0)) 2(θo)(Vθ(po))'

B12

=

E ( T 1 T£)

=

(Vθ(po))(vE

=

J(pQ) ,

and (4)

by 2 . 6 ( 3 ) and 4 . 3 ( 5 ) .

(T))

=

Ve(pQ)

The Cauchy-Schwarz i n e q u a l i t y says B ^ >_ B 1 2 B 2 2 B 2 1

which i s the same as ( 1 ) .

||

A useful f e a t u r e of the form of Theorem 4.4 is the absence of any r e g u l a r i t y condition on T other than the existence of la

θ

(T).

Many other

o

versions of the information inequality contain further assumptions about T (See e.g. Lehmann (1983, Theorem 7.3).) but these are superfluous here.

APPLICATIONS

95

An information inequality l i k e Theorem 4.4 is needed f o r applications of the following type. 4.5

Application

( K a r l i n ' s Theorem on A d m i s s i b i l i t y of Linear Estimates)

The information inequality can sometimes be used to prove admissibility.

In these situations other, more f l e x i b l e , proofs can also

be used, but the information inequality proof is nevertheless easy and revealing.

The following r e s u l t is due to Karlin (1958).

inequality proof,

The information

due to Ping (1964), is a generalization of the f i r s t

proof of t h i s sort in Hodges and Lehmann (1951).

See Lehmann (1983, p.271) f o r

f u r t h e r references and d e t a i l s of the proof. Theorem.

Let { p Λ } be a f u l l regular one-dimensional exponential family with u

N = (θ, θ),

-°° <_ θ < θ <_ °°.

under squared error loss.

Consider the problem of estimating ξ(θ) = EQ(X)

The r i s k of any (non-randomized) estimator δ i s

thus R(θ, δ) = E θ ((δ(x) - ξ ( θ ) ) 2 ) . (1)

δ

Then the linear estimator

(x)

=

αx + p

α,p

is admissible if 0 < α <_ 1 and if (2)

/ exp(-γθ + λψ(θ)) dθ

diverges at both θ and θ, where γ,λ are defined by α

(3) Proof.

= ΓTT'

β

= Γ^T

We consider here only the case p = 0 = γ . (See Exercise 4.5.1.)

Fix α. Let δ be any estimator with finite risk. Let b(θ) = E θ (δ(X)) - αξ(θ). The information inequality yields

(4)

R ( θ ,δ ) >

[(αζ(θ)

+

b

W y f

+ (ξ(θ)(l - α ) -b(θ))2

ξ'(θ) >

α2ξ'(θ)

+ 2αb'(θ)

+ (ξ(θ)(l

- α) -

b(θ))2

96

STATISTICAL EXPONENTIAL FAMILIES

since ξ(θ) = EΘ(X) and ξ'(θ) = J(θ) = Var θ X . For δ^Q 2

(5)

2

R(θ, δ α > 0 ) = Λ ' ( θ ) + (1 - α ) ξ (θ) .

Hence, if (6)

R(θ, δ) < R(θ, 6 n )

then 2b'(θ) - 2λξ(θ) b(θ) + (1 + λ) b 2 (θ) < 0

(7) Let

K(θ) = e λ ψ ( θ ) b(θ)

.

Then (7) becomes (8)

2K'(Θ) + (1 + λ) K 2 ( θ ) e λ ψ ( θ ) < 0

Now, let ΘQ € (a, b) and make the change of variables t(θ) = Θ J exp(λψ(t))dt. Correspondingly, define k(t) by k(t(θ)) = K(θ), so that (8) becomes (9)

2k'(t) + (1 + λ) k2(t) £ 0

where -°° < t < °° by ( 2 ) .

The only s o l u t i o n of (9) f o r t € (-°°, °°) is k = 0

since i n t e g r a t i o n of (9) shows that for t > t k'^t)

- k"1(t1)

k i s non-increasing and

>. (1 + λ ) ( t - t χ ) / 2

and hence k(tj) < 0 is impossible.

A similar inequality for t < t- shows

that k(t χ ) > 0 is also impossible.

It follows that (6) implies b Ξ 0 , which

in turn implies 6 = 6 Q (a.e.(v)) by completeness. This proves admissibility Ofδ

α,0 It is generally conjectured that the condition 4.5(2) is necessary

APPLICATIONS as well as sufficient for admissibility of 6 o. are known in this connection.

97 However only partial results

See Joshi (1969) and also Exercises 4.5.4,

4.5.5. 4.6 Further Developments It is useful in considering asymptotic theory to have available a few further results concerning the information

inequality.

These results are sketched below; the proofs are left for exercises. These results have nothing to do specifically with exponential families but only require a setting in which the information inequality is valid.

Nevertheless,

for precision assume below the setting of Theorem 4.4, and let S c R m denote a (possibly large) open set on which Σ Θ ( O ) ( T ) exists. For convenience we consider below only estimation of p under the quadratic type loss function (1)

L(p, δ) = (6 - p)' J(p)(ό - p) ,

and under a truncated version of this loss. (See (3) below.) For proof of the following assertions see Exercises 4.6.1 - 4.6.7 and Brown (1986). Let h be an absolutely continuous probability density on S , supported on a compact subset H c S , (2)

/ R(p, δ)h( P )dp

Then the expected risk satisfies

> m -/

Note that the right side of this inequality is independent of 6, and thus provides a lower bound for the Bayes risk under the prior density h. A natural truncation of the loss (1) is the function min(L(p, ό ) , K). Generalizations of the information inequality and of (2), like those to be described below, can be stated for this natural truncation; however the statements and proofs are easier under a different truncation which is equally useful in asymptotics.

This truncation will now be described.

Let K > 0. For v e R define

98

STATISTICAL EXPONENTIAL FAMILIES

vκ

-K = Ivl K

v < -K v £K v >K .

For v € R

d e f i n e v u t o be the v e c t o r w i t h c o o r d i n a t e s

i=l,....k.

Now l e t

i\

(v J . i\

L κ (p, δ) = (6 - p)£ J~l(p)(δ - p ) κ

(3)

= (v. )„

i

l

ι\

,

.

Let R κ denote the risk function corresponding to this truncated loss function. If δ is an estimator of p, let ό

(4)

(K)(χ; P)

=

P

κ

and b

(5)

(κ)^

=

E

θ^6(K)^X' P ) )" P

=

e

( K ) ^ p ^ "p

Let λ,(p) >: ... >_ λ (p) > 0 denote the ordered eigenvalues of J(p). Let α be any number satisfying 0 < α < 1. Then (6)

fl +

2

) R (p, ό )

(1 - α)λ m K 2 ^

^ >

α Tr(J(p)(ve(κ)(p))' J^ίpJίVe^jίp)))

(Note: values of p.

K

+ Tr(J(p)b(κ)(p)b|κ)(p))

Ve/^x exists except possibly f o r a countable number of

At these values i n t e r p r e t the r i g h t side of (6) as i t s lim sup;

or use r i g h t (or l e f t ) p a r t i a l derivatives i n place of Ve/^x, f o r these always e x i s t . ) This i n e q u a l i t y becomes more i n t e r e s t i n g as K gets large r e l a t i v e to

1/λ m , f o r then

α

can be chosen near 1 but so that T\—?ΓΊ72 ^l-α;λ m κ

is small,

The i n e q u a l i t y (6) leads to an i n e q u a l i t y concerning the Bayes r i s k j u s t as the usual information i n e q u a l i t y leads to ( 2 ) .

With h as i n (2)

APPLICATIONS

(7)

(l + ^

2 \ j R (p, ( 1 - α ) λ K2/ H K

6

99

)h(p)dp

H The above bound, unlike ( 6 ) , does not involve 6 (through e / ^ J . UNBIASED ESTIMATES OF THE RISK An unbiased estimate of the risk as a tool for proving inadmissib i ϋ t y of estimators f i r s t appears in Stein (1973), and has been widely exploited since then.

The basic technique is embarassingly simple.

It

involves merely an integration by parts which succeeds because of the term θ x e appearing in the exponential density. few of the easier applications.

Here we describe the method and a

For further (more complex) applications, see,

for example, Berger (1980b), Berger and Haff (1981), and Haff (1983).

Here

is the heart of the method. A function t : R •> R is called absolutely

continuous

if

t ( x , , . . , x j , is absolutely continuous in x.., i = l , . . . , k , when a l l Xj, j ^ i are held f i x e d . 4.7

Let t !

Theorem Let s : R •> R be absolutely continuous.

(1) (2)

/ |s(x)|e θ # x dx < θ

/|s'(x)|e '

x

dx

,

Assume

and

< »f

i = 1

k.

Then (3) Proof. (4)

ΘΊ / s ( x ) e θ ' x dx Set i = 1 for convenience. / |s(κ 1

=

-/ s ! ( x ) e θ ' x dx

For almost every (

x 2 , . . . , x k ) | e θ # x dx χ

<

and Pi

J ls Λxi>

x

\ ι ft X

?»

»Λi/ / l e

U Λ

i

^

-

100

STATISTICAL EXPONENTIAL FAMILIES

because of (1), (2). For any such (x 2 5 ...,x k ) integration by parts yields (6)

θjj s(x l s x 2 ,...,x k )e θ # x dx 1 = lim θ / s(x ,x9,...,x. )e

R

{ -/

fl

dx Ί

y

s^(x ls x 2 ,...,x k )e

Γ

fl

x11

dx 1 + s(x 1 ,x 2J ...,x |< )e

> B

-/ s (x 1 ,x 2 ,...,x k )e θ " x dx 1 + lim inf f [|s( s ( xX ir,x x 2 ,9 ,...,xJe . . . ,x k )evθ.χl *xj B-χ»

= -/ s^(x 1 ,x 2 ,...,x k )e θ # X dx by (2) and then (1). Integration over Xp»...,xk then yields (3).

||

The assumptions (1) and (2) are slightly more stringent than necessary, and also can be given alternate forms. For example the assumption (5) together with lim s(x Ί , x ? ,...,x.)e θ ' x = 0

(7)

,

for

X

£

Is,

almost eyery x 2 , . . . , x k implies ( 4 ) , and hence (3) when i = 1.

Or, f o r

example, when k = 1 a p o t e n t i a l l y useful r e s u l t is the equality

J°°θs(x)eθx dx = -f°s'(x)e θ x -s(0 + )

(8) for

a b s o l u t e l y continuous f u n c t i o n s s having / | s ' ( x ) | e

dx < °° and

Ay

lim s(x)e

= 0. However, the version of the theorem given above suffices

for the usual applications. Theorem 4.6 can be expressed in other forms which are more suggestive of its applications, as in the following two corollaries.

APPLICATIONS

101

4.8 Corollary Let p θ (x) be a p r o b a b i l i t y density on R

( r e l a t i v e to Lebesgue

measure) of the form (1)

P θ (x)

=

h(x) exp(θ

x - ψ(θ))

where h >^ 0 is absolutely continuous. Let t : R -> R be absolutely Let t.1 = - 3 1 .

continuous.

i

9

χ

Then

h! θ. E θ (t) = -Eθ((t! + ^ "

(2)

provided both expectations in (2) exist. k k Let t : R •* R be absolutely continuous. Then (3)

E

Θ

k where V

t =

Σ

8x.j

(Θ • t )

s

=

-EΘ(V

+ 2JL. t )

, provided that

(4)

i

and

E θ (| ^

t |) < - ,

1 =1

k.

(In expressions (2), (3), (4) and similar expressions below define ~ = 0 if h = 0 .) Proof.

For (2) note that ~

(th) = (tj + jp t)h and apply Theorem 4.7.

For (3) apply (2) with i=l,...,k and sum. Remarks. (5)

||

Expression (2) immediately yields ΘE Q (t) = -Eθ(Vt + t ^ )

provided the expectations e x i s t .

(3) can also be derived d i r e c t l y from Green's

theorem which implies (under suitable conditions) that

(6)

/s(x)(Ve θ#x )dx

= - /(V s(x))e θ ' x dx

102

STATISTICAL EXPONENTIAL FAMILIES It can also be worthwhile to apply Theorem 4.7 repeatedly, as in

the next proposition which is needed for Theorem 4.10. 4.9

Proposition Let p be as in Corollary 4.8. Assume that h! is also absolutely

continous, and that

and (2) (where h1.111. = -¥- h). Then 8x? (3) k

2

(where V h = Σ h1Ί1.1. ). Proof.

Apply Theorem 4.6 twice for each i=l,...,k and sum over i.

||

Combining the preceding results yields the following unbiased estimator of risk for squared error loss. 4.10 Theorem Let {puΩ } be an exponential family whose densities are of the form 4.8(1) with h satisfying 4.9(1), (2). Let 6: R k -> R k be any absolutely continuous estimator of θ. Suppose 2

(1)

and (2)

E θ (||δ|| )

<

h! Eθ(|δ! + ίpδl) < - ,

oo

1 = 1,....k

.

Then (3)

2

2

E θ (||δ-Θ|| ) = E θ (||δ|| - 2(V

δ +^

δ) + ^ )

APPLICATIONS

103

Note t h a t

Proof.

E θ ( | | δ - θ | | 2 ) = E θ ( | | δ | | 2 - 2Θ Now u s e 4 . 8 ( 3 ) a n d 4 . 9 ( 3 ) t o a r r i v e a t ( 3 ) .

Remarks.

δ+

||

The l e f t s i d e o f ( 3 ) i s t h e r i s k f u n c t i o n f o r squared e r r o r l o s s .

As p r e v i o u s l y , we f r e q u e n t l y use t h e n o t a t i o n R ( θ , δ) f o r a r i s k

function

when the l o s s f u n c t i o n ( h e r e ||δ - θ|| ) i s c l e a r from the c o n t e x t .

The

i n t e g r a n d o f the r i g h t s i d e o f (3) i s f r e e o f θ; hence t h i s i n t e g r a n d i s an unbiased e s t i m a t e o f R ( θ , δ ) .

For most a p p l i c a t i o n s o f (3) one a c t u a l l y needs

o n l y an unbiased e s t i m a t e o f R ( θ , δ j estimators.

- R ( θ , δp) where δ. and 6? are two g i v e n

I n t h a t c a s e , t h e term | | θ | | , l e a d i n g t o - r —

i n ( 3 ) , cancels.

Assumption 4 . 9 ( 2 ) i s t h e r e f o r e not needed t o a r r i v e a t an unbiased e s t i m a t e o f the

(4)

form

R(θ, δ χ )

4.11 Application

- R(θ, δ2)

=

E^MδJI

2

- | | δ 2 | | 2 + 2(V

(δχ - ό2)

(James-Stein estimator)

The neatest application of Theorem 4.10 is to prove the minimaxity of the James-Stein estimator for a multivariate normal mean.

(The

original result in James and Stein (1961) uses a different method of proof.) Let X be k-variate normal, k >^ 3, with mean ξ(θ) = θ and covariance I. Consider the problem of estimating ζ under squared error loss.

The usual

estimator δQ(x) = x is minimax.

However, when k :> 3 i t is not admissible.

(1)

=

δ(x)

( l ί ϋ M l i ) llxll2

where r is absolutely continuous, non-decreasing, and (2)

0 < r( )

< 2(k - 2)

Let

104

STATISTICAL EXPONENTIAL FAMILIES

Then (3)

R(θ, 6) £ R(θ, ό 0 ) = k

Strict inequality holds in (3) except when r Ξ 0 or when r Ξ 2(k - 2), as can be seen from (5) below. The normal density is of the form 4.8(1) and -r- = -x. With ό as in (1)

so that 4.10(4) yields (4)

R(θ, 6 0 ) - R(θ. 6) = E Θ (2V

IIXIΓ

IIXI I

(It remains to check the regularity conditions needed for 4.10(4), and these will be discussed below.) —-—~ = k"2 ? . Hence (4) yields llxll llxll

Observe that V (5)

R(θ, δ n ) - R(θ, δ) = E ^ " * 1 ' ) (2(k-2) - r(llXll)) + 2 Γ ' ( " x " ) ) υ ϋ iixir iixii

The unbiased estimator of the risk which appears on the right of (5) is nonnegative because of (2); hence (3) follows. The first estimator of James and Stein was of the form (1) with r = k - 2, which is the best possible constant value of r. However, a better estimator (as also noted by James and Stein) is +

(6)

δ (x) = (1 - ^ ^ ) llxll which corresponds to the choice r(t)

+

x

= min(t 2 , k-2)

See Exercise 4.11.1. See also Exercises 4.11.5, 4.17.5, and 4.17.6 for generalizations. (It is also of interest to note that in general if

APPLICATIONS

δ

i

=

δ

Oi

+

Ύ

i '

i = 1

k

" -> >

t h e n

4

1 0

105

4

( ) yields k

(7)

R(θ, δ 0 ) - R(θ, δ)

=

E

θ

[ Π ^ γ

Γ

γ { ]

The integrand is formally the same as the Cramer-Rao lower bound (in which b( ) replaces γ( ))

See 4.5(7) (with λ = 0) and Exercise 4.5.6.

Hence the

fact that the inequality

Σ 2 -£- γ. - γ? > 0

(8)

i =l

a x

Ί

i

Ί

"

has a non-trivial solution i f and only i f k _> 3 leads to the proof of the fact that 6Q(X) = x is inadmissible i f and only i f k _> 3.) The regularity conditions stated in Theorem 4.10 are not always satisfied by an estimator of the form (1). δ is not continuous at ||x|| = 0.) a supplementary argument: specified r( ) (9)

Justification of (4) therefore requires

suppose 6 is an estimator of the form (1) with a

Let 6 be the estimator with r( ) replaced by r e (||x||)

Then δ

( I f , for example, r(x) = k-2 then

= min(||x|| 2 /ε ,

r(||x||))

.

satisfieds the conditions of Theorem 4.10 so that (4) holds for

6 .

Passing to the l i m i t as ε Ψ 0 yields that (4) also holds for 6. There is a yjery extensive l i t e r a t u r e concerning the problem of estimating a multivariate normal mean.

For an introduction and some references

consult Lehmann (1983, Chapter 4). 4.12

Remark For discrete exponential families there is an analog of the

unbiased estimates in 4.8 and 4.10 which involves difference operators instead of partial derivatives.

These results are based on the deceptively simple

equality oo

(1)

Σ λh(x)λX x=0

oo

=

Σ h(x - l ) λ X x=l

106

STATISTICAL EXPONENTIAL FAMILIES

They have been particularly useful for certain problems involving Poisson or negative binomial variables. See Hudson (1978), Hwang (1982), and Ghosh, Hwang, and Tsui (1983) for some theory and applications.

GENERALIZED BAYES ESTIMATORS OF CANONICAL PARAMETERS We first define the concept of a generalized Bayes estimator in the current context and state some foundational results. Then we discuss estimation of the canonical parameter of an exponential family. Later in this chapter we discuss estimation of the expectation parameter, including the topic of conjugate priors for exponential families. 4.13

Definition Let {p Q : θ € 0} be an exponential family of densities. Let Ό

ζ: Θ -> R be measurable. Let G be a non-negative (σ-finite) measure on Θ, locally finite at every θ € Θ. G is called a prior measure on Θ. Let S c R . Then 6: S -> R is generalized Bayes on S (for estimating ζ under squared error loss) if / ζ(θ)pft(x)G(dθ) (1) ό(x) = , x € S , / P θ (x)G(dθ) where both numerator and denominator exist for all x € S. We say δ is generalized Bayes if it is generalized Bayes on S where v(S C ) = 0. We will use the symbol δ β to denote the generalized Bayes procedure for G, when this exists. If the loss is squared error loss -(2)

L(θ, a) =

11 a - ζ(θ)|| 2

for estimating ζ(θ) and if the Bayes risk, (3)

B(G) = inf B(G, 6') = inf / R(θ, δ 1 )G(dθ) δ ό1 = inf/E fl(L(θ, δ'(X))G(dθ), δ1 θ

APPLICATIONS

107

satisfies B(G) < ». Then by Fubini's theorem any Bayes estimator for G (i.e. one which minimizes B(G, 6)) must also be generalized Bayes for G. One of the topics in which we shall be interested below is that of characterizing complete classes of procedures under squared error loss (2). Since L is strictly convex the nonrandomized procedures are a complete class. The following theorem is our main tool for proving complete class theorems. (In the current context a complete class is a set of procedures which contains all admissible procedures.) 4.14

Theorem With {p Q } and L as above ewery admissible procedure must be a

limit of Bayes estimators for priors with finite support. More precisely, to eyery admissible procedure corresponds a sequence G. of prior distributions supported on a finite set (and hence having finite Bayes risk) such that (1)

6 G β (x) - 6(x)

a.e.(v)

where (as above) δn denotes the Bayes estimator for G.. Proof.

This theorem is apparently "well known". Its proof is outside the

intended scope of our manuscript. However, I do not know any adequate published reference for it, so a proof is given in the appendix to the monograph.

See Theorem A12. Theorems 3.18 and 3.19 of Wald (1950) come close

to the above theorem as do some comments in Sacks (1963) and in Le Cam (1955).

II We now concentrate on estimation of the canonical parameter.

In

this case generalized Bayes estimators have a particularly convenient form, as described in the next theorem. 4.15 Theorem Let {p f l } be a canonical exponential family and l e t G be a prior measure on Θ for which the generalized Bayes procedure, δG for estimating θ

108

STATISTICAL EXPONENTIAL FAMILIES

exists. Define the measure H by H(dθ) = e " ψ ( θ ) G(dθ)

(1)

θ x and (as usual) l e t λ..(x) = / e H(dθ) denote i t s Laplace transform.

Then δ fi

satisfies

(2)

δ G (x)

=

V In λ H (x)

( I f v(8K) = 0 then, of course,

=

VψH(x) ,

x e Γ .

(2) completely defines δQ since

v((K°)Cmp)

= v(3fC) = 0.)

Proof.

By d e f i n i t i o n the generalized Bayes procedure i s

(3)

δG(x) G

=

/ θ e θ ' x H(dθ) g / e θ x H(dθ)

a.e. (v)

By assumption the i n t e g r a l s on the r i g h t o f ( 3 ) e x i s t a . e . ( v ) ; hence N H 3 K° . the

The denominator e x i s t s on WH, by d e f i n i t i o n , and by Theorem 2 . 2 ,

numerator e x i s t s on N° and i s given by V λ u ( x ) . π π

This proves ( 2 ) .

II

If δ is only generalized Bayes on S c K relative to G one clearly has an analogous representation of δ on S°, namely (4)

δ(x)

= Vψ H (x) ,

x € S° .

An interesting special consequence of the above is that if k = 1, and |δ(x) - x| is bounded, and λδ(x) is generalized Bayes on K° for 0 < λ £ 1 then δ(x) = x + b. See Meeden (1976). The foundation for the following major theorem has been laid above and in Section 2.17. The first theorem of this type was proved by J. Sacks (1963) for dimension k = 1. Indeed Sacks claimed, but did not prove, validity of the result for arbitrary dimension. Brown (1971) proved the result for arbitrary dimensions when {p Q } is a normal location family; and that Ό

proof was extended to arbitrary exponential families by Berger and Srinivasan

APPLICATIONS (1978).

109

The proof below follows Brown and Berger-Srinivasan.

The proof of

Theorem 4.24 is somewhat more l i k e Sacks' original proof. 4.16

Theorem

Let {p Q } be a canonical k parameter exponential family. Then 6 is admissible under squared error loss for estimating θ only if there is a measure H on θ c W such that

(1)

/ θ e θ # x H(dθ) 6(x) = Q-^ = Vψ H (x) , / e H(dθ)

Remarks.

for

x e K°

a.e.(v)

.

The expression (1) i m p l i c i t l y includes the condition N,, 3 K°, so

that both numerator and denominator in (1) are well defined for a l l x € K°. I f H(Θ - Θ) = 0 so that 0 = § c W 5 then one may define (2)

= e ψ ( θ ) H(dθ)

G(dθ)

and rewrite (1) as / θp f l (x)G(dθ) (3)

6(x)

9

=

/

,

x € K°

.

Pθ(x)G(dθ)

Thus 6 is generalized Bayes on K° relative to G. This observation leads to Corollary 4.17 and to further remarks which appear after the corollary. Proof.

Let 6 be admissible. By Theorem 4.14 there is a sequence of prior

measures G., having finite support, such that $ G (x) -> δ G (x) a.e.(v). Let x Q € K° such that 6 Q (xQ) •+ ό(x Q ). Since G 1 has finite support ; e

θ Xo"Ψ(θ)

(2)

fi.(dθ)

=

This is a normalized version of 4 . 1 5 ( 1 ) , so, l e t t i n g ψ. = ψr;

,

110

STATISTICAL EXPONENTIAL FAMILIES

(3)

δ G (x) = Vψ^x)

Since / e

H.(dθ) = 1 we assume w i t h o u t loss o f g e n e r a l i t y the existence o f a

l i m i t i n g measure H, f o r which H. -> H weak*.

(Apply 2 . 1 6 ( i v ) to the measure

e X o # θ H i t o get e X o ' θ ί^ -> H*, say, and l e t H = e " X o # θ H* such t h a t 4.14(1) holds a t x 1 .

Then thbre i s a f i n i t e set S c K ° such t h a t

4.14(1) holds on S and such t h a t B = conhull S s a t i s f i e s x1 e B ° . (4)

Let x e S.

ΨΊ (x) - ψ . ( x 0 )

=

J 1 (x - x 0 )

V Φ ^ X Q + p(x - x o ) ) d p

Vψ^x) i

||x - XQ11

by C o r o l l a r y 2 . 5 . ( N o t e t h a t Ψ ^ X Q ) Ξ 0 . ) I t f o l l o w s Ί i m sup s u p ψ . ( x )

i-**>

xQ € B ° ,

Then

<. ( x - x Q )

(5)

Let x 1 € K°

.)

x€S Ί

=

sup | | ό ( x ) | |

Mδ^x)!!

that

||x-xn||

<

°°

u

x€S

This is the principle assumption of Theorem 2.17, which now implies the existence of a subsequence H., and a limiting measure, which must be H, such that ψΊ.(x) + Ψ H (x),

x € B°, and also Vψ Ί (x) ^ Vψ H (x),

x € B°, by 2.17(5).

Since vψ.(x') = ό.(x') ->• ό(x') we have (4)

δ(x') = Vψ H (x')

This proves (1) since x1 is an arbitrary point of and since 4.14(1) is satisfied 4.17

a.e.(v).

K° satisfying 4.14(1),

||

Corollary Suppose Θ is closed in R and

(1)

v(3K)

= 0

Then the generalized Bayes procedures form a complete class.

APPLICATIONS Proof.

HI

As noted the admissible procedures are a (minimal) complete class.

If 6 is admissible then for some prior measure H on Θ = Θ

δ(x) = 'θ e Π H W

(2)

a.e.(v)

/ e θ # x H(dθ) by 4.16(1) and (1), above. Let G(dθ) = e ψ ( θ ) H(dθ) as in 4.16(2) to get the desired representation, (3) Remarks.

δ(x) =

θpA(x)G(dθ) 2 ( Pθ(x)G(dθ)

If v is dominated by Lebesgue measure then (1) holds since the

Lebesgue measure of the boundary of any convex subset of R is zero. (To see this note that if C is bounded and convex with 0 € intC then 9C = n [(1 + Ί|)C - (1 -Ί j)C] = Π ΊC. , say, where (as usual) i=l i=l aC = {x: By € C, x = ay}. See e.g. Rockafeller (1970). Then / dx = a/dx aC C so that / dx = lim rf dx = lim(^j-)/ dx = 0. If C is unbounded apply the άΛ 8C S" C result for bounded C to C n {x: llxll < b} and let b -> «>.) If v{dK) f 0 then there are, in general, admissible procedures which are not generalized Bayes. See Exercise 4.17.1. Similarly, if Θ is not closed in R there will again be admissible procedures which are not generalized Bayes, even when v(9K) = 0. See Exercise 4.17.2. When Θ = W and the exponential family is regular then Θ is closed if and only if H = R . Hence when Θ + R one cannot assert that all admissible procedures are generalized Bayes. However, the representation 4.16(1) remains valid. This representation is qualitatively similar to a generalized Bayes representation and is generally as useful as one. Not all estimators which can be represented in the form 4.17(3) or 4.16(1) are admissible. In fact, many are not. Nevertheless, representations of this form are valuable stepping-off points for general admissibility

112

STATISTICAL EXPONENTIAL FAMILIES

proofs. See Brown (1971, 1979). The most conspicuous example of an inadmissible generalized Bayes estimator occurs in the problem of estimating a multivariate normal mean already discussed in 4.11. The usual estimator ό(x) = x is generalized Bayes, but when k >^ 3 it is not admissible. When k >_ 3 the positive part JamesStein estimator, defined in 4.11(6), dominates δ(x) = x. However, the positive part James-Stein estimator cannot be generalized Bayes

(see Example 2.9);

hence is itself inadmissible. So far as I know the problem of finding an (admissible) estimator which dominates 4.11(6) remains open. However, theoretical and numerical evidence indicates that such an estimator cannot have a much smaller risk at any parameter point; hence 4.11(6) remains one of the many reasonable alternatives to ό(x) = x when k >_ 3. (See e.g. Berger (1982).)

GENERALIZED BAYES ESTIMATORS OF EXPECTATION PARAMETERS CONJUGATE PRIORS The statistical problem of estimating the expectation parameter ξ(θ), is more often of interest than that considered previously, of estimating the natural parameter.

(Of course for normal location families the two problems

are identical.) In this case, too, there is a representation theorem for generalized Bayes procedures and a complete class theorem based on a representation similar to that of generalized Bayes. (In some (not fully developed) sense the generalized Bayes representation available here is dual to that in the preceding section -- the differentiation operator is with respect to θ and appears inside the integral sign instead of being with respect to x and appearing outside it.) Both these main results are somewhat more limited than those for estimating θ; but are nevertheless useful. A new feature of considerable statistical interest appears here. The linear estimators are (generalized) Bayes for the conjugate (generalized) priors. This result is presented first; the conjugate priors are defined in

APPLICATIONS

113

4.18 and the existence and linearity of their (generalized) Bayes procedures is proved in Theorem 4.19. 4.18

Definition Prior measures having densities relative to Lebesgue measure of

the form g(θ) = C e θ ' Ύ - λ ψ ( θ )

(1)

γ € Rk ,

λ >0

are called conjugate prior measures. Note that if the prior is of the form (1) then the posterior distribution, calculating formally, has the same general form, with new parameters γ + x and λ + 1. For a sample of size n the n parameters become γ + s n = γ + Σ x and λ + n. (Note in (1) that g = 0 if n

1= 1

Ί

θ ί W since then ψ(θ) = °° .) Arguments resembling those in the following proof show that the conjugate prior measure is f i n i t e , and hence can be normalized to be a prior probability distribution i f and only i f (2)

λ > 0

and

γ/λ

€ K°

See Exercise 4.18.1. For estimating ζ(θ) = E_(X), under squared error loss, the Bayes u procedures for conjugate priors are linear in x. This fact (often under extraneous regularity conditions) has been known for decades. See, for example, De Groot (1970, Chapter 9) and Raiffa and Schlaiffer (1961). The following precise statement and its converse first appeared in Diaconis and Ylvisaker (1979). 4.19

(See Exercise 4.19.1 for a statement of the converse.)

Theorem Let {p Q } be a regular canonical exponential family and let g(θ) be θ

a conjugate prior density as defined by 4.18(1). Then the generalized Bayes procedure for estimating ξ(θ) exists on the set

114

STATISTICAL EXPONENTIAL FAMILIES

(1)

S = {x : δ(x) = ^ ^ € λ +1

and has the l i n e a r form

(2)

+

δ(x) = J-J-J

=

γ?Γϊ

αx +

P

I f v(S C ) = 0 then δ i s generalized Bayes.

Remarks.

occurs for γ = 0, v(8K) = 0.

λ > 0.

I t occurs f o r γ = 0,

x

>

€ S

I f 0 € K t h i s always

λ = 0 i f (and only i f )

I t can occur for other values of γ,λ as w e l l . I f x ί S then the generalized Bayes procedure does not e x i s t at x

since /

θ e

*x~ψ^

g(θ)dθ = °°.

See Exercise 4 . 1 9 . 1 .

For the r e l a t i o n between the condition that v(S c ) = 0, so that 6 is generalized Bayes, and K a r l i n ' s c o n d i t i o n , 4 . 5 ( 2 ) , see Exercise 4.19.2. Proof.

Let x € S.

The generalized Bayes procedure at x, i f i t

exists,

has the form

(3)

δ ( χ )

/ (Vφ(θ)) exp((x+γ) - θ - (λ+l)ψ(θ))dθ / exp((x+γ) θ - (λ+l)ψ(θ))dθ

=

because of the form of g and of p Q , and because ξ(θ) = Vψ(θ) on W and g(θ) = 0 for θ (. hi. I f the integrals i n the numerator and denominator of (3) e x i s t then Green's theorem i n the form of 4.7(3) y i e l d s (4)

(x + γ ) / exp((x + γ)

=

θ - (λ + l)ψ(θ))dθ

(λ + 1) / (Vψ(θ)) exp((x + γ)

Rearranging terms in (4) y i e l d s ( 2 ) .

I t remains only to v e r i f y that the

numerator and denominator of (3) e x i s t . L e t z = £J2. . Hence

θ - (λ + l)ψ(θ))dθ

z £ K ° s i n c e x € S.

APPLICATIONS

(5)

θ

liminf

^ > -

IIΘMHOO

θ

'

z

115

> 0

llθll

by 3.5.2(1) (or by 3.6(3) and t r a n s l a t i o n of the o r i g i n ) .

I t follows that

f o r some ε > 0 (6)

exp ((x + γ)

θ - (λ + l)ψ(θ))

=

0(e"

ε||θ|1

)

This proves existence of the integral i n the denominator of ( 3 ) . Now consider ξ, = -r^1

let

ξ 1 (θ) = 0 i f θ f. M.

i n θ 1 for θ € W. for θ 1 < q

(7)

and

du -j

on hi.

For s i m p l i c i t y of notation

Fix Θ 2 , . . . , θ k .

ξ j t θ ^ θ g . . . . »θ k ) is monotone

Thus f o r some q = q ( θ 2 , . . . , θ k ) € R , ξJθyθ^,... ξ j ( θ j , θ 2 , . . . ,θk) ^ 0

f o r θ^ > q.

/ | ξ 1 ( θ r θ 2 , . . . , θ | < ) | exp((x+γ)

below,

, θ k ) <_ 0

Hence

θ - (λ+l)ψ(θ))dθ 1

q l i m / - ξ 1 ( θ 1 , θ ? , . . . , θ j exp((x+γ) B*x> B

θ - (λ+l)ψ(θ))dθ 1

B + lim / ξ 1 ( θ 1 , θ ? , . . . , θ | f ) exp((x+γ) . θ - (λ+l)ψ(θ))dθ Ί l d κ L B-^> q The function

e x p ( - ( λ + l ) ψ ( θ i s θ 2 , . . . , θ k ) ) is absolutely continuous i n θ,

since {p θ > is regular.

( I f { p 0 } were not regular there could be a discon-

t i n u i t y at the boundary of W.)

Let θ = ( q ( θ 2 . . . . , θ k ) , Θ 2 , . . . , θ k ) .

Ordinary

i n t e g r a t i o n by parts y i e l d s (8)

q l i m / - ξ 1 ( θ 1 , θ 2 . . . . . θ k ) exp((x+ γ ) . θ - ( λ + l ) ψ ( θ ) ) d θ 1 Bκ B =

lim j-tXi+γj)

B-χ»

I

/ exp((x+γ)

-B

θ - (λ+l)ψ(θ))dθ 1 q

+ [exp((x+γ)

θ - (λ+l)ψ(θ))]

Ί

>

q = - ( X I + Ύ J / exp((x+γ) * θ - (λ+l)ψ(θ))dθ, + exp((x+γ) θ l i '

*•

q

- (λ+l)ψ(θ )) q

116

STATISTICAL EXPONENTIAL FAMILIES

by (6). Note that (again by (6)) (9)

k ? Θ Q - (λ+l)ψ(θ )) = 0(exp(-ε Σ θί)) q M 3=2 J

exp((x+γ)

Reasoning similarly for the second integral on the right of (7), integrating both integrals over θp»...,θ. , and using (9) yields (10)

/.κ |ζ Ίi(θ) I exp((x+γ) R

Finally, the identical reasoning on ζ., ) exp((x+γ)

θ - (λ+l)φ(θ))dθ

< -

i = l , 2 , . . . , k , shows that θ - (λ+l)ψ(θ))dθ < -

which verifies that the numerator of (3) exists. As noted previously, this completes the proof.

||

4.20 Application For a given k-parameter exponential family {p Q } the conjugate prior distributions, {g

} , say, form a (k+1)-parameter exponential family

with canonical statistics Θ 1S ...,Θ., -ψ(θ). This (k+1)-parameter family is minimal except when ψ(θ) is a linear function of θ. This linearity occurs when p n is the Γ(α, σ) family with known σ, and in certain multivariate u generalizations of this univariate example. Many familiar exponential families are the conjugate families of prior distributions for other familiar exponential families of distributions. (Conjugate prior measures which are not finite then appear as limits of these distributions.) For example, the N(γ, λ I) distributions are conjugate to the N(μ, I) family. The proper conjugate prior distributions for the Γ(α, TZQJ) family (α known, θ < 0) are those of -Θ where Θ ^ Γ(λα, - γ ) , γ < 0, λ > 0. The proper conjugate priors for the P(e ) family have density (i)

g Y λ (θ) = e γ θ - λ e ,

γ < o,

λ >o

with respect to Lebesgue measure on (-«>, °°). Thus the density of ξ = e is

APPLICATIONS

117

Γ(-γ, 1/λ). See also Exercise 5.6.3. The basic representation theorem for generalized Bayes procedures is a simple consequence of Green's Theorem 4.7(3), and is an obvious extension of 4.19(4) in the proof of Theorem 4.19. The regularity conditions in the following statement may be modified as noted in the remark following the theorem, 4.21 Theorem Let {pθ> be a regular canonical exponential family and let G be a prior measure on Θ. Suppose G has a density, g, with respect to Lebesgue measure. Suppose g(θ)e~^ ' is absolutely continuous on R . Assume for x e S (1)

/ e θ χ -Ψ( θ > g(θ)dθ

(2)

/ ||vg(θ)|| e θ ' χ - ψ ( θ ) dθ <

and / ||Vψ(θ)|| g ( θ ) e θ * χ - ψ ( θ ) dθ

(3)

Then the generalized Bayes procedure, 6, for estimating ξ(θ) under squared error loss, exists on S and is given by the formula (9())

Remarks.

de

c

If v(S ) = 0 then, of course, the unrestricted generalized Bayes

procedure exists and is given by (4). Conditions (1) and (2) are of course necessary for the representation (4) to make sense. Condition (3) is necessary in order that the generalized Bayes estimator be well defined. However it can often be deduced as a consequence of (2) and so then need not be checked directly. Suppose (5) for some function h(θ) satisfying

118

STATISTICAL EXPONENTIAL FAMILIES tk~l

h(t)dt

Then (1) is satisfied, and condition (2) implies condition (3). See Exercise 4.21.1. The representation (4) is exploited in Brown and Hwang (1982) as the starting point for a proof of admissibility of generalized Bayes estimators under certain (important) extra regularity conditions. Proof.

Conditions (1), (2), and (3) justify use of the integration by

parts formula 4.7(3), which yields (6)

/ x(g(θ)e" ψ ( θ ) )e θ ' x dθ = / (-Vg(θ) + g(θ)Vψ(θ))e θ ' x " ψ ( θ ) dθ

Rearranging terms (each of which exists by (1), (2), (3)) yields (4).

||

We now turn to the complete class theorem comparable to Theorem 4.16. The result proved below applies only to one parameter exponential families. It appears to us that there exists a satisfactory multiparameter analog of this result which, however, is somewhat more complex to state (and to prove). We hope to present this multiparameter extension in a future manuscript. As with Theorem 4.16 the representation of admissible procedures involves a ratio of integral expressions similar to the formula for a generalized Bayes estimator. Again, under certain additional conditions, this representation reduces to precisely that of a generalized Bayes procedure. A new complication appears in the integral representation below. only on an interval I. whose definition involves ό( ) itself.

It applies (See 4.24(1).)

However, as explained in the remarks following the theorem, the values of δ(x) for x ί ί are uniquely specified by monotonicity considerations. Hence the theorem actually describes exactly the values of ό(x) except for at most two points -- the endpoints of I. . In this sense the complication presented by the presence of I. is just a minor nuisance.

APPLICATIONS

119

We begin with a technical lemmά. 4.22

Lemma Let v

be a sequence of probability measures on R .

Suppose for

some ζ > 0 (1)

lim i n f v n ( { x > K})

>

ζ

> 0

n+ o°

f o r a l l K < oo.

Let ε > 0.

Suppose λ

n=l,...

.

Then

e ε x vn(dx)

ί (2)

( ε ) < °° ,

v

lim n-**>

λ

(ε) n

for all K < «>. Remarks.

The negation of (1) is the condition

(3)

l i m l i m i n f v n ( { | x | > K})

K**>

n

n*»

= 0

This is the usual necessary and s u f f i c i e n t condition for there to exist a subsequence n 1 and a non-zero limiting measure v such that v , •> v. The conclusion (2) can be paraphrased by saying that the sequence of probability measures

e ε x v (dx)/λ

( ε ) sends a l l i t s mass out to -κ». n

Proof.

Let K < oo, 1 < m < «.

/" e

(4)

I

ε x

v(dx)

/ e ε x v n (dx)

/" e

Then

ε x

v(dx)

2 f

/ e e x v n (dx)

—oo

—oo

>

e

e ( m

Now l e t n •»• » and m -»• °° t o f i n d

"

1 ) κ

v n ( { x > mK})

.

120

STATISTICAL EXPONENTIAL FAMILIES

/ e ε x v n (dx) (5)

lim inf

which proves ( 2 ) . 4.23

K

f e ε x v n (dx)

||

Theorem

Let {p Q : θ € Θ} be a regular exponential family on R . Consider u

the problem of estimating the expectation parameter, ξ ( θ ) , under squared e r r o r loss.

Let 6 be an admissible estimator.

function. (1)

Then,

6( ) must be a non-decreasing

Let

I 6 = {x: v ( { y : y >x, ό(y) € K°}) > 0

and v ( { y : y < x, ό(y) € K°}) > 0 } .

Then there exists a f i n i t e measure V on Θ such t h a t f o r a l l x G I

(2)

Remarks.

δ(x)

=

r ζ(θ) θx J li + \l ζr( U )\e θ ) l

In (2) the functions ..

*

|1/Q\ , and Λ

,Γ/Q\. have the obvious

i n t e r p r e t a t i o n on the boundary of M. ( I n other words, i f N = ( a , b) then

= -1 , etc., since

lim ξ(θ) = » , lim ξ(θ) = — . )

By monotonicity of 6, I must be an open i n t e r v a l . -oo £ i < T £ oo.

Suppose K° = ( k , E ) , -» £ k < k <_ ».

Say I = ( i , T ) ,

Then k £ i

( ΐ <^ k,

respectively) and, by monotonicity and the d e f i n i t i o n of I , ό(x) = k f o r k <_ x < i

(ό(x) = R for

T < x <_ k ) .

For i < x < ϊ ,

ό ( x ) i s defined by ( 2 ) .

Thus, the theorem f a i l s to define 6 ( x ) only f o r x = i i f -°° < k , f o r x = T or k depending on whether T < R or T = k.

I f v, the dominating measure f o r { p A } , is continuous then these

APPLICATIONS

121

two points have measure 0 and the theorem completely describes δ. Similarly, if K = (-», <*>) then irrespective of v the theorem completely describes 6 If V(N - N) = V(ίa, b}) = 0 then (2) can be rewritten as (3)

Jξ(θ) Pfl(x)G(dθ) § , ί Pθ(x)G(dθ)

δ(x) =

x€Γ

,

where β

ψ(θ)

Thus, δ is then generalized Bayes on T in the ordinary sense. of course, occur i f W = R .)

(This must,

When W f R there may exist admissible procedures

having representation (2) but not (3).

See Exercise 4.24

Finally, note as with Theorem 4.16 that there are many inadmissible procedures satisfying (2). Proof.

See for example Exercise 4.5.4.

I f G is a prior density then the Bayes procedure (assuming i t is

well defined for x € K) is given by the formula

m

* t^

-

f ζ ( θ ) e θ x ' ψ ( θ ) G(dθ)

_

/ ζ ( θ ) e θ x H(dθ)

/eθxH(dθ) where H(dθ) = ce" ψ ^ θ ^ G(dθ). x

θx

e //e H(dθ)

ζ(θ) is monotone on W.

The family of densities

is an exponential family (with parameter x) relative to the

dominating measure H.

In particular, i t has monotone likelihood ratio.

Hence, ό r is monotone non-decreasing by Corollary 2.22.

(6^ is actually

s t r i c t l y increasing unless G is concentrated on a single point.)

All admissi-

ble procedures are (a.e.(v)) limits of Bayes procedures by Theorem 4.14, and limits of monotone functions are monotone. must be monotone non-decreasing.

Hence a l l admissible procedures

(A different proof of a better result is

contained in Brown, Cohen and Strawderman (1976).) Let 6 be admissible and l e t όp Theorem 4.14 having δp

+6

a.e.(v).

be the sequence promised in i Since all δp are monotone non-

122

STATISTICAL EXPONENTIAL FAMILIES

decreasing t h e r e i s no loss o f g e n e r a l i t y i n assuming δg ( x ) -*• 6 ( x ) f o r a l l n x f Γ , and we do so below. Assume 0 e l c Γ ,

(5)

V (dθ)

Define the p r o b a b i l i t y measures,

(1 + | ξ ( θ ) l ) e " ψ ( θ ) G ( d θ ) ^ ^ / ( I + I ξ ( θ ) l ) e " ψ ( θ ) 6n(dθ)

=

n

Let

ε > 0 such t h a t ε € K°.

Then

εθ

(6)

δG(ε)

J

-

e

1 +

V (dθ)

^ Vn(dθ)

i Λ Suppose for some ζ > 0 (7)

Let

Tim inf V ({θ > K}) > ζ > 0

for all

K<

ΘQ be the unique value such t h a t ξ ( θ Q ) = 0 , and l e t K > Θ Q .

1 + iε(θ) 1

1<s

i n c r e a s i n

9

f o r

θ

>

θ

o

The f u n c t i o n

A

PP"ly Lemma 4.23 to get

J e ε θ Vn(dθ)

ξ(K)

Similarly, ,+ tela) ι

1S

decreasing for θ > Θ Q so that

J e ^ V n (d θ ) 1 + ξ(K)

Substitute

( 8 ) and ( 9 ) i n t o

•

t h e f o r m u l a , ( 6 ) , f o r or

( ε ) and l e t K -*• °° t o n

find

APPLICATIONS (10)

123

δ(ε) = lim δ Q (ε) >_ lim ξ(K) . n-*» n κ-χ»

This holds for all e > 0 with ε e K°. It follows from (10) that O i l , contrary to assumption.

Hence (7) must be false. A symmetric argument shows

that lim inf V ({θ < -K}) > ζ > 0 is also impossible. n-*» (11)

Hence

lim lim inf V ({|θl > K}) = 0 .

By translating the origin the same argument can be applied at any x € I c K°. The conclusion is that x e I implies e θ x V n (dθ) (12)

lim lim inf

This i s s l i g h t l y more than i s needed to apply the c o r o l l a r y o f Theorem 2.17 stated i n Exercise 2 . 1 7 . 2 . x interchanged so t h a t P n

( ( 1 2 ) implies 2 . 1 7 . 2 ( 2 ) with the roles o f v

(dθ) = eθxVn(dθ)/ eθxVn(dθ).)

Π j X

fI

θ and

The conclusion of

Π

this exercise is that there exists a subsequence {n 1 } and a limiting measure V on Θ such that (13)

eθxVn,(dθ)

+ e θ x V(dθ)

Note that V(Rk) = λ v ( 0 ) = lim λy

and

(0) = 1.

λ v § (x) - λ γ ( x ) ,

x € 1° .

Since both

and

χ

+^(3))

n

are bounded continuous functions on §, (13) and (6) yield directly 1 + lξ(θjl that for x e 7

e

δ(x)

=

lim

δG

(x)

This verifies ( 2 ) , and completes the proof.

||

v ( d θ )

124

STATISTICAL EXPONENTIAL FAMILIES

EXERCISES 4.1.1 ( i ) Prove the Cauchy-Schwarz inequality (4.1(1)) with B ^ in place of Bpo when Bpo is singular.(ii) Show the inequality remains valid when T,, T^ are respectively Uxs) and (mxs) matrix valued random variables, [ ( i ) Reproduce the proof of Theorem 4.1 with Bio in place of BZi; or rotate coordinates so that B2o is diagonal with diagonal entries d.

> 0, 1 < i < r, and d

= 0,

r+1 £ i <_ m, and apply 4.1(1) for the f i r s t r coordinates of T 2 J 4.2.1 Let v be a measure on R and l e t T be a real valued s t a t i s t i c . 2 Suppose 0 € W° and E(T ) < °°. Show f o r every ε > 0 there i s a polynomial 2 p(x) such t h a t E((T - p) ) < ε .

2 ( I n other words, the monomials x - , . . . , x . , x , ,

Xj x 2 , . . . form a complete basis f o r L 2 ( v ) . ) ^0

=

*' ^1 =

in L9(v).

a

ll

x

+

a

1 0 ' ^2 =

Let α. = E ( T f . ) .

T-g€L2(v),

a

22x

2 +

a

21x

+

[For k = 1 ( f o r s i m p l i c i t y ) l e t a

2 0 ' # * # ^ e o^thonormal f u n c t i o n s

Then Σα? < ET2 so t h a t g = Σ α . f . € L 9 ( v ) .

0 e W°(T - g) and λ ^ ( 0 ) = 0 ,

j=0,l,...

.]

4.3.1 V e r i f y formulae 4.3(3) and 4 . 3 ( 5 ) . 4.3.2 Let p Q be a f u l l canonical exponential f a m i l y and l e t ξ = ξ ( θ ) denote the expectation parameter.

Show t h a t r e l a t i v e t o t h i s parameter the

information matrix i s J ( ζ ) = Σ - 1 ( θ ( ζ ) ) . 4.4.1 Let M be a f i x e d £χ£ p o s i t i v e s e m i - d e f i n i t e symmetric m a t r i x . 1

Write the i n f o r m a t i o n i n e q u a l i t y f o r EQ ( ( T - μ ) M(T - μ ) ) where T i s an Jl-dimensional s t a t i s t i c w i t h mean μ and f i n i t e covariance a t ΘQ.

[This i s

immediate from Theorem 4.4 and (T - μ ) 1 M(T - μ) = Tr(M(T - μ ) ( T - μ ) 1 ) . ]

APPLICATIONS

125

4.4.2 Show that the information inequality 4.4(1) is an equality i f and only i f for some matrix A and vector b (a)

T(x)

= A(Vθ(p Q ))X + b

.

[Show the Cauchy-Schwarz inequality is an equality i f and only i f Tp is an affine transformation of T-.] 4.4.3

Let {p θ : θ e 0} be a different!able subfamily and T an £dimensional statistic. Suppose %a (T) exists for some θ n € 0. Then the w0

u

information inequality is an equality for a l l θ € 0 i f and only i f 0 is an affine subspace of W and T is an affine function of the canonical minimal sufficient s t a t i s t i c for the exponential family {p Q : θ e 0 } .

(That such a

characterization holds under mild regularity conditions for a general family

{p 0 }

was proved in

Wijsman (1973) and Joshi (1976).)

[Use

Exercise 4.4.2.] 4.4.4 Suppose {p Q } is a canonical one-parameter exponential family.

Show

that when the information inequality is not an equality i t can be improved to an inequality of the form:

(1)

Var 0 T > ε'(θ o )M(θ o )ε(θ o )

where ε(θ) is the jχl vector with (2)

J

ε(θ), = - ^ e f θ ) 3Θ

1 =1

J

and M(θ) is an appropriate j x j symmetric matrix, not depending on T. In -1 fact, M(θ) is the covariance matrix at θ of the vector with coordinates

(3) (The inequality (1) with M

3Θ. as in (3) is called a Bhattacharya

inequality.

126

STATISTICAL EXPONENTIAL FAMILIES

Such inequalities are valid also for full k parameter exponential families and for ^-dimensional statistics, as well as for differentiate subfamilies (p replaces θ in (1) - (3)). See e.g. Lehmann (1983, p.129).

[A direct proof

is possible which also yields the formula (3). An alternate proof assumes ΘQ = 0, ψ(θ 0 ) = 1 (w.l.o.g.) and uses Exercise 4.2.1 to write / (T(x) - oυu ) 2 v(dx) >_ Σ α? = Σ /T(x)f. (x)v(dx). 1=1Ί 1=1 Ί

]

4.4.5 Suppose X...... are i . i . d . observations from a d i f f e r e n t i a t e exponential subfamily.

(1)

Let N be a stopping time with PQ

Eθ (exp(ε N))

Let S n =

<

»

for some

ε >0

n Σ X. and l e t T(S M , N) be a s t a t i s t i c for which la

(2)

ZΘQ(T)

where e(ρ) =

E

Θ(D)(T(SN>

>

N

(E θ ( N ) ) " 1 ( V e ( p 0 ) ) '

) )

(N < °°) = 1 and

(T) < «.

Then

yHpQ)(Mp0))

[Prove directly or use Exercise 3.12.2 (iii)

and Theorem 4.4. The regularity condition (1) can be considerably relaxed or modified, but some condition on N is needed in general. See Simons (1980).] 4.4.6 ( i ) When {p Q } is a f u l l canonical exponential family and p

EQ (T ) < «>, the Bhattacharya i n e q u a l i t i e s 4 . 4 . 4 ( 1 ) l i m i t as j + °°.

( i i ) I f {ΌA i s an m-dimensional

tend to e q u a l i t y i n the

d i f f e r e n t i a t e subfamily

w i t h m < k then there a r e s t a t i s t i c s T f o r which the appropriate Bhattacharya i n e q u a l i t i e s do not tend to e q u a l i t y as j + «.

[ ( i ) Use Exercise 4 . 2 . 1 and

proceed from the proof sketched i n the h i n t i n Exercise 4 . 4 . 4 . a curved exponential

f a m i l y i n the canonical

( i i ) Consider

version 3 . 1 1 ( 1 ) , and l e t

T(x) = x2 - x 2 . ] 4.5.1

Prove the assertion in 4.5 when 3 ^ 0 .

[Let Y = X - γ. Apply 4.5

APPLICATIONS

127

to yield αY as an admissible estimator of ξ(θ) - γ. Hence αY + γ is admissible for ξ(θ).] 4.5.2 Show the condition 4.5(2) implies 6 Jx) = (αx + β) € K a.e.(v). α,p

[The theorem would be false otherwise! theorem i s also of i n t e r e s t .

But a d i r e c t proof not involving the

Use Lemma 3 . 5 . ]

4.5.3

Suppose (λ, γ) satisfies condition 4.5(2), λ 1 < λ, and either γ € K° or v is a discrete measure. Then (λ 1 , γ) satisfies condition 4.5(2). If γ € 8K = K - K°, and (λy γ ) , (λ 2 , γ) both satisfy 4.5(2), and λ χ < λ < λ 2 then (λ, γ) satisfies 4.5(2). 4.5.4 Let X ~ Γ(a, σ ) , a known, and consider the problem of estimating σ = E(X) under squared error loss, δ

o

( i ) Using K a r l i n ' s theorem v e r i f y that

( x ) =αx + β is admissible i f α = -^γ

, 3 = 0 or i f α < - ~

α"fΊ

Ot, p

, β > 0.

— a+ l

(ii) Show that if α,β do not satisfy these conditions then 6

Q

is inadmissible

Ot>P

since there i s an admissible l i n e a r estimator which i s b e t t e r . 4.5.5 Consider the one-parameter exponential family defined by 3.4(1) with θo

=

-1 and θ = θi € (-°°> 0 ) .

under squared error loss.

Let δ

Consider the problem of estimating ξ(θ) o

be a l i n e a r estimator as i n 4 . 5 ( 1 ) .

Oί»p

Observe that condition 4.5(2) of K a r l i n ' s theorem i s not s a t i s f i e d at θ = 0. Sh.ow that 6

o

is inadmissible.

[For the case α = 1, β = 0 l e t

OUp

(1)

δ'(x) L

X

X £ C

c + (x-c)/2

x >c

=

Then R(θ, δ 1 ) < R(θ, δ Ί n ) for ξ(θ) £ c and, for ξ(θ) >_ c, a crude bound yields

128

STATISTICAL EXPONENTIAL FAMILIES R(θ, δ') < [h) Var θ (X) + (*)(ς(θ) - c ) 2 + ξ 2 (θ)

(2)

=

3

2

1

Hence f o r c s u f f i c i e n t l y large R(θ, 6 ) < R(θ, δ Q ^ ξ(θ)

2

ξ ( θ ) / 8 + (ξ(θ) - c ) / 4 + ξ ( θ ) 3

= ξ ( θ ) / 2 also when

> c]

4.5.6

Let {pθ> be as in 4.5. Suppose it is desired to estimate g(θ) = ξ(θ) + W'(θ) under squared error loss. Show the estimator δ Q is Otjp

admissible if (1)

/ exp(λψ(θ) + (1 + λ)W(θ) - γλ(θ)dθ

diverges at both θ and θ. [Define b( ) as in 4.5.

4.5(7) becomes

2b'(θ) - 2(λξ(θ) + (1 + λ)W(θ))b(θ) + (1 + λ)b 2 (θ) ± 0 .]

(2)

(See Ghosh and Meeden (1977). Although an estimator δ

o

may be admissible

α,p

here, it is not clear that it is desirable, whereas for the case W = 0 of 4.5 these estimators are yery natural.) 4.5.7 Let {p Q } be a canonical two dimensional exponential family with Ό

o

W = R . Consider the problem of estimating ξ(θ) with squared error loss (so 2

that R(θ, δ) = E0(||δ(X) - ξ(θ)|| )). Show that the estimator δ(x) = x is admissible. Apply this result when (X-, X«) are independent normal, independent Poisson, independent binomial, or the sample means from Von-Mises variables. [Using the bivariate information inequality leads to replacement of 4.5(7) by (1) where V

2v

b(θ) +

2

2 ab.(θ) b(θ) = Σ —9\θ τ — . If b satisfies (1) so does 1=1 i

APPLICATIONS

129

B(θ) = (2π)" 1 / 2 π Q" 1 b(Q.Θ)dφ 0 Φ Φ

(2) where

/cos φ *

-sin φχ

^sin φ

cos φ'

b is spherically symmetric; hence can be written as b(θ) = β(||θ||)θ . Let t = I |θ| |. (1) becomes (3)

2kβ(t) + 2t3'(t) + t V ( t )

< 0 .

Now let K(t) = t 2 β(t) to get 2K'(t) + K2(t)/t £ 0

(4) in place of 4.5(8).

(Note how the argument fails if k > 2!)] (Stein (1956),

Brown and Hwang (1982, Corollary 4.1).) 4.5.8 Let X ~ Γ(α, σ), problem of estimating σ = -r

α > 0 a specified constant. under the loss function

L(σ, a) = i - l n φ - 1

(1)

.

(See Chapter 5 for a natural interpretation of this loss. 4.11.3 and 4.11.4.) estimator.

e(θ)

(i)

Show that

(ii)

Let 6 Q (x) = £ and l e t 6(x) = (1 + Φ(x))ό Q (x) be any

= Eβ(φ)

and

W(t)

R(θ, 6) - R(θ, δ 0 )

= t - ln(l + t) ,

>

- M i M +

t > -1

.

W(e(θ))

Use (3) to show that <5Q is admissible among a l l estimators having

e(θ) £ B for a l l θ € (-°°, 0 ) . on δ.

See also Exercises

Let

(2)

(3)

Consider the

See Brown (1966).)

(6Q is actually admissible with no restriction

130

STATISTICAL EXPONENTIAL FAMILIES [(i) R(θ, δ) - R(θ, δ Q ) = E θ (-ΘXφ(X) -

= -θe'(θ)/α + E θ (φ(X) - ln(l + φ(X))). For (ii) follow the pattern of the proof of Theorem 4.5. (It is also possible to use (3) to prove δg is admissible with no restriction on δ.)] 4.6.1 Prove 4.6(2).

[Use the information inequality to write

/ h(p)R(p, δ) dp >_ m + /{2h(p) Tr(Vb(p)) + h(p)Tr(J(p)b(p)b' (p))}dp . Integrate by parts the f i r s t term in the integrand in order to get an integral whose integrand is a quadratic in b(p) for each fixed p. grand for each p to get 4.5(2).]

Minimize this inte-

See Exercise 5.8.1 for a statistical

application of 4.5(2). 4.6.2

In preparation for the proof of 4.6(6) prove the following facts: (i) For each K, V e (κ)(p) exists for all but at most a countable number of points, p. Fix

PQ, K

for which Ve, K )(p 0 ) exists. Let δ*(x) =

e*(p) = E θ (p)(<5*( χ )K and D = (d..) = Ve*(p Q ) - V e ( κ ) ( p Q ) . (ii) d.. = 0 , 1

ί ") Let

d

[Since all

ρQ),

Show

i f j , and

p

(

χ

θ

κ

l ϋl± θ(p0) l i - i l> ) |D| = ( | d . . | ) w i t h d . . as above and l e t J = J ( p Q ) be symmetric

p o s i t i v e d e f i n i t e w i t h eigenvalues (iv)

<5/ K N(X;

T r ( J D J " D)

< "

| d . | <_ 1 the e i g e n v a l u e s

have magnitude a t most -r^ . λ m

λ , >. . . . τ-^Tr|D| < λ m

>_ λ

> 0.

Show

——^-^ λmK2

- - and hence diagonal elements - - o f JDJ Then Rv = T r ( J K

E) > T— Tr E " λl

where

-1

APPLICATIONS

131

4.6.3 Also in preparation for the proof of 4.6(6) prove the following matrix inequalities (i)

TrίJA'J^A) _> 0 for any (kxk) positive definite symmetric J

and any (kx k) matrix A. (ii) [(i)

Tr(J(A' + B'JJ'^A + B)) > α TrίJA'J^A) - - j ^

Diagonalize J (and J " ) and then w r i t e out Tr( ) as a sum of individual

terms,

( i i ) follows from ( i ) . ]

4.6.4

Now prove 4.6(6).

[Write the information inequality for δ*.

Substitute Ve*(p Q ) = Ve, κ x(p Q ) + D and use 4.6.2(iv) and 4.6.3(ii).

(Note

that both these inequalities are nearly trivial when k = 1, so in that case the overall proof is much simpler to follow.)] 4.6.5 The inequality 4.6(6) is never sharp (except sometimes in the limit as K •+ °°). To examine how far from sharp the inequality is compare R κ and the best lower bound from 4.6(6) in the case where k = 1, L is ordinary squared error loss, X

N(θ, 1), p = θ, and δ(x) = ax (0 < a <_ 1).

[For a = 1, K = 1, I get R κ = .516 >_ .250 = best lower bound. For a = 1, K = 3 I get R κ = .991 >_ .5625 and for a = 1, K = 10 R κ = .999+^ .891 .] 4.6.6 Prove 4.6(7).

[See 4.6.1.]

4.6.7 Investigate the sharpness of (7) by comparing the Bayes risk for L^ and the bound on the right of 4.6(7) when k = 1, L is ordinary squared 2

error loss, X ~ N(θ, 1), p = θ, and h is a normal (0, σ ) density. (Note: h does not have compact support, but it can be shown (Exercise !) that the tails of h decrease fast enough so that 4.6(7) is still valid.) [When K = «

132

STATISTICAL EXPONENTIAL

FAMILIES

2 2 so t h a t 4 . 6 ( 7 ) reduces to 4 . 6 ( 2 ) the Bayes r i s k i s σ / ( I + σ ) and the lower 2 2 bound i s (σ - l ) / σ . Thus even when K = °° the bound i s not sharp, although 2 i t i s asymptotically sharp as σ ->• °° a l s o . ]

4.11.1 Let δ denote the James-Stein estimator 4 . 1 1 ( 1 ) w i t h r Ξ k - 2 and let 6

denote the corresponding " p o s i t i v e p a r t " estimator 4 . 1 1 ( 6 ) . +

R(θ, δ ) < R ( θ , δ ) .

+

2

[ W r i t e R ( θ , δ) - R ( θ , δ ) = E Q ( g ( | |X| | ) ) . 2

and I S " ( g ) = - 1 , and ( t r i v i a l l y ) E g ( g ( | | x | | ) ) > 0 .

Show t h a t

Note S"(g) = 1

Use Exercise 2 . 2 1 . 1 . ]

4.11.2 Suppose X - N(μ, σ 2 l )

(X € R k ) and, independently,

is desired to estimate μ with squared error loss —

V/σ 2 - x ^ .

2 σ is unknown.

It

Let

2

k :> 3. Let σ = V/m and 6(x

, .Λ.

ιiχiι2/δ2

where 0 £ s( ) <_ 2(k-2)m/(m + 2) and s( , σ ) i s d i f f e r e n t i a t e and nondecreasing f o r each value o f σ . [Assume ( w . l . o . g . ) t h a t σ

= 1.

Show t h a t δ ( x ) i s b e t t e r than δ Q ( x ) = x. Condition on σ

apply 4 . 1 1 ( 5 ) with

r ( ) = σ s( , σ ) ; and take the expectation over σ . Λ

(A f r e q u e n t l y recommend-

Λ

2 2 2 2 ed choice f o r s i s s ( | | x | | , σ ) = m i n ( | | x | | / σ , (k-2)m/(m + 2 ) ) corresponding to 4 . 1 1 ( 6 ) . ] 4.11.3 Let

. be independent r ( α . , σ.) v a r i a b l e s w i t h α. known,

i=l,...,k.

Consider the problem o f estimating σ = ( σ - , . . . , σ k ) with loss 2 function L(σ, a ) = Σ σ . ( l - a . / σ . ) . The best l i n e a r estimator f o r t h i s problem i s δ Q with 6 Q i ( x ) = x^/(a. [Use Theorem 4 . 5 . ]

+ 1).

( i ) When k = 1 t h i s estimator i s admissible.

( i i ) f o r k > 2 define δ by k

(1)

δ.(x)

q

= χ./(α Ί . + 1) + (k-l)α. + 1/ Σ (oj + IΓ/XJ

.

APPLICATIONS

133

Show that R(σ, 6), < R(θ, δQ). (This is the easiest of several interesting related results in Berger (1980b).) [Let φ ^ x ) = (^ + Using Corollary 4.7 show (2)

R(σ, 6 0 ) - R(σ, δ) =

-EtΣ (

L-2 L (α. + 1 Γ

2

since σ..(l - a/σ^

^^ X i

φίX)

^———

α

1

+ —LJ (α. + 1) Z

*

3x

i

φ 1

(

2

= (a/ -θ. -

1// -θ..) .

Then show the expectand on the

(Use the fact that -^r"Φ Ί ( χ ) < 0 to eliminate the

right of (2) is negative.

σX

1

terms i n v o l v i n g φ. -r— φ. . ) ] I

σX

1

4.11.4

Let X. ~ Γ(α., σ ^ , α^ > 0 specified constants, i = l,...,k, as in Exercise 4.11.3. Consider the loss function L(σ, a) =

(1)

Define 6 n by 6 .(x) = x / α . . Mx)

= (1 + Φ i ( χ ) )

δ Λ 1

Σ (a /σ. - ln(a./σ.) - 1) . Ί Ί i= l 1 Ί

(See Exercise 4.5.7.)

Let k >_ 3 and define 6 by

( χ ) where cα.

(2)

φ.(x) 1

=

In x. ]

1 + Σ(α. In

?

x^Γ

with 0 < c <_ 1. Show that R(σ, 6) < R(σ, 6 Q ) , σ > 0. [The unbiased estimator of R(σ, ό) - R(σ, δQ) is x. 3φ i

(The following algebra can be simplified by changing variables in (3) to y. = α In x ,

i=l,...,k.)

Then show this is always positive, using the

facts that |Φi | ± c/2 and t - In (1 + t) <_ 2t 2 /3

for |t| <_h.

(You w i l l see

134

STATISTICAL EXPONENTIAL FAMILIES

that values of c somewhat larger than 1 can also be used in (2).)] See Dey, Ghosh, and Srinivasan (1983). (Change variables in Exercise 4.5.7(3) to σ = -| = ξ(θ), and compare with the i -th term in brackets in (3), above. This identity of expressions is analogous to that which occurs in the estimation of normal means with squared error loss. See 4.11(8).) 4.11.5 Let X ~N(θ, I). Consider the problem of estimating θ € R ε>0

under squared error loss. Suppose for some C < «>, (1) Then δΛx)

(δjU) - x) is

x > 2-k+ε

for

llxll > C .

inadmissible.

[Let 6 9 ( x ) = ό Ί ( x ) - ε [ ( | | x | | - C ) + Λ 1] ι

and use 4.10(4).]

*-*-

llxll

(Note that this generalizes Example 4.11 since δΛx) = x

satisfies (1) when k >_ 3.) 4.15.1 (i)

Show t h a t f o r estimating the natural parameter the corres-

pondence between p r i o r measures and t h e i r generalized Bayes procedures is oneone i f Supp v has a non-empty i n t e r i o r ( i . e . show 6 G = 6 H G = H).

[Use Theorem 4.15 and Corollary 2 . 1 3 . ]

show t h a t t h i s u n i c i t y may f a i l i f (Supp v ) °

a.e.(v)

implies

( i i ) Give an example to

= φ.

4.15.2

Show that every admissible estimator of θ under squared error loss satisfies the monotonicity condition (1)

(x 2 - x χ )

(ό(x 2 ) - 6( X l )) i 0

a.e.(v * v) .

[Use 4 . 1 4 , 4 . 1 5 , and 2 . 5 . (Do not use 4 . 1 6 ( 1 ) f o r t h i s would not y i e l d ( 1 ) f o r x Ί e 9/C.)]

APPLICATIONS

135

4.16.1 Let X - P(λ). Let c Q ± 0. Show that the estimator 6(0) = c Q , ό(x) = In x, x=l,2,... , is not an admissible estimator of the natural parameter θ = In λ under squared error loss, (ό is the "maximum likelihood estimate" of θ; see Chapter 5. Also, the squared error loss function L(θ, a) = (a - θ) can be justified in its own right, or one can transform to λ = e θ

and let b = e a . The loss then takes the form (In b - In λ ) 2 2 = (In (b/λ)) = L*(λ, b). The inadmissibility result, above, then says also that ό*(x) = x is an inadmissible estimator of λ under loss L*. Losses of the form L* appear naturally in scale invariant problems; see Brown (1968).) [Use Theorem 4.16. If 6 is of the form 4.16(1) then, by monotonicity, (1)

In [x] <

λi(x) χ ^ H

<

In ([x] + 1) ,

x > 1 .

Hence λ H (x) -»«as x -> <» but λ..(x) = o(e ε x ) as x -* <», v ε > 0. This is impossible by Lemma 3.5 and Exercise 3.5.1.] 4.17.1 Let X ~ Bin(n, p ) , n >_ 3, and consider the problem of estimating the natural parameter θ = In (p/(l - p)) under squared error loss. Show that the procedure -1 ό(x) = 0 1

x = 0 1 £ x <_ n-1 x =n

is admissible. But, 6 is not generalized Bayes. (Note that Corollary 4.17 is not valid here because 4.17(1) is not satisfied. Of course, Theorem 4.16 is satisfied with H giving unit mass to the point θ = 0.) [Let ό 1 be another estimator. Suppose ό'(0) = -1 + α, α > 0. Then lim

I θ Γ ^ R ί θ , ό 1 ) - R(θ, 6)) = α > 0 .

136

STATISTICAL EXPONENTIAL FAMILIES

Hence R(θ, δ 1 ) ± R(θ, δ), V θ, implies

(i) δ'(0) <_ -1. Similarly (ii)

ό'(n) >_ 1. Among all procedures satisfying (i), (ii) 6 uniquely minimizes R(0, 6). Hence δ is admissible. If 6 were generalized Bayes the prior G would have to have support {0} by 2.5; but this would imply 6(0) = 0 = ό(n).] 4.17.2 Let Z ^ r(α,σ) as in 4.17.3, below, (i) Show that the estimator δ Q (x) = 0 cannot be represented as a generalized Bayes estimator of θ = 1/σ. (ii) For α < 2 show δ Q is admissible, [(ii) If δ M

o

(iii) For α > 2 show δ Q is inadmissible,

then, for some € > 0, R(θ,<5) > £ θ α as θ + 0. (iii) Let

δ(x) = (α-2)/x.]

4.17.3 Let Z ~ r(α, σ ) , α known. Then the distributions of X = -Z form an exponential family with natural parameter θ = 1/σ. Consider the problem of estimating θ with squared error loss. Show that δ(x) = b e x

(1) can It

be represented i n the form 4 . 1 6 ( 1 ) .

(= be" z ) [ L e t H be a Poisson d i s t r i b u t i o n !

can f u r t h e r be shown t h a t δ is admissible when α <_ 2 since i t

minimizes (2)

uniquely

m

H ( { 0 } ) l i m s u pp R(θ, ό ) e ψ ( θ ) θ+00

+

Σ

H({i})R(i, δ ) e ψ ( i )

.]

i i=l

4.17.4 Let {p_} be any exponential family with K compact (Binomial, Multinomial, Fisher, Von Mises, etc.). Show that δ(x) = x is an admissible estimator of θ under squared error loss. [Show that δ is Bayes for the prior 2

distribution, G, with density c exp(ψ(θ) - ||θ|| /2) and that B(G) < «> . Admissibility then follows from basic decision theoretic results. See, e.g. Lehmann (1983, Theorem 3.1). Use Exercise 3.4.1 to verify that B(G) is

APPLICATIONS

137

finite (and also that G is finite).] Caution! δ(x) = x is not a very natural estimator of θ, in spite of its admissibility. Hence its use in this problem is not necessarily recommended (unless the prior G is indeed as above). If Supp v is finite then <5(x) = x is a natural estimator of ξ(θ), and is admissible under squared error loss for estimating ξ(θ). See Exercise 4.5.5 and also Brown (1981b). 4.17.5 Let X ~ P(λ). Consider the problem of estimating

λ under loss

function L(λ, a) = (ln(a/λ)) 2 .

(1)

Show that estimator δ ^ x ) = e x is generalized Bayes, but not admissible.

[The

question is equivalent to asking whether the estimator 6(x) = x is generalized Bayes, or admissible for estimating the canonical parameter, θ, under squared error loss. Reason as in Exercise 4.17.4 to show 6(x) = x is generalized fiayes. However, for estimating θ, direct calculation shows that δ'(x) = bx, e"~ <_ b < 1 is better than δ(x). This inadmissibility result shows that the general result of Exercise 4.17.4 does not extend to problems with K not compact, even when k = 1. (All estimators of the form 6(x) = bx, 0 < b <_ 1, are generalized Bayes for estimating θ. We conjecture that none of them are admissible.)] 4.17.6 1/

Let X ~ N(θ, I). Consider the problem of estimating θ€ R under squared error loss, (i) Let G be a generalized prior density. Show that the generalized Bayes estimator (if it exists) can be written in the form

CD

6G(x) = x

where (2)

g*(x) = / pfl(x) G(dθ)

138

STATISTICAL EXPONENTIAL FAMILIES

( i i ) Consider the l i n e a r p a r t i a l d i f f e r e n t i a l inequality (3)

V

(g*(x) vu(x))

< 0

||x|| > 1

subject to the condition that u is continuous on ||x|| _> 1, and (4)

u(x)

=

1

for

||x||

=

1,

u(x)

£

1

for

||x||>l

.

Show that i f ( 3 ) , (4) have a non-constant solution which also s a t i s f i e s

Mllljll2)

(5)

< °° »

θ € Rk ,

then όg i s inadmissible. [(ii)

Let 6(x) = όg(x) + η j .

Use (5) and Green's theorem to

j u s t i f y an expression l i k e 4.10(4) f o r R(θ, όg) - R(θ, 6) but with an extra term involving a surface integral over {x: ||x|| = 1).

This extra term i s

non-negative because of ( 4 ) , and the remainder of the expression i s nonnegative because of ( 3 ) . the above.

(Note that Exercise 4.11.5 is a special case of

Brown (1971) proves that s o l u b i l i t y of ( 3 ) , ( 4 ) implies inadmissi-

b i l i t y of 6Q

(condition (5) is not r e q u i r e d ) , and conversely

if

is bounded -- and somewhat more generally -- then i n s o l u b i l i t y of ( 3 ) , (4) implies a d m i s s i b i l i t y of 6Q. See also Srinivasan (1981).] 4.17.7

(Berger and Srinivasan (1978).) ( i ) Again l e t X ~ N(θ, I ) and consider the problem of estimating

θ € R under squared error loss.

6(χ)

CD

Suppose +

= x i +

o(—ί-Λ

for two constant kxk matrices B and M. Show that ό is inadmissible unless B = cM for some c € R. [Theorem 4.17 and the representation 4.17.5(1) imply V(ln g*(x))

= ^

l

By considering l i n e integrals over closed paths show t h i s is

impossible

APPLICATIONS

unless B = cM.

139

The calculations are easier i f B and M are simultaneously

diagonalized ( w . l . o . g . ) .

Then when k = 2 the only paths that need be considered

are those bounding sets of the form {x:

x 1 >_ 0, χ« _> 0,

r <_ | |x| | <_ r + ε } . ]

( i i ) Suppose, instead, that X ~ N(θ, I) with % known ( p o s i t i v e d e f i n i t e ) ; and 6 is given by ( 1 ) . M f o r a d m i s s i b l i t y of 6.

Now w r i t e a necessary condition on B and

Does the condition involve %Ί What i f the loss

function i s L(θ, a) = (a - θ )

1

D(a - θ)

f o r some (known p o s i t i v e d e f i n i t e

matrix D? 4.18.1

Verify the assertion in 4.18(2).

[Use Lemma 3.5 and Exercise

3.5.1(2).] 4.20.1 If x ί S as defined in 4.20(1) then / Pθ(x)g(θ)dθ = °°, so that the generalized Bayes procedure for the conjugate prior does not exist at x. [See Exercise 4.19.1.] 4.20.2 Show that K a r l i n ' s condition 4.5(2) implies that S => K°.

(Hence,

i f v(3K) = 0 i t implies that the estimator 4.20(2) = 4.5(1) is generalized Bayes.)

( i i ) Give an example where 4.5(2) is s a t i s f i e d but 4.20(2) = 4.5(1)

is not generalized Bayes. 4.20.3 Let { p Q : θ € 0}

be a stratum of an exponential f a m i l y , as

ϋ defined in Exercise 3.12.1. Suppose it is desired to estimate η

(θ)

r

(θ)

= d). . ζ(2)(θ)

un

der squared error loss.

(Note that in the sequential

setting of 3.12.2(iii) and 3.12.3, η(θ) = EΘ(Y) is a \jery natural quantity to estimate.) State general conditions to j u s t i f y the formal manipulation —

140

STATISTICAL EXPONENTIAL FAMILIES

χ (1)

m

(2)

/ η(θ)e θ * X dθ θ

x

/e * dθ

( 1 )

which says that <5(x) = -^- is generalized Bayes on (K/ ,\)° relative to the x (1) (2) prior measure dθ,-v on 0* = {(θ/.j, θ ( 2 ) ^ θ ( l ) ^ € θ

: θ

(l)€ ^ ( 1 ) ^ *

(The conclusion is justified in the situation of 3.12.2(ii) and in that of 3.12.2(iii) if p Q (N <_ N Q ) = 1, and somewhat more generally.) 4.2Q.4 Generalize 4.20.3(1) to obtain a representation for certain estimators xm +a of the form x u ; +. b• . (2) 4.21.1

Show that 4.21(5) and 4.21(2) imply 4.21(1) and 4.21(3). [4.21(1) is trivial from 4.21(5'). For 4.21(3) reason as in the proof of Theorem 4.19. The key fact is that, with q as defined there,

-B

x

x

u

x

-B

9 Θ

1

-B

, etc.

Now integrate over Θ 2 >...»θ k and let B -» «>. The first part of the expression is bounded because of 4.21(1), (2), and the second part because of 4.21(5').] 4.21.2

(Converse to Theorem 4.19.) Let G be a prior measure whose Bayes procedure for estimating ξ(θ)

exists on S and satisfies δ(x) = αx + β. Suppose S° f φ. Assume further that G possesses a density g satisfying 4.21(1), (2), (3). Then G is a conjugate prior measure, and its conjugate prior density, 4.18(1), has α = l/(λ + 1) and 3 = γ/(λ + 1).)

Apply 4.7(3) to the last integral of the

APPLICATIONS

141

equality (λ+l)/(Vψ(θ))g(θ)eθ χ - Ψ ( θ ) d θ

(1)

= γ/g(θ)eθ χ -ψ( θ )dθ + x/g(θ)e θ * x "*< θ >dθ ,

rearrange terms and invoke completeness to find (2)

vg(θ) = (γ - λVψ(θ))g(θ)

.]

(Diaconis and Ylvisaker (1979) show that this statement is true without this "further" assumption that G possess a density.) (A question of interest is whether this unicity result extends to non-linear generalized Bayes estimators.

To be more precise suppose the

generalized Bayes procedures for estimating ξ(θ) under priors G and H exist and are equal everywhere on S with S° t φ.

Does this imply G = H?

In the

case of the normal distributions or the Poisson distribution the answer is yes.

See 4.15.1 for the normal distribution and Johnstone (1982) for the

Poisson distribution.) 4.24.1

Suppose δ( ) is admissible for estimating ξ under squared error loss. Then v{χ : 6(x) (. K} = 0. [Define δ'(x) as the projection of 6(x) on K. If v{x : 6(x) t ό'(x)} t 0 then R(θ, ό') < R(θ, δ) whenever R(θ, δ) < «> . (If δ is admissible there must exist some θ for which R(θ, δ) < «>.)] 4.24.2 (i) Verify that the conclusion of Theorem 4.24 remains valid when {p Q } is a steep exponential family and Θ c W°. (ii) Even more generally, it Ό

is valid for any one-parameter exponential family i f (1)

Θ c

{θ : EΘ(X) = ξ(θ) € R}

and i f the definition 4.24(1) is modified to

142

STATISTICAL EXPONENTIAL FAMILIES

(2)

I' = {x : v({y: y > x, δ(y) € ξ(N°)°}) > 0 and

ξ({y: y < x, δ(y) e ζ(W°)°}) > 0}

( i i i ) Extend Theorem 4.24 to the problem of estimating ρ(θ) under squared error loss where p: N° -> R is a non-decreasing f u n c t i o n .

[The formulation

and proof are i d e n t i c a l to ( i i ) , above.] 4.24.3

Let v = v^. + Vp where v, is Lebesgue measure on (0, 3) and v^ gives mass 1 to each of the points x = 1,2. Consider the estimator 6 of ζ (under squared error loss) given by

(1)

0 h \h 7h 3

δ(x)

x <1 x =1 1 < x <2 x =2 x >2

(i) Show that δ has the representation 4.24(2) on I = (1,2), but (ii) this representation cannot be extended to the points x = 1,2 even though δ(x) e K° for these points.

(iii) Show that δ is a pointwise limit of a

sequence of Bayes procedures,

(δ is also admissible. See Exercise 7.9.1.)

4.24.4 Let X have the geometric d i s t r i b u t i o n w i t h parameter p ( G e ( p ) ) , under which (1)

PKX = x}

=

p(l - p)x

x=0,l,...

( i ) Show t h a t δ ( x ) = x/2 i s an admissible e s t i m a t o r o f E (X) = ( l - p ) / p under squared e r r o r l o s s .

[Use K a r l i n ' s Theorem 4 . 5 .

Note also t h a t the e s t i m a t o r s

δ ( x ) = ex w i t h c > h f a i l t o s a t i s f y 4 . 5 ( 2 ) and are not a d m i s s i b l e . ]

(ii)

Suppose i t i s known i n a d d i t i o n t h a t p <_ ^ , so t h a t E (X) >^ 1 . , Using Theorem 4.24 show t h a t the t r u n c a t e d v e r s i o n o f δ - - namely

δ

' ( x ) = m a x ( δ ( x ) , 1) —

APPLICATIONS is inadmissible. , ( i i i ) than ό1 ??)

Can you find an (admissible) estimator better

143

CHAPTER 5. MAXIMUM LIKELIHOOD ESTIMATION

5.1 Definition Let φ : R k -> [0, «>] be convex. Define % : R k x R k -• [-o°,«>] by (1)

A(θ, x) = £ φ (θ, x) = θ

x - φ(θ)

For S c N let (2)

£(S, x) = sup U ( θ , x) : θ € S}

and let θ s (x) = {θ e S : £(θ, x) = A(S, x)}

(3)

Note that according to this definition θ~ is a subset of S. We will often abuse the notation slightly by letting θ also denote an element of this set. If φ = ψ is the cumulant generating function for an exponential family then £ φ (θ, x) = log p θ (x)

θ €H

is the log likelihood function on N. (Of course, A.(θ, x) = -«> for θ I N in accordance with the natural convention that ψ(θ) = ~ for θ f. N .) θ € θς(x) is then called a maximum likelihood estimate at x relative to ScW,

A function 6 : K + Θ for which δ(x) € Θ 0 (x) a.e.(v) is called the

(a) maximum likelihood estimator.

This terminology is not always properly

used in the literature; and we will also abuse it, at least to the extent of also referring to the set valued function θ (•) as the maximum likelihood estimator. 144

MAXIMUM LIKELIHOOD ESTIMATION

145

5.2 Assumptions , The main results of this section concern the existence and construction of maximum likelihood estimators, θ. The proofs of these results are based on the fact that ψ is a convex function satisfying certain additional properties, and not otherwise on the fact that ψ is a cumulant generating function.

In Chapter 6 we will want to apply these same existence and

construction results to convex functions, φ, which are not cumulant generating functions. To prepare for this application we now make explicit the conditions on φ which are needed in the proofs of the main results of this section. Let φ : R •*(-<»,«>] be a lower semi continuous convex function. Let N = W. = {θ : φ(θ) < »} .

Such a function is called regularly strictly

convex if it is strictly convex and twice different!able on N% and

Φ (1)

Dpφ

is positive definite on

N°

In the following results we will assume φ is regularly strictly convex.

In some of the following we also assume φ is steep. Note that if

ψ is the cumulant generating function of a steep exponential family then it satisfies these assumptions. Here are some useful facts. Let I = I be defined by 5.1(1), and let the mapping ξ : N -> R , be defined by ξ(θ) = Vφ(θ). Then, ζ is continuous and 1 - 1 since φ is strictly convex. (1) says that the Hessian of ξ = vψ is positive definite. Hence ξ(M°) is an open set; call it R, or R φ .

ξ"Ί( ) is continuous on R.

Theorem 3.6 establishes that (2)

R = K°

when φ = ψ is the cumulant generating function of a minimal steep exponential family. (3)

In particular, in this case R is convex

146

STATISTICAL EXPONENTIAL FAMILIES It will be shown in Proposition 6.7 that (3) is always valid under

the above general assumptions on φ including steepness of ψ. As previously, let θ( ) = ξ ' ^ ). i.e. ζ(θ(x)) = x. (The assumption above of the existence of second derivatives and of (1) is convenient, but can be dispensed with. The other assumptions are required for the following development.) We emphasize again: the following results about i, and maximum likelihood estimation concern the general situation where φ is as assumed above. These results therefore apply in particular to maximum likelihood estimation from minimal 5.3

steep standard exponential families.

Lemma

Assume φ is regularly strictly convex. Then, &( , x) is concave k k and upper semi continuous on R for all x € R . It is strictly concave on N . If Θ Q e N° then (1)

V£( , x ) , θ

(2)

D 2 £( , x ) . θ

where ( Z ί θ g ) ) ^

39 9θ. ' J

φ(θ

Q)

= x - ξ(θ Q ) = -D 2 φ(θ Q ) = -2(θ 0 )

1s

P°sitive

d e f i n i t e

I f x € R (= K°) then

lim £(θ, x) = -«

(3)

IIΘIIHOQ

Proof.

The first assertions are immediate from Assumption 5.2. Equations

(1) and (2) are a direct calculation. The positive definiteness of 2 ( Θ Q ) is a consequence of 5.2(1). Assertion (3) has been proved in 3.6(4) for the case where φ = ψ is the cumulant generating function of a minimal steep exponential family. This proof was needed in order to show that R = K° in such a situation. However we now want a proof valid for arbitrary convex functions, φ, satisfying

MAXIMUM LIKELIHOOD ESTIMATION 5.2(1).

147

This i s easily supplied. Assume x e R, then θ(x) € A/°.

Note using ( 1 ) , (2) that

VA(θ(x), x) = 0, and D 2 A(θ(x), x) is negative d e f i n i t e . 6 > 0,

ε >0

(4)

£ ( θ ,x ) = A ( θ ( x ) ,

<

x ) - ( θ- θ(x))

A ( θ ( x ) , x) - ε

1

Hence f o r some

Z(θ- θ(x))/2 +o(||θ -

for

I|θ - θ ( x ) | |

=

2

θ(x)|| )

δ

I t f o l l o w s t h a t when ||θ - θ ( x ) 11 > δ

(5)

£ ( θ , x)

< A ( θ ( x ) , x)

-

11θ - θ n | | δ

5_ε

by (4) since A(θ(x) + (δ/(||θ - θ(x)||))(θ - θ(x))) < (1 - δ/||θ - θ(x)||)£(θ(x), x) + (δ/||θ -θ(x)||H(θ, x) by convexity.

(5) implies (3).

||

(We note that the positive definiteness of I is not really needed to establish (3). It is only necessary that the conclusion of (4) be valid -i.e. for some 6 > 0, ε > 0 (4 1 )

Jt(θ, x) < λ(θ(x), x) - ε

for

||θ - θ(x)|| = δ

This condition follows whenever λ( , x) is a strictly concave function which assumes its maximum at θ(x).) It is useful to now prove the following lemma. This result is used in Theorem 5.5 to show that θ 0 c W° when 0 is convex. 5.4

Lemma Assume φ is steep and regularly strictly convex. Let

θ χ € N - W°, Θ Q € M°. Let θ = Θ Q + p(θ χ - Θ Q ) , 0 ϋ(Qv x)

From 5.3(1) | ^ A ( θ p f x)

=

as p t 1 because ψ i s steep.

(ΘQ - θ χ )

(x - ξ ( θ p ) )

+

- -

This proves (1) from which (2) is immediate.

(In case ψ i s regular, i . e . N = W°, then l i m Λ(θ , x) = -°° c o n t i n u i t y , which can also be used to prove ( 2 ) . )

by upper semi-

||

FULL FAMILIES Here is a fundamental result concerning maximum likelihood estimation. It follows easily from the above. 5.5 Theorem Let φ be steep and regularly s t r i c t l y convex. (I)

θ w (x)

=

{θ(x)}

c

I f x € R then

A/°

In other words, θ.,(x) consists of the unique point θ = θ(x) satisfying N I (I ) ξ(θ) = x € R I f x £ R then θ.,(x) is empty.

(Recall that i f Φ = Ψ is the cumulant generating

function of a steep canonical exponential family then R = K°.) Proof.

For any x, {Qjχ)}c

Λ/°

by v i r t u e of Lemma 5.4.

Any maximum

l i k e l i h o o d estimator must thus be a local maxima of £( , x) and hence must satisfy

MAXIMUM LIKELIHOOD ESTIMATION

149

V£( , x ) , ~ = 0 This implies ( Γ ) by 5.3(1). Furthermore, the solution to (I 1 ) is unique if it exists, and it exists if and only if x £ R = ζ(W°). Remarks.

||

Maximum likelihood estimation is defined in statistical theory for

a general parametric family of densities {f Q : θ € Θ} by θ(x) = {θ € Θ : f Ω (x) = sup f (x)}. Note that this definition is invariant θ

α

α

under reparametrization. Thus, if ξ = ξ(θ) is a 1 - 1 map on 0 the maximum likelihood estimate of the parameter ξ € ξ(θ) is ξ(θ). Accordingly, Theorem 5.5 says that for minimal steep exponential families x = ξ(θ(x)) is the unique maximum likelihood estimator of the mean value parameter ξ = ξ(θ) at x £ K°. To emphasize, in terms of the mean value parametrization the maximum likelihood estimator is determined by the trivial equation (1")

ξ(x) = x ,

xe r

For the present, (1") is valid if and only if x € K°. This set of course contains almost every x(v) if and only if (2)

v(K - O

= 0

.

Note that (2) is satisfied i f v is absolutely continuous with respect to Lebesgue measure.

I t is never satisfied i f v has f i n i t e support or,

more generally, has countable support and K ^ R . In the last part of Chapter 6 we expand such exponential families so that (1") usually remains valid for a.e.x (v). (Since ζ = EΩ(x) equation (1") also defines ζ(x) = x as the classical method-of-moments estimator. Thus for the mean value parametrization the maximum likelihood and method-of-moments estimators agree.) Suppose that X-,...,X

are independent identically distributed

random variables from the exponential family { p θ ) . Then, as noted in 1.11(2),

150

STATISTICAL EXPONENTIAL FAMILIES

the distributions of the sufficient statistic X = n -1 Σ X

also form an

exponential family with natural parameter α = nθ and cumulant generating function nψ(α/n).

It follows that α(x) = nθ(x). So, the maximum likelihood

estimator of α based on X is nθ(X ) and the maximum likelihood estimator θ/ i of θ = α/n based on X is (3)

θ(n)

5.6 Examples

= Sc/n = θ(x n )

.

(Beta Distribution)

For a variety of common full families the above remarks lead to easy calculation of the maximum likelihood estimator. These are situations such as those mentioned in 3.8 where the mean value parametrization has a convenient form. For example if Y,, Y2>

>Y n are i.i.d. multivariate normal

(μ, %) random variables then the maximum likelihood estimators for μ and -in

n

Ί

μμ 1 + t are, respectively, Y = n

Σ Y, and n Σ Y.Y! . This leads to the i=l Ί i=l Ί Ί conventional maximum likelihood estimates

(1) t = S = n" 1 Σ(Y i - Ϋ)(Y. - Y ) 1 For the Fisher - Von Mises distributions the result of Theorem 5.5 is not so easy to implement. See 3.8. Another not so convenient, but important, family is the beta family, which will now be discussed. Consider the family of densities (2)

f Ay)

= B'^α, β)ya'l(l - y ) 3 " 1 ,

0 < x < 1,

α > 0,

3 > 0 .

ot,p

realtive to Lebesgue measure on (0, 1), where B = B(α, β) denotes the beta function, (3)

B(α. 3) =

T

Γ(α + β)

MAXIMUM LIKELIHOOD ESTIMATION

151

This is a two parameter exponential family with canonical parameters (α, 3) e N = (0, °°) x (0, °°). The corresponding canonical statistics are (4)

xχ

= log y

x2

= log (1 - y)

In this case the canonical parameters themselves have a convenient statistical interpretation since (5)

E(Y) = α / ( α + β ) ,

E(l - Y) = β/(α + β)

Var(Y) = αβ/(α + 3) 2 (α + β + 1) = Var (1 - Y)

.

The mean value parameters are somewhat less convenient. One has (6)

ξ 2 (β. α) = ξ,(α, β) = B " 1 ^ , .β) /(In y)ya'l(l ά ι 0 T - (In B(α, g j

=

3α

M k

=0

α+3+k;

- y)*'ldy

Γ'(α)

Γ'(α + 3)

Γ(α)

Γ(α + β)

- -ΐ i w :

k Q

B

i

(α+k)(α+3+k)

and 3-1 ξ Ί (α, 3) = - Σ 1 (See e.g. Courant and Hubert (1953, (7)

, -irif α+κ p.499)).

3 = 1,2,...

Suppose Y-,...,Y are i.i.d. beta variables, and X.., X^^ are defined from Y. through (3), i=l,...,n . Then the maximum likelihood estimates of (α, 3) can be found numerically by solving (8)

ζj(α, 3) = Xj

i = 1. 2

n

from (6), where X. = n~ 1 Σ X... An exact solution appears to be unavailable, J1 J i=l except when α,3 turn out to be integers so that (7) applies. According to Theorem 5.5, the solution to (8) exists if and only if X € K°. Now,

152

STATISTICAL EXPONENTIAL FAMILIES K = conhull ίln y, In (1 - y) : y e (0, «)}

Since {In y, In (1 - y) : y € (0, 1)} is strictly convex in R this solution n 2 therefore exists if and only if n > 2 and Σ (Y. - Ϋ) > 0. The event 1= 1 2 " Σ (Y. - Y) = 0 occurs with zero probability when n >_ 2; hence the maximum i =l Ί likelihood estimate exists with probability one when n >_ 2. NON-FULL FAMILIES We now proceed to discuss the existence and construction of maximum likelihood estimators when Θ c W. Here is an existence theorem. 5.7

Theorem Let φ be steep and regularly strictly convex. Let 0 c N be a non-

empty relatively closed subset of A/. Suppose x e R. Then θ Q (x) is non-empty. Suppose x € R - R. Suppose there are values x. e R, i=l,...,I, and constants β. < °° such that (1)

Θ c

I y H~((X - x.), β ^

.

Then Θ 0 (x) is non-empty. Remark.

See Exercises 5.7.1-2, 7.9.1-3, and Theorem 5.8 for more infor-

mation about the theorem.

In particular, (1) implies x (. (ζ(Θ))~. See

Figure 5.7(1) for an illustration of 5.7(1). Proof.

Let x e R.

£( , x) is upper semi-continuous and satisfies 5.3(3).

Hence &( , x) assumes its supremum over Θ. But £(θ, x) = -°° for θ € (δ - Θ) C H - M.

It follows that Θ 0 (x) is non-empty.

Suppose x € R - R and (1) is valid. Then for each θ € Θ there is an index i for which θ € ίΓ (x - x., β.). For this index (2)

£(θ, x) = θ

(x - xΊ ) + θ

x Ί - ψ(θ) i

β 1 + θ • x i - ψ(θ)

.

MAXIMUM LIKELIHOOD ESTIMATION

153

0

Figure 5.7(1): An Illustration of 5.7(1) showing R, x € R - R, Θ

2 u H~((x - xΊ. ) , β.) Ί i= l

and ξ(Θ) . It follows that Λ,(θ» x) £

sup {$. + θ

x Ί - ψ(θ) : 1 < l < 1} ->

a

s | |θ| |

θ € Θ by 5.3(3).

The second assertion of the theorem follows from (2) as did the

first from 5.3(3).

||

CONVEX PARAMETER SPACE When Θ is convex one gets a better result, including a fundamental equation defining the maximum likelihood estimator.

154

STATISTICAL EXPONENTIAL FAMILIES

5.8 Theorem Assume φ is as above. Suppose Θ is a relatively closed convex subset of W with Θ n N° f φ. Then θ (x) is non-empty if and only if x 6 R (= K°) or x € R - R and Θ <= H" (x - x r β χ )

(1) for some Xj € R, pj € R.

((1) is the same as 5.7(1) with I = 1.)

If θ 0 (x) is non-empty then it consists of a single point. This is the unique point, θ e Θ n N° satisfying (2)

(x - ξ(θ))

(θ - θ) >_ 0

V θ €Θ

(An alternate form of (2) when x - ξ(θ) t 0 is (2 1 )

Θ

c H" (x - ξ(θ), (x - ζ(θ))

θ)

.)

(Note t h a t i f θ ( x ) € Θ t h e n , o f c o u r s e , θ ( x ) = { θ ( x ) } and θ = θ(x) trivially satisfies Proof.

( 2 ) . See 5 . 9 f o r i l l u s t r a t i o n s

£( , x) i s s t r i c t l y concave on hi and hence can assume i t s maximum

a t o n l y one p o i n t o f the convex s e t Θ. Suppose (2) i s s a t i s f i e d . (3)

£ ( θ , x) - £ ( θ , x)

=

(θ - θ)

+ (θ - θ)

=

(θ - θ)

w i t h e q u a l i t y i f and o n l y i f θ = θ. Λ

o f( 2 ) . )

/S

Furthermore, θ f l c w° by Lemma 5 . 4 . Then f o r θ € Θ (x - ξ ( θ ) )

ξ ( θ ) - (ψ(θ) - ψ ( θ ) )

(x - ξ ( θ ) ) + A ( θ f ζ ( θ ) ) - £ ( θ , ξ ( θ ) )

U ( θ , ξ(θ)) - £(θ, ξ(θ)) > 0

when

Λ.

θ t θ since θ = θ(ξ(θ)) is the unique maximum likelihood estimator over N corresponding to the observation ξ(θ) .) Hence (2) implies that θ Q (x) = {θ}.

MAXIMUM LIKELIHOOD ESTIMATION

155

On the,other hand, suppose (4)

(x - ξ(θ Q ))

(Θ Q - θj) < 0

for some θ Q , θ χ € 0

.

Then θp

= Θ Q + p(θ 1 - Θ Q ) € θ

for 0 _ 0 .

Hence £(θ , x) > £(Θ Q , x) for p > 0 s u f f i c i e n t l y small; and ΘQ cannot be the unique maximum l i k e l i h o o d estimator.

I t follows that the unique maximum

l i k e l i h o o d estimator i f i t e x i s t s , must s a t i s f y ( 2 ) . F i n a l l y , i f x € R or (1) is s a t i s f i e d then θ β i s non-empty by Theorem 5.7. Conversely, i f θ € θ ξ(θ)

is non-empty then θ s a t i s f i e s ( 2 ) . Hence ξ =

€ R and

(x - ζ)

θ

<_ (x - ξ)

by (2) so that (1) is satisfied with x χ = ξ. 5.9

θ

||

Construction The criterion 5.8(1) is particularly easy to apply if

0 = (ΘQ + L) Π W for some linear subspace, L. This is because the vectors {(θ - θ) : θ € L} will then span L. Thus, by (1), in order to find θ one need only search for the unique point θ* 6 0 for which x - ζ(θ*) l L. This process can be viewed from two slightly different perspectives. Because of its importance we illustrate both these perspectives in the simplest case where Θ Q + L is a hyperplane. Thus, consider the case where 0 = H n W with H a hyperplane, say H = H(a, α ) . Let x e R. (The same construction also works for x € R - R if 5.7(1) is satisfied.) To find θ (x) one may proceed from θ(x) along the curve {θ(x + pa) : p 6 R} until the unique point at which θ(x + pa) € 0. This point is θ. The process is illustrated in Figure 5.9(1).

156

STATISTICAL EXPONENTIAL FAMILIES An alternative procedure is to map Θ n A/° into R as ξ(Θ Π A/°).

Then proceed along the line {x + pa : p € R} until the unique point at which x + pa € ξ(Θ n N°). This point is x = ξ(θ). This process is illustrated in Figure 5.9(2).

x+pa

Figure 5.9(1):

First construction of θ

Θ

Figure 5.9(2):

Second construction of θ

There are useful paradigms available also for the case where θ is an arbitrary relatively closed convex set.

These are described in 5.13.

The entire process illustrated above may also be viewed from a different perspective.

Θ is contained in a proper linear subset of R .

Hence the densities {p θ : θ e 0} form an exponential family which is not minimal.

This non-minimal family can be reduced by sufficiency and reparame-

trization to a minimal family of dimension k1 < k.

Let (φ-,. - - ,φ. ,) and

MAXIMUM LIKELIHOOD ESTIMATION

(yi>

J Y ^ I ) denote the natural parameters and corresponding observations

this family. H(a,

157

in

(They are formed by p r o j e c t i n g θ and x, r e s p e c t i v e l y , onto

α) or any t r a n s l a t e H ( a , $) . )

This family w i l l

have log-Laplace

transform ψ*(φ) = Ψ ( θ ( φ ) ) , and the m . l . e . , φ, s a t i s f i e s 5 . 5 ( 1 ) —

Φ(y)

=

i.e.

Φ(y)

where φ(y) is the inverse to ξ*(φ) = Vψ*(φ)

Thus

θ(x) = Φ(y(x)) . These remarks can be used to yield a very simple proof of Theorem 5.8 in the special case where Θ = ( θ Q + L) n W.

They also provide a method of

easily constructing the maximum likelihood estimate in many such cases.

Here

are two examples. 5.10a

Example Consider the classical Hardy-Weinberg situation described in

Example 1.8.

( X j , X2> X^) is multinomial (N, ζ) with expectation

2 2 ξ = N(p , 2pq, q ) ,

0 < p = 1-q < 1. This is a three-dimensional exponential

family with two dimensional parameter space Θ = {θ:

= β j d . l . l ) + B 2 (2,l,0) + ( 0 , In 2, 0)} = H ( ( l , - 2 , 1 ) , -2 In 2 ) .

(This family is not minimal.

This fact affects but does not hinder the

reasoning which follows.) Reduction to a minimal exponential family yields a one-parameter exponential family with parameter φ = 2θ 1 + θ 2 and natural observation y = 2x 1 + Xp.

(Θ is two-dimensional but yields a family of only order one

since the original family was not minimal.) (1)

E(Y)

=

Note that

N(2p2 + 2pq)

= 2pN

Hence 2x

(2)

P

=

2N

=

+ x?

2N

°

< y

< 2N

.

158

STATISTICAL EXPONENTIAL FAMILIES

Correspondingly, ξ = N(p , 2pq, q ) and θ can be defined from θ.. = ^ 3, £ R.

+ In ξ..,

(Note that θ is a line rather than a single point because the original

representation of the multinomial family was not minimal.) The simplicity of (1) is the special fact which enables the preceding construction to proceed so smoothly. behave similarly.

Many other multinomial log-linear models

Classes of such models are discussed in Darroch, Lauritzen,

and Speed (1980) and in Haberman (1974).

Here is a useful example.

5.10b Example Consider a 2χ2χ2 contingency table. denoted by y-jj k >

i , j , k = 0 , 1 . They are multinomial (N) variables with

respective probabilities π. ., . such a table.

The observations w i l l be

There are various useful log-linear models for

The derivation of maximum likelihood estimates for such models

provides a useful and illuminating application of the preceding theory.

Here

we consider the model in which responses in the f i r s t category (corresponding to index i ) are conditionally independent of those in the third category given the level of response in the second category.

This model illustrates

several characteristic phenomena, and allows for direct and explicit maximum likelihood estimates of the parameters TΓ. .. . In order to write the model in customary vector-matrix notation, let π

£

z% = y. =

π

iik'

k

where 1= 1 + i + 2j + 4k

*"et ^ ° 9

π

^ (^eno'te ^

e

(1 <_ I <_ 8 ) , and, similarly,

v e c t 0 Γ

with coordinates log π. , λ = l , . . . , 8 .

Let 1 1 D1

=

-1 1

1 -1

-1 -1

1 1

-1 1

1 -1

-1 -1

1

1

1

1

-1

-1

-1

-1

1

-1

-1

1

1

-1

-1

1

1

1

-1

-1

-1

-1

1

1

1

1

1

1

1

1

1

1

The log-linear model of interest here has (2)

θ* = (log π) = DB,

3 e R6 .

MAXIMUM LIKELIHOOD ESTIMATION

159

In order to normalize π one must choose 3g so that

8 Σπ

(3)

= 1

.

The r e s u l t i n g multinomial family i s an 8-parameter exponential family. canonical s t a t i s t i c can be reduced via s u f f i c i e n c y . θ* z = 3'D'z = β'x*.

Its

Let x* = D'z so that

Furthermore x£ = N with p r o b a b i l i t y one.

Hence x € R

with ( x 1 , . . . , x 5 ) = ( X j , . . . , x £ ) is a s u f f i c i e n t , canonical s t a t i s t i c .

The

corresponding canonical parameter is θ € R with ( θ * , . . . , θ g ) = ( β , , . . . , β r ) . I t can be checked t h a t t h i s l o g - l i n e a r family i s characterized by the conditional independence of responses i n categories 1 and 3 given level of response i n category 2. noting that i f i t Γ ,

l

n

The conditional independence can be checked by k ϊ k1 then (2) y i e l d s

V j V

+

l n τ τ

ijk

From t h i s i t follows that π . + π..^

=

l n

= π+

k

π

i'jk

+

Ί n

π

ijk'

ττΊ ή+» which implies the desired

conditional independence. By Theorem 4.5 x is the maximum l i k e l i h o o d estimate of E(X) = ξ ( θ ) . Thus, (log TT) = D $ ( x ) i s the maximum l i k e l i h o o d estimate of (log π) = θ * w i t h β(x)

= ( e 1 ( x ) . - . . » B 5 ( x ) . 3 6 ( β ) ) ' where 3 6 ( ) i s determined by ( 3 ) . The r e l a t i o n between ξ(θ) and π(θ) is easy to determine via simple

calculations such as ξ χ = EfX^

= Σ ( - l ) 1 E ( y i j k ) , etc.

These y i e l d

(4)

Thus

(5)

Σ(-l) Ί 'y i j k

- xj

= NΣί-l) 1 ^.,

,

etc.

160

STATISTICAL EXPONENTIAL FAMILIES From these relationships and the structure of D it is possible in

this case to give explicit expressions for ί ^ ^ } in terms of ί y ^ } -

Let

a "+" replacing a subscript denote addition over that subscript. Thus, τr1 +u + = Σ π.. . Simple manipulation based on (3) and (5) yields 1Jk Nπ1++ (6)

The conditional independence properties yield

£ Hence Nπ

ijk

FUNDAMENTAL EQUATION

5.11

Definition For ΘΛ 6 0 c R define VQ(ΘQ)> the set of (outward) normals to 1/

Θ a t ΘQ e Θ, t o be the s e t o f a l l δ e R (1)

δ

( θ 0 - θ)

satisfying

>. 0 + o ( | | θ Q - θ||)

v

θ eQ

V is obviously a convex cone, and can easily be shown to be closed. Note that if Θ Q e int Θ then V Θ (θ Q ) = {0}. If Θ Q is an isolated point of Θ then V 0 (θ Q ) = R . If Θ is a different!able manifold with tangent space T at Θ Q then V Q ( Θ Q ) is the orthogonal complement of T — i.e., V 0 (θ o ) = {δ: δ

k

τ = 0 V τ e T}. Here V 0 (θ Q ) is a linear subspace of R .

If Θ is convex and θ Q 6 bd 0 then V 0 (θ) = {δ :Θ c fi" (δ,δ θ Q )} .

MAXIMUM LIKELIHOOD ESTIMATION 5.12

Theorem Assume φ is steep and regularly s t r i c t l y convex.

relatively closed subset of N. (1) Let θ € θ (x) c Λ/°.

(2)

V0(θ)

€

Note that

vθU(θ, ζ(θ))|θ=§ = 0

and x - ξ ( θ ) = 0 when θ = θ. 0

< A(θ,x) - £(θ,x)

.

Hence, as i n 5 . 8 ( 3 ) =

(θ - θ)

( x - ξ(θ)) + A(θ, ξ(θ)) - A(θ, ξ(θ))

=

(θ - θ)

( x - ξ ( θ ) ) + o ( | | θ -θ | | )

Thus, by definition, (1) is satisfied.

||

Note that the theorem does not require x € R (= 5.13

Let Θ be a

Then for any θ € Θ 9 (x) n N° x - ξ(θ)

Proof.

(3)

161

K°).

Construction

The fundamental equation, 5.8(1) or 5.12(1), can be used to picture the process of finding a maximum likelihood estimator, by an extension of the process pictured in 5.9. Fix x e R k . Suppose it is desired to locate Θ 0 (x).

If 0 n W° / φ

one should first check to see whether x € ξ(Θ n N°). If so, then θ(x) = θ β (x). /\ If not, then Θ 0 (x) c bd Θ. To see whether a given θ Q e bd Θ n W° can be an /\ element of θ first locate Θ, Θ Q , X , and x Q = ζ(θ Q ) on their respective graphs. Then carry a vector δ pointing in the direction of x - XQ over to ΘQ in order to check whether δ is an outward normal to 0 at θg. If so, then Θ Q is a candidate for θ. In fact, if Θ is convex {θ Q } = Θ 0 (x).

If Θ is

not convex one must search over bd Θ for all such candidates, then examine β(θ, x) at each of them to eliminate those which are not global maxima. (If φ is not regular and Θ is not convex one needs also to search over

STATISTICAL EXPONENTIAL FAMILIES

162

Θ n (W - W 0 ).) The process is illustrated in Figure 5.13(1).

Figure 5.13(1):

θ Q and

are candidates for Θ 0 (x).

θ 2 is not.

If bd Θ is a curve as in Figure 5.13(1) then this process is relatively convenient. Otherwise, it is usually less convenient to search over all of bd Θ for the set of candidates. An alternate picture can also be constructed.

In this picture one

constructs for each θ € Θ the collection of points in X space for which θ can possibly be the maximum likelihood estimator. In order to construct this picture one locates θ € bd Θ and draws the unit outward normal(s), 6, to θ. One then maps θ to ξ(θ) and carries the vector(s) 6 directly over to X space. The corresponding line or cone with vertex located at ξ(θ) θ

χ

is the locus of values of x for which θ e β ( )

ΊS

a

possibility. Again,

if x falls in more than one such locus then &(θ, x) must be separately examined at all such θ. This process is illustrated in Figure 5.13(2).

MAXIMUM LIKELIHOOD ESTIMATION

Figure 5.13(2):

163

C. is the locus of points, x, for which θ. can possibly fall in Θ 0 (x).

5.14

Example The curved exponential f a m i l y described i n Example 3 . 1 2 provides a

p a r t i c u l a r l y elegant instance o f the above c o n s t r u c t i o n .

The family i s a two-

parameter standard exponential family with θ ( λ ) = ( - λ , - I n λ ) 1 , and Θ = { θ ( λ ) : λ > 0} c M =

(-co, o) x R, and

ψ(θ) = l n [ ( e θ l T - l ) / θ 1 + e θ i T + θ 2 ] .

K = conhull { ( 0 , 0 ) , ( T , 0 ) , ( T , 1)} .

Then,

ζ(θ(λ))

1

K a n d ξ ( θ ) on a s i n g l e p l o t .

p"λT

λT , e"λl).

Figure 5 . 1 4 ( 1 ) shows both Θ and

T h e r e i s no o v e r l a p s i n c e Θ cz { ( θ , , θ p ) : θ , < 0 }

a n d K cz { { xy y x 22): ) : Xj >_0}.

The tangent space to θ(λ) is spanned by (-1, -1/λ) 1 .

Hence

V 0 (θ(λ)) is s the line {p(l, - λ ) : p e R}. The locus, C(λ), of points p x for which θ(λ) can be the maximum likelihood estimator is the line

STATISTICAL EXPONENTIAL FAMILIES

164

, --λT (2)

C(λ)

=

ίξ(θ(λ)) + p ( l , -λ):

{(0,

1) + σ ( l , - λ ) :

as can be seen by letting σ =

p e R} =

+ p,

σ € R} - e

-λT + p.

Formula ( 2 ) reveals that the

loci C(λ) are s t r a i g h t lines through the point ( 0 , 1 ) . It or ( T , 1 ) . (0,

e " λ T -λp): p € R}

Again, see Figure ( 1 ) .

can be seen from Theorem 5.7 that θ ( x ) f φ unless x € K is ( 0 , 0)

(Applying 5 . 7 ( 1 )

for points on the i n t e r i o r of the l i n e j o i n i n g

0) to ( T , 1) requires the choice 1 = 2 .

Of course, these points occur with

p r o b a b i l i t y zero, so i t ' s not worth the e f f o r t ! )

Since the loci C(λ) i n t e r s e c t

only a t ( 0 , 1) f. K i t follows from ( 2 ) that i f x t ( 0 , 0) or (T, 1) then θ Θ (.x) is the single p o i n t , θ ( λ ) , f o r which x € C ( λ ) . If

x = ( 0 , 0) or (T, 1) then θ ( x ) = φ since neither of these points /\ l i e s i n U C ( λ ) . (That θ ( x ) = φ i n this case can also be seen by applying the λ€R f i n a l part of Theorem 5.8 to the parameter set consisting of the convex hull of θ).

Figure 5.14(1): Illustrating the construction of θ(x) via construction of loci C.

MAXIMUM LIKELIHOOD ESTIMATION

165

The original description of this example involves a single observation, X, which can take only values in (0 x [0, T]) u {(T, 1)}. -in

However, if one observes X n = n n

Σ X_ where X . are n i.i.d. variables each Ί i =lΊ

with the given distribution, then X p can take values over more of K. This problem has natural parameter θ* = nθ and log Laplace transform ψ*(θ*) = nψ(θ*/n).

It follows that N9K and ξ(θ) are as before. Θ undergoes a simple

transformation.

It is easy to check that the above picture applies equally

well to this problem, for which various values of x € K° are possible. See also Proposition 5.15. From (2) one sees that the maximum likelihood estimator of λ is (3)

λ = (1 - x 2 )/x χ . In terms of the original motivation for this problem the parameter

1/λ is the mean value (= mean lifetime) of the exponential variable Z.

-x 2 )

Thus,

nxΊ

ι

n In this problem nx1η = Σ Y. = "total time on test", and n(l - xά 9 ) = (number of i = lΊ observations < T) = "number of objects failing before truncation". This supplies the familiar expression for this problem: (3")

/\ (1/λ) =

total time on test number of objects failing before truncation

Note that the value of T does not appear in (3"). This fact has been commented on and exploited by Cox (1975) and many others. It has been noted that the differentiate subfamily treated in this example is a stratum within the full two parameter family.

It is really this

fact which explains the elegance of the above construction and of Figure 5.14(1). See Exercise 5.14.1 - 5.14.3. In general the maximum likelihood estimate for an i.i.d. sample

166

STATISTICAL EXPONENTIAL FAMILIES

is determined exactly as that from a single observation. The latter part of Example 5.14 mentions one special case of this. It is worthwhile to formally note this fact. 5.15 Proposition Let X-,... ,X be i.i.d. random variables from a standard exponential family {p n : θ € 0} . Let θ^ denote the set of maximum likelihood estimators of θ € 0 on the basis of a single observation. The maximum likelihood estimator of θ € 0 based on the sample 1

n

X Ίl>...,X nn is a function of the sufficient statistic, Xn = n .Σ X. . Let =1 l θ ^ ( ) denote this function of X n

θ< n ) (ϊ) = θΘ("x)

(1) Proof.

Then .

The cumulant generating function for the sufficient statistic

S = nX is nψ(θ). The proposition follows from the fact that £ n ψ (θ, s) = θ

s - nψ(θ)

= n(θ s/n - ψ(θ)) = n£ ψ (θ, s/n)

,

since this shows that &niκ( > s) is maximized i f and only i f I ( , s/n) is maximized.

II

MAXIMUM LIKELIHOOD ESTIMATION

167

EXERCISES 5.6.1 Verify formula 5.6(6). 5.6.2 The multivariate generalization of the beta distribution is the Diriohlet θ

0

=

V(a), d e f i n e d a s f o l l o w s : k-1 1=l. ..k; γk = Ί " Σ V

distribution,

k .Σ V

γ

i

>

°>

k _> 2 ; t h e

θ

> 0, i = l , . . . , k ,

d i s t r i b u t i o n has

density with respect to Lebesgue measure over the allowable { ( y ^ . ^ y , , Ί ) }

(l)

fθ(y)

Γ(θ n )

=

k

°

k

(Θ.-1) Ί

πy

π Γ(Θ.) 1 i=l

> i Ί=1

.

This is a k-parameter exponential family with canonical s t a t i s t i c X. = In Y. . (1) (ii)

E(Y ) l

Describe K. Verify the standard formulae:

= θ./θ n i

Var(Y.)

ϋ

l

(θ π -θ.)θ. = —-—•—— 2

(2) θ.θ. Cov (Y Ί , Y.)

(iii)

=

-

Ί

j

2

D e r i v e f o r m u l a e f o r E(X.)

analogous t o 5 . 6 ( 6 ) , ( 7 ) . s

(iv)

j=l,...,£.

Let 1 = s n < . . . < s

= k and d e f i n e z . =

=

i ζ

j Σ

+1

Y

,

Show that Z has a P(θ') distribution, and describe θ 1 in terms of

θ. (v) variables.

Let Y ' 1 ^ , i = l , . . . , n be independent, k-dimensional ^ ( θ ^ 1 ^ )

-1 n ( i ) Verify that the d i s t r i b u t i o n of n Σ Yv '

is

168

STATISTICAL EXPONENTIAL FAMILIES

5.6.3 Let XΊ , i = l , . . . , k, be independent Γ(α., 3) variables. Describe k the conditional distribution of the variable (X Ί S ..-,X b ) given Σ X. as a 1 κ i=l Ί multiple of an appropriate Dirichlet variable.

(Note the partial analogy

between the situation here and that in Example 1.16. Note also that the situation here was described from another perspective in Exercise 2.15.1.) 5.6.4 The following is a valid statement:

the k-dimensional Dirichlet

distributions form the family of (proper) conjugate priors for the parameter ( p . , . . . ,p. _-) of a k-dimensional multinomial distribution.

Relate this

statement to the general theory of Sections 4.18-4.20, and describe (in terms of the Dirichlet parameters) the posterior expectation of p given the multiθ.

nomial. observation.

k-1

θ

[Let p. = e / ( I + Σ e ), e t c . ] 1 j=l

(This conjugate relation between Dirichlet and multinomial distributions has an i n f i n i t e dimensional generalization in which the Dirichlet distribution is replaced by a "Dirichlet process" and the multinomial distribution is replaced by a distribution over the family of cumulative distribution functions on [ 0 , 1]. See Ferguson (1973) and Ghosh and Meeden (1984).) 5.7.1

(i)

Show that 5.7(1) implies x £ (ξ(θ))~.

(ii) Show the converse is not valid by constructing an example in which φ = ψ, R = K° is not strictly convex, x £ (ξ(Θ))~, and 5.7(1) fails.

(I believe no example exists when R is strictly convex. See

Exercise 7.9.2 which shows that when R = K° is strictly convex and x £ (ξ(θ))~ then θ(x) t φ.) [(i) x £ (ξ(H"((x - x Ί ), £.)))" for x. € R, x e R - R. (ii) Let v give mass 1 to each of the four points (+ 1, +1). Let x = (1, 0)

MAXIMUM LIKELIHOOD ESTIMATION

169

and Θ = {(t, 2): t € R} .] 5.7.2 Construct examples in which φ = ψ is steep, R = K°, x e (ξ(θ))~, and (i) θ(x) = φ, (ii) θ(x) f φ. [For both examples let v be the uniform 7 2 distribution on the ball {x: (xΊ - 1) + Σ x.} plus a point mass at 0. For 1

(i)

i=2

l e t 0 = ίθ: θ = (α, 0 , . . . , 0 ) } .

every

Ί

For ( i i ) l e t 0 ={θ : ψ(θ) = 3 } .

u n i t vector v + e , there i s a unique η ( v ) > 0

As v -» e - 9 η ( v ) -* « and hence ξ ( η ( v ) v ) -> 0 .

For

such t h a t ψ ( η ( v ) v ) = 3.

Hence 0 £ ( ξ ( θ ) ) " . ]

5.8.1 Let {p . θ e 0} be a standard one-parameter exponential family. u Suppose ξ ( 0 ) is an unbounded i n t e r v a l — i . e . ξ ( 0 ) => ( ζ Q , ξ j or ξ, = +oo.

with ξ Q = -°°

For ξg < A < ξ 1 suppose e i t h e r

(1)

ζQ

=

- oo and

ξ-

=

oo

A J J(ξ)dξ

= co

or and

1

with J " 1 ^ )

= θ'(ξ) =

estimating ξ.

V a r θ

(ε)(χ)>

is minimax; and

the

t n a t

J(ζ)dξ

J

=

»

denotes the Fisher information for

Consider the problem of estimating ξ under the loss 4 . 6 ( 1 )

i . e . L ( ξ , 6) = J ( ξ ) ( ό - ξ ) 2 .

admissible,

s o

/ A

Show t h a t :

( i ) the maximum l i k e l i h o o d

estimator

( i i ) i f 0 9 W then the maximum l i k e l i h o o d estimator is not

( i i i ) Give examples when 0 = N and ξ ( 0 ) is unbounded in which

maximum l i k e l i h o o d estimator is not minimax, is minimax but not admissible,

is both minimax and admissible,

( i v ) Can you generalize ( i ) to a k-parameter

family? [Let

(2)

—

αn Ψ ξQ,

h^2 ( ξ )

=

3n + ξχ

m1n( /

and

J(t)dt,

Kp,

/

J(t)dt)

170

STATISTICAL EXPONENTIAL FAMILIES

where K is chosen so that h is a probability density. Show K n -+ 0 because of (1). Then use 4.6(2). For (ii) use Theorem 4.24.] 5.9.1 Consider the general l i n e a r model as defined i n 1.14.1.

(a) Verify

that the usual least squares estimators of ξ are also the maximum l i k e l i h o o d estimators ( i . e . μ = Bξ). (b) What is the maximum l i k e l i h o o d estimator of 2 σ ? Is i t unbiased? (Assume m ^ r + 1.) (c) Generalize the preceding p

questions to the situation where Y ~ N(μ, σ ΐ) with μ = Bξ as in 1.14.1 and t a known positive matrix.

[The maximum likelihood estimates are the usual

generalized least squares estimates.] 5.9.2 Generalize 5.9.1 to the multivariate linear model defined in 1.14.3. 5.9.3 Let (X., X2) be the canonical statistics from a normal sample with mean μ. and variance σ?> and let (Z,, Z2) be from an independent normal sample with mean μ 2 and variance σ|.

Suppose μ, <_ μ 2 , but the parameters are

otherwise unrestricted. Show that ( μ . , μ2) = U p z,) i f x, <_ z, and otherwise /\

/\

/\

μ, = μ 2 = μ is the unique solution to

(Assume x 2 > x? and z 2 > z^, which occurs with probability one.) 5.9.4 Let ξ be a normally distributed vector with mean 0 and covariance matrix I.

Given ξ let Y be distributed according to the general linear

model 1.14.1.

(Assume m •> r + 1.)

Suppose B'B is diagonal and t <= P, a

relatively closed convex subset of positive definite diagonal matrices, (a) Show that the (marginal) distributions of Y form an exponential family

MAXIMUM LIKELIHOOD ESTIMATION with Θ a r e l a t i v e l y closed proper convex subset of N. are minimal s u f f i c i e n t s t a t i s t i c s . ]

171 [ ζ and |(Y - Bξ) |

(b) When V is a l l p o s i t i v e d e f i n i t e

diagonal matrices describe the maximum l i k e l i h o o d estimates of 2, σ . (c) Extend (b) to include other suitable subsets, P.

(d) The preceding is a

canonical form f o r a class of random e f f e c t s models (see, e . g . , Arnold (1981)). To see t h i s convert the usual balanced one-way or two-way random e f f e c t s models to a model of this form by applying suitable l i n e a r transformations to the usual parameters.

[For the one-way model having E(Y..) = μ + α . , 1J

μ ~ N ( 0 , σ*)9 μ

a. ~ N(0, σ*), i

i =l , . . . , I ,

j=l,...,J

Ί

l e t ζ. = Iμ +

u

I Σ α,

J.

*

i

and ( ξ 2 i - . . » ξ j ) = ( α ^ . . ^ j j M where M is a I x ( I - 1) matrix whose columns are orthonormal and orthogonal to 1.] [The following three exercises concern the 2χ2χ2 contingency t a b l e . ] 5.10.1 Consider the model under which the f i r s t category and t h i r d category are (marginally) independent ( i . e . , π..+k

= ττi++π++k).

Show t h i s is a

l o g - l i n e a r model and f i n d an e x p l i c i t expression f o r the maximum l i k e l i h o o d estimator. 5.10.2 Consider the l o g - l i n e a r model described by the r e s t r i c t i o n 0 = φ 1 + φ- + φ β + φ 7 - (φ 2 + Φ3 + Φ5 + Φg)

(This is the model described by

the phrase, "no t h i r d - o r d e r i n t e r a c t i o n s . " )

Write the equation(s) determining

the maximum l i k e l i h o o d estimator.

Determine that these equations do not have

a closed form s o l u t i o n , such as 5.10(7). (1980).

(See Darroch, Lauritzen, and Speed

In such a case the l i k e l i h o o d equations must be solved numerically.

The usual methods are the E-M algorithm or the Newton-Raphson algorithm. Bishop, Feinberg, and Holland (1975) and Haberman (1974).) 5.12.1 Consider the model described by h = TΓQ++ = π 1 + +

= π+Q+ = π+1+

See

172

STATISTICAL EXPONENTIAL FAMILIES

Show this corresponds to a d i f f e r e n t i a b l e subfamily within the f u l l exponential family, but is not a log-linear model. for

Find the maximum likelihood estimator

this d i f f e r e n t i a b l e subfamily [TΓQQ+ = π--+

.]

5.14.1 Let

{ p θ : θ € 0} be astratumof

family, as defined in Exercise 3 . 1 2 . 1 .

regular (or a steep)

exponential

(a) Show that for x € R the maximum

likelihood estimator exists and satisfies

(i) ζ

(2)

X

(2)

(b) Discuss the situation when x € R - R.

(c) Show (by example) that there

can be two solutions to (1); but there can never be more than two. Is it possible for both of these solutions to be maximum likelihood estimators? [Suppose the family is defined by ψ(θ) = ΨQ. Note that the set {θ: ψ(θ) £ Ψ Q } is convex and apply Theorem 4.8.] 5.14.2 Show how the result of Exercise 5.14.1 directly yields 5.14(3'). [Translate x 2 ] 5.14.3 Apply 5.14.1 to describe the maximum likelihood estimator for the other examples discussed in 3.12.2. 5.15.1 Let X 1 9 . . . , X exponential family.

be i . i . d . with distribution pQ from a canonical

Let K <= W° be compact.

Then x n is uniformly asympto-

t i c a l l y normal over θ € K with mean ξ ( θ ) and covariance matrix 1 = = n" 1D 2 ψ(θ).

[Apply Theorem 2.19.]

5.15.2 Consider the setting of 5.15.1:

(a) The maximum likelihood

MAXIMUM LIKELIHOOD ESTIMATION

173

estimators θ n and ξ n e x i s t with p r o b a b i l i t y approaching 1 as n •* » uniformly over θ € K.

(b) They are asymptotically normal uniformly over θ € K

with means θ and ζ and covariances n C(a)

£

( θ ) and n

£ ( θ ) , respectively.

P ( x n (. R) converges to 0 ( e x p o n e n t i a l l y f a s t ) , uniformly on K.

g ( t ) = g ( t Q ) + ( h ( t o ) ) ' ( t - t Q ) + o ( | |t - t Q | |)

(b)

if

then g ( x R ) i s asymptotically

normal with mean ξ ( θ ) covariance h' ( ξ ( θ ) ) £ ( θ ) h ( ξ ( θ ) ) s

uniformly f o r θ € K.]

5.15.3

Let X , , . . . , X

be i . i . d . with d i s t r i b u t i o n p

exponential subfamily {p Q : θ € Θ}.

Let K c θ be compact,

uniformly asymptotically normal over θ e K with mean θ .

from a d i f f e r e n t i a t e ( a ) Then ΘM is (b) For a curved

exponential family with θ = θ ( t ) the maximum l i k e l i h o o d estimator t uniformly asymptotically normal a t θ ( t ) G K with mean t . asymptotic variance o f t

of t is

( c ) Write the

as a function o f 2 ( θ ( t ) ) , θ ' ( t ) , and the s t a t i s t i c a l

curvature a t t of the curved exponential f a m i l y .

[See Theorem 5 . 1 2 , the h i n t

to 5 . 1 5 . 2 ( b ) , and Section 3 . 1 1 . For ( c ) , and f o r a geometric i n t e r p r e t a t i o n o f ( a ) and (b) note t h a t */FΓl Iς

- ξ 11 -> 0 i n p r o b a b i l i t y where ξ

p r o j e c t i o n i n the inner product <s, t > = s 1 Σ a t θ to Θ.

( θ ) t of x

denotes the

on the tangent l i n e

I f the problem is w r i t t e n i n the canonical form of Section 3.11

the asymptotic variance is

I.]

5.15.4 1

Let {p Q : θ e 0} be a curved exponential family. Let θ E W but θ 1 ί 0. Assume (w.l.o.g.) that the family has been written in the canonical form 3.11(1) - (4) with 0 = i 0 (9') = θ Θ (ξ(θ')). Show θ' = (0,α,...,0) with 1

α £ p. Let X..,...,X be i.i.d. observations under θ from this family and let t be the maximum likelihood estimator of t. Show that if α

CHAPTER

6. THE DUAL TO THE MAXIMUM LIKELIHOOD ESTIMATOR

KULLBACK-LEIBLER INFORMATION (ENTROPY) Before turning to the dual of the maximum l i k e l i h o o d estimator we define the Kullback-Leibler information, and prove a few of i t s simple properties.

The goal of t h i s detour is to provide a natural p r o b a b i l i s t i c

i n t e r p r e t a t i o n f o r t h i s dual as the minimum entropy expectation parameter. 6.1

Definitions Suppose F, G are two p r o b a b i l i t y d i s t r i b u t i o n s with densities f , g

r e l a t i v e to some dominating information

σ - f i n i t e measure v.

The

Kullbaok-Leibler

of G at F is

(1)

K(F, G)

with the convention that °°

=

EF(ln(f(x)/g(x)))

0 = 0,

0 / 0 = 1 , and y/0 = °° f o r y > 0.

K is als

referred to as the entropy of G a t F. I t can easily be v e r i f i e d that K(F, G) is independent of the choice of dominating measure v.

The existence of K w i l l be established i n

Lemma 6.2 where i t is shown that 0 <_ K £«>. In exponential families i t is convenient to w r i t e (2)

K(Θ Q , θ j )

=

K(PΘ , Pθ ) , 0

For (3)

ΘQ, θ j e N

1

S c H let K(S, θ χ )

=

inf{K(θQ, θ j ) :

etc. 174

ΘQ € S}

,

THE DUAL TO THE MLE

175

K( , ) as defined i n (2) has domain A/χA/.

I t is convenient to

also transfer this d e f i n i t i o n to the expectation parameter space.

Accordingly,

define K(ξ Q , ζ χ ) by (4) for

K(ξ Q , ξ χ ) ( ξ g . ξ j ) € ξ(W°)

x ξ(M°).

=

K(θ(ξ Q ), θ ( ξ 1 ) )

I f the family is steep this d e f i n i t i o n is

valid

on K° x K°. I t is also sometimes convenient to extend the κ

d e f i n i t i o n of K( , ζ,) to a l l of R , by lower semi continuity. for (5)

a minimal steep family, and for ξ Q € R - K°, K(ξ 0 , ξ 1 )

=

For ξ f. K9

lim i n f { K ( ξ , ξ χ ) : eΨO

Accordingly,

ξ>1 € K°, define

ξ € K°9 | |ξ - ξ Q | | < ε}

ξ 1 € K° define K(ξ, ξ j )

(6)

=

-

I t is to be emphasized that this is a formal, analytic extension of the d e f i n i t i o n .

κ

(ξn» ξ i ) f ° r £n f- ^° does not necessarily have a

p r o b a b i l i s t i c interpretation l i k e ( 1 ) .

(Sections 6.18+ give a p r o b a b i l i s t i c

interpretation of K, valid under some auxiliary conditions.) K is often called the Kullback-Leibler "distance" from ΘQ to θy but

i t is not a metric in the topological sense.

general -- not symmetric.

There i s , however, one yery important special case

where K is symmetric and ( K ) 2 is a metric: {P } = {ΦΛ - : θ e R θ θ,2 s t a t i s t i c t* x (7)

,

In p a r t i c u l a r , i t is -- i n

the normal location family,

forms a standard exponential family with canonical

(see Example 1.14), and has K(ΘQ, θj)

=

(θj - θo)iZ"1(θ1 -θQ)/2

The following proposition has already been mentioned above.

176

STATISTICAL EXPONENTIAL FAMILIES

6.2

Proposition For any two distributions K(F, G) exists and satisfies

(1) K(F,

0 <_ K(F, G) £ oo G) = 0 i f and only i f F = G.

Proof.

E F (1n(f(X)/g(X))) = E F ( - l n ( g ( X ) / f ( X ) ) )

>

-In E F (g(X)/f(X))

=

-In 1 = 0

by Jensen's inequality, with equality i f and only i f f = g a . e . ( v ) .

||

For exponential families K has an especially simple and appealing form. 6.3

Proposition Let {pθ> be a standard exponential family.

I f ΘQ € W°, θ j € N

then K(θ Q , θ χ )

( 1)

{Bemark. K(θ Q ,

=

(θQ - θχ)

ξ ( θ Q ) - (ψ(θ 0 ) - ψ ί θ j ) )

=

log (p θ ( ξ ( θ o ) ) / p θ

(ξ(θQ)))

Suppose { p θ ) is steep and ΘQ € N - N°, θ 1 € W°.

θ χ ) = « = lim

K(η, θ χ ) for {ηΊ.} c hl° by steepness.

Then Since the only

^i^o sensible interpretation for ( θ Q - θ j

? ( 6 Q ) is « here, (1) may be considered

valid for a l l ΘQ € hi for regular or steep Proof.

Note

families.)

that

ln(pθ (x)/pθ (x)) and E θ (X) = ξ ( θ Q ) .

||

=

( θ j - ΘQ)

x - (ψtθj) - ψ(θ0))

THE DUAL TO THE MLE 6.4

177

Remark The second part of 6.3(1) shows how the Kullback-Leibler

tion is related to maximum likelihood estimation.

For S c N l e t

(1)

θ χ € S}

K(ΘQ, S)

=

inf{K(θQ, θ j ) :

informa-

Then, by 6 . 3 ( 1 ) , i f ΘQ e A/° (2)

K(ΘQ, S)

=

K(ΘQ, θ)

for θ € S i f and only i f θ e θ s ( ξ ( θ Q ) ) . In other words, for steep families, for Θ = S, and for an observation x € K° the maximum likelihood estimator is the closest point in S to θ(x) in the Kullback-Leibler sense.

(For observations x € K - K° such

an interpretation requires an extension of the definition of K l i k e that to be provided in

Sections 6.18+.)

Note also that K(ΘQ, θ χ )

(3)

= £ ( θ 0 , ξ ( θ Q ) ) - 1{QV

ζ(θQ))

The fact that the quantity on the right is positive (for ΘQ e M°,

Q- f θ Q )

has already been used in 5.8(3) and 5.12(3). 6.5

Theorem

Let {p } be a standard exponential family. θ i n f i n i t e l y d i f f e r e n t i a t e on W° x W°. On W° (1)

VK(ΘQ, •)

(2)

D 2 K(Θ Q , •)

=

ξ( ) - ξ ( θ Q )

= D2ψ( ) = Z( ) ,

If {p_} is minimal and steep then on K° Ό

(3)

Then K( , ) is

VK( , ξ χ ) = θ( ) - θίζj)

Θ Q € H°

178

STATISTICAL EXPONENTIAL FAMILIES

(4)

D2K( , ξj) = Γ ^ θ t )) , Consequently,

(5)

If

K(ξ, ξ j )

ξj € /C°

given ξ , e K° and ε- > 0 there is an ε« > 0 such t h a t

>. ε 2 | | ξ - ζ 1 | |

whenever

llξ-ξjll

>

εj

s c K° is compact then a value ε ? > 0 can be chosen so t h a t ( 5 ) is

valid

uniformly f o r a l l ξ , 6 S.

Proof.

Formulae ( 1 ) - ( 3 ) a r e s t r a i g h t f o r w a r d from 6 . 3 ( 1 ) .

(Note a l s o

t h a t ( 1 ) , ( 2 ) a r e merely a r e s t a t e m e n t o f 5 . 3 ( 1 ) , ( 2 ) . ) ( 4 ) f o l l o w s from ( 3 ) by t h e i n v e r s e f u n c t i o n theorem s i n c e θ ( ) = ξ

( • ) and V ξ ( ) = Σ( )

Formula ( 5 ) f o l l o w s from ( 3 ) , ( 4 ) as d i d t h e analogous c o n c l u s i o n 5 . 3 ( 3 ) , and 5 . 3 ( 5 ) o f Lemma 5 . 3 f o l l o w from 5 . 3 ( 1 ) , ( 2 ) . The a s s e r t e d u n i f o r m i t y o f ( 5 ) over ζ 1 € S i s easy t o check i n t h a t p r o o f . (Note:

||

i f p Q i s not minimal 6 . 5 ( 3 ) u

v a l i d w i t h %" i n t e r p r e t e d as a g e n e r a l i z e d

is s t i l l

v a l i d and 6 . 5 ( 4 )

is

inverse.)

CONVEX DUALITY 6.6 Definition Let φ: R -> (-«>,<»] be convex.

The convex

dual

of φ is the function

k

d : R -> [-oo, oo] d e f i n e d by d φ ( x ) = s u p U φ ( θ , x ) : θ e Rk}

(1) (Recall, JL(θf x) = θ

x - φ(θ).)

We w i l l be i n t e r e s t e d i n t h e s i t u a t i o n when φ i s r e g u l a r l y s t r i c t l y convex and s t e e p . l( (2)

9

(See D e f i n i t i o n 5 . 2 . )

Then i f x e R = ξ ( N φ h

x ) i s s t r i c t l y concave on hi. and V£( , x ) ι θ / x ) = 0 . d φ ( x ) = £ φ ( θ ( x ) , x)

for

x € R

Thus

= ξ(W°)

(In such cases, and somewhat more generally, the pair (d., R ) is called the

THE DUAL TO THE MLE L e g e n d r e t r a n s f o r m o f ( Φ , Λ/ φ ).

179

I t i s e a s y t o check f r o m ( 2 ) a n d Theorem 6 . 5

that (3)

dd (θ) = φ(θ) Φ

for

θeW°

It can be shown that (3) actually holds for all θ € R , but we do not need this fact in what follows.) Suppose ψ is the cumulant generating function of a steep exponential family. Then (4)

dψ(xQ)

= K ( x 0 ,X ; L ) + θ ( X l )

xQ

x Q

If the coordinate system and dominating measure are chosen so that ψ(0) = 0 = ξ(0) then (4) becomes (4 1 )

d φ (x Q ) = K(x 0 , 0)

x € K°

This provides a p r o b a b i l i s t i c i n t e r p r e t a t i o n f o r d(x) on K°.

I t w i l l be

seen l a t e r t h a t d( ) is the maximal lower semi continuous extension o f (d(x):

x € K°) to a l l of R k , and ( 4 ) is v a l i d f o r a l l

xQ € R k .

Lemmas 6.7 and 6 . 8 and Theorem 6.9 present some important basic facts about convex d u a l i t y .

They are j u s t the t i p of a r i c h theory.

We w i l l

not f u r t h e r develop t h i s theory as an a b s t r a c t u n i t ; although other important features of the theory are i m p l i c t in r e s u l t s we s t a t e elsewhere ( e . g . Theorem 5 . 5 ) .

A u n i f i e d presentation of the theory appears i n R o c k a f e l l e r

( 1 9 7 0 ) , and many elements of i t are i n B a r n d o r f f - N i e l s e n ( 1 9 7 8 , e s p e c i a l l y Chapters 5 and 9 ) .

6.7

Lemma

The convex dual d is a lower semi continuous convex function. Hence,N. is convex. Suppose φ is regularly strictly convex. Then d is strictly convex and twice different!able on R. On R

180

STATISTICAL EXPONENTIAL FAMILIES

(1)

Vd(x) = θ(x) ,

and D2d(x) = ( D ^ ) " 1 (θ(x)) .

(2)

Proof.

Since d is the supremum of linear functions i t is lower semi-

continuous and convex. For x € R,

d(x) = x

θ(x) - ψ(θ(x)).

the same computation that yielded 6.5(3), ( 4 ) . since D2d is positive definite.

Hence ( 1 ) , (2) hold, by

d is s t r i c t l y convex on R

( I t is possible to also directly establish

s t r i c t convexity without requiring that φ be twice d i f f e r e n t i a t e . )

||

I t is now convenient to consider

£ d (x, θ)

= x

θ - d(x)

.

Under the conditions of Lemma 6.7 Vd(x) = θ(x) so that for Θ 6 W ° &Λ('>

Θ

)

ΊS

uniquely maximized at the value x for which θ(x) = θ.

is precisely ξ ( θ ) .

This value

This interpretation is developed further below, especially

in Definition 6.10. The following equivalent expression for steepness is a fundamental building block in the proof of Theorem 6.9, and has other uses.

6.8

Lemma

Let φ be regularly strictly convex. Then φ is steep if and only if

implies (2)

l|Vφ(θ.)|| - -

.

THE DUAL TO THE MLE Proof.

181

Assume (1) implies (2). Let θ0 n e ™ N°,' θ, e W - N°, l c

θ p = ΘQ + p ( e j - θ 0 ) .

σ

Then

' θo)

=

d ξ θ

( ( P ) ) -ξ ( θ p )

• «<ΘP>

(Θ

θ 0

P' V -* ( V

d is s t r i c t l y convex and twice d i f f e r e n t i a t e on the open set R with (D ? d) nonsingular on R.

Hence

(4)

for

lim

£ d ( x , θ)

every θ € θ(R) = N° by Lemma 5 . 3 ( 3 ) .

(5)

ξ(θp)

Since θ . e A/, 1

(6)

Since | | ξ ( θ p ) | | + », by ( 2 ) , we have

( θ p - ΘQ) - φ ( θ p ) =

l i m φ(θ ) = Φ ( θ Ί ) i s f i n i t e . p L p+1

ξ(θp)

( θ 1 - ΘQ)

=

ξ(θp)

= -oo

-Ad(ξ(θp),

ΘQ)

-«.

This implies

( θ p - θ Q ) / p •> -

as

p t l

By d e f i n i t i o n , Φ i s steep. Conversely, suppose there i s a sequence s a t i s f y i n g ( 1 ) f o r which (2)

fails.

The sequence can be chosen so t h a t

sup 11Vφ(θ i )11

=

B

<

-

This means that ξ(θ.) = Vφ(θ ), i=l,... is a bounded sequence, thus, without loss of generality, the original sequence {θ..} can be assumed to have been chosen to satisfy ξ(θ.j) -> x*. Hence, for any θ 1 € Rk

,

182

STATISTICAL EXPONENTIAL FAMILIES

(7)

θ

x* - φ(θ) = Tim (θ.

ξ(θ.) - φ(θΊ.))

>_ Tim sup (θ 1 ξ(θΊ.) - Φ(θ')) =

θ

1

x* - φ ( θ ' )

It follows that (8)

d(x*)

= θ

x* - φ(θ) < °°

This means t h a t θ f. hi° s a t i s f i e s θ € θ ( x * ) . impossible i f φ i s steep.

Hence

Proof of Proposition 3.3.

By Theorem 5.5 t h i s is

φ i s not steep.

||

I t is now easy to prove the converse assertion

in Proposition 3.3, namely that a minimal exponential family satisfying (9)

E 0 ( | | x || )

= oo

for

θ G W - W°

is steep. By Fatou's lemma i f 11m ||Vψ(θ.)|| Hence (2) is s a t i s f i e d . 6.9

=

{θ } s a t i s f i e s (1) then

l i m ||E θ .(x)||

> l i m E 0 .( | |x| |)

= «

Thus ψ i s steep, which is the desired r e s u l t .

. ||

Theorem

Assume φ is steep and regularly strictly convex. Then d. is also, and (1) Proof.

«"d = % φ Let x Q e R,

v e Rk.

Note that p > 0 since R i s open. and x p = x Q + p (

X l

- χQ).

- ξ(N ) .

Let p y = i n f {p > 0:

x Q + pv £ R} .

Assume p < °° and l e t x, = x Q + p v

Note that x 1 ί R.

Suppose i t were true that

THE DUAL TO THE MLE (2)

183

lim inf | |θ(x ) 11 < co . p pfl

Then there would be a sequence p.. t 1 with θ ( x p . ) + θ * , say. X j f. R = ξ ( W ° ) .

θ * (. A/° since

But then, since φ is steep, t h i s would imply

I Up.11 = l l ξ ( θ ( χ p . ) ) l l

- -

by Lemma 6 . 8 , which is a c o n t r a d i c t i o n since x p . -> x - .

Hence ( 2 ) is f a l s e ;

so t h a t a c t u a l l y

(3)

lim

||θ(xj||

=

oα .

p

ptl

The argument i n the f i r s t p a r t of the proof of Lemma 6 . 8 applies to y i e l d the dual to 6 . 8 ( 6 ) , namely (4)

θ(xp)

(xχ - x0)

-> oo

as

p t l

.

(Technically, the lemma as stated cannot be directly quoted since we have not yet established that R = M . so that d is regularly strictly convex. But, d has the desired convexity and differentiability properties on R c w, by Lemma 6.7. It is then easy to check that the first part of Lemma 6.8 indeed applies since p

} c R and yields (4) as the dual of 6.8 (6).) i

d is therefore a convex function with

(5)

^ d ( x

+ p(xj - x 0 ) ) +

oo

as

p

t

l

.

This implies t h a t

(6)

d(xQ + p ( x χ - x 0 ) )

= «

for

p > 1

Since the above argument applies f o r a l l

v € R , i t yields

(7)

for

Thus R

d(x)

=> W..

d(x) = θ ( x )

This y i e l d s

=«,

x £ R

(1) s i n c e , a l s o , R c W d

x - Φ ( θ ( x ) ) < oo on R.

because

.

that

184

STATISTICAL EXPONENTIAL

FAMILIES

I t now follows that d is regularly s t r i c t l y convex since i t has the desired smoothness properties, e t c . , on R = N°. by Lemma 6.7. d is steep since (5) applies to any xQ e R, Remark.

x 1 e R - R.

And, f i n a l l y ,

| |

Since d i s convex, lower semi continuous, and d ( x ) = °° f o r x f. R

i t must be t h a t d( ) on R i s the maximal lower semi continuous extension o f d(x):

x e R

(= K°) to a l l o f R k . d(xj

=

1

That i s , f o r x χ € R - R

lim inf {d(x): εΨO

x € R, I | x - x . I I < ε }

I t follows that i f {p Q } is a steep exponential family. between d(x Q ) and K(x Q , x χ ) is valid for a l l xQ € Rk,

L

The relation 6.6(4) x χ € K°.

MINIMUM ENTROPY PARAMETER The path has been prepared for the definition of the dual to maximum likelihood estimation,

and for the basic existence and construction

theorems. 6.10

Definition I,

Let d: Let S
R - * ( - « , °°] be convex and lower semi continuous.

Define

ξs(θ)

=

{ξ € S: £ d ( ξ , θ) = A d (S, θ) = i n f { λ d ( x , θ ) : x € S}} .

Obviously ξ^ is related to I.

in the same fashion as θ, the maximum likelihood

estimator for an exponential family, is related to the log likelihood £,..

function

( I t would therefore seem logical to adopt the notation ξ ς rather than ξς.

However for reasons of convenience and tradition we wish to reserve the notation ξ s for the set of maximum likelihood estimates of expectation parameters.

That i s , ξς(x) = ξ ( θ s ( x ) ) .) The function ξL has been given a variety of f a i r l y

appelations.

inconvenient

For example, values in ξς(θ) can be called minimum entropy

THE DUAL TO THE ME

185

(expectation) parameters r e l a t i v e to the set S c K°.

Barndorff-Nielsen (1978)

refers to values θς(x) = θ ( ξ ς ( . θ ( x ) ) ) , x e K°9 as maΰsimum likelihood

predictors.

(Note however t h a t ξ $ ( θ ) n (K - K°) t φ is possible even i f {p θ > is regular as long as S is not convex

(see Theorem 6.13).

Hence values i n ξ need not always

be expectation parameters.) Another i n t e r p r e t a t i o n is provided by the Kullback-Lei b i e r information.

Consider a steep minimal exponential f a m i l y .

I f ξ e ζς(θ) Π K°

then K(ξ, ξ ( θ ) )

=

i n f {K(x, ξ ( θ ) ) :

Thus, θ € θ ( ξ ς ( θ j )

x € S n K°}

.

is a parameter i n Θ(S) whose Kullback-

Leibler distance to θ, is a minimum over a l l parameters i n θ(S). Suppose { p f i } i s a minimal, steep standard exponential f a m i l y . Then Theorem 6.9 establishes that d, is steep and r e g u l a r l y s t r i c t l y convex with R = ζ(W°) θ i n Chapter 5.

= K°.

Consequently ξ possesses the properties established f o r

The main properties are formally stated below; t h e i r proofs

consist only of reference to the appropriate results i n Chapter 5. Convention.

In the f o l l o w i n g statements {p Q } is a minimal steep standard

exponential f a m i l y . 6.11

Note that R = K° c Wd c K.

Theorem If θ €

then

UΘ)

(1)

I f θ e N - N° then ξ w (θ) i s empty. Proof.

This i s the dual statement to Theorem 5 . 5 .

||

Note t h a t (2)

θ(ζw(θ(x)))

= θ w (x)

,

etc.

In other words, f o r a f u l l exponential family the maximum l i k e l i h o o d p r e d i c t o r

186

STATISTICAL EXPONENTIAL FAMILIES

is the same as the maximum likelihood estimator.

However (2) does not extend

to non-full families. 6.12

Theorem Let S cW.be a non-empty, r e l a t i v e l y closed subset of W^. Suppose

θ e N°.

Then ζ ( θ ) is non-empty. Suppose θ € W - W° and there are values θ i € W°,

i=l,...,I

and

constants $.. < » such that

I (1)

S c

y

H " ( θ - θ . , (3.) .

Then ξ(θ) is non-empty. For any ξ € ξ s (θ) n K° (2) Proof.

θ - θ(ξ) € V s (ξ) . Invoke Theorem 5.7 and Theorem 5.12.

||

6.13 Theorem Suppose S Π W, is a relatively closed convex subset of W^ with S n K° non-empty. Then ξ s (θ) is non-empty if and only if θ € W° or θ e W - W° and (1)

S c H"(θ - θ χ , Bj)

for some θ e W°, 3, € R. If ζ s (θ) is non-empty then it consists of the unique point ξ € S Π K ° satisfying (2) Proof.

(θ - θ(ξ))

(ξ - ξ) > 0

Invoke Theorem 5.8.

v

ξeS

.

||

6.14 Construction Theorems 6.12(2) and 6.13 have a geometrical interpretation which looks exactly like that of their counterparts in Chapter 5. For example,

THE DUAL TO THE MLE

187

suppose S = H n K with H the hyperplane H(a, α ) , and H n K°is non-empty. Then in order to find ζς(θ) one need only search for the unique point ζ* € H for which θ - θ(ζ*) = pa for some p € R. The process can be pictured from two different perspectives. Both of these are shown in Figure 6.14(1). (i) One may proceed from ξ(θ) along the curve {ζ(θ + pa): p € R} until the unique point at which ζ(θ + pa) € H. (ii) Alternatively one may map S n K° back into Θ as θ(S n K°) and then proceed along the line {θ + pa: p € R} until the unique point at which θ + pa ε θ(S n κ°).

e

Θ(S)

Figure 6 . 1 4 ( 1 ) :

Construction of ξ s ( θ ) when S = H ( a , α) n K

There is an important s t a t i s t i c a l d i f f e r e n c e between the s i t u a t i o n p i c t u r e d here and the dual s i t u a t i o n . d i s p l a y e d i n 5 . 9 . In Construction 5.9

Θ = H n N and the problem considered was to

find θ .

In t h a t case one could proceed via the geometrical dual to Figure

6.14(1).

See Figures 5 . 9 ( 1 ) and 5 . 9 ( 2 ) .

However, one could also reduce by

s u f f i c i e n c y to a minimal exponential family with parameter space Θ. then be found by applying Theorem 5.5 to t h i s minimal f a m i l y .

θ 0 could

A corresponding

188

STATISTICAL EXPONENTIAL FAMILIES

statistical interpretation is not available for the dual problem of finding ζ

HnK' Furthermore, i f Θ = H n N and S = ξ(Θ) the maximum likelihood

predictor relative to S cannot legally be found by f i r s t reducing by sufficiency.

This very undesirable property of a s t a t i s t i c a l estimator is

displayed in the following example. 6.15

Example Consider the Hardy-Weinberg problem discussed earlier in

Examples 1.8 and 5.10. Let S = ξ(Θ) and consider the problem of finding ξς. Rather than provide a general formula for ξ (a messy exercise) we discuss a special case, and some implications. Suppose N = 18 and x = ( 3 , 6 , 9 ) .

P =

(1)

2x

*

+x

2

=g

θ(ξ(x))

We have already seen that

Thus ξ ( x ) = 18(J, J , | ) = ( 2 , 8 , 8 ) , and

= θ(x)

=

=

ί p ( l , l , l ) + (In 1, In 4, In 4)}

{ β j ί l . l . l ) - (In 2 ) ( 2 , l , 0 ) + (0, In 2, 0)} c

θ

Note also that (2)

θ(x)

=

{ p d . l . D + (In 1, In 2, In 3)}

.

Of course θ(x) n θ = Φ Since ς(p) = ( p 2 , 2pq, q 2 ) = ( p 2 , 2p(l-p). (1-p) 2 ) space to S = ί ξ ( p ) :

0 3* - 3 }

τ = 0} .

-2 + 2p)

the tangent Evaluated at

THE DUAL TO THE NILE

189

Now, from (1) and (2) θ(x) - θ(ζ) = {p'(l,l,l) + (0, In 2 - In 4, In 3 - In 4): p 1 € R} . Thus (θ(x) - θ(ξ})

(3)

τ = (2/3) In (1/2) - (4/3)ln (3/4) f 0 .

The implication of (3) is that θ(x) - θ(ζ) £ V $ (ξ).

It follows

from Theorem 6.12(2) that (4)

θ(x) n θ(x) = φ

,

or, in other words, (4')

ξ(x) t ξ(x)

.

Finally, suppose instead that the sample point is x* = (2,8,8). Note that x* = ξ(x) with x = (3,6,9), as above.

In this case ξ(x*) = x*

and hence (5 1 )

ξ(x*) = ξ(x*) = x*

and (5)

θ(x*) = θ(x*) = θ(x*) . Recall from the discussion in Example 5.10 that,over the domain

K°, ξ(x) coincides with the minimal sufficient s t a t i s t i c . (5)

Thus, from (4) and

(or (4 1 ) and (5 1 )) i t can be seen that here the "estimator"

θ(x) = θ(ξ(θ(x))) is not a function of the minimal sufficient

statistic.

is a very undesirable property for a statistical estimator.

Indeed, we

This

emphasize, the primary statistical use of θ does not l i e in i t s use as a statistical estimator, but rather in i t s use in the theory of large deviations. See, for example, 7.5 and Exercises 7.5.1 - 7.5.6.

190

STATISTICAL EXPONENTIAL FAMILIES

ENTROPY 6.16

Discussion In statistical mechanics and elsewhere the term entropy appears

and has a definition whose connection with the quantity K(θ Q , θ,) for exponential families is not at first obvious. See Ellis (1984a; 1984b). k k Let F be a probability distribution on R . Let x e R and define the entropy of x under F as (1)

E F (x) = inf {K(G, F ) : E Q (X) = x} . There is, as yet, no exponential family apparent in this definition.

However, there is indeed an intimate connection between ξ and K, as revealed in the following theorem. The theorem is proved only for the case where F satisfies certain mild assumptions and x € κl or x t Kp

We leave it to the

reader to develop the appropriate results when F does not satisfy these assumptions. The situation where x € K - K° can sometimes be treated using the methods at the end of this chapter. 6.17 Theorem Suppose the exponential family generated by F is a steep minimal family with 0 € int N.

Let ξ Q = ξ(0) = E R (X).

Let K denote the usual

Kullback-Leibler function, 6 . 1 ( 4 ) , for this exponential family. Then (1) i f y € K°.

(2)

Proof. (3)

E F (y)

= K(y, ξ Q )

If y £ K

»

=

Ef(y)

=

K(y, ξ Q )

Suppose y € K°, it is obviously true that E F (y) < K(y, ξ Q )

since the distribution G(dx) = p 0 / %(x)F(dx) = p θ ( y ) ( d χ ) satisfies E Q (X) = y

THE DUAL TO THE MLE

191

and K(G, F) = K(y, ξ Q ) . Suppose K(G, F) < «> and (4)

E G (X) = y = It must be that G <*« F, for otherwise K(G, F) = ». Let g = 4S-

and p = P θ ( y ) (5)

Then

K(G, F) - K ( P θ ( y ) 5

F)

=

/ [ g ( χ ) I n g(χ) -

=

/ g ( x ) ( l n g ( x ) - In p ( x ) ) F ( d x )

p

(χ)

Ί n

p(χ)]

+ /(g(x) - P(x))(ln

=

K(G. P θ ( y ) )

>

G = F θ /y\

I t follows

from ( 3 ) and ( 5 ) t h a t ( 1 ) holds.

is the unique d i s t r i b u t i o n

K(G, F) = Ef(y)

satisfying

p(x))F(dx)

0

since / ( g ( χ ) - p ( x ) ) ( l n p ( x ) ) F ( d x ) = / ( g ( x ) - p ( x ) ) ( θ by ( 4 ) .

F(dx)

x - φ(θ))F(dx) = 0 (Also, note t h a t

( 4 ) and y i e l d i n g

.)

If y £ K then Eg(X) = y implies G « F and hence K(G, F) = - = κ(y, ξ Q ) .

||

AGGREGATE EXPONENTIAL FAMILIES If {p Q } is a full canonical exponential family and x € dK then θ(x) = φ. (See Theorem 5.5.)

If v(8K) > 0 then this means that with

positive probability the maximum likelihood estimator fails to exist. This occurs most commonly when v has countable support.

In most such

cases the family of distributions {p Q : θ € N] can be augmented in a natural way so that the maximum likelihood estimator is always defined over this new, larger family of distributions. The augmented family will be called an aggregate exponential family.

192

STATISTICAL EXPONENTIAL FAMILIES Aggregate exponential families can also be satisfactorily defined

in a few special cases where v does not have countable support, but v(8K) > 0 nevertheless.

However, such situations are rare in applications and the

general theory involves d i f f i c u l t i e s not present in the countable case; hence we do not treat such situations below.

For similar reasons of convenience we

avoid non-regular exponential families. Special cases of the theory are extremely familiar — for example the aggregate family of binomial distributions, which is just B(n, p ) , 0 < p£l.

The general theory for the case where v has f i n i t e support

appears in Barndorff-Nielsen (1978, p.154-158), along with some observations about generalizations. 6.18

Definitions Let v be a measure concentrated on the countable subset

X = {χv

x 2 , . . . } c Rk.

(1)

Thus

v(ίχ.})

>

0

v(X c )

1=1,2,... ,

Consider the closed convex set K = K .

The faces

of K

= 0

.

are the non-empty sets

of the form (2)

F

=

K n H(v, α)

where

K c H~(v, α)

By convention the set K is i t s e l f a face of K (corresponding to v = 0 , α = 0 ) . A f a c e , F, is i t s e l f a closed convex subset, which has dimension s,

0 <_ s <_ k.

interior Rs.

(Only the face F = K can have dimension k.)

The

relative

of F, denoted r i ( F ) is the i n t e r i o r of F considered as a subset of

An a n a l y t i c c h a r a c t e r i z a t i o n of r i ( F ) is t h a t x e r i (F) i f x € F and

f o r every

if

hyperplane H 6 Rk such t h a t x € H but F £ H then both F n H + ϊ ,

and F n H~ f φ. Let F be a face o f K.

I f v ( F ) > 0 then the r e s t r i c t i o n of v to F,

v,c is uniquely defined and non-zero.

We use the notation K. c =K

.

Note

t h a t while i t is usually t r u e t h a t K,r = F t h i s need not always be the case.

THE DUAL TO THE MLE

193

See Exercise 6.18.1. The f i r s t main theorem involves the following structural assumption on X: For e\ίβry ξ G X there is a face F of K such that K ι c = F I F

(3)

and ξ € r i ( F ) . I f X is f i n i t e then (3) is clearly satisfied.

Another important

case where (3) is satisfied is when X = { 0 , 1 , . . . } , as for example when Xj,...,X|^ are independent Poisson or independent negative binomial variables. Assumption 6.22(1) provides an easily verified structural condition which implies ( 3 ) . 6.19

Definition

(Aggregate family)

Let X and v be as in 6.18. family of densities generated by v.

Let {p 0 } be the canonical exponential

Assume the family is regular.

As shown

in Chapter 3 this family can be reparametrized by the expectation parameter ξ = ξ ( θ ) . Let q

(1)

ζ(θ)(x)

Then, { q ξ : ξ € K°} = {p 0 :

=

P

θ(x)

θ

€ W

θ € W} .

Now, for each face, F, of K with v(F) > 0 l e t ψ r = ψ v

and

define the family of densities exp(θ Pθ,F(χ) θ l h

relative to the measure v. measure v . f .

x - ψ.r(θ))

x € F

IF

= 0

x j£ F

This is an exponential family relative to the

Assume this family is regular.

Let ξ.p denote i t s expectation

parameter, and l e t (2)

=

PθlF(x)

Thus ζ ranges over the set r i K. F as θ ranges over N^ = Wvjp

Note that the

194

STATISTICAL EXPONENTIAL FAMILIES

family {p Q jF: θ € N.p} is not minimal. Hence the map θ •*• ξ,p(θ) is not 1 - 1 . However, q>. >p = q^ ι p if and only if ξ, = ξp> by virtue of Theorems 1.9 and 3.6. Let (3)

F = {x: 3 face

F of K 3 v. F t 0 and x e ri(F)} .

Lemma 6.20, below, establishes that for each ξ € F there is a unique F such that ξ € ri(F) and a unique density q^.p corresponding to the pair ξ, F. This density has (4)

E q

(X)

= ξ

.

ξ|F

We denote this density as q f .

The aggregate family of densities

generated by v with parameter space F is the family (5)

{ q ξ : ξ € F}

.

Note that P ξ (X)

(6) 6.20

= 1

V ζ € F

.

Lemma Make the assumptions in 6.18 and 6.19.

is a unique F such that ξ € r i ( F ) .

The density q

Then for each ξ € F there = q^. p satisfies 6.19(4).

I t i s , in f a c t , the unique density of the form q , ( F« having expectation ξ. Proof. K c ίΓ(v\

Suppose ξ e r i ( F ) and a l s o ξ € F 1 = H ( v ' , α ' ) n K where α1).

Then e i t h e r

(i) FcH(v'.a')

or

( i i ) F n H+(v', a 1 ) t φ

and F n H " ( v ' , α 1 ) t φ. I n case ( i i ) H ( v ' , α 1 ) i s n o t a s u p p o r t i n g h y p e r plane, a contradiction.

Hence ( i ) h o l d s , and so F 1 D F.

Reversing t h e roles

o f F, F 1 i n t h e above now shows t h a t ξ € r i ( F ) and ξ e r i ( F ' ) i m p l i e s F = F 1 . By Theorem 3 . 6 , {En (x): θ € N } = r i ( K l t : ) = r i ( F ) by q V I F ξ(θ)IF IF 6 . 1 8 ( 3 ) since v i exists.

p

generates a regular family.

Thus q ξ i p s a t i s f y i n g 6 . 1 9 ( 4 )

THE DUAL TO THE MLE

195

For every ξ € X the preceding shows that ζ = E (X) € ri(F) where q

F is the unique face of K with ξ € r i ( F ) .

Hence ξ = E

q

(X) = E = q£l.

ξ§IF'

(X)

||

i f the conclusion of

6.18(3) holds for a l l ξ € conhull X then F = conhull X. occur that Fcconhull X.

q

ξ|F

implies F = F 1 , and thus, as previously noted, implies q Assumption 6.18(3) guarantees that F 3 X.

ζ

Otherwise i t may

Exercise 6 . 2 0 . 1 sketches an example.

I f Assumption

6.22(1) is s a t i s f i e d then (1)

F =

conhull X

=

K

.

Here is the f i r s t main theorem providing the extension of Theorem 5.5. 6 .21

Theorem Make the assumptions in 6.18 and 6.19.

Then for x € F 3 X the

maximum likelihood estimator, ξ ( x ) , is uniquely determined by the t r i v i a l equation (1)

ξ(x)

Proof.

= x

L e t x e r i ( F ) f o r some f a c e F = H ( v , α ) n K o f K.

I f ξ1 € r i ( F ' )

and x £ F 1 then q ζ , ( x ) = 0 . Now suppose ξ ' e r i ( F ' ) , Lemma 6 . 2 0 ) t h a t F 1 3 F.

x e F 1 , b u t F 1 t F.

I t follows

The argument now t a k e s p l a c e i n F 1 .

Hence we can

assume f o r c o n v e n i e n c e , and w i t h o u t l o s s o f g e n e r a l i t y , t h a t F 1 = R and ξ 1 e K°.

(2) and

l

= θ + p e

OK

We may f u r t h e r assume t h a t x = 0 , K c ί Π e ^ 0 ) , and 0 e r i ( F )

w i t h F = H ( e l f 0 ) n K. θ

(as i n

1

, ρ > 0 .

Then, ξ 1 = ξ ( θ ' ) f o r some θ ' e W ° c Then q

ξ ( θ

j ( 0 ) = exp(-ψ(θp))

Rk.

Let

196

STATISTICAL EXPONENTIAL FAMILIES eΦ(θp)=

(3)

eθ'

/

0 +

x

χ +

Pχ1v(dx)+

e θ ' # x v(dx)

/

f°

eθ"

/

χ

v(dx)

O

= ψ|F(θ )

'

by the monotone convergence theorem and the d e f i n i t i o n of ψ.r. (2)

I t follows from

and (3) t h a t

(4)

qξ,(0)

< q ξ ( θ )(0)

< qξl,,F(0) ,

0 < p < »

where ζ" is the unique point i n r i ( F ) defined by ξ" = ξ ( p ( θ ' ) .

Finally, if ξ 1 " € ri(F) then applying Theorem 5.5 to the measure v | F yields qξ,..|F(0)

(5)

with equality only i f ξ" 1 = 0.

< qQ|F(0)

Combining ( 4 ) , ( 5 ) , and the f i r s t comment

in the proof y i e l d s (6)

ζ(0)

= 0

.

This verifies (1) when ξ = 0 , and completes the proof. Remark.

||

As noted in the remark preceding the theorem it is usually true

that F => conhull X. Assume so and assume the hypotheses of the theorem. Let X,,...,X be i.i.d. random variables with density q f , ξ € F. As usual, let n Xn = .Σ X./n. =1 l

Then Xn € conhull X c F with probability one.

The family of

distributions of the sufficient statistic Xn is then also an aggregate family f i t t i n g the specifications of the theorem. estimator of ξ € F based on X-,...,X (6)

Hence the maximum likelihood

satisfies the t r i v i a l equation

£(Xr...,Xn)

=

Xn

.

The preceding theorem yields the existence of maximum likelihood

THE DUAL TO THE MLE estimates when the parameter space is F.

197

In order to guarantee existence of

these estimates when the parameter space is a proper closed subset of K i t suffices

to establish continuity in ξ of q Λ x ) , x € X.

useful for other purposes as well.

This continuity is

Somewhat unfortunately, the assumptions of

Theorem 6.21 do not imply that q^x) is continuous in ξ (see Exercises 6.23.5-6) and the following theorems demand stronger assumptions. Sufficient assumptions are described below. There is a further, aesthetic, reason for wanting to know that q^(x) is continuous in ξ. family { q ζ ( x ) :

The definition given in 6.19 of the aggregate

θ € F} is structurally natural.

But there is also an analy-

t i c a l l y natural definition for the family of distributions generated from {p n : u

θ G N] -- namely, the set of a l l probability distributions on X which

are limits of sequences of distributions in ί p θ h

These two definitions

coincide when q^(x) is continuous in ξ. 6.22 Assumptions K is called a polyhedral convex set i f i t can be written as the intersection of a f i n i t e number of half spaces (see Rockafellar (1970)). Assume that K is a polyhedral convex set and that for every one of the f i n i t e number of faces, F, of /C (1)

F = K|F

.

As previously noted in 6.20(1), this implies F = K = conhull X. For any convex set S € R define the centered span of S to be the subspace spanned by vectors of the form x - y , subspace by csp S. (2)

Note that i f xQ € ri S then csp S =

span {x - x Q :

Assume that for eyery face F of K

(3)

x,y € X.

Pr

°JCspFW

"

x € S}

Denote this

198

STATISTICAL EXPONENTIAL FAMILIES

Note t h a t i f X is f i n i t e then (1) i s s a t i s f i e d , and ( 3 ) is s a t i s f i e d since A/ |F = R

for a l l

faces F ( i n c l u d i n g F = K).

measure then (1) and (3) are again s a t i s f i e d .

6.23

Theorem

x € K,

Then f o r

The proof involves an i n d u c t i o n on the dimension, k.

r e s u l t is n e a r l y obvious.

assume K c ( - « , ξ Q ] . θ. -> W, and Q. - > « . for x f ξ Q ,

every

ζ 6 K.

q (x) is continuous f o r

Proof.

and

I f v i s a product

See Exercise 6 . 2 2 . 2 .

Make the assumptions in 6 . 1 8 , 6 . 1 9 , and 6 . 2 2 .

the

trivially

Suppose ξ Q e 3K.

For k = 1

Without loss o f g e n e r a l i t y

Then ξ i -• ζ Q w i t h ζ i t ξ Q ,

i =l,...

i m p l i e s ξ Ί = ξ(θ Ί .)»

I t follows t h a t q ξ . U 0 ) = P θ . ( ? 0 ) -* v ^ } ) "

1

= qζ ( ξ Q ) ,

q ξ . ( x ) -• 0 = q ζ ( x ) .

For arbitrary k, including k=l, if ξ Q € K° then q ξ (x) =

pQ,^Λx)

is continous on a neighborhood of ξ Q . This completes the proof for k = 1. We now turn to the case k >_ 2. We need to prove continuity of q>. at ξ Q € dK. Let ξ. -* ξ Q . We need consider only the case where {ζ.} c F with F some face of K, since K has only a finite number of faces. If this F is a proper face of K then q r -> q Γ by the induction hypothesis. Hence we need consider only the case where each ξ. = ζ(θ ), θ^ e A/. There is a unique face F Q of K such that ξ Q € ri F Q = ri ACjp . o K c R " ( e - , 0 ) , - σ e , € K° f o r some

loss o f g e n e r a l i t y

σ > 0,

F Q = H ( e 1 9 0 ) n K a n d c s p F Q = {w € R k :

( 0 <_ s < k - 1 ) . w

(2)

G RS<

Let S

Further,

assume

ξQ = 0,

Without

= csp F Q .

w = (0, ω), ω € RS},

F o r w e Rk w r i t e

assume 0 e N . p » ψ

F

(0) = 0,

ψ|p ( θ ) i s a f u n c t i o n o f θ / 2 ) > a n d so we w i l l

w1 = ( w L , W / ξ

(0) = 0.

( F

write ψ j

F

(θ/2\),

2

J with

Note t h a t where

convenient. We h a v e a l r e a d y assumed 0 e M , r . f o r some 6 Q > 0 . σ(θ),

I t then follows

s a y , such t h a t θ + σ e 1 € W,

from 6 . 2 2 ( 3 ) θ ^ σ ( θ ) .

Hence { θ € S :

| | θ | | <_ ό Q } c W.p

t h a t f o r e a c h such θ t h e r e i s a Since { θ € S:

||θ||<_ό0}

is

THE DUAL TO THE MLE compact, with {θ € S:

| | θ | | <• ό Q ,

θ + oeι

199

€ N} as a r e l a t i v e l y open s u b s e t ,

there must, f u r t h e r , e x i s t s a σQ >_ 0 such t h a t θ + σe, e N f o r a l l σ >_ σ Q , θ e S,

| | θ | | £ <50. For 6 <_ όg, σ _> σQ define

(1)

Q

=

Q(σ, 6)

{θ € Rk: | | θ

=

( 2 )

||

< 6,

v

Note t h a t θ ^ j

x^j

- σgθj

x^j

±

x

(-σ + σ Q ) | | X ( ^ | \< 0 ,

Vx

€ K .

Hence for θ € Q λ(θ) as i n 6 . 2 1 ( 4 ) .

<

λ(σ Q e χ )

< co

I t follows t h a t Q c N.

Now assume f o r convenience, and without loss o f g e n e r a l i t y , t h a t σQ = 0 .

Then f o r θ € Q λ(θ)

(2)

=

Je

θ

Ψ/ e (

χ

v(dx)

2

) (

2

)

as σ •> °°, uniformly f o r θ ^ ) 1 6Q. (3)

sup {|ψ(θ) I:

as σ -> ~, (4)

6 •> 0.

sup

v

/e"

σ l l X

(l)

M + θ

(

2

)

X

(

2

) v(dx)

(dx)

In p a r t i c u l a r

θ £ Q(σ, δ ) }

•+ ψ , F ( 0 ) 1 Γ

= 0

o

I t follows that

ί|pθ(x) - qQ(x)|:

for each x € K.

<

θeQ(σ, 6)}

[For x € FQ the convergence

^ 0

as

σ -^ °°,

6 •> o

in (4) is uniform over

subsets of F o ; however i f x £ FQ then as σ -> «, 6 » 0 ,

p Q (x) =

θ # x e

compact "ψ(θ)

~ eθ#x

-> 0 = q Ω ( x ) , but the convergence is not uniform over a r b i t r a r y compact subsets of K. x € X -

( I t is uniform over bounded subsets of X i f e, Fo.)]

x < -ε < 0 for a l l

200

STATISTICAL EXPONENTIAL FAMILIES I t remains to show t h a t f o r given σ >^ a Q ,

α > 0 such that ||ξ|| < α,

δ <_ δ Q there is an

ξ € K° , implies θ(ξ) € Q(σ, δ ) .

Once t h i s has

been done i t follows from ( 4 ) , and the induction hypothesis, that q^-ίx) i s continuous in ξ ε K f o r each x € K. For convenience we show below only that there is an α > 0 such that ||ξ|| < α implies θ(ξ) € Q(0, δ ) .

The proof f o r a r b i t r a r y α > 0,

in place of σ = 0, requires only minor a l t e r a t i o n s of the constants appearing i n the proof.

In the following α, ε are generic p o s i t i v e constants whose

numerical value may decrease as the proof progresses. is an α > 0 such that ||θ/2%11 > δ

Since 0 € W. F

there

implies ψ J F ( θ / 2 \ ) >. 201 |θ/2% ||.

Let

C c X be a f i n i t e subset of X such that C n FQ t φ and F n C f φ f o r eyery face F of K which properly contains FQ.

The existence of C is guaranteed by

6.22(1).

Suppose I I Θ ( 2 ) M max ξ/-x

{ΘQX

X/^:

> δ and

x € C} > 0 .

θ

(i)*x m

>

for

°

some

x

e κ

τhen

| | ξ | | < α and α i s s u f f i c i e n t l y small

If

i s i n the convex h u l l o f ί x / i \ :

x € C} U { 0 } .

then

Hence t h e r e i s an η ε R

such t h a t

(5)

θ^j

f o r a l l I | ξ |I < α .

ζ^x

<_ ηα max ίθ.^x

L e t p = max { | | X / 2 Λ I I :

x ^

x € C}

x € C},

vQ = min { v ( { x } ) :

x € C}.

Then A(θ, ξ )

= θ

=

θ

ξ -ψ ( θ )

(

2

)

ξ

( 2 )

- β||θ

( 2 )

||

+θ

ξ

( 1 )

( 1 )

- ln(e

(

2

)

λ(θ)).

Now, (6)

λ ( θ ) >_ λ

>.

j

F

(θ(2j)

+ v

0

e x p (231| θ ( 2 j | | )

exp( θ

+ v

0

( 1 )

x

exp ( θ

( 1 )

( 1 )

+ θ

x

( 2 )

( 1 )

x

( 2 )

)

- P||Θ(2)||)

.

THE DUAL TO THE MLE For n o t a t i o n a l s i m p l i c i t y l e t t = θ / , x (7)

A(θ, ξ ) 1 θ

( 2 )

ξ

( 2 )

201

x ^ x > 0 . Then f o r α <_ 3 / 2

- 3 | | θ ( 2 ) | | + η α t- I n (

3 l l θ e

2

( )

M

+

exp ( t - p I I θ / o \ I I ~ 31 I θ / p \ I I /

VQ

£

- ε + η α t- ( 3 | | θ ( 2 ) | |

1

-ε

V ( t - (p + 3 ) | | θ ( 2 ) | | + I n v Q ) )

for α > 0 sufficiently small, since 3||θ

3t p + 23 + a

| | V ( t - ( p + 3 ) | | θ ( 2 ) | |1 1 - a ό ) > (2)ir v^; -

f o r I | θ / 2 x 11 > δ , a _> 0 . If

ι ( θ .ξ )

(8)

I | θ # 2 j | I > δ butΘ Q J

< θ

θ

1

(l)

( 2 )

- ΨlFo(θ(2))

(2) * ξ(2) "

θ

ξ

( 2 )

δ

i

b u t

X / j x 1 0 f o r a l l x € K then

ψ

Θ

θ

+

F0(θ(2))

1

(D * x{i)

>

°

( 1 )

"ε f o r

s o m e

x

e

κ

t h e n

X/.% > 0 f o r some x e C; and

(9)

£(θ, ξ) 1

θ(

2 )

ξ

( 2 )

- ψ θ 2

ψ

<

| F

(i)

(θ{2)) + ηαθ(1) χ

X(1}

(D

IF0

-ε < 0

f o r α > 0 and some e > 0 s u f f i c i e n t l y s m a l l , s i n c e ψ F ( θ / 2 \ ) 1 ° b u t sup

{ ψ

| F

(θ(2j):

l|θ(2)||

o r ( 9 ) a p p l y so t h a t

1 δ j } < « . I f | | ζ | | < α a n d θ jf Q o n e o f ( 7 ) , ( 8 ) ,

202

STATISTICAL EXPONENTIAL FAMILIES

(10)

A(θ, ξ) £

-ε <

0

.

On the other hand, there i s a σ > 0 s u f f i c i e n t l y large so that by (2) or ( 3 ) , (11)

Hoey

ξ)

=

σe χ

ξ - ψ(σe χ )

^

σe][

ξ - ε/3

>_ -2ε/3 for ||ξ|| < α £•—• . It follows from (10) and (11) that if ||ξ|| < α, ξ € K°, then if θ (έ Q &(θ, ξ) £ Hence θ f θ ( ξ ) .

-ε

<

-2ε/3 £

il(θ(ξ), ξ ) .

I t follows that θ(ξ) € Q.

We have thus proved that given σ, 6 there is an α > 0 such that ||ξ|| < α,

ξ € K°, implies θ(ζ) € Q(σ, 6 ) .

completes the proof of the theorem.

| |

As previously noted, this

THE DUAL TO THE MLE

203

EXERCISES 6.6.1 Assume φ i s r e g u l a r l y s t r i c t l y convex.

Verify

6.6(3).

6.7.1 For φ regularly s t r i c t l y convex, when does d. = φ? 6.9.1 Generalize Theorem 3.9 to apply to steep, regularly convex functions v

φ [i.e.; write φ = V Φ

v

and consider the map θ -M ' . Show this map is λ (2) Φ(2)(θ)^ ;

1 - 1 and continuous on N° with range ξ^AN°)

x φ/ 2 x(M°) = K,.* x φ/ 2 \(W°)].

6.18.1 (i)

Show t h a t Kj F f F i n the following example:

X = ( 1 , -1) u { ( ( i 2 - l ) J V i , (ii)

1/i);

1 =1 , 2 , . . . }

,

F = K n H ( ( l , 0 ) , 1).

Construct an example of the same phenomenon i n R where X is

a discrete set ( i . e . X has no accumulation points i n R ).

[Construct X so

that the set X i n ( i ) i s i t s p r o j e c t i o n on the space spanned by the f i r s t two coordinate axes.] 6.19.1 Show that the following three families are aggregate exponential families: (i) (ii) (iii)

Binomial (n, p ) , Poisson ( λ ) ,

0 <_ p <_ 1

λ >. 0

Multinomial (N, p),

0 < p., " Ί

k Σ p. = 1 . i=lΊ

6.19.2 Suppose the d i s t r i b u t i o n o f X family { q ί Ί ' b ,

i = l , 2 , and X ^ ,

X ^

form an aggregate exponential

are independent.

Show t h a t the d i s t r i b u -

tions of ( X ^ , X^ 2 M form a ( k j + k 2 parameter) aggregate exponential

family.

204

STATISTICAL EXPONENTIAL FAMILIES

6.20.1 Construct an example in which 6.18(3) holds but F f conhuil X. [Let X1 be the set in 6.18.1(i) and define X € R 3 by X = ίx: ( x Γ x 2 ) E X 1 ,

x 3 = ±(1 - x 2 )} u (1,0,1) U (1,0,-1).]

6.21.1 Let X be the set defined in 6.20.1 with the additional point (1,0,0). Show (i) 6.18(3) fails at x = (1,0,0). (ii) {q

:

The maximum l i k e l i h o o d estimate f o r the aggregate family

ξ e F} f a i l s to e x i s t ( i . e . i s the empty s e t ) when X = ( 1 , 0 , 0 ) ,

which occurs with p o s i t i v e p r o b a b i l i t y . (iii)

The f a i l u r e i n ( i i ) can be r e c t i f i e d i n a natural way by

l e t t i n g G = conhull { ( 1 , 0 , - 1 ) , q

ξ(θ)|G

=

p

θ|G

t 0

(iv) for

t h e

f a m i l y

{ q

( 1 , 0 , 1 ) } and adding the d e n s i t i e s ξ:

ξ

€

F}

Addition o f the d e n s i t i e s q ζ j G i s " n a t u r a l " i n the sense t h a t

each ξ € G there i s a sequence θ. 6 U° such t h a t q ε .ς;(x)

=

^

]im Pθ ( x ) .

i-*» i

[This sequence cannot be chosen to be of the form θ = θ 1 + iv for fixed v € R , θ 1 e W° as was the case in the proof of Theorem 6.21.] 6.21.2 Let v be linear measure on the perimeter SS, of the unit square, S.

This measure does not have a countable supporting set.

describe i t s

Nevertheless,

"natural aggregate family", having parameter space S and

satisfying the conclusion of Theorem 6.21 for each x € S.

6.21.3 (i) c i r c l e S.

Let v be uniform measure on the perimeter

S, say, of the u n i t

Thus, {p Q } i s the family of Von-Mises d i s t r i b u t i o n s

(Example 3 . 8 ) .

u Show there can be no possible way of constructing a family of densities {q-} which contains {pθ> such that the maximum likelihood estimate for {q^} exists

THE DUAL TO THE MLE with probability one. (ii)

[

lim

PΘM

=

°°

205

for each

x € 8S.]

Note that i f Xn is the sample mean from a sample of size n,

n >• 2, having the above distribution, then the maximum likelihood estimate does exist with probability one. (iii)

Construct a measure v for which {pQ} is a regular exponential Ό

family but there does not exist an n for which i t is possible to construct an "aggregate family" of densities {q^}9 containing the densities of X under θ, such that the maximum likelihood estimator exists with probability one. [There exists such a measure v having K = {x € R :

x* + x* < x*

0 < x

< 1},

and v({0}) > 0 . ] 6.22.2 Show that 6.22(1) (including the polyhedral nature of K) implies 6.20(1).

[The polyhedrality of K guarantees that for ey/ery x e 8K there is

a face F of K such that x € ri

F.]

6.22.2

Prove that 6.22(1) and 6.22(3) are satisfied whenever v is a k

product measure on a countable set X = Π X., X. € R. [The faces j=l

J

F = H ( v , α ) n X o f X a r e determined uniquely

J

by ( s g n V j , . . . , s g n v k ) . ]

6.22.3 (i)

Prove t h a t

(1)

M | F = Proj c s p F ( W | F ) x (csp F )

^ p M (ii)

c

PrcspF(M|F)

,

and

.

Give an example in which X = { 0 , 1 , . . . } ,

F = {(0, 0), (1, 0),... }, (3)

1

Proj'csp F ( N )

=

(-°°» ° ) x °

*

R x 0

N = (-»,0) 2 , "

Pr

°JCsp F(WIF)

206

STATISTICAL EXPONENTIAL FAMILIES

and (4)

ξ| F ((0, 0)) = (1, 0) € X .

(Thus 6.22(3) is not valid here.) (ii) In the example (ii) show that qE((x.i» °))> x τ

=

°> 1> . >

is not continuous at ξ = (ξ^, 0 ) , ξ j > 1. [If θ. is chosen so that θ. j Φ 0 somewhat slowly and θ.g "*• "°°t h e n ξ ( θ Ί ) •*• (ξ-|> °) b u t q ε ( θ )( χ ) + °1M O ) ^ " 6.23.1 Prove versions of Theorems 5.7, 5.8 and 5.12 valid for aggregate exponential families. [Make the assumptions in Theorem 6.23.] 6.23.2 Show that q (x) is not jointly continuous in (ξ, x) at any point with ξ = x € dK. 6.23.3 Are the analogs to Theorems 6.12 and 6.13 valid for aggregate exponential families under the assumptions of Theorem 6.23? 6.23.4 Suppose X = (0, 0) U {x € R 2 : x. = 1,..., i = 1,2}. Note that Assumption 6.22(1) is not satisfied. Show that, nonetheless, q J x ) is continuous at every ξ € conhull X = F. (If one defines q ζ (x) = q Q (x) for ξ € K - conhull X then it is even true that q # (x) is continuous on K.) 6.23.5 Let X = {((i 2 - 1)*/1, 1/1): i = 1,...} U (1, 0 ) . For x = ((i 2 - l)*/i, 1/i) € X let v({x}) = 1/2 1 , and let v({0}) = 1. Note that 6.22(1) is not satisfied. Show that q ((1,0)) is not continuous at ξ = (1,0) [q/ j Q \ ( ( 1 , 0 ) ) = 1 . Let 0 < c < 1. For i sufficiently large let θ £ = p^x £ with

Prt

chosen so that p Λ ((1,0)) = c ({p 0 } is a swiftly increasing

THE DUAL TO THE MLE

207

sequence.) Then ξ(θ £ ) + (1, 0) but q ς , θ j((lf 0)) • c ^ 1.]

(In this

X>

example q Γ (0) is, however, upper semi continuous; so that, for example, the conclusion of Theorem 6.23 remains valid. Exercise 6.23.4 shows this need not be the case.) 6.23.6 For x = x v({x

( i j )

})

(

i

j

= ( 4+ 3 j ) / 2

)

i

2

= ( ( i - \ ) .

H

For x = x

/ \,

1 / 1 , j ) , 1=1 , . . . ,

( j )

j =± 1 , l e t

= ( 1 , 0 , j ) , j = - l ,0 , + 1 l e t

v({x>) = 2 - | j | . Otherwise v ( { χ } ) = 0. Construct {θ^} i n a manner s i m i l a r to 6.23.5 with ( θ ^ ) 3 = 0 so that Pθ ( ί x ( j ) :

j = 0 , ± 1 } ) + 1/3

and

(Φι))

ι

+ 1.

Verify t h a t ξ ( θ £ ) - ( 1 , 0 , 1/2)

XJ

and Pθ U^h)

- fc (x("1})+ 1/12, but ^ j ^ i ^ )

Λ/

( 1 / 4 ) 2 < 1/12.

=' ( L O . * ) ^ " ^ =

XJ

Hence q ζ ( x

upper semi continuous.

) i s not continuous at ξ = ( 1 , 0, 1/2) or even

I f E c K is the closed set ί ξ ( θ ^ ) : λ = l , . . . } U ( 1 , 0, 1/2)

then the maximum likelihood estimator or o\ over the family {q ζ : ξ € E} fails to exist at the possible observation xv(-1)

CHAPTER

7. TAIL PROBABILITIES

In exponential families the probability under θ of a set generally f a l l s off exponentially fast as the distance of the set from ξ(θ) increases.

This section contains several results of this form.

The f i r s t of

these w i l l be improved later, but i t is included here because of i t s simplicit: of statement and proof. Throughout this chapter l e t {p Q } be a steep canonical exponential family.

(Most of the results hold with possibly minor modifications for non-

minimal families, and many also hold for non-steep families.)

FIXED PARAMETER (Via Chebyshev's Inequality) 7.1 Theorem Fix Θ Q € N°. Choose ε so that {θ: | |θ - Θ Q | | £ ε} c Λ/°. Then there exists a constant c < °°, such that (1)

P r Q H (v, α) θ o

< c exp(-εα)

for all v e R k with ||v|| = 1 and all α € R. Proof. (2)

Let c = exp(sup {ψ(θ) - ψ(θ Q ):

and let θ £ = Θ Q + εv. Then

208

||θ-θo||=ε})

TAIL PROBABILITIES

/

exp(θQ

=

+ / exp(θϋQ H (v,α)

1

( + /

£

c exp(-εα)

exp(θ ε

.

209

x - ψ(θQ))v(dx)

x + (εv)

x - (εv)

x - ψ(θ n ))v(dx) °

x - Ψ(θ ε ))v(dx))exp(ψ(θ ε ) - ψ(θ Q ) - εα)

||

Note that (2) provides a specific formula for the constant appearing in ( 1 ) . In specific situations the bound provided in Theorem 7.1 can be improved in various ways.

However the following converse result shows that

Theorem 7.1 always comes within an a r b i t r a r i l y small amount of yielding the best exponential rate of decrease for t a i l probabilities. 7.2

Proposition

Let Θ Q € W°. Suppose there exists a c < » and ε > 0 such that 7.1(1) is valid for all v € R k with ||v|| = 1 and all α > 0. Then {θ: ||θ - θ o | | < ε} cN°. (Thus, if for some ε > 0, c < °°, a bound of the form 7.1(1) is valid for all v with ||v|| = 1 and all α > 0, then Theorem 7.1 will verify such a bound for any ε 1 < ε.) Proof.

We leave the proof as an exercise.

||

When ε = inf {||θ - θ J | : θ t W} then 7.1(1) may or may not be valid for all α, v. The following example demonstrates this. 7.3

Example R e l a t i v e to Lebesgue measure, l e t

210

STATISTICAL EXPONENTIAL FAMILIES f_ k (y) = Γ(k)y k " 1 e- y / η /n k

(1)

y >0

0

y£ 0 .

This is the gamma density with scale parameter η and shape parameter k. x

l

=

^'

X

2

=

^

n

θ

y

l

=

η

"^ '

Θ

=

2

^

" ^'

anc

e

* ^ *

v

^

e

t Ίe

'

by the map y + x when y has Lebesgue measure on ( 0 5 °°).

m e a s u r e

Let

induced

One then has a

standard exponential family of order 2 with ψ(θ)

=

( θ 2 + 1) l n ί - θ j ) - In Γ(θ 2 + 1)

and (2)

W=

( - « , 0 ) x ( - 1 , oo), When k = 1

κ=

{ ( x

r

x2):

XjlO,

x ? > I n Xj}

( i . e . θ 2 = 0) the r e s u l t i n g one-parameter exponential

family is t h a t o f exponential d i s t r i b u t i o n s with i n t e n s i t y | θ - | .

For t h i s

family

P r θ

=-i ί χ i

> a)

=

e"α

for all

α > 0

so t h a t 7 . 1 holds with v = 1 and ε = 1 = i n f ί | | θ - Θ Q || : θ ? W .

On the

other hand, f o r θ 2 = 1 the r e s u l t i n g one-parameter gamma family has

ΘΓ-1

1

> α}

=

(α + l ) e ~ α

for a l l

α > 0.

Thus here 7.1(1) fails to hold when v = 1 and ε = 1 = inf ί||θ - Θ Q | | : Θ < W When W = R k Theorem 7.1 says only that P r Ω {H + (u, α ) } = 0 ( e " k α ) θ

o

for a l l k > 0.

However, much smaller bounds may be valid for these t a i l

probabilities.

Consider for example the following well known facts:

(3)

and

Γ e"

t2/2

dt < e "

α2/2

/α

for

α >0

TAIL PROBABILITIES 0.2/Q

oo

(4)

α

J e

2 /o

dt ~

e

/α

as

Thus, suppose X i s normal, mean 0 , variance 1. (5)

Pr{X > α}

211

< e~α2/2/α(2π)^

α •*• «>

Then, from (3) for

α >0

.

I t can be seen from (4) t h a t t h i s bound i s asymptotically accurate as α •> °° . Theorem 7.5 contains a bound which e a s i l y y i e l d s the statement (6)

Pr{X > α} £

for

this situation.

but

is s t i l l i n f e r i o r to ( 5 ) .

e"α2/2

This i s much b e t t e r than what i s a v a i l a b l e from 7.1(1)

Theorem 7.1 applies to p r o b a b i l i t i e s o f large deviations defined by h a l f spaces but can e a s i l y be converted t o a statement about any shape o f set,

as f o l l o w s .

7.4

Corollary Consider a standard exponential f a m i l y .

Let S be any s e t .

Fix ΘQ € W°. Let XQ € R .

Let p = i n f { | | x - X Q | | : x t S} , and define ε as i n

Theorem 7 . 1 . Then there i s a c < °° such t h a t (1)

PA ({(X - Xn)/α t S}) θ

Proof.

o

< c exp(-εpα)

ε'p

= εp.

n n {x: x i=l

α€R

.

I t suffices to prove the corollary for xQ = 0 and S the open

sphere of radius p about the o r i g i n . There exists p1 < p and ε 1 < inf{||θ 1

for a l l

ϋ

- ΘQ|| : θ f W}

such that

There exists a f i n i t e set of unit vectors {a..: i = l , . . . , n } such that a. αp 1 } '

n an < Σ c exp(-αp'ε') <_ c exp(-εpα) by Theorem 7.1 where c < ~ is Ί " 1 =1 appropriate constant. ||

212

STATISTICAL EXPONENTIAL FAMILIES

FIXED PARAMETER (Via Kullback-Leibler Information) I t is possible to use the Kuilback-Leibler information number ( i . e . entropy) to improve the preceding bound.

See the exercises for some

applications of this bound to asymptotic theory. 7.5

Theorem Let ΘQ € A/° and H+ = H + (v, α ) . PΘQ(H+)

(1) Proof.

1

Then

exp(-K(H + , ξ ( θ Q ) ) )

.

Suppose f i r s t that H+

(2) Let ξ = ξ R + (ΘQ) . θ = θ(ξ) € W°.

ΓΊ K°

Note that ξ € FT n

t

φ .

K° by Theorem 6.13.

Hence

(This is precisely the situation pictured in Figure 6.14(1).)

Now, (3)

k

=

K(H + , ξ ( θ Q ) )

i

(θ - ΘQ)

by definition and by 6.13(2).

Pϋfl (H )

=

=

(θ - θ Q )

x - ψ(θ) + φ(θ Q )

V

x € H+

This yields

/+ P PθAA (x)v(dx) (x)v(dx) ΰ + θoo

= J + exp((θ0 - θ)

ίi

£ - ψ(θ) + ψ(θ Q )

= =

P θ ( (X ) // — 2 — p~(x)v(dx) ΰ+ θ ΰ+ P~(x) H

x - ψ(θ0) + ψ(θ))p~(x)v(dx)

exp(-k)pg(x)v(dx)

θ

<. e"

k

which is the desired result. Now suppose H+ Π K ί φ but H+ Π K° = φ.

Then

TAIL PROBABILITIES

limK(H + (v, α - ε ) , ξ ( θ Q ) )

(4)

213

= K(H + (v,α), ξ ( θ Q ) )

< co

since K( , ξ ( θ Q ) ) is lower semi-continuous (by definition), satisfies lim

IKII-* 0 0

K(ξ, ξ ( θ Q ) )

=

»

(by 6 . 5 ( 5 ) ) , and s i n c e K ( H + ( v , α ) , ξ ( θ Q ) ) >. K ( H + ( v , α - ε ) , ξ ( θ Q ) ) f o r a l l ε > 0.

Hence

(5)

P f i ( H + ) = l i m P. ( H + ( v , α - ε ) ) < l i m e x p ( - K ( H + ( v , α - ε ) , ξ ( θ Ω ) ) ) θ υ o εΨO θ o εΨO =

exp(-K(fl + , ξ ( θ Q ) ) )

.

||

(We leave as an exercise to verify that K(H + , ζ ( θ Q ) )

(6)

= -

i f and only i f

Pθ (H + )

= 0

.)

Note that the Kullback-Leibler information enters into the above only as a convenient way of identifying the sup {(θ - ΘQ) x € H } .

x - ψ(θ) + ψ(θ Q ):

Various other interpretations of K, such as the probabilistic

Definition 6 . 1 , do not enter into the above argument. The connection between Theorem 7.5 and 7.1 is provided by the following lemma. 7.6

Lemma

Let ΘQ € N° and H + = H+(v, α). Suppose θ = Θ Q + εv e W°. Then (1) Proof.

+

K(H , ξ(θ Q )) >. ψ(θ Q ) - ψ(θ) + εα . Let ξ = ξ-+(θQ) as in Theorem 7.5. Then K(H + , ξ(θ Q )) = (θ - Θ Q ) 1

(θ - Θ Q )

ξ + Ψ(θ o ) - ψ(θ) ξ + Ψ(Θ Q ) - ψ(θ)

since θ = θ(ζ) = θ w (ξ) maximizes l(9 ζ). Hence

214

STATISTICAL EXPONENTIAL FAMILIES K(H+, θ Q ) 1 εv

ξ + ψ(θ Q ) - ψ(θ) = εα + ψ(θ Q ) - ψ(θ) .

||

Applying the bound (1) in the formula 7.5(1) yields the earlier formulae, 7.1(1) and (2), of Theorem 7.1. ~ 2 Note also that in the normal example of Example 7.3, K(ξ, 0) = ξ / 2 , and thus 7.5(1) yields 7.3(6).

FIXED REFERENCE SET The preceding results concern the nature of probabilities of large deviations when the parameter is fixed and the reference set for calculating the probability proceeds to i n f i n i t y .

There is another class of results.

These

concern the situation when the reference set is fixed and the parameter proceeds to i n f i n i t y in an appropriate direction.

These theorems were exploited

in a s t a t i s t i c a l setting by Birnbaum (1955) and then Stein (1956).

Giri (1977)

surveys several further applications of this theory.

7.7

Theorem Let v € R k ,

(1)

S2

(2)

Let K c N be compact.

Let S χ , S 2 c Rk

α € R.

v(Sχ

c

ίΓ(v, α)

n H+(v,

α))

with

, >

0

.

Then there exist constants c and ε > 0 such that θ#x v(dx) / e

(3)

—

< χ

/ S

c exp(-pε)

v(dx)

l

for a l l θ € W of the form θ = η + pv with η € K, p > 0.

TAIL PROBABILITIES Let S ^ ε ) = Sι f) H + (v, α + ε ) .

Proof.

v ( S Ί ( ε ) ) > ε > 0.

/ S

e

215

There is an ε > 0 such that

Then,

v(dx)

/

2

S

.

exp(p(v

x - α) + pα + η

x)v(dx)

2

ft Y

/ e S2

v(dx)

/ exp(ρ(v Sχ(ε)

/

x - α) + pα + η

x)v(dx)

e η " x v(dx)

<

< epε / eη S,(ε)

χ

c exp(-pε)

v(dx)

where (4)

c

= sup (/ e η ' x v(dx)/J e η # x v(dx)) ηCK S 2 S^ε)

Here is why c < °°: inf η€K

/ e η # x v(dx) > 0 . Sj(ε)

< ~

.

K is compact and v ( S , ( ε ) ) > 0 so that

Also, / e η # x v(dx) is upper semicontinuous on K S2

by Fatou's lemma, and is f i n i t e on K since K c N .

Thus sup J e η # \ > ( d x ) < <». η€K S 2

||

The preceding theorem r e a l l y concerns the relationship of probabil i t i e s for the sets S 2 and S.,(0) = S 1 n H ( v , α) contained in separate h a l f spaces. 7.7.

Note again the dual relationship, connecting θ U a n d H c K i n

Theorem

Because of this relationship i t is often revealing in such

contexts to superimpose both the sample space and parameter space on a single plot.

This is done in Example 7 . 1 2 ( 1 ) . Here are some corollaries to the Theorem, the second of which w i l l

be used in the example. compared to Theorem 7 . 1 .

The f i r s t of these corollaries may be i n s t r u c t i v e l y

216

STATISTICAL EXPONENTIAL FAMILIES

7.8

Corollary k Let v € R ,

KcWbe

compact, and S c H ( a , α ) . Suppose

v(H + (v, α ) )

(1)

> 0

.

Then there e x i s t constants c and ε > 0 such that Pr θ (S) for

<_ c exp(-pε)

a l l θ € W of the form θ = η + pv with η € K,

any sequence {θ.. € hi: θ.. = p..v + η ^ ,

p. -+«>,

lim Pr f i (S) θ i-*» i Proof.

In p a r t i c u l a r ,

for

. e K} one has

η>

= 0

.

Let S 2 = H (v, α ) . Then by Theorem 7.7 PrAS)

< c exp(-pε) J

s

=

7.9

p > 0.

θ

χ

e

c exp(-pε)Prθ(S2)

" ψ ( θ ) v(dx)

<_ c exp(-pε)

.

||

Corollary Again, l e t v € R ,

K c W ° be compact, and v(S) > 0 ; and l e t {θ..}

be any sequence of the form θ. = p.v + η. with p. -> » and η. € K.

(1)

lim Eft (v i^o θ i

X)

=

sup{α:

Then

v(H + (v, α ) ) > 0} < «, .

(Note that here we assume K c A/°; not merely K c hi.) Proof. it

Let α n denote the supremum on the right of ( 1 ) . Since Eft (v X) <_ α n

is only necessary to prove lim i n f EA (v X) 2 l α n ϋ u i-*χ> i

T° this end, l e t

α < α 1 < α Q and S 2 = H"(v, α 1 ) . Let ξ 2 ( θ ) = EΘ(X|X € S 2 ) . result is t r i v i a l .

Hence, suppose v(S 2 ) > 0.

continuous for a l l θ € N°.

Hence 3 = i n f ί v

I f v(S 2 ) = 0

the

Note that ξ 2 ( θ ) exists and is ζ 2 ( η ) : η € K} > -°°.

Note that

TAIL PROBABILITIES

217

3<α'. Apply Corollary 2.5 to the conditional exponential family given X € S9 (generated by v| c ) to find ά

|S2 E θ (v

X|X € S 2 )

for all θ = η + pv with p _> 0. E θ (v

X)

>_ E η (v

X|X € S 2 )

>.

3

Then for such θ,

= Pr θ (X € S 2 )

E θ (v

X|X € S 2 )

+ Pr θ (X ί S 2 )

E θ (v

X|X € S - S 2 )

>. ( c e " ε p / ( l + c e " ε p ) ) 3 + ( 1 / ( 1 + c e ' ε p ) ) α ' by Theorem 7.7.

p sufficiently large. arbitrary.

X) > α (since α < α 1 ) for θ as above for a l l

Hence E (v

This implies lim inf EA (v

X) > α n , since α < α n was

|| Note the placement of the hyperplane H in the statement of Theorem

7.7.

I f S 2 cz H" and v(S. n H+) > 0, but v(S, n H+) = 0, then only a much weaker

conclusion is valid. 7.10

This conclusion is contained in the following corollary.

Corollary Let v € R , α € R.

(1)

Suppose S2

c

H"(v, α)

and v(Sj n H + (v, α))

(2)

Let K c N be compact. = PΊ v + nΊ with n i e K,

(3)

> 0

.

Let { Θ . J c W b e a sequence of the form

p1 + ».

Then

lim—! θi

= 1

0

.

218 Proof.

STATISTICAL EXPONENTIAL FAMILIES Apply Theorem 7 . 7 t o f i n d P θ ( S 2 ΓΊ H " ( v , α - ε ) )

(4)

lim θ.

for a l l ε > 0.

1

Furthermore, i f p.. > 0

Pθ (S 2 n H + (v, α - ε ) ) (5)

P

(S 2 ίl H + (v, α - ε ) )

1

!

as ε -• 0

uniformly for η. € K

(The inequality in (5) follows after applying Corollary 2.23 to the functions

hc(x) = X s -cχ s . π H + ( v ,α-ε) W Ί t h C c h o s e n so t h a t E η . ( M X ^

=

°t0 find

that

E ϋ. (hC (X)) —> Eη.j(hC (X)) for all c and p. > 0.) Combining (4) and (5) yields Ω i the conclusion of the corollary.

||

7.11 Example 2 Consider the usual sufficient statistics X, S derived from a normal (μ, σ ) sample. As explained in Example 1.2 the statistics X, = X, 7 7 X2 = S + X are the canonical statistics for a two-parameter exponential 2 2 family with canonical parameters θ, = nμ/σ , θ 2 - -n/2σ . Note that 2 K = { ( x , , x 2 ) : x2 1 x p tor some c > 1 consider the conditioning set Q = { ( x r x2):

x2 1

cx

i>

=

t(x»

s 2

)

:

x 2 /s 2 1 V(c

"

ι

)}'

(™

s

i s

t h e

s e t

on

which the usual two-sided t - t e s t (based on t = /rvT x/s) with n - 1 degrees of freedom accepts at the appropriate level determined by c.) σ

2

+ 0.

Then ( θ ^ θ 2 ) = (π/σ ) ( μ Q 5 -h).

ray with slope - % i n as σ -* 0.

Fix μ = μ Q and l e t

Thus ( θ ^ θ 2 ) proceeds down the

Both X and Θ are displayed on the plot in

Figure 7.11(1), which shows also K, Q, and this l i n e . Corollary 7.9 applied to the conditional exponential family given X € Q (generated by the measure v restricted to Q) yields

TAIL PROBABILITIES

219

(1)

sup { μ ^ - X g / Z K x j , Note that E(μ Q X 1 - X £ /2 |X € Q) = ( μ Q , -h) S

X 2 ) | χ € Q) € Q.

sequence ί ( x l Ί »

X

Q

i f

(χϋs

a n d

the tangent to Q a t the point (\ιQ/c, ( n / σ 2 ) ( μ 0 , -H).)

(2)

μQ/2c2

=

μ

WQ/C) is perpendicular

X

2

2 ^

#

c

(Note that

to the ray

Thus

11m

E

(

μ

σ

2)<ίχi

X

2)IX _

€

Q

>

=

(μ0/c>

μ

0

/ c ) =

e

0

( s a y )

p

In terms o f the t r a d i t i o n a l v a r i a b l e s X, S , and t = /n-1 x/s t h i s y i e l d s 2 (3)

lim σ2->0

Ef ^ °'

2

2 )

((X, S )| l t l < τ ) '

Example 7 . 1 1 ( 1 ) :

.

X 2 )|X e Q) and t h a t

Furthermore since Q is s t r i c t l y convex 2 2 /2c μ 0 / 2 c = s u p ' 0

c

2 Ί )^

E((Xr

x 2 ) € Q}

= ( 9 ^τ + n - 1

f

(τ +

Picture for Example 7.12

9_ n - 1

220

STATISTICAL EXPONENTIAL FAMILIES

COMPLETE CLASS THEOREMS FOR TESTS (Separated Hypotheses) The preceding results can be used to prove admissibility of many conventional test procedures in univariate and multivariate analysis of variance and in many other testing situations involving exponential families. When combined with the continuity theory for Laplace transforms of Section 2.17 these results yield useful complete class characterizations for certain classes of problems.

In many of these cases the characterization precisely describes

the minimal complete class.

The general theory, as well as a very few specific

applications, is described in the remainder of this chapter. cations can be found in the cited references.

Many more appli-

The results to follow should be

compared to the results in the same s p i r i t for estimation which appear in Chapter 4. 7.12

Setting and Definitions Throughout the remainder of this chapter {p Q : θ€Θ}

is a standard

Ό

exponential f a m i l y .

The parameter space Θ is divided i n t o non-empty n u l l

and a l t e r n a t i v e spaces ΘQ, Θ-; so t h a t Θ = ΘQ U Θ..

In the customary fashion,

a t e s t of Θg versus Θ, is uniquely s p e c i f i e d by i t s c r i t i c a l f u n c t i o n , φ, where Φ(x)

= P ( t e s t r e j e c t s ΘQ|X = x ) .

Φ 1 i s as good as a t e s t Φ 2

V

(1)

π

The power of ψ i s π ( θ ) = E θ ( ψ ) .

A test

if

θ ) (θ)

e >

θ€θ

>. πφ (θ)

θ eΘ

*• V

I t is better i f there is s t r i c t inequality for some θ € Θ.

(Here, and in what

follows, we write, "a test φ" in place of the more precise but cumbersome phrase, "a test with c r i t i c a l function φ".) no better test.

A test is admissible i f there is

The decision-theoretic formulation with a two-point action

space A = {a^, a.} and a loss function of the form L(θ,

a.) = A(θ) > 0 J

if

θ d Θ., j

=0

if

θ € Θ. , J

yields the same ordering among tests, and hence the same collection of

TAIL PROBABILITIES

221

admissible tests. Let (2)

Ur

= ϋ Γ (Θ, θ 0 )

=

(u: I lul I = 1, 3 θ € 0 3 I Iθl I > r, I

and

u = j ^ J^ \ , llθ - θ o ll J

r £θ and let (3)

U(0, θ 0 ) u

=

n U ( θ , θQ) u r >0 r

and

U*(0, θ n ) υ

=

Π 0 ( 0 , θn) ϋ r^O r

Note that i f 0 is a closed cone then U = U*; more generally U c U*.

.

I t is

possible that U = φ but U* f ψ. If S cR (4)

is a convex set l e t α(u)

= ou(u)

= sup {x u:

x € S}

.

This function is defined for u € R , , although we will mainly be interested its values for | | u | | = 1. (5)

As is well known,

S =

n

FΓ(u, α s ( u ) )

.

I t is clear from the definition (4) that α( ) is lower semi continuous. The following lemma is a key result which leads directly to the f i r s t main theorem.

A result of this type was f i r s t proved and used by

Birnbaum (1955) in the case of testing for a normal mean.

A general result

similar to the following lemma was then proved and applied in Stein (1956b). 7.13

Lemma Fix θ 2 € Rk.

(1)

where U* = U*(0 1> θ 2 ) .

Let S =

n ίΓ(u, α ς ( u ) ) b u€U*

Assume further either that

in

222

STATISTICAL EXPONENTIAL FAMILIES

(2)

S =

n FΓ(u, α ς (u)) , S U€U

(U = U(Θ, θ J ) , ά

or ou(u) i s continuous a t u f o r a l l u e U* - U.

Let φ ^ x ) = 1 f o r a l l x ί S.

Suppose Φ2 is as good as φ j . Then Φ 2 (x) = 1 f o r x i S, a . e . ( v ) . (Note: v{x:

x ί S ,

Proof.

A more formal way to s t a t e the conclusion of the lemma is

φ 2 ( x ) < 1} = 0 . ) Assume f o r convenience θ 2 = 0.

is f a l s e .

Suppose the conclusion of the lemma

Then there is an ε Q > 0 , u Q e U* such that

(3)

CQ

=

{x:

Φ 2 (x) < l - e

l

o

satisfies v(C Q n H + ( u Q , α ( u Q ) ) )

(4) Assume u Q € U. {p.uQ:

i=l,...}

cz 0 .

Theorem 7.7 y i e l d s

\~ f 2

.

Then there is a sequence { p . } with p. -> °° such that

>-yCΊV 1 - π

> 0

(ρ u 0 ) 1

V ( ε

)

) e9'X π

n

v(dx)

X

/

\

e J υ(dx) + C o nH (u o ,α(u o ))

<_ CQ exp (-p.ε ) + 0

as

i+«

Hence π. (p.un) > π. (pn u n ) for i sufficiently large, which shows that φ 9 is φ

1 U

φ

c.

1 U

not b e t t e r than φ-. Now assume u Q ί

U but ou(u) is continuous a t u Q e U* - U.

Then

ε Q > 0 i n ( 3 ) can be chosen small enough so t h a t (6)

v(Cn Π H+(u, α(u))) u

f o r a l l ||u||=l with | | u - u o | | < ε Q .

>

εn u

Theorem 7 . 7 , including formula 7 . 7 ( 4 )

the constant c appearing i n 7 . 7 ( 3 ) , now y i e l d s , f o r θ = pu € M,

for

TAIL PROBABILITIES

1 - 7τ Φ

(7)

(pu) *

<

1 - πA (pu) n

1 for

||u|| = 1 with ||U-UQ|| < ε Q .

epu'x

v(dx)

/ epu*x C o nH + (u,α(u))

v(dx)

/ fl~(u,ct(u))

~

223

£

(l/εo)e-P o

uQ € U * ( Θ 1 )

implies there e x i s t s a sequence

θ.

e Θ χ with

| | θ . | | ->oo such t h a t θ . / d l θ . H ) -• u Q .

π

(θj) > π

( θ Ί) for i sufficiently large.

than φ-.

It

I t follows from ( 7 ) t h a t

Consequently φ« i s not b e t t e r

follows from the two cases t r e a t e d above t h a t φ ? b e t t e r

than φ, implies Φ 2 (x) = 1 for (a.e.) x ί S.

Lemma 7.13 leads directly to a criterion which can often be used to prove admissibility of conventional tests for appropriate testing problems. 7.14 Corollary Let {p : θ e 0 } , θ = 0Q U θ j be a standard exponential family, as in 7.12.

Let θ 2 € Rk and

(1)

S

=

n H"(u, α ς ( u ) ) b u€U*

where U* = U * ^ , θ 2 ) , as in 7.13(1). Assume (also as in 7.13) that 7.13(2) is satisfied or that ou(u) is continuous at u for all u € U* - U. Let φ(x) = 1 - χ s (x) Proof.

(= 0 if x € S, =1 if x £ S ) . Then φ is an admissible test.

Suppose φ 1 is any test as good as φ. Then, φ'(x) = φ(x) = 1 for

a.e.(v) x € S by Lemma 7.13. But then, π.,(θ 0 ) i ^ ( Θ Q ) implies φ'(x) = φ(x) = 0 for a.e.(v) x 6 S. Thus, φ 1 = φ a.e.(v). admissible. Remark.

It follows that φ is

|| It follows from Corollary 7.14 that if θ^ is a bounded null hypothe-

sis and Θ = R k then any nonraήdomized test with convex acceptance region is

224

STATISTICAL EXPONENTIAL

admissible.

FAMILIES

When ΘQ = { θ Q } is simple and v is dominated by Lebesgue measure

such t e s t s i n f a c t form a minimal

complete class — i . e . a t e s t is

admissible

i f and only i f i t is nonrandomized and has convex acceptance region

(a.e.(v)).

This i s the fundamental r e s u l t which was proved by Birnbaum ( 1 9 5 5 ) .

See

Exercise

7.15

7.14.3.

Application

( U n i v a r i a t e general l i n e a r model)

Here is a customary canonical form f o r the normal theory general Y € Rp has the normal N(μ, σ I ) d i s t r i b u t i o n , μ s + 1 = . . . = μ

l i n e a r model: σ

2

1 £

> 0 , and the null Γ

£

s

£ P

hypothesis to be t e s t e d is t h a t μ, = . . . = μ

(See, e . g . Lehmann (1959, Chapter 7 ) . )

= 0,

= 0,

This can be reduced

via s u f f i c i e n c y and change o f v a r i a b l e s to a t e s t i n g question o f the form P 2 considered above.

Let X. = Y.,

butions of X = ( X 1 » . . . , X S + J

i=l,...,s,

X$+1 =

i =l , . . . , s ,

hypothesis i s , t h e r e f o r e , Θ Q = {θ € N:θ. = 0 , I

θ $ + 1 = -l/2σ

Then the d i s t r i -

The F-test when

.

The null

i = l , . . . , r } , so t h a t

Qd. > 0 } , where of course W = {θ € R s + I :

Figure 7 . 1 5 ( 1 ) :

.

form a minimal standard exponential f a m i l y with 2 2

canonical parameters θ . = μ Ί /σ ,

Θ χ = {θ € N:

Σ Y

r = l

θ g + 1 < 0}

= s,

p = 2

.

TAIL PROBABILITIES

225

The usual likelihood ratio F-test accepts if (and only if) r 2 Σ Yί/r (1)

F ^ < c Σ Yί/(p - s) S+l J

as determined from tables of the F-distribution.

'

In terms of the canonical

variables this region is

(2)

or Γ

(3)

9

S

9

K Σ XJ + Σ XJ j=l r+1

< X

, where

K = 1 + (p - s)/rF α

> 1.

(The simple situation for r = 1 = s, p = 2 is illustrated in Figure 7.15(1), above, which shows K in the upper half-space and N in the lower half.

Compare

Figures 7.11(1) and Figure 7.12.3.) Consider a point z in the boundary of the acceptance region (3). s r 2 ? Thus, K Σ z. + Σ z = z,.,-,. The outward normal at z is v = (2Kz Ί ,... ,2Kz , J J s+l I r Ί r+1 Γ 2 Z Γ + Ί , . . . , 2 z $ , - 1 ) . Except for the (s + 1 - r) dimensional set having Σ Z . = 0 all positive multiples of this vector lie in Θ J. It follows that 7.13(1) and 7.13(2) are satisfied (for any choice of θ Q € Θ Q ) . Thus the F-test (1) (or (2)) is admissible. Note that the test remains admissible by the same r 2 2 reasoning if e , is restricted by Σ y. > aσ since then r

Θ,I = {θ € W:

2 Σ θI > -2 a θs+i c+1} i=1

The same style of reasoning can be used to prove admissibility of a wide variety of tests involving the univariate and multivariate general linear model.

It was used in Stein (1956b) to prove admissibility of

226

STATISTICAL EXPONENTIAL FAMILIES

Hotelling's T

test; Giri (1977) contains a compilation of other results

provable by this method, and further references. 7.16

Discussion I f a test is shown to be admissible by virtue of Theorem 7.14 this

does not, in i t s e l f , constitute a strong recommendation in favor of the test. In principle the following situation may exist:

there may be another test φ1

with π , (θ) <_ τr.(θ) for a l l θ € ΘQ and with π.,(θ) >. π (θ) for "most" θ € Qy I t might occur that π , (θ^) > π ( θ . ) for θ € Θ-except when both π , and π are \/ery nearly one.

In such a case φ' would dominate φ for a l l practical purposes.

Of course, a procedure whose admissibility can be proved by Theorem 7.14 may also be a desirable one. this.

The F-test of 7.15 is a good example of

I t is admissible from several perspectives in addition to that of

Theorem 7.14. The most surprising of these properties is undoubtedly the fact that i t is a Bayes test.

See Kiefer and Schwartz (1965) and Exercise

7.16.2. The F-test is also locally optimal (D-optimality) in the sense that i t maximizes (among level-α tests) ?

(1)

min σ y(EΘ0

Γ a2 ? Σ - \ π. (μ, σ )

i = l d/.

.

φ

See Giri and Kiefer (1964) or Giri (1977) and Exercise 7.16.3. When r = s the F-test, φr> is also optimal in the sense that for any constant c > 0 and any level-α test Φ (2)

r Σ μ 2 /σ 2 = c 2 } i=i i 2 ^ 2 2 2 > min {π.(μ, σ ): Σ μ./σ = c} Ί Φ i =i

min {π. (μ, σ 2 ) : ΦF

with equality only if φ = φ_. Note that the left side of (2) is a constant. See Brown and Fox (1974b). Brown and Fox (1974a) yields the same result for s + 1 = r. For r £ s + 2 it is only known that the (minimax) inequality (2) is valid without the (admissiblity) assertion of equality only if φ = φp. This

TAIL PROBABILITIES

227

(minimax) assertion follows from the Hunt-Stein theorem as stated in Lehmann (1959). The next lemma is needed for the complete class theorems which follow i t . 7.17

The lemma can be viewed as an elaboration of Theorem 2.17.

Lemma Let ω be a sequence of ( l o c a l l y f i n i t e ) measures concentrated on

QcR

,

Then there exists a subsequence ω ,, a closed convex set S, and a

( l o c a l l y f i n i t e ) measure

ω concentrated on Θ such that

λ ω ( (b) * »

,

b (£ S .

If ω i, ω, and S are as in (1) and θ 2 £ R then (2)

S =

where U* = U*(Θ, θ 2 ) .

Proof.

n R"(u, α ς (u)) b u€U*

,

(This is s i m i l a r to 7 . 1 3 ( 1 ) . )

The f i r s t part of the lemma is a direct consequence of Theorem 2.17.

To prove (2) l e t T = n H"(u, α Q ( u ) ) and suppose y € T°. b U€U*

there is an x(u) € S such that u

Then for eyery u € U*

x(u) > u y.

Define N(u) by (3)

N(u) = {v: I|v|l = 1, v

x(u) > v

y}

.

N(u) is a r e l a t i v e l y open subset of the unit sphere and u € N(u).

Hence

U N(u) 3 U*, and there is a f i n i t e subset u Γ . . . , u c U* such that u€U* r (4) N = U N(u.) => U* . Ί i =l For convenience l e t x i = x ί u ^ . (5)

sup { | | θ | | :

Now,

θ €0

,

228

STATISTICAL EXPONENTIAL FAMILIES

otherwise there would be a sequence v. ί S with v^ -> v (v £ tj since M is open) and a sequence p. •* °° such that p.v. € Θ, contradi cti on.

(6)

i=l,...

but then v 6 U* c |, a

Then

/eθ-yω.(dθ)<

eBHy|l

||Θ||
{{θ:

£ /e0#xi

ω.(dθ)

zxω (x.) by (3), (4), (5) and the simple fact that ω n ι(ίθ:

B11x Ί 11 θ x Ί

I |θ| I <_ B>) <_ e

• fe

I t follows from (6) and (1) that y € S.

Hence T° c s.

closed and convex this implies T = S.

||

ω n '(dθ) .

Since T and S are

Here is the complete class theorem from Farrell (1968). to situations where ΘQ is compact and ΘQ and Θ1 are separated sets. Theorem 7.19 for a partial converse.

I t applies See

Results like Theorem 7.18 and 7.19

have been proved in contexts somewhat more general than ordinary exponential families.

See Schwartz (1967), Oosterhoff (1969), Ghia (1976), Perlman (1980),

and Marden (1982a, 1982b), for such extensions and various applications.

In

k the following statement Θj denotes the closure in R , not merely the closure relative to W. 7.18

Theorem Let ΘQ c M be compact and assume θ Q n 0

admissible test.

= φ.

Let φ1 be an

Then there exists an equivalent test φ ( i . e . π.,(θ) = π.(θ)»

θ e ΘQ U Θ j ) , a convex set S satisfying 7.17(2), and a (locally f i n i t e ) measure H. on Θ•,

i = 0 , l , such that λn (x) < °° for x e S° and

TAIL PROBABILITIES

1

if

229

x ί S λ H (x)

(1)

Φ(x)

=

1

if

0

a.e.(v).

if

x € S° ,

—! λ H (x) λH (x) —1 λμH ( x ) 0

x € S° ,

>

1

<

1

,

I f (Θ Q u Θ j ) 0 f φ then Φ = Φ 1 ;

( λ H ( x ) is f i n i t e since H Q (Θ 0 ) < °°.)

and hence a l l admissible tests are of the form φ i n ( 1 ) . I f φ1 is admissible then according to Theorem 4A.10 there exists an

Proof.

equivalent test φ and a sequence of a priori

distributions G (concentrated

on f i n i t e subsets of Θ) whose Bayes procedures Φn (say), converge to φ in the topology of 4A.2. weak*

By Exercise 4A.2.1 this convergence means that φ + φ

— i .e.

(2)

/ (Φ n (x) - φ(x)) g(x)v(dx)

for every v integrable function g.

-

0

A consequence of (2) is that i f a subse-

quence of Φ n (x) converges pointwise on some (measurable) subset T c K (say Φ n .(x) •* λ ( x ) ,

x € T) then the l i m i t must be φ ( i . e . , φ(x) = λ ( x ) ,

x € T,

a.e. (v)) Let (3)

Tn(dθ)

H

Note that

=

e

"orW

-ψ(θ) G_(dθ)// e -Ψ(θ)

= 1.

Let ω = HQ

Φn

i = 0, 1

Then

1 (4)

θ € Θ ,

n (dθ)

G

λ

H

(x)

>

1

<

1

(x)

+ Hχ .

0

Let ω ,, ω, S be as in Lemma 7.17. Let H.. = u).Q

i=0, 1, so that H i n , -> HΊ ,

i=0, 1, as n1 ->«.

,

Then H Q ( Θ ) = H Q (Θ 0 ) = 1 since

230

STATISTICAL EXPONENTIAL FAMILIES

H Q n (Θ 0 ) = 1 and ΘQ is compact.

The assertions in (1) follow from this along

with Lemma 7.17, ( 4 ) , and the decision theoretic facts in the f i r s t paragraph of the proof. I f φ1 and φ are equivalent and (ΘQ U Θ j ° by completeness (Theorem 2.12).

t φ then φ1 = φ ( a . e . ( v ) )

||

Many of the tests produced by the recipe 7.18(1) are admissible. In certain s t a t i s t i c a l situations, i t can even be concluded that a l l of them are admissible.

Then Theorem 7.18 describes the minimal complete class.

following converse to Theorem 7.18 contains statements of these facts.

The It

is not entirely satisfactory but i t is the best general result we have been able to devise. (1)

Θ*

For the purpose of this theorem define =

{θj € Qy

θ1 € W

+ p,θ1 € Qι

θ 2 € β1 3 (1 - ρ)θ 2

or there is a

for

0 <_ p < 1}

(See Exercise 7.19.3 for an extension of ( 1 ) . ) 7.19

Theorem

Consider the testing problem described in Theorem 7.18. Suppose φ satisfies 7.18(1) where H, is concentrated on Θ* and S satisfies all the assumptions of Lemma 7.13 relative to some θ 2 € R . Suppose also that

(2)

φ(x)

1 )

if

. x€ S

\ and

( X )

>

1

λHμ (x) < 1 , 0

(This is a mild extension of the l a t t e r part of 7.18(1).)

a.e.(v)

Then any c r i t i c a l

function as good as φ must also satisfy (2) and 7.18(1) with the same values of S, H o , H r

I f also either

(31)

v({x:

λ H (x) —i

=

1

and

φ(x) < 1})

= 0

,

TAIL PROBABILITIES

231

or λ H (x) (3")

v({x:

—ΐ λ H (x)

=

1

and

φ(x) > 0})

= 0

,

or (Supp (H o + H^)0

(4)

f

φ

then φ is admissible; and i f η is as good as φ then η = φ a . e . ( v ) . I f v is dominated by Lebesgue measure, U(Θ, θ 2 ) = U*(Θ, θ 2 ) for some θ 2 € Rk, and Θj c Θ* then the collection of tests of the form 7.18(1) is a minimal complete class. Proof.

Suppose φ is defined by (2) and 7.18(1) where S s a t i s f i e s the

assumptions of Lemma 7.13. Suppose ηis another c r i t i c a l function as good as φ.

Then η(x) = 1 i f x f. S by Lemma 7.13. I f θ € Θ- then

0 £ /(η(x) - φ(x))e θ # x v(dx)

(5)

since 0 £ π (θ) - π (θ). By continuity (5) also holds if θ € § 1 n M. Now, suppose ζ = (1 - p)θ«+ pθ, € Θ, for 0 <_ p < 1. Then (5) holds at θ = ζ and ζ x /(η(x) - φ(x))e p v(dx) is continuous in p as p t 1 by Exercise 1.13.l(ii)It follows that (5) holds whenever θ € 0, . The opposite inequality to (5) holds when θ € Θ Q , and H Q is finite since Θ Q <= W is compact. Hence (6)

0 < / (/(η(x) - Φ(x))e θ # x vίdxJJίHjtdθ) - H Q (θ)) .

Notice that η(x) - φ(x) < 0 whenever λ u (x) > λn (x), so that —

Πj

n0

+

/ (η(x) - Φ(x)) λ μ (x) v(dx) £ /(η(x) - φ(x)) + λ H (x) v(dx) o

/ / e θ ' x v(dx) HQ(dθ)

232

STATISTICAL EXPONENTIAL FAMILIES

Furthermore, as already noted, H Q is a f i n i t e measure. i n t e g r a t i o n i n ( 6 ) can be reversed, y i e l d i n g

(7)

0

£

Hence the order of

that

/ (η(x) - Φ ( x ) ) ( λ H (x) - λH (x))v(dx)

<

-

with the i n t e g r a l extending only over the region x € S since η ( x ) = φ(x) f o r x ί S.

Because φ s a t i s f i e s ( 2 ) , the integrand i n ( 7 ) is non-positive; hence

η(x) also s a t i s f i e s If

( 2 ) , f o r otherwise the i n t e g r a l would be negative.

in addition φ s a t i s f i e s

( 3 1 ) then ^ . ( θ ^ > TΓ ( θ ^ ,

(a c o n t r a d i c t i o n ) unless η ( x ) = Φ(x) a . e . ( v ) .

Similarly i f

( 3 " ) is

s a t i s f i e d n ( x ) = Φ(x) a . e . ( v ) ; f o r otherwise π φ ( θ Q ) < ^ ( Θ Q ) , F i n a l l y , suppose ( 4 ) is s a t i s f i e d i n place o f ( 3 1 ) or ( 3 " ) . reasoning f o l l o w i n g

Θ Q e ΘQ. Note t h a t the

( 7 ) shows t h a t e q u a l i t y holds i n ( 7 ) and hence i n ( 6 ) .

From t h i s i t follows t h a t / ( n ( x ) - Φ ( x ) ) e θ " x v ( d x ) = 0

a . e . HQ + H χ

t h i s i n t e g r a l is non-negative on θ * and non-positive on Θ Q .

since

( 4 ) then implies

η ( x ) = Φ(x) a . e . (v) by completeness and hence φ is admissible. the

ΘJ e Θy

This completes

proof of a l l assertions i n the middle paragraph of the theorem. If

v is dominated by Lebesgue measure and also s a t i s f i e s the

remaining assumptions of the l a s t paragraph of the theorem then λH (x) v{x:

xήxΓ H

= 1}

=

°

o 1

so that any test, φ, of the form 7.18(1) is also of the form (2), and (3 ) (and (3")) is satisfied, and H- is concentrated on Θ- <= 0* and S satisfies assumption 7.13(2) of Lemma 7.13. It follows that φ is admissible.

||

COMPLETE CLASS THEOREMS FOR TESTS (Contiguous Hypotheses) 7.20 Definitions: It is necessary to characterize the local structure of Θ, near ΘQ. Let Θ Q = {θ Q } and Θj be given and let

TAIL PROBABILITIES (1)

233

J(ε) = { J : J is a f i n i t e non-negative measure on {θ: θ ε Θ r ||θ-Θ o || < ε } , J J(dθ) < 1 , f

/

< »,

^ τ J ( d θ ) | | < 1}

I|Θ-Θ O || 2

Then let (2)

Δ(ε) = {(v,M): V = /

^J(dθ),

IIΘ-Θ 0 II (Θ-Θ M

=j

2

)(Θ-Θ )*

Q

o

ιiβ-θoιr

j

J(dθ)>

ε

J ( ε ) κ

Also, let Δ = Γ\ Δ(ε). Note that v ε R and M is a positive semidefinite ε>0 k x k matrix, and Δ and Δ(ε) are compact, convex sets. In various typical statistical problems it is not hard to explicitly describe Δ. For example, if

Θ Q = 0 and Θ = θ Q U "Θ, is a closed conical

set then Δ is the convex hull of points of the form (v,0): v ε Θ, ||v|| <_ 1 , and (0,M): M = vv 1 3 v ε Θ, -v ε Θ, ||v|| < 1 . (See Exercise 7.20.1.) As another example, suppose

Θ is a twice different-

iable curved exponential family at Θ Q . This means that there are two orthogonal vectors u ., ιu ε R , with ||ui|| = 1 such that for θ ε Θ (4)

θ-θ Q = ((θ-θ o )-u 1 )u 1 + ||θ-θo||2u2 + o(||θ-θ 0 || 2 ).

(Note in (4) that

KΘ-ΘQJ U ^

2

= ||θ-θo|| + o( ||θ-θ o || ), and also that u 2 = 0

is a possible value of u 2 ) Then Δ is the convex hull of (u-j,O), (-u-j,O) and

(u^sUnU-j). (See Exercise 7.20.2.) As with earlier results the full complete class characterization is not

directly as a generalized Bayes test but involves an extension of this notion. As part of this extension the kernel

θ

x

e ' is replaced by

Z34

STATISTICAL EXPONENTIAL FAMILIES

(5)

e θ " x -l-θ.χ

ω(θ,x) =

Iθll2

A converse result which sometimes yields a characterization of the minimal complete class is given in Theorem 7.22. As with earlier results both of the following theorems can be profitably extended beyond the exponential family context in which they are proved below. See Marden and Perlman (1980), Marden (1981, 1982b), Cohen and Marden (1985), Brown and Sackrowitz (1984, Theorem 6.1), and Brown and Marden (in preparation). 7.21

Theorem Let Θ Q = {θ Q } be a simple null hypothesis. Let φ 1 be an admissible

test of Θ Q versus 0^. Then there exists an equivalent test φ and a closed convex set S satisfying 7.17 (2) such that (1)

Φ(x) = 1

x Φ S.

Further, for every x Q € S° there is a finite non-negative measure H on ^ - {θ Q } θ (x-x n ) with S° c {x: e H(dθ) < «>}, a constant C ε R, an M ε Δ 2 , and a v ε R k satisfying (3), below, with at least one of C, H, v, M being non-zero, such that for all x ε S° 1

< if C

0

/ ω(θ,x-x Q )e

θ xn

°H(dθ)+v (x-x 0 )

+ (x-xo)'M(x-xo)/2

>

If θn ^ ψ then Φ = Φ 1 a.e. (v). Define θ-θ~ ||θ-θo||>ε Then there is a sequence

(3)

ε i -> 0

such that

»" ~ 0 " 1imvε

= vQ

(say) exists, and

(v Q ,M) ε Δ. (Note that if

written as

J||θ||

H(dθ) < «

the extreme right side of (2) can be re-

TAIL PROBABILITIES 0")

/5

235

θ (x-x 0 ) 'Lsl H(dθ) + v o (x-xo)'M(x-xo)/Z .

In particular, lim vε = vυ n ε->0

exists.)

Proof. The assertion just after (2) follows from completeness, as in Theorem 7.17. Now, suppose Φ' is admissible. Then by Theorem 4A.10 there is an equivalent <J> and a sequence of prior distributions G. concentrated on finite subsets of Θ such that the Bayes procedures, ψ. = Φ G , converge to Φ in the topology of 4A.2. (See the proof of Theorem 7.18 for further remarks.) Without loss of generality let θ Q = 0 and

Thus Φ J ( X ) = ί 0 } according to whether (4)

e θ ' x G!(dθ)

/ Θ

>

1 .

l

As in 7.17, it is possible to reduce {G!} to a subsequence (if necessary) such that now for some closed S satisfying,7.17 (2), lim Je θ ' x G!(dθ) = « 0 <_ lim Je θ ' x G!(dθ) = q(x) < » where G! -> G1

x i S x e S° ,

and q(χ) = Je θ ' x G' (dθ). Clearly, (1) is satisfied.

Assume without loss of generality that x Q = 0 € S°. Rewrite (4) as (5)

/ ω(θ,x)||θ||2G!(dθ) + J (θ χ)G!(dθ) > cϊ . Θ Θ l l

Let d. = J||θ||2G!(dθ)+||/θG!(dθ)|| + |c!| and H.(dθ) = dT11| θ||2G!(dθ). Substituting in (5) and multiplying through by dT

yields

236 (6)

STATISTICAL EXPONENTIAL FAMILIES /

ω(θ,x)H 1 (dθ) + / θ

e

- ^ 2 ||θ||

H.(dθ) >

C 1

where jH^dθ) + || /(θ/||θ||2)Hi(dθ)|| + Ic^ = 1. Reduce { H ^ to a subsequence (if necessary) so that /(θ/1|θ||2)Hi(dθ) •*• v, c i + C.

(7)

H i -• H 1 since G\ ->• G 1 . Furthermore /H'(dθ) + | |v| | + C = 1 since x Q = 0 € S° Let H = H',- o } . Let ε > 0 such that H({θ:||θ|| =e}) = 0. (AΠ but a countable set of e's satisfy this.) For each x ε S (8)

/ ||θ||>ε

ω(θ,x)H.(dθ) + / ω(θ,x)H(dθ), 1 ||θ||>ε

1/

(ω(θ,x) . 2 L L § Θ ^ ) H (dθ)| = 0(ε)

and (9)

iiθiiiε

2iiβir

Ί

since (e* - 1 - t - t 2 /2)/t 2 = 0(t) and jH^dθ) <. 1.

Another subsequence may

now be taken, if necessary, so that the following limits exist: (10)

v = 11m f ε

(11)

— 5 - y H.(dθ) = v - /

1— ιiθiι<e Heir

1

- ^ - T H(dθ)

ιiβiι>ε

iieir

M = lim / - ^ H.(dθ). ε

i - llθll 2

Ί

By definition, (v ,M ) ε Δ(ε). Δ(e) is compact in the obvious topology. Hence there is a subsequence ε. + 0 so that (v ,M£ ) -> (v Q ,M) ε Δ. If j

J

necessary another subsequence of {H.} may be extracted using a diagonalization argument so that (10) and (11) hold for each ε . It follows from (5), J (7), (8), (9), and (11) that for x ε S° 1 < ΦΊ (x) + if C Jω(θ,x)H(dθ) + v x + x'Mx/2 .

TAIL PROBABILITIES

237

Note that Tr M = H'({0}). Hence Tr M + H(Φ" J) + | |v| | + C = 1 so that at least one of M, H, ||v||, C are non-zero. It follows from (10) that (3) is satisfied. Since Φ Ί -* ψ in the topology of 4A.2 this yields (2).

||

7.22 Theorem Consider the testing problem described in Theorem 7.21. Suppose θ Q ε W° and φ satisfies 7.21(1), (2), and (3) where "S satisfies all the assumptions of Lemma 7.13 and H is concentrated on Θ*, as defined in 7.19(1). Suppose ψ(x) is also given by 7.21(2) for x e S - S°. Then any critical function as good as ψ must also satisfy 7.21(1) for x i S" and 7.21(2) for x ε S (a.e. (v)) with the same values of S, H, v, M, C, x Q . If also either (1)

v{x: ω(θ,x-xQ)H(dθ) + v'-(x-x Q ) + (x-xQ)'M(x-xo)/2 = C, φ(x) < 1} = 0

or (I 1 ) v{x: ω(θ,x-xo)H(dθ) + v .(x-xQ) + (x-xQ)'M(x-xQ)/2 = C, φ(x) > 0} = 0 or (2)

(Supp H)° f φ

then φ is admissible; and if η is as good as φ then η = Φ a.e. (v). If v is dominated by Lebesgue measure, U ( Θ , Θ 2 ) = U * ( Θ , Θ 2 ) for some θ 2 ε R k , and e^ = Θ* then the collection of tests of the form 7.21(1), (2) is a minimal complete class. Proof. Much of the proof resembles that of Theorem 7.19 (as does much of the statement of the theorem). Assume with no loss of generality that θ Q = 0 and x n = 0. Let ε. + 0 and v U

j

be as in 7.21(3) and let J. ε J(ε.)» be measures ε

J

supported on finite subsets such that vε (3)

j /

- /

eθ

'

θ 2

11 Θ 11 ?

c

JΛdθ) - 0 J

J.(dθ) -> M.

J

238

L β t

STATISTICAL EXPONENTIAL FAMILIES

H

li=H|{θ:||θ||>εi>+

is b e t t e r than

(4)

For each

(5)

O

ψ

4

J

then

η

J

i'HOi({O}) satisfies

( n ( x ) - φ ( x ) ) Je θ

χ

=

C

+

/llθll'2ji

7.21 ( 1 ) and

(Hli(dθ)-Hoi(dθ))v(dx),

x ε S Jeθ'x(H,.(dθ) - HQ.(dθ)) = /

ω(θ.χ)H(dθ) + v x + x ' M x / 2 - C

v

J

^

)x

-x'Mx/2).

Lemma 2.1 implies that the dominated convergence theorem can be invoked in (4), (5) as i -> 00 (6)

since 0 ε hi6 and ω(θ x) = 0 ( e θ ' x + l ) .

Hence

0 4 J (η(x)-φ(x))(/ω(θ x)H(dθ)+v x+x'Mx/2-C)v(dx).

It follows that η satisfies 7.21 (2). The remaining assertions of the theorem are proved just as the analogous assertions in Theorem 7.19.

||

TAIL PROBABILITIES

239

EXERCISES 7.2.1 Prove proposition v({||x|| > α}) = 0 ( e " ε α )

7.2.

[ I f v is a f i n i t e measure and

then E v ( e

ε Ί | x | !

)

< » for a l l 0 < ε1 < ε . ]

7.4.1 (i)

L e t S be a convex s e t w i t h p = i n f { | | x | | : x (. S } .

some ε > 0 , c < °°

Suppose f o r

7 . 4 ( 1 ) holds i . e .

(1)

P ft ( { X / α (. S } ) < θ

c exp(-εpα)

V

α € R .

o

Show that {θ: | |θ - Θ Q || < ε 1 }

N° for all ε 1 < ε.

(ii) Give an example of

a nonconvex set with v{x: ||x|| < p, x (. S} > 0 and in which (1) holds but {θ: I|θ - ΘQ|I < ε 1 } φ N° for any ε 1 < ε. 7.5.1 Let Θ Q e W° and H + = H + (v, α ) . Show lim (n" 1 log P θ (X n € H + )) <_ -K(H + , ξ(θ Q )) .

(1)

[Use Theorem 7.5 and Proposition 5.15.] 7.5.2 In 7.5.1 suppose H + n K° ϊ φ. Show (1)

lim (n" 1 log Pft (X n € H + )) = -K(H + , ζ(θ n )) .

[For one d i r e c t i o n use 7 . 5 . 1 ( 1 ) .

For the other l e t P l n ^ denote the d i s t r i b u -

t i o n of S n under θ = θ ( ξ H + ( θ ) ) . ] (2)

P Q (X € H o )

>

e x p [ - n ( K ( θ , ΘQ) + ε ) ] P ^ n ) ( { S :

•> e x p [ - n ( K ( θ , θ 0 ) + ε ) ] by the Central L i m i t Theorem (Exercise

5.15.1).]

| ( θ Q - θ)

\\ < e } )

240

STATISTICAL EXPONENTIAL FAMILIES

7.5.3 Let Θ Q = θ(ξ Q ) € W°. Let Q be a closed subset of R k . Show lim (n" 1 log P 0 (*n € Q)) < -K(Q, ξ Q ) . [Let ε > 0.

Show Q c

Σ H + ( v . , a . ) where K ( H + ( v , α . ) ) > K ( Q , ξ n ) - ε .

.=1

i

i

When k > 2 this requires some care.)

i

Apply

i

-

0

7.5.2.]

7.5.4 Let ΘQ = θ ( ξ Q ) .

Let Q c Rk be a set such that

K(Q°, ξ 0 ) = K(Q, ξ Q ) = k (say). (1)

Then

lim n" 1 log P fl (X n € Q)

=

-k

.

o

[Reason as in 7 . 5 . 2 and use 7 . 5 . 3 . ] 7.5.5 Let X , , . . . be i . i . d . random variables on R with d i s t r i b u t i o n F. Let h: X -+ Rk be measurable and Q c R k .

Let ζ(Q) = i n f { ζ p ( x ) : x € Q}

where ξp(x) denotes the entropy as defined in 6 . 1 6 ( 1 ) . and E(exp ( ε | |X| I ) ) < «> for some ε > 0. lim n" 1 log P(X

Suppose ξ(Q°) = ξ(Q)

Then e Q)

=

E(Q)

.

7.5.6 ( i ) Show that K( , ξ Q ) is r e l a t i v e l y continuous on {x: K(x, ζ Q ) < «} i f v(K - K°) = 0 , i f k = 1 , or i f v is concentrated on a countable number of points satisfying Assumptions in Theorem 6.23. K(Q, ξ Q ) = K(Q, ξ Q ) as required in 7.5.4.

I f so, then for Q an open set

( i i ) Given an example where Q is open

and K(Q, ξ Q ) t K(Q, ξ Q ) . [Let v be Lebesgue measure on the f i r s t quadrant of 2 R plus a unit mass at the o r i g i n . ] 7.7.1

Hwang (1983) raises the following question: Let X - N(θ, I ) , k k k θ e R . Does there exist an estimator 6: R -> R for which

TAIL PROBABILITIES (1)

P θ (||δ(X) - θ|| PΘ(||X - θ|| < B)

with s t r i c t inequality for some B, θ? dominate" δ Q (x) = x.

241 v

B>0,

,

( I f so, δ would be said to "stochastically

Note that for fixed B > 0 there exists an estimator 6

dominating δQ in the sense of satisfying (1) for a l l θ € Rk. ait.)

θ € Rk

See Hwang (op.

I t can be shown that 6 f δQ exists

and references cited therein.)

satisfying (1) i f and only i f there exists a continuous spherically symmetric function δ f δQ satisfying ( 1 ) . Show that no such function exists. [Suppose I I

I I

ll^vXo/ll

 0 , s u f f i c i e n t l y

small,

(i)

- ^

pθp(!lχ-θpn >Bp) Pθ

(x e H ( x 0 ,

I | χ o | | 2 ) , δ(X) e H " ( X Q ,

||XQ||

_P eεpPθ ( X € H - ( θP P

V

||χ o || 2 )

Use t h e m u l t i v a r i a t e g e n e r a l i z a t i o n o f 7 . 3 ( 3 ) t o e s t i m a t e t h e denominator on the l e f t o f ( 1 ) ; then use 7 . 7 ( 3 ) f o r t h e a s y m p t o t i c a s s e r t i o n i n ( 1 ) . A s i m i l a r argument, w i t h d i f f e r e n t θ and B , a p p l i e s when | | δ ( x Q ) | | > | | x Q | | f o r some x Q e R . See Brown and Hwang ( i n p r e p a r a t i o n ) . ] 7.9.1 Consider the e s t i m a t i o n problem described i n Exercise 4 . 2 4 . 3 . t h a t the e s t i m t o r 4 . 2 4 . 3 ( 1 ) i s admissible. to

[Use Theorem 7.7 and C o r o l l a r y 7.9

show t h a t i f δ 1 i s b e t t e r than 6 then ό ' ( x ) = 0 ,

x = 1 , and symmetrically f o r x >_ 2.

Show

Among a l l

x < 1 , and ό ' ( x ) <_ ^ ,

such estimators 6 minimizes the

risk at θ = 0 . ]

7.9.2

(A uniform Let

version of C o r o l l a r y 7.9)

Vj c V 2 be subsets o f the u n i t sphere i n R

V 2 r e l a t i v e l y open i n the u n i t sphere.

Let

w i t h V- closed and

242

STATISTICAL EXPONENTIAL FAMILIES α(v) = sup {α: K n H + (v, α) f φ} .

(1) Assume α ( v ) < «

V v € V2

Then

α ( v ) is continuous

f o r v € V2

(ii)

V ε > 0

3

δ > 0

3

(ill)

V ε > 0

3

rQ

(i)

(2)

v

v(H+(v, α(v) - ε ) ) > δ ,

v € Vχ

3

ξ(rv)

>

α(v) - ε

V

v € Vχ,

r > rQ

*

7.9.3 Consider a steep exponential f a m i l y . and l e t K be s t r i c t l y convex. t h a t ζ{Q.) for a l l

-> y .

i > I.

Let y € 3K,

Then, ( i ) 3 I < «,

Let K c { x :

< ε,

y < -2ε,

v

ε > 0,

v € V^

v(H+(v, e ) )

(1)

> δ

0 e K,

Let θ. € N ° , 1 = 1 , . . . f such 1 θ. δ > 0 such t h a t v ( H + ( i i Λ ι , ε ) ) > δ

Hence, ( i i ) ψ ( θ . ) >_ ε| | θ | | + I n δ f o r a l l

v € V2;

£ 0},

y ^ 0.

l i m ψ ( θ . ) = °° . η i-**> [There e x i s t V., \L as i n 7 . 9 . 2 and ε > 0 , α(v)

x

i > I , and

(iii)

δ > 0, satisfying

and

V v t V2

.

p (Draw p i c t u r e s i n R to help see why the above i s t r u e . is important h e r e . )

Now, l l θ . l l -•»

.

The s t r i c t convexity θ. Hence, ,, 1 ., f. VΊ for i

(Why?)

s u f f i c i e n t l y l a r g e , by 7 . 9 . 2 ( 2 ) . ] 7.9.4 Consider a steep exponential f a m i l y . closed i n N and assume K i s s t r i c t l y convex. x ί

(ζ(Θ n W°))~.

Show t h a t θ ( x ) t .

Let Θ c hi be r e l a t i v e l y

Suppose x € dK but

(This r e s u l t complements Theorem 5 . 7 .

I b e l i e v e i t should be possible to prove i t by showing the above hypotheses imply t h a t 5 . 7 ( 1 ) i s s a t i s f i e d .

However, the h i n t below i n d i c a t e s a d i f f e r e n t

argument. [Assume x = 0 € K c {x: x χ <. 0} lim ψ(θ) = °° . llθll-χ»,θ€Θ

(w.l.o.g.).

Apply 7 . 9 . 3 to show

Now proceed as i n the proof of Theorem 5 . 7 , f o l l o w i n g

TAIL PROBABILITIES

243

5.7(2).]

7.9.5 Consider a standard exponential

family with natural

parameter

space

k + Let v € R and α Q = sup { α : v(H ( v , α ) ) > 0 } . Let θ^ = ρ Ί v + r^. as i n

N.

Corollary 7 . 9 . Then (1)

11m v

Vψ(θ.)

=

αΩ

.

Hence, there exist a c > -°° such that (2)

ψ(θ.) >. -c + αp. ,

and, consequently, (3)

p θ (x) -* 0

V x e H"(v, α Q )

[The key a s s e r t i o n , ( 1 ) , is a uniform version of Theorem 3 . 9 , since for η. Ξ η i t follows immediately from t h a t theorem.

However, i t seems

easier to prove ( 1 ) as a consequence o f Corollary 7 . 9 . ( A l t e r n a t i v e l y , one may also derive the above, as well as 7 . 9 , through an a p p l i c a t i o n of convex d u a l i t y , since K° = R, e t c . ) ] 7.11.1 In the s i t u a t i o n i n Corollary

7 . 1 1 l e t p ( θ , ) = Pfl ( S 9 ) / ( P f l ( S j ) .

C o n s t r u c t examples ( i ) i n which p ( θ . ) ~ | | θ . | |" α ,

α > 0 ; ( i i ) i n which

p f θ j ) •* 0 b u t I | θ i I | α ρ ( θ Ί . ) + oo f o r a l l α > 0 ; and ( i i i ) i n which p ( θ Ί ) = 0 ( 1 |Θ 1 | Γ

α

) f o r a l l α > 0 but e '

α ll θ

i ' l p ( θ 1 ) •> «,

for a l l α >0.

[ ( 1 ) L e t k = 1 , v ( { 0 } ) = 1 and v ( d x ) = x 0 1 " 1 dx on x > 0 . ]

7.12.1 Consider a t e s t i n g problem, as i n 7.12 with ΘQ = H(v, α) n A/, O

= M_ 0

anc

j 0

z^ 1 ) e H(v, α ) ,

n Wβ ^ φ.

z ^

For z € R k , l e t z = z ^

+ z" ' where

= pv l H(v, α ) . Assume ( w . l . o . g . ) v ( R k ) = 1 . Show

244

STATISTICAL EXPONENTIAL FAMILIES (i) If φ 1 is better than φ then

(1)

φ(x)v(dx|x (1) = y) = /

/

Φ'(x)v(dx|x (1) = y)

y € H(v, α) a.e.(v) and / x ( 2 ) φ(x)v(dx|x ( 1 ) = y) = / x ( 2 ) φ'(x)v(dx|x (1) = y)

(2)

y € H(v, α ) , a.e.(v) (ii)

Show that φ is admissible i f and only i f for some measurable func-

tions CΊ , γ . ,

(3)

i = l,2,

ψ(x)

=

1

if

x(1)

> C2(x(2))

γ2(x(2))

if

x(1)

= C2(x(2))

0

if

Cλ[x{2))

•«

x

-

if

xu;

< C^x^M

/

Ύi \X

1

< x(1) I,.. vX

< C2(x(2))

/

.

[This is a continuation of 2.12.1 and 2.21.2.] (Matthes and Truax (1967).) 7.12.2 Prove that i f φ is an admissible t e s t and Q c X with v(Q) > 0 then φ must also be admissible f o r the same problem with dominating measure V . Q . 7.12.3

Let Xχ = X and X2 = S2 + X2 be the canonical statistics for the two2 parameter exponential family generated by a N(μ, σ ) random sample. (See Example 1.2.) μ

0xl " X2^2

=

Consider Figure 7.12.3. ^

SUC Ί

'

^

a t

V

(R)

=

V

(S).

( i ) Show that this is possible,

Draw the broken line parallel to (v is defined in Example 1.2.) ( i i ) Let Φ1 be the c r i t i c a l function for

the test with acceptance region Q1 + R - S, and let φQ be the c r i t i c a l function for the usual one-sided t - t e s t , which has acceptance region Q1 = {x, < 0 or

TAIL PROBABILITIES x

? -

cx

245

i}

(μ,σ 2 ) ( φ l } (μ,σ 2 ) ( φ l }

(μ,σ 2 )( φ O }

Hence Φ1 is a better test than φQ of (2)

HQ:

μ

[E(Φ 1 - Φ o ) = E ( χ s - χ R ) . (1984).

<

0

versus

μ

Now use Corollary 2 . 2 3 . ]

μQ

^

(See Brown and Sackrowitz

See also Exercise 7 . 1 4 . 6 . )

A Figure 7.12.3: Diagram for Exercise 7.12.3 7.13.1 Here is an example which shows that something more than 7.13(1) p

is needed for validity of the conclusion of Lemma 7.13. Let X € R be bivariate N(θ, I). Consider the problem of testing Θ Q = {0} versus = {θ: θ1 > 0, θ 2 = -θj}. Let S = {x e R 2 :

1 0}.

(i) Show that U = φ but U* = (0, -1). (ii) Verify that S satisfies 7.13(1) but not the remaining hypotheses of Lemma 7.13. (iii) Let φ.ίx) = 1 if x ί S, = 0 otherwise. Show the conclusion of Lemma 7.13 does not apply to φ,. [Let Φp(x) = 1 if x-i ^

< ε or

246

STATISTICAL EXPONENTIAL FAMILIES

x. < 0, x« < -ε. Show for ε > 0 sufficiently small φp dominates φ,.] 7.13.2 The additional assumptions of 7.13 are stronger than necessary. Let X ~ N(θ, I ) , Θo = {0}, S be as in 7.13.1. θ 1 = {(μ, y 4 ) : μ > 0}.

But now let

Note that S satisfies 7.13(1) but does not satisfy

either of the other two assumptions of Lemma 7.13. as φ then φ'(x) = 1 for all x (. S.

Show that i f φ' is as good

Conclude that φ is admissible. [Show

directly that i f Q is an open set in Sc then P (u u * ) i Q ) lim JU!iLJ 50 μ-* P/(μ>μ M )( S )

= oo

.]

7.14.1 A t e s t φ is said to have a nearly convex acceptance region there is a closed convex set A such that φ(x) = 0 , for x (. A.

if

x e A° and φ(x) = 1

(Thus, i f v is dominated by Lebesgue measure any test with nearly

convex acceptance region is equivalent to one with a (closed) convex acceptance region.

See the Remark following Corollary 4.17.)

ΘQ = {ΘQ} is simple in the s e t t i n g of 7.12.

Suppose

Show t h a t any Bayes t e s t has

nearly convex acceptance region. 7.14.2

Let φ. be a sequence of critical functions with nearly convex acceptance regions. Suppose φ. -* φ weak* on L TO . (See 4A.2(1) for the definition of weak* convergence.) Then φ has a nearly convex acceptance region.

[Assume v(R ) < °°. To each φ. there corresponds an A.. Let {u.}

be a countable dense subset of {u:

Mull = 1}. Choose a subsequence {i'}

such that α Δ (u.) converges for each u , say, α Λ ", i

J

A = nFΓ(u., α , ) . Then φ(x) = 0 , J J j

J

M. i

(u. ) -*α . Let J

J

x e A° and =1 for x ft A.]

7.14.3 Suppose ΘQ = { Θ Q } is simple in the setting of 7.12.

TAIL PROBABILITIES (i)

247

Show t h a t the tests with nearly convex acceptance regions form a

complete class. (ii)

Suppose, a l s o , Θ1 = R - { θ Q } and v is dominated by Lebesgue measure.

Show t h a t the tests with convex acceptance regions form a minimal complete class.

[Use Theorem 4.14, 7 . 1 4 . 1 , 7.14.2, and, f o r ( i i ) , Theorem 7.14.]

7.14.4 Suppose the support of v i s a f i n i t e s e t , X. R .

(i)

Let 0 Q = {Θ Q } € hi =

Prove that φ is admissible i f and only i f there i s a closed

convex set A such t h a t φ(x) = 1 i f X (. A, f o r some face F of A.

(ii)

= 0 i f x € A0 or i f x € r . i . F

Can you formulate an analogous complete class

statement v a l i d when X i s countable and the assumptions of Theorem 6.23 are satisfied?

[ ( i ) Use Theorem 7.14, Corollary 7.10, and 7.12.2.

( i i ) Be

c a r e f u l ; the characterization i n ( i ) i s not v a l i d here, even when X = { 0 , 1 , . . . } k , and so w i l l need to be modified.] 7.14.5 Consider a 2χ2 contingency table.

(See Exercise 1.8.1.)

Two

common tests for independence of row and column effects are the likelihood 2 ratio test and the χ test, based on the values of

=

(i) (ii) (iii)

N Σ

Use Theorem 7.14 to show that the χ

test is admissible,

Is the likelihood ratio test also admissible via Theorem 7.14? Use 7.12.1 to prove both tests are admissible.

7.14.6 Show that the test with critical function φ- in Exercise 7.12.3 is admissible.

248

STATISTICAL EXPONENTIAL FAMILIES

7.16.1 Let X € Rk be N(θ, I ) . Suppose ΘQ = 0 and Θj = { θ : |θ i I > c φ

l

( x )

1

i=l,...,k}.

Consider level α tests of the form

χ

= - {t:|til
Note that φ, i s admissible.

k}

( x )

a n d

(x

=

*2 >

l

X

" {||t|Ka2}

( x )

Adjust k, c, α to provide an example where Φ2

dominates φ Ί except where π

i s extremely small.

7.16.2 Consider t h e u n i v a r i a t e l i n e a r model, as i n 7 . 1 5 . Show t h a t t h e [ L e t η € R s . L e t σ = 1/(1 + l l η l l ) and 2

usual F t e s t , 7 . 1 5 ( 1 ) , i s Bayes. μΊ = r ^ / U + l l η l l 2 ) , i=l,...,r.

i = r+l,...,s.

2

Under θ 1 a l s o l e t μ . = r)./{l

+ ||η||2),

Under ΘQ ( r e s p . Θ,) l e t η have d e n s i t y p r o p o r t i o n a l t o

(1 + U n l l 2 Γ p / 2 e x p ((

2

2 ( 1 + I In 11 ) 2 (resp.,

2

(1 + M n l l Γ

p / 2

exp( Σ ?-) r+1 2 ( 1+ l l η l Γ )

).]

(Kiefer and Schwartz (1965).) 7.16.3 Verify when r = 2 that the F test has the local optimality property described in 7.16(1).

(This is called D-optimality.)

2 2-π(0, σ ) Φ μ?

[Write

)dy ) 8μ

μ=0

and use a general form o f t h e Neyman-Pearson Lemma o r Theorem 2 . 2 1 . ] 7.16.4 Let X-,...,X. be independent gamma variables with known indices α , , . . . , α h and unknown scale parameters σ , , . . . , σ . .

Consider the problem o f

t e s t i n g the n u l l hypothesis H Q : σ . = . . . = σ. . ( I n the special case where 2 the X, /CL are χ variables r e s u l t i n g from a normal sample then t h i s i s the problem of t e s t i n g homogeneity

of variance.

(In t h i s notation the variances

TAIL PROBABILITIES

249

are α,,... ,σ. .)) Show (i) The likelihood ratio test for this problem has acceptance region

α

(Σχ.) o S = {x: ^

(1)

<_ C} ,

where

αn

=

π x^ (ii)

k Σ α, Ίι

When these distributions are written as a canonical exponential

family the null hypothesis is linear in both parameter space and expectation space.

Nevertheless, for k _> 3, the acceptance region for the likelihood

ratio test is not convex.

(Hence there is no hope of proving i t s admissibi-

l i t y via Theorem 7.14.) [(ii)

Consider k = 3 and α. Ξ α.

Consider points of the form

x = (z, z, 1) on the boundary of the acceptance region S.

Let

πx. f(x) =

^ - T - C so that f(x) = 0 for x € as. Show that for z sufficiently (Σx-jΓ large (Vf(x z ))' (D 2 f(x z ))(vf(x z )) < 0.] (iii)

The likelihood ratio test is unique Bayes, hence admissible.

H1 l e t θ. = 1/σ density |η.| Ί"

Under

= (1 + η.) where η. e R are independent variables with (1 + η?)~ α i .

has density | η p α ° " ^ ( l + η 2 ) " α ° .

Under HQ, ΘΊ = 1/σ. Ξ (1 + η 2 ) where η e R (This result is another one of many

contained in Kiefer and Schwartz (1965).) Note: admissible.

I t is not always true that a likelihood ratio test is

For an interesting counter-example see Lehmann (1959, p.338)

or Kiefer and Schwartz (1965, p.767). 7.17.1 2 Let x G R be b i v a r i a t e normal,

N(θ, I ) .

of t e s t i n g ΘQ = {0} versus Θχ = { θ : θ ^ :> 0 ,

Consider the problem

Show t h a t the 2 non-randomized level α = .05 t e s t with acceptance region {x: ||x|| <_ 5.991] is inadmissible.

| |θ| | >_ 1} .

(Can you also f i n d a better test?)

(Compare t h i s r e s u l t

250

STATISTICAL EXPONENTIAL FAMILIES

with 7.22.2 in which this test is admissible.) 7.17.2 Exercise 2.10.1 indicates a n o n t r i v i a l testing problem where ΘQ and Θ1 are contiguous and a l l tests are admissible. the

Here is an example of

same phenomenon in which the null and a l t e r n a t i v e hypotheses are sepa-

rated:

Let 1 £ m < k and l e t X = {x e R : x i = 0

or

1,

i = l , . . . , k , ΣxΊ = m}.

Let v be counting measure on X, with ί p θ ) the exponential family generated by v.

Θχ = {θ: ||θ|| 2 >_ 1}.

Let ΘQ = { 0 } ,

d e f i n i t i o n s of Θ, w i l l also s u f f i c e . ) test.

(Other more r e s t r i c t i v e

Let φ be any (possibly randomized)

Then φ is admissible. [It

is possible to use Lemma 7.13 f o r t h i s , but here i s an

easier argument. {qΓ:

The aggregate family generated by { p θ l contains k 1 where q Γ ( ) = χ Γ ( * ) and also q Γ (•) = ( ) where

ξ € X}

ξ Q = ξ(0) (jj ) l . for

I f Φ is inadmissible there exists a test Φ' better than φ Then (by c o n t i n u i t y ) φ1 must be as good as φ for

testing ΘQ versus Θ-.

testing q Γ

and (m V

1

versus { q r : ξ € X}

Σ φ'(x) < ( V

x€X

1

.

This implies φ'(x) >_ φ(x),

x € X,

Σ φ(x).]

m

xex

7.18.1 Let X,, Xp be independent gamma variables Γ ( α . , λ . ) , variables with α , , α 2 known.

i=l,2,

Consider the problem of t e s t i n g H Q : λ

versus the a l t e r n a t i v e H-:

max |1 - λ . | > ε for some given ε > 0. Ί i=l,2 any " i n t e r s e c t i o n " test with acceptance region —

= λp = 1 Show that

1

(1)

φ(x) = 0

is inadmissible.

iff

a

n

< xΊ < a i 2 ,

(See also 7.21.1.)

i = l , 2,

(0 < a . j < a i 2 < « )

[No admissible t e s t can have an

acceptance region with a sharp corner at (x^, x^) = ( a 1 2 , a 2 2 ) l i k e (1) has. See Example 2.10.]

—

TAIL PROBABILITIES

251

7.19.1 In Theorem 7.19 replace Θ* by (1)

θί* = {θΊ ε Θ", : θ, ε N or there is a set {θ' : j = 1,.. . , J } ^ N and a sequence {ζ }czQ*

with

ζ. -> θ . and

{ζ.}cz conhull ({θU U {θ,})} . [Use 1.13.2.] 7.20.1 Prove the assertion in 7.20(3). [The extreme points of {J:J ε J(ε), JθJ(dθ) = V Q } , V Q ε Θ, are the distributions in this set which are concentrated on a single point; similarly the extreme points of {J:J ε J(ε), JθJ(dθ) = 0, / ||θ||2J(dθ) = α} are two-point distributions. The extreme points of ϊ(ε) are thus points

(v, M) satisfying 7.19(2) with J

either a one- or two-point distribution, as above. The extreme points of Δ are (contained in) the set of limits as ε + 0 of these points.] 7.20.2 Prove the assertion following 7.20(4). [Let J be either a one- or twopoint distribution.] 7.20.3 Generalize the assertion following 7.20(4) to apply to the situation where Θ is a twice differentiate manifold at θ Q . [First generalize 7.20(5)!] 7.21.1 In the setting of 7.18.1 consider the problem of testing H Q : λ^ = λ 2 = 1 versus the complementary alternative H-.: λ, f 1 or λ^ t 1. Show that the intersection test 7.18.1(1) is still inadmissible.

252

STATISTICAL EXPONENTIAL FAMILIES

7.21.2 Consider the curved exponential family of Example 3.14 and 5.14. Let ΘQ = {θ Q } and Θ 1 = Θ - Θ Q - TO be specific take θ Q = θ(λ Q ) = (-1,0); i.e., λ Q = 1. One easily constructed test of θ Q |λ-λ o | > c n with c n

is that which rejects when

chosen to give the desired level of significance.

(Such a test can be constructed for any curved exponential family, and has certain asymptotic optimality properties as n -> «>.) Show that for moderately large n and the usual levels of significance this test is inadmissible; although for every n there exists a (possibly very small) level of significance for which the test of this form is admissible. [Use 5.14 and Theorem 7.21. Except for small values of n or large values of c n

the acceptance

region has a convex, but not strictly convex, form. Theorem 7.21 allows only very special admissible acceptance regions which are not strictly convex; and for appropriate values of n, c

the above acceptance region is not of this

special form.] 7.22.1 Let X 1 9 ...,X n

be independent normal variables, Xj ~ N(μ,l+μ ). Con-

sider the problem of testing HQ:μ = 0. Let ΦΊ = 1 if |X"| > 1.96...//n , = 0 otherwise; and π,(μ) = E (Φ-.). Show (i) φ j has level α = .05 and is locally unbiased (i.e., π.j(O) = 0, πη (0) > 0 ) . (Is ψ, also globally unbiased; i.e., π ^ μ ) ^ .05??) (ii) ψ, is inadmissible. [Use 7.20(5) and Theorem 7.21. Note that θ 2 = --^2 = -(2(l+μ 2 ))" 1 4-I/2 2σ H = 0.]

to show ψ ]

cannot satisfy 7.21(2) unless

(iii) Find a locally best locally unbiased level α test; i.e., the test which maximizes π"(μ) subject to π(0) = α, π'(0) = 0. Use Theorem 7.22 to verify this test is admissible. [Admissibility actually follows directly from the fact that this test is the unique locally best locally unbiased level α test, but it may be instructive to note how this test can be written in the form 7.21(2) with H = 0.] Call this test Φ 2

TAIL PROBABILITIES ((iv) Is Φ 2 unbiased??

Is Φ 2

better than ψ^??

253 If not, what is??)

(v) Generalize (i)-(iii) to arbitrary curved exponential families: Show that the locally unbiased test with parallel boundaries for the acceptance region is not locally best among locally unbiased tests unless u 2 = 0 in 7.20(5). State (convenient, frequently satisfied) conditions under which this parallel boundary test is inadmissible. 7.22.2 Let X be bivariate normal with mean θ and covariance

1. Consider the

problem of testing Θ Q = 0 versus Θ-j = {(θ^.θg): θ ^ > 0}. Consider tests 2 (χ)» a >b> c > 0 (These tests are ( ) ' ( ) ^ symmetric in (xpX^).) Show that such a test is admissible if and only if 2 2 a _> b. The same result holds if Θj = {(θ^ ,θ 2 ): θ-jθ2 > 0, θ^+θ 2 4 1}.

of the form ψ(x) = χ

2

:

APPENDIX TO CHAPTER

H. POINTWISE LIMITS OF BAYES PROCEDURES

This appendix contains a proof of Theorem 4.14, which was used to establish the complete class Theorems 4.16 and 4.24, and w i l l be used again in Chapter 7.

As already noted, this theorem has nothing in particular to

do with exponential families, but i t s proof is included here since i t is not readily accessible elsewhere.

We w i l l state and prove i t below in a convenie

form which is more general than that stated in Theorem 4.14. 4A.1

Setting Let ί p Q ( x ) : θ e 0} be any family of probability densities relativ

to

a σ-finite measure v on a measure space X,B.

(1)

p θ (x)

>0

Assume θ €Θ

x€X,

(This assumption is a c t u a l l y used only i n Proposition 4A.11 and Theorem 4A.12 Let

the a c t i o n space, A, be a closed convex subset of Euclidean space.

loss function is L: Θ x A -> [ 0 , <»). function f o r each θ e Θ.

(2)

The

Assume L ( θ , •) i s a lower semi continuous

Assume also t h a t

lim

L(θ, a )

=

°°

,

Θ6Θ

,

lla||-*» (If

A is a bounded set this is t r i v i a l l y satisfied.)

A* = A; i f cation of A. (3)

I f A is bounded l e t

A is unbounded l e t A* = A u {1} denote the one-point compactifiExtend the function L(θ, •) to A* by defining L(θ, I)

= -a

.

A randomized decision procedure on A* i s . a kernel 6( | ) f o r which 254

APPENDIX

δ(t|x)

is a Borel measure on

δ(A| )

is

255

A*,

for

x € X

(4) 8 measurable f o r each measurable s e t A ^ A *

A nonrandomized procedure i s one f o r which δ( p o i n t , δ ( x ) , f o r almost every ( v ) , x € X. kernel δ(

|x) is concentrated on a s i n g l e

Note we use the symbol 6 both f o r

| ) and f o r the r e l a t e d function δ( )

o f a l l randomized decision procedures. to A c A*, and l e t V

.

the

Let V* denote the c o l l e c t i o n

Let V c V* denote those g i v i n g mass 1

c V denote the nonrandomized procedures i n V.

As usual, the r i s k of any procedure is

(5)

R(θ, 6)

=

// L ( θ , a) 6 ( d a | x ) p ( x ) v ( d x ) θ X A*

Note t h a t R(θ, 6) may take the value °°.

(6)

R(θ, 6 ' )

<

R(θ, δ)

V

.

A procedure 6 is admissible

θ € Θ => R(θ, δ 1 )

=

R(θ, δ)

V

if

θ € Θ

.

The proof o f the main r e s u l t o f the appendix is broken down i n t o s i x main p r e l i m i n a r y steps as f o l l o w s : (i) (ii)

V* i s compact i n an appropriate topology; R(θ, •) is lower semi continuous on V*;

(iii)

δ. + δ n with δ. e 0 . I = 0 , 1 , . . . ,

(iv)

the mini max Theorem f o r f i n i t e Θ;

implies δ ; -> δ n i n measure

(v);

(v) (vi)

the closure o f the Bayes procedures i s a complete c l a s s ; and V

is a complete class when L ( θ , •) i s s t r i c t l y convex.

Formal statements of a l l

these r e s u l t s and some c o r o l l a r i e s are given below.

Complete proofs are also given f o r a l l but ( i ) f o r which the reader can consult the references c i t e d below. 4A.2

Definitions We now define the topology on V*.

the Banach space o f v i n t e g r a b l e f u n c t i o n s .

Let lχ

= L^X,

B χ , v) denote

Let C* denote the (Banach) space

256

STATISTICAL EXPONENTIAL FAMILIES

of continuous, real-valued functions on the compact set A*. For every δ € P*, f e L,, c € C* there is a number S6(f, c) = //c(a)δ(da|x) f(x)v(dx)

.

Define the topology on V* according to the convergence criterion δ •*• δ i f (2)

βΛ ( f , c)

-> β Λ ( f , c)

f ε Lv

C o

c ε C*

.

1

α

(This is a "weak" topology. The collection of sets of the following form comprise a basis for this topology: {δ € V : 3χ(f, , c •) - 3χ. (f, > c ) I < ε, δ0 € P*f

f

€ l

v

1 0 } .)

4A.3

Theorem V* is compact in the topology defined above.

Proof. above.

This theorem appears in Le Cam (1955) in a form similar to the In a somewhat more primitive form the result appears already in Wald

(1950). I t is interesting to note that this theorem is actually a special case of a result in abstract functional analysis. Theorems V.8.6 and IV.6.3

I t follows directly from

(the Riesz representation theorem) of the classic

treatise of Dunford and Schwartz (1966). For a complete, detailed proof see Farrell (1966, Appendix).

||

As has already been noted in the text, Wald's book, and the paper of Le Cam, both cited above, continue from their versions of Theorem 4A.3 and prove results similar to most of those below; but they do not e x p l i c i t l y state a version of Theorem 4A.12 which is our ultimate goal.

See especially Wald

(1950, Sections 3.5 and 3.6) and Le Cam (1955, Theorem 3.4).

APPENDIX 4A.4

257

Proposition The map R ( θ , •): V* -> [ 0 , « ] i s lower semi-continuous.

I n other

words, i f ό + 6 Q then (1)

Ίim i n f R ( θ , δj

Proof.

Let 6 α -> ό Q .

>_ R ( θ , ό Q ) ,

θ eθ

Let θ e 0 and l e t c β ( ) = m i n ( L ( θ , •)> B ) . Then

c β € C* and, f o r any 6 £ V*9 3 6 ( p θ , c β ) t R ( θ , δ) as B t °°. (2)

limα i n f R(θ, δα)

(1) follows directly from (2).

>

Thus,

limα i n f 3fi ( p 0 ,c β )

||

We will apply this proposition in roughly the following form: 4A.5 Corollary Let { Θ 1 , . . . , θ m } c Θ.

L e t Γ f <= Rm be the s e t o f a v a i l a b l e f i n i t e

r i s k points -- i . e . (1)

ff

=

{ r € Rm:

3 6 G P*,

R(θj,ό)=rj,

j=l,...,m}

.

Let rf c Rm be the s e t o f points dominated by r f -- i . e . (2)

Γf

= { r e Rm:

where (as usual) s £ r means s . <. r . ,

3 s € Γf, j=l,...,m.

s <_ r } Then f

f

i s a non-empty,

m

donvex, closed subset o f R . Notation:

I n the current context, when R(θ > ό) < <», j = l , . . . , m , we w r i t e

R ( . , δ) to denote the point r € Rm with r. = R ( θ , , 6 ) , j = l , . . . , m . Proof. Tf t Φ and r

R(θ, a Q ) = L ( θ , a Q ) < «,, Θ € Θ, so Γ f ϊ φ, and consequently Γf i s convex, so also Γ f i s convex. -> r.

Suppose r^ € Γ f , 1 = 1 9 . . . 9

Then there e x i s t δ ; e V* with R( , 6 y ) <_ r ,

1=1

Since V*

258

STATISTICAL EXPONENTIAL FAMILIES

is compact there must exist a subsequence {V} such that 6., is convergent; δ., -*- δ. Then (3)

(r)j = Hm(r^,)j > lim inf R(θj, 6^,)

by Proposition 4A.4. It follows that r € Γ.. This proves that r^ is closed.

11 Here is another useful consequence of Theorem 4A.3 and Proposition

4A.4. 4A.6 Corollary The set of admissible procedures forms a minimal complete class. ?H.ooi.

We give a proof only for the case where Θ = {θ,,...,θ } is finite.

The corollary will be applied in this form in the proof of Theorem 4A.10. (The proof for general Θ is basically similar, but involves some form of Zorn's lemma. See, e.g. Brown (1977).) Let δ Q € V*. To each 6 € V* associate the point r(δ) = r € [0, ~ ]

m

for which r. = R(θ , δ ) , j = l,...,m. Let j

(1)

J

f = {r € [0, oo] m :

r

= r (δ) for some 6 6 £>*} .

(This is the same as 4A.5(1), except for the fact that here r. = » is possible j

so that r(δ) is defined for all δ € V*9 not merely for those δ having finite risk.) Let r 6 r be a minimal point of f which dominates r(δ Q ). (2)

That is,

r <_ r(δ Q ); r <_ r and r ί r => r ί f .

(It is shown in Lemma 4A.8 that r. < °°, j = l,...,m, but that fact is not ~J

essential here.) Such a point, r, can be constructed as the limit of a sequence of points r(ό .) ξ r.

APPENDIX

259

By Theorern 4A.3 the sequence δ. has an accumulation point in V*, say δ. By Proposition 4A.4 r(δ) ± r <_ r(δ Q ).

Since r was minimal it follows

that 6 is admissible. It has thus been shown that any procedure, δ Q , is dominated by an admissible procedure, as asserted by the corollary.

||

4A.7 Proposition Let δ € P , o = 1,..., and suppose δ + δ Q in P* with δ Q € V^. Then δ (•)•»• δn( ) in measure (v). Thus there is a subsequence V for which δ^.ί ) •> θ o ( ) a.e.(v). PΛ.OOI5.

Suppose δ^ -> δ Q in P* but δ α / δ Q in measure (v). If δ Q € P*-P then

there is an a n € A, an e > 0, and a set S with v(S) > 0 such that (1)

|δ Q (x) - a Q | < ε for all

x € S

and (2)

lim sup v({x € S:

(To verify (1) and (2) is a standard

|6 (x) - a Q | > 2ε}) > 0 . but nontrivial exercise in measure theory

which uses the fact that A is a separable metric space. If δ Q € V*

then

(1) may need to be replaced by |θ Q (x)| > 1/ε and, correspondingly, the statement |δ (x)| < l/2ε would need to be substituted in (2). Similar substitutions would α

then need to be made in what follows.) Let c € C* satisfy 0 <_ c( •) ± 1, and 1

|a - a Q | i ε

0

I a - a n | _> 2ε

c(a)

and let f( ) = x s ( ) € L ] . Then (3)

β, (f, c) = v(S),

260

STATISTICAL EXPONENTIAL FAMILIES

but 3 6 (f, c) < v({x: |δ α (x) - a Q | < 2ε}) = v(S) - v({x € S:

|δ(x) - a Q | >. 2ε})

so that (4)

lim inf 3. (f,c) <_ v(S) - lim sup v({x € S: |δ (x) - a π | >. 2ε}) α α α α u < v(S) .

Taken together (3) and (4) contradict the assumption that δ ^ 6 Q in V*. contradiction shows that δ -> δQ in Ό* implies 6 + δQ in measure ( v ) .

This The

second conclusion of the proposition is a standard consequence of this. We now come to the minimax theorem.

||

In preparation for this

theorem we prove a simple lemma. 4A.8

Lemma Let Θ be f i n i t e .

Then the set of procedures having f i n i t e risks

is a complete class of procedures in Ό*.

(In other words, for every δ € V*

t h e r e i s a p r o c e d u r e δ 1 € V* w i t h R ( θ , δ 1 ) <_ R ( θ , 6 ) a n d R ( θ , δ 1 ) < °°, θ € Θ . ) Proof.

L e t a Q € A a n d A . = max { L ( θ , a j : θ e θ } <

B = { a € A: Assumption

m i n { L ( θ , a ) : θ € 0 } <_ A - } .

4A.1(2). Ap

°°.

B i s a bounded s e t b e c a u s e o f

Hence =

max { L ( θ , a ) : θ € Θ ,

a € B } < «

.

Define δ1 to s a t i s f y (1)

δ ({ao}|x)

= δ({aQ}|x) + δ(Bc|x)

δ'(A|x)

δ(A|x)

=

,

{ a Q } t A,

A c B

.

( I n words, δ 1 takes a c t i o n a Q whenever δ takes an a c t i o n o u t s i d e B.)

Then

APPENDIX ό'(B|x) a 1 ; hence R(θ, ό 1 ) £ A 2 < «. (2)

Also, by c o n s t r u c t i o n ,

<. / L ( θ , a ) δ ( d a | x ) + L ( θ , a Q ) δ ( B c | x ) B

/ L(θ, a)ό'(da|x)

± Hence R(θ, ό 1 ) £ R(θ, δ).

261

JL(Θ, a)6(da|x)

.

||

In the language of Corollary 4A.5, used below, the preceding can be interpreted as saying that the set of procedures with risk points in Γ f is a complete class. 4A.9

Theorem Let Θ be f i n i t e .

Let δQ € V* be any procedure for which

R( , δ Q ) € Γ f , and such that (1)

R( , δ 0 ) - ε

for every ε > 0.

j = l , . . . , m such that

(2)

m Σ τr.R(θ., δ n ) J U j=l J

Remark.

J

Γf

Then δQ is Bayes — i . e . there exists a prior G giving

mass π. to θ. e Θ, J

£

<

m i n f * Σ π.R(θ., δ) J 6€0 j = l J

The minimax risk — M = i n f * max {R(θ, δ):

f i n i t e by Lemma 4A.8.

.

θ € 0}

~ must be

(Also, as a consequence of Corollary 4A.5 there must

exist a minimax procedure.)

I f δQ is any minimax procedure then i t must

satisfy (1) and hence must be Bayes.

This does not yet prove that the

resulting prior 6 is least favorable -- i . e . Σ π.R(θ.* δ) >. M for a l l J

δ € Ό*.

Indeed, this need not be the case.

J

To get a least favorable prior

apply the proof of the theorem to the point with coordinates r. Ξ M, ϋ

j=l,...,m. This point need not correspond to any procedure in V*, but it is in f, and the proof of the theorem applies directly to yield {π.} such that M =

m Σ π.M £ i n f * j=l J P

m Σ π.R(θ., δ ) . J

J

This {π.} corresponds to the l e a s t J

262

STATISTICAL EXPONENTIAL FAMLIES

favorable distribution. Proof.

Γ f is a closed convex subset of R m by Corollary 4A.5. Condition (1)

implies that the point r Q = R(θ, δ Q ) lies on the boundary of Γ-. Hence there exists a nonzero vector {α.} which defines a supporting hyperplane to Γ* at r Q - i.e. (3)

Σ αj(r o )j = 1nf {ΣαjΓjΓ

r € Γf} .

Since r Q € Γ f , so also is r Q + ae^ for any unit vector e^, and a_> 0. Thus (3) yields

- J ^ o ^ ' 3 * aι{roh>

(4)

I t follows t h a t α . _> 0 , ^ = l , . . . , m .

(5)

a

^°

Let

\

=

m Σ α

Then m Σ π.(rn).

(6)

=

i n f {Σπ r :

r € Γf}

.

Furthermore, by Lemma 4A.8, f o r every δ € V* there is an r € Γ f such that r. < R ( θ . , δ ) ,

j = l , . . . , m ; so that

J "~ 3 (2), now follows from (6).

m Σ π.r. < Σπ.R(θ., δ ) .

j_2 J J ~

The desired r e s u l t ,

3 3

||

4A.10 Theorem Let BQ denote the set of Bayes procedures for priors concentrated on f i n i t e subsets of Θ.

Then BQ, the closure of BQ in P*, is an essentially

complete class. Proof.

(Note:

the following proof is written in the language of directed

sets, nets, and subnets.

See, e.g. Dunford and Schwartz (1966).

The reader

unfamiliar with these concepts, or the equivalent concept'of f i l t e r s and

APPENDIX

263

u l t r a f i l t e r s can understand the essence of the proof by considering the case where Θ is countable, for then the nets and subnets can be cooverted to ordinary sequences and subsequences.

I f X, B is Euclidean space -- as in the

exponential family situation — i t can be shown by an auxiliary argument that sequences and subsequences also can suffice for the proof, since the topology of V* has a countable basis.)

Let <5Q be any procedure.

Let A denote the collection of a l l f i n i t e subsets of Θ formed into a directed set under the obvious p a r t i a l ordering: Consider a fixed α € A;

α c 0.

α- £ ou i f α- c o u .

Consider the s t a t i s t i c a l problem

with parameter space just the f i n i t e set α.

There must exist a procedure,

call i t 6 , which is admissible in this r e s t r i c t e d problem and is a t least as good as δ Q — i . e . (1)

R(θ, ό α )

Since δ

R(θ, δ 0 )

θ Gα

is admissible in the r e s t r i c t e d problem i t s a t i s f i e s

4A.9(1) there. 6

<

condition

(The existence of 6^ is guaranteed by Corollary 4A.6.)

Hence

is Bayes with respect to a prior G concentrated on the f i n i t e set α c 0 . Let A1 = { α 1 } be a ( c o - f i n a l ) subnet of A and l e t δ e V* be such

that δ , •> δ.

(The existence of A' and δ follows from Theorem 4A.4 by standard

topological arguments.) out in A 1 .

Let ΘQ € 0.

Then α 1 => {Θ Q } for every α 1 f a r enough

Hence R(θ Q ,δ ,) <_ R(θ Q J δ Q ) for any such α 1 and, by Proposition

4A.5, R(ΘQ, δ)

<_ lim i n f R(θ Q , δ α . )

£

R(θ Q , δ Q )

.

Since ΘQ £ θ is a r b i t r a r y , this proves that δ € BQ is as good as δ Q .

Since

δ Q € V* is also a r b i t r a r y this proves BQ is an essentially complete class.

||

So far we have not assumed that L(θ, •) is s t r i c t l y convex, as is the case in the applications in Chapter 4.

We now add this assumption, which

is required for the desired complete class theorem.

264 4A.11

STATISTICAL EXPONENTIAL FAMILIES Proposition Assume

(1)

L(θ, •)

is s t r i c t l y convex on

Let δ € V*9

for each

θ € Θ

Then there is a δ1 € £>n such that

δ (. Όn.

R(θ, δ 1 )

(2)

A

<

R(θ, δ) ,

with s t r i c t inequality for some θ Q € Θ.

θ € Θ

In particular, the procedures in V

are a complete class. Proof.

I f δ € V* but δ £ V then v ( { x : δ ( U } | χ ) } ) > 0 .

θ e Θ, by 4 A . 1 ( 1 ) .

L e t a Q € A and l e t δ 1 be d e f i n e d by <S'(x) = a Q .

R(θ, δ1)

(3)

Hence R ( θ , δ) Ξ « ,

=

L ( θ , a Q ) < oo =

R(Θ,

δ) ,

Then

θ eΘ

Now, suppose 6 e V but 6 f. V . If R(θ, δ) = «> then, again, δ'(x) Ξ a Q satisfies (3). So, assume R(θ o ,δ') < » for some θ Q € Θ. Condition (1) and 4A.1(2) guarantees that for some ε > 0, A Q >^ 0 (4)

L(θ Q , a) > ε||a|| - A Q

(We leave this as an exercise on convex functions.

.

A \/ery similar r e s u l t is

proved in 5.3(3) and ( 5 ) ; see 5 . 3 ( 4 ' ) . ) Hence (5)

co >

R(Θ Q , δ)

=

/ ( / L ( θ , a ) δ ( d a | x ) ) p θ Q ( x ) v(dx)

|| δ ( d a | x ) ) p θ (x)v(dx) - AQ I t follows t h a t

(6)

f\ |a| I δ ( d a | x )

since p A (x) > 0 Define

a . e . ( v ) by 4A.1(1).

< <»

a.e.(v)

.

APPENDIX /aό(da|x)

if

265 /| |a| |ό(da|x) < °°

ό'(x) = {

(6)

aQ

otherwise

Then /L(θ, a)δ(da|x)

1

L(θ,ό'(x))

a.e.(v)

with strict inequality whenever 6( |x) is not concentrated on the point ό'(x). Since 6 (. V^ this occurs with positive probability under v — and hence by 4A.1(1) under P Q . θ o Consequently R(θ, ό 1 )

(7)

with s t r i c t inequality for ΘQ e Θ. whenever R(θ,ό) < °° .)

£

R(θ, 6)

(In fact, there is s t r i c t inequality in (7)

| |

The desired result now follows as an easy consequence. 4A.12

Theorem Assume 4A.11(1).

Then the set of pointwise limits of sequences of

procedures in BQ is a complete class. Proof.

(B

is defined in 4A.10.)

As a consequence of 4A.11(1), Jensen's inequality and 4A.1(1)

eyery procedure in BQ is non-randomized.

Also, there cannot be two non-

equivalent admissible procedures with equal risk functions, for i f δ 1 f 6 2 then (1)

(R(θ, 6λ) + R(θ, 6 2 ) ) / 2

^

R(θ,

( 6 χ + 62)/2)

with s t r i c t inequality whenever the right-hand side is f i n i t e . The theorem now follows as a direct consequence of Corollary 4A.6, Proposition 4A.11, Theorem 4A.10, and Proposition 4A.7. Here's how: Because of Corollary 4A.6 there is a unique minimal complete class.

I t is contained in V by Proposition 4A.11 and in BQ by Theorem 4A.10

and ( 1 ) , above.

I f δQ is in this minimal complete class ( i . e . , i f 6 Q is

266

STATISTICAL EXPONENTIAL

admissible)

Then, by Proposition 4A.7, δ α ( ) •* δ Q ( ) a . e . ( v )

which is the desired condition. 4A.13

€ BQ = V^ such that 6 α + δ Q € V^

there is therefore a net 6

in the topology on V*.

FAMILIES

II

Generalizations (i)

Assumption 4A.1(1) and the s t r i c t convexity assumption 4A.11(1)

are used in the proof of Theorem 4A.12 for only two purposes; namely, to guarantee that (1)

δ € BQ

=> δ € Vn ,

and that (2)

(δy

δ2

admissible;

R(θ, δ χ ) = R(θ, δ 2 ) ,

θ e Θ)

=* δ χ

=

δ2

.

I f (1) and (2) can be established separately, as is the case in some of the applications in Section 7 to the theory of hypotheses t e s t s , then the conclusion of Theorem 4A.12 remains valid without 4A.1(1) and 4A.11(1). (ii)

There is not much hope for something l i k e the conclusion of

Theorem 4A.12 unless (1) and (2) are s a t i s f i e d .

However, a l l the e a r l i e r

results of this appendix, through Theorem 4A.10, remain valid without the assumptions 4A.1(1) and 4A.11(1) (or (1) and ( 2 ) ) . (iii)

The remaining assumption which can be relaxed without major

alterations in the theory is the assumption 4A.1(2) on the loss function.

If

this assumption is replaced by (3)

lim L(θ, a) a-κc

=

sup { L ( θ , a ) :

a e A}

and (4)

sup {L(θ, a ) :

a e A}

<

«

then a l l results through Theorem 4A.10 remain valid with only a simple modification needed in the statement and proof of Lemma 4.8 to establish that the procedures in V are a complete class.

APPENDIX

(iv)

267

I f (3) is valid but not (4) or 4A.1(2), then a peculiar

situation may arise.

The results through Corollary 4A.7 remain valid, but i t is

then possible that there may exist admissible procedures having R(θ, 6) = °° for some θ € Θ.

When this peculiarity occurs the minimax theorem is not valid in

the strong form of Theorem 4A.9.

(There may exist admissible minimax procedures

satisfying 4A.9(1) for which no prior exists satisfying 4A.9(2).) of Theorem 4A.9 i s , however, valid. v

sequence of priors defined by {π £

A

A weaker form

Its conclusion is that there exists a , £=1,...} and corresponding Bayes procedures

ό' ) such that R(θ, δ ' ' ) + R(θ, 6 Q ) , θ £ Θ.

(The most convenient proof I know

of this fact proceeds in a somewhat roundabout fashion using a device found in Wald (1950).) (v)

Brown (1977) contains versions of Theorem 4A.3 and Proposition

4A.4 valid for some situations where i t is useful to compactify A in some fashion other than the one point compactification, A*, used above; or where the loss L depends on the observed x € X, as well as on Θ, A; or where the decision rules are restricted a priori

to l i e in some proper subset of V.

In many of

these situations i t is possible to proceed further and also establish the conclusion of Theorem 4A.10. I t is also possible to derive some satisfactory results in the (unusual) situation where A is not a Borel subset of Euclidean space, nor imbeddable as such a subset.

Such an extension involves intricacies not present

in the preceding treatment of the Euclidean case.

268

STATISTICAL EXPONENTIAL FAMILIES

Exercises 4A.2.1 Suppose A = { a Q , a - } , corresponding to a hypothesis testing problem. Φ( χ )

(a Q = "accept", a. = "reject".)

== Φ δ (x) = δ(ίa,|x})

For any procedure 6 l e t

denote the c r i t i c a l function of the test.

Then,

6η •*• 6 in the topology on V* i f and only i f φ^ •* Φδ in the weak* topology on L 00

(i.e. V

(i) f o r every f e L..

yiΦδ Cx) - Φδ(χ)|f(χ)v(dx)

- o

(See e . g . Lehmann (1959, Section A 4 ) . )

REFERENCES

AMARI, S. (1982). Differential geometry of curved exponential families -curvature and information loss. Ann. Statist. 10, 357-385. ARNOLD, S.F. (1981). The Theory of Linear Models. Wiley: New York. BAHADUR, R.R. and ZABELL, S.L. (1979). Large deviations of the sample mean in general vector spaces. Ann. Prob. 7, 587-621. BAR-LEV, S.K. (1983). A characterization of certain statistics in exponential models whose distribution depends on a sub-vector of parameters only. Ann. Statist. 11, 746-752. BAR-LEV, S.K., and ENIS, P. (1984). Reproducibility and natural exponential families with power variance functions. Preprint. Dept. of Statistics, S.U.N.Y., Buffalo. BAR-LEV, S.K. and REISER, B. (1982). An exponential subfamily which admits UMPU tests based on a single test statistic. Ann. Statist. 10, 979-989. BARNDORFF-NIELSEN, 0. (1978). Information and Exponential Families in Statistical Theory. Wiley: New York. BARNDORFF-NIELSEN, 0. and BLAESILL, P. (1983a). Exponential models with affine dual foliations. Ann. Statist. 11, 753-769. BARNDORFF-NIELSEN, 0., AND BLAESILL, P. (1983b). Reproductive exponential families. Ann. Statist. 11, 770-782. BARNDORFF-NIELSEN, 0. and COX, D.R. (1984). The effect of sampling rules on likelihood statistics. Inter. Statist. Rev. 52. To appear. BARNDORFF-NIELSEN, 0. and COX, D.R. (1979). Edgeworth and saddle point approximation with statistical applications (with discussion). J. Roy. Statist. Soc. B 41, 279-312. 269

270

STATISTICAL EXPONENTIAL FAMILIES

BASU, D. (1955). On statistics independent of a complete sufficient statistic. Sankhya 15, 377-380. BASU, D. (1958). On statistics independent of a complete sufficient statistic,

Sankhya 20, 223-226.

BERAN, R.J. (1979). Exponential models for discrete data. Ann. Statist. 7, 1162-1178. BERGER, J.O. (1982).

Selecting a minimax estimator of a multivariate normal

mean. Ann. Statist. 10, 81-92. BERGER, J.O. (1980a). Statistical Decision Theory:

Foundations, Concepts,

Methods. Springer-Verlag: New York. BERGER, J.O. (1980b).

Improving on inadmissible estimators in continuous

exponential families with applications to simultaneous estimation of Gamma scale parameters. Ann. Statist. 8, 545-571. BERGER, J.O. (1976). Admissible minimax estimation of a multivariate normal mean with arbitrary quadratic loss. Ann. Statist. 4, 223-226. BERGER, J.O. (1975). Minimax estimation of location vectors for a wide class of densities. Ann. Statist. 3, 1318-1328. BERGER, J.O. and HAFF, L.R. (1981). A class of minimax estimators of a normal mean vector for arbitrary quadratic loss and unknown covariance matrix. Mimeograph Series, Purdue University. BERGER, J.O. and SRINIVASAN , C. (1978). Generalized Bayes estimators in multivariate problems. Ann. Statist. 6, 783-801. BERK, R.H. (1972). Consistency and asymptotic normality of maximum likelihood estimates for exponential models. Ann. Math. Statist. 43, 193-204. BHATTACHARYA, R.N. and RAO, R.R. (1976). Normal Approximations and Asymptotic Expansions. Wiley: New York. BIRNBAUM, A. (1955). Characterizations of complete classes of tests of some multiparameter hypotheses with applications to likelihood ratio tests. Ann. Math. Statist. 26, 21-36. BISHOP, Y.M., FEINBERG, S.E., and HOLLAND, P.W. (1975). Discrete Multivariate

REFERENCES

271

Analysis. M.I.T. Press, Cambridge. BLAESILD, P. (1978). A generalization of the exponential distribution to convex sets in R k . Scand. Jour. Statist. J5, 189-194. BROWN, L.D. (1986). Information inequalities for the Bayes risk. Statistics Center Preprint, Cornell University. BROWN, L.D. (1981). A complete class theorem for statistical problems with finite sample spaces. /\nn. Statist. ^, 1289-1300. BROWN, L.D. (1980). Examples of Berger's phenomenon in the estimation of independent normal means. Ann. Statist. J3, 572-585. BROWN, L.D. (1979). A heuristic method for determining admissibility of estimators — with applications. Ann. Statist. 7_9 960-994. BROWN, L.D. (1977). Closure theorems for sequential design processes. Statistical Decision Theory and Related Topics II, Academic Press, 57-91. BROWN, L.D. (1971). Admissible estimators, recurrent diffusions, and insoluble boundary value problems. Ann. Math. Statist. j42, 855-904. BROWN, L.D. (1968). Inadmissibility of the usual estimators of scale parameters in problems with unknown location and scale parameters. Ann. Math. Statist. ^39, 29-48. BROWN, L.D. (1966). On the admissibility of invariant estimators of one or more location parameters. Ann. Math. Statist. 37, 1087-1136. BROWN, L.D. (1965). Optimal policies for a sequential decision process. J_. Soc. Indust. Appi. ^3, 37-46. BROWN, L.D., COHEN, A., and STRAWDERMAN, W. (1976). A complete class theorem for strict monotone likelihood ratio with applications. Ann. Statist. 4, 712-722. BROWN, L.D. and FARRELL, R. (1984). Complete class theorems for estimation of multivariate Poisson means and related problems. /\nn. Statist., to appear. BROWN, L.D. and FOX, M. (1974a). Admissibility of procedures in twodimensional location parameter problems. Ann. Statist. 2, 248-266.

272

STATISTICAL EXPONENTIAL FAMILIES

BROWN, L.D. and FOX, M. (1974b). Admissibility in statistical problems involving a location or scale parameter. Ann. Statist. J2, 807-814. BROWN, L.D. and HWANG, J.T. (1982). A unified admissibility proof. Stat. Dec. Theory and Related Topics, III. 1, 205-230. Academic Press: New York. BROWN, L.D., JOHNSTONE, I.M., and MACGIBBON, K.B. (1981).

Variation

diminishing transformations: a direct approach to total positivity and its statistical applications. Jour. Amer. Statist. Assoc., 76, 824-832. BROWN, L.D. and RINOTT, Y. (1985). Stochastic order relations for multivariate infinitely divisible distributions. Statistics Center Preprint, Cornell University. BROWN, L.D. and SACKROWITZ, H. (1984). An alternative to Student's t test for problems with indifference zones. Ann. Statist. ]_2, 451-469. CHERNOFF, H. (1952). A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. Ann. Math. Statist. 23, 493-507. CLEVINSON, M. and ZIDEK, J. (1977). Simultaneous estimation of the mean of independent Poisson laws. Jl. Amer. Statist. Assoc. 70, 698-705. COURANT, R. and HILBERT, D. (1953). Methods of Mathematical Physics (Translated and revised from the German original). Interscience: New York. COX, D.R. (1975). Partial likelihood. Biometrika, 62, 269-276. DARROCH, J.N., LAURITZEN, S.L., and SPEED, T.P. (1980). Markov fields and log linear interaction models for contingency tables. Ann. Statist. S9 522-539. DARMOIS, G. (1935). Sur les lois de probability a estimation exhaustive. Compt. Rend. Acad. Sci., Paris 260, 1265-1266. DE GROOT, M.H. (1970). Optimal Statistical Decisions. McGraw-Hill: New York. DEY, D.K., GHOSH, M., and SRINIVASAN, C. (1983). Simultaneous estimation of parameters under Stein's loss. Preprint.

REFERENCES

273

DIACONIS, P. and YLVISAKER, D. (1979). Conjugate priors for exponential families. Ann. Statist. 7., 269-281. DUNFORD, N., and SCHWARTZ, J.T. (1966). Linear Operators, Part I. Interscience: New York. DYNKIN, E.B. (1951). Necessary and sufficient conditions for a family of probability distributions. Select. Transl. Math. Statist, and Prob. U 23-41. EATON, M.L. (1982). A review of selected topics in multivariate probability inequalities. Ann. Statist. Π), 11-43. EFRON, B. (1978). The geometry of exponential families. Ann. Statist. 69 362-376. EFRON, B. (1975). Defining the curvature of a statistical problem (with applications to second order efficiency); with discussion. Ann. Statist. 3, 1189-1242. EFRON, B. and TRUAX, D. (1968). Large deviations theory in exponential families. Ann. Math. Statist. 39, 1402-1424. EQUELI, S. (1983). Second order efficiency of minimum contrast estimators in a curved exponential family. Ann. Statist. _Π, 793-803. ELLIS, R.S. (1984a). Large deviations for a general class of random vectors. Ann. Prob. U, 1-12. ELLIS, R.S. (1984). Entropy, Large Deviations, and Statistical Mechanics Springer-Verlag. FARRELL, R.H. (1968). Towards a theory of generalized Bayes tests. _Ann. Math. Statist. 38, 1-22. FARRELL, R.H. (1966). Weak limits of sequences of Bayes procedures in estimation theory. Proc. Fifth Berk. Symp. Math. Statist. Prob. _1, 83-111. FEIGIN, P.D. (1981). Conditional exponential families and a representation theorem for asymptotic inference. Ann. Statist. £, 597-603. FELLER, W. (1966). An Introduction to Probability Theory and Its Applications Vol. II. Wiley: New York.

274

STATISTICAL EXPONENTIAL FAMILIES

FERGUSON, T.S. (1973). A Bayesian analysis of some nonparametric problems. Ann. Statist. I, 209-230. FLEISS, J.L. (1981). Statistical Methods for Rates and Proportions, 2nd Edition. Wiley: New York. GHIA, G.D. (1976). Truncated generalized Bayes tests. Ph.D. thesis. Yale University. GHOSH, M., HWANG, J.T., and TSUI, K.W. (1983). Construction of improved estimators in multiparameter estimation for discrete exponential families (with discussion). Ann. Statist. Π_, 351-374. GHOSH, M., and MEEDEN, G. (1977). Admissibility of linear estimators in the one-parameter exponential family. Ann. Statist. J5, 772-778. GIRI, N.C. (1977). Multivariate Statistical Inference. Academic Press: New York. GIRI, N.C. and KIEFER, J. (1964). Local and asymptotic minimax properties of multivariate tests. Ann. Math. Statist. 315, 21-35. HABERMAN, S.J. (1979). Analysis of Qualitative Data, Volumes I, _Π. Academic Press: New York. HABERMAN, S.J. (1974). The Analysis of Frequency Data. University of Chicago Press: Chicago, Illinois. HAFF, L.R. (1982). Solutions of the Euler-Lagrange equations for certain multivariate normal estimation problems. Preprint. HERR, D.G. (1967). Asymptotically optimal tests for multivariate normal distributions. Ann. Math. Statist. 38, 1829-1844. HIPP, C. (1974). Sufficient statistics and exponential famlies. Ann. Statist. 2, 1283-1292. HODGES, J.L. JR. and LEHMANN, E.L. (1951). Some applications of the Cramer-Rao inequality. Proc. 2nd Berkeley Symp. Math. Statist. Prob., 13-22. HOEFFDING, W. (1965). Asymptotically optimal tests for multinomial distributions. Ann. Math. Statist. 36, 369-408.

REFERENCES

275

HUDSON, H.M. (1978). A natural identity for exponential families with applications in multiparameter estimation. Ann. Statist, j), 473-484. HWANG, J.T. (1983). Universal domination and stochastic domination-estimation simultaneously under a broad class of loss functions. Cornell Statistics Center Technical Report. HWANG, J.T. (1982). Improving upon standard estimators in discrete exponential families with applications to Poisson and negative binomial cases. Ann. Statist. 10, 857-867. JAMES, W. and STEIN, C. (1961). Estimation with quadratic loss. Proc. Fourth Berkeley Symposium Math. Statist, and Prob. l_f 311-319. JOAG-DEV, K., PERLMAN, M.D. and PITT, L.D. (1983). Association of normal random variables and Slepian's inequality. Ann. Prob. Π_, 451-455. JOHANSEN, S. (1979). Introduction to the Theory of Regular Exponential Families. University of Copenhagen, Institute of Mathematical Statistics, Lecture Notes. JOHNSON, B.R. and TRUAX, D.R. (1978). Asymptotic behavior of Bayes procedures for testing simple hypotheses in multiparameter exponential families. Ann. Statist. 6, 346-361. JOSHI, V.M. (1976). On the attainment of the Cramer-Rao lower bound. Ann. Statist. 4, 998-1002. JOSHI, V.M. (1969). On a theorem of Karl in regarding admissible estimates for exponential populations. Ann. Math. Statist. 40, 216-223. KALLENBERG, W.C.M. (1978). Asymptotic Optimaiity of Likelihood Ratio Tests in Exponential Families. Mathematical Centre Tracts, Amsterdam. KARLIN, S. (1968). Total Positivity (Volume I). Stanford University Press, Stanford, California. KARLIN, S. (1958). Admissibility for estimation with quadratic loss. Ann. Math. Statist. 29, 406-436.

276

STATISTICAL EXPONENTIAL FAMILIES

KARLIN, S. and RINOTT, Y. (1981). Total positivity properties of absolute value multinomial variables with applications to confidence interval estimates and related probabilistic inequalities. Ann. Statist, j), 1035-1049. KOOPMAN, B.O. (1936). On distributions admitting a sufficient statistic. Trans. Amer. Math. Soc. ^9> 399-409. KOUROUKLIS, S. (1984). A large deviation result for the likelihood ratio statistic in exponential families. Ami. Statist., to appear. KOZIOL, J.A. and PERLMAN, M.D. (1978). Combining independent chi-squared tests. Jour. Amer. Statist. Assoc. 7^3, 753-763. KULLBACK, S. (1959). Information Theory and Statistics. Wiley: New York. KULLBACK, S. and LEIBLER, R.A. (1951). On information and sufficiency. Ann. Math. Statist. 22, 79-86. LAURITZEN, S.L. (1984). Extreme point models in statistics. Scand. _J. Statist. Π_, 65-91. LEHAMNN, E.L. (1983). Theory of Point Estimation Wiley: New York. LEHMANN, E.L. (1959). Testing Statistical Hypotheses. Wiley: New York. LE CAM, L. (1955). An extension of Wald's theory of statistical decision functions. Ami. Math. Statist. 2(>, 69-81. LINDSAY, B.G. (1983). The geometry of mixture likelihoods, part II: the exponential family. -Ann. Statist. jΛ, 783-792. LINNIK, Y.V. (1968). Statistical problems with nuisance parameters. Amer. Math. Soc. Trans!, erf Math. Monographs, ^0. MANDELBAUM, A. (1984). Linear estimators and measurable linear transformations on a Hubert space. 1. Wahr. verw. Gebiet. 65, 385-397. MARDEN, J.I. (1983). Admissibility of invariant tests in the general multivariate analysis of variance problem. Ann. Statist. 11, 1086-1099. MARDEN, J.I. (1982a). Combining independent noncentral chi-squared or F-tests. Ann. Statist. JLO, 266-277. MARDEN, J.I. (1982b). Minimal complete classes of tests of hypotheses with multivariate one-sided alternatives. Ann. Statist. 10, 962-970.

REFERENCES

277

MARDEN, J.I. (19ζl). Invariant tests on covariance matrices. Ann. Statist, j}, 1258-1266. MARDEN, J.I. and PERLMAN, M.D. (1981). The minimal complete class of procedures when combining independent non-central F-tests. Proc. Third Purdue Symp. Dec. Theory and Related Topics. MARDEN, J.I. and PERLMAN, M.D. (1980). Invariant tests for means with covariates. Ann. Statist. S9 25-63. MARDIA, K.V. (1975). Statistics of directional data (with discussion). jJ. Roy. Statist. Soc. EL 27, 343-349. MARDIA, K.V. (1972). Statistics of Directional Data. Academic Press: London. MARDIA, K.V. (1970). Families of Bivariate Data. Griffin: London. MARSHALL, A.W. and OLKIN, I. (1979). Inequalities: Theory of Marjorization and Its Applications. Academic Press: New York. MATTHES, T.K. and TRUAX, D.R. (1967). Tests of composite hypotheses for the multivariate exponential family. Ann. Math. Statist. _38, 681-697. MEEDEN, G. (1976). A special property of linear estimates of the mean. Ann. Statist. 4, 649-650. MEEDEN, G., GHOSH, M. and VARDEMAN, S. (1984). Some admissible nonparametric and related finite population sampling estimators. To appear in Ann. Statist. MORRIS, C.N. (1983). Natural exponential families with quadratic variance functions: statistical theory. Ann. Statist. U9

515-529.

MORRIS, C.N. (1982). Natural exponential families with quadratic variance functions. Ann. Statist. K), 65-80. MUIRHEAD, R.J. (1982). Aspects of Multivariate Statistical Theory. Wiley: New York. NEVEU, J. (1965). The Mathematical Foundations of the Calculus of Probability. Holden-Day: San Francisco. NEY, P. (1983). Dominating points and the asymptotics of large deviations for random walk on R d . Ann. Prob. 11, 158-167.

278

STATISTICAL EXPONENTIAL FAMILIES

NEYMAN, J. (1938). On statistics the distribution of which is independent of the parameters involved in the original probability law of the observed variables. Stat. Res. Mem. _Π, 58-59. OOSTERHOFF, J. (1969). Combination of One-Sided Statistical Tests. Mathematisch Centrum, Amsterdam. PATIL, G.P. (1963). A characterization of the exponential type distributions. Biometrika 50, 205-207. PING, C. (1964). Minimax estimates of parameters of distributions belonging to the exponential family. Chinese Math. 5, 277-299. PITMAN, E.J.G. (1936). Sufficient statistics and intrinsic accuracy. Proc. Camb. Phil. Soc. 32, 567-579. RAIFFA, H. and SCHLAIFER, R. (1961). Applied Statistical Decision Theory. Harvard University: Boston. ROCKAFELLAR, R.T. (1970). Convex Analysis. Princeton University Press: Princeton, New Jersey. SACKS, J. (1963). Generalized Bayes solutions in estimation problems. Ann. Math. Statist. 3[4, 751-768. SAW, J.G. (1977). On inequalities in constrained random variables. Comm. Statist. Theor. Meth. A6(13), 1301-1304. SCHWARTZ, R. (1967). Admissible tests in multivariate analysis of variance. Ann. Math. Statist. 38, 698-710. SIMON, G. (1973). Additivity of information in exponential family probability laws. ^. Amer. Statist. Assoc. j>8, 478-482. SIMONS, G. (1980). Sequential estimators and the Cramer-Rao lower bound. J!. Statist. Planning and Inference 4, 67-74. SOLER, J.L. (1977). Infinite dimensional-type statistical spaces (Generalized exponential families). Recent Developments in Statistics (edited by J.R. Barra, et al.). North Holland Publishing Co.: Amsterdam. SRINIVASAN, C. (1984). On estimation of parameters in a curved exponential family with applications to the Galton-Watson process. Technical report, University of Kentucky.

REFERENCES

279

SRINIVASAN, C. (1981). Admissible generalized Bayes estimators and exterior boundary value problems. Sankhya 43, 1-25. STEIN, C. (1973). Estimation of the mean of a multivariate normal distribution. Proc. Prague Symp. on Asymptotic Statist., 345-381. STEIN, C. (1956a). Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. Proc. Third Berkeley Symp. Math. Statist. Prob. 19 197-206. STEIN, C. (1956b). The admissibility of Hotellings T2-test. Ann. Math. Statist. 27., 616-623. SUNDBERG, R. (1974). Maximum likelihood theory for incomplete data from an exponential family. Scand. Jk Statist. _1, 49-58. VAN ZWET, W.R., and OOSTERHOFF, J. (1967). On the combination of independent test statistics. Ann. Math. Statist. 38, 659-680. WALD, A. (1950). Statistical Decision Functions. Wiley: New York. WIJSMAN, R.A. (1973). On the attainment of the Cramer-Rao lower bound. Ann. Statist. 1, 538-542. WIJSMAN, R.A. (1958). Incomplete sufficient statistics and similar tests. Ann. Math. Statist. .29, 1028-1045. WOODROOFE, M. (1978). Large deviations of likelihood ratio statistics with applications to sequential testing. Ami. Statist. (6, 72-84.

INDEX

Absolute continuity 99, 102

Chebyshev's inequality

Admissibility 96, 169, 220, 223-226, 231-232, 237, 247, 256

Chi-squared distribution

208 28

(see also gamma distribution) Affine: projection 10 Complete class 107, 110, 220, 224, 22ί transformation 7 230-231, 237, 256, 259 Aggregate exponential family 191-194, 203-206, 250 Completeness 42, 96, 230, 232 Analyticity

38

(see also bounded completeness)

Asymptotic: normality 172-173 optimality 252

Conditional: distribution 21, 30 dominating measure 25

Bayes: acceptance region 40 procedure 262, 263 risk 97 test 69, 226, 246

Conical set 233, 251

(see also generalized Bayes) Behrens-Fisher problem

68

Bessel function 77 Beta distribution

60, 150

Bhattacharya inequality Binomial distribution 136, 203

126 60, 76, 135,

Bounded completeness 61 Canonical exponential family (see standard exponential family) Cauchy-Schwarz inequality 91, 124 Censored exponential distribution 83-84, 163-165

Conjugate prior 90, 106, 112, 116, 161 Contingency table 27, 30, 67, 158, 17" 247 Contiguous alternative 232, 250 Continuity Theorem for Laplace transforms 48-53 Convex: dual 178 hull 2 polytope 50 support 2 Convolution 61 Cramer-Rao inequality inequality)

(see informatioi

Critical function 220, 269 Cumulant generating function 1, 31, 38, 71, 145

Characteristic function 42 280

INDEX

281

Curved exponential family 81, 83, Finite: parameter space 256, 261, 86-89, T26, 165, 173, 233, 252, 262 253 sample space 193 support 107, 149, 263, 266 (see also differentiate subfamily) Fisher information 169 Difference operator 105 Different!able: manifold 30, 160 subfamily 81, 89, 92 165, 172, 173 (see also curved exponential family)

Fisher-Von Mises distribution 76-78, 150, 204 Fourier-Stieltjes transform 42 F-test (Snedecor) 225, 226, 248

Dirichlet: distribution 64, 167, 168 process 167, 168

Fundamental equation 160, 186-187

Discrete exponential family 105, 136, 203

Gamma distribution 18, 47, 60, 68, 76, 132, 133, 136, 210, 248, 250

D-optimality 226, 248

(see also matrix gamma distribution)

E-M algorithm 171

Gal ton-Watson process 89

Entropy 174, 190, 212, 240

Generalized: Bayes estimator 40, 90, 105-107, 110, 112, 140, 141 least square estimate 170

Equicontinuity 52 Essentially smooth 71 Expectation parameter 112, 120, 124, 137-138, 141, 142

General linear model 29, 30, 158, 170, 224, 225 Geometric distribution 27

Exponential distribution (see censored exponential distribution, gamma distribution)

(see also negative binomial distribution)

Exponential family: canonical parameter Green's Theorem

101, 112

(see natural parameter)

Hardy-Weinberg distribution 12, 157, 188

convexity property of 19 dimension of 6 full 2, 16, 80 order of 16

Holder's inequality 19

(see also curved exponential family, differentiate subfamily, discrete exponential family, mean value parameter, natural parameter, regular exponential family, standard exponential family, steep family, and specific families of distributions such as normal distribution, Poisson distribution, etc.) Exponentially fast 208 Face of convex set 192 Fatou's lemma 20, 182, 215

Homogeneity of variance 248 Homomorphism 74, 86 Hotelling's T 2 test 226 Hunt-Stein Theorem 227 Hyperbolic secant distribution 61 Inadmissibility 112, 135, 137, 142, 244, 249, 251, 253

282

INDEX

Information: matrix 93, 124 inequality 90, 94, 97, 105, 124, 125, 130 (see also Fisher information, Kυllback-Lei bier information)

Martingale 88 Matrix: gamma distribution 30, 62, 64 normal distribution 29 Maximum likelihood estimator (M.L.E.) 70, 135, 144, 152, 172-173, 177, 186, 195

Independence: in contingency tables 27, 30, 67, 171, 247 mutual 44, 63

McNemar's test 67

Independent observations

Mean value parameter 70, 75, 150

17, 166

Method-of-moments

Infinite divisibility 61 Inverse Gaussian distribution

72, 85

149

Minimax 103, 169, 256, 260

James-Stein estimator 40, 90, 103, 112, Minimal: entropy parameter 184 complete class (see 132 complete class) Karlin's Theorem 90, 95, 112, 127, 139, exponential family 2, 5, 72, 142 74, 79, 84, 145, 149, 161 Kronecker product 30

Mixed parametrization

79, 243

Kullback-Leibler information 174, 175, Moments 34-38, 50 177, 185, 190, 212 Monotone likelihood ratio 58 (see also entropy) Monotonicity 134 Large deviations 211, 214, 239-240 Multinomial distribution 4, 11, 27, Legendre transformation 179 168, 203 Likelihood ratio test 255, 247, 249 Linear estimator 90, 95 Locally finite measure 48 Local optimality

226

Log linear model

11, 171

(see also contingency table, HardyWeinberg distribution, multinomial distribution) Log Laplace transform

157, 160

(see also cumulant generating function) Lower semi-continuity 19, 75, 145, 179, 184, 215, 256, 258 Marginal distribution

8, 64, 170

Markov chain 28 Markov stopping time 88

(see also binomial distribution, loglinear models) Multivariate: beta distribution 64 linear model (see general linear model) normal distribution (see normal distribution) Natural parameter

1, 26, 45, 76, 106

Nearly convex 246, 247 Negative binomial distribution

27, 60, 106

Newton Raphson algorithm 171 Neyman-Pearson lemma 248 Normal distribution 36, 47, 60, 76, 108, 116, 132, 134, 138, 170, 218, 244, 245, 249, 252 Odds ratio 31, 135

INDEX

283

Partial order 57

Unbiased test 61

Poisson: distribution 60, 76, 106, 135, 136, 137, 141, 203 process 88

Uniform continuity 49

Polyhedral convex set 197, 205

Uniform distribution 77, 169 Uniqueness (for Laplace transform) 42, 63

Quadratic variance function (QVF) 60 Upper semi-continuity Random effects model

148

171

Von Mises distribution (see FisherRegular exponential family 2, 22, 70, Von Mises distribution) 79 Weak convergence 48, 51, 257, 269 Regularly strictly convex 145, 147, 179, 182, 203 Wishart distribution 30 Relative interior 192

(see also Matrix gamma distribution)

Schur convexity 59 Sign change preserving

55, 66

Similar test 61 Slepian's inequality 69 Squared error loss 95, 97, 103, 109, 134 Stable distribution 72 Standard exponential family 1, 35, 42, 43, 92, 166, 223 Statistical curvature 82, 86, 88 Steep family 70, 71, 75, 79, 145, 147, 149, 161, 169, 175, 180, 190, 208 Stein's unbiased estimate 90 Stratum 87, 88, 139-140, 172 Strongly reproductive

18

Student's t-test 218, 244 Sufficient statistic

13, 17, 27, 185

Support (of measure) 2, 191 Tight (sequence of measures) 49 Total positivity 53, 55 Truncated (loss function) 97-99