This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
( y'n(s - 0)), 0 < 0 b
(y'nr), 0=0 c
(y'n(r - 0)), 0 > 0
where =
1 - , and is the N(0, 1) distribution function. (b) Plot the risk function when b = c = 1, n = I and .
(t) r = -s = -1, (n) r = - s = -1. 2 For what values of 8 does the procedure with r = -s = -1 have smaller risk than the procedure with r = -�s = -1? 4. Stratified sampling. We want to estimate the mean J1. = E(X) of a population that has been divided (stratified) into s mutually exclusive parts (strata) (e.g., geographic loca
'
I
' i
.
i I
'
I
I
"
tions or age groups). Within the jth stratum we have a sample of i.i.d. random variables X1j, · · · , Xnii; j = l, . . . , s, and a stratum sample mean Xi ; j = l, . . . , s. We assume that the s samples from different strata are independent. Suppose that the jth stratum has 100pj% of the population and that the jth stratum population mean and variances are f-£J and o"J. Let N = E;= l nJ and consider the two estimators s
n3
1 /1 1 = N - L LxiJ, where we assume that PJ•
j=l i=l
1 < j < s, are known.
iiz
s
= L PJ XJ
j=l
(a) Compute the biases, variances. and MSEs of ji1 and Jiz. How should nj, 0 be chosen to make /11 unbiased?
I
{b) Neyman allocation. Assume that 0
< J. < s,
< aJ < oo, 1 < j < s, are known (estimates
will be used in a later chapter). Show that the strata sample sizes that minimize MSE(/i2) are given by
( 1 .7.3)
Hint:
77
Problems and Complements
Section 1.7
You may use a Lagrange multiplier.
(c) Show that
MSE(Ji!) with nk � Pk N minus MSE(Ji,) with nk given by ( 1 . 7.3) is
N- 1 2:;= 1 Pj (t7J - a-)2, where a- = 2:;=1 PP7r �
-
Xb and Xb denote the sample mean and the sample median of the sample X1 b . . , Xn - b. lf the parameters of interest are the population mean and median of Xi - b, respectively, show that MSE(X,) and MSE(X,) are the same for all values of b (the
5. Let '
.
�
MSEs of the sample mean and sample median are
6.
Suppose that
that
n is odd.
satisfying
X1 , . . . , Xn are i.i.d.
ac;
X
invariant with respect to shift).
F, that X is the median of the sample, and �
""
We want to estimate "the" median
P(X < v) ?: � ll!ld P(X > ) > �v
�
v ofF,
where
v is defined as a value
(a) Find the MSE of X when (i) F is discrete with
a < b < c. Hint:
�
P(X � c) � p, P(X
Use Problem 1.3.5. The answer is M
where (ii)
P(X � a)
k � .5 (n + ! ) and S � B(n,p).
� b)
�
1 -- 2p, 0 < p
<
1,
SE(X) � [(a-b)2 + (c- b)2]P(S ?: k)
F is uniform, U(O, 1). Hint:
See Problem B.2.9.
(iii) F is normal,
N(O, 1) , n � 1, 5, 25, 75.
Hint: See Problem B.2.13.
Use a numerical integration package.
� MSE(X)/M SE(X) in question (i) when b � 0, a � -,;, b � ,;, p � 20 40 and n � 1, 5, 15. (c) Same as (b) except when n � 15, plot RR for p � .1, .2, .3, 4 .45. (d) Find E[X - b[ for the situation in (i). Also find E[X - bl when n � 1, and 2 and � compare it to E[X - b[. (e) Compute the relative risks MSE(X)fMSE(X) in questions (ii) and (iii). 7. Let X1 , . . . , Xn be a sample from a population with values (b) Compute
the relative risk RR .
,
.
,
.
�
,
-
e - 2,; , e - ,; , e , e + ,; , e + 2,;; ,; > o. Each value has probability
�
-
.2. Let X and X denote the sample mean and median.
Suppose
that n is odd. �
(a) Find MSE(X)
andthe relative risk RR
(b) Evaluate RR when n � 1, 3, 5. Hint: By Problem 1.3.5, set () 0
distribution of
�
=
�
-
� MSE(X)fMSE(X).
without loss of generality. Next note that the
X involves Bernoulli and multinomial trials.
I
I 78
:!
j
'
8. Let X 1 ,
Chapter 1
, Xn be a sample from a population with variance a2, 0 < a2 < oo. (a) Show that s:? = (n - 1) - l E� 1 (Xi - X)2 is an unbiased estimator of u2. Hint: Write (X, - X)2 = ([X, - !'] - [X - l'lJ2, then expand (X, - X)2 keeping the
' ' '
'
Statistical Models, Goals, and Performance Criteria
.
•
.
square brackets intact.
(b) Suppose X,
i :
-
N(!',
(i) Show that MSE(s2 ) =
' '
I' :
(ii) Let &6 = c c=
L�
(n + 1)- 1
l
2(n - 1)-1
(xi - X)2•
Show that the value of c that minimizes
MsE(O'; ) is
Hintfor question (b): Recall (Theorem 8.3.3) that
' ' ' ' '
distribution. You may use the fact that E(Xi - J.L)4
'
I
.I
= 30'2•
9. Let () denote the proportion of people working in a company who have a certain charac teristic (e.g., being left-handed). It is known that in the state where the company is located, 10% have the characteristic. A person in charge of ordering equipment needs to estimate 8 and uses where jJ
=
e = (.2J(.1oJ + (.sw
X/n is the proportion with the characteristic in a sample of size n
company. Find
�
from the
MSE(e) and MSE(P). If the true e is eo, for what e0 is MSE(B)jMSE (P) < 1?
Give the answer for n
= 25 and n = 100. 10. In Problem 1.3.3(a) with b = c = 1 and n = 1. suppose 6 is discrete with frequency function 1r(O) = 1r ( -�) = 1r U) = '1· Compute the Bayes risk of dr,s when (a) r = -s -1 (b) r = -�s = -1.
•
=
Which one of the rules is the better one from the Bayes point of view?
11. A decision rule d is said to be unbiased if
E9 (Z(e,o( X) ) ) < E9(Z(e' , o(X))) "
'
II '
•
I
for all e, e'
E 8.
(a) Show that if e is real and l (e, a) definition of an unbiased estimate of 8.
= (e - a ) 2, then this definition coincides with the
(b) Show that if we use the 0 - 1 loss function in testing. then a test function is unbiased in this sense if, and only if, the powerfunction, defined by {J(e, o) = E9(o(X )), satisfies
for all
e' E 81•
, •
1
{J(e',o) > sup({J(e, o ) : e E 8o} , i
-
Section 1.7 Problems and Complements
79
12. In Problem 1.3.3, show that if c ::; b,
z
>
0, and
then Jr,.� is unbiased
13. A (behavioral) randomized test of a hypothesis H is defined as any statistic rp(X) such that 0 <
Suppose that U - U(O, 1) and is independent of X. Consider the following randomized test J: Observe U. If = u, use the test Ju. Show that J agrees with rp in the sense that,
U
Pe[o(X)
�
1]
�
1 - Pe[o(X)
� OJ = E9 (
(t)] - t. (b) Show directly that p minimizes E(IY - cl) as a function of c. 9. If Y and Z are any two random variables, exhibit a best predictor of Y given Z for mean absolute prediction error. Suppose that Z has a density p, which is symmetric about all z. Show that c is a median of Z. 10.
c,
p( c + z) = p(c - z) for
11. Show that if ( Z, Y) has a bivariate nonnal distribution the best predictor of Y given z in the sense of MSPE coincides with the best predictor for mean absolute error.
Many observed biological variables such as height and weight can be thought of as the sum of unobservable genetic and environmental variables. Suppose that Z, Y are measure ments on such a variable for a randomly selected father and son. Let Z', Z11, Y', Y" be the corresponding genetic and environmental components Z = Z' + Z", Y = Y' + Y", where (Z', Y') have a N(p,, J.L, a2, a2 , p) distribution and Z", Y" are N(v, T2) variables independent of each other and of ( Z', Y'). 12.
(a) Show that the relation between Z and Y is weaker than that between Z' and Y'; that is, [Cor(Z, Y )l < I PI · (b) Show that the error of prediction (for the best predictor) incurred in using Z to predict Y is greater than that incurred in using Z' to predict Y'. 13. Suppose that Z has a density p, which is symmetric about c and which is unimodal; that is, p( z) is nonincreasing for z > c. (a) Show that P[I Z - tl :; s] is maximized as a function of t for each s > 0 by t = c. (b) Suppose (Z, Y) has a bivariate normal distribution. Suppose that if we observe Z = z and predict p ( z ) for Y our loss is 1 unit if II"( z) - Yl > s, and 0 otherwise. Show that the predictor that minimizes our expected loss is again the best MSPE predictor.
14. Let zl and Zz be independent and have exponential distributions with density >..e -Az , z > 0. Define Z = Z2 and Y = Z1 + Z1Z2 • Find
(a) The best MSPE predictor E(Y (b) E(E(Y
I Z)) (c) Var(E(Y I Z)) (d) Var(Y I Z = z) (e) E(Var(Y I Z))
I Z = z) ofY given Z = z
'
'
I
'
'
'
82
Statisf1cal Models, Goals, and Performance Criteria
Chapter 1
(0 The best linear MSPE predictor of Y based on
'
Z = z. Hint: Recall that E( Zt) = E( Z2) = 1/A and Var ( Zt) = Var ( Z2) = 1 /A 2 15. Let J-L(z)
=
E(Y I Z = z). Show that
Var(J-L(Z))/Var( Y)
=
Corr2 (Y, J-L(Z))
= max g
Corr' ( Y, g(Z))
where g(Z) stands for any predictor.
' 1,
'
16. Show that p�y linear predictors.
=
Corr' (Y, J-L L(Z)) =
maxgEC
Corr2 (Y, g(Z)) where £ is the set of
17. One minus the ratio of the smallest possible MSPE to the MSPE of the constant pre dictor is called Pearson's correlation ratio 1J� y; that is,
�iY
=
1 - E [Y - J-L(Z)] 2 /Var(Y)
=
Var(J-L(Z) )/Var(Y) .
(See Pearson, 1905, and Doksurn and Samarov, 1995, on estimation of 1]�y .) (a) Show that 1]�y > PiY • where PiY is the population multiple correlation coefficient of Remark 1.4.3. Hint: See Problem 1.4.15. (b) Show that if Z is one-dimensional and h is a 1-1 increasing transformation of 2 ts mvanant un er suc . 2 2 y then ryh(Z)Y atiS, Th d hh = 1Jz . 1J "
Z,
.
.
(c) Let 'L = Y - J-LL(Z) be the linear prediction error. Show that, in the linear model of Remark 1.4.4, €£ is uncorrelated with PL(Z) and 1J� y = p�y· 18. Predicting the past from the present. Consider a subject who walks into a clinic today, at time t, and is diagnosed with a certain disease. At the same time t a diagnostic indicator Zo of the severity of the disease (e.g., a blood cell or viral load measurement) is obtained. Let S be the unknown date in the past when the subject was infected. We are interested in the time Yo = t - S from infection until detection. Assume that the conditional density of Zo (the present) given Y0 = Yo (the past) is
where !L and a2 are the mean and variance of the severity indicator Z0 in the population of people without the disease. Here j3y0 gives the mean increase of Z0 for infected subjects over the time period y0; j3 > 0, y0 > 0. It will be convenient to rescale the problem by introducing Z = (Z0 - J-L)/u and Y = /3Yo/u. (a) Show that the conditional density f(z I y) of Z given Y = y is N(y, !). (b) Suppose that Y has the exponential density
1r(y) = A exp{-.\y}, A > 0, y > 0.
Section 1.7
Problems and Complements
83
Show that the conditional distribution of Y (the past) given Z =
7r(y I z ) � (27r) - l c-1 exp where c
=
{- � [y - (
z
z
}
(the present) has density
- >.) ]2 , y > 0
4>(z - >..) . This density is called the truncated (at zero) normal, N(z - >.. , 1 ) ,
density. Hint: Use Bayes rule.
(c) Find the conditional density rro (Yo I zo) of Yo given Zo (d) Find the best predictor of Yo given Zo E IYo - g( Zo)l. Hint: See Problems 1.4. 7 and 1.4.9.
=
= z0.
zo using mean absolute prediction error
(e) Show that the best MSPE predictor ofY given Z = z is E(Y I Z � z)
�
c-1
. - z )
-
(>. - z) .
(In practice, all the unknowns, including the "prior" rr, need to be estimated from cohort studies; see Berman, 1990, and Normand and Doksum, 2000) . 19. Establish 1.4.14 by setting the derivatives of R(a, b) equal to zero, solving for (a, b), and checking convexity.
20. Let Y be the number of heads showing when X fair coins are tossed, where X is the number of spots showing when a fair die is rolled. Find (a) The mean and variance of Y.
(b) The MSPE of the optimal predictor of Y based on X. (c) The optimal predictor of Y given X
= x, x =
1, . . . , 6.
21. Let Y be a vector and let r(Y) and s(Y) be real valued. Write Cov[r(Y), s(Y) I zl for tbe covariance between r(Y) and s(Y) in the conditional distribution of (r(Y), s(Y)) given Z = z. (a) Show that ifCov[r(Y), s(Y)] < oo, then
Cov[r(Y), s(Y)]
�
E{Cov[r(Y), s(Y) I Z]} + Cov{ E[r(Y) I Z], E[s(Y) I Z]}.
(b) Show tbat (a) is equivalent to (1.4.6) when r (c) Show that if Z is real, Cov[r(Y), Z]
�
�
s.
Cov{E[r(Y) I Z], Z}.
(d) Suppose Y1 � a1 + b1Z1 + W and Yz � a2 + b,Z2 + W, where Y and Y2 are 1 responses of subjects 1 and 2 with common influence W and separate influences Zt and Zz, where Z1. Z2 and W are independent with finite variances. Find Corr(Y1 , Yz) using (a).
84
Statistical Models, Goals, and Performance Criteria
Chapter 1
(e) In the preceding model (d), if b1 = b2 and Z1 , Z2 and �V have the same variance cr2• we say that there is a 50% oVerlap between Y1 and Yz. In this case what is Corr(Y1 , Y )? 2 (f) In model (d), suppose that Z1 and z, optimal predictor of Yz given (Yi, Z1, Zz).
are
N(p, u2 ) and
W � N(p0 , u5).
Find the
22. In Example 1.4.3, show that the MSPE of the optimal predictor is u�(l - p�y ). 23. Verify that solving (1.4.15) yields (1.4.14). 24. (a) Let w(y,z) be a positive real-valued function. Then [y - g(z)]2/w(y,z) = 6w(y,g(z)) is called weighted squared prediction error. Show that the mean weighted squared prediction error is minimized by po ( Z ) = EO(Y I Z), where Po (y , z) =
cp(y, z)/w(y, z)
and c is the constant that makes p0 a density. Assume that
Eow (Y, g( Z)) < co for some g and that Po is a density. B(n,z), n > 2 , and suppose that Z has the beta, (b) Suppose that given Z = z, Y ,B(r, s ) density. Find p0(Z) when (i) w(y, z) = 1, and (ii) w(y, z) = z(l - z), 0 < z < 1. �
,
25. Show that EY2 < co if and only if E(Y - c)2 < co for all c.
Hint: Whatever be Y and c, 1
2 - c2 < (Y - c)2 Y 2
=
Y2 - 2cY + c2 < 2 (Y2 + c2). '
i
Problems for Section 1.5 1. Let XI, . . . ' Xn be a sample from a Poisson, P(e), population where e > 0. (a) Show directly that z=:=t Xi is sufficient for 8.
(b) Establish the same result using the factorization theorem. 2. Let n items be drawn in order without replacement from a shipment of N items of which N8 are bad. Let Xi = 1 if the ith item drawn is bad, and = 0 otherwise. Show that L� 1 xi is sufficient for 8 directly and by the factorization theorem. 3. Suppose X1 ,
•
.
•
,
I
'
Xn is a sample from a population with one of the following densities.
8) = 8x9- 1, 0 < X < 1, 8 > 0. This is the beta, ,B(O , !), density. (b) p(x, 8) = 8ax•-1 exp( -8x ) x > 0, 8 > 0, a > 0. (a) p(x
,
"
,
This is known as the Weibull density. (c) p(x, 8)
=
0a9fx(9+1 ), X > a, 8 > 0, a > 0.
1
'
85
Section 1. 7 Problems and Complements
This is known as the Pareto density. In each case, find a real-valued sufficient statistic for (),
a fixed.
4. (a) Show that T1 and T2 are equivalent statistics if, and only if, we can write T2 = H (T1) for some 1-1 transformation H of the range of T1 into the range of T2. Which of the following statistics are equivalent? (Prove or disprove. )
n� 1 Xt and I:� 1 log Xi , Xi > 0 (c) I:� 1 xi and I:� 1 log Xi, Xi > 0 (d) (l:� 1 xi , l:� 1 x;) and (l:� 1 xi, I:� 1 (xi - x?) (e) (I:� 1 Xi, 2:� 1 xl) and (2:� 1 Xi , 2:� 1 (xi X ) 3). 5. Let B = (B" 82) be a bivariate parameter. Suppose that T1 (X) is sufficient for 81 whenever 82 is fixed and known, whereas T2(X) is sufficient for fh whenever 81 is fixed (b)
-
and known. Assume that (}h B2 vary independently, lh E 81, 82 E 82 and that the set S {x : p(x, B) > 0} does not depend on B. =
(a) Show that if T1 and T2 do not depend one, and 81 respectively, then (T1 (X), T2(X))
is sufficient for e.
(T1 (X) , T,(X)) is sufficient for B, T1 (X) is sufficient for 81 whenever 8z is fixed and known, but Tz(X) is not sufficient for 8z, when (}1 is fixed (b) Exhibit an example in which
and known. 6. Let X take on the specified values v1 , vk with probabilities 81, . . . , 8k, respectively. Suppose that X1 , . . . , Xn are independently and identically distributed as X. Suppose that IJ = (81, , Bk) is unknown and may range over the set 8 = { (B, . . . , Bk) : e, > 0, 1 < i < k, L� 1 8i = 1 }. Let Nj be the number of xi which equal Vj· .
.
.
.
.
1
.
(a) What is the distribution of (Nt, . . . , Nk)? (b) Show that N = (N1, . . . , Nk-t) is sufficient for B.
7. Let X1,
. • . 1
Xn be a sample from a population with density p( x , 8) given by -
p(x, B)
0 otherwise. Here e
=
(!",a) with -oo <
11-
< oo, a > 0.
(a) Show that min (X1 , , Xn) is sufficient for fl when a is fixed. (b) Find a one-dimensional sufficient statistic for a when Jl. is fixed. . • .
(c) Exhibit a two-dimensional sufficient statistic for 8.
8. Let X1, . . , Xn be a sample from some continuous distribution Fwith density J, which is unknown. Treating f as a parameter, show that the order statistics X(t ) • . . . , X(n) (cf. Problem B.2.8) are sufficient for f. .
'
''
I !
86
9.
Statistical Models, Goals, and Performance Criteria Let
X1 ,
.
.
•
,
Chapter 1
Xn be a sample from a population with density
fo (x)
a(O) h (x)
if O I
< x < o,
0 othetwise where h(x)
>
0, 0
= (OI , O, ) with -oo
<
OI
< 02 <
oo, and a(O) =
I [J:,' h(x)dx] �
is assumed to exist. Find a two-dimensional sufficient statistic for this problem and apply
02] family of distribution s. 10. Suppose X1, , . . , Xn are U.d. with density f(x, 8) = �e-l.x-S]_ Show that (X{l)• . . . , X(n)). the order statistics, are minimal sufficient. Hint: t, Lx (O) = - L� I sgn(X, - 0), 0 rt {XI, . . . , Xn }, which determines X( I ) , · · · , X(n)• 11. Let X1 , X2 , . . . , Xn be a sample from the unifonn, U(O,B), distribution. Show that X(n) = max{ Xii 1 < i < n} is minimal sufficient for 0. 12. Dynkin, Lehmann, Schejfi 's Theorem. Let P = {Po : () E 8} where Po is discrete concentrated on X = {xi, x,, . . . }. Let p(x, 0) = Pe [X = x] = Lx(O) > 0 on X. Show your result to the U[81,
that
fxx(��) is minimial sufficient.
Hint: 13.
Apply the factorization theorem.
Suppose that X =
bution function
(X1,
• . .
F(x). If F(x)
, Xn) is a sample from a population with continuous distri is N(p, u2 ) , T(X) = (X, u2 ), where u2 = n� I l.:(X,
X)2, is sufficient, and S(X) = (X(1 1 , . . . , X(n1 ), where X(, 1 = (X(i) - X)ju, is "irrel evant" (ancillary) for (J.L, a2 ). However, S(X) is exactly what is needed to estimate the
.
!
'
"shape" of
'
class F = iff
F(x) when F(x) is unknown . The shape ofF is represented by the equivalence {F((· - a)fb) b > 0, a E R). Thus a distribution G has the same shape as F
G E F.
function
:
For instance, one "estimator" of this shape is the scaled empirical distribution
�
F,(x)
jfn,
x(n :S x < x(i+l)• j
=
1, .
..,n-1
x < x(1 ) 1, X > x(n)· � Show that for fixed x, F,((x - x )fu) converges in probability to F(x). Here we are using F to represent F because every member ofF can be obtained from F. 0,
:I
'I
14. Kolmogorov's Theorem.
"
'. •
9,
We are given a regular model with 9 finite.
(a) Suppose that a statistic T(X) has the property that for any prior distribution on the posterior distribution of
9 depends on x only through T(x).
Show that T(X) is
sufficient. (b) Conversely show that if T(X) is sufficient, then, for any prior distribution, the
posterior distribution depends on x only through T
(x) .
i
j l
'
Section 1.7
Problems
87
and Complements
Hint: Apply the factorization theorem. 15. Let Xh . , Xn be a sample from f(x 0), () E R. Show that the order statistics arc minimal sufficient when f is the density Cauchy f(t) � 1 /Jr( l + t2). 16. Let X Xrn; Y1 , . . , Y,l be independently distributed according to N(p,, 0"2) and . .
1,
.
· -
.
•
,
.
N(TJ, r2), respectively. Find minimal sufficient statistics for the following three cases: (i) p,, TJ, (ii)
p,,
17 < oo,
0 < a, T.
a = T and p,, 17, O" arc arbitrary.
(iii) p, =
17.
a, T are arbitrary: -oo < 1] and p,, a, T are arbitrary.
In Example 1.5.4, express t1 as a function of Lx(O,
I)
and
Lx(l, 1).
Problems to Section 1.6 1. Prove the assertions of Table 1 . 6. 1. 2.
Suppose
X1 ,
• • •
, Xn is as in Problem 1.5.3. In each of the cases (a), (b) and (c), show
that the distribution of X forms a one-parameter exponential family. Identify ,.,, B, T, and
h.
3.
Let X be the number of failures before the first success in a sequence of Bernoulli trials
with probability of success 9. Then Pe [X � k] �
the
geometric distribution (9 (B)).
(I - 9)'9, k � 0, I, 2, .
.
. This is called
(a) Show that the family of geometric distributions is a one-parameter exponential fam
ily with
(b)
T(x)
�
x.
Deduce from Theorem 1.6.1 that if X�o
. . . , Xn is a sample from 9(9), then the
2.:� 1 Xi form a one-parameter exponential family. (c) Show that E� xi in part (b) has a negative binomial distribution with parameters
distributions of
l
(n,9) defined by Pe iL:7 1 X, = k] �
( � ) n
+
- l
( 1 - 9)'9", k = 0 , 1 , 2 , . . . (The
negative binomial distribution is that of the number of failures before the nth success in a sequence of Bernoulli trials with probability of success
0.) Hint: By Theorem 1.6.1, Pe [L� 1 X, = k] = c,(l - 9)'0", 0 < 9 < I. If =
"\" � c,w• =
k=O
I
(1
-w
)n
, 0 < w < I,
then
Ck �
I d' n w ) (I k! fiw k w=O
Which of the following families of distributions are exponential families? (Prove or disprove.)
4.
(a) The U(O, 9) fumily
88
Statistical Models, Goals, and Performance Criteria
Chapter 1
(b)p(." , O) = {exp[- 2 log 0 + log(2x)]}l[x E (0,0)] (c) p(x, O) = �. x E { 0. 1 +0, . . . , 0.9 + 0) (d) The N(O, 02) family, 0 > 0 (e) p( x, O ) = 2(x + 0)/(1 + 20), 0 < x < I, 0 > 0
(f) p(x, 9) is the conditional frequency function of a binomial, B(n, 0), variable X,
given that X > 0.
5. Show that the following families of distributions are two-parameter exponential families and identify the functions 1], B, T, and h.
(a) The beta family.
(b) The gamma family.
6. Let X have the Dirichlet distribution, D( a) , of Problem 1.2.15. Show the distribution of X form an r-parameter exponential family and identify fJ, B, T, and h. 7. Let X = ( (X 1 , Y1 ), . . . , (X., Yn)) be a sample from a bivariate normal population. Show that the distributions of X form a five-parameter exponential family and identify 'TJ, B, T, and h. 8. Show that the family of distributions of Example 1.5.3 is not a one parameter CX(Xlnential family. Hint: If it were, there would be a set A such that p(x, 0) > 0 on A for all 0. 9. Prove the analogue of Theorem 1.6.1 for discrete k-parameter exponential families. 10. Suppose that f(x, B) is a positive density on the real line, which is continuous in x for each 0 and such that if (X1, X2 ) is a sample of size 2 from f (·, 0), then X1 + X2 is sufficient for B. Show that /(·, B) corresponds to a one-arameter exponential family of distributions with T(x) = x. Hint: There exist functions g(t, 0), h(x�, x2) such that log f(x�, 0) + log j(x2, 0) = g(x1 + x2, 0) + h(x1, x2). Fix Oo and let r(x, 0) = log f(x, 0) - log f(x, Oo), q(x, 0) = g(x,O) - g(x,Oo). Then, q(x 1 + x2,0) = r(x, , O) + r(x,, O), and hence, [r(x� , O) r(O, 0)] + [r(x2, 0) - r(O, 0)] = r(x1 + x2, 0) - r(O, 0).
' '
I
11. Use Theorems 1.6.2 and 1.6.3 to obtain moment-generating functions for the sufficient statistics when sampling from the following distributions.
(a) normal, () = (p, a2) (b) gamma, r(p, e = >., p fixed
>.),
(c) binomial (d) Poisson (e) negative binomial (see Problem 1.6.3) () = (p, (0 gamma, r(p,
>.),
.
-
>.) .
�---
------
II
'
Section 1. 7
89
Problems and Complements
12. Show directly using the definition of the rank of an ex}X)nential family that the multi nomial distribution, M(n; B,, . . . , B.) , O < B; < 1, 1 < j < k. L:�" 1 B; = 1, is of rank
k - 1.
13. Show that in Theorem 1.6.3, the condition that E has nonempty interior is equivalent to the condition that £ is not contained in any ( k I)-dimensional hyperplane . �
•
14. Construct an exponential family of rank k for which £ is not open and A is not defined on all of t:. Show that if k = 1 and &0 # 0 and A, A are defined on all of t:, then Theorem 1.6.3 continues to hold. 15. Let P = {Po : B E 8} where Po is discrete and concentrated on X = {x 1 , x2, . . . ) , and let p(x, B) = Po [X = x]. Show that if P is a (discrete) canonical exponential family generated b(, (T, h) and &0 # 0, then T is minimal sufficient. Hint: �;,• Lx(TJ) = T; (X) - E'IT;(X). Use Problem 1.5.12. 16. Life testing. Let X1, . . . , Xn be independently distributed with exponential density < < < 2 1 the > ordered X's be denoted by Y1 Y 2 0 for 0, and let · · · e-xl (28)x Y It is assumed that Y1 becomes available first, then Yz, and so on, and that observation is continued until Yr has been observed. This might arise, for example, in life testing where each X measures the length of life of, say, an electron tube, and n tubes are being tested simultaneously. Another application is to the disintegration of radioactive material, where n is the number of atoms, and observation is continued until r a-particles have been emitted. Show that ..
(i) The joint distribution of Y1 ,
1
•
[ p
n!
(28)' (n - r)! ex (ii) The distribution of II: :
l
.
.
, Yr is an exponential family with density
L::� r Yi + (n - r)yr 28
-
]
< < < , 0Yl - . . . - Yr ·
2 + (n r)Yrl/B is x with 2r degrees of freedom. Y;
(iii) Let Yi , Y2 , denote the time required until the first, second, . . . event occurs in a Poisson process with parameter 1/28' (see A.16). Then Zr = Y1/8', Z2 = (Y2 Y1 )/8', z, = (Y3 - Y2)/8', . . . are independently distributed as x2 with 2 degrees , Yr is an exponential family with density of freedom, and the joint density of Y1 , •
.
.
•
1
(28')
r) ( Y ' exp 28'
•
.
'
0<
Yl
< ... <
Yr·
The distribution of Yr/B' is again x2 with 2r degrees of freedom. (iv) The same model arises in the application to life testing if the number n of tubes is held constant by replacing each burned-out tube with a new one, and if Y1 denotes the time at which the first tube bums out, Y2 the time at which the second tube burns out, and so on, measured from some fixed time.
'
. 90
Statistical Models, Goals, and Performance Criteria
Chapter 1
[(ii): The random variables z, � (n - i + l)(Y; - l� � t )/9 (i � 1, . . . . 1· ) are inde pendently distributed as x2 with 2 degrees of freedom, and [L� 1 Yi + (n - 1')Yr]/B =
I: :� t Z,.]
17. Suppose that (Tk X l , h) generate a canonical exponential family P with parameter 1J kxl and E = Rk. Let (a) Show that Q is the exponential family generated by IIL T and h exp{ cTT}, where lh is the projection matrix of T onto L � { '1 : '1 � BIJ + c } .
(b) Show that ifP has full rank k and B is of rank l, then Hint: If B is of rank l, you may assume
Q has full rank I.
18. Suppose Yt . . . , Yn are independent with Yi "' N(/31 + /32 Zi, a 2 ), where z1 , . . . , Zn are covariate values not all equal. (See Example 1.6.6.) Show that the family has rank 3. Give the mean vector and the variance matrix of T. ,
19. Logistic Regression. We observe
(z1, Y1 ) , . . . , (zn, Yn) where the Y1, , Yn are inde pendent, Yi "-' B(�, .Xi). The success probability .Xi depends on the characteristics zi of .
.
•
the ith subject, for example, on the covariate vector zi = (age, height, blood pressure)T. The function l(u) � log[u/(1 - u)[ is called the logit function. In the logistic linear re gression model it is assumed that l ( .\,) � zf(3 where (3 ((31 , . . . , f3d )T and z, is d x 1. Show that Y (Y1, Yn) T follow an exponential model with rank d iff z1, . . . , zd are not collinear (linearly independent) (cf. Examples 1.1.4, 1.6.8 and Problem 1.1.9). =
• . •
=
,
20. (a) In part II of the proof of Theorem 1.6.4, fill in the details of the arguments that Q is generated by (171 - 'lo)TT and that �(ii) =�(i). (b) Fill in the details of part Ill of the proof of Theorem 1.6.4.
21. Find JJ.('I)
I
EryT(X) for the gamma, r(a, .\), distribution, where 9 � (a, ,\) .
�
Xn be a sample from the k-parameter exponential family distribution 22. Let X 1 , (1.6.10). Let T � (I;� 1 T1(X,), . . . , I;� 1 Tk(Xi)) and let •
I
•
•
,
s
�
{(ry1(1J), . . . , ryk(IJ)) , e E 8).
Show that if S contains a subset of k + 1 vectors v0, . . . , Vk+l so that vi - v0, 1 < i < k, are not collinear (linearly independent), then T is minimally sufficient for 8.
I
. i I, '
,
:
23. Using (1.6.20), find a conjugate family of distributions for the gamma and beta fami lies. (a) With one parameter fixed.
(b) With both parameters free.
Section 1.7
91
Problems and Complements
24. Using (1 .6.20), find a conjugate family of distributions for the normal family using as parameter 0 (01 , 02) where 01 = Eo (X), Oz = 1/(VaroX) (cf. Problem !.2.12). =
25. Consider the linear Gaussian regression model of Examples 1.5.5 and 1.6.6 except with cr z known. Find a conjugate family of prior distributions for (/31 , /32) T.
26. Using (1 .6.20), find a conjugate family of distributions for the multinomial distribution. See Problem !.2.15.
27. Let 'P denote the canonical exponential family genrated by T and h. For any TJo E £, set h0(x) = q(x, 'lo ) where q is given by (!.6.9). Show that P is also the canonical exponential family generated by T and h0.
28. Exponentialfamilies are maximum entropy distributions. The entropy h(f) of a random variable X with density f is defined by
h(f) = E(- log f(X))
=
- J: [logf(x)Jf(x)dx.
This quantity arises naturally in information in theory; see Section 2.2.2 and Cover and Thomas (1991). Let S = {x : f(x) > 0}. {a) Show that the canonical k-parameter exponential family density
f(x, 'I) = exp
•
1/0 + L 'l;r; (x) - A( 'I) j:=l
, xES
maximizes h(f) subject to the constraints
f(x)
> 0,
Is f(x)dx = 1, Is f(x)r;(x) = a; , 1 < j < k,
where r-,o, . . , TJk are chosen so that f satisfies the constraints. Hint: You may usc Lagrange multipliers. Maximize the integrand. .
(b) Find the maximum entropy densities when r;(x) = xi and (i) S = (0, oo), k I, <>t > 0; (ii) S R, k = 2 , at E R, az > 0; (iii) S R, k 3, a1 E R, o:z > 0, a, E R. =
=
=
=
29. As in Example 1.6.11, suppose that Y 1 , . , Y are i.i.d. Nv(f.L, E) where f.L varies freely in RP and E ranges freely over the class of all p x p symmetric positive definite matrices. Show that the distribution of Y (Yt, . . . Yn ) is the p(p + 3) /2 canonical exponential family generated by h = l and the p(p + 3)/2 statistics .
n
.
=
,
n
n
i=l
i=I
. . . , Y;p). Show that [ is open and that this family is of rank p(p + 3) /2. Hint: Without loss of generality, take n = 1. We want to show that h = 1 and the m = p(p + 3) /2 statistics T; (Y) = Yj, 1 < j < p, and T;1(Y) Y;Yi, 1 < j < l < p,
where Y;
=
(Yil
,
=
92
Statistical Models, Goals, and Performance Criteria
Chapter 1
generate Nv(J.l, E). As E ranges over all p x p symmetric positive definite matrices. so does E-1• Next establish that for symmetric matrices M,
J exp{-uTMu}du < oo iff M is positive definite
by using the spectral decomposition (see B.10. 1 .2)
p
M = L AjejeJ for e 1 , . . . , ep orthogonal, )..J E R.
j=l
To show that the family has full rank m, use induction on p to show that if Z1, . . . , Zv are i.i.d. N(O, and if Bpxp = (b; is symmetric, then
l)
1)
p P I: a;Z; + L b;1 Z;Z1 =
j,l
j., l
c
= P(aTZ + ZTBz = c) = o
a = 0, B = 0, = 0. Next recall (Appendix B.6) that since Y � Np(/', E), then Y = SZ for some nonsingular p x p matrix S. 30. Show that if X 1 , . ,Xn are i.i.d. Nv(8,E0) given 6 where �0 is known, then the Np(A, f) family is conjugate to N (B Eo), where A varies freely in RP and r ranges over c
unless
.
.
.
,
all p x p symmetric positive definite matrices.
31. Conjugate Normal Mixture Distributions. A Hierarchical Bayesian Normal Model. Let {(I';, r; ) : j k} be a given collection of pairs with l'i E R, > 0. Let (J.I., be a random pair with >.; = P((!', u = (!';, r; ) , 0 2::��1 >.; = Let 8 be a random variable whose conditional distribution given (JL, u = (P,J, Tj) is normal, N(p,j, rJ). Consider the model X = 8 + t:, where 8 and e are independent and € N(O, a3) , a� known. Note that 8 has the prior density
1< <
.
•
'
•
' •
I
)
)
< >.; < 1,)
Tj
u)
1.
"'
•
k
1r(B) = L A;'Pr, (B - !'; ) j=l where 'l'r denotes the N(O, uon. •
i� ''' ' ' ': ' •
(1 .7.4)
T2) density. Also note that (X I B) has the N(B, u5) distribu-
(a) Find the posterior k
"
I
•
1r(B I x) = LP((tt,u) (!'j, Tj ) I x)1r(B I (!';,r;),x) j=l =
and write it in the fonn k
L >.;(x)'Pr,(x)(B - !'j(x)) j=l
Section
1.7
93
Problems ;3nd Complements
for appropriate .\J (x ) , TJ (x) and fLJ (x ). This shows that ( 1. 7 .4) defines a conjugate prior for the N(B, 176), distribution.
(b) Let Xi = 8 + Ei, l < i <
where 8 is as previously and E t , . . . , En are i.i.d. N(O, 17�). Find the posterior 7r (B I x , , . . . , Xn ) , and show that it belongs to class (1.7 .4). Hint: Consider the sufficient statistic for p(x I B). n,
32. A Hierarchical Binomial-Beta Model. Let { (r1 , sJ) : 1 < j < k} be a given collection of pairs with r; > 0, s1 > 0, let (R, S) be a random pair with P(R = r1 , S = s;) = >.;,
0 < .\1 < 1, E7= l .\1 = 1, and let 8 be a random variable whose conditional density given R = r, S = s is beta, (J(r, s ) . Consider the model in which (X the binomial, B(n1 e), distribution. Note that e has the prior density
1r(B, r, s)
1r(B) =
I B) has
k
L -\;1r(B, r1 , s1 ) .
( I .7.5)
j=l
Find the posterior
k
1r(B I x) = L P(R = r; , S = s; I x)1r(B I (r; , s;),x) j=l
and show that it can be written in the form L: >-1 (x)1r(B,r;(x),s;(x)) for appropriate >-;(x), r;(x) and s;(x). This shows that (1.7.5) defines a class of conjugate priors for the B( n, B) distribution.
33. Let p(x, ry) be a one parameter canonical exponential family generated by T(x) = x and h(x), x E X C R, and let ,P(x) be a nonconstant, nondecreasing function. Show that E,,P(X) is strictly increasing in
Hint:
ry.
Cov,(,P(X), X)
� E{(X - X')i,P(X) - ,P(X')]) where X and X' are independent identically distributed as 34. Let (Xt 1
•
•
•
, Xn) be a stationary Markov chain with two states 0 and
P[Xi = Ei 1 x1 = c1, . . . , Xi - 1 where (i) (ii)
( P10 Poo
Pol Pn
)
=
1.
That is,
Ei-d = P[Xi = ci 1 xi-1 = Ei-d = p.,._l.,.
is the matrix of transition probabilities. Suppose further that ·
Poo = PII = p, so that, Pw = Pol
P[X1
X (see A.l l.12).
= OJ = P(X1
=
1] = �·
= 1 - p.
94
Statistical
Models,
Goals, and
Performance C riteria
Chapter 1
(a) Show that if 0 < p < 1 is unknown this is a full rank, one-parameter exponential family with T = N00 + N11 where N1j the number of transitions from i to j. For example, 01011 has No1 = 2, N11 = 1, Noo = 0, N10 = I . (b) Show that E(T) = (n - 1)p (by the method of indicators or otherwise). 35. A Conjugate Priorfor the Two-Sample Problem. Suppose that X1 , . . . , Xn and Y1 , . . . , Yn are independent N'(fl1, a2) and N(�J 2 , a2) samples, respectively. Consider the prior 7r for which for some r > 0, k > 0, ro-2 has a x� distribution and given a2, p,1 and fL2 are independent with N(�1 , a2 Ikt) and N(6, a2 I k,) distributions, respectively, where �j E R, kj > 0, j = 1, 2. Show that 1r is a conjugate prior. 36. The inverse Gaussian density, IG(J..t, A), is j(X,Jl., .\)
=
1 [.\I27T] i2x-3i 2 exp{ -.\(x - J1.)2/2J1.2x}, x > 0,
J1.
> 0, .\ > 0.
Show that this is an exponential family generated by T(X) = - ; (X, x- 1 )T and h(x) = (27r) -lf2x- 3/2 (b) Show that the canonical parameters TJt , ry are given by ry1 = fL-2 A, 1J2 = .\, and 2 that A( q1 , 172 ) = (� log(q2 ) + JihijZ) , £ = [O, oo) x (O,oo). (c) Fwd the moment-generating function of T and show that E(X) = Jl., Var(X) Jl.-3 .\, E(x-') = Jl._, + .\_ ,, Var(x-') = (.\Jl.)-' + 2.\ -2. (d) Suppose J1. = Jl.o is known. Show that the gamma family, f(u,,B), is a conjugate pnor. (e) Suppose that .\ = .\0 is known. Show that the conjugate prior formula (1.6.20) produces a function that is not integrable with respect to fL. That is, n defined in (1.6.19) is empty. (f) Suppose that J1. and .\ are both unknown. Show that (1.6.20) produces a function that is not integrable; that is, f! defined in (1.6.19) is empty. 37. Let X1, . . . , Xn be i.i.d. as X � Np(O, �0) where �0 is known. Show that the conjugate prior generated by (1.6.20) is the Afv (q0 , 761) family, where 7Jo varies freely in RP, T6 > 0 and I is the p x p identity matrix. 38, Let X; = (Z;, Y;)T be i.i.d. as X = (Z, Y)T, 1 < i < n, where X has the density of Example 1.6.3. Write the density of X1 , . , Xn as a canonical exponential family and identify T, h, A, and £. Find the expected value and variance of the sufficient statistic. 39. Suppose that Yt , . . . , Yn are independent, Yi N(fLi, a2 ), n > 4. (a) Write the distribution of Y1 , , Yn in canonical exponential family form. Identify T, h, 1), A, and E. (b) Next suppose that fLi depends on the value Zi of some covariate and consider the submodel defined by the map 11 : (11 1 , 112 , II3)T � (p7, a2JT where 11 is determined by /Li = cxp{Bt + Bzzi}, Zt < z2 < · · · < Zn; a 2 = 83 (a)
-
•
'
'P'· ',,' '
.. . ' ..
.
'
H '
......,
, ,, ,.
' ' ,
"
I I' I'
.
.
:
.
•
,I '' '. •
i
I'
··---
------
'
: : •
Section 1.8
95
Notes
where 91 E R, 02 E R, 03 > 0. This model is sometimes used when Jli is restricted to be positive. Show that p(y, 0) as given by ( is a curved exponential family model with l 3.
1. 6.12)
=
40. Suppose Y1 , . . . , Y;.l are independent exponentially, E' ( ,\i), distributed survival times, n > 3. (a) Write the distribution of Y1 , . . . , Yn in canonical exponential family form. Identify T, h, '1. A, and E.
(b) Recall that J..li = E(Yi) = Ai 1 . Suppose /Ji depends on the value Zi of a covariate. Because fti > 0, J.Li is sometimes modeled as
J.Li = cxp{Or + fhzi}, i =
1, . . .
, n
where not all the z's are equal. Show that p(y, fi) as given by nential family model with l =
2.
1.8
(1.6.12) is a curved expo
NOTES
Note for Section 1.1
(1)all dominated For the measure theoreticaHy minded we can assume more generally that the Po are p(x, 9) denotes df;11 , the Radon Nikodym by a a finite measure and derivative.
ft
that
Notes for Section 1,3
� (I) More natural in the sense of measuring the Euclidean distance�between the estimate (} and the "truth" 0. Squared error gives much more weight to those (} that are far away from (} than those close to (}. (2) We define the lower boundary of a convex set simply to be the set of all boundary points r such that the set lies completely on or above any tangent to the set at r. Note for Section 1.4 (I) Source: Hodges, Jr., J. L., D. Kretch, and R. S. Crutchfield. Statlab: An Empirical Introduction to Statistics. New York: McGraw-Hill, 1975. Notes for Section 1.6 (I) Exponential families arose much earlier in the work of Boltzmann in statistical mechan ics as laws for the distribution of the states of systems of particles-see Feynman ( 1 963), for instance. The connection is through the concept of entropy, which also plays a key role in information theory-see Cover and Thomas ( 1991 ).
(2) The restriction that's x E Rq and that these families be discrete or continuous is artifi cial. In general if J.L is a a finite measure on the sample space X, p(x, (}) as given by
(1.6.1)
'
'
96
Statistical Models, Goals, and Performance Criteria
can be taken to be the density of
X with respect to J-L-see Lehmann (1997), for instance.
Chapter 1
This permits consideration of data such as images, positions, and spheres (e.g., the Earth), and so on.
Note for Section 1.7 ( I ) uT Mu > 0 for all p x
1.9
1 vectors u # 0.
REFERENCES
BERGER, J. 0., Statistical Decision Theory and Bayesian Analysis New York:
Springer, 1985.
BERMAN, S. M., ''A Stochastic Model for the Distribution ofHIV Latency Time Based on T4 Counts," Biometika. 77, 733-74 1 (1990).
BICKEL, P. J., "Using Residuals Robustly 6, 266--291 (1978).
1: Tests for Heteroscedasticity, Nonlinearity," Ann. Statist.
BLACKWELL, D. AND M. A. GIRSHICK, Theory of Games and Statistical Decisions New York:
Wiley,
1954.
Box, G. E. P., ''Sampling and Bayes Inference in Scientific Modelling and Robustness (with Discus sion)," J. Royal Statist. Soc. A 143, 383-430 (1979).
BRowN, L., Fundamentals of Statistical Exponential Families with Applications
in Statistical Deci
sion Theory, IMS Lecture Notes-Monograph Series, Hayward, 1986.
CARROLL, R. J. AND D. RuPPERT, Transformation and Weighting in Regression New York:
Chapman
and Hall, 1988.
CoVER, T.
M. AND J. A. THOMAS, Elements of Information Theory New York: Wiley,
1991.
DE GROOT, M. H., Optimal Statistical Decisions New York: McGraw-Hill, 1969.
DoKSuM,
K A
AND A. SAMAROV, "Nonparametric Estimation
of Global Functionals and a Measure
of the Explanatory Power of Covariates in Regression,'' Ann. Statist. 23, 1443-1473 ( 1995).
FERGUSON, T. FEYNMAN,
S., Mathematical Statistics New York: Academic Press, 1967.
R. P., The Feynmo.n Lectures on Physics, v. I , R. P. Feynman, R. B. Leighton, and M.
Sands, Eels., Ch. 40 Statistical Mechanics of Physics Reading, MA: Addison-Wesley, 1963.
GRENANDER, U. AND M. ROSENBLATT, Statistical Analysis of Stationary Time Series New York:
Wi
ley, 1957.
HorXJES, JR., 1. L., D. KRETcH AND R. S. CRUTCHFIELD, Statlab: An Empirical introduction to Statis tics New York: McGraw-Hill, 1975.
KENDALL, M. G. AND A. STUART, The Advanced Theory of Statistics, Vols. II, ffi New York:
Hafner
Publishing Co., 1961, 1966.
L6HMANN, E. L., "A Theory of Some Multiple Decision Problems, 1-25, 547-572 (1957).
L6HMANN, E. L., "Model Specification: Statist. Science 5, 160-168 (1990).
LatMANN, E. L., .. . . .' •
'
I
I and II," Ann. Math Statist. 22,
The Views of Fisher and Neyman, and Later Developments,"
Testing Statistical Hypotheses, 2nd ed. New York: Springer, 1997.
Section 1. 9
97
References
D. V., Introduction to Probability and Statistics from a Bayesian Point of View, Part I: Probability; Part II: Inference London: Cambridge University Press, 1965.
LINDLEY,
MANDEL, J., The Statistical Analysis of Experimental Data
New York: J. Wiley & Sons, 1964.
A. DOKSUM, "Empirical Bayes Procedures for a Change Point Problem with Application to HIV/AIDS Data," Empirical Bayes and Likelihood Inference, 67-79, Editors: S. E. Ahmed and N. Reid. New York: Springer, Lecture Notes in Statistics, 2000.
NORMAND, S-L. AND K.
K., "On the General Theory of Skew Correlation and Nonlinear Regression," Proc. Roy. Soc. London 71, 303 (1 905). (Draper's Research Memoirs, Dulan & Co, Biometrics Series II.)
PEARSON,
RAIFFA,
H. AND R. SCHLAIFFER, Applied Statistical Decision Theory, Division of Research, Graduate
School of Business Administration, Harvard University, Boston, 1961. SAVAGE., L. J., The Foundations of Statistics, J. Wiley & Sons,
New York, 1954.
SAVAGE., L. J. ET AL., The Foundation of Statistical Inference London: SNEDECOR, G. W. AND W. G.
Press, 1989. B. and Hall, 1986.
WErHERILL, G.
AND K.
Methuen & Co., 1962.
COCHRAN, Statistical Methods, 8th Ed. Ames, lA: Iowa State University
D. GLAZEBROOK, Sequential Methods in Statistics New York: Chapman
I I
'
'
Chapter 2
METHODS OF ESTIMATION
2.1 2.1.1
BASIC HEURISTICS O F ESTIMATION Minimum Contrast Estimates; Estimating Equations
P E P, usually parametrized as P = Our basic framework is as before, X E X, X {Po : 8 E 8}. In this parametric case, how do we select reasonable estimates for 8 itself? """
-
That is, how do we find a function B(X) of the vector observation X that in some sense "is close" to the unknown 8? The fundamental heuristic is typically the following. We consider a function that we shall call a contrastfunction
p : X x 8 ---> R and define
E8,P(X, 8). As a function of 8, D(80) 9) measures the (population) discrepancy between 8 and the true value 80 of the parameter. In order for p to be a contrast function we require that D(80 1 9) is uniquely minimized for 8 = 60• That is, if P90 were true and we knew D(Bo, 8) as a function of 8, we could obtain 80 as the minimizer. Of course, we don't know the truth so this is inoperable, but in a very weak sense (unbiasedncss), p(X, 8) is an estimate of D(80, 8). So it is natural to consider 8(X) minimizing p(X, 8). This is the most general form of the minimum contrast estimate we shall consider in the next section. Now suppose 8 is Euclidean c Rd, the true 80 is an interior point of 8, and 8 D(Bo, 8) is smooth. Then we expect D(8o, 8)
-----Jo
v8 D(80, 8)
=
o
(2. 1 . 1 )
where V denotes the gradient,
Arguing heuristically again we are led to estimates B that solve
v8p(X, e)
=
o.
(2.1.2)
The equations (2.1.2) define a special form of estimating equations.
99
100
Chapter 2
Methods of Estimation
More generally, suppose we are given a function W and define V(l1o, 11) � Suppose V (80, 8) =
:
XxR d - Rd, W - ('ljl1,
E11, w(X, 11).
.
.
•
, '1/Jd)T
(2. 1 .3)
�
0 has 80 as its unique solution for all 80 E 8. Then we say 8 solving �
w(X,11)
�o
(2.1 .4)
is an estimating equation estimate. Evidently, there is a substantial overlap between the two classes of estimates. Here is an example to be pursued later.
Example 2.1.1. Least Squares. Consider the parametric version of the regression model of Example 1.1.4 with l'(z) = g({3, z), {3 E Rd, where the function g js known. Here the data function n} where Yi , . . . , Yn are independent. A natural are X = {(zi, Yi) : 1 ,$
(I)
i<
p(X, /3) to consider is the squared Euclidean distance between the vector Y of observed , g({3, Zn))T That is, we take Yi and the vector expectation of Y, JL(z) = (g({3, z 1), •
.
.
n
p(X, {3) = IY - �tl2 =
�)Yi - g({3, z;))'.
i=l
(2. 1 .5)
Strictly speaking P is not fully defined here and this is a point we shall explore later. But, for convenience, suppose we postulate that the Ei of Example 1.1.4 are i.i.d. N(O, Then {3 parametrizes the model and we can compute (see Problem 2.1.16),
a5).
E(3 ,P(X, {3)
D (f3o. {3)
n
no-;l + 2)g({30, z,) - g({3 , z, )]2 ,
(2.1.6)
i=l
J
!
which is indeed minimized at {3 = {30 and uniquely so if and only if the parametrization is � identifiable. An estimate {3 that minimizes p(X, {3) exists if g({3, z) is continuous and
' •
lim{)g(fl, z) l : lf31 �
(Problem 2.1.10). The estimate {3 is called the least squares estimate. � If, further, g({3, z) is differentiable in {3, then {3 satisfies the equation alently the system of estimating eq!.lations,
� � 8g � � 8g � ({3, z,)g(/3, z,), ({3, z;)Y; = LL813 i=l 813 i=l J
I'
oo} = oo
3
In the important linear case,
(2.1.2) or equiv(2.1 .7)
I
I
I
d
g({3, zi) = 'L, ziJ,BJ and zi = (Zit, . . . , Zid)T j=l '
• •
.i
•
•
I
•
Section 2.1
Basic Heuristics o f Estimation
101
the system becomes n
d
i=l
=I
I.>'iY; = L k
n
� L z-t;-z-k t
(2.1.8)
i=l
the normal equations. These equations are commonly written in matrix form (2.1 .9)
where Zv = ll ziJIInx d is the design matrix. Least squares, thus, provides a firs t example of both minimum contrast and estimating equation methods. We return to the remark that this estimating method is well defined even if the ci are not i.i.d. a5). In fact, once defined we have a method of computing a statistic {!J from the n}, which can be judged on its merits whatever the true P data X = {(zi 1 Yi)1 governing X is. This very important example is pursued further in Section 2.2 and Chapter 0 6.
N(O,
1
Here is another basic estimating equation example. Example 2.1.2. Method of Moments (MOM). Suppose X1, ,Xn are i.i.d. as X � P8 , 8 E Rd and 8 is identifiable. Suppose that l't ( 8), . . . , I'd (8) are the first moments of the population we are sampling from. Thus, we assume the existence of . • .
d
Define the jth sample moment /11 by,
1 <j
< d.
To apply the method of moments to the problem of estimating 9, we need to be able to express 9 as a continuous function g of the first moments. Thus, suppose
d
8 � (1'1(8), . . , !'d (O)) .
is from Rd to Rd . The method of moments prescribes that we estimate 9 by the solution of
1-1
�
Pi = l'i (O) , 1 :O j
if it exists. The motivation of this simplest estimating equation example is the law of large numbers: For X ,..._, Po, fiJ converges in probability to flJ(fJ). More generally, if we want to estimate a Rk-valued function q(B) of 9, we obtain a MOM estimate of q( 8) by expressing q(8) as a function of any of the first moments 1'1, . . . , I'd of X, say q(O) = h(l'l, . . . , I'd), > k, and then using h(j.t�, . , /id) as the estimate of q(8).
d
d
.
.
Methods of Estimation
102
Chapter 2
For instance, consider a study in which the survival time X is modeled to have a gamma distribution, f(u, A), with density [.\"/r (a)]x"- 1 exp{ -.\x}, x > O; cx > O, .\ > 0. In this case f) � (u, .\), l'l � E(X) � n /.\, and 1'2 � E(X2) � a(l + u) /.\2 Solving
for (} gives
"�
(M•/a)2 ,
.\ � i'•/a2,
=
iii �
(X/<1 )2 ; ), � X !a'
where o-2 p,2 -p,i and &2 = n -lEX[ - X2• In this example, the method of moment es timator is not unique. We can, for instance, express (} as a function of J..t t and /-L3 = E(X3 ) D and obtain a method of moment estimator based on /11 and ji3 (Problem
2.1.11).
Algorithmic issues
We note that, in general, neither minimum contrast estimates nor estimating equation solutions can be obtained in closed fonn. There are many algorithms for optimization and root finding that can be employed. An algorithm for estimating equations frequently is quick and M used when computation of M(X, ·) = D'I'(X, ·) - J�$' (X, ·JJ dxd J is nonsingular with high probability is the Newton-Raphson algorithm. It is defined by initializing with 90, then setting (2. 1 . 1 0)
.!
• • •
This algorithm and others will be discussed more extensively in Section 2.4 and in Chap ter 6, in particular Problem
6.6.10.
2.1.2
The Plug-In and Extension Principles
We can view the method of moments as an example of what we call the plug-in (or substitu tion) and extension principles, two other basic heuristics particularly applicable in the i.i.d. case. We introduce these principles in the context of multinomial trials and then abstract them and relate them to the method of moments. Example 2.1.3. Frequency Plug-in(2) and Extension. Suppose we observe multinomial trials in which the values v1, Vk of the population being sampled are known, but their respective probabilities p1, . . , Pk are completely unknown. If we let X1, , Xn be i.i.d. as X and Ni = number of indices j such that Xi = Vi, then the natural estimate of Pi = P[X = vi] suggested by the law of large numbers is Njn. the proportion of sample values equal to vi. As an illustration consider a population of men whose occupations fall in one of five different job categories, 3, 4, or 5. Here k 5, vi = i, i = . , 5, Pi is the proportion of men in the population in the ith job category and Njn is the sample proportion in this category. Here is some job category data (Mosteller, I 968). .
.
=
1, ..
•
.
,
...
1, 2,
'
I
I
•
'
Section 2.1
103
Basic Heuristics of Estimation
Job Category l
23
2 84
3 289
4 217
5 95
0.03
0.12
0.41
0.31
0.13
I
•
Ni �
Pi
-
n = 2::,�1 Ni
=
Lio)Pi = j
708
for Danish men whose fathers were in category 3, together with the estimates Pi = Next consider the more general problem of estimating a continuous function
Ni fn.
q(p1,
.
•
•
,
Pk) of the population proportions. The frequency plug-in principle simply proposes to re place the unknown population frequencies p1, . . . , Pk by the observable sample frequencies
Ntfn, . . . , Nkjn.
That is, use
(2. 1 . 1 1 ) to estimate q(p 1 ,
. . .
,Pk) ·
For instance, suppose that in the previous job category table,
categories 4 and 5 correspond to blue-collar jobs, whereas categories white-collar jobs. We would be interested
q(p,, · · · ,Ps)
=
2 and 3 correspond to
in estimating
(p4 + Ps) - (P2
+ P3) ,
the difference in the proportions of blue-collar and white-collar workers. If we use the frequency substitution principle, the estimate is
0.44 - 0.53
=
-0.09. Equivalently, let P denote p = (p1, . .
which in our case is
,pk) with Pi
=
=
·v,], I < i < k, and think of this model as P = {all probability distributions P on {v1 , . . . , vk} }. Then q ( p ) can be identified with a parameter v : P � R, that is, v(P) = (p4 + p5) - (p, + p,), and the frequency plug-in principle simply says to replace P = (p,, . . . , Pk) in v(P) by D P = ( �� , . . . , �,. ) the multinomial empirical distribution of X1, , Xn. .
P[X
,
.
Now suppose that the proportions
p1, . . . , Pk
functions of some d-dimensional parameter 8 =
a component of 8 or more generally a function
.
•
do not vary freely but are continuous
(fh, .
q(8) .
. , (}d) and that we want to estimate
.
Many of the models arising in the
analysis of discrete data discussed in Chapter 6 are of this type. Example
2.1.4.
Hardy-Weinberg Equilibrium.
Consider a sample from a population in
genetic equilibrium with respect to a single gene with two alleles. If we assume the three different genotypes are identifiable, we are led to suppose that there are three types of individuals whose frequencies
P1 If Ni
=
82, P2
are
=
given by the so-called
28(1 - 8),
p, = ( I
8) 2 , 0 < 8 < L
(2.LI2)
i in the sample of size n, then (Nt , N2 , N3) parameters (n,p1 ,P2, p3) given by (2.1.12). Suppose
is the number of individuals of type
has a multinomial distribution with
-
Hardy-Weinberg proportions
104
Methods of Estimation
Chapter 2
$1. we can use we want to estimate fJ, the frequency of one of the alleles. Because fJ the principle we have introduced and estimate by JN1 jn. Note, however, that we can also 0 write 0 = I ..fii3 and, thus, 1 - .jNJTi is also a plausible estimate of 0. =
-
In general, suppose that we want to estimate a continuous Rl-valued function q of 8. If p1, . ,Pk are continuous functions of 8, we can usually express q(8) as a continuous function of PI , . . . , Pk, that is, .
.
q(IJ) = h(p,(IJ), . . . ,p.(IJ)), with h defined and continuous on P=
.
(p,, . . ,p.) : p; > O,
(2. 1.13)
k
L p; = 1
j
.
'
i= 1
Given h we can apply the extension principle to estimate q( 8) as,
T(X1, . . . , Xn)
=
h ( Nn, , . . , Nn• ) .
.
.
(2 1 . 14)
As we saw in the Hardy-Weinberg case, the representation (2.1.13) and estimate (2.1.14) are not unique. We shall consider in Chapters 3 (Example 3.4.4) and 5 how to choose among such estimates. We can think of the extension principle alternatively as follows. Let
Po = {Po = (p, (IJ), . . . ,p. (IJ)) : e E e} be a submodel of P. Now q(IJ) can be identified if IJ is identifiable by a parameter v : Po � R given by v(Pe ) = q(O). Then (2.1.13) defines an extension of v from Po to P via v : P --> R where v(P) - h(p) and v(P) = v(P) for P E Po . The plug-in and extension principles can be abstractly stated as follows:
T Plug-in principle. If we have an estimate P of P E P such that P E P and v : P is a parameter. then v(P) is the plug-in estimate of v. In particular, in the i.i.d. case if P is the space of all distributions of X and X�, . . . , Xn are i.i.d. as X f'V P. the empirical distribution P of X given by -
-
-
---+
-
1
n
P[X E A] = n L I (X; E A) -
(2.1 . 15)
. t=l
�
is, by the law of large numbers, a natural estimate of P and v(P) is a plug-in estimate of v(P) in this nonparametric context For instance, if X is real and F(x) = P(X :S x) is the distribution function (d.f.), consider va (P) = i [F- 1 (a) + F.J 1 (a)], where a E (0, I) and
F-1 (a) = inf{x : F(x) > a}, Fu 1 (a) = sup{x : F(x) < a},
-
-
�
-
(2. 1.16)
Section 2.1
105
Basic Heuristics of Estimation
v0 ( P) is the ath population quantile X Here x � = v.!. (P) is called the population median. A natural estimate is the ath sample quantile then
'
a.
(2.1.17) where
F is the empirical d.f. Here X!. �
is called the sample median.
'
For a second example, if X is real and P is the class of distributions with EjXj1 < oo, then the plug-in estimate of the jth moment v(P) = f..t1 = in this nonparametric �
context is the jth sample moment v(P) =
E(XJ )
�
J x;dF(x)
=
.
n- 1 I:� 1 Xf. �
Extension principle. Suppose Po is a submod.el of P and P is an element of P but not necessarily Po and suppose v : Po � T is a parameter. If v : P � T is an extension of v � in the sense that v(P) = v(P) on Po. then v(P) is an extension (and plug-in) estimate of v(P). With this general statement we can see precisely how method of moment estimates can be obtained as extension and frequency plug-in estimates for multinomial trials because
1';(8)
k
=
L vfp,(8) = h(p(ll)) i=l
where
h(p)
=
=
v(Po)
k
L v!P< = v(P) ,
i =l
�
and P is the empirical distribution. This reasoning extends to the general i.i.d. case (Prob lem 2.1.12) and to more general method of moment estimates (Problem 2.1.13). As stated, these principles are general. However, they are mainly applied in the i.i.d. case--but see Problem 2.1.14.
Remark 2.1.1. The plug-in and extension principles are used when Pe, v, and v are contin uous. For instance, in the multinomial examples 2.1.3 and 2.1.4, Pe as given by the Hardy Weinberg p(O), is a continuous map from e = [0, 1] to P, v(Pe ) = q(8) = h(p(8)) is a continuous map from 8 to R and v( P) = h(p) is a continuous map from P to R. Remark 2.1.2. The plug-in and extension principles must be calibrated with the target parameter. For instance, let Po be the class of distributions of X = B + �: where B E R
and the distribution of t: ranges over the class of symmetric distributions with mean zero. Let v(P) be the mean of let P be the class of distributions of = (I + < where B E R and the distribution off ranges over the class of distributions with mean zero. In this case both v1(P) = Ep(X) and v2(P) = "median of P" satisfy v(P) = v(P), P E Po,
X and
but only
�
v1(P)
-
=
X is a sensible� estimate of v(P), P
symmetric, the sample median
X
¢ P0, because when P is not
v2(P) does not converge in probability to Ep(X).
106
Methods of Estimation
Chapter 2
Here are three further simple examples illustrating reasonable and unreasonable MOM estimates.
Example 2.1.5. Suppose that X 1 , . . . , Xn is a N(f-1, a2 ) sample as in Example 1 . 1.2 with assumptions ( 1 )--(4) holding. The method of moments estimates of f1 and a2 are X and
-· a .
0
Example 2.1.6.
Suppose
X1 ,
.
•
•
, Xn
are the indicators of a set of Bernoulli trials with
probability of success 8. Because p,1 (8) = () the method of moments leads to the natural
estimate of 8,
X, the frequency of successes. To estimate the population variance 8( 1 - 8)
we are led by the first moment to the estimate,
X(l - X).
Because we are dealing with
(unrestricted) Bernoulli trials, these are the frequency plug-in (substitution) estimates (see Problem
o
2.Ll).
1.5.3 1
Example 2.1.7. Estimating the Size of a Population (continued). In Example where ) . Thus, 0 = 21' X1 , . . c, Xn are i.i.d. U {1, 2, . . . , 0}, we find I' = Eo (X,) = J (0 and 2X - 1 is a method of moments estimate of B. This is clearly a foolish estimate if X(n ) = max Xi > 2X - 1 because in this model B is always at least as large as X(n) . D
+1
As we have seen, there are often several method of moments estimates for the same
q(9).
For example, if we are sampling from a Poisson population with parameter B, then (}
is both the population mean and the population variance. The method of moments can lead to either the sample mean or the sample variance. Moreover, because Po =
exp( -0}, a frequency plug-in estimate of 0 is
- logj)0, where
will make a selection among such procedures in Chapter
Remark 2.1.3.
3.
P(X
Po is n- 1 [#X;
=
0) = 0]. We
=
What are the good points of the method of moments and frequency plug-in?
(a) They generally lead to procedures that are easy to compute and are, therefore, valu
'
able as preliminary estimates in algorithms that search for more efficient estimates. See
'
Section
2.4.
(b) If the sample size is large, these estimates are likely to be close to the value estimated
(consistency). This minimal property is discussed in Section It does turn out that there are
5.2.
"best'' frequency plug-in estimates, those obtained by
the method of maximum likelihood, a special type of minimum contrast and estimating
2.4, they are often difficult to compute. Algorithms for their computation will be introduced in Section 2.4.
equation method. Unfortunately, as we shall see in Section
Discussion.
When we consider optimality principles, we may arrive at different types of
estimates than those discussed in this section. For instance, as we shall see in Chapter
3,
estimation of (} real with quadratic loss and Bayes priors lead to procedures that are data weighted averages of(} values rather
than minimizers of functions p( 6, X). Plug-in is not
the optimal way to go for the Bayes, minimax, or uniformly minimum variance unbiased (UMVU) principles we discuss briefly in Chapter apparent in Chapters
5 and 6.
3.
However, a saving grace becomes
If the model fits, for large amounts of data, optimality prin
ciple solutions agree to first order with the best minimum contrast and estimating equation solutions, the plug-in principle is justified, and there are best extensions.
' '
'
'' ' '
Section 2.2
107
Minimum Contrast Estimates and Estimating Equations
Summary. We consider principles that suggest how we can use the outcome X of an experiment to estimate unknown parameters. For the model { Pe : f) E 8} a contrast p is a function from X x H to R such that the discrepancy D (!Jo , IJ)
�
E(;"p(X, IJ), B E 6 C Rd
is uniquely minimized at the true value 8 = 80 of the parameter. A minimum contrast esti mator is a minimizer of p(X, 6). and the contrast estimating equations are VeP(X, 6 ) =
0,
For data ( (z;, li) : I < i < n ) with li independent and E(Y;) = g((3, z;), 1 < i < n , where g is a known function and /3 E Rd is a vector of unknown regression coefficients. a least squares estimate of /3 is a minimizer of p(X, (3) = Lfli
-
g((3 , z, )f .
For this contrast. when g (/3 z) = zT{3. the associated estimating equations are called the normal equations and are given by ZbY = ZbZn/3. where Zn = llziJ·IInxd is called the design matrix. Suppose X P. The plug-in estimate (PIE) for a vector parameter v = v(P) is obtained by setting fJ = v( P) where P is an estimate of P. When P is the empirical probability distribution Pe defined by Pe(A) = n- 1 I:� 1 l [X; E Aj, then v is called the empirical PIE. If P = Po. 6 E e. is parametric and a vector q(8) is to be estimated. we find a parameter v such that v(P&) = q(8) and call v(P) a plug-in estimator of q(IJ). Method of moment estimates are empirical PIEs based on v(P) = (f.li. . . . , f.ld )T where In the multinomial case the frequency plug-in estimators f.lJ = E( XJ\ I < j < are empirical PIEs based on v(P) � (PI : . . ,pk ) , where PJ is the probability of the jth category, 1 < j :::; k. Let Po and P be two statistical models for X with Po c P. An extension v of v from Po to P is a parameter satisfying v(P) = v(P), P E P0. If P is an estimate of P with P E P, v(P) is called the extension plug·in estimate of v(P). The general principles are shown to be related to each other. ,
,.....,
�
�
�
�
�
�
d.
.
�
�
2.2 2.2.1
�
M I N I M UM CONTRAST ESTIMATES AND ESTIMATING EQUATIONS Least Squares and Weighted Least Squares
Least squares< 1 ) was advanced early in the nineteenth century by Gauss and Legendre for estimation in problems of astronomical measurement. It is of great importance in many areas of statistics such as the analysis of variance and regression theory. In this section we shall introduce the approach and give a few examples leaving detailed development to Chapter 6.
108
by
Methods of Estimation
In Example 2. 1.1
Chapter 2
we considered the nonlinear (and linear) Gaussian model P0 given
li = g(/3, z;) + <, 1 < i < n where ci are i.i.d.
(2.2.1)
N(O, aJ) and {3 ranges over Rd or an open subset. The contrast n
p(X, i3) = Dli - g(/3, z,Jf i=l -
led to the least squares estimates (LSEs) {3 of {3. Suppose that we enlarge Po to P where we retain the independence of the Yi but only require
IJ.(z; ) = E(Y; ) = g((l,z;),
E(<;) = 0.
(2.2.2)
Are least squares estimates still reasonable? For P in the semiparametric model P, which satisfies (but is not fully specified by) (2.2.2), we can compute still
Dp (/30 , /3)
=
Epp(X, /3)
n
- L Var i=l
p
n
(<; ) + L [g(/30 , z;) i=l
-
(2.2.3)
g(/3, z;)] 2 ,
which is again ntinimized as a function of i3 by i3 = {30 and uniquely so if the map , g(f3,znlf is 1 - 1. f3 � The estimates continue to be reasonable under the Gauss-Markov assumptions,
(g(i3,zt),
.
.
.
E( <;) = O, l < i < n, Var(ci) = a2 > 0, 1 :::; i :::; n,
(2.2.4)
Cov(�:i, Ej) = 0, 1 < i < j ::.; n
(2.2.6)
(2.2.5)
because (2.2.3) continues to be valid. Note that the joint distribution H of (C t , . . , t:n) is any distribution satisfying the Gauss-Markov assumptions. That is, the model is semiparametric with {3, a2 and H un known. The least squares method of estimation applies only to the parameters and is often applied in situations in which specification of the model beyond (2.2.4}-{2.2.6) is difficult Sometimes z can be viewed as the realization of a population variable Z, that is, (Zi, Yi). 1 $ i < n, is modeled as a sample from a joint distribution. This is frequently the case for studies in the social and biological sciences. For instance, Z could be edu cational level and Y income, or Z could be height and Y log weight Then we can write the conditional model of Y given Zi = Zj , I < j < n, as in (a) of Example 1.1.4 with If we consider this model, (Z,, li) i.i.d. as <j simply defined as Yj E(Yj I Zi = (Z, Y) � P E P = {All joint distributions of(Z,Y) such that E(Y I Z = = g(i3,z), and i3 � (g(/3, z,), . . . ,g(/3, zn)JT is 1 - 1, then f3 has an interpretation as a i3 E parameter on P, that is, f3 = (l(P) is the ntinim12er of E (Y - g(/3, Z))'. This follows .
{31,
-
Rd},
"
I
Zj)·
z)
•
.
•
, !3d
109
Minimum Contrast Estimates and Estimating Equations
Section 2.2
-
from Theorem 1.4.1. In-this case we recognize the LSE {3 as simply being the usual plug-in estimate f3(P), where P is the empirical distribution assigning mass n - 1 to each of the n pairs (Zi, Yi). the most commonly used g in these models is g({3, z) = As we noted in Example is called the linear and zT(3, which, in conjunction with (multiple) regression modeL For the data { (zi, Yi); i = 1, , n } we write this model in matrix form as Y = Zv(3 + <
2.1.1(2.2.1), (2.2.4), (2.2.5) (2.2.6), ...
where Zv = llzij II is the design matrix. We continue our discussion for this important special case for which explicit formulae and theory have been derived. For nonlinear cases we can use numerical methods to solve the estimating equations (2.1.7). See also Problem Sections and and Seber and Wild (1989). The linear model is often the default model for a number of reasons: (I) If the range of the z's is relatively small and p(z) is smooth, we can approximate p(z) by
2.2.41,
6.4.3 6.5,
p(z) = -:::·
d
a p(zol + :L a: (zo)(z - zo), j=l J
for zo an interior point of the domain. We can then treat p(zo) g� (zo)zo as an unknown to give an approximate ( d + I) -dimensional and identify %!; (zo) with linear model with z0 = 1 and as before and
f3o
- E1= l
f3;
Zj
d
p(z) = L f3; z; .
j=O
This type of approximation is the basis for nonlinear regression analysis based on local polynomials, see Ruppert and Wand (1994), Fan and Gijbels (1996), and Volume ll. (2) If as we discussed earlier, we are in a situation in which it is plausible to assume that (Zi, Yi) are a sample from a (d+ 1 )-dimensional distribution and the covariates that are the coordinates of Z are continuous, a further modeling step is often taken and it is assumed . . ) zd , Y)T bas a nondegenerate multivariate Gaussian distribution Nd+I (11, I: that ). In that case, as we have seen in Section 1. , E(Y I Z = z) can be written as •
(Zl )
.
4
d
p(z) = f3o + L f3; z;
j=l
where d
f3o
=
(-((3,, . , {3d), l)�t = i'Y - L f3i !'j ·
..
j= l
(2.2.9)
no
Methods of Estimation
Chapter 2
Furthennore,
< = y - I'(Z) is independent of Z and has a N(O, a 2 ) distribution where
Therefore, given Zi = 1 < i < n.
Zi,
a2 = ayy - Eyz:E z� Ezy.
1
n, we have a Gaussian linear regression model for }i,
Estimation of {3 in the linear regression model. We have already argued in Example � 2. 1. 1 Zn/3 is identifiable, the least squares estimate, {3, exists that, if the parametrization {3 and is unique and satisfies the normal equations (2.1.8). The parametrization is identifiable if and only ifZv is of rank d or equivalently if ZbZv is of full rank d; see Problem 2.2.25. In that case, necessarily, the solution of the normal equations can be given "explicitly" by ------)
(2.2.10)
'
Here are some examples.
Example 2.2.1. In the measurement model in which Yi is the detennination of a constant g(z, il1) = 1 and the normal equation is L:� 1 (y, - il1) = 0, /h, d = 1, g(z, il1J = il" � D whose solution is il1 = ( 1/n) L:� 1 Yi = y.
8�,
Example 2.2.2. We want to find out how increasing the amount z of a certain chemical or fertilizer in the soil increases the amount y of that chemical in the plants grown in that soil. For certain chemicals and plants. the relationship between z and y can be approximated well by a linear equation y = /31 + (32 z provided z is restricted to a reasonably small interval. If we run several experiments with the same z using plants and soils that are as nearly identical as possible, we will find that the values of y will not be the same. For this reason, we assume that for a given z, Y is random with a distribution P(y j z ). Following are the results of an experiment to which a regression model can be applied (Sned.ecor and Cochran, 1967, p. 139). Nine samples of soil were treated with different amounts z of phosphorus. Y is the amount of phosphorus found in com plants grown for 38 days in the different samples of soil. Z;
r;
1
64
4 71
5 54
9 81
11 76
13 93
23 77
23 95
�
28 109
,
' '
The points (zi, Yi) and an estimate of the line !31 + f32z are plotted in Figure 2.2.1. We want to estimate f3I and /32 . The normal equations are n
n
� (y, - il1 - il2 z;) = 0, � z,(y; - il1 - f3zzi) = 0. i=l
(2.2. 1 1 )
When the Zi 's are not all equal, we get the solutions (2.2.12)
'
I
Section 2.2
Minimum Contrast Estimates and Estimating Equations
111
y •
110 90
•
•
.
E6:
•
•
70 50
�
•
•
•
10
0
20
X
30
Figure 22.1. &atter plot { (Zi, y1); i = 1, . . . , n} and sample regression line for the phosphorus data. fs is the residual for (zs, Ys) ·
and fJ1
=
Y - /3zz
(2.2.13)
where z = ( 1 /n) I:� 1 z;, and ii = (1/n) 2::7 1 y;. The line y = {31 + f32z is known as the sample regression line or line of best fit of YI, . . . , Yn on z 1 , . . . , Zn- Geometrically, if we measure the distance between a point (z; , Y;) and a line y = a + bz vertically by d; = IY; - (a + bz;)l. then the regression line minimizes the sum of the squared distances to the n points ( z1, Y1 ), . . . , (zn, Yn). The vertical distances €i = fy,: - (!31 + f32 zi)] are called the residuals of the fit, = 1, . . . , n. The � line y = {31 + (32 z is an estimate of the best linear MSPE predictor a1 + b1 z of Theorem 1 .4.3. This connection to prediction explains the use of vertical distances� in regression. The regression line for the phosphorus data is given in Figure 2.2.1. Here fJ1 = 61.58 and � 0 /3z = 1.42. �
�
�
�
i
�
Remark 2.2.1. The linear regression model is considerably more general than appears at first sight. For instance, suppose we select p real-valued functions of z, 91, . . . , gp. p > d
and postulate that ,u ( z) is a linear combination of 91 (z), . . . , Yv(z) ; that is p
,u(z) = 2:);9; (z). j=l
Then we are still dealing with a linear regression model because we can define Wpxi
112
I
Methods of Estimation
(g1 (z ) ,
Chapter 2
. . . . gp( z)) T as our covariate and consider the linear model Yi
p
L OJWij + ci, 1 < i <
=
j=l
n
where
Wij For instance, if
Yi
=
= 9j( Zi) ·
d = 1 and we take 9i (z) = zi,
0
< j < 2, we arrive at quadratic regression,
Bo + B 1 Zi + (hz'f + £i-see Problem 2.2.24 for more on polynomial regression.
Whether any linear model is appropriate in particular situations is a delicate matter par
tially explorable through further analysis of the data and knowledge of the subject matter. We return to this in Volume
II.
Weighted least squares. In
Example 2.2.2 and many similar situations it may not be
zi of the However, we may be able to characterize the dependence of Var( fi ) on
reasonable to assume that the variances of the errors covariate variable.
Zi at least up to a multiplicative constant.
fi
are the same for all levels
That is, we can write
(2.2.14) where
are
a2 is unknown as before, but the wi
heteroscedastic
known weights. Such models are called
(as opposed to the equal variance models that are
homoscedastic).
The
method of least squares may not be appropriate because (2.2.5) fails. Note that the variables
y, ,
= _
_
g ((3 , Z; )
Y; - ..;w; ..;w;
+ <; ' ..;w;
are sufficient for the Y; and that Var(<;/ #;) =
g((3, z;) j ..;w;. €, =
and the
=
I < z. < n, - -
ww2 jw;
9(,8, zi ) + ti , 1
=
u2 . Thus, if we setg((3, z;) = (2.2.15)
- satisfy the assumption (2.2.5). The weighted least squares estimate of /3 is now Yi
the value
�
{3. which for given iii
=
yi f ,fWi minimizes
(2.2.16) as
a function of ,8.
Example 2.2.3. Weighted Linear Regression. zi2 = Zi, and g(/3, Zi) = {31 + fhzi. i = 1, . .
{31 and f3z that minimize
I '
n
L v, [y, i=I
-
Consider the case in which .
d
= 2, �
Zii = 1 , �
, n. We need to find the values {31 and /32 of
(/3, + f3zz,W
(2.2.17)
Section 2.2
113
Minimum Contrast Estimates and Estimating Equations
where vi = 1/wi- This problem may be solved by setting up analogues to the normal equations (2.1.8). We can also use the results on prediction in Section 1.4 as follows. Let (Z*, Y*) denote a pair of discrete random variables with possible values (z1, y1) , . . . , (zn , Yn) and probability distribution given by
P[ (Z' , Y') � (z; , y;)J � u;, where
i
�
1, . . , n .
n Ui = Vi/ L vi , i = 1 , . . ,n. i=l + {32Z* denotes a linear predictor of Y* based on Z*, then its MSPE is .
If /1-t (Z*)
given by
=
{31
n
E[Y ' � !lt (Z')f � L u;[y, � ((3, + (32z;)]2 i=l It follows that the problem of minimizing (2.2.17) is equivalent to finding the best linear MSPE predictor of Y"'. Thus, using Theorem 1.4.3, a.
V.l -
and
Cov(Z"', Y"') - L:t UiZi Yi - (L�� t UzYi)(L��J uizi) n ; 2 ( n u;z;)2 Var(Z•) L;�J u z; L;�1
(2.2.18)
�
n � 1" � u;y; - (h - � u;z;. (31 � E(Y') - f32E(Z') � -n1 " n 1 iz.-o-: I �
�
n -"
This computation suggests, as we make precise in Problem 2.2.26, that weighted least D squares estimates are also plug-in estimates.
�
Next consider finding the ,B that minimizes (2.2.16) for g(,B, z; ) = zf,B and for general d. By following the steps of Example 2.2.1 leading to (2.1.7), we find (Problem 2.2.27) � that f3 satisfy the weighted least squares normal equations where W = diag(wt, . . . , Wn) and Zn = llziillnxd is the design matrix. When Zn has n, we can write rank d and Wi > 0, 1
Remark 2.2.2. More generally, we may allow for correlation between the errors
{t:i}· That is, suppose Var(€) = a2W for some invertible matrix Wn x n · Then it can be shown (Problem 2.2.28) that the model Y � Zn,B+• can be transformed to one satisfying (2.2.1) � and (2.2.4H2.2.6). Moreover, when g(,B, z) = zT,B, the ,B minimizing the least squares contrast in this transformed model is given by (2.2.19) and (2.2.20).
Methods of Estimation
1 14
2
Remark 2.2.3. Here are some applications of weighted least squares: When the ith re sponse Yi is an average of ni equally variable observations, then Var(Yi) = a2 /ni, and 'Wi = ni 1 . If Yi is the sum of ni equally variable observations, then 'Wi ni. If the vari ance of Yi is proportional to some covariate, say z1 . then Var(}i) = Zi 1 a2 and 'Wi Zi l · =
=
I n time series and repeated measures, a covariance structure i s often specified for Problems 2.2.29 and 2.2.42). 2.2.2
e
(see
Maximum Likelihood
The method of maximum likelihood was first proposed by the German mathematician C. F. Gauss in 1 82 1 . However, the approach is usually credited to the English statistician R. A. Fisher ( 1 922) who rediscovered the idea and first investigated the properties of the method. In the form we shall give, this approach makes sense only in regular parametric models. Suppose that p(x, 0) is the frequency or density function of X if 0 is true and that 8 is a subset of d-dimensional space. Recall Lx (0) , the likelihood function of 0, defined in Section 1 .5, which is just p(x, 0) considered as a function of 0 for fixed x. Thus, if X is discrete, then for each 0, Lx ( 0) gives the probability of observing x. If 8 is finite and 1r is the uniform prior distribution on 8, then the posterior probability that 0 0 given X = x satisfies 1r(O I x) ex Lx (O), where the proportionality is up to a function of x. Thus, we can think of Lx ( 0) as a measure of how "likely" 0 is to have produced the observed x. A similar interpretation applies to the continuous case (see A.7. 1 0) . The method of maximum likelihood consists of finding that value B(x) of the parameter that is "most likely" to have produced the data. That is, if X = x, we seek 0(x) that satisfies
Lx( B( x) )
=
p (x, B(x) ) = max{p(x, 0) : 0
E
8}
max{ Lx (O)
:
0 E 8}.
By our previous remarks, if 8 is finite and 1r is uniform, or, more generally, the prior density 1r on 8 is constant, such a 8(x) is a mode of the posterior distribution. If such a 8 exists, we estimate any function q( 0) by q(O(x) ). The estimate q(O(x) ) is called the maximum likelihood estimate (MLE) of q (O) . This definition of q(O) is consistent. That is, suppose q is 1 - 1 from 8 to 0; set w = q( 0) and write the density of X as p0 (x, w) = p(x, q-1 (w)). If w maximizes p0 (x, w) then w = q(O) (Problem 2.2.16(a)). If q is not 1 - 1 , the MLE of w q(O) i s still q({i) (Problem 2.2.16(b)). Here is a simple numerical example. Suppose 0 0 or � and p (x , 0) is given by the following table. 1 x\0 0 2
0
0. 1 0 0.90
2
Then 8( 1 ) = �' 8( 2) = 0. If X 1 the only reasonable estimate of 0 is value 1 could not have been produced when 0 0. =
=
� because the
1 15
Maximum likelihood estimates need neither exist nor be unique (Problems 2.2. 1 4 and 2 .2. 13). In the rest of this section we identify them as of minimum contrast and estimating equation type, relate them to the plug-in and extension principles and some notions in information theory, and compute them in some important special cases in which they exist, are unique, and expressible in closed form. In the rest of this chapter we study more detailed conditions for existence and uniqueness and algorithms for calculation of MLEs when closed forms are not available. When fJ is real, NILEs can often be obtained by inspection as we see in a pair of impor tant examples.
Example 2.2.4. The Normal Distribution with Known Variance. Suppose X "" N(fJ, a2 ) , where a2 is known, and let
0, J.l t , J.lz E R and 'Pu (s) = �'P ( �). It is not obvious that this falls under our scheme but let (2.4.6)
where Ll., are independent identically distributed with Po [LI., = 1] = .\ = 1 - Po [Ll., Suppose that given � = (� 1 , , �n). the Si are independent with .
.
=
0].
•
L.o(S, I A ) = L.o(S, I Ll., ) = N(LI.,I', + {1 - Ll.,)l'z, Ll.,"; + (1 - Ll.,)"�).
I, '
:!
'
'
i. '
That is, �i tells us whether to sample from N(J11 ,a?) or N(J12 ,a�). It is easy to see (Problem 2.4.1 1), that under 9, S has the marginal distribution given previously. Thus, we can think of S as S(X) where X is given by (2.4.6). This five-parameter model is very rich permitting up to two modes and scales. The log likelihood similarly can have a number of local maxima and can tend to oo as e tends to the boundary of the parameter space (Problem 2.4.12). Although MLEs do not exist in these models, a local maximum close to the true eo turns out to be a good "proxy" for the D nonexistent MLE. The EM algorithm can lead to such a local maximum. The EM Algorithm. Here is the algorithm. Let
' '
( g p(X,B)Bo) I S(X)
J(B I Bo) � Eo, lo _
p(X,
=
)
s
(2.4.7)
where we suppress dependence on s. Initialize with Bold = Bo . The first (E) step of the algorithm is to compute J (B I B0ld) for as many values of B as needed. If this is difficul� the EM algorithm is probably not suitable. The second (M) step is to maximize J(B I Bold ) as a function of B. Again, if this step is difficult, EM is not particularly appropriate. Then we set Bnew � arg max J (B I B0ld ) , reset Bold = Bnew and repeat the process. As we shall see in important situations, including the examples, we have given, the M step is easy and the E step doable. The rationale behind the algorithm lies in the following formulas, which we give for 8 real and which can be justified easily in the case that X is finite (Problem 2.4.12)
(
)
q (s ,B ) = o p(X, B) I S(X) = s E0 q(s,Bo) p(X,Bo) and
:B lo q(s, B) 9=0o g
�
E00
(:B logp(X,B) I S(X) = s) 9=90
(2.4.8)
'
(2.4.9)
for all Bo (under suitable regularity conditions). Note that (2.4.9) follows from (2.4.8) by taking logs in (2.4.8), differentiating and exchanging Eo0 and differentiation with respect
I
i i
'
'
'
!
i
Section 2.4 Algorithmic Issues
135
to () at 00. Because, formally, DJ(B I Bo ) � [)B
E
,,
(�I DB
og p(X ' B) I S (X)
=
and, hence, 8J(B I Bo) DO
)
s
(2.4.10)
a = DB log q (s , B0 ) Oo
(2.4. 1 1 )
it follows that a fixed point () of the algorithm satisfies the likelihood equation, a log q(s,B) = 0. DB
(2.4.12)
The main reason the algorithm behaves well follows.
Lemma 2.4.1. /fBnew, Bold are as defined earlier and S(X) = s, q(s, Bnew ) > q(s, Bold )-
(2.4.13)
Equality holds in (2.4. 13) iff the conditional distribution of X given S(X) = s is the same for Bnew as for Bold and Bnew maximizes J(B I Bold l· Proof. We give the proof in the discrete case. However, the result holds whenever the quantities in J(O I Oo) can be defined in a reasonable fashion. In the discrete case we appeal to the product rule. For x E X, S(x) = s p(x,B) = q (s, B)r(x I s, B)
(2.4.14)
where r(· I · , B) is the conditional frequency function of X given S(X) q (s , B) J(B I Bo) = log + Eo, q(s, Bo) If Bo = Bold• 8 = Bnew.
{
,·(X I s, B) log I S(X) = s ,·(X I s , Bo)
{
= s.
}
Then
(2.4. 15)
.
}
q(s, Bnew ) r(X I s, Bnew ) log log I S(X) = s . , " .J(Bnew I Bold ) - Eo " = ld o r (X I s, uold ) q (s, uold )
(2.4. 16)
Now, J(Bnew I Bold ) > J(Bold I Bold ) = 0 by definition of Bnew . On the other hand,
{
r(X I s, Bnew ) - Eoold log r(X I s , B ) I S(X) = s old by Shannon's ineqnality, Lemma 2.2.1.
}
>0
(2.4.17) D
The most important and revea1ing special case of this lemma follows. Theorem 2.4.3. Suppose {Po
:
() E 9} is a canonical exponential family generated by
(T,h) satisfying the conditions of Theorem 2.3.1. Let S(X) be any statistic, then
136
Methods of Estimation
Chapter 2
(a) The EM algorithm consists of the alternation •
A(Bnew) = Eo01d (T(X) I S(X) = s)
(2.4. 18)
Bold = Bnew ·
(2.4.19)
unique. Ifa solution of(2.4.18) exists it is necessarily � (b) If the sequence of iterates {B-m} so obtained is bounded and the equation
A( B) = Ee(T(X) I S(X) = s) •
(2.4.20)
�
has a unique solution, then it converges to a limit B", which is necessarily a local maximum of q(s B). ,
Proof. In this case,
J(B I Bo) = Eo ( (B - Bo)TT(X) - (A( B) - A(Bo)) I S(X) = s} - (B - Bof Ee, (T(X) I S(X) = y) - (A(B) - A(Bo)) ,
Part (a) follows. Part (b) is more difficult. A proof due to Wu (1983) is
(2.4.21)
Example 2.4.4 (continued). X is distributed according to the exponential family
p(x, B) = exp{1)(2Ntn(x) + N2n(x)) - A{7J)}h(x) where
1)
= log
( � B) , h(x) 1
= 2N,,(x) ,
(2.4.22)
A(1J) = 2nlog(1 + e")
and N;n = L:� 1 <;;(x;), 1 < j < 3. Now, (2.4.23)
2Nlrn + N2m +Eo
e
L (2 il + fi2) I €it + €i2 , m + 1 < i <
i==+I
n
.
(2.4.24)
Under the assumption that the process that causes lumping is independent of the values of the €ij . 0, 1 < j S 2 Po [<;; = 1 [
Po [
92 B2 B2 + 29 (1 - B) 1 - {1 - 9) 2 1 - Po[
Eo(2Ntn + N2n I S) = 2Nlm + N2m +
I I
•
A'(1J) = 2nB n
' ;
D
sketched in Problem 2.4.16.
-
2
�
2 - Bold
Mn
(2.4.25)
I ;
I
Section 2.4 Algorithmic Issues
137
where Mn
n
L
�
(
i=-m+l
Thus, the EM iteration is
_ 2N, + N,m _...:2;_...-_ Mn ;; + .,- � (2.4.26) unew n 2 - Bold n It may be shown directly (Problem 2.4.12) that if 2 Nl m + N2m > 0 and Mn > 0, then � B-m converges to the unique root of ,.
' (2Nsm + N2m ) B 2 lm + (N + (1 - Nsm JJ � 0 B n n o
in (0, 1), which is indeed the MLE when S is observed.
Example 2.4.6. Let ( z, , Y1 ), . . . , ( Zn . Yn) be i.i.d. as ( Z, Y), where (Z, Y) � N(lh, '"' af , a�, p). Suppose that some of the Zi and some of the Yi are missing as follows: For n2 , we oberve onJy Zi, and for 1 � i � n 1 we observe both Zi and }i, for n 1 + 1 n2 + 1 ::.; n, we observe only Yt. In this case a set of sufficient statistics is n n n 1 1 1 = Z, 2 = T = n= nL z;, Y, = n 4 Tl Y /, L ZiYi· T L 5 T3 T '
< i<
i<
-
i=l
i=l
i=l
The observed data are S � { (Z,, Y; ) : 1
< i ::; nt} U {Z; : n1 + 1 < i ::; n,} U {Y; :
n,
+ 1 ::; i ::; n}.
To compute Eo (T I S = s ) . where B = (J11, /-t2 , af , a� , B) , we note that for the cases with Zi and/or Yi observed, the conditional expected values equal their observed values. For other cases we use the properties of the bivariate normal distribution (Appendix B.4 and Section 1.4), to conclude
Ee(Y, I Z,) /1-2 + P
A( B)
�
EeT � (/" 1 , Ji-2, af + l"i, ai + �.a, a,p + /"!Jl-2).
= We take 80 BMOM • where BMOM is the method of moment estimates � (/1�, Ji,, ilf, i12 , r) (Problem 2.1.8) of 8 based on the observed data, and find (Problem 2 .4.1) that the M-step produces ....... ....... ....... ....... 2 i11,new - T, (8old), /12,new � T, (Bold ), i11,new � Ts( Bold) - 'I? � ,, (2.4.27) T.(eold) - T, i1�,new ....... ....... ....... ....... ....... ....... ....... ....... Pnew [Ts(Bold ) - T, T,]/{[Ts(Bold ) - T,) [T.(8old ) - T,] } , �
-
�
1
138
Methods of Estimation
Chapter 2
where Tj (B) denotes Ti with missing values replaced by the values computed in the Estep and Ti = Tj (B0rct ) , j = 1, 2. Now the process is repeated with BMOM replaced by Bnew0 �
�
�
�
Because the E-step, in the context of Example 2.4.6, involves imputing missing values, the EM algorithm is often called multiple imputation.
X, then J(B i 80) is log[p(X, B)/p(X,B0)], which Remark 2.4.1. Note that if S(X) as a function of() is maximized where the contrast - log p(X , 0) is minimized. Also note that, in general, -Eo, [J(O I Oo )] is the Kullback-Leibler divergence (2.2.23). =
Summary. The basic bisection algorithm for finding roots of monotone functions is devel oped and shown to yield a rapid way of computing the MLE in all one-parameter canonical exponential families with E open (when it exists). We then, in Section 2.4.2, use this algorithm as a building block for the general coordinate ascent algorithm, which yields with certainty the MLEs in k-parameter canonical exponential families with E open when it exists. Important variants of and alternatives to this algorithm, including the Newton Raphson method, are discussed and introduced in Section 2.4.3 and the problems. Finally in Section 2.4.4 we derive and discuss the important EM algorithm and its basic properties.
2.5
PROBLEMS AND COMPLEMENTS
Problems for Section 2.1 1. Consider a population made up of three different types of individuals occurring in the 2 Hardy-Weinberg proportions 02, 28(1 - 0) and ( l - 8) , respectively, where 0 < B < I. (a) Show that T3
' "
I'
=
NI/n + N2/2n is a frequency substitution estimate of e.
(b) Using the estimate of (a), what is a frequency substitution estimate of the odds ratio B/(1 - B)?
(c) Suppose X takes the values -1, 0, 1 with respective probabilities PI, p2, P3 given by the Hardy-Weinberg proportions. By considering the first moment of X, show that T3 is a method of moment estimate of (). 2. Consider n systems with failure times X1 , , Xn assumed to be independent and identically distributed with exponential, &( >.) , distributions. .
.
•
(a) Find the method of moments estimate of >.. based on the first moment. (b) Find the method of moments estimate of >.. based on the second moment. (c) Combine your answers to (a) and (b) to get a method of moment estimate of >.. based on the first two moments. (d) Find the method of moments estimate of the probability P(X1 2: 1) that one system will last at least a month.
3. Suppose that i.i.d. X1 , . , Xn have a beta, {3(a1, a2 ) distribution. Find the method of moments estimates of a = (a 1 , a2 ) based on the first two moments. .
.
; '
''
!
139
Problems and Complements
Section 2.5
Hint: See Problem 8.2.5. 4. Let X1 ,
. . .
, X11 be the indicators of n Bernoulli trials with probability of success 8.
(a) Show that X is a method of moments estimate of 8. -
(b) Exhibit method of moments estimates for VaroX 8(1 8)/n first using only the first moment and then using only the second moment of the population. Show that these estimates coincide. =
(c) Argue that in this case all frequency substitution estimates of q(8) must agree with
q(X).
5. Let Xt, . . . , Xn be a sample from a population with distribution � function F and � frequency function or density p. The empirical distribution function F is defined by F(x) = [No. of X; < x]/n. If q(8) can be written in the form q(8) � s(F) for some � function s of F we define the empirical substitution principle estimate of q(B) to be s ( F). (a) Show that in the finite discrete case, empirical substitution estimates coincides with
frequency substitution estimates. � Hint: Express F in terms of p and F in tenns of
(
p X) �
�
No. of X; � x n
.
�
(b) Show that in the continuous case X "' F means that X = Xi with probability 1/n.
(c) Show that the empirical substitution estimate of the jth moment /1j is the jth sample �
moment J..LJ . Hint: Write m;
�
J== xi dF(x) or m1
�
Ep(Xi) where X - F.
� � (d) For t1 < · · · < t., find the joint frequency function ofF(t,), . . . , F(tk). 1 Hint: Consider nF(t1), N2 n(F(t2 ) - F(t,)), . . . , � (N1, . . . , Nk+ I ) where N Nk+I = n(l - F(tk)). �
�
6. Let X( l ) < < X(n) be the order statistics of a sample X1, . . . , Xn. (See Problem 8.2.8.) There is a one-to-one correspondence between the empirical distribution function F and the order statistics in the sense that, given the order statistics we may construct F and given F, we know the order statistics. Give the details of this correspondence. ·
·
·
�
�
�
The jth cumulant 0· of the empirical distribution function is called the jth sample cumulanr and is a method of moments estimate of the cumulant cj. Give the first three sample cumulants. See A.l2.
7.
8. Let ( Z1 , Y1 ) , ( Z2 , Y2 ), . . . , ( Zn, Yn) be a set of independent and identically distributed F. The natural estimate of F(s t) is random vectors with common distribution function � the bivariate empirical distribution function F(s, t), which we define by ,
� F( s,t )
�
Number of vectors (Z;, Y;) such that Z; < s and Y; < t
n
.
I ;
140
Methods of Estimation �
Chapter
2
�
(a) Show that F(·, · ) is the distribution function of a probability P on R2 assigning mass 1/n to each point (Zi, Yi ) . (b) Define the sample product moment of order (i, j), the sample covariance, the sample correlation, and so on, as the corresponding characteristics of the distribution F. Show that the sample product moment of order (i,j) is given by �
The sample covariance is given by
IL n
n k=l '
' '
(Zk - Z)(Yk - Y) = -
n -I ...- L., zk yk - ZY, n k=l
where Z, Y are the sample means of the Z1, . . . , Zn and Y1 , sample correlation coefficient is given by
.
.
•
, Yn, respectively. The
'
All of these quantities are natural estimates of the corresponding population characteristics and are also called method of moments estimates. (See Problem 2.1.17.) Note that it follows from (A.11.19) that - I < r < 1.
,I
'
9. Suppose X = (X,, , . . , Xn) where the X, are independent N(O, a2).
'
(a) Find an estimate of cr2 based on the second moment. (b) Construct an estimate of a using the estimate of part (a) and the equation a
=
!
#.
i •
i'
'
(c) Use the empirical substitution principle to construct an estimate of a using the relation E(IX, I) = a.,J21r.
10. In Example 2. 1.1, suppose that g(/3, z) is continuous in {3 and that lg(/3, z) I tends to oo as \.81 tends to oo. Show that the least squares estimate exists. Hint: Set c = p(X, 0). There exists a compact set K such that for f3 in the complement of K, p(X, {3) > c. Since p(X, {3) is continuous on K, the result follows. 11. In Example 2.1.2 with X ji, and M3 · Hint: See Problem B.2.4.
�
f(a, >.), find the method of moments estimate based on
12. Let X1, , Xn be i.i.d. as X � Po, (} E 8 C Rd, with (} identifiable. Suppose X has possible values v1 , . . . , vk and that q(9) can be written as
'
' '
i ' :
J l
• . .
q(O) = h(!'1(6), . . . ,!'r(O))
I ' '
'
'
I '
Section 2.5
141
":_ ' "::: P '' Problems and Com:c::: :m :: :: ::
_ _ _ _ _
for some Rk -valued function h. Show that the method of moments estimate (j = h(/11, , fir) can be written as a frequency plug-in estimate. 1 l . Suppose X estimates( ) Xn are i.i.d. as X P0 , General method of moment 1, 13. with (J E 8 C Rd and (J identifiable. Let 91 , . . . , 9r be given linearly independent functions and write • • •
.
p;(O) = Eo(g; ( X )), /i;
=
.
rv
.
n
n - r L9;( X;), j = 1,
.
.
. , r.
i=l
Suppose that X has possible values v1 , . . . , vk and that q(O) = h(pr(O), . . . , Pr(O)) for some Rk-valued function h.
(a) Show that the method of moments estimate (j = h(/11, . . . , ilr) is a frequency plug
in estimate. (b) Suppose {Po : () E 8} is the k-parameter exponential family given by (1.6.10). Let g,(X) = T;(X ) , 1 < j < k. In the following cases, find the method of moments esumates •
(i) Beta, 1)(1, 8) (ii) Beta, /)(8, 1) (iii) Raleigh, p(x,B) = (xj82) exp(-x2/282) , x > 0. 8 > 0 (iv) Gamma, f(p, 8), p fixed (v) Inverse Gaussian, IG(p, -") , () = (p, A). See Problem 1.6.36. 14.
Hint: Use Corollary 1.6.1.
When the data are not i.i.d., it may stiJl be possible to express parameters as functions of moments and then use estimates based on replacing population moments with ..sample" moments. Consider the Gaussian AR( I) model of Example 1.1.5. (a) Use E(X;) to give a method of moments estimate of p.
(b) Suppose p = po and /) = b are fixed. Use E(U{), where
U, = (X; - pa) to
1/2
,
give a method of moments estimate of u2• (c) If J..l and u2
are fixed, can you give a method of moments estimate of {3?
142
Methods of Estimation
Chapter 2
15. Hardy-Weinberg with six genotypes. In a large natural population of plants (Mimulus guttatus) there are three possible alleles S, I, and F at one locus resulting in six genotypes labeled SS, II, FF, Sf, SF, and IF. Let 01, 02 , and 83 denote the probabilities of S, I, and F, respectively, where ()i = 1. The Hardy-Weinberg model specifies that the six genotypes have probabilities
L7=I I
Genotype Genotype
ss e,
Probability
2
3
4
5
6
II
FF
Sf
SF
IF
e,
e3
ze,e,
ze,e3
ze,e3
Let Nj be the number of plants of genotype j in a sample of n independent plants, 1 < j < 6 and let p1 � N; /n. Show that
'
I'
are frequency ' '
plug-in estimates of 81,
16. Establish (2.1.6).
Hint: [Y, - g(/3,z;)] =
02 , and (}3·
[Y; - g(/30, z;)] + [g(/3o , z;) - g(/3,zi )] .
17. Multivariate method of moments. For a vector X =
(X1, . . . , Xq). of observations, let
the moments be
mikrs
=
E(XtX�),
For independent identically distributed empirical or sample moment to be �
ffijkrs
n
j
Xi = (Xt1, . . . , Xiq ) , i
1 "' x; x• - - L....J i is• n . t=l
r
> 0, k > 0; r, s = 1, . . . , q.
J > o' k > - o·' r, s ·
-
=
1, . . . , n, we define the
1, . . . ' q .
IfB = {01�, , Om) can be expressed as a function of the moments, the method of moments estimate 8 of 8 is obtained by replacing mJkrs by fiiikrs· Let X = ( Z, Y) and 8 = (a1, b1), where (Z, Y) and (a,, bt) are as in Theorem 1.4.3. Show that method of moments estimators of the parameters b1 and a1 in the best linear predictor are .
•
•
Problems for Section 2.2
1. An object of unit mass is placed in a force field of unknown constant intensity 8. Read ings Y1 , , Yn are taken at times t1 , . . . , tn on the position of the object. The reading Yi .
.
•
I
Section 2.5
143
Problems and Complements
differs from the true position ( 8/2)tf by a mndom error f1• We suppose the t:i to have mean 0 and be uncorrelated with constant variance. Find the LSE of 8. 2. Show that the fonnulae of Example 2.2.2 may be derived from Theorem 1.4.3, if we con sider the distribution assigning mass 1/n to each of the points
(zr , YI ) , . . . , (zn , Yn) ·
3. Suppose that observations Y1 , . . . , Yn have been taken at times z1 , . . , Zn and that the linear regression model holds. A new observation Yn-1-1 is to be taken at time Zn+I · What is the least squares estimate based on Y1 , . . . , Y;t of the best (MSPE) predictor of Yn+ I ? .
4. Show that the two sample regression lines coincide (when the axes are interchanged) if and only if the points (zi,Yi). i = 1, . . , n, in fact, all lie on a line. Hint: Write the lines in the fonn .
�(y - y) � � ·· p
(z - z) "
7
�
.
5. The regression line minimizes the sum of the squared vertical distances from the points (z1 , y1 ), , (zn, Yn)- Find the line that minimizes the sum of the squared perpendicular distance to the same points. Hint: The quantity to be minimized is . • .
I:?-r (y, - 8r - 8,z,)2 1 + 8,i 6. (a) Let Y1 , . . . , Yn be independent random variables with equal variances such that E(Yi) = o:z3 where the z3 are known constants. Find the least squares estimate of o:.
(b) Relate your answer to the fonnula for the best zero intercept linear predictor of Section 1.4. Show that the least squares estimate is always defined and satisfies the equations (2.1.5) provided that g is differentiable with respect to (3,, 1 < i < d, the range {g(zr, /3), . . . , g (zn , /3), i3 E Rd } is closed, and /3 ranges over R'7.
8. Find the least squares estimates for the model Yi = 81 + 82zi + €i with t:i {2.2.4)-(2 . 2.6) under the restrictions 81 > 0, 8, :;; 0.
a�
given by
9. Suppose Yi = 81 + t:i, i = 1, . . . , n1 and Yi = 82 + l':i, i = n1 + 1, . . . , n1 + n2 , where t: 1 , . . . , €n1+n2 are independent N(O, a2) variables. Find the least squares estimates of ()1 and 8,. 10. Let X1, , Xn denote a sample from a population with one of the following densities or frequency functions. Find the MLE of 8. .
•
•
(a) f(x, 8) � 8e-Ox, X > 0; 8 > 0. (exponential density)
(b) f(x, 8) = 8c9x-(O+ I ), X > c; c constant > 0; 8 > 0. (Pareto density)
144
Methods of Estimation
(c) f(3c,8)
Chapter 2
c8".r-lc+l), :r > 8; c constant > 0: 8 > 0. (Pareto density) (d) f(x, 8) = v'ex v'0- 1 , 0 < x < 1, 8 > 0. (beta, il( VB, 1), density) (e) f(x, 8) = (x/82 ) exp{ -x2/282 }, x > 0; 8 > 0. (Rayleigh density) =
(f) f(x, 8) = 8cx'--1 exp{ -8x'}, x
> 0; c constant > 0; 8 > 0. (Wei bull density)
, Xn. n > 2, is a sample from a N(Jl, a2) distribution. (a) Show that if J1 and a2 are unknown, J1 E R, a2 > 0, then the unique MLEs are /i = X and a2 = n-1 2.:� 1 (X, - X ) 2 (b) Suppose p and a 2 are both known to be nonnegative but otherwise unspecified. Find maximum likelihood estimates of J1 and a2 . 11. Suppose that X1 ,
•
.
.
12. Let X1 , . . . , Xn, n > 2, be independently and identically distributed with density f(x, 8)
=
I
-
0'
exp { -(x - Jl) /0'}, x 2 Jl ,
where 8 = (Jl, 0'2 ), -oo < J1 < oo, 0' 2 > 0.
(a) Find maximum likelihood estimates of J.t and a2 .
(b) Find the maximum likelihood estimate of Pe [X1 > t] for t > 11· Hint: You may use Problem 2.2. 16(b ).
13. Let X1 ' . . . ' Xn be a sample from a u [8 - i 8 + i I distribution. Show that any T such that X(n) i < T < X(l) + i is a maximum likelihood estimate of 8. (We write U[a, b] to make p(a) = p(b) = (b - a) 1 rather than 0.) '
-
14, If n extsts. •
=
-
I in Example 2.1.5 show that no maximum likelihood estimate of 8 = (Jl, 0'2 ) �
�
15, Suppose that T(X) is sufficient for 8 and that 8(X) is an MLE of 8. Show that 8 � depends on X through T(X) only provided that 8 is unique. Hint: Use the factorization theorem (Theorem 1.5.1). 16. (a) Let X
Pe, 8
�
E
8 and let 8 denote the MLE of 8. Suppose that h is a one-toone function from 8 onto h(8). Define ry = h(8) and let f(x, ry) denote the density or frequency function�of X in terms of T} (i.e., reparametrize the model using ry). Show that the MLE of ry is h (8) (i.e., MLEs are unaffected by reparametrization, they are equivariant under one-to-one transformations). -
9 E 8}, 8 c RP, p > I , be a family of models for X E X c R"� k Let q be a map from 8 onto !1, !1 c R , I � k < p. Show that if 9 is a MLE of 9, then q(O) is an MLE of w = q(O). = {9 E 8 : q(O) = w }, then {8(w) : w E !1} is a partition of 8, and 8(w) Let Hint: � 9 belongs to only one member of this partition, say 8(w). Because q is onto !1, for each w E !1 there is 9 E 8 such that w = q( 9). Thus, the MLE of w is by definition (b) Let 1' = {Po :
WMLE
=
arg sup sup{Lx(O) : 9 E 8(w)}. WEO
'
I
I
I
i
Section 2.5
145
Problems and Complements
Now show that WMLE � w � q(O). 17. Censored Geometric Waiting Times. If time is measured in discrete periods, a model that is often used for the time X to failure of an item is
P, [x
�
k] � Bk -1(1 - B), k � 1, 2, . . .
where 0 < 8 < 1. Suppose that we only record the time of failure, if failure occurs on or before time r and otherwise just note that the item has lived at least (r + 1) periods. Thus, we observe Y1 , . . , Yn which are independent, identically distributed, and have common frequency function, f( k,B) � Bk- 1 ( 1 - B), k � 1 , . . . , r
.
f(r + 1 , B) � 1 - Po [X < r] � 1 - L Bk-1 (1 - B) � B'. k=I (We denote by "r + 1" survival for at least (r + 1) periods.) Let M = number of indices i such that Yi = r + 1. Show that the maximum Jikelihood estimate of 8 based on Y1 , , Yn
.lS
•
-
B(Y) �
""
•
.
y: - n
L...��� . ' M. Li�l Y, -
18. Derive maximum likelihood estimates in the following models. (a) The observations are indicators of Bernoulli trials with probability of success 8. We want to estimate B and VaroX1 = B(1 - B). (b) The observations are X1 = the number of failures before the first success, X2
=
the number of failures between the first and second successes, and so on, in a sequence of binomial trials with probability of success fJ. We want to estimate 8.
.
19. Let X1 , . . , Xn be independently distributed with Xi having a N( ei, 1) distribution, 1 < i < n. (a) Find maximum likelihood estimates of the fJi under the assumption that these quan tities vary freely. (b) Solve the problem of part (a) for n = 2 when it is known that B1 < B,. A general solution of this and related problems may be found in the book by Barlow, Bartholomew, Bremner, and Brunk (1972).
20. In the "life testing" problem 1.6. 16(i), find the MLE of B. 21. (Kiefer-Wolfowitz) Suppose (X,, . . . , Xn) is a sample from a population with density
f(x, B) �
9
lOa
(X - p)
5)]/P( S � 5)
�
.0262
if S = 5. Such randomized tests are not used in practice. They are only used to show that with randomization, likelihood ratio tests are unbeatable no matter what the size a is. Theorem 4.2.1. (Neyman-Pearson Lemma). (a) If a > 0 and i{)k is a size a likelihood ratio test, then <{Jk is MP in the class of level a tests. (b) For each 0 < a :5: 1 there exists an MP size a likelihood ratio test provided that randomization is permitted, 0 < r.p(x) < l,for some x. (c) Ifr.p is an MP level o: test, then it must be a level a likelihood ratio test; that is, there exists k such that
P, [
0. To this end consider
Et[cpk(X) -
0. Note that a > 0 implies k < oo and, thus, 'Pk(x) � 1 if p(x, 80) � 0. It follows that II > 0. Finally, using (4.2.3), we have shown that k and have 0} D and 0 � Oo. It follows from the Neyman-Pearson lemma that an MP test has power at least as large as its level; that is, a with equality iffp(·, Oo) � f3(0, 0) + .u/ + (1 - >.JaN((! ) density. Find the efficiency ep( X, X) as defined in (f).- If and a = 1, = 4, evaluate the efficiency for £ = .05, 0.10, 0.15 and 0.20 and note that X is more efficient than X for these gross error cases. (h) X Suppose that X1 has the Cauchy density f(x) = 1/:>r(l + x 2), x E R. Show that ep(X , ) = O. P is c} are monotone increasing in c. Finally, to obtain I
Et[
2
kEo (
(4.2.4)
'
'
I
'
i
I
1 '
'
Section 4.2
Choosing a
Test Statistic:
225 -------= =cc
"--
The Neyman-Pearson lemma
(b) If a � 0, k = oo makes 'Pk MP size a. If a = 1, k = 0 makes E1 'Pk(X) � 1 and I.Pk is MP size a. Next consider 0 < a < 1. Let P1 denote Po, , i = 0, 1. Because Po [L(X, Oo, OJ) = oo] = 0, then there exists k < oo such that
Po[L(X, 00,01) > k] < a and Po[T,(X, 00, OJ) > k] > a. If P0[L(X, Oo, 01) � k] � 0, then 'Pk is MP size a. If not, define
a - Po[L(X, o,,O,) > k] 'Pk (x) � P0[L(X, Oo, OJ) = k]
on the set {x : L(x, Oo, O,) = k ) . Now 'Pk is MP size a. (c) Let x E {x : p(x, OJ) > 0}, then to have equality in (4.2.4) we need to have
Corollary 4.2.1. /f
p(-,o,).
Proof. See Problem 4.2. 7.
Remark 4.2.1. Let Jr denote the prior probability of 00 so that (1-Jr) is the prior probability of fh. Then the posterior probability of (}1 is
(1 - JT)p(x, O,) 7r(0 1 I X) (1 - JT)p(x, O,) + JTp(x, Oo)
-
_
(1 - JT)L(x,00,01) (1 - JT)L(x, Oo, 0, ) + Jr .
(4.2.5)
If o, denotes the Bayes procedure of Example 3.2.2(b), then, when Jr � k/(k + 1), o, � 'Pk· Moreover, we conclude from (4.2.5) thatthis o, decides 01 or 00 according as Jr(01 I x) is larger than or smaller than 1/2 . Part (a) of the lemma can, for 0 < a < 1, also be easily argued from this Bayes property of 'Pk (Problem 4.2.10). Here is an example illustrating calculation of the most powerful level a test I.Pk·
Example 4.2.1. Consider Example 3.3.2 where X = (X1, , Xn ) is a sample of n N(J.L,u2) random variables with u2 known and we test H : J.L = 0 versus K : I" = v, where v is a known signal. We found •
•
•
nv2 v n L(X, O,v) = exp 2 L X; u i= l 2u2 Note that any strictly increasing function of an optimal statistic is optimal because the two statistics generate the same family of critical regions. Therefore,
X T(X) � Vn-;; =
a
v.,fii
[log L(X,O,v) + nv2 ] 2"2
226
Testing and Confidence Regions
Chapter 4
is also optimal for this problem. But T is the test statistic we proposed in Example 4.1.4. From our discussion there we know that for any specified a, the test that rejects if, and only if. T > z(l - a)
(4.2.6)
has probability of type I error o:. The power of this test is, by (4.1 .2), (z(a) + (v../ii/ (J)). By the Neyman-Pearson lemma this is the largest power available with a level et test. Thus, if we want the probability of detecting a signal v to be at least a preassigned value j3 (say, .90 or .95), then we solve ( z (a)+ ( v../ii/(J) ) = !3 for n and find that we need to take n = ( (J /v )2[z(l -a) + z(/3)f D This is the smallest possible n for any size a test. An interesting feature of the preceding example is that the test defined by (4.2.6) that is MP for a specified signal v does not depend on v: The same test maximizes the power for all possible signals v > 0. Such a test is called uniformly most powerful (UMP). We will discuss the phenomenon further in the next section. The following important example illustrates, among other things, that the UMP test phenomenon is largely a feature of one-dimensional parameter problems. Example 4.2.2. Simple Hypothesis Against Simple Alternative for the Multivariate Nor mal: Fisher's Discriminant Flmction. Suppose X � N(p1, E1), 81 = (p1, E1), j = 0, 1. The likelihood ratio test for H : 9 = Bo versus K : (J = fh is based on '
.
•
'
'
'' •
t '
Rejecting H for L large is equivalent to rejecting for
I
1
•
.. • •
•
•
large. Particularly important is the case Eo = Et when "Q large" is equivalent to "F = (tt 1 - p.0)EQ1X large." The function F is known as the Fisher discriminant function. It is used in a classification context in which 9o, 91 correspond to two known populations and we desire to classify a new observation X as belonging to one or the other. We return to this in Volume II. Note that in general the test statistic L depends intrinsically on ito, JJ 1. However if, say, J.£1 = p,0 + Ado, ).. > 0 and Et = Eo, then, if ito• Llo, Eo are known, a UMP (for all ).. ) test exists and is given by: Reject if
'
i
•
•
I'
,
'
(4.2.7)
where c = z(l - a)[�6E0 1 �0]! (Problem 4.2.8). If �0 = (1, 0, . . . , O)T and Eo = I, then this test rule is to reject H if X1 is large; however, if E0 =j:. I, this is no longer the case (Problem 4.2.9). In this example we bave assumed that IJo and IJ, for the two populations are known. If this is not the case, they are estimated with their empirical versions with sample means estimating population means and sample covariances estimating population covanances. •
I
'
Section 4.3
UnifOfmly Most Powerful Tests and Monotone Likelihood Ratio Models
227
Summary. We introduce the simple likelihood ratio statistic and simple likelihood ratio (SLR) test for testing the simple hypothesis H : () = Bo versus the simple alternative K : 0 01• The Neyman-Pearson lemma, which states that the size a SLR test is uniquely most powerful (MP) in the class of level a tests, is established. We note the connection of the MP test to the Bayes procedure of Section 3.2 for de ciding between Oo and 81. Two examples in which the MP test does not depend on ()1 are given. Such tests are said to be UMP (unifonnly most powerful). =
4.3
U N I FORMLY MOST POWERFUL TESTS ANO MONOTONE L I K E LIHOOD RATIO MODELS
We saw in the two Gaussian examples of Section 4.2 that UMP tests for one-dimensional parameter problems exist. This phenomenon is not restricted to the Gaussian case as the next example illustrates. Before we give the example, here is the general definition of UMP: Definition 4.3.1. A level a test
(3 (0,
:
0 E 60 (4.3.1)
for any other level a test 1.p.
Example 4.3.1. Testing for a Multinomial Vector. Suppose that ( N1, . . . , Nk) has a multi nomial M(n , Bt, . . . , Bk) distribution with frequency function,
""' " n! n1 1. . . . nk.1 171 • · · 17k
P(nt, · · · , nk , B) -
"
'
where n1, . . . , nk are integers summing ton. With such data we often want to test a simple hypothesis H : B1 = Ow, . . . , Ok = ()kO· For instance, if a match in a genetic breeding ex periment can result in k types, n offspring are observed, and Ni is the number of offspring of type i, then (N1 , . . . , Nk) - M(n, e,, . . . , Ok)· The simple hypothesis would corre spond to the theory that the expected proportion of offspring of types 1 , . . . , k are given by Ow, Oko · Usually the alternative to H is composite. However, there is sometimes a simple alternative theory K : B1 = B11, . . . , (}k = Ok1· In this case, the likelihood ratio L • . .
,
.
IS
rr (:'')N' tO .
L�
i=l
Here is an interesting special case: Suppose Ojo > 0 for ally", 0 < integer l with 1 :0 l < k
€
< 1 and for some fixed (4.3.2)
where
228
Testmg and Confidence Regions
Chapter 4
That is, under the alternative, type l is less frequent than under H and the conditional probabilities of the other types given that type l has not occurred are the same under K as they are under H. Then =
pn-N,f_N/
pn ( Ejp)N' .
L Because t:. < 1 implies that p > E, we conclude that the MP test rejects H, if and only if, =
Critical values for level a are easily determined because Nl B( n, 810) under H. Moreover, for a = P( N, < c) , this test is UMP for testing H versus K : 0 E 8 1 = { 1:1 : 1:1 is of the form (4.3.2) with 0 < E < 1 }. Note that because l can be any of the integers 1 , . . . , k, we get radically different best tests depending on which (}i we assume to be Ow 0 under H. N1 <
c.
.......,
Typically the MP test of H : 0 = Oo versus K : (} = 01 depends on (h and the test is not UMP. However, we have seen three models where, in the case of a real parameter, there is a statistic T such that the test with critical region { x : T(x) > c} is UMP. This is part of a general phenomena we now describe.
Definition 4.3.2. The family of models { P, : 0 E 8} with 8 c R is said to be a monotone likelihood ratio (MLR) family if for fit < 02 the distributions Po1 and Po2 are distinct and D the ratio p(:r:, 02 )jp(x, 0 1 ) is an increasing function ofT(x).
Example 4.3.2 (Example 4.1.3 continued). In this i.i.d. Bernoulli case, set s then p(x, 0) = 0 ( 1 - 0)"-' = (I - 0)"[0/(1 - 0)]"
'
' '
'
and the model is by (4.2.1) MLR in s.
=
2:� 1 x;,
D
Example 4.3.3. Consider the one-parameter exponential family mode!
p(x,O) = h(x) exp{ry (O)T(x) - B (O)} .
'
•
"
•
.
If TJ(O) is strictly increasing in () E 6, then this family is MLR. Example 4.2.1 is of this D form with T(x) = ..fiixfu and ry(p.) = (..fiiu)p., where u is known. Define the Neyman-Pearson (NP) test function
. • •
I ifT(x) > t 6t (x ) 0 if T(x) < t
' • '
'
_
(4.3.3)
with 61(x) any value in (0, 1) if T(x) = t. Consider the problem of testing H : 0 = 00 versus K : 0 = O, with Oo < 8,. If {Po : 0 E 8}, 8 C R, is an MLR family in T(x), then L(x, Oo, 01) = h(T(x)) for some increasing function h. Thus, 6, equals the likelihood ratio test
"
:
'
' :·.
'
Theorem 4.3.1. Suppose {Po : 8 E 8 }, 8 c R, is an MLRfamily in T(x). (I) For each t E (0, oo), the powerfunction (3(1:1) = Eo 61 (X ) is increasing in 0. (2) If Eo061(X) = "' > 0, then 6, is UMP level "' for testing H : 8 < 80 versus
K : O > O,.
•
'
'
I
Section 4.3
Uniformly Most Po�rful Tests and Monotone likelihood Ratio Models
229
Proof (I) follows from <51 = I.Ph(t) and Corollary 4.2.1 by noting that for any fh < ()2 , c5t is MP at level Eo, o1(X) for testing H : 0 = 01 versus [( : 0 = o,. To show (2), recall that we have seen that c5t maximizes the power for testing II : () = Oo versus K : () > Oo among the class of tests with level a = Eo,61 (X). If 0 < Oo, then by (1), Eoo1 (X) < a and br is of level o: for H : () :< 00. Because the class of tests with level a for H : () < ()0 is contained in the class of tests with level a for H : () = 00, and because dt maximizes the 0 power over this larger class, bt is UMP for H : () < Oo versus K : () > Oa. The following useful result follows immediately.
Corollary 4.3.1. Suppose {Po : 0 E 6}, e c R, is an MLRfamily in T(r). If the distributionfunction Fa ofT( X) under X ,....., Pe0 is continuous and ift(l -a) is a solution of Fo (t) = 1 - a, then the test that rejects H if and only ifT(r) > t(I ex) is /JMP level a for testing H : () < Oo versus K : () > Oo. -
Example 4.3.4. Testing Precision. Suppose X1, . . . , Xn is a sample from a N(p,,u2) population, where 11 is a known standard, and we are interested in the precision <J- 1 of the measurements X1 , Xn. For instance, we could be interested in the precision of .
.
•
,
a new measuring instrument and test it by applying it to a known standard. Because the most serious error is to judge the precision adequate when it is not, we test H : u 2: uo versus K : u < u0, where uQ 1 represents the minimum tolerable precision. Let S = 2:� 1 (X, - 1')2, then
p(x, 0) = exp
1
{ - 20'2 S - �2
log(27r0'2 )
}.
This is a one-parameter exponential family and is MLR in T = -8. The UMP level a test rejects H if and only if S < s(cx) where s(cx) is such that Pa, (S :S s(cx)) = a. If we write
we see that Sfu5 has a X� distribution. Thus, the critical constant s ( a) is <J5xn(o.), where D Xn(a) is the ath quantile of the ;G distribution.
Example 4.3.5. Quality Control. Suppose tha� as in Example 1 . 1 1, X is the observed number of defectives in a sample of n chosen at random without replacement from a lot of N items containing b defectives, where b = N8. If the inspector making the test considers lots with bo = NOo defectives or more unsatisfactory, she formulates the hypothesis H as 0 > 00, the alternative K as () < 80, and specifies an a such that the probability of rejecting H (keeping a bad lot) is at most a. If o. is a value takeh on by the distribution of X, we now show that the test a• with reject H if, and only if, X < h(cx), where h(cx) is the ath quantile of the hypergeometric, H.( NOo, N, n), distribution, is UMP level ex. For simplicity suppose that bo > n, N - bo > n. Then, if N01 = b, < bo and 0 < x :S b,, (1.1.1) yields .
b,(b, - l) . . . (b, - x + 1)(N - b1 ) . . . (N - b1 - n + x + l) ( L x, 00' 0 1 ) =
b0(b0 - l) . . . (bo - x + I)(N - bo) . . . (N - bo - n + x + l ) '
230
Testing and Confidence Regions
Note that L( x, Ba B 1 ) = 0 for b1 < x < ,
'CI ) "' L (x + i 'c:Oo' O'-' -';'-;-'---';; :;c- = L(x, Oo, OI) -
n. Thus, for O <
:1·
Chapter 4
< b1 - 1,
( bbo1 -- xx ) (N(N -- nn + i ) - (b1 - x)x) + 1) - (b0 -
< I
·
Therefore, L is decreasing in x and the hypergeometric model is an MLR family in T(x) = -x. It follows that 8* is UMP level a:. The critical values for the hypergeometric distribu D tion are available on statistical calculators and software.
Power and Sample Size In the Neyman-Pearson framework we choose the test whose size is small. That is, we choose the critical constant so that the maximum probability of falsely rejecting the null hypothesis H is small. On the other hand, we would also like large power {3(0) when () E 81; that is, we want the probability of correctly detecting an alternative K to be large. However, as seen in Figure 4.1.1 and formula ( 4 .1.2), this is, in general. not possible for all parameters in the alternative 9 1 . In both these cases, H and K are of the fonn H : () < Oo and K : () > Oo. and the powers are continuous increasing functions with limo!Oo {3(0) = a . By Corollary 4.3.1, this is a general phenomenon in MLR family models with p(x, 0) continuous in 0. This continuity of the power shows that not too much significance can be attached to acceptance of H, if all points in the alternative are of equal significance: We can find 0 > Oo sufficiently close to Oo so that /3( 0) is arbitrarily close to f3(0o) = a. For such 0 the probability of falsely accepting H is almost 1 - a. This is not serious in practice if we have an indifference region. This is a subset of the alternative on which we are willing to tolerate low power. In our nonnal example 4.1.4 we might be uninterested in values of J.L in (0, �) for some small � > 0 because such improvements are negligible. Thus. (0, �) would be our indifference region. Off the indifference region, we want guaranteed power as well as an upper bound on the probability of type I error. In our example this means that in addition to the indifference region and level a, we specify {3 close to 1 and would like to have /3(!') > /3 for all i' > t.. This is possible for arbitrary /3 < 1 only by making the sample size n large enough. In Example 4.1.4 because fJ(J.L) is increasing, the appropriate n is obtained by solving
' ' '
'
i,
f> ''
I
' • '
I '
!·!
{J(t.) = if>(z(a) + .,fiit.ju) ' ' '
'
,
I
I
= f3
for sample size n. This equation is equivalent to
z(a) + .,fiit.ju = z(/3) whose solution is
Note that a small signal-to-noise ratio �/a will require a large sample size n.
!
I
••
' • •
Section 4.3
231
Uniformly Most Powerful Tests and Monotone Likelihood Ratio Models
Dual to the problem of not having enough power is that of having too much. It is natural to associate statistical significance with practical significance so that a very low p-value is interpreted as evidence that the alternative that holds is physically significant, that is, far from the hypothesis. Formula shows that, if n is very large and/or a is small, we can have very great power for alternatives very close to 0. This problem arises particularly in goodness-of-fit tests (see Example 4. 1.5), when we test the hypothesis that a very large sample comes from a particular distribution. Such hypotheses are often rejected even though for practical purposes "the fit is good enough." The reason is that is so large that unimportant small discrepancies are picked up. There are various ways of dealing with this problem. They often reduce to adjusting the critical value so that the probability of rejection for parameter value at the boundary of some indifference region is o:. In Example this would mean rejecting H if, and only if.
(4.1.2)
n
4.1.4
X> ,fii - z ( 1 u
-a) +
t. vn-·. u
As a further example and precursor to Section 5.4.4, we next show how to find the sample size that will "approximately" achieve desired power {3 for the size o: test in the binomial example.
Example 4.3.6 (Example 4.1.3 continued). Our discussion uses the classical normal ap proximation to the binomial distribution. First, to achieve approximate size we solve and find the approximate critical value = Po" S > s) for s using 4.
f3(0o)
(
a,
( L4) 1 so = nOo + 2 + z(1 -a) InOo(J -Oo)] 1;2 .
Again using the normal approximation , we find
( nO + l so (3(0) = Po (S 2: so ) = [n0(1 � O)]'f2 ) . Now consider the indifference region (Oo, Bt), where = + Dt., .0t. > 0. We solve (3(0,) = f3 for n and find the approximate solution
fh
Oo
a = .05, f3 = .90, Oo = 0.3, and 01 = 0.35, we 11eed n = (0.05)-2 {1.645 0.3(0.7) + 1.282 0.35(0.55)} 2 = 162.4.
For instance, if
X
X
Thus, the size .05 binomial test of H : 8 = 0.3 requires approximately 163 observations to have probability .90 of detecting the 17% increase in e from 0.3 to 0.35. The power .35 and = achievable (exactly, using the SPLUS package) for the level .0:> test for 0 163 is 0.86.
0=
n
Our discussion can be generalized. Suppose 8 is a vector. Often there is a function q(B) < q0 and K : such that H and K can be formulated as H : > q0• Now let
q(O)
q(O)
232
Testing and Confidence Regions
Chapter 4
q1 > qo be a value such that we want to have power fJ(O) at least fJ when q(O) > q1 . The set {0 : qo < q(O) < q1} is our indifference region. For each n suppose we have a level a test for H versus I< based on a suitable test statistic T. Suppose that {3(0) depends on () only through q( 0) and is a continuous increasing function of q( 0). and also increases to 1 for fixed () E 61 as n oo. To achieve level a and power at least {3, first let Co be the smallest number c such that ...-t
Then let n be the smallest integer such that
P,, [T 2: co] > fJ
where Oo is such that q( Oo) = q0 and 01 is such that q (0 1 ) = q1 . This procedure can be applied, for instance, to the F test of the linear model in Section 6 . 1 by taking q(0) equal to the noncentrality parameter governing the distribution of the statistic under the alternative. Implicit in this calculation is the assumption that Po1 [T > c0] is an increasing function ofn. We have seen in Example 4.1.5 that a particular test statistic can have a fixed distribu tion £o under the hypothesis. It may also happen that the distribution of Tn as (} ranges over 81 is detennined by a one-dimensional parameter >.(0) so that 90 = {0 : >.(0) = 0} and 9 1 = {0 : >.(0) > 0} and Co(Tn) = c,(6)(Tn) for all 0. The theory we have devel oped demonstrates that if C,(Tn) is an MLR family, then rejecting for large values ofTn is UMP among all tests based on Tn. Reducing the problem to choosing among such tests comes from invariance consideration that we do not enter into until Volume II. However, we illustrate what can happen with a simple example. .
'
Example 4.3.7. Testing Precision Continued. Suppose that in the Gaussian model of Ex ample 4.3.4. p, is unknown. Then the MLEof u2 is (]-2 = � E�-1 (Xi -Xf as in Example 2.2.9. Although H : u = uo is now composite, the distribution of Tn = ni:T2/u5 is x;_1, independent of JJ.. Thus, the critical value for testing H : u = uo versus K : u < uo and rejecting H if Tn is small, is the a percentile of X�-t· It is evident from the argument of Example 4.3.3 that this test is UMP for H : u > uo versus K : u < uo among all tests 0 depending on U2 only. Complete Families of Tests The Neyman-Pearson framework is based on using the 0-1 loss function. We may
ask whether decision procedures other than likelihood ratio tests arise if we consider loss functions l(O, a) , a E A = { 0, I}, B E 9, that are not 0-1. For instance, for e, = (tm, oo ), we may consider l(O, 0) = (B - Bo), B E 91 • In general, when testing H : B < Bo versus K : (} > Bo, a reasonable class of loss functions are those that satisfy l(O, 1) - l(O,O) > 0 for B < 00 l(0, 1) - l(0,0) < 0 for B > Bo.
(4.3.4)
i '
1
i
1
•
1 ' '
Section 4.4
233
Confidence Bounds, Intervals, and Regions
The class D of decision procedures is said to be complete< 1 ),(2) if for any decision rule 'P there exists 15 E V such that R(O, J)
<
R(O , I") for all 0 E E>.
(4.3.5)
That is, if the model is correct and loss function is appropriate, then any procedure not in the complete class can be matched or improved () by one in the complete class. Thus, it isn't worthwhile to look outside of complete classes. In the following the decision procedures arc test functions. {Po : 0 E e }, e c R, MLR Theorem 4.3.2. T( X) and I(0, a) (4.3.3) (4.3.4), EJ, (X) = <>, 0 S " < I, The risk function of any test rule 'P is
at all
i s Suppose an family i n suppose the loss function satisfies then the class of tests of the form with is complete. Proof. R(O,\")
Eo{I"(X)l(O, 1) + [ I - I"(X)]l(O,O)} = E, {l(8,0) + [i(O, 1 ) - i (O, O)]I"(X) }
=
Let J,(X) be such that, for some 00, E,,J, (X) = E,,I"(X) > 0. If E,I"(X) = 0 for all 0 then J=(X) clearly satisfies (4.3.5). Now J, is UMP for H : 0 S Oo versus K : 8 > 8o by Theorem 4.3.1 and, hence, R(O,J,) - R(O, I") = (1(0 , I) - /(0, O))(E,(J,(x)) - E,(I"(X))) < 0 for 8 > 8o. (4.3.6) But 1 - J, is similarly UMP for H : 0 2: 8o versus K : 0 < Oo (Problem 4.3.12) and, hence, E,(1 -J,(X)) = 1 - E,J,(X) 2: 1 - E,I"(X) for 0 < 00. Thus, (4.3.5) holds for D all 8 Summary. We consider models {Po : e E 8} for which there exist tests that are most powerful for every e in a composite alternative 61 (UMP tests). For () real, a model is said to be monotone likelihood ratio (MLR) if the simple likelihood ratio statistic for testing eo versus 81 is an increasing function of a statistic T( x) for every eo < e1 . For MLR models, the test that rejects H : 8 < 8o for large values of T(x) is UMP for K : 8 > 8o. In such situations we show how sample size can be chosen to guarantee minimum power for alternatives a given distance from H. We also show how, when UMP tests do not exist, locally most powerful (LMP) tests in some cases can be found. Finally, we show that for MLR models, the class of NP tests is complete in the sense that for loss functions other than the 0-l loss function, the risk of any procedure can be matched or improved by an NP test. .
4.4
CONFIDENCE BOUNDS, INTERVALS, AND REGIONS
We have in Chapter 2 considered the problem of obtaining precise estimates of parameters and we have in this chapter treated the problem of deciding whether the parameter fJ i" a
234
Testing and Confidence Regions
Chapter 4
member of a specified set 80. Now we consider the problem of giving confidence bounds, intervals, or sets that constrain the parameter with prescribed probability illustration consider Example
1
- a.
As an
4.1.4 where X1 , . . . , Xn are i.i.d. N (J1, u2 ) with a2 known.
Suppose that J-L represents the mean increase in sleep among patients administered a drug.
X = (X1, . . . , Xn ) to establish a lower bound !'(X) for I' with a prescribed probability (1 - a) of being correct. In the non-Bayesian framework, I' is a constant, and we look for a statistic !'(X) that satisfies P(I'(X) < !') = 1 - a with 1 - a equal to .95 or some other desired level of confidence. In our example Then we can use the experimental outcome
this is achieved by writing
By solving the inequality inside the probability for I-t• we find
P(X - o-z(l - a)/Vn < I') = 1 - a and
!'(X) = X - o-z(l - a)/ Vri
'
P(I'(X) < !') = 1 - a. We say that !'(X) is a lower confidence bound with confidence level 1 - a. Similarly, as in ( 1.3.8), we may be interested in an upper bound on a parameter. In the N(l', o-2) example this means finding a statistic !'(X) such that P(I'(X) > !') = 1 - a; is a lower bound with
'
and a solution is
I'(X) = X + o-z(1 - a)j.,fn.
1
'
Here
!'(X) is called an upper level (l - a) confidence bound for I'·
Finally, in many situations where we want an indication of the accuracy of an estimator,
we want both lower and upper bounds. That is, we want to find
that the interval
'
a
such that the probability
[X - a, X + a] contains p. is 1 - a. We find such an interval by noting
,!
and solving the inequality inside the probability for p.. This gives ,I
where
I
I'±(X) = X ± o-z (1 - �a)j .,fn. [!,- (X), !'+(X)] is a level ( 1 - a) confidence interval for I' · general, if v = v(P), P E P, is a parameter, and X � P, X E Rq,
We say that In
'
It.
it
be possible for a bound or interval to achieve exactly probability
it may not
(1 - a) for a prescribed ( 1 - a) such as .95. In this case, we settle for a probability at least ( 1 - a). That is,
.
! 1:
'
'
I
'
I
•
Section 4.4 Confidence Bounds, Intervals, and Regions
Definition 4.4.1. A statistic v(X) is called a level ( 1 for every P E P,
235
- o) lower confidence bound for v if
P[v(X) < v] > I - o. Similarly, v(X) is called a level (l
- a ) upper confidence bound for v if for every P E P,
P[ v( X )
=
v] > I - o.
Moreover, the random interval [v( X), v( X)] formed by a pair of statistics v(X), v(X) is a level ( I - a ) or a 100(1 - a) % confidence interval for v if, for all P E P,
P[v(X) < v < v(X)] > I - a. The quantities on the left are ca11ed the probabilities of coverage and (1 - a) is called a confidence level. For a given bound or interval, the confidence level is clearly not unique because any number (I - a' ) < (l - a) will be a confidence level if (I - a) is. In order to avoid this ambiguity it is convenient to define the confidence coefficient to be the largest possible confidence level. Note that in the case of intervals this is just inf{ P(!c(X) < v < v(X) , P E P]} (i.e., the minimum probability of coverage). For the normal measurement problem we have just discussed the probability of coverage is independent of P and equals the confidence coefficient. Example 4.4.1, The (Student) t Interval and Bounds. Let X1, . . . , Xn be a sample from a N(p,, u2 ) population, and assume initially that u2 is known. In the preceding discussion we used the fact that Z (!') = .,fii( X - I')/
2 s =
n
I - 2. (X; - X) L n - 1 i=l
lhat is, we wil1 need the distribution of
Now Z(l') = .,jii (X - 1')/
236
Testing and Confidence Regions
Chapter 4
has the t distribution Tn-t whatever be J-l and tJ2 • Let tk (P) denote the pth quantile of the Tj, distribution. Then P ( -tn - 1 (I -
i a) < T(J.t) < tn-i (1 - �a)) = 1 -
Solving the inequality inside the probability for J-t, we find P [X - stn-1
(I - �a)/ ..fo. < J1 < X + stn-i
a.
(1 - ; a)/ ..fo.]
=
1 - a.
The shortest level (1 - a) confidence interval of the type X ± sc/..fo. is, thus, [X - stn-1 ( 1 - ; a)/ .Jii, X + stn- i ( 1 - ; a) / v'n]
.
(4.4.1)
Similarly, X - stn- 1 ( 1 - a) /.fii and X + stn-1 ( 1 - a) /.fii are natural lower and upper confidence bounds with confidence coefficients (1 - a). To calculate the coefficients tn-l ( 1 - �a) and tn-dl - a), we use a calculator, computer software, or Tables I and II. For instance, if n = 9 and a = 0.01, we enter Table II to find that the probability that a 7;,_ 1 variable exceeds 3.355 is .005. Hence, tn-1 (1 - !a) = 3.355 and
[X - 3.355s/3, X + 3.355s/3] is the desired level 0.99 confidence interval. From the results of Section B.? (see Problem 8.7.12), we see that as n ---+ oo the Tn-1 distribution converges in law to the standard normal distribution. For the usual values of a, we can reasonably replace tn-1(P) by the standard normal quantile z(p) for n > 120. Up to this point, we have assumed that X1 , . . . , Xn are i.i.d. N (p,, a2). It turns out that the distribution of the pivot T(J.t) is fairly close to the Tn-1 distribution if the X's have a distribution that is nearly symmetric and whose tails are not much heavier than the normal. In this case the interval (4.1.1) has confidence coefficient close to 1 - a. On the other hand, for very skew distributions such as the x2 with few degrees of freedom, or very heavy-tailed distributions such as the Cauchy, the confidence coefficient of (4. I. 1) can be much larger than 1 - a. The properties of confidence intervals such as (4.4.1) in non-Gaussian situations can be investigated using the asymptotic and Monte Carlo methods introduced in Chapter 5. See Figure 5.3.1. If we assume a2 < oo, the interval will have D probability (1 - a) in the limit as n ---+ oo.
Example 4.4.2. Confidence Intervals and Bounds for the Vartance of a Normal Distri bution. Suppose that X1 , . . . , Xn is a sample from a N(p,, a2) population. By Theorem B .3.1, V(a2) = (n - 1 )s2/a2 has a x;_1 distribution and can be used as a pivot. Thus, if we let Xn-l(P) denote the pth quantile of the X�- 1 distribution, and if a1 + a2 = a, then
P(x(a1) < V(a2 ) < x(1 - a2 )) = 1 - <>. By solving the inequality inside the probability for a2 we find that
[(n - 1)s2 /x(1 - a2 ), (n - 1)s2/x{a1)] is a confidence interval with confidence coefficient (1 - a) . .
,
'
.·
(4.4.2)
1 ' •
237
Section 4_4 Confidence Bounds, Intervals, and Regions
a2, 1959).
The length of this interval is random. There is a unique choice of cq and which uniformly minimizes expected length among all intervals of this type. It may be shown that is not far from optimal (Tate and Klett, for n large, taking = = The pivot V(u2 ) similarly yields the respective lower and upper confidence bounds ex) and if we drop the normality assumption, the confidence inter In contrast to Example even in the limit as -----+ oo. val and bounds for u2 do not have confidence coefficient Asymptotic methods and Monte Carlo experiments as described in Chapter have shown that the confidence coefficient may be arbitrarily small depending on the underlying true we give an interval with cor distribution, which typically is unknown. In Problem D rect limiting coverage probability. The method of pivots works primarily in problems related to sampling from normal populations. If we consider "approximate" pivots, the scope of the method becomes much broader. We illustrate by an example. Example 4.4.3. Approximate Confidence Bounds and Intervals for the Probability of Suc , are the indicators of Bernoulli trials with cess in Bernoulli Trials. If 1 , probability of success 8, then is the MLE of 8. There is no natural "exact" pivot based has on and However, by the De Moivre-Laplace theorem, yln(X approximately a N(O, distribution. If we use thls function as an "approximate" pivot and let :=:::::: denote "approximate equality," we can write
a1 a2 �a (n - 1)sjx(1 - (n-4.4.1)sjx(a). 1,
n
X 0.
1)
p
Let
X X
•
.
•
1-a 1.1.16
Xn
n 5
n
- 0)/ )0(1 - 0)
O) <- z (1 - la) = 1 - a. 2 )0(1 - 0)
yin(X -
ka = z ( 1 - �a) and observe that this is equivalent to [ k2 2 P (X - 0) < ;:0(1 - 0)l = P(g(O,X)
ex
where
g(O,X) = (I + � }2 - (zx + �) O+X2 For fixed 0 < X < 1, g( 0, X) is a quadratic polynoutial with two real roots. In terms of S = nX, they are(! ) k B(X) { S + z� - ka)(S(n -S)/nJ + k�/4} I (n + k�) (4.4.3) O(X) { S + kz� + ka)(S(n - S)/nJ + k�/4} I (n k�) . +
Because the coefficient of ()2 in g((}, X) is greater than zero, IB :
g(O, X) < OJ = �(X) 0 < B(X)], �
(4.4.4)
. •
•
238
Testing and Confidence Regions
Chapter 4
so that [O(X), ii(X)[ is an approximate level (1 - a) confidence interval for 0. We can similarly show that the endpoints of the level (1 - 2a) interval are approximate upper and lower level (1 - a) confidence bounds. These intervals and bounds are satisfactory in practice for the usual levels, if the smaller of nO, n(l - 8) is at least 6. For small n, it is better to use the exact level (1 - a) procedure developed in Section 4.5. A discussion is given in Brown, Cai, and Das Gupta (2000). Note that in this example we can determine the sample size needed for desired accu racy. For instance, consider the market researcher whose interest is the proportion B of a population that will buy a product. He draws a sample of n potential customers, calls will ingness to buy success, and uses the preceding model. He can then determine how many customers should be sampled so that (4.4.4) has length 0.02 and is a confidence interval with confidence coefficient approximately 0.95. To see this, note that the length, say l, of the interval is
l � 2k0{ y'[S(n - S) j n] + ki,j4} (n + k� ) -l Now use the fact that
S(n - S)/n � in - n-1 (S - �n)2 :> in
(4.4.5)
to conclude that
(4.4.6) Thus, to bound l above by lo choose
=
.
0.02, we choose n so that ka.(n + k�)-� n
''
•
=
'
ln this case, 1 - �a = 0.975, ka. = z (1 length 0.02 by choosing n so that
n • • '
;; •• '
i
ko 2 T;; - k0• - ta)
2 2 ) (1.96) ( 1.96 0.02
lo. That is, we
=
=
!
'
1.96, and we can achieve the desired
9, 600.16, or n
=
I
9, 601.
This formula for the sample size is very crude because (4.4.5) is used and it is only good when 8 is near 1/2. Better results can be obtained if one has upper or lower bounds on 8 such as 8 < 8o < �, 8 > 81 > �. See Problem 4.4.4. Another approximate pivot for this example is .,fii(X - 8)/JX(l - X). This leads to the simple interval (4.4.7)
't •
'.
'
=
=
i" I
,,
See Brown, Cai, and Das Gupta (2000) for a discussion.
D
1
' '
' '
j
Section 4-4
239
Confidence Bounds, Intervals, and Regions
Confidence Regions for Functions of Parameters
We can define confidence regions for a function q(O) as random subsets of the range of q that cover the true value of q(O) with probability at least (1 - a). Note that ifC(X) is a level (1 � a ) confidence region for 0, then
q (C(X)) � {q(O) 0 E C(X) }
is a level (1 - a) confidence region for q(O). Example 4.4.4. Let X 1 ,
•
•
•
•
, Xn denote the number of hours a sample of internet sub
scribers spend per week on the Internet. Suppose X 1 , Xn is modeled as a sample from an exponential, £(8-1 ) , distribution, and suppose we want a confidence interval for the population proportion P(X ? x) of subscribers that spend at least x hours per week on the Internet. Here q(O) � I - F(x) � exp{ -x/0). By Problem B.3.4, 2nX/O has a chi-square, X�n• distribution. By using 2nX /0 as a pivot we find the (1 - a ) confidence interval 2nX/x (I - ia) < e < 2nX/x (ia) •
.
•
,
where x(/3) denotes the {3th quantile of the X�n distribution. Let 0 and 8 denote the lower and upper boundaries of this interval, then
exp{-x/0} < q(O) < exp{ -xjiJ)
is a confidence interval for q(0) with confidence coefficient (1 - a) . If q is not 1 1, this technique is typically wasteful. That is, we can find confidence regions for q(0) entirely contained in q(C(X)) with confidence level ( 1 - a) . For instance, we will later give confidence regions C(X) for pairs B ( 01, 02) T. In this case, if q(B) = 01, q(C(X)) is larger than the confidence set obtained by focusing on B1 alone. -
-=
Confidence Regions of Higher Dimension
We can extend the notion of a confidence interval for one-dimensional functions q(B) to r-dimensional vectors q( 0) � (q1 ( 0), . . . , q" ( 0) ) . Suppose q (X) and q1 (X) are reai J valued. Then the r-dimensional random rectangle -
! (X) � { q(O) • q; (X) < q; (O) < q; (X) , j � l, . . . , r ) is said to be a level (I - a) confidence region, if the probability that it covers the unknown but fixed true ( q1 ( 0), . . . , qr( 0)) is at least (! - a). We write this as
P[q(O) E /(X)] 2 I
-
o.
Note that if I; (X) � [q . , q; J is a level (1 - a; ) confidence interval for q; ( 0) and if the -} I. (X) X . . · x pairs (T 1 T,), . . , (Ir, Tr) are independent, then the rectangle I (X) Ir(X) has level ,
.
=
(4.4.8)
Testing and Confidence Regions
240
Chapter 4
Thus, an r-dimensional confidence rectangle is in this case automaticalll obtained from the one-dimensional intervals. Moreover, if we choose o = 1 - ( 1 r , then I(X) has
j
1 - a:.
confidence level
- a)
IJ are not independent is to use Bonferroni's
An approach that works even if the
in
equality (A.2.7). According to this inequality,
P[q(O ) E I(X)] > Thus, if we choose
a,
=
ajr, j
1-
=
1, .
c
L P[qj(O ) "' Ij(X)] > j=l .
. , r, then
1-
c
:Laj· j= l
I(X) has confidence level (1 - a).
Example 4.4.5. Confidence Rectangle for the Parameters ofa Normal Distribution. Sup pose X1 , Xn is a N(p, u2 ) sample and we want a confidence rectangle for (t.t, a2 ). .
.
•
,
From Example 4.4.1
h (X )
=
X ± stn-1 (1 - !a )/ Vn
( - !a). From Example 4.4.2,
is a confidence interval for J.L with confidence coefficient 1
I2 (X) _ -
(n - l ) s2
Xn-1 ( 1 -
is a reasonable confidence interval for
I, (X)
is a level ( 1
1 ( Xn-1 4 a)
with confidence coefficient
'' " '
:
'
ij · I ,
,
,
!2 (X)
-
Thus,
.
The method of pivots can also be applied to oo-dimensional parameters such as
F.
X1. . . , Xn are i.i.d. as X P, and we are interested in the function F(t) F(·). We assume that F is P(X < t); that is, v(P)
Example 4.4.6.
Suppose
rv
.
' '
distribution
' ''
continuous, in which case (Proposition 4.1.1) the distribution of
"
(1 - �a).
a) confidence rectangle for (p, u2 ) It is possible to show D that the exact confidence coefficient is ( 1 - �a) 2 • See Problem 4.4.15.
'
, 'I
x
a2
14a , )
(n - l)s2
=
=
Dn(F)
=
su
p I F(t) - F(t) i
tER
Dn (F) is a pivot. Let da be - a. Then by solving Dn(F) < d. for F, we
does not depend on F and is known (Example 4.1 .5). That is,
PF(Dn(F) < dn) 1 find that a simultaneous in t size 1 - a confidence region C(x)(·) is the confidence band which, for each t E R, consists of the interval chosen such that
=
C(x)(t)
(max{ O, F(t) - d0}, �
=
min { !,
�
F(t) + d.}).
We have shown
P(C(X)(t) for all
::J
F(t) for all t E R)
P E P = set of P with P( -oo, t]
continuous in
t.
=
1 -
a 0
, ,
i ' J •
l
Section 4.5
241
The Duality Between Conf1dence Regions and Tests
We can apply the notions studied in Examples 4.4.4 and 4.4.5 to give confidence regions for scalar or vector parameters in nonparametric models. Example 4.4.7. A Lower Confidence Bound for the Mean of a Nonnegative Random Vari able. Suppose X, , . . . , Xn are i.i.d. as X and that X has a density f( t) = F' (t), which is zero for t < 0 and nonzero for t > 0. By integration by parts. if J.t = IL(F) = f0= tj(t)dt exists, then Let F-(t) and F+ (t) be the lower and upper simultaneous confidence boundaries of Example 4.4.6. Then a ( 1 a) lower confidence bound for p. is p. given by -
(4.4.9)
-
because for C(X) as in Example 4.4.6, J.t = inf { J.t( F) : F E C(X)} = J.t(F+ ) and sup{J.t(F) : F E C(X)} = J.t(F-) = oo-see Problem 4.4.19. Intervals for the case F supported on an interval (see Problem 4.4.18) arise in ac counting practice (see Bickel, 1992) where such bounds are discussed and shown to be D asymptotically strictly conservative. -
Summary. We define lower and upper confidence bounds (LCBs and UCBs), confidence
intervals, and more generally confidence regions. In a parametric model {Pe : B E e}, a Ievel l - a: confidence region for a parameter q(B) is a set C(x) depending only on the data x such that the probability under Po that C(X) covers q(8) is at least 1 - a: for all 0 E 8. For a nonpararnetric class P = {P) and parameter v = v(P), we similarly require P(C(X) ::> v) > 1 - a for all P E P. We derive the (Student) t interval for I' in the N(/1, a2 ) model with a2 unknown, and we derive an exact confidence interval for the binomial parameter. In a nonparametric setting we derive a simultaneous confidence interval for the distribution function F( t) and the mean of a positive variable X.
4.5
THE DUALITY BETWEEN CONFIDENCE REGIONS AND TESTS
Confidence regions are random subsets of the parameter space that contain the true param eter with probability at least 1 a. Acceptance regions of statistical tests are. for a given hypothesis H, subsets of the sample space with probability of accepting H at least 1 - a: when H is true. We shall establish a duality between confidence regions and acceptance regions for families of hypotheses. We begin by illustrating the duality in the following example. -
Example 4.5.1. 1\vo-Sided Tests for the Mean of a Normal Distribution. Suppose that an established theory postulates the value f-LO for a certain physical constant. A scientist has reasons to believe that the theory is incorrect and measures the constant n times obtaining
242
Testing and Confidence Regions
measurements
Xi
X1 .
.
.
..
Xn.
Chapter 4
Knowledge of his instruments leads him to assume that the
are independent and identically distributed normal random variables with mean {L and
a2. If any value of Jl other than /LO is a possible alternative, then it is reasonable to formulate the problem as that of testing H : f-l = J-lo versus K : f-l I- f-lO· We can base a size o test on the level (1 - a ) confidence interval ( 4.4.1) we constructed for Jl as follows. We accept H. if and only if, the postulated value Jlo is a member of the level ( 1 - o ) confidence interval variance
[X - stn-1 (1 - �a) /vn. X + stn-1 (1 - la) / vn].
(4.5, [ )
= y'n(X - /1-o) / s, then our test accepts H, if and only if, -tn -l ( 1 - �a) < T < tn-1 (t - la) . Because P" [IT I = tn-1 (1 - )a)] = 0 the test is equivalently characterized by rejecting H when I TI > tn-1 (1 - �a). This test is called two-sided If we let T
because it rejects for both large and small values of the statistic T. In contrast to the tests of Example 4. I .4, it has power against parameter values on either side of /1-0·
Because the same interval (4.5 . 1 ) is used for every J.Lo we see that we have, in fact, generated a family of level
a tests {J(X, 11Jl where
1 if vnY�;� 1 > tn -1 (1 - �a) 0 otherwise.
(4.5.2)
These tests correspond to different hypotheses, O{X, J.lo ) being of size pothesis
H : Jl = flo-
a only for the hy
Conversely, by starting with the test (4.5.2) we obtain the confidence interval (4.5.1)
by finding the set of J1 where J(X, Jl)
� 0.
We achieve a similar effect, generating a family of level
a tests, if we start out with
(1 - a) LCB X - tn_ 1 (1 - o.)sj y1i and define J•(x, Jl) to equal ! if, and only if, X - tn_1(1 - a)s/vn 2 Jl· Evidently, (say) the level
=
1 - a. D
These are examples of a general phenomenon. Consider the general framework where the random vector X takes values in the sample space
I I
I
I I
r
E P.
Let
I
X has
distribution
,
(1 - a), that is
J
!
,
P[v E S( X ) ] > 1 - a,
''
and
v = v(P) be a parameter that takes values in the set N. For instance, in Example 4.4. 1 , J1 = Jl(P) takes values in N = (-oo, oo) in Example 4.4.2, "2 = "2 ( P) takes values in N = (0, oo ), and in Example 4.4.5, (f', "2) takes values in N = ( -oo, oo) x (0, oo). For a function space example, consider v(P) = F, as in Example 4.4.6, where F is the distribution function of Xi. Here an example of N is the class of all continuous distribution functions. Let S = S( X ) be a map from X to subsets of N, then S is a (1 - a) confidence region for v if the probability that S(X) contains v is at least P
'
X C Rq
all P
E P. . •
,
• ,
Section 4.5
243
The Duality Between Conf1dence Regions and Tests
H = Hv0 : v = v0 test J(X, v0) with level a. Then the
Next consider the testing framework where we test the hypothesis for some specified value
vo.
Suppose we have a
acceptance regiOn
A (vo) � {x : J(x, vo) = 0 }
1 - a. For some specified vo, H may be accepted, for other specified vo, H may be rejected. Consider the set of v0 for which H.,.0 is accepted; this is a random set contained in N with probaPihty at least 1 - a of containing the true value of v(P) whatever be P. Conversely, if S(X) is a Ievel l - a confidence region for v, then the test that accepts H.,.0 if and only if v0 is in S(X), is a level o test for H.,.0 • Formally, let Pv0 = {P : v(P) � v0 : vo E V}. We have the following. Duality Theorem. Let S(X) = {v0 E N : X E A(vo)). then is a subset of X with probability at least
P[X E A(v0)) > 1 - a for all P E Pv, ifand only if S(X) is a 1 - a confidence region for v. We next apply the duality theorem to MLR families:
Theorem 4.5.1. Suppose X � Po where {Po : 8 E 6} is MLR in T = T(X) and suppose that the distribution/unction Fo(t) ofT under Po is continuous in each of the variables t and B when the other is fixed. If the equation Fo(t) � 1 - a has a solution li.(t) in e. then Ba(T) is a lower confidence boundfor () with confidence coefficient 1 - a. Similarly, any solution B.(T) of Fo(T) = a with Ba E 6 is an upper confidence bound for 8 with coefficient ( 1 - a). Moreover, if a1 + a2 < 1, then [.Qa1 , Ba2] is confidence intervalfor () with confidence coefficient 1 - (o t + a2 ).
Proof. By Corollary 4.3 . 1 , the acceptance versus K : B > Bo can be written
region of the
UMP size a test of H : B
=
Bo
A(Bo) = {x : T(x) < to0(1 - a) } where
to0 (1 - a ) is the 1 - a quantile of Fo0• By the duality theorem, if s(t) = {8 E 6 : t < to(l - a ) } ,
then S(T) is a we find
1 - a confidence region for B. By applying Fo to hnth sides of t 5 to(1-a), S(t) � {B E 6 : Fo(t) < 1 - a}.
By
Theorem
4.3.1,
the power function
Po(T > t) = 1 - Fo(t)
for
a
test with critical
t is increasing in (). That i�, Fe (t) is decreasing in (). It follows that Fo (t) < 1 - a iff B > lia(t) and S(t) = [lia, oo). The proofs for the upper confid�nce hnund and interval
constant
fotlow by the same type of argument,
0
We next give connections between confidence bounds, acceptance regions, and p-values for MLR families: Let t denote the observed value
t = T(x) ofT(X) for the datum x, let
244
Testing and Confidence Regions
Chapter 4
o:(t, Oo ) denote the p-value for the UMP size a test of H : (} = Oo versus K : () > Oo, and
let
A• (B) = T(A(B)) = {T(x) : x E A(B)).
Corollary 4.5.1. Under the conditions of Theorem 4.4.1,
{t : a(t,B) > a) = (-oo, to(1 - a)] {B : a(t,B) > a) = [�(t), oo).
A•(B) S (t) Proof. The p-value is
a(t, B) = Po (T > t) = 1 - Fo(t). We have seen in the proof of Theorem 4.3.1 that 1 - Fo(t) is increasing in 0. Because D Fe(t) is a distribution function, 1 - Fo(t) is decreasing in t. The result follows. In general, let a( t, v0) denote the p-value of a test b (T, vo ) = 1 [T > c] of H : v = vo based on a statistic T = T( X ) with observed value t = T(x). Then the set C = {(t,v) : a(t,v) < a} = {(t,v) : J(t,v) = 0} (t, v)
t, v
t
where, for the given gives the pairs will be accepted; and for the given v, is plane, in the acceptance region. We call C the set of compatible points. In the vertical sections of C are the confidence regions S( whereas horizontal sections are the acoeptance regions • = 0). We illustrate these ideas using the example , Xn are i.i.d. N(f-L, cr2 ) with a2 known. Let T = X, of testing H : fL = fLO when X1 , then C = { ( , p : [t - p[ S <7z
t)
A (v) = {t : J(t, v) .
.
(t, B)
(t, B)
•
t )
(1 - ia)/ vn} . Figure 4.5.1 shows the set C, a confidence region S(t0), and an acceptance set A• (po ) for this example.
Example 4.5.2. Exact Confidence Bounds and Intervals for the Probability of Success in n Binomial Trials. Let X1, . . . , Xn be the indicators of n binomial trials with probability E (0, 1), we seek reasonable exact level ( 1 - �) upper and lower of success 6. For To find a lower confidence bound for (J confidence bounds and confidence intervals for o E (0, 1). our preceding discussion leads us to consider level tests for H : (} < denote the critical Let We shall use some of the results derived in Example constant of a level (1 confidence region is test of H. The correspouding level given by C(Xt , . . , Xn) = : S < - 1 },
a
. .
'
, " '1 .
•
•
.
,
,
8.
a)
.
xi.
{B
a 4.1.3.
k(B, a)
where s = Ef I To analyze the structure of the region we need to examine
,,
(i)
'.,, '
(ii)
,, •
>I ,
�,
''i
•
·!
k(B,a) is nondecreasing in B. k(B,a) � k(Bo, a) ifB i Bo.
k(Oo, a) (1 - a)
Oo. B
k(8, a). We claim that
l 1
Section 4.5
245
The Duality Between Confidence Regions and Tests
Figure 4.5.1. The shaded region is the compatibility set C for the two-sided test of Hp.0 : J.L = J.to in the normal model. S(to) is a confidence interval for f.l for a given value to ofT, whereas A* (f.lo) is the acceptance region for Hp.o.
(iii)
k(fJ, a) increases by exactly
(iv)
k(O,a) � I
and k( l
,
a)
�
1
at its points of discontinuity.
n + 1.
To prove (i) note that it was shown in Theorem 4.3.1 (i) that Pe [S > j] is nondecreasing 02 and in 6 for fixed j. Clearly, it is also nonincreasing in j for fixed e. Therefore, 01 k(B1, a) > k(B2 , a) would imply that
<
a > Po, [S 2: k(B2 ,a) ] > Po, [S > k(B2 , a) - 1]
2:
Po, [S > k(B1 ,a) - 1] > a,
a contradiction. The assertion (ii) is a consequence of the following remarks. If Bo is a discontinuity point o k(B, <>), let j be the limit of k(B, <>) as 0 t Bo. Then Po [S > j] a for all 0 < Bo and, hence. Po, [S > j] a. On the other hand, if 0 > 00, Po [S 2: j] > a. Therefore, Po, [S > j] � a and j = k(B0, <>). The claims (iii) and (iv) are left as exercises. From (i), (ii), (iii), and (iv) we see that, if we define
<
<
B(S) = inf{B : k(B,a) � S
+ 1},
then
C(X) �
l] { (B(S), 0, 1 [
]
if s > 0 ifS = 0
246
Testing and Confidence Regions
Chapter 4
I
and �(S) is the desired level (I - a) LCB for 0 (2l Figure 4.5.2 portrays the situation. From our discussion, when S > 0, then k(O(S) , a) = S and, therefore, we find 8(S) as the unique solution of the equation,
When S = 0, 8(S) = 0. Similarly, we define
O(S) = sup(O : j(O,a) = S - 1} where j ( 8, a) is given by, •
Then 0 (S) is a level
(I - a) UCB for 0 and when S < n, 0(S) is the unique solution of
�( � ) s
or(! - o)n -r = a.
When S = n, O(S) = I. Putting the bounds f!(S), O(S) together we get the confidence interval (O(S), O(S)] oflevel (1- 2a). These intervals can be obtained from computer pack ages that use algorithms based on the preceding considerations. As might be expected, if n is large, these bounds and intervals differ little from those obtained by the first approximate method in Example 4.4.3.
, •
1 •
•
I
'
' •
I
l I ' i • •
•
l
I ' •
4
k(O,O.i6)
3
'�
I
I i
l l
I i ,
Figure 4.5.2. Plot of k(O, 0.16) for n = 2.
,I
'I
,. . .
• •
j
Section 4.5
247
The Duality Between Confidence Regions and Tests
Applications of Confidence Intervals to Comparisons and Selections
We have seen that confidence intervals lead naturally to two-sided tests. However, two-sided tests seem incomplete in the sense that if H B = 80 is rejected in favor of H : () -1- Bo, we usually want to know whether H : () > Bo or H : B < Bo. For instance, suppose B is the expected difference in blood pressure when two treat ments, A and B, are given to high blood pressure patients. Because we do not know whether A or B is to be preferred, we test H : B = 0 versus K : 8 -1- 0. If H is rejected, it is natural to carry the comparison of A and B further by asking whether 8 < 0 orB > 0. If we decide () < 0, then we select A as the better treatment, and vice versa. The problem of deciding whether B = 80, 8 < 80, or B > Bo is an example of a three decision problem and is a special case of the decision problems in Section 1.4, and 3.1-3.3. Here we consider the simple solution suggested by the level ( 1 - a) confidence interval !: :
1.
2.
3.
Make no judgment as w whether 0 < 80 or 8 > 80 if I contains Bo; Decide B < Bo if I i s entirely to the left of Bo; and Decide 8 > 80 if I is entirely to the right of 80.
(4.5.3)
, Xn are i.i.d. N(J-l, a2 ) with u2 known. In Section 4.4 we considered the level { 1 - a) confidence interval X ± uz{l - � a)/ fo for f-t. Using this interval and ( 4.5.3) we obtain the following three decision rule based on T = .JTi(X a: o)/ ' l Do not reject H : I' � l'c if ITI < z(l - !a). Decide I' < l'o ifT < -z(t - ia). Decide I' > l'c ifT > z(l - ia). Example 4.5.2. Suppose X1 , . . .
Thus, the two-sided test can be regarded as the first step in the decision procedure where if H is not rejected, we make no claims of significance, but if H is rejected, we decide whether this is because () is smaller or larger than Bo. For this three-decision rule, the probability of falsely claiming significance of either 8 < Bo or () > Bo is bounded above by �a. To see this consider first the case 8 > 00. Then the wrong decision "11 < �-to" is made when T < -z(l - � a). This event has probability
(
)
( l' - l'o) < P[T < -z(l - la)J � -z(l - la) - .Jii a ( -z(l - la)) � la. Similarly, when f-t < �-to. the probability of the wrong decision is at most � a. Therefore, by using this kind of procedure in a comparison or selection problem, we can control the probabilities of a wrong selection by setting the a of the parent test or confidence intervaL We can use the two-sided tests and confidence intervals introduced in later chapters in similar fashions. Summary. We explore the connection between tests of statistical hypotheses and confi
dence regions. If J{x, vo} is a level a test of H
:
v = v0 , then the set S(x) of v0 where
248
Testing and Confidence Regions
o ("'' v0)
�
0 is a level (1
- o ) confidence region for
fidence region for v. then the test that accepts
H
;
v
va.
If S(x) is a level
( 1 - a) con·
E S(:r)
= v0 when vo
Chapter 4
is a level
a:
test. We give explicitly the construction of exact upper and lower confidence bounds and intervals for the parameter in the binomial distribution. We also give a connection between confidence intervals, two-sided tests, and the three-decision problem of deciding whether a parameter 8 is (}0, less than
4.6
80, or larger than 90, where 80
is a specified value.
'
'
' '
'
UNIFORMLY MOST ACCURATE CONFIDENCE BOUNDS
In our discussion of confidence bounds and intervals so far we have not taken their accuracy into account. We next show that for a certain notion of accuracy of confidence bounds, which is connected to the power of the associated one-sided tests, optimality of the tests translates into accuracy of the bounds. If (} and (}* are two competing level both very likely to fall below the true
B.
(1 - a ) lower confidence bounds for
But we also want the bounds to be close to
we say that the bound with the smaller probability of being far below Formally, for
(}, they are
X E X c Rq, the following is true.
B.
Thus,
(} is more accurate.
Definition 4.6.1. A level ( 1 - a) LCB 8* of(} is said to be more accurate than a competing level ( 1 - a) LCB 8 if, and only if, for any fixed 8 and all 8' < B. P, �· (X) < 8'] < Po ]B(X) :S 8']. - a ) UCB 8 8' > e.
Similarly, a level ( 1 any fixed e and all
-·
(4.6. 1)
is more accurate than a competitor 8 if, and only if, for
(4.6.2)
Lower confidence bounds B* satisfying (4.6.1) for all competitors are called uniformly
most accurate as are upper confidence bounds satisfying (4.6.2) for all competitors. Note
(1 - a) uniformly most accurate level ( 1 - a) UCB for -8. that
'
8*
is a uniformly most accurate level
LCB for B, if and only if
-8*
is a
•
.
'
I
.
i
1 ' '
' ,
.
.
:1;1
.
Example 4.6.1 (Examples 3.3.2 and 4.2.1 continued). Suppose X = (X,, . . . , Xn) is a sample of a N(J.L, a-2 ) random variables with a2 known. A level a test of H : J.L = J.Lo vs K ' J.1. > J.l.o rejects H when ..fii( X - J.l.o )/u > z(1 - a) . The dual lower confidence bound is J.l., (X) = X - z(1 - a)u/ ..fii. Using Problem 4.5.6, we find that a competing lower confidence bound is J.1. (X) = X(k)• where X(l) :S X(2) :S · · · :S X(n denotes ) 2 the ordered X,, . . . , Xn and k is defined by P(S > k) � 1 - a for a binomial, B(n, � ),
J.L (X) is random variable S. Which lower bound is more accurate? It does tum out that _,
more accurate than e2(X) and is, in fact, uniformly most accurate in the N(J.L,
.
'
L
I!
I
a-2 ) model.
This is a consequence of the following theorem, which reveals that (4.6.1) is nothing more than a comparison of (X)Wer functions.
�------ --
I ' I
•
I '
'
'
I
'
-
P, [e• (X) < B'] < Po [B(X) < 8']. '
'
0
!
I
'
'
i !
I
l
'
I
'
Section 4. 6
249
Uniformly Most Accurate Confidence Bounds
Theorem 4.6.1. Let fJ* be a level {1 - n) LCB for B, a real parameter, such that for each (}0 the associated test whose critical function O* (x. Bo) is given by
I ifO"(x) > Oo
o"(x, Oo)
0 otherwise is UMP level a for H : (} level ( I - a).
=
Bo versus K (} > Bo. Then 0_* is uniformly most accurate at :
Proof Let 0 be a competing level ( I - a ) LCB
Oo. Defined o (x, 00) by
o(x, 00) � 0 if, and only if, O(x) < 00.
Then o(X,Oo) is a level a test for H : 0 � Oo versus K : 0 > Oo. Because J"(X, 00) is UMP level a for H : (} = Bo versus K : B > Bo, for fJ 1 > (}0 we must have
Ee. (o(X, Oo)) < Ee, (o"(X, Oo)) or
Pe, [O(X) > Oo] < Pe, [O" (X) > Oo]. Identify follows.
Bo
with (}' and 81 with (} in the statement of Definition 4.4.2 and the result o
If we apply the result and Example 4.2.1 to Example 4.6. 1 , we find that x - z(l o:)a / JTi is uniformly most accurate. However, X(k) does have the advantage that we don't have to know a or even the shape of the density f of Xi to apply it. Also, the robustness considerations of Section 3.5 favor X{k) (see Example 3.5.2). Uniformly most accurate (UMA) bounds turn out to have related nice properties. For instance (see Problem 4.6. 7 for the proof), they have the smallest expected "distance" to 0:
Corollary 4.6.1. Suppose �·(X) is UMA level (I - a ) lower confidence boundfor 0. Let O(X) be any other (I - a ) lower confidence bound, then for all 0 where a+
�
a,
if a > 0, and 0 otherwise.
We can extend the notion of accuracy to confidence bounds for real-valued functions of an arbitrary parameter. We define q* to be a uniformly most accurate level (1 - a ) LCB for q(0) if, and only if, for any other level ( I - a) LCB q,
Pe[_f < q(O')] < Pe [q < q(O') ] whenever q((}') < q( 8). Most accurate upper confidence bounds are defined similarly.
Example 4.6.2. Boundsfor the Probability ofEarly Failure ofEquipment. Let X 1 , , Xn be the times to failure of n pieces of equipment where we assume that the Xi are indepen dent £(A) variables. We want a uniformly most accurate level {1 - a) upper confidence bound q* for q( A) = 1 - e->.to, the probability of early failure of a piece of equipment. .
•
.
Testing and Confidence Regi ons
250
C h a pter 4
We begin by finding a uniformly most accurate level ( 1 - a) UCB ). for A. To find ). we invert the family of UMP level a tests of H : A � Ao versus K : >. < A 0 . By Problem 4.6.8, the UMP test accepts H if •
•
n
L Xi < X2n ( 1 - a) /2>-o
(4.6.3)
i=l
or equivalently if
A0
X2 n ( 1 - a ) n
< 2 "l<' L.. t = l X'
where X2 n ( 1 - a) is the ( 1 - a) quantile of the X � n distribution. Therefore, the confidence region corresponding to this test is ( 0, 5. ) where 5. is by Theorem 4.6. 1 , a uniformly most accurate level (1 - a ) UCB for A and, because q is strictly increasing in A, it follows that q(,\ ) is a uniformly most accurate level ( 1 - a ) UCB for the probability of early D failure. *
*
*
Discussion We have only considered confidence bounds. The situation with confidence intervals i s more complicated. Considerations of accuracy lead u s to ask that, subject to the require ment that the confidence level is ( 1 a) , the confidence interval be as short as possible. Of course, the length T 'I. is random and it can be shown that in most situations there is no confidence interval of level (1 - a) that has uniformly minimum length among all such intervals. There are, however, some large sample results in this direction (see Wilks, 1 962, pp. 374-376). If we tum to the expected length Ee (T - 'I.) as a measure of precision, the situation is still unsatisfactory because, in general, there does not exist a member of the class of level ( 1 a) intervals that has minimum expected length for all B. However, as in the estimation problem, we can restrict attention to certain reasonable subclasses of level ( 1 - a) intervals for which members with uniformly smallest expected length exist. Thus, Neyman defines unbiased confidence intervals of level ( 1 a) by the property that -
-
-
-
Pe ['I. � q(B) � f] � Pe ['I. � q(B' ) � T] for every B, B' . That is, the interval must be at least as likely to cover the true value of q( B) as any other value. Pratt ( 1961) showed that in many of the classical problems of estimation . there exist level ( 1 - a) confidence intervals that have uniformly minimum expected length among all level ( 1 - a) unbiased confidence intervals . In particular, the intervals developed in Example 4 . 5 . 1 have this property. Confidence intervals obtained from two-sided tests that are uniformly most powerful within a restricted class of procedures can be shown to have optimality properties within restricted classes. These topics are discussed in Lehmann ( 1 997).
Summary. By using the duality between one-sided tests and confidence bounds we show that confidence bounds based on UMP level a tests are uniformly most accurate (UMA) level ( 1 a) in the sense that, in the case of lower bounds, they are less likely than other level (1 - a) lower confidence bounds to fall below any value B' below the true B . -
._
r t ·
i
't
;
Section 4 . 7
4.7
251
Freq uen tist a n d Bayesian Form u lations
F R E Q U E N T I S T AN D BAY E S I A N FO R M U lAT I O N S
We have so far focused on the frequentist formulation of confidence bounds and intervals where the data X E
X
c
Rq
are random while the parameters are fixed but unknown.
A consequence of this approach is that once a numerical interval has been computed from
experimental data, no probability statement can be attached to this interval. Instead, the interpretation of a
100( 1· - a)%
confidence interval is that if we repeated an experiment
indefinitely each time computing a
100 ( 1 - a)%
confidence interval, then
100 ( 1 - a)%
of
the intervals would contain the true unknown parameter value. In the B ayesian formulation of Sections 1 . 2 and 1 . 6 . 3 , what are called level
(1
...:...
a)
credible bounds and intervals are subsets of the parameter space which are given probability at least
( 1 - a)
by the posterior distribution of the parameter given the data. Suppose that,
given e, X has distribution Pe , e E
8
c
R,
and that e has the prior probability distribution
II. Definition 4.7.1.
Let
II( · f x ) denote the posterior probability distribution of e given X x , (1 - a) lower and upper credible bounds for e if they respectively =
then ft. and B are level sati sfy
II( fi. :S: e j x ) ;:::: 1 - a , II ( e :S: B !x ) � 1 - a.
Turning to Bayesian credible intervals and regions, it is natural to consider the collec tion of e that is "most likely" under the distribution
Definition 4.7.2.
Let
n ( - f x)
denote the density of e given
ck is called a If n
II ( e j x ) .
=
{e : n( - l x)
Thus ,
X = x,
then
� k}
level ( 1 - a) credible region for e i f II( Ck f x ) :2: 1 - a .
(e f x)
is unimodal, then
Ck
will be an interval of the form
[fi., B] .
We next give such
an example .
Suppose that given f..l , X1 , . . . , Xn are i.i.d. N(f..l , 0"6) with 0"6 known, N(f..lo , TJ ) , with f..l o and T;s known. Then, from Example 1 . 1 . 1 2, the posterior distribution of f..l given xl , . . . ' Xn is N(/I B , a� ) , with
Example 4.7. 1. and that f..l
rv
n _x + i f..lo 2 1 -;;:g f..l B = n + l , O"B = n + l U5 -;;:g U5 -;;:g �
It follows that the level
1
-
U5
�
a lower and upper credible bounds f..l f..l - = B - Zl -o
O"o Vn 1 + nr0
(
::"A)
1
2
for
f..l are
Testing a n d Confidence Reg ions
252
(1 -
while the level
a:) credible interval is [p,- , p, + ] with /1
±
2
Compared to the frequentis t interval
1 .3.4
a-o
�
/111 ± Z l - -"'
=
interval is pulled in the direction of Example
Cha pter 4
1
( �) nT0 2
vn 1 +
•
z1 _ � a-olfo, the center fiB of the Bayesian p,0 is a prior guess of the value of p,. See guesses. Note that as To ----> oo, the B ayesian
X ±
p,0,
where
for sources of such prior
interval tends to the frequentist interval; however, the interpretations of the intervals are D
different.
Example 4.7.2.
known. Let >-
where
a
> 0,
=
b
Suppose that given a-2, xl , · · · , Xn are i.i .d. N(p,o, a - 2 and suppose >- has the gamma f( density
�a, � b)
X� +n
credible bound for
A.
distribution, where
t
distribution, then .6_
= =
and
ij�
=
where /1 0 is
1 . 2. 1 2, given x1 , , Xn, l: (xi - p,0 ) 2 . Let X a + n ( a: ) denote the a:th X a + n ( a: ) I ( t + b) is a level ( 1 - a: ) lower
> 0 are known parameters . Then, by Problem
( t + b ) >- has a X�+n
quantile of the
a-2 )
.
.
•
( ( n - 1)s 2 + b)lxa+n (a:)
( 1 - a: ) upper credible bound for a-2 . Compared to the frequentist bound ( n 1) s 2 I Xn- l ( a: ) of Example 4.4.2, a� is shifted in the direction of the reciprocal bI a of the
i s a level
mean of 1r(A).
We shall analyze B ayesian credible regions further i n Chapter
Summary.
5.
(1
I n the Bayesian framework w e define bounds and intervals, called level
-
a: ) credible bounds and intervals, that determine subsets of the parameter space that are assigned probability at least
the data
:c.
(1
- a: ) by the posterior distribution of the parameter
In the case of a normal prior
1r ( B)
and normal model
(}
p ( x I B), the level (1
given - a: )
credible interval is similar to the frequentist interval except it is pulled in the direction
p,0
of the prior mean and it i s a little narrower. However, the interpretations are different: In
the frequentist confidence interval, the probability of coverage is computed with the data X random and
B fi xed, X
is computed with
4.8
whereas in the Bayesian credible interval, the probability of coverage
=
x fixed and (} random with probability distribution II ( B I X
=
x) .
P R E D I CT I O N I NT E RVA LS
I n Section variable
Y.
1.4
we discussed situations in which we want t o predict the value o f a random
In addition to point prediction of
that contains the unknown value
Y
Y,
it is desirable to give an interval
with prescribed probability
(1
- a .
)
[Y, Y]
For instance, a
doctor administering a treatment with delayed effect will give patients a time interval
[1:: , f']
in which the treatment is likely to take effect. S imilarly, we may want an interval for the
Section 4 . 8
P rediction Interva ls
253
future GPA of a student or a future value of a portfolio. We define a level ( 1 - n ) prediction interval as an interval [Y, Y] based on data X such that P ( Y ::; Y ::; Y) ?: 1 n . The problem of finding prediction intervals is similar to finding confidence intervals using a pivot: Example 4.8.1. The (Student) t Prediction Interval. As in Example 4.4. 1 , let X1 , . . . , Xn be i.i.d. as X "' N(p,, cr 2 ) . We want a prediction interval for Y Xn+l • which is assumed to be also N(p,, cr2 ) and independent of X1 , . . . , X11 • Let Y Y (X) denote a predictor based on X (X1 , . . . , "� ) . Then Y and Y are independent and the mean squared prediction error (MSPE) of Y is =
=
=
Note that Y can be regarded as both a predictor of Y and as an estimate of p,, and when we do so, AISP E(Y) MSE(Y) + cr 2 , where MSE denotes the estimation theory mean squared error. It follows that, in this case, the optimal estimator when it exists is also the optimal predictor. In Example 3.4.8, we found that in the class of unbiased estimators, X is the optimal estimator. We define a predictor Y* to be prediction unbiased for Y if E(Y* - Y) 0 , and can c�nclu� e that in the class of prediction unbiased predictors, the optimal MSPE predictor is Y X . � We next use the prediction error Y - Y to construct a pivot that can be used to give a prediction interval. Note that =
=
=
Y-Y
=
X
-
1 Xn+l "' N(O, [n - + 1]cr 2 ) .
-1
Moreover, s2 (n - 1 ) l:: � (Xi - X) is independent of X by Theorem independent of Xn+1 by assumption. It follows that =
B .3 . 3
and
Z
1 +y1cr p (Y) - vny -
has a N(O, 1) distribution and is independent of V (n - 1)s2 /cr2 , which has a x;,_ 1 distribution. Thus, by the definition of the (Student) t distribution in Section B . 3 , =
�
=
p (y ) -
Z(Y)
IV
v�
Y-Y
=
vn- 1 + 1s
has the t distribution, Yn-1 · By solving -tn- 1 (1 Y, we find the ( 1 - a) prediction interval
�a) ::; Tp (Y) ::; tn - 1 (1 - � a) for
Y = X ± Jn - 1 + 1stn - 1 (1 - ! a ) .
(4. 8 . 1 )
Note that Tp (Y) acts as a prediction interval pivot in the same way that T ( p, ) acts as a confidence interval pivot in Example 4.4. 1 . Also note that the prediction interval is much wider than the confidence interval ( 4.4. 1 ) . In fact, it can be shown using the methods of
254
Testi n g a n d Confidence Re g ions
Cha pter 4
Chapter 5 that the width of the confidence interval ( 4.4. 1 ) tends to zero in probability at the
�a) .
rate n - ! , whereas the width of the prediction interval tends to 2az ( 1 Moreover, the confidence level of ( 4.4. 1 ) is approximately correct for large n even if the sample comes from a nonnormal distribution, whereas the level of the prediction interval ( 4.8.1) D i s not ( 1 - a ) i n the limit a s n --+ oo for samples from non-Gaussian distributions.
We next give a prediction interval that is valid from samples from any population with a continuous distribution.
Example 4.8.2. Suppose X1 , . . . , Xn are i.i.d. as X "' F, where F is a continuous dis tribution function with positive density f on ( a , b), - oo :S: a < b :S: oo . Let X(l) < · · · < X(n) denote the order statistics of X1 , . . . , Xn . We want a prediction interval for Y = Xn+ l "' F, where Xn+ l is independent of the data X1 , , Xn . Set U; = F(Xi ) , i = 1 , . . . , n + 1 , then, b y Problem B .2 . 1 2, U1 , , Un+ l are i.i.d. uniform, U(O, 1 ) . Let U(l l < · · · < U(n) be U1 , . . . , Un ordered, then .
•
.
. . •
P(U(j) :S: Un+ l :S: U(k) )
J P(u jc
:s: Un+ l :s: v I u(j )
=
u, u(k)
=
v ) dH ( u , v )
v - u ) dH ( u, v ) = E (U(k J ) - E ( UcJ l )
where H is the joint distribution of U(j) and Uc k) · By Problem B .2.9, E (U( i ) ) = thus,
P (X(j) :S: Xn+l :S: X(k) )
=
k-j
ij (n + 1 ) ; (4. 8 .2)
n+ 1.
It follows that [X(]l • X(kJ] with k = n + 1 - j is a level a = (n + 1 - 2j) / (n + 1) prediction interval for Xn+ l · This interval is a distribution-free prediction interval. See Problem 4.8.5 for a simpler proof of ( 4.8.2).
Bayesian Predictive Distributions Suppose that () is random with () "' 1r and that given () = e, X1 , , Xn + l are i.i.d. p(X I e). Here xl , . . . ' Xn are observable and Xn+ l is to be predicted. The poste rior predictive distribution Q( · I x) of Xn+ l is defined as the conditional distribution of Xn+l given x = ( x i . . . . , xn ) ; that is, Q (· I x) has in the continuous case density . . •
q(xn+l I
x
n+l
)=
n
1 iIJl p(x; I B)1r(B)d8j 1 iIJl p(xi I B)1r(B)d8 e
=
e
=
with a sum replacing the integral in the discrete case. Now [Y B • YB] is said to be a level (1 - a ) Bayesian prediction interval for Y Xn+ l if =
Section 4 . 9
L i kelihood Ratio P roced u res
255
5 5
Example 4.8.3. Consider Example 3 . 2 . 1 where (Xi I B) N( e , 0" ) , 0" known, and 1r(B) is N ( TJo , 7 2 ) , 7 2 known . A sufficient statistic based on the observables X I , . . . , Xn is rv
= X = n - I :L�= I Xi , and it is enough to derive the marginal distribution of Y Xn+l from the joint distribution of X, Xn +I and 9, where X and Xn+ I are independent. Note T
=
that
E[ (Xn+ I - 9 ) 9]
=
E{E (Xn + l - e)e I 9 = B} =
0.
Thus, Xn+l - 9 and 9 are uncorrelated and, by Theorem B .4. 1 , independent. To obtain the predictive distribution, note that given X = t, Xn+ l - 9 and 9 are still uncorrelated and independent. Thus,
.C{ Xn+ l I X = t} = .c{ (Xn+ I - 9 )
+
9 I X = t}
=
N( /iB , 0"5
+
a1 )
where, from Example 4.7.1, 0"
�2
B
=
n
a5
1
I
+ 72
'
J1 B �
= ( 0" B2 I 7 2 ) TJO + ( n O" B2 I 0"02 ) X- ·
a) Bayesian prediction interval for Y is [Y3 , YtJ with
It follows that a level (1 -
Yf = liB ± z ( 1 -
� a)
V(J5
+
a1 .
(4 .8.3)
To consider the frequentist properties of the Bayesian prediction interval ( 4.8.3) we com pute its probability limit under the assumption that XI , . . . , Xn are i.i.d. N ( B , 0" ) . Because 0"1 ----) 0, ( n 0" 1 I 0" ) ----) 1, and X .!:.., e as n ----) oo , we find that the interval ( 4.8.3) converges in probability to e ± z ( 1 - � a) O"o as n ----) oo . This is the same as the probability 0 limit of the frequentist interval ( 4.8. 1 ) .
5
5
The posterior predictive distribution is also used to check whether the model and the prior give a reasonable description of the uncertainty in a study (see Box, 1 9 83). Summary. We consider intervals based on observable random variables that contain an un observable random variable with probability at least ( 1 - a). In the case of a normal sample of size n + 1 with only n variables observable, we construct the Student t prediction interval for the unobservable variable. For a sample of size n + 1 from a continuous distribution we show how the order statistics can be used to give a distribution-free prediction inter val. The Bayesian formulation is based on the posterior predictive distribution which is the conditional distribution of the unobservable variable given the observable variables. The Bayesian prediction interval is derived for the normal model with a normal prior. 4.9 4.9.1
L I K E L I H O O D RAT I O P R O C E D U RES I nt rod uction
Up to this point, the results and examples in this chapter deal mostly with one-parameter problems in which it sometimes is possible to find optimal procedures. However, even in
:·.:. :·. . :.
Test ing a n d Confidence Regions
256
C h a pter 4
the case in which B is one-dimen sional, optimal procedures may not exist. For instance, if a2 ) population with a2 known, there is no UMP test = f-lo vs K : f-1 =f. f-lo · To see this, note that it follows from Example 4.2. 1 that if f-1 1 > f-lo, the MP level a test <5a (X) rej ects H for T > z ( 1 - � o: ) , where T yn(X - f.Lo ) /a. On the other hand, if f-1 1 < f-Lo , the MP level a test <pa (X) rej ects H if T ::::; z(o:). Because <5a (x) =f. <pa (x) , by the uniqueness of the NP test (Theorem 4.2. l (c)), there can be no UMP test of H : f-1 = f-Lo vs H : f-1 =f. f-Lo.
X1 ,
.
•
.
, Xn
is a sample from a N(f.L,
for testing H
:
f-1
=
I n this section w e introduce intuitive and efficient procedures that can b e used when no optimal methods are available and that
are
natural for multidimensional parameters.
The efficiency is in an approximate sense that will be made clear in Chapters 5 and 6. We start with a generalization of the Neyman-Pearson statistic that H
:
X =
(X1 , . . . , Xn ) has density or frequency function
B E 80 vs K
: B
p(x, B1 ) jp(x, B0 ) . Suppose p(x, B) and we wish to test
E 8 1 . The test statistic we want to consider is the likelihood ratio
given by
L (x)
=
sup {p(x, B) : B sup {p(x, B) : B
E 81 }
.
E 8o}
Tests that reject H for large values o f L (x) are called likelihood ratio tests. To see that this is a plausible statistic, recall from Section 2 . 2 . 2 that we think of the likelihood function sample
L(B, x) = p(x, B) as a measure of how well B "explains" the given x = (x1 , . . . , xn ) · So, if sup {p(x, B) : B E 8 1 } is large compared to sup{p(x, B) :
B E 80 } , then the observed sample is best explained by some B E 8 1 , and conversely. Also note that { B0 } , 81
=
coincides with the optimal test statistic
L (x)
p(x, B1 ) jp(x, B0 )
when 8 o
=
{ B 1 } . In particular cases, and for large samples, likelihood ratio tests have
weak optimality properties to be discussed in Chapters 5 and 6. In the cases we shall consider, dimension than 8
=
p(x, B)
is a continuous function of B and B0 is of smaller
8 o u 81 so that the likelihood ratio equals the test statistic
>. (x) =
sup{p(x, B) : B E 8 } sup{p(x, B) : B E 8o }
(4.9 . 1 )
whose computation is often simple. Note that in general
>.(x)
=
max(L (x) , 1) .
We are going t o derive likelihood ratio tests i n several important testing problems . Al though the calculations differ from case to case, the basic steps are alway s the same.
1 . Calculate the MLE B of B. 2. Calculate the MLE B0 of B where B may vary only over 8 0 .
3 . Form >.(x)
=
p(x, � /p(x, 00 ) .
h that i s strictly increasing on the range of ,\ such that h(>.(X) ) has h(>.(X)) is equivalent to >.(X), we specify the size a likelihood ratio test through the test statistic h(>. (X)) and ,its ( 1 - a ) th quantile obtained from the table.
4 . Find a function
a simple form and a tabled distribution under H. Because
l 1
Section 4 . 9
L i keli hood Ratio P rocedu res
257
likelihood confidence regions, bounds, and so on. For instance, we can invert the family of size o:
We can also invert families of likelihood ratio tests to obtain what we shall call likelihood ratio tests of the point hypothesis H
:
e
=
e0
and obtain the level
(1 -
o:
)
confidence region
C (x ) where
sup0
=
denotes sup over
e
ro0 n
{ e : p ( x , e) � [c ( e) t E
1
sup0 p (x, e) }
(4.9.2)
8 and the critical constant c ( e) satisfies
[sup0 p (X, e) p (X ' e0 )
� c (eo
)]
=
o: .
6) that c ( e) is independent of e. In that case, e who se likelihood is on or above some fixed value dependent on
It is often approximately true (see Chapter
C(x)
is just the set of all
the data. An example is discussed in Section
4.9.2.
e ( el > e2 ) where e1 is the parameter of e 2 is a nuisance parameter. We shall obtain likelihood ratio tests for hypotheses of the form H : e1 ew, which are composite because e 2 can vary freely. The family of such level o: likelihood ratio tests obtained by varying e10 can also be inverted and yield confidence regions for e1 . To see how the process works we refer to the specific examples Thi s section includes situations i n which
=
interest and
=
in Sections
4.9.2
4.9.2--4.9.5.
Tests for t h e M e a n o f a N ormal Distri bution- M atched Pair Experiments
Suppose X1 ,
•
•
,
, X n form a sample from a
N(J.L, 0" 2 )
population i n which both J..l and
0" 2
are unknown. An important class of situations for which this model may be appropriate occurs in
matched pair experiments. Here are some examples. Suppose we want to study
the effect of a treatment on a population of patients whose responses are quite variable because the patients differ with respect to age, diet, and other factors . We are interested in expected differences in responses due to the treatment effect. In order to reduce differences due to the extraneous factors , we consider pairs of patients matched so that within each pair the patients are as alike as pos sible with respect to the extraneous factors. We can regard twins as being matched pairs . After the matching, the experiment proceeds as follow s . In the ith pair one patient is picked at random (i . e . , with probability
� ) and given the treatment,
while the second patient serves as control and receives a placebo . Response measurements are taken on the treated and control members of each pair. Studies in which subjects serve as their own control can also be thought of as matched pair experiments . That is, we measure the response of a subj ect when under treatment and when not under treatment. Examples of such measurements are hours of sleep when receiving a drug and when receiving a placebo, sales performance before and after a course in salesmanship, mileage of cars with and without a certain ingredient or adjustment, and so on.
Let Xi denote the difference between the treated and control responses for the ith pair.
If the treatment and placebo have the same effect, the difference Xi has a distribution that is
258
Testi n g a n d Confidence Regions
C h a pter 4
symmetric about zero. To this we have added the normality assumption. The test we derive will still have desirable properties in an approximate sense to be discussed in Chapter if the normality as sumption is not satisfied. Let J1-
5
E (X1 ) denote the mean difference
=
:
between the response of the treated and control subjects . We think of J1- as representing the treatment effect. Our null hypothesis of no treatment effect is then
H
J1-
=
0. However,
' for the purpose of referring to the duality between testing and confidence procedures, we
H
test
: J1-
=
fl-o ,
where we think
of
as an established standard for an old treatment.
Two-Sided Tests
:
We begin by considering K
fl-o
J1-
i-
fl-O·
This corresponds to the alternative "The treat
ment has some effect, good or bad." However, as discussed in Section
4.5,
the test can be
modified into a three-decision rule that decides whether there is a significant positive or negative effect.
Form of the Two-Sided Tests Let
B
=
(Jl-, a2 ), 8o { (Jl-, a2 ) : J1- = fl-o}. =
Under our assumptions,
The problem of finding the supremum of that
p ( x , B ) was solved in Example 3 . 3 .6.
sup{p (x , B ) : e where
�
()
=
�2
(x, a )
=
E
8}
=
p ( x , B) ,
(1-n � n1 � - 2 ) LJ xi ,
i =l
- LJ (xi i=l
x) ,
is the maximum likelihood estimate o f e.
&5
Finding
sup {p( x , () )
of a2 when J1-
equation is
=
fl-o
: () E
We found
8 0} boils down t o finding the maximum likelihood estimate &5 ). The likelihood p ( x () ) at
is known and then evaluating
0�2 log p (x ,
() )
=
,
(Jl-o,
� [ :4 t(xi - Jl-o) 2 - :2 ]
which has the immediate solution
�2a0 = -1 � LJ ( Xi - fl- o ) 2 . n
i=l
=
0,
Section 4 . 9
Likelihood Ratio Proced u res
259
a5 gives the maximum of p(x, B) log .\(x) , which thus equals
By Theorem 2.3. 1 , equivalent to
�
log .\(x)
l og p(x, B)
-
for B E
80 .
The test statistic
.\ (x)
is
log p(x, (fJo , 0'02 ) )
{ - � [ (log 2 7T) � log(a5 /a2 ) .
+
( lo g a2 ) ]
- �} - { - � [ (log 2 7T) + (log a5 ) J � } -
Our test ruie, therefore, rejects H for large values of
(a5 fa 2 ) .
To simplify the rule further
we use the following equation, which can be established by expanding both sides.
Therefore,
Because
s2
function of
I Tn I
(n - 1 ) where
1 -
(a5 / a2 ) = 1 + ( x - Mo ) 2 / 0' 2 . I: ( x i - x ) 2 na 2 j (n - 1 ) , a5 fa 2 =
Tn
=
y'n(x - fJ o) . s
Therefore, the likelihood ratio tests rej ect for large values of distribution under H (see Example 4.4. 1 ), the size
rej ect H if, and only if,
I Tn l 2': 2 . 064.
n = 25
I Tn 1 -
Because
Tn
has a T
a critical value is tn- 1 (1 - � a) and we
can use calculators or software that gives quantiles of the the critical value. For instance, suppose
i s monotone increasing
t distribution, or Table III, to find a = 0.05. Then we would
and we want
One-Sided Tests The two-sided formulation is natural if two treatments, A and B, ate considered to be equal before the experiment is performed. However, if we are comparing a treatment and control, the relevant question is whether the treatment creates an improvement. Thus, the : f1 S f,Lo versus K : f1 > fio (with fio = 0) is suggested. The statistic is equivalent to the likelihood ratio statistic ,\ for this problem. A proof is sketched
testing problem H
Tn
in Problem 4.9.2. In Problem 4.9. 1 1 we argue that
6
=
( !1 - f,Lo ) / a.
P0 [Tn 2': t ]
is increasing in <5, where
Therefore, the test that rej ects H for
Tn 2': tn- 1 ( 1 - a) , is of size
a for H
versus K : 11 <
fio
:
11
S fio · Similarly, the size
rejects H if, and only if,
a
likelihood ratio test for H
:
11
2': fio
a nd Confidence
260
4
Power Functions To discuss the power of these tests, we need to introduce the noncentral t distribution with k degrees of freedom and noncentrality parameter 8. This distribution, denoted by Tk , 6, is by definition the distribution of ZI jVfk where Z and V are independent and have
N(8, 1) and x� distributions, respectively. The density of Zl jVfk is given in Problem 4.9. 12. To derive the distribution of Tn , note that from Section B.3, we know that yn(X �-L ) Ia and ( n - 1 ) s2 la2 are independent and that ( n 1 ) s2 la2 has a x;_ 1 distribution. Because E [yn(X f-to)la] = fo ( f-t - f-Lo) Ia and Var ( y'n(X
yn (X
-
- f-Lo) Ia)
=
1,
f-Lo) I a has N( 8, 1 ) distribution, with 8 = fo(f-t - f-Lo) Ia . Thus, the ratio fo ( X - f-to) la . / (n- 1 )s2 /u2 V n- 1
has a Yn - 1 , 6 distribution, and the power can be obtained from computer software or tables ( f-t , a2 ) of the noncentral t distribution. Note that the distribution of Tn depends on (} only through 8. The power functions of the one-sided tests are monotone in 8 (Problem 4.9. 1 1) just as the power functions of the corresponding tests of Example 4.2. 1 are monotone in Vnf-LI a. We can control both probabilities of error by selecting the sample size n large provided we consider alternatives of the form 181 � 81 > 0 in the two-sided case and 8 � 81 or 8 ::::; 81 in the one-sided cases. Computer software will compute n. If we consider alternatives of the form (f-t f-Lo) � � . say, we can no longer control both probabilities of error by choosing the sample size. The reason is that, whatever be n, by making a sufficiently large we can force the noncentrality parameter 8 = fo (f-L - f-Lo ) I a as close to 0 as we please and, thus, bring the power arbitrarily close to a:. We have met similar difficulties (Problem 4.4.7) when discussing confidence intervals. A Stein solution, in which we estimate a for a first sample and use this estimate to decide how many more observations we need to obtain guaranteed power against all alternatives with 1 �-L - 1-Lo l � � . is possible (Lehmann, 1 997, p. 260, Problem 1 7). With this solution, however, we may be required to take more observations than we can afford on the second stage.
Likelihood Confidence Regions If we invert the two-sided tests, we obtain the confidence region
C(X)
{f-t : l vn (X - �-L)Is l s tn-1 ( 1 - �a: ) } .
We recognize C(X) a s the confidence interval of Example 4.4. 1 . Similarly the one-sided tests lead to the lower and upper confidence bounds of Example 4.4. 1 .
Section 4.9
261
Likelihood Ratio Proced ures
Data Example
As an illustration of these procedures, consider the following data due to Cushny and Peebles (see Fisher, 1 958, p. 1 2 1 ) giving the difference B - A in sleep gained using drugs A and B on 1 0 patients. This is a matched pair experiment with each subject serving as its own control. Patient i A B B-A
1 0.7 1 .9 1 .2
2 - 1 .6 0.8 2.4
3 -0.2 1.1 1 .3
-4 - 1 .2 0. 1 1 .3
5 -0. 1 -0. 1 0.0
6 3 .4 4.4 1 .0
7 3. 7 5.5 1 .8
8 0.8 1 .6 0.8
9 0.0 4.6 4.6
10 2.0 3.4 1 .4
2 If we denote the difference as x's, then x 1 . 58, s 1 . 513, and ITn l 4.06. Because t9 (0.995) 3.25, we conclude at the 1% level of significance that the two drugs are significantly different. The 0.99 confidence interval for the mean difference J.-L between treatments is [0.32,2.84] . It suggests that not only are the drugs different but in fact B is better than A because no hypothesis J.-L 11-' < 0 is accepted at this level. (See also ( 4.5 .3) .) =
=
=
=
=
4.9.3
Tests and Confidence I ntervals for the Difference in Means of Two Normal Populations
We often want to compare two populations with distribution F and G on the basis of two independent samples X 1 , . . , Xn1 and Y1 , . . . , Yn2 , one from each population. For in stance, suppose we wanted to test the effect of a certain drug on some biological variable (e.g., blood pressure). Then X1 , . . . , Xn1 could be blood pressure measurements on a sam ple of patients given a placebo, while Y1 , . . , Yn2 are the measurements on a sample given the drug. For quantitative measurements such as blood pressure, height, weight, length, vol ume, temperature, and so forth, it is usually assumed that X 1 , , Xn1 and Y1 , . . . , Yn2 are independent samples from N (J.-L1 , a 2 ) and N (J.-L2 , a2 ) populations, respectively. The preceding assumptions were discussed in Example 1 . 1 .3. A discussion of the con sequences of the violation of these assumptions will be postponed to Chapters 5 and 6. .
.
. . •
Tests
We first consider the problem of testing H : f.-Ll J:L2 versus H : f.-Ll =F J.-L2. In the control versus treatment example, this is the problem of determining whether the treatment has any effect. Let e (J.-Lt , J.-l2 , a 2 ) . Then 8o { e : J.-L 1 J.-L2 } and 81 {e : J.-L1 # J.-L2 }· The log of the likelihood of (X, Y) (X1 , , Xnp Y1 , . . . , Yn2 ) is =
=
=
=
=
=
. . •
where n n1 + n2 . As in Section 4.9.2 the likelihood function and its log are maximized over 8 by the maximum likelihood estimate e. In Problem 4.9.6, it is shown that (j =
=
262
Testi n g a nd Confidence Regions
Cha pter 4
- - 2) (X, Y, a , where
When tt1 tt2 tt. our model reduces to the one-sample model of Section the maximum of p over 80 is obtained for f) (Jl, jl, a5), where =
4.9.2. Thus,
and
If we use the identities
_!
fcvi - J1)2
?l i=l
obtained by writing [Xi - J1] 2 log likelihood ratio statistic
is equivalent to the test statistic
1 n
f(l'i i=l
Y ) 2 + n2 ( Y - J1) 2 n
[(Xi - X) + (X - J1)] 2 and expanding, we find that the
I TI where
and
To complete specification of the size a likelihood ratio test we show that distribution when tt1 tt2 . By Theorem B.3 . 3 =
1 - 1 - 1 nl 1 n2 -X, - Y, -2 L (Xi - X- ) 2 and -2 L (Y· - Y ) 2 a j=l a a a i=l -
1
• i -: "
I
T has Yn-2
Section 4.9
263
likelihood Ratio Proced ures
are independent and distributed as N(�Ld a, 1/ni ) , N(1L2 / a, 1 /n2 ) , X�1 _1 , X�2 _:1, respec tively. We conclude from this remark and the additive property of the x 2 distribution that (n - 2)s 2 ja2 has a X�-2 distribution, and that jn1 n2jn(Y - X)ja has a N(O, 1 ) distri 2 bution and is independent of (n - 2)s ja2 • Therefore, by definition, T Tn _2 under H and the resulting two-sided, two-sample t test rejects if, and only if, r-v
As usual, corresponding to the two-sided test, there are two one-sided tests with critical regions, for H : IL2 :::; ILl and for H : ILl :::; IL2· We can show that these tests are likelihood ratio tests for these hypotheses. It is also true that these procedures are of size a for their respective hypotheses. As in the one sample case, this follows from the fact that, if ILl -=f. �L2 , T has a noncentral t distribution with noncentrality parameter,
Confidence Intervals
To obtain confidence intervals for IL2 - ILl we naturally look at likelihood ratio tests for the family of testing problems H : 1L2 - ILl = � versus K : IL2 - ILl -=f. �- As for the special case � = 0, we find a simple equivalent statistic I T (�) ! where
If IL2 - �L 1
=
� . T (�) has a Tn-2 distribution and inversion of the tests leads to the interval (4 .9. 3 )
for IL2 - IL l · Similarly, one-sided tests lead to the upper and lower endpoints of the interval as 1 �a likelihood confidence bounds. -
.....
·--
•.
Testi n g and Confidence Regions
264
Cha pter 4
Data Example
As an illustration, consider the following experiment designed to study the permeability (tendency to leak water) of sheets of building material produced by two different machines. From past experience, it is known that the log of permeability is approximately normally distributed and that the variability from machine to machine is the same. The results in terms of logarithms were (from Hald, 1 952, p. 472) x
y
(machine 1 ) (machine 2)
1 .845 1 .5 83
1 .790 1 .627
2.042 1 .282
We test the hypothesis H of no difference in expected log permeability. H is rej ected
ITI 2 tn-2 ( 1 - �a) . Here y - x -0.395, s 2 = 0.0264, and T -2.977. Because t4 (0.975) = 2. 776, we conclude at the 5% level of significance that there is a significant
if
=
=
difference between the expected log permeability for the two machines. The level 0.95 confidence interval for the difference in mean log permeability is
-0.395 ± 0.368.
..
: i
'
On the basis of the results of this experiment, we would select machine 2 as producing the smaller permeability and, thus, the more waterproof material . Again we can show that the selection procedure based on the level ( 1 - a) confidence interval has probability at most �a of making the wrong selection.
4.9.4
T h e Two- Sample P roblem with U nequal Variances
In two-sample problems of the kind mentioned in the introduction to Section 4.9.3, it may happen that the X's and Y's have different variances. For instance, a treatment that in creases mean response may increase the variance of the responses. If normality holds, we are lead to a model where X1 , . . . , Xn1 ; Y1 , . . . , Yn 2 are two independent N ( llb ai ) and N(112 , a� ) samples, respectively. As a first step we may still want to compare mean responses for the X and Y populations. This is the Behrens- Fisher problem. Suppose first that ar and a� are known. The log likelihood, except for an additive constant, is
The MLEs of 111 and 112 are, thus, /11 x and /12 y for (111 , 112 ) E R x R. When 111 112 = 11. setting the derivative of the log likelihood equal to zero yields the MLE =
=
=
Section 4.9
Likelihood Ratio Procedu res J
265
where I = arIa�. It follows that .-\ (x, y )
By writing
2 L (xi - x) + n 1 (j1 - x) 2 i=l n1
i=l n2
n2
L ( Yi - j1) 2 = L(Yi - fi? + n2 ( j1 - fi) 2 j=l j =l we obtain
\ (x, y ) = exp
{ 2�� CM - X) 2 + 2�5 (/i - Y) 2 } .
Next we compute
j1 - x
= =
n1 !n2 (fi - x) l(nl + ln2) .
Similarly, j1 - fi n2 (x - fJ) I (n 1 + 1n2 ) . It follows that the likelihood ratio test is equivalent to the statistic lD I IaD , where D = Y - X and aJy is the variance of D, that is =
Var (X ) + Var ( Y ) = Thus, DIaD has a N( !::::.. , 1) distribution, where !::::.. must be estimated. An unbiased estimate is
=
n1
+
n2
.
J.t2 - f.tl · Because a1 is unknown, it
sr s� 2 S D = -1 + - . n n2 It is natural to try and use DIs D as a test statistic for the one-sided hypothesis H : J.t2 :::; J.t l , I D I I S D as a test statistic for H : J.t l = J.t2, and more generally (D - !::::.. ) 1 sD to generate confidence procedures. Unfortunately the distribution of ( D !::::.. ) Is D depends on arIa� for fixed n 1 , n2 . For large n 1 , n2, by Slutsky's theorem and the central limit theorem, (D !::::.. ) 1 S D has approximately a standard normal distribution (Problem 5.3.28). For small and moderate n 1 , n2 an approximation to the distribution of ( D - !::::.. ) Is D due to Welch ( 1 949) works well.
266
Testing a n d Confidence Regions
Let c
= sfjn 1s'b .
C h a pter 4
Then Welch' s approximation is Tk where
k
[
c2 n1 - 1
+
]
1 ( 1 c) 2 n2 - 1
When k is not an integer, the critical value is obtained by linear interpolation in the t tables or using computer software. The tests and confidence intervals resulting from this approximation are called Welch's solutions to the Behrens-Fisher problem. Wang ( 1 97 1 ) has shown the approximation t o b e very good for a = 0.05 and a 0.0 1 , the maximum error in size being bounded by 0.003. Note that Welch 's solution works whether the variances are equal or not. The LR procedure derived in Section 4.9.3, which works well if the variances are equal or n1 n2 , can unfortunately be very misleading if ui # u� and n1 # n2. See Figure 5.3.3 and Problem 5.3.28. =
4.9.5
l i kelihood Ratio Proced u res for Bivariate N ormal Distri butions
If n subjects (persons, machines, fields, mice, etc.) are sampled from a population and two numerical characteristics are measured on each case, then we end up with a bivariate random sample (Xb Yl ) , . . . , (Xn , Yn ) · Empirical data sometimes suggest that a reason able model is one in which the two characteristics (X, Y) have a joint bivariate normal distribution, N(J.LI, /12, ' u� , p), with ur > 0, (j� > 0.
Testing Independence, Confidence Intervals for p The question "Are two random variables X and Y independent?" arises in many sta tistical studies. Some familiar examples are: X weight, Y blood pressure; X test score on mathematics exam, Y test score on English exam; X percentage of fat in diet, Y cholesterol level in blood; X average cigarette consumption per day in grams, Y = age at death. If we have a sample as before and assume the bivariate normal model for (X, Y), our problem becomes that of testing H : p 0. =
=
Two-Sided Tests If we are interested in all departures from H equally, it is natural to consider the twosided alternative K : p # 0. We derive the likelihood ratio test for this problem. Let B = ( J.Lb f/,2 , u f , , p) . Then 8o { B : p = 0} and 81 { B : p # 0 } . From (B .4.9), the log of the likelihood function of X ( (X 1 , Y1 ) , . . . , ( Xn , Yn) ) is =
[
logp(x, B ) -n[log(2r.u1u 2 1 2:: ��1 ( X ; /l 1 ) 2 2( 1 - p2 ) uf
_
2p 2:: �=1 ( x ; /l 1 ) ( Yi - /l2) u 1 u2
+
]
2::��1 (y; - 1"2) 2 . u�
I
Section 4 . 9
267
L i ke l ihood Ratio Proced u res
The unrestricted maximum likelihood estimate (i, y, &i, &� , p) , where
-----2 0" 1
e
was given in Problem 2.3. 1 3 as
i i i = [t (xi - ( Y) ]
= -� � ( Xi - X- ) 2 n 1
=1
�=1
----- 2 ' 0" 2
X)
yi
= -� � ( Yi - y-) 2 ' n 1
-
=1
/n&1u2 .
p is called the sample correlation coefficient and satisfies -1 ::::; p ::::; 1 (Problem 2. 1 . 8). When p 0 we have two independent samples, and eo can be obtained by separately maxi =
mizing the likelihood of X 1 , . . . , Xn and that of Y1 , and the log of the likelihood ratio statistic becomes
. . . , Yn . We have eo = (x, y, &i , &� , 0)
log A(x) (4.9.4)
= Thus, log A (x) is an increasing function of p2 , and the likelihood ratio tests reject H for large values of [ fi1 . To obtain critical values we need the distribution o f p or an equivalent statistic under H. Now,
i-
� (U n
p=
U) (Vi - V) /
[� n
(Ui - U ) 2
] [� 1 /2
n
(Vi - V) 2
]
1 /2
where Ui = ( Xi J.-L1 ) /0"1 , Vi = (Yi J.-L2 ) /0"2 . Because ( U1 , V1 ) , . . . , (Un , Vn) is a sample from the N(O , 0, 1 , 1 , p) distribution, the distribution of p depends on p only. If p = 0, then by Problem B .4.9, -
(4.9.5) has a Tn- 2 distribution. Because ITn l is an increasing function of [ fi1 , the two-sided likeli hood ratio tests can be based on ITn I and the critical values obtained from Table II. When p 0, the distribution of p is available on computer packages. There is no simple form for the distribution of p (or Tn) when p =I 0. A normal approximation is available. See Example 5.3.5. Qualitatively, for any a, the power function of the LR test is symmetric about p = 0 and increases continuously from a to 1 as p goes from 0 to 1 . Therefore, if we specify indifference regions, we can control probabilities of type II error by increasing the sample size.
=
268
Testi n g a nd Confidence Regions
C h a pter 4
One-Sided Tests
In many cases, only one-sided alternatives are of interest. For instance, if we want to decide whether increasing fat in a diet significantly increases cholesterol level in the blood, we would test H : p = 0 versus K : p > 0 or H : p :::; 0 versus K : p > 0. It can be shown that p is equivalent to the likelihood ratio statistic for testing H : p :::; 0 versus K : p > 0 and similarly that corresponds to the likelihood ratio statistic for H : p 0 versus K : p < 0. We can show that Pe [p ;:::: c] is an increasing function of p for fixed c (Problem 4.9 . 1 5). Therefore, we obtain size a tests for each of these hypotheses by setting the critical value so that the probability of type I error is a when p = 0. The power functions of these tests are monotone. Confidence Bounds and Intervals
Usually testing independence is not enough and we want bounds and intervals for p giving us an indication of what departure from independence is present. To obtain lower confidence bounds, we can start by constructing size a likelihood ratio tests of H : p p0 versus K : p > p0. These tests can be shown to be of the form "Accept if, and only if, p :::; c(p0 )" where PPo [p :::; c(p0 ) ] 1 - a. We obtain c(p) either from computer software or by the approximation of Chapter 5. Because c can be shown to be monotone increasing in p, inversion of this family of tests leads to Ievel l a lower confidence bounds. We can similarly obtain 1 a upper confidence bounds and, by putting two Ievel l - �a bounds together, we obtain a commonly used confidence interval for p. These intervals do not correspond to the inversion of the size a LR tests of H : p Po versus K : p -::f. p0 but rather of the "equal-tailed" test that rejects if, and only if p ;:::: d (p0 ) or p :::; c(p0 ) where PPo [I) d (p0 )] PPo [IJ' :::; c(p0 )] 1 �a. However, for large n the equal tails and LR confidence intervals approximately coincide with each other. =
Data Example
A� an illustration, consider the following bivariate sample of weights Xi of young rats at a certain age and the weight increase Yi during the following week. We want to know whether there is a correlation between the initial weights and the weight increase and for mulate the hypothesis H : p 0.
370.7 1 8. 1
Here p 0.18 and Tn 0.48. Thus, by using the two-sided test and referring to the 'I7 tables, we find that there is no evidence of correlation: the p-value is bigger than 0.75. Summary. The likelihood ratio test statistic >. is the ratio of the maximum value of the likelihood under the general model to the maximum value of the likelihood under the model specified by the hypothesis. We find the likelihood ratio tests and associated confidence procedures for four classical normal models:
!
Section 4 . 10
269
Problems a nd Complements
( 1 ) Matched pair experiments in which differences are modeled as N(p,, a2) and we test the hypothesis that the mean difference p, is zero. The likelihood ratio test is equivalent to the one-sample (Student) t test. (2) Two-sample experiments in which two independent samples are modeled as coming from N(p,1 , a2) and N(p, , a2 ) populations, respectively. We test the hypothesis 2 that the means are equal and find that the likelihood ratio test is equivalent to the two-sample (Student) t test. (3) Two-sample experiments in which two independent samples are modeled as com ing from N(p,1 , ai ) and N(p, , a� ) populations, respectively. When a? and a� are 2 known, the likelihood ratio test is equivalent to the test based on I Y - X I . When a? and a� are unknown, we use (Y - X ) / s n , where sn is an estimate of the standard deviation of D Y - X. Approximate critical values are obtained using Welch's t distribution approximation. =
(4) Bivariate sampling experiments in which we have two measurements X and Y on each case in a sample of n cases. We test the hypothesis that X and Y are indepen dent and find that the likelihood ratio test is equivalent to the test based on IP1. where p is the sample correlation coefficient. We also find that the likelihood ratio statistic is equivalent to a t statistic with n - 2 degrees of freedom. 4.10
P R O B L E M S A N D CO M P L E M ENTS
Problems for Section 4.1 1. Suppose that X 1 , uniform distribution
, Xn are independently and identically distributed according to the U(O, fJ). Let Mn max (X1 , . . . , Xn) and let
• . .
=
1 if Mn � C 0 otherwise.
=
(a) Compute the power function of be and show that it is a monotone increasing function of e.
(b) In testing H : f) ::; � 0.05 ?
versus K
:
f)
>
! . what choice of
c
would make be have size
exactly
(c) Draw a rough graph of the power function of be (d) How large should n be so that the be (e) If in a sample of size n
=
20, Mn
=
specified in (b) when
specified in (b) has power
n
=
0. 98 for f)
20. =
�?
0.48, what is the p-value?
2. Let X1 , , Xn denote the times in days to failure of n similar pieces of equipment. Assume the model where X = (X1 , . . . , Xn ) is an £ ( >. ) sample. Consider the hypothesis H that the mean life 1/ >. p, ::; f..L o . . • •
=
. i
270
Testing and Confidence Regions
Chapter 4
(a) Use the result of Problem B.3.4 to show that the test with critical region
[X > !"ox( 1
-
<>) /2n] ,
where x(l - a) is the (1 - o:)th quantile of the X�n distribution, is a size a test.
(b) Give an expression of the power in terms of the X� n distribution.
(c) Use the central limit theorem to show that ) /!") + vn(l" - l"o) j !"] is an approximation to the power of the test in part (a). Draw a graph of the approximate power function. Hint: Approximate the critical region by [X > l"o(l + z(1 - a) / vn)]
;
i •
;
.
(d) The following are days until failure of air monitors at a nuclear plant. If f.Lo = 25, give a normal approximation to the significance probability. Days until failure: 3 150 4034 3237 34 2 3 1 6 5 14 150 27 4 6 27 10 3037 Is H rejected at level a � 0.05? 3. Let X1, . . .
, Xn be a P(O) sample.
(a) Use the MLE X of 0 to construct a level a test for H : 0 � Oo versus K : 0 > 00.
(b) Show that the power function of your test is increasing in 8. (c) Give an approximate expression for the critical value if n is large and 8 not too close to 0 or oo. (Use the central limit theorem.)
4. Let X1 ,
• • •
,
c • •
Xn be a sample from a population with the Rayleigh density •
f(x,O)
=
(xj0 2 ) exp{-x2j202 }, x > 0 0 > 0. ,
(a) Construct a test of H : B = 1 versus K : B > with approximate size o: using a complete sufficient statistic for this model. Hint: Use the central limit theorem for the critical value.
1
(b) Check that your test statistic has greater expected value under K than under H. 5. Show that if H is simple and the test statistic T has a <;ontinuous distribution, then the p-value a(T) has a unifonn, U(O, 1 ) , distribution.
• • •
1
I
i I
I
Hint: See Problem B.2.12.
6. Suppose that T1 , . . . , Tr are independent test statistics for the same simple H and that each Ti has a continuous distribution, j = 1, . . , r. Let o:(TJ) denote the p-value for 1j. j = 1 1 . . , r. .
.
(a) Show that, under H, T
Hint: See Problem B.3.4. 7. Establish
=
-2 L.:;�1 log <>(T;) has a X�r distribution.
(4.1.3). Assume that Fo and F are continuous.
I , ' •
i
Section 4.10
271
Problems and Complements
8. (a) Show that the power Pp[Dn > ka) of the Kolmogorov test is bounded below by
�
sup Fp[[F(x) X
�
·
Fo(x)l > ko[
Dn > [ F(x) - Fo(x)[ for each .T. (b) Suppose Fo isN(O, I) and F(x) � (l+exp(->•/r)) - 1 where r � J3/rr is chosen so that J.0000 x2dF(x) = 1. (This is the logistic distribution with mean zero and variance 1.) Evaluate the bound Pp(IF (x) - Fo(x)l > ka ) for a � 0.10, n � 80 and x � 0.5, � 1, and 1.5 using the normal approximation to the binomial distribution of nF(x) and the Hint:
approximate critical value in Example 4.1.5.
(c) Show that if F and Fo are continuous and mogorov test tends to 1 as n ---+ oo.
F
"I
Fo,
then the power of the Kol
9. Let X1 , . . . , Xn be i.i.d. with distribution function F and consider H : F = F0. Suppose that the distribution £o of the statistic T = T(X) is continuous under H and that H is rejected for large values of T. Let T(l), . . . , y( B ) be B independent Monte Carlo simulated values of T. (In practice these can be obtained by drawing B indepen dent samples X(ll, . . . , X( B) from Fo on the computer and computing T(j) = T(X Ul ) , j = 1 , . . . , B. Here, to get X with distribution .Fo, generate a U(O, 1) variable on the 1 computer and set X � F0- (U) as in Problem B.2.12(b).) Next let T(l), . . . , r<8 + 1) de note T, y{t), . . . , y(B) ordered. Show that the test rejects H iff T > T( B+ I-m) has level a � m/(B + !). Hint: If H is true T(X), T(X< 11), . . . , T(X<81) is a sample of size B + I from Co. Use the fact that T(X) is equally likely to be any particular order statistic. 10. (a) Show that the statistic Tn of Example 4.1.6 is invariant under location and scale. That is, if x; � (X, - a)jb, b > 0, then Tn(X') � Tn(X).
(b) Use part (a) to conclude that CN( ",u')(Tn) = CN(o,l) (Tn).
ll. In Example 4.1.5, let ¢( u) be a function from (0, I) to (0, oo ) , and let a the statistics � sup ,P(Fo(x))[F(x) - Fo(xW S¢,a
�
�
T¢,a
sup¢(F(x))[F(x) - Fo (x) l"
U.p,a -
j ,P(Fo(x))IF(x) - Fo (x) ["dFo(x) j ,P(F(x))[F(x) - Fo(x) l "dF(x).
V¢,a
Fo .
X
X
> 0. Define
(a) For each of these statistics show that the distribution under H does not depend on
(b) When '1/J(u) = 1 and a: = 2. Vw,o: is called the Cramer-von Mises statistic. Express
the Cramer-von Mises statistic as a sum.
272
Testing and Confidence Regions
Chapter 4
(c) Are any of the four statistics in (a) invariant under location and scale. (See Problem 4.1.10.)
!, , •
I
•
12. Expected p-values. Consider a test with critical region of the form {T > c} for testing H : 8 = (}0 versus I< : (} > (}0• Without loss of generality, take 80 = 0. Suppose that T has a continuous distribution Fe. then the p-value is
U = 1 - F0 (T).
• '
,
!
;i
(a) Show that if the test has level a, the power is
•
• •
(3(8) = P(U < a) = 1 - Fe(F0 1 (1 - a)) 1
1 where F0- (u) = inf{t : Fo (t) 2: u). (b) Define the expected p-value as EPV(8) = EeU. Let T0 denote a random variable with distribution F0, which is independent ofT. Show that EPV(8) = P(To > T). Hint: P(To > T) = J P(To 2: t I T = t)J.(t)dt where fe(t) is the density of Fe(t).
(c) Suppose that for each a E (0, 1), the UMP test is of the form 1 {T > c}. Show that the EPV(8) for 1{T > c) is uniformly minimal in 8 > 0 when compared to the EPV(8) for any other test. Hint: P(T < to I To = to) is 1 minus the power of a test with critical value to . (d) Consider the problem of testing H : Ji, = Ji,o versus K : Ji, > J.to on the basis of the N(p., "2 ) sample X1, . . . , Xn, where " is known. Let T = X p.0 and 8 p. - p.0. Show that EPV(8) = if!( -.Jii8/ ..;2" ) , where if! denotes the standard normal distribution. (For a recent review of expected p values see Sackrowitz and Samuel-Cabo, 1999.) -
' :l
=
Problems for Section 4.2
1. Consider Examples 3.3.2 and 4.2.1. You want to buy one of two systems. One has The first system costs $106, the signal-to-noise ratio vfao = 2, the other has vfao = other $105• One second of transmission on either system costs $103 each. Whichever system you buy during the year, you intend to test the satellite 100 times. If each time you test, you want the number of seconds of response sufficient to ensure that both probabilities of error are < 0.05, which system is cheaper on the basis of a year's operation?
1.
2. Consider a population with three kinds of individuals labeled 1, 2, and 3 occuring in the Hardy-Weinberg proportions /(1,8) = 82 , !(2,8) = 28(1 - 8), f(3,8) = (1 - 8) 2 • For a sample X1 , . . . , Xn from this population, let N1 N2, and N3 denote the number of Xj equal to I, 2, and 3, respectively. Let 0 < 8o < 8, < 1 . .
(a) Show that L(x, 8o, 8,) is an increasing function of 2N,
N2 . (b) Show that if c > 0 and a E (0, 1) satisfy Pe, [2N1 + N2 > c] = a, then the test that rejects H if, and only if, 2N, + N2 > c is MP for testing H : 8 = 80 versus K : 8 = 81. +
3. A gambler observing a game in which a single die is tossed repeatedly gets the impres
sion that 6 comes up about 18% of the time, 5 about 14% of the time, whereas the other ,
I
i !
, '
j
•
I
l j
'
1
i
I
1 i •
I !
I
1
�:-- :_:::_:��-=c==-==-==='---Section 4.10
273
Problems and Complements
· -
-
four numbers are equally likely to occur (i.e., with probability .17). Upon being asked to play, the gambler asks that he first be allowed to test his hypothesis by tossing the die n times. (a) What test statistic should he use if the only alternative he considers is that the die is fair? (b) Show that if n = 2 the most powerful level .0196 test rejects if, and only if, two 5's are obtained. (c) Using the fact that if(N" . . . , Nk) � M(n, 8,, . . . , Bk), then a,N, + +akNk has approximately a N(np,, na2 ) distribution, where 11 = L7 1 aifJi and a2 = I: � 1 fJi(ai 11)2, find an approximation to the critical value of the MP level a: test for this problem. ·
4. A formulation of goodness of tests specifies that a test is best if the maximum probability
of error (of either type) is as small as possible. (a) Show that if in testing H : {} = {}0 versus K : fJ = fJ1 there exists a critical value c such that P0, [L(X, 80, 81) > c] = I - Po, [L(X, Bo, 81) > c] then the likelihood ratio test with critical value c is best in this sense. (b)
Find the test that is best in this sense for Example 4.2.1.
5. A newly discovered skull has cranial measurements (X, Y) known
to
be distributed either (as in population 0) according to N(O, 0, 1, 1 , 0.6) or (as in population 1) according to N(l, I, I, I, 0.6) where all parameters are known. Find a statistic T(X, Y) and a critical value c such that if we use the classification rule, (X, Y) belongs to population 1 if T > c, and to population 0 ifT < c, then the maximum of the two probabilities ofmisclassification Po [T > c], P!(T < c] is as small as possible. Hint: Use Problem 4.2.4 and recall (Proposition B.4.2) that linear combinations of bivariate normal random variables are normally distributed. 6. Show that if randomization is permitted, MP-sized a: likelihood ratio tests with 0
1 have power nondecreasing in the sample size.
< a: <
7. Prove Corollary 4.2.1. Hint: The MP test has power at least that of the test with test function d(x) = a:. 8. In Exarnle 4.2.2, derive the UMP test defined by (4.2. 7). 9. In Example 4.2.2, if .
a
(1, 0, . . . , of and �0 # I, find the MP test for testing
< 1 , prove Theorem 4.2. 1(a) using the connection between likelihood ratio
tests and Bayes tests given in Remark 4.2. 1.
i
274
Testing and Confidence Regions
Chapter 4
'
' '
'
Problems for Section 4.3
1. Let X1 be the number of arrivals at a service counter on the ith of a sequence of n days. A possible model for these data is to assume that customers arrive according to a homo geneous Poisson process and, hence, that the Xi are a sample from a Poisson distribution with parameter 8, the expected number of arrivals per day. Suppose that if 8 < 80 it is not worth keeping the counter open.
(a) Exhibit the optimal (UMP) test statistic for H : 8 < 00 versus K
:
0 > 00 .
(b) For what levels can you exhibit a UMP test? (c) What distribution tables would you need to calculate the power function of the UMP
test? 2. Consider the foregoing situation of Problem 4.3.1. You want to ensure that if the arrival rate is < 10, the probability of your deciding to stay open is < 0.01, but if the arrival rate is > 15, the probability of your deciding to close is also < 0.01. How many days must you observe to ensure that the UMP test of Problem 4.3.1 achieves this? (Use the normal approximation.) 3. In Example 4.3.4, show that the power of the UMP test can be written as
2 ( Gn( n(<>)/ <J) <J�X (J = <J ) where Gn denotes the X�n distribution function. ' .
•
'
. •
4. Let X1, . . . , Xn be the times in months until failure of n similar pieces of equipment. lf the equipment is subject to wear. a model often used (see Barlow and Proschan, 1965) is the one where X1, . . . , Xn is a sample from a Weibull distribution with density f(x, A) = .\cxc -te->.xc. x > 0. Here c is a known positive constant and .\ > 0 is the parameter of mterest. •
(a) Show that K:
1/>. > 1/>.o.
L� 1 Xf is an optimal test statistic for testing H : 1/.\ < 1/ .\o versus
(b) Show that the critical value for the size a test with critical region [L�-1 Xf > k] is k = X2 n ( 1 - <>) /2>.o where X2n ( 1 - <>) is the ( 1 <>)th quantile of the X�n distribution and that the power function of the UMP level a test is given by -
1 - G,n(>.x,n(1 - <>)/ >.o) where G2 n denotes the X�n distribution function. Hint: Show that X[ &(>.) . -
1/ >.o = 12. Find the sample size needed for a level 0.01 test to have power at least 0.95 at the alternative value 1/>.1 = 1 5 Use the normal approximation to the (c) Suppose
" •
! j
J
!
l
'
i
l i
.
critical value and the probability of rejection.
5. Show that if X1, . . . , Xn is a sample from a truncated binomial distribution with I
p(x, 0) =
(:)
'
' '
I .
n x /[1 Ox(l - 0)" (1 - O) ), X = 1, . . . , n,
,
i
Section 4.10
275
Problems and Complements
then E7 1 Xi is an optimal test statistic for testing H : 8
=
8o versus I\ ; 8 > 80.
6. Let X1, . . . , Xn denote the incomes of n persons chosen at random from a certain population. Suppose that each Xi has the Pareto density 1 +BI -( f(x,O) = c8Bx , x>c where 8 > 1 and c > 0.
(a) Express mean income J.L in terms of 8.
:
(b) Find the optimal test statistic for testing H
J.L
=
J.Lo versus
K : J.L > po.
(c) Use the central limit theorem to find a normal approximation to the critical value of test in part (b). Hint: Use the results of Theorem 1.6.2 to find the mean and variance of the optimal test statistic. 7. In the goodness-of-fit Example 4.1.5, suppose that F0 (x) has a nonzero density on some interval (a, b), -oo < a < b < oo, and consider the alternative with distribution function F(x, B) = Fi!(x), 0 < B < 1. Show that the UMP test for testing H : B > 1 versus K : B < 1 rejects H if -2E log F0(X,) 2: x1_a, where Xt-a is the (1 - u)th quantile of the X�n distribution. (See Problem 4. 1 .6.) It follows that Fisher's method for cgmbining p-values (see 4.1.6) is UMP for testing that the p-values are uniformly distributed against F(u) = u8, 0 < B < l . 8. Let the distribution of smvival times of patients receiving a standard treatment be the known distribution Fo, and let Y1, . . . , Yn be the i.i.d. survival times of a sample of patients receiving an experimental treatment. (a) Lehmann Alte�arive. In Problem 1.1. 1 2, we derived the model G(y, Ll.) = 1 - [1 - F0(y)J", y > 0, Ll. > 0. To test whether the new treatment is beneficial we test H : � < 1 versus K : .6. > 1. Assume that Fo has � density fo. Find the UMP test. Show how to find critical values. (b) Nabeya-Miura Alternative. For the purpose of modeling, imagine a sequence X1 , X2, of i.i.d. survival tjmes with distribution F0 • Let N be a zero-truncated Poisson, P()..), random variable, which is independent of X1 J X2, .
•
•
.
P(Y < y)
=
e
,
-
-
1
.
•
C(max{X1 ,
(i) Show that if we model the distribution of Y as .>..F e o (y)
•
1
•
•
.
, XN ) ) then ,
, y > 0, A > 0.
(ii) Show that if we model the distribotion of Y as C(min{Xt, . . . , XN )) , then P(Y < y)
=
e-.XFo(Y) e
,
-
_
1
1 ,
y > 0, A > 0.
276
Testing and Confidence Regions
Chapter 4
(iii) Consider the model G(y,
efiFo (Y) _ 1
0)
e• - 1
j
, O f' O
Fo(Y), o � o. To see whether the new treatment is beneficial, we test H
Fo has a density fo(y). I:� 1 Fo( l:i).
Assume that
:
{} < 0 versus K : 8 > 0.
Show that the UMP test is based on the statistic
Xn be i.i.d. with distribution function F(x). We want to test whether F is exponential, F(x) � 1 - exp( -x), x > 0, or Weibull, F( x ) � 1 - exp( -x 9), x > 0, B > 0. Find the MP test for testing H : {} = 1 versus K : B = 81 > 1. Show that the test is 9.
Let
X1,
•
.
.
,
not UMP.
10. Show that under the assumptions of Theorem
complete.
Hint: �
Consider the class of all
4.3.2 the class
Bayes tests of H : ()
1r{Oo} 1 - 1r{O,} varies between 0 and l.
=
/.00 e,
p(x, O)d1r(O)j
The left-hand side equals
j"'
-oo
of all Bayes tests is =
lh
where
loss, every Bayes test for
>
.
f6";' L(x, 0, 00)d1r(O) f_':x, L(x, 0, Oo)d1r(O)
.
Section 4-4 •
, Xn
be a sample from a normal population with unknown mean J.L and
cr2. Using a pivot based on Er 1 (Xi - X)2• (a) Show how to construct level {1 - a) confidence intervals of fixed finite length for
unknown variance log a2•
1 (X; - X) 2 � 16.52, n � a) UCB for u2? announce as your level {1
(b)
Suppose that Ef
2,
a
�
0.01.
Wbat would you
-
(B/2)t� + €i, i = 1, . . , n, where the €i are independent normal random variables with mean 0 and known variance u2 (cf. Problem 2.2.1).
2. I '
'
•
Let
Xi
=
.
'
;
ll '
l
' I '
I
I
1 l
J '
l I i
T(x), the denominator decreasing. 12. Show that under the assumptions of Theorem 4.3.2, 1 - 6t is UMP for testing H : 8 > Bo versus K : 8 < Bo.
.
' •
•
The numerator is an increasing function of
1. Let X1 ,
i
p (x, O)d1r(O) ( <) L
6
Problems for
I
•
80 versus K : B
11. Show that under the assumptions of Theorem 4.3.1 and 0-1 H : B < 80 versus K : B > 81 is of the form Ot for some t. Hint: A Bayes test rejects (accepts) H if
ll
oS�ec�t:�io�o�4.,.l=O�P�m�b�le�m�n�d�C�o �,�· �m�p21e=m2e=''='�---(a) Using a pivot based on the MLE
------
----7 --�27
(2L:r<-- l t;X!)/�i 1 ti ofB, find a fixed length level
(1 - a) confidence interval for B. (b) ItO < ti < 1, i = 1, . . . , n, but we may otherwise choose the t! freely, what values should we use for the tt so as to make our interval as short as possible for given a? 3. Let X1, . . . , Xn be as in Problem 4.4.1. Suppose that an experimenter thinking he knows the value of a2 uses a lower confidence bound for 1-l of the form p;(X) = X - c, where c is chosen so that the confidence level under the assumed value of a-2 is 1 - a. What is the actual confidence coefficient of J-t, if a2 can take on all positive values? 4. Suppose that in Example 4.4.3 we know that 8 < 0.1. (a) Justify the interval [il, min( 8, 0.1 )] if 8 < 0.1, [0. 1, 0.1] if 8 > 0.1, where 8, 8 are given by (4.4.3). (b) Calculate the smallest
n needed to bound the length of the 95% interval of part (a)
Compare your result to the n needed for
by
0.02.
5.
Show that if q(X) is a level
then
[q(X), q(X)]
is a level
interval arbitrarily if q
Hint:
6.
Use (A.2.7) .
Show that if
(4.4.3).
(1 - a,) LCB and q(X) is a level (1 - a,) UCB for q(8), (1 - (a, + a,) ) confidence interval for q(8). (Define the
> q.)
X1, . • • , Xn are i.i.d. N(J-t, u2 ) and O:t + 0:2 < a:, then the shortest level
( 1 - a) interval of the form
is obtained by taking
[x- z(1 - a,) fo' X + z(1 - a,) ;,;]
a:1 = a:2 = aj2 (assume a-2 known). Hint: Reduce to a:1 +a:2 = a: by showing that if 0:1 + 0:2 < a:, there is a shorter interval with a:1 + a:2 = a:. Use calculus. 7. Suppose we want to select a sample size N such that the interval (4.4.1) based on n = N observations has length at most l for some preassigned length l = 2d Stein's (1945) two stage procedure is the following. Begin by taking a fixed number no and calculate
X0 = ( 1/n0) I:�" 1 X, and
.
> 2 of observations
s5 = (no - 1)-ti:�" 1 (X, - Xo)2•
Then take
N - n0 further observations, with N being the smallest integer greater than no
and greater than or equal to
2
[sot,._, ( 1 - i a) /d] . Show that, although
N is random, ../N(X - p)/so, with X = I:f 1 X
distribution. It follows that
[x - sotn,- 1 (1 - �a)/ .JN, X + Sotno- 1 (1 - )a)j .JN]
'
i' I
278
Testing and Confidence Regions
(1 - a
is a confidence interval with confidence coefficient
) N,
Chapter 4
for Jt of length at most
(The sticky point of this approach is that we have no control over
'
2d.
and, if (}' is large, we
may very likely be forced to take a prohibitively large number of observations. The reader interested in pursuing the study of sequential procedures such as this one is referred to the
1986, and the fundamental monograph of Wald, 1947.) + 1 x,. By Theorem 8.3.3, Sn , is + = k, X has a N(Jl, (}'2 jk) depends only on Sno • given
book of Wetherill and Glazebrook,
Hint:
Note that -
X (no/N)Xn, (1/NJL.':' N N(O, <72) ../N(X =
independent of Xno · Because distribution. Hence,
- !') has a
(a) Show that in Problem
8.
of length at most
2d wheri (}'2
4.4.6, in
n
n
N (1
-
distribution and is independent of sn ,·
order to have a level
�
a)
is known, it is necessary to take at least
observations.
Hint: (b)
confidence interval
z2 (
(c)
'
'
'
.
•
-
�a) (}'2 jd2
Set up an inequality for the length and solve for n.
What would
be the minimum sample
size in part (a) if
a
=
0.001, (}'2
0.05? '
1
Suppose that (}'2 is not known exactly, but we are sure that (}'2
z2 {I - �a) (J'r/d2 B(n, 0) X = Sjn.
< (J'f.
=
5, d
=
Show that
observations are necessary to achieve the aim of part (a).
n
>
9.
Let S �
(1
(a) Use (A. l4.18) to show that sin - 1 ( - a) confidence interval for sin- 1 ( VO).
and
v'X)±z (1 - 1a) / 2fo
= X = 0.1, X�, ... , Xn1 Yl l . . . , Yn2 be
(b) If n
is an approximate level
use the result in part (a) to compute an approximate level
100 and
0.95 confidence interval for ().
10. Let and N(TJ, .,-2) populations, respectively.
(a) If all parameters are unknown, find two quadruples are each sufficient.
(b) Exhibit
a level (1 -
a) confidence
2)
two independent samples from N(Jl, (}'
ML estimates of /-L. v, a2, r2 • interval for 12
/2 (}'
If
( TJ - f.') .
cr 2 , r2
are known, exhibit a fixed length level (1
�
and
Show that these
using a pivot based on the
a:)
confidence interval for
Such two sample problems arise In comparing the precision of two instruments and in
determining the effect of a treatment.
11. Show that the endpoints of the approximate level ( 1 - a) interval defined by
are indeed approximate level
Hint: 12.
,
., '
'
'
[O(X) < OJ = [fo(X- 0)/[0(1 - O)Jl < B(n, 0). 0 X± J3z ( 1 - l a) / 8.
Let S �
(a)
{1 - �a:) upper and lower bounds.
Show that
interval for
Suppose that it is known that
z
:S
(4.4.3)
(1 - 1a) ].
!.
4fo is an approximate level ( 1 -
i i
i
'
I
'
statistics of part (a). Indicate what tables you wdllld need to calculate the interval.
(c)
I
a) confidence
Section 4.10
279
Problems and Complements
(b) What sample size is needed to guarantee that this interval has length at most 0.02? 13. Suppose that a new drug is tried out on a sample of 64 patients and that S = 25 cures 8(64, 8), give a 95% confidence interval for the true proportion of are observed. If S cures 8 using (a) (4.4.3), and (b) (4.4.7). "'
14. Suppose that 25 measurements on the breaking strength of a certain alloy yield X = 11.1 and ,<; = 3.4. Assuming that the sample is from a /V(p,, cr2) population, find (a) A level 0.9 confidence interval for Jl. (b) A level 0.9 confidence interval for cr.
(c) A level 0.9 confidence region for (Jl, o-). (d) A level 0.9 confidence interval for p, + cr. 15. Show that the confidence coefficient of the rectangle of Example 4.4.5 is ( 1 - � o:) 2 . Hint: Let t = tn�l (1 - !n), x 1 = Xn�I ( !a), and x 2 = Xn- 1 (1 - !o:), then
P[(JL, u2 ) E J.{X) x !,{X)] By the proof of Theorem B.3.3, (n - 1)s2/u2 X�- 1 = r (� (n - 1J, D and n(X p,)2 Icr2 xt = r ( � ' �) are independent. Now the result follows from Theorem 8.2.3. �
"'
16. In Example 4.4.2,
(a) Show that x( ul ) and x(1 -u,) can be approximated by x( a 1) (n- 1) + v'2( n 1) l z{ u1 ) and x(1 - u2 ) (n - 1) + J2(n - 1) l z(1 - a2 ). Hint: By B .3 .1, V (u2 ) can be written as a sum of squares of n 1 independent N(0, 1) random variables, :L7 / Z'f. Now use the central limit theorem. ""
�
-
(b) Suppose that Xi does not necessarily have a nonnal distribution, but assume that l-'4 = E(X; - 1-')4 < oo and that "' = Var[(X; - JL)/u]2 = (JL4/u4 ) - 1 is known. Find the limit of the distribution of n-l {[(n - 1)s2 /u2) - n } and use this distribution to find an approximate 1 - a confidence interval for cr 2 . (In practice, K is replaced by its MOM estimate. See Problem 5.3.30.) Hint: ( n - 1)s2 2:;' 1 (X; - JL)2 - n(X - JL) 2 • Now use the law of large numbers, Slutsky's theorem, and the central limit theorem as given in Appendix A. =
(c) Suppose Xi has a X� distribution. Compute the (K. known) confidence intervals of part (b) when k 1, 10, 100, and 10 000. Compare them to the approximate interval given in part (a). K 2 is known as the kurtosis coefficient. In the case where Xi is normal, it equals 0. See A.l l . l l . Hint: Use Problem 8.2.4 and the fact that X� = f (k, �). =
-
I I 280 17.
Consider Example 4.4.6 with
interval for
F(x).
In this case
�
.r
Testing and Confidence Regions
Chapter 4
- a)
confidence
fixed. That is, we want a level {1
nF(1:) = # [X; < xj y'ii[F(x) - F(x)[ JF(x)[l - F(x)[
has a binomial distribotion and
is the approximate pivot given in Example 4.4.3 for deriving a confidence interval for 8 =
F(x).
(a) For 0 < a < b < I, define �
An(F)
y'ii[F(x) -F(x)[ , r' (a) -< x -< p- ' (b) JF(x)[l - F(x)j .05 .95. F Pp(An(F) S t) = Pu(An(U) <
= sup
Typical choices of a and b are
and
Show that for
•
continuous t)
where
U denotes
the uniform,
confidence intervals for
F(x) u.
intervals for
)=I-
u
8
U(O, 1), distribution
function. It follows that the binomial
in Example 4.4.3 can be turned into simultaneous confidence
by replacing
z (I - �u)
by the value
Ua
determined by
Pu(An(U) <
(b) For O < a < b < ! , define
Bn(F) F
" . '
. [I
•
.
,
.'
..
Show that for
= sup
i •
;
1
•
continuous,
Pp(Bn(F) < t) Pu(Bn(U) < Fo Ho : F Fo An(Fo) Bn(Fo) be X X,,(a,... , Xn X E 0 I" -f. F(x)dx + J;[l -F(x)jdx.
(c) For testing ta for and
=
can
ifft
(a) Show that
with
t).
continuous, indicate how critical values
Ua
i
and
obtained using the Monte Carlo method of Section 4.1.
are i.i.d. as
Suppose
f(t)
•
y'ii[F(x) -F(x)[ , p-'(a) < x < p-'(b) VF(x)[l -F(x)j =
18.
'
b) for some -oo
and that
has density f(t) = oo.
F'
()
t . Assume that
=
(b) Using Example 4.1.6, find a level ( 1 - a) confidence interval for Jl.
'
• • •
j
l j ! l
19. In Example 4.4.7, verify the lower bounary Jl given by (4.4.9) and the upper boundary -
I" = 00.
'
'
•
---
Section 4.10 Problems and Complements
Problems for Section 4.5 1. Let X1, . . . , Xn1 and Y1 , respectively, and let � � 0I>.. .
•
•
,
281
Yn2 be independent exponential E( 8) and £( ...\. ) samples,
f(o.)
denotes the o.th quantile of the F2n 1 ,2n 2 distribution, show that [Y J{ i a) IX, Y f ( 1 - �a) IX[ is a confidence interval for � with confidence coefficient 1 "· Hint: Use the results of Problems B.3.4 and 8.3.5. (a) If
-
(b) Show that the test with acceptance region [!(�<>) < XIY S <> for testing H : � = 1 versus K : � of 1.
f(1 - ia)) has size
(c) The following are times until breakdown in days of air monitors operated under two different maintenance policies at a nuclear power plant. Experience has shown that the exponential assumption is warranted. Give a 90% confidence interval for the ratio � of mean life times. X
y
3 1 5040 34 32 37 34 2 31 6 5 14 150 27 4 6 27 10 30 37 8 26 10 8 29 20 10
Is H : � = 1 rejected at level a � 0. 10?
2. Show that if 8(X) is a level ( 1 - a) UCB for 8, then the test that accepts, if and only if 8(X) 2. 8o, is of level a for testing H : 0 > Oo. Hint: If 8 > 8o, [B(X) < 8) :::> [8(X) < 8o). 3. (a) Deduce from Problem 4.5.2 that the tests of H
UCBs of Example 4.4.2 are level o. for H : o-2 >
o- .
5
:
cr2 = cr6 based on the level (1
- <>)
(b) Give explicitly the power function of the test of part (a) in tenns of the X�- l distri bution function.
(c) Suppose that n = 16, o. = 0.05, o-fi = 1 . How small must an alternative before the size o. test given in part (a) has power 0.90?
0"2 be
4. (a) Find c such that 00 of Problem 4.1.1 has size <> for H : 8 < 80. (b) Derive the level ( 1 - a) LCB corresponding to 00 of part (a).
(c) Similarly derive the level (1 - a) UCB for this problem and exhibit the confidence intervals obtained by putting two such bounds of level (1 <>1) and (1 - <>2 ) together. -
1 n is the shortest such confidence interval. (d) Show that [Mn, Mn/o. f J 5. Let X 1, X2 be independent N(8, , cr2), N(8,, cr2 ), respectively, and consider the prob lem of testing H : 81 = 82 = 0 versus K : Bf + 8� > 0 when 0"2 is known. (a) Let 00(X1, X2) = 1
if and only if Xf + Xi > c. What value of c gives sizec>?
(b) Using Problems B.3.12 and 8.3.13 show that the power (3(01, O,) is an increasing function of 8f + B�.
�'
''
282
Testing and Confidence Regions
Chapter 4
(c) Modify the test of part (a) to obtain a procedure that is level a for H : 81 = B�, 02 = eg and exhibit the corresponding family of confidence circles for ({)1, 82 ). Hint: (c) x, -e?, x, -eg are independentN(01-B?, a-2 ), N( 82 -eg, a-2 ), respectively. 6. Let X1 , . . . , Xn be a sample from a population with density f (t 8) where () and f are unknown, but f (t) = f( -t) for all t, and f is continuous and positive. Thus, we have a location parameter family. -
(a) Show that testing H : 8 < 0 versus K : () > 0 is equivalent to testing H' : P[X1 > OJ <
l !
i:
� versus K' : P[X1 > OJ > � -
(b) The sign test of H versus K is given by, 1 if
n
L 1 [X, > OJ i=l
>k
0 otherwise.
Determine the smallest value k = k(a) such that ok(o) is level a for H and show that for n large, k - �n + �z(1 - a)y'n. (c) Show that ok(oJ {X1
. .
'
K : 0 > Bo
- Oo, .
. .
, Xn - Oo) is a level a test of H
:
0 < Oo versus
.
(d) Deduce that X(n - k( o:) +l) (where X(j) is the jth order statistic of the sample) is a level { 1 - a) LCB for {) whatever be f satisfying our conditions. (e) Show directly that Pe[XuJ
e.
< OJ and Pe [X(i) < 0 < X<• JJ do not depend on J or
(f) Suppose that a = 2 -(n - l ) L:J 1
-
"'·
�
(;)
. Show that P[X(k) < 0 :c:; X(n - k+l)J
=
(g) Suppose that we drop the assumption that f(t) = f( -t) for all t and replace 8 by the v = median of F. Show that the conclusions of (a}-(f) still hold.
7. Suppose () = ( 'T}, r) where 'T} is a parameter of interest and r is a nuisance parameter. We are given for each possible value 'T}o of TJ a level a test O(X, 'TJo ) of the composite hypothesis H : ry = ry0 • Let C{X) = { ry : o{X, ry) 0}.
.
'
l •
I
=
(a) Show that C(X) is a level {1 - a) confidence region for the parameter ry and con versely that any level (1 - a) confidence region for TJ is equivalent to a family of level a tests of these composite hypotheses.
,
(b) Find the family of tests corresponding to the level ( 1 - a) confidence interval for ll of Example 4.4.1 when a2 is unknown.
'
•
'
Section 4.10
Problems and Complements
283
8. Suppose X, Y are independent and X � N(v, 1), Y � N(r1, 1). Let p (p, r1 ) . Define J(X, Y, p)
2
,
= o if ]X - pYI < (1 + p ) 'z(l = 1 otherwise.
=
v/1J,8
=
1
2 a)
(a) Show that J(X, Y, po) is a size <> test of H : p = Po· (b) Describe the confidence region obtained by inverting the family {J(X, Y, p)} as in Problem 4.5.7. Note that the region is not necessarily an interval or ray. This problem is a simplified version of that encountered in putting a confidence interval on the zero of a regression line.
9. Let x � N(8, 1) and q(8)
= 82
(a) Show that the lower confidence bound for q( B) obtained from the image under q of the ray (X - z(1 - <>) , oo) is q(X)
=
(X - z(1 - <>))2 if X > z(1 - <>) 0 if X < z(l - <>).
(b) Show that Po(�(X) < 82] and, hence. that sup8 P6 [q(X) < 82]
-
1 - a if 8 > 0 >l>(z(1 - a) - 28) if 8 < 0
= 1.
10. Let a (S, 8o) denote the p-value of the test of H : 8 = 8o versus K : 8 > 8o in Example 4.1.3 and let (8(S), 8( S)] be the exact level (1 - 2u) confidence interval for 8 of Example 4.5.2. Show that as 8 ranges from 8(S) to 8(S), n(S, 8) ranges from " to a value no smaller than 1 - u. Thus, if 8o < O(S) (S is inconsistent with H : 8 = 8o), the quantity � = B(S) - 00 indicates how far we have to go from 80 before the value S is not at all surprising under H . 11. Establish (iii) and (iv) of Example 4.5.2.
12. Let 1J denote a parameter of interest, let T denote a nuisance parameter, and let (} = (ry, -r ) . Then the level (1 - <>) confidence interval [ry(x), ij(x)] for ry is said to be unbiased confidence interval if
Pe[ry( X) < ry' < ry(X)] < 1 - " for all ry' ,P ry, all II.
That is, the interval is unbiased if it has larger probability of covering the true value 1J than the wrong value 17'. Show that the Student t interval (4.5.1) is unbiased. Hint: You may use the result of Problem 4.5.7.
284
Testing
and Confidence Regions
Chapter 4
Confidence Regionsfor Qua11tiles. Let X 1, . . . . Xn be a sample from a population with 1 continuous distribution F. Let Xp = 1 [F-1 (p) + Fij (P)], 0 < p < I, be the pth quantile of F. (See Section 3.5.) Suppose that p is specified. Thus, 100x.95 could be the 95th percentile of the salaries in a certain profession1 or lOOx o5 could be the fifth percentile of 13.
.
the duration time for a certain disease.
''
: Xp < 0 versus I< : Xp > 0 is equivalent to testing H' : P( X > 0) < (1 - p) versus K' : P( X 2: 0) > (I - p). (b) The quantile sign test Ok of H versus K has critical region {x : L.:� 1 [Xi > OJ > k ) . Determine the smallest value k = k(a) such that Jk(o) has level a for H 1and show that for n large, k(a) - h(a), where (a) Show that testing H
h (a) C': n(l - p) + Z1 -u )np(! - p).
'
I
'
(c) Let x• be a specified number with 0 < F(x') < I. Show that J•(X1 -x•, . . . , Xn x*) is a level a test for testing H : Xp < x* versus K : xP > x*.
.. . .
(d) Deduce that X(n-k(u)+1 ) (Xc;) is the jth order statistic (1 - a) LCB for Xp whatever be f satisfying our conditions.
of the sample) is a level
(e) Let S denote a B(n,p) variable and choose k and l such that I - a = P(k < S < n - l + I) = 2:7 k+ 1 pi(! -p)"-i. Show that P(X(k) < Xp < X(n-1)) = 1 - <>. That is, (X(k)' X(n-1)) is a level (1 - a) confidence interval for Xp whatever be F satisfying our
conditions. That is, it is distribution free. (f) Show that
'
(g) Let F(x)
•
l .
h (�a) and h (I - �a) where
denote the empirical distribution. Show that the interval in parts (e) and
!
k
and l in part (e) can be approximated by
(0 can be derived from the pivot
T(xp) Hint:
.
!
h(a) is given in part (b). -
l
'
Note that F(xp)
•
' '
' ' '
•
=
vn[F(xp) - F(xp)J . jF(xp) [l - F(xp)]
= p. Construct the interval using F-1 and Fu 1•
14. Simultaneous Confidence Regions for Quantiles.
In Problem
13
preceding we gave a
Xp for p fixed. Suppose we want 0 < p < 1. We can proceed as follows. Let F, F-(x), and F+ (x) be as in Examples 4.4.6 and 4.4.7. Then
disstribution-free confidence interval for the pth quantile a distribution-free confidence region for Xp valid for all -
P(F -(x) < F(x) < F'+(x)) for all x E (a, b) = 1 - a. (a) Show that this statement is equivalent to
P(xp < Xp < Xp for all p E (0, 1))
Section
4.10
Problems and Complements
285
where x, � sup{.r : a < .r < b, F f (>·) < p} and .r" � inf{.r : a < x < b, fr - (x ) > p}. That is, the desired confidence region is the band consisting of the collection of intervals {[;z;,, x,] : o < p < 1 } .
(b) Express x11 and :rp in terms of the critical value of the Kolmogorov statistic and the order statistics. (c) Show how the statistic An(F) of Problem 4.1.17(a) and (c) can be used to give
another distribution-free simultaneous confidence band for xp. Express the band in terms of critical values for An(F) and the order statistics. Note the similarity to the interval in Problem 4.4. 13(g) preceding. 15.
Suppose X denotes the difference between responses after a subject has been given treatments A and B, where A is a placebo. Suppose that X has the continuous distribution F. We will write Fx for F when we need to distinguish it from the distribution F-x of -X. The hypothesis that A and B are equally effective can be expressed as H : F�x (t) = Fx (t) for all t E R. Tbe altemative is that F.x(t) ! F(t) for some t E R. Let Fx and F-X be the empirical distributions based on the i.i.d. X1 , . . . , Xn and -X1 , . . . , -Xn. �
�
(a) Consider the test statistic D(Fx , F x)
�
max{ [Fx(t) - F_ x(t)[ : t E R}. �
Show that if Fx is continuous and H holds, then D(Fx, F_x) has the same distribution as D(Fu, F1-u ) where Fu and F1-u are the empirical distributions of U and 1 - U with �
�
�
�
......
,
U
F(X) �� U(O, 1). n Hint: nFx (x) = 2:,� 1 l[Fx (X,) < Fx (x)]
�
n F_x(x)
=
n
L 1 [-X, < x] =
i==l nF1 - u{F-x (x))
�
�
nFu(F(x)) and
n L 1 [F_x( -Xi) < F_ x(x) ] i=l
�
nFr - u(F(x)) under H.
See also Example 4. 1.5. (b) Suppose we measure the difference between the effects of A and B by � the dif ference between the quantiles of X and -X, that is, vp(p) = � [xP + x1_ PJ, where p = F(x). Give a distribntion-free level (1 - <>) simultaneous confidence band for the curve {vF{p) : 0 < p < 1} . Hint: Let t.(x) = F � (Fx(x)) - x, then nF-x (x + l:. (x))
=
n L 1 [-X, < x + l:.(x)] i=l n L l [F_x ( -X,) < Fx (x)] i=l
=
nFr-u (Fx(x)) .
'
I'
''
'I
'
'•
286
Testing and Confidence Regions
· -
' '
'
'
Chapter 4
�
Moreover,
nFx (x)
2..:� 1 1 [Fx (Xi)
�
Fx(,·)] = nFu(Fx(x)). It follows that if --C, then D(Fx . F�x.") = D(Fu,F1_u), and by <
�
�
F- x(': + t.(x)), solving D( Fx, F- x,t::. ) < do. for �. where da. is the nth quantile of the distribution of D(Fu) F1 _u ) we get a distribution-free level (1 - a) simultaneous confidence band for t.(x) = F_J:(Fx(x)) - x = -2vF (F(x)). Properties of this and other bands are given we set �
F�x."(x) = �
�
�
,
by Doksum, Fenstad and Aaberge ( 1 977).
(c) A Distdbution and Parameter-Free Confidence Interval. Let 8(.) : :F
--t
R, where
:F is the class of distribution functions with finite support, be a location parameter as defined
in Problem
3.5.17. Let
VF =
vt = sup vt (P) inf vF(p), O
[vF (p), lit (p)] is the band in part (b). Show that for given F E F, the probability is (1 - a) that the interval [vF , DtJ contains the location set Lp = {B(F) : B(·) is a location parameter} of all location parameter values at F. Hint: Define H by H-1(p) = �[Fx 1 (p) - Fx 1 (1 - p) J = � [Fx 1 (p) + F_J: (v)]. Then H is symmetric about zero. Also note that 1 x = H-1 (F(x)) - 2 t.(x) = H-1 (F(x)) + vp(F(x)). It follows that X is stochastically between Xs-v F and Xs +vp where Xs = H- 1 (F(X)) has the symmetric distribution H. The result now follows from the properties of B(· ). 16. As in Example 1.1.3, let X1 , . Xn be i.i.d. treatment A (placebo) responses and let Y1 , . . . , Yn be i.i.d. treatment B responses. We assume that the X's and Y's are in� dependent and that they have respective continuous distributions Fx and Fy. To test the hypothesis H that the two treatments are equally effective, we test H : Fx(t) = Fy (t) for all t versus K : Fx (t) of Fy (t) for some t E R. Let Fx and Fy denote the X and Y where
. .
1 '
'
'
1
�
�
empirical distributions and consider the test statistic
'
II
I I
D(Fx , Fy ) = max [Fy (t) - Fx(t) [ . tER
(a) Show that if H holds, -
then
-..
""
-..
""
D(Fx , Fy) has the same distribution as D(Fu , Fv ),
Fu and Fv are independent U(O,l) empirical distributions. Hint: nFx(t) = 2..:� 1[Fx{X;) < Fx (t)] = nFu(Fx(t)); nFy(t) = 1 ) 1 < F v (xp)] = nFv (Fx (t)) under H . 2..:� 1 [Fy {Y; (b) Consider the parameter bp( Fx Fy) = Yp - xp. where Xp and Yp are the pth quan� tiles of Fx and Fy. Give a distribution-free level (1 - a) simultaneous confidence band - [&; ,&;i : 0 < p < !] for the curve {&p(Fx, Fv) : 0 < p < 1 } . Hint: Let t.(x) = Fy 1 (Fx (x)) - x, then
where
1
n n nFy(x + t.(x)) = I; I [Y; < Fy 1 (Fx(x))] = I; 1[Fy {Yi) < Fx(x)] = nFv(Fx(x)). i=l
i=l
l i t
'
I
'
Section 4.10
287
Problems and Complements
Moreover. nFx (":)
=
Fu (Fx (x)) . It follows that if we set F1� ,(x)
=
Fv (x + t.(x)), " " -� � £ then D (Fx , F;,, ) = D(Fu, Fv ) . Let d0 denote a size a; critical value for D (Fu , Fv ), � then by solving D(Fx , FY, .6.) < do: for Ll, we find a distribution-free level (1 a: ) simul1 confidence band for t.(. T ,) taneous = Fx 1 (P) - Fy (p) = a,(Fx, Fy ) . Properties of n
•
-
this and other bands are given by Doksum and Sievers ( 1 976).
(c) A parameter 0 = 0 ( ) ; :F x :F ---7 R, where :F is the class of distributions with finite support, is called a shift parameter if O(Fx, Fx +a) = 8(Fx Fx ) = a and · ,
8t
·
-a,
8t
Y1 > Y, X1 > X
o>
O(Fx , Fy) < 8(Fx, Fv, ) , O (Fx , Fy ) > O(Fx, , Fy ) .
Let 6 = mino
(d) Show that E(Y) - E(X), a,(·, ·), 0 < p < 1, 6, and 8 are shift parameters.
(e) A Distribution and Parameter-Free Confidence Interval. Letb" - = min o
-
-
-
�
�
Problems for Section 4.6
1. Suppose X1 , . . . , Xn is a sample from a r (p, ! ) distribution, where p is known and 0 is unknown. Exhibit the UMA level (1 - a;) UCB for 0.
2. (a) Consider the model of Problem 4.4.2. Show that n
n
i=l
i=l
e· = (2 2 �A X; )/ L tt
-
2z(1
n
) [L tt l - �
- a; .,-
i=l
is a uniformly most accurate lower confidence bound for 8. (b) Consider the unbiased estimate of e, T = (2 2:7 1 X;) I 2:7 1 if. Show that n
n
e = (2 z= x,J; z= ti - 2uvnz(1 i=l i=l
-
n
n)/ z= ti i=l
is also a level (1 - a:) confidence bound for O. (c) Show that the statement that 0* is more accurate than 0 is equivalent to the assertion that S = (2 E� 1 t; Xi)/ E� 1 t[ has uniformly smaller variance than T. Hint: Both 0 and 0* are normally distributed. 3. Show that for the model of Problem 4.3.4, if I' = 1 />., then
a uniformly most accurate level 1
-
a:
UCB for J.L.
I' = 2 L� 1 Xf!x,n(<>) is
288
Testing and Confidence Regions
Chapter 4
- o UpPer and lower confidence bounds J1 in the model of Problem 4.3.6 for c fixed, n = 1. 4.
Construct unifom1ly most accurate level 1
for
5. (1
Establish the following result due to Pratt
( 1961 ) . Suppose [8', 8'], [8, 8] are two level
- a) confidence intervals such that
Po[B' < 8' < 8'] < Pe[B < 8' < OJ
for al!
B oJ 8. '
(8, 8), (8', 8') have joint densities, then Eo(8' - 8') :::; Eo (8 - 8). = = Hint: Eo(B - 8) 1 = ]00= du p(s, t)dsdt = J = Po [B < u < B]du, where
Show that if
(1: )
=
p(s , t) is the joint density of ( 8, 8).
F, G corresponding to densities /, g, respec 1 I so are well defined and that p- ,G tively, satisfying the conditions of Problem 8.2.12 strictly increasing. Show that if F(x) < G(x) for all x and E(U),E(V) are finite, then E(U ) > E(V). Hint: By Problem B.2.12(b), E(U) = f; p-1 (t)dt. 6.
' ' '
7.
1
Let
U, V
be random variables with d.f.'s
Suppose that ll_' is a uniformly most accurate level ( 1 -
- a. Prove Corollary 4.6.1.
Hint:
Apply Problem 4.6.6 to
a) LCB such that ?6[8' < 8]
=
V = (8 - 8')+, U = (8 - B)+.
j,
i
J '
' •
• • •
l '
' •
8. In Example 4.6.2, establish that the UMP test has acceptance region (4.6.3). Hint: Use Examples 4.3.3 and 4.4.4.
j '
l •
'
�· ;
•
Problems for Section 4.7
•
1. (a) Show that if 8 has a beta, /3( r, s). distribution with .>. = s6/r(l - 6) has the F distribution F2r,2, .
Hint:
r
and
s positive integers, then
8,
r, s) distribution with r and s integers. Show how the quantiles of the F distribution 0.
2. Suppose that given ..\ = ...\., X1, , Xn are i.i.d. Poisson, P(>.) and that ..\ is distributed as V/s0, where so is some constant and V Let T = L:� 1 Xi· .
•
.
"'
(a) Show that (.X J T
m = k + 2t.
x2 distribution
upper and lower credible bounds for ...\..
3.
Suppose that given () =
Pareto,
Pa(c, s) , density
x�-
= t) is distributed as W/ s, where s = so+ 2n and W � x;;, with
(b) Show how quantiles of the
8,
can be used to determine level
(1 - a)
X1, . . . , Xn are i.i.d. uniform, U(O,B), and that () has the
1r(t) = sc' ft'-1, t
> c, s > 0, c > 0.
'
' •
•
X has a binomial, 13(n,8), distribution and that (} has
can be used to find upper and lower credible bounds for ...\. and for
•
'
See Sections 8.2 and 8.3.
(b) Suppose that given () = beta, /3(
: J
•
Section 4.10
Problems
289
and Complements
ma.x{X1, . . . , Xn)· Show that {8 1 M = m) � Pa(c' , s') with c' = (a) Let M max{c, m} and s' = s + n. =
(b) Find level {1 - u ) upper and lower credible bounds for B.
(c) Give a level ( 1 - a: ) confidence interval for B.
(d) Compare the level { 1 - u) upper and lower credible bounds for B to the level { 1 - a) upper and lower confidence bounds for B. In particular consider the credible bounds as n --+ oo.
4. Suppose that given 8 = (tt1 , J-1.2 , r ) = (111, /.l2, r), Xr , . . . , Xm. and Yr , . . . , Yn are two independent N(f.li , r) and N(f.l2, r) samples, respectively. Suppose 8 has the improper prior 1r{B) = ljr, T > 0.
(a) Let so = E(x; proportional to
- x)2 + E(yj
-
y )2 Show formally thatthe posterior ?r(B •
I x,y) is
1r(r I so)?r(/11 I r,x)?r(/12 1 r,y) where 1r{r I so) is the density of so/V with V � Xm+n -2 , ?r(/11 I r,x) is a N(x, rjm) density and 1r{112 I r, y) is a N(y, r/n) density. Hint: p(B I x, y) is proportional to p( 8)p(x I /11 , r)p (y I /12 , r). (b) Show that given T, 1'1 and /12 are independent in the posterior distribution p(B x, y) and that the joint density of .1. = J.ll /12 and r is �
1r {Ll.,r I x,y) = 1r(r I so)?r(Ll. l x - y , r) where 1r{ Ll. I x - y,
is (Student) t with m + n - 2 degrees of freedom. Hint: 1r{Ll. I x, y) is obtained by integrating out r in 1r{Ll., r I x, y). (d) Use part (c) to give level {1-a) credible bounds and a level (1-a) credible interval
for Ll..
Problems for Section 4.8 1. Let Xr, . . . , Xn+I be i.i.d. as X N(J-.l, a5), where u5 is known. Here X1 , observable and Xn+I is to be predicted. ,..,_,
(a) Give a level (1
� a:
) prediction interval for Xn-t l ·
• . •
, Xn is
290
Testing and Confidence Regions
Chapter 4
•
•
(
(b) Compare the interval in part (a) to the Bayesian prediction interval 4.8.3) by doing a frequentist computation of the probability of coverage. That is, suppose X1, . . . , Xn are i.i.d. N(Jl, 0'5). Take "5 � r2 I, .05. Then the level of the lOll, �o 10, and frequentist interval is 95%. Find the probability that the Bayesian interval covers the true mean J.t for M = 5, 8, 9, 9.5, 10, 10.5, 1 1 , 12, 15. Present the results in a table and a graph.
� n�
a�
�
l
' ,
F,
where X1, . . . , Xn are observable and Xn+l is to , Xn+ l be i.i.d. as X be predicted. A level (1 - a: ) lower (upper) prediction bound on Y = Xn+l is defined to be a function Y(Y) of X1, . . . , Xn such that P(Y < Y) > 1 - a (P(Y < Y) > I - a). 2. Let X1,
•
.
,.....,
•
(a) If F is N(Jl, a5) with "5 known, give level (I - a) lower and upper prediction bounds for Xn+l· (b) If F is N(p,, a2) with a2 unknown, give level (1 bounds for Xn+l ·
-
a:) lower and upper prediction
(c) IfF is continuous with a positive density f on (a, b), -oo < a < b < oo, give level (I - a) distribution free lower and upper prediction bounds for Xn+ I ·
3. Suppose X1 , . . . , Xn+l are i.i.d. as X where X has the exponential distribution
F(x I 0) � I - e-x/O,
.. •
x
> 0, 0 > 0.
B(n, 0),
X is a binomial, random variable, and that (} 4. Suppose that given (} � has a beta, {J(r, s), distribution. Suppose that Y, which is not observable, has a distribution given (J = 8. Show that the conditional (predictive) distribution of Y given
B(m, 0)
X = x is
i
'
'
'
i '
Suppose X1 , . . . , Xn are observable and we want to predict Xn+t· Give a level (1 - a) prediction interval for Xn+l · Hint: XdB has a � distribution and nXn+t! E� 1 xi has an F2,2n distribution.
0,
i'
! '
'
1 l
i
•
q(y l x) � ( ; ) B(r +x+y, s + n - x + m - y)/B(r +x, s + n - x) where B( ·, ·) denotes the beta function. (This q(y I x) is sometimes called the P61ya distribution.) Hint: First show that
J '
i
'
•
q(y I x) � Jp(y I 0)1r(O I x)dO. 5. In Example 4.8.2, let U( I) < · · - < u
.
•
,
u C! ), . . . , U(n+l) .
Problems for Section 4.9 I
'.
B(n, 8), distribution. Show that the likelihood ratio statistic for 0 i � is equivalent I2X - nl.
1. Let X have a binomial, testing H : � versus K :
0�
to
,
Section 4.10
291
Problems and Complements
Hint: Show -'(n - x).
that for
x < �n, A(x)
is an increasing function of
-(2x - n) and A(x)
=
Jn Problems 2-4, let X1 ) . . . , Xn be a N(f-L, a2) sample with both J-L and a2 unknown. 2.
In testing
H : JL < Jlo
versus K
:
J-L
> Jlo
show that the one-sided, one-sample t test is
the likelihood ratio test (for o: < �). Hint: Note that /io � X if X < Jl.o and � Jl.o otherwise. Thus, log A(x) and � (nl2) log(! + r;l(n - !)) for Tn > 0, where Tn is the statistic.
t
We want to test
3. One-Sided Testsfor Scale.
�
0, ifTn < 0
H : a2 < a� versus K : a2 > a5. Show that
(a) Likelihood ratio tests are of the form: Reject if, and only if,
Hint: log -'( x) � 0,
if ii2 lv-5
<
'
1 and � (nl2)[ii2 Io-5
(b) To obtain size o: for H we should take c = Xn-I ( 1 Hint: Recall Theorem B.3.3.
-
-
I
-
log(ii2Io-5)1 otherwise.
o: .
)
(c) These tests coincide with the testS obtained by inverting the family of level
lower confidence bounds for
q2 .
4. Two-Sided Testsfor Scale.
We want to test
.
H:a
=
{1
-
o:
}
ao versus K : a i= a0.
(a) Show that the size o: likelihood ratio test accepts if, and only if,
n
CJ < (i) (ii)
F(cz) - F(c1)
I 2 '"' (Xi - X)
L..,. o CJ i=I
�
-
,
< Cz where c1
and
c2
satisfy,
1 - a, where F is the d.f. of the X�-1 distribution.
Ct - cz = n logct/cz.
(b) Use the
normal approximatioh to check that
C1n
Czn
V2nz(1 - �a) n + V2nz(1 - �a) n-
approximately satisfy (i) and also (ii) in the sense that the ratio
C1n - CZn --,=--=.:;-"--n log cl n / C z n
---�'
1 as n
---�'
oo .
(c) Deduce that the critical values of the commonly used equal-tailed test, Xn-l ( � o:)
Xn-1 (1 -
J a) also approximately satisfy (i) and (ii) of part (a).
,
i
292
Testing and Confidence Regions
Chapter 4
The following blood pressures were obtained in a sample of size n = 5 from a certain population: 1 24, 1 10, 1 14, 100, 190. Assume the one-sample normal model. 5.
(a) Using the size a: = 0.05 one-sample t test, can we conclude that the mean blood pressure in the population is significantly larger than l 00? (b) Compute a level 0.95 confidence interval for a2 corresponding to inversion of the equal-tailed tests of Problem 4.9.4.
! l
l
,
I
•
(c) Compute a level 0.90 confidence interval for the mean bloC
6. Let X1, . . . , Xn1 and Y1 , . . . , Yn2 be two independent N(J.L t ,
�
(I'!, 1'2 , a2 ) is (X, y, 0'2 ), where a' is as defined in
(b) Consider the problem of testing H : J..L I < J-L2 versus K : ;..t 1 > J1. 2 · Assume a < � Show that the likelihood ratio statistic is equivalent to the two-sample t statistic T.
(c) Using the normal approximation
and (1' 1 - 1'2 )/a � l7. The following data are from an experiment to study the relationship between forage pro duction in the spring and mulch left on the ground the previous fall. The control measure ments (x's) correspond to 0 pounds of mulch per acre, whereas the treatment measurements (y's) correspond to 500 pounds of mulch per acre. Forage production is also measured in pounds per acre. X
y
794 2012
1800 2477
576 3498
41 1 2092
897 1808
.
•
'
I I
•
I .I
l
I
I
'
,
l
•
I
'
!
I
'
I
'
Assume the two-sample normal model with equal variances. (a) Find a level 0.95 confidence interval for p.z - I'!·
•
•
•
(b) Can we conclude that leaving the indicated amount of mulch on the ground signifi cantly improves forage production? Use ct = 0.05.
'
•
(c) Find a level 0.90 confidence interval for a by using the pivot s2 ja 2•
8. Suppose X has density p(x, 8), 8 E 6, and that A(X, 6o, e,) depends on X only through T.
T is sufficient for 8. Show that
9. The nonnally distributed random variables X1, . . . , Xn are said to be serially correlated or to follow an autoregressive model if we can write
Xi = (}Xi-I + €i, i = 1 , . . , n, .
where X0
=
0 and £1, . . . , €n are independent
•
N(O, a2) random variables.
, I
'
•
293
Problems and Complements
Section 4.10
(a) Show that the density of X = (X1, . . . , Xn) is p(x, 8)
=
2 (21r
for -oo < Xi < oo, i = 1, . . . , n,
i= 1
xo = 0.
- 8x,_1)2}
(b) Show that the likelihood ratio statistic of = 0 (independence) versus K 0 (serial correlation) is equivalent to -(L� 2 xixi_ 1 )2 1 L:� 1 x;.
H:0
:0
i:-
a �
10. (An example due to C. Stein). Consider the following model. Fix 0 < < and a/[2(1 - ")] < c < Let 8 consist of the point -I and the interval [0, 1[ . Define the
a.
frequency functions p(x , 8) by the following table. X
8
-2
-I
0
I
2"
- - Q
Q
l2 - a
la 2
Be
(;::�) (� - a)
(;::�) ( i - a)
( 1 - B)c
'
-1 i -I
1 2
) ( 1
'
'
a
Q
(a) What is the size a likelihood ratio test for testing
H
:
2
B = - 1 versus K : B 7'= -1?
a
(b) Show that the test that rejects if, and only if, X = 0, has level and is strictly more powerful whatever be B. 11. The power functions of one- and two-sided t tests. Suppose that T has a noncentral t, 7k,6• distribution. Show that, (a) P, [T >
t] is an increasing function of J. (b) P, [IT[ > t] is an increasing function of ]J].
Hint: Let Z and V be independent and have N(O, 1), xZ distributions respectively.
Then, for each v > 0, P0 [Z > t yV/k] is increasing in J, P6 []ZJ > in [J[. Condition on V and apply the double expectation theorem.
t yV/k] is increasing
12. Show that the noncentral t distribution, 7k,6• has density
1 {= x U•-'I e-H x+C tVxfk -ol'ldx. J. ' ,(t) = h'k(�k)zlCk+l) lo Hint: Let Z and V be as in the preceding hint. From the joint distribution of Z and V, get the joint distribution of Y1 = Z/ ,jV7k and Y2 = V. Then use py, (y1) = f py,y, (y,, Y2 ) dy,. 13. The F Test for Equality of Scale. Let X1, , Xn1 , Y1 , - . . , Yn2 be two independent samples from N(Jli, £Ti), N(p.2 , 0'�). respectively, with all parameters assumed unknown.
...
294
Testing and Confidence Regions
Chapter 4
(a) Show that the LR test of H : af = a� versus K : ai > af is of the form: Reject if, and only if, F � [(n1 - 1)/(n2 - l)]E(}j - Y) 2/E(X; - X)2 > C.
(b) Show that (afja�)F has an Fnz-I,n1 -I distribution and that critical values can be obtained from the F table. -
(c) Justify the two-sided F test: Reject H if, and only if, F > /(1 a/2) or F < f(n/2 ) , where f( t) is the tth quantile of the Fnz-I,nt -I distribution. as an approximation to the LR test of H : a1 = (7z versus K : a1 -=1 a2. Argue as in Prol;>lem 4.9.4.
''' '
' '
(d) Relate the two-sided test of part (c) to the confidence intervals for a5faf obtained
in Problem 4.4.10.
14. The following data are the blood cholesterol levels (x's) and weight/height ratios (y's) of I 0 men involved in a heart study.
I'
X
y
254 2,71
240 2,96
279 2,62
284 2. 1 9
315 2.68
250 2.64
298 2.37
384 2.61
310 2.12
337 1.94
Using the likelihood ratio test for the bivariate normal model, can you conclude at the I 0% level of significance that blood cholest�rol level is correlated with weight/height ratio?
15, Let ( X1 , Y1 ), . . . , (Xn, Yn) be a sampj� from a bivariate.N"(O, 0, <1?, <1�, p) distribution. Consider the problem of testing H : p � 0 versus K : p -1 0. (a) Show that the likelihood ratio �tatistic is equivalent to lrl where
i=l
j=l
(b) Show that if we have a sample from a bivariate N(P.I , p.z, ai, a?, p) distribution, then P[P > c] is an increasing function of p for fixed c. Hint: Use the transformations and Problem B.4.7 to conclude that ,O has the same dis tribution as 812/8182, where
'''
=
n
n
n
2 = L l � L s; u,8 Vi , 812 � L U;V, i=2 i=2 i=2 and (U2, V,), . . . , (Un, Vn ) is a sample from a .N"(O, 0, 1 , 1, p) distribution. Let R � 812 /81 82, T � vn - 2R(V1 R2, and using the arguments of Problems B.4.7 and B.4.8, show that given Uz = u2, . . . , Un = Un, T has a noncentral 'Fn -2 distribution
with noncentrality parameter p. Because this conditional distribution does not depend on (u2, • . . , Un ), the continuous versiop of (B.1.24) implies that this is also the unconditional distribution. Finally, note that ,0 has the same distribution as R, that T is an increasing function of R, and use Probl�m 4.8.ll(a).
,
,
..
·;'
16, Let A(X) denote the likelihood ratio statistic for testing H : p = 0 versus K : p -1 0 in the bivariate normal model. Show, using (4.9.4) and (4.9.5) that 2 log >.(X) !:. V, where
•
.
' '
' '
1 '
l i
'
v has a xi distribution.
'
!
section 4. 11
295
Notes
17, Consider the bioequivalence example in Problem 3.2.9.
(a) Find the level " LR test for testing H :
8 E [-<, <] versus K : 8 i/c [-<, <].
(b) Compare your solution to the Bayesian solution based on a continuous loss function oo, and r;o = 0, n oo. given in Problem 3.2.9. Consider the cases rro = 0, TJ .......
4.11
.......
NOTES
Notes for Section 4.1 (1) The point of view usually taken in science is that of Karl Popper [1968]. Acceptance of a hypothesis is only provisional as an adequate current approximation to what we are interested in understanding. Rejection is more definitive.
(2) We ignore at this time some real-life inadequacies of this experiment such as the placebo effect (see Example 1 . 1 .3). (3) A good approximation (Durbin, !973; Stephens, 1974) to the critical value is c,.(t) = tj( ..fii - 0.01 + 0.85/ ..fii) where t = 1 .035, 0.895 and t = 0.819 for " = 0.01, .05 and 0.10, respectively. Notes for Section 4.3 (1) Such a class is sometimes called essentially complete. The term complete is then re served for the class where strict inequality in ( 4.3.3) holds for some 8 if
(2) In using 8(5) as a confidence hound we are using the region [8(5), 1]. Because the region contains C(X), it also has confidence level (I <>) . -
4.12
REFERENCES
BARLOW, R. AND F. PROSCHAN, Mathematical Theory of Reliability New York: J. Wiley & Sons
,
1965.
BICKEL,!\ E. HAMMEL, AND J. W. O'CONNELL, "Is there a sex bias in graduate admissions?'' Science, 187, 398-404 ( 1975). Box, G. E. P., Apology for Ecumenism in Statistics and Scientific lnference, Data Analysis and Ro bustness, G. E. P. Box, T. Leonard, and C. F. Wu, Editors New York: Academic Press, 1983.
296
Testing and Confidence Regions
Chapter 4
BROWN, L. D., T. CAt, AND A. DAS GuPTA. "Interval estimation for a binomial proportion," The American Statistician, 54 (2000).
DoKsUM, K. A. AND G. SIEVERS, "Plotting tions," Biometrika,
63, 421-434 ( l 976).
with confidence:
Graphical comparisons of two popula
DOKSUM, K. A., G. FENSTAD, AND R. AARBERGE, "Plots and tests for symmetry," Biometrika, 473-487 (1977).
64,
'
! I
DURBIN, J., "Distribution theory for tests based on the sample distribution function," Regional Con ference Series in Applied Math., 9, SIAM, Philadelphia, Pennsylvania (1973).
l I I
FERGUSON, T., Mathematical Statistics. A Decision Theoretic Approach New York: Academic Press, 1967. FISHER, R. A., Statistical Methods for Research Workers, Company, 1958.
13th ed.
l '
New York: Hafner Publishing
HALO, A., Statistical Theory with Engineering Applications New York:
J. Wiley & Sons,
i '
• . '
1952.
.� !. '
HEDGES, L. V. AND I. OLION, Statistical Methods for Meta-Analysis Orlando, FL: Academic Press,
I
1985.
i •
JEFFREYS, H., The Theory of Probability Oxford: Oxford University Press, 1961.
LEHMANN, E. L., Testing Statistical Hypotheses, 2nd ed. New York: Springer, 1997. .. '
'; •
••
!.
'
PoPPER, K. R., Conjectures and Refutations; the Growth ofScientific Knowledge, 3rd ed. New York: Harper and Row, 1968.
PRATI, J., "Length of confidence intervals," 1. Amer. Statist. Assoc., 56, 549-567 (1961). SACKRoWITZ, H. AND E. SAMUEL--CAHN,
"P values as random variables-Expected P values," The
American Statistician, 53, 326-331 (1999).
STEPHENs, M., "EDF statistics for goodness of fit," J. Amer. Statist, 69, 730-737 ( 1974).
STEIN, C., "A two-sample test for a linear hypothesis whose pOWer is independent of the variance," Ann Math. Statist., 16, 243-258 (1945).
TATE, R. F. AND G. W. KLETI, "Optimal confidence intervals for the variance of a normal distribu tion, 1. Amer. Statist. Assoc., 54, 674--682 (1959). "
VAN ZWET, W. R. AND J. OSTERHOFF, "On the combination of independent test statistics," Ann. Math. Statist., 38,659-680 (1967).
WALD, A., Sequential Analysis New York: Wiley, 1947. WALD, A., Statistical Decision Functions New York: Wiley, 1950.
WANG, Y., "Probabilities of the type I errors of the Welch tests," 1. Amer. Statist. Assoc., 66, 605-608 (1971).
WELCH, B., "Further notes on Mrs. Aspin's tables," Biometrika, 36, 243-246 (1 949).
WETHERDJ.., G. B. AND K. D. GLAZEBROOK. Sequential Methods in Statistics New York: Chapman and Hall, 1986.
WILKS, S. S., Mathematical Statistics New York: J. Wiley & Sons, 1%2.
1 • •
j 1
l '
I '
, , •
.
' •
!
' : '
Chapter 5
ASYMPTOTIC APPROXIMATIONS
5.1
INTRODUCTION: THE MEANING AND USES OF ASYMPTOTICS
Despite the many simple examples we have dealt with, closed fonn computation of risks in tenns of known functions or simple integrals is the exception rather than the rule. Even if the risk is computable for a specific P by numerical integration in one dimension, the qualitative behavior of the risk as a function of parameter and sample size is hard to ascer tain. Worse, computation even at a single point may involve high-dimensional integrals. In Xn from a distribution F, our setting for this section particular, consider a sample Xt 1 and most of this chapter. If we want to estimate J.L(F) = EFX1 and use X we can write, •
•
•
,
(5.1.1) This is a highly informative formula, telling us exactly how the MSE behaves as a function of n, and calculable for any F and all n by a single one-dimensional integration. However, consider med(X1, . . . ,Xn) as an estimate of the population median v(F). If n is odd, v (F) � p- l ( l), and F has density f we can write MSEF(med(X,, . . . , Xn)) =
J: p-l (x -
2
(l)) 9n(x)dx
(5.1.2)
where, from Problem (B.2.13), if n � 2k + 1,
9n(x) =
n
( ) 2k k
Fk (x)(1 - F(x))k f(x).
(5.1.3)
Evaluation here requires only evaluation of F and a one-dimensional integration, but a different one for each n (Problem 5.1.1). Worse, the qualitative behavior of the risk as a function of n and simple parameters of F is not discernible easily from (5.1.2) and (5.1.3). To go one step further, consider evaluation of the power function of the one-sided t test of Chapter 4. If X1 , . . . , Xn are i.i.d. N(Ji-, a2 ) we have seen in Section 4.9.2 that ,fii.X IS has a noncentral t distribution with parameter !"/rJ and n - 1 degrees of freedom. This distribution may be evaluated by a two-dimensional integral using classical functions 297
i
I
298
I
!' '
Asymptotic Approximations
Chapter 5
'' '
(Problem 5.1.2) and its qualitative properties are reasonably transparent. But suppose F is not Gaussian. It seems impossible to determine explicitly what happens to the power function because the distribution of foX/ S requires the joint distribution of (X, S) and in general this is only representable as an n-dirnensional integral;
'
'' '
where
A� There are two complementary approaches to these difficulties. The first, which occupies us for most of this chapter, is to approximate the risk function under study
i'
'
'
;
'
'
by a qualitatively simpler to understand and easier to compute function, Rn (F). The other, which we explore further in later chapters, is to use the Monte Carlo method. In its simplest form, Monte Carlo is described as follows. Draw B independent "samples" of size n, {XIj, . . . , Xn1 }, 1 < j < B from F using a random number generator and an explicit form for F. Approximately evaluate Rn(F) by
1 B
-
(5. 1.4) Rs = B Ll(F, o(X11, . . . , XnJ)). j=l By the law of large numbers B ---+ oo, fis � Rn (F ). Thus, save for the possibility of a very unlikely event, just as in numerical integration, we can approximate Rn(F) arbitrarily as
closely. We now turn to a detailed discussion of asymptotic approximations but will return to describe Monte Carlo and show how it complements asyrnptotics briefly in Example 5.3.3. Asymptotics in statistics is usually thought of as the study of the limiting behavior of statistics or, more specifically, of distributions of statistics, based on observing n i.i.d. observations X1, . . . , Xn as n --+ oo. We shall see later that the scope of asymptotics is much greater, but for the time being let's stick to this case as we have until now. Asymptotics, in this context, always refers to a sequence of statistics
{Tn(XJ, . . . , Xn)}n>l , for instance the sequence of means {Xn}n>t. where Xn = of medians, or it refers to the sequence of their distributions
! :E� 1 Xi, or the sequence
Asymptotic statements are always statements about the sequence. The classical examples p are, Xn -+ EF(XI) or -
LF( ;fn(Xn - EF (XI))) -+ N(O, VarF(XI)).
i i
'
I I
' ' ! ,
j
Section 5 .1
Introduction: The Meaning and Uses of Asymptotics
Tn(X . , Xn)
Tn (X . , Xn) Tn(X1, ... ,Xn)
299
In theory these limits say nothing about any particular but in practice we 1, . . act as if they do because the we consider are closely related as functions of 1, . . so that we expect the limit to approximate or (in an appropriate sense). For instance, the weak law of large num bers tells us that, if oo, then
nCp(Tn (X1, . . . , Xn)) Ep]Xd <
(5.1 .5) That is, (see A.l4.1 )
h[] Xn -JL ] > <[
�
o
(5.1.6)
for all £ > 0. We interpret this as saying that, for n sufficiently large, X is approximately equal to its expectation. The trouble is that for any specified degree of approximation, say, £ = .01, (5.1.6) does not tell us how large n has to be for the chance of the approximation not holding to this degree (the !eft-hand side of (5.1.6)) to fall, say, below .01. Is n > 100 enough or does it have to be n > 100, 000? Similarly, the central limit theorem tells us that oo, ,, is as above and then if n
•
Ep]Xf] <
a2 - Varp (XI), n ( Pp [vn X a- JL) < z] il>(z )
(5.1 .7)
�
where 4:1 is the standard normal d. f. As an approximation, this reads
( (x I') ) . PF[Xn < x] "' il> vn a
(5.1.8)
n,
Again we are faced with the questions of how good the approximation is for given x, and What we in principle prefer are bounds, which are available in the classical situations of (5.1.6) and (5. 1 . 7). Thus, by Chebychev's inequality, if oo,
Pp.
a' Pp[]Xn - I'] > <] < nE2 .
EpXf <
As a bound this is typically far too conservative. For instance, if more delicate Hoeffding bound (B.9.6) gives
(5 .1 .9 )
jX 11 < 1, the much
Pp [IXn -JL] > e] ::; 2exp { - �ne2} . (5.1 .10) 1 implies that a2 < 1 with a2 = 1 possible (Problem 5.1.3), the right Because ] X !] hand side of (5.1.9) when a2 is unknown be>omes 1jn e2 For ' = .1, 400, (5.1.9) is .25 whereas (5.1. 10) is .14. ::;
n =
Further qualitative features of these bounds and relations to approximation (5.1.8) are given in Problem 5.1.4. Similarly, the celebrated Berry-Esseen bound (A. 1 5 . 1 1 ) states that if oo,
EF]Xd' <
sup X
PF [
-a !') <- x] - i!>(x) -< 0Eap]3nX1/2d3
r,; (Xn
V"
(5. ! . I I)
300
Asymptotic Approximations
Chapter 5
where C is a universal constant known to be < 33/4. Although giving us some idea of how much (5.1.8) differs from the truth, (5. 1.11) is again much too consetvative generally. (1) The approximation (5.1.8) is typically much better than (5.1.11) suggests. Bounds for the goodness of approximations have been available for Xn and its distri bution to a much greater extent than for nonlinear statistics such as the median. Yet, as we have seen, even here they are not a very reliable guide. Practically one proceeds as follows: (a) Asymptotic approximations are derived. (b) Their validity for the given n and T11 for some plausible values of F is tested by numerical integration if possible or Monte Carlo computation. If the agreement is satisfactory we use the approximation even though the agreement for the true but unknown F generating the data may not be as good. Asymptotics has another important function beyond suggesting numerical approxima tions for specific n and F. If they are simple, asymptotic formulae suggest qualitative properties that may hold even if the approximation itself is not adequate. For instance, (5.1.7) says that the behavior of the distribution of Xn is for large n governed (approxi mately) only by f-L and cr 2 in a precise way, although the actual distribution depends on Pp in a complicated way. It suggests that qualitatively the risk of Xn as an estimate of Jl, for any loss function of the form l(F,d) = -'(]!' - dl) where -'(0) = 0, A'(O) > 0, behaves like -' (0)( <1/ y'n)( v'27i') (Problem 5.1.5) and quite generally that risk increases with <1 and decreases with n, which is reasonable. As we shall see, quite generally, good estimates Bn of parameters B(F) will behave like Xn does in relation to Jl · The estimates On will be consistent, Bn � O (F) for all F in the model, and asymptotically normal, '
,
fo[Bn - B(F)] <1(B,F)
-->
(5.1.12)
N(O, 1)
'
'
•
•
l '
•
-
where CT(B, F) typically is the standard deviation (SD) of foOn or an approximation to this SD. Consistency will be pursued in Section 5.2 and asymptotic normality via the delta method in Section 5.3. The qualitative implications of results such as are very impor tant when we consider comparisons between competing procedures. Note that this feature of simple asymptotic approximations using the normal distribution is not replaceable by Monte Carlo. We now turn to specifics. As we mentioned, Section 5.2 deals with consistency of various estimates including maximum likelihood. The arguments apply to vector-valued estimates of Euclidean parameters. In particular, consistency is proved for the estimates of canonical parameters in exponential families. Section 5.3 begins with asymptotic computa tion of moments and asymptotic normality of functions of a scalar mean and include as an application asymptotic normality of the maximum likelihood estimate for one-parameter exponential families. The methods are then extended to vector functions of vector means and applied to establish asymptotic normality of the MLE 7j of the canonical parameter 17
j
i
•
•
'
j
l :
Section 5.2
30 1
Consistency
in exponential families among other results. Section 5.4 deals with optimality results for likelihood-based procedures in one-dimensional parameter models. Finally in Section 5.5 we examine the asymptotic behavior of Bayes procedures. The notation we shall use in the rest of this chapter conforms closely to that introduced in Sections A l 4, A. IS, and B.7. We will recall relevant definitions from that appendix as we need them, but we shall use results we need from A.14, A. IS, and 8.7 without further discussion. .
Summary. Asymptotic statements refer to the behavior of sequences of procedures as the sequence index tends to oo. In practice, asymptotics are methods of approximating risks, distributions, and other statistical quantities that are not realistically computable in closed form, by quantities that can be so computed. Most aSymptotic theory we consider leads to approximations that in the i.i.d. case become increasingly valid as the sample size increases. We also introduce Monte Carlo methods and discuss the interaction of asymptotics, Monte Carlo, and probability bounds.
5.2
CONSISTENCY Plug-In Estimates and MlEs in Exponential Family Models
5.2.1
Suppose that we have a sample X1, . ,Xn from Po where 6 E 6 and want to estimate a real or vector q(B). The least we can ask of our estimate iln (X1, , Xn) is that as .
.
•
.
.
n � oo, fin I] q(8) for all 8. That is, in accordance with (A.l4.1) and (B.7. 1), for all 8 E e. , > o. (5.2. 1 )
·
where I j denotes Euclidean distance. A stronger requirement is (5.2.2)
Bounds b(n, <) for sup9 P9 1/
p
X � p(P) = E(X!) �
-
�
and p(P) = X, where P is the empirical distribution, is a consistent estimate of p(P). For P this large it is not uniformly consistent. (See Problem 5.2.2.) However, if, for
'
'
302 instance,
=
P
{P : EpX'f < M <
by Chebyshev's inequality, for all
X
oo }, then
Asymptotic Approximations
Chapter 5
is uniformly consistent over
P because
P E P.
0
, Xn be the indicators of binomial trials Example 5.2.2. Binomial Variance. Let X1 , with P[X1 = 1] = p. Then N = L; X; has a B(n,p) distribution, 0 < p < 1, and P = X = N/n is a uniformly consistent estimate of p. But further, consider the plug-in estimate p(1 -p)jn of the variance of p, which is � q(ji), where q(p) = p(1-p). Evidently, by A.I4.6, q(jj) is consistent. Other moments of X1 can be consistently estimated in the .
�
•
.
-
0
same way.
To some extent the plug-in method was justified by consistency considerations and it is not suprising that consistency holds quite generally for frequency plug-in estimates.
Theorem 5.2.1. Suppose thnt P = S = { (Ph . . . , Pk) : 0 < Pi < 1, 1 < j < k, L;J�1 p, = 1}, the k-dimensional simplex, where Pi = P[X1 = xi], 1 < j < k, and { x1, , xk} is the range of X,. Let Ni = L;� 1 1(X; = x1) and PJ - NJfn, Pn = (ih, . . . , fA) E S be the empirical distribution. Suppose that q : S RP is continuous. Then (jn = q(P:n ) is a unifonnly consistent estimate of q(p ). .
•
.
---+
Proof. By the weak law of large numbers for all p,
6>0
Pp [IPn - PI > 6] � 0. Because q is continuous and S is compact, it is uniformly continuous on S. Thus, for every <
> 0, there exists 6 (<) > 0 such that p, p' E S, I P' -pi < 6(<), implies l q(p')-q(p) I <
Then
Pp [liln - q l > <] :S Pp[[Pn - PI > 6(<)] But, sup{ Pp[li\n -pi > 6] : p E S} < kj4n62 (Problem 5.2.1) and the result follows. In fact, in this case, we can go further. Suppose the is defined by
0
modulus of continuity of q, w( q, 0)
·
w(q, 8)
= sup{ l q(p) - q(p') I : IP - P'l :S 8},
Evidently, w(q, ·) is increasing in
w(q,6) !
<.
O as 6 !
0. Let w-1 : [a, b] < 1 w- (<)
It easily follows (Problem
0
and has the range (a, b) say. If
(5.2.3) q is
continuous
R+ be defined as the inver�e ofw,
= inf{6
: w (q , 6)
2: <}
(5.2.4)
i
5.2.3) that (5.2.5)
A simple and important result for the case in which X�, . . . , Xn are i.i.d. is the following:
I
with Xi
EX
I
l
••
'
'
•
Section 5.2
303
Consistency
(g,, . . . , 9d) map X onto Y C Rd Suppose Proposition 5.2.1. Let g Eoi9J(Xl ) l < oo, I < j < d, for all 8; let mj(B) = EogJ (Xl), I < j < d, and let RP. Then, if h is continuous, q(B) = h(m(B)), where h : Y �
ij
h(g)
=
I n "' g(X, ) h , L... i=l
is a consistent estimate of q(8). More generally if v(P) = h{Epg(X1 ) ) and P = {P : Ep lg(X1) 1 < oo }, then v( P) = h(g) , where P is the empirical distribution, is consistent for v{P). Proof. We need only apply the general weak law of large numbers (for vectors) to conclude that (5.2.6) if Eplg(Xl)l < oo. For consistency of h(g) apply Proposition B . 7. 1: Un .!:.. U implies 0 that h(U n) !'. h (U) for all continuous h.
Example 5.2.3. Variances and Correlations. Let Xi (Ui, Vi), 1 < i < n be i.i.d. N2 (J-Lt ,J-Lz , aT,a�,p), af > 0, IPI < 1. Let g(u , v ) = (u, v, u2 , v2 , uv) so that E� 1 g(Ui, Vi) is the statistic generating this 5-parameter exponential family. If we let 8 = (J-Lt, J-l2, a'f, ag,p), then =
If h = m- 1 , then
which is well defined and continuous at aJl points of the range of m. We may, thus, con clude by Proposition 5.2. 1 that the empirical means, variances, and correlation coefficient are all consistent. Questions of uniform consistency and consistency when P { Distribu tions such that EU[ < oo, EV12 < oo, Var(UJ) > 0, Var(VJ ) > 0, I Corr (Ur, Vr)l < I } 0 are discussed in Problem 5.2.4. =
Here is a general consequence of Proposition 5.2.1 and Theorem 2.3.1. Theorem 5.2.2. Suppose P is a canonical exponentialfamily of rank d generated by T. Let q, [ and A(·) correspond to P as in Section 1.6. Suppose [ is open. Then, zf X1, . . . , Xn are a sample from Pr, E P.
(i) P'1 [The MLE ij exists] � I.
(ii) ij is consistent.
'
�I '
' '
'
304
Asymptotic Approximations
Chapter 5
'
' '
-
Proof Recall from Corollary 2.3.1 to Theorem 2.3.1 that i)(X�> - - - , Xn) exists iff � L� 1 T(Xi) = Tn belongs to the interior C.J. of the convex support of the distribution of Tn - Note that, if no is true, En, ( T(X!)) must by Theorem 2.3.1 belong to the interior
A(n) = to,
t
A(n0) = En, T{X1),
where o = of the convex support because the equation is solved by '7o· By definition of the interior of the convex support there exists a ball < 6} c CT- By the law of large numbers, S, =
{t : [t - En, T(Xt)l
I
n
n
L T(X,) i=l
Hence,
1
P,
jo E'r,, T(X!).
n
Pn, [ n L T(X, ) E CT) � 1. i=l But Tj, which solves
(5.2.7)
n 1 A(n) = n- L T(X,), 1
•= .
exists iff the event -in (5.2.7) occurs and (i) follows. We showed in Theorem 2.3.1 that on C.J. the map 17 ..--j. A(11) is 1-1 and continuous on£. By a classical result, see, for example, : A(t:) � t: is continuous on S, and the result follows from Rudin (1987), the inverse D Proposition 5.2.1.
A- 1
i i
\ '
-
'
5.2.2
! ! '
Consistency of Minimum Contrast Estimates
'
'
The argument of the the previous subsection in which a minimum contrast estimate, the MLE, is a continuous function of a mean of i.i.d. vectors evidently used exponential family properties. A more general argument is given in the following simple theorem whose con be i.i.d. Po, B E e c Rd . Let 0 be a minimum ditions are hard to check. Let contrast estimate that minimizes
XI' . . . , Xn
n 1 X , II) Pn(X, II) = n- L p( , .
where, as usual, D(
z=l
Theorem 5.2.3. Suppose
n P�' o 1 0 sup{ [ - L[p(X, , Ii) - D{lio , li)JI : II E e} n i=l
(5.2.8)
and -
Then 8 is consistent.
-
lio
l > <} > D(lio,lio)
'
1
I '
'
I
i
' '
i
110 , II) = Eo,P(Xt. II) is uniquely minimized at lio for all lio E e.
inf{D{Ii, lio) : Ill
I
1
for l!llery e
> 0.
(5.2.9)
j
I I
I
I I
I
I
Section
305
5.2 Consistency
Proof Note that,
n 1 ] < Pe,[inf{-n L[p(X, , II) - - p(X., IIo)] : ]11 - 1101 > <} < O]
-
<
P11, [19 - llol >
By hypothesis, for all
i= l
(5.2.10)
o > 0,
1
Pll, [inf{- L(p(X, , II) - p(X,,IIo)) : [11 - llol > <} n i= l - inf{D(IIo, II) - D(ll0, llo)) : 111 -- llol > <} < -o] � 0 because the event in
n
(5.2. 1 1 ) implies that
1 n
sup{[- L[p(X,, II) - D(llo, 11)]1
n i=l
which has probability tending to
o� Then
(5.2.11)
0 : II E e) > -,
(5.2.12)
2
0 by (5.2.8). But for c > 0 let
� inf{D(II,IIo) - D(llo,llo) : [11 - llol > o).
(5.2.11) implies that the right-hand side of (5.2.10) tends to 0.
0
A simple and important special case is given by the following.
{ll1 , . . . , lid). E11, ] logp(X" II) [ < oo and the parameterization is identifiable, then, if II is the MLE, P11 [8 ¥ 0; ] � Ofor all j. CoroUary 5.2.1. lf 6 is finite, e
�
-
-
'
Proof Note that for some < > 0,
(5.2.13) By Shannon's Lemma 2.2.1 we need only check that (5.2.8) and (5.2.9) hold for
logp(x, II). But because 6 is finite, (5.2.8) follows from the WLLN and P11, [max{] � l:� 1 (p(X, , II;) - D(llo, II;)) : 1 < j < d) > <] < d max{ P11, [I� l:� 1 (p(X,, II; ) - D(llo, II;) )I > <] : 1 < j S d) and
(5.2.9) follows from Shannon's lemma. can
often fail
see
�
Problem
p(x, II) � 0. 0
5.2.5. An alternative condition that is readily seen to work more widely is the replacement of (5.2.8) by Condition
(5.2.8)
(i) For all compact K
1 n
c
6,
Pll, 0. sup - L ]p(X , 0) - D(llo, II) I : II E K n i=l ,
(ii) For some compact
1 n
K c 8,
II) - p(X,, 110)) : II E K' P11, inf - L(p(X,, _ n �=1
(5.2.14) >0
�
1.
!
•
'
306
Asymptotic Approximations
Chapter 5
•
We shall see examples in which this modification works in the problems. Unfortunately checking conditions such as (5.2.8) and (5.2.14) is in general difficult. A general approach due to Wald and a similar approach for consistency of generalized estimating equation so lutions are left to the problems. When the observations are independent but not identically distributed, consistency of the MLE may fail if the number of parameters tends to infinity, see Problem 5.3.33.
Summary. We introduce the minimal property we require of any estimate (strictly speak ing, sequence of estimates) consistency. If Bn is an estimate of B(P), we require that Bn � � B(P) as n � oo. Unifonn consistency for 'P requires more, that sup{P[IBn - B(P)] > Ej : 0 for all € > 0. We show how consistency holds for continuous functions of P E P} ---t
vector means as a consequence of the law of large numbers and derives consistency of the MLE in canonical multiparameter exponential families. We conclude by studying consis tency of the MLE and more generally MC estimates in the case 8 finite and 8 Euclidean. Sufficient conditions are explored in the problems. 5,3
FIRST- AND H IGHER-ORDER ASYMPTOTICS: THE DELTA M ETHOD WITH APPLICATIONS
We have argued in Section 5.1 that the principal use of asymptotics is to provide quantita tively or qualitatively useful approximations to risk. 5.3.1
The Delta Method for Moments
•
• •
We begin this section by deriving approximations to moments of smooth functions of scalar means and even provide crude bounds on the remainders. We then sketch the extension to functions of vector means. be i.i.d. X valued and for the moment take X = R. Let As usual let Xt, . . . h : R � R, let ]]g]]oo = sup{ ]g(t) I : t E R} denote the sup norm, and assume
, Xn
(i)
!
(b)
'
!
(a) h is
m times differentiable on R, m > 2. We denote the jth derivative of h by
h(i) and assume ]jh(ml]]oo = SUPx ]M ml ( x)J < M < oo
l
l •
'
.
.
'
(ii) EJX, jm < oo Let E(X1)
=
�'• Var(X1)
=
a2 We have the following.
Theorem 5.3.1. lj(i) and (ii) hold, then
m- 1 h(j) ( ) Eh(X) = h(!l) + L 'I !l E(X J· j= l
where • •
'
I
�-
-
!l)j + Rm
(5.3. I )
-
Section 5.3
First-
and Higher-Order Asymptotics: The Delta Method with Applications
307
The proof is an immediate consequence of Taylor's expansion. (5.3.2)
where [X'
-
l'f < [X - J'f. and the following lemma.
Lemma 5.3.1. If EjX 1 lj < that
oo,
j > 2, then there are constants Cj > 0 and Dj > 0 such (5.3.3) (5.3.4)
Note that for j even. £[X - lt[i = E(X - I')J
Proof. We give the proof of (5.3.4) for all j and (5.3.3) for j even. The more difficult argument needed for (5.3.3) and j odd is given in Problem 5.3.2. Let I' = E(XI) = 0, then
i X) E(Xi) - n -iE("" � I �= .. L.
(a)
=
0 unless each integer that appears among { i1, But E(X,, . . . X,j ) least twice. Moreover, •
sup
(b)
•
.
.
,
ii } appears at
[E(X,, . . . X,, ) [ = E[Xl [i
!t, · ·· ,ij
by Problem 5.3.5, so the number d of nonzero terms in (a) is
U/ 21 n
2:
(c)
r_- 1
r
.
2::
.
J -
1 , . . . , tr t . +· ··+•r=J i�o>2 all k
•t
where . , n . = tt. .1n. ..!.!... and (t] denotes the greatest integer < t. The expression in (c) is, for j < n/2, bounded by tl
... ,z..
1
[�J! c
(d)
n n -
(
!) . . (n .
[j/2] + I)
where
Cj =
max 1
{ { .
L 'l. l , . 1. tr. .
1
: it
+
. . + ir = j, ik > 2, 1 < k :::; r .
}}
.
308
Asymptotic APproximations
Chapter 5
!
l i
But
n-1n(n - 1) . . . (n - [i/2] + 1)
(e)
< niJ/ZJ -J
and (c), (d), and (e) applied to (a) imply (5.3.4) for j odd and (5.3.3) for j even, if J.t = 0. In general by considering Xi - J-L as our basic variables we obtain the lemma but with EIXdi replaced by EIX1 - J.tiJ. By Problem 5.3.6, EIX1 - J.tl' S 2'EIXdi and the D lemma follows. The two most important corollaries of Theorem 5.3. 1, respectively, give approximations to the bias of h(X) as an estimate of h(J.t) and its variance and MSE. Corollary 5.3.1. (a}
(b)
ifEIXJ !3 < oo and l l h(3l lloo < oo, then ('l (1')a2 + n-31 . h O( 2) Eh(X) = h(J.t) + 2n if E(Xt) O(n-2).
< oo and
llh<4llloo
< oo then
(5.3.5)
O(n-31') in (5.3.5) can be replaced by
Proof For (5.3.5) apply Theorem 5.3.1 with m = 3. Because E(X - J.t)2 = a2 /n, (5.3.5) follows. If the conditions of (b) hold, apply Theorem 5.3.1 with m = 4. Then D Rm = O(n -2) and also E(X - J.t)' = O(n -2) by (5.3.4).
I
I i
•
I I• 1'
I•
Corollary 5.3.2. if (a) ll h(j ) ll oo < oo,
•
i
•
•
. •
1 S j < 3 and EIXd3 < oo, then
.
ll ' •
'
1 i
l •
Var h(X) = a2 [h(I) (J.t)J' + O(n-312) n
(b) 1/ llh(J) IIoo < 00, 1 < j < 3, and EXf < oo, then replaced by O(n-2).
j
•
(5.3.6)
.,
I
'
O(n-312) in (5.3.6) can be
l •
Proof. (a) Write
Eh2(X)
=
=
h2(J.t) + 2h(J.t)h(1 l(J.t)E(X - J.t) + {h(2l(J.t)h(J.t) + [h< 1l ]2 (J.t) } E(X - J.t)2 1 + E[h2]<3 l (X.)(X - J.t)3 6 2 O(n-312). h2(J.t) + {h('l (J.t)h(J.t) + [h<1l]'(J.t)}':.. + n
(b) Next, using Corollary 5.3.1,
• • •
I
I
1 •
':
.,
·-----,
----
Section 5.3
First- and Higher-Order Asymptotics: The Delta Method with Applications
309
Subtracting (a) from (b) we get (5.3.6). To get part (b) we need to expand Eh2(X) to four D terms and similarly apply the appropriate form of (5.3.5).
Clearly the statements of the corollaries as well can be turned to expansions as in The orem 5.3.1 with bounds on the remainders. Note an important qualitative feature revealed by these approximations. If h(X) is viewed, as we normally would, as the plug-in estimate of the parameter h(J.t) then, for large n, the bias of h(X) defined by Eh(X) - h(p.) is O(n-1 ) which is neglible compared to the standard deviation of h(X), which is O(n-112) unless hi1) (p.) = 0. A qualitatively simple explanation of this important phenonemon will be given in Theorem 5.3.3. ,
Example 5.3.1. If XL . . . , Xn are i.i.d. c(A) the MLE of A is X - I If the X, represent the lifetimes of independent pieces of equipment in hundreds of hours and the warranty replacement period is (say) 200 hours, then we may be interested in the warranty failure probability (5.3.7) If h(t)
1 - exp(-2/t), then h(X) is the MLE of 1 E,Xt = 1/),, =
- exp (- 2A)
=
h(p.), where p.
=
We can use the two corollaries to compute asymptotic approximations to the means and variance of h(X). Thus, by Corollary 5.3.1,
Bias,(h(X))
because hi2) ( t) 5.3.1)
=
E,(h(X) - h(p.)) h<2�(/-l) :: + O(n-2) 2e -2" A3 (1 - A)/n + O(n -2 )
(5.3.8)
4(r·3 - t-•) exp( -2/t), 0'2 = 1/A2, and, by Corollary 5.3.2 (Problem (5.3.9) D
Further expansion can be done to increase precision of the approximation to Var h(X) for large n. 'f!!us, by expanding Eh2(X) and Eh(X) to six terms we obtain the approximation Var(h(X)) = �[hi1 ) (1, )]'0'2 + ;/. {h ll )(p.)hi2) (i•)p.3 (5.3.10) + J [hi'J (p.)]' ()'•} + R� with R� tending to zero at the rate 1/n3. Here Jlk denotes the kth central moment of Xi and we have used the facts that (see Problem 5.3.4)
E(X - p.)3 =
��,
(5.3.1 1 )
Example 5.3.2. Bias and Variance of the MLE of the Binomial Van·ance. We will com pare E(h(X)) and Var h(X) with their approximations, when h(t) = t(1 - t) and X; -
I
310
Asymptotic Approximations
Chapter 5
B(l,p), and will illustrate how accurate (5.3. 10) is in a situation in which the approxima tion can be checked. First calculate
Eh(X) = E(X) - E(X2) = p - [Var(X ) + (E(X))2] n-1 1 = p(1 - p) - -p(1 - p) = p(1 p). n n Because hl'>(t) =
1 - 2t, hl2> = -2, (5.3.5) yields . 1 p) E(h(X)) = p(1 - p) - -p(1 n
and in this case (5.3.5) is exact as it should be. Next compute
) p ) p (1 p (1 2p ' Var h(X ) = n (1 - 2p) + n - 1
{
} (n - 1 ) 'n
Because 1'3 = p(1
.
- p)(1 - 2p), (5.3.10) yields 1 1 2p)p(1 Va:t h(X) = -n (1 - 2p)2p(1 - p) + --,{ p)(1 -2(1 2p) n +2p'(1 - p)2} + R� p( l -p) { (1 - 2p .!:.[2p(1 - p) - 2(1 - 2p)2]} + R�. )2 + n n
.. , '
'
'
! ' '
" '·I :
).
R'n '
' •
•
I
I
t •
'
•• •
.. .
j
' ' '
'
I
' I
I
'
p(1 � p) [(1 - p)] 2p)2 2p(1 n p(1 � p) = O(n-3) . [1 6p(1 p)] n
•
'
!
Thus, the error of approximation is
�
�·
'
!. I
.I
1
D
The generalization of this approach to approximation of moments for functions of vec tor means is fonnally the same but computationally not much used for d larger than 2.
d and let Y, = g(X,) = (g, (X;), . . . , gd (X, ) f . g X � R 5.3.2. Suppose : Theorem Let h : Rd R. assume that h has continuous partial derivatives of order up to m, and that
l
i
'
l
--t
(i) [[Dm(h)ll= < co where Dmh(x) is the array (tensor)
'
'
' '
xi
amh . . , , a . . axd .
,d
X
( )
:
. Z1
+
. . . . + Zd
. < < < 1 t · m,
.
•
'
I
'
-------
l
Section
5.3 First- and Higher-Order Asymptotics: The Delta Method with Applications
(ii) EIY,, /m < oo, 1 < j < d where Y;1
=
311
9j(X;).
Then, ifYk = � 2:: � 1 Yik> Y = � 2::� 1 Yi, and J.1 = E¥1, then
Eh(Y)
This is a consequence of Taylor's expansion in d variables, 8.8.1 1 , and the appropriate generalization of Lemma 5.3.1. The proof is outlined in Problem 5.3.4. The most interest ing application, as for the case d = 1, is to m = 3. We get, for d = 2, E I 1 j 3 < oo
Y
{ �:� (J.L) Var(Yu ) + 8,",',j'x, (J.L) Cov(Yl l , Y12)
h(J.L ) + � �
Eh(Y)
i z:l
(5.3.13)
(J.L) Var(Y., ) } + O(n 312 ) Moreover, by (5.3.3), if E/Y1 /4 < oo, then O(n- 31 2 ) in (5.3.12) can be replaced by O(n-2 ). Similarly, under appropriate conditions (Problem 5.3.12) Var h(Y) � [ (:,;� ( J.L) ) Var ( Yu ) + 2 g,;•, (J.L) ffx� (J.L) Cov(Yu . Ytz) (5.3.14) + (:,;',(J.L) ) 'Var(Yt2 )] + O(n- 2 ) +
'
Approximations (5.3.5), (5.3.6), (5.3.13), and (5.3.14) do not help us to approximate risks for loss functions other than quadratic (or some power of (d - h(J.L))). The results in the next subsection go much further and "explain" the form of the approximations we already have. 5.3.2
The Delta Method for In law Approximations
As usual we begin with d = 1.
Theorem 5.3.3. Suppose that X = R. h : R � R, EXt < oo and h is differentiable at J.L = E(X 1 ). Then
.C( Vn(h(X) - h(J.L)))
�
N(O,
<72 (h))
(5.3,15)
where and
<72 = Var(X,).
The result follows from the more generaJly useful lemma.
Lemma 5.3.2. Suppose {Un} are real random variables and tlultfor a sequence {an } of constants with an --+ oo as oo,
n
--+
312
Chapter 5
Asymptotic Approximations
(i) an(Un - u ) .!:.. V for some constant u.
• '
' '
(ii) g : R ---> R is d(fferentiable at u with derivat;ve g< 1 l ( u) . Then (5.3. 16) '
'
Proof.
By definition of the derivative, for every £
> 0 there exists a 8 > 0 such that
lv - ul < li '* lg(v) - g(u) - g<'>(u)(v - u)l <
(a) Note that (i) '*
an(Un - u)
(b)
=
Op(l)
'
'
• •
'
' '
Un - U = Op(a;;: 1 ) = Op (1).
(c) '
..
Using
'
'
' ' •
l
(c), for every li > 0
'
'
P[IUn - ul < li] � l
(d) and, hence, from (a}, for every
£
>
0,
'
i
P ]lg(Un ) - g(u) - g < 'l(u)( Un - u)l :0:
(e) But (e) implies
(f)
'J
from (b). Therefore, I
(g)
i'
But, by hypothesis,
I I
''
\ '
' '
I
'
an(Un - u) !:. V and the result follows.
The theorem follows from the central limit theorem letting
v-
N(o, a2 ).
'
Un = X, an = n112, u
Note that (5.3.15) "explains" Lemma 5.3.1. Formally we expeet that if Vn
!:.
EV,{ - EVi (although this need not be true, see Problems 5.3.32 and B.7.8). Vn = y'n(X - !") !:. V - N(O, a2). Thus, we expeet
0
=
1-4 o
V, then
Consider
'
'
'
•
' •
(5.3. 17)
! 1'
j
1
'
'
Section 5.3 where
313
First- and Higher-Order Asymptotics: The Delta Method with Applications
Z � N(O, 1 )
.
But if j is even.
EZ' > 0, else E Z' = 0. Then (5.3. 1 7) yields 2 = O(n-'·1 ) , j even = o(n-' 1 2),j odd.
Example 5.3.3. "t" Statistics. , X,. be i.i.d. F E :F where Ep(X1) = p,, (a) The One-Sample Case. Let X1 , 2 Varp(XI) = a < oo. A statistic for testing the hypothesis H : p, = 0 versus K : p, > 0 •
lS
•
.
•
=
T,.
where
s2 =
-
X ..;Ti s n
1
n-1
2. (X1 X) L -
i=l
= { Gaussian distributions}, we can obtain the critical value tn-1 (1 - n) for Tn from the Tn-l distribution. In general we claim that if F E F and H is true, then
If :F
£
Tn - N(O, 1).
(5.3.18)
Zt-a but that the tn - t (1 - a) critical value (or z1_a) is approximately correct if H is true and F is not Gaussian. For the proof In particular this implies not only that tn - 1 (
note that
1 -n)
-+
(X - p,) £ - N(0, 1) U,. = ..jTi u
by the central limit theorem, and
n 1"X - 2 - L ) 1 - X)
n t=l . by Theorem
5.2.2 and Slutsky's
ples with l"t
= E(X1), u1 = Var(X,), 1"2 = E(Yt) and u� = Var(Y,).
theorem. Now Slutsky's theorem yields (5.3.18) because
T,. = U,.f(s,./u) = g(U,. , s,.fu), where g(u, v) = ufv. (b) The Two-Sample Case. Let X1, Xn1 and Yt , . . . , Yn2 •
H :
/11
= J1.2 versus
•
•
,
be two independent sam Consider testing
K : 11-2 > 11-I · In Example 4.9.3 we saw that the two sample t statistic Sn =
�
n n2
cY � X ) , n = nt + n2
has a 'Tn-2 distribution under H when the X's and Y's are normal with af = a� . Using the cen tral limit theorem, Slutsky's theorem, and the foregoing arguments, we find (Problem 5.3.28) that if n t fn A, 0 < A < 1, then -
8" ", N
(o
(1 - A)u1 + >.u� ' Au1 + (1 - A)u�
)·
314
Asymptotic Approximations
It follows that if 111 = n2 or af a�, then the critical value t,1-2(1 approximately correct if H is true and the X's and Y's are not normal.
-
Chapter 5
et) for Sn is
Monte Carlo Simulation As mentioned in Section 5.1, approximations based on asymptotic results should be checked by Monte Carlo simulations. We illustrate such simulations for the preceding t tests by generating data from the X� distribution M times independently, each time com puting the value of the t statistics and then giving the proportion of times out of M that the t statistics exceed the critical values from the t table. Here we use the xJ distribution because for small to moderate d it is quite different from the normal distribution. Other distributions should also be tried. Figure 5.3.1 shows that for the one-sample t test, when o: = 0.05, the asymptotic result gives a good approximation when n > 10 1 ·5 32, and the true distribution F is x� with d 2:: 10. The x� distribution is extremely skew, and in this case the tn-l (0.95) approximation is only good for n > 102·5 316. ,.....
"""'
One sample: 10000 Simulations; Chi-square '- data
�---�---�----�- -�----Tl 1 0:.
0.12,-
..
0.'
0.08
'..
0.0<
•
0.02
•
o�------�----�------�-----7�----� 1.5 2 2.5 3 0.5 1
j i
Log1 0 sample size
i
'
'
•
'
Figure 5.3.1. Each plotted point represents the results of 10,000 one-sample t tests using x� data, where d is either 2, 10, 20, or 50, as indicated in the plot. The simulations are repeated for different sample sizes and the observed significance levels are plotted.
I
''
I '
'.
'
.
l
''
'
II
For the two-sample t tests, Figure 5.3.2 shows that when uf = u� and n1 n2, the tn- 2 (1-o:) critical value is a very good approximation even for small n and for X, Y ,..... X� · =
! i
Section 5.3
First- and Higher-Order Asymptotics: The Delta Method with Applications
315
This is because, in this case, Y - X = �; 2.:71 1 (Yi - Xi). and Yi - Xi have a symmetric distribution. Other Monte Carlo runs (not shown) with af 1=- a� show that as long as n1 = n2 , the tn - 2 (0.95) approximation is good for n1 > 100, even when the X's and Y's have different x� distributions, scaled to have the same means, and a� = 12af. Moreover, the tn-2(1 - o:) approximation is good when n 1 -=/=- n2 and af = a2. However, as we see from the limiting law of Sn and Figure 5.3.3, when both n 1 -=/=- n2 and af 1=- ai, then the two-sample t tests with critical region 1 { S'n > tn_ 2 (1 - cr)} do not have approximate level o:. In this case Monte Carlo studies have shown that the test in Section 4.9.4 based on Welch's approximation works well. Two
0.12
sample; 10000 Simulations, Chi-Square DaJa; Equal Variances
0.'
'·"'
,df..so
--
-
� �-. - - - --o-- - - --7� dHO
$
0.02
"� ------� �----�3� 2.5 0.5------� 1.5 ------� 2 ----� 1 log1 0 sample size Figure 5.3.2. Each plotted point represents the results of 10,000 two-sample t tests. For each simulation the two samples are the same size (the size indicated on the x-axis), af = a�. and the data are x� where d is one of 2, 10, or 50. D
Next, in the one-sample situation, let h(X) be an estimate of h(J.L) where h is con tinuously differentiable at I'· hl 1 >(!') i 0. By Theoretn 5.3.3, y'n[h(X) - h(JL)] ". N(O,
_
n-
vn[h(X ) - ho]
s[h( l )(X)[
'
!
' '
·
.
316
Asymptotic Approximations
'
' '
Chapter 5
Two Sample;10000 Simulations: Gaussian Data: Unequal Variances; 2nd sample 2x bigger -'--o.tzr-�������������-�--����� �
---ci
i-
"
•
' •
-,
i
'
,_,
•
'
' '
'
'
• >
�
• '
'
'
0 {)8
ai 0.00
1
' '
·'
• u
u �
¥: 0
"
'-"' - - - -f) - - - - - - -£>- -
0.02 ""---...._, ...._,__
-
-
6K
,,
-
-
- -
-o- - - - - - -(J- -
-
-
+
-
-
-
-
-
-
-
0
- - +
0�-----��----� 0.5 --�1�----� 1.5 25 ----� 2
'
Log10 (smaller sample size)
Figure 5.3.3. Each plotted point represents the results of IO,{X)() two-sample t tests.
For
each simulation the two samples differ in size: The second sample is two times the size of the first. The x-axis denotes the size of the smaller of the two samples. The data in the first sample are N(O, 1) and in the second they are N(O)a2) where a2 takes on the values I, 3, 6, and 9, as indicated in the plot.
' ' '
' '
Combining Theorem 5.3.3 and Slutsky's theorem, we see that here, too, if H is true Tn £ N(O, 1)
so that z1_0 is the asymptotic critical value. Variance Stabilizing Transfonnations Example 5.3.4. In Appendices A and B we encounter several important families of dis
tributions, such as the binomial, Poisson, gamma, and beta, which are indexed by one or more parameters. If we take a sample from a member of one of these families, then the sample mean X will be approximately normally distributed with variance cr2 /n depending on the parameters indexing the family considered. We have seen that smooth transfor mations h(X) are also approximately normally distributed. It turns out to be useful to know transformations h, called variance stabilizing, such that Var h(X) is approximately independent of the parameters indexing the family we are considering_ From (5_3_6) and
'
Section 5.3
First- and Higher-Order Asymptotics: The Delta Method with Applications
317
2 2 l (1')] ) [li a fn. Thus, (5.3. 13) we see that a first approximation to the variance of h( X) is finding a variance stabilizing transformation is equivalent to finding a function h such that for all J-l and a appropriate to our family. Such a function can usually be found if a depends only on J-l, which varies freely. In this case (5.3.19) is an ordinary differential equation. As an example, suppose that X1 , . . . , Xn is a sample from a P(A) family. In this case a2 = A and Var(X) = Ajn. To have Var h(X) approximately constant in A, h must satisfy the differential equation [h< 1 l (A)j 2 A = c > 0 for some arbitrary c > 0. If we require that h is increasing, this leads to h(1l (A) = Vc/J>.., A > 0, which has as its solution h(>.) = 2,JC,\ + d, where d is arbitrary. Thus, h(t) = /i is a variance stabilizing transformation of X for the Poisson family of distributions. Substituting in (5.3.6) we find Var( X) : � 1/4n and vn( (X) l - ( >.) l ) has approximately a N(o, 1/ 4) distribution. o One application of variance stabilizing transformations, by their definition, is to exhibit monotone functions of parameters of interest for which we can give fixed length (indepen dent of the data) confidence intervals. Thus, in the preceding P( >.) case, r5
VA ±
2z(1 - la) 2 vn:
is an approximate 1 - a confidence interval for J>... A second application occurs for models where the families of distribution for which variance stabilizing transformations exist are used as building blocks of larger models. Major examples are the generalized linear models of Section 6.5. The comparative roles of variance stabilizing and canonical transformations as link functions are discussed in Volume II. Some further examples of variance stabilizing transformations are given in the problems. The notion of such transformations can be extended to the following situation. Suppose, 9n (X1 , . . . , Xn) is an estimate of a real parameter f indexing a family of distributions from which X 1 , , Xn are an i.i.d. sample. Suppose further that .
.
•
Then again, a variance stabilizing transformation h is such that
vn(h('/) - h('y)) -+ N(o, c)
(5.3.19)
for all f. See Example 5.3.6. Also closely related but different are so-called normalizing transformations. See Problems 5.3.15 and 5.3.16. Edgeworth Approximations
The normal approximation to the distribution of X utilizes only the first two moments of X. Under general conditions (Bhattacharya and Rao, 1 976, p. 538) one can improve on
318
Asymptotic Approximations
Chapter 5
the normal approximation by utilizing the third and fourth moments. Let F11 denote the distribution of Tn = Jn( X - J-L) fa and let fin and )'211 denote the coefficient of skewness and kurtosis of Tn . Then under some conditions,O l
'
I
where r11 tends to zero at a rate faster than 1/n and H2. H , and Hs are Hermite polyno 3 mials defined by H2 (x)
�
x2 - 1, H (x) 3
�
x3 - 3x, Hs(x)
�
x5 - 10x3 + 15x.
(5.3.21)
' '
The expansion (5.3.20) is ca11ed the Edgeworth expansion for Fn .
Example 5.3.5. Edgeworth Approximations to the x2 Distribution. Suppose V rv X� According to Theorem B.3. 1, V has the same distribution as E� 1 X[, where the Xi are independent and Xi ""' N(O, I), i = I, . . . , n. It follows from the central limit theorem that Tn � (2::7 1 Xf - n ) /ffn � ( V - n)j ffn has approximately a N(O 1) distribution. To improve on this approximation, we need only compute !In and 12n. We can use Problem B .2.4 to compute ,
I '
: •
'Yi n �
E(Vn)' - 2V2
(2n)l
Vn , 'Y2n
1
•
�
E(V - n)4
(2n)2
_
3�
12 n·
1
I '
Therefore, Fn (x)
•
!, •
' •' ' '.
�
VZ
I I 2 3 (x) -
Table 5.3.1 gives this approximation together with the exact distribution and the normal D approximation when n = 10.
•
;
!
I
I
'
l I
:
. .,
�:
I
'
•
•
'
:
'
'
•• •
'
I.
I
'
',
I
I
TABLE 5.3.1. Edgeworth(') and nonnal approximations EA and NA to the X�o distribu tion, P(Tn < x), where Tn is a standardized x!o random variable.
Section 5.3
First- and
Higher-Order Asymptotics: The Delta Method with Applications
3 19
The Multivariate Case
Lemma 5.3.2 extends to the d-variate case.
Lemma 5.3.3. Suppose {Un} are d-dimensional random vectors and that for some se
quence of constants {an} with an (i) an(Vn -
(ii)
u
)
_r.�
----'�" cx::>
as n ---Jo
cx::>,
Vdx l for some d X 1 vector of constants u.
g : Rd ----'�" RP has a differential g��d (u) at
u.
Then
Proof. The proof follows from the arguments of the proof of Lemma 5.3.2.
(X 1 , Y1 ), , (Xn, Yn ) be i.i.d. as (X, Y) where 0 < EX4 < co , Let p2 = Cov 2 (X, Y)/a�ai where af = Var X, a� = Var Y; and let
Example 5.3.6. Let
•
< oo. EY4 � r2 = C2 j&�(j� where 0<
•
•
Recall from Section 4.9.5 that in the bivariate normal case the sample correlation coefficient r is the MLE of the population correlation coefficient p and that the likelihood ratio test � of H : p = 0 is based on lrl. We can write r2 = g(C, &f , U�) : R3 ----'�" R, where and scale invariance of p and r, we can Because of the location g(u 1 , u2 , u3 ) = ui/u2u3. use the transformations xi = (xi - J.ti )/a' and }j = (Yj -J.t2) I0"2 to conclude that without loss of generality we may assume J.ti = 11-2 = 0, ai = ai = 1, p = E(XY).� Using the central limit and Slutsky's theorems, we can show (Problem 5.3.9) that vn(C - p), .Jri(i7f - 1) and vn(i7� - 1) jointly have the same asymptotic distribution as vn(U n - u ) where Un = (n- 1 EXiYi, n 1 EXl , n -1 E�2 )
u
and = (p, 1, 1). Let Tf.; central limit theorem
vn(U - u ) Next we compute
�
=
Var(XkYi ) and Ak,j,m,l = Cov( XkYi, .fm Y'), then by the
N(O, E), E =
2 >.1,1,2,0 >.l,l ,0,2 71 , 1 >.1 , 1 2 ,0 722,0 >.2,0,2,0 >. 1 , 1,0,2 ),2 , 0,2 ,0 7o2,2 ,
(5.3.22)
g(l) (u) = (2u1 (u,u3, -uifu�u,, -uifu,u� ) = (2p, - p2 , -p2 ). It follows from Lemma 5.3.3 and (8.5.6) that .Jri(r2 - r? ) is asymptotically nonnal, N(o, aiD. with 4 2 T 4·' ' + ' ' + P 2o P 102 f.' 1n +2{ -2p3 >-.,,, ,2,o - 2P' >'1,1,0,2 + p4 >-.,.o,2,o}.
320
Asymptotic Approximations
When (X, Y) � N(Jt1, !'2,ui, u2,p), then ,fii(r - p) !:. N(o, (1 - p2 )2 ).
ufi
=
4p2(1 - p2 )',
Chapter 5
and (Problem
5.3.9)
Refening to (5.3.19), we see (Problem 5.3.10) that in the bivariate normal case a vari ance stabilizing transformation h(r) with ,fii[h(r) - h(p)] !:. N(O, 1) is achieved by choosing
h(p) = !2 log
( ) 1+p 1 -p
·
The approximation based on this transformation, which is called Fisher's z, has been stud ied extensively and it has been shown (e.g., David, 1938) that
C(vn - 3(h(r) - h(p)) is closely approximated by the N(O, 1) distribution, that is, P(r < c) "'
.
f ! '
p = tanh {h(r) ± z (1 - �a)/ vn - 3 } 0
where tanh is the hyperbolic tangent.
.
Here is an extension of Theorem 5.3.3 .
' •• ·'
Y
.
' '
Y are independent identically distributed d vectors VarY 1 = E and h 0 � RP where 0 is an open subset
Theorem 5.3.4. Suppose 1, . • • with ElYJ / 2 < oo, EY1 = m,
•
ofRd, h =
n
1
:
(h1 , . . . , h,) and h has a total differential h(1) (m) = ll g'• (m) ll �,
h(Y) = h(m) + h( l l (m){Y - m) + o, (n - 1 12 )
' ' ,
.I
f. ,,
pxd
. Then
(5 .3 .23)
'
'
(5.3.24)
'
' '
j
•
•
. •
. L •
:i '
!
Proof. Argue as before using B .8.5 (a)
' .' '.
I
and (b) so that
h{y)
('l ( (m)(y - m) m) h h = + -
,fii(Y - m)
£
�
+ o(l y - ml)
N(o, E)
Section 5.3
First- and Higher-Order Asymptotics: The Delta Method with Applications
vn(h(Y) - h(m)) = jnh(ll(m)(Y - m) + Op(l).
(c)
321 D
Example 5.3.7. xi and Normal Approximation to the Distribution ofF Statistics. Sup- pose that X1, , Xn is a sample from a N(O, 1) distribution. Then according to Corol lary 8.3.1, the F statistic . . •
Tk,m
_
-
( 1/k) L:=t Xf '"' k+m
( 1/m) L....i=k+t X,2
(5.3.25)
has an :Fk,m distribution, where k + m = n. Suppose that n > 60 so that Table IV cannot be used for the distribution of Tk,m· When k is fixed and m (or equivalently n = k + m) is large, we can use Slutsky's theorem (A. l 4.9) to find an approximation to the distribution of Tk,m· To show this, we first note that (1/m) L:+;:1 Xl is the average ofm independent xi random variables. By Theorem B.3.l, the mean of a xi variable is E(Z 2 ), where Z � N(O, 1). But E(Z2) = Var(Z) = I. Now the weak law of large numbers (A.15.7) implies that as m ----t oo,
Using the (b) part of Slutsky's theorem, we conclude that for fixed k,
as m ----t oo. By Theorem B .3.1, 2.:: 1 Xl has a x% distribution. Thus, when the number of degrees of freedom in the denominator is large, the :Fk,m distribution can be approximated by the distribution of V/k, where V � x%. To get an idea of the accuracy of this approximation, check the entries of Table IV against the last row. This row, which is labeled m = oo, gives the quantiles of the distri bution of V/k. For instance, if k = 5 and m = 60, then P[Ts,6o < 2.37) = P[(V/k) :S 2.21) = 0.05 and the respective 0.05 quantiles are 2.37 for the F5,60 distribution and 2.21 for the distribution of Vjk. See also Figure B.3.1 in which the density of V/k, when k = 10, is given as the :F10,= density. Next we turn to the normal approximation to the distribution of Tk,m· Suppose for simplicity that k = m and k --+ oo. We write Tk for Tk,k · The case m = ).k for some ). > 0 is left to the problems. We do not require the Xi to be normal, only that they be i.i.d. with EX1 = 0, EXf > 0 and EXt < oo. Then, if <72 = Var(XI), we can write,
(5.3.26)
I'
322
Asymptotic Approximations
x; Icr2 and Yi2
= Xf+i/a2, i = 1 , . . . ' k. Equivalently Tk Y; � (Y; 1 , Y,,)r, E(Y;) � ( 1 , 1)T and h(u, v) � �. By Theorem 5.3.4, where yil
=
=
Chapter 5
h(Y) where
•
(5.3.27) where 1 (1, l)T, h( 1 1(u, v) identity. We conclude that �
In particular if X1
(�, - ;)> )T and
�
:E
�
Var(Yu)J, where J is the 2
x
2
vn(Tk - 1) £ N(0, 2Var(Yu )) .
�
N(0, a2 ) , as k
In general, when min{k, m}
�
oo,
.,fii(Tk - 1) £ N(O, 4). --t oo,
the distribution of �(Tk,m -
1) can be ap proximated by a N(O, 2) distribution. Thus (Problem 5.3. 7), when X; � N(O, <12 ) , P[Tk,m
< t]
P[�(Y,,m - 1)
<
�(t - 1)]
• ·.
(5.3.28)
"' if!( �(t - 1)/-/2).
!
•
An interesting and important point (noted by Box, 1953) is that unlike the t test, the F test for equality of variances (Problem 5.3.8(a)) does not have robustness of level. Specifically, ifVar(Xf) of 2<74, the upper h,m critical value !k,m ( 1 - o) , which by (5.3.28) satisfies
l (: ,
!
Zt-a ,
' ' ,
or
/k ,m(1 - o) "' 1 +
,
.
•
!1
! i�
Ck m =
�· •
'
'
.
'
2 (m + k)
mk
1+
Z J -a
K(m + k) Zt-a mk
where K � Var[(X1 - 1'•)/<7.)2, l't � E(X1), and <1� � Var (X1 ) . When it can be estimated by the method of moments (Problem 5.3.8(d)).
5.3.3
•
1
l l
l
is asymptotically incorrect. In general (Problem 5.3.8(c)) one has to use the critical value
'
:r.
�
mk ;o 1)/v2 o) ---c� (/km(l m+ k ,
• ,
1 I
.
(5.3.29) K
is unknown, D
Asymptotic Normality of the Maximum likelihood Estimate in Exponential Families
Our final application of the 8-method follows.
Theorem 5.3.5. Suppose P is a canonical exponential family of rank d generated by T Xn are a sample from P17 E P and ij is defined as the MLE with [ open. Then if X1 , .
•
.
,
if it exists and equal to c (some fixed value) otherwise,
l. '
'
! I' ;Ii
•
,
,
Section 5.3
First- and Higher-Order Asymptotics: The Delta Method with Applications
(i) ii = 71
1
� I:7- 1 A 1 (fi)( T(Xi) ••
1
- A(71)) + op71 (n- ) '
•
1 (71)).
(ii! £71(fo(ii - 71 )) • Nd(o , A
323
Proof. The result is a consequence. of Theorems 5.2.2 and 5.3.4. We showed in the proof of Theorem 5.2.2 that, if T = � I:� 1 T(X;), P71 [T E A(E)] � 1 and, hence, P71 [ij = A.- 1 (T)] � L Identify h in Theorem 5.3.4 with A.- 1 and m with A(71). Note that by B.8.14, if t = A(71), (5.3.30)
.. . But DA = A by definition and, thus, ih our case, (5.3.31) Thus, (i) follows from (5.3.23). For (ii) simply note that, in our case, by Corollary 1.6.1, .
E = Var(T(Xl ) )
=
A(71)
and, therefore, (5.3.32) 0
Hence, (ii) follows from (5.3.24).
Remark 5.3.1. Recall that
A(71)
=
Var71 (T)
=
!(71)
is the Fisher information. Thus, the asymptotic variance matrix I - 1 (17) of .Jri(ij - 17) equals the lower bound (3.4.38) on the variance matrix of y'n(ij - 71) for any unbiased estimator ij. This is an "asymptotic efficiency" property of the MLE we return to in Section
6.2.1.
Example 5.3.8. Let X1 , Xn be i.i.d. as X with X N(p,, a2). Then T1 T2 = n- 1 EX[ are sufficient statistics in the canonical model. Now •
.
.
'"'""'
,
.fii [1i - 11,T2 - (112 + a2)] -".N(O, O,l (fl))
-
=
X and
(5.3.33)
where, by Example 2.3.4,
Here 1)1 = JL/a2, '72 By Theorem 5.3.5,
=
-l/2a2, i}I
=
X/0'2, and 1)2
=
-l/2il2 where 0'2
=
T2 - (T1)2
324
Asymptotic Approximations =
Because X T, and il2 (Problem 5.3.26)
=
T - (T1 )2• we can use (5.3.33) and Theorem 5.3.4 to find 2
.,;n(X where Eo
Chapter 5
= diag(a2 2a4).
-
J" , il2
- <72) .S N(O, 0, Eo)
1
Summary. Consistency is Oth-order asymptotics. First-order asymptotics provides approx imations to the difference between a quantity tending to a limit and the limit, for instance, the difference between a consistent estimate and the parameter it estimates. Second-order asyrnptotics provides approximations to the difference between the error and its first-order approximation, and so on. We begin in Section 5.3.1 by studying approximations to mo ments and central moments of estimates. Fundamental asymptotic formulae are derived for the bias and variance of an estimate first for smooth function of a scalar mean and then a vector mean. These "8 method" approximations based on Taylor's fonnula and elemen tary results about moments of means of i.i.d. variables are explained in terms of similar stochastic approximations to h(Y) - h(JL) where Y1 , ,Y are i.i.d. as Y, EY 1-L· and h is smooth. These stochastic approximations lead to Gaussian approximations to the laws of important statistics. The moment and in law approximations lead to the definition of variance stabilizing transformations for classical one-dimensional exponential families. Higher-order approximations to distributions (Edgeworth series) are discussed briefly. Fi nally, stochastic approximations in the case of vector statistics and parameters are devel oped, which lead to a result on the asymptotic normality of the MLE in multiparameter exponential families. .
.
•
n
=
•
'
• •
5.4
•
•
1:,
ASYMPTOTIC THEORY I N O N E DI MENSION
' l
.
'
'
l
i•
! ' •
•
I
.
,,
I
i
'
In this section we define and study asymptotic optimality for estimation, testing, and con fidence bounds, under i.i.d. sampling, when we are dealing with one-dimensional smooth parametric models. Specifically we shall show that important likelihood based procedures such as MLE's are asymptotically optimal. In Chapter 6 we sketch how these ideas can be extended to multi-dimensional parametric families.
j !
'
'
'
; '
'
' ..
;• '
I·
5.4.1
Estimation: The Multinomial Case
Following Fisher (1958), <' l we develop the theory first for the case that x., . . . , Xn are i.i.d. taking values {Xo, . . , xk } only so that P is defined by p = (A,, . . , Pk ) where .
.
P; - P(X,
=
x;], 0 < j < k
(5.4.1 )
and p E S, the (k+ I)-dimensional simplex (see Example 1.6.7). Thus, N (No, . . . , Nk ) where Ni I:� 1 l(Xi Xj) is sufficient. We consider one-dimensional parametric submodels of S defined by P {(p(x0, 0), . ,p(xk , 0)) : 0 E 8}, e open c R (e.g., see Example 2.1.4 and Problem 2.1.15). We focus first on estimation of 0. Assume =
-
A:0
�
;
'
=
=
.
.
p(x;, 0), 0 < P; < I, is twice differentiable for 0 < j < k.
i I
�
J
-------
,
II , ,
•
Section 5.4
325
Asymptotic Theory in One Dimension
Note that A implies that k
Iogp (X1, 8) � 2.:)ogp (x;, 8)1 (X1 � x;)
l(X1 ,8) is twice differentiable and
J=oO
(5.4.2)
g� (X1 , 8) is a well�defined, bounded random variable (5.4.3)
Furthermore (Section 3.4.2), (5.4.4)
and
g;� (X
1,
8) is similarly bounded and well defined with (5.4.5)
As usual we call 1(8) the Fisher infonnation. Next suppose we are given a plug-in estimator h ( r;:) (see (2.1.11 ) ) of (J where h: S�
R
satisfies
h(p(8)) � 8 for all 8 E e
(5.4.6)
where p(8) � (p(x0, 8), . . . , p (x., B))r. Many such h exist if k 2.1.4, for instance. Assume
> 1. Consider Example
H : h is differentiable. Then we have the following theorem. Theorem 5.4.1. Under H,Jor all 8. (5.4.7) where 172 (8, h ) is given by (5.4.11). Moreover, if A also holds,
"2 (8, h) > r' (B)
(5.4.8)
with eqUality if and only if, M & (p(8))
P,
_,
p(9)
�I
m
.
(B) &B (x;,B), 0 < J < k.
(5.4.9)
Asymptotic Approximations
326 Proof.
Apply Theorem 5.3.2 noting that
( (N) ) vn h -;; - h(p (B))
Note that, using the definition of
( N ) 8 p( h + op( l) . x;,8) vn &p; (p( )) : k &h
�
Nj ,
&h ( N ) h &p; (p (8)) -:;/: - p(x; , 8) n-1 kh &p; (p(B))( l (X, k &h
n
but also its asymptotic variance is
"'(8, h )
t
( ;;') - h(p (8))} asymptotically normal with mean 0,
'
Note that by differentiating (5.4.6), we obtain
j=O
k &h &p 2: 8 (p(8)) 88 (x;,8) � 1 j=D PJ
Cove
• . •
•
'
- p(x, , 8)). (5.4.10)
&h k &h ) ( h &p; (p(8)) p(x; , 8) - L a'PJ_ (p (8))p(x; , 8)
or equivalently, by noting
I t.
� x;)
�
k
'
k
�
Thus, by (5.4.10), not only is vn { h
I
Chapter 5
(5.4. 1 1 )
2 •
(5.4.12)
� (x;, 8) � [f, l (x; , B) ] p( x;, B), k &h &l & (p(B))l(X, � x;), 8ii (X� o B) j =O PJ
L
�
1.
(5.4.13)
By (5.4. 13), using the correlation inequality (A. l 1 . 16) as in the proof of the information inequality
(3.4.12), we obtain
with equality iff,
& 2 (B,h)Var9 l < " 1 &B (X1 ,8) � "' (B,h)I (B)
&h &l L & . (p(B))(l(Xt = x; ) - p(x;,B)) � a(B) &B (XI> 8) + b(B) k
j=O
for some
PJ
a(B) f' 0 and some b(B) with prob�bility 1.
which implies (5.4.9).
(5.4.14)
, '
, 1 , '
(5.4.15)
a (B), whil• �ir common
a2 (B)I(B) � "' ( B, h), we s�� tliat equality in (5.4.8) gives a2 (B)I2 (B) = 1, •
•
Taking expectations we get b(B) = 0.
Noting that the covariance of the right-· anp lett-hand sides i s variance is
•
(5.4.16) 0
, •
I
Section 5.4
_.3::2:..:_7
n Asymptotic Theory in One Dlm io:;_ e "::':: :.c::
_ _ _ _ _ _ _ _ _ _ _ _ _ _
We shall see in Section 5.4.3 that the information bound (5.4.8) is, if it exists and under regularity conditions, achieved by if = h ( r;; ) , the MLE of B where h is defined implicitly by: h(p) is the value of 0, which �
(i) maximizes L::7�o Nj logp(xj, B) and
(ii) solves L:7�0 N, gi (xj , B) = O.
Example 5.4.1. One-Parameter Discrete Exponential Families. Suppose p (x, 8) = exp{BT(x) - A(B)}h(x) where h(x) = l(x E {xo , . , xk }). B E e. is a canonical one-parameter exponential family (supported on {xo, . . . , Xk }) and e is open. Then Theorem 5.3.5 applies to the MLE B and
..
�
C8( /ii ( B- B)) � N
(o, I(�))
(5.4.17)
with the asymptotic variance achieving the information bound J-1(B). Note that because 'C'k T(x N then, by ( -1 n (X ) 2.3.3) T = n L;� 1 T ; = L..,j�o 1 ) ";- . (5.4.18) and h(p)
=
k
[A_]- 1 L T(:Ej )Pj y';, Q
(5.4.19)
The binomial (n, p) and Hardy-Weinberg models can both be put into this framework with D canonical parameters such as B = log 1 P in the first case.
( P)
Both the asymptotic variance bound and its achievement by the MLE are much more general phenomena. In the next two subsections we consider some more general situations.
5.4.2
Asymptotic Normality of Minimum Contrast and M-Estimates
We begin with an asymptotic normality theorem for minimum contrast estimates. As in Theorem 5.2.3 we give this result under conditions that are themselves implied by more technical sufficient conditions that are easier to check. Suppose i.i.d. X1 , , Xn are tentatively modeled to be distributed according to Po. B E 8 open c R and corresponding density/frequency functions p (· , B). Write P = { P, : B E E> ) . Let p : X X e � R where . • .
D(B,B o ) = Es, ( p(X , , O ) - p(X, ,Ilo ))
328
Asymptotic Approximations
Chapter 5
•
; •
•
is uniquely minimized at Bo. Let On be the minimum contrast estimate
-
0. � Suppose
AO: .p � Then
•
1 n argrnin - _L p(X,, O). n . t=l
U is well defined. 1 n ,P(X;, On) -n L . t=l
�
0.
(5.4.20)
In what follows we let P, rather than Pe, denote the distribution of Xi. This is because, as pointed out later in Remark 5.4.3, under regularity conditions the properties developed in this section are valid for P � { Pe : 0 E 8}. We need only that 0(P) is a parameter as defined in Section 1.1. As we saw in Section 2.1, parameters and their estimates can often be extended to larger classes of distributions than they originally were defined for. Suppose AI: The parameter O(P) given by the solution of
j ,P(x, O)dP(x)
'·
0
(5.4.21)
is well defined on P. That is,
J I.P (x, O) jdP(x) < co, 0 E e, p E p
and O(P) is the unique solution of(5.4.21) and, hence, O(Pe)
=
i •
• • •
'
'
0.
Ep.j;2(X1,0(P)) < oo for all P E P. A3: ,P( · 9) is differentiable, � (X1, 9) has a finite expectation and
i !
A2:
'
,
a.p
'!
I·
�
Ep 89 (X, 9(P)) f 0. A4: sup1 AS:
'
•
{ � I:;� 1 (� (X; , t) - � (X; , O(P))) I : jt - 9(P)I <
9. £. 9(P). That is, 9. is consistent on P = {Pe
0.
: 9 E 8}.
.
' '
i
Theorem 5.4.2. Under AO-A5,
'
I '
•
(5.4.22) where
-¢(x,P)
=
/( .j;(x,9(P)) -Ep 09 (Xt,9(P)) a.p
J • •
1
'
•
•
• •
•
(5.4.23)
Section 5.4
329
Asymptotic Theory in One Dimension
Hence, where
"2 (1/J, P)
=
Ep
(5.4.24)
Proof Claim (5.4.24) follows from the central limit theorem and Slutsky's theorem, ap plied to (5.4.22) because
n ../Ti(Bn - B(P)) = n-112 L ;ft(X;, P) + op(l) i=l and
Ep;f(x,, P) while
Ep
0
Ep;f2 (X1,P) =
-
by A l , A2, and A3. Next we show that (5.4.22) follows by a Taylor expansion of the equations (5.4.20) and (5.4.21). Let Bn = B (P) where P denotes the empirical probability. By expanding n-• I:7 1 <J;(X;, Bn) around B(P) , we obtain, using (5.4.20), -
-
l n &.p l n - n L <J;(X;, B(P)) = n L 88 (X;,O�)(Bn - B (P)) i=l t=l
where
IB� - B(P) I < IBn - B(P)I. Apply A5 and A4 to cooclude that n n I l thj; * a , = )) ;, ) + op ( l) (P (X ; B (X L L O <J; 8 8 8 8 n n n i=I t=l
(5.4.25)
(5.4.26)
and A3 and the WLLN to conclude that (5.4.27) Combining (5.425)-(5.4.27) we get,
(On -
8¢ ( B(P)) -Ep 88 (X1 , B (P))
But by the central limit theorem and A l ,
+
n ) l op(!) n t; <J;(X;, B (P)). =
n 2 1 n -n L = Op( - 1 ). B(P)) <J;(X;, . t=l 1
(5.4.28)
(5.4.29)
330
Asymptotic Approximations
Chapter 5
Dividing by the second factor in (5.4.28) we finally obtain
I n
On - O(P) = n L ;f(X;, O(P)) + Op
i= l
I n
!
- '<' 1/J(X;, O(P)) n�
l=l
and (5.4.22) follows from the foregoing and (5.4.29).
0
Remark 5.4.1. An additional assumption A6 gives a slightly different formula for Ep 'iilf (X" O(P)) if P = Po.
A6: Suppose P = Po so that O(P) = 0, and that the model P is regular and let l(x, 0) = logp{x, B) where p(·, (}) is as usual a density or frequency function. Suppose l is differen· tiable and assume that
&l -Eo (X, , O),P(X1 , 0) 80
-Covo (:�(Xt , O) , ,P (X1, 0) ) .
i
J 1/J(x, O)p(x, O)dJL(x)
=
0
(5.4.30)
' '
· ,
'
l i
(5.4.31)
=
{Po :
I
l l
•
I
I
l '
l •
'
; ,
• • '
•
or more generally, for M-estimates, =
I •
• •
argmin Epp(X1, B)
(2) B(P) solves Ep1/J(X1 , B)
!
•
Remark 5.4.2. Solutions to (5.4.20) are called M-estimates. Our arguments apply to M estimates. Nothing in the arguments require that On be a minimum contrast as well as an M-estimate (i.e., that 1/J = � for some p).
=
•
•
i
-
(I) O(P)
. ,
'
for all 0. If an unbiased estimate S (X!) of 0 exists and we let 1/J (x, 0) = 6 ( x) -0, it is easy to see that A6 is the same as ( 3.4.8). If further X1 takes on a finite set of values, { xo, . . . , Xk }, and we define h(p) = I:7=o 6(x1 ) Pi• we see that A6 corresponds to (5.4.12). Identity (5.4.30) suggests that ifP is regular the conclusion of Theorem 5.4.2 may hold even if 1/J is not differentiable provided that Eo 'iilf ( X1 , 0) is replaced by Covo(1/J(X" 0). Z� (X1 , B)) and a suitable replacement for A3, A4 is found. This is in fact true-see Prob lem 5.4.1.
Remark 5.4.3. Our arguments apply even if X,, . . . , Xn are i.i.d. P but P
• '
•
Note that (5.4.30) is formally obtained by differentiating the equation (5.4.2 1), written as
!
0.
Theorem 5.4.2 is valid with B(P) as in (I) or (2). This extension will he pursued in Vol ume 2. We conclude by stating some sufficient conditions, essentially due to Cramer (1946), for A4 and A6. Conditions AO, Al, A2, and A3 are readily checkable whereas we have given conditions for AS in Section 5.2.
; ;
Section 5.4
A4': (a) 8 --+
33 1
Asymptotic Theory in One Dimension
�� (Xt , O ) is a continuous function of 8 for all x.
{
(b) There exists J(O) > 0 such that sup
EhJ; iJ
0') , (X t O
where Eo M(X1 ,
Dv')
iJ (Xt, O
0)
0) < oo.
A6': �t (x, 0') is defined for all x, /0'- 0/
< J(
0) and J:+: J I !Ji:' (x, s) dJL(x)ds < oo for
some J � J(O) > 0, where JL(x) is the dominating measure for P(x) defined in (A. l0.13).
That is. "dP(x)
�
p(x) dJL(x) ."
Details of how A4' (with A0-A3) iiT!plies A4 and A6' implies A6 are given in the problems. We also indicate by example in the problems that some conditions are needed (Problem 5.4.4) but A4' and A6' are not necessary (Problem 5.4.1).
5.4.3
Asymptotic Normality and Efficiency of the MLE
The most important special case of (5.4.20) occurs when and 1/>(x,
0)
- Dl
�
ijO(x, 0)
-
p(x, 0)
�
l (x, 0) -.
=
logp(x,
0)
obeys A0-A6. In this case On is the MLE On and we obtain an
identity of Fishef's,
21 & - Eo 002 (Xt , 0) where
(5.4.32)
/(8) is t�e Fisher information introduq;:d in Section 3.4.
We can now state the basic
result on asymptotic normality and e�ciency of the MLE. Theorem
5.4.3.
If AO-A6
apply to p(x, 0)
satisfies
�
l(x, 0) and P
�
Pe,
-
then the MLE On (5.4.33)
so that (5.4.34)
Furthermore, A0-A6, then
if Bn
is a minimum contrast estimate whose corresponding p and '1jJ satisfy
2 (1/>, Po) >
<7
with equality iff 1/> � a ( 0) � for some a o1 0.
1 I(O)
(5.4.35)
r •
• '
332
Asymptotic Approximations
Chapter 5
Proof Claims (5.4.33) and (5.4.34) follow directly by Theorem 5.4.2. By (5.4.30) and
(5.4.35). claim (5.4.35) is equivalent to
(5.4.36) =
0, cross multiplication shows that (5.4.36) is just the correlation Because Eo'!j;(X1 , B) inequality and the theorem follows because equality holds iff 'ljJ is a nonzero multiple a ( B) o of 3i; (X1 , B).
I
Note that Theorem 5.4.3 generalizes Example 5.4.1 once we identify '1/J(x, B) with T(x) - A'(B). The optimality part of Theorem 5.4.3 is not valid without some conditions on the esti· mates being considered. l'l Example 5.4.2. Hodges's Example. Let X1 , . . . , Xn be i.i.d. .N(B, 1). Then X is the MLE of B and it is trivial to calculate I(B)_ 1. Consider the following competitor to X: -
Bn
I
-
0 if lXI < n-1;4 X if l X I > n-1/4
(5.4.37)
We can interpret this estimate as first testing H : () 0 using the test "Reject iff JXI > n-1/4" and using X as our estimate if the test rejects and 0 as our estimate otherwise. We next compute the limiting distribution of .,fii (Bn - B). Let Z .N(O, 1). Then =
�
P[IZ + ..fiiBI < n1i4]
(5.4.38)
___,
where
-
5.4.4
1
1
' •
•
' ' • •
1
l
�
• •
:
.
' • •
;
'
; ' '
l
•
•
(5.4.39) =
•
•
0 because n114 - ,ftiB ___, -oo, and, thus, Therefore, if B # 0, P•[IXI < n- 1 14[ PB [Bn = X[ ___, 1. If B 0, Po [IXI < n1i4] ___, 1, and PB[iin = OJ I. Therefore, ___,
•
Testing
The major testing problem if (} is one-dimensional is H : (} < Bo versus K : (} > 9o. If p(·, B) is an MLR family in T(X), we know that all likelihood ratio tests for simple B1
Section 5.4
�3:::3:..::3 _;cc"_:O._o:c'-=O-=;m_:'cc":c';cco"'----����... Asymptotic Th�o:_c"f.-�
versus simple ()2 , ()1 < ()2 , as well as the likelihood ratio test for H versus K, are of the form "Reject H for T( X) large" with the critical value specified by making the probability of type I error a at ()0. If p(-, ()) is a one-parameter exponential family in B generated by T(X) , this test can also be interpreted as a test of H : >.. < >..o versus K : >.. > .\0, where ).. = A(()) because A is strictly increasing. The test is then precisely, "Reject H for large values of the MLE T(X) of >...'1 It seems natural in general to study the behavior of the test, "Reject H if Bn > c(a, Bo) " where P&, [Bn > c(a, Bo )] = a and Bn is the MLE of e. We will use asymptotic theory to study the behavior of this test when we observe i.i.d. X1 , . . . , Xn distributed according to Po. B E (a, b), a < eo < b, derive an optimality property, and then directly and through problems exhibit other tests with the same behavior. � Let en (a, ()0) denote the critical value of the test using the MLE On based on n observations. •
•
Theorem 5.4.4. Suppose the model P = {Po : 8 E 8} is such thot the conditions of Theorem 5.4.2 apply to '0 = g� and Bn. the MLE. That is,
£,(-fii(Bn - 8)) � N(o, r 1 (8))
(5.4.40)
where 1(8) > Ofor all 8. Then
c,(a,Bo) = Bo + Zt-a/Vn1(8o) + o(n- 112) where Zt- a is the 1 - a quantile of the N(O, 1) distribution. Suppose (A4') holds as well as (A6) and 1(8) < oofor all 8. Then lf8 > Bo. Po[Bn > c,(a,Oo)] � 1. lf8 < Bo, Po[Bn > en(<>, Oo)] � 0. �
�
(5.4.41)
(5.4.42) (5.4.43)
Property (5.4.42) is sometimes called consistency of the test against a fixed alternative. Proof. The proof is straightforward:
Po, [Vn1(8o)(Bn - Bo) > z] � l - 1>(z) by (5.4.40). Thus,
Po,[Bn > Bo + Zt-a/Vnf(Oo)] = Po,[Vn1(8o)(Bn - Bo) > Zt-a] � a.
(5.4.44)
But Polya's theorem (A.l4.22) guarantees that su p
which implies that other hand,
IPo,[.fii(Bn - Bo) > z] - (1 - 1>(z))l � 0,
Jn1(80)(c,(a,Oo) - Bo) - z1_0
(5.4.45)
� 0, and (5.4.41) follows. On the
Po[Bn > Cn(<>, Bo)] = Poh/;;/(B)(Bn - 8) > Vn/(ij(cn (<>, Bo) - 8)].
(5.4.46)
i
•
334
Asymptotic Approximations
Chapter 5
l
By (5.4.41),
.jni(8)(c.,(a, 80 ) - 8) and ___, oo if 8 <
foi(B)(8o - 8 + Z1-a / /ni(80) + o(n -112 )) .jnl(8)(8o - 8) + 0(1) -oo if 8 > 80
80. Claims (5.4.42) and (5.4.43) follow.
___,
0
Theorem 5.4.4 tells us that the test under discussion is consistent and that for n large the power function of the test rises steeply to a from the left at 00 and continues rising steeply to 1 to the right of 80. Optimality claims rest on a more refined analysis involving a reparametrization from 8 to ')' = vn( 8 - 8o). (3)
Theorem 5.4.5. Suppose the conditions of Theorem 5.4.2 and (5.4.40) hold uniformly for fJ in a neighborhood of Oo. That is, assume
sup{IP,[fol(Bj(Bn - 8) S z] - (1 - of?(z))[ : [8 8o [ < <(8o) } forsome <(8o) > 0. LetQ1 = Po, 'l' Jri(8 - 8o), then -
___,
0,
(5.4.47)
�
I '
(5.4.48)
j
,
.
'
•
uniformly in "f· Furthennore, if
•
(5.4.49)
•
'
'
j i !
then
'
1 - 1?(z1-a - 'l'y'I{8o)) if'Y > 0 > 1 - 1?(z1-a - 'l'y'I{8o)) if'Y < 0. <
'
(5.4.50)
Note that (5.4.48) and (5.4.50) can be interpreted as saying that among all tests that are asymptotically level a (obey (5.4.49)) the test based on rejecting for large values of Bn is asymptotically uniformly most powerful (obey (5.4.50)) and has asymptotically smallest probability of type I error for B < Bo. In fact, these statements can only be interpreted as valid in a small neighborhood of 80 because "' fixed means () 80. On the other hand, if Jri(B - B0) tends to zero, then by (5.4.50), the power of tests with asymptotic level a tend to a. If Jri(B - Bo) tends to infinity, the power of the test based on Bn tends to I by (5.4.48). In either case, the test based on Bn is still asymptotically MP. -
-----+
�
�
Proof. Write
Po[Bn > cn(a, Bo)] - P,[fol(Bj(Bn - B) > fol(Bj(c., (a, Bo) - B)] Po[/nl(B)(Bn - B) > .jni(B)(Bo - B + Z1-o/v'nl(Bo) +o(n- 1/2))]. (5.4.5 1)
!
.i
'
i '
Section 5.4
335
Asymptotic Theory in One Dimension
..fii (e - eo) is fixed, I(e) � I (eo + },; ) � !(eo) because our uniformity assumption implies that () ---+ I(O) is continuous (Problem 5.4.7). Thus, If 7
�
1 - .P(Z!-a(l + o(l)) + .jn(I(e0) + o( l))(eo - e ) + o( l ) ) I - .P ( zl a - -y�) + o( l ) )
-
(5.4.52)
and (5.4.48) follows. To prove (5.4.50) note that by the Neyman-Pearson lemma, if 1 > 0,
n
p (xi , eo + Jn) > dn (a,eo) < Pe, + )n I; log p(X e0 ) i=l 1,
-l-
n
v (xi , eo + ;?;; ) I; log -·· p(X e ) � dn(o, eo) , 1l 0 i=l (5.4.53)
where p(x, 0) denotes the density of Xi and dn, tn are uniquely chosen so that the right hand side of (5.4.53) is a if 7 is 0. Further Taylor expansion and probabilistic arguments of the type we have used show that the right-hand side of (5.4.53) tends to the right-hand side of (5.4.50) for all -y. The D details are in Problem 5.4.5.
The asymptotic results we have just established do not establish that the test that rejects for large values of On is necessarily good for all alternatives for any n. The test l [en > en (a, eo)] of Theorems 5.4.4 and 5.4.5 in the future will be referred to as a Wald test. There are two other types of test that have the same asymptotic behavior. These are the likelihood ratio test and the score or Rao test. It is easy to see that the likelihood ratio test for testing H : g < 80 versus K : 8 > 00 is of the form �
n
"Reject if
L log[p(X,}n)/p(Xi, eo)]l (en > eo) > kn(eo, a)." i=l
It may be shown (Problem 5.4.8) that, for a < �. kn(e0, a) � zf_. + o(l) and that if own (X1 , . . ) Xn) is the critical function of the Wald test and OL n (X1 ' . . . ) Xn) is the critical function of the LR test then, for all "(,
.
(5.4.54)
Assertion (5.4.54) establishes that the test O.in yields equality in (5.4.50) and, hence, is asymptotically most powerful as well. Finally, note that the Neyman Pearson LR test for H : () = ()0 versus K : Oo + t, € > 0 rejects for large values of 1 - [log pn( X1 , . . , Xn, eo E
.
+ <) - logpn (Xl , . . . , Xn, eo)]
•
•
'
•• .
.
•
336
Asymptotic Approximations
Chapter 5
where Pn(Xt, . . . , Xn, B) is the joint density of Xt , . . . , Xn . For e small, n fixed, this is approximately the same as rejecting for large values of 8�0 log pn(X 1 1 , Xn, Bo). , Xn are i.i.d. with The preceding argument doesn't depend on the fact that X1, common density or frequency function p(x, 8) and the test that rejects H for large values of a�o log Pn (X 1 , , Xn, 90) is, in general, called the score or Rao test. For the case we are considering it simplifies, becoming •
.
•
•
.
•
.
I
•
.
'
'
•
•
"Reject H iff
t 0�0 log (X , 9 ) p
·�
. 1
;
•
I
'
o
>
!I
rn(a,9o)."
'
It is easy to see (Problem 5.4.15) that and that again if JR.n (X1, .
rn(a, 9o) = Z t - a /nl(90) + o (n1 12 ) .
•
.
(5.4.55)
, Xn) is the critical function of the Rao test then
(X1 ' . . . ' Xn) = Ow n (X , ' . . . ' Xn)] � 1 ' Po,+-'Wf.n ,;n
'
• • ..
-
'
(5.4.56)
(Problem 5.4.8) and the Rao test is asymptotically optimal. Note that for all these tests and the confidence bounds of Section 5.4.5, I(9o), which d' may require numerical integration, can be replaced by n 1 do2 ln (Bn) (Problem 5.4. 10).
•
• ' • • .
-
' • •
•
• •
5.4.5
'
'
•
•
Confidence Bounds
We define an asymptotic Level l - a lower confidence bound (LCB) ()n by the requirement that
'
•
(5.4.57) for all () and similarly define asymptotic Ievel l - a VCBs and confidence intervals. We can approach obtaining asymptotically optimal confidence bounds in two ways:
I' I.
I II '
(i) By using a natural pivot.
(
'
'
'1
(ii) By inverting the testing regions derived in Section 5.4.4.
"'
Method (i) is easier: If the assumptions of Theorem 5.4.4 hold, that is, (AO)-(A6), (A4'), and !(9) finite for all 8, it follows (Problem 5.4.9) that
.r
Co( ni (6 ) ( Bn - B)) � N(o, 1)
(5.4.58)
9� = (in - Zt-a / Vnl(Bn ) .
(5.4.59)
•
'. •
'
I
I I
I! ••
'
V
n
for all () and, hence, an asymptotic Ievel l - a lower confidence bound is given by Turning tto method (ii), inversion of 8Wn gives formally
e�, = inf{ 9 : c,(a, 9)
>
-
9n}
(5.4.60)
'
Section 5.5
337
Asymptotic Behavior and Optim ality of the Posterior Distribution
or if we use the approximation
Cn ( Cl', 9) � e + Zt-o/Vni(iJ), (5.4.41 ), (5,4,61)
In fact neither 8�1, or 8�2 properly inverts the tests unless cn (et, 8) and C,. (a, 8) are increasing in 8. The three bounds are different as illustrated by Examples 4.4.3 and 4.5.2. If it applies and can be computed, 8�1 is preferable because this bound is not only
approximately but genuinely Ievel l - a. But computationally it is often hard to implement
because cn (et, 0) needs, in general, to be computed by simulation for a grid of B values. Typically, (5.4.59) or some equivalent alternatives (Problem 5,4, 10) are preferred but can
be quite inadequate (Problem 5.4.1 1), These bounds 8�, (}�1, 8� 2• are in fact asymptotically equivalent and optimal in a suit
able sense (Problems
Summary.
5.4,12 and 5.4.13).
We have defined asymptotic optimality for estimates
in one-parameter models.
In particular, we developed an asymptotic analogue of the information inequalily of Chap ter
3
for estimates of () in a one-dimensional subfamily of the multinomial distributions,
showed that the MLE fonnally achieves this bound, and made the latter result sharp in the context of one-parameter discrete exponential families. In Section theory of minimum contrast and
5.4.2 we developed the
M-estimates, generalizations of the MLE, along the lines
of Huber (1967), The asymptotic formulae we derived are applied to the MLE both under the mOOel that led to it and tmder an arbitrary P. We also delineated the limitations of the optimality theory for estimation through Hodges's example. We studied the optimal ity results parallel to estimation in testing and confidence bounds. Results on asymptotic properties of statistical procedures can also be found in Ferguson ( 1996), Le Cam and Yang
(1990), Lehmann (1999), Rao (1973), and Serfling (1980), 5.5
ASYMPTOTIC BEHAVIOR AND OPTIMALITY OF THE POSTERIOR DISTRIBUTION
Bayesian and frequentist inferences merge as
n --+ oo
in a sense we now describe. The
framework we consider is the one considered in Sections from a regular model in which e is open c
able.
5.2
and
5.4,
i.i.d. observations
R or e � {91, , , , , e,} finite, and e is identifi
Most of the questions we address and answer are under the assumption that arbitrary specified value, or in frequentist tenns, that
e is true.
fJ = (}, an
Consistency The first natural question is whether the Bayes posterior distribution as
n ------+ oo con
centrates all mass more and more tightly around B. Intuitively this means that the data that
are coming from Po eventually wipe out any prior belief that parameter values not close to
e are likely,
Formalizing this statement about the posterior distribution, II(· I X1 , is a function-valued statistic, is somewhat subtle in general. But for e
•
.
•
, Xn). which
= {(}I, . . . ) (}k}
it is
1. i
338
Asymptotic Approximations
Chapter 5
straightforward. Let
tr(e i X, , . . . , Xn) = P[9 = e I X,, . . . , Xn]· Then we say that II( . I x,, . , Xn) is consistent iff for all e E e, Po [ltr(e I Xt, . . . , Xn) - 1 1 > <] � 0 .
(5.5. 1 )
.
for all E > 0. There is a slightly stronger definition: II(- j X1 , iff for all e E 9,
•
•
•
(5.5.2)
, Xn) is a.s. consistent
tr(e I X,, . . . , Xn) � 1 a.s. Po .
(5.5.3)
General a.s. consistency is not hard to formulate: tr ( · I Xt , . . . , Xn)
=>
b{o) a.s.
Po
(5.5. 4)
where ::::} denotes convergence in law and <5{o} is point mass at B. There is a completely satisfactory result for e finite.
Theorem 5.5.1. Let 1r1 P(8 = Bi], j = 1, . . , k denote the prior distribution of8. Then II(· I X1, , Xn) is consistent (a.s. consistent) iff 'lfj > 0for j = 1, . . . , k.
.
. '
.
-
' ,
. • .
Proof. Let p( ·,B) denote the frequency or limit j function of X. The necessity of the condition is immediate because 1fj = 0 for some j implies that 1r( Bi I X1, . . . , Xn) = 0 for all X, , . . . , Xn because, by (1.2.8),
P[IJ = e; I Xt, . . . , Xnl "; rr�, p(x,, e; )
.
:i j'
k
'
'
., ' '
•
,
I
1 1 .[f-. 1 p(X;, e.) "• - og + - L.., og '--;-�-,::.;, n i=I n p(Xi, Bj) 11"j
•
•
·
Intuitively, no amount of data can convince a Bayesian who has decided a priori that (JJ is impossible. On the other hand, suppose all 1ri are positive. If the true (J is (Ji or equivalently 8 = (Ji , then
.
.
n
L::.�t "• 0,�, p(X;, e.)
•
(5.5.5)
,
'
-
By the weak (respectively strong) LLN, under Po;• 1 .[f-. 1 p(X; , e. ) og L p(X; , e;) n ,�
1
----+
I ( o;
E
(
p(X, , e.) og p(X1 , e; )
•
'
i
,
)
I
.
::; ) < 0, by Shannon's inequality, if
in probability (respectively a.s.). But Eo; tog :�: ) B =I Bi. Therefore, a
I
I X,, . . . , Xn) tr(e. og tr(e; I Xt, " . , Xn)
in the appropriate sense, and the theorem follows. 'I
� -00 ,
0
i i
,
i
'
•
I
'
' ,' •
•
Section 5.5
Asymptotic Behavior and Optimality of the Posterior Distribution
Remark 5.5.1. We have proved more than is stated. (} I X1 , . , Xn] 0 exponentially. .
.
339
Namely. that for each
(J E 8, Po[O =/:
-
D
As this proof suggests, consistency of the posterior distribution is very much akin to
consistency of the
MLE. The
appropriate analogues of Theorem 5.2.3 are valid. Next we
give a much stronger connection that has inferential implications:
Asymptotic normality of the posterior distribution Under conditions AO-A6 for p(x, 8)
-
= l(x, 8)
that if 8 is the MLE,
Consider
-
C( y'ri(ll - B) I
=
logp(x, 8), we showed in Section 5.4
Ca(vn( if - B)) � N(o,r'(B)).
(5.5.6)
Xt, . . . , Xn), the -posterior probability distribution of y'n(ll -
B( X1 , . , Xn) ) , where we emphasize that (] depends only on the data and is a constant given Xt, . . . , Xn. For conceptual ease we consider A4(a.s.) and A5(a.s.), assumptions -
.
.
that strengthen A4 and AS by replacing convergence in
Po. We also add,
A7: For all
Po
(1, and all /j > O there exists <(li, B) > 0 such that sup
n
� I)z (X;, 8' ) - l(X;, B)] : IB' - Bl > /j n . t=l
AS: The prior distribution has a density 1r( · ) on
at all
Po probability by convergence a.s.
e.
<
8 such that 1r( · )
-
€
(0 8) � J. ,
is continuous and positive
Remarkably,
Theorem 55,2 ("Bernstein/von Mises''), If conditions A0-A3, A7, and AS hold, then
A4(a.s.), A5(a.s.), A6,
-
C(y'n(9 - 8) I x,, . . , Xn) � N(O, l 1 (B )) .
(5.5.7)
a.s. under Pofor all B. We can rewrite sup for all
8 a.s. Po
X
(5.5.7) more usefully as IP[y'n(9 - if) < x I
Xt, . . . , Xn] - of> (x JI(O)) I � 0
(5.5.8)
and, of course, the statement holds for our usual and weaker convergence
in Po probability also. From this restatement we obtain the important corollary.
Corollary 5.5.1. Under the conditions of Theorem 5.5.2, sup X
a.s.
Pofor all 8.
IP[y'n(9 - if) :; x I
Xt , . . . , Xn] - of>(x �)l � 0
(5.5.9)
i
' '
340
Asymptotic Approximations
Chapt er 5
Remarks ( I ) Statements (5.5.4) and (5.5.7)-(5.5.9) are, in fact, frequentist statements about the asymptotic behavior of certain function-valued statistics. (2) Claims (5.5.8) and (5.5.9) hold with a.s. replaced by in Po probability if A4 and A5 are used rather than their strong forms-see Problem 5.5. 7. (3) Condition A7 is essentially equivalent to (5.2.8), which coupled with (5.2.9) and
-
identifiability guarantees consistency of B in a regular model.
Proof We compute the posterior density of .,fii(O - 8) as (5.5.10) where Cn = Cn(Xl , . . . , Xn) is given by
'
-
Divide top and bottom of (5 .5.10) by II7 1 p(X;, B) to obtain
(5.5.1 1 ) where l(x,B ) = logp(x,B ) and
.
, '
We claim that
(5.5. 12) for all 8. To establish this note that (a) sup
{ (0 + fo) - 11"(8) 11"
tent and 1r is continuous.
:
ItI
:S M
} � 0 a.s. for all M because 0 is a.s. consis
I !
: ,
' I
(b) Expanding,
(5.5.13)
i
1 •
!
' I
,
Section 5.5
Asymptotic Behavior and Optimality of the Posterior Distribution
where
Bi,lf < )n. L� g� (X,, e) 0 n r:Pz 1 n 82 l 1 L &B2(X;, B' (t)) - n L &B2 (X;, B) : [ t [ < M � 0, i=l i==l M, P0• 5.5.3),(5.5.13), (SLLN)
[e-
1
We use
sup
for all
341
here. By A4(a.s.), A5(a.s.),
�
n
a.s.
Using
the strong law of large numbers
we obtain (Problem
Po [dnqn(t) �1f(B)exp {Eo :;:(Xr,B)�} (5.5. 2) .
Using A6 we obtain
]
for all t
1
and AS,
(5.5.14)
� 1.
Now consider
dn
=
J:" (e + Jn) exp �� (x,J+ Jn ) - l(X,, e) ds r dnqn(s)ds Jlsi
(5.5.15)
-
By AS and A7,
n
o) e-m(,, < exp �(I(X , t) l(X B-)) ·. ft - 8 f > ; i=l (5.5.14) y'ne-"'<'· 0 ) Po 0 5.5.4) > 0. (5.5. 14) (B) > 0 &' { 1 t' l Po [dnqn(t) < 21r(8) exp Eo ( &B' (Xr, 8)) 2 } < 8(8)vn � L p,
o
L-
suP
-
;,
for all 8 so that the second tenn in 8 Finally note that (Problem
such that
4
-
' u
1
(5.5.16)
a.s.
for all
� �
is bounded by by arguing as for
�
, there exists 8
for all [t[
(5.5.17)
By
(5.5.15) and 5.5 . 1 6) , for a11 8 > 0,
(
-1
.
Finally, apply the dominated convergence theorem, Theorem B.7.5, to 8(8)y'n)), using (5.5.14) and (5.5.17) to conclude that, a.s.
Po,
s (B) { 1r(8) v'2,i 2I = exp dn --+ 7r(B) 1d s } 2 vi (B) = _
ITTii\ .
(5.5.18)
dnqn(sl(fsf < (5.5.19)
I
' . .
342
Asymptot'1C Approximations
Chapter 5
Hence, a.s. Po, where r.p is the standard Gaussian density and the theorem follows from Scheffe's Theorem B.7.6 and Proposition B.7.2. o
•
•
•
•• •
'·
Example 5.5.1. Posterior Behavior in the Normal Translalion Model with Normal Prior. (Example 3.2.1 continued). Suppose as in Example 3.2.1 we have observations from a N( (), a2) distribution with a2 known and we put a N(rJ, T2) prior on 8. Then the posterior 1 1 distribution of 8 is N wtn7J -I- W2nX, (� + r 2 ) - where Wtn =
• •
' ' '
•
)
(
"2
, W2n 2 2 nr -1- a
=
1 - wl n ·
(5.5.20)
0, X - (), a.s., if () = e, and (� + T\) - l Evidently, as n ---+ 00, Wtn 0. That is, the posterior distribution has mean approximately () and variance approxi mately 0, for n large, or equivalently the posterior is close to point mass at () as we expect from Theorem 5.5.1. Because e = X, fo(ll - e) has posterior distribution Now, Jnw1 n = O(n- 1 12) � o( l ) and N .,lnwln(ry - X), n ( � + -----+
(
n ( ;i + �) rem 5.5.2.
-l
----->
;'>) - l ).
----+
a2 = J- 1 (8) and we have directly established the conclusion of Theo D
Example 5.5.2. Posterior Behavior in the Binomial-Beta Model. (Example 3.2.3 contin ued). If we observe Sn with a binomial, B(n, 8), distribution, or equivalently we observe X,, . . . , Xn i.i.d. Bernoulli (!, B) and put a beta, (3(r, s) prior on B, then, as in Example 3.2.3, 9 has posterior (3(Sn +r, n + s - Sn) . We have shown in Problem 5.3.20 that if Ua,b oo, b ----+ oo, has a {J(a, b) distribution, then as a
[
-+
(a + b) 3 ab
] (Ua,b l
a a+b
)
£ --> N(O, l).
(5.5.21)
-
If 0 < B < 1 is true, Sn/n �- () so that Sn + r ----+ oo, n + s Sn ----+ oo a.s. Po. By identifying a with Sn + r and b with n + s Sn we conclude after some algebra that because 9 = X, fo(9 - X) !:, N(O, B(I - B) )
-
a.s. Po, as claimed by Theorem 5.5.2.
jI i j '
•
'
•
0
Bayesian optimality of optimal frequentist procedures and frequentist optimality of Bayesian procedures Theorem 5.5.2 has two surprising consequences. (a) Bayes estimates for a wide variety of loss functions and priors efficient in the sense of the previous section.
are
asymptotically
1
'
.
' '
' •
Section 5.5
343
Asymptotic Behavior and Optimality of the Posterior Distribution
(b) The maximum likelihood estimate is asymptotically equivalent in a Bayesian sense to the Bayes estimate for a variety of priors and loss functions. As an example of this phenomenon consider the following. �
Theorem 5.5.3. Suppose the conditions of Theorem 5.5.2 are satisfied. Let B be the MLE ofB and let B* be the median ofthe posterior distribution oj(). Then (i) �
(5.5.22) a.s. Po for all (). Consequently, 8...._* - 8 +
1 L. �I
n l=I
-l
(8) !!!:_ (X,. B) + op,(n - l / ) 88 2
(5.5.23)
and £o ( .,fii(ir - 8)) � N (O, r1( 8)). (ii) (5.5.24)
E( fo( 111 - e1 - 1 111 I x, , . . . , Xn ) = mjn E( Vn( l 11 - dl - 1111) I X, , . . . , Xn) + op(l). (5 . 5.25)
Thus, (i) corresponds to claim (a) whereas (ii) corresponds to claim (b) for the loss functions ln (0, d) = fo( ! B-dl - 181). But the Bayes estimates for ln and for l(O, d) = 18dl must agree whenever £( 1111 1 X,, . . . , Xn) < oc. (Note that if E( I111 1 Xt, . . . , Xn) = oo, then the posterior Bayes risk under l is infinite and all estimates are equally poor.) Hence, (5.5.25) follows. The proof of a corresponding claim for quadratic loss is sketched in Problem 5.5.5.
Proof. By Theorem 5.5.2 and Polya's theorem (A.14.22) sup
IP[.,fii (/1 - e)
< X I x,, . . . , Xn) - 1>(x vT(8))[
�
0 a.s. Po.
(5.5.26)
But uniform convergence of distribution functions implies convergence of quantiles that are unique for the limit distribution (Problem B.7 .1 1). Thus, any median of the posterior distribution of fo(11 - e) tends to 0, the median of N(O, I- 1 (�)), a.s. Po. But the medi an of the posterior of fo(/1 - 8 ) is fo(B• - 8) , and (5.5.22} follows. To prove (5.5 .24) note that �
�
�
and, hence, that (5.5.27) E(fo(l11 - el - l 11 - e•l) I X,, . . . , Xn) < fole - e· l � o a.s. Po, for all B. Because a.s. convergence Po for all B implies. a.s. convergence P (B.?), claim (5.5.24) follows and, hence,
E( Vn(l11 - Bl - 1111 ) I Xt , . . Xn) = E( Vn(l/1 - B' l - 1111) I X, · · · , Xn) + op ( l ) . ·
,
(5.5.28)
344
Asymptotic Approximations
Chapter 5
Because by Problem 1.4.7 and Proposition 3.2.1, B* is the Bayes estimate for ln(B,d), D (5.5.25) and the theorem follows. �
Remark. In fact, Bayes procedures can be efficient in the sense of Sections 5.4.3 and 6.2.3 even if MLEs do not exist. See Le Cam and Yang ( 1990). Bayes credible regions
There is another result illustrating that the frequentist inferential procedures based on f) agree with Bayesian procedures to first order.
�
Theorem 5.5.4. Suppose the conditions of Theorem 5.5.2 are satisfied. Let where Cn is chosen so that 1r(Cn I xl, . . . , Xn) = 1 - 0', be the Bayes credible region defined in Section 4.7. Let In('Y) be the asymptotically Ievel l - 'Y optimal interval based on 8, given by -
where dn('Y) = z ( 1
' •
- :})
..jlcfj. ThenJor every < > 0, 0,
P• [In (<> + <)
i• •
I
l
c
Cn (X,, . . . , Xn)
c
In (<> - <)] � I.
(5.5.29)
The proof, which uses a strengthened version of Theorem 5.5.2 by which the poste rior density of fo(IJ - 0) converges to the N(o I- 1 (B)) density unifonnly over compact neighborhoods of B for each fixed 0, is sketched in Problem 5.5.6. The message of the the orem should be clear. Bayesian and frequentist coverage statements are equivalent to first order. A finer analysis both in this case and in estimation reveals that any approximations to Bayes procedures on a scale finer than n- 1 12 do involve the prior. A particular choice, the Jeffrey's prior, makes agreement between frequentist and Bayesian confidence procedures valid even to the higher n- 1 order (see Schervisch, 1 995).
l
,
:
!
Thsting
Bayes and frequentist inferences diverge when we consider testing a point hypothesis. For instance, in Problem 5.5.1, the posterior probability of Bo given X 1 , Xn if H is false is of a different magnitude than the p-value for the same data. For more on this so called Lindley paradox see Berger (1985) and Schervisch (1995). However, if instead of considering hypothesis specifying one points 00 we consider indifference regions where H specifies [00 + t..) or ( 00 - t., 00 + t..) , then Bayes and frequentist testing procedures agree in the limit. See Problem 5.5.2. •
!
'
•
i '
•
•
,
Summary. Here we established the frequentist consistency of Bayes estimates in the fi nite parameter case, if all parameter values are a prior possible. Second. we established
'
Section 5.6
345
Problems and Complements
the so-called Bernstein-von Mises theorem actually dating back to Laplace (see Le Cam and Yang, 1990), which establishes frequentist optimality of Bayes estimates and Bayes optimality of the MLE for large samples and priors that do not rule out any region of the parameter space. Finally, the connection between the behavior of the posterior given by the so-called Bernstein-von Mises theorem and frequentist contldence regions is developed.
5.6
PROBLEMS AND COMPLEMENTS
Problems for Section 5.1 1. Suppose X1 , . . . , Xn are i.i.d. as X rv F, where F has median p- t ous case density. (a) Show that, if n = 2k + 1,
n
EFmed(Xh . . . , Xn ) EFmed2 (X1, . . . ,X ,. )
n
-
( �) and a continu
) dt • t (t tk(1 ) 1' ) (� 2 ( : ) [[r'(t)ft•(1 - t)•dt p- l
(b) Suppose F is uniform, U(O, 1). Find the MSE of the sample median for n and 5.
Z
�
1, 3,
N(p, 1) and V is independent of Z with distribution x;.. Then T =
/ ( �) � is said to have a noncentral t distribution with noncentrality J1 and
2. Suppose Z
=
of freedom. See Section 4.9.2.
m
degrees
(a) Show that
where fm(w) is the x� density, and
)
nX (b) If X1, . . . , X,. are i.i d N(p, u2 show that y' .
.
/
( ,.�1 l:: (X, - XJ2 ) i has a
noncentral t distribution with noncentrality parameter .fiiJL/IT and n dom.
-
1 degrees of free
(c) Show that T2 in (a) has a noncentral :Ft, m distribution with noncentrality parameter J12. Deduce that the density ofT is p(t)
=
00
2 L P[R = i] . !2i+l(f)[
where R is given in Problem B.3.12.
' "
I
1.
\
346
Asymptotic Approximations
Chapter 5
Hint: Condition on IT I.
1,
3. Show that if PIIXI < 11 probability Hint: Var(X) < EX2
�.
< 1
then Var(X)
with equality iff
X
±1 with
4. Comparison of Bounds: Both the Hoeffding and Chebychev bounds are functions of n and f. through foe.
(a) Show that the ratio of the Hoeffding function h( foE) to the Chebychev function c( Jii<) tends to 0 as Jii< � oo so that h(-) is arbitrarily better than c( - ) in the tails.
'I·.. 'i
( V:() - 1 gives lower results than h in
(b) Show that the normal approximation 24>
the tails if PIIXI < 1] = 1 because, if a2 < 1 , 1 - .P(t) -
•
5. Suppose .\ : R � R has .\(0)
� oo.
0, is bounded, and has a hounded second derivative .\". Show that if X1, . . . , Xn are i.i.d., EX1 = f-1. and Var X1 = a2 < oo, then
! •
=
E.\( X - p.) = >.'(0)
;,/!; (!) +
0
as
n � oo.
Hint: JiiE(.\ ( I X - l•ll - .\( 0)) = E>.'(O)Jii i X - P.l + E
IX - 1'1 < IX - 1'1· The last term is .\'(O)a J== lzl
('·;· (X - p.)(X - t•?)
:S s upx l.\''(x) la2 jn and the first tends to
Problems for Section 5.2 1. Using the notation of Theorem 5.2.1, show that
2. Let X1, . . . , Xn be i.i.d. N(p.,a2). Show that for all n > 1, all < > 0 - I'
sup P(•,•l I IX "
Hint: Let a
------)
I
2:
<] =
1.
oo.
3. Establish (5.2.5).
1
Hint: liJn - q(p)l > < => IPn - P I > W- ( <).
4. Let (U;, •
V;), 1 < i < n, be i.i.d.
-
P E P.
(a) Let 'Y(P) = P[Ur > 0, Vr > 0]. Show that if P = N(O, 0, 1 , 1, p), then
(
p = sin 21l' 'Y(P) -
!) .
Section 5.6
347
Problems and Complements
(b) Deduce that if P is the bivariate normal distribution, then
-p .
= Slll
n
211" 2. L 1 (X; > X)1(Y; > Y) n . 1=1
is a consistent estimate of p. (c) Suppose p(P) is defined generally as Covp (U, V)j .jVarpU Varp V for P E P � { P : EpU2 + EpV2 < oo, VarpU VarpV > 0}. Show that the sample correlation coefficient continues to be a consistent estimate of p(P) but ji is no longer consistent.
.
5. Suppose X1 , . . , X
n
are
i.i.d.
N(IJ, o-5) where ao is known.
(a) Show that condition ( 5.2.8) fails even in this simplest case in which X
Hint: sup
u. r
n "' 1.. n Lt= 1
((Xi-J.l)2 - (1 (J.t-J.to)2) ) +
,.2 0
a-1 0
=
o:J.
.& IJ is clear.
(b) Show that condition (5.2. 14)(i),(ii) holds. Hint: K can be taken as [-A, A], where A is an arbitrruy positive and finite constant.
6. Prove that (5.2.14)(i) artd (ii) suffice for consistency. 7. (Wald) Suppose B
�
p( X, B) is continuous, B E R and
(i) For some <(Bo) > 0
Eo, sup {ip(X,B') - p(X,B)i : IB - B' I < <(Bo)} < oo. (ii) Eo, inf{p(X, B) - p(X, Bo) : I B - Bol > A } > 0 for some A
< oo.
�
Show that the maximum contrast estimate B is consistent. Hint: From continuity of p, (i) and the dominated convergence theorem, .
i m Eo, sup{ ip(X,B') - p(X,B) i : B' E S'(B, o)} � o l5-o where S( B, o) is the o ball about B. Therefore, by the basic property of maximum contrast estimates, for each B f> B0, and < > 0 there is o(B) > 0 such that
Eo, inf{p(X, B') - p(X, B0) : B' E S(B, i (B))) > <.
By compactness there is a finite number 81 , . . , Or of sphere centers such that .
K n {B : IB - Bol > .\}
c
T
U S(B, J (B, )) j= 1
Now
inf
1 n
-n L {p(X. , B) - p(X;, Bo)} : B E K n {I : 18 - Bol > .\ . ·�
1
'' •
'
'
' '
348
Asymptotic Approximations
2: min
l<j
�n .
n
I; inf{p(X, , O' ) - p(X, Oo)} :
t= l
Chapter 5
B' E S(81, o(8,))
For r fixed apply the law of large numbers.
8. The condition of Problem 7(ii) can also fail. Let X� be i.i.d. N(J.", a-2 ). Compact sets [( can be taken of the form {11'1 < A, ' :;; a :;; 1/<, ' > 0}. Show that the log likelihood tends to oo as a ---+ 0 and the condition fails.
'
9. Indicate how the conditions of Problem 7 have to be changed to ensure uniform consis tency on K. •
l
10. Extend the result of Problem 7 to the case 8 E RP, p >
1.
"
Problems for Section 5.3
I. Establish (5.3.9 ) in the exponential model of Example 5.3.1. 2. Establish (5.3.3) for j odd as follows: (i) Suppose Xf, . . . , X� are i.i.d. with the same distribution as X1, 1 pendent of them, and let X' = n- EX;. Then EIX - 1'11 < EI X .
(ii)
,
•
Xn but inde
- X' 11, If are i.i.d. and take the values ±1 with probability � . and if c1, , Cn are con stants, then by Jensen's inequality, for some constants Mi, ti
•
n
E L ci€i i=l (iii) Condition on IX; (iv)
.
'
<
n
EJ�I L "''' i=l
j+l < -
M· 3
•
•
t
n
Lc� i=l
•
- X; i , i = 1, . . . , n, in (i) and apply (ii) to get .
t
E II:� 1 (X; - X:JI' < M; E [2::� 1 (X; - X:J2] ' < M1n� E [,; 2::�-1 (X; - X;) 2] > < M;n1 E (,; L:iX; - X;!l) l'lj.
<
M;n > EIX1 -
3. Establish (5.3.11). Hint: See part (a) of the proof of Lemma 5.3.1. 4. Establish Theorem 5.3.2. Hint: Taylor expand and note that if i1 + · · · + id = m
d E II (Yk - JJ.d' k=l
d
<
<
1 m m + L ElY• - l'•lm k=l Cmn-k/2_
'
'
Section 5.6
349
Problems and Complements
Suppose ad >
0, 1 < j < m, "L,�= I ii = m then a�1 ,
•
•
•
m
m
, a�d < [max(a1, . . . , ad)] m <
L aj j =l
m
< mm - I L aJ. J=l 5. Let X1 ,
•
•
•
, Xn be i.i.d. R valued with sup { IE( X;, . . .
6. Show that if E IXd i < oo, j
EX1 = 0. Show that
, X;, ) I : i,, . . >
.
, i1; j
2, then E I X1
�
Hint: By the iterated expectation theorem
E I Xt
�
= 1, . . . , n} = E I Xt l ·
!Lii < 2i E I X1IJ.
P·lj = E{I X, � ILl' I I Xd > IILI}P( IXtl > ll•ll +E{I X, l"lj I I Xt l < IILI}P(I Xt l < ll"l l · �
7. Establish
5.3.28.
8. Let XJ , . . . , Xn1 be i.i.d. F and Y]. , . . . , Yn2 be i.i.d. C, and suppose the X's and Y's are independent.
(a) Show that if F and G are N(/Lt afl and N(!L2 , aD, respectively, then the LR test of H : uf = a� versus K : af # a� is based on the statistic sT /s�, where sT = (nt 1)- t 2::7' 1 (X; X) " s� = (n2 1)- 1 2::7' 1 (1-j Y)" . ,
�
�
,
�
(b) Show that when F and .Fk,m distribution with k = n1
G are nonnal as in part (a), then (si/af)f(s�fa�) has an
�
(c) Now suppose that F and
and that 0
Ck,m = 1
�
1 and rn =
n2 - 1.
G are not necessarily nonnal but that
< Var( Xf) < oo. Show that if m = .\k for some .\ > 0 and +
�<(k + m) ZI -a , K, = Var [( X1 km
-
/Jl ) /0"1] 2 ,
Then, under H : Var(Xt) = Var(Y1), P (s[! s� :;
(d) Let Ck ,m be ck,m with
K.,
/JI
ck,m) � 1
= E(XI ), a21 = Var (Xt ) �
a
as k �
.
oo.
replaced by its method of moments estimate. Show that under the assumptions of part (c), if 0 < Exr < oo, PH (st/s� < Ck ,=) --+ 1 - a as k ----t 00.
i
:;
350
Asymptotic Approximations
Chapter 5
(e) Next drop the assumption that G E 9. Instead assume that 0 < Var(Y?) < :x: . Under the assumptions of part (c), use a normal approximation to find an approximate critical value qk,m (depending on �· , � Var[ (X 1 - JL 1 )/ CJr]2 and ><2 = Var[( X2 - JL2)/CJ2)2 such that PH(si/s� < Qk,m ) ----+ 1 - a as k _____. oo .
•
(0 Let iik,m
K:t
and K2 replaced by their method of moment estimates. Show that under the assumptions of part (e), if 0 < EX� < oo and 0 < EY18 < oo, then P(si/s� :::; fik,m) ----+ 1 - a as k -----> oo. '' '
•
' �•
' •
' •
' '
be Qk,m with
9. In Example 5.3.6, show that (a) If l't = 1'2 � 0, CJ1 = o-2 = I, vn((C - p), (iJ/ - 1), (ui - l))T has the same asymptotic distribution as n1 [n- 1 EX1Y;: - p, n- 1 EX[ - 1 , n-I EY? - lJT . (b) If (X, Y) � N(Jtl, /l2 , "1, ":f, p), then jil(r 2 - p2) _s N(O, 4p2 (I - p2 ) 2 ) and, if p ol 0, then jn(r - p) � N(O, (1 - p' J2). (c) Show that if p = 0, jil(r - p) � N(O, 1). Hint: Use the central limit theorem and Slutsky's theorem. Without loss of generality, JL 1 = JL2 = 0, cri = cri = 1.
(���)
10. Show that � log is the variance stabilizing transformation for the correlation coefficient in Example 5.3.6. I + I · l - l H'mt.· Wnte (1 p)2 2 1-p l+P . _
(
)
11. In survey sampling the model-based approach postulates that the population {xi , · . . , xN} or {(ui , xi), . . . , (uN, XN) } we are interested in is itself a sample from a superpopulation that is known up to parameters; that is, there exists TI , . , TN i.i.d. Po, () E 8 such that Ti = ti where ti = (ui,xi), i = 1, N In particular, suppose in the context of Example 3.4.1 that we use Ti 1 Tin, which we have sampled at random from {t 1 , tN}, to estimate x = � E� 1 Xi. Without loss of generality, suppose ij = j, 1 < j < n. Consider as estimates .
. . . 1
1
1
•
•
(1.) X = � -
•
•
•
.
.
1
•
x,+ n+X. when 7: -= ( ..
�
t
" Uz,
t
X )·
-
(ii) X R = bopt (U - u) as in Example 3.4.1, Show that. if � then
----+
,\ as N
----+
oo, 0 < ...\ < 1, and if EX'f < oo (in the supermodel),
(a) jii(X - x) ..S N(o, 72 (1 - .>,)) where 7 2 = Var(XI ). (b) Suppose Po is such that X I = bUi+t::i , i = 1, . . . , N where the Ei are i.i.d., Et::i = 0, Var(<;) = o-2 < oo and Var(U) > 0. Show that jii(XR -x) ..S N(O, (l-A)o-2 ) , "2 < Hint: (a) X - X = (1 - � ) (X - Xc) wliere xc = N�n Et n+I xi . � (b) Use the delta method for the multivariate case and note (bopt - b)(U - u) -
72
-
(
op n
- >'
).
' •
�
]
'
l '
Section 5.6
351
Problems and Complements
12. (a) Suppose that
EJYd3 < oo. Show that JE(Y. - l'a)(Y, - l'b)(Y, - /'c ) J < Mn-2
(b) Deduce formula (5. 3 .1 4 ) Hint: If U is independent of (V, W), EU .
13. Let Sn have a
X� distribution.
(a) Show that if n is large,
..;s;:
-
=
0, E(WV) <
oo,
then E(UVW )
.Jii has approximately a N(0,
=
0.
�) distribution. This
is known as Fisher's approximation. (b) From (a) deduce the approximation P[Sn < x] "' 1>( J2X y'2ri} (c) Compare the approximation of (b) with the central limit approximation P[Sn < x] = .P((x - n)/ffn) and the exact values of P[Sn < x] from the x' table for x = xo.go, X = XQ.99, n = 5, 10, 25. Here Xq denotes the qth quantile of the X; distribution.
X1 , . . . , Xn is a sample from a population with mean Jl-, variance a2 , and third central moment Jl-3 . Justify fonnally 3 E [h(X ) - E(h(X ))J' = __!, [h'(l') ] 311a + h"(I'Jih' (l' ) ] 2u4 + O(n-3 ). n2 n 14. Suppose
Hint: Use (5.3.12). 15. It can be shown (under suitable conditions) that the nonnal approximation to the distri bution of h( X) improves as the coefficient of skewness 'YI n of h(X) diminishes.
(a) Use this fact and Problem 5.3.14 to explain the numerical results of Problem 5.3. 13(c). (b) Let Sn X� · The following approximation to the distribution of Sn (due to Wilson and Hilferty, 1931) is found to be excellent ,......
Use (5.3 .6 ) to explain why. 16. Normalizing Transformation for the Poisson Distribution. Suppose X1 , . . , Xn is a sample from a P(.\) distribution. .
(a) Show that the only transformations h that make E[h(X) - E(h(X))J' up to order 1/n2 for all ,\ > 0 are of the form h(t) = ct2i3 + d.
=
0 to terms
(b) Use (a) to justify the approximation
•
17. Suppose X�, . . . , Xn are independent, each with Hardy-Weinberg frequency function f given by
352
Asymptotic Approximations
X
J(x)
0 B
I 2B(I - B)
Chapter 5
2 (I - B)2
where 0 < e < I. (a) Find an approximation to P[X < tJ in terms of fJ and t.
(b) Find an approximation to P[VX < t) in terms of B andt. (c) What is the approximate distribution of y'ii( X -- Jl.) + X2 , where J1. = E(X1 )?
18. Variance Stabilizing Transfo111U1tion for the Binomial Distribution. Let X1 , Xn be the indicators of n binomial trials with probability of success B. Show that the only variance stabilizing transformation h such that h(O) = 0, h( l ) = 1, and h'(t) > 0 for all t, is given by h(t) = ( 2/7r)sin- 1 (vt). .
.
•
1
19. Justify formally the following expressions for the moments of h(X, Y) where (X, , Yi ), . . . , (Xn, Yn) is a sample from a bivariate population with E(X) = I' I · E(Y) = J1.2, Var(X) = uf, Var(Y) = u�, Cov (X, Y) = pu1u2 . (a)
1 ) ( Y)) O(n = E(h(X , h J1.1, Jt)2 + - ).
(b) Var(h(X, Y)) "' �{[h, (Jl.t, 1'2)) 20'7 n 2 2 (n + +2h, (Jt" 1'2 lh2 (J1.,, J1.2)pu, "'2 + [h2 (J1." 1'2ll ,.n o - ) where
a a h1(x, y) = &,h(x,y), h2(x,y) = ayh(x,y) .
Hint: h(X, Y) - h(Jl.t, !'2 ) = h, (Jl.t, J1.2)(X - Jl.t ) + h2 (J1.1, J1.2)(Y - J1.2 ) + 0(n - t ) . 20. Let Bm,n have a beta distribution with parameters m and n, which are integers. Show that if m and n are both tending to oo in such a way that m/ (m + n) ---> a, 0 < a < I,
then
(Bm,n - mj (m + n)) x v' < m+n P y'a(l - a)
___,
'li (x) .
. - - -1 where X,, . . . , Xm, Y1 , . . . , Yn - Hmt: Use Bm,n = (mX/nYJII + (mXjnY)) independent standard exJX>nentials.
are
21. Show directly using Problem B.2.5 that under the conditions of the previous problem, if m/(m + n) - a tends to zero at the rate 1/(m + n) 2 , then <>(1 - a) m E(Bm 'n) = , Var Bm 'n = + Rm 'n n + m m+n
where Rm,n tends to zero at the rate 1/(m + n) 2 .
< <
i
Section 5.6 Problems and Complements
353
22. Let Sn ,..,_, X�· Usc Stirling's approximation and Problem 8.2.4 to give a direct justifi cation of where Rn / fo -+ 0 as in n -+
oo.
Recall Stirling's approximation:
(It may be shown but is not required that I foR, I is bounded.) n
23. Suppose that X1 , . . . , X is a sample from a population and that h is a real-valued func tion of X whose derivatives of order k are denoted by h(k), k > I. Suppose IM41(x)l < M for all x and some constant M and suppose that 114 is finite. Show that Eh(X) h(Jl.) + ih(21 (Jl.) + : + Rn where I Rn l < M31 (Jl.)IJ1.3I/6n2 + M(J1.4 + 30'2) /24n2 Hint: "
Therefore,
24. Let
X 1 , • . • , Xn be a sample from a population with mean J1 and variance a2 Suppose h has a second derivative h<2) continuous at fL and that h(l)(Jl) = 0.
< oo.
(a) Show that fo[h(X)-h(Jl.)] � 0 while n[h(X -h(Jl.)] is asymptotically distributed as iM'I(J1.)0'2V where V � XI·
(b) Use part (a) to show that when J1. = �. n[X(l - X) - Jl.(l - Jl.)] !:. -0"2V with V xi . Give an approximation to the distribution of X ( 1 - X) in tenns of the xi distribution function when J1 = �. "'
25. Let XI> . . . , Xn be a sample from a population with 0'2 = Var(X) < oo, J1. = E(X) and let T = X2 be an estimate of p,2• (a) When J1. of 0, find the asymptotic distribution of fo(T-Jl.') using the delta method.
(b) When J1. = 0, find the asymptotic distriution of nT using P(nT < t) = P(-Vt < foX S Vt). Compare your answer to the answer in part (a). (c) Find the limiting laws of fo(X - Jl.)' and n(X - J1.)2.
354 26.
Asymptotic Approximations Show that if
xl ' . .
Chapter 5
, XII are i.i.d. N(p, cr2), then
.
£ N(O, O, Eo) vc:n(X - p,(f2 - a2 ) � where E0 = diag (a2 , 2a4 ) . -
Hint:
27.
Use
(5.3.33) and Theorem 5.3.4.
(X1 , Yt ), . . . , (Xn, Yn ) are n sets of control and treatment responses in a
Suppose
matched pair experiment. Assume that the observations have a common N(JJ.1 , J.t2, CYi, a�, p) distribution. We want to obtain confidence intervals on
t
IL2 - J-LI = �-
Yi - Xi we treat X1 ,
of using the one-sample intervals based on the differences . . . , Yn as separate samples and use the two-sample t intervals
Y1 ,
n
Analysis for fixed
is difficult because
n - oo and (a) Show that
no longer has a
is given by
(4.9.3), then limn P[t. E In] > I -
<>
where the right-hand side is the limit of .JTi times the length of the one-sample based on the differences.
Hint:
i t
28.
Suppose
Xt, . . . , Xn1
E(XI ), a/
=
and
Yt,
.
.
ni (n � >., 0 <
' '
(b)
, Yn2
Deduce that if ,\
<
=
4.9.3
are as in Section
E(YI ), and u� =
T(Ll.)
,\ < I .
(a) Show that P[T(t.) '
.
Var(XI), /"2
the behavior of the two-sample pivot
'
.
, Xn ,
<>.
)
t interval
(a), (c) Apply Slutsky's theorem.
with I'I =
I
•
is the length of the interval In,
-
i
.
(4.9.3). What happen s? '12n-2 distribution. Let
Jlnl vnllnl � 2vu1 + a�z(1 �") > 2( Jar + a� - 2pawz)z(1 �
(c) Show that if
I
.
(t [1 - 1�',"�fJ l).
P[T(t.) < t] �
(b) Deduce that if p > 0 and In
T(�)
Suppose that instead
Var(YJ ). We want to study
of Example 4.9.3. if n� , n2 ---+ oo. so that
t) �
= �
or
a1
=
independent samples
- >.)a/ + .\a�)]:).
az, the intervals (4.9.3) have correct asymptotic
probability of coverage.
(c) Show that if a� >
of coverage < •
'
I
'I !
'
'
'I ' '
1
-
a,
a/ and >. > 1 - ,\, the interval (4 9 3) has asymptotic probability ,
.
whereas the situation is reversed if the sample size inequalities and
variance inequalities agree.
(d) Make a comparison of the asymptotic length of (4.9.3) the pivot jD - t.j(so where D and so are as in Section 4.9.4.
29. Let T = (D - t.) /so
that
E(Xf) <
(a) Show n2 ---+ oo.
where
D, t. and so
and the intervals based on
are as defined in Section
4.9.4.
Suppose
oo and E(Y/) < oo.
that
T has
asymptotically a standard normal distribution as
n1 ---+
oo and
'
,
l .
I
I I
'
l
'
,
!
'
l
Section 5.6
355
Problems and Complements
(b) Let k be the Welch degrees of freedom defined in Section 4.9.4. Show that k � ex: as n 1 ----.) oo and n2 oo. ---+
P,2 in favor of (c) Show using parts (a) and (b) that the tests that reject H : Ji.I K : f..L2 > p,1 when T > tk( l - a) , where tk (l - a) is the critical value using the Welch =
approximation, has asymptotic level a.
(d) Find or write a computer program that carries out the Welch test. Carry out a Monte Carlo study such as the one that led to Figure 5.3.3 using the Welch test based on T rather than the two-sample t test based on Sn. Plot your results. 30. Generalize Lemma 5.3.3 by showing thai if Y 1 , . . . , Y E Rd are i.i.d. vectors and EIYt l k < oo, where I · I is the Euclidean norm, then for all integers k: n
where C depends on d, ElY 1 [ k and k only. Hint: If [x[1 � L;:� l [xj [, x � (x 1 , . . . , xdf and [xl is Euclidean distance, then there exist universal constants 0 < Cd < cd < 00 Such that cdlx l l < l xl < Cd l x! J .
31. Let X1 , . . . , Xn be i.i.d. as X � F and let I' = E(X),
has a X� - I distribution when F is theN(JJ, (j2) distribution. (a) Suppose E(X4) <
oo.
(b) Let Xn-1 (a) be the ath quantile of x;_J• Firld approximations to P(Vn < Xn-1 (a)) and P(Vn :'> Xn-l (I - a)) and evaluate the approximations when F is Ts . Hint: See Problems B.3.9 and 4.4.16. . (c) Let R be the method of moment estimaie of"" and let
Va � (n - 1) + y'i<(n - l)z (a).
Show that ifO
<
EX8 < oo, then P(Vn < Va,) ----.) a as n ---+ oo.
32. It may be shown that if Tn is any sequence of random variables such that Tn £ T and
if the variances ofT and Tn exist, then lim infn Var(Tn)
>
Var(T). Let
where X is uniform, U( - 1 , 1). Show that as n ---+ oo, Tn £ X, but Var(Tn) --+ 00. 33. Let X;j (i
�
I, . . . ,p; j
�
I,..
.
, k) be independent with X,,
�
N(J';, u2).
(a) Show that the MLEs of l'i and <12 are
k
Iii
=
p
k
k- 1 L X;j and iT2 = (kp)- 1 L L (X,j - /1; )2
j= l
i=l j=l
356
Asymptotic Approximations
Chapter 5
(b) Show that if k is fixed and p � oo, then <72 !. (k - 1)cr2(k. That is the MLE jl2 is not consistent (Neyman and Scott, 1948).
'
! •
(c) Give a consistent estimate of a2 . Problems for Section 5.4
1. Let X1, !/! : R� R
.
. , Xn
.
be
i.i.d. random variables distributed according to P E P. Suppose
(i) is monotone nondecreasing (ii) ¢(-oo)
<
0 < .P(oo)
'
(iii) I.PJ(x) < M < oo for all x. (a)
Show that (i), (ii), and (iii) imply that O(P) defined (not uniquely) by
i
Ep,P(X1 - O(P)) 2 0 > Ep!/J(Xt - II ) , ali i/ > O(P)
is finite. (b) Suppose that for all P E P, O(P) is the unique solution of Ep,P(X1 - 0) = 0. Let On = O(P), where P is the empirical distribution of X1, . . . , Xn. Show that Bn is consistent for O(P) over 'P. Deduce that the sample median is a consistent estimate of the population median if the latter is unique. (Use ,P(x) = sgn(x).) Hint: Show that Ep1jJ(X1 - B) is nonincreasing in B. Use the bounded convergence .......
.....
-..
-..
theorem applied to ,P (Xt - 0) !, !/!( -oo) as 0 � oo.
(c) Assume the conditions in (a) and (b). Set (.(0) - Ep>jJ(Xt - 0) and T2(0) Varp¢(X 1 - 0). Assmne that ,\'(0) < 0 exists and that
I'
1 VnT
I ! ' •
i
-
I
'
.
•
i
• •
!•
I
[!/! (X; - On) - A(On)] !:, N(O, 1) � (O)
c
..fti(On - 0) � N Hint: P( ..jii(On - 0)) < t) = P(O < On)
(d) Assume part (c) and A6. A'(O) = Cov(,P(X, - 0), f(X,)).
(0,
T2(0) [A'(O)J'
)
I
1
i
•
•
•
'
i'
'
I I
, I
for every sequence {On} with On = 0 + t(..jii for t E R. Show that
'
I
n
!
'
I
'
'
.
'
n = P ( - 2::. �1 ,P(X; - On) < 0).
Show that if f(x)
(e) Suppose that the d.f. F(x) of X, is continuous and that f (O)
l
F' (x)
exists, then
•
\
' •
=
F'(O) exists. Let
X denote the sample median. Show that, under the conditions of (c), ..jii(X - 0) !:.
N(O, 1/4/'(0)).
� -----------------------------------------------------.... •
Problems and Complements
Section 5.6
�
.
(0 For two estimates 01 and
..-.
357 ..-.
.
c
.
J
B2 With .,jii(B1 - B) � N(O, cr]). = I, 2, the ..-.
..-.
of 81 with respect to 82 is defined as ep(81, 82 ) = then = 1r/2. ..-.
relative efficiency
..-.
-
a?fo.Y.
asymptotic
Show that if
N(J", cr2) , ep(X, X ) (g) Suppose X1 has the gross error density j,(x - B) (see Section 3.5) where f,(x) = ( I c)
-
r
2. Show that assumption A4' of this section coupled with A0-A3 implies assumption A4.
Hint:
( f!,P f!, P X X Es B(p) ( : [ t -B(p)[ < en ,, ( ,,t) L ;:; f!B f!.p ) f! f! { , P < Esup f!B (X�, t) - f!B,P (X,, B(p)) : [ t - B(p)[ < En } . up
I
"
t= l
Apply A4 and the dominated convergence theorem B.7.4. 3, Show that A6' implies A6. Hint:
b<
oo,
ff0 f.p(x,B)p(x,O)dl"(x) = J g,(.P(x, B)p(x,B))dJ"(x) if for all -oo < a <
[ J -fo (,P(x, B)p(x,B)d!"(x)) J ,P(x,b)p(x,b)dJ"(x) - J .p(x,a)p(x,a)dJ"(x). =
Condition A6' permits interchange of the order of integration by Fubini's theorem (Billings ley, 1979) which you may assume.
X, , . . . , Xn be i.i.d. U(O , B), B > 0. (a) Show that g� (x, B) = i for 0 > x and is undefined for B < x. Conclude that g�(X,B) is defined with Pe probability I but f!l I f!l Ee De(X, B) = - 7" 0, Yare (X , B) = 0. B f!B (b) Show that if B = max(X�o ... ,Xn) is the MLE, then .Ce(n(B - B)) � [ (!/B). Thus, not only does asymptotic nonnality not hold but 8 converges B faster than at rate with I(B) = oo, not 0! n-112• This is compatible ( P [n(B B) < x[ = I - ! - ,:'e ) � I - exp(-x/B ) . 4. Let
-
Hint:
e
-
"
358
Asymptotic Approximations
Chapter 5
5. (a) Show that in Theorem 5.4.5
p (x., eo + -;in) log � p(X;, Oo ) n
i.
�
'Y
n
& fo � &e log p(X , B ) ,
•
0
•
' •
,
'
' '
'
'
under Pe0 , and conclude that
'
' '
"
(b) Show that
i '
I
p (X;, Oo + -;)n) log � p(X;,Oo)
(
2 p (X;, Oo + Jn ) 2l(O) . log tl(00),"f � N � p(X;,Oo)
n
£,,+7.<
)
'
!
' ;
1i
(c) Prove (5.4.50). Hint: (b) Expand as in (a) but around 00 (d) Show that P,
o
[ + Vn -"-
n L:,·-t log -
+ -;)n. p( X.,Oo + ;l;;n )
using (b) and Polya's theorem (A.l4.22).
p(X.,'o )
= Cn
]
___,
0 for any sequence { Cn } by
6. Show that the conclusions of Prqf:>le� 5.4.5 continue to hold if ' '
' ' •
'
I j
'
;
is replaced by the likelihood ratio statistic
'
I
1
p(X;, Oo + -;)n) � L., log ---,-,-;-;,-i'-''p(X;, Bo) i�l
I
I
'
.
I
I
'
I '
'
7. Suppose A4', A2, and A6 hold for .p � &l(&B so that I(B) < oo. Show that B 1(0) is continuous. Hint: () ---+ g;� (X, 8) is continuous and �
sup
E,g;!(X,O)
-!(B)
and
{ :;; (X,B') • JO - B'J <
if
I
�
------
Section 5.6 Problems and Complements
359
8. Establish (5.4.54) and (5.4.56). Hint: Use Problems 5.4.5 and 5.4.6.
9. (a) Establish (5.4.58). Hint: Use Problem 5.4.7, A.l4.6 and Slutsky's theorem. (b) Suppose the conditions of Theorem 5.4.5 hold. Let B� be as in (5.4.59). Then (5.4.57) can be strengthened to: For each Bo E 8, there is a neighborhood V(Bo) of 80 1 - a. such that limn sup{Po[B� < BJ : B E V(Bo)} �
10. Let
�
n
&2 1 � 1 I � - - L ' (X;, B). &B n i=l
(a) Show that under assumptions (AO)-(A6) for 1j! consistent estimate of I ( fJ).
�
at all B and (A4 ), f is a '
(b) Deduce that
is an asymptotic lower confidence bound for fJ. (c) Show that if Pe is a one parameter exponential family the bound of (b) and (5.4.59) coincide. 11. Consider Example 4.4.3, setting a lower confidence bound for binomial (}. -
�·
(a) Show that, if X = 0 or 1 , the bound (5.4.59), which is just (4.4.7), gives B � X. Compare with the exact bound of Example 4.5.2. (b) Compare the bounds in (a) with the bound (5.4.61), which agrees with (4.4.3), and give the behavior of (5.4.61) for X � 0 and 1 . 12. (a) Show that under assumptions (AO)-(A6) for all B and (A4 ) '
.
_ 8• + Op n- II ' B• ( ) -nj - -n
for j
�
1 , 2.
-
13. Let Bn 1 , Bn2 be two asymptotic Ievel l - a lower confidence bounds. We say that Bn 1 is asymptotically at least as good as Bn2 if, for all f > 0 -
-
-
Show that 8�1 and, hence, all the B�i are at least as good as any competitors. Hint: Compare Theorem 4.4.2.
360
Asymptotic Approximations
Xn are i.i.d. inverse Gaussian with parameters J.L and A, where Jl is known. That is, each Xi has density 1/ ,\x 2 ,\ x exp - 21-1 + Jl. ; > 0; J-l > 0; ,\ > 0. 2rrx3 2x 2 14. Suppose that
X1,
Chapter 5
. • .
,
( )
.\ .\ }
{
(a) Find the Neyman-Pearson (NP) test for testing H : ,\ (b) Show that the NP test is UMP for testing H
:
=
Ao versus K : A
<
A0 .
.\ > >.o versus K >. < >.o. :
(c) Find the approximate critical value of the Neyman-Pearson test using a normal approximation. (d) Find the Wald test for testing H : ,\ = .\o versus K
:
.\
.\o.
<
(e) Find the Rao score test for testing H : ,\ = ..\o versus K
:
..\ <
1
Ao.
15. Establish (5.4.55). Hint: By (3.4.10) and (3.4.11 ) , the test statistic is a sum of i.i.d. variables with mean zero and variance 1(0). Now use the central limit theorem. Problems for Section 5.5 1. Consider testing H : J.t = 0 versus K : 11- =j:. 0 given X1, . . . , Xn i.i.d. Consider the Bayes test when p, is distributed according to 1r such that
N(Jt, 1).
1 > rr({O}) = .\ > 0, rr(p 7" 0) = 1 - .\ and given I' 7" 0, I' has aN(O, T2) distribution.
!
(a) Show that the posterior probability of {0} is
1 n where mn(.fii,X) = r(l + r2)- 12rp
( (I+��;)l/2).
Hint: Use Examples 3.2.1 and 3.2.2.
i I
-
(b) Suppose that I' = 0. By Problem 4.1.5, the p-value fJ = U(O, 1) distribution. Show that (J I'. 1.
'
I
2[1 -
(c) Suppose that Jl = 8 > 0. Show that jj;(J I'. oo. That is, if H is false, the evidence against H as measured by the smallness of the p-value is much greater than the evidence measured by the smallness of the posterior probability of the hypothesis (Lindley's "para dox"). 2. Let X" . . . , Xn be i.i.d. N(Jl, 1) . Consider the problem of testing H K ; J1. > �. where .6. is a given number.
:
Jl
I
•
E
[0, LJ.] versus
(a) Show that the test that rejects H for large values of y'n(X - LJ.) has p-value p =
'
'
'
Section 5.6
361
Problems and Complements
(b) Suppose that 1-L has a N(O, 1 ) prior. Show that the posterior probability of H is where an = nf (n + 1).
-
�. -yn(anX - �)/ Ja,; (Lindley's "paradox" of Problem 5.1.1 is not in effect.) (c) Show that when
Jl =
c
�
N(O, 1 ) and p
c
�
U(O, 1).
(d) Compute plimn�oo p/pfor p. fo �. (e) Verify the following table giving posterior probabilities of [0, �] when ynX 1 .645 and p = 0.05. 10 .029 .058
n 1'> = 0. 1 "' = 1.0
20 .034 .054
50 .042 .052
=
100
.046 .050
3. Establish (5.5.14). Hint: By (5.5.13) and the SLLN,
logdnqn(t) t2 -2
=
(
(�
n &2/ &2/ ) t ) I e I(O) - f1 88, (x,, • (t)) - 8 , (X,,B0) + logn 0 + Vn n 0
•
Apply the argument used for Theorem 5.4.2 and the continuity of n (B). 4. Extablish (5.5.17). Hint:
<
n l
n
&' { f1 sup &B2 l(X,, B') : iB' - B[ < o}
�
if [t[ < oyn. Apply the SLLN and o ous at 0 = 0.
E9sup
{ ;;,z(X,,B')
: [B - B'l < o} continu
5, Suppose that in addition to the conditions of Theorem 5.5.2, J B'n(B)dB < yn(E(9 I X) - B) � 0 a.s. Pe. Hint: In view of Theorem 5.5.2 it is equivalent to show that dn
J tqn(t)dt
�
oo.
Then
0
M a.s. P,. By Theorem 5.5.2, J_ M tqn(t)dt � 0 a.s. for all M < oo. By (5.5.17), :J'lfo J:J'lfo [t [ exp -iJ(B)'; dt < ' for M(<) sufficiently large, all [ < [t qn(t)dt J
{
}
l
Asymptotic Approximations
362
Chapter 5
'
<>
0. Finally, n
l)l(X,, t) - l(X,})) 7r(t)dt. i=l
,,
'
'
Apply (5.5.16) noting that
I
foc"'''·0 1
6. (a) Show that sup([qn (t)
-
� 0 and
I• (O)
'
:
J [t[1r(t)dt < co. [t[ <
M} � 0 a.s. for all e.
(b) Deduce (5.5.29). Hint: {t : ,jl(Bj
we must have Cn =
I
',!.
c(z1_ � [l(B)n)-112)(! + op(!)) by Theorem 5.5.1.
and A5. Show that (5.5.8) and in
d.
�
7. Suppose that in Theorem 5.5.2 we replace the assumptions A4(a.s.)
•
/ in
and A5(a.s.) by A4
(5.5.9) hold with a.s. convergence replaced by convergence
Po probability.
5.7
NOTES
Notes for Section 5.1 ( I ) The bound is actually known to
and 1 with probability 1 - Pn
be essentially attained for Xi = 0 with probability Pn where Pn -----jo 0 or 1. For n large these do not correspond
to distributions one typically faces. See Bhattacharya and Ranga Rao (1976) for further
'
I
I I • •
discussion .
•
Notes for Section 5.3
,
(I) If the right-hand side is negative for some x, Fn (x) is taken to he 0.
'
!' '
I' :
,I ' '
.
l
' '
•
'
' ' '
•
Notes for Section 5.4
I '' '
•
(2) Computed by Winston Chow.
' '
J
I i• '
(I) This result was first stated by R. A. Fisher ( 1925). A proof was given by Cramer ( 1946). Notes for Section 5.5 ( 1 ) This famous result appears in Laplace's work and was rediscovered by S. Bernstein and R. von Mises-see Stigler ( 1 986) and Le Cam and Yang (1990).
-------
'
I
i
i
' •
Section 5.8
5.8
363
References
REFERENCES
BERGER, J., Statistical Decision Theory and Bayesian Analysis New York: Springer-Verlag, 1985. BHATTACHARYA, R. H. AND R. RANGA RAO, Normal Approximation and Asymptotic Expansions New
York: Wiley, 1976. BILLINGSLEY, P., Probability and Measure New York: Wiley, 1979. Box, G. E. P., "Non-normality and tests on variances," Biometrika, 40, 3 1 8-324 (1953).
CRAMER, H., Mathematical Methods of Statistics Princeton, NJ: Princeton University Press, 1946.
DAVID, F. N., Tables of the Correlation Coefficient, Cambridge University Press, reprinted in Biometrika Tablesfor Statisticians ( 1 966), Vol. I, 3rd ed., H. 0. Hartley and E. S. Pearson, Editors Cambridge: Cambridge University Press, 1938.
FERGUSON, T. S., A Course in Large Sample Theory New York: Chapman and Hall, 1996. FisHER, R. A., "Theory of statistical estimation," Proc. Camb. Phil. Soc., 22, 700-725 (1925).
FISHER, R.
A., Statistical Inference and Scientific Method, Vth Berkeley Symposium, 1958.
HAMMERSLEY, J. M. AND D. C. HANSCOMB, Monte Carlo Methods London: Methuen & Co., 1964.
HoEFFDING, W., ..Probability inequalities for sums of bounded random variables," J. Amer. Statist. Assoc., 58, 13-80 (1963).
HUBER,
Proc.
P. 1., The Behavior of the Maximum Likelihood Estimator Under Non-Standard Conditions, Vth Berk. Syrup. Math. Statist. Prob., VoL I Berkeley, CA: University of California Press,
1967. LE CAM,
1990.
L. AND G. L. YANG, Asymptotics in Statistics, Some Basic Concepts New York:
Springer,
LEHMANN, E. L., Elements of Large-Sample Theory New York: Springer-Verlag, 1999.
LEHMANN, E. L. AND G. CASELLA, Theory of Point Estimation New York Springer-Verlag, 1998.
NEYMAN, J. AND E. L. Scorr, "Consistent estimates based on partially consistent observations," Econometrica, !6, 1-32 (1948).
RA.o, C. R., Linear Statistical Inference and Its Applications, 2nd ed. New York: J. Wiley & Sons, 1973. RUDIN, W., Mathematical Analysis, 3rd ed. New York: McGraw Hill, 1987.
SCHERVISCH, M., Theory ofStatistics New York: Springer, 1995.
SERFLrNG, R. J., Approximation Theorems of Mathematical Statistics New York: J. Wiley & Sons,
1980.
STIGLER, S., The History of Statistics: The Measurement of Uncertainty BefOre 1900 Cambridge, MA: Harvard University Press, 1986. WILSON, E. B. AND M. M. HiLPERTY, 'The distribution of chi square," Proc. Nat. Acad. Sci., U.S.A., /7, p. 684 (1931).
I
I '
'
'
'
'
!
!"I
I
i
'
'
'
I
'
'
' '
'
'
1 '
Chapter 6
=======
INFERENCE IN THE MULTIPAR AMETER CASE
6.1
I N F ERENCE FOR GAUSSIAN LINEAR MODELS •
Most modern statistical questions involve large data sets, the modeling of whose stochastic structure involves complex models governed by several, often many, real parameters and frequently even more semi- or nonparametric models. In this final chapter of Volume I we develop the analogues of the asymptotic analyses of the behaviors of estimates, tests, and confidence regions in regular one-dimensional parametric models for d-dimensional models {Po : II E 8}. 8 c R"- We have presented several such models already, for instance, the multinomial (Examples 1.6. 7, 2.3.3), multiple regression models (Examples 1.1 .4, 1.4.3, 2.1.1) and more generally have studied the theory of multiparameter exponen tial families (Sections 1.6.2, 2.2, 2.3). However, with the exception of Theorems 5.2.2 and 5.3.5, in which we looked at asymptotic theory for the MLE in multiparameter exponen tial families, we have not considered asymptotic inference, testing, confidence regions, and prediction in such situations. We begin our study with a thorough analysis of the Gaus sian linear model with known variance in which exact calculations are possible. We shall show how the exact behavior of likelihood procedures in this model correspond to limit ing behavior of such procedures in the unknown variance case and more generally in large samples from regular d-dimensional parametric models and shall illustrate our results with a number of important examples. This chapter is a lead-in to the more advanced topics of Volume II in which we consider the construction and properties of procedures in non- and semiparametric models. The approaches and techniques developed here will be successfully extended in our discussions of the delta method for function-valued statistics, the properties of nonparametric MLEs, curve estimates, the bootstrap, and efficiency in semiparametric models. There is, however, an important aspect of practical situations that is not touched by the approximation, the fact that d, the number of parameters, and n, the number of observations, are often both large and commensurate or nearly so. The inequalities ofVapnik-Chervonenkis, Talagrand type and the modern empirical process theory needed to deal with such questions will also appear in the later chapters of Volume II.
365
366
'j '
!
Inference in the Multiparameter Case
Chapter 6
Notational Convention: In this chapter we will, when there is no ambiguity, let expres
sions such as (J refer to both column and row vectors.
' ' ' '
6.1.1
'
I
'
The Classical Gaussian linear Model
Many of the examples considered in the earlier chapters fit the framework in which the ith measurement Yi among n independent observations has a distribution that depends on known constants zil, . . , Zip· In the classical Gaussian (nonnal) linear model this depen dence takes the fonn
f
" ' '
1
•
.
Yi
=
p
L ziJf3i + Ei,
j=l
i
= l, . . . , n
(6. U)
where c 1 , . . , En are i.i.d. N(O, a2). In vector and matrix notation, we write .
'i '"
'
(6.1 .2)
!' ' '·
I'.l
and
'
"
Y = Z(3 + •,
•
� N(0,<72J)
(6.1.3)
where zi = (Zii, . . . , Zip f. Z = (zij)nxp. and J is the n x n identity matrix. Here Yi is called the response variable, the Zij are called the design values, and Z is called the design matrix. In this section we will derive exact statistical procedures under the assumptions of the model (6.1.3). These are among the most commonly used statistical techniques. In Sec tion 6.6 we will investigate the sensitivity of these procedures to the assumptions of the model. It turn� out that these techniques are sensible and useful outside the narrow frame work of model (6.1.3). Here is Example 1 . 1.2(4) in this framework.
. •
Example 6.1.1. The One-Sample Location Problem. We have n independent measure ments Yt , . . . Yn from a population with mean f3t = E(Y). The model is 1
Yi j i
' '
=
f3t + €i ,
i = 1, . . . 1 n
(6 1 .4) .
D
where €1 , . . . 1 €n are i.i.d. N(O, a2). Here p = 1 and Znxl = (1, . . , 1)T. .
The regression framewor� of Examples 1. 1.4 and 2. 1 . 1 is also of the form (6.1.3): Example 6.1.2. Regression. We consider experiments in which n cases are sampled from a population, and for each case, say the ith case, we have a response Yi and a set of p - 1
, Zip · We are interested in relating the mean of covariate measurements denoted by Zi2 the response to the covariate values. The normal linear regression model is • . . .
'
I
p }i = f3t + L Zij/3j + €i1 j=2
i = 1, . . , n .
(6. 1 .5)
. l
Section 6.1
367
Inference for Gaussian Linear Models
where (31 is called the regression intercept and /3z, . . . /]p are called the regression coeffi cients. If we set zil 1, -i 1, . , n, then the notation (6.1.2) and (6.1.3) applies. We treat the covariate values Zij as fixed (nonrandom). In this case, (6.1.5) is called the fixed design normal linear regression model. The random design Gaussian linear regression model is given in Example 1.4.3. We can think of the fixed design model as a conditional version of the random design model with the inference developed for the conditional dis D tribution of Y given a set of observed covariate values. .
=
=
.
.
Example 6.1.3. The p-Sample Problem or One- Way Layout. In Example 1.1.3 and Sec tion 4.9.3 we considered experiments involving the comparisons of two population means when we had available two independent samples, one from each population. 1\vo-sample models apply when the design values represent a qualitative factor taking on only two val ues. Frequently, we are interested in qualitative factors taking on several values. If we are comparing pollution levels, we want to do so for a variety of locations; we often have more than two competing drugs to compare, and so on. To fix ideas suppose we are interested in comparing the performance of p > 2 treat ments on a population and that we administer only one treatment to each subject and a sample of nk subjects get treatment k, 1 < k < p, n1 + · · · + np = n. If the control and treatment responses are independent and nonnally distributed with the same variance a2, we arrive at the one-way layout or p-sample model, (6.1.6)
where Ykl is the response of the lth subject in the group obtaining the kth treatment, /3k is the mean response to the kth treatment, and the €kl are independent N(O, � ) random variables. To see that this is a linear model we relabel the observations as Y1, Yn. where Yn1+n2 to Y1 , . . . , Yn1 correspond to the group receiving the first treatment, Yn1 + 1 , that getting the second, and so on. Then for 1 $" j < p, if no = 0, the design matrix has elements: .
•
j- 1
1 if L nk + 1 < i <
k= I
•
.
.
•
,
,
j
L nk
k=l
0 otherwise and
II 0 Z=
0
,,
.
•
.. •
•
0 0
• • •
•
0
0
•
•
•
•
where Ij is a column vector of nj ones and the 0 in the "row" whose jth member is Ij is a column vector of ni zeros. The model (6. 1 . 6) is an example of what is often called analysis of variance models. Generally, this terminology is commonly used when the design values are qualitative. 1 The model (6.1.6) is often reparametrized by introducing a = p- 2::� �1 f3k and Ok /3k - 0: because then ok represents the difference between the kth and average treatment =
368
Inference in the Multiparameter Case
effects, k model is
1 , . . . ,p. In terms of the new parameter {3"' =
Y = Z'(3' + <,
<
Chapter 6
(a, ISh . . , 15v)T, the linear .
� N(O, a2J)
where z;1 (p+ I ) = (1, Z) and l n x 1 is the vector with n ones. Note that Z* is of rank p and that {3* is not identifiable for {3* E RP+ 1 . However, {3* is identifiable in the p-dimensional linear subspace {(3' E RP+ I : L:��I {,k = 0} of w+' obtained by adding the linear restriction E�=I J"k = 0 forced by the definition of the Jk 's. This type oflinear model with the number of columns d of the design matrix larger than its rank r, and with the parameters identifiable only once d r additional linear restrictions have been specified, is common in analysis of variance models. D x
-
!
Even if {3 is not a parameter (is unidentifiable), the vector of means p = (.UI, . . . , JLn ) T of Y always is. It is given by
•
p
'
I
I' =
Z(3 =
L /3jCj
; •
j=l where the Cj are the columns of the design matrix,
!
(Zlj, • • · , Znj)T, j = 1, . . . ,p.
Cj =
The parameter set for f3 is RP and the parameter set for It is w = {P- = Z(3; (3
E RP} .
Note that w is the linear space spanned by the columns Cj, j = 1, . . . , n, of the design matrix. Let r denote the number of linearly independent Cj, j = 1, . . . ,p, then r is the rank of Z and w has dimension r. It follows that the parametrization ({3,a2) is identifiable if and only ifr = p (Problem 6.1.17). We assume that n > r .
' •
The Canonical Fonn of the Gaussian Linear Model
• • '.
I
,.
i
'
r ''
The linear model can be analyzed easily using some geometry. Because dimw = r, there exists (e.g., by the Gram-Schmidt process) (see Section B.3.2), an orthonormal basis VI ' . . . ' Vn for Rn such that VI' . . ' v span w. Recall that orthonormal means vr vj = 0 for i f. j and v[vi = 1. When vrvj = o. we call Vi and Vj orthogonal. Note that any t E Jl:l can be written .
'
'
t = I;(vit)v;
I
I
r
n
'
(6.1.7)
i=l
and that
t E w ¢:> t =
T
L:(vft)vi ¢:> v'[t = 0, i = i=l
r
+ 1, . . . , n.
We now introduce the canonical variables and means '
' • •
'
I
;
Ui = vfY, T/i = E(Ui) = v'[Jl., i = 1, . . .
, n.
Inference for
Section 6.1
G:::'c:"cc"c:''_c"--L:::i--ne:::'::_'_::M.:.:oc:d:::ei'-'-----'3:C6:::C9
Theorem 6.1.1. The U1 are independent and Ui ,......, N(rJi, a2 ), i = 1, . . . 11i
while
(TJt , .
.
, n,
where
r + 1 , . . . , n,
= 0, i =
. , 1Jr )T varies freely over Rr.
Proof. Let Anxn be the orthogonal matrix with rows v'[ , . . . , v�. Then we can write U = AY, 7J = AJ.t, and by Theorem B.3.2, U1, . . , Un are independent nonnal with D variance and E(Ui) = v[J.L = 0 for i = r + 1, . . . , n because p, E w.
a2
.
Note that
1
Y � A- U. So, observing U and Y is the same thing. 1-'
J.t and 71
(6. 1.8)
are equivalently related,
� A - I 'I·
(6.1.9)
whereas Var(Y) � Var(U)
�
cr2Jnx n·
(6.1.10)
It will be convenient to obtain our statistical procedures for the canonical variables U, and then translate them which are sufficient for (J.J., using the parametrization (17, to procedures for �-'· (3, and based on Y using (6.1 .8)-(6. 1.10). We start by considering the log likelihood C(TJ, u) based on U
u2)r,
a2) cr2
n 1 � - 2 L.,(u ; - r]i) 2" t=l
C(TJ,U)
2
-
� I � 1 2 - 2 L ui + � � 2a i=l
i=l
6. 1.2
2
n 2 tog(27rcr ) 1JiUi
� r(f -
� i=l
2u2
-
n log(27r0'" 2) .
(6. 1 . 1 1 )
2
Estimation
a2 known case, which is the guide to asymptotic inference in general. Theorem 6.1.2. In the canonical form of the Gaussian linear model with a2 known We first consider the
(i) T � ( U1 , . . . , U,)T is sufficientfor rJ.
(ii) UJ, . . . , Ur is the MLE of1JI , · · · · 1lr· (iii)
Ui is the UMVU estimate ofryi, i = 1, . .
.
, r.
' c,. are constants, then the MLE of0: = (iv) If Ct ' also UMVUfor a. •
.
.
(v) The MLE of p. is il =
ui =
E;
z:; 1 Ci1Ji
is a = E� 1
and j),i is UMVUfor J.ti, i = vTil. making il and u equivalent. 1
vi Ui
1,
. . . , n.
ciui. a is Moreover,
I' '
370
Inference in the Multiparameter Case
Chapter 6
Proof. (i) By observation, (6.1.11) is an exponential family with sufficient statistic T. (ii) U1 , . , U1. are the MLEs of 111, . . , Tfr because, by observation, (6. 1.11) is a function of 1}1 , . . . , 1lr only through L�- 1 (ui - "7i )2 and is minimized by setting 'tli = ui. (We could also apply Theorem 2.3.1.) 1 , . . . , r. (iii) By Theorem 3.4.4 and Example 3.4 . 6, U; is UMVU for E(U;) = 7);, (iv) By the invariance of the MLE (Section 2.2.2), the MLE of q(9) = 2:: � 1 c;ry; is q(8) = L� 1 <;Ui· If all thee's are zero, 5 is UMVU. Assume that at least one c is different from zero. By Problem 3.4.10, we can assume without loss of generality that L:·= 1 c'f = 1. By Gram-Schmidt orthogonalization, there exists an orthonormal basis V I , . . . , Vn of Rn with VI = c = ( c , ' . ' Cr, 0, . . . ' O)T E Rn . Let wi = vru. �i = Vi'fJ, i = 1 ' . . . ' n, then W N(�,a2J) by Theorem B.3.2, where J is the n x n identity matrix. The distribution of W is an exponential family, W1 = Q is sufficient for 6 = a, and is UMVU for its expectation E(W,) = a. 0 (v) Follows from (iv). .
.. •
i
.
.
i=
-
:! I ' I' I
•
'
.
' ' '
I i
.
-
I I
Next we consider the case in which a2 is unknown and assume n
> r + 1.
•
I
!
'
i
•
i
Theorem 6.1.3. In the canonical Gaussian linear model with a2 unknown,
(i) T = (ii) (iii)
Ur, L7 r+l un T is sufficientfor ('TJ1, . . . , 1]r, a2)T. The MLE ofa 2 is n 1 L� r+1 u; . s2 = ( n r) - 1 L� r+l u? is an unbiased estimator of a2• (U1,
. . . 1
-
-
(iv) The conclusions of Theorem 6 . 1 2 (ii), . . . ,(v) are still valid. .
'
'
Proof. By {6.1.11), (UI , · · · , Ur , L� I is sufficient. But because L� 1 Ui2 L� I Ui2 + L7 r+l Ul, this statistic is equivalent to T and (i) follows. To show (ii), recall that the maximum of ( 6. 1.11) has 'Tfi = Ui . That is, we need to maximize
' '
II i
,,
I
'
I
•
i J •
. '
'
: .
"
as a function of a2. The maximizer is easily seen to be n - I L� r+ l U? (Problem 6.1.1 ). (iii) is clear because EUl = a2, i > r+ 1. To show (iv), apply Theorem 3.4.3 and Example 3.4.6 to the canonical exponential family obtained from (6.1.11) by setting T; = U;, Bj 'TJi /a2, j = 1, . . , r, Tr+1 = L� 1 U? and Br+ I = 1/2u2
=
.
-
I
•
I
.
Projections -
We next express j1 in terms of Y, obtain the MLE (3 of {3, and give a geometric interpretation of ji, {3, and s2• To this end, define the norm It! of a vector t E Rn by
l tl' = I:� 1 fl.
'
I
•
I
'
'
'
unT
•
i
'
'i
•
I
I '
Section 6.1
371
Inference for Gaussian linear Models
Definition 6.1.1. The projection y0 = 1f(Y I £...').. of a point y E Rn on w is the point
Yo = arg min{IY - W : t E w}. -
The maximum likelihood estimate f3 of f3 maximizes
or, equivalently,
13 = arg min{ IY - Z/312 : !3 E W}.
That is, the MLE of {3 equal� the least squares estimate (LSE) of {3 defined in Example 2.1.1 and Section 2.2.1. We have Theorem 6.1.4. In the Gaussian linear model (i) jl is the unique projection o.fY on
u..J
(ii) jl is orthogonal to Y - ji.. (iii)
s2
=
and is given by
ii = zl3
IY - /il 2 /(n - r)
(6.L12)
(6.Ll3)
(iv) lfp = r, then !3 is identifmble, !3 = (zrz )-1 zr !'. the MLE = LSE of!3 is unique and given by (6.1.14) -
(v) f3; is the UMVU estimate of /3;. j i = I , . . . , n.
Proof.
=
1, . . . , p, and Jii is the UMVU estimate of J.ti,
(i) is clear because Z/3, f3 E RP, spans w. (ii) and (iii) are also clear from Theorem
2:� ViUi and Y - /i = L7 r+l v;U;. To show (iv), note that 1 ,.. = Z/3 and ( 6. 1 . 12) implies zr,.. = zTZ!3 and zTji = zrzl3 and, because z has full rank, zrz is nonsingular, and !3 = (Zrz)-1 zrp, 13 = (zrz) -1Zrji. To show f3 = (ZTz)-lzTy, note that the space w.l. of vectors s orthogonal to w can be written as w.L = {s E R" : sr(Z/3) = O for all !3 E W}. 6.1 .3 because /i
=
It follows that !3T (ZT s) = 0 for all !3 E RP, which implies zr s = 0 for all s E wj_. Thus, zr (Y - ji) = 0 and the second equality in (6.1.14) follows. f3; and Jii are UMVU because, by (6.1.9), any linear combination ofY's is also a linear combination of U's, · and by Theorems 6.1.2(iv) and 6 . 1.3(iv), any linear combination of D U's is a UMVU estimate of its expectation .
372
Inference in the Multiparameter Case
Chapter 6
Note that in Example 2.1.1 we give an alternative derivation of (6.1.14) and the normal equations (ZTZ)f3 = zry. The estimate fi = Z{3 of tt is called thefitted value and € Y - ji is called the residual from this fit. The goodness of the fit is measured by the residual sum of squares (RSS) IY - fll2 = L7 1 Ef. Example 2.2.2 illustrates this terminology in the context of Example 6.1.2 with p = 2. There the points jli = /31 + fhzi. i = 1, . . , n lie on the regression line � fitted to the data { ( z;, y;), i = 1 , . . , n}; moreover, the residuals €; = (y, (f3t + /3z z,)] are the vertical distances from the points to the fitted line. Suppose we are given a value of the covariate z at which a value Y following the linear model (6.1.3) is to be taken. By Theorem 1.4.1, the best MSPE predictor of Y if f3 is known well as z is E(Y) = zT{3 and its best (UMVU) estimate not knowing f3 is as � � Y = zT{3. Taking z = zi. 1 < i < n, we obtain P,i = zi/3. 1 < i < n. In this method of "prediction" � of Yi, it is common to write Yi for j]i, the ith component of the fitted value j],. That is, Y = j.i. Note that by (6.1.12) and (6.1.14), when p = r, �
=
�
�
.
.
�
It follows from this and (B.5.3) that if J = Jnxn is the identity matrix, then (6.1.15)
'
'
Next note that the residuals can be written as €=
I
•
I
'
.
.i '
J
I,
'
I
Y - Y = (J - H)Y.
The residuals are the projection of Y on the orthocomplement of w and
Var(€) = a2 (J - H). Corollary 6.1.1. In the Gaussian linear model �
(i) the fitted values Y = jl and the residual € are independent.
I '
�
We can now conclude the following.
I
'
•
•
'
•
i
'
'
I, I
•
I
HT = H H2 = H.
'
I
'
'
The matrix H is the projection matrix mapping Rn into w, see also Section B.lO. In statistics it is also called the hat matrix because it "puts the hat on Y." As a projection matrix H is necessarily symmetric and idempotent,
I
i
1
H = Z(ZTz)-tzT
•
'
•
where
,
·J
•
Y = HY
I
'
-
�
'
1
�
�
'
, •
(ii) Y � N(J.t, a2H),
(iii) € � N(O, a2 (J - H)), and
(iv) ifp = r, (:i � N({3, a2(zrz)- 1 ).
(6.1.16)
1 .
•
Section
6. 1 Inference for Gaussian linear Models
373
Proof (Y, E) is a linear transformation of U and, hence. joint Gaussian. The independence �
follows from the identification of j1 and € in tenns of the a2(zrz)- 1 follows from (B.5.3).
�
Ui in the theorem. Var(/3)
=
o
We now return to our examples.
Example 6.1.1.
One Sample (continued). Here J-i = _81 and j1 = {3 1 Y. Moreover, the unbiased estimator s2 of o-2 is L:� 1 (Yi - Yf j(n - 1), which we have seen before in 0 Problem 1.3.8 and (3.4.2). �
-
Example 6.1.2. Regression (continued). If the design matrix Z has rank p, then the MLE LSE estimate is j3 = (zrzr- 1 zry as seen before in Example 2.1.1 and Section 2.2.1. We now see that the MLE of Jt is /l = Z/3 and that {3j and Iii are UMVU for !3i and /-Li respectively, j = 1 , . . . ,p, i = 1, . . . , n. The variances of ,8,ji, = Y, and € = Y - Y are given in Corollary 6.1.1. In the Gaussian case ,8, Y, and € are normally distributed with Y and € independent. The error variance o-2 = Var(El) can be unbiasedly estimated by s2 = (n - p)- 1 IY Il l '0
=
�
�
�
�
�
�
�
�
-
Example 6.1.3.
The One-Way Layout (continued).
(ZTZ)(3 = ZY become
In this example the normal equations
n,
nk(Jk = L ykl , k = 1, . . . ,p. 1=1 At this point we introduce an important notational convention in statistics. If {Cijk . . . } is a multiple-indexed sequence of numbers or variables, then replacement of a subscript by a dot indicates that we are considering the average over that subscript. Thus,
where n = n 1 + · · · + np and we can write the least squares estimates as
f3k = Yk. , k = 1 , . . . ,p. By Theorem 6.1 .3, in the Gaussian model, th� UMVU estimate of the average effect of all the treatments, a = {3., is
1 p a = - L Y•. (not Y. in general) p k=l and the UMVU estimate of the incremental effect Ok = f3k - a of the kth treatment is
P• = Y•. - [i, k = 1, . . . ,p. �
0
j: '
' !;
'.
''
374
Inference in the Multiparameter Case
Chapter 6
Remark 6.1.1. An alternative approach to the MLEs for the normal model and the associ ated LSEs of this section is an approach based on MLEs for the model in which the errors , En in ( 6.1.1) have the Laplace distribution with density c1, . . .
_l_exp { -� ltl } 2,;
<7
and the estimates of f3 and J.L are least absolute deviation estimates (LADEs) obtained by minimizing the absolute deviation distance L� 1 IYi - zT.BI. The LADEs were introduced by Laplace before Gauss and Legendre introduced the LSEs-see Stigler (! 986). The LSEs are preferred because of ease of computation and their geometric properties. However, the LADEs are obtained fairly quickly by modem computing methods; see Koenker and D'Orey ( !987) and Portnoy and Koenker ( !997). For more on LADEs, see Problems 1.4. 7 and 2.2.31.
6.1.3 '
• •
'
j
'
j
1
1 '
' 1
,j
'
'
'
'
The most important hypothesis-testing questions in the context of a linear model corre spond to restriction of the vector of means J-L to a linear subspace of the space w, which together with a2 specifies the model. For instance, in a study to investigate whether a drug affects the mean of a response such as blood pressure we may consider, in the context of Example 6.1.2, a regression equation of the form
lzijllnx3
.\ (y)
=
sup{p(y, JL) : JL E w } sup{p(y, JL) : JL E wo }
'
'
' '
(6. l.l7)
where zi2 is the dose level of the drug given the ith patient, ZiJ is the age of the ith patient, and the matrix with Zil = 1 has rank 3. Now we would test H : f3z = 0 versus K : {32 =I 0. Thus, under H, {JL : J-ti = (31 + f33zi3, i = 1 , . . . , n} is a two dimensional linear subspace of the full model's three-dimensional linear subspace of Rn given by (6.1.17). Next consider the p-sample model of Example 1.6.3 with f3k representing the mean reSJXlnse for the kth population. The first inferential question is typically "Are the means equal or not?" Thus we test H : (31 = · · · = (3p = (3 for some (3 E R versus K: "the {3's are not all equal." Now. under H, the mean vector is an element of the space {JL : J.Li = (3 E R, i = 1, . . . , p}, which is a one-dimensional subspace of Rn. whereas for the full model JL is in a p-dimensional subspace of Rn . In general, we let w correspond to the full model with dimension r and let Wo be a q-dimensional linear subspace over which JL can range under the null hypothesis H; 1 < q < r. We first consider the a2 known case and consider the likelihood ratio statistic
'
I
l
Tests and Confidence Intervals
mean response = /31 + f32 zi2 + fJJZiJ,
!,
'
: '
'
'
•
' '
'
Section
6_1
375
I nference for Gaussian Linear Models
for testing H : fL E w0 versus f{
:
J-L
E w - w0. Because (6.1. 18)
then, by Theorem 6. 1.1, .\(Y)
�
exp
{- 2t2IY -
)}I' - IY - Po I'
}
where j1 and flo arc the projections of Y on w and w0, respectively. But if we let An x n be an orthogonal matrix with rows vf , . . . , v� such that V J , . . . , vq span wo and v 1 , . . . , Vr span w and set (6. 1 . 19) then, by Theorem 6.1.2(v), .\(Y)
�
exp
1 2<7'
i=q+ l
u'
(6. 1 .20)
•
It follows that r
2 1og.\(Y) �
L (U./<7)2 i=q+I
Note that (U;/
distribution with
e ' � <7- 2
a2 known, 2 log .\(Y) has a x;_q(02)
r
" L 11i' = a-'1 {L - Ito I'
(6.1.2 1)
where fLo is the projection of fL on w0. In particular, when H holds, 2 log .\(Y) ""' x; -q.
Proof We only need to establish the second equality in ( 6.1.21). Write 'li is as defined in (6.1.19), then
�
AJL where A
r
L '1? � IJL - JLol 2 i=q+ I 0
376
Inference in the Multiparameter Case Next consider the case in which
MLEs of a2 for Jt
a2
E w and tt E wo are
a -2
Chapter 6
is unknown. We know from Problem 6 . 1 . 1 that the
a0 = -I IY - t-to I IY - J.L -12 and _, - I' , n n respectively. Substituting jJ.., ji0, 82, and 86 into the likelihood ratio stati stic, we obtain =
-
{ �(Y· E · a:.;) }" � r �� ; ' a Y - J.L P Y, JLo, o
n
.A ( y) =
=
where p(y, /L,
The resulting test is intuitive.
It consists of rejecting H when the fit, as measured by the
res idual sum of squares under the model specified by H, is poor compared to the fit under the general model. For the purpose of finding critical values it is more convenient to work with a statistic equivalent to >.. ( Y),
n-_:=_r IY - /iol2 - IY - /il2 IY - Iii' r-q
T Because T
=
(r - q) -'1/i - /io l2 (n - r)- 'IY - /il2 •
(n - r)(r - q)-1{[.A(Y)]2fn - !}, T is an increasing function of .A(Y)
and the two test statistics are equivalent. T is called
'
i
the
hypothesis.
>
(6.1 .22)
F statistic for the general linear
We have seen in PmJX1Sition 6.1.1 that a-211}- itol2 have a x;_ q (82) distribution with
'
:
02 =
'
T
•
=
(noncentral (central
= =
x:-• variable)!df
X�-r variable)!dj
with the numerator and denominator independent. The distribution of such a variable is
noncentral :F distribution with noncentrality parameter 82 and r - q and n - r degrees offreedom (see Problem B.3.14). We write :F.,m (B2) for this distribution where k = r - q and m = n - r. We have shown the following. called the
Proposition 6.1.2. In the Gaussian linear model the F statistic defined by ( 6.1.22), which is equivalent to the likelihood ratio statistic for H : tt E wo for K : J.t E w - wo. has the noncentral F distribution Fr-q,n -r(fP) where 02 = a-2IJL - J.tol2. In particular, when H Jwlds, T has the (central) Fr-q,n-r distribution. Remark 6.1.2. In Proposition 6 . 1 . 1 suppose the assumption ..a2 is known" is replaced by ••a2 is the same under H and K and estimated by the MLE 0'2 for /.I. E w." In this case, it can be shown (Problem 6.1.5) that if we introduce the
variance equal likelihood ratio
statistic IL E w} max{p(y, /L,iT2 : ) :\(y ) max{p (y,/L, iT2) : IL E wo} _
(6.1.23)
l '
•
Section 6.1
Inference for Gaussian ---Linear Models
377 ------ -
then A(Y) equals the likelihood ratio statistic for the o-2 known case with 0'2• It follows that 2 log .\(Y) =
r
�
-- q -T = noncenlral x?.
-
rr2
q
(6.1 .24)
central X�-c/n
(n - r)/n
replaced by
where T is the F statistic (6.1.22). Remark 6.1.3. The canonical representation (6.1.19) made it possible identity
IY - i'ol 2 = IY - iL l' + II' - i'o l2 ,
to
recognize the (6.1.25)
which we exploited in the preceding derivations. This is the Pythagorean identity. See Figure 6.1.1 and Section B.lO.
Y3 y
IY - i'o l2
y,
I Y - iL l'
Y1 = Y2
Figure 6.1.1. The projections jl and jl0 of Y on w and wo; and the Pythagorean identity.
We next return to our examples. Example 6.1.1. One Sample (continued). We test H : (31 = i'o versus K : (3 'f I'D· In this case Wo = {!'o}, q = 0, r = 1 and
T= which we recognize as 4.9.2.
(n
-
(Y - 1'0)' 1 ) 1 I: (Y, Y)2 ' -
-
t2 /n, where t is the one-sample Student t statistic of Section 0
378
Inference in the Multiparameter Case
Chapter 6
Example 6.1.2. Regression (continued). We consider the possibility that a subset of p q covariates does not affect the mean response. Without loss of generality we ask whether the last p - q covariates in multiple regression have an effect after fitting the first q. To formulate this question, we partition the design matrix Z by writing it as Z = (Z1, Z2) where Z1 is n x q and Z2 is n x (p - q ) , and we partition {3 as {3T = ({3f, !3J) where {32 is a (P - q) x 1 vector of main (e.g., treatment) effect coefficients and {31 is a q x 1 vector of"nuisance" (e.g., age, economic status) coefficients. Now the linear model can be written as --
i
'
,,
i
i
I ' '
' '
'
t zry and {30 0 versus K {32 i' Q_ ln this case {3 (zrz){3 2 (Zfz t )- 1 ZjY are the ML& under the full model (6,L26) and H, respectively, Using (6.1.22) we can write the F statistic version of the likelihood ratio test in the intuitive form We test H :
=
-
:
F=
-
=
•
I
6.2.1.
•
•
.
.
'
1 '
Example 6.1.3. The One-Way Layout (continued). Recall that the least squares estimates of {31, , (Jp are Y1., Yp.. As we indicated earlier, we want to test H : (31 = · · · = (Jp. Under H all the observations have the same mean so that, •
'
' '
0
•
'
'
'
(RSSH - RSSF)/(dfH - dfp) RSSF/dfF
1 2 2 (6, 1 27) a0 = (p - q)- {3r {ZfZ2 - ZfZt (Zfz,)-1 ZfZ,}/32 , In the special case that Z'fZ2 = 0 so the variables in Z1 are orthogonal to the variables in Z2 , 02 simplifies to a-2 (p - q)f3f(ZfZ2 )f32 , which only depends on the second set of variables and coefficients. However, in general 82 depends on the sample correlations between the variables in Z1 and those in Z2. This issue is discussed further in Example
'
•
'
'
where RSSF = IY - iW and RSSH = IY- i}0 1 2 are the residual sums of squares under the full model and H, respectively; and dfF = n-p and dfH = n-q are the corresponding degrees of freedom. The F test rejects H ifF is large when compared to the ath quantile of the Fp-q,n-p distribution. Under the alternative F has a noncentral Fp-q ,n-p(fP) distribution with noncentrality parameter (Problem 6, L 7)
'
i
,
i
.
'
l '
l ;
'
i
'
! '
' '
' I '
'
Thus,
p nk
p
lc=l l=I
k=l
2 l lit - tto = L L(Y• - Y. )2 = L nk (Yk- - YY •
Substituting in (6.1.22) we obtain the F statistic for the hypothesis H in the one-way layout
L��t nk (Yk, - Y .)2 Tp - I L��· L:�',(Ykt - Y.-) 2 ' _
n-P
. I
•
• '
I
'
------
Section 6.1
Inference for Gaussian Linear Models
379
When H holds, '1' has a Fp- l,n - p distribution. If the (J, are not all equal, T has a noncentral Fp-l,n-p distribution with noncentrality parameter (6.! .28)
where {3 = n- I :2::=�"" 1 ni/3i. To derive 1F, compute a - 2 1 M - J.Lo 12 for the vector Jt = (fh, . . . , (31,(3, . , (3,, . . . , (3p, . . . , {3p)T and its projection 1-'o = ((3, . . . , (J)T There is an interesting way of looking at the pieces of information summarized by the F statistic. The sum of squares in the numerator, p SSe = L nk(Yk. - Y.)2 .
.
k=l
is a measure of variation between the p samples Y11 , . . . , Yi sum of squares in the denominator,
SSw
=
p
n1 ,
.
• •
, Yy1 ,
• • .
, Ypnp· The
"'
L L(Yk, - Y•. ) ' , k=1 l=1
measures variation within the samples. [f we define the total sum of squares as
SSr
1'
=
nk
L L(Yk, - Y.)2 , k=1 l= l
which measures the variability of the pooled samples, then by the Pythagorean identity (6.!.25) (6.! .29) SSr = SSe + SSw. Thus, we have a decomposition of the variability of the whole set of data, SST , the total sum of squares, into two constituent components, SSs, the between groups (or treatment) sum of squares and SSw. the within groups (or residual) sum of squares. SSTfa2 is a (noncentral) x2 variable with (n - 1) degrees of freedom and noncentrality parameter f?. Because SSB /a2 and SSw/a2 are independent x' variables with (p - 1) and (n - p) degrees of freedom, respectively, we see that the decomposition (6.1.30) can also be viewed stochastically, identifying IP and (p - 1) degrees of freedom as "coming" from SSafa 2 and the remaining (n - p) of the (n - 1} degrees of freedom of SSr ja2 as "conting" from SSwfa 2 • This information as well as SSa/(P - 1) and SSw j(n - p), the unbiased estimates of 02 and a2, and the F statistic, which is their ratio, are often summarized in what is known as an analysis ofvariance (ANOVA) table. See Tables 6.1.1 and 6.1.3. As an illustration, consider the following data( l ) giving blood cholesterol levels of men in three different socioeconomic groups labeled I, H, and III with I being the "high" end. We assume the one-way layout is valid. Note that this implies the possibly unrealistic
'I ,
'
'
i'I
'
'� '
380
Inference in the Multiparameter Case
Chapter 6
'
TABLE 6.1.1. ANOVA table for the one-way layout
Between samples
Within sample�
ss B
SSw
-
Sum of squares
- I:
[ ,, '
1 SST = �
Total
k- 1 I' '
P 1< - 1
�
c
111< '
"
( \ "k- - ) 1"2 k 0? - \" · '
1
L " �<
1-l
ki
(}'kl
-
d.f.
·
k
I'
1;- I'
p-
Mean squares
I
n - p Tl -
1
-
SS8
A/Sa
AISw
=
s� Hi
F-value i\/Sa 1\IS
•
.,_
TABLE 6.1.2. Blood cholesterol levels
403 312 403
I II
Ill
311 222 244
336 420 235
269 302 353
259 420 319
386 260
353
210
286
290
I assumption that the variance of the measurement is the same in the three groups (not to speak of normality ). But see Section
6.6 for ''robustness.. to these assumptions.
p = 3, n1 = 5, n2
compute
=
1 0, n3 = 6, n = 21,
' ' • '
and we ;
' ' •
•
TABLE 6.1.3. ANOVA table for the cholesterol data
�
" • •
ss Between groups Within groups Total
1202.5 85, 750.5 86,953.0
d.f.
2 18 20
MS
F�value
601.2 1 0.126 4763.9
From :F tables, we find that the p-value corresponding to the F-value
0.126
is
0.88.
Thus, there is no evidence to indicate that mean blood cholesterol is different for the three SOCIOCCOnormc groups. •
. •
We want to test whether there is a significant difference among the mean blood choles terol of the three groups. Here
' '
0
•
Remark 6.1.4. Decompositions such as {6.1.29) of the response total sum of squares SST into a variety of sums of squares measuring variability in the observations corresponding
analysis of variance. They can be formulated in any linear model including regression models. See Scheffe (1959, pp. 42-45) and Weis berg (1985, p. 48). Originally such decompositions were used to motivate F statistics and to variation of covariates are referred to as
to establish the distribution theory of the components via a device known as Cochran's the orem (Graybill,
1961, p. 86).
Their principal use now is in the motivation of the convenient
summaries of information we call ANOVA tables.
Section 6.1
381
Inference for Gaussian linear Models
Confidence Intervals and Regions
We next use our distributional results and the method of pivots to find confidence inter vals for J.li, 1 < i < n, /3j. 1 < j < p, and in general, any linear combination
n
of the J.t'S. If we set 'lj; = 2::7 1 aiJii = aT 11 and �
�
�
where H is the hat matrix, then (> - ,P)/a{,P) has a N(O, I ) distribution. Moreover,
(n - r)s2/a2
=
[Y
··
2
iil fa2
n
=
L (Uja2)2
i=r+I
�
has a X�- r distribution and is independent of '¢. Let
be an estimate of the standard deviation a ('¢ ) of 1};, This estimated standard deviation is called the standard error of'¢. By referring to the definition of the t distribution, we find that the pivot �
�
�
has a Tn_,. distribution. Let tn-r ( 1 - �a) denote the 1 - �a quantile of the Tn-r distri bution, then by solving [T( ,P )[ < tn-• (1 - �a) for ,P, we find that
is, in the Gaussian linear model, a 100(1 - a)% confidence interval for 1/J. Example 6.1.1. One Sample (continued). Consider 1/J = Jl. We obtain the interval
which is the same as the interval of Example 4.4.1 and Section 4.9.2. Example 6.1.2. Regression (continued). Assume that p = r. First consider 1jJ = f3J for some specified �gression coefficient /3j . The 100(1 - a)% confidence interval for f3J is
(Jj = {Jj ± tn-p ( I - 1 a) s{ I {ZTZ) -1 1jj }l' �
2
382
Chapter
6
[(ZTz)�1]h is the jth diagonal element of (ZTZ)� 1 Computer software computes (zr z)�' and labels s{[(ZTz)�1 ]11) l as the standard error of the (estimated) jth regres
where
sion coefficient. Next consider ljJ level o:) confidence interval is
=
(1 -
J..Li
=
J1.i
=
mean response for the ith case,
Jii ± tn -p
1 < i < n. The
(1 - �a) sA,
hii
where is the ith diagonal element of the hat matrix H. Here s� is called the standard error of the (estimated) mean of the ith case. 2 and Next consider the special case in which
p
=
Yi = !3r + /32Zi2 + Ei, i
=
11
.
. . , n.
If we use the identity n
�)z;z - z.z)(li - Y) - �)z,2 - z.2)li - YL(zi2 - z.z) i=l - �)z" - z.z)li, we obtain from Example 2.2.2 that
fh
Because •
�
Var(Yi) a2 , we obtain
L��� ( zi2 - z.z)li
(6.1.30)
· 2 L� 1 (zi2 - z.2)
=
n
2 "' � / Var(!lz) (J L)zi2 - z.z) i=l and the 100( I - a)% confidence interval for .62 has the form !lz = iJz ± tn-p ( 1 - la) s/ V))zi2 - z z)2 . �
I
i.
I
'
Inference in the Multiparameter Case
• •
;
'
,
The confidence interval for /31 is given in Problem 6.1.10. 2 case, it is straightforward (Problem 6.1. 10) to compute Similarly, in the
p�
h
ii -
_
2 I (zi2 z.z) - + "'" n
L...i. =l (Zi2 - Z.2 ) and the confidence interval for th� mean response Jl.i of the ith case has a simple explicit
'
f3k. 1 < k < p. Because
I
'
o
fonn.
Example 6.1.3. One-Way Layout (continued). We consider 'ljJ
ilk � Yk. N(!lk, (J2/nk). we find the 100(1 - a)% confidence interval ilk � i3. ± tn-p (1 - �a) s/Jiik where s2 � SSw/(n-p). The intervals for p � f). and the incremental effect ok � ilk - p 0 given in Problem 6.1.11. �
�
are
i ; '
I
• •
=
I
•
1 ' •
J
·] •
I
i ' •
l
Section 6.2
383
Asymptotic Estimation Theory in p Dimensions
Joint Confidence Regions
We have seen how to find confidence intervals for each individual (3j. 1 < j < p. We next consider the problem of finding a confidence region C in RP that covers the vector f3 with prescribed probability (1 - a ) . This can be done by inverting the likelihood ratio test or equivalently the F test That is, we let C be the collection of {30 that is accepted when the level (I - a) F test is used to test H j3 � j30. Under H, I' � l'o � Zj30; and the numerator of the F statistic (6.1.22) is based on 2 Iii M l � IZ/3 - Zf3ol � (/3 - f3o)T (ZTZ)(/3 - /30). Thus, using (6.1.22), the simultaneous confidence region for {3 is the ellipse � � T T o (Z Z) ( f3o ) ) < fc,n-c (1 - �a ) f3 : (3 2 ((3 o C � f3 (6. 1.31) rs where Jr,n-r (1 -- �a) is the 1 - �a quantile of the :Fr,n-r distribution. Example 6.1.2. Regression (continued). We consider the case p = r and as in (6.1.26) T = (/3[, f3f) , where {32 is a vector of main effect coefficients and ZI /3 write Z = ( , Z2 ) -. -.T .-...T "'T and {31-is a vector of ..nuisance" coefficients. Similarly, we partition {3 as {3 = {/31 , {32 ) where /3 1 is q x 1 and- j32 is (p - q) x 1. By Corollary 6.1.1, 0"2 (zrz) is the variancecovariance matrix of {3. It follows that if we let S denote the lower right (P - q) x (P q) comer of ( zrz) -1, then a2S is the variance-covariance matrix of j3 2 . Thus, a joint 100(1 - a}% confidence region for {32 is the p - q dimensional ellipse :
-
o '
< fp -q, n-p (1 - �a)
f3o, :
0
We consider the classical Gaussian linear model in which the resonse Yi for the ith case in an experiment is expressed as a linear combination Jlt = Lj=1 (3i Zij of covariates plus an error fi, where ci, . . . are i.i.d. N(0, a2 ) . By introducing a suitable orthogonal transformation, we obtain a canonical model in which likelihood analysis is straightforward. The inverse of the orthogonal transformation gives procedures and results in terms of the original variables. In particular we obtain maximum likelihood estimates, likelihood ratio tests, and confidence procedures for the regression coefficients {f3j}. the resJX)nse means {Jli }, and linear combinations of these. Summary.
, tn
6.2
ASYM PTOTIC ESTIMATION TH EORY IN p DIM ENSIONS
In this section we largely parallel Section 5.4 in which we developed the asymptotic prop erties of the MLE and related tests and confidence bounds for one-dimensional parameters. We leave the analogue of Theorem 5.4.1 to the problems and begin immediately generaliz ing Section 5.4.2.
6.2.1
Estimating Equations
Our assumptions are as before save that everything is made a vector: i.i.d. P where P E Q, a model containing P = {Po : 6 E 6} such that (i)
,
e open c RP.
I
(ii) Densities of Po are p(-, 6), 9 E 6.
•
J
The following result gives the general asymptotic behavior of the solution of estimating equations.
AO. 'I'
=
(,P1 ,
•
.
.
,
1/!p)T where ,P1 � 1
gf. is well defined and
n
'
-L 'I'(X,, On ) � 0. .
n t= l
(6.2.1)
•
A solution to (6.2.1) is called an estimating equation estimate or an M-estimate. AI. The parameter 8( P) given by the solution of (the nonlinear system of p equations in p unknowns):
J 'l'(x, 6)dP(x) � 0
1
(6.2.2)
is well defined on Q so that 6(P) is the unique solution of (6.2.2). Necessarily 6(Po) � 6 because Q :::> P. A2. Epi'I'(X�o 6(P))I 2
'
< oo where I · I is the Euclidean norm.
A3. 1/Ji{·, 8), 1 < i < p, have first-order partials with respect to all coordinates and using the notation of Section B.S,
I! :
where
is nonsingular. A4. sup .
{I� 2:� 1 (Dw(X;, t) - Dw(X;, 6(P)))i
p
:
It - O(P)I <
AS. On � O(P) for all P E Q.
Theorem 6.2.1. Under AO-AS of this section n 1 On � O(P) + - L ii'!(X;, O(P)) + t=l where
n.
' •
.
'
.
op(n· 112)
iii (x,O(P)) � -[EpDw(X� o O(P))]" 1 w(x, O(P)).
(6 2.3 ) .
(6.2.4)
Section 6.2
Asymptotic Estimation The ry in
o
p
Dimensions
385
Hence, (6.2.5)
where E(>T>, P)
�
J(6. P)E>T>>T>r(X1, 6(P))F (0. P)
(6.2.6)
and F 1 (6, P) = - EpD>l>(X, , 6(P))
=
The proof of this result follows precisely that of Theorem 5.4.2 save that we need multivariate calculus as in Section B.8. Thus,
1
-n
n
1 n
I: >f>(X,, 6(P)) n I: D>f>(X,, 6�)(6n - 6(P)). i=l i=l =
(6.2.7)
Note that the left-hand side of (6.2. 7) is a p x 1 vector, the right is the product of a p x p matrix and a p x 1 vector. The rest of the proof follows essentially exactly as in Section 5.4.2 save that we need the observation that the set of nonsingular p x p matrices, when viewed as vectors, is an open subset of RP , representable, for instance, as the set of vectors for which the determinant, a continuous function of the entries, is different from zero. We use this remark to con clude that A3 and A4 guarantee that with probability tending to 1, � 2::� 1 DW(Xi, 6�) is nonsingular. ,
Note. This result goes beyond Theorem 5.4.2 in making it clear that although the definition of On is motivated by P, the behavior in (6.2.3) is guaranteed for P E Q, which can include P � P. In fact, typically Q is essentially the set of P's for which 6(P) can be defined uniquely by (6.2.2). We can again extend the assumptions of Section 5.4.2 to:
A6. If l(-, 6) is differentiable -E6 w(X,, 6)Dl(Xt. 6) -Cov6 (w(X1, 6), Dl(X1, 6))
(6.2.8)
defined as in B.5.2. The heuristics and conditions behind this identity are the same as in the one-dimensional case. Remarks 5.4.2, 5.4.3, and Assumptions A4' and A61 extend to the multivariate case readily. Note that consistency of On is assumed. Proving consistency usually requires different arguments such as those of Section 5.2. It may, however, be shown that with probability tending to 1, a root-finding algorithm starting at a consistent estimate 0� will find a solution On of (6.2.1) that satisfies (6.2.3) (Problem 6.2.10).
" i
i!
Inference in the Multiparameter Case
Chapter 6
'
'
,I
''
386
'
'
'
6.2.2
Asymptotic Normality and Efficiency of the MLE
[fwe take p(x O ) comes ,
=
l(x,O)
=
log p(x,O), and >I> (x , O) obeys A0-A6, then (6.2.8) heEoDl (X 1 , O)DT l( X1 , 0)) VaroDl(X1, 0)
(6.2.9)
where
is the Fisher information matrix 1(8) introduced in Section 3.4. If p : 8 ---+ R, 8 C Rd, is a scalar function, the matrix 8��eoJ {8) is known as the Hessian or curvature matrix of the surface p. Thus, (6.2.9) states that the expected value of the Hessian of l is the negative of the Fisher information. We also can immediately state the generalization of Theorem 5.4.3.
))
Theorem 6.2.2. If AO-A6 holdfor p(x, 0)
=
. •
'
I' '
'
'
-
log p(x, 0), then the MLE On satisfies
I n 11 op(n + l:; r' (O)Dl(X; , O) · ') On = 0 + n i =l
(6.2. 10)
i '
'
,
so that (6.2 . 1 1 )
If 8n is a minimum contrast estimate with p and '¢ satisfying AO--A6 and corresponding asymptotic variance matrix E(W, P9). then E(w,P0 ) > r ' (OJ in the sense of Theorem 3.4.4 with equality in (6.2.12) for 0 -On = On + Op(n - 1/2 ).
(6.2.12) =
Oo iff, unckr 00, (6.2. !3)
The proofs of (6.2.10) and (6.2.11) parallel those of {5.4.33) and (5.4.34) exactly. The proof of (6.2. 12) parallels that of Theorem 3.4.4. For completeness we give it. Note that by {6.2.6) and {6.2.8) E(>I> , Po) = Cov0 1 {U, V)Varo (U)Cov0 1 (V, U) (6.2.14) Proof.
where U = >I>(X1 , 0), V = Dl(X 1 , 0). But by (B.l0.8), for any U, V with Var(Ur, VT)T nonsingular (6.2.15) Var(V) > Cov(U, V)Var- 1 {U)Cov{V, U). Taking inverses of both sides yields I- 1 (0) = Var0 1 (V) < E(w, O). (6.2.16)
'
•
.
'
Section 6.2
Asymptotic Estimation Theory in
p
387
Dimensions
Equality holds in (6.2.15) by (B . I 0.2.3) iff for some b = b(O) U = b + Cov ( U , V ) Var- 1 ( V ) V
(6.2. 1 7)
with probability 1. This means in view of Eo 'I' = EoDl = 0 that w(X1 , 0)
=
b(O)Dl(X, , O).
In the case of identity in (6.2.16) we must have -[EoD>�'(X,, OW1>li(X, , 0) = r 1 (0)Dl(X1, 0).
(6.2.18)
Hence, from (6.2.3) and (6.2. 10) we conclude that (6.2.13) holds.
0 �
We see that, by the theorem, the MLE is efficient in the sense that for any ap l aT(J n has asymptotic bias o(n-112) and asymptotic variance n-1aT J-1(8)a, which is no larger than that of any competing minimum contrast estimate. Further any competitor Bn�such that aT(Jn has the same asymptotic behavior as aTBn for all a in fact agrees with On to order n-112 A special case of Theorem 6.2.2 that we have already established is Theorem 5.3.6 on the asymptotic normality of the MLE in canonical exponential families. A number of important new statistical issues arise in the multiparameter case. We illustrate with an example. x
.
-
Example 6.2.1. The Linear Model with Stochastic Covariates. Let Xi = (Zf, Yi)T, 1 < i ::S n, be i.i.d. as X = (ZT, Y) T where Z is a p x 1 vector of explanatory variables and Y is the response of interest. This model is discussed in Section 2.2.1 and Example 1.4.3. We specialize in two ways: (i) (6.2.19) where ' is distributed as N(O, a2) independent of Z and E(Z) = 0. That is, given Z, Y has a N(a + zr[3, a2) distribution. (ii) The distribution Ho of Z is known with density h0 and E(ZZT) is nonsingular. The second assumption is unreasonable but easily dispensed with. It readily follows (Problem 6.2.6) that the MLE of [3 is given by (with probability 1)
� - Z �T(nJ Y . f3-.. = [Z-T(nJ Z(nJI I
(6.2.20)
Here z(n is the n X p matrix IIZij Z.j II where z.j = � LZ 1 Zij· We used subscripts ) (n) to distinguish the use of Z as a vector in this section and as a matrix in Section 6.1. In the present context, Z(n) = (Z1, . Zn)T is referred lo as the random design matrix. This example is called the random design case as opposed to the fixed design case of Section 6.1. Also the MLEs of a and a2 are �
.
.
1
p --" a=Y � Z ; f3; ,
.J =l
I
� 2 a 2 = -[Y - (Ci + Z(nJ/3) [ . n
(6.2.21)
I
388
Inference in the Multiparameter Case
Chapter 6
�
Note that although given Z 1 , . . . , Zn, (3 is Gaussian, this is not true of the marginal distribution of {3. It is not hard to show that AO-A6 hold in this case because if Ho has density ho and if 8 denotes then �
(a,f3r,a2)T,
I 2 2 + log h0(z) rr) log + 2 ] (1oga zr,:3) + (a [Y 2 2 2oI
l(X,IJ)
Z; ; ( ,
Dl(X, IJ)
and
,
,
1(1:1)
1
,
2"
4 (<2 - I) a- 2
=
0 0
)
I
i
'
I
••
'
0
0 0
o--2 E(ZZT)
'
(6.2.23)
1
2u4
0
• .
l
<
,
•
N(O, diag (o-2 ,o-2[E ( ZZrW 1 , 2o-4 )). (6.2.24) This can be argued directly as well (Problem 6.2.8). It is clear that the restriction of Ho known plays no role in the limiting result for a,/3,U2• Of course, these will only be the MLEs if H0 depends only on parameters other than (a, {3,u2). In this case we can estimate and give approximate confidence intervals for {3j. j = 1, . ,p. E( ZZT) by � L� 1 An interesting feature of (6.2.23) is that because 1(1:1) is a block diagonal matrix so is 1 -1 (6) and, consequently, {3 and 0'2 are asymptotically independent. In the classical linear model of Section 6.1 where we perform inference conditionally given Zi = Zi, 1 < i < n, we have noted this is exactly true. This is an example of the phenomenon of adaptation. If we knew a-2, the MLE would still be {3 and its asymptotic varianc� optimal for this model. If we knew a and /3, U2 would no longer be the MLE. But its asymptotic variance would be the same as that of the MLE and, by Theorem 6.2.2, &2 would be asymptotically equivalent to the MLE. To summarize, estimating either parameter with the other being a nuisance parameter is no harder than when the nuisance parameter is known. Formally, in a model P = { P(9,") : 8 E 8, '7 E £} we say we can estimate B adaptively at 'TJO if the asymptotic variance of the MLE B (or more generally, an efficient estimate of 8) in the pair (0, iiJ is the same as that of O(r;o), the efficient estimate for 'P'I1a = {P(o,f1a) : 8 E 8}. The possibility of adaptation is in fact rare, though it appears prominently in this way in the Gaussian linear model. In particular consider estimating {31 in the presence of a, (!32 . . . , /3p) with (i) (3,, . . , f3v known. (ii) ,:3 arbitrary. {3p 0. Let Z, In case (i), we take, without loss of generality, a = fJ2 (Zi l , . . . , Zip) T, then the efficient estimate in case (i) is �
zizr
.
.
�
�
�
a,
j '
•
•
so that by Theorem 6.2.2
L(y'n(ii - a, {j - ,:3,82 - o-2))
(6.2.22)
j j
�
.
.
.
.
=
=
(6.2.25)
1 •
j
'
'
l
l
l
,
I
•
1
• •
.
:
'
'
Section 6.2
Asymptotic Estimation Theory in p Dimensions
389
with asymptotic variance a2[Ezfj-- 1 . On the other hand, (31 is the first coordinate of {3 given by {6.2.20). Its asymptotic variance is the {1, 1) element of 0"2 [EzzT]- 1 , which is strictly bigger than , [Ezf] -• unless [EZZTJ- 1 is a diagonal matrix (Problem 6.2.3). So in general we cannot estimate f3t adaptively if f3z, . . , {3P are regarded as nuisance -
,.
-
.
parameters. What is happening can be seen by a representation of [Zfn) Z( n)]-1 Z(n ) Y and ! 1 1 (11) where J-1(11) = llf''{ll) lf. We claim that
(3
2:�-JZ.t z,C I J )Y; (6.2.26) I= 1 zC 1 2 (Z ) tl L.... t=l t where Z( l ) is the regression of (Zu, . . . , Z1n)T on the linear space spanned by (Zj 1 , . . . , Zjn)T, 2 < j < Similarly, 2 " (6.2.27) Z . I (II) = "'I E(Z" !1(Zn 1 Z21 , . . , p,) ) where II(Ztl I Zz l ' . . . ) Zpl) is the projection of zll on the linear span of z21 ) . . Zp l (Problem 6.2.1 1). Thus, Il{Zu I z, " . . . 'Zpl) = l:j=, a; z,, where (a;, . . . ' a;) min imizes E(Zu - l:P_2 aj Zj 1) 2 over ( . . , ap) E RP-l (see Sections 1.4 and B.10). What (6.2.26) and (6.2.27) reveal is that there is a price paid for not knowing [3,, . . , {3p when the variables Z2, . . . , Zp are in any way correlated with Z1 and the price is measured by [E(Zil - Il(Zu I Z21, . . . 'Zp1 )2]- 1 = (1 - E(Il(Zu I z21,2 . . . , Zp,))2 ) I . '-'--:..:..._ -'-""""f;c;. ,.(7 '----"---"'"-''Z E(Z11 ) ( ) E 11 -
"'n
-
p.
-
'
a,,
'
.
.
-
(6.2.28)
In the extreme case of perfect collinearity the price is oo as it should be because (31 then becomes unidentifiable. Thus, adaptation corresponds to the case where have linearly (see Section 1.4). Correspondingly no value in predicting in the Gaussian linear model (6.1.3) conditional on the Zi, i = 1, . . , n, is undefined if the denominator in (6.2.26) is 0, which corresponds to the case of collinearity and occurs with probability 1 if
(Z2, . . , , Zp)
Z1
fJ1
.
E(Zu - Il(Zu I z21, . . . , Zp,))2 = 0.
0
Example 6.2.2. M Estimates Generated by Linear Models with General Error Structure. Suppose that the €i in (6.2.19) are i.i.d. but not necessarily Gaussian with density !Jo (� ) for instance,
,
fo (x) = ( 1
e-x
)
+ e-x ' '
the logistic density. Such error densities have the often more realistic, heavier tails(1) than the Gaussian density. The estimates {30 1 O'o now solve -
p 1 z Y; L f3.o z,. L ,,.,p k=l i=l p n 1 L a x a I Yi - L/3koZik k= l i= l n
and
0"
-
-
-
=0
=0
, ,
390
Inference in the Multiparameter Case
=
f'
(yj_Q10 (y) + l ) , (30 = (fito, . . . ,f]po) T �
�
where 'lj;
=
Theorem
6.2.2 may be shown to hold (Problem 6.2.9) if
- J;; . x(y)
-
�
Chapter 6
The assumptions of
fo is strictly concave, i.e., fo is strictly decreasing. (ii) (log fo)" exists and is bounded. Then, if further fo is symmetric about 0,
'
(i) log
•
r:r- 2 I((3r, I )
I(IJ) =
2
,
r:r- 2
( c1E(tZ) � )
•
(6.2.29) •
, ,
fo(x)dx, c, = J (xfo (x) + 1 ) 2 fo(x)dx. Thus, ,60, iTo are opti mal estimates of {3 and a in the sense of Theorem � 6.2.2 if fo is true. Now suppose /o generating the estimates f3o and (T� is symmetric and satisfies (i) and (ii) but the true error distribution has density f possibly different from fo. Under suitable
l
conditions we can apply Theorem
i
where c1
=
J (fo(x))
6.2.1 with
' •
'
' •
i
•
•
� 'lj; ( y - L:l ZkiJk ) , I
where
'
i
;
1f;, (z, y, (3, r:r)
> ( y L�=l Zk,6k )
'
<
j
<
p
(6.2.30)
-
'
I
I •
,
-
•
' • '
• •
where {30, ao solve • '
(6.2.31)
' '
I I:
!!
II I I'' iI
<Jo
and E(>l>, P) is as in (6.2.6). What is the relation between (30, and (3, <J given in the Gaussian model (6.2.19)? If is symmetric about 0 and the solution of (6.2.31) is unique,
fo � then (30 = (3. But <Jo = c(fo)o- for some, c(fo) typically different from one. Thus, (30 can .
be used for estimating (3 althougl) if tiJI1 true distribution of the •• is N(O, r:r2 ) it should
perform less well than (3. On the qt)ler hand, iT0 is an estimate of r:r only if normalized by a constant depending on (See Problem 6.2.5.) These are issues of robustness, that is, to have a bounded sensitivity curve (Section 3�5, Problem 3.5.8), we may well wish to use a nonlinear bounded '.II = ('f/;1 , . . . , ,Pp )T to estimate {3 even though it is suboptimal when € N(O, u2), and to use a suita�ly normalized version of 0:0 for the same purpose. One effective choice of '¢1 is the Hu�er function defined in Problem 3.5.8. We will discuss D these issues further in Section 6.6 and Volume II. .
�
fo.
,....,
1
l
Co( vn(i3o f3o)) � N(O, :!:: ( >1>, P)) C( y'n(iT - r:ro) � N(O, r:r2(P))
I
j
i
to conclude that
•
ll
Section 6.2
391
Asymptotic Estimation Theory in p Dimensions
Testing and Confidence Bounds
There are three principal approaches to testing hypotheses in multiparameter models, the likelihood ratio principle, Wald tests (a generalization of pivots), and Rao's tests. All of these will be developed in Section 6.3. The three approaches coincide asymptotically but differ substantially, in perfonnance and computationally, for fixed n. Confidence regions that parallel the tests will also be developed in Section 6.3. Optimality criteria are not easily stated even in the fixed sample case and not very persuasive except perhaps in the case of testing hypotheses about a real parameter in the presence of other nuisance parameters such as H : 81 < 0 versus K : fh > 0 where fJz, . . . , Bp vary freely.
The Posterior Distribution in the Multiparameter Case
6.2.3
The asymptotic theory of the posterior distribution parallels that in the one dimensional case exactly. We simply make 8 a vector, and interpret I I as the Euclidean norm in conditions A7 and A8. Using multivariate expansions as in B.8 we obtain ·
Theorem 6.2.3. Ifthe multivariate versions of A0-A3, A4(a.s.), A5(a.s.) and A6-A8 hold �
then. if8 denotes the MLE,
(6.2.32) a.s. under Pe for al/ 8. The consequences of Theorem 6.2.3 are the same as those of Theorem 5.5.2, the equiv alence of Bayesian and frequentist optimality asymptotically. Again the two approaches differ at the second order when the prior begins to make a difference. See Schervish (1995) for some of the relevant calculations. A new major issue that arises is computation. Although it is easy to write down the pos terior density of 8, 1r(8) IT� 1 p(Xi, 8), up to the proportionality constant fe rr(t) ll� 1 p(X,, t)dt, the latter can pose a formidable problem if p > 2, say. The prob lem arises also when, as is usually the case, we are interested in the posterior distribution of some of the parameters, say ( 01, 82), because we then need to integrate out (83 , . . . , Bp). The asymptotic theory we have developed pennits approximation to these constants by the procedure used in deriving (5.5.19) (Laplace's method). We have implicitly done this in the calculations leading up to (5.5.19). This approach is refined in Kass, K.adane, and Tier ney ( 1989). However, typically there is an attempt at "exact" calculation. A class of Monte Carlo based methods derived from statistical physics loosely called Markov chain Monte Carlo has been developed in recent years to help with these problems. These methods are beyond the scope of this volume but will be discussed briefly in Volume II. We defined minimum contrast (MC) and M-estimates in the case of p dimensional parameters and established their convergence in law to a normal distribu tion. When the estimating equations defining the M-estimates coincide with the likelihood
Summary.
392
Inference in the Multiparameter Case
Chapter 6
equations, this result gives the asymptotic distribution of the MLE. We find that the MLE is asymptotically efficient in the sense that it has "smaller" asymptotic covariance matrix than that of any MD or AJ-estimate if we know the correct model P = {Po : B E e} and use the MLE for this model. We use an example to introduce the concept of adaptation in which an estimate f) is called adaptive for a model {Po,11 : () E 8, ry E £} if the asymptotic distribution of fo( () - B) has mean zero and variance matrix equal to the smallest possible for a general class of regular estimates of () in the family of models {Po,110 : () E 8}. ryo specified. In linear regression, adaptive estimation of is possible iff Z1 is uncorrelated with every linear function of Z2) . . . ) Zp. Another example deals with M-estimates based on estimating equations generated by linear models with non-Gaussian error distribution. Finally we show that in the Bayesian framework where given 0, X1 ) Xn-are i.i.d. Po, if 0 denotes the MLE for Po, then the posterior distribution of yfii(O - 0) converges a.s. under Po to the N(O, r 1 (9)) distribution.
/31
•
•
•
• •
)
. '
6 .3
LARGE SAMPLE TESTS AND CONFIDENCE REGIONS
In Section 6.1 we developed exact tests and confidence regions that are appropriate in re gression and anaysis of variance (ANOVA) situations when the responses are normally distributed. We shall show (see Section 6.6) that these methods in many respects are also approximately correct when the distribution of the error in the model fitted is not assumed to be normal. However, we need methods for situations in which, as in the linear model, co variates can be arbitrary but responses are necessarily discrete (qualitative) or nonnegative and Gaussian models do not seem to he appropriate approximations. In these cases exact methods are typically not available, and we tum to asymptotic approximations to construct tests, confidence regions, and other methods of inference. We present three procedures that are used frequently: likelihood ratio, Wald and Rao large sample tests, and confidence procedures. These were treated for f) real in Section 5.4.4. In this section we will use the results of Section 6.2 to extend some of the results of Section 5.4.4 to vector-valued parameters.
,
'
i
• '
•
1 l •
·�
'
I '
j
. •
6.3.1
Asymptotic Approximation to the Distribution of the Likelihood Ratio Statistic
In Sections 4.9 and 6.1 we considered the likelihood ratio test statistic,
sup{p(x,9) : 9 E 8} = A(x) sup{p(x, 9) : IJ E 8o} for testing H : IJ E 80 versus K : IJ E 8,, 8, = 8 - 8o, and showed that in several statistical models involving normal distributions, .\(x) simplified and produced intuitive tests whose critical values can be obtained from the Student t and :F distributions. However, in many experimental situations in which the likelihood ratio test can be used to address important questions, the exact critical value is not available analytically. In such
'
'
I '
I •
I
'
'
Section
6.3
393
Large Sample Tests and Confidence Regions
cases we can turn to an approximation to the distribution of >.(X) based on asymptotic theory, which is usually referred to as Wilk.s '.s theorem or approximation. Other approxi mations that will be explored in Volume II are based on Monte Carlo and bootstrap simu lations. Here is an example in which Wilks's approximation to C(>.(X)) is useful:
Example 6.3.1. Suppose X�, X2 , . . . , Xn are i.i.d. as X where X has the gamma. r(a, /)), distribution with density
p(x; B)
�
1 iJ"x"- exp{-!lx}/f(a); x > 0; a > 0, il > 0.
(Ei, il), exists and in Example 2.4.2 we In Example 2.3.2 we showed that the MLE, B showed how to find B as a nonexplicit solution of likelihood equations. Thus, the numerator of .\(x) is available as p(x, B) � IT� 1 p(x;, B). Suppose we want to test H : a � 1 (exponential distribution) versus K : a i- 1. The MLE of f3 under H is readily seen from (2.3.5) to be ilo � 1/x and p(x; 1, ilo) is the denominator of the likelihood ratio statistic. It remains to find the critical value. This is not available analytically. D �
�
�
�
�
�
�
�
The approximation we shall give is based on the result "2log >.(X) � x�" for degrees of freedom d to be specified later. We next give an example that can be viewed as the limiting situation for which the approximation is exact:
Example 6.3.2. The Gaussian Linear Model with Known Variance. Let Y1 , . . . , Yn be independent with Yi rv N(Pi1 <1�) where Uo is known. As in Section 6.1.3 we test whether = ( 11 . . J.l. p, . , J.tn)T is a member of a q-dimensional linear subspace of Ir, w0, versus the alternative that J.l. E w - w0 where w is an r-dimensional linear subspace of Rn and w ::> w0; and we transform to canonical form by setting
vq span w0 where An x n is an orthogonal matrix with rows vf . . . v'[; such that V J , and v 1 , . . . , Vr span w. Set Bi T]i/Uo, i l, . . . , r and Xi Uduo, i l, . . . , n. Then Xi rv N(Bi, l), i 1 , . . . , r and Xi N(O, l), i r+l, . . . , n. Moreover. the hypothesis H is equivalent to H : Oq+ l = · · · = Or = 0. UsingSection 6.1.3, we conclude that under H. =
=
=
rv
• . . 1
1
1
=
=
=
2 Iog .\(Y)
r
�
L x; x;i=q+l �
•.
Wilks's theorem states that, under regularity conditions, when testing whether a parameter vector is restricted to an open subset of Rq or Rr, q < r, the X�-q distribution is an approximation to £(2 log .\(Y)). In this u2 known example, Wilks's approximation is D exact. We illustrate the remarkable fact that X�-q holds as an approximation to the null distri bution of 2 log A quite generally when the hypothesis is a nice q-dimensional submanifoJd of an r-dimensional parameter space with the following.
I [ j
i
'
394
Inference in the Multiparameter Case
Chapter 6
Example 6.3.3. The Gaussian Linear Model with Unknown Variance. If Yi are as in
Example 6.3.2 but CT2 is unknown then 8 = (Jl., u2 ) ranges over an r + 1-dimensional manifold whereas under H, 8 ranges over a q + l�dimensional manifold. In Section 6.1.3, we derived 2 " L... i =q+l xi ) n 2 log>.(Y � log I +
' '
''
2:: :• • +I X'f
•
•
Apply Example 5.3.7 to Vn = l:�-q+ 1 X[jn-1 _L:� r+ I X'f and conclude that Vn _£, x;_ , . Finally apply Lemma 5.3. 2 with g(t) � log(! + t), an � n, c � 0 and conclude that 2log ..\(Y) £ x;-q also in the a-2 unknown case. Note that for A (Y) defined in Remark 6.1.2, 2 log A (Y) �
Vn ". x;_,
as
o
welL
Consider the general i.i.d. case with X1 , . . . , Xn a sample from X c R', and () E 8 c W. Write the log likelihood as
p(x, B),
where x E
n ln(8) � I; log p(X B). ,,
i=l
We first consider the simple hypothesis H
:
'
1 •
!
6 = Bo.
'
Theorem 6.3.1. Suppose the assumptions of Theorem 6.2.2 are satisfied. Then, under H : () � 8o, � c 2 log>.(X ) � 2[ln(8n) - ln(8o)] � x;. �
Proof Because On solves the likelihood equation Doln(O) = 0, where Do is the derivative
�
with respect to 8, an expansion of ln(O) about On evaluated at 0 = 00 gives ,....
,....
,....
2[ln(8n) - ln(8o)] � n(8n - 8o) In(8n)(8n - 8o) � � for some ()� with [8� - 8n[ < [8n - 8o[. Here I
"
8
T
8
•
In(()) � - n L 88 88 log p(X , 8) k J i=l -
By Theorem 6.2.2, mation matrix. Because
,
(6.3.1)
! j '
•
• •
..
'"
fo(On - 80) ", N(O, J-1 (8o)), where I.xr(8) is the Fisher infor
l
'
�-
'
'
[8� - 8o[ < [8� - On [ + [On - 8o[ :0 2[0n - 8o[, we can conclude arguing from A.3 and A.4 that that In(8�) !. Eln(80) � 1(80). Hence, """
C
T
1 J(8o)V, V � N(O, I (8o)).
2[ln(8n) - ln(8o)] � V The result follows because, by Corollary B.6.2, yrJ(8o)V � x�.
(6.3.2) 0
•
•
•
•
' •
As a consequence of the theorem, the test that rejects 2 log>.(X) where Xr(l and
395
Large Sample Tests and Confidence Regions
Section 6.3
)
- o
is the
H : ()
= 80 when
> x, (1 - a),
1 - o quantile of the X; distribution, has approximately Ievel l - a, � {eo : 2[ln(en) - ln(eo)] < x,. ( l
-
(6.3.3)
oe) }
is a confidence region for 9 with approximat� coverage probability 1
-
a
.
H : 8 E 8o. where 8 is open and 8o is the Set Of e E 8 with 8j = 8o,j, j = q + 1 , . . . , T , and {8o ,j} are specified values. 2 Examples 6.3.1 and 6.3.2 illustrate such 8o. We set d = r - q, e T = (e (ll, e < l ) , e <1l = (81, . . . , 8,)r, e< 'l = (Bo+h - . . , 8, ) r, e6' l = (8o,,+1· · . . Bo , r )T Next we tum to the more general hypothesis
,
Theorem 6.3.2. Suppose that the assumptions of Theorem 6.2.2 hold for p(x, e), e E 8. Let Po be the model {Po : 8 E 8o} with corresponding parametn'zation 9( 1 ) = �( 1 ) �(1) (1) ( 81 , . , , 8,). Suppose that e0 is the MLE of e under H and that e0 satisfies A6for �( � T Po. Let eo ,n = (1101) , e0(2) ) . Then under H : e E 8o,
Proof.
Let
e0 E 8o and write 2 log >.(X)
= 2 [/n(iin) - In( eo) ] - 2 [1,. (iio,n) - ln (eo)].
It is easy to see that A0-A6 for P imply A0-A5 for Po. By
(6.3.4)
(6.2.10) and (6.3.1) applied to
� � (! ) � Bn and the corresponding argument applied to 00 , 9o,n and (6.3.4), 2 log >.(X)
= sT (eo)r 1 (eo)S(eo) - s; (eo)IQ' 1 (eo)S1 (eo) + Op(1)
where
(6.3.5)
n
n - 1 /2 = L Dl(X, , e) S(eo)
i=l
and S
= (S�, S2)T where 8 1 is the first q coordinates ofS.
Furthermore,
lo(eo) = Vare,S1 (eo). Make a change of parameter, for given true 80 in
8o,
TJ = M (e - eo) where, dropping the dependence on
Do. M = p J 1 /2
(6.3.6)
396
Inference in the M u!tiparameter Case
and P is an orthogonal matrix such that, if A0 MA0
=
{1J :
_
{8
-
Chapter 6
:
Oo 8 E 80} I
= · · · = 11, = 0, 1J E M8).
1/q+l
•
• •
•
Such P exists by the argument given in Example 6.2.1 because 111 2 Ao is the intersection of a q dimensional linear subspace of Rr with J1 12 {8 - 80 : () E 8}. Now write De for differentiation with respect to (} and D11 for differentiation with respect to TJ. Note that, by definition, >. is invariant under reparametrization =
.l.(X)
(6. 3. 7)
-y(X)
. ,: •
!
1 1
.
•
1 • •
where
-y(X) = sup{p(x, 110 1J
+ M-11))} / sup{p(x, llo + M-1'1) : llo + M- 11) E Bo}
and from (8.8.13) D1Ji(x, 110
+ M-11J) = [M-1JTDIII(x, ll).
(6.3.8)
We deduce from (6.3.6) and (6.3.8) that if T(1J)
=
.
'
n
n - 1 /2 L D1Jl (X ,, llo + M- 11)),
•
I
,
i=l
• •
then Moreover, because in terms of 1), H is {1J E M8 applying (6.3.5) to -r(X) we obtain, 2 log-r(X)
:
1Jq+ 1
= ..
·
=
(6.3.9)
1Jr
= 0}, then by
TT(O)T(O) - Tf{O)T1(0) + op(l)
q - L T,'(o) - L T,'(o) + ov(l) r
i=l r
L
i=q+ l
'
T,2 (0) + Op(l),
which has a limiting X�-q distribution by Slutsky's theorem because T(O) has a limiting Nr (O, J) distribution by (6.3.9). The result follows from (6.3. 7). D
I• ' •
(6.3.10)
i=l
•
'
Note that this argument is simply an asymptotic version of the one given in Example 6.3.2. Thus, under the conditions of Theorem 6.3.2, rejecting if .\(X) > X q ( l - a) is an asymptotically level a test of H : 8 E 9o. Of equal importance is that we obtain an asymptotic confidence region for (8q+ 1 , , Br ) a piece of 8. with 8 , . . . , Bq acting as 1 nuisance parameters. This asymptotic level 1 o: confidence region is r
• • .
�
�
�
.
-
{(Oq+l> . . . , llr) : 2[1n(11n) - ln(Oo,!, . . . , Oo,,, oq+ l , . . . , Or)] '
.
,. ' •
I
<
Xr-q(l - a)) (6.3.1 1 )
'
R:.= l •:"'g ; o:_ f;:_":.= 6:: i e ..C d_C de Te ":_:.: e'C e:: Se . 3___c::: ':: ' ;.:o o" o:e:cS::: c_:: ':c m C'p:C :.:o ' ":.= :_ o::.c "':__ '.c g.c :. =':: :::: -
cc 397
_ _ _ _ _ _ _ _ _ _ _ _
-
where Bo. 1 , . . . , Bo.q are the MLEs. themselves depending on 8q+l, . . . . 8,., of B 1 , , 8q assuming that Bq+ I , . . . , Br are known. More complicated linear hypotheses such as H : 6- Bo E w0 where w0 is a linear space of dimension q are also covered. We only need note that if wo is a linear space spanned by an orthogonal basis v 1 , . . . , Vq and Vq+ t , . . . , Vr are orthogonal tow0 and v 1 , . . . , Vr span Rr then, .
WO �
T {8 : 9 Vj � 0,
q
+ 1 < j < r}.
.
•
(6.3.12)
The extension of Theorem 6.3.2 to this situation is easy and given in Problem 6.3.2. The formulation of Theorem 6.3.2 is still inadequate for most applications. It can be extended as follows. Suppose H is specified by: There exist d functions, 9i ; e - R, q + 1 < j < r written as a vector g, such that Dg(8) exists and is of rank r - q at all 8 E e. Define H : 8 E 80 with
eo � {8 E e : g(8)
=
o}.
(6.3.13)
Evidently, Theorem 6.3.2 falls under this schema with 9;(8) � 8j - 8o,j. q + 1 < j S r. Examples such as testing for independence in contingency tables, which require the following general theorem, will appear in the next section. Theorem 6.3.3. Suppose the assumptions of Theorem 6.3.2 and the previously conditions on g. Suppose the MLE hold 9o,n under H is consistent for all (J E 80. Then, if A(X) is the likelihood ratio statistic for H : 8 E 8o given in (6.3.13), 2 log .X(X) ". under H.
X�-•
The proof is sketched in Problems (6.3.2)-(6.3.3). The essential idea is that, if 8o is true, .X(X) behaves asymptotically like a test for H : 8 E 800 where
eoo
�
{8 E e : Dg(8o)(8 - 8o )
�
0}
(6.3.14)
a hypothesis
of the form (6.3.13). Wilks's theorem depends critically on the fact that not only is 8 open but that if 60 given in (6.3.13) then the set { (8 1 . . . , 8q)T : 8 E e } is open in R•. We need both properties because we need to analyze both the numerator and denominator of A(X). As an example ofwhatcan go wrong, let (X; I , Xi2) be i.i.d. N(BI, 82, J), where J is the 2 x 2 identity matrix and 80 = {8 : 81 + 82 < 1 }. If 8I + 82 � 1, >
(j
0
�
( (XI + X2 )
2
+
� 1 _ (X I + X2) ) 2' 2
2
and 2 log .X(X) � xt but if 8 I + 82 < 1 clearly 2 log .X(X) � Op(1). Here the dimension of 80 and 8 is the same but the boundary of 8o has lower dimension. More sophisticated examples are given in Problems 6.3.5 and 6.3.6.
398
Inference in the Multi parameter Case
Chapter 6
' '
' '
6.3.2
Wald's and Rao's Large Sample Tests The Wald Test
Suppose that the assumptions of Theorem 6.2.2 hold. Then
� L ,fii(IJ - IJ) � N(o, r 1 (1J)) as n
� oo.
(6.3.15)
Because I( IJ) is continuous in IJ (Problem 6.3. 10), it follows from Proposition B.7.1(a) that
� p I(IJn) � I(IJ) as n � oo.
(6.3.16)
By Slutsky's theorem B.7.2, (6.3.15) and (6.3.16),
n(iin - IJ)T I(On)(iin - IJ) !:., yrI(IJ)V, V - Afr(O, r1 (1J)) where, according to Corollary B.6.2, yrI(IJ)V - x;. It follows that the Wald test that rejects H : 6 = Oo in favor of K : (} i= Oo when
� � T Wn (IJo) � n(IJn - IJo) I(IJo)(IJn - IJo)
>
'
Xr (1 - o)
has asymptotic level a. More generally I(Bo) can be replaced by any consistent estimate
� � of I( IJo). in particular - � D2ln (IJo) or I ( IJn) or -! D2ln (IJn). The last Hessian choice is � favored because it is usually computed automatically with the MLE. It and I(Bn) also have the advantage that the confidence region one generates {6 : Wn(O) < xp(l - a)} is an ellipsoid in W easily interpretable and computable see (6.1.31). For the more general hypothesis H : (} E 8o we write the MLE for 8 E 8 as 6n = � (1) �(2) � � � � �(2) ( 1) (IJn , IJn ) where IJn � (8" . . , 8, ) and IJn � (B,+, . . . , Br ) and define the Wald statistic as ) 2) (6.3.17) Wn (IJ� ) � n(ii�' - IJ�'ll[J"(iin Jr1 (ii�) - IJ�2))
'
r'.
"
" .
'
i'
�
i
�
.
I
where I22(IJ) is the lower diagonal block of I-1 ( IJ) written as
I -1 (IJ)
_
-
(
I " (IJ) I12(1J) I21(1J) J22(1J)
•
)
. •
i
''
.
! I
i
I
with diagonal blocks of dimension q X q and d X d, respectively. More generally, 122(iin) is replaceable by any consistent estimate of 122(8), for instance, the lower diagonal block
� of the inverse of - � D2ln( IJn). the Hessian (Problem 6.3.9).
j
Theorem 6.3.4. Under the conditions of Theorem 6.2.2, zf H is true,
Wn(IJ�2) ) !:., x;_,.
(6.3.18)
Proof. 1(8) continuous implies that I-1(8) is continuous and, hence, !22 is continuous. �(2 c (2) ) But by Theorem 6.2.2, ,fii(!Jn - IJ0 ) � Afd(O, J22( 1Jo )) if IJo E 60 holds. Slutsky's 0 theorem completes the proof. '
f I
�
'
'
,:
I
I :
.'
I !
�----------------------------------------
i l
'
•
'
Section 6.3
Large Sample Tests and Confidence Regions
399
2)
The Wald test, which rejects iff Hin ( Bb ) > X,-- q (1 level a . What is not as evident is that, under H,
) is, therefore. asymptotically
- a ,
�( 2 (6.3.19) Wn (90 ) ) = 2 log ,\(X) + ap(l) where A(X) is the LR statistic for H : (} C: 80. The argument is sketched in Problem 6.3.9.
Thus, the two tests are equivalent asymptotically. The Wald test leads to the Wald confidence regions for ( Bq + 1 , . Br) T given by { 8(2 ) 2 Wn (9 ( 1 ) < Xr-q(l - u)}. These regions are ellipsoids in R"- Although, as (6.3.19) indicates, the Wald and likelihood ratio tests and confidence regions are asymptotically equivalent in the sense that the same conclusions are reached for large n, in practice they can be very different. .
.
�
:
The Rao Score Test For the simple hypothesis H · () = Bo, Rao's score test is based on the observation that, by the central limit theorem,
vnt/Jn(9o) !; N(O, I (9o))
where 1/J n = n- I Dln (eo) is the likelihood score vector. It follows from this and Corollary B.6.2 that under H, as n -
1
(6.3.20) CXJ,
Rn(9o) = ntJ;?:,(9o)r (9o)..Pn(9o) !:. x;.
The test that rejects H when Rn(Bo) > Xr ( 1 - a:) is called the Rao score test. This test has the advantage that it can be carried out without computing the MLE, and the convergence Rn(Bo) � x; requires much weaker regularity conditions than does the corresponding convergence for the likelihood ratio and Wald tests. The extension of the Rao test to H : () E 8o runs as follows. Let
>�'n(9) = n - 1 / 2 D2 1n(9)
where Dtln represents the q x 1 gradient with respect to the first q coordinates and D2ln the d x 1 gradient with respect to the last d. The Rao test is based on the statistic � (2) ..-.. -1 T
Rn(9o ) = n>�'n (9o,n)E --...
_
>�' n(9o,n)
� � where :E is a consistent estimate of :E (Bo), the asymptotic variance of Vn Wn ( Bo,n) under H. It can be shown that (Problem 6.3.8) E(Oo) = I,,(Oo) - !2 1 ( 9o)I!i1 (9o)Il2(9o)
(6.3.21)
where I11 is the upper left q x q block of the r x r infonnation matrix I( 80), I12 is the upper right 6.3.9) under A0--A6 and consistency � q x d block, and so on. Furthermore, (Problem of eo,n under H, a consistent estimate of .:E- 1 ( 80) is � � 2 2 1 (6.3.22) n- 1 [-D,ln(9o,n) + D21ln(9o,n)[D1 ln(Bo,n] Ddn(9o,n)] ......_
--...
400
Inference in the Multiparameter Case
Chapter 6
where D� is the d x d matrix of second partials of l11 with respect to e
Rn(9b2)) !:, X�·
•
The Rao large sample critical and confidence regions are { Rn (9�2) ) > Xd (l - a)} and {9(2) : Rn (9(2)) < Xd(l - o)}. The advantage of the Rao test over those of Wald and Wilks is that MLEs need to be computed only under H. On the other hand, it shares the disadvantage of the Wald test that matrices need to be computed and inverted. Power Behavior of the LR, Rao, and Wald Tests It is possible as in the one-dimensional case to derive the asymptotic power for these where Bo E 8o. The analysis for 8o = tests for alternatives of the form On = Bo + { Oo} is relative!y easy. For instance, for the Wald test
�
T
.--..
""'
1
••
'
•
where x� (-y2) is the noncentral chi square distribution with m degrees of freedom and noncentrality parameter "'?. It may he shown that the equivalence (6.3.19) holds under lin and that the power be havior is unaffected and applies to all three tests. Consistency for fixed alternatives is clear for the Wald test but requires conditions for the likelihood ratio and score tests-see Rao (1973) for more on this. Summary. We considered the problem of testing H : 8 E eo versus K : 8 E e 8o where 8 is an open subset of R! and eo is the collection of 8 E e with the last r q coordinates 9(2) specified. We established Wilks's theorem. which stales that if >.(X) is the LR statistic, then, under regularity conditions, 2log >.(X) has an asymptotic distribution under H. We also considered a quadratic form, called the Wald statistic, which measures the distance between the hypothesized value of 8(2) and its MLE, and showed that this quadratic form has limiting distribution x:-q. Finally, we introduced the Rao score test, which is based on a quadratic form in the gradient of the log likelihood. The asymptotic distribution of this quadratic form is also x:-q. -
�
x;-•
6.4
l
.,
-
• •
�
n(9n - Bo) I(9n)(9n - 9o) = ( fo(On - 9n) + t.)I(On)( fo(On 9n) + A) !:. x;_.(ArI(li0)A)
. • • . .
1
LARGE SAMPLE METHODS FOR DISCRETE DATA
In this section we give a number of important applications of the general methods we have developed to inference for discrete data. In particular we shall discuss problems of
1
1
'
'
l
Section 6.4
_ ______________ 4 01
Large Sample Methods for Disccete Data __ � __
goodness-of-fit and special cases oflog linear and generalized linear models (GLM), treated in more detail in Section 6.5.
6.4.1
Goodness-of-Fit in a Multinomial Model. Pearson's x2 Test
As in Examples 1.6.7, 2.2.8, and 2.3.3, consider i.i.d. trials in which Xt = j if the ith trial pnxluces a result in the jth category, j = 1 , . . . , k. Let ()i = P(Xi = j) be the probability of the jth category. Because ()k = 1 - E�-; ()i , we consider the pa rameter 9 = (81, . . . , 8k- I)T and test the hypothesis H : ()J = ()0i for specified (J0j. j = 1, . . . , k - 1. Thus, we may be testing whether a random number generator used in simulation experiments is producing values according to a given distribution, or we may be testing whether the phenotypes in a genetic experiment follow the frequencies predicted by � theory. In Example 2.2.8 we found the MLE 0; � N; jn, where N; � I:� 1 1 {X; = j }. It follows that the large sample LR rejection region is k
2 log ,\ (X) � 2 LN; Iog(N;/nBo;) > Xk-l (l
j=l
- a) .
To find the Wald test, we need the information matrix I = IIIij j j . For i, j = 1, . . . , k-1, we find using (2.2.33) and (3.4.32) that if i = j, if i ,., j. Thus, with 8ok
=
Wn(Bo)
1
- E7 i ()Oi• the Wald statistic is k-1 k-l k-1 n L[B; - Bo;]2/Oo; + n L L(B; - Bo;)(O; - Bo;)/Ookj=l i=l j=l
�
The second term on the right is
k-l n L(O; - Do;) j=l Thus,
k W (Bo) L(N; - nBo;)2 /nOo;i=l n
�
402
Inference in the M u ltipara m eter Case
Cha pter 6
The term on the right is called Pearson 's chi-square ( x2 ) statistic and is the statistic that is typically used for this multinomial testing problem. It is easily remembered as X2
=
S
UM ( Observed - Expected) 2
(6.4.1)
Expected
where the sum is over categories and "expected" refers to the expected frequency EH ( Nj) . The general form (6.4. 1) of Pearson's x2 will reappear in other multinomial applications in this section. To derive the Rao test, note that from Example 2.2 .8,
with
To find J- 1 , we could invert I or note that by (6.2. 1 1) , J-1 (9) � with and, by � A. N (N1 , . . . , Nk _I ) T l3 . 1 5, llaij ll( k-1) x (k-l) =
=
=
Var ( N ) , where
=
Thus, the Rao statistic is
-
(
n
' .
(
k-1 k-1 . () L L . =1 z=1 . ()Oi J
.
. .
-
:
.�
_..._
_..._
_ z
.. "'
()k 8ok
·";;
)(
)
_..._
_..._
[ (!!J_ �) ]
The second term on the right is
k-1 L 8oj j=1 8oj 8ok _..._
-n
_..._
2
[� ] _..._
= -n
8ok
To simplify the first term on the right of (6.4.2), we write _..._
()j 8oj
-
_..._
-
()k 8ok
-
-
2
1
-1
{[8ok (()j - 8oj) ] - [8oj (()k - 8ok )] } 8oj 8ok , _..._
=
)
()k 8oi8oj . 8J· - _ 8o1· 8ok .
,.._
and expand the square keeping the square brackets intact. Then, because
k-1 L (Oj - 8oj ) j=1
=
- (Ok - 8ok),
(6.4.2)
403
Large Sam ple M ethods for Discrete Data
Section 6 . 4
the first term on the right of ( 6 . 4 . 2 ) becomes
It follows that the Rao statistic equals Pearson 's x2 .
Example 6.4. 1. Testing
a
Genetic Theory. In experiments on pea breeding, Mendel ob
served the different kinds of seeds obtained by crosses from peas with round yellow seeds and peas with wrinkled green seeds. Possible types of progeny were:
(2)
wrinkled yellow;
(3)
round green; and
(1)
round yellow;
wrinkled green. If we assume the seeds are
(4)
produced independently, we can think of each seed as being the outcome of a multinomial trial with possible outcomes numbered occurrence
e4
B1 , (h, 83, 84.
1 , 2, 3, 4
as above and associated probabilities of
Mendel's theory predicted that
81
=
9/16,
82
1 / 1 6, and we want to test whether the di stribution of types in the n
=
= =
83
performed (seeds he observed) is consistent with his theory. Mendel observed
n2 nB4o
101,
= =
n3
34 . 75,
k
= =
X2
108, 4 =
n4
=
(2.25) 2 3 1 2 . 75
32.
+
Then,
( 3 . 2 5) 2 104 . 2 5
+
nB10
=
( 3 . 75 ) 2 104 . 2 5
3 1 2 . 75,
+
nB2o
( 2 . 75 ) 2 34 . 75
=
=
=
3 / 1 6,
556 trials he
n830
n1
=
=
3 1 5,
104.25,
0 . 47 '
which has a p-value of 0 . 9 when referred to a x� table. There is insufficient evidence to rej ect Mendel 's hypothesis . For comparison 2 log value may be too small ! See Note
6.4.2 Suppose
=
0 .48 in this case. However, this
j.:
Goodness-of- Fit t o Composite M ultinomial Models. Contingency Tables N
=
(N1 , . . , Nk ) T has a multinomial, M (n, 8) , distribution. We will investi H : B E 8o versus K B rf:_ 8o , where 8o is a composite "smooth" subset .
gate how to test of the
,\
1.
(k -
:
I ) -dimensional parameter space
I I
k
e
=
{8 : ei � o,
1 :::;
i :::; k ,
I: ei
=
i=l
1}.
For example, i n the Hardy-Weinberg model (Example 2 . 1 . 4),
which is a one-dimensional curve in the two-dimensional parameter space the adequacy of the Hardy-Weinberg model means testing
H
:
8
E
80
8.
Here testing
versus K
:
8
E
"!�I
ii
.
,
!
t
I
Chapter 6
81 where 81 = 8 - 8o . Other examples, which will be pursued further later in this section, involve restrictions on the ei obtained by specifying independence assumptions on classifications of cases into different categories. We suppose that we can describe 80 parametrically as
where 17 ( ry1 , . . . , TJq) T , is a subset of q-dimensional space, and the map 17 -.. ( B1 (17 ) , . . . ' ek ( 1J) ) T takes £ into 8o. To avoid trivialities we assume q < k - 1. Consider the likelihood ratio test for H : e E 8o versus K : e rj:. 8o . Let p( n1 , . . . , nk , 8 ) denote the frequency function of N . Maximizing p(n1 , . . . , nk , 8 ) for 8 E 80 is the same as maximizing p( n1 , . . . , nk , 8( 1J)) for 1J E £. If a maximizing value, ij = ( rh , . . . , i]q) exists, the log likelihood ratio is given by =
log .x (n 1 , . . . , nk ) =
k
2: ni [log (nd i=l
n)
- log ei ( i7 ) J .
If 17 -.. e( 1J) is differentiable in each coordinate, £ is open, and ij exists, then it must solve the likelihood equation for the model, {p( , 8( 17 ) ) : 17 E £}. That is, ij satisfies ·
a
l
t·
I nference in the M u lti para meter Case
404
a
'rjj
log p(n l , • • • , nk , 8 (1J)) = 0 ,
1
�j �
q
(6.4.3)
or
k a n· L . t � ei (1J) = O, l � j � q. i =1 et ( 1J ) 'rJJ If £ is not open sometimes the closure of £ will contain a solution of (6.4.3) . To apply the results of Section 6.3 and conclude that, under H, 2 log >. approximately has a x;- q distribution for large n , we define Bj = 9j ( 8), j = 1 , . . . , r, where 9i is chosen so that H becomes equivalent to "(B� , . . . , e� ) T ranges over an open subset of Rq and Bj Boj , j q + 1 , . . . , r for specified Boj ·" For instance, to test the Hardy-Weinberg model we set e� = B1 , e� = B2 - 2VBt ( 1 - VBt) and test H : e� = 0. Then we can conclude from Theorem 6.3.3 that 2 log >. approximately has a x r distribution under H. The Rao statistic is also invariant under reparametrization and, thus, approximately x;_ q · Moreover, we obtain the Rao statistic for the composite multinomial hypothesis by replacing Boj in (6.4.2) by ei (ij) . The algebra showing Rn (80) = x 2 in Section 6.4. l now leads to the Rao statistic =
.,
=
1J = Rn (8("')
� [Ni - nej (ij)]2 =
j� =l
1J nBJ· ("')
X
2
where the right-hand side is Pearson's x 2 as defined in general by (6.4. 1 ) . The Wald statistic i s only asymptotically invariant under reparametrization. However. the Wald statistic based on the parametrization 8 ( 17) obtained by replacing Boj by ej (ij), is, by the algebra of Section 6.4. 1 , also equal to Pearson's x2 •
,
Methods for Discrete Data
Section 6.4
Example 6.4.4.
Hardy-Weinberg.
Thus, H is rejected if x 2
ij) O(
((
2:: x1 ( 1
2n 1 + n2 2n
405
We found in Example 2.2.6 that if ) with
(2n 1 + n2)/2n.
- a
2 + n2 ) ( 2n3 + n2 ) ) ) 2 , (2nl + n2) (2n3 2 , 2n
T
2n
0
Example 6.4.5. The Fisher Linkage Model. A self-crossing of maize heterozygous on two characteristics (starchy versus sugary; green base leaf versus white base leaf) leads to four possible offspring types: ( 1 ) sugary-white; (2) sugary-green; (3) starchy-white; (4) starchy-green. If Ni is the number of offspring of type i among a total of n offspring, then ( Nb . . . , N4) has a M (n, fh , . . . , B4) distribution. A linkage model (Fisher, 1 958, p. 301), specifies that
where TJ is an unknown number between 0 and 1. Tt> test the validity of the linkage model we would take 8o { G (2 + ry) , i ( l ry) , i (1 - TJ) , iTJ) : 0 :; TJ S 1 } a "one dimensional curve" of the three-dimensional parameter space e. The likelihood equation (6.4.3) becomes
n1 (2 + ry)
(n2 + n3) n4 + (1 - ry) -:ry
0
,
(6.4.4)
which reduces to a quadratic equation in if. The only root of this equation in [0, 1] is the desired estimate (see Problem 6.4. 1). Because q = 1, k 4, we obtain critical values from 0 the x� tables.
Testing Independence of Classifications in Contingency Tables Many important characteristics have only two categories. An individual either is or is not inoculated against a disease; is or is not a smoker; is male or female; and so on. We of ten want to know whether such characteristics are linked or are independent. For instance, do smoking and lung cancer have any relation to each other? Are sex and admission to a university department independent classifications? Let us call the possible categories or states of the first characteristic A and A and of the second B and B. Then a randomly se lected individual from the population can be one of four types AB, AB, AB, AB. Denote the probabilities of these types by B11 , B12 , B21 , B22 , respectively. Independent classifica tion then means that the events [being an A] and [being a B] are independent or in terms of the eij ·
eij
=
(Bil + ei2 ) (B11 + B21 ).
To study the relation between the two characteristics w e take a random sample o f size n from the population. The results are assembled in what is called a 2 x 2 contingency table such as the one shown.
406
I nference in t h e M u ltipa ra meter Case
C h a pter 6
A A
The entries in the boxes of the table indicate the number of individuals in the sample who belong to the categories of the appropriate row and column. Thus, for example N12 is the number of sampled individuals who fall in category A of the first characteristic and category B of the second characteristic. Then, if N = (N1 1 , N12 , N21 , N22) r , we have N M (n, Ou , 8 12 , 82 1 , 022) . We test the hypothesis H : 8 E 80 versus K : 0 � 8o, where 80 is a two-dimensional subset of 8 given by "'
8o
=
{ ( "7 1 'T/2 , "1 1 ( 1 - "72 ) , "12 ( 1 - 'T/1 ) , ( 1 - 'T/1 ) ( 1 - "72 ) ) : 0 ::; "7 1 ::; 1, 0 ::; 'T/2 ::; 1 } .
Here we have relabeled 0 11 + 0 12 , 0 1 1 + 021 as ry 1 , ry2 to indicate that these are parameters, which vary freely. For 8 E 80 , the likelihood equations (6.4.3) become + n12 ) rh (nu + n2! ) Tf2
(nu
whose solutions are Tf1 'T/2
=
(n21
+
n22)
( 1 - if! )
(n12
+
n22)
(6.4.5)
( 1 - fh)
( nn (nu
+
n 12 ) /n + n21 ) jn,
(6.4.6)
the proportions of individuals of type A and type B, respectively. These solutions are the maximum likelihood estimates. Pearson's statistic is then easily seen to be (6.4.7)
where Ri = Ni l + Ni 2 is the ith row sum, Cj = N1j + N2J is the jth column sum. By our theory if H is true, because k = 4, q = 2, x2 has approximately a xi dis tribution. This suggests that x2 may be written as the square of a single (approximately) standard normal variable. In fact (Problem 6.4.2), the (NiJ - RiCJ /n ) are all the same in absolute value and,
where z
�l tt [
1 R J -
�=1 J= l
l
•
Section 6 . 4
407
Large Sam ple Methods for Discrete Data
An important alternative form for Z is given by (6.4 .8)
Thus, z = y'n[P (A I
B)-
P(A I B )]
[�(B) �(�) ] B, B, P(A) P(A)
1
12
where P is the empirical distribution and where we use A, A, B to denote the event that a randomly selected individual has characteristic A, A, B. Thus, if x2 measures devia tions from independence, Z indicates what directions these deviations take. Positive values of Z indicate that A and B are positively associated (i.e., that A is more likely to occur in the presence of than it would in the presence of B). It may be shown (Problem 6.4. 3) that if A and B are independent, that is, P(A I = P(A I B), then Z is approximately distributed as N(O, 1 ) . Therefore, it is reasonable to use the test that rejects, if and only if,
B
B)
Z � z(1 - a )
B)
as a level a one-sided test of H : P(A I B) = P(A I B) (or P(A I B) :::; P(A I B)) 2 > P(A I B) . The x test is equivalent to rejecting (two-sidedly) if, versus K : P(A I and only if,
Next we consider contingency tables for two nonnumerical characteristics having a and b states, respectively, a , b � 2 (e.g., eye color, hair color). If we take a sample of size n from a population and classify them according to each characteristic we obtain a vector Nii • i = 1, . . . , a , j 1, . . . , b where Nij is the number of individuals of type i for characteristic 1 and j for characteristic 2. If (Jij = P [A randomly selected individual is of type i for 1 and j for 2] , then =
{Nij : 1 :::; i :::;
a,
1 :::; j :::; b}
I".J
M (n , (Jij : 1 :::; i :::;
a,
1 :::; j :::; b) .
The hypothesis that the characteristics are assigned independently becomes H : (Jij 'TJi 1 T}j 2 for 1 :::; i :::; a , 1 :::; j :::; b where the 'TJi 1 , T}j 2 are nonnegative and 2: �= 1 'TJi 1 2:� = 1 T}j 2 = 1. The Nij can be arranged in a a x b contingency table,
Nu a
Na1 c1
1
2
N12
c2
... ... . . . ... . . .
b
N1 b
R1
Na b cb
Ra n
=
-
�- .-__,.,.,.-.,----�-,_...---....---_.,.--.--
---��
408
I nference in the M u lti para meter Case
Chapter 6
with row and column sums as indicated. Maximum likelihood and dimensionality calcu lations similar to those for the 2 X 2 table show that Pearson's x2 for the hypothesis of independence is given by
(6.4.9) which has approximately a x(a- I ) (b-l) distribution under H. The argument is left to the problems as are some numerical applications.
6.4.3
Logistic Regression for B i nary Responses
In Section 6 . 1 we considered linear models that are appropriate for analyzing continuous responses {Yi} that are, perhaps after a transformation, approximately normally distributed and whose means are modeled as J-li = E;= I Zij {3j = z f {3 for known constants { Zij } and unknown parameters f3I , . . . , {3p . In this section we will consider Bernoulli responses Y that can only take on the values 0 and 1. Examples are ( 1 ) medical trials where at the end of the trial the patient has either recovered (Y = 1 ) or has not recovered (Y 0), (2) election polls where a voter either supports a proposition (Y = 1) or does not (Y 0). or (3) market research where a potential customer either desires a new product (Y = 1) or does not (Y 0) . As is typical, we call Y 1 a "success" and Y = 0 a "failure." We assume that the distribution of the response Y depends on the known covariate vector z T . In this section we assume that the data are grouped or replicated so that for each fixed i, we observe the number of successes Xi = E";;;__ I Yij where Yij is the response on the jth of the ffii trials in block i, 1 :::; i :::; k. Thus, we observe independent XI , . . . with Xi binomial, B( mi , 1ri ). where 1ri = 1r ( zi ) is the probability of success for a case with covariate vector zi . Next we choose a parametric model for 1r (z ) that will generate useful procedures for analyzing experiments with binary responses. Because 1r (z) varies between 0 and 1, a simple linear representation zT {3 for 1r ( ) over the whole range of z is impossible. Instead we turn to the logistic transform g( 1r) , usually called the logit, whicll we introduced in Example 1.6.8 as the canonical parameter =
=
=
=
' xk
.t:" ·
·
(6.4. 10)
TJ = g ( 1r ) = log [1rj ( l - 1r)] .
Other transforms, such as the p robit 9 I ( 1r ) = - 1 (1r) where is the N(O, 1 ) d.f. and the log-log transform 92 (1r) = log [- log ( 1 - 1r)] are also used in practice. The log likelihood of 1r ( 1r1 , , 7rk ) T based on X (XI , . , Xk ) T is
t, [x, C :',J log
=
• . .
=
+
m; log ( l - 1r; )
.
.
] t, ( ;'; ) +
log
When we use the logit transform g ( 1r ) , we obtain what is called the where
.
(6.4. 1 1)
logistic linear regres
sion model
'"
409
Large Sam ple Methods for Discrete Data
Section 6.4
2, (1, zi)T is the logistic regression model of Problem 2 . 3 . 1 . ZN (f3) of {3 (/31 , . . . , /3p)T is, if N 2:::: 7=1 m i, The log likelihood l ( 1r({3)) p k k (6.4. 1 2) g( ( f3 ) {3 ) lN m; lo l + exp{z; } + log ';; fJ; T; The special case p
Zi
=
=
=
where
Tj
=
) ( � =
�
�
=
=
2::::7=1 ZijXi
and we make the dependence on N explicit. Note that
lN (f3)
the log likelihood of a p-parameter canonical exponential model with parameter vector
( T1 , . . . , Tp) T. It follows that the NILE of {3 solves E{3 (Tj) Ef3 (z r x) = z r x, where Z = llzij llrnxp is the design matrix.
and sufficient statistic T
Tj , j
=
1, . . . , p,
or
=
=
Thu s, Theorem 2 . 3 . 1 applies and we can conclude that if 0 <
Xi
p, the solution to this equation exists and gives the unique MLE
E(Tj)
=
2::::7=1 Zij J.Li, the likelihood equations are just z r (x - J.L )
=
<
mi and Z has rank
{3 of {3.
sufficient but not necessary for existence-see Problem 2 . 3 . 1 . We let Then
is
{3
/-Li
=
The condition is
E(Xi)
0.
=
mi1ri.
(6.4 . 1 3)
By Theorem 2 .3 . 1 and by Proposition 3 .4.4 the Fisher information matrix is
I(f3) where
W
=
diag {
=
z r wz
(6.4. 1 4)
mi7ri(1 - 7ri) }k x k · The coordinate ascent iterative procedure of Section {3.
2 . 4 . 2 can be used to compute the MLE of
Alternatively with a good initial value the Newton-Raphson algorithm can be em
ployed. Although unlike coordinate ascent Newton-Raphson need not converge, we can guarantee convergence with probability tending to estimate use
1 as N
---+ oo
as follows. As the initial
(6.4. 1 5)
Vi
=
lo
g
( mi X·-� x+i .!2+ )
the empirical logistic transform. Because
1
2
(6.4. 1 6)
,
{3 (zrz) -1ZT'T] and TJi log[7ri (1 - 7ri)-1], =
=
{30 i s a plug-in estimate of f3 where 1ri and (1 - 1ri) i n TJi has been replaced by
xi --. 1 xi -1 , (1 )* 1 - mi + 2mi mi + 2mi - 1ri Similarly, in (6.4. 14) , W is Here the adjustment 1/2mi is used to avoid log 0 and l og 1f:
=
=
oo.
estimated using
{mi1r; ( l - 7ri) * } . Using the 8-method, it follows by Theorem 5 . 3 . 3, that if m Wo
(
=
(
diag
m} V; - log 1 :',
---+ oo,
1ri > 0 for 1 ::; i ::; k
J) -S N(O, [rr; ( l - rr; Jr 1 ) .
410
6
Because
Z has rank p, it fo1lows
(Problem
6.4. 14) that {30
is consistent.
To get expressions for the MLEs of 7r and JL, recall from Example
1 .6.8 that the inverse
of the logit transform g is the logistic distribution function
Thus, the MLE of 1ri is
Jfi
=
g
-1
( L:�=l Xij(jj ) . Testing
In analogy with Section
6.1,
we let w
=
{ 11
:
rJi
=
zT {3, {3
E
RP} and let
r
be the
dimension of w. We want to contrast w to the case where there are no restrictions on 11; that is, we set n
=
R k and consider TJ E fl. In this case the likelihood is a product of
independent binomial densities, and the MLEs of 1ri and J.Li are statistic
2 log .A for testing H :
ji is the MLE of JL for 11
11 E w versus
E w. Thus, from
D (X, ji)
K:
(6.4. 1 1)
Xdmi
and
Xi .
The LR
n - w is denoted by D(y, ji), where
TJ
and
(6.4. 12)
k
2 I )xi log(Xdfli ) + XI log( XI/ flDJ i =l
(6.4. 1 7)
x: mi Xi and fl� mi Mi · D (X, ji) measures the distance between the fit ji based on the model w and the data X. By the multivariate delta method, Theorem 5. 3.4, D(X, fl) has asymptotically a x%-r· distribution for 11 E w as mi --+ oo, i 1, . . . , k < oo-see Problem 6.4. 1 3. As in Section 6. 1 linear subhypotheses are important. If w0 is a q-dimensional linear
where
=
=
subspace of w with q <
K:
11 E w -
r,
then we can form the LR statistic for H : 11 E
w0
versus
wo
2 log .A
= 2 t [xi log ( Ei . ) + XI i =I
/-LOt
log
�
( E, . )]
(6.4. 1 8)
fLo�
ji0 is the MLE of JL under H and M�i mi MOi . In the present case, by Problem 6.4. 1 3, 2 log .,\ has an asymptotic x;-q distribution as mi --+ oo, i 1, . . . , k. Here is a
where
=
-
=
special case.
Example 6.4.1. The Binomial One- Way Layout. Suppose that k treatments are to be tested for their effectiveness by assigning the ith treatment to a sample of mi patients and record ing the number
Xi
of patients that recover, i
pendently and we observe
X1 , . . . , Xk
.
1 , . . , k.
independent with
The samples are collected inde
Xi
rv
B ( 1ri , mi ) ·
For a second
example, suppose we want to compare k different locations with respect to the percentage that have a certain attribute such as the intention to vote for or against a certain proposition. We obtain k independent samples, one from each location, and for the ith location count the number Xi among
mi that has the given attribute.
Section 6 . 5
411
Generalized Linea r Models
This model corresponds to the one-way layout of Section 6.1, and as in that section, an important hypothesis is that the populations are homogenous. Thus, we test H : 1r1 1r 1rk = 1r, 1r E (0, 1 ) , versus the alternative that the 1r's are not all equal. Under 2 H the log likelihood in canonical exponential form is =
{3T - N log(l + exp{{:J}) +
t; log ( �: ) k
where T 1::= 1 Xi , N 2: := 1 mi. and (3 = log [1r / ( 1 - 1r)] . It follows from Theorem 2.3.1 that if 0 < T < N the MLE exists and is the solution of (6.4. 13), where J.L (m 1 1r, . . . , mk7r)T. Using Z as given in the one-way layout in Example 6 . 1 .3, we find that the MLE of 1r under H is 7r TjN. The LR statistic is given by (6.4. 18) with Jioi mi1r. The Pearson statistic k- 1 ( Xi mi "' 1r ) 2 2 X m · 7r 7r ) "'(1 = � i l =
=
_ I:
-
,....
is a Wald statistic and the x2 test is equivalent asymptotically to the LR test (Problem 6.4. 15).
Summary. We used the large sample testing results of Section 6.3 to find tests for impor tant statistical problems involving discrete data. We found that for testing the hypothesis that a multinomial parameter equals a specified value, the Wald and Rao statistics take a form called "Pearson's x2 , which equals the sum of standardized squared distances be tween observed frequencies and expected frequencies under H . When the hypothesis is that the multinomial parameter is in a q-dimensional subset of the k !-dimensional pa rameter space 8, the Rao statistic is again of the Pearson x2 form. In the special case of testing independence of multinomial frequencies representing classifications in a two-way contingency table, the Pearson statistic is shown to have a simple intuitive form. Finally, we considered logistic regression for binary responses in which the logit transformation of the probability of success is modeled to be a linear function of covariates. We derive the likelihood equations, discuss algorithms for computing MLEs, and give the LR test. In the special case of testing equality of k binomial parameters, we give explicitly the MLEs and x2 test. "
-
6.5
G E N E RA L I Z E D L I N EA R M O D E LS
In Sections 6.1 and 6.4.3 we considered experiments in which the mean J.Li of a response Yi is expressed as a function of a linear combination
�i
=
zf {3
p
=
L Zij/3j
j= l
of covariate values. In particular, in the case of a Gaussian response, J.Li �i · In Section 1ri g - 1 ( �i) , where g - 1 (y) is the logistic distribution EYiJ , then /.Li 6.4.3, if J.Li =
=
=
6
412
function. More generally, McCullagh and Neider ( 1 983, 1 989) synthesized a number of previous generalizations of the linear model, most importantly the log linear model devel oped by Goodman and Haberman. See Haberman (1974). The generalized linear model with dispersion depending only on the mean
The data consist of an observation ( Z , Y) where Y = (Yb . . . , Yn ) T is n x 1 and ZJx n = (z1 , . . . , Zn ) with Zi = (zib . . . , Zip ) T nonrandom and Y has density p ( y, 17) given by (6.5 . 1 )
where 17 i s not in £, the natural parameter space o f the n-parameter canonical exponential family (6.5. 1 ) , but in a subset of E obtained by restricting TJi to be of the form
where h is a known function. As we know from Corollary 1.6.1, the mean f.t of Y is related to 11 via f.t = A ( 17) . Typically, A ( 17) = A 0 ( T/i ) for some A 0 , in which case J-Li A� (TJi ) . We assume that there is a one-to-one transform g(f.t) of f.t, called the link function, such that p
g (f.t)
= L /1jZ (j ) j=l
Z{3
where z (j ) (z1j , . . . , Znj ) T is the jth column vector of Z. Note that if A is one-one, f.t determines 17 and thereby Var(Y) A( 17) . Typically, g(f.t) is of the form (g(J-L1 ), , g(J-Ln)) T , in which case g is also called the link function. • • •
Canonical links
The most important case corresponds to the link being canonical; that is, g
11
=
or
p
I:l 111 z<j ) = Zf3.
j=
In this case, the GLM is the canonical subfamily of the original exponential family gener ated by ZTY, which is p X 1. Special cases are: (i)
The linear model with known variance.
These Yi are independent Gaussian with known variance 1 and means J-Li The model is GLM with canonical g(f.t) f.t, the identity. (ii)
Log linear models.
= L:�= l Z.;,3 {3;.
Section 6 . 5
413
Genera l ized L i n e a r Models
Suppose (Y1 , . . . , Yp)T is seen in Example 1.6. 7,
M(n, B l , · · · , Bp), Bj T/j
=
0, 1 :::; j :::;
>
log ej ' 1 :::; j :::;
p,
L::�=l B =
1 . Then, as
p
are canonical parameters. If we take g(/1 ) = (log /-ll , . . . , log /-lp) T , the link is canonical. The models we obtain are called log linear-see Haberman (1 974) for an extensive treat ment. Suppose, for example, that Y = IIYiJ I I I:Si:Sa• 1 :::; j :::; b, so that YiJ is the indicator of, say, classification i on characteristic 1 and j on characteristic 2. Then (} =
IIBij ll , 1 :::; i
:::;
1 :::; j :::; b,
a,
and the log linear model corresponding to
l og eij = /3i + /3j ,
where /3i , f3J are free (unidentifiable parameters), is that of independence
eij
=
ei+e+j
where Bi+ = L::� =l Bii• B+J = L:: �=l BiJ· The log linear label is also attached to models obtained by taking the Yi independent Bernoulli ( Bi), 0 < ei < 1 with canonical link g(B) log [B( 1 - B ) ] . This isjust the logistic linear model of Section 6.4.3. See Haberman ( 1 974) for a further discussion. =
Algorithms If the link is canonical, by Theorem 2 . 3 . 1 , if maximum likelihood estimates they necessarily uniquely satisfy the equation z ry
=
{3 exist,
z rEl3 Y = z r A(Z/3)
or (6.5 .2)
It's interesting to note that (6.5 .2) can be interpreted geometrically in somewhat the same way as in the Gausian linear model-the "residual" vector Y - M(/3) is orthogonal to the column space of Z. But, in general, M(/3) is not a member of that space. The coordinate ascent algorithm can be used to solve (6.5.2) (or ascertain that no so lution exists). With a good starting point {30 one can achieve faster convergence with the Newton-Raphson algorithm of Section 2.4. In this case, that procedure is just (6.5 .3)
where In this situation and more generally even for noncanonical links, Newton-Raphson coincides with Fisher's method of scoring described in Problem 6.5.1. If /3 0 � {30, the
414
C h a pter 6
I n ference i n the M u lt i parameter Case
true value of {3, as n � oo, then with probability tending to 1 the algorithm converges to the MLE if it exists. In this context, the algorithm is also called iterated weighted least squares. This name stems from the following interpretation. Let Ll m+l = jjm+l - jjm , which satisfies the equation (6.5.4) That is, the correction A m+l is given by the weighted least squares formula ( 2.2.20) when the data are the residuals from the fit at stage m, the variance covariance matrix is Wm and the regression is on the columns of WmZ-Problem 6.5.2. Testing in GLM
Testing hypotheses in GLM is done via the LR statistic. As in the linear model we can define the biggest possible GLM M of the form (6. 5 . 1 ) for which p n. In that case the MLE of J1 is Ji M (Y1 , , Yn ) T (assume that Y is in the interior of the convex support of {y : p (y , 11) > 0 } ) . Write 17( ) for A - l . We can think of the test statistic =
=
• • •
·
2 log .X. = 2 [Z ( Y , 77 ( Y )) - l (Y , "' ( IL o ) ]
for the hypothesis that IL = /L o within M as a "measure" of (squared) distance between Y and /Lo · This quantity, called the deviance between Y and /Lo , (6.5.5) is always 2:: 0. For the Gaussian linear model with known variance a-5 (Problem 6.5.4). The LR statistic for H D( Y , j1,0)
=:
:
IL E
wo is just
i nf { D( Y , IL )
:
1L E
where j1,0 is the MLE of IL in wo . The LR statistic for H : IL with w 1 :J wo is
wo } E
wo versus K : IL
E
w 1 - wo
D(Y , Ji 0) - D(Y, Ji1 ) where j1,1 is the MLE under w 1 . We can then formally write an analysis of deviance analo gous to the analysis of variance of Section 6.1. If w0 c w1 we can write (6.5.6) a decomposition of the deviance between Y and j1,0 as the sum of two nonnegative com ponents, the deviance of Y to j1, 1 and � (j1,0 , j1, 1 ) = D (Y, j1,0) - D(Y, j1,1 ) , each of which can be thought of as a squared distance between their arguments. Unfortunately � =j=. D generally except in the Gaussian case.
Section 6 . 5
415
General ized Linear Models
Formally if w0 is a GLM of dimension p and w1 of dimension q with canonical links, then �(ji,0 , ji, 1 ) is thought of as being asymptotically x; - q . This can be made precise for stochastic GLMs obtained by conditioning on Z1 , . . . , Zn in the sample (Z1 , Y1 ) , . . . , (Zn , Yn ) from the family with density
p (z, y, {3)
=
h(y) qo (z) exp { (zT{3)y - A0 (zT {3) }.
(6.5.7)
More details are discussed in what follows. Asymptotic theory for estimates and tests
If (Z1 , Y1 ) , . . . , (Zn , Yn ) can be viewed as a sample from a population and the link is canonical, the theory of Sections 6.2 and 6.3 applies straightforwardly in view of the gen eral smoothness properties of canonical exponential families. Thus, if we take Zi as having marginal density qo , which we temporarily assume known, then (Z1 , YI ) , . . . , (Zn , Yn ) has density
p (z, y, {3)
=
TI h(y;)qo (z;) { t, [(zf exp
{3) y; - A0(zf {3) ]
}
.
(6.5.8)
This is not unconditionally an exponential family in view of the A0 ( z f{3) term. However, there are easy conditions under which conditions of Theorems 6.2.2, 6.3.2, and 6.3.3 hold (Problem 6.5.3), so that the MLE {3 is unique, asymptotically exists, is consistent with probability 1, and (6.5 .9)
1 What is J- ({3) ? The efficient score function
a�i logp (z , y , {3) is
(Yi - A o (Zf{3) ) Z f and so,
which, in order to obtain approximate confidence procedures, can be estimated by f :EA.(z{j) where :E is the sample variance matrix of the covariates. For instance, if we assume the covariates in logistic regression with canonical link to be stochastic, we obtain =
If we wish to test hypotheses such as H : /31
=
· ·
·
=
/3d
=
0, d
< p,
we can calculate
i= l
where {3 H is the (p x 1 ) MLE for the GLM with f3�x l ( 0, . . . , 0 , !3d+ I, . . . , /3p) . and can conclude that the statistic of ( 6.5. 10 ) is asymptotically x� under H. Similar conclusions =
___
l_
-�P -
I.I
·' 1.'
416
I nference i n the M u ltipara m eter Case
Chapter 6
follow for the Wald and Rao statistics. Note that these tests can be carried out without knowing the density qo of z . 1 These conclusions remain valid for the usual situation in which the Z i are not random but their proof depends on asymptotic theory for independent nonidentically distributed variables, which we postpone to Volume II. The generalized linear model The GLMs considered so far force the variance of the response to be a function of its mean. An additional "dispersion" parameter can be introduced in some exponential family models by making the function h in (6. 5 . 1 ) depend on an additional scalar parameter 7. It is customary to write the model as, for c( 7) > 0, p ( y, 17 , 7) Because J p ( y , 17 , 7)dy
=
=
exp{ c- 1 ( 7 ) ( 11T y - A ( 17 ) ) }h(y, 7) .
(6.5 . 1 1)
1, then
A(17) jc(7) = log
/ exp{c- 1 (7)17TY} h(y, 7)dy.
(6.5. 1 2)
The left-hand side of (6.5. 10) is of product form A( 17 ) [1 / c ( 7) ] whereas the right-hand side cannot always be put in this form. However, when it can, then it is easy to see that
E(Y)
=
A (17 )
Var( Y ) = c(7) A (17 )
(6.5. 13) (6.5. 1 4)
so that the variance can be written as the product of a function of the mean and a general dispersion parameter. Important special cases are the N ( J-L, 0" 2 ) and gamma (p , .A) families. For further dis cussion of this generalization see McCullagp. and Neider (1983, 1989). General link functions Links other than the canonical one can be of interest. For instance, if in the binary data regression model of Section 6.4.3, we take g ( J-L) cp - 1 (J-L) so that =
'lri =
we obtain the so-called probit model. Cox ( 1 970) considers the variance stabilizing trans formation
which makes asymptotic analysis equivalent to that in the standard Gaussian linear model. As he points out, the results of analyses with these various transformations over the range . 1 :::::; J-L :::::; .9 are rather similar. From the point of analysis for fixed n, noncanonical links can cause numerical problems because the models are now curved rather than canonical ex ponential families. Existence of MLEs and convergence of algorithm questions all become more difficult and so canonical links tend to be preferred.
417
Models
Summary. We considered generalized linear models defined as a canonical exponential model where the mean vector of the vector Y of responses can be written as a function,
E/3j z(j ) , where the z(j ) are f3 is a vector of regression coefficients. We considered the
called the link function, of a linear predictor of the form observable covariate vectors and
canonical link function that corresponds to the model in which the canonical exponential model parameter equals the linear predictor. We discussed algorithms for computing MLEs of
/3.
In the random design case, we use the asymptotic results of the previous sections to
develop large sample estimation results, confidence procedures, and tests.
6.6
RO B U STN ESS P R O P E RT I ES A N D S E M I PARA M ET R I C M O D E LS
As we most recently indicated in Example
6.2.2,
the distributional and implicit structural
assumptions of parametric models are often suspect. In Example
6.2.2,
we studied what
procedures would be appropriate if the linearity of the linear model held but the error dis
j0, which 0, the resulting MLEs for f3 optimal under fo continue to estimate f3 as defined by (6. 2.19) even if the true errors are N(O, a 2 ) or, in fact, had any distribution symmetric around 0 (Problem 6.2.5). That is, roughly speaking, if we consider the semi parametric model, pl {P : (zr, Y) p given by (6.2. 19) with Ei i.i.d. with density f for some f symmetric about 0}, then the LSE /3 of f3 is, under further mild conditions on P, still a consistent asymptotically normal estimate of f3 and, in fact, so is any estimate solving the equations based on (6.2.31) with fo symmetric about 0. There is another semi parametric model p2 {P : (zr, Y)T p that satisfies Ep (Y I Z) zr(3, Epzrz nonsingular, Ep Y2 < oo} . For this model it turns out that the LSE is optimal in a sense to
tribution failed to be Gaussian. We found that if we assume the error distribution is symmetric about
rv
=
rv
=
be discussed in Volume II. Furthermore, if P3 is the nonparametric model where we assume
(Z, Y) has a joint distribution and if we are interested in estimating the best linear Y given Z, the right thing to do if one assumes the (Zi , Yi) are i.i .d., is "act as if the model were the one given in Example 6. 1.2." Of course, for estimating /-1-L(Z) only that
predictor 1-1-L (Z) of
in a submodel o f P3 with
Y where c is independent of is not the best estimate of
=
Z but a0 (Z)
(3.
zr (3 + ao(Z)c is not constant and
(6.6.1)
a0
is assumed known, the LSE
These issues, whose further discussion we postpone to Vol
ume II, have a fixed covariate exact counterpart, the Gauss-Markov theorem, are discussed below. Another even more important set of questions having to do with selection between nested models of different dimension are touched on in Problem
6.6.8 but otherwise
also
postponed to Volume II.
___ _
__j
418
I n ference in the M u ltipara meter Case
Chapter 6
Robustness in Estimation We drop the assumption that the errors
E1 , . . . , En in the linear model (6. 1 .3) are normal.
Instead we assume the Gauss-Markov linear model where
(6.6.2) where Z is an n of the estimates
x
p matrix of constants, {3 is p
x
1, and Y ,
€
are n
x
1. The optimality
Jii, fji of Section 6 . 1 . 1 and the LSE in general still holds when they are
compared to other linear estimates.
Theorem 6.6.1. Suppose the Gauss-Markov linear model (6.6.2) holds. Then, for any parameter of the form
0:
=
2:: � 1 aiJ-Li for some constants a1 ' . . . ' an,
the estimate
a Y1 ,
=
2:: � 1 aiJii . . . , Yn.
has uniformly minimum variance among all unbiased estimates linear in
Proof.
Because
Cov ( Yi ,
Yj )
and a i s unbiased. E( Ei ) 0 , E(a) 2:: � 1 aiJ-Li Ei , Ej ) , and by (B .5 .6), n n VarcM (a) L: a; Var (Ei ) + 2 L aiaj Cov (Ei , EJ ) a2 L: a; i 1 i=1 i <j =
=
=
=
o: ,
Moreover,
Cov (
=
=
=
where VarcM refers to the variance computed under the Gauss-Markov assumptions. Let
a stand for any estimate linear in VarcM (a)
=
Y1 , . . . , Yn .
The preceding computation shows that
Varc (a) , where Varc (a) stands for the variance computed under the Gaus
sian assumption that
E1 , . . . , En are i.i.d.
N(O, a2).
By Theorems 6 . 1 . 2(iv) and 6 . 1 .3(iv),
Varc (a) ::; Varc (a) for all unbiased a. Because Yare
=
VarcM for all linear estimators, 0
the result follows.
Note that the preceding result and proof are similar to Theorem 1 .4.4 where it was shown that the optimal linear predictor in the random design case is the same as the optimal predictor in the multivariate normal case. In fact, in Example 6 . 1 .2, our current Ji coincides with the empirical plug-in estimate of the optimal linear predictor ( 1 .4. 14) . See Problem 6 . 1 .3. Many of the properties stated for the Gaussian case carry over to the Gauss-Markov case:
Proposition 6.6.1. If we replace the Gaussian assumptions on the errors E1 , , En with the Gauss-Markov assumptions, the conclusions ( 1 ) and (2) of Theorem 6 . 1 .4 are still . • .
jj and j], are still unbiased and Var(Ji) Var (jj) a2 (zrz ) - 1 •
valid; moreover, ifp
= r,
=
a2H, Var(€)
=
a2 (I H), and -
=
Example 6.6.1. One Sample (continued). In Example 6. 1 . 1 ,
J-L
=
{31 , and Ji
=
fj1
=
Y.
The Gauss-Markov theorem shows that Y, in addition to being UMVU in the normal case, is UMVU in the class of linear estimates for all models with
EY/
<
oo .
However, for
419 n
large,
Y has a larger variance (Problem 6.6.5) than the nonlinear estimate
median when the density of Y is the Laplace density
1 2A
exp{ -AIY - 11 1 } ,
For this density and a l l symmetric densities,
y E
R, J1 E R,
Y
sample
A > 0.
Y is unbiased (Problem 3.4.12).
Remark 6.6.1. As seen in Example 6 . 6 . 1 , a major weakness of the Gauss-Markov theorem is that it only applies to linear estimates. Another weakness is that it only applies to the homoscedastic case where Var(}i ) is the same for all i. Suppose we have a heteroscedastic version of the linear model where
E( E )
0, but Var ( Ei )
a
} depends on i.
If the
} are
a
known, we can use the Gauss-Markov theorem to conclude that the weighted least squares estimates of Section
2.2 are UMVU. However, when the at are unknown, our estimates are
not optimal even in the class of linear estimates.
Robustness of Tests In Section
5 . 3 we investigated the robustness of the significance levels of t tests for
the one- and two-sample problems using asymptotic and Monte Carlo methods. Now we will use asymptotic methods to investigate the robustness of levels more generally. The 6-method implies that if the observations
Xi
come from a distribution
P,
which does not
belong to the model, the asymptotic behavior of the LR, Wald and Rao tests depends criti cally on the asymptotic behavior of the underlying MLEs From the theory developed in Section true we expect that
iin
8 and 80.
6.2 we know that if '11 ( - ,
8)
=
Dl( - , 8)
and P is
8(P) , which is the unique solution of
I w (x, 8)dP(x)
(6.6.3)
0
and
.;ii (e - 8 ( P) ) � N( O, �( w , P) ) :E given by ( 6.2.6 ) . Thus, for instance, consider H :
(6.6.4)
8 80 and the Wald test statis tics Tw n ( ii 80 )T !(80 ) (9 80) . If 8(P) :::/= 80 evidently we have Tw oo . B ut if 8 (P) 80 , then Tw � vr J(80)V where V "' N(o, � ('11 , P) ) . Because � ('II , P ) :::/= J- 1 ( 8 0 ) i n general, the asymptotic distribution of Tw i s not x;. This
with
=
=
=
observation holds for the LR and Rao tests as well-but see Problem
6.6.4. There is an
important special case where all i s well: the linear model we have discussed in Example
6 . 2 . 2. Example 6.6.2. The Linear Model with Stochastic Covariates. Suppose ( Z 1 , Yl ) , . . . , (Zn , Yn ) are i.i.d. as (Z, Y) where Z is a (p x 1 ) vector of random covariates and we model the relationship between
Z and Y as
(6.6.5)
6
420
with the distribution P of (Z , Y) such that E and Z are independent, EpE 0, EpE 2 < oo , and we consider, say, H : (3 0. Then, when E is N(O, u2 ) , (see Examples 6.2.1 and 6.2.2) the LR, Wald, and Rao tests all are equivalent to the F test: Reject if =
Tn where {3 is the LSE, Z( n) s2
_ =
"" T
2 (3 Z T (n) Z (n) f3/ s 2 Jp , n -p ( l --.
- a)
(6.6.6)
(Zb . . , Z n ) T and .
1 � 2 � ( Yi - Yi ) n - p i =l
1
""'
--
=
p
n
--
2
"' IY - Z (n)f3 i .
Now, even if E is not Gaussian, it is still true that if w is given by (6. 2.30) then (3 (P) specified by (6.6.3) equals (3 in (6.6.5) and u2 ( P ) Varp ( E ) . For instance, =
Thus, it is still true by Theorem 6.2.1 that (6.6.7) and (6.6.8) Moreover, by the law of large numbers, 1 n - zr (n) z ( n)
�� � z i z! n i=1
(6.6.9)
1.
so that the confidence procedures based on approximating Var( ,B) by n - 1 z'[n ) Z ( n) s2 / Vii are still asymptotically of correct level. It follows by Slutsky's theo rem that, under H : (3 = 0,
where Wpxp is Gaussian with mean 0 and, hence, the limiting distribution of is x;. Because fp,n-p ( 1 a ) -+ xp ( 1 a ) by Example 5.3.7, the test (6.6.5) has asymptotic level a even if the errors are not Gaussian. This kind of robustness holds for H : (Jq + 1 f3o, q+ 1 , , (Jp f3o,p or more generally (3 E £0 + (30, a q-dimensional affine subspace of RP. It is intimately linked to the fact that even though the parametric model on which the test was based is false, •
. .
(i) the set of P satisfying the hypothesis remains the same and (ii) the (asymptotic) variance of (3 is the same as under the Gaussian model.
Section 6.6
421
Robustness Properties and Semi parametric Models
If the first of these conditions fails, then it's not clear what H : 8 80 means anymore. If it holds but the second fails, the theory goes wrong. We illustrate with two final examples. =
The Two-Sample Scale Problem. Suppose our model is that X 1 2 (X( ) , X( ) ) where X( 1 ) , x < 2 ) are independent E E respectively, the lifetimes of paired pieces of equipment. If we take H : A 1 A2 the standard Wald test (6.3. 17) is 2 ..... (0 1 02 ) > "Reject iff n (6.6. 10) - x 1 ( 1 - a) " a2 where
Example 6.6.3.
(!;), ( j;)
.....
< _!_n � xz j ) L i=1
( 6.6. 1 1 )
and a=
!2n i: < xP ) i=1
+
v
x? ) ) .
(6.6. 1 2)
The conditions of Theorem 6.3.2 clearly hold. However suppose now that X(l) /0 1 and X( 2 ) /02 are identically distributed but not exponential. Then H is still meaningful, X(l) and X( 2 ) are identically distributed. But the test (6.6.10) does not have asymptotic level a in general. To see this, note that, under H, (6.6. 1 3 )
but
J2Ep ( X ( 1 ) ) =f.
a
V2 VarpX{ 1 )
in general. It is possible to construct a test equivalent to the Wald test under the parametric model and valid in general. Simply replace <7 2 by
�2
u
=
_!_ � n� z=1
{(
��) - ( .X < 1 ) Xz
+
2
) (X �2) - (.X (l)
.X< 2 ) ) 2
+
z
+
2
)
.X <2 ) ) 2
}
.
(6.6. 1 4) D
Example 6.6.4. The Linear Model with Stochastic Covariates with E and Z Dependent. Suppose E(E I Z) = 0 so that the parameters {3 of (6.6.5) are still meaningful but E and Z
are dependent. For simplicity, let the distribution of E given Z = z be that of a (z ) E ' where E1 is independent of Z. That is, we assume variances are heteroscedastic. Suppose without loss of generality that Var( E1 ) 1. Then ( 6.6. 7) fails and in fact by Theorem 6.2.1 =
Vii(� where
!3)
-+
N(o, Q)
(6.6. 1 5)
422
I n ference in the M u ltipara meter Case
Chapter 6
(Problem 6.6.6) and, in general, the test (6.6.6) does not have the correct level. Special ize to the case Z (1, h A1 , . . . , Id - Ad )T where (!1 , . . . , Id) has a multinomial ( A 1 . . . , Ad ) distribution, 0 < Aj < 1, 1 ::; j ::; d. This is the stochastic version of the d-sample model of Example 6 . 1 .3. It is easy to see that our tests of H : {32 f3d 0 = fail to have correct asymptotic levels in general unless a2 ( Z) is constant or A 1 1 Ad 1/ d (Problem 6.6.6). A solution is to replace Z(n ) Z(n) / s2 in (6.6.6) by Q- , where =
-
,
=
·
· ·
=
=
=
·
· ·
=
Q is a consistent estimate of Q . For d = 2 above this is just the asymptotic solution of the Behrens-Fisher problem discussed in Section 4.9.4, the two-sample problem with unequal 0 sample sizes and variances.
To summarize: If hypotheses remain meaningful when the model is false then, in gen eral, LR, Wald, and Rao tests need to be modified to continue to be valid asymptotically. The simplest solution at least for Wald tests is to use as an estimate of [Var y'no] - 1 not I(O) or - � D2ln (O) but the so-called "sandwich estimate" (Huber, 1 967), (6.6. 1 6)
Summary. We considered the behavior of estimates and tests when the model that gen erated them does not hold. The Gauss-Markov theorem states that the linear estimates that are optimal in the linear model continue to be so if the i.i.d. N(O, a2 ) assumption on the errors is replaced by the assumption that the errors have mean zero, identical vari ances, and are uncorrelated, provided we restrict the class of estimates to linear functions of Y1 , . . . , Yn . In the linear model with a random design matrix, we showed that the MLEs and tests generated by the model where the errors are i.i.d. N(O, a2 ) are still reasonable when the true error distribution is not Gaussian. In particular, the confidence procedures derived from the normal model of Section 6 . 1 are still approximately valid as are the LR, Wald, and Rao tests. We also demonstrated that when either the hypothesis H or the vari ance of the MLE is not preserved when going to the wider model, then the MLE and LR procedures for a specific model will fail asymptotically in the wider model. In this case, the methods need adjustment, and we gave the sandwich estimate as one possible adjustment to the variance of the MLE for the smaller model.
6.7
PRO B L E M S A N D CO M P L E M E N TS
Problems for Section 6.1 1. Show that in the canonical exponential model ( 6. 1 . 1 1 ) with both (i) the MLE does not exist if n = r , (ii) if n 2: r + 1, then the MLE
( TJ1 , . . . , TJr , a2 ) T I. S
n ( U1 , . . . , Ur , - 1 "' L... i =r+ n
I
U2i ) T
.
11
and a2 unknown,
(7]1 , . . . , 77n a2 )T of . ---2 - - 1 1Y J-L --- � 2 In partiCU 1ar, a n
-
•
2. For the canonical linear Gaussian model with a2 unknown, use Theorem 3.4.3 to com
pute the information lower bound on the variance of an unbiased estimator of a2 • Compare this bound to the variance of s2 •
Section 6.7
Hint:
423
Problems and Complements
By A. 1 3 .22, Var(U[ )
=
2 o- 4 .
3. Show that fi of Example 6. 1 .2 with p = r coincides with the empirical plug-in estimate of J.lL (J-LLb (Zi2 , , Zip ) T, /-LY /31 ; , /-LLn)T, where JtLi = J-LY + (zi -J-Lz)/3, z; =
=
· · ·
· · .
=
and f3 and J-Lz are as defined in ( 1 .4. 14) . Here the empirical plug-in estimate is based on i.i.d . (Zi , Y1 ) , . . . , (Z� , Yn) where Z i = (Zi2 , . . . , Zip)T . 4. Let Yi denote the response of a subject at time i, i
= 1 , . . . , n.
Suppose that Yi satisfies
the following model Yi
=
B + Ei , i
=
1, . . . , n
where Ei can be written as Ei cei- l + ei for given constant c satisfying 0 � c � 1 , and the ei are independent identically distributed with mean zero and variance o-2, i 1 , . . . , n; e0 0 (the Ei are called moving average errors, see Problem 2.2.29). Let =
=
where
a · = � - c)i L.) i =O 1
( 1 - ( - c)i ) 2 ( 1 - 1( -+cc)j+ l )/ � L 1+c i= l
(a) Show that B is the weighted least squares estimate of B. (b) Show that if ei
rv
N(O, o-2 ), then B is the MLE of B.
(c) Show that Y and B are unbiased.
(d) Show that Var( O) � Var ( Y ).
(e) Show that Var (B) < Var ( Y ) unless
c
=
0.
5. Show that >:(Y) defined in Remark 6 . 1 . 2 coincides with the likelihood ratio statistic
A (Y) for the o-2 known case with o-2 replaced by 0:2.
6. Consider the model (see Example 1 . 1 .5) Yi
B + ei, i
where ei = cei- l + Ei, i = 1 , . . . , n, t:o 0, c are i.i.d. N(O, o-2 ) . Find the MLE B of B.
=
E
1, . . . , n (0, 1] is a known constant and t: 1 , . . . , En
7. Derive the formula (6. 1.28) for the noncentrality parameter B2 in the regression example. 8. Derive the formula (6. 1 . 29) for the noncentrality parameter 82 in the one-way layout. 9. Show that in the regression example with interval for /31 in the Gaussian linear model is
p
=
r
=
2, the 100(1
z. •
-
)• ] .
a ) % confidence
424
Inference in the M u lti para meter Case
10. Show that if p
=
r
=
2 in Example 6.1.2, then the hat matrix H
=
C h a pter 6
( hij) is given by
1 + (zi2 Z.2) (zj2 - Z.2) -__:___-"----:--=__:_ __;;_ n -=----= 2.:(zi2 - z.2) 2 1 1. Show that for the estimates
(a) Var ( a)
;� 2.:::� =1 nk
=
a and J'k in the one-way layout Var ( 8k )
=
( (p�:)2 + Lk=;fi ;k ) .
(b) If n is fixed and divisible by p, then Var( a) is minimized by choosing ni
ni
n is nl2, n2
(c) If =
fixed and divisible by np nl2(p -
=
(d) Give the
·
·
·
=
2(p - 1), 1) .
c =
nIp.
then Var(J'k) is minimized by choosing
100(1 - a)% confidence intervals for a and 6k.
, Yn be a sample from a population with mean 11 and variance a 2 , where n is 1 even. Consider the three estimates T1 Y, T2 ( 1 I 2n) 2.::: 1:1 Yi + ( 3 I 2n) 2.::: �= � n+ 1 }i, 12. Let Y1,
• • .
=
1 - - 2). and T3 = 2 (Y
(a) Why can you conclude that T1 has a smaller MSE (mean square error) than T2 ? (b) Which estimate has the smallest MSE for estimating 0
=
� (/1 2) ?
13. In the one-way layout,
(a) Show that level ( 1 - a) confidence intervals for linear functions of the form {3j - {3;. are given by
and that a level ( 1
a)
confidence interval for a2 is given by
(n - p) s2 lxn -p ( 1 'lj;
(b) Find confidence intervals for
� ( � + tJ3 )
� a ) ::; a2 ::; (n - p) s2 lx n-p (� a ) . 'lj;
t31 ·
14. Assume the linear regression model with p future observation Y to be taken at the pont z .
(a) Find a level 1 + fJ2 z . fJ
(1 - a)
r
=
2. We want to predict the value of a
confidence interval for the best MSPE predictor E(Y)
=
(b) Find a level (1 a) prediction interval for Y (i.e., statistics t(Y1 , . . . , Yn) . t(Yh . . . , Yn ) such that P[t_ ::; Y ::; � 1 - a). Note that Y is independent of Y1 , . . . , Yn .
15. Often a treatment that is beneficial in small doses is harmful in large doses. The following model is useful in such situations. Consider a covariate x , which is the amount
Section 6. 7
425
Probtems a nd Lornol;emEmts
· or dose of a treatment, and a response variable Y, which is yield or production. Suppose a good fit is obtained by the equation
where Yi is observed yield for dose xi . Assume the model log }i = !31 + f32 x i + f33 log xi + Ei , i =
1, . . . , n
where E1 , . . . , En are independent N(O, a2). (a) For the following data (from Hald, 1 952, p. 653), compute {i1 , {i , {i3 and level 0.95 2 confidence intervals for (31 , !3 , (33 . 2 (b) Plot (Xi , Yi) and (Xi , fli) where Yi = e131 e132xi x f3 • Find the value of x that maxi mizes the estimated yield fj = e131 e132xx133 .
3 .77
x (nitrogen) y (yield)
1 67.5
Hint: Do the regression for /-Li = (31 + f32zil + f32zil + f33zi2 where zn = Xi - x, Zi = log xi - * I::1 log xi . You may use x = 0.0289. 2 16. Show that if C is an n x r matrix of rank r, r :=:; n, then the r x r matrix C'C is of rank r and, hence, nonsingular. Hint: Because C' is of rank r, xC' = 0 => x = 0 for any r-vector x. But xC'C = o => ll xC' II 2 = xC'Cx' = o => xC' = o .
17. In the Gaussian linear model show that the parametrization ({3, a2 ) T is identifiable if and only if r = p. Problems for Section 6.2
1. Check AO, . . . ,A6 when 8 log p(x, 8) , and Q = P.
=
(M, a2), P
2. Let 8, P, and p be as in Problem
6.2.1
densities of the form
=
{N({L, a2)
:
1-L E
R, a2
>
0}, p ( x, 8)
and let Q be the class of distributions with
( 1 - E )
a2 < 72 , E :=:;
1 2
where
1
1
=
2
>
0.
(b) Show that the MLE of (M, 7 2 ) does not exist so that A6 doesn't hold.
426
Inference in the Multiparameter Case
(c) Construct a method of moment estimate Bu of 8 moments which are fo consistent.
I I '
=
Chapter 6
(Jt. T2) based on the first two
�
(d) Deduce from Problem 6.2.10 that the estimate (} derived as the limit of NewtonRaphson estimates from 8n is efficient. 3. In Example 6.2.1, show that EZtZ; � O, i > !.
([EzzT]- 1 )(1. 1 1 > [Ezfj- t with equality if and only if
4. In Example 6.2.2, show that the assumptions of Theorem 6.2.2 hold if (i) and (ii) hold. 5. In Example 6.2.2, show that
fo is logistic.
c(fo) = uo/a is 1 if fo is normal and is different from 1 if
6. (a) In Example 6.2.1 show thatMLEs of {3, f.l, and
Hint: fx(x)
�
fyiz(y)Jz(z).
(b) Suppose that the distribution of Z is not known so that the model is semi parametric, X Pce,H)• {Pce,H) : 0 E 8, H E H}, 0 Euclidean, 1{ abstract. In some cases it is possible to find T(X) such that the distribution of X given T(X) � t is Q,, which doesn't depend on H E H. The MLE of 0 based on (X, t) is then�called a conditional MLE. Show that if we identify X � (zCnl, Y), T(X) = zCnJ , then ({3, fj, i72) are conditional MLEs. Hint: (a),(b) The MLE minimizes q'.[Y - zCnJ f31 2 -
• ••
i' .
'
7. Fill in the details of the proof of Theorem 6.2.1.
8. Establish (6.2.24) directly as follows:
(a) Show that if Zn � � I:� t Z; then, given Z( n)• .,fii(ji multivariate nonnal distribution with mean 0 and variance,
(; "
�
n[Z (�J ( n) ] -1
)
f.l,
('jj - f3 ) T ) T has a <
•
'
and that 0:2 is independent of the preceding vector with n&2 /a2 having a x;_P distribution. (b) Apply the law of large numbers to conclude that
P -1 T n Zc nJ Z CnJ � E ( ZZT ) .
(c) Apply Slutsky's theorem to conclude that
and, hence, that
(d) (/j - f3)T z[n1 z(n) ('jj - {3)
=
op(n - 112). �
(e) Show that 0:2 is unconditionally independent of ('ji,/3). (f) Combine (a)-(e) to establish (6.2.24).
I
427
Section 6.7
Problems and Complements
9. Let Y1
. , Y;1 real be independent identically distributed
. . .
where p. E R, a > 0 are unknown and r has known density f > 0 such that if p(x) - log f(x) then p" > 0 and, hence, p is strictly convex. Examples are f Gaussian, and f(x) = e-x(l + ex)-', (logistic).
a0 is assumed known a unique MLE for JJ exists and uniquely
(a) Show that if o
solves
� L, P i=t
(b) Write 81
uniquely solves
l
a'
e,
'
( ao-!') Xi
- 0.
; . Show that if 02
=
BB a unique MLE for Bt exists and
10. Suppose A0-A4 hold and 8� is fo consistent; that is, 8�
=
80 + Op(n- 112 ).
(a) Let IJn be the first iterate of the Newton-Raphson algorithm for solving (6.2.1) starting at o;p
Show that
n IJ�) Xi, D,P( IJn IJ� - !:.n L . =
-1
t=1
n !:. L 'l'(X,, 8�). n . t=l
On satisfies (6.2.3).
Hint:
n n n 1 1 -n L 'l'(X,, 8�) n- L 'l'(X,, 80) - n- L D¢(X,,IJ�) + ap(l) (1J� 1J0). . 1
i=l
=
i=-.1
t=1
(b) Show that under AO-A4 there exists c > 0 such that with probability tending to 1, ;. 2:7 1 '!'(X, , 8) has a unique 0 in S(IJ0, <), the < ball about IJo. Rd Hint: You may use a uniform version of the inverse function theorem: If gn : Rd
are
such that:
(i)
sup{JDgn(IJ) - Dg(IJ)j : jiJ - 11oJ < <} � 0, (ii) gn(Bo) � g(IJo), (iii) Dg(IJo) is nonsingular, (iv) Dg(IJ) is continuous at IJo,
----J.
428
Inference in the Multiparameter Case
then, for
n
sufficiently large, there exists a b
and their image contains a ball
S(g(9o).b).
> 0. t > 0 such that gn
are
1-1
Chapter 6 on
8(9o, 6)
rithm starting at e;, converges to the unique root
(6.2.3).
iteration of the Newton-Raphson algo
On described in (b) and that O satisfies n
Hint: You may use the fact that if the initial value of Newton-Raphson is close enough
to a unique solution. then it converges to that solution. Establish
DYi
�
i= l
where
Y; � (Z,, � Z,( ll)llt 2::: (6; + c;lh)Z,(j) �
i=l
�( I ) = I:j
1 ciziU)
zr/3)2 over all {3 is the same as minimizing "'
� Y;
i=l
Differentiate with respect to
{31.
P
� (Zit � Z, )6t � "' � {J, Z,, j=2
�(! )
2
l'
'' •
I:� 1 (Yi � '
2
•
Similarly compute the information matrix when the model
is written as
•
y; = {J, (Z" where {31, 12 ,
I
'
j=l
and the c; do not depend on {3. Thus, minimizing
n
1 '
p
z; /3) 2 = 2:::
'
• • • '
Hint: Write n
1
l
(6.2.26) and (6.2.27).
n
' '
(c) Conclude that with probability tending to 1,
II.
j
. . •
�
rr(z" 1
, lp range freely and
•
p
. '
z,2 , . . . , z,p)) + 2::: ..0z,1 + <,
•
j=2
i
Ei are i.i.d. N(O, a 2 ).
Problems for Section 6.3 1.
Suppose responses Y1 , . . . , Yn are independent Poisson variables with log
.\i
for given covariate values z 1 , tests for testing
=
• . .
fh
, zw
+
fhzi, 0 <
ZJ
<
···
<
Yi ,...._, P(Ai), and
Zn
Find the asymptotic likelihood ratio, Wald, and Rao
H : lh. = 0 versus K : 82
¥ 0.
2. Suppose that wo is given by (6.3.12) and the assumptions of Theorem (6.3. ;! hold for p(x, 9) , 9 E e. Reparametnze 1' by '1(9) = 2::}� 1 '7; (9)v; where '7; ( 9) = 9 v;, {v;} are orthonormal. q( · , 'I ) = p(·, 8 ) for '1 E 3 and 3 = {'I(9) : 9 E e}. Show that if 30 = {'I E 3 : 'I; = 0, q + 1 < j < r} then .\(X) for the original testing problem is given by
.\(X)
= sup { q(X
and, hence, that Theorem
· 'I) : 'I E 3} / sup{q(X, 'I) : 'I E 3o }
6.3.2 holds for eo
as given in
(6.3.12).
� '
l
Section 6 . 7
Problems
and
4c: =-29
C om p I 'm ' " " :CC-''CC::c:cc:::c_
_ _ _ _ _ _ _ __ _ _ _ _ _ _ _
E10 and the conditions of Theorem 6.3.3 hold. There exists an open ball about IJ0, S(IJ0) c 8 and a map 3. Suppose that 80
E
which is continuously differentiable such that (i) �1 ( 1J) = g1 (1J) on S(IJ0), q + 1 < j < r and, hence,
S(IJo) ,'1 8o = {IJ E S(IJo) : �i (IJ) = 0, q + 1 < j < r}. (ii)
'I
is 1-1 on S(IJ0) and D'l ( IJ) is a nonsingular r
x r
matrix for aii iJ
E
S( IJo).
(Adjoin to 9q+ l , . . . , gr. a'fB, . . . , ar o where a1 , . . . , aq are orthogonal to the linear span of
��'' (IJo)
pxd
.)
Show that if we reparametrize {Po : IJ E S(IJ0)} by q(· , 'I (IJ)) p(·,IJ) where q(·, 'I) � is uniquely �defined on 2 = {'I(IJ) : £! E S(£!0)} then, q(·,'l) and Tin = 'I(On) and, Tio,n = q(Bo,n) satisfy the conditions of Problem 6.3.2. Deduce that Theorem 6.3.3 is valid.
4. Testing Simple versus Simple. Let 8 = {80 , OJ}, Xi, . . . , Xn i.i.d. with density p( -,B) .
Consider testing H : (} = 00 versus K : B = 81. Assume that Pe1 of Pe0 , and that for some b > 0,
(ii) Ee, (a) Let >.(X1 , . . . , Xn) be the likelihood ratio statistic. Show that under H, even if p J = 0, 2 log .l.(X1, . . , X,) � 0. .
(b) If h = 2 show that asymptotically the critical value of the most powerful (Neyman Pearson) test with Tn = L� 1 ( l(X,, IJ,) - l(X,, Oo)) is -nK(IJo, 0! ) + Z!-a Jria(Oo, £!,) where k(Bo, 81) is a Kullback-Leibler in formation
(X " IJo) p K(IJo , IJ!) = Ee, log IJ (X p 1' 1) and
be i.i.d. with X, and Y; independent, N(IJ1 , 1), N(IJ2 , 1), respectively. Suppose Bi > 0, j = 1, 2. Consider testing H : 81 = 82 = 0 versus K : IJ, > 0 or IJ2 > 0. 5. Let (X,, Y, ) , 1 < i <
n,
430
lnfer�nce in the Multiparameter Case
Chapter 6
1 < i < n) is distributed as a with probabilities �, �, ! , respectively.
(a) Show that whatever ben, under H, 2 Iog >.( Xt, Yi
mixture of point mass at 0, xi and x� Hint: By sufficiency reduce to n = 1. Then
:
(b) Suppose Xi, Yi are as above with the same hypothesis but 8 = { ( 81 , fh) : 0 < e, < cl)1,01 > 0}. Show that 2 log.\(X,, li : 1 < i < n) has a null distribution, which
is a mixture of point mass at 0, sin .6. = 0 < .6. < ; .
JI�c2,
xi and x� but with probabilities � - �, �
and
� where
(c) Let (X,' Y, ) have anN, (e,' e,, u1o' "�o. Po ) distribution and ( x,, Y;), 1 < i < n, be i.i.d. Let 81, 82 > 0 and H be as above. Exhibit the null distribution of 2 log .\(Xi, Yi : 1 < i < n). Hint: Consider (]"�0 =
. o-§o = 1 and zl = Xt. 22 = P�l 1-Po �
�
6. In the model of Problem 5(a) compute the MLE (Ot , 02 ) under the model and show that (a) If 01 > 0, e, > 0,
C( y'n(B1 - 01 , ii, - O,)) � .N"(O, 0, 1, 1, 0). (b) lf01 = 02 = 0
where U � N(O, 1) with probability U with the same distribution .
i
•
i and 0 with probability i and V is independent of �
(c) Obtain the limit distribution of y'n(O , - 01, O, - 1:1,) if O, = 0, O, > 0. �
'
(d) Relate the result of (b) to the result of Problem 4(a). Note: The results of Problems 4 and 5 apply generally to models obeying AO-A6 when we restrict the parameter space to a cone (Robertson, Wright, and Dykstra, 1988). Such restrictions are natural if, for instance, we test the efficacy of a treatment on the basis of two correlated responses per individual.
•
I .,
7. Show that (6.3.19) holds.
Hint: �
(i) Show that !(On) cau be replaced by 1(0).
'i
I
' •
' •
(ii) Show that Wn (B�2) ) is invariant under affine reparametrizations q = a + BB where B is nonsingular.
8�2))
showing that its leading (iii) Reparametrize as in Theorem 6.3.2 and compute Wn ( term is the same as that obtained in the proof of Theorem 6.3.2 for 2 log),(X).
l ' •
i
Section
6.7
Problems and Complements
431
�(1) 8. Show that under AO-A5 and A6 for 011
where �(llo) is given by (6.3.21). Hint: Write
�( I I
and apply Theorem 6.2.2 to On . 9. Under conditions AO-A6 for (a) and A0-A6 with A6 for i!�'l for (b) establish that (a)
[-�D2 ln(8n)]-l is a consistent estimate of J- 1 (0o ).
(b) (6.3 . 22) is a consistent estimate of �- 1 (11o)Hint: Argue as in Problem 5.3.10.
10. Show that under A2. A3, A6 11 � !(II) is continuous. Problems for Section 6.4 1. Exhibit the two solutions of (6.4.4) explicitly and find the one that corresponds to the
maximizer of the likelihood. 2. (a) Show that for any 2 x 2 contingency table the table obtained by subtracting (esti mated) expectations from each entry has all rows and columns summing to zero, hence, is of the form
Z2 where Z is given by (6.4.8) (c) Derive the alternative form (6.4.8) for Z. (b) Deduce that x'
�
2 contingency table model let Xi = 1 or 0 according as the ith individual sampled is an A or A and Yi = 1 or 0 according as the ith individual sampled is a B or B. (a) Show that the correlation of X1 and Y1 is
3. In the 2
x
� P
P(A n B) - P(A)P(B) )P( A)(1 - P(A))P(B) (1 - P(B)) .
(b) Show that the sample correlation coefficient r studied in Example 5.3.6 is related to Z of (6.4.8) by Z � .fiir . (c) Conclude that if A and B are independent, 0 < P( A) < 1, 0 < P(B) < 1, then Z
has a limitingN(01 l) distribution.
' '
l ' '
'
432
Inference in the Multipara meter Case
Chapter 6
4. (a) Let ( Nu' Nl2· N21 ' N22 ) rv M ( n, Bu (}12, (}2 1 ' 022 ) as in the contingency table. Let Ri = Nil + Ni2t ci = Nli + N2i· Show that given Rr = TI, R2 = r2 = n - Tr, Nu and N21 are independent B(r 1 , 8u / (8u + 81,) ) , B(r,, 8,! / ( 821 + 8,,) ). I
(b) Show that 812/(811
+ 81 2 )
�
82 1/(821 + B,)
iff R1
and C,
are
independent.
(c) Show that under independence the conditional distribution of Nii given R;_ Ci = Ci, i = 1, 2 is Jt(ci, n, ri) (the hypergeometric distribution).
= r1,
5. Fisher's Exact Test
From the result of Problem 6.2.4 deduce that if j(o:) (depending on r1 , c1, n) can be chosen so that then the test that rejects (conditionally on Rr = Tf, cl = Ct) if Nu > j(a) is exact level a. This is known as Fisher's exact test. It may be shown (see Volume ll) that the (approximate) tests based on Z and Fisher's test are asymptotically equivalent in the sense of (5.4.54). 6. Let Nii be the entries of an a x b contingency table with associated probabilities Bij and let 1Ji 1 = E�= l (}ij. 1Ji2 = z::: 1 Oii. Consider the hypothesis H : Oij = Tfil Tfj2 for all i, j. (a) Show that the maximum likelihood estimates of Tfit, 1Ji 2 are given by �
Tfil where R; � L; N,;, C;
�
Li N ; .
(b) Deduce that Pearson's
,
R;
= n
,
�
C;
Tfj2 = n
x' is given by (6.4.9) and has approximately a XZa-l ) (b-1 )
distribution under H. Hint: (a) Consider the likelihood as a function of Tfit. 1, . . . , b - 1 only.
i =
1, . . . , a
-
1, fJJ2· j
=
7. Suppose in Problem 6.4.6 that H is true. (a) Show that then
. •
:
P[N,; -
where
( B, C, D, . . . ) A
=
are the multinomial coefficients.
(b) How would you, in principle, use this result to construct a test of H similar to the 2 x test with probability of type I error independent of Tfit , 1Ji2 ?
l
Section 6.7
8.
433
Problems and Complements
The following table gives the number of applicants to the graduate program of a small
department of the University of California, classified by sex and admission status. Would you accept or reject the hypothesis of independence at the 0.05 level (a) using the x
2
test with approximate critical value?
(b) using Fisher's exact test of Problem 6.4.5? ny D c.::, '-;-o-, 19 12 0 5
Admit Men Women
Hint:
(b) It is easier to work with N22. Argue that the Fisher test is equivalent to
+ n - (r1 + ci ) or N22 < q1 + n - (r1 + c! ) , and that under H, is conditionally distributed 1t(r2, n, c2).
rejecting H if N22 N22
9.
(a) If A, (i)
>
Q2
B, C are three events, consider the assertions,
P(A n B 1 C) = P(A 1 C)P(B
(ii) P(A n B 1
1 C)
(A, B INDEPENDENT GNEN C)
C) = P(A 1 C)P(B 1 C) {A, B INDEPENDENT GIVEN C)
P(A n B) = P(A)P(B) (A, B INDEPENDENT)
(iii)
(C is the complement of C.) Show that (i) and (ii) imply (iii), if A and C are independent or B and C are independent.
(b) Construct an experiment and three events for which (i) and (ii) hold, but (iii) does not.
(c) The following 2 x 2 tables classify applicants for graduate study in different depart ments of the university according to admission status and sex. Test in both cases whether the events [being a man] and [being admitted] are independent. Then combine the two tables into one, and petfonn the same test on the resulting table. Give p-values for the cases. Admit Men Women
Admit
Deny
235 1 " 3,8 --
�' 35 270 45 7
Men Women
273 42 n 315 =
three
Deny
122 103
225 1 62 n = 387
93 69
215 172
(d) Relate your results to the phenomenon discussed in (a), (b). 10. Establish {6.4.14). =
0 in the logistic model, �; = equal, and that we wish to test H : fh < {3g versus K fh > {3g. 11. Suppose that we know that {31
Show that, for suitable a, there is a UMP level
l::f
1 z;N;
> k, where Pp� [l::f
1
;;;N;
2:
k] = a.
:
a test,
{31
+ {hz;,
z; not all
which rejects, if and only if,
l l'
434
Inference in the Multiparameter Case
Chapter 6
12. Suppose the Zi in Problem 6.4.11 are obtained as realization of i.Ld. Zi and m, so that (Z; , X;) are i.i.d. with (X, I Z;) � B(m, 1r(.Li2Zi)).
'
(a) Compute the Rao test for H Problem 6.4.11.
:
= rn
fJ2 < {3g and show that it agrees with the test of
(b) Suppose that f3t i s unknown. Compute the Rao test statistic for H
case.
:
(32 < !38 in this
(c) By conditioning on L� 1 Xi and using the approach of Problem 6.4.5 construct an exact test (level independent of f3t)·
�' '
.
13. Show that if Wo C Wt are nested logistic regression models of dimension q < r < k and m1, . . . , mk � oo and H : fJ E w0 is true then the law of the statistic of (6.4.18) tends to X2r-q· Hint: (X; - J',)/ /m;K;(l - 1r;), I < i < k are independent, asymptotically N(O, 1 ) . Use this to imitate the argument of Theorem 6.3.3, which is valid for the i.i.d. case. 14. Show that, in the logistic regression model, if the design matrix has rank p, then {30 as defined by (6.4.15) is consistent. -
15. In the binomial one-way layout show that the LR test is asymptotically equivalent to Pearson's x2 test in the sense that 2 log .\ - x2 .£.; 0 under H.
16. Let X1 , . . . , Xk be independent Xi N(Bi, a2) where either a2 = a5 (known) and Bt, . . . , Bk vary freely, or Bi = Bio (known) i = 1, . . . , k and a2 is unknown. Show that the likelihoOO ratio test of H : Bt = 810, . . . , Bk = 8ko, a2 = a5 is of the form: Reject if (1/a;3) 2:::� 1 (Xi - Bio)2 > k2 or < kt. This is an approximation (for large k, n) and simplification of a model under which (N1, . . . , Nk ) � M(n, 010, . . . , Oko) under H, but under K may be either multinomial with 8 # Bo or have Eo(Ni ) nBio, but Varo (N;) < nOiO(l - 0;0)("Cooked data"). "'
:::;;;::
Problems for Section 6.5
1. Fisher's Method ofScoring The following algorithm for solving likelihood equations was proosed by Fisher-see Rao (1973), for example. Given an initial value Oo define iterates -
_...._
_...._
.-.
_...._
llm+l � Om + l- 1 (9m)Dl(9m)·
Show that for GLM this method coincides with the Newton-Raphson method of Section 2.4.
'
i
i
•
•
j
1
l 1 j
I '
1 ' .
•
i
1
i i
I
•
l
' •
2. Verify that (6.5.4) is as claimed formula (2.2.20) for the regression described after (6.5.4). 3. Suppose that (Z1, Yi ) , . . . , (Zn, Yn) have density as in (6.5.8) and, (a) P[Z1 E {zl 1 l, . . . ,z(k)}) � 1
'
I
l •
'
•
'
Section 6.7
Problems and Complements
435
(b) The linear span of {zl 1 1 , . . . , z(k)} is RP (c) P[Z 1 � zUI] > 0 for all j. Show that the conditions A0--A6 hold for P P/3, E P (where qo is assumed known). Hint: Show that if the convex support of the conditional distribution of Y1 given Z1 = zUl contains an open interval about P,j for j = 1, . . . , k, then the convex support of the conditional distribution of z:::;: 1 A1 YJ zU l given Zj = z U), j = 1 , . . . , k, contains an open !. zlil in RP ball about "k L...... J =l )r) �
=
"
·
4. Show that for the Gaussian linear model with known variance D(Y,f.'o) Jy - P.ol2/a,l.
aJ,
the deviance is
=
5. Let Yt , .
. . , Yn be independent responses and suppose the distribution of Yi depends on
a covariate vector zi. Assume that there exist functions that the model for Yi can be written as
h(y, T), b(B), g(p,) and c(T) such
o,y - b(O,) } { p(y, O,) � h(y,r) exp c(T)
g(!"i) Var(Y)/c(r) � b"(B). where T is known,
�
zT{3, and b' and g are monotone.
Set �
�
g(f") and v(!")
�
(a) Show that the likelihood equations are
..;:... df"i ( Yi - f"i )Zij O, J. , . . . , p. [ L., dt. ) ( v <,t JJ� i =l �
·
Hint: By the chain rule
8 l (y O) 8(3; ,
�
�
i3l
dO df" 8( 80 df" d( 8!3; .
(b) Show that the Fisher information is Z);WZv where Zv � jjz,;l/ is the design matrix and W � diag(WJ , . . . , wn ), w, = w(!"i ) � l/v (!"i)( d(i/ df"iF.
(c) Suppose (Zr, Yr ), . . . , (Zn, Yn) are i.i.d. as (Z, Y) and that given Z � z, Y follow the model p(y, O(z)) where O(z) solves II ( 0) � g-1 (zT {3). Show that, under appropriate conditions,
(d) Gaussian GIM. Suppose Y; � N(f"i, ,;,j). Give 0, T, h(y,r), b(ll), c(r), and v(f"). Show that when g is the canonical link, g � (b')- 1 , the result of (c) coincides with (6.5.9).
(e) Suppose that Y; has the Poisson, P(f",), distribution. Give B, T, h(y, r), b(O), c(r), � and v(!")· In the random design case, give the asymptotic distribution of ..fii(/3 - (3). Find the canonical link function and show that when g is the canonical link, your result coincides with (6.5.9).
Chapter
6
i
•
Problems for Section 6.6 1. Consider the linear model of Example 6.6.2 and the hypothesis
(3q+1 = f3o,q+L . . , (Jp = f3o,p ·
under the sole assumption that Ec = 0, 0 < Var E < oo. Show that the LR, Wald, and Rao tests are still asymptotically equivalent in the sense that if 2 log .\n, Wn , and Rn are the corresponding test statistics, then under H,
Wn + op(1) Wn + Op(1).
i :
I
Inference in the Multiparameter Case
436
l
Note: 2log .\n, Wn and Rn are computed under the assumption of the Gaussian linear model with a2 known. Hint: Retrace the arguments given for the asymptotic equivalence of these statistics under parametric model and note that the only essential property used is that the MLEs under the model satisfy an appropriate estimating equation. Apply Theorem 6.2.1.
:
I
•
2. Show that the standard Wald test for the problem of Example 6.6.3 is as given in (6.6.10). 3. Show that 0:2 given in (6.6.14) is a consistent estimate of2 VarpX(l) in Example 6.6.3 and, hence, replacing &2 by 0:2 in (6.6 10) creates a valid level u test. .
4. Consider the Rao test for H : fJ = fJo for the model P = { PfJ : II E 8} and AO A6 hold. Suppose that the ttue P does not belong to P but if fJ(P) is defined by (6.6.3) then fJ(P) = 80. Suppose AO-A6 are valid. Show that, if VarpDl(X, 110) is estimated by
!(80), then the Rao test does not in general have the correct asymptotic level, but that if the estimate � L:� dDl][DljT (X,, fJo) is used, then it is. 5. Suppose X1 , . . . , Xn are i.i.d. P. By Problem 5.3.1,� if P has a positive density J at v(P), the unique median of P. then the sample median X satisfies
vn(X - v(P) )
�
l
i
l
•
l
-!
N(O,a2(P))
,
•
where a2(P) = 1 /4f(v(p)).
,
(a) Show that iff is symmetric about I'• then v(P) = I' · (b) Show that if f is N(J', a 2 ), then a2(P) > a2 = Varp(Xl), the information bound and asymptotic variance of yn(X - J'), but if f�(x) = � exp -lx - J'I, then a2(P) < a2 , in fact, a2(P)ja2 = 2/rr. 6. Establish (6.6.15) by verifying the condition of Theorem 6.2.1 under this model and
verifying the formula given.
7. In the binary data regression model of Section 6.4.3, let 1r = s (zf /3) where s ( t) is the continuous distribution function of a random variable symmetric about 0; that is,
s(t) = 1 - s ( -t), t E R.
l1
,
'
I ! I
I I
, '
,
(6.7.1)
' �------... ,
6.7 Problems and Complements
Section
437
Show that can be written in this form for both the probit and logit models. (b) Suppose that are realizations of U.d. zi. that zl is bounded with probability 1 and let ih(XC"l), where X(n) {(Y;, Z,) : 1 < i < n}, be the MLE for the logit model. Show that if the correct model has Jri given by s as above and {3 {30, then !3L is not a consistent estimate of {30 unless s(t) is the logistic distribution. But if f3L is defined as the solution of EZ ,s(Zf{30) Q(/3) where Q(/3) E(ZfA(Zf/3)) is p x 1, then vn(rJL - f3 L) has a limiting normal distribution with mean 0 and variance (j- 1 (f3)Var(Z 1 (Y1 - A (ZfiJ) )) [Q- 1 (/3)] where Q(/3) E(ZfA(Zf{30) Z, ) is p and necessarily nonsingular. Hint: Apply Theorem 6.2.1. 8. (Model Selection) Consider the classical Gaussian linear model (6.1.1) Yi = Jl-i + Ci, where are i. i.d. Gaussian with mean zero Jl-i = zT{3 and variance a2, zi are 1, . . d-dimensional vectors for covariate (factor) values. Suppose that the covariates are ranked in order of importance and that we entertain the possibility that the last d - p don't matter, . . � !3d 0. /3P+l Let f3(p) be the LSE under this assumption and Y;(p) the corresponding fitted value. A natural goal to entertain is to obtain new values YJ", . . . , Y; at z1 , and eval uate the performance of YjCP), . . . , Y�P) and, hence, the model with f3d+I (3p 0 by the (average) expected prediction error rr
(a)
Zi
�
�
�
=
�
�
p x
�
i =
ci
, n,
.
�
�
�
.
�
•
.
.
,
Zn
=
= · · · =
n
EPE(p) � n EL(Yi' - ¥;<">)2 -1
i=l
Here YJ" , . . , v; are indepe-qdent of }] , . . . , Yn and �* is distributed as }'£, i 1, . . Let RSS(p) I;(Y; Y,CP) )2 be the residual sum of squares. Suppose that "2 is known. (a) Show that EPE(p) "2 (1 + �;) ; I:� 1 (P.i - p.jP) )2 where JL)P) zf iJ(P) and i3(p) (/3t, . , {3p, 0, . . , O) T (b) Show that =
.
, n.
-
�
�
�
.
.
.
+
�
.
and deduce that (c) EPE(p) (d)
�
RSS(p) + �"2 is an unbiased estimate of EPE(p).
Show that (a),(b) continue to hold if we assume the Gauss-Markov model. Model
438
Inference in the Multiparameter Case
Chapter 6
(e) Suppose p 2 and 11(z) � !! 1 o, + i!2z2. Evaluate EPE for (i) �i � a1z; and (ii) 'rJi = /31 Zil + /32zi2· Give values of /31 , t'i2 and { .::1 1 , :;;i z} such that the EPE in case (i) is smaller than in case (ii) and vice versa. Use a2 = 1 and n 10. �
=
Hint: (a) Note that
EPE(p)
n
n
2 1 �
n
n
2
i =l
Derive the result for the canonical model. (b) The result depends only on the mean and covariance structure of the
i = l, . .
6.8
.
}i, Yt , jl�,
, n.
NOTES
Note for Section 6.1 (1) From the L. A. Heart Study after Dixon and Massey (1969). '
Note for Section 6.2 (1) See Problem
• '
3.5.9 for a discussion of densities with heavy tails.
' '
Note for Section 6.4
' •
(l) R. A. Fisher pointed out that the agreement of this and other data of Mendel's with his hypotheses is too good. To guard against such situations he argued that the test should be
'
'
' '
used in a two-tailed fashion and that we should reject H both for large and for small values
2•
of x
reasonable, if we consider alternatives to H, which are not multinomial. For instance, we
!I
might envision the possibility that an overzealous assistant of Mendel "cooked" the data.
' '
. .
l
'
I
'
I
I
Of course, this makes no sense for the model we discussed in this section, but it is
I
;'
LR test statistics for enlarged models of this type do indeed reject H for data corresponding 2 to small values of x as well as large ones (Problem 6.4. 16). The moral of the story is that the practicing statisticians should be on their guard! For more on this theme see Section
6.6.
'
l
'
•
'
'
'
l '
6.9
REFERENCES
Cox, D.
R., The Analysis ofBinary Data London:
DIXON,
'
Methuen, 1 970.
W. AND F. MASSEY, /ntroduction to Statistical Analysis, 3rd ed.
New York: McGraw-Hill.
1969.
'
l
Section 6.9
439
Refere nces
FfSHER, R. A . Stotistical Methods fm· Research lVrn{ers, 13th ed. New York: Hafner, 1958. .
GRAYBILL, F., An lnrmduction to Linear StatiJ£ica{ Models, Vol. I New York: McGraw-Hill. 196 1 . HABERMAN, S., The Analysis of Frequency Data Chicago: University of Chicago Press, 1974.
HALD, A., Statistical Theory with Er1gineering Applications New York: Wiley, 1952.
HUBER, P. J., "The behavior of the maximum likelihood estimator under nonstandard conditions," Proc. Fifth Berkeley Symp. Math. Statist. Prob. I, Univ. of California Press, 22 1-233 ( 1967).
KASS, R., J. KADANE AND L. TiERNEY, "Approximate marginal densities of nonlinear functions," Biometrika, 76, 425-433 ( 1 989).
KoENKER, R. AND Y. D'oREY, "Computing regression qunntiles," J. Roy. Statist. Soc. Ser. C. 36, 383-393 ( 1 987).
LAPLACE,
P.-S., "Sur quelques points du systeme du monde," Memoires de l'Acadf!mie des Sciences
de Paris (Reprinted in Oevres CompUtes, II, 475-558. Gauthier-Villars, Paris) ( 1 789).
MALLOWS, C., "Some comments on Cp,'' Technometrics, 15, 661--675 ( 1 973). McCULLAGH, P. AND J. A. NELDER, Genemlized Linear Models London: Chapman and Hall, New York, t 983; second edition, 1 989.
PORTNOY, S. AND R. KoENKER, "The Gaussian Hare and the Laplacian Tortoise: Computability of squared-error versus absolute-error estimators," Statistical Science, 12, 279-300 ( 1 997). RAO, C. R., Linear Statistical inference and Its Applications, 2nd ed. New York: J. Wiley & Sons, 1 973.
ROBERTSON, T., f. T. WRIGHT, AND R. L. DYKSTRA, Order Restricted Statistical inference New York: Wiley, 1988.
SCHEFFE,
II., The Analysis of Variance New York:
Wiley, 1 959.
SCHERVlSCH, M., Theory of Statistics New York: Springer, 1 995. STiGLER, S., The 1/istory of Statistics: The Measurement of Unce11ainty Before !900 Cambridge, MA: Harvard University Press, 1 986.
WEISBERG, S., Applied linear Regression, 2nd ed. New York: Wiley, 1 985.
' ' •
'
"
'''
' '
'
'
'
l
l '
'
j
i
'
i
1
I
'
Appendix A
A REVIEW OF B ASIC PROB AB ILITY THEORY
In statistics we study techniques for obtaining and using information in the presence of uncertainty. A prerequisite for such a study is a mathematical model for randomness and some knowledge of its properties. The Kolmogorov model and the modem theory of prob ability based on it are what we need. The reader is expected to have had a basic course in probability theory. The purpose of this appendix is to indicate what results we consider ba sic and to introduce some of the notation that will be used in the rest of the book. Because the notation and the level of generality differ somewhat from that found in the standard textbooks in probability at this level, we include some commentary. Sections A.14 and A.15 contain some results that the student may not know, which are relevant to our study of statistics. Therefore, we include some proofs as well in these sections. In Appendix B we will give additional probability theory results that are of special interest in statistics and may not be treated in enough detail in some probability texts.
A.l
THE BASIC MODEL
Classical mechanics is built around the principle that like causes produce like effects. Prob ability theory provides a model for situations in which like or similar causes can produce one of a number of unlike effects. A coin that_ is tossed can land heads or tails. A group of ten individuals selected from the population of the United States can have a majority for or against legalized abortion. The intensity of solar flares in the same month of two different years can vary sharply. The situations we are going to model can all be thought of as random experiments. Viewed naively, an experiment is an action that consists of observing or preparing a set of circumstances and then observing the outcome of this situation. We add to this notion the requirement that to be called an experiment such an action must be repeatable, at least conceptually. The adjective random is used only to indicate that we do not, in addition, require that every repetition yield the same outcome, although we do not exclude this case. What we expect and observe in practice when we repeat a random experiment many times is that the relative frequency of each of the possible outcomes will tend to stabilize. This 441
1
I
l
A Review of Basic Probability Theory
442
Appendix A
long-term relative frequency 11 ..1/ '' where 11,\ is the number of times the possible outcome A occurs in n repetitions, is to many statistician::;. including the authors, the operational interpretation of the mathematical concept of probability. ln this sense, almost any kind of activity involving uncertainty, from horse races to genetic experiments, falls under the vague heading of "random experiment." Another school of statisticians finds this formulation too restrictive. By interpreting probability as a subjective measure, they are willing to assign probabilities in any situation involving uncertainty, whether it is conceptually repeatable or not. For a discussion of this approach and further references the reader may wish to consult Savage ( 1954), Raiffa and Schlaiffer ( I 961 ), Savage (1962), Lindley ( 1 965), de Groot ( I 970), and Berger ( 1 985). We now turn to the mathematical abstraction of a random experiment, the probability model. In this section and throughout the book, we presunle the reader to be familiar with ele mentary set theory and its notation at the level of Chapter I of Feller ( 1968) or Chapter l of Parzen ( 1960). We shall use the symbols U, n, c, , c for union, intersection, comple mentation, set theoretic difference, and inclusion as is usual in elementary set theory. A random experiment is described mathematically in terms of the following quantities. .
-
A.l.l The sample space is the set of all possible outcomes of a random experiment. We denote it by n. Its complement, the null set or impossible event, is denoted by 0. A.1.2 A sample point is any member of fl and is typically denoted by w.
A.1.3 Subsets of n are called events. We denote events by A, B, and so on or by a descrip tion of their members, as we shall see subsequently. The relation between the experiment and the model is given by the correspondence "A occurs if and only if the actual outcome of the experiment is a member of A." The set operations we have mentioned have interpre tations also. For example, the relation A c B between sets considered as events means that the occurrence of A implies the occurrence of B. If w E f.!, {w} is called an elementary event. If A contains more than one point, it is called a composite event.
i •
I •
I \ '
" '
A.1.4 We will let A denote a class of subsets of n to which we an assign probabilities. For technical mathematical reasons it may not be possible to assign a probability P to ev ery subset of n. However, A is always taken to be a sigma field, which by definition is a nonempty class of events closed under countable unions, intersections, and complemen tation (cf. Chung, 1974; Grimmett and Stirzaker, 1992; and Loeve, 1977). A probabiliry distribution or measure is a nonnegative function P on A having the following properties:
•
' ' '
I
(i) (ii)
P(Q) = l. If A1, A2, . . . are pairwise disjoint sets in A, then p
I
•
'
•
•
i, •
•
u A; t=1
=
L P(A;). i= 1
Recall that Ui 1 Ai is just the collection of points that are in any one of the sets Ai and that two sets are disjoint if they have no points in common .
.�
1
l I
l
l j
' ••
j
l
•
l
\
i
l
1
Section A.2
Elementary Properties of Probability Models
443
A.l.S The three objects !1, A, and P together describe a random experiment mathemati cally. We shall refer to the triple (0, A, P) either as a probability model or identify the model with what it represents as a (random) experiment. For convenience, when we refer to events we shall automatically exclude those that are not members of A. References Gnedenko (1967) Chapter I , Sections 1-3, 6-8 Grimmett and Stirzaker (1992) Sections 1. 1-1.3 Hoe!, Port, and Stone ( 1 97 1 ) Sections Ll , 1 .2 Parzen (1 960) Chapter I, Sections 1-5 Pitman ( 1 993) Sections 1.2 and 1.3 A.2
ELEMENTARY PROPERTIES O F PROBABILITY MO DELS
The following are consequences of the definition of P. A,2,1 lf A c B, then P(B - A) = P(B) - P(A). A.2.2 P(A")
=
1 -
P(A), P(0)
A.2.3 If A c B, P(B) A.2.4 0 < P(A) < I. A.2.5 P (U;;' 1 An) < A.2.6 If A, c A, A.2.7 P
c
(n��r A,)
>
=
0.
P(A).
I;;;'
,
P(A n ) ·
· · · c An . . . , then P (U;:' 1 An) = lim,�= P(An)·
> 1 -
I:��r P(A;) (Bonferroni's inequality).
References Gnedenko, (1967) Chapter I, Section 8 Grimmett and Stirzaker ( 1992) Section 1.3 Hoe!, Port, and Stone (1992) Section 1.3 Par2en (1 960) Chapter I , Sections 4-5 Pitman (1993) Section 1.3 A.J
DISCRETE PROBABILITY MODELS
A.3.1 A probability model is called discrete if !1 is finite or countably infinite and every subset off! is assigned a probability. That is, we can write f! = {w 1 , w2, . . . } and A is the collection of subsets of n. In this case, by axiom (ii) of (A.l .4)., we have for any event A, (A.3.2)
' '
;
444
A Review of Basic Probability Theory
Appendix A
An important special case arises when n has a finite number of elements, say N, all of which are equally likely. Then P( {w )) = 1/N for every w E n. and Number of elements in A P(A ) = :_::_c:c:.:,:_�Ci"'-'-=--"'-'-' . (A.3 .3) N
A.3.4 Suppose that w 1 , .
, WN
are the members of some population (humans, guinea pigs, flowers, machines, etc.). Then selecting an individual from this population in such a way that no one member is more likely to be drawn than another, selecting at random, is an experiment leading to the model of (A.3.3). Such selection can be carried out if N is small by putting the "names" of the wl in a hopper, shaking well, and drawing. For large N, a random number table or computer can be used. . .
-� l 1
'
1
-
'• •
• '
I
'
l ,
•
References
Gnedenko (1967) Chapter I , Sections 4--5 Parzen (1960) Chapter I , Sections 6-7 Pitman (1993) Section 1.1
A.4
.
l
1
I
CONDITIONAL PROBABI LITY AND INDEPENDENCE
! 1 '
Given an event B such that P( B) > 0 and any other event A, we define the conditional probability of A given B, which we write P(A I B), by P(A n B) P(A I B) (AA. l ) . =
P(B) If P(A) corresponds to the frequency with which A occurs in a large number of repetitions of the experiment, then P( A I B) corresponds to the frequency of occurrence of A relative to the class of trials in which B does occur. From a heuristic point of view P(A I B) is the chance we would assign to the event A if we were told that B has occurred. If A1, A2, . are (pairwise) disjoint events and P(B) > 0, then
p U A I B = :L P(A; I B).
In fact, for fixed B as before, the function P(- I B) is a probability measure on (!:l, A) which is referred to as the conditional probability measure given B. Transposition of the denominator in (A.4.1 ) gives the multiplication rule,
P(A n B) = P(B)P(A 1 B).
(A.4.3)
Bn are (pairwise) disjoint events of positive probability whose union is !:l, the identity A = U7 1 (A n B; ). (A. l 4)(ii) and (A.4.3) yield n (A.4.4) P(A) L P(A I B; )P(B; ). .
-�
�
1 '
'
'
(A.4.2)
i=l
i=l
B1 , B2,
1
.
.
If
l j .
.
•
,
.
=
j=l
j ' '
' •
1 •
'
' '
'
'
Section A.4
445
Conditional Probability and Independence
If 1 1( A ) is positive, we can combine (A.4.1 ). ( A.4.3 ), and ( A.4.4) and obtain Bayes rule P(A I B,)P(B,) , P(B, I A) � "" ) L.. j - 1 P(A I B )P(BJ
The conditional probability of A given B1 , defined by
(A4.5)
.I
.
.
.
•
Bu is written P(A B1
for any events A, B1, . . . , Bn such that P(B1 n · · · n Bn) Simple algebra leads to the multipliCation rule,
>
•
•
.
•
, En ) and
0.
P(B1 n · · · n Bn) � P(B1) P(B, I BJ)P(B, I B1 , B,) . . . P(B, I B1 . . . . , Bn _ !) (A.4.7)
whenever P(B1 n · · · n En - d > 0. Two events A and B arc said to be independent if P(A n B) � P(A)P(B).
(A.4.8)
If P( B) > 0, the relation (A.4.8) may be written P(A I B) � P(A).
(A.4.9)
In other words, A and B are independent if knowledge of B does not affect the probability of A. The events A 1 , . . . , An are said to be independent if P(A;, n . . . n A;,)
k
= IT P(A;,)
(A.4.10)
j=l
for any subset {i1, . . . , ik} of the integers {1, . . . , n}. If all the P(Ai) are positive, relation (A.4. l0) is equivalent to requiring that P(A, I A;, . . . , A;,)
�
P(A1)
for any j and {i1, . . . , ik} such that j '/c {i1, . . . , i k }. References Gnedenko ( 1967) Chapter I , Sections 9 Grimmett and Stirzaker (1992) Section 1.4 Hoe!, Port, and Stone (1971) Sections 1 .4, 1 .5 Parzen (1960) Chapter 2, Section 4; Chapter 3, Sections 1.4 Pitman (1993) Section 1 .4
(A.4. l l )
l '
446
A Review of Basic Probability Theory
Appendix A
COMPOUND EXP ERIMENTS
A.5
There is an intuitive notion of independent experiments. For example, if we toss a coin twice, the outcome of the first experiment (toss) reasonably has nothing to do with the outcome of the second. On the other hand, it is easy to give examples of dependent experi ments: If we draw twice at random from a hat containing two green chips and one red chip, and if we do not replace the first chip drawn before the second draw, then the probability of a given chip in the second draw will depend on the outcome of the first dmw. To be able to talk about independence and dependence of experiments, we introduce the notion of a compound experiment. InformaUy, a compound experiment is one made up of two or more component ex periments. There are certain natural ways of defining sigma fields and probabilities for these experiments. These will be discussed in this section. The reader not interested in the formalities may skip to Section A.6 where examples of compound experiments are given . A.S.l Recall that if A 1 , . . . , An are events, the Cartesian product A1 x · · · x An of At, . . . , An is by definition { (w1, , Wn) : w� E Ai, 1 < i < n}. If we are given n ex periments (probability models) t\, . . . , En with respective sample spaces 0 1 , , .On. then the sample space .O ofthe n stage compound experiment is by definition f21 X · · · X .On . The (n stage) compound experiment consists in performing component experiments £1 , , En and recording all n outcomes. The interpretation of the sample space n is that (w1, . . . , wn) is a sample point in n if and only if w 1 is the outcome of £1, w2 is the outcome of £2 and so on. To say that £t has had outcome w? E Ot corresponds to the occurrence of the com pound event (in .O) given by fl1 X · · · X f2i-I X {w?} X f2t+I X · · · X fln = {(Wt, . . , wn) E n : wi = wf}. More generally, if Ai E At. the sigma field corresponding to £i, then Ai corresponds to nl X . . . X ni-l X At X ni+ I X . . . X nn in the compound experiment. If we want to make the £i independent, then intuitively we should have aU classes of events A1, , An with Ai E Ai, independent. This makes sense in the compound experiment. If P is the probability measure defined on the sigma field A of the compound experiment, that is, the subsets A of !1 to which we can assign probability (I), we should have P([A, X n, X X !1n] n [!1, X A, X . . . X !1n] n . . . ) .
.
•
•
'
• • •
.
I
.
' '
•
'
!
l '
i
.
.
'
•
1 1 •
.
.
•
'
.
= P(A, x · · · x An) = P(A, X n, X . . . X !1n)P(fl, X A, X . . . X nn) . . . P(n, X . . . X nn-1 X An). (A.5.2)
j l
If we are given probabilities ?, on (!1, A1), P2 on (!12 , A2), . . . , Pn on (!1. , An). then (A.5.2) defines P for A1 x · · · X An by
i
l j
· .
x
(A.5.3) · · · x An) = P, (A,) . . . Pn(An) · It may be shown (Billingsley, 1995; Chung, 1974; Loeve, 1977) that if P is defined by (A.5.3) for events A1 x · · x An. it can be uniquely extended to the sigma field A spec ified in note ( 1 ) at the end of this appendix. We shall speak of independent experiments £1, . . . , En if the n stage compound experiment has its probability structure specified by (A.5.3). In the discrete case (A.5.3) holds provided that P( {(w, , . . . , wn)}) = P1 ( { w,}) . . . Pn ( {Wn}) for all wi E !1,, 1 < i < n. (A.5.4) P(A, ·
l
I '
''
I:
:1
i
•
L i
--
---------------
�
Section
A6
Specifying
P 1, .
know
447
Bernoulli and Multinomial Trials, Sampling With and Without Replacement
P when the £,
are dependent is more complicated. In the discrete case we E 01, { (w1 . . . , U.:n ) } ) for each once we have specified c,..:11 ) with
wi P( (w1 i = , H. By the multiplication rule (A.4.7) we have, in the discrete case, the following. A.S.S P({(w1 . . , wn ) } ) P(£1 has outcome w l ) P(£2 has outcome w2 1 £1 has out £11_ 1 has outcome W n _ t ) , come W1 ) . . . P(£T, has outcome Wn I £1 has outcome W J , The probability structure is determined by these conditional prob ab ilitie s and conversely. .
•
.
.
.
•
.
.
=
.
.
•
•
•
References Grimmett and Stirzaker ( 1992) Sections Hoe!, Port, and Stone Parzen
A.6
1 .5, 1 .6
( 1 97 1 ) Section 1 .5
(1960) Chapter 3
BERNOULLI AND MULTINOMIAL TRIALS, SAMPLING WITH AND WITHOUT REPLACEMENT
A.6.1 Suppose that we have an experiment with only two possible outcomes, which we shall denote by S (success) and F (failure). If we assign P( { S) ) p, we shall refer to such =
an experiment as a
Bernoulli trial with probability of success p. The simplest example of such a Bernoulli trial is tossing a coin with probability p of landing heads (success). Other examples will appear naturally in what follows. If we repeat such an experiment n times independently, we say we have performed n Bernoulli
trials with success probability p. If 0 is the sample space of the compound experiment, any point w E n is an n-dimensional vector of S's and F's and,
where then
k(w)
P( { w ) ) = Pk(w)(l - p)"- k(w)
is the number of S's appearing in
w.
(A.6.2)
If Ak is the event [exactly
k S's occur], (A.6.3)
( k )=
where
n
The formula
n!
k! (n - k)! '
(A6.3) is known as the binomial probability.
A.6.4 More generally, if an experiment has q possible outcomes w1,
, wq and P( {wi}) = ,pq. If Pi. we refer to such an experi ment as a multinomial trial with probabilities p1 , .
•
•
•
.
•
the experiment is performed n times independently. the compound experiment is called n
multinomial trials with probabilities p1 , . . . , Pq· If n is the sample space of this experiment and w
E !1, then
(A.6.5)
' i' !'
\'
'
,,
Appendix A
k�(w) = number of times wi appears in the sequence w. If AkJ , ... ,kq is the event (ex kt WI 's are observed, exactly kzwz 's are observed, . . . , exactly kqwq 's are observed),
where actly then
(A.6.6)
.
,.
'
A Review of Basic Probability Theory
448
'
' ' '
where the
ki are natural numbers adding up to n.
'
A.6.7
If we perform an experiment given by (.0, A, P) independently
n times,
we shall
sample of size n from the population given by (O,A, P). When .0 is finite the tenn, with replacement is added to sometimes refer to the outcome of the compound experiment as a
distinguish this situation from that described in (A.6.8) as follows.
A.6.8 If we have a finite population of cases n
{w1 , WN } and we select cases wi successively at random n times without replacement, the component experiments are not independent and, for any outcome a = (wi1 , wi., ) of the compound experiment, •
P({a})
•
•
=
•
•
•
1 (N) n
�
I l
,
'
(A.6.9)
.
' '
where
' •
(N) n
N! (N - n)! .
�
' ' •
If the case drawn is replaced before the next drawing, we are sampling and the component experiments are independent and bers of n have a "special" characteristic F and
Ak
= (exactly
P(A k )
�
for max (O, n -
k
N( 1
P( {a}) �
p)
�
with replacement,
1/N". If Np of the mem
' •
'
have the opposite characteristic
• • '
k "special" individuals are obtained in the sample}, then
) ( n
S
and
�
'
(Np)k(N(I - p)) n-k (N) n
(A.6. l0)
�
N(l - p)) < k < min(n,Np),
•
J
• •
and
(A.6.1 0) is known as the hypergeometric probability.
P(Ak)
�
O otherwise. The formula '
References
J
'
j
2, Section I I Hoe!, Port, and Stone (1971) Section 2.4 Gnedenko (1 967) Chapter
.. •
'
1
Parzen ( 1960) Chapter 3, Sections 1 -4
'
I
Pitman (1993) Section 2.1
A.7
l1
PROBABILITIES ON EUCLI DEAN SPACE
Random experiments whose outcomes are real numbers play a central role in
theory and
.1
'
practice. The probability models corresponding to such experiments can all be thought of
'
i
as having a Euclidean space for sample space.
'
i
..J
________________________________
Section A.?
449
Probabilities on Euclidean Space
We shall use the notation Rk of k�dimensional Euclidean space and denote members , xk) ', where ( )' denotes transpose. of Rk by symbols such as x or (x1 , .
•
.
A.7.1 lf (a1 , bt ) , . . . , (a , bk) are k open intervals, we shall can the set (a1obt) x · · · x k (ab bk) = {(x 1, . . , xk) : ai < xi < bi, 1 < i < k} an open k rectangle. .
A.7.2 The Borel field in Rk, which we denote by !3k, is defined to be the smallest sigma field having all open k rectangles as members. Any subset of Rk we might conceivably be interested in turns out to be a member of f3k. We will write R for R1 and !3 for !3 1 . A.7.3 A discrete (probability) distribution on Rk is a probability measure P such that L:� 1 P( { xi }) = 1 for some sequence of points {xi} in Rk. That is, only an xi can occur as an outcome of the experiment. This definition is consistent with (A.3.1) because the study of this model and that of the model that has n = { x1, , Xn, . . . } are equivalent. The frequencyfunction p of a discrete distribution is defined on Rk by •
•
•
p(x) = P( (x )). (A.7.4) Conversely, any nonnegative function p on Rk vanishing except on a sequence {x1, . , Xn, . . . } of vectors and that satisfies L:� 1 p(xi) = 1 defines a unique discrete probability .
.
distribution by the relation
P(A ) =
L p(x; ).
(A.7.5)
x;EA
A.7.6 A nonnegative function p on Rk , which is integrable and which has
r p(x)dx = 1 , JR'
where dx denotes dx1 . . . dxn, is called a density function. Integrals should be interpreted in the sense of Lebesgue. However, for practical purposes, Riemann integrals are adequate. A.7.7 A continuous probability distn'bution on Rk is a probability P that is defined by the relation
P(A)
=
L p(x)dx
=
1
(A.7.8)
for some density function p and all events A. P defined by A.7.8 are usually called abso lutely continuous. We will only consider continuous probability distributions that are also absolutely continuous and drop the term absolutely. It may be shown that a function P so defined satisfies (A.l.4). Recall that the integral on the right of (A.7.8) is by definition
{ 1A(X)p(x)dx ln•
where 1 A (x ) = 1 if x E A, and 0 otherwise. Geometrically, P(A) is the volume of the "cylinder" with base A and height p(x) at x. An important special case of (A.7.8) is given by (A.7.9)
450
A Review of Basic Probability Theory
Appendix A
Jt turns out that a continuous probability distribution determines the density that generates it "uniquely."( I ) Although in a continuous model P( { x}) = 0 for every x, the density function has an operational interpretation close to that of the frequency function. For instance, if p is a continuous density on R, :r:0 and x1 are in R, and h is close to 0, then by the mean value theorem
P([xo - h, xo + h]) p(xo) "' P([xo - h, xo + h]) "' 2hp(xo) and . (A.7.10) P([Xt - h X I + h]) p(Xt ) The ratio p(xo)/p(xl ) can, thus, be thought of as measuring approximately how much more or less likely we are to obtain an outcome in a neighborhood of xo then one in a neighborhood of x1• 1
A.7.11 The distribution function (d.f.) F is defined by
F(x 1 , . . . ,x•) = P((-oo , x,J
x
. . x (-oo, x.]). ·
(A.7.12)
The d.f. defines P in the sense that if P and Q are two probabilities with the same d.f., then P = Q. When k = 1, F is a function of a real variable characterized by the following properties: '
(A.7. 13)
I I
(A.7.14)
'
• •
•
I' I
;
·.
'
'
"
I
'
x < y =? F(x) < F(y) (Monotone) Xn j x =? F(xn)
•
'
I' '
!
I '
F(x) (Continuous from the right)
limx�oo F(x) = 1 limx�-oo F(x) = 0.
•
'
�
'
!
I '
(A.7.1 5)
i
(A.7. 16)
1
It may be shown that any function F satisfying (A. 7. 13HA.7.16) defines a unique P on the real line. We always have
F(x) - F(x - 0) (2) = P({x)).
(A.7.17)
Thus, F is continuous at x if and only if P( {x}) = 0. References
I
Gnedenko (1967) Chapter 4, Sections 21, 22 Hoe!, Port, and Stone (1971) Sections 3 . 1 , 3.2, 5.1, 5.2 Parzen ( 1960) Chapter 4, Sections 1-4, 7 Pitman (1993) Sections 3.4, 4.1 and 4.5
l !
•
•
i
J
Section A.8
A.B
451
Random Vari<Jbles and Vectors: Transformations
RAN DOM VARIABLES AND VECTORS: TRANSFOR MATIONS
Although sample spaces can be very diverse, the statistician is usually interested primarily in one or more numerical characteristics of the sample point that has occurred. For example, we measure the weight of pigs drawn at random from a population, the time to breakdown and length of repair time for a randomly chosen machine, the yield per acre of a field of wheat in a given year, the concentration of a certain pollutant in the atmosphere, and so on. In the probability model, these quantities will correspond to random variables and vectors.
A.8.1 A random variable X is a function from f!to R such that the set {w : X(w) E B} x-'(B) is in 0 for every B E B.(l l
=
Xk) T is k-tuple of random variables, or equivalently a function from !1 to Rk such that the set {w , X(w) E: B} � x- 1 (B) is in A for every B E Bk_(I) For k = 1 random vectors are just random variables. The event x- 1 ( B) will usually be written [X E B] and P([X E B]) will be written P[X E B]. The probability distribution of a random vector X is, by definition, the probability measure Px in the model ( R k , 13k, Px ) given by
A.8.2 A random vector X = ( X1,
•
•
.
,
Px(B) � P[X E B[.
(A.8.3)
A.8.4 A random vector is said to have a continuous or discrete distribution (or to be contin
uous or discrete) according to whether its probability distribution is continuous or discrete. Similarly, we will refer to the frequencyfunction, density, dj, and so on of a random vector
when we are, in fact, referring to those features of its probability distribution. The subscript X or X will be used for densities, d.f.'s, and so on to indicate which vector or variable they correspond to unless the reference is clear from the context in which case they will be omitted. The probability of any event that is expressible purely in tenns of X can be calculated if we know only the probability distribution of X. In the discrete case this means we need only know the frequency function and in the continuous case the density. Thus, from (A.7.5) and (A.7.8) P[X E A]
L P(x),
xE A
if X is discrete
L p(x)dx, if X is continuous.
(A.8.5)
When we are interested in particular random variables or vectors. we will describe them purely in terms of their probability distributions without any further specification of the underlying sample space on which they are defined. The study of real- or vector-valued functions of a random vector X is central in the theory of probability and of statistics. Here is the formal definition of such transformations. Letg be any function from Rk to Rm, k, m > I, such that<21 g-1(B) {y E Rk : g(y) E �
452
A Review of B<1sic Probdbility Theory
£3} E 8/.: for every B E 8"'.
Then the random tran�(ormatirm
g (X) (w) = g(X(w) ).
g( X)
Appendix A
is defined by (A.8.6)
An example of a transformation often used in statistics is
g
(g 1 , g2 /
with
g1 (X)
= X and 92(X) = k- 1 E�· 1 ( Xi - X)2. Another common example is g(X) = (min{ X,), max( X,))'. The probability distribution of g(X) is completely detennined by that of X through 1 (A.8.7) P[g(X) E B] = P[X E g- ( B)].
k- 1 L; 1 Xi
If X is discrete with frequency function
tion
px , then g(X) is discrete and has frequency func
Pg(XJ(t) = Suppose that
X
=
L Px (x) .
(A.8.8)
{x:g(x)=t}
is continuous with density
Px and g is real-valued and one-to-one C3l
S such that P[X E S] = 1. Furthennore, assume that the derivative l of g exists and does not vanish on S. Then g(X) is continuous with density given by on an open set
'
'
(A.8.9)
!
'
'
g(S), and 0 otherwise. This is called the change of variableformula. If g(X) = aX + !'. a # 0, and X is continuous, then
'
for t E
Pg(X) (t) = From (A.8.8) it follows that if (X1
t - I' ) ( PX j';J 1
a
·
•
'
• '
(A.8.10)
Yf is a discrete random vector with frequency func tion P(X, Y)· then the frequency function of X, known as the marginal frequency function,
i
'
'
'
is given by(4) • '
i (
t '
px(x) = LP(X ,YJ(x,y) . y
(A.8 . 1 1 )
(X, Y)T is continuous with density P( X,Y)• it may be shown (as a consequence of (A.8.7) and (A.7.8)) that X is a marginal density function given by
' •
Similarly, if
px(x)
�
These notions generalize to the case
1: P(X,YJ(x, y)dy.<5l
integrating out over y in
P(X,YJ(x, y) .
! •
1 i '
Z = (X, Y), a random vector obtained by putting two
random vectors together. The (marginal) frequency or density of X is found as and (A.8.12) by summing or
(A.8.12)
. • • •
in (A.8. 1 1 )
Discrete random variables may be used to approximate continuous ones arbitrarily
!
• '
;
!
•
i
•
closely and vice versa.
I I !
•
;
' '
j
-
Section A.9
453
Independence of Random Variables and Vectors
In practice, all random variables are discrete because there is no instrument that can measure with perfect accuracy. Nevertheless, it is common in statistics to work with con tinuous distributions, which may be easier to deal with. The justification for this may be theoretical or pragmatic. One possibility is that the observed random variable or vector is obtained by rounding off to a large number of places the true unobservable continuous random variable specified by some idealized physical modeL Or else, the approximation of a discrete distribution by a continuous one is made reasonable by one of the limit theorems of Sections A. l 5 and B.7.
A.8.13 A convention: We shall write X = Y if the probability of the event [X fc Yj is 0. References Gnedenko (1967) Chapter 4, Sections 2 1-24 Grimmett and Stirzaker ( 1992) Section 4.7 Hoe!, Port, and Stone (197 1 ) Sections 3.3, 5.2, 6. 1, 6.4 Parzen (1960) Chapter 7, Sections 1-5, 8, 9 Pitman (1993) Section 4.4
INDEPENDENCE O F RANDOM VARIABLES AND VECTORS
A.9
A.9.1 Two random variables X1 and X2 are said to be independent if and only if for sets A and B in B, the events [X1 E AJ and [X, E BJ are independent. A.9.2 The random variables X1 , , Xn are said to be (mutually) independent if and only if for any sets A1, , An in B, the events [X 1 E A1], . . [Xn E An] are independent. To generalize these definitions to random vectors X 1 , . . , Xn (not necessarily of the same dimensionality) we need only use the events [Xi E Ai] where Ai is a set in the range of xi. A.9.3 By (A.8.7), if X and Y are independent, so are g(X) and h( Y) , whatever be g and h. For example. if (X1, X2 ) and (Yt . Y2 ) are independent, so are X1 + X2 and Yt Y2, (X1,X1X2) and Y2 , and so on. •
.
•
•
•
.
.
1
.
Theorem A.9.1. Suppose X = (X1, . . . , Xn) is either a discrete or continuous random vector. Then the random variables X 1 , . . . , Xn are independent if, and only if, either ofthe following two conditions hold: (A.9.4) Px(x,, .
.
.
, Xn) = Px, (xi) . .
.
px. (xn)
for all x , , . . , Xn. .
(A.9.5)
A.9.61fthe Xi are all continuous and independent, then X = (Xt, . . . , , Xn) is continuous. A.9.7 The preceding equivalences are valid for random vectors XI J . . (X�o . , Xn). .
.
.
1
Xn with X -
A Re iew of Basic Probability Theory
454
i
i
v
Appendix A
Xn are independent identically distributed k-dimensional random vectors with d.f. Fx or density (frequency function) px, then X 1 X11 is called a random sample (?{ si::.e n from a population with d.f. Fx or density (frequency function) Px- In
A.9.8 If X1
•
.
.
.
,
•
.
.
.
,
statistics, such a random sample is often obtained by selecting n members at random in the sense of (A.3.4) from a population and measuring k characteristics on each member. If A is any event, we define the random variable 1 A, the indicator of the event A, by
1 if w E A
0 otherwise.
(A9.9)
If we perfonn n Bernoulli trials with probability of success p and we let Xi be the indicator of the event (success on the ith trial), then the Xi form a sample from a distribution that assigns probability p to 1 and { 1 - p) to 0. Such samples will be referred to as the indicato rs of n Bernoulli rn·ats with probability ofsuccess p. References
• l-
i '
Gnedenko ( 1967) Chapter 4, Sections 23, 24 Grimmett and Stirzaker (1992) Sections 3.2, 4.2 Hoe!, Port, and Stone (1971) Section 3.4 Parzen (1960) Chapter 7, Sections 6, 7 Pitman (1993) Sections 2.5, 5.3
A.lO
THE EXPECTATION O F A RANDOM VARIABLE
Let X be the height of an individual sampled at random from a finite population. Then a reasonable measure of the center of the distribution of X is the average height of an indi vidual in the given population. If x1, . . . , xq are the only heights present in the population, it follows that this average is given by 'L.L 1 xiP[X = xi] where P[X = Xi] is just the proportion of individuals of height xi in the population. The same quantity arises (approx imately) if we use the long-run frequency interpretation of probability and calculate the average height of the individuals in a large sample from the population in question. In line with these ideas we develop the general concept of expectation as follows. If X is a nonnegative, discrete random variable with possible values { x1, x2 , . . . }, we define the expectation or mean of X, written E( X), by 00
1 i
oj
'
j
l
-� ' ' •
E(X) = l:: x,px(x;). i==l (Infinity is a possible value of E(X). Take
(A.IO.I)
·'
1
x, = i, Px (i) = + 1) , i = 1 , 2, . . . .) i(i A.l0.2 More generally, if X
is discrete, decompose { x1, x2, . . . } into two sets A and B where A consists of all nonnegative Xi and B of all negative Xi- If either 'L.x,EA XiPX (xi) <
�
.
Section A.lO
455
The Expectation of a Random Variable
oo or L:,,E8( -x.Jpx (J:i) < oo, we define E(X) unambiguously by (A. IO.I ). Otherwise, we leave E(X) undefined. Here are some properties of the expectation that hold when X is discrete. If X is a constant, X(c..v·) = c for all w, then (A. I 0.3)
E(X) = c. If X
=
lA (cf. (A.9.9)), then
E(X) = P( A). (A.l0.4) If X is an n-dimensional random vector, if g is a real-valued function on Rn , and if E(lg(X) I) < oo, then it may be shown that 00
E(g(X)) = L g(x,)px (x, ) i= l
(A. 10.5)
As a consequence of this result, we have 00
E( IX I) = L lx,lpx (x,) . i=l Taking g(x) = L:�
1
(A.l0.6)
O:'iXi we obtain the fundamental relationship n
n
E :L a,X, = :L a,E(X,) (A.l0.7) i= l i=l if O:' t , . . . , O:'n are constants and E(IXil) < oo, i = 1, . , n. A.IO.S From (A.l0.7) it follows that if X < Y and E(X), E(Y) are defined, then E(X) < E(Y). If X is a continuous random variable, it is natural to attempt definition of the expec .
.
a
tation via approximation from the discrete case. Those familiar with Lebesgue integration will realize that this leads to
E(X) =
1: xpx(x)dx
(A.l 0.9)
as the definition of the expectation or mean of X whenever J0= xpx(x)dx or J' xpx(x)dx is finite. Otherwise, E(X) is left undefined. 00
A.IO.lO A random variable X is said to be integrable if E(IXIJ < oo. It may be shown that if X is a continuous k-dimensional random vector and any random variable such that
{ lg(x)iPx(x)dx < 00, Jn,
g(X) is
456
A Review of Basic Probability Theory
then E(g(X)) exists and
E(g(X)) = l, g(x)px(x)dx.
Appendix A
(A.IO. l l )
In the continuous case expectation properties (A. I 0.3), (A. l0.4), (A. I 0.7), and (A.IO.S) as well as continuous analogues of (A.l0.5) and (A.l0.6) hold. It is possible to define the expectation of a random variable in general using discrete approximations. The interested reader may consult an advanced text such as Chung (1974), Chapter 3. The fonnulae (A.l0.5) and (A.IO. l l) are both sometimes written as (A.IO.l2) E(g(X)) = f g(x)dF(x) or f g(x)dP(x) Jnk .IRk where F denotes the distribution function of X and P is the probability function of X
defined by (A.8.5). A convenient notation is dP(x)
J g(x)dP(x)
• 1:
.
•
=
p(x)dl'(x), which means
=
=
L g(xi)p(xi)1 discrete case i=l
J: g(x)p(x)dx,
(A.l0.13) continuous case.
We refer to J1 = llP as the dominating measure for P. In the discrete case J1 assigns weight one to each of the points in {x : p(x) > 0} and it is called counting measure. In the continuous case di-L(x) = dx and J.l(x) is called Lebesgue measure. We will often refer to p(x) as the density of X in the discrete case as well as the continuous case. References Chung (1 974) Chapter 3 Gnedenko (1967) Chapter 5, Section 26 Grimmett and Stirzaker (1992) Sections 3.3, 4.3 Hoe!, Port, and Stone ( 1971) Sections 4. 1 , 7.1 Parzen (1960) Chapter 5; Chapter 8, Sections 1-4 Pitman (1993) Sections 3.3, 3.4, 4.1
'
•
'
'
•
l
l
•
A.ll
1
MOMENTS
• '
A.U.l If k is any narural number and X is a random variable, the kth moment of X is defined to be the expectation of Xk . We assume that all moments written here exist. By (A.I0.5) and (A. 10.11),
E(X k )
=
2:xkpx(x) if X is discrete X
1: xkpx(x)dx
(A. I l.2) if X is continuous.
In general, the moments depend on the distribution of X only. • •
l
j�
Section A.ll
A.ll.3
Moments
457
The distribution of a random variable is typically uniquely specified by its mo
ments. This is the case, for example, if the random variable possesses a moment generating function (cf. (A.l2.l)).
A.l1.4 The kth central moment of X is by definition E[(X - E(X))k]. the kth moment of (X - E(X)), and is denoted by l'k·
variance of X and will be written Var X. The nonnegative square root of Var X is called the standard deviation of X. The standard deviation measures the spread of the distribution of X about its expectation. It is also called a measure of scale. Another measure of the same type is E(]X - E(X)J), which is often referred to as the mean deviation. The variance of X is finite if and only if the second moment of X is finite (cf. (A.1 1 . 15)). If a and b are constants, then by (A.l0.7) A.ll.S The second central
moment is called the
2 Var(aX + b) = a
Var X.
(A. l l .6)
(One side of the equation exists if and only i f the other does.)
A.11.7 If X is any random variable with well-defined (finite) mean and variance, the stan dardized version or Z-score of X is the random variable Z = (X - E(X))jvVar X. By (A.l 0.7) and (A. l l .6) it follows then that
E(Z) 2 A.l1.9 If E(X )
=
0, then X
=
= 0 and Var Z =
0. If Var X
follow, for instance, from (A.l5.2).
A.U.lO The
=
0, X
=
1.
(A. l l .8)
E(X) (a constant). These results
third and fourth central moment'i are used in the
and the kunosis ')'2, which are defined by
coefficient of skewness ;1
where o-2 = Var X. See also Section A.l2 where ;1 and ;2 are expressed in tenns of cumu lants. These descriptive measures are useful in comparing the shapes of various frequently used densities.
A.ll.ll If Y
= a + bX with b > 0, then the coefficient of skewness and the kurtosis of Y
are the same as those of X. If X
A.ll.12
� N(tl,a2), then 11
=
1z
=
0.
It is possible to generalize the notion of moments to random vectors. For sim
k
2. If X1 and X2 are random variables and i, j are nat ural numbers, then the product moment of order (i,j) of X1 and X2 is, by definition, E(XjX�). The central product moment of order (i,j) of X, and X2 is again by defi nition E[(X 1 - E(X1))' (X2 E(Xz))i]. The central product moment of order (1, I) is plicity we consider the case
=
-
I
'
'
'
'
458
A Review of Basic Prob.ability Theory
Appendix A
called the comriance of X1 and .\:2 and is written Cov( .\ 1- X2). By expanding the product (X 1 - E(X J ) ) ( X2 - E( X2 )) and using (A. I 0.3) and (A.IO. 7). we obtain the relations, Cov(aX1 + bX2, cX3 + dX,) = ac Cov(X1 . X3) + be Cov(X2, X,) + ad Cov(XJ , X,) + bd Cov(X,, X4)
(A. I I l 3)
I '
and
'
(A. l l . l4)
I
I
If Xi and Xf are distributed as X1 and X2 and are independent of X1 and X2, then
�
Cov(X1 , X2) = E(X1 - x; )(X, - x; ). lfwe put X1 = X, = X in (A. I l . l4), we get the formula 2 2 Var X = E(X ) - [E(X)j
The covariance is defined whenever X1 and X2 have finite variances and in that case
''
I'
(A. l l . l 6)
'
!
•
'
'
i '
with equality holding if and only if (1) X1 or X2 is a constant or Co = (2) (X1 - E(X1)) �ar (X�/') (X2 - E(X2 )). This is the correlation inequality. It may be obtained from the Cauchy-Schwartz inequality, 2
r
(A.l 1 . 17)
'
�
X,) Cov(X1, _ . Corr(X 1, X2 ) J(Var X1)(Var X2 )
'I I
'
'
'
' •
' ' '
'
-i
•
1
•
for any two random variables Z" z, such that E(Zf) < oo, E(Zi) < oo. Equality holds if and only if one of Z11 Z2 equals 0 or Z1 = a Z2 for some constant a. The correlation inequality corresponds to the special case Z1 = X1 - E(X1), Z2 = X2 - E(X2 ). A proof of the Cauchy-Schwartz inequality is given in Remark 1.4. 1. The correlation of XI and X2. denoted by Corr(X1, X2). is defined whenever XI and X2 are not constant and the variances of X1 and X2 are finite by
'
I
(A . I l . l 5)
(A. l l . l8)
I
' '
•
•
'
'
'
l � . . ·
I
'
The correlation of X1 and X2 is the covariance of the standardized versions of X1 and X2. The correlation inequality is equivalent to the statement (A.l l.l9) Equality holds if and only if X2 is linear function (X2 = a + bX1 , b # 0) of X1 ·
l
j
Section A.12
tion
459
Moment and Cumulant Generating Functions
If X1 , . . . . X11 have finite variances, we obtain as a consequence of (A 1 1 . 13) the rela-
+ · · · + Xn) = L Var X, + 2 L Cov(X, . X1 ). n
Var(X1
(A. l l .20)
·t=l
If X1 and X2 are independent and X1 and X2 are integrable, then or in view of(A.l l . J 4), Cov(X1 , X2)
=
Corr(X1 , X2)
(A.l l .2 l )
0 when Var(X;) > 0, i = 1 , 2.
=
(A. l l .22)
This may be checked directly. It is not true in general that X1 and X2 that satisfy (A. l l .22) (i.e., are uncorrelated) need be independent. The correlation coefficient roughly measures the amount and sign of linear relationship between X1 and X2. It is -1 or 1 in the case of perfect relationship (Xz = a l- bX1 , b < 0 orb > 0, respectively). See also Section 1.4. , Xn are indepen As a consequence of (A. I 1 .22) and (A.l l.20), we see that if X1 , dent with finite variances, then •
Var(X1 + · · ·
+
•
.
n
Xn)
=
L Var X;.
(A. l l .23)
i=l
References Gnedenko (1967) Chapter 5, Sections 27, 28, 30 Hoe!, Port, and Stone (1971) Sections 4.2-4.5, 7.3 Parzen ( I 960) Chapter 5; Chapter 8, Sections l-4 Pitman (1993) Section 6.4
A.l2
MOMENT AND CUMULANT GEN ERATING FUNCTIONS
A.l2.1 If E( e"' I X I ) < oo for some s0 > 0, Mx (s) = E( e'X ) is well defined for lsi S so and is called the moment generating function of X. By (A.l 0.5) and (A.IO.l l ), Mx(s)
1:
if X is discrete
i=l
(A. l 2.2) esxpx (x)dx if X is continuous.
If Mx is well defined in a neighborhood { s : lsi < so} of zero, all moments of X finite and
= E(Xk) k s , lsi < so . Mx(s) = L k! k=O
are
(A.l2.3)
460
I
i I
I
A Review of Basic: Probability Theory
A.l2.4 The moment generuting function .Hx has derivatives of all orders at :; dk
:\J,. ( s ) �Is�- -
---cc
=
Appendix A
=
0 and
E(X ' ).
A.12.5 If defined, Alx determines the distribution of X uniquely and is itself uniquely determined by the distribution of X. If X1 . . . . X11 are independent random variables with moment generating functions + X11 has moment generating function given by llf x !ll x, , then X1 + .
1 ,
.
•
.
•
· · ·
•
M( X , +
n
·+X, J (s) = IT Mx, (s ) . •
i=l
(A.l2.6)
This follows by induction from the definition and (A. l 1 .21). For a generalization of the notion of moment generating function to random vectors, see Section B.S. The function Kx(s)
=
log Mx (s)
(A. l 2.7)
is called the cumulant generating function of X. If !v!x is well defined in some neighbor hood of zero, Kx can be represented by the convergent Taylor expansion 00
c Kx(s) = L ;sj . J 0 F .
where
!
(A.l2.8)
di (A. l2.9) c 7 = c;(X ) = . Kx (s)l,� o dsJ is called the jth cumulant of X, j > 1. For j > 2 and any constant a, cj(X + a) = Cj (X). If X and Y are independent, then c;(X + Y) = c; (X) + c1 (Y). The first cumulant c1 is the mean J-.t of X, c2 and c3 equal the second and third central moments f-£2 and f-£3 of X, and c4 = J.L4 - 3.u�. The coefficients of skewness and kurtosis (see (A. l l .IO)) can be written as ')'1 = c3Jci and /'2 = c4jc�. If X is normally distributed, Cj = 0 for j > 3. See Problem 8.3.8. ,
I
I •
References Hoe!, Port, and Stone (1971) Chapter 8, Section 8.1 Parzen (1960) Chapter 5, Section 3; Chapter 8, Sections 2-3 Rao (1973) Section 2b.4
A.13
SOME CLASSICAL DISCRETE AND CONTINUOUS DISTRIBUTIONS
By definition, the probability distribution of a random variable or vector is just a probability measure on a suitable Euclidean space. In this section we introduce certain families of '
••
I I
Section A . 13
461
Some Classical Discrete and Continuous Distributions
distributions, which arise frequently in probability and statistics, and list some of their Following the name of each distribution we give a shorthand notation that
properties.
will sometimes be used as will obvious abbreviations such as "binomial binomial distribution with pmameter
(n, B)".
( n, B)"
for "the
The symbol p as usual stands for a frequency
If anywhere below p is not specified explicitly for some value of x it shall be assumed that p vanishes at that point. Similarly, if the value of the distribution function F is not specified outside some set, it is assumed to be zero to the "left" ofthe set and one to the "right" of the set. or density function.
I. Discrete Distributions The binomial distribution with parameters n and B : B( p(k) � The parameter
( � ) e'(J -e)"-•,
n, B).
k � O, J , . . .
, n.
(A. I 3 . 1 )
n can be any integer > 0 whereas 8 may be any number in [0, l ] .
A.13.2 If X is the total number of successes obtained in n Bernoulli trials with probability of success If
e. then X has a !3(n, B) distribution (see (A.6.3)).
X has a !3( 0) distribution, then
n,
E(X) � nO Var X � nO(! - 0).
(A.l3.3)
,
Higher-order moments may be computed from the moment generating function
Mx (t) � [Be' + ( 1 - 0)]". A.13.5 If X, , X2 ,
B(n2, 0), . . .
(A.I3.4)
, Xk are independent random variables distributed as B(n1, 8), 0), respectively, then X, + X2 + · · · + X, has a l3(n1 + - · · +
. . .
, l3(nk,
nk,e)
distribution. This result may be derived by using (A.l2.5) and (A.I 2.6) in conjunction with
(A.l3.4).
The hypergeometric distribution with parameters D, N, and n :
p(k)
�
(A.I3.6)
for k a natural number with max(O, n - (N - D)) and n may be any natural nwnbers that
1t(D, N,n).
are
:S
k < min(n, D). The parameters D
less than or equal to the natural number N.
A.13.7 If X is the number of defectives (special objects) in a sample of size n taken without
X has If the sample is taken with replacement, X has
replacement from a population with D defectives and N - D nondefectives, then
H( D, N, n) distribution (see (A.6. I 0)). a B(n DfN) distribution. an
,
' •
462
A Review of Basic Probability Theory
If X has an H(D, N, n) distribution, then
D E(X) = n ' Var X = N
D nN
(
D !N
)
N-n N_ 1.
Appendix A
• •
7
The Poisson distribution with parameter .\ : P(.\).
e-'>.k p(k) = k!
(A.I3.9)
for k = 0, 1, 2, . . . The parameter >.. can be any positive number. If X has a P(.\) distribution, then .
=
Var X
= .\.
(A. 13.10)
The moment generating function of X is given by
1
Mx(t) = e'<,'- ) .
(A. 1 3 . 1 1 )
If X1, X2, . . . , Xn are independent random variables with P(.\1), P( .\2 ) , . . . , P .\n) distributions, respectively, then X1 + X2 + -i-Xn has the P(.\1 + .\2 + · · · + .\n) distribution. This result may be derived in th same manner as the corresponding fact for the binomial distribution.
A.l3.12
(
The multinomial distribution with parameters n, B1,
.
•
•
,
Bq : M(n, 01,
•
•
.
, Oq)•
n! k (A. l 3 . 1 3) p(k l , · · · , kq ) = k ! . . . kq !(}kl l ' ' • (jqq l whenever ki are nonnegative integers such that 'L.? 1 ki = n. The par�meter n is any natural number while (01, , Oq) is any vector in q q, 2:)• = 1 e = (&, . . e,) : e, > o, i=l •
.
.
.
1
.
A.13.14 If X = (Xt, . . . , Xq ) , where Xi is the number of times outcome wi occurs in n '
, Oq). then X has a M(n, ()11
multinomial trials with probabilities (01, . . . tion (see (A.6.6)). If X has a M (n, 01, . . , B,) distribution,
.
.
•
, Bq) distribu
.
E(X;) Cov(Xi , Xi)
-
nO;, Var X; = nl!;(l - O,) - neiej , i of:- j, i,j = l, . . , q. .
•
(A. I3.8)
Formulae (Al3.8) may be obtained directly from the definition (A 13.6). An easier way is to use the interpretation (A.l3.7) by writing X = 'L - 1 Ii where Ij = 1 if the jth object sampled is defective and 0 otherwise, and then applying formulae (A.I0.4), (A.I0.7), and (A. I 1 .20).
E(X )
i
(A. l3. 15)
•
' •
•
Section A 13
463
Some Classical Discrete and Continuous Distributions
These results may either be derived directly or by a representation such as that discussed in (A . I 3.8) and an application of formulas (A. I0.4). (A.I 0.7), (A. I 3. 13). and ( A . l 1 .20).
A.13.16 If X has a M(n, f:h , . . . , Bq)
n -
2::; ""1 X iJ' , f}i s , 1 - Lj= 1 8i1 ) distribution for any set {i 1 , . . . , ·is } C { 1 , . . . , q }. has a M ( n , Bi 1 , Therefore, Xj has B (n, e) ) distributions for each j and more generally L;== I xij has a B(n, L;=l 81 ) distribution if s < q. These remarks follow from the interpretation .
.
distribution, then
(Xi1 ,
•
•
.
,
X
•.•
,
•
.1
(A. I 3 . 14).
II. Continuous Distributions
F. and X
mean that X is random variable with d.f. density or frequency function p.
FL
F.
Let Y be a random variable with d.f. =
{ Fp,
: ·
-oo < J.t < oo} is called a
Let
"" p will similarly
Fp,
F. Therefore, for any fl, "(,
family.
FP- can be referred back to F or any FL
Similarly, if Y generates
so does Y
+
The family
J.t is called a
parameter, and we say that Y generates FL. By definition, for any J.t, X
and all calculations involving
+ J.t.
be the d.f. of Y
location parameterfamily,
X
location
"" Fp, {:::} X- J.t
""
other member of the
'Y for any fixed 'Y·
If Y has a first
moment, it follows that we may without Joss of generality (as far as generating assume that E(Y) = 0. Then if X
.......
F will mean that X has
Before beginning our listing we intrOOuce some convenient notations:
� F", E(X) = 11..
FL
goes)
> 0. The family Fs = ( F;; : rr > 0} is called a scale parameterfamily, a is a scale parameter, and Y is said to generate Fs. By definition, for any a > 0, X F; {:::} Xja F. Again all calculations involving one member of the family can be referred back to any other because for any a, T > 0, Similarly let
F;
be the d. f. of oY,
""
rr
""
F;(x) � F; If Y generates
Fs
C:) .
and Y has a firSt moment different from 0, we may without Joss of
=
1 and, hence, if X � F;;, then E(X) = rr. Alternatively, i f Y has a second moment, we may select F as being the unique member of the family Fs having 2 Var Y = 1 and then X "' F; ==? Var X = a . Finaiiy, define Fp,,a as the d.f. of aY + p. The family FL ,S = {Fp,,a : -oo < J.t < oo , a > 0} is called a location-scale parameter generality take E(Y)
family, J1 is called a location parameter, and a a scale parameter, and Y is said to generate
FL ,S· From
Fp,,a(x) =
( x - fl.) F a
=
) ( r(x - !") F..., ,r + 'Y , a
we see as before how to refer calculations involving one member of the family back to any other. Without loss of generality, if Y has a second moment, we may take
E(Y)
=
0, Var Y =
1.
'
464
Then if
A Review of Basic Probability Theory
Appendix A
X "' F1, ,c-. we obtain
E(X) = JJ., Var X = <72 Clearly FP- Fp.,1 , F; Fo,cr· The relation between the density of F,...,u and that of F is given by (A.8.10). All the families of densities we now define are location-scale families. =
=
or scale
The nonnal distribution with parameters fL and a2 : N(J-L, cr2) . p (x) =
{ �exp - � (x - JJ.)2} . 22
(AI3.17)
The parameter J.l can be any real number while a is positive. The normal distribution with 1 is known as the standard normal distribution. Its density will be denoted 11 = 0 and by \P(z) and its d.f. by
a=
A.13.18 The family of N(JJ., <72) distributions is a location-scale family. If Z has a N(O, 1) distribution, then Z + Jl. has a N(JJ., <72 ) distribution, and conversely if has a N(JJ., <72 )
<7 distribution, then (X - JJ.)/<7 has a standard normal distribution. If X has a N(JJ., <72) distribution, then
X
(A.I3.19)
{ <7 t 2 2 Mx(t) = exp pt 2 }
More generally, all moments may be obtained from
+
for -oo <
t
<
oo.
a2 = 1, then k t 2 [ (2 )! l k Mx(t) = L=O z•k! (2k)' k
'
(AI3.20)
In particular if {L = 0,
=
.
,
,
(A I 3 .2 1 )
A.13.23 If X., . . , Xn
0
z•f2(k/2)!
if k > 0 is even.
1
, '
are
(AI3.22)
. • .
j
•
'
= JJ . ; , E(X;) CiXi
independent normal random variables such that has a Var Xi = a�, and c1, , Cn are any constants that are not all 0. then E� 1 N(ctJJ.t + · · · + CnJl.n o ci
i
I
ifk > O is odd k!
'
.
•
and, hence, in this case we can conclude from (A.l2.4) that
E(Xk) E(Xk)
'
I
1
'
'
!
'
,
•
'
The exponential distribution with parameter >. : t'(>.).
I
,
Section A.l3
465
Some Classical Discrete and Continuous Distributions
(A. 13.24)
p(x) = >.e ', x > O.
The range of >. is (0, oo ). The distribution function corresponding to this p is given by F(x) = 1 - e->x for x > 0.
(A.13.25)
A.13.26 If rr = 1/A, then rr is a scale parameter. £(1) is called the standard exponential
distribution.
If X has an£(>.) distribution, 1
1 E(X) = >. , Var X = 2 . >.
(A.13.27)
More generally, all moments may be obtained from 1 (t ) = Mx 1 - (t/>.) =
[ l � k! =
k! tk ).k
(A.13.28)
which is well defined for t < >.. Further information about the exponential distribution may be found in Appendix B.
The uniform distribution on (a, b) : U(a, b). p(x) =
1
(b
-
a)
' a<x
(A. 13.29)
where (a, b) is any pair of real numbers such that a < b. The corresponding distribution function is given by (x - a) F(x) = for a < x < b. ( b - a)
(A.1 3.30)
If X has a U(a, b) distribution, then (b - a)2 a+b Var X = E(X) = 2 ' 12
A.13.32 If we set I' = a,
(A.13.31)
= (b - a), then we can check that the U(a, b) family is location-scale family generated by Y, where Y - U(O, l). a
References Gnedenko (1967) Chapter 4, Sections 21-24; Chapter 5, Sections 26--28, 30 Hoe!, Port, and Stone (1971) Sections 3.4. 1, 5.3.1, 5.3.2 Parzen ( 1962) Chapter 4, Sections 4--6; Chapter 5; Chapter 6 Pitman (1993) pages 475-487
a
j I ,
466
A Review of Basic Probability Theory
A.14
Appendix A
,
MODES OF CONVERGENCE OF RANDOM VARIABLES AND LIMIT TH EOREMS
'
I
'
Much of probability theory can be viewed as variations, extensions, and generalizations of two basic results, the central limit theorem and the law of large numbers. Both of these theorems deal with the limiting behavior of sequences of random variables. The notions of limit that are involved are the subject of this section. All limits in the section are as
! ' '
I
n - oo.
A.l4.1 We say that the sequence of random variables {Zn} converges to the random vari able Z in probability and write Zn � Z if P[IZn - Zl > t] 0 as for every > 0. That is, Zn � Z if the chance that Zn and Z differ by any given amount is negligible for large enough. A.l4.2 We say that the sequence {Zn} converges in law (in distribution) to Z and write Zn !:..,. Z if Fz.. ( t) Fz (t) for every point t such that Fz is continuous at t. (Recall that Fz is continuous at t if and only if P[Z t] 0 (A.7.17).) This is the mode of convergence needed for approximation of one distribution by another. (A.l4.3) If Zn Z, then Zn Z. Because convergence in law requires nothing of the joint distribution of the Zn and Z whereas convergence in probability does, it is not surprising and easy to show that, in general, convergence in law does not imply convergence in probability (e.g., Chung, 1974), but consider the following. A.l4.4 If Z z0 (a constant), convergence in law of { Zn} to Z implies convergence in probability. Proof. Note that z0 ± are points of continuity of Fz for every c > 0. Then --->
n
- oo
E
n
-->
�
p
-->
j I'
I
I
�
c
-->
=
c
P[[Zn - zo[
> <]
P(Zn < zo + <) + P(Zn � zo - <) < 1 - Fz. (zo + ;) + Fz. (zo - <). 1-
(A.l4.5)
-
By assumption the right-hand side of (A.I4.5) converges to ( 1 - Fz(zo + 2)) + Fz(zo
<)
=
0.
0
A.14.6 If Zn I', z0 (a constant) and g is continuous at z0• then g(Zn) I', g(zo). Proof. If ' is positive. there exists a J such that [z - Zo[ < J implies (g(z) - g(zo)[ < <. Therefore, (A.l4.7) P((g(Zn) - g(zo)[ < <] > P[[Zn - zo[ < J] � I - P[[Zn - zo[ > J]. Because the right-hand side of (A.l4.7) converges to 1, by the definition (A.l4. I) the result follows. 0
i
' ' ' '
•
j
j
1
I
I i j
'
1 •
I
j
1
i
------
Section A.l4
467
Modes of Convergence of Random Variables and Limit Theorems
A more general result is given by the following. A.14.8 If Zn !:, Z and g is continuous, then g(Zn) !:, g(Z). The following theorem due to Slutsky will be used repeatedly. L
Theorem A.l4.9 If Zn Z and Un (a) Zn + Un � Z + Uo, (b) UnZn .!:.... uaZ. -4
p
-
u0 (a constant), then
Proof. We prove (a). The other claim follows similarly. Begin by writing
Fcz.w.)(t )
-
P[Zn + Un < t, Un > uo - E] +P[Zn + Un < t, Un < uo - E].
(A. I 4. 10)
Let t be a point of continuity of F(Z+uo)· Because a distribution function has at most countably many points of discontinuity, we may for any t choose E positive and arbitrarily small such that t ± t are both points of continuity of F(Z+uo)· Now by (A.l4.10), Fcz. + u,) (t) :S P[Zn < t - uo + E) + P[IUn - uo[ > Ej.
(A.l4. J l)
Moreover,
(A.14.12) Because Fz,+u0(t) Thus.
=
lim sup Fcz.w.J)(t) n
P[Zn < t - ua]
=
Fz.(t - u0 ) , we must have Zn + uo !:, Z + u0.
- j > E] < lim n F(Zn+uo) (t + E) + lim n P[[Un u0
=
F(Z+uo)(t + E).
(A.l4. 13) Similarly, I-
F(Z, +Un)(t)
=
P[Zn + Un > tJ :S P[Zn > t - uo
-
Ej + PfiUn - Uo[ > Ej. (A.14.14)
and, hence, liminf F(Z,+U,)(t) > limF(Z,+uo)(t -
n
n
E) = F(Z+uo)(t - E).
(A.l4.15)
Therefore,
F(Z+uo) (t - E) < liminf n F(Z,+U,)(t)
< lim supF(Z,+U,)(t) < F(Z+•o ) (t + E) . n
(A. 14.16) Because E may be permitted to tend to 0 and F(Z+uo) is continuous at t, the result fol o lows.
A.14.17 CorolJary. Suppose that an is a sequence of constants tending to oo, b is a fixed number, and an ( Zn - b) � X. Let g be a function of a real variable that is differentiable and whose derivative g' is continuous at b. ( l l Then On[g(Zn) - g(b)J !:, g'(b)X.
(A. J4. 1 8)
1
468
Appendix A
A Review of Basic Probability Theory
1
Proof. By Slutsky's theorem
Z,. - b = .2_ [a,.(Z,. - b)] _c, 0 · X = 0. a,.
By (A. l4.4), ] Z,. - b] !':. 0. Now apply the mean value theorem to g(Z,.)
a,.[g(Z,. ) - g(b) ]
(A. l4.19)
- g(b) getting
a,.[g' (Z�) (Z,. - b)]
=
]Z,. - b]. Because ]Z,. - bl !':. 0, so does IZ; - bl and, hence, Z� !':. b. By the continuity of g' and (A.l4.6), g' (Z; ) !':. g' (b). Therefore, applying (A.l4.9) again, o g'(Z�)[a,.(Z,. - b)] _c, g'(b)X. A.14.20 Suppose that { Zn} takes on only natural number values and pz,.. (z) pz (z) for t: all z. Then Zn Z. This is immediate because whatever be z, Fz. (z ) = Lt1 0 Pz. (k) � Lt1 0pz(k) = Fz(z ) where [z] = greatest integer < z. The converse is also true and easy to establish. where
IZ� - bl
<
----+
----+
A.l4.21 (Scheffe) Suppose that { Z,.}, Z are continuous and Pz. (z) � pz(z) for (almost) all z. Then Z,. -". Z (Hajek and Sidak, 1967, p. 64).
::
•
. •
,
l
j
'
References
];
•
Grimmett and Stirzaker ( 1992) Sections 7.1-7.4
A.15
jj
FURTHER LIMIT THEOREMS AND I N EQUALITIES
<
The Bernoulli law of large numbers, which we give next, brings us back to the motivation of our definition of probability. There we noted that in practice the relative frequency of occurrence of an event A in many repetitions of an experiment tends to stabilize and that this ..limit" corresponds to what we mean by the probability of A. Now having defined probability and independence of experiments abstractly we can prove that. in fact, the rel ative frequency of occurrence of an event A approaches its probability as the number of repetitions of the experiment increases. Such results and their generalizations are known as Jaws of large numbers. The first was discovered by Bernoulli in 1713. Its statement is as follows.
•
•
•
If { Sn} i s a sequence of random variables such that S,. has a B(n, p) distribution for n > 1, then -
P
� p.
•
,
l
'
1
•
'
'
(A.l5.1)
n As in (A.13.2), think of Sn as the number of successes in n binomial trials in which we identify success with occurrence of A and failure with occurrence of A c. Then Snfn can . . .
,
!
Bernoulli's (Weak) Law of Large Numbers
Sn
i
'
'
Section A.lS
Further Limit Theorems and Inequalities
469
be interpreted as the relative frequency of occurrence of A in n independent repetitions of the experiment in which A is an event and the Bernoulli law is now evidently a statement of the type we wanted. Bernoulli's proof of this result was rather complicated and it remained for the Russian mathematician Chebychev to give a two-line argument. His generalization of Bernoulli's result is based on an inequality that has proved to be of the greatest importance in proba bility and statistics.
Chebychev's Inequality If X is any random variable, then
E(X2 )
P[]XJ > a] < >- <] - (
-. 2 a
(A.l 5.2)
The Bernoulli law follows readily from (A. l 5 .2) and (A.13.3) via the calculation
p
Sn _ P n
<
)
E Sn f n - p 2
c:2
�
Var Sn
n2c:2
�
p( l - P) nc:2
�
O as n
�
OO.
(A.l5.3)
A generalization of (A.15.2), which contains various important and useful inequalities, is the following. Let
g Z.
be a nonnegative function on
range of a random variable
R such that
g
is nondecreasing on the
Then
E(g(Z)) < P[Z >- aI - g(a) . Z ] X ] , g( t2Z > (Markov' s > g(t ] X ] ) inequality), Z = X g(t) est s > (Bernstein's inequality, Proof of (A.15.4). g, (A.l5.4)
If we put
�
t) �
if t
�
cases are obtained by taking and
=
and
0 and 0 otherwise, we get (A.l5.2). Other important and for
�
t if t
0 and 0 otherwise
0 and all real t
see
B.8.1 for the binomial case Bernstein's inequality). Note that by the properties of
(A.l5.5) Therefore, by (A.l0.8)
g(a)P[Z > a] � E(g(n)IrpaJ) < E(g(Z)),
(A.l5.6)
0
which is equivalent to (A.l5.4).
The following result, which follows from Chebychev's inequality, is a useful general ization of Bernoulli's law.
Khintchin's (Weak) Law of Large Numbers Let
{Xi}, i >
1 , be a sequence of independent identically distributed random variables
with finite mean J-L and define
Sn
=
L�
l
xi.
Then
Sn P � -+ J-L. n
(A.l5.7)
470
A Review of Basic Probability Theory
Upon taking the
Xi
to
Appendix A
be indicators of binomial trials, we obtain (A. l 5 . I).
De Moivre-Laplace Theorem
{Sn}
Suppose that
n, p)
B(
is a sequence of random variables such that for each n,
distribution where 0 < p < 1 . Then
Snj - np � Z, .np( ! - p)
Z has
has a
(A. 15.8)
_
where
Sn
a standard normal distribution. That is, the standardized versions of
verge in law to a standard normal random variable. If we write
Sn
con
Sn -np = .,fii (Sn P) .jnp(! - p) .jp(! - p) -;;-
and use (A.14.9), it is easy to see that (A. l5.8) implies (A. l5.1).
The De Moivre-Laplace theorem is generalized by the following. , '
Central Limit Theorem Let
{Xi}
be a sequence of independent identically distributed random variables with
(common) expectation J.t and variance a2 such that 0 <
a2
<
00 .
Then, if
Sn - n!' �c. z.
(A.15.9)
a.,fii
where
1
Sn = L� xi
i
• •
'
Z has the standard nonnal distribution.
• •
The last two results are most commonly used in statistics as approximation theorems. Let
'
k l be P[k Sn lj and
� [ P k - 2� - Sn - l + 2 ] np l+ !] np! Sn np k [ P .;npq .;npq .;npq (l - np+ ) (k -np- !2 ) .;npq .;npq � l+ � continuity correction.
nonnegative integers . The De Moivre-Laplace theorem is used as <
<
<
<
where q
= 1-
p.
The
� appearing in k -
and
•
•
-
-
�
i ' 'I I '
<
<
2'
•
-
is called the
j
� We
have an excellent idea of how good this approximation is. An illustrative discussion is given
in Feller (1968, pp. 1 87-188). A rule of thumb is that for most purposes the approximation
can
be used when
npX, n(l - p)
Only when the
and
are both larger than
5.
are integer-valued is the first step of (A.J5. 10) followed. Otherwise
(A.15.9) is applied iri the form
(bnp.) P[a Sn bj .,fiia <
<
""
-
(a.fiia - np.)
.
(A.15.10)
l
I
i ,
i '
Section A.!5
471
Further Limit Theorems and Inequalities
The central limit theorem (and some of its generalizations) are also used to justify the assumption that "most" random variables that are measures of numerical characteristics of real popu lations, such as intelligence, height, weight, and blood pressure, are approx imately normally distributed. The argument is that the observed numbers are sums of a large number of small (unobserved) independent factors. That is, each of the characteristic variables is expressible as a sum of a large number of small variab les such as influences of particular genes, elements in the diet, and so on. For example, height is a sum of factors corresponding to heredity and environment. If a bound for EIXi
-
tt!3 is known,
it is possible to give a theoretical estimate of the
error involved in replacing P( Sn < b) by its normal approximation: Berry-F..sseen Theorem Suppose that X1 , . . , Xn are i.i.d. with mean Jt and variance .
cr2 > 0.
Then, for ail n, (A. l 5. 1 1 )
For a proof. see Chung ( 1 974, p. 224). In practice, if we need the distribution of Sn we try to calculate it exactly for small values of n and then observe empirically when the approx imation can be used with safety. This process of combining a limit theorem with empirical investigations is applicable in many statistical situations where the distributions of transformations
g(x)
(see A.8.6) of
interest become progressively more difficult to compute as the sample size increases and yet tend to stabi lize. Examples of this process may be found in Chapter 5 . We conclude this section with two simple limit theorems that lead to approximations of one classical distribution by another. The very simple proofs of these results may, for instance,
be found in Gnedenko ( 1 967, p. 53 and p.
1 05).
A.15.12 The first of these results reflects the intuitively obvious fact that if the populations sampled are large and the samples are comparatively small, sampling with and without replacement leads to approximately the same probability distribution. Specifically, sup pose that
{XN}
is a sequence of random variables such that
'H. (DN, N, n), distribution where DN fN ---> p as N
---> oo and
XN has a hypergeometric
11,
is fixed. Then
(A. l 5 . 1 3 ) as N - oo for k = 0,
1,
. . , n. B y (A.l 4.20) we conclude that .
(A. l5. 14) where X has a B(n,p) distribution.
The approximation of the hypergeometric distri
bution by the binomial distribution indicated by this theorem is rather good. stance, if N = 50, n =
5,
and D
=
For in
20, the approximating binomial distribution to
H(D,N,n) is 8(5, 0.4). If H holds, P{X
<
2]
= 0.690
while under the approximation,
A Review of Basic Probability Theory
472
P[X < 2] 0.683. (n/N) :S O.L =
Appendix A
As indicated in this example, the approximation is reasonable when
i �
•
•
The next elementary result, due to Poisson, plays an important role in advanced proba bility theory. Poisson's Theorem
{Xn} is a sequence of random variables such distribution and npn -+ A as n -+ oo, where 0 < A < oo. Then Suppose that
that
Xn
has a
B(n,pn)
; '
•
; •
(A. l 5 . 1 5 )
for
k
=
0, 1, 2,
. .
P(:l) distribution. the
Xn � X where X has a This theorem suggests that we approximate the B(n,p) distribution by
. as
n
P(np) distribution.
-+
oo.
By (A.1 4.20) it follows that
:
Tables 3 on p. 108 and 2 on p. 154 of Feller ( 1 968) indicate the
excellence of the approximation when p is small and
np is moderate.
It may
be shown that
the error committed is always bounded by np2.
.
' •
'
:
References Gnedenko ( 1 967) Chapter 2, Section 1 3 ; Chapter 6, Section 32; Chapter 8, Section 42
Hoe!,
Port, and Stone (1971) Chapter 3, Section 3.4.2
Parzen ( 1960) Chapter 5, Sections 4, 5; Chapter 6, Section 2; Chapter 10, Section 2
A.16
POISSON PROCESS
A.16.1 A
Poisson process with parameter A
t
is a collection of random variables
{ N(t)},
> 0, such that (i) (ii)
1
! j
1 •
• '
N(t) has a P(:lt) distribution for each t. N(t + h)
l
-
N(t)
is independent of
N(s)
for all
s < t, h >
0, and has a
P(:lh)
distribution.
i
'
i
j '
l •
Poisson processes are frequently applicable when we study phenomena involving events
that occur "rarely" in small time intervals. For example, if N (t)is the number of disinte grations of a fixed amount of some radioactive substance in
the period from time 0 to time
t, then {N(t)} is a Poisson process. The numbers N(t) of "customers" (people, machines, etc.) arriving at a service counter from time 0 to ti me t are sometimes well approximated by a Poisson process as is the number of people who visit a WEB site from time 0 to t. Many interesting examples are discussed in the books of Feller ( 1968), Parzen (1 962), Kar lin (1 969). In each of the preceding examples of a Poisson process
N(t)
represents the
number of times an ..event" (radioactive disintegration, arrival of a customer) has occurred in the time from 0 to
t.
We use the word
event here for lack of a better one because these
, ' '
l
j '
�
• '
'
Section A.l6
473
Poisson Process
are not events in tenns of the probability model on which the N ( t) are defined. If we keep temporarily to this notion of event as a recurrent phenomenon that is randomly detennined in some fashion and define N(t) as the number of events occurring between time 0 and time t, we can ask under what circumstances { N (t)} will form a Poisson process.
A.16.2 Formally, let (N(t)}. t > 0
be a collection of natural number valued random variables. It turns out that, { N(t)} is a Poisson process with parameter A if and only if the
following conditions hold: (a) N(t + h) - N(t) is independent of N(s), s < t, for h > 0, (b) N(t + h) - N(t) has the same distribution as N(h) for h > 0, (c) P[N(h)
=
1]
(d) P[N ( h) > 1 ]
=
=
>.h + o(h), and o(h).
(The quantity o(h) is such that o(h)/h be interpreted as follows.
�
0 as h
�
0.) Physically, these a"umptions may
(i) The time of recurrence of the "event" is unaffected by past occurrences. (ii) The distribution of the number of occurrences of the "event" depends only on the length of the time for which we observe the process. (iii) and (iv) The chance of any occurrence in a given time period goes to 0 as the pe riod shrinks and having only one occurrence becomes far more likely than multiple occurrences. This assertion may be proved as follows. Fix t and divide [0, t] into n intervals [0, tfn], (t/n, 2tfn], . . . , ( (n - 1 )t/n, t]. Let I;n be the indicator of the event [N(jtfn) - N( (j l)t/n) > lj and definer Nn(t) = 'L7 1 I;n· Then Nn(t) differs from N(t) only insofar as multiple occurrences in one of the small subintervals are only counted as one occurrence. By (a) and (b), Nn (t) has a B(n, P [N(tfn) > 1]) distribution. From (c) and (d) and Theorem (A. l5.15) we see that Nn (t) !:. Z, where Z has a P(>.t) distribution. On the other hand,
P[ [ Nn (t) - N(t)] > <] S P[Nn(t) � N(t)] S P :;;
�
;Q
)t (j (N (�) - N ( �1 ) ) > I]
l)t j p [ (N (� - N (< � ) ) > l
= nP [N (�) > 1] no (
;)
---t
0 as n ---t oo. (A.l6.3)
·· -------
··---
474
A Review of Basic Probability Theory
The first of the inequalities in ( A . I 6 . 3 ) is obvious. the second says that if
Appendix A
Nn ( t) f N ( t )
1
there must have been a multiple occurrence in a small subinterval, the third is just (A.2.5}, and the remaining identities follow from (b) and (d). The claim (A.16.3) now follows from Slutsky's theorem (A. I 4.9) upon writing
A.16.4 Let T1
N(t) � N, (t)
+ (N(t) - N, (t)).
be the time at which the "event" first occurs in a Poisson process (the first
t
N(t) = 1), T2 be the time at which the "event" occurs for the second time, and Then T1 , T2 - T1 , T11 - Tn - 1 , . . are independent, identically distributed £(>.)
such that so on.
•
•
.
,
.
random variables.
References Gnedenko ( 1 967) Chapter 10, Section '
' '
'
'
•
'
51
'
Grimmett and Stirzaker ( 1992) Section 6.8 Hoe!, Port, and Stone ( 197 1 ) Section 9.3 Parzen (1962) Chapter 6, Section 5 Pitman (1993) Sections 3.5, 4.2
A.17
j
; '
NOTES
Notes for Section A.S
A to be the smallest sigma field that has every An with Ai E Ai, 1 < i < n, as a member. ( I ) We define
set of the form A1
x
· · ·
x
Notes for Section A.7
' f
'
1
( I ) Strictly speaking, the density is only defined up to a set of Lebesgue measure 0. (2) We shall use the notation g(x+O) for limx. tx for a function
1
g of a real variable that possesses
g(xn) and g(x -0) for limx. !x g(xn)
such limits.
1 1
:
•
l '
l
Notes for Section A.S ( I ) The requirement on the sets
x - 1 (B) is purely technical.
discrete case and is satisfied by any function of interest when Sets
B that are members of Bk
are called
measurable.
It is no restriction in the
0 is Rk or a subset of Rk .
k,
When considering subsets of R
we will assume automatically that they are measurable. (2) Such functions g are called definitions (A.8 . 1 ) and (A8.2).
measurable.
This condition ensures that g(X) satisfies
For convenience, when we refer to functions we shall
assume automatically that this condition is satisfied. (3) A function g is said to be one to one ifg(x)
(4) Strictly speaking,
(X, Y) and
�
g(y) implies x y. �
(x, y) in (A.8. l l ) and (A.8. 12) should be transposed.
However, we avoid this awkward notation when the meaning is clear.
' '
Section A.18
475
References
(5) The integral in (A.8.12) may only be finite for "almost all" x. In the regular cases we study this will not be a problem. Notes for Section A.l4 ( I ) It may be shown that one only needs the existence of the derivative g' at b for (A.l4.17) to hold. See Theorem 5.3.3.
A.18
REFERENCES
BERGER, J. 0., Statistical Decision Theory and Bayesian Analysis
New York: Springer,
1985.
BILLINGSLEY, P., Probabiliry and Measure, 3rd ed. New York: J. Wiley & Sons, 1995.
CHUNG, K. L., A Course in Probability Theory New York: Academic Press, 1974.
DEGRoOT, M. H., Optimnl Statistical Decisions New York: McGraw Hill, 1970.
FELLER, W., An Introduction to Probability Theory and Its Applications, Vol. I, 3rd ed.
J.
Wiley & Sons,
1968.
GNEDEN KO, B . V., The Theory of Probability, GRIMMEI"I, G.
R.,
Press, 1992.
New York:
AND D.
R.
4th ed. New York:
Chelsea, 1967.
STIRZAKER, Probability and Random Processes Oxford: Clarendon
HAJEK, J. AND Z. SIDAK, Theory of Rank Tests New York: Academic Press, 1 967.
HOEL, P. G., S. C. PORT, AND C. J. STONE, Introduction to Probability Theory Boston: Houghton Mifflin, 1 9 7 1 .
KARLIN, S., A First Course in Stochastic Processes
New York: Academic Press,
1 969.
LINDLEY, D. V., Introduction to Probability and Statistics from a Bayesian Point of View, Part I:
Probability; Part II: Inference London: Cambridge University Press, 1965. LoEvE, M., Probability Theory, Vol. I, 4th ed. Berlin: Springer, 1 977. PARZEN, PARZEN,
E., Modern Probability Theory and Its Application New York : J. Wiley & Sons, 1960.
E., Stochastic Processes San Francisco: Holden Day, 1962. -
PITMAN, J., Probability New York: Springer, 1 993. RA.IFFA, H., AND
R. SCHLAIFFER,Applied Statistical Decision Theory, Division of Research, Graduate
School of Business Administration, Boston: Harvard University, 1 9 6 1 . RAO, C. R., linear Statistical Inference and Its Applications, 2nd ed. New York: 1973. SAVAGE, L. J., The Foundations ofStatistics
New York: J. Wiley & Sons, 1 954.
J. Wiley & Sons,
SAVAGE, L. J., The Foundation ofStatistical Inference London: Methuen & Co., 1962.
'
i
.
' .,
�
'
Appendix B
ADDITIONAL TOPICS IN PROB ABILITY AND ANALYSIS
In this appendix we give some results in probability theory, matrix algebra, and analysis that are essential in our treatment of statistics and that may not be treated in enough detail in more specialized texts. Some of the material in this appendix, as well as extensions, can be found in Anderson ( 1 958), Billingsley ( 1995), Brei man ( I %8), Chung ( 1 978), Dempster ( 1 969), Feller (1971), Loeve (1977), and Rao ( 1 973). Measure theory will not be used. We make the blanket assumption that all sets and functions considered are measurable.
B.l
CONDITIONING BY A RANDOM VARIABLE OR VECTOR
The concept of conditioning is important in studying associations between random vari ables or vectors. In this section we present some results useful for prediction theory, esti mation theory, and regression.
B.l.l
The Discrete Case
The reader is already familiar with the notion of the conditional probability of an event A given that another event B has occurred. If Y and Z are discrete random vectors possibly of different dimensions, we want to study the conditional probability structure of Y given that Z has taken on a particular value z. Define the conditional frequency function p( · I z) ofY given Z = z by
p( ) y ,z p(y I z) = P[Y = y I Z = z] = p (z) z
(B. l . l )
where p and pz are the frequency functions of (Y, Z) and Z. The conditional frequency function p is defined only for values of z such that pz(z) > 0 . With this definition it is
477
478
Additional Topics in Probability and Analysis
y
TABL E R !
z 0
I
2 pz(z)
Appendix B
I
0 10 0.25 0.05 0.05 0. 15 0.05 1 0.1 o 035 I 030
I
20 0.05 0.05 0.25 0 .35
i
-
JIF (Y) II
0.35 0.25 0.40
' '
I
'
;
clear that p( · I z) is the frequency of a probability distribution because
by {A.8.11). that z � z.
pz (z) p(y z) � EYp(y I z) � Ey , �I pz (z) pz ( z) This probability distribution is called the conditional distribution ofY given
Example 8.1.1 Let Y = (Y1 , . . . , Yn). where the Yi are the indicators of a set of n Bernoulli trials with success probability p. Let Z = Ef 1 }i, the total number of successes. Then Z has a binomial, B(n,p), distribution and p(y I z) � " •
!I •
.!
• •
.. .' .i
'
1
'
i •
PjY = y, Z = z]
(�)
p'(l - p)"-'
�
pY(I - p)"-'
(�)
P' ( l - p)"-'
I
(�)
- -,-c�
; .
l
;
j
•
'
(B.I.2)
if the Yi are all 0 or 1 and Eyi = z. Thus, if we are told we obtained k successes in n binomial trials, then these successes D are as likely to occur on one set of trials as on any other.
Example 8.1.2 Let Y and Z have the joint frequency function given by the table For instance, suppose Z is the number of cigarettes that a person picked at random from a certain population smokes per day (to the nearest 10), and Y is a general health rating for the same person with 0 corresponding to good, 2 to poor, and 1 to neither. We find for z � 20 0 I 2 y P(Y I 20) 7 7 7
•
1 ' •
l l
l
•
-
These figures would indicate an association between heavy smoking and poor health be 0 cause p(2 l 20) is almost twice as large as py(2). The conditional distribution of Y given Z = z is easy to calculate in two special cases.
(i) lfY and Z are independent, then p(y I z) = py(y) and the conditional distribution coincides with the marginal distribution. (ii) If Y is a function of Z, h(Z), then the conditional distribution of Y is degenerate, Y � h(Z) with probability I. ' •
• • '
I
Both of these assertions follow immediately from Definition(B. l . l ).
l •
Section B.l
479
Conditioning by a Random Variable or Vector
Two important formulae follow from (B. I . I ) and (A.4.5). Let q(z j y) denote the conditional frequency function of Z given Y y . Then =
p(y,z) p( y I z) =
=
(B. I . 3)
p(y I z)pz(z)
q(z I y)py(y ) Eyq (z I y)p y(y )
(B . I .4)
Bayes' Rule
whenever the denominator of the right-hand side is positive. Equation (B. t 3) can be used for model construction. For instance, suppose that the number Z of defectives in a lot of N produced by a manufacturing process has a B(N, 0) distribution. Suppose the lot is sampled n times without replacement and let Y be the num ber of defectives found in the sample. We know that given Z = z, Y has a hypergeometric, 'H( z, N, n ) , distribution. We can now use (8.1.3) to write down the joint distribution of Y and Z .
y
where the combinatorial coefficients
(�)
(B.I.5) n
vanish unless a , b are integers with b <
a.
We can also use this model to illustrate (8 . 1 .4). Because we would usually only observe Y, we may want to know what the conditional distribution of Z given Y y is. By (8.1 .4) this is ..,...,
P[Z
IY
=
y] =
c(y)
=
I:
= z
where ,
( � ) B"(l - 0)"-' ( � ) ( � � � ) / c(y)
(�)
W ( l - B)"
'
( � ) ( � �: )
.
This formula simplifies to (see Problem B . l . l l ) the binomial probability, (B.l .6)
8.1.2
Conditional Expectation for Discrete Variables
Suppose that Y is a random variable with E(IYI) < oo. Define the conditional expectation of Y given Z = z, written E(Y I Z = z), by
E(Y I Z
=
z) = Eyyp (y I z) .
(B. I . 7)
480
Additional Topics in Probability and Analysis
Note that by (B . I . l ), if pz(z)
>
0,
Appendix B '
�i n
•
E( ) p,· y E( I Y I I Z = z) = Ey ]y]p(y I z) < Ey ]Yi . (B.I.8) = pz z pz z Thus, when Pz ( z) > 0, the conditional expected value of Y is finite whenever the expected value is finite.
Example B.l.3 Suppose Y and Z have the joint frequency function of Table B.l. We find 5 11 1 . E(Y I z = 20) = 0 . 7 + I + 2 . 7 = 7 = 1.57. 7 I
Similarly, E(Y I Z = 10) = ; = L17 and E(Y I Z = 0) = � = 0.43. Note that in the health versus smoking context, we can think of E(Y I Z = z) as the mean health rating 0 for people who smoke z cigarettes a day. Let g(z) = E(Y I Z = z). The random variable g(Z) is written E(Y I Z) and is called the conditional expectation ofY given Z. ( l ) As an example we calculate E(Y1 I Z) where Y1 and Z are given in Example B. l . l . We have E(Y, I z = i) = P lY, = I I z = i] =
� { 7 ( )
( ��n (7)
_,__,_--"-
1 •
n
•
(B.!.9)
is just the number of ways i successes can occur in n Bernoulli
'
l
;
�
;
1 '
;
•
' •
.
'
..
trials with the first trial being a success. Therefore, z E(Y, I Z) = -. n
B, 1.3
'
!
The first of these equalities holds because Y1 is an indicator. The second follows from (B. l.2) because
<
(B.LJO)
'
l
i
j
'
Properties of Conditional Expected Values
j <
In the context of Section A.4, the conditional distribution of a random vector Y given Z = z corresponds to a single probability measure P� on (0, A). Specifically, define for
A E A,
•
'
J i '
i , •
P. (A) = P(A l iZ = z]) if pz ( z)
>
0.
(B.LJ 1)
This P� is just the conditional probability measure on (0, A) mentioned in (A.4.2). Now the conditional distribution of Y given Z = z is the same as the distribution of Y if P� is the probability measure on (0, A). Therefore, the conditional expectation is an ordinary expectation with respect to the probability measure P�. It follows that all the properties of the expectation given in (A. J0.3HA. 10.8) hold for the conditional expectation given Z = z. Thus, for any real-valued function r(Y) with Elr(Y) I < oo, E(r(Y) I Z = z) = Eyr(y )p(y I z)
' ' •
<
! '
• •
•
j
Section B.l
481
Conditioning by a Random Variable or Vector
and (B.l.12) aE(Y1 I Z = z) + {IE(Y, I Z z) identically in z for any Y" Y, such that E(IYJ ! ), E(IY2 1 ) are finite. Because the identity
E(aY1 + f)Y, I Z
holds for all
z,
=
z)
=
=
we have (B.I.13)
This process can be repeated for each of (A. I 0.3)-(A.l 0.8) to obtain analogous properties of the conditional expectations. In two special cases we can calculate conditional expectations immediately. If Y and Z are independent and E(IYI) < oo, then
E(Y I Z) = E(Y).
(B. l . l 4)
E(h(Z) I Z) = h(Z).
(B. l .15)
This is clear by (i). On the other hand, by (ii) The notion implicit in (B. l . l5) is that given Z = z, Z acts as a constant. If we carry this further, we have a relation that we shall call the substitution theorem for conditional
expectations:
E(q(Y, Z) I Z
=
z) = E(q(Y, z) I Z = z) .
This is valid for all z such that pz ( z) > 0 if E lq (Y, Z) I < tions (B. I . l l ) and (B.l.7) because
oo.
(B. l . 16)
This follows from defini
P[q(Y, Z) = a I Z = z] = P[q(Y,Z) = a, Z = z l Z = z] = P[q(Y,z) = a I Z = z] (B.l.l7) for any a. If we put q(Y , Z)
=
r(Y)h(Z), where Elr(Y)h(Z)I <
oo,
we obtain by (B. I . l 6),
E(r(Y)h(Z) I Z = z) = E(r(Y)h(z) I Z = z) = h(z)E(r(Y) I Z = z).
(B. l . l 8)
Therefore,
E(r(Y)h(Z) I Z)
=
h(Z)E(r(Y) I Z).
(B.l . 19)
Another intuitively reasonable result is that the mean of the conditional means is the mean:
E( E(Y I Z)) = E(Y),
(B.l .20)
whenever Y has a finite expectation. We refer to this as the double or iterated expectation
theorem. To prove (B.l.20) we write, in view of (B.l .7) and (A.I0.5),
E(E(Y I Z)) = Ezpz(z)[Eyyp(y I z)] = Ey,zYP ( Y I z )pz(z) = Ey,zYP(y, z) = E(Y ).
(B.I.21)
-----
'
"
'
482
Additional Topics in Probability and Analysis
The interchange of summation used i.s valid because the finiteness of all sums converge absolutely. As an illustration, we check (B.l .20) for
E(E(Yt I Z)) � E If we apply (B. 1 .20) to Y
E(Y1 I Z)
(n) Z
=
np -;;:
E( [Y[)
Appendix B implies that
given by (B . 1 . 1 0). In this case,
=
p = E(Yt ) .
(B. l .22)
= r(Y)h(Z) and use (B . l . 19), we obtain the product expec
tation fonnula: Theorem B.l.l lf Elr(Y)h(Z)I < oo, then
E( r(Y)h(Z) ) = E(h(Z)E(r(Y) I Z)). Note that we can express the conditional probability that
P[Y E A I Z = zj = E(l[Y E AJ I Z Then by taking r(Y)
I '
'
= z) =
Y E A given Z = z as
EyEAP(Y I z) .
= l[Y E AJ, h = I in Theorem B . 1 . 1 we can express the (uncondi
tional) probability that Y
E A as
P[Y E AJ = E(E(r(Y) I Z)) = E.P[Y E A I Z = zjpz (z)
.' Ii
(B.l .23)
=
E[P(Y E A I Z) J.
••
For example, if Y and
:II
Z are as in (B.l .5), P[Y < yj
!'
where
'
!
Hz
(B . l 24)
=
E.
(�)
8'(1 - 0)"-• H.(y)
is the distribution function of a hypergeometric distribution with parameters
(z, N,n).
�l �;'·; � i
'' 'I ''' i' '
.
Continuous Variables
B. l.4
•
,
,
'
•
Suppose now that
u
(Y, Z)
is a continuous random vector having coordinates that are them
selves vectors and having density function
!i
p(y, z).
between frequency and density functions, the
z = z by ifpz(z) '
i
> 0.
fu nction of Y given
(B . l .25)
,
•
1
1
'
'
'
j •
pz (z),
is given by (A.8.12), it is clear that p(·
I z)
is a density. Because (B. 1 .25) does not differ fonnally from (B. 1 . 1), equations (B.l.3) and (B.I .6) go over verbatim. Expression (B . l .4) becomes
py (y )q(z I y) where q is the conditional density of Z given
'
conditional density( I)
p(y z) P(Y I z ) , ( pz z )
Because the marginal density of Z,
•
We define, following the analogy
,
Y
=
(B.l .26)
y. This is also called Bayes' Rule.
j
J '
'J
'
, • •
'
''
" '
'
l
Section B.l
483
Conditioning by a Random Variable or Vector
If Y and Z are independent, the conditional distributions equal the marginals as in the discrete case. Example B.1.4 Let Y1 and Y2 be independent and uniformly, U(O, 1), distributed. Let Z min(Y1 , Y2 ) Y = max(Y1 , Y2). The joint distribution of Z and Y is given by ,
=
F (z, y)
2P[Y1 < Y2 , Y1 < z, Y2
-
{' rmio(y,,,) 2l dy dy o lo
<
,
,
=
yj
2 lI' o min(y2, z)dy,
(8.1 .27)
if 0 S Z, 'lj < 1. The joint density is, therefore,
2 ifO < z < y < 1 0 otherwise.
p (z, y)
(8.1 .28)
The marginal density of Z is given by
!,' 2dy
pz(z)
=
2(1 - z ) , 0
<
z
<
1
(8.1.29)
0 otherwise. We conclude that the conditional density of Y given Z (z, 1 ) .
z is uniform on the interval 0
If E ( IYI) < oo, we denote the conditional expectation ofY given Z = z in analogy to the discrete case as the expected value of a random variable with density p(y I z). More generally, if E( lr(Y ) I ) < oo, (A.IO. l l) shows that the conditional expectation of r(Y) given Z = z can be obtained from E(r(Y) I Z
=
z)
=
1:
r(y)p(y I z)dy.
(8.1.30)
As before, if g(z) = E(r(Y) I Z = z) , we write g(Z) as E(r(Y) I Z) , the conditional expectation of r(Y) given Z. With this definition we can show that formulas 12, 13, 14, 19, 20, 23, and 24 of this section hold in the continuous case also. As an illustration, we next derive B.l.23: Letg(z) = E[r(Y) I ZJ, then, by (A.IO. l l), E(h(Z)E( r(Y) I Z )) =
1:
=
E(h(Z)g( Z ))
h(z)pz(z)
[C
=
1:
h(z)g(z)pz(z)dz
]
r(y)p(y I z)dy dz.
(8.1.31)
484
Additional Topics in Probability and Analysis
Appendix B
By a standard theorem on double integrals, we conclude that the right-hand side of (8 . 1 . 3 1 ) equals
1: 1: 1:1:
r(y) h( z)pz ( z}p(y I z}dydz
�
(B. l .32)
r(y}h (z)p(y, z } dyd z
�
E(r(Y)h( Z }}
j l
l ' I
'
by (A.IO. I l ), and we have established B . l .23. To illustrate these formulae, we calculate E(Y I Z) in Example 8 . 1 .4. Here, E(Y I Z
=
z } � ro ' yp (y I z}dy J
1 �
(1
_
' ydy r }, z}
1+z �
2
, 0 < z < 1,
and, hence,
.
E(Y I Z)
=
1+z 2
'
;
•
'
l
.
•
.
8 . 1 .5
1
'
Comments on the General Case
'
•
•
Clearly the cases (Y, Z) discrete and (Y, Z} contiquous do not cover the field. For ex ample, if Y is uniform on (0, 1) and Z Y2, then (Y, Z) neither has a joint frequency function nor a joint density. (The density would have to concentrate on z y2, but then it cannot satisfy /01 J; f(y, z)dydz = 1.) Thus, (Y, Z) is neither discrete nor continuous in our sense. On the other hand, we should have a concept of conditional probability for which P[Y = u l Z VuJ 1 . To cover the general theory of conditioning is beyond the scope of this book. The interested student should refer to the books by Breiman (1968), Loeve (1977), Chung (1974), or Billingsley (1995). We merely note that it is possible to define E(Y I Z z) and E(Y I Z) in such a way that they coincide with (B. I.?) and (B.I .30) in the discrete and continuous cases and moreover so that equations 15, 16, 20, and 23 of this section hold. As an illustration, suppose that in Example B. l .4 we want to find the conditional expec tation of sin( ZY) given Z z. By our discussion we can calculate E(sin( ZY) I Z z) as follows: First, apply (B.1. 16) to get =
=
=
=
=
�
=
E(sin(ZY) I Z
�
z)
=
E (sin(zY } I Z
�
z).
Because, given Z z, Y has a U(z, 1) distribution, we can complete the computation by applying (A.IO. l l) to get =
E(sin(zY) I Z
=
z)
=
1
(1 - z)
1' •
sin(zy)dy
�
1 ( [cos z2 - cos z]. z 1 - z)
I
'
'
: .
• •
•
•
.
'
-
'
1
'
'
Section
B.2 B.2.1
8.2
485
Distribution Theory for Transformations of Random Vectors
DISTRI BUTION THEORY FOR TRANSFORMATIONS O F RANDOM VECTORS The Basic Framework
In statistics we will need the distributions of functions of the random variables appearing in an experiment. Examples of such functions are sums, averages, differences, sums of squares, and so on. In this section we will develop a result that often is useful in finding the joint distribution of several functions of a continuous random vector. The result will gen eralize (A.8.9), which gives the density of a real-valued function of a continuous random variable. Let h (h1, , hk)T, where each hi is a real-valued function on Rk . Thus, h is a transformation from R k to Rk . Recall that the Jacobian Jh (t) of h evaluated at t ( t1, . . . , tk )T is by definition the determinant =
.
•
.
a [)t, Jh (t) �
h, (t)
•
• •
•
•
•
•
•
[)
•
•
[)tk
h, (t)
•
•
•
The principal result of this section, Theorem B.2.2, rests on the change of variable theorem for multiple integrals from calculus. We now state this theorem without proof (see Apostol. 1974, p. 421). Theorem B.2.1 Let h = (h1, . . . , hk)T be a transformation defined on an open subset B of Rk. Suppose that; ( ! ) (i) h has continuousfirst partial derivatives in B. (ii) h is one-to-one on B. (iii) The Jacobian of h does not vanish on B. Let f be a real-valued function (defined and measurable) on the range h(B) { (h1(t), . . . ,hk(t)) : t E B} ofh and suppose f satisfies
-
r lf(x)ldx < 00. jh(B) Then for every (measurable) suhset K of h(B) we have
f J(x)dx jK
�
f
Jh-l(K)
!(h(t)) IJh(t) ldt.
(B.2.1)
- - -·· . -- -
-- -------
486
Additional Topics in Probability and Analysis
Appendix B
In these expressions we write dx for dx1 . . dxk. Moreover, h-I denotes the inverse of the transformation h; that is, h-I (x) = t if, and only if, x = h(t). We also need another result from the calculus (see Apostol, 1974, p. 417),
' '
.
1 Jh • (t) � Jh (h-1(t)) -
(B.2.2) .
'
It follows that a transformation h satisfies the conditions of Theorem B.2.l if, and only if, h- 1 does. We can now derive the density ofY � g(X) � (91 (X), . . . , 9k(X))T when g satisfies the conditions of Theorem B.2.1 and X = (X1, Xk)T is a continuous random vector. •
•
.
I
,
Theorem B.2.2 Let X be continuous and let S be an open subset of Rk such that P(X E S) � 1. /fg � (91> . . . , 9k)T is a transformation from S to R k such that g and S satisfy the conditions of Theorem B.2.1, then the density ofY = g(Y) is given by
' • •
l 1 •
(B.2.3) •
l
for y E g(S).
1 1 j "
Proof. The distribution function ofY is (see (A.7.8))
Fy (y ) �
i
j
J
• •
where Ak � {x E Rk : 9;(x) < y;, i � 1, . . . k } Next we apply 'fheorem B.2.1 1 with h � g- 1 and f � p,. Because h- (Ak) � g(Ak) � {g(x) : 9;(x) < y, , i � 1, . . . , k} � {t : t < y;, i � 1 , . . , k }, we ll�tain ,
.
•
.
i
. '
' '
j
l l '
The result now follows if we recall from Section A 7 that whenever Fy (y) = fy'::o . . . Jy� q(tl! . . . tk)dtl . . . dtk for some nonnegative function q, then q must be the D density of Y, !
Example B.2.1 Suppose X � (X1 , Xz)T where X1 and Xz are independent with N(O, 1) and N(O, 4) distributions, res�tively. What is the joint distribution of Y1 = X1 + X2 and Yz � X1 - Xz? Here (see (A.13.17)), Px (xi>xz ) �
'
4� exp - � [xi + !x�J .
' ' '
•
!
I
x1
In this case, S � R2 Also note that 91 (x) � + Xz, 9z (x) � X1 � (Y1 + Yz) , 9:21 (Y) � 5 (y1 - 1J2 ), that the range g(S) is R2 and that
J·-· (y) =
1 z 1 2
1 2 1 -2
1 2
- -
-
xz,
g!1 (y)
' � ; '
Section 8.2
487
Distribution Theory for Transformations of Random Vectors
Upon substituting these quantities in
(B.2.3), we obtain
�Px G
Pv(YI, Y2)
(Yl + y, ),
r
� (Yl - Y2))
_I_ exp - � � (YI + Y2)2 + _l_(YI - y,)2 8?r 2 4 16 I I " 2 exp - 32 I::>y i + 5y2 + 6YIY2 I . 2 Sn '
]
This is an example of bivariate normal density. Such densities will be considered further in Section
Upon combining
(B,22) and (B,23) we see that for y E g(S) ,
py (y ) If
o
B.4.
X
is a random variable
(k
=
=
1),
Px( g - 1 (Y ) ) ' JJ. (g l (y ))l
(B2A)
the Jacobian of
g
is just its derivative and the
requirements (i) and (iii) that g' be continuous and nonvanishing imply that monotone and, hence, satisfies (ii). In this case
(B.2.4)
g is
strictly
reduces to the familiar formula
(A.8.9). It is possible to give useful generalizations of Theorem not one-to-one (Problem
B.2.2 to
situations where
g is
B.2.7).
Theorem B.2.2 provides one of the instances in which frequency and density functions
g is one-to-one, and Y
g(X) , then py(y ) = p,(g-1 (y) ) . The extra factor in the continuous case appears roughly as follows. If A(y) is a ..small" cube surrounding y and we let V(B) denote the volume of a set B, then V (g 1 (A (Y)"-')-) P[g(XJ E A(y)J P[X E g- 1 (A(y)_ ) ] . :_ py(y) "' ";'��'i': 1 V(g (A(y))) V(A(y)) V(A(y)) V(g-1 (A(y ) ) ) l (g- (y)) . V(A(y )) differ. If X is discrete,
"' P
=
X
Using the fact that
g-1
is approximately linear on
A(y) , it is not hard to show that
V(g-l(A(y))) ! J•-' (y) l . "' V(A(y )) The justification of these approximations is the content of Theorem
B.2.2.
The following generalization of (A.8.10) is very important. For a review of the ele mentary properties of matrices needed in its formulation, we refer the reader to Section
BJO.
g is called an affine transformation of Rk if there exists a k x k matrix A and a k x 1 vector c such that g(x) = Ax+c. If c = 0, g is called a linear transformation. The function g is one-to-one if, and only if, A is nonsingular and then Recall that
g-I(y ) = A - l (y - c), y E Rk, where A -I
is the inverse of A .
(B.2.5)
l
488
Additional Topics In Probability and Analysis
Appendix 8
j
'
' '
B.2.1 Suppose X is continuous and S is such that P(X E S) = 1. If g is a one-to-one affine transformatioll as defined earlier, then Y = g(X) has density Corollary
(8.2.6)
'
for y E g( S), where det A is the determinant of A. The corollary follows from (8.2.4), (8.2.5), and the relation,
' '
•
'
.. '
'
Jg (g-1 (y)) - det A.
'
"
i
(8.2.7)
Example B.2.1 is a special case of the corollary. Further applications appear in the next D section. B.2.2
The Gamma and Beta Distributions
As a consequence of the transformation theorem we obtain basic properties of two impor tant families of distributions, which will also figure in the next section. The first family has densities given by •
(8.2.8) for x > 0, where the parameters p and A are taken to be positive and f(p) denotes the Euler gamma function defined by
r(p) =
l ' = edt. tp1
'
(8.2.9)
b, ,,(x) =
B(r, s)
'
' 1 '
•
(8.2. 10)
The family of distributions with densities given by (B.2.8) is referred to as the gamma family of distributions and we shall write f(p, ..\) for the distribution corresponding to 9p,>.. · The special case p = 1 corresponds to the familiar exponential distribution £(..\) of (A.l3.24). By (A.S.IO), X is distribnted r(p, .>.) if, and only if, >.X is distribnted r(p, 1). Thus, 1 /.>. is a scale parameter for the r(p, .\) family. Let k be a positive integer. In statistics, the gamma density 9p, >.. with p = �k and .\ = � is referred to as the chi squared density with k degrees offreedom and is denoted by x%. The other family of distributions we wish to consider is the beta family, which is in dexed by the positive parameters r and s. Its densities are given by x"-1(1 - x)'-1
•
•
It follows by integration by parts that, for all p > 0,
r(p + 1) = pr(p) and that r(k) = (k - 1)! for positive integers k.
•
(8.2. 1 1)
forO < x < 1, where B(r, s) = [r(r)r(s)]/[r(r+s)] is the betafunction. The distribntion corresponding to br,• will be written (J(r, s) . Figures 8.2.1 and 8.2.2 show some typical members of the two families.
�1 •
'
'
.
i' '
j i
J
, '
j' .
'
' '' '
'
l i
Section 8.2
Distribution Theory for Transformations of Random Vectors
489
Theorem B.2.3 If X1 and X2 are independent random variables with f(p, A) and f'(q, A)
distributions, respectively, then Y1 = X1 + X2 and Y2 = XI/( XI + X2 ) are independent and have, respectively, f(p + q, A) and fJ(p, q) distributions. Proof. If A
=
1, the joint density of X1 and X2 is (B.2.I2)
for x1 > 0, X2 > 0. Let
1.4
p
1.2 ....
1.0
0.8
-<
•
=
.\ = IO
0.6 0.4 0.2 0
I I
0
I
·� - - -
/
.
p=
',
'
'--- ---' .,_ + -
-
1
2
2, .\ ,
=
'
1
..... ..... . . ............. . .. ,
,
. _ _ _
-
-
.
-
- - - -
3
X
Figure B.2.1 The gamma density, gp,>. ( x) ,
for selected p, .\.
Then g is one-to-one on 8 = {(XI. X2)T : X1 > 0, X2 > 0} and its range is 81 { (y, , Y2 ) T : Yt > 0, 0 < y2 < I}. We note that on S, g-1 ( YI , Y2 ) = (YIY2 , YI - YIY2 f·
(B.2. I3)
Therefore, Y2 YI
I - Y2 -y,
=
-y,.
-------·�
(B.2.14)
-
..
-
. .
__
490
Additional Topics in Probability and Analysis
Appendix B
If we now substitute (B 2. 1 3) and (B.2.14) in (B.2A) we get for the density of (Y1 , Y2)T .
g(S 1 . X,).
I
>
�
I e-Y1 ( Y tY2 )11- (Yt - .tJ I Y2) q - YI f (p)f( q)
(B.2. 15)
0. 0 < y2 < L Simplifying (B.2.15) leads to JIY (Yr, y,)
.
'
=
1
py(y,.y,)
for y1
•
� YP+
(y, ) bp. q (Y2).
(B.2.16)
The result is proved for ,\ = 1. If ). i- 1 define x; = /\X1 and Xf = >.Xz. Now xr and X� are independent f(p, 1). r(q, 1) variables respectively. Because X� + Xf D .\(X1 + X2) and x; (X; + x;) - 1 � X1 (X 1 + X2) _ , the theorem follows.
4 •
II II 3 "I I I I r � 5 s = 14 I I I I I 2 I I I I I I I I ·. I I 1 I I I I I
'
1
'
l
'
0
0
I I
•
r =-s = 3 -
-
I I'
'
-
'
'
•
j
•
•
1
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
' '
X
•
'
Figure B.2.2 The beta density, br,s(x), for selected
r, s.
By iterating the argument of Theorem B.2.3, we obtain the following general result. Corollary B.2.2 If X 1 , , Xn are independent random variables such that Xi has a f(pi, A) distribution, i = 1 , . . ' n, then E? 1 xi has a f(Ef I Pi, ,\) distribution. • • •
.
Some other properties of the gamma and beta families are given in the problems and in the next section.
! '
j ' '
Section 8.3
B.J
Distribution Theory for Samples from a Normal Population
491
DISTRIBUTION THEORY FOR SAMPLES FROM A NORMAL POPULATION
In this section we introduce some distributions that appear throughout modem statistics. We derive their densities as an illustration of the theory of Section B.2. However, these distrib�tions should be remembered in terms of their definitions and qualitative properties rather than density formulas.
B.J.l
The
x2, F,
and
t
Distributions
Throughout this section we shall suppose that X = (X1, . . . , Xn)T where the Xi form a sample from a N(O, a2) population. Some results for normal populations, whose mean differs from 0, are given in the problems. We begin by investigating the distribution of the Ei 1 X[, the squared distance of X from the origin.
Theorem 8.3.1 The random variable V = Ei 1X[ ja2 has a x� distribution. That is, V has density
for v > O. Proof. Let Zi = Xi/a, i = 1 , . . , n. Then Zi N(O, 1 ) . Because the Z[ are inde pendent, it is enough to prove the theorem for n = 1 and then apply Corollary B.2.2. If T = Zf, then the distribution function ofT is .
"'
(8.3.2) and, thus,
FT (t) = ( Vt) - <1> ( - Vt).
(8.3.3)
Differentiating both sides we get the density ofT
PT (t) = C i cp(Vt) =
l
y'2i;
t- ! e-t/2
(8.3.4)
for t > 0, which agrees with 91 1 up to a multiplicative constant. Because the constant is determined by the requirement21�at PT and g 1 1 are densities, we must have PT = g �·� 1 1 2•2 D and the result follows.
Let V and W be independent and have x� and x?n distributions, respectively, and let S = (Vjk ) (Wjm). The distribution of S is called the F distribution with k and m degrees offreedom. We shall denote it by Fk ,m· Next, we introduce the t distribution with k degrees offreedom, which we shall denote by 7,. By definition T. is the distribution of Q = Zj.jV/k, where Z and V are inde pendent with N(O, 1) and x� distributions, respectively. We can now state the following elementary consequence of Theorem B.3.1.
'
'
I
492
Additional Topics in Probability and Analysis
Appendix B
<
'
' •
Corollary B.3.l The random mriable
tion. The rmzdom mriab/e X
(
111
I k )E7" 1 x; j"i:,�·+;.� 1 x;
has Oil
1 / ( 1/k):E� +} Xt2 has a T�..: disrribwion.
Fk,m distrilm
Proof. For the first assertion we need only note that
'
•
'
(B.3,5) and apply the theorem and the definition of Fk,m· The second assertion follows in the same D way.
I
•
i
.
I
'
I
To make the definitions of the :Fk.m and 7k distributions useful for computation, we need their densities. We assume the S, Q, V, W are as in the definitions of these distribu· tions. To derive the density of S note that, if U � Vj (V + W), then
Vjk � m U (B.3.6) S Wjm k 1 - U · r (�m, �)( l l and V and W are independent, then by Because V r (�k, �). W Theorem B.2.3, U has a beta distribution with parameters �k and �m. To obtain the density of S we need only apply the change of variable formula (A.8.9) to U with g( u) = (mjk)uj(1 - u) . After some calculation we arrive at the Fk,m density (see Figure B.3. 1 ) (k/m)�•sl<•-2l( I + (kjm)s) -l
'""
,....,
'
! •
j
'
li , •
'
'
' ' ,
'
'
l 1 •
•
• • •
'
•
•
•
To get the density of Q we argue as follows. Because -Z has the same distribution as Z, we may conclude that Q and -Q are identically distributed. It follows that
P[O < -Q < q) 1 P[-q < Q < OJ � 2 P[O < Q2 < q2). Differentiating P[O < Q < q), P[-q < Q < OJ and JP[O < Q2 < q2] we get PQ (q) PQ ( -q) qpQ' (q2) if q > 0. P[O < Q < q)
�
�
(B.3.8)
l
(B.3.9)
Now Q2 has by Corollary B.3.1 an Fu distribution. We can, therefore, use (B.3.7) and (B.3.9) to conclnde
PQ (q) _
for -oo
< q < oo.
r (�(k + 1 )) (1 + (q' /k))-l<•+ll
v'1i'kr Gk)
l
I l
j
•
i ' ,
(B.3.10)
.. •
• •
Section 8 3
493
Distribution Theory for Samples from a Normal Population
0.2
. •
-- .. ... . . .. . . . .
0
0
I
2
3
Figure B.3.1 The :Fk,m density for selected (k, m) .
0.45
'
-
-4
-2
-3
The N(O,l) density The T4 density
-I
0
I
'
2
'
-
3
Figure B.3.2 The Student t and standard normal densities. The x2, T, and :F cumulative distribution functions are given in Tables II, III, and IV, respectively. More precisely, these tables give the inverses or quantiles of these distribu tions. For a E (0, 1 ) an ath quantile or 100 ath percentile of the continuous distribution F is by definition any number x(a) such that F(x(a)) = a. Continuity of F guarantees the existence of x(a) for all a. If F is strictly increasing, x(a) is unique for each a. As an illustration, we read from Table III that the (0.95)th quantile or 95th percentile of the '12o distribution is t(0.95) = I. 725. ,
------ ----------- --·-- -···· · ·
...... ......
.
-·--
.
-·-
494
Additional Topics in Probability and Analysis
8.3.2
Appendix B
Orthogonal Transformations
We tum now to orthogonal transformations of normal samples. Let us begin by recalling some classical facts and definitions involving matrices, which may be found in standard texts, for example, Birkhoff and MacLane (1965). Suppose that A is a k x m matrix with entry aij in the ith row and jth column, i = 1, . , k; j = 1, . . . , m. Then the transpose of A, written AT, is the m x k matrix, whose entry in the ith row and jth column is aji · Thus, the transpose of a row vector is a column vector and the transpose of a square matrix is a square matrix. An n x n matrix A is said to be orthogonal if, and only if, .
.
' ' '
J'
•
'
(8.3 . 1 1 ) or equivalently if, and only if, either one of the following two matrix equations is satisfied, (8.3.12)
;
'
(8.3.13) where I is the n xn identity matrix. Equation (B.3.12) requires that the column vectors of A be of length 1 and mutually perpendicular, whereas (B.3.13) imposes the same requirement on the rows. Clearly, (8.3.12) and (8.3.13) are equivalent. Considered as transformations on Jr1' orthogonal matrices are rigid motions, which preserve the distance between points. That is, if a = ( a1 , . . . , an ) T, b = (b, , . , bn)T, Ia - bl = JE� 1 (a, - b;) 2 is the Euclidean distance between a and b, and A is orthogonal, then . .
'
'j
•
'
'
I
I
'
'
I a - bl
=
(8.3. 14)
IAa - Abl.
'
To see this, note that
(a - b)T(a - b) = (a - b)TATA(a - b) [A(a - bW[A(a - b)] = I Aa - Abl2
I a - bl'
(8.3.15)
Finally, we shall use the fact that if A is orthogonal,
'
,
I
I
(8.3.16)
ldet AI = L
'
'
•
'
This follows from
[det AJ2
=
[det A] [det AT] = det[AAT]
=
det l =
L
(8.3.17)
Because X = (X1, . . , Xn)T is a vector of independent identically distributedN(O, a2 ) random variables we can write the density of X as .
-
[v'27i'<7]n exp [d,r<7]n exp [- 2!,1x1'] · 1
1
-
I
I
I'
" ,I
n � xr
20''
(8.3.18)
l
Section B 3
495
Distribution Theory for Sam ples from a Normal Population
We seek the density of Y = g(X) = c � (c 1 , . . , cn) · By Corollary B.2. 1, Y .
py(y)
AX + c, where A is orthogonal, � (Y1, . . . Y,, )T has density
n
x n,
and
.
-
(A (Y - c)) ' ldet AIPx Px (AT(y - c)) 1
(B.3.19)
by (B.3 . 1 1 ) and (B.3.16). If we substitute AT (Y - c) for x in (B.3.18) and apply (B.3.15), we get (B.3.20) Because
} � ( Yi : c;) Px(Y - c) = n { yJ
(B.3.21)
we see that the Yi are independent normal random variables with E(Yi) = Ci and common variance (J2 . If. in particular, c = 0, then Y1 , . . . , Yn are again a sample from a N(O, (J2) population. More generally it follows that if Z � X + d, then Y � g(Z) � A(X + d) + c � AX + (Ad + c) has density
py(y) � Px(y - (Ad -t- c)).
(B.3.22)
Because d � (d1 , . . . ,dn)T is arbitrary and, by definition, E(Z;) i = 1, . . . , n, we see that we have proved the following theorem.
Theorem B.3.2 !fZ
= E(X; + d; )
=
d;,
= (Z1, , Zn)T has independent normally distributed components with the same variance fJ2 and g is an affine transformation defined by the orthogonal matrix A andvector c = (c1, Yn)T has independent , cn)T, then Y = g(Z) = (Y1 , nonnally distributed components with variance fJ2. Furthermore, ifA = ( aij) .
•
.
.
•
•
n E(Y;) = c; + L a;;E(Z;) j =l
. . . 1
(B.3.23)
for i = l , . . . ,n. This fundamental result will be used repeatedly in the sequel. As an application, we shall derive another classical property of samples from a normal population. Theorem B.3.3 Let (Z1,
.
.
•
, Zn)T be a sample from a N(f.L, a2) population. Define n 1 Z = N L Z; . (B.3.24) i"-= 1
Then Z and Ek= t (Zi - Z)2 are independent. Furthermore, Z has a N(f.l, (J2 /n) distribu tion while ( 1 Ia2 ) :Ef I ( zi - Z? is distributed as X� - 1 .
496
Additional Topics in Probability and Analysis
Proof Construct an orthogonal matrix A a, =
=
Appendix 8
( aij) whose first row is
(Jn ' Jn) . . .
This is equivalent to finding one of the many orthogonal bases in Rn whose first mem ber is a1 and may, for instance, be done by the Gram-Schmidt process (Birkhoff and MacLane, 1965, p. 180). An example of such an A is given in Problem B.3.15. Let AZ = (Y,, . . . , Yn JT· By Theorem B.3.2, the V. are independent and normally distributed with variance 0"2 and means
n n E(Y;) = I >ijE( Zj) = I' :�::>ij· j=l j=l Because a1 j = 1/ .Jii, 1 < j < n, and A is orthogonal we see that n n L aij = Vn Laljakj = 0, k = 2, . . . , n. j=l j=l
(B.3.25)
(B.3.26) •
)
•
Therefore,
E(Yt)
= w/n, E(Yk)
•
= 0,
k = 2, . . . , n.
•
(B.3.27)
By Theorem B.3.1, (Ijcr 2 )EL2Y.,' has a X�- I distribution. Because by the definition of A,
- Yt z = ..;n·
(B.3.28)
Now
I
n n n n L(Z• - Z)2 = I:zf - 2Z l: Z. + nZ2 = I: zf - nZ2• k=l k=l k=I k=I
Therefore, by (B.3.28),
•
Finally,
n n '<;""' L.(Zk - Z- )2 = '<;"' L. z.2 - Y,2 . k=l k=l
n n l: Y.' = IYI 2 = IAZ - AOI 2 = IZI 2 = 2:: z� k=l k=l
Assertion (B.3.29) follows.
•
�
'
1
'
'
•
•
1 ., '
1
i
1'.
the theorem will be proved once we establish the identity
n n '<;"' 2 '<;"' Z = L.(Z k - )2 . L_. yk k=l k=2
'
i
.
.
.
(B.3.29)
., '
•
•
, '
.
i
l
.
(B.3.30)
., •
'
,,
!
I
•
' •
(B.3.31 )
(B.3.32) 0
. j
l
1 I
Section 8.4
B.4
497
The Bivariate Normal Distribution
THE BIVAR IATE NORMAL DISTRIBUTION
The normal distribution is the most ubiquitous object in statistics.
It appears in theory
as an approximation to the distribution of sums of independent random variables, of order
statistics, of maximum likelihood estimates, and so on. In practice it turns out that variables
arising in all sorts of situations, such as errors of measurement, height, weight, yields of
chemical and biological processes, and so on,
are
approximately normally distributed.
In the same way, the family of k-variate nonnal distributions arises on theoretical
grounds when we consider the limiting behavior of sums of independent k-vectors of ran dom variables and in practice as an approximation to the joint distribution of k-variables. Examples are given in Section
In this section we focus on the important case k = 2 where all properties can be derived relatively easily without matrix calculus and we can
6.2.
draw pictures. The general k-variate distribution is presented in Section
more thorough introduction to moments of random vectors.
B.6 following
a
Recall that if Z has a standard normal distribution, we obtain the N(Jl, cr2) distribution
as the distribution of g(Z)
= crZ +
f-1· Thus,
Z
generates the location-scale family of
N(J.i,, cr2 ) distributions. The analogue of the standard normal distribution in two dimensions
is the distribution of a random pair with two independent standard normal components,
whereas the generalization of the family of maps g( z)
= cr z + J.i, is the group
of affine
transformations. This suggests the following definition, in which we let the independent
N(O, 1) random variables Z1 and Z2 generate our family of bivariate distributions. A planar vector (X, Y) has a bivariate nonnal distribution if, and only if, there exist constants aiJ • 1 � i,j < 2, J.l,1, J.1,2, and independent standard normal random variables z1 ' z2 such that X M1 + anZ1 + a12Z2 (B.4. 1) y /-L2 + a21 zl + a22z2. In matrix notation, if A � (a,,), J1
�
definition is equivalent to
( !' 1 , /'2)T, X � (X,Y)T, Z
X � AZ + J1·
(B.4.2)
Two important properties follow from the Definition
(B.4.1).
Proposition B.4.1 The marginal distn'butions of the components of a bivariate normal random vector are (univariate) nonnal or degenerate (concentrate on one point). This is a consequence of
Note that
E(X )
�
(A.1 3.23).
The converse is not true. See Problem
1'1 + anE(Z1) + a12E(Z2)
�
1'1 , E(Y)
�
1'2
B.4.10. (B.4.3)
and define
u1 Then
X has a N(/'1, uf )
�
v'var X,
u2
�
vvar Y.
(B.4.4)
and Y a N(/' 2 , u?;) distribution.
Proposition 8.4.2 If we apply an affine transformation g(x) Cx + d to a vector X, which has a bivariate nonnal distribution, then g(X) also has such a distribution. =
-------�-- �---
. --
.
--
.
....,. _.
498
Additional Topics in Probability and Analysis
Appendix B
This is clear because
(B.4.5) We now show that the bivariate normal distribution can be characterized in terms of first- and second-order moments and derive its density. As in Section A.l l, let (X Co::.:v-" :) cc-: • Y� ( .::. ) = Cor X , Y = 0'10"2 if a1 a2 f- If a1 a2 = 0, )it will be convenient to let = We define the variance covariance matrix of (X, Y (or of the distribution of (X, Y)) as the matrix of central second moments CX + d = C(AZ + J") + d = (CA)Z + (CJL + d).
p
p
0.
0.
This symmetric matrix is in many ways the right generalization of the variance to two dimensions. We see this in Theorem B.4.1 and (B.4.21). A general definition and its properties (for k dimensions) are given in Section B.S. Theorem B.4.1 Suppose that ""'' i' 0 and I PI < I. Then
1 Px(x) = exp [-�({x - JLf�- 1 (X - JL))l
(B.4.8)
' ' '
I
I '
l
l '
(B.4.9) i
•
Because (X, Y) is an affine transformation of (Z1 , Z2), we can use Corollary 8.2.1 to obtain the joint density of (X, Y) provided A is nonsingular. We start by showing that AAT = �. Note that Proof.
i.
'i i<
liI
I I
while and
"'1
i l
t l
l
1
'�
(B.4.10) pcr1a2
Cov(auZt + a12Z2, a21Z1 + a22Z2 ) - an":ll Cov(Z1 ,Z, ) +(a,,a21 + a11a,2) Cov(Z1, Z2) +a12a22 Cov(Z2, Z2) aua21 + a12a22·
(B.4.11)
i �-----
Section 8.4
499
The Bivariate Normal Distribution
Therefore, AA T
=
ldct AI
:E and by using elementary properties of determinants we obtain
Vdet Adet AT v'det 1: � crw2 J1 - pz.
J[det A]'
�
�
Vdet AA T
(8.4.12)
Because IPI < 1 and a1a2 f. 0, we see that A is nonsingular and can apply Corollary B.2.1 to obtain the density of X. The density of Z can be written
pz(z) As in (8 .3.19),
Px (x)
�
)
1 T - z z 2
.
- � [A 1( x - J.<Jf [A - 1 (x - J.<)J } { ex p � (x - J.<)T[A -'f A -• (x - J.<) } . { � � AI
21rld t A I 2rrld t
Because
( p ;r,;: ex 1
exp -
(8.4. 13)
(8.4.14)
(8.4.15) we arrive at (B.4.8). Finally (B.4.9) follows because by the formulas for an inverse (8.4.16) 0
From (B.4.7) it is clear that :E is nonsingular if, and only if, a1a2 f. 0 and IPI < 1. Bivariate nonnal distributions with a1 a2 i- 0 and IPI < 1 are refened to as norulegenerate, whereas others are degenerate. If a'f = ai1 + ai2 = 0, then X = J.ti, Y is necessarily distributed as N(J.
(Y - 11-2) a2
�p
(X - J.
.
(8.4. l 7)
Because the marginal distributions of X <¥Id Y are, as we have already noted, N(J.t1, oD and N(J.<2, o-:j) respectively, relation (8.4.1 7) specifies the joint distribution of (X, Y) com pletely. Degenerate distributions do not have densities but correspond to random vectors whose marginal distributions are nonnal or degenerate and are such that (X, Y) falls on a fixed line or a point with probability 1. Note that when p = 0, Px (x) becomes the joint density of two independent normal variables. Thus, in the bivariate normal case, independence is equivalent to correlation zero. This is not true in general. An example is given in Problem B.4. 1 1 . Now, suppose that we are given nonnegative constants a1, a2. a number p such that IPI < 1 and numbers J.t t , J.t2 · Then we can construct a random vector (X, Y) having a
'
II
500
Additional Topics in Probability and Analysis
Appendix 8
••
p(x,y)
\
.
•
'
'
I !
'
y
'
' •
' ' • '
• • •
p(x,y)
•
J
y
Figure 8.4.1 A plot of the bivariate normal density p(x, y) for l'l o-1 = o-2 =
1, p = 0 (top) and p = 0.5 (bottom).
=
1'2
=
2,
bivariate normal distribution with vector of means (Jtl, /.L2 ) and variance-covariance matrix E given by (B.4.7). For example, take (8.4.18) and apply (B.4.10) and (B.4. I I). A bivariate normal distribution with this moment structure will be referred to as N(J<,, 1'2 . at ,a�, p) or N(J<, E). ' '
1 '
'
Section 8.4
501
The Bivariate Normal Distribution
Now suppose that U
= (U1, U2 )T is obtained by an affine trnnsformation, u,
u,
c11X + c12 Y + vi c21X c22 Y v2
(B.4.19)
+
+
from a vector (X, Y) T having a N(f-li, f-£2, uf , a� , p) distribution. By Proposition 8.4.2, U has a bivariate normal distribution. In view of our discussion, this distribution is completely determined by the means, variances, and covariances of U1 and U2, which in tum may be expressed in terms of the f-li, af , p, cij , and Vi. Explicitly, E(UI) Var U1 Var U, Cov(U1, U,)
VI + Cuf-ll C12j.L2, E(U2) V2 + C21J.L1 + C22/l2 2 2 2 c11 a1 + c12a2 cr2C11para2 2 2 2 2 C21 u1 + c22 a2 + c21C22pa1a2 cuc21a i + C12C22a� + (cuc22 + c12C2t ) Pu1a2.
=
+ 2 2+ 2
(B.4.20)
In matrix notation we can write compactly (E(U, ) , E (U,)) E(U)
( v, , v, l
T = v + C!L, + C(tL1,!'2) CECT
(B.4.21)
where E(U) denotes the covariance matrix of U. If the distribution of (X, Y) is nondegenerate and we take l
(Tl
0
( vl , v, ) ' =
e
../l - p2 l 2 .,.2Jl-p
(Tt
(B .4.22)
-C(I'bi'2JY
then ul and u2 are independent and identically distributed standard normal random vari ables. Therefore, starting with any nondegenerate bivariate normal distribution we may by an affine transformation of the vector obtain any other bivariate normal distribution. Another very important property of bivariate normal distributions is that normality is preserved under conditioning. That is, Theorem B.4.2 /f (X, Y) has a nondegenerate N(l'l , 1'2, cr?, cr�,p) distribution, then the conditional distribution ofY given X = x is
Proof. Because X has aN(tLI , a¥ ) distribution and (Y, X ) is nondegenerate, we need only
------
-
- ----
502
Additional Tepics in Probability and Analysis
Appendix 8
calculate
p( y I x)
P(Y,X) ( y, x)
px(x)
{ (x -:.t)2]} p')J
1 ---;;c=7c===cce exp o-2 }21r(1 - p2 )
+[1 - (1 -
1 exp "2 V21r(1 - p')
-
1
2(1 -
(Y - I"z ) 2 [ p2 ) ,-� -
]
2 1 (x 1"1) (Y 1" 2) _ [ p 2(1 - p2) 0"2 "'
. (8.4.23)
This is the density we want.
i
0
Theorem B.4.2 shows that the conditional mean of Y given X = x falls on the line y � 1"2 + p(o-2 /o-!)(x - 1"1). This line is called the regression line. See Figure 8.4.2, which also gives the contour plot Sc = { (x, y) : f(x, y) = c} where c is selected so that P((X, Y) E Sc) � 'Y· 'Y � 0.25, 0.50, and 0.75. Such a contour plot is also called a 1001% probability level curve. See also Problem 8.4.6. B y interchanging the roles of Y and X, we see that the conditional distribution of X 2 p + 2 given Y � y is N(p1 (udo-2)p(y 1" ) , a-i (l - )). With the convention that 0/0 = 0 the theorem holds in the degenerate case as well. More generally, the conditional distribution of any linear combination of X and Y given any other linear combination of X and Y is normal (Problem B .4.10). As we indicated at the beginning of this section the bivariate normal family of distri butions arises naturally in limit theorems for swns of independent random vectors. The main result of this type is the bivariate central limit theorem. We postpone its statement and proof to Section B.6 where we can state it for the k-variate normal.
,
l
,
'
B.S
MOMENTS OF RAN DOM VECTORS AND MATRICES
• •
We generalize univariate notions from Sections A.lO to A.l2 in this section. Let U, re spectively V, denote a random k, respectively l, vector or more generally U = IIUij llkxl , a matrix of random variables. Suppose EIUiJ I < oo for all i, j. Define the expectation of U by
E(U)
B.S. I
�
IIE(U;; ) IIk xl· ,
Basic Properties of Expectations
If Am xk. Bmxl are nonrandom and EU, EV are defined, then
E(AU + BV) � AE(U) + BE(V).
(B.S. I)
Section 8.5
503
Moments of Random Vectors and Matrices
y 3
•
(2,2) . .
2
• • • • • • •
•
1
•
•
0
0
2
1
3
X
Figure B.4.2 25%, 50%, and 75% probability level-curves, the regression line (solid line), and major axis (dotted line) for the N(2, 2, 1, I , 0.5) density.
This is an immediate consequence of the linearity of expectation for random variables and the definitions of matrix multiplication. If U � c with probability 1,
E(V) = c . For a random vector U, suppose EU,' < oo for i = 1, . . . , k or equivalently E(fUf2) < co, where 1 · I denotes Euclidean distance. Define the variance of U, often called the variance-covariance matrix, by
Var(U)
E(V - E(U))(U - E(U)f ff Cov(U, U; )[fkxk,
(B.5.2)
,
a symmetric matrix.
B.5.2 If A is m
Properties of Variance x
k as before, Var(AU)
=
AVar(U)A T
Note that Var(U) is k x k, Var(AU) is m x m.
(B.5.3)
Additional Topics in Probability and An a lysis
504
Appendix B
Let ckx 1 denote a constant vector. Then
Var(U + c) = Var(U ) .
(B.5.4)
Var( c ) = IIO]]kxk ·
(8.5.5)
If ak x 1 is constant we can apply (8.5.3) to obtain
Var( Ej= 1 ap, ) = arvar(U)a = Li,j a, a; Cov(U,, U; ).
Var(aTU)
=
(B.5.6)
Because the variance of any random variable is nonnegative and a is arbitrary, we con clude from (B.5.6) that Var(U) is a nonnegative definite symmetric matrix. The following proposition is important.
Proposition B.S.l /f E]Uj2 < oo, then Var(U ) is positive definite if and only if, for every a t 0, b,
P[aru + b = OJ < !.
(B.5.7)
Proof. By the definition of positive definite, (B.lO.l), Var(U) is not positive definite iff arvar(U)a = 0 for some a f 0. By (B.5.6) that is equivalent to Var( aTU) = 0, which is D equivalent to (B.5.7) by (A.l l .9). If Ukx 1 and Wkxi then
independent random vectors with E]U]2 <
oo,
Var(U + W) = Var(U) + Var(W).
.
E]W]2 < oo, (B.5.8)
This follows by checking the identity element by element. More generally, if E]U]2 < oo, E]V]2 < oo, define the covariance of Uk xr, Vr x 1 by
'
'.
•
•
•
are
Cov(U, V)
•
=
•
E(U - E(U))(V - E(VJJY IICov(U;, Vj )]]kx l·
Then, if U, V are independent
'
Cov(U, V) = o.
(B.5.9)
.
' •
'
•
' ' '
In general Cov(AU + a, BV + b) = ACov(U, V)Br
(B.5.10)
j I
for nonrandom A, a, B, b, and Var(U + V) = Var(U) + 2 Cov(U, V) + Var(V). We leave (B.5.10) and (B.S. l l ) to the problems. Define the moment generatingfunction (rn.g.f.) of Ukxi for t E IF by
M (t)
I
=
Mu (t)
=
E (e' u ) = E( e"�� ,•, u, ) . T
(8.5.1 1) . •
' '
> • >
•
i
•
I
Section 8.5
505
Moments of Random Vectors and Matrices
Note that Af(t) can be (c. f.) of U by,
co
for all
t =!=- 0. In parallet< 1 ) define the characteristic function
T
'l'(t) = E(e" u) = E(cos(tTU)) + iE(sin(tTU)) where i = A. Note that 'I' is defined for all theorems are beyond the scope of this book. Theorem B.5.1 Let S = (a)
t E n• , all U. The proofs of the following
{t : M(t) < oo). Then,
S is convex. (See B.9).
(b) If S
has a nonempty interior S0, (contains a sphere S(O, €), € > 0), then M is analytic on 8° In that case EIUIP < oo for all p. Then, if i1 + · · + ik = p, ·
&PM(O) &t;, 1 .
In particular,
.
•
=
&t;, k
&M
(0) &tJ
E(Ui' . . . U�' ).
(B.5.12)
= E(U)
(B.5.13)
kxl
and (B.5.14)
kxk (c)
If 8° is nonempty, M detennines the distribution ofV uniquely.
Expressions (B.5.12HB.5.14) are valid with
cumulant generating function of U is defined by K(t) = Ku(t) = log M(t). If S ( t) = { t : M (t) < oo } has a nonempty interior, then we define the cumulants as The
&P
'-"ll···tl; = e; l · · ·tl; (U ) = &t;' . & " K(t) . . tk 1 r.
·
.
, i1 + · · · + ik = P·
.
t-O
An important consequence of the definitions and (A.9.3) is that if Ukxl• Vkxl are independent then
Mu+V(t) = Mu (t)Mv(t), Ku+V(t) = Ku(t) + Kv (t)
(B.5.15)
where we use subscripts to indicate the vector to which the m.g.f. belongs. The same type of identity holds for c.f.'s. Other properties of cumulants are explored in the problems. See also Bamdorff-Nielsen and Cox (1989).
..
..
506
Additi:mal Topics in Probability and Analysis
Appendix 8
Example 8.5.1 Tire Bimriare Normal Distribution. N(JtJ , Jl2, af, ai , p) distribution then, it is easy to check that (B.S.
E(U) = I' Var(U) = �
Mu(t) where
t = (t1, t2) T , I'· and
=
exp
{
e·l'
+
16)
(B.S.l7)
� tT �t }
(B.S.l8)
� are defined as in (6 .4.7), (6.4.8). Similarly
'Pu(t) obtained by substituting iti for ti,
=
{
�
exp it TI' - tT I: t
1 < j < k,
in
(8.5.18).
}
(B.S. 19)
The result follows directly from
(A. l3.20) because
(6.5 .20) and by (6 .4.20),
U1 + t2U, has a N(tT/L, tT �t) distribution.
By taking the log in (8. 5 . 1 8) and differentiating we find the first five cumulants
[: '
t1
'
,
'
'
', ·. i'
I
I
0
All other cumulants are zero.
'
• '
I
THE MULTIVARIATE NORMAL DISTRIBUTION
8.6
l ! .
' •
;! '
''
'
8.6.1
Definition and Density
•
'
We define the multivariate normal distribution in two ways and show they are equivalent. From the equivalence we are able to derive the basic properties of this family of distribu� tions rapidly.
Definition 8.6.1 Ukxl
has a
multivariate (k-van"ate) normal distribution
iff
U
can be
i
J l
written as
U = I' + AZ when
Jlk; x l • Akxk
are constant and
Z
=
( Z1 , . . . , Zk ) T
where the Zj are independent
standard normal variables. This is the immediate generalization of our definition of the bivariate normal. We shall show that as in the bivariate case the distribution of U depends on
I'
=
E(U)
and
� - Var(U), only.
Definition B.6.2 Ukx 1 has a multivariate normal distribution dom, aTU = E1=IaiUi has a univariate normal distribution.
iff for every
akx 1
nonran
Theorem B.6.1 Definitions B.6.J and B.6.2 define the same family of distributions.
' •
'
l
Section 8.6
507
The Multivariate Normal Distribution �
[ATa]Tz + aTp, a linear combination of independent normal variables and, hence, normaL Conversely, if X = aTU has a univariate normal distribution, necessarily this is N(E(X)1 Var(X)). But
Proof If U is given by Definition B.6. I, then aTU � aT (AZ + p)
E(X) � aT E(U)
(B.6. 1)
Var(X) = aT Var(U)a
(B.6.2)
from (B.S. I ) and.(B.5.4). Note that the finiteness of E(IUI) and Var(U) is guaranteed by applying Definition B .6.2 to eJU, where ej denotes the k x 1 coordinate vector with 1 in the jth coordinate and 0 elsewhere. Now, by definition, Mu (a)
E(exp(aTU)) = E(ex) exp {aT E(U) + �aT Var(U)a}
(B.6.3)
from (A.1 3.20), for all a. Thus, by Theorem B.5 . 1 the distribution of U under Definition B.6.2 is completely determined by E(U), Var(U). We now appeal to the principal axis theorem (B. I 0.1.1 ) . If :E is nonnegative definite symmetric, there exists Akxk such that where A is nonsingular iff :E is positive definite. Now, given U defined by B.6.2 with E(U) = p, Var(U) = �- consider where A and Z are as in Definition B.6.1. Then E(V) = p, Var(V)
=
AVar(Z)AT
=
AAT
because Var(Z) = Jkxk • the identity matrix. Then, by definition, V satisfies Definition 8.6.1 and, hence, B .6.2 and has the same first and second moments as U. Since first and second moments determine the k-variate normal D distribution uniquely, U and V have the same distribution and the theorem follows. Notice that we have also proved: Corollary 8.6.1. Given arbitrary fLkxt and :E nonnegative definite symmetric, there is a unique k-variate normal distribution with mean vector J..L and variance-covariance matrix �.
We use Nk (P, �) to denote the k-variate normal distribution of Coroiary B.6.1. Argu ing from Corollary B.2.1 we see the following. Theorem 8.6.2 /j:E ispositive definite or equivalently nonsingular, then ifV U has a density given by 1 pu ( x) = (27r) k/2 [det(�)] k/2
I x { x exp - 2 ( - p) � T
-I
}
(x - p) .
""'
Nk(J..L, :E),
(B.6.5)
Additional Topics in Probability and An alysis
508
Appendix B
Proof. Apply Definition 8.6. 1 with A such that E = AAT The converse that U has a density only if E is positive definite and similar more refined results are left to Problem 8.6.2. There is another important result that follows from the spectral decomposition theorem
(8.10. 1 .2). Theorem 8.6.3 If U kx 1 has Nk(J.L, E) distribution, there exists an orthogonal matrix Pkxk such that pTU has an Nk (v, Dkxk) distribution where v = pT/1- and Dk x k is
'
j •
' •
• •
l j
the diagonal matrix whose diagonal entries are the necessarily nonnegative eigenvalues of E. /f:E is of rank l < k, necessarily only l eigenvalues are positive and cowersely.
•
Proof. By the spectral decomposition theorem there exists P orthogonal such that
E = PDPT Then pru has a./1/k(v, D) distribution since Var(PTU) ofP.
. •
=
pTEP = D by orthogonality 0
This result shows that an arbitrary normal random vector can be linearly transformed to a normal random vector with independent coordinates. some possibly degenerate. In the bivariate normal case, (B.4.19) and (B.4.22) transformed an arbitrary nondegenerate bivariate normal pair to an i.i.d. N(O, 1) pair. Note that if rank � = k and we set
( r ' = P (o l r ' p r
E i = PD i PT, E-i = E i .
•
' ,,
. .
I;
;
· .
'
·J
•
'
.1 •
has a N(O, J) distribution, where J is the k x k identity matrix.
'
•
'
' • '
Corollary 8.6.2 /fU has an./1/k (O, E) distribution andE is of rank k, then ur E - 1 U has
8.6.2
;
•
(8 .6.6)
Proof. By (8.6.6). UTE-1U = ZTZ, where z is given by (8 .6.6). But zrz = where Z; are i.i.d. ./1/(o, 1). The result follows from (8.3.1).
l
•
where D! is the diagonal matrix with diagonal entries equal to the square root of the eigenvalues of :E, then
a x� distribution.
• •
L� 1 Zl 0
Basic Properties. Conditional Distributions
If U is./1/k(Jt, E) and Atxk• br x i are nonrandom, then AU + b is .Ni(A�t + b, AEA'). This follows immediately from Definition 8.6.2. In particular, marginal distributions of blocks of coordinates of D are normal. For the next statement we need the following block matrix notation. Given :Ekx k positive definite, write
(8.6.7)
Section 8.6
509
-------------- --��
The Multivariate Normal Distribution
where :E l l is the l X l variance of ( u1 ' . ' ul ) T. which we denote by U(l)' :Ezz the ( k-·l) X (k - I) variance of (U1+ 1 , Uk) 7 denoted by Ul'l, and �" = Cov (U I ' i Ui 21)1xk-t. .
.
:E :.n
=
:Ef2 .
•
Similarly write
.
•
f.J =
ul 'I and Ui2)
(
.
p
.l ' i p Pl
)
,
2 where JL( I l and J.L( ) are the mean vectors of
,
We next show that independence and uncorrelatedness are the same for the k variate normal. Specifically
( ���; )
where U(I) and U(Z) are k and l vectors, Theorem 8.6.4 /f U k +l x = ( ) l respectively, has a k + l variate normal distribution, then UP) and U( 2) are independent iff
.
(8.6.8) Proof. If Ui 1 1 and u1 21 are independent, (8.6.8) follows from (A. 1 1 .22). Let U i 1 1• and 2 Ui 21• be independent Nk(E(Ui 1 1), Var(U ( l l ) ) , Ni (E(UI'I ) , Var(Ui 1 )) . Then we show U lll• 2 l has the same distribution as U and, hence, U( ) , u< > are below that U* U(Z)• independent. To see that U* has the same distribution as U note that _
(
)
EU•
=
2
EU
by definition, and because U(l) and U( ) have E12 Var(U(j)•) = E;;, j = 1, 2, then
Var (U)
=
(8.6.9)
0
and by construction
Var(U')
(8.6.10)
by (8.6.7). Therefore, U and U* must have the same distribution by the determination of the k-variate normal by first and second moments. D
Theorem 8.6.5 If U is distributed as Nk (p,, :E), with :E positive definite as previously, then the conditional distribution of U( I ) given U(Z) = ul2l is N1 (p,(l) + :E 2:E221 ( u<2l 1 2 p,< l), :Eu - :E 2 :E221 :E1 2). Moreover "En - :E12:E 221 :E2 1 is positive definite so there is 1 a conditional density given by (B.6.5) with (:Ell � "E 1 z:E221 :EzJ) substituted for :E and 2 2 1 1 p< 1 ) for 1-'· 1-'( ) - ��2�22' (ui Proof. Identification of the conditional density as normal can be done by direct computa tion from the formula
(8.6.1 I ) after noting that :En is positive definite because the marginal density must exist. To derive this and also obtain the required formula for conditional expectation and variance we proceed as follows. That "Eu, :En are positive definite follows by using aT �a > 0 with a whose last k - l or first l coordinates are 0. Next note that
(8.6.12)
-------------- - - ---··-
510
Additional Topics in Probability and Analysis
because
1 1 Var(U 2 I l)E 22 E21 E12E22 :E12 'Ezz1 'Ezz 'Ezz1 :E 21
Appendix B
(8.6.13)
by (B.5.3). Furthermore, we claim 1 2 Cov(E12E22'Ui'l, ul ' l - E12E22 U( l )
�
0
(8.6.14)
(Problem B.6.4) and, hence, by Theorem B.6.5, U(l) - E12E221Ui2) and Ui 2) are inde pendent Thus, the conditional distribution of U(l) - E 1 2E 221U( 2l given U( 2l = u< 2l is the same as its marginal distribution. By the substitution property of the conditional distribution this is the same as the conditional distribution of uU l - ::E 1 2 ::E221 u< 2l given u<2l = u< 2l. The result now follows by adding E 12 E 2i U(2l and noting that (8.6.15) and
E(U I u, � u,) 1
1 u2 u + E(U1 - E 12E221U2 I , � u,) E 12E22 - E(U1 - E12E221 U2 ) + E12E221 u2 1 1 JL1 - 'Et 2 'E22 1Jz + ::E 12:Ezz U2 1 p.1 + E12E22 (u2 - p.2) .
.
•
0
• ,
Theorem 8.6.6 The Multivariate Central Limit Theorem. Let Xt, Xz, . . . , Xn be independent and identically distributed random k vectors with EIX 1 I2 < oo. Let E(X1 ) = g R, JL, Var{XI) = ::E, and let Sn = Ei 1 Xi. Then, for every continuous function : Rk
(
Jnnp. ) £. g(Z)
g Sn
----->
(8.6.16)
1 ,
"
, •
� ,
•
., ,
where Z � N. (o, E).
As a consequence. if I; is positive definite, we can use Theorem 8.7.1 to conclude that
(B.6. l7) {x : xi < Zi, i = 1, . . . , k} where as usual subscripts for all z E Rk . Here {x : x < z} indicate coordinate labels. A proof of this result may be found in more advanced texts in probability, for instance, Billingsley (1995) and in Chung (1974). An important corollary follows. =
Corollary 8.6.1 1/the X, are as in the statement of Theorem B.6.6, if� is positive definite and if X = � E�=t Xi, then (B.6.l8)
• '
•
l
•
l
; '
'
j i
1 l
·�
' '
j
-1
,
'
l
Section 8.7
Convergence for Random Vectors:
Op
and OJ> Notat ion
Sll
Proof fo(X - p) (Sn - ni')/Jn. Thus, we need only note that the function g(x) = l xT:E � 1 x from Rl.: to R is continuous and that if z xl: N(O, � ) . then zT�- z �
"'
"'
(Corollary B.6.2).
8.7
c
Op
CONVERGENCE FOR RANDOM VECTORS: AND op NOTATION
The notion of convergence in probability and convergence in law for random variables dis cussed in section A . l . 5 generalizes to random vectors and even abstract valued random elements taking their values in metric spaces. We give the required generalizations for ran dom vectors and, hence, random matrices here. We shall also introduce a unique notation that makes many computations easier. In the following, B.7.1 A sequence of random vectors
Z - (Z1, . . . , Zd )T
Zn
iff
=
( Zn1 , . . . Znd )T ,
p
IZ n - Z l � or equivalently
Znj � Zj
for
that depend on n. Thus,
1 < j < d.
Zn � Z iff
converges
Zn
n - 1 L:� 1 Z;. If EIZI <
oo,
then Zn !.:. p
in probability to
are considered under probabilities
for every € > WLLN (the weak law of large numbers).
Euclidean distance.
0
Note that this definition also makes sense if the
Pn
I · I denotes
0. -
Let Z1, . . . , Zn be i.i.d. as Z and let Zn � EZ.
EjZI 2 < oo, the result follows from Chebychev's For a proof in the EIZI < oo case, see Billingsley ( 1995). When
inequality
as
in Appendix A.
The following definition is subtler.
8.7.2 A sequence
L(Zn) � L(Z), iff for all functions
{Zn}
of random vectors
converges in law
to
Z,
written
Zn
c
--+
Z
or
h(Zn) !:, h(Z)
h : Rd--+R, h continuous.
We saw this type of convergence in the central limit theorem (8 .6.6). Note that in the definition of convergence in law, the random vectors
Zn, Z only play
the role of defining marginal distributions. No requirement is put on joint distributions of
{ Zn} , Z.
Thus, if Z1 ,
. . . , Zn
are i.i.d., Zn
An equivalent statement to (B .7.2) is
.£... Z1, but Zn + Z , .
Eg(Z ,.)� Eg(Z) for all
g
:
Rd �R
p
(B.7.3)
continuous and bounded. Note that (B.7.3) implies (A.l4.6).
following stronger statement can be established .
The
512
Additional Topics in Probability and Analysis
Appendix B
' '
'
Theorem B.7.1 Zn _£. Z ijf (B.7.3) holds for every g : Rd_.,.flP such that g is bounded and if Ag = { z : g is continuous at z } then P[Z E Ag] = 1. Here are some further properties.
'
'
'
; '
' ;
' '
Proposition B.7.1 (a) IfZn £. Z and g is continuous from R" to RP, then g(Zn) £. g(Z). (b) The implication in (a) continues
conclusion above.
to
' '
.
'
,
hold if "P" is replaced by "{." in premise and
(c) The conclusion of (a) and (b) continues to hold if continuity of g is replaced by P[Z E Ag] = I where Ag - ( z : g is continuous at z } . 8.7.4
If Zn
p
�
Z then Zn
c
�
Z.
A partial converse follows.
j
L p 8.7.5 IfZn -----+ zo (a constant), then Zn __,. zo. ' '
I; '
•
•
I'
,,
II
i I.
'
'
•
Note that (B.7.4) and (8.7.5) generalize (AJ4.3), (Al4.4).
Theorem 8.7.2 Slutsky's Theorem. Suppose Z?; = (u;:, V!) where Zn is a d vector, Un is b-dimensional, Vn is c = d - b-dimensional and (a) Un (b)
" �
U
Vn � v where v is a constant vector.
(c) g is a continuous function from Jr1
to
Rb.
•
I
'
Then Again continuity of g can be weakened to P[(UT, vT )T E Ag] We next give important special cases of Slutsky's theorem:
=
L
�
Example 8.7.1 (a) d = 2, b = c = !, g(u,v) = au + {Jv. g(u,v) = uv or g(u, v) = ;; and v # 0. This covers (A.I4.9) (b)
'i '
Vn = IIVni;]]bxb,c = b2, g(uT,vT) = vu where v is a b x b matrix. To apply Theorem B.? .2. rearrange V and v as c x 1 vectors with c = b2• n
i
' ,
Section
8.7
Convergence for Random Vectors: O p and Op Notation
'---� -��--_-::_-:::
513
Combining this with b = c = dj2,g(uT,vT) = u + v, we obtain, that if the b x b matrix IIVn l l � llvll and Wn. b x 1, tends in probability to w, a constant vector, and c Un --+ U, then (8.7.6) The proof of Theorem B.7.l and other preceding results comes from the following the orem due to Hammersley (I 952), which relates the two modes of convergence. Skorokhod ( 1956) extended the result to function spaces . Theorem 8.7.3 Hammersley. Suppose vectors Zn � Z in the sense ofDefinition 8.7.2. There exists (on a suitable probability space) a sequence of random vectors { z.;:} and a vector Z"' such that
(i) .C(Z;;) = .C(Zn) for all n, .C ( Z* )
=
.C(Z)
(ii) z;; � Z"'. A
somewhat stronger statement can also be made, namely, that zn"' '::.!.
z"'
where �· refers to almost sure convergence defined by Zn �- Z if P
(n�oo lim Zn = z) = I.
This type of convergence also appears in the following famous law. SLLN (the strong law of large numbers). Let Zh . . . , Zn be i.i.d. as Z and let Zn n-l L:� 1 zi. then Zn �- J.L = EZ iff E I Z I < 00.
-
For a proof, see Billingsley ( 1 995). The proof of Theorem B.7.3 is easy for d = I. (Problem B.7.1) For the general case refer to Skorokhod ( 1 956). Here are the proofs of some of the preceding assertions using Hammersley's theorem and the following. Theorem 8.7.4 If the vector Un converges in probability to U and g P[U E A9] = I, then
is
bounded and
For a proof see Billingsley (1995, p. 209). Evidently, Theorem B.7.4 gives the equiva lence between (8.7.2) and (B.7.3) and establishes Theorem B. 7. I . The proof of Proposition B.7 .I (a) is easy if g is unifonnly continuous; that is. for every £ > 0 there exists 6 > 0 such that
{(z, , z,) : lg(zi) - g (zz ) l > E) C {(z,, zz) : l zl - Zzl > 0). A
stronger result (in view of Theorem B.7.4) is as follows.
. .
5 14
Additional Topics in Probability and Analysis
Appendix B
7.5 Dominated Conl'ergence Theorem. If { H111 } , lV are random l'ariab/es, W" -': W, P[IWn J < jliFJ] I and E]WJ < oo, then EWn EW. Proposition B.7.1 (b) and (c) follow from the (a) part and Hammersley's theorem. Then (8.7 .3) follows from the dominated convergence because if g is bounded by M and uni formly continuous, then for c5 > 0 (B.7.7) ]Epg (Zn) - Epg ( Z ) J < Theorem B.
=
�
sup{]g(z) - g(z')J : ]z - z'J < J} + MP[]Zn - ZJ > J]
Let n-+oo to obtain that lim supn ]Epg(Zn ) Epg(Z)J < sup{]g(z) - g(z')J : Jz - z'J < J} (8.7.8) and let 0-----+0. The general argument is sketched in Problem 8.7.3. For 8.7.5 let h,(z) l(]z - zol > ) Note that Ah, {z : ]z - zo] " c). Evidently if P[Z zo] 1, P[Z E Ah,] I for all > 0. Therefore, by Problem 8.7.4, P [IZn - zo] > •J�P[]Z - zo] > <] 0 because P[Z zo] 1 and the result follows. Finally Slutsky's theorem is easy because by Hammersley's theorem there exist V,:", u;: with the same marginal distributions as Vn, Un and u:;_ � U* , v,: � Then (U:;_, V,;" ) !:..,. (U*,v), whichbyProposition8.7.1 implies that (Un, Vn) .£. (U,v), which by Theorem B.7.1 implies Slutsky's theorem. In deriving asymptotic properties of some statistical methods, it will be convenient to use convergence of densities. We will use the following. -
= =
=
< .
f
=
=
=
=
=
v.
Theorem 8.7.6 Scheffe's Theorem. Suppose Pn(z) and p(z) are densities or frequency functions on Rd such that Pn (z) p( z) as n -----+ oo for all z E Rd. Then ---t
j IPn(z) - p(z)]dz
�
0 as n � oo
in the continuous case with a sum replacing the integral in the discrete case. Proof.
We give the proof in the continuous case. Note that ]Pn(z) - p(z)J
where x+ m =
ax
{ O x}. ,
= Pn(z)
- p(z) + 2 [p(z) - Pn(z)J+
Thus,
'
j IPn(z) - p(z)]dx jIPn (z) - p(z)]dz + 2 j[p(z) - Pn(z)j+dz. =
The first tenn on the right is zero. The second tenn tends to zero by applying the dominated convergence theorem to Un [p(Z) - Pn(Z)]+fp(Z) and g(u) u, u E [0, 1], because =
=
0 [p(z ) - Pn(z)]+ < p(z). Proposition 8.7.2 /fZn and Z have densities orfrequency functions Pn(z) and p(z) with Pn(z) � p(z) as n � oofor all z E Rd, then Zn .£. Z.
' '
'
'
'
1
'
'
'
·:, •
l
Section 8.7
SIS
Convergence for Random Vectors: Op and Op Notation
Proof. We give the proof in the continuous case. Let g bounded, say 191 < M < co . Then
IEg (Zn ) - Eg(Z ) I
=
:
Rd
--+
R be continuous and
j g(z)[pn(z) - p(z)]dz < M j IPn(z) - p(z)ldz D
and the result follows from (B.7.3) and Theorem 8.7.5.
Remark 8.7.1 Theorem 8.7.5 can be strengthened considerably with a suitable back ground in measure theory. Specifically, suppose J-t is a sigma finite measure on X. If 9n and g are measurable functions from X to R such that ( I)
9n
�
g in measure, i.e., I'{ x : l9n (x) - g(x) I > c)
�
0 as n
· ;
oo for all < > 0
and (2) J lun lrdl' � J lg ! dl' as n � oo for some r > l , then J l9n -g ldl' � O as n � oo. D A proof of this result can be found in Billingsley (1979, p. 184). '
L
Theorem 8.7.7 Polya's Theorem. Suppose real-valued Xn --+ X. Let Fn,F be the distribution Junctions of Xn, X, respectively. Suppose F is continuous. Then sup IFn(x ) - F(x)l X
�
0.
�
F(x) and Fn(x - 0) � F(x) for all x . Given �: > 0, choose x, x such that F(x) :S £, 1 - F(x) < £. Because F is uniformly continuous on [x, x], there exists c5 (£ ) > 0 such that for all Jl. < x1, x2 < x, < XK = x be such that lx , - Xzl < J(c) '* IF(x,) - F(x2)1 < f. Let x = xo < x1 ]x; - x; - , J :S J(<) for all j. Then Outline of Proof. By Proposition 8.7.1, Fn (x)
•
•
•
JFn(x) - F(x)l < max{IFn(x; ) - F(x; ) ] , 1Fn(X;+1) - F(x;+I ) ]} sup x3<x <xj+l + sup {max{(Fn(x) - Fn (x;)),Fn (X;+!) - Fn (x) }} x,.<x<x3+l +max{(F(x) - F(x;)), F(x;+I) - F(x) } .
sup IFn(x) - F (x) ] x<x sup IFn (x ) - F(x)l x>X
< Fn (X) <
+ F(x)
( 1 - Fn (x)) + (1 - F(x)).
Conclude that, limn supx IFn(x) - F(x ) l < 3< and the theorem follows. We end this section with some useful notatiOn.
D
Additional Topics in Probability and Analysis
516
t
' '
t
The Op , ;::. p, and
'
Appendix B
op Notation
The following asymptotic order in probability notation is useful.
I '
'
Un � o p(l ) Un Op( l )
' '
iff iff
�
iff iff
Note that
p
Un -------+ 0 \It > 0, 3M < oo such that \In P[ I Un l > M] S t I Un l � {1 ) Op /Vn l IUn l 1 � Op { ) /Vnl
1 O � O (1 O p(1)op(1) � op ), p(1) + op( l) p( ) ,
c
(8.7.9)
and Un � U =? Un � Op(1). Suppose Z1, . . . z are i.i.d. as Z with EIZI < oo Set 11- � E(Z ) , then Z" 1-' + op(1) by the WLLN. If E IZI 2 < oo, then Zn � J.L + Op(n - l ) by the central limit theorem. ,
8.8
"
.
MULTIVARIATE CALCULUS
' •
8.8.1 A function T : Rd _____, R is linear iff
T{ax1
+ f3x2) � aT(xi) + {3T(x2 )
E R, x1 , x2 E Rd. More generally, T : Rd x · · · x Rd _____, R is k linear iff k T(x 1 , x2, . . . , xk ) is linear in each coordinate separately when the others are held fixed.
for all
a, f3
8.8.2 T
-
(T1 ,
•
•
•
, TP ) mapping Rd x · · · x Rd ---+ RP is said to be k linear iff T1 ,
• • •
,
Tp
k
are k linear as in B.8.1.
8.8.3 T is k linear asin B.8.1 iff there exists an array {ai1, ,i�c : 1 < ij such that if x, = (xti , . . , Xtd ) , 1 < t < k, then ..
.
1 < d, < j < k} _
' '
.
T(xt, . - . , xk )
=
d
d
i�c=l
il=l
_L · · · _L aiJ,
.
.
. ,i"
k
IJ XjiJ
j=l
•
(8.8.4)
B.8.5 1f h : 0 � RP, 0 open c Rd, h = (h,, . , hp), then h is Frt!chet differentiable at x E 0 iff there exists a (necessarily unique) linear map Dh(x) : Rd _____, RP such that .
•
.
lh(y) - h(x) - Dh{x)(y - x)l � o(IY - xi)
where I I is the Euclidean norm. If p = 1, Db is the total differentia{. ·
,
(B.8.6)
j •
i
J •
, •
·'
'
.
'
l
-
Section 8.8
517
M ultivariate Calculus
More generally, h is m times Frichet differentiable iff there exist l linear operators D 1 h (x) : Rd x . . . x Rd R", 1 < l < m such that �
I
m Dl h
h(y) - h(x) - L l! (x)(y - x, . . . , y - x) 1=1
=
(8.8.7)
o([y - x[m).
8.8.8 If h is m times Frechet differentiable, then for 1 < j < p, hj has partial derivatives of order < m at x and the jth component of D1h(x) is defined by the array o1hj(x) l . : < 1.j < d, + + f = 1, o < fi :::; , 1
{
8.8.9 h is
"d
m
. }
.
times Fr6chet differentiable at x if hj has partial derivatives of order up to m on 0 that are continuous at x. B.8.10 Taylor's Formula
If hj, 1 < j < p has continuous partial derivatives of order up to m + 1 on 0, then, for all x , y E 0, m D 1 h(x) nm+ 1 h(x') h(y) = h(x) + L 'l. (y - x, . . . , y - x) + m + ) .' (y - x, . . . , y - x) ( ! =1 1 (8.8. 1 1 ) for some x* = x + a*(y - x), 0 < a* < 1. These classical results may be found, for instance, in Dieudonne (1960) and Rudin (1991 ). As a consequence. we obtain the following. B.8.12 Under the conditions of 8.8.1 0,
� D1h(x) (y - x, . . . , y - x) h(y) - h(x) l! � l=l
< ( (m + 1 )!)- 1 sup{ [Dm+'h(x')[ : [x' - x[ < [y - x[)[y - x[ m+I
for all x,y E 0.
B.8.13 Chain Rule. Suppose h : 0 � RP with derivative Dh and g : h(0) derivative Dg. Then the composition g o h : 0 __,. Rq is differentiable and
�
R• with
D(g o h)(x) = Dg(Dh(x) ).
As a consequence. we obtain the following.
p, h be I - l and continuously Frechet differentiable on a neighborhood of x E O, and Dh(x) = zh· (x) be nonsingular. Then h- 1 : h(O) � o is Frechet B.8.14 Let d
=
l
:x:;
differentiable at y = h(x) and
ll
px p
Dh-' (h(x))
=
[Dh(xJr'-
518
Appendix B
Additional Topics in Probability and Analysis
6.9
CONVEXITY AND I N EQUALITIES
Convexity A subset S of
Rk is said to be
ax+ (1- a)y E S.
When
k
convex
if for every
x, y
a
E [0, lj,
1, c(!nvex sets are finite and infinite intervals. When
=
spheres, rectangles, and hyperplanes are convex. The point the convex set S iff for every d
E S, and every
i- 0�
k > 1,
x0 belongs to the interior S0 of (8.9.1)
where 0 denotes the empty set. A function
g from a convex set S to R is said to be convex if
g(ax + (I - a)y)
< ag(x)
+ (I - a)g(y),
all
x, y E S,
a E [0, 1].
(B.9.2)
g is said to strictly convex if(B.9.2) holds with < replaced by < for all x � y, a '/c {0, 1}. Convex functions are continuous on 8°. When k = 1, if g" exists, convexity is equivalent to g"(x) > 0, x E S; strict convexity holds if g"(x) > 0, x E S. For g convex and fixed x, y E S, h(a) = g(ax + (I - f')Y)) is convex in a, a E [0, 1]. When k > I, if 8g2(x)j8xi8xi exists, convexity is equivalent to
L u,u;82g (x)f8x,8x; > 0,
all
u
E
Rk and x E S.
• '
i,j A function convex.
h from a convex set S to R is said to be (strictly) concave if g =
Jensen's Inequa�ty. and
If S c
EU is finite, then EU
Rk is convex and closed, g ij convex on S,
E S,
-h is (strictly)
P[U E S]
Eg(U) exists and
Eg(U) > g(EU) with equality if and only if there are a and
=
I,
(B.9.3)
bkx 1 such that
In particular, if g is strictly convex, equality holds in (B.9.3) if and only if P[U
for some Ckxl·
'
"
=
c]
=
' '
l
1
For a proof see Rockafellar (1970). We next give a useful ineqpality relating product
moments to marginal moments:
HOlder's Inequality.
Let r and s be numbers with r, s
EI XYI When
r =
<
> 1, r- 1 + s-1
{EIXIr} ; {EIYI'J l .
=
1. Then (B .9 .4 )
s = 2, HOlder's inequality becomes the Cauchy-Schwartz inequality (A.l ] . ] 7).
For a proof of (B.9.4), see Billingsley (1995, p. 80) or Problem B.9.3.
'
Section 8.10
Topics in Matrix Theory and Elementary Hilbert Space Theory
We conclude with bounds for tails of distributions. Bernstein Inequality for the Binomial Case. Let Sn
....__,
519
B( n, p) , then
P([Sn - np[ > nc) < 2 exp{-n<2 /2} for f > 0.
(8.9.5)
That is, the probability that Sn exceeds its expected value np by more than a multiple nc of n tends to zero exponentially fast as n OCI. For a proof, see Problem 8.9.1. Hoeffding's Inequality. The exponential convergence rate (8.9.5) for the sum of indepen dent Bernoulli variables extends to the sum Sn = 2:� 1 Xi of i.i.d. bounded variables Xi, [X; - I' I < c;, where I' = E(Xt) -
n
P[ISn - nJl[ > x) < 2exp
- i x'/ l:::Cl
(8.9.6)
i= I
For a proof, see Grimmett and Stirzaker (1 992, p. 449) or Hoeffding (1963). TOPICS IN MATRIX THEORY AND ELEM ENTARY H I LBERT SPACE THEORY
8.10 8. 10.1
Symmetric Matrices
We establish some of the results on symmetric nonnegative definite matrices used in the text and 8.6. Recall Apxp is symmetric iff A = AT. A is nonnegative definite (nd) iff xT Ax > 0 for all x, positive definite (pd) if the inequality is strict unless x = 0. 8.10.1.1. The Principal Axis Theorem (a) A is symmetric nonnegative definite (snd) iff there exist Cpxp such that A = CCT
(8.10.1)
(b) A is symmetric positive definite (spd) iff C above is nonsingular. The if' part in (a) is trivial because then xT Ax = xrccT X = 1Cxl2. The "only if' part in (b) follows because jCxl2 > 0 unless x = 0 is equivalent to Cx =/:- 0 unless x = 0, which is nonsingularity. The "if' part in (b) follows by noting that C nonsingular iff det(C) ol 0 and det(CCT) = det2(C). Parenthetically we note that if A is positive definite, A is nonsingular (Problem 8.10.1). The "if' part of (a) is deeper and follows from the spectral theorem. .•
8.10.1.2 Spectral Theorem (a) Avxp is symmetric iff there exists P orthogonal and D = diag( At, . . . , .Ap) such that (8.10.2)
520
Additional Topics in Probability and Analysis
Aj are real, unique up to labeling, and are the eigenvalues of A. exist vectors ej, Jeil 1 such that
(b) The
=
Appendix B
That is, there
(B.l0.3)
A is also snd, all the >...i are nonnegative. The rank of A is the number of nonzero eigenvalues. Thus, A is positive definite iff all its eigenvalues are positive.
(c) If
(d) In any case the vectors
ei can be chosen orthonormal and are then unique up to label.
Thus, Theorem 8.10. 1 .2 may equivalently be written p
A = L eie[Ai
(B. l0.4)
i=l
where
eieT can be interpreted as projection on the one-dimensional space spanned by ei
(Problem B . l 0.2). (B.J 0.1) follows easily from B . 1 0.3 by taking
.!
C � P diag(>-1 , . . . , >-$ ) in (B.10.1).
The proof of the spectral theorem is somewhat beyond our scope MacLane ( 1 953,
1
pp. 275-277, 3 1 4), for instance.
see Birkhoff and
B.l0.1.3 If A is spd, so is A - I .
Proof. A � P diag(>- 1 , . . . , Ap)PT => A-1 � P diag(>-� 1 , . . . , >-; ' )PT • '
I
B.l0.1.4 If A is spd,
then max{ xT Ax
: xTx < I }
�
•
max; >-; .
'
I •
8.10.2
Order on Symmetric Matrices
As we defined in the text for
A, B symmetric A < B iff B - A is nonnegative definite.
j •
This is easily seen to be an ordering.
B.l0.2.1 If A and B are symmetric and A :S
B, then for any C (B.10.5)
B - A sod means B-A � EET and thenCBcT -CAcT � C(B-A)cT � CEETcT (CE)(CEjT. Furthermore, if A and B are spd and A :S B, then
This follows from definition of snd or the principal axis theorem because
�
(B. l0.6)
Proof.
After Bellman ( 1 %0, p. 92, Problems 13, 14). Note first that, if A is symmetric, (B.! 0. 7)
j '
.,
I •
' '
'
Section 8.10
521
Topics in Matrix Theory and Elementary Hilbert Space Theory
because y = A -•x maximizes the quadratic form. Then, if A < B. 2xTy - yT Ay > 2xTy - yT By
all x.. y. By (8.10.7) we obtain xT A-1x > xT B- 1 x for all x and the result fol D lows.
for
B.10.2.2 The Generalized Cauchy-Schwarz Inequality E n E12 be spd, (p + q) x (p+ q), with En.P x p, ,, q x q. Then Let E = E2 E21 En E1 1 , E22 are spd. Furthennore,
(
)
·
(B.! 0.8) Proof From Section B.6 we have noted that there exist (Gaussian) random vectors Up xi, Vqxi such that E = Var(Ur, vr)r, En = Var(U), E22 = Var(V), E12 = cov(U, V). The argument given in B.6 establishes that (B.l0.9) 0 and the result follows. B.l0.2.3 We note also, although this is not strictly part of this section, that if U, V are random vectors as previously (not necessarily Gaussian), then equality holds in (B.l0.8) iff for some b
(B.JO.IO) with probability I. This follows from (B.! 0.9) since aT Var(U - E12E221 V)a = 0 for all a iff (B.JO. l l ) for all a where b is E(U -E 12E22 V). But (B.! 0.1 1) for ali a is equivalentto(B.!O.IO).
1
8.10.3
Elementary Hilbert Space Theory
A linear space 1i over the reals
is a Hilbert space iff
(i) It is endowed with au inner product ( ) : 1l x 1l � R such that ( ) is bilinear, ·, ·
·
,
·
(ah1 + bh2, ch3 + dh4) = ab(h1, h2) + ac(h1 , hg) + be( h2, hg) + bd(h2, h4),
symmetric, (h1, h2)
=
(h2, hi), and (h, h) > 0
with equality iff h = 0.
0
I
,
'
522
Additional Topics in Probability and Analysis
Appendix B
It follows that if l ilt II' = (/L It). then II , II is a norm, That is. (a) lilt II = 0 iii It = 0 ((cb) llah1hll = lalll
nL, n
8.10.3.1 Orthogonality and Pythagoras's Theorem
is orthogonal to h2 iff (h1 , h2 ) = 0. This is written h 1 notion of orthogonality in Euclidean space. We then have h1
_1_
hz.
--
This is the usual
Pythagoras's Theorem. If h1 j_ hz, then
(8.10.12)
I
An interesting consequence is the inequality valid for all h1, hz,
i
(8.10.13)
In R2 (8. !0.12) is the familiar "square on the hypotenuse" theorem whereas (8.!0.13) says that the cosine between x1 and x2 is < 1 in absolute value. ,
.
I
, ,
8.10.3.2 Projections on Linear Spaces
We naturally define that a sequence hn E H converges to h iff Jl hn - hi I ---t 0. A linear subspace £ of H is closed iff hn E l for all n, hn --+ h h E C. Given a closed linear subspace C of 1i we define the projection operator 11(· I C) : 1i � C by: 11(h I C) is that h' E C that achieves mi ( l l h h'll : h' E C). lt may be shown that 11 is characterized by the property h - 11(h I C) h' for all h' E C. (8.10.14) Furthermore, (i) 11(h I C) exists and is uniquely defined. (ii) 11(· I C) is a linear operator 11(ah, + f3h, I C) = a11(h , I C) + (311(h, I C). (iii) ll is idempotent, D2 = IT. :::::>
n
-
l_
, ,
,
I • •
'
l , j
' ' '
.
l
Section 8.10 Topics in
Matrix Theory and Elementary Hilbert Space
(iv) II is norm reducing
523
Theory
llll(h I L) ll < ll h ll·
(8.10.15)
ln fact, and this follows from (8.10. 12),
ll h ll' = llll(h I L)f + llh - Il(h I L i ll'
(8.10.16)
Here h - Il(h I L) may be interpreted as a projection on L� = {h : (h, h') = 0 for all h' E [}. Properties (i)-(iii), of II above are immediate. All of these correspond to geometric results in Euclidean space. If x is a vector in RP, Il(x I L) is the point of L at which the perpendicular to L from x meets L. (8.10.16) is Pythagoras's theorem again. If [ is the column space of a matrix Anxp ofrankp < n, then
(8. 1 0.17) �
�
�
This is the fonnula for obtaining the fitted value vector Y = (Y1 , . . , Yn) T by least squares in a linear regression Y = A[3 + < and (8.10.16) is the ANOVA identity. The most important H1lbert space other than RP is L2(P) = {All random variables X on a (separable) probability space such that EX2 < oo}. In this case we define the inner product by .
(X, Y) - E( XY)
(8.10.18)
II XII = E l (X' ) .
(8.10.19)
so that
All properties needed for this to be a Hilbert �pace are immediate save for complete ness, which is a theorem of F. Riesz. Maintaining our geometric intuition we see that, if E(X) = E(Y) = 0, orthogonality simply corresponds to nncorrelatedness and Pythagoras's theorem is just the familiar •
Var(X + Y)
=
Var(X) + Var(r )
if X and Y are uncorrelated. The projection formulation now reveals that what we obtained in Section mulas for projection operators in two situations,
1.4 are for
(a) [ is the linear span of 1, Zt, . . . , Zd. Here ll(Y I L) = E(Y) This is just (1.4.14). (b)
L is the space of all X
+ ( Ez:l, Ezy )T( Z
-
E..'( Z))
.
(8 10 20) .
.
g(Z) for some g (measuralle). This is evidently a linear space that can be shown to be closed. Here, =
II(Y I L) That is what
(1.4.4) tells us.
=
E(Y I Z)
.
(8 10.2 1 ) .
524
Additional Topics in Probability and Analysis
Appendix 8
The identities and inequalities of Section 1 ..-1 can readily be seen to be special cases of
(B. 10.16) and (8.10. 1 5).
For a fuller treatment of these introductory aspects of Hilbert space theory, see Halmos
( 1951 ), Royden (1968), Rudin (1991), or more extensive works on functional analysis such as Dunford and Schwartz (1964). B.ll
,
•
•
'
P ROBLEMS AND COMPLEMENTS
Problems for Section B. I 1. An urn contains four red and four black balls. Four balls are drawn at random without
replacement. Let Z be the number of red balls obtained in the first two draws and Y the total number of red balls drawn. (a) Find the joint distribution of Z and Y and the conditional distribution of Y given Z and Z given Y.
(b) Find E(Y 2.
i
I
=
1 Y)
+ Z2.
=
Y.
Suppose Y and Z have joint density function p( z, y) = z + y for 0 < z < 1, 0 < y < 1 .
(a) Find E(Y I
2.
E(E(Y I Z)) using (a).
Z1 and Y = Z1
Hint: E(Z1 + z2
4.
0, 1 ,
Suppose Z1 and Z2 are independent with exponential £(.A) distributions. Find E(X I Y)
when X
•
=
I Z).
(b) Compute EY =
3.
z) for z
> 2 is an integer.
(a) Find E(Y '
=
Suppose Y and Z have thejoint density p(z, y) = k(k-1 )(z-y)k-2 for0 < y < z < I ,
where k
!
I Z
I Z = z).
(b) Find E(Ye1Z+(1/Z)I
1 Z = z) .
5. Let ( X1 , . . . , Xn) be a sample from a Poisson P (.A) distribution and let Sm =
m < n.
(a)
Show that the conditional distribution of
X
M(k, 1/n, . . . , 1/n).
(b) Show that E(Sm
I Sn)
=
given Sn
I:;" 1 X; ,
k is multinomial
(mfn)Sn·
6. A random variable X has a P(.A) distribution. Given X = k, Y has a binomial B(k,p) distribution.
(a) Using
the relation
E(e'Y)
=
E(E(e'Y
I X)) and the uniqueness of moment gen
erating functions show that Y has a P(.Ap) distribution.
(b) Show that Y and X - Y are independent and find the conditional distribution of X given Y
=
y.
•
Section B.ll
525
Problems and Complements
7. Suppose that X has a normal N(ft, cr2 ) distribution and that Y independent of X and has a N( -y, r2) distribution.
=
X + Z, where Z is
(a) What is the conditional distribution of Y given X = x?
(b) Using Bayes rule find the conditional distribution of X given Y
=
y.
8. In each of the following examples: (a) State whether the conditional distribution of Y given Z or of neither type.
=z
is discrete, continuous,
(b) Give the conditional frequency, density, or distribution function in each case. (c) Check the identity E[E(Y I Z)] = E(Y) (i) 1 -, z2 + Y2 < 1
P(Z, Y )(z,y)
1r
0 otherwise. (ii)
4zy, 0 < z < 1, 0 < y < 1
P(z,Y) (z , y)
0 otherwise.
(iii) Z has a uniform U(O, 1) distribution, Y (iv) Z has a U ( -1, 1) distribution, Y (v) Z has a U(-1, 1) distribution, Y
=
Z2
=
Z2.
=
Z2 if Z2 <
! and Y = ! if Z2 > ;.
9. (a) Show that if E(X2) and E(Y2) are finite then Cov(X, Y) = Cov(X, E(Y I X)). (b) Deduce that the random variables X and Y in Problem B.l.8(c) (i) have correlation 0 although they are not independent.
. .
10. (a) If X 1 , . , Xn is a sample from any population and Sm
=
E� 1 Xi, m :S n, show
that the joint distribution of (Xi, Sm) does not depend on i, i < m . Hint: Show that the joint distribution of (X., . . . , Xn) is the same as that of (Xip . . . , xi,. ) where (il, . . . ' in ) is any permutation of (1, . . . ' n) .
(b) Assume that if X and Y are any two random variables, then the family of condi tional distributions of X given Y depends only on the joint distribution of (X, Y). Deduce from (a) that E(X, I Sn) = · · · = E(Xn I Sn) and, hence, that E(Sm I Sn) = (m/n)Sn.
526
Additional Topics in Probability and Analysis
N.
Appendix B
11. Suppose that Z has a binomial. B( B), distribution and that given Z = z, hypergeometric, 'H(z, n ) , distribution. Show that
N,
P[Z �
z
Iy
�
y[
( N - n ) w-Y(J - B)N-n-(,-y)
�
z-y
(i.e., the binomial probability of successes in Hint: P[Z
�
z I Y � y)
where b(y)
�
L '
Y has a
�
N - n trials).
n N ( - ) B'(l - B)N-'/b(y)
( ��; )
z-y
8' (1 - B) N- ,
�
BY (l - B) N-y
Problems for Section B.2 l. If B is uniformly distributed on (- "/2, 1r/2) show that Y � tan B has a Cauchy di stri bution whose density is given by p(y) = l/[1r(l + y2 )). -oo < y < oo. Note that this density coincides with the Student t density with one degree of freedom obtainable from
(B.3.10).
Suppose X1 and X2 are independent exponential £(A) random variables. Let Yi XI - x2 and y2 = x2. 2.
=
(a) Find the joint density of Y1 and Y2 .
(b) Show that Yi has density p(Y) = �Ac"lvl, double exponen!ial or Laplace density.
-oo
oo.
This is known as the
3. Let X1
and X2 be independent with {3(r1, si) and {3(r2, s2) distributions, respectively. Find the joint density of Y1 = X1 and Y, = X2(l - XI)·
4. Show that if X has a gamma r(p, A) distribution, then (a) Mx (t) = E(e'x ) = (b) E(Xr ) =
�[�(;/,
r
(/ ) t
>
".
t < A.
-p.
(c) E(X) = pfA. Vat\X) = pfA2
5. Show that if X has a beta (J(r, s) distribution, then
(b) Var X = (r+sP(�+s+I) . 6. Let Vi, . . , Vn+I be a sample from a population with an exponential £(1) distribution (see (A.l3.24)) and let Sm � L;;" 1 V,, m < n + 1. .
'
•
.
> • '
' •
•
l1
I j' '
l
'
--
Section B.ll
527
Problems and Complements
(a) Show that T =
(sv
n+l
, . . . , sv.
n+l
)
T
has a density given by
n!, ti > 0, 1 < i <
PT(tlJ . . . , tn)
0 otherwise.
-
. first the JOmt . . bution · of . . dIStn H.mt: Denve (b) Show that U
=
(
Sn 81 . . 8n+l , . , 8n + l
Pu(u1 , · · · , un )
)
(s
v
n+l ,
n,
E;l 1 ti < 1,
. . . , $v. n+l n
,
T has a density given by
nl, 0
< Ut < U2 < · · · < Un < 1,
0 otherwise.
S1 , . . . , Sr be r disjoint open subsets of Rn such that P[X E Ui 1Si] that g is a transformation from u;- I si to Rn such that
7. Let
=
1. Suppose
si for each i.
(i) g has continuous first partial derivatives in (ii) g is one to one on each Si. (iii) The Jacobian of g does not vanish on each
Si.
Show that if X has density px , Y = g(X) has density given by r
1
1
py (y) = l:Px (g;-1 (y))IJ•• (gi (Y))I- I,(y) for y E g(ur 1 S,)
i=l
where g; is the restriction of g to S; and l;( y) is I ify E g(S; ) and 0 otherwise. (If 1 li(Y) = 0, the whole summand is taken to be O even though g; is in fact undefined.)
Hint: P[g(X) E B] = L� 1 P[g(X ) E B, X E S;) 8. Suppose that X1 , . . . , Xn is a sample from a populati(Jn with density f. The Xi ar ranged in order from smallest to largest are called the order statistics and are denoted by Xcl), . . . , X(n). Show that Y g(X ) = ( X(l), . . . , X(n) J" has density =
n
py (y) = n! IT f(y;) for y1
i=l
< y, < · - · < Yn
Hint: Let
- {(x, . . . , xn) : x1 < · · · < z.,}, {(x1, . , xn) : x, < x, < ·· · < xn}
-
and so on up to
.
.
Snt · Apply the previous problem.
528
Additional Topics in Probability and Analysis
Appendix B
9. Let X1 , . . , X, be a sample from a unifonn U(O, I ) distribution (cf. (A. 13.29)). .
(a) Show that the order statistics of X
density is given in Problem B.2.6(b).
(X 1,
=
. . .
, Xn) have the distribution whose
(b) Deduce that X(k) has a 11(k, n - k + I ) distribution.
(c) Show that EX(k) = k/(n + 1 ) and Var X(k ) = k(n - k + 1)/(n + 1)2(n + 2). Hint: Use Problem B.2.5.
10. Let X1 , . . . , Xn be a sample from a population with density f and d.f. F. (a)
Show that the (X(r+I)• - . . , X(n) )T is
if X(l) <
···
conditional
density
of
(X(l), . . . , X(n))T
gtven •
< X(r) < X(r+l)·
(b) Interpret this result.
11. (a) Show that if the population in Problem B.2. 10 is U(O, 1 ) , then
(
x( l l- , . . . , x<,J == X(r+I ) X(r+l)
I
(b) Deduce th at X(n) • . . ,
'
,
.
)
r
x,.,
and (X(,+l)• . . . , X(n))r are independent. x,. ,,
x(n-1) , X(n-2) , .
. . ,
x,,, . ' are m . dependent m thIS case. x (1)
12. Let the d.f. F have a density f that is continuous and positive on an interval (a, b) such that F(b) - F(a) = 1 , -oo S a < b < oo. (The results are in fact valid ifwe only suppose
•
I
that F is continuous.)
•
. •
(a) Show that if X has density J, then Y = F(X) is uniformly distributed on (0, 1).
I I
(b) Show that if U � U(O, 1 ), then F- 1 (U) has density
f.
(c) Let U(l) < < U(n) be the order statistics of a sample of size n from a U(O, 1) population. Show that then F-1 (U( 1J) < < F- 1 (U(n)) are distributed as the order statistics of a sample of size n from a population with density f. · · ·
· · ·
13. Using Problems B.2.9(b) and B.2.12 show that if X(k) is the kth order statistic of a
sample of size n from a population with density f, then
• ' . .
,
•
1
14. Let X(l)> . . . , X(n) be the order statistics of a sample of size n from an £(1) population.
II '
Show that nX(l)• (n - l)(X(2) - X(l)), (n - 2) (X(3) - X(2J) , . . . , (X(n) - X(n - 1)) are independent and identically distributed according to £(1 ). Hint: Apply Theorem B.2.2 directly to the density given by Problem B.2.8.
l !
Section 8.11
529
Problems and Complements
15. Let Tk be the time of the kth occurrence of an event in a Poisson process as in (A.I6.4).
(a) Show that Tk has a r(k, >,) distribution. (b) From the identity of the events, [N ( l) < k - lI
100 "
9k , l(s)ds =
k-2:1 ..,e--" ).)
,�
. 0
J.
=
[Tk
>
1 I, deduce the identity
.
Problems for Section B.3
1. Let X and Y be independent and identically distributed N(O,
..;X?J+P. Show that 8 is unifonnly distributed on ( - � , }).
(b) Let 8 = sin - 1
(c) Show that X/Y has a Cauchy distrihotion. Hint: Use Problem B.2.1.
2. Suppose that Z r ( � k, � k), k > 0, and that given Z = z, the conditional distribution of Y is N(O, z- I ) Show that Y has a T. distribution. When k = 1, this is an example where E(E(Y I Z)) = 0, while E(Y) does not exist. 3. Show that if Z1 , , Zn are as in the statement of Theorem B.3.3, then ""' .
•
.
.
n
y'ri(Z - !')/ l:(Z; - ZJ 2 /(n - 1) i= 1
has a Tn- 1 distribution.
4. Show that if XI, . . . ' Xn are independent £(>.) random variables, then T = has a X�n distribution. Hint: First show that 2.-\Xi has a r ( 1, �) = X� distribution.
2). E� 1 xi
X1, . . . , Xm; Y1 , , Yn are independent £ ().) random variables, then S = (n/m) (2:;" I X; ) / (2:; I lj) has a F2m,2n distribution.
5. Show that if
. • .
6. Suppose that X1 and X2 are independent with r(p, 1) and r (p+ �. 1) distributions. Show that Y = 2.,jXIX2 has a r(2p, 1) distribution. Suppose X has density p that is symmetric about 0; that is, Show that E(Xk) = 0 if k is odd and the kth moment is finite. 8. Let X � N(I', u2 ). 7.
(a) Show that the rth central moment of X is E (X - I')'
-
r even
- 0,
T odd.
p(x)
=
p( -x) for all x.
Appendix B
'
'
530
I'
Additional Topics in Prob.ability and Analysis
(b) Show the rth cumulant C is xero for r > 3. r Hint: Use Problem B.3.7 for r odd. For r even set m = r/2 and note that because Y = ](X - J,t)/o-]2 has a xi distribution, we can find E(Y'") from Problem 8.2.4. Now use E(X - J,t)' = o-r E(Ym ),
9. Show that if X � 7,., then
for r even and r < k. The moments do not exist for r 2: k, the odd moments are zero when r < k. The mean of X is 0, for k > 1, and Var X = kf(k - 2) for k > 2. Hint: Using the notation of Section 8.3, for r even E(Xr) = E(Qr) = k�rE(Zr)E(V-!r), where 8.3.7.
I '
10. Let X
! '
�
Z
�
N(0, 1) and V
� X� ·
Now use Problems
8.2.4
and
Fk,m • then
provided - � k < r < � m. For other r, E(Xr) does not exist. When m m/(m - 2), and when m > 4, Var X =
2m2 (k + m - 2) k(m - 2)2(m - 4)
>
2, E(X)
=
.
Hint: Using the notation of Section B.3, E(Xr) = E(Q") = (m/kY E(Vr)E(W-r), where V rv x� and W t">.J x!. Now use Problem B.2.4. 11. Let X have a N( 0, 1) distribution.
(a) Show that Y
=
X2 has density
py(y) =
1 2.fii'Y
' e-l(>+9 )(e0v'Y + e-0v'Y),
y > 0.
This density corresponds to the distribution known as noncentral x2
dom and noncentrality parameter fP .
with 1 degree offree
(h) Show that we can write
py (y)
R
where formula.
,......
P ( � 82 )
and fm is the
=
••
;
,
=
� P (R = i)j,i+I(Y) 't=l
x� density.
Give a probabilistic interpretation of this
Hint: Use the Taylor expansions for e0../fi and e-o.,;y in powers of yfij.
!
I ·1''
1
'
i
Section B.ll
531
Problems and Complements
12. Let X1 , . . . , Xn be independent normal random variables each having variance 1 and E(Xi) = Bi , i 1, . . . , n, and let 02 = L:� 1 B[. Show that the density of V = L:� 1 Xl is given by =
Pv (v) = L P(R = i)f2i+n(v), v > 0 i=O where R P (�fP) and frn is the x:n density. The distribution of V is known as the noncentral x2 with n degrees offreedom and (noncentrality) parameter 82. Hint: Use an orthogonal transformation Y = AX such that Y1 = L� 1 (BiXi/8). Now V has the same distribution as L� 1 J!i2 where Y1 , . . . , Yn are independent with vari ances I and E(Y1 ) = 0, E(Y;) = 0, i = 2 , . . . , n. Next use Problem B.3. 1 1 and ""'
roo Pv (v) = n J0
00
L P(R = i)j,i+l (v - s) fn-l (s)ds. i=O
13. Let X1, • . . ,Xn be independentN(O, 1) random variables and let V = (X1 + ej2 + L� 2 Xi2 • Show that for fixed v and n, P(V > v) is a strictly increasing function of 82. Note that V has a noncentral x� distribution with parameter 82. 14. Let V and W be independent with W x_2mand V having a noncentral xi distribution .......,
with noncentrality parameter 02• Show that S = (V/k)/(W/m) has density
Ps (s)
=
00
L P(R = i)fk+2i, m(s) i=O
where R "' P (�02) and /j, m is the density of Fi,m· The distribution of S is known as the noncentral Fk,m distribution with (noncentrality) parameter 02. 15. Let X1, • . . , Xn be independent normal random variables with - common mean and 2 = I:,�/ X; - X(m))2 . vanance. Define X(m) = (1/m) L::,� 1 X;, and Sm -
.
rn
·
rn
(a) Show that
(b ) Let
n-1 n Show that the matrix A defined by Y = AX is orthogonal and, thus, satisfies the require ments of Theorem B.3.2.
(c) Give the joint density of (X(n), �, . . . , S�)T .
'
' ' ' '
532
Additional Topics in Probability and Analysis
Appendix B
16. Show that under the assumptions of Theorem B.3.3, Z and (Z1 - Z, . . . , Zn - Z) are independent. Hint: It suffices to show that Z is independent of (Z2 - Z� . . . , Zn - Z). This provides another proof that Z and L� 1 ( Zi - Z) 2 are independent.
Problems for Section B.4 I. Let (X, Y)
�
N ( 1, 1 , 4, 1 , � ). Find
(a) P(X + 2Y < 4). (b) P(X < 2 I y = 1). (c) The joint distribution of X +
2Y and 3Y - 2X. Let (X, Y) have a N(!'l> p,2, u�, u�, p) distribution in the problems 2--{;, 9 that follow. 2. Let Fh ·, J..l. t , J..L2, cr�, cr�, p) denote the d.f. of (X, Y). Show that
I
•
( X - p,1 Y - p,2 ) , (Tl 0"2
has a N(O, 0, 1, 1, p) distribution and, hence, express F( ·, · , ttl , J.l2, af, cr�, p) in terms of F(·, · , 0, 0, 1 , 1,p) .
3. Show that X + Y and X - Y are independent, if and only if, crf
4. Show that if cr1cr2
=
u�.
> 0, IPI < 1, then
•
has a � distribution. Hint: Consider ( U1 , U2 ) defined by (B.4.19) and (B.4.22).
5. Establish the following relation due to Sheppard.
F(O, 0, 0 , 0, 1, 1, p) =
! + (1/21T) sin -l
p.
Hint: Let U1 and U2 be as defined by (B.4.19) and B.4.22, then
P[X < 0, Y < OJ
P[U1 < O, pU1 + y'1 - p2U2 < OJ p - p u, < 0, u2, > r.�=;; u y'1 - p2 =
6. The geometry of the bivariate normal surface.
c}. Suppose that cri = u� . Show that {S,; c > 0} is a family of ellipses centered at (l't. p,2) with common major axis given by (Y-1'2 ) = (a) Let S,
=
{(x, y) : P( X,Yj(X, y)
=
;
'
i
Section
B.ll
Problems
and
533
Complements
(x -111) if p > 0, (y -1'2) � - (x - 1'1 ) if p < 0. If p � 0, {So} is a family of concentric
circles.
(b) If x
=
(
c,
y) is proportional to a normal density as a function of y. That is, sections of the surface z = px_ (x, y) by planes parallel to the (y, z ) plane are proportional c, Px
to Gaussian (normal) densities. This is in fact true for sections by any plane perpendicular to the y) plane.
(x,
(c) Show that the tangents to So at the two points where the line y � 1'2+p(a2/ cr,)(x J..t d intersects Sc are vertical. See Figure B.4.2.
(X1, YI), . . . , (X., Yn) be a sample from a N(I'I . l-'2 • cr� , aJ, p) � N(l-', !:.) dis2 tribution. Let X � (1/n) 2::� 1 X, , Y � (1/n) 2::� 1 Y; , Sf � 2::� 1 (X, - X) , 2 2 " " "" S2 � L..i�1 (Y; - Y) • S12 � L..i�1 (X, - X)(Y; - Y). (a) Show that n(X - 1'1 . Y - I'2)T !:.-1 (X - I' I. Y - ll-2) has a x� distribution. (b) Show that (X, Y) and (8�, SJ, 812) are independent. 7. Let
Hint: (a): See Problem B.4.4.
(b): Let A be an orthogonal matrix whose first row is
(n- i , . . .
-
,n
�).
Let
U
(X1, . . . , Xn)T and Y � (Y1 , . . . , Ynf· Show that ) form a sample from aN(O, 0, o-i, a-�, p) population. Note that Sf V (U2, V2), . . . , (Un, n n '"' 822 � n '"' n · ;;;; while � 812 X � y n. UI /v'n. u,v;, � Vt/v v,-. �2 Li�2 Li �2 Li u,-, 8. In the model of Proplern B.4.7 let R � 812/8182 and
AX and V
� AY, where X �
=
=
y'(n - 2)R
T � �c==� J1 - R2 . (a) Show that wpen p = 0, T has a Tn_2 distribution. (b) Find the density of R if p � 0. Hint: Without loss of generality, take
cr� � cr� � 1. Let C be an (n - 1) x (n (U2, . . . , Un)/S1· Define (W2, . . . , Wnf �
1) orthogonal matrix whose first row is C(V,, . . and show that T can be written in the form T � L/M where L � 2 f� � and M2 � 2). Argue that given are, T has a Yn 2 distribution. no matter what = = Now use the continuous version of (B.l .24).
.
, Vn)T 812/S1 W2 u2 U2, . . . 'Un
(S�SJ - 8f2)/(n - )S L��3 Wt/(n u2, . . . ) Un Un,
9. Show that the conditional distribution of aX + bY given eX + dY = t is normal. Hint: Without loss of generality take a � d � 1, b � c � 0 because (aX + bY, eX +
dY) also has a bivariate normal distribution. IPI �
Deal directly with the cases
1.
Ut0"2
=
0 and
10. Let p1 denote the N(O, 0, 1 , 1, 0) density and let P2 be the N(O, 0, 1, 1, p) density.
Suppose that
(X, Y) have the joint density 1
1
p(x, y) � 2 pi (x, y) + 2 pz(x, y).
I:I
534
Additional Topics in Probability and Analysis
Show that
X
and only if, p
Appendix B
and Y have normal marginal densities, but that the joint density is normal, if
= 0.
11. Use a construction similar to that of Problem B.4.1 0 to obtain a pair of random variables (X, Y) that (i) have marginal normal distributions. (ii)
are
uncorrelated.
(iii)
are
not independent.
Do these variables have a bivariate normal distribution?
'
,
I
Problems for Section B.5
'
I. Establish (B.S.IO) and (B.S. II). 2. Let akxi and
,
Bkxk be nonrandom . Show that
I and • •
3. Show that if Mu (t) is well defined in a neighborhood of zero then =
I '
Mu (t) = 1 + L
/
II
p= l
where J.Li1 ···i,. =
p, p =
E(U;1
p.
.••
i,. t! 1
· · · U�,. ) and the sum is over all
1, 2, . . . . Moreover,
That is, the Taylor series for
4.
�J.ti1
•
•
•
t�,.
(i 1 , . . , ik) with ij > 0. E;=I ii .
Ku converges in a neighborhood of zero.
Show that the second- and higher-degree cumulants (where p =
�=1 ii
invariant under shift; thus. they depend only on the nioments about the mean.
=
2: 2) are
l I
'
5. Establish (B.5.!6}-{B.5.19). 6. a�
In the bivariate case write =
aoz·
Show that
1J.
•
=
E(U),
u;;
-
E(U, - M,)'(U2 - !-'2)i , ul
•
-
l
I
'
• •
' , '
•
-
i
' •
Section B.ll
-------�-------------�535 �
���--
Problems and Complements
and
(C40, CQ4, C221 C31, C1 3 ) (a40 - 3af, O"o4 - 3a?, a22 - a-fa� - 20'u , a:n - 3trf0'1 1 , 0"13
=
V. W. and Z are independent and that U1
7. Suppose
= Z+
-
V and U2 =
3a�au ). Z+
W.
Show
that
Mu (t) = Mv(t,)Mw(t,)Mz(tr + t,) Ku(t) = Kv(tr) + Kw(t,) + Kz(tr + t,) and show that c;1 (U)
8.
=
c,+1(Z) for i ,<
j; i,j > 0.
Suppose U
(The bivariate log normal distribution) .
N(Jl.I ,Ji-2, af, a� , p)
distribution. Then Y
bivariate log nonnal distribution.
E(Y,' Yi) where
=
= (Yi , f2)T = (eu\eu2)T
9. (a) Suppose Z is N(p., E). P = L�=I ii > 2) are zero.
has a bivariate
is said to have a
Show that
{
�
exp if.l. r + i1J.2 + i2
= 0"1 a2p.
an
= (U1, U2)T
+
ij
�j'O"�}
Show that all cumulants of degree higher than
2
(where
(b) Suppose U 1 , . . . , Un are i.i.d. as U. Let Zn = n- l I;� 1 (U; p.). Show that Kz. (t) = nKu(n-it) - ni p.T t and that all cumulants of degree higher than 2 tend to zero as n --+ oo. -
10.
Suppose
Ukxl
and
vmxl
are independent and
I = {h, . . . , i k } and J = {ik+IJ . . . , ik.,_} unless either I = {0, . . . , 0} or J = {0, . . . , 0}. where
z(k+m) x l
=
(UT, VT)T. Let CI ,J
bea cumulantofZ. Show that CI,J
Problems for Section B.6 I. (a)
Suppose
pendent
U
=
U,
N(O, 0'2)
(U1,
•
.
.
,
= fJ. + aZ; + (JZ,_,
i
=
!, . .
.
, k, where
Z0, . . . , z.
¥0
are inde
random variables. Compute the expectation and covariance matrix of
u. ). Is U k·variate normal?
(b) Perform the same operation and answer the same question for Ui defined as follows: U1 = Z1 , U2 = Z2 + aUl ! U3
=
Z3 + aU2 , - . . , Uk = Zk + o:Uk-1·
2. Let U be as in Definition B.6.1. Show that if :E is not positive definite, then U does not have a density. has positive definite variance :E. Let ugll and u[�� X l be a partition I) 2 of U with variances �11. �22 and covariance :E12 = Cov(U( I ) , u< l)lx(k -l ) · Show that
3.
Suppose
Ukxl
1 2 2 Cov(E12E 22 UC l, uCil - E12E22UC l) = 0.
536
' I'
Add"1tional Topics in Probability and Analysis
•
Problems for Section B.7
ll •
Appendix B
1. Prove Theorem B.7.3 for d = 1 when Z an Zn have continuous distribution functions F
' '
and Fn . Hint: Let U denote a unifonn, U(O, 1 ), random variable. For any d.f. G define the left inverse by Now define Z� = F;;1 (U) and z• = p-1(U ) . = inf{t : G(t) >
••
' •
u}.
a-1(u)
2. Prove Proposition B.?. l (a). 3. Establish (B.7.8).
I
4. Show that if Zn £ zo, then P(IZn - zo l > <)
'
Hint: Extend (A.14.5).
�
P(IZ - zol > <). '
5. The Lp nonn of a random vector X is defined by IXI P = { EIXIP} , , p > I. The sequence of random variables {Zn} is said to converge to Z in L norm if IZn - ZI P � 0 P as
n
-+ oo.
We write Zn
� z. Show that
Lq
Lp
(a) if p < q, then Zn � Z => Zn � Z.
'
I
Hint: Use Jensen's inequality B.9.3.
I
(b) 1f Zn Lp � Z, then Zn � Z. Hint:
il '
p
.
'
E!Zn - ZIP 2: EfiZn - ZIPI{IZn - Zl 2: <}] > EP P(!Zn - Zl 2: E).
!
i i
6. Show that IZn - Zl � 0 is equivalent to Zn; � Z; for I :S j < d. Hint: Use (B.7.3) and note that !Zn; - Z;l2 < IZn - Zl2 U(O, 1) and let U1 = I, U2 = I{U E (0, �)}, Ua = I{U E (�, I)}, U4 = I{U E [o, m. Us = I{U E (hm, . , Un = I{U E (m2 -•, (m + l)T•)},
7.
Let U
�
. .
where n =
8. Let U L,
a.s.
p > m + 2•, 0 < m < 2• and k 0. Show that Un � 0 but Un + 0.
�
U(O, I) and set Un = 2n 1{U E [0, �)}. Show that Un �· 0, Un � 0, but
Un + 0, p > I, where L is defined in Problem B. 7.5. P 9. Establish B.7.9. 10. Show that Theorem B.7.5 implies Theorem B.7.4.
. ••
.'j
.j '
1
.I' ,
'
' ' ·
I ,
•
�
11. Suppose that as in Theorem B.7.6, Fn(x) F(x) for all x, F is continuous, and strictly increasing so that p-l (n) is unique for all 0 < n < 1. Show that
sup{/ F;; 1 (a) - F- 1(a)/ : < < a < I - E} � 0 for all ' > 0. Here F;; 1 (a) = inf{x : Fn(x) > a } . Hint: Argue by contradiction.
I
'
l
Section
B.ll
537
Problems and Complements
Problems for Section B.S 1. If h
: Rd � for izl <
RP and
h(x)
=
h(xo + z) Here the integral is a p
Hint: Let g(u)
11
h : Rd for i z l < <1,
2. If
=
x
Dh(x)
=
is continuous in a sphere
h(xo) +
(1 1
)
h( x0 + uz)zdu zT
d matrix of integrals.
h(x0 + uz). Then by the chain rule, g(u) = h(Xo + uz)z and
h(xo + uz)zdu =
[
•
g(u)du = g(l) - g(O) = h(xo + z) - h(xo).
ii(x) = D2 h(x) is continuous in a sphere { x : l x - xo l < <1}, then
� R and
h(xo + z) = h(xo) + h(xo)z + zT Hint:
{x : lx - Xoi < <1}. then
Apply Problem B.S. I to
[11 11
]
ii(xo + uvz)vdudv z .
h(xo + z) - h(xo) - h(Xo)z . •
3. Apply Problems B.S.! and B.S.2 to obtain special cases of Taylor's Theorem B.8.12.
Problems for Section B.9 1.
State and prove Jensen's inequality for conditional expectations.
2. Use Hoeffding's inequality (B.9.6) to establish Bernstein's inequality (B.9.5). if p = � . the hound can be improved to 2 exp{ -2n/,2 }.
Show that
2. IYr, ; + ! = 1.
3. Derive HOlder's inequality from Jensen's inequality with k = Hint: For (x,y)
4.
Show that if k
R2, consider g(x,y)
E
= 1
=
and g"(x) exists, then
equivalent.
lxr +
g" (x) > 0,
5. Show that convexity is equivalent to the convexity of
a E [0, 1] for all x and y in S. 6.
Use Problem
7.
Show that if
all
x E
S, and convexity are
g (ax + (1 - a)y)) as function of
5 above to generalize Problem 4 above to the case k > 1.
.
/. / g2 (x) exists and the matrix I /.x, a"x,. g2 (x) I is positive definite, then x,
g is strictly convex.
x.,
8. Show that
P(X > a)
<
inf{ e-ta Ee'x
Hint: Use inequality (A. l5.4).
9. Use Problem 8 above to prove Bernstein's inequality.
:
t � 0}.
538
Appendix B
Additional Topics in Probability and Analysis
10. Show that the sum of (strictly) convex functions is (strictly) convex. Problems for Section B.lO
1. Verify that if A is snd, then
A is ppd iff A is nonsingular.
2. Show that if S is the one-dimensional space S = the projection matrix onto
S (B.l 0.17) is just eeT.
{ae : a E R}
for e orthononnal, then
3. Establish (1!.10.15) and (B.IO. l 6). 4. Show that h - II ( h I C.) = II(h I C.�) using (B.l0 .14). 5. Establish (B.IO.l7). B.12
NOTES
Notes for Section B.J.2 ( ! ) We shall follow the convention of also calling
•
I
E(Y I Z) any variable that is equal
to g(Z) with probability ! .
Notes for Section B.1.3
( I ) The
definition of the conditional density ( B . l .25) can be motivated
A(x), A(y) are small "cubes" with centers x and y and volumes dx, dy andp(x,y) is continuous. Then P[X E A(x) I Y E A(y)] = P[X E A(x) , Y E A(y)]/P[Y E A(y)]. But P[X E A(x), Y E A(y)J "" p(x, y)dx dy, P[Y E A(y)J "' py(y)dy, and it is reasonable that we should have p(x I y) "" P[X E A(x) I Y E A(y)]/dx "' p(x,y)/py (y). Suppose that
:
as follows:
,
Notes for Section B.2
(I) We do not dwell on the stated conditions of the transfonnation Theorem B.2.1 be
cause the conditions are too restrictive. It may, however. be shown that (B.2. 1) continues to hold even iff is assumed only to be absolutely integrable in the sense of Lebesgue
K is any member of Bk , the Borel a-field on Rk . Thus, K any set in Rk that one commonly encounters.
and
f can be any density function and
Notes for Section B.3.2 (I) In deriving (B.3.15) and (B.3.17) we are using the standard relations,
BTAT, det[AB)
=
[AB] T
det Adet B, and det A = det AT.
Notes ror Section B.S (I) Both m g f. s .
.
'
tion of U defined by
and c.f.'s are special cases of the Laplace transform 1/J of the distribu·
where z is in the set of k tuples of complex numbers.
l
• '
'
•
Section 8.13
B.13
References
539
REFERENCES
ANDERSON, T. W., An Introduction to Multivariate Statistical Analysis New York: J. Wiley & Sons , 1958.
APOSTOL, T., Mathematical Analysis, 2nd ed. Reading, MA: Addison-Weslcy,l974. BARNDORFF-NlELSEN, 0. E., AND D. P. Cox, Asymptotic Techniques for Use in Statistics New York: Chapman and Hall, 1989.
BILLINGSLEY, P., Probability and Measure, 3rd ed. New York: J. Wiley & Sons, 1979, 1995.
BIRKHOFF, G., AND S. MacLANE, A Sun;ey of Modern Algebra, rev. ed. New York: Macmillan, 1953. BIRKHOFF, G., AND S. MacLANE, A Sun;ey ofModern Algebra, 3rd ed. New York MacMillan, 1965. BRElMAN, L., Probability Reading, MA: Addison-Wesley, 1968. CHUNG, K. L., A Course in Probability Theory New York: Academic Press, 1974. DEMPSTER, A. P., Elements of Continuous Multivariate Analysis
Reading, MA:
Addison-Wesley,
1969.
DIEUDONNE, J., Foundation of Modern Analysis, v. 1 , Pure and Applied Math
.
Series, Volume 1 0
New York: Academic Press, 1960.
DUNFORD, N., AND J. T. SCHWARTZ, Linear Operators, Volume 1 , lnterscience New York: J. Wiley & Sons, 1964.
FELLER, W., An Introduction to Probability Theory and Its Applications, Vol. II, 2nd ed. New York:
J. Wiley & Sons,
1971.
GRIMMEIT, G. R., AND D. R. STIRSAKER, Probability and Random Processes Oxford: Claren don Press, 1992.
HALMOS, P. R., An Introduction to Hilben Space and the Theory of Spectral Multiplicity, 2nd
ed.
New York: Chelsea, 1951. HAMMERSLEY, J., ''An extension of the Slutsky-Fr6chet theorem," Acta Mathematica, 87, 243--247 ( 1 952). HoEFFDJNG, W., "Probability inequalities for sums of bounded random variables " J. Amer. Statist. ,
A"oc., 58, 13-30 ( 1 963).
LoEvE, M., Probability Theory, Vol. I, 4th ed. Berlin: Springer, 1977. RAo, C. 1973.
R., Linear Statistical Inference and Its Applications, 2nd ed.
New York:
J. Wiley & Sons,
ROC KAFELLAR, R. T., Convex Analysis Princeton, NJ: Princeton UniverSity Press, 1970.
ROYDEN, H. L., Real Analysis, 2nd ed. New York: MacMillan,
1968.
RUDIN, W., Functional Analysis, 2nd ed. New York: McGraw-Hill, 1991.
SKOROKHOD, A. V., "Limit theorems for stochastic processes, Th. Prob. Applic., "
1, 261-290 (1956).
I '
'
'
'
,I
'
' '
I I '
'
1
1
'
' '
' '
l
Appendix C TA B L ES
541
542
Tables
Appendix C
Pr(Z < z)
Table I The standard normal distribution z
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.0
0.5000
0.5040
0.5080
0.5120
0.5160
0.5199
0.5239
0.5279
0.53!9
0.5359
0. 1
0.5398
0.5438
0.5478
0.5517
0.5557
0.5596
0.5636
0.5675
0.5714
0.5753
0.2
0.5793
0.5832
0.5871
0.5910
0.5948
0.5987
0.6026
0.6064
0.6103
0.6141
0.3
0.6179
0.62 1 7
0.6255
0.6293
0.6331
0.6368
0.6406
0.6443
0.6480
0.6517
0.4
0.6554
0.6591
0.6628
0.6664
0.6700
0.6736
0.6772
0.6808
0.6844
0.6879
0.5
0.6915
0.6950
0.6985
0.7019
0.7054
0.7088
0.7123
0.7157
0.7190
0.7224
0.6
0.7257
0.7291
0.7324
0.7357
0.7389
0.7422
0.7454
0.7486
0.75 1 7
0.7549
0.7
0.7580
0.76 1 1
0.7642
0.7673
0.7704
0.7734
0.7764
0.7794
0. 7823
0.7852
0.8
0.7881
0.7910
0.7939
0. 7967
0.7995
0.8023
0.805 1
0.8078
0.8106
0.8133
0.9
LO
0.8159
0.8186
0.8212
0.8238
0.8264
0.8289
0.8315
0.8340
0.8365
0.8389
0.8413
0.8438
0.8461
0.8485
0.8508
0.8531
0.8554
0.8577
0.8599
0.8621
l.l
0.8643
0.8665
0.8686
0.8708
0.8729
0.8749
0.8770
0.8790
0.8810
0.8830
L2
0.8849
0.8869
0.8888
0.8907
0.892j
0.8944
0.8962
0.8980
0.8997
0.9015
!.3
0.9032
0.9049
0.9066
0.9082
0.9099
0.9 1 1 5
0.9 131
0.9147
0.9162
0.9177
1.4
0.9192
0.9207
0.9222
0.9236
0.9251
0.9265
0.9279
0.9292
0.9306
0.9319
1.5
0.9332
0.9345
0.9357
0.9370
0.9382
0.9394
0.9406
0.9418
0.9429
0.9441
1.6
0.9452
0.9463
0.9474
0.9484
0.9495
0.9505
0.9515
0.9525
0.9535
0.9545
1.7
0.9554
0.9564
0.9573
0.9582
0.9591
0.9599
0.9608
0.9616
0.9625
0.%33
1.8
0.9641
0.9649
0.9656
0.9664
0.9671
0.9678
0.9686
0.9693
0.9699
0.9706
1.9
0.97 !3
0.9719
0.9726
0.9732
0.9738
0.9744
0.9750
0.9756
0.9761
0.9767
2.0
0.9772
0.9778
0.9783
0.9788
0.9793
0.9798
0.9803
0.9808
0.9812
0.9817
2.1
0.9821
0.9826
0.9830
0.9834
0.9838
0.9842
0.9846
0.9850
0.9854
0.9857
2.2
0.986 1
0.9864
0.9868
0.9871
0.9875
0.9878
0.988 1
0.9884
0.9887
0.9890
2.3
0.9893
0.9896
0.9898
0.9901
0.9904
0.9906
0.9909
0.99 1 1
0.99l3
0.9916
2.4
0.9918
0.9920
0.9922
0.9925
0.9927
0.9929
0.9931
0.9932
0.9934
0.9936
2.5
0.9938
0.9940
0.994 1
0.9943
0.9945
0.9946
0.9948
0.9949
0.9951
0.9952
2.6
0.9953
0.9955
0.9956
0.9957
0.9959
0.9960
0.996 1
0.9962
0.9%3
0.9%4
2.7
0.9965
0.9966
0.9967
0.9968
0.9969
0.9970
0.9971
0.9972
0.9973
0.9974
2.8
0.9974
0.9975
0.9976
0.9977
0.9977
0.9978
0.9979
0.9979
0.9980
0.9981
2.9
0.998 1
0.9982
0.9982
0.9983
0.9984
0.9984
0.9985
0.9985
0.9986
0.9986
3.0
0.9987
0.9987
0.9987
0.9988
0.9988
0.9989
0.9989
0.9989
0.9990
0.9990
3.1
0.9990
0.9991
0.9991
0.�99 1
0.9992
0.9992
0.9992
0.9992
0.9993
0.9993
3.2
0.9993
0.9993
0.9994
0.9994
0.9994
0.9994
0.9994
0.9995
0.9995
0.9995
3.3
0.9995
0.9995
0.9995
0.9996
0.99%
0.9996
0.9996
0.9996
0.99%
0.9997
3.4
0.9997
0.9997
0.9997
0.9997
0.9997
0.9997
0.9997
0.9997
0.9997
0.9998
Table entry is probability at or below z.
.
'
'
i I '
1
'
' '
'
Appendix C
543
Tables
Pr(Z < z)
Table I' Auxilliary table of the standard normal distribution Pr(Z > z) z Pr(Z > z ) z
Pr(Z ::0: z) z
.50
.45
.40
.35
.30
.25
.20
.1 5
.10
0
. 1 26
.253
.385
.524
.674
.842
1 .036
1.282
.09
.08
.07
.06
.05
.04
.03
.025
1.341
1.405
1.476
1 .555
1 .645
1.751
1.881
1.960
.02
.01
.005
.001
.0005
.0001
.00005
.0000 1
2.054
2.326
2.576
3.090
3.291
3.719
3.891
4.265
Entries in the top row are areas to the right of values in the second row.
Area = .1
•• •
Area= .8
Area = .1
I
544
Tables
Appendix C
' I
•
Pr(T 2 t)
Table II t distribution critical values df
.25
.10
.05
.025
Right tail probability p .02
.0 I
.005
.0025
.001
.0005
l
3.078
6.314
1 2.71
15.89
3 1 .82
63.66
127.3
318.3
636.6
2
LOOO
0.8 1 6
!.886
2.920
4.303
4.849
6.965
9.925
14.09
22.33
3 1 .60
3
0.765
1 .638
2.353
3.182
3.482
4.541
5.841
7.453
10.21
12.92
1.533
2.!32
2.776
2.999
3.747
4.604
5.598
7.173
8.610
1.476
2.015
2.571
2.757
3.365
4.032
4.773
5.893
6.869
1.943
2.447
2.612
3.143
3.707
4.317
5.208
5.959
5
0.74 I
6
0.718
7
0.7 1 1
L440
1.415
1.895
2.365
2.517
2.998
3.499
4.029
4.785
5.408
8
0.706
!.397
1.860
2.306
2.449
2.896
3.355
3.833
4.501
5.041
9
0.703
1.383
1.833
2.262
2.398
2.821
3.250
3.690
4.297
4.781
10
0.700
1.372
1.812
2.228
2.359
2.764
3.!69
3.581
4.144
4.587
11
0.697
1.363
1 .796
2.201
2.328
2.718
3. !06
3.497
4.025
4.437
12
0.695
1.356
1.782
2.179
2.303
2.681
3.055
3.428
3.930
4.318
13
0.694
1.350
1.771
2.160
2.282
2.650
3.012
3.372
3.852
4.221
14
0.692
1.345
1.761
2.145
2.264
2.624
2.977
3.326
3.787
4.140
15
0.691
1.341
1.753
2.131
2.249
2.602
2.947
3.286
3.733
4.073
16
0.690
1 .337
1 . 746
2.120
2.235
2.583
2.921
3.252
3.686
4.015
17
0.689
1.333
1.740
2.110
2.224
2.567
2.898
3.222
3.646
3.965
18
0.688
1 .330
1.734
2.101
2.214
2.552
2.878
3.197
3.610
3.922
19
0.688
1 .328
1.729
2.093
2.205
2.539
2.861
3.174
3.579
3.883
20
0.687
1.325
1.725
2.086
2.197
2.528
2.845
3.153
3.552
3.850
21
0.686
1.323
1.721
2.080
2.189
2.518
2.831
3.135
3.527
3.819
22
0.686
1.321
1.717
2.074
2.183
2.508
2.819
3.1 19
3.505
3.792
4
0.727
23
0.685
1.319
1.714
2.069
2.177
2.500
2.807
3.104
3.485
3.768
24
0.685
1 .3 1 8
1 .7 1 1
2.064
2.172
2.492
2.797
3.091
3.467
3.745
25
0.684
1.316
1.708
2.060
2.167
2.485
2.787
3.078
3.450
3.725
30
0.683
1.310
1.697
2.042
2.147
2.457
2.750
3.030
3.385
3.646
40
0.681
1.303
1.684
2.02 1
2.123
2.423
2.704
2.971
3.307
3.551
50
0.679
1.299
1 .676
2.009
2.109
2.403
2.678
2.937
3.261
3.496
60
0.679
1 296
1.671
2.000
2.099
2.390
2.660
2.915
3.232
3.460
100
0.677
1 .290
1.660
1.984
2.08 1
2.364
2.626
2.871
3.174
3390
1 000
0.675
1.282
1.646
1.%2
2.056
2.330
2.581
2.813
3.098
3.300
0.674
1.282
1 .645
1 .960
2.054
2326
2.576
2.807
3.090
3.291
50%
80%
90%
95%
96%
98%
99%
99.5%
99.8%
99.9%
00
Confidence level C
The entries in the top row are the probabilities of exceeding the tabled values. The left column gives the degrees of freedom.
•
�-"
•
545
Tables
Appendix C
(x' > x)
Table III dJ
X
x2 distribution critical values
.25
.10
.05
.025
Right tail probability p .02
.0 I
.005
.0025
.001
.0005
I
1.32
2.71
3.84
5.02
5.41
6.63
7.88
9.14
!0.83
12.12
2
2. 77
4.61
5.99
7.38
7.82
9.21
10.60
1 1 .98
13.82
15.20
3
4. 1 1
6.25
7.81
9.35
9.84
1 1 .34
12.84
14.32
16.27
17.73
4
5.39
7.78
9.49
1 1 . 14
1 1 .67
13.28
14.86
16.42
18.47
20.00
5
6.63
9.24
1 1 .07
12.83
1 3.39
15.09
16.75
18.39
20.52
22. 1 1
6
7.84
10.64
12.59
14.45
15.03
16.81
18.55
20.25
22.46
24.!0
7
9.04
12.02
14.07
1 6.01
16.62
18.48
20.28
22.04
24.32
26.02
8
10.22
13.36
15.51
17.53
18.17
20.09
21 .95
23.77
26.12
27.87
9
1 1 .39
14.68
16.92
19.02
19.68
2 1 .67
23.59
25.46
27.88
29.67
10
12.55
15.99
18.31
20.48
21.16
II
23.21
25.19
27.1 1
29.59
31 .42
13.70
17.28
19.68
21.92
22.62
24.72
26.76
28.73
31 .26
33.14
12
14.85
18.55
21 .03
23.34
24.05
26.22
28.30
30.32
32.91
34.82
13
15.98
19.81
22.36
24.74
25.47
27.69
29.82
31 .88
34.53
36.48
14
17.12
21.06
23.68
26.12
26.87
29.14
31.32
33.43
36.12
38. 1 1
15
18.25
22.31
25.00
27.49
28.26
30.58
32.80
34.95
37.70
39.72
16
19.37
23.54
26.30
28.85
29.63
32.00
34.27
36.46
39.25
41.31
17
20.49
24.77
27.59
30.19
31.00
33.41
35.72
37.95
40.79
42.88
18
21.60
25.99
28.87
3 1 .53
32.35
34.8 1
37.16
39.42
42.31
44.43
19
22.72
27.20
30.14
32.85
33.69
36.19
38.58
40.88
43.82
45.97
20
23.83
28.4!
31.41
34.17
35.02
37.57
40.00
42.34
45.3 1
47.50
21
24.93
29.62
32.67
35.48
36.34
38.93
41.40
43.78
46.80
49.0 1
22
26.04
30.81
33.92
36.78
37.66
40.29
42.80
45.20
48.27
50.51
23
27.14
32.01
35.17
38.08
38.97
41 .64
44.18
46.62
49.73
52.00
24
28.24
33.20
36.42
39.36
40.27
42.98
45.56
48.03
51.18
53.48
25
29.34
34.38
37.65
40.65
41.57
44.31
46.93
49.44
52.62
54.95
26
30.43
35.56
38.89
41 .92
42.86
45.64
48.29
50.83
56.41
27
3 1 .53
36.74
40. I I
54.05
43.19
44.14
46.96
49.64
52.22
55.48
57.86
28
32.62
37.92
41.34
44.46
45.42
48.28
50.99
53.59
56.89
59.30
29
33.71
39.09
42.56
45.72
46.69
49.59
52.34
54.97
58.30
60.73
30
34.80
40.26
43.77
46.98
47.96
50.89
53.67
56.33
59.70
62.16
40
45.62
51.81
55.76
59.34
60.44
63.69
66.77
69.70
73.40
76.09
50
56.33
63.17
67.50
7 1 .42
72.61
76.15
79.49
82.66
86.66
89.56
60
66.98
74.40
79.08
83.30
84.58
88.38
91 .95
95.34
99.61
102.69
80
88.13
96.58
101.88
106.63
108.07
1 1 2.33
1 1 6.32
120.10
1 24.84
128.26
100
109.14
1 1 8.50
124.34
129.56
l3l.l4
135.81
140.17
144.29
149.45
153.17
The entries in the top row are the probabilities of exceeding the tabled values. p Pr(� > x ) where x is in the body of the table and p is in the top row (margin). dj denotes degrees of freedom and is given in the left column (margin). =
546
Tables
Appendix C
Pr(F > f) f Table IV F distribution critical values
Pr(F > f)
0.05
,,
I
I
2
3
4
5
161
199
216
225
648
799
864
4052
4999
18.51
r,
6
7
8
10
15
230
234
237
239
242
246
922
937
948
957
969
985
5403
900
5625
5764
5859
5928
5981
6056
6157
19.00
19.16
19.25
19.30
19.33
19.35
19.37
19.40
19.43
38.51
39.00
39.17
39.25
39.30
39.33
39.36
39.37
39.40
39.43
98.50
99.00
99. 1 7
99.25
99.30
99.33
99.36
99.37
99.40
99.43
10.13
9.55
9.28
9.12
9.01
8.94
8.89
8.85
8.79
8.70
17.44
16.04
15.44
34. 12
30.82
7.7 1
0.025 0.01
0.025
O.o!
0.05
2
0.025
O.oJ
0.05
3
14.88
14.73
14.62
14.54
14.42
14.25
29.46
ISJO
28.71
28.24
27.91
27.67
27.49
27.23
26.87
6.94
6.59
6.39
6.26
6.16
6.09
6.04
5.96
5.86
12.22
10.65
9.98
9.60
9.36
9.20
9.07
8.98
8.84
8.66
21 .20
18.00
16.69
15.98
15.52
15.21
14.98
14.80
14.55
14.20
6.61
5.79
5.41
5.1 9
5.05
4.95
4.88
4.82
4.74
4.62
0.025
10.01
8.43
7.76
7.39
7.15
6.98
6.85
6.76
6.62
6.43
0.01
16.26
13.27
12.06
1 1 .39
10.97
10.67
10.46
10.29
10.05
9.72
5.99
5.14
4.76
4.53
4.39
4.28
4.21
4.15
4.06
3.94
8.81
7.26
6.60
6.23
5.99
5.82
5.70
5.60
5.46
527
13.75
10.92
9.78
9. 15
8.75
8.47
8.26
8.10
7.87
7.56
5.59
4.74
4.35
4.12
3.97
3.87
3.79
3.7 3
3.64
3.51
8,07
6.54
5.89
5.52
5.29
5.12
4.99
4.90
4.76
4.57
12.25
9.55
8.45
7.85
7.46
7.19
6.99
6.84
6.62
6.31
5.32
4.46
4.07
3$4
3.69
358
3.50
3.44
3.35
3.22
7.57
6.06
5.42
5.05
4.82
4.65
4.53
4.43
4.30
4.10
1 1 .26
8.65
7.59
7.01
6.63
6.37
6.18
6.03
5.81
5.52
5.12
4.26
3.86
3.63
3.48
3.37
3.29
3.23
3.14
3.01
7.21
5.71
5.08
4.72
4.48
4.32
4.20
4.10
3.96
3.77
10.56
8.02
6.99
6.42
6.06
5.80
5.61
5.47
5.26
4.96
4.96
4.10
3.7 1
3.48
3.33
3.22
3.14
3.07
2.98
2.85
6.94
5.46
4.83
4.47
4.24
4fl7
3.95
3.85
3.72
3.52
10.04
7.56
6.55
5.99
5.64
5.39
5.20
5.06
4.85
4.56
4.75
3.89
3.49
3.26
3. l l
3.00
2.91
2.85
2.75
2.62
0.025
6.55
5.10
4.47
4.12
3.89
3.73
3.61
3.51
3.37
3.18
0.01
9.33
6.93
5.95
5.41
5.06
4.82
4.64
4.50
4.30
4.01
4.54
3.68
3.29
3.06
2.90
2.79
2.71
2.64
2.54
2.40
0.025
6.20
4.77
4.15
3.80
3.58
3.41
3.29
3.20
3.06
2.86
0.01
8.68
6.36
5.42
4.89
4.56
432
4.14
4.00
3.80
3.52
0.025
O.oJ
0.05
Ofl5
0.05
4
5
6
0.025
O.Q I
0.05
7
0.025 0.01 0.05
8
0.025 0.01 0.05
9
0.025 0.01 0.05
10
0.025
O.Di
0.05
0.05
12
15
r1 = numerator degrees of freedom. r2
=
denominator degrees of freedom.
i
I\
I NDEX
X
'"'-J
F, X is distributed according to F,
table, 379
463
B(n, 8), binomial distribution with param eters n and 8, 461 £(,\), exponeiltial distribution with pa rameter A, 464 1{ (D, N, n), hypergeometric distribution with parameters D, N, n, 461 M(n, 81, . . . , Bq ) , multinomial distribu tion with parameters n, fh, .
e,, 462
N(p, E),
analysis of variance (ANOVA), 367
..,
multivariate normal distribu tion, 507
antisymmetric, 207, 209 asymptotic distribution
I
of quadratic forms, 5 0 asymptotic efficiency, 331 of Bayes estimate, 342 ofMLE, 3 3 1 , 386 asymptotic equivalence of MLE and Bayes estimate, 342 asymptotic normality, 3 1 1 of
M-estimate, estimating equation estimate, 330
N(JJ, u2 ), normal distribution With mean JJ and variance u2 , 464 N(JJt, j.t2, uf, u� , p), bivariate normal dis
of MLE. 33 I , 386
P(.\), Poisson distribution with parame
of sample correlation, 3 1 9
tribution, 492 ter
,\, 462
U (a, b), uniform distribution on the inter val (a, b) , 465
of estimate, 300 of minimum contrast estimate, 327
of posterior, 339, 391 asymptotic order in probability notation, 516 asymptotic relative efficiency, 357 autoregressive model, l l , 292
acceptance, 215 action space, 1 7
Bayes credible bound, 25 I
adaptation, 388
Bayes credible interval, 252
algorithm, l 02, I 27
Bayes credible region, 251
bisection, 127, 2 1 0 coordinate ascent, 129
asympbtic, 344 Bayes estirn•te, 162
EM, l33
Bernoulli trials, 166
Newton-Raphson, 102, 132, 189,210
equivan.ance, 168
for GLM, 4 1 3 proportional fitting, 157 alternative, 215, 2 1 7
Gaussim model, 163
linear, I
Bayes risk, I (i2 547
548
Index
Bayes rule, 27, 162
Bayes rule, 165
Bayes' rule, 445, 479, 482
coefficient of determination, 37
Bayes' theorem, 14
coefficient of skewness, 457
Bayesian models, 12
collinearity, 69, 90
Bayesian prediction interval, 254
comparison, 247
Behrens-Fisher problem, 264
complete families of tests, 232
Bernoulli trials, 447
compound experiment, 446
Bernstein's inequality, 469
concave function, 5 1 8
binomial case, 519 Bernstein-von Mises theorem, 339
Berry-Esseen bound, 299
conditional distribution, 478 for bivariate nonnal case, 501 for multivariate normal case, 509
Berry-Esseen theorem, 471
conditional expectation, 483
beta distribution, 488
confidence band
as prior for Bernoulli trials, 15 moments, 526 beta function, 488 bias, 20, 176 sample variance, 78 binomial distribution, 447, 461 bioequivalence trials, 198 bivariate log normal distribution, 535 bivariate normal distribution, 497 cumulants, 506 geometry, 532 nondegenerate , 499 bivariate normal model, 266
quantiles simultaneous, 284 confidence bound, 23, 234, 235 mean nonparametric, 241
unifonnly most accurate, 248
confidence interval, 24, 234, 235 Bernoulli trials approximate, 23 7 exact,
244
location parameter
nonparametric, 286
median nonpararnebic, 282
Cauchy distribution, 526
one-sample Student t, 235
Cauchy-schwartz inequality, 458
quantile
Cauchy-S chwarz inequality, 39 generalized, 521 center of a population distribution, 7 1 central limit theorem, 470 multivariate, 5 1 0
nonpanuneUic, 284
shift parameter
nonpanuneUic, 287
two-sample Student t, 263 unbiased, 283
chain rule, 517
confidence level, 235
change of variable formula, 452
confidence rectangle
characteristic function, 505 Chebychev bound, 346 Chebychev's inequality, 299, 469 chi-square disUibution, 491 noncentral, 530 chi-square test, 402 chi-squared disUibution, 488 classification
Gaussian model, 240 confidenoe region, 233, 239 distribution function, 240 confidence regions Gaussian linear model, 383 conjugate nonnal mixture distributions, 92
consistency, 30 l
Index
of estimate, 300, 301 of minimum contrast estimates, 304 ofMLE, 305, 347 of posterior, 338 oftest, 333 uniform, 30 I contingency tables, 403 contrast function, 99 control observation, 4 convergence in LP norm, 536 in law, distribution, 466 in law, in distribution for vectors, 5 1 1 in probability, 466 for vectors, 5 1 1 of random variables, 466 convergence of sample quantile, 536 convex function, 518 convex support, 122 convexity, 518 correlation, 267, 458 inequality, 458 multiple, 40 ratio, 82 covariance, 458 of random vectors, 504 covariate, 10 stochastic, 387, 419 Cramer-Rao lower bound, 1 8 1 Cramer-von Mises statistic, 271 critical region, 23, 215 critical value, 216, 217 cumulant, 460 generating function, 460 in normal distribution, 460 cumulant generating function for random vector, 505 curved exponential family, 125 existence of MLE, 125
De Moivre-Laplace theorem, 470 decision rule, 19 admissible, 3 1
549 Bayes, 27, 161, 162 inadmissible, 31 minimax, 28, !70, 171 randomized, 28 unbiased, 78 decision theory, 16 delta method, 306 for distributions, 3 1 1 for moments, 306 density, 456 conditional, 482 density function, 449 design, 366 matrix, 366 random, 387 values, 366 deviance, 414 decomposition, 4 1 4 Dirichlet distribution, 74, !98, 202 distribution function (d.f.), 450 distribution of quadratic form, 533 dominated convergence theorem, 514 double exponential distribution, 526 duality between confidence regions and tests, 241 duality theorem, 243 Dynkin, Lehmann, Scheffe's theorem, 86 Edgeworth approximations, 317 eigenvalues, 520 empirical distribution, I04 empirical distribution function, 8, 139 bivariate, !39 entropy maximum, 91 error, 3 autoregressive, 1 1 estimate, 99 consistent, 301 empirical substitution, 139 estimating equation, 100 frequency plug-in, 103 Hodges-Lehmann, 149 least squares, I 00
550
: '
Index maximum likelihood, 1 1 4
fitted value, 372
method of moments, 10 I
fixed design, 387
minimum contrast, 99
Fr6chet differentiable, 516
plug-in, 104
frequency function, 449
squared error Bayes, 162
conditional, 477 frequency plug-in principle, 103
unbiased, 176 estimating equation estimate asymptotic nonnality, 384
:
'[I '
moments, 526
estimation, 16
gamma function, 488
events, 442
gamma model
independent, 445 '
gamma distribution, 488
expectation, 454, 455 conditional, 479
MLE, 124, 129, 130 Gauss-Marakov linear model, 418 Gauss-Markov assumptions, 108
exponential distribution, 464
Gauss-Markov theorem, 418
exponential family, 49
Gaussian linear model, 366
conjugate prior, 62 convexity, 61
canonical fonn, 368
confidence intervals, 381
curved, 57
confidence regions, 383
identifiability, 60
estimation in, 369
log concavity, 61
identifiability, 3 7 1
MLE, 121
moment generating function, 59 multiparameter, 53 canonical, 54
one-parameter, 49 canonical, 52
likelihood ratio statistic, 374 MLE, 371 testing, 378
UMVU estimate, 3 7 1 Gaussian model Bayes estimate, 163
rank of, 60
existence of MLE, 123
submodel, 56
mixture, 134
supermodel, 58
Gaussian two-sample model, 261
UMVU estimate, 186
generali�ed linear models (GLM), 4 1 1
extension principle, 102, l 04 F distribution, 491 moments, 530 noncentral, 531
F statistic, 376 factorization theorem, 43
geometric �stribution, 72, 87 GLM, 412 . estimate •
asymptotic distributions, 415 Gaussi1111, 435 likelihood ratio asymptotic distribution, 415
Fisher consistent, 158
likelihood ratio test, 414
Fisher information, 1 80
Poisson, 435
matrix, 185 Fisher's discriminant function, 226
• •
I
goodness-of-fit test, 220, 223 gross error models, 190
Fisher's genetic linkage model, 405 Fisher's method of scoring, 434
Holder's inequality, 5 1 8
I I
'
'
_I
Index
551
Hammersley's theorem,
513
Hardy-Weinberg proportions,
103, 403
chi-square test, 405 MLE, 1 1 8, 124 UMVU estimate, 183 hat matrix, 372 hazard rates, 69, 70 heavy tails, 208 Hessian, 386
Kolmogorov statistic, 220 Kolmogorov's theorem, 86 Kullback-Leibler divergence, 1 16 and MLE, 1 1 6 Kullback-Leibler loss function, 169 kurtosis, 279,457
L, norm, 536
Laplace
distribution, 526
hierarchical Bayesian normal model, 92 hierarchical binomial-beta model, 93 Hodges 's example, 332
Laplace distribution, 374 law of large numbers weak
Hod.ges-Lehmann estimate, 207 Hoeffding bound, 299, 346
Bernoulli's, 468
Hoeffding's inequality, 519
Khintchin's, 469
Horvitz-Thompson estimate, 178
least absolute deviation estimates, 149, 374
Huber estimate, 207, 390 hypergeometric distribution, 3, 461
least favorable prior distribution, 170
hypergeometric probability, 448
least squares, 107, 120
hypothesis, 215 composite, 215 null, 215 simple, 215
weighted, 107, 1 12 Lehmann alternative, 275 level (of significance). 217 life testing, 8 9 likelihood equations, 1 1 7
identifiable, 6
likelihood function, 47
independence, 445 , 453
likelihood ratio, 48, 256
independent experiments, 446
asymptotic chi-square distribution,
indifference region, 230
394, 395
influence function, 196
confidence region, 257 asymptotic, 395
information bound, 181
logistic regression, 410
asymptotic variance, 327
test, 256
information inequality, 179, 1 8 1 , 186, 188, 206
bivariate normal, 266
integrable, 455
Gaussian one-sample model, 257
interquartile range (IQR), 196
Gaussian two-sample model, 261
invariance
one-sample scale, 291 two-sample scale, 293
shift, 77 inverse Gaussian distribution, 94
likelihood ratio statistic
IQR, 196
in Gaussian linear model, 376
iterated expectation theorem, 481
simple, 223
Jacobian, 485 theorem, 486
likelihood ratio test, 335 linear model Gaussian, 366
Jensen's inequality, 5 1 8
non-Gaussian, 389
-------
552 I
I
'
'
i
stochastic covariates, 419 Gaussian, 387 heteroscedastic, 421 linear regression model, 1 09 link function, 412 canonical, 4 1 2 location parameter, 209, 463 location-scale parameter family, 463 location-scale regression existence of MLE, 1 27 log linear model, 412 logistic distribution, 57, 132 likelihood equations, 154 neural nets, 1 5 1 Newton-Raphson, 132 logistic linear regression model, 408 logistic regression, 56, 90, 408 logistic transform, 408 empirical, 409 logit, 90, 408 loss function, 18 0 - 1, 19 absolute, 1 8 Euclidean, 1 8 Kullback-Leibler, 169, 202 quadratic, 1 8 M-estimate, 330 asymptotic normality, 384 marginal density, 452 Markov's inequality, 469 matched pair experiment, 257 maximum likelihood, 1 14 maximum likelihood estimate, 1 14. see MLE maximum likelihood estimate (MLE), 1 14 mean, 7 1 , 454, 455 sensitivity curve, 192 mean absolute prediction error, 80, 83 mean squared error (MSE), 20 mean squared prediction error (MSPE), 32 median, 7 1 MSE, 297
Index
population, 77, 80, 105 sample, 192 sensitivity curve, 193 Mendel's genetic model, 2 1 4 chi-square test, 403 meta-analysis, 222 rnethod ofmoments, 101 minimax estimate Bernoulli trials, 173 distribution function, 202 minimax rule, 28, 170 minimax test, 173 MLE, 1 14, see maximum likelihood esti mate as projection, 371 asymptotic normality exponential family, 322 Cauchy model, 149 equivariance, 1 1 4, 144 existence, I 21 uniqueness, 121 MLR, 228, see monotone likelihood ratio model, I, 5 AR(l), I I Cox, 70 Gaussian linear regression. 366 gross error, 190, 210 Lehmann, 69 linear, 1 0 Gaussian, 10 location symmetric, 191 logistic linear, 408 nonparametric, 6 one-sample, 3, 366 parametric, 6 proportional hazard, 70 regression, 9 regular, 9 scale, 69 semiparametric, 6 shift, 4 symmetric, 68 two-sample, 4
' I '
Index
moment, 456 central, 457 of random matrix, 502 moment generating function, 459 for random vector, 504 monotone likelihood ratio (MLR), 228 Monte Carlo method, 219, 221, 298, 3 14 MSE, 20, see mean squared error sample mean, 21 sample variance, 78 MSPE, 32, see mean squared prediction error bivariate normal, 36 multivariate normal, 37 MSPE predictor, 83, 372 multinomial distribution, 462 multinomial trials, 55, 447, 462 consistent estimates, 302 Dirichlet prior, 198 estimation asymptotic normality, 324 in contingency tables, 403 Kullback-Leibler loss Bayes estimate, 202 minimax estimate, 201 MLE, 1 19, 124 Pearson's chi-square test, 401 UMVU estimate, 187 multiple correlation coefficient, 37 multivariate normal distribution, 506 natural parameter space, 52, 54 natural sufficient statistic, 54 negative binomial distribution, 87 neural net model, 1 5 1 Neyman allocation, 76 Neyman-Pearson framework, 23 Neyman-Pearson kmma, 224 Neyman-Pearson test, 165 noncentral t distribution, 260 noncentral :F distribution, 376 noncentral chi-square distribution, 375 nonnegative definite, 519 normal distribution, 464, see Gaussian
553
central moments. 529 normal equations, lO I weighted least squares, 1 1 3 normalizing transformation, zero skew ness, 351 observations, 5 one-way layout, 367 binomial testing, 410 confidence intervals, 382 testing, 378 order statistics, 527 orthogonal, 41 orthogonal matrix, 494 orthogonality, 522 p-sample problem, 367 p-value, 221 parameter, 6 nuisance, 7 parametrization, 6 Pareto density, 85 Pearson's chi-square, 402 placebo, 4 plug-in principle, 102 Poisson distribution, 462 Poisson process, 472 Poisson's theorem, 472 Polya's theorem, 515 pnpulation, 448 pnpulation R-squared, 37 pnpulation quantile, I 05 pnsitive definite, 519 pnsterior, 13 power function, 78, 217 asymptotic, 334 sample size, 230 prediction, 16, 32 training set, 19 prediction error, 32 absolute, 33 squared, 32 weighted, 84
I
'
554
Index
prediction interval, 252
random variable, 451
Bayesian, 254
random vector, 451
distribution-free, 254
randomization, 5
Student t, 253
randomized test, 79, 224
predictor, 32 linear, 38, 39 principal axis theorem, 519 prior, 13 conjugate, 15 binomial, 15 exponential family, 62 general model, 73 multinomial, 74
rank, 48 ranlcing, 16 Rao score, 399 confidence region, 400 statistic, 399 asymptotic chi-square distribution, 400 test, 399
multinontial goodness-of-fit, 402
normal case, 63
Rao test, 335, 336
Poisson, 73
Rayleigh distribution, 53
improper, 163
regression, 9, 366
Jeffrey's, 203
confidence intervals, 381
least favorable, 170
confidence regions, 383
probability
heteroscedastic, 58, 153
conditional, 444
homoscedastic, 58
continuous, 449
Laplace model, 149
discrete, 443, 449
linear, 109
distribution, 442
location-scale, 57
subjective, 442
logistic, 56
probability distribution, 451
Poisson, 204
probit model, 416
polynomial, 146
product moment, 457
testing, 378
central, 457 projection, 41, 371
weighted least squares, 147 weighted, linear, 1 1 2
projection matrix, 372
regression line. 502
projections on linear spaces, 522
regression toward the mean, 36
Pythagorean identity, 4 1 , 377
rejection, 215
in Hilbert space, 522
relative frequency, 441 residual, 48,
quality control, 229 quantile population, I 04 sensitivity curve, 195
I l l, 372
sum of squares, 379 response, 10, 366 risk function, 20
maximum, 28 testing, 22
risk set, 29 random,44l random design, 387 random effects model, 167 random experiments, 441
convexity, 79 robustness, 190, 418 of level t-statistics, 314
555
Index asymptotic, 419 of tests , 419
stochastic ordering, 67, 209 stratified sampling, 76, 205 substitution theorem for conditional ex
saddle point, 199 sample, 3 correlation, 140, 267 covariance, 140
pectations, 481 superefficiency, 332 survey sampling, 177 model based approach, 350
cumulant, 139
survival functions, 70
mean, 8, 45
symmetric distribution, 68
median, 105, 149, 192
symmetric matrices, 5 1 9
quantile, 105
symmetric variable, 68
,
random, 3 regression line, 1 1 1 variance, 8, 45
t distribution, 491 moments, 530
sample of size n, 448
Taylor expansion, 517
sample space, 5, 442
test function, 23
sampling inspection, 3
test size, 217
scale, 457
test statistic, 216
scale parameter, 463
testing, 16, 213
Schefte's theorem, 468 multivariate case, 5 14 score test, 335, 336, 399
Bayes, 165, 225 testing independence in contingency ta bles, 405
selectiQ.g at random, 444
total differential, 517
selection, 75, 247
transfonnation
sensitivity curve, 192
k linear, 5 16
Shannon's lemma, 1 16
affine, 487
shift and scale equivariant, 209
linear, 487, 5 1 6
shift equivariant, 206, 208
orthogonal, 494
signal to noise fixed, 126 Slutsky's theorem, 467 multivariate, 5 1 2 spectral theorem, 519 square root matrix, 507 standard deviation, 457 standard error, 3 8 1 standard normal distribution, 464 statistic, 8
trimmed mean, 194, 206 sensitivity curve, 194 type I error, 23, 216
type n error, 23, 216 UMP, 226, 227, erful
see uniformly most pow
UMVU, 177, see uniformly minimum variance unbiased
uncorrelated, 459
ancillary, 48
unidentifiable, 6
equivalent, 43
uniform distribution, 465
sufficient, 42 Bayes, 46 minimal, 46 natural, 52
discrete
MLE, 1 1 5 uniformly minimum variance unbiased
(UMVU), 177
' '
I I
'
'
556
uniformly most powerful asymptotically, 334 uniformly most powerful (UMP), 226, 227 variance, 457 of random matrix, 503 sensitivity curve, 195 variance stabilizing transformation, 3 16, 317 for binomial, 352 for correlation coefficient, 320, 350 for Poisson, 3 1 7 in GLM, 4 16 variance-covariance matrix, 498
Index
von Neumann's theorem, 1 7 1 Wald confidence regions, 399 Wald statistic, 398 asymptotic chi-square distribution, 398 Wald test, 335, 399 multinomial goodness-of-fit, 401 weak law of large numbers for vectors, 5 1 1 Weibull density, 84 weighted least squares, 1 1 3, 147 Wilks's theorem, 393-395, 397 Z-score, 457