This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
, is identical with Bartlett's test at level !Y. . This correspondence arises because the sampling distri bution of sf I (J? , wi th s} a random varia ble and a} fixed, is identical to the posterior distribution of (J;2, with Sj2 fixed and (J; the random variable , under the assumption that OJ and Jog (Jj are uniform a priori. A serious difficulty in the practical use of Bartlett's criterion in the context of sampling theory is its extreme sensitivity to non-~ormality [Box (1953a, b)]. For samples from non-Normal populations with kurtosis i'2 = 1\4 / 1\~, for instance, Mo is asymptotically distributed as (I + 'h /2);.t _ I and not as;d _ ,. In this chapter Bayesian procedures are derived under the same assumptions as are customary in the sampling approach. Later in Chapter 4 we turn to the problem of comparing the spread of distributions when 0Jormality cannot be assumed. , J.71 1:, 64), compared with the approximating , 8 (<\> , 1.741:.71.15). The effect of the constraint is seen to be very slight. In the above we have assumed that the nine means OJ are locally uniform a priori. In some situations, especially when the nine cars were of the same make and year, they might be more realistically regarded as a random sample from a large population of cars. In these circumstances, a more appropriate analysis would follow the procedure developed in the foJiowing chapter. is the multivariate t distribution 1(I ; and is, in fact , the minimum variance unbiased estimator- see Graybill and Weeks (1959) . In the case of a single contrast I) = /'9, the sampling distribution of the quantity in (7.4.46) would be teO , I, I(r - I)]. Thu s, for given II, the conditional posterior distribution p(llllI', y) in (7.4.45) is numerically equivalent to the confidence distribution of I). In practice 11 ' is of course 110{ known. Then in the sampling theory approa ch two problems a rise ; one concerns the c hoice of estimator for II and the other, the distribution of the resulting estimator for 4>; given that choice. Concerning the first problem , from Table 7.4.1 and using (7.4 .36), it can be verified that for r> A, the quantities (Se, Sb' Sf) are independent and that Sf ~ a,2
s;.
2.12.4 An Example
As an example of the Normal theory procedure, SLI~pose an investigation conducted to compare the spread of three distributions yields s~ = 52.785,
s~ = 34.457,
S5=
66.030,
From which .52
=
51. 091 ,
and
v = 90.
L.:nder the assumptions of Normality and noninformative priors, the joint distribution of the two contrasts and
cPz = log (J~
is, from (2.12.6) , p(cPl ,
[;(~~;J3 Ti s n 5 (1 +
T,
+ T 2 )-45 ,
-
log (J~
.
,
,
- ......~
where
137
2
TI
V1 S~ e- 1>,
=
= 0.79941 e - 4>"
v3 S 3
Using (2.12.7), the mode of the distribution is
¢I
=
log s; - log s5
•
cP2 = log
2 S2 -
log
=
2 S3 =
-0.2239, -
0,6504.
Figure 2, 12.1 shows the mode and contours of the approximate 75, 90, and 95 per cent H.P, D . regions, Making use of Bartlett's approximation , the contours were drawn so that tv! = -2log W = (1 + A)/(2, o: ) , for CI. = (0.25, 0.10, 0 .05), respectively, and (2,12 , 12)
tv!
=
-9010g3
+ 9010g{1
For this example, we have from (2,12, 10)
-r exp[ -(cPl
-l-
0.2239)J
+
exp [-( )2
+ 0,6504)J}
+ 30[(cPl + 0 ,2239) + (cP2 + 0,6504)], A = (-t)(~o - 9'0 ) = 0,0148, and from Table III (at the end of this book), x2 (2, 0,25) = 2,77, x2 (2, 0,10) = 4,61, and X2 (2 , 0.05) = 5,99, The contours are nearly elliptical as might be expected because, for samples of this size, cPl and cP2 will be roughly jointly Normally distributed, Also shown in the figure is the point cj> = 0 corres ponding to a~ = = O'~ For this example, <j> = 0 is included in the 90 per cent region but excluded from the 75 per cent region. Of course, if we were merely interested in deciding whether <j> = 0 lies inside a particular H,P,D, region, we could employ (2.12.13) directly and obtain
ai
3
- 210g Wo = -
L
Vi
(log s; - log 52) = 3,1434,
i= I
so that
-210gWo 3,J434 1+ A = 1.0148
.
-_.
'
_.-
Comparison of the Spread of k Distributions
2.12
.
= 3,098,
which falls between /(2,0,25) and /(2,0,10). Thus, we woulJ know straight away that = 0 is included in the 90 per cent regi o n but excluded from the 75 per cent region without recourse to the diagram, By consulting a table of l probabilities, we can also determine the approximate content of the H.P.D, region which just covers a particular parameter point
'-".~
Standard ;\Jormal Theory Inference Problems
138
2.13
0.35
-0.65
-1.65
~
__________
~
______
- 1.5
~
____
-/.O
~
____
-O.S
L-~
____
~
0
______
0 .5
~
__
¢I~
1.0
Fig.2.12.1 Contours of posteri or distribution of (
2.13 SL'MMARIZED CALCULATIONS OF VARIOL'S POSTERIOR D1STRIBL'T10NS
Table 2.13.1 provides a summary of the ~ormaJ theory posterior distributions, based on noninformative priors, for making inferences about various location and scale parameters. The examples in most cases are those discussed in this chapter.
Table 2.13.1
Summarized calculations of various Normal theory [losterior distributions I. Normal m eall
0, variance
0' 2
known
From (22.4), J
Y = -Iy. /1 )
---
.
2.13
Summarized Calculations of Various Posterior Distributions
139
Table 2.13.1 Continued The (I - a) H.P.D. interval of _
y -
e is .0
lIa/ 2
-= < 8 < ji n
+ ua/ 2
a
~,
n
where u./ 2 can be obtained from Table I (at the end of this book) . Thus , for 2 (ji = 50, a = 20, 11 = 20), then 8 "" N(50,J). For a = 0.05, uO.02S = 1.96 == 2 and the 95% H .P.D. interval is (48 < () < 52).
2. Normal variance a Z, 8 known. From (2.3.6), 2 S
J 2 L (Yj - 8) . n
= -
The standardized (I - a) H .P. D. interval is obtained in terms of the metric loga, and the co rresponding interval in a 2 is ns 2
2 -- <
2 (n, a)
0
2
I1S2
< -z-~ (n,
, IX)
where ~2(n, a) and i(n, a) are given in Table II (at the end of thi s book). From this suppose (n = 20, I1S z = 348, S2 = 17.4), then a 2 ~ 348X2~' For IX = 0.05 , (~2 = 9.96, X2 = 35.23) and the 95% H.P.D. interval is (9.88 < a 2 < 34.94).
3. Normal variance
0
2
,8 unknown .
From (2.4.19), OZ
~ (VS2)X;2,
v=
I,
/I -
and
As in the case 8 is known, the standardized (I - a) H.P.D. interval is obtained in terms of log 0, and the corresponding interval in a Z is Vs
2
-Z--
0
2
2
2 Vs <~2--'
'!
(v,a)
unknown.
From (2.4.21),
The (I - Gi) H.P.D . interva l of 8 is
ji -
s ' a/ Z(v)-;--
vII
<
(j
< ji +
S
/a /2 (V)"
v n
140
Standard "'ormal Theory Inference Problems
2.13
Table 2.13.1 Conlil/lled where la/2(V) is given in Table IV (at the end of this book). Thus, for (y = 50, sl.J" = 0.96, v = 19), then 8 ~ 1(50,0.92,19). For a = 0.05, IO.02S(19) = 2.09 so that the 95% H.P.D. interval is (47.99 < 8 < 52.01).
5. Differel/ce ollwo Normal means y = 82 unknown but equal.
8 I' when the variances
-
()i
and ()~ are
From (2.5 .3),
2
51
=
- I '<"
VI
L-
(y Ij
-
- )2 ' Y1
and
The (I - ex) H .P.D. interval of',' is
Y2 - YI -
1. / 2 (11 1
+ n2
I
-
2)
s
1
1-
1)
1/ 2
< 'i
-I- -
\ 111
11 2
For example, Y2 - 5'1 = 5, n) = 112 = 10, 52 = 4; then, 'I ~ 1[5, }, 18]. For a = 0.05,1 0 . 025 ( 18) = 2.10 and the 95 % H .P.D. interval is (3.12 < Y < 6.88).
6. Difference olt wo No rmal means'}' = 82
-
e
I'
when ()~ and ()~ are unequal and unknown
From (2.5. 13),
where b = 4
17
+ 12'
2)
b (. -b- II'
_V_2_ ) (
1-
v
and The (I - a) H.P.D. interval is, approximately,
---
2
cos 2 cp
+
(_V_I_) sin VI - 2
2
cp ,
......
. --~--
2.13
Summarized Calculations of Various Posterior Distributions
141
Tabl e 2.13 . 1 Continued Thus, for
Y2 -
5'1
= 5,
=
"I
20, "2
=
12, s~
=
/1 h = 13.376,
12, and s~
=
40, we find
12 = 0.1552,
= 1.2063.
(12
and
=
1.026,
so that
y "'- /(5,4.035, 1338). For a = 0.05,/ 0025 (13.38) (0.67 < Y < 9.33).
== 2.15 and the
7. Th e ralio oj IWO 'l/ol'm(11 pariallees O'~
'O'L
95~:, H .P.D. intt.:rval is approximately
whell
01
alld
O2
are UllkIlOWIl.
From (2.6. I ),
The standardized (I - Ci) H. P. D . interval is expressed in terms of log 0' 2 and the corresponding interval in O'~/O'~ is
where F(v"
V 2 ,Ci)
and F( I' I'
\' 2,Ci)
"I
-
log 0' I'
are given in Tab le V (at the end of this book).
=20, "2 = 12, then 0'~ '0' ~~(3.33) F (19,11)' For Thus, for ~~=12, s~=40, r.t = 0.05, F(19, II, 0.05) = 0.352, F(19, I I, 0.05) = 3.15, and the 95 ~~ H.P.D . interval- is (1.172 < O'i/O'~ < 10.490).
8. The regression coefficients 8 o/Ihe Normal lillear model, willl 0' 2 unknown. First, from (2.7.1) and (2.7.8) comput (; X ' X, C = (X'X)-I
{) =
(X'Xr IX ' y,
where
y = X6
and
v
= II
-
k.
From (2.7.20),
The (I - ex) H. P. D. region of
e is given
by
(8 - I'h'( X ' X)(8 -
6)
< F(k,
\',a) ,
142
2.13
Standard Normal Theory Inference Problems
Table 2.13.1 Conlinlled where F(k, I',:X) is igve n i n Table VI (at the end of this book). For inferences about a subset of e , let
c
=
[~~: ~~:].
Th en from (2 .7.22),
For inferences concerning the ith individual element of e, i = 1, .. ., k,
where
Cjj
is the ith diagonal element of C. Th e ( I -::I.) H .P.D. interval of OJ is
Thus , for k
= 2,
11
= 18,
c
0.168 - 0.074
= [
- 0.074 ] 0.095'
9=
[ 98.9]
124.4
'
and for v = 16 and 52 = 157.8,
e = [ O I] ~1 O2
( [98.9] 1578[ 0.168 124.4 ' . -0.074
-0.074] 6 ') 0.095' 1
I'
21
For <J. = 0.05, F(2, 16, 0.05) = 3.6 3 and the 95 ~u H .P. D . region ofe is given by Q({) -2
2s
=
9(0 1
-
98.9)2
+ 14(0 1
9S .9)(Oz - 124.4)
-
+ 16(8z -
2 x 157.8
124.4)2 -
< 3.63.
As an example, for th e point eo = (120, 120),
Q(e o)
3015
2s
315.6
--z = -
-
= 9.56
> 3.63,
and therefore eo is excluded from the 957~ region.
and for (88.25 <
- -
---
-
fY.
=
0.05,
01
~
1(98.9, 157.8xO.168, 16),
O2
~
1(124.4, 157.8 x O.095 , 16),
= 2. 12 so that the individual 95 % intervals are and (116.41 < 0 2 < 132.39).
'O.02S( 16)
0 1 < 109.55)
For individual inferences
--
--~
2.13
Summarized ' :alculations of Various Posterior Distributions
43
Table 2.13.1 COIl/inlled
9.
Comparisoll nf k Norll/al II/eans
ai' ... ,0"
2 wi/Ii comlllO Il IIl1knolVll unrinllce 0 .
From (2.11.10), the (I -:;() H . P.f). regio n of the (k contrasts
¢i=Oi-O, is given by
I
I
Y=
\,)J2
< F(k _ I, \', a),
,.
(k -
where
I '
lJ = -n Ln.D J J'
i=I,,,,k-l,
2..: k n[¢ - (ji -
I -11 Lil) y J'
I) S2
=
S2
1, - 1
L 2..: (Yij i
=
In particular the point
¢. _ I
Lk nj(Y; _
Y/,
v=
and
II
-k.
is included in the
y)2
s
I)
-
j
= 0 (i.e., 0 1 = ... = Ok)
- - - -"'"z- < (k -
I) linearl y independent
F(k -
I , V,:;().
For example, suppose k = 3, "I = 10, 1/ 2 = 10, 113 = 12, J;I ,-" 8, Yz = 10,.h = 7, and S2 9; then j i = 8.25 and L 17;(Y; - y)2 = 50. For CJ. = 0.05 , F(2, 29, 0.05) =
=
i
=
3.33 and ~ == 2.8 so that the point
10. C ol11pariSOIl
0/ k
Normal L'Oriallces
0 i,
. .,01,
O2 = 03 ) is included in
lVi/h /llIk nOW/1 mealls 0 I'
.. . ,
Ok'
From (2.12.11) to (2.12.21 ), for the (k - I ) contrasts
¢i = log
0-; -
log
0'; ,
i = I , ... ,1< -
I,
the H.P.D. region of content approximately (I - el) is given by -2logW
~--- < /(k I A
where
-210gW
=
VIOg( ~v'k) + I'log
I, el) ,
[I + kil ~ exp[-(¢; i
k-I
-
¢i)]]
T
1 Vk
\ 'i
=
Jl i
L vM; i" r
-I,
v =
LVi' i=1
¢i = logs; -
log sf,
¢;l,
144
Standard ~ormal Theory Inference Problems
A2.1
Table 2.13 . 1 Continu ed and /(k - I, a) is given in Table Iff (at the end of this book). 1>, = 0 (i = I , ... , k - I) (i.e., O'~ = ... = O't), k
L
- 2 log W = -
v,( log
s? -
[n particul ar, for
log 52),
i= t
where
.S2 =
V-I
L
ViS?,
i= I
s;
Thus, for k = 3, VI = "2 = " ) = 30, s~ = 52.79, = 34.46, = 66 .03 , we have v = 90, 52 = 51.09, A = 0.015, ~I = -0.22, (j;2 = -0.65 , a nd
si
-210gW
(-90 log 3
l+A
+ 90 [I +
exp [- (
+
+
30 [(
0.22)J
+
+
0.22)
exp [- (
+ (
+ 0.65)JJ
0.65)J} ( 1.015.
For tJ. = 0.05, /(2,0 .05) = 5.99 so that the 95% H.P.D. region is given by _ 210g W i( I + A) < 5.99 . [n particul ar, for 1>1 =
I+A so that the point
- I.,o: 1vi( lOgs;
2 -log5 )
I+A
= -
3.143
- = 3.098, 1.015
= 1>2 = 0 is included in the 95% region.
- - - - -- -- - - - - -- -- -
- - -- - -- - - - - -
APPEN DIX A2.1
Some Useful Integra ls We here give several integral formula e which are useful in the derivat ion of a number of distribu tions discuss ed in this book.
The Gamma , Inverte d Gamma , and R elat ed Int egrals For a > 0 , p > 0 ,
CA2.1.1)
CA2.1.2)
(A2.1.3)
A2.1
Appendix
145
and (A2.IA)
More generally, for a > 0, P > 0, and
(J.
I"
I x p- !e '- ax ' ex
o
I'"
and
X - (p + l)
o
=
e - a·'-'dx
> 0, I -P 11r( -P) , -a ex ex =
(A2. I .S)
I -P1 ar(P) -a -. 'Y. ex
(A2.1.6)
The Dirichlet Integral
For Ps > 0,
J R
XP1 I
-
= I , ....
S
1
11
p"-!(I
.. . XII
+ -
I, and r
-
11
I - .. . -
=
1, 2, .. the integral
X"
)P", , -Idx: -
d
I .. . x"
=
.~ I r( ) I1 "s=1 Ps CA2.1.7) ro=~ ; : PsJ '
where R: (xs > 0, L;= I Xs < I), is known as the Dirichlet integral. Alternatively. it can be expressed in the inverted form
The Normal Integrals
For -
00
<
'1 < 00 ,
C
> 0,
f
_cc ",e xp
Let TJ be a 17 x matrix. Then
f
ro
.. .
- w
f O'-
00
[
-
2'I (X- -c-'1) 2] dx =
J 2n c.
(A2 . 1.9)
vector of constants and C a 17 x 11 positive dennite symmetric
exp[-}ex - TJ),C-I(x -lJ)Jdx l .. . dx" = (,/ 2rr)"ICI! /2, CA2.!.IO)
where x = (XI' ... , XII)" The t Integrals
For II > 0,
[I + (X -c '1)2// ]-tI
L
f_
OO . oo
v
1)
rmrc{v) , dx= r[-Hv + I)J 'V vc,
CA2.1.11)
J46
~ormal
Standard
A2.2
Theory Inference Problems
For v > 0,
where '1 and c are defined in (A2.1.9).
'" J'" [ (X-ll)'C-1(X-TJ)]- ;( v+lI) dX J ... I + -
::0
-
=
1 ...
dX n
V
Cf
[fCl)]'T(}v) (v v)"ICl 12 f[1Cv + I1)J '
(A2.1.12)
where 11, C , and x are defined in (A2.1.10) .
APPENDIX A2.2
Stirling's Series To approximate the Gamma function, we often employ an asymptotic series discovered by Stirling, and later investigated by Bayes [see Milne-Thomson (1960)J The expansion involves an interesting class of po lynom ials discovered by D. Bernoulli.
Bernoulli Polynomials The Bernoulli polynomial of degree r, BrV), is generated by the following function,
te X ' e - I
I
'"
-,- =
r=O
t' -Br(x), r!
(A2.2.1)
Upon equating coefficients, we find
BoCx) = I B 1 (x)
=
BJ(x)
= x(x
(x - })
- l)(x - t )
(A2.2.2)
The Bernoulli numbers Br are generated by setti ng x = 0 in (A2 .2.1), 1
_
-,-- -
e - 1
In particular BI
1.
\.
tr
L., r=O r!
B"
Adding 1/2 to both sides of (A2.2.3), we have r t (e' + I) I ~ I ---=-+ \ - - B 2 (e' - I) 2 /:"0 r! r'
..
-
CA2.2.3)
(A2.2.4)
A2.2
147
Appendix
Since the left of (A2.2.4) is an even function of
it follows that
f,
p = 1,2, ... .
(A2.2.S)
It can also be shown that (-l)P I B 2p > 0
(that is, B 2 " alternates in sign)
and IB 2 ,,1 tends to infinity as P --+ Bo
=
I,
(A2.2.6)
The first few Bernoulli numbers are
00 .
I
BI =
Bs
-"30 '
= 0, (A2.2.7)
Stirling's Series Stirling's series provides an asymptotic expansion of the logarithm of the Gamma function rep + h) which is asymptotic in p for bounded h . We have log
rep + h)
= ~
log (211)
+ fJ logp
- p
- L" ,
+
BI (h) logp
B
(h)
,.,.1 ,+R,,(p), r(r + I)p
(-I)'
I
(A2.2 .8)
where B,(h) is the Bernoulli polynomial of degree r given in (A2.2.1) . The remainder term R,,(p) is such that CA2.2.9) ( )1 _ _ c,,_ IR"p - Ipl"', l' where C" is some positive constant independent of p. The series in (A2 .2.8) is an asymptotic series in the sense that lim IR"(p)p" l = 0
(11 fixed)
Ipl- :c
and
(A2.2.lO) lim IR"(p)p"; = ex:.
(p fixed) .
Thus, even though the series in (A2.2 .8) diverges for fixed p , it can be used to + /7) when p is large . Setting h = 0, we obtain approximate log
rep
log rep)
:;= -~
log (211)
+ (p
-
1) logp - p "
+ ,I= J
B
2,
2
+
1
2r(2r _ I)p ,
R;,(p) ,
(A2.2.11)
where use is made of the fact that B 2 , + I = O. It can be shown that the remainder term R;, (p) satisfies the inequality R '()I"::::
, "p
'" 2(n
+
IBp 2"+J 2
I.
1)(2n
-l-
I)
(" + I
'
\
.
(A2212) . .
148
Standard Normal Theory Inference Problems
A2.2
This, in conjunction with the fact that B 2 (n+ 1) alternates in sign, impJies that, for positive p , the value of log f(p) always lies between the sum of n terms and the sum of 11 + I terms of the series, and that the absolute error of the series is less than the first term neglected. Even though (A2.2.11) is an asymptotic series, surprisingly close approximations can be obtained with it, even for small values of p. Taking the exponential of the log series, we obtain 571 ) 248320p4 '" , (A2 .2.13) which is known as Stirling's series for the Gamma function .
.
--
CHAPTER
3
BA YESIAN ASSESSMENT OF ASSUM PTIONS I. EFFECT OF NON-NORMALITY ON INFERENCES ABOUT A POPULATION MEAN WITH GENERALIZATIONS 3.1 INTRODGCTlON
In Chapter 2 a number of conventional problems in statistical inference were considered. Although interpretation was different, in most cases the Bayesian results close ly paralleled those from sampling theory.t Whichever mode of inferential reasoning was adopted, certain assumptions were necessary in deriving these results. Speciflcally, it was assumed that observations could be regarded as normally and independently distributed . Although at flrst sight this assumption seems restrictive, abundant evidence has accumulated over the years that the basic results are of great usefulness in the actual conduct of scientific investigation. Nevertheless, exploration in new directions is desirable but has tended to be confined by the technical limitations of sampling theory . In particular, development along this route becomes unwieldy unless a set of minimal sufficient statistics happens to exist for the parameters considered. The Bayesian approach is not restricted in this way, and a principal object of our book is to explore some of the ways in which this Aexib ility may be put to use. :!: In particular, the consequences of relaxi ng conventional assumptions may be studied. In the present chapter the assumption of Normality is rel axed. Specifically inferential problems about means and regression coefficients are considered for a broader class of distributions which includes th e Normal as a special case. 3.1.1 Measures of Distributional Shape, Describing Certain Types of non-Normality
Following Fisher, let K j be the jth cumulant of the distribution under st udy. Then /(1 = J1 is the mean, and 1(2 = a 2 is the variance. Since for the Normal di stribution the K j are zero for all j > 2, each scale-free quantity j = 3,4, ... ,
(3 . 1.1 )
measures some aspect of non-Normality. The measures 1'1 and 1'2, flrst considered by Karl Pearson, are of special interest. The Behrens- Fisher problem was one instance in which the Bayesian result had no -ampiI ng parallel. ~ Mosteller & Wallace ([964) prov ide other interesting illustrations of the use to which thi s freedom can be put.
T
149
150
Bayesian Assessment of Assumptions 1
3.1
Skewness
The standardized third cumulant, ".,
E(v - /-1)3
Y =--= - - I I(~ .2 aJ
(31.2)
'
provides a measure of skewness of the distribution. Thus in Fig. 3.1.1. (I IS zero for the symmetric distribution seen at the center of the diagram ; it is negative for the distribution to the left. which is said to be negatively skewed; it is positive for the distribution to the right, which is said to be positively skewed.
"II
<0
cO
"II
Negutivdy skl.:w
"II
:· 0
Posilivcly
SYlllllll..:trit:
,k~w
Fig. 3.1.1 SymmetriC and skewed distributions. K urlosis
The standardiLed fourth cumulant, "'4
Y2 = -
/(~
=
E(r - II)"
a
.j
-
3,
(3.1.3)
measures a characteristic of the distribution called kurtosis. For the Normal distribution Y2 = 0. If}'2 > 0, the distribution is said to be /eptokurti c. If Y2 < 0, it is said to be p/Qlykurlic. The Normal distribution is contrasted with a symmetric leptokurtic and a symmetric platykurtic distribution in Fig 3.1.2. Typically a leptokurtic distribution has less pronounced "shoulders" and heavier tails than the Normal. For in stance , Student's t distribution is markedly leptokurtic when the number of degrees of freedom is small. On the other hand , a platykurtic distribution typically has squarer shoulders and lighter tails. As an example, the rectangular distribution is highly platykurtic.
. 1' 2
<0
Pl a ry kllrti c
1'2
=0
Normal
1'2
>0
LL'p lokurti(
Fig. 3.1.2 The l'ol'lnal distribution contrasted with a sy mmetric platykurtic distribution and a symmelric leptokurtic distribution.
3.1
Introduction
151
Of course. distributions can exhibit kurtosis and skewness at the same lime. Thus the -/ distribution is kptokurtic and positively skewed (especially when the number of degrees of freedom is small). 3.1.2 Situations where "Iormality would not be Expected
As we have said in Section 2.1, one can expect the distributions of the observations to be approximately Normal when the experimental conditions are such as to produce a central limit tendency, that is to say, when the errors arise from a variety of independent sources. none of which are dominant. Nevertheless we would expect that certain kinds of measurements would not be Normally distributed. An example is yarn breaking strength. If the yarn is thought of as being made up of a number of links (like a chain), with the break occurring at the weakest link, and if the distribution of the strength of an individual link was Normal, the breaking strength will be distributed like the distribution of the smallest observation from a i'I 0 fill a I sample. This extreme value distribution is skewed and highly leptokurtic. However, this does not mean that the Normal assumption is unrealistic for all experiments where the data are breaking strengths. Consider again the experiment described in Section 2.5 in which two different types of spinning machines were compared The non-Normal error contributed by breaking strength measurement might be a minor contribulor to the overall error; for Ihis would usually be dominated by machine differences and sampling errors. The many components associated with these dominant sources would be likely to produce a central limit tendency and approximate Normality might be expected in this example, even though the data analyzed were breaking strengths. Platykurtic distributions for individual errors can also occur. For example, suppose that in successive runs of a chemical apparatus, temperature varied about a set point. The experimenter might accept minor variations, but when a larger deviation occurred the run might be abandoned and a new run substituted. While the resulting "truncation" could lead to a platykurtic distribution, again, when other sources of error are present, this component could be swamped by the effect of the other errors. The investigator will rarely be In the position where he can be certain of the precise form of the overall error distribution. Rather, his opinion may be described by a distribution of distributions centered about some central distribution. His state of mind will depend on the experimental setup and situations will sometimes occur where he expects a dominant non-Normal source of variation to determine the overall error distribution . More frequently , the setup will be one where he expects a finite number of sources each to have an important role. While in these circumstances the central limit theorem, which describes an asymptotic property, does not allow him to assume exact ~ormality, it does provide a basis for thinking of his distribution of distributions as cOllcentrated about the
152
Bayesian Assessment of Assumptions 1
3.2
~ortnal. Furthermore, he is sometime concerned only with differences between observations drawn from distributions supposed similar except in location. It may then be assumed that the parent distribution of these differences is a member of a class of symmetric distributions clustered about the Normal. In this chapter we explore the use of a class of symmetric distributions with kurtosis measured by a non-Normality parameter f3 which takes the value zero for the "Jormal distribution. The state of uncertainty about the parent distribution can then be expressed by giVing to f3 an informative unimodal prior probability distribution centered at zero. The sharpness of this prior distribution can be varied so that the effect of varying degrees of uncertainty about "J o rmality can be represented. In particular, when the rrior distribution becomes a (j function at r~ ~. 0, exact >Jormality is assumed. This is an illustration of the manner in which the Bayesian approach can be used to assess the effect of uncertainty in assumptions.
3.2 CRITERION ROBlSfi'ESS AND INFERE~CE ROBLST1\;ESS ILLLSTRATED USING DARWIVS DATA
On sampling theory , once the data-generating model is assumed, criteria appropriate to that assumption can be derived for inferential purposes. For example, suppose the observations)'I' .. ,),,, are regarded as a random sample from a Normal population'V(O, (J2) and it is desired to make inferences about rhe ml!an O. Then , if (J is unknown. the usual criterion is the I statistic jI -O
s/J n . Apart from the sample size 11, thi~ criterion involves the data only via the sample mean y and the sample standard deviation s = [(/1 - I)" 1 L (y - ji)2} , which are jointly sufficient for (0, (J). Inferences about 0 are then based on the sampling distribution of I, assuming a Normal parent distribution . It is customary to justify the use of such a Normal theory criterion in the practical circumstance in which Normality cannot be guaranteed , by arguing that the distribution of this t ailerion is but little affected by moderatl! non-Normality in the parent distribution- that is, it is robust under non-l\ormality. We shall refer to this type of insensitivity as criterion robustness under non-I\ ormality. This argument, however, does not take into account the fact that if the parent distribution really differed from the "Jormal , the appropriate criterion would no longer be the Normal theory statistic. For Instance, suppose it was known that the parent distribution was rectangular (uniform); then the same sampling theory arguments previously le ading to the t criterion would show that inferences were best made using the criterion
Criterion Robustness and Inference Robustness
3.2
153
which now involves the data only via Y(>I) and Y( I), the largest and the smallest observations in the sample. Thus, on the assumption that the sample comes from a rectangular distribution, inferences about 8 ought to be based not on the distribution of t but on the distribution of W. The example which follows shows that although the distribution of the Normal theory t criterion is not changed very much by assuming the parent to have some distribution other than the Normal, the inference to be drawn from a particular sample can be markedly different when we employ a criterion appropriate to this other distribution. To distinguish it from criterion robustness, the property of insensitivity of inferences to departures from assumptions we shall call inference robustness.
Darwin's Data Consider the analysis of Darwin's data on the difference in heights of self- anq cross-fertilized plants quoted by Fisher (1960, p. 37). The data consists of measurements on J 5 pairs of plants. Each pair contained a self-fertilized and a crossfertilized plant grown in the same pot and from the same seed . Following Fisher, we shall treat as our observations the 15 differences Yi (i = I, ... ,15), which are set out in Table 3.2.1 and plotted below the horizontal axis in Fig. 3.2.1. Table 3.2.1 Darwin's data: differences (in eighths of an inch) of heights of 15 pairs of self- and cross-fertilized plants 49
23
24
-67 8 16 6
28 14
75 60 -48
56
29
41
On sampling theory, given that the differences are a random sample from a Normal parent population N(O, (12), one should interpret these data using the paired t test. In particular, for a significance test appropriate to the hypothesis that = 0 against the alternative 8 > 0, the associated significance level is 2.485 %. The curve on the right in Fig. 3.2.1 is an appropriately scaled t distribution with 14 degrees of freedom centered about = 20.933, with scale factor sf, = 9.746, where S2 = L(Yi - yf(n - I) = 1,424.6. For definiteness, we shall call this distribution the reference distribution for O. This reference distribution may be variously interpreted. It was regarded by Fisher as the fiducial distribution of O. It is also a "confidence distribution" simultaneously allowing every confidence interval to be associqted with its appropriate confidence coefficient. Finally , if as in Section 2.4, we make the assumption that 0 and log (J are approximately independent and locally uniform a priori, it is the posterior distribution of 8. Now suppose that instead of assuming Normality for the parent distribution, we supposed it to be uniform over some unknown range 0 j <1 to 0 + J j <1.
e
y
n
154
Bayesian Assessment of Assumptions I
3.2
Mid· point
! Mean 0.06
Rec! angul ar parent
0.04
o.O}
Fig.3.2.1 Distributions of (J for Da rwin's data: Norma l a nd rectangular parent distributions (dots under the horizontal axis are the 15 differences in height recorded by Darwin).
Such a supposition would, of course, be quite ridiculous in the present example. First, we know that many contributing errors ari sing from genetic differences, soil differences, and so forth , tend to produce a central limit effect, so that we can expect with good reason that the heights themselves and , even more, their differences will be closely Normally distributed. Second, the evidence from the sample itself does not support the uniform assumption . However, for illustration, let us make the assumption of a rectangular instead of a Normal parent and let us consider the effect of this extreme degree of nonNormality on the distribution of the (statistic. This can be approximately calculated using, for example, the work of Gea ry (1936), Gayen (1949 , 1950), or of Box and Andersen (1955). Following these latter authors , it can be shown that, when the parent is non-Normal , the null distribution of (2 is appro ximated by
3.2
Criterion Robustness and Inference Robustness
ISS
an F distribution with band 8(n - 1) degrees of freedom, where 8 = I
+ £(b
-
3) .
n with (n
+ 2) L Y~ (2.y;/
b=--- - -·
£(b - 3)
== Y2 -
11-
1
(2Y4 - 3y~
+
IIY 2)
+ 11
(3.2 . J)
2(3Y6 - 16Y4 /2
+
+
J5y~
38Y4 - 3y~
+ 86yz),
r = 3,4, ... , and the Kr'S are the cumulants of the parent distribution of the differences. In our present exa mple , b is found to be 0.913. Thus, /2 is approximately distributed as F with 0.913 and 12.78 (instead of 1 and 14) degrees offreedom. In particular, the significance level associated with the hypothesis that = 0 against the alternative > 0 is now 2.388 % as compared with the previous value of 2.485 %. The test of the hYP0thesis that the true difference is zero , using the t criterion, is thus very little affected by this major departure from Normality . Furthermore, confidence intervals and hence the co nfidence di stribution , based on the / stati stic , would be a lmost unchanged. However, if we reall y kne\V th a t the parent distribution was rectangular, then the largest observation )'(1/) and the smallest observation Y(I) would be jointly sufficient for (8, 0") ; and, as mentioned earlier, we would, on classical sampling theory, be led to consider not the / criterion but the function
e
e
m - 0 W=- - -
(3.2.2a)
17/(11 - I) '
where m = ~(YII/) + YI I » is the midpoint , and 17 = HY(I/) - Y(1» is the half range . On the assumption that the parent is rectangular, the variate W is distributed as I (
P(W)=2
IWI ) - .,
1 +;-=-(
,
- oo< W
(3.2.2b)
[Neyman and Pearson (1928); Carlton (l946)J. Thus, the va riate I W I has the F distribution with 2 and 2(n - I) degrees of freedom. The distribution of W in (3.2.2b) may be called a "d ouble F distribution" with [2 , 2(n - I)J degrees of freedom, since it consists of two such F distributions standing "back to back."
156
Bayesian Assessment of Assumptions 1
3.2
The curve on the left in Fig. 3.2.1 is an appropriately scaled double F distribution with (2,28) degrees of freedom centered at the sample midpoint, m = 4.0 with scale factor h/(n - I) = 5.07, where the sample half range is h = 71. Just as the right-hand curve in the figure exemplifies the inferential situation with the Normal assumption , so the curve on the left correspondingly exemplifies the situation with the rectangular assumption. As before, it can be interpreted either as a fiducial distribution , as a confidence distribution , or, on parallel assumptions to those previously used , as a posterior distribution. It is seen to be very different from the one appropriate to the Normal assumption. In particular, the sampling theory significance level associated with the hypothesi s that () = 0 against the alternative () > 0 is now not 2.485 %, but 23.2 I 5 %. Thus, whichever form of derivation we favor , we see thaL if we assume a rectangular parent distribution , the inferences to be drawn are very different from those appropriate for a Normal parent. This is so even though the I criterion itself is very little affected by even this large departure from Normality. One reason for the large difference of the distributions in Fig . 3.2. 1 is that one curve is Lentered at the sample mean y = 20.9, and the other at the sample midpoint m = 4.0. The mean and th e midpoint for Darwin's data are very different mainly because of two rather la rge negative differences , and for this data we have an example in which the criterion is robust but the inference is not. As we have explained, it is not seriously suggested that the rectangular distribution is a reasonable choice for the parent. We wish only to emphasize that uncertainty in our knowledge of the parent distribution transmits itself rather forcefully into an uncertainty about the inferences we can make concerning and the difficulty which this presents in our interpretation of the data is not· avoided by knowledge of robustness under non-Normality of the criterion . The difficulty can be resol ved by making pro vision for an appropriate state of uncertainty about the parent distribution in the formulation. Possible knowledge about the parent distribution is of two kinds , that coming from the sample itself and a priori knowledge which may come from fa miliarity with the physical setup. Both of these can be taken account of in an appropriate Bayesian formulation .
e,
3.2.1 A Wider Choice of the Parent Distribution
Lf, in the analysis of Darwin 's data , we supposed thal the parent distributions of self-and cross-fertilized plants were identical except for location , then the distribution of the differences would certainly be symmetric. We will assume therefore that the parent distribution of differences is a member of a class of symmetric distributions which includes the t\ormal, together with other distri butions on the one hand more leptokurtic, and on the other hand more platykurtic . Now the standardized Normal distribution may be written p(x)
= k exp ( -
t Ixl
q
)
with
q = 2.
3.2
Criterion Robustness and Inference Robustness
157
By allowing q to take values other than two we obtain what may be called the class of exponential power di str ibuti o ns . These di stributions were considered by Diananda ( 1949), Bo x (J953b), and Turner (1960) ; with q = 2/(1 + (3) they can be written in the general form
\y - eI
») '
2/ P (1+ I 2 . -cp- ·
p( y le , cp,fJ)=kcp-l exp ( -
- ro < y < ro,
(3.2.3)
where and
- ro < D < ro,
cp > 0,
- I < fJ
~ I.
In (3.2.3), () is a location parameter and cp is a scale parameter. + It can be readil y shown that fe y ) =
e,
V ( ) = a 2 = il+p)f l[}(1 ar y \. I
rue
+
+
}cp2.
fJ)]
(3 .2.4)
(3)]
We may alternatively ex press (3.2.3) as p(y
y - IJj2 /(J +P)]
I e, a, fJ) = w(fJ)a- 1 exp - c«(3) - a -
[
where
c
I
_ (fJ) -
{nW + f3)] }
,
- ro < y < ro,
(3.2.5)
1.'(1 + Pl
ITW + P)]
and w(P) = (I
{r[1( J + P)] } 1'2 + P){1[-}(1 + P)J}J,2'
a > 0,
- ro <
e<
ro,
- J < .B
~
I.
The parameters e and a are then the mean and the standard deviati o n of the population , respectively. The parameter Pcan be regarded as a measure of kurtosi s indicating the extent of the " no n-Normal ity" of the parent population. In particular, when fJ = 0, the di stri but io n is Normal. When.B = I , the di stribution is the double exponential p(y I
e,
a,
P=
I a exp ( 1) = v'L
Iy -O l)
,/1. . - a -
,
- 00
< y < ro.
(3 .2. 6a)
'1' In our earlier work (1962) the form (3.2.3) was employed, but there the symbol a was used for the general scale parameter now called cp.
158
Bayesian Assessment of Assumptions i
3.2
Finally, when [3 tends to - I , it can be shown thal the distribution tends to the rectangular distribution,
I lim p(y : {3, e, 0') = - - - ,
P~-I
o-
2,,30'
y'3 (J <
Y <
e + .)3
(3.2.6b)
(J.
figure 3.2.2 shows the exponential power distribution for various values of [3. The distributions shown have common mean and standard deviation. We see that for {3 > 0 the distributions are leptokurtic and for fJ < 0 they are platykurtic. To further illustrate the effect of fJ on the shape of the distribution, Tabk 3.2.2 gives the upper 100:x percent points in units of (J for various choices of {3 with () assumed zero. Except for fJ equals 0 and 1, and the limiting case {3 ~ - I, the percentage points in the table were calculated by numerical integration on a computer [Tiao and Lund (1970)]. In (3 .2.5) we have employed the non-Normality measure {3 which makes the double exponential and the rectangular distribution equally discrepant from the Normal. However, we might have used, for example, the familiar kurtosis measure Y2 = J(4/J(~ for the class of exponential power distributions. It is readily shown that
f[1 ( I
+ {3)] f["W + /3)]
(3.2.7)
- - - - - - - - , - - - - 3.
{feW + {3)J}2
Table 3.2.3 gives the value of Y2 for various values of p. In terms of Y2, the double exponential distribution would appear as 3 and the rectangular distribut ion as - 1.2. However, whether {3, Y2' or any similar measure of non-Normality is adopted, the analysis which follows will be much the same.
Table 3.2.2 Upper 100:x percent points of the exponential power distribution for various values of in units of standard deviation 0'
(8
P -1.00 -0.75 -0.50 -0.25 0.00 0.25 0.50 0.75 1.00
rJ.
= 0)
= 0.25
0.10
0.05
0.Q25
0.01
0.005
0.001
0.87 0.84 0.80 0.73 0.67 0.62 0.58 0.53 0.49
1.39 1.36 1.35 1.31 1.28 1.25 1.22 1.18 1.14
1.56 1.57 1.61 1.63 1.64 1.65 1.65 1.64 1.63
1.65 1.71 1.81 1.89 1.96 2.02 2.06 2.09 2.12
1.70 1.84 2.03 2.18 2.33 2.46 2.58 2.68 2.77
1.71 1.91 2.16 2.37 2.58 2.77 2.94 3.10 3.28
1.73 2.05 2.41 2.75 3.09 3.43 3.76 4.08 4.39
fJ
3.2
Criterion Robustness and Inference Robustness
{J
~
O.
159
75
I \ {J
~
-0.50
{J = - 0.25
(J=O( Normal)
{J = LOU (Double exponen ti al)
Fig. 3.2.2 F-xponential power distributi ons with common standard deviation for various values of {3.
Bayesian Assessment of Assumptions I
160
3.2
Table 3.2.3 Relationship between
fJ
-0.75 -1.07
-1.0 -1.2
Yz
-0.50 -0.S1
-0.25 -0.45
(fJ, Y2)
0.0 0.0
0.25 0.55
0.50 1.22
0.75 2.03
1.00 3.00
3.2.2 Derivation of the Posterior Distribution of 0 for a Specific Symmetric Parent Given. a sample of n independent observations from a member of the class of distributions in (3.2 .5), the likelihood function is
I(e ,
(J,
fJly) x [ w (fJ)]
II
(J-lIe xp
· l
II
-C(fJ)JI
ly- eIZI( I+ P)] -'(J- .
(3.2.8)
.
e
Consider now the posterior distribution of assuming that the parameter known. For a fixed fJ, the likelihood function is
I(e ,
(J
I fJ,
y)
OC
(J-II
exp
[
e 2 / ( 1 +P)]
1' -
-c(fJ) ~
1
-'-(J-
fJ
is
(3.2.9)
.
1
In addition, the noninformative reference pri o r for
pee,
(J
I fJ)
e and
(J
is such that loca lly (3.2.10)
OC 0'-1.
[See the discussion in Section 1.3 concerning the l.ocation-scaJe family (J .3.57), of which (3.2.5) is a special case, and the discussion leading to (1.3. \05).J The joint posterior distribution of (e , 0') is then
pee,
a I [3, y)
OC
a-(1I+1
l
exp
y - e 2/ (1 +Pl] ,
-c([3) ~ . ~
r
1
- 00
< 0<
00,
a > O.
1
(3.2.11) Employing (A2.1.6) in Appendix A2.1 to integrate out bution of is then
e
- oc <
(J ,
the posterior distri-
e<
00,
(3.2.12)
where
M(e) =
I, IYi
_
e1 2/(l
Cp)
i
and [J(fJ)] - 1 is the appropriate normalizing constant. Thus, for any fixed fJ, is si mply proportional to a power of .'v1(e),n, the absolute moment of order 2/(1 + fJ) of the observations about e. The constant [J(,B)r 1 is such that
pee I [3, y)
(3.2.13)
3.2
Criterion Robustness and Inference Robustness
161
and is merely a normalizing factor which ensures that the total area under the distribution is unity. This integral cannot usually be expressed as a simple function of the observations ; it can, of course, always be computed numerically, and with the availability of electronic computers, this presents no particular difficulty. lf all that is required is to draw the posterior distribution so that it can be presented to the investigator, it is seldom necessary to bother with the value of the normalizing constant at all. He appreciates that a probability distribution must have unit area and for most inferential purposes the ordinates need not even be marked. Using (3.2.12), posterior distributions computed from Darwin's data for various values of [3 are shown in Fig. 3.2.3.
Midp (Oi {l, y )
.
0.06
~\ 1\
1\ I \ , \-.-!l= 09
(3= -0.9
,
I
I
0.04
\
, ,
I I
I
1
I
I
!l= 0.2
0.02
o Fig. 3.2.3 Posterior distribution of
e for various choices of [3: Darwin's data .
Bayesian Assessment of Assumptions 1
162
3.2
3.2.3 Properties of the Posterior Distribution of 8 for a Fixed [3 Since p(8 I [3, y) is a monotonic function of M(8), we find (see Appendix A3.1): 1. p(8 I y, f3) is continuous, and, for -I < [3 < I, is differentiable and unimodal, although not necessarily symmetric; the mode being attained in the interval [y( I), Y(n)l The modal value is in fact the maximum likelihood estimate of 8. However, it should be noted that we are not concerned with the distribution of this maximum likelihood estimate; rather, we are considering the distribution of 8 given the data.
2. When [3 = 0, M (8) = L(Y; - 8)2 = (n - 1)S2 + n(y - 8)2, and making the necessary substitutions we obtain, for the posterior distribution of 8,
( 8-Y \ [3=O,y ) =p(t
p t=~ S, 'V n
(3.2.14)
n - 1 ),
where p(t" _ 1) is the density of t (0, 1, n - I) distribution. 3. When [3 approaches - I, lim [M(8)]
+ 1m - 81),
-I
where m and h are as in (3.2.2a). Making the necessary substitutions, we find lim p(8 I [3, y)
fl~-I
= -1
2
( -hn-l
)-1 ~
.
I
IWI1-n +- , n-I
where
-00
<8<
00,
(3.2.15)
8-m
w=--,-------
h,(n - I)
Thus,
.
(
18 - ml
I
pl~~ 1 P F = hl(n _ I) [3, y
)
=
p[F 2,2(" -1)J,
F> 0,
(3.2.16)
where p[F 2,2(11 _ 1)J is the density of an F variable with [2, 2(n - I)] degrees of freedom. For Darwin's data, (3.2.15) is the reference distribution for 8, shown by the curve on the left in Fig. 3.2.1, but now derived as a Iimiting posterior distribution. Thus, we see that, when the parent is Normal ([3 = 0), our expression (3.2. I 2) yields the I distribution as expected, and when the parent approaches the rectangular ([3 --> - 1), again as expected, (3.2.12) tends to the double F distribution with 2 and 2(n - 1) degrees of freedom. In each case, the posterior distribution can be expressed in terms of simple funL:tions of the observations which are minimal sufficient statistics for 8 and (J.
3.2
Criterion Robustness and Inference Robustness
163
4. When fJ approaches 1, the distri but ion in (3.2 . 12) is not expressi ble in terms of simple functions of the observations. However, in the limit the mode of the posterior distribution is the median of the observations if n is odd , and is some unique value between the values of the middle two observations if n is even. When fJ = I and n is even , the density is, in fact, constant for values of between the middle two observations.
e
5. In certain other cases, it is possible to express the posterior distribution of terms of a fixed number of functions of the observations. For instance, letting
e in
fJ
= (2 - q)/q,
then for
q
= 2,4, .. . ,
we have
pee, a I fJ , y) x
a-(n+ l)exp [
-c(fJ)a- q Jo (- I)'
(~) ersq-rl '
- CXl
<
e<
00,
a > 0,
(3.2.17)
and - 00
<
e < 00 ,
(3.2.18)
where
It is readily seen that the set of q functions , SI' S2 ' ... , Sq , of the observations are jointly sufficient for and a. In general, however, the posterior di stribution cannot be expressed in terms of a few simple functions of the observations. If we wish to think in terms of sufficiency and information as defined by Fisher (1922 , 1925), our posterior distribution always , of course, employs a complete set of sufficient stati stics, namely, the observations themselves . Consequently, no matter what is the value of fJ, no information is lost.
e
From the family of distributions for various values o f /3, sbown in Fig. 3.2.3 , we see that very different inferences would be drawn concerning 8, depending upon which value of fJ was assumed . The chief reason for this wide discrepancy is the fact that in Darwin's data , the center of the posterior distribution changes markedly as fJ is changed. In particular, for this sample, the median , mean , and tbe midpoint are 24.0, 20.9, 4.0 respectively ; and these are the modes of the posterior distributions for the double exponential , Normal , and rectangular parent, respectively.
164
Bayesian Assessment of Assumptions 1
3.2.4 Posterior Distribution of 8 and
fJ
3.2
when
fJ
is Regarded as a Random Variable
Because of the wide differences which occur in the posterior distribution of 8 depending on which parent distribution we employ, it might be thought that there would be considerable uncertainty about what could be inferred from this set of data . It turns out that this is not the case when appropriate evidence concerning the value of 13 is put to use. There are two possible sources of information about the value of fJ, one from the data itself and the other from knowledge a priori. Both types of evidence can be injected into our analysis by allowing fJ itself to be a random variable associ a ted with a prior distribution p(fJ).
J oint Distribution of 8 and fJ Let us assume tentatively that, a priori, 8 and the standard deviation a so that
p(8, a,
fJ)
fJ
=
is distributed independently of the mean
p(fJ)p(e,
a) .
(3 .2.19)
As before, we adopt the noninformative reference prior for (8, 0)
pC8, a) ex a- J. Then , the joint posterior distribution of (8 ,
pee,
(1 ,
fJl y) ex
(1- J
p(fJ)/(e, a , fJl y) ,
(1,
13)
(3.2.20) is
-oo < e
a > O,
-1
(3.2.21) where I(e , a, fJl y) is the likelihood function given in (3.2 .8). After eliminating the standard deviation a, we obtain the joint posterior distribution of (e, fJ) as
pee, 131 y) ex p(fJ)[M(8)r tn (l
+il)
rei + 'In(1 + fJ)J{r[l + } (1 + fJ)J} -n, - 00 < e < 00 , -I < fJ ~ I.
(3.2.22)
Assumptions about locally uniform priors and particularly about independence of the component parameters have to be made with caution . Although it seems reasonable that should be assumed to be independent of a and fJ a priori, we might have second thoughts about the independence of the scale parameter a and the non-Normality parameter fJ. The definition of a sca le parameter is always arbitrary to the extent of a multiplicative constant , and in particular, if we write the di stribution in the form (3.2.3), using ¢ as the scale parameter, then
e
¢ = /(fJ)a, where /(fJ) is some function of fJ. It might be supposed, therefore, that if fJ and a were independent then (J and ¢ could not be. It would then follow that the marginal posterior distribution p(8, fJl y) would be different if it was assumed [as in our earlier work (I 962)J that log ¢ was locally uniform and independent of fJ instead of assuming, as we have done
-------
3.2
Criterion Robustness and Inference Robustness
165
here, that log u was locally uniform and independent of fJ. In fact, the results obtained are approximately the same whatever functionf(m is used . To see this, Jet LIS suppose that p(log u, fJ) = p(log u) pep), with p(log u) ex; c. Since log ¢ = 10gf(fJ)
+
(3.2.23)
log u , it follows that, for given
p(log ¢) x c.
13,
locally, (3.2.24)
Further, we can see from Appendix A3.2 that if[ogu is locally uniform and independent of
{J, then log ¢ and fJ will be approximately independe;,t.
A Reference Distribution for
fJ
It will turn out from our subsequent discussion that a useful reference prior distribution for fJ is a uniform distri burion over the range ( - I, I). If we denote the posterior distribution of e and 13 based on this choice by puce, fJl y), then
Pllce, fJ I y)
ex;
[M (e)r ~ n( 1+ P) r[ I
+
}n( I
+
fJ)]
{fC I +
~ (I
+ fJ)]}-1/
(3.2.25)
and
pee, PI y)
ex;
puce, f31 y)p(f3),
- co <
e<
cx.;,
-
1<
13
~
1,
(3.2.26)
where pCP) is any appropriate prior distr~bution of 13. The joint distribution PII(e, fJl y) actually obtained from the Darwin data is shown in Fig. 3.2.4. The sections of the joint distribution for various fixed values
Fig. 3.2.4 The joint distribution pl/(e,
fJ i y):
Darwin's data.
3.2
Bayesian Assessment of Assumptions 1
166
of [3 are, apart from a weighting factor, the conditional distributions p(e I [3, y) already sketched in Fig. 3.2.3. The weighting factor is of course p,,([31 y) and it is evident that this factor attaches little credibility to platykurtic parenthood.
3.2.5 Marginal Distribution of [3 If we integrate (3.2.25) over e we obtain the weighting factor mentioned above which is the posterior marginal distribution of [3 for a uniform reference prior in [3,
Pu([31 y) cc
rei
+In(I + [3)]{r[1 + HI + [3)]}-n1([3),
-I < [3:( 1,
(3.2.27)
where 1([3) is given in (3.2.13). This distribution , shown by the solid curve In Fig. 3.2.5 for a = I, aJlows appropriate inferences about [3 to be drawn in the light of the data for a uniform prior over the range - J < [3 :( I. Now, in practice , there will be few instances where this prior would actually coincide with the attitude of the investigator, nevertheless, its use as a reference provides a valuable
a =\ p,,(!l.y)
1.0
)
\ .0
p(ill
-----~---
0.5
1'--_ __
--------
_ _--'-_ _ _ _ _---'
0.5
!l
~
o
\
o
._ \
a= \0
a=J \ .5
1.0
'
\ .0
",-
-r/
p({3 l yl
0.5
P({3)-/
'
0.5
I /
I I
/
.... M ... / = -_ _ _ _--'-_ _ _ _'--""'"
- \
0
{3
Fig.3.2.5 Prior and posterior distributions of
" a": Darwin's data.
_ _--"''''----_ _- - ' -_ _ _~_---.J
-"+
\
- \
f3
0
{3 ,.
\
for various choices of the parameter
3.2
Criterion Robustness and Inference Robuslness
167
'intermediate step to more realistic choices. r or, since
p(f3 l y) rx Pu(f3 i y)p(fJ),
(3.2.28)
we can produce the posterior distribution of fJ for any prior p(f3) from p"Cf3 l y) merely by multiplication and renormalizing .
Choice of a Prior Distribution for fJ ln problems like the analysis of Darwin's data, usually some central limit effect would be expected. While thi s does not warrant an outright assumption of Normality which can be represented in the present context by making p(f3) a b function at fJ = 0, it does mean that the investigator would want to associate high prior probability with distributions in the neighbourhood of the Normal. He could represent this attitude by choosing a prior distribution for fJ having a single mode at fJ = 0. A convenient distribution for this purpose is a symmetric beta distributi0n having mean zero and extending from - I to + I with one adjustable parameter which we call "a". Specifically, we assume thatt -1
(3.2.29)
where W=
f(2a)[l(a)r 2
a ~ l.
2-(2 a -I),
When a = }, the distribution is uniform. With a> I, it is a symmetric distribution having its mode at the normal theory value 13 = 0, and it becomes more and more concentrated about 13 = as "a" is increased . When "a" tends to infinity, p(f3) approaches a delta fl)nction , representing an assumption of exact Normality. The dotted curves in Fig. 3.2.5 show this distribution for a = 1,3,6, and 10. The corresponding posterior distributions p(f3l y) are shown by the solid curves in Fig. 3.2.5. The Figure shows how increasing prior certainty about Normality tends to override the information from the sample. When "a" tends to infinity , pCf3l y) will approach a b function at fJ = 0 and Normality will be assumed no matter what the information from the sample.
°
e of e is
3.2.6 Marginal Distribution of
The posterior distribution distribution of 8 and p, yielding
obtained by integrating out fJ from the joint
- CJJ
t Letting x =
(I
+ 13)/2 and p =
<
e<
00.
q = a, we have the usual beta distribution
= f(p + q) P-I(I_ )q-J P( x ) f(p)f(q) x x,
O<x<J.
(3 .2.30)
168
3.2
Bayesian Assessment of Assumptions 1
Also , we may write pee, {31 y)
The posterior distri bution of pee I y)
=
f
=
(3.2.31)
pee I {3, y)p({3 I y).
e,
/(e I fJ, y)p({31 y) dfJ ,
<
- 00
e<
00 ,
(3.2.32)
can thus be thought of as a weighted average of the "I-like" distributions pee I (J , y) of Fig. 3.2.3, with a weight function p({31 y) given by a solid curve of Fig. 3.2.5. Since p(fJ I y) ex. p,,(fJ I y) p(fJ) we have also pee I y) ex.
f
J
pee I (J, Y)Pu(fJl y)p({3) d(J,
- 00
<
e<
00.
(3.2.33)
Thus, we are averaging the f-like distributions with a weight function which depends partly on p({3) , representing prior information, and partly on Pu({31 y) , t representing information which is independent of the prior. Finally we can write pee I y) ex.
J~
1 pu(e,
{3 I y)p(fJ) d(J
(3.2.34)
so that pce I y) is the marginal distribution obtained when Pl/ce, fJl y) shown in the three-dimensional diagram of Fig. 3.2.4 is averaged over the weight function p«(J). The marginal distributions pce I y) for a = I, 3, 6 and 00 are drawn in Fig. 3.2.6. They show to what extent inferences about e depend on the strength of central limit effect assumed . The curve for a -> 00 is a I distribution corresponding to an outright assumption of parent Normality. In view of the very large differences exhibited by the conditional distributions pee I {3, y), it is remarkable how little the marginal distribution pee I y) is affected by choices of "a" covering a range representing exact Normality (a -> (0) to "no central limit effect" (a = I). The reason for this is that widely discrepant conditional distributions generated by parents which approach the uniform ({3 -> - 1) are almost ruled out by information coming from the sample itself. The curve for a = 10 is not shown in Fig. 3.2.6 because it is almost identical to the t distribution obtained when a -> 00 . This is interesting because, as seen from Fig. 3.2.5, the central limit effect implied even by a = 10 is not an overwhelmingly strong one. For instance, with this distribution the probability a priori that -0.33 < {3 < 0.33 (that is, q = 2/(1 + {3) is between 1.5 and 3) is only 87 percent.
t We are employing noninformative prior distributions for prior distribution pee, a) will of course change Pu({3 I y) .
e and
a; a change in the
]69
Criterion Robustness and Inference Robustness
3.2 p(1I1 y)
0.04
0.02
Fig. 3.2.6 Posterior distribution of data .
e for various choices of the parameter "a": Darwin's
3.2.7 Information Concerning the Nature of the Parent Distribution Coming from the Sample In the past, the Normality (or otherwise) of a sample has often been assessed by inspecting the empirical distribution, by applying goodness-of-fit tests such as the l test and the Kolmogoroff-Smirnoff test (1941), and by checking measures of skewness and kurtosis . In cases such as the present one, where it is reasona ble to assume symmetry, the calculation of PIl([31 y) would provide another way of expressing sample information about the nature of the parent distribution . However, with this approach more can be done than merely "test" the assumption of Normality and then, in the absence of a "significant" result, assume it. The information about [3 coming from the sample is appropriately used in making inferences about e. In particular, for Darwin's data it plays an important role in virtually eliminating the influence of unlikely parent distributions having extreme negative kurtosis. 3.2.8 Relationship to the General Bayesian Framework for Robustness Studies It will now be seen how Darwin's example illustrates the general approach introduced in Section 1.6. After integrating out (J, we have the joint distribution of the mean and the non-Normality parameter [3, which correspond to 9 J and 9 2 respectively, in the general discussion. While it is true that we can obtain the marginal posterior distribution of immediately by integrating out [3 from pCB, [31 y), it is informative to write the integral in the form
e
e
(3.2.35)
170
Bayesian Assessment of Assumptions]
3.3
and to say that p({31 y) serves as a weight function acting on the various conditional posterior distributions pee I (3. y) . A study of the conditional distribution p({) I /3. y) for various (3, together with the weight function p({31 y) as was done in Sections 3.2.2 through 3.2.5, corresponds precisely to a study of the component distributions p(9. 192 , y) and p(9 2 I y) in (1.6.3) of the general approach. Indeed. the attractiveness of this approach is further increased by the fact that if we let p(9) and p(9 2 ) be the prior distributions for 9) and 92 assumed independent, and let 1(9.,9 2 I y) represent the joint likelihood, we may then write 0.6.3) in the form
p(9) I y) = c. where
t
pu(e 2 1 y)
p (6.
192 , y)p.,(6 2 I y)p(9 2 )d9 2 ,
(3.2.36)
=
Jp(9) )/(9.,9
(3.2.37)
c2
2 I
y) d9.
c.
is the likelihood integrated over a weight function p(9.) and and C2 are normalizing constants. The marginal distribution p(e 2 I y), which is proportional to the product,
(3.2.38) is separated into the prior distribution of 9 2 and a part which is independent of this prior distribution. The function Pu(9 2 I y) can be regarded as a posterior distribution on the basis of a uniform reference prior for 9 2 , It provides a basic reference posterior density which can be converted into a posterior distribution with an appropriate prior p(a z) by multiplication . It is informative to consider the effect of varying p(9 z) to see how sensitive is the final result to changes in prior assumptions. and also to study p.,(9 2 I y) itself. Thus, for the present example, the weight function p({31 y) in (3.2.35) can be written
p({3 I y) oc p({3)Pu({J I y),
(3.2.39)
with p({J) the prior distribution of (3, corresponding to p(9 2 ) in the general framework. Also (3.2.40) Pu({31 y) oc Pu({J, ely) de
J
corresponds to p,,(a z I y) in the general formulation and can be thought of as the likelihood integrated over a weight function uniform in and log (5 .
e
3.3 APPROXIMATIONS TO THE POSTERIOR DlSTRIBL'TlON
pee I {3, y)t
e
In this section, we consider again the posterior distribution of for given {J, p({) I (3, y) in (3.2.12), and discuss a method which can be used to approximate it.
3.3.1 Motivation for the Approximation It turns out that over a wide range of values of {3 (say, from {3 = -0.75 to {J = 0.75) the distribution pee I (3, y) can be satisfactorily approximated by a t distribution.
t
Much of the material in this section al'Jd Section 3.4 are taken from D. R . Lund's Ph.D . thesis (1967).
\
-----
Approximations to the Posterior Distribution p(e I p, y)
3.3
171
Specifically, we shall demonstrate that
pee I {3, y)
oc [M(B) + dee -
where
tJ) 2
r
n
M(e)
=
I
Iyu
£n(l+p),
-
00
<
e<
00,
(3.3.1)
_eI 2 /(J '- PJ,
u=l
oc
pee
8 is the mode of I{3, y). The symbol means "approximately proportional to. " To this degree of approximation, 0 is distributed as the t distribution t{8, .'v1(8)/d[n(\ -I- {3) - 1],11(1 + {J) -I}. This type of approximation has to be justified in somewhat different ways, depending on whether {3 is negative or positive. In particular, for positive {3 the approximating process itself can be somewhat tedious . We shall , therefore, make clear why this kind of approximation is important. Cenainly,for many purposes of inference all that we really need is to be able to compute the posterior density function of e and present its graph to the investigator. This computation is best made using the exact form d is an appropriately chosen constant, and
pee I {3, y) cc [M(e)rt,,(l + Il),
- 00
<
e<
00,
(3.3.2)
as given in (3.2.12). The approximation (3 .3.1) has two uses: a) In some instances we may wish to calculate probability integrals. Although we can do this directly by integrating the density function (3.3 .2) numerically, it is an advantage to be able to make use of the already tabled t integrals . b) In problems discussed in Chapter 4, we shall often need to be able to integrate out 8 from some posterior distribution involving other parameters (for example, the variance (} 2). Specifically, we shall be involved with evaluating expressions of the form
J":oo [M(8)] -C, dA
J":
00
[M(8)]
(3 .3.3)
c, dO'
Using (3 .3.1), we have
J":oo [M(O)r de . J":oo [M(B) + d(8 - 8)2r J":oo [M(&)] c2 d8 = J ~ oo [M(8) + d(8 - 8)2r C1
C1
de
C2
dO (3.3.4)
"iote that , for this approximation, actual calculation of the value of d is not necessary since it cancels out on integration. So far as this more important application is concerned, what follows is mainly intended to show that an approximation of the form of (3 .3.1) can be quite accurate and that, therefore, the approximate integration (3.3.4) can be used.
172
Bayesian Assessment of Assumptions 1
3.3
3.3.2 Quadratic Approximation to M(8) Since p(8 1 {3, y) is a monotonic decreasing function of M(8), we shall first conduct the discussion in terms of M(8) . Now, M(8) is a convex function of 8 and, for -1 < {3 < 1, possesses a unique minimum 8 [which is, of course, the mode of p(8 1{3, y)]. Further, for {3 = 0 (3.3 .5) Now if for {3 :f. 0 we can use an approximation of the form (3 .3.6) where d is some constant to be determined, then the corresponding approximate distribution of 8 can be expressed in terms of the familiar Student's t form. In what follows, we first discuss the problem of determining the mode and then the question of finding the constant d.
Determination of the Mode 8 Two methods may be employed to obtain 8. The first of these makes use of the fact that M'(8) is monotonically increasing in 8 so that M/(a) = 0 has only one solution, 8. Thus, M'(8) ~ 0 implies that 8 ~ 8. By calculating M ~ (8) at a point 8, we, therefore, know in which direction to proceed toward We then increase or decrease the value of 8, whichever is required, in steps until M'(8) changes sign . Repeating this process, but using steps which are halved after each sign change, we can obtain {} to any desired degree of accuracy. The second method employs Newton's iteration procedure for extracting roots of an equation. Starting with an initial guess value 80' we can write
e.
(3.3.7) provided the second derivative Mil (8) exists. Setting the right-hand side of (3.3.7) to zero , we obtain a new estimate, 8 l of {}, such that (3.3.8) Repeating the procedure with 8l in the role of 80' we obtain a second new estimate B2 , and so on. We continue the process until 18i+l - 8;\/18 i l < 0.001, say, where 8i + l is the value obtained in the (i + l)th iteration. Now M'(8)
-2
= --
n
L Iyu -
81- 2 {J/(l+{J) (Yu - 8)
l+{3u=l
and M"(8)
=
2(1 - {3) (1
+ {3)2
n
"I
uf'l Yu
- 81- 2{J/(1+ (J)
,
(3.3.9)
(3.3 .10)
3.3
Approximations to the Posterior Distribution p(e
! P.
y)
173
so that
B,"
I
=
B _ M'(e j ) , M"(e j )
=
(-2{3\ , +(3)L:=lly"-B j l\1 - {3! B, + I - (3 L~"lly, - ejl
(1
2
When the iteration process is terminated, we have, by setting
P/iJ+Pl
y,,
2P1(I+P)
e=
Bj
(3.3.11 )
. , I
= Bj ,
II
e= L allY'"~
(3.3.12)
u=1
where
f nterpretGtiol1 oj Result s It is interesting to note that (3.3 . 12) says that modal value (') is in th~ form of a weighted average of the observations, When (3 ---> 0, all of the weights a" tend to I 'n as they should, yielding (') = Y in the limit. As fJ decreases below zero, observations corresponding to the residuals y" - [) with large absolute values tend to receive more weight. This is an intuitively pleasing result because for /3 < 0, the parent distributions have tails shorter than those of the '\iormal , and one would expect that extreme observations would thus provide more information about the location of the distribution, We have already seen that as /3 ---> -1 in the limit, = (Y(I ) ...L Y(III) 2 (the sample midpoint), so that all of the weight is divided equally between the largest and smallest observations, while all of the intermediate observations get zero weights. Indeed, the weights in (3 ,3.12) tend to this limit. The iteration procedure will, however, converge very slowly for {3 near -1 because the factor (I + (3)/(I - {J) in (3.3,11) is small and Bj + I is determ ined primarily by Bj , As (3 increases above zero, more weight is given to observations corresponding to the residuals Y" of smaller absolute value. This is again intuitively appealing because the parent then has longer tails than those of the Normal, and one would now expect that the "extreme" observations would provide relatively little information about the location of the parent distribution. When /3 < 0, 8j ~ J in (3,3.11) is a weighted average of two quantities Bj and j, ""here
e
e
a
(3,3.13)
with weights equal to -2/3f (1 - j3) and (I + j3li (l - /3), respectively , The quantity 8j is the previous estimate of and the quantity {}j is itself a weighted average of the observations, Clearly 8H I lies between 8j and j . However, when /3 > 0, 8j J ' instead of lying between 8 j and becomes a weighted difference of these two quantities and thus could be distinct from either one, Indeed, as
e,
a
7
ai'
Bayesian Assessment of Assumptions 1
174
3.3
a
-> I, 8i + I becomes a large multiple of 0i i . The iteration procedure (3 .3.11) may, therefore, be very erratic and unpredictable for 13 > 0, especially for 13 near I. In practice, it may proceed toward the solution, but accelerate too fa st and overshoot. It may tben reverse directi on and begin to diverge. Thus, the second method should be used for 13 < 0 only and the first method used when 13 > O. In practice, the second method often converges for 0 < 13 ~ 0.25, and in our experience, the final estimate in such cases has always been the same from botb methods. The first method, however, seems more reliable .
13
3.3.3 Approximation of p(a I y) For 13 ~ 0, the second derivative M " (8) exists for all a and we can employ Taylor's theorem to write
~
.
'vI(8) = .\1(8)
M"(O)
+ - - (8 -
This expression is, of course, exact for 13 = O. posterior distribution p(8 I 13, y) is
2
•
2
a) .
(3 .3.14)
To thi s degree of approximation, the
(3.3.15) That is, for 13 < 0,
ais approximately distributed as t{e,2M(e)/M " (e)[n(1 +/3) - IJ, n(l
+ 13) -
I}.
We have found in practice that (3 .3.J5) gives very close ap'proximation in the range -0.5 ~ . 13 ~ O. To illustrate, Fig. 3.3.1 shows, for Darwin 's data, the exact distribution (solid curve) and the corresponding a pproximation (broken curve) from (3 .3.15) for 13 = -0.25, -0.50, and -0.75, respectively. Tn the first two instances, the agreement is very close. The approximation becomes somewhat less satisfactory for 13 = - 0.75. This is to be expected, since the limiting distribution of as f3 -> - 1 is not of the t form [see (3.2.15)].
a
- - F. X J((
p (81 fJ. y)
0 04~
o.ot
pre I ~ .
p(IJ I {J . y )
OO(,~
0.06
,A
- 20
0
;; = '0 .25
- - - - Approx imat e
0.06
y)
/3 = -0 .75
0.04 0. 02 I
20
40
60
e~
- 20
0
Fig. 3.3.1 Comparison of ex act and arproximate distributions of 8 for several negative values of f3: Darwin's data.
Approximations to the Posterior Distribution p(B I /J, y)
3.3
175
For {J > 0, M" (e) does not exist if 8 = Yu , (ll = j, ... , /1), so that the approximation in (3.3.14) may break down. In practice, even in cases where is not equal to any of the observations, good approximations are only rarely obtained through the use of this method. There is, however, reason to believe that the form (3.3.1) with d appropriately chosen will produce satisfactOry results for fJ > O. In our earlier work (1964a), this form was employed to obtain very close approximation to the moments of the variance ratio O"VO"i in the problem of comparing the variances of two exponential power distributions (this is discussed in Chapter 4). We now describe a method for determining d which does not depend on the existence of '\1 "c{h This method employs the idea of least squares. For a set of suitably chosen values of say, l , 2 , . . m , we determine d so as to minimize the quantity
e
e,
e e
,e
m
S
=
I, {l'v/(8) - [M(8) +
d(Bj - 8)2]}2 .
(3.3.16)
j=l
Solving
as/ad
=
0, the solution is d
= L'jr~
[MCBj )
l
M(8)](e j
-
I,J= I
(e j
-
{N
(3.3.17)
8)4
-
All that need be determined now is where to take the m points Bj . If m is large, say forty or more, the exact location of the points should not be critical so long as they are spread out over the range for which the density of is appreciable. In particular, if fI1 is odd and if the B/s are uniformly spaced at intervals of length c, with the center point at = 8, then (3.3.17) can be simplified to
e
d = 240 {[Lj'=1 M(B)(B j
8)2J - M(8)c 2 Cm- 1)111(111
-
4
c (m -
e
l)m(1Il
+ 1)(3m 2
-
+ 1)/12}
7)
(3.3.18)
.
To this degree of approximation, the posterior distribution of
e is (3 .3.19)
- - Exaci
::6t
:~:lt'3'
il
. y)
- - - - A p p rox imale
y)
i3 = 0 .50
i3 = 0.25
0.06
0.04
0.04
0.04
0.02
0.0:'
0.02
'--"""'-_--'_ _'---"'--...J
o
20
40
60
e -,
o
:'0
(J
=
075
40
Fig.3.3.2 Comparison of exact and approximate distributions of values of {J: Darwin's data.
e for several
positive
Bayesian Assessment of Assumptions 1
176
3.4
that is, a
t{e, M(t9) j d[n(1
+
j3) -1],n(1
+ f3)
-I}
distribution. For illustration, Fig. 3.3.2 shows , for f3 = 0.25, 0.50, and 0.75, a comparison of the actual distribution (solid curve) and the approximation (broken curve) using the present method for Darwin's data. In calculating the approximating form, the value of d was determined by using a set of 101 equally spaced points spread over four Normal theory standard deviations [:E(Yi - y)2 in(n - I)] 1. 2 centered at = e. The agreement is remarkably close for the cases f3 = 0.25 and f3 = 0.50. The result for the case f3 = 0.75 is less satisfactory. In general, one would not expect the distribution of to be well approximated by that of a I form when fJ is near unity. Indeed, when f3 = I and 11 an even integer, the density will be constant between the middle two observations. Nevertheless, for Darwin's data, the approximation for fJ = 0.75 still seems to be close enough for practical purposes. The process described above is admittedly somewhat tedious. However, as we have pointed out, the object of t his section is not primarily to provide means of calculating d. Rather, it is to show thatan approximation of the form M(e) == M(8) + dee - e)2 for some d can be used, and so to provide an easy means of integrating out from functions containing ,VI(e). In our subsequent applications d will cancel and need not be computed.
e
e
e
3.4 GENERALIZA nON TO THE LINEAR MODEL The analysis in Section 3.2 can be readily extended to the general linear model y
= X6
+ E,
(3.4. 1)
where y is a n x 1 vector of observations, X a n x k full rank matrix of fixed elements, 0 a k x 1 vector of un known regression coefficients, and E a n x 1 vector of random errors. When this model was considered in Section 2.7, it was assumed that E had the multivariate spherically :'\Iormal distribution l\'n(O, Icr 2). We now relax the assumption of Normality and suppose that the elements of E are independent, each having the exponential power distribution
pee I (J, [3)
=
W({3)(J-1 exp
e \'2 /(I+Pl]
[ -c([3).-:; I
,
-
<X)
< e<
00,
(3.4.2)
where w(f3) and c(f3) are given in (3.2.5). Our objective will be to study the effect of departures from Normality of this kind on inferences about the regression coefficients 9. The likelihood function of (9, (J, [3) is
1(9,
0',
f31
y) ex:. [w(f3)]n O'-n exp
- c(f3) i~l n
[
1
y. - x' 91 2/ (1 I cr (,)
+PJ] ,
(3.4.3)
3.4
Generalization to the Linear Model
177
where x;;) is the ith row of X. Following arguments similar to those leading to (3.2. 19) and (3.2.20) , suppose a priori that p(G,
<J,
fJ)
=
p(fJ)p(G, a)
(3.4.4)
with p(G,
<J)
oc
(J- I
and p(fJ) temporarily unspecified. Combining (3.4.3) and (3.4.4) and integrating out <J , we obtain the joint posterior distribution of (9, fJ), which can be written as the product p(9,
The conditional distribution
fJl
y) =
pee I fJ , y)p(f31
(3.4.5)
y) .
pee I {J , y)
In (3.4.5), the conditional posterior distribution - 00
where M(e)
= I"
pee I {J, y) is < 8j <
00 ,
j = I , ... , k ,
(3.4 .6)
Iy; - x;;)eI 2 /(1 +P),
i= 1
)(fJ) =
L
[M(e)r i"(1
(3.4.7) +P )
de,
and R: ( - 00 < 8j < co, j = I, ... , k) . Jf 9 consists of a single element and X is a column of ones, the distribution in (3.4.6) reduces to that in (3.2.12) and much of the analysis which follows parallels that for the single mean. I n general, by studying the distri bution pee I fJ, y) as a function of {3, we can determine how sensitive inferences about e are, to departures from Normality of the type postulated. In particular, when f3 = 0 (exact Normality), M(e) = (y - xe)' (y - X9),
(3.4.8)
so that the distribution in (3.4.6) reduces to the tk[e, S2 (X'X) - I, vJ distribution obtained in (2.7.20). For other values of fJ , it does not seem possible to express the di stribution in terms of simple functions of the observations except when f3 --t - 1 and X assumes special forms [see Lund (J 967)J , However, when th e number of parameters in G is small , say k = 2 or 3. contours of M(G) can be plotted. By investigating the changes in the location and shape of the contours for different \aluc s of fJ, one obtains a good appreciation of the effect of changes in fJ on inferences about e. An iJlustrative example will be given later in this section. For - 0.5 < {J < 0.5, Lund (J 967) has demonstrated that the di stribution (3.4.6) can be satisfactorily approximated by a multivariate I distribution .
Bayesian Assessment of Assumptions 1
178
3.4
The Marginal DiSlrihutions
The posterior distribution p([3 : y) in (3.4.5) can be written
p([3 I y) cc puC[3 : y) p(f3),
-1
(3.4.9)
where
pi[31 y) cc r[1
+ ·}n(l + (3)] {r[1 + t(1 + (3)] }-n 1([3)
is the posterior distribution of [3 for a uniform refe rence pnor -I < [3 ~ I. From (3.4.5), the posterior distribution of e is
pee I y)
=
f
1
(3.4.10) In
the range
pee I fJ , y)p(fJ I y) dfJ
e,
As in the case of a single parameter we are here averaging the conditional distributions pee I fJ, y) for various fJ by a weight function p(fJ i y) which is the product of p.,(fJ I y) and p(fJ)· 3.4.1
An Illustrative Example
The data of Table 3.4.1 refers to an experiment relating the rate K of a chemical reaction to the absolute temperature T at which the experiment was conducted. The dependence of K on T is expected to be represented by the Arrhenius law 100 b
K
= Jog A .
E I
R T '
(3.4.12)
where R is the known gas con stant ; A and E are constants to be estimated. The twenty experimental runs were performed in random order, and the reasonable assumption was made that over the ra nges of temperature employed log K had constant variance. We, therefore, consider the 'simple ·linear model i
= 1, 2, ... ,20,
(3.4.13)
where
e_
Yi = log K i.
2 -
E 50,000R
and The "Arrhenius plot" of Y against x is shown in Fig. 3.4.1. Tn terms of the linear model in (3.4.1), the (20 x 2) matrix X has two columns, the first is a column of
3.4
179
Generalization to the Linear Model
,,
y = log K
-3 .3
, \.
, .,
-3 .5
\.-,(
{J=-0.9
\.~ .\.
, \.
\.
. :\
~ ~
-3.9
~ ~
• -4 . 1
• -4.3
(3 = 0 {J=09
-4 .5
-4 .7
•
{J = - 09 Y= - 3.996 - 0 2 I 8x
•
(3 =00 y =-4.013 - 0 .20Jx (J = 0.9 Y= -4.010
• 0.20Jx
-4 .9
-J
833
Fig. 3.4.]
-2 820
-I
o
806
794
I 781
Plot of reaction rate data wit h fitte d straight lines for
2
769
f3
=
3 x 758 T
- 0.9 , 0.0, 0.9 .
ones and the elements of the second are the va lues of x as defined abov~. The two columns are, of course, orthogonal. From (3.4.6), the posterior distribution of (8,,8 2 ) for fixed f3 is -co < 8, < co , where M(a)
=
-co < 8 2 < co,
20
L i= 1
[y; - 8, - x j 82 [2 /('+P).
(3.4.14)
180
Bayesian Assessment of Assumptions 1
3.4
Table 3.4.1 Reaction rate data y = log K
T
x
833 820 806 794 781 769 758
-3 -2 -1 0
-3.50, -3.51, -3.86, -3.99, -4.02, -4.44, -4.74,
2 3
- 3.40, -3.62, -3.70 -4.18 -4.23 -4.62, -4.49,
- 3.36, - 3.50 -3.53
-4.34 -4.61, -4.62
eel' (2),
Table 3.4.2 gives the mode of the distribution, 6 = for f3 = -0.9(0.3)0.9. The fitted straight lines which result from using 6 associated with f3 = -0.9,0.0, and 0.9 are shown in Fig. 3.4.l. Figure 3.4.2 shows, for each of the seven values of f3 considered, three contours A , B, and C of the corresponding posterior distribution of e8 1 , ( 2 ) together with the mode (2)' The density levels of the three contours are:
eel'
A: pe8 1 , 82 1f3, y) B:
p(8 1 , 8 2 I fJ, y)
C:
p(8 1 , 82
[f3,
y)
ezl f3, = 0.25 p(e l , e2 I f3, = 0.05 p(B I , e2 1f3,
=
0.5 pCBl>
y), y), y).
Table 3.4.2 Values of the mode of
fJ 81
e2
-0.9 -3.996 -0.218
-0.6 -4.017 -0.205
e as a function of f3:
-0.3 -4.016 -0.204
0.0 -4.013 -0.203
reaction rate data
0.3 -4.011 -0.203
0.6 -4.010 -0.203
0.9 -4.010 -0.203
Using the bivariate Normal approximation, the contours would very roughly be the boundaries of the 50, 75, and 95 percent H.P.D. regions, respectively. The contours, together with the modal values in Table 3.4.2, show how inferences about (8 1 , ( 2 ) are affected by changes in f3 in the range considered. For this example, both the location and the shape of the contours change appreciably as f3 decreases below -0.3. In contrast, the effect of f3 is relatively small in the range from - 0.3 to 0.9. Since the two columns in X are orthogonal, the parameters 8 1 and 8 2 are a posteriori uncorrelated when -f3 = O. Figure 3.4.2 shows that this "orthogonality" property is gradually lost as f3 moves away from zero . In particular, the parameters 8 1 and 8 2 become increasingly negatively correlated and the dispersion tends to increase as fJ decreases from -0.3, namely, as the parent distribution becomes more and more platykurtic.
....
~
i) Il= - 0.9
82
ii) Il= - 0.6
iii) Il
= -0.3
iv) (3 = 0.0
0. 17
- 0.19
--0 .21
-0.23
-'--_ _ 1 -4 .05
82
- 4.00
v)
Il
'3.95
-4.05
=0.3
- 4.00
- 3.9 5
-4 .0 5
vi) (3 = 0.6
I
- 4 .00
- 3. 95
-4.05
- 4 .00
vii) {J = 0.9
1
- - 3.95
8
1
C'l
II>
"
0.1 7
II>
.r~ 2:.
- 0.19
(5-
"IS
:r II>
·0.21
C
"~
. '0.23
~
- 4.05
- 4.00
-3.95
- 4 .05
L -4.00
- 3.95
[ - 4.05
-4.00
I -3.95
~
8 I
Fig. 3.4.2(i)- (vii) Contours of the posterior distribution of (8 1 .8 2 ) for various values of f3: reaction rate data (the density levels are such that p (8\.8 2 I y)/p(8 1 • 82 I y) = (0.5. 0.25. 0.05) for (A, B , C), respectively) ,
&.
~
-IX>
182
Bayesian Assessment of Assumptions 1
3.4
The posterior distribution of either parameter, conditional on obtained from (3.4.14) by integrating out the other,
-w<8 j <w,
)=1,2,1=1,2,
and
)=/-1.
/3,
can be
(3.4.15)
Figures 3.4.3 and 3.4.4 show respectively the marginal distributions of (JI and of for the seven values of /3 considered. As expected, both the location and the spread of the marginal distribution s are quite sensitive to changes in f3 in the range -0.9 ~ /3 ~ -0.3. Also , thc sensitivity seems to be somewhat greater for (J2 than for (JI'
(J2
(3 = - 0.9 (3=0 3 16
12
8
4
Fig. 3.4.3 Posterior distribution of (J J for various choices of
/3:
reaction rate data .
Thus, so far as inferences about 8 J and 8 2 are concerned, the assumption of Normality would not, for the present example, lead us much astray if the true parent were leptokurtic. On the other hand, such an assumption could lead to rather erroneous conclusions if the true parent distribution were close to the rectangular. Proceeding as before, overall inferences about (JI and (J2 can be made by averaging the posterior distributions in (3.4.14) and (3.4.15) over the posterior
·~.----
Generalization to the Linear Model
3.4
183
40
30
20
\0
L---~~~~~--~----L---~---L--~--~~~~~--- 82~
-0.230
- 0.2\ 0
- 0 . \90
- 0\70
Fig. 3.4.4 Posterior distribution of 8 2 for various choices of [3: reaction rate data.
distribution of {J. From (3.4.11), the joint distribution of (B 1 , B2 ) is
(3.4.16) and from (3.4.15) the marginal distributions are . p(ej I y) cc
L 1
p(Bj I (J, y) PuC[3 1y) p([3) d[3, - 00
< Bj <
00,
j
= I, 2,
(3.4.17)
where p.({J I y) is the posterior distribution of [3 on the basis of a uniform reference prior for [3, and p([3) is the appropriate prior distribution. The distribution
Pu(f3 1y) cc r[J + 10(1 + [3)]{f[1 + HI + [3)]}-20 1([3), where
1([3)
=
f'<:oL:
[I IYi - Bl -
B2X;\ 2/0 +P) ]
- I < [3
-10(1 +P)
~
I, (3.4.18)
dBl de 2
is shown in Fig. 3.4.5 (a = 1). We see that the mode is near [3 = 0, indicating that the Normal distribution is a plausible parent . However, the evidence is not very strong. There is a probability of about 17 % that [3 ~ 0.6 and a probability of about 7 % that f3 :( -0.6. Thu s, pare nts which are quite different from the Normal cannot be ruled out from the evidence provided by pl/([31 y) alone.
Bayesian Assessment of Assumptions 1
184
3.4
a=1
a=3
1.0
1.0
0.5
0.5
~
____L -____
~
o
·0.5
____
~
_____
~~
~~
__L-____L -____L -_ _
o
. 0.5
0.5
~~~
0.5
a=6 p(~iy)
1.0
1.0
0.5
0.5
p( m
L-~~L-
'0.5
____
~
0
____
L-~
__
~
~
0.5
L -_ _
~L-
____L -____
'0.5
0
~
____
~
~
0.5
Fig.3.4.5 Prior and posterior distributions of fJ for various choices of the parameter "a": reaction rate data. In experiments of this kind a tendency toward Normality is to be expected. We represent this by supposing as before that
-1
(3:4.19)
where the parameter "a" can be adjusted to represent a greater or lesser central limit effect. The result of varying "a" is shown in Fig. 3:4.5, where p(fJ) and the corresponding posterior distribution of f3
p(fJ I y) ex: p,,(f31 y)p(f3),
-1<[J;(1.
(3:4.20)
are plotted for a = I, 3, 6, and 9, respectively. For a ;:: 6, the distribution practically reproduces the prior. For any specific value of "a" , overall inference about (8\,8 2 ) can now be made by averaging the conditional distributions p(8 j I f3, y) over the appropriate p(f31 y). We have computed the marginal distributions of8\ and 8 2 in (3:4.17) for a = I, 3, 6, and 9, and specimens of the densities are shown in Tables 3:4.3, and 3:4.4, respectively . Also shown in the same tables are densities of the conditional distributions p(8 j I fJ = 0, y) which correspond to a --+ 00, that is, to the assumption of exact Normality. These distributions are very little affected by the choice
Generalization to the Linear Model
3.4
Table 3.4.3 Posterior distribution of
0 1 for various choices of "0" : reaction rate data p(G I I Y)
=
01
a = 1
a = 3
a
6
a = 9
-4.090 -4.078 -4.066 -4 .054 -4.042 -4.030 -4.018 -4.012 -4.006 - 3.994 -3.982 -3 .970 - 3.958 - 3.946 - 3.934
0.19 0.55 1.51 3.70 7.66 12.88 16.72 17.10 16.30 12. 12 6.80 3.10 1.23 0.44 0. 15
0.19 0.56 1.55 3.78 7.81 13.02 16.71 17.01 16.13 11 .87 6.74 3.12 1.25 0.45 0 . 15
0.19 0.57 1.56 3.79 7.82 13.00 16.65 16.95 16.10 11.85 6.78 3.16 1.27 0.46 0.15
0.19 0.57 1.56 3.79 7.81 12.97 16.61 16.94 16. 10 11.87 6.81 3.18 1.28 0.46 0.15
a
-'>
oc
0 . 19 0.56 · 1.54 3.76 7.73 12.82 16.49 16.88 16.12 11.98 6.95 3.27 1.31 0.47 0.16
Table 3.4.4 Posterior distribution of O2 for various choices of "a": reaction rate data p(G 2 I Y)
O2
a = I
a = 3
a.= 6
a = 9
-0.2420 -0.2364 -0.2308 -0.2252 -0.2196 -0.2140 -0.2084 -0.2028 -01972 -0.1916 -0. 1860 -0. 1804 -0.1748 -0.1692 -0.163G
0.18 0.57 1.74 4.77 11.03 21.21 32.61 37.68 31.57 20.0J 10.25 4.45 1.72 0.61 0.20
0.16 0.53 1.60 4.43 10.62 21.02 32.61 37.83 32.11 20.53 10AI 4.40 1.63 0.55 0.J8
0.16 0.51 1.56 4.34 10.50 20.92 32.54 37.85 32.33 20.75 10.48 4.38 1.60 0.53 0.17
0.16 0.50 1.54 4.31 10.46 20.87 32.49 37.84 32.42 20.85 10.51 4.38 1.59 0.52 0.16
a
-'>
00
0.15 0.49 1.51 4.25 10.36 20.72 32.29 37.75 32.64 21.15 10.66 4.40 1.57 0.51 0.16
185
186
Bayesian Assessment of Assumpt ions [
3.5
of "a" . Even for the extreme case a = I where p(fJ y) = p,,(fi I y), the respective margin a l di stributions of OJ are quite cl ose to the limiting p(Oj I f3 = 0, y) distributions. for thi s data, then , where weak evidence fro m the sample would support a supposition of approxim ate Normality, analysi s with wider assumpt ions confirms the Norm al theory analysi s. 3.5 FCRTHER
EXTE~SJOi\
TO
~ONLINEAR
'YrODELS
The object of much experimentation is to stud y the re lati onship between a respon se or output variable l' subj ect to error and input variables ¢ I' ~2 ' .. . , Sf! the levels of which a re suitably c hosen by the experimenter. Suppose the model can be written (3.5.1 ) Y =I)(s I 0) + 8, wheres' = ( ~ I'" , ¢p ), F,is theerrortermwithzeroexpectati o nand8' = (OI, .. ·,Ok) is the vector of k unknown parameters to be l:stimat ed. A model is linear in the parameters 0 if where x I' ... , X k are functions of ~ I' ... , ~ P ' Otherwise, the model is called" nonlinear" (in the parameters 0). Line ar model s were considered in the la st section . We now show ho\\> the approach is extended to Include nonlinear models. In 11 experiments, let y ; be the ith observation , S; = ( ~ ;I ' .. . , ~if!) the setting of the p input variables, and e; the corresponding ~rr o r term. We then have
°
Y; = I)(S; I 0) + c,"
i = I, .. .,
(3 .5.2)
11 ,
where E(B;) = or equivalently E( y ;) = I)(S; I El ). We shall suppose that the errors ( 1' 1' .. . , ell) are independent, each having the exponential power di stribution in (3.4.2). Thus, irrespecti ve of whether the model (3.5.2) is linear or nonlinear in the parameters 0, the likelihood funct ions of (0, a, (3) is
" le' 12/(1"Pl]
1(0. a, fJ I y) ex [w(fJ)]" a" exp [ - c(fJ) ;LI I ~ .
(3 .5.3)
'
where e; = y; - E( y;) = y ; - I)(S; I 0), and w({3) and c(li) are given in (3 .2.5). If we now suppose th at 0 and log a are appro xim ately independent and loca lly uniform a priori , then for given (3 the jo int posterior distribution of 0 and a is p(O, a I (3, y) ex a - (" + Il exp
/I
[
II e' 12!( +/)lJ ,
- c(fJ) ; ~I ~
I
(J
> 0,
-
00
< OJ <
j = I , .. , k .
00 ,
(3.5.4)
3.5
Further Extension to
On integrating out
(J
~onlil1ear
Models
187
we obtain the joint distribution of 0 in the simple form - 00
< OJ <
00,
j = I, ... , k,
(3.5.5)
where
I"
M(O) =
IYi - I)(~i I OW /( I +fJI
i ··..: I
and J({J) is the normalizing integral J(fJ) =
r
[M(O)r ""( I + 01 dO,
JR
with R: ( -
< () j <
if:;
if:;,
j = I, ... , k).
Also if fJ is regarded as a random variable independent of (0, posterior distribution of fJ ('or a given prior pUJ) will be p([3 I y)
if
p,,([3 I y)p((i),
where p,,([3 I y)
Cf....
f[ I {f[ I
(J)
a priori, the
- 1 <[3~ I ,
+ }n(1 + IJ)] + -}( I + 1m}" J(fJ)
(3.5.6)
is the posterior distribution of 13 on the basis of a uniform reference prior over the range (- I, I). Finally, when fJ is not known the marginal posterior distribution of 0 is
p(O I y)
=
if
r
I
J
II
p(O I [3, y)p(fJ I y) dl3
p(O I [3, y)p,,([31 y)p(fi) dli,
- 00
< OJ <
00,
j "" I, .. , k.
(3.5.7)
It will be noted that in the above we have not needed to assume either that a) the model was linear in 0; or that b) the error distribut io n was Normal. 3.5.1 An Illustrative Example
The following simple example, which serves to show the generality of the approach , concerns an experiment in a continuous chemical reactor. The object wa s to estimate the rate constant K of a particular chemical reaction. The quantit y)' is the observed concentration of a reactant A in the outlet from the reactor, expressed as a fraction of its concentration at the inlet. The inlet concentration was held fixed throughout the invest igation but the Aow rate was varied ['rom one
I81!
Bayesian Assessment of Assumptions I
3.5
experimental run to another. If the reaction is first order with respect to A , it is possible to show that the outlet concentration IS given by IJ = 1/( 1+ 00, where () is a known multiple of rhe rate constant K and ~ is the ratio of the inlet concentration to the flow rate . Our problem then is to make inferences about 0, given data in the form of 11 pairs of values of y a nd ~, and given the model I E(y·) = 1)(t. 10) = - , " 1 + U~i'
i = I, ... ,11.
(3 .5.8)
In eight runs the following results were obtained: Table 3.5.1
Chemical reactor data
y
50
75
0.78
0.69
100 0.54
125 0.57
150 0.54
200
250
0.37
0.40
300 0.32
We shall calculate, to a suffici ent approximation, posterior den si ties of 0, p(OI/J, y), for Ii -> -I and /j = -0.5, 0.0,0.5, 1.0. The corresponding five parent distributions are, then, the uniform, the fourth-power distribution, the Normal distribution , the 4,3 power distribution , and the double exponential distribution . For clarity we shall set out the calc ulations in some detail. Table 3.5.2 shows values of the residuals ri = h - I'/( ~i 10) for a suitable range of values of 0 at intervals or 0.025. From this table we calculate for each value of fJ the quantities:
(where r L is the deviate largest in absolute magnitude). These five quant ities provide unstandardlzed ordinates of the conditional posterior distributions of 0, p(O i fJ, y). for Ii -> - 1.0 and Ii = -0.5,0.0,0.5, 1.0. For any given value of Ii , the quantity j(fJ) = (sum of these un sta ndardized ordinates x interval in 0) provides, for our present purpose,t an a pproximation sufficiently close to the normalizing integral J(fJ). On dividing the unstandardized ordinates through by j(/i) we obtain the approximate ordinates of the posterior distribution of 0 for the five se lected values of fJ. These calculations are shown in Table 3.5.3 (on page 192). Using these values the posterior distributions of () are plotted in Fig. 3.5.1. For thi s example, the distributions differ quite considerably from one another, both in their location and shape. Inferences about 0 thus depend rather heavily upon the assumed form of the parent population. It is of interest to consider the modal values of 0 for various values of (3, that is, the values OW) giving maximum posterior density . t A considerable improvement can be obtained, of course, by using a simple numerical
integration procedure.
Further Extension to Nonlinear Models
3.5
189
Table 3.5.2 Values of residuals rj = Yi - I'/(~j I e) for various values of e computed from the chemical reactor datat ~
e
50 75 100 125 150 200 250 300 ~
50 75 100 125 150 200 250 300
e
0 .500
0.525
0.550
0.575
0.600
0.625
0.650
0.675
-20 -37 -127 -45 -31 -130 -44 -80
-13 -27 -116 -35 -20 -118 -33 -68
-4 -17 -105 -23 -8 -106 -21 -57
4 -8 -95 -12 4 -96
11
19 9 -76 8 24 -74 10 -28
26 18 -66 18 34 -65 19 -19
23 26 -57 28 43 -56 28 -11
0.700
0 -85 -1 14 -85 0 -37
-10 -47
0:725
0.750
0.775
0.800
0.825
0.850
47 43
53 50 -31 54 69 -30 52 12
60 58 -23 62 78 -22 60 19
66 65 -16 70 85 -15 67 26
72 72 -9 78 92 -8 74 32
78 78 -2 84 100 0 80 37
39 34 -48 ' 37 52 -47 36 -3
-40 45 61 -38 44 6
t The entries in the table are values of r, x 100, where r,
= y, - (I
+ 8C;,)- '.
Reading from the graphs we find approximately: Parent distribution
Rectangular
Fourth power
Normal
4 power
Value of fJ
-1
- }
0
!
0.70
0.69
0.665
0.63
Value of (J(fJ)
Double exponential
0.595
Thus, the value 8(-1)=0.70
(rectangular parent) mInImIzes deviations from the fitted curve,
the
maxImum
absolute
(J(O) = 0.665
(Normal parent) is the least squares value and minimizes the sum of the squared deviations from the fitted curve, and
e(l) = 0.595
(double exponential parent) minimizes the sum of the absolute deviations from the fitted curve.
The original data and fitted functions for these three values of Fig. 3.5.2.
e are
shown in
Bayesian Assessment of Assumptions 1
190
3.5
p(B (3, y)
20
10
0.50
0 .60
Fig.3.5.1 Posterior distribution of
Information About
070
0.80
e for various values of p: chemical reactor data.
p
Approximate unstandardized ordinates for PII(P I y) can be obtained by replacing J(P) in (3.5.6) with .1(P), so that (I • r[l + ±n(1 + P)] J( PII(pl y) oc {reI + W +p)]}" [3),
which are shown in the last row of Table 3.5.3 and plotted in Fig. 3.5.3. From this sample of only 8 observations, as is to be expected, there is very little evidence about the form of the parent distribution. In particular, the highest ordinate of PII([31 y) is only about three times the lowest. To see how slight is this apparent preference for negative values of [3 (for platykurtic parent distributions), we can temporarily consider PII([3 1y) as a genuine posterior distribution (that is, consider it as if there were a uniform prior for [3). Then the appropriate function puCPI y), obtained in Fig. 3.5.3 by joining the ordinates with a dotted curve, would indicate a probability of about 1/3 that P was in fact positive (the parent was leptokurtic ). To obtain the marginal distribution of 0, one must integrate p(B I P, y) with p(PI y) as the weight function . Now the weight function is given by pCP I y) oc puC[31 y)p([3) and, for this example, it is the prior p([3) which dominates rather than Pu([31 y). Moreover, I [3, y) changes drastically for different p. Here then
pee
-- .
3.5
Further Extension to Nonlinear Models
191
y {3
o; 0 ( Norma) pHren!)
= \ (Double exponential parent) Leas t mean devi a! ion
Leas t squares
e= 0.665
8 = 0.595 0 .60
0.60
OAO
OAO
o ·~
o 0 .20
'------'-_---'-_.1----'-_
100
- ' -_
o
L -_ __
0.20 '----'-_
300
200
___'__L---'-_---'-_-'--_ __
100
y
200
300
{3 = -\ ( Uniform parent ) Leas t m ax imum d evia te
e= 0.70
0.60
DAD
O. 20
'--~----'---'---'----'--~---
\00
200
300
Fig. 3.5.2 Fitted functions using modal values of e: chemical reactor data .
we are put on notice by the Bayes analysis that, for this example, the conclusions depend critically on what is assumed a priori about fJ and hence about the nature of the parent distribution. The data analyst is made aware that he is in a situation where prior assumptions about fJ (about Normality) must be given very careful attention , the data being too few to supply much information on their own account. If extensive records exist of similar experiments, he will be led to examine these to determine whether or not they suggest approximately Normal error structure. He will also carefully consider whether the hypothesis of many error sources, none of which dominate, is sensible for this experimental set-up. If he decides that it is sensible to use for pCfJ) a distribution centered at fJ = 0 as in Fig. 3.5.4 . Then for this data p(fJ I y) will be much the same as pCf3). A veraging pce I [3, y) with this weight function will be nearly the same as averaging with the prior pC[3). Thus the averaging process would lead approximately to the marginal distribution of e, pce I y) == pce I [3 = 0, y), and to Normal theory as an approximation. The above kind of argument could also be used if his investigation led the data analyst to some central distribution other than the Normal. For example,
Bayesian Assessment of Assumptions 1
192
3.5
Table 3.5.3
Approximate ordinates of posterior distribution of chemical reactor data
e for various values of {J:
(J
-1
-1 Value of 8
rZ 8;J(-I) (I.r 4)-2;J(-t) (I.r 2 )-4;J(0) ( I.lrI 4/ 3 )-6;J(-t) (I.lrl) - 8/J(I)
0.500 0.525 0.550 0.575 0.600 0.625 0.650 0.675 0.700 0.725 0.750 0.775 0.800 0.825 0.850
J(/3)
rei
0.0 0.0 0.1 0.1 0.4 1.1
2.7 8.8 18.3 5.1 1.9 0.8 0.4 0.2 0.1 102.0
+ tn(l + (J)]
{reJ + 'W + {J)J}" Approximate Pu({J I y) (unstandardized)
-t
0
0.0 0.0 0.1 0.2 0.6 1.8 4.8 9.8 12.3 6.2 2.5 0.9 0.4 0.2 0.1
0.0 '0.1 0.3 0.9 2.4 5.4 8.9 8.9 6.7 3.4 1.7 0.7 0.4 0.2 0.1
0.0 0.1 0.6 2.3 6.7 9.0 8.2 5.8 3.6 1.8 1.0 0 .5 0.2 0.1 0.1
13.52
0.6045
0.02319
1
102.0
63.07
4.390
32.8
32.4
-----~
-1.0
_ _ _ _ _ _~_ _ _ _~L__ _~_ _L __ _ _ _~~~
-O.S
0
O.S
0.000804 40,320
1,414.4
38.1
59.4
0.0 0.1 0.7 3.7 14.3 8.7 5.1 3.1 2.1 1.0 0.5 0.3 0.2 0.1 0.0
1.0
Fig. 3.5.3 Approximate ordinates of Pu({J I y): chemical reactor data.
3.6
Summary and Discussion
193
-, \
\
\
\
\~P((3) \ \ \
\ \ \ \ \
\
L -__- L__- L________L -______-L~~__~ (3
-1.0
o
-0.5
0.5
~
1.0
Fig.3.5.4 Combination of the function p.lf3 I y) with a moderately informative prior p(f3): chemical reactor data.
we have said that breaking-strength test measurements often have a markedly leptokurtic distribution, so that in an experiment to determine the mean breaking strength from a sample of test pieces in which almost the only source of experimental error was the test itself, the analyst might employ a weight function p(f3) centered not on zero but on say 13 = 0.8 . The resulting distribution I y) could then be considerably different from that obtained if the Normal were regarded as the most likely parent. It is an advantage of the Bayesian analysis that when, as in this case, there is little information from the data itself concerning the value of some nuisance parameter such as 13, then we are put on notice by evidence like that set out in Fig. 3.5.1 that assumptions about fJ ought not to be made lightly. If we really do have some ex.ternal evidence about the parent distribution then we should use it; but if we do not , then the uncertainty about must be correspondingly increased. This small sample situation may be contrasted with that which applies when there is a larger number of observations. In' this case the sample evidence about 13 is much stronger and there is correspondingly less need for strong prior evidence.
pee
e
3.6 SUMMARY AND DISCUSSION
It seems pertinent at this point to summarize and discuss the results obtained in this chapter. The models employed in the analysis of the three sets of data in this chapter are all special cases of y
=
'1(~ I 8)
+e
(3.6.1)
194
3.6
Bayesian Assessment of Assumptions]
given earlier in (3.5.1), with y the observed output, 1; a set of inputs and 9 a set of parameters. Specifically, the models and their special features are as follows: Darwin's data (Section 3.2)
y
=8+e
Parameter 8 is the population mean . Parameters 8 1 and 8 2 appear linearly in regression equa-
Reaction rate data (Section 3.4)
tions. Parameter fi appears nonlinearly.
Chemical reactor data (Section 3.5)
The analysis of such sets of data would often be made on the assumption that the error e was Normally distributed . Bayesian analysis makes it possible to study the effect of wider distributional a ssumptions, and to illustrate this, we have supposed that the parent error distribution was a member of the symmetric family
p( e)
=
(j)(fJ) (J -
1
ex p
[-
e \2 /(I+P)] c(fJ) -;; f
i
(3.6.2)
in (3.4.2), which includes the Normal as a special case. By changing fJ we are able to study the effect of changing the kurtosis of the parent distribution in a particular way. 1. For the general class of models represented by (3.6.2), and on reasonable assumptions which include the idea that there is no appreciable prior knowledge concerning 8 or (J , the conditional posterior distribution p(8 I fJ, y) has the remarkably simple general fo rm (3.6.3) 2. In all the examples studied , p(O I fJ, y) changes quite markedly as fJ is changed. Consequently, the inferences which would be made about 9 could differ materially depending on the value of fJ. We have contrasted this lack of inference robustness with the criterion robustness enjoyed by the model under certain circumstances [see, for example, Box and Watson (1962)]. 3. lnferences about the value o f fJ itself can be made by studying the distribution
p(fJl y) oc pl/(fJl y)p(fJ)
(3.6.4)
where pl/(fJl y) is the distribution of fJ in relation to a uniform reference prior. 4. For small samples p,,(fJ I y) will be rather flat. This indicates that, as might be expected, little information about the nature of a parent di stribution is contained in a small sample of observations.
3.6
5.
Summary and Discussion
195
The marginal posterior distribution (3.6.5)
summarizes what overall inferences might be made about 9 . In this expression p(fJl y) ex:. PII(fJl y) p(fJ) acts as a weight function. 6. Some caution is needed in interpreting the integra tion in (3.6 .5) if we need to rely heavily upon the form of p(fJ). However, in many examples where a central limit tendency is expected , the precise form of p(fJ) will not be of much importance. In the Darwin example, even though Pu(fJl y) is far from sharp, it is sufficient to ensure that extreme conditional distributions p(91 fJ, y) enter with little weight. Thus , even though the conditional distributions p(91 /1, y) differ markedly over the entire range of fJ, the inferences to be drawn about 9 do not change very much for reasonable choices of pCfJ). The same effect can be seen for the reaction rate data in Section 3.4. 7. An important advantage of the Bayesian approach is that it makes a deeper analysi s of inference problems possible . It does this by showing a) how sensitive the conditional distribution p(91 (J, y) is to changes in the nuisance parameter (J, b) what the data can tell us about the nuisance parameter (J. Various situations could occur: a) p(91 (J, y) might be very insensitive to fJ , in which case it would be apparent that exact knowledge of the nuisance parameter fJ was not needed. b) p(91 fJ , y) might be sensitive to changes in fJ , in which case the integration of this conditional distribution with weight function p(fJl y) u:. PII(fJl y)p(fJ) would need to be studied further. c) If PII(fJl y) wa s sharp (as it would be if the number of observations was large), then inferences would rest primarily on the information coming from the sample, as represented by PII(fJl y), and very little on the choices of p(fJ). d) If, on the other hand, p,JfJl y) was (ather flat, as is particularly true in the third example of the chemical reactor data in Section 3.5, then any integration we make would in this case depend critically on the choice of p(fJ) , In this case then, where the inference \Vas dependent to a large degree on prior assumptions , this fact would be clearly pointed out to us. Suppose for instance, that in the third example we made an outright assumption of Normality (corresponding to choosing p(fJ) to be a () function at (J = 0), then the Bayesian analysis would make it perfectly clear that, for this particular example, the conclusions rested heavily on that choice, and that we ought to look to its justification.
Bayesian Assessment of Assumptiuns 1
196
3.7
In general, in order to draw conclusions on topics of importance to him, the experimenter needs information about primary parameters (for example, 9) in the context of a model structure about which he is uncertain. Such uncertainties can sometimes be represented as uncertainties about nuisance parameters. Whatever system of inference the investigator embraces, possible information about structure can either appear as prior assumptions or as information from the sample itself. In some instances reliance on prior assumptions could be greatly lessened by making proper use of appropriate information in the sample as can be done with Bayesian analysis. The investigator is often tempted to make whatever assumptions he thinks he needs to allow him to make inferences about primary parameters. Unless he has some means of keeping track of the consequences of such assumptions, he runs the risk of being misled. Bayesian analysis helps to bring out into the open how much we need to assume and about what. Assumptions may be a) necessary or unnecessary, b) well founded or ill founded, and inference procedures may be c) sensitive or insensitive to assumptions. Bayesian analysis supplies information in (a) and (c) and can draw our attention to the necessity for making up our minds about (b). 3.7 A SUMMARY OF FORMULAS FOR POSTERIOR DISTRillUTIONS Table 3.7.1 provides a short summary of the formulas for various prior and posterior distributions discussed in this chapter. Table 3.7.1
A summary of prior and posterior distributions
I. Suppose the data y' = (Yt> ... , Yn) are drawn independently from the exponential power distribution in (3.2.5),
p(yle,(J,{3) where
= W({3)(J-l c
exp
- e\2 / Pl] , [ -c({3) Iy-(J(1
+
_{r[1(1+{3)]}I /(l+P l ((3) - rH(l + (3)]
and
w({3)
=
(l
{r[1(1 + (3)]}1{2 + (3) {r[w + (3)J}3/2 .
-00
<
y
<
00,
3.7
A Summary of Formulas for Posterior Distributions
Table 3.7 . I Continued
2. Suppose further that the pr ior distribution is, from (3.2.20) and (3.2.29).
pee, a, (J)
=
pce, a)p({3),
with
p(O, a) cc a- I 3. Given
and
{3, the conditional posterior distribution of
e is in -
(3.2. I 2), Cf...,
<
e<
co ,
where II
L
M(O) =
I Yi - 0 2/ ( 1 + CJ
and
i= I
In particular, for
f3
= 0 ..
8-y
51, .'11 ~ ',,- I,
where
J j: = - IYi
2
and
.I'
II
Also, for
f3
-+
-
I 2 = - - ICYi -9)·
11-1
I
10 -
/II I
- -_ ' ~F2 : 2 (nh/(n 1)
I) '
and 4. The marginal distribution of {3 is in (3.2.28),
pC{3 ; y)
cc
p({3)p,,({31
y) ,
-\
<{3~1,
where
5. The marginal distribut ion of
e is from
(3.2.32),
"1
cc
J _P(81 {3, y) p"C{31 y) p({3) d{3 , I
- oc <
e<
CO .
197
198
3.7
Bayesian Assessment of Assumptions I
Tab le 3.7.1
(OI/ Ii/lll ed
6. The condi tional distribution p(O ! (3, y) , for (3 not close to - I, can be app roximated by the I distribution
+ (3) -
I{e, M(e) /d[II(1 where
&is
IJ , 1/(1
-I-
(3) - I},
the mode and d is a positive constant. Procedurse for computing
eand d
are given in Section 3.3.
•
7. For the linear mode l y = Xe + E in (3.4.1), where X is an II x k matrix, e ' = (0), ... , 8k ), a nd E' = (E J' . .. , E,,), the formulae are of the same form as those given in (2) through (6) above. Th e only substitutions needed are a) replace
8 bye,
b) replace M(tJ) by
L
M(e) =
/Yi -
X;il
e/2/(1 c PI ,
; :-:::: 1
where
X;il
is the ith row of X, and
c) redefine J«(3) as
where R: ( - X
< OJ <
00, j = I, .. , k).
8. For the n o nlinear model in (3.5.2)
Yi = 11(~i / a) where~; = (~iJ'
+ ei' i =
I, .. ,
11,
... , ~ip) · The substit uti ons required for the formulae in (2) through
(6) are a) replace
8
by
a,
b) replace M(8) by M(a)
= L"
/Yi - 11(~, / a) /2/( J "\- Pl,
; :--- 1
c) redefine
where R: (-00
< OJ <
OC, j = 1, . .. ,k).
A3.1
Appendix
199
APPENDIX A3.1 SOME PROPERTIES OF THE POSTERIOR DISTRIBUTION p(() I (J,
y)
In Section 3.2.3 we have asserted certain properties of the posterior distribution in the permissible range of {J. These properties follow from work on the sample median by Jackson (1921). For the class of parent distribu,tions givt:n in (3.2.3), the ma xim um likelihood estimate of 0 for fixed {J, which is also the mod e of p(O I (J, y), has been considered for certain specific choices of fJ by Turner (1960). In our notation , consider the function
pee i (J, y)
n
M(e) =
I
iYi - OI W " P),
CA3.1. J)
i :. . I
For convenience, let us denote q M(O) =
=
+
2/ ( J
I"
(J), so that
el q ,
IYi -
q>1.
(A3. J .2)
;= 1
I.
We first show that
a) N/(e) is convex, continuous, and for q > I , has continuous first derivative; b) for q> J, M(B) has a unique minimum which is attained in the interval [y( 1) ' Y(n) l To see (a), consider i = I , 2, .. . , n.
Clearly gjCB) IS convex and continuous everywhere. Then for e < Yi,
(A3.1.3)
Now, suppose that q> I.
for fJ > Yi' g; (8) = q(O - y;)q
I
and as 8 approaches Yi from both directions lim or Yi
g; (0)
=
lim
g;(8)
=
0,
(A3.1.5)
G.Yi
which implies that g; (Yi) = O. Since q - I > 0, g;(O) exists and is continuous everywhere. Assertion (a) follows since M(8) is the sum of all gi(e). let us now consider M'(e). We see that for q > I, M '(8)
=
q
I" i ~= t
IY(i) -
olq Oi,
(A3.1.6a)
200
Bayesian Assessment of Assumptions 1
where
A3.1
for
8 :::;
for
0> Y(i)
Y(i )
and Y(i)' (i = I, .. ., /1) are the ordered observations. Thus, for 8 < Y(i)' II
L (Yi -
M'(8) = -q
8)q-
<0
(A3.1 .6b)
o.
(A3.1.6c)
I
j :::. 1
and for 8 >
Y(II)' II
M'(8) = q
I
(8 - yJq-
I
>
i= I
Thus, by properties of continuous function, there exists at least one 00' 00 :::; Y(II)' such that M' (8 0 ) = O. Further, it is easy to see from (A3.1 .6a) thal M'(8) is monotonically increasing in 8, so that M'(8) can vanish once (and only once) and that the extreme value of M(8) must be a minimum. This demonstrates assertion (b).
Y(I):::;
2. It has been shown by Jackson that when q approaches I, in the limit the value of 0 which minimizes M(O) is the median of the Yi'S, if 11 is odd; and, if 11 is even, is some unique value between the middle two of the Yi'S. 3.
for 4.
When q = I, it is readily seen that M(O) is minimized at the median of the y's, odd, and is constant between the middle two observations for n even.
11
We now show that, when q is arbitrarily large lim [M(8)JI /q == (h
+ 1111 - 01),
(A3.1.7)
where
Proof Consider a finite sequence of monotone increasing positive numbers {a,,} and a number S, such that S
=
La "
(
q) l / Q
i= 1
= Gn
f"I j :-: l
where
a·
~:::;J
a"
Q
(:!.!...)1Jl / ) a'r
for all i.
(A3.1.8)
A3.2
Appendix
Hence
[n (a.)q] l.·q -=2::--', all
S) ( an
so that log
(~)
201
(A3.1.9)
;= 1
=
all
~JOg q
[t (~)q] .
(A3 .1. 10)
all
. -1
When q --+ co,
" (a. )qJ
~~~. log [i~l a~
= logr,
(A3 . 1.11)
0,
(A3.1.12)
where I ,,;; r ,,;; n. But this implies that lim Jog
(~)
q - 'f
an
=
from thi s lim S = all' ('- oc
Thus, for any given value of 0, when q is arbitrarily large, lim [M(O)J l / q = max IYi - 01 =
max [ 10 - Y( 1)1 , 10 - Y(II )I]
11 + (m - (J) =
Hence, lim [M(a)] I 'q = (h
h ( h
+
(8 - m)
for for for
8 < m,
a = m, a > 111.
(A3.1.13)
+ 1m - Ill) and the assertion follows.
q- ce
APPENDIX A3.2 A PROPERTY OF LOCALLY U"lIFORM DISTRIBUTIONS
Let YI and Yl be two independently di stributed random variables. has a distri bution such that the density is approximately uniform ,
Suppose Yl
over a range of YI which is wide relative to the range in which the density of Y2 is appreciable. Then the variable X = YJ + Y2 has a locally uniform distribution and is di stributed approximately independently of Yl.
202
A3.2
Bayesian Assessment of Assumptions 1
Proof The conditional densi ty of X given Y2 is I( X I Y2) = I(y! = X - Y2 I Y2)
= pC y ! =
X - Y2)
== c. It foll ows th at X is approximately independent of Y2 and its marginal density is
reX)
==
c.
--_.,
CHAPTER 4
BA YESIAN ASSESSMENT OF ASSUMPTIONS 2. COMPARISON OF VARIANCES 4.1 INTRODUCTION
In the preceding chapter, we analyzed the problem of making inferences about location parameters when the assumption of Normality was relaxed . We now extend the analysis to the problem of comparing variances. As was shown by Geary (1947), Gayen (1950), Box (1953a), and others, the usual Normal sampling theory tests to compare variances are not criterion-robust under non-Normality. That is to say, the sampling distributions of Normal theory criteria such as the variance ratio and Bartlett's statistic to compare several variances can be seriously affected when the parent distribution is moderately non-Normal. The difficulty arises because of "confounding" of the effect of inequality of variances and the effect of kurtosis. In particular, critical regions for tests on variances and tests of kurtosis overlap substantially, so that it is usually difficult to know whether an observed discrepancy results from one or the other, or a mixture of both [see Box, (1953b)]. Within the Bayesian framework we are not limited by the necessity to consider "test criteria." However, as we have seen, the question of the sensitivity of the inference does arise . Therefore, following our earlier work (1964a), we now consider the comparison of variances in a wider Bayesian framework in which the parent distributions are permitted to exhibit kurtosis. We again employ as parent distributions the class of exponential power distributions in (3.2.5), which are characterized by three parameters (fJ, 0,0'). The densities for samples from k such populations will thus depend in general upon 3k unknown parameters,
C/31'(}I , 0'1), (/32,8 2, 0'2) , ... , (Pk,8 k ,
O'k)'
It will often be reasonable to suppose that the parameters fJI' ... , Pk are essentially the same, 'and we shall make this assumption. Tn what follows, the case of two populations will be considered first, and the analysis is then extended to k ~ 2 populations. 4.2 COMPARISOJ'; OF TWO VARIANCES
For simplicity, the analysis is first carried through on the assumption that the location parameters 8 1 and 8 2 of the two populations to be compared, are known. 203
204
Bayesian Assessment of Assumptions 2
4.2
Later this assumption is relaxed . Proceeding as before, the inference robustness of the variance ratio CTVCTT may be studied for any particular sample y by computing the conditional posterior distribution p(CT ~ ': CTi I fJ , y) for various choices of f3. The margin al posterior distribution, which can be written as the product p(f31 y) ex p(fJ)PII(f31 y), indicates the plausiblity of the various choices. In particular, a prior distribution p(f3) , concentrated about f3 = 0 to a greater or lesser degree, can represent different degrees of central limit tendency, and marginal posterior distributions p( dl CTi I y) appropriate to these different choices of p(f3) may be computed. To relax the assumption that the location parameters 8 1 and O2 are known involves two further integrations, and proves to be laborious even on a fast electronic computer. However , we shall show that a close approximation to the integral is obtained by replacing the unknown means 8 1 and 81 by their modal values in the integrand , and changing the "degrees of freedom" by one unit. An example is worked out in detail. 4.2.1 Posterior Distribution of CT~ /CTi for Fixed Values of (8 1, 8 1 , {3)
The likelihood function of
(0"1 '
CT 2 , 8 1 , 8 2 , f3) given the two samples
and IS
(4.2.1)
where
and c(f3) and w(fJ) are given in (3.2 .5) as
_ I r[H I + f3)] c(f3) - \
r[-HI
-1--
f3)]
} 1 '( I+{J ) ,
{rew + f3)]} w({3) = (1 +f3)
1/ 2
{IlHI +{3)J }),2
We assume , as before, that the means (8 1, 8 2 ) and the logarithms of the standard deviations (CT I , CT2) are approximatel y independent and locally uniform a priori, so that
(4.2.2) or
1 p(cr;) ex - , CT i
i = 1, 2.
(4.2.3)
4.2
205
Comparison of Two Variances
Further, we suppose that these location and scale parameters are distributed independently of the non-Normality parameter (J. 1t follows that for given values of 0 = (8 1 , ( 2 ) the joint posterior distribution of ert, Cl2, and (J can then be written
(4.2.4a) where p(fJ) is the prior distribution of (J and I( (J j , (J 2, (J I e, y) is the likelihood function in (4.2.1) when (8),8 2 ) are regarded as fixed. Now, we may write the joint distribution as the product p(Cl I , er 2 , (J I e,
y)
=
p«(J 10, y)p(er), (J21 {J, e, yJ
(4.2.4b)
For any given value of {J, the conditional posterior distribution of er 2 is then
CIt
and
2
p(erl ' er21 {J, a, y)
TI p«Jj I (J , 8j, Yi),
=
(4.2.5)
i= 1
where p ( (Ji I fJ , 8 j, Yj) oc er j
- (IIi "
I
1
exp
[
-
'ZC«(J)nisj«(J, 8;)] er;/() + P) ,
(Jj > O.
This has the form of the product of two independent inverted gamma distributions. By making the transformation V = d/eri and W = er 1 , and integrating out W , we find that the posterior distribution of er~ / (Ji is V > 0,
(4.2.6) where
w«(J,O)
=
t_I_) [[-l2(n) + \ 1 + fJ
TIj= 1
n 2)(l + (J)J [n j S 1 «(J,8 t )];II I ( I+P) [[1n j(l + {J)] n 2S2«(J, ( 2)
It is convenient to consider the quantity [s) ({J, 8 I )iS2 «(J, ( 2 )J V ) /( 1+ Pl, where it is to be remembered that V = d 'erf is a random variable and sl «(J,8 j )/S2«(J,8 2) is a constant calculated from the observations. We then have that
p [
e] [
F=S)C{J , 8)V)/(l+ Pl l (J , S2({J, ( ) ,y
=
P
FII 1(I -'- P) .n l (l
]
+P) ,
(4.2.7)
2
where p[Fn,(l + P ),1I1( 1 ~ PJJ is the density of an F variable with n 1 (I degrees of freedom. In particular, when (J = 0, the quantity V Lj~ I (Ytj - 8)2 /nl Lj~ 1 (Y2j - ( 2)2 1n2
+ (J) and n2 (I + (J) (4.2.8)
is distributed as F with n) and n2 degrees of freedom. Further, when the value of
206
Bayesian Assessment of Assumptions 2
4.2
{J tends to - J so that the parent distributions tend to the rectangular form, the quantity (4.2.9) where (h j , h2) and (ml' n1 2) are the half-ranges and mid-points of the first and second samples, respectively, has the limiting distribution (see Appendix A4.1) nln2
lim p(u I (J, e, y) p-
u , n\-I
for
+ 11 2 )
2(11\
=
0 <
U
<
1,
(4.2.10)
-I
(
j
I1 l1
2(111
2
+ 112)
u- ;"2-1
for
1<
U
<
00.
Thus, for given [3 not close to - I, probability levels of V can be obtained from the F-table. In particular, the probability a posteriori that the variance ratio V is less than unity is simply (4.2.11) 4.2.2 Relationship Between the Posterior Distribution p( V I (J, 0, y) and Sampling Theory Procedures
In this simplified situation where 8 1 , 8 2 , and [3 are assumed known, a parallel result can be obtained from sampling theory. It is readily shown that the two power sums 11 1S I ([3,8 1 ) and n 2s 2([3, 8 2 ) in (4.2.1), when regarded as functions of the random variables YI and Y2 ' are sufficient statistics for (0"1' 0"2) and have their joint moment generating function _
.\11 z(t I' t 2) -
2
TI
[_
I
2fi
O";/(1+p)]-~n'(I+p) ,
1=1
c([3)
(4.2.12)
where Thus, letting
we obtain
n (1 2
Mz.(l j
, (
2
)
=
2r,)--] n'( l+ PJ
(4.2 . 13)
i= 1
which is the product of the moment-generating functions of two independently distributed X2 distributions, with /1J(1 + (3) and 11 2 (1 + [3) degrees of freedom, respectively. Thus, on the hypothesis that O"~ .O"T = 1, the criterion S2 ([3,8 J /s1 ([3,8 2 )
4.2
Comparison of Two Variances
207
is distributed as F with n 2 (l + [3) and n,(l + [3) degrees of freedom and, in fact, provides a uniformly most powerful similar test for this hypothesis against the alternative that (j~/(j7 > I. The significance level associated with the observed 5 2 ([3, 8 t )/Sj ((3, ( 2 ) is (4.2.14) and is numerically equal to the probability for V < I given in (4.2.11). The level of significance associated with the observed ratio 52 «(3,0 j)15 j «(3, ( 2 ) which can be derived from sampling theory is, therefore, precisely the probability a posteriori that the variance ratio (j~, (jT is less than unity when log (jj and log (J2 are supposed locally uniform a priori. Table 4.2.1 Results from analyses of identical samples [y = (percent of carbon - 4.50) x 100] Analyst Az Sample No. yz
Analyst A, Sample No . y,
1 2 3 4 5
-8 -3 20 22
-10 2 3 4 5 6
3 5 10 14 -21 2
10
II
7
II
12 13
8 16 5.77
13
6
7
8 9
10
Mean
16
-8 9
7
5 -5 5
8
-11
9
25 22 16 3 40 0 -5
12 14 15 16 17
18 19
20 Mean
16
30 -14 25 -28 6.55
208
Bayesian Assessment of Assumptions 2
4.2
4.2.3 Inference Robustness on Bayesian Theory and Sampling Theory The above problem of comparing two variances with the means (e 1 , e2 ) known, is an unusual one because we have the extraordinary situation where sufficient statistics [n IS 1 (,8, e1)' n2s2(,8, 2 )J for O"~ and O"i exist for each value of the nuisance parameter {3. Inference robustness under changes of {3 can in this special case, therefore, be studied using sampling theory and a direct parallel can be drawn with the Bayesian analysis. We now illustrate this phenomenon with an example.
e
0. 10
6:
'E'~ 0.08 ::i~
-::; = V ~ci: u u
0.06
:::.
~~ 0.04 oll "'-
Vi
0.02
(a)
\.0 0.8
>. q)
""-
0.6
",,-'
0.4
0.2
- 1.0
0 (b)
Fig. 4.2.1 Inference robustness: the analyst data. a) The sampling theory significance level for various choices of {3. This is numerically equal to the posterior probability that V is less than unity. b) Posterior distribution of {3 for given taken as uniform (a = I).
e
1
and
e
2
when the prior d is tribution of f3 is
4.2
209
Comparison of Two Variances
The data in Table 4.2.1 were taken from Davies (1949). These data were collected to compare the accuracy of an experienced analyst A l and an inexperienced analyst A2 in their assay of carbon in a mixed powder. We shall let (J~ and (J ~ be the variances of Al and A 2 , respectively. The population means O) and 8 2 are, of course , unknown but for illustration we temporarily assume them known and equal to the sample means 5.77 and 6.55 , respectively . in the original Normal sam pling theory analysis, a test of the null hypothesis (Jt (Ji = I against the alternative (J~(Ji > I, yielded a significance level of 7.6%. . [n Fig. 4.2.1 (a) , for each value of 13, a significance level et.(f3) based on the uniformly most powerful similar criterion has been calculated. The figure shows how, for the wider cla ss of parent di stributions, the significance level changes from <,bout 2.1 % for 13 cl ose to - I (rectangular parent) to 9.9 % when 13 is close to I (double exponential paren t) . In the sampling theory framework the Normal theory variance ratio lacks criterion robu st ness to kurtosis, but the inference robustness now considered is of a different character in which the criterion itself is changed appropriately as the parent distribution is changed . As mentioned earlier, the significance level a(f3) has a Bayesian interpretation. It is the posterior probability , given 13, th a t the varia nce ratio V is less than unity.
4.2.4 Posterior Distribution of
(Ji, (J; when fJ is Regarded as a Random Variable
Using the Bayesian approach, we can write the joint distribution of V and 13 as the product
p(V, 1319, y)
=
p(V i 13,9, y)p(fJ I 9, y).
=
(J~ / (J~
(4 .2.15)
The posterior distribution of V is then
p( V
Ie, y) = L'"J} p(V 113, 9, y)p(fJ I 9, y) dfJ ,
v> o.
(4.2.16)
As before , the marginal distri bution of f3 which acts as the weight function can be written
p(f3 l e , y)
c£
-1
p(f3)Pu(f3l e, y) ,
(4.2.17)
From (4.2.4a), we obtain
p,,(f3 ! e, y)
C£
( f[ I
+ HI +
2
f3)]} - (n l +112)
TI
f[ I
+
}n j( I
+ 13)]
[n jSj(f3,
eJr
: n l(1
+ P)
j =-; 1
(4.2. 18) which is the posterior distribution of fJ for a uniform reference prior. It should be remembered that this reference prior is not necessarily one which represents one's genuine prior opinion; it does, however, provide a useful "waystage" a llowing us to compute PII([J I e, y) which, when multiplied by any pnor p(f3) of genuine interest, yields the appropriate posterior distribution. The
210
Bayesian Assessment of Assumptions 2
4.2
function p/f3l e, y) can also be thought of as the likelihood integrated over the weight function I ()2) -I, which corresponds to the assumption of a uniform prior for log ()I and log () 2' As was seen earlier, this choice is appropriate to the common situation where there is little prior information about these scale parameters relative to that which will be obtained from the data. It is important to notice, however, that Pu(f3l e , y) is dependent on the choice of priors for () I and ()2' For the analyst exa mple, the function PIlCf3l y) is shown in Fig. 4.2.1 (b).
«()
e,
4.2.5 Inferences About V with 13 Eliminated Now although sampling theory provides , in this exceptional instance, a basis for determining inference robustness in the sense that we can interpret Fig. 4.2.1 (a) as showing the change in significance level that occurs for different values of 13, it does not provide us with any satisfactory basis for using sample information about 13. On the other hand, on Bayesian theory the overall probability
Pr{V <119, y}=
f:l1pr{V
(4.2.19)
can be obtained by integrating the ordinates Pr {V < 1 I 13,9, y} of Fig. 4.2.1 (a) with the weight function p(f3l e , y) (X p,,(f3l 9, y)p(f3). Now p/f3l e, y) is given for the example in Fig. 4.2.1 (b), so that to obtain Pr {V < I I 9, y} we have first to multiply the distribution in Fig. 4.2.1 (b) by an appropriate prior pep), and then use the normalized product as a weight function in the integration of the ordinates of Fig. 4. 2.1 (a) . Certainly, the experimental situation here is one where we might expect a central limit effect, for it is likely that a large number of different components of error, associated with the various manipulations of the analysts, will make contributions of comparable importance to the overall error. Thus, as before in (3.2.29), we introduce the flexible prior distribution for 13 having a mode at 13 = 0,
p(f3) = w(l- f3 2y-l,
-1 < 13 ~ I ,
where
(4.2.20) a ~ l.
The dotted curves of Fig. 4.2.2 show p(f3) for various choices of "a." Multiplication of p(f3) by Pu(f3l e, y) then produces p(f3l 9, y) shown by the solid curves in Fig. 4.2.2. With p(f3) uniform (a = I) we find Pr { V < 11 y} = 7.91 %, which happens to agree fairly closely with the value 7.59 % obtained using Normal theory. This is not surprising because the distribution Pu(P Ie, y) is, for this data, roughly centered about the value zero. The averaging of (4.2.19) can, therefore, be expected to yield a result close to that for Normal theory. Any injection of a sharper prior obtained by setting a > I and representing a prior expectation of central limit tendency could be expected to bring the probability even closer to that for Normal theory.
e,
4.2
Comparison of Two Variances
211
To faciJitate comparison with sampling theory, we have so far calculated only the posterior probability that V < I , which parallels directly the significance level. In most problems the whole posterior density function would be of interest. This fuller analysis will be illustrated below when the unrealistic assumption that 8 1 and ()2 are known a priori is relaxed .
a= 1
1.5
1.5
p,,(ili 6, y)
1.0
1.0 p(/l)
i __ _
0.5
0.5
L---~~----~------~-----/l
-0.5
0
....
L--L-L~----~------~~~_il""
0.5
-0.5
0.5
a =3
1. 5
1.5
p(/l16, y)
1.0
1.0 p(/l)
.\//
0.5
,/
/
/
/
0.5
/
-os
13 ....
0
0.5
-0.5
/l .... 0
0.5
Fig. 4.2.2 Prior and posterior distri bu[ions of f3 for various choices of the parameter "a": the analyst data.
4.2.6 Posterior Distribution of V when () 1 and ()z are Regarded as Random Variables To concentrate attention on important issues, we supposed initially that the means e = (()I, ()2) were known. W ithin the Ba yes ian formulation no difficulties of principle are associa ted with the elimination of such nuisance parameters.
4.2
Bayesian Assessment of Assumptions 2
212
WiLh the prior assumptions of (4.2.2), (4.2.3) and (4.2.20), the joint posterior distribution of (e, O"j, 0"2' (3) is
- 00 < Bj < 00,
aj
> 0,
i = 1, 2,
and
- I < [3 :::::; 1,
(4.2.21)
where l(a j , a 2,8 j ,8 2, [3 1 y) is given in (4.2.1). Upon making the transformation from (e, ai' 0"2' (3) to (e, V, ai ' (3) and integrating out (e, a j), the posterior di stribution of [3 and V can be written as p( V, 13 1y)
= p(V 1[3, y)p([3 1y).
(4.2.22)
The conditional posterior distribution of V for a fixed value of [3 is
V> 0,
(4.2 .23)
where p(V 1 [3, e, y) is the distribution given in (4.2.6) and p(e ! [3, y), the joint distribution of 8 1 and 8 2 , is 2
p(O 1 [3, y)
=
TI p(8
i 1 [3,
Yi)
i ~ l
2
(4.2.24)
[n j s j (j3,8ar +n ,(I + P)
TI S(I) [ ([38)]l n,( 1+Pl d8 ;=l - 00 njSj , i
i
- 00 < 8 1 < CO,
-00 < 82 < 00,
with Sj([3, 8J (i = 1,2) given in (4.2.1). That is,
v> 0
(4.2.25)
where
When the parents are Normal ([3 = 0), the quantity
v
~ (Ylj - YI/ /(l1 j - I) ~ (Ylj - Y2)2/(112 - 1)
(4.2.26)
is distributed as F with (nl - I) and (n2 - I) degrees of freedom. Also, when the parents tend to the rectangular form ([3 -+ - 1), the quantity u* = V(h j ,'h 2)2,
4.2
Comparison of Two Variances
213
where, as before, hi and /7 2 are the half ranges of the two samples YI and Y2, has the limiting distribution (see Appendix A4.l) lim {I~
p(u" I {J, y)
-I
kU~(II'-I)-1
{ ku;'
[(111
+ 11 2 )
+ 11 2 )
[(111
\ (112-1)-1
+ 112
(111
-
(111
-
2)1I~: 2]
-
+ 112
2)u;, 1'2J
-
for
0 < u* < I,
for
I < u" <
(4.2.27)
00.
with k
= _
111172_
2(111
+ 11 2 )
(111 -
+ 112
(111
-
1)(112 -
1)(111
I)
+ 112
2) .
-
For other choice of parenb, it does not appear possible to express the posterior distribution of V in terms of simple functions. Methods are now derived which yield a close approximation to thi s distribution . 4.2.7 Computational Procedures for the Posterior Distribution p( V I [3, y) Numerical evaluation of (4.2.25) involves, among other things, computing a double integral for each value of V. This is laborious even on a fast electronic compulcr. However, a general expression for the rth moment of V for (r < 112 '2) is readily obtained as f( v r I f3
) = k({J) J- 'L '7'
,Y
\'00
[/1
. - xc
([3 0 )J - ? ('IJ + 2rj( 1 + P) de , I I [nisI' ([30)J cI-II,(I +P )d(J I I S
I
1
(4.2.28) with k([3) = fU(I1\
+
2r)(1
+ {3)]fU(112
- 2r)(1
+ f3)J
IlJ
fUnj(1
+ {3)].
Computation of the moments thus only involves the comparatively sim ple evaluation of one-dimensional integrals. It would now be possible to proceed by evaluating these moments and fitting appropriate forms of distributions suggested by the exact results obtainable when fJ = 0 and fJ --> - I. However, a simpler and more intuitively satisfying procedure is as follows. Employing the approximating form (3.3.6) we may write i
=
I, 2
(4.2 .29)
where OJ is the value which minimizes I1 j s;(fJ, OJ) and is , therefore, also the mode of the posterior distribution of (Jj in (4.2.24). Substituting (4.2.29) into (4.2.28), the integrals in the numerator and the denominator can then be explicitly evaluated.
Bayesian Assessment of Assumptions 2
214
4.2
Specifically, for the integrals in the numerator, we obtain
f:
}n l s
I C{3, °1)r\'(lIl
,2r)(1 , p)
dBI
f~oo
[n 2 s 2 (f3, 02)r'(II,-2r)(1
· Pl
de 2
= [n 1s I U~, 81) r t [(lIl + 2r)(1 +P) -' 1] d~ 1/ 2B {t, 1-[(17 1 + 2r)(I + f3) [n 2S 2(f3,
x
8z) r -J- [(n,- 2r)( I'P) -
I]
d2" 1/ 2 Brt, H(n 2
-
IJ }
2r)( 1 + f3) - J]},
(4.2.30)
and for those in the denominator, we obtain inl
f:
,>
[l1i S;(f3, BJr :",( I + Pl dB i =
lJl
[l1iSi(f3,
8;)r H II,( 1+ P) - I] + f3)-IJ},
x di-I/lB{L 1[ni(1
(4.2.31)
where B(p ,q) = r(p)r(q)T(p + q). By taking the ratio of (4.2.30) and (4.2.31), we see that the quantities d l and d z are cancelled. The rth moment, (r < +11 2 - -HI + f3) - 1), of V is, approximately,
[{-Hen] + 2r)(1 + f3)- J]}r{}[(n 2
-
21")(1 + {3)- IJ}
1l~= lr {t[ni(I+{3)-I] }
£(vr\f3,y)=
(4.2.32) That is, the moments of (4.2.33) are approximately the same as those of an F variable with n2(1 + f3) - I degrees of freedom.
II 1
(I + {3) - I and
Table 4.2.2 Comparison of exact and approximate moments of the variance ratio V for the analyst data with various choices of {3 f3 {exact aprrox. {exact approx. {exact approx. {exact approx.
/1; Jl~ f13
f1~
---
-
-0.8 2.16 2.11 5.12 4.99 13.31 13. 14 38.04 39.00
-0.6 2.03 2.02 4.71 4.69 12.50 12.49 38.16 38.44
-0.4
-0.2
2.07 2.06 5. I 3 5.1 I 15 .23 15.18 54.53 54.51
.
2.20 2.20 6.10 6.08 21.24 2.\. I 7 93.54 93.36
0.2 2.55 2.55 9.00 9.01 44.02 44.07 300.60 301.04
0.4 2.72 2.72 10.79 10.79 62.14 62.14 525.11 524.67
0.6
0.8
2.89 2.89 12.77 12.74 86.19 85.69 896.08 888 .07
-
3.05 3.04 14.97 14.84 117.63 115.96 1497.80 1465.83
4.2
Comparison of Two Variances
215
Table 4.2.2 shows, for the analyst example, the exact and approximate first four moments of V for variou s value's of [3. The exact moments were ob tained by direct evaluation of (4.2.28), using Simpson's rule, and the approximate moments were computed using (4.2.32). The moments approximation discussed above suggests that the quantity in (4.2 .33) is approximately distributed as an F variable with 11 1 (1 + [3) - 1 and 11 2 (1 + [3) - I degrees of freedom. This means that in (4.2.23) the process of averaging p(V 1[3, e, y) over the weight function pee 1[3, y) may be approximated by simply replacing the unknown (8 1 , 0 2 ) in p(V 1[3, e, y) by their modal values and reducing the degrees of freedom by one unit. That is p(V I [3, y)
where
w([3,e)
=
f_l_) \l + f3
f~Hnl + 11 2)(1 + [3) -I] [nISI([3'~I)] :[II I(I+P) - I). TIi= 1 f{}[l1i(l + [3) -IJ } /12 S2([3,8 2)
This approximating distribution can be justified directly by using the expansion (4.2.29) to represent the quantities 11 1 51 ([3,8 1 ) and /1252([3, O2 ) in (4.2 .25) and integrating directly. We have followed through the moment approach here because the exact moments can be expressed in rather simple form and, as we pointed out, could be used , if desired, to fit appropriate distributions. It should be noted that in obtaining the approximating moments (4.2.32) and the distribution (4 .2.34) through the use of the expansion (4.2.29), it is only necessary to. determine the modes iJ l and 82 , The more difficult problem of finding suitable values for d l and d 2 (see Section 3.3) does not arise here because of the cancellation in the integration process. Although calculating the modal values (8 1,8 2 ) still req uires the use of numerical method s s uch as those discussed in Section 3.3, this is a procedure of great simplicity compared with the exact evaluation of multiple integrals for each V in (4.2.25). Table 4.2.3 shows exact and approximate densities of V for several values of f3 using the analyst data. The exac t dens ities are obtained by direct evaluation of (4.2.25) using Simpson's rule, anu the approximate densities are calculated using (4.2.34). Some discrepancies occur when [3 = -0.8 (that is, as [3 approaches - I) but for the remainder of the (3 range the agreement is very close.
/3 with 111 and 8 2 Eliminated i /3, y) are shown in Fig . 4.2.3 for
4.2.8 Posterior Distribution of V for Fixed
For the analyst data, the graphs of p( V various choices of {3, where, this time, the simplifying assumption that the location parameters 8 I and O2 are known and equal to the sample means was not made. In computing the posterior distributions of V for [3 = -0.6 and [3 = 0.99, 8 1 and
Bayesian Assessment of Assumptions 2
216
4.2
O2 were eliminated by actual integration and also by the approximate procedure mentioned above. In each of these two cases, the curves obtained by the two methods are practically indistinguishable. For the cases f3 = 0 and f3 = - I, the exact distributions were calculated directly using (4.2.26) and (4.2.27) , respectively. Table 4.2.3
Specimen probability densities of the variance ratio V for various values of {J: the analyst data V
{J= -0.8 approx . exact
0.5 1.0 1.2 1.4 1.6 1.8 2.2 2.6 3.0 4.0
0.005 0.099 0.209 0.353 0.503 0.619 0.628 0.412 0.206 0.026
V
exact 0.5 1.0 1.2 1.4 1.6 1.8 2.2 2.6 3.0 4.0
0.020 0.159 0.264 0.388 0.510 0.597 0.579 0.380 0.194 0.028
{J = 0.2 approx .
0.070 0.274 0.329 0.359 0.367 0.359 0.3J4 0.255 0.200 0.100
0.069 0.274 0.329 0.359 0.368 0.360 0.315 0.256
0.200 0.100
f3= -0.6 approx. exact 0.017 0.229 0.378 0.507 0.583 0.594 0.474 0.301 0.168 0.032
0.021 0.245 0.388 0.510 0.580 0.588 0.468 0.297 0. 167 0.032
(3= -0.2 exact approx. 0.048 0.299 0.389 0.444 0.461 0.449 0.372 0.276 0.192 0.070
0.049 0.301 0.390 0.444 0.46J 0.448 0.371 0.275 0.192 0.069
{J = 0.6 exact approx.
exact
approx.
0.090 0.258 0.294 0.311 0.313 0.305 0.272 0.23\ 0.190 0. 110
O.JOO 0.253 0282 0.294 0.295 0.287 0.256 0.219 O. J 83 O.IJ \
O.JOO 0.253 0.283 0.295 0.296 0. 287 0.257 0.219 0.183 0.111
0.089 0.257 0.293 0.31 J 0.313 0.306 0.273 0.23J 0.190 0.110
[3 = 0.8
We see from these graphs how the posterior distribution of the variance ratio V changes as the value of f3 is changed. When the parents are rectangular ({J -> -I), the distribution is sharply concentrated around its modal value. As f3 is made larger so that the parent distribution passes through the Normal to become Jeptokurtic, the spread of the distribution of V increases, the modal value becomes smaller, and the distribution becomes more and more positively skewed. It is evident in this example that inferences concerning the variance ratio depend heavily upon the form of the parent distributions.
-
-
4.2
Comparison of Two Variances
217
p ( V[ Il , y )
,,/I' ,' ,, ''I \
I
0.7
'
I I
' '
I
0.6 1l ~-0 .6
'
I _____ / · \ ' --, .'
I..-- Il ~
.i~·1\
I .
O.S
\,
I ' I \ I'
I
- [ .0.
I
I
, I
. I' ,
I
I I
.
0.4
".
'I
,
\
\
\
\
\
\
0.3
.
\ I
1/ , . . . , II 0.2
I
'
"
.
\
,,\
\
/',' (l~099
\. \
\
. \\
, ~
\\
0.1
.........
.
~~~-L----~--
o
.........
\'
2
____
~
~~
____
J
.........
-----
....,;:,""-. __
-L~~
~
____
~V~
6
4
Fig. 4.2.3 Posterior distribution of V for various choices of fJ: the analyst data.
4.2.9 Marginal Distribution of 13
The margihal posterior distribution of f3 , p(f31 y), with (9, as before, be written as the product p(f31 y) oc PII(f31 y)p(f3),
- I <
(fl' (f2)
B~
eliminated may,
(4.2.35)
I,
where Pu
TIl~ I l [ J +inJ I + 13)] TI2 (13 I) y oc {r[J + .HI + f3)] }", +II' i = 1
f"[ -co
nisi
(13 ()] - !-II,(I +P) d() ,
i
i
is the posterior distribution of f3 relative to a uniform prior for - 1 < f3 ~ I. For the analyst example, the distribution PIICf3l y) shown in Fig. 4.2.4 is very
218
Bayesian Assessment of Assumptions 2
4.2
little different from Pu(/J 19, y) shown in Fig. 4.2. 1(b). As before, the mode of the distribution is close to fJ = o.
'---'_.......l._...J..._....I...._...l...-_.l...---'_.......l._...J...._....I....il....
1.0
Box
a Tiao
4.2-4
Fig. 4.2.4 Posterior distribution of fJ for a uniform reference prior : the analyst data.
4.2.10 Marginal Distribution of V The main object of the analyst investigation was to make inferences about the variance ratio V . This can be done by eliminating fJ from the posterior distribution of V by integrating the conditional distributionp(V I fJ, y) with the weight function p(PI y), +l
p(VIY)=
J
_IP(VIP,y)p(Ply) dP,
V> O.
(4.2.36)
Fig. 4.2.5 shows the distribution of V for a uniform reference prior in fJ, that is, with the weight function puCfJl y) in (4.2.35). Also shown by the dotted curve is the posterior distribution on the assumption of exact Normality. Any prior on Pwhich is more peaked at P = 0 can be expected to produce even greater agreement with the Normal theory posterior. For this particular set of data, therefore, the Normal theory inference evidently approximates very closely that which would be appropriate for a wide choice of "a" in (4.2.20). This agreement is to be expected because, in this particular instance, the sample evidence expressed through puCPI y) (see Fig. 4.2.4) indicates the Normal to be about the most probable parent distribution . However, such agreement might not be found if the sample evidence favored some other form of parent.
4.3
Comparison of the Variances of k Distributions
219
p(V/y)
0.4 - - - Distribution with uniform reference prior on (3 - - - - - - Distribution under assumption of NORMALITY
0.3
0.2
01
~~---L
______L -_ _ _ _- L______
o
~
____- L____
~V~
234
6
Fig. 4.2.5 Posterior distribution of V for a uniform reference prior in f3 : the analyst data.
4.3 COMPARISON OF THE VARIANCES OF k DISTRIBUTIONS We now consider the problem of comparing the variances of k exponential power distributions when k can be greater than two . As mentioned earlier, we shall assume that the k populations have the same non-Normality parameter f3 , but possibly different values of 8 and a. When independent samples of size n l , . . , nk are drawn from the k populations, the likelihood function is
/(
II
jIJk a
_II,
j
exp
[}C(f3)l1 j s/ f3 , 8j ) ] a7 /(I+P)
,
(4.3.1 )
where
k
Yj
=
(Yj ! , ... ,Yjll),
11
I
=
j=
I1 j , 1
and c(f3) and w(f3) are given in (4.2 . 1).
4.3.1 Comparison of k Variances for Fixed Values of (e, f3)
e in each population
At first, we suppose that f3 and the location parameter known. With or
I
p( aj) ex -- , aj
i
=
I, .. . , k
are
(4.3.2)
220
Bayesian Assessment of Assumptions 2
and for given (9,
/3),
4.3
the posterior distribution of
(J
is
k
p( (J I9, /3,
TI pC (Ji IfJi' /3, y;), ;=1
y) =
(Ji>O,
(4.3.3)
i=l, ... ,k,
where
p( (Ji I fJi'
/3, Yi)
CC (Ji-(Il,:· I)
exp [ -
lcC/3)n isi (/3, fJ;)/ (J;/(1 +PJ].
The joint distribution for «(JI> ... , (Jk) is thus a product of k independent inverted gamma distributions. In a certain academic sense, the obtaining of this joint distribution solves the inference problem. However, there is still the practical difficulty of appreciating what this k dimensional distribution has to tell us. One question of interest is whether a particular parameter point is, or is not, included in an H.P.D. region of given probability mass. For instance, we may be interested in considering the plausibility of equality of variances. Following our earlier argument in Section 2.12, we consider the joint distribution of the Ck - 1) contrasts <\>' = CrPI, ... , rPk-l) in the logarithms of the standard deviations (J,
i=I, ... ,k-l. It is straightforward to verify that the posterior distribution of
9 _ pC<\>I ,/3,y) -
(l
C4.3.4)
<\>
is
["Un(l + /3)] + /3yk-1JTI~=1 ["[-} n;(l + /3)J
-00
< rPl <
00, ... , -00
< rPk-l <
(4.3.5)
00,
where Vi
= l1 iSi C/3, fJ;) exp [ -rPd(l + /3)J,
i
kC/3A) factor (1 + f3) - I,
=
1, ... , (k - 1).
I1k S
Except for the this distribution is in precisely the same form as the distribution in (2.12.6), with Vi . l1 i C1 + /3), and s;C/3, fJ;) replacing T;, Vi' and respectively. From the developments in Section 2.12 it follows, in particular, that the distribution C4.3.5) has a unique mode at
s;,
i=I, ... ,k-l.
(4.3.6)
Further, in deciding if a particular point <\>0 is included in the H.P.D. region of content (1 - a), we may make use of the fact that p(<\> I 9, /3, y) is a monotonic decreasing function of M, where M = -210g W,
(4.3.7)
with W =
n"l:Il( I + P) rI~ = lnt";(l+P)
J. TI V ~ Il,(1+P) (1 + V + ... + V ;=1
[k -
I
I
1
)-i k-l
ll
(l+ P)
( 4.3.8)
4.3
Comparison of the Variances of k Distributions
221
The cumulant generating function of M , is, for (3 i= -I,
k -1 ' 1-.,11(/) = a - - - l o g (l - 21) 2
+
L (,(r(I .JO
r~
- 2t) -(2r-l) ,
(4.3.9)
1
with =
CI. r
B2r (_2_)2r 2r(2r - 1) 1 + [3
t
1 [
i~
1
n.- (2r-l) _ n - (2r - I)] /
and "a" is a constant. Bartlett's method of approximation discussed in Section 2.12 can be extended here so that Pr {p( 19, (3, y) > p(o I (3, y) I y} = Pr {M < -210g Wo}
e,
.
= Pr where A = 3(k _
1~(l + [3)
Ltl
{2
Xk -
1 I1 j-
1
<
- 2 log Wo} I+ A
1
- n-
'
(4.3.10)
1
and Wo is obtained by inserting a particular parameter point of interest <1>0 in (4.3 .8). In particular, if <1>0 = O. so that 0'1 = 0'2 = ... = O'k> then -2 log Wo reduces to k
- 210g Wo = -
I
I1j( I
+ [3)[log Sj(fJ, Bi )
log 5([3, 9)J,
-
(4 .3.11)
i= 1
with
In this special case where the 8'5 are assumed known, exactly parallel results can be obtained within the sampling theory framework . In fact, the likelihood ratio criterion for testing the null hypothesis (Ho: 0'1 = 0'2 = ... = ak ) against the alternative HI that they are not all equal, is
_ X(fJ) -
k [S;([3 , Bj ) ] -
1]
/- 1
-([3 9) S ,
):II,(l
+P) .
(4.3.12)
Thus - 210g X(m is given by (4.3.11) and on the null hypothesis, the cumulant generating function of the sampling distribution of - 210g ),([3) is precisely that given by the right-hand side of (4 .3.9). It follows that the complement of the probability (4.3 . 10) is numerically equivalent to the significance level associated with the observed likelihood ratio statistic },([3).
An Example For illustration, consider the three samples of 30 observations set out in Table 4.3 . 1. These observations were, in fact, generated from exponential power distributions
222
4.3
Bayesian Assessment of Assumptions 2
with common mean fi = 0 and fJ = - 0.5. Their standard deviations a j were chosen to be 0.69, 0 .59, 0.76, respectively. The analysis will be made assuming the means to be known . Table 4.3.1
Data for three samples generated from the symmetric exponential power distribution with common fJ = -0.5 and fi = 0 Sample Observation number
2 3 4 5 6
7 8 9 10 11 12 13 14 15 16 17
18 19 20 21 22 23 24 25 26 27 28 29 30
Two
One (a
= 0.69) 0.23 0.24 -0.26 -1.13 -0.72 -0.52 0.65 -1.16 0.25 -0.83 0.90 -0.28 0.19 -0.43 -1.12 -0.93 -0.58 0.62 0.61 0.42 -1.34 -0.l3 -0.92 -1.24 -1.02 -0.40 -0.70 0. 10 0.38 -0.73
(a
= 0.59) -0.84 0.61 0.20 -0.83 0.46 -0.38 0.13 0.33 -0.20 -0.45 1.16 -0.55 0.42 -0.73 0.37 -0.37 -0.50 -0.50 -0.59 -0.77 -0.99 0.06 0.46 0.03 -1.01 0.68 -0.28 -0.62 -0.70 -0.32
(a
Three = 0.76) 0.90 -0.26 -0.03 0.04 0.73 -0.06 -0.91 -1.38 1.24 -0.09 -0.40 -1.32 -0.57 -1.34 0.97 0.68 0.70 -0.52 0.34 -1.00 -1.28 0.21 1.42 0.31 -0.31 -0.14 0.99 1.14 0.21 -0.76
-~--'----
-
4.3
223
Comparison of the Variances of k Distributions
0.5 0 -O S
-1 .0
13 = -0.5
13=0
0.5 0
~
0 -0.5 95%
-1.0
90%
-1.5
75,1.
-1.5
75 <;1
-1.0 - 0.5
0.5
0
-1.0 -0.5
1.0
0
0.5
1.0
0.5
13= 05
0 -0. 5 -1.0 -1.5
-1.0 -0.5
0
OS
1.0
Fig. 4.3.1 Contours of the posterior distribution of (cp l '
The first object is to study how inferences about the comparative values of (Jj would be affected by changes in the val ue of {3. For fixed {3, the posterior distribution of the two contrasts and is that given in (4.3.5) with k = 3. Figure 4.3.1 shows for {3 = -0.5, 0.0, and 0.5 the mode and three contours of the posterior distribution for cP I a nd CP2' These contours, which correspond approximately to the 75, 90, and 95 % H .P.D. regions, were drawn such that M
=
-2log W
= (1 +
A)/(2, IX),
(4.3.14)
22<1
Bayesian Assessment of Assumptions 2
4.3
for ex = 0 .25, 0.10. 0.05, respectively. For this example, M
=
-90(1 + ,8) log 3 + 90(1 +,8) Jog (1 + exp{ - (I + ,8)-1[4>1 - ¢ l C,8)] } + exp { -(I + ,8)-1[4>2 - ¢2 (,8)] })
+ [4>2
+ 30 {[4> 1 - ¢ [ (,8)]
- ¢2{P)]}
(4.3. 15)
and 1 + A = 1 + (1 + ,8) - [(0.0148) . Inspection of Fig. 4.3.1 shows that in all ca ses ¢[ and 4>2 are positively correlated. This might be expected , for although O'~, and are independent a posteriori, ¢1 = log O'i - log ()' ~ and 4> 2 = log O'~ - log O'~ contain log O'~ in common. Also, for this particular set of data , the mode of the posterior distribution is not much affected by the choice of ,8, and the roughly elliptical shape of the contours is similar. The distributions differ markedly, however, in their spread ; the dispersion becoming much larger a s ,8 is increased . Inferences about the relative values of the variances are for this reason very sensitive to the choice of ,8. The possibility that the variances are all equal so that Ij> = 0, is often of special interest to the investigator. Figure 4 .3. 1 shows that the parameter point Ij> = 0 is just excluded by the appro ximate 95 % H .P .D . region for ,8 = -0.5, but lies well inside the 90 % region for ,8 = 0, and the 75 % region for,8 = 0.5. To present a more complete picture, we may calculate, as a funct ion of ,8, the probabil ity associated with the region
0'7,
p(<jll 0 , ,8, y) < p(<\>
0';
= 0 I e , ,8, y).
(4.3 .16)
For this example, we have from (4.3.10) and (4.3.11), with k = 3, Pr {p(1j>
10, /3, y) < p(
= 0 I e,,8,
-log Wo } y) I y} == exp { I + (1 + ,8)-1(0.0148) ,(4.3.17)
with
3
log Wo
=
15(1 + ,8)
I
j=
[log Sj(,8, 8j )
-
log 5(,8, e)].
1
Figure 4.3.2 shows this probability fo r various values of ,8 ranging from -0.9 to 1.0. It is less than I % for ,8 = -0.9, monoto nically increasing to 21 % for ,8 = 0, a nd to almost 58 % for ,8 = I. As mentioned earlier, for a fixed value of,8, (4.3.17) is numerically identical to the significance level associated with the likelihood ratio criterion for testing the null hypothesis that the variances are equal. Inferences about the equality of the variances are thus very sensitive to changes in /3, irrespective of whether we adopt the sampling or the Bayesian theory .
4.3.2 Posterior Distribution of ,8 and As before, the non-Normality para meter ,8 ca n, in the Bayesian framework , be included in the formulation as a variable parameter. Assuming that ,8 and the standard deviations 0' are a priori independent, we may write
p(/3, a) = p(,8)p( a),
(4.3. 18)
4.3
225
Comparison of the Variances of k Distributions
where pCP) is the prior of P and p(cr) is the distribution in (4.3 .2). Combining (4.3.18) with the likelihood function in (4.3.1) and integrating out cr. for fixed 6 the posterior distribution of P is obtained as
p(f31 fl, y) rx p(f3)pJP I fl, y),
-1
(4.3.19)
In this expression
puCP 16, y) rx [r (I + 1+ P)] -nfI r 2
is the posterior distribution of
[1 + !2 (I + P)] [nisi(p, 8
i )]
2
r =[
P corresponding
-~.";( 1 +fJ) (4.3.20)
to a uniform reference prior.
0.6
0.4
0.2
1.0 -0.8
- 06
0.4
-0.2
o
0.2
0.4
0.6
0.8
Fig. 4.3.2 The posterior probability Pr { p(<j> I e, P, y) < p(<j> = 0 I e, P, y) I y} as a function of P for the generated example; the probabil ity is numerically equivalent to the sampling theory significance level. Figure 4.3.3 for a = I shows the distribution piP I fl, y) for the data of Tabk 4.3. 1. Its close concentration about the mode P -0.7, shows it to be quite informative about P and this is to be expected because of the large number of observations (n = 90). Even after multiplication by a prior distribution p(P) which indicated a rather strong belief in a central limit tendency, inferences about P would still not be much different from that based upon pf/C{31 fl, y). In each of the four sets of graphs in Fig . 4.3.3, the dotted curve is the assumed prior and the solid curve the corresponding posterior distribution . The prior distributions were taken from (4.2.20) with a = I , 3, 6, 10, respectively. Thus with "a" as high as 10, the information from pf/(fJ I fl, y) still dominates the "prior" information. This figure may be contrasted with Fig. 3.2.5 for which only 15 observations Cdi fferences) were available and the information from Pu({3 I y) had much less weight.
=
2.S
2.S p(lli /) , y)
/
2.0
2.0
1.5
/' I \
I.S ~
/ 1.0
I
I
I
"-\ \
\
\
\ \
, \
0 (e)
\
\ \
·O.S
\ p((l) \
\ \
1.0
\ O.S
I
I
p(jJ)
O.S
\ \
0.5
\ \
"- "-
, \
{3~
-O .S
0
il~
O.S
(d)
Fig. 4.3.3 Prior and posterior distribution of f3 for various choices of the parameter "a": the generated example.
Comparison of the Variances of k Distributions
4.3
227
Posterior Distribution of and its Approximation In the Bayesian framework, uncertainty in the inferences about due to changes in {3 can be removed by considering the marginal posterior distribution of <1>. From the distribution of <1>, given {3, in (4.3 .5) and that of {3 in (4.3.19) , we may write
p( 19, y)
=
f
1 p(
19, {3, y)p(fJ 19, y) dfJ,
-00
<
Although both the distribution p(fJ 19, y) and the conditional distributiGn p( 19, P, y) involve only simple functions of the observations, it does not seem possible to express the marginal distribution in a simple closed form . For k ~ 3, complete evaluation of the distribution would be very burdensome even on a fast computer. However, in situations in which the distribution of P is sharp and nearly symmetrical, we can write, approximately,
p( 19, y) == p( 19, jJ, y)
(4.3.22)
where jJ is the mode of p(PI 9, y). For instance, the posterior distribution Pu(PI 9, y) based upon a reference uniform prior in f3 has its mode close to f3 = - 0.7 for the present example. Using the approximation in (4.3.22), the marginal posterior distribution is thus nearly p( 19, fJ = -0.7, y). Contours of this approximate distribution are shown in Fig. 4.3.4 .
o -{l.S
•0
~75 ~: U .
- 1.0
90o/,
9S 'li
- 1.5
Fig. 4.3.4 Contours of the approximate posterior distribution of (
Accuracy
0/ the
Approximation
To check the accuracy of the modal approximation, the exact marginal distributions may be compared with the approximate marginals implied by (4.3.22). From
228
Bayesian Assessment of Assumptions 2
4.3
(4.3 .5), it is straightforward to verify that, given [3, the distribution of ¢i is
p
( -1,.10 [3 ) '1',
,
,
=
(1
Y
rU(n i + nk)(1 + [3)] u~ n;(l +P) (I + U)- i-(n;+nk )(l +P) + [3)fUni(l + (:?)]f[}nk(l + [3)] , , -0')
<
00,
(4.3.23)
where
Vi
=
n i s;(f3, 8J exp [ - (l
+ [3) - 1¢;],
nksk([3,8 k)
i=I , ... ,k-1.
Thus, the marginal distribution of ¢i is p(¢; 1 0, Y) =
f
I
p(
(4.3.24)
which, by adopting the same argument leading to (4.3.22), is approximately p(¢; 1 0, y) == p(
4.3.3 The Situation When
el , . . . , ek Are Not Known
e
In the above, we have assumed that the location parameters el , . .. , k are known. The more common situation is when they are not known. With the assumption that 0, 0" , and f3 are a priori approximately independent, and that k
p(S)
=
TI pee;)
where
(4.3.26)
i= 1
p(1)t
:11,
y)
2.0
Exact - - - - - Ap pro xima te
1.5
1.0
0.5
-=~~----~---J----~----~--~----==~1>I~
·0.80.6
- 0.4
0
0.2
0.4
Fig.4.3.5 Posterior distribution of ¢l: the generated example.
Box a Tioo
4.3-5
4.3
Comparison of the Variances of k Distributions
the posterior distribution of
pee I p, y)
e,
for a given
p,
229
is
k
=
TI pee; ! {3, y;),
-00
< 0; <
i = I, ... ,k,
Xl,
(4.3.27)
j ;:;; t
where
Consequently, for a given {3, the posterior distribution of the
(k -
J) contrasts
4> defined in (4.3.4) becomes,
< 4>; <00,
i=I, .. ,k-1
(4.3.2R)
where the first factor in the integrand is given by (4 .3.5). [n the special case where the populations are Normal, it is readil y verified that the integral can be evaluated exactly yielding,
(4.3.29)
where V; =
n; -
I,
v
=
n - k,
and
Ti
-
"II, (Yij - Y;- )2 -1>, L ~I/.: 2 e ) L (Ykj - Yk)
which is , of course, the same distribution obtained earlier in (2.12.6). In the more general situation when the parent populations are not necessarily 0Jormal , it does not seem possible to express the integral exactly in terms of simple functions . However, when dealing with the ratio of two variances , it was demonstrated that the effect of integrating over the posterior distribution of (8 1, ()2) was essentially to replace the 8's by their corresponding modal values and to reduce the "degrees of freedom," n;(1 + ,8), (i = 1, 2) , by one unit. We can extend this approximation to the more general case where again it is exact if {3 = O. As in (3.3.6) we write i
=
I, ... , k,
(4.3.30)
where 8i is the value of 0; minimizing I1;S;(fJ , O;) and , therefore , is also the mode of the posterior distribution p(G;: {3, )'J in (4.3.27) . Substituting (4.3.30) into
230
Bayesian Assessment of Assumptions 2
4.3
(4.3.28) and integrating out e, the posterior distribution pC<\> approximately
-00
<
< ce, . .. ,
-00
<
1/3, y)
is then
CO,
(4.3.31)
where In j
= f1j(l
+ /3) -
I,
In
= n(l + /3) -
k,
and i=I, ... ,k-\'
As ill (4.3.24) for the case of the comparison of two variances, in obtaining the approximate distribution (4.3.31) through the use of (4.3.30), it is only necessary to determine the modal values the quantities d; being cancelled in the integration process. The distribution (4.3.31) is of exactly the same form as that in (4.3.5). Consequently, to this degree of approximation, the cumulant generating function of
e;,
I'v/*
-210g W*
=
(4.3.32)
where
is given by k-I KM.(t) = a - - - l o g (I - 2/)
2
+ I
(I"
2t)-(2r-l l ,
C(r(l -
(4.3.33)
r= 1
where
ar -_
B 2r
~ L.
22r-1 [
;~ I
2,.(2r - I)
-(2r-l) _ In;
m -(2r-))] .
Hence the distribution of M* can, as before , be approximated using Bartlett's method. To decide whether a parameter point
(, 2 < - 2 Jog W6} , Pr{M " > -210g Wo* } = Pr lXk-1 . [+ A* where Wo* is obtained by substituting A" =
[
3(k - 1)
into (4.3.32) and
I
k
;= J
m ·I
J -
m-
I
J .
(4.3 .34)
Comparison of the Variances of k Distributions
4.3
In particular, for
= 0, - 2 log
W; reduces to
±
- 2 log Wo* = -
J1J
i
'[log ~iS;([J,
_ s([J, 8)
8;) -
log s([J, 6)]
,
(4.3,35)
In;
i= t
with
231
I = -
m
~
L
_ l1iSi([J, B;),
i=J
in the case fJ = 0, we obtain precisely the !\ormal theory results already discussed . using Bartlett's method of approximation the somewhat remarkable result is obtained , therefore, that for any known value of [J (not close to -I), the decision as to whether the point
nisiCfJ, 8;) ni(J + [J) - I
ni( I
8..1 2 /( 1+11)
+ fJ)
- J
We are thus able to obtain , for each [J , the approximate probability content of a n H.P,D. region which would just excl ude the point
Marginal Posterior Distributions of [J and
p([J I y) cc p(fJ)p"CfJ i y) ,
-I < fJ ~ I ,
(4.3 .36)
where PII (fJ ly) cc [
I'-2+ [J)] r ( 1+-
-
II
r
J'"
k n ] _ [n isi([J , Bi)r i";( J '11) dOi ilJ1rl+-t(I+fJ)
is the pos terior distribut io n of p, corres ponding to a uniform reference prior dis tribution . Now , we can write the ma rgi na l distribution of
p(
f
J
p(
( 4.3,37)
232
Bayesian Assessment of Assumptions 2
4.4
where the first factor of the integrand is given by (4.3 .28). In obtaining this distribution, the unknown location parameters e and the non-Normality parameter fJ are eliminated by integration. The distribution thus provides the final overall inferences about the linear contrasts of the logarithms of the variances of the k populations. In practice, numerical integration of f3 and e, and particularly e, would be exceedingly burdensome. However, the factor p(<\! I f3, y) can be approximated by (4.3.31) for values of f3 not close to -I, and when p(fJ : y) is concentrated about some modal value fJ we can employ the further approximation
pC<\! I y)
:, p(<\!
I jJ, y).
(4.3.38)
Thus in appropriate circumstances. p(<\! ! y) may be approximated by the righthand side of (4.3.31) with jJ substituted for the unknown fJ. Although evaluation of jJ for (4.3.36) still requires numerical integration of one-dimensional integrals as well as numerical determination of a maximum, these are simple processes compared with evaluation of the exact distribution. 4.4 INFERENCE ROBUSTNESS AND CRITERIOi'\ ROBUSTNESS
Using the notation of Section 1.6 suppose that eI are a set of parameters of interest and O2 a set of nuisance parameters measuring discrepancies from "ideal" conditions 8 2 0 , Then: a) robustness of inferences about 8 1 to departures from the ideal conditions, may be studied by considering how the conditional posterior distribution p(8 1 18 2 , y) changes, as the elements of O2 are moved av.ay from the values 9 20 , b) at the same time , the marginal posterior distribution p(ezl y) measures t~e plausibility of various choices for 8 2, and integration of peel I e2. Y) with p(8 2 I Y) as weight function yields p(el i y), which , in suitable circumstances, shows what overall inferences can be made a bout 9 I' The ideas have becn iJiustrated , in this chapter and the previous one, with C2 = fJ measuring a departure from Normality and with the elements of 0 1 being in turn means, regression coefficients and variances. Since in any given instance we never know for certain how elaborate assumptions need to be, the general possibility of embedding a tentative parsimonious subset of assumptions in a more prodigal set and studying the sensitivity of the conditional inference can provide a highly informative technique of preliminary data analysis. Such studies of inference robustness are, as has been said , of a different character from the customary criterion robustness studies of sampling theory. However, inference robustness can be interpreted in terms of sampling theory although usually the limitations of that theory make it difficult to apply. In the sampling theory study of criterion robustness , an "optimal" criterion rewey) is selected which is appropriate for making inferences about the parameters
4.4
Inference Robustness and Criterion Robustness
233
9 t for som e fixed 9 2
= 9 20 correspond ing to the "ideal" assumptions. The change in the sampling distribution of C!)lO(Y) is then stud ied as the paramete rs 9 2 are changed from 9 20 , By contrast, to study inference robustness in the sampling framew ork, we need to study the sampling distribution not of Ce'Q(Y) but of Co ,( Y) as 9 2 are changed. It is a n extrao rdinary fact that for a sample drawn from an exponential power distribution of which the mean is assumed know n, sufficient stati stics exist for the variance over the en/ire range of the kurtosi s parameter (3. In this unusual circum stance therefore, it is possible with sampling theory to study not only criterion robustness but also inference robustn ess. We find in thi s exceptional case, where study of inference robustness is accessible to sampling theory , that it pa rallels exactly the Bayesian result.
4.4.1 The Analyst Example As a specific illustration, we may again conside r the variance compariso n dal;! of Table 4. 2 . .1 consisting of J 3 independent observations made by an An alyst A 1 and 20 made by an Anal yst A 2 . Alth oug h popUlation means were unknown , for the purpose of this demonstrati on we shall assume them known and equal to the sample means. On the Normal theory test of equality of variances , the significance leve l, that is, the probabi lity of exceeding the observed variance rat io by chance, is 7.6 %.
Table 4.4.1 Changes in percentage significa nce level induced by departures of the population f3 from the f30 defining the criterion Cpo(Y): the analyst data Parent population
fJo f3 1 -0.6
c: 0
'C ~ '--
U
-0.6 -0.4 -0.2 0.0 0.2 0.4 0.6
6.0 3.5 3.0 2.8 2.5 2.4 2.4
-0.4
-0.2
0.0
0.2
0.4
0.6
11.0 6.5 4.8 4.5 4.0 3.5 4.0
9.0 7 .0 6.0 5.0 4.8 4.8
14.0 9.8 7.6 6.0 6.0 5.0
11.5 9.5 8.0 8.0 6.5
17.0 12.0 10.0 8.6 7.5
22.0 14.5 12.0 9.5 9.2
Table 4.4.1 shows the percentage significance levels calc ulated on standard sampling theory fo r this example under a number of different circumstances. The details of how these calcul ation s w<.:re made are given later. In calculating the table, we have assumed the two parent distrib utions to be members of the class of exponential power distributions defined in (3.2.5) having common value of (J and known mea ns. Then , for any fixed value of {3, say {3 = f30, the uniformly
234
Bayesian Assessment of Assumptions 2
4.4
most powerful si mi lar test of the hypothesis H 0: a~ / a~ HI: > 1 is provided by the ratio
1 against the alternative
=
ai ai
8) C ( ) _ s 2 (Il PO, 2 Po
Y - sICfJo,8 1)
_
-
Iy 2j. - 821:2/(1 +Pol 1n 2 I"'lyu - 8112/(l+Po)/nl'
"'12 L,
(4.4.1)
in which the numerator and the denominator are sufficient statistIcs for the variances O"~ and O"i, respectively. In particular, when fJo = 0 the criterion CoCy) is the usual F statistic. Consider for example the row in the table for flo = O. The entries show how the percentage significance levels for the F criterion change as the fJ value of the parent popUlation changes. For instance, 7.6 % is the F criterion significance level for a Normal parent ({3 = 0) and the value 9.5% immediately to the right of this shows the significance level for a somewhat more leptokurtic parent distribution (fJ = 0.2) but with the same "formal theory F criterion Co(y). Similarly, if we take the values corresponding to the next row for (30 = 0.2, we have the corresponding probabilities for the criterion
[""' 1v . - 8 1
C 0.2
n L, () _ ~ Y - n z "'" L..
. Ij
2
2/1 2 . ]
I'} I j - 8II,2 /1. 2
'
(4.4.2)
which will provide a uniformly most powerful similar test for a parent with
{3 = 0.2 . In particular, the significance level for this parent population is 8.0%. Now , however, consider the change from 7.6 % to 8.0 % in the diagonal. This gives a measure of inference robustness. Specifically, it shows how much the significance level changes when the parent distribution and the appropriate criterion are changed together. Thus, while the familiar criterion robustness is measured by the changes occurring horizontally across the table, inference robustness is measured by changes occurring in the diagonal elements which are printed in bold type. It is noticeable that the changes which occur horizontally are, for this data , considerably greater than those which occur diagonally. Jn fact, whereas the probability of the error of the first kind for the Normal theory criterion is changed by a factor of 5 (from 2.8% to 14.5 %), in changing from a platykurtic distribution with {3 = -0.6 to a leptokurtic distribution with {3 = 0.6, it is changed only by a factor of 1.5 (from 6.0% to 9.2 %) when appropriate modification is made in the criterion. While one cannot on this evidence draw any general conclusions , it is true that for these particular data, the inferential probabilities about the ratio of variances are much less affected by changes in fl than are the probabilities associated with a particular criterion. The above discussion has been conducted so far entirely in terms of sampling theory. It will be recaJJed from Section 4.2.3 that the diagonal elements of the table are preci sely the a posteriori probabilities that the variance ratio O"~ is less than unity for the corresponding values of {30 in the parents, The inference robustness study under sampling theory and the Bayesian robustness study thus give precisely parallel results,
'ai
4.4
I nference Robustness and Criterion Robustness
235
4.4.2 Derivation of the Criteria
For any specific value of fi, say f3 = f3o, the uniformly most powerful similar criterion Cpo(y) in (4.4.1) follows an F distribution with n 2 (1 + fio) and nl(1 + fio) degrees of freedom when the hypothesis H 0: a~1 ai = 1 is true. Equivalently, the statistic W
L"' IYlj - 0 112/ ( 1 +P o) 0 112/0 +Po) + L: n' IY2j _ 8212/0 +Po)
_
(f3o) -
L"'
IYli -
(4.4.3)
has a beta distribution with parameters }n I (I + fio) and }n 2 (1 + fio). Using this result, the exact probabilities in the diagonal of Table 4.4. I, namely Pr
{FIII(I +po) ,n,(1 +Po )
>
Cpo(y)},
can be readily calculated. For the off-diagonal elements, we need to find the distribution of the criterion Cpo(Y) when the parent f3 takes some value other than f3o. This can be approximated using permutation theory. For the exponential power distribution (3.2.5), it is readily shown by employing (A2.1.5) in Appendix A2.1 that the variate
IY ~ 8
X =
2/ 1
(I+P O)
(4.4.4)
has as its rth moment (about the origin)
=
r{(l
)1,
+ (3)[-} + rl(l + f3o)]} fCte l + (3)]
[c(fm-'(I ' p)/(I+flo)
r = I, 2,3,....
(4.4.5)
Now write Yl . - 0 Xj =
(I
-~
Y2(j -no) -
I .
'1
2/ (1 +Po)
j = 0, 1, ... , n l ,
e2
1
2/
(1
+floJ . }=n l
+ I ,11 + 2, 1
+ n z.
... ,11 1
(12
Then, on the hypothesis H 0: o-taf = I and following the method in Box and Andersen (1955), the permutation moments of W (f3 )
= "~" L, = I
° ""
LJ= 1
X·J X.J '
(4.4.6)
can be written
£[ W (f3o)] p
nl
= -, 11
V[W(f3o)] = p
2n 1 n 2 2( 2)
nn+
[
1 n 1 + -2 - - I (b 2 /1-
-
] . 3) ; (4.4.7)
236
4.5
Bayesian Assessment of Assumptions 2
where
By taking the expectation over all samples of the permutation moments, we obtain the ordinary moments of W (f3o) as
E[W(f3o)] = ~,
(4.4.8)
n
Expanding the denominator of b 2 around the mean of X and taking expectations, we find that, to order 11- 2
E(b - 3) = /12 _ 3
2
flf
+ 11- 1 [fl2 _ {IT
2{1J
+ 3/1~ 1
{Ii
/1i
J
(4.4.9) where {I .. are given in (4.4.5). For a specific value of (3, it can be shown that the statistic W({3o) is approximately distributed as a beta variable with parameters [in I (l + f3o)b, 1n 2(1 + f3o)b], where (4.4.10) represents the modification due to departure of f3 from f3o . Equivalently, Cpo(Y) is approximately distributed as an F variable with tl 2 (1 + f3o)f> and 11 1 (! + f3o)b degrees of freedom from which the off-diagonal probabilities in Table 4.4.1 can be approximately determined. The result is, of course, exact when f3 = o.
4.5 A St:~MARY OF FORMULAE FOR VARIOUS PRIOR AND POSTERIOR DISTRIBLTIONS
For convenience of the reader, Table 4.5.1 gives a summary of the formulae for various prior and posterior distributions concerning inferences about comparison of variances. Table 4.5.1 A summary of prior and posterior distributions
I. Suppose k samples y; = (YI1' . .. , Yln), .. , y~ = (Ykl' ... , Yknk) are drawn from possibly different members of the exponential power distribution with common f3 y - 81 1 p(yI8,a,{3)=w(f3)a- exp [ -c(f3) - a \
2/
(I+Pl]
,
-
CD
<
Y
< 00,
4.5
A Summary of Formulae for Various Prior and Posterior Distributions
237
Table 4 .5.1 COlllinm'd
where
c
_ (f[W + ,8)J} ] /( 1" PI (,8) - l r[w + ,8)J '
+ ,8)W 2 + ,8){r[W + ,8)JP' Z' { f'[~( 1
(1)(,8)
= (I
2. Let 0 = (8 1 , ... ,8k ) and (J = (u I , ... ,u k ) The prior from (43.2), (4 .3.18), and (4.3.26) is
pee, cr,,B)
= p(e)p(a)p(,B)
with
pee) 0
k
c,
pea)
Ct;.
I rO';-l,
i= I
. The Case k = 2 3. Conditional on (e,,8), the posterior distribution of V = u~/a~, p( (4.2.7),
V:{1, e, y), is
where
Thus
PdV < 11f3,O,y} = Pr
f
\F",WP ). Hl()+P)
SI({J,O)} < 52([3,8 )
.
2
[n particular, for [3 = 0 V
For
,8
--> -1 ,
L(Ylj -
8 1)2 / 11 ]
L(Y2j - 8Z)2/ 112
~F
",.
"1'
the distribution of
is, from (4.2.10), I1t"2
p(l!
I ,8 -->
-
I,
2(n]
e, y) = (
+
U + III
n lll z
2(11]
-
1
0<
II
< I
11 2 )
+ liZ)
U -+ " 2 - l
1
<
II
<
A),
where (hI' h z ) and (111 1,1112) are the half-ranges and the mid-points, respec tively.
in
238
Bayesian .\ssessment of Assumptions 2
4.5
Table 4.5. I lontinlled
4. Conditional on
e,
the posterior distribution o f
e, y)
p([31
p([3)PII({31
X
[3
is, in (4.2.17),
e, y),
-1<{3:(I,
with
n rei + ·i llj(i 2
X
+1i )][lljsj (Ii,B)r ~·", (I +P)
;=l
5. The posterior di stribution of V given
p(V
Ie, y)
f~
=
e,
6. The posterior distribution of
=
9 1[3, y)
p(
1-1
Je< -
0:)
7. The posterior di stribution of
v,
p(V
I Ii, y)
p(V I Ii,
(4.2. J6),
e, y)p(1i I e, y) dli,
V> 0.
given (3, is, in (4.2.24),
[l1js;C{3,Bj)] -'~ >I,(I+P)
2
1]
e is , from
=
[n,s, . (Ii } B)r t ",(J + P) dO I
given (3, with
r: f
,
- 00
'
< Bj <
Xl.
e eliminated is, in (4.2.23) and (4.2.25), I Ii, y) dBI
_: p(V I (3, 9, y)p(e
de 2 ·
In particular, for {3 = 0
V
and when
Ii --+
-
[
p(u* I {3 --+ - 1, y) =
L:(Ylj - 9 J )2( 111 - [) ~( _ )1( _ [) -
..., hj
Y2
n2
the distribution of u*
= V(h l /h 2 )2
+ n2) + 11 2 ) 1
kut(II , -I ) -I[(111 { ku;,'(11,-I)-I[(11
F",-I.>I ,- J' is, from (4.2.27),
+ 112
(111
-
2)II ~ J , 0
(I/J
+ 112
-
2)u; -\:J, I
[)(11,
+ 11 2
-
2)
< u* < I < u* < 00
with 111112
k=---2(111
For
Ii
+ 112)
(II J
+ 11 2
-
not close to - I , from (4.2.34), we have the approximation
v1 /(l+p)II,SI(.B'~I)/ [I1I(J +/3) -I] "'-' F 1l S2({J,B )/ [1l2(1 2 2 where
(e e l ,
2)
is the mode of
+ til - 1]
pee I {3, y).
II,(I+P)-I.II,(I+P)-I'
4.5
A Summary of Formulae for Various Prior and Posterior Distributions
239
Table 4.5.1 Continued 8. With 0 eliminated, the posterior dist r ibution of
fJ
is, in (4.2.35), < {3 ~ 1
-I
p({3 1y) oc p({3)PII({3 1y), with PII(PI y) oc {f'[1
9. With
+ W + {J)]} - (II , +" 2)
e and {3 eliminated,
the final posterior distribution of V is, in (4.2.36),
p(V
T he General Case k
I y)
=
r
p(V I {3, y)p({31 y) d{3.
I
>2
10. Conditional on (e, /J) , then from (4.3 .5) the posterior di stribution of the (k - I)
contrasts i = I, ... ,k - 1
is p(<\>
I e, {3, y)
I" LVI
1
= C
]
(
" - I
I
V -l,, ;( l -t fJ)
+ i~l Vi
)
-
-t,,( I + fJ)
-00
,
<
00,
i=J, . . . ,k-J
where k
n
2:
=
II j ,
i=l
and
c
=r
[~
(I
+ /3)]
(I
+ {3) -
(k - J)
{,f1 r [ ~
(I
+ {J)]
r
1
Thus, from (4.3 .7) to (4 .3.1 J) , the approximate (l - a ) H .P .D . region of <\> is
- 210gW l+A
< -/ (k -
I, a ),
where
A=
J
3(k - 1)(1
+ {3)
(Inj-1-n-I), j=l
240
4.5
Bayesian Assessment of Assumptions 2
Table 4.5.1 Contil/ued and /(k - I,a) is given in Table III (at the l:nd of this book). = 0 (i = I , ... ,k - I) (that is, (J~ = ... = (J~)
In particular, for
k
L
- 2 log W = -
11;( I
+ j3)[Iog Sj(fJ, ej) -
log s(fJ , e;)J,
i= (
with
1 I. Conditional on
e, the
posterior distribution of fJ is, in (4.3.19) and (4.3.20),
p(fJ I e, y) cc p(fJ)Pu(fJ I e, y),
- I <(3~I,
with
/3 eliminated, the posterior distribution of <1>' from (4.3.22), approximately
12. W ith
p( where
/3
is the mode of
I e, y) ==
p(
=
(c/>I' ... ,
p(/31 e, y).
e is,
in (4.3.27),
- cc < 8; < co,
/3,
is
I e, 13, y),
13. Conditional on (3, the posterior distribution of
14. Conditional on
e,
with
e eliminated
i = 1, ... , k.
and for {J not close to -I, the posterIor
distributi on of is approximately, from (4.3.31 ), -00
where k
III
=
I ;=1
In j ,
In;
=
I1 j (1
+ (J)
- I,
i=l, ... ,k-I,
A4.1
Appendix
241
Table 4.5. 1 Contillued
(e\, ... ,e
and Thus, from (4.3.32) to (4.3.35) the k ) is the mode of p(BI [3, y) . approximate (I - ('1) H .P.D. region of t\> is given by
- 210g W*
- - - - < ,/(k - I, (X) ,
+
I
A*
where - 2 log w * =
~ - ~~:
J I 111;
log
A*=
1 . k (, 3(k _ I) ;~I
and
In particular, for the point - 2 log W *
t\> =
0 «T~
=- ~ L
111;
= ... [
log
i=1
III;
+ mlog (I + ~~
log Yi
I
111;
m
-
-\
'I;)
)
.
= (T~) I1;S;([3,
e;)
,I
- Jog 5([3, fI)
111i
with
15 . With t:I eliminated, the posterior distribution of f3 is, in (4.3.36),
p(f31 y)
<X
p({J) PII(f31 y),
- I
< f3
~ I,
whe re k
Pu(/JI
y) <X
{ r[1
+ -}(I + (3)J } -"
TI rei
+
~ l1i (I
+ f3)]
i ;: 1
16. Finally, with fI and
f3 eliminated , the
posterior distribution of t\> is approximately
pet\> I y) == pet\> 1/3, y), where
/3
is the mode of p(f31 y).
APPE"DlX A4.1 LIMITING DISTRIBUTIONS FOR THE VARIANCE RATIO V WHEN
f3 APPROACHES -1 We here sketch the derivation of the limiting distributions given in (4.2 . 10) and (4.2 .27) for the variance ratio V when f3 approaches -1. These results follow readily by making use of (A3.1.7) in Appendix A3.1.
Bayesian Assessment of Assumptions 2
242
A4.1
Specifically, for the di stribution of u in (4.2.10), we first make the transformation in (4.2.9), (A4.1.1) so that p(u I [3, e , y) = q([3)[r([3, 8 1,8 2)]'" Ui " l - 1 [I
+ u I/( 1+ Pl r([3, 8 1,8 2)2/( 1+ Plr }(lIl +n,)(1 + Jl), 0 <: u <
where
00 ,
(A4.1.2)
h2 + 1m2 - e2 1 [n 2 s2 ([3 , 82 )]t(l +P)
a nd
As [3
-t -
I,
Jim q([3) = P--I
n Jn2
2(n, + n2)
= q,
(A4. 1.3)
say, and by using (A3.1.7), lim r([3,8 1 ,8 2 ) =1.
(A4.1.4)
P--I
Further, following the argument in (A3.l.8) through (A3 . l.12), we have lim [1 P- - I
+ ul /(1+P) r([3,8J,82)2 /(1+p)r (l+p)
=
•
r
for
I <
for
O
U
<
00,
(A4. 1.5)
I
1t follows that lim p(u I [3, e, y) =
p_ - 1
{qu }n'-J qu
.,n2
~
for
0< u
for
1< u <
I, (A4. 1.6)
_,_ I
00,
as given in (4.2.1 0). To derive the distribution of u* in (4.2.27) , we first note that from the result in (3.2.15) as [3 - t -I, the joint distribution of (8 1 ,8 2 ) in (4.2.24) tends to 2
lim p- -J
pee I [3, y) = TI~(ll i ;;
l )h(II;-l) (hi
+
1m; - 8;1) -11 1,
I
- 00
<
e; <
00,
i = I , 2.
(A4.1.7)
It is readil y verified from (A4.1.7) that the quantity (A4.1.8)
A4.1
Appendix
243
is distributed a posteriori as CZ,H I I I - l ) - l
lim p(z I (3, y) /1 - -1
= fI z -1
for
O
for
I < z'<
(A4.1.9) 00,
where (n1 - 1)(n2 -
c =
I)
-=--.:'-----'--=---,
2(nl
+ 112
V
(~) 2 =
- 2)' .
Noting that u*
=
172
(A4.I.IO)
uz.
We obtain from (A4. I .6) the co ndition a l posterior distribution of for
0 < u* < z,
for
z <
It follows from (A4.1.9) and (A4. 1.11) that for 0 < lim p(u* I (3, y)
= qc
lU;+II '- 1
f"'
1I*
1I*
~
1I*
<
given z as (A4.1.1 I)
00 .
I, 1
Z1
dz
+ U~"' -1
J z-
'.-1
dz
p- - I O u .
= ku~(nJ
-1 )- t [(17
1
+ 112)
and for I < u* < ce,
with k
2qc
= - -111
+ 11 2
-
-
I
as given in (4.2.27).
-
(111
+ 172
-
2)U~ / 2 ],
(A4.1.l2)
CHAPTER 5
RANDOM EFFECT MODELS
5.1 INTRODUCTION
In previous chapters problems about means, variances, and regression coefficients were discussed. We now consider another im portant class of practical problems concerned with variance componel1ls. To illustrate how variance component problems arise, consider a chemical process producing batches of material which are sampled and then analysed for some characteristic such as product yield . When the total variance a 2 of the observed yield is large, the effici~nt operation of the process is hampered, control becomes difficult , and process improvement studies are hindered. So that etlort can be effectively directed to reducing (5 2, it is necessary to discover the relative importance of the variolls sources from which variation may spring. It would for example be fruitless to devote effort to improving the analytical method if in fact an inadequate technique for sampling the material were the main cause of variation. With this in mind a special type of experiment may be conducted using what is called a hierarchical design. I = 4 batches j
"'" J
'J mpies
pe r batch
K = 2 analy s ~s per '
Fig.S . .I.1 Illustration of a hierarchical classification.
figure 5.1.1 illustrates a hierarchical design for four batches, three samples per batch and two analyses per sample. More generally suppose that I batches of product are randomly taken, each batch is sampled J times and K analyses are performed on each sample. Then on the assumption that batches, samples, and analyses vary independently, and addit'ively contribute errors e j , eij, and I:'jjk, we have the mathematical model i=I"I;
""here
Yjjk
are the observations, 0
J.
j = I ... . ,J;
k=I , ... ,K
common location parameter, 244
E'j' e jj
(5.1.1)
and
ejjk
5.1
Introduction
245
are independently distributed random variables with zero mean s and variances Var (e ijk ) = crt, Var (eij) = cr~ an d Var (e i) = ()~. Thus the total variance of the (ijk)th observation Yijk becomes ()~ + ()i + (); and the quantities (()~, ()~, ()~) are called the variance components. Obviously, more than three sources of variation may be involved and many different applications of this kind of study could be quoted. The random variables ej , ejk are so metimes called random effects and the model in (5.l.!) is known as a random effect model. This is in contrast to the model in (2.7.7) for the comparison of Normal means considered in Section 2.11. In the sampling theory framework, these means are usually regarded as fixed constants and the model is therefore commonly called a fixed effect model. . The relationship between these two types of models will be considered in more detail later in Chapters 6 and 7. 5.1.1 The Analysis of Variance Table We have already seen that in the comparison of Normal means, certain calculations are conveniently set out in the form of an analysis of variance table. It so happens that such a table is also of value in analyzing variance components. An appropriate analysis of variance table for the present random effect model is shown in Table 5. J.l. Table 5.1.1 Analysis of variance of hierarchical classification with three va riance components Source
Due to batches Due to samples Due to analyses Total
Sum of squares (S.S.)
53
= J K L. (Yi .. -
5 2
=
51
= L L. L. (Yijk -
Y )2
V3
= (I
V2
= 1(1
Yij/
VI
=JJ(K - J)
yj
JJK - I
K L. L (Yij. - Yi.l
L L L. (Yijk -
Degrees of freedom (d.f.)
- 1) - I)
Sampling expectation of mean square (E. M.S.)
Mean square (M.S.)
1113
= 5 3 /V3
()~23 =()T + K()~ + J K()~
1/12
= 5 Z /v2
()~ 2 = ()T
1111
=5 1 ,'v l
()2
+ K()~
I
In Table 5.1.1 (and hereafter) we adopt the notation by which a dot rep/acing a subscript indicates al1 average over that subscript. Thu s, Yi j . =
l K L Yijk,
I
Yi .
J L y··
I) . '
and
I
Y .. = -1 L.y·
I. . •
From the sampling theory point of view, by pooling the sample variances
246
Random Effect Models
5.1
within the individual samples, we obtain an estimate of cr~, the analytical variance. Then by pooling the sample variances of the sample means within each batch, we obtain an estimate of cr~ + K cr~ where cri measures the variation due to sampling. Finally, from the variation of the batch means, we can obtain an estimate of cr~ + Kcr~ + JKcr~ where cr~ is the batch variance. It is then customary to obtain estimates of the variance components (cri, cr~, crD by solving the equations in the expectations. Thus, from Table 5.1.1 we have the relationships (5.1.2a) so that the usual estimates of the variance components are and
5.1.2 Two Examples We will begin by illustrating the simple case where there are only two components of variance. The model is then j=I, . .. ,J;
k=I, ... ,K,
(5.1.3)
which is a special case of (5.1.1) when I = 1. The first data set we consider is taken from Davies (1967, p. 105). The object of the experiment was to learn to what extent batch to batch variation in a certain rilw material was responsible for variation in the final product yield. Five samples from each of six randomly chosen batches of raw material were taken and a single laboratory determination of product yield was made for each of the resulting 30 samples. The data are set out in Table 5. 1.2 with an analysis of variance in Table 5.1.3. In this example j = I, ... , 6 rcfers to the batches and for each j, k = I, ... , 5 refers to the samples. The "within batches" componenl cr: will be referred to as the "analyses" component although it includes sampling errors as well as chemical Table 5.1.2 Dyestuff data (Yield of dyestuff in grams of standard color) Batch
2
3
4
5
6
Individual observations
1545 1440 1440 1520 1580
1540 1555 1490 1560 1495
1595 1550 1605 1510 1560
1445 1440 1595 1465 1545
1595 1630 1515 1635 1625
1520 1455 1450 1480 1445
y)
1505
1528
1564
1498
1600
1470
.
y ..
= 1527.5
Introduction
5.1
247
analysis errors. The variance component ()~ associated with "batches" represents the variation associated wi th raw material quality changes when sampling and analytical errors are fully allowed for. The data are characterized by the fact that the between-batches mean square is large compared with the within-batches mean square, strongly indicating that the component ()~ associated with batches is nonzero. Table 5.1.3 Analysis of variance for the dyestuff example
Source Between batches Within batches (analyses)
S.S.
d.f.
E.M.S.
S2
=
56,357.5
v2
= 5
1n2
=
1J,271.50
(Ji
SI
=
58,830.0
VI
=
Inl
= 2,451.25
(J2
Total
115,187.5
a-i = 2,451.25 ,
M.S.
24
+ 5()~
1
29
a-~ = (11,271.50 - 2,451.25)/5 = 1,764.05,
The second data set we consider illustrates the case where the between-batches mean square is less than the within-batches mean square. These data had to be constructed for although examples of this sort undoubtedly occur in practice they seem to be rarel y published . The model in (5 . 1.3) was used to generate six groups of five observations each. The errors ejk were dra wn from a table of random Normal deviates with () 1 = 4 , the e j we re drawn from the same table but with (J2 = 2, and the parameter was set to be equal to five. For convenience we shall refer. to the components (()i, ()~) in this second example associated with " analyses" and " batches" as we did in the first. The data are set out in Table 5. 1.4 and the corresponding analysis of variance, in Table 5.1. 5.
e
Table 5.1.4 Data generated from a table of random normal deviates for the two-component model (with = 5, (Jl = 4, ()2 = 2)
e
Batch Individual observations
7.298 3.846 2.434 9.566 7.990 6.2268
2
3
4
5
6
5.220 6.556 0.608 11.788 -0.892
0.110 10.386 J 3.434 5.510 8.166
2.212 4.852 7.092 9.288 4.980
0.282 9.014 4.458 9.446 7.198
1.722 4.782 8.106 0.758 3.758
5.6848
6.0796
3.8252
4.6560
7.5212
Y .. = 5.6656
248
Random Effect Morlcls
5.1
Table 5.1.5 Analysis of variance for the generated example S.S.
Source Between batches Within batches (ana[yses)
52
=
41.68[6
51
=
358.7014 400.3830
Total
ai = 14.9459,
d.f.
M.S.
E.M.S .
vl =
5
1/7 2
=
8.3363
()~
=
24
1/7 1
= 14.9459
()l
VI
+ 5d
I
29
a~ = (8.3363 - 14.9459)/ 5 = - 1.3219,
5.1.3 Difficulties in the Sampling Theory Approach The sampling thcory approach to the problem of variance components encounters a number of snags. These have bothered statisticians for many years, as is evidenced by the great variety of attempts which have been made to resolve the problems.t The difficulties can conveniently be discussed in terms of the examples quoted above.
Negative Estimate of (J"~ The most commonly used estimate of (J"; is obtained from the difference of the two mean squares (m l , m l , ) and we have referred to this estimate as 8-~. When the between batch mean square in2 is smaller than the within batch mean square m I as it is in the second example above, (J" ~ will be negative . Since (J"~ must be nonnegative, this is generally regarded as objectionable.
Difficulties with Confidence Int ervals Even with the assumption that e j and ej k are Normally and independently distributed, the sampling distribution of 8-i is complicated and depends on the unknown variance ratio (J"~/(J"T. The problem of obtaining a confidence interval for (J"~ is complex, and no generally accepted procedure has been found . Also , in estimating the variance ratio (J"~ i(J"f the commonly used confidence intervals, which are based upon the sampling distribution of the mean square ratio ml /m l , may include negative values . Intuitively, one would feel that the part of the interval associated with negative values of ()~ ought to be discounted in some way, but attempts to do this within the sampling theory framework are difficult to justify.
t See for example, Daniels (1939), Crump (1946), Fisenhart ( 1947), Henderson (1953), Moriguti (1954), Tukey (1956), Graybill and Wortham (1956), Bulmer (1957), Searle (1958), Herbach (1959), Gates and Shine (1962), Thompson (1962,1963), Gower (1962), Williams (1962), Bush and Anderson (J 963), Wang ( 1967), Zacks (1967) . For a review of recent work in this area, see Tan (1964) and Ali (1969).
S.2
Bayesian Analysis of Hierarchical Classifications with Two Variance Components
249
Pooling of Estimates When ml and m I are not very different, it has sometimes been argued that a pooled estimate (vlm l + V1ln 2)/(V j + v 2) of O'i should be used. But how does one decide when to pool and when not to pool? And how is the sampling distribution of the estimate of O'i affected by the fact that pooling will only be practiced when m l and Inl are sufficiently close? Attempts have been made to get around these difficulties and to study the consequences of various proposed procedures , but the situation remains unsatisfactory.
Departures from Assumptions Additional complications arise when we consider the effect of departures from the assumptions of Normality and independence. It is shown in Scheffe (I 959) that non-Normality in e j and lack of independence in e jk will have serious effects on the distri buti o ns of the criteria which one uses to make inferences about (O'i, O'~) . This further confuses an already chaotic situation. In summary, then, traditional sampling theory methods have led to worrisome difficulties in the variance component problem to which no generally accepted set of solutions have been obtained . Our aim in this chapter is to reexamine these problems from a Bayesian standpoint, making the standard Normality and independence assumption s. Within the Bayesian framework, no difficulty in principle occurs in relaxing the Normality and independence assumptions. Indeed, Tiao and Tan (1966) and Hill (1967) have studied variance component problems while rel axing the independence assumption for the e jk , and more recently Tiao and Ali (1971 a) have considered the effect of departure from Normality of the ej.
5.2
BAYESIAN ANALYSIS OF HIERARCHICAL CLASSIFICATIONS WITH TWO VARIANCE COMPONENTS
As a preliminary to the Bayesian a nalysis . a summary of notation and assumptions for the two-component random effect model is given below. With J groups (batches) of K ohservations (analyses) we employ the model j = I, ... , J;
k = 1, ... , K ,
(5 .2.1)
where the e's are supposed Normally and independently distributed with Var (eJ
=
O'L
and
(5.2.2)
Thus, Var (Yjk) = O'i + (J~ , and the traditional unbiased estimators for the variance components O'~ and O'~ are respectively and
(5.2.3)
Random Effect Models
250
5.2
with m 2 and m l the "between" and "within" mean squares defined in Table 5.2.1. It is to be noted that an important quantity occurring in the analysis is af2 = af + Ka~ . Table 5.2.1 Analysis of variance of hierarchical classifications with two components d.f.
S.S.
Source Between batches Within batches Total
M.S.
V2=(J-l) vI=J(K-l) JK- ' j
E.M .S.
m2=S2/ V2 m. =Sl /Vl
5.2.1 The Likelihood Function To derive the likelihood function for random effect models, we first recall certain useful results summarized in the following theorem . Theorem 5.2.1 Let XI, ... , Xn be n independent observations from a Normal distribution N(O,0"2). Let x be the sample mean and (Xi - x) (i = I, . .. , n), be the residuals. Then I.
x is
distributed independently of the
2. L (XI - X)2 ~
(J2
3. In so far as the for 17 2 •
Xi -
X as N(O , 0"2/n) ,
X~-I' and Xi -
X arc concerned, L
(Xi -
X)2 is a sufficient statistic
Turning to the model in (5 .2.1), one convenient way to obtain the likelihood function is to work with the group means Yj. and the residuals Yjk - Yj. ' Clearly, in terms of (e , ej , ejk) and
(5.2.4)
Jt follows from the Normality and independence assumptions of (e j , ejk) and the results in Theorem 5.2.1 that a) the Yj are independent, each having a Normal distribution N(B,
O"i 2IK),
b) h are distributed independently of Yjk - Yj ' c) the quantity L k(Yjk - Yj)2 is distributed as (J~XtK-I) so that the sum L L (Yjk - Yj.)2 = SI = vlm l is distributed as a~X; I ' and finally d) in so far as the Y jk - Yj. are concerned, vim I is a sufficient statistic for
ai.
5.2
Bayesian Analysis of Hierarchical Classifications with Two Variance, Components
251
Thus, the likelihood function is
(5.2.5) where it is to be noted that O"T2 > O"T. Alternatively, in terms of (8, O"T, O"~) and noting that (5.2.6) we have
(5.2.7)
5.2.2 Prior and Posterior Distribution of (8, O"i, O"~) From (5.2.5) the likelihood function can be regarded as having arisen from J independent observations from a population N(8,O"i2/K) and J(K - I) further independent observations from a population N(O,O"i). Treating the location parameters 8 separately from the variances (O"T, O"T 2), we therefore take, as a noninformative reference prior, a distribution with (8, log log O"i2) locally uniform and locally independent. The fact that O"T 2 = O"T + KO"i must be positive leads to the restriction O"i2 > O"i in the parametel space of (O"T, O"T2)' Thus the non informative prior is defined by
0";,
(5.2.8) with
p(8) cc c and
Alternatively, in terms of (8, O"T, O"~), the prior distribution is (5.2.9) This choice of the prior distribution has been criticized by Stone and Springer (1965). A discussion of their criticism is given in Appendix A5.5. An important iss ue raised by K [otz, Mi lton and Zacks (1969) and Portnoy (1971) in connection with sampling theory point estimators of (O"i, O"~) based on this prior will be considered in detail in Appendix A5.6.
252
Random Effect Models
5.2
at
The noninformative reference prior distribution for and a~ may also be obtained directly by applying Jeffreys' rule discussed in Section 1.3. It can be readily shown that the information matrix for (aT, a~) is
(S.2.10a)
Thus, the determinant is (S .2.lOb)
By combining the prior in (5.2.9) with the likelihood in (5.2.7), the posterior di£tribution of (e, is obtained as
aT. aD
p(e,a~,aily)cx: (a~)-(i.'+I)(a~
1 [Vlm~ exp { - -2 _ af
+
+
KaD- }(' 2+1)-J
V2m2 2 0' 1
2
+ Ka2 -
<::I)
+
<
JK(y .. -e f
e<
2
0'1
2
]}
+ Ka2
00,
a~ > 0,
'
a~ > 0.
(5.2.11 )
To obtain the posterior distribution of the variance components (a~ , aD , (5 .2.11) is integrated over yielding
e
p(ai,ai I y)
=
w(a~)-(-t·, + (a~ + Kai)- -t('2 + l)exp [ I)
+
CJ~I + aTV:~ai)]
a~ > 0,
a~ > 0,
(5.2 . 12)
where w is the appropriate normalizing constant which, as it will transpire in the · next section, is given by K(Vlml) ~ 'J
W=
(v 2m 2yl- '2
2- -} ( · , h 2 )
(5.2. 13)
1(t vl)1(1 v2)Pr{F,2", < m 2 /m l }
From the definition of the X- 2 distribution , we may write the joint di stribution of (a~, aD as
p«(J~ , ai I y) =
Pr
{F.
2
•• ,
< 1112} 1111
a~ > 0_
a~ > 0,
(5.2. J 4)
5.2
Bayesian Analysis of Hierarchical Classifications with Two Variance Components
253
where p(X: 2 = x) is the density of a n X- 1 variable with v degrees of freed o m evaluated at x.
5.2.3 Posterior Distribution of Results
d /O'T
and its Relationship to Sampling Theory
O'i
Before considering the distributions of and (d separately we discuss the distribution of their ratio. Problems can arise where this is the only quantity of interest. For example , in deciding how to allocate a fix ed total effort to the sampling and analyzing of a chemical product, one is concerned only wi th O'i/O'i and not with and O'~ individually . For mathematical convenience, we work with O'f l/O'i = I + K(O'~ / O'i) rather than O'~/O'i itself. In the joint distribution of (O'T , O'D in (5.2 . 12), we make the transformation
O'f
v = O'i, and integrate over V to obtain
1 00
peW I y) = wK- 1 W- c(V2+ 1 )
v- l ( v,+v2 + 2)exp
[-
2~
(vlm l
+
V 1 n1 2
/ W)] dV. (5.2.15)
For fixed W the integral in (5.2.15) is in the form of an inverted gamma function and can be evaluated explicitly using (A2 . 1.2) in Appendix A2.1. Upon normalizing, the expression for w, already given in (5 .2.13), is obtained. The distribution of W can then be written as
W > 1
(5.2 .16)
where p(F., ... , = x) is the density of an F variable with ( VI ' v2) degrees of freedom evaluated at x. Probability statements about the variance ratio O'~ /O'T can thus be made using an F table . In particular, for 0 < '1 1 < 112. the probability that O' ~/ O'i falls in the interval ('II' '12) is
(5.2.17) Note that the distribution of W is defined over the range W > 1. Therefore, H.P.D . or other Bayesian posterior intervals will not extend over values of W less than one, that is, they will not cover negative values of O'~/O'i .
Random Effect Models
254
5.2
Now from (5.2.16) we can write for the distribution of (m2/m I) W - 1
p
(!!2 W111)
I
I Y)
= P (F,.,.v , =
' ~ w-1)! prf\F,.,.v, < !!2l m m f
w>
l.
l
l
(5 .2.18)
This result merits careful study. If O'T 2 and 0' 7 were unconstrained by the inequality 0'~ 2 > Ci~, then using an asterisk to denote unconstrained probability densities, we should have (from the results in Section 2.6), p*
(!!2
W-
I
I Y)
(I-'v .v , = ~ w- t ).
= p*
2
I
In]
W > 0,
ml
(5.2.19)
where from the Bayesian viewpoint W is a random variable and 1112 /1111 is a ratio of fixed sample quantities. This probability distribution also has the interpretation that given the fixed ratio W = O'L/O'~, the sampling distribution of (m 2m l )W- 1 follows the Fy"v , distribution . With this interpretation (5 .2. 19) gives the confidence distribution of W from sampling theory. In contrast to the distributions in (5.2.16) and (5.2.18), the confidence distribution in (5.2.19) is defined over the entire range W > O. Confidence intervals for W based on this distribution could thus cover values of W in the interval 0 < W < l,thatis, _K - t < 0'~/0'7 < O. SinceO'~ /0'7 must be nonnegative. this result is certainly objectionable. In comparing (5.2.18) with (5.2.19), we see that the posterior distribution In (5.2.18) contains an additional factor
From (5 .2.19), Pr ftFv, .y, <
ml} =
Pr* { W > I } = Pr* {O'iz > 0'7}.
(5.2.20)
I11 t
The Bayesian result can then be written p (
m2 W-l
I )=
m)
.
Y
P*
[F
1
y 2 ." ,
= (m 2! m l ) W- ]
Pr* { W > 1}
,
W > l.
(5.2.21)
In other words, the constrained distribution appropriate to the variance component problem is the unconstrained F distribution truncated to the left at W = I and normalized by dividing through by the total area of the admissible part. The truncation automatically rules out values of W less than one which are inadmissible on a priori grounds. If we write the factor Pr {Fv" v, < m 2 m l } as 1 - a, then, on sampling theory , a may be interpreted as the significance level of a test of the null hypothesis that 0'72/aT = 1 against the alternative that ai 2/0'7 > I. However, while the interpre-
5.2
255
Bayesian Analysis of Hierarchical Classifications with Two Variance Components
tation of '1. as a significance level is clear enough, it seems difficult to produce any sampling theory justification for its use in the intuitively sensible manner required by the Bayesian analysis. 5.2.4 Joint Posterior Distribution of
ITi
and O'~
Our a posteriori knowledge about the parameters O'T and O'~ appropriate to the reference prior considered is, of course, contained in their joint distribution given by (5.2.14). From discussio n in the preceding section, we can also regard this distribution as the constrained distribution of the variance components (O'i, IT~), the constraint being given by IT~ 2 > IT~. If the constraint were not present, the posterior distribution of (IT~, lTD would be P*(ITT,
IT~ I y) =
(1',111,)-' p
{x:/ = ~} K(v 2m 2)
-1
p
{X:,2
=
ITT
1',111,
ITi
> 0,
IT~ > - -
K
+ KITi} , v2 m 2
ITT.
(5.2.22)
Figures 5.2.1 and 5.2.2 show contours of the joint posterior distributions for the two examples introduced earlier in Sectio n 5.1.2. Dotted lines indicate those parts of the distributions which have been truncated. The examples respectively illustrate situations in whi'ch e~ has a positive and a negative value. Knowledge of the joint distribution of O'T and O'~ clearly allows a much deeper appreciation of the situation than is possible by mere point estimates. Modal Values of the 10illt Posterior Distribulion
On equating partial derivatives to zero, it is easily shown that the mode of the unconstrained distribution (5.2.22) is at
IT~ o =
(5.2.23)
The mode is thus to the right of the line IT~ = 0 provided that 1112
VI (1'2
1111
V2 (VI
->
+ 2) . + 2)
(5.2.24)
In this case, the modes of the posterior distribution in (5.2.14) and of the unconstrained distribution are at the same point. For the dyestuff example, i112/in, = 4.6 which is much greater that 1'1 (1'2 -'- 2)/[v 2 (v( + 2)J = 1.3 so that from (5.2.23) the mode is at (O'To = 2,262.7,0'~o = 1,277.7), as shown by the point P in Fig. 5.2.l. When the inequality (5.2.24) is reversed, it can be readily shown that the mode of the constrained distribution is on the line O'~ = 0 and is al (5.2.25)
tv
U1
0..
~
'" ;:I
Co
0
3
[%l
I
I
;;r<
I
::
~
I I
,
I I
~
0
c.
001
II>
(;;'
I \
o
0.8
1.6
2.4
40
3.2
a~
x 10
4.8
5.6
6.4
7.2
8.0
3
(The contours are labeled by the level of th e density of (of x 10· 3, O~ x 10' J) cal cu lated from (5.2.14). Dotted contour indicates the inadmissible portion of the un constrained d ist ribution .)
Fig. 5.2.1 Contours of the joint distribution of the variance components (CiT, Ci~): the dyestuff example. Ul
N
Ul
tv 5.2
...,.,
t;I:I
to
'" ,;' :::l
.,t
4.4
Q"
'" ;n' ",
3.6
......
o....,
l:
' ... , ,
~.
.,...
,.,
, ... " ..... ,,"
:r
;'
2.8
:::.
n
~
'"'";; ,.,
2.0
~
0' :::l
'" ~. ;;
...,
12
~
o <:
- 1.0
.,...
,;'
b" I
0.4 I
-0.8
-0.6
0.4
-0.2
o
0.2
0.4
a~ x 10
0 .6
08
1.0
.2
1.4
1. 6
I
u,
2.0
",.,to
n o 3
(The conto urs are labeled by the level of the densi ty of (0; , o~) calculated from (5.2 . 14). Dotted contours indicate the inadmissible portion of the unconstrained distribution.)
Fig. 5.2.2 Contours of the joint distribution of the variance components ((J~, (J~): the generated example,
"0
o
:::l
to
:::l
r;r
N Ul -.J
258
Random Effect Models
5.2
For the generated example m2/m, = 0.56 which is less than 1'1(1'2 + 2) / [1' 2(1' , + 2)J = 1.3 and consequently the mode is at (er~ 0 = 12.133, er~o = 0) as illustrated by the point P in Fig. 5.2.2. ff we ignored the constraint er~ > 0, then from (5.2.23) the mode would be at (eria = 13.796 , er~o = - 1.568) as shown by the point p ' in the same figure. A negative value of &~ always leads to a distribution of (er~, er~) with its mode constrained on the line er~ = O. This is because &~ = 0 corresponds to (m2/m,) < 1 which, with VI > v2 , implies that (m 2 /m,) < vl(V z + 2)/[V 2 (V 1 + 2)]. The mode reflects only a single aspect of the di stribution. Inferences can best be made by plotting density contour~, a task readily accomplished with electronic computation.
5.2.5 Distribution of eri The posterior distribution of er~ is obtained by integrating out er~ from the joint distribution (S.2.14) to give
er~ > 0,
(S.2.26)
where
J(eri)
n.
= Pr{X~2 < v1m 2/ er Pr {F" ,\" < m2 /m,}
We sball later in Section S.2.11 discuss an alternative derivation by the method of constrained distributions discussed in Section 1.5. for the moment we notice that if, as before, we denote an unconstrained distribution by an asterisk, then from (S.2.22)
(S .2.27a) and
PI'
{F'2'"
<
m2} m,
= Pr*
{er~
> 0 I y}
= Pr* {C I y},
(S.2.27b)
where C is the constraint eri2 > er~ (or equivalently er~ > 0). We can thus write (S .2.26) as
(S.2.2/<» where
2 Pr* {C I eri, y} J(er,) = Pr* {C I y} . Consider the first factor (v,mJ)-lp(X:/ = eri/ v1m J ) on the right of (5.2.26). In this expression, eri is a random variable and 111, a fixed sample quantity, but since the sampling distribution of Vtln, for any fixed value of er~ is erix~" the same
--
5.2
-~-
Bayesian Analysis of Hierarchical Classifications with Two Variance Components
259
expression would also supply the usual confidence distribution for (Ji within the sampling theory framework. In this instance, therefore, the confidence distribution does not correspond to the posterior distri bution because the latter includes an additional modifying factor f(IJi) . Because this second factor has no counterpart in sampling theory, its presence in the Bayesian result is of special interest. It expresses the additional information about (Ji which comes from 111 2 , That some information of this kind exists is obvious on commonsense grounds. For instance, in the dyestuff example, 1112 = 11 ,271 .5 is an estimate of (J~ + SIJ~. This tells us H'i/houl any reference to 1111 that values of (J~ say four times as large as 11 ,271 .5 are unlikely. This intuitive argument is given a precise expression in the Bayesian analysis through the modifying factor.r ((J~). The numerator of f ((J7) given by (S.2.27a) is a function of (Ji depending only on m 2 . It is Pr* {C I (Ji, y}, the probability of the constraint (J7 2 > IJ; being true for each specific value of (Ji. The denominator, given by (S.2.27b) , is Pr* {C I y}, the probability of the same constraint being true over all values of (Ji . It is independent of (Ji and is merely a normalizing constant. Figures 5.2.3 and 5.2.4 show the effect of the modifying factor for the two sets of data we have previously discussed. Shown in each case are:
pta;
Iy) x 10 3
Int
=
ml
=
J
2,4 5 1.25 1,271 .50
0.50
2. 0
0. 25
1.0
2.0
4.0
6.0
Fig. 5.2.3 Decomposition of the posterior distribution of IJ~ : the dyestuff example.
260
Random Effect Models
5.2
a) the unconstrained distribution p*«(J~ I y) ; thi s is al so the confidence di stribution of (JT which would be obtained from sampling theory, b) the modifying factor f«(Jf) , and c) the product of (a) and (b) which is the posterior distribution of (J~. The roles played by the two factors in determining the distribution of (Jf depend critically on the relative size of In 1 and m 2 . In the first example, 1n 2 is large compared with 111 1, and we see that over the range in whic h p * «(J~ I y) is appreciable,f«(Jf) is relatively flat. Multiplication then produces a p«(J~ I y) which is close to P*«(JT I y) and the modifying factor has little effect on the distribution . For the second example, however, beca use m 2 is actually less than In J , the factor f«(Jf) falls off quite sharply over the r?nge in which p*(CJ i I y) i, appreciable and multiplication produces a p(aT i y) which is considerably modified.
p(a ; l y) x IO
nJ,
= 14. 9 5
1n 2
= 8.3 4
40 .
1.0
-, , \
\ \
\ 0. 5
\
2.0
1.0
2.0
3.0
Fig. 5.2.4 Decomposition of the posterior distribution of IJ'f: the generated example.
The decomposition illustrated in Figs. 5.2.3 and 5.2.4 of the posterior distribution of CJi into its two basic components is of interest for another reason. Examples are occasiona Ily met where m2 is significantly smaller than mI ' rn such cases it has been suggested , for example by Anscombe (l948b) and NeIder (J 954), in the sampling theory framework, that the model itself should be regarded as suspect. From the Bayesian point of view we are led to the same conclusion. For, when 1n 2 is very small compared with 111 1 , the factor f(CJf) will fall sharply and the information about CJT coming from the two components will be contradictory.
5.2
Bayesian Analysis of Hierarchical Classifications nith Two Variance Components
261
Effects of this kind can in particular be produced by serial correlation between the errors and, as has been pointed ou t earl ier, Bayesian analysis which takes account of this correlation h as been carried out.
Sampling Theory interpretation of the Variolls Factors Once more, while it does not seem possible to justify a similar analysis on sampling theory, the meaning which can be attached to the various factors in (5.2.26), from the sampling theory point of view, is interest ing. From (5.2.26) we have (5.2.28) Sampling theory inferences about a~ arc usually made using the fact that (5.2.29) where vlm l is the random variable and second factor of (5.2 .28) may be written Pr {X~,
ar
is the fixed but unknown parameter. The
< v2 m2 /af}
Pr {Fv2 • vl < m2 !md
1 - a(af)
(5.2.30)
I-a
In this expression, the quantity a(a7), which is a jimetion of af, is the significance level associated with the test in which the mean sq uare m2 is employed to test the hypothesis that a~ 2 = aT against the alternative a~ 2 > a~ for each specified value of a~. As mentioned earlier in Section 5.2.3 , the quantity a is the significance level associated with the over-all test, based on the mean square ratio m2 / ml' of the hypothesis that a72 = a~ against the alternative th a t aT 2 > where is not specified.
aT
a7
Direct Calculation of p(a7 I y)
ai
The distribution of in (5.2.26) is equal to the product of two factors , the first being the density of an X- 2 variable with VI degrees of freedom and the second, the ratio of a / probability integral and a F integral. The density function is readily calculated therefore using tables or charts of the X2 density and integral , and of the F integral. The distribution can also be expressed as a weighted series of X2 densities. Details are given in Appendix (AS.I).
5.2.6 A Scaled X- 2 Approxim.ation to the Distribution of
aT
a7
Although the density function of can be calculated directly from (5.2.26) , for many practical purposes and in particular to study the problem of "pooling of variance estimates," it is useful to use an approximation involving only a single x2 variable.
262
Random Effect Models
5.2
Writing z
=
-
vlm l 2-
(5.2.31)
0")
and applying the identity (AS.2.1) in Appendix AS.2, we nnd directly from (5.2.12) that the rth moment of z is
(5.2.32) Further, the moment generating function is
m2!ml }' Pr { Fv"v, < ~ Mz(/)=(1-2t) - ~ V '
,
If I < l
(5.2.33)
in2 \
-1
Pr { F V2 ,v, <
/11 1
Consider the two extremes case when nil / 111 I is very large and when it is close to zero. In the first case, .'Vfz(t) tends in the limit to (I - 2ff i'" so that the distri2 bution of z tends to the X distribution with VI degrees of freedom. When /112/ml tends to zero, it is easy to verify by applying L'Hospital's rule that
(5.2.34) so that the distribution of z again tends to the X2 distribution but with V 2 additional degrees of freedom. The preceding discussion suggests that the distribution of z could be well approximated by a scaled X2 variable. By equating the first two moments of z to those ofax~ , we find VI
i x Ctv 2 ,i;v 1 + I)
2
ixCtv2, 'h) (5.2.35)
where
v2 m 2
x=-----
vlm l
+ V 2 111 2
and ix(p,q) is the incomplete beta integral. then
To this degree of approximation,
5.2
Bayesian Analysis of Hierarchical Classifications with Two Variance Components
where, as before, the symbol In terms of O'~ we thus ha ve
" ,..c,"
263
means "approximately distributed as".
(5.2 .37) It can be verified that 0 < a < I and b > Vj' To illustrate thi s approximation, we show below in Table 5.2.2 the exact and approximate densities for the two sets
of da ta introduced earlier. Using Pearson 's Tables of the Incomplete BefaFunction (1934) we find that for the dyestuff example that a = 0.9912 and b = 24.26 and for the generated example a = 0.9426 and b = 28 .67 . The agreement between the exact and the approximate values is exceedingly close for both examples. Table 5.2.2
Comparison of the exact density
p(O'i I y) with approximate density obtained from v 1 m 1 /O'i ,..c, ax; Generated example
Dyestuff example
O'f x 10- 3
Exact density x 10 3
Approximate density x 10 3
1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.00 4.25 4.50 4.75 5.00 5.25 5.50
0.0018 0.0350 0.1652 0.3666 0.5282 0.5854 0.5499 0.4639 0.3646 0.2733 0.1986 0.1413 0.0992 0.0691 0.0479 0.0332 0.0230 0.0160 0.0111
0.0017 0.0345 0.1647 0.3674 0.5300 0.5870 0.5504 0.4633 0.3633 0.2718 0.1973 0.1403 0.0986 0.0688 0.0479 0.0333 0.0232 0.0162 0.0114
O'i x 10 - 1 0.60 0.80 1.00 1.20 1.40 1.60 1.80 . 2.00 2.20 2.40 2.60 2.80 3.00 3.20 3.40 3.60 3.80 4.00
Exact density xl0 0.0064 0.2099 0.7959 1.1605 1.0530 0.7432 0.4573 0.2611 0.1435 0.0775 0.0416 0.0225 0.0122 0.0067 0.0038 0.0021 0.0012
0.0007
Approximate density xlO 0.0062 0.2097 0.7975 1.1610 1.0518 0.7418 0.4566 0.2611 0.1437 0.0778 0.0419 0.0227 0.0124 0.0068 0.0038 0.0022 0.0013 0.0007
Random Effect Models
264
5.2
It will be recalled from (5.2.26) that if the additional information coming from m2 were ignored , then for each set of data we would have
In fact we find
Dyestuff example
Gen erated example
As expected, the modification which occurs in the dyestuff exampJe is very slight, whereas that for the generated example is considerably greater.
5.2.7 A Bayesian Solution to the Pooling Dilemma On sampling theory, when m l and in2 are of about the same size they are often pooled. The idea is that if it can be assumed thaI a~ is zero, the appropriate estimate of ai is (vlm l + V2m2)'(VJ + v 2 ) :lnd not 111 1 , Thus: a) if /112/ml is large, the estimate of O"~ is taken to be
/11 1 ,
distributed as
O"fXv:/v J, (VI
b) if mz/ml is close to one, the estimate of vz), distributed as (JTX;, +v,!(v J + V2)'
(Ji
is taken to be (vlm l
+ v2 m 2 )!
+
Another way of saying this would be, th~t the estimate is taken to be vJm J VI
+ },V2 m 2
+ WV2
with the weights) and
W
such that
a) )
=W=
0
if mztmJ is large
b) )
=W=
I
if m2 f ml is close to one
Now , from the Bayesian viewpoint the distribution of (Ji may be approximated , as in (5.2.36), by (vlm1)·'(Ji '" ax;. This approximation may equally weJl be written in terms of the pooled estimate as 2
X
", VI
VI
+ WV 2
+WV l
(5.2.38a)
or in terms of sums of squares as (5.2 .38b)
5.2
Bayesian Analysis of Hierarchical Classifications with Two Variance Components
265
where (5.2.38c) W=
Since 0 < a < J an d b > VI' the weights A and ware both positive. For illustration, consider first the dyestuff example. Here the mean square ratio (m2/ml = 4.6) is quite large and the corresponding weights (), = 0.009, w = 0.052) are therefore small. The contribution of m 2 for this example may thus be ignored for practical purposes , and the posterior distribution and the sampling theory confidence distribution essentially coincide. For the generated example, however, the mean square ratio (m2/m I = 0.56) is small and the corresponding weights (A = 0.524 , w = 0 .934) are no longer negligible . In th is case then, the contribution of m2 to the posterior distribution of O'i is considerable. These two posterior distributions have of course already been shown in Figs. 5.2.3 and 5.2.4 and their derivation discussed . The present excursion merely provides an alternative but interesting means of viewing them. Usi ng (5.2.35) with (5.2.38c), A and w may be plotted as functions of m 2 !m J for any J and K . Figure 5.2.5 shows such graphs for the case J = 6 and K = 5 corresponding to the examples we have considered. As might be expected, for large values of m 2.'m l , }. and w approach zero but, as m 2/m l becomes smaller, larger values of ), and w ca lling for increasing degrees of pooling occur.
1.0
0.8
0.6
0.4
0.2
L __L __--L_ _--1._==:::::r=""""-.:C=====_ 3.0
1.0
Fig. 5.2.5 Values of (A.,
w)
5.0
~} ~
tn ,
as functions of m2/mj (J = 6, K = 5).
266
5.2
Random Effect Models
It will be noticed that w is larger than ), for all values of m2/mj . To see the reason for this, consider the case where m l and m2 are equal so that m2/mj = 1. While m l provides an estimate of a~ alone, m 2 is an estimate of aL = (f~ + KO"i and d must be positive or zero. Thus with m2!m j = I the evidence coming from 1112 and 1/1 1 together implies a smaller value for O"~ than would have been suggested by m) alone. This is consonant with the fact that, for m2/mj = I, OJ is larger than }" and corresponding compensation would be expected, and is found , for other values of m2 /m j •
Th us, in the Bayesian context there is no dilemma "to pool or not to pool." The distribution of O"i ahrays depends on n72 as well as on m), as obviously it should. No "decision" needs to be made, instead there is a steady pooling transition which depends on the evidence suppJied by the sample.
5.2.8 Posterior Distribution of a~ We now turn to the problem of making inferences about a~ which in the context of sampling theory is accompanied by many difficulties. The posterior distribution of a~ is obtained by integrating (5.2.14) over to yield
ar
2
) _
p ( 0"21 y -
)-)IW (, } P Xv,
K(vlm))-I (v 1m 2
P {F r '" 2. \", <
m2/ m l
-2 _
-
0
of + Ka~) V 2 111 2
(.
-2 _
O"f )
P Xv, - - vlm l . O"~ > O.
d
2
ai '
(5.2 .39)
For the noninformative reference prior distribution (5.2.9), the distribution p(a~ 1 y) summarizes all our a posteriori knowledge about a~ . It is defined over the range (0, co) and thus no problem of "negative variance estimate" will ever arise. The exact posterior distributions of O"~ for the dyestuff example and for the generated example are given by the solid curves in Figs. 5.2.6 and 5.2.7 respec.~ively.
Some properties of the distribution of O"~ are discussed in Appendix AS.3 ., , In particular, it is shown that a) if
m2 v 2 +2 VI ->---m)
V2
VI
+ 2'
then the distribution is monotonically increasing in the interval 0 < O"~ < monotonically decreasing in the interval O"~ > c2' where
CI'
and
(5.2.40)
--
-~
5.2
-
Bayesian Analysis of Hierarchical Classifications with Two Variance Components
I.~
0 .3
I
Exact
~ ~
- - - - - Approximate
I
'0
>.
0 .2
NN
-3
"-
0.1
Fig. 5.2.6 Posterior distribution of (Ji: the dyestuff example.
- - - Exact --- -
Approxil11Jte
4.0 0
>~I
"2.0
1.0
ol x)0-
2.0 1
Fig. 5.2.7 Posterior distribution of (Ji: the generated example.
-~~-
267
_.-
268
Random Effect Models
5.2
so that the mode must lie in the interval c l < b) if
nl2 (V2 + m l < -v-2 -
(J2
2) (
<
C2,
and
Vj
VI
+ v2 + 4
)
'
then the distribution is monotonically decreasing in the entire range mode is at the origin . For the dyestuff example, m2/mJ == 4.60, which is greater than
(
_VJ )
vJ
+2
(~) == v2
(Ji
> 0 and the
1.29
so that the mode must lie in the interval C l < (J2 < C2 where c l = 684.08 and C2 = 1,253.67. Inspection of Fig. 5.2.6 shows that the mode is approxima tely at (Ji = I , I 00. For the generat.:d example, 1112/111 I = 0.558 which is less than
( ~) v 2
(
VI
VI
+ v2 + 4
)
= 1.02.
Thus, as shown in Fig. 5.2.7, the distribution is J shaped, having its mode at the origin and monotonically decreasing in the entire range (Ji > o.
5.2.9 Computation of p((J~ I y) It does not appear possible to express the distribution of (J~ in (5.2.39) in terms of simple functions. The density for each value of (J~ can, however, be obtained to any desired accuracy by numerical calculation of the one-dimensional integral. This was the method used to obtain the exact distributions of Figs. 5.2.6 and 5.2.7. Approximations which require less labour are now described.
A Simple X2 Approximation It is seen from the posterior distribution of (J~ and (J~ in (5.2.14) that if (J~ were known, say (JT = (JT o, then the conditional distribution of CT~ would be in the form of a truncated "inverted" X2 distribution. That is,
(J~
> 0,
(5.2.41)
so that the random variable
(5.2.42) is distributed as X- 2 with V 2 degrees of freedom truncated from (JiO /(V2 m2). Although the marginal posterior distribution p«(J~ I y) cannot be expressed exactly
~-
-
~
~-
-
--
-
5.2
Bayesian Analysis of Hierarchical Classifications with Two Variance Components
269
in terms of such a simple form, nevertheless, the expression of the integral in (5.2.39) does suggest that, for large v u 2, we can write
fo p(X:,2 =(Ji +mK(J~)p (x~/ =~) d(~) a
Y2
vim i
l
ylm l
(5.2.43) where the substitution z = Yl m 1: (2CJi) is made and t = Yr/2 is the value of z which maximizes the factor zi V1 ,2) e- c appearing in the integrand. Making use of (5.2.43) and upon renormalizing, we thus arrive at a simple truncated inverted / approximating form for the posterior distribution of
(JL
u~ > O.
(5.2.44)
or equivalently, -2
+ Kui)
In!
K 21) P( X2 = P Inl ~ Y == ( 2m V 2 Pr f X;,2 > V
l
Y2
ln
m[
+
K(J~
1111
- --->--.
2
~}
V2
m2
V2
m2
(5.2.45)
V 2 ln 2
In words, we have that (mr -+- KuD /(v 2 m 2 ) is approximately distributed as X- 2 with V 2 degrees of freedom truncated from below at i11! / (V 2 lnl)' Or equivalently, v1 m 2/ (m r + Ku~) is distributed approximately as X2 with V2 degrees of freedom truncated from above at V 2 ln 2 :m!. Posterior probabilities for u~ can thus be approximately determined from an ordinary X2 table. In particular, for I'} > 0
Pr {O <
u~ <
Pr I'}
I y} ==
v2m2 {
m!
+ KI'}
Pr
V2 m 2 }
2
< Xv , < - -
{X;'2 <
m]
-
V~~2
(5 .2.46)
}
The posterior densities for d for each of the two examples obtained from this approximation are shown by the dotted curves in Figs. 5.2,6 and 5.2.7. It is seen that this simple approximation gives satisfactory results even for the rather small sample sizes considered.
270
Random Effect Models
5.2
Table 5.2.3 Comparison of exact and approximate posterior density of O"~t a) Dyesluff Example p(d I y) O"~
x 10 3
(1)
(2)
(3)
1°
1-1
{- 2
0.0 0.2 0.4 0.6 0.8
0.0070 0.0591 0.1518 0.2386 0.2948
0 .0246 0.0865 0.1694 0.2455 0.2952
0.0332 0.085J 0.1665 0.2433 0.2938
0.0290 0.0841 0.1667 0.2440 0.2946
1.0 1.1 1.3 1.4 1.5 1.6 1.7 1.8 1.9
0.3200 0.3233 0.3221 0.3173 0.3099 0.3006 0.2899 0.2784 0.2664 0.2543
0.3169 0 .3193 0.3175 0.3123 0.3047 0.2953 0.2847 0.2733 0.26J5 0.2495
0.3161 0.3187 0.3J 70 0.3120 0.3045 0.2952 0.2846 0.2732 0.2615 0.2495
0.3170 0.3196 0.3179 0.3128 0.3053 0.2959 0.2853 0.2740 0.2621 0.2502
2.0 2.2 2.4 2.6 2.8
0.2422 0.2187 0.1967 0.1766 0.1584
0.2376 0.2145 0.1930 0.1732 0.1554
0.2376 0.2146 0.1930 0.1733 0.1555
0.2383 0.2151 0.1935 0.1737 0.1559
3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
0.142J 0.1089 0.0843 0.0661 0.0524 0.0421 0.0342 0.0281 0.0232 0.0194 0.0164
0.1395 0.1070 0.0829 0.0650 0.0516 0.0415 0.0337 0.0277 0.0230 0.0192 0.0162
0.1395 0.1070 0.0829 0.0650 0.0516 0.0415 0.0337 0.0277 0.0230 0 .0192 0.0162
0.1399 0.1073 0.0831 0.0652 0.0518 0.0416 0.0338 0.0278 0.0230 0.0193 0.0162
x 10- 3
1.2
(4) Exact
S.2
Bayesian Analysis of Hierarchical Classifications with T wo Variance Components
271
Table 5.2.3 (continued)
b) Generated Example p((J~
I y) x
10
(J~ X lO-1
(1) to
(2) t -I
(3) t- 2
(4) Exact
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45
5.3566 3.8069 2.7671 2.0563 j .5595 1.2046 0.9460 0.7538 0.6087 0.4974
5.5170 3.8943 2.7997 2.058 1 1.5463 1.1853 0.9251 0 .7336 0.590J 0.4807
5.4635 3.8829 2.8003 2.0623 1.5514 1.1902 0.9294 0.7373 0 .5932 0.4833
5.4795 3.8897 2.8039 2.0644 1.5527 1.1911 0.9301 0.7379 0.5937 0.4837
0.50 0.60 0 .70 0.80 0.90
0.4107 0.2881 0.2086 0.1552 0.1181
0.3961 0.2769 0.2002 0.1488 0 .1132
0.3982 0.2784 0.2012 0.1495 0.1137
0.3986 0.2787 0.2014 0.1497 0.1138
1.00 1.20 1.40 1.60 l.80 2.00
0.0916 0.0579 0.0386 0 .0268 0.0193 0.0143
0.0878 0.0556 0.037J 0.0258 0.0186 0.0138
0.0882 0.0558 0.0372 0 .0259 0.0187 O.OJ 38
0.0883 0 .0559 0.0373 0.0259 0.0187 0.0138
t Accuracy of vari ous approximations [see expression (/\5.4.7) in Appendix A5.4 for the formulas). (I) Asymptotic expansion with the leading term.
(2) Asymptotic expansion with terms to order (3) AsymptOtic expansion with terms to order (4) Exact evaluation by numerical integration.
I" I
l 2
An Asymptotic Expansion T he argument which led to (5.2.44) can be further exploited to yie ld an asymptotic expansion of p«(J~ I y) in powers of I - I = (vl / 2) - I following the method discussed for example in Jeffreys and Swirlee (1956) and De Bruijn ( 1961). Detail s of the derivation of the asymptotic series are given in Appendix A5.4. The series is such that the simple approximation in (5.2.44) is the leading term. Table 5.2.3 shows, for the two examples, the degree of improvement that can be obtained by including terms up to order , - I and up to order ,-2, respectively.
272
Random Effect Models
5.2
T he Distribution of u~ When the .\1 ean Square Ratio is Close to Zero Finally, we remark here that in the extreme situation when /1121111 approaches zero the posterior distribution of Kui l111 2 becomes diffuse, a result noted by Hill (1965). This behavior i ~ reflected by the approximating form (5.2.44) from which it will be seen that as 1112/ml tends to zero, the distribution of Ku~ /m2 approaches a horizontal line. In the limit, then , all values of Ku~ ,'m2 become equally probable and no information about u~ is gained . It has been noted in Section 5.2.5 that if m 2/m is, close to ze ro , this will throw doubt on the adequacy of the model. In a Bayesian analysis the contradiction between model and data is evidenced by the conflict between p*(u~ I y) and fCui). The distribution of K(J~ .t111 2 thus becomes diffuse precisely as overwhelming evidence becomes available that the basis on which this distribution is computed is unsound. If assumptions are persisted in, in spi te of evidence that they are false , bizarre conclusions can be expected. Table 5.2.4
Summarized calculations for approximating the posterior distributions of (uT, (illustrated using dyestuff data) 1.
Use standard analysis of variance (Table 5.2.1) to compute 1111
2.
=
2,451.25,
111 2
K
= 11 ,271.50 ,
=
5.
Use (5.2.35) to compute a = 0.9912,
x = 0.4893, 3.
u~ )
b = 24.26.
Then, using (5.2 .37) the posterior distribution of the "within " component
ai is given
by and for this example ()'~
4.
rV
59 ,352.30X2L 6'
Also from (5.2.45) posterior distribution of the "between" component (J~ is given by Inl
+ Kai
- - - - rV
- 2
XV2
V 2 1112
truncated from below at
For this example, 0.043
truncated from below at 0.043.
+ (0 .89 x
1O-4a~)
rV
XS2
--
5.2
~---
-
Bayesian Analysis of Hierarchical Classifications with Two Variance Components
-
273
5.2.10 A Summary of Approximations to the Posterior Distribution of (aT ,
aD
Using the dyestuff data for illustration Table 5.2.4. above provides a summary of the calculations needed to approximate the posterior distributions of aT and a~ . 5.2.11 Derivation of the Posterior Distribution of Variance Components Using Constraints Directly It is of some interest to deduce directly tbe distributions of the variance components by employing the result (1.5 .3) on constrained distributions. We bave seen tbat the data can be thought of as consi sting of J independent observations from a population N(e, aTz/K) and J(K - 1) further independent observation s from a population N(e, aT). Suppose that , as before, we assume that the joint prior distribution log log is locally uniform but now we temporarily omit the constrai nt C: 2 > a~ . Then (TT and aT z would be independent a posteriori. Also, if, as before , p*(x) denotes the unconstrained density of x , then
pee,
ai
ai,
ai 2)
ai>
(5.2.47)
0,
(5 .2.48)
W > 0,
where W
= ai2/ai.
(5.2.49)
It follows that (5.2 .50)
(5.2.5\) and (5 .2.52) Using (1.5.3) the joint posterior dist ribution of (aT, aiz) given the constraint Cis
aiz >
at > o. (5.2.53)
-
--
---~~
Random Effect Models
274
5.2
Noting that given (ai, aiz) such that ai2 > a7, Pr"'{C I a~, aiz, y} = 1, we have
2
p (aj,
= ai / (vjm j)] (v 2m 2)-1 p [X:/
2 I ) _ (vjml)-I p [X;,2 al2 Y -
=
aT2 t (v2 m2)]
Pr {Fy"v, < m2/ml}
ai2 > a~ > 0,
,
(5.2.54)
which is, of course, equivalent to the posterior distribution of (ai, aD in (5.2.14) if we remember that ai2 = d + Ka~. The distribution of the ratio W given C is, by a similar argument,
I y) Pr'" {C I W, y} Pr"'{Cly}
p"'(W p(Wly)=
(m l !m 2 )p(F Y"
Y2
=
(mdm2) W)
Pr {F."y, < m2/ml}
W> I,
(5.2.55)
where Pi'" (C I W, y) = !. This is the same distribution obtained earlier in (5.2.16). Further, the distribution of ai subject to the constraint C is 2
) _
p(ally -
p"'(ai I y) Pr'" {C I ai, y} Pr*{Cly}
(vjm l )
-1 P (XVI 2=
af) Pr {2Xv, < v2 m
2 } -2-
--
Vjmj
0'1
= ------------------------ Pr
{F.
2,vl
<
:~}
which is identical to expression (5.2.26) . distribution of ai 2 given C is p
(
aT
> 0,
(5.2.56)
Finally, by the same argument , the
a2 I ) _ p*(ai21 y) Pr* {C I ai2' y} 12
Y -
Pr* {C I y}
aT
2
> 0.
(5.2.57)
Certain properties of this distribution are discussed in the next section.
5.2.12 Posterior Distribution of ai2 and a Scaled X- 2 Approximation Earlier we saw that not only mj but also m 2 contained information about the component ai. In a similar way we find that both mj and m 2 contain information about ai 2' While this parameter is not usually of direct interest for the twocomponent model, we need to study its distribution and methods of approximating
5.2
Bayesian Analysis of Hierarchical Classifications with Two Variance Components
275
it for more complicated random effect models. From (5.2.57) it is seen that the posterior distribution of a~2 is proportional to the product of two factors. The first factor is a X- l density function while the second factor is the right tail area ofaXl distribution . This is in contrast to the posterior distribution of aT in (5 .2.56), where the probability integral is a left tail probability of a / variable. Using the identity (A5.2.1) in Appendix A5.2, the rth moment of z 1 = v2 m z/uT 2 is
r(V2+r)pr{Fvl+zrV' < E(z~)
=
2r
2
'v
. .
r
(;1)
V
2
z
+ 2r
ml} m 1
(5.2.58)
Pr {Fv"v, < ::}
and the moment generating function is
Mzj(t)
=
(I - 2/)- }v, Pr {F ,v, < (ml/m 1 )(1 - 2t)}, Pr {FV1,v, < m 2/m 1 } V2
It I < -}.
(5.2.59)
When 111 2 /m 1 approaches infinity, Mz,(t) tends to (I - 2t)-),'2 so that the distribution of Z 1 tends to the X2 distribution with V2 degrees of freedom. This suggests that, for moderately large values of m2/ml, the distribution of Zl = V2ml/d1 might well be approximated by that of a scaled / variable, say ex;. By equating the first two moments, we find
Vz e = (2
+
) IxCtVl 1
+ 2, -tv l ) -
IxCi-v2 + l,-!Vl)
V2 IxC-}V1 + 1, i-vI) - - - ----,-2 IxC-tv1, ! v 1 )
(5.2.60)
d = ~ IxC1- vz + 1, -tv 1 ) c IxC fv2,1Vl) , where we recall from (5.2.35) that x of approximation,
=
V2ml'(v1ml
or
+
vlm l ). Thus, to this degree
(5.2.61)
and values of c and d can be readily determined from tables of the incomplete beta function. Jn practice, agreement between the exact distri bution and the approximation is found to be close provided m 2 / 111 1 is not too small. Ta}:>le 5.2.5 gives specimens of the exact and the approximate densities for the dyestuff and the generated examples. The agreement for the dyestuff example is excellent. For the generated example, for which m2/ml = 0.56, the approximation is rather poor. The reason for this can be seen by studying the limiting behavior of the moment generating function of Z 1 in (5.2.59) when m1/m 1 -> O. Applying L ' Hospital's rule, we find lim .'v1 z ,(t) = I, "l2/»IL-0
276
Random Effect Models
5.3
which is far from the moment generating function ofaX2 distribution. Thus , we would not expect the approximation (5 .2.61) to work for small values of m2 I m! .
Table 5.2.5 Comparison of the exact density p(aTzl Y) with approximate density obtained from
v2 m2 /(JTz
rV
ai 2/1000 3.0 5.0 7.0 9.0 11.0 13.0 15.0 17.0 \9.0 21.0 25.0 30.0 35.0
cXJ Dyestuff example Approximate Exact 0.0041 0 .0399 0.0626 0.0640 0.0557 0.0461 0.0372 0.0300 0.0242 0.0196 0.0132 0.0084 0 .0054
0.0047 0.0385 0.0619 0.0639 0.0564 0.0468 0.0379 0.0304 0.0245 0.0198 0.0133 0.0084 0.0053
(Ji2 / 1O 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 3.0 3.5 4.0 4.5
Generated example Approximate Exact 0.0172 0.1233 0.3074 0.4543 0.5101 0.4936 0.4409 0.3776 0.3170 0.1836 0.1198 0.0812 0.0571
0.0396 0.1435 0.2774 0.3852 0.4427 0.4541 0.4336 0.3955 0.3501 0.2217 0.1456 0.0955 0.0633
5.2.13 Use of Features of Posterior Distribution to Supply Sampling Theory Estimates
In recent years sampling theorists have used certain features of Bayesian posterior distribution to produce "point estimators" which are then judged on their sam pIing properties. We briefly discuss this approach in Appendix A5.6 .
5.3 A THREE-COMPONENT HIERARCHICAL DESIGN MODEL The results obtained in the preceding sections may be generalized in various ways. In the remainder of this chapter an extension from the two-component to the three-component hierarchical design model is studied. In Chapters 6 and 7, certain two-way cross classification random effect models and mixed models are considered. The following development of the three-component model is broadly similar to that for two components. However, the reader's attention is particularly directed to the study in Section 5.3.4 of the relative contribution of variance components. Here new features arise and study of an example leads to somewhat surprising and disturbing conclusions which underline the danger of using point estimates.
5.3
A Three-component Hierarchical Design Model
277
For convenient reference we repeat here the specification of the three-component model already given at the beginning of this chapter. It is supposed that
1=1, ... , 1; )=1, ... , 1;
k=l, .. . ,K,
(5.3.1)
e
where Yijk are the observations, is a common location parameter, e i, e ij and ei jk are three different kinds of random effects. We further assume that the random effects (e i , eij, e ijk) are all independent and that e i ~ N(O,
O"D,
(5.3.2)
and
It follows in particul a r that Var (Yijk) = O"f + O"i + O"~ so that the parameters (ai, O"i, O"~) are the variance components. As previously mentioned, a hierarchical design of this type might be used in an industrial experiment to investigate the variation in the quality of a product from a batch chemical process. One might randomly select 1 batches of material, take 1 samples from each batch, and perform K repeated analyses on each sample. A particular result Yijk would then be su bject to three sources of variation, that due to batches e i, that due to samples eij and that due to analyses e ijk . The purpose of an investigation of the kind could be (a) to make inferences about the relative contribution of each source of variation to the total variance or (b) to make inferences about the variance components individually. 5.3.1 The Likelihood Function To obtain the likelihood function, it is convenient to work with the transformed variables Yi .. , Yij. - Yi., and Yijk - Yij. rather than the observations Yijk themselves. By repeated application of Theorem 5.2.1 (on page 250) the quantities Yi. . =
e + e i + ei. + ei.. ,
Yij. - Yi. . = (eij - ei ·)
+ (eij . - eiJ,
(5.3.3)
are distributed independently and Yi.. ~ N(O,
KII(Yij . -
O"i23/1K),
hi ~ d2X;(J-l )'
I I I (Yijk - Yij)2 ~
(5.3.4)
O"i X ;J(K-l)'
where and as defined in Table 5.1.1.
(5.3 .5)
278
Random Effect Models
5.3
Further, insofar as Yijk - Yij. and Yij. - Yi.. are concerned, ~ ~ ~ (Yijk - YuY and K ~ ~ (Yij. - Yi. Yare sufficient for a~ and a~ 2, respectively . Thus, the likelihood function is
(5.3.6a) where (m l , n12, m3) and (VI' given in Table 5.1.1, that is, VI
=
V 2 , 1'3)
are the mean squares and degrees of freedom
JJ(K - I),
V2
=
J(J - I),
52 K L L (Yij. - YiJ m 2 =- = v2 V2
and
53
m3
= -
=
2
1'3
= (l - I) ,
(5.3.6b)
JKL(h.-YY v3
1')
The likelihood function in (5.3.6a) can be regarded as having arisen from J independent observations from N(O, ai23jJK) , together with J(] - I) independent observations from N(O , aT2/K) and a further JJ(K - 1) independent observations from N(O, ai). From (5 .3.5) the parameters (ai, a;2, ai2)) are subject to the constraint (5.3 .7) On standard sampling theory, the unbiased estimators for (a~ , ai, a~) are ~2
(5.3 .8)
a2 =
5.3.2 Joint Posterior Distribution of
(ai, aL aD
As before, the posterior distribution of the parameters is obtained relative to the noninformative prior distribution p(e, log a;, logai2 ' loga;23) rx constant. If there were no constraint, then, a posteriori (o-i, 0-12, o-f 23) would be independently distributed as (v l m j )x;,2, (V2m2)X:22, and (v3m3)x~2, respectively. Using (1.5.3),
279
A Three-component Hierarchical Design Model
5.3
with the constraint C: o-T < o-L < o-T 23, the joint posterior distribution is
p(o-f, o-i2, o-i23 I Y) = p*(o-i, o-T2, o-in l C, y) -p
*( 0- 2
_
*(
-
- p
0-
2 a2 2, J 23
1>]
21
)
aI Y P
*( 2 0- 12
1
v J
)Pr*{C 1 o-i,o-i2,o-T23'Y} Pr* {C 1 y }
1 ,) *( ) P
2
1 ,)
a 12 3 )
Pr* {C 1 o-i, o-i2' ai23' y} Pr* {C 1 y} ,
(5 .3.9) where, as before, an asterisk indicates an unconstrained distribution. Since the factor Pr* {Clai,ai2,o-i23,Y} takes the value unity when the parameters satisfy the con straint and zero otherwise, it serves to restrict the distribution to that region in the parameter space where the constraint is satisfied . Thus,
p*(ai 1 Y) P*«(JT 2 1 y) p*(ai23 1 Y) Pr* {C 1 y}
{ X~I
vim]
X~2
112 m 2 }
Pr - 2 - > - - ' - 2 > - Xl" V2 m 2 XV} V3 m 3
ai23 > ai2 > ai> 0,
(5.3.10)
2 where in the denominator (X~I' X~" X;') are independently distributed X variables with (VI' v2 , v3 ) degrees of freedom respectively. From (5.3.10), the joint posterior distribution of the variance components (ai, o-~, is
an
(5.3.11) where
280
Random Effect Models
5.3
It can be verified that
(5.3.!:?) where lAp , q) and B(p , q) are the usual incomplete and complete beta functions, respectively. We now discuss various aspects of the variance component distribution .
5.3.3 Posterior Distribution of the Ratio CTi 2/CTT In deriving the distributions in (5.3 . 10) and (5.3.11), use has been made of the fact that the variances (CTi, CT~2' CTi23) are subject to the constraint C: CTi23 > CTT2 > CTT . To illustrate the role played by this constraint , we now obtain the posterior distribution of the ratio CTi 2/CTi and show its connection with sampling theory results. For the two-component model the posterior distribution of the ratio CTi 2/CTT , which we there called W, was discussed in Section 5.2.3 . In the persent case the posterior distribution of CTT 2/()T can be obtained from (5.3.10) by making a transformation from (CTf, CTT 2' CTi23) to (CTi2ICTT, CTT , CJT23) and integrating out CTi and CJT23. Alternatively, since
Pr* {C I CJT 2/CT;, y} Pr* {C I y}
(5.3.13)
it follows that (5 .3.14) where
The first factor (m 1 m 2 )p[F v,.v, ' bution of CTT 2/CTT and the factor
--
= (CTT2 /CTT)(m dm2)] is the unconstrained distriH (CTT 2/CTT) represents the effect of the constraint
A Three-component Hierarchical Design Model
5.3
C. Now, we can regard the C as the intersection of the two constraints C 1: and C 2 : crf32 > af2' Thus, we can write
Pr~ {C I~, y} Pr*{Cly}
Pr*
{c I~ ,y} I
Pr* {C I I y}
pr*{c 2 j
281
crf 2 > cri
CI'~ ' Y} (5.3.15)
Pr* {C 2 1 C I , y}
from which it can be verified that
(5.116) with
and
{
X~' vlm l X~2 V2 m 2 } Pr 2>--'-2>-XV, V 2 n1 2 XV) V3 m 3 As in the two-component model, the effect of the constraint C I: crf < crf2 is to truncate the unconstrained distribution from below at the point a~ 2 /cr~ = I and H I «T~2 /crD is precisely the normalizing constant induced by the truncation . The distribution in (5.3.16), however, contains a further factor H 2'1 (crf 2/cri) which is a monotonic decreasing function of the ratio cri2/cri. Thus, the " additional " effect of the second constraint C 2: cr~ 2 < crT23 is to pull the posterior distribution of ai 2/crT towards the left of the distribution for which C 2 is ignored and to reduce the spread. In other words, for Yf > 1
Pr {I
< cr;; <
Yf
I Y}
~ Pr {:: < FV"v, <
Yf : : }
I
Pr
{F
V2 • V ,
< ::}.
(5.3.17)
The extent of the influence of HI and H 2 ' 1 depends of course upon the two mean squares ratios m 2/ml and n1 3/m 2' In general , when m2/ml is large, the effect of HI will be small and for given m 2/ml' a large n1) / n12 will produce a small effect from H 2V I ' On the other hand, when the values of m 2 /m l and n1) / n12 are moderate, the combined effect can be appreciable . For illustration, consider the set of data given in Table 5.3 .1. The observations were generated from a table of random Normal deviates using the model (5.3.1) with (8 = 0, crT = 1.0, a~ = 4.0, crj = 2.25, 1= 10, J = K = 2). The relevant sample quantities are summarized in Table 5.3.2.
282
Rando'll Effect Models
5.3
Table 5.3.1 Data generated from a table of random normal deviates for the three-component model
e = 0,
a~
a~
= 1.0,
= 4.0,
a~
= 2.25, I = 10,
J
= K= 2
so that a~23
a72 = 9.0, a~/(ai
= 18, ai/ (a f + ai + a~) = 13.8%,
+ a~ + a~) =
55.2%
a~/(a~
+ a~ + a~) = j=2
j=1 1 I
2 3
4 5 6 7
8 9 10
k
=1
k
= 2
2.713 4.229 -2.621 4. 185 4.271 -1.003 -0.208 2.426 3.527 -0.596
2.004 4.342 0.869 3.531 2.579 - 1.404 - 1.676 1.670 2.141 -1.229
31.0%.
k = 1
k=2
0.603 3.344 - 3.896 1.722 -2.101 -0.775 -9.139 1.834 0.462 4.471
0.252 3.057 -3 .696 0.380 0.651 - 2.202 - 8.653 1.200 0.665 1.606
Table 5.3.2 Analysis of variance for the generated data
s.s. S3 S2 SI
=
240.28
V3
=
9
= 122.43
v2 = 10
=
VI
B-i = B-~
d.f.
20.87
1.04
-- = 10.26
a~ = 5.62
1.04
+ B-~ +a~
= 20
M.S.
=
26.70 12.29 m l = 1.04 m3
m2 =
a~ = 3.60
= 10.26
10.1 percent
5.62
- - = 54.7 percent
10.26
3.60 -- =
10.26
35 . 1 percent
The solid curve in Fig. 5.3.1 shows for this example the posterior distribution of a~ 2/ai when C I and C 2 are both included. The broken curve in the same figure is the unconstrained posterior distribution . For this example, m2/ml is large and
5.3
A Three-component Hierarchical Design 'V1odel
283
so that the effect of C 1 is negligible and the broken curve is essentially also the posterior distribution of (J~ 2/(Ji with the constraint C I' The effect of C 2 is, however, more appreciable. The moderate size of in3lin2 implies that (Ji 2/(J~ may be slightly smaller than would otherwise be expected.
- - - Exact distribution - - - - - Unconstrained distribution
0.06
0.02
IS
Fig.s.3.1 Posterior distribution of (JT2/(Ji: the generated data.
We can now relate the above result to the sampling theory solution to this problem. It is well known that the sample quantity (m l /m 2 ) is distributed as «(Ji,(J~2)F"'.V2' Thus, the unconstrained posterior distribution
= «(Ji 2 /(J~)(m l / m 2)] is also the confidence distribution of (Ji2 (Ti. Inference procedures based upon (m t!m 2 ) p[F V[,"2
this confidence distribution are not satisfactory. In the first place the distribution extends from the origin to infinity so that the lower confidence limit for «(Ti 2/(Ji) = I + K«(J~ /(Ji) can be smaller than unity and the corresponding limit for (J~/(Ji less than zero . In addition, the sampling procedure fails to take into account the information coming from the distribution of the mean square 111 3 , which , in the Bayesian approach, is included through the constraint (Til3 > (Jil'
5.3.4 Relative Contribution of Variance Components One feature of the posterior distribution of «(Jf, (J~, (JD, which is often of interest to the investigator, is the relative contributions of the variance components to the
284
Random Effect Models
5.3
total variance of Yijk ' Inferences can be drawn by considering the joint distribution of any two of the three ratios
Since r) = I - r 2 - '3 ' we make the transformation in (5.3 . I I) from (u~ , u~, to (u~ , '2, (3) and integrate out to obtain
u1
p(rl , '2,
'31 y) =
N[X I X 2(J
x (I +
- 1) Xl
+ X 1¢2 +
X 2¢
+ X2)-(VI+V 1 +V
J
lJ(K -
I)J3
X\v'; 2 ) -2 Xi
V1
/
2
uD
)-2
)!2,
(5.3.19) where
This distribution is analytically complicated. However, for a given set of data , contours of the density function can be plotted. Using the example in Table 5.3.1, three contours of the posterior distribution of (,) , r2 ' (3) are shown in Fig. 5.3.2. To allow for the constraint '"I + (2 + r ] = 1, the contours were drawn as a tri-coordinate diagram. Tbe mode of the distribution is at approximately the point P = (rio = 0.075 , r 20 = 0.425 , r 3 0 = 0.500). The 50%, 70% and 90% in the figure were drawn such that 50%: IOgp('lO , '
2 0'
'301 y) 90% : logp(rlo , r 20 , r 30 1y) 70% : logp(rlo, r 2 0,
(31 y) == 0.69 = -h~(2, 0.5), logp(rl ' ' 2, r31 y) == 1.20 = 1 /(2 , 0.3), logp(r" r 2 , (31y) == 2.30 = ~/(2 , 0 . 1),
r 30 1y) - logp(rl' -
-
(2 '
and that the density of every point included exceeds that of every point excluded . That is, they are the boundaries of H.P. D. regions with a~pro ximate probability contents 50, 70, and 90 percent respectively. Figure 5.3.2 provides us with a very illuminating picture of the inferential situation for this example. It is seen that on the one hand rather precise inferences are possible about (the percentage contribution of uf). In particular, nearly 90% of the probability mass is contained in the interval 0 .05 < r, < 0 .25. On the other hand, however, the experiment has provided remarkably little information
'I
5.3
A Three-component Hierarchical Design Model
285
/
) 0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
-Batches
Fig. 5.3.2 Contours of the posterior distribution of (rl' r2, (3) : the generated data.
on the relative importance of 6~ and 6 ~ in accounting for the remaining variation Thus values of the ratio 1"2 /r 3 = 6~ / 6~ ranging from t to 9 are included in the 90 percent H.P. D. region. The diagram clearly shows that while we are on reasonably firm ground in making statements about the relative contributions of 6i on the one hand and (6~ + 6D on the other to the total variance, we can say very little about the relative contributions of (J~ and (Jj to their total contribution «J~ + (3). There would be need for additional data before any useful conclusions could be drawn on the relative contribution of these two components. The result is surprising and disturbing because the data comes from what we would have considered a reasonably well designed experiment. It underlines the danger we run into if we follow the common practice of considering only point
286
Random Effect Models
5.3
estimates of variance components. 'Within the sampling theory framework such estimates of the relative contributions of (O"~, O"~, 0"5) may be obtained by taking the ratios of the unbiased estimators (o-i, o-L 0-;),
The resulting ratios correspond to the point Pion the diagram. For large samples, the point PI would tend to the maximum likelihood estimate and the asymptotic confidence regions obtained from \iormal theory would tend to the H.P.D . regions in the Bayesian framework. The small sample properties of the estimates (f I, 1'2 , 1'3) are, however, far from clear.
5.3.5 Posterior Distribution of O"~ We now begin to study the marginal posterior distributions of the variance components (O"~, O"~ , O"~) from which inferences about the individual components can be drawn. We shall develop an approximation method by which all the distributions involved here can be reduced to the X- 2 forms discussed earlier for the two-component model. Th is method not only gives an approximati solution to the present three-component hierarchical model, but also can be applied,. to the general q-component hierarchical models where q is any positive integer. For the posterior distribution of O"f, we have from (5.3 .10) or (5 .3.11)
O"i
> 0,
(5.3.21)
where
The distribution of O"i is thus proportional to the product of two factors, the first being a x- 2 density with VI degrees of freedom and the other a double integral of'; variables. The first factor (vlml)-l p(X:, 2 = ai / vlm l ) is the unconstrained distribution of af and the integral on the right of (5.5.21) is induced by the constraint C. Because this integral is a monotonic decreasing function of crL the effect of the constraint is, as is to be expected, to pull the distribution towards the left of the unconstrained x- 2 distribution . Exact evaluation of the distribution is tedious even on an electronic computer. But to approximate the distribution, one may employ the scaled X2 approach by equating the first two moments of
5.3
A Three-component Hierarchical Design Model
287
Vjl11l,'O"~ to that of an ad variable. Using the identity (A5.2.2) in Appendix A5.2 we find that the rth moment of (vlml / O"~) is
vjm J X:', Pr {-X~ , > - - - >V21'll2} -X~2 v2m 2 ' X:3 v)m)
(5.3 .22)
where (X;', + 2n X~2' X;') in the numeralor are independent X2 variables with appropriate degrees of freedom. From (5.3.12), evaluation of (a, b) would thus involve calculating bivariate Dirichlet integrals. Although this approach simplifies the problem somewhat, existing methods for approximating such integrals, see for example, Tiao and Guttman (1965), still seem to be too complicated for routine practical use.
A Two-stage Scaled X- 2 Approximation We now describe a two-stage scaled x- 2 approximation method. First, consider the joint distribution of (O"~, 0"~2)' Integrating out 0";23 in (5.3.10), we get
p(O";,
O"f 2
I y)
(5.3.23) where
Pr
{F"J.Vl
<
:~}
and ~ is defined in (5.3.21). The quantity G(O"i2) is in precisely the same form as the posterior distribution of O"f in (5.2.26) for the two-component model. Regard ing the function G( O"f z) as if it were the distri bu tion of 0"; 2, we can then employ the scaled x- 2 method developed in Section 5.2.6 to approximate it by the distri2 bution of v~m;X;i where
, V2
v2 Ix,(-}V3, 1V2 + 1) a J Jx ,(-}V3, -}V 2)
=-
(5.3.24)
Random Effect Models
288-
5.3
and XI
V3 m 3
= - - - -V2 m 2
+ V3 m 3
To this degree of approximation, the distribution in (5.3.23) can be written
p(O'i,
O'i2i Y)
==
~'(Vlml)-l p(X:,2 = ~)(v;m;)-l p(X:12= v~i2,), Vjml m 2
O'i2 > O'i
2
> O.
(5.3.25)
The effect of integration is thus essentially to change the mean square m2 to m; and the degrees of freedom V 2 to v~ . Noting that 0'72 = ai + Ka~, it follows that
at +, K, a~) CK(v1mJ- p (X:/ = a2 )(v;m;)-l P (X;',2 = ----''-------' vim! V2m2 p(at,a~IY) == ------ --------{-- ----v;-m-;-}-------------1
_1-
Pr F y ',. VI < ---vlm l
af
> 0,
ai > 0,
,(5.3.26)
where
If we ignore tbe constant ~", the distribution in (5 .3.26) is of exactly the same form as the posterior distribution in (5.2. 14). We can therefore apply the results for the two-component model to deduce the marginal posterior distributions of af and d for our three-component model. In particular, for the error variance cd, we may employ a second scaled X- 2 approximation so that O'i is approximately distributed as V'lm'IX:;2 where VI
a2
Ix,(-t v;, -}VI + 1) Ix/iv;, tv l ) , (5.3.27)
and
Thus, making use of an incomplete beta function table, the quantities (ai' v;, m;; V'I, m;) can be conveniently calculated from which the posterior distribution of a~ is approximately determined. For the generated example, we find (a l = 0.865, v; = 12.35, m; = 1l.51; Q 2 = 1.0, V'I = 20.0, m'l = 1.04) so that a posteriori the variance at is approximately distributed as (20.87)X - 2 with 20 degrees of freedom. Since for this example (VI = 20, m l = 1.04), the effect of the constraint c: a~23 > 0'~2 > a7 is thus negligible and inference about a~ can be based upon
a2,
289
A Three-component Hierarchical Design Model
5.3
the unconstrained distribution (vlml)-l p(x:-/ = af!vlm l ). This is to be expected because for this example the mean square 1n2 is much larger than the mean square Inl'
Pooling of Variance Estimates lor the Three-component Model The two-stage scaled X- 2 approximation leading to (5.3.27) can alternatively be put in the form (5.3.28)
a 2I
where
2)
, .
V2
This expression can 'be thought of as the the three mean squares (ln l , m2 , m 3 ) in illustration, suppose we have an example three squares (m l , m 2 , m 3 ) are equal. We and W2 = 0.48. Thus, ml -(J2 I
Bayesian solution to the "pooling" of estimating the component (Ji. For for which I = 10, J = K = 2 and the find Al = 0.48, )'2 = 0. 157, WI = 0.754
VI VI
+ 0.48 V 2 + 0.157 V3 + 0 .754 v2 + 0 .4 8 v3
_
-
~
8 2 O. 2 (J I
so that ml (0.82) (JI
-2
2
N
'1.3206/32.06,
that is,
-
ai
-2
N
(0.82)(32.06)X32.06'
Inl
In particular, the mean and variance of (Ji /m l are and By contrast, the evidence from
In I
alone would have given that is,
so that and Thus, the additional evidence about (Ji coming from 1n2 and m3 indicates that ai is considerably smaller than would have been expected using In I alone. In addition, the variance of the distribution of ai/m l is seen to be only about half of what it would have been had only the evidence from Inl been used.
290
Random Effect Models
5.3
5.3.6 Posterior Distribution of
cd
Using the approximation leading to (S.2.44), it follows from the joint distribution of (O'L O'D in (S.3.26) that the posterior distribution of O'~ is approximately
21 ) . K(v~m~)-l P[X;,2 = (m! + KO'~) / (v~m~)] Pr {Xv, 2 < v 1 m 2i m 1
p (0'2 Y =
I
I
}
O'~ > O.
(S.3.29)
To this degree of approximation, then, the quantity (mJ + KO'~),I'~m~ has the X- 2 distribution with v~ degrees of freedom truncated from below at ml / (v~m~). For the example in Table S.3.1, the quantity 0.007 + 0.014 O'~ thus behaves like an X- 2 variable with 12.3S degrees of freedom truncated from below at 0.007. Posterior intervals about O'~ can then be calculated from a table of / ___ probabilities.
5.3.7 Posterior Distribution of O'~ From the joint posterior distribution of (O'L O'~, O'j) in (S.3.11), we may integrate out (0';, O'~) to get the posterior distribution of O'~,
O'~ > O.
(S .3.30)
The density function is a double integral which does not seem expressible in terms of simple forms. To obtain an approximation, we shall first consider the joint posterior distribution of (£Ti 2, O'i 23) and then deduce the result by making use of the fact that £Til3 = £Ti2 + JKO'~. From (S.3.10), we obtain
P(O'i2' O'i23 Y) 1
=
((v 3 m 3
r p(x:,z = J
0'1~3 )(V 2 m 1 }-1 p(x:/
V3 m 3
=
O'i2) Pr (Xv: > v 1 m21 ) V2 m 2 0' 12 (S.3.31)
where
(''' = (Pr {F
V2 • V1
<
m 2 /m 1 },
and (is given in (S.3.21). The function G 1 (O'i2) is in exactly the same form as the distribution of £Til in (S.2.S7) for the two-component model. Provided m2/ml is
A Three-component Hierarchical Design Model
5.3
291
not too small, we can make use of the approximation method in (S.2.60) to get
(S.3.32) where
i x ,(-1: V 2 + 1,11'1) a J i x ,(-}V2,1 v I) "2
(S.3.33)
and
V2m 2
XJ= - - - - vlm l V2 m 2
+
It follows that the posterior distribution of 2 ( paJly
).
a; is approximately
JK(vJmJ)-l (v~m~)-IJ'X) (,
= P r {F 'J,y; <
/ "} mJ m 2
0
-2 _
p Xv,
aiz
-
+ JK(5) vJmJ
(S.3.34)
The reader will note that the distribution in (S.3.34) is in precisely the same form as the posterior distribution of a~ in (S.2.39) for the two-component model. Provided v~' is moderately large, we can thus employ (S.2.44) to express the distribution of a~ approximately in the form of a truncated x- 2 distribution j (' 21 ) -"- JK(vJmJ)-l p[X;,2 = (m~ + JKa~)!(vJm3)J p a3 y P {2 "/()} , r X"J > m2, V3 m J
a5 > O.
(S.3.3S)
To this degree of approximation , the quantity (m~ + J K(5)/(v3m3) is distributed as an X- 2 variable with V J degrees of freedom truncated from below at m~/(v3m3)' from which Bayesian intervals for a~ can be determined. For the set of data in Table S.3. J, we find (a3 = J.O, v~ = 10.0, m~ = 12.29) so that the quantity 0.051 + 0.017 a~ is approximately distributed as X- 2 with 9 degrees of freedom truncated from below at the point O.OSI. 5.3.8 A Summary of the Approximations to the Posterior Distributions of (ai, ai, a~) For convenience in calculation, Table S,3.3 provides a short summary of the quantities needed for approximating the individual posterior distributions of the
292
Random Effecl Models
S.4
aL aD.
variance components CaL The numerical values shown are those for the set of data introduced in Table 5.3.1. Table 5.3.3 Summarized Calculations for Approximating the Posterior Distributions of (a~ , a ~ ,a~) (illustrated using the data in Table 5.3.1) 1. Use Table 5. I.l, or expression (5 .3.6b), to compute
=
m)
1.04,
K
V2
= =
JK
=
m 2
= 20, = 2,
VI
12.29, 10, 4,
m3
=
26.70,
2. Use (5 .3.24) to determine XI
=
v;
=
0.662,
GI
=
G2
= 1.0,
m; =
12. 35 ,
0.865, 11.51.
3. Use (5 .3.27) to determine x 2
= 0.872,
V'1
=
20.0
m')
v~
= =
0.854, 10.0,
m~
5. Then, for inference about
af
=
1.04.
4. Use (5.3.33) to determine X3
2
G3=
I
I
-
1.0,
= 12. 29. 2
a 1 N v j m1 x v ', af N (20.8)X202 • 6. For inference about a ~
111)+Ka ~. v2 m 2 I
I
- 2
1"'0..1
X \l~
truncated from below at m! /(v ; m; ),
N
X ~22 35
truncated from below at 0,007.
N
XV)
trunca te d from below at m;'/(v3n73 ),
N
X9 2
truncated from below at 0.05!.
0.007 +O.014a~ 7. For inference about a~
m;'+JKa ~
- 2
V3 m 3
0.051 +O . 017a~
5.4 GENERALIZATION TO q-COMPONENT HIERARCHICAL DESIGN MODEL
The preceding analysis of the three-component m odel and in particular the approximation methods for the posterior distributions of the individual variance
Generalization to q-Component Hierarchical Design Model
5.4
293
components can be readily extended to the general q-component hierarchical model yi •.. . i, = e
+
ei•
+ e iqiq _, + ... + e i q .. . it + ... + iq = I, ... , I q;
where
i1 = 1, ... , I I,
(5.4.1)
e ;'s a common
location parameter, and ei q' ... , Assuming that these effects are Normally and independently distributed with zero means and variances 2 (0';, 1 , ... , t , ... , ui) and fol/owing the argument in Sections 5.2 and 5.3, it is readiJy shown that the joint posterior distribution of the variance components is Yiq ... i,
are the observations ,
... ,
e iq .. . i "
eiq ... i , are q different kinds of random effects.
u;_
u
2 I ) - (Iq - 1 jq - 2 ( 2 ' ... , 0' q PO'I Y ) 2
...
1q - I )P(0'21> U 212, ... , 0' 2J ...1> .. . , 0' 2I ... q I) y, O'T > 0, ... , > 0, (5.4.2)
u;
where
+ 110'~)' ... ,uL.t = (O'L.(t-l) + II .. ·1'_IO'r2 ), O'i..,q = (O'L.(q-l) + II .. ·1q_) 0';), O'T2 = (ui
p(O'L ... ,
x
... ,
uLI> ... , O'i..,q I y)
[p { X~'Xv, > - -, ... , X;t-I XVt r 2'
vIm)
V2 m 2
- - 2-
>
vr-Im r - 1 vrm,
vq-1mq - 1 , ... , X2v;-, > -'---"--'--'-
Xv.
}J-
1
Vqmq
0< O'T < O'i2 < ... < uL.r < ... < O'T... q ,
(5.4 .3)
and the m/s and the v/s are the corresponding mean squares and the degrees of freedom . The distributions in (5.4.3) and (5.4.2) parallel exactly those in (5.3.10) _ . ~nd (5 .3.11). , In principle, the marginal posterior distribution of a particular variance com2 ponent, say, O'r2, can be obtained from (5.4.2) simply by integrating out O'i, ... , O'r _1' 0'/+ 1, ... ,0';. In practice, however, this involves calculation of a (q - I) dimensional integral for each value of u,2 and would be difficult even on a fast computer. A simple approximation to the distribution of O'r2 can be obtained by first considering the joint distribution p(UL.(/-l)' O'L., I y). The latter distribution is obtained by integrating (5.4.3) over (O'7...(r+ 1)' ... , O'L.q) and (0';, ... , O'L.(,-2»' It is clear that the set (O'i...(r + 1)' ... , O'i .. ) can be eliminated in reverse order by repeated applications of the ax~ approximation in (5 .2.35). The effect of each integration is merely to change the values of the mean s4uare m and the degrees of freedom v of the succeeding variable. Similarly, the set (O'i, ... , d .. .(1 _ 2» can be approximately
Random Effect Models
· 294
.\5.1
integrated out upon repeated applications of the ex; method in (S.2.60). Thus, the distribution of (0"7...(1-1), takes the approximate form
0"r..1)
p(CJ7.. .(I-ll'
O"~
( V',- 1 m'I - I
.. 11
)p(v-: 2
I V, _ I
=
y) ==
0"i..1) "
CJT",(I-ll)(V'I11 ') - l P ( Xv,-:2 = ' , I I {V
1
I -
11'1
Pr F t·• Vt· Y
1
1
I
l
'}
-
vlm
<~ I
/11.1_1
CJi..., >
ai . .
(1-1)
> O.
(S.4.4 )
Noting that CJ~ ... I = aT ... ( I - I ) + II .. ·1 1 _ 1 a,2, the corresponding posterior distribution of (CJi...(I-I)' a,l) is then of exactly the same form as the posterior distribution of (O'~, CJ~) in (S.2.14) for the two-component model. The marginal distribution of 0'12 can then be approxim ated using the truncated X- 2 form (S .2.44) or the associated asym ptoti c formulas . Finally, in the case I = 2; so that VI = 1"), 1111 = m'l and the posterior di stribution of the variance O'T can be app roximated by that of an inverted distribution as in the two-co mpon en t model.
X:
APPENDIX AS.1 The Posterior Density of CJ~ for the Two-Component Model Expressed as a X -
2
Series
a;
We now express the distribution of in (S .2.26) as a X- 2 series. Using the Mittag-Leffler formula [see Milne- Thomson (1960), p. 331J , we can write Pr
2 { XV2
<
2
V 2 Jn 2 /IJd
= exp
[ ( 1'2 111 2)] -
-2 2 CJ I
~ (V 2m 2/ 2IJ Tt·,,2) . r r(~ + + I) .
2:
r=O
' Vz
r
(AS.!.I)
Thus, the distribution of CJi / (vl m l + V2m2) can be a lternati vely expressed as a weighted series of the densities of X- 2 variables,
(AS.l.2) with the weights
Wr
given by
[ V2m2 /(VII11I)J(v 2/ 2) + r
Wr =
[J
+
Pr {F V2 • V1 <
V2111 2! (VII11I)r(I, +>-2)+ r 111 2 / 111 1}
+ v2 ) + rJ rctv2 + ,. + i)r(}v r[{(Vl
l )'
We note that when 1'2 is even , the right ·hand side of (A5.!.l) can, of course, be expressed as a finite Poisson series so that
21)
P (IJ 1 Y =
(vlml)'l p[X:,2 = IJi/(vlm l )] Pr {F V2 • V , < Jn2!l11l} x
{
1 - exp [ -
V 1 in 2 /(2IJi)]
'r'~= 2 -o I L..
[V 2 n1 2 / (,.2!a
i)J'} .
(A5.1.3)
A5.2
Appendix
In this case, posterior probabil ities of probabilities, that is, 2
P r {d 1 < u 1 <
d
2
ui can
295
be expressed as a fInite series of X2
} Pr { V 1 nJ 1 l d 2 < X;, < V l l11 l id 1 } I y = _--'--"---'-c---=--_~_--',----:'-'--"-'. Pr {F y2 ,v I < 111) / I11 I }
(AS . 1.4) where r(1 v1
+
+
r)[V 2ln 2!(v t m t )J,[1
V2 iJ1 2!(I'II11 I)r(vl / 2)+r
r! r(vl /2) Pr {F V2 • VI <
cPr =
111 2
!ml }
,
so that probability integrals of CJi can be evaluated exactly from an ordinary X2 table. APPENDIX AS.2
Some Useful Integral Identities
The derivation of the posterior distribution in (5 .2.12) leads to the following usefu l identity. For PI> 0, P2 > 0, a 1 > 0, a2 > and c > 0
°
L .c fo""x-(PI+1) (x
+ cy)- CP2-'- ll exp [- (~ + x
:2CY)] dxdy
where lJp, q) is the usual incomplete beta functi on. Similarly, the results in (5.3.10) fo r the three-component model implies the identity: for Pi> 0, G i > 0, i = 1, 2, 3 and C I > 0, C2 > O.
fo
oo
fo'" fo'" Xl
- ( PI
= (C 1C.2)-1
+ 1)
(XI
+ CI X2 )- ( P2+ 1l (XI + CIX2 + C2 X3)- (P3+ I)
(113 f( Pia .) :-PI)pr {xipi 2 l
i=l
X2P2
>
~ ' X 2~P2
where X~PI' X ~ P 2 and X~P3 are independent / degrees of freedom, respectively.
a2
X2P3
> G2 } , a3
variables with 2p I '
(A5 .2.2)
296
Random Effect Models
AS.3
We record here another integral identity the proof of which can be found in Tiao and Guttman (1965). For a > 0, p > 0 and n a positive integer
f
a
(l + X)-(P+II) X,, -
l
dx = B(I7,p) -
o
"-I(n-I) . +a)-(P +JI.B(p+), n-), I . aJ(I j=O
}
(A5.2.3) ,
where B(p, q) is the usual complete beta function. APPENDIX AS.3
Some Properties of the Posterior Distribution of.lJ~ for the Two-component Model We now discuss some properties of the posterior distribution of lJ~ in (5 .2.39). Moments
By repeated application of the identity (A5.2.1) in Appendix A5 .2, it can be verified from (5.2.12) that for VI > V2 > 2r, the rth moment of lJ i is
p; =
)r t
(V2m2 2K
V2
(~)
i=O
(_ V2111 2)i Ix[( v2/2) -
I
17 + i, (vJm Ix(v 2 /2, vl /2)
vJm l
iJ
m2
(AS.3.1)
x=----V l 11l 1
+
V 2 ln 2
Making use of an incomplete beta function table, the first few moments of lJ~ , when they exist , ca n be conveniently ca lculated and will, of course, provide the investigator with valuable information about the shape of the distribution . Mode
To obtain the mode of the distribution of formation
O" L it
is convenient to make the trans-
2KlJ~
(AS.3.2)
u=--. V 2 ln 2
so that from (5 .2.39) p(u I y)
=
M
1'"
[(¢Z)-I
+ ur CV2 + J )
x exp {- [(¢Z)-l
+ uri }zh-lexp(-Z)dZ
where M is a positi ve constant, ¢ = v)l1ltl(2lJD is made.
V 2 11l 2 /(V I I1l I )
(AS.3.3)
and the substitution z =
-~-
.
-
297
Appendix
AS.3
Upon differentiating (A5 .3.3) with respect to u, we find
ap(~~ Y) = M{ [1- (~ + I)U
If'
(~2 + 1)¢-I
g(u , z)dz -
L''
g(U ,Z)Z-1 dZ},
(A5.3.4) where g(u, z)
= [(¢Z) -I + ur('~ Y2 +3) z-l
v1
1exp ( - {z + [(¢Z)-I + uri }).
-
Applying integration by parts
Jv dx
Sx dv = xv -
(A5.3.5)
and setting dv
= zl(v, +\'2) + 1,
X
= [ ¢- I + lIZr (i Y2+3) exp( - {z + [(¢Z)-I+Ur1 }) , (A5 .3.6)
the second integral on the right-hand side of (A5.3.4) can be written
roo g(U,Z)Z -1 dz =
Jo
1 00
( VI
+
2
)
+4
V2
g(u , z)[1
+
N(u , z) ] dz,
(A5.3.7)
0
where
N (u, z)
=
(~
+ 3 ) (¢ - I + uz) - I
U
+
r
I (¢ - 1 +
liZ) -
2
Substituting the right-hand side of (A5 .3.7) into (A5.3.4), we have M- I op(uIY)
au
= [ 1- (V2 -+1 )( u 2
2¢-1
V2
_(~ ,+ 1)( 2
+
2¢ -1
VI
+
V2
+4
VI
+4
)]1""
g(u,z)dz
0
)Joo g(u,z)N(u,z)dz.
(A5.3 .8)
0
Since N(u, z) ?:: 0, it follows that Op(u I Y) < 0
au
if
u > u* ,
(A5.3.9)
+ ¢,
(A5 .3.10)
where
Also since, N(u , z) ?::
(;2 + 3)
¢u
therefore , Op(l( I y) > 0
au
if
(A5.3.l!)
AS.4
Random Effect Models
298
where 1'2+
VI
6
+ +4
+J)]
J) '-I _ 2(¢'1
) - Il(~ + 2
1'2
VI
+
V2
+4
and
Hence ,
°
a) if u* > 0, then p(u I y) is monotonically Increa sing in the interval < u < 'u* and monotonicall y decreasing in the interva l II > u'" so that the mode must· lie in the interva l 11* < 11 < u*; and b) if 1I* < 0, then p(u i y) is monotonically decreasing in the entire range u > 0 and the mode is at the origin. In terms of (J ~ and the mean squares 1171 and 111 2 , we may conclude that : a) If (
_VI \1 1
+2
)(~), \ '2
°
then p«(J~ I y) is monotonically increasing in t he interval < (J ~ < ( I and monotonically decreas ing in the interval (J ~ > C2 so that the mode lies in the interval C I < (J ~ < C2 , where K -1
(
VI 1'1
+ V2 + 4 + 21'2 + 10
)
l
-
1'2
1'--111 2 2
-ir 2
and (A5.3.12) b) If nl 2
m1 <
(V2 + 2 ) ( - \ - 2'-
VI
+
VI 1' 2
+
4
)
'
then p«(J~ I y) is monotonically decreasing in the entire range (Ji > mode is at the origin.
°
and the
APPENDIX AS.4
An Asymptotic Expansion of the Distribution of (J' ~ for the Two-component \'lodel
We here derive an asymptotic expansion of the distribution of (J' ~ in (5 .2.39) in powers of t - \ = (v 1 /2)-I .t To simplify wri ting , we shall again work with the
t In Tiao and Tan (1965), the distribution of [(1'2 / 2) _1)-1 with slightly different results.
(J' ~
was expanded in powers of
distribut ion of u= 2K(J~ /(v2ml) rat her than (J~ itself. Upon making the substitution Z = 1'11111 '(2a~), the distribution of u is (AS.4.I)
where (AS.4 .2)
VI
and
t= -
2
For fixed u, the function h (u, z) in th e integral (AS.4.I) is clearly analytic in 0 < z < 00. Usi ng Ta ylor 's theorem, we can expand h(u , z) around t . By reversing the order of integration , we obtain
(A5.4.3)
where h(O)(u,t)
= [r( v2 f2)r l
(r l
+
ur ;(v,+2)ex p[- (r l
+ U) -l ],
h(l )(u, t) = -t-lh(O)(u, t)RI(}.,u) , h(2 )(U, t)
= t - 2 h(O)(u, /)[R 2()., u) + 2R I (A, u)],
h(3 )(u,t) = -t- 3 h(O)(U,I)[R 3 (A , u) h(4)(U, t)
= 1-
4
h(°>C u, t)[R 4(A, u)
+
+ 6R 2 ()"
u)
12R 3 (}" u)
+
+
6R I (A,U)],
36R 2 (}" u)
+ 24R I (J"
u)],
300
Random Effect Models
+
(~2
+
AS.4
1) (~ + 2) (~2 + 3) (I; + 4) W4] , and
It is to be noted that for fixed i = cpl, IP )(u, I) is of order' -1, h(2)(u, I) is of order 1- 2 and in general h(I")(u, r) is of order From the asymptotic relationship
,-T.
between the gamma distribution and the normal distribution one can easily verify that the integral -1
f(/)
foo (z-t)'z,-Ie-zdz
(AS.4.4)
0
is a polynomial in I of degree [}(I' - I)J , where [qJ is the smallest nonnegative integer greater than or equal to q. Thus, the expression in (AS.4.3) can be written as a power series of 1 ' 1. Also, to obtain the coefficient of , ' 1 ) we need only to evaluate the first three terms of the series in (AS.4.3) and to obtain that of 1- 2 , the first five terms of the series and so on. Now the quantity Pr {F V2 ,vl < III 2 il11 I } in (AS.4.1) can be written
= eel)
J: Z'~ '2-1
exp (-Z)[
(I -, -7-
r"
ex p {
~: - ;/32+ ..
1
} dz,
(A 5.4.5)
where
Applying Stirling's series (see Appendix A2.2) to eel) and expanding the factor in the square bracket of the integrand in powers of I .1, we find that for Axed )" (AS.4.6)
A_
-
-~
Appendix
A5.4
301
where
+
C;l - +)l3 - ic~] ,
Gi~V1) is the cumulative distribution of a gamma variable with parameter v2/2 evaluated at J., and gJ}v 1 ) is the corresponding density, From (AS.4.3) and (AS.4.6) , we find that Cor fixed i. the distribution of u can be written
p(u l y) ==
U-' J + u)- O-"2 t
l)
G).OV 1 ) l(-}v z )
exp[-U- 1 + !I) - I]
(AS.4.7) with B1(u, },)=}Rz()"u)+R1U, u)-B1(U , l) = -~-R4(Jc,U)
AIU) 1
G) (-2-V2)
+ -~R3(J"U) +
'
~ R2(J"U)
+
R1( X, u)
where R 1 (l, u) , ''' ' R 4 (A , u)-are given in (AS.4.3), and Al V,) and A 2 U) are given in (A5.4,6). The distribution of II is thus expressed as a series in powers of I - I The leading term is in the form of a truncated "inverted" gamma distribution and is equivalent to the distribution in (S.2.44). The coefficients Bl (u, ),), B 2 (u, X) , etc, are polynomials in (). 1 + U)-l, It is straightforward to verify that when one integrates the distribution (AS.4,7) over II all terms except the leading one vanish so that (A5.4.7) defines a proper density function. By making the transformation / = 21V,- J + u) one can , of course, alternatively express the distribution in (AS.4,7)'as a series of / densities, [f one is interested in individual posterior probabilities of u, a corresponding asymptotic expansion of the probability integrals in powers of t -1 can be readily obtained. From (AS.4,7), it can be verified that for fixed )" P r {u>
Uo
Iy} =
'
PO~;IO)
-3 I 1 + -PJ(yo) + -ZP ), 2 (yo) + OCt t t
(AS.4.8)
.-
302
Random Effect Models
A5,4
where
and
9 "oC-t v 2) G;(-}v 2 )
X
2 --; {( J:"/o) Yo + 2S (YO)2[
)'0 -
Yo (V2 2
7(1'0)3[3 + I )] '" + 6 l /0
V2 + 1)(V2 + Yo (2 2 + 2 )] + 8I (YO)4[ J: Yo4- 31'03("2 2 +
3
-
21'02(V2 2 + 2)
)
Posterior probabilities of u and therefore ()~ can thus be conveniently calculated using (AS.4.8) in conjunction with a standard incomplete gamma function table or a X2 table. While the asymptotic expansions of p(u I y) and the related probability integrals as obtained above are justified for large t, tlleir usefulness depends, of course, upon how close they approximate the exact distribution and probabilities for moderate va lues of t . We have already seen in Figs. S.2.6 and S.2.7 that the posterior distribution of ()~ obtained just by employing the leading term of (AS.4. 7) is in close agreement with the exact distribution for both the dyestuff and the generated example. Tables S.2.3(a) and S.2.3(b) illu strate how further improvement can be made by employing additional terms in the expansion (AS.4.7). The first columns of the two tables give , respectively, a specimen of the densities of ()~ for the two examples calculated from (AS.4 .7) withjust the leading term. In the second and third columns of the same tables are s hown densities of ()~ wbich now include terms to order t -I and t - 2, respectively , and in the fourth column the corresponding exact densities obtained from (S.2.3 9) by numerical integration are listed. The values of I for these two examples are not at all large. Further, the dyestuff example has a moderate value of i1l2/1I1 1, while the generated example has a rather smal! 1112 /m I' Thus, the results for the two examples demonstrate that even with moderate values of f , the expans ion formula in (AS.4.7) is useful both for small and for mode rate values of m 2!111 1.
A5.5
Appendix
303
The main practical advantage of the expressions (A5.4.7) and (A5.4.8) is, of course, their computational simplicity. The leading terms of order t -1 in (A5.4.7) and (A5.4.8) are simple functions of X2 densities and probabilities which can be easily evaluated on a desk calculator in conjunction with a standard X2 table or an incomplete gamma function table. If terms of order t - 2 or higher are included, the use of an electronic computer is then desirable. >.levertheless, the expansion gives an efficient computational method compared with the direct numerical integration of the exact expressions in (5.2.39) and its related probability integrals. APPENDIX AS.S
A Criticism of the Prior Distribution of (ai, ai) We now discuss an interesting theoretical point raised by Stone and Springer (1965) concerning the prior distribution of (ai, aT 2) in (5.2.8) (A5.5.1) They argue that this distribution can be regarded as the limiting distribution of the following sequence of distributions (A5.5.2) as e --+ O.
Making the transformation from (aT,
aT 2)
to (A5.5.3)
and they claim that the marginal prior distribution of p is
o<
p < 1,
E;
> O.
(A5.5.4)
Letting 3 > 0, the prior probability that 0 < p < 3 is
Pr (0 < p < 3)
=
(A5.5.5)
3'
which approaches 1 as e --+ O. This means that the probability limit p lim p
= O.
Now, consider the distribution of the mean squares ratio ditional on p
(A5.5.6) Jn J /Jn 2 •
Since con-
it follows from (A5.5.6) that, a priori (A5.5.8)
Random Effect Models
304
AS.6
Thus, the prior distribution in (A5.5.1) implies that a priori, one would expect m J 'm 2 to be very close to zero , and would therefore be surprised if a somewhat large value of ml / nJ2 were observed. Consider now the posterior distri bution of ai in (5 .2.26). If- nJ, 1m2 were small , (that is, m 2 /m l were very la rge), then
I(a7)
lim ntl/m 2
=
1.
·· 0
The implication is , then, since a prior one would expect i11[ . 1112 to be small in any event , it should be therefore unnecessary to consider this factor at all in the posterior inference. And if m, i l71 2 turned out to be somewhat large, it would be in conflict with the implied prior belief. While the above argument sounds interesting, given the way Stone and Springer approach the problem , it has hard ly a ny practical ' s rgni.fi ~anse so far as the posterior inference is concerned. Consider the two situations e -> 0 and e = 0.5, and let !3 = 0.2. Then, a priori , lim Pr (0 < p < 0.2)
,-0
=
I
and
Pr (0 < p < 0.21 e
=
0.5) = y'0.2 = 0.447.
Thus, a large difference exists between the prior for which e = 0.5 and the prior in (A5.5 .1), and in the case e = 0.5, Stone and Springer's objection would not arISe. However, the use of e = 0.5 will only serve to change, in thc posterior distribution of (ai, aD in (5.2.14), V J to vi = VI - 0.5 and i11[ to mi = mJ(v[ 'vi). The reader can verify th a t for either the dyestuff or the generated example or any other example in which VI is of moderate size, the posterior inferences about (ai, would be scarcely affected.
aD
APPE~DIX
A5.6
"Bayes" Estimators In this book we treat sa mpling theory inference and Bayesian inference as quite distinct concepts. In sampling theory, inferences about parameters, supposed fixed, are made by considering the sampling properties of specifically selected functions of the observations called esLimators. In Bayesian inference, the pa rameters are considered as random variables and inferences are made by considering their distributions condi tional on the fixed data . In recent years, some workers have used a dual approach which , in a sense, goes part of the way along the Bayes route and then switches to sampling theory. SpecificaiJy, the Bayesian posterior distribution is obtained, but not used to make inferences directly. Instead, it is employed to suggest appropriate functions of the data for use as estimators. Inferences are then made by considering the sampling properties of these "Bayes" estimators . Because this dual approach is now quite widespreadsee, for example, Mood and Graybill (1963)- , we feel that we should say something about it here.
------
305
Appendix
A5.6
The present authors believe that while estimators undoubtedly can serve some useful purposes (for example, in the calculation of residuals in the criticism phase of statistical analysis), we do not believe their sampling properties have much relevance to scientific inference.
Description of Distributions (Including Posterior Distributions) We have argued in this book that tlie whole of the appropriate posterior distribution provides the means of making all relevant inferences about a parameter or a set of parameters. We have further argued that, by using a noninformative prior, a posterior distribution is obtained appropriate to the situation where the information supplied by the data is large compared with that available a priori. If for some reason it was demanded that we characterized a continuous posterior distribution without using its mathematical form and without using diagrams, we could try to do this in terms of a few descriptive measures. The principles we would then have to employ and the difficulties that would then confront us would be the same as those we would meet in so describing any other continuous distribution or indeed in representing any infinite set by a finite set.
~-=
__~______~-L
_ _~~_ _ _ _~~_ _~
10
~~
35 (a)
Posterior mean = 45.8
~
10
20
40
60 (b)
Fig. AS.6.1 Two hypothetical posterior distributions
80
....
306
Random Effect Models
A5.6
For example, the description of the posterior distribution of Fig. AS.6.1 (a) is simple. It is a Normal distribution with mean 22 and standard deviation 3.7 Thus, with knowledge of the ~ormal form , two quantities suffice to provide an exact description. Description of the posterior distribution in Fig. A5.6.1 (b) would be more difficult. We might describe it thus: "It is a bimodal distribution with most of its mass between 10 and 80. The left hand mode at 30 is about half the height of the right hand mode at 57. ;\ local minimum occurs at 43 and is of about one tenth the height of the right hand mode" . Alternatively, we could quote, say, 10 suitably spaced densities to give the main features of the distribution . However we proceeded we would need to quote five or more differentD\ll"!!g ers to provide even an approximate idea of what we were talking about. ow there is nothing unrealistic about this hypothetical posterior distribution . It would imply that values of the parameter around 57 and around 30 were plausible but that intermediate values around 43 were much less so . Practical examples of this sort are not uncommon and one such case will be discussed later in Section 8 .2. Posterior distributions met in practice are usually neither so simple as the Normal nor so complex as the bimodal form , just considered. But what is clear is that the number and kinds of measures needed to approximately descrihe the posterior distribution must depend on the type and complexity of that distribution. Questions about how to describe a posterior distribution cannot be (and do not need to be) decided until after that distribution is computed. Furthermore, they are decided simply on the basis of what would best convey the right picture to the investigator's mind .
Point Estimators Over a very long period, attempts of one sort or another have been made to obtain by some prior recipe a single number calculated from the data which in some way best represents a parameter under study . Historically there has been considerable ambivalence about objectives. Classical writers would for example sometimes state the problem as that of finding the most "probable" value and sometimes that of finding a best "average" or "mean" value.
A Measure of Goodness of Estimators Modern sampling theorists refer to the single representative quantity calculated from a sample y as a point estimator and tbey argue as follows. Suppose we are interested in a particular parameter cp. Then any function J>(y) of the data which provides some idea of the value of cp may be called an estimator. Usually a very large number of possible estimators can be devised. For example, the variance of a distribution might be estimated by the sample variance, the sample range , the sample mean absolute deviation and so on. Therefore a criterion of goodness is needed . using it the various estimators can be compared and the "best " selected. It is argued that goodness of an estimator ought to be measured by the average
A5.6
Appendix
307
closeness of its values over all possible samples to the true value of ¢. The criterion of average closeness most frequently used nowadays (but of course an arbitrary one) is the mean squared error (M.S.E .). Thus among a class of possible candidate estimators ¢ J (Y), ... , ¢;(y), .. , ¢k(Y)' the goodness of a particular one, say, ¢j(Y) would be measured by the size of M.S.E.(¢J = E[¢j(Y) y
(pJ2 = f/¢),
where the expectation is taken over the sampling distribution of y. The estimator ¢i(Y) would be " best" in the class considered if j = l , ... ,k;
ji=i
for all values of ¢, and would be called a minimum mean squared error (M.M.S.E.) estimator. At first sight, the argument seems plausible, but the M.S.E. criterion is an arbitrary one and is easily shown to be unreliable. A simple example is the estimation of the reciprocal of the mean of the Normal distribution N(e , I) . From a random sample Y, there is surely something to be said for the maximum likelihood estimator y- J which is, after all , sufficient for e- J, but it so happens that the t is infinite. M.S.E. of The real weakness of the argument lies in that it requires a universal measure of goodness. Even granted that some measure of goodness may be desirable, it ought to depend on the probabilistic structure of the estimator as well as what the estimator is used for. But inherent in the theory is the supposition that, a criterion is to be arbitrarily chosen a priori and then applied universally to all estima tors.
r
How to Obtain the Estimators
If we were to accept the above argument, we would have a means of ranking a set of estimators once they \lwe presented to us. But how can we cope with the embarrassingly large variety of functions of t.he data which might have relevance as estimators? Is there some way in which we can simplify the problem so as to pick out, if nofene, then at least a limited number of good estimators which may then be compared? One early attempt to make the problem manageable was to arbitrarily limit the estimators to be (a) linear functions of Y which were (b) unbiased, and then to find the one which had minimum variance. These requirements had the advantage that the "best" estimators could often be obtained from their definition by a simple minimization process. For example, in a random sample of size n from a population with finite mean ¢ and finite variance, the sample mean y is the "best" estimator of ¢ satisfying these requirements. However, after it was realized that biased estimators often had smaller mean squared errors than unbiased ones, the requirement of unbiasedness was usually
308
Random Effect Models
A5.6
dropped and the criterion of minimum M .S.E. substituted for that of minimum variance. As an example, for the exponential distribution fey) = cJ> - I e- Yi¢ , (y> 0, cJ> > 0). The linear function ny /(n + I) , although biased , has smaller M.S .E. than that of the sample mean y in estimating cJ>. When it was further shown that non-linear estimators could be readily produced (from Bayesian sources) having smaller M.S.E. than the "best" linear ones for all values of ¢ , linear estimators could no longer be wholeheartedly supported.t Possible ways of obtai ning non-linear estimators which sti/l might minimize mean squared error were therefore examined. Perhaps then the logical final step will be to abandon the arbitrary criterion of minimum M.S .E . With this step taken, it would seem best to give up the idea of point estimation altogether and use instead the posterior distribution for inferential purposes. One method of finding estimators which are not necessarily linear in the observations is Fisher's method of maximum likelihood. Here the estimator is the mode of the likelihood, and , under certain conditions, asymptotically at least, the estimators have Normal distribution and smallest M.S. E. However, some investigators have -Celt that the likelihood method itself was too limited for their purposes. A few ye*rs ago therefore, the Bayesians were surprised by unexpected bed fellows anxious /to examine their posterior distributions for interesting features. Whereas it was the mode of the likelihood fun ction which had received almost exclusive attention , it now seemed to be the mean of the posterior distribution which became the main center of interest. Nevertheless the posterior mode and the median have also come in for some examination and each has been recommended or condemned on one context or another.
A Decision-theoretic Approach Justification for this way of choosing estimators is sometimes attempted on formal decision-theoretic grounds. In this argument, let $ be the decision-maker's "best" guess of the unknown cJ> relative to a loss function L($ , cJ». Then the so-called "Bayes estimator" is the guessed value which minimizes the posterior expected loss
E L($, cJ» ¢ Iy
=
f ""
L($, ¢)p(cJ> I y) dcJ>.
- ""
]n particular, if the loss function is the squa red error
L1($ , cJ»
=
($ - cJ»2
(AS .6.la)
then, provided it exists, the Bayes estimator is the posterior mean ,! but by suitably t An example of this kind will be discussed later in Section 7.2.
t When authors write of the "Bayes estimator" without further qualification they usually refer to the posterior mean . The naming of "Bayes" estimator seems particularly unfair to the Rev. Thomas Bayes who already has had much to answer for and should be cleared of all responsibility for fathering this particular foundling.
A5.6
Appendix
309
changing the loss function the posterior mode or median can also be produced. Thus, see Pratt, Raiffa and Schlaifer (1965), for
L,(q,,¢)
~ (:
1<$ - 4>1 > e (A5.6.1 b)
for
1<$ - ¢I < e
where e > 0 is an arbitrarily small constant, the Bayes estimator is a mode of the posterior distribution. Also, for
the Bayes estimator is the posterior median. Indeed, if the loss function were
then by appropriately choosing a 1 and a2 , any fractile of the posterior distribution could be an estimator for 4>. In a proper Bayesian decision analysis, the loss function is supposed to represent a realistic economic penalty associated with the available actions. This justification, in our view, only serves to emphasize the weakness of the point estimation position. (a) In the context in which the estimator was to be employed, if the consequences associated with inaccurate estimation were really what the loss function said they were, then the resulting estimator would be best for minimizing this loss and this would be so whether its M.S.E. in repeated sampling . was large or small. (b) It seems very dubious that a general loss function could be chosen in advance to meet all situations. In particular, it is easy to imagine practical situations where the squared error loss (A5.6 .1a) would be inappropriate. In summary it seems to us that this Bayesian decision argument has little to do with point estimation for inferential purposes.
Are Point Estimates Useful for Inference? The first question a Bayesian should ask is whether or not point estimates provide a useful tool for making inferences. Certainly, if he were presented with the Normal distribution of Fig. AS.6.1 (a) and told he must pick a single number to represent ¢ he would have little hesitation in picking 22 which is both the mean and the mode of the distribution. Even here he would regret that he had not been allowed to say anything about the uncertainty associated with the parameter 4>. But what single number should he pick to represent 4> in Fig. AS.6.1 (b)? If he chose the higher mode at 57, he would certainly feel unhappy about not mentioning the lower mode at 30. He would surely not choose the "Bayes" estimator--the posterior mean at 4S.8 which is close to a local minimum density . While it easy to demonstrate examples for which there can be no satisfactory point estimate, yet the idea is very strong among people in general and some
310
Random
Effect
Models
A5.6
statisticians in particular that there is a need for such a quantity. To lhe idea that people like to ha ve a single number we answer that usually they shouldn ' t get it. Most people know they live in a statistical world and common parlance is full of words implying uncertainty. A in the case of w~ather forecasts, statements about uncertain qu ant ities ought to be made in terms which reRect that uncertainty as nearly as possible. t Should inferen ces be Based
011
Sampling Distrihutioll of Poillt Estimators')
Having computed the estimators (rom the posterior distribution, the sampling theorist often uses its sa/JIpling distribution in an attempt tll make inferences about the unknown parameter. The bilJlodal example perhaps makes it easy to see why we are c.iubious about the logic of this practice. first we feel that having gone to the trouble of computing the posterior di stribution , it may as well be put to use to make appropriate inferences about the paramelers. Second we cannot see what relevance the sampling distribution of, say, the mean of the posterior distribution would have in making inferences about cp. Why :vfean Squared Error and Squared Error Loss ?
We have mentioned that the M.S .E. criterion is arbitrary (as are alternative measures of average closeness) . We have also said th a t in the decision theoretic framework the quadratic loss function leading to the posterior mean is arbitrary (as are alternative loss functions yielding different features of the posterior distribution) . The question remains as to why many sampling theorists seem to cling rather tenaciously to the mean squared error criterion and the quadratic loss function . The reasons seem to be that (a) for their theory to work, they must cling to something, arbitrary though it be and (b) given this , it is best to cling to something that works weU for the Normal distribution. The latent belief in universal near-1\Jorm ality of estimators is detectable in the general public's idea that the "most probable" and the "average" are almost synonymous, implying a belief, if not in Normality, at least that "mean equals mode". Many statisticians ramiliar with ideas of asymptotic I\ormality of maximum likelihood and other estimators have a similar tendency . This has led to a curious reaction by some workers when the application of these arbitrary principles led to displea sing results. The Bayesian approach or the Bayesian prior have been blamed and 11'0t the arbitrary features, curious mechanism and dubious relevance of the point estimation method itself.
t The public is ill served when they are supplied with single numbers published in newspapers and elsewhere with no measure of their llncertainty. For example, most figures in a financial statement, balanced down to the last penny, are in fact merely estimates. When asked whether it would be a good idea to publish error limits for those figures, an accounting professor replied "This is commonly done internally in a company, but it would be too much of a shock for the general publ ie!"
A5.6
Appendix
311
Poinl Estimators for Variance Components For illustration we consider some work on the variance component problem tackled from the standpoint of point estimation. Two papers are discussed, one by Klotz, Milton and Zacks (1969) and the other by Portnoy (1971) concerning point estimators of the component o-~ for the two-component model.
M.S.E. of Variolls Estimators of o-~ These authors computed the M.S.E. of a variety of estimators for (o-i, o-~). Specifically, for the component o-~, they have considered for various values of J and K the following estimators: a) the usual unbiased estimator a-~ b) the maximum likelihood estimator c) the niean of the posterior distribution (S.2.39)-see (AS.3.1) in Appendix AS.3 d) the component for o-~ of the mode of the joint distribution (S.2.14) e) the component for o-~ of the mode of the corresponding joint distribution of (o-~2, o-~), among some others. o-~/(o-~ + o-~)
Their calculation shows that over all values of the ratio
i) the M.S.E. of the estimators (b), (d) and (e) are comparable and are much smaller than that of (a), ii) by far the worst estimator is the posterior mean (c). For example, when J = S and K = 2, the M.S.E. of (c) is at least 8 times as large as those of (b), (d) and (e). The fact that the posterior mean has very large M.S. E. seems especially disturbing to these authors, since such a choice is motivated by the squared error loss (AS.6.1a). Thus, they concluded that "The numerical results on the M.S.E. of the formal Bayes estimators strongly indicate that the practice of using posterior means may yield very inefficient estimators. The inefficiency appears due to a large bias caused by posteriors with heavy tails resulting Fom the quasi prior distributions. Other characteristics of the posterior distri bu tions such as the mode or median can give more efficien t point estimators". (1. Al1u;r. Statist. Assoc., 64, p. 1401). The italics in the above are ours and they draw attention to a point we cannot understand. The authors' conclusions and associated implications may be summarized in part as follows.
312
Random Effect ylodels
AS.6
Table AS.6.1 Comparison of estimators , of a~ Implied loss function [see (A5 .6. la,b)]
Feature of posterior di stribution
Estimators judged
Criterion
Noninformative (5.2.9)
L,
mean
bad
M.S .E.
Noninformat ive (5.2.9)
L2
mode of d w.r.t. (a L a ~ )
good
M.S.E.
Noninformative (5 .2.9)
L2
mode of a ~ W.Lt. (a!2 , a ~ )
good
M.S.E.
Prior
But since the same prior distribution is assumed in every case yielding the same posterior, how is the prior to blame for the alleged inefficiency of one of the estimators? Or why on this reasoning should it not be praised for the efficiency of the other two? We are neither encouraged by the fact that based on the reference prior in (5 .2.9), some characteristics of the posterior distribution in (5.2.14) such as the mode has desirable sampling properties, nor are we dismayed by the larger M.S.E . of the corresponding posterior mean of a~. One simply should not confuse point estimation with inference. Inferences about a parameter are provided not by some arbitrarily chosen characteristic of the posterior distribution , but by the en/ire di stribution itself. One of the first lessons we learn in Statistics is that the mean is but one possible measure of the location of a distribution. While it can be an informative measure in certain cases such as the Normal , it can be of negligible value for others. Other Example s of the InejficielJc:y of the Posterior Mean For the problem of the Normal mean with known variance a 2 , the posterior distribution of based on the uniform reference prior is N(r, a 2 ,'n). As mentioned earlier, this distribution allowed us to make inferences not only about e but also about any transformation of (J such as eO or (J-' . While the posterior mean of e, y, is a useful and important characteristic of the Normal posterior, the posterior 2 mean of eO, e·H l a / " , tells us much less about the log Normal posterior, and the posterior mean of A-I , since it fails to exist, tell us nOfhil1R about the distribution of (F I! Thus, so far as inference is concerned , it is irrelevant whether some arbitrarily chosen characteristic of the posterior ha s or does not have some allegedly desirable sampling property. In particular, it is irrelevant that y is the M.M.S .E.
e
--'- -
-
Appendix
AS.6
313
estimator for e, or that for eO, as can be readily shown, the posterior mean ey+ ~ a2 /JI has larger M.S.E. than that of the posterior mode ey - a'/JI (which , for that matter, has larger M.S. E than that of the estimator e Y -1. 50 ' /JI) . Furthermore, the noninformative prior should not be to blame for causing the heavy right tail of the log Normal posterior for eO, nor should the same prior be praised for producing the symmetric Normal posterior for e. It would certainly seem illogical to "judge" the appropriateness of the posterior and therefore of the prior on such grounds. Adjusting Prior to Produce " Efficient" Posterior Mean
In view of all this, the analysis given in Portnoy's paper we find even more perplexing. He proposes the scale invariant loss function -
2
($_0"~)2
L(c/> , 0"2) = 4(O"i + KO"D 2
(A5.6.2)
and proceeds to obtain Bayes estimators of O"~ with respect to the class of priors
p(O"i, O"D
if..
(O"iy-(I!+1) (O"i + KO"Db.
(A5.6.3)
For given (a, b), the estimator is the ratio of the posterior expectations
I
" iZ } I E (0"12 + KO"z) 2 -2 c/>(a.b) = E {O 2 2 . ailY (0"1 + KO" z) a;IY Using the "admissibility" proofs si milar to James and Stein (1961), he concludes that the best estimator is the one corresponding to (a = - I, b = - I) in (A5 .6. 3) which is then precisely our l1oninformative prior. His computation shows that the M.S.E. of $(_1._1) is much smaller than the corresponding posterior mean . This result is certainly not surprising if we consider the similar but algebraicalJy much simpler problem of the )\;ormal variance 0"2 with the mean e known . As we have discussed in Section 2.3, the posterior di stribution of 0' 2 based on the noninformative prior p(a 2) f 0" 2 is ns2XI~Z where ns z = LCy; - 8)2. It is readily seen that the Bayes estimator with respect to the scale invariant loss function
is then
which , as is well known, is the minimum M.S.E. estimator of (52 among all function of S2 having the form cs 2 In particular, the M.S. E. of $ is smaller than that of the postr'rior mean ns 2 (1/ - 2) . What is extraordinary is the conclusions Portnoy draws from his study. 8sing our notat ion, he says that "For the case of the present estimator, $( - 1. _I) ' we actually find tha t the pnor distribution (for a = - I, b = -I) corresponds exactly to the Jeffreys' prior,
314
Random Effect Models
AS.6
0-;2(o-f + Ko-D-l. However, $(-1.-1) is not the posterior mean, but the posterior expected value of o-t (o-i + KCTD 2 over the posterior expected value of I/(o-i + Ko- ~ ) 2 , the denominator coming from the normalizing factor in the loss function. This is the same as taking the posterior mean for the prior 0-;2(0-~ + Ko-~) - ~. The reasonable size of the mean squared error of $(-1. _) shows that this latter posterior distribution is centred at least as weJl and has su~stanti a lly smaller variance (on the average) than the posterior for the Jeffreys' prior. Thus , if one wishes to make inferences based on a posterior distribution, one can seriously recommend using the prior 0-;2(0-~ + Ko-~) - 3 instead of the Jeffreys' prior. This serves to justify, in my opinion , the use of squared error as loss: by using squared error and by taking an appropriate limit of what might be ca lled Bayes invariant priors, one is assured of finding a posterior distribution which, on the average, is reasonably centered and has reasonably small variance . As this example shows, the Jeffreys' prior can lead to posterior distri butions with mean and variance far too large ." (Ann. Math . Statist . 42, p.1394). This seems to be suggesting that in order to obtain a good estimator with respect to the scale invariant loss function in (A5.6.2), Jeffreys' prior is needed , but the use of such a prior is, at the same time, undesirable in making inferences because it leads to a posterior mean having large mean squared error. To conclude that this study serves to justify "the use of squared error loss" is indeed difficult to follow . To see where this sort of argument leads, consider again the problem of the Normal variance 0- 2 with known mean . Let the prior distribution be of the form
Suppose we wish to estimate some power of 0- 2 , say 0-2P. Now for different values of r:t., a family of posterior means for 0-2p can be generated . . If we were to insist on a squared error loss function and hence on the posterior mean as the "best" point p estimator of 0-2 , then it is shown in Appendix 5.7 that the estimator which minimizes the M.S.E. is the one for which the prior would be such that CJ.
=
2p
+
I,
that is, (A5.6.4) provided n > -4p.
The implications would then be: (i) if we were interested in (12 (p = 1), the prior should be proportional to (0"2)-3 which is similar to the pri o r suggested by Portnoy for estimating o- ~ ; (ii) on the other hand, if p = 0 correspo~ding to log CT, then we would use the noninformative prior (1 . . 2; but (iii) if we should be interested in 0-- 1 00 (p = -50), then we ought to employ a prior which is proportional to 0- 198 !
AS.7
Appendix
To put the matter in another way, the posterior distribution of ponding to (AS.6.4) would then be (for II > - 4p)
(52
315
corres-
(AS.6.S)
with n' = n + 4p and S'2 = ns2 ,'n', which would be the same as that resulting from Jeffreys' noninformative prior but with 11 ' = 17 + 4p observations . Noting that n's'2 ,'(52 '" X,7" so that (n's' 2)p /(5 2p", {X,7.)P this would mean for example that if we wish to estimate ((52)10 from a sample of, say, 11 = 8 observations, we would have to base our inferences on a posterior distribution which would correspond to a confidence distribution appropriate for 48 observations. On the other hand , if we were interested in estimating «(52)-10 from a sample of 11 = 48 observations , the posterior we would use would be numerically equivalent to a conndence distribution relative to only 8 observations .
Conclusions The principle that the prior, once decided, should be consistent under transformation of the parameter seems to us sensible. By contrast, the suggestion that to allow for arbitrariness in the methods of obtaining point estimators, the principle be abandoned in favour of freely varying the prior wherever a different function is to be estimated seems less so.
APPENDIX AS.7 Mean Squared Error of the Posterior Means of
(J2p
In this appendix, we prove the result stated in (AS.6.4). That is, suppose we have a random sample of n observations from the Normal population N(e, (52) where . (52 is known, and supposed that the prior distribution of (52 takes the form (AS.7.l) then among the class of estimators of (J2p which are the posterior means of (J2 p , the estimator which minimizes the M.S . E. is obtained by setting the prior to be such that (AS.7 .2) rx = 2p + I provided n > - 4p. Corresponding to the prior (AS.7 . 1), the posterior distribution of (52 is
ns2)+v
p« (J2 I y) = r ((-iv)21:
V
«(52) - (tv + I)
e - (" S'/ 2,,') ,
(AS.7.3) '
316
Random Effect Models
AS.7
=
v
where
11
+
2(~
- 1) > O.
Thus, using (A2.1.2) in Appendix A2.1, the posterior mean of a 2p is,
E«(J2 p I y) = c(ns2)p, where
_ r(-!-v - p) (J.)P r(JV) 2,
(AS.7.4)
C -
provided '~v - p > O. Now, the sampling distribution of ns 2 is (J2X;' . It follows that the M.S .E. of C(I1S 2 y is, for (-1/1 + 2p) > 0
M.S.E.
=
E [c(ns2)p - (J2 P]2 y
= a 2p [ I + c 2
1(J/1 + 2p)22 p I fUn)
-
2c
+ P)2 P ] rUn) '
rc-}n
-=-C--:--7"'-
(AS.7.S) Letting
g(v)
M.S.E. = ( 7P -
I
)
f(-}n)
we have
g(v) = [
rc~v -
r(-}v)
P)1 2
fen
+ 2p) - 2
r(-!-v - p) r(-}v) f(-~11
+ p).
(AS.7.6)
Differentiating with respect to v and setting the derivative to zero , we obtain
1(-!-n
+ 2p) {fn·v -
p) f'(·}v - p) - [f(-}v - p)]2 f'(~ v) ; 1(-}v)}
- r('ill
+ p)[f(1v) f'(}v - p) - f(-1v - p) f ' (}v)] = 0,
(AS .7.7)
where f'(x) is the digamma function. The solution is
v
= n + 4p
or
~ =
2p
+
1.
By differentiating g'(v) it can be verified that this solution minimizes the M.S.E.
CHAPTER 6
ANAL YSIS OF CROSS CLASSIFICATION DESIGNS
6.1 INTRODUCTION
In the previous chapter we have discussed random effect models for hierarchical designs. In some situations it is appropriate to run cross classification designs and it may be appropriate to assume either a) that all factors have random effects, b) that all factors have fixed effects, or c) that some factors have random effects and some have fixed effects. We begin with a general discussion of the situation when there are just two factors and for illustration we suppose the factors to be car drivers and cars.
6.1. t A Car-Driver Experiment: Three Different Classifications Consider an experiment in which the response of interest was the performance , as measured by mileage per gallon of gasoline, achieved in driving a ncw Volkswagen sedan car. Suppose that a) eight drivers and si x cars were used in the investigation, b) each driver drove each car twice, and c) the resulting 96 trials were run in random order. The above description of the investigation is, by itself, inadequate to determine an appropriate anal ysis for the results. To decide this a number of other things must be known. Most important are: a) the objectives of the investigation, b) the design of the experiment to achieve these objectives, in this example the method of selection of drivers and cars.
Objectives: The Particular or the General ? ln one situation the objective may be to obtain information about the mean performance of each of the particular six cars and eight drivers included in the experiment. Comparisons between such means are sometimes called "fixed effects" and the analysis is conducted in terms of a fixed effect model. 317
318
Analysis of <':ross Classification Designs
6.1
In another sit uation the individual cars and 'or drivers in the experiment may be of no interest in themselves . The objective may be to use them only to make inferences about the populations of cars and /or drivers from which they are presumed to be randomly drawn. In particular, the variances (and possibly certain covariances) of these populations will be of special interest. In this situatio n our analysis is cond ucted in terms of a random effect model. In still another situation the objective may be to learn about the mean perfor~ mance of each car as it performs in the hands of a population of drivers. In this case the analysis is conducted in terms of a mixed (random and fixed effect) model. With the two jactor cross classification then, three situations can be distinguished which are associated with specific models as follows: a) both factors random-two-way random effect model, b) one factor random , one factor fixed- two-way mixed model, and c) both factors fixed- twa-wa y fixed effect model.
Design Considerations So far we have talked only of objectives but it is important that the experiment be conducted so that it is possible to achieve these objectives. In the random effect model, interest centers on the characteristics of populations, and in particular, on their variance components. To estimate such characteristics the particular cars or drivers included in the experiment must be random selections from the relevant populations of cars and drivers. For example, the two-way random effect setup would be appropriate if the six VW sedan cars and the eight drivers were randomly drawn from relevant and definable populations. The sampled population of cars might be all new VW sedans available for sale in the State of Wisconsin in a particular week, The sampled population of drivers might be all students at the University of Wisconsin-Madison prepared to take part in the experimenl for a payment of $20. It will often happen , that the population one would really like to make inferences about is different from the population that can be conveniently sampled. For instance, we may really be interested in the whole populatio n of drivers in the United States, but may settle for the student drivers to produce a feasible experiment. In such a case it must be remembered that the statistical conclusions apply only to the limited popula tion sampled. Any extrapolation to a wider population is based solely on an opinion that, for example, "student drivers are similar to other drivers." Questions concerning the car-driver populations which this kind of experiment _ might answer are: I . How large a variance in gas mileage might be experienced by different drivers operating new VW sedan cars.
6.1
Introduction
319
2. What total proportion of this total variance was associated with: (a) variation in cars, (b) variation in drivers, (c) interaction between cars and drivers, and (d) unassignable variation. By contrast, the fixed effect setup implies th at our interest centers on the mean performance of each of the six particular cars or the eight particular drivers which mayor may not be rea listically considered as relevant random samples from specific popUlations. The present discussion uses the familiar sampling-theory terms, " random " and "fixed" effects. As we shall see later, in the Bayesian framework, where of course all effects (parameters) are random variables, appropriate choice of model forms with suitable prior distribution s ac hieves the different inferential objectives.
Classification of the Data in a Two- way Table Suppose in general that the test cond ucted with I cars denoted by (I, ... , i, .. . , 1) a nd J drivers deno ted by (J, ... , j, ... , J) and every driver operates every car K times (K ?o J). Let Yij k be the kth (k = I, .. , K) observation for the jth dri ver on the ith car. The data can be set out in an J x J arrangement with K observations per cell as in Table 6. J .1. In the example considered above there are I = 6 cars, J = 8 drivers, and K = 2 replicate runs. Table 6.1.1 Arrangement of observations in a two-way cross classification design Drivers j
J
Yijl
Cars
Y/jk
YiJf.
J
I
I
320
Analysis of Cross Classification Designs
6.1
Whichever setup is being considered we can always write
(6.1.1)
J n this equation eis the overall mean gas mileage, 6ij is the mean increase or decrease in gas mileage achieved when the )th driver is operating the ith car, and ejjk is the experimental error which is responsible for differences between replicate runs performed by the same driver in the same car. It will be supposed throughout that these experimental errors are distributed independently of each other, and of the bjj , with zero mean and variance
cr; .
6.1.2 Two-way Random Effect Model We now consider the formulation of an appropriate model when both factors are random.
Classification of the Population in a Two-way Table We can picture the whole population of possibJe combinations of cars and drivers as given by the cells in the f x f table of Table 6.1.2 having aJl possible columns
Table 6.1.2 Two-way population of increments
{)ij
Drivers
- - - --- --- - - -
--- - -- --- -
-
Row (car) means 1'- - -
------- ------ -- --I--- ~
- - - - - - -- - - --
- -- --
- --1----1- - - -
-
---
-
-
-- --- -
-
-
-- - -
-- - -
- -If--- -
- --
Cars --- --- -
- - -- - - - - --
·- If- --
-
- - - - - - - - - 1-- - -1- - -- - - - - - -
- - - 11-- - -1
-----
- --11- --
- --
-
-
-
-
-
-
- 1
r~
Column (driver) means
.~.-j,!,
= = = =1 1 ==1== = = =11= = =\
o
c· .J
-
•
.
' -
~
-
Introduction
6.1
321
(drivers) indexed by j = 1,2, ... , cf and aJi possible rows (cars) indexed by i = 1,2, ... , f Then 0,/ wilJ be the mean increment associated with the l'th car operated by the Jth driver. That is, the amount by which the average gas mileage achieved by this car-driver combination exceeds or falJs short of the overalJ population mean O. It is now convenient to define the row means and column means of the iJi/ in this population table as the (random) effects for cars and drivers respectively and the "nonadditive" increment as the (random) interaction effect. Thus, ri
=
0i.
= random effect for ith row (car),
c)
=
O.i
=
random effect for jth column (driver),
t i.j
=
0 iJ
-
0 i.
o.J =
-
random interaction effect (car x driver),
where, as before, we use the notation that a dot replacing a subscript indicates the average over that subscript. The identity I. I. 0:/ =
c1
LO}
+..f L0 2j + LL (0,) - 0,. -
oy
can then be written ( 6.1.2) If we now define the variance components as 2 (J
r
2 (J
c
2 (J
t
Lr;
=--
..f
L c~
=--
c1
I.
tL
=--
..fc1
row (car) component,
column (driver) component,
(6.1.3)
interaction (car-driver) component.
Then letting
we obtain ( 6.1.4)
oij
On this formulation, then, we can substitute = rj + cj + ti) in (6.1.1). The model for the actual test data in the 6 x 8 replicated array of Table 6.1. I is thus Yjjk
=
e+
rj
+ cj +
Ii)
+
ejjk'
We now consider how the r j , cj , and lij in the equation are to be interpreted.
322
Analysis of Cross Classification Designs
6.1
The random effect model assumes that the 1=6 rows (cars) were selected randomly from a pop ulation of f rows (cars) and the J = 8 columns (drivers) were similarly selected from a population of j' co lumns (drivers). We show in the ne xt section that, assuming the populations to be large, this random sampling ensures that the rio c j , and t ij may all be treated as uncorre/ated random vari ab les. Also the errors eijk are by assumption uncorrelated with the Oij' Thus finally the two-way random effect model may be written
i=I" .. , I ;
j= I, ... , J;
k
=
I, ... ,K, (6.1 .5)
where ri , cj ,
lij '
eijk are all uncorrelated random variables with zero means and E(c;)=CJ~,
E(r l)= CJ;,
E(t;;)=CJ,2,
E(etjk)=CJ,~,
(6 . 1.6)
so that
( 6.1.7) In the traditional sampling-theory analysis of the problem, this setup wouJd be appropriate for making inferences about the behavior of an infinite population of drivers and cars, but not about the parricular drivers an d cars tested. The problem might arise for example in a general production study by a large manufacturer of motor cars. The conclusions drawn could indicate the degree of variability of gas mileage over the relevant populations of drivers and cars and the relative contribution of cars, drivers , and interaction to that overall variability. J ustificatiol1
of the TIro-way Random Effect M ode/
To show that the r i , Cj' and t ij in (6.1.5) are uncorrelated random variables with the properties just mentioned we regard the particular small sets of 1 rows (cars) and J columns (drivers), actually under study, as random samples (without replacement) from the f rows and j' columns of Table 6.1.2. The quantities Oij in (6.1. J) are then random variables. It can be readily shown that
£(0 .. 0 . ,) 'J
'if
=
(52
r
+
1
(j' _ I)
«(52 -
(52)
r
j #- j ,
,
(6.1.8)
and _
f
_ __
_ _ (5 2 -
(..I - 1)(j' - 1)
C
i#- i,j#-j;
j'
CJ;,
(f - 1)(/- J)
i,t'= 1, .. ,,/;
j,j= I, ... ,J.
6.1
323
Introduction
On the assumption that the populations are large, we have and
(6 . 1.9)
If IJ is a J x J identity matrix and Ij is a J x 1 vector of ones, then the /J x /J covariance matrix of the quantities (jij is given by
1
:1 (22)+11 ,2 : j (Ie + (I, J j(Ir
(6.1.10)
J
The form of this matrix confirms the appropriateness of the formulation set out in (6.1.5) through (6.1.7).
6.1.3 Two-way Mixed
~ode1
The same car-driver data could arise in quite different circumstances. Suppose a car rent~1 agency has a particular set of I = 6 cars , not necessarily all the same model or make, and it is desired to estimate their individual mean performances when operated by a relevant population of drivers. To do this a test is conducted with J = 8 drivers randomly chosen from the population of interest. The underlying setup is represented by the elements in Table 6.1.3, which can be thought of as sub-population of Table 6.1.2 in which are selected the I particular rows (cars) of interest. Table 6.1.3 A sub-population of 1 particular rows of the population of increments (jij
DRIVERS 1 (j] i
CARS
I
f
I~
(jj,j
rj
(jT I
r[
324
Analysis of Cross Classification Designs
6.1
The behavior of the increments of the J cars with, say, the j th driver, is thus represented by an i-variate observation ()~ = (b l / , ... , bJ ) from a large population of", drivers with
i,
where
aii
1 = '"
L«(>;', -
t'
= I, ... , I ,
2
r;) ,
I ai, = '" L(b il - r;)(b,) Further, it can be verified that when ", --> different drivers are uncorrelated, that is, Cov (b i j , bi )
=
Cf)
0,
(6.1.11)
rJ.
the increment s associated with two
j
*" j.
The model in (6.1.1) can then be written
j=I, ... ,J;
k=I , .. ,K,
(6.1.12)
with
8j
=
8 + ri ,
(6.1.13)
we have finally (6 .1.14) The principal objective here is to make inferences .a bout the mean vector 9 which determines the mean performances of the cars. However the meaning to be attached to the individual elements of the covariance matrix V J of the b's is also worth considering. The ith diagonal element a ii of this matrix represents the variation in performance when the ith car is operaled by the whole population of drivers. The covariance ai' may be written ai ' = Pi,(a ij a,J 1/ 2 where P;; measures the correlation between performances achieved on cars i and t' by the population of drivers . If the two cars are similar, the driver who does well on one will usually do well on the other so that Pi ' would be positive. For two cars with widely different characteristics, Pi. may be a quite small positive number or even negative. For example, if car i was easily manipulated by right-handed people but not by left-handed people while car ~. was easily manipulated by left-handed people but not by right-handed people, then Pi, could be negat ive. We shall see
Introduction
6.1
325
in the next section that by use of this correlation structure we can allow for " interactions" of a much more sophi sticated kind than is possible with what has been the customary assumptions about " interaction variance." Indeed the customary assumptions correspond to a very restricted special case of the general model given here.
6.1.4 Special Cases of Mixed Models The mixed model set out in (6.1.12) involves I unknown mean s ai > }/(J -- I) -I- I unknown elements in the covariance matrix VI and an unknown (J; , as parameters. Situations occur where si mpler models are appropriate. Two special cases may be of interest. 1.
Additive Model
What sometimes is called the additive mixed model is obtained if we assume that (6.1.15) I n terms of the entries in Table 6.1.3 , this is saying that bi.j -
ri =
b 'l -
ri
so that
(Jii
= ai ; = a;,
all
(i, t) .
1n this case, the model (6.1.1) ma y be written in the form (6.1.16) where cj a'1d ei j k are uncorrelated random variables with zero means and E(,}) = a;, E(etjk) = a; so that E(Yijk ) = l3 i and Var ( Yi jk) = a~ + a; . If for the moment we ignore the experimental error e ijk the model can be represented as in Fig. 6.1.1 for 1 = 2, J = 3. For the driver-car example, CI is the incremental performance for the j th driver. It measures his ability to get a little more or little less gas mileage. The
Perfo rm ance o f three drive rs on car I ( a)
Performan ce of th e ,ame three drivers on car 2 (h )
Fig. 6.1.1 Graphical illustration of an additive model.
Analysis of Cross Classification Designs
326
6.]
additive model will be appropriate if, apart from experimental error, this incremental performance cj associated with the j th driver remains the same for alii cars which he drives (that is, c; = b ij - r, = b' i - r,). In choosing a random sample of dri ve rs we will choose a random sample of incremental performance so that the C " C2, .. . , CJ can be taken as a random sam pIe from a d istri bution (for example, the :-\ormal shown in Fig. 6.1.1) with zero mean and fixed variance 0";. This structure is sometimes adopted in modelling data generated by Fisher's randomi zed block designs. From this point of view we would here be testing I treatments (cars) in J randomly chosen blocks (drivers).
2.
Int eraction Mode/
Another special case arises when the covariance matrix V J takes the form (6.1.17) where Ipi < 1. In terms of the entries in Table 6. 1.3, this says that , the I elements (b Ii' . . , bI;) are equi-correlated and have common variance, that is,
1
J-
-1
/-
~«(j
..- r)(8 . -
'/
'
'J
~(b
r) = '
'/. 2 rI,
0/
r)I
p
)
2
=
rI,
2
If'' ,
i,/=I, ... , I;
If we further assume that p > 0, the covariance matrix
if=t·.
( 6.1.18)
in (6.1.17) can be written (6.1.19)
where 0'; = ¢2 P and 0",2 = ¢2 (I - p). expressed as
The model in (6.1.1) can be equivalently
(6.1.20) where
c j , tij'
and
eijk
are uncorrelated variables with zero means and and
so that and This model is sometimes known as the mixed model wilh interaction . When 0',2 = 0 , it reduces to the additive mixed model in (6 . I .16). The implications of this interacti on moc.iel are illustrated in Fig. 6.1.2. Ignoring the measurement error ei jk the two dots in Fig. 6.1.2(a) show the contribution 8 1 + C I + 1,1 and 0 1 + C 2 + t12 for drivers 1 and 2 using car 1. The quantities 0, and O2 are fixed constants, C I and C 2 are drawn from a Normal distribution
6.1
introduction
327
I"
Performance of Iwo
Pe rform ;lnce of Ih e sarn e Iwo
drivers on- car I
drivers on car 2
(al
(b)
Fig. 6.1.2 Graphical illustration of an interaction model.
having variance!J; and til and 11 2 are random drawings from a Normal distribution having variance !J,2 . Figure 6. 1.2(b) shows the situation for the same two drivers operatinga second car. Note that although a shift in mean forB I to0 2 has occured, C I and C 2 are assumed to remain the same as before but a different sample of I' S is added. This interaction model is less restrictive than the additive mixed model because it allows the possibility that an increment associated with a given driver will be different on different cars. However, because the additional increments Ii) are a : :··~.!' -' ,.:. b'~ independent and to have the same variance for each driver and each car. it is sli!. roo restrictive to represent many real situations. For example, consider two particular cars, say car I and car 2, which were in lud ed II: th e experiment. According to the model, the Increments associated with ti .t J dh vers would be as follows: Driver ('ar
j
C,
('ar 2
cJ
j
+ I" + 12J
Driver j
Driver J
cj
+
11)
CJ
c)
-+- ' 2 )
CJ
+ III + t 2J
'\lo w 5U vose that the panicular cars I and 2 were very much alike in some importa nt f-:.' ~' but differed from the remaining cars. Thus, these particular cars migl-.l ,', 2. c. : , . leg room for the driver than the remaining I - 2 cars which were tested. Then we would expect that the incremental performance c) of a short legged driver j would be enhanced by a factor II) almost identical to 12 ) while the incremental performance of a long legged driver i would be reduced almost eqmllly and the negative contributions II ) and 12/ would also be almost equal. Therefore, for these almost identical cars we would expect tl) to be the same as 1 2 ) for all j. I n other words, we would expect II and 12 to be highly correlated within drivers and not uncorreJated as required by the model. The assumption
328
Analysis of Cross Classification Designs
6.1
of common variance for lij might also be unrealistic because, for example, the differences between drivers might be emphasized or de-emphasized by particular cars.
6.1.5 Two-way Fixed Effect Model In some situations, no question would arise of attempting to deduce general properties of either the population of drivers or the population of cars. Interest might center on the performance of a particular set of J cars with a particular group of J drivers. In this case , setting
[)=8+b .. , (6.1.21) where
bI .
=
1 -Ib J I} '
b .}
I = -I
Io·
I}'
b. .
1 I J I Ib I} '
= -
the model (6.1.1) becomes (6.1.22) This is the two-way classification fi xed effect model appropriate in this context for estimating the parameters Ct. i , {3j , }' ij' As we have mentioned earlier, the term fixed effect is essentially a sampling theory con~ept because in this framework . the effects Ct. i , {3j and Yij are regarded as fixed but unknown constants. From the Bayesi a n viewpoint, all parameters are random variables and the appropriate inference procedure for these parameters depends upon the nature of the prior distribution used to represent the behavior of the factors corresponding to rows and columns. Problems could arise where a noninformative situation could be approximated by allowing these to take locally uniform and independent priors. We should the n be dealing with a special case of the linear model discussed in Section 2.7. On the other hand and in particular for the driver-car example it might be more realistic to assume that the ex's, {3's, and y' s were themselves random samples from distributions with zero means and variances respectively . The problem of making inferences about the parameters (X i' {3j , Yij would then parallel the approach given later in Section 7.2 for the estimation of means from a one-way classification random effect model. In the above we have classified a number of different problems which can be pertinent in the analysis of cross classification designs. In what follows we shall not attempt a comprehensive treatment. Rather, a few familiar situations will be studied to see what new features come to light with the Bayesian approach. We begin by considering the two-:-vay random effect model in (6.1.5) and then devote the remainder of the chapter to the analysis of the additive mixed model in (6 . 1.16) and the mixed model with interaction in (6.1.20).
u;, u;, u;,
6.2
Cross Classification Random Effect Model
329
6.2 CROSS CLASSIFICATION RANDOM EFFECT MODEL
In this section the problem is considered of making inferences about the variance components in the two-way model i=I , ... , I ; }=I, ...• J ; k=I, ... , K, (6.2.1)
where Yijk are the observations, 0 is a location parameter, ri , ej , and lij are three different kinds of random effects and e ijk are the residual errors. In the context of our previous discussion the random effects ri , cj ' and Ii) could be associated respectively with cars, drivers, and car· driver interaction. The model would come naturally if the samples under study were randomly drawn from large populations of cars and drivers about which we wished to make inferences. It was shown that the ri , cj , lij' and eijk could then be taken to be uncorrelated with zero means 2 and variances respectively so that , and t
a;, a:, a
a;,
v ar (Yijk ) = a..2 + '(J c2 + at2 + a e'2
(6.2.2)
The form of the model (6.2.1) is closely related to the three-component hierarchical design model discussed in Section 5.3 . Indeed , expression (6.2.1) or (J: is known reduces to (5.3 .1) with appropriate changes in notation if either to be zero. The relevant sample quantities are conveniently arranged in the analysis of variance shown in Table 6.2.1. Using the table, we can write:
a;
A2
IT T
m, - m t
= ---
JK
where these sampling theory point estimates of the components a;, a~, at2 , and (J; are ~btained by solving the equations in expectations of mea n squares as in the case of hierarchical design models. The difficulties encountered are similar to those discussed earlier. 6.2.1 The Likelihood Function
To obtain the likelihood function , it will be useful to recall first a number of results related to the analysis of two-way classifications which are summarized In the following theorem. Theorem 6.2.1 Let xij (i = I, ... ,I,) = I, ... ,J) be IJ independent observations from a Normal distribution NCO, (J2) arranged in a two-way table of I rows and J columns. Let x., Xi., Xj' and xij - Xi. - Xj + X . be respectively the grand mean , row means, column means, and residuals. Then I. I I
+
X;j = lJx~ + J I(x/. I I (xlj - x/. - Xj +
X
Y+I
x.y,
I
(Xj -
X .)2
'--' '--' o
):0
.,
q
'"in' o
.., ..,(") o '"'"
Table 6.2.1
(")
Analysis of variance of two-way rand o m effect models Source
S.S .
G rand mean
IJK (y .
M.S.
E.M.S.
UK (y .. _0) 2
CJ; + KCJ,2 + [KCJ; + J KCJ; CJ;" = CJ; + KCJ,2 + J KCJ; u;,c = CJ; + KU,2 + [Ku;
d.f.
-W
Sri v,
Main .effect "r"
S, = J K 'L (Yi _ Y . )2
V,= (I-I)
In,
Ma in effect " e"
S C = [K (y .J.. _. Y .. )2
vc = (J-l)
rIIc =Sc/ vc
v, = (I-I)(J-l)
In,
Inte racti o n r x Residual Total
C
S,= K'L'L(Yij. -Yi - Y. j
+Y .
/
S e= 'L'L'L(Yijk- YijY
ve =IJ(K-J)
LLL(Yijk _0 )2
UK
0;-
'"'":;;
=
=Sr/ v,
I11c=S) Ve
~
o· :0 b
'"
'" a"C' :0
'"
u;,=CJ; + KU,2
CJ2 e
a-
N
6.2
Cross Classification Random Effect Model
2. The sets of quant;'ies x ., {x ,. - .x.}, {Xj - x.} , and {Xij - Xi -are distributed independently of one another, 3. J L (Xi. - X
Y
~ a 2X~1 I » T L (Xj -
X.
Y
331
'1 j
-!- x .. }
~ a2xtJ _ J) and
L L (x ij - Xi. - Xj -:. x ..f ~ a 2X}I _ I)(J _ I) '
J, {Xj - x J, and {xi) - Xi. - x. j + x J is concerned, the corresponding sum of squares given in (3) is a sufficient statistic for a 2 .
4. So far as each of the sets {Xi . - X
In the theorem, (I) is an algebraic identity, (2) can be proved by showing that the four se ts of variables are uncorrelated with one another, (3) follows from Cochran's theorem (1934) on distribution of quadratic forms, and (4) can be seen from inspection of the likelihood function. For the model (6 .2. 1), we shall make the usual assumption that (ri' cj' Ii), eijk ) are Normally distributed . Since the effects are uncorrelated, the assumption of Normality implies that they are also independent. To obtain the likelihood function , it is convenient to work with the set of quantities Y ... , {Yi .. - Y .. }, {Y.j . - Y .. J, {Yi j. - y, .. - Y.j. + Y .. }, and {Yijk - YijJ. We have Y. Yi.. - Y
Yj Yij - Yi . - Y.j.
-
=()+r =
(ri - r)
Y . = (C j
+ Y ...
ic
-
c)
+1
+e. ,
+
(Ii. - t
+
(t.j - t
= (Iij - 'i. -
tj
J+
(e,
J+
+ tJ +
(e j
- e . J, . -
e
(e ij . - e i ..
J, -
(6.2 .3a)
e
j .
+
.J,
e
It follows from repeated use of Theorem 5.2.1 (on page 250) and Theorem 6.2.1 (on page 329) that the grand mean Y . . and the four mean squares (111" l11e , 111" me) in Table 6.2.1 are jointly sufficient for (0, a;, a;" a;,,, a;,J and are independently distributed as (6.2.3b)
The likelihood function is, therefore,
X
exp - {-
l}
1!·lJK(0-y .. J2 , "eme ",m, . v,m, . Vel11e 2 2 2 T - 2 - + - 2 - T - 2 - T - 22 aelr + ae'e - a., ac a., a err ae,e
.
(6.2.4)
The likelihood can alternatively be thought of as arising from one observation from a Normal distribution with mean 0 and variance (a;" + a;, e - a;,)j(1JK),
332
Analysis of Cross Classification Designs
6.2
together with (ve> v" v" vc ) further independent observations drawn from Normal distri butions with zero means and variances (u;, U,2, u;'r> u;,J, respectively. The model supplies the additional information that the parameters (u;, u;" u;'r' u;,c) are subject to the constraint
(6.2.5)
6.2.2 Posterior Distribution of (u;,
u~)
u;,
U,2,
Adopting the argument in Section 1.3 and treating the location parameter () separately from the variances (u;, u;". u;,c)' we employ the noninformative reference prior distribution
r;;"
(6.2.6) subject to the constraint C. tion of (u;, u;" u;", u;,c) is
Upon integrating out 0, the joint posterior distribu-
(6.2.7) where W
- 1
=
p.* (C I) I Y
=
pr
f\ -1.2 > 2
Ve
XV,
V 111 e
e
--, V,111,
2 1.e"r
,
vr m
r
2 1.eVe
- 2 <... - - , - 2
XV,
vim,
i
Xv,
<
vC m C \
--J' v,m,
and (X~c' l" X;" x;J are independently distributed variables with (v e, Vn v" vJ degrees of freedom, respectively. It follows that the joint distribution of the variance components (u;, U,2, u;, is
r;;)
6.2
Cross Classification Random Effect \'Iodel
333
As might be expected, this distribution is similar to the distribution in (5.3.11) for the three-component model. The scaled X- 2 approximation techniques developed in Sections 5.2.6 and 5.2.12 can now be employed to obtain marginal . distribution of certain subsets of the components .
6.2.3 Distribution of (cr;, cr,2, cr;) To iJlustrate, suppose we wish to make inferences jointly about (cr;, cr,2, cr;). The corresponding posterior distribution can, of course, be obtained by integrating (6.2 .8) over cr~. Alternatively, we may use (6.2.7) and first obtain the distribution of (cr; , cr;" cr;,r) by integrating out cr;,c yielding
( 6.2.9) where
and Wi
= w Pr
{F
v
v c.
t
me} .
< m,
We see that the function G(cr;,) is in precisely the same form as the posterior distribution of the variance cr ~ for the one-way model in (5.2.26). It follows by using the technique developed in Section 5.2.6 that G( cr:,) is approximately 2 . G(cr e ,) == (v,m,) I
I
-
t
p
( X\: 2 > -
,
l
cr el
= - , -, 'V,m,
where
)
(6.2.10)
'
vIm,
J
ml =--, ' atv,
and , as before, Ix(p, q) is the incomplete beta function. To this degree of approximation , the joint distribution of (cr; , cr,l , cr;) is
2 2 2.=
p(cr e , cr" crr I y)
WI
(Veme)
_IP (-2 Xv. =
(- 2-_ 'cr;
cr; ) (v;m;) -K - I P X\:
-Verne
X(vrmr) -I p (-2 = Xv,
cr;
+ Kcr,2 + JKcr;)
JK
v,m, cr; > 0,
Cl,2
> 0,
+ Kcr;) I'
vrm t
l
1
(J;
'
> 0,
(6 .2. 11)
334
Analysis of Cross Classification Designs
where _ 1
WI
=
Pr
6.2
2
{XV. v.me -2- > -,-, , Xv;
v, m,
which is exactly the same form as the distribution in (5.3.11) and can thus be analyzed as before. It is clear that an analogous argument can be used to derive the joint distribution of (a; , a,2, a;) if desired.
6.2.4 Distribution of (a;, a;) The derivation given above allows us to make inferences about the individual components (a;, a,2, a;, a;) and joint inferences about the sets (a;, (J,2, a;) and (a;, (J,2, a;). In cross classification designs, we are often interested mainly in the "main effect" variances «(J;, (J;). The corresponding joint posterior distribution is, of course, given by integrating (6.2.8) over a; and (J,2,
(6.2.12) It does not seem possible to express the exact distribution in terms of simple functions and direct calculation of the distribution would require numerical evaluation of a double integral for every pair of values «(J;, (J;). We therefore introduce an approximation method which reduces (6.2. t 2) to a one-dimensional integral. First, we obtain the distribution of (a;" (J;'r' (J;, c) from (6.2.7),
(6.2.13) where
and W
" _ w Pr {F v,.v. < meJ m'l . -
O'L
The function H«(J;,) is in the same form as the distribution of in (5.2.57) for the two-component model. It follows from the method in (5.2.60) that
a2ct I\ 2 . 'f -1 .. - 2 _ H(u e,) = (v, m,) p { X,." - -"- ,, J t v, m t II
(6.2.14)
-
- . ,
6.2
where
.
_.
_- _
_ "'~L -~-...,
-
Cross Classification Random Effect Model
"
v
I
V,
J X,(}V,
+ I , -lY e)
= - -=------'-a2
I." ,(1 V" Jv e )
335
vim, m , = - -" ' a 2 v, II
and
Hence,
where
so that
u; > 0,
u~
> O.
(6.2. 15)
Calculation of the approximating distribution in (6.2. 15) thus involves computing only one dimensional integrals, a task which is considerably simpler than exact evaluation of (6.2.12) .
6.2.5 An Illustrative Example To illustrate the analysis presented in this section, Table 6.2.2 gives the results for a randomized experiment. The data consists of 162 observations representing mileages per gallon of gasoline for 9 drivers on 9 cars with each driver making two runs with each car (l = J = 9, K = 2). We shall analyze the example by supposing that the underlying Normality and independence assumptions for the model (6.2.1) are appropriate and shall adopt the noninformative reference prior in (6 .2.6) for making inferences.
Table 6.2.2
2 3 4 5 6 7 8 9
<..>
2
3
4
Dri vers 5
6
7
8
9
26.111 26.941 22.652 22. 139 25 .218 27 .189 25.192 25.081 24.631 25.930 21.809 23.144 22.236 22.245 24.542 23.816 21.472 23.130
29.719 29.218 25.966 26.835 27.682 27.521 25.962 28.715 29.567 28.368 28.030 28 .234 26.467 27.115 27.103 24. 8 17 25.108 27.589
31.915 32.183 27.856 26.241 28.912 28.059 28.717 29.783 27.140 30.818 28.447 27.670 25.716 25.059 25.051 25.293 28.419 25.941
34.582 36 .490 31.090 31.320 33.821 34.462 34.619 34.653 34.553 35.337 31.432 32.203 31.916 31.541 32.282 31.295 31.241 32.459
28.712 27.091 25.956 23.653 29.394 27.859 27.663 25.516 27.746 26.210 26.551 24.538 25.028 26.296 25.096 25.909 27.020 24.445
31.518 30.448 23.375 25.298 28.713 29.302 27.511 30 .906 31.371 21.495 28.073 28 .148 25.238 25.083 27.655 26.207 26.676 25.738
26.513 28.440 25.329 24.098 26.005 27.020 26.145 23.299 27.290 24.689 24.575 23.999 21.607 21.900 25.038 22.951 22.795 23.299
29.573 29.464 28.648 27. 136 31.174 28.003 26.834 29.549 33.239 32.319 28.026 28.820 27.687 29 .357 26.414 26.256 27.638 27.385
Cars 32.431 31.709 26.356 26.225 30.243 31.785 29.830 29.859 33.464 30.307 28.313 27.998 28.294 27.363 29.864 27.363 27.438 27.486
Gas mileage for 9 drivers driving 9 cars (duplicate runs) c..
'"
> ::J
I"
.;;
'"
[;;.
.... ...0rJ 0
~
., rJ
'"'"5i ~
o·
::J
0
~
o'C '
'"'"
The averages for each driver on each car as well as the driver means, car means, and grand mean are given in Table 6.2.3 . The breakdown of the sum of squares and other relevant quant ities for the analysis are summarized in Table 6.2.4 in the fo rm of an analysis of variance table.
T3.ble 6.2.3
Average mileage for 9 drivers on 9 cars
Cars
1
2
3
4
Drivers 5
6
7
8
9
Row means
1 2 3 4 5 6 7 8 9
32.0700 26.2905 31.0140 29.8445 31.8855 28.1555 27.8285 28.6135 27.4620
26.5260 22.3955 26.2035 25 .1365 25.2805 22.4765 22.2405 24.1790 22.3010
29.4685 26.4005 27.6015 27.331>5 28.9675 28.1320 26.7910 25.9600 26.3485
32.0490 27.0485 28.4855 29.2500 28.9790 28.0585 25.3875 25.1720 27.1800
35.5360 31.2050 34.14J5 34.6360 34.9450 31.8175 31 .7285 31.7885 31.8500
27.9015 24.8045 28.6265 26.5895 26.9780 25.5445 25.6620 25.5025 25.7325
30.9830 24.3365 29.0075 29.2085 30.4330 28 .1105 25.1605 26.9310 26.2070
27.4765 24.7135 26.5125 24.7220 25.9895 24.2870 21.7535 23.9945 23.0470-
29.5185 27.8920 29.5885 28.1915 32.7790 28.4230 28.5220 26.33 50 27.5115
30.1699 26. J 207 29.0201 28 .3241 29.5819 27.2228 26.1193 26.4973 26.4044
Column means
29.2404
24.0821
27.4453
27.9567
33.0720
26.3713
27.8197
24.7218
28.7512
27.7178 (Grand mean)
-
...
c;-,
Cross Classification Random Effect Model
6.2
337
Table 6.2.4 Analysis of variance of 9 drivers on 9 cars Source
d.f.
S.S.
M.S.
E.M.S.
8
mr = 45.2623
6;{y = 6; + 26,2 + 186; 6;,c = 6; + 2o} + 186;
Grand mean
162(8 - 27.7178)2
Main effect cars
362.0985
Main effect drivers
1,011.6550
v = 8
c
me = 126.4569
Interaction
109.6123
v,=64
m, =
1.7127
95.2460
ve = 81
me=
1.1759
Residuals
8"; = 1.1759,
Vr =
8",2 = 0.2684,
8"; = 6.9302,
6;, = 6; + 26,2 6 e2
8"; =2.4194
In examples of this kind, our main interests usuall y center on the "main effect" variances (6;, 6;) and the interaction variance 6,2 However it seems clear from Table 6.2.4 [and can be read ily confirmed by a fuller Bayesian analysis using, for example, (6.2 .11)J that, for this example, 6,2 is small compared with 6; and
6; .
If we wish to make inferences about the "main effect" variances (6;,6;) we may use the approximation (6.2.14) to eliminate yielding
6;
X2 =
a2
109.6123 0 3 204.8583 = .5 51,
= 33 x
I x2 (34,40.5) I x2 (33 , 40.5) - 32 x I x ,(33,40.5) I x2 (32 ,40.5) ,
= 33(0.98161) - 32(0.98461) = 0.88561,
v;' = m" t
=
O.8:~61
(0 .98461) = 71.1544,
109.6123 (0.8856 1)(71.1544)
= 1.7395
and
v;'m;'
=
123.7704.
,
338
Analysis of Cross Classification Designs
6.2
From (6.2.15), the approximate posterior distribution of (a;, a;) is then
x P
(
-2
X71.1544
=
2)
a et
123.7704
d2 et ·
a
Figure 6.2.1 shows four contours of the distribution calculated by numerical integration together with the mode (a;o' (J;o) which is at approximately (I.92, 5.52).
20.0
-
10.0
o
2.0
4.0
6.0
a; ( Cars) Fig.6.2.1 Contours of posterior distribution of «(J~, (J~): the car-driver data.
The levels of the contours are respectively 50, 25, 10 and 5 per cent of the density at the mode. Using the asymptotic Normal approximation and since the X2 distribution with two degrees of freedom is an exponential, these contours are very roughly the boundaries of the 50, 75, 90 and 95 per cent H. P.D. regions. It will be seen that for this example both the main effect components are substantial and that they appear to be approximately independently distributed.
Cross Classification Random Effect Model
6.2
339
6.2.6 A Simple Approximation to the Distribution of (a;, a;) When , as in this example, the modified degrees of freedom v;' is large, the last factor in the integrand of (6.2.15) will be sharply concentrated about 111;'. We may adopt an argument similar to the one leading to the modal approximation in (5.2.43) to write
To this degree of approximation, (J; and a; are independent, each distributed like a truncated inverted '/ variable. More specifically, m ;'
+ JKa; v,m,
,.v
X:;. 2
m;'
truncated from below at
v,m,
and
(6 .2.17) m; ' +IK(J; ,.v
Veme
X-:c2
truncated from below at
m;' Vern e
20.0
10.0
o
2.0
4.0
6.0
0; (Cars) Fig. 6.2.2 Contours of the approximate posterio r distribution of ((J~. (J~) : the car-driver data .
Analysis of Cross Classification Designs
340
6.2
Returning to our example, the posterior distribution of the variances (0';, a~) using the present approximation is
22
(-2 =
(-2 =
1.7395 + 180';) 362.0985 P X8
p(a" a c I y) ex: P X8
1.7395 + 18a~) . 1,011.6550
Figure 6.2.2 gives four contours corresponding to the 50, 75 , 90 and 95 per cent H.P.D. regions, together with the mode (8';, a~). They are very close tc{ the contours in Fig. 6.2.1. In fact, by overlaying the two figures one finds that they almost completely coincide, showing that for this example (6.2 .16) is an excellent approximation to (6.2.15).
pro;
pro:
y)
0.40
y)
0.40 Cars
0 30
0.30
020
0.20 Driv ers
0.10
0.10
0; 0
10
Fig. 6.2.3 Approximate posterior distribution of the car-driver data.
a;:
L.-L-_L-_ _L-_ _L--==-t=~_..Lo2 -,
-?
15
10
0
15
20
25
c
Fig. 6.2.4 Approximate posterior distribution of a~: the car-driver data.
a;
Using (6.2.17), individual inferences about can be made by referring the quantity (1.7395 + 18a;)/362.0985 to an inverted X2 with 8 degrees of freedom truncated from below at 0.0048 or equivalently 362.0985/(1.7395 + 180';) to a X2 with 8 degrees of freedom truncated from above at 208.33. ·Similarly, inferences about a~ can be made by referring (1.7395 + 18a~).1,101.655 to xi 2 truncated from below at 0.0016 or 1,101.655/(1.7395 + 18a~) to X~ truncated from above at 633.3. In both cases, the effect of truncation is negligible. The approximate posterior distributions of and a~ are shown respectively in Figs 6.2.3 and 6.2.4.
a;
The Additi\'c .:Vlixed Model
6.3
341
6.3 THE ADDITIVE MIXED MODEL
We consider in this section the model i=I, ... ,I;
J=I, ... ,J
(6.3.1)
which is the additive mixed model in (6.l.16) with k = 1. The random effects cj and the errors eij are assumed to independently follow Normal distributions with zero means and variances 0'; and 0'; , respectively. In terms of the previous example we would have I particular cars of interest tested on a random sample of J drivers. The assumption of additivity implies that the expected increment of performance cj associated with a particular driver J is the same whichever of the I cars he drives. Though restrictive this assumption is adequate in some contexts. Suppose it was appropriate in the case of the car rental agency interested in the mean performances 8 i of I = 6 cars tested by a random sample of drivers. Then the variance component 0',2 would measure the variation ascribable to the differences in drivers, and 0'; would measure the experimental error variation. 6.3.1 The Likelihood Function For the likelihood function, it is convenient to transform the I J observations into { Yi.}, { Y.j - y,} and {Yij - Y; - Y.j + y}. We have
Y.j - Y .. = c j Yij - Yi. - Y.j
+ Y ..
=
-
c
+ e. j
eij - ei . - e j
-
(6.3.2)
e.,
+
e
Using Theorem 5.2. 1 (on page 250) and Theorem 6.2.1 (on page 329), we may conclude that: l. the three sets of variables {Y i,}, {Y.j - y,}, and {Yij - Y; - Y j independent one of another;
+ y}
are
2. so far as the set {Y .j - Y J is concerned, the sum of squares I I (Yj - Y is a sufficient statistic for CO'; + 10';) and is distributed as (0'; + 100;)X~J-l );
r
3. so far as the set {Y ij - Yi. - Y.j + Y. J is concerned, the sum of squares I I (Yij - Yi - Y.j + Y. Y is a sufficient statistic for 0'; and is distri buted as h;X(I-I)(J-l) ; and 4. the vector of variables N 1(8, V) where
y: =
(YI., ... , Y1
and Noting that
J is distributed
as the I-variate Normal (6.3.3)
342
Analysis of Cross Classification Designs
6.3
the likelihood function is 1(0.0"; . 0"; I y) ex:. (0";)-[JU- I )f2]
xexp
{ 2I [1 -_
.
I (y . - y)2 2')
0".
~.
+ /O" c
CO"; +
+I
[0";)-J /2
I (y .. - y. - y .
')
"2
.)
+ Y . .)2 +Q(O) J} ,
O"e
•
(6.3.4)
where
Q(9) = (y - 9)'V- I (y. - 9). In the analysis of additive mixed models, interest is usually centered on the comparative values of the location parameters 8 j rather than 8 j themselves. We may, therefore, work with i = I, .. . ,1;
(6.3.5)
Since v-I=J
-2[1_ a
O"e
2
e
a~ 1 2 11'] + 1 I O" c
= JO";2(I-+IJ1;.)
+
j/-I(O";
+i(JD- I l l l;,
(6.3.6)
we may express the quadratic form Q(O) as
The various sums of squares appearing in the likelihood function can be conveniently arranged in Table 6.3.1 in the form of an analysis of variance. In Table 6.3.1 , «>' = (
and
It is clear from the above expression that I[ (e. (J;e I y) can be regarded as the likelihood function of a sample consisting of one observation y. drawn from a Normal distribution N(B. O";e/lJ) and Vc independent observations from a Normal population N(O , O"~e) ' Similarly 12 (<(>, (J; I y) can be taken as that of a sample of
""~
Table 6.3.1
Analysis of variance for the additive mixed model
Source
Grand mean Fixed effect (cars)
d.r.
S.S.
IJ(rJ-yj
IJ({j-y/ S(Q»=J~[¢i-(Yi
M.S.
_yJJ 2
E.M .S.
(J;e=(J;
" <1> =(1-1)
m(Q» = S(Q»/v
(J;
(J~C'
Random effect (drivers)
S C =l~ ( y .J._y . . )2
Ve =(J-l)
me = Se/ve
Error
S e =~~( y. _ y I . _y+y )2 I) .J .'
v. =(I-I)(1-I)
Ille
= Sp/Ve
(J 2
e
+ 1(J;
..>-
>-l
:r
0.
.~. ..,.~ :;
0.
~
o
Co
~
~
W
Analysis of Cross Classification Oesigns
344
6.3
one observation from a (I - 1) dimensional Normal distribution with means (cPl. ··"cPT - I) and covariance matrix (O"; , J)[I(/_l) - (l/l)l u -J)l;JiI)J and of Vc further observations from a Normal population N(O, a;) . Since = +I the parameters (o';e, a;) are thus subject to the constraint
a;. a;
a;,
( 6.3.9) As before, it is convenient to proceed first ignoring the constraint C. The posterior distribution of the various parameters appropriate to the model (6.3.1) are then conveniently derived from the unconstrained distributions using the result (1.5.3).
6.3.2 Posterior Distributions of (e, q"
a;, O";e) Ignoring the Constraint C
With the noninformative reference prior distribution+ (6.3.10) the unconstrained posterior distribution of (e , q"
a;, O"~e I y) cc (0";) -
p*(8,q"
- 00
<
e < oc,
-00
Hy.,-t v.. )-I
< cPi <
00,
a;, o~e) is
(O"~e)- -}(Yc - I)-l
.i = I, ... , (1- J),
0"; >
0,
0";. >
0,
(6.3.11)
from which for the unconstrained situation we deduce the following solutions which parallel known sampling results, Given 0";, the conditional posterior distri bution of q, is the (J - I) dimensional multivariate Normal
0"; E)
yJ
(6.3.1 La)
In particular, given 0";, the conditional distribution of the difference b = i - i = cPi - cP i between two particular means of interest is Normal ,V(Yi . - Yi,' 20";/J).
(6.3.l2b)
The marginal posterior distribution of q, is the (I - 1) dimensional multivariate t distri bution t(l _ 1)(cf>, me E,. ve )·
(6.3.13a)
N(I - 1)(cf>,
with
:j,' =
(YI. - Y .. , ·" 'YU-l) . -
and
e e
When the number of means I is large, the assumption that the 8's are locally uniform can be an inadequate approximation, and a more appropriate analysis may follow the lines given later in Section 7.2.
6.3
The Addith'e Mixed Model
345
Hence the difference b is distributed as ((Yi. - y;., 2Iil o /J, Ve so that the quantity r = [6 - (Yi - y;)] (2m eiJ)1 ' 2 is distributed as teO, I, vel.
(6.3.13b)
Also, the quantity V = m(
(6.3.14)
The quantity (J; l vom,. has the X,:.2 distribution.
( 6.3. 15)
Given <1>, the conditional posterior distribution of (J; is such that (J; /[vem e + v¢m(
2
+
(6,3,16a)
r2)] is distri(6.3.16b)
distribution with
Ve
degrees (6.3.17)
The ratio mc(J; /(me(J;e) has the F distribution with degrees of freedom.
Vc
and
I',.
(6 .3.18)
Given <1>, the conditional distribution of the ratio (J;/(J;" is such tha t the qua n ti ty
u; (J [v oln e
;0
is distributed as F with
Ve
n1c
+ v,p~m-(--'--)]-/-(v-e-+-v-",)
and v,,
+
v,p degrees of freedom.
(6.3.19a)
In particular, if just c5 is given, the quanti.t)'
(J;
l11e
+ r 2 )/(v e + I) Vc and Ve + I degrees of freedom.
(J:e me(v" follows the F distribution with
6.3.3 Posterior Distribution of
u;
and
(6.3.19b)
u;
The unconstrained posterior distributions given above are, of course, not themselves the appropriate distributions for the model (6.3 . 1), but provide a useful stepping stone from which they may be reached. When interest centers on the variance components ((J;, (J;), we may first employ the result (1.5 .3) to obtain from (6.3.15) and (6.3.17) the distribution of ((J;, (J;J,
p((J;, (J:e I y)
=(v.I11,,) -1P (Xv,2 =~)
(Veme)
v (!m(J
x
Pr* {C I (J;, (J; .. y} Pr*{ ely}
1
P
(X:c 2 = (J;e) Verne
(6.3.20a)
346
Analysis of Cross Classification Designs
6.3
Using (6.3 . 18), the overall probability of the constraint C:
0'; < O';e is (6.3.20b)
0';
In addition. given and 0';., Pr*{C 10';,0':., y} is clearly unity for zero otherwise. Thus , the posterior distribution of is
(0';,0';)
0';
> 0,
0';
0';
> O.
<
a:. and
(6.3.21)
The reader will note that if we setl = K , VI = \Ie. 1'2 = \Ie> nil = me' and 111 2 = me, the distribution in (6.3 .21) is precisely the same as that obtained in (5.2.14). Infercan thus be made using the results obtained earlier. ences about
(0';, 0';)
6.3.4 Inferences About Fixed Effects for the Mixed Model
In many problems, the items of principal interest are the comparisons
0';
0';
6.3.5 Comparison of Two 'Vleans
Consider the problem of comparing two particular means, say those for the ith and the dh treatment. To make inferences about 6 = 0i - 0; it is convenient to , work with the quantity , = [D - (Y;, - Yi)]
1(2111 ) --;
1/ 2
(6.3.22)
From the unconstrained distribution of, in (6.3.13b), we again employ the result · (1.5.3) to obtain the posterior distribution of, as Pr*{CI " y} peT I y) = p(t v.. = ,) Pr*{C I y} ,
- (f)< , <x,
(6 .3.23)
where Pr*{ ely} is given in (6.3.20b) . From (6.3. I 91;J), the conditional posterior probability of C, given T, is Pr* { CIT, Y} = Pr
{
F vc • v .
'; I
<
c( e 1) ) m,(v + ,)
111 \I --i---2J e
(6.3.24)
6.3
The Additive Mixed Model
347
The posterior distribution of -r is, therefore,
p(-r I y) = p(ty. = -r) g
(r21 ::) ,
-w < -r < w,
(6 .3.25)
a Student's t distribution with Ve degrees of freedom modified by a factor g(-r21 mclme) which is the ratio of two F probability integrals Pr f\ F" ,. .. I < -me(V,, -+I)( 1+ c.
c
In('
Ve
-Ver2)-I} (6,3.26)
We now discuss some properties of this modified t distribution. Since both = r) and g(r21 m J m e ) are functions of r2, the distribution is symmetric about the origin. 1t follows that all odd moments when they exist are zero . Using the identity (A5.2, I) in Appendix A5 ,2, the 2rth (r < v,/2) moment ofT is
p(t.,
£( -r 2 r I y) =
v~
B[(t ve) - r, r] Pr{F •• I
- 2r < (me /m e) (v.-2r) lv,} _ ---'-'Pr {F. c • v• < me/me}
(6,3.27)
_ - ' - - ---'-,:-',,-e--"-'-_---'----=-----'-'-_c_
Bh , r)
where, as before, B(p, q) is the complete beta function. The function g(r21 me/me) is monotonically decreasing in -r2. This implies that the distribution per I y) is uniformly more concentrated about the origin than p(i,," = -r). That is, for an arbitrary constant d > 0,
Pi {Irl < d I y}
~
Pr {II,., I < d}.
(6.3.2B)
The above result is indeed a sensible one. For, the distribution p(-r I y) is obtained from the distribution p(t., =.r) by imposing the constraint C : u; < We see in (6 .3.12b) that if known, is proportional to the conditional variance of o. Thus, it is not at all surprising that the posterior distribution of r with the constraint C is more concentrated about the origin than the distribution of -r without the constraint. We have here in fact a further example of Bayesian pooling which is discussed in more detail in Section 6.3.7.
a;,
a;,.
An Example In the particular case J = 2, the randomized block arrangement results in pairs of observations. On standard sampling theory, the comparison of the two means is usually ca,ried out by considering the differences Zj = Y2j - YIj(J = I, .. , J) between the pairs and making use of the fact that the sampling distribution of t =
[0 - (Y2 . - YI J] (2m,i J)1/2
(6.3 .29)
348
Analysis of Cross Classification Designs
6.3
Table 6.3.2
Measurement of percentage of ammonia by two anal:/'sts Days (blocks)
2
3
4
5
6
7
8
Analyst I
37
35
43
34
36
48
33
33
Analyst 2
37
38
36
47
48
57
28
42
(%Ammonia- 12) x 100
where 2 = )-1 LZ j , is the teO, I, veJ distribution. In this analysis, however, any "inter-block information" provided by me about (J; is not taken account of. We consider this "paired (" problem in some detail using for illustration an example quoted in Davies (1967, p. 57). The data shown in Table 6.3.2 are determinations made by two analysts of the percentage of ammonia in a plant gas made on 'eight different days. The primary purpose of the experiment was to make inferences about a possible systematic mean difference D, that is, a bias between results from the two analysts. The example illustrates the use of the additive mixed model (6.3 . 1) with J = 2, ) = 8. It should be borne in mind that we assume that (i) day to day variations of the true ammonia percentage follow a Normal distribution having variance (J;, (ii) the eight particular daily samples considered may be regarded as a random sample from this distribution , (iii) there is no interaction between analysts and samples, and (iv) the Normally distributed analytical error has the same variance (J; for both a nalysts. The sample quantities needed in the analysis are given in Table 6.3.3 in the form of an analysis of variance. Trible 6.3.3
Analysis of vari ance for t he analyst data Source
5.5.
Fixed effect (bias between analysts)
d.f.
S(c'l)= ~ J(D - 2)2
4(c'l -4.25)2
.= 4(D -4.25/
Random block effect (days)
Se= 553.0
Error
S r =206.75
l11e=79.0 ve=7
Average for analyst 1
YI . = 37.38
Average for analyst 2
12 . =41.63
me/me =2.67
M.S .
111.=29.54
I
6.3
The Additive Mixed Model
349
The solid curve in Fig. 6.3. I is the posterior distribution of the bias b = 8 2 -8 1 calculated from expression (6.3.25), corresponding to a noninformative reference prior. This distribution is seen to be more concentrated about its mean , Z = )'2 . )'1. = 4.25, than the unconstrained posterior distribution of b shown by the dotted curve in the same figure. This latter curve (which is also the confidence distribution of D) is a scaled t distribution centered at the same mean with 7 degrees of freedom . In particular, an H.P, D. interval for D is shown with content 91.2 %. This would have the somewhat smaller content 90.0% if the unconstrained distribution were used. The broken curve shown in the figure is the posterior distribution of D obtained by taking = 0 in (6.3.1), the implication of which will be discussed later in this section,
(J:
0.10
0.05
91.2\\ H.P.D. interval 0.0
5.0
100
posterior distribution of 0, (Bayesian "partial pooling") unconstrained posterior distribution of 0, and confidence distribution given by " paired I . " (No pooling) - - - posterior distribution of 0 obtained by laking (J~ , 0. and confidence distribution given'by " unpaired t," (Complete pooling) Note that Bayesian " partial pooling" yields the sharp~st distribution, -- ----- -
Fig. 6.3.1
Posterior distribution of the analytical bias D.
Effect o/me/me on the Distriburion ofb In the example considered, the posterior distribution of the bias D coming from the Bayesian analysis is not very different from the unconstrained distribution which parallels the sampling results . This is because the ratio mJ m e = 2.67 is fairly large so that the influence of the factor g(r2 I me/me) is relatively mild . However, when me 'me is close to or less than unity, the effect of g(r2 I me/me) will be much greater. For illustration, suppose that in the analyst example the pair (48,57) is excluded. One then obtains the analysis of variance presented in Table 6.3.4.
350
Analysis of Cross Classification Designs
6.3
0.10
0.05
L-______- L_ _ _ _ _ _- L_ _ _ _ _ _W-______
-2.0
4.0
1.0
~
______
~
___
o~
10.0
7.0
posterior distribution of b. (Bayesian "partial pooling") unconstrained posterior distribution of cl, and confidence distribution given by "paired I." (No pooling) - - - posterior distribution of 0 obtained by taking 0"; = 0, and confidence distribution given by "unpaired I." (Complete pooling) Note that Bayesian "partial pooling" yields the sharpest distribution. ---. ----
Fig. 6.3.2 Posterior distribution of the analytical bias b for data excluding the pair (48,57).
Here the ratio me/me = 0.86 is less than unity and, as shown in Fig. 6.3.2, the posterior distribution of b is markedly different from the unconstrained distribution. It will be seen later in Section 6.3.7 that g(r2 I mcfm e ), in fact, reflects the effect of pooling the variance estimates me and m e in the estimation of 0";. Table 6.3.4
Analysis of variance for the analysts data excluding the pair (48, 57) Source Fixed effect (bias between analysts)
S.S.
d.f.
S(b) = P(b - 2)2 = 3.5(b - 3.57)2
M.S. 3.5(b-3 .57)2
Random block effect (days)
Se=166.71
ve=6
m e =27.79
Error
S. = 193.86
ve =6
m e =32.31
)'\ . = 35.86, me/me =
Y2. =39.43
0.86
For given T, the modifying factor g(T 2 1 mel me) is a function of mcfm e. It is instructive to consider the behavior of per I y) in ' the extreme cases mclme --> 00 and me/me"" O. When me/me--> oo,g(T21 me/me) approaches unity for any fixed T, so that
6.3
The Additive Mixed Model
351
the distribution per I y) tends in the limit to the unconstrained distribution p(lv = r). On the other hand, when me/me -> 0, applying L'Hospital's rule, one finds that
(6.3.30) Hence in thelim it the distribution of the random variable [1/(/- 1)J1 / 2 r approaches the /(0,1, ve + ve ) distribution. Further, it can be verified that the mixed derivative
if
log g(r2 I me/me)
~iO(mclme) IS
(6.3.31)
where
and N(me,'me) is a non-negative function af me. l11e' Clearly H(O) = 0 and upon differentiating we find that H'(a) > O. Thus
22 10g g(r21111c1me) Or 2 o(me/m e)
~ O
(6.3 .32)
for all values of me/me and r2. This implies that the posterior distribution of r2 has the "monotone likelihood ratio" property.. see e.g. Lehman (1959). It follows that for d> 0, the probability Pr{lrl < d I y} is a monotonically decreasing function of me/me' This, together with (6.3.28) and (6.3.30), shows that, in obvious notation
(6.3.33) which provides an upper and a lower bound for the probability Pr{lrl < d I y}.
6.3.6 Approximations to the Distribution
pCr I Y)
In this section, we discuss certain methods which can be used to evaluate the posterior distribution of r.
352
6.3
Analysis of Cross Classification Designs
When ve = (J - 1) is a positive even integer, we can use the identity (A5.2.3) in . Appendix A5.2 to expand the numerator of g(T 2 I me/me) into
(6.3.34) where x = - - - --
Substituting (6.3.34) into (6.3.25), we obtain ~-Vc-
L
1
W/YjP(tv. ,C 2j = )'/r),
-
00
<
T
<
(6 .3.35)
CO,
j=O
where and It follows that for d > 0,
tvc- 1
L j =O
Wj
Pr {Jf vc + 2 ) <
(6 .3.36)
)'jd}
which can be used to calculate probabilities to any des ired degree of accuracy standard f table in conjunction with an incomrlete beta function table. When evaluation of the probability of T from (6.3.36) would be rather laborious and values of "e' this formula is not applicable. When appropriate it can, however, to check the usefulness of simpler approximations.
using a
"e is large,for odd be used
A Scaled t Approximation We now show that a scaled I distribution can provide a simple and overall satisfactory appro ximation to the posterior distribution of T. This result is to be expected because we have seen that in the two extreme cases when me/me -> 00 and m)me ...... 0, T follows a scaled t distribution exactly. We can write (6.3.25) as
p(rIY)=
E p(TJa;,y) O" ~ IY
- co <
T
< co.
(6.3.37)
6.3
The Additive Mixed Model
353
From (6.3.12b), the unconstrained conditional distribution of T, given (J';. is Normal N(O , (J';.m,) . Once (J'; is given, the constraint C: (J'; < (J'~e has no effect on the distribution of r. It follows that the Arst factor in the integrand, per I (J'; , y), is the Normal distribution N(O , u; /me)' Now from (6.3.21) the posterior distribution p( I y) in the integrand is
u;
Pr {x~c < vem)u;} Pr{F vc • v • < me/me}
(6.3.38)
This distribution is of exactly the same form as the distribution of ui in (5. 2.26) for the two-component random model effect. Thus, employing the method can be closely approximated by developed in Section 5.2.6, the distribution of that of a scaled X- 2 variable. Using the resulting approximation , we obtain
u;
(6.3.39) where
and
X=
That is, the quantity b - (Yi. - Yi) (2m: /J)l /2
(6.3.40)
is approximately distributed as 1(0, I, v~) ,. or equivalently, b is approximately distributed as I(Yi. - y ,., 2m ~/J, v~). For the complete analysts data, we have (v~ = 8.57, m~ = 27.55). On the other hand, excluding the pair (48,57) one would get (v ~ = 10.48, m~ = 23.80) . Tables 6. 3.5a and 6.3.5b give, respectively , for these two cases specimens of the posterior densities of (j obtained from exact evaluation of (6.3.25) and from the above approximation. The agreement is very close. 6.3.7 Relationship to Some Sampling Theory Results and the Problem of Pooling
The standard sampling theory "paired /" analysis depends on the fact that the sampling distribution of r = [b - (Yi - y,)] /(2m,.J)l /1 is the 1(0, I , "e) distribution. The resulting confidence distribution of b is numerically equivalent to th e unconstrained posterior distribution of (j. Now E(me) = (J'; and E(mJ = 2(J'; + u;. In obtaining this confidence distribution , therefore, one may feel intuitively that some information about the variance component (J'; is lost by not taking me into a~count because its sampling distribution also involves u; . What has sometimes been done is to use the ratio me/me to test the hypothesis (J'; = 0
354
Analysis of Cross Classification Designs
TallIe 6.3.5
6.3
Comparison of exact and approximate density of b
b 4.25 4.45 4.65 4.85 5.05 5.25 5.45 5.65 5.85 6.05 6.25 6.65 7.05 7.45 7.85 8.25 8.65 9.05 9.45 9.85 10.25
(a) Data of Table 6.3.2 Exaci Approximate 0.147659 0.147830 0.147353 0.147181 0.145759 0.145932 0.143600 0.143425 0.140232 0.140411 0.136437 0.136253 0.134182 0.133995 0.1 26292 0.126489 0.120718 0.120512 0.114559 0.114343 0.108120 0.094818 0.081560 0.068940 0.057370 0.047087 0.038185 0.030645 0.024376 0.019244 0 .015097
0.107894 0.094571 0.081299 0.068675 0.057115 0.046858 0.037995 0.030505 0.024292 0.019216 0.015122
(b) Data of Table 6.3.2 excluding the pair (48,57) Exact Approximate 3.57 0.149453 0.149380 3.77 0.148973 0.148900 3.97 0.147543 0.147470 4.17 0.145197 0.145122 4.37 0.141985 0.141908 4.57 0.137977 0.137898 4.77 0.133258 0.133177 4.97 0.127925 0.127839 5.17 0.122079 0.121990 5.37 0.115830 0. 115736
o
5.57 5.97 6.37 6.77 7.17 7.57 7.97 8.37 8.77 9. 17 9.57 9.97
0.109286 0.095729 0.082174 0.069237 0.057359 0.046800 0.037672 0.029966 0.023591 0.018407 0.014256 0.01097J
0.109187 0.095621 0.082060 0.069124 0.057253 0.046711 0.03 7604 0.029924 0.023576 0.018419 0.014290 0.011024
6.3
355
The Additive Mixed Model
and , in the absence of a significant result . to run an " unpaired /" analysis . That is, to employ the quantity (vern , + Verne) (V e + vJ as a pooled estimator of 0"; with (ve + ve) degrees of freedom. In this case, inferences about (j are made by referring the quantity
b - (Yi. -y,) (6.3.4t)
to the teO, t, Vc + ve ) di stribution . From the Bayesian point of view, this paralJels the case when 0"; is known to be equal to zero. To see this, suppose we now take the prior distribution of (8, 0";) to be, locally ,
(6.3.42) Then, on combining this with the likelihood in (6 .3.4) conditional on follows the same teO , I, it can be verified that a posteriori the quantity distri bu tion. Note that
'I
T
=
D - (Y2. - YI) {L(Zj - Z)2j[J(J - J)]}I /2
and
0"; = Ve
+
0,
ve )
{L L (Yu - Yi)2 /[J(J - 1)]}1 /2
(6.3.43) where, as before in (6.3.29) , Zj = .h j - Y lj and z = (1,J)Lz j . The problem is thus the familiar one of deciding whether an unpaired t test or a paired / test should be adopted in analyzing paired data wh en it is felt that there might or might not be variation from pair to pair. For illustration, consider again the complete set of analyst data. On sampling theory, testing the hypothesis that 0"; = against the alternative 0"; > (or equivalently testing O";eiO"; = I against O";eJO"; > I) involves calculating
°
°
The result is thus not quite significant at the 10% level. The confid e nce distribution of b shown by the dotted curve in Fig. 6.3.1 which is obtained from , (paired t) is, however, appreciably different from that given by the broken curve in the same figure when (unpaired t) is used. As mentioned earlier, from the Bayesian point of view, these two curves correspond, respectively, to the posterior distribution of i5 when the constraint c: 0"; < O";e is ignored and that of D when 0"; is assumed zero. Both of these curves are different from the solid curve which is the appropriate posterior distribution of D. In obtaining the latter distribution
'I
356
Analysis of Cross Classification Designs
6.3
we do not make the outright assumption that (J~ = 0, noi' do we ignore the constraint CJ; < given by the model. The difficulty in the sampling theory approach in deciding whether to use, or TI is another example of the "pooling" dilemma discussed in Section 5.2.7. The posterior distribution p(, I y) can be analyzed in this light. The discrepancy between peT I y) with the constraint (' and the unconstrained distribution p(I,.• = T) can in fact be regarded as a direct consequence of "Bayesian pooling" of me and m ,. in obtaining the former distribution. As seen in (6.3.37) , the posterior distribution of T can be written
a;"
pCT I y)
=
f:
pCr 10';, y)p(O'; I y)dCJ;,
-00
< , <
00,
a; > 0. Thus , the departure of the posterior distribution of, from the 1 (0, I, v,.) distribution depends entirely upon the departure of the distribution of CJ; fr.om that of (v"mJx,~2. For the complete analyst data, (ve = 7, me = 29.54) and (v: = 8.57, m~ = 27.55). Thus, the posterior distribution of T (or equivalently of 6) was not much different from the unconstrained distribution of T. In terms of the pooling discussion in Section 5.2.7, we can write
(6.3.44) where from (5.2.38), and For this example, ), = 0.05 and w = 0.22 so that Verne
+ AVclne = Ve
206.75
+ WVc =
7
+ 0.05
+ 0.22
X
X
553.0 = 206.75
+
29.8 = 236.55
7
= 7 + 1.6 = 8.6.
,z,
X(7 + 1.6) '
Thus, approximately
206.8
+ a
2 e
29.8
2
The combined effect of the addition of 29.8 to the sum of squares and of 1.6 to the degrees of freedom is small and the distribution of CJ; is nearly the same as
6.3
The Addilh'e Mixed "Iodel
357
for the unconstrained situation which would give 206.8 -
0'-21
2 ~ x~ ·
The much larger effect when the pair (48,57) is excluded- Fig. 6.3.2 ·arises from the much greater degree of pooling which occurs here. We have Vc = 6, me = 32.21, v~ = 10.48, m ~ = 23.80 so that
), = 0.33, Thus,
193.9
+ 0.33
w = 0.751 . x 166.7
2
"-X6+0.7)X6
or
193.9
+ 55.7 2
1 ""- X6~4.5
O'e
as l:ompared with the unconstrained distribution 193.9 --2- O'e
2
X6'
In this example, therefore, me contributes significantly to the posterior distribution of 0';. On sampling: theory, because of the small size of the mean squares ratio l77 ell11" = 0.86, one might be tempted to pool me anJ me' Then so that
193.9+ 166.7 .
2
0'.
-
2 X6-6'
defines the confidence distribution of 0'; for this set of data. The correspon ding confidence distribution of 0 which results from this complete pooling and which, from the Bayesian viewpoi nt, is the posterior distribution of 0 on the assumption that O'~ = 0, has been shown in Fig. 6.3.2. As in the analysis of the complete data (Fig. 6.3.1), the posteri or distribution of 0 corresponding to a Bayesian 'partial pooling' is sharper than \I'ilh "no pooling" or "complete pooling". 6.3.8 Comparison of I Means
When [ (1 ~ 2) means are to be compared , it is convenient to consider the I - 1 linearly independent contrasts ¢i = 0i - 6, i = I, ... , (1- I). The sample quantities needed in the analysis were conveniently summarized in Table 6,3.1 , the analysis of variance table for the present model. As we have alread y indicated, the model can be appropriate in the analysis of randomized block data in those cases where it is sensible tv represent block contributi ons as random effects.
358
Analysis of Cross Classification Designs
6.3
The posterior distribution of <1>' = (¢>I'" ¢>,_ I) may be obtained from (6.3.13a), (6.3.18), and (6.3.19a) by applying the restilt (1.5.3) concerning constrained distributions. We thus obtain p( I y) =
P(tU-I )
= q, - 4> I me l:, Ve) g(
-OC!<¢>;<00,
i=l, ... , f - l ,
(6.3.45) where the factor P(t(/_I) = - 4> I m e 1:, ve ) is the density of a 1(/_1)(4>,111,1:, v.) distribution which is the unconstrained distribution of <1>, and the modifying factor g(
Pr*{C 1<1>, y} Pr {F,"e.ve+V~ < m j l11. (
(6.3.46) .
with l11e(
+
v1»'
Two "F" ratios occur in the modifying factor g( I y) has the same ellipsoidal contours as the unconstrained multivariate distribution . However, since g( I y) than for the unconstrained distribution. To study this effect more formally we notice that since p( I y) is a function of Seq,) only, the probability content of any given contour may be obtained by considering the distribution of S(
v=
m(
=
5( Se/ve
(6.3.47)
From (6.3.14), (6.3.18), and (6.3.19a) we have p(V I Y) = p (Fv<J"v e = V) 9
(v I e) , l11
V> O.
(6.3.48)
l1Ie
The first factor is the ordinary F density with ("1>' ve ) degrees of freedom which would define the posterior distribution of V if there were no constraint. The modifying factor g(V 1111,.'111,) is the same as g(,Y e variable. That is, l"nr Vo > 0 (6.3.49) Pr {V < Voly};:: Pr{FY,Pe < Vo}·
6.3
The Additive Mixed Model
359
To see this, since p(V I y) is a probability den sity, we have 1=
J"o
p(Vly)dV=
J.r p(Fv~ .v, = V)g (V 0
'I'
~ 111
)
I
dV= Eg ( V tn _C )
me
(6.3.50)
111 ,.
where the expectation E on the extreme right is taken over the uncon st rained Fv ~ . v. distribution . Now g (V I II1ci l11 e) is monotonically decreasing in V so that there exists a value V' such that
g(VlmC»1
for
for
V > V'.
(6.3,51)
lI1e
Consider the difference
levo) = Clearly 1(0)
= I( ex:. ) =
Pr{V
< Vo I y} - Pr
{Fv~ ,v"
< Vol·
(6.3 .52)
O. Upon differentiating, we have (6,3 ,53)
Expressions (6,3.51) and (6.3.53) together imply that levo) can never be negative, and (6.3.49) follows at once, As in the case of comparing two means, when m c'me te nds to infinity, g(V I m)tI1e) approaches unity for all V and the posterior distribution p(V I y) tends to the unconstrained F(v¢.v r ) distribution. It can be verified that the mi xed derivative (6.3.54) for aU V and lIIe/me' so thaI for Vo > 0, the probability Pr{V < Vo I y} is monotonicall y decreasi ng in l71c1me' Further as l1Ic1me -> 0, we obtain (6.3.55) Thu s, corresponding to (6.3.33), (6.3 .56) which provides an upper and a lower bound for PI' {V
< Vo I y},
The relationship between these n ;sults and those of sampling theory is similar to that discussed earlier for the comparison of two means. In particular, the ellipsoid defined by m( )/rne = F( vtP' vC' a) which encloses the smallest (I - CI) confidence region will also be the (I - ct) H.P.D, region for the unconstrained distribution
360
Analysis of Cross Classification Designs
6.3
of <1>. The same ellipsoid will define an H .P . D. region for the posteriord istri bu tion of <1>, but it is clear from (6.3.49) that the probability content will be greater than (1 - ex).
6.3.9 Approximating the Posterior Distribution of V = m( 0, where
Wj
and
Yj
are given in (6.3.35).
(6.3.57)
It foJlows that (6.3.58)
which can be calculated using an F table or an incomplete beta function table . This expression is, however, not applicable for odd values of \'e and becomes rather complicated , computationally, for large Vc
A Scaled F Approximation Adopting an argument similar to that leading to the scaled t approximation for the individual comparison , we write
v > O.
(6.3.59)
hom the unconstrained joint distribution of in (6.3.12a) , it is clear that given a;, the quantity V = m( distribution. further, once a~ is given the constraint C does not affect the distribution of V. Thus the first factor in the integrand of (6.3.59) is , in fact , a (a;, m,.)x ~", /v", d istri bu tion. The second factor is , of course , the same posterior distribution uf given in (6.3.38). Employing the scaled [ 2 approximation implied by (6.3.39) to the distribution of we obtain
a;
0';,
(6,3,60) The probability integral of V can thus be approximately determined by using an F table or an incomplete beta function table.
.s.3
The Additive Mixed Model
361
6.3.10 Summarized Calculations
For the complete set of analyst data introduced in Table 6.3.2, Table 6.3.6 provides a summary of the approximating posterior distributions of the various parameters of interest for the additive mixed model. Table 6.3.6
Summarized calculations for the variolls arproximate posterior distributions for the additive mixed model (applied to the analyst data of Tab le 6,3,2)
I, Use Tables 6,3, I and 6.3,3 to obtain YI , = 37,38,
m ,, =79,O,
y
h,=41.63, lIIe
= 39,51
+ 2.13)2 + (4)2 -
8[(4),
m(<j» =
= 29.54,
2.13)2J
v",= 1 1= 2,
J=8
2. Use (6,3.39) to calculate
x = 0,728, v~ =
(I
0.876
m~=27.55
8,57,
3, Then, for making inferences about
=
0';
a; '" v:,nl~x ~_2 4, For making inferences about
(J; truncate d franl below at
Jn
e
verne 0,053
+ 0,0036 (J~ "'- X7
2
truncated frolll below at 0,053,
5. For comparison of two means
6 = U, - U, b-(Yj,-Y,) -
- ----;' /"' 2'
(2m~/ J)
,
",,/(0, I, vel
6-4,25 - - - ",,/(0, 1,8.6) 2,62 6. For general comparison of I means (in this case 1=2)
362
Analysis of Cross Classification Designs
6.4
6.4 THE INTERACTION MODEL
The method of analysis in the previous section can be easily extended to the so-called mixed model with interaction in (6.1.20), i=I , .. ,I;
j=I, ... ,J ;
k=l , ... , K
(6.4. 1)
where Yijk are the observations, 8i are location parameters, cj are random effects, lij are random "interaction" effects, and ei jk are random errors. The I J K observations can be a rranged in a two way table with I rows, J columns , and K observations per cell as in Table 6.1.1. Following our previous discussi on, we assume that cj , Ii), and e ijk are all independent and that
cj ~ N(O, cr;),
Ii) ~ N(O , cr/),
and
eijk ~ N(O, cr;) .
(6.4 .2)
6.4.1 The Likelihood Function To derive the likelihood function , it is convenient to define
and write (6.4 .3) ( 6.4.4) From Theorem 5.2.1 (on page 250), the deviations )'ijk - Yij . are distri buted independently to the cell means Yij. and also Se = LLL (Yijk - YijY has the a;x2 distribution with lJ(K - 1) degrees of freedom. Further, the model in (6.4.4) for the cell means Yij. is of exactly the same form as the additive model discussed in the preceding section with Gij having Normal distribution N(O, cr,2 + K-1cr; ). I t follows that the likelihood function is
( 6.4.5) The quantities appearing in the likelihood functi on can be conveniently arranged in analysis of variance form as in Table 6.4.1. In Table 6.4.1, tJ = (Ill) LOi, 4>' = (¢I, ... , ¢/-t), ¢i = 8i and
e,
/- 1
¢T
=
-
I
i= ,
¢i'
From the definitions of (a;, a;" a;,c), we see that these pa rameters are subject to the constraint (6.4.6)
0-
j,.
Table 6.4.1 Analysis of variance for the interaction model
Source Grand mean Fixed effects
S.S.
d.r.
F.M.S.
U;,e =
v4>=I-1
m(
CJ;, =CJ;
S c =IKI(Y. j. -Y. )2
ve=J-1
me = Sel l'e
(J etC'
Interact ion
S, = KI I(Yij. - Yi. . - Y.j. + Y ../
v, = (1- 1)(1 -I)
m, = Sri v,
a;,
me = Selve
CJ2e
S.>=I I I(Yijk - Yiji
ve =IJ(K-I)
+ KCJ,2
2
Random effects
Error
u; + KCJ,2 + I KCJ~
IJK(U-y .. )2
IJK(V - y .)2
S( ; - (Yi .. - Y )J2
M.S.
...;
:r
II>
'" ;;
;;l !l
o·
'" ::s: o Co
!!.
W 0W
364
An:llysis of Cross Classification Designs
6.4.2 Posterior Distribution of (B,
6.4
9, a;, a;" a;,c)
Adopting an argument similar to that given in the previous section, we employ the noninformative reference prior distribution (6.4.7) with the constraint C, from which it follows that the posterior distribution can be written
- co <
e<
CD;
-
CD
a;,c > a;, > a; >
<
O. ( 6.4.8)
In (6.4.8) , (i) the conditional distribution of B given a;,c/IJ K); (ii) the conditional distri bution of the (I - I) dimensional multivariate Normal
N (y ... ,
a;,c> p(1l I a;/C, y), 9
given
is Normal is
a;" p( 9 I a;" y),
(6.4.9) with
and (iii) the marginal distribution of
(a;, a;" a;,J is
(6.4.10) where (X;. ' X;" x;J are independent X variables with (v e , v" vJ degrees of freedom respectively. Expression (6.4.10) is of precisely the same form as the distribution in (5 .3.10) for the three component hierarchical design model. Inferences about the variances (a; , a,2, a~) can thus be made using the corresponding results obtained earlier. 2
6.4
The Interaction Model
365.
6.4.3 Comparison of I Means
u;, u;" U,7,c)
Integrating out (e, (l - J) contrasts q, is p(q, I y)
=
P(tU-I)
from (6.4.8), the posterior distribution of the
= q, - if> I m, L, v,) g(q,),
- w < ¢i <
CO,
i
=
I , .. , 1 - I.
(6.4.1 I)
'(!_
whose first factor is the den si ty of the multivariate I-distribution I ) (if>, m,L, v,) which would be the distribution of q, if the constraint C in (6.4.6) were ignored. The modifying factOr g(q,) in (6.4. 1 I), which represents the effect of the constraint C in (6.4.6), is
(6.4.12)
g(q,) = "/2 t..>,Je
Pr { -
,,2
AVt
m '12 v n1 } > - - - > -t -I V m ' ,,2 V m t I t .. c Ve
(!
'~\'r
y("
c
where X;" v~ is an X2 variable with v, + v> degrees of freedom independent of X;, and X~" Since g(q,) is a function of S(q,l only, it follows that the center of the distribution in (6.4.11) remains at if>, and the density contours of are ellipsoidal. It is easy to see that if> is the vector of the posterior means of q, and is also the unique mode of the distribution. These properties are very sim ilar to the distribution of q, in (6.3.45) for the additive mixed model. However , unlike the latter distribution , g(q,) is no longer monotoni ca lly decreasing in S(q,) so that 9 does not necessarily make the distribution of q, more concentrated about if>. The reason is that the quantity appearing in the covariance matrix of q, for the conditional y) in (6.4 .9) can be regarded as subject to two constraints, distribution p(q, I < and 0";, > While the former constraint tends to reduce the spread of lhe distribution of q" the latter tends to increase it. The combined effect of these two constraints depends upon the two mean square ratios me. m, and me/m,. . n general the smaller the ratios mell11, and m e'm" the larger will be the reduction in the spread of the di stribution of q,.
u,:, u;,c
u;, u,:.
u;"
6.4.4 Approximating the Distribution p( q, : y)
For a given set of data , the specific effect of the constraint C can be approximately determined by the procedure we now develop. The posterior distribution of q, can be written as the integral
-
~o
< cPi <
CO ,
;=1 , ... ,1-1. (6.4.13)
Analysis of Cross Classification Designs
366
6.4
In the integrand, the weight function p(~;, I y) is
p(~~, I y) = JU~' J': p(~;, ~;" ~;,c I y) d~;,( d~; , o
(6.4.14)
Gel
Adopting the scaled / approximation techniques developed in Sections 5,2 .6 and 5.2.12, to first integrate out IJ;,c and then eliminate ~;, we find (6.4.15) where
I'
I x,(1 l'c, t v, + I) a2 [x,( i'v o ~ v,) ,
* =V,r
V,
2
1<>(-11'(>
lv, +
I)
1",(-1-1'0 11',)
and
To this degree of approximation, p(<\> I y) ='= P(t(l-l) = <\> -
- co < 1>i <
CO,
i
=
I, .",1 - I,
(6.4,16) i.e., a l(l-l)(
v=
me<\»~ = 5(<\», v.p
m;
111;
is approximately distributed as F with (v.p, v;) degrees of freedom.
(6.4,17)
6.4
The Interaction Model
367
6.4.5' An Example Consider again the car-driver data introduced in Table 6.2.2, but now suppose that we are interested in comparing the performance of the 9 particular cars. The relevant sample quantities are:
me = 126.4569
I11r
= 1.7127
1', =
v>
Car
=8
64
J = 9
=
1 2 3 4 5
m r =J.1759
Yi . - Y
Ve
=
81
K
=
2
Car
¢i =Yi . -Y ..
6 7 8
- 0.4950 -1.5985 -1.2205 -\.3134
2.4521 - 1.5971 1.3023 0 .6063 1.8641
9
On the basis of the non-informative reference prior distribution in (6.4.7) the posterior distribution of the I - I = 8 independent contrasts Q> = (¢l' ... , ¢s)' is that given in (6.4.11). To approximate this distribution, we find by using (6.4. 15)
x = 1,0 11.655 2 1, 121.267 1 9 02 (4 ,34)
= 0.902 '
= I g02 {4 , 33) =
1. 902 (4 ,32) == 10
so that a2
== 1.0.
Thus, 109.6123 Xl
= 204.858
a l = 33 =
Y
= 0.5351
1", (34 ,40. 5)
-
I x ,( 33, 40.5)
1",(33,40.5) 32 x - - - ' - - - 1,,(32 , 40.5)
33(0.9816) - 32(0.9846)
v'
= 64(0.9846) =
,
0.8856
71.15
and
m;
=
109.6123 0.8856x71.15
= 1.74.
=
0.8856
Analysis of Cross Classification Designs
368
It follows that
q,
6.4
is approximately distributed as , 8 (<\>, 1.74E. 71.15) where from
(6.4.9),
In particular, suppose we wish to compare the means of cars I and 2. Then
(; = 0) - O2 = ¢I - ¢2 is approx imately distri buted as 1 (4.05, O. j 933 , 71.15). Lsing the Normal approximation, limits of the 95% H.P.D. interval are (3,)9,4.91). For overall comparison of the means of all nine cars, the (I - 0:) H.P.D. region of q, is given by m(q,)
-1.74
where m(q,)
~
F(8, 71.15, :;)
18 ~ ~ 2 = - L., (¢j - ¢J . 8 .1 j
In particular, one may wish to decide whether the parameter point
0 1 = ... =
m(
1.74
45.26 1.74
= - - = 26.01
and
F( 8, 71. 15, 0.05) == 2.14
so that the point
CHAPTER 7
INFERENCE ABOUT MEANS WITH INFORMATION FROM MORE THAN ONE SOURCE: ONE-WAY CLASSIFICATION AND BLOCK DESIGNS 1.1 !:'I\TRODUCTIO]'\
.V1 U1c h scientific work has as its object the detection and measuremeni of possible cha nges in some observable response associated with qualitative change in the experimental conditions. To make it possible to assess such influences, experiments are frequently arranged so that the data can be classified in some convenient manner. Thus, in our earlier discussion, we have considered data coming from one-way classification arrangements, cross classifications, and hierarchical designs. \\ith sampling theory, when the object was to compare means, the analysis was ordinarily conducted using a "fixed effect" model , but when variance components were of interest the analysis employed a "random effect" model. Mixed models were employed when some factors were thought of as contributing fixed and others random effects. As we have seen in Chapters 5 and 6, a number of difficulties were associated with analysis of these models in the sampling theory approach. By considering two other important problems in this general area we hope to show further, in the present chapkr, how the Bayes approach illuminates and resolves difficulties. The two selected problems concern the comparison of means i) for the one-way classification design, ii) for balanced incomplete block designs. For problem (i) the sampling theory Jifficulties are evidenced, for example, by the results of James and Stein (1961) which show the usual averages to be inadmissable estimators of the group means. For problem (ii) in sampling theory terms , one difficulty is that of appropriately "recovering inter-block information." An inadequacy of the framework outlined above is that it does not provide for the common situation in which we wish to compare means which are nevertheless "random effects." Similarly block effects are often best treated as random. These inadequacies are easily remedied in the Bayesian approach, and when this is used, it be~omes clear that the probJem~ encountered with sampling theory concern once more the difficulty, on that theory, of appropriate poo/inK of information from more than one source. 369
370
7.2
Inference about Means with Information from more than one Source
7.2 INFERENCES ABOUT MEANS FOR THE ONE-WAY RANDOM EFFECT 'MODEL Suppose we have data Yj k arranged in a one-way classification with J groups and K observations per group. For example, consider again the dyestuff data [taken from Davies (1967, p. 105)J for K = 5 laboratory determinations made on samples from each of J = 6 batches. These data were previously considered in Sections 5.1 and 5.2. The observations and the batch averages are shown in Table 7.2.1.
Table 7.2.1 Yield of dyestuff in grams of standard color
2
Individual observations (yield-I 400)
Averages
Batch 3
J95 150 205
4
5
6
J95 230 115 235 225
120 55 50 80 45
200
70
145 40 40 120 180
140 155 90 160 95
160
45 49 195 65 145
105
128
164
98
110
7.2.1 Various Models Used in the Analysis of One-way Classification
In Section 2.11 we have considered the analysis of data arranged relation to a "fixed effect" model j=1 , 2, .. .,J :
k=I , 2, ... , K,
10
this way in (7.2.1 )
with the errors distributed independently and )lormally such that ejk ~ N(O , ()"~). In that formulation , the 8j were supposed to have locally uniform reference priors. This would seem appropriate if the means 8 j were expected to bear no strong relationship one to another. Certainly, problems occasionally do occur where this is so. Thus, in comparing the laboratory yields for several different methods of making a particular chemical product, we could have a situation approximated by the supposition that anyone of the methods could give yields anywhere within a wide range, independently of the others. Given this supposition , we have seen that a posteriori the 8j would have a multi variate I distribution, and consideration as to whether a particular parameter point e was or was not included in a particular H.P.D. region could be decided by the use of the F distribution. These results all have exact parallels in the standard sampling theory analysis. It is aprarent, however, that locally uniform priors for the 8 j would be totally inappropriate for the dyestuff data. A model more likely to fit these circumstances would be one where the batch means 8j were regarded as independent drawings
7.2
371
Inferences about Means for the One-way Random Effect Model
from a distribution. These data have already been studied in connection with such a model in (5.2.1). SpecificalJy, writing OJ = 0 + e j , it was assumed that
e j ~ N(O,O"D,
that is ,
OJ ~ N(O, O"D,
(7 .2.2)
but was supposed that we wished to learn about o-~, the variance of the distribution of the OJ. This "random effect" model could, however, equally well be appropriate when interest centered on the individual batch means 0 1 , ... , 6 , rather than on the variance o-~. As a further example, if the performance of six Volk swagen cars bought in the same year was tested on five different days , the main objective could be to compare the mean performance of the particular six cars being tested, even though the cars could reasonably be regarded as random drawings from a population of Volkswagens. As Lindley has pointed out in his discussion of the work of Stein (1962), it is common for a random effect model to be appropriate, and yet to wish to make inferences about means. In what follows then, the random effect model of Section 5.2 is used but the object now is to make inferences about the individual batch means OJ. We later consider in some detail the contrast between this random effect a nalysis of the means and the corresponding fixed effect analysis, and also its relation to certain results in sampling theory.
°
7.2.2 Use of the Random Effect Model in the Study of Means
Assuming then the random effect model defined j n (7 .2.1) and (7.2.2), the usual associated analysis of variance is given below in Table 7.2.2. Table 7.2.2
Analysis of variance ror one-way classification random effect model
Source Between groups (batches) Within groups
d.f.
5.5 .
S2=K'i.(Yj . -Y SI
i
='i.'i.(Yjk-Yj /
Sampling expectation or M.S.
M.S.
v2 = J - )
m 2 = S2 / v 2
O"i2 =O"T
"I=J(K-l)
m l =S J VJ
0"2
+ KO"~
I
As before, we have adopted the notation that a subscript replaced by a dot means the average over that subscript. Prior and Posterior Distributions of (e, 0,
o-i, O"~)
°°
Let e be a vector whose elements are the group means 1 , 2 , ... , OJ. This vector must be distinguished from the scalar 0 = E(O),j = J, 2, ... , J. The joint distribution of the JK observations y and the unknown parameters
372
7.2
Inference about Means with lnformation from more than one Source
(e, crT , crL 0) may be wri tten pry . e, cri, cri, 0)
=
pCO, cri, cr~) p(G I cr~ , 0) pry ; e, a~)
(7.2.3)
where from (7.2.1)
pry I 0, cr
2 1)
cc (cr 2l ) -1K'2 exp f\ - - 1 2 [v]m] 2cr 1 - 00
+K
j=l, ... ,J,
Xl.
2} ,
L (OJ - h) ]
(7.2.4)
k=l, ... ,K,
from (7.2.2)
pee I cr~ , 8) oc (cr~)-11 1 ex p [ - ~ L (OJ 2cr2
0)2] ,
-
-00 < OJ < 00,
j= I, ... ,J
(7.2.5) the prior distribution of (0 , cri, cri). For a noninformative reference prior of (0, cri, i:T ~ ), note that by combining (7.2.4) and (7.2.5) and integrating out 0, the joint distribution of y given (0, crT, cr~) is precisely proportional to lhe likelihood function in (5.2.7). Thus, given y, the likelihood function of (0 , cri, cr~) is
and
pro, cri, cr~) is
(7.2 .6) Thus, following the previous approach to variance component models, inferences are considered against the background of the noninformative reference prior (7.2.7) in which log cri , log ai 2 , and 0 are supposed locally uniform, and the fact that cr?2 = cri + Kcr~ implies the constraint crT2 > cr~. Given the sample y, the joint posterior distribution of (0, cri, crL 0) is then prO, aT, a~, 8 I y) cc cr; 2cr;} pCO I crL 8) p(~/
I e, crT) cc a;}
(ai}-(t .IK + 1) (a~) - tJ
]}, cr~ > O,
a~>O,
-00 <() j< OO,
j=I, ... ,J.
(7.2.8)
e yia Intermediate Distri butions posterior distribution of e, various in termediate distributions
7.2.3 Posterior Distribution of
Before obtaining the of (G, cr; , a~, 0) are first derived . This approach facilitates subsequent comparison of random and fixed effect models .
Inferences about Means for the One-way Random Effect Model
7.2
Conditional Posterior Distribution 0/
e Given
373
(U, (J~, (J~)
The joint distribution in (7.2.8) can be written as the produt:t
pee , (J~,
(JL 0 I y)
= p(S i (JL (JL 0, y) p«(J~, (JL 0 I y).
(72.9)
Using (A. I. 1.5) in Appendix AI.I, we can write K
2
~(8j - Yj) +
1
0"; (U j
- 0)
2
K
K
-1
2
O"~(I _ z) (OJ - OJ) + 0"~2 (D - )'j) .
=
(7.2 . 10) where
OJ
= (I - Z)Yj
+
(7.2.11 )
zU
and Note t ha t z is the reci procal of W = O"~ 2 /(J~ , the quantity considered in (5.2 . 15). It follows that conditional on (U, O"~, O"~) the OJ are independently distributed a posteriori as (7.2.12)
Posterior Distribution
0/ (0, O"~, O"~)
Also, the posterior distribution of (0, (J~, O"~) is
0"7
> 0, (J~ > 0,
-
00
< 0<
(7.2.13)
Xl ,
which is equivalent to the distribution in (5.2.11). This implies that in particular (J~ > O.
(7.2.14)
which is a truncated X- 2 distribLltion . Posterior Distribution of (9, z) Using (7.2.10) 1 [ vlm l -2
+
K
~2 (OJ 0" I
!:>.i + __L_(_O,,-j..".-_°1__1
- ..
2
2
(J 2
=
I
-2
l
vlm l
20" I
+ 2(J~ 1 (0
.
- 0)
2
(J 2
+
KS(9,
Z)1 '
(7.2.15)
Inference about V1eans with Information from more than one Source
374
7.2
where
and &=r1I.fJ j
.
Making the transfo rmation from (a , O'~, O'~, fJ) out 0 and O'T, in (7.2.8), we obtain . p(a, z I y) ex:.
Z·}v,-1
+
(J - z)- .~v, [vIm, O
to (€I, O'~, z, 8) and integrating
KS(a , z)r':(VI+V 2+J), -00
j=I , ... ,J.
(7 .2. 16)
Now the quantity S(a, z) may be written
s(a, z) =
(a - Y.na - y)
z 8'(1 - rl11')a - z
+ -I-
where Y: = (Yt.> ... , Yl,) and 1 is a J x I vector of ones. identity (A7.1.l) in Appendix A7.1, we find KS(8, z) =
+
ZV 2 ln 2
(7.2.17)
By making use of the
(I - z)-' Q(a, z),
(7 .2.18)
where Q(a, z) = K[a - e(z)]'(1 - zr '11')[a - e(z)J,
e '(z)
=
[G,(z) , ... ,GAz)J, Gj(z) = h
yJ,
- z(Yj. -
j
=
\, ... , J.
Thus, p(a, z I y) ex:. ztv, -, (I - z) - iv, [v 1 m 1
+ zv 2 m 2 +
O
(I - z) -1 Q(a, z)r i(VI + ,., +1),
j=I, ... ,J.
-co<8 j
(7.2.\9)
Posterior Dist ribution oj 8 Conditional on z From (7.2.19) it follows at once that , conditional a) the distribution of
a is
0 11
z,
the J-dimensional t distribution
t)[8(z), s2(z)K -1 (I - zr
I llY 1,
V,
+ v2 J,
where
8'(z)
8/z) = h -
and
(I -
zr
1
=
eel (z), ... , 8Az)J,
z(Yj -
l1')-'
=
yJ =
I +
r
(I - z)Yj.
+
zY.,
1(_2_)11" I - z
(7.2.20)
7.2
375
Inferences about Means for the One-way Random Effect Model
i.e ..
-Cf)
b) the OJ have means covariance
8/z). j
Var( 8j lz)=
=
< OJ <
OC,
j = I, ... ,J,
I.... J, and they have common variance and
(v, + v2 ) K(V,+V2-2)
[
1+
z] J(I-z)
2
5 (z).
(7.2.21 )
c) the marginal distribution of OJ is a of freedom; specincally, the quantity
I
distribution having v, +
V2
degrees
OJ - O/z) V + V2 - 2) Var(Ojlz) ]' /2 [( , v, + V 2
is distributed as /(0. I. VI
+
(7.2.22)
1'2),
d) more generally, the marginal distribution of a linear function of the means '1 = 1'9, where I' = (I, ... , /)), is such that the quantity '1 - fj(z) ~(I)
with fj(z) = l' 9(z)
and
is distri buted as (7.2.23)
e) in particular, the marginal distribution of a particular difference OJ - OJ is a 1 distribution having V, +. V2 degrees of freedom. namely, the quantity (OJ - 0) - (I - z)(Yj. - Yj)
[2K has the /(0, I, v,
+
v2 ) distribution.
I S 2(Z)JI12
(7.2.24)
f) the joint di stribution of the (J - I) linearly independent contrasts = (4), •...• 4» - ,)' where 4>j = OJ j = I, ... ,J - I, is given by the J - 1
n,
376
Inference about Means with Information from more than one Source
dimensional t distribution having p( I z y) oc II I
VI
+
\/2
degrees of freedom defined by
K Lj x I [
+
+
(VI
+ ...
T
yJ]2}
- { (VI +V,+ 1-1 ),
\/2)S2(Z)
-OO«Pj<XJ, where
7.2
j=I, ... ,J-J ,
(7.2,25)
Posterior Distribution oj z Now from (5,2 . 16) and remembering that z = W -1, the marginal distribution of z = (J7 /(J~ 2 is given by
p(z I y)
=
(m2 /ml)p[Fv2.VI = (m 2/I11 I )z] Pr {FV2:V I < /11 2/ 11I 1}
0< z < I,
,
(7.2.26)
where, as before, p(Fv"vl = c) is the density of an F variable with (v 2, 1'1) degrees of freedom evaluated at c. The rth moment (r < -t v 1) of z is, therefore,
)r
, 8(-iv2 + r,~vl - r) IxC~V2 + r, tVI - r) (1')17'1 1 p = E(z)' = -r 8(1 1'2, ~VI) IxC-t1'2' }V I ) V2m2'
where x =
I' 2/112
and 8(p,q) and IxCp,q) are, respectively, the complete
_ _: : " - C C - _ ,
vim!
(7.2.27)
+ V2 m 2
and the incomplete beta functions. Posterior Distribution 0/ ij The unconditional distribution of the means a may now be obtained by integrating the conditional distribution of a for given z in (7.2.20) with p(z I y) as weight function. Thus
pee I y)
=
I
p(a I z, y) p(z I y)dz,
(7.2.28)
fa
In particular, the unconditional distribution of any linear function 17 = of the elements of a can be similarly obtained by integrating the conditional distribution of 17 given z in (7.2.23) over the distribution p (z I y). Also, to obtain the moments of OJ, we can take the expected values over z of the conditional moments. In particular, we And E(Oj I y)
= OJ = E[(Jj(z)] = YJ. ~
+
= (I - /1~)Yj
Var (OJ I y) = E Var (8 j I z) z
+
/1'1 (Yj - Y
J (7 .2.29)
P'I y ,
E[8 j (z) - 8J2 z
J(v!ml
+
p'! \12m2) - (J - 1)(p'l vlm l
JK(vl
+ Cr'j . - Y./P2'
+
v2
-
+ Il~v2m2)
2)
(7 .2.30)
7.2
Inferences about "leans for the One-way Random Effect Model
377
and Cov (8 j , 8j [ y)-= E Cov (8 j , 8j [ z)
+
E[8 j (z) - 8;][8/z) - 8j ]
z
z
where
7.2.4 Random Effect and Fixed Effect Models for Inferences about Means Inferences about the means Bj resulting from the random effect formulation are different from those for the fixed effect formulation discussed earlier in Section 2.11. Specifically, in contrast to the results in (7.2.28) through (7.2.31), we recall from (2.11 . 1) that for the fixed effect model, a posteriori (7.2.32) where
y:
=
(Yl., ... , YJ)'
In particular, the marginal distribution of a linear function IJ =
I'e is
IJ ~ t(l'Y" m1K-1l'l, Vj)
Also, the means , variances and covariances of the
£(8 j [ y) = YJ.,
e j
(7,2.33 ) are Cov (8 j , Gj [y) = 0,
and
(7,2.34) An Illustrative Example For illustration, consider the dyestuff data quoted in Table 7.2.1, example j is the mean yield for the jth batch of dyestuff and
e
J = 6, K
=
5,
VI
=
24,
v2
Y .. = 127.5,
=
5,
mj = 2,451.25,
J-I.; = 0.233,
112
1112
=
!n this
1 i,271.50
= 0.026.
Figure 7,2,1 contrasts the posterior distributions of the 8j from random and fixed effect models. The greater clustering of the distributions about y. which occurs with the random effect model is clearly seen. Table 7.2.3 contrasts the means and standard deviations of the distributions obtained from these two models. To facilitate comparison , we have arranged and numbered the groups in order of magnitude of the sample means Yj., Inspection of the table shows the clustering of the posterior means and also the ~light reduction in standard deviation of the distributions with the random effect models. Figure 7.2.2 shows the distribution of 8 6 - G1 for the random model effect together with the corresponding distribution appropriate to the fixed effect model. In this extreme case, a very large difference is seen between the two distributions,
Inference about Means with Information from more than one Source
378
7.2
o Samp Ie mean • Posterior mean
0.015
0.005 L..<=.....-=---"=-----'L-L-=--A.J___~1L____=::""_a..;::"""'=_.&:I:=__'____==__'__~6i Random effec t model
-+
0.015
0.005 .....,.=--'--.=.-"'-_---'-.=:.Ll....L_--D="'----_'_-=>-=~___"::=::<'___'_____=
50
90
130
170
__
ei -+
210
Fixed effect model
Fig. 7.2.1 Posterior distributions of
e for the dyestuff data. j
Table 7.2.3 Means and standard deviations Gf 6j for random effect and fixed effect models: the dyestuff data Standard deviations [Var (e j I y)]1/2 Group Posterior means E(e j I y) (ordered by Random effect Fixed effect magniiude Random effect Fixed effect y J.. . [420.4+0.026(Yj. - y./]1,2 (534.8)1/2 of mean) Yj . -O.233(Yj. - Y.)
2 3 4 5 6
83 105 110 128 156 183
70 98 105 128 164 200
22.5 21.1 20.8 20.5 21.3 23.6
e
23 .1 23 .1 23.1 23.1 23.1 23.1
Although for the fixed effect model the j are uncorrelated, this is not so for the random effect model. The rather slight correlations that occur in the present example are shown in Table 7.2.4.
7.2
Inferences about Means for the One-way Random Effect Model
;.-- .....
0.010
/
- - - Fixed effect model - - Random effect mode l
/
"' ,
379
\
"\
\ \
\ \
" \
0.005
40
70
100
130
Fig. 7.2.2 Posterior distributions of 86
-
\
\
",,
"'-.
160
81 for the dyestuff data.
Table 7.2.4
Correlation matrix of 8j for the random effect model: the dyestuff data
Pij = COY C8 j , 8j I y)/ [Var (8 j I y) Var (OJ I y)J1 /2 Group
2
3
4
5
6
0.14
0.12 0.09
0.05 0.05 0.05
-0.07 -0.01 +0.00 0.05
-0.16 -0.07 -0.04 0.05 0.18 I
2 3 4 5 6
7.2.5 Random Elfect Prior versus Fixed Elfect Prior In the Bayes framework there can, properly speaking, be no "fixed" effects since the parameters 8j are regarded as having probability distributions in any case. The terminology "fixed effect" and "random effect" are sampling theory concepts which we have retained to make certain analogies clear. The distinctions arise from the different prior distributions appropriate in different circumstances. Specifically, corresponding to the "fixed effect" model , the appropriate noninformative prior is one where' p(9) is taken to be locally uniform so that p(9, u~)
=
p(9) p(ui)
with
peS) ex:: c
and
This will be called the "fixed effect" prior.
(7.2.35)
31ll)
Inference about Means with Information from more than one Source
7.2
On the other hand, corresponding to the "random effect" model, the appropriate prior is
pee, (Ji, (J~, B) = pCB, (Ji, (JD p(O I (J~, B), where
pCB, (Ji,
(7.2.36)
(In ex (J 12 (J 122
is our non informative prior for (B, (Ji, (JD and pee I (J~ . 8) is supposed Normal as in (7.2.5). This we call the "random effect" prior. In either case, the posterior distribution of C is obtained by combining the appropriate prior with the same likelihood
lee, (Ji I y) ex p(y I e, (J~),
(7.2.37)
where p(y I e, (Ji) is given in (7.2.4). Thus, the posterior distributions of e are different because the priors are. For purposes of comparison , both can be thought of in terms of a general model in which the prior for e, (Ji, (J~ and 8 is written in the form
(7.2.38) where p(8) ex c
(7.2.39)
and
pee, (J~ I 8, (Ji) = p«(Ji I (J7) p(O I (J~, 8) IX.
p«(J~ I (Ji) (J-;J exp
[-
~ 2(J2
L (8 j
-
8)2].
(7.2.40)
In both cases, the 8j may be supposed to be a random sample from a Normal distribution N(G. d). The crucial question concerns the choice of p«(J~ I (JD. The assumption of a locally uniform prior for the Bj in the fixed effect formulation amounts to postulating that they are distributed about some unknown 8 with a large variance do. We can accomodate this assumption within the general framework by making p«(J~ I (JD in (7.2.40) a delta function at (J~ = a~o. On the other hand, by employing a noninformative prior p«(J~ I (JT) ex (J122 , we are led to the random effect analysis where, by allowing a~ to become a random variable, we let the data y determine what can be said about the spread of the 8j .
Uniform Prior and M ultiparameter Problem of High Dimension
e
We have seen that in the fixed effect formulation, the noninformative prior for is supposedly locally uniform. It is natural to ask "Specifically what prior distribution for the 8 j is implied by the random effect prior of (7.2.36) ." For simplicity, suppose (JT is known. We have (7.2.41)
7.2
Inferences about Means for the One-way Random Effect Model
381
where from (7.2.36) and (7.2.40) p(
a I 0'2,0) u: (0'2) 2
2 -) ' 2
exp
I-
2I
, L , (b'
j -
OJ 21 .
20' 2
(7.2.42)
'
and
(7.2.43) Thu ,. for large J,
(7.2.44) and. since the second factor dominates the first,
(7.2.45) ,This distribution is very close to the prior for asuggested by Stein (1962) and by Anscombe (1963). Specifically, they proposed that an appropriate prior for high dimen s ional a is t he multivariate I distribution
(7.2.46) \\'here \' is an arbitrarily small positive constant. The prior of a in (7.2.45) has it s probability mass spread over a very wide region in the space of e. In particular, it can be s hown that the variance of any linear function of a is infinite. To see the distinction between (7.2.45) and the locally uniform prior p(a) (f~ c, onsider the quantity (7.2.47) \'. lie h is, in a sense, an "estimate" of O"~. In particular, consider what prior distribution r ;,-~ is implied by the random effect prior for ill (7.2.45). h r st, conditional on O"~,
a
"e obtain from (7.2.42) that -2
I
;z
P ( 0' 2 0' 2'
0)
ex
.. .1 0"- J2
0" 2
2
ex p
(
-
2 0" 2 --2 ) . 20"2
r
(7.2.48)
B) integrating over the distribution of O"~ in (7.2.43), we then find that, for large J ,
(7.2.49) \\ hi c h is a decreasing function of O'~ and, as might be expected, is of the same form as noninformative prior for O"i. By contrast, consider what the prior of O'~ would be if the OJ were s trictly uniformly l1. tributed over some region R in the parameter space ofe. Now O'~ = i. defines a hypert ..:
" h rical surface with radius proportional to ;J centered at 81 where 1 is a Jx J ': ,llr of ones. Suppose this surface lies within R. Then the probability that 0'2 lies '.\ [thi n i. ! and J.'. + 6, where (J is a small positive constant, is proportional to ,iIi.! )) - I. Thus, a strictly uniform prior for 0 implies that
p(O'i I (J) ex (O'~)-±:) -
1
(7.2.50)
382
Inference about Means with Information from more than one Source
7.2
which, for J > 2, is an increasing function of a~ and asserts that a~ is large with high probability. This is not surprising if we remember that a Aat prior for can be produced by allowing a~ to become large in the prior distribution (7.2.42) Thus, we obtain from (7.2.48)
e
as in (7.2.50) so that , for the " fixed effects" prior, a~ is large because a~ is tacitly assumed to be large. One is thus Jed to contrast the two results strictly uniform prior random effect (approximate result) (or prior with a~ --) co) p(a~ I 8)
x (ai +
Ka~)-1
p(a~
I 8) cc
(a~)~J-1
and we see that these two expressions become more and more discordant as the dimension J increases. As was pointed out by Anscombe, these results should lead one to approach with some caution the choice of noninformative prior distributions when the number of parameters is large. One might wonder how can we justify two different priors for both of which are supposed to be noninformative? The starting point for deriving a noninformative prior in Section 1.3 was that we wished to express the state of knowing little about the parameters in a given model relative to what the data would have to tell us . This implies . as we have seen already, that differel1l priors will express this state when we contemplate different models. For this example the "fixed effect" prior expresses the idea directly that we know very little about anyone of the parameters 8j . By contrast , the " random effect" model says that the 8j are random drawings from a Normal distribution ,Y«), (J"~) and the prior expresses the fact that little is known about the mean ~nd variance of rhal dislribulion. The implication here is that the means do cluster together and correctly implies a different prior for the OJ.
e
Alternative Assumptions About the Prior To discuss the choice of prior further, It IS useful to reiterate the general role of assumptions in statistical methods. In practice , we know that all assumptions are false. For example , there never was an exactly straight line nor an exact Normal distribution . Logically , then , in selecting assumptions we should not ask (a) "Is this assumption true?" nor only (b) "Does this assumption approximate the truth concerning the experimental setup?" Rather we have to ask (c) "Does the use of this assumption lead to an approximately correct result?" This is so because the answer to (a) must always be "No" and so the question is irrelevant while the answer to (b) can be "Yes" when the answer to (c) is "No" and vice versa . The nature of the experimental setup should provide some guide, though not necessarily a decisive one , in what ought to be assumed. Therefore, in considering the choice of a prior distribution for a group of J means we shall consider both questions (b) and (c) above .
7.2
Inferences about Means for the One-way Random Effect Model
383
Priors for Different Experimental Set-ups For the one-way classification model being considered, in rare instances it might be realistic to analyse the data against a noninformative reference prior which supposed that the treatment means j were, approximately, independently and uniformly distributed over a wide range . Strictly speaking, this prior is only truly representative of a situation where the experimenter
e
i) knows little abo ut any of the treatments, ii) has no reason to believe any of the treatment will produce similar results. [n practice, however, (ii) will rarely represent his state of mind. Usually some of the treatments would be modifications of others and could be expected a priori to behave similarly. [n particular, (ii) is clearly not appropriate in the example we have considered where the " treatments " are "batches." There, the alternative supposition is much closer to reality, whereby the j are independent drawings from a Normal distribution N(O, O"D with and O"~ only vaguely known and hence approximately represented by noninformative priors. Obviously the two alternative possibilities considered in this chapter are by no means exhaustive . When a large number of treatments are under consideration , the experimenter may for example be able to divide them into subgroups within which he expects similarities. Alternatively even if the "treatments" were different batches, he might not believe that the j were independent but rather that they formed a time series represented perhaps by an autoregressive process.t 1n this case the prior would be defined in terms of the parameters of this process. The Bayes approach has the advantage that any possibility of interest can be modelled and the appropriately chosen prior will invite the data to comment on relevant uncertain aspects (for instance on the variance of the population of the 8j in the random effect model , or on the value of the autoregressive parameter if the 8 j are represented as a time series).
e
e
e
7.2.6 Effect of Different Prior Assumptions on the Posterior Distribution of (} We have said that with data occurring in the form of a one-way classification, there are many prior assumptions that in different circumstances could make sense. The two priors for (} examined here are the "fixed effect" uniform prior and the "random effect" prior. As illustrated by the dyestuff example, the random effect prior results in a greater clustering of the posterior means about the grand average y .. as well as an overall increase in the precision of the posterior distributions of the {)j.
A Simpler Situation A better intuitive understanding of how this happens is obtained by considering the simpler situation in which O"L a~ and e are all supposed knOll"n. t See, for example, Ti ao and Ali (197Jb).
384
Inference about Means with Infurmation from more than one Source
7.2
Then, for the random effect model we would have two sources of information about a particular batch mean OJ. First, we would know from (7.2.5) that the Bj was drawn from a Normal d istri bution with mean B and variance u~. Second, we w0uld known from (7.2.4) that the sample mean Yj. was distributed Normally about Bj with variance u U K. Combining these facts we see from (7.2.12) that a posleriori the Bj would be )JormalJy distributed with mean -
Bj
=
2
E(B j I 0'1 ,
2 0'12 '
e'Yj)
=
(l - Z)Yj.
+ zO,
(7.2.5\)
where 2 z=a 2' L u 12 ·
Thus, the posterior mean would be a linear interpolation between the sample mean Yj and the prior mean 0. Also, the variance would be 2
2 0'1
2
Var(Ojlal,a12,B,Yj)=K(1 - z) .\
(7.2.52)
Further, the Bj would be distributed independently of one another. Note that Z- l = u~ 2 / U~ = 1 + K(a~/a~) measures the relative variation of the group means OJ compared with the within group variation of the data. On the other hand, for the fixed effect model with a~ assumed known, we would only have the second source of information from (7.2.4) so that a posteriori the mean s Bj would be l'
= Yi
and
Var (OJ I a~, Yj)
= ailK,
j = \ , ... , J.
(7.2.53)
Thus , the effect of the information from the random effect prior is to "pull" the posterior mean of OJ towards the prior mean e and to decrease the variance by a factor (I - z) = \ - afi O"f2' When a~, a~ and are not known , the posterior mean of OJ for the random effect model becomes that in (7.2.29) , that is,
e
£(e j I Y) =
8j
= (I - )1'1) h
)1'1
=
+
)1 'IY .. ,
where E(z I y) ,
which is an interpolation between the sample mean J'j. and the grand average y. Compared with (7.2.51), we are thus replacing 0 by Y .. and z by its posterior expectation. Silualion When J is Large
Now, if the number of groups J is large so that V J and \'2 are both large, then in the integrand of (7.2.28) the conditional multivariate I distribution pee I z, y) given in (7.2.20) approaches a J-dimensional multivariate ~ormal distribution. Also, the distribution p(z I y) in (7.2.26) will become sharply concentrated about
-, 1
Inferences about Means for the One-way Random Effect Model
385
r he reciprocal of the mean square ratio A
ml
1
m2
F
(7.2.54)
Z=-=-
Ided F > I. In this case, using the approximations /1'1 == I/F, /1~ == (I /F)2 and /12 == 0 in -_ .29) through (7.2.31) and noting that the integration process 10 (7.2.28) is -- en lially equivalent to replacing the unknown z in pee I z, y) by z, the OJ are ~i iributed approximately Normally and independently with
- - 0)
EI'-) -' ) ==(l-~)Y F ). ' +~Y F" B~
Var (OJ I y) ==
and
Km 1 (
I -
1\ F) . (7.2.55)
co ntrast, for the fixed prior, it is clear from (7.3.32) that, for large J, the (I re approximately Normally and independently distributed such that (7.2.56)
and Thus. by using the random effect prior (7.2.36) a ) the posterior means cluster more closely about Y .. ,
bl the variance of the distribution of OJ tends to be reduced by a factor which, for large J, is approximately I - I; F . It follows that when F is large so that I /F is small, the clustering and increase in precision is negligible and the random effect prior yields what is essentially the fix~d effect solution. On the other hand, when F is not large but takes a value such as 2 or 3, considera.ble modifications in the posterior means and variances occur leading to modifications in the inferences about 9. The dyestuff data of Section 7.2.4 represent an intermediate situation. As we have seen, for this data, inferences about individual elements ofe' = (0 1 , ... ,0 6 ) a re not very different for the random and the fixed effect formulations . More generally, if F = m 2 im l is fairly large (say F > 10), the random effect analysis will not differ very markedly from the fixed effect analysis . ~ow F = m1 /m l is a sample measure of 2
() I 2
-- =
ai
2
1
+
.(J 2
K --
ai
indicating how large a~ is compared with a ~ i K. In the random effect analysis , then , we are as it were using the data (as reflected by the spread of the sample means Yj in relation to the within group variances milK) to comment on the spread of the means and hence tell us to what extent the 10caJJy uniform prior ass umption on OJ is justified.
386
Inference about Means with Information from more than one Source
7.2
Size of H.D.P. regions From (7.2.55) and (7.2.56), we see that for any individual mean Bj , when J is large, length of (1 - a ) H.D.P. interval for random effect model = -( 1 _ FI ) length of (l - .:t) H.P.D. interval for fixed effect model
1/ 2.
(7.2.57)
On the other hand, for the Bj considered jointly volume of (I - a) H.P.D . region for random effect model
(7.2.58)
volume of (1 - Ct.) H.P.D. region for fixed effect model I
For large J, (1 - I / F)J / 2 could be small even when F was large and (I _ I / F)I,2 was close to unity, so that it might seem at first that, for large J, a great increase in precision is to be expected from the use of internal evidence about (J~. To set this result in proper perspective, however, it should be remembered that the situation is exactly as if the standard deviation (J I of the errors ejk in (7.2.1) had been decreased by a factor (I - I / F/ ,2. This would also result in the volume of the H.P.D. region heing reduced by the factor (I - I/F)Ji2.
Overall Analysis of Contrasts In the sampling theory analysis of the one-way classification, a useful preliminary to more detailed study is the overall test of the hypothesis of equality of the group means 9' = (B I , .. . , BJ ). This is accomplished by referring the ratio of mean squares m2 /m l to a table of the F distribution, In Section 2,J I we have discussed the Bayesian analogue of this procedure. Given Ij>' = (cPt, .. . ,cPJ-l)' a set of] - 1 linearly independent contrasts among the J parameters e, it wa s shown that the point Ij> = 0 was not included in the (I - Ct.) H.P.D. region for Ij> if (7.2.59) or, for
Vt
large, if
where, as before, F(] - 1, VI' a) is the upper a percentage point of a F distribution with [(1 - I), VI] degrees of freedom and x2 (1 - I, a) is that ofaX2 distribution with (J - 1) degrees of freedom. This analysis was based on the assumption of a fixed effect prior. Consider now the corresponding result for the random effect prior of (7.2.36) and for the moment let us take cPt = Bt cPJ-I = BJ - I Using the result (7.2,25), conditional on z , we see that the quantity
e, ... ,
e.
(7.2,60)
7.2
Inferences about Means for the One-way Random Effect Model
387
where ¢j = (I - Z)(Yj. - yJ
and
is distributed as F with [(1 - I), (VI + v2 )] degrees of freedom. Thus, when J is large, Z converges in probability to IfF = m l /m 2 provided F> 1 so that
(7.2.61)
and approximately
K L~= I [ifJj - (1 - IIF)(Yj. - yJ]2 (J - l)ml(l - IIF)
(7.2.62)
is distributed as F with [(1 - I), (VI + v2 )] degrees of freedom. Thus, for the point.p = 0 not to be included in the 1 - Ci H.P.D. region, we require approximately that
(7.2.63) that is,
or to a further approximation, /11 2
/.(1 - 1,::x)
Inl
J - 1
--1>
.
Thus, for large J, the effect on inferences about the value .p = 0 of employing the random effect prior is approximately represente.d by subtracting one from the mean ratio before referring to the appropriate tables. The situation is illustrated in Fig. 7.2.3. An H.P.D. region for two orthogonal contrastst .p* = (ifJi,ifJi)' is shown for the fixed effect model centered about ~*. The region has radius r. For the random effect model, the corresponding H.P.D. region is centered about(l - JiF)~*whi1etheradiusoftheregionisreduced by a factor (1 - I;F)1/2. In the situation illustrated, the point .p = 0 is just outside the fixed effect H.P.D. region but is inside the corresponding random effect region.
t The set of linearly independent contrasts ifJI = 8 1 - 8.... ,ifJJ-I = 8J - I - ti can always be transformed into a set of (J - 1) orthogonal linear contrasts having spherical H.P.D. regions.
388
Inference about Means with Information from more than one Source
7.2
//--R~~~;-" "
effect
\
"
"
hxed-
\;
\ effect
\ \
,
I I I I I /
Fig.7.2.3 H.P.D. regions for orthogonal contrasts "random-effect'· priors.
4>i
and ¢*2 with "fixed-effect" and
7.2.7 Relationship to Sampling Theory Results We have seen earlier in Section 2. 11 that, for the fixed-effect model , the standard sampling theory results parallel those obtained from a Bayesian analysis with a non informative prior which is locally uniform in O. By contrast , in this chapter a random effect Bayesian analysis of the means 8j has been explored and has led to intuitively sensible results. It is interesting, therefore, that in this case also, parallel sampling theory results exist. The "parallel" sampling theory comes out of the work of James and Stein (1961) who considered the problem of estimating a set of means 8 1 "" ,8 j of the one = way model (7.2.1) so as to minimize mean squared error. In general, let 9' = 1 L j j ) be a set of point estimators for 0' = (8 1 , , , , 8J ) and let J be the appropriate squared error loss function. Then, as discussed in Appendix A5.6, the mean squared error of is the expectation
(e"
.. .,e
ee 8y
e
(7.2.64) where E is taken with respect to the distribution of the data y. For simplicity let us suppose that O'T is known. Then the analysis proceeds as follows. If considerations are limited to which are linear functions of the data, then it is well known that the minimum mean squared error point estimators are the sample means = (Y[, ... , YJ} Specifically,
e
y:
2
M.S .E.(yJ = J
-J
E L (Yj. -
8J 2
=
K0'1 '
(7.2.65)
However, these authors relaxed the unnecessarily restrictive assumption of linearity and considered instead for J ~ 2 estimators j which are non-linear
e
Inferences about Means for the One-way Random Effect Model
7.2
389
functions of th e data sLic h that j = 1, ... , J.
that is,
I '
OJ' = ( 1 - - )' ), , F' J.
where
I
+ -F" ()
(7.2.66)
W
F' =
K L (Yi (J - 2)O'f
a nd () is any arb itrary constant. They then showed that •
I
M.S. E.(9) = J
£ L(D• j
-
2
OJ) =
KO'~
(
1-
I) ' P'
(7.2.67)
where
and I F " is such that
I
O~-
. . . F"
I
lim =0. ii~ . oc F"
It was thus demonstrated that the non-l inea r estimators (7.2.66) have the rem arkab le sa mpling property that, whatever the true value of O'~, M.S.r.(9) can neve r exceed M.S.E. (yJ Note th at O'~ is a meas ure of the spread of the means aj . Thu s, the M.S .E. of 9 will approach the M.S.E. of y. when the means are spread out. For comparison, we see that with the sim plifying assumption that O'~ a nd 0 are known, the posterior distribution or e is obtained by integrating the conditional ;'\Iormal distribution for 0 in (7.2.12) over the distribution of O'~ in (7.2.14) . Now for large J , the distribution of O'~ is sharply concentrated about its mode a-~ , where 2 0'1
+
K. 2 0'2
=
K L (8 - Yi)2
J
+
2
,I
(7.2.68)
provided o-~ > O. It follows th at, for large J, the OJ are approximately Normall y distributed independent of one a nother with posterior means 8i such that .
8j
-
e= [I
-
(J
+ 2)0'~
K L (Yi -
W
I (Yi
-
0 )
that is ,
a= J
(I - -~))J ' + _1-0 F* F* J.
where
F* _
K L (y i. (J
W
+ 2)a~
(7.2.69)
390
Inference about Means with Information from more than one Source
and with common variance
~(1 K
__1). F*
7.2
(7.2.70)
Thus, the estimators in (7.2.66) which have remarkable sampling properties are of precisely the same form as the posterior means in (7.2.69) resulting from the Bayesian random effect analysis , the only difference is that J - 2 replaces J + 2 . (J + 2) . (that IS, F' = (J _ 2) P). Indeed , Stein (1962) proposed that the appropriate non-linear functions to be used as estimators should be obtained from the posterior distribution with " suitably chosen" priors. For reasons explained in Appendix A5.6, we would not employ this approach but would feel that , having come so far towards the Bayesian position, one more step might as well be taken and the posterior distribution itself, rather than some arbitrary feature of it, be made the basis for inference.
7.2.8 Conclusions 1) Given an appropriate metric , a state of relative ignorance about a parameter can frequently be represented to an adequate approximation by supposing that, over the region where the likelihood is appreciable, the prior is constant , However, careful thought must be given to selecting the appropriate form of the model, and to deciding which of its characteristics it is desired to express ignorance about in this way. For example, suppose there are a number of parameters 8j which (i) are expected to be random drawings from a distribution or (ii) are expected to be successive values if a time series. Then it would be the parameters of (i) the distribution or (ii) the time series, about which we would wish to express ignorance. It would not be appropriate in these cases to so express ignorance about the ()j directly, for this would not take account of available information (i) that the ()j clustered in a distribution or (ii) that the 8j were members of a time series. 2) In this chapter we have discussed in some detail a situation of the first kind where the assumption of a uniform prior distribution applied directly to the 8j might not be appropriate. Instead it was supposed a priori that the 8j were randomly drawn from a Normal distribution with unknown mean 8 and unknown variance (J'~. By employing noninformative priors for these parameters, the data themselves are induced to provide evidence about the location and dispersion of the 8j . These ideas are of more general application and are akin to those employed by Mosteller and Wallace (1964) and provide a link with the empirical Bayes procedures of Ro bbins (1955, 1964). 3) Analysis in terms of a random effect prior is , of course, not necessarily appropriate (although, since clustering of means is so often to be expected, it is an important point of departure). Applications have been mentioned in which the means arranged in appropriate sequence might be expected to form a time series. In another situation two or more distinct clusters might be expected . By suitably
7.3
Inferences about Means for "'lodels with Two Random Components
391
choosing the structure of the model and introducing suitable parameters associated with noninformative priors, the data are invited to comment on relevant aspects of the problem. 4) An important practical question is "How much does a correct choice of prior matter?" or equivalently "How sensitive (or robust) is the posterior distribution of the j to change in the prior d istri bution?" The Bayes analysis ma kes it clear that the problem is one of combining, in a relevant way, information coming from the data in terms of comparisons between groups with information from within groups. Now suppose there are J groups of K observations. The different priors reflect different relations be/lt"een the J groups. Thus if J is small compared with K one can expect that inferences will be robust with respect to choice of prior. However, when J is large compared with K, between group information wilJ become of relative importance, and the precise choice of prior will be critical.
e
5) In general, Bayes' theorem says that p(8 I y) a.: p(8) /(8 I y)
where y represents a set of observations and 8 a set of unknown parameters. Thus, given the data y, inferences about 8 depend upon what we assume about the prior distribution p(8) as well as what is assumed about the data generating model leading to the likelihood function /(8 I y). In Chapters 3 and 4, we studied the robustness of inferences about 8 by varying /(8 I y) to determine to what extent such commonly made assumption as Normality can be justified. We can equally well, as has been done here, study the effect on the posterior distribution of varying the prior distribution p(8). Examples of the latter type of inference robustness studies have in fact been given earlier in Section 1.2.3 and in Section 2.4.5. 7.3 INFERENCES ABOLT MEANS FOR MODELS WITH TWO RANDOM COMPONE'JTS
I n Section 6.3 we considered the analysis of cross classification designs and in particular the analysis of a two-way layout using a model in which the row means were regarded as fixed effects and the column means as random effects. It is often appropriate to use such a two-way model in the analysis of randomized block designs where the blocks are associated with the randoO) column effects and the treatments with the fixed row effects. Such a model contains two random components: one associated with the "within block error" and one with the "between block error." In Section 6.3 the error variance component and the between columns variance were denoted by and respectively. In what follows where we associate the random column component with blocks we shall use and (J~ respectively for the error variance component and the between blocks variance component. I n practice, th·;' blocks correspond to portions of experimental material which are expected to be more homogeneous than the whole aggregate of material.
(J;
(J;
(J;
392
.: nference about Means with Information from more than one Source
7.3
Thus in an animal experiment the blocks might be groups of animals from the same litter. The randomized block design ensures that comparisons between the treatments can be made II'ithin blocks (for example, between litter mates within a litter) and so are not subject to the larger block to block errors. When the available block size is smaller than the number of treatments, incomplete block designs may be employed- see, for example, Cocbran and Cox (1950) , Kempthorne (1952). Thus six pairs of identical twins might be used to compare four different methods of reading instruction in the balanced arrangement shown in Table7.3.l. Table 7.3.1
A balanced incomplete block design BLOCKS (Twin Pair)
TREATMENTS (method of reading instruction)
A
B C D
x x
2
3
x
x
x
x
4
5
x x
x
6
x
x
x
The design is such that treatments A and 8 are randomly assigned to the first pair of twins, A and C 10 the second , and so on. In general, in a balanced incomplete block design (B I BD), I treatments are examined in J blocks of equal size K. Each treatment appears r times and occurs}. times ill a block Il'ith every other treatment. In the reading instruction example I = 4, J = 6, K = 2, r = 3, and}, = 1. Designs of this kind supply two different kinds of information about the treatment contrasts. The first uses within (intra) block comparisons and the second uses between (inter) block comparisons. Thus one within-block comparison is the difference in the scores of the first pair of twins, which yields information about the difference between treatments A and B. The difference in scores of the second pair of twins similarly yields information about the difference between A and C, and so on. All comparisons of this sort are affected only by the within block variance a~. Analysis of incomplete block designs has sometimes been conducted as if tl1ese wuthin block comparisons were the onl)' source of information about the treatment contrasts. However, as was pointed out by Yates (1940), a second source of information abollt the treatment contrasts is supplied by comparisons betll'een the block averages. Thus the average score for the ilrst pair of twins, supplies an estimate of the average of the effects of treatments A and 8; the average score for the second block supplies an estimate of the average of the effects of treatments A and C. and so on . Comparison of these block average scores which have an error variance <J; +~<J; thus supplies further information abollt the treatment contrasts.
393
Inferences about Means for Models with Two Random Components
7.3
On sampling theory, estimation of the treatment contrasts using either (I) within block information only or, (ii) between block information only, is readily achieved by a standard application of least squares. It is , however, far from clear how these estimates ought to be combined and resulting inferences drawn. 7.3.1 A General Linear Model with Two Random Components
There is no particular reason to limit the initial discussion to a particular kind of design and we now consider a general linear model appropriate to any design having J blocks with fixed block siLe K. Specifically, we consider the linear model (7.3.1 )
j = I, ... , J.
In this model Yj = (Ylj, · · , YK) is a K x I vector (or block) of observations, and Aj is a K x I matrix of fixed elements. The quantity a = (0], ... ,0,)' is a I x I vector of regression coefficients, b j is the jth block (random) effect having mean zero and variance a;, I is a K x I vector of ones, and ej = (e lj , ... , e Kj )' is a K x I vector of random errors. [n this model the vector Aja is a "fixed" component of the vector of observations Yj and the vectors bjl and ej are two distinct random components. For an incomplete block design, a is the I x I vector of treatment means and i\ consists of 1's and 0'5, each row of which contains a single 1 indicating which treatments are included in thejth block. Thus fo r the reading instruction design
o
o
0] °'
° ° 0 0
A6 = [
0
I
0] I
.
7.3.2 The Likelihood Function
We assume that bj and the elements of ej,j = 1.... , J, are all independent and that (7.3.2) where 0 is a null vector and I is an identity matrix, both of size K. To derive the likelihood function of the parameters (9, a; , a;), it is convenient to work with the block averages and the within block residuals " K
) ' ,J, = K -
l },
•v'J
= K-I
I
) ' k J'
and
(7.3.3)
k . I
where R
=
K- 1 iT .
T he model in (7.3 .1) can thus be alternatively expressed as Y.j = K - I~'Aja
+ bj + e j (7.3.4)
394
Inference about \1eans with Information from more than One Source
where
7.3
K
I
e j = K- I I'e j = K- I
ekj
k= I
It follows from the Normality and independence assumptions of h j and ej that a) the tw~o sets of random variables {Y,j}, {(I - R)Yj} are independent of each other, b) Yj are independently distri buted as Normal N (K - II' A }l,
0"; + K - J 0";),
c) (I - R)Yj are independently Normally distributed with (1 - R) Aje and a singular covariance matrix 0";(1 - R),
mean
and
vector
Thus, the likelihood function can be expressed as the product I(e, 0";,
0"; I y) =
Ib(9,
O";e
I Y,j ' j
= I, , .. , J)/e(9,
0"; I (I
- R)Yj' j
= 1, .. " J), (7,3.5)
where the first factor is Ib (9,
O";e
I Y,j'
= I , .. "
j
J) ex p(Y,j' j
= I, .. "
J
I e, 0";.)
(7.3 ,6)
with (7 ,3,7) and
J
Sb(e) = K
I
(Y ,j - K- 1 l'A j e)2 ,
(7.3.8)
j=1
For the second factor in (7.3,5) , we first observe that (1 - R)I = 0 and the matrix (I - R) is idempotent of rank K - I, Thus there exists a K x K orthogonal matrix P, where K-I
P=[P*IK-;l] such that
P ,(I - R)P
=
[I." 0' .. . K- I
.
~
j.
(7.3.9)
Let
(7.3 ,10)
Then the (K - I) x I vector x j is 1\ ormaJJy distributed as N K _ 1(0, Ik -1 a;). Noting that (7.3.11) xjXj = (Yj - Ajenl - R)(Yj - Aje), the second factor of the likelihood (7 ,3.5) is found to be
Ie (e, a; I (I - R)Yj' j
=
I, .. " J) ex p(x j , j
ex
=
I, .. " J I a;)
2)-'J(K-I) ( O" e '
exp
[
(7.3 .12)
- -1-Se(e)j -,- ,
2
(J;
7.3
Inferences about :Yleans for :Vlodels with Two Random Components
where
395
J
= L
Se(9)
j
=J
(7.3.13)
(Yj - A j 9)'(1 - R)(Yj - A j 9).
Now, we can write Sb(9) = Sb
+
=
Qb(9)
and
ey
(9 -
(7.3.14)
Qb(9),
where
(JJ
AjRA j ) (9 -
e)
e satisfies the normal equations
(JI
AjRAj)e =
JI
(7.3.15)
A i lY .j
Similarly, we can write Se(9) = Se
+
where Se
and
8 satisfies
=
Se(8) ,
QeC9}
=
8y
(9 -
(7.3.16)
Qe(9),
lJI
Aj(1 - R)A j (9 -
Jl
Aj(1 - R)Yj .
1
8)
the normal equations
[JJ
e
Aj(1 - R)A j J
=
(7.3.17)
It is convenient to arrange the quantities Sh' Qb(9), S e, and QeC9) and to indicate sampling di stributional properties in the familiar form of an analysis of variance as in Table 7.3.2. Table 7.3.2
Analysis of variance of the linear model with two random components
Sources
S.S.
d.f.
Sampling distribution of S.S. (all independent)
Between (inter) blocks
Qb(e)
C/b
2 2 0' beX q "
Sb
J-qb
Within (intra) blocks
Q.(9)
qe
O';X;c
Se
J(K - 1) -qe
a;X7(K-I )-qr
2
be XJ
0'2
- qb
In Table 7.3.2, qb and qe are, respectively, the rank of the matrices J
L AjRAj
j = 1
J
and
L A j(1 j= J
R)A j .
396
Inference about Means with Information from more than one Source
7.3
7.3.3 Sampling Theory Estimation Problems Associated with the l\'1odel /
In the sampling theory framework, the usual difficulties are associated with making inferences about the variance components 0"; and 0";. The unbiased estimator of 0"; which has often been used
Se K[J(K - 1) - q.]
(7.3.18)
is intuitively un satisfactory because it can take negative values. In addition, the sampling distr ibution of 6~ involves the nuisance parameter O"i,'0",2, and consequently lea ds to difficulties in constructing confidence interval s for Finally, while inkrences about are often based upon Sc alone, yet if (5~ /0"; is not large, Sb can also contribute appreciable informatiun about Attempts to deal with this problem by "pooling variances" are accompanied by familiar problems previously discussed in Chapters 5 and 6. Usually the param eters of principal interest are the elements of the vector o or linear functions of them . hom Table 7.3.2 we see that if the inter-block information wa s ignored , confidence regions for 0 co uld be constructed by referring the quantity
a;.
a;
(5;.
(7.3.19)
Se/ [J(K - I) - qe]
to an F distribution with [q,. J(K - I) - qeJ degrees of freedom. Similarly, if the intra-block information was ignored, a different set of confidence regions for 9 could be obtained by referring the quantity Qb(9)iQo
(7.3.20)
Sbl (J - qb)
to an F distribution with (qb , J - qo) degrees of freedom. In attempts to combine intra- and inter-block information the major problem for sampling theory is the difficulty of eliminating the nui sance parameter see, for example Yates (1940), Scheffe (1959, pp. 170-178), Graybill and Weeks (1959), and Seshadri (1966). We now cons ider the problems from a Bayesian viewpoint a nd discuss in detail the analysis of balanced incomplete block designs.
(5;.(5;,
7.3.4 Prior and Posterior Distributions of
«(5;, o";c, 9)
From (7 .3.5), (7.3.6) and (7.3.12) we see that information contained in the likelihood function can be regarded as coming from J independent observations Yj from Normal populations with variance (5t.IK , and J(K - J) independent observations draw n from Normal populations with variance The means of these observations are linear functions of the elements of the vector 9.
(5:.
7.4
Analysis of Balanced Incomplete Block Designs
397
As In Chapters 5 a nd 6, we car ry through the analysis llsing the noninformative reference prior distribution (7.3.21) where, since (J;e = (J;' ~ K(J;, the constraint (J~r > (J; is impl ied. Combining (7.3.5) with (7.3.21) we obtain the posterior di st ributi on as p((J;,(J;"O I y) ex
((J~e) - ( ! J+
x ex p {
\) ((J;) - (}J(K ~
+
I [ Sb 2
-
I ). \)
QbCO)
2
(Jbe
+
Se
+
Qe(O) 2
(J,
~ XJ<8i
(JL>(J;>o,
J I\ ' (7.3.22)
i=I, ... ,J.
We shall not proceed further with the general model but consider the especially important case of balanced incomplete block designs. 7.4 ANALYSIS OF BALA'\JCED INCOMPLETE BLOCK DESIGNS
As we have noted earlier, in a balanced incomplete block design (BIBD), J treatments are examined in J blocks of equal size K. Each treatment appears r times and occurs ), times in a block with every other treatment . For any such design, we must have J ~ 1,
JK
Ir,
=
),(1 -
I) = r(K - I ).
(7.4.J)
In terms of the general model (7.3.1), Aj is a K x I matrix of I's and O's, each row of which contains a single 1 in the column corresponding to the treatment applied. Clearly Aj consists of orthogonal rows. Th e I x J matrix (7.4.2)
is known as the incidence matrix or the design matrix whose elements consist of l's and O's indicating the presence and absence of the I treatments in the J blocks. For the instruction design in Table 7.3. I, the incide nce matrix is Block 3 4 5 J=6 I 1 o 0 0 IL 0 0 I 0 0 I 0 0 0 0 0 2
~
Treatment 2L 3 1=4
1
7.4.1 Properties of the BIBD Vlodel
The BIBD model has the following properties:
/
J
I
j '~
AjAj I
= rI,
(7.4.3)
398
Inference about Means with Information from more than one Source
7.4
(7.4.4)
(7.4.5)
;f;.
,
A/I - R)A j
L
= -
j=l
J../
(I, -
K
r
I'
(7.4.6)
1,1,)
J
I
Aj l d
(7.4.7)
= rB,
'j
j= I
J
I
Aj(I - R)Yj
=
(7.4.8)
reT - B) ,
j~J
I~T =
(7.4.9)
lIB = iy
where I, is an I x 1 identity matrix , I, is an i x 1 vector of ones, B' = (BJ' ... , B,), Bi is the average of Y.j for blocks) in which the ith treatment appears, T' = (T1' ... , T,) is the vector of treatment averages, and Y .. is the grand average . BreakdolVl1 oj Sb(S)
Using (7.4.4) , (7.4.5), (7.4.7), and (7.4.9), the quantity Qb(S) in (7.3.14) can be written QbCS)
= (
where
~ A )J! (8
r
ea
i -
2
+
~
I
~
8, = (r - },) (rK B, - My .. ) ,
[J!
(8; -
8Jr,
r > I,.
(7.4 .10)
(7.4.11 )
In a BIBD, it is the comparative values of the 0; which are of primary interest. fn this connection it is convenient to work with the contrasts
=
8; -
e, I
I
where
(7.4.12)
i = I, ,. ,I,
t1=/ - IIO;
I
so that
i= J
¢i
and to write
= 0,
i= L
(7.4.13)
with
~ rK ¢ = -r _ (}1. B - Y ) I
f
"
,
r>
j,.
(7.4.14)
-~
,
,
7.4
Anal~' sis
of Balanced Incomplete Block Designs
399
Since Sb(a) = S6 + Qb(a) is true for all a, by settin g 0 = 0 (which implies that
I J
S6 = K
2
Yj - rl y
2
(r---)') I
I
-
K
j = 1
=KI J
(Yj - Y )
2
('
-
j = 1
-2 4>i
;= 1
r - ), )
-K
I -2 I(p ;·
(7.4. I 5)
i= 1
Breakdoll'l1 of Se(6)
From (7.4.6) the quantity Qe(9) ill (7.3.16) can be written Qeca)
}.Jr~ L., K k,
=
-
=
K
(0; - e;) - [ . 1 l ,.~ _~, (e; - 0;)
'2
~
), 1
,
" J2}
.
(7.4.16)
2
i~l (4) ; -
where 4>; is defined in (7.4 .12), 1
(fi;
=
8; -
1- 1
L8
(7.4. 17)
1
hI
e
and = (8 1 , .",0 1 )' satisfies the normal equations (7.3.17). (7.4.6), (7.4.8), and (7.4.9) that , rK 6=-(T-B)
n
It follows from (7.4.18)
is a parti cu lar solution of (7 .3.17) for the BI BD and that
,
4>; =
,
=
0;
rK
l i (~ -
i = I"
B;),
/.
(7.4,19)
Consequently , we may write Se(O) in (7.3.16) as
.
SeeS) = Se with Se
=
J
K
n
,
;~l (4); -
2
L )'J j~1
L L hj -
j " lk = 1
~
+K
J
K
).j
2
-
K
2
(7.4.20) 1
'2
L
;-1
(7.4.21) The expressions for (fii' (fi;, Sb ' and Se given above are, of course, well known . We have sketched the development in order to relate the present results for the BIBD to those for the general linear model set out in Section 7.3.
Inference about Means with Information from more than one Source
400
7.4
The Analysis of Variance The breakdowns of SbCfI) and SeCfI) are summarized analysis of variance form.
In
Ta ble 7.4.1
Il1
the usua I
Table 7.4.1 Analysis of variance of a BIBD model
Sources Between (inter) blocks
S.S.
Sampling distribution of S.S . (all independent)
d.f.
c5(J -1) (J-I)-o(l-I)
Within (intra) blocks
/-1 J(K-l)-(l-l)
In Table 7.4.1, 0=1
if r>/. and 0=0 if r=;..
7.4.2 A Comparison of the One-way Classification, RCBD and BIBD Models It is instructive to relate these results to those for the additive mixed model (Section 6.3) and those for the one-way classification (Sections 2.1 J and 7.2). If, as we shall suppose, blocks may be represented as a "random effect, " then the additive mixed model is appropriate for the analysis of the randomized block design. We shall call the latter the randomized comp/ele block desigl1 (ReB D) to distinguish it from the BIBD. For a BlBD, the model in (7.3.1) may be written
j = I, . ., J;
k = 1, ... , K ,
(7.4.22)
where Ykj is the kth observation in the jth block, bj the block effect and ekj the within block residual error. The expectation E(Yk) is
(7.4.23) if the ith treatment, i = I, .. . , I, is applied to Ykj ' Alternatively, we may write the model as (7.4.24) i=I, ... , I; m=J , ... ,r where Y(im) is the !11th observation receiving the ith treatment and term such that
£Um )
is the error
(7.4.25)
7.4
Analysis of Balanced Incomplete Block Designs
401
if Y(im) happens to be in the kth position of thejth block . In other words, (7.4.22) and (7.4.24) show that we can classify the J K = rl observations in two different ways, one according to the blocks and the other according to the treatments. If there is no block effect, b j == 0, then the model in (7.4.24) is the one-way classification model discussed earlier with the appropriate change in notation. On the other hand, if the size of a block equals to the number of treatments, K = I, and every treatment appears in every block, so that J = r = }., then the model in (7.4.22) is j= t, .. .,}
i= 1, ... ,1;
(7.4.26)
which is the ReBD model given in (6.3.1) upon substituting h j for Cj' Now, the sample averages needed to compute 4>i, 4); and the sum of squares in Table (7.4.1) are a) Y. = Y( .. )
b) Ti
= Y(i.)
grand average treatment averages
c) Yj
block averages
d) Bi
average of block treatment appears.
averages
for
blocks
In
which
the
ith
Note that when bj == 0 or K = I,Bi == Y .. , otherwise rl LiBi = Y .. ' Table 7.4.2a compares the point estimates of the contrasts ¢i = Di - 8, i = I . .. ., I, for the one-way classification, RCBD and BIBD models. While the estimate Ti - Y .. is the same for both the one-way classification and the RCBD, it is regarded as an "intra-block" estimate for the latter because
T- Y 1
..
=
~ I (Y' - Y .) j
}
I}
(7.4.27)
.}
that is, because it is the average of J within block contrasts among the observations. For the B[8D, there are two sources of information about ¢i' the "intra-block" estimate
4)i
and the "inter-block" estimate
4>i' Ii) Y. m
Let 11/
(7.4.28)
= I, ... , I'
denote the r block averages in which the ith treatment appears. block" estimate 4); of ¢i is proportional to T, -
Bi = -I "L (Y(i",) r m
Then, the "intra-
( Y:~)
which is made up of r within block contrasts among the observations. hand, the inter-block estimate ;Pi is proportional to
Bi
-
Y ..
(7.4.29) On the other
402
Inference about Means with Information from more than one Source
representing a contrast among the block averages.
7.4
The cOns tants of proportionality
( '-1 and (I _/)-1 are such that
f
. =
J.I
rK'
1-1=
( rK-}.J)=(~), rK rK
o < I~
(7.4.30)
I.
The quant ity 1 is sometimes called the "efficiency fact or " of a BIBD . and r = },J= I and the BIBD reduces to the RCBD . In this case
When K = /
(7.4.31) so that, as far as sampling theory is co ncerned , there is only one source of information about (p;. Table 7.4.2 Comparison of one-way classification, RCBD and BIBD models a) Point estimates of the contrast ¢i = OJ One-way classification
&: RCBD
[ nter-block
--
BIBD
¢= - I I
I
Y .. ) - -I (B·I
Tj-Y. I
Intra-block
T i - Y.
b) Decomposition of the sum of sq uares
7 (Ti -
Bi )
L L (Y(im) - ei:
Source
One-way classificati on
RCBD
Grand mean
r/C& - y/
rl(D- Y.l
interblock
-
BIBD 1'/(8 -
yj
(t - I)r"i(¢i - ¢ i
rI{¢i- (T i -y.)] 2
Treatment intrablock interblock Residual s
rI.[¢i- (T i -y)]2
IrI.(¢i -
KI.(Yj-Y )2
KI.(Yj - Y.Y - (I - f)rI.¢2
I. I. (YUm) - Ti intrablock
I. I. (Ykj _ y )2 -rI.(Ti-y./
I. I. (Ykj - Y.Y -frI.
7.4
Analysis of Balanced Incomplete Block Designs
Table 7.4.2b gives a corresponding comparison of the decomposition of the sum of squares for the three models. The reader will have no difficulty in associat ing the entries in the table with the corresponding ones in Tab.le 2.11.1 for the one-way cla ssification, in Table 6.3. 1 for the RCBD and in Table 7.4.1 for the BIBD. 7.4.3 Posterior Distribution of G> = ( I _ I)' for a BIBD
The quantities of principal interest are contrasts between the treatment parameters. If in (7.3 .22) we make the transformation from 0, .... , 0 1 to[}and the 1 - 1 linearly 'independent contrasts >1 = 01 - O. i = I , .... 1 - I, then upon substituting for Qb(9) and Qe(9) into (7.3.22) from Table 7.4.1 and integrating out 8, we obtain
- cc< >;
i = I , ... , 1 -
I;
o <(J;
< (J~c <
00,
(7.4.32)
where and in what follows it is to be remembered that >1 = - (>, + ... + >1 - ')' In (7.4.32) , use is made of the definition of the efficiency factor f = 21(rK) in (7.4.30). Applying the identity (AS .2. 1) in Appendix AS .2 to integrate out and we have
(J;
(JL.
' I + (I
X _Sb
I
- /)r JI (>; - r1>Y -
J)
J-
H J - ')
< > ; <
::I),
r~ J(K -
1"(<1»
l
I
=
2'
2
I, .. . , 1 - I ,
I)]
'
(7.4.33)
where
and , as before, 1.(p,q) is the incomplete beta function. The posterior distribution is the product of three factors. The first of these is in the form of a (I - I) dimensional multivariate I distribution centered at = (4)\ ... .. 4)1 - 1)" with covariance matrix proportional to (1 1 _ 1 - I - I I, 11;_,). For r > }" the second factor again takes the form of an (I - 1)-dimensional multivariate I distribution centered at = (r1>" .... r1>1 I)', with covariance matrix also proportional [0 (II - J - / - 1 11 _ 1 1; _ 1)' The third factor is an incomplete beta integral whose upper limit depends on G>. We note that the distribution p(G> I y) is a multivariate generalization of a form of distribution obtained in Tiao and Tan (J966, Eqn. 7.2).
404
Inference about "leans with Information from more than one Source
7.4
Relationship to Previous Resulls The first factor in (7.4.33) is equivalent to the confidence distribution of IjJ in the sampling theory framework if only the intra-block information is employed. It would also be the appropriate posterior distribution of
4J =
(I - 11)1jJ1
where
+ deep -
Ce )2] . ;- J(/\ -
x
I)
[ob
+ p1jJ2,
Substitution of IjJ into (7.4.33) yields the form
+ db(~1 -
Cb)2] -
J- I J(K- I)] i"!iJ) [ -2- ' 2 '
~ iJ -
I)
- co <
t~
<
00,
(7.4.34)
where
in the interval between Co and cb as c,. ~ Cb. Remembering that is monotonically increasing in u(jI), the effect of the this function is to give more weight to the fir st factor in determining the center of the complete distribution p(p I y). Since the line we have chosen is arbitrary, it follows that the function 1,,(!') U(J - I), }J( K - I)] gives more weight to the first factor of p(1jJ I y) than to the second. Now ou(p)/op
~ 0
1"!iJ)Ll(J- I),·P(K- I)J
Finally, in the special case r = Jc , namely, a randomized complete block design model, the distribution in (7.4.33) degenerates to that given' in (6 .3.45) t Other results of the same kind will be considered further in Chapters 8 and 9 when combining the information about common regression coefficients from several sources, in cases where no variance restrictions exist.
7.4
Analysis of Balanced Incomplete Block Designs
405
with appropriate changes in notation. In this case, the second factor is a constant and the third factor, which is now monotonically decreasing in L!= 1 (>j - ({>Y, tends to make the distribution more concentrated about .j) than it would be if only the first factor alone were considered. 7.4.4 An Illustrative Example To illustrate the theory we consider an example involving I = 3 treatments, J = 15 blocks of size K = 2, each treatment replicated r = 10 times; thus A = 5. Table 7.4.3 shows the data with some peripheral calculations. For convenience in presentation, we have associated the blocks (random effects) with rows and the treatment (fixed effects) with columns. The data was generated from a table of random Normal deviates using a BIBD model with 8 1 = 5, 8 2 = 8) = 2, (J"; = 2.25, and (J"; = 1. For this example the distribution in (7.4.33) is, of Table 7.4.3 Data from a BIBD (1= 3,J= 15, K= 2, r= 10, A= 5,/=0.75) Treatment 2 I
Block
2 3 4 5 6 7 8 9 10 11 12 13 14 IS
10.05 10.10 5.52 9.90 9.99 5.93 6.76 7.94 11.44 9.74
3
y . j.
0.24 3.80 2.90 2.10 3.94 4.83 4.41 6.47 5.30 4.18
7.485 7.795 4.245 6.810 7.835 3.085 5.280 5.420 6.770 6.840 2.810 5.250 6.125 4.260 4.980
4.92 5.49 2.97 3.72 5.68
0.79 6.09 5.78 3.22 5.78
T;
8.737
4.444
3.817
Bj
6.1565
5.7595
5.082
({>j
3.4406
;Pj
1.962
y .. =
5.666
S e = 29.1841 (13 d.f.)
- 1.754
-1.6866 Sb = 49.04448 (12 d.f.)
0.374
-2.336
406
Inference about Means with Information from more than one Source
7.4
course, a bivariate one. Note that 3 '\
L
- 2 (¢i - ¢)
=
2[(¢I - 3.4406) 2
+
(¢2
+
-j-
(¢2 - 0.374) 2 --I- (¢I - 1.962)(¢2 - 0.374)J
1.754) 2
+ (¢l - 3.4406)(¢2 + 1.754)J
i= 1
3 '\
L
i ~
- Z (¢i - ¢J
= 2[(¢1
- 1.962) 1
1
Figure 7.4.1 shows a number of contours related to the distribution (7.4 .33). l. The contour (I) centered at PI = (3.44, -1.75) is the 95 per cent H.P.D . region derived from the first factor alone. It was calculated usi ng the fact that, if the first factor alone were the posterior distribution, then the quadratic form
jr'L[=I (¢i -
(l - f)r'Lf=l (¢i -
JJY /U -
1)
Sb f(J - 1) to an F distribution with (J - I) and (J - J) degrees of freedom. 3. The broken lines (3) are contours of the third factor , with the contour values shown. 4. The contour (4) centered at P = (3.26, -1.50) defines the approximate 95 per cent H.P.D. region derived from the complete distribution (7.4 .33). It was calculated from the formula logp (¢r,
¢i I y) -
log!, (¢l' ¢zl y) = 5.99 = X2 (2, 0.05)
where (¢i, ¢i) are the coordinates of P, the center and maximum point. This is equivalent to p(¢J' ¢21 y) = 0.05p(¢j, I y).
¢i
We can see from Fig. 7.4.1 that , as expected, the overall distribution contour curve (4), is located between the contours (I) and (2) from the first and second factors respectively . In fact the center P is practically collinear with the other centers P J and Pz. Contours (1) and (2) are elliptical and have parallel axes, since the covariance matrices of the two corresponding bivariate t distributions are proportional. Contour (2) is much larger than contour (I) essentially because, for this example,
Sb/[(J - 1)(1 - j)J
~
Se/{f [J(K - I) - (J - I)]}
7.4
407
Analysis of Balanced Incomplete Block Designs
implying that the first factor will have a dominating role in determining the overall distribution. The lines (3) show that the third factor is increasing in a South-east direction confirming that this factor tends to pull the final distribution of 4> towards the bivariate I distribution given by the first factor. However, for this example, the effect is quite small, as is seen by the slight changes in the value of the function over the region in which the dominating first factor is appreciable, The shape of contour (4) is very nearly elliptical which suggests (as turns out to be the case) that the distribution might be approximated by a bivariate t distribution.
(2)
o
2
o In J. 2. 3. 4.
2
4
the figure 95 per cent contour of 41 based on first factor of (7.4.33) 95 per cent contour of cP based on second factor of (7.4.33) Contours of Q for various values of 1u (4J) H(J- I), ~·J(K - 1») as marked Approximate 95 per cent contour of the entire distribution of cP in (7.4.33)
Fig.7.4.1 Contours for the components of the posterior distribution of (cPl' cPz) for the BIBD data.
7.4.5 Posterior Distribution of
at./(J;
Tn this section, we digress from our discussion of the problem of making inferences about the contrasts
(7.4.35) which plays an important role
In
approximating the distribution of
408
Inference about 'Vieans with Information from more than one Source
In the joint distribution peer;, er;e, (A7.1.1) in Appendix A7.1 to write I
erb'/ (I
-
f)r L
i= 1
=
er;:-2
I
4>Y
C¢i -
I y) In
+ er;2/r L
(1- / )-I} S"
_
+ IV + -1-
[¢i - ¢JIV)]2
where
(7.4.32) we may use the identity
CfiY
C¢i -
i= 1
I-I) r JII {(1+ -w-
I
S,
L
(1-f),.
=
7.4
(4)i -
(7.4.36)
CfiY
(7.4.37)
i= 1
-
and
(p;(w) =
(1+ -wl-/)-I[I¢i- + (1- / )_J -19-
Making the transformation from (er~e> posterior distribution of (11', <1» is
pC w,
I y)
x { Se +
0';)
W-IS b
+ (w +
Upon eliminating
,. > }, and b = 0 if r
00
< ¢i <
HJCw)
0';,
the
(I
- 1,
w> J.
}-[t(JK-I)]
,
(7.4.39)
by integration and recalling from Table 7.4.1 that b = 1 if the posterior distribution of IV is if.
-nIl x
[
(
1+ S:S
)t(l-I) (1+-1-1' So)
1+
IV> 1,
HI (w)H 2(W),
if. W~[J(K-I)-(l-l»)-1
w
+
= 1, ... ,1
qJi(W)]2
= 2,
and
IV
i
00,
with
(
to (It', er;) and integrating out
I -I) J f1 -/)-1 S, + (1+ -wr i~1 [¢i -
p(w I y)
X
(7.4.38)
.
if. W - [-}(J - J ) + I]
-
H2(W)
¢i
(7.4.40)
)-H/(r-l)-6(1-1»)
w
-·,6(/-1)
Sb
W (
w
+ (1
)
-
I)!f
(S- ' )(l+_eS Sb
Sb
w
)-IJ-+/(r-l)
.
This distribution is constrained in the interval (l, OJ) and is proportional to the product of two factors HI (IV) and H 2(11'). The constraint w > 1 arises from the inequality O'~e > If there were no constraint, the factor HI (w) would be proportional to an F distribution with J(K - I) - (1 - I) and (J - I) - b(I - I) degrees of freedom and would represent a quantity proportional to the confidence distribution of w based on Se and Sb' It would also be proportional to the posterior distribution of w based upon the portion of the likelihood relating to
0';.
7.4
Analysis of Balanced Incomplete Block Designs
Sb and Se alone in conjunction with a uniform reference prior log
In
log
409
IJ be
and
IJ e'
For r > A, the second fa ctor H 2(IV) occurs because the quanti ties and
7.4.6 Further Properties of, and Approximations to, p(.p I y) We now return to our discussion of the posterior distribution p( I y) in (7.4.33). In the analysis of data from a balanced incomplete block design , we are often p(WI Y)
0.750
0.625
0.500
In 1. 2. 3,
the figure Posterior distribution o f Posterior d istribution of Posterior distribution of
W
w IV
in (7.4.40) based on H,(w) alone , a lso a confidence distribution of w ignoring the constraint IV > 1,
Fig. 7.4.2 Posterior distribution s of w for the BIBD data ,
Inference about Means with Information from more than one Source
410
7.4
concerned with the problem of making inferences about one or more linear contrasts of the treatment means OJ. Since the ¢j are themselves linear contrasts, a contrast in the 01 can be expressed as a contrast of ¢l' . ,., ¢/. We are thus led to the problem of finding the marginal distributions of linear contrasts of ¢I' .'" ¢/ . From the joint distribution (7.4.33) it is clear that the conditional distribution of a subset of ¢l ' "., ¢I_I' given the remainder, is of exactly the same form as the original distribution. However, the marginal distribution of a subset is not To obtain the marginal distributions, it is helpful to write the joint distribution of 41 in the alternative form , p(q,ly)
=f
i = 1, .'"
X) p(q" w,y)p(wly)d\\',
1 - 1, (7.4.41)
where p( wly) is given in (7.4.40). From the joint distribution pew, q,ly) in (7.4 .39) it is readily seen that , given II' , the conditional distributionp(q, I \-I', y) is the (J - J)dimensional multivariate I distribution (7.4.42) where and
( + f1-1)
I(r - l)rl w
S2(W) = Sb
( 1-/)-1 Si,
+ wS e + w w + f
4>;,
Note that from (7.4.38), the conditional posterior mean of
(7.4.43)
given w,
is a weighted average of the intra- and inter-block estimates ($j, $;) with weights proportional to I and (l - I)/w respectively . From (7.4.41) and properties of the multivariate t distribution, the marginal distribution of any set of p linearly independent contrasts l] = HI = L( 41' : ¢/)' , where L is a p x I matrix of rank p ,:; (I - 1) such that LII = 0, is
pel] I y)
=
f"
pel] I w, y)p(w I y) dw,
- 00
< l]j <
00,
i
=
1, . . "p ,
(7.4.44)
where pel] I w, y) is the Ip[iJ(W) , s2 (w)LL' , I(r - I)J distribution with
ft(w)
= L[(ji'( lV) . cPl(I1·)]'.
In particular, the posterior distribution of a single con trast 1]
p(YJ I y) =
lO I p(1]
IV,
y) p( l'.' I y) dw,
= e' l = (41' •¢/)l
-oc < n
is
(7.4.45)
7.4
Analysis of Balanced Incomplete Block Designs
411
where, for given w, the quantity IJ - iJ(w) s(w)(l'l):
(7.4.46)
follows the teO, I, f(r - I)J distribution. Although the conditional distribution of TJ given w is of the multivariate t form, it does not seem possible to express the unconditional distribution in terms of simple functions when p < (I - 1). However, for given data, the unconditional distribution can be obtained by a one dimensional numerical integration . The solid curves (I) in Figs. 7.4.3 and 7.4.4 show, respectively, the marginal distributions P(4)1 I y) and P(4)2 I y) for the data in Table 7.4.3. The distribution of 4>1 is derived from (7.4.45) by setting l' = (j, - 1. - t) so that IJ = 4>1' The distribution of 4>2 is obtained similarly. These distributions are nearly symmetrical, centered about 4>1 = 3.25 and 4>2 = -1 .50, respectively. Also shown in the same figures by the broken curves (2) centered at $1 = 3.44 and $2 = -1.75, are respectively the distributions of ¢ I and 4>2 based upon the first factor (intra-block) of (7.4.33) alone. These curves (2) are numerically equivalent to the corresponding confidence distributions of 4>1 and 4>2 in the sampling theory framework if inter-block information is ignored. By comparing the curves labeled
2
3
4
In the figure I. Posterior distribution of IPI obtained from (7.4.45) 2. Posterior distribution of IPI based on the first factor of (7.4.33), also a confidence distribution of IP I 3. Approximate distribution of IPI. from (7.4.52)
Fig. 7.4.3 Posterior distributions of
0602
412
Inference about Means with Information from more than one Source
7.4
P( ~l l y)
ISO
1.00
0.50
o ~~~----~----~----~----~~~~---~2~ -3
-2
-I
0
In the figure I. Posterior di stribution of ¢J, obtained from (7.4.45) 2. Posterior distribution o f ¢J2 based on the firs t facto r of (7.4. 33), also a confidence distribution of
(l) and (2) in each figure , one sees how the inter-block information modifies the m a rginal inferences of the contrasts, and for the present example the modification is appreciable . The dotted curves (3) are approximations discussed below. Approximations
In obtaining the marginal distributions of cPJ and cP2 shown in Figs. 7.4.3-4, numerical integration on a computer was necessary in both cases. However, both distributions are almost symmetrical and resemble Student's t distributions . In fact, close approximation to p(> I y) can be obtained by use of the multivariate t distribution. In general, we can write
p(<» I y) = E p(<»
I lV, y),
- 00
i
= 1, ... , / - 1,
(7.4.47)
g(w)
where p(<» I w, y) is given in (7.4.42) and the expectation is taken over the posterior distribution of g(w) which is some monotonic function ·of lV. If the conditional distribution p(<» I lV, y) is changing gently in the region of g(w) where its density is appreciable , then p(>ly)=p(>liV,y),
-00 < cPi
i
= 1, .. . , / - 1
(7.4.48)
7.4
Analysis of Balanced Incomplete Block Designs
413
where g(w) = Eg(H'). From the distribution of \r in (7.4.40), we see that evaluation of the expectation Eg(l\) would necessitate numerical integration irrespective of the choice of g. Alternatively, we can write p( I y)
== p( I ~V, y) + h E[g(lI') - g(w)]
(7.4.49)
where g(lv) is the mode of g(\!) and h is the first derivative of p( I I\" y) with respect to g(II'), evaluated at II!. If g(ll) is chosen so that its distribution is symmetrical, then (7.4.50) i= 1, ... ,/-1. p( I y) == p( I w, y), -00 <
- I) [
(j) (IV), S2 ( w) ( 1/ _ I
+
1/ -
-
11; _ 1 )
, /
(r -
1)]
where (j)(w) and S2(IV) are obtained simply by substituting IV into (7.4.42) and (7.4.43). To decide whether a particular parameter point = o lies inside or outside the (I - a) H.P.D. region, we may thus compare the quantity
L{= I [
(7.4.51)
with the JOOct upper percentage point of an F variable with (J - I) and l(r - J) degrees offreedom. Making use of the properties of the multivariate I distribution, any set of p :( (I - J) linearly independent contrasts will have an approximate p-dimensional multivariate I d istri bution. I n particular, the quantity YJ - i](IV) S(IV) (l'l) l '2
(7.4.52)
is approximately distributed as 1[0, I, I(r - I)]. Now for the data of Table 7.4.3 the distribution of II', as whown in Fig. 7.4.2, is clearly not symmetrical. An F-Iike distribution with a long tail to the right (such as the one shown) can however be transformed into a more symmetrical distribution by a log transformation g(II') = log I\'. l\oting that p(log l\' I y) = Ivp(II'1 y), it can be readily verified from (7.4.40) that for r > ), the mode of log II' is a root of the cubic equation, (7.4.53)
where
J(K do
=
-
I) (1 -I 1)2 ~ Se '
(J - I)
Inference about :Yleans with Information from more than one Source
414
I
=
( I
(J - I)
\J -
1
(I -/ f )2 _ _I -1_1 (I -f f)[2J(K
Sb I) - (I - l )J -
Se
+ and
dz = (2J ;
~ ~) C~ f)
7.4
- [lj ~ 11) -
I
C=~) C~ f) ~~
H S: S,) . Sh
For the example, (7.4.53) becomes !.S091 0 I
II' } -
0.850443
11'2 -
II' -
0.200062
=
0
and the appropri ate root is \{' = 1.99 to two decimal places. Using this value in (7.4.52) we find that , to this degree of approximation, -
CPI - 3.23 - -0.42
¢2+ 1.45 0.42
and
are distributed marginally as teO, 1,27). The curves marked (3) in Figs. 7.4.3 and 7.4.4 represent the approximating distributions. In both cases, the agreement between the exact distribution obta ined by numerical integration [curve (l)] and the approximating distributions seems to be sufficiently close for any practical inferential purpose. To give further illustration of this approximation, Table 7.4.4 shows specimens of the exact and the approximate densities of ¢ 1 = eI -lJ using the data in Federer (1955, pp. 419·422) . In this example, J = 15, 1= 10, K = 4,r = 6, ), = 2 S ..
= 17.878935, $1
Sb
=
1.379463,
4>1
= 3.3450,
S,
=
26 .996222
= 1.90476
and the cubic equation in (7.4 .53) becomes w3
-
3.809690 w 2
+
0.032859
IV -
0.009920 = 0
of which the appropriate root is w = 3.80. The agreement between the exact and the approximate distributions is again dose. While we have illustrated the use of the log modal approximation only in the one dimensional case, i.e., a single contrast, the close agreement in the examples shown as well as the near elliptical shape of the contour of p(< I y) shown in Fig. 7.4.1 suggest that such approximations will also be useful in higher dimensions.
7.4
Analysis of Balanced Incomplete Block Designs
415
Table 7.4.4 Specimens of the exact and approximate densities of 0, : Federer's example Exact
2.37 2.47 2.57 2.67 2.77 2.87 2.97 3.07 3.17 3.27 3.37 3.47 3.57 3.67 3.77 3.87 3.97 4.07 4.17
Approximate
0.02027 0.04505 0.09397 0.18237 0.32640 0.53407 0.79236 1.05818 1.26442 1.34595t 1.27319 1.06983 0.79989 0.53412 0.32028 0.17367 0.08586 0.03904 0.01648
0.01580 0.03770 0.08332 0.16919 0.31316 0.52445 0.78959 1.06303 1.27463 l.35779t J .28383 1.07832 0.80649 0.53922 0.32400 0.17608 0.08718 0.03964 0.01668
i' Density at the common mode.
7.4.7 Summarized Calculations for Approximating the Posterior Distributions of Table 7.4.5 provides a summary of the calculations needed for approximating the posterior distributions of the contrasts 0" ... ,01' The numerical values shown are those associated with the data in Table 7.4.3.
Table 7.4.5 Summarized calculations for approximating the posterior distribution of
1. 1= 15,
2. y..
K=2,
1=3,
Data of Table 7.4.3. r= 10, }=5
Grand average Average of observations corresponding to treatment i, i = 1, ... , 1 Average of block averages Y.j for the r blocks in which the ith treatment appears, i= I, ...• J Y. =5.67 Tl =8.74, T 2 =4.44, T3 = 3.82 Bl =6.16. B2 =5.67. B3 =5.08
Inference about Means with Information from more than one Source
416
7.4
Table 7.4 .5 Conlinued 3. <1>1 = 3.44,
4)1
= 1.96,
<1>2
=
-1.75,
<1>3 =
-1.69
<1>2
=
0.37
4)3 =
- 2.34
Use
(7.4.14) Use
4. S , =29.18
5.
(7.4.17-18)
(7.4.21)
Sb=49 .04
(7.4.15)
S, = 17.77
(7.4.37)
~v
= 1.99
Use
(7.4 .53)
Use
(7.4.38)
Use
(7.4.43)
8. Marginal posterior distribution of ¢j
¢i -(j)jUv) . [0 1 I( J)] [S2(IV)(1-1)/ I]I (2 ~f, r, ¢j -(j)j(IIJ) 0.42
..v 1(0
1 27) '"
i= I, ... , f.
i= 1, 2, 3.
9. Comparison of two treatment means.
"y/=8 j -8" =¢j-¢,' 11-[(j)j(lv)-(j),(lv)] [2s 2(1:v)JI 12 e.g.
i= 1,
N
1[0 J f(r-l)J ' ,
i =2 6-4 .68 - -
N
0.71 10. Overall compa rison of I treatment means.
1(0, 1,27)
7.4
Analysis of Balanced Incomplete Block Designs
417
7.4.8 Recovery of Interblock Information in Sampling Theory
We now consider to what extent the results in the preceding sections ma y be related to those of sampling theory . If th e variance ratio \t' = I + (Ka; :a;) were known , then , on sampling theory, the quantity
(
11 '
+ fI -
I)
2
X/-I'
(7.4.54) Clearly, IV can be estimated in many different ways . In particular, the estimators of II' proposed by Yates and by Graybill and Weeks are based upon separate unbiased estimators of and which are themselves linear combinations of (Se , Sb , Sf)' The appropriateness of these estimators seems questionable since it is the ratio wand not the individual components, on which inferences about 4>; depend. Further, granted that separate un biased estimators of a; and a; are of interest , presumably one should use those with minimum variance. However, it is readily verified that among linear functions of (Se, Sb' Sf)' the UMV estimators of (a;, a; ) depend upon the unknown IV. Another approach would be to estimate w by the method of maximum likelihood . By substituting for Sb(9) in (7.3.6) and SeeS) in (7.3.12) from (7.4.13), (7.4 . 15) and (7.4.20), (7.4.21), respectively, and by making use of (7.4 .36), it can be verified that , for IV > I,
a;
l(w I y)
=
a;,
max I(a; , 8, «1>,
11'1
y)
a;, O, cj>Iw
CC 14'-1 /2
[ Se
Sb + + -;-
(
H'
+
JI-r)-I Sf J- tlK
(7.4 .55)
It follow s that the maximum likelihood estimator of 11 ' is the root of a cubic equation unless the maximum of 1(\\' 1y) occurs at II' = 1. Whether an estimator of 11' is obtained by using separate unbiased estimators of and (J; or by maximum likelihood, the resulting distributional properties of the corresponding estimator of 4>; are unknown .
a;
7.4
Inference about Means with Information from more than one Source
418
In the Bayesian approach, as we have seen, an estimator IV of 1\' may be used to obtain approximations to the distributions of the ¢i' However, no point estimate of II' is actually necessary since, given the data, the actual distribution of the ¢i can always be obtained through a single integration as in (7.4.41). The integration in (7.4.41) averages the conditional distribution r( I II', y) over a weight function p(1\' I y) which reflects the information about I\, in the light of the data y and the prior assumption, We have discussed in detail the problem of combining inter- and intra-block information about linear contrasts of treatment means 0; only for the BIBD model. The methods can, however, be readily extended to the general iinear model of (7.3.1),t Cn the above the prior distribution of (Ji was supposed locally uniform . It is often more appropriate to employ the random-effect prior in Section 7,2 to the present problem. For details of this analysis, see Afonja (1970), APPENDIX A7.1
Some Useful Results in Combining Quadratic Forms
We here give two useful lemmas for combining quadratic forms. Lemma 1. Let x, a and b be k x 1 vectors , and A and B be k x k symmetric matrices such that the inverse (A + B) - I exists. Then,
(x - a)' A(x - a)
+
(x - b)'B(x - b)
=
+
+ fi)(x - c) A(A + B) - I B(a
(x - cnA
(a - by
- b)
(A7,J.I)
where c = (A
+
+
fir I(Aa
Bb).
Proof: (x - a)'Aex - a)
+
(x - b)'fiex - b)
= x'(A + B)x - 2x'(Aa + Bb) -1- a'Aa + b' Bb
= x'(A + B)x =
(x - cnA
- 2x ' (A
+
+
B)(x - e)
+ e' CA +
fi)e
+
B)e
+
d
d
(A7.1.2)
where d
=
a'Aa
+
b'Bb - c'(A
+
fi)e
(A7 .1.3)
Now, e'(A
+
fi)e = (Aa
+ Bb)'(A + ll)-l(Aa + llb)
= [A(a - b) + (A + B)b]'(A + B) - I[(A + B)a - B(a - b)] = -(a - b)'A(A
t See Tiao and Draper (1968).
+
B)-IB(a - b)
+ a 'Aa + h'llb
(A7 . 1.4)
419
Appendix
A7.1
Substituting (A 7.1.4) into (A 7.1.3), the lemma follows at once. Note that if both A and B have inverses, then (A7.1.5) it sometimes happens that we need to combine two quadratic forms for which the matrix (A + B) has no inverse. In this case, Lemma 1 may be modified as foJlows: Lemma 2. Let x, a and b be k x I vectors and A and B be two k x k positive semidefinite symmetric matrices. Suppose the rank of the matrix A + B is q( < k). Then, subject to the constraints Gx = 0,
(x - a)'A(x - a)
+ (x - b)'B(x - b) = (x - c*)'(A + B + M)(x - c*) + (a - b)' A(A + B + M)-IB(a - b)
(A7.1.6)
where G is any (k - q) x k matrix of rank k - q such that the rows ofG are linearly independent of the rows of A + B, M = G'G and
c* = (A
+ B + Mf
1
(Aa
+ Bb).
Proof We shalJ first prove that M(A + B
+ M) . I A .= M(A -;- B + M) - In
=
O.
(A 7.1.7)
Since (A + B) is of rank q, G is of rank k - q a nd the rows of G are linearly independent of the rows of (A + B), there exists a (k - q) x k matrix U of rank (k - q) such that UG' is non-singular and
+ B) = O.
U(A
(A7.1 .8)
Now , UtA
+ B + M)(A + B + M)
1
U.
=
From (A 7. 1.8)
+
LM(A
Postmultiplying both sides by (A UM(A
Since M
+
B
+ M) - I
= L.
B), we get
+B+
+ B)
M) - l(A
=
O.
(A7.1.9)
= G'G and UG' is non-singular, (A 7.1.9) implies that G(A
+B +
M) - 1 (A
+
B) = 0
M(A
+
M) - I (A
+
B)
so that B
+
Postmultiplying both sides of (A7.I.IOa) by (A M(A
+ B + M) - 1 A(A +
+ M(A +
B
=
o.
(A 7. 1. lOa)
+ B + M) - IM , we obtain
+ M)
B + M)IB(A + B
1
M
+ M) - IM = 0
(A7.1.10b)
420
Inference about '\1eans with Information from more than one Source
7.4
Since A and B are assumed positive semidefinite, it follows that both terms on the left of (A 7.1.1 Ob) are positive semidefinite matrices. Thus the equality implies that both must be null matrices. Writing A = C'C and B = D ' D where C and Dare k x k matrices we must then have
M(A + B + M)-IC' = M(A + B + )i)-lD' = 0 and the assertion (A 7.1.7) follows. We now prove (A 7. 1.6). Using the fact that \'Ix
= 0,
we have
(x - a)'A(x - a) + (x - b)'B(x - b) = x'(A + B + M)x - 2x'(A + B + M)e* +e*'(A + B + M)e* + d l =
(x - e*)'(A + B + M)(x - e*) + d l (A7.1.1I)
where
dl
::;
a'Aa + b'Bb - c*'(A + B + M)c*.
(A7.I.l2)
Now,
e* ' (A + B + M)c* = (Aa + Bb)'(A + B + ;VI)-I(Aa + Bb) =
[A(a - b) + (A + B)b]'(A + B + M)-l [(A + B)a - B(a - b)] -(a - b)'A(A + B + M)-IB(a - b) + b'(A + B) (A + B + )i)-irA + B)a + (a - b)'A(A + B + M)-l (A + B)a - b'(A + B)(A
+
B + M)-IB(a - b)
(A7.1.13)
Using (A7.l.lOa) , the second term on the extreme right of (A7.1.13) becomes
b'(A + B)(A
+ B + M) - I(A + B)a
=
b'(A + B)a.
(A 7.1.14)
Applying (A 7.1 .7), the third and fourth terms are , respectively,
(a - b)'A(A + B + M) - I(A + B)a = (a - b)'Aa, -b' (A + B)(A + B + M) - IB(a - b)
=
-(a - b)'Bb.
(A7.l.lS) (A7 . 1.16)
Substituting (A7.1.14-16) into (A7.1.13), we get
e* ' (A + B + :vI)e* = -(a - b)'A(A + B + M)-lB(a - b) + a'Aa + a'Bb so that
dl
::;
a nd the lemma is proved.
(a - b)'A(A + B + M)-IB(a - b)
CHAPTER 8
SOME ASPECTS OF MULTIVARJATE ANALYSIS
8.1 I'\jTRODLCTION
In all the problems we have so far considered , observations are made of a single unidimensional response or output y . The inference problems that resu lt are calJed univariate problems. In this and the next chapter, we shaJ I consider problems which arise when the output is multidimensional. Thus, in the study of a chemical process, at each experimental setting one might observe yield YI ' density Y2, and color YJ of the product. Similarly, in a study of consumer behavior, for each household one might record spending on food y" spending on durables Y2, and spending on travel and entertainment h . We would then say that a three-dimensional output or response is observed. Inference problem~ which arise in the analysis of such data are called multivariate. In this chapter, we shall begin by reviewing some univariate problems In a general setting which can be easily extended to the multivariate case. 8.1.1 A General Uniyariate Model
It is often desired to make inferences about parameters 8" ... , Ok contained in the relationship between a single observed outpUI variable or response Y subject to error and p input in variables c; I' ... , ~p whose values are assumed exactly known. It should be understood that the inputs could include qualitative as well as quantitative variables. For example , ~i might take values of 0 or 1 depending on whether some particular quality was absent or present in which case ~i is called an indicator variab le or less appropriately a dummy variable. The Design .'VI atrix
Suppose, in an investigation, 11 experimen tal "runs" are made, and the uth run consists of making an observation ),,, at some fixed set of input conditions 1;;, = ( ~ III' ~ " 2 > ... , ~"p). The 11 xp design matrix
/;=
s',
~,
/;;, 1;;,
,
~ 12
~JP
~1I1
~1I2
~"P
~n 1
~"2
~IIP
421
(8.1 .1)
422
Some Aspects of 'Iultivariate Analysis
8.1
list s the jJ input conditions to be used in each of the n projected runs and the Ifth row of 1; is th e vector ~:,. The phraseology "experimental run", "experimental design" is most natural in a sit uation in which a scientific experiment is being conducted and in which the levels of the inputs are at our choice. In some ap plications, however, and particul a rl y in economic studies , it is often impossible to choose the experimental conditions. We have only historical data generated for us in circumsta nces beyond our control and often in a manner we would not choose. It is convenient here to extend the terminologies "experimental run" and "experimental design" to include experiments designed by nature, but we must, of cou rse, bear in mind the limitations of such historical data. To obtain a mathematical model for our set-up we need to link the 11 observations y ' = (YI' ... , y,,) with the inputs~. This we do by defining two functions called respectively the expeClatiol1 junction and the error junction.
The Expectation Function The expected value E(y,.) of the output from the uth run is assumed to be a known function 1]" of the p fixed inputs 1;" employed during th at run, involving k unknown parameters 0' = (0 1 " ., Ok), (8.1.2)
The vector valued function TJ expectation junction.
= TJ (1;, 6).11 ' =
(I]
I' . . ,
1]" • ... , 1],,) ,
is called
I he
Th e Error Fun ction The expectation function links E(y,,) to 1;" and O. We now have to link Yu to E(yJ = 1],.. This is done by means of an error distribution junction in E' = (6 1, ... , 10,,). The n experimental errors E = Y - TJ wh ic h occur in making the runs are assumed to be random variables hav in g zero means but in general not necessarily independently or Normally distributed. We denote the density function of the n errors by p(r. In) where n is a set of error distribution parameters whose values are in general unknown. Finally, then, the output in the form of the n observations y and the input in the form of the n sets of conditions ~ are linked together by a mathematical l110del containing the error function and the expectation function as follows TJ =' TJ( ~ .e )
1; -
--+
plY - 'l l ")
TJ
y.
(8.1.3)
Thi s model involves a function
iCy, e,lt, 1;)
(8.1A)
of the ' 'y;ervations y, the parameters {) of the expectatio n function, the parameters It of th e error distributi o n and the design 1;.
S.2
A General 1\iultil'a r iate Normal Model
423
Data Genttraliol1 .'VI ode/ J(we knew 9 and It and the design~. we could use the function (8.1A) to calculate the probabililY density associated with any particular set of data i This data generation model (which might, for example, be directly useful for simulation and Monte-Carlo studies) is the function j(y, 9, 1t,~) with 0, 1t and c; held fixed and we denote it by
p(y I G ,. 1t,~) = I(y , 0, n, ~) ,
(8.1.5)
which emphasizes that the density is a function of y alone for fix ed D, 1t, and
~.
The Likelihood Function and the Posierior Distribution
in ordinary statistical practice. we are not directly interested in probabilities associated with various sets of data. given jixed values of the parameters e and 1t. On the contrary, we are concerned with the probabilities associated with various sets of parameter va lues , given a fixed set of data which is known to have occurred. After an experiment has been performed , J' is known and fixed (as is ~) bUI e and It are unknown . The likelihood function has the same form as (8.IA), but in it y and ~ are fi xed and e and 1t are not to be regarded as variables. Thus, the likelihood may be written
ICe, It I y, 1;) = f( ~', 6, 1t,
~).
(8.1.6)
In what follows we usually omit specific note of dependence on 1; and write l(e,1t I y, 1;) as 1(0 , 1t ' y).
In lhe Bayesian framework , inferences about e and 1t can be made by suitable study of the posterior distribution p(O, 1t ' y) of e and 1t obtained by combining the likelihood with the appropriate prior distribution pee, 1t) ,
p(e, 11: I y) if
/(9,1t I y)
pee, 1t).
(8.1.7)
An example in which the expectation function is nonlinear and the error distribution is non-l\ormal was given in Section 3.5. In this chapter, we shall from now on assume Normality but will extend our general model to cover multivariate problems. 8.2 A GENERAL MULTIVARIATE '\'ORMAL MODEL
Suppose now that a number of output responses are measured in each experimental run. Thus, in a chemical experiment, at each setting of the process conditions ¢I = temperature and ¢2 = concentration , observations might be made on the output responses YI = yield of product A , Y2 = yield of product B, and Y3 = yield of product C. In general, then, from each experimental run the
424
Some Aspects of Multivariate Analysis
8.2
m-variate observation
There would now be m expectation functions
would be available.
where
(8.2.1)
where ~tlj would contain pj elements (~lllj, . . . , ~llSj, . '" ~tlP;;) and OJ would contain k j elements (e lj , ... , gj , " . , &k;j) . The expectation functions y/"j might be linear or non Iinear both in the parameters OJ and the inputs /;"j' Also , depending on the problem, some or all of the pj elements of ~llj might be the same as those of /;llj and some or all of the elements of OJ might be the same as those of OJ. That is to say, a given output would involve certain inputs and certain parameters which might or might not be shared by other outputs.
e
8.2.1 The Likelihood Function Let us now consider the problem of making inferences about the OJ for a set of 11 m-variate observations. We assume that the error vector
u
=
1, . .. , 11 ,
(8.2.2)
is, for given 0 and I , distributed as the m-variate Normal M",(O, I), and that the runs are made in such a way that it can be assumed that from run to run the observations are independent. Th us, in terms of the general framework of (8.1.4) , ~ = 1t are the parameters of an error distribution which is multivariate Normal. We first derive some very general results which apply to any model of this type, and then consider in detail the various important special cases that emerge if the expectation functions are supposed linear in the paramett:rs OJ. The joint distribution of the n vectors of errors E = (E( I)' ... , E ( u)' ... , E(n»' IS II peE I ~, 0) = P(E(ll ) I ~ , 0)
n
11=(
= (2n:),.,II/2IEI- n, 2 exp ( -
ro <ellj <
ro,
- 1 "E' E-IE ( II) 1..., ( II) 1/
2
i=l,,, .,m,
u :-:
)
1
u=l, ... ,n,
(8.2.3)
8.2
A General Multivariate Normal Model
425
where L = {a,J is the m x m covariance matrix, L_I = {aij} its inverse and to the complete set of all the (k 1 + ... + kin) parameters el' ... ,e",. Denoting See) to be the 111 x m symmetric matrix
e refers
with n
S;j(e;,
e)
=
I
n
Cui
Cllj =
u= I
I [Ylli ,, =-= 1
l'/i(~lIi' e;)] [Yll j - I'/j(~Uj' e)],
;,)=1, ... ,111,
(8.2.4 ) then the exponent in (8. 2.3) can be expressed as n
I
m
no
I
1: ;11) 1:- 11:(11) = trS(e)1:- 1 =
I
aijSij(ei,
e)
(8.2.5)
;= 1 j= 1
u= I
where tr A means the trace of the matrix A. likelihood function can thus be written
Given the observa tions , the
l(e,1: 1y) cc p(1: 11:, e)
oc 11:1-
11 /
2
1 tr 1:- l s(e)] .
exp [ -
(8.2.6)
To clarify the notation , we emphasize that y refers to the observations Y,i ... f 'm YII Y=
Yul
J'ui
Yllm
YII I
Jini
.rnlfl
=
[)I I ,
... ,
yj ,
... ,
YuJ
11
x 111 matrix of
Y;
I)
y;,,) Y;n)
where Yi = (Y li , "")'11;)' is the vecto r of 11 observations correspond ing to the ith response and Y(II) = (YIII' .. . , YII"')' is the vector of m observations of the utn experimental run. Similarly, I: refers to the 11 x m matrix of errors
E =
61I
6 1i
elm
Cu i
ell;
e,m,
6 nl
e'l;
enm
1:; I)
=
[E
I , .."
E;, . . " Em]
=
1:;11)
,
1:(11)
8.2.2 Prior Distribution of (e, 1:)
For the prior distri bution of the parameters (e,1:), we shall first of all assume that e and 1: are approximately independent so that p(e,1:)
==
pee) p(1:) .
(8.2.7)
426
8.2
Some Aspects of '\1ultivariate Analysis
We shall further suppose that the parameterization in terms of such that it is appropriate to take e as locally uniform,t
pee) oc constant. the 'lln(m + 1) distinct
e is
so chosen (8.2.8)
For the prior distribution of elements of r., application of the argument in Section 1.3 for the multiparameter situation leads .to the non informative reference prior (S.2 .9) Now, (8 .2.10) '.'!here (S.2 .11) is the Jacobian of the transformation from the elements a ij of aU of r.- l • It is shown in Appendix AS.2 that I·.?'(r.-I)j oc
Ij)~~l I
r. to the elements (8.2. 12)
and that
I ~I' ar.- I
= jDm+'.
(8.2. 13)
'
Thus, p(r.) oc jr.l- t (m + I l.
In this special case
In
=
(8.2.14)
I, (S.2.14) reduces to I
(S .2. 15)
p(a) I) oc - all
which coincides with the usual assumption concerning a noninformative prior distribution for a single variance. Another special case of interest is when the if i =1= j. In this case, the errors (e,d' " ., cum) are uncorrelated, that is, (Jij = same argument leads to
°
m
p(r.1 (Jij = 0, i
=1=
i) = P«(Jll, .'" (J,,",,) oc
n
(S.2.16)
i= 1
t As we have mentioned earlier, when the parameter space is of high dimension, the use of the locally uniform prior may be inappropriate and more careful considerations should be given to the structure of the model in selecting a noninformative prior.
8.2
427
A General Multivariate Normal Model
8.2.3 Posterior Distribution of (8, E)
Using (8.2 .6), (8.2.8), and (8 .2.14) , the joint posterior distribution of (8, E) is p(8, Ely)
if.
IEI- ' (II -i- nr + I) exp [ - l tr E - I see)],
-
00
e<
<
00,
E > 0, (8.2 .17)
where the notation - 00 < a < 00 means that each element of the set of parameters a can vary from - 00 to OC!, and the notation E > 0 means that the ·lm(m + I) elements uij are such that the random matrix E is positive definite. It is sometimes convenient to work with the elements of E- ' = {u ij } rather than the elements of E. Since
p(e'E-lly)=p(e'EIY)lo~ll'
(8.2.18)
it follows from (8 .2.13) that the posterior distribution of (8, E - ' ) is
pea, E - I I y) if. IE- 1
1!-(II-III - I)
exp [-
-
00
.~
< a<
tr E- 1 Sea)], 00,
E- ' >
o.
(8.2 . 19)
8.2.4 The Wishart Distribution
We now introduce a distribution which is basic in Normal theory multivariate problems. Let Z be a m x m positive definite symmetric random matrix which consists of t m(m + I) distinct random variables zij (i, j = I, .. . , m; i ?- j). Let q> 0, and B be a m x m positive definite symmetric matrix of fixed constants. The distribution of zij' p(Z)
if.
IZl i q- 1exp (-
t
tr ZB),
Z > 0
(8 .2.20)
obtained by Wishart (1928), is a multivariate generalization of the X2 distrihution. It can be shown that
f
IZI ~q-I exp (-1 tr ZB) dZ = IBI -l (q , no -
I)
2t m(q +m-l)
r",
(q -t- n~ -
I)
z >o
(8.2 .21)
where fp(b) is the generalized gamma function, Siegel (\935) p-l h> - - . (8.2.22) 2
We shall denote the distribution (8.2 .20) by w'n (B- t , q) and say that Z is distributed as Wishart with q degrees of freedom and parameter matrix B I. For a discussion of the properties of the Wishart distribution, see for example Anderson (195H). Note carefully that the parameterization used in (8.2.20) is different from the one used in Anderson in one respect. In his notation, the
428
Some Aspects of Multh'ariate Analysis
8.2
distribution in (8.2 .20) is denoted as W (B- 1 , v) where v = q + m - 1 is said to be the degrees of freedom. As an application of the Wishart distribution, we see in (8.2.19) that, given 9, ~ - t is distributed as Wm[S -1 (9), 11 - m + I J provided 11 ~ m . 8.2.5 Posterior Distribution of 9
Using the identity (8.2 .21), we immediately obtain from (8.2.19) the marginal posterior distribution of 9 as p(91 y) ex IS(9)I-n/2,
-
< 9<
00
00,
(8.2.23)
provided 11 ~ m. This extremely simple result is remarkable because of its generality. It will be noted that to reach it we have not had to assume either : a) that any of the input variables ~II; were or were not common to more than one output, or b) that the parameters 9; were or were not common to more than One output, or c) that the expectation functions were linear or were nonlinear in the parameters. This generality may be contrasted with the specification needed to obtain "nice" sampling theory results. For example, a common formulation assumes that the SII; are common, that the 9; are 110t. and that the expectation functions are all linear in the parameters. In the special case in which there is only one output response y, (8 .2.23) reduces to p(91 y) ex [S(9)rnl,
-
00
< 9<
00,
(8.2.24)
with S(9) = L:7. I [YII - IJ(S", 9)J2. As we have seen , this result can be regarded as supplying a Bayesian justification of least squares, since the modal values of 9 (those associated with maximum posterior density) are those which minimize S. The general result (8.2.23) supplies then, among other things, an appropriate Bayesian multivariate generalization of least squares . The "most probable" values of 9 being simply those which minimi ze the determinant IS(9)1. Finally, in the special case (J;j = 0, i =1=), combining (8.2.16) with (8.2.6) and integrating out (Jll' .. . , (JIMIn yields m
p(91 y) ex
fl
[S;;(9;)r n ' 2,
-
00
<9<
00.
(8.2.25)
;= 1
8.2.6 Estimation of Common Parameters in a Nonlinear Multivariate Model
We now illustrate the general applicability of the result (8 .2.23) by considering an example in which:
A General Multivariate Normal \'lodel
8.2
429
a) certain of the ()'s are common to more than one output, and h) the expectation functions are nonlinear in the parameters.
----- -- ------------=-;-;,=-.,.,...-
---------
-----
-----
=-':-=-=-=-11\ :-=-':-':-=-
Time~~
B
fil\
•
I
•
c
Fig. 8.2.1 Diagrammatic representation of a system A
~
B
~
C.
Suppose we have the consecutive system indicated in Fig. 8.2.i, which shows water running from a tank A via a tap opened an amount cPl into a tank B which then runs into a tank C via a tap opened an amount cPz' If /}(, /}2 and /}3 are the proportions of A, B, and C present at time ¢, with initial conditions (/}1 1,112 = 0, /}3 = 0), the system can be described by the differential equations
d/}2
df d/}3
= cPl/}1 -
df = cP2/}2'
cPZ/}2,
(8.2.26)
430
Some Aspects of Multivariate Analysis
8.2
Systems of this kind have many applications in engineering and in the physical and biological sciences. In particular, the equation (8.2.26) could represent a consecutive first-order chemical reaction in which a substance A decomposed to form B, which in turn decomposed to form C. The responses 11 I' 112, 113 would then be the mole fractions of A, B, and C present at time ( and the quantities ¢ 1 and ¢z would then be rate constants associated with the first and second decompositions and would normally have to be estimated from data. If we denote by Yl' Y2, and Y3 the observed values of 111' '12, and 113' then, on integration of (8.2.26), we have the expectation functions
=
e-tJ>,~,
E(YI)
=
111
E(Yz)
=
112 = (e-tJ>I~ - e-tJ>2~) ¢1 /(¢z - cPI)'
E(Y3)
=
11J
=
I
(8.227a) (8.2.27b)
+ (- 4>2 e-tJ>, ¢ + ¢I e-tJ>2{)/ (4)z - 4>1)'
(8.2.27c)
and it is to be noted that for all (, 111
+
112
+
(8.2.27d)
== 1.
113
Observations on Yl could yield information only on ¢l, but observations on Yz and Y3 could each provide information on both 4>1 and 4>2' If measurements of more than one of the quantities (YI' Yz, YJ) were available, we should certainly expect to be able to estimate the parameters more precisely. The Bayesian approach allows us to pool the information from (YI, h, Y3) and makes it easy Table 8.2.1 Observations on the yield of three substances in a chemical reaction
Time=(11 J. 2
t 2 2 4 4 8t 8t 16t J6t + These four runs are omitted in
Yield of
Yield of
A
B
C
Ylu
YZ u
Y3u
0.959 0.914 0.855 0.785 0.628 0.617 0.480 0.423 0.166 0.205 0.034 0.054
0.025 0.061 0.152 0.197 0.130 0.249 0.184 0.298 0.147 0.050 0.000 0.047
0.028 0.000 0.068 0.096 0.090 0.118 0.374 0.358 0.651 0.684 0.899 0.991
the second analysis.
Yield of
8.2
A General Multivariate Normal Model
431
to appreciate the contributi o n from each of the three responses. In this example is the only input variable and is the elapsed time since the start of the reaction. We denote by Y;I/) = (yu l ' Yu2, Y1/3) a set of m = 3 observations made on IJ 1u' '12,,, '1311 at time (I/' A typical set of such observations is shown in Table 8.2. I. In some cases observations may not be available on aJl three of the outputs. Thus only the concentration Yz of the product B might be observable, or Yz and Y3 might be known, but there might be no independently measured observation Yl of the concentration of A.t We suppose that the observations of Table 8.2.1 may be treated as having arisen from 12 independent experimental runs, as might be appropriate if the runs were carried out in random order in sealed tuhes, each reaction being terminated at the appropriate time by sudden cooling . Furthermore, we suppose that (Yl, Y2, YJ) are functionally independent so that the 3 x 3 matrix ~ may be assumed to be positive definite and contains three variances and three covariances, all unknown. It is perhaps most natural for the experimenter to think in terms of the logarithms Ifl = log cPl and 1f2 = log cP2 of the rate constants and to regard these as locally uniformally distributed a priori.t We sha ll , therefore, choose as our reference priors for elf l , 1f 2) and ~ the distributions in (8.2.8) and (8.2.14), respectively. ~
t When the chemist has difficulty in determining one of the products he sometimes makes use of relations like (8.2.27d) to "obtain it by calculation." Thus he might "obtain" Yl from the relation Yl = I - Y2 - Y3' For the resulting data set, the 3 x 3 covariance matrix ~ will of course not be positive definite, and the analysis in terms of threedimensional responses will be inappropriate. [n particular, the determinant of the sums of squares and products which appears in (8.2.23) will be zero whatever the values of the parameters. The difficulty is of course overcome very simply. The quantity Yl is not an observation and the dala has two dimensions , not three. The analysis should be carried through with Y2 and Y3 which have actually been measured . For a fuller treatment of problems of this kind arising because of data dependence or near dependence, see Box, Erjavec, Hunter and MacGregor ( 1972). t Suppose that (a) the expectation functions were linear in 8 1(<\» a nd 1f 2 (» where (cPl' cP2), (b) little was known a priori about either parameter compared with the
<\> =
information supplied by the data, and (c) any prior information about one parameter would supply essentially none about the other. Then, arguing as in Section 1.3 , a noninformative reference prior to should be locally uniform. Conditions (b) and (c) are likel y to be applicable to this problem at least as approximations, but cond it ion (a) is not, because the expectation functions are non-linear in cPl and cP2 and no general linearizing transformation exists. However, [see for example Beale (1960), and Guttman and Meeter ( 1965)J the expectation functions are more "nearly linear" in Ifl = log cPl and Bl = log cP2' Thus , the assumption that 8 1 and If2 are locally uniform provides a better approximation to a noninformative prior for the rate constants. For reasons we have discussed earlier, the ass umption is not critical and, if for example we ass ume cPl and cP2 themselves to be locally uniform, the posterior distribution is not altered appreciably.
e
Some Aspects of Multivariate Analysis
432
8.2
Expression (8.2.23) makes it possible to compute the posterior density for the parameters assuming observations are available on some or all of the products A, B, and C. Thus, we may consider the posterior distribution of 0 = (8 1 ,8 2 ), a) if only yields Y2 of product B are available
pee I Y2) ex [5 22 (e)r"'2,
- 00 < e < 00,
(8.2.28a)
b) jf only yields Y3 of product C are available,
pee I Y3) ex [533(9)rnI2,
- 00 < 9 < 00,
(8.2.28b)
c) if only yields Yz and Y3 of Band C are available
S 22(9)
p(O
I Y2, Y3) ex S23(9)
I
l-
523(0) 5 (9) : 33
n
l2
'
-
00
<
e<
00,
(8.2.28c)
and d) if yields YI' Y2 and Y3 of the prodllcts A, Band C are all available - 00 < 9 < 00, pee I y):x. IS(e)l- nI2, (8.2.28d) where
See) = {Sjj(9)}, i, j = 1,2,3.
---t-~A---fr---+--l--+---+-el
= log 4>1
-2
Fig.8.2.2 99.75 % H.P.D. regions for 8 1 and 8 2 for the chemical reaction data.
8.2
A General :vJultivariate Normal Model
433
Since there are only two .parameters 8 1 and 8 2 , the posterior distributions can be represented by contour diagrams which may be superimposed to show the contributions made by the various output responses. Single contours are shown in Fig. 8.2.2 of the posterior distributions of 8 1 and 8 2 for (a) Y2 alone, (b) Y3 alone, (c) Y2 and Y3 jointly, and (d) y" Y2' and Y3 jointly. The contours actually shown are those which should correspond to an H.P.D. region containing approximately 99.75% of the probability mass calculated from
10gp(O I,) - 10gp(O I,)
=
-t/(2,ex),
ex = 0.0025
where pee I,) refers to the appropriate distributions in (8.2.28a-d) and 0, the conesponding modal values of O. In this example, it is apparent, particularly for Y3' that the posterior distributions are non-Normal. Nevertheless, the above very crude approximation will suffice for the purpose of the present discussion. In studying Figure 8.2.2, we first consider the moon-shaped contour obtained from observations Y3 on the end product C alone. In any sequential reaction A -> B -> C -> '" etc., we should expect that observation of only the end product (C in this case) could provide little or no information about the individual parameters but only about some aggregate of these rate constants. A diagonally attenuated ridge-like surface is therefore to be expected. However, it should be further noted that since in this specific instance 173 is symmetric in 8, and 8 2 [see expression (8.2.27c)], the posterior surface is completely symmetric about the line 8, = 8 2 , In particular, if (8 1 , ( 2 ) is a point of maximum density the point (e 2 , ( 1 ) will also give the same maximum density. In .general the surface will be bimodal and have two peaks of equal height symmetrically situated about the equi-angular line. Marginal distributions will thus display precisely the kind of behaviour shown in Fig. AS.6.l. Figure 8.2.2 shows, how, for this data, the inevitable ambiguity arising when only observations Y3 on product C are utilized, is resolved as soon as the additional information supplied by values Y2 on the intermediatc product B is considered. As can be expected, the nature of the evidence that the intermediate product Y2 contributes, is preferentially concerned with the difference of the parameters. This is evidenced by the tendency of the region to be obliquely oriented approximately at right angles to that for Y3' By combining information from the two sources we obtain a much smaller region contained within the intersection of the individual regions. Finally , information from y, which casts further light on the value of 8 1 , causes the region to be further reduced. Data of this kind sometimes occur in which available observations trace only part of the reaction. To demonstrate the effect of this kind of inadequacy in the experimental design, the analysis is repeated omitting the last four observations in Table 8.2.1. As shown in Fig. 8.2.3, over the ranges studied, the contours for Y2 alone and Y3 alone do not now close. r-..evertheless, quite precise estimation is possible using Y2 and Y3 together and the addition of y, improves the estimation further.
434
Some Aspects of Multivariate Analysis
8.2
Yl alone
--'t--Y'x--+I----+--t-I----+---I----+-- 8 1 = log
Fig. 8.2.3 99.75 % H.P,D . regions for 8 1 and 8 2 • excluding the last four observations.
Precautions in the Estimation of Common Parameters Even in cases where only a single response is being considered, caution is needed in the fitting of functions . As explained in Section 1.1.4, fitting should be regarded as merely one element in the iterative model building process. The appropriate attitude is that when the model is initially fitted it is tentatively entertained rather than assumed. Careful checks on residuals are appJied in a process of model criticism to see whether there is reason to doubt its applicability to the situation under consideration. The importance of such precaution is even greater when several responses are considered, In multivariak problems , not only should each response model be checked individually but they must also be checked for overall consistency, The investigator should in practice not revert immediately to a joint analysis of responses. He should:
8.3
Linear Multivariate Models
435
I) check the individual fit of each respo nse, 2) compare posterior distributions to appraise the consistency of the information from the various responses (an aspect discussed in more detail in Chapter 9). Only in those cases where he is satisfied with the individual fit and with the consistency sha ll he revert to the j oint analysis. 8.3 LINEAR MULTIVARIATE MODELS
In discussing the general m-variate ~-';ormal model above, we have not needed to assume anything specific about the form of the m expectation functions 11 . In particular, they need not be linear in the p a rameters+ nor does it matter whether or not some parameters appear in more than one of the expectation function s. Many interesting and informative special cases arise if we suppose the expectation functions to be linear in the O's. Moreover, as will be seen, the linear results can sometimes supply adequate local approximations for models non-linear in the parameters. From now on then we assume that
i
= I, ... , m ,
U
= I, ... , n,
(8.3.1 )
where and with
independent of all the O's. The n x k j matrix X j whose uth row is X; 'Ii) will be called the derivative matrix for the ith response. Our linear m-variate model may now be written as Yl
=
Xl OJ
+
El
(8 .3.2)
Certain characteristics of this mode of writing the model should be noted. In particular, it is clear that while the elements of X ( ui) will be functions of the
t Although, so that a uniform density can represent an approximately noninformative prior and also to assist local linear approximation, parameter transformations in terms of which the expectation function is more nearly linear, will often be employed.
436
Some Aspects of Multivariate Analysis
8.3
elements of the vector input variables ~u;, they will in general not be proportional to the elements of ~,,; themselves. Thus if
then
log ~ul; ~u2;
~ 1l1i ~u3i
and
8.3.1 The Use of Linear Theory Approximations when the Expectation is Nonlinear in the Parameters The specific form of posterior distributions which we shall obtain for the linear case will often provide reasonably close approximations even when the expectation functions '1(,,) is nonlinear in 9. This is because we need only that the expectation functions are approximately linear in the region of the parameter space covered by most of the posterior distribution, say within the 95% H .P.D. reg·i on.t For moderate n, this can happen with functions that are highly nonlinear in the parameters when considered over their whole range. Then, in the region where the posterior probability mass is concentrated (say the 95% H.P.D. region) , we may expand the expectation function around the mode
e;
k,
E(y,,;)
=
YJ,,; == YJ;(~u;'
eJ +
L
X"g;
(Bg; -
8g;),
(8 .3.3)
g=1
where
which is , approximately, in the form of a linear model. Thus, the posterior distributions found from linear theory can, in many cases, provide close approximations to the true distributions. For example, in the univariate case (m = I) with a single parameter B, the posteriror distribution in (8.2.24) would be approximately
(8.3.4) where v = n - I,
S
2
= -V1
1\
L[Yu - YJ(~II' 0)]
2
and
Xu
= aYJ(~Il' B)I aB
' O=U
so that the quantity
s would be approximately distributed as teO, I, v).
t A possibility that can be checked a posteriori for any specific case.
(8.3.5)
8.3
Linear Multivariate \-Iodels
437
When, as in the case in which multivariate or univariate least squares is appropriate, a convenient method of calculating the 8's is available for the linear case but not for the corresponding nonlinear situation, the linearization may be used iteratively to find the 8's for the nonlinear situation. For example, in the univariate model containing a single parameter 8, with a first guess 80' we can write approximately (8.3.6)
where
x II =
and
8"'(~1I' 8)
88
Applying ordinary least squares to the model, we obtain an estimate of the correction 80 and hence hopefully an improved "guess" 8 1 from
(J -
e
1
=
LZ"o x" 0° +---.
(8.3 .7)
LX;'
This is the well-known Newton-Gauss method of iteration for nonlinear least squares, Box (1957, 1960), Hartley (1961), Marquardt (J 963), and under favorable conditions the successive iterants will converge to
e.
8.3.2 Special Cases of the General Linear \1ultivariate Model In general, the joint distribution of 9 and 1: is given by (8.2.17) and the marginal distribution of is that in (8.2.23) quite independently of whether f/i(I;II; , is linear in i or not. For prac tical purposes, however, it is of interest to consider a number of special cases . For orientation we reconsider for a moment the linear univariate situation discussed earlier in Section 2.7,
e;)
e
e
y =
XO
+ e,
(8 .3.8)
where y is a n x 1 vector of observations. X a 11 x k matrix of fixed elements, a k x 1 vector of parameters and tan x J vector of errors. In this case,
p(9,
(J2
I y)
IX. «(J2)-c·n + 1) exp [-
~~~)
],
(J2
> 0, -
00
< e <
00,
<
e<
00,
e
(8 .3.9)
and
pee I y)
IX.
[s(e)r n ' 2 ,
-
00
(8.3.10)
The determinant IS(9)1 in (8.2.23) becomes the single sum of squares
See) = (y - X e)' (y - X e).
(8.3.1 J)
I n this linear case, we may write
See)
= (n -
k) S2
+ (9
- 6)' x' x(e - 6),
(8.3.12)
438
Some Aspects of Multivariate Analysis
8.4
where k)5 2
(11 -
=
(y -
y)'
(y -
y) = (y - xey
(y - Xe)
and, assuming X is of rank k ,
fI so that, writing v = p(9 I y) cc [ 1 +
11 -
= (X'X) -l X'y
k,
(9 - e)'x'X(O - e) l- "~ ('"
_.
2
k)
-
'
V5
00
< 0<
00.
(8,3 .13)
The posterior distribution of 0 is thus the k dimensi ona l Ik[e, 5 2 (X'Xr l, v] distribution. Further, integrating out {) fr()m (8.3,9) yields the distribution of u 2 , 2
p(u 2 1 y)
if:.
(u 2 ) -0,'''' 1) exp
(
\'5 ) - 2u 2
'
(8.3 .14)
so that u 21 (v5 2) has the X:2 distribution, All the above results have, of course, been already obtained earlier in Section 2.7, it is clear that the general linea r model (8.3 .2) which can be regarded as the multivariate generalization of (8,3,8) need not be particularized in any way. The matrices XI' "" Xm mayor may 1101 have elements in common; furthermore, the vectors of parameters 8 I ' "" Om mayor may 1101 have elements in common. Using sampling theory, Zellner (1962, 1963) attempted to study the situation in wh ich the Xj were not assu med to be identical. The main difficulty with his approach was that the minimum variance estimator for 0 involves the unknown ~, and the estimato rs proposed are "o ptimal" only in the asymptotic sense, Cases of special interest which are associated with practical problems of importance and which relate to known results include: a) wilen the derivative matrices XI = parameters 8 1 , " . , 9m are not, b) when 9 1
".
=
= X ", = X are common but the
0111 but the matrices XI ' "" Xm are not, and
c) when 9 1 = ... = Om al1d XI
=
."
= XIII '
In the remaining part of this chapter, we shall discuss case (a), The problem of estimating common parameters which includes (b) and (c) will be treated in the next chapter.
8.4
I~FERE~CES
ABOUT 9 FOR THE CASE OF A COMMO]'; DERIVATIVE
"'lATRIX X
The model for which Xl = ", = XIII = X (so that kl = , = k m = k) and 9 1 i= ." i= Om has received most attention in the sampling theory framework ·see, for example, Anderson (J 958). From the Bayesia n point of view, the problem has been studied by Savage (1961 a), Geisser and Cornfield (1963), Geisser (1965a), Ando and Kaufman () 965), and others, In general, the
8.4
fnferences about
e for
the Case of a Common Derivative Matri>: X
439
multivariate model in (8.3.2) can now be written (8.4.1 )
y=XO+r. [yJ=[X] I7xm
Ilxk
[OJ+[EJ kxm nxm
where the notation beneath the matrices indicates that Y IS an n x 111 matrix of m-variate observations, 0 is a k x 111 matrix of parameters and f: an 11 x m matrix of errors. The model would be appropriate for example if sa y a 2P factorial experiment had been conducted on a chemical process and the output )' 1 = product yield, Yz = product purity, Y3 = product density had been measured. The elements of each column of the common matrix X would then be an appropriate sequence of + l's and - I's corresponding to the experimental conditions and the "effect" parameters 0i would be different for each output. In econometrics, the model (8.4.1) is frequently encountered in the analysis of the reduced form of simultaneous equation systems. We note that the k x m matrix of parameters
(8.4.2)
0=
can be written in the two alternative forms
0= [0 1 ,
...
(8.4.3)
,Oi, ... ,O",J
where 0i is the ith column vector a nd O;g) is the gth row vector of O. For simplicity, we shall assume throughout the chapter that the rank of X is k .
8.4.1 Distribution of 0 Consider the elements of the m x m matrix S(O) = {Sij(Oi, O) } of (8.2.4). XI = ... = Xm = X, we can write
When
Sij(Oi,O) = (Yi - X OJ' (Yj - X OJ) = (Yi - X 9Y (Yj - X 9) + (Oi where
ei=(X'X)-lX'Yi
ey
X'X(Oj - 0)
(8.4.4)
is the least squares estimates of Oi,i=l , ... ,m.
440
Some Aspects of Multivariate Analysis
8.4
Consequently,
S(9) = A where
6 is
+
(9 - 6)' X'X(S - 0),
(8.4.5)
the k x m matrix of least squares estimates
(8.4.6)
and A is the m x m matrix with i,j = I, ... ,111,
(8.4.7)
that is, A is proportional to the sample covariance matrix. For simplicity, we shall assume tbat A is positive definite. From the general result in (8.2.23), the poslerior distribution of 9 is then p(SI y) cc IA
+
(S - 6)' X'X(S -
OW"/2,
-
00
< 9<
00.
(8.4.8)
As mentioned earlier, when there is a single output (m = I), (8.4.8) is in the form of a k-dimensional multivariate f distribution. The distribution in (8.4.8) is a matrie-variate generalization of the t distribution. It was first obtained by Kahirsagar (J960). A comprehensive discussion of its properties has been given by Dickey (1967b).
8.4.2 Posterior Distribution of the Vleans from a m-dimensional Normal Distribution In the case Ie = I where each 9; consists of a single element and X is a 11 x J vector of ones , expression (8.4.8) is the joint posterior distribution of the m means when sampling from an m-dimensional multivariate Normal distribution N",(S,1:.). In this case
II
X' X = n
and
a ij =
" - ) (Yuj - Yj - ), L (Ylli - Y; u= 1
where
I Yi =
n·
II
I
u~l
Ylli·
(8.4.9)
Inferences about 9 for the Case of a Common Derivative Matrix X
8.4
441
The posterior distribution of 0 can be written p(OI y) oc IA
oc II
+ n(O
- 9)' (0 -
+ n A -!
9W n /2
(0 - 9)' (0 - 9)1-,,/2,
-
00
< 0<
(8.4.10)
00.
We now make use of the fundamental identity (8.4.11)
Ilk - P QI = II, - Q PI,
where Ik and I, are, respectively, k x k and a I x I identity matrices, P is a k x I matrix and Q is a I k matrix. Noting that (0 - 9) is a 1 x m vector, we immediately obtain
x
p(OI y) oc [I
+ n(O
- 9) A -
I
(0 - 9)'] -11/ 2,
-
00
< tJ i <
00 ,
i
= 1, ... , m, (8.4.12)
which is a rn-dimensional tm [9', n -I (n - m) -1 A, n - mJ distribution, a result first published by Geisser and Cornfield (1963). Thus, by comparing (8.4. 12) with (8.3.13), we see that both when m = 1 and when k = I, the distribution In (8.4.8) can be put in the multivariate t form.
8.4.3 Some Properties of the Posterior Matric-variate When neither distribution of distribution in t distribution .
t
Distribution of 0
rn nor k is equal to one, it is not possible to express the 0 as a multivariate t distribution. As we have mentioned , the (8.4.8) can be thought of as a matric-variate extension of the We now discuss some properties of this distribution.
TII'o equivalent Representations of the Distribution of 0
It is shown in Appendix
J"
II",
A~U
+
that, for v > 0,
A-I (0 - 9)' X'X(O - 9)I- t (v+k + m-1) dO
-00<0 < 00
(8.4.13) where
+ rn) + 1, r [ _l( v + In - J)J _ '_"_2_ _ _ _ __ l,JHv + k + m - I)J
v = n - (k c(rn, k, v)
= [r(})]mk
(8.4.14)
and 1 pCb) is the generalized Gamma function defined in (8.2.22). Thus, p(OI y)
=
(c(rn, k, v)r I IX'Xl m, 2 1A - Ilk/21 1m+ A - I (0 - 9)' x' X(O _ 9)1 -"H , r H m- I), -
00
<0<
00.
(8.4.15)
442
Some Aspects of Multivariate Analysis
8.4
and we shall say that the k x m matrix of parameters 6 is distributed as Note th at by applying the identity (8.4.11), we can write
tkm [9, (X ' X)-I, A, v]. 11m
+
+
A -1(0 - 9)' X'X(8 - 9)1 = Ilk
(X' X) (8 - 9)A -1 (0 - 9)'1
(8.4. 16)
so that in terms of the m x k matrix e' the role s of m and k on the one hand and of the matrices (X'X)-I and A on the other are simultaneously interchanged. Thus, we may conclude that if 6 ~ tkm[9, (X'X)-I, A, v], then (8.4 .17)
It follows from these two equivalent representations that
c(k , 111 , v) == c(m, k , v),
that is,
1k [ -HI' + k - I)J = ~m [ H I' + m - I )] 1k [HI' + k + m - 1)] - 1m [1(v + k + m - I)J'
(8.4.18)
Marginal and Conditional Distributions of Subsets of Co/umns of 8
We now show that the marginal and conditional distributions of subsets of the = m l + mz and
m columns of 8 are also matric-variate t distributions. Let m partition the matrices 6, 9, and A into
(8.4.19)
Then: a) conditional on 6 1 *, the subset 6H is distributed as (8.4 .20)
where H- 1
=
(X'X)-l
+
(8 1*
6z* = 9a + A zz ' l
-
(6 1 * -
= A2l -
91*)A~/ (6 1 *
91*)A~II
- 91*)',
A 12 ,
A ZI A~II Au ·
b) 6 1 * is distributed as 8 1 * ~ t km , [91*' (X'X)-I , All , vl
(8.4.21 )
To prove these results , we can write I
(6 - 9)A- (6 - 9)' = (6 1* - 91*)A~lt(61* - 9 1 *)'
+ (6 H
-
62 *)A2't 1 (6 H
-
62 *)',
(8.4.22)
e for the Case of a
Inferences about
8.4
Common Derivative Matrix X
443
The determinant on the right-hand side of (8.4.16) can now be written Ilk
+ (X'X) (9
- fhA - I (9 - 8)' 1 = Ilk Ilk
X
- 8 1*)A ~/ (9 1*
+ (X'X)(9 1*
+ H(6 2* - 9H
1
)A;2 1(6 2* -
-
8 1~J'I
92*)'1.
(8.4.23)
Substituting (8.4.23) into (8.4. J 5), we see that, given 9 1 *, the conditional distribution of6 h is p(9 2 * 16 1*. y) oc Ilk + H(9 h - 9H )A;21'1 (9 h - 92 *)'1 -i(v+k + m-I)
oc II '"1
+ A 22'1 -I (9 1*
-
92* )'H(9 2*
-
91* )1- ·~[(v+ml)+k+m2 - ll (8.4.24)
From (8.4.13), the normalizing constant is [c(m2,k , v
+ ml)rIIHlm,'2IAi}1Ik 2.
(8.4 .25)
Thus. given 9 1 *, the k x m 2 matrix of parameters 9 2 * is distributed as A l2 . I .V + m l ). For the marginal distribution of 6 1 *. since
fkm1
(OH' H -
I,
use of (8.4.23) through (8.4.25) yields p(9 1 * I y)
oc IHI-m 2l 2 Ilk OC
IIm,
+
+ (X'X) (9 1* -
81*)A~11 (9 1* _ 8 1*)'I -}( v+k+m-l)
- I (9 9- )'X' X(9 9- )I-:(v+k+m,-I) A II 1* - 1* 1 * - 1* '
-
That is, the k x
Inl
< 9u <
00
(8.4.26)
00 .
matrix of parameters 91* is distributed as f km ,
[8 1*, (X'X)-I, All' v].
MarRinal Distribution of a Particular Column of 9 In particular, by setting 1111 = I in (8.4.21), the marginal distribution of 9 1 * = 9 1 is the k-dimensional multivariate I distribution -
00
< 61 <
00,
(8.4.27) where a l l = All is now a scalar, that is, 6 1 ~ Ik [8 1 , V-I a l I (X'X)-l, v]. By mere relabeling we may conclude that the marginal distribution of the ith column of 6 in (8.4.2) is -
00
< 9j <
00.
(8.4.28)
It will be noted that this distribution is identical to that obtained in (8.3.13) when only a single output was considered, except that in (8.3.13)
444
Some Aspects of Multivariate Analysis
8.4
v = n - k, but in (8.4.28) v = 11 - k - (m - I). In a certain sense, the reduction of the degrees of freedom by 111 - I is not surprising. In adopting the multivariate framework, m(m - I)/2 additional parameters (Jij (i #- j) are introduced. A part of the information from the sample is therefore utilized to estimate these parameters and (/11 - I) of them «(Jil, ... , (Ji(i-l), (Ji(i+ I)' ... , (Jim) are connected with y;. We may say that 'one degree of freedom is lost' for each of the (m - I) additional parameters. On the other hand, it is somewhat puzzling that if we ignored the multivariate structure of the problem and treated y; as the output of a univariate response, then on the basis of the noninformative prior p(8;, (J;;) cc (Jii j . we would obtain a posterior multivariate t distribution for 8i with (m - I) additional degrees of freedom. This would seem to imply that. by ignoring the information from the other (m - I) responses Yj, ···,Yi-j,y;+\, ... ,Ym, more precise inference about 8; could be made than when all the m responses were considered jointly. This phenomenon is related to the "paradox" pointed out by Dempster (1963) and the criticisms of the prior in (8.2.14) by Stone (1964). The above implication is admittedly perplexing, and further research is needed to clarify the situation. We feel, however, that the multivariate results presented in this chapter are of considerable interest and certainly provide a sensible basis for inference in the common practical situation when (n - k) is large relative to m.
Distribution oj 8 Expressed as a Product oj Multivariate t Distributions
We note that for the partition in (8.4.19), if we set mj = m - J and in2 = I, then from (8.4.20) the conditional distribution of 9Z* = 9"" given 91* = [81> ... ,9",_,], is (8.4.29) where a,... 12 ... (m-l)
=
A 22 · 1 ·
From the marginal distribution of 8 1 * = [9 1 , .. , 8m - 1 J in (8.4.21) if we partition 9 1 * into [8 s * :8m - 1 ] where 8s * = [9 1 ,· ··,9",-2J, it is clear that the conditional distribution of 8m - I , given 8 s * , is again a k-dimensional multivariate t distribution. It follows by repeating the process m - I times that , if we express p(8 I y) as the product (8.4.30) then each factor on the right-hand side is a k-dimensional multivariate t distribution.
Marginal and Conditional Distributions oj ROll'S oj 9
Results very similar to those given above for column decomposition of 8 can
8.4
Inferences about
e for the Case of a Common Derivative Matrix X
now be obtained for the rows of 8. k,
8'
=
Consider the partitions
k2
k!
[8( 1)* : 8(2)*J
(X'X) - l =
e
=
445
m'
k2
8' = [8( 1)* ~ 9(2)*J
(8.4.31 )
/II
[~:~ : ~::.]:>
where it is to be remembered that 6; 1)* are the first k I rows and 9;2)* the remaining k2 rows of 8. Since the m x k matrix 9' is distributed as Imk[9', A, (X'X)-I, v] , it can be readily shown that a) given 6( I )*,
(8.4 .32) where e 22 ' 1 = e 22 - e 21 e~ll e 12 , G(2)* = 8(2)*
(9(1)* - 8(1)*) e~11 elZ '
+
and b) Marginally ,
(8.4.33) or equivalently, 9(1)* -
c) The gth row of
e,
Cgg
[9;1)*' ell' A, v].
6(0)' is distributed as 9(g) -
where
[kim
1m
, - I [6(g ) , V Cgg
A , vJ
(8.4.34)
is the (gg )th element of C.
d) The distribution of 6 can alternatively be expressed as the product
(8.4.35) where, parallel to (8.4.30), each factor on the right-hand side is an tn-dimensional multivariate I distribution. Comparing expression (8.4.34) with the result in (8.4.12), we see that, as was the case with a column vector of 9, the two distributions are of the same form except for the difference in the "degrees of freedom" They now differ by (k - I), simply because an additional (Ie - 1) "input variables" are included in the model.
Marginal Dislribution of a Block of Elements of 9 Finally consider now the partitions ni,
. ]
k I,
k,
.]
kI k,
•
(8.4.36)
446
SOllie Aspects of Multivariate Analysis
8.4
It follows from (8.4.21) and (8.4.33) that the k I x 1111 matrix of parameters 0 1 , is distributed as (8.4.37) The marginal distributions of 9 12 • 9 2 " 022, and indeed that of any block of elements of 0 can be similarly obtained, and are left to the reaJer. In the above we have shown that the marginal and the conditional distributions of certain subsets of 0 are of the matric-variate 'form. This is, however, not true in general. For example, one can show that neither the marginal distribution of (9", ( 22 ) nor the conditional distribution of (9'2, 9 2 ,) given (9 ll • ( 22 ) is a matric-variate t distribution. The problem of obtaining explicit expressions for the marginal and the conditional distributions in general is quite complex, and certain special cases have recently been considered by Dreze and Morales (1970) and Tiao. Tan, and Chang (1970). Means and Covariance Matrix orO
From (8.4.28), the matrix of means of the posterior distribution of 9 is £(9) =
9
(8.4.38)
and the covariance matrix of 9j is a·
Cov (9J = --"2- (X ' Xr-1, v-
i
=
I, ... , m.
(8.4.39)
For the covariance matrix of 9 j and 9i , with no loss In generality we consider the case i = 1 and j = 2. Now £(9 1
-
9,) (9 2
-
( 2 )'
= £ (9 1
-
0,
( 1)
£ (9 2
O,la,
-
( 2 )"
If we set
111, = 2 in (8.4.21) and perform a column decomposition of the k x 2 matrix 9,* = [9,,9 2 ], it is then clear from (8.4.20) that
£ (9 2 - (2)
=
a~11 a'2 (9, -
9,)
O2 \9,
so that, as might be expected , (8.4.40)
= _ _1_
v-2
[all a12 ] ® (X'X)-l a'2 a 22
(8.4.41)
8.4
Inferences about (} for the Case of a Common Derivative .Vlatrix X
where Q9 denotes the Kronecker product- see Appendix A8.4. we write the elements of 0 and as
e
=
0'
where 0 and
0
0'
(0'[, _._ ,0;,,),
447
In general , if
= (8'1' ... ,6;,,),
(8.4.42a)
are kin x 1 vectors, rhen ~
I
~
Cov (0) = £(0 - 0) (0 - 0)' = - - A Q9 (X'X)-l .
(8.4.42b)
2
V -
Bya similar a rgument, if we write 0~ =
(0; I I'
... ,
(8.4.43a)
0rk») ,
then I v - 2
Cov (0*) = - - (X' X) - l Q9 A.
(8.4.43 b)
Linear Transformation of 0
Let P be a kJ x k(kl ~ k) matrix of rank kl and Q be a I11xm l (m J ~ 111) matrix of rank 111 1 , Suppose is the k [ x m 1 matrix of random variables obtained from the linear transformation (8.4.44) = POQ. Then is distributed as I klln l [P 6 Q, P(X 'X) - Jp', Q' AQ, v]. as an exercise for the reader.
The proof is left
Asymptotic Dislribution of 0
When v tends to infinity, the distribution of 0 approaches a km dimensional multivariate Normal distribution,
x exp [ -
i td:- l
where
(8 -
6)'
X 'X (0 ~
t
= v lA,
and we shall say that , asymptotically, 0 ~
Nmk
6)J,
-
00
< 8<
00,
(8.4.45)
[8, t Q9 (X'X)-l].
To see this, in (8.4 .15) let
Q = v A - I CO -
6)' X'xeo
t- [ e8 -
8)' X'X(9 -
=
-
8).
e).
Then, we may write It!
11m
+ v-1QI = TI (I + v-J },J i= 1
(8.4.46)
448
Some Aspects of Multivariate Analysis
8.4
where (AI' ... ,Am) are the latent roots of Q. Thus, as v -> 00
= exp ( Since tr Q where 0 and
0
=
(0 - 0)'
t- 1 ® (X'X)(0 -
t tr Q). 0)
(8.4.47)
are the km x I vectors defined in (8.4.42a), and noting that
the desired result follows at once. It follows that
E (0)
= 0,
Cov (0)
=
t® (X' X)-l
(8.4.48a)
or, alternatively, (8.4.48b)
E (0*) = 0*,
where (0,0) and (0*,0*) are defined in (8.4.42a) and (8.4.43a), respectively.
8.4.4 H.P.D. Regions of e Expressions (8.4.28) and (8.4.34) allow us to make inferences about a specific column or row of e. Using properties of the multivariate t distribution , H.P.D. regions of the elements of a row or a column can be easily determined . We now discuss a procedure for the complete set of parameters which makes it possible to decide whether a general point = eo is or is not included in an H.P.D. region of approximate content (l - ry). It is seen in expression (8.4 . 15) that the posterior distribution of is a monotonic increasing function of the quantity Vee), where
e
e,
e
vee) =
IA + (e
_I~I, X'X(9 _
9)1 = 111/, A + -1 (e - 9)' X'X (9 - 8)1- 1
(8.4.49)
e
Consequently, the parameter point = eo lies inside the (I - ry) H.P.D. region if and only if (8.4.50) Pr {vee) > v(e o) I y} ~ (I - ry),
8.4.5 Distribution of
Vee)
To obtain the posterior distribution of Vee) so that (8.4.50) can be calculated, we first derive the moments of Vee). Applying the integral identity (8.4.13) the
Inferences about
8.4
e for the Case of a Common Derivative Matrix X
449
hth moment of V(9) is found to be
E[V(W I y]
c(m, k, v + 2h) c(m,k,v)
=
fI s=l
r[J(v - 1 + s) + h] r[t(v - I + k + s)] r[Hv - 1 + s)] r[-Hv - I + k + s) + h]
(8.4.51)
From (8.4.46) and (8.4.49) it follows that V = V(9) is a random variable defined on the interval (0, I) so that the distribution of V is uniquely determined by its moments. Further, expression (8.4.51) shows that distribution of V is a function of (m, k, v). Adopting the symbol U (m,k,v) to mean a random variable whose probability distribution is that implied by the moments in (8.4.51) , we have the following general result. Theorem 8.4.1 Let 9 be a k x m matrix of constants, X' X and A be , respectively, a k x k and a m x m positive definite symmetric matrix of constants, and v > O. If the k x m matrix of random variables then U(9)
~
U(m,k , v)
where U(9)
= 11m + A -1
(9 -
9)'
X'X (9 -
8)1- 1
As noted by Geisser (1965a), expression (8.4.51) correspond exactly to that for the 11th sampling moment of V(9 0 ) in the sampling theory framework when 9 are regarded as fixed and y random variables. Thus, the Bayesian probability Pr{U(9) > V(9 0 ) I y} is numerically equivalent to the significance level associated with the null hypothesis 9 = 9 0 against the alternative 9 =1= 9 0 , Some Distributional Properties
0/ V (m.k,v)'
Following the development, for example, in Anderson (1958), we now discuss some properties of the distribution of U(m ,k,v)' It will be noted that the notation V(m,k,v) here is slightly different from the one used in Anderson. Specifically, in his notation, v is replaced by v + m - I. a) Since from (8.4.18) c(k, m, v) = c(k, m, v), the hth moment in (8.4.51) can be alternatively expressed as I
£[U(9)'1 y]
=
- J + t) + h] r[}(v - 1 + m + t)] TI r[!(v . r[-}(v - 1 + t)]r[-i(v - 1 + m + t) + h] k
(8.4.52)
1=1
By comparing (8.4.51) with (8.4.52), we see that the roles played by m and k can be interchanged. That is, the distribution of U(m.k ,v) is identical to that of V(k,m ,v) ' In other words, the distribution of V = V(9) arising from a multivariate model with m output variables and k regression coefficients for each output is identical to that from a
450
Some Aspects of :YIultivariate Analysis
8.4
multivariate model with k output variables and m regression coefficients for each output. With no loss in generality, we sha ll proceed with the m-output model, i.e., the V(m,k,v) distribution . b) Now (8.4.51) can be written m E(V"ly)=11B
k)/ B (V-l+S ,k) -,
(V-l+S
+h , 2
2
s= 1
2
2
(8.4.53)
where B(P, q) is the complete beta function. The right-hand side is the hth moment of the product of m independent variables X l ' " ' ' xm having beta distributions with parameters
It follows that V is distributed as the product
X I ...
xm.
c) Suppose m is even. Then we can write (8.4.53) as
E( Vh I y) =
m/ 2
11
B[
1=1
V
+
1)
2(1 -
2
k + h,~
2
B[V+2(/-l)
2
1 (V B
1 + 21 2
k ) + h,-
2
.
(8.4.54)
!....JB(V-I+21!"")
' 2
2' 2
Using the duplication formula (8.4.55) so that B(p
+ l,q) B(p, q) =
2 2q B(2p, 2q) B(q, q),
(8.4.56)
we obtain E(V"I y)
=
m/2
11 1=1
+ 2(1 B[v + 2(1
B [v
I -
+ h)
' 1), kJ
kJ
(8.4.57)
where ZI' .. . , z ",/2 are m/2 independent random variables having beta distributions with parameters [v + 2(1 - 1), k], t = 1, .. . , m12, respectively. Thus, V is distributed as the 2 pro d uct zl2 .. . zml2' The Case m = 1.
When m = 1 so that v = n - k, it follows from (8.4.53) that, V has the beta distribution with parameters [en - k)j2, k j 2] so that the quantity (n - k) (I - V)/(k V) is distributed as F with (k, n - k) degrees of freedom. This result is of course to be
8.4
Inferences about 9 for the Case of a Common Derivative Matrix X
expected, since for m
= 1, we have IAI =
(n - k)s2, where
= (n - k) - I (y -
S2
451
y)' (y - y),
so that I (
U) (n - k ) = (9 -
U
6)' X'X(9 - 6)
(8.4.58)
ks 2
k
which, from (2.7.21), has the F distribution with (k, n - k) degrees of freedom.
= 2.
The Case m
When m = 2, v = n - k - 1 and from (8.4.57) Ul/ 2 is distributed as a beta variable with parameters (n - k - I, k). Thus, the quantity
( I -U U
2
J /
1/ 2
)
(n - kk - 1) =
(11 2
+ A- 1 (9
- 9),X'X(e -
9W/2) -
I)
(n - kk - 1) (8.4.59)
has the F distribution with [2k, 2(n - k - I)] degrees of freedom.
8.4.6 An Approximation to the Distribution of U for General m For m ~ 3, the exact distribution of U is complicated , see e.g. Schatzoff (1966) and Pillai and Gupta (1969). We now give an approximation method following Bartlett (1938) and Box (1949). In expression (8.4.51), we make the substitutions
M = - rp v log L", as = i (s + k - I)
where
rp
t
= - h/(rp v),
and x = tv, is some arbitrary positive number, so that
£(U h I y)
= =
bs
=
}(s - 1),
(8.4.60)
+ bsJ + as]
(8.4.61)
£(e tM I y)
fI
s= 1
rex rex
+ as)
+ bs)
['[rpx(J - 2t) + x(I - rp) f'[rpx(l - 2t) + x(l - rp)
In terms of the random variable M, (8.4.6\) is then its moment-generating function. Taking logarithms and employing Stirling's series (see Appendix A2.2), we obtain the cumulant generating function of M as (8.4.62) where ()J,
=
(- IY -r-'-(,-+-1)--:(-rpx--")-'
m
I
s= 1
{B,+J[x(l - rp)
+ bsJ
- B,+I[X(l - rp)
+ aJ}
452
Some Aspects of Multivariate Analysis
8.4
and BrCz) is the Bernoulli polynomial of degree r and order one. The asymptotic expansion in (8.4.62) is valid provided cp is so chosen that x(1 - cp) is bounded. In this case, Wr is of order O[(cpx)-r] in magnitude . The series in (8.4.62) is of the same type as the one obtains in (2.12.15) for the comparison of the spread of k Normal populations. In particular, the distribution of M = - cpv log V can be expressed as a weighted series of '/ densities, the leading term having mk degrees of freedom.
The Choice of cp It follows that, if we take the leading term alone, then to order O[(cpx) -I] = O[(cp1v)-I] the quantity M = - cpvlog V (8.4.63) is distributed as X;'k> provided 11'(1 - cp) is bounded . In particular, if we take cp = 1, we then have that M = - v log V is distributed approximately as X;k. For moderate values of v, the accuracy of the X2 approximation can be improved by suitably choosing cp so that WI = O. This is because when WI = 0, the quantity M will be distributed as X;k to order O[(cp-l V)-2]. Using the fact that B 2 (z) = Z2 - Z + .~ (8.4.64) it is straight forward to verify that for
cp = I
WI =
1
+-
(m
21'
0, we require
+k
- 3) .
(8.4.65)
This choice of cf; gives very close approximations in practice. An example with v = 9, m = k = 2 will be given later in Section 8.4.8 to compare the approximation with the exact distribution . It follows from the above discussion that to order O[(cp _~V)-2J ,
Pr {V(9) > V(9 o) [y } == Pr {X;k < - cpv log V(9 0 ) }
(8.4.66)
with
cp log V(9 0 )
=
-
=
1
I
+-
(m
+
A -I
log [1m
2v
+k
- 3),
(eo -
8)' X' X
(eo -
8)[
which can be employed to decide whether the parameter point approximately inside or outside the (\ - IX) H.P .D. region .
e = 90
8.4.7 Inferences about a General Parameter Point of a Block Submatrix of
lies
e
In the above, we have discussed inference procedures for (a) a specific column of e, (b) a specific row of 9, and (c) a parameter point eo for the complete set of e.
8.4
Inferences about
{I
for the Case of a Common Derivative Matrix X
453
In some problems, we may be interested in making inferences about the parameters belonging to a certain block submatrix of 9. Without loss of generality we consider only the problem for the k 1 X m 1 matrix 9 11 defined in (8.4.36). From (8.4.37) and Theorem 8.4. 1 (on page 449), it follows that the quantity (8.4.67) is distributed as U(ml,kl, v)' This distribution would then allow us to decide whether a particular value of 9 11 lay inside or outside a desired H.P.D. region . In particular, for m l > 2 and kl > 2, we may then make use of the approximation (8.4 .68) where
so that the parameter point 9 11 ,0 lies inside the (J - a) H.P.D . region if and only if
8.4.8 An Illustrative Example An experiment was conducted to study the effect of temperature on the yield of the product Yl and the by-product Y2 of a chemical process. Twelve runs were made at different temperature settings ranging from 161.3°P to 195.7°F. The data are given in Table 8.4.1. The average temperature employed isT = 177 .86. We suppose a model to be entertained whereby, over the range of temperature explored, the relationships between product yield and temperature and by-product yield and temperature were nearly linear so that to an adequate approximation
u = I, .. . , 12,
(8.4.70)
where Xu = (T., - T)iIOO, the divisor 100 being introduced for convenience in calculation. The parameters ell and e l2 will thus determine the locations of the yield-temperature lines at the average temperature 'f while e 2 l and 2 2 will represent the slopes of these lines. The experimental runs were set up independently and we should therefore expect experimental errors to be independent from run to run. However, in any particular run, we should expect the error in Yl to be correlated with that in Y2 since slight aberations in reaction
e
454
Some Aspects of Multivariate Analysis
8.4
Table 8.4.1 Yield of product and by- product of a chemical process Temp . oF
Product Yl
By-product Y2
161.3 164.0 165 .7 170.1 173.9 176.2 177.6 181.7 185.6 189.0 193.5 195 .7
63.7 59.5 67.9 68.8 66.1 70.4 70.0 73.7 74.1 79 .6 77 .1 82.8
20.3 24.2 18.0 20.5 20 .1 17.5 18.2 15.4 17.8 13 .3 16.7 14.8
conditions or in analytical procedures could simultaneously affect observations of both product and by-product yields. Finally, then, the tentat ive model was YIII
= 8 11 x II I + 821 X ll2 +
Yu2
= 81 2 X III + 822 x u2 + eu 2,
ell I
(8.4.71)
where X II 2 = x," and x u l = 1 is a dummy variable introduced to " carry" the parameters 81l. and B1 2 • It was supposed that (eUI ' e1l 2 ) followed the bivariate Normal distribution N 2(0, 1:). Given this setup, we apply the results a rrived at earlier in this section to make inferences about the regression coefficients
6=
(8.4.72)
against the background of a noninformative reference prior distribution for 8 and 1:. The relevant sample quantities are summarized below : 11
= 12
m=k=2
c=
(XX) -
I
= [0.0~33
6.8~50J (8.4.73)
A
= [ 61.4084 - 38.4823
- 384823J 36.7369
A -I
=
[0.0474 0.0496
l
0.0496 0.Q792J
8.4
Inferences about
e for the Case of a Common Derivative Matrix X
455
and (j
18.0666] , -20.0933 = [0 1
= [7l.1417 54.4355
The fitted lines
91
=
92 =
ell + e e +e
2 1 (T-
12
22 (T
, ez]
,
[9(1)]
= 9(2) .
T) x 10 - 2
- T)
X
10- 2
together with the data are shown in Fig . 8.4.1. As explained earlier, in a real data analysis, we should pause at this point to critically examine the conditional inference by study of residuals . We shall here proceed with further analysis supposing that such checks have proved satisfactory.
80
x ?O
70 X
•
10 •
Product Y
I
X By-produ ct Y 2
•
60
T .... 160
170
180
190
200
Fig. 8.4.1 Scatter diagram of the product Yt and the by-product Y2 together with the best fitting lines.
Inferences about
el
=
(8 1 1 , 821 ),
When interest centers primarily on the parameters have from (8.4.28) that
peel I y) cc
[61.4084 +
(e
t
-
9t ),
x'xce
at t -
for the product Yl' we
9t ) r ll / 2
(8.4.74)
that is, a bivariate t 2[9 1 , 6.825(X' X) - I , 9J distribution. Since the matrix X'X is diagonal, 8 11 and 8 21 are uncorrelated (but of course not independent).
456
Some Aspects of Multivariate Analysis
8.4
Figure 8.4.2a shows contours of the 50, 75 and 95 per cent H.P.D. regions together with the mode 61, from which overall conclusions about fll may be drawn.
74
72
70
68 L-~------~------~------~------~------~--e21~
30
40
50
60
70
80
(a)
Contours of the posterior distribution of fll' the parameters of the product
Fig. 8.4.2a
straight line.
20
18
16
L-~----------~------~------~------~-----a22~
- 40
- 30
- 20
- 10
0
(b)
Fig.8.4.2b
straight line.
Contours of the posterior distribution offl 2, the parameters of the by-product
Inferences about fl2 = (8 12 , ( 22 )' Similarly, from (8.4.28), the posterior distribution of fl2 for the by-product Y2 is
P(fl2 I y) ex:. [36.7369
+
(fl2 - ( 2)' X'X(fl2 - ( 2)r 11/2,
(8.4.75)
--
Inferences about
8.4
e for the Case of a Common
Derivative Matrix X
457
which is a t 2 [6 z ,4.08(X'X)-1,9] distribution. Again, the parameters 8 12 and 8 z2 are uncorrelated. The 50, 75 and 95 per cent H.P .D. contours together with the mode z for this distribution are shown in Fig. 8.4.2b. The contours have exactly the same shape and orientation as those in Fig. 8.4.2a because the same X' X matrix is employed; the spread for 9 2 is however smaller than that for 9 I since the sample variance from Y2 is less than that from YI'
e
Inferences about 9(2) = (8 21 ,
( 22 )'
In problems of the kind considered , interest often centers on 9(2) = (8 21 , ( 22 ) which measure respectively the slopes of the yield/temperature lines for the product and by-product. From (8.4.34), the posterior distribution of 9(2) IS
p(9(2) I y) ex: [6.875
+ (9(z)
-
e(2)' A
-I C9( 2) -
6(2))] -11 / 2,
(8.4.76)
that is, a t Z [9(2) ' 0.764A , 9J distribution . Figure 8.4.3 shows tbe 50, 75 and 95 per cent contours together with the mode 9(2)' Also shown in the same figure are the marginal distributions t(e 21 , 46.90, 9) and t(e 22 , 28.05 , 9) and the
-10
- 20
-30
-40
~--------------~~~--~------~----~------~--~---821-
30
40
50
Fig. 8.4.3 Contours of the posterior distribution of distributions for the product and by-product data .
60
9(2)
70
and the associated marginal
458
Some Aspects of Vlultiva riate Analysis
8.4
corresponding 95 per cent H.P.D. intervals for 821 and 822' Figure R.4.3 summarizes, then, the information about the slopes (8 21 ,8 22 ) coming from the data on the basis of a noninformative reference prior. The parameters are negatively correlated; the correlation between C8 21 , 822 ) is in fact the sample correlation between the errors (e ul , eu2 ) (8.4.77) It is clear individual misleading parameters
from the figure that care must be taken to distinguish between and joint inferences about (8 21 ,8 22 ), It could be exceedingly to make inferences from the individual H .P .D. intervals about the jointly (see the discussion in Section 2.7).
J oint Inferences about 9
To make overall inferences about the parameters [9 1 , 9 2 J, we need to calculate the distribution of the quantity U(9) defined in (8.4.49). For instance, suppose we wish to decide whether or not the parameter point
171
(8.4.78)
-30J lies inside the 95 per cent H.P.D. region for 9.
We have
(9 - 8) 'X'X(9 _ 9) = [ 31.8761 o 0 -0.6108
-0.6108] 27.9272
(8.4.79)
so that U(Oo)
= IA +
(9 0
-
;A\ 9) 'X'X(9 0
775.0668 -
9)1
= 4503.8878 = 0.1720,
(8.4.80)
Since m = 2, we may use the result in (8.4.59) to calculate the exact probability that U(9) exceeds U(9 0 ). We obtain Pr {C(9) > U(9 o) 1 y} = 1 -IJo.l72o (9,2) = 1 - 0.0022 = 0.9978.
(8.4.81)
From (8.4.50), we conclude that the point Co lies outside the 95 per cent H.P.D region of O. Note that while the point 9 0 = (9 10 ,9 20 ) is excluded from the 95 per cent region , Figs. 8.4.2a,b show that both the points 9 10 and 9 20 are included in the corresponding marginal 95 per cent H.P.D . regions . This serves to illustrate once more the distinction between joint inferences and marginal inferences. Approximating Distribution of U
It
is
informative
=
to compare
U (0)
the exact distribution
of U(O)
with
the
Some Aspects of the Distribution of :E
8.5
approximation in (8.4.03) using the present example (v (8.4.59), the exact distri bution of V is found to be
1
p(V) =
2B~,~
V 3 . 5 (l - Vl/2)
,
= 9,
m
459
= k = 2). From
O
(8.4 .8 2)
Using the approximation given in (8.4.63) to (8.4.65) , we find A,
'f'
and
=
II
18'
cjJv = 9.5,
p(M) == tM exp ( - -tM),
M = - 9.510g V
0< M <
(8.4.83)
00.
This implies that the distribution of V is approximately p(V)
== (22.5625) (- log V) U 3 . 75 ,
O
(8.4.84)
Table 8.4.2 gives a specimen of the exact and the approximate densities of U calculated from (8.4.82) and (8.4.84) . Although the sample size is only 10, the agreement is very close. Table 8.4.2 Comparison of the exact and the approximate distributions of U for n = 12 and m = k = 2 p(U) U
Exact
Approximate
0.05 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 0.95
0.00098 0.00973 0.08900 0.30098 0.66947 1.16498 1.69708 2.10935 2.17561 1.59706 0.95220
0.00089 0.00924 0.08688 0.29731 0.66548 1.16239 1.69718 2.11244 2.18045 1.60130 0.95478
8.5 SOME ASPECTS OF THE DISTRIBUTION OF 1: FOR THE CASE OF A COMMON DERIVATIVE MATRIX X
We discuss in this section certain results pertaining to the posterior distribution of the elements of the covariance matrix 1: = {() ij}.t t For the important probJem of making inferences about the latent roots and vectors of1:
which is not discussed in this book, see Geisser (1 965a) and Tiao and Fienberg (1969).
460
Some Aspects of Multil'ariate Analysis
8.5
8.5.1 Joint Distribution of (9, 1:) When the X/s are common , the joint posterior distribution of (9, 1:) in (8.2. 17) can be written p(9 , 1:\ y)oc \1:\-t(V+k+2m)exp {- itr1:- 1 [A -
+
00
(9 - 9)'X'X(9 - 6)J},
< 9 <
1: > 0,
00 ,
(8.5.1)
where v = n - (k + m) + I and use is made of (8.4.4) and (8.4.5). The individual and joint inferences about (8 2 ] , ( 22 ). It could be exceedingly distribution can be written as the product p(9, 1: \ y) = p(9 \ 1:, y) p(1: \ y) .
Conditional Distribution of 9 given 1: Given 1:, we have that p(a \ 1:, y) oc exp [ -
t
tr 1:- 1 (9 - 9)'X ' X(9 - 9)J,
-
00
< a<
00,
(8.5 .2)
which by comparison with (8.4.45), is the km-dimensional Norma l distribution
(8 .5.3) Marginal Distribution of 1: Thus, the marginal posterior distribution of 1: is 1: > 0,
(8.5.4)
From the Jacobian in (8 .2.13), the distribution of 1:- 1 is
(8.5 .5) which, by comparing with (8 .2.20), is recognized as the Wishart distribution Wm(A -1, v) provided > O. The distribution of 1: in (8.5.4) may thus be called an m-dimensional "inverted" Wishart distribution with I' degrees of freedom , and be denoted by w.,~ leA, v). From (8.2.21) the normalizing constant for the distri bu tions of 1: and 1: - 1 is
v
IA IH v+m-l) 2--!-m(v+m-1) Note that when m = 1 and v an inverted X2 distribution,
[r",C + ~ - 1)]
=n- k
-1.
(8.5.6)
the distribution in (8.5.4) reduces to
P((Jll\Y)OC(J~;~. (n-k+2)exp( - ;~:1)'
(Jl]
> 0,
(8.5.7)
which is the posterior distribution of (52 = (J 1] with data from a univariate multiple regression problem (see Section 2.7) . . From the results in (8.5.1), (8 .5.3), (8 .5.4) a nd (8.4.15), we have the following useful theorem.
Some Aspects of the Distribution of 1:
8.5
461
Theorem 8.5.1 Let e be a Ie x In matrix of constants, X'X and A be, respectively, a k x k and a In x m positive definite symmetric matrix of cons tants, and v > O. If the joint distribution of the elements of the k x m matrix and the m x m positive definite symmetric matrix 1: is
e
pee,1: I y) ex:. 11: 1 - ~(v +k+ 2m) exp{ then ,
(a)
-1
tr1:- 1 [A
given 1:, e~Nmk [e,1: ® (X'X) - lJ ,
+ (b)
(0 -
0)'
X' X(O - e)] },
1: ~ W,~I(A ,v)
and
(c)
e ~ t"'k[e , (X'X) - I, A, v]. 8.5.2 Some Properties of the Distribution of 1: Consider the partition
(8. 5.8) We now proceed to obtain the posterior distribution of (1: 11 ,n, T), where
(8.5.9)
and
It is to be remembered that 1: 11 is the covariance matrix of the m 1 errors (e u1 , .. . , ellm ,), nand T are, respecti vely, the covariance matrix and the m 1 x m 2 matrix of "regression coefficients" for the conditional dis tribution of the remaining m2 errors (eU("'j + 1)' . . . , ell"') given' (e lll , •.. , eumJ Denoting
A
=
[1~: \ 1::'] :: ,
(8. 5. 10)
and it can be readily shown that a) 1:11 is distributed independently of (T,n), b)
1:ll~ W,~/(All 'V),
(8.5.11) (8 .5. 12)
c) d)
T
~ t lnlm2
+ In l ) , TrAil (T -
-I (T, All ' A 22 . I ,
U(T) = 11m2 + AZ/l (T -
(8.5.13)
V
1')1-
1
(8.5 .14)
462
Some Aspects of Multivariate Analysis
The above results may be verified as follows, express the determinant and the inverse of 1: as
8.5
Since 1:
IS
positive definite, we can
11:1 = 11:111 lUI 1:where M =
1
[l:~( ~ J + M,
=
l ~~~II~,
(8.5. is)
~L~ - Il
Expression (8.5.15) may be verified by showing 1: -I E = 1. Thus, the distribution in (8,5.4) can be written
E > O.
(8.5,16)
For fixed 1: 11 , it is readily seen by making use of (A8.1.l) in Appendix A8,] that the Jacobian of the transformation from (1: 12 ,1:22 ) to (T, Q) is J =
I: 0(1: 1:d I = '1: 1m2. o(T,U) , I II 12
(8.5.17)
,
Noting that M does not depend on 1: 11 , it follows from (8.5.16) and (8.5 .17) that 1:11 is independent of (T,U) so that (8 .5.18)
P(1: II , T,Uly) =p(T,Qly)p(1:IIIY) where p(T,Q I y) :;c IQI - (·, v+m) exp (-'~ tr MA),
U > 0,
-
00
< T<
(8.5 ,19)
00
and (8.5.20)
Thus,
El1
is distributed as the inverted Wishart given in (8.5.11).
From (8.5,10) and (8.5.15), we may express the exponent of the distribution in (8.5.19) as IrMA = tr(TQ-IT'A II
-
TU- I A 2 1 -Q-- I TA I2 +Q- 1 A 22 )
= trQ-I [AnI + ( T - l')'A l l (T -
Tn
(8,5.21)
In obtaining the extreme right of (8.5.21), repeated use is made of the fact that tr AB = trBA. Thus, p( T ,n I y) ex IUI- }(v' +III ,
2 m2)
exp {- i tr
n - I [Ani
+ (T - T)'AII (T - i')]), -
00
< T <
00,
U > 0,
where v' = II + /111' which is in exactly the same form as the joint distribution of (1:, 0) in (8.5.1), The results in (8.5. J 2) and (8,5 .13) follow by appl ication of Theorem 8.5.1 (on p. 461). Further, by using Theorem 8.4,[ (on p. 449), we obtain (8.5.14),
Some Aspects of the Distribution of E
8.5
Distribution
463
0/ (J 11
By setting m 1
=
1 in (8.5.11), the distribution of
P((Jll
I y) oc
(Jl't~vc'l) exp ( - ;~:1
(J11
is
),
(8.5.22)
a x- 2 distribution with v = 11 - k - (rn - 1) degrees of freedom. Comparing with the univariate case in (8.5.7), we see that the two distributions are identical except, as expected, that the degrees of freedom differ by (m - 1). The difference is of minor importance when m is small relative to n - k but can be appreciable otherwise-see the discussion about the posterior distribution of i in (8.4.28). It is clear that the distribution of (Jii' the ith diagonal element of :E, is given by simply replacing the subscripts (1,1) in (8.5.22) by (i, i).
e
The Two Regression Matrices :E;11:E 12 and :E221 :E21 The m 1 x m2 matrix of "regression coefficients" T = :E;/:E12 measures the dependence of the conditional expectation of the errors (81/(111,+1)' ... ,81/"') on (e Il1 , ... , 8 11m J From the posterior distribution of T in (8.5.13) the plausibility of different values of the measure T may be compared in terms of e.g. the H.P.D. regions. In deciding whether a parameter point T = To lies inside or outside a desired H.P.D. region, from (8.5.14) and (8.4.66) one may then calculate Pr {VeT) > VeTo) I y}
== Pr {XI~""'2 < - cPt (v + mJ log VeTo)},
(8.5.23)
where
cPt
=
I
+
m 1 + Jn 2 - 3 2(v+m 1 )
.
In particular, if To = 0 which corresponds to :E12 are independent of (81/(111,+1)' ... , 8 um ), then
= 0, that IS,
(8 u1 , ... , 8 um ,)
(8.5.24) Consider now the m 2 x m 1 matrix of "regression coefficients"
Z
= :E221 :E21
(8.5.25)
It is clear from the development leading to (8.5.13) and (8.5.14) that by interchanging the roles of m 1 and m2' we have (8.5.26) where and (8.5.27)
464
:;,5
Some A;pects of Multivariate Analysis
T hus, in deciding whether the parameter point LO is included in a desired H.P.D. region, we calculate
(8.5.28) where
cp"= In particular, if Zo
1
m -+- 1112 - 3 2(v + m 2 )
+ -l - - - -
= 0 which again corresponds to U(Zo = 0) =
IAI IAllll Ad
1:12
= 0, then (8.5.29)
Although U(Zo = 0) = ["(To = 0) the probabilities on the right-hand sides of (8.5 .23) and (8.5.28) will be different whenever Inl =I- m2 ' This is not a surprising result. For, it can be seen from (8.5.18) that the distribution of T is in fact proportional to the conditional distribution of 1: 12 , given 1;11' Inferences about the parameter point 1:12 = 0 in terms of the probabililY Pr {X.;' ,1II 2 < - cp' (v + m l ) lo g U(To = O)} can thus be interpreted as made conditionally for fixed 1: 11 , That is to say that we are comparing the plausibility of 1: 12 = 0 with other values of 1: 12 in relation to a fixed 1: I I' On the other hand, in term s of Pr {X.;""" < - cpu (v + 1112) log U(Zo = O)}, inferences about 1:12 = 0 can be regarded as conditional on fixed 1: 22 , Thus, one would certainly not expect that, in general, the two types of conditional inferences about 1:12 = 0 will be identical.
8.5.3 An Example for ill ustration, consider again the product and by-product data in Table 8.4.1. The relevant sample quantities are given in (8.4.73). When interest centers on the variance 0" 11 of the error 6 uI corresponding to the product Yl' we have irom (8.5.22)
(8.5.30) Using Table II (at the end of this book) , limits of the 95 per cent H.P.D. interval of 10gO"ll in term s of O"ll are (3.02, 20.79). Similarly, the posterior di st ribution of 0"2l is such that
(8.5.31) and the limits of the corresponding 95 per cent H .P. D. interval are (1.81, 12.44) . From (8.5.13) , and since 1111 = /11 2 = 1, the posterior distribution of T= 0"1- 11 0"12 is the univariate t(l,st, VI) distribution where
T=
a~/aI 2 = - 0.627,
8.5
Some Aspects of the Distribution of 1:
465
and
Thus,
T
+ 0.627
--~-
~
0.143
t lO
(8 .5.32)
so that from Table IV at the end of this book, limits of the 95 per cent H.P. D. interval are (- 0.95, - 0.31). In particular, the parameter point 0"~/0"12 = 0 (corresponds to 0"12 = 0) is excluded from the 95 per cent interval. Finally, from (8 .5.26) Z = 0"2210"]2 is distributed as t(Z,s~, v1 ) where and
a ll - ai2 /a 22 V2
x a 22
=
0.0574.
Thus, Z
+ 1.05 0.240
(8 .5.33)
~ tlO'
Limits of the 95 per cent H.P.D . interval are (- 1.58, - 0.52) and the point 0" 12 = 0 is again excluded. Further, from (8.5.14) , (8 .5.27), (8.5.32) and (8.5.33)
0" 22'
Pr {U(T) > [;(0) I y} = Pr { U(Z) > U (O) I y}
=
Pr
{I/IOI > 4.37}
(8.5.34)
so that inferences about 0"12 = 0 in terms of either T or Z are identical. This is of course to be expected since, for this example, m 1 = m2 = l.
8.5.4 Distribution of the Correlation Coefficient P12 The two regression matrices ~~/~1 2 and ~22'~21 are measures of the dependence (or association) between the two set of responses (YIII ' ... , Yum.) and (Yu(m l + 1)' ... , Yum)· When interest centers at the association between two specific responses Yui and Yllj, the most natural measure of association is the correlation coefficient P i j = O"i)(O"iiO"jY ; 2. Without loss of generality, we now consider how inferences may be made about P12' By setting m 1 = 2 in the distribution of ~II in (8.5.11), we can follow the development in Jeffreys (1961 , p. 174) to obtain the posterior distribution of the correlation coefficient Pl2 as
- 1< P< 1, (8.5.35)
466
Some Aspects of Multivariate Analysis
where P
=
8.5
P12,
r=rI2=(
all a22
)1 / 2
is the sample correlation coefficient, and the normalizing constant is
It is noted th at this distribution depends only upon the sample correlation coefficient r. To see this, for
Inl
= 2, the posterior distribution of the elements (0'11,0'22,0'12) of
l:J1 in (8.5.11) is P(0'Il,0'22,0'I21 Y)
cc
[0'110'22
(I - p2)r (~ \ + 2)
(8.5.36) where from (8.5.6), the normalizing constant is, (a11a22 -
ai 2)(v+1) / 2! {2(v+I) nl/2lJ r[}(v + 2 - i)J}
We now make the transformation, due to Fisher (1915), (8.5 .37) The Jacobian is (8.5.38) so that the distribution of (x, OJ, p) is P(X,OJ,PIY) CC(1 _p2)'!,(V+4)OJ- 1 X- (V+2) ex p [ OJ
2(1 - p2)x
> 0, x > 0,
- 1
(OJ
+
OJI -2 p
1.
r)],
(8.5.39)
Upon integrating out x, p(OJ,p l y) cc (1_p2) ~(v-2)OJ-I
(
OJ
1
+ -; -
2pr
)-(V"I)
OJ> 0,
- 1
< P < I, (8.5.40)
from which we obtain the distribution of P given in (8.5.35).
8.5
467
Some Aspects of the Distribution of E
The Special Case When When r
t' =
0
0, the distribution in (8.5.35) reduces to
=
- 1 < p < 1,
(8.5.41)
which is symmetric at p = 0, and is identical in form to the sampling distribution of r on the null hypothesis that p = O. In this case, if we make the transformation (8.5.42) then the distribution of t is -
Cf)
< t < co,
so that the quantity t is distributed as teO, 1, v).
The General Case When r #- 0 In general, the density function (8.5.35) cannot be expressed in terms of simple functions of r. With the availability of a computer, it can always be evaluated by numerical integration, however. To illustrate, consider again the bivariate product and by-product data introduced in Table 8.4.1. Figure 8.5.1 shows the posterior distribution of p calculated from (8.5.35). For this example v = 9. and r = - 0.81. The distribution is skewed to the right and concentrated rather sharply about its mode at p == - 0.87; it practically rules out values of p exceeding - 0.3. pep I y) 4.0
3.0
2.0
1.0
---1...-==='"--_ _---L._ _ _...l...- P
L -_ _--L_ _ _L -_ _
- 1.0
- 0.8
- 0.6
- OA
- 0."
0.0
-,
0.2
Fig. 8.5.1 Posterior distributions of p for the product and by-product data.
468
Some Aspects of Multivariate Analysis
8.S
Series Expansions of pep I Y) The distribution in (8.5.35) can be put in various different forms . In particular, it can be expressed as
-l
(8 .5.43)
where a.. I ( I + pr)' I (2s - 1)2 Sv(p,r)=I+II 1=1f! 8 , S= I(V+S+i)
TI
is a hypergeometric series, and the normalizing constant is
To see this, the integral in (8.5 .35) can be written
J
OO
I
[
w+o;-2 pr
OW-I
]- (V +I )
dw=2
fro
IW-I
[w + w1 -2pr J-(V+1) dw,(8 .5.44)
On the right-hand side of (8.5.44), we may make the substitution, again due to Fisher,
U
=
1 -
-
-
1 - PI' -
" -- - -
Hw + (l/w)]
- pr
Hw + ( 1Iw)] Hw + (l lw)] -
1
(8 5.45)
PI"
Noting that
we have
~ = w- I
ow
(l - u ) (2u)12 [1
-HI + pr)u]I /2
(l - pr)-1,2
(8.5.46)
so that (l - p2):c( v-Z) Jl (l - u)"
pep I y) o c . 1 . (1 - pr)"' ,
-1/ 2 0
(2u)
[1 - }(l
+ pr)ur l / 2 du,
-1
Expanding the last term in the integrand in powers of u and integrating term by term, each term being a complete beta function, we obtain the result in (8.5.43).
~ ---~-
II
Some Aspects of the Distribution of 1:
8.S
469
We remark here that an alternative series representation of the distribution of p can be obtained as follows . In the integral of (8.5.35), since (8.5.48) by completing the square in the second factor on the right-hand side of (8.5.48), we can write the distribution in (8.5.35) as (1 -
p(ply)a::
p2XHv - 2) 22(y + l)
(I-pr)
I'"
v [
1+
W
1
p/'/]
(w -
-(Y + I)
22
I-pr
dw
(8.5.49)
Upon repeated integration by parts, the above integral can be expressed as a finite series involving powers of [(I - pI') / (I + pr)]1/2 and Student's I integrals. The density function of p can thus be calculated from a table of I distribution. This process becomes very tedious when v is moderately large so its practical usefulness is limited.
The series Sy(p , r) in (8.5.43) has its leading term equal to one, followed by terms of order v- I, 1= 1,2, .... When v is moderately large , we may simply take the first term so that approximately (I _ pZ)"HV-Z) p(ply)=c (I-prr+ + '
- I < p < 1,
(8.5.50)
where c is the normalizing constanL c
-1 -
-
I
I
-1
(1 -
2)-Hv - 2)
p +.c dp. (J - prr '
Although evaluation of c would still require the use of numerical methods, it is much simpler to calculate the distri bution of p using (8.5.50) than to evaluate the integral in (8.5.35) for every value of p. Table 8.5.1 compares the exact distribution with the approximation using the data in Table 8.4.1. In spite of the fact that v is only 9, the agreement is very close. It is easily seen that the density function (8.5.50) is greatest when p is near r. However, except when r = 0, the distribution is asymmetrical. The asymmetry can be reduced by making the transformation , l+p ( = tanh- I p = ·llog---, J-p
(8.5.51)
due to Fisher (1921) . Following his argument , it is found that ( is approximately Normal N[tanh-Ir-
2(v
51'
+I
),(v+
I)-II.
470
Some Aspects of Multivariate Analysis
8.6
Setting m = 2, k = I so that v = n - 2, the distribution in (8.5.43) is identical to that given by Jeffreys for the case of sampling from a bivariate Normal population. Finally, we note that while we have obtained above the distribution of the specific correlation coefficient P = P12, it is clear that the distribution of any correlation Pij ' i i= j, is given simply by setting r = rij = 2 aij/(aii aj in (8.5.35) and its associated expressions.
Y·
Table 8.5.1 Comparison of the exact and the approximate distributions of P for
II
= 9 and r =
-0.81
p(p I y)
P
Exact
Approximate
-0.98 -0.96 -0.94 -0.92 -0.90 -0.88 -0.86 -0.84 -0.82 -0.80 -0.70 -0.60 -0.50 -0.40 -0.30 -0.20
0.2286 1.2164 2.4867 3.5159 4.1200 4.3281 4.2448 3.9782 3.6141 3.2127 1.5182 0.6584 0.2853 0.1260 0.0569 0.0261
0.2283 1.2150 2.4844 3.5134 4.1180 4.3271 4.2448 3.9790 3.6157 3.2149 1.5210 0.6603 0.2865 0.1267 0.0572 0.0263
8.6 A SUMMARY OF FORMULAE AND CALCL'LATIONS FOR \I1AKI"JG INFERENCES ABOUT (8,~)
Using the product, by-product data in Table 8.4.1 for illustration. Table 8.6.1 below provides a short summary of the formulae and calculations required for making inferences about the elements of (8 , ~) for the linear model with common derivative matrix defined in (8.4.1). Specifically, the model is y
= xe + £
(8.6.1 )
where y = [y I, ... , Yrn] is a n x m matrix of observations, X is a n x k matrix of fixed elements with rank k, 8 = [8 1 , ... , 8",J is a k x 111 matrix of parameters and £ = [E(I)' ... , £(1/)]' is a n x 111 matrix of errors. It is assumed that feu), u = 1, .... n are independently distributed as NI/ICO, ~) .
8.6
A Summary of Formulae and Calculations for Making Inferences about (9, I:)
471
Table 8.6.1
Summarized calculations for the linear model y = X6
+
£
1. From (8.4.1), (8.4.4), (8.4.7) and (8.4.13), obtain
m=2,
n= 12,
k=2 ,
0] X'X - [ 12 0.15 '
- °
C = (X'X)-l
6, = (X'X)- j X'y = [71.14
18.07] . - 20.09
54.44
xey (y j -
aij = (Yi -
v=n-(m+k)+1 =9
X9),
=
°
A={aij}
A = [
and
0]
0.08 [
6.88
i, j=I, .. . ,117
61.41 -38.48
- 38.48] . 36.74
2. Inferences about a specific column or row of 6: Writing
6 = [6 j ,
".'6m]=[~:j) l' 6(k)
then from (8.4.28) and (8.4.34), i=l, ... ,117, and g=l, ... ,k.
Thus, for i= 1 and g =2
61
6(2)
~ '2
~
71.14) { ( 54.44
12 { (
,
6.83 x [0.08
°
54.44) [ 61.41 20.09' 0.76 x _ 38.48
0] , 9 } 6.88 - 38.48] } 36.74 ' 9 .
3. H.P.D. regions of 6: To decide if a general parameter point 6 0 lies inside or outside the (l -a) H.P.D. region, from (8.4.63) and (8.4.66), use the approximation
- ¢v log U
N
X~k'
where U= U(6)
= IAIIA+(6-8)' X'X(6-9)1- 1
and
so that 60 lies inside the 1 -C( region if
- ¢v log U(6o) < X2(l17k, C().
¢=1
+
m+k-3 -2-v-'
472
Some Aspects of Multivariate
~ nalysis
8.5
Table 8.6.1 Continued eor the example, 1; = 19/18. Thus, if 70
171
60 = [ 65
then
-30 '
and the point lies inside the 1 -a region if -9.5JogO.17 = 16.7 < X2 (4,::x). 4. I-I.P.D. regions for a block submatrix of B : Let nil
6=
A=
m,
'"2
[ 6 t:·J :: 11
9=
6 21
[
Inr
"'2
All
A12
A21
A22
J
m,
[0
l'
C =
ml
nl2
(j II
9 12
21
: 822
k,
k,
C II
CJ2 C 22
C 21
l
h, k2
J
k, h2
To decide, for example, if the parameter point 6 11.0 lies inside or outside the (l-u.) H.P.D. region for 1l' use the approximation in (8.4.68), 1;l vlo g U(e l l ),.:., X,;"k" where U(6 JJ ) = IAIIll A ll +(611-911)'C;/(6ll-011)1-1 and 1;J =1 +(1 / 2v)(ml +k J -3), so that 6 11 ,0 lies inside the (1-a) region if
°
-¢IV
log U(6 J1 ,0) < X2 (in l k l ,u.),
Thusifm l =k 1 =1, 811 ,0=70, then 1;1=17/ 18 and U (8 11 0 =70) = 61.41 /
,
'[
61.41 +
(70-71.14)2J = 0.79, 0.08
so that 811 . 0 lies inside the (l-u.) region if - 8.510g 0.79 = 2.0
< X2(l, u.).
Note that since in1 = k I = 1, exact results could be obtained and the above is for illustration only. 5. Inferences about the diagonal elements of 1: : From (8.5.22) (Jii
Thus, for i= 1,
U II
~ aii
x;2,
i=l, ... ,m.
~61.41X92
6. H.P.D. regions for the "regression matrix" T: Let
AS.l
Appendix
473
Table S.6.1 Continued To decide jf a parameter point To lies inside or outside the (1 -IX) H.P.D. region, from (8.5.23), use the approximation
X;,m,
-¢'(v+m,)log Ucr),.v where ¢' = 1 +
m, +1172-3
---2(v+m 1)
and U(T)
= IA22-11IA221 +(1'-'1')' A l1 (T-i,)I-'
so that To lies inside the region if -¢ ' (v+m,)log U(To) < x2(m,1172'IX). Thus, for m, =m2 = 1, ¢'(v +111,)=9.5, If To = 0, then
U(To =0)
=
:"22., =022.1 =
12.63 and f=0.63.
12.63'[12.63 +(0.63)2 x 61.4IJ =0.34
and the point To =0 will lie inside the (i-a) region if
-9.5 JogO.34 = 10.3 < Note that (i) L'(O)= IAI{IA II IIA 2 2 ir are, of course, available.
"
x2 (1,a).
and (ii) since
7. Inferences about the correlation coefficient Pij: bution in (8.5.50),
I11 J
=1112=1, exact results
Use the approximating distri-
-I
where i,j=l, ... ,n!.
Thus, (1_ p2)3.5
pep 1 y) eX::
(1 +0.81p)9.5
'
- I
< I.
The normalizing constant of the distribution can be obtained by numerical integration when desired. APPENDIX AS.l
The Jacobians of Some Matrix Transformations We here give the lacobians of some matrix transformations useful in multivariate problems. In what follows the signs of the lacobians are ignored. a) Let X be a k x m matrix of km distinct random variables. Let A and B be, respectively, a k x Ie and m x m matrices of fixed elements. If
Z3
=
AXB,
Some Aspects of Multivariate Analysis
474
AS;1
then the Jacobians are, respectively,
I :~21
OZ3
I eX
= IBI\
I = IAI
m
IBI·k
b) Let X be a m x m symmetric matrix consisting of~m(m elements and let C be a m x m matrix of fixed elements. If
+
(A8.1.1)
I) distinct
y=cxc , then (A8.1.2) For proofs, see Deemer and Olkin (1951), based on results given by P. L. Hsu , also Anderson (1958, p . 162). c) Let X be a m x m nonsingular matrix of random variables and Y = X -1. Then,
2X
8Y
-y-y OZ '
oz
(A8.1.3)
where z is any function of the elements of X.
Proof: Since X Y
= I,
it follows that
o
(ax) Y + X oY - = o.
oz (X Y) = -oz
-
oz
Hence
oY
oz The lacobians of tw.o special cases of X are of particular interest. a) If X has m 2 distinct random varia bles, then from (A8.1.1) (A8.IA)
b) If X is symmetric, and consists oflm(m then from (A8.1.2)
+
laYax 1= IYl . m
+
1
I) distinct random variables,
(A8.1.5)
AS.2
Appendix
475
APPENDIX AS.2
The Determinant of the Information :\1atrix of ~ - 1
We now obtain the determinant of the information matrix of ~ ' I for the m-dimensional Normal distribution N m(/1, ~) . The density is
t tr...:J (y -
p(y I /l, ~) = (2n) -m / 2 I~ -11 12 exp [ -
-
00
/l) (y - /1)'] ,
< y <
00 ,
(A8.2.1)
where Y = (Yl , ···,Ym)', /l = (J.l.I, .. . , J.l./tI)', ~ = {()iJ and ~-I = {()iJ}, i,j = 1, ... , m. We assume that ~ (and ~-I) consists of ~'m(m + 1) distinct elements. Taking logarithms of the density function , we have Jog p
m
=- -
2
log (2rr)
+ i· log I~ - II
Differentiating log p with respect to
()ij,
-
1 tr ~ - I .
(y - /l) (y - /l)'.
(A8.2.2)
i, j = J, ... , m , i ~ j,
(AS.2.3) Since dl~-ll/d()iJ = (iij where (iij is the cofactor of ()ij, it follows that the first term on the right-hand side of (A8.2.3) is simply }()jj. Thus , the second derivati ves are
a2 10gp
i,j = I, ... ,m; ( k,!=i , .. . ,m;
d()ij d()k I
j)
i ~ k ~ I .
(AS .2.4)
Consequently, the determinant of the information matrix is proportional to IJ(~ _ I )1 =
I-
{ O()ij 8 Jog P } I I O~-l 8~ I. 2
E
IX
(; ()kl
(A8.2.S)
From (A8.l.S), we have that
1 ~1= '~lm ' l. 8~-1
(A8.2.6)
I
APPENDIX AS.3
The Normalizing Constant of the I km[9, (X'X)'
1,
A, v] Distribution
Let 6 be a k x m matrix of variables. We now show that
J
-00
11m + < 9 < <<>
A - l (6 -
9)'
X'X(9 -
9W ~ (v+k+m-l)d9
(A8.3 .1)
Some Aspects of :\1uIiivariate Analysis
476
A8.3
where v > 0, 8 is a k x m matrix, X'X and A -1 are , respectively, a k x k and a m x m positive definite symmetric matrix, c(m, k, v) =
[rcm
+ m - I)J + Ie + m - I)J
r",[-t(v
mk
rmU-(v
r pCb) is the generalized Gamm a function defined in (S.2.22) . Since X'X and A are assumed definite, there exist a k x k and a m x m nonsingular matrices G and H such that
and
= G'G
X 'X
A-I = HH'.
and
(AS.3.2)
Let T be a k x m matrix such that T = G(8 - 8)H.
(AS .3.3)
Using the identity (S.4.l1) we can write II",
+
+ H'(S
A -1(8 - 9) ' X'X(8 - 9)1 = 11m
=
- 9) ' G' G(S - 8)HI = 11m
+ TT'I·
Ilk
+ T'TI (AS.3.4)
From (A8.! l) the Jacobian of the transformati o n (A8.3 .3) is
I ~: 1=IGl mIHlk =IX ' Xl
mI2
IA -1Ik I 2.
(AS.3.S)
Thu s, the integral on the left-hand side of (A8.3 .!) is
IX' XI- mI2 IA -1 1where
QIn
J
=
II k
k/ 2 Qm,
(AS.3.6)
+ TT'I- +(v+k+m-l) dT .
(A8 .3.7)
-00 < T < GO
LetT= [t 1 ,
.. .
, tIl1 J = [Tl,tmJ where t j is a k x l vector, i= 1, .. . ,m. Then, Ilk + TT'l
It follows that
Qm where
qnr
..
(AS.3.8)
+TT' I - ~'(I'+k+m-l)qdT l' 11m
(AS.3.9)
Ilk
=
Ilk
-00
< 00
[1
_L
=J
r
=
+ TITl + tmt;,,1 + TjT'tl [1 + (,,(lk + TjT't)-ltm].
=
IIk
I
l'm (I k
+ T1 T')-lt J- -:(v+m-l+k)dt m' 1m
-oo < tnt
From (A2.1.l2) in Appendix A2.! ,
_ r _t k r[-}(v + til - 1)J qm - [ ( 2)J r[-H v + k + m _ I)J Ilk
+ TlT'11
1 /2.
Appendix
AB.4
Thus,
_ r Qm - [
J_ k r[-l(v + m Cz)J ru(v + k + m
477
1)J _ I)J Qm-I'
where
Qm- 1
=
f
IIk + T 1T 1 1--:,(v+m-2+k) dT 1-r.:i:>
The result in (AS.3.]) follows by repeating the process m - 1 times. APPENDIX A8.4
The Kronecker Product of Two Matrices We summarize in this appendix some properties of the Kronecker product of two matrices. Definition: If A is a m x m matrix and B is a n x n matrix, then the Kronecker product of A and B in that order is the (mn) x (Inn) matrix
Properties: i) (A (8) BY
=
A' (8) B'
ii) If A and B are symmetric, then A(8)B is symmetric. iii) When A and B are non-singular, then (A (8) B)-l
=
vi) If C is a m x m matrix and D is a
11
iv) tr A (8) B
=
A - I (8)B- J
tr A tr B
(A
+
x n matrix, then
C) (8) B = A (8) B
+ C (8) B
and (A (8) B) (C (8) D)
=
(AC) (8) (BD).
CHAPTER 9
ESTIMATION OF COMMON REGRESSION COEFFICIENTS
9.1 INTRODUCTION: PRACTICAL IMPORTANCE OF COMBINING INFORMATION FROM DIFFERENT SOURCES
Problems occur, notably in engineering, economics, and business, where Jt IS difficult to obtain sufficiently reliable estima tes of critical parameters. One way to improve estimates is to seek more data of the same kind . But in engineering this may mean running further costly experiments while in economics and business further data from the same source may not be available. Another way in which estimates can sometimes be improved is by combining information from different kinds of data . Thus, in Section 8.2.6 we considered a chemical system producing three different products where the yield of each product depended on one or both of two chemical rate parameters. I t was demonstrated that, by appropriately utilizing observations on all three products, much more precise inferences could be drawn than would have been possible with only one. As a further illustration , in a study of th e effect of interest rates on the demand for durables, an economist might wish to a ppropriately combine demand data from In different communities. Let Yj = ( Y lb ·.· ,y"Y be an JJ x 1 vector, representing demand over 17 specific periods, for the ith community. Suppose that Yj
=
E(y;)
+
Sj ,
i = 1, ... , n1,
(9.1.1)
where e; is a parameter for the ith community, measuring the effect on demand of a change in the interest rate, !;; is the corresponding n x 1 vector of interest rates, and Sj = (eli' ... , enY is the associated n x I vector of errors. To complete the model we must make specific assumptions about (i) the relationship among (e l , . . . , 8m ) , (ii) the na ture of the expectation functions T)j(!;j, 8j) and (iii) the form of the probability di stribution of Sj . We discuss these assumptions below: (i) If there is reason to believe that the parameters (8 l> ... , 8m ) for the In different communities resemble one another, then, irrespective of whether or not there are correlations among the In vectors of errors, (s" ... , sm), information about 8j from the ith response vector Yi ought to contribute to knowledge about ej . A formulation which takes account of one kind of resemblence, supposes that 8" ... , 8m 478
Introduction
9.1
479
are m independent observations from some population distributed as f(8). In particular, the random effect model discussed in Section 7.2 was of this kind. Jf the variance offCO) was expected to be small , the m vectors of expectation functions TJi might then be regarded as containing a common parameter OJ = ... = = and data from all m communities could be used to make inferences about 8e . We have already considered a multivariate setup of this kind in Section 8.2.6. It was there possible to show that, even in the general situation where the expectation functions were nonlinear in the parameters, exact results could be obtained yielding the posterior distribution of the common parameters in a form which was mathematically simple.
em ee
ii) Mathematical simplicity does not invariably lead to tractibJe computation however, and we now consider the problem with the added simplification that the expectation functions are linear in the parameter As usual, consideration of the more manageable linear case has more than one motivation. On the one hand, practical problems do occur where the appropriate expectation functions are indeed linear in the parameters. On the other hand, for nonlinear problems, we rarely need to concern ourselves with nonlinearity over the whole permissible region of variation of 8e> but only in some smaller sub-region of the parameter space-for example, over a (l - Ct.) H.P.D . region for some small ':I. . A function which is "globally" highly nonlinear in 8 can be, in this sense, "locally" nearly linear. In sllch a case , as is indicated in Section 8.3. I, linear distribution theory can provide useful approximations. In the Bayesian formulation, we are concerned only with approximate linearity in relation to the one sample of observations which has in fact occurred and not with some hypothetical "repeated samples from the same population" which have not. It follows that we can, for any given data sample, check the adequacy of the assumption of local linearity by direct numerical computation as part of the statistical analysis.
ee.
iii) We shall suppose that the 11 vectors [(II) = (e lll , ... , E"",) , U = 1, .. . , n, are a sample of independent observations from the m-dimensional Normal population N",C9, E), where 1: = {O";J. In general, the relationships connecting demand with interest rate will contain, not one, but k common parameters ge . Thus, with the above assumptions we are led to a linear, Normal, common parameter model, which can be written i
=
I , ... ,m
(9.1.2)
with Xi an n x k matrix appropriate to the ith community and 9c a k x 1 vector of parameters common to all communities. This model (9 . 1.2) will be recognized as a special case of the general linear model set out in C8.3.2) i
= I, ...,m
(9 .1.3)
480
Estimation of Common Regression Coefficients
9.1
appropriate to the circumstance ~hat the regression coefficients with the m vectors of responses are common, so that
e
I, ... ,
9m associated (9 . 1.4)
In what follows we shall first discuss the situation, where the m response vectors YI, "., Ym are supposed independent of one another so that ~ is diagonal. The posterior distribution of 9c is then a product of m multivariate t distributions. A useful and simple approximation procedure is developed for making inferences about subsets of 9c for the case of two responses. The more general situation is next considered where the m responses are correlated . Except when the derivative matrices X I , ... , Xm are identical , the resulting posterior distribution of 9c is rather complicated analytically . A useful approximation procedure will, however, be developed for m = 2. As we have pointed out in Section 8. 2.6 , care must be exercised in the pooling of information to estimate common parameters.. A natural question that arises in the investigator's mind will be whether the evidence from the data are compatible with the supposition that 9 1 = ". = em = 9c . In the following section , this question is considered for the important special case of estimating the assumed common mean of two Normal populations which have possibly unequal variances- a problem which has an interesting history. 9.2 THE CASE OF m INDEPENDENT RESPONSES
Suppose there are m independent responses related to the same parameters " ., k )' by possibly different linear models. Suppose further that the number of observations for each response is not necessarily the same. Then the likelihood function is
ec = (e l ,
e
(9.2.1) where Yi is an ni x 1 vector of observations and Xi a nj elements. Relative to the noninformative reference prior
Y.
k matrix of fixed
m
p(e c , all' . '"
(Jrnm)
cc naill, i= 1
the posterior distribution is
(9.2.2)
9.2
The Case of m Independent Responses
where
481
and
Integrating out 0" II ' .. . , O"mm, the posterior distribution of the common parameters g e is then proportional to the product of m multivariate t distributions m
p(9 c I y) c.c
TI
[au
+
(g e
-
6;)'X;X;(9;-6Jr :rn"
- 00
< ge <
00 .
(9 .2.3)
i= 1
Properties of this distribution have been considered by Dickey (1967a, 1968) and others.
9.2.1 The Weighted Mean Problem The simplest problem of this kind has come to be called th e problem of the weighted mean, Yates (1939). Suppose that m = 2, k = I and (XI = 111 " X 2 = In ,) are columns of ones. Typically we have two sets of Normally dist ributed observations Yl = (YII, .. ·,Yn,IY and Y2 = (YI 2, .·· , Yn ,2Y independent of one another and having different variances O"~ = 0" 11 and O"~ = 0"22 but both assumed to have the same mean e. The posterior distribution of e is then proportional to the product of two t distributions
<
- 00
e<
00,
(9 .2.4 )
where V;
= n; -
1,
i
=
1, 2,
a distribution given by Jeffreys (1961). Simultaneously Fi sher (1961 a, b) obtained an identical distribution using a fiducial argument. Except for the normalizing constant, the density (9 .2.4) can be easily calcul a ted from a table of the d ensity function of Student's t---e.g. Bracken and Schleifer (J 964).
An Example The following results were obtained independent methods
n
In
estimating a physical co nstant by two
Method 1
Method 2
107.9
109.5
12.1
39.4
11
13
482
Estimation of Common Regression Coefficients
9.2
0.4
f
0.3
0.2
0. 1
~-----L----~----~------~----~-----L---------e~
106
110
108
Fig.9.2.1 Individual posterior distributions of constant.
e: estimation
of a common physical
pCB I y)
0.4
\ \
\
\ \
\ \
\ \ \ \
\ 106
108 \
1 10
\ \
\\
/\
(a la e) log p(O I y)
\
\
\
\
\
,
\ \
Fig. 9.2.2 Combined posterior distribution of constant.
e:
\
'.
estimation of a common physical
9.2
The Case of m Independent Responses
e
483
e
If l and 2 are the means for the two methods, then crucial to what follows is the assumption that el = e2 = e. We shall entertain this assumption for the moment and later discuss it further in Section 9.2.3. With this assumption, then, the two I distributions shown in Fig. 9.2.1 centered at 107.9 and 109.5 with scale factors SI/Jn; = 1.03 and s2 /Jn2 = 1.74 are the individual posterior distributions pee I y 1) and pee I yz) obtained respectively from the results of methods A and B. The unnormalized posterior distribution of which combines information from both samples, is obtained by mUltiplying the I densities and is shown by the "I-like" distribution in Fig. 9.2.2. As might be expected , the combined distribution is "located" between the individual t distributions. For this example, pee I y J has a narrower spread than pee I Y2) and consequently exerts a stronger influence on the final posterior distribution .
e
9.2.2 Some Properties of p(e I y) Although the distribution pee I y) in (9.2.4) is the product of two distributions which are symmetric, the distribution itself will not be symmetric unless either S\ = Yz or n l = n2 and sf = s~. We see for example in Fig. 9.2.2 that the distribution is slightly skewed to the right. Upon differentiating the logarithm of pee I y) with respect to e, it is easy to see that for YI < .v2
a
ae logp(e I y)
e <.vI ~ 0
for
(9.2 .5)
"elogp(e I Y) = - VIS 2 + n (e -.vI )2 l I
(9.2.6)
where
a
nice -
Y'i)
U
so that the distribution is increasing in (- 00, .vI) and decreasing in (.Y2, (0) and consequently the mode(s) of the distribution will lie between (.Y I ' Y2)'
Bimodality When .vI and .vz are wide enough apart, the distribution becomes bimodal. By setting (0/8e) logp(e I y) = 0, it can be verified that when n l = n2 = n (so that VI = V2 = v) a sufficient and necessary condition for bimodality is (9.2.7)
where
To consider the implications of this condition, suppose for the moment that the sample
484
Estimation of Common Regression Coefficients
9.2
si
variances are nearly equal, so that == s~ =" S2 and the second term on the left can be ignored, then (9.2.7) implies that the distribution will be bimodal only if (9.2.8) i.e., if where s.d. =
;nJv ~
2
is the standard deviation of the individual posterior distributions pee I )'1) and J I this means that 5'1 and 5'2 must be at least 6 standard deviations apart before the distribution becomes bimodal. The practical implication of this analysis is now considered. A question mentioned already and to be discussed in more detail in Section 9.2.3 is the appropriateness of the assumption of compatibility of the means. That is, the assumption that, to a sufficient approximation, e 1 = e 2 = e. One might at first think that bimodality in the posterior distribution of or its absence could provide a guide as to whether the compatibility assumption was reasonable. Expression (9 .2.8) makes it clear that while bimodality would normally imply incompatibility, lack of bimodality certainly need not imply the converse. To see why this is so, we have only to remember that for sufficiently large v the component I distributions will approach the Normal form. Thus, for large sample sizes the right-hand side of (9.2.4) approaches the product of two Normal densities. But the product of two Normal densities, whatever their separation, is proportional to another Normal density, which is of course unimodal. It is not surprising then that, relative to their spread, the separation of the two component posterior density functions necessary to induce bimodality increases without limit as v is increased.
pee I Y2)' For example, when v =
e
Normal Approximation to pee I Y) When the two sample means Yl and Y2 are not widely discrepant , the combined posterior distribution (9 .2.4) will be unimodal and located between the two individual t distributions. Although the unnormalized density function of can be easily determined , it does not seem possible to express the posterior probability integrals in a simple form. Thus , for example , exact evaluation of the probability content of a specific H.P.D. interval would require the use of numerical methods. However, we see that when both VI and V 2 are large the distribution in (9.2.4) is approximately
e
p(GI y) :X exp [where
1/1' (e
- 8)2J,
(9.2.9)
The Case of m
9.2
Inde[l~ndent
Responses
485
As shown by Fisher (1961 b), the above expression can be made the leading term of an asymptotic expansion of the distributioJl in (9.2.4) in powers of v; 1 and v2 J. The "correction" terms in the asymptotic series are polynomials in 8 . .in Appendix A9.1 the asymptotic procedure is given for the more genera l situation when m = 2 but Oc consists of more than one element. An alternative and simpler approximation to the distribution of c based upon multivariate l distribution will be developed in Section 9.3. Returning to the distribution shown in Fig. 9.2.2 for the physical constant example, we see that the distribution is only slightly asymmetrical. Using the Normal approximation in (9.2.9) , we have
e
e- =
(11 -
12.1
13)-1[11 - - 107.9 + - 13
+ -93.4
12. 1
39.4
109.5
J=
108.34,
which nearly coincides with the mode of the distribution in the figure. This itself, of course, does not imply that the distribution can be closely approximated by a Normal distribution without "correction" terms. A convenient device advocated by Barnardt for checking Normality is to graph the first derivative of the logarithm of the posterior density function against the parameter value. A unique characteristic of the Normal distribution is that this plot produces a straight line. A plot of the derivative in (9.2.6) in the range where the density is appreciable is shown by the broken line in Fig. 9.2.2. It is nearly linear, so that the posterior distr ibutionp(8 I y) may be treated approximately as N(l08.34, 0.807).
9.2.3 Compatibility of the Means Before information from m sets of measurements obtained by m different methods are combined, one should first consider whether the methods are measuring the same thing. In particular, it is helpful to compare individual posterior distributions , as was done for the numerical example in Fig. 9.2.1. More formally , questions of compatibility may be studied by considering the distribution of contrasts among the means derived from the joint " unconstrained" distribution of 8 1 , • .. , 8m , when these parameters are not regarded as common. We illustrate with the case m = 2 and , as before, use an asterisk * to denote unconstrained distributions, the constraint being that 8 1 = 8 2 , As we have seen in Section 2.5, relative to the noninformative reference prior distribution (9.2.10)
p*(8 j , 8 2, cr 1, cr2) ex (cr 1cr2)- 1, the joint posterior distribution of 8 1 and 8 2 is
p*(8 1 , 82 I Y) ex [ 1 +
nl(8! -2Y!? J- '~ (Vl+l) ----,;---
[1 +
I'I S I
V2 S 2
- 00<8 1 <00,
t
Personal communication .
n 2(8 2 -2Y2)2 J- } (V2+ 1),
-00<8 2 <00.
(9.2.11)
486
Estimation of Common Regression Coefficients
9.2
Figure (9 .2.3) shows the contours of (8, , 82) as well as various other aspects of the distribution.
2
L-------------------~-----------------------e, ~
:)",
Fig. 9.2.3 Various aspects of the unconstrained posterior distribution of (8,.82)'
To obtain the distribution of the difference in means we write and
(9 .2. 12)
where and
Iv l
+ w2 =1.
Then the joint distribution of 0 and'8 w is
p*(o , 8 w I y)
oc
[I + !21(8
w -
H'2~
-
y,)2
J- 1(",
+ I)
[I + n2(8
VIS,
w
WI: - 512)2 ] - ~(V2+
+
I),
V2 S 2
- 00
< 0<
00 ,
- 00
< 8w <
00.
(9.2.13)
The distribution can be written as the product
p*(b , 8 w I y) = p*(o I y)p*(8 w I b, y).
(9.2 .14)
Whatever the choice of the weights W I and 1\12, the density p*(b I y) will be proportional to the Behrens-Fisher distribution in (2.5.12) appropriate for making
9.2
The Case of m Independent Responses
487
inferences about the difference b = 8 2 - 8[ . The distribution p(8 I y) of the common mean (9.2.4) is p*(8 w I b = 0, y), the conditional distribution of 8w given that it is kno wn that b = 8 1 - 8 2 = O. To shed light on the question of whether a supposition that D was close to zero was , or was not, tenable in the light of the data one could inspect the distribution p*(b I y), the appropriately scaled Behrens-Fisher distribution for the difference in means. In a case like that illustrated in Fig. 9.2.4 the relevance of the distribution p(8 I y) in (9 .2.4) would be thrown in doubt. Such evidence might suggest that , instead of attempting to pool incompatible information, the investigator ought to direct his efforts to explain possible bias in the methods .
Fig. 9.2.4 Posterior distribution of the difference b in relation to the value b = O.
9.2.4 The Fiducial Distributions of 8 Derived by Yates and by Fisher Yates (1939), in his discussion of the problem of the weighted mean, considered the fiducial distribution of
in the special case where (9.2.15)
and found that
had a Behrens- Fisher distribution. However, Yates' fiducial distribution is not the same as the fiducial distribution subsequently derived by Fisher. This, at first sight perplexing, situation is clarified by the Bayes analysis . We have already discussed Fisher's result corresponding to (9.2.4) . We now consider Yates' result from this same viewpoint. The marginal posterior distribution of
488
8", =
Estimation of Common Regression Coelf:cients
\1'
J OJ
+ W z 82
9.3
is obtained in general by integrating out [) in (9.2.14),
p*(8", I y) =
f~J:.YJP'i'(8w
I C), y)p*(b I y) db.
(9.2.16)
To see that this yields a density function' numerically identical to Yates' fiducial distribution , in (9.2.11) write
(92. 17) and '[ =
(a w
Z =
t 2 sin ¢
-
jiw)
I(
ST
~
2
WI
(9.2.18)
+ t J cos ¢
where
Then. whatever weights
WI
and
W2
= 1 -
wI
are employed, the distribution of , is
pC, I V I .V 2 .¢) CCf~c.o[ l + ('Sin¢~JzcoS¢)Z] -1 (VI + J)[1 + (ZSin¢:2TcoS¢)2]_ ~ (V2+I)dZ' -
r:f)
< r < oc.
(9.2.19)
which is a Behrens-Fisher distribution . Yates result is obtained by using the particul ar weighting of (9.2.15). The basic difference between the two results p*(8", I b = 0, y) and p*(8 w I y ) (see Fig. 9.2.3) is now readily seen. Fisher's result for the distribution of the common mean, which corresponds with Jeffreys' Bayesia n solution, is the conditional distribution of That is to say, given that b = 8\ - 8 2 = O. On w (or8\ or8 2 ), given that 8\ = 8 2 , the other hand, Yates' result corresponds not to any conditional distribution but to a
a
marginal distribution, that of 8 w = wI 8 J
+ w282'
The relevance of Yates' result, is clearly open to question. For, if it can be assumed that OJ = 8 2 and hence that [) = 8\ - 8 2 = 0, one shou ld employ the conditional distribution (9.2.4). Conversely, if 8\ and 8 2 cannot be assumed equal (implying that one or both contain unknown systematic error), there seems to be no basi s for the weighting W\
-
w2
=
IlJ/sT
--,-2 n2 / s 2
which is based only on sampling variation. There will be cases where it cannot be plausibly asserted that 8\ = 8 2 exactly, but nevertheless we may feel fairly sure that the difference b = 8 J - 82 is small. Tn (his case
9.3
The Case of
In
Independent Responses
the Fisher-Jeffreys solution may be used as an approximation.
489
Specifically, if an
informative prior distribution p(Ci) is introduced which might for example be a Normal
distribution centered at Ci = 0 with standard devia lion as before, t he posterior distribution of will be p* (e ,.. 1y) ex
Now (i) if we make
~o
e",
f:",
~b'
then, on the same assumptions
p*(e w 1Ci, y) p* (0 1y) pCb) dCi.
(9.2.20)
very large corresponding to the assertion that little is known about
0, we obtain the margina l distribution in (9.2.16), which however will depend on the particular weights wI and 1V2 adopted; (ii) on the o ther hand, if we make ~J small so that pCb) approaches a delta function at b = 0, then the distribution of Ow tends to the Fisher ·Jeffreys so lution (9 .2.4) whatever the values 0/ the weights w J and W2.
9.3 SOME PROPERTIES OF THE DISTRIBUTION OF THE COMMON PARA'VIETERS FOR TWO INDEPENDENT RESPONSES
Until further notice, we shall use the symbol 8 to denote the k x I vector of regression coefficients common to all m respon ses, so that 9'
= 8~ =
(e J ,
•.• ,
8k ).
When there are two independednt responses, from (9.2 .3) the posterior distribution of 8 is proportional to the product of two multivariate t distributions with different degrees of freedom,
pe81
y) x
2
I1
l
I
+
(8 -
6)' XX(8 - 8)] - t (v, + k) " 2' , ,
-
ViS i
i= 1
00
< 8 <
(9.3.1 )
00,
where
and 8i satisfies the norma l equations (X;X ,)8 i = X;yj. To obtain the ma rginal distribution of a su bset of 8, say 8; = (8 I ' .. . , I)' directly it would be necessary to evaluate, for each value of 81, a Ie - / dimensional integral. The difficulty may be avoided by writing the di stribution in a different form which , in turn, leads to a simple approximation .
e
9.3.1 An Alternative Form for p(8 y) 1
Writing (Ji = (J1l and of «(J~ , (J~ , 8) is 2
p ((J j .
281) Y ex
(J2,
(J i
=
(J22
(2)-<1:", +1)( (J22) CJ j
in (9.2.2) with
0112 + 1)
exp
~~ > 0,
In
=
2 the pos terior dis tribution
{_~rV jS~ +Rl(8) +V2S~ +R2(O)]} 2 2 ' 2 .
(Ji
> 0,
CJ 2
(Jj
-
00
< 8<
00 ,
(9.3.2)
490
Estimation of Common Regression Coefficients
9.3
where
i = 1,2.
(aT, an
If we integrate out directly from (9.3.2), we obtain the form (9.3.1). Alternatively, we may make the transformation from (a~, a~, e) to (ai, lV, e) where (9.3.3) and integrate out
aT to obtain
pee, wi y) a: w ~ "2-1 {v,si
+ R, (e) +
W[V2S~
+ R 2 (e)J} -'(Ill +n2),
o<
<
\1'
00,
-
00
<
e<
(9.3.4)
00.
This distribution can be written as the product
pee,
Hi
I y) =
pee I
\1',
y)
p(w I y).
(9.3.5)
Assuming that at least one of the matrices Xl and X 2 is of full rank, the identity (A7.1.1) in Appendix A7.1 may be used to combine the two quadratic forms RJO) and wR 2 (0) in (9.3.4) into (9.3.6) where
R(e I w) = [0 - O(I\;)]'(X', Xl Sc(w) = w(e, - ( 2 )'x ',x, (X ',X,
+
+
WX~X2)[e - O(\v)J,
WX~X 2)-1 X ~ X 2 (el - eJ,
and
It follows that the conditional di stribution of () given
U'
-
is 00
< () <
00,
(9.3.7)
where S2(W)
=
n1
+ n2
-
k
[v,si + \VV2S~ + Sc(w)J,
which is the k-dimensional multivariate t distribution tk[O(\\'), S2(W)(X',X,
+
\\X~X2) - I, n l +11 2 - k].
To obtain the marginal distribution p(\\' I y), we substitute (9.3.6) into (9.3.4) and integrate out () yielding
0<
H'
<
00,
(9.3.8) where d is the appropriate normalizing constant such that
d '"
=
fo"' IX"X 1
+ wX~X21 - 1/2H'}"2 - 1
[S2(H')J - H1I!+1I 1 -k)dH'.
9.3
Some Properties of the Distribution of the Common Parameters
491
An alternative form for the posterior distribution of e is therefore pee I y)
where p(al
LV,
=
50'"
p(O I IV, y) pew I y) dw, ·
- 00 < 0 < 00,
(9 .3.9)
y) and p(H I y) are given in (9.3.7) and (9.3.8), respectively.
By comraring (9.3.9) with (9.3.1) , it can be verified that the normalizing constant for the expression (9.3.1) is
(9.3 . 10)
where d- 1 is the integral given in (9.3 .8). Using properties of the multivariate t distribution, and partitioning p
C(w) = (X'IXI
+
IVX~X2)-1 = l~;y\::~'c;~(;:~;t p
the marginal distribution of the subset p(a, I y) =
I"
a; =
(8 1 ,
...
= k - I,
(9.3.11)
, 8,), (I :( k), is
p(a, I w, y) p(H'1 y) dw ,
-00
<
a, <
00,
(9.3.12)
when,: p(ell LV, y) is the t,[S,(w), S2(\\')CI/(\ v), 11[ + 112 - kJ distribution . A similar expression can be obtained for the remaining elements a~ = (8,+ I, . .. , Bk)' Although it does not seem possible to express (9.3.12) in closed form, its evaluation only requires numerical integration of one-dimensional integrals instead of the (k - I)-dimensional integration implied by (9.3.1)·· ·a very significant simplification. 9.3.2 Further Simplifications for the Distribution of
a
In evaluating the posterior distribution of a and its associated marginals for subsets of a, if we were to employ directly the expressions in (9 .3.9) and (9.3 . 12), it would be necessary to calculate the determinant and the inverse of the matrix (X'IX [ + LVX;X 2 ) for every w. We now show how this problem can be made more manageable.
492
Estimation of Common Regression Coefficients
9.3
The special case where X~Xl and X~Xz are proportional
Consider first the special case where (9.3.13) The expressions in (9.3.6) then become R(O I w) = (I
+ aw)[O
- 8(w)]' X'tXI
[0 - S(w)] (9.3.14)
-
~
j
9(w) = - - - (0 1
1+ aw
~
+ awOz),
so that (a) the conditional distribution of 0 given
IV,
p(O I w, y) , is
-
00
<0 <
00,
(9.3.15)
where (n l
+ nz
Z
- k)s (w)
2
= VIS I
+
Z
wV2SZ
aw + -- (0- 1 J + ow
~
~-
02)' X'IXI (0, - Oz),
-
(b) the conditional distribution for the subset OI,p(O,1 w, y), is
p(O,1 w, y)
cc {(n l
+ 112
-
k)S2(w)
+
(I
+ aw)[O,
- S,(w)] 'D;[ 1[01 -
-
00
S,(W)]} - +(It I+n 2 -HI),
< 01 <
00,
(9.3.16)
where I
(X',X t ) - ' =
p
[~~;'~;pp];)
and (c) the marginal distribution of w,p(w I y), reduces to
p(wl y) = d IX'tXjl - I,2 (J
+ aw)-k/Z w zn2 - ' [sz(w)]- t (n l+n2-k),
0 < w < oc.
(9.3.17)
Employing (9.3.15) -(9.3.17), the posterior distribution of 0 in (9.3.9) and its associated marginal distributions for subsets of 0 in (9.3.12) can now be conveniently evaluated on a computer with little numerical complication. The posterior distribution of 0 thus obtained is of very much the same form as the posterior distribution of Q> in (7.4.33) and (7.4.41) for the BIBD model. In both cases we are combining information about regression coefficients from two sources where the variance ratio is not assumed known. The main distinction lies in the fact that for the BIBD model 'we have the added constraint w = (Ji;e/(J; > 1 on the variance ratio while no such restrict ion exists here.
~ome
9.3
!'roperties of Ihe Distribution of the Common Parameters
493
The General Case II'hen X'tXt and X;X 2 are not Proportional In obtaining (9.3.6). we have assumed that at least one of the two matrices, X;X\ and X; X2' is positi ve defin ite . To be specific, suppose that X~X! is positive definite. Then there exists a non-singular k x k matrix P such that
(9.3.18)
and
where A = {}' jJ is a k x k diagonal matrix with non-negative diagonal elements. It can then be shown that , in the integral of (9,3,9) and (9,3.12), a) the conditional di stribution p(O
I IV, y)
becomes
p(911V, y)
x:. {( n\ + n 2 - k)S2(W) + [0 - !l(w)],(P')-l(I + \I'A)P-l[O -
<9<
- 00
O(II')] }- ~ (" I : "2), 00,
(9.3.19)
with and k
\.vA..
(nl + n2 - Ic)S2(w) = VIS~ + \'\o'V2S~ + j~l 1 +
::A
•
A
(cPlj - cP2)2 , jj
b) the conditional dis tri bution p(91 III', y) of the subset 9/ is
P(O,
I IV , y) oc {(nl
+
n2 -
k)S2(W)
+
[0 1 - OI(I \')J' CI~I( \V)[OI -
01(11')]}- t
00
< 91 <
00,
(9.3 ,20)
where and
and (c) the marginal distribution pew I y) reduces to
p(H' 1y)
= dIX;X 1 1-1'2
[h (I + W}'j)] - 1/2 1V~1I2-1 [S2(IV)]-~'('''+''2-k), J=!
0 < IV <
00,
(9.3,21)
To obtain these results, we may write and
(9.3.22)
so that (9.3,23)
494
Estimation of Common Regression Coefficients
9.3
It follows that for the expressions of sew) and Se(w) in (9.3.6) we now have
Sew)
= =
+ wA)-l (p-le j + WAP- l( 2 ) P(I + wA)-l (<1>1 + wA2)
P(I
(9.3.24)
with i = 1,2
and Se(W) = wce l - ( =
2
)
'(p')-l p-l P(I
+
WC
+ wA)-lp' (p,)-l A p- l (e 1 -
( 2)
wA)-lA(l -<1>2)
(9.3.25) Further, the determinant IX~Xl IX'tXj
+
+
wX~X21 is
wX~X21 = ICP,)-lp- l
+
w(P,)-lAP- 1 1 = IP'I- 1 Ip- J III
(l
+ w}'i)
+
wAI
k
= IX'1 XII
I1
j=l
(9.3.26)
Substituting (9.3.24)- (9.3.26) into (9.3.9) and (9.3.12), we obtain the desired results. In computing the posterior distributions in (9.3.9) and (9.3.12), standard numerical methods can be used to obtain first the matrices P and A. The distribution of a and its marginals of subsets of the elements of a can then be conveniently calculated using (9.3.19)-(9.3.21) with the appropriate normalizing constants inserted. All that is required is numerical evaluation of one-dimensional integrals involving simple functions. With the availability of a computer, this presents nO problem.
9.3.3 Approximations to the Distribution of
a
We have shown in (7.4.50) that the distribution of for the BIBD model can be closely approximated by setting
== p( I w, y)
p( I y)
where wis the mode of the posterior distrihution of log ;I'. A similar argument can be applied to the distribution of a in (9.3.9). We can write p(al y) =
E p(al
IV,
y)
== p(al
w, y)
(9.3.27)
log w
where p(a I w, y) is obtained by inserting the density p(log
W
I y) rx
W" 2 ,2
rjbl (1
+ WA j )
r
wfor
II'
in (9.3.19) and
1/2 [s2(\I')rl(1I1
n2
wis the mode of
-kl,
- oc < log
H'
<
CO,
(9.3.28)
Some Properties of the Distribution of the Common Parameters
495
wr
which is derived from (9.3.21) by inserting the Jacobian [(dl dw) log J = w. Upon taking logarithms of (9.3.28) and differentiating , it can be verified that }v is the appropriate root of the equation
hew) k
hew) =
z
i1 W-
1
-
L }'j/l +
II'A j
= 0,
(9.3.29)
J-l
j=1
which maximizes (9.3.n). To obtain the root w, it seems most convenient to employ the standard Newton-Raphson iteration procedure. That is, with an initial guessed value H'O' expand (9,3.30) where
h'(w)
=-
i1 2 W-
Z
+
L k
)=1
+ (I1J + 112
[
1
A·}}·
+ WAll
12
[vzs~ + )tl }' )j(J + WA,,)-2(¢J) -
- k)[s2(w)rZ
Thus, we find a new guessed value WI
WI
= Wo
¢zY
r
such that
h(wo) h' (w o)
---
(9.3.31)
which can now be used as the next guessed value, and the process repeaied until convergence occurs. A convenient choice tor the initial value is Wo = sf/si, To this degree of approximation, then, 9 is distributed as the Ie dimensional multivariate I distribution Ik[O(~V), S2(W)(X'lXl + wX;XZ)-l , 111 + 112 - kJ from which corresponding approximations to marginal distributions of subsets of 8 can also be obtained. The accuracy of this approximation will be illustrated by an example in the next section. An alternative approximation employing an asymptotic expansion of the distribution in (9.3.1) in powers of (V~I, Vii) is given in Appendix A.9.1.
9.3.4 An Econometric Example For illustration, we analyse a simple econometric investment model with annual time series data, 1935-54, taken from Boot and De Witt (1960) , relating to two
496
Estimation of Common Regression Coefficients
9.3
large corportaions, General Electric and Westinghouse. In this model, price deflated gross investment is assumed to be a linear function of expected profitability and beginning of year real capital stock. Following Grunfe1d (1958), the value of outstanding shares at the beginning of the year is taken as a measure of a firm's expected profitability. The two investment relations are YIII
= BOI + B\x ull + B2 x u 21 +
Yu2
=
B02
+
B\XuI 2
+
6,,1'
+
B 2 x II 22
where u denotes the value of a variable in year u,
= 1,2, ... ,20, and
U
General Electric
Variable Annual real gross investment Value of shares at beginning of year Real capital stock at beginning of year Error term
(9 .3.32)
6 112 ,
Westinghouse
YI/I
YI/2
Xull
xuJ2
xu2l
xu22
eul
eu2
The parameters Bl and 8 2 in (9.3.3 2) are taken to be the same for the two firms ; however, BOl and B02 are assumed to be different to allow for certain possible differences in the investment behavior of the two firms. Further, eul and el/2 are assumed to be independently and Normally distributed for all u as N(O, a~) and N(O, respectively. We may write the model in (9.3.32) in the form
aD,
(9 .3.33) where 'Yi = Xiii
9=
[B;i] ,
Xl
2i
Xl/!i
Xl/ 2i
xnli
Xn2
y; t;
::
[ ],
=
(Yli' .. ·,Yui' .. ·,Yni)
=
(eli' .. "
i= 1,2
Bui , ... )
and
en;), n = 20 .
i
Thus, the two vectors of responses (y I, Y2) are independently Normally distributed , and contain some (but not all) common parameters 9. On the assumption that (Bo l , 802 ,9) and log a 1 and log a 2 are locally uniform and independent a priori, the posterior distribution of the parame ters is P(BOl' B 02 ,
9, ai,
a~ I Y) cc
TI (af)-(-}n+l)exp {- 2a~[VSi2 +
i= 1
af
> 0,
(Yi - yYT;Ti(Yi -
Yi)J},
i
-
00
<
BOi
<
00,
-
00
< 9<
00,
(9.3.34)
497
Some Properties of the Distribution of the Common Parameters
9.3
where
v=n-3 and
VS? = Integrating out (8
pea, aT, O"~ I y)
01
and 8
ceIl (at) !~
(Yi - T,.9;)'(Yi - TiYi)
02 ), we obtain
-[:(n-1) +
1
1] exp { -
+
12 [vst -2 ai at > 0,
6J' X;X;(a
(0 -
-
00
< a<
-
00,
9i
)J} ,
(9.3.35)
where
1 Xji
= -
n
n
L1 X
u~
uji ,
j= 1,2.
This distribution is of exactly the same form as that in (9.3.2) with n i = n - 1, Vi = v and k = 2. Eliminating aT and O"~ by integration, the distribution of 8 then takes the form of the product of two bivariate t distributions as shown in (9.3.1) or alternatively can be expressed in the integral form (9.3.9). Numerical values for sample quantities needed in our subsequent analysis are given below.
General Electric ,
XjX j
= 10 6 [3.254
Vj
=
n1
= 19
0.195 =
s~ = 0.104
X
=
-
0.15170 X
(~12)
OJ
= (0.02655)
8 21
sf = 0.777
X 1, X 2 = 10 6 [0.940
0.233] 1.193
0.233
91= (~11)
Westinghouse
10 3
8 22
0.195J 0.074
(0.05289) 0.09241 10 3
v 2 = 17
17
112
=
19
A plot of the contours of the distribution of a is shown in Fig. 9.3.1. The three contours shown by the solid curves are drawn such that
p(a I y) = c pee I Y)
e
for
c = 0.50,0.25,0.05
where == (0.037,0.145) is the mode of the distribution. In other words, they are boundaries of the H.P. D. regions containing approximately 50, 75 and 95 per cent of the probability mass, respectively. Also shown in the same figure are the 95
498
Estimation of Common Regression Coefficients
9.3
per cent contours (labelled by G.E. and W.H.) of the two factors of the distribution of e together with the corresponding centers (9 1 , ( 2 ), As expected, the distribution of e is located "between" the two individual bivariate t distributions, and the spread of the combined distribution is smaller than either of its components. Further, the influence of the first factor-· G .E.-is seen to be greater in determining the overall distribution because
The solid contours in Fig. 9.3.1 are nearly elliptical suggesting that the distribution of e might well be approximated by the bivariate t distribution in (9.3.27). Since comparison of exact and approximate contours is rather difficult to make, the exact and approximate marginal distributions of 8 1 and 8 2 will be compared.
0.221
0. 12
0.02
L---~----~------~-----L----~------~O
0.01
0.09
0.05
Fig. 9.3.1 Contours of the posterior distribution of investment data.
e and
~
I
its component factors: the
The solid curves in Fig. 9.3 .2a and b show, respectively, the distributions of 8 1 and 8 2 calculated from (9 .3.12). The two factors in the integrand were
9.3
Some Properties of the Distribution of the Common Parameters
499
- - - - Exact - - - - Approx imate
32
16
16
8
~--~~--~----~~---L-e
o
0.03
0.06
0.09
~
~--~--~~----~---e2~
o
I
0.12
0.24
(b)
(a)
Fig. 9.3.2 Comparison of the exact and the approximate posterior distributions of 8 1 and
82 : the investment data .
determined from the expressions in (9.3.20) and (9.3.21). The matrices P and A in (9.3.18) and the vectors
P = 10- 3 [
-0.19814 0.89469
= p-lij =
1
0.52196] 0.22306
10 3 [0.14331] 0.10527 '
A = [°.0 2703 0 ,
1J>2
= P - I9' 2 =
10
3
0] 0.30513 [0.07128] . 0.12839
The approximating distributions shown by the broken curves in Figs. 9.3.2 and 9.3.3 were determined using the result in (9.3.27). The root ~V of the equation hew) = ° in (9.3.29) was calculated by employing the iterative process in (9.3 .30) and (9.3.31). Starting with the initial value Wo = si!si == 7.47, we found after two iterations, W == 7.349. Thus, 9 is approximately distributed as t2[G(~V), s2(~v)P(I + IvA)-lp', 36J where G(W) = [0.03726] 0.14460
and
s2(w)PCI
+
WA)-Ip'
0.8898
= 10- 4 [ -0.8535
-0.8535] 5.2060 ..
500
Estimation of Common Regression Coefficients
9.4
It follows that to this degree of approximation, the quantities 81
-
0.03726
0.00943
8 - 0.1446 -2 - - - 0.0228
and
are individually distributed as teO, 1,36) from which the broken curves were drawn. For this example, the agreement between the exact distribution obtained from numerical integration. and the approximating distributions are sufficiently close for most practical purposes . Thi s, together with the near elliptical shape of the contours in Fig. 9.3. J, suggests that the approximation in (9.3 .27) will be useful for higher dimensional distributi ons.
9.4 INFERENCES ABOUT COMMO~ PARAMETERS FOR THE GENERAL LINEAR MODEL WITH A COMMON DERIVATIVE MATRIX Returning to the linear multivariate model in (9.1.2) , we now consider the case where not only the parameters but also the derivative matrices are assumed common , i.e., el = ... = em = ec and XI = .. = Xm = X. The elements of the multivariate observation vectors £(11) = (e,d' ... , e U = I , ... ,11 are, however, not assumed independent. For example, suppose observations were made by m = 3 different observers of a temperature which was linearly increasing with time. An appropriate model for the observations Yul' Y1l2, Yu3 made at time tu might be llm ) ' ,
YIII YII2 YIl3
= 8 1 + 82l + = 8 1 + 82 f + = 8 1 + 02 f + ll
e lll
l/
8// 2
ll
(;1/3
(9.4.1)
If observations were made at 11 distinct times, we should have a model of the common parameter, common derivative form with and where X is the 11 x 2 matrix whose tlth row is (1, /..) , 11 = 1, ... ,11. In general, the model may be written in the form of equation (8.4.1) with the k x m matrix e of (8.4.2) having common parameters within every row so that (9.4.2) From (8.4.8), it is readily shown that the posterior distribution of -
00
< Oc <
where
and
v=
(X'X)-I
+ G[A- I
-
~A-II I' d m
111
A-I]O' ,
ec 00,
is (9.4.3)
9.5
General Linear Model with Common Paramefers
SOl
whi ch is the k-dim ensional multi variate I distribution tk
[8 e , (n
~ k)d V, n -
k] .
To show this, from (8.4.8) the posterior distributi on of Oc is -
where it is to be remembered that 1A
+ (Oel:" -
e)'X' X(8):.. - e)1
e is a k x
00
< Oc <
00,
(9.4.4)
m matrix . Now, using (8.4.11)
= 1,\/ 1X' XII(X'X)-I +
(8 cl:/I - O)A -I (8e l~, - en
(9.4.5) We can write (8c t:.. - O)A -1 (8e l:/I -
0)'
= d(8 e -
where
8c ) (8 e
-
lU' + 6BO',
(9.4.6)
I
B = A- I - -A - It l ' A-I d m m . Making use of (9.4. 6) and (8.4.11), we see that the third factor on tbe right-hand side of (9.4.5) is proportional to (9.4.7) where
v
=
(X ' X)-J
+ eBO',
a nd the desired result follows.
The special case for which k = 1 and X a 11 x 1 column of ones was discussed by Geisser (l965b). The result for the situation when we have common parameters within rows of a block submatrix of 8 is similar to (9.4.3) and is left to the reader.
9.5 GENERAL LJl\iEAR MODEL WITH COMMON PARA\1ETERS
We now discu ss certain aspects of the problem of making inferences about the common parameters 8c in the general situ a tion when the derivative ma trices Xi are not common and the responses are correlated. For simplicity in notation, we shall again suppress ' the subscript c and denote 0 = 8c . By setting 8 1 = . . . = 8m = 8 in (8.2.23), the posterior distribution of the common parameters 8 is then -
00
< 0<
(9.5 .1)
00,
where S(8)
=
{Si j (O)}
}
Sij(O~ : (Yi - X i 8)'(Yj - X j 8)
8 - (8 1 ,
.. . ,
Ok)
1=1, .. . , 111,
j=I , ... , I11.
502
Estimation of Common Regression Coefficients
9.5
Since Xi # Xj ' we cannot in general express Si/8) in (9. 5. 1) in the form given in (8.4.4), and thus it is no longer possible to reduce the distribution of 8 to the much simpler form in (9.4 .3). When the vector 8 consists of only one or two elements, whether or not the model is linear in the parameters the distribution can be plotted and the contributions from the individual vector of responses Yl assessed , as was done for example in Section 8.2.6. As always, however, a s the number of parameters increases, the situation becomes more and more complicated. Further, to obtain the marginal distribution of a subset, say l = (8 1 , . .. , 81)', or8 , it is necessary to evaluate a k - I dimensional integral for every value of 8 1, In what follows we shall discuss the important special case In = 2 for the linear model and develop an approximation procedure for the distribution of 8.
e
9.5.1 The Case m = 2 When m = 2, the distribution in (9.5.1) can be written S2 (9)] - ~ n II pC8Iy)cx.[S •• c9)r l- [ S22( 9 )-S::(9) ,
-00<8<00 .
(9 .5.2)
The first factor [S II (8)J -;11 is clea rly in the form of the multivariate t distribution t[e 1 , siCX~XI)-l, vJ where, assuming XI is of full rank ,
e = (X 'IX.)-IX'IYI , l
si
=
(9 .5.3)
(YI - X J 81 )'(Yl - X1(1)"V
and v = n - k. This factor can be thought of as representing the contribution from the first response vector YI' The nature of the second factor, [S 22(8) - Si2(9)/SI I (8)r -t n , which provides the " extra information " from Y2, is not easily recognizable. We now proceed to develop an alternative representation of the distribution of 8 . It will be recalled that if (Y . , 12) are jointly Normal N2(~ ' I:), the distribution can be written
(9.5.4 ) where P(YI I Jil'
O"tt) =
v~
exp [_ (YI - JiI) 2 ] 20" 11
-00 < Y I < 00 , '
-00
< 12 < 00,
9.5
General Linear Model with Common Parameters
503
For m = 2 and with common pa rameters e, the likelihood function of e and (all> ani' f3) is Ice, 0"11 , 0"22'1, f31 y) oc
O"~/" O";/';
exp {- _1_ (Yl - xle)'(YI - Xle) 20" 11
- _1_ [Y2 - x1e - f3(YI - xle)],[yz - x 2e - f3(YI - Xle)]} .
(9.5.5)
20" 22'1
From the noninformative reference prior distribution of e and ~ in (8.2.8) and (8.2 .14), we make the transformation from (0"1l , a 22 ,0"12) to (a ll ,0"22 '1 ,f3) to obtain the prior
pee, 0"11 '
0"22'1,
f3) oc 0"~/ / 2 0";23~2.
(9.5.6)
Consequently, the posterior distribution of (e,O"II'O"22-I,f3) is
pee,
0"
0"
II' 22'1'
f3ly)ocO"-[HII-1)+IJ~-r !(II+1)+11exp[_SII(e)_ S22'I(e,f3)] 11
V22 '1
- co < f3 < 00,
> 0,0"2 2'1> 0,
0"11
2
0"11
-00
< e <
20"22'1
00,
'
(9.5.7)
where S I I (e) = (y 1 - XI e)'(y I - X I e) 5 221 (e, fJ) = [y; - x 2e - f3(Yl - X,e)]'[Y2 - x 2e - f3(YI - x 1e)]. If we were to integrate out (0"11,0"22'1' f3) from (9 .5.7), we would of course obtain the form given in (9.5.2). Alternatively, we may make the transformation from (0"11,0"22'1) to (0"11 ' w) where
(9 .5.8)
to obtain p(9,lV,f3 IY)oc
IV~(1I+1)-1
o<
IV
[SII(e)
< co ,
+
w5 22 . 1 (e, fJ)T",
- co <
f3 < 00,
-
00
< 9 <
00.
(9.5.9)
This distribution can be written as
pee, 111, f3 I y)
=
wS n l (e, fJ)
=
pee I IV, fJ, y)p( w, fJ I y).
(9.5.10)
Now, we may write SII (9)
+
(211 - k)S2(w, f3)
+ R(S I IV, {3),
(9 .5.11)
where
R (9 I HI, (3) = [e - flew, {3)]'[X'IX I
+
WX;'I C/3)X 2' 1({3)J [e - 9(w, f3)J ,
(211 - k)S2 (IV, fJ) = [y I - X 19(11I, f3)]'[y 1 - Xl 9(w, f3)J
+ w[y 2'1 (f3) - X 21 (fJ)9( w, f3)]'[h I ({3) - X 2' 1({3)fl( lV, {3)J,
504
9.5
Estimation of Common Regression Coefficients
and e( lV, [3) = [X 'I XI
+
WX;I([3)X 21 ([3)r l [X 'I )'I
+
IIX;'I([3)Y21(fJ)]
Making use of (9.5.11), we see that a) the conditional distribution of
I H', f3, Y) ex
p(O
[(211 - k)S2(\1', /3)
+
e,
gi ve n (11', f3), is
R(e i ll', (3)] - I I ,
-oo < e < oo,
(9.5. I 2)
which is the
distribution. b) partitioning
(9.5 .13)
and
+
[X 'I Xl
the conditional distribution of the subset p(e/I
It' ,
[~!)::::~~~~;(~;:,%t ,
J1IX; I (tJ)X 2l (/3)J I =
et>
given (\\', /3), is
[3, Y)
ex {(2n - Ic)S2( 11', /3) + [el - el (lI, /3)]'CI~ I (11', ,8)[el - el(IV. /3)]} - t I2n - H -
a II[e/(I\!, /3), S2(W, j3)CI/(H', f3), 211 -
kJ
<
00
el
<
00 ,
I) ,
(9 .5.14 )
distribution, and
c) the posterior distribution of (lV, (3) is
pew, /31 Y) ex IX'IXI + \Vx ; I(j3)X 21 (/3)I - l / 2 11'1-111 + 1)-1
o<
H'
<
CO,
-
[S2(1\', (3)r 1- (211 - k),
co < /3 <
CIJ.
(9.5.15)
e<
00,
(9 .5.16)
Thus, we may write pee 1y) =
fo'" f~ ,.pc
e 1IV, /3, y)p(IV , [31 y) df3 dw,
-00
<
which is an alternative representation of the distribution in (9.5.2). The usefulness of this representation lies chieAy in the associated form for a subset el , p(ell y)
=
r">fCf.
j0
p(ell
H',
[3, y) pew, /31 y) dj3 dll',
-oo < e/
(9.5.17 )
- 00
If exact evaluation of the marginal distribution of el is desired , it will only be nece ~ sary to calculate the double integral in (9.5.17), whatever the value of k , instead
9.5
General Linear Model with Common Parameters
505
of a Ie - I dimensional integral implied by the form in (9.5.2). This is particularly useful whenever k - I is greater than two. I n the situation where the responses were uncorrelated, (':1.3.12) could be simplified according to (9.3.20) and (9.3.21). It does not seem possible in general here to find a matrix P to similarly diagonalizc [XI X I + H'X~'I (,6)X 2 . tC,6)J for all values of (,6, It), and consequently it is necessary to evaluate its determinant and inverse as functions of ,6 and II'.
9.5.2 Approximations to the Distribution of 9 for
111 =
2
From (9.5.16), one might attempt to approximate the distribution of 9 by writing p(O I y) =
E
p(9 I w,,6, y)
== p(91 IV, E, y)
(9.5.18)
[(w,P)
E)
where ((0, is the mode of the distribution of some functions f(w,,6) of (w, ,6). However, this modal value (II', is not easy to determine since the distribution p(w,,61 y) in (9.5.15) is a rather complicated function of (w,,6). We now develop an alternative approach. The distribution in (9.5.9) may be written as
E)
p(O, ,6,
wi
y)
=
p(O,,6 I w, y)p( wi y)
(9.5.19)
where the first factor is
p(9,,61 w, y) ex:. [Sll (9)
+
wS 22 . 1 (9,,6)J -II,
-
ex:. <,6 <
00,
-
00
<9<
00.
(9.5.20)
(a
For a given value of w, let w , Ew) be the mode of the conditional distribution (9.5.20). Employing Taylor's theorem, we may expand Sil (9) + wS 22 . 1 (9,,6) into
SII (9)
+
wS 221 (9,,6) == (2n - k - l)s2(w)
+ Q(9,,61
w)
where
Q(9,,61 w)
H(w) = HII(w)
=
I
~,
~
(B - a
w)
(9 - 0",,6 - ,6w)H(w) ,6 - Ew
[~;l~~:~.
= X~XI + wX~'I(Ew)X2'I(E,.,,)
h22(w) = W(YI - X law)' (Yl - x1a,.)
h 12 (w) = w{X~.I(Ew)(YI -
and
x1a + X'l [Y21(Ew) W)
- X 2.1
(E,,,)a,vJ}
(9.5.21 )
506
Estimation of Common Regression Coellicients
9.5
To this degree of approximation, the conditional distribution of (e, fi) given w is in the form of a (k + I )-dimensional I distribution. Integrating out fi from (9.5.20) using the approximating form (9.5.21), we have that
(9.5.22) where Q(e I w) = (0 -
fU' G(w) (e
-
awl
which is the I k[lt,s2(w)G I (w),2n - k - 1J distribution. Further, substituting (9.5.21) into (9.5.9) and integrating out (0, (3), the posterior distribution of wis. approximately,
o<
w
<
00.
(9.5 .23)
We may now adopt an argument similar to that in (9.3.27) to approximate the distribution of as
e
pee I y) = where p(S
I w, y)
pee I w, y) == pee I ~V, y),
E lo)!
is obtained by inserting
win
(9.5.22) and -
Calculalion of Caw,
(9.5.24)
W
wis 00
the mode of the densit)
< log w <
00.
(9.5.25)
(Jw, w).
In obtaining the approximating form (9.5.22), the basic problem is to find, for a given value of w, the conditional mode 8w , of (9.5 .20). From (9.5.7), we see that the quantity SII (e) + WS22'1 (e, (3) is quadratic in either [3 or e given the other. Thus, conditional on e, S'I ce) + wS22-1 (e, {3) is minimized when
p",
(9.5.26) and, conditional on {3, it is minimized for
(9.5.27)
e,
e
as given in (9.5.11) Thus, given an initial choice of say = eo, expression (9.5.26) can be employed to obtain f30 = PCS o) which in turn can be used to calculate a(w, {30) from (9.5.27) and so on. A convenient choice for So is the center of SII(e), namely eo = (X'IXI)-IX\YI' This iterative procedure will converge under favorable circumstances . Once the conditional mode C8"" fJw) is determined for a given value of IV, the corresponding density of log IV in (9.5.25) for that value can be readily obtained. The
9.5
General Linear Model with Common Parameters
507
mode w may thus be found from (9.5.25) by employing standard numerical search procedures on a computer. A convenient preliminary estimate of ~V is given by (9.5.28) where
si is given
in (9.5.3), and 2
52 J -
S
s~ 2
2
(9.5.29)
?J
2 -
with
and
v
=
n·- k.
The accuracy of the approximation will be illustrated by an example in the next section.
9.5.3 An Example To illustrate the theory developed abovc, we consider the following example . An experiment was cond ucted to assess the effect of change in pressure from 30 psi to 50 psi on a chemical process yield, known to be subject to large errors. Twelve batches of raw materials were randomly selected and, from each batch , a pair of samples were taken by two chemists A and B for separate experimentation. Each chemist ran half of his experiments at high pressure and half at low pressure and they arranged their pressure settings such that each level of pressure run by chemist A was paired with each level of pressure run by chemist B an equal number of times. The data are given in Table 9.5.1. Table 9.5.1 Pressure effect independently assessed by two chemists
Barch no.
2 3 4 5 6 7 8 9 10 II J2
Chemist A Pressure Yield 50 50 50 50 50 50 30 30 30 30 30 30
Yll/
79.5 76.4 77.2 76.0 78.6 82.9 61.4 67.4 63.5 69.1 61.2 69.7
Chemist B Pressure Yield 30 50 30 50 30 50 30 50 30 50 30 50
Y21/
73.2 85.0 60.3 72.9 66.5 80.0 63.6 78.3 73.3 81.3 54.2 76.3
508
Estimation of Common Regression Coefficients
9.5
It is assumed that In the region of the experimentation, the following model is appropria te Yul = eOI + eX + eul (9.5.30) U )
Y,,2 =
e02 + eX u2 + eu2
where Pressure - 40 so that the Xui assume only two values (I, -I). The preSSUle effect e is assumed common for both chemists but the parameters eo I and 80 2 are taken to be different to aJlow for possible systematic differences in mean yield. The error terms (e ub 8 112 ) are assumed correlated because each pair of observations (YuI,Yu2) is made from the same batch of material. In terms of the linear model in (8.3.2) we have i = 1,2, (9.5.31) where
(e8
Oi
6i =
)
'
Xi = [1 : x;],
12
11 =
and 1 is a 12 x 1 vector of ones. We are thus in a situation in which some of the parameters 0i and part of the derivative matrix Xi are common between the two responses. Assuming that (e U ) ' 8 u2 ) has the bivariate Normal distribution N 2 (O, I:) and on the basis of the reference prior distribution in (9.5.6), the posterior distribution of (8 01 ,8 02 ,8, all' a22 'I, (3) is p( 8 0 I' 802 , 8, all' a 22-) , -00 -
00
<
eOI
<
< fJ <
00,
f3 I y) ex -00
g I (eo I,
<
e02
<
e, a) I) g 2(8 02 , e, a 22· I, fJ), 00,
-00
<
e<
00, a l l >
0,
a 22 ' 1 >
0,
(9.5.32)
00.
where g I (e 01, e,
~ II )~~-[(ll t2)+llexp[ _ _1_(), I "'- v II
v
.
x exp { - _1_[Y2 2a22'1
2a II
-
X9)'(y I I I
-
X9)] I I
X 29 2 - fJ(YI - X l 9 1)J'[Y2 - X 29 2 - fJ(YI - X I 9 1 )]}
Since main interest centers on the pressure effect need first to be eliminated. Writing
e,
the parameters
ce
ol
.8 02 )
(9.5.33) where
1 Yi = Yi - y;l,
Yi
= -
n
n
L Yui' u= 1
i = J, 2,
9.:;
509
General Linear Model with Common Parameters
1I
and noting that l'Yi = 1'Xi = 0, we have that
(Yt - X t{lt)'(Yl - X 1{l1)
(Yt -
=
X t B)'(Yl
- xIB)
+ 12(Gol
-
(9.5.34)
Yl)2
and
[Y2 - X 2 {l2 - /3(Yl - X 1{l1)]'[Y2 - X 2 {J2 - /3(Yl - X t{lI)J =
[Y2 - x 2 B - /3(Yl - x I 8)]'[Y2 - x 2 8 - /3(Yl - x 1 G)J
+
12[8 02 - Y2 - /3(8 01 - Yl)J 2.
Substituting (9.5.34) into (9.5.32), (BOl' ( 02 ) can be easily integrated out, yielding
p(8, () 1 I, () 221' /3 I y) ex.
-co <
gi (B, () 11) gi(8, () 22-1, /3),
e<
-co < /3 < co,
co,
()11
> 0, ()22.l > 0,
(9.5.35)
where
and
which is of the form in (9.5.7) with n = II, and the results in Sections 9.5.1-2 are now applicable. Figure 9.5.1 shows the posterior distribution of B calculated from (9.5.2). The normalizing constant was computed by numerical integration. Also shown in the same figure are the normalized curves of the two component factors
pce, y) 0.50 Second factor
5 22 (8)]
[ 5 12 (8)
'\,
- s~
\
0.25
).......
-
/
/
/ / /
---_/
-"
"
/
/,
J
\
"
I
"-
I
/
"
I/
/ /
\
\~First \
\ "
~
/
/
", ...........
"
factor [SII (8)1" nl2
,
\ '~-' ........ ~-
L-------~~________~________~__~'~'~-~_L___
4
6
8
e•
10
Fig.9.S.1 Posterior distribution of B and its component factors: the pressure data.
Estimation of Common Regression Coefficients
510
9.5
of the distribution. The first factor, which is a univariate I distribution centered at = 6.53, represents information about coming from the results of chemist A, and the second factor, having mode at = 5.49 and a long tail toward the right, represents the "extra information" provided by the results of chemist B . As expected, the overall distribution is located between the two component factors. It is a I-like distribution centered at 8 = 6.22 and skewed slightly to the right.
e
e
e
Although for this example it will be scarcely necessary to approximate the distribution of for illustrative purposes we have obtained the approximate distribution from (9.5.24). Starting with the preliminary estimate
e,
Wo
we found by trial and error that ~V
!Jw
= sf/sL = 0.312
=0.377 .
The corresponding
13,,,
and
ew are
8w =6.189
= 0.6948,
from which we obtained
pee I y) eX: [227.9735 + 16.4171(8 That is, the quantity
e-
- co <
6.l89)2r2I2,
e<
co.
6.189
0.833
is approximately distributed as teO, 1,20). Figure 9.5.2 compares the accuracy of the approximation for this example. The agreement between the exact and the approximate distributions is very close. p(81 y)
Ex ac t. Approx imate
0.50
0. 25
4
6
8
Fig. 9.5.2 Comparison of the exact and the approximate posterior distributions of 8: the pressure data.
A Summary of Various Posterior Distribution for Common Parameters 9
9.6
511
9.6 A SV\IIMARY OF VARIOUS POSTERIOR DISTRIBVTIONS FOR COMMON PARAMETERS 9 In Table 9.6 . 1, we provide a summary of the various posterior distributions for making inferences about the common parameters 9' = C8 1 , • •. , 8k ) discussed in the preceding sections.
Table 9.6.1 A summary of various posterior distributions for the common parameters 9
Independent Responses: 1. The model is
i=l, ... , m, where Yj is a nj x 1 vector of observations 9 is a k x I vector of parameters, Xi is a nj x k matrix of fixed elements, and E j is a nj x 1 vector of errors distributed as Normal N n ,(O,O";;I). The Ej'S are assumed to be independent of one another. 2. The distribution of 9 is, from (9.2.3) m
TI [a jj + (9 -9
p(91 y) oc
' X;X;(9 -9;)J -tn"
j)
-
00
<S<
00,
i=1
where
3. For m = 2, X~ X I positive definite, an a lternative expression for p(9 I y) is in (9.3.12) p(9 I y) =
fa""p(S' w, y)p(w I y) dw
where from (9.3 .8), pew I y)
oc
IX~ Xl
+ wX ; X 2 1- 1 / 2
w tn ,- 1 [S2(W)r Hn,
+n , -k),
0<
W
<
00,
and, from (9.3.7), the conditional distribution p(9 I w, y) is the multivariate distribution with
S
2
(w)
= n\
Sc(w)
1 2 2 [v,s\ +wV 2 S 2 +Sc(w)J, +n2-k
= w(9 1 - ( 2 )'
X'\ X \ (X~X,
vj=ni-k,
+ wX;X 2 )-1 X;X 2 (9 1 - ( 2 ), i= J, 2.
9.6
Estimation of Common Regression Coefficients
512
Table 9.6.1 COl1lil1ued
4. In particular , partitioning
e=
p
rt-t
[~:lt~:~' "~:Pp~.:~.] ~
the marginal distribution of St is
p(SII y) = where p(SII
IV,
J:
p(SII
IV,
y) pew I y) dw,
y) is the multivariate I distribution
'lcel(w), S2(W)Cl/(W), "1
+ 172 - kJ.
5. From (9.3.18) to (9.2.21) computation of p(O I y) and p(SII y) can be simpliAed by Arst Anding a k x k non-singular matrix P and a k x k diagona l matrix A = {Jcjj} , j=I, ... ,k, such that and so that i= 1, 2
k
IX',X, +wX;X21 = IX'IX, I
TI (1 +w}jj) j =1
and where
6. A simple approximation to
pee I y)
is given by (9.3.27)
p(S I y)
= p(el w, y),
i.e.
o '"'- ,k[8Uv),
s2(1:v)(X ', X I + \"X ;X~) -
I, 111
+ 172 -
kJ
where from (9.3.28) and (9.3.29), IV is the ro ot of the equation
maximizing wp(w I y).
(9.3.30) and (9.3.3 1).
An iterative procedure for Anding w is described in
9.6
A Summary of Various Posterior Distributions for Common Parameters
e
513
Table 9.6.1 Continued
Correlated Responses with Common DerivaIive Matrix X: 7. The model is i= J, ... , 11'l,
where Yj is a 11 x 1 vector of observations, ge a k x I vector of parameters, X a 11 x k matrix of fixed elements with rank k and I>j = (e I j, . " , ell)' a n x 1 vector of errors. It is assumed that 1>(11) = (e Il 1' " .,e ll ",)', u=l, " , ,111, are independently distributed as N",(O, l:) .
8. Let
0= [01' ""O",J,
i = l,,, .,m , i,j=l,
",, 111,
and 1", be a 111 x I vector of ones. Then, from (9.4.3) the posterior distribution of is the multivariate t distribution
ee
where fj = c
~OA d
1
1""
and
General Linear Model : 9. The model is i= 1, ... , In ,
where Yj is a
11
x 1 vector of observations, X j is a n X k matrix of fixed elements, I>j a n X L vector of errors . It is assumed that G"m)', II = 1, .. ,,111, are independently distributed asiY",(O, l:),
e a k x I vector of parameters and 1>(11)
=
(e lll ,
""
10. The posterior distribution of 9 is, from (9.5.1),
-00 < 9<00, where S(9) = {Sij(9)}
i, j = 1, ""
111
Estimation
514
of Common Regression
9.6
Coellicients
Table 9.6.1 Continued 11. For m = 2, an alternative expression of the posterior distribution of 6 is given in (9 .5.16) p(6 I y)
=
f"-o f"'" p(6 I w, /3, y)p(w, /31 y) d/3 dw ,
-00
< 6 <
00,
- co
where, from (9.5.15),
pew, 131 y)
cc IX 'tXI + wX; .! (/3)X 2 .! (/3)1 - 1/2 wt (II+ll - ![s2(w)r 1-( 211-kl,
o<
H'
<
CO,
-
00
< /3 <
00
and, from (9.5.12), the conditional posterior distribution p(6 I w, fj, y) is the multivariate / di stribution
with 9(w,(j) = [X'tXl +wX2 _I (fj)X2.1«(j)rt[X'IYI + IVX
X 2 .! (fj) = X l -fjX!,
2'1«(j)Y21 (/3)]
h! (/3)=Y2 -fjYl
and (2n-k)sl(W,fj> = [YI -X I 9(w,/3>J[ Yt -X I9(w,(j)]
+ w[y 2. t (fJ) -
X 21 (fj)9( W, fJ)]'[Yll «(j) - Xl. 1«(j)a( IV, (j)}
12. By partitioning
6
=
[t·t
and
the marginal posterior distribution of
6 1 is in (9. 5.17), -
00
< 61 <
00,
where p(6 / 11V, fj, y) is the multivariate t distribution
13. A procedure for approximating the distribution p(6 I y) is developed in Section 9.5.2 .
A9.1
515
Appendix
APPENDIX A9.1
Asymptotic Expansion of the Posterior Distribution of 9 for Two Independent Responses We now develop an asymptotic expansion for the distribution of 9 in (9.3 . 1), p(9 I y) oc
2
TI
[
9)]
(9 - 9YXX(9 -
1+
"
2' ViS i
i= I
- t ( \' ; +k)
I
-
,
00
< 9<
00.
(A9 . 1.l)
The results can be used to approximate the distribution as well as the associated marginal distributions of subsets of 9. The procedure is a generalization of Fisher's result (1961 b) for the case of the weighted mean discussed in Section 9.2.1. For simplicity in notation, we write i
= 1,2,
(A9.1.2)
where
Expression (A9.1.1) then becomes
p(91 y) = c-! g(Q! , Q2),
-00 < 9<00,
where (A9.1.3) and
c The expression (J
(1+
=
I
_«>
+ QI / v!r- Uv, H )
QI )- t(V' ~
.,- k)
= exp
<0<<<>
g(QI ' Q2)d9
can be written
+ (+ QVII)] .
V k (-iQI) exp [}QI - I --2-log
I
Expanding the second factor on the right in powers of vjl, we obtain
(I + QVII )- :(YI " k) where
= exp (-
}QI) L"
(A9.1.4)
PiVji,
i =O
Po = J, PI = -t ( Qf - 2kQ!),
P2
=
9~6{3Qi - 4(3k
+
4)Q~
+
12k(k
+ 2)QT],
etc.
Similarly, we have that (
I
+ ~)- HYP k) = exp(- ~Q2) V2
t i=O
qiV;i ,
(A9.I.S)
Estimation of Common Regression Coefficients
516
A9.1
qo = I,
where
ql = {'( Q ~ - 2kQ2),
q2 =
9
l
o[3Qi - 4(3k
+
4)Q~
+ 12k(k +
2) Qn etc.
Substituting (A9.1A) and (A9.1.5) into (A9.1.3) and after a little reduction , we ca n express the posterior distribution as
pee I y)
h(9),
= 11 , -1
-
00
< 0 <
(A9 . 1.6)
00.
where
IMII /Z (2n:f.k exp(-}Q)JoJo PiqjV~11'2J, W
11(9)
=
Q
'Y)
.•
(9 - 8)' M(e - 9) ,
=
and
1\ '
=
roo
<0<00
11(9) d9 (A9.1.7)
The integral II' in (A9.l.7) can be evaluated term by term . From (A9.1A) and (A9.1 .S), we see that each te rm is a bivariate polynomial in the mi xed moments of the quadratic forms QI and Q2 where the va riables e havea multivariate Normal di stribution N k (8, M- l ). For this problem , it appears much simpler to obtain the mixed moments indirectly by first finding the mixed cumulants. It is straightforward to verify that the joint cumulant generating function of QI and Q2 is 1;\1 11/2
K(I"t 2) = log
J
-00 < 0 < 00
- )J k exp(t1QI (2][ '
- }log lI - 2M - l (t I M
+
2(tIM 1lll
+
I
+
12Q2 - l Q)d9
12 '\12)1
+ t 2 M z112),(M -
+
tlll~Mllll
+ l 211 ; M 2112
2tlMI - 2t 2 M 2)-1(II M llll
+
{2M2112), (A9.1.8)
where a nd Upon differentiating (A9.1.8) and after some algebraic reduction , we find (see Appendix A 9.2) K IO =
=
/(01 J(,s
+
trM- I M
1
+ 11 ; M 1TJ 1,
tr M- 1 M 2
+ 11 2M 211 2,
= 2,+S-I(r
(rTJ 1
where
+
+s
- 2)1[(r
STJ2)' G"(rll 1
+
+ STJ2)
S -
-:-
(A9.1.9)
1)trM- 1 G"
I'TJ'1
G"TJ 1
-
STJ;Gr"TJ2],
r
+
S ~
2,
Appendix
A9.1
517
Employing the bivariate moment- cumulant inversion formulae as given by Cook (1951), the integral l\' in (A9.1.7) can be written as v.
I;
H' =
'" L.. "b L.. i j V I iV2 j,
(A9.LlO)
i=D j=O
where
boo
=
1,
b lo
=
:1:(/(20
bOI
=
{(/(02
b ll
=
-to [/(22 + 1(20/(02 + 21(L
b02
=
/(~o
-
2/o(J o),
I(~I
-
2kiCol) , -I-
4/(j[IC OI /(10
+ /(~0f(~1 + 2/(21/( 0 1 + 2/(12/(1 0 + /(02/(10 + /(20/(01 + 2/(11/(10
+ /(02/(;0 - 2k(K12 + /(21 + 2/(11/(01 + /(10 /( 61 + /(ol/( i o) + 4k2(/(11 - I(OI/(IO)J, eft [3(/(40 + 3/(~0 + 4/(30 /( 10 + 6K 1 0 K i o + I(io) - 4(3k + 4) ("30 + 3/( 20 /(10 + Kio) + 12k(k + 2) (/(20 + Ido)J,
b 20 =
+ +
1(2 0/(61
Substituting the result s in (A9 . I.IO) into (A9 .1.6), we obtain the following asymptotic expression for the posterior distribution of e, I
IMI ,2 [ I (0 P (9 I Y2 ) = (2nr !< exp - '2
-
" )' M(9 \J
I
-
1\)J IJ
I ~ ~
i ~ Oj~O
d
-
ijVj
- 00
i
- j
v2
<9<
,
00,
(A9 .1.11)
where
doD = I,
9 and M are given in (A9.1.7). Expressions for additional terms d 12 , d 21, d 22 , etc., can similarly be found if desired. The posterior distribution is thus expressed in the form of a multivariate Norma l distribution multiplied by a power series in v~ I and v; I. When both VI and V 2 tend to infinity, all terms of the power series except the leading one vanish so that, in the limit, the posterior distribution is multivariate Normal N k (9, M - I ) . For finite values of VI and V 2 , the terms in the power series can be regarded as "corrections" in a l\,jormal approximation to the distribution in (A9.l.l) . hom
and
Estimation of Common Regression Coefficients
518
A9.1
(A9 . 1.4), (A9.I.S) and (A9.1.9), we see that numerical evaluation of the coefficients in the power series involves merely matrix inversions and mUltiplications, operations which are easily performed on an electronic computer. We note that when the posterior distribution is a univariate distribution as in (9.2.4), the results in (A9.!.1 I) are in exact agreement with those obtained by Fisher (1961 b). I n Fisher's derivation, each term of the integral \v in (A9.1. 7) was expressed in terms of the moments of a univariate Normal distribution. It can therefore be evaluated directly without making use of the mixed-cumulant formulae given in (A9.1.9) which seem more convenient for tbe multivariate case considered here. For the univariate case, posterior probabilities can be calculated using the formulae given in Fisher's paper. When k > I, the corresponding formulae for the evaluation of joint probabilities become exceedingly cumbersome and are not given here.
The M argina/ Posterior Distrihution
a,
a;
When interest centers on a subset of the elements of say = (8[, .. . ,8,), an asymptotic expression for the corresponding marginal posterior distribution can be obtained by integrating out the remaining elements, a; = (81 + 1 , ... , 8k ) from the joint distribution in (A9 . I.I I). We have that
p(a,ly)
[:v1[1/2f
00
= -( '. k 2n) -
exp(-}Q) -00
<9,< ""
8 into 8' = (8; : 8;.)
Partitioning
00
..
I I dijV;'viJdar' i=O j=O
(A9.1.l2)
and the ma trices M and :\1-1 into
we can write the marginal posterior distribution as p(a, I y)
=
IV,~
11' /2 .1., exp [ - HO, - e,)'vl~ 1(0, - 8,)]/(0,),
(2n) '
-00<0,<00 , (A9.1.13)
where
with
From the expression for dij given in (A9.1.11) , we see that each term in the integral f(O,) is a bivariate polynomial in the quadratic form QI and Q2 where 0, is now considered fixed and Or has a multivariate Normal distribution
~
:
~'"
.
,~'
. #-.
'.".
...
,If·
.;......
.
..
•
----------
~
-
• • • _''':'
Appendix
"'r(9,., M':;: I) . Adopting the same procedure as that described ;ection and by setting
In
519
the preceding (A9.l.14)
'1 2 = [CH .• ~lrll , C r/ : C rr . r
we obtain the mixed cumulants of Q I and Q2
= trMr~1
W0 1
= tr Mr~ 1 Crr +
Wrs
=
r
Brr
2 +' - 1(r
+
(rYI
+ 1'1 BrrYI +
(9 1 -
y~ Crryz
+S-
2)!
+ (9 1 [(r + S -
1
(a l -
911 )
(21)'F/~ I (e l -
6zl )
( 11 ),E;i
l)trM,~1
(A9.1.l 5)
H"
+ SYz)' HYS(ry] + SYz) - rY'1 HrsYI -
sy~
Hrsyz]'
r+
S ~
2,
where Hrs = Mrr (M,~ I Brr)' (M':;: I CrY (Br~1 Brl - M,:;:l Mr/)(a l -
YI
= (Sr - 61r) +
Yz
= (Sr - 92 J + (Cr~l Crt - Mr~l M"/)(a l
611)
- eZI)'
Using the results in (A9 . l.l5), we can express the marginal posterior distribution of al as
(A9.1.16) with boo
=
~:-.
~
-
A9.1
WIO
. , ....
-.\." -........
I,
and the quantities gi) are functions of the mixed cumulants wi) with the functional relationships exactly the same as those between bi) and /(ij shown in (A9.I.I0). It is seen that the leading term in the expansion is a multivariate Normal distribution NI(S/' VII)' When 91 consists of only one variable, the quantities
f
520
;':stimation of Comlllon Regression Coefficients
'\9.2
bij in (A9,1.16) are simply polynomials in that variable. Lmploying the wellknown expression for the moments of a Normal variable, one can easily derive an asymptotic expression for the moments. In addition, probability integrals can also be approximated using methods given in Fisher's previously cited paper . ( 1961 b).
APPE"IDIX A9.2
Mixed Cumulants of the Two Quadratic Forms Ql and Q2 From the joint cumulant generating function of the quadratic forms Q\ and Q2 given in (A9.1.8), we now derive the expressions for the mixed cumulants shown in (A9.1.9). In our development, we shall make use of the following lemma.
Lemma Let PI be a 11 x 11 positive definite symmetric matrix and P2 be a 11 x n nonnegative definite symmetric matrix. Then, for sufficiently small i, we have ir
X;
log
II -
I -r
iP 1P 2 1 = -
tr (P 1 P 2 Y·
(A9.2.1)
r= 1
Employing the above lemma and for sufficiently small values of 11 and 12, we can expand the first term on the right of (A9.1.8) into
--i
10giI - 2M- l (tl M I
+ 12MJ I =
I
r=1
2r - l --tr(tI M - 1M I -i- 12M- 1M 2)', r
In (A9.1.8), the quadratic form 1111'IMl111 can be w'ritten I
(A9.2.2)
[ll'[M[ll1 = tlTJ'IM1(M - 21[M[ - 21 2M 2 ) - I(M - 211Ml - 21 2 M 2 )TJl = 11TJ'[M l (M
- 2/[M[ - 2/ 2 M 2) - IMTJI - 2t711'tMl(M - 211Ml
-2t 2 M 2)-IM l11[ - 21\1 211'[M I (M - 2tlMI - 21 2MJ- I M 2TJ1' (A9.2.3) Similarly, 1211~M2TJ2 = 1211~M2(M - 21[Ml - 212M2) - IMli12 - 2dTJ~M2(M - 211Ml
(A9.2.4) Thus, the expression in (A9.1.8) becomes ro
1((1[,/ 2 ) =
I
r= 1
2,'-1 - - t r ( t I M- 1M I r
+ 11 lVr l M 2Y +
I 1TJ'I
M l(I - 2/ 1 M - 1 M[
-2t2M-IM2)-1111 -I- 1211;M 2(I - 21 1 M - 1 M I - 2t2M-1M2)-1 112 - 21[12( tl1 - TJ2)'M[CI - 2t l l\,r - 1 M\ - 2t 2 M-- 1 M 2) - 1 M- I M 2(111 -112). (A9.2.5)
/
· '.
---.-
Appendix
A9. 1
Since 1\1 =:\1 1 -+- M 2 , it is easy to see that the matrix M I M ' By virtue of this property, we have (tIM - IM I
+
'2M-1M2)'
=
t (r) I; I
i=O
I
l\1 2
52]
is symmetric.
,;- i(M - 1M I )i( M -- 1 M 2),-i
(A9 .2.6)
'I and '2' = I f 2i + j'~d(i ~ J)(M-IM1Y(M
and , for sufficiently small values of (I - 211 M-IM I - 2'2M -- lM 2 )-1
I ~ O )=0
IV( 2 )j
I
(A9.2.7) Substituting (A9.2.67) into (A9 .2.5) and after a little rearrangement, we find
[(1 1,1 2)
= I
+
I
1"-1,;
r = I
[~tr(M - IMlr + l1 'I M(M - 1MtY'1l] I
(A9.2.8)
+
I I OC
"':
_
2r !-'-1 1~1 2
r=l.\ ~ l
(r
+
S 1 I
2)!
[(r
+
S -
. l)trM- I G"
r.S .
where G"
= M(M - I M I ),(M- 1 VI 2 Y.
Upon differentiating (A9.2.8), we obtain 2,-1 (r - I) ![tr (M- I M I )'
IC rO
=
1C0s
= 2-,-I(S -
+ r'1 '1M(M - I M I )'l1IJ,
(A9.2.9)
1) I[tr(M - 1 M 2 Y +Sl1~M(M-1M2Y11 2J,
(A9.2.IO)
r, s
~
1
which can then be combined into the expressions given in (A9.] .9).
(A9.2.ll)
.
CHAPTER 10
TRANSFORMATION OF DATA 10.1 INTRODUCTION
In this chapter we discuss the problem of data transformation in relation to the linear model
y
=
Eey)
+ E,
(10.1.1)
E(y)=xe, where y is an 11 x I vector of observations. X a 11 x Ie matrix of fixed elements, 8 a Ie x I vector of regression coefficients, and E a 11 x I vector of errors. In Section 2.7, we discussed in detail the linear model (10.1.1) under ~the assumption that E is distributed as NII(O, (}21). This Normal theory linear model is of great practical importance and has been used in a wide variety of applications . These include , for example, regression analysis and the analysis of designed experiments such as Ie-way classification designs, factorials , randomized blocks, latin squares, incomplete blocks, and response surface designs . Normal theory analysis, whether from a Bayesian or sampling point of view, is attractive because of its simplicity and transparency. The possibility of data transformation greatly widens its realm of adequate validity. In using the Normal linear model , we make assumptions not only about (i) the adequacy of the expectation function E(y) = X 8 but also about the suitability of the spherical Normal error function to represent the probability distribution of E. Specifically, we assume (ii) constancy of error variance from one observation to another (iii) , Normality of the distributions of the observations and (iv) independence of these distributions. t The last assumption (iv) is perhaps the most important of all. Its violation leads to dramatic consequences. Its relaxation leads to the consideration of a whole new range of important time series and dynamic models, which, however, are not treated here. Nevertheless, at least for data generated by designed experiments, the independence assumption is one whose applicability can to some extent be assured by the physical conduct of the experiment, and we shall here suppose it to be
t We are, of course, also assuming that there are no "aberrant" observations. Tn practice, .due to the possible anomalies in the experimental setup, some of the observations might have been generated, not from the postulated model, but from an alternative model which has a large bias or a much larger variance. For a Bayesian analysis of this problem, see Box and Tiao (1968b) . 522
-....
"
.
.
· -
-~---
523
Introduction
10.1
valid. Assumptions (i), (ii), and (iii) are usuaIJy not under the experimenter 's control, but it happens rather frequently that , although these assumptions are not true for the observations in the original metric, they are reasonably well satisfied when the observations are suitably transformed. We now consider why this is so.
10.1.1 A Factorial Experiment on Grinding Consider an experiment on the grinding of particles which are approximately spherical. Suppose that observat ions on the final size of the particles were taken after standard material had been fed to three different grinding machines for four different grinding periods, using a 3 x 4 factorial design, and that the primary purpose of the experiment was to determine how different machines and different grinding periods affected the final size of the particles. Now suppose it happened to be true that if "size" were measured by particle radius y, then , to an adequate approximation,t the Normal linear model (10.1.1) could represent the data and the effects of machines and of periods of grinding would behave additively. Specifically, suppose that, if "particle size" was measured in terms of radius, then, to a sufficient approximation, a) there would be no interaction between machines a nd grinding periods , b) the error variances would be constant, and c) the error distribution would be Normal. Now it might be inconvenient to measure particle size in terms of the radius y and it might not occur to the investigator to do so. Instead, he might lJ1easure the area of the circular section of the particle seen under the microscope (proportional to / ) or he might weigh the particle and so effectively work with the volume (proportional to y3) . In either case he would probably report his results in terms of what was actually measured. But if, in terms of the y's, the effects were additive and the errors spherically Normal, this would certainly not be true, for example , of the effects and errors in terms of the y2,S . Specifically, suppose that
i = I, 2, 3; j
=
I, ... , 4,
00.1.2)
where E(Yij) = 8 i + c!>j is the average response for the ith machine run and the jth period of time, and etj is distributed as N(O , (J2). Then , for the particle size measured in terms of area , we should have the model ( 10.1.3) where
(10.1.4)
t Since the particular radius could not true exactly.
be negative, the Normal assumption could not be
524
Transformation or Data
10.1
and
(10.1.5) with
(10.1.6) We note that for the response Yi~ a) a nonadditive (interaction) term function E(yf) in (lO.IA),
ei CPj
now appears in the expectation
b) the error eij in (10.1.5) has a variance which is not constant but depends directly on E(Yi) = i + CPj, and
e
c) since Bi j has a Normal distribution the distribution of eij must certainly be non-Normal. . Now in spite of this it is nevertheless true that, if the investigator made an appropriate transformation (in this ca se the square root transformation) of his "particle area" data, the simple Normal linear theory analysis would apply. In general. there is often surprisingly little to be said in favour of the original metric in which data happen LO be taken. For example, temperature can be measured in terms of ' C when it is related to the expansioD of a thread of mercury in a thermometer. However it could equally well be measured in terms of molecular velocity V IX (T + 273)1 /2 which would be of more immediate relevance in some contexts. Bearing these arguments in mind, it is perhaps not surprising that numerous examples occur where the assumptions underlying the ~ ormal theory linear model , although not true for observations in the original metric, provide an adequate representation after suitable transformation. In some instances appropriate transformations have been arrived at from theoretical considerations alone, in some cases by analysis of the data alone , and in some instances by a mixture of both. We shall here consider estimation of transformations from the data. It is not of course suggested that interaction, heterogeneity of variance and non-Normality can ahmys be eliminated by suitable transformation of the data. In fact , cases of interaction, inhomogeneity of variance and non-Normality can each be divided into two classes: transformable and non-transformable. In the first cJass the phenomena of interaction , of variance inequality or of nonNormality are anomalies arising only because of unsuitable choice of metric. In the second class the phenomenon is of more fundamental character which transformation cannot eliminate. Finally it will not necessarily be true that the same transformation of the data will simultaneously eliminate all discrepancies from assumption . Cases can arise where , for example, approximate additivity can be achieved by a particular transformation but not simultaneously with variance homogeneity.
10.2
Analysis of the Biological and the Textile Data
In summary, we should like
[0
a) the expectation function E(y) b) the individual errors
8 ;,
=
525
employ the model (10.1.1) with
X e having the simplest poss ible form,
i = I , ... , n, having the same variance
(J2,
and
c) the individual errors t ; independently and Normally distributed.
If a measurement of y does nOL possess these properties , a suita ble nonlinear transformation of y may improve the situation. I t is often the case that the logarithm, the reciprocal , the square root, or some other transformation s of y will make possible an analysis which strains assumptions less and in terms of which a simpler representation is possible. 10.2 At'jAJ.YSIS OF THE BIOLOGICAL AND THE TEXTILE DATA
We now introduce two sets of data and conside r in some detail the question of choice of appropriate models, following joint work with D .R . Cox .
10.2.1 Some Biological Data Table to .2.1 shows the survival times of n = 48 animals exposed to three differelll poisons and subject to four different treatments. The experiment was set out in a 3 x 4 factorial design with a fourfold replication (four animals per group) . We shall analyze the data using the linear expectation function E(y) = XO and we now discuss precisely what this involves and the justification for doing so. Table 10.2.1 Survival time (in \O-hr units) of animals in a 3 x 4 factorial experiment
Treatment Poison
II
A
B
C
D
0.31 0.45 0.46 0.43
0.82 LlO 0.88 0.72
0.43 0.63 0.76
0.45 0.71 0.66 0.62
0.36 0.29 0.23
0.92 0.61 0.49 1.24
0.44 0.35 0.31 0.40
0.56 1.02 0.71 0.38
0.22 0.2 1 0.18 0.23
0.30 0.37 0.38 0.29
0.23 0.25 0.24 0.22
0.30 0.36 0.3\ 0.33
OAO III
OA5
526
Transformation of Data
10.2
The use of the linear expectation function fey) = xa for this biological data and for most o ther examples is an admitted approximation to the truth. In most cases there und o ubtedly exists some true underlying fun ctional relationship (10.2 . 1) fey) = f(~ , <1», which is probably nonlinear in the parameters
4> and which describes in
mechanistic terms how fey) is affected by some set of basic input variables
~,
involving some set of para meters <1>. We have tacitly decided to replace it with the empirical linear model (10.1 . 1). This may be either because, in the presenl state of the art, this " true" relationship is unknown and would be too difficult to find out, or because the known mechanism is too complicated to use. For this biological example, it might be possible to write dillerential equations which represented the absorption of poison into the blood stream of the animals a nd to represent mechanistically by other equations the effect of the treatments. If this could be done, one might run appropriate experiments to test the model and to obtain estimates of the basic constants it contained. In the absence of such fundamental knowledge, we would have to set our sights lower. We could , for example, simply set out to estimate the "effects" of the treatments in terms of the increase in survival time they produced . This could be done in terms of a purely empirical linear model. Specifically, we might postulate that , associated with the tth treatment (/ = 1,2, 3,4) and the pth poison (p = 1, 2,3) , there was a mean survival time BIp" If we were principally interested in differences associated with the various poison-treatment combinations, we could write the model in the form
fe y ;) =
BOXOi
+
(B ll
-
BO)Xlli
+
(B 12
-
BO)XI2i
+ ... + i
where
XO i =
I and the indicator variable
X/p i
(e 43 -
=
e O)X 43i
I, ... ,48,
+
e,
( 10.2.2)
is 1 if the ith animal received the
pth poison and the Ith treatment, and zero otherwise . One way of dealing with the problem that the resulting 48 x 13 derivative matrix would not be of full rank, would be to omit, say, the final term (e 43 - eo) X 4 3i- . The resulting empirical model which would be of the form of (10.1.1) would contain 12 functionally independent parameters. In some circumstances it would be reasonable to expect that a simpler empirical model containing fewer parameters could be found. In particular, it might be true, to an adequate approximation , that the effects of poisons and trea tments were additive. Specifically, if et. - e o was the mean change in survival time produced by the tth treatment and the mean change p produced by the pth poison, then the model could be written
e.
£(Yi) = eOXOi
+
(e l .
-
fJo)uJj
+ .. . +
(e 4 .
-
( 0 )U 4i
eo
+
(e' l
-
eo) \Vii
(10.2.3)
10.2
Analysis of the Biological and the Textile Data
527
where Uri is I or 0 depending on whether the ith animal had or did not have the tth treatment and Hlpi is 1 or 0 depending on whether the ith animal had the pth poison. Again we could omit, say, (84-' - flo) U4-i and (fI'3 - 8 0 ) \\' 3i to retain an indicator matrix X of full rank. We shall call the general model in (10.2.2) the interaction model and the more specializeci model in (10.2.3) the additive model. It is often convenient to write the more general interaction model in the form of the additivl: model with additional terms specifically carrying parameter combinations which measure interaction. For the present example , we could write (10.2.2) alternatively as 3
E(yJ = eo
X Oi
+
L
2
(8r
-
flo) Uri
I
+
1=1
(8. p
-
flo)
W pi
p=l
3
+
2
Z L 1=
(flrp
-
8t .
-
flp
+ 80 ) x tpi
(\0.2.4)
1 p= 1
By omitting the last summation containing Lhe six independent interaction parameters, we would have the nonsingular form of the additive model of (10.2.3). Whereas the interaction model contains twelve functionally independent parameters, the additive model contains only six. In the interest of simplicity we naturally would wish to use the additive model if this were adequate. While the additive model might not be appropriate in the original metric, it might become so if a suitable data transformation were employed. Representation in terms of the smallest possible number of parameters we call "parsimonious parametrization." C learly if by, say , a power transformation from y to /" we could validate a simple additive modeJ, we would have served the interest of parsimony. That is, by including one extra parameter A we would have made it possible to eliminate six interaction parameters. We now consider ormalily and constancy of variance. inspection of the sampleyariances in the 12 cells of Table 10.2. \ shows that they differ very markedly and that they tend Lo increase as the cell mean increases. It might be thal a transforma tion could make it more plausible that population variances were equal. Again, while one would not expect on this limited amount of data that there would be a great deal of information about its N ormality or otherwise, it might very well be tha t some transformation of the data could improve the Normality of the distribu tions of rhe errors.
10.2.2 The Textile Data As a further example. we consider the data of Table 10.2.2. These data came originally from an unpublished report to the Technical Committee, International Wool Textile Organization, in which Drs. A. Barella and f... Sust described some experiments on th e behaviour of worsted yarn under cycles of repeated loading.
528
Transformation of Data
10.2
Table 10.2.2 gives the numbers of cycles to failure, y, obtained in a 3 3 experiment in which the factors were ~l: length of test specimen (250, 300, 350 mm),
¢2: amplitude of loading cycle (8, 9, 10 mm) , ~3:
load (40, 45, 50 g) . Table 10.2.2
Cycles to failure of worsted yarn in a 33 factorial experiment Factor levels Cycles to failure Xl
X2
x)
-1
-1 -I - 1
-1
0 0
-I
-1 -1 -J -1 -J
-I -1 -J 0 0 0
-I
-I
-I
0
0
0 +1 -1 0 +]
-1 0 0
0 0 0
+1 +1 +1 -J
+1 +1 +1 +1 +1 +1 +1 +1 +1 x,
+ 1 +1 +1
0 +1 -1 0 +1
0
= (,;, -
170 118 90 J ,414 1,198 634 ] ,022 620 438 442 332 220 3,636 3,184 2,000
- I 0 + 1 -J 0 +1 -1 0 +1 -1 0 +J
-I -I
0 0 0 +1 +1 +1 300)/50,
674 370 292 338 266 210
0 +1
0
0
y
X2
=
~2 - 9 and
1,568 ],070 566 ],140 884 360 X3
= (';3 --
45)/5.
10.2
Analysis of the Biological and the Textile Data
529
Although the form E(y) = f(~, 4» was unknown, one might hope to locally represent this function by expanding it about the average levels 1;0 in a Taylor series. We would then have 3
E(y)
= fo +
I
3
+
Ir«(, - (,0)
t= 1
3
I I ,=
Ir/f" - (,to) «(j - (,jO)
1 j= 1
+
higher-order terms,
(IO.2.S)
where
. of I ./,=-
O(,t ~=~o
,
If it was assumed, for example , that third- and higher-order terms could be neglected, then we could employ for the expectation function the second-degree polynomial in (,1. (,2. (3 which could be written in the form 3
E(y) = eoXo
I I e,X
+
t
3
+
t=
3
I I t=l
eljX,j
(10.2.6)
j=t
where Xo
= I, C2 = I ,
and
e,
=
C'Ir '
The expectation of the ith observation Yi would then be linear in the parameters (eo, el, O2, e3, 8 11 ,8 12 , eD, en· e 23 , e33 ) and depend on the values of (XI' X2, X3) as set out in Table 10.2.2. It seems highly probable that the full second-degree polynomial model containing ten terms could represent the data fairly well. However, inspection of the data shows that the response appears to be monotonic in Xl, X 2 and X3' It is possible, therefore, that after some suitable data transformation a simpler model linear in 1; = «(,1' (,2, ~3 ) omitting the six second-degree terms (ell' 22 , 8 33 , 12 , e 13 , e23 ) might be suitable, and improvement in variance homogeneity and in 1 Tormality of the error distribution might be achieved also. In summary, the task of the data analyst is to make explicit the model which underlies a given body of data . J n attempting to relate data to any kind of model the analyst runs two kinds of risks. Obviously, he may force the data to a model which is inadequate. Alternatively, he may so encumber the model with unnecessary parameters that thl.: corresponding analysis is too complex to be worthy of the name. A complex analysis is justifiable when the phenomenon itself causes the complexity. Often complexity is introduced unnecessarily.
e
e
530
Transformation of Data
10.2
For example, we shall show that the textile data are better represented by a transformed model linear in the experimental variables ~ and containing only six parameters COo, 0 1 , O2 , e), (1, A,) than it is by a model in which an untransformed y is represented as a quadratic function of 1 , containing eleven parameters (0 0 ,0 1 , O2 , 03 , ell, 022 , 033 , 8 J2 , 0 13 , ( 2 ), d· In this si tuation we can say that the complexity has not arisen from the physical situation but rather from failure to work with the appropriate metr ic. Again, in the biological example, it will be shown that the use of a simple transformation avoids the necessity for interaction parameters and also for postula ting different variances in the groups. The analyst's problem is how to allow diversi ty without producing chaos. The principle of parsimony, aptly christened by Tukey (1961 ), says that parameters should be introduced sparingly and in such a way that the maximum amount of resolution is achieved for each parameter introduced. Parsimonious parametrization is frequently made possible by suitable data transformation.
10.3 ESTIMATION OF THE TRANSFORMATION We shall use the notation y
y().)=
in the forms
r-'109:
r y
y<J.) =
to define a parametric family of nonlinear be a scalar or a vector containing, elements a monotonic function of y over the admissible of particular usefulness are y<") = / and
(A '# 0),
(10.3.1)
y> - ,1,2'
(10.3.2)
(A = 0),
+ l,)"
log (y
y>O
- 1
,1, \
+
A2)
(}'l
(,1,1
=I 0),
= 0),
because they are then continuous at A = 0 and AJ = 0, respectively. Thi s class of transformations includes as special cases the logarithmic transformation, the reciprocal transformation , and the square root transformation. N ow suppose that a suitable family of transformations y<J.) has been selected and assume that, for some chosen 1, to a sufficient approximation, y(~)
= X a + E,
(10.3.3)
where yell = (y\l) , ... , y:,l»', l = (YI' ... 'YII)' is a n x 1 vector of observations, X is a 11 x k matri x of fixed elements ha ving ra nk k the first column of which consists of 11 ones, is a k x I vector of parameters , a nd E is a 17 x I vector of errors having spherical Normal distribution NII(O, (J2l).
e
·
Estimation of the Transformation
10.3
531
10.3.1 Prior and Posterior Distributions of 1 The probability density function of the original untransformed observations y is
1 [(yt'.) - X9), (y(~l p(y I 1, 9, 0' 2) = (2n):" a" exp 20'2 - 00
<
xa)] y(~l
J(l; y),
< 00,
(10.3.4)
where the Jacobian J(l; y) is
J(l; y) ,I
=
n Idi~ll -d' . Yi n
(10.3.5)
,=J
Writing
(10.3.6) with the joint posterior distribution is
pee, log 0', j, I y) -
where
:x. IX;
O'-n
exp [ -
< log 0' <
pea, log 0',1)
Sl..
X,
+
(a -
-
91.)'
20'2
X'X(a -
00 < C < 00 ,
0,)]
J(l; y)p(e, log 0', 1),
- 00 < 1 < 00,
(10.3.7)
is the prior distribution.
Choice of the Prior Distribution
pea, log 0' , 1)
We now consider what to choose for
the pnor
pea, log 0', 1).
pen, log 0', 1) = p(l) pee, log 0' I 1),
Writing
(10.3.8)
we shall suppose that, for any specific j. , p(O, log 0' I j,) is locally uniform. On the other hand, p(O, log a 11) must clearly be dependent on 1 for a change in 1 will magnify or dimini sh all the data and will hence change the value of this locally uniform density . We conclude then that
pee, log 0' I 1)
g(l).
0:.
(10.3.9)
To decide the form of the function g(l), we employ the following argument. Let us assume that over the range of the data we can write approximately
i
=
1, 2, ... ,n
(10.3.10)
where 11 is some represenlative value of the gradient (d)Plidy). Taking expectations, we have E(y~~»
==
a).
+ I). l::(yJ.
(10.3.11)
Since
e = (X'X)-l X' E [y(Al] ,
(10.3.12)
--
532
Transformation of Data
10.3
it follows that
(10.3.13) where
pee, log u I A) IX. II.. I- k •
(10.3.16)
The value we shall use for 1/.. 1 is the geometric mean of the absolute values of the derivatives dy<")/dy at the actual data points, that is,
1/.. 1 = Thus,
n 1dy<") I] n
[
_i_
,= 1
l in
= J(A;
y)l /n
= J{/JI.
(10.3. 17)
dYi
pea, log u I A) IX. g(A) IX. J;:k/JI
(10.3.18)
and finally
pee, log (J, I,) IX.
J;: k i ll pCA).
(10.3.19)
Posterior Distribution of A With this prior distribution, the joint posterior distribution for 9, log (J, and A is
pee, log u, A I y) IX. (J-n exp [ -
CIJ
S..
< log (J < 00,
+
(9 -
9,.)'
X ' X(S 2(J2
- 00 < 9 < 00,
9..)]
J ~n-k) /n p(A.),
- 00 < A. < 00,
(10.3.20)/
where it should be noted that J). involves the observations but not the parameters Sand (J. Integrating out S and log (J , we obtain
p(A. I y) IX. Pu(A. I y) p('A-) ,
- 00 < A. < 00,
(10.3.21)
where
pl/(A.I y) IX. (s).. fJ i /n)- +(II-k ) is the posterior distribution for a uniform reference prior for 'A-. if we work with the "normalized" da ta
Alternatively,
(10.3.22) then, writing ~( ).)
= X(X'X)-IX'Z()') where z().) =
PII(/,I y)IX. [S(A., z)r:·(n-k),
with
( z ~1..) ,
... ,
z~).»',
- 00 < A. < 00,
we have
(10.3.23)
.
.
.
.
'.:.----
10.3
Estimation of the Transformation
533
In practice, in choosing a transformation that may hopefully result in parsimonious parametrization, constancy of variance, and Normality , we can compute the residual sum of squares S(A., z) for a suitable range of values of 'A, and hence, [S('A,z)]-Hn-k). Normalization of [S('A,z)r~(n-k), will then produce puCll y) . If desired, Pu(ll y) may then be combined with any prior weight function p('A.). It should be noted that by regarding J). as a constant, (l0.3.22) implies that the logarithm of the standard deviation of z()') can be written 1
log O"(z(l.» == - -log J).. n
+ log 0-.
(10.3.24)
Using the approximation (10.3.15) with 1/)..1 = J~ /n, we have
" logO"
I
== -logl.. + logo-yo
(10.3 .25)
n
It follows that (10.3 .26)
so that, to the degree of approximation (10.3.15), the standard deviation of the same for any l.
z()")
is
10.3.2 The Simple Power Transformation In practice, the power transformation in (10.3.1) is of particular interest. the z form we have A. = 1, zP.)
=
(/ - 1)/(lj/-l), { y logy,
(1 oF 0) (A
= 0)
In
(10.3.27)
where y = (f17= 1 yy in is the geometric mean of the data. In obtaining the posterior distribution of A, the chief labor is in calculating the residual sum of squares S(2, z) for a range of values of 1. Computer programs are now rather generally available for analyzing linear models and in particular in producing, via an analysis of variance table, the residual sum of squares. Writing w = y/y, we have for the power transformations
(1 oF 0)
(J, = 0)
(10.3.28)
where C;, is a constant depending on A. Now the size of the constant c). will have no effect on any element in the analysis of variance table except the "correction for the mean." Consequently, when the simple power transformations are being
534
Transformation of Data
10.3
considered, S(I" z) may be obtained simply by computing IV i = Yi/'y, ... , nand pe:forming an analysis of variance on the variates
[y ), -1
II
=
1,2,
(J, =I 0)
W)' ,
(10.3.29) ,
(). = 0)
ty log w,
(note that y J... - 1 wi. = Y when J~ = 1). So far as the calculation of puCJ...1 .1) is concerned, we can work equally well with the variates J... -1 w.\ and log w, since the fixed multiplication constant y will have no effect on PuCAI y), which is in any case normalized so that SPu(J...1 y)dJ... = I.
10.3.3 The Biological Example For the biological data in Table 10.2.1 we would like, if possible, to find a transformation for which: a) the "six parameter" additive model rather than the "twelve parameter" interaction model is applicable. b) the cell variances are equal and c) the errors are Normal. These properties seem unlikely to apply for the original data. In particular simple plotting demonstrates a strong tendency for the within-cell variances 1.0 increase as the within-cell means increase. Furthermore, the analysis of variance in Table 10.3.1 for the untransformed data suggests the possibility of interaction. Interpreting this table from the Bayes point of view discussed in Section 2.7, we can say that in the space of the six interaction parameters the point (0,0,0,0,0,0) barely lies within the 90 per cent H .P. D. region . (The appropriate mean square ratio is 41.7/22.2 = 1.88 while the 10 per cent point f6r F with 6 and 36 degrees of freedom is 1.95.)
Table 10.3.1 Analyses of variance of the biological data Mean squares x 1000 Degrees of freedom Poisons Treatments PxT 'within groups
2
3 6 36
Untransformed
Degrees of freedom
Reciprocal transformation (z form)
516.3 307.1 41.7 22.2
2 3 6 36(35)
568.7 221.9 8.5 7.8(8.0)
..0.3
Estimation of the Transfor Al31 ion
535
Since it is hoped that after transformation the additive model will be adequate . the appropriate residual sum of squares to use in calculating Pu(}' I y) is tha t obtained after Rtting the additive model and is based on 42 degrees of freedom. We denote it by S42(...1., 2) . Table 10.3.2 Values of S42(J., ,,) and of Pu(}' I y) over a range of A where the density is appreciable: the biological data
)
},
S4 2(}c, z)
},
Pu(}'1 y)
1.0 0.5 0.0 -0.2 -0.4 -0.6 -0.8 -1.0 -1.2 -1.4 -1.6 -2.0 -2.5 -3.0
1.0509 0.6345 0.4239 0.3752 0.3431 0.3258 0.3225 0.3331 0.3586 0.4007 0.4625 0.6639 1.1331 2.0489
0.0 -0.1 -0.2 -0.3 -0.4 -0.5 -0.6 -0.7 -0.8 -0.9 -i.O -1.1 -1.2 -1.3 -1.5
0.01 0.02 0.08 0.26 0.49 0.94 1.46 1.82 1.82 1.42 0.92 0.47 0.19 0.07 0.01
Table 10.3.2 shows values of S42(A, z), and of the resulting Pu(J.1 y), over a range of }, in which the density is appreciable. Upon normalizing [S42 (A, z)r 21 by numerical integration, we find -
00
< A<
00.
(l0.3.30)
The posterior distribution PI/A. I y) shown in F ig. 10.3.1 is approximately Normal with mean - 0.75 and star>dard deviation 0.22. The 95 per cent H .P.D. interval extends from about - 1.18 to about - 0.32. We notice in particular that A. = 1 (no transformation), }, = 0 (log transformation) , are untenable on thi s data . On the other hand, the reciprocal y-l, which has a natural appeal for the analysis of mortality time data because it can be interpreted as · the "rate of dying", possesses most of the advantagcs obtainable from transformation. A plot of within-cell sample variances against within-cell means for the transformed data now shows no dependence . Furthermore there is now no suspicion of Jack of additivity. The analysis of variance table for the untransformed data and for the reciprocal transformation (in the z fo~m) is shown in TabJe 10.3.1. The
536
10.3
Transformation of Data
-)
0
Fig. 10.3.1 The posterior distribution p,.(AI y) for the biological data. (Arrows show approximate 95 per cent H.P.D. interval)
Normal assumptions are, of course, much more appropriate for the transformed data and this results in a much greater sensiti vity of the analysis . The within-groups mean square is reduced to about a third of its previous value relative to the poison and treatment mea n squares which remain about the same size. This implies, for example, that the spread of the individual marginal posterior distributions of the effects will be reduced by a factor of about ) 3 when transformed back to the original metric.
10.3.4 The Textile Example Consider now the textile data of Table 10.2.2. In the original analysis, a second-degree polynomial in x I, X2 and X3 was employed to represent E(y). However, since inspection of the data suggests that y is monotonic in Xl, X2 and x 3 , there is a possibility that a first-degree polynomial might provide adequate representation for E(l), with A suitably chosen . We therefore work with the transformed variate (10.3 .27), hoping that after such transformation : a) the expected value of the transformed response can be adequately represented by a model linear in the x's, b) the error variance will be constant, and c) the observations will be Normally distributed. The appropriate residual sum of squares to be considered is that after fitting a linear function to the x's . The sum of squares has (27 - 4) = 23 degrees
-
'. •
-. . ,.'.. .,
6
,'/
.
-
.
-
'.
I
. "::-
'. .
--~
to.3
Estimation of the Transformation
537
Table 10.3.3 Values of SH(J" z) and of Pu(}"1 y) over a range of A. where the density is appreciable: the textile data },
S23(}" z)
A.
p,.(J, I y)
1.00 0.80 0.60 0.40 0.20 0.00 -0.20 -0.40 -0.60 -0.80 -1.00
5.48 JO 2.9978 1.5968 0.8178 0.41] 5 0.2519 0.2920 0.5378 J. 1035 2.1396 3.9955
0.20 0.15 0.10 0.05 0.00 -0.05 -0.10 -0.15
0.02 0.09 0.42 1.58 4.18 5.64 4.66 2.36 0.77 0.19 0.04 0.01
~0.20
-0.25 -0.30 -0.35
of freedom. Table 10.3.3 shows the value of S23(A., z) and of the resulting PuCA.1 y) over a range of }, in which the density is appreciable. Upon normalizing [S23 (A., z)r 11.5 by numerical integration, we find -
00
< A<
00.
(10.3.31)
The distribution PIP I y) is plotted in fig. 10.3.2. It has its mean at - 0.06 and the 95 per cent H.P.D. region extends from - 0.20 to 0 .08, determining the appropriate transformation very closely. The log transformation U = 0) has considerable practical advantages for this example and is strongly supported by the data. The analysis of variance for the untransformed data and for the logarithmically transformed data taken in the z form is shown in Table 10.3.4. Table 10.3.4 Analyses of variance of the textile data Mean squares x 1000
Linear Quadratic Residual
Degrees of freedom
Untransformed
Degrees of freedom
Logarithmic transformation (z form)
3 6 17
4,9J6.2 704.1 73 .9
3 6 17(16)
2,374.4 8.1 11.9(12.6)
10.4
Transformation of Data
538
[t is seen that transformation eliminates the need for second-order terms in the equation, while the use of the more appropriate model greatly increases the sensitivity of the analysis.
pu C)" 1y)
4
2
-
95';" L -_
_
_
" - ' -- - ' -- ' - " ' -_ _ _~).. ->
--J
o
Fig. 10.3.2 The posterior distribution
Box Be Tieo 10.3-2
P,P I y) for the textile data. (Arrows show approxi-
mate 95 per cent H .P.D. interval)
10.3.5 Two Parameter Transformation To illustrate the estimation of a ~wo-parameter transformation , we apply to the textile data the transformation (10.3.2), which, writing gm for "geometric mean", can be written in the z form (10.3.22) as
(10.3.32) . where
Figure
PuP. I , ,{2 I y)
10.3.3 shows the contours of the joint posterior distribution obtained from a 7 x II grid of values of )Ogp),{l, ,{2 1y)
=
const - (11.5) log SUI' ,{ 2, z) .
(10.3.33)
.
.
--
Analysis of the Effects
10.4
e After Transformation
539
0.200
0.100
o 0.2
o
A\ .... -0.2
-OA
- 0 .6
-0.8
Fig. 10.3.3 Contours of the posterior distribution PI/C)'l' A2 I y) for the transformation + ,.1.2»).': the textile data. (Contours labeled enclose approximate H.P.D. regions)
(y
As before, if the k-dimensional posterior distribution p(A I y) were multivariate Normal, then a (I - Ct.) H.P.D. region for parameters A would be defined by logp(J" I y) - 10gp(A I y) < ~. X2 (k, Ct.),
(10 .3.34)
Although, for this example, the contours show that the Normal approximation is likely to be a poor one, we have nevertheless as a rough guide labeled H .P.D. regions with per cent probabilities using (JO.3.34). It is evident that, for this example, there is likely to be no particular advantage in including a nonzero value for A2' 10.4 ANALYSIS OF THE EFFECTS 0 AFTER TRANSFORMATION
So far we have considered only the marginal posterior distribution peA I y). Although there will be some circumstances in which inferences about A will be the major objective, in many cases the chief interest of the analysis would be in making inferences about 9. The status of A will then be that of a (very valuable) vector of nuisance parameters. As usual, we can eliminate A by integrating pee I A, y) with weight function peA I y) in accordance with
p(9 I y) = Sp(9 , A I y)dA = Jp(S i A, y) peA I y)dA.
-_. '":
(lOA.I)
Again in the spirit of Section 1.6 , we should be cautious in practical problems of relying solely on this integration , for if p(SI A, y) were changing drastically over the range in which peA I y) was appreciable, it would be important to know about it. Thus, in principle, we should always make a study of p(O I A, y) for a number of values of A which are of interest.
..
540
Transformation of Data
10.4
From the distribution in (10.3.20) with p()..) assumed uniform , the joint distribution p(9, ).1 y) in z(l·) form is
+ Q(9, ).)rn/2,
p(9,).1 y) oc J;:k /n [S()', z)
-
00
< 9<
00,
-
00
< ). <
00,
(10.4 .2) where
Q(9, ).)
= J;:2 /" (9 - 0..)' X' X(9 - 9i.)'
Thus, the conditional distribution p(91)., y) is p(91 )., y) oc [ 1 +
Q(9, ).)] -Hv + k)
s; ().) \'
'
-
< 9<
00
(I0.4,3)
00,
where y
= n - k,
ys;().)
= S()', z),
which is the k-dimensional multivariate t distribution lk [Oi. , s;()')J~ /n
(X'X)-l, v],
and the marginal distribution of 9 is -
00
< 9<
00.
(10.4.4) From (10.3.23) , the mode i of the marginal distribution of ). is the value minimizing S()', z) . We may employ Taylor's theorem to write
S()', z) + Q(9,).) == S(5.., z) + (). - 5..) 'G()' - 5..) + Q(9, i)+ g(9, ).),
(10.4.5)
where 2G is the matrix of the second derivatives of SeA, z) with respect to ). evaluated at 5.. and
(10.4.6) For moderate n, the influence of g(9 , ).) will be small and can be ignored . Expression (10.4.4) can thus be approximated to yield p(91 y)
d:.
[S()', z)
+
Q(9, ).)] - Hv- ,+k) E(1;:k /n) ,
(10.4.7)
i. where the expectation is taken over the distribution
Pl()..) oc
(). - i)' G()' - i)] - in [ 1 + sCi, z) + Q(9, i) ,
-
00
< ). <
00.
(10.4 .8)
Now for moderate n ,
E (1;:k /n) == Jt/" i.
"\.
f,
,
~'
:,-
~~.'.,
...
. . .·r·-
.
-
(10.4,9)
, 10.4
-
e After Transformation
541
which does not involve 9, so that finally we can express the ma rginal distribution of 6 approximately as
pee j y) eX
}.) [I + S;-r Q(e, ()..)
] - i( v -
(v - r)
r+k )
,
-
00
< 9<
00,
(10.4 .10)
where which is the k-dimensional multivariate t distribution tk
COl:,
s;_
r ()..)
Ji'" (X ' X) - \
v-
r].
Thus, approximately the marginal posterior distribution of e is obtained by analysing z<):) as if).. were a known fixed parameter and reducing the residual degrees of freedom by r, the number of transformation parameters. Having obtained 5., we can thus approximately justify performing a standard analysis with the transformed variate z
y = const. ~~l
.~
.
. ':, .--------.....-
-
Analysis of the Effects
.'""
.
542
Transformation of Data
!O.5
sharply peaked and when substituted in (10.4.1) would justify approximately an analysis in terms of reciprocal and logarithmic transformations respectively. Finally, then, in the actual analysis of these examples, an appropriale transformation }'o close to ;: was made and a standard an a lysi s performed with the transformed variate y;'u, but with t~e number of residual degrees of freedom reduced by one. From a Bayesian viewpoint , this analysis rests on the approximate use of the multivariate ( distribution p(S I )0' y) with reduced degrees of freedom to represent p(e ! y). Bayesian justifi.cation of the analysis of variance table and calculation of the posterior distributions of particular subsets , differences, and other linear functions of the elements of 9 only make use of particular aspects of this basic multivariate t distribution. The bracketed items in Tables 10.3.1 and 10.3.4 show the modifications to the degrees of freedom and mean squares which would be appropriate in the analysis of variance table on the above argument.
10.5 FURTHER Al':ALYSIS OF THE BIOLOGICA L DATA
It is an inherent assumption in the foregoing analysis that a transformation exists which simultaneously achieves simplicity of the model, homogeneity of variance, and ~ormality . This is clearly a much less restricted assumption than the more usual one of supposing that these requirements are all met lI'ithoUI transformation. However, it is of interest to perform further analysis which in some way separates the issues of a) simplicity of the linear model, b) homogeneity of variance, c) Normality and which makes it possiblt: to see to what extent these may be achieved with the same transformation. Consider again the biological data. We have in our previous analysis tacitly supposed that there existed a transformed variable )'eA) in (10.3.1) for which Simultaneously a) y~.l) had a Normal distribution N{t:[ y~i)], a} ], b)
=
(J;
(J2 ,
and
c) ELJ/)] was adequately represented by addiTive row and column parameters, i = I, .. . ,n. Further light is shed on the situation by supposing first that ), was chosen to satisfy (a) only, then considering how the situation is changed by adding requirement (b), and fi.nally considering the effect of adding requirement (c).
,
. •
'.
. -
.~
.
.
- l .. -
10' ,,-'
•
.
-
--
.
.
543
Further Analysis of the liQlogical Dllt:'\
10.5
If we merely suppose tha t a rransformation yU ) of the form (10.3.1) exists which simultaneously induces Norm a lity in a11 cells but not necessarily cons ta ncy of cell variances additivity, then , for each cell. we have • . (!.)
J)
=
f) j 1
+
j = 1, ... , k,
E·
J'
(10.5.1)
where for the jth cell corresponding to the Ith treatment and pih pOlson, is spherical ~\lormal /,'n ; (0, (J7 I) (for this example, k = 12,l1j = 4) . Following the argument in Seclion 10.3.1 , we have the prior dis tribution
Ej
p(B, log u, },) cc
r;k,,, pe A),
(10.5.2)
where n J i.
=
II Idy~ i.) [. . i= !
dYi
Writing p"CA I N) to indicate the posterior distribution under Normality (N) only and with p(},) assumed locally uniform, it can be readily be shown that k
p"V, I N) cc
n [Sv; ()" z)r : vi,
-
00
< A < co,
(10.5.3)
j= !
where SV j(J. , z) is the sum of squares of devia tions from the cell mean in terms of z(,i) for the jth cell having vj = 17 j - 1 degrees of freedom (v j = 3 for the present example). The ordinates of p" (AI N ) are shown in the fourth column of Table 10.5.1 and the distribution is plotted in Fig. 10.5.1.
.- 3
- 2
- I
o
2
3
4
Fig. 10.5.1 Posterior distributions of ), for different models : the biological data.
,
.
.
-
544
Transformation of Data
10.5
It is well known that small samples are not able to tell us much about the Normality or otherwise of a distribution so that it is scarcely surprising that Pu (A I N) is found to cover an extremely wide range of A. Table to.S.1 Ordinates of posterior distribution of A for different models: the biological data
A
PIIUI A,H,N)
p,/), I H, N)
0.006 0.023 0.076 0.257 0.492 0.942 1.462 1.823 1.823 1.419 0.923 0.468 0.194 0.067 0.019 0.005 0.00]
0.006 0.021 0.055 0.127 0.261 0.471 0.754 1.059 1.320 1.430 1.360 1.136 0.850 0.558 0.329 0.170 0.078 0.032 0.009
1.0 0.5 0.0 -0.1 -0.2 -0.3 -0.4 -0.5 -0.6 -0.7 -0.8 -0.9 -l.0 -/,1 -1.2 -1.3
-1.4 -1.5
- 1.6 -1.7
PuO< IN)
0.335 0.398 0.342 0.324 0.304 0.283 0.261 0.240 0.218 0.196 0.173 0.153 0.134 0.116 0.099 0.083 0.069 0.058 0.050
We now consider the effect of assuming that a transformation yU) exists in terms of which not only Normality but also constant variance is obtained (but not necessarily additivity). Using the model in (10 .5.1) but now with O'~ = O'i = ... = (J~, the within-cells sum of square S,V Cl, z) = L} : 1 SVj (J., z) is pooled from the k = 12 groups of four an imals and has II ", = L~~ 1 Vj = 36 degrees of freedom. The resulting posterior distribution of A, PII
CA I H, N)
<X.
[S w (J., z)] -
j- ,w,
(10.5.4)
shown in Fig. 10.5. I, is now very much sharper. Finally, if we assume that additivity can also be achieved by transformation, we have the probability distribution PII V I A, H, 1\') already given in Fig. 10.3.1 and also shown in Fig . 10.5.1 which is even sharper than Pu(J. I H, N).
.
-. .
.
.
. .
- -...
10.5
Further Analysis of the Biological Data
545
If we denote by SI (A, z) the sum of squares for interaction which has degrees of freedom, then the residual sum of squares SeA, z) originally considered is, in our new notation S (J" z) = S'" CA, z) + SI CA, z) and
VI
=6
- co < A <
00 .
(10.5.5)
Since in each case we obta in the posterior distribution of A conditional on the truth of a model containing given restrictions, we need some practical answer to the question of whether a constraint is justified or not.
10.5.1 Successive Constraints Let C denote a particular constraint. Then, as we have seen earlier in C1.5. 5), we can write p(
where p(C)
= E [p(C
1
A 1 C)
=
p
(A) x p( C 1 },) p(C) ,
(10.5.6 )
A)J is a constant independent of L
).
We apply this to the present example with Pu CA 1N) the " unconstrained " density for which it is only assumed that Normality can be achieved for some }" and with Pu(J, 1 H , N) the " constrained" density for which it is assumed that homogeneity of variance can also be achieved . With the dependence on Normality (N) understood, we can write
p" (A 1H)
Pu
=
U) p" (H 1A) oc Pu CA) Pu (H 1A). Pu (H)
(10.5.7)
Thus , the distribution p"V, H) is decomposed into two factors both of which are functions of A: the unconstrained "prior" distribution PII (A) and the "likelihood" factor Pu (H 1 A) representing the effect of the constraint H . Now, we can obtain Pu (H 1 A) from the relationship 1
Pu (H 1 A) oc
P"p~\IA~) oc [WH CA, Z)J 1/2,
(10.5.8)
Ii [Sw[SvP"CA, z)/v",J z)/VjJ:~
(10 .5.9)
where from (10.5.3) and (10.5.4), WH(A, z)
=
j= 1
In other words,
Pu
(,.1·1
O'i =
...
=
aD oc Pu CA) PII (d = .. . = 0'11 A)
(10.5 . 10)
where (10.5.11)
-
-
..
546
Transformation of Data
10.5
M
30 F
5
20
-
4
-- 3 ITX. z) 10
o
2
4
Fig. 10.5.2 Values of M(A, z) and F(l, z) as functions of A: the biological data.
Now recalling our previous discussion of the comparison of variances In Section 2.12, if we consider the k - 1 linear contrasts in log ai ,
M(A, z) to
{1+ 3(k I_
= -
210g WH (J. , z)
l.L-1 - 1],} k,
I)
} ~ lVj
2
l.. (k-I,IX).
Vw
Thus, in the biological problem, for a particular ;., the point <1>0 = 0 would lie outside a 95 per cent H.P.D. region if M(l, z) > 21.8. The values for M(A, z) for various values of A. are plotted in Fig. 10.5 .2. From this we see that the constraint O'i = O'i = .. . = O'~ is, in fact, compatible with values of A in a range which includes, for example, the reciprocal transformation . We can now proceed in a similar way to determine the appropriateness of the further constraint of additivity. With dependence on Normality (N) understood, we can write Pu(J,IA,H)=PII( A IH)
Pu(A I H,2) ( ) cc Pu(..l. I H)PII(AIH, A).
Pu AI H
(l0.5.12)
10.5
Further Analysis of the I3iological Data
547
The "likelihood" factor Pu (A I H, A) can be obtained from
p" (J, I A, H) p,,(AIH,A)CC p"(},IH)
X
. 12 [WA V,z)] , ,
(10.5,13)
where, from (10.5.4) and (10.5 .5),
W AZ = A (
,)
[S w (A, z)r"' [Sw ()" z) + SI (J" Z)]"whl .
(10 .5.14)
Writing 81 to denote the 1 interaction parameters, we thus have
p" (A I 8[ = 0, O'i
= ... = O'~) cc p" (A I (J~
= . .. = lJ~ )Pu
(9 1 = 0 I O'f
= .. , =
a~, A)
(10.5 .15) where the second factor on the right
p" (8[
= 0 IlJ~
= .. . =
lJ~, A) cc [WA (}.., z)]l i l
(10.5 .16)
is the density of 9f at 91 = 0, conditional on the choice of ), and given that the cell variances are constant. From (10.4.3), it is readily seen that 8[ is distributed a posteriori as I[ [9[ (A), s~ (A) C , vw ] where 91 (J,) is the I x I sub-vector of OJ. corresponding to 9[ , v",s~ (A) = S", (A , z) and C is proportional to the covariance matrix of 81 , The interaction sum of square S[ (A, z) is in fact
Jr"
(10.5.17) and, with the dependence on A and z understood, expression (10.5.16) can be written
(10.5.18) The question of whether for a particular value of A the co nstraint A is acceptable or not may be resolved by considering whether the parameter point 9[ = 0 falls in or outside an appropriate H.P.D. region . Since Sf follows a multivariate I distribution, the question is answered by referring the quantity
F(A,Z)
= (J-2 flJo~C-lef)(VIS\:)
Sd vr
=--
S,.JVw
(10 .5.19)
to an F table with V1 and Vw degrees of freedom . For the present example, F has 6 and 36 degrees of freedom. Thus, the point 01 = 0 will lie outside a 95 per cent H.P .D. region if F(}" z) > 2.36. The plot of F P" z) as a function of A given in Fig. 10.5.2 shows that the further constraint is acceptable over a fairly wide range which includes the interesting region close to the reciprocal transformation.
548
Transformation of Data
10.6
10.5.2 Summary of Analysis Finally, then, the overall posterior distribution PII 10.3 .1 can be decomposed into three functions,
U I A, H, N)
shown in Fig.
Pu (). I A, H, N) cc Pu (A I N)plI (H I N, A)PII (A I H, N, A).
(l0.S .20)
The first factor is the posterior distribution of )" supposing only that the transformation yU) exists which induces Normality simultaneously in all the cells. Its product with a second factor, measuring the probability density associated with variance homogeneity for ea ch value of A, is proportional to PII(A I H, N), the posterior distribution of }" supposing a transformation yU) exists which induces both Normality and homogeneity of variance. Finally, the product with a third factor, measuring the probability density associated with zero interaction for each value of A, is proportional to PI/(A I A, H, N), the posterior distribution supposing that a transformation i l ) exists which induces Normality, homogeneity of variance, and additivity. The plausibility of the constraints of homogeneity of variance and of additivity can be assessed for every contemplated }. by considering whether the constraint defines a parameter point in side or outside a suitable H.P.D. region. For the present data the constraints appear justified. However, situations commonly occur where, for example, nonadditi vity is not reduceable by a given family of transformations . One would be warned of this possibility if the F plot of Fig. 10.5.2 indicated that fo r no value of A was the point Sf = (} included in, say, a 95 per cent H .P .D . region . Similarly, situations are common in which variance homogeneity cannot be induced by transformation . This possibility would be assessed by the .\1 plot of Fig. 10.5.2.
10.6 FURTHER ANALYSIS OF THE TEXTILE DATA
A similar decomposition is possible for the textile data analysis. In our original analysis, it was assumed that a transformation yO) was possible which induced linearity in the response surface as Ire /! as spherical ;-';ormality. It could be true , of course, that no such transformation was possible, in which case we could fall back on the possibility of inducing spherical Normality with the original quadra tic model. The introduction of add itivity which we have previously discussed is a special case of the induction of parsimony in the parametrization of the linear model. In general we have (10.6.1) in which O2 are the parameters which hopefully may not be needed. In the particular example of the textile data, the elements of 9 2 are the coefficients uf the second-degree terms . Wc have (10.6.2)
10.6
Further Analysis of the Textile Data
549
6
4
3
o
-I
Fig. 10.6.1 Posterior distributions of}. for different models: the textile data.
The "constrained" and "unconstrained" probability density of .A. for the textile data are shown in Fig. 10.6.1. Writing (SI> S2, SR' VI, v 2 , vR ) for the sums of squares and degrees of freedom associated with linear terms, quadratic terms, and residuals, we have (S
R+
S )-
S"£< '2
= S-vR/2 R R (SR +-S-2-)<:-\·-"-:-+-,,,-;-),"""2 .
(10 .6.3)
We find by the previous argument that the second factor on the right is proportional to the ordinate at 9 2 = 0 of the multivariate t distribution for 9 2 , The appropriateness of the constraint 9 2 = 0 for any value of ), may thus be judged by considering whether O2 = 0 is included in an appropriate (I - a) H.P. D. region , which in turn is equivalent to referring
to the C( significance point of the F-distribution with V 2 and V R degrees of freedom. A plot of F(.A., z) in Fig. 10.6.2 shows that, in this case, for values of ), in the region around }, = 0, there is no reason to question the applicability of the constraint that the surface in the transformed response could be planar.
550
10.7
Transformation of Data
F
"\\ \
\ \
10
\
\
/~
\
\
I
\
8
I
\\\ 6 --
"
I
I
"
\\
/F (A . Z) \
I
\ \
4
/
I I
\
I
\ \
I I
\
\ \ \
I
J---F= 270
I I \
../
I
\,
~----------~----------~ I A~
-·1
Fig. 10.6.2 Values of
0
I
FO" z) as a function of J.: the textile data .
10.7 A SC\1MARY OF FOR\1ljLAE FOR VARIOUS PRIOR AND POSTERIOR DISTRIBUTIONS
In this chapter the frequently made assumption that the Normal linear model is adequate in the original metric is relaxed , and instead it is assumed only that there exists some transformatio/1 /A) of the observation y for which the linear model is appropriate. Specifically, the model is given in ( 10.3 .3) as
yeA) = XO + t . where y(l.) = (i/'>, ... , J~l.»)', y = ( y" ... , Yn)' is a n x I vector of observations, X is a n x k matrix of fixed elements , e is a k x I vector of parameters and t is a /1 x I vector of errors distributed as ,,(0, cr 2 1). Table 10.7.1 provides a short ummary of the formulae for the prior and posterior distribution s of (9, cr 2 ,A.). Table 10.7.1 A summary of various prior and posterior distributions 1. The prior is , from (1 0.3 .8) to (10.3.19), p(9, log cr, \) = p(J,)p(O, log cr I I.),
where pO-) is some prior for '- and
10.7
A Summary of Fo rmulae for Va rious Prior and Pos t(!rior Distribu tions
55l
Table 10.7.1 COnJinued 2.
p et}, log (T
I )., y) ex. (T
" 1/
exp
[Sl.+ (e-o,yx'X(O -Ol.)] 2 2(T
'
- G0
< 0<
00,
(T
> 0,
where and
3. Working with the normalized data
the posterior dis tributio n of J... with pO.) a uniform reference prior is in (10.3.23), -
00
< ). <
00,
where and
4. Tn particular, if J... =
), and for the s imple power transformation in (10.3.1),
(/ - 1)' 1.
y<).)= { log y
(J, # 0) y>O, ().
=0)
then , from (10.3.27) z().)
=
(/'-1)/U/-
1
)
{ y logy
where
For the two-parameter transformation in (10.3.2) ,
(1.1 #0)
U 1 =0) then , from (10.3.32) z(l.)
where
=
[(y +}' 2)'1 -1]f{J.I [gm (y+ J. 2)])·,-I} {
gm (y+ J 2) log (Y+}'2)
552
Transformation of Data
10.7
Table 10.7.1 Continued
5. The posterior distribution of 8 with p(1.) a uniform reference prior is in (10.4.4),
p(8 I y) oc
J
-00
<
~<
00
J;k'''[S(A, z)
+ Q(8, A)r"/ 2 d)"
-
00
<9<
00,
where
This distribution can be approximated by (10.4. 10),
. [ p(9 1y) ex I +
Q(8, ),) 2
,
S,,_,(A)(V-1)
J
- .} ( v-,+k)
'
-00
<9 <
00,
that is, a k-dimensional multivariate { di stribution
{k[el:, s~_,(5,)Jf/II(X'X)-I, V-T], where v=n-k ,
T
is the number of elements in A, ). is the value minimizing
SeA, z), and (v -,,)s; _,(i) = S(}." z) .
REFERENCES
PRINCIPAL SOURCE REFERENCES The following papers formed (he original basis for portions of the chapters indicated, and are not specifically referenced in the text.
Chapter 2 Box , G. E. P., and Tiao, G. C. (1965), " Multiparameter Problems from a Bayesian Viewpoint," Ann. Math. Statist . 36, 1468
Chapter 3 Box, G. E. P., and Tiao, G. C. (1962) , "A further Look at Robustness via Bayes's Theorem ," Biometrika 49, 419 Chapter 4 Box, G. E. P., and Ti ao, G. C. (1964a), "A Bayesian Approach to the Importance of Ass umptions Applied to the Comparison of Variances," Biometrika 51, 153 Box, G. E. P., and Tiao, G. C. (1964b), "A Note on Criterion Robustness and Inference Robustness," Biometrika 51, 169 Chapter 5 Tiao , G. c., and Tan, W. Y. (1965), "Bayesian Analysis of Random-Effect Models in the Analysis of Variance . 1. Posterior Distribution of Variance Components," Biometrika 52, 37 Tiao, G. c., and Box , G. E. P. (1967), "Bayesian Analysis of a Three-Component Hierarchical Design Model ," Biometrika 54, 109
Chapter 6 Tiao, G. C. (1966), "Bayesian Comparison of Means of a Mixed Model with Application to Regression Analysis," Biometrika 53, 11 Chapter 7 Box, G . E. P., and Tiao, G. C. (1968a), "Bayesia n Est imat ion of Means for the RandomEffect Model," 1. Amer. Stalisl . Assoc. 63, 174 Tiao, G. c., and Draper, ]\i. R. (1968), "Bayesian Analysis of Linear Models with Two Random Components, with Special Reference to the Balanced Incomplete Block Design," Biometrika 55, 101
Chapter 8 Tiao, G. c., and Zellner, A. (1964a), "On the Bayesian Estimation of Multivariate Regression, " f. Roy. Statisi. Soc. , Series B 26, 277 571
572
References
Box, G. E. P ., and Draper, N . R. (1965), "The Bayesian Estimation of Common Parameters from Several Responses," Biometrika 52, 355
Chapler 9 Tiao, G. C , and Zellner, A . (l964b), "Bayes's Theorem and the Use of Prior Knowledge in Regression Analysis," Biometrika 51, 219 Chapter 10 Box, G. E. P., and Cox, D. R. (1964), "An Analysis of Transformations," 1. Roy. Slatist. Soc., Series B 26, 211 GENERAL REFERENCES Afonja, B. (1970), "Some Bayesian Considerations of the Analysis and Choice of a Class of Designs," Ph .D. thesis, the University of Wisconsin, Madison Ali, M. M. (1969), "Some Aspects of the One-Way Random-Effects Model and the Linear Regression Model with Two Random Components," Ph.D. thesis, the University of Wisconsin, Madison Anderson, T . W. (1958), An Introduction to Multlvariale Slalislical Analysis, New York: Wiley Ando, A. and Kaufman, G. M. (1965), "Bayesian Analysis of the Independent Multinorma I Process--Neither Mean nor Precision Known," 1. Amer. Slatisl. Assoc. 60, 347 Anscombe, F. J. (1948a), "The Transformation of Poisson, Binomial and NegativeBinomial Data," Biomelrika 35, 246 Anscombe, F. J . (l948b), "Contributions to the Discussion on D. G. Champernowne's Sampling Theory Applied to Autoregressive Sequences," 1. Roy. Slalist. Soc., Series B 10, 239 Anscombe, F. J. (1961), "Examination of Residuals," Proceedings a/4th Berkeley Symp. Malh. Slalist. Proc. 1, I Anscombe, F. J. (1963), "Bayesian Inference Concerning Many Parameters with Reference to Supersaturated Designs," Bullelin In!. Sial. Insl. 40·42, 721 Anscombe, F. J. and Tukey, J. W. (1963), "The Examination and Analysis of Residuals," Technomelrics 5, 141 Barnard , G. A. (1947), " Significance Tests for 2 x 2 Tables," Biometrika 34, 123 Barnard, G. A. (1949), " Statistical Inference," J. Roy. Slalist. Soc., Series B 11, 115 Barnard, G. A. (1954), "Sampling Inspection and Statistical Decisions," J. Roy. Stalist. Soc ., Series B 16, 151 Barnard , G. A., Jenkins, G. M., and Winsten, C B. (1962), "Likelihood Inference and Time Series," 1. Roy. Slalisl. Soc. , Series A 125, 321 Bartlett, M. S. (1936), "The Square Root Transformation in Analysis of Variance," Suppl. 1. Roy. Stalisl. Soc., 8, 27, 85 Bartlett, M. S. (1937), "Properties of Sufficiency and Statistical Test," Proc. Roy. Soc., Series A 160, 268 Bartlett , M. S. (1938), "Further Aspects of the Theory of Multiple Regression," Proc. Camb. Phil. Soc. 34, 33. Bayes, T. R. (1763), "An Essay Towards Solving a Problem in the Doctrine of Chances," Phil. Trans. Roy. Soc. London 53, 370 (reprinted in Biomelrika (1958), 45, 293) Beale, E. M. L. (1960), "Confidence Regions in Nonlinear Estimation," 1. Roy. Statisi. Soc., Series B 22, 41
---- -- -
-
--
-
--
References
573
Behrens, W. V. (1929) , «Ein Beitrag zur Fehlerberechnung bei Weniger Beobachtungen," Landw. Jb. 68, 807 Birnbaum, A. (1962), "On the Foundation of Statistical Inference," J. Amer. Statist. Assoc. 57, 269 Boot, J . C. G. , and De Witt, G. M. (1960), "Investment Demand: An Empirical Contribution to the Aggregation Problem," Intern . Econ. Review 1, 3 Box, G . E. P. (1949), "A General Distribution Theory for a class of Likelihood Criteria," Biometrika 36, 317 Box, G. E. P. (1953a), "Non-normality and Tests on Variances," Biometrika 40,318 Box, G . E. P. (l953b), "A Note on Regions for Test of Kurtosis, " Biometrika 40,465 Box, G. E. P . (1954), "Some Theorems on Quadratic Forms Applied in the Study of Analysis of Variance Problems: II. Effects of Inequality of Variance and of Correlation Between Errors in the Two-Way Classification, " Ann. ,\1alh. Statisl. 25, 484 Box, G. E. P. (1957), "Use of Stati stical Methods in the Elucidation of Basic Mechanisms," Bull. Int. Sial. Ins/. 36, 215 Box, G. E. P. (1960), "Fitting Empirical Data," AI/n. New York Academy 0/ Sciences 86, 792 Box , G. E. P ., and Andersen , S. L. (1955), "Permutation Theory in the Derivation of Robust Criteria and the Study of Departures from Ass umptions," J . Roy. Statist . Soc., Series B 17, 1 Box, G. E. P ., and Jenkins, G. M. (1970), Time Series Analysis, Forecasting and Control, San Francisco : HOlden- Day Box, G. E. P ., Erjavec, J. , Hunter, W. G . and MacGregor, J. F. (1972), "Some Problems Associated with the Analysis of Multiresponse Data" (to appear in Technometrics) Box, G. E . P., and Ti ao, G. C. (l968b) " A Ba yesian Approach to Some Outlier Problems," Biometrika 55, 119 Box, G. E . P. , and Watson, G. S. (1962) , "Robustness to Non-Normality of Regression Tests, " Biometrika 49, 93 Bracken, J., and Schleifer, A . (1964), Tables /or Normal Sampling wilh Unknown Variance, Cambr idge, Harvard Uni versity Press Brookner, R . J ., and Wald, A . (1941), "On the Distribution of Wilks' Statistic for Testing the Independence of Several Groups of Variates, " Ann. Math. Statist. 12, 137 Bulmer, M . G. (1957), "Approximate Confidence Limits for Components of Variance, " Biometrika 44, 159 Bush, N. , and Anderson , R. L. (1963), "A Comparison of Three Different Procedures for Estimating Variance Components," Technometrics 5, 421 Carlton , G . A. (1946), "Estimating the Parameters of a Rectangular Distribution," Ann. Math . Statist . 17, 355 Cochran, W. G . (1934), " The Distribution of Quadratic Forms in a Normal System, with Applications to the Analysis of Covariance," Proc. Camb. Phil. Soc. 30, 178 Cochran , W. G., a nd Cox , G. M. (1950), Experimental Designs , second edition , New York: Wiley Cook, M. B. (1951), "Bivariate K-Statistics and Cumulants of their Joint Sampling Distribution, " Biometrika 38, 179
-
574
References
Cornish, E. A. (1954), "The Multivariate I-Distribution Associated with a Set of Normal Sample Deviates," Aust. J. Phys. 7, 531 Crump, S. L. (1946), "Estimation of Variance Components in the Analysis of Variance," Biometrics 2, 7 Daniel, C. (1959), "ese of Half Normal Plots in Interpreting Factorial Experiments," Technometrics I, 31 I Daniels, H. E . (1939), "The Estimation of Components of Variance," J. Roy. Statist. Soc. , Supplement 6, 186 Davies, O . L. (editor), (1949), Statistical .'vfelhods in Research and Production, second edition, London: Oliver and Boyd Davies, O. L. (editor), (1967), Statistical M ethods in Research and Production, third edit ion, London: Oliver and Boyd De Bruijn, N. G. (196]), Asymptotic Methods in Analysis, Amsterdam: North-Holland Deemer, W. L., and Olkin, r. (1951) , 'The lacobians of Certain Matrix Transformations Useful in Multivariate AnalYSiS, Based on Lectures by P. L. Hsu," Biometrika 38,345 De Finetti, B. (J 937), "La Prevision : Ses Lois Logiques, ses Sources Subjectives," Ann. In st. H. Poincare 7, 1. English translation in Studies in Subjective Probability , H. E. Kyburg, Jr. and H. G. Smokier (editors), 1964, New York: Wiley De Groot, M. H. ([970), Optimal Statistical Decisiolls, New York: McGraw-Hili Dempster, A. P . (1963), "On a Paradox Concerning Inference about a Covariance Matrix," Ann ..\1alh. Statist. 34, 1414 Diananda, P. H. (1949), "Note on Some Properties of Maximum Likelihood Estimates," Proc. Camb. Phil. Soc. 45, 536 Dickey, 1. M. (l967a), "Expansions of I Densities and Related Complete lntegrals," Ann. Math. Slatist. 38, 503 Dickey, J . M. (I 967b), "Matric-Variate Generalizations of the Multivariate I Distribution and the Inverted Multivariate t Distribution," Ann. Math. Statist. 38, 511 Dickey, 1. M. (1968), "Three Multidimensional-Integral Identities with Bayesian Applications," Anll. Math. Statist . 39, 1615 Dreze, J. H., and Morales, J. A. (1970), "Bayesian Full Information Analysis of the Simultaneous Equation Model ," Cenler for Operations Research and Econometrics, Discussion Paper No. 7031, Universite Catholique de Louvain Dunnett, C. W ., and Sobel, M. (1954), "A Bivariate Generalization of Student 's 1Distribution With Tables for Certain Special Cases," Biomelrika 41, J53 Edwards, W., Lindman, H., and Savage, L. J . (1963), "Bayesian Statistical Inference for Psychological Research," Psychological Rev. 70, 193 Eisenhart, C. (1947), "The Assumptions Underlying the Analysis of Variance," Biometrics 3, 1 Federer, W. T. (1955) , Experimenlal Design, New York: Macmillan ~isher, R. A . (1915), " Frequency Distribution of the Value of the Correlation Coefficient in Samples from an IndeAnitely Large Population ," Biometrika 10, 507 Fisher, R. A. (1921), " On the 'Probable Error' of a Coefficient of Correlation Deduced from a Small Sample," Me/ron 1, '3 Fisher, R. A . (1922), "On the Mathematical Foundations of Theoretical Statistics," Phil. Trans. Roy. Soc. , Series A 222, 309
-----~
- - --
~
References
5"/5
Fisher, R. A. (1924), "On a Distribution Yielding the Error Function of Several WellKnown Statistics," Proc. II/t. Math. Congress, Toronto, 805 Fisher, R. A. (1925), "Theory of Statistical Estimation," Proc. Comb. Phil. Soc. 22, 700 Fisher, R. A. (1930), "Inverse Probability," Proc . Comb. Phil. Soc. 26, 528 Fisher, R. A. (1935), "The Fiducial Argument in Statistical Inference," Ann. Ellgen. 6, 391 Fisher, R. A. (1939) , "The Comparison of Samples with Possibly Unequal Variances," Ann. Eugen. 9, 174 Fisher, R. A. (1959), Statistical .'vIelhods and Scientific Inference, second edition, London: Oliver and Boyd Fi sher, R. A. (1960), The DeSign of Ex peri men IS, seventh edition, New York: Hafn'e r Fisher, R. A. (J 961a), "Sampling the Reference Set ," Sankhy6, Series A 23, 3 Fisher, R . A . (196Ib), "Weighted Mcan of Two Samples With Unknown Variance Ratio," Sankhy6, Series A 23, 103 Fraser, D . A. S. (1968), The Structure of Inference, New York : Wiley Fraser, D . A. S., and Haq, M. S. (1969), "Structural Probability and Prediction for the Multivariate Model," J. Roy. Statist. Soc., Series B 31,317 G ates, C. E. , and Shine, C. J. (1962), " The Analysis of Variance of the S-Stage Hierarchical Classification ," Biometrics 18, 529 Gayen , A. K. (1949), "The Distribution of 'Student's' I in Random Sample of any Size Drawn from Non-Normal Cniverse," Biometrika 36, 353 Gayen , A. K. (1950), "The Distribution of the Variance Ratio in Random Samples of any Size Drawn from a Non-Normal Universe," Biometrika 37, 236 Geary, R . C. (1936), "The Distribution of 'Student's' Ratio for Non-Normal Samples," J . Roy. Slalist. Soc., Supplement 3, 178 Geary, R . C. (1947), "Testing for Normality," Biometrika 34,209 Geisser, S. (I 965a), "Bayesian Estimation in Multivariate Analysis," Ann. Math . Statist. 30, 150 Geisser, S. (1965b), " A Bayes Approach for Combining Correlated Estimates," J. ArneI'. Statisl. Assoc. 60, 602 Geisser, S., and Cornfield , J. (1963), "Posterior Distributions for Multivariate Normal Parameters," J. Roy. Slalist . Soc., Series B 25, 368 Gossett, W. S. ["Student" ] (1908), " The Probable Error of a Mean," Biometrika 6, 1 Gower, J. C. (1962), " Variance Component Estimation for Unbalanced Hierarchical Classification," Biometrics 18, 537 Graybill, F. A., and Weeks , D. L. (1959), "Combining Inter-Block and Intra-Block Information in Balanced Incomplete Blocks ," Ann. Math. Statist . 30, 799 Graybill, F. A., and Wortham, A. W. (1956), "A Note on Uniformly Best Unbi ased Estimators for Variance Components," J. Amer. Statist. Assoc. 51, 266 Grunfeld , Y. (J 958), "The Determinants of Corporate Investment," Ph.D. thesis, University of Chicago Guttman , 1. , and Meeter, D. A. (1965), "On Beale's Measure of Nonlinearity," Technomefrics 7, 623 Hartigan, J. A. (1964), "Invariant Prior Distributions, " Ann. Math . Stalist. 35,836
576
References
Hartigan, J . A. (1965), " The Asymptotic Unbiased Prior Di stribution," Ann. Math. Statist. 36, 1137 Hartley, H . O. (1940), "Testing the Homogeneity of a Set of Variances," Biometrika 31, 249 Hartley, H . O. (1961), "The Modified Gauss-Newton Method for the Fitting of Nonlinear Regression Functions by Least Squares," Technometrics 3, 269 Henderson, C. R. (1953), "Estimation of Variance and Covariance Components," Biometrics 9, 226 Herbach, L. H. (1959), "Properties of Model II-type Analysis of Variance Tests," Ann. Math. Statist. 30, 939 Hildreth, C. (1963), " Bayesian Statisticians and Remote Clients," Econometrika 32,422 Hill , B. M. (1965), "Inference About Variance Components in the One-Way Model," J. Amer. Statist. Assoc. 60, 806 Hill , B. M . (1967), "Correlated Errors in the Random Model," J . Amer. Statist. Assoc. 62, 1387 Hogg, R . V., and Craig, A. T. (1970), Introduction to Mathematical Statistics, second edition, New Yo rk : Macmillan Huzurbazar, V. S. (1955), "Exact Forms of Some Invariants for Distributions Admitting Sufficient Statistics," Biometrika 43, 533 Jackson, D . (1921), " Note on the Median of a Set of Num'bers," BIIII. Amer. Math. Soc . 27, 160 James, W., and Stein, C M . (1961) , "Estimatio)l with Quadratic Loss Function," PI'OC 4th Berkeley Symp. Math. Statist. Proc. 1, 361 Jaynes, E. T. (1968), " Prio r Probabilities, " IEEE Trans. Systems Science and Cybernetics SSC-4,227 Jeffreys, H . (1961), Theory of Probability, third edition , Oxford : Clarendon Press Jeffreys, H ., and Swirlee, B. (1956), Methods of Mathematical Physics, Cambridge: Cambridge University Press Johnson , R . A. (1967), "An Asymptotic Expansion for Posterior Distributions," Ann. Math . Statist. 38, 1899 Johnson, R . A. (1970), "Asymptotic Expansions Associated with Posterior Distributions," Ann. Math . Statist. 41, 851 Kahirsagar, A . M. (1960), "Some Extensions of the Multivariate t-Distribution and the Multivariate Generalization of the Distribution of the Regression Coefficients," Proc. Camb. Phil. Soc. 57, 80 Kempthorne, O. (1952), The Design and Analysis of Experiments, New York: Wiley Kendall, M. G., and Stua'rt, A. (1961), The Advanced Theory of Statistics, Volume 2, New York: Hafner Klotz, J. H., Milton, R. C, and Zacks, S. (1969), "Mean Square Efficiency of Estimators of Variance Component," J. Amer. Statist. Assoc. 64, 1383 Kolmogoroff, A. (1941), "Confidence Limits for an C nknown Distribution Function," Ann. Math. Statist. 12, 461 Lehman, E. L. (1959), Testing Statistical Hypotheses. New York: Wiley Lindley, D. V. (1965), Introduclionto ProbabililY and Stalislicsjrom a Bayesian Viewpoinl, Part 2, Inference, Cambridge: Cambridge University Press
References
577
Lindley, D . V. (1971), "Bayesian Statistics, a Review," Regional Conference Series in Applied Malhematics, S.f.A.M. Lund , D. R . (1967), "Parameter Estimation in a Class of Power Distributions," Ph .D. thesi s, the University of Wisconsin , Madison M arquardt, D. W. (1963), "A n AlgorittuTI For Least Squares Estimation of Nonlinear Parameters," J. Soc. Ind. Appl. Math. 11,431 Milne-Thomson, L. M. (1960), The Calculus of Finite Differences, New York: Macmillan Mood, A. M., and Graybill, F. A. (1963), Introduction 10 the Theory of Statistics, New York: McGraw- Hili Moriguti, S. (1954), "Confidence Limits for a Variance Component," Rep . Stat. Appl. Res. Juse. 3, 29 Mosteller, F., and Wallace, D. L. (1964), Inference and Disputed Authorship:1he Federalist, Reading, Mass.: Addison-Wesley ,. Neider, J . A. (1954), "The Interpretation of Negative Component of Variance," Biometrika 41, 544 Neym an, 1., and Pearson , E. S. (1928), "On the Use and Interpretat io n of Certain Test Criteria for Purposes of Statistical. Inference, Part I," Biometrika 20A, 175 Novick, M. R. (1969), "Mu ltiparameter Bayesian Indifference Procedures," J. Roy. Statist. Soc. Series B 31, 29 Novick, M. R. , and Hall, W. J. (1965), "A Bayesian Indifference Procedure," J. Amer. Statist. Assoc. 60, 1104 Patil , V. H. (1964), "The Behrens-Fisher Problem and its Bayesian Solution," J. Indian Statist . Assoc. 2, 21 Patnaik, P. B. (1949), "The Non-Central /- and F-Distributions and Their Applications," Biometrika 36, 202 Pearson, K. (1934), Tables of the Incomplete Beta-Functioll, Cambridge: Cambridge University Press Perks, F . J . A. (1947), "Some Observations on Inverse Probability, Including a New Indifference Rule," J. Illst. Actuaries 73, 285 Pillai , K . C. S., and Gupta , A. K. (J 969), "On the Exact Distribution of Wilks' Criterion," Biometrika 56, 109 Pitman, E. J. G. (1936), "Sufficient Statistics and Intrinsic Accuracy," Proc. Camb . Phil. Soc. 32, 567 Portnoy, S. (J 971), " Formal Bayes Estimation with Application to a Random Effect Model," Ann. Math. Statist . 42, 1379 Pratt, J . W., Raiffa, H., and Schlaifer, R. (1965), Introduction to Statistical Decision Theory . New York: McGraw-Hill Raiffa, H ., and Schlaifer, R . (1961), Applied Statistical Decision Theory, Cambridge: Harvard University Press Ramsey, F. P. (1931), The FOllndation of Mathematics alld Other Logical Essays , London: Routledge and Kegan Paul Robbins, H . (1955), "A n Empirical Bayes Approach to Statistics," Proc. 3rd Berkeley Symp. Math . Statist. Proc. 157 Robbins, H. (1964), "The Empirical Bayes Approach to Statistical Problems," Ann. Math. Statist. 35, J
578
References
Savage, L. 1. (1954), The Foundation
0/ Statistics, New York: Wiley
Savage, L. J. (1961a), "The Subject Basis of Statistical Practice," unpublished manuscript , the University of Michigan Savage, L. J. (1961 b), "The Foundation of Statistics Reconsi dered ," Proc. 4th Berkeley Symp. 1, 575 Savage, L. J., et al. (1962), The Foundation of Statistical Inference, London: Methuen Schatzoff, M. (1966), "Exact Distribution of Wilk's Likelihood Rat io Criterion," Biometrika 53, 347 Scheffe, H. (1959), The Analysis 0/ Variance, New York: Wiley Schlaifer, R. (1959), Probability and Statistics /01' Business Decision, ;-..lew York: McGraw· Hill Searle, S. R . (1958), "Sampling Variances of Estimates of Components of Variance," Ann. Math . Statist. 29, J 67 Seshadri, V. (1966), " Comparison of Combined Estimators in Balanced Incomplete Blocks," Ann. Math . Statist. 37, 1832 Siegel, C. L. (1935), "Ueber die Analytische Theorie der Quadrat ischen Formen ," Ann. Math. 36, 527 Stein, C. M. (1962), " Confidence Sets for the Mean of a Multivaria te Normal Distribution," 1. Roy. Statist. Soc., Series B 24. 265 Stone, M. (1964), "Comments on a Posterior Distribution of Geisser and Cornfield ," J. Roy. Statist. Soc ., Series B 26, 274 Stone, M. , and Springer, B. G. F. (1965) , "A Paradox Involving Quasi Prior Distributions," Biometrika 52, 623 Sukhatme, P. V. (1938), "On Fisher and Behrens' Test of Significance for the Difference in Means of Two Normal Samples," Sankhyil 4, 39 Tan , W. Y. (1964), "Bayesian Analysis of Random Effect Models," Ph.D. thesis, the University of Wisconsin, Madison Thompson, W . A., Jr. (1962), "The Problem of Negative Estimates of Variance Components," Ann. Math. Statist. 33, 273 Thompson , W. A., Jr. (1963), "Non-Negative Estimates of Variance Components," Technometrics 5, 44 [ Tiao, G. c., and Ali, M. M. (l971a), "Effect of Non-Normalit y on Inferences About Variance Components," Technometrics 13, 635 Tiao, G. c., and Ali, IvI. M . (197Ib), "Analysis of Correlated Random Effect : Linear Model with Two Random Components," Biometrika 58, 37 Tiao, G . c., and Fienberg, S. (1969), "Bayesian Estimation of Latent Roots and Vectors with Special Reference to the Bivariate Normal Di stribution," Biometrika 56, 97 Tiao, G. c., and Guttman, 1. (1965), "The Inverted Dirichlet Di stribution with Applications," 1. Amer. Statist. Assoc. 60, 793 Tiao, G. c., and Lund , D. R. (1970), " The ese of OLUMV Estimators in Inference Robustness Studies of the Location Parameter of a Class of Symmetric Distributions," J. Amer. Statist. Assoc. 65, 370 -j'iao, G. C, and Tan , W . Y. (1966), "Bayesian Anal ys is of Random-Effect Models in the An alysis of Vari ance. II. Effect of Autocorrelated Errors," Biometrika 53, 477
References
579
Tiao, G. c., Tan, W. Y., and Chang, Y C. (1970), "A Bayesian Approach to Multivariate Regression Subject to Linear Constraints," paper presented to the Second World Congress of Ihe Economelric Sociely, Cambridge, England Tukey, J. W. (1956), "Variances of Variance Components: I. Balanced Designs," Ann. Math. Sialisi. 27, 722 Tukey, J. W. (1961), "Discussion Emphasizing the Connection Between Analysis of Variance and Spectrum Analysis," Teehnomelrics 3, 191 Turner, M. C. (1960), "0[1 Heuristic Estimation Methods," Biomelrics 16, 299 Wang, Y. Y. (1967), "A Comparison of Several Variance Component Estimators," Biometrika 54, 30 I Welch, B. L. (1938), "The Significance of the Difference Between Two Means When the Population Variances are Unequal," Biomelrika 29, 350 Welch, B. L. (1947) , "The Generalization of 'Student's' Problem when Several Different Population Variances are Involved," Biometrika 34, 28 Welch, B. L., and Peers, H. W. (1963), "On Formulae fOJ Confidence Points Based on fntegrals of Weighted Likelihood," J. Roy. Sialisi. Soc. Series, B 25, 318 Wilks, S. S. (1962), .Vfalhematical Sialisties, New York: Wiley Williams, J. S. (1962), "A Confidence Interval for Variance Components," Biometrika 49,278 Wishart, J. (1928), "The Generalized Product Moment Distribution in Samples from a Normal. Multivariate Population," Biomelrika 20A, 32 Yates, F. (1939), "An Apparent Inconsistency Arising from Tests of Significance Based on Fiducial Distribution of Unknown Parameters," Proc. Comb. Phil. Soc. 35, 579 Yates, F. (1940), "The Recovery ofInter-Block Information in Balanced Incomplete Block Designs," Ann. Eugel1. 10, 317 Zacks, S. (1967), "More Efficient Estimators of Variance Components," Technical Report No. 4, Department of Statistics, Kansas State University, Manhattan, Kansas Zellner, A. (1962) , "An Efficient Method of Estimating Seemingly Unrelated Regression and Tests for Aggregation Bias," 1. Amer. Sialisl. Assoc. 57, 348 Zellner, A. (1963), "Estimators for Seemingly Unrelated Regression Equations: Some Finite Sample Results," J. Amer. Sialisi. Assoc. 58, 977 Zellner, A., and Tiao, G. C. (1964), "Bayesian Analysis of the Regression Model with Autocorrelated Errors," 1. Amer. Sialist. Assoc. 59, 763
.
INDEXES
·
-
~
-
AUTHOR INDEX Afonja, B., 418, 572 Ali, M. M ., 248 , 249, 383, 572, 578 Andersen, S. L., 154,235, 573 Anderson, R. L., 248, 573 Anderson , T. W., 427, 438,449,474, 572 Ando, A " 438,572 Anscombe, F . J ., 1, 8,38,260,381,382,572 Barnard, G . A., 1, 14,71,72, 73 , 122,485, 572 Bartlett, M. S., 38, 134, 135, 203 , 221, 230, 451,572 Bayes, T. R., 23, 24, 308, 572 Beale,E. M.L.,43J,572 . Behrens, W. Y., 106,573 Birnbaum , A., 73, 573 Boot, J. C. G., 495,573 Box, G . E. P., 7, 8J, J 36, 154, 157, J 94,203, 235,431,437,45J,522,571 , 572,573 Bracken, J., 481,573 Brookner, R. J., 136, 573 Bulmer, M. G., 248, 573 Bush, N" 248, 573
DeGroot, M . H., 19,574 Dempster, A. P., 444, 574 DeWitt, G. M., 495, 573 Diananda, P. H. , 157, 574 Dickey, J. M., 440,481,574 Draper, N. R., 4J8, 571, 572 Dreze, J. H., 446, 574 Dunnett, C. W., 117,574 East, D. A., 555 Edwards, W., 22, 84, 574 Eisenhart, c., 248, 574 Erjavec, J., 431,573 Federer, W., T., 414, 574 - ienberg, S., 459,578 Fisher, R. A" 1, 12, 13,24, 35, 38,42,60, 73, 105, 106, Ill, 149, 153, 163, 308,326,466, 468,469,481,485, 487,488,489,515,518, 520,574, 575 Fraser, D. A. S., 73,575 Gates, C. E., 248, 575 Gayen, A. K., 154,203,575 Geary, R. c., 154,203,575 Geisser, S., 438, 441,449, 459,501, 575 Gossett, W. S., "Student", 97, 575 Gower, J. c., 248,575 Graybill, F. A., 3, 248, 305, 396, 417, 575, 577, Grunfeld, Y., 496,575 Gupta, A. K., 451 , 577 Guttman, 1., 287, 296, 431,575,578
Carlton, G. A., 155,573 Chang, Y. c., 446,579 Cochran, W. G., 331, 392, 573 Cook, M. B., 517, 573 Cornfield, J. 438, 441,575 Cornish, E. A" 117,574 Cox, D. R., 525, 572 Cox, G. M., 392, 573 Craig, A. T., 3, 576 Crump, S. L., 248, 574
Hall, W. J., 60, 577 Hamilton, P. A., 555 Haq, M. S., 73, 575 Hartigan, J. A., 60, 575 Hartley, H. 0., 136,437,556,570,576 Henderson, C. R., 248, 576 Herbach, L. H" 248, 576
Daniel, C., 8, 574 Daniels, H. E., 248, 574 Davies, O. L., 209, 246, 348, 370, 574 DeBruijn, N. G ., 27 J, 574 Deemer, W . L., 474, 575 DeFinetti, B., I, 14,574 583
584
Author Index
Hildreth, c., 124,576 Hill, B. M., 249, 272, 576 Hogg, R. V., 3, 576 Hsu, P. L., 474 Hunter, W. G.,431, 573 Huzurbazar, V. S., 576 Jackson, D., 199, 200, 576 James, W., 313, 369, 388, 576 Jaynes, E. T., 60, 576 Jeffreys, H., 1,41,42,44,54,56,58,60,72, 80,94,106,271,481,488,489,576 Jenkins, G. M., 7, 73, 81, 572, 573 Johnson, R. A., 36, 576 Kahirsagar, A. M., 440,576 Kaufman, G. M., 438,572 Kempthorne, 0., 392, 576 Kendall , M. G., 62, 576 Klotz, J. H., 251, 311, 576 Kolmogoroff, A" 169, 576 Lehman, E. L., 351,576 Lindley, D. V., 1,80,84,371,555,576,577 Lindman, H., 22, 84, 574 Lochner, R. H., 565 Lund, D. R., 158,170, ]77,577,578 MacGregor, J., 431, 573 Marquardt, D. W., 437,577 Meeter, D . A., 431,575 Merrington, M., 557 Milne-Thomson, L. M., 146,294,577 Milton, R. c., 251, 311, 576 Mood,/\.. M., 3, 305, 577 Morales, J. 1\..,446,574 Moriguti, S., 248, 577 Mosteller, F., 25,149,390,577 Neider, J. A., 260, 577 Neyman, J., 72,155,577 Novick, M. R., 60, 577 Oikin, 1.,474,574 Patil, V. H., 107, 577 Patnaik, P. B., 577 Pearson, E. S., 72, 155,556,570,577 Pearson, K., 149, 263,577 Peers, H. W., 60, 579 Perks, F. J. A., 60, 577
Pillai, K. C. S., 451, 577 Pitman, E. J. G ., 577 Portnoy, S., 251, 311, 313, 314, 577 Pratt,J . W ., 309, 577 Raiffa, H., 19,309,577 Ramsey, F. P., 1, 14, 577 Robbins, H., 390, 577 Savage, L. J., I, 14,22,25, 73, 80, 438,574, 578 Schatzoff, M ., 451, 578 Scheffe,H., 249, 396, 578 Schlaifer, R., 19,309,577,578 Schleifer, A., 481,573 Searle, S. R., 248, 578 Seshadri, V., 396, 578 Shine, C. J., 248, 575 Siegel, C. L., 578 Smirnoff, 169 Sobel, M., 117, 574 Springer, B. G . F., 251, 303, 304 Stein, C. M., 1, 313, 369, 371, 381, 388, 390, 576, 578 Stone, M., 25] , 303,304,444,578 Stuart, A. , 62, 576 Sukhatme, P. V., 106, 107, 578 Swirlee, B., 271,576 Tan, W. Y.,248,249,298,403,446,571,578, 579 Thompson, W. A., Jr., 248, 578 Tiao, G. C , 8], 157, 158,249,287,296,298, 383,403,418, 446,459,522,565,571,572, 573,578,579 Tukey,J . W.,8,248,530,572,579 Turner, M. C, 157, 199, 579 Wald, A., 136, 573 Wallace, D . L., 25,149,390,577 Wang, Y. Y ., 248,579 Watson, G. S., 194, 573 Weeks, D. L., 396,417,575 Welch, B. L., 60, 73,109,579 Wilks, S. S., 62,116,579 Williams, J. S., 248, 579 Winsten, C. B., 73, 572 Wishart, 1., 427,579 Wortham, A. W., 248, 575 Yates, F., 392, 396,417,481,487,488, 579 Zacks,S .,248,25I, 311,576, 579 Zellner, A., 81, 438, 571, 572, 579
SUBJECT INDEX
Analysis of variance tables additive mixed model, 343 balanced incomplete blocks, 400 comparison Normal means, 131 interaction modeJ, 363 linear model, two random components, 395 NormalJinear model, 127 three component hierarchical , 245 two component random effect, 250, 371 two-way random effect, 330 Balanced incomplete block design (BIBD) 392,397 efficiency factor, 402 recovery of inter-block information, 417 Bartlett's test , variance homogeneity (see Comparison of variances) Bayes' theorem applied to frequencies, 12 scientific inference, 20 subjective probabilities , 14 definition, 10 sequential nature, 11 Bayesian inference, appropriateness, 9 Behrens-Fisher distribution 106 approximations, 107 Bernoulli polynomials, 146 Beta distribution, 36 Binomial distribution 34 parameter inference, 36 Cuachy distribution, 64 Central limit conditions, 78 Chi "X" distribution, 88 inverted, 88 Chi-square "X 2 " distribution, 87 inverted, 88 logX 2 88
585
Common parameters, inference independent responses, 480 two responses, 489, 515 linear bivariate model, 502 linear multivariate, 501 common derivative matrix, 500 nonlinear multivariate, 428 Comparison of Normal means additive mixed model, 357 scaled Fapproximation , 360 balanced incomplete blocks, 403, 409 fixed effect model (k-populations), 130 interaction model, 365 scaled t approximation, 366 random effect model, 376, 386 two means additive mixed model, 346, 352 random effect model, 375 variances equal, 103 variances unequal, 104 Comparison of variances k-samples Bartlett's approximation, 135, 221, 230 non-Normal parents, 220, 227, 229, 231 Normal parents, 133 2-samples non-Normal (exponential power) parents, 205,209, 212,214,218 Normal parents, 110 Confidence distribution, 79 Confidence interval, 6 Contrasts, linear, 128 comparison of location, 131 spread, 132 Correlation coefficient (Normal), inference, 465
586
Subject Index
Covariance matrix (Normal), inference, 460, 461 latent roots and vectors, 459 CrosscJassification desi gn, 319
Jacobians, matrix transformations, 473 Jeffreys'rule single parameter, 42 multiparameter,54
Decision theory, 19 Derivative matrix, 113 Design matrix, 421 Digamma function, 316 Double exponential distribution , 157 Double F distribution, 155
Kronecker product, 477 Kurtosis, measure, 150 Kurtosis parameter (exponential power distribution), inference prior distribution, 167 I-sample, 166, 167, 183, 187 k-sample, unequal variances, 209, 217, 225,231
Estimators Bayes, 304 mean-squared error, 307 James-Stein problem, 388 variance components, 311, 313 Expectation function, 422 multivariate linear, 435 multivariate non-linear, 424 Exponential power distribution , 157 non-Normality (kurtosis) measure, [57 upper per cent points, 158
F distribution, 11 0 Fiducial inference, 73 Fixed effect models cross classification, 318, 328 one-way clossification, 370 Gamma function, generalized, 427 H ierarchical design, 244 H. P.D . (highest posterior density) region general, 122 interval,85 interval, standardized, 89 Indicator variable, 113 Information matrix definition, 53 Normal covariance matrix, 475 variance components, 252 Information measur<-, single parameter, 42 Integral identities Dirichlet , 145 gamma, 144 Normal, mu ltivariate, 145 t, matric-variate, 475 multivariate, 145 variance component distributions, 295
Likelihood , data translated, 32 approximate, 36,41 multiparameter,48 Likelihood function definition, 10 ro le in Bayes' theorem, 10 standardized likel ihood, II Likelihood inference, 73 Likelihood ratio criterion, comparison of k non-Normal variances, 221 Linear model non-Normal,176 Normal, 46, 113 multivariate, common derivative matrix, 438 multivariate, general, 435 two random component, 393 Location parameter, definition , 18 Location-scale family distributions, 43 parameter inference, 43, 44, 58 Loss function, 308, 309, 313 Matrix of independent variables, 113 Mean, definition, 76 Median, definition, 76 Mixed models general,324 addit ive, 325, 341 interaction , 326, 362 Monotone likelihood ratio, 351 Multinomial distribution, 55 parameter inference, 55 Nonlinear model non-Normal , 186 Normal,422 linear approximation, 436 multivariate, 423
Subject Index
Non-Normal means (exponential power parents), inference I-sample, 160, 162, 167, 171 k-samples, unequal variances, 212,229 Normal distribution, 80 matric-variate,447 multivariate, 80, 115, 116 Normal means, inference fixed effect model (k-populations), 128, 377 multivariate model, 440 random effect model , 372,376 random vs. fixed effect prior in high dimension, 379, 383 single mean, 28, 51,82,93,95,97 informative prior, 99 Normal prior, 18, 74 Normal standard deviation, inference, 31, 87,89,93,96 informative prior, 99 metric for locally uniform prior, 101 Normal variance, inference (see Normal standard deviation) Nuisance parameter, 70 applied to robustness studies, 70, J 69 Observation, aberrant, 522 Pascal distribution, 45 parameter inference, 45 "Paired-t" problem, 348 Poisson distribution, 39 parameter inference, 40 Pooling variance estimates, 249, 246, 289, 353 Prior distribution definition , 10 improper, 21 locally uniform, 23 non informative, definition, 32 dependence on model, 44 location and scale parameters, 56 multiparameter, general, 54 summary of examples, 59 reference, 23 Posterior distribution definition, 10 parameters under constraints, 67 Probability, subjective, 14
58"
Quadratic forms combination of two, 418 mixed cumulants, 520 Random effect models hierarch ical design , 244,249,277,293 cross classification design , 318, 322, 329 one-way c1assitkation,371 Randomized (complete) block designs (see mixed model additive), 326 Regression coefficients, inference linear model, non-Normal, 177, J 78 linear model, Normal , 48, 51, 115,117, 125,437 joint vs. marginal, 122 multivariate, common derivative matrix , 440, 441, 448, 452 two random components, 396 nonlinear modeJ, non-Normal, J87 nonlinear model, Normal multivariate, general, 427, 428 independent responses, 428 transformation of data, 531,539 Robustness cri terion, 152 inference, 153 variance ratio, 208 inference vs . criterion mean, 153 variances, 232 Sampling theory inference, 5, 72 Scale parameter, definition, 49 Scientific investigation, iterative process, 4 Skewness, measure, 150 Spread parameter, definition, 77 Stable estimation, 22 Statistical analysis, in scientific investigation, 5 Statistical inference, general, 5 Stirling's series, 147 Student's t distribution (see t distribution) Sufficient statistics additive mixed model, 341 definition, 61 exponential power distribution, variance, 206 hierarchical random effect models, 250, 278
588
Subject Index
Normal distribution, 60 Normal linear model, 115 relevance in Bayesian inference, 63 two-way random effect model, 331 uniform (rectangular) distribution, 155 distribution, 97 matric-variate, 441 multivariate, 117 product of multivariate, 481,489 asymptotic expansion, 515 product of 2 univariate, 481 t 2 criterion, non-Normal parent, 154 Transformation of data general model, 530 inference, 531,539 logarithmic (textile data), 527, 536,548 reciprocal (biological data), 525,534, 542 square root (grinding experiment), 523 f
Variance components, definition, 245 Variance components, inference additive mixed model, 346 balanced incomplete blocks model, 407 interaction model, 364 q-component hierarchical model, 293 2-component hierarchical model asymptotic expansion, 271, 298 individual components, 255, 258, 266,272, 294, 296 noninformat ive prior, 251, 303 ratio, 253 sampling theory difficulties, 248 3-component hierarchical model individual components, 279, 286, 290 ratio, 280 relative contribution , 283 scaled X- 2 approximation, 261,274,287, 291 , 335,353,366 two-way random effect model , 332
U distribution, 449
approximations, 451 Uniformly most power similar test two non-Normal variances, 207
Weighted mean problem, 481 Wishart distribution, 427 inverted , 460