Proceedings of Statistics 2001 Canada The 4 t h Conference in Applied Statistics
R E C E N T ADVANCES IN
STATISTICAL
METHODS
ixlited by Yogendra P. Chaubey
Imperial College Press
RECENT ADVANCES IN STATISTICAL M E T H O D S
Editorial Board: Chief Editor P. CHAUBEY, Concordia University, Department of Mathematics and Statistics, Montreal, QC H4B 1R6, Canada YOGENDRA
Coordinating Editors G A R R I D O , Concordia University, Department of Mathematics and Statistics, Montreal, QC H4B 1R6, Canada JOSE
FASSIL N E B E B E , Concordia University, Department of Decision Sciences and MIS, Montreal, QC H3G 1M8, Canada Associate Editors B.ABRAHAM,
R.
University of Waterloo
Southern Methodist University
J.-F.
A.
ANGERS,
University de Montreal A.
BOSE,
Carleton University A.
CANTY,
McMaster A.H.
University
EL-SIIAARAWI,
National Water Research Institute
NATARAJAN, OKTAC,
UQAM N.G.N. PRASAD, University of Alberta S.N. R A I , St. Jude Children Research Hospital J. RAMSAY,
Concordia University
McGill University R. R O Y , Universite de Montreal A. SEN, Oakland University
C.
D. STANFORD,
R.
FERLAND,
UQAM G.
FISHER, GENEST,
Universite Laval
and University of Western Ontario
R.D.
B. SUTRADHAR ,
GUPTA,
University of New Brunswick
Memorial University of Newfoundland
Z. KHALIL,
K.
Concordia University
McGill University
WORSLEY,
Managing Editor VINCENT G O U L E T , Concordia University, Department of Mathematics and Statistics, Montreal, QC H4B 1R6, Canada
edited by Y o g e n d r a P. Chaubey Concordia University, Canada
RECENT ADVANCES IN
STATISTICAL METHODS
Proceedings of S t a t i s t i c s 2 0 0 1 Canada The 4 t h Conference in Applied S t a t i s t i c s Montreal, Canada
6 - 8 July 2001
Imperial College Press
Published by Imperial College Press 57 Shelton Street Covent Garden London WC2H 9HE Distributed by World Scientific Publishing Co. Pte. Ltd. 5 Toh Tuck Link, Singapore 596224 USA office: Suite 202,1060 Main Street, River Edge, NJ 07661 UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE
British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library.
RECENT ADVANCES IN STATISTICAL METHODS Proceedings of Statistics 2001 Canada: The 4th Conference in Applied Statistics Copyright © 2002 by Imperial College Press All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means, electronic or mechanical, including photocopying, recording or any information storage and retrieval system now known or to be invented, without written permission from the Publisher.
For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from the publisher.
ISBN 1-86094-333-0
This book is printed on acid-free paper. Printed in Singapore by Mainland Press
PREFACE The present volume consists of a selection of 31 papers which were presented at the conference Statistics 2001 Canada: Fourth Canadian Conference in Applied Statistics held at Concordia University, Montreal, Quebec, Canada during July 6-8, 2001. The Conference attracted approximately 250 participants from all over the globe and featured 144 speakers in seven plenary sessions and 42 invited and contributed papers sessions. This conference was held as a sequel to the ones previously held at Concordia University in 1971, 1981 and 1991 as the first, second and third Canadian conferences in applied Statistics, all under the chairship of Professor T.D. Dwivedi. The 1971 conference was initiated with local Statistics groups in Montreal and Ottawa, and led to the creation of the Statistical Society of Canada. The next two conferences were held under the joint sponsorship of the two departments: Mathematics &: Statistics and DS & MIS. The Conference brought eminent academics and practitioners in the discipline of Statistics from all over the world, a majority mainly from Canada and USA. The tradition of these conferences has become a permanent feature in the minds of the Statistics community on the Canadian scene - to hold such conferences every 10 years in order to assess the existing techniques and new directions in Applied Statistics. The success of the Conference was summarized by one of the Plenary Speakers, in an email to me: "This is just a note to express my pleasure in attending the highly successful conference you have arranged. You indeed did a splendid job." Here, "you" must be understood in a plural sense. The task of this magnitude required help from many people, and the organizers are thankful to all those who helped. The success of the Conference is a pride for the team on the Organizing Committee, Scientific Committee and Student Volunteers. The papers included in this volume have gone through serious refereeing process. The editorial contribution provided by the members of the Editorial Board and many referees is highly appreciated. Furthermore, I am very thankful to all the authors for submitting their papers and continued cooperation. I would also like to thank Mr. Anthony Doyle of World Scientific Publishing (UK) Ltd. for enthusiastically supporting this project. I sincerely hope that this volume will prove to be an important research resource for the scientific community. Yogendra P. Chaubey
Concordia University Montreal, May 2002
v
This page is intentionally left blank
CONTENTS
Preface
v
On the Estimation of Size and Mean Value of a Stigmatized Characteristic of a Hidden Gang in a Finite Population R. Amah and S. Singh
1
Unemployment, Search and the Gender Wage Gap: A Structural Model C. Belzil and X. Zhang
12
Kullback-Leibler Optimization of Density Estimates A. Berlinet and E. Brunei
31
The Asymptotic Distribution of Spacings of Order Statistics M. G. Bickis
42
Theoretical and Computational Issues in Bayesian Analysis of Multivariate Ordinal Categorical Data with Reference to an Ophthalmologic Study A. Biswas
50
Second-Order Moments and Mutual Information in the Analysis of Time Series D. R. Brillinger
64
Spatial Association Between Community Air Pollution and Heart Disease: Analysis of Correlated Data S. Cakmak, R. Burnett, M. Jerrett, M. S. Goldberg, Arden Pope III and R. Ma
77
On the Robustness of Relative Surprise Inferences to the Choice of Prior M. Evans and T. Zou
91
VII
VIII
Using Survival Analysis in Pretern Birth Study C. Y. Fu and S. H. Liu
107
Asymptotic Forms and Bounds for Tails of Convolutions of Compound Geometric Distributions, with Applications J. Cai and J. Garrido
114
Improved Finite-Sample Inference in Overidentified Models with Weak Instruments N. Gospodinov
132
Probability of Correct Selection of Gamma versus GE or Weibull versus GE based on Likelihood Ratio Test R. D. Gupta, D. Kundu and A. Manglick
147
Survey Methodology for Studying Substance Use Prevention Programs in Schools S. M. Jones, B. C. Sutton and K. E. Boyle
157
One-Step Estimation for the Partially Linear Proportional Hazards Model X. Lu and R. S. Singh
169
A Nested Frailty Survival Model for Recurrent Events R. Ma, J. D. Willms and R. T. Burnett
186
Multiple Comparison Procedures for Linear Models under Order Restrictions H. Mansouri and R. Paige
198
Testing Goodness-of-Fit of the Gamma Models C. E. Marchetti, G. S. Mudholkar and G. E. Wilding
207
On Frailty Models and Copulas D. Oakes
218
Surrogate Data and Fractional Brownian Motion P. Rabinovitch
225
IX
Analysis of Occult Tumour Trial Data with Varying Lethality S. N. Rai, J. Sun and D. Hunt
233
Nonlinear Mixed Effects Models: Recent Developments P. S. R. S. Rao and N. Zaino
252
The Sampling Bias of Heckman's Sample Bias Estimator P. Rilstone and A. Ullah
263
Computational Sequence Analysis: Genomics and Statistical Controversies P. K. Sen
274
Universal Optimality of Completely Randomized Designs K. R. Shah and B. K. Sinha
290
A Nonparametric Comparison of Tumor Incidences with Intermediate Lethality and Different Death Rates J. Sun, Q. Zhao and S. N. Rai
296
Generalized Smoothed Estimating Functions with Censored Observations A. Thavaneswaran and J. Singh
304
Large Sample Asymptotic Properties of Ordinary Least Squares, Two Stage Least Squares and Limited Information Maximum Likelihood Estimators in Simultaneous Equations Models R. Tiwari and V. K. Srivastava
312
Relative Stability of Weighted Maxima of Bounded i.i.d. Random Variables R. J. Tomkins
324
Transient Analysis of Some Infinite Server Queues G. E. Willmot and S. Drekic
329
X
Empirical Likelihood Method for Finite Populations C. Wu
339
Second Order Estimating Equations for Clustered Longitudinal Binary Data with Missing Observations G. Y. Yi and R. J. Cook
352
O N T H E E S T I M A T I O N OF SIZE A N D M E A N VALUE OF A STIGMATIZED C H A R A C T E R I S T I C OF A H I D D E N G A N G I N A FINITE POPULATION
Department of Statistics,
School of Mathematics
RAGHUNATH ARNAB University of Durban-Westvile, Africa
Durban-4000,
SARJINDER SINGH and Statistics, Carleton University, Ottawa, Canada E-mail:
[email protected]
South
Ontario,
Arnab and Singh 1 developed theoretical formulae for estimation of size and the mean value of a sensitive character of a sub group (such as hidden gang) of a finite population using a randomized response survey in a unified set-up. In this article expressions for the estimators of the population characteristics and their variances are obtained for various sampling strategies used in practice. Performance of the proposed estimators are compared using numerical studies.
1
Introduction
In the collection of data of a sensitive nature, like induced abortion, drug addiction, suffering from AIDS etc. directly from the respondents, the respondents very often report untrue value and even refuse to answer. Warner 8 introduced an ingenious method known as randomized response (RR) technique for collection of data of a stigmatized nature by protecting confidentiality of respondents and produced an unbiased estimator for the proportion of persons belonging to a certain sensitive group under simple random sampling with replacement (SRSWR). This technique was extended for quantitative characteristics and or other sampling designs by various authors: see Chaudhuri and Mukherjee 3 and Arnab and Singh x . Singh, Horn and Chowdhury 7 were the first who used RR technique to estimate population parameters for both the qualitative and quantitative characters. Arnab and Singh 1 extended the method of Singh, Horn and Chowdhury 7 for SRSWR for an arbitrary sampling design. In the present investigation we have studied the performance of a few well-known sampling strategies which may be used in estimating parameters of two population characteristics in practice. Efficiencies of the proposed strategies are compared on the basis of simulation studies and a real survey data.
1
2 2
Formulation of the Problem
Suppose that a finite population P consists of N (known) identifiable units. Let NQ (unknown) denote the total number of persons belonging to some sensitive group (Gang) G(G P) and Xi be the value of the stigmatized quantitative character X under study for the i—th person in the Gang G. Arnab and Singh * considered the following sampling strategies for estimation of proportion -K = 2j£- and the mean px = ^2i=1 Xi/N of the sensitive character^ for elements of the population P belonging to the subgroup G in a unified setup. From the population P , two independent samples s\ and «2 of sizes ni and ri2 are selected by using some suitable sampling design p\ and P2 respectively. If the respondent labeled i, selected in the sample sk(k = 1,2) belongs to the sensitive group G then he or she has to disclose the true value Xi. On the other hand if the respondent does not belong to the sensitive group G the respondent has to perform certain randomized response trial using a suitable randomized device. It is expected that the chosen randomized device Dk should produce a value similar in range to the confidential character X for generating more co-operation from the respondents. Thus each of respondents selected in the sample will supply either the true value of X or a number obtained by randomized device Dk. The confidentiality of the respondent is maintained since the interviewer will not know whether the respondent is supplying true value or a value generated by the randomized device. The mean and variance of the randomized device Dk are assumed to be known to the interviewer and will be denoted by 9k and <J\ respectively. Thus for the i-th respondent, included in the sample sk, k — 1, 2, we obtain a randomized response:
_ J xit iiieG -\Rki, MiiG
Zki
. W
where Xi and Rki denote respectively the true value of the stigmatized character X and the response obtained by the randomized device. The above response can be written as zki = xJi + (1 - h)Rki
= xJi + I[Rki
(2)
where *
f 1, if i € G \0, MiiG
(3)
3
with I\ = \ — I{. Denoting for ER(VR) to the randomize device, we get ER(ZH)
as expectation (variance) with respect
= xJi + I'iOk = 7i (say)
(4)
and VR(zki) = I'MRki)
= I-al
(5)
Conditional on the randomized device Dk, let the population mean of z-values be given Zk = 53»=i Zhi/N. By using the standard results in finite population survey sampling, it may be estimated by a linear homogeneous unbiased estimator given by Tk = Y, KiZki/N
(6)
where bSki are known constants satisfying the following design unbiasedness condition Ylsh3i bSkiPk(sk) = 1 with pk{sk) being the probability of selection of the sample sk for the design pk. It is easy to see from (4) as in Arnab and Singh 1: E(Tk)=nnx
+ (l-n)0k,k
= l,2.
(7)
This gives rise to unbiased estimation of n and fix as given in the following theorems. T h e o r e m 2 . 1 . An unbiased estimator of the proportion 7r is given by * = *
-
^
(8
>
with the variance = [ )
V(Ti) + V(T2) (01-02)2
( )
T h e o r e m 2.2. An approximately unbiased estimator of \ix is given by A* = j -
(10)
where di = T20i - Ti62 and d2 = (T2 - 02) - {Ti - #1). An approximate expression of the variance of the above estimator is given as V{fix) = (6, - fix)2V(T2)
+ (92 - /ix) 2 V(T!)
(11)
4
Remark: Note that the variances of the unbiased estimators in the above theorems may be obtained by replaing V{T\) and V{T2) by their respective unbiased estimators. The following theorem is useful in this regard. Theorem 2.3. The variance of Tfc and its unbiased estimator are respectively given as
V{Tk) =
2 {al/N') ^ a i ( f c ) j : + ^ 7 t ( f c ) { a i ( f c ) - l }
(12) An unbiased estimator of the variance V(Tk) is given by V(Tk) = Vk + ^(l-n(k))
(13)
where a k
i( ) = X! 6 ^P f c ( S f c )' a y ( f c ) = J2 KibskjPk{sk), Sk.Bi
(14)
sk3i,j
«i - I P E ;§)<"*> - » + E E ^ c o - »)
(«)
and
*<*>-£E;$o-
(••)
Now any of the standard sampling strategies, such as Horvitz-Thompson , SRSWR, SRSWOR, Hansen-Hurwitz 4 , and Rao-Hartley-Cochran 6 can be used to obtain Tk in obtaining the estimator of n and fix. The results are summarized in the following section. 5
3
Estimators under Standard Sampling Strategies
The results given below under different strategies can be obtained by using different choices of T\ and Ti in the general estimator of 7r and \ix described in section 2.
3.1
Horvitz- Thompson Estimator Based on an Arbitrary Sampling Scheme
The Horvitz-Thompson 5 %HT of 7r, based on arbitrary probability sampling scheme can be obtained by choosing bSki — | , , and is given by T
* = JvE^=T^
(17)
The variance of TkHT is obtained by using Theorem 2.1 and is given in the following theorem. T h e o r e m 3 . 1 . The variance of TkHT is given by
(18)
+
^(E(Ay- 1 )+EE^( fc ))+ 2 ^(E x » E M*))] (19) ieG
^
}
i& jeG
ieG
jfreG
where G is the complement of the set G and Pij(k) = ^ !^rfc|fc) — 1 for k = 1, 2. Using standard results in sampling with unequal probabilities, we can set two unbiased estimators for VkHT gibem below:
*& - ^E 4 < 4 -1)+E E m-/"m] 4 c - *' (20) and (2) V,fcHT
_
1 y ^ y ^ 7ri(fc)7Tj(fc) - 7Tij(fc) , Zfcj
-N*1Z£
^m
{
Zfcj ^ 2 , akn
W)~^m
]
+
^
^\
(01\
(1_7r) (21)
The first estimator is obtained from (13) and the second one is an alternative estimator. 3.2
SRSWOR Sampling Scheme
The results given above may be specialized to the SRSWOR sampling by substituting, 7r;(fc) = ^ and ^ ( f c ) = n^h^Sw • The expression for Tk,
6
V(Tk) and VkHT for SRSWOR are obtained as folows: Tk = — V Zki = z(k) = TkwoR
V(TkWOR) = ^ — ^ nk
+ ( - - i ) # T ^ - ^ nk
(22)
+ 7T(1 - 7r)(^ - 0fc
(23) and
^ = ^->'W where (JVG - l)s* = £ Ei6Sfe(^-^( 5.5
fc
))
^
+
^1-^
_ (^.^xtf/No
(24)
and (nfc - l)^(fc)
=
2
Hansen-Hurwitz Estimator Based on PPSWR
Here we assume that the sample sk of size nk is selected by PPSWR method of sampling with pi as a probability of selection for the i—th unit of the population at every draw. Using the Hansen-Hurwitz 4 estimator of mean for PPSWR, an unbiased estimator of IT(IX + (1 — ir)6k is obtained by choosing bSki = rh^fL and it is given by
where ni{sk) denotes the number of times the ith unit appears in the fcth sample Sk and Zki denotes the average of the randomized responses from the sample sk. For the PPSWR sampling scheme we have: T h e o r e m 3.2. The variance of Tknu is given by
7V2nfc
-/v2")= r-t Pi
7 34
SRSWR Sampling Scheme:
Specializing the above result for the SRSWR sampling, an unbiased estimator of TT)IX + (1 — ir)0k is obtained by putting rii(sk) = ^ - in T£ and is given by
TkHH = — 5 3 Zki
= z fc =
( )
(27)
TkW
R-
Tlk i&Sk .,_
The variance of TWR is obtained by putting pi — jf in the expression for an V(THH) d is given by V{TkWR)
= — [ncrl + TT(1 - 7r)(Mx - 6xf + (1 - TT)^]
(28)
where a =
x
*
3.5
Rao-Hartley-Cochran
NGiJzY, i€G
i->
Strategy:
Following the standard terminology of Rao, Hartley and Cochran strategy, an unbiased estimator of TT/J,X + (1 — ir)Ok is given by
6
(RHC)
(29)
N
'tz*
for k = 1,2 where Pi(k) denotes the sum of p;'s belonging to the ith group Qi(k)(i = 1,2, ...,rifc) that was formed in selection of sample sk by RHC sampling scheme. For the RHC sampling scheme we have: T h e o r e m 3.3. 2
N(n
- 1)
k V{TkRHc) = j ^ n (N ZH\^T'+ k -1)4^ *
N-nk
+ N2{N-1) 4
N Uk
~ V k n f c (JV-l)4-Pi
?H*>
(30)
Relative Efficiency of Estimators
We next numerically compute the relative efficiency of various strategies. The measure for this is called Percent Relative Efficiency (PRE) of ei relative to
8 Table 1. Percent Relative Efficiency of SRSWOR vs. SRSWR for estimating proportion
•K
N
nx
"2
100 500 1000 1000 5000 10000 10000
20 100 100 200 500 2000 2000
20 200 200 200 500 2000 3000
0.1 109.4 114.8 106.9 109.3 104.4 109.3 112.3
0.3 115.9 129.7 112.9 116.2 107.4 116.2 123.2
0.4 117.5 133.8 114.4 117.9 108.2 117.9 126.1
0.6 119.6 138.7 115.5 119.1 108.8 119.2 128.1
0.7 120.5 139.9 116.6 121.0 109.5 121.1 130.7
0.8 0.9 121.4 122.6 140.3 139.5 116.7 116.5 121.9 122.9 109.9 110.3 121.9 123.0 131.5 131.9
e
SRSWR vs.
V r e2
"( \ Varei
(31)
SRSWOR
In order to study the performance of the proposed estimator under SRSWOR sampling with respect to SRSWR sampling design, we considered here a few fixed values of the parameters of the sensitive character and that of the randomization devices. We considered a hypothetical situation with 6\ = 20.5, <J\ = 2.50,02 = 25.6,a| = 2.54,^ = 20 and S* = 2.50. PRE of SRSWOR sampling with respect to SRSWR for estimating proportion for a selection of values of N, ni, ni and n is summarized in Table 1. It has been observed that for moderate sample sizes and population sizes the relative efficiency of SRSWOR sampling over SRSW^R sampling remains appreciable. Similar values are obtained for estimating fix, which are not displayed here. 4.2
PPSWR vs. SRSWR and SRSWOR
We consider here the parameters from the following small artificial population, which was used for checking the behaviour of the estimators under SRSWOR: We observed that it is possible to make PPSWR sampling more efficient than SRSWR and SRSWOR sampling designs based on the choice of selection probabilities. For illustration we consider 9X = 63.5,of = 2.50,02 = 65.6 and a\ = 2.54. From the artificial population we have \ix = 65,5^ = 9,iVG = 3
9 Table 2. Values in an artificial population
65 10
68 25
62 39
10 25
14 36
25 47
36 45
23 24
18 20 25 36
Table 3. Percent Relative Efficiency of P P S W R vs. SRSWR and SRSWOR
m = n2 = 3 TV
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
WR 419.4 207.1 168.6 152.3 143.2 137.3 133.1 129.9 127.3
WOR 269.1 152.4 131.2 122.1 117.0 113.6 111.2 109.3 107.8
=5 WR 422.7 207.9 169.2 152.9 143.7 137.8 133.6 130.4 127.9 TH
n2 = 7 WOR 249.1 133.8 112.9 104.1 99.1 95.8 93.5 91.7 90.3
and N = 20. We consider Pi = 0.07,P2 = 0.08,P3 = 0.06 and P 4 = Ps = ••• = P20 = 0.79/17 = 0.046471. Here 7r = 3/20 = 0.15 and the percent relative efficiency of PPSWR sampling over SRSWOR sampling is 256.7 and 179.7 percent Respectively, for n\ = 7i2 = 3. Similar experiments have been done by considering different values of n (except the value of NQ ) and changing the above artificial population accordingly, then the relative efficiency of PPSWR sampling over SRSWR sampling and SRSWOR sampling, respectively, for different choices of sample sizes and values of true proportion is given in Table 3. In case, the sample sizes are large relative to the population size, the loss in efficiency of PPSWR vs. WOR is not surprising. This may happen because in case of large sample sizes, WOR samples become more representative of the population and hence it may produce gains in efficiency with respect to any WR sampling design. It is more interesting to note that for small values of the true proportion of the gang, the PPSWR sampling shows efficient results than both designs for moderate sample sizes in comparison to population size. Similar results hold for efficiency of estimators of fj,x.
10 Table 4. Percent Relative Efficiency of RHC vs. P P S W R for estimation of w IT
m = Tl2 = 3
Til = 5, Tl2 = 7
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
102.5 103.6 104.2 105.6 107.4 105.4 103.9 103.1 102.8
101.3 102.6 103.6 104.5 106.6 104.6 102.4 102.2 101.9
Table 5. Percent Relative Efficiency of RHC vs. P P S W R for estimation of fj.x
•K
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
4.3
RHC Scheme vs.
rii = ri2 = 3 110.2 111.4 113.6 115.3 116.3 114.6 113.2 110.2 108.2
n\ = 5, n-i = 7 109.7 110.4 111.5 112.5 114.6 112.5 110.2 109.8 106.2
PPSWR
Percent Relative Efficiencies of RHC scheme vs. PPSWR have been computed for the same values of parameters as given in the earlier section and reported in Table 4, for estimating proportion. It is interesting to note that under RHC scheme there is slight but non-ignorable gain in efficiency over the PPSWR sampling for estimating proportion. However, for estimating fix, there may be considerable gain in efficiency under RHC scheme, as seen from Table 5.
11
Acknowledgments The authors are thankful to the referees for constructive comments and the editor, Yogendra P. Chaubey for bringing the original manuscript in the present form. References 1. R. Arnab and S. Singh, Ann. Inst. Math. Stat. (To appear, 2002). 2. P. Bratley, B. L. Fox and L.E. Scharge, A Guide to Simulation (SpringerVerlag, New York, 1983). 3. A. Chaudhuri and R. Mukherjee, Randomized Response: Theory and Techniques (Marcel Dekker, New York, 1988). 4. M.H. Hansen and W.N. Hurwitz, Ann. Math. Stat. 14, 333 (1943). 5. D.G. Horvitz and D.J. Thompson, J. Amer. Statist. Assoc. 47, 663 (1952). 6. J.N.K. Rao, H.O. Hartley and W.G. Cochran, J. Roy. Statist. Soc. Ser. B 24, 482 (1962). 7. S. Singh, S. Horn and S. Chowdhury, Austral. & New Zealand J. Statist. 40, 291 (1998). 8. S.L. Warner, J. Amer. Statist. Assoc. 60, 63 (1965).
U N E M P L O Y M E N T , S E A R C H A N D T H E G E N D E R WAGE G A P : A S T R U C T U R A L MODEL CHRISTIAN BELZIL AND XUELIN ZHANG Department of Economics, Concordia University, HSG 1M8 and Statistics Canada E-mail: belzilc @vax2. concordia. ca Using a structural model in which the decision to search is indogenous, we analyze how various parameters such as the mean wage offer, the offer probability and the value of non-market time can explain the gender wage gap. The model, implemented on a sample of young Canadian men and women who suffered a permanent job displacement, is able to explain both the higher incidence of right-censored unemployment spells and longer completed unemployment duration for females. The structural parameters imply that around 30% of the gender re-employment wage gap is explained by the presence of young children.
1
Introduction
For several decades, labor economists have tried to explain the existence of the gender wage gap. As women have constantly increased their share of the labor force, interest in the persistence of a significant difference in wages paid to female versus male workers (given identical observable characteristics) has grown steadily. To date, economists have used two fundamental economic theories to explain the gender wage gap: human capital theory and statistical discrimination. In the human capital approach, which dates back to Mincer and Polachek 5 , the gender wage gap is explained by the fact that females are relatively more productive in household activities than males. For this reason, women tend to invest less in labor market human capital or tend to work in occupations which do not require heavy human capital investments. The gender wage gap is therefore a result of discontinuous work pattern expectations. The literature on discrimination has, on the other hand, focused on the differential treatment of male and females workers who are otherwise identical. The notion of discrimination, dating back to Becker 1, is based on the fact that employers, facing uncertainty about individual productivity or individual labor force attachment, must sometimes focus on observed differences between males and females when hiring new workers. As a result, women may systematically receive lower wages or may be excluded from various occupations. The approach to the gender wage gap suggested in this paper is quite
12
13
different from most previous work. We use a partial equilibrium job search framework in order to investigate the gender differences in job search outcomes which take place following a permanent job displacement and analyze how much of the gender re-employment wage gap can be explained by the presence of young children. The model has several distinct features. First, the decision to search is endogenous. Using data on the willingness to search for a new job upon job displacement (participation data), we specify a model where both males and females may decide to drop out of the labour force. By allowing the decision to search to be endogenous, we avoid self-selection bias introduced if we were to sample only those women searching and work with duration and wage data (such as is typically done in the literature). Second, we examine the possibility that the information on job search activities provided by individuals over-estimates the fraction of displaced workers who decided to search. Third, we do not rely on homogeneity assumptions. We use observable characteristics such as age, education, marital status and child status (number of young children) to parameterize three important aspects of the search process; the value of non-market time, the mean wage offer, the offer probability. Reservation wages are treated as a function of unknown parameters and exogenous regressors; this must be solved using dynamic programming principles. Fourth, we also introduce measurement errors in observed re-employment wages. We believe that investigation of the gender wage gap using a structural model is particularly promising. First, the imposition of all the restrictions imposed by dynamic programming allows us to obtain separate estimates for all parameters of the mean wage offer and the reservation wage function. This means that we can actually compute how various regressors such as the number of young children, marital status, and education, effect on males and females differently. This can be achieved without having to impose exclusion restrictions such as those needed in reduced-form analysis of female wage functions. In other words, our model allows us to distinguish between supply side versus demand side factors affecting the gender wage gap. As a consequence, the structural estimates can be used to investigate gender differences in offered wages, reservation wages and re-employment wages, unemployment duration and on the incidence of non-participation following job displacement. The model is estimated from a sample extracted from the Canadian Labour Market Activity Survey. We use a sample of men and women who have experienced a permanent job displacement. The likelihood function is based on information on the decision to search or not to upon displacement (we refer to this as participation data), duration data and re-employment wages.
14
The paper is arranged as follows. In Section 2, the theoretical model is presented in detail. Section 3 is devoted to a discussion of econometric issues and a presentation of the likelihood function. The data set is presented in Section 4 while Section 5 is devoted to a discussion of the main results while an analysis of the gender wage gap resulting from the structural parameter estimates is performed in Section 6. The conclusions are summarized in Section 7. 2
A Model with Endogenous Job Search
To investigate gender differences in job search outcome, we specify a stationary search model similar to the one estimated by Belzil and Zhang 2 to investigate child care and search costs. The model is applied to full-time male and female workers who are affected by a permanent job displacement. We disregard temporary layoffs. The model is constructed around the following assumptions. 1. Expected lifetime earnings are maximized over an infinite horizon and discounted at rate 6 — T^— 2. Individuals receive at most one offer per period and the probability of receiving an offer is given by £. Search is costless. 3. The unemployed receive unemployment benefit b for each period of unemployment. 4. For those who decide not to search, the value of non-market time per period is given by •d(K) = -
exp(rK)
T
in which ~&(K) may be interpreted as the monetary value of the output produced at home, K denoting the number of young children at the time of the displacement. 5. We assume that job offers are indexed by an hourly wage rate and that, upon acceptance, a job is held forever. Wage offers are normally distributed with mean /x and variance a\. Using the previous assumptions, it is straightforward to derive the value functions associated with each state. These are
15
Vu = b + (3E[V]
(2)
V
(3)
" = r7p
(lexP(rK)^
in which Ve(w) is the value of accepting employment at wage w, Vu is the value of unemployment (search) and Vn is the value of leaving the labour force to involve in household activities. The value of following the optimal policy in the future, E[V], is given by E[V] = — ^ [(1 - Ow* + £P(i» < w*)w* + £P{w > w*)E(w\w > w*)} (4) where w* denotes the re-employment reservation wage (for those who decide to search and remain in the labour force) . In the case where wage follow a normal distribution with density >(-^Iii) and cdf $ ( ^ l i i ) , E[V] can be re-expressed as
E[V\ = - L - L + K -
M)*(^) + <^(^)1 +
(1 - 0
(-^)
(5) The necessary and sufficient condition to remain in the labour force and search (following displacement) is given by
- ^ 3
( 1 exp(Tlo) < b + PE[V] = ^
K )
(6)
E c o n o m e t r i c Specification
In this section, we present the estimation strategy used to investigate the gender wage gap and other related issues. 3.1
Parameterization
To take into account individual unobserved heterogeneity in the value of non market time, we allow the value of non-market time to incorporate a stochastic element: •d(K) =a + -
exp(rK)
T
in which a~N(0,(T2n) In order to introduce observed heterogeneity, we parametrize the offer probability and the mean wage offer. Initially, the job offer probability is allowed
16
to depend on child status only. This might be relevant if for instance, those women with young children searching for a new job are less flexible or have less time to devote to search activities." To take this into account, we specify the offer probability as
£ = exp(£o + £i^) We parametrize the mean wage offer distribution as a function of education binary variables (primary, secondary and university) indicating the highest degree attained by a given individual, and age dummies: agei (16-24, inclusive), age2 (25-34, inclusive) and age3 (35-44, inclusive). Note that the reference groups are those who have a secondary education and are aged between 25-34. M = fioj + Mi ' primary + ^2 • university + /x3 • age1 + (14 • age 3
(7)
where fioj, {j = 1,2) is an individual effect intended to capture unobserved heterogeneity in the mean wage offer. In order to estimate the discount factor and the sample proportion for the unobserved heterogeneity term, we use the following transformations: a
exp(6/3) 1 + exp(6/3) exp(My) 1 + exp(/i p )'
3.2
Likelihood Functions
Let Si = 1 for those women who decided to search and 0 for those who dropped out. Using condition (6), the probability that a woman i will search is given by Pr( S i = 1) = Pr [a < h(w*)] = $
W)
(8)
where h(w*) = w* - - exp{rK)
(9)
The probability that a woman decides not to search, Pr(si = 0), follows trivially from equations 6 and 7. As it is the case in most economic surveys of a
Of course, it is impossible to distinguish this hypothesis from the hypothesis that employers are less likely to offer employment opportunities to females with children.
17
the unemployed, some of the individual reporting that they are searching are still unemployed by the end of the survey time. For women, we only observe a censored unemployment duration. For these women who completed their unemployment spell, we observe a completed duration ti and a re-employment hourly wage Wi. Completed observations are indexed by the binary variable Ci = 1 while censored observations are indexed by Cj = 0 . It is well known that wage data is often subject to measurement errors. We assume that observed wages, wt are given by i.i.d.
wt =wt+Et,
(10)
withet sim N(0,cr*)
in which i.i.d.
wt~n
+ et,
(11)
withe t sim N(0,
Assuming that it and et are independent, then (12) Ot=et + et i.i.d. N(0,CT2) where erf = a\ + o\. It is easy to see that the joint probability of receiving an acceptable offer and observing a re-employment wage wt, is given by w* - n - ^t(wt - fx) Pr(wt > w* ,wt) = l - * ( -
(^(i-V))*
-) • 7 ^ ) OQ
d3)
Og
where (14)
s/rt + ^I
In the paper, we consider two distinct likelihood functions which differ according to whether or not participation data is judged reliable or not. 1. Model with Participation Data (Model 1). When we rely on the answer provided by each individual to the question (s) pertaining to their job search status, the log likelihood function for the entire sample is given by
Ll= X > S 1
.$
MO
h(wT) '
log [(1 - ^(w*))^-1^.
i-.Si=l
Pr(w > w*,w)]
+ (15)
18
Model without Participation Data (Model 2). As reported in Section 4, we have found that quite a large number of individuals mistakenly reported a willingness to search, For example, some individuals who reported they were not willing to work and were not searching for job after being laid-off were reemployed before the end of the survey, while a large number of individuals, reporting that they were searching, remained unemployed for a long period. We consider that the probability that an individual did not find re-employment by the end of the survey is the sum of the probability that he(she) was not searching and the probability that he(she) was indeed searching weighted by the probability that no acceptable offer had been found. The log likelihood is then given by L2 = J2
log | $ fl^ll)
(l _ ^ ( a , ; ) ) * ' - ^ . Pr(«, > w*, w)
i:,Ci=l
\ r ^(.-.(^>)+.(*£!)<.-««>>"•
(16)
i:Ci=0
It is easy to see that the following parameters are identifiable (and hence estimable); the standard deviation of a, (an), the parameter of the function representing productivity at home (T), the wage offer distribution parameter {(i) and cru, the standard deviation of the measurement error term
The Canadian Labour Market Activity Survey
The sample analyzed in this paper is drawn from the 1986-1987 Canadian labour Market Activity Survey (LMAS). The LMAS was designed as a replacement for the Annual Work Patterns Survey in order to provide measures Many authors set the discount rate /3 to a constant. To do so, we make use of the optimality condition and apply a Newton-Raphson algorithm to obtain estimates of the reservation wage. c
19
of labor market dynamics. The data have been collected by Statistics Canada and Employment Immigration Canada. 4-1
Description
The LMAS is based on a stratified sample of Canadians aged between 16 and 69. The sample contains workers who held a full time job and experienced a job displacement. Workers are defined as full-time workers if they worked at least 120 hours per month. Job displacement information is based on question 34 of the LMAS: "What was the main reason for stopping work?". The reasons corresponding to a permanent separation are non-seasonal economic/business conditions, company moving or going out of business, end of a temporary (non-seasonal) job and dismissal by the employer. The question is therefore consistent with the notion of displacement used in the empirical literature. Given these restrictions, we started with 1910 observations: 1091 females and 819 males. Since we were interested in young and prime-age male and female workers, we eliminated all those older than 45 years old as well as those working in primary sector occupations such as farming, fishing, hunting and other occupations of this type. The resulting sample was 794 females and 494 males (1288 observations in total). In order to estimate the model, a measure of unemployment duration (perhaps censored) is typically needed. Jones and Riddell 4 point out that the LMAS is unique in that the design of the survey attempts to distinguish between individuals who search throughout the entire non-employment spell and those who did not (those out of the labor force and those who have a "marginal attachment" to the labor force). The distinction is made according to the number of consecutive weeks of search reported by every individual and the number of weeks of non-employment in which the individual reported whether he/she was willing to accept any job. A "marginal attachment" applies when an individual reports not searching while also reporting that available work would have been accepted. The post-displacement status can actually be a complex sequence of states and, as documented by Jones and Riddell 4 , the exact nature of the non-employment spell is not clear. To get around these problems, we estimate our model with fixed nonmarket time value using the information provided in Question 21 (Did you look for work at any time during this period ?) and Question 24 (Did you want to work at any time during this period ?). We define a displaced worker as a non-participant if the individual did not look or did not want to work during the period. The duration of unemployment is computed from the difference between the starting week of the new job (when applicable) and the week of
20
termination of the previous job. The LMAS has information on hourly wages as well as hours per week for all jobs sampled. The hourly wage rate used in this study is derived from information on hours worked per day, hours worked per week, total weeks worked and annual earnings in a given job. Using information on whether one has received some unemployment benefit or not, we can construct a measure of the amount of unemployment benefit received for those who report having received UI benefit. To do so, we use the fact that, in 1986, the maximum insurable earnings were $495 per week and the replacement ratio was 60%. Apart from the information on the decision to search, non-employment duration and accepted wages, the LMAS reports information on the number of young children; this can identify those women who had children when displacement took place. We also observe the marital status (married, divorced, single or cohabitating) so that lone mothers and married mothers can be identified. Education level is reported by class variables (there are 5 classes). For the purpose of this study, we constructed three (3) education dummies. The first group contains those with secondary schooling, the second one contains individuals with post-secondary schooling while those with university training are in the third group. Age is also reported as a class variable. For our sample, we construct three (3) age dummies corresponding to 16-24, 25-34, and 35-44. Variable definitions and sample statistics can be found in the Appendix.
4-2
Some Features of the Data
One of the most striking features of the data is the difference between male and female unemployment duration. Although the overall averages are quite close (20 weeks for males and 23 weeks for females), censoring appears more important for females. Among 794 women aged below 45, only 264 (33%) had found a new job by the end of the survey. The remaining 558 women are split between those who were still searching at the end of the survey and those who report not searching after the loss of their previous job. The difference in sample averages between male and female unemployment duration is, however, not explained by censoring alone. Belzil and Zhang 2 also report that simple non-parametric rank tests indicate clearly that females experience longer unemployment spells than males. It is interesting to note that despite the large number of censored non-employment spells, only 6% (48/794) report not being available to search. Among 494 males, more than 50% had found a new job before the end of the survey. Interestingly, although preunemployment wages and re-employment wages for both males and females are relatively similar (10.09$ vs 9.61$ for males and 6.28$ vs 6.95$ for fe-
21
males), we found female average pre-unemployment wages for those who have no children and those with children to be almost the same (6.30$ vs 6.35$). However, the average re-employment wage of females with no children (7.00$) exceeds the average re-employment wage of those with children (6.26$) by more than 11%. 5
Empirical Results
As discussed earlier, we worked with two versions of the model. The first version is estimated under the assumption that the reported participation decision information is reliable (Section 5.1) while the second version uses the likelihood function which does not use participation data (Section 5.2). Section 5.3 is devoted to the introduction of marital status while, in Section 5.4, we investigate the initial condition problem that could arise if permanent unobserved heterogeneity in the labor market is correlated with child status. 5.1
Model with Participation Data
The results obtained under the assumption that the reported participation decision is reliable are in Table 1-A (Model I). A summary of the value of nonmarket time and the offer probability implied by the structural parameters are in Table 1-B (Model I). Estimates for the variance of the true wage offer (1.36 and 0.09) and the variance of the measurement error (2.69 and 2.13) illustrate the importance of measurement error. Unobserved heterogeneity in mean wage offer is also found to be important; about 10% of male and female workers are at the high end of mean wage offer ($16 for males and $15 for females).d Although the gender gap in mean wage offer appears to be very small, the differences observed in the value of non-market time, offer probabilities and the discount factors reveal differences in job search behavior between males and females. The parameter estimates of the effect of young children on the offer probability (£i) indicates that men with children receive more offers than those without children while it is the reverse for females. For those without children, these estimates imply that males face a probability of 0.05 per week of receiving an offer while females would have a probability of 0.31.With one child, female workers still receive offers at higher rate than male workers (0.15 vs. 0.13). These estimates do not seem to fit the data very well because, in the sample, females experience longer spells of unemployment. "We started to allow for three mass points in the mean wage offer. However, we found that two of them are fairly close to each other. This lead us to specify only two mass points for
22
The estimates for bp imply a very large a difference in discounting behavior between males and females as the annual discount rates are 29% for males and 0.1% for females. Given that our sample consists of relatively homogeneous individuals (in age and education, for example), it is quite unlikely that such a difference is the case. The estimates for the value of non-market time (Table 1-B, Model I) indicate that males and females tend to have similar home productivities (a questionable result) and that home productivity is increasing with child status (as expected). 5.2
Model without Participation Data
Table 1-A (Model II) contains the result for the model estimated under the assumption that the reported participation decision may not be reliable. The estimates of cre for males and females (2.44 and 1.21 respectively) indicate again the importance of measurement error in wage data while the two constant terms ^oi a n d M02 (15.33 vs. 7.03 for males) and (11.38 vs. 3.77 for females) indicate the importance of unobserved heterogeneity. Other things equal, university education raises the wage rate by $1.32 for male and $1.80 for female workers when compared to the reference group (high school education). Although the sample contains only those who were below 45, we still find a positive effect of age on expected wages although it is insignificant for females. This is perhaps a reflection of the fact that females have flatter age-earnings profiles than males. From Table 1-B (Model II), we see that offer probabilities increase with child status for males but decrease with child status for females. Female displaced workers with no children receive offers at a higher rate than males with no children (0.20 vs. 0.07). With one child, male and female workers receive job offers at a similar rate (about 0.1); with two children, the probability a male worker receives job offer per period of time (one week in the current study) becomes 0.15, but that for a female worker becomes 0.05. The discount factors estimated are 0.965 for male workers and 0.94 for female workers. These estimates are close to those obtained by Christensen and Kiefer 3 . Although the results show that male job searchers put a relatively higher value for future offers than their female counterparts, we no longer observe a huge gender differences in discount rates as obtained with Model l. e The most striking difference between Models I and II lies in the value of home productivity. When we let the data determine the probability that a e
In a companion paper (Belzil and Zhang 2 , 1996), we estimate of structural search model with child care costs in which the discount factor in treated as an individual effect. We find a particularly large dispersion in individual discount factors amongst women.
23
Table 1-A: Estimation Results for Models I and II
0"u
Mp
Moi M02 Ml M2 M3 M4
£o T
b/3
logl
Male Model I Model II 1.3571( 2.38) 1.7748( 4.23) 2.6860(12.60) 2.4400(11.09) -2.3626(-7.94) -1.8875(-6.57) 16.0007(15.12) 15.3292(17.42) 7.2211( 7.50) 7.0281(10.75) -0.8517(-1.36) -1.1636(-1.36) 0.8228( 1.65) 1.3208( 2.14) -0.8546(-1.97) -1.2983(-2.29) 0.0953( 0.22) 1.2450( 1.64) -2.9462(-6.47) -2.6618(-14.7) 0.8731( 2.67) 0.3829( 2.63) 0.3972( 5.51) 0.5303( 2.76) 2.8689( 7.27) 10.3291( 2.99) 5.1807( 8.85) 3.3061( 7.80) -1851.46 -1832.16
Female Model I Model II 0.0882( 3.72) 2.6149( 4.15) 2.1324(18.99) 1.2058( 8.44) -2.5443(-7.75) -2.8972(-5.30) 15.0150(20.42) 11.3806( 4.26) 7.0468(68.98) 3.7692( 1.86) -0.3807(-2.28) -I.2674(-1.54) 0.3431( 3.54) 1.8027( 4.69) -1.2255(-10.9) -0.9423(-1.64) -0.3014(-3.62) 0.4177( 0.75) -1.1577(-10.9) -1.6048(-2.86) -0.7581(-3.34) -0.7103(-3.16) 0.4418( 8.94) 0.1628( 5.24) 2.8087(12.95) 9.3314( 1.85) 10.7921(17.08) 2.7810(10.48) -1897.55 -1892.86
Note: Asymptotic t-ratios are in parantheses Table 1-B: Home Time Value and Offer Probability
Female
Male
K K K K K K
= = = = = =
0 1 2 0 1 2
Model I Hm. Value Offer Prob. 2.26 0.31 3.52 0.15 5.48 0.07 2.52 0.05 3.75 0.13 5.57 0.30
Model II Hm. Value Offer Prob. 6.14 0.20 7.23 0.10 14.46 0.05 1.89 0.07 3.21 0.10 5.45 0.15
given individual is searching, the household productivity parameter r estimates are 0.53 for males and 0.16 for females. The implied home time values, shown in Table 1-B (Model II) indicate that females are more productive at home than males. This seems to indicate that answers provided by individuals on their job search status tend to over-estimate search intensity and under-estimate home productivity. Overall, we believe that the estimates obtained without relying on participation data yield more convincing results. However, on the matter of which
24
model fits the data better, the log likelihood values do not yield a clear indication. 5.3
The Effects of Marital Status
The estimates presented in Table 1-A were obtained from model specifications in which marital status played no part. As most static and dynamic models of female labor market behavior are based on the fact that married and single females behave differently, it is reasonable to investigate whether marital status has an impact on job search outcomes for female displaced workers. To do so, we allow the offer probability and home time value to be a function of marital status. The parameterization for £ and r are £ = exp(£o+l;iK + £2M) and r = exp(ro + T\M) respectively where M is equal to 1 for those who are married or live in a permanent union and 0 otherwise. Marital status is introduced in the offer probability function in order to reflect the possibility that married women have less (or more) time to devote to search activities. We retained a specification without participation data (such as in Model 2). The results (in Table 2-A, under Model III) show that marital status has basically no effect on male offer probability and male home time value, though it does raise female home productivity. The level of significance is however relatively low (t-ratio of 1.48). Although being married seems to raise the offer probability for males (0.35) and females (0.15), standard errors are also large. Introducing marital status does not seem to lead to any major improvement. 5.4
The Initial Condition Problem: Child Status and Unobserved Heterogeneity
The estimates obtained in Section 5.1, 5.2 and 5.3 were generated under the assumption that unobserved ability in the labor market production is independent from any individual characteristic, including child status. It is possible that the distribution of unobserved labor market ability among women who have children is different than for women with no children. For instance, one could imagine that females who were working full-time in the presence of young children have relatively higher labor market ability than those women with no children. In what follows,we extend the model to take into account correlation between unobserved heterogeneity in the mean wage offer and female child status. The probability of belonging to the high level, px, is given by exp(^ p0 + VpiK) 1 + exp(/iipo + fipiK)
25
The results (Table 2-A, Model IV and Table 2-B) do not support the hypothesis that permanent unobserved heterogeneity in labor market ability is correlated with child status. Although the probability of belonging to the high ability class (/(Xoi = 15.3) increases with K, the very low t-ratio (0.04) implies that sorting is insignificant. 6
Investigating the Gender and the Fertility Wage Gap
In this section, we use the structural parameters obtained for Model II (where participation data are not necessarily reliable) to calculate the mean wage offer, reservation wage and observed re-employment wage rate (all measured in $ terms on an hourly basis)/ The gender gap (1 minus the ratio of female reemployment wage to male re-employment wage) can be calculated at different levels of education, age and number of children. Controlling for age and education, we can define a gross gender wage gap that reflects gender as well as child status. This gross gender wage gap is then decomposed into a portion that is due to child status and a residual part. The mean wage offer n is calculated according to equation 8, while the reservation wage is obtained through the Newton-Raphson procedure using the estimates of the structural parameters. The expected re-employment wage can be calculated using, E[w\w > w*] = fi + a, Table 3 presents the expected mean wage offer, mean reservation wage and average expected re-employment wage for both male and female workers at different levels of age, education, and child status (mean wage offer is independent of child status). Table 4 contains the corresponding gender wage gap. The gender wage gap in mean wage offers ranges between 60% (for low education workers) and 40% for high education workers. We note that the reservation wage gap is smaller than the wage offer gap. This implies that female hazard rates will be smaller than male hazard rates. Note also that the reservation wage gap rises significantly as the number of children increases; it goes from around 20% (for those without children) to around 50% for those with 2 children. Both the reservation wage gap and the re-employment wage gap are widening with age and shrinking with respect to education level. More importantly, the gender wage gap becomes larger as the number of children increases. For younger females, the gender wage gap goes from 15% to 20% •f All values are averaged over the two support points in the mean wage offer.
26 Table 2-A: Estimation Results for Models III and IV
Female
Male
Mp
Model III 1.8470( 4.32) 2.4122(11.35) -1.9204(-6.82)
MpO Mpl Moi M02 Ml M2 M3
m &
15.2750(17.03) 6.8664( 9.43) -1.1565(-1.38) 1.3364( 2.18) -1.1204(-1.90) 1.0686( 1.36) -2.8432(-12.6) 0.3281 ( 2.00) 0.3458( 1.24)
Model IV 1.7688( 4.25) 2.4370(11.15) -1.9904(-0.72) 0.2244( 0.04) 15.3283(17.53) 7.0213(10.85) -1.0958(-1.29) 1.3650( 2.21) -1.2926(-2.29) 1.2206( 1.62) -2.6669(-14.9) 0.3843( 2.65)
Model III 2.6141( 4.68) 1.2085( 8.59) -2.9910(-5.39)
11.7045( 4.58) 3.9000( 2.09) -1.2362(-1.49) 1.7563( 4.66) -1.0650(-1.86) 0.1950( 0.34) -1.6698(-3.62) -0.6615(-3.06) 0.1506( 0.56)
0.5274( 2.75)
T To Tl
<7n b/3
Logl
-0.0461(-0.73) -0.9884(-1.31) 8.6643( 2.38) 3.3617( 7.50) -1831.14
10.2342( 3.01) 3.3077( 7.82) -1832.16
Model RA 2.6265( 4.13) 1.2041( 8.46) -2.9026(-4.87) 0.0296( 0.04) 11.3294( 4.23) 3.7276( 1.83) -1.2580(-1.53) 1.8044( 4.70) -0.9287(-1.63) 0.4181( 0.75) -1.5964(-2.81) -0.7064(-3.17) 0.1637( 5.42)
-1.6430(-8.01) -0.2838(-1.48) 8.6094( 2.12) 2.7884(10.06) -1891.70
9.1749( 1.92) 2.7833(10.58) -1892.86
Note: Asymptotic t-ratios are in parantheses
Table 2-B. Marital, Offer Prob. and Hm. Value
Female
Male
K K K K K K
=0 =1 = 2 =0 =1 =2
Married Home Value Offer Prob. 6.87 0.22 7.95 0.11 9.19 0.06 2.81 0.08 4.02 0.11 5.73 0.16
Single Home Value Offer Prob. 5.17 0.19 6.27 0.10 7.61 0.05 1.05 0.06 2.72 0.08 7.07 0.11
Table 3: Mean Wage, Reservation Wage and Re-employment Wage K=0 Mean Wage Offer (/x) Primary Agei=l Secondary University Primary Age 2 =l Secondary University Primary Age3=l Secondary University Reservation Wage Primary Agei=l Secondary University Primary Age2=l Secondary University Primary Age3=l Secondary University Re-employment Wage Primary Agei=l Secondary University Primary Age 2 =1 Secondary University Primary Age3=l Secondary University
Male K=l
K=2
5.66 6.82 8.14 6.96 8.12 9.44 8.20 9.37 10.69
K=0
Female K=l K=2
1.96 3.23 5.03 2.90 4.17 5.97 3.32 4.59 6.39
4.92 5.57 6.35 5.65 6.34 7.15 6.39 7.11 7.94
5.30 6.03 6.89 6.11 6.88 7.78 6.93 7.73 8.65
5.68 6.47 7.40 6.56 7.39 8.36 7.45 8.30 9.29
3.86 4.45 5.43 4.29 4.94 6.00 4.49 5.18 6.25
3.42 3.87 4.64 3.74 4.25 5.09 3.90 4.43 5.30
3.10 3.39 3.94 3.31 3.66 4.27 3.42 3.79 4.42
6.73 7.63 8.72 7.74 8.70 9.85 8.77 9.78 10.97
6.92 7.83 8.92 7.94 8.91 10.04 8.98 9.98 11.15
7.13 8.06 9.16 8.17 9.14 10.28 9.21 10.21 11.38
5.40 6.17 7.40 5.96 6.79 8.09 6.23 7.08 8.41
5.10 5.77 6.91 5.57 6.34 7.56 5.82 6.61 7.87
4.88 5.48 6.53 5.31 6.00 7.16 5.53 6.25 7.46
28
(with no children) up to 30% (with 2 children). For the older age group in the sample, the wage gap goes from 25% (with no children) to 35% to 40% (with 2 children). Table 4: Predicted Gender Wage Gap (%)
Agei=l
Age2=l
Age3=l
Primary Secondary University Primary Secondary University Primary Secondary University
Mean* (M) 65 53 38 58 49 37 59 51 40
Reserv. Wage K=0 K = l K=2 22 36 45 20 36 48 15 33 47 24 39 45 22 38 51 16 35 49 30 44 54 27 43 54 21 39 52
Re-empl. Wage K=0 K = l K=2 20 26 32 19 26 32 15 22 29 23 30 45 22 29 34 18 25 30 29 35 40 28 37 39 22 29 34
Our measurement of the contribution of child status to the gender wage gap is based on three expected wage values; one for a male worker with no children, one for a female worker who has no children and the third value is from another female worker who has one child. The difference in expected wage between a male worker who has no children and a female worker who has one child becomes the gross gender wage gap (reflecting the effects of gender and child status) while the gap between the two female workers is referred to as the fertility wage gap. We normalize the gross gender wage gap to be unity, and calculate the percentage contribution by fertility. Table 5 presents the calculated fertility wage gaps for different age and education groups. In the first column, we compare a male worker who has no children and two female workers, one has no children, the other has 1 child. In the second column, the fertility wage gap is computed using a male and a female worker with no children, and a female worker with two children. On average, more than 20% of the gender wage gap observed for females with one child is due to the presence of this child. For female workers who have two children, the presence of young children contributes about 30% to the gross gender gap in expected wage. 7
Conclusion
The fact that males tend to earn more than females is a stylized fact present in all industrialized countries. Several explanations, reviewed in the Introduction, have been proposed by labor economists. In this paper, we have
29
proposed a new approach to the issue; namely, we have investigated how the job search process may differ across males and females. Using a structural job search model with endogenous search, we have estimated how differences in structural parameters can affect the discounted expected lifetime earnings of unemployed females and found that the presence of young children plays an important role in the setting of the optimal reservation wage. More precisely, we found that female workers with young children have a substantially lower probability of receiving an offer while the opposite is true for males. In other words, the presence of young children translates into a significant efficiency loss in their of job search. We also found that the value of nonmarket time for females with young children is actually higher than for males. Overall,the self-reported search status in the Labour Market Activity Survey seems to over-estimate the number of female displaced workers who are actually searching. This is consistent with the higher incidence of censoring among women than among men. Table 5: Fertility Wage Gap (%)
Agei=l
Age 2 =l
Age 3 =l
Primary Secondary University Primary Secondary University Primary Secondary University
K=l 18 22 27 18 19 23 13 14 17
K=2 28 32 40 27 29 35 22 24 27
One of the advantages of obtaining structural estimates is that we can distinguish supply side effects (non-market time) from demand side effects (offer probability) and impute a fraction of the gender wage gap to the efficiency loss in job search explained by the presence of young children. Our estimates imply that females with no young children face parameters much the same as male workers. When females with no children are compared with males with no children, the gender wage gap is between 15% to 25%. We found the gender wage gap to be increasing with the number of young children (up to 35% for females with no children). However, the results indicate that around 30% of the gender wage gap is actually accounted for by the effect of young children on the valuation of job search activities (through the offer probability). As well, our results indicated that the gender gap in the mean wage offer
30
received tends to exceed the gender reservation wage gap and is therefore consistent with the fact that females experience longer spells of unemployment than males. Finally, the results presented are consistent with the claim that the gender wage gap is typically small when males and females enter the labor market but tends to increase with age or experience. Appendix A: Variable Definition and Sample Mean Variable Marital Kid Prim Second Univ Agei Age2 Age 3 Wagel Wage2 U-Durat UIB Censor Parti N
Male 0.5992 0.3765 0.1174 0.6336 0.2490 0.2794 0.4413 0.2794 10.0923 9.6082 20.5243 3.9212 0.5041 0.9575 494
Female 0.5781 0.3325 0.0781 0.5982 0.3237 0.4181 0.3967 0.1851 6.2843 6.9574 22.6373 1.7465 0.3325 0.9396 794
Definition Marital Status, 1 if married Number of children under 6 =1 for elementary or lower education = 1 for high school or post-secondary =1 for university education =1 if age falls between 16-24 =1 if age falls between 25-34 =1 if age falls between 35-44 Hourly wage rate before separation ($) Hourly re-employment wage rate ($) Unemployment duration in weeks Hourly UI benefit ($) =1 for censored unemployment spell =1 for labour market participant Sample size
References 1. Becker, Gary (1971) The Economics of Discrimination , University of Chicago Press, Chicago. 2. C. Belzil and X. Zhang, Young Children and the Search Costs of Unemployed Females (Working paper, European University Institute, San Domenico di Fiesole, Italy, 1996). 3. B. J. Christensen and N. Kiefer, Econometric Theory 7, 464 (1991). 4. S. Jones and W. C. Riddell, Journal of Labor Economics 13, 351 (1995). 5. J. Mincer and S. Polachek, Journal of Political Economy 82, ( (1)974). 6. N. Stokey and R. E. Lucas, Recursive Methods in Economic Dynamics (Harvard University Press, Cambridge, MA, 1989). 7. P. Swaim and M. Podgursky, Journal of Labor Economics 12, 640 (1994). 8. K. Wolpin, Econometrica 55, 801 (1987).
KULLBACK-LEIBLER OPTIMIZATION OF D E N S I T Y ESTIMATES A. BERLINET AND E. BRUNEL Laboratoire de Probabilities et Statistique, Universite Montpellier II, CC 051, Place Eugene Bataillon, F-34095 Montpellier Cedex 5, Prance E-mail:
[email protected] We analyze the asymptotic behavior of the expected value of the Kullback-Leibler information divergence between the true density and modified histograms. This provides an asymptotically optimal value for the smoothing parameter which can be used in plug-in methods.
1
Introduction
The Kullback-Leibler divergence is a basic tool in decision and information theory. It has a lot of attractive features such as its additivity property used in projection pursuit density estimation and it defines a strong notion of convergence (stronger than L\-convergence) in the set of probability densities. Also in some parametric models the Kullback-Leibler divergence can be expressed as simple function of the parameters. However, with this error criterion, difficulties appear when standard kernel estimates or histograms are considered to estimate an unknown probability density from a sample of observations. On the one hand tail properties of the kernel dramatically influence the behavior of estimates (Hall 9 ) . On the other hand empty cells, occuring with high probability, make the Kullback-Leibler divergence between the true density and the histogram equal to infinity. Recently, modified histograms circumventing the problem of empty cells have been shown to have nice properties with respect to information divergences (Barron, Gyorfi and van der Meulen 2 , Berlinet, Gyorfi and van der Meulen 5 , Berlinet, Vajda and van der Meulen 6 , Gyorfi, Liese, Vajda and van der Meulen 7 ) . As usual in such a non parametric density estimation framework the crucial issue is about the number of cells to take into account. From upper bounds given by Barron and Sheu 3 and Barron et al. 2 for densities with bounded support and finite Fisher information it follows that the number of cells should be of order n 1 / 3 , where n is the number of observations. Here we carefully analyze the behavior of the expected value of the Kullback-Leibler divergence between the true density and the modified histogram and give the exact constants in its asymptotic expansion. This provides an asymptotically optimal value for the number of cells which can be used in plug-in methods.
31
32
2
Kullback-Leibler divergence and modified histograms
First recall that if / and g are two probability densities on R, the KullbackLeibler information divergence of / with respect to g is defined by J f(x)
i(f,g) =
log(f(x)/g(x))
d\(x)
if / «
g (1)
oo
otherwise,
where A is the Lebesgue measure. Integrals without limits are taken over the whole real line. Let us now turn to the estimation problem and the definition of estimates. Let (Xi) j>i be a sequence of independent real random variables with the same unknown distribution fx on R. The probability measure fi is supposed to have a density / with respect to A. Define a sequence of integers ( m n ) n e N . such that 0 < mn < n and put hn = l/mn. The integer m„ will be the number of cells for n observations X\,..., Xn and hn is called the smoothing parameter. We suppose that we know g, a density which can be seen as a reference density, satisfying l(f, g) < oo and denote by v the probability measure on (R, S(R)) with density g. In practice g will correspond to some a priori idea on the unknown density. For instance one can suspect that the true density / is not far from some known parametric density g. Contamination and more generally mixture models are examples of such situations. When the entropy of / and E{\og \X\)+ are finite a density constant on (—1,1) and behaving like constant/x2 if |x| > 1 is such that X(f,g) < oo (Gyorfi and van der Meulen 8 ). Define finite partitions Vn = {An>i, • • •, An<mn} of R such that the An
1. The density estimate introduced by Barron l is then defined as Vh,Xx) = an [l + niJ,n(An(x))]
g(x)
fj,n(An{x)) (1 - o„)
+ an 9(x) (2)
where An(x) = An^ if x £ An^ and /i n stands for the empirical measure associated with the sample X\,... ,Xn. As shown by Barron et al. 2 (who study the case with general ( a „ ) n 6 N . ) , under the assumption l(f,g) < oo, if hn —> 0 and nh„ —> oo as n —> oo, then Z(f,PhJ
-> 0 a.s.
and
E ( l ( / , p h j ) -» 0.
(3)
33
Moreover, Berlinet et al (1997) showed, with the additional assumption v^S^ — Sfj) = 0 where S^ = {x : f(x) > 0}, that
X(/,PO-E(/(/,PO)
£-> M^viS,)).
(4)
The condition T(f, g) < oo is minimal to ensure the finiteness of X(/,pV„) and of its expectation. Now, the choice of g can easily meet the condition on the boundary of the support of jj,. It is worth noting that (4) is independent of the dimensionality of the problem (see Berlinet et al (1997)). As u(SM) < 1, one gets from (4), for any e > 0, lim P ( n v ^ P X / . P f c J - E [ I ( / , p f c J ] | > e) < 2 $ ( - e ) ,
(5)
where $(•) stands for the cumulative distribution function of a standard gaussian random variable. The upper bound in (5) is independent of both / and g. For other properties of modified histograms see the above mentioned references and the recent paper by Berlinet and Biau 4 . 3
Asymptotics
Let us write T(f,phn)
= B(hn) + V(hn),
where
5 ^ ) = //(«) log ( ^ j j ) ^ ) and
V(M =//(*) l°g f ^x # W ) .
J \ Phn( ) J The terms B(hn) and V(hn) are respectively named bias and variance components by analogy with the theory of quadratic loss. But balance between these two components cannot be achieved so easily than between analogues in L 2 -theory. As Theorem 1 shows, the order of convergence of the bias component depends crucially on tail properties of / and g whereas the variance component (in which g simplifies and disappears) is typically of order {nhn)~l without almost no restriction. From the very definition of p^n (x) it appears that a strange fact can happen with that estimate. We can be very lucky and have g = / , the unknown density. Then the choice hn = mn = 1 leads for any x to Phjx)
= f{x).
34
This is unbeatable whatever the error criterion one can consider! In the following we will remove that case by considering densities such that the derivative of (f/g) cannot be identically zero. Assumptions Let /o = f/g be the density of fi with respect to the probability measure v, denote by G the distribution function of v and by ip = f0 o G~1 the density of G(X). Let Stl = {x : f(x) > 0}, S^ be the closure of S^, and for e G (0,1) let Afn = {xeS^ : E \phn(x)} < (1 - e) f(x)}. We consider the following assumptions (Ao) I(f, g) < oo; n —> oo; nhn -> oo; (Ai)
K^-5M)=0;
(-A2) V admits first and second order bounded derivatives; (.A3) there exists 7 > 0 so that for all x G S^ , fo(x) > 7. Comments on assumptions The only assumptions required for the analysis of the variance component are (Ao) which is the minimal condition for consistency of Barron estimates and condition (A\) already given by Berlinet et al (1997). This last condition is not too much restrictive since it is satisfied for all measures v for which the boundary of the support of \i is negligible. Although / is unknown, it is often the case that its support S^ is known. Then one can choose v such that v{Sp) = 1. Otherwise ^(S^) has to be estimated, for instance by j/(min; Xi, max; Xi). Here we only consider the case where v(S,j,) is known. The two following weaker conditions could be given in place of (.A3).
(A'3) J / ( x ) I 0 g ( E ^ ) ] ) dX(x) = o(hl); (A'A) sup ^f\
d\(x) = O ( n 2 ^ ) and f
£&
dX(x) = o (n2h4n);
Under (A3) the ratio E \ph,Xx)] /f(x) converges uniformly to 1, thus for n large enough, the set J\fn = {x € 5M : E [p/i„(a;)] /f(x) < 1 — e} is empty and assumption (A'3) follows directly. Under (A'3) and (A'4), the set W„ (which can be intuitively understood as a kind of "non-uniform convergence set of the ratio") is asymptotically negligible in the sense that X(Afn) tends to 0. Conditions (A3) and (A^) are less easy to interpret than the condition (.A3)
35
but they allow more flexibility in the choice of g. Indeed, under (.A3) and (.44), f/g is allowed to tend to 0 with slow rate of convergence. Also conditions (.A'3) and (A4) involve the expectation of the estimate E [ph„(a;)], so (.A3) can be preferred. Although it is stronger, it only involves / and g. So it appears as a more clear and understandable condition on the model. As well known conditions on tail behavior must appear when the accuracy of estimates is evaluated by means of Kullback-Leibler divergence. The following theorem gives the asymptotic behavior of the bias component and of the expectation of the variance component of the Kullback-Leibler divergence of modified histograms. Theorem 1 Under the assumptions (Ao) and (Ai),
E
(6)
^^+°(£)-
/ / moreover the assumptions (A2) and (A3) are satisfied, if f'21
0 < [ kMd\(x) J f(x)
< 00
and nh\ —> 00, as n —> 00, then f'2
**->-&/^<*>+<,(*s+;s:)-
(7>
Corollary 1 Under the assumptions required in the second part of Theorem 1 the asymptotically optimal smoothing parameter is given by /„
i/3 h: = (Gv(s»)) LJ{j^d\(x)
-1/3
\
For this value /i*, '2<^
+ o (n~2^
\
1 / J
/
o„
N-2/3
.
As already mentioned the choice of g can often be made such that u(S^) = 1. Then a pilot estimator can be used to compute an approximated value of h*n. Our results are in accordance with the upper bounds given by Barron and Sheu 3 and Barron et al. 2 . Up to our knowledge no minimax results are available in the present setting. In the proof of Theorem 1 the following lemma is used several times. This
36
lemma extends a technical result given by Berlinet et al (1997). Its proof, which is omitted, relies on the application of the Lebesgue dominated convergence theorem to suitable functions. Lemma 1 Let fi and v be two probability measures with densities f and g with respect to the Lebesgue measure on (R,B(R))- Let (An>i,..., -A„]mn) be a partition o / R satisfying v{Anik) = hn = \/mn. The measure fi is supposed to be absolutely continous with respect to v and to satisfy v^S^ — S^) = 0, where 5M = {x : f(x) > 0} and S^ is the closure of 5^. If hn —* 0 and nhn —> oo as n —-> oo then, for any couple (p, q) of positive integers, we have (n/a(An,k))P
"^ n
^°°
fc=i
( ! + n/j,(Anik)j
(0
if
0
[ i/(Sp) if
p=q
and Jirn^ hn ^
[nKAn,k)J
log (l + n(i,(Anikj)
exp [-C (n - /«)/i(An,fc)] = 0,
where C > 0 and K G R are any fixed constants. Proof of Theorem 1. Let Rn(x) = (phn(x) - E \phn{x)\)/E E [V{hn)} = - J f(x)E
[log ( l + E „ ( z ) ) ]
\phn(x)].
d\(x).
Elementary analysis shows that Vr>-1,
r2 |log(l + r ) - r - + y | < | r | 3 - l o g ( l + r ) I ( r < _ 1 / 2 ) ,
so that E [V{hn)\ -±J
f(x)V(Rn(x))
< J f(x) { E \Rn(x)\3
d\(a
+ E (| log (1 + Rn(x)) | I(Rn{x)<-1/2))
} d\(x).
By Holder inequality, we get E |ii„(x)| 3 < E (i?„(z) 4 ) 3 / 4 and since
Rn{x)
1
simple calculations give
+n/j.(An(x))
(8)
\nn{An{x))
-
n{An{x))},
(9)
37 314 E (R (xY) { '
= / 3 n 2 M ^ " W ) 2 + nM(^n(a:))\ 3 / 4 { (l + n^An(x)))4 ) •
Consequently,
nhn
ff(x)E\Rn(x)\3d\(x)<2:i/2hny
^
A
^
,
Hence by Lemma 1, it follows that
f(x)E\Rn(x)\3d\(x)=0
lim nhn '—°°
Now for all x S Anik
, Rn(x) S
|log (l + Rn{x)j
(10)
J
1 + n/j,(Anik) ' 1 +
nfi(Antk)
so that
I( fi „( x )<_i/2) < log ( l + n/i(A„, fc )) •
Applying Bernstein inequality, we obtain the following bound for the second term in the right hand side of (9),
nhn
/ f(x) E (J log (1 + Rn(x)) | I(fl n(:c )<-i/2))
< nhn ^
fi{Anik) log (1 + nn(An>k))
exp ( - —
d\(x) nfj,(Anik)).
fc=i
Lemma 1 applies and we can conclude that
\m^nhn
f f(x)
{ E ( | l o g ( l + ^ n ( a ; ) ) | l ( f l n ( : c ) < _ 1 / 2 ) ) } d\(x) = 0.
From this and (10), we get
E [V(hn)} - \ J f(x) V(Rn(x)) d\(x) °(zir) nhr.
(11)
38
Now by Lemma (1), it is easily seen that lim ^
[f(x)V(Rn(x))d\(x)
= hm
nhn ^ n/u(Ara,fc)2 I ^ g -
+ nn(Antk)y
2 To prove (7), we decompose the pointwise bias as follows E[p f c „(i)] - /(a:) = (1 - an)
~ n{An{x)) h„
- /oW g(x) + an[g(x) - f(x)}.
Now, changing variable in the integrated bias, we obtain
B(hn) = -J
where, for y £ [0,1], ^An(G-l(y))) Un(y) = (1 - an)
hn
v(y)
v{y)
+ a.
Iviy)
Thanks to assumption (A3), putting Mn = {y£ G - 1 ^ ) n [0,1] : Un{y) < - e } , where 0 < e < 1, we have /
ip(y) log ( l + Un(y))
I V(y)\og(l Jo
d\{y)
+ Un(y))dX(y)
= I v{y) log (1 + un(y)) i(Un(v)>-e) dKy) + o [h2n). Jo Applying Taylor expansion with integral remainder, we can see that
39 B(hn)
+ Rn{y) dX(y) + o{hl)
/ ip(y) Un(y) - ^Un{yf J[0,l]/Mn
= -
with
I
Rn(y)
UM
(t-Un(y))2 (l+t)3
d\{t),
for Un(y) > —e. Simple arguments lead to lim
17 I
n - o o h^
f(y)\Rn(y)\d\(y)=0
J[0,l]/Mn
and by checking that 0< /
'[0,1]/Mn J\0,l]/M
JMn
which is upper bounded by /
= o(hl),
we get finally 2
B{hn) = \f Z
¥>(y)Un(y)
dX(y)
J[0,l]/Mn
+ o< (hi).
So, it remains to compute the integral in the above expression. For this, we write by Taylor expansion, for y € \(k — l)hn ; khn] M(A n (G- x (j/))) = hMv)
+ v'{y)hn (khn ~Y~y)+
°( /l «)
and that gives /f,.\2
v{v) Un(y? = (l - a „v'iv) ) 2 ^ f (khn
h-n
-'-f-y)
ill
+ 2 a „ ( l - a „ ) ^ (1 -
+<
+ o(h2n).
- y)
(12)
40
By using the assumption n/i 2 —> oo, the second and the third terms in the right-hand side are proved to be negligible in front of h\. Therefore,
/ J[0,1]/M„
is equal to
J[o,i]/Mn f{y)
2
Now, apply the mean formula. Since
I TW{khn~^-y) dX{v) m
-
"
Jit
\2
rkh"
u
2^"" ¥>(&) 19 12 ^fc=l
From this result and by checking that fM •M-Mn) does, we obtain
1
^>'2{y)/f{y) d\(y) tends to zero as
f^VfjU^
l
f/oW 2 ,w , n
Acknowledgments We thank the referees for interesting questions and remarks.
41
References 1. A. R. Barron, in Proc. IEEE Int. Symp Inform. Theory, Kobe, Japan, (1988). 2. A. R. Barron, L. Gyorfi, and E. C. van der Meulen, IEEE Trans, on Information Theory 38, 1437 (1992). 3. A. R. Barron and C. Sheu, Ann. Statist. 19, 1347 (1991). 4. A. Berlinet G. and Biau, Iterated modified histograms (Submitted for publication, 2001). 5. A. Berlinet, L. Gyorfi snd E. C. van der Meulen, Publications de I'ISUP. 41, 3 (1997). 6. A. Berlinet, I. Vajda and E. C. van der Meulen, IEEE Trans, on Inform. Theory 44, 999 (1998). 7. L. Gyorfi, F. Liese, I. Vajda and E. C. van der Meulen, Statistics 32, 31 (1998). 8. L. Gyorfi, and E. C. van der Meulen, in Transactions of the Twelfth Prague Conference on Information Theory, Statistical Decision Functions and Random Processes, 88 (1994). 9. P. Hall, Ann. Statist. 15, 1491 (1987).
T H E A S Y M P T O T I C D I S T R I B U T I O N OF SPACINGS OF O R D E R STATISTICS M I K E L I S G. B I C K I S Mathematical Sciences Group, Department of Computer Science, Saskatchewan, 106 Wiggins Road, Saskatoon, SK SIN 5E6, E-Mail: [email protected]
University Canada
of
It is well known that for an i.i.d. sample from a uniform distribution, the spacings between the order statistics are asymptotically distributed like i.i.d. exponential random variables. Although the spacings between order statistics from an arbitrary distribution are not in general identically distributed, one can still consider the distribution function to which the empirical cdf of such spacings from an i.i.d. sample would converge as the sample size goes to infinity. The limit of this empirical cdf is called the asymptotic spacing distribution. If X is a random variable with density / , then the asymptotic spacing distribution is 1 — L, where L is the Laplace transform of the distribution of f(X). Although the explicit form for L is not tractable in general, one can use Tauberian theorems to relate the tail behaviour of L to the order of contact of / to the z-axis: The more gradual the approach of / to the axis, the heavier the tail of the asymptotic spacing distribution.
1
Introduction
Imagine a large number of marathon runners crossing a finish line, and consider the set of time intervals between their arrivals. What would be the distribution of the lengths of these intervals? The question is ill-posed, since in general these lengths would not have the same distribution. Nonetheless, one can calculate the empirical distribution of such a collection and ask the question "Is there a probability distribution to which this empirical distribution converges as the sample size goes to infinity?" It was noted by Katz l that the empirical distribution appears to follow a "power law". More precisely, one could say that the tail of the distribution function behaves like a negative power of its argument (although, of course, integrability requirements prohibit such behaviour extending to the origin). This observation led Katz to speculate the spacings in an ordered Normal sample would have such a "power law" distribution. In this paper, we study the relationship between a parent distribution and the asymptotic distribution of spacings between its order statistics. We characterize the distributions for which the spacings distribution has a negative power tail, and discover that these do not, in fact, include the Normal distribution.
42
43
2
Empirical distribution of spacings
Let Xi,...,
Xn be i.i.d. from a distribution F, and define Yi = X(i+1) --X"(i),
i=
l,...,n-l
to be the spacings between consecutive order statistics. We will assume in what follows that F is absolutely continuous. It is well known that if F is the uniform distribution on the unit interval then nYi, i = 1 , . . . ,n — 1 are asymptotically i.i.d. exponential, and thus it makes sense to talk about the asymptotic distribution of the spacings. In general, however, the Yi's will not be identically distributed, even asymptotically, so care is needed in defining what is meant by "the" asymptotic distribution. Let 1
Sn(t) =
n—1
- ^ / ( ^ W )
(1)
n — 1 *-^ i=\
be the empirical survivor function of the scaled spacings, where /(t )00 ) (') represents the indicator function of the interval (t, oo). (Formulas look simpler in this context if we deal with survivor functions rather than cumulative distribution functions. It should cause no confusion if we refer to a distribution by its survivor function rather than its cdf.) Suppose that there is a distribution S such that with probability 1, Sn converges weakly to S. Then we will call S the asymptotic spacings distribution (ASD) of the distribution F. Pyke 2 considered the limit of (1), and presented an elegant proof that oo
/
e-tf(x)f(x)dx
in probability,
-oo
where / = F' is the probability density function of the original sample. He added, without proof, the stronger statement that the convergence is, in fact, uniform with probability 1, referring to a paper by Blum and Weiss 3 . Monte-Carlo investigation led the present author to speculate that the empirical distribution of spacings is, moreover, asymptotically equivalent to the empirical distribution of an appropriate i.i.d. sample, as suggested by the following heuristic argument, under the additional assumption that the density / is continuous. Conjecture: Let X\,..., Xn,... be i.i.d. from a distribution F with a continuous density / , and let Z\,..., Z„,... be i.i.d. from an exponential distribution with mean 1, independent of the X;'s. Then the empirical spacings distribution S„(t) converges to the same limiting distribution as the empirical distribution from the i.i.d. sample Z\jf{X{),... ,Zn/f(Xn).
44
Heuristic Argument: By the mean value theorem F(X(i+1)) for some X^
= F(X(i))
< Vi < X(i+iy
+ f(Vi)(X(i+1)
-
X(i))
Thus
F ( X ( j + 1 ) ) - F ( X ( i ) ) = /(V5)r i .
(2)
However, since F{Xi) has a uniform distribution, the left side of (2) gives the spacings of an i.i.d. sample from the uniform distribution. Thus, the joint distribution of n/(y1)y1,...,n/(y„)yn will approach that of Z\,..., Zn. For large samples, the order statistics will approximate the appropriate quantiles, as will the Vi's sandwiched between them. Thus, the joint distribution of nY\,... ,nYn will approximate that of Zi/f{Xw),..., Zn/f{X(n)). The random function
5;c) = ^iEJ(t.oo)(2r*//(x w )) > i=l
on the other hand, will have the same distribution as 1
71 — 1
Sn(t) =
£
J(tl0o) (Zi/fiXi))
,
(3)
»=i
which is the empirical survivor function Z1/f(X1),...,Zn/f(Xn).
from the i.i.d. sample
A rigorous proof of the conjecture will depend first on a careful specification of the nature of the putative asymptotic equivalence between (1) and (3). A strong result would obtain if it could be shown that the stochastic process TJU (§n(t) — S(t)) converges to the same limit as y/n (Sn(t) — S(t) J. The asymptotic spacings distribution oo
/
e-tfWf(x)dz
(4)
-oo
is just the distribution of the random variable Z/f(X), which, by conditioning on X, can be viewed as a mixture of exponential distributions.
45
Example 1. If F is exponential, then the spacings themselves are exponential with E{Yi) = l/(ra — i) and the distribution of Sn can be calculated explicitly for finite n. We have n-l
E (sn(i)) = (n - I)""1 Y,
Pr
{ ^ > *M
i=l n-l
(n-l)-1Y,e~in~i)t/n
=
i=l
_ 1 - e x p ( - ( l - l/n)t) (n-l)(e'/n-l) from which we can see
that E (§n(tj) -y (1 — e ')/£ as required by (4).
Example 2. Consider a random variable with density function k
/(*) =
0 < a; < fc + 1
:+ l
where k > 0 is a parameter. (These distributions are members of the Beta family, rescaled for ease of computation.) Making the change of variable v-t
fc+1
we can calculate rfc+i
S(t) = / Jo
exp - t
_ 1 + 1/fc ti+i/k
r(2 +
fc + 1
fc + 1
Te-V
~vv^k dx Jo i/k)G1+1/k(t)
tl+l/k
where G 1+1 /fc(t) is the Gamma cdf with shape parameter 1 + 1/fc. These two examples illustrate ASD's with tails behaving like negative powers. Such distributions have at most a finite number of moments. In
46
particular, the expected asymptotic spacing length will be
JO
r
/>oo
roo
/
S(t)dt=
/ JO
e-tfWf{x)dxdt
/ Jf(x)>0
= /
f(x)/f(x)dx
Jf(x)>0 — 00
unless the support of / is bounded. Thus the distribution of an unbounded random variable, such as the Normal, cannot have an ASD with a tail 0(t~a) with a > 1 for then its expectation would be finite. 3
Characterization of asymptotic spacings distributions
Note that the ASD is in fact a Laplace transform. If X has density / , consider the random variable W = f(X). Then S{t) = E{e~tw). If / is monotone, then the density of W is given by /°/_1(w) | / ' o / - » |
w =
| / ' o / - » |
provided the density is differentiable. (If / is not monotone, then the density of W must include contributions from the components of f~1(w)). Thus, for example, if X is Exponential, then W is uniform with Laplace transform (1 — e~')/i as seen previously. If X is Uniform, W is degenerate at 1, with Laplace transform e~l. The representation of the ASD as a Laplace transform allows one to use Tauberian theorems 4 which relate the tail behaviour of the Laplace transform with the behaviour of the distribution near the origin. Specifically, if L is the Laplace transform of a distribution F, then as F(t)~L(l/t)/T(p+l) where [mlog(F(tx)/F(t))
i->0
_
log X
We see that what controls the tail behaviour of the ASD is the behaviour near the origin of W, which is determined by the way that the density / approaches the axis. It can readily be seen that if f(x) behaves like 0(xk) as x -> 0, then W will have a density like 0{w1/k) as w - • 0. The cdf of W will then behave like 0{wl>k+l) and the ASD will have a tail like 0 ( H 1 + 1 / f c ) ) .
47
Quantiles of asymptotic spacing distributions < O
"
CM
"
"^^ r i ^ 1 1 **«-^
~ ——
< o
^--^^^^x
3? CD Q. Q. Z>
r
o
—
Normal Exponential
-—
Beta(1,2 ' Uniform
1
CM
< O
CO
0.005 0.010
0.050 0.100 Probability
0.500 1.000
Figure 1. Upper quantiles of asymptotic spacings distributions from selected parent distributions
Obviously, the distribution of W will be unaffected by location shifts or reflections, and will be reciprocally scaled by rescaling of X. Thus the tail behaviour of the ASD is the same for all distributions in a location-scale family. If X is Normal, then the density of W is messy, but after some calculation can be seen to be approximately proportional to (— logw) - 1 / 2 near the origin. This does not behave like a power of w, indicating that the ASD will not have a negative-power tail. By making the change of variable s = — logw, we can see that the cdf of W will be approximately proportional to -!/2 f
/ .— log w
'da.
It follows that the tail of the ASD will be like /•OO
/
'
-l/2„-s
ds,
J log*
i.e., like the tail of exp(C) where C has a xl distribution.
48
0.01
0.10
1.00 10.00 Normal spacings quantiles
100.00
Figure 2. Q-Q-plots of Normal spacings
4
Numerical results and simulations
The quantiles for the ASD corresponding to given parent distributions can be computed numerically. These are plotted in Figure 1 for distributions with different tail behaviours. One can examine the suitability of the ASD for describing the empirical distribution of spacings for large but finite sample sizes. Figures 2 and 3 present Q-Q-plots for spacings of samples 1000 i.i.d. observations from the Normal and Exponential distributions, respectively, plotted against the respective ASD. These plots demonstrate a reasonably good fit. The plot is presented on a log-log scale to accommodate the heavy-tailed distribution. Note that on such a scale, a well-fitting plot should lie on a straight line with unit slope. Such a line is superimposed on the points, constrained to pass through the empirical upper 10 percentile.
49
0.01
0.10 1.00 10.00 Exponential spacings quantile
100.00
1000.00
Figure 3. Q-Q-plots of Exponential spacings
Acknowledgments I want to thank Leon Katz for raising the question that spawned this paper. I am indebted to Richard Lockhart and Michael Stephens for useful discussions on this topic. This research was supported by a grant from the Natural Sciences and Engineering Research Council (NSERC) of Canada. References 1. 2. 3. 4.
L. Katz, Personal communication, 2000. R. Pyke, J. Roy. Stat. Soc. 27B, 395 (1965). J. R. Blum and L. Weiss, Ann. Math. Stat. 28, 242 (1957). W. Feller, An Introduction to Probability Theory and its Applications v. 2, 2nd ed. (Wiley, New York, 1971).
T H E O R E T I C A L A N D COMPUTATIONAL ISSUES I N B A Y E S I A N ANALYSIS OF MULTIVARIATE O R D I N A L CATEGORICAL DATA W I T H R E F E R E N C E TO A N OPHTHALMOLOGIC S T U D Y A T A N U BISWAS Applied
Statistics
Unit, Indian Kolkata E-mail:
Statistical Institute, 700 108, India [email protected]
203 B. T.
Road,
Bivariate or multivariate ordinal categorical data is quite common in different real life situations. Several frequentist's approaches are available for the analysis of such data. In almost all those approaches, the computational burden is tremendous. As an alternative, some Bayesian latent variable based approach were suggested by some authors. In such cases also, the computation is a key issue. The computational burden of such Bayesian analyzes can be remarkably reduced by using the software WinBUGS. The present paper discusses one real life dataset on an ophthalmologic study, called the Wisconsin Epidemiologic Study of Diabetic Retinopathy (WESDR), which provides bivariate ordinal categorical d a t a with several eye-specific and subject-specific covariates. The techniques of the available Bayesian analyzes with reference to the WESDR data are discussed in the present paper. Some results on Bayesian model selection for this data is also discussed. By the help of some exploratory d a t a analysis an appropriate model is selected, and it is then analyzed in the Bayesian semiparametric way taking a Dirichlet process prior for the random effect (in the frequentist's sense). The final result on the analysis of WESDR data is presented. Several computational issues in this context are also discussed. Finally, the applicability of the present approach to some more general situation are also pointed.
1
Introduction
In several social, psychological and biomedical studies the response variable is ordinal categorical in nature, sometimes they are ordinal due to the absence of well-defined non-invasive direct measurements {e.g., mild, moderate, severe, etc.). If the response is of multivariate nature and each component of the response is ordinal categorical, we have multivariate ordinal categorical data to deal with. This kind of data are available in many real life situations. Ever since Dale 7 proposed the analysis of bivariate ordinal categorical data, a lot of subsequent studies were carried out in this interesting and important research area. Much of the early works were done on the frequentist's view point. Molenberghs and Lesaffre 27 used a multivariate Placket distribution as an extension of Dale's 7 model. These likelihood methods are, of course, computationally extensive. As an alternative, Williamson et al. 37 considered
50
51
the generalized estimating equations (GEE) approach. They fitted cumulative probit margins and a global odds ratio association model. Williamson et al. 38 discussed the applicability and usefulness of their computer programs GEECAT (generalized estimating equations for categorical data) and GEEGOR (generalized estimating equations using global odds ratio). Kim 19 carried out a latent variable based estimation technique for a bivariate ordinal data of an ophthalmologic study called the Wisconsin Epidemiologic Study of Diabetic Retinopathy (WESDR). Kim 19 used the Newton-Raphson iteration technique to find the estimates of the underlying parameters including the typically unknown cut-off points. But he admitted the computational difficulty of the technique in a more general set up. Williamson and Kim 3 6 , also with reference to the WESDR dataset, considered bivariate latent variable regression model using global odds ratio, which required no specific choice of underlying latent distribution except its continuity, and no specific structure of the correlation. A quasi-Newton method in a full maximum likelihood procedure was employed for the estimation of the model parameters. Kim et al. 20 discussed regression models for bivariate ordered categorical data from ophthalmologic studies. Some statisticians wanted to look at the multivariate ordinal categorical data from a Bayesian philosophy and tried to apply the Bayesian theory for analyzing such a multivariate ordinal categorical data with covariates. It is observed that, besides the difference in philosophical perspective, the Bayesian computation becomes more tractable and one may arrive at good solutions after some easy computational effort. In the present paper we discuss some of the available approaches and discuss some possible new directions, mainly with reference to a real life bivariate ordinal dataset obtained from the WESDR study. We discuss the WESDR study and the nature of the resulting bivariate ordinal data briefly in the next section. In Section 3, two available analyzes on WESDR data are briefly indicated. A discussion on appropriate model selection is provided in Section 4. In Section 5, a Bayesian analysis of the WESDR data using the selected model from Section 4 is carried out. The results are briefly discussed. Some important computational issues are discussed in Section 6. Finally, Section 7 ends with some concluding remarks.
52
2 2.1
WESDR Description of the WESDR
It was a population-based study conducted in Southern Wisconsin between 1980 and 1982 by Dr. Ronald Klein and his medical colleagues at the University of Wisconsin and supported by the National Eye Institute, NIH. In the baseline study, a total of 996 insulin-taking, younger onset diabetic patients were examined using standard protocols to determine the prevalence and severity of diabetic retinopathy and associated risk variables. The population of the study consisted of a probability sample selected from 10135 diabetic persons who received primary care in an 11-county area in southern Wisconsin from 1979 to 1980. A detailed description of the population can be available in Klein et al. 2 1 . Of the younger-onset persons (less than 30 years of age), 996 participated in the baseline examination (1980 to 1982). Subsequently there were two follow-up examinations on the same population after 4 and 10 years (cf. Klein et al. 24 ; Klein et al. 2 2 ) , with some missing data in this course of study. The baseline and the two follow-up examinations were performed in a mobile examination van in or near the city where the participants lived. The ocular and physical examinations included taking stereoscopic color fundus photographs of seven standard fields. The basic goals of the study (cf. Klein et al. 25 ) were • To find the associated risk factors which are important in planning a well-coordinated approach to the public health problem posed by the complications of diabetes (cf. Hamman 17 ; Rand 3 0 ) . • Identifying the possible patients at high risk level of severe retinopathy was quite important for advising ophthalmologic care. • Planning future studies such as controlled clinical trials of treatment of diabetes and diabetic retinopathy (cf. Rand 30 ; Palmberg et al. 2 8 ) . • Progression of diabetic retinopathy over time and associated factors was also an important aspect of study (cf. Wahba et al. 3 3 ). 2.2
Data description
WESDR data is a bivariate ordinal data. The retinopathy scale (RS) provided in the dataset is a more current one than the one used in some earlier works. Both the right and left eye retinopathy severity levels are recorded as two components of the bivariate response. The possible values of the retinopathy severity levels are 10, 21, 31, 37, 43, 47, 53, 60, 61, 65, 71, 75 and 85, corresponding to increasing levels of severity of retinopathy within an eye. A commonly used grouping is 10, 21-37, 43-53 and 60-85, which correspond to
53
no retinopathy (category 0), mild nonproliferative retinopathy (category 1), moderate to severe nonproliferative retinopathy (category 2), and proliferative retinopathy (category 3), respectively. Such a grouping is usually done for easy interpretation of the data, and it reduces the computational burden to a great extent. Most of the available works are done using this grouping. Three eye-specific covariates are recorded separately for each eye. The first one is the presence or absence of macular edema (ME), which is the effusion of serious fluid into the intersects of cells in tissues. The right and left eye refractive error (RE) in diopters, which is the ability of the eye to refract light which enters into it so as to form an image on the retina, is also recorded. The values can be negative or positive, negative values represent myopia (nearsightedness), and positive values represent hyperopia (farsightedness). In addition, the right and left eye intraocular pressure (IOP) in mmHg. is also measured. In addition there are 11 person-specific covariates. The first two are age at diagnosis (AgD) of diabetes in years and duration of diabetes (DuD) in years. To get current age at the time of examination, one has to add AgD and DuD. The other person-specific covariates are glycosylated hemoglobin (GH) in percent (a measure of control of blood sugar where lower values are considered better), systolic and diastolic blood pressures (SBP & DBP) in mmHg., body mass index (BMI) in kilograms per meter squared (using weight and height), pulse rate (PR) in beats per 30 seconds, sex, urine protein (UP) (present/absent), doses of insulin (DI) per day and area of residence (AR) (urban/rural). 3 3.1
Bayesian Analyses Bayesian Analysis of Baseline Data
The first major Bayesian analysis of the WESDR (baseline) data is due to Biswas and Das 2 . They considered 691 observations with full response and covariate information for their analysis. Suppose yn and ym be the responses from the left and right eye, respectively, for the it\\ individual. Both y^i and ym can take values 0, 1, 2 and 3. It is assumed that there are some underlying latent variables which are unobservable, but in effect we observe these y^ and ym- Underlying latent variables y*Li and y*Ri are postulated, which are responsible for observing y^ and ym in the following way: Vhi = 3 if y*Li S (7ij,7ij+i], 3 = 0,1,2,3, VRI =HfyRi G (72i,72/+i], Z = 0,1,2,3,
54
where 71/s and 72*'s are typically unknown cut-off points such that 7so < 7*1 < • • • < 7s4 f° r 5 = 1,2. Some mild conditions on these 71/s and 72;'s, such as (1) 7 i o = 720 = - 0 0 ,
(2) 711 = 721 =
a
714 = 724 = 00,
known constant,
are needed for identifiability (see Chen and Shao 3 , for details). The latent vector y* = {y*Li,y*m)T is assumed to follow a bivariate normal distribution N2{XiP, S). The joint distribution of yt = (yLi, VRi)T, yVs a r e written as 7r(i/?, j/i, i = !,•••, ^1)8, E, 7) N
n
5 3 lioi x J ( » £ i
e
(7ij.7ij+i],I/m G (721,721+1])
3,l€S
xN2(Xip,i:),
(1)
where N = 691, the number of individuals under study; 7 is the collection of all unknown 7^'s and 7 2i 's; S = {{j, I) : j = 0,1, 2, 3; I = 0,1, 2, 3}; I(X e A) = 1 or 0 according as X € A or not; and 1J, = 1 or 0 according as y* = (j, l)T or not. Without sufficient prior knowledge, noninformative priors for /?, S and 7 are taken. The conditional posteriors of the parameters are of known form. The computations can be carried out using Markov Chain Monte Carlo (MCMC) technique. In particular, posterior summary statistics of all the parameters are obtained using the software WinBUGS. The analysis shows that the severity of retinopathy among the younger onset diabetic persons is directly affected by DuD and DBP. Also GH was one vital covariate for retinopathy. 3.2
Analysis of the Baseline and 4-year Follow-up Data
Das and Biswas 9 considered the analysis of bivariate ordinal data repeated over time. They took the WESDR data for two time points only (baseline and 4-year follow-up data). Some notational adjustments are to be needed from the subsection 3.1. In place of yLi, yRi, y*Li, y*Ri, 1},, Xt we just write yLit, ymt, yliti y*mv !}*> -Kit to indicate that they are for time point t, t = 0,1. For the baseline data we write t = 0 and for the 4-year follow-up data we write are t = 1. The joint distribution of yit = {yLit,VRit)T, y*t = (ylit,y*Rit)T,s now written as T(j/*t. Va,i = 1, • • •, n, t = 0, l|/3, S , 7 )
55 n
1
= IIII
£
J
i*
x /
(»£«
G
(7ii. 7 y + i ] , J/ffit G (72J, 721+iD
i = i t = o |_j,ies
xA^X^ + Z^E),
(2)
where the sample size n is now 548, as some of the individuals of the baseline study were missing in the follow-up study. A random effect model (in the frequentist's sense) is assumed for y*t as y*t = Xit(3 + Zitbi + eit, where &;, a ^-dimensional vector, is the ith cluster (subject) effect, which is responsible for the longitudinal dependence. Here e;t is assumed to be bivariate normal with mean vector 0 and dispersion matrix E. For likelihood identifiability, E needs to be a correlation matrix (see Chib and Greenberg 5 ; Chen and Shao 3 ) . The prior for (3 is taken as Np(Po, Eo). Without sufficient prior information a relatively vague prior is chosen by setting /30 = 0 and Eo = 10 4 x identity matrix. This ensures the posterior to be driven by data. A noninformative prior for 7 is taken. A prior proportional to the normal density (with known mean po and known variance T _ 1 ) in the domain ( — 1,1) is taken for p, the only correlation parameter in the 2 x 2 correlation matrix E. A Dirichlet Process (DP) prior for the unknown distribution G of 6,'s are taken (see Ferguson 12 ; Kleinman and Ibrahim 2 6 ) . The software WinBUGS is used for the Markov Chain Monte Carlo (MCMC) computations. Metropolis-Hastings algorithm (Hastings 18 ) was employed for sampling from the posterior of p, as it becomes non-standard; for other parameters the popular Gibbs sampler (see Geman and Geman 16 ; Gelfand et al. 13 ) is used. Note that in case E is a proper dispersion matrix, a suitable inverted Wishart prior could be taken for E. In that case the Bayesian MCMC computations could be same, possibly requiring some smaller time as the conditional posterior of E would be inverted Wishart also. 3.3
Present Analysis Using all the Data
The WESDR data provides information of only baseline values (and not of the 4-year follow-up) on N — n = 143 individuals. Das and Biswas 9 ignored this information. Now this data on 143 individuals can be successfully used to get more information. The likelihood thus becomes n(y*t,yit, i = 1, • • •, n, t = 0,1; y*0,yio, i = n +!,•••, N\0, E, 7)
56 n
1
nn
J2
*n
S
l
%
^
x
^Lit
X J
^
G (7y.7y+i],»fl« G (72(,72i+i])
i 0
G
(Ty'7ii+i],yfliO
G
(72i,72i+il)
i=n+l
xAr2(XiO/3 + Z i 0 &i,S)},
(3)
after modifying (2). Our ultimate goal is to analyze the baseline and 4-year follow-up data by making full use of the data after suitable prior elicitation and model selection in terms of covariate selection. These issues are discussed in the next section. 4
Prior Elicitation and Model Selection
Most of the priors considered so far are vague priors. An empirical Bayes procedure could be carried out if some past data were available. For example, Angers and Biswas l considered the analysis of the 4-year follow-up data only. They used the baseline data to construct priors for different parameters. Some sensitivity analyses were also carried out by Angers and Biswas 1. But, in the absence of such past information, the vague priors are appropriate. In such a case, the posterior will be driven by the data. In fact, we have a quite large dataset, and the data should dominate the posterior in any case. In the context of the 4-year data of the WESDR, model selection in terms of covariate inclusion was carried out by Angers and Biswas l . They computed the standardized Bayes estimators and ordered them. Then the model with highest marginal probability is selected. It was observed in their study that the baseline values, DBP and GH are important covariate in different situations. They included these covariates in their model to analyze the data. But if we carry out our analysis simultaneously for the baseline and 4-year follow-up data, there is no question of taking the baseline values as covariates. The covariates which are not important can be excluded from considerations to ease the computational burden. Wahba et al. 3 3 carried out their analysis on a subgroup of the younger onset population, consisting of 669 subjects with no or nonproliferative retinopathy. They considered retinopathy scale (RS) as the response variable. Thus they reduced the bivariate ordinal data problem to a simpler univariate ordinal data problem. Some exploratory GLIM modeling using the SAS procedure LOGISTIC (SAS Institute 31 ) were carried out and after some exploratory
57
considerations they took the model. Wahba et al. 3 3 observed that the effect of GH was very strong and fairly linear in the logit and that the effects of AgD, DuD and BMI were strong and nonlinear. We have done a similar detailed exploratory study using SAS with the baseline and 4-year data. We observe that the effect of DuD in a log-scale explains the data in a better way. Note that Wahba et al. 3 3 also observed that DuD has a nonlinear effect. We also find that the effects of GH and DBP are significant in the context of retinopathy levels. In fact, Klein et al. 23 reported that GH is a strong predictor of progression of diabetic retinopathy in the younger onset group. Angers and Biswas * also observed the same scenario. DBP came out as an important covariate in the study of Kim 19 and Biswas and Das 2 . In the present study we observe possible interaction between GH and DBP. Thus the model has to be taken with care. This is discussed in Section 5. 5
Present Bayesian Analysis W i t h the Selected Model
We now carry out a Bayesian semiparametric analysis of the baseline and 4year follow-up data of the WESDR. We take the selected covariates (discussed in Section 4) into the model, suitable transformations and interaction effects are also considered following our exploratory study described in the previous section. This seems to be a correct model which can explain the retinopathy levels in terms of different covariates. We use all the available data, i.e., the likelihood is given by (3). We impose an additional restriction 71 j = ^2j for all j , which seems logical as the two eyes should behave in the same way. The random effect model (in the frequentist's sense) is justified from intuitive feelings. Besides the significant covariates taken in the model (after excluding the insignificant ones) there must be several unobserved (most of which are unobservable) factors that can be responsible for retinopathy and also the association between the retinopathy levels between two eyes. As the two eyes of the same human being are considered, several nerves, tissues, same reading habit and work habit, same environmental exposure, etc. are responsible for similar effects on both the eyes. These unobservable factors are expressed in terms of the uncertainty 6j's. The model is then taken as y*kit = constant
+ \og(DuD) x j3DuD + GH x (3GH + DBP x (3DBp
+ (GH x DBP)
x PGHXDBP
+ bki + ekit,
k =
L,R.
A Bayesian semiparametric study as in Das and Biswas 9 was carried out with different choices of the prior. The prior for f3 is taken as a vague prior. This is taken by setting (3 ~ Np(0, EQ), where So = 10 4 x identity matrix, i.e., by
58
setting the variances to be large. Some noninformative prior proportional to unity is set for 7. The prior for p is the same discussed in subsection 3.1. A Dirichlet process prior is taken for bi = (&Lz!&JRi)T's such that bi ~ G ,
G~DP{a,G0), with Go ~ iV m (0,r), with some known T. When G 0 is known, i.e., G has a prior probability model with known hyperparameters, the posterior distributions can be easily obtained as in Wilks et al. 35 . Here we have set F = identity matrix. The posteriors are observed to be dominated by the data. We primarily take priors of the form discussed in Section 3. Now we discuss the nature of the conditional posteriors without providing the exact mathematical expressions. The conditional posterior of /3 is multivariate normal. The conditional posterior of p is non-standard, to sample from it one has to use the Metropolis-Hastings algorithm (see Hastings 1 8 ) . With some probability a(p, p ) (easily computable) we move to a candidate value p from the present value p, and with probability 1 — a(p, p ) we stick to p. We take p = p + h, where h is a random zero mean increment. The variance of h is usually taken to be of order O(^). The conditional posterior of 71J is found to be uniform over some domain. The conditional posterior of y*t is truncated bivariate normal, truncated in some rectangular region. Finally, the conditional posterior of bi is a mixture where one piece is normal and the others are point masses. That is, with some probability we choose bi = bj and otherwise we sample from the normal density. For detailed mathematical forms of the conditional posteriors, one can see Das and Biswas 9 . The derivation of the present case is similar to Das and Biswas 9 with a slight possible change in some expressions. The posterior summary statistics of the relevant parameters are then obtained. From our final analysis, the posterior mean of PDUD is 0.78, that for POH is 0.52 and for PDBP is 0.32. The posterior mean of PGHXDBP is 0.27. Thus all of DuD, GH and DBP have positive effect to the retinopathy on both the eyes in the sense that presence of any of them increases the retinopathy levels. The interaction of GH and DBP is also affecting the retinopathy in a positive way. The posterior mean of p, the polychoric correlation (correlation between two categorical random variables) is observed to be 0.862. Thus, quite a high correlation between the two eyes is present. This justifies the need for such an analysis of multivariate categorical data - one would loose much information if the analysis were carried out marginally for each response, separately. It is also observed that the posterior standard deviations of the regression coefficients are low for most of the cases. The Monte Carlo
59 error (MC error) is also quite small, these are less than 0.001 for most of the parameters. In turn we have tried the same analysis using logistic distributions for cm and emt (see Qui et al. 29 , for such logistic error distributions). But the results are almost same as that for normal errors. Hence we are not discussing the figures separately. 6
Computational Issues
In such analyses, computations play a major role. The main objection to the earlier frequentist's solutions (without going to the philosophical issues) was their tremendous computational burden. In fact, in some more complicated scenarios, most of those approaches become computationally intractable. Thus, along with the philosophical view point, we look at the computational burden of the Bayesian approaches with great interest. Qui et al. 29 discussed some such computational issues for multivariate ordinal data in some other situations. Their case was much simpler in the sense that they did not consider complicated random effect distributions like the Dirichlet process semiparametric model. Moreover the prior for p in our case yielded a non-standard conditional posterior, requiring Metropolis-Hastings algorithm to be carried out. All the computations of Biswas and Das 2 and Das and Biswas 9 were carried out using MCMC technique which avoids the evaluation of high dimensional numerical integration. The computations in Angers and Biswas 1 was done using a computer program in S-Plus. It took hours for the computation for model selection. The MCMC requires intensive computation and careful assessment of convergence. For computations, the freely available software WinBUGS is used and it eased the computational burden greatly. WinBUGS is a window version of the software BUGS (Bayesian analysis Using Gibbs Sampler), developed by MRC, Biostatistics Group, Institute of Public Health. For details see http://www.mrc-bsu.cam.ac.uk/bugs/ or the manual of WinBUGS 1.3 (see also Spiegelhalter et al. 3 2 ) . To operate the necessary computations, the software requires the likelihood, prior and data, although we have mentioned the distribution of the conditional posteriors only to visualize what is going on behind the tandom. It generates random samples from a series of conditional posterior distributions specified in the Bayesian model according to an MCMC algorithm. The analysis of the baseline data in Biswas and Das 2 was initially programmed in C. But it took more than 24 hours for the computation. By WinBUGS the computations were done in less than 4 hours. For the DP
60 prior with baseline and 4-year follow-up data in Das and Biswas 9 , WinBUGS took about 16 hours for the computation. After selecting the model and exclusion of several insignificant covariates, for the present analysis WinBUGS took about 12 hours for each computation. Standard approaches can be considered for assessing convergence (Cowles and Carlin 6 ) and for model checking (Gelman et al. 14 ) and comparisons (Chib 4 , Diciccio et al. n ) . In the present paper the convergence of Gibbs sampler is ensured by the basic approach of Gelman and Rubin 15 . Starting from some initial values of the parameters we generate 4 chains of values, each chain being generated starting from an over-dispersed distribution and with a sample size of 8000. We delete 4000 replications as "burning" samples to minimize the effect of initial values and retain the values of the next 4000 replications to approximate the posterior distribution. To monitor the convergence we focus our attention to the parameters of interest, namely DuD, GH, DBP and GH x DBP. Following Gelman and Rubin 15 , we compute the between and within chain mean squares of the retained values, say B and W respectively, for each of the parameters. Then we find s2 = (4000 - l)W/4000 + 5/4000, i/ = s2 + (4x 4000) _1 S, and finally the potential scale reduction factor r = v/W. The potential scale reduction factors are nearly 1, and this suggests that the desired convergence is achieved in the Gibbs sampler. A 4000 updates were burn in as the initial samples, followed by a further 4000 samples which were used to obtain the posterior summary statistics like the posterior mean, median, standard deviation, MC error, 95% probability interval for each of the parameters under consideration. 7
Concluding Remarks
• It could be of interest to analyze the 10-year follow-up data at the same occasion. But we could not access the data. • In addition to the probit link, a Iogit link is also tried, of course with not much difference in the results. This result was somewhat expected, as there is not much difference in the tail behavior of the normal and the logistic distributions. With the help of WinBUGS the computation is easily doable, and it only needs a slight change in the program. • Some nonparametric technique in terms of smoothing spline ANOVA was employed by Wahba et al. 33 . We are now trying to employ a suitable
61
wavelet approach. This is under consideration and will be pursued in a future communication. • If we had to deal with multivariate data, the number of correlation components in the correlation matrix would be greater than one, namely Pi2,Pi3, • • •• In this case writing p = (pi2,Pi3, • ' ' ) T i t n e prior of the kdimensional vector p (say) can be taken to be proportional to a fc-variate normal density (with known mean vector and dispersion matrix) truncated in a set C which is a convex solid body, subset of [—1, l] fc that leads to a proper correlation matrix. See Chib and Greenberg 5 for details. One can set the variances large to make it an uniform prior. If, instead, S were taken as a proper dispersion matrix (unknown), an inverted Wishart prior could be appropriate. The conditional posterior would be inverted Wishart in that case. • One relevant point may be the case where some component (s) of the multivariate categorical response is (are) not ordinal. In that case one can possibly use the Rasch model for that component(s). The problem of combining it in our present set up is a challenge. Latent variable approach will not be applicable for those components. The theory is under study with some ordinal components and some (non-ordinal) only categorical components. The details will be pursued in a separate communication. Also one interesting situation could be the case where some of the components of yi, say yu, • • • ,ysi are ordinal categorical, and the remaining components of j / i ; say ys+i,i, • • • ,VH are continuous. See Das et al. 10 for a Bayesian approach in such a situation. • If, in the multivariate case, some of the cells of the fc-way ordered classification (two-way, for bivariate case) are empty, the analysis will become more complicated. In a bivariate case, such a situation is observed and analyzed by Weiss 34 in the frequentist's set up. But Weiss 34 observed that the log-likelihood function is not globally concave resulting serious difficulty in estimation. In the Bayesian set up, the likelihood will be simply as (1) with an indicator multiplied for each individual i, which will be 1 if the bivariate observation vector belongs to the non-empty cells, and will be 0 otherwise. A Bayesian analysis in this direction has been done by Das and Biswas 8 which shows that the Bayesian solution does not have the drawbacks of the frequentist's approach. • Finally, we would like to provide some motivation of the random effect (in the frequentist's sense) semiparametric model. We are, in fact, observing some covariates. But, there is reason to believe that there are numerous physiological factors such as several nerves are affecting the retinopathy of an individual, most of them are unobservable. To model their effects in the retinopathy level, one has to bring some random component (in the fre-
62
quentist's sense) under consideration. See Das and Biswas discussion in this context.
9
for a detailed
Acknowledgments The author wishes to thank the two referees for their careful reading and some suggestions which led to some improvement over an earlier version of the manuscript. The author wishes to thank Drs. Ronal Klein and Barbara E. K. Klein of the University of Wisconsin for providing both the baseline and 4-year follow-up data of the WESDR study. The WESDR project was originally supported in part by grant EY 03083 (R. Klein) from the National Eye Institute, NIH. References 1. J.-F. Angers and A. Biswas, Technical Report. (University of Montreal, CRM 2757, 2001). 2. A. Biswas and K. Das, Statistics in Medicine 2 1 , 549 (2002). 3. M.-H. Chen and Q.-M. Shao, Journal of Multivariate Analysis 7 1 , 277 (1999). 4. S. Chib, Journal of the American Statistical Association 90, 1313 (1995). 5. S. Chib and E. Greenberg, Biometrika 85, 347 (1998). 6. M.K. Cowles and B.P. Carlin, Journal of the American Statistical Association 89, 883 (1994). 7. J.R. Dale, Biometrics 42, 909 (1986). 8. K. Das and A. Biswas, Technical Report (Applied Statistics Division, Indian Statistical Institute, 2000). 9. K. Das and A. Biswas, Technical Report. (Applied Statistics Division, Indian Statistical Institute, 2001). 10. K. Das et al., Calcutta Statistical Association Bulletin 49, 255 (1999). 11. T.J. Diciccio et al., Journal of the American Statistical Association 92, 903 (1997). 12. T.S. Ferguson, Ann. Statist. 1, 209 (1973). 13. A.E. Gelfand et al., Journal of the American Statistical Association 85, 972 (1990). 14. A. Gelman et al., Statistica Sinica 6, 733 (1996). 15. A. Gelman and D.B. Rubin, Statistical Science 7, 457 (1992). 16. S. Geman and D. Geman, IEEE Trans, on Pattern Analysis and Machine Intelligence 6, 721 (1984).
63
17. R.F. Hamman, Proceedings of the Diabetes Control Conference, Atlanta, Centre of Disease Control, 32, (1982). 18. W.K. Hastings, Biometrika 57, 97 (1970). 19. K. Kim, Statistics in Medicine 14, 1341 (1995). 20. K. Kim et al. in Collected Papers in Honor of Retirement of Professor Chung Han Yung. Seoul National University, Department of Computer Science and Statistics Alumni Association, eds. J.Y. Park et al, 36, (1996). 21. R. Klein et al., American Journal of Epidemiology 118, 228 (1983). 22. R. Klein et al, Arch. Ophthalmol. 112, 1217 (1994). 23. R. Klein et al., Journal of the American Medical Association 260, 2864 (1988). 24. R. Klein et al., Arch. Ophthalmol. 107, 237 (1989). 25. R. Klein et al, Am. J. Epidemiol. 119, 54 (1984). 26. K.P. Kleinman and J.G. Ibrahim, Biometrics 54, 921 (1988). 27. G. Molenberghs and E. Lesaffre, Journal of the American Statistical Association 89, 633 (1994). 28. P. Palmberg et al., Ophthalmology 88, 613 (1981). 29. Z. Qui, Journal of Biopharmaceutical Statistics 12, (2002) 30. L.I. Rand, Am. J. Med. 70, 595 (1981). 31. SAS Institute SAS/STAT User's Guide, Version 6, 4th ed. (SAS Institute, Inc., Cary, North Carolina, 1989). 32. D.J. Spiegelhalter et al., Technical report (University of Cambridge, MRC Biostatistics Unit, 1994). 33. G. Wahba et al, Ann. Statist. 23, 1865 (1995). 34. A.A. Weiss, Applied Statistics 42, 487 (1993). 35. W.R. Wilks et al, Biometrics 49, 441 (1993). 36. J. M. Williamson and K. Kim, Statistics in Medicine 15, 1507 (1996). 37. J.M. Williamson et al., Journal of the American Statistical Association 90, 1432 (1995). 38. J. M. Williamson et al., Computer Methods and Programs in Biomedicine 58, 25 (1999).
SECOND-ORDER MOMENTS A N D MUTUAL INFORMATION IN T H E ANALYSIS OF TIME SERIES DAVID R. BRILLINGER Statistics Department University of California Berkeley, CA, 94720-3860 E-mail: [email protected] A statistical network is a collection of nodes representing random variables and a set of edges that connect the nodes. A probabilistic model for such is called a statistical graphical model. These models, graphs and networks are particularly useful for examining statistical dependencies amongst quantities via conditioning. In this article the nodal random variables are time series. Basic to the study of statistical networks is some measure of the strength of (possibly directed) connections between the nodes. The use of the ordinary and partial coherences and of mutual information is considered as a study for inference concerning statistical graphical models. The focus of this article is simple networks. The article includes an example from hydrology.
1
Introduction
Science concerns relationships. The question that usually arises is what is the form of some relationship. A lesser question is how strong is a relationship. The work presented considers the use of partial coherency, and of coefficients of mutual information as measures of the strength of association of connections. An example involving river flows measured at a succession of dams along the Mississippi River is presented. Here the nodes are in series and the edges are directed. The locations of the dams are provided in Figure 1. Basic books discussing statistical graphical models include Cox and Wermuth 7 ,Whittaker 19 , Edwards 8 , Lauritzen 16 . The paper has the following sections: Mutual Information, Networks, Results, Discussion and Extensions. 2 2.1
Mutual Information Continuous Case
The field of information theory provides some concepts of broad use in statistics. One of these is mutual information. It is a generalization of the coefficient of determination, corr{X,Y}2, and it unifies a variety of problems. For a bivariate random variable (X, Y) with density function p(x, y) the
64
65
mutual information (MI) is defined as [ P(x,y) log P}X'y) dxdy Js px{x)PY{y) where S is the region p(x, y) > 0. As an example, for the bivariate normal the MI is given by IXY
=
IXY
=
(1)
-^og(l-p2XY)
where pxy is corr{X,Y}. The coefficient IXY has the properties of: 1). Invariance, IXY = luv if the transformation (X,Y) ->• (U,V) has the form U — f(X),V = g(Y) with / and g each differentiable 1-1 transforms. 2). Non negativity, IXY
> 0.
3). Measuring independence in the sense that IXY = 0 if and only if X and Y are statistically independent. 4). Providing a measure of the strength of dependence in the senses that i) IXY = oo if Y — g(X), and ii) Ixz < IXY if X is independent of Z given Y. The property 3) that IXY = 0 only if X and Y are independent stands in strong contrast to the much weaker correlation property of pXY. The estimation of entropy There are several methods that have been used. Nonparametric
estimate
Suppose one is considering the bivariate random variable (X, Y). Supposing further that p(x,y), is an estimate of the density p(x,y), for example a kernel estimate, then a direct estimate of the entropy is S2J2
P{iS,jS) log p{iS,j6) = E{\og p(X,Y)}
for 6 small.
In the same way E{log px(X)}, E{\og py(Y)} may be estimated and one can proceed to an estimate of the mutual information via expression (1). References to the type of entropy estimate just described and some statistical
66
properties include: Joe 15 , Hall and Morton 13 , Fernandes 9 , Hong and White, 14 , Granger et al. 12 . Difficulties with this form of estimate can arise when px(-), PY(-) are small. The nonparametric form also runs into difficulty when one moves to higher dimensions. A sieve type of estimate is presently being investigated for this situation, in particular an orthogonal function expansion employing shrunken coefficient estimates. Parametric estimates of entropy If the density p(x,y\8) depends on a parameter 9 that may be estimated reasonably then an immediate estimate of the entropy is provided by /
p(x,y\6)
log p(x,y\6)
dxdy.
Another form of estimate is based on the likelihood function. Suppose one has a model for the random variable (X, Y) including the parameter 0, (of dimension u). Suppose the model has the property that X and Y are independent when 6 = 0. When there are n independent observations the log-likelihood ratio for the hypothesis 6 = 0 is
Vlog-
p Xi Vi
( ' W
'Px{xi)p Y (yi)
with expected value nlxYThis suggests the use of the log-likelihood ratio statistic divided by n as an estimate of IXY- A further aspect of the use of this statistic is that its distribution will be approximately proportional to xl, where v is the dimension of 6, when X and Y are independent. Partial analysis When networks are being considered the conditional mutual information is also of use. For a trivariate random variable X, Y, Z one can consider 7IXY\Z
f f f = J
I
] J P(x,y,z)
\ 1
log
P(X1 Vl Z)P(Z)
, , ,
p{x^z)p{yjZ)
dxdydz
67
Its value for the trivariate normal is --log(l -
p2XY\z)
with PXY\Z the partial correlation of X and Y having removed the linear effects of Z. 2.2
Processes
A disadvantage of MI as introduced above is that it is simply a scalar. As consideration turns to the process case, i.e. functions, it seems pertinent to seeks to decompose its value somehow. 1. Time-side approach The entropy of a process is denned by a suitable passage to the limit for example as lim
E{logp(X1,X2,...,XT)}
T—foo
where p(xi, ...,XT) denotes the density of order T. To begin one can simply consider the mutual information of the values Y(t + u),Y(t) or of the values Y (t + u), X(t). This leads to a consideration of the coefficients IYY(U)
and
IYX(U)
i.e. mutual information as a function of lag u. References to this idea include: Li 17 and Granger and Lin n . 2. Frequency-side approach Similarly it seems worth considering the mutual information at frequency A of two components of a bivariate stationary time series. This could be defined as the mutual information of the spectral increments dZx(X) and dZy(A) of the Cramer representation. Because these variates are complex-valued a 4-variate random variable is involved. In the Gaussian case the MI at frequency A is - log(l -
\RXYW\2)
where \RXY (A)|2 is the coherence of X and Y at frequency A. The overall information rate is
- T i o g a - \R(U) \2)du, J — 7T
(see Granger and Hatanaka 1 0 ) .
68 Wisconsin
St Anthony Falls^ Dcrn No
River
Minnesota Iowa O
Lock & Dan
£
Reservoir
ft
Strean Gage
Figure 1. The locations of the 10 dams along the Mississippi River some of whose flow rates are studied.
69 In the general case for each frequency one might construct an estimate, /xy(A), based on kernel estimates of the densities taking empirical FT-values near A as the data. One way to estimate the MI, suggested above, is to fit a parametric model and then to use the log-likelihood ratio test statistic for a test of independence. A novel way, also being pursued, is to use first recurrence time estimates of entropy (see Ornstein and Weiss 18 and Wyner 2 0 ) . 3
Networks
In crude terms a network is a box (or node) and line (or edge) diagram and some of the lines may be directed. In our work a box corresponds to a random entity, to a random variable, to a time series or to a point process. In studying such models the methods of statistical graphical models provide pertinent methodology. Typically these models are based on conditional distributions. See the books by Edwards 8 , Whittaker 19 , Lauritzen 16 , Cox and Wermuth 7
If A, B, C represent nodes a question may be as simple as: Is the structure A -»• B -> C appropriate or is it better described as (A, B) -> CI On the other hand the question may be as complicated as: What is the wiring diagram of the brain? Figure 1 shows the locations of dams of a network along the Mississippi River. Since the bulk of the water flows south an elementary graphical model for this situation is: Dam 1 -> Dam 2 -> ... -> Dam 10. Of course there are other sources of water, such as entering rivers and rainfall to be taken note of. The figure and the data to be analyzed are taken from www.mvp-wc.usace.army.mil/projects/lockjdam.html One reference to the approach of this paper is Brillinger 3 . 4 4.1
Results Mississippi River Flow
The waters of the Mississippi River flow from Minnesota in the north of the United States to the Gulf of Mexico. Flooding along the river has long been a concern, and the U.S. Army Corps of Engineers has constructed a series of locks for flood control and as an aid to navigation. The waters flowing may be viewed as a system added to by precipitation and by flow from entering streams and runoff and reduced by evaporation, absorption and diversion. The basic data employed in the study to be described are the daily water
70
Coherence dams 2 and 4 1.0 0.8 0.6 -
Coherence dams 2 and 5 1.0
\ 0.8
\
0.6
\
0.4 •
V
0.4 0.2
0.2 0.0 -
0.0
WiiikiiiiiJjli ifmmNmtm 0.2
0.4
0.0
0.0
0.2
0.4
frequency (cycles/day)
frequency (cycles/day)
Coherence dams 4 and 5
Partial coherence, 25I4 1.0 • 0.8 • 0.6 0.4 • 0.2 • | L | „ , | L. . l l j . ^ i l l . . , . .
0.0 •
0.0
0.2
0.4
frequency (cycles/day)
FV'NMVFWL
0.0
.irawi 0.2
0.4
frequency (cycles/day)
Figure 2. Estimated partial coherence of log(flow rate) at Dams 2 and 5 given those at Dam 4. The horizontal line gives the approximate upper 95% null level.
71
Partial coherence, 35|4
0.0
0.2
0.4
frequency (cycles/day)
Partial coherence, 24(3
0.0
0.2
0.4
frequency (cycles/day)
Partial coherence, 25|34
0.0
0.2
0.4
frequency (cycles/day)
Figure 3. The estimated partial coherence of log(flow rate) at Dams 3 and 5 given Dam 4, Dams 2 and 4 given Dam 3 and Dams 2 and 5 given those at Dams 3 and 4. The horizontal lines give the approximate upper 95% null level.
72
flow rates as recorded at a succession of dams along the river. The data are daily from 1960 on and were obtained from the WWW site mentioned above. Figure 1, taken from that web site, shows the locations of the dams. Entering streams may be seen. Consider for example the logarithms of the flow rates, Y2(t),Yi(t),Y5(t) at Dams 2, 4, 5. Their locations may be seen in Figure 1. One sees the bulk of the waters passing from Dam 2 to Dam 4 and then onto Dam 5. One sees the Zumbo River entering between Dams 4 and 5 and the three other rivers entering between Dam 2 and Dam 4. The logarithms of the flow rates are taken as the basic variables. This situation will be studied as providing a useful test bed for studying the effectiveness of the partial coherence and mutual information parameters. Figure 2 presents the results of a partial coherence analysis focusing on Dams 2, 4, 5. The top right panel provides the estimated coherence functions of Dams 2 and 5. The horizontal line gives the approximate upper 95% null level under the null hypothesis of zero coherence. One sees high coherence at the low frequencies. The bottom right plot is the estimated partial coherence function of Dams 2 and 5 having removed the linear time invariant effects of Dam 4. Once again the horizontal line gives the approximate upper 95% null level under the null hypothesis of zero coherence. One looks for more than about 5% of the values lying above these lines. In the partial coherence case one sees not too much activity. In this situation the partial coherence could have been anticipated to be negligible because of Dam 4's being so close to Dam 5, i.e. highly effective in blocking off the direct effects of Dam 2. As an aid to understanding this analysis one can consider the model Y5(t)
=
I a{t - u)Yi(u)du
+ e{t)
Y4(t)
=
/ b(t - u)Y2(u)du
+ r)(t)
with e and 77 noise processes. The partial coherence estimated is then the coherence of the processes e and 77. The analysis is now extended to 4 series. Figure 3 provides the estimated partial coherences of Dams 3 and 5 having removed the effects of Dam 4 and of Dams 2 and 4 having removed the effects of Dam 3 and also that of Dams 2 and 5 having removed the effects of Dams 3 and 4. The required formulas may be found in Brillinger 2 . The networks contemplated in the analysis are the parallel and series ones. Logically the series structure is appropriate. All three of the partial coherences appear weak. This is consistent with the series structure of the graph as was anticipated. Had there been parallel links
73
Mutual information - Real parts
0.0
0.1
0.2
0.3
0.4
0.5
frequency (cycles/day)
Mutual information - Imaginary parts
0.6
0.4
0.2
0.0 0.0
0.1
0.2
0.3
0.4
0.5
frequency (cycles/day)
Figure 4. The top panel provides an estimate of the MI of the two real parts of the Mississippi river flows at Dams 2 and 5. The bottom panel similarly refers to the two imaginary parts. The dashed line gives the approximate upper 95% null level.
74
through Dams 3 and 4 instead of the serial ones then the partial coherences 35|4 and 24|3 would not be expected to be near 0 generally. As a further example the MI of series 2 and 5 was estimated as a function of frequency A, actually the Mi's of the two real parts of the frequency components and of the two imaginary parts. These estimates are graphed in Figure 4. Approximate 2.326 s.e. limits are obtained by randomly altering the phases of the empirical FT components, following the procedure of Braun and Kulperger 1. In the plots one notes some apparent association at the lower frequencies. This could arise, for example, from the occurrence of snow or rain storms affecting the segments of the river at the same time. An estimate of the full MI is currently being developed following the discussion of Section 2 . A point process analysis of this situation is developed in Brillinger 4 . These data are also considered in Brillinger 5 .
5
Discussion and Extensions
The estimated partial coherences of the dams' flow rates had the forms anticipated given the physical knowledge of the situation. The coefficient of mutual information is a unifying concept extending second-order quantities that have restricted applicability. Its being 0 actually implies that the quantities involved are statistically independent. Another important advantage is that the MI pays no real attention to the values of the process. They can be non-negative, integers or proportions for example. The MI is useful when one wishes to make inferences stronger than: "The hypothesis of independence is rejected" and more of the character "The strength of connection is I." During the work the plot of the function IXYW, appear more useful than simple scalars IXY- Both parametric model-based estimates and nonparametric estimates of mutual information have been mentioned and computed. A number of extensions are available and some work is in progress. One can consider the cases of: point processes, spatial-temporal data, local estimates, of learning, of change, of trend, and of comparative experiments. Brillinger 6 contains related ideas and examples from neurophysiology. One needs to develop the statistical properties of other estimates of MI such as the estimate based on the waiting time and the sieve estimates.
75
Acknowledgments Dr. Partha Mitra made some stimulating remarks concerning the use of mutual information. The Referees' remarks led to clarifications and reductions. I thank them. Part of the material was presented as the Opening Lecture at the XXXIeme Journees de Statistique in Grenoble in 1999. This research was supported by the NSF grants DMS-9704739 and DMS9971309. References 1. J. Braun and R. Kulperger, Commun. Statist.-Theory Meth. 26, 1329 (1997). 2. D. R. Brillinger, (1975). Time Series: Data Analysis and Theory (HoltRinehart, New York. Republished as a SIAM Classic in Applied Mathematics 2001). 3. D. R. Brillinger, Brazilian Review Econometrics 16, 1 (1996). 4. D. R. Brillinger, Technometrics 43, 266 (2001). 5. D. R. Brillinger, Proc. Interface 2001. 6. D. R. Brillinger, Revista Investigacion Operacional (to appear, 2002). 7. D. R. Cox and N. Wermuth, Multivariate Dependencies: Models, Analysis, and Interpretation (Chapman & Hall, London, 1998). 8. D. Edwards, Introduction to Graphical Modelling (Springer, New York, 1995). 9. M. Fernandes, Nonparametric entropy-based tests of independence between stochastic processes (Preprint, Fundacao Getulio Vargas, Rio de Janeiro, 2000). 10. C. W. J. Granger and M. Hatanaka, Spectral Analysis of Economic Time Series (Princeton University Press, Princeton, 1964). 11. C. W. J. Granger and J-L. Lin, J. Time Series Anal. 15, 371 (1994). 12. C. W. J. Granger, E. Maasouni and J. Racine, A dependence metric for nonlinear time series (Preprint, Dept. of Economics, UCSD, 2000). 13. P. Hall and S. C. Morton, Ann. Inst. Statist. Math. 45, 69 (1993). 14. Y. Hong and H. White, Asymptotic distribution theory for nonparametric entropy measures of serial dependence (Preprint, Economics Department, UCSD, 2000). 15. H. Joe, Ann. Inst. Statist. Math. 4 1 , 683 (1989). 16. S. L. Lauritzen, Graphical Models (Oxford University Press, Oxford, 1996).
76
17. W. Li, J. Statistical Physics 60, 823 (1990). 18. D. S. Ornstein and B. Weiss, IEEE Inf. Theory 39, 78 (1993). 19. J. Whittaker, Graphical Models in Applied Multivariate Statistics (Wiley, New York, 1990). 20. A. J. Wyner, Ann. App. Prob. 9, 780 (1999).
SPATIAL ASSOCIATION B E T W E E N C O M M U N I T Y AIR POLLUTION A N D HEART DISEASE: ANALYSIS OF CORRELATED DATA
SABIT CAKMAK, RICK B U R N E T T Health
Canada, PL 3104E, Ottawa, On., K1A 1B6, E-mail: [email protected]
Canada
MICHAEL J E R R E T T School of Geography
and Geology, McMaster
University,
Hamilton,
Canada
M A R K S. G O L D B E R G Departments
of Epidemiology,
Biostatistics, and Occupational University, Canada
Health,
McGill
A R D E N P O P E III Economics
Department,
Brigham
Young
University,
Provo,
Utah,
USA
R E N J U N MA Department
of Mathematics
and Statistics,
University
of New Brunswick,
Canada
DANIEL KREWSKI Institute
of Population
Health,
University
of Ottawa,
Canada
Cohort study designs are often used to assess the association between community based ambient air pollution concentrations and health outcomes, such as mortality, development and prevalence of disease, and pulmonary function. Typically, a large number of subjects are enrolled in the study in each of a small number of communities. Fixed site monitors are used to determine long-term exposure to ambient pollution. T h e association between community average pollution levels and health is determined after controlling for risk factors of the health outcome measured at the individual level (i.e., smoking). We present a new spatial regression model linking spatial variation in ambient air pollution to health. Health outcomes can be measured as continuous variables (pulmonary function), binary (prevalence of disease), or time to event data (survival or development of disease). The model incorporates risk factors measured at the individual level, such as smoking, and at the community level, such as air pollution. We demonstrate that the spatial autocorrelation in community health outcomes, an indication of not fully characterizing potentially confounding risk factors to the air pollution-health association, can be accounted for through the inclusion of location in the deterministic component of the model assessing the effects of air pollution on health. We present a statistical approach t h a t can be implemented for very large cohort studies. Our methods are illustrated with an analysis of the American Cancer Society cohort to determine whether the prevalence of heart disease is associated with concentrations of sulfate particles.
77
78
1
Introduction
Cohort study designs are often used to assess the association between community based ambient air pollution concentrations and health outcomes, such as mortality, development and prevalence of disease, and pulmonary function. Typically, a large number of subjects are enrolled in the study in each of a small number of communities. Practical considerations of implementation motivate this epidemiologic design, in that it is relatively easy to recruit additional subjects in a community compared to identifying communities with appropriate longitudinal pollution monitoring records. Fixed site monitors are used to determine long-term exposure to ambient pollution. The association between community average pollution levels and health is determined after controlling for risk factors of the health outcome measured at the individual level (e.g., smoking). Standard statistical computing software programs (e.g., SAS 10 ) can be used for analysis if the assumption of statistical independence between subjects is appropriate. Health responses, however, often cluster by community, indicating that responses of subjects within the same community are more similar than responses of subjects in different communities. This implies that community itself poses some risk to health. Community-level variables, such as measures of socio-economic status of the community, can be used to model this unexplained risk in addition to individual-level risk factors. Failure to account for all the variation between community health outcomes even after controlling for individual and community level risk factors can lead to downward biased estimates of the uncertainty in the community-level risk factors, including air pollution (see Ware and Strom 1 3 ). Additional bias in the uncertainty of the risk estimates can occur if the community average health outcomes display spatial autocorrelation. That is, health responses for communities close together are more similar than responses for communities farther apart, thus invalidating the use of standard statistical models such as linear, logistic or hazard models. Autocorrelation in the residuals of these models could be due to missing or systematically mis-measured risk factors that are spatially autocorrelated. Failure to account for spatial autocorrelation can yield downward biased estimates of uncertainty in the community-level risk factors and may suggest uncomplete control for potentially confounding community-level factors with the variables of primary interest, such as air pollution (Miron 8 ) . We present a regression model in which the residual community health responses are characterized by community-based stochastic variables called "random effects", after controlling for individual and community-level risk factors. The variance of the random effects represents the residual variation
79
in response between communities. Broader spatial trends in residual health outcomes are modeled by non-parametric smoothers of location in the deterministic component of the model. This approach is analogous to that used in time series studies in which temporal trends in health series are jointly modeled with air pollution (Cakmak et al. 2 ) . Our modeling framework can accommodate continuous, binary, and time to event health data. We present an approach to statistical inference that can accommodate very large datasets typically encountered in this area. Our methods are illustrated using data obtained from the American Cancer Society (ACS) study of particulate air pollution and health (Pope III et al.g). We examined the association between the prevalence of heart disease and ambient particulate sulfate concentrations in 144 metropolitan areas in the United States. 2
Spatial Model
Three basic regression models are considred here: linear, logistic, and time to event. First, we start with time to event models. Suppose that the cohort of interest is stratified on the basis of one or more relavant covariates. Let the instanteneous probability of event at time t, or hazard fuction, for a individual i residing in area s and a member of stratum m is given by h™(t). We propose a space-time model to relate spatial risk factors to longevity. The hazard /ijm(£) for our model is determined by
jlm^e{(^)+pTxr(t)+0Tz(s)+V(s)}
^
where /i™(£) is the baseline hazard function for the mth strata, ((s) is the two-dimensional term to account for residual spatial variability, p is a vector of unknown regression coefficients linking individual risk factors to the hazard function, and /? is a vector of unknown regression coefficients linking the spatial level risk factors to the hazard function. Covariate information modulates the baseline hazard function with the regression parameters p and P representing the logarithm of the relative risk of death per unit change in the individual and spatial covariates, respectively. The spatial random effects, T](s), or frailties, are shared by all individuals in area s. These random effects reflect the difference between the observed hazard function and the hazard function predicted from a statistical model. We assume that the spatial process has zero expectation, variance , 0 > 0 and correlation matrix Ct with dimension S. Second, we consider logistic model. Let us assume that the response probability is given by n. The odds of positive response {j~) for a individual
80
residing in area s and a member of stratum m is defined by T T e{C(s)+p xr(t)+P z(s)+v(s)}
(2)
where C(s) is the two-dimensional term to account for residual spatial variability, p is a vector of unknown regression coefficients linking individual risk factors to the odds of positive response, /3 is a vector of unknown regression coefficients linking the spatial level risk factors to odds of positive response, and the spatial random effects process, 7j(s), are shared by all individuals in area s. Third, we consider linear model. The model assumes that dependence of health responses on covariates occours through linear combination: as)+pTXr(t)+0TZ(s)+r1(s)
(3)
where C(s) is the two-dimensional term to account for residual spatial variability, p is a vector of unknown regression coefficients linking individual risk factors to the health responses, (3 is a vector of unknown regression coefficients linking the spatial level risk factors to the health responses, and the spatial random effects process, r](s), are shared by all individuals in area s. 3
Statistical Estimation
We divide the estimation procedure into two stage. In "Stage One", we work in time domain. Health outcome data is modeled by covariates at the individual level and indicator functions for each community. Community-level covariates, such as air pollution, are not included at this stage. For our hazard model (1), for example, in the time domain we consider h™e&s.^5(s)I(>)+PT*r(t)}
(4)
where I(s),s = 1, ...,S — 1 are indicator variables taking the value 1 if the subject resides in area s and zero otherwise. One area (S) is (arbitrarily) assigned as a reference. The unknown parameters represent the logarithm of the relative risk of death for those subjects living in area s compared to those subjects in the reference area 5, after controlling for the individual risk factors Xi^(t). In the time domain the corresponding equations for linear and logistic regression models are straightforward. Estimates of the community-specific health responses are determined using standard computer software for linear, logistic and Cox proportional hazard survival models. We used SAS 10 procedures because they can accommodate very large sample sizes. For linear and logistic regression models, we do not specify an intercept term. The estimates of the indicator functions can be
81
interpreted as the average response in a specific community after adjusting for individual-level covariates. Output from stage one is the community-specific adjusted health responses denoted by {J(s),s = 1,...,S}, where s denotes a zero dimensional point in Cartesian (x,y) space representing the location of one of communities under study. Additional output from this stage is the variance-covariance matrix of the S(s), denoted by v, which describes the uncertainty in the estimates of the community-specific adjusted health response. The Cox proportional hazards survival model incorporates a baseline hazard function that is interpreted as the instantaneous probability of an event given all covariates in the model are assigned a zero value. In this case, indicator functions for community are defined with respect to a reference community. A limitation of this procedure is that the uncertainty of the estimate of the reference area is not defined. Because these values are based on comparisons with the same reference area, they are correlated. This correlation increases the estimated uncertainty in the location-specific log-relative risks {J(s)}. The induced correlation can be removed by methods developed by Easton et al. 4 . This procedure eliminates the covariance between the {<5(s)}, and defines an associated estimate of uncertainty to the assigned value of zero for S(S). In "Stage Two", we work in space domain. Estimates of community health responses are related to risk factors defined at the community level using a linear random effects regression model as follows: 6(s)=as)+PTZ(s)
+ r)(s) + e(s)
(5)
where e(s)is a random process with zero expectation, variance-covariance matrix v, independent of the spatial random effects process rj(s). C(s)is the two-dimensional trend term to account for residual spatial variability, and /? is a vector of unknown regression coefficients linking the vector of spatial level risk factors, Z(s), to the community-specific health responses. The spatial random effects, T](S) , are shared by all individuals in area s. These random effects reflect the difference between the observed and predicted values from our statistical model. We assume that the spatial process r](s) is stationary (i.e., expectation does not vary in space), has zero expectation, variance 0 > 0, and correlation matrix 0 with dimension S. The correlation of the random effects between two areas can be modeled by their distance apart or some other characteristic of their locations. Autocorrelation models, or correlation between the same response at different locations, typically assume that closer locations will have attribute values, in this case random effects, that are more similar than values in locations farther apart. Thus, these models are often characterized by functions that decrease monotonically with
82
distance (Matheron 7 ) . Here, <5(s), has expectation /u(s) = C(s)+0TZ(s) and variance-covariance matrix £ = OfJ + 1/ If the number of subjects and events in each community is large, as is assumed here, the {<$(s)} have approximately a multivariate normal distribution with mean vector /u and variance-covariance matrix £. The unknown parameter vector, f3 , and the variance of the random effects, 0 , can be estimated by maximum likelihood methods with the log-likelihood function minimized by Fishers Scoring algorithm (see Burnett et al. 1 ) . The elements of V are assumed to be known in this stage. The random effects will be non-stationary if there is a trend in space. Spatial trends can be accounted for by surface ((s). We consider non-parametric smoothed estimates of £ using the robust locally-weighted regression (LOESS) smoothers (Cleveland and Devlin 3 ) within the generalized additive model (GAM) framework (Hastie and Tibshirani 5 ) . The complexity of the surface is controlled by a "span" parameter which is the proportion of the data used for the local regression. The larger the span, the smoother the estimated surface. We have developed a simple method to judiciously select the appropriate span in the LOESS smoother so as to minimize the autocorrelation structure of the random effects. We do this by plotting the correlation of the standardized residuals versus the distance between communities. The span is decreased until there is no apparent association with distance. The GAM procedure in S-Plus requires that the data are uncorrelated. Within our iterative estimation procedure, we run the GAM with a weight function as the inverse of diagonal elements of S. The GAM regression model consists of a LOESS surface of (x,y) and spatial or community-level covariates, Z(s). We then capture the marginal values of this surface in the x and y dimension. These values are used as two new covariates in the random effects linear regression model, in which estimates of the covariate parameters, /?, and the dispersion parameters, 0 , are obtained by maximum likelihood methods. We iterate between the GAM step and the maximum likelihood step until little relative change in consecutive estimates of 0 is observed (in this case
< io- 4 ). 4
The American Cancer Society Study
Volunteers of the ACS enrolled over 1.2 million people in September of 1982 throughout the United States. Information on history of disease, longevity, and demographic characteristics were obtained. We obtained information on particulate sulfate levels from the Aerometric Information Retrieval System
83
(AIRS) and the Inhalable Particle Network (IPN) for 1980 and 1981 for 144 Metropolitan Statistical Areas (MSAs) in which ACS subjects were enrolled. Sulfates are secondarily formed particulate aerosols originating from sulfur dioxide emissions and are a major component of fine particulate matter. The sulfate data from AIRS was collected using glass fiber filters, which react in the presence of sulfur dioxide and artifactually innate the sulfate concentration. The sulfate data obtained from the IPN used teflon niters which are not subject to this artifact problem. Both monitoring networks were operating in 41 MSAs. We calibrated the AIRS sulfate data to the IPN sulfate data using six linear regression models with separate calibrations for three regions of the county and two time periods [April-September and October to March] 6 . Estimates of exposure were obtained by averaging all available sulfate data from all monitors located in a MSA for the years 1980 and 1981, inclusive. The mean concentration of sulfate particles adjusted for the sulfur dioxide artifact across all 144 cities was 6.4 ng/m3 , with a minimum value of 1.4 fig/m3, an interquartile range of 4.2 fig/m3, and a maximum value of 15.6 /j.g/m3.
5
Results
We examined the association between concentrations of sulfate particles and the prevalence of doctor diagnosed heart disease (8.6%). The mean age at enrollment for the 540,679 subjects was 56.6 years, 5% of subjects were younger than 40 years, 5% were older than 75 years, 56.6% of subjects were women, and 94% were white. There were 3,755 subjects per community on average, with a mean of 323 subjects with heart disease per community. BeaumontPort Arthur, Texas provided the least number of subjects (61) and the fewest with heart disease (7), while the most subjects were recruited in Los Angeles (23,151) with the highest number with heart disease (1,970). The first step in the analysis was to use the logistic regression procedure in SAS to identify all relevant individual risk factors that were associated with heart disease. This model also included indicator functions for the 144 communities to account for any extra between community variation not account for by the individual-level covariates. Thirty-seven risk factors were selected including variables representing age (a cubic polynomial to account for potential non-linear association with disease), gender, age-gender interaction, race (white versus other), quadratic polynomial of the number of cigarettes smoked and years smoking for current and former smokers, age starting smoking for current and former smokers, consumption of beer, wine, and liquor, quadratic polynomial of body mass index (square of height divided by weight), edu-
84
cational attainment (less than high school, high school, or more than high school), martial status (single, married, other), passive exposure to tobacco smoke, and regular exposure to some air toxics in work or daily life (asbestos, chemicals/acids/solvents, coal or stone dusts, coal tar/pitch/asphalt, diesel engine exhaust, or formaldehyde). In the next step, we visualized the spatial association between the prevalence of heart disease and sulfate particles using our stage two linear random effects regression model. Here, we regressed the {(5(s)} onto the (x,y) geographic coordinates defined by longitude and latitude of the 144 MSAs with a non-parametric smoothed spatial surface £ (Figure 1, right panel), excluding spatial covariate information such as air pollution (i.e. Z(s)=0 ). We used the GAM procedure in S-Plus 12 assuming no autocorrelation (i.e. U = 0) and a diagonal form for V. [A diagonal covariance matrix is required to use the GAM in S-Plus 12.] We use latitude and longitude for this visualization step because these coordinate definitions are more easily interpretable than the Cartesian (x,y) coordinate specification. We use the Cartesian coordinates in all other formal statistical analyses as the examination of spatial autocorrelation usually relies on Euclidian rather than angular distance measures. This procedure produced a three-dimensional surface of {^(s)} after adjusting for all individual level risk factors. We found that adjusted prevalence of heart disease was elevated in the Ohio Valley region south of Lake Erie and in the southeast, diminished in the west and south, and moderately elevated in the mountain states. We also spatially modeled concentrations of sulfate particles using a GAM assuming these values were uncorrelated (i.e. E = a21, where a2 is the residual variance and / is the identity matrix). Modeled sulfate values centered by their mean concentration are portrayed in left panel of Figure 1. A corresponding elevation in concentrations of sulfate particles exits in the Ohio Valley region and the southeast, with much lower concentrations in the west. Heart disease was elevated in the mountain states, a pattern not observed for sulfates. This visualization suggests a positive association between the two surfaces. We then fit our spatial linear random effects model with no spatial predictors, no location surface, assuming that the data are uncorrelated (S = 0 / -I- V) and determined the standardized residuals from this model. The association between the autocorrelation of these standardized residuals and distance is graphically presented in Figure 2 (panel a) using the correlogram function in the Spatial Module of S-Plus u . Correlograms measure autocorrelation as a function of distance between the observations within binned distance groups. This procedure generates average autocorrelations
85
Figure 1. Non-parametric smoothed surface of the prevalence of heart disease by latitude and longitude, adjusted for individual level covariates in American Cancer Society Study with smoothing parameter of 20 percent (left panel). Non-parametric smoothed surface of particulate sulfate concentrations by latitude and longitude with a smoothing parameter of 20 percent (right panel). Note, z-axis represents residuals from generalized additive model.
at 20 roughly equal distances. Autocorrelation peaks at a value of 0.45 for near communities, and is positive but declining for distances under 500kms.. A cyclic pattern is evident in the autocorrelations for distances greater than 500kms.. This pattern could be due to the several regions of elevated prevalence of heart disease (see Figure 1, right panel). The association between autocorrelation and distance is reduced slightly by the inclusion of sulfates (Figure 2, panel b). We also examined the autocorrelation structure in the standardized residuals for models with a LOESS location surface with spans of 80, 60, 40, and 20% (Figure 2, Panels c-f respectively). A span of 20% was required to eliminate the association between the standardized residuals and distance. The sensitivity of the sulfate effect, /3, and the random effects variance, 0 , to the model specification is given in Table 1. The sulfate effect based
86 b) Sulfate
a) No Spatial Predictors
O O
500
1000
Q_
1500
1000
1500
2000
distance (km)
distance (km)
c) Sulfate + Location (span=80%)
d) Sulfate + Location (span=60%)
o o o 500
1000
1500
2000
1000
1500
2000
distance (km)
distance (km)
e) Sulfate + Location (span=40%)
f) Sulfate + Location (span=20%)
f a o u 6 o « 500
° ° o u
< > " . . „ « »
1000 distance (km)
1500
2000
500
1000
1500
2000
distance (km)
Figure 2. Correlation of standardized residuals by distance between locations for spatial random effects model under six model specifications: a) no spatial predictors and no location surface b) sulfates as the spatial covariate with no location surface, c-f) model includes sulfate and a location surface estimated with a LOESS span of 80-20% respectively. Horizontal line indicates zero values.
on the standard logistic regression model ( Model 1) assuming subjects are independent 0 = 0.0100) was similar to the corresponding estimate based on our "Two-Stage" spatial random effects model ( Model 2) (/? = 0.0103) under the identical assumption of independence (i.e. 0 = 0 ) , indicating that the 2stage estimation approach gave similar results to the logistic regression model for this example. This was most likely due to the large number of subjects and cases of heart disease per MSA. The estimates of the standard errors of the sulfate effect for the two approaches were also similar (Table 1). There existed strong statistical evidence to support the assumption of
87 Table 1. Sulfate Effects and random effects variance by model type and span of LOESS smoother of location surface. Span(%)
Sulfate Effect(/3) (st. error)
Relative Risk* (95% Conf. Int.)
Random Effects Variance(0))
Logistic (1)
NA
0.0100 (0.0018)
1.043 (1.027, 1.058)
NA
Spatial (2)
NA
0.0103 (0.0019)
1.044 (1.028, 1.060)
0
Spatial (3)
NA
0.0123 (0.0033)
1.053 (1.026, 1.080)
0.0076 0.00041
Spatial (4)
80
0.0132 (0.0036)
1.057 (1.025, 1.089)
0.0062 0.00041
Spatial (5)
60
0.0132 (0.0035)
1.057 (1.026, 1.089)
0.0055 0.00015
Spatial (6)
40
0.0131 (0.0032)
1.057 (1.028, 1.085)
0.0043 0.00012
Model Type (Model No.)
Spatial 20 1.047 0.0110 (1.02, 1.07) (0.0031) (7) *: Relative risk evaluated at interquartile range of sulfate levels (4.2 NA: not applicable
0.0022 0.00025 ng/m3)
additional variation in the adjusted prevalence of heart disease between communities (i.e., 0 > 0) based on the likelihood ratio test comparing Models 2 and 3 (p < 0.0001). Increasing the complexity of the location surface (i.e., decreasing the span) was also justified in terms of improving the model's fit to the data (p < 0.0001) (likelihood ratio tests comparing Model 3 to Models 4 to 7 respectively). The inclusion of a random effect for location ( Model 3) increased the estimate of the sulfate effect 0 — 0.0123) but almost doubled the estimated standard error (0.0033) compared to the error obtained from a model assuming independence among subjects (i.e., 0.0019). This suggests that there was more variation in heart disease between communities ( 0 = 0.0076) than can be fully explained by the within community estimation error,V, and sulfate particles. The sulfate effect estimate was insensitive to inclusion of location surfaces (Models 3-7). The unexplained between community variation, however, was much lower for Model 7 ( 0 = 0.0022) compared to Model 3 in which no surface
88
"'ini
2
mi
[I inn HI
4
IMII[
6
urn i
I
i
8
ninii
10
—LU
12
, U
14
L^.
16
sulfate
Figure 3. Cubic-spline representation of the association between particulate sulfate concentrations (fig/m3) and prevalence of heart disease adjusted for individual level covariates and a smoothed representation of location (LOESS span of 20 percent). Dashed lines represent 95 percent confidence intervals, with tick marks indicating sulfate values.
modeling was undertaken ( 0 = 0.0076), reflecting the ability of the location surface to explain between community variation not accounted for by sulfate particles. The form of the relation between sulfate concentrations and the prevalence of heart disease is illustrated in Figure 3. Here, we used a spline function representation of sulfates in the GAM after adjusting for a LOESS surface of heart disease using a span of 20 percent. There was little statistical evidence for a departure from a linear association (p=0.4429).
89 6
Discussion
We have identified two related issues about the analyses and interpretation of studies linking spatial variation in ambient air pollution and health. First, if the assumption of statistical independence is not valid, the uncertainty in the estimates of effect may be understated. Second, autocorrelation in these residuals may suggest that there exists missing or systematically mis-measured risk factors that may be correlated with air pollution and consequently could confound the pollution-health association. On the first issue, our model provides more accurate estimates of the uncertainty of estimates of effect. Based on the analysis of the ACS data, while our model gave similar sulfate-heart disease estimates as the standard logistic model, the standard errors of these estimates were higher (Table 1). With regard to the second issue, we have observed a pattern of spatial autocorrelation in the prevalence of heart disease that cannot be fully explained by ambient particulate sulfate concentrations, even after controlling for a host of risk factors measured at the individual level. Inclusion of a location surface eliminated this spatial correlation pattern and slightly reduced the uncertainty in the sulfate effect estimate from 0.0030 for a LOESS location surface estimated using a 80% span (Model 4) to 0.0029 for a span of 20% (Model 7). However, the sulfate effect was insensitive to adjustment for spatial trends in heart disease suggesting that the association between heart disease and sulfate pollution was strong enough to effectively complete with location in predicting the prevalence of heart disease. We have extended our modeling approach beyond that reported in our reanalysis 6 . Spatial autocorrelation was modeled by including regional indicator variables into the random effects model. This procedure was ad hoc in that there is no unique method for defining regions. Our new approach of using location surface models allows the data to determine the form and extent of the spatial adjustment. We also removed spatial autocorrelation by pre-filtering both the S(s) and the sulfate data for spatial trends using a spatial moving average function. The radius of data inclusion for the moving average term was based on the distance beyond which no evidence of spatial autocorrelation was graphically apparent. Furthermore, the number of communities comprising the moving spatial average varies in space, yielding spatially filtered estimates with varying uncertainty. Finally, all evidence of associations between air pollution and mortality at the regional scale are removed using the pre-filtering approach. Our new modeling method allows air pollution to compete with location to predict health responses. Evidence of regional scale associations between health responses and air pollution will be
90 captured with this new approach. Our spatial models can also be used to assist with the selection of appropriate spatial or community-level risk factors. The visualization stage suggested that the upper midwest had elevated prevalence of heart disease and this pattern was not observed in the sulfate data. This would indicate the need for including additional risk factors that cluster in this region. By repeating the visualization and spatial dependence analysis after adding new variables, our method can help to identify variables most likely to explain significance between-community variation. References 1. R. T. Burnett, W. H. Ross and D. Krewski, Environmetrics 6, 85 (1995). 2. S. Cakmak, R. Burnett and D. Krewski, J. Expos. Anal. Environ. Epidemiol 8, 129 (1998). 3. W. S. Cleveland and S. J. Devlin, J. Am. Statist. Assoc. 74, 829 (1988). 4. D. F. Easton, J. Peto and G. A. G. Babiker, Statist, in Med. 10, 1025 (1991). 5. T. Hastie and R.Tibshirani, Generalized Additive Models, (Chapman and Hall, London, 1990). 6. Health Effects Institute, Reanalysis of the Harvard Six Cities Study and the American Cancer Society Study of Particulate Air Pollution and Mortality: A Special Report of the Institute's Particle Epidemiology Reanalysis Project, (Health Effects Institute, Cambridge MA,2000). 7. G. Matheron, Eco. Geology 58, 1246 (1963). 8. J. Miron, In Spatial Statistics and Models, eds. G.L. Gaile and C.J. Willmott (D. Reidel Publishing Company, Boston, 1984). 9. C. A. Pope III, M. J. Thun, M. M. Namboodiri, D. W. Dockery, J. S. Evans, F. E. Speizer and C. W.Heath, Am. J. Respir. & Crit. Care Med. 151, 669 (1995). 10. SAS/STAT Software, Changes and Enhancements through Release 6.12, (SAS Institute Inc., Cary, NC, 1997). 11. S+SpatialStats, User's manual for Windows and UNIX, (MathSoft, Data Analysis Products Division, Seattle, WA, 1997). 12. S-PLUS, Programmer's Guide, (MathSoft, Data Analysis Products Division, Seattle, WA, 2000). 13. J. H. Ware and D. O. Stram, Can. J. Statist. 16, 5 (1988).
O N T H E R O B U S T N E S S OF RELATIVE S U R P R I S E I N F E R E N C E S TO T H E CHOICE OF P R I O R M I C H A E L EVANS Department
of Statistics, University of Toronto, Toronto, E-mail: [email protected]
Department
of Statistics, University of Toronto, Toronto, E-mail: tianliQutstat.utoronto.ca
Ontario
M5S
3G3
Ontario
M5S
3G3
TIANLI ZOU
Relative surprise inferences are derived in the context of a sampling model for the data and a prior model for the parameter by examining how beliefs change from a priori to a posteriori. Relative surprise inferences are often different than those that arise from frequency or some standard a posteriori Bayesian methods. Given that relative surprise inferences are determined by how the data has caused beliefs to change we would expect that these inferences are more robust to the choice of prior than inferences that are dependent solely on the posterior. This paper is concerned with demonstrating this and quantifying the extent to which robustness is enhanced by taking the relative surprise approach. The connection with Bayes factors is also explored.
1
Introduction
Suppose that we have model {/# : 6 € fi} for data x € X, where each f$ is a density with respect to support measure /x, and a prior density n with respect to support measure v for 6. Suppose further that our interest is in making inferences about the marginal parameter r = T (6) where T : ft —> T. Conditional probability (Bayes Theorem) then prescribes that inferences about T should be based on the conditional distribution of r given x; i.e. the posterior distribution of r. We will denote the prior density of r with respect to some support measure vj- on T by 7r-f and the posterior density of T with respect to vj- by 7TT (• | x). Throughout this paper all priors are proper. The actual choice of the densities is important for what follows and this cannot be done arbitrarily; e.g. changing the value of a density on a set of measure zero. For the purposes of this paper we will always take our densities to be limits. So, for example, if P is a probability measure on a space S, absolutely continuous with respect to support measure A, then the density at s will be taken to be equal to dP_ ,g) d\
=
P{An) lim n-too A (An)
91
92 where An C S is a sequence of sets that converge "nicely" to s. That we can define densities in this way in very general situations follows from definitions found, for example, in Rudin 8 and, as can be seen from Evans, Eraser and Monette 2 , this does not restrict our discussion in any meaningful way. Further this restriction on the definition of densities establishes an important connection between the relative surprise approach to inference and the use of Bayes factors as shown at the end of Section 2. In Section 2 we review some discussion found in Evans : concerning the use of surprise and relative surprise for deriving inferences. In addition we establish a relationship between the relative surprise approach to inference and the use of Bayes factors and consider an important example whose robustness properties are subsequently analyzed. Inferences based on what is called the observed surprise are seen to correspond to some standard Bayesian inferences. These inferences suffer from a lack of invariance; i.e. the inferences depend intrinsically on the parameterization chosen for the model. There may be reasons in a particular application why a parameterization must be fixed; e.g. a loss function is prescribed, but, in general, it seems reasonable to ask that our inferences not be parameterization dependent. Relative surprise inference, as defined in Evans 1, is one example of inference that is independent of the parameterization. This type of inference is also seen to have a very natural interpretation as it is based on how the observed data has changed our beliefs from a priori to a posteriori. Given that relative surprise inferences are more data driven that some standard Bayesian inferences one might expect that they are more robust to the selection of the prior. This is the topic of Section 3 and is the main content of the paper. We develop the relevant mathematical theorems for characterizing the local robustness properties of relative surprise and surprise inferences and compare these. Although the arguments are somewhat subtle the results confirm our suspicions that relative surprise inferences exhibit superior robustness properties when considering sensitivity to the choice of prior. Our approach to studying the robustness properties of our inferences is similar to that taken in Gustafson and Wasserman 6 and Gustafson 4 ' 5 . One key difference, however, is that these authors are concerned with developing robustness diagnostics for applications of Bayesian inference; i.e. determining whether or not a particular model, prior and data combination is robust. While such diagnostics are highly relevant in applications, however, this is not our concern here. We simply want to compare the robustness properties of two different approaches to inference. In Section 4 we draw some general conclusions and discuss future research on this topic.
93 2
Surprise and Relative Surprise
While conditional probability prescribes that inferences should be based on the posterior distribution it says nothing about what form these inferences should take. For example, if we wish to quote an estimate of r then perhaps we might think of using the posterior mean as natural choice. Of course this may not make sense in general as r need not take its values in a Euclidean space, or the relevant expectation may not exist or the posterior distribution may be strongly skewed or multimodal which could render its mean an inappropriate choice as a representative value. One way to prescribe an inference in such a situation is to specify a loss function and to use a Bayes rule; e.g. posterior expectations are the optimal choice when using squared error loss. Such an approach places a premium on minimizing expected losses over any other considerations. Another approach is to look for a principle or basic idea that leads necessarily to the form of the inference much as the principle of conditional probability leads to the use of the posterior. If the principle, or basic idea, is compelling then we feel confident that the inferences derived make sense. Of course the assessment of the principle involves examining many applications to see whether or not its application leads to what are generally considered to be sensible inferences. One principle, or basic idea, is that of surprise. In essence we want to measure whether or not a specified value TQ € T is surprising in light of the data x. While there are a number of different ways of measuring surprise one fairly natural method for this is given by the observed surprise (OS) n ( 7 r T ( T ( 0 ) \x)>*r{TQ\x)\x);
(2)
namely the posterior probability that the posterior density at r = T (9) is greater than at the specified value TQ. If (2) is near 1 then To is in a region where the posterior density is relatively small and r 0 is then said to be surprising in light of the data. As an example consider the following. Example 1. Hypothesis testing. Suppose that T = IH0 (the indicator function for the set H0) where H0 C fi is such that II (Ho) > 0 and r 0 = 1. So in such a case (2) provides an assessment of whether or not the hypothesis Ho is surprising in light of the data. Taking 7rT (11 x) — II (H0 | x) and 7TT (01 x) = II (H£ | a;) we have that
94
the observed surprise equals II (TTT (T (9) \X)>TTT
_/0
{T0
\X)\X)
]SU{H0\x)>U{HS\x) c
(
c
, {
~ \ n (#0 | re) if n (H0 i x) < n (#0 j x) ' and we have evidence against Ho when II (HQ | X) is large. It is easy to see that (3) is a Bayes rule when using 0-1 loss so that the OS approach agrees with the standard Bayesian decision-theoretic answer for this problem. Sometimes it is argued that posterior probabilities are to be preferred to P-values in assessing a hypothesis but (3) makes it clear that assessing hypotheses using posterior probabilities can also be thought of as falling within the P-value approach. The virtue of a P-value approach as exemplified by (2) is that it does not require II (Ho) > 0 to provide a measure of surprise, although the assessment is different than simply looking at T = i # 0 , so that continuous priors can be used. This has some consequences for the avoidance of Lindley's paradox, as discussed in Evans 1, but this issue does not concern us here. If we want to estimate r then Good's principle of least surprise naturally leads us to select the least surprising value of To and this is a value which minimizes (2). It is obvious then that the least surprise estimator is given by a mode of the posterior density iry (• | x). Further if we wanted to construct a region containing the true value of r then perhaps a natural choice is to specify 7 € [0,1] and use the 7-surprise region {T0 I n (TTT (T (0) \x)>nr
(r 0 | x) \ x) < 7} ;
i.e. the set of values for To that are not surprising at the level 7. It is immediate that such a region corresponds to a highest posterior density (HPD) region for r. The above shows that measuring surprise via (2) leads to some standard Bayesian inferences and that the derivation of these does not require the specification of a loss function. It has been pointed out previously in Evans 1 , and it may be obvious to the reader, that there is a fundamental difficulty with inferences derived via (2). In particular these inferences will depend upon how we specify the posterior density 7TT (• | x), or equivalently, how we specify the support measure v-r- Different choices will lead to very different values for (2), very different estimators and very different 7-surprise regions. In general there does not seem to be an argument that leads us to a canonical choice of support measure. So surprise as a justification for these inferences seems untenable and perhaps we are even lead to doubt the validity of such inferences even though posterior modes and HPD regions are commonly used.
95 As a solution to this problem Evans l proposed to base the derivation of inferences instead on the observed relative surprise (ORS) given by JJ | 7TT (Y (0) | x) ^
nr(T(6))
7Tr(T0\x)
(4)
7TT (To)
Here 7TT (TQ I x)
(5)
7TT ( r 0 )
is measuring the change in belief in TQ from a priori to a posteriori and (4) is the posterior probability of a change in belief larger than that observed at T 0 . Again if this number is near 1 we have that To is a surprising value. In essence a value To is surprising when the data has lead to a bigger increase in belief at other values of r = T (6) compared to the increase (or decrease) at T0. A least relative surprise estimate is then obtained by choosing a value of r 0 that maximizes (4) or equivalently maximizes (5). As above we can also obtain 7-relative surprise regions. A virtue of all the relative surprise inferences is that they are not dependent on the choice of the prior or posterior densities or equivalently the choice of the support measure vj-- These inferences are invariant under smooth reparameterizations. As shown in Evans 1 these inferences can also be quite different than standard Bayesian inferences although often they are similar. We consider the context of Example 1. E x a m p l e 2. Hypothesis testing (Example 1 continued). We consider again the situation where T = IH0 , II (Ho) > 0 and we take 7TT (1) = II (Ho) and 7Tx (0) = II (HQ) then the observed relative surprise is given by n
J' 7TT ( T () I x) 7TT ( T (9))
>
7TT ( T 0
I a;)
71"T ( r 0 )
if BFHo > 1 Ho
-{n {H§\x)i£BF
(6)
where
BF,Ho
U(H0\x) l-U(H0\x)fl
n {H0
n(H0)
(7)
is the Bayes factor in favor of Ho- So we get something similar to what is commonly recommended in this context. We will see, however, that the robustness properties of (6) are very different than those of (3). Further (4) has the virtue of providing a measure of surprise even when II (Ho) = 0 and again, see Evans 1 , this has implications for the avoidance of Lindley's paradox.
96
Sometimes just the Bayes factor is recommended in hypothesis testing with small values of (7) treated as evidence against Ho- The interpretation of the value a Bayes factors takes is somewhat ambiguous, however, although various calibrations have been suggested; see, for example Kass and Raferty 7 . Perhaps the most direct way of calibrating a Bayes factor is to proceed as we have with the OS and ORS computations and compute the posterior probability of obtaining a value larger than (7) with large values of this probability indicating that Ho is indeed surprising. For the context of Example 1 we obtain
= {0u
n{BFr(9)>BFHo\x)
if BFHo > 1 (H£ | x) if BFHo < 1 ;
i.e. this measure of surprise agrees exactly with the ORS. So in a sense we can think of the ORS as a generalization of the Bayes factor when we choose to calibrate the value of a Bayes factor using the tail probability. Suppose more generally we have that T = { r i , . . . , r*} with prior probabilities 7Ti,..., n-* respectively, and we want to assess the plausibility of the value To. Then U({T}\X) T
n({r}) i-u({T}\x)'i-n({T})
and possibly we could assess the surprise at To by computing n (BFr(e)
> BFT0 | re)
but when k > 3 this is generally not equal to the ORS at To- Effectively this approach is saying that the change in belief in r from a priori to a posteriori should be measured by BFT and this seems somewhat less direct than using 7TT (r | x) /-7TT (T) as in the ORS. Further it is not at all clear how to extend this approach to the continuous case. Notice, however, that when II (An) > 0 for all n, and the An converge nicely to {r} with II ({r}) = 0, then _ n (An I X) IVT (An) A
"
l-U(An\x)
II ( H ) '
/vr
l-U(An)
(An)
7TT (T I x) 7TT(T)
as n ->• oo. So in general it is very natural to think of the relative surprise approach as a modification and generalization of the Bayes factor approach. Of course the Bayes factor is restricted to hypothesis testing problems where the parameter space is partitioned into ft = H0UHQ with the prior probability of H0 satisfying 0 < II (H0) < 1. The relative surprise inferences are more general in their application than this, however, both with respect to hypothesis testing, and estimation, prediction and model checking problems; see Evans
97 3
Robustness to the Prior Under Linear Perturbations
In an obvious sense relative surprise inferences are more data driven than inferences based on surprise. Observe also that when T (6) = 9 then (4) equals n {fe{x)>
fe0{x)\x);
i.e. the observed relative surprise is the posterior probability that the likelihood at 8 is greater than the likelihood at 9$. In particular, a least relative surprise estimate of 6 is given by a maximum likelihood estimate while relative surprise regions for the full parameter 9 correspond to likelihood regions and only depend on the prior through their posterior probability contents. So in a general sense we expect relative surprise inferences to be less dependent on the prior than other Bayesian inferences. It is the purpose of this section to examine this issue more closely and in particular to examine the marginal parameter case; i.e. T (8) ^ 9. Our interest then is in how the inferences vary as we perturb the prior II. Basically we will say one inference method is more robust than another if the rate of change in the inference is smaller at II for the first inference method than for the second. For example, it is clear immediately that when our goal is to estimate 9 then a least relative surprise estimator is more robust to the choice of prior than any other Bayesian inference because it only depends on the likelihood function. Actually this example is somewhat unusual as even the estimation problem becomes more difficult as soon as interest is in a marginal parameter r. In that case a LRSE will depend on the prior and the extent of the dependence is not obvious. Still the result for the full parameter leaves us hopeful that something similar can be obtained for a marginal parameter as well. Computational and mathematical issues prevent us from arbitrarily choosing the directions in which we perturb the prior so that definitive comparisons; i.e. results that hold for all directions, seem impossible to achieve at this point. Still we can obtain results for several choices of directions and the results give us considerable insight into the relative robustness of the various inferences. Further, since all the inferences under consideration are based on the observed surprise (2) or observed relative surprise (4), we will compare the robustness of these quantities. Recall that these quantities can be used for assessing whether or not ro is a plausible value for r. Perhaps the most commonly used directions are the contamination directions given by prior probability measures of the form
IP = (1 - e) n + eQ
98 where e > 0 and Q is a probability measure on Q. The prior density of 9 with respect to v is then given by 7re = (1 - e) 7T + eq
when Q is also absolutely continuous with respect to v. Actually, because our interest is in the marginal parameter r, we want to consider only perturbations to the marginal prior distribution so that the conditional distribution of the full parameter given the marginal remains fixed. This kind of perturbation is easily expressed in terms of densities as Tre (9) = (1 - e) 7TT (r)
TT
(0 I r) + eqr (r)
IT
(9 | r)
= [(l-e)7rT(T)+«fr(T)]7r(0|T)
(8)
where TT(6\T) is the conditional density of 9 given T (8) = T with respect to support measure vT on {9 : T (9) = r } . Then the marginal posterior density, when the prior density is given by (8), is Try (T I x) = (1 - ex) TTT (r | x) + exqr (r | x) where £x
em
=
« (*) (1 - e) m^x) + emq (x)'
(9)
™*(x) = /" / 6 (K)ir(e)u(de) is the prior predictive density for the data when the prior is 7r with a similar definition for mq (x), 7Tr (T I x) —
Qy
(T
I x)
TT (r)/,*(*)
mn (x) 9r(r)f;(x) mq(x)
and we write the conditional density with respect to /j, of x given that T (9) = T as JT-HT}
Now letting G denote the posterior distribution function of ny (r | a;) when T ~ ITx (• | x) then the following result, as proved in Evans and Zou 3 , yields expressions for the upper and lower Gateaux derivatives of the observed surprise.
99 Theorem 1. If the posterior distribution of ny (T\X) is continuous when T ~ n T (• | a;) or when T ~ Qy (• | x), G is differentiable at -Ky (TO \ x) with derivative g (ny (r 0 | x)) and <7T (• | x) is continuous and bounded, then the lower Gateaux derivative (and the upper Gateaux derivative) of the observed surprise at r 0 , under linear perturbations of the marginal parameter r = T (0), takes the form mq (x) {qr (r» \x)-qr (T 0 | X)} g (ny (r 0 | a;)) + mn(x) mq{x) | <3T [TTT (r I x) > TTT (r 0 | a;) | x] - 1 m^x) \ n T [TTT (r | a;) > TTT (T 0 | X) \ x] J
._
for some T* Of course when the Gateaux derivative of the observed surprise exists it takes the form given in (10) as well. The expression (10) is not very useful when we need to compute the Gateaux derivative as r* is unspecified. In specific examples we are still required to evaluate the limit corresponding to the first part of (10). We will see, however, that (10) is of great value when comparing the robustness of the OS with the robustness of the ORS. We note that the second term in (10) is always bounded above by S
-^jJlM mn[x)
{l _ n T [TTT (r | x) > TTT (TO I x) I a:]} .
The following result, as proved in Evans and Zou 3 , suggests that the first term in (10) can be arbitrarily large in absolute value. Lemma 2. If the density function g of Try (r | a;) is continuous from the left at 7Tx (TO I x), where To is the unique mode of 7TY Ola:), and 7TT (• | x) is continuously differentiable at TQ then g(Try (TQ \ x)) = oo. So Lemma 2 indicates that the first term in (10) could be infinite whenever To is a posterior mode. That this happens is confirmed by the following example. Example 3. Gateaux derivative of the OS can be infinite. Suppose that x ~ Bernoulli^ 1 / 2 ) where 8 G [0,1] is unknown, we take II to be the Beta(l/2,1) distribution and the support measure v is Lebesgue measure. Then the posterior density, when we observe x — 1, is given by 1; i.e. the posterior distribution is uniform. Suppose we take Q to be a Beta(3/2,1) distribution which leads to the posterior density 28 when we observe x — 1. Now we explicitly evaluate the first term in Gateaux derivative of the observed
100
surprise at 60. First note that II [IT (6 \ x) > it (90 | x) \x] = U [1 > 11 x] = 0 and (1 - ex) + 2ex6 > (1 - ex) + 2ex60 if and only if 0 > 60. Therefore
lim - ( n (1 - cx) + 26*19 > e-+0 e L
(1 - ex) + 2ex60
n
[TT (6 |
x) > n (d0 | x) | a;]
= lim - (1 - 60) = oo e->0 e
and we see that the first term in the Gateaux derivative for the observed surprise is infinite for every #oIn Evans and Zou 3 the following result is established for the Gateaux derivative of the observed relative surprise. Theorem 3. The Gateaux derivative of the observed relative surprise at To under linear perturbations of the marginal parameter r = T (6) is given by
^ { Q T [ / ; ( * ) > / ; ( * ) ! x ] - n r [/;(*)>/;»|*]}.
(n)
lil-n^X)
Note that this result implies that the Gateaux derivative of the observed relative surprise is always finite. In particular consider an application of this result to Example 3. Example 4. Gateaux derivative of the ORS in Example 3. We have that fg (x) = 61/2 and therefore Q [/; (x) > f*T0 (x)\ x] =Qr[6>
do I x]=
f
29d0 =
l-el
n[/;(x) > /;o(x)\ x] = u[e>e0\ x] = [ de = i-e0 J00
while mn(l) = 1/2 and mq (1) = 3/4. Therefore the Gateaux derivative of the observed relative surprise at #o is given by, using (11), | (1 - 0g - 1 - (?o) = |«o (1 -
101
of Q measures that converge to a measure degenerate at a LRSE then (11) converges to
s
-^im{1_Ur[rAx)>f:Ax)\x]}
(12)
and this is bounded whenever sup r / * (x) < oo. It clear then that (12) is the supremum of (11), when the supremum is taken over Q. We now establish a result that formally confirms our intuition, at least in the continuous case, that inferences based on the ORS are more robust to the choice of prior than inferences based on the OS. For this we consider the supremum of the absolute values of the Gateaux derivatives over all possible absolutely continuous perturbation measures Q and all possible reparameterizations. By a reparameterization we mean a 1-1 bicontinuously differentiable map defined on T. Theorem 4. If r has an absolutely continuous prior distribution on an open subset of a fc-dimensional Euclidean space, then the supremum of the absolute value of (10) is greater than or equal to the supremum of the absolute value of (11) when the supremum is taken over all possible absolutely continuous measures Q and all possible reparameterizations. Proof: When ir is absolutely continuous on an open subset of Rk we can reparameterize the problem so that the prior distribution for r is uniform on [0,1] . For example, we could use the probability transform based on ir to do this. With this parameterization, if To is a LRSE then it is also a posterior mode so that (11) is 0 and so is the second term of (10). From Lemma 2, however, we have that the first term of (10) is either 0 or infinite in absolute value. So in the case that To is a LRSE the supremum, over all possible parameterizations, of the absolute value of (10) is always greater than or equal to the absolute value of (11). When ro is not a LRSE then reparameterizing so that r is uniform on [0,1] establishes that the second term of (10) equals (11). Further if we take a measure Q that assigns 0 probability to a neighborhood of To then
102
view of robustness one would want to ensure that such a selection did not lead to inferences that had terrible robustness at least when using some traditional Bayesian inferences. Of course this is not an issue when using relative surprise inferences as these are not dependent on the parameterization chosen. We interpret Theorem 4 as strong support for the increased robustness of the ORS over the OS. Theorem 4 is restricted to contexts where the prior is absolutely continuous. Discrete contexts seem much more difficult to analyze. We consider now, however, an important problem where the marginal prior is discrete. Example 5. Hypothesis testing (Examples 1 and 2 continued). Consider the problem discussed in Section 1 where we want to test the null hypothesis HQ and we have II (HQ) — no > 0. Therefore the posterior probability of HQ is given by V"O/H
n(H0\x)=
(X)
° J(1n o- _ TO) ^ f*m ^OIHO ( — ) + X
(x)'
Let Q be the measure degenerate at HQ. Under a small perturbation e of the prior probability the posterior probability becomes W(H0\x)~
( U - < ) *o+ 0/*„(*) ((1 - e) TTO + e) rHo (x) + (1 - c) (1 -
/ £ . (x)'
TTO)
Now suppose that II (H0 \x) > U (HQ \X) . Then, by (3), and the fact that IF (Ho \x) > IP (HQ | X) for all e small enough, the observed surprise is 0 when e is small enough and so the Gateaux derivative of the observed surprise is also 0. If n (H0 | x) = n (H§ | x) then 7r 0 /£ 0 (x) = (1 - TT0) / £ g (x) and so IT _
(H0\x)
((l-e)Tro + e) fHoW ((1 - e) no + e) f*Hg (x) + (1 - e) (1 - n0) f*H< (x) ni-xpU
I
( (l-e)7ro
+e )
TTO
j\(l-e)(l-no)j
f
(l-e)(l-*o)rHS(x)
]
\ ((1 - e) TTO + e) f*Ho (x) + (1 - c) (1 - 7T0) f*H, (x) j 1 /
e \ f 7^0 +
TTO V
1-e/
(l-e)(l-7r0)/£c(i)
1 ((l-e)7ro+e)/^0(x) + (l-e)(l-7ro)/^(a;)
103
(l-e)(l-7ro)/&.(s) > ( ( i - c ) *•„ + e ) fa (X) + (1 _ e ) (1 -
=
TTO)
fa, (x)
W(HZ\x)
and the Gateaux derivative is again 0 by (3). If U(H0\x) < II (#0C | x) then, since IP (H0 \x) < IF (#0C | x) for all e small enough, the Gateaux derivative of the observed surprise is given by I f
((1-<0*O+C)/HO(S)
~ i ™ 7 \ ((l- £ )7r 0 +e)/^ o ( a : )+(l- e )(l-7ro)/^(x) = - (*0fk
=
TO/£„(*)
7ro/£ 0 (z)+(l-7ro)/£ s
(X) + (1 - TTo) / £ « (*)) _ 2 (1 - TTo) f*Ho (x) ft- (X)
-—n(H0\x)Il(HZ\x) TTO
(1 - TTo) BFHo (l-7r0+7ro5F.ff0)
(13)
2-
Now observe that BFfj = jBF/fo for all e; i.e. the Bayes factor in favor of HQ is independent of e. This implies that the Gateaux derivative of the observed relative surprise is given by 0 when BFH0 > 1 and is given by (13) when BFHa < 1. It would seem then that the behavior of the two measures of surprise is very similar. Consider, however, the situation where BFH0 > 1, n (Ho \ x) < II (HQ I x) and U(H0) — 7r0 is very small. In this situation the Gateaux derivative of the ORS is 0 while for the OS (13) is approximately equal to BFH0 • So in situations where the data have lead to a large value of BFHo, but the posterior probability of Ho is still less than 1/2, the observed surprise is very sensitive to perturbations in the prior belief assigned to Ho. Now consider the situation where BFJJ0 < 1 while II (H0 \ x) > II (HQ | X) so that the Gateaux derivative of the OS is 0 and the Gateaux derivative of the ORS is (13). We have that (13) is bounded above by BFHo 1 - 7T0 + TT0BFHo
and, because BFJJ0 < 1, this is an increasing function of -KQ which goes to the value 1 as 7To -> 1. So we have shown that the worst case behavior of the Gateaux derivative of the ORS in this problem is given by the upper bound 1 while the worst case behavior of the Gateaux derivative of the OS, using the relation BFJJ0 =
104
/ H 0 ( X ) If HI (X) > i s s i v e n b y
sup ZkM. Based on this we see that the ORS has superior robustness properties to the typical Bayesian inference II (HQ \ x) for testing HQ versus HQ. As mentioned previously sometimes BFH0 is used in these contexts and this has the appearance of being fully robust because BFjjo = BFH0 for all e. Recall, however, the need to calibrate the value of a BFH0 through an equation such as (7). This leads to n(H5\x)=(l
+ Y^BFHo>)
.
Therefore a value of BFH0 = 20 when 7r0/ (1 - 7To) = 1 implies II (HQ \ x) = 1/21 = .048. On the other hand BFHo = 20 when TT0/ (1 - 7r0) = 1/10 implies II (HQ I x) = .5. The point here is that the interpretation of the value BFH0 does inherently depend on the values of the prior probability -KQ and so is not robust to this choice. The key concept in determining relative surprise inferences is the ORS and so it makes sense to concentrate on assessing its robustness properties. Ultimately, however, we should look at the robustness properties of the specific inference. With hypothesis testing this is the ORS but with estimation, for example, we need to look at the LRSE. We note that *?(T\X)=
-Ky (T)
f;(x)
(1 - e) m7r(x) + em, (a;)
for every e > 0 immediately implies the following result. Theorem 5. The LRSE of r is constant under perturbations that affect only the marginal prior of r and so the Gateaux derivative of the LRSE is always 0. Of course this is not true of the posterior mode which arises from the observed surprise and the principle of least surprise. Example 6. Suppose that x ~Bernoulli(0 1 / 2 ) , we take II to be the Beta(l,3/2) distribution and the support measure v is Lebesgue measure. Then the posterior distribution, when we observe x — 1, is given by the Beta(3/2,3/2) distribution and this has its mode at 1/2. Suppose we take Q to be a Beta(l, 1/2)
105
distribution and this implies that the posterior distribution is Beta(3/2,1/2) when we observe x = 1. Note that
el (1 e) 2+
e 2 (1 er 2
-' <» i*>=o - -» m m " - " "mm " - ' and this always has its mode at 1. This is an extreme example of course but it illustrates an important point. In particular the Gateaux derivative of the LRSE is 0 while, in this case the Gateaux derivative of the posterior mode is infinity. 4
Conclusions
In Evans x relative surprise inferences were advocated for Bayesian contexts where a loss function was not prescribed. One of the advantages of these inferences is that they are invariant under reparamerizations and this is not the case for more traditional approaches to deriving inferences in Bayesian inference problems. Further the relative surprise inferences are seen to be primarily data driven in that they are based on how the data changes beliefs from a priori to a posteriori rather than being based on the posterior alone. It might be argued that for a specific application a particular parameterization is paramount but reasons must be supplied for this and these do not seem available in many applications. Further it might be argued that if one has a strong belief in a particular prior then the relative surprise approach weakens the amount of input that the prior has in determining an inference. While this is true, the results of this paper point to the negative side of that argument, namely, the lack of robustness to the choice of the prior for some traditional inferences. The development here has excluded consideration of improper priors. Strictly speaking improper priors are excluded from the relative surprise formulation as they do not marginalize appropriately. Still we can consider many improper priors as limits of sequences of proper priors and consider the limiting surprise and relative surprise inferences as was done in Evans l . In such a case we could consider perturbations to this sequence and study the relative robustness properties of the limiting inferences. This is something we are currently investigating. This paper has only considered linear perturbations to the marginal prior. More generally we will study other types of perturbations. Further there is the question of the effect of perturbing the conditional prior distribution of the full parameter given the marginal parameter. It seems reasonable to separate out the effect of perturbations to the marginal and conditional. We will consider these questions in further research work.
106
Acknowledgments The authors thank the referees for some helpful comments. The authors were partially supported by the Natural Sciences and Engineering Research Council of Canada. References 1. M. Evans, Communications in Statistics - Theory and Methodology 26(5), 1125 (1997). 2. M. Evans, D. A. S. Eraser and G. Monette, Canadian Journal of Statistics 13, 137 (1985). 3. M. Evans and T. Zou, On the robustness of relative surprise inferences to the choice of prior, (Technical Report No. 0201, Dept. of Statistics, University of Toronto, Toronto, Canada, 2002). 4. P. Gustafson, Annals of Statistics 24, 174 (1996). 5. P. Gustafson, Journal of the American Statistical Association 9 1 , 774 (1996). 6. P. Gustafson, and L. Wasserman, Annals of Statistics 23, 2153 (1995). 7. R. E. Kass and A. E. Raferty, Journal of the American Statistical Association 90, 773 (1995). 8. W. Rudin, Real and Complex Analysis, Second Edition ( McGraw-Hill, New York, 1974).
U S I N G SURVIVAL ANALYSIS IN P R E T E R M BIRTH S T U D Y CHONG YAU FU The Institute of Public Health, National Yang Ming University, 155 Li-Long St., Sec. 2, Shih-Pai, Taipei, Taiwan, 112 E-mail: chong@ym. edu. tw SHIH HUA LIU Department of Humanities and Science, National Yulin University of Science and Technology, 123 University Road, Touliu, Yunlin, Taiwan, 640 In conventional preterm birth studies the outcome variable was a binary data, namely whether or not the gestational age was less than 37 weeks. An absolute yes or no classification was used. Based on the progressing sense of gestational age, this study investigated the risk of preterm birth among different age groups using survival analysis. Using the key concept of the rank ordering of time, KaplanMeier estimate and the Cox model were applied to preterm birth studies. The applied results show that the relative risk of preterm birth for the 12-17, 18-19 and 20-24 year old age groups, varied with gestational age, and was 1.6-8.69, 1.2-5.76 and 0.92-1.52 respectively when compared with 25-29 years old. For the older age groups, namely 30-34, 35-39 and 404- years old, the relative risks of preterm birth were 1.2, 1.54 and 1.74, respectively, constant in gestation age. Hence, applying survival analysis to preterm birth studies can provide more informative results than other approaches.
1
Introduction
Preterm birth is the leading cause of neonatal death and of morbidity in surviving infants ( Rush, Davey and Segall 10 ). The development of a complement system in preterm births was closely related to gestational age (Wolach et al. u ) . Hence, the duration of gestational age was important for preterm birth. Previous studies of preterm birth, namely Fu et al. 5 , Meis et al. 9 , Rush et al. 10 etc., always used a binary data approach, merely considering whether or not gestational age was less than 37 weeks. Naturally, the study results thus failed to reflect the relationship between risk of preterm birth and gestational age. Chiang 1 constructed an antenatal life table for fetal death and live birth in a prospective pregnancy study. The birth data was actually a history cohort data with the special feature of specific birthdays, the gestational age was a duration time of a fetus and the gestational age of less than 37 weeks was treated as an event. Hence, it should be straightforward to use survival analysis for a preterm birth study.
107
108 The censoring of data was a major issue in survival analysis. This study explores the performance of partial likelihood for the Cox model through censored data defined in this preterm birth study. The analytical results were expected to be more informative than those of other studies. 2
Statistical M e t h o d
Kaplan-Meier 7 estimate is an empirical method for estimating survival function. The Cox 2 model is commonly used for survival model incorporating covariates based on the partial likelihood function. Crowley and Hu 3 employ time-varying covariates in the Cox model. Both Kaplan-Meier method and Cox model are based on the concept of the rank ordering of time rather than on time itself. According to this concept, the units of time employed might be days, months, hours, gestational age, and so on. In this work, gestational age was equivalent to time, and birth at 37 weeks or less was an event. Births at 37 weeks or later were treated as censored data. The preterm birth data were then analyzed through survival analysis. Furthermore, graphs explored the original data and presented the modeled results. This study used the software, SAS 6.12 and S-plus (only for drawing plots). 3
Preterm Birth Study
Preterm birth is the cause of neonatal death, and of morbidity. Wolach et al. (1997) reported that the development of a complement system in preterm births is closely related to the gestational age . Hence, the conditions of preterm births varied with gestational age. And, the methodology of survival analysis is able to keep this information. Meis et al. 9 reported that the risk for preterm birth in different age groups displayed a U-shape, with younger and older mothers being high risks groups for preterm birth. In this study population, comparing with the 25-29 year age group, the relative risk for preterm birth in different age groups would be estimated. The data in this study came from the Computerized Medical Birth Registry (Lu et al. 8 ) , a data bank collected prospectively from ten Taiwanese hospitals from February 1993 to February 1995. The recruitment criteria included single birth, first parity, gestational age > 24 weeks, and birth weight < 500g. Totally, 17958 births met these criteria. Gestational age were estimated from the last menstrual period, and births with gestational age > 24 weeks and < 37 weeks were considered preterm
109 births and denned as event data. Those normal births, births with gestational age > 37 weeks, were treated as censored data. Age groups were classified as 12-16, 17-19, 20-24, 25-29(reference group), 30-34, 35-39 and 40+ years old. From a previous study (Fu et al. 5 ) , we realized that the risk factors for preterm births differed among younger and older age groups. Hence, the data set was divided into two parts, corresponding to younger and older age groups, both sharing a common reference group (25-29 years old). This approach increased the homogeneity each data set and did not introduce extra covariates into the model. However, the estimated coefficient values for the main risk factors (younger and older age) changed very slightly when additional covariates were brought into the models. 4
Results
Table 1 lists the number of preterm births and censored values for 24 — 37, 38 + weeks. Most of the births occurred in the 20 — 34 year old age group. Almost all censored data occurred at 38 + weeks. Figure 1 show the smoothed curves for the hazard of preterm birth over different age groups. These smoothed curves can be divided into three sections, for 24 — 28, 28 — 33, and 33 — 37, weeks in both plots of Fig.l. In the 24 — 28 week section, all curves display a constant hazard, except for those of the 12 — 17 and > 40 years age groups. Across all age groups, the hazard of preterm birth remained roughly constant at 28-33 weeks, but increased tremendously for gestational age > 33 weeks. Table l.The Frequency of Fails (preterm birth) and Censoring, (#failed/#censored), in Gestational Age 24 — 37 Weeks by age groups.
Weeks 24~ 24 25 26 27 28 29 30 31 32 33 34 35 36 37 36 +
12 - 17 yrs. 71 = 163 0/0 3/1 0/1 0/0 0/0 I/O 0/0 3/0 1/0 2/1 3/0 2/0 6/0 8/0 12/0 0/119
18 - 19 yrs. n = 384 0/0 0/0 5/0 0/0 2/0 1/0 2/0 1/0 4/0 4/0 4/0 13/0 7/0 8/0 31/0 0/302
20 - 24 yrs. n = 3, 545
25 - 29 yrs. n = 8,597
30 - 34 yrs. n = 4, 188
35 - 39 yrs. i% = 909
0/0 7/0 3/1 2/0 2/0 9/0 8/2 6/2 12/2 13/0 25/2 30/1 51/2 78/2 220/0 0/3069
0/0 6/4 7/2 7/0 12/0 12/1 10/0 16/4 32/4 34/4 53/1 54/1 85/2 213/0 586/3 0/7448
0/0 2/0 4/1 5/0 4/0 9/2 11/0 13/2 9/1 15/1 21/0 35/1 65/0 120/0 336/1 0/3529
0/0 1/0 1/0 1/0 2/0 1/0 3/0 4/0 6/0 8/0 5/0 10/0 14/0 40/0 83/1 0/729
TI
40+ yrs. = 172
0/0 1/0 0/0 1/0 0/0 0/0 1/0 0/0 0/0 2/0 1/0 1/0 5/0 5/0 21/1 0/133
Tables 2 and 3, and Fig. 2, summarize the results of Cox model. In table 2, model 2 provided significant time dependent results for younger age groups. Compared to the 25 — 29 year old age group, the relative risk of the younger 12 —17,18 —19 and 20 — 24 age groups varied with gestational age, and ranged
110 Table 2. Estimated Cox Models for Gestational Age in preterm births; Age Groups: 12-17, 18-19, 20-24, 25-29(reference group); Standard Errors in Parentheses
Covariates A l : 1 2 - 17 vs. 2 5 - 2 9
Model 1 0.768 (0.159)
Model 2 5.281 (1.498)
A2: 18 - 19 vs. 25 - 29
0.516 (0.116)
4.631 (1.164)
A3: 20 - 24 vs. 25 - 29
0.005 (0.055)
1.293 (0.711)
Al*Gestational Age
NA
-0.130 (0.044)
A2*Gestational Age
NA
-0.118 0.034
A3*Gestational Age
NA
-0.037 (0.020)
Total sample size: 12, 675; Events: 1,712; Censored: 10,963 (Censoring Rate = 86.5%)
Table 3. Estimated Cox Models for Gestational Age in preterm births; Age Groups: 25 — 29(reference group), 30 — 34,35 — 39,40+ ; Standard Errors in Parentheses
Covariates A5: 3 0 - 3 4 vs. 2 5 - 2 9
Model 3 0.179 (0.049)
Model 4 -0.119 (0.686)
A6: 35 - 39 vs. 25 - 29
0.441 (0.081)
1.018 (1.050)
A7: 40+ vs. 25 - 29
0.567 (0.165)
0.833 (2.184)
A5*Gestational Age
NA
0.008 (0.019)
A6*Gestational Age
NA
-0.016 0.030
A7*Gestational Age
NA
-0.008 (0.062)
Total sample size: 13, 849; Events: 1,991; Censored: 11, 858 (Censoring Rate = 85.6%)
111
Oil
0.11 12<=AGB=17 18<=AGB=19 20<=AGB=24 25<=AGB=29
ore
25<=ACB=29 30<=AGB=34 35<=AGB=3S 40<=ACE
DM
Off? -
0B7
•p
ra ^ 1155
016
ore
om
Ofll
-001
0D1
-T
24
1
1
oni
r-
26 28 30
32 34 36
38
— i
1
1
24 26 28 30
Gestational age (weeks)
1
1
1
32 34 36 38
Gestational age (weeks)
Figure 1: Smooth Hazard Functions for different age groups
12-17 y 18-19 y 20-24 y 25-29 ^(reference group) X-34y 35-38 y 40+ y
Ges&ional age (weeks)
Figure 2: Results from Cox models; age specific R.R. (reference group 25-29 yrs.) for preterm birth in gestational age 24-37 weeks.
112
from 1.6 — 8.67,1.2— 5.76 and 0.92 — 1.52, respectively. However, the relative risk for the older 30 — 34,35 — 39 and 40 + age groups did not vary significantly with gestational age (model 4 of table 3), and their relative risks were 1.2,1.54 and 1.74, respectively. Figure 2 displays the relative risk in younger and older age groups comparing with the 25 — 29 year old group. Comparing the empirical results (Fig. 1) with the modeled result (Fig. 2), the common impact of risk of preterm birth were the risk for younger age groups varying with gestational age while for older age groups this risk was constant. Meanwhile, the estimated relative risks were very significant for both models. 5
Discussion
The main methods employed in this application study were the Kaplan-Meier estimate and Cox models. Besides the key issue of censored data, the basic issues included time rank ordering and risk set. The appropriateness or otherwise of the application of the model can be judged based on these issues. This study is an original work applying survival analysis to investigate preterm birth. Gestational ages > 24 weeks and < 37 weeks were considered event data, and their ranked order was analyzed. Meanwhile, a gestational age of greater than 37 weeks was censored data, contributing to the risk set of denominators in parameter estimation, occurring in the survival estimator of Kaplan-Meier or the partial likelihood of the Cox-model. Consequently, the estimated parameters are based on the entire study population. Based on 17,958 births, the results of log-rank trend testing (Collett 4 ) indicated that ranked by age group, the risk for preterm birth was ordered as follows: (12 - 17) > (18 - 19) > (40+) > (35 - 39) > (20 - 24) > (30 - 34) > (25 — 29), with p — value < 0.0001. This testing was based on the sense of average risk in weeks 24 — 37. However, the relative risks of preterm birth varied with gestational age from Cox models for younger age groups. Hence, compared to the results of Cox models and the logrank test, Cox models not only dealt with the time dependent problem, but also estimated their relative risks. The risk factors for preterm births included maternal age, marital status, education, parity, prenatal care, and so on. Model building process revealed that maternal age was a very strong risk factor for preterm birth. Furthermore, with the very large sample size herein, whether or not other risk factors were included, the coefficient estimations of maternal age made very little difference. Consequently, this investigation only included the maternal age in Cox models.
113
However, the risk factors for preterm birth differed between the younger and older age groups. Hence, a sufficiently large sample size would offer the benefit of homogeneity in dealing with younger and older age groups separately. Finally, from two separate Cox models, the relative risk of preterm birth varied significantly with gestational age in the younger age group, but not in the older age group. Figure 1 does note this same situation. Reviewing previous studies, including Meis et al. 9 , Fraser et al. 6 , and Wessel et al. 12 preterm birth was defined as less than 37 weeks, thus classifying it as a binary response data. This work considered each gestational age using a survival analysis approach and thus incorporated a sense of progressive risk. Hence, these results were more informative than those that used a binary response. Acknowledgments I would like to thank Dr. Jen Her Lu for providing data from Computerized Medical Birth Registry and for useful discussion. References 1. C.L. Chiang , In The life table and its applications (Krieger, Florida, 1984). 2. D.R. Cox, Journal of the Royal Statistical Society, Ser. B 74, 187 (1972). 3. J. Crowley and M. Hu, Journal of American Statistical Association 78, 27 (1977). 4. D. Collett, In Modeling Survival Data in Medical Research, eds.: C. Chatfield and J.V. Zidek (Chapman and Hall ,London, 1994). 5. C. Y. Fu et al., Chinese J. of Public Health (Taipei) 18(3), 228 (1999). 6. A.M. Fraser, J.E. Brockert and R.H. Ward, The New England Journal of Medicine 332, 1113 (1995). 7. E.L. Kaplan and P. Meier, Journal of American Statistical Association 53, 457 (1958). 8. J.H. Lu et al., Medical Information 19(4), 323 (1994). 9. P.J. Meis et al., American Journal of Obstetrics and Gynecology 173, 590 (1995). 10. R.W. Rush , D.A. Davey and M.L. Segall, British Journal of Obstetrics and Gynecology 85, 806 (1978). 11. B. Wolach et al., Acta Padiatrica 86(5), 523 (1997). 12. H. Wessele£ al., Acta Obstetricia Gynecologica Scandinavica 75, 360 (1996).
A S Y M P T O T I C FORMS A N D B O U N D S FOR TAILS OF C O N V O L U T I O N S OF C O M P O U N D G E O M E T R I C D I S T R I B U T I O N S , W I T H APPLICATIONS JUN CAI Department of Statistics and Actuarial Sciences, University of Waterloo, Canada N2L 3G1 E-mail: [email protected] JOSE GARRIDO Department of Mathematics and Statistics, Concordia University, Quebec, Canada H^B 1R6 E-mail: garrido @vax2. concordia. ca
Ontario,
Montreal,
In this paper, we derive exponential and subexponential asymptotic forms for tails of convolutions of compound geometric distributions. General lower and upper bounds for the tails are given, which can be used in some cases to determine a closer tail approximation. Applications of these results are given to the ruin probability in the classical risk process perturbed by a diffusion; previous results are easily derived and a theorem of Veraverbeke 26 is generalized. In addition, twosided bounds for the ruin probability with large claim sizes are given, extending the bounds of Dickson 10 to the diffusion risk model.
1
Introduction
Let {Xi,i > 1} be a sequence of i.i.d. non-negative random variables with common distribution F and F(0) = 0. Further, let iV be a geometric random variable with Vr{N — n} = qpn, n = 0 , 1 , 2 , . . . and p = 1 - q, for 0 < q < 1, which is independent of {Xi,i > 1}. SN = J2i=i %i 1S sa*d to be compound geometric, where SN = 0 if N = 0. Its distribution function is denoted by H{x) = Pr{SV < x}, x > 0. Suppose that Y is another non-negative random variable with distribution G and (7(0) = 0, where Y,N and {Xi,i > 1} are independent. Then, the convolution H * G of the compound geometric distribution H and distribution G, i.e. the distribution of SN + Y, arises in many applied probability models, such as regenerative processes (Cohen 8 , Kalashnikov 18 and Keilson 19 ) risk theory (Dufresne and Gerber n , Sundt and Teugels 23 and Veraverbeke 26 ) and queueing theory (Asmussen 1 , van Hoorn 25 and Szekli 24 ). Many distributions of interest in these works can be expressed in the form H*G of the distribution of SN+Y. _ Denote by W(x) = H * G(x), consider W(x) = 1 - W(x), the tail or sur-
114
115
vival function of W(x). It is well-known (Renyi's theorem) that if F has a finite mean, then for any x > 0, H(x)/ /0°° H(y) dy ->• e~x as q -> 0 or equivalently £?(AT) —• oo. Furthermore, Keilson 19 (see also Kalashnikov 18 ) showed that if F has a finite second moment and G has a finite mean, then a similar limit theorem holds for W(x), namely for any x > 0, W(x)/ J0°° W(y) dy —> e~x as E(N) —> oo. However, in many applications, we are interested in the asymptotic behavior and bounds for W(x). The asymptotic behavior and bounds for the tail H(x) of the compound geometric distribution function H(x) is well-known. The purpose of this paper is to give a more complete description of the asymptotic behavior for W(x), obtain the asymptotic estimates for W(x) under various situations, derive bounds for W(x) in heavy tailed cases and consider the applications of these results in risk theory (for bounds see e.g. Cai and Garrido 4 ) . The paper is organized as follows: in Section 2, we derive an exponential asymptotic form for W(x) in terms of Lundberg's coefficient using the key renewal theorem. In Section 3, we consider subexponential asymptotic forms for W{x). Here, general lower and upper bounds for W{x) are given first, the bounds indicate the possible asymptotic form for W(x) and are also used to determine a closer approximation for W(x) in some cases. Section 4 discusses the asymptotic forms for W(x) in the intermediate case, when exponential moments exist but Lundberg's coefficient does not. In Section 5, as an application of the results for W(x), we consider the ruin probability in the classical process perturbed by a diffusion (see Rolski et al. 2 2 ) . The asymptotic estimates of the ruin probability derived by Dufresne and Gerber u , Gerber 16 and Veraverbeke 26 ) are easily obtained. A theorem of Veraverbeke 26 is also generalized. In addition, two-sided bounds for the ruin probability with large claims are given, thus extending the bound of Dickson 10 to diffusion risk models.
2
Light-tail asymptotics
In general, W(x) does not admit an exponential asymptotic form, for example, see Remark 3.5 of this paper. But if conditions similar to those in CramerLundberg's theorem hold, then there exists an exponential asymptotic form for W(x), which is stated in the following theorem. Throughout this paper, f(x) ~ g(x) means that f(x)/g(x) -> 1 as x -> oo and f(x) = o(g(x)) means that f(x)/g(x)-*0 as x—>oo while me^s) = /0°° esx dB{x) denotes the moment generating function of B on [0, oo).
116
Theorem 2.1. Let F be non-lattice and K a constant such that /•OO
eKx dF(x) = 1/p .
/
(1)
If mG(K) = /0°° e Kx dG(z) < oo then | ' i ) e - f a , if /? = / ° ° ie K » dF(a:) < oo , (2) P«P Jo = o(e-Kx), if ,9 = oo, (3) where, K; in (1) is called Lundberg's coefficient. Proof. See Willmot and Lin 28 , Corollary 9.3.2., pp. 175-6. • Remark 2.1. When Y = 0, G is degenerate at zero, and thus m o ^ ) = 1, W(x) = H(x) and Theorem 2.1 is reduced to Cramer-Lundberg's theorem, i.e. under the conditions of Theorem 2.1 about F H(x)~^e-Kx, if^
W
= o(e-Kx),
3
if/3 = oo.
(5)
Heavy-tail asymptotics
In this Section, we consider the case when F or G are heavy-tailed, in particular, subexponential distributions. First, the two following theorems give general lower and upper bounds for W(x), which indicate possible asymptotic forms used to determine, in some cases, a closer approximation for W(x). Theorem 3.1. For any x > 0,
ww >_ " F ( ; ' + f w .
(6)
pF(x) + q Proof. Since W{x) = PT{SN + Y > x} is the survival function of the random variable Sjy+Y, W(x) is decreasing. By the usual renewal argument, we have W(x) > qG(x)+pF(x)+PW{x)
f dF(y) ,
x >0,
Jo
= qG(x) +PF(x)+pW{x)F(x) this implies that (6) holds.
,
•
117
Remark 3.1. Take Y = 0 in Theorem 3.1, we get a lower bound for the tail H(x) of the compound geometric distribution function H(x), namely H(x)
>
f f ^ , for any* > 0 . (7) pF{x) +q This gives a derivation of a result known for the ruin probability in the classical risk process [e.g. see Theorem 3.1 of De Vylder and Goovaerts (1984)]. Theorem 3.2. If F has a finite mean E{X\) and G(x) has a decreasing density function, then for any x > 0 W(x)
< Pn*)+jG(x) pF(x)+q
+
6(x)
where S(x) = pG(x){x~1 E(Xi) - F(x)}-±0 as z->-oo. Proof. Since W(x) = H * G(x) = Pr{S;v + Y < x}, by conditioning on the value of Y, we get that for any x > 0, W(x) = G(x) + [X H(x - y) dG(y) Jo
(9)
[XH(x-y)G'(y)dy. (10) Jo But, we know that if f(x) and g(x) are integrable functions with different monotonicity on [a, b], a < b, then = G(x)+
pb
-i
f(x)g(x)dx Ja
rb
< —— / "
a
rb
f(x) dx
Ja
g(x)dx.
(11)
Ja
Thus, by (10), (11) and the fact that H(x — y) is increasing in y over [0, x], we get that for any x > 0, W{x) < G(x) + - T ff (a: - i/) dy [X G'(y) dy x Jo Jo - G(i) + ^ X
(12)
rI Hit), Hit) dt Jo /"OO
= G ( x ) + G W E(SN) - / (13) ^(0 X Jx Since 5jv is compound geometric, the distribution H of SV is a New Worse than Used (NWU) distribution [Lemma 2.1 of Brown (1990)], which implies that if is a NWUE distribution, i.e. />oo
L
H(t) dt > E(SN)H(x),
x>0.
(14)
118
Thus, by (13), (14) and (7), we get that for any x > 0 W{x) < G(x) + x-1G(x)E(SN)[l
- H(x)]
?3S«.
(15)
pF(x)+q But E(SN) = E(£%=i Xi) = E(N)E(Xi) = pE(X{)lq. Hence qE{SN) = pE(Xi). This, together with (15), implies that (8) holds. • R e m a r k 3.2. The requirement in Theorem 3.2 that G has a decreasing density function is not restrictive. In fact, this condition can be satisfied by many distributions such as the class of the equilibrium distribution function Fe(x) = J0 F(y)dy/ J0 F(y)dy with decreasing density function f(x) = F{x)/ Jg00 F{y) dy, which often arises in risk theory, reliability, queueing and renewal theory. In addition, the class of decreasing failure rate (DFR) distributions have decreasing density functions since f(x) = r(x) F(x), where r(x) > 0 is the failure rate function of F, which is decreasing in this case. Corollary 3 . 1 . Combining Theorems 3.1 and 3.2 we get that under the conditions and notation of Theorem 3.2, for any x > 0 PF(x)+qG(x) PF(x)
+q
_
<
pF(x)+jG(x)+S(x) pF(x)+q '
<
~
~
^
;
Since S(x) -> 0 as x -> oo, (16) indicates that PF{x)
+ gG(x) pF(x)+q
may be an asymptotic form for W(x) as x -+ oo. Indeed, we show below that under various situations, this is precisely the asymptotic form of W{x). Corollary 3.2. Under Theorem 3.2, if lim xG(x) = oo then x-+oo
pF{x)+q Proof. By (16), we get for any x > 0,
! <
IT(X)
W(x) v J
pF(x)
+
J
+ qG{x)
< 1+ ~
PGMiEm-xFjx)} PxF{x)
+ qxG{x)
v
119
Since F has a finite mean, x F(x) -t 0 as x -> oo. Thus, (18) and x G(x) -> oo as x —> co imply that W(x)
~ PFW + qG(x) pF(x)+q D
R e m a r k 3.3. There exist classes of distributions that satisfy the conditions of Corollary 3.2, for example the Pareto distribution with density function q(x) — -
aca —— ,
x >0,
where 0 < a < 1 , c > 0,
or more generally, the Burr distribution with distribution function F(x) = 1 - ( T
)
,
x >0,
where A > 0 , 0 < a < l , 0 < r < l .
But, it should be pointed out that the condition xG(x) -^ oo as a:-»oo is restrictive, since it implies that G has no finite mean. On the other hand, under the conditions of Corollary 3.2, pF(x)+gG(x) PF(x)+q gives a closer approximation to W{x) than G(x) does, as seen by Theorem 3.1 and the fact that for any x > 0,
pnx)+iG(x) PF(x)
+q
>
~
We know that if h is an asymptotic form of an unknown function T, i.e. T(x) ~ h(x), then any function g satisfying g(x) ~ h(x) is also an asymptotic form of T. Hence, it is interesting to find a closer asymptotic form among known approximations. As shown above, a closer approximation can be derived by combining bounds and asymptotic forms. Furthermore, we notice that the only relation between F and G required in Corollary 3.2 is that G(x) dominate asymptotically F{x). If other relations between them are assumed and subexponentiality (as defined below) is further imposed on F or G, we can derive additional results for W(x) as follows. Definition 3 . 1 . A distribution B on [0, oo) is said to be subexponential, denoted by B 6 S, \iB~$){x) ~ 2~B{x). Subexponential distributions are heavy tailed; typical examples are the Pareto or Lognormal distributions. The following Lemma combines Proposition 1 of Embrechts et al. 14 and Theorem 2 of Chistyakov 5 .
120
Lemma 3.1. Suppose that F\ and F 2 are two distributions on [0, 00). (i) If F 2 G S and Fx{x) = o{F2(x)), then Fi*F 2 G S and Fi * F2(x) ~ F 2 (x). (ii) If Fi * F2 G 5 and F^x) Fi*F 2 (ar).
= o(Fi *F 2 (x)) then F 2 G 5 and F2(x)
(iii) If F 2 G 5 , then for any e > 0, lim eSxft(x)
~
= 00, i.e. e~£x = o(Fj(a;)).
Theorem 3.3. Suppose that G(a;) = o(F(x)). The three following assertions are equivalent: (i) F 6 S, (ii) W (iii) W(x)
eS, ~ { p f ( a ; ) + g G ( x ) } / { p F ( x ) + 9} ~
pF(x)/q.
Proof. First, it is clear that G{x) = o(F(x)) implies pF(x)+qG(x) pF(x)+q
^
By Corollary 3.2 of Embrechts et al. conditions are equivalent:
14
pF(x) q
, we know that the following three
(a) FeS, (b) H G S, (c) i/fa) ~ pF{x)/q . Hence, if F G S, by (b), (c) and G{x) = o(F(a;)), we get that H G S and
£W=EwxM40 H(x)
F{x)
as
^00j
H{x)
i.e. G(:r) = o(H(x)), hence (c) and (i) of Lemma 3.1 imply that W = G*H G S and W(x) ~ i? (x) ~ pF(x)/q. Conversely, if W = G * H G S, by Theorem 3.1 and G(x) = o(F(a;)) : G_(x) < { ^ ( x ) + g g f r ) - H ' ( i ) - pF(i)+?G(i) <
=
{pF(x) + j } G ( z ) / F » p + 9 G(a:)/F(a;)
_^
Q
this is to say that G(x) = o(W(x)), thus, (ii) of Lemma 3.1 and (c) imply that H G 5 and W(x) ~ # (a:) ~ pF(x)/q. So, we have shown that F G <S 44> W e S. In addition, the above proof also showed that F G S =$• W(x) ~ pF(x)/q. Thus, in order to complete the proof of Theorem 3.3, we still need to prove that W(x)
~ -F(x)
=>• FeS.
(20)
121
By definition, W{x) = Pr{5jv + Y <x} = £ ~ = 0 qPnG * F^(x),
and hence
^qpnG*F(n){x)
W{x) = n=0
which implies that G*FW(x)
= —; w{x)-Y^qpnG*F^{x) qp' n^2
(21)
Since Y is a non-negative random variable, for any integer k > 1, G * FW(x) = Pv{Y + X1 + --- +
Xk>x} F(k)(x)
> Pr{X 2 + • • • + Xk > x} = >Pr{max(X1;...
,Xk)>x} k-i
k
F(x)Y,[F(x)]n
= l-[F(x)] =
(22)
n=0
(22) implies that hm mf
_
x - > oo
F(x)
w N
> lim inf
_ / /
z ->• oo
> k.
(23)
F(x)
Clearly, (22) and (23) are also true for k = 0, thus by (21) and (22) FW(x) F(x)
G*F(V(x) ~ F\x)
1 - qp^
Thus by (24), (23) and W(x) ~ pF(x)/q,
W(x)
^
qp
W)-^2
F(x)
(24)
we get that
FW(x) 1 hmsup - = — — < —r-~2^
y
n^2
<
q
*—'
qp'
+ 2
= 2 , (25)
n=0
hence, (23) and (25) imply that Iim^ -> oo i ^ 2 ' (x)f F(x) = 2, i.e. F e S. a It is interesting to note that (20) holds and is independent of the condition that G(x) — o(F(x)). In addition, if G is an exponential distribution, using Lemma 3.1(iii) and following the proof of Theorem 3.3, we get directly the following corollary.
122
Corollary 3.3. If G is an exponential distribution, then Theorem 3.3 holds. Theorem 3.4. Suppose that F(x) — o(G(x)) and F E S. The two following assertions are equivalent: (i) GeS,
(ii) W G S,
and either one of them implies that: (iii) W{x)
~
{PF{x)+qG{x)}/{pF{x)+q}
~
G(x).
Proof. By (c) in the proof of Theorem 3.3, we know that H(x) = ^ -
H(x)
F(x)
= =7-7- x = ^
-)• 0
as
x -> oo ,
G{x) F(x) G(x) i.e. H(x) = o(G(x)), hence if G G S, by Lemma 3.1(i), we get that W = # * G G S and W(x) ~ G(z). But, F(z) = o(G(x)) implies that pF(z) + g G ( i ) m pF(x)+q Conversely, if W = H * G G S, by Theorem 3.1 and the proof of Theorem 3.3(c), we get that < £ ( * ) < {pF(x) + g f f l z ) - W(a;) ~ p f ( x ) + ? G ( i )
=
{pF(x) + q)E{x)IG{x) g F(x)/G(x)+
_^
&g
i.e. #(:r) = o(W(a;)). Then Lemma 3.1(h) =>• G G 5 and W(x) ~ G(z). Remark 3.4. We know that pF(a;) + oG(a;) pF(x) „ , , , 9^, . „ r _ F * V ; > £—i-^ if and only if g2 G(x) > \pF(x)}2. pF{x)+q q
•
Thus, in view of Theorem 3.3(iii),_if F G S^x) = o(F(x)) and q2G(x) > 2 3 2 \pF{x)} as a;->oo, (for example, G(x) = [F(x)] / ), then by Theorem 3.1, we know that {pF(x) + qG(x)}/{pF(x) + q) is a closer approximation for W{x) than pF{x)/q is. By an argument similar to that in Remark 3.3, we also know that under the conditions of Theorem 3.4 and if G G <S, then {pF{x) +qG(x)}/{pF(x)+q} is a closer approximation to W(x) than G{x). Theorem 3.5. If G(x) ~ T{x) and F G S, then w
W (x) ~
PF(x)
+ qG(x) =—— pF{x)+q
F(x) ~
G(x) ~
q
. q
123
Proof. F E S implies that H e S and H(x) ~ pF(x)/q. G(x) H(x)
G(x) F(x)
Fix) H(x)
Thus, by Theorem 1 of Cline Kliippelberg 17 ) we know that
6
q p
(see also Theorem 4.3 of Goldie and
q H(x) ~ (l + l)H(x) = p p On the other hand, F(x) ~ G(x) implies that W(x)=G*H(x)
PF(x)+qG{x)
F{x) q
pF(x)+q
Hence,
F(x)
G(x) q
so Theorem 3.5 holds. • Remark 3.5. The results of this Section also show that the lower bound in Theorem 3.1 is asymptotically exact for W(x) as x -» oo, in the subexponential case. Bounds for W(x) were considered in Kalashnikov 18 and Willmot and Lin 2 7 ) . The bounds for W(x) in (5.1) and (5.2) of Kalashnikov 18 are asymptotically exact as E{N) -> oo. The bounds of Willmot and Lin 27 are applicable to the tail of convolutions of more general compound distributions; these are based on a generalized Lundberg's coefficient and NWU distributions (see Willmot and Lin 28 for more details). Also, we point out that W{x) cannot admit exponential asymptotic forms and exponential upper bounds if F or G is subexponential, i.e. no constant c > 0 and e > 0 such that W(x) ~ ce~£x or W(x) < ce~£x, for all x > 0. For example, if F is subexponential, then we know that H € S and e£x H(x) —> oo as x -> oo, hence lim e£x W(x) = oo, since W(x) = H * G(x) > H(x). x—>oo
4
Medium-tail asymptotics
We consider here the intermediate case, i.e. when mF(s) < oo for some s > 0, but Lundberg's coefficient does not exist, as Tnp(s) = 1/p can not be satisfied but mpis) < 1/p holds (e.g. in the inverse Gaussian case). First, recall the definition of the S(a) class and its properties. Definition 4.1. A distribution B on [0, oo) is said to belong the S(a) class for a > 0, denoted by B S S(a), if (i) \imx^00BW(x)/B(x) (ii) limx _> oo B{x - y)/^(x)
=
2mB{a)
124
Clearly, B € 5(0) <£• B G S. A class of distributions in S(a) is the class of generalized inverse Gaussian distribution N~1(a,b,c) with a < 0, b > 0 and c > 0, since N~1(a,b,c) G <S(c/2) (see Embrechts 12 for details). The following proposition recalls some properties of <S(a), which will be used here. Its proof is in Lemma 2.4 and Theorem 2.7 of Embrechts and Goldie 13 and page 268 of Kliippelberg 21 (for details see Rolski et al. 2 2 ) . Proposition 4.1. Suppose that B G £(7). (i) For any e > 0, e 7 + e B(x) —> 00 as a; —• 00, (ii) If L is a distribution on [0, 00) and limx —> 00 L{x)/ B{x) = c, where 0 < c < 00, then L € 5(7), (iii) If 7 > 0, B has a finite mean m, and B\ is the ladder height distribution of B [i.e. B,{x) = i | 0 X B(j/)dy], then B x G «S(7) and ^ ( z ) ~ g £ } . Theorem 4 . 1 . Suppose that F G 5(7) for some 7 > 0 and that 771^(7) = | 0 °° e<xdF{x) < 1/p. If G(x)/F{x)-+a as z - ^ o o , then W G 5(7) and F(») ~ g W 7 ) + "[l-pmF(7)]} [l-pmF(7)J2 Proof. Since 0 < prnpij) < 1, there exists some e > 0 such that 0 < p[mF('y) +e] < 1, thus £^L 0
n 1
where c = Y^Li nqp [mF('y)] -
(27) 2
= pq[l - puip^)]' .
G{x) G(x) F(x) a —, ' = =i—^ x _ -> — as #(x) F(a:) H(i) c By Theorem 1 of Cline 6 , we get that W{x) hm = 7 ~ f = 1 -> 00 f f ( x )
,. H*G(x) hm x->oo
;
Thus, x -> 00 .
. , a = mG 7 + - * " g 7 •
H(x)
C
But, m^(7) = £ ( e ^ ) =
^ ^
n
E
afZ7=iXi
n=0
= X > " [ m F ( 7 ) ] " = q[l - p m F ( 7 ) ] - 1 n=0
(28)
125
Hence, by (27) and (28), we get that Q<
W(x) ~ [TTIGCT) H—wjf (7)] # ( z ) ~ [email) + amH(7)] 2
F(a;)
-1
= {p?[l-pmF(7)]- mG(7) + a 9 [ l - P " ^ F ( 7 ) ] } ^ ( Z ) . i.e. (26) holds. W £ S(j) follows from (26) and (ii) of Proposition 4.1. 5
•
Ruin probabilities in a diffusion risk model
To illustrate a possible application of the above results, consider the classical risk process perturbed by a Wiener process, i.e. the surplus R(t) at time t is R(t) = x + ct- S{t) + Z(t) , where x > 0 is the initial risk reserve, c > 0 is the premium rate, S(t) is the compound Poisson process representing the total claims at t [with rate 1/d > 0, and the independent individual claim sizes with common distribution function B and B(0) = 0], and {Z(t)} is a Wiener process, independent of {5(t)}, with infinitesimal drift 0 and infinitesimal variance 2D > 0. Assume c > A/d, where A = /0°° B{x) dx is the expected claim size, then the relative security loading q = 1 — \/(cd) is such that 0 < q < 1. Let ip(x) denote the probability of ultimate ruin, starting with initial reserve x : V>(a:) = Pr{infiJ(i) < 0} . Denote tp(x) = 1 — i(>(x). Dufresne and Gerber have shown that for any x > 0,
n
(see also Veraverbeke
26
)
00
tp(x) = Y, qpnF{n) * G(x) =H* G(x) , n=0
or, equivalently, 00
i(j(x) = J2 1Pn F(n) * G(x) = IlTG(x)
,
(29)
n=0
where H{x) = £ £ L p q p n F ^ ( x ) , G{x) = 1 - e~cx/D, F(x) = G * B1 (x), Bx [x) = \ $* B(y)dy, for x > 0, and p = A/(od), q=l-p=j=ft. Suppose that ^ and r) are independent random variables with distributions G and B\, respect., then £ + r\ has distribution F' = G*B\ and for any s £ l , />oo
mF(s) = / Jo
esxdF(x)
= £ [ e ^ + " > ] = E(esi)E(es»)
= mG(s)mBl
(s). (30)
126
Thus, if there exists a constant R such that mF(R) = mG(R)mB1{R)
(31)
=
then (31) implies that R < c/D , r
mG(R) = E(e"t) = -J
rc
Rx
e
— ^-
i
e o dx
c-RD
(32)
< oo
and xe
e
D
D mG{R) . c-RD
cD (c - RD)2
dx =
(33)
By (31) and (32), we get that E(eR*) = mBl(R)
=
1 pmG{R)
c-RD pc
Hence />00
/? = / Jo = 5.1
xeRxdF{x) mG(R)
pc
= E[{£ + V)em+rt)
A J0
= E{^)E{eRr])
+ E(e fi «)S(r/e fi ")
xeHxB(x)dx
(34)
Asymptotics for the diffusion risk model
We know that if G and B\ have density functions, so does F = G * B\. This implies that F is non-lattice. Thus by Theorem 2.1, we get the following exponential formula for the ruin probability tp(x). Corollary 5.1. Suppose that mp{R) = 1/p. If J0°° xeRx B(x) dx < oo then
«*>-=$?'-*- and if JQ xeRx B{x)dx
*E+*L r c
cd J0
Rx xe
B{x)dx
-l
-Rx
= oo, then ip{x) = o{e~Rx) .
(35)
This Corollary includes Theorem 4.1 of Gerber 16 , the results in Section 7 of Dufresne and Gerber u and in Section 3 of Veraverbeke 26 .
127
Since G{x) = 1 - e~cxlD is an exponential distribution function, by (i) and (ii) of Lemma 3.1, we know that B\ 6 S <S> F = G*Bi £ S. Furthermore, by Corollary 3.3 and Lemma 3.1, we know that Bi e S =S> 1>(x) ~ - F(x) = —^— T(x) q cd — X cd — A
cd-
In addition, using the fact F{x) = G * Bx(x) > Bi(x) and following the proof of (20), it is clear that %l>{x) ~ ? 57(3;) = _ ^ _ (J
B~1[X)
=> Bi e <S ,
CO — A
thus, Corollary 3.3 implies the following theorem. Theorem 5.1. For the classical risk process perturbed by a diffusion, the following conditions are equivalent: (i) Bx £ S,
(ii) v e S,
(iii) 4(x) ~ ^
/x°° B(y) dy.
Remark 5.1. Theorem 5.1 generalizes Theorem 1 of Veraverbeke 26 , where it is shown that (i) and (ii) are equivalent and that either one of them implies (iii). Conditions enabling B\ £ S can be found, expressed in terms of B, in Embrechts and Omey 15 and Kliippelberg 20 . Finally, for the intermediate case, suppose that for some 7 > 0, BE S(J) and 771^(7) = W G ( 7 ) J « B 1 ( 7 ) < 1/p, then by Proposition 4.1, we get that B\ £ S("f) and B\(x) ~ B(x)/(/y\), in addition, (1) rnG{l) = (c/D) J0°° e?xercxlD
dx = c/{c - D7) < 00,
(2) mBl (7) = Jo00 e 7 * dBt (x) = (1/A) J0°° e^ B~[(x) dx < 00 and 7 < c/D, hence, there exists some e > 0 such that 7 + e < c/£>, thus m g ( 7 + e) = c / [ c ~ (7 + e)JC] < °°) a n d by Proposition 4.1, we get that G(a;) e-y+£G(x) . = z=~ jBi(a:)
e-y+£Bi(x)
> 0 as a; -> 00 .
Thus, by Theorem 1 of Cline 6 and (ii) of Proposition 4.1, we get F(x) G*B1(x) = — = —= -> man) Bx{x) B1{x)
as
z-^oo
128
and F € S ( 7 ) and F(x)
~
mG^)B~[(x)
~
T
^^-B(x)
Thus, G(x) = o(F(a;)) and Theorem 4.1 imply that ? 6 5(7) and
*(*) ~ n W m G ( / \ 1 2? F ^
(36)
2
[l-pmF(j)] pg[m G (7)] 2 A7[l-pmF(7)]
= cdj — ( 1 -cd - ) l -
2
elyB{y)dy
-A c cd J0 thus, (38) yields Theorem 2 of Veraverbeke 26 . 5.2
B(x) ,
(38)
Ruin probability bounds for the diffusion risk model with heavy tails
Under condition (31), the exponential bounds for the ruin probability ip(x) have been derived by Dufresne and Gerber n . Here, we use a generalized condition of Dickson 10 to derive bounds for if)(x) with heavy claim size tails. General upper and lower bounds for 4>(x) follow directly by Theorems 3.1 and 3.2 since G is exponential and has a decreasing density. Given t > 0, suppose that Rt satisfies mG(Rt)
f eR'ydB1(y) Jo
= -, P
(39)
or equivalently,
I
* e*<*dBM = —— • (40) /o cp L e m m a 5.1. For any claim size distribution B with B(0) = 0, there exists a unique solution Rt € (0, c/D) to equation (39). Proof. Let h(x) = mG{x) /0* exydBi(y) - ±. Since H(0) = Bi{t)-±<0 and lim moix) = lim xtc/D
xfc/D
— = 00 , C — xD
this implies that limx1-c/£) h(x) = 00. Thus, the existence of the unique root of h follows from the fact that h is continuous and strictly increasing. • Since B\ is continuous in here, condition (39) is clearly equivalent to mam
J
^ydBt(y) = mry
(41)
129
where
mix) =
{BMmt)so^x
(42)
Let ^ W = ^ftPl»FfW*G(x)
(43)
n=0
or equivalently, %l)t {x) = l - tpt (i) = ^2qtPt
^ ( n ) * G{x) ,
(44)
n=l
where p t — pB\{t), qt = 1 - Pt and Ft — Bt* G. That is to say, Vt(^) i s the ruin probability in the diffusion risk model with corresponding parameters Pt, qt, Ft and G. By induction we get that for any 0 < x < t, B(tn)(x) = B[n)(x)/[Bi(t)]n, hence for any 0 < x < t,
J2^ptBtn)*Gin)*G(x)
n^) = n=0
oo
= - J2 QPnF(n) * G{x) = -
Q
Thus, for any 0 < x < t, 1>t{x) = 1 - ft{x) = 1 - - ipd{x) = 1 - ^ [1 - tl>d(x)] . This implies the following property. Lemma 5.2. For any 0 < x < t, ^
'
q + pBrit)
{
q + pB^t)
'
Using Lemma 5.2 above, we can prove the following result. Theorem 5.2. Suppose Rt satisfies (39), then for any 0 < x < t,
p*JfL- < ^
q+pB^t) ~ In particular, for any x > 0, P 1(X)
*
-
q + pBi{x)
< JML.
~ q+pBiit)
< iK«) <
P 1 X)
*A
q+pBi{x)
+
+
J!£±L..
(46)
q+pBi(t)
qe X
'-
q+pB^x)
.
(47)
130
Proof. The lower bound in (46) follows from ipt(x) > 0 and (45). On the other hand, if a constant R satisfies 1
f°°
mG(R) then Dufresne and Gerber
16
/ Jo
eR"dB^y)
= - , P
(48)
show that for any x > 0, ip(x) < e-Rx
.
(49)
Thus, apply (49) to ipt{x) with condition (41), to get for any x > 0, tPt(x) < e~R'x . This, together with (45), implies that the upper bound in (46) holds. Taking x = t in (46), gives (47). • Remark 5.2. (a) Since pB^x) q+pBx{x)
pB^x) q
by Theorem 5.1, we know that the lower bound in (47) is asymptotically exact, for large x, if £?i is a subexponential distribution. (b) If D = 0, the diffusion risk model is reduced to the compound Poisson risk model, thus the bound of Dickson 10 is derived as a special case of Theorem 5.2 and is improved upon. Other applications of the results in sections 2-4 have been investigated, notably to M/G/k queues (see for example Asmussen 2 ). Acknowledgments We are grateful to the anonymous referees for their constructive comments. We would also like to acknowledge the funding obtained for this research by the Natural Sciences and Engineering Council of Canada (NSERC) operating grant OGP0036860. References 1. S. Asmussen, Applied Probability and Queues (John Wiley & Sons, New York, 1987). 2. S. Asmussen, Ruin Probabilities (World Scientific, Singapore, 2000). 3. M. Brown, An. Prob. 18, 1388 (1990). 4. J. Cai and J. Garrido, J. App. Prob. 36, 1058 (1999). 5. V.P. Chistyakov, Th. Prob. App. 9, 640 (1964).
131
6. D.B.H. Cline, Prob. Th. Rel. Fields 72, 529 (1986). 7. D.B.H. Cline, J. Aus. Math. Soc. A43, 347 (1987). 8. J.W. Cohen, On Regenerative Processes in Queueing Theory (SpringerVerlag, Berlin, 1976). 9. F. De Vylder and M.J. Goovaerts, Ins.: Math. Econ. 3, 121 (1984). 10. D.C.M. Dickson, Scand. Act. J. , 131 (1994). 11. F. Dufresne and H. Gerber, Ins.: Math. Econ.10 51 1991. 12. P. Embrechts, J. App. Prob. 20, 537 (1983). 13. P. Embrechts and C M . Goldie, Stoc. Proc. App. 13, 263 (1982). 14. P. Embrechts, C M . Goldie and N. Veraverbeke, Z. Wahr. Verw. Geb. 49, 335 (1979). 15. P. Embrechts and E. Omey, J. App. Prob. 2 1 , 80 (1984). 16. H. Gerber, Skand. Akt. 2051970. 17. C M . Goldie and C Kluppelberg, in A Practical Guide to Heavy Tails: Statistical Techniques and Applications, ed. R. Adler, R. Feldman and M.S. Taqqu (Birkhauser, Boston, 1998). 18. V. Kalashnikov, Topics on Regenerative Processes, CRC Press, Boca Raton, Florida, 1994). 19. J. Keilson, Ann. Math. Stat. 37, 886 (1966). 20. C Kluppelberg, J. App. Prob. 25, 132 (1988). 21. C. Kliippelberg, Prob. Th. Rel. Fields 82, 259 (1989). 22. T.Rolski, H.Schmidli, V. Schmidt and J. Teugels, Stochastic Processes for Insurance and Finance (John Wiley & Sons, New York, 1999). 23. B. Sundt and J.L. Teugels, Ins.: Math. Econ. 16, 7 (1995). 24. R. Szekli, Stochastic Ordering and Dependence in Applied Probability, Lect. Notes Statist. 97 (Springer-Verlag, New York 1995). 25. M.H. van Hoorn, Algorithms and approximations for queueing systems, CWI Tract# 8 (CWI, Amsterdam, 1984). 26. N. Veraverbeke, Ins.: Math. Econ., 13, 57 (1993). 27. G.E. Willmot and X. Lin, Ins.: Math. Econ., 18, 29 (1996). 28. G.E. Willmot and X. Lin, Lundberg Approximations for Compound Distributions with Insurance Applications, Lee. Notes Stat. 156 (SpringerVerlag, New York, 2001).
I M P R O V E D FINITE-SAMPLE I N F E R E N C E IN O V E R I D E N T I F I E D MODELS W I T H W E A K I N S T R U M E N T S NIKOLAY GOSPODINOV Department of Economics, Concordia University, 1455 de Maisonneuve Blvd. West Montreal, Quebec, H3G 1M8 Canada E-mail: gospodin@vax2. concordia. ca This paper investigates the finite-sample properties of the class of generalized empirical likelihood estimators in possibly overidentified models with weakly identified parameters. These nonparametric likelihood estimators satisfy exactly the moment conditions and automatically remove the bias that arises from a lack of centering of the moment conditions. The inference procedure suggested in the paper does not involve any explicit estimation of the variance-covariance matrix. The confidence sets for the parameters of interest are constructed by inverting the x 2 acceptance region of the criterion test.
1
Introduction
Moment condition models arise naturally from dynamic economic theory with optimizing agents. Since the seminal paper by Hansen 13 , the generalized method of moments (GMM) has become the predominant framework for estimating the structural parameters of these models. Under some general regularity conditions, the GMM estimator is consistent, asymptotically normal and efficient for the given set of moment conditions. Unfortunately, it has been found that the small-sample properties of the conventional GMM estimators (in particular, the two-step GMM) are rather poor. In this paper, we investigate the properties of the class of generalized empirical likelihood estimators of moment condition models. Members of this class are the empirical likelihood-based GMM of Qin and Lawless 27 , Imbens 15 and Imbens, Spady and Jonhson 16 , and the maximum entropy-based GMM of Kitamura and Stutzer 17 and Imbens, Spady and Johnson 16 . This class also includes the continuously-updated GMM as a special case (Imbens, Spady and Johnson 16 ; Newey and Smith 22 ) which explains the superior small-sample performance of this estimator over the traditional two-step GMM found in Hansen, Heaton and Yaron 14 . These nonparametric likelihood estimators minimize the distance between the empirical distribution function and a distribution function that exactly satisfies the moment conditions. One of the most attractive properties of nonparametric likelihood estimators is that they tend to remove some important sources of bias that give rise to poor finite-sample properties of the GMM estimator and GMM-based
132
133
test statistics. The first source of bias arises from the fact that the first-order conditions of the standard two-step GMM estimator (Hansen 1 3 ) , evaluated at the true values of the parameters, are non-zero. This bias is exacerbated if the number of instruments increases (Kocherlakota 1 8 ). Altonji and Segal 1 and Angrist, Imbens and Krueger 3 proposed some ad hoc methods for reducing the magnitude of the bias. Donald and Newey 8 showed that, for the continuously-updated GMM of Hansen, Heaton and Yaron 14 , the first-order conditions are exactly satisfied at the true values of the parameters and this source of bias is automatically removed. In fact, for all members of the class of the generalized empirical estimators, the moment conditions are exactly centered at zero by construction. Second, the estimation of the weighting matrix can be another important source of bias due to the non-zero finite sample correlation between the elements of the variance-covariance matrix of the parameters and the errors (Altonji and Segal 1 ) . This source of bias is present for both the two-step and continuously-updated GMM estimators but disappears for the empirical likelihood (EL) estimator of Qin and Lawless 2T . Newey and Smith 22 showed that the bias of the empirical likelihood estimator of Owen 24 and Qin and Lawless 27 is the same as the bias for the infeasible optimal GMM where the optimal linear combination coefficients do not have to be estimated. Third, the small-sample properties of the GMM estimators and test statistics can be seriously affected by the choice of instruments that are only weakly correlated with the endogenous variables (Stock and Wright 2 9 ) . In this case, the finite sample distributions of the GMM estimators and the test statistics may depart substantially from their asymptotic distributions. Stock and Wright 29 proposed an alternative reparameterization of the moment conditions and obtained asymptotic representations with improved finite sample properties. In their framework, however, the weakly identified parameters are not consistently estimable. Fortunately, one could still conduct asymptotically valid inference by inverting criterion-based tests since their limiting X 2 -distribution at the true values of the parameters is preserved. In this paper, we show that the nonparametric likelihood estimators are robust in the presence of weakly identified parameters. Most importantly, the criterion-based inference procedure does not involve any explicit estimation of variance-covariance matrices. Unlike the Wald test, the confidence sets constructed by inverting the criterion test, also satisfy the requirement of infinite expected volume in the completely unidentified model (Dufour 9 ) . Finally, the class of generalized empirical likelihood estimators is transformation invariant and the obtained confidence sets are transformation respecting. The rest of the paper is structured as follows. Section 2 discusses two ap-
134
proaches to estimating moment condition models that give rise to the GMM and the nonparametric likelihood estimators. The asymptotic validity of the confidence interval construction by criterion test inversion is shown in Section 3. The Monte Carlo experiment in Section 4 studies the finite-sample properties of the different estimators and their corresponding confidence intervals in a linear instrumental variable model with weakly identified parameters. In Section 5, nonparametric likelihood estimators are applied to estimating the return to education. Section 6 summarizes the conclusions. 2
General Approach to Estimating Moment Condition Models
Let E[g(x,9)\F] = J g(x,6)dF = 0 be an m x 1 vector of population moment conditions implied by economic theory, where (xi, xi,...) are independent random vectors in R p with unknown continuous distribution function F, 8 is a k x 1 vector of unknown parameters from 0 and g(.) is a given function {g(x,0) : R p x R l - » R m } with m > k. Suppose that we restrict the family of possible distribution functions to the space of multinomial distributions with finite support on the observed data, denoted by $. Also, let Fn denote the empirical measure of the sample {xi}"=1 from F that places probability mass n _ 1 on each data point and P„ be another probability measure that assigns multinomial weights Pi,P2, ••••,Pn to each of the observations. Below, we consider two versions of the analogy principle discussed in Manski 2 0 . The first version selects an estimator that minimizes the distance of the moment conditions from zero (GMM estimators). The second version selects an estimator that minimizes the distance between the empirical measure and a measure Pn that satisfies exactly the moment conditions (nonparametric likelihood estimators). 2.1
GMM Estimators
The conventional GMM estimator minimizes the distance of the sample counterparts of these moment conditions from zero using the quadratic form Qn(0) = 9n(6)'Wn(6)gn(6),
(1)
where gn(0) = E[g(x,6)\Fn) = f g(x,9)dFn and Wn(9) is a positive definite weighting matrix. Then, 9 = argmme<E@Qn(6). The properties of the GMM estimator depend crucially on the choice of the weighting matrix. The optimal GMM estimator sets Wn{9) = [£ £ t " = i & ( % # ) ' ] ~*, where gi{9) = g{Xi, 6). If a preliminary consistent (but not necessary efficient) estimator 6 of 9 is
135
used in the estimation of Wn, we have the two-step GMM estimator Oistep = argmin ( ? € 0
• (2) «=1
J
L
i=l
.
.
«=1
If we substitute 82step in the weighting matrix and repeat this until convergence of both 9 and Wn, the obtained estimator is the iterated GMM estimator (Hansen, Heaton and Yaron 14 ) described by the equations -1
2_ V^ / d9i(&igmm) 56"
n
i=l
/ j 9i \yigmm)9i
("igmm )
9n\Pigmm)
— Ufc.
(3) Finally, the continuously-updated GMM estimator proposed by Hansen, Heaton and Yaron 14 does not require a preliminary estimate of 6 and directly minimizes the criterion function 6CU = argmin0 € O
• j=i
J L i=i
J
(4)
L «=i
The estimator is the solution to a (typically nonlinear) system of k first-order conditions I ^ j g H
I W J ^ J
5 n
( e
c
0 - 5 „ ^ c „ ) ' W „ ( ^ n ) ^ ^ W n ( ^ ) 5 n ( ^ )
= 0
or
J r
^
-v
L I-i
t=l ^
where A = - [ £ " = i 5i(^cu)3i(^cu)'J 9n(0cu)Although these estimators are asymptotically equivalent, their finite sample properties may differ (see for example Hansen, Heaton and Yaron 1 4 ). 2.2
Nonparametric Likelihood Estimators
A second approach is to obtain a value of 6 that minimizes a distance between probability measures rather than the distance of the moment conditions from zero. This data driven approach selects from the set of distributions that satisfy exactly the moment conditions a probability measure Pn closest to the
136
empirical measure Fn defined by the Cressie and Read criterion DP{Fn,Pn)
6
power divergence
= -^-j)YjPi[{nPiy-l],
(6)
where p is a fixed scalar parameter which determines the shape of the criterion function. Cressie and Read 6 proposed the family of power divergence statistics as goodness-of-fit tests. Here, we use the Cressie-Read divergence criterion for estimation purposes. The estimator is defined as the solution to minpne*,eeeDp(Fn,Pn) subject to E[g(x,6)\Pn]
(7)
= / g(x,6)dPn
= 0.
(8)
This form of the analogy principle maps the empirical distribution function onto the space of feasible distribution functions and chooses the probability measure that is most likely to have generated the observed data, subject to the moment conditions (Manski 2 0 ) . The solution to the above constrained optimization problem is a straightforward application of the Lagrange multiplier principle. This framework embeds several interesting special cases (see Kitamura and Stutzer 17 ; and Imbens, Spady and Johnson 1 6 ). The first two cases can also be interpreted as discrete versions of the forward and backward KullbackLeibler discrepancy between the empirical measure and Pn. If we let p approach 0, the estimator is the solution to the problem 2 " min — > In npi p,e
(9)
n *-^
n
n
subject to ^2pig(xi,6)
= 0 and ^ P i = 1.
(10)
This is the empirical likelihood estimator of Owen 24>25>26 and Qin and Lawless 27 obtained as the root of the system of equations
(\Y,U9i{eEL)l(i
+ \'9i{eELj)
\ J'm-\-k^
[h££=i A' ( ^ H J / (i + \'9i(eEL))
J
where A is a vector of Lagrange multipliers on the moment conditions.
137
The (m + k) x 1 parameter vector (6EL, A)' can then be used to compute the vector of probability weights PJ = n f 1 + \'g0EL)) for i = 1,2, ...,n. This gives an efficient estimate of the distribution function F given the set of moment conditions. One drawback of this method is that the form of the loss function resists heavy downweighting of specific data points which may be a problem in the presence of outliers. If p -> — 1, we obtain the maximum entropy, exponential tilting or KLIC (Kullback-Leibler information criterion) estimator of Efron 10 and DiCiccio and Romano 7 and discussed in Kitamura and Stutzer 17 , Imbens 15 and Imbens, Spady and Johnson 1 6 ). It is computed as the minimizer of the KLIC function min 2 V , Pi In npi p
'
(11)
«=i
subject to (10). The maximum entropy estimator solves the system of firstorder conditions \
/^Efeiftfcjexpp'ftfe))
= o.m+k-
l^£r=1A' 26
9gi(0BT)
exp (Xgi(0ET))
Finally, if p -> - 2 , we obtain the Euclidean likelihood estimator of Owen given by the argument that minimizes i
n „2
min P,0
(12)
i)
n i=l
subject to (10). The solution is obtained from \
= o.m+k,
v ^Er=1A'(^#^)[i+A%fe)] where 90EU)
= 90EU)
~ £ E"=i 90 EU)- From the first m equations, A =
" [ £ E ? = i f f i f c ) f f i ( W ] - 1 [^Y:U90EU)] A and j>i = n
E
- (
Pi
_1
1 + \'g~0Eu)
dg0Eu) 89'
• Then, by substituting for
in the last k equations, we get
i n -Y,90EU)-90EU)'
1
ri n -Y,90EU) i=l
=0*.
138
It is interesting to see that this system of first-order conditions is almost identical to (5) for the continuously-updated GMM estimator and very similar to (3) for the iterated GMM estimator. Hence, the Euclidean likelihood estimator can be interpreted as a continuously-updated GMM estimator and an optimally weighted iterated GMM estimator. Note also that the objective functions (9) and (11) implicitly impose the constraint pt > 0 which validates the interpretation of pi as probability weights. For the Euclidean estimator, we either have to inspect the positivity of the probability weights each time or introduce a nonnegativity constraint explicitly into the minimization problem. Owen 26 argues that in small samples, the negativity of the estimated weights may be advantageous for confidence interval construction. 3
Inference in Moment Condition Models
Consider the estimators discussed in Section 2. Let 6Q denote the true value of the parameter vector and suppose that the following regularity conditions are satisfied. Assumption Al. Assume that Wn A- W, where W is a nonstochastic symmetric positive definite matrix; g(xi,9) is continuous in 9; E [sup f l e e |<7(£;,#)|] < oo; supj E [g(xi,0)g(xi, 9)'] < oo for all 9 and 0 is a compact subset of Rk. Assumption A2. There is a unique 9Q such that E[g(xi,90)] E[g(xi,0)] ^ 0 for all 9^90eQ.
= 0 and
Assumption A3. Assume that M = E (d9{Z'/o))
da(
continuous in 9 and E 0o,
su
P0eAT(0o)
is of full rank k;
ee1' \\ < ° ° for
S0me
^;e)
is
neighborhood of
N(90).
Theorem 1. Under Assumptions
A1-A2,
9p -^ #o as n —• oo
v^(£p-0o)4iv(o,n), 1
where Cl = (M'V^M)Proof. See Imbens
1S
and V = E (g(Xi, 90)g(Xi, 90)'). , Qin and Lawless 2 7 , and Newey and Smith
22
.
Theorem 2. Let 9 = (a,/3)', where a € 0 i and j3 € 02 are p x 1 and (k — p) x 1 vectors, respectively. Then, under Assumptions A1-A3,
139 (i) test for overidentifying
restrictions
nDp{9p) 4 x\m-k) (ii) test ofH0:6
as n ->• oo
= 60
n [DP{00) - Dp(9p) ->• x\k) as n -> oo (iii) test for a subset a with Ho : a = a 0 n \Dp(ao,0~p) - Dp(ap,Pp)]
4 x\p) as n ^ oo
where DP{9Q) for p = —2, —1 and 0 is the criterion function defined in (12), (11) and (9), 9p = (ap,j3p) is the unrestricted nonparametric likelihood estimator that minimizes Dp(6) and /3P is the minimizer of Dp(a,j3) subject to a = aoProof. See Imbens
15
, Qin and Lawless
27
, and Newey and Smith
22
.
The results in Theorems 1 and 2 show that we can conduct asymptotically valid inference such as testing for overidentifying restrictions and constructing confidence intervals by inverting the \ 2 acceptance region of the criterion test. The IOOT/% confidence set for the parameter of interest 8 is then given by the set of values of 9 satisfying Cn{x) = {9eQ: h
Dp{9) < qn},
where qn is the lQQrf quantile of the distribution of Dp(9). Equivalently, Cr,(x) = {9 G © : x e A(9)}, where A{9) is the acceptance region of the test Dp{9). The endpoints of the confidence set are the infimum and the supremum over Cv{x), respectively. In particular, the two-sided, equal-tailed confidence interval with nominal coverage n is given by Cn(x) — [9L,9U], where the confidence limits are defined to satisfy 6L — inf{0 6 0 : Vr(Dp(9) < q^Ho) > n} and 9V = sup{9 G 6 : Pr(Dp(9) < qv\H0) > T)}For the weak instrument case, Stock and Wright 29 parameterized the moment condition as a function of the sample size and developed an alternative limiting theory which yields a better approximation to the finite-sample distributions of the estimator and corresponding test statistics. In particular, Stock and Wright 29 replace Assumption 2 with the assumption that E[g(xi,9)] — n _ 1 / 2 m(0) uniformly in 8 G 0 , where m(9) is continuous in 9 and bounded on 0 with m(0o) = 0. Under this assumption, the GMM and nonparametric likelihood estimators are no longer consistent (9P—9Q — Op(l))
140
although the x2 asymptotic approximation for the distributions of the test for overidentifying restrictions and a test of a subvector of weakly identifiable parameters is still valid. 4
Monte Carlo Study
The poor small sample performance of the two-step GMM estimator in linear homoscedastic instrumental variables models with weakly identified parameters has been well documented in Nelson and Startz 21 , Maddala and Jeong 19 , Bound, Jaeger and Baker 4 and Staiger and Stock 28 among others. In this section, we assess the robustness of the nonparametric likelihood estimators in the presence of weak instruments. The structure of the Monte Carlo experiment is similar to the one considered by Angrist, Imbens and Krueger 3 . It is designed to study the finite sample bias of the different estimators and the size properties of hypothesis tests and the test for overidentifying restrictions with a large number of irrelevant instruments. The data are generated from the model Vi = #0 + 6iXi + e»,
(13) z
u
Xi - 70 + L j = l 7j ij + i> where z ~ N(0,I),
(fy
= chol{T,)^ fc ~ iid(0,I),
S = f ^ j j ^fj
,
c/io/(£) denotes Cholesky decomposition of S, 6Q = 0, #i = 1, 70 = 0, 71 = 0.15 and 7/ = 0 for I = 2, ...,m - 1. The optimal two-step GMM estimator in this setting is asymptotically equivalent to the two-stage least squares (2SLS) estimator with Wn = (Z'Z)'1
e2steP = {x'pzxyl
(x'pzy),
where X = (l,x), Z = (l,z) and Pz = Z{Z'Z)~lZ. We also consider the limited information maximum likelihood (LIML) estimator which is given by OLIML
= (X'(I - kMz)Xyl
(X'(I -
kMz)y),
where Mz = I — Pz, k is the smallest characteristic root of (Y Y)(Y MzY)~l and Y = (y X). The confidence intervals for the OLS, 2SLS and LIML estimators are constructed by inverting the Wald test using the x 2 critical values. The confidence intervals for the empirical likelihood (EL) and the Euclidean likelihood
141
(Euclid) estimators are obtained by inverting the \ 2 acceptance regions of the corresponding criterion-based tests discussed above. The results for the exponential tilting (KLIC) estimator are very similar to the results for the EL estimator and are not reported. The number of Monte Carlo replications is 10,000. Table 1. Monte Carlo results for model (13) with T=500 and Gaussian errors. estimator
quantiles around ((?i — 0i) 0.25 0.50 0.75 m — k = l , 7 i = -15 72 = 0 0.6991 0.7154 OLS 0.7336 0.7528 -0.1826 -0.0693 2SLS 0.0348 0.1224 LIML -0.2297 -0.1028 0.0165 0.1184 EL -0.2362 -0.1120 0.0005 0.0940 Euclid -0.2354 -0.1124 0.0011 0.0939 m — k —5,7i = .15 72 = - = 76 = 0 OLS 0.6987 0.7157 0.7339 0.7526 2SLS -0.1098 -0.0148 0.0830 0.1640 LIML -0.2260 -0.0975 0.0193 0.1190 EL -0.2427 -0.1175 -0.0011 0.0959 -0.2402 -0.1179 -0.0022 0.0951 Euclid m — k —10,7i = .15,72 = .. = 7 n = 0 0.7149 OLS 0.6985 0.7338 0.7520 0.1484 2SLS -0.0145 0.0663 0.2220 LIML -0.2291 -0.1020 0.0186 0.1253 -0.2394 -0.1154 -0.0002 0.1009 EL Euclid -0.2408 -0.1161 -0.0010 0.1002 0.10
0.90
test for OIR
coverage rate of CI
0.7695 0.1919 0.2045 0.1655 0.1656
0.1147 0.1219 0.1055 0.1019
0.0000 0.8809 0.8592 0.8909 0.8927
0.7692 0.2297 0.2037 0.1732 0.1720
0.1443 0.1178 0.1118 0.1039
0.0000 0.8080 0.8617 0.8791 0.8830
0.7690 0.2822 0.2104 0.1769 0.1770
0.1837 0.1132 0.1256 0.1059
0.0000 0.6287 0.8470 0.8745 0.8854
First, we assess the effect of increasing the number of redundant moment restrictions on the magnitude of the bias of the estimators and the size properties of the corresponding criterion tests. Tables 1 reports the 0.10, 0.25, 0.50, 0.75 and 0.90 quantiles of the distribution of (#i — 9\) as well as the empirical size of the test for overidentifying restrictions (OIR) with nominal level 0.1 and the coverage properties of the 90% confidence intervals for 6\ in a model with Gaussian errors. Newey and Smith 22 showed that in model (13) with symmetric errors bias(dnML) = bias(8EL) = bias(8Eu) — —fi/n> where S = Claex/a^. Thus, the LIML, EL and Euclidean estimators are higher-order asymptotically
142
equivalent and their bias does not depend on the number of instruments. By contrast, bias(Q2SteP) = (m — 2)6jn which increases linearly with the number of instruments. To investigate the finite-sample sensitivity of the bias of the estimators with respect to the number of irrelevant instruments, we consider three cases of overidentification: m — 1 = 2 (1 overidentifying restriction), m — 1 = 6 (5 overidentifying restrictions) and m — 1 = 11 (10 overidentifying restrictions). Tables 1 contains the results for a sample size of 500 which is commonly encountered in economic applications. As expected, the OLS estimator is severely upward biased. The 2SLS is slightly biased when m — k = 1, but its bias starts to approach the bias of the OLS estimator as m increases. Also, the size properties of the test based on the 2SLS deteriorate significantly as the number of instruments gets large. The magnitude of the bias of the LIML estimator is small and insensitive to the degree of overidentification of the model which is consistent with the theoretical results. The bias of the nonparametric likelihood methods is negligible and it is practically unchanged as the number of the overidentifying restrictions increases. The confidence intervals based on the nonparametric likelihood estimators slightly undercover with the coverage rate of the Euclidean likelihood being closest to the nominal level. Similar results were obtained for sample size T = 150 but these results are not reported due to space limitations. It is interesting to note that the dominance of the Euclidean over the EL estimator in terms of coverage rates is more pronounced for the smaller sample size. This requires further theoretical investigation of the higher-order properties of these tests using Edgeworth expansions. The higher-order asymptotic equivalence of the LIML, EL and Euclidean estimators derived by Newey and Smith 22 is valid only for models with symmetric errors. Newey and Smith 23 show that all members of the class of nonparametric likelihood estimators except EL have an additional bias term coming from the estimation of the variance-covariance matrix fi. To investigate the sensitivity of the results to fat tailed and asymmetric distributions, we also report results from weakly identified models with i-distributed errors with 4 degrees of freedom and x2 -distributed errors with 1 degree of freedom. In addition, we vary the correlation of the endogenous explanatory variable with the instruments. The simulation results in Table 2 show that the nonparametric likelihood estimators provide reliable inference in the presence of weak instruments regardless of the distribution of the errors. Similar to the results in Table 1, the Euclidean estimator dominates in terms of coverage rate with empiri-
143
cal levels within 3 percentage points from the nominal level. Although the bias of the empirical and Euclidean likelihood estimators could be significant for asymmetric errors (for instance, the bottom left corner of Table 2), it is still considerably smaller than the 2SLS and LIML estimators. In summary, the nonparametric likelihood estimators appear to possess good finite-sample properties in overidentified models with weak instruments. Table 2. Monte Carlo results for model (13) with T = 500 and m - k = 1. ^f\
estimator
=
.05,^ 2
=
-02
7i = .05,72 = .05
7! = .05,72 = -1
median
test
cov.rate
median
test
cov.rate
median
test
cov.rate
bias
OIR
of C I
bias
OIR
of C I
bias
OIR
of CI
6~w(o,J) OLS 2SLS LIML EL Euclid
.7909 .0000 .1300 .1352 .8182 .0458 .0995 .8552 .0149 .0859 .8946 .0140 .0852 .8946
.7844 .0877 .0232 .0085 .0077
.0000 .1266 .8405 .1022 .8645 .0967 .8893 .0963 .8903
.7621 .0285 .0286 -.0040 -.0037
.0000 .1075 .8854 .1319 .8462 .0957 .8891 .0950 .8903
.7900 .0000 .1332 .1390 .8190 .0443 .1006 .8502 .0105 .0930 .8842 .0083 .0890 .8943
.7841 .0799 .0184 .0018 .0009
.1261 .1108 .1010 .0957
.0000 .8479 .8715 .8836 .8924
.7602 .0333 .0337 .0031 .0026
.0000 .1124 .8870 .1396 .8344 .1064 .8825 .1016 .8907
.7922 .1595 .1008 .0318 .0287
.1444 .1693 .0955 .0882
.0000 .8043 .7799 .8644 .8776
.7795 .0618 .0333 .0032 .0027
.1168 .1463 .1019 .0940
6 ~ *(4) OLS 2SLS LIML EL Euclid
&~Xa(l) OLS 2SLS LIML EL Euclid
.7957 .2661 .2136 .1126 .1027
.1527 .1720 .0893 .0825
.0000 .7453 .7314 .8545 .8702
.0000 .8559 .8495 .8757 .8908
The finite-sample properties of the constructed confidence intervals for the nonparametric likelihood methods can be further improved by bootstrap methods. For the efficient bootstrap suggested by Brown and Newey 5 and Hall and Presnell 12 , the data can be resampled using the implied probability weights pi from the estimation problem rather than the empirical measure (pi = n~1 for all i) as in the conventional bootstrap. Also, the asymptotic validity of the conventional bootstrap requires explicit recentering of the moment conditions (Hall and Horowitz n ) whereas for the efficient bootstrap the moment conditions, evaluated at the true parameter vector, are centered at zero by construction.
144
5
Empirical Illustration: Return to Education
Estimating the return to education is of central interest to labour economists. It shows the predicted percentage increase in wage for an additional year of education. Since education is believed to be endogenous, Angrist and Krueger 2 suggested the quarter of birth as an instrument for education. However, Bound, Jaeger and Baker 4 challenged the results obtained by Angrist and Krueger 2 arguing that the two-step GMM could be severely biased in the presence of a large number of weak instruments. Here we use the Angrist-Krueger data set which consists of a random sample from 1980 census of 329,500 men who were born between 1930 and 1939. See Angrist and Krueger 2 for a detailed description of the data and model specification. Following Angrist and Krueger 2 , 30 instruments are constructed by interacting quarter and year of birth. Then we draw random subsamples of 500 observations from the original sample without replacement. The results in Table 3 are obtained from 5,000 repetitions and report the median of the parameter estimates and their corresponding standard errors. Table 3. Estimation results for the return of education. EL Euclid OLS 2SLS parameter estimate 0.0708 0.0714 0.0723 0.0653 standard error 0.0087 0.0437 0.0404 0.0373 The estimated return to schooling for all methods is in the range of 6.5% and 7.3%. This is a bit surprising since the weak instruments employed in the estimation are expected to produce a large upward bias in the OLS and 2SLS estimates. This does not seem to be the case and all the estimates do not appear significantly different from one another. It is also interesting to note the smaller standard errors for the nonparametric likelihood estimators compared to the two-step GMM estimator. 6
Concluding Remarks
This paper shows the robustness of nonparametric likelihood estimators of moment condition models to the presence of weak instruments and nonnormal errors. The computational procedure does not involve any explicit bias correction or estimation of variance-covariance matrices. The confidence intervals are obtained directly from the criterion function by inverting its asymptotic acceptance region.
145
One interesting finding that emerges from the study is the existence of noticeable differences in the coverage properties within the class of generalized empirical likelihood estimators. Since the criterion-based test statistics are asymptotically equivalent, higher-order expansions are necessary to appraise the statistical significance of these results. Acknowledgments I would like to thank Gordon Fisher (the Associate Editor) and an anonymous referee for helpful comments and suggestions. Financial support from the FRDP of Concordia University is gratefully acknowledged. References 1. J.G. Altonji and L.M. Segal, Journal of Business and Economic Statistics 14, 353 (1996). 2. J.D. Angrist and A.B. Krueger, Quarterly Journal of Economics 106, 979 (1991). 3. J.D. Angrist, G.W. Imbens and A.B. Krueger, Journal of Applied Econometrics 14, 57 (1999). 4. J. Bound, D. Jaeger and R. Baker, Journal of American Statistical Association 90, 443 (1995). 5. B.W. Brown and W.K. Newey, GMM, Efficient bootstrapping and improved m/erence(Mimeo, Department of Economics, Rice University, 2001). 6. N. Cressie and T. Read, Journal of Royal Statistical Society B 46, 440 (1984). 7. T. DiCiccio and J. Romano, International Statistical Review 58, 59 (1990). 8. S.G. Donald and W.K. Newey, Economics Letters 67, 239 (2000). 9. J.-M. Dufour, Econometrica 65, 1365 (1997). 10. B. Efron, Canadian Journal of Statistics 9, 139 (1981). 11. P. Hall and J.L. Horowitz, Econometrica 64, 891 (1996). 12. P. Hall and B. Presnell, Journal of the Royal Statistical Society B 61, 143 (1999). 13. L.P. Hansen, Econometrica 50, 1029 (1982). 14. L.P. Hansen, J. Heaton and A. Yaron, Journal of Business and Economic Statistics 14, 262 (1996). 15. G.W. Imbens, Review of Economic Studies 64, 359 (1997). 16. G.W. Imbens, R.H. Spady and P. Johnson, Econometrica 66, 333 (1998).
146
17. 18. 19. 20. 21. 22. 23.
24. 25. 26. 27. 28. 29.
Y. Kitamura and M. Stutzer, Econometrica 65, 861 (1997). N.R. Kocherlakota, Journal of Monetary Economics 26, 285 (1990). G.S. Maddala and J. Jeong, Econometrica 60, 181 (1992). C.F. Manski, Analog Estimation Methods in Econometrics (Chapman and Hall, New York, 1988). C.R. Nelson and R. Startz, Econometrica 58, 967 (1990). W.K. Newey and R.J. Smith, Asymptotic bias and equivalence of GMM and GEL estimators (Mimeo, Department of Economics, MIT, 2000). W.K. Newey and R.J. Smith, Higher order properties of GMM and generalized empirical likelihood estimators (Mimeo, Department of Economics, MIT, 2001). A. Owen, Biometrika 75, 237 (1988). A. Owen, Annals of Statistics 18, 90 (1990). A. Owen, Annals of Statistics 19, 1725 (1991). J. Qin and J. Lawless, Annals of Statistics 22, 300 (1994). D. Staiger and J.H. Stock, Econometrica 65, 557 (1997). J.H. Stock and J. Wright, Econometrica 68, 1055 (2000).
P R O B A B I L I T Y OF C O R R E C T SELECTION OF G A M M A V E R S U S G E OR WEIBULL V E R S U S GE B A S E D O N LIKELIHOOD RATIO STATISTIC RAMESHWAR D. GUPTA Department of Applied Statistics and Computer Science The University of New Brunswick, Saint John Canada, E2L 4L5 E-mail: [email protected]
DEBASIS KUNDU AND ANUBHAV MANGLICK Department of Mathematics Indian Institute of Technology Kanpur Pin 208016 India. E-mail: [email protected] This paper proposes the use of likelihood ratio statistic in choosing between gamma and GE models or between Weibull and GE models. Probability of correct selections are obtained using Monte Carlo simulations for various sample sizes and for various model parameters. Simulation results indicate that it is easier to distinguish between GE and Weibull models than between GE and gamma models. Two real life data sets are analyzed. Interestingly, in both cases although in the literature gamma or Weibull model was used but based on the maximum likelihood values we select GE as the 'best' fitted model among these three distributions.
1
Introduction
Recently a new two-parameter distribution named as Generalized Exponential (GE) distribution or Exponentiated Exponential distribution has been introduced and studied quite extensively by two of the authors Gupta and Kundu 6,7,8,9,10 rpj^ Qg f a m iiy h ^ the following distribution function FGE(x;a,\)
= (l-e-Xx)a;
a,\>0,
density function fGE(x;a,\)=a\{l-e-Xx)a-1e-Xx;
a,X>0,
survival function SGB(x;a,\)
= l-(l-e-Xx)a;
a,X>0,
and the hazard function . , ., aA(l - e-x*)"-1e-Xx hGE(x;a,\) = 1_(1_e_Ax)a
147
^
. n a,A>0.
(1)
148
Here a and A are the shape and scale parameters respectively. It is observed in Gupta and Kundu 6 ' 7 that the two-parameter GE distribution can be used quite effectively in analyzing many lifetime data particularly in place of twoparameter gamma or two-parameter Weibull distribution. The two-parameter GE distribution can have increasing, decreasing and constant failure rates depending on the shape parameter (Gupta and Kundu 6 ) . It is also observed by Gupta and Kundu 9 that the behavior of the two-parameter GE distribution is very much similar to a gamma distribution in many respects and in many cases the two-parameter GE model provides a better fit than the two-parameter gamma model or the two-parameter Weibull model in terms of maximum likelihood or minimum chi-square. Therefore, it is quite important to choose between a Weibull and GE models or between a gamma and GE models to analyze real life data sets. It is found to be very difficult to discriminate between these three models because all these three models are quite flexible and they overlap with each other in the sense that exponential distribution is a special case to all of them. Although, these three models may provide similar data fit for moderate sample sizes but it is still desirable to select the correct or more nearly correct model, since the inferences based on the model will often involve tail probabilities where the affect of the model assumption will be more critical. Therefore, even if large sample sizes are not available, it is still important to make best possible decision based on whatever data are available. In this paper we propose to use the logarithm of the ratio of the maximum likelihood functions in choosing two overlapping distributions. The idea was originally proposed by Cox 3 | 4 in discriminating between two models and Bain and Englehardt 1 used it in choosing between gamma and Weibull models. It is observed that the probability of correct selection (PCS) depends only on the shape parameter of the distribution from which the data are coming. Since it is not possible to obtain the exact distributions of the likelihood ratio statistics, we obtain PCSs by using extensive Monte Carlo simulations for various sample sizes and for various model parameters. We use two real data sets to illustrate how the proposed methods can be used in practice. Rest of the paper is organized as follows. In section 2, we provide the selection method based on likelihood ratio statistic. The PCSs are presented in section 3. We analyze two real data sets in section 4 and finally we conclude the paper in section 5.
149
2
Likelihood Ratio Statistic
Suppose X\,..., Xn are independent and identically distributed random variables from any one of the three distributions. The density function of a GE random variable with the shape parameter a and the scale parameter A will be denoted by (1). For both gamma and Weibull distributions we denote the shape parameter by /? and the scale parameter by 9. The density function of a gamma random variable with the shape parameter (3 and the scale parameter 9 will be denoted by JGA{X) as /GA(X)
= j ^ ^ e - ' * ;
x,/3,9>
0,
(2)
and similarly the density function of a Weibull random variable with the shape parameter parameter (3 and the scale parameter 9 will be denoted by fWE(x)
=
P0
Let us define n
n
LGE(a, A) = J J fGE{xi),
LGA(/3,9) = J J fGA(xi)
i=l
i=l
and n
=
LWE(P,0)
Y[fwE(xi).
Now first consider choosing between gamma and GE models. The natural logarithm of the likelihood ratio statistic T\ = ln(Li), where _
LGE(a,X)
1_
LGA0,9)
and 7\ = n \n(a\X)
- p ln(X0) - ^ . ^ + ln(r(/?)) - X(X - 9) a
(3)
Here X = ^ Y^Ji=\ -^»> -^ = (I"Ii=i -^i)" • Moreover, & and A are maximum likelihood estimators (MLEs) of a and A if the data are assumed to come from a GE distribution and in this case (Gupta and Kundu 7 ) they have the following relation a—
ElLiln(l-e-^)
150
Similarly /? and 9 are MLEs of /3 and 9 if the data are assumed to come from a gamma distribution and in this case they satisfy the following equation
'4The proposed procedure is as follows. Choose a GE model if Ti > 0, otherwise choose a gamma model. It is clear from (3) that if the data are coming from a GE population, the distribution of T\ is free of A and depends only on a and similarly if the data are coming from a gamma population, the distribution of T\ is independent of 6 and it depends only on /?. The above assertions can be proved very easily. First let us assume that the data are coming from GE(a, A). It implies that the distribution of XXi is independent of A. Moreover it follows from the Theorem 7.8.5 of Bain and Englehardt 2 that the distributions of j and j are independent of A. Since the distributions of a and $ are independent of A, therefore the result follows immediately from the expression of Ti. Similar argument can be given when the data are coming from a gamma distribution then the distribution of T\ is independent of 9 and depends only on /?. It implies that in the first case PCS is free from A and depends only on a, whereas in the second case PCS is free from 9 and depends only on 0. Now we consider the likelihood ratio statistic for discriminating between GE and Weibull models. Similarly as before, the natural logarithm of the likelihood ratio T2 = ln(L 2 ), where _
LGE{OIA)
~
LWE0,§)
2
and To. = n ln(^\-^l-XX-pin(9X)
+ l
(4)
Here X and X are same as defined before and a and A are MLEs of a and A for GE distribution and similarly /? and 9 are MLEs of /? and 9 for Weibull distribution. In the case of Weibull distribution, 9 and J3 satisfy the following relation 9=(
n
En
yf.
We use the similar discrimination procedure as before, i.e. if T2 > 0, choose GE distribution, otherwise choose Weibull distribution. In this case also it
151 Table 1. Probability of correct selection between GE and gamma distributions and when the data are generated from a GE distribution
n | a —>
20 40 60 80 100 200
0.50 0.6610 0.6342 0.6282 0.6134 0.6134 0.6033
0.75 0.5798 0.5870 0.5847 0.5822 0.5813 0.5760
1.0
2.0
4.0
8.0
0.4942 0.5004 0.5039 0.5047 0.5059 0.4945
0.4633 0.4812 0.5100 0.5194 0.5321 0.5699
0.5081 0.5563 0.5841 0.6059 0.6242 0.6947
0.5503 0.6155 0.6472 0.6740 0.7045 0.7860
16.0 0.6012 0.6566 0.7031 0.7421 0.7641 0.8601
can be shown as before that if the data come from a GE population then the distribution of Ti depends on a and independent of A and if the data come from a Weibull population, then the distribution of T
3
Probability of Correct Selections
In this section we use Monte Carlo simulations to compute the PCSs for different shape parameters and for different sample sizes. We consider different shape parameters namely 0.50, 0.75, 1.0, 2.0, 4.0, 8.0, 16.0 and also different sample sizes namely 20, 40, 60, 80, 100, 200. All the probabilities are calculated based on 10,000 replications. The exact details are provided below: Case 1: Between GE and Gamma First we generate a sample of size n from a GE(a, 1). From the given sample we obtain MLEs of a and A (GE parameters), similarly we obtain MLEs of /? and 9 (gamma parameters). We compute T\ and observe whether it is negative or positive. We replicate the process 10,000 times and compute the percentage of times T\ is positive and that provides the probability of the correct selection. The same way we estimate the PCS between gamma and GE models when the data are coming from a Gamma(/3, 1). We generate the sample from a Gamma(/3, 1) and obtain the MLEs of a, A, /3 and 9. In this case we compute the percentage of times Ti is negative out of 10,000 replications. Results are reported in Tables 1 and 2.
152 Table 2. Probability of correct selection between GE and gamma distributions when the data are generated from a gamma distribution
n i p -> 20 40 60 80 100 200
0.50 0.3655 0.4086 0.4370 0.4555 0.4519 0.5039
0.75 0.4246 0.4317 0.4398 0.4498 0.4597 0.4855
1.0
2.0
4.0
8.0
0.4977 0.5104 0.4999 0.4964 0.4998 0.4991
0.5829 0.5948 0.5942 0.6051 0.6081 0.6319
0.6104 0.6281 0.6607 0.6745 0.6943 0.7534
0.6347 0.6900 0.7197 0.7544 0.7775 0.8598
16.0 0.6653 0.7365 0.7849 0.8276 0.8542 0.9317
Table 3. Probability of correct selection between GE and Weibull distributions when the data are Generated from a GE distribution.
n 4- a —> 0.50 20 0.6610 40 0.7258 60 0.7663 80 0.7856 100 0.8128 200 0.8849
0.75 0.5643 0.5982 0.6157 0.6309 0.6430 0.7012
1.0
2.0
4.0
8.0
0.5025 0.4976 0.4849 0.5042 0.5097 0.4933
0.5731 0.6316 0.6829 0.7139 0.7382 0.8295
0.6675 0.7520 0.8154 0.8582 0.8825 0.9614
0.7283 0.8357 0.8929 0.9224 0.9496 0.9911
16.0 0.7699 0.8811 0.9301 0.9582 0.9752 0.9981
Case 2: Between GE and Weibull In this case first we compute the PCS between GE and Weibull distributions and when the data are coming from a GE distribution. We generate a sample of size n from a GE(a, 1) and compute MLEs of a, A (GE parameters) and /3, 6 (Weibull parameters). Replicate the process 10,000 times and observe the percentage of times T^ is positive. Exactly the same way, we compute the PCSs between GE and Weibull distributions and when the data are coming from a Weibull distribution. The results are reported in Tables 3 and 4 a. Some of the points are quite clear from the Tables 1,2,3 and 4. First of all in all four cases when a = 1.0, the PCS is close to .5 as it should be for all sample sizes. Because when a = 1.0, all three distributions become exponential distribution therefore, between any two distributions the probability of choosing any particular one is 0.5. It is also observed that in all cases (except for a = 0.50 or 0.75 in Table 1) as sample size n increases the PCS increases "When /3 = 4, the Weibull distribution becomes almost symmetric. In this case it is too time consuming to compute MLEs of the GE parameters. In Table 4, for /3 = 4, the results are based on 1000 replications.
153 Table 4. Probability of correct selection between GE and Weibull distributions when the data are generated from a Weibull distribution.
n I 0 -> 20 40 60 80 100 200
0.50 0.6655 0.7159 0.7768 0.8233 0.8653 0.9492
0.75 0.5155 0.5844 0.6169 0.6457 0.6700 0.7536
1.0
2.0
4.0
8.0
0.5037 0.5045 0.5046 0.5069 0.5047 0.5085
0.7061 0.7823 0.8395 0.8757 0.9006 0.9660
0.8142 0.9185 0.9573 0.9768 0.9874 0.9992
0.8738 0.9624 0.9868 0.9947 0.9979 1.0000
16.0 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
for all values of a ^ 1. Moreover as |Q — 1| increases the PCS increases from 0.5 in most of the cases. Surprisingly in Table 1, when the sample size 20 or 40 and the a = 2.0, the PCSs are less than .5. It indicates that when the data come from GE and the shape parameter is greater than one but not very far from one then the likelihood ratio statistic can not distinguish properly whether the data are coming from a GE distribution or from a gamma distribution if the sample size is not very large. Similar pattern is also observed in Table 2, when the shape parameter is 0.50 or 0.75. It indicates that for certain ranges of the shape parameter it is really very difficult to distinguish between gamma and GE distributions. In Table 1, when a = 0.50 or 0.75, the PCSs decrease as n increases. In Table 2, when a = 0.50 and 0.75, the PCS are less than 0.5. Both these findings are quite counter intuitive and we can not justify them. We feel it might be due to the estimation procedures. It is known that when the shape parameters is less than one then the parameter estimations are quite difficult for both GE and gamma distributions. The likelihood functions are very flat and therefore the maximum likelihood estimators are quite unstable. The results might be the reflection of that. One of the referees has mentioned that it might be due to discretization of the cumulative distribution function. Now comparing the results of Tables 1 and 2, it clear that the PCSs of Table 2 are more than the corresponding PCSs of Table 1 when the shape parameter is more than one. It clearly indicates that for the shape parameter greater than one, the likelihood ratio statistic can distinguish better between gamma and GE distributions if the data are coming from a gamma distribution than vice verse. It can be justified as follows. As the shape parameter increases the gamma density becomes symmetric but the GE density remains skewed as the shape parameter goes to infinity. Therefore, if the data are coming from a gamma distribution and the shape parameter is large then
154
naturally PCS becomes higher. Comparing the results between Tables 3 and 4, it is also observed that PCSs of Table 4 are more than the corresponding PCSs of Table 3. Therefore, between GE and Weibull distributions also the likelihood ratio statistic can distinguish better if the data are coming from a Weibull distribution than the other way. The same reason as above holds here also. Finally comparing the results between Tables 1, 2 and Tables 3, 4, it is clear that using the likelihood ratio statistic it is much easier to distinguish between Weibull and GE models than between gamma and GE models. It might be due to the fact that the Weibull density goes to symmetry much faster than the gamma density as the shape parameter increases. We believe that it should be true for any other statistics also and more work is needed in this direction. 4
D a t a Analysis
In this section we apply the above procedure in two data sets. Both these data sets have been used by several authors in the literature. D a t a Set 1: Lieblein and Zelen n provided the following sample of size n = 23 to illustrate the use of Weibull model. These data indicate the endurance in millions of revolutions of deep-groove ball bearings. Thoman, Bain and Antle 12 mentioned that Weibull distribution should be used where as Bain and Englehardt * proposed to use gamma distribution. The data are as follows: 17.88, 28.92, 33.00, 41.52, 42.12, 45.60, 48.80, 51.84, 51.96, 54.12, 55.56, 67.80, 68.64, 68.64, 68.88^84.12, 93.12, 98.64, 105.12, 105.84, 127.92, 128.04, 173.40. For this data X = 72.12 and X = 63.46. The MLEs of the GE parameters are a = 5.2825, A = 0.0323, gamma parameters are ft = 4.0309, 9 = 0.0558 and Weibull parameters are /3 = 2.1033, § - 0.0122. Also ln{LGE(a,X) = -112.9762, ln{LGA(/3,§) = -113.0272 and ln(LWE0J) = -113.6886. Therefore, Tx = 113.0272-112.9762 = .0510 and T2 = 113.6886-112.9762 = 0.7124. Since both T\ and T? are positive, therefore GE is preferable than gamma or Weibull distributions based on the likelihood ratio statistic. Moreover, from the tables it can be said that the PCS between gamma and GE is between 51% to 55% and the PCS between Weibull and GE is between 67%-73%. D a t a Set 2: The following data represents the relief times of 20 patients receiving an analgesic. The data is obtained from Gross and Clark 5 (page; 105), where they fitted gamma distribution. The data are as follows: 1.1, 1.4, 1.3, 1.7, 1.9, 1.8, 1.6, 2.2, 1.7, 2.7, 4.1, 1.8, 1.5, 1.2, 1.4, 3.0, 1.7, 2.3, 1.6, 2.0. For this data set X = 1.90 and X = 1.80. The MLEs of the GE parameters are as follows: a = 36.6437, A = 2.2348 and ln(LGE(a,X)) = -16.2605.
155
The MLEs of the gamma parameters are (1 = 9.6630 and § = 5.0864 and \n(LGA0,6)) - -17.8186. Similarly we obtain the MLEs of the Weibull parameters as 0 = 2.7875 and 6 = 0.4695 and \n{LWE{P,6)) = -20.5864. Therefore, Ti = -16.2605 + 17.8186= 1.5581 and T2 = -16.2605 + 20.5864 = 4.3259. Since T\ and T-i both are positive therefore in both cases we prefer GE models. The PCS between gamma and GE is more than 60% and the PCS between Weibull and GE is more than 77%. 5
Conclusions
In this paper we consider the likelihood ratio statistics to choose between GE and gamma model or between GE and Weibull models. It is very easy to use in practice. We could not obtain the exact distributions of the likelihood ratio statistics so we use Monte Carlo simulations to compute the probability of correct selections. The likelihood ratio statistics depend only on the shape parameters. It is observed that as the shape parameter deviates from one it becomes easier usually to distinguish between two distributions. It is also observed that it is much easier to distinguish between GE and Weibull distributions than between GE and gamma distributions. Two data sets are analyzed. Interestingly, in both cases although in the literature gamma or Weibull model was used but we select GE as the 'best' fitted model based on the maximum likelihood values. Another interesting question is to determine the minimum sample size n to discriminate between two distribution functions (either GE and gamma or GE and Weibull) for a given PCS. Work is in progress in that direction and it will be published elsewhere. Acknowledgments Part of the work of the first author was supported by a grant from the Natural Sciences and Engineering Research Council. The authors would like to thank two anonymous referees for some constructive suggestions. References 1. L.J. Bain and M. Englehardt, Communications in Statistics Theory and Methods 9, 375 (1980). 2. L.J. Bain and M. Englehardt, Statistical Analysis of Reliability and Lifetesting Models; Theory and Methods (Marcel and Dekker Inc., New York, 1991)
156
3. D.R. Cox, Proceedings of the Fourth Berkeley Symposium in Mathematical Statistics and Probability 5, 105 (1961). 4. D.R. Cox, Journal of the Royal Statistical Society Ser. B 24, 406 (1962). 5. A.J. Gross and V.A. Clark Survival Distributions; Reliability Applications in the Biomedical Sciences (John Wiley and Sons, New York, 1975). 6. R.D. Gupta and D. Kundu, Australian and New Zealand Journal of Statistics 4 1 , 173 (1999). 7. R.D. Gupta and D. Kundu, Biometrical Journal 43, 117 (2001). 8. R.D. Gupta and D. Kundu, Jour. Stat. Comp. Simul 69, 315 (2001). 9. R.D. Gupta and D. Kundu, Closeness of gamma and generalized exponential distributions (Technical Report, Department of Mathematics, Statistics and Computer Science, The University of New Brunswick, New Brunswick, Canada, 2001). 10. R.D. Gupta and D. Kundu, Jour. Appl. Stat. Sc. (Accepted, 2002). 11. J. Lieblein and M. Zelen, Jour. Res. Nat. Bur. Stand. 47, 273 (1956). 12. D.R. Thoman, L.J. Bain and C.J. Antle, Technometrics 11, 445 (1969).
SURVEY M E T H O D O L O G Y FOR S T U D Y I N G S U B S T A N C E U S E P R E V E N T I O N P R O G R A M S IN SCHOOLS S H E L T O N M. J O N E S , B R I A N C. S U T T O N A N D K E R R I E E . B O Y L E
E-mail:
RTI International, P. O. Box 12194, Research Triangle Park, NC 27709-2194, USA [email protected]; [email protected]; [email protected]
According to the 1998 National Household Survey On Drug Abuse, almost 10 percent of youths in the United States between ages 12-17 used illicit drugs SAMSHA (Substance Abuse and Mental Health Services Administration n ) . In addition, 18 percent of 12-17 year olds were current cigarette users, over eight percent were current users of marijuana, and almost one percent of youths were reported to be current users of cocaine. These percentages demonstrate a need for continuous substance use prevention programs. Although such programs have existed in U. S. school systems for some time, their extent and prevalence have been unclear (Ringwalt et al. 1 0 ; Kann et al. 7 ) . The School-Based Substance Use Prevention Programs Study (SSUPPS) is designed to study substance use prevention activities currently available in middle schools. In addition to a school sample component, SSUPPS includes a public school district component for those districts associated with the public school sample. This paper presents survey design methodology for obtaining statistically valid national estimates of substance use prevention programs among the conventional public and private middle schools in the United States. Also, presented is information on sample selection and sample weighting procedures.
1
Introduction
Most researchers tend to agree that the best way to prevent substance use among adolescents is to reduce demand (Dusenbury and Falco 3 ; Eigen and Rowden 4 ). Substance abuse prevention programs for many adolescents are sponsored and maintained by schools with Federal assistance from the United States Safe and Drug-Free Schools and Communities Act (SDFSCA). These programs vary widely, and include both classroom curricula and nonclassroom activities. The School-Based Substance Use Prevention Programs Study (SSUPPS) is designed (1) to determine if these prevention programs differ across schools with middle school grades by school type, enrollment size, urbanicity, and students socio-economic status, and (2) to determine if the established programs reflect evidence of effectiveness provided by prevention research. This paper presents the sample design established to provide descriptive information about the prevalence and characteristics of school-based programs. Probability sampling of middle schools in the U.S. is the basis for our stratified systematic sampling design (see Cochran x ) of both public and
157
158
private schools. For the list of acronyms used, please see the appendix. 2
Defining the Target Population
SSUPPS is a survey of conventional schools with middle school grades and public school districts for the 1998/1999 academic year. Middle schools are defined as conventional schools with one of the following grade combinations: (1) only fifth and sixth grades, (2) only a sixth grade, or (3) a seventh or an eighth grade. Specifically, this study provides information on substance use prevention programs available for students approximately 10-13 years of age. Excluded from the target population are non-conventional schools or educational units (such as special education schools, alternative education schools, public middle schools with fewer than 20 students enrolled, State Department of Education units, charter schools, Bureau of Native-American Affairs units, adult education schools, Department of Defense units, vocational technical schools, and middle schools with no eligible student enrollment for the 1998/1999 academic year). A public school district is an educational unit of authority that operates one or more public schools. 3
Sampling Frame Considerations
A sampling frame is defined as the list or mechanism used to identify population elements for the selection of a sample. There are three conventional sources of sampling frames for school data. The Common Core of Data or CCD (U.S. Department of Education 15 ) is the primary public school data source of the National Center for Education Statistics (NCES). Two other sources are owned by Market Data Retrieval or MDR (Dun & Bradstreet Corporation 2 ) and QED (Quality Education Data, Inc. 9 ). The MDR includes all educational levels from preschool through college in the United States and Canada. Both the CCD and the MDR have proven to be reliable sources of information on public school data (Hamann 5 ) . A single data source was preferred to reduce school multiplicity and therefore eliminate the need to merge multiple files to capture the entire target population. The QED includes information on urbanicity and poverty that can be beneficial for stratification. In addition, QED provides educational data useful for all types of schools: public, Catholic, and non-Catholic schools from pre-kindergarten through 12th grade. Therefore, due to the need to survey both public and private schools, as well as the need for stratification, the QED database was chosen as the sampling frame. It should be noted that since the sampling frame for this study was constructed, there have been important developments in school
159
lists. NCES now maintains, in addition to the CCD, a database of non-public schools by means of the Private School Universe Survey (PSS). Both CCD and PSS are accessible on the NCES website (http://nces.ed.gov). Therefore, if this research was conducted in the future, we would recommend a combination of CCD and PSS. Once finalized, the sampling frame contained a total of 37,841 eligible conventional middle schools (because the classifications of ineligibles are not mutually exclusive, the ineligible counts are not additive). This frame comprised 22,891 public middle schools, 6,031 Catholic schools, and 8,919 non-Catholic private middle schools (Table 1). 3.1
Explicit and Implicit
Stratification
The sampling frame was stratified explicitly and implicitly to control the distribution of the sample and to reduce sampling variation among survey statistics (see Kish 8 , pp. 75-112). Explicit stratification was used to provide domain estimation for public schools at different levels of urbanicity, school size, and poverty index. Urbanicity was defined at three levels provided by QED: urban (area within the central city), suburban (area surrounding the central city), and rural (area outside any metropolitan area). Three levels of size were defined based on the estimated total number of eligible students enrolled in middle school grades: small (fewer than 200 eligible students), medium (200-600 eligible students), and large (more than 600 eligible students) . The poverty index for public schools was determined from the Orshansky percentage, which is the percent of students who fall below the federal governments poverty guidelines. The Orshansky percentage is defined at the district level and is a relative indicator of community wealth/poverty when compared to other school districts. Three poverty index levels were formed: low (schools with an Orshansky percentage 10% or less), medium (11-40 %), and high (greater than 40 %). The various combinations of strata for public schools formed a total of 27 levels of explicit stratification (Table 1). Due to the small population of non-public schools and because of secondary interest, non-public schools were only stratified by Catholic and non-Catholic to assure that the non-public school population received some representation. Additionally, state identification was used to implicitly stratify the public and non-public school sampling frames. 4
Sample Selection
A school sample component was designed to survey a probability sample of teachers responsible for teaching substance use prevention programs at the
160
school. In addition, a separate survey was conducted of Safe and Drug-Free School District Coordinators. The district survey was conducted only among those districts that oversaw schools in the sample. The sample design consisted of selecting a stratified systematic sample of middle schools within each explicit design stratum as described by Cochran J (pp. 226). A sample of schools or teachers and district coordinators was selected to participate in a mail survey on substance use prevention programs at the school or district. The sample design is illustrated in Figure 1. 4-1
The Middle School Sample
The initial middle school sample of 3,430 schools accommodated two data collection waves. Wave 1 consisted of a random subsample of 2,852 school selections for possible participation in the study. The remaining schools comprising Wave 2 were not surveyed due to time constraints relative to data collection. Although strict proportional allocation was not used, adequate representation in each stratum made it possible for the public school sample to cover a broad range of different types of middle schools. The non-public school sample of 472 schools was proportionally allocated to Catholic and other private schools for a sample of 189 Catholics and 283 non-Catholics. 4-2
Under-Coverage or
Under-Representation
When the frame was constructed, QED was one of the most reliable data sources for public and non-public schools. Public schools are updated four times per year, and both public and non-public schools are re-verified every two years. However, the list of non-Catholic private schools is not as complete as that of public schools. This is primarily because of the difficulty in securing information on non-Catholic private schools, since many tend to be small and difficult to track from year to year. Because of this limitation, nonCatholic private schools are under-represented beyond the under-sampling of non-public schools mentioned above. 4-3
Inclusion of Public School Districts
This study primarily surveyed teachers who had knowledge of substance use prevention programs within schools with middle school grades. Due to secondary interest in district level inference, the survey design was expanded to accommodate a comprehensive survey of public school districts. Each district was included in the sample based on the number of schools associated with the district and selected from the stratum.
161
5
Sample Weighting and Results
Sample weights for schools were computed as the inverse of the school sample selection probabilities. The district weights were determined as a function of a multiplicity adjustment estimator (Sirken 1 4 ). Some sample schools and district coordinators refused to participate in the study. This type of nonresponse is referred to as unit nonresponse because information is missing for the entire sampling unit. The initial sampling weights of the responding units were adjusted to compensate for the missing data arising from the nonrespondents. A common weighting procedure referred to as sample-based adjustment cell weighting is described by Jones and Chromy 6 . With the school sample, adjustment cell weighting was employed to partition the school respondents into adjustment cells or weighting classes. Within each class or stratum, responding teachers were weighted up in an attempt to compensate for the nonresponding teachers within each stratum. These respondents weights were adjusted by the sum of the stratum weights for the respondents plus the nonrespondents divided by the sum of the respondents weights. The nonresponse adjusted weights are recommended for use in all weighted statistical analyses with software packages such as, SUDAAN (Shah, Barnwell, and Bieler 13 ) and SAS (SAS Institute Inc. 1 2 ) . 5.1
School Weights
For our design, each school within each explicit stratum was selected with the same probability. Let Nh, denote the number of schools on the frame within stratum h, and nh denote the number of schools selected within stratum h (h = 1,2, ...,29) based on frame information, which occasionally is not accurate. It follows that the school sample selection probability {ITM) a n d the initial sampling weight (whi) for school i within stratum h are respectively given by:
162
37,841 Middle Schools on Sampling Frame
—>
—> 3,430 Middle Schools Selected (Waves 1 & 2) 1
2,380 Public Middle Schools Selected
6,031 Catholic Schools
8,991 Other Private Schools
2,858 Public Schools Selected (Waves 1 & 2)
229 Catholic Schools Selected (Waves 1 & 2)
343 Other Private Schools Selected (Waves 1 & 2)
i —>
i
22,891 Public Schools
2,380 Publics (Wave 1)
Metro Type Urban
Subarban
Rural
Total
1 Orshansky Score2 Low Medium High Total Low Medium High Total Low Medium High Total Low Medium High Total
i 189 Catholics (Wave 1)
I 283 Privates (Wave 1)
Eligible Grade Size3 Total Small Medium Large 82 10 33 39 352 134 178 39 222 119 20 83 655 69 250 336 332 152 138 42 442 202 170 70 167 75 57 35 941 147 429 365 139 14 57 68 468 322 125 21 177 41 130 6 784 41 520 223 553 242 120 191 1,261 369 431 461 566 182 199 185 742 2,380 902 736
1 The initial sample included a subsample of 2,852 schools in Wave 1 and 578 schools in Wave 2. Wave 1 was surveyed, but due to time constraints, wave 2 was not. 2 The Orshansky scores are categorized as: Low - less than or equal to 10%, Medium - 11 to 40%, and High - 4 1 % and higher. 3 Eligible grade sizes are estimates of the total number of students enrolled at the school for all Middle School specific grades (5-8). The size distinctions are: Small - fewer than 200 students, Medium - 200 to fewer than 600 students, and Large - 600 and more students.
Figure 1. T h e S S U P P S S a m p l e
163
5.2 District Weights Since schools were sampled within strata based on school level characteristics, it was possible that a district contained multiple schools across different strata. Thus, it was sometimes impossible to match each district with a unique stratification level. Therefore, district weights were created by expanding or replicating the districtlevel file so that the number of observations equaled the number of sampled eligible schools. These weights were post-stratified to the sampling frame totals by three Orshansky levels (less than 10%, 11% to 40%, and greater than 41%). Finally, the weights for responding districts were adjusted for district level nonresponse. Given the definitions in Section 5.1, let Mj be the number of eligible schools on the frame within district j , and rtihj denote the number of times district j was included within stratum h, then the basic sampling weight (w(p)j) for district j is:
Denoting the number of sample districts in Orshansky level o, (o = 1,2,3) by n0 and the number of eligible schools selected in district j by rij, equation (1) provides the sampling weight (tU(„)j-) for weighted estimation as: w^j = w^j/rij. The post-stratified weight (w?*>.) for district j is: w?V = W(„)j(n 0 / VJ0 W(„)j), where S o w (")j = s u m °f initial district weights for point and variance estimation by Orshansky levels (o = 1,2,3). The non-response adjusted weight (u>(v)j) for district j is :
y; we,vs »?-M = <W I ^y Tw7^ 1 • where TJ o w^f. = sum of post-stratified weights for eligible districts, and y j 0 w^f sum of post-stratified weights for responding districts. 5.3
Sampling
P) =
Results
Table 1 shows various unweighted eligibility and response rates from school sample. Rates are provided for 27 different levels of explicit public school strata plus 2 levelest for non-public schools, including three domain variables of interest (urbanicity, Orshansky percent, and schools size). Individual Orshansky-level eligibility and response rates are also given in Tables 3 and 4. Eligibility rates for both the public and Catholic schools (95.5 and 92.1) proved significantly higher than rates from non-Catholic private schools (71.0). Response rates range from a low of 46.7
164 Table 1. Frame Count with Number Selected, Eligibles, and Respondents by Strata for SSUPPS Frame Count Design S t r a t u m (urbanicity, Orshansky , size )
of
Schools
Number of Selected Schools
Eligible Schools
Respond i n g Schoo s
Number
Percent
Number
Percent
448 53
10 33 39 39 134 178 20 83 119 42 152 138 70 202 170 35 75 57 68 57 14 322 125 21 130 41 6
8 33 38 33 127 172 15 78 114 37 148 135 60 192 165 32 70 56 65 56 14 314 123 20 124 39 5
80.0 100.0 97.4 84.6 94.8 96.6 75.0 94.0 95.8 88.1 97.4 97.8 85.7 95.0 97.1 91.4 93.3 98.2 95.6 98.2 100.0 97.5 98.4 95.2 95.4 95.1 83.3
6 17 32 17 89 133 7 60 82 27 102 99 46 148 121 22 51 39 49 41 11 232 96 14 83 29 3
75.0 51.5 84.2 51.5 70.1 77.3 46.7 76.9 71.9 73.0 68.9 73.3 76.7 77.1 73.3 68.8 72.9 69.6 75.4 73.2 78.6 73.9 78.0 70.0 66.9 74.4 60.0
Urban Suburban Rural
4,816 9,720 8,355
655 941 784
618 895 760
94.4 95.1 96.9
443 655 558
71.7 73.2 73.4
Orshansky (low) Orshansky ( m e d i u m ) Orshansky (high)
4,764 12,700 5,427
553
534
1261
1206
566
533
96.6 95.6 94.2
384 896 376
71.9 74.3 70.5
Size (small) Size (medium) Size (large)
7,603 8,649 6,639
736 902 742
688 866 719
93.5 96.0 96.9
489 633 534
71.1 73.1 74.3
Public School
22,981
2,380
2,273
95.5
1,656
72.9
30 - Catholic 40 - Other Private
6,031 8,919
189 283
174 201
92.1 71.0
129 120
74.1 59.7
Public Schools 1 - urban, low, small 2 - urban, low, medium 3 - urban, low, large 4 - urban, medium, small 5 - urban, m e d i u m , m e d i u m 6 - urban, medium, large 7 - urban, high, small 8 - urban, high, medium 9 - urban, high, large 10 - s u b u r b a n , low, small 11 - s u b u r b a n , low, medium 12 - s u b u r b a n , low, large 13 - s u b u r b a n , medium, small 14 - s u b u r b a n , medium, medium 15 - s u b u r b a n , medium, large 16 - s u b u r b a n , high, small 17 - s u b u r b a n , high, m e d i u m 18 - s u b u r b a n , high, large 19 - rural, low, small 20 - rural, low, medium 21 - rural, low, large 22 - rural, m e d i u m , small 23 - rural, m e d i u m , m e d i u m 24 - rural, medium, large 25 - rural, high, small 26 - rural, high, m e d i u m 27 - rural, high, large
57 236 280 262
1,007 1,333 140 615 886 374
1,381 1,219 776
2,242 1,883 383 830 632 592 505 120
3,579 1,385 233
1,440
Non-Public School
14,950
472
375
79.4
249
66.4
All Schools
37,841
2,852
2,648
92.8
1,905
71.9
T h e Orshansky scores are categorized as: Low - less than or equal to 10%, Medium - 11 to 40%, and High 4 1 % and higher. Eligible grade sizes are estimates of the t o t a l number o students enrolled at school for all middle school specific grades (5-8). T h e size distinctions are: Small - fewer than 200 s t u d e n t s , Medium - 200 to fewer t h a n 600 s t u d e n t s , and Large - 600 and more s t u d e n t s .
165 Table 2. Public School District Level Eligibility and Response Rates Design Stratum (Orshansky) Low Medium High Total
Eligible Districts Number Percent 518 98.5 1,051 98.3 98.1 415 98.3 1,984
Number of Districts Included 526 1,069 423 2,018
Responding Districts Number Percent 403 77.8 81.4 856 80.5 334 80.3 1,593
Table 3. School Eligibility Percentage by School Type School Type Percentage
Public 95.5
Catholic 92.1
Other Private 71.0
Total 92.8
Table 4. School Responding Percentage by School Type School Type Percentage
Public 72.9
Catholic 74.1
Other Private 60.2
Total 72.0
percent (in the urban/high Orshansky/small stratum) to a high of 84.2 percent (in the urban/low Orshansky/large stratum) (see Table 1). Table 4 shows that response rates for public and Catholic schools (72.9 and 74.1%) are highly significant when compared to response rates from other private schools (60.2%). The combined response rate for all schools was 72.0 percent. The resulting overall unequal weighting effect (UWE) for nT responding schools was 1.45, where
UWE n
~ (z^^r
Table 2 provides public school district-level eligibility and response rates for the three Orshansky levels. This table shows fairly equal eligibility rates across all levels with a slightly lower response rate (77.8%) for the lowest Orshansky level. The overall eligibility rate was 98.3% and the overall response rate was 80.3%. The overall non-response adjusted unequal weighting effect for school districts was 1.60.
5-4 Data Collection Experience Table 5 shows some weighted questionnaire findings from the teacher data collection activities. Approximately 64 percent of all public schools have a coordinator or a
166 Table 5. School Weighted Percentage of Teachers Responsible for Substance Use Prevention Education
School Type Percentage
Public 64.1
Catholic 42.2
Other Private 40.0
Total 56.3
Table 6. Weighted Percentage of Districts Responding Middle Schools with a Substance Use Prevention Education Coordinator
All Schools 67.4
Some Schools 9.2
No Schools 23.5
specific teacher who is primarily responsible for substance use prevention education at the school. This prevalence rate is significantly higher than that for Catholic and non-Catholic private schools (42.2 and 40.0 percent). Table 6 shows the weighted questionnaire findings for public school districts. Comparable to school findings, approximately 67 percent of public school districts indicated that a coordinator or a specific teacher was present in all schools within the district with the responsibility for substance use prevention activities in the school.
6
Conclusions
One of the most important sample design issues for this study relates to the identification of the sampling frame. When the SSUPPS study was designed, the data sources known were CCD, MDR, and QED. According to Hamann 5 , all three sources are adequate for public schools, but QED is the only source containing both public and private schools. This feature led to our use of QED. However, since this survey was implemented, NCES has made available on the NCES website (http://nces.ed.gov) a detailed data source for private schools known as the Private School Universe Survey (PSS). If the surveys were implemented today, we would recommend a combination of CCD and PSS for the sampling frame.
Appendix: List of Acronyms Used 1. SAMHSA - Substance Abuse and Mental Health Services Administration 2. SSUPPS - School-Based Substance Use Prevention Programs Study 3. SDFSCA - Safe and Drug-Free Schools and Communities Act
167 4. CCD - Common Core of Data 5. NCES - National Center for Education Statistics 6. MDR - Market Data Retrieval 7. QED - Quality Education Data 8. PSS - Private School Universe Survey
Acknowledgments The authors thank the supporter for this research: National Institute on Drug Abuse, Dr. Elizabeth Robertson (Project Officer), NIDA Grant 5R01 DA1149202. Thanks to members of the School-Based Substance Use Prevention Programs Study project team for comments concerning this document: Allison Burns, Susan T. Ennett, Matthew C. Farrelly, Ruby E. Johnson, Kurt Ribisl, Christopher L. Ringwalt, Luanne Rohrbach, Eva W. Silber, Ashley P. Simons-Rudolph, Judy M. Thorne, Amy A. Vincus, and Dana L. Wenter. In addition, the authors thank James R. Chromy for his suggestions and technical review, David Kellogg for his editorial review, and Brenda E. Gurley for her clerical contributions.
References 1. W.G. Cochran, Sampling Techniques, Third Edition (John Wiley & Sons Inc., 1977). 2. Dun &: Bradstreet Corporation, Market Data Retrieval Database(School Year 1994-1995). 3. L. Dusenbury and M. Falco, Journal of School Health 65, 420 (1995). 4. L.D. Eigen and D.W. Rowden, Making the Case for Prevention: A Discussion Paper on Preventing Alcohol, Tobacco, and Other Drug Problems (Department of Health and Human Services, Public Health Service, Washington, DC, 1993). 5. T.A. Hamann, Evaluating the Coverage of the U.S. National Center for Education Statistics Public Elementary/Secondary School Frame (Presented at the International Conference on Establishment Surveys - II. Governments Division, U.S. Census Bureau, 2000). 6. Jones, S.M., and J.R. Chromy, In Proceedings of the Section on Survey Research Methods (American Statistical Association, 1982). 7. L. Kann, J.L. Collins, B.C. Pateman, M.L. Small, J.G. Russ and L.T. Kolbe, Journal of School Health 65, 291 (1995). 8. L. Kish, Survey Sampling (John Wiley & Sons, New York, 1995). 9. Quality Education Data Inc., QED National Education Database: Data Users Guide, Version 4-6 Quality Education Data, Inc., Denver, CO, 1998).
168 10. C.L. Ringwalt, J.M. Greene, S.T. Ennett, R. Iachan, R.R. Clayton and C.G. Leukefeld, Past and Future Directions of the D.A.R.E. Program: An Evaluation Review (Prepared for the National Institute of Justice, 1994). 11. Substance Abuse and Mental Health Services Administration, Summary of Findings from the 1998 National Household Survey on Drug Abuse (Office of Applied Studies, U.S. Department of Health and Human Services, 1999).. 12. SAS Institute Inc., SAS Procedures Guide, Version 6, Third Edition (1990). 13. B.V. Shah, B.G. Barnwell and G.S. Bieler, SUDAAN Users Manual (Research Triangle Institute, Research Triangle Park, NC, 1996). 14. M.G. Sirken, Journal of the American Statistical Association 67, 224 (1972). 15. U.S. Department of Education, Public Elementary/Secondary Universe Survey (National Center for Education Statistics, School Year 1994-95).
O N E - S T E P ESTIMATION FOR THE PARTIALLY LINEAR P R O P O R T I O N A L H A Z A R D S MODEL XUEWEN LU Department of Mathematics and Statistics, University of Calgary, 2500 University Drive, Calgary, Alberta, T2N 1N4, Canada R.S. SINGH Department of Mathematics and Statistics, University of Guelph, Guelph, Ontario, NIG 2W1, Canada E-mail: [email protected] The proportional hazards regression model usually assumes that the covariate has a log-linear effect on the hazard function. In this paper, we consider a semiparametric survival model with flexible covariate effects. We assume the baseline hazard function can be parameterized, while the risk function associated with covariates is modeled in a semiparametric way. A "one-step algorithm" is used to estimate the nonparametric components and the parametric components by maximizing the local and the global likelihood, respectively. Using the local linear method, the estimates of the unknown parameters and the unknown covariate function are determined, and their asymptotic distributions are obtained.
1
Introduction
The Cox regression model (see Cox 7 ) is an important model in survival analysis, in which the conditional hazard of failure at time t given the covariate value z is modeled as h(z\t) =
h0(t)ex.p(zT/3).
Here, the baseline hazard ho(t) is nonparametric, while the dependency on the covariate z is parametric. The partial likelihood principle provides efficient estimates of (3 in the semiparametric model (see Andersen and Gill 1 ) . The Cox proportional hazards model has had profound impact on clinical trials for quantifying the effects of covariates on the survival time and for controlling confounding by means of a mathematical model which takes the outcome under consideration as the dependent variable, and includes both the postulated causal factor and confounding factors in the equation (see Fleming and Lin 13 and Elwood 1 0 ) . Despite the advantages of the Cox model, many authors have considered nonparametric and semiparametric modeling of covariate effects on the censored failure time. For example, Beran 3 , Dabrowska 8 , McKeague and
169
170
Utikal 2 3 and Nielsen and Linton 25 study the fully nonparametric hazard model where h(z, t) is unspecified. Fan, et al. n treat the multiplicative nonparametric model where h(z,t) = h0(t)X(z) in which neither h0(-) nor A(-) is parametrically specified. Sasieni 29 ' 30 examines a partially linear covariate effect in a model where h(z,t) = ho(t) exp{zf(3 + Xfa)}- Dabrowska 9 discusses a generalized Cox regression model with baseline intensity dependent on a covariate in which h(z,t) = h0(t, zi)exp{zj/?}. Hastie and Tibshirani 15,16 consider a fully nonparametric additive Cox model. Huang 17 studies a partly linear additive Cox model. Grambsch, Therneau and Fleming 14 and Fleming and Harrington 12 (Section 4.5, pages 163-168) propose to use smoothed martingale residuals to explore the functional form of the covariate effect in the Cox model. A survey of other regression models for censored survival data can be found in Andersen et al. 2. In this paper, we study the following semiparametric survival model : h(t\z, x) = h0(t; a) exp{zT/3 + \{x)},
(1)
where ho is a baseline hazard function with an unknown parameter vector a. (zT,x)T e R p x R is a covariate vector. The variable x is continuous with values in a compact set X £ R, and z is discrete or continuous with values in R p . 0 = (/3T, aT)T is an unknown q + p parameter vector, A() is an unknown smooth univariate function. In this model, the baseline hazard is modeled parametrically; the covariate effects are modeled semiparametrically. In many applications the shape of the baseline hazard is well understood: for example, the Gompertz-Makeham hazard is used extensively in insurance problems (Jordan 19 , page 21). A linear approximation to the logarithm of the baseline hazard is successful in a number of chronic diseases problems (Meshalkin and Kagan 2 4 ) . Also, the Weibull baseline is widely used in both biostatistical and reliability applications, e.g. Lawless 21 . The covariate effect, however, is rarely specified by a completely parameterized model. For instance, in a clinical trial study, when Z is a treatment covariate and X is a covariate describing characteristic of the patients, B can be interpreted as a measure of the treatment effect after adjusting for the effect of X. The effect of X can be modeled in a nonparametric way. Fleming and Lin 13 provide an overall review for the developments of survival analysis in clinical trials. Hence, this model allows flexible modeling of the covariate effect in many applications including clinical trials. We assume that the covariates are time-independent due to the fact that this type of data arises often in practice and the technical details involved in time-dependent covariate models are hard to tackle. Nielsen et al. 26 study model (1) without the linear part zTf3 and they assume the covariate x is time-dependent. As they mention in their paper,
171
the partial likelihood principle fails to estimate the parametric part in model (1). A generally applicable approach for this type of semiparametric model is that of profile likelihood, which is discussed extensively in Severini and Wong 31 and Bickel et al. 4 . When X(x) has a parametric form, for example, \(x) = xTj, it is well known that one can use maximum likelihood methods to analyze the model. Kalbfleisch and Prentice 20 and Lawless 21 give a detailed introduction to the use of maximum likelihood methods in the analysis of parametric regression models in the context of survival analysis. Most statistical software, for example, SAS and SPLUS, can analyze the Cox model, but they lack the ability to estimate the parameters in the baseline hazard function when the covariates are modeled in a nonparametric or semiparametric manner. That is, when the regression function A(a;) is not parameterized as a linear form or is an unknown regression function. This paper discusses how to estimate both the parametric and nonparametric parts in model (1). The paper is organized as follows. Section 2 of the paper introduces the likelihood for the survival model under right-censoring. Section 3 describes the local likelihood and local linear fit method. Section 4 proposes the onestep estimation method. Section 5 studies the asymptotic distributions for both parametric and nonparametric parts. Section 6 gives some conclusion remarks on the proposed method. Finally, the Appendix gives the technical proofs. 2
Likelihood Function for A Parametric Survival Model
Let f(t\z, x) denote the conditional density function of lifetime T given (Z,X) = (z,x), and let S(t\z,x) = P{T > t\(Z,X) = (z,x)} be its conditional survival function. The conditional distribution function of a random censoring variable C given {Z,X) = (z,x) is denoted by G(t\z,x). Then under independent and noninformative censoring (i.e., G(t\z, x) does not involve the unknown parameters), the conditional likelihood function is given by
L^UfWZuXjUsiYilZuXi), u
(2)
c
where Y[u a n d Ylc denote, respectively, the products involving the uncensored and the censored observations. Given that Ho(t;a) is the cumulative baseline hazard function and that \(x) is parameterized as A(a;) = A(a;;7), the conditional survivor function for T at (Z, X) = (z, x), under the survival model (1), is of the form S(t\z, x; a, (3,7) = e x p [ - # 0 ( i ; a) exp{zT/3 + X(x; 7)}].
172
The hazard function and cumulative hazard functions are given by h(t\z,x;a,p,j) = h0(t;a)exp{zTP + X(x;j)} and H(t\z,x;
L{0,l) =
HihiYilZuXi^SiYilZuXi), 1
remembering that 6 = (/?T, <jT)T. Under the survival model (1), we have the log-likelihood for the sample
i„(«.7) = S{«iiog/(yi|zilxo + (i-«i)iogS(yi|zi,xi)} i
= ] > > { l o g h0(Yi; a) + {Z?P + A(X i; 7))} i
+ exp{Z?P +
\(Xi;1)}logSo(Yi;cj))
= ^ [ « i { l o g h0(Yi; a) + {Zf (3 + A(X i; 7))} i
-exp{Z?P
+ \(Xi;i)}Ho{Yi;iT)],
(3)
using Ho(t;a) = — logSo(t; a). Maximization of (3) leads to the maximum likelihood estimators of a, (3 and 7. 3
Local Likelihood
Suppose that the form of X(x) is not specified, and that the first order derivative of X(x) at the point x exists. If we let ao — X(x), 01 = X'(x), then, by Taylor's expansion, X(X)aa0+a1{X-x), for X in a neighborhood of x. Let b be the bandwidth parameter that controls the size of the local neighborhood and let W be a kernel function. Using this local model, one would find ao, a^ and 6 = (/3T,aT)T to maximize the local (log) likelihood n
ln(a0,a1,(3,a)=Y,k{
+ ao +
a1{Xi-x),(Yi,6i)}Wb{Xi-x),(4)
173
where h{a, Zj(3 + a0 + ax{Xi - x), (Y{, Si)} = &[logh0{Yi; a) + {ZT/3 + a0 + - x))] - Ho{Yi;a)eXp\ZT(3 + (Zf'/? + a0 + ax{Xi - x))}, Wb{t) = ai{Xi b-lW(t/b). But the estimates of the true parameters 9Q = (f3j, &£)T from this local likelihood are not efficient since we use only a fraction of all data points. Lu et al. 22 consider model (1) using the generalized profile likelihood method with local constant fit and obtained an efficient estimate of (/?o, 0"o) which achieves the semiparametric information bound. But that approach requires a fully iterated algorithm and is not computationally convenient. In this paper, we provide the one-step algorithm with local linear fit. Application of the local linear fit enables us to reduce the bias of the estimator for the nonparametric component and to avoid boundary effects. In some important applications, we show that one-step algorithm achieves the same efficiency. 4
The One Step Estimation Method
Under model (1), the primary interest is to estimate the true parameters
Y.lifaZj'P
+ XiXi-MAYuSi)}.
(5)
i=l
Step 3: Fix 9 at its estimated value from Step 2. The final estimate of A0(-) is A(a;; b, 9) = a 0 where (So, &o) maximizes n
Y^ k{°, ZU + ao + ai(Xi - x), (Yi,6i)}Wb(Xi i=l
- x).
(6)
174
At this final step we take b to be an estimate of the bandwidth that is optimal for estimation of Ao(-) when 90 is known. To compare the one-step algorithm with the generalized profile likelihood method or the fully iterated algorithm, we also give the fully iterated algorithm as follows: Step 0: Fit a parametric linear model to obtain the initial estimates 9. Step 1: Find A(a:; b, a, /3) by maximizing the local likelihood (4). We take b to be an estimate of the bandwidth that is optimal for estimation of 9Q. Step 2: Update 9 by maximizing n
J2 hW, ZTf3 + \(x; b, a, 0), (Yh Si)}.
(7)
Step 3: Fix 6 at its estimated value from Step 2. The final estimate of Ao(-) is \(x; b, 9) = a$ where (&o, oi) maximizes (6). At this final step we take b to be an estimate of the bandwidth that is optimal for estimation of Ao() when 6Q is known. Instead of maximizing (7), an alternative approach is to update the current 6 by maximizing n
Y.idczTp
+
Xfab&JiiYiji)}.
i=l
We conjecture that the estimators resulting from this type of updating are equivalent to those obtained by maximizing (7); this would be interesting to verify. 5 5.1
Distribution Theory One-Step Estimator of the Nonparametric Part
In this subsection, we investigate properties of the estimators of the nonparametric part Ao(-) of (1) under two cases: (1) the one-step approach when 9Q is estimated locally as in (4); and (2) the final estimate as in (6) when 9Q is estimated globally. In case (1), we are using only 0(nh) data points to estimate both 90 and Ao(-). In case (2), #o is estimated at parametric rates, and thus AQ(-) can be estimated asymptotically as well as if #o were known.
175 We consider case (1) first. Let /(•) be the density of X. Define
X = x
£(*)
mi = Ao(Xi) + Z?P;
(
{Si - HQ(Yi; (To) exp(mi)}Ui {Si - H0(Yi; do) exp(mj)}Zi 6ih'0{Yi;CT0)- H'0(Yi;CT0)exp(m;) y
i>(:r) = first diagonal element of the matrix S _1 (a;); ho(t;
and ft,'0(t;(T) — dho(t;cr)/d<J.
Theorem 5.1. Under condition A given in the appendix, as n —> oo, b —> 0 and nb —> oo and n&5 is bounded, for the the maximizer of the local likelihood (4), we have (nb) 1/2
\(x) - A0(x)
P~Po a -
«2
-^A^E-1^
|A" = a;
(T0
N 0, Hence (^) 1 / 2 {A(x) - A0(x) - ^A 0 '(:r)& 2 } ^ N(Q, ^o-£?v{x)).
m
5.2
(8)
Fully Iterated Estimator of the Nonparmetric Part
One-step estimation of the parameters
176
Theorem 5.2.
(nb)^{\(x)~X0(x)-^b2}^N
0,
f(x)E(S\X = x)
The asymptotic variance in this result coincides with the univariate result given in Nielsen et al. 2 6 and Lu et al. 22 when covariates are time-independent, but the bias here is different, because we use the local linear fit. This result suggests bandwidth estimators in the spirit of that of Ruppert, Sheather and Wand 2 8 . Following the discussions made in Carroll et al. 5 ' 6 , for example, consider estimation of Ao(-) at the final step, for a given function w(-) with compact support, minimizing the asymptotic weighted mean squared error with weight f{-)w(-) yields the optimal global bandwidth bopt = C(W)n
/ v*(x)w(x) dx "J f\%(x)2f{x)w{x)dx
-1/5 | .
where C(W) = ( ^ 0 K ^ 2 ) 1 / 5 . For the one-step algorithm in Step 1, a relatively ad hoc choice of b is bopt x n 1 / 5 x n " 1 / 3 = bopt x n _ 2 / 1 5 . On the other hand, for the fully iterated algorithm (7), we have shown that ordinary bandwidth rates are permissible (Lu et al., 22). Severini and Staniswallis 3 2 , Hunsberger 18 , Severini and Wong 31 show the same thing for generalized partially linear model. 5.3
One-Step Estimator for the Parametric Part
We now study the global estimators for the parametric components
(CTOJPO)-
Theorem 5.3. Let J3 and a be the one-step estimates which maximizes the likelihood (5). Under conditions A given in the appendix, as n —• oo, nb4 —» 0 and nb2/ log(l/6) —> oo,
nW^-M^o.ir^ir1), \ a - aa J where B = Ex=B
+
E(6VVT), E{1(X)1T(X)ejE-1(X)eJ},
-y(x) = E(6V\X = x), V = {ZT, h'Q(Y;cr0)T)T, the first position.
ex is the unit vector with 1 in
177
Therefore, one iteration leads already to a root-n consistent estimator of 0o = (/3 0 Vo T ) T We have derived the results using local constant fit for the fully iterated estimator defined by (7) (see Theorem 5.2, Lu et al., 2 2 ) . The results also hold for the local linear fit in this paper. For the sake of completeness, we cite the result as follows: Theorem 5.4. Let 6 = (/3T, aT)T be the fully iterated estimates which maximizes the likelihood (7). Under Condition I, S and A given in Lu et al. 2 2 , assume ig0 is positive definite and finite, then
where ig0 is the marginal Fisher information matrix for 6Q given by dl,„ ieo=E0^—(60,\0)
,
v
+
81
x
—(e0,\0)(v*)
®2
v* = X'go is the least favorable direction. ig0 can be consistently estimated by ig0 has the following simple expression: ie0 = E0 j ^ | g r ( t , Z, X)h0(t; a0) exp[Z T /3 0 + X0(X)]I(t)
dt
-E0J^^(t,Z,X)dN(t) =
Eo[6{d-^dJ^(Y,Z,X)}}
where n8{t,z,x) = log[/i o (i;0-)exp{z T /3 + Xg(x)}}, Xe(x) = X0(x) + rBo(x) rg(x) and rg{x) = \ogE{exp(ZT(3)Ho(Y;cr)\X = x}. We have proved ig0 is the semiparametric information bound. R e m a r k 5.1. The fully iterated estimator is uniformly as or more efficient than the one-step estimator. However, when X and Z are independent and 6 = 1 (there is no censoring), the one-step estimator is as efficient as the fully iterated estimator. Therefore, iteration is not necessary when X and Z are weakly correlated and the censoring is not heavy. In fact, if we set E(Z) = 0 without loss of generality, then, when X and Z are independent and 6 = 1, we have E(SZ\X) = E(Z) = 0, and the asymptotic variance for the one-step estimator of /? equal to {E(ZZT)}~1, which is also the asymptotic variance for the fully iterated estimator given in (7).
178
6
Conclusions
For the generalized partially linear single-index models, the asymptotic distribution theory for the one-step and the fully iterated estimator, based on the local linear fit, has been established by Carroll et al. 5 . In this paper, we formulate our approach in the frame work of proportional hazards regression models. We can draw some similar conclusions from our study: (i) estimation of the nonparametric part Ao(-) of model (1) can be done just as well as if the parameters (<70, /?o) were known; (ii) the parametric parts can be estimated at the usual parametric rate of convergence; (iii) the fully iterative estimator can have a smaller variance-covariance matrix than the one-step estimator, but for the case without censoring and with weakly correlated X and Z both estimators share the same variancecovariance matrix; (iv) estimation of the nonparametric part in the model has an effect on the distribution of the estimates for the parametric parts; (v) best estimation of the parametric parts requires undersmoothing of the nonparametric part. Our model (1) is restricted on univariate in the nonparametric part. When there are several covariates in the nonparametric part, dimensionality reducing strategies such as additive and index models can be used. For instance, one could fit the following partially linear single-index model: h{(t; z, x) = h0(t, a) exp{zT(3 + A(x T 7 )}, where A(-) is of unknown form and, a, /3 and 7 are unknown parameters. Further development of this issue needs to be investigated. Appendix: Technical Proofs The following conditions will be needed: Condition A: (i). 6Q = (
179
(iii). There exists £ > 0 such that E[{H0(Y; <70) exp(Z r A,)} 2 +«|X],
E{\H0(Y; a0)
exp(ZTf30)Z\2+^\X}
and E{\h'0(Y;ao)\2+t\X},
E{\H'0(Y;a0)exp(ZTPo)\2+i\X}
are finite and continuous at the point X log h0(Y;a0).
= x, where ho(Y;(To) =
(iv). The functions E{H0(Y; a0) exp{ZT/30)\X}, E{H0(Y; <x0) exp{ZTPo)\X}, E{H'0(Y; a 0 ) exp(Z T po)Z\X), E{H0(Y; a0) exp(ZT/30)ZZT\X}, T E{H'Q'(Y;a0)exp(Z p0)\X}, E{fh'0{Y;a0)\X} and E{6h%(Y;a0)\X} are all continuous at the point X = x. The function E(5\X) is positive and has a continuous 2nd derivative around the point x . (v). There exists a function M(y), with E{M(Y)} 3
d
daAdakda{
< oo, such that
3
h0{y,(j)
< M(y),
d
dcrjdffkdcTi
•H0{y;c)
< M(y),
for all y, and for all cr in a neighborhood of OQ. (vi). The kernel function W > 0 is a bounded density supported on [—1,1] symmetric about zero. (vii). The function XQ(X) has a continuous 2nd derivative around the point (viii). The density /(•) of X has a continuous 2nd derivative around the point x and f(x) > 0. Let go — ho(Y;ao),
define
S0(x) = E < {
0 K2 0 0 X Z 0 ZZT Zgl \<7o 0 g0ZT gf2 )
and
Si(x)
=E<
( v0 0 v0Z V go
0 VQZT v0go \ v2 0 vo0 0 vaZZT v^Zgl 0 <70ZT ^ o ^ 2 /
X =, > .
180
Proof of Theorem 5.1: Let cn = (n/i)" 1 / 2 , 1 \ / c-^a-Xoix)} X! = \(Xi - x)/b ) , /3* = I c~\{& - A'0(o:)}
, Ui=(
l
)
and let a* = cn {o - o-o), Denote A{ = A^o;) = Ao(a;) + Zf(30 + X'0(x)(Xi — x). If (ao, a\,6)T mizes (4) then 9* maximizes
maxi-
5 3 Z;(cra<x* + (T0, cnp*TX* + Xi, (Yit 6i))Wb(Xi - x) with respect to 0*. Consider the normalized function n
/„(r) = ^53[zi(c„(r*+fT0,cni3*Tx;+Ai,(yi,<5i))-/i((To,Ai,(yi,^))]wb(xi-x), t=i
which is maximized by 6*. ln(9*) is a concave function, we can apply the Convexity Lemma (see Pollard 27 ) to it. By a Taylor expansion of the function h{-, -, {Yi, Si)) we obtain that
in(9*) = wis* - ~e*TAne*{i +
0p(i)},
0)
where n Wn = hcn ^
/{6iHo{Yi;a0)exp(Xi)}Ui {6i - H0(Yi; a0) exp(Ai)}^
i=i \6ih'0(Yi;
_
] Wb(Xi - x)
- if 0 (y i; CTo)exp(Ai)
and _iHo(n •>0)ZiVj i = 1
\ e ^ ^ ( V ,
c * i H ( | ( y i ; » o ) 2 i 2 j ' oXi ZiHla(Yi;"Si)r \°0)ZT -6ih'^Yi;<,0) +
) Wb(Xi
~ x).
Hl^Yi;^0)ex
It is shown that An = f(x)S0(x)
+ o p (l) = A + o p (l).
Therefore, by (9), ln(G*) = WZe*-^e*TA6*
+ op(l).
(10)
181
By applying the Convexity Lemma, we obtain that 0*=A-1Wn
+ op(l).
Hence, the asymptotic normality of 9* will follow from that of Wn. Since Wn is a sum of i.i.d. random vectors, we only need to compute the first two moments and check Liapounov's conditions. It can be shown that E{Wn) = Vri{f(x)^\'0\x)b2 Var(Wn)=f(x)S1(x)
+ o(b2)}, + o(l).
Therefore, we have Wn-EWn^N{0,f(x)S1(x)}. Proof of Theorem 5.3: We need two lemmas, Lemma A.l and Lemma A.2, given by Carroll et al. 5 in order to prove Theorem 5.2. By Lemma A.2, each element in An converges uniformly to its corresponding element in A. Hence, expression (9) holds uniformly in x G X. By the Convexity Lemma, it also holds uniformly in 6* G C and x G X for any compact set C. Lemma A.l then yields sup\6*{x) - A~lWn{x)\ xex
^ Q,
(11)
where 9*(x) and Wn(x) are defined in the proof of Theorem 5.1, both depend on x. By considering the first element of the vectors in (11), we have 1
— V WiWb(Xi - x) = sup X(x) - Ao(ar) - -nf(x) ^ xex where W; is the first element of the vector E-1^)
op(cn),
f{6i-H0(Yi;a0)exp(Xi)}Ui {6i-Ho(Yi;
By a result of Carroll et al. 5 , the following stronger result holds: L - £ WiWb(Xi - x] sup X(x) - \0(x) - — nf(x) t = i xex Let §! = nll2C$ - /?o), k = n^2(a rhi = \(Xi) + Zfpo,
0P{b2cn
+
c2n\og^2(l/b)}.
- a0), and 9 = (Of, 6%)T. Denote by
and
rm = A0 {X{) + Zj (30 •
182
Then, 9 maximizes n
UO) = ^ [ ^ ( " " 1 / 2 ^ i + °o, n-l'2Zf92
+ rhu (Yu S{)) - h(a0, mu (Yu Si))],
By Taylor's expansion, we have ln(9) = n-1'2
£
(Yi> 5 ^ ~ \tTBj,
tffc,
(12)
yhere <*, ( V . i{miAxl,oi}}
{6i-H0(Yi;a0)exp(mi)}Zi - F0(yi;a0)exp(mi)
K \ \ - {
- \s.^Yi;
where b\x = exp(m; + n 1^2Zi9ii)H0(Yi;ao + n 1/292i)ZiZ?, b[2 = exp(rhi + n-^ZiO^ZiH'oiYi; a0 + n~1/292i)T, b21 = exp(m i + n-1^Zi91i)H'0(Yi; a0 + n~V292i)Zj and b\2 = - & / # ( * ; ;
Bn
E{SZh'0(Y;a0)T} E{S(h'0(Y;a0))®2})
\ +
° ^ 1 } = ~B + °p{1>-
As for the first term in (12), we have n
n~1/2 £
n
i{mi, (Yh Si)) = n-1'2 £
1=1
^(rm,
(YuSi))
1=1 n
+
n-ll2Y,^J{mi,{Yi,Si)){\{Xi)-\0(Xi)} t=i 1 2
+ Op(n / ||A-A 0 ||^ o ), where Mrnu (Yi, Si)) = ^(rm, dmC
(Yi, Si)) = - ( Hf^ f * ) . \ H0(Yi;
183 The second term in the above expression can be expressed as n
n
n -3/2 £
^
^
{Yu
1 - Xi) ( 5 i ) ) / ( x i ) - Y, WjWbiXj
t=l
1=1 l 2 2
+0P{n / c Define z^- = v(Xj,Zj,
(
l o g ^ l / f c ) } = Tnl + 0P{n^cl
n
log^l/fc)}.
(Yj,6j)) as the first element of
Sj - HQ(YJ;
Using the definition of Xj(Xi), therefore n
Zj
dM(t).
\h0(t;a0)J
we obtain A, — rrij = 0((Xj
— Xi)2)
and
n
Tnl = n- 3 / 2 5 3 ^ 2 Vf K , (YuSiWiXiy^MXj
- Xi) + 0P{nl'2h2)
i=lj=l
By a similar procedure used in Carroll et al. 5 , it can be shown that T„ 2 - T„ 3 £ 0,
(13)
where
j=i
with
IV i?o(yi;cro)exp(??2i) y 1
j
It is seen that
Hrni^YiA))
= j
dMi{t)
Combining (12)-(13) we obtain that n 1 2
Zn(0) - n " / ^ J2 tyXi, (Y, Si), Zi) - -6TB9 + oP(l), t=i
184
where
n(xu (Yh Si), Zi) = Mm, W, st)) - 7(*iH By the Convexity Lemma we have n
from which it follows that 6^
N(Q,B~llliB-1).
This establishes the result in Theorem 5.3. Acknowledgments Research of R.S. Singh is supported in part by the National Science and Engineering Research Council Grant # A4631. References 1. P.K. Andersen and R.D. Gill, Ann. Statist. 10, 1100 (1982). 2. P.K. Andersen, O. Borgan, R.D. Gill and N. Keiding, Statistical Models Based on Counting Processes (Springer-Verlag, New York, 1993). 3. R.J. Beran, Nonparametric regression with randomly censored survival data (Technical report, Univ. California, Berkeley, 1981). 4. P.J. Bickel, C.A.J. Klaassen, Y. Ritov and J.A. Wellner, Efficient and Adaptive Inference in Semiparametric Models (Johns Hopkins Univ. Press, 1993). 5. R.J. Carroll, J. Fan, I. Gijbels, and M.P. Wand, Discussion Paper #9506 (Institute of Statistics, Catholic University of Louvain, Louvain-la-Neuve, Belgium, 1995). 6. R.J. Carroll, J. Fan, I. Gijbels, and M.P. Wand, J. Amer. Statist. Assoc. 92, 477 (1997). 7. D.R. Cox, J. Royal Statist. Soc. 34, 187 (1972). 8. D.M. Dabrowska, Scand. J. Statist. 14, 181 (1987). 9. D.M. Dabrowska, Ann. Statist. 25, 1510 (1997). 10. J.M. Elwood, Critical appraisal of epidemiological studies and clinical trials, Second Edition (Oxford University Press, London, 1998). 11. J. Fan, I. Gijbels and M. King, Ann. Statist. 25, 1661 (1997). 12. T.R. Fleming and D.R. Harrington, Counting Processes and Survival Analysis (John Wiley & Sons, New York, 1991).
185
13. T.R. Fleming and D.Y. Lin, Biometrics 56, 971 (2000). 14. P.M. Grambsch, T.M. Therneau and T.R. Fleming, Biometrika 77, 147 (1990). 15. T. Hastie and R. Tibshirani, Generalized Additive Models (Chapman and Hall, London, 1990). 16. T. Hastie and R. Tibshirani, Biometrics 46, 1005 (1990). 17. J. Huang, Ann. Statist. 27, 1536 (1999). 18. S. Hunsberger, J. Amer. Statist. Assoc. 89, 1354 (1994). 19. C.W. Jordan, Life Contingencies (The Society of Actuaries, Chicago, 1975). 20. J.D. Kalbfleisch and R.L. Prentice, The Statistical Analysis of Failure Time Data (John Wiley and Sons, New York, 1980). 21. J.F. Lawless, Statistical Models and Methods for Lifetime Data (John Wiley & Sons, New York,1982). 22. X. Lu, R.S. Singh and A.F. Desmond, Journal of Statistical Planning and Inference 98, 119 (2001). 23. I.W. McKeague and K.J. Utikal, Ann. Statist. 18, 1172 (1991). 24. L.D. Meshalkin and A.R. Kagan, Discussion of "Regression models and life tables" by D.R. Cox. J. Roy. Statist. Soc. Ser. B 34, 213 (1972). 25. J.P. Nielsen, O.B. Linton, Ann. Statist. 23, 1735 (1995). 26. J.P. Nielsen, O.B. Linton and P.J. Bickel, Ann. Statist. 26, 215 (1998). 27. D. Pollard, Econometric Theory 7, 186 (1991). 28. D. Ruppert, S.J. Sheather and M.P. Wand, J. Amer. Statist. Assoc. 90, 1257 (1995). 29. P. Sasieni, Scand. J. Statist. 19, 215 (1992). 30. P. Sasieni, J. Roy. Statist. Soc. Ser. B 54, 617 (1992). 31. T.A. Severini and W.H. Wong, Ann. Statist. 20, 1768 (1992). 32. T.A. Severini and J.G. Staniswalis, J. Amer. Statist. Assoc. 89, 501 (1994).
A N E S T E D FRAILTY SURVIVAL MODEL FOR R E C U R R E N T EVENTS RENJUN MA Department of Mathematics and Statistics, and Canadian Research Institute for Social Policy, University of New Brunswick, Fredericton, Canada, E3B 5A3 E-mail: [email protected] J. DOUGLAS WILLMS Canadian Research Institute for Social Policy, University of New Fredericton, Canada, E3B 5A3 E-mail: [email protected]
Environmental
Brunswick,
RICHARD T. BURNETT Health Directorate, Health Canada, Ottawa, Canada, KlA E-mail:[email protected]
0L2
We consider a nested frailty Cox proportional hazards model for recurrent events. We adopt a Poisson modelling approach and the principal results depend only on the first and second moments of the unobserved frailties. The use of the proposed methods is illustrated through the reanalysis of chronic granulotomous disease data previously reported by Fleming and Harrington 5 .
1
Introduction
Recurrent events often arise naturally in practice. Examples of such events include disease onsets and emergency room visits in medical studies, process stoppages in reliability studies, as well as criminal offences and unemployment in social studies. Such recurrent events are often clustered by subjects and their residential areas or other groupings. Until recently, the recurrent events are usually modelled by incorporating a single level of subject frailties, or random effects, into the survival models, while ignoring other level of clustering effects (Yau 1 6 ) . Furthermore, the existing approaches to frailty survival models generally rely on specific distributional assumption of frailties such as gamma, inverse Gaussian and log-normal (Sastry 13 , Yau 1 6 ) . In this paper, we consider a nested frailty Cox proportional hazards model for recurrent events where the nested frailties are specified only up to the first and second moments. The treatment of ties and stratification is explicitly incorporated in our approach in the same way as in the standard Cox model. Our estimation procedure is based on a characterisation of this frailty Cox model as an auxiliary random effects Poisson regression model (Ma et al. u ) .
186
187
This approach gives optimal and consistent parameter estimators based on the orthodox best linear unbiased predictor approach to the auxiliary random effects Poisson models. For a single level of gamma frailty, given the frailty parameter, our approach coincides with the hierarchical likelihood approach (Ha et al. 6 ) and the EM algorithm approach (Klein 10 , Nielsen et al. 12 and Andersen et al. 2 ) . We introduce the nested frailty Cox proportional hazards model and its auxiliary random effects Poisson models in sections 2 and 3, respectively. In section 4, we discuss the estimation of the nested frailty Cox models, using the orthodox best linear unbiased predictor approach to the auxiliary random effects Poisson models. An application of our approach to the chronic granulotomous disease data is illustrated in section 5. Some alternative frailty models are discussed in section 6. Some comparisons of our approach with others in the literature are discussed in section 7. 2
Nested Frailty Cox Model
In this section, we consider a Cox model with two levels of nested Suppose that the cohort of interest is composed of m independent indexed by i. Within the ith cluster, there are Ji correlated subjects by (i,j). Specifically, we assume that the cluster-level frailties Ui,..., positive, independent and identically distributed with E(Ui) = 1 and var(Ui) -a2,
frailties. clusters indexed Um are
i - l,...,m.
(1)
We also assume that there is a subject frailty Uij associated with the subject j from the ith cluster, j — 1,..., J;; i = 1 , . . . , m. We further assume that, given the cluster-level frailties U* = u* = (u\,... ,um), the subject-level frailties Wij},j = 1, ••., Ji',i = l , . . . , m are positive and conditionally independent, and that the conditional distribution of Uij, given {/* = u*, depends on Ui only, with E(Uij\U* = u«) = Ui and var(Uij\U* = «») = u2Ui,
(2)
j = l,...,Ji and i = l , . . . , m . Furthermore, assume that there are n^ recurrent events within the subject (i,j) where the A;th recurrent event time of the jthe subject in the ith cluster is given by y
_ J time to infection/censoring (^ time to infection/censoring
first episode (k = 1) since latest event (k > 1)
Let the hazard function for the fcth recurrent event time of the jth subject in the ith cluster at time t be denoted by Xijk{t). Given both the cluster
188
and subject level frailties U = u, we assume that the hazard functions are conditionally independent, with A«*(t) = ^k)(t)uij
exp{pTxijk(t)},
(3)
where k = 1,2,..., a with a being the maximum number of observed recurrences of the event, where AQ (t) denotes the unspecified baseline hazard function (Cox and Oakes 4 ) and Xijk{t) denotes the time-dependent covariate vector for the kth recurrent event time of subject i, j at time t. The components of this time-dependent covariate vector Xijk (t) will be approximated by piecewise constants described in the next section. The distribution of frailties is assumed not to depend on the regression parameter /?. Clearly, the recurrent event times, either observed or censored, within the same cluster are correlated. An implication of this model is that the subject riij is not at risk of (k + l)th event until this subject has experienced the fcth event. A Cox proportional hazards model with a single level of frailties is obtained as a special case of the Cox model with two levels of frailties by setting v2 — 0 and Ji — 1 for all i. Our model only requires the specification of the first two moments of frailties since it is generally impractical to assume that the random mechanism by which the unobserved frailties were generated is entirely known. On the other hand, our assumptions (1) and (2) on random effects do cover a wide range of frailty distributions including gamma, inverse Gaussian and log-normal. 3
Auxiliary Random Effects Poisson Models
Let Sk be the set such that (i, j) € Sk if and only if the subject (i,j) has experienced his (fc-l)th recurrent event time. Let TH < . . . < Tkgh denote the distinct fcth recurrent event times in Sk, with rrikh indicating the multiplicity of fcth recurrent events occurring at time 7>/, (k — 1 , . . . , a). The risk set at time Tkh is a subset of Sk such that TZ{Tkh) — {(i,j) '• Ujk > Tkh}, where Ujk is the feth observed recurrent event time for the jth subject in the ith cluster. In addition, for any (i,j) € 5^, let Yy/b,/, be 1 if the fcth recurrent event occurs for subject (i,j) at time Tkh and 0 otherwise. Let Y and U denote the vectors of the Yijk,h and the random effects (frailties) Uij, respectively. Given the random effects U = u, Peto's version of the conditional partial likelihood (Cox & Oakes, 1984, p. 103) is / (B-Y\u) - 1 7 W
ft
' ^ MM
n ^ ^ ^ ^ ^ ^ i e x p ^ ^ ^ ) ) } ^ , , ^ ! )
iZiiJWr^eM^iATkh))}^
' (4)
189
where the time-dependent covariate Xijk(t) is approximated by a piecewise constant function Xijk(t) - xijk(Tkh) t e [rk{h-i), Tkh] for h = 1 , . . . , qk with Tfco = 0. The value of Xijk{Tkh) can be taken as any reasonable quantity such as the median of Xijk(t) over the interval (^^-1),^^]We now define an auxiliary random effects Poisson regression model. Assume that the components of Y are conditionally independent, given random effects U = u, with Yijk,h ~ Poisson{itij exp{akh + prxijk(Tkh))}
{i,j) & Tl{Tkh).
(5)
This auxiliary random effects Poisson model extends that of Whitehead (1980) for standard Cox models to frailty Cox models. Given the random effects, the conditional likelihood for the random effects Poisson model is a
qk
£li M
eX
P{E(i,i)eK(r fch ) UH ex P( a fch + PTXijk(Tkh))}
'
(6)
It has been shown by Ma et al. 1X that
t i Uii
mkh
-
J
where the term in parentheses on the right-hand side does not depend on the parameters of interest. This demonstrates that the maximum joint Poisson likelihood estimators for the regression parameter vector /3 from (6) are the maximum joint partial likelihood estimators for the regression parameter vector P from (4). We may therefore make inferences on the frailty Cox models by fitting random effects Poisson models. In the remainder of this paper, we focus on the nested frailty Cox proportional hazards models specified by (1), (2) and (3) by way of fitting the auxiliary nested random effects Poisson models specified by (1), (2) and (5). 4 4-1
Orthodox Best Linear Unbiased Predictor Approach Prediction of random effects
We predict the random effects by the orthodox best linear unbiased predictor of U given Y. If we let U and Y be random vectors with finite second moments, the orthodox best linear unbiased predictor of U given Y is
u = E(U) + Cov(t/,y)(Cov(r))-1 {Y - E(Y)} .
190
This is the linear unbiased predictor of U given Y which minimises the mean squared distance between the random effects U and their predictor within the class of linear functions of Y. The cluster random effects predictor can be expressed as fT _ i
1 + CT
-1 _i
Sfc=l Eft=l S(i,j)€TC(rth) 2 V^
a
V^^
fc
w Y
ij ijk,h
V"*
'
^ '
where (i, j) runs over the risk set TZ(Tkh) for any given i. Here, Vijk,h = exp (akh +
PTxijk)
= exp {(aT,0T)xijk,h}
,
and, for fixed (i, j),
(
a
1+
qk
^ 5353 k=i h=i
53
Vijk,h
(i,j)en(Tkh)
The right-hand side of the equation (7) clearly involves unknown quantities 7, a2 and v2. These quantities together with random effects predictors will be iteratively estimated. At each iteration, the Ui will be updated by evaluating those unknown quantities on the right-hand sides at their current values. The detailed discussion on the iterative algorithm will be presented in section 4.4. The unknown quantities involved in the following equations will be understood similarly. The subject random effects predictors are a
Qk
2
Uij =WijUi + V WiJY^'%2 k=l h=l
4-2
53
Y
ijk,h-
(8)
(i,j)e1Z(rkh)
Estimation of regression parameters
Consider first estimation of the regression parameters in the case of known dispersion parameters. Estimation of unknown dispersion parameters will be discussed in section 4.3. Differentiating the joint log-likelihood of the auxiliary model for the data and random effects yields the joint score function. Replacing the random effects with their predictors, we have an unbiased estimating function for the regression parameters 7 = ( a T , / ? T ) T : a
9k
^(7) = 53 5 3 fc=l h = l m
53 (i,j)€K(Tkh)
X
W<* {YHk,h - Uii{l)»ijk,h{l)\
191
This estimating function 1/1(7) is then a vector function of the same dimension as that of 7. The second equality is simply a rearrangement of terms under summation by clusters where a
«k
*/>i(l) = E E
E
Xi k h Yi k h
i> \ i,
- Uij(l)Hijk,h(l)}
corresponds to the ith cluster. For each i, it is clear from (7) and (8) that the random effect predictor Uij (7) is unbiased and involves only the induced observations, yijk,h, within the ith cluster, and so is the estimating function ipi(y). Therefore the unbiased estimating functions ipi{'y),---,il>m('y) are independent. The solutions of the estimating equation ^2^L1 V'JIT) = 0 provide estimators of the regression parameters. Noting that the dimension of parameter 7 increases with the number of clusters, the standard estimating equations theory based on the fixed number of parameters would apply if the dimension of parameter 7 is bounded. Otherwise, it can be shown that, under some regularity conditions, the component-wise asymptotically normality of parameter estimator, 7, remains valid if the dimension of parameter 7 grows slowly (He and Shao 7 ) . Specifically we have, for any scalar vector b of dimension of 7 such that bTb = 1, 6 T 7 is asymptotically normal with asymptotic mean bTj and asymptotic variance given by & T 5(7) - 1 V(-y)S('y)~rb. Here, the sensitivity matrix 5(7) and the variability matrix V(y) are given by
*r)=i><7)=i:s,{^}, i=l
i=l m
vh) = E i=l
K
!
J
m
v
^ = E ^ {^(7)^(7)} • t=l
It has been verified that S(^) — — V{p() for the nested random effects Poisson model (Ma et al. u ) ; therefore the asymptotic variance of bTj is simply - 6 T 5 ( 7 ) _ 1 6 . If the dimension of parameter 7 is bounded, the regression parameter estimator, 7, itself can be shown to be asymptotically normal with asymptotic mean 7 and asymptotic variance given by — -S'(7)_1 under mild regularity conditions (Artes and J0rgensen 3 ) . The estimating function Y^iLi V'iM c a n easily be shown to be optimal in the sense that it attains the minimum asymptotic covariance for the estimator bTj among a certain class of linear functions of Y. This estimating equation Y^T=i "4>i{l) — 0 c a n ' De solved by the Newton scoring algorithm introduced by J0rgensen et al. 9 : 7*=7-^1(7M7),
192
where 7* denotes the updated value for 7. Here negative sensitivity matrix plays a role similar to that of the Fisher information matrix in the Fisher scoring algorithm. In fact, the negative sensitivity matrix here corresponds to the so-called Godambe information matrix. The exact expression of the sensitivity matrix is given by m
«=1 j = l
i-\
qk
a —
Ji
m
^,2-f
Z-(
Viik,hXiik,h{Xijk,h)
,
(9)
fc=i />=i (i,j)en(Thh) where
(
a
qk
5Z ] C
\
w
ijtJ'ijk,hXijk,h
5Z
,
*=1 h=l (ij)€7e(rfch)
/«J
=
I 2 J Z_/
X/
(10)
/
V>ijk,hXijk,h J •
\fc=i h=i (i,j)en{Tkh)
J
Here, the index (i,j) in (10) runs over the risk set IZfah) for fixed i, and (i, j) runs freely over the risk set Tl(Tkh) in the last term of (9). In addition, c(i) denotes the mean squared distances between the random effect Ui and its predictor, specifically a = E(Ui - Uif a2 1+
a
2_)fc=l 2J/i=l z2(i,j)<EK(Tkh) wijixiik,h
where (i,j) runs over the risk set TZ(rkh) for fixed i. An analog of Wald's test is available for testing the hypothesis Ho : /3(!) = 0, where f}(i) is a subvector of /3. The test statistic is ^ = %{J11(7)}"1^(i), where J11 (7) is the block of the asymptotic covariance matrix of 7 corresponding to /?(!). Asymptotically, W follows a x2(fc) distribution, where k is the size of the subvector /?(i).
193
4-3
Estimation of dispersion parameters
We now suppose that the dispersion parameters are unknown. By analogy with generalized linear models, we adopt the following adjusted Pearson estimator for the dispersion parameter a2: m
i
the first term being the Pearson estimator and the second term being a bias correction term. The corresponding adjusted Pearson estimator for v2 is
v1 =—^2 y y^ {0ij - Oi)2+a,+ct- iciWij i. m
i=l
Ji
j=l
l
}
Again, the first term is the Pearson estimator and the remaining terms are bias correction terms where Cy denotes the mean squared distance between the random effect Uij and its predictor, specifically dj = EiUij - Uij)2 = Wij (j/2 + CiWij) .
These dispersion parameter estimators can also be shown to be consistent as m —> oo. Unlike previous approaches in the literature, the asymptotic variance of our regression parameter estimator is not affected by variability in the dispersion parameter estimators. 4-4
Computational procedures
Due to the lack of closed-form solutions for those unknown parameters, the computational algorithms for random effects modeling are inevitably iterative. Initial values for the regression parameters are taken as the regression parameter estimates obtained from standard Poisson regression techniques assuming independent responses, Yijk,h- Initial random effects predictions Ui and Uij are given by the average of the responses within cluster i divided by the average of all responses, and the average of the responses within subject (i,j) divided by the average of all responses, respectively. The initial dispersion parameter estimates are calculated from the adjusted Pearson estimators, omitting the bias-correction terms. After initial values are given, the algorithm then iterates between updating the regression parameter estimates via the Newton scoring algorithm,
194
updating random effect predictors via the orthodox best linear unbiased predictor, and updating dispersion parameter estimates via the adjusted Pearson estimators until the algorithm converges. At each iteration, the left-hand sides in the above equations are updated by evaluating those unknown quantities on the right-hand sides at their current values.
5
An example
Fleming and Harrington 5 presented the chronic granulotomous disease (CGD) data of 128 patients from 14 medical clusters. These patients were randomized to either interferon gamma (rIFN-g, t r t = l ) or placebo (trt=0). Other variables such as age and sex were also recorded. The disorders were characterized by recurrent infections and the maximum number of infections observed was 7. Fourteen of the 63 patients in the treatment group had at least one infection and the total number of infections in this group was 20. On the other hand, 30 of the 65 patients in the placebo group had at least one infection and the total number of infections in the placebo group was 56. The question is if the treatment is effective in reducing the frequency of serious infections in patients. This problem is complicated by the possible intra-subject correlation and clustering effects of medical centres on these patients. To account for possible centre and subject effects, we consider four versions of the Cox models with both centre and subject level frailties specified by (1), (2) and (3). The treatment effect was taken as the only covariate. These four models are the standard Cox model without frailties (a2 = v2 = 0), centre frailty Cox model (i/2 = 0), subject frailty Cox model (a2 = 0) and the Cox model with both centre and subject frailties, respectively. The parameter estimates corresponding to these four models are displayed in the first, second, third and last row of table 1. This table shows that the dispersion parameter estimates for both centre and subject frailties in the fourth model are essentially zero. That is, there is little centre or subject effect. This conclusion has also been confirmed by the results based on the Cox models with a single level of frailties at either centre or subject level. This result is not surprising. In the literature, Therneau and Grambsch 14 found that there is little subject effect based on the Cox model with a single level of gamma frailties and Yau 16 found that there is little centre effect, but some subject effect based on the Cox model with both centre and subject level lognormal frailties.
195 Table 1. Parameter estimates for the chronic granulotomous disease data. Cox model Standard Center level Subject level Two levels
6
Treatment
fi
s.e.
-0.8759 -0.8759 -0.8759 -0.8771
0.2782 0.2782 0.2782 0.2785
Dispersion parameter Center Subject 0 0.0013
0 0.0009
Alternative Models
We analyzed the CGD data based on our model specified by (1), (2) and (3) where recurrent events times are stratified on the basis of event type, that is, the number of recurrences. More specifically, all the first event times form the first stratum, all the second event times form the second stratum, and so on. It implies that a patient's risk of the next infection might have been changed after he experienced his latest infection. If the occurrence of each infection has permanently compromised the immune system, the risk of the subsequent infections will be increased. However, some clinical scientists believe that the risk of recurrent CGD infection remains unchanged regardless of the number of previous infections (Therneau and Grambsch 1 4 ). The data can then be modeled by assuming that all the event times share the same baseline hazard as follows (Andersen and Gill 1): Kjk(t) = Xo(t)uij exp{{3xijk(t)},
(11)
where the centre and subject frailties are still specified by (1) and (2). We fit the CGD data with the above model specified by (1), (2) and (11) with treatment as the sole covariate. The centre and subject level dispersion parameter estimates are 0.097 and 0.545, respectively. This result indicates some subject effects and relatively small centre effects. The regression parameter estimates standard error for treatment effect are now —1.1183 and 0.30465 in comparison with -1.0829 and 0.26763 for the standard Cox model without frailty. The standard error has increased slightly. The difference between (3) and (11) is that the recurrent event times are stratified on the basis of the number of recurrences in the former model, but not in the latter one. So the latter model can be mathematically viewed as a special case of (3) by setting a = 1. However, there is some conceptual difference between these two models in terms of risk sets. Suppose a patient has experienced his (k + l)th recurrent event at time r , then every patient who is under observation at time r is at risk at that time in the latter model. But in the former model, among all subjects who are under observation at
196
time r, only those who have experienced their feth recurrent event, but have not yet experienced their (k + l)th recurrent event are at risk at time T. The extension of both models to include extra stratification based on certain external variables is straightforward. Suppose these models are extended to include extra stratification based on, say gender; therefore the former model is still in the form of (3), but with 2a strata based on both gender and event type being indexed by k = 1,... ,2a. On the other hand, the latter model is also in the form of (3), but with only 2 strata based on gender being indexed byfc = l,2. 7
Discussion
Yau 16 has recently proposed a best linear unbiased predictor approach to the Cox model with nested log-normal frailties. His approach was a generalization of the linear mixed model equations of Henderson 8 where the estimating equations are obtained by differentiating the joint partial likelihood with respect to the regression parameters and the frailties. It is an analytically tractable analog of best linear unbiased predictors; therefore the corresponding frailties predictors are still called best linear unbiased predictors (Yau 1 6 ). However, these frailty predictors actually correspond to the conditional mode of the frailty given the data, and are thus neither linear nor unbiased in general for non-normal distributions. In fact, these pseudo best linear unbiased predictors often lead to biased estimating equations and inconsistent parameter estimators. This approach remains controversial, especially for cases where the number of centres increases with the sample size (Ha et al. 6 ) . On the other hand, our approach to the nested frailty Cox model for recurrent events is based on a truly best linear unbiased predictor approach to the random effects Poisson models. Our approach gives unbiased estimating equations and leads to optimal and consistent parameter estimators. The extension of our approach to the Cox models with more levels of nesting effects is also straightforward. Acknowledgments This work was supported in part by the Natural Sciences and Engineering Research Council of Canada and in part by the Geomatics for Informed Decisions network of Canada. The authors gratefully acknowledge the computing support of Dr. Edward Hughes. The authors are also very grateful to Drs. Xuming He and Qiman Shao for their further clarification on their asymptotic results on parameters of increasing dimension.
197
References 1. P.K. Andersen and R.D. Gill, Ann. Statist. 10, 1100 (1982). 2. P.K. Andersen, J.P. Klein, K.M. Knudsen and P.R. Tabaneray, Biometrics 53, 1475 (1997). 3. R. Artes and B. J0rgensen, Scand. J. Statist. 27, 321 (2000). 4. D.R. Cox and D. Oakes, Analysis of Survival Data (Chapman and Hall, New York, 1984). 5. T.R. Fleming and D.P. Harrington, Counting Processes and Survival Analysis (Wiley, New York, 1991). 6. I.D. Ha, Y. Lee and J.K. Song, Biometrika 88, 233 (2001). 7. X. He and Q. Shao, J. Multi. Anal. 73, 120 (2000). 8. C.R. Henderson, Biometrics 3 1 , 423 (1975). 9. B. J0rgensen, S. Lundbye-Christensen, X.-K. Song and L. Sun, Statist. Med. 15, 823 (1996). 10. J.P. Klein, Biometrics 48, 795 (1992). 11. R. Ma, D. Krewski and R. Burnett, Random effects Cox model: a Poisson modelling approach (Tech. Rep. No 338. Laboratory for Research in Statistics and Probability, Carleton University, Canada, 2000). 12. G.G. Nielsen, R.D. Gill, P.K. Andersen and T.I.A. S0rensen, Scand. J. Statist. 19, 25 (1992). 13. N. Sastry, J. Am. Statist. Assoc. 92, 426 (1997). 14. T.M. Therneau, and P.M. Grambsch, Modeling Survival Data: Extending the Cox Model (Springer-Verlag, New York, 2000). 15. J. Whitehead, Appl. Statist. 29, 268 (1980). 16. K.K.W. Yau, Biometrics 57, 96 (2001).
MULTIPLE C O M P A R I S O N P R O C E D U R E S FOR LINEAR MODELS U N D E R O R D E R R E S T R I C T I O N S HOSSEIN MANSOURI AND ROBERT PAIGE Department
of Mathematics and Statistics, Texas Tech Lubbock, TX 79409-1042 E-mail: [email protected]
University
Multiple comparison procedures are some of the most frequently use statistical methods. Although, there exists an extensive amount of literature on multiple comparison procedures, few articles have addressed the problem of multiple comparisons under order restrictions. Currently available procedures are mainly developed for the one-way layouts and hence are not applicable to more complex designs such as unbalanced factorial designs or analysis of covariance models. To achieve these extensions, the focus of the present investigation is to develop simultaneous multiple comparison procedures for linear models under order restrictions of the parameters.
1
Introduction
Consider the linear model Y = X/3 + e,
(1)
where Y is an nx 1 observation vector, J i s a n n x p known matrix, (3 is apX 1 vector of unknown parameters, and e is an n x 1 vector of independent and identically distributed random variables having a normal distribution with mean 0 and variance a2 > 0. A common problem of interest is to make simultaneous inferences about components of a linear function C/3, where C is a r x p matrix with known elements. Focusing on simultaneous testing of the constrained parameters, suppose that we are interested in testing the hypothesis H0:C/3 = d
(2)
against the one-sided alternatives Hi — Ho, where Hi:CP>d
(3)
Without loss of generality it can be assumed that d = 0. Depending on dimensionality of C, we are dealing with a different class of tests. For instance, for testing subhypotheses against ordered alternatives in regression models or testing for main effects in factorial analysis of variance models, then r < p, and for simultaneous one-sided multiple comparison
198
199 procedures r > p. A third class of tests which is the main focus of this investigation, is the problem of simultaneous multiple comparisons under the assumption that the parameter space is constrained a priori. Although the first two problems are well developed, at least theoretically, not much is available on the third problem. Simultaneous tests for the order restricted (one-sided) hypotheses are well developed. Well-known tests are the likelihood ratio test, Wald's test, and Rao's efficient scores test. These are discussed in detail in Robertson et al. 10 . Shapiro 12 presented a unified approach by assuming the parameter space to be a closed convex cone and the observations follow a multivariate normal distribution with a known dispersion matrix. Mansouri 4 proposed a test based on the aligned ranking approach. One-sided simultaneous multiple comparison procedures for linear models, although not discussed explicitly can be derived from the results in Hochberg and Tamhane 3 and is outlined in Westfall et al. 13 . Simultaneous multiple comparison procedures under order restrictions of the parameters has not received much attention in the literature. The existing techniques are mainly developed for the one-way layouts under the assumption of simple order restriction of the means. Marcus and Peritz 8 proposed onesided Scheffe-type simultaneous confidence intervals for monotone contrasts in the means of normal distributions, Williams 14 proposed a Tukey-type range test for monotone contrasts for the balanced one-way layouts, Marcus 7 proposed a two-sided Scheffe-type simultaneous confidence intervals for monotone contrasts for the means of normal distributions. Hayter 2 argued that simultaneous inference for monotone contrasts does not extend to simultaneous pairwise comparisons and developed a one-sided studentized range technique under the assumption of simple order without the restriction on monotonicity of the contrasts. Recently, Mansouri and Shaw 6 proposed distribution-free simultaneous multiple comparison tests under the assumption of simple order for the balanced complete and incomplete randomized blocks. Mansouri 5 extended the latter results for repeated measures designs with incomplete observations. The main focus of this article is to develop a technique for simultaneous multiple comparisons in linear models. This technique provides a procedure that is applicable to balanced as well as unbalanced designs in analysis of variance and analysis of covariance models. In addition, we consider both the one-sided inferences without assuming monotonicity of the contrasts. In Section2, a brief summary of simultaneous one-sided inference in linear models is presented. In Section 3, simultaneous multiple comparison procedures are proposed and their distributions are discussed.
200
2
One-Sided Tests
In this section, a brief discussion of one-sided tests for subhypotheses, (q = r < p) and simultaneous one-sided multiple comparisons, (r > q,q < p) will be presented. 2.1
One-Sided Test for Subhypotheses
Let X = (Xi,X2) and P = {P'x, /3'2)', where Xx is n x q, X2 is n x p - q, j3\ and P2 are g x 1 and (p — q) x 1, respectively. Then the linear model in (1), can be written as Y = X1p1+X2p2
+e
(4)
Without loss of generality, the linear hypotheses in (2) and (3) can be written as H0 : f31 =0 and Hi : /31 > 0. Let
3 = (3'i, #>)' be the unrestricted maximum likelihood estimator (MLE) of /3. Let S=(Y-
X0)'(Y
- Xp) + ( 3 - P)'X'X(J3 - p)
Then the likelihood ratio statistic is equivalent to min S — min S PeH0 PeHi =
min(3 -/31)'Ar11(3i-/5i) PeH0 1 min(31-j31)'An1(31-/31)
-
/3€H,
= 3'1An13i - 0i - fiY*uX@i - Fx) ^Pl'A^Pl where P\ is the MLE of Pi under Hi and A n = (-X1-X1 — XXX2{X2X2)~ 2
Under Ho and known a , the test statistic X2=o-2pl'A^P*i has a ^-distribution given by
X2Xi)
.
201 9
where the weights Uj are nonnegative satisfying X^LQWJ = ^ arK * Xo = 0Computation of these weights depend on q, A n and the alternative hypothesis Hi. More details on o/j-'s are provided in Shapiro 12 sec. 5. For the one-way layouts, uij 's are called level probabilities and are discussed in detail in Barlow et al. 1. 2.2
One-Sided Multiple Comparisons
Again consider the linear model (4), and assume that we are interested in testing Hio and Hu where Hi0 : ctfi = 0 and Ha : d£x > 0, i = 1, • • • , r > « Let /3 = (JSJ, /3 2 )' be the unrestricted MLE of/3 and let S2 be an unbiased estimator of a2. Define the test statistics Ti = t@1/S(dikllCi)1/2,
i = l,---,r.
Then we reject H,o in favor of Hu — H^ if
where ca is upper a-th quantile of the distribution of max; T;. We note that Ti, • • • , T r have a joint r-dimensional t-distribution with dispersion matrix R = D _ 1 / 2 C ' A n C - D ~ 1 / 2 , where D = diag^AnCi, • • • , c^Ancv), see Westfall et al. (1999), chapter 5. 3
Multiple Comparisons Under Order Restrictions
Consider the linear model in (4) and assume further that the parameter space is subject to the constraint that (31 s K., where /C is a closed convex cone. The simple order restriction, considered in the cited literature in Section 1 is a member of /C. Suppose that we are interested in testing H^ against Hu where Hi0 : ci/31 = 0 and Hn : ci/31 > 0, i = 1 , . . . , r > q
202
Let P{ denote the orthogonal projection of /31 onto /C, i.e. /3j is the solution of min ( 3 i - ^ i )
An(3i-/3,).
Define the test statistics T* = cfijo
(c-AnCi)
, i = l,...,r.
Using the Union-Intersection principle of Roy (1953), we reject HIQ in favor of HiX - iifjo if T*>ca,
i = !,•••
,r
where ca is upper a-th quantile of the distribution of max; T*. By converting the simultaneous testing problem to simultaneous confidence region, we obtain simultaneous upper confidence limits with family confidence coefficient 1 — a given by c^^c^i-c^KAnC;)1/2,
i = l , . . . ,r
To find the critical values of the sampling distribution of max; T*, in most cases one has to resort to simulation techniques. This is a common practice in testing under order restrictions as well as situations where simultaneous multiple comparison techniques are used. This is the case because in both inferential approaches, except for the balanced analysis of variance designs, the sampling distribution of the resulting statistic is intractable, hence there is no choice but to resort to simulation. Similarly, if we are testing HM against the alternative Hii-.difa^O,
i = l,---
,r>q
we reject Hi0 if \T*\ > da/2, where da is the upper a-th quantile of the distribution of max;|T*|. Finally, a simultaneous confidence region with a family confidence coefficient 1 — a is given by cjfr € {<$[ ± ^ ^ ( ^ n c ) 1 / 2 } ,
i=
l,---,q.
To simulate the null distribution of the test statistics, the following algorithm may be used. 1. Generate E\, • • • ,en from a standard normal distribution.
203
2. Generate Y\, • • • ,Yn using the model Y = model under the Ho — f]t #;o-
which is the reduced
XR/3R+£
3. Calculate Tf, • • • , Tr* (\T;\, ••• , \T*\). 4. Calculate maxi<j< r maxi<;< r \T*\T* for the one-sided (maxi<;< r \T*\ for the two-sided) simultaneous confidence interval. 5. Repeat the process M times. 6. Find ca(da/2) the upper a-th (a/2-th)quantile m&xi
of
Example
Our dataset comes from Montgomery 9 ( pp. 170). Here, an engineer is studying the effect of cutting speed on the rate of metal removal in a machining operation. However, the rate of metal removal is also related to the hardness of the test specimen. Five observations are taken at each cutting speed. The amount of metal removed (y) and the hardness of the specimen (x) are shown in the following table: 1000
Cutting Speed (rpm) 1200 1400
X
V
X
68 90 98 77 88
120 140 150 125 136
112 94 65 74 85
X
y
165 140 120 125 133
118 82 73 92 80
y
175 132 124 141 130
W e assumed analysis of covariance model y{j = (j, + Ti + 7 (xij -x..)
+ Eij for i = 1, 2,3 and j = 1 , . . . ,5
where e^ ~ N (0,CT2). Since the rate of metal removal can be expected to be monotonically increasing in cutting speed it is conceivable that Ti <
T2
< T3.
We then considered the following sets of hypotheses: #10 : r 2 - ri = 0 #20 : r 3 - n = 0 #30 : T 3 - r 2 = 0
#11 : r 2 - n > 0 #21 : r 3 - ri > 0 #31 : r 3 - T 2 > 0
(5)
204
1.6-
0.5 -
0.0
J
/
f^
1
1
1
1
0.0
0.6
1.0
1.5
2.0
'
Figure 1: Density Plot of max; T* Next, identifiability constraint ^ ; T; = 0 was imposed. Under this, constraint our omnibus null hypothesis was H0:TI=T2=T3=
0.
For the reduced model, under the omnibus null, we obtained the following parameter estimates: 7 = 0.9299 fi = y = 86.40 a = 2.744 We simulated from the estimated reduced model. Statistics Z\*, T% and T£, associated with following hypotheses sets (Hw, Hn), (H20, #21) and (H30, #31) respectively, were computed and the quantity max;T* was obtained. This process was repeated 9,999 times to yield a total of 10,000 simulated max; T* values. Next, ca, an estimate for ca, the upper a-th quantile of the max;T*, was obtained by first ordering the max;T/ values from smallest to largest and taking our estimate to be the realization of the 10,000 (1 — a)-th order statistic. Note that, for our example a was set at 0.05 which guaranteed an integral 10,000 (1 — a) value. Our simulated 0.05-th quantile of the max; T* turned out to be 1.1652. Figure 1 shows the S-Plus density plot of the simulated max; T* values.
205
For our example, the observed values of the T*s were as follows:
i
2?
1 2 3
0.4157 0.4237 0.2638
resulting in failure to reject Hio for i = 1,2,3. 5
Conclusion
In this article, we have proposed a method for making simultaneous one-sided and two-sided multiple comparisons for linear models, when the parameters of interest belong to a closed convex cone. The exact distributions for the resulting statistics are derived. However, because of complexity of the distributions for all except the balanced analysis of variance designs, one must obtain the critical values by simulation. An algorithm for generating the approximate values of these quantiles is proposed. Lastly, we applied this procedure to a data set. References 1. R.E. Barlow, D.J. Bartholomew, J.M. Bremner and H. Brunk, Statistical inference under order restrictions (John Wiley, New York, 1972). 2. A.J. Hayter, Journal of the American Statistical Association 85, 778 (1990). 3. Y. Hochberg and A.C. Tamhane, Multiple comparison procedures (John Wiley, New York, 1987). 4. H. Mansouri, Journal of Statistical Planning and Inference 24, 107 (1990). 5. H. Mansouri, Distribution-free multiple comparisons for incomplete data (Submitted for publication, 2000). 6. H. Mansouri and C. Shaw, Nonparametric multiple comparison procedures for ordered parameters in balanced incomplete blocks (Submitted for publication, 1999). 7. R. Marcus, Communications in Statistics-Theory and Methods 1 1 , 615 (1982). 8. R. Marcus and E. Peritz, Journal of the Royal Statistical Society, Ser. B 38, 157 (1976). 9. D. Montgomery, Design and Analysis of Experiments, 4th edition (John Wiley, New York, 1997).
206
10. T. Robertson, F.T. Wright and R.L. Dykstra, Order restricted statistical inference (John Wiley, New York, 1988). 11. S.N. Roy, Annals Of Mathematical Statistics 24, 220 (1953). 12. A. Shapiro, International Statistical Review 56, 49 (1988). 13. P.H. Westfall, R.D. Tobias, D. Rom, R.D. Wolfinger and Y. Hochberg, Multiple comparisons and multiple tests using the SAS system (SAS Institute, Cary, NC, USA 1999). 14. D.A. Williams, Biometrika 64, 9 (1977).
TESTING GOODNESS-OF-FIT OF THE G A M M A MODELS
C A R O L E. M A R C H E T T I Department
of Mathematics and Statistics, Rochester Institute Rochester, NY 14623, USA E-mail: [email protected]
of
Technology,
G O V I N D S. M U D H O L K A R A N D G R E G O R Y E . W I L D I N G Department
of Biostatistics, University of Rochester, Rochester, NY 14642, E-mail: [email protected]; [email protected]
USA
Gamma models are extensively employed in statistical analyses, especially for reliability and lifetime data. Assessing the appropriateness of these models is of obvious importance. While characterization theorems in probability and statistics are widely appreciated for their role in clarifying the structures of families of probability distributions, they are not as well recognized for being natural, logical and effective starting points for constructing tests for goodness-of-fit problems. It is well known that two independent identically distributed random variables ( X i , X2) have a gamma distribution if, and only if, X i + X% and X1/X2 are independent. This characterization is used by Locke 1 8 to propose a method for testing the composite gamma hypothesis. It is based upon the random creation of n pairs from a sample of size 2n, and using tests of bivariate independence, such as those based on the rank correlation coefficient or Kendall's tau, for testing independence of the sums and the ratios. However, the tests of bivariate independence he considered are consistent against only some dependence alternatives. Instead, we propose employing the classical rank test due to Hoeffding 9 or its asymptotic equivalent due to Blum, Kiefer and Rosenblatt 2 , which are known to be consistent against all dependence alternatives. The resulting goodness-of-fit tests are consistent against all non-gamma alternatives and, in moderate size samples, offer substantial power advantages. Additionally, an alternative approach, which does not suffer from the caprice of chance implicit in the random pairing, is also considered.
1
Introduction and Summary
Characterization theorems in probability and statistics are generally well appreciated for their aesthetic appeal, mathematical completeness and the light they shed on the structures of the probability distributions. Although logically self evident, but not well recognized, is the fact that they can be natural and effective bases for constructing goodness-of-fit (GOF) tests needed for assessing the validity of models based on parametric families, such as normal, exponential, and inverse Gaussian (IG), commonly used in statistical practice. The earliest explicit use of a characterization theorem for constructing a goodness-of-fit test is by Vasicek 37 , who used Shannon's maximum entropy characterization to construct a test for the composite hypothesis of normality.
207
208
Now there exists a substantial body of literature on the goodness-of-fit tests based on characterization theorems. Maximum entropy characterizations have been used by Gokhale 7 and Mudholkar and Lin 27 to construct goodness-offit tests for exponentiality, and more recently by Mudholkar and Tian 30 to construct a GOF test for the inverse Gaussian model. Characterization results based on the statistical independence of sample statistics have been used by Lin and Mudholkar 15 and Mudholkar, Marchetti and Lin 2 8 to construct GOF tests for normality, and by Mudholkar, Natarajan and Chaubey 29 to test inverse Gaussian goodness-of-fit. For an overview of characterizations and goodness-of-fit, see Marchetti and Mudholkar 20 . The gamma family of unimodal, right-skewed distributions with nonnegative support, has been used to model a wide variety of applications in diverse fields such as geology (McCullagh and Lang 2 5 ) , ecology (Matis, Rubink, and Makela 2 3 ) , inventory control and queuing (Yeh 4 0 ) , economics (McDonald and Jensen 2 6 ) , meteorology (Bougeault 3 ) , reliability (Reiser and Rocke 3 5 ) , biomedical studies (Tan 36 ) and genetics (Yang 3 9 ) . Testing the appropriateness of the gamma models for use in these applications is vital. Empirical distribution function (EDF) tests for the composite gamma hypothesis have been developed by Pettitt and Stephens 34 and Lockhart and Stephens 16 . For an account of these, see D'Agostino and Stephens 4 . It is well known that the distribution of two IID random variables (X\,X2) is gamma if and only if X\ + X2 and X\jX2 are independent (see Lukacs, 1 9 ). Locke 17,Locke76 proposed testing the composite gamma hypothesis by creating n pairs from a sample of size 2n and testing the independence using procedures such as the one based on Kendall's tau. Since the rank tests of bivariate independence he considered are consistent against only some dependence alternatives, the resulting gamma GOF tests lack consistency against all non-gamma alternatives. The existing goodness-of-fit tests of the composite gamma hypotheses are reviewed in Section 2. A modification of one of these tests, together with a comparative study of its power function is presented in Section 3. The modified test, which is consistent against all non-gamma alternatives, offers a power advantage with moderate sample sizes. In Section 4, an alternative approach is considered. Section 5 is given to conclusions. 2
The Gamma Goodness-of-Fit Tests
The literature on testing goodness-of-fit for the gamma models is rather scant, and most of the proposed solutions lack simplicity for use in practice. The tests available, which may be grouped as the EDF tests and the tests based on characterization results, are now reviewed.
209 2.1
The Empirical Distribution Function Tests
The empirical distribution function for a sample Xi,X%, ...,X„ is defined by Fn(x) = # ( X j < x)/n. To test that the sample is from a population with a distribution function F(x) = F0(x), where F0(x) is fully specified, one may measure the discrepancy between the Fn(x) and F0(x). Kolmogorov 13 introduced the first EDF statistic, D, by defining D+ = sup{Fn{x) - F0{x)},
(1)
X
£>_ = sup{F0(a;) - Fn(x)},
and
(2)
X
D = sup \Fn{x) - F0(x)\ = max(D + , DJ).
(3)
X
The Cramer-von Mises family is a class of statistics of the form oo
/
{Fn(x)-F0{x)}21>{x)dF[x).
(4)
-oo
When ip(x) = 1, the Cramer-von Mises W2 statistic is obtained; when i[>(x) = [F(x)(l — F(x))}~1, one obtains the Anderson-Darling * statistic A2. The EDF tests have been adapted for the composite goodness-of-fit hypothesis F(x) = F0(x, 6), where F0 is known but 6 is not, by plugging in estimated 0 for 6. Pettitt and Stephens 34 considered the problem of testing the composite gamma hypothesis in which the shape parameter a is known and scale parameter /? is unknown. They provide asymptotic percentiles for three EDF statistics for the problem. Lockhart and Stephens 16 examined the situation of a gamma distribution with unknown shape parameter and scale parameter either known or unknown. They suggest using a maximum likelihood estimator of a and give asymptotic percentiles for the resulting EDF statistics. D'Agostino and Stephens 4 provide an excellent overview of goodness-of-fit for the gamma and other distributions. 2.2
Characterizations and Gamma GOF Tests
A nice summary of characterizations of the gamma family, from the earliest results due to Nabeya 3 3 , Goodman 8 , Mauldon 2 4 and Laha 14 , appears in Johnson, Kotz and Balakrishnan 12 . The simplest of these, the following result due to Lukacs 19 , is the basis of Locke's 18 test. Proposition 2.1: Two independent random variables (X\,X2) have a common gamma distribution iff X\ + X-i and X\jXi are independent.
210
Extensions, applications and reviews of the above characterization may be found in Findeisen 6 , Marsaglia 21,22 and Wang 38 . Since this characterization is asymmetric in Xt and X2, Locke 17 considers the dependence in terms of U = Xi + X2 and V = ma.x(X1/X2, X2/X1). Before discussing the specifics of Locke's test we note a general maximum entropy result, from which characterizations of several members of the exponential family have been fruitfully used to develop consistent goodnessof-fit tests. The examples include the maximum entropy tests of the composite hypotheses of normality (Vasicek 3 7 ) , uniformity (Dudewicz and van der Meulen 5 ) , exponentiality (Gokhale 7 , Mudholkar and Lin 27 ) and the inverse Gaussian distribution (Mudholkar and Tian 30>31). Gokhale 7 presents a general discussion of construction and suggests a GOF test for the gamma model, but offers no details. The following is a relatively recent characterization due to Hwang and Hu n which will be considered in Section 4. Proposition 2.2: The mean x and coefficient of variation s/x of a random sample are independent iff the population is gamma. 2.3
Locke's Tests
Specifically, Locke 18 proposed testing the composite gamma hypothesis by randomly pairing observations from a sample of size 2n, to obtain n pairs (Yi,Zi), where Yi = X2i + X2i-i,
Zi = m&x{X2i/X2i-i,X2i-i/X2i},
(5)
i = 1,2, ...,n. Then testing the composite gamma hypothesis is equivalent to testing the independence of Y and Z. Locke employs the quadrant tests and tests based on the rank correlation coefficient and Kendall's tau for this purpose. An obvious drawback to this method is that with random pairing, the same data set can produce different results. However, more importantly the rank tests that Locke considers are consistent against only some dependence alternatives. In the next section, we propose a modification of Locke's procedure that addresses the consistency issue. 3
A Consistent Modification of Locke's Procedure
Locke's 18 approach of randomly pairing the observations in the sample can be combined with use of the consistent tests of bivariate independence due to Hoeffding 9 and Blum, Kiefer and Rosenblatt 2 to construct gamma GOF
211
tests which would be consistent against all non-gamma alternatives. These modifications would be considered improvements over Locke's test provided they show power advantages with moderate size samples. In this section, we first describe the consistent tests of bivariate independence, the modification of Locke's procedure, and then present the results of a study of the power functions. 3.1
Two Consistent Tests of Bivariate Independence
Hoeffding's
9
notes that D(x, y) = FX,Y ( * , » ) - FX (x) FY (y) = 0,
(6)
if and only if X and Y are independent. Nonparameteric estimation of the quantity J D2(x, y)dF(x, y) results in the statistic,
_Q-2(n-2)R+(n-2)(n-3)S n(n-l)(n-2)(n-3)(n-4)
'
^'
in which n
Q = J2(Ri - !)(*i - 2)(Si ~ !)(£ - 2),
(8)
1=1 71
22 = 5^(i2i-2)(5 i -2)c i ,
s
i=i n
= I > < - 1)Ci-
(9)
(10)
t=i
where Ri and Si are the respective ranks of Xi among the A"'s and Yi among the y ' s , and Ci is the number of bivariate observations (Xj,Yj) for which Xj < Xi and Yj < Yi. Hoeffding 9 provides the exact null distribution of nDn for sample sizes n = 5,6,7 which is extended to n = 8,9 by Hollander and Wolfe 10 . However, Hoeffding 9 was unable to obtain the asymptotic distribution of nDn. Blum, Kiefer and Rosenblatt 2 consider a computationally simpler statistic, n
Bn = n- 5 J2 Wi(i)N4(i) - N2(i)N3(i)f,
(11)
t=i
where N\(i), N2(i), Ns(i), and JV4(i) are the number of points lying in the four quadrants determined by the vertical and horizontal lines through the
212
bivariate point (Xi, Yj). Independence is rejected for large values of the test statistic. Blum, Kiefer and Rosenblatt 2 show that nBn is asymptotically equivalent to Hoeffding's 9 statistic and provide the asymptotic percentiles. Mudholkar and Wilding 32 obtain the empirical distribution of the two statistics for small and moderate size samples. A selection of the 10%, 5% and 1% points of the nDn and nBn statistics appears in Table 1. A Gaussian approximation for the nBn statistic, which can be used to obtain both the associated percentiles and probabilities, is currently under development by Mudholkar and Wilding. 3.2
The Modified Test
The tests of bivariate independence due to Hoeffding 9 and due to Blum, Kiefer and Rosenblatt 2 may be used for testing the independence of Y and Z. Since the two rank tests are consistent against all dependence alternatives, it is obvious that the modified versions of Locke's procedure using these tests will produce goodness-of-fit tests which are consistent against all non-gamma alternatives. Let Xi,X2,..., Xin be a random sample from a population with d.f. F and consider the problem of testing the composite hypothesis that F(-) is the d.f. of a gamma random variable. If the sample size is odd, as in Locke 18 , delete an observation at random. We propose testing the composite hypothesis by randomly pairing observations from the sample, to obtain n pairs (Yi, Zi), as given in (5). Testing the composite gamma hypothesis is equivalent to testing the independence of Y and Z. The rank tests due to Hoeffding 9 and to Blum, Kiefer and Rosenblatt 2 are employed for this purpose. 3.3
A Monte Carlo Experiment
A Monte Carlo experiment was conducted using 100,000 simulated samples of size n = 20,30,50,100 from a variety of non-negative populations. Each sample was subjected to the above tests of the composite gamma hypothesis which are based on Locke's premise and use tests of bivariate independence due to Kendall, Spearman, Hoeffding and Blum et al. A selection of the results of the simulation experiment are shown in Table 2. In the population column, GTL stands for Generalized Tukey-Lambda family. Where parameters are unspecified, the standard distribution is used. The Spearman, Hoeffding and Blum et al. tests have excellent Type I Error control, but the Kendall test is not so accurate in this respect for small or even moderate sample sizes. For most of the populations, but not all, the tests
H
h-* 03 os
o
•d
o o o "o o o
o w 3
to
a
to
s &
0> B
o •o
T)
3
a
3 CO 0. e> T3
3 b
a
o
X 0
-J
OS
to
X p
_G1
j»
»
•<
V
X
s 3 s II II II
o>
rem
00 OS to
o
00
o cn
OS CO
O
Cn tO
o
OS
o
O CO
O O t—i l—11—'
O O O O O
O O O O O O O O O O C n c n e n e n Cn h-> l-» tO tO 4^.
O O O O O O
O O O O O O 4s. 4*. 4^ 4^ en cn
O O O O O O
O O O O O
OOOOtOOO •vltOOSi—'4^
ososos-a—i
O O O O O
O O O O O
eocnoco-4
O O O O O Cncn en e n e n 4^. 4^- e n e n e n
O O O O O O O O O O O O O O O O O O O h - » » - * O O O O O O to to1to to to to tOoS t- Jot tOoOoC oO O O O O O (-> i- to co *>- en osoocotoooo oo to en os co 4^4^enos-a t0 4^-J*>-tO
4^4^-cnasoo co to os en co
O O O O O OS OS OS OS OS
O O O O O
ootototooo to o co os i—i - j t o o s t o o o o
O O O O O O OS OS OS OS OS OS I—' I—'I—' t o t O CO INS CO - q h-" - J - a
~q~a-q--j-q
O O O O O
p p o o o
O O O O O
cocoeoi^ii^ OOtOCOOO toeoash-'cn
O O O O O
OSCOl—'4S*^I
O O O O O - 1 0 0 OOOOOO Q 0 O M W O 1
O O O O O
ossow
O O O l — l I—> 0 0 0 0 t O HJ t O
O O O O O
l-'l-'tOCO^ I—'OSrfi.1—'O
O O O O O
O O O O O
Cn en en c n c n
O O O O O o o o o o
^]I->COCOO
O O O O O CO CO CO C O O
O O O O O
tOOO-JOSCn
O O O O O
O O O O O
O O O O O M M H t O M O M U O f f l t O i ^ W O O ) CnCOCOO—1
—a I — ' O S O O S
t—» h-> t-J h-> t O Co o s - q t o n - 1 - J CO t o C n 0 0
O O O O O
OtOH-iCOtO
O O O O O -a—i-i-aoo cnos-qooo O O O O O
O O O O O
Otf^toenoo
en^jh-icoo
to toco 4*. 4*-
O O O O O
00 00 00 00 tO tO*--vI-q-
o o o o o
O O O O O O O O O O O O O O O en en os os os OS OS OS — 1 — I OOtOOI-'tO co j i . - q i - ' t o to os 4^- co os t O t O C n O O O O O O O
ostotooso torf^tooen
O O O O O O O O O I - " 00 00 to t O O
enosoooto enoscnoto
O O O O O OOOl—'I—» enososoos
to os oo -q i-> en oo co c o o
O O t P O H M
t o t o co co eo
O O O O O
O O O O O
Ji.COtOl-'O
rfi. 4*. 4 ^ 4 ^ 4^-4i.rf*.4^cncn h-> h-> IO C O ^ l-'00 4*.COCO
O O O O O O O O O O
cn c n - q - ^ i as oo to !-• co 4^. os *». to i—' en 00i->0ocni>o tot—' e n c n t o
O O O O O OSOSOS-
O O O O O
tOlfc-OI—'Cn
>j^cnos-
O O O O O CO CO CO CO CO
O O O O O O OS OS OS OS OS OS t o CO CO £>. ifc. OS e n CO »£. 4*- 0 0 1 - 1
O O O O O O
O O O O O O CO CO CO CO CO CO t O t O CO CO CO £ t OS OS O d ^ - • ! c n
O O O O O O
>ti. os-q o to ~q
O I - CO 0 0 i ^ CO
o o o o o O O O O O O O O O O to to t o to t o to to t o to t o to to to t o t o t o1 t o t o co rf^ rf». 4*- C n e n e n os os -a -a oo
to to
O O O O O
t o to to to t o to
O O O O O O
O O O O O
tOOO^ICSCn
O O O O O O
O O O O O
*kCOtOI->0
to to t o to to
O
h l i - 4 i . CO CJ t o O I O O I O O T
O
o
I—> CO 0 0 - 4 OS Cn O O O O O O
o o o
o o
bico
O O
I—'00
* > > - ' bs** **tO
oo
CO 0 0 00 ~J Cn l—'
O O
0
3
71
p o 0 0 bobs bo OS e n * - C n
Cn CO e n CO O O 0 0
Cn
i-
co t o 0 0 COCO COCO - q c o
0
0
en co
e n CO 0 0 0
£
CD
3
CD
93
en-a
o o o p pp pp 0 0 pp co to bs to c o t o bobs bs** * * t o t—'h-' * * O S 00 en ~3CS C O O t O * 0 5 0 >->0 coco **co ~ q c o C n e n
e n 0 0 1—'en
coco
po
**io
pp
en co bo*- ob>
cobs o'o
0 0 0 0 O O 0 0 0 0 ent—' e n CO h - ' 0 0 CO 0 0 0 0 0 0 - J CO c o t o c n e n e n h—t 0 0 - q O C n C 0 I - ' 0 0 t o O O
-<JCO
pp
O O 0 0 * * t o en t o CO 0 0 rf-»oo h-' C n c n e n CO t O OS CO O O 00 to
pp o p p p pp 0 0 p p pp o p ^ C n bsco * * t o - ~ J * . c o t o toiOSCn * * 0 0 0 0 COOS 0 0
a a. s>
CD
s
a
0
CM
3"
0 CD
TO
Blum
COrfa. C O - q co t o COlt^
t—» h - '
O O
coco
cn co <=><=>
~h - '
1-1
o o pp p p p o o p o p p p pp p o 0 0 O t - ' t o n - " bsco c o t o e n CO coi- 1 bocn '<=>'<=> co to 0^ 0C *n. C1—'CO t O * - O - J COCO t o t o !-> 0 0 CO 0 0 ***coo CO** c o o c o c o 0 0 1 e n - 4 CO e n O O ) CO 0 0 c o c o coo
0 0 1 ^ j i - tot-" O r CO C © ^ 0 0 CO * - t O OS 0 0 C O J ^ 0 0 CO C O ^ J 4 ^ CO o - < i coi—» C O O C O l - ' c o t o rf^to
po pp
f-iO
o o o p p p p o o o - 1 bs j ^ b s ^ bsco CC SOC^O t- o4 t OS COOi I U O o - * i* t O - 4 C O oo-a o * > COCO
t—»t—*
0 0
to
t—'
J°
"as
J-;
e n CO e n c o C n CO 0 0 O O 0 0
H
"1
_en
1
1—»
o o o o o p p o o p o p o p 0 0 p o 0 0 o p pp c o t o C n CO bs j ^ e n j ^ co bo t O M ^ C n c o t o bobs ct oO Mt o t* *OoMo bs** O S C n COl—' ~ j * - 0 1 0 0 1 — ' 1 — ' t o t o - a 0 1 0 0 0 1 t o os rf^rf^ 0 0 t O t o c o O O ! c o t — ' C n o s co en #*-^ en»—» 0 0 0 1 0 0 ~ J e n o
o o
en co
to
"1
f
t^ 50
0
0
Spe;arman
o o
0 0 1—' h - ' rf^O COl—'
N3 0 1
!—'•-' l*^H^
o o
Uni form
o o
Par to(5
i n CO On CO O r CO Cn CO O O O O O O
ormal
•o
-ull( lue
CnCO
Joh son
cnco O O
~h—'
(2.5
Cd
a
1—'
i3
F(3
"co
0Q 3
O H tr^ Bet
W W
Pea
o a <
en
O
O
Ext
*-l cn
t—t
Gam
co
S*
1
o
Populati
o
Cd
3
pv o
•<;
3
0
I
•o
M
0 01
10
5?
II 0
0
esi 0
p-
0 0
CD
3"
95
3 rr t0.
CD
a c-ta" 0
95
3" 3
c-t-
CO
xs cr 0 95 3 CD m
3
00
g
95
3
a
CD
0=1
2.
co 3
a. c
a S
sa o-
0 - 95 95 a
0
-
ffi
l-t>
0
CO
3 CD 3" O
0
CD
3" 3"
a m rt-
0
ent sed
215
4
A n Alternative Approach
To remove the caprice inherent in the random pairing of observations, another approach to testing goodness-of-fit for the gamma distribution is in progress. It is based upon the following recent characterization of the gamma distribution due to Hwang and Hu u , given in Proposition 2.2. For developing the test we use the technique introduced by Lin and Mudholkar 15 in their test for normality. Given a random sample x\, X2, • ••, xn, we create n pairs (£_i, c_;) by removing one observation at a time from the sample. The product moment correlation between the pairs is used to measure dependence between x and c. We therefore consider the statistic r =
EiU(5-i-g)(c-i-c) > / £ " = i ( 2 - i - S) 2 E L i ( c - i - c) 2
( 12 x
where c = n _ 1 £ " = 1 c_j. The asymptotic null distribution of r, however, depends on the shape parameter of the gamma distribution a. y/nr-+N(o,3+™).
(13)
One can compute the maximum likelihood estimate & and substitute to obtain a measure of the standard error, but such estimation requires solving a nonlinear equation involving the di-gamma function. There is also the question of use with small samples. Investigation of these issues is in progress. 5
Conclusion
In this paper we have examined the use of Hoeffding's 9 test of bivariate independence and its asymptotic equivalent due to Blum, Kiefer and Rosenblatt 2 , to obtain consistent modifications of Locke's test of the composite gamma hypothesis. It is seen that the modified tests have substantial power advantages over the original. References 1. T.W. Anderson and D.A. Darling, Journal of the American Statistical Association 49, 765 (1954). 2. J. R. Blum, J. Kiefer and M. Rosenblatt, The Annals of Mathematical Statistics 32, 485 (1961). 3. P. Bougeault, Journal of the Atmospheric Sciences 39, 2691 (1982).
216
4. R.B. D'Agostino and M.A. Stephens, Goodness-of-fit Techniques ( Marcel Dekker, New York, 1986). 5. E. Dudewicz and E.C. van der Meulen, Journal of the American Statistical Association 76, 967 (1981). 6. P. Findeisen, Annals of Statistics 6, 1165 (1978). 7. D.V. Gokhale, Computational Statistics and Data Analysis 96, 157 (1983). 8. L.A. Goodman, Annals of the Institute of Statistical Mathematics 3, 123 (1952). 9. W. Hoeffding, The Annals of Mathematical Statistics 19, 546 (1948). 10. M. Hollander and D.A. Wolfe, Nonparametric Statistical Methods (John Wiley, New York, 1999). 11. T.Y. Hwang and C.Y. Hu, Annals of the Institute of Statistical Mathematics 5 1 , 749 (1999). 12. N.L. Johnson, S. Kotz and N. Balakrishnan, Continuous Univariate Distributions (John Wiley, New York, 1994). 13. A. Kolmogorov, Gior. 1st. Ital. Attuari 4, 83 (1933). 14. R.G. Laha, Annals of Mathematical Statistics 25, 784 (1954). 15. C.T. Lin and G.S. Mudholkar, Biometrika 67, 455 (1980). 16. R.A. Lockhart, and M.A. Stephens, Goodness-of-fit tests for the gamma distribution. (Technical Report, Department of Mathematics and Statistics, Simon Fraser University, 1985). 17. C. Locke, Communications in Statistics - Theory and Methods 4, 357 (1975). 18. C. Locke, Communications in Statistics - Theory and Methods 5, 351 (1976). 19. E. Lukacs, The Annals of Mathematical Statistics 26, 319 (1955). 20. C.E. Marchetti and G.S. Mudholkar, Goodness-of-Fit Tests and Model Validity (Birkhauser, Boston, 2001). 21. G. Marsaglia, Proceedings of the Symposium on Statistics and Related Topics (Carleton Univ., Ottawa, Ont., 1974). 22. G. Marsaglia, In Contributions to Probability and Statistics. Essays in Honor of Ingram Olkin (Springer-Verlag, New York, 1989). 23. J. H. Matis, W. L. Rubink and M. Makela, Environmental Entomology 21, 436 (1992). 24. J.G. Mauldon, Quarterly Journal of Mathematics 27, 155 (1956). 25. P. McCullagh and P. Lang, Journal of the Royal Statistical Society B 46, 344 (1984). 26. J.B. McDonald and B.C. Jensen, Journal of the American Statistical Association 74, 856 (1979).
217
27. G.S. Mudholkar and C.T. Lin, Colloquia Mathematica Societatis, Janos Bolyai 45, 395 (1984). 28. G.S. Mudholkar, C.E. Marchetti and C.T. Lin, Jounal of Statistical Planning and Inference (To appear, 2002). 29. G.S. Mudholkar, R. Natarajan and Y. P. Chaubey, Sankhya Ser. B 63, 362 (2002). 30. G.S. Mudholkar and L. Tian, in Jounal of Statistical Planning and Inference (To appear, 2002). 31. G.S. Mudholkar and L. Tian, Communications in Statistics - Theory and Methods (To appear, 2002). 32. G.S. Mudholkar and G.E. Wilding, On the conventional wisdom regarding two consistent tests of bivariate independence (Technical Report, Department of Biostatistics, University of Rochester, 2001). 33. S. Nabeya, Annals of the Institute of Statistical Mathematics 2, 13 (1950). 34. A.N. Pettitt and M.A.Stephens, EDF Statistics for Testing the Gamma Distribution (Technical Report, Department of Statistics, Stanford University, 1983). 35. B. Reiser, and D.M. Rocke, IAPQR Transactions, Journal of the Indian Association for Productivity, Quality & Reliability 18, 1 (1993). 36. W. Y. Tan, Biometrical Journal 37, 319 (1995). 37. O. Vasicek, Journal of the Royal Statistical Society B 38, 54 (1976). 38. Y.H. Wang, Analytical Methods in Probability Theory, Lecture Notes in Math. 861, 166 (Springer, New York, 1981). 39. Z. Yang, Journal of Molecular Evolution 39, 105 (1994). 40. Q. Yeh, Production and Inventory Management 38, 51 (1997).
O N FRAILTY MODELS A N D C O P U L A S DAVID OAKES Department of Bio statistics, University of Rochester Medical Center 601 Elmwood Avenue Box 630, Rochester NY 14642, USA E-mail: [email protected] We review the connection between archimedean copula models and frailty models for bivariate failure-time data, emphasizing applications in biostatistics, reliability and extreme value theory. A new method for semiparametric estimation of the dependence parameter in an archimedean copula model is proposed.
1
Bivariate Frailty Models
Many models have been proposed for multivariate data failure-time data (Ti,T 2 ) arising in biostatistics, reliability and other applications. The classical model of Marshall and Olkin 16 has a discontinuity along the diagonal Ti = Ti which may be appropriate if the Tj represent times to failure of a machine component and failure of one component results in increased stress on the other component. However this model would not be appropriate for the data reported in Table 1 of Oakes 21 from Lawless 12 (p. 477) in which Ti is the time to appearance of a fracture in a component and T2 is the subsequent time to failure. A plot of the data, shown in Oakes 21 (Figure 2) suggests that Ti and T2 are not independent so that a model is needed which allows the bivariate survivor function pr(Ti > t\,Ti > £2) = S{t\,t-i) to be continuous over the whole positive quadrant while still allowing dependence between T\ and T 2 . Oakes 22 analyzed an example of Gumbel and MustafA 8 from extreme value theory. Here T\ is the height in inches of the flood of the Fox River at Berlin, WI (upstream) and T2 the height at Wrightstown WI (downstream). Of the many possible models we focus on one class, bivariate frailty models, which allow for very flexible modeling of the marginal distributions of T\ and T2 and, separately, of the dependence between them. We assume that there is an unobserved random variable W, called here a frailty, common to Ti and T 2 , and such that Ti and T2 are conditionally independent given W. In the reliability example above, W might represent the susceptibility or inherent weakness of the component. In studies of human populations W might represent the influence on mortality of unmeasured genetic or environmental factors common to two related individuals. Each of Ti and T2 follows a proportional hazards model in W, so that Pv(Tj>tj\W
= w) =
218
Bj(tJr,
219
where the Bj(tj) are baseline survivor functions, corresponding to a component with unit frailty W = 1. The law of iterated expectations gives S(tut2)
= E{pr(T! > tuT2 > h\W)}
=
E{B1{tl)wB2{t2)w)
= p[-log{fli(ii)}-log{B 2 (i 2 )}]. where p{-) = E { e x p ( - • W)} is the Laplace transform of the distribution of W. Often the Bj(tj) will not be observable. However if we write q(-) for the inverse function to p(-) we obtain the expression S(h,t2)
=p[q{S1(t1)}
+q{S2(t2)}}
for S(ti,t2) in terms of its marginal survivor functions Si(ti) = S(ti,0) and 52(42) = S(0,t2). Genest and MacKay 5 described these as "archimedean copula models". When W follows a gamma distribution with unit scale parameter and index K, we have p{s) = (1 + s)~K and
S{tut2)
~ [l-log{Bi(*i)}-log{B2(t2)}J .5i(«i)(- 1 /*) +S2(t2)(-i/*)
-1_ '
When Bj(tj) = exp(-pjtj) we obtain the bivariate Pareto distribution S(h,t2) = (1 4- pih + p2t2)~K. The form of S{ti,t2) with exponential marginals is mentioned in Johnson and Kotz u (p. 288), but the general form appears to have been first proposed by Clayton 2 , who also gave an alternative derivation via local odds ratios that is described below. The model was later given independently by Lindley and Singpurwalla 14 who viewed the W as the equivalent of a "random environment". By allowing W to have the so-called "positive stable distribution" (Feller, 3 Chapter XII) with Laplace transform p(s) — exp(—s a ), for some 0 < a < 1, we obtain a family considered by Gumbel, 7 with S(tut2) Hougaard 23
9
= exp{-([-log{5 1 (t 1 )}]( 1 / a ) +
{-\og{S2(t2)}]^r}.
gave the derivation via frailties. See also Oakes and Manatunga
220
2
Local Odds ratios
The question arises as to whether the distribution of W is determined by 5(^1,^2)- Oakes 22 answered this question in the affirmative, by considering the local odds ratio
Here the superscripts denote order of differentiation in h and t 2 so that the terms in the numerator are the joint density and survivor functions of (Ti, T2) and those in the denominator are the two mixed survivor-density functions. The function 9*(ti,t2) may be thought of as the odds ratio of the 2 x 2 table formed by dichotomizing each side of the region above and to the right of the point (ti,t2) at ti + Ai and t2 + A 2 respectively, where the Aj are small. It also equals the ratio of the conditional hazard functions at i 2 of T2 given Ti = *i and of T2 given Ti > h. It is easily seen that
p'{s)z where s = q{Si(ti)} + q{S2(t2)} = q{S(h,t2)}, and so depends on (ti,t2) only through the value v say of S(t\,t2). That is 6*{t-\_,t2) — 0{S{t\,ti)} for some function 9(v). Moreover, since s = q(v), the inverse function theorem gives
q'(v) which can be integrated twice, showing that 9(v) determines q(v) up to an arbitrary multiplicative factor. Hence p(s) is determined up to a scale factor by 9(v). For Clayton's model, with a gamma frailty distribution, we find that 9(v) = 1/K + 1, i.e. free of (^1,^2), while for Hougaard's model, 9{v) = 1 +
1 a ~ a log(w)'
which declines from infinity to unity as v declines from v — 1 to v = 0. Hougaard's model seems more realistic in most applications.
221 3
The Kendall Distribution
An alternative approach to characterizing archimedean copula models, which include frailty models, was proposed by Genest and Rivest 6 . They considered the (univariate) distribution function K(v) and density k(v) of the random variable V = S(T\,T2), analogous to the usual univariate probability integral transform. Here the distribution of V is not uniform, but has V
'
«'(«)'
V
'
q'(v)2
The similarity of the functional forms of 9(s) in terms of p(s) and k(v) in terms of q(v) is purely coincidental. It is clear however that either K{v) or k(v) determines q(v) up to a constant factor. For Clayton's model we find that K(V)
= (K +
1)V-KV1+1'K,
and for Hougaard's model K(v) = v(l — alogv). For Hougaard's model it turns out that Z — — log(F) has the simple mixed gamma density f(z) = (1 - a + az)exp(-z), a result originally given by Lee 13 . An immediate corollary of Lee's result is that if Z has the mixed gamma density f(z) and U is uniform (0,1), with Z and U independent, then T = UaZ has a unit exponential distribution. This amusing result can be shown directly from the facts that E{Tn) = E(Una)E{Zn)
= —^—-{(1 - a)n! + a(n + 1)!} = n\, na + 1 and that the exponential distribution is determined by its moments (Feller, 3 p. 234). To see the reason for the name Kendall distribution, note that the population analog r of Kendall's coefficient of concordance f can be written T = 4pr(Tj' > TUT^ > T2) - 1 where the pairs (TUT2) and (T{,T£) are drawn independently from S(ti,t2). We have r = 4£{pr(T 1 ' > TUT!, > T2)\TUT2)
- 1 = 4£S(T 1 ,T 2 ) - 1 = 4E(V) - 1.
For bivariate frailty models r = 4 J sp"(s)p(s) r = 1/(1 + 2K), while Hougaard's model has r = Given a random sample {(T{1>,Tj,1'), i = ate distribution, a simple analog estimator K(v)
- 1. Clayton's model has 1 - a. l , . . . , n } from the bivariof K(v) can be calculated
222
as the empirical distribution function K(v) = n _ 1 # { i : Vi < v], where Vi = (n + l)-x#{j : T^ > T^,TP > T 2 W }. Plots of K(v) against v or possibly K(v)/v - 1 against v may be useful diagnostics. Barbe et al. x prove weak convergence of a suitably normalized version of Kendall's process {K(v), 0 < J) < 1} to a Gaussian process and give an expression for the covariance function. Note that the limiting covariance exhibits long-range dependence. A smoother plot may be obtained from risk sets formed from the componentwise minima of two pairs of observations, i.e. from the n(n - l ) / 2 pairs ( T W ) i r W ) ) w h e r e T™ = m i n ( T ^ T ^ ) and T^'j) = min(T 2 W ,T 2 0 ) ). Let Rij denote the size of the corresponding bivariate risk set, i.e. Rij =#{k:
T[k) > T^j\T^]
> Tfj)}.
and
Aij=sign{(T^-T^)(T^-T^)}, the indicator of concordance or discordance for the pair (i,j). that nr./A
_ i |T(«.J) _
prCA^-lIT,
f
-tl,T2
T(i,3)
_ 4.
D
_
r
\ _
It is easily seen
0*(*1,*2)
-h,B*j-T)-r_l
+
e.{hM).
This approach relates closely to the original proposal of Clayton (1978) for estimation of 6 in his model: here the conditional probability that a pair is concordant given the size of the bivariate risk set is simply p
p r ( A y = l\Rtj =r) = r _
1 + e-
For general archimedean copula models pr(Ay = l\Rij = r) Oakes 4
22
6(r/n) r - 1 + 6(r/n)'
gave the exact probability for the positive stable frailty model.
Estimation of a Dependence Parameter
Suppose that a copula model is specified up to a single parameter a indexing the dependence and the marginal distributions are either parameterized fully or are kept arbitrary. Then the following strategies are possible for estimating Q.
223
(a) (Semi)parametric maximum likelihood or an equivalent technique. Results of Murphy 17 ' 18 and Parner 24 for special frailty models and of Murphy and van der Vaart 19 for general semiparametric profile likelihood functions suggest that this approach can be expected to be consistent, asymptotically normal, and asymptotically fully efficient. However there are some questions about its performance in small samples. The EM algorithm can sometimes be used to assist computation. (b) Marginal profile ("pseudo-")likelihood estimation. Fit each marginal separately, either by parametric or nonparametric maximum likelihood, and then assume that each marginal is fixed at its estimated value and maximize the likelihood in a. See, for example, Shih and Louis, 25 Genest, Ghoudi and Rivest 4 and Hougaard 1 0 ) . This approach appears to work well in practice, has a relatively simple asymptotic theory, but is not fully efficient. (c) Use an analog estimator based on Kendall's r or another similar summary measure of correlation (Oakes, 20 Manatunga and Oakes 15 and Genest and Rivest 6 ) . (d) Derive estimating equations based on bivariate risk sets, i.e. consider pr(Ay = l\Rij) - see the discussion at the end of the previous section. (e) Fit the Kendall density k(v) to the distribution of the empirical estimates Vi defined in Section 3. Asymptotic theory needs to be modified to account for the correlations among the Vi. Properties of this estimator are being studied by A. Wang and D.O. It seems to work quite well. Details will be given elsewhere. We used this approach to compute estimates K, and d and the corresponding values f of Kendall's tau from fitting the gamma and positive stable frailty models to the data for the two examples mentioned in Section 1. For the cable insulation data (with f = 0.46) the estimates from the gamma model and positive stable model are respectively Gamma Model: h = 0.46 f = 0.52, Stable Model: a = 0.45 f = 0.55 . For the Fox river data (with f = 0.52) we follow Gumbel and Mustafi 8 and Oakes 22 in applying our models to the inverted ranks, which gives, Gamma Model: k = 0.38 f = 0.57, Stable Model: a - 0.43 f = 0.57 .
Acknowledgments This work was supported by grant R01 52572 from the (US) National Cancer Institute.
224
References 1. P. Barbe, C. Genest, K. Ghoudi and B. Remillard, J. Mult. Anal. 58, 197 (1996). 2. D. G. Clayton, Biometrika 65, 141 (1978). 3. W. Feller, An Introduction to Probability Theory and its Applications, Vol. II, 2nd Ed. (John Wiley, New York, 1971). 4. C. Genest, K. Ghoudi and L.-P. Rivest, Biometrika 82, 543 (1995). 5. C. Genest and R. J. MacKay, Can. J. Statist. 14, 145 (1986). 6. C. Genest and L.-P. Rivest, J. Am. Statist. Assoc. 88, 1034 (1993). 7. E. J. Gumbel, J. Am. Statist. Assoc. 55, 698 (1960). 8. E. J. Gumbel and C.K. Mustafi, J. Am. Statist. Assoc. 62, 569 (1967). 9. P. Hougaard, Biometrika 73, 671 (1986). 10. P. Hougaard,Analysis of Multivariate Survival Data (Springer-Verlag, New York, 2000). 11. N. L. Johnson and S. Kotz, Continuous Univariate Distributions (John Wiley, New York, 1972). 12. J. Lawless, Statistical Models and Methods for Lifetime Data (John Wiley, New York, 1982). 13. L. Lee, J. Mult. Anal. 9, 266 (1979). 14. D.V. Lindley and N.D. Singpurwalla, J. Appl. Prob. 23, 418 (1986). 15. A.K. Manatunga and D. Oakes, J. Mult. Anal. 56, 60 (1996). 16. A. W. Marshall and I. Olkin, J. Am. Statist. Assoc. 62, 30 (1967). 17. S. A. Murphy, Ann. Statist. 22, 712 (1994). 18. S. A. Murphy, Ann. Statist. 23, 182 (1995). 19. S. A. Murphy and A. W. van der Vaart, J. Am. Statist. Assoc. 95, 449 (2000). 20. D. Oakes, J. R. Statist. Soc. B 44, 414 (1982). 21. D. Oakes, In Modern Statistical Methods in Chronic Disease Epidemiology, eds. R. L. Prentice and S.H. Moolgavkar (John Wiley, New York, 1986). 22. D. Oakes, J. Am. Statist. Assoc. 84, 487 (1989). 23. D. Oakes and A. K. Manatunga, Biometrika 79, 827 (1992). 24. E. Parner, E. Ann. Statist. 26, 183 (1998). 25. J. Shih and T.A. Louis, Biometrics 5 1 , 1384 (1995).
S U R R O G A T E DATA A N D F R A C T I O N A L B R O W N I A N MOTION PETER RABINOVITCH Alcatel Research and Innovation 600 March Rd. Ottawa ON K2K 2E6, Canada Email: Peter. Rabinovitch@Alcatel, com Statistical analysis of fractional Brownian motion is difficult because it exhibits long range dependence. In this paper, we describe experiments that show that the method of surrogate data can help with this problem.
1
Introduction
Fractional Brownian motion (Mandelbrot and Van Ness 7 ) has become one of the standard models of telecommunications traffic due to three facts. When appropriate, it is a parsimonious model, as it is parameterized by three parameters; it has Gaussian marginal distributions; and it exhibits long range dependence, as many real traffic streams do. The long range dependence, while necessary for accurate modeling, presents statistical difficulties. For example, the standard way of proving the Central Limit Theorem (CLT) for dependent random variables (see Resnick 11 ) is to find an integer m such that random variables further apart than m are independent. Then, in essence, one applies the regular, independent CLT and estimates the error term introduced by leaving out blocks of m random variables, and shows that this error term can be made as small as desired. However, in the long range dependent setting, finding such an m may not be possible. One could also appeal to the generalizations of the bootstrap to dependent data, such as the moving blocks bootstrap. However, it is a known result (Lahiri 6 ) that the moving blocks bootstrap does not necessarily work with long range dependent data. The solution to this problem comes from an unlikely source, nonlinear dynamics. Surrogate data (also known as "phase scrambling", "Fourier bootstrap methods", etc.) was developed for hypothesis testing in nonlinear dynamics in the early 1990's. At that time, many papers were published claiming that some observational set of data had a particular fractional correlation dimension, and therefore that the data was chaotic. The surrogate data method was developed (Theiler et al. 13 ) to be able to test this claim, by generating many linear, Gaussian synthetic data sets that had the same first and second
225
226
order statistics as the original data. One could then compare a statistic, such as correlation dimension, of the original data set to a histogram of the same statistic of the synthetic data sets, and if they were significantly different, one could reject the null hypothesis that the original data could be generated by a linear, Gaussian process. We will present evidence, both by referring to the research literature, and by numerical experiments, that the surrogate data method captures the relevant characteristics of fractional Brownian motion, including long range dependence. Pointer to S+ source code is provided so that the readers can perform their own experiments. 2
Fractional Brownian Motion
Definition 2.1. A stochastic process {X(t),t Brownian motion with Hurst parameter H if
> 0} is said to be a fractional
1. X(t) has stationary increments for t > 0; 2. X(t) is normally distributed with mean 0; 3. X(0) = 0 almost surely; and 4. The increments of X(t),Z(j) Pz(k)
:= X(j + 1) - X(j),j
= ±{\k + l\2H +
= 0,1,... satisfy,
\k-l\2H-2h?h}
where pz denotes the autocorrelation function of Z. Remark 2.1. Let {X(t), t > 0} be a fractional Brownian motion with Hurst parameter H > 1/2. Then the increments of X(t) are long range dependent. There are many ways to simulate fractional Brownian motion, such as random mid-point displacement; superposition of heavy tailed on/off processes, wavelet methods, and Fourier methods. In this paper we use the last method, as described in Paxson 8 . A common model for long range dependent network traffic is W(t) = Xt +
aXH(t),
where W(t) represents the total amount of traffic that has arrived by time t, A and a are parameters representing the mean and standard deviation of the arrivals, and Xn(t) is a fractional Brownian motion with Hurst parameter H.
227
3
Surrogate Data
Theiler et al. 13 proposed a method of generating synthetic data that preserves the mean and the autocorrelation of the original series. The mean is preserved by subtracting it from the original series, and then adding it back to the synthetic series. This is standard, as it is frequently also a step in moving blocks bootstrap, or even classical time series analysis [ARMA modeling, etc.). For the autocorrelation, they proposed taking the Fourier transform of the series, and then scrambling the phases of the transformed data, while ensuring that the phases remain symmetric so that when the data is transformed back it is real valued. The phases can be either sampled with replacement from the original phases (i.e. bootstrapped), or they can be sampled uniformly on the interval. Here, we use the latter choice for its simplicity. Now, because the autocorrelation function of a time series is determined by the power of its Fourier transform [i.e. the square of the coefficients), and that the power of the Fourier transform does not depend on the phase of the data, the output of this procedure has the same autocorrelation as the input does. Furthermore, this procedure requires only that the data be a realization of a linear Gaussian process, and a particular such process does not have to be specified. Here is how we generate a single Surrogate Data Sample. Algorithm : To Generate a Surrogate Data Sample with Gaussian Marginals • 1. Input a time series y[t],t = 1...N. • 2. Compute the discrete Fourier transform of the data: z\t] := DFT[y[t\). Note that z[t] has both real and imaginary parts. • 3. Randomize the phases: z'[t] := z ^ e ^ ' l where [t] where is uniformly distributed on (0,2w]. • 4. Symmetrize the phases to ensure that z'[t] := z"[t]. The mechanics of this will depend on the way in which your software stores the components of the FFT. • 5. Invert the discrete Fourier transform: y'[t] :— DFT~1z"[t\. . Note that because of the symmetry of the phases, the resulting time series y'[t] is real. • 6. Output y'[i\.
228
Suppose we want to generate a surrogate data set for the time series (0.707,1.0,0.707,0.0, -0.707, -1.0, -0.707,0.0,0.707,1.0,0.707,0.0, -0.707, -1.0, -0.707,0.0). The DFT of this series is approximately (0,0,1.414+1.414i, 0,0, 0,0,0, 0,0, 0, 0,0,0,1.414-1.414i, 0). Random, symmetric phases are (0.0, 1.433,4.543,1.996,0.878,4.724,2.792,0.736,0.0, -0.736,-2.792, -4.724, -0.878, — 1.996,-4.543,-1.433). Our data to inverse transform is then (0.0,0.0,1.155 - 1.632J, 0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.155 + 1.632i,0.0). Finally, our inverse DFT'd data, i.e. our surrogate data set, is (0.577,0.985,0.816,0.169,-0.577,-0.985,-0.816, -0.169,0.577,0.985,0.816,0.169, -0.577, -0.985, -0.816, -0.169). In order to calculate a statistic, one would normally generate many surrogate data sets of the original, calculate the statistic of each, assemble them into a histogram, and proceed as in the bootstrap methods. This method is frequently used in nonlinear time series analysis to perform hypothesis tests, however little theoretical work has been done to assess it's limits. The following two results are notable exceptions. Theorem 3.1. (Braun and Kulperger x) Suppose the autocovariance function of the original process is absolutely summable. Let
then
a s n -> co, where / is the spectral density of the original process (which determines the autocorrelation of the process), and Y* is a surrogate data version of the original process. In other words, the surrogate data has the same Fourier spectrum as the original data, asymptotically. The next result says that (roughly) the marginal distribution of the surrogate data converges in distribution to a Gaussian distribution. Theorem 3.2. (Braun and Kulperger 1 ) If {Y/j, h > 1} is a stationary ergodic sequence such that {Yfi, h > l } is ergodic with E\Y2) = a2, then then Y* % JV(0, o-2)almost surely for j = 1,2,... Remark 3.1. The surrogate data method generates synthetic data that has the same sample autocorrelation structure as the original series, not the same autocorrelation as the population, i. e. we may be reproducing artifacts of the sample, and not the true distribution we are seeking.
229
Remark 3.2. The autocorrelation we are talking about is the circular autocorrelation, defined by which in the limit of infinite sample size approaches the population autocorrelation function, but in the case of finite sample size is only approximately equal to the population autocorrelation function, defined by 1
/N-t
C
N X x
P (T) •= -jy ( 5Z t *+T + \t=l
53
x
tXt+T-N
t=N-T+l
Remark 3.3. The documentation for the source code that we use for the surrogate data method (Davison and Hinkley 4 ) says "The types of statistic for which this method produces reasonable results is very limited and the other methods seem to do better in most situations." Although this may be true, surrogate data seems to work well in this case. In Percival, Sardi and Davison, 9 the following two claims about this method are made. Remark 3.4. "Unfortunately this resampling scheme and its variants apply to a very limited range of statistics, because they mimic only second-order properties of the original data." This is true, but fractional Brownian motion is determined by second-order properties, thus in this case, it is not a limitation. Remark 3.5. "Moreover variability is underestimated because this resampling scheme fixes the periodogram, unlike for the original series whose periodogram is random, and statistics that can be computed from the periodigram such as , display no variation across samples." This fixed periodigram problem may be able to be overcome by i) resampling the phases of the data, or ii) applying more sophisticated methods, such as those of Schreiber and Schmitz 12
Also, it is the purpose of this note to encourage research into this method, and thus the above criticisms of the method should be seen as challenges and opportunities for further research. 4
Fractional Brownian Motion and Surrogate Data
The fact that the surrogate data method generates synthetic data sets that are linear and have Gaussian marginals is not as restrictive as it seems, as "a long-memory process can always be approximated by an ARMA(p, q) process" (Brockwell and Davis, 2 p 520), i.e. a sufficiently complicated linear Gaussian process. Also, a linear, Gaussian time series is determined by its autocorrelation function, so we have the following chain of reasoning.
230
• A fractional Brownian motion (long memory process) can be approximated arbitrarily well by an ARMA(p, q) process with sufficiently large p and q. It is determined by its mean and autocovariance function. • The approximating ARMA(p, q) process is a linear process with Gaussian marginals, and as such it is determined by its mean and autocovariance function. • The surrogate data method outputs a series with the same autocovariance as its input, and approximately Gaussian marginals. • Therefore, the output is a fractional Brownian motion with the same long range dependence properties as the input. The only obstacle to proving this is that the proof of Theorem 1 requires that the autocovariance function of the series must be absolutely summable, which is not the case for long range dependent data. In fact it is sometimes used as the definition of long range dependence! However, in the next section, we present experimental results that suggest that the surrogate data method does, in fact, work for fractional Brownian motion. 5
Experimental Results
For the following experiments, all trace lengths were 215, Hurst parameters were estimated with the periodigram method, and for the Moving Blocks Bootstrap, a block length of 32 was used (trace length 1/3, as suggested in Kunsch 5 ) . In the graph that follows for each target Hurst parameter in the set 0.5,0.525„0.975, we generated one fractional Brownian motion. Then 100 surrogate versions of the data were generated, and a 95% confidence interval for the Hurst parameter of the surrogate data sets is presented. Similarly, 100 moving block bootstrap versions of the data were generated, and a 95% confidence interval for the Hurst parameter of these moving block bootstrap data sets is presented. Note that the surrogate data confidence intervals are right on line with the actual Hurst parameters, those of the moving blocks bootstrap are significantly low. We see that the moving blocks bootstrap becomes more biased as the Hurst parameter increases, which is to be expected, as the moving blocks bootstrap works for independent data (H = 0.5) and does not for long range dependent data (H > 0.5). The surrogate data method has only small fluctuations in bias as the Hurst parameter increases.
231
95% Confidence Intervals Surrogate Data Confidence Intervals
• ^
~° m -#-• as E ti LLI
CO
• t> °
Moving Blocks Bootstrap Confidence Intervals
ta a
0.B
0.7
OB
0.9
H
Figure 1: 95% Confidence Intervals of Hurst Parameter 6
Conclusion
It seems that the absolute summability hypothesis of Theorem 1 may be stronger than is needed to achieve the result, and that the method of surrogate data may work in the long range dependent case too. In all cases, the Surrogate Data method has less bias than the Moving Blocks Bootstrap. In this paper we used, as our point estimator, the periodogram estimator of the Hurst parameter. However, it is expected that the surrogate data method provides a method to calculate confidence intervals for a much larger variety of statistics for long range dependent processes. An example of this is the Dembo estimator of the effective bandwidth of a fractional Brownian motion traffic stream, discussed in Rabinovitch 10 . Also, note that the same experiments were run with the Veitch-Abry wavelet estimator of the Hurst parameter, and Mandelbrot's method of generating a fractional Brownian motion with essentially the same results. 7
Source Code
Code to calculate the Hurst parameter of a data set by using the periodogram method, (and others) is available from Murad Taqqu's web site at
232
http://math.bu.edu/INDIVIDUAL/murad/methods/index.html. Vern Paxson's source code to generate a fractional Brownian motion was obtained from http://ita.ee.lbl.gov/html/contrib/fft-fgn.html, and is described in Paxson 8 . Code for the surrogate data method, as well as the moving blocks bootstrap was written by Angelo Canty and is included and described in Davison and Hinkley 4 . It is also available at http://lib.stat.emu.edu/S/DH/bootlib.sh.Z. Code for Mandelbrot's method of generating a fractional Brownian motion and for the Veitch-Abry wavelet estimator of the Hurst parameter are available in Coeurjolly 3 . Acknowledgements The author would like to thank Amit Bose as well as the anonymous reviewer for many valuable comments. References 1. W. J. Braun and R. J. Kulperger, Communications in Statistics - Theory and Methods 26(6), 1329 (1997). 2. P.J. Brockwell and R.A. Davis, Time Series: Theory and Methods, 2nd Edition (Springer-Verlag, New York, 1991). 3. J.-F. Coeurjolly, Journal of Statistical Software 5, (2000) (http://www.jstatsoft.org/v05/i07/). 4. A.C. Davison and D.V. Hinkley, Bootstrap Methods and Their Application (Cambridge University Press, UK, 1997). 5. H.R. Kiinsch, The Annals of Statistics 17, 1217 (1989). 6. S.N. Lahiri, Statist. Probab. Lett. 18, 405 (1993). 7. B.B. Mandelbrot and J.W. Van Ness, SIAM Review 10, 422 (1968). 8. V. Paxson, Computer Communications Review 27, 5 (1997). 9. D.B. Percival, S. Sardy and A.C. Davison, Wavestrapping Time Series: Adaptive Wavelet-Based Bootstrapping (Unpublished Manuscript, 2000). 10. P. Rabinovitch, Statistical Estimation of Effective Bandwidth (M.Sc. Thesis, Carleton University, 2000). 11. S.I. Resnick, A Probability Path (Birkhaiiser, Boston, 1998). 12. T. Schreiber and A. Schmitz, Surrogate Time Series (Unpublished Manuscript, 1999, http://xxx.lanl.gov/ps/chao-dyn/9909037). 13. J. Theiler, S. Eubank, A. Longtin, B. Galdrikan and J. D. Farmer, Physica D 58, 77 (1992).
A N A L Y S I S O F O C C U L T T U M O U R T R I A L DATA W I T H VARYING LETHALITY S. N. R A I Department
of Biostatistics, St. Jude Children's Research Hospital, Lauderdale St., Memphis, TN 38105-2794 E-mail: [email protected]
332
North
J. SUN Department
of Statistics,
University
of Missouri,
Columbia,
MO
D. H U N T Department
of Biostatistics,St. Lauderdale
Jude Children's Research St, Memphis TN 38105
Hospital,332
N.
The primary motivation for this work is drawn from problems arising in the analysis of incomplete data. Although data can be incomplete in many ways, we are most interested in those situations where an intermediate event, which is of prime importance, cannot be observed. For example, in the analysis of data from occult tumour trials, the time of tumour onset is not known. Due to incompleteness of the data, analyses become very complex and many assumptions are often required to develop a basis for inference concerning the tumour incidence. We briefly discuss the assumptions made that lead to many different analyses of occult tumour trial data in the past two decades. From the prospect of reducing the sample size in occult tumour trials, some models have been proposed. These semi-parametric models assume relationships in tumor-bearing and tumor-free animals and have impact on tumor onset rates and potency measures of carcinogenic substances that we have explored further.
1
Introduction
Rodent tumorigenicity experiments are commonly used to screen chemicals, drugs, and food additives for carcinogenic effects. Experiments of this type have three different, though related, purposes; these are: (a) to estimate the rate of tumour development, which is assumed to be irreversible, (b) to estimate the effect of tumour presence on the death rate, and (c) to estimate carcinogenic potency, i.e., the magnitude of the dose effect of the substance of interest. In addition, with respect to (b) one is interested in estimating how tumour presence alters the rate of death in tumour-bearing subjects; in other words,
233
234
how lethal is the tumour? Generally, the lethality of a tumour can be classified as incidental, lethal or intermediate. An incidental tumour does not alter the death rate in tumour-bearing subjects, i.e, the death rates in tumour-bearing and tumour-free animals are the same. On the other hand, if an experimental animal dies almost immediately after tumour onset, the corresponding tumour is known to be one of the lethal type. Tumours which are neither lethal nor incidental are known as tumours of intermediate lethality. Another interesting aspect of this type of experiment is the problem of estimating the rate at which tumour develops in a specific environment on specific sets of subjects. However, the most important focus of this type of experiment is to compare a potentially carcinogenic agent to its absence in relation to the rate of tumor onset. We now turn our attention to those trials which involve occult tumours, i.e., the presence of a tumour is determined only at the time of postmortem. A typical experiment involves about 600 experimental animals of both sexes in each of two strains randomized to a control group or one of two or three exposure groups. In most of these experiments, the animals, which are usually mice or rats, are maintained in a controlled environment and dosed with the potential carcinogen according to the experimental protocol. During the experiment, animals may be selected for interim sacrifice according to the protocol in order to determine their tumour status. At the conclusion of the study, all surviving animals are killed for humane reasons and to discover their tumour status as well; this is known as the terminal sacrifice. In experiments which involve smaller sample sizes, often only a terminal sacrifice is performed. Occult tumour studies represent an important source of information concerning the possible carcinogenic effect of potentially hazardous substances such as chemicals, drugs and food additives. Animal survival/sacrifice experiments are commonly used in such studies, and furnish data that typically include the administered dose of the suspected carcinogen, the age of the animal at death and indicators of the presence or absence, at death, of various tumours. These data are frequently both grouped and incomplete, and are invariably difficult to analyze. Hoel and Walburg n were the first investigators to consider the analysis of data from carcinogenicity studies involving occult tumours in living animals, and made a useful distinction between rapidly lethal tumours and incidental tumours. In the former, the time to death following tumour onset is short, and therefore time to death is a good proxy for the time of tumour onset. Accordingly, an analysis based on time to death with tumour is indicated. In the incidental tumour case, the tumour has no effect on the death rate and the proportion of deaths with tumour provides an estimate of the tumour preva-
235
lence at that time. Most tumours, however, are neither lethal nor incidental; consequently, neither of these methods is appropriate. Under the assumption that the cause of death can be identified, i.e., whether death is principally due to tumour or due to competing risks, Kodell and Nelson 14 consider a semiMarkov model with Weibull transition intensity functions. They estimate the parameters by maximizing the likelihood function for the model, and illustrate their results by analyzing a carcinogenicity trial involving the toxic substance benzidine dihydrochloride. Kalbfieisch, Krewski and Van Ryzin 12 provide a thorough review of the field, and describe the construction of the full likelihood function in detail. Other related articles include Kodell, Shaw and Johnson 15 , Dinse and Lagakos 7 , Turnbull and Mitchell 26 and Portier 20 ; in these papers, the authors are interested in estimating the tumour onset distribution based on a multiple decrement analysis. Another key assumption used in the development of methods for the analysis of occult tumour trials is cause of death. That is the deaths can be classified as due to tumour or due to competing causes. Although the context of observation is a fairly common assumption on which the analysis of carcinogenicity experiments is based, the cause of death is not always known, or it may be uncertain (see Finkelstein and Ryan 10 , Lagakos and Ryan 16 or Kodell, Shaw and Johnson 1 5 ). For example, in an empirical investigation of the EDQI data, Lagakos and Ryan 16 found the cause of death information to be inadequate for several tumour types occurring in that experiment. Consequently, it seems unwise to assume that reliable information of this type is likely to be available. Alternative to the assumption of the cause of death information, there are some procedures that are based on estimating the number of deaths due to fatal tumours and non-fatal tumours first and then estimating the tumour onset rates and testing the effects of the potential carcinogen on the tumour onset rates (Ahn et al. 1). McKnight and Crowley 18 and Dewanji and Kalbfieisch 5 both provide an extensive survey of nonparametric methods of estimation in occult tumour studies. McKnight and Crowley 18 argue that the tumour incidence rate should be the principal quantity of interest in carcinogenicity experiments, and propose a nonparametric estimator of this quantity. Dewanji and Kalbfieisch 5 derive a nonparametric estimate of this rate using the EM algorithm. In both papers, information from numerous interim sacrifices is essential in estimating this key rate. Some other results which also require numerous interim sacrifices may be found in the papers by Williams and Portier 27 - 28 . From the perspective of an experimentalist, interim sacrifices represent an undesirable aspect of the nonparametric approach, because they frequently inflate the size and cost of a proposed trial. One approach which reduces
236
the necessity for interim sacrifices involves the use of parametric models for the death rates experienced by tumour-free and tumour-bearing animals. Although many different parametric models might be possible, two particular forms were considered quite naturally in this occult tumour context. In the first parametric form, which is referred as the constant risk ratio (CRR) model (Dinse 7 , s , and Lindsey and Ryan 17 ) and multiplicative failure rate (MFR) model (Rai and Matthews 2 3 ) , are based on the Cox proportional hazards model (Cox 3 ) . In the second, which corresponds to a cause-separable hazards model, the rates of death for tumour-bearing and tumour-free animals differ by a constant, referred as constant risk difference (CRD) model (Dinse 7,8 ) and additive failure rate (AFR) model by Rai and Matthews 23 ; when the death process is a discrete random variable, there is not any preferred version of the AFR model proposed in the occult tumour trials. In all these semiparametric models, the lethality is assumed to be constant over time. When time to tumour onset is considered a continuous random variable and time to death is considered a discrete random variable, all these semi-parametric approaches provide estimates of tumour onset rates in experiments that have as few as only a terminal sacrifice. But in some experiments these rates are heavily dependent on the form of the lethality parameter (Rai et al. 2 4 ) , suggesting that the relationship is too restrictive and model dependent. Therefore when there is at least one more interim sacrifice, a more general model for lethality can be accommodated, that we consider in this manuscript. In the remainder of this manuscript, we organize the remaining sections as follows. In Section 2 we define quantities of interest in occult tumour trials and construction of the likelihood. In Section 3 we define some additional notations for nonparametric estimation and also suggest some parametric relationship for describing lethality. The estimation is discussed very briefly in Section 4. In the penultimate section, we reanalyzed a subset of a data set which was previously analyzed by Dewanji and Kalbfleisch 5 and Rai and Matthews 2 3 . Some discussions are presented in Section 6. 2
Preliminary Considerations
Consider a carcinogenicity trial in which the presence of tumour is not clinically observable, i.e., tumour status can be determined only at necropsy. At the time of observation, an experimental unit can occupy any of the three states illustrated in Figure 1. Let the stochastic process {X(t)} identify the state occupied by an animal at time t. For simplicity, we suppose that n animals in state 1 at time t = 0 are randomly selected as experimental units and are observed for the duration
237
Alive without tumour
Alive with tumour AiW
Aa(t)
Figure 1. An illness-death model involving three states. State 1 corresponds to animals which are alive without tumour. Tumour-bearing animals which are alive are in state 2, which is frequently unobservable in carcinogenicity experiments involving occult tumours. State 3 is an absorbing state and corresponds to death.
of the trial. Let the random variable T denote the time of death and U the time of tumour onset. We also assume that the development of a tumour is an irreversible event, and therefore transitions from state 2 to state 1 do not occur, as illustrated in Figure 1. From time to time, experimental units are sacrificed to determine their status. These animals are chosen for sacrifice independent of their health status, etc., to ensure that sacrifices can be regarded as independent of the times of the events of interest. The intensities shown in Figure 1 are defined as the limits A,(i) = lim Pr{X(t
+ At) = 2\X(t) =
\2{t) = lim Pr{T e[t,t + At)\X{t)
=
1}/At,
(1)
I}/At,
(2)
and A3(*|u)
lim Pr{T€
[t,t +
At)\X(t)
2,U =
u}/At,
(3)
for u < t; otherwise Aa(i|w) = 0. We now define various quantities of interest, as functions of the three hazard rates Ai(i), A2(i) and \3(t\u). The marginal distribution of T, the time to failure, can be determined from the survivor
238
function S(t) = E exp where the expectation E is taken with respect to the distribution of U, the onset time. The pseudo-survival functions corresponding to the intensities Ai(.), A2(.) and A 3 (i|u) are
Qi(t) = exp < - / \i(v)dv >
for i = 1,2 and
Q3(t\u) = exp I - / \3(v\u)dv
whereas
Q(t) = exp { -
/" (Ai(u) + \2{v))dv\
I,
(4)
- Q i ( i ) g 2 (*)
denotes the probability that the time to the first event — tumour onset or death without tumour — exceeds t. Note that S(t) = Pr{X{t)
= 1} + Pr{X(t)
= 2}
= Q(t) + I Ai(«)Q(«)Q 3 (t|u)d«. Jo Following the work of McKnight and Crowley 18 , we can define an average hazard associated with the transition from state 2 to state 3 as \D\T(t)
= lim Pr{T € [t,t + At)\X(t) =
=
2}/At
,. Pr{Te[t,t + At),X(t) = 2}/At lim At-o Pr{X(t) = 2}
J* XiMQMXaitMQiWuldu J0
Ai(u)Q(u)Q3(t\u)du
Similarly, the tumour prevalence function, which is the proportion of live animals with tumour in the population, is defined to be 7r(t) = Pr{X(t)
= 2 | r > *} Pr{X(t) = 2} Pr{X(t) = 1} + Pr{X(t) f0M{u)Q(u)Qz{t\u)du S{t)
= 2}
239
Finally, we define the lethality function, l(t), to be either the difference of death rates, A D l T (i) - X2(t), or the relative death rate, XDlT(t)/X2{t), for different forms of A3(£|u). With respect to the analysis of data from a carcinogenicity trial, there are two special cases, based on lethality assumptions, which are easily handled; these correspond to tumours which are rapidly lethal and, at the other extreme, tumours which are incidental. In the case of tumours which are rapidly lethal, A3(£|u) is very large, and death occurs immediately after tumour onset. In this situation, the time of death can be regarded as a surrogate for the time of tumour development. On the other hand, if the tumour is incidental, tumour development has no effect on the death rate and A3(£|u) =X2(t). In this case, the analysis of data from the trial can be based on the fact that the proportion of deaths at time t which involve tumour-bearing animals provides an estimate of the tumour prevalence in the population at that time. 2.1
Constructing the Likelihood Function
Now, we outline a general framework for constructing the likelihood function. Let 0 represent the full parametric vector. Thus, 0 includes parameters which specify the general form of the transition intensities, c.f. Figure 1. In general, the data arising from a carcinogenicity trial will consist of the time and type of failure, i.e., natural death or sacrifice, and an indication of tumour presence or absence at autopsy. Let U be the realization of the random variable T for the ith experimental animal, i = 1,2,... , n. If d represents the contribution to the likelihood due to the ith animal, then the likelihood function for 6 is -k(^) = n"=i C« • Table 1 identifies the various types of observations which occur and the corresponding contribution to the likelihood. The terms A(t) and B(t) which appear in Table 1 represent the integrals
A(t) = I Ai(u)Q(u)A 3 (<|u)Q 3 (t|tOdu Jo and B(t) = f Ai(«)Q(u)Q 3 (t|«)d«, Jo respectively. These same terms can also be expressed in other familiar forms. Let the random variable W = T — U denote the time between tumour onset and subsequent failure. For t>u, i.e., w > 0, we can write Pr{W E [w,w + dw)\U = u] = A3(w +
u\u)G(w\u)dw
240 Table 1. Contributions to the likelihood function, arising in the analysis of data from a carcinogenicity trial
Observation Type Death without tumour Sacrifice without tumour Death with tumour Sacrifice with tumour
Outcome
T = t,X(t~) = l T>t,X(t)
Likelihood Contribution X2(t)Q(t)
=l
T = t,X(t~) = 2 T>t,X(t)
=2
Q(t) A{t) B(t)
where G(w\u) = Pr{W > w\U = u) = Q3(u + w\u) is the probability of survival to time t given tumour onset at u. Then, A{t)=
/ \3(w + Jo
u\u)f(u)G{w\u)du
and B{t) = f Jo
f(u)G(w\u)du,
where /(.) is the (sub)density function of U, i.e., f(u) = Xi(u)Q(u). Using parametric formulation for intensities or related quantities, one can analyze data (Dewanji et al. 6 ) . A nonparametric method of estimation based on the likelihood approach is considered in the next section. 3
Non-Parametric Estimation
The observed data for each animal consist of the time of death or sacrifice and an indicator of tumour presence or absence. Suppose there are M distinct death times, denoted by ti < • • • < tM, and let Ij = (tj-\,tj],j = 1,2, ...,M, where, for completeness, we define to = 0 and *M is the time of terminal sacrifice. Without loss of generality, set tj = j for j = 0,1,...,M. The
241
argument of Kaplan and Meier 13 can be used to show that, without the imposition of distributional restrictions, the likelihood is maximized when the death rates place mass only at the observed times of death. Following Dewanji and Kalbfleisch 5 , we can treat death as a discrete process and T as a discrete random variable, and we identify the range of T as { 1 , 2 , . . . , M}. Since there is no restriction on the choice of scale for the tumour onset process, it can be considered discrete or continuous. We suppose that the random variable U, which denotes the time of tumour onset, is continuous. Therefore, the resulting model is a mixed scale model for occult tumour trial data (Rai et al. 2 4 ) . When the time to death variable, T, the time to onset variable U, are both considered discrete, the resulting model is a discrete scale model (Rai and Matthews 2 3 ) . Here we are interested in a mixed scale model. For animal i, there is an associated time of sacrifice, Yi, which is chosen in advance of the experiment with Pr(Yi = j) = qj ( j = 1,2,... , M) and J^ qj = 1. This random sacrifice model is analogous to a random censorship model. The event Yi = j corresponds to a planned sacrifice of animal i at time j + , so that sacrifices at j are presumed to follow other events that may occur at j . At time M the experiment is terminated and all surviving animals are sacrificed. For this reason, such experiments are referred to as survival/sacrifice experiments. Let u and t be realizations of U and T respectively. The discrete version of the intensities related to death rates are defined as \*2{t) =Pr{T
= t\X{t)=0,T>t}
and
\*3{t\u) = Pr{T = t\X(t) = l,U =
u,T>t>u}
for t = 1,2,... ,M and u > 0. Furthermore, we define the tumour incidence rate XIU) = Pr{U e Ij\T >j-l, = 1 — exp{— / Jj-i
X(j - 1) = 0}
Xi(u)du}
3
/
Xi(u)du,
for interval j = 1 , . . . , M. By assuming that U is a continuous random variable, we exclude the possibility that tumour onset and death with tumour occur simultaneously. If there are many sacrifices (Dewanji and Kalbfleisch, 5 , McKnight and Crowley l s , parameters of the full model can be estimated non-parametrically
242
without any further assumption about the form of death rates in tumourbearing and tumour-free animals. However, most of the tumorigenicity experiments have a few interim sacrifices. Therefore some relationship between these death rates are required. The relationships in the instantaneous death rates (when T and U are assumed to be continuous random variables) in tumour-free and tumour-bearing animals that are proposed by Dinse 7 and Lindsey and Ryan 17 are A3(*|u) = A2(t)eT
(5)
and X3(t\u)=\2(t)+7.
(6)
Since estimation is based on nonparametric formulation, a discrete version of the death rates in tumour-free and tumour-bearing animals using equations (5) or (6) are needed. The model described in equation (5) has some constraints that may not be very realistic (Rai 2 2 ) . Rai and Matthews 21 have proposed the models
W\u)
=
\*2(t)
i-\ut\u) i-xm
{)
and \*3{t\u)=\*(t)+1,
(8)
for t = 1,2,... , M, which are based on discrete scale for T. The former model, which we refer to as the multiplicative failure (MFR) model, corresponds to the discrete version of the proportional hazards model (Cox 3 ) . The model given in equation (5), the constant risk ratio (CRR) model, is closer to the continuous proportional hazards model. The latter formulation, which we call the additive failure rate (AFR) model, represents a cause-separable hazards model for \l(t\u). For notational convenience, we refer to both models by X3(t). Note that these models assume that the death rate does not depend on the tumour onset time. A detailed comparison of these models is given in Rai 22 . Since both the tumour incidence rate and the death rate without tumour are completely unspecified and events may occur at any time t = 1 , . . . , M, we must estimate the 2M values of \\{t) and A^t). If Ag(t) was also unspecified, corresponding to a fully nonparametric model of the experiment, we would have to estimate 3M parameters; however, the parametric forms for A^(i) prescribed in equations (1) and (2) reduce that total to 2M + 1. These relationships were helpful to estimate the parameters even when there is no interim sacrifice. But sometimes different models lead to very
243
different estimates for tumour onset rates, the primary quantity of interest in tumorigenicity experiments. In all these models the lethality parameter is considered constant over time. This assumption may not be very realistic. In some experiments there may be more information than just the indicator of tumour presence (Ryan and Orav 2 5 , Parise et al. 1 9 ) . For example, we may know the grade of the tumour at the time of death and sacrifice. This grade information can be directly linked to model lethality of the tumour. We propose the following semi-parametric models for lethality: 7 = 7o + 7i*>
(9)
7 = 7o + 7i ( * - " ) ,
(10)
7 = 7o + 7i* + 72-^,
(11)
7 = 7o + 7 i ( * - « ) + 7 2 ^
(12)
and
fort = 1,2,... ,M. The first relationship leads to a Markov model whereas the second relationship is a semi-Markov model. The last two models incorporate covariate information. This covariate is related to development of tumour and therefore is an internal covariate. Accommodating contributions to the likelihood function from internal covariates can sometimes be cumbersome. In this manuscript we, for simplicity, we consider the first model and other possibilities are considered somewhere else. We conclude this section by defining some additional notation which will be used in later sections of the paper. Let
Q(t)
=Pr{T>t,X(t)=0}
=
1[{1-M)}i[{l-K(j)} j=i
j=i
and Qz{t\u) = Pr{T > t,X(t)
= 1\U £ In}
= I[{1-A5(J)}, j>u
for u < t = 1,2,... ,M. The function Q(t) represents the probability that an experimental animal is alive and tumour-free at time t. This quantity is the
244
product of two separate terms involving A*(.) and A^.). This is due to the fact that the variable U is treated as continuous and T is discrete, as described in Rai et al. 2 4 . The quantity Q3(t\u) corresponds to the conditional probability that an animal which developed a tumour in Iu is still alive at time t. The functions Q(.) and Q3(-|.) will be used to calculate prevalence and survival functions. 4
Fitting the Semi-parametric Model
We use the EMI algorithm (Rai and Matthews 21 ) to estimate the parameters of the model. This method of estimation is an algorithm of the EM type which was first described by Dempster, Laird, and Rubin 4 . As with all such algorithms, maximum likelihood estimation of the parameters in the model using the observed data (the incomplete data problem) is accomplished by maximizing the conditional expectation, given the data, of the likelihood function generated by the corresponding complete data formulation. Our complete data formulation is based on the procedure given in Rai et al. 2 4 . Note that our complete data formulation provides variance estimates for tumour onset rates and death rates. Here, we briefly describe the estimation procedure. A Complete D a t a Formulation We assume that the time of tumour onset is known to belong to one of the intervals It, and that sacrifice is simply a right-censoring of the multistate process. In that case, the data from a sample of n experimental animals can be summarized as the observed values of the counting processes Ni(t), N2{t), N3(t), Y0(t) and Yi(t) for t = 1 , . . . , M . The quantity Ni(t) represents the number of animals with tumour onset time in It, whereas iV2(t) identifies the number of tumour-free animals which die at time t. Likewise, N3 (t) indicates the number of tumour-bearing animals which die at time t. The random variables Y\(t) and Yo(t) summarize the number of animals with and without tumour, respectively, which are sacrificed at t; thus n = Y!,tLi{Ni(t) + N2(t) + Yo(t)}. In addition, the variable Ri (t) represents the number of animals at risk of developing a tumour in It; likewise, i?2(t) and i? 3 (t) be the number of tumour-free and tumour-bearing animals, respectively, which are at risk of death at t. Note that due to mixed scale assumption, R2(t) = Ri(t) -
Nttf).
The complete data may be divided into two groups, depending on the type of information available. The first group corresponds to animals in state
245
1; these either remain in state 1 or move into one of the other two states, i.e., state 2 or state 3. For such animals, the contribution to the likelihood function at time t is Li(t) <x \{{t)NlW{l
- Xl(t)}R^-N^
x A5(t)JVa(t>{l -
\*2{t)}R^-N^l
The second group corresponds to animals in state 2; these either remain in state 2 or move into state 3. Thus, animals in state 2 at time t contribute L2{t) oc \*3{t)N^{l
- \l{t)}R3{t)-N3{t)
(13)
to the likelihood function. Combining these two groups, we obtain the likelihood function for the semi-parametric model based on the complete data, viz., M
L a J J L!(t)L2(t).
(14)
t=i
An explicit version of this likelihood function is obtained by replacing \%{t) with various model-specific forms. The Incomplete Data Problem Since tumour information can be obtained only at autopsy, the complete data are not available; instead, the observations consist of N2{t), Yb(i), Ns(t), and Yi(t). Let 6T = (Xi(j), X2(j),70,71; t = 1 , . . . ,M) be the vector of parameters in the semi-parametric model. A modified version of the EM algorithm (Rai and Matthews 21 ) provides a simple method for estimating 6 in two steps: E (expectation) and Ml (one-step maximization). Starting with an initial estimate of 0, say 6(°\ these steps are applied in a strictly alternating sequence until the parameter estimates converge and the log likelihood function of the observed data is maximized. Suppose that 0( , _ 1 ) represents the value of 6 which was obtained at iteration i — 1 of the algorithm. At the next E-step, we have to evaluate the conditional expectation i V f ( j ) - E{N1(j)\N3(t),Y1(t),t
= j,j + l,...,M,9
= flt*-1)}
for j = 1 , . . . , M. To compute this expectation, the conditional probability
P^Ult)
= Pr{U G Ij\T = t,X(t) = 1} =
A
i(j)Q0" ~ l)AS(t|j)Q 3 (t - l]j)
E!=iAi(0('-i)A5(t|0Q3(t-i|0
246
and PtX){j\t)
= Pr{U G Ij\Y = t,X(t) = 1} Af(j)QC?-l)Q3W) =
(16)
ELiAI(0Q('-i)Q3(t|0 for j < £, is required. Note that Q3(t — \\t) = 1, i.e., the probability of surviving at least up to time t — 1 given that the animal develops a tumour in the interval It is one. The dependence of the right-hand side of expression (15) and (16) on (i — 1) has been suppressed for notational convenience. Consequently,
^i(l)(i) - E ^ C ) ^ " 1 ' +
Yitt)}Ptl\j\t)-
t>j
At the ith iteration, the expected values of the risk sets -Ri(-), -R2O) a n d Rs(.) are evaluated by substituting the quantities N^ (.), N%(.), A^O), YQ(.) and Yi(.) in equations (3), (4) and (5), respectively. These expectations are required in order to proceed with the succeeding maximization step. At the next step of the algorithm we maximize the complete data likelihood specified in equation (7) with respect to the parameter vector 9 to obtain the updated value 6^l\ It can easily be seen that the maximum likelihood estimates for the parameters \\{j) are A*«(j) = ( ^ i ( < ~ 1 ) ( j ) M i " 1 ) ( j ) , if *tl){3) 1 \ 0 otherwise
* 0
(j = 1 , . . . ,M). The other parameter estimates A2 (.), % and 7}^ are updated using an one-step maximization algorithm described in Rai and Matthews 2 1 , as there are no closed-form expressions for these parameters. Now we summarize our experience concerning the problem of identifiability. As noted previously, the generalized MFR and AFR models discussed in this paper each involve 1M + 2 unknown parameters. In order to uniquely estimate 6, we therefore require at least 2M + 2 independent observations. According to the design of the experiment, only two types of event can occur at any time: death with tumour or death without tumour. The data on death information form a multinomial table with 2M cell frequencies which specifies the number of deaths with and without tumour at M time points. If there are C interim sacrifices including the terminal sacrifice at M, the data on sacrifice information form C binomial tables with 2 cell frequencies which specifies the number of sacrificed animals with and without tumour. Thus, we will have 2M + C independent informations in the data to use in estimating
247
2M + 2 parameters in either semi-parametric model. Note that we need at least one interim sacrifice in addition to the terminal sacrifice for estimating parameters of the general model. 5
Example: Ionizing Radiation and the Occurrence of Glomerulosclerosis
A few tumorigenicity experiments supplement the terminal kill with numerous interim sacrifices; however, many occult tumour studies involve only one interim sacrifice. The requirements of fully nonparametric methods such as those proposed by Dewanji and Kalbfleisch 5 and others are rarely met in these commonly-occurring experimental designs. In this section we have chosen to analyze the data from an occult tumour study in order to illustrate the merits of the proposed semi-parametric methods. The example involves a number of interim sacrifices. Data for this example is extracted from Table 2 in Dewanji and Kalbfleisch 5 . The data represent a summary, in intervals of 100 days, of the information gleaned from deaths and sacrifices concerning the presence or absence of the disease glomerulosclerosis. This data was also analyzed using discrete scale models in our previous work (Rai and Matthews 2 3 ) . Further information concerning this comparative assay regarding the occurrence of glomerulosclerosis following exposure to ionizing radiation may be found in the report of Berlin et al. 2 . The results of fitting the basic and generalized versions of MFR and AFR models to these data are summarized below. The Kaplan-Meier estimate of the survival function for both the control and irradiated groups of mice and the corresponding estimates of the survival function which were derived from the basic and generalized versions of MFR and AFR models are almost identical, except for some time points where the generalized models produce better results than the basic models (due to lack of space these 8 graphs are not produced here). On this basis, it appears that both types of the basic semi-parametric models appear to fit the observed data well, but the generalized versions produce some improvement. Estimates of the cumulative tumour incidence rate in each of the dose groups for various models are presented in Table 2. As a basis for comparison, the corresponding nonparametric estimates obtained by Dewanji and Kalbfleisch 5 , and denoted by the symbol DK, are also included in Table 2. Note that their model is a saturated model for these data. While agreement between the estimates based on the basic MFR and AFR models and the nonparametric estimates (DK) is not so bad in the control group and the irradiated group of mice, the estimates based on generalized versions of the
248 Table 2. Estimated cumulative tumour incidence rates for the glomerulosclerosis d a t a based on various mixed scale models
Age in Days 0-100 101-200 201-300 301-400 401-500 501-600 601-700 701-
DK° 0.195 0.420 0.961 1.266 2.034 2.442 2.750 3.537
Control group M A GM 0.198 0.160 0.195 0.432 0.410 0.433 0.962 0.980 1.040 1.360 1.481 1.380 1.894 2.044 1.958 2.497 2.385 2.529 3.035 3.094 3.130 3.813 3.672 3.747
GA 0.193 0.427 0.984 1.333 1.894 2.429 3.086 3.789
DK 0.166 0.469 1.104 1.794 1.794 1.794 1.794 1.794
I r r a d i a t e d group M A GM 0.190 0.121 0.173 0.623 0.637 0.538 1.077 1.119 1.052 1.676 1.692 1.649 2.074 2.047 2.163 2.431 2.220 2.183 2.220 2.431 2.183 2.220 2.431 2.183
GA 0.130 0.642 1.117 1.670 2.044 2.262 2.262 2.262
a
DK = Dewanji and Kalbfleisch approach, M = MFR model, A = AFR model, GM = generalized MFR model, GA = generalized AFR model
Table 3. Estimated prevalence functions for the glomerulosclerosis d a t a based on various mixed scale models
Age in Days 0-100 101-200 201-300 301-400 401-500 501-600 601-700 701-
DK* 0.194 0.369 0.698 0.775 0.949 0.970 0.970 1.000
Control group M A GM 0.198 0.157 0.194 0.387 0.381 0.360 0.715 0.752 0.701 0.814 0.817 0.858 0.904 0.937 0.917 0.940 0.958 0.966 0.965 0.985 0.983 0.963 0.994 0.992
GA 0.193 0.375 0.713 0.802 0.905 0.949 0.978 0.984
DK 0.164 0.368 0.758 0.939 0.945 0.937 0.941 1.000
Irradiated group M A GM 0.194 0.116 0.171 0.561 0.553 0.438 0.763 0.728 0.692 0.868 0.889 0.880 0.910 0.926 0.938 0.935 0.928 0.949 0.895 0.928 0.969 0.775 0.911 0.988
GA 0.127 0.562 0.759 0.885 0.922 0.933 0.921 0.886
" as in Table 2
Table 4. Values of the maximized log likelihood and the corresponding number of parameters estimated when the glomerulosclerosis d a t a were fitted with the mixed scale M F R and A F R models involving 7 = 70 + 71*
Dose group Control Irradiated
N u m b e r of parameters 17 17
71 = 0 Maximized log likelihood MFR AFR -1778.50 -1782.06 -2886.39 -2886.79
Number of parameters 18 18
715*0 Maximized log likelihood MFR AFR -1777.78 -1778.41 -2880.87 -2886.72
MFR and AFR models are a bit closer to the estimates based on the saturated model. Estimates of the tumour prevalence function, n(t), based on various semi-parametric models are summarized in Table 3. The corresponding values derived by Dewanji and Kalbfleisch 5 are once again included for com-
249
parison purposes. Relative to the nonparametric estimates of 7r(£), the basic MFR model appears to estimate a higher prevalence during the initial periods and lower towards the end. On the contrary, the revers pattern is observed when comparing the estimates obtained using the basic AFR model with the saturated model. The estimates obtained using the generalized versions of these models bridge the gap with those obtained using the saturated model, expect for the first and last time intervals for the irradiated group when using the AFR model. Table 4 summarizes the values of the maximized log likelihoods and the numbers of parameters estimated for each semi-parametric model that was fitted to the data. The values in the table provide a basis for examining the adequacy of the simpler model for 7 with respect to the glomerulosclerosis data. In the case of the MFR model, it appears that the hypothesis 71 = 0 is not contradicted by the data for the control group of mice (p=0.230); however, the same test yields a significance level of 0.001 when applied to the results of fitting the MFR model to the irradiated mice. Somewhat curiously, the reverse of the above remarks summarize the conclusions based on the likelihood ratio tests when the AFR model is fitted to the same data set. Finally, we also found that incidence rates are different in both the groups for basic and generalized versions of the MFR and AFR models. The results are very similar to those obtained using discrete scale models for tumour onset and death rates (Rai and Matthews 2 3 ) . 6
Discussion
As we have previously indicated, in the absence of supplementary information such as that provided by interim sacrifices or the cause of death, realistic approaches to the estimation of tumour incidence rates and carcinogenic potency will necessarily involve modeling assumptions. For the most part, the assumptions on which the discrete scale models of Rai and Matthews 2 3 were based continue to apply with respect to the mixed scale models which we have just described in preceding sections of this paper. The principle difference has been the choice of scale on which tumour onset is deemed to have occurred. A benefit which is realized through selection of the mixed scale model concerns the form of the likelihood function. In this case, the complete data formulation can be factored into two components; one of these involves parameters related to tumour onset, while the remaining factor concerns the death rates with and without tumour. Such a separation is not achieved in the corresponding complete data likelihood generated by the discrete scale model.
250
Note that the choice of a mixed scale model induces a natural preference or ordering of the two events, tumour onset and death, which is not present in corresponding discrete scale models. In effect, because tumour onset occurs on a continuous scale whereas the time to death is a discrete random variable, tumour-free animals are at risk of developing tumour at any time, whereas the same animals are only at risk of death immediately prior to the next point on the discrete scale of the random variable T after remaining tumour-free for the intervening interval. This natural ordering has implications for the resulting estimates of the tumour incidence rate, and the death rates with and without tumour, which we believe are deserving of further study. For example, in the mixed scale model it is possible for animals to die with tumour at the first value in the range of T, i.e., the time of the first observed death or sacrifice, whereas the same event is impossible in a discrete scale model. Consequently, at least one interim sacrifice is required in order to obtain unique parameter estimates in the discrete scale models, whereas no interim sacrifice is required in the case of the mixed scale models. When there are interim sacrifices in additional to the terminal sacrifice, our generalization of the lethality function using one additional parameter provides adequate results - good fit to the data and comparable estimates of the tumour onset and prevalence rates. Acknowledgments The authors wish to thank two referees and the Editor for their comments and suggestions. The research of Dr. Rai and Dr. Hunt was supported in part by a Cancer Center Support grant (CA21765) and the American Lebanese Syrian Associated Charities (ALSAC) and that of Dr. Sun was supported by a grant from the National Institutes of Health. References 1. H. Ahn, H. Moon and R.L. Kodell, Applied Statistics 49, 157 (2000). 2. B. Berlin, J. Brodsky and P. Clifford, Journal of the American Statistical Association 74, 5 (1979). 3. D.R. Cox, Journal of the Royal Statistical Society, Ser. B 34, 187 (1972). 4. A.P. Dempster, N.M. Laird and D.B. Rubin, Journal of the Royal Statistical Society, Ser. B 39, 1 (1977). 5. A. Dewanji and J.D. Kalbfleisch, Biometrics 42, 325 (1986). 6. A. Dewanji, D. Krewski and M. J. Goddard, Biometrics 49, 367 (1993). 7. G.E. Dinse Biometrics 47, 685 (1991).
251
8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28.
G.E. Dinse, Biometrics 49, 399 (1993). G.E. Dinse and S.W. Lagakos, Biometrics 38, 921 (1982). D.M. Finkelstein and Ryan, L.M. Applied Statistics 36, 121 (1987). D.G. Hoel and H.E. Walburg, Journal of the National Cancer Institute 49, 361 (1972). J.D. Kalbneisch, D.R. Krewski and J. Van Ryzin, The Canadian Journal of Statistics 11, 25 (1983). E.L. Kaplan and P. Meier, Journal of the American Statistical Association 53, 457 (1958). R.L. Kodell and C.J. Nelson, Biometrics 36, 267 (1980). R.L. Kodell, G.W. Shaw and A.M. Johnson, Biometrics 38, 43 (1982). S.W. Lagakos and L.M. Ryan, Environmental Health Perspectives 63, 211 (1985). J. Lindsey and L.M. Ryan, Applied Statistics 42, 283 (1993). B. McKnight and J. Crowley, Journal of the American Statistical Association 79, 639 (1984). H. Parise, G. Dinse and L. Ryan, Applied Statistics 50, 171 (2001). J.C. Portier, Biometrika 73, 371 (1986). S.N. Rai and D.E. Matthews, Biometrics 49, 587 (1993). S.N. Rai, Biometrical Journal 39, 909 (1997). S.N. Rai and D.E. Matthews, Applied Statistics 46, 93 (1997). S.N. Rai, D.E. Matthews and D.R. Krewski, The Canadian Journal of Statistics 28, 65 (2000). L.M. Ryan and E.J. Orav, Biometrika 75, 631 (1988). B.W. Turnbull and T.J. Mitchell, Biometrics 40, 41 (1984). P.L. Williams and C.J. Portier, Communications in Statistics 2 1 , 711 (1992). P.L. Williams and C.J. Portier, Biometrika 79, 711 (1993).
NONLINEAR MIXED EFFECTS MODELS: R E C E N T DEVELOPMENTS
P O D U R I S.R.S. R A O A N D N I C H O L A S Z A I N O Statistics
Program, Hylan 915, University of Rochester, NY 14627 E-mail: [email protected]
Rochester
Illustrations of nonlinear models with fixed and random effects along with their applications for different practical situations are presented. Least squares, maximum likelihood and other procedures suggested in the literature for estimating the fixed effects and variance components are described. Approximations suggested for the likelihood procedure are summarized. Suitable transformations required for these procedures are also presented. i
1
Introduction
Nonlinear models are extensively employed, for instance, in bioassay, in radioimmunoassay, for the analysis of dose-response measurements related to medical treatments, for the examination of concentrations and absorptions of chemical compounds in Pharmacokinetics, and to evaluate the effects of herbicides. They are also widely used to study the degradation and reliability of electronic components and integrated systems, to describe time-to-failure distributions and to predict the plateaus of deterioration of the above types of components and systems. Nonlinear models are also suggested to examine the departures from the assumptions such as normality of the linear models. As in the linear case, some of the coefficients in a nonlinear model can be fixed and the others random. The Analysis of Variance (ANOVA), Minimum Norm and Minimum Variance Quadratic Unbiased Estimation, (MINQUE and MIVQUE), Maximum Likelihood and Restricted Maximum Likelihood (ML and REML) procedures and their modifications are widely used to estimate the variance components and fixed effects of a linear mixed effects model. However, these approaches have to be modified and adapted in the case of a nonlinear mixed effects model. Suitable transformation followed by modifications of the above procedures may provide acceptable estimates for some types of nonlinear models. The major motivation for the present article comes from Solomon and Cox 20 , who have presented examples of nonlinear mixed effects models, examined some transformations to estimate the variance components and fixed effects, suggested an approximation to the likelihood function, and evaluated its appropriateness for estimating the parameters of an exponential regression
252
253
model and also for testing a hypothesis related to the 2 x 2 contingency tables through the logistic regression model. An overview of these and other illustrations and estimation procedures suggested in the literature are presented in the following sections. Illustrations of nonlinear models along with some of the suggested estimation procedures are presented in Sections 2 and 3. Linear mixed effects models, the corresponding marginal distributions along with the MIVQUE, ML and REML procedures for the fixed effects and variance components are briefly summarized in Section 4. The description in this section leads into Section 5, where a general description of the nonlinear mixed effects model, an approximation to the marginal distribution and the likelihood function for the nonlinear model along with the suggested estimation procedure are presented. Section 6 presents illustrations of the degradation models employed to improve the reliability of electronic and other types of components and systems, and a two-stage procedure of estimation. A summary with discussion appears in Section 7. 2
Four Illustrations of Nonlinear Models
Two commonly used models in a number of applications are the exponential regression model Vij = exp(ctj + fcxi) + e{j,
(2.1)
and the logistic regression model Vij = exp(«i + PixCj + ei:j,
(2.2)
For both the models, i = 1,2, ...,k denotes the experimental units and j = 1,2,..., 7i; denotes the number of observations on the ith unit. In these models, the response yij is related to the fixed variable Xi, which can be the covariate or the concomitant variable. The error etj is usually assumed to follow the normal distribution with zero mean and unit standard deviation. When the coefficients a; and fii are fixed parameters, they can be estimated through the nonlinear estimation and similar procedures. However, in some applications, one or both the coefficients are assumed to be fixed or random. A model used in economic analysis for relating production (P) to labor (L) and capital (C) is the Cobb-Douglas production function Jr{j — Ct{yL/ij j
y^ij)
~r ^ij >
(2.3)
254
i = 1,2, ...,m industries, and j = 1,2, ...,£ time periods, for instance. One or more of the coefficients o^, /?, and 7; may be assumed to be random to perform meta-analysis, for instance. To analyze the effects of herbicides, the dose-response model considered by Rudemo et al. 16 takes the form of Vij
=C+(D-
C)/[l + (xij/xoi)0']
+ vtij ,
(2.4)
i = 1,2, ...,8 herbicides, and j = 1,2, ...6 doses. In this model, the concentration of herbicide, xoi is the half-effect of the herbicide, and y^ is the weight of plants. The coefficients C and D are the same for all the herbicides, and a.i represents the fixed effect. The random error e^- is assumed to have mean zero and unit variance. When the half-effect varies with the herbicide absorption by the plants, the random-coefficient model considered by the above authors takes the form of VH =C+(D-
C)/[l + ( f t i « ) a ' l + °tih
(2-5)
where ai is fixed and f3i random. For the case of replications, with the additional subscript k, the random effect rk was added to the model. Box-Cox type transformation, Taylor's approximation to the right side of the model and suitably weighting the error term followed by the likelihood procedure were examined to estimate the parameter of the transformation. Data on four pairs of herbicides and the dry weights (y) of 150 plants were used for this purpose. The above authors have also mentioned that when there are errors of measurement in the controlled variable Xij, it can be replaced in the above models by Xij(obs) = Xij(true) + crx6. Now the model would contain an additional random effect. 3
Nonlinear Models to test normality or to validate transformations
In the one-way mixed effects model, yis = /j, + Ai + Bis, the random effect Ai and the error term B « are usually assumed to be independently normally distributed with zero means and variances a\ and a\ respectively. To represent departures from normality, Solomon and Cox 20 considered yis = fj, + Ai + Bis + a20A2is + anAiBis
+ a02Bfs ,
(3.1)
where i = 1,2, ...m represents the units and s = 1,2,..., r represents measurements on the ith unit. The above authors have demonstrated that "skewness
255
of the random effects and heterogeneity of the within-group variation" are related to the fixed parameters 0:20, « n and ao2 • To test the hypothesis that these coefficients are zero, it was suggested that a suitable permutation test or the likelihood ratio procedure can be employed. Estimation of these parameters and the variance components a\ and a\ through the third order moments obtained from the observations yis was described. To discuss transformations, Solomon 19 considered the mixed-effects model y\'s = fi + A* + B*s, where A* is the random effect and B*s is the error term. Solomon and Cox 2 0 noted that retaining the quadratic terms, this model can be expressed approximately as the nonlinear model in (3.1) with ai2o = c*02 = <2n/2, and examined the approximation with data from 16 systolic and diastolic blood pressure measurements on each of 25 patients. This relationship for the three fixed parameters was found to be satisfactory for their estimates obtained as described above with the actual measurements as well as their square root and log transformations. 4
Linear Mixed Effects Model, Marginal Distribution and the Likelihood Estimation
This section briefly summarizes the estimation procedures for the linear mixed effects model, and it is intended as an introduction to the estimation procedures described in the following two sections. Consider the case in which f(yij\ai) is normal with mean fii = fi + a.i and variance a 2 , and /(OJJ) is normal with mean zero and variance
(4.1)
i = 1,2,..., k and j = 1,2,..., m, where e^ and a* follow independent normal distributions with zero means and variances a2 and a2a respectively. The joint distribution of the n — km observations is f(yu,-,Vkm.
I a>i,a2,...,ak)f(ai,a2,...,ak)
,
(4.2)
which can be expressed as £1(2/11, •••, Vkm\ M> v2)g2{ai, Vi,H, a2, G\), th
(4.3)
where y~i is the mean of the m observations of the i group. The marginal distribution of (j/u,..., j/fcm), which provides the likelihood function L(/i, a2, a\), is obtained by integrating the joint distribution in (4.2) with respect to c^, that is, by integrating out g2 in (4.3). As pointed by Solomon and Cox 2 0 , this integral is the same as E[L(fi, a2, a2a, a,-)] where the
256
expectation is taken with respect to cti. ^From this integration, the marginal distribution of (yn, •••, Vkm) becomes the product of k multinormal distributions each with mean fi, variance a\ + u2 and covariance
(4.4)
In this model, Y is an n-vector of observations, e is an n-vector which includes the random effects and the residuals, X is an (n x p) known matrix, and (3 is the (p x 1) vector of the fixed parameters. In the right hand side of (4.4), suggested by C.R. Rao 13 , £ represents the random effects and £o the residuals. As suggested by Harville 5 , the Linear Mixed Effects Model can also be expressed as Y = Xf3 + Zb + e .
(4.5)
In this representation, the vector of the random effects b is assumed to have mean zero and dispersion D(b) = D, and the vector of residuals e is assumed to have mean zero and dispersion D(e) = R. As a result, the dispersion of Y becomes V = ZDZ' + R. The marginal distribution of Y is obtained by integrating the joint distribution f(Y\b)f(b) with respect to b. When f(Y\b) and f(b) follow the multivariate normal distributions, this marginal distribution becomes the multivariate normal distribution with mean X/3 and covariance V given by f(Y; (3, a) = ( 2 n ) - " / 2 | V | - < 1 ^ e x p [ - ( l / 2 ) ( y - X(3)'W(Y
- X(3)] . (4.6)
In this expression, a = (Cy)) denotes the elements of D and R, and W = V". This distribution provides the likelihood function for (3 and a, L = L(/3,a). ^From (4.6), the likelihood estimates of f3 and a are obtained from P = {X'WX)-X'WY
(4.7)
and trlVidV/daij)}
= (Y- X(3)'W{dV/doi:j)W{Y
- Xf3) .
(4.8)
Note that if V = ECT^V;, (dV/daf) = Vt. The joint density f(Y\b)f(b) was considered by Hartley and J.N.K. Rao 4 for the estimation of the fixed effects and the variance components and by Henderson 7 for the estimation of (3 and predicting the random effect b. The MINQUE and MIVQUE procedures as well as the REML method for estimating the variance components are presented in C.R. Rao and Kleffe
257 14
and P.S.R.S. Rao 15 , and the REML method by Harville 6 . As shown by all these authors, the Best Linear Unbiased Predictor (BLUP) of the random effect vector b is obtained from DZ'W(Y — X/3). With the normality assumption for b and e, the BLUP is the same as E(b\Y). For the variance components of the Generalized Linear Model (GLM) considered by Schall 1 7 , Lee and Chaubey 8 examined the MINQUE type of estimation. 5
Nonlinear Mixed Effects Models and an Approximate Likelihood Estimation
Following (4.5), for the nonlinear case, Vonesh and Carter 2 3 and Gumpertz and Pantula 3 , for instance, consider mixed effects models of the type Y = F(X(3) + Zb + e,
(5.1)
where F(X[5) is a nonlinear function of the fixed effects (3. Note that Zb with the random effect b and the error e are additive. Sheiner and Beal 1 8 , Lindstrom and Bates 9 , Solomon and Cox 20 , Wolfin24 ger , and others consider models of the form Y = F{Xp + b) + e
(5.2)
and Y = F(X,p,Z,b)
+ e = F + e.
(5.3)
The nonlinear models of the type presented in Sections 2 and 3 can be represented through these general models or their modifications. As in Section 4, the dispersions of b and e can be represented by D and R respectively. 5.1
Estimation through the Approximate Likelihood
For the estimation of the fixed effects and variance components of the nonlinear models in Section 2, 3 and 5, least squares, Analysis of Variance, Maximum Likelihood and similar methods may be adapted. For instance, for the estimation of the fixed effects of the model in (3.1), Solomon and Cox 20 considered the method of moments approach. Transformations of one or both sides of the models before applying these types of approaches may provide acceptable estimates in some situations. However, unlike in the linear case described in Section 4, for the nonlinear mixed-effects models of the types in Sections 2, 3 and 5, the integral of g2 can not be expressed in an explicit form. As a consequence, the marginal distribution of the observations and the corresponding likelihood function will not be available.
258
Solomon and Cox 2 0 suggest expressing gi in a Taylor's series and retaining the leading terms, before integrating out the random effects. If g^ is of the exponential type, these terms correspond to the Laplace expansion. Now, the fixed effects and variance components can be estimated from the resulting marginal distribution obtained as described in Section 4. The validity of this approximate procedure was examined by the above authors for the likelihood corresponding to the exponential model Yjs = exp (6 + Aj)xs + Bjs,
(5.4)
j = 1,2, ...,m and s — 1, 2,..., r. The random effect Aj and the error term Bjs were assumed to be independently normally distributed with zero means and variances o\ and a\ respectively. The approximate likelihood was found to be close to the exact likelihood, unless o~\ is very large. The above authors have also investigated the approximation to the likelihood for testing the difference between the effects of two treatments with the responses arranged in several 2 x 2 tables. For the hypothesis of equality of the means of the effects of the treatments, the above type of approximation for the likelihood corresponding to the logistic model was considered. The power of the score test with the approximation was compared with the Mantel-Haenszel test. To adopt the approach for the approximate likelihood L(/3,cr) of (5.3), Wolfinger and Lin 2 5 consider the joint distribution g = C|J?| _ 1 / 2 exp{-(Y - F)'R'l(Y |£>|-
1/2
-
1
exp{-6'Z)- 6/2} .
F)/2} (5.5)
Let, F0 = F0{X,(3) = F[(X,P,Z,b)\b = 0], Z0 = dF/db'\b = 0 and Vo = ZQDZQ + R. The approximation results in the maximization of the likelihood corresponding to the mixed effects model Y = F0(X,(3) + Z0b + e. With the "pseudo observations", Y0 = Y - F0(X,P) dFo/d/3, the estimate of 0 is now obtained from p0 = (X'0WoX0r(X'0W0YQ),
(5.6) + X0/3, where XQ = (5.7)
where Wo = (Vb) _1 . This expression resembles (4.7) for the linear mixed effects models. Estimates of the variance components a for (5.1), (5.2) and (5.3) are obtained from (4.8) with X0, YQ, VO and (30. The same approximation can be considered for the REML estimates of the variance components. The BLUP for b is obtained from b0 = DZQW0(Y0 - X0(3). For the empirical BLUP, P in this expression is replaced by the estimate in (5.7).
259
The above authors examined the approximation in (5.6) for two models. The first model considered was the logistic model Vij = (A + bn)/{l
+ e x p [ - ( t y - /?2 - 6«2)//?3} + e tf)
(5.8) th
for i = 1,2,..., 15 subjects, with j = 1,2, ...10 observations on the i subject. The coefficients /3i and /32 are the fixed effects, (bn, ^2) are the random effects and ey is the error. Ten different values were considered for £y; same for each subject. This model was examined earlier by Pinheiro and Bates n . The second model examined was the pharmacokinetic model for the concentration of a drug given by logyij = log F + 6ij,
(5.9)
where F = {10fca[exp(-fcj£y) - exp(-katij)]/vi(ka - ki)}exp(eij). In this model ka is the fixed effect and fcj is the random effect. The investigation showed that for the approximate likelihood method, estimators of the fixed effects are almost unbiased for the first model, but biased for the second. Shun and McCullagh 21 discuss the conditions suitable for the Laplace approximation. Vonesh 22 applied the Laplace approximation only to the random effects, and found that the resulting estimates are consistent; the rate of convergence was found to depend on both the number of subjects (i) and the number of observations (j). For the repeated measures data, Lindstrom and Bates 9 considered the model Vij
= F(Xi(3 + Zibi) + etf .
(5.10)
For the Taylor's approximation, the derivative of F at the current estimate of b' = (61, 621 •••) was considered; Sheiner and Beal 18 considered the derivative at b' = 0. Ramos and Pantula 12 also discuss the estimation procedures for the nonlinear random coefficient models. 6
Nonlinear Models for Reliability and Degradation
Degradation, which is the cumulative damage or deterioration, provides valuable information on the reliability, for instance, of an electronic component. An important aspect of degradation is the estimation of the time-to-failure distribution for a specified level of deterioration. Degradation over time can be linear or nonlinear. We describe below one illustration for the linear case and two illustrations for the nonlinear case. Linear degradation can be described by the model Vij ~ a{ + Pitj + dj,
(6.1)
260
for i = 1,2, ...,n components (sample paths), and j = l,2,...,mj measurements on the ith component, where tj represents the time of the j t h measurement. When pi in (6.1) follows the Weibull, normal or lognormal distribution or when (ai,fii) follow the bivariate normal distribution, Lu and Meeker 10 present explicit expressions for the time-to-failure distributions. A simple description of nonlinear degradation can be obtained from the Paris Law, g = csm, where g is the growth of fatigue cracks, 5 is the stress intensity factor, and (c,m) are constants. Starting with this equation, the above authors considered the model VH =
-(1/#K)
log[l - (.90)^Pu^itj]
+ ey.
(6.2)
In this model, yy is log (crack length at time tj), and (/3H,/?2;) are random. The time-to-failure distribution was obtained through a two-stage procedure. At the first stage, estimates of (Pu, /?2i) were obtained. For the second stage, the sampling variances and covariances of these estimates along with the variances and covariances of (pu, /?2i) were considered. Estimates of the expected values of (pu, p2i) and the time-to-failure distribution were obtained with the weights obtained from the between and within variances and covariances. We note that for the linear mixed effects model, the estimator obtained from (4.7) for the fixed parameter is a weighted average of the least squares estimators, where the weights are inversely proportional to the unconditional variances of the least squares estimators. A similar procedure was followed for the above two-stage estimation. The second illustration is related to the Plateau, maximum degradation over time. Highly reliable systems such as lasers, integrated circuits and optical communication devices do not fail over time, but their performance can deteriorate and reach a plateau. Boulanger and Escobar 1 describe the experiment of Gumpertz and Pantula 3 and consider the Weibull sigmoidal random coefficient model for degradation, Vij(t) = a « [ l - exp(-(/3yt)T] + e^t).
(6.3)
In this model, y^ (t) is the change in propagation delay up to time t of the j t h device at the ith stress level x,-, a y is the plateau for the j t h device which is random over the devices, /3y is related to a y , and 7 is a fixed constant. The problems of importance considered by the above authors were to (a) estimate a y , (b) optimize the stress levels and the proportion of devices to allocate for each stress level, and (c) optimize the times for measuring the degradation at each stress level. At the first stage, Maximum Likelihood Estimates of (ay,/3y) for each device at the ith stress level Xi were obtained.
261 At the second stage, the model log(a«) = A + Bxi + Sij
(6.4)
was considered. In this model, V(6ij) includes the sampling variance from the first stage and the variation across the devices. Estimates of A and B were obtained from this model. 7
Summary and Discussion
Nonlinear mixed effects models are applicable in several practical situations, and some of the important illustrations are presented in this article. To estimate the variance and covariance components of the linear mixed effects models, the ANOVA, MIVQUE, ML and REML procedures or some of their modifications are known to have desirable properties. To adapt the above type of procedures for a nonlinear mixed effects model, suitable transformations of the model or approximations to the likelihood function are required, as seen in the previous sections. In addition, we note that C.R.Rao 13 presents a nonlinear model for the two-way classification, along with the approximations to the ANOVA type of sums of squares and the corresponding estimation procedure. As expected, some of the approximations have been found to perform well under suitable conditions. For the computations, programs such as SAS NLINMIX can be used. For some types of nonlinear models, acceptable estimates are obtained through the two-stage procedures of the type described in Section 6. However, it was mentioned that these procedures may require a large amount of computer time to obtain confidence limits for the fixed effects. For the different types of nonlinear mixed effects models employed in practical situations, further research is required to evaluate the suitability of the transformations and estimation procedures briefly summarized in this article. Acknowledgments The authors would like to thank Dr. Yogendra Chaubey for his interest in this topic and for suggestions on the first draft of this paper. Thanks also to Joan Robinson for patiently and promptly typing this article in LaTeX. References 1. M. Boulanger and L.A. Escobar, Technometrics 36, 260 (1994).
262
2. M.B. Carey and R.H. Koenig, IEEE Transactions on Reliability 40, 499 (1991). 3. M. Gumpertz and S.G. Pantula, Journal of the American Statistical Association 87, 201 (1992). 4. H.O. Hartley and J.N.K. Rao, Biometrika 54, 93 (1967). 5. D.A. Harville, Journal of the American Statistical Association 72, 320 (1977). 6. D.A. Harville, In Advances in Statistical Methods for Genetic Improvement of Livestock (Springer-Verlag, New York, 1990). 7. C.R. Henderson, Biometrics 9, 226 (1975). 8. H.S. Lee and Y. P. Chaubey, Communications in Statistics, Theory and Methods 25 (6), , (1375)1996. 9. M.J. Lindstrom and D.M. Bates, Biometrics 46, 673 (1990). 10. C.J. Lu and W.Q. Meeker, Technometrics 35, 161 (1993). 11. J.C. Pinheiro and D.M. Bates, Journal of the Computational Graphical Statistics 4, 12 (1995). 12. R.Q. Ramos and S.G. Pantula, Statistics and Probability Letters 24, 49 (1995). 13. C.R. Rao, Linear Statistical inference and its Applications, 2nd Edition (John Wiley and Sons, New York, 1973). 14. C.R. Rao and J. Kleffe , Estimation of Variance Components and Applications (North Holland Publishing Co, Amsterdam, 1988). 15. P.S.R.S. Rao, Variance Components Estimation: Mixed Models, Methodologies and Applications (Chapman & Hall/CRC Press, 1997). 16. M. Rudemo, D. Ruppert and J.C. Streibig, Biometrics 45, 349 (1989). 17. R. Schall, Biometrika 78, 719 (1991). 18. L.B. Sheiner and S.L. Beal, Journal of Pharmacokinetics and Biopharmacokinetics 8, 553 (1980). 19. P.J. Solomon, Biometrika 72, 233 (1985). 20. P.J. Solomon and D.R. Cox, Biometrika 79, 1 (1992). 21. Z. Shun and P. McCuIlagh, Journal of the Royal Statistical Society B57, 749 (1995). 22. E.F. Vonesh, Biometrika 83, 447 (1996). 23. E.F. Vonesh and R.L. Carter, Biometrics 48, 1 (1992). 24. R.D. Wolnnger, Biometrika 80, 791 (1993). 25. R.D. Wolfinger and X. Lin, Computational Statistics and Data Analysis 25, 465 (1997).
T H E S A M P L I N G BIAS OF H E C K M A N ' S S A M P L E BIAS ESTIMATOR PAUL RILSTONE York University, Department of Economics, North York, Canada, M3J 1P3 E-mail: prilQyorku.ca AMAN ULLAH University of California, Riverside, Department of Economics, Riverside, CA, 92521, U.S.A. E-mail: [email protected] Heckman's two-step estimator for sample selection models can be poorly behaved in some situations due to a form of multicollinearity. The high dispersion of the estimator in these contexts can be deduced by inspection of the asymptotic covariance matrix. However, large-sample theory suggests that the estimator is asymptotically unbiased. In this paper, we derive the second—order bias of Heckman's estimator and demonstrate that this is substantial in similar circumstances. This is reflected in reported simulations.
1
Introduction
Numerous authors, including Wales and Woodland 14 , Nelson 5 , Paarsch 8 , Nawata and Nagase 6 , have provided substantial practical and simulation evidence that Heckman's 4 two-step estimator for sample selection models can be poorly behaved in certain circumstances. This typically occurs when the variables in "Heckman's lambda" are a subset of the other explanatory variables, resulting in a form of multicollinearity. The impact of this on the dispersion of Heckman's estimator can be seen by inspection of the asymptotic covariance matrix of the estimator. However, the standard distributional results from large-sample theory indicate that the estimator is asymptotically unbiased, implying that "on average" the estimator is correct. In this paper we derive the second-order bias of Heckman's estimator and demonstrate that, in finite samples, the bias can also be quite substantial. We consider the basic sample selection model which consists of a linear regression equation (referred to as the "wage equation" given its predominant use in this sort of analysis) and a decision equation which determines whether the dependent variable of the wage equation is observed. To estimate the parameters of this model it is usually assumed that the disturbances from
263
264
these two equations have a joint normal distribution. a The parameters of the decision equation are estimated in a first-stage probit approach. The inverse Mills's ratio or "Heckman's lambda" is evaluated at the estimated parameter values and this is then inserted as an additional regressor into the wage equation. Several researchers have considered the effects on the sampling properties of Heckman-like corrections when the normality assumptions are relaxed. Goldberger 3 and Arabmazar and Schmidt l<2 have shown that the simple Tobit estimator may be biased in the presence of nonnormality or heteroscedasticity. Schafgans 12 has shown that this bias carries over to Heckman's model, as one would expect. Our concern is with the bias of this estimator, even in the context of a correctly specified model. The results we derive are applicable to more general models with the normal distribution of the disturbances of the decision equation replaced by another. Our results depend on the nonlinearities in the model and multicollinearity problems, rather than with any particular distributional assumptions. The discussion proceeds as follows. In Section 2 we derive the secondorder bias of the Heckman estimator. In Section 3 we examine the finite sample bias with a simulated model and apply the bias corrections to a data set. Section 4 summarizes the conclusions. 2
Derivation of the second—order bias
The higher-order properties of nonlinear estimators have been considered by several authors including Pfanzagl and Wefelmeyer 9 , Skovgaard 13 and Rilstone, Srivastava and Ullah u . In this paper we find it convenient to use the approach in Rilstone, Srivastava and Ullah n to derive the second-order bias of Heckman's estimator. The equation of interest is written Yi = X[a + ei
(1)
where Y; is observed only if a latent variable Z* —W^ + Vi is greater than zero. Xi and Wi are fcxxl and fc^xl observable vectors of explanatory variables. Also observed is an indicator variable Zi = 1 [W-'y + Vi > 0]. e; and V{ have a joint normal distribution with correlation coefficient p. e^ has variance of and the variance of Ui is normalized to one. "Olsen 7 showed t h a t Heckman's estimator is still consistent if the joint normality assumption is relaxed to an assumption of normality of the disturbance term in the decision equation combined with linearity of the expectation of the wage equation conditional on the decision equation disturbance.
265
It is well known that, conditional on Z{ — 1, €j has a truncated normal distribution and that E[e; | Zi = 1] = pcreAj where
is the inverse Mills's ratio, and $ denoting the density and distribution functions of a standard normal random variable. We denote the first- and second-order derivatives of A; by AJ 'and \\ ' respectively, noting that x\ ' = -A; (Aj + Wrf). It is also useful to define K
{ )
(1-*(W77))"
Augmenting the right hand side of (1) with the inverse Mills's ratio we have the regression: Yi = X{a + ijA(Wi'7) + «*
(4)
where rj = p
(5)
Heckman's estimator for this model may be written as the solution to
^£«(ft=°
(6)
where qi{(3)=\Zi\{W'il)um\
V
Sib)
(7)
j
and 0 = (a', rj, 7')' is a k x 1 vector of parameters with A: = kx + kw + 1. The expressions derived in Rilstone, Srivastava and Ullah n are functions of the derivatives of <&(/?) and their expectations. We indicate the k x k and kxk2 matrices of first- and second-order derivatives by S/Qi(P) and \/2qi(/3)b respectively. The second-order bias is obtained via a second-order stochastic expansion of the difference between /? and the true value of /? and then taking the expectation of that difference. In Rilstone, Srivastava and Ullah n it is 6 We note that these are defined such that the /'th row of V9i(/5) contains the gradient vector of the Vth element of g;(/3), the i'th row of V2
266
shown that the second-order bias of estimators solving moment equations of the form given in equation (6) can be written B0) = j;(£ [V^])" 1 [ f [Vidi] - \e [v2%] £ [di ® * ] }
(8)
in which Vi = ^qi-£
[s/qi],
di = (£ [V9i])
1
Qi-
(9)
Sufficient conditions under which B(/3) is a valid expression for the secondorder bias are discussed in Rilstone, Srivastava and Ullah u and consist of smoothness restrictions on the model as well as existence, uniformly in /3, of the fourth moments of the random variables appearing in \7qi(/3) and \/2qi((3). Since qi((3) is infinitely differentiable in this case, the smoothness conditions are satisfied. We assume that the moment conditions are satisfied. The matrix of first derivatives is ZiXi v Ui (p) (10) Xjqi (/?) = | Zi (Xi V Ui (/?) + Ui (/?) V Ai) V5i(7) -ZiTiXiX^W! -Zii-Uiifl-vXjxVw! SZiXV - (1 ZJX^WiW Since U{ has zero mean, the expectation of this last expression may be written -vXiX^WtZi -riK^WtZi -XiXiWiW!
£ [Vft]
= -£
XrxfZi 0
(11)
vxWjCtWfZi XiXiWiW'
where X* = (X[, Xi)'. The inverse of (11) may be written as (£[Vft]) - 1 =
a b 0 c
where
a = (e [xtxfz^y -r,(e [x;x?z§~ b = £ [X^X*W[Z^
(£ [XiXiWiW;])-1,
a n d c = (f [-AiAiWiW-])" 1 .
(12)
267
We also have £&<& =
a?x:xr'Zi p£ \\WiX*
p£ XfXfWlZi
Zi
(13) £
With expressions (12) and (13) it is useful to review what is known about the asymptotic distribution of 0. Heckman 4 showed that y/N(0 — 0) —>N(0, V) where, using our notation, we can write
v=(f [v*])_1f fe«;i(f [v*])-1' = md'ii
(14)
Bv inspection, we can characterize when the standard errors associated with 0, that is the square roots of the diagonal elements of V, may be large. We concentrate on the first kx + 1 elements of V. There are several standard situations, such as when the X^s are collinear and/or a2 is large. What is of interest are those situations which are particular to the sample selection model: there are two such interesting cases. The standard errors will be large (£[V9i] approaches singularity) when the columns of X* are closely collinear; in particular when A; can be written as a linear combination of the Xi's. Since A; can be well approximated by a linear function of its argument, 0 W/7, collinearity will be a problem when the Wf's are a subset of the X^s or are highly correlated with them. d The second instance where the asymptotic standard errors may be large is, again by inspection, when p is large. This corresponds to situations when there is a large degree of simultaneity in the model. The hypothesis that p is zero is one that is often tested via significance tests on the estimator of 77. We note that both of these situations when the standard errors may be large are quite common in empirical studies. There is often very little theoretical justification for assuming that, for example, there are elements of Wi which are not contained in Xi or that there is no simultaneity between the wage and decision equations. As is evident from equation (8), the second-order derivatives of the moment equations may make a substantial contribution to the second-order bias. We derive explicit representations for these in the following equations. Put g
2
V 9i(/?) c
llg12g13>
q21 q22 q23 q31 q32 q33
Tobin recognized this in his original work on his model. ''Some authors state as an identification condition that the Wi's contain at least one element not in the Xi's. Strictly speaking this is not necessary, although from a practical perspective, if this is not the case, one can expect to have large standard errors.
268 1
qu = (ofcxX*
— Ofc xX (fc :r fc),
qU = (Ofc.x{fc.few) g21 = (0 lxfc2
(W
X;iX^Wi)
Jkxxl
- XiX^Wi
,
- vXiX™ {W[ ® W/)) , g22=(0lxfe0lxi-2AiAfV/)
- Af (*{ ® W/)) ,
«23 = (-Ai 1 ) (W/®X < )-A i AJV/£>) ) £> = («i(/?)A<2> - r ^ ) 9
=
(Ofc„XfcJ
Ofc^xfc*
2
- A,A<2) - (AJ1*)2) {W> ® W*),
Ofc^xfc^fc^j ,
9
=
(Ofc^xfca,
Ofc^xl
Ofc^xfc^))
(ZiX^ + (1 - Zi)AJ2V<(W/ ® W/)) •
Ofc„xu
It is readily shown that
£ [v a »] (olx,2
dx/c
(O fc .xfcO lfc .xi-X i A i (1) W/)
g13.
(olxfc01xi-2AiAf)^/)
g 23
- A((1) {X[ ® W/))
V
g
,
33
where
913 = e (Ofc.x(*.*„) XiA]1 V ; ^ A f > (w/ ® w / ) ) ,
fa = e(-\?\w'i®xi) t 3 = s(0kwxk^
-x^wa-x^P
0kwXkw
-(x^rxw^w'), (l-$i)xf))Wi(Wl®W!)),
($iX™ +
Hn HnH13' 21 22 23 (^[V%]) = I H H H 0 0 H33 _1
Since £[Vidi\=£\xjqi(e[x7qi])
l
q{
269 and (£[v
/HnXiUi + Hu\iUi + H13Si 21 H XiUi + H22XiUi + H23Si V H33Si
1
ii
£ [Vidi] (
-e XiXiH^Si
V
XiX'^St 2
+ XiXiH^Si 23
+ 33
+ X H Si + XiX^W!H Si (Z^
- (1 -
XiX^W-H^Si - m (/?)
, TZ^XY^WiWlH^Si (^
\ X^W'^Si
J
We can now examine in what circumstances the second-order bias may be large. It is not possible to put a sign on the terms of this bias, but several qualitative observations may be made in this respect. There are two terms in B((3), both scaled by (£ [V9i])~ • The first term, £ [Vidi], is essentially the correlation between the residual v % — £ [V
Simulation and Empirical Results
To get a sense for how well the second-order bias derived in Section 2 reflects the sampling bias of the Heckman estimators we conducted a small set of simulation experiments consisting of a wage equation in the form of equation (1) and a decision equation. Each has an intercept and one conditioning variable. The disturbances are constructed as jointly normal. This was parameterized so that there were five regression parameters including the coefficient on A^ and the variance from the wage equation. (This was set equal to unity. Recall that the variance term from the decision equation is normalized to one
270
in these models.) The intercept terms are set to zero, the coefficients on Xi and Wi set to one. We altered two components of the model across the experiments. First for obvious reasons, the correlation between the disturbances was set at a variety of values. Second, as noted, the bias and variance of the parameters of interest depend on the collinearity of the Xi's and A*. Since the latter is roughly linear in its argument, the correlation between the Wi's and the Xi's is crucial. The Wi's were constructed as uniform on [-1,1]. The Xi 's were then constructed as a linear combination mixture of the Wi's and another independent, mean zero, uniform random variable: Xi = pxwWi + Vi. The support of the distribution of Vi was altered across experiments in order to keep the variance of Xi constant and equal to that of Wi, across the experiments. The observed sampling properties of the estimators are summarized in the following tables for a variety of values for p and pxw as well as sample sizes of N = 50, 100. We focus on the estimates of the regression parameters of the wage equation and consider their properties relative to what would be an unbiased least squares estimator (albeit infeasible in this case), a "Heckman estimator" with the true values of 7 used in the wage equation. 6 The tables are self-explanatory. The sampling bias (i.e. averaging over the 1000 replications of the experiment) of the estimates is given in Table 1. We note that these reflect the analytical results of the previous section: the biases are absolutely increasing functions of p and pxw, and decreasing functions of the sample size. Note that the intercept and the coefficient on the inverse Mills's ratio are particularly affected. Since it is often the case in practice that the Xi's and the Wi's in these kinds of models contain almost the same random variables, we would often be in situations corresponding to pxw=.75. With the sample size N — 50, the bias can be particularly large. Table 2 gives the mean squared error of the estimates of the usual Heckman estimators, relative to the infeasible estimators. As would be expected, these are also increasing functions of the correlation parameters. To provide an empirical example we used a data set of 50 observations taken from the 1990 Integrated Public Use Microdata series employed in a study of white-black wage differentials by Chia-Hui Chiu. The variables in the decision equation consisted of an intercept, marital status and working status. The dependent variable of the wage equation was log wages, the explanatory variables consisted of an intercept and experience. This is undoubtably a rather simplistic model, but one that qualitatively reflects many which are e
T h e comparison could have been made with other benchmark estimators, notably efficient maximum likelihood estimators, but we feel that this benchmark is more comparable. Moreover, the MLE is not necessarily unbiased.
271 Table 1. Sampling Bias of (3. iV = 50 Parameter ai Q2
•n 71 72
Pi
.25, .25 -0.02882 -0.00258 0.03511 -0.00165 0.05878
.25, .75 -0.03893 -0.00222 0.04815 -0.00165 0.05878 N=
.25, .25 0.00222 0.00491 -0.00216 0.00380 0.02578
.25, .75 -0.00540 0.00746 0.00730 0.00380 0.02578
Parameter Ctl
V 71 72
Pi
Pxw
.75, .25 -0.09583 0.00019 0.11268 -0.00165 0.05878 100
.75, .75 -0.11844 0.00152 0.13676 -0.00165 0.05878
Pxw
.75, .25 -0.01935 0.00637 0.02655 0.00380 0.02578
.75, .75 -0.03019 0.01001 0.03942 0.00380 0.02578
used in empirical work where identification might be rather tenuous. Table 3 reports the usual point estimates and standard errors as well as the bias estimates. Since the bias depends on the true parameters and population moments, the bias estimates themselves are based on the point estimates and sample moments. * We notice that the bias estimates are quite large relative to the point estimates and the standard errors. Such a situation could lead one to make substantially different inferences. One final point should be made here. One should interpret these estimates very carefully. It is well known that higher-order approximations can be very poor, be they based on stochastic expansions as done here or via other techniques such as Edgeworth expansions. Phillips and Park 10 have a discussion of this. It is a reasonable conjecture that this corresponds to situations where the higher terms in the expansions make a strong contribution to the sampling biases. In the course of the Monte Carlo experiments we calculated the analytic second—order biases and bootstrap estimates of the biases. Both of these approaches performed very poorly, particularly in those cases when the results in Section 2 would indicate potentially high biases.
f In principle, using these estimated values rather than true population values should not effect the rate of convergence of the bias estimate since the estimates themselves are consistent.
272 Table 2. Sampling Inefficiency of /3 AT = 50 Pi pxw
Parameter Ql Q2
V
71 72
.25, .25 2.09845 1.00134 1.67572 1.00000 1.00000
.25, .75 1.95023 1.00014 1.72793 1.00000 1.00000 AT =
.25, .25 1.23581 0.99946 1.67572 1.00000 1.00000
.25, .75 1.18577 0.99866 1.72793 1.00000 1.00000
Parameter Ql
ai V
71 72
.75, .25 4.16079 1.00092 2.87769 1.00000 1.00000 100
.75, .75 3.61555 0.99973 2.78589 1.00000 1.00000
Pi Pxw
.75, .25 1.39542 0.99979 2.87769 1.00000 1.00000
.75, .75 1.27143 1.00024 2.78589 1.00000 1.00000
Table 3. Parameter point and bias estimates, standard errors Point Estimate
Bias Estimate
Standard Error
-0.39561 0.06672 5.65843 -0.28100 0.64172 0.12921
-2.70729 -0.12628 13.81740 -0.47202 0.24726 0.17037
0.59017 0.05453 2.92799 0.73259 0.27583 0.25398
Parameter Ql Q2
V
71 72 73
4
Conclusion
This paper has derived the second-order bias of Heckman's sample selection estimator. The message, from the analytical results which are reflected in the simulations, is that this bias can be large when the estimators have large standard errors and/or the parameters are poorly identified. The lesson for applied researchers is that there is even more reason to be careful in interpreting point estimates in these situations. Acknowledgments The authors would like to thank the workshop participants at the University of British Columbia, the 1996 Berkeley/ NSP Conference on Bootstrapping
273
and the India and Southeast Asia Meeting of the Econometric Society for helpful remarks. Research funding for Rilstone and Ullah was provided by the Natural Sciences and Engineering Research Council of Canada and Academic Senate,UCR, respectively. References 1. A. Arabmazar and P. Schmidt, Journal of Econometrics 17, 253 (1981). 2. A. Arabmazar and P. Schmidt, Econometrica 50, 1055 (1982). 3. A.S. Goldberger, in Studies in Econometrics, Time Series and Multivariate Statistics, eds. S. Karlin, T. Amemiya and L.A. Goodman (John Wiley, New York, 1983). 4. J.J. Heckman, Econometrica 47, 153 (1979). 5. R.D. Nelson, Journal of Econometrics 24, 181 (1984). 6. K. Nawata and N. Nagase, Econometric Reviews 15, 387 (1996). 7. R.J. Olsen, Econometrica 48, 1099 (1980). 8. H.J. Paarsch, Journal of Econometrics 24, 197 (1984). 9. J. Pfanzagl and W. Wefelmeyer, Journal of Multivariate Analysis 8, 1 (1978). 10. P.C.B. Phillips and Joon Y. Park, Econometrica 55, 1065 (1988). 11. Paul Rilstone, V. K. Srivastava and Aman Ullah, Journal of Econometrics 75, 369 (1996). 12. Marcia M.A. Schafgans, Semiparametric estimation of a sample selection model: A simulation study (London School of Economics and Political Science, Department of Economics, 1996). 13. I.M. Skovgaard, Scandinavian Journal of Statistics 8, 227 (1981). 14. T. J. Wales and A.D. Woodland, International Economic Review 2 1 , 437 (1980).
COMPUTATIONAL SEQUENCE ANALYSIS: GENOMICS A N D STATISTICAL CONTROVERSIES
P R A N A B K. S E N Department
of Biostatistics and Statistics, University of North Chapel Hill, NC 27599-7420, USA E-mail: [email protected]
Carolina,
In genomics, and more generally, in computational biology, principles of molecular genetics govern computational sequence analysis, providing room for stochastics to comprehend the basic differences between mathematical exactness and biological diversity. With a large number of sites having categorical (qualitative) responses with imprecise interrelationships, conventional (discrete or continuous) multivariate statistical modeling and analysis may encounter roadblocks of various kinds. Limitations of likelihoods and their variants are appraised in this context. Alternative approaches that take into account underlying biological implications to a greater (and parametrics to a lesser) extent are appraised in the light of validity and robustness perspectives.
1
Introduction
At the dawn of bioinformatics (and genomic science too), biostochastics is in an interdisciplinary phase. The conventional approach of planning biomedical studies (in a reasonably controlled setup), formulating statistical models (from an existing pool of standard ones), and carrying out standard statistical analysis may no longer be universally adoptable in bioinformatics setups. At the present it is not precisely known what constitutes the core of bioinformatics. We may quote from a very recent text by Ewens and Grant 9 : We take bioinformatics to mean the emerging field of science growing from the application of mathematics, statistics, and information technology, including computers and the theory surrounding them, to study and analysis of very large biological and in particular, genetic data sets. The field has been fueled by the increase in the DNA data generation. As such, within the broader setup of bioinformatics, we consider here some aspects of computational sequence analysis (CSA) pertaining to human genomes. Principles of molecular genetics, as well as, various environmental factors govern CSA. At the current stage, gene scientists can not scramble fast enough to keep up with the genomics, emerging at a furious rate and in astounding detail. The data dusts need to be settled in their proper places
274
275
before methodological issues can be viewed in a proper perspective. CSA, and bioinformatics in general, do not aim to lay down fundamental mathematical laws that govern biological systems parallel to those laid down in physics. Such laws, if they exist, are a long way from being determined for biological systems. Biodiversity, extreme variability in human immunity, and genetic complexities (in number and activity) have rather confounding impacts on such biological (theoretical) laws. There is, at this stage, mathematical utility in the creation of tools that investigators can use to analyze biological systems data. Because of underlying stochastic evolutionary forces, such tools involve statistical modeling of biological systems, which in turn, requires the incorporation of probability theory, statistics, and stochastic processes. Although knowledge discovery and data mining (KDDM) procedures (or statistical learning, by another name, Hastie et al. n ) are increasingly being used in CSA, it might not be proper to jump on statistical conclusions based on data analysis alone (Cox 7 , Breiman 5 ) unless the algorithms could be justified from statistical perspectives. There is a genuine need to grasp the genetic and molecular biologic bases of CSA, and in the light of that to formulate statistical modeling and analysis schemes. Primarily driven by this motivation, we (Sen 21 ) designate Biostochastics to deal with stochastic modeling and analysis (i.e., stochastics) for very large biological (including genetic and genomic) data sets. As such, biostochastics extends to large biological systems which may not have predominant genetic factors; neuronal spatio-temporal model (Sen 2 2 )s are noteworthy examples. In this study, however, we confine ourselves to biostochastics pertaining to CSA that has a dominant genetic and genomics flavor; but our inclination is to emphasize the underlying methodology with a view to facilitate applications. With a large number of sites, with categorical responses, statistical modeling may become quite complex, for which classical likelihood approaches may stumble into conceptual as well as computational difficulties. Conventional (discrete or continuous) multivariate analysis may also encounter roadblocks due to excessively high dimensionality as well as imprecise specification of underlying dependence patterns. Without an acceptable topology that defines neighborhoods, there might not be enough incentive for standard statistical modeling and analysis. Alternative approaches that take into account underlying biological implications to a greater (and parametrics to a lesser) extent are appraised on validity and robustness considerations. Section 2 outlines the molecular biological background, and based on this motivation, statistical approaches are outlined in Section 3. Variants of likelihoods, such as the pseudolikelihood, are discussed in Section 4. Section 5 deals with some nonparametrics along with some general remarks.
276
2
Molecular Biological Perspectives
Within the broader domain of computational biology, we pay especial attention to CSA pertaining to human genomes. Gene, in the Mendelian setup, is the basic unit of inheritance. Genes occur at definite sites or loci, on chromosomes, which are strings of DNA (Deoxyrinucleic acid), the basic genetic material in a cell and the carrier of genetic information for all organisms, except for some viruses. DNA is a double-helical model; it is a polymer, madeup of nucleotides which are four in number, and can be distinguished by the four bases: A (adenine), C (cytosine), G (guanine) and T (thymine). Like the DNA, Ribonucleic acid (RNA) and proteins are also macromolecules of a cell, though they differ in their forms and constitution. Like DNA, RNA is a nucleic acid, but with T replaced by U (uracil), and it has a single strand. Proteins are also polymers, and there are 20 amino acids. Most human cells contain 46 chromosomes, in 23 pairs; one pair relates to the sex chromosomes, while the other 22 homologous pairs are termed autosomes. There are about 30,000 to 40,000 genes embedded within the human genome. Genetic data, for duploid organisms, relate to traits determined by autosomal Mendelian loci, so that DNA plays a basic role in genetic data analysis (Waterman 25 , Lange 14 , Ewens and Grant 9 ) . Principles of molecular genetics, such as the central dogma that DNA makes RNA makes protein, govern CSA. The transfer of genetic information from DNA to DNA (called replication) means that the molecule can be copied; the loop from DNA to RNA called transcription precedes the loop from RNA to protein, called translation. The RNA which is translated into protein is termed the messenger RNA (or mRNA), and the transfer RNA (or tRNA) translates the genetic code into amino acids. If we accept the basic role of DNA as the genetic information carrier, then it is natural to conclude that evolution is directly related to changes in DNA This is the genesis of molecular evolution. Substitutions, such as A «-» G or C <-» T are called transitions, while A <-> C, A <-> T, G «-> C, and G <-> T are called transversions (recall that in a DNA, A pairs with T and G pairs with C). Physiochemical studies, electron microscopy, and X-ray diffraction analyzes have established that most DNA molecules are long, flexible, threadlike structures, having a nearly constant diameter with regularly spaced and repeated structures, irrespective of the base composition (A, C, G, T) or their order. In the usual form of the double-helix structured DNA, the purine and pyrimidine rings in each chain are stacked 0.34nm, i.e., every ten base pairs. There are two external grooves: major (wider) and minor (narrower). Next, note that amino acids are encoded by triplets of nucleotides, called codons. Let
277
us define MR = {A, C, G, U}, and let C = {(xi, x2, x3) : Xj € MR,j = 1,2,3} be the codon. Finally, let A be the set of aminoacids and termination codon. Then the genetic code can be defined as a map: g : C —+ A, g £ Q, so that Q is the set of all genetic codes. As the human genome project is heading for a completion, there are some formidable statistical tasks which are surfacing into the research efforts to fathom out the mystery of the genomic code. 3
Statistical Motivation
Most problems in CSA are essentially statistical. Amidst a chaos of random mutation and natural selection, stochastic evolutionary forces act on genomes, resulting in genetic drifts, and thus requiring stochastic modeling and analysis for quantitative studies. In genomic sequence analysis, typically, we encounter data on a large number (K) of positions or sites, and in each position, we have a purely qualitative (nucleotides or amino acid labels) categorical response with 4 to 20 categories depending on the DNA or protein sequence, the spatial (functional as well as stochastic) dependence (or association) patterns of these sites may not be known, nor can they be taken to be stochastically independent. Also, as has been mentioned before, regular and nearly identical structures of the DNA solicitate statistical appraisal based on other variational properties which exhibit more statistical variation and information too. In this way, we are more in the domain of biostochastics than biomathematics or purely computer algorithms. In this high-dimensional qualitative response setup, it is difficult to incorporate standard (discrete or continuous) multivariate analysis tools, in a parametric formulation (as the number of associated parameters may be exceedingly large and the underlying model may not be that well specified or anticipated). As such a (complete/partial/conditional/profile/pseudo) likelihood approach may be hopeless, unless there is a sample, drawn objectively, of enormously large size. On both counts, we may have difficulties in adopting conventional tools. If we restrict ourselves to a single site, in most cases (particularly, for viral sequences), there is little statistical information. In a multiple site context, treating the sites as independent could lead to serious misspecification of the model. As such, we need to consider high-dimensional qualitative categorical data models preserving intersite dependence, and then to proceed to CSA statistical appraisals. There has been some attempts to incorporate suitable quasi-likelihood methods based on generalized estimating equations (GEE) and invoking Markov chain monte Carlo (MCMC) methodology that amends easily to Gibbs sampling and Metropolis-Hastings algorithms (Durbin et al. 8 ) . However, these procedures are yet to have a solid
278
methodological foundation in such a complex setup. In this study, we confine ourselves to some specific models, and appraise variants of likelihoods and alternative approaches towards their resolutions. 4
CSA : Likelihoods and Alternatives.
Let us consider a setup of K sites, and let X = [X\,...,XK) be the response (stochastic) vector, where each Xj is categorical with anywhere between 4 and 20 qualitative categories. If we denote the joint probability function of X by p(x), we may express it as K
P(x) =p(xi)'[[p(xj\xi,i
<j).
(1)
3=2
If these sites were ordered in some way and had a Markov chain (MC) property, then we could have written p(x) =p{xi)-p{x2\xl)---p(xK\x(K_l)).
(2)
Things could have been even simpler if these conditional probabilities p(xj\x(j_i)) were the same for all j = 2,...,K. However, lacking such an ordering, a MC property can not be taken for granted, and on top of that the stationarity of the transition probabilities may need critical appraisals. If we are able to validate the MC property, but stationarity may not hold, then the number of parameters associated with the joint probability law jumps from C(C +1)— to C — 1 + (K — 1)C 2 , and this in turn may demand an excessively large same size (as K is large) to justify conventional statistical methods to be valid and efficient. In the more likely event, we do not even have an ordered index set J = { 1 , . . . , K}, so that it may not be feasible to adopt a MC model. Of prime scientific interest is the statistical comparison of sets of genomic sequences from human immunodeficiency virus (HIV). It may be noted that a retrovirus, like HIV, has the ability to reverse the normal flow of genetic information from genomic DNA, and that the genetic variability of HIV is relatively high compared to other retroviruses. In this way, we have a spatial (or spatio-temporal) model in a general multivariate analysis of variance (MANOVA) setup. Typically, there are molecular epidemiologic studies of genomic sequences that pertain to different epidemiologic strata, so that external CSA like MANOVA becomes pertinent. On the other hand, comparing the variability at different sites for the same individual (or their covariability) constitutes the internal CSA (like the canonical analysis in multivariate
279 statistical analysis). In this way, we encounter both MANOVA and canonical analysis type problems in a very high dimensional categorical data model. The conventional product multinomial model can be adopted to formulate suitable likelihood functions. With K sites and C(> 4) categories for each site response, we have typically CK cells, so that the number of parameters involved in a single multinomial law is equal to CK — 1, and if there are G groups, this number jumps to G{CK — 1). Clearly, in this setup, with large K the above number increases exponentially (even for G = 1 or 2 and C = 2. This in turn requires the sample size(s) astronomically large in order that standard discrete multivariate analysis tools (Agresti 1 , Sen and Singer 23 ) can be validly used to draw statistical conclusions from acquired data sets (satisfying suitable objective sampling schemes). Faced with this roadblock, one may naturally wonder whether the model can be represented in terms of a comparatively smaller number of parameters, and suitable pseudolikelihood formulations can be made instead of the classical likelihood one to avoid some of these difficulties. We present some of these in a very simple setup, and appraise their suitability in CSA. For the K-site model, we introduce a vector Y = (Yj,... ,YK) where Y/t is equal to 1 or 0 according as there is a mutation at site k or not, for k = 1 , . . . , K. Also, let Q = { ( i j , . . . ,iK) : ik = 0,1, k = 1 , . . . , K}; note that the cardinality of fl is 2K. Further let P(y) = P { Y = y } , y £ (1. Let us define then Q(y) = log{P(y)/P(0)}, so that by definition, P
(y)
=
e
Q(y)/^eQ(z)
(3)
zgfi
Led by a basic representation of multivariate binary random variables due to Bahadur 3 , Liang, Zeger and Qaqish 15 advocated the following representation: K
P(y) = expi^UkVk fc=l
+
^2
yaVkUsk + --- + yi---VKUi...,K},
(4)
l<s
where the u^ are the conditional logits, usk are the conditional log -odd ratio, etc.. If in this representation, we let all the second and higher order interaction coefficients (uijfe..) to be null, we end up with a pairwise dependence model wherein K
Q(y) = '^2akyk+ fe=l
^2
iskVsVk,
(5)
\<s
where the otk and jsk are respectively the main effect and first order interactions.
280
We conceive of n independent sequences Yj, i = 1 , . . . , n and denote the K x n matrix of observations by Y = (Yi,. .., Y n ) , and denote its fcth row (vector) by Y„(fc), k = 1 , . . . , K. Then, following Besag 4 , we formulate a pseudolikelihood function as K
LP{0, Y) = [ J P(Yn(k) = yfc|Ys(n), a ± k, 9),
(6)
fc=i
where 9 = (c*i,... ,0^,712,- • • ,1K-I,K)Writing j k k = 0, k = 1 , . . . , K, we may show that for this pairwise dependence model,
M«,Y)=nn - — T T ^ — 7
(7)
0, +
l = i " 1 [ 1 + e »"< * 2^=i ^ » « > This is termed the autologistic model. It is not clear how in general, can this pseudolikelihood function be interpreted as a conditional, partial or even profile likelihood function ? The maximum pseudolikelihood estimator (MPLE) 9n of 6 is a point of maxima of the pseudolikelihood function. This MPLE can be taken more in the spirit of generalized M-estimators (as it is based on a model different from the likelihood function); moreover, using the structure in (7), it can be shown that the MPLE is consistent and asymptotically normal. However, because of the basic model difference, robustness perspectives of such MPLEs need to be assessed properly. Further, for the same reason, the MPLE's efficacy is unknown, and expected to be lower than that of the MLE if the latter were obtainable from the data. Moreover, this also raises the robustness issues for the MPLE for model departures and distortion due to the choice of the particular autologistic model. For the MPLE, there are computational difficulties, and moreover, there is no direct way of obtaining the (asymptotic) dispersion matrix of the MPLE (which depends on the unknown likelihood function). The deficiency of the MPLE stems from the fact that the associated estimating equations (EE) are not isomorphic (even asymptotically) to the EE for the MLE. Also, the observed information (matrix) may not be available in this formulation. Thus, there are some genuine concern over the use of MPLE on the ground of robustness and efficiency. From the EE perspectives, one may use the results in Liang et al. 15 to study large sample properties of the MPLE, although for large values of K, it remains to appraise how far such asymptotic properties are tenable. We may notice that in the above formulation of the autologistic function
P(9,Y) =
C(9)F(Y,0),
(8)
281
where C(6) = P(Y = 0,0) and n
K yik ak+
logF(y,e) = ^2Y^ ^ i=l k=l
K
Y^ IkiVii).
(9)
j=k+l
This representation makes it possible to use the simulated likelihood ratio technique, using the Gibbs sampling or the Metropolis algorithm (or more generally the Hastings algorithm) to generate a simulated sequence of random vectors to carry out the maximization on simulated data. Note that the above sketched MCMC procedure, advocated by Geyer and Thompson 10 , is highly computation incentive, and as in the current CSA setup, usually we have a very large dimensional 0 even such MCMC techniques may run into computational complexities. If we consider internal CSA problems, in an autologistic model, we need to confine our attention to the sub-vector 7 = ( 7 ^ , 1 < j < k < K), there being (^) such parameters. As such, for large K we have a larger number of parameters, and all the problems referred to in connection with 6 show up in this case too. In addition, for testing independence of the K positions an autologistic model may have good relevance, but for canonical analysis, such models may be questionable. The basic difficulty in incorporating the classical canonical analysis (Anderson 2 ) in this context is the lack of linearly combinability of the coordinates of Y; so as to justify the basic interpretation of canonical correlations and canonical variates. We may refer to Sen 22 for some discussion of canonical analysis in neuronal spatio-temporal models, and the present setup is quite akin to it. If we want to have external CSA, such as the MANOVA for several independent groups of sequences, for testing homogeneity of the groups, the autologistic model may serve a good purpose. If we denote the 5th group parameter vector by 0g = (ag,jg) , g = 1, G, then primarily we may be interested in testing for the homogeneity of the ctg. As in the normal theory MANOVA, we may be tempted to assume that the j g are homogeneous, but in view of a lack of parametric orthogonality of ag and 7 such an assumption may have to carefully appraised. Since these parameters are not directly related to the full likelihood functions, the usual likelihood ratio type test may not work out well in this setup. Pseudolikelihood based inference would be subject to the same criticism as in internal CSA. Wald-type test based on the MPLE of the ag may conceived of. But, as it is cumbersome to estimate the (asymptotic) covariance matrix of the 0g, a formulation of the Wald-type test statistic may not be that convenient either. Moreover, with such estimated covariance matrices, we may have genuine concern regarding their robustness
282
properties (against plausible model departures), and as a result, such tests are not expected to be robust. However, it is possible to introduce a suitable permutation test for the homogeneity that we shall consider later on. Suppose now that instead of the binary response we consider the full model involving CK possible response vectors for K positions, each with C possible (qualitative) outcomes. Even for a single position, instead of a conventional logit model (for binary outcome), we need to consider a generalized logit model for polychotomous responses. For multiple sites, a precise formulation of the pseudolikelihood function becomes quite involved, involving too many parameters. Neither the Bahadur 3 representation nor the Liang , Zeger and Qaqish 15 formulation (for the binary case) provides a good resolution. Besides, the question of robustness, validity and efficacy of such procedures becomes even more pertinent in this general case. As such, we take recourse to suitable alternative external CSA procedure. Let us now consider a permutation test for homogeneity of G groups of independent sequences in an external CSA setup. Consider G groups, where the gth group consists of ng independent sequences, for g = 1 , . . . , G. In a pseudolikelihood formulation, we consider the autologistic model, and denote the parameter vector for the
(10)
Let now R „ = ( i ? i , . . . , Rn) be any permutation of ( 1 , . . . , n), so that there are n! possible realizations of R „ ; we denote this set by 1Z. Let then Y°(Rn)
=
(Y°Rl,...,Y°RJ,
(11)
283
for R n G 7c. Thus, corresponding to the sample point Y ° , we generate an orbit of n! sample points, and we denote this orbit by 0 ( Y ° ) . It follows by standard rank-permutation arguments (Chatterjee and Sen 6 ) that under the null hypothesis {Ho) of homogeneity of the G groups, P{Y°
= Y°(Rn)\0(Y°)}
= (n!)-\
(12)
for every R n G 7c. This (discrete) uniform permutation (conditional) probability measure is denoted by Vn. For the autologistic model, the pseudolikelihood function (for the 5th group), Lp (9g,Yg) is defined as in (7), and this leads to the pseudo-score statistic (vector) U„ 0 (9) at 6 as
u
ff w =i=\E E ( wfc=i) [(vift*+E Wfi) j=i -log(l+eXp{^){afc + ^
7
^
)
} )]},
(13)
for g = 1 , . . . , G. Next, we consider the pooled sample of n sequences, and using (7) we compute the pooled MPLE of 6, denoted by Gn; we may need to use the Gibbs sampling or MCMC tools for numerically achieving this. But this is to be done only for the pooled sample, not for the G samples separately. Let us then denote the aligned pseudo-scores by V%=VM(dn),g
= l)...,G.
(14)
Note that by construction, the MPLE 6 is a symmetric function of the n stochastic vectors Y°, r = 1 , . . . , n, and hence, the MPLE is T^-invariant, i.e., for every point on the orbit 0(Y°), the MPLE remains the same. Therefore, under the null hypothesis, for each g(= 1 , . . . , G), the ng terms appearing as summands of U„*' are marginally identically distributed and they are interchangeable or exchangeable random vectors (across the G groups as well). As such, we express the aligned pseudo-scores as U£>=£u(Yfli)0n) = ]Tunsii, i=l
say, 5 = 1,...,G,
(15)
i=l
where under the permutation setup, the MPLE is invariant, and hence, permutation distribution of the Un„ can be generated by the same permutation law , formulated in (12), but applied to the U n< ^ instead of the Ygi.
284
We define the pooled sample vector U° in the same manner as in Y°, and this way, the orbit 0(Y°) is mapped onto an orbit (9(U°), and hence, the permutation law Vn is generated by the column permutations of R „ over the set 7£, i.e., permutations of ( 1 , . . . , n). Note that by virtue of the choice of the MPLE, we have n
£ U ° = 0,
(16)
so that we have £ p n { U # } = 0 , g = l,...,G.
(17)
Let us further define the permutation pseudo-score covariance matrix by C
v
n
»=^iEEM-..i.
as)
9=1 i = l
and note that this matrix is P n -invariant. Also, let A„ be a G x G matrix with elements Xgg',n = (ng(n6gg> - ng>))/n, g, g' = 1 , . . . , G,
(19)
where Sij is the Kronecker delta (i.e., equal to 1 or 0 according as i = j or not). Then, by standard argument (as in Chatterjee and Sen 6 ) , we obtain on writing the rolled-out vector U„ = (Uj,/ , . . . , U i G ' )' that JBP„(U„ul)
= V„(g)Ari,
(20)
where (££) refers to the Kronecker product of the two matrices. Note that A n is of rank G — 1, and noting that n _ 1 A n has the same structure as the multinomial sample covariance matrix, its generalized inverse can be taken as A; =diag(—,...,—)--ll'. (21) ni no n Note that 0 has K{K + l ) / 2 = M (say) elements, so that the pseudo-scores U„ i; are M-dimensional vectors, and as a result V n is an M x M matrix. We denote the generalized inverse of V n by V~, and note that under fairly general regularity conditions (as the elements of 0 are linearly independent), V „ is positive definite, in probability, as n increases, and hence, it is of full rank in probability. Further, note that by (16), l ' U n = 0,
(22)
285
so that we may consider the following quadratic form as a suitable (Wald-type) aligned test statistic: £
« =E^V-U 9=1
n 3
.
(23)
u
Note that essential for the computation of the test statistic Cn is the MPLE 9n which can be obtained by using the algorithms mentioned before. Once this is done, the computation of the aligned pseudo scores vectors U„ B and the covariance matrix V„ does not pose any horrendous task, so that like the Rao score test statistic in the classical likelihood ratio type tests, computationally the major task is the MPLE for the pooled sample. In this sense, Ln is more like an aligned M-statistic; we refer to Rao and Sen 19 for the general asymptotics for such aligned pseudo-score statistics, considered in a directional data model setup. For finite sample sizes, the exact permutation distribution of £ „ , under Vn, can be obtained by direct enumeration. This task becomes prohibitively laborious as the ng increase. For this reason, there is a genuine need to consider suitable large sample approximation to the permutation distribution of Cn In the present setup, note that the VA9' are binary, and hence, the coordinates of each U„ • are all bounded random variables. This makes Tig,l
it possible to use the celebrated Wald-Wolfowitz-Noether-Hoeffding-Motoo permutational central limit theorem on the individual pseudo-scores statistics. Omitting these details of algebraic manipulations (quite similar to those in Puri and Sen 18 (Ch. 5, p. 186)), we arrive at the following : When the individual sample sizes ng,g = 1,...,G all increase, subject to the condition that ng/n —> pg, g = 1 , . . . , G, where the pg are positive numbers adding upto one, the permutational (conditional) distribution of Cn converges, in probability, to the central chi-square distribution with degrees of freedom (DF) M* = {G- 1)K(K + l ) / 2 . Based on this basic convergence result, asymptotically, the permutation test can be replaced by an unconditional test based on Cn having the central chisquare distribution with M* DF (under the null hypothesis of homogeneity of the G groups). Consistency properties and asymptotic power under local alternatives follow conventional lines of arguments, and hence, the details are omitted. In the context of genomics, usually K is large, and hence, (even for G as low as 2 or 3) M* may be quite large. For example, often K is in the range of 200 to 500, so that even for G = 2, M* may be in the range of 20,000 to 75,000. Even, for K as small as 50, and G = 2, M* is more than 750. With
286
larger DF, the accuracy of the chi-square approximation to the permutation distribution of the test statistic may generally require increasing large sample sizes. Although in some cases in CSA this can be justified, in general, it might create some roadblocks. One alternative is to partition the parameter vector 0g = (a f l , 7 ) , for each g = 1 , . . . , G, and then to test for the null hypothesis of homogeneity of the ag assuming further that the 7 9 are homogeneous. This is quite analogous to the multivariate normal compound symmetry testing problem (Anderson 2 ) where the overall hypothesis HMVC is partitioned into HM and H.VC-, and while testing for the null hypothesis HM (of homogeneity of the means), the homogeneity of the variances and covariances is presumed, although here we are to deal with multivariate binary data model where the parametric orthogonality does not hold, and exact likelihood formulations are difficult. Thus, we need to proceed through appropriate permutation formulations with suitable alignments. In the present case, again we need to consider the MPLE of 6 from the pooled sample (of all the n sequences), and then consider the pseudo scores pertaining to the a component, for individual groups, aligned at the pooled MPLE. The rest of the manipulations are quite similar, and the resulting test statistic, denoted by £ n , would have then a permutation distribution which is asymptotically, in probability, chi-square with (G - 1)K DF. 5
Nonparametrics and General Comments
As has been mentioned earlier, the autologistic model may lack proper statistical justifications from genuine likelihood perspectives. As such, routinely using the autologistic model may lead to nonrobust and inefficient statistical resolutions, and this is particularly of serious concern when K is large. Further, in the more general case of very high dimensional categorical data with C categories for each position (and C > 4), an autologistic model needs considerable modifications (to accommodate such categorical outcomes, instead of binary ones), resulting in a much larger number of associated parameters. While such a case may still be treated in a permutation setup, much of the charm of the autologistic model would disappear, and the inefficiency and nonrobustness aspects may even be more pertinent. Because of this reason, we consider some alternative procedures that attach less emphasis on the likelihood approach, and more on alternative measures that deal with similar homogeneity problems. We may refer to Pinheiro et al. 1 6 , 1 7 for some resolutions, and we will mainly add some further comments to their work. Consider a general CSA with K sites, each one having a categorical response with C{> 2) qualitative categories, indexed as 1 , . . . ,C. For the ith
287 sequence, let X; = (Xn,..., Xm)' be a random vector of responses where X^ denotes the category outcome c(= 1 , . . . , C) at site k(= 1 , . . . , K). Recalling that these sites may not be stochastically independent, we need to have a measure of divergence which takes into account the inter-site stochastic dependence to a certain extent. The primary motivation for using a diversity measure stems from the fact that HIV or some other retrovirus have the ability to have higher mutation rates which can be traced with a diversity index, without going through some likelihood formulations. With that in mind, we define the Hamming distance between a pair (i,i') of sequences as K
1
fc=i
so that Da* is the proportion of sites where X; and Xi' do not match. Since the K coordinate indicator functions are not necessarily independent, this measure attempts to take into account their dependence, albeit in a symmetric manner. It is easy to see that the expected value of Dai is the average (over the K positions) Gini-Simpson 24 diversity indexes, which we denote by AH • It is also possible to employ other measures of diversity which have nonparametric flavour; for details, we refer to Pinheiro et al. 1 6 ). It may be remarked that an optimal nonparametric estimator of AH is the Hoeffding 12 {/-statistic (corresponding to the kernel Dij of degree 2):
^ '
l
This {/-statistic formulation enables us to use conventional statistical tools for testing homogeneity of the AH for different groups. In conventional way of using ANOVA procedures based on [/-statistics, such tests for homogeneity of the AH for the different groups were considered by Pinheiro et al. 16>17; their test is based on the 'between group' and 'within group' sum of squares of the sample Hamming distances. Since the kernel is bounded, asymptotic results were not so difficult to derive. However, because of the fact that under the null hypothesis of homogeneity, for the ANOVA test statistic based on the divergence of the statistics in (25), we have stationarity of order 1 (Hoeffding 1 2 ) , and as a result, the asymptotic null hypothesis distribution is not a conventional chi-squared or variance ratio distribution, and its use needs further simulation results. Following the decomposition of the Gini-Simpson index (Sen 2 0 ) , we consider here a somewhat different formulation. For the 5th group of ng independent sequences, we define the sample Hamming distance as in (25) and denote
288
it by DnJ, for g = 1 , . . . , G. Also, we pool the G groups into a combined set of n sequences, and compute the Hamming distance, denoted by Z?„ . Note that the Hamming distances are nonnegative quantities, bounded from above by 1, and it is possible to express D^' = within group component + between group component G
= E K / " ) ^ + Dn(B), say,
(26)
9=1
where Dn{B) is nonnegative and is stochastically small when the G groups of sequences are homogeneous. Denoting the first term on the right hand side of (26) as Dn(W), we may compare the two and formulate a test statistic as CH = Dn(B)/Dn(W).
(27)
The same permutation principle invoked in Section 4 can be incorporated to find the critical level. However, for large sample sizes, such a test may not have a desired chi-square or normal approximation (mainly due to the fact that Dn{B) when standardized may not be asymptotically normal or chi-square). For that reason, some alternative test procedures have been considered by Sen 21 ; these have simple asymptotic distributions under the permutation setup, as well as, unconditionally. For the internal CSA, as a very first step, Karnoub et al. 13 formulated a conditional test for independence of mutations in the case of K = 2. This area merits specific statistical attention to handle the case of general K. Likelihoods are difficult to be formulated precisely, and hence, autologistic or other models may stumble into conceptual difficulties. Nonparametrics have a much better prospect. High-dimensionality and categorical data setups pose the basic challenges for such resolutions. Acknowledgments This work was supported by the Cary C. Boshamer Professorship Research Fund at the University of North Carolina, Chapel Hill. The author is grateful to the reviewers for their helpful comments on the manuscript. References 1. A. Agresti, Categorical Data Analysis (Wiley, New York, 1990). 2. T. W. Anderson, An Introduction to Multivariate Statistical Analysis (Wiley, New York, 1958)
289 3. R.R. Bahadur, In Studies in Item Analysis and Prediction, ed. H. Solomon (Stanford University Press, Calif., 1961). 4. J. Besag, J. Roy. Statist. Soc. B 48, 192 (1974). 5. L. Breiman, Ann. Statist. 28, 374 (2000). 6. S.K. Chatterjee and P.K. Sen,Calcutta Statist. Assoc. Bull. 13, 18 (1964). 7. D.R. Cox, J. Roy. Statist. Soc. A 158, 455 (1995). 8. R. Durbin, S. Eddy, A. Krogh and G. Mitchison, Biological Sequence Analysis: Probabilistic Models for Proteins and Nucleic Acids (Cambridge University Press, UK, 1998). 9. W.J. Ewens and G.R. Grant, Statistical Methods in Bioinformatics: An Introduction, ( Springer, New York, 2001). 10. C.J. Geyer and E.A. Thompson, J. Amer. Statist. Assoc. 90, 909 (1995). 11. T. Hastie, R. Tibshirani and J. Friedman, The Elements of Statistical Learning, (Springer, New York, 2001). 12. W. Hoeffding, Ann. Math. Statist. 19, 293 (1948). 13. M.C. Karnoub, F. Seillier-Moiseiwitschand P.K. Sen, Inst. Math. Statist. Led. Mono. Ser. 33, 221 (1999) 14. K. Lange, Mathematical and Statistical Methods in Genetic Analysis, (Springer, New York, 1997). 15. K. Liang, S.L. Zeger and B. Qaqish, J. Roy. Statist. Soc. B 54, 3 (1992). 16. H. Pinheiro, F. Seillier-Moiseiwitsch, P.K. Sen and J. Eron, In Handbook of Statistics, Volume 18 : Bioenvironmental and Public Health Statistics, eds. P. K. Sen and C. R. Rao (Elsevier, Amsterdam, 2000). 17. H. Pinheiro, F. Seillier-Moiseiwitsch and P. K. Sen, In preparation (2001). 18. M. L. Puri and P. K. Sen, Nonparametric Methods in Multivariate Analysis (Wiley, New York, 1971). 19. C.R. Rao and P.K. Sen, J. Nonpar. Statist, (to appear, 2002). 20. P.K. Sen, Calcutta Statist. Assoc. Bull. 49, 1 (1999). 21. P.K. Sen, Excursions in Biostochastics: Biometry to Biostatistics to Bioinformatics, (Lecture Notes, Academia Sinica Inst. Statist. Sci., Taipei, ROC, 2001). 22. P.K. Sen, Scientiae Mathematicae Japonicae 55 (to appear, 2002). 23. P.K. Sen and J.M. Singer(1993). Large Sample Methods in Statistics: An Introduction with Applications (Chapman-Hall, UK, 1993). 24. E.H. Simpson, Nature 163, 688 (1949). 25. M. Waterman, Introduction to Computational Biology: Maps, Sequences and Genomes (Chapman and Hall, UK, 1995).
UNIVERSAL OPTIMALITY OF COMPLETELY RANDOMIZED DESIGNS
K I R T I R. S H A H Department
of Statistics
and Actuarial Science, University Ontario N2L 3G1, Canada E-mail: [email protected]
of Waterloo,
Waterloo,
B I K A S K. S I N H A Indian
Statistical E-mail:
Institute, Kolkata, [email protected]
India
It is reasonable to expect that in a completely randomized design when the number of experimental units is not a multiple of the number of treatments, the most symmetrical allocation would be optimal. However, the available proofs are w.r.t. specific optimality criteria and are not always straightforward. In this paper, we show that the most symmetrical allocation is universally optimal both for the estimation of the treatment effects as well as for the estimation of the treatment contrasts.
1.
Introduction
We consider the problem of comparing v treatments using n experimental units when the design adopted is the completely randomized design (CRD). When n is a multiple of v, it is easy to see that the design where every treatment is applied to n/v e.u.'s is optimal. When n is not a multiple of v, the design where every treatment is applied to [^] or to [-] + 1 e.u.'s may be expected to be optimal. Such a design is known as the design with the most symmetrical allocation (MSA) Known optimality results for the MSA relate only to the specific optimality criteria. For some criteria such as the distance optimality criterion, the proof is somewhat involved (Liski et al. 2 ) . In this paper, we use a theorem of Shah and Sinha 4 to show that the MSA is in fact universally optimal (UO) both for the estimation of the treatment effects as well as for the estimation of the treatment contrasts. 2.
Universal Optimality
The concept of universal optimality (UO) was first introduced by Kiefer 1. The motivation is that for some settings a particular design is optimal w.r.t. very many optimality criteria. In such cases it would be useful to try to establish
290
291
optimality w.r.t. a class of optimality criteria. This would obviate the need for establishing optimality w.r.t. specific criteria. When the class of criteria is the set of all criteria satisfying some minimal reasonable conditions, if a design is optimal w.r.t each of these criteria, it is called universally optimal. Kiefer * exploited this idea to establish UO of the balanced block designs and of the generalized Youden design within the appropriate design classes. Kiefer's notion of UO was somewhat modified by Shah and Sinha 3 who used a slightly different set of conditions on the class of optimality criteria. We present these here for the sake of completeness. Let V denote the class of designs and for d 6 V, let d denote the information matrix for the set of effects to be estimated. Let (•) are (i) 0(C) = C2 i.e. if C\ — C2 is non-negative definite then, <j>{C\) < (C2). (This condition is very appealing. In fact, one would wish to impose only this condition and no others. Unfortunately, this leaves the class of designs much too wide which makes it hard to find optimal designs.) (hi) (j)(Ci) > (f>(C2) = > (f>(tCi) > 4>{tC2) for all positive integers t. (This is reasonable. If d2 is better than d\, then t copies of d2 are better than the same number of copies of d{). (i y ) ((S*a)C) where i g 's are non-negative integers. (This is a form of convexity which is very appealing. It is weaker than the usual convexity condition. See Shah and Sinha 3 for a detailed discussion of this). A well known result due to Kiefer l states that if a design d* is such that (i) Cd* is completely symmetric and (ii) C
292
Theorem 2..1. Let d* £ V. If for a design d, we can get non-negative integers tg (not all zero) such that ( 2~] tg) Cd* — 2_. tgCdg is non-negative definite, then d* is at least as good as d w.r.t. every criterion satisfying conditions (i) to (iv). If this holds for all d G X>, d* is said to be universally optimal. The above theorem is an extension of a result due to Yeh 5 modified for the Shah-Sinha formulation. Yeh required the difference to be a null matrix. In fact, he conjectured that this condition is necessary. However, in our application to CRD, the difference is a non-null matrix. 3.
Optimality of M S A CRD's
Here, we have a total of n experimental units to be used for studying the effects of v treatments. If n is divisible by v, it is easy to show that the design where each treatment is equally replicated is UO. We shall now assume that n is not divisible by v. Let 7ij denote the number of e.u.'s receiving the treatment i. For the most symmetrical allocation (MSA), ri; = x or x + 1 where x is the largest integer not exceeding n/v. Thus, \rii — rij\ < 1 for all i,j for the MSA design. We now show that the MSA design is UO. We deal separately with the following two cases. Case I: Estimation of the treatment effects Our model here is E
(Vij) = HiJ = !. 2 . •• -,ni;i = 1,2,..., V(yij)=a2.
v
Here y,j denotes the j'-th observation on treatment i and the observations are assumed to be uncorrelated. It is easy to see that the C-matrix for the estimation of the fi^s is diag (n1,n2,...,nv). For the MSA design, let p denote the number of treatments for which ni = x + 1. Thus, n = J^ 7i; = p(x + 1) + qx where q = v — p. For the competing design, we assume that n\ > n^ > • • • > nv. Let
293 p
v
i
p+i
y = Y^Ui/p and let z — Yl, ni/ q. The case where p < q is handled similarly. We consider the case where q of the j/'s go to the last q positions and these are replaced by the .z's. Number of such permutations is s = ( p ). We give weight (5 to each of these and weight a to (y,..., y, z,..., z). All others are given weight zero so that a + s/3 = 1. We try to see if this combination gives {x + 1 , . . . , x + 1, x,..., x). Equating these, we get,
ay + f3(tz +(s-
t)y) = x + 1
az + j3sy = x where s = (p) and t = (Pz{). It can be shown that these have solution a = (y — x)/(y — z) and /3 = (x — z)/s(y — z). Since y > x > z and since y > z, it follows that we can get non-negative integers tg such that (^2 tg){x + l,...,x
+ l,x,...,x)
= ^ i 3 ( n i , . . . , nv)g
where (TH, . . . , nv)g is a permutation of {n\,..., nv). This establishes the UO property of the MSA design for the estimation of the treatment effects. Case II: Estimation of the treatment differences The same model as above is still applicable. It is easy to see that the
294
information matrix for the estimation of the contrasts in the fii's is given by
Cd = diag(ni,n2,...,
n2
n„)
(ni,n2,...,n„)
\nvJ Let d* denote the MSA design. It is enough to show that given a design d, for a suitable choice of non-negative integers tg
where Cdg is obtained by applying permutation g to the rows and columns of the matrix CdAgain we assume that n\ > n2 > • • • > nv and take the same choice of £3's as in the previous case. If we define Xd = (ni,ri2,... ,nv), it is enough to show that
Y^tgGgxdxdGa
>
(^tg)xd.xd,
where Gg is the permutation matrix for the permutation g. (This follows from the fact that Y,tgGgXd = (%2tg)xd*). We now consider a u-variate distribution where we assign probability tg/C^tg) to GgXd with the tg's as described above. We have seen that
Since the dispersion matrix of GgXd is non-negative definite, it follows that J2 (Yt
) G9XdX'dG'g
>
x
d*Xd..
It now follows that (^2tg)Cd* — Y^^g^dg is a non-negative definite matrix. This establishes the UO property of the MSA design for the estimation of the treatment effects differences. Remark 3..1. The above establishes the optimality of the MSA with respect to all the known criteria such as A-, D-, E-, MV-, (M,S)- and SD- optimality. Though most of these were known, the derivation for some of these such as SD- optimality were rather involved.
295
Yeh
5
had conjectured that Cd* ~ /
(lg Cdg = 0
is a necessary condition for the universal optimality of the design d*. Here, a g 's are non-negative real numbers. Our above result for the CRD shows that the above matrix could be non-null. However, this leaves open the possibility that for some other choice of a 9 's the difference matrix is null. We now give a revised conjecture that for d* to be universally optimal for any design d, there exist non-negative integers tg such that 2_j^a^d* ~ /] tg Cdg is a n.n.d. matrix. The application to the CRD set up appears to be the first known case where the matrix is non-null. In any case, The theorem used here is a powerful tool for establishing UO. It is hoped that it will be found useful in many investigations. Acknowledgment This work was partially supported by a grant from the Natural Science and Engineering Research Council of Canada. This support is gratefully acknowledged. References 1. J. Kiefer, In A Survey of Statistical Design and Linear Models, ed. J.N. Srivastava ( North-Holland, Amsterdam, 1975). 2. E.P. Liski, N.K. Mandal, K.R. Shah and B.K. Sinha, Topics in Optimal Designs (To appear in Lecture Notes in Statistics Series, Springer-Verlag, New York, 2001). 3. K.R. Shah and B.K. Sinha, Canadian Journal of Statistics 17, 345 (1989). 4. K.R. Shah and B.K. Sinha, In Recent Advances in Experimental Designs and Related Topics, ed. S. Altan and J. Singh (Nova Sciences Publishers, Inc. Huntington, New York, 2001). 5. C M . Yeh, Biometrika 73, 701 (1986).
A N O N P A R A M E T R I C COMPARISON OF T U M O R I N C I D E N C E S W I T H I N T E R M E D I A T E LETHALITY A N D D I F F E R E N T DEATH RATES JIANGUO SUN, QIANG ZHAO Department
of Statistics,
University of Missouri, 222 Math Sciences Columbia MO 65211 E-mail : [email protected]
Building,
SHESH N. RAI Department
of Bio statistics, St. Jude Children's Research Hospital,32 Lauderdale St., Memphis, TN 38105-2794
N.
The comparison of incidence rates of occult tumors is usually one of main objectives of tumorigenicity experiments. For the problem, a common practice is usually to assume that tumors are lethal or nonlethal (Hoel and Walburg 9 ; Sun 1 7 ) , to fit incidence rate data to certain parametric or semiparametric models (Dewanji et al. ), or to treat tumors with intermediate, but known lethality (Lagakos and Louis n ) . In this paper, a simple nonparametric test, which allows tumors to have intermediate and unknown lethality, is proposed for the comparison of incidence rates. The method also allows distributions of death times to depend on treatments or doses. The proposed method is applied to data arising from a tumorigenicity experiment.
1
Introduction
This paper considers statistical analysis of tumorigenicity experiments with focus on the comparison of different treatments or doses with respect to rates of development of tumor (e.g., Dinse and Lagakos 7 ; Dewanji and Kalbfleisch 3 ; Dinse 6 ) . In these situations, the variable of interest is usually the time to tumor onset, which is often not directly observable. Instead, only death time of animals under study and the status of tumor onset at the death time are observed. For such treatment comparison problems, an important factor that has to be considered and has a great deal of impact on the comparison is tumor lethality, measuring the effect of tumor onset on the death rate of animals. Two extreme cases are that tumors are lethal and nonlethal, respectively. The former means that tumor onset kills the animal right away, while the latter means that tumor onset has no effect on the death rate and does not alter the risk of death from other causes. If tumors are lethal or nonlethal, there exist a number of parametric and nonparametric test procedures. This is especially the case for lethal tumors
296
297 since tumor onset time and death time can be treated as being identical and thus are observed or right-censored. In other words, many existing survival methods could be directly applied for the comparison of tumor rates in this case (Kalbfleisch and Prentice 1 0 ). For discussion on nonlethal tumors, see, among others, Dinse 6 , Dinse and Lagakos 7 , Peto and Peto 13 , Sun 17 and Sun and Kalbfleisch 18 . If tumor is between lethal and nonlethal, for the comparison, a common method is to fit some parametric or semiparametric three-states regression models to observed data and then to apply the resulting score test (Dinse and Lagakos 7 , Dewanji and Kalbfleisch 3 ; Rai, Matthews and Krewski 1 5 ). The three-states model includes alive without tumor, alive with tumor and dead. Animals could move from either of the first two states to the last state. A detailed description of the three-state model for tumorigenicity experiments is given in the article by Rai, Sun and Hunt in this volume. The goal of this paper is to develop a nonparametric test procedure without making any model assumption about tumor rates. To compare tumor rates, another factor that needs to be considered is animal death time, which serves as observation or censoring time and could depend on treatments. A comparison not accounting for animal death time difference could overestimate or underestimate treatment difference (Dinse and Lagakos 7 ) . For example, consider a situation in which animals in one group have longer survival times and higher tumor rates than animals in the other group. Then tests assuming the same death distribution could overestimate tumor rate difference. In other words, the survival difference if existing needs to be adjusted for treatment comparison. The remainder of the paper is organized as follows. We begin in Section 2 with introducing notation and assumptions that will be used throughout the paper. Section 3 discusses the comparison of incidence rates of occult tumors with intermediate lethality and presents a simple nonparametric test procedure. For the problem, the Cox 2 proportional hazards model is assumed for tumor lethality and treatment effect on death rates. The proposed methodology allows estimation and test of tumor lethality as well as treatment effect on animal death. The method is a generalization of that proposed in Sun 17 , who considered the same problem for nonlethal tumors. In Section 4 the methodology is applied to a real data set from a tumorigenicity experiment. Some discussions and concluding remarks are given in Section 5. 2
Notation and Assumptions
Consider a tumorigenicity experiment involving n independent animals who are tumor-free at time t = 0 and are randomly assigned to one of p + 1
298 treatment or dose groups. For the ith animal, let C7, denote the time of tumor onset, T* the time of death and C; the censoring or sacrifice time. It will be assumed that the development of a tumor is an irreversible event. In practice, since tumors are occult, we do not observe the Ui's. Instead, we observe only Ti = min{ T* , Ci }, the smallest of death and sacrifice times. Define Ni{t) = I(Ui < t), the indicator of the absence (Ni(t) = 0) or presence (Ni(t) = 1) of a tumor in animal i at time t, and Ni(t) = 7(7} < t), indicating if the animal is dead (by 1) at time t, i = 1, ...,n. Then observed data are {T{, Ni{Ti), 6i = I{T{ = T*); i = l , . . . , n } . Let Z{ denote the p-dimensional vector of treatment or dose group indicators associated with the ith animal. Our main goal is to test the hypothesis Ho : E{ Ni{t) \Zi} is independent of i. As mentioned before, in the following, we will focus on tumors that are between lethal and nonlethal. To model tumor lethality, we will assume that given Zi and Ui, the hazard function of the death times T*'s is given by the following proportional hazards model (Cox, 1972), Xi(t) = \iit\Zi)
= I(Ti > t)X0(t)eT'Zi+^^^
,
(1)
where Xo(t) denotes the baseline hazard function, r is a p-dimensional vector of unknown regression parameters characterizing the group or dose effect on death rate, and (3 is a scalar parameter describing tumor lethality, i = 1,..., n. The above model allows that animal death rates could be affected by both doses and the onset of a tumor. Note that /? = 0 means that the tumor is nonlethal and r = 0 means that the dose has no effect on the death rate. If both/? = 0 and r = 0, the death times T*'s then follow the same distribution and are independent of the Ni's and the Z;'s for all animals. Thus the model allows both the assessment and estimation of possible dose effect and tumor lethality. Models similar to model (1) have been discussed, for example, by Dinse 5 , Lindsey and Ryan 12 and Rai and Matthews 14 . In particular, Sun 17 considered the problem of testing the hypothesis Ho under model (1) with /? = 0, that is, for nonlethal tumors. To test Ho for current situation, there exist several difficulties compared to the case of /? = 0. One major problem is estimation of the parameter r , which can be easily obtained if (3 = 0, and the parameter (3. This is because model (1) involves the unknown tumor onset times UiS. In the following, we will develop an EM-type algorithm for this. It will be assumed that the censoring times C;'s are independent of tumor onset and death times and that they follow the same distribution as the death times when there is no dose and tumor effect.
299
3
Statistical Methods
In this section, we discuss the test of the hypothesis HQ and estimation of parameters T and p. Let So(t) = e x p { - JQ Ao(s) ds} be the baseline survival function for the T*'s. To construct a test statistic for H0, first assume that r , P and So are known. Note that we can rewrite Ni(Ti) as N^Tt) = J0°° Ni(t)dNi(t). Thus under H0 and the assumptions, /•OO
= {elT'Zi
E{Ni{Ti)\Zi}
+
A0(i){5o(t)}exp(T'z'+/3)/iW5o(t)^
V + l} / Jo
conditional on Z,, where fi{t) denotes the mean function of the Ni(t)'s under HQ. Therefore, under HQ, we have that
J i W H e ^ + W + l}-*
, ,M *}-f
Wrt
M
This motivated the following test statistic
X(r,P,S0) =Y( Zi^ ~ -
'
Z)^^-—-±^-
(2)
s 0 ( T i ) « p ( ' - ' Z i + /9)
which has expectation zero under Ho, where Z = Y^i=i %i /n. Among others, Charles and Wei x discussed similar test statistics for the comparison of two treatments in the context of balanced repeated measurements. Sun 17 considered the test of HQ and proposed a test statistic similar to X(T,(3,SQ) replacing unknown parameters by their consistent estimates. Let a = (T,P, SO). It can be shown using the method in Sun 17 that under Ho, X(a) has asymptotically a normal distribution with mean zero. Thus using the statistic X(a), the hypothesis HQ should be rejected for large values of X(a), where a denotes a consistent estimate of a. To apply the statistic X(a), we need to estimate a under HQ. For this purpose, note that if the C/j's are known, r and P can be easily estimated by the solution to equations XT(r,p) = 0 and Xp(r,P) = 0, where
XT(r,P) = > J /
E " , /(tz,-) \ Z i - ^ i J ( t " g j + g /^<«) }
and X(3(T,P) n
,.00
(
E " , lit < T,)e''zi*muiS>l(Ui
< t) 1
dN
^)
300
which are score functions from the partial likelihood function of r and (3. In the above, N?(t) = I{T < t,6t = 1) and N*(t) = ^ = 1 N?(t). Given r and (3, the baseline survival function 5*0 can be conveniently estimated by the consistent estimator Softr.fl = e x p j - I
E ; = i / ( s
< ^ )
(
; U
+
^ < 4 •
This motivates the following algorithm for estimating a under HQ. Step 0. Choose initial values for the £/,'s. Step 1. Estimate the marginal survival distribution of the tumor onset time under Ho based on the U's and denote it by SuStep 2. Estimate r and (3 by solving the equations XT(T,(3) = 0 and Xf}{T,(3) = 0 and then estimate 5o by So{t;f,J3). This yields an estimate, say §T*\U> °f the conditional survival function of the T* given the £/j's. Step 3. Let Ui be equal to its conditional expectation E{Ui\Ti,Ni(Ti),Su,ST*\u} under the current values of the parameters, i = 1, ...,n. Step 4. Go back to step 1 until the convergence of the estimates of r , (3 and So. In the above algorithm, a natural choice for the initial value of U is to set Ui = Tifl if Si = 1 and £/; = Tj if Si = 0. Steps 2 and 3 correspond to the maximization step of EM algorithm and Step 4 can be regarded as the expectation step of EM algorithm. An alternative to steps 0 and 1 is to estimate SJJ based on interval-censored data [ (Tt, Ni(Ti)} ; i = 1,..., n] on the UiS (Sun 16 ; Turnbull 1 9 ). In this way, in step 1 after the first iteration, Su can be estimated by the Kaplan-Meier estimate (Kalbfleisch and Prentice 1 0 ). Once estimates a are obtained, we propose to use the simple bootstrap procedure given in Efron 8 to determine the p-value for testing HQ using X(a) as well as the distributions and variance estimates of a. Specifically, the bootstrap procedure can be carried as follows. For a given integer K and each fc = 1, ...,K, Step 1. Draw from observed data {Ti,Ni(Ti),6i , i = 1, ...,n} with replacement n random samples { T^k) ,N^(T}h)), s\k) , i = l , . . . , n } . Step 2. Obtain an estimate of a denoted by d'fc) = (f (fc>,/3 = X(&W). Step 3. Go back to step 1 if fc < K. The distribution of a and X(a) can be estimated by a' 1 ', . . . , d ^ ' and X^, ...,X(K^ for large enough K, respectively. For the test of the hypothesis H0, the p-value can be calculated as £ * = 1 I(\X^\ > \X(a)\)/K. Some-
301 times it is also interest to test r = 0 and 0 = 0, which correspond to no treatment effect on death rate and the tumor under study being nonlethal, respectively. Suppose that we accept Ho- Then for the test of r = 0, the p-value can be calculated as 2fc=i I(\T^\ > \f\)/K for large K and the same can be done for 0 = 0. Note that here both estimation and test about r and 0 are performed under HQ, which is the main interest of the paper. The general inference about r and 0 could be carried out in the similar way. 4
A n Application
To illustrate the proposed method in the previous section, we apply it to the tumorigenicity experiment reported by Hoel and Walburg 9 on lung tumors on 144 male RFM mice. The experiment involves two treatments, conventional environment (96 mice) and germfree environment (48 mice). The observed data include observation times (Ti's) of the mice, lung tumor presence or absence indicators at the observation times (JVi(Ti)'s) and treatment indicators. Note that the observation times here are either the death or sacrifice times of the animals. The data were given in Table 5 of Hoel and Walburg 9 and also analyzed by Lagakos and Louis 16 and Sun 17 . One of the objectives of the study was to compare the lung tumor incidence rates of the two treatments. To compare lung tumor incidence rates, define Z; = 0 if an animal was given conventional environment and Zi = 1 otherwise. Applying the method given in Section 3, we got that f = -2.0314, 0 = 0.4008 and X(a) = 38.1738. To obtain the p-value for testing the equality of tumor development rates between the two groups, the bootstrap procedure given in Section 3 with K = 10000 was used and yielded a p-value of 0.0673 for HQ. The results indicate that there is a moderate difference between the lung tumor incidence rates of the mice under the two treatments and that the mice in germfree environment had a little higher rate of tumor development. These are similar to those given by Hoel and Walburg 9 , Lagakos and Louis u and Sun 17 . Assuming the nonlethality, Hoel and Walburg 9 and Sun 17 gave p-values of 0.01 and 0.028, respectively. Note that the difference between the methods used in Hoel and Walburg 9 and Sun 17 is that the latter adjusts for survival difference between the two groups, while the former does not. In contrast, the method given here adjusts for both the survival difference and lethality. Lagakos and Louis n assumed known lethalities and obtained p-values of around 0.07. We also considered the tests of the hypotheses T = 0 and 0 = 0. Using the same bootstrap samples as those for X(a), we obtained a p-value of almost zero for testing T = 0, indicating that the two groups had significantly
302
different survival rates. The same was also pointed out by Lagakos and Louis 11 and Sun 17 . In particular, the latter obtained and plotted the separate Kaplan-Meier estimates of the survival functions for death times of the mice in the two treatment groups, which suggested that the mice in the germfree environment had significantly longer survival time than those in the conventional environment. For 8 = 0, the bootstrap procedure yielded a p-value of 0.3722. This is consistent with the fact that lung tumors are usually regarded as relatively nonlethal (Hoel and Walburg 9 ) . 5
Concluding Remarks
A nonparametric test procedure is proposed for the comparison of tumor incidence rates, which is often one of the main objectives of tumorigenicity experiments. Compared to the existing methods, a main feature of the presented approach is that it allows tumors having unknown intermediate lethality. Also it adjusts for the possible effect of doses or treatments on animal survival rates. To implement the method, a bootstrap procedure is developed for the determination of p-values. Note that in the above, Zi was assumed to be treatment indicators for simplicity. In fact, the presented method also applies to the situation where the vector Zi includes some covariates whose effect one may want to examine. The focus of this paper is on the comparison of tumor development rates. Sometimes it may be of interest to describe and estimate tumor development rates. For these situations, certain models usually need to be assumed (Dewanji et al. 4 ; Lindsey and Ryan 12 ; Rai, Matthews and Krewski 1 5 ) . More work remains to be done. One problem for future research is to study asymptotic properties of the proposed test procedure and the point estimates of r and /?. In this paper, we have assumed that death time of an animal can be described by the proportional hazards model (1). It would be useful to develop similar test procedures for other commonly used models such as additive failure rate model and multiplicative failure rate model studied by Rai, Matthews and Krewski 15 among others. A related problem is model checking or selection, for which there seems no established method. Acknowledgments The authors wish to thank two referees for their comments and suggestions. The research of J. Sun was supported by a grant from the National Institutes of Health and that of S.N. Rai was supported in part by a Cancer Center Support grant (CA 21765) and the American Lebanese Syrian Associated Charities.
303
References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19.
S.D. Charles and L.J. Wei, Biometrics 44, 1005 (1988). D.R. Cox, Journal of the Royal Statistical Society, Ser. B 34, 187 (1972). A. Dewanji and J.D. Kalbneisch, Biometrics 42, 325 (1986). A. Dewanji, D. Krewski and M.J. Goddard, Biometrics 49, 367 (1993). G.E. Dinse, Biometrics 42, 325 (1991). G.E. Dinse, Statistics in Medicine 13, 689 (1994). G.E. Dinse and S.W. Lagakos, Applied Statistics 32, 236 (1983). B. Efron, Journal of the American Statistical Association 76, 312 (1981). D.G. Hoel and H.E. Walburg, Journal of the National Cancer Institute 49, 361 (1972). J.D. Kalbneisch and R.L. Prentice, The Statistical Analysis of Failure Time Data (John Wiley and Sons, New York, 1980). S.W. Lagakos and T.A. Louis, Applied Statistics 37, 169 (1988). L. Lindsey and L.M. Ryan Applied Statistics 42, 283 (1993). R. Peto and J. Peto, Journal of the Royal Statistical Society A135, 185 (1972). S.N. Rai and D.E. Matthews, Applied Statistics 46, 93 (1997). S.N. Rai, D.E. Matthews and D.R. Krewski, The Canadian Journal of Statistics 28, 65 (2000). J. Sun, Interval Censoring, in Encyclopedia of Bio statistics (John Wiley and Sons, New York, 1998). J. Sun, Journal of the Royal Statistical Society B 6 1 , 243 (1999). J. Sun and J. D. Kalbneisch, Biometrics 52, 726 (1996). B.W. Turnbull, Journal of the Royal Statistical Society B38, 290 (1976).
GENERALIZED S M O O T H E D ESTIMATING F U N C T I O N S W I T H C E N S O R E D OBSERVATIONS A. T H A V A N E S W A R A N Department
of Statistics, Manitoba Email:
University of Manitoba, R3T 2N2 CANADA [email protected]
Winnipeg
,
JAGBIR SINGH Department
of Statistics, Fox School of Business, Temple Philadelphia, PA 1922 USA Email: [email protected]
University,
Recently there has been a growing interest in the estimation of intensity of continuous time analog of regression models (semimartingales). Counting process model, a special case of the semimatingale models, is widely used to study the estimate of the hazard function in Aalen x , Ramlau-Hansen 8 etc.. Following a recent work of Novak 5 on generalized kernel density estimators, a new class of nonparametric estimators for the intensity of a semimartingale process is introduced. It includes the kernel smoother for the hazard rate based on optimal estimating function studied in Thavaneswaran and Singh 12 as a special case. The asymptotic properties of the smoothed estimator with censored observations are also discussed in some detail.
1
Introduction
A semimartingale is a stochastic process which can be represented as the sum of a process of bounded variation and a local martingale. In the case of continuous time, a typical example of semimartingale is a process (X(t), t > 0) with independent increments for which E|X(i)| is finite and a function of locally bounded variation. The class of semimartingales includes point processes, Ito processes, diffusion processes, etc. Consider a continuous time stochastic process (X(t), t > 0) defined on (Q,A, P), a complete probability space for each P in a family {V} of probability measures, and a family F = [Ft, t > 0], of a algebras Fs C Ft C A for s < t, F0 augmented by sets of measure zero of A, and Ft = Ft+, where Ft+ = r\s>tFs . We denote by D the space of right-continuous functions {x(t), t > 0) having limit on the left and use X — (X(t),Ft) to denote an Ft -adapted random process (X(t)) with trajectories in the space D. For simplicity we assume that X(0) = 0. We shall denote by Mfoc{F, P) a class of locally square integrable martingales (H(t), Ft). Assume that the process (X(t), Ft) is a semimartingale for each P,
304
305
that is for each P it can be represented in the form X(t) = V(t) + H(t)
(1)
where V(t) is a locally bounded variation process and H(t) € Mfoc(F,P). When we allow V(t) and H(t) to depend on P e {V} only through 6, the model (refeql.l) can be written as
X(t,e)=V(t,O)
+ H(t,0)
When 6 £ TZ, optimal as well as recursive estimates have been studied in Thavaneswaran and Thompson n . Asymptotic properties of the parametric estimators for vector valued multiparameter semimartingales had been studied in Hutton and Nelson 3 .Here we consider the following semimartingale model of the form dX(t) = a(t)Y(t-)dR(t)
+ dM(t).
(2)
where a(t) is an unobservable deterministic part of the intensity process of the semimartingale X(t), {X(s),Y(s),R(s),0 < s < t} are observable processes, M(t) £ Mf0C(F, P) with predictable variance process < M >t= /„' C(s)dR(s), and C(s)is a known function of the observations and a(s). For a similar restriction that the conditional mean and the conditional variance of X(t) are absolutely continuous with respect to R(t), see Hutton and Nelson 3 . When R(t) = t, equation (2) turns out to be the Aalen's * multiplicative intensity model for counting processes. Example 1.1. Poisson process: When X(t) is a right continuous process having jumps of size 1 and X(t) = Jo a(s)Y(s)ds, with Y(s) = 1, is a deterministic function, the semimartingale model (2) becomes a non homogeneous Poisson process model with cumulative intensity \(t) . Example 1.2. Multiplicative intensity model: When X(i) denotes the number of deaths up to time t, a(t) is a hazard rate, Y(t—) is the number of individuals at risk before time t, then the semimartingale model (2) turns out to be the multiplicative intensity model introduced by Aalen *. The process M(t) is a zero mean square integrable with variance process< M >t— J0 a(s)Y(s)ds provided there are no simultaneous deaths. This model has been widely applied to such phenomena as life history data or arrivals at an intensive care unit of a hospital. The counting process model and the martingale limit theory have been used to study the asymptotic properties of the estimators of the hazard function and the survival function with censored observations. In a recent paper, Novak 5 discussed the problem of obtaining a generalized kernel density estimator with independent observations. The kernel function smoothing approach is a useful tool for hazard function estimation as well.
306
Recently, efforts have been made to extend the smoothing methodology in various directions, see Ramlau-Hansen 8 , Thavaneswaran 10 , Thavaneswaran and Singh 12 . The purpose of this paper is to explore the implications of the result of Novak 5 for estimating the intensity function of a semimartingale model. We also apply it to hazard function estimation with censored observations by generalizing the Novak's 5 result. Let Xi, , X n be a random sample from the distribution of a random variable (r.v.) X with density f(x) > 0 . The classical kernel density estimator of f(x), at any given x , proposed by Rosenblatt 7 , and studied by Parzen 6 and Nadaraya 4 is:
>-M = ; s X : / , ( ^ ) . i=l
x
»«"
'
The kernel / 7 , the density of some symmetric r.v. 7, and the smoothing parameter (band width) h = h(n) are to be chosen by the statistician. It is assumed that the density / is continuous at x, / 7 vanishes outside the interval [-T, T] and has at most a finite number of discontinuity points. The generalized kernel estimator studied in Novak 5 is: fn,a (*) = - £
/J* «*< ~ ^f"
(**)) S" (Xi) h
(3)
«=1
where a 6 11, It = I{\x — Xi\fa(x) < hT+}, and T+ is a constant greater than T . For any continuous function g of / we propose the following extended version of the Novak's estimator as fn,o (*) = \ E
M ((Xi ~ x)g(f{Xi)))
g(f(Xi))Ii
(4)
i=i
where U = I{\x - Xi\g(f(x)) < hT+). If g(x) = fa(x), then estimator (1.6) reduces to the estimator (3) and for a = I it coincides with Abramson 2 estimator. Thus, the class of proposed estimators {fn,g {x)} generalizes that of kernel density estimators including Novak's as well. In order to compute the estimator, Novak suggested that one can substitute a consistent estimator of f{Xi). In the case of censored data it is more appropriate to substitute an estimate which incorporates censoring. Thus, it is of interest to note that our approach allows to smooth the density with censored data by letting g {f(x)) as the survival function and substituting the product limit estimator in the kernel. The following lemma will be used in Section 2.
307 Lemma 1.1. As n -> oo E
fn,g {x) -> / (x)
nhDfUig (x) ->• i// (x) 5 (x)
(5)
where D denotes the variance and v = J & [x) dx. The proof of the lemma follows by the dominated convergence theorem. In the next section, a modified version of this generalized kernel density estimator is studied for a semimartingale intensity. 2
Estimating function based kernel estimate
Based on a single realization, a smoothed optimal estimating equation studied in Thavaneswaran and Singh 12 for estimating the semimartingale intensity a(t) at io is: l
j ^{{t0-8)hr1)br1d
(6)
o
where (i) (7° = J0a°s gdMSie, as in Thavaneswaran and Thompson u , is an optimal estimating function defined through a°se = ( c o o ) ' w n e r e J(s) = I(Y(s) > 0,C(s) > 0), and (ii) / 7 is a non-negative integrable kernel function with band width h . In analogy with (6), we propose a smoothed estimate for a(io) as a solution of
L
i
/fc^((*o - s)g(a(X{s))))9(a(X(s)))dG»
o = _
0,
(7)
where g is a continuous function of a(t) through the observations as in (4). When g{x) = 1, (7) reduces to (6). The explicit form of the resulting estimator from the estimating equation (2.2) can be written as: 0O
=
Jo h-y ((to ~ s)g(X(S)))
g(X{8))Z$$*dX(8)
So fh-r ((to - s)g(X(s)))
g(X(s))^m^dR(s)
308
2.1
Asymptotics:
Strong consistency and asymptotic
normality
The optimal smoother for a(to) = 90 for fixed t0 from a sequence of semimartingales indexed by n, (i.e.) dXn{t) = a(t)Yn(t-)dR(t)
+ dMn{t),
for 0 < t < 1, can be written as s)g(Xn(S)))g(Xn(s))Y^lJ{;fsUxn(S)
So A 7 ((*o °°n(t) =
So /fc7((«o "
= e0 + = 9o +
s)s(X„(s)))ff(*n(»))£g^dfl(s)
Jo A 7 ((to ~ s)g(Xn(s)))
g(Xn(s))
Jo A T ((*o - s)g(Xn(s)))
y
-gfcC> dM„(s) g{Xn(s))^i^dR(s)
An
where Mn = ^ h-y ((to ~
s)g(Xn(s)))g(Xn(s))Yn{")'ln{s)dMn(s)
J0
^n\S)
and An = f fhy ((to -
s)g(Xn(s)))g(Xn(s))Y"{")fn{s)dR(s) ^n{s)
JO
For any sample size n, it can be easily shown that the estimate is unbiased for models with deterministic intensity. The following two theorems, analogous to the ones given in Thavaneswaran and Singh u are established. Theorem 2.1. Let mn = S" = 1 (AMi/i4i) . Under the assumptions that (i) the predictable variance process of m„ , < mn >oo< °o a-s. and (ii) for the predictable process An, A,*, = oo a.s. . The optimal smoother 9^(t) —> a(t) a.s. for all fixed t as n —>• oo The assumptions (i) and (ii) are somewhat restrictive for a general semimartingale model. However these can be easily verified for an autoregressive model of order one as in Shiryayev 9 (p. 489). Proof: The process Hn(s) = fhl ((t0 -
s)g(Xn(s)))g(Xn(s))Yn{^){s)
309
is predictable and Mn(s) is a zero mean square integrable martingale. Hence it follows from the properties of stochastic integrals, the integral
Mn=
[
Hn(s)dMn(
Jo is a zero mean square integrable Furthermore, arable martingale. Fui An=
f fhydh Jo
s)g(Xn(s)))g(Xn(s))Y^^n{s)dR(s) w(s)
-
is a Lebesgue-Stieltjes integral of a predictable process with respect to a bounded variation function R(s) and is predictable. Therefore, Mn/An is a martingale sequence. The proof now follows by applying the strong law of large numbers for the martingale sequence Mn/An, as in Shiryayev 9 (p. 487). Theorem 2.2. Assume that _^_i_jasn^OQin probability, ( i ) 9(xn(s)Y„(s)jn(s) (ii) the functions a, g and a are continuous at the point t, (iii) R{t) = t, lim g ( X " ( ^ y j " ( s ) = borhood of t.
7s
in probability uniformly in a neigh-
Then, A A M 0 ° ( £ ) - 6>o) 4
N (0, — J as n - * oo,
where i / = £ fi(t)dt as in (5). Proof: We apply the martingale central limit theorem given in Shiryayev 9 (p. 511). In order to apply the theorem we verify the necessary conditions. Recall that tiHn(s)dMn(s)
V^K(d°n(t) ~ Oo)
M n Mn
/nhn nhn
where Hn(s), An are as defined earlier and{M„} is a sequence of martingales with variance process, < Mn >= 4 - / Hl{s)d <M>n(s)=^nn-n Jo =
Jo
f nnn j 0
Hl{s)Cn{s)ds
r1f2h-y((to-s)g(Xn(s)))g2(Xn(s))Y^s)Jn(s)Cn(s)dc Cn(s) nhn
310 1
= /
,/* {{u)g{Xn{t - hnu))) g2{Xn(t - hnu)Y%{t - hnu)Jn{t - hnu) Cn(t - hnu) du C£(t-hnu) n -» v "it-
Thus the corresponding variance process converges in probability and satisfies one of the necessary conditions for the martingale central limit theorem. In order to verify the second condition, consider [\Hn(s)\ > e^/nhn] = \fhl ((t - s)g(Xn(s)))
g(Xn(s))Yn^n{s)
|>
t^hn
as n ->• oo, hn -> 0, g(*"( a )|M s ?-M s ) _^ _i_^ uniformly in a neighborhood of t. Thus [I\Hn(s)\
>e]-^0
in probability. Applying the martingale central limit theorem, as n —> oo, Mn(t) —> N(0,u 74) in distribution. Moreover, Bn ->• 74 in probability. Hence,
y/nfc(0°n(t) - Oo)
-*N(O,-
\
It
in distribution. 3
Conclusion
In this paper we have introduced generalized kernel estimators for the intensity of a semimartingale process. It extends the results of Thavaneswaran and Singh 12 on strong consistency and asymptotic normality. Generalized kernel estimator of the hazard function with censored observations turns out to be a special case. Acknowledgments Research of both authors supported in part by Stan Alton and R.W. Johnson Pharmaceutical Research Institute. First author's research supported in part by the National Science and Engineering Council of Canada.
311
References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11.
0 . 0 . Aalen, Ann. Statist. 6, 701 (1978). I S . Abramson, Ann. Math.Statist. 10, 1217 (1982). J.E. Hutton and P.I. Nelson, Stochastic Process. Appl. 22, 245 (1986). E.A. Nadaraya, Theory Probab. Appl. 19, 133 (1974). S.Y. Novak, Theory of Probability and Applications 90, 301 (2000). E. Par Zen, Ann. Math. Statist. 33, 1065 (1962). M. Rosenblatt, Ann. Math.Statist. 27, 832 (1956). H. Ramlau-Hansen, Ann. Statist. 11, 453 (1983). A.N. Shiryayev, Probability (Springer, New York, 1984). A. Thavaneswaran, Stochastic Process. Appl. 28, 81 (1988). A. Thavaneswaran and M.E. Thompson, Journal of Applied Probability 23, 409 (1986). 12. A. Thavaneswaran and J. Singh, Ann. Inst. Statist. Math. 45, 721 (1993).
L A R G E S A M P L E A S Y M P T O T I C PROPERTIES OF O R D I N A R Y LEAST SQUARES, TWO STAGE LEAST S Q U A R E S A N D LIMITED INFORMATION M A X I M U M LIKELIHOOD ESTIMATORS IN SIMULTANEOUS EQUATIONS MODELS R. TIWARI Department
of Statistics, University of Jammu,Jammu-180006, E-mail: [email protected]
Department
of Statistics,
India
V.K. SRIVASTAVA Lucknow
University,
Lucknow-226007,
India
This article discusses the large sample asymptotic properties of three estimators, viz., OLS, TSLS and LIML, for the coefficients of a single structural equation.
1
Introduction
For the coefficients in a single structural equation of a complete simultaneous equation model, Nagar 5 has considered fc- class estimators and has derived large sample asymptotic approximations for their bias vector and mean squared error matrix. He has restricted his attention to those fc - class estimators which are consistent and have same asymptotic distribution according to large sample asymptotic theory. These estimators are specified by the characterizing scalar fc such that fc is nonstochastic and (1 —fc)is of order C ^ T - 1 ) , T being the number of observations. Such estimators do include the case of two stage least squares (TSLS) estimator but fail to include two popular estimators. One is the ordinary least squares estimator for which the value of fc is nonstochastic but (1 —fc)does not satisfy the requirement of being of order 0(T~l). The other is limited information maximum likelihood (LIML) estimator for which the value of (1 —fc)is of order 0p{T~l) but the value of fc is stochastic. The purpose of this paper is to provide large sample asymptotic approximations for the bias vector and mean squared error matrix for these two estimators and to make a comparative study of the performance properties of the three popular estimators {viz., ordinary least squares, two stage least squares and limited information maximum likelihood) employing the large sample asymptotic theory. The plan of this paper is as follows. Section 2 describes the simultaneous equations model and presents the estimators for the coefficients in a structural equation. Section 3 discusses the asymptotic properties of the ordinary least
312
313 squares estimator while Section 4 provides the large sample asymptotic properties of the limited information maximum likelihood estimator and compares them with those of the two stage least squares estimators. Lastly, Section 5 gives the derivation of the results. 2
Model Specification And Estimators
Consider a simultaneous equation model consisting of a set of M linear structural equations in M jointly dependent and A predetermined variables. YB + XT = U
(1)
Where Y is aT x M matrix of T observations on M jointly dependent variables, B is a M x M nonsingular matrix of coefficients associated with them, X is a T x A matrix of T observations on A predetermined variables, T is a Ax M matrix of coefficients associated with them and U is a T x M matrix of structural disturbances. It is assumed that the row vectors of U are independently and identically distributed, each following a multivariate distribution with mean vector 0 and variance covariance matrix S. With no loss of generality, let us suppose that we are interested in the estimation of parameters in the first structural equation of the model (1), expressible as Y = Yi0 + Xxl = A6 + u;
+ u A = (Y1:X1)
;
S=\... \ 7
(2)
where y is a T x 1 vector of T observations on the jointly dependent variable to be explained, Y\ i s a T x m submatrix of V, j3 is a m x 1 vector of coefficients associated with m(< M) explanatory jointly dependent variables, X\ is a T x I submatrix of X, 7 is a I x 1 vector of coefficients associated with l(< A) explanatory predetermined variables and u is the first column vector of U. The reduced form corresponding to explanatory jointly dependent variables in (2) is expressible as Yi = XUt + UHi
(3)
where III is a A x TO matrix of reduced form coefficients determined by the elements of the matrix —TB~1 and H1 is a M x m matrix obtained from the elements of the inverse matrix B.
314
We can write
A = (n;
xx)
x n i : Xi ) + [ UHi : 0
(4)
XU + UH where and
n = IL
H=[H1:0
o
Assuming the structural equation (2) to be identifiable, the fc-class estimator of 6 in (2) is defined by -l
6k
A'{I-kPx)A
A'{I-kPx)Y
(5)
with Px = I — X(X'X)~1X' and fc as the characterizing scalar. The family (5) encompasses three popular estimators, viz., ordinary least squares estimators(OLS), two stage least squares (TSLS) and limited information maximum likelihood (LIML) estimators. The OLS and TSLS estimators are specified by fc = 0 and fc = 1 respectively while LIML estimator is specified by fc = A where A=:
Tiy-YipyPxiy-Yrf)
with PXl=I3
Xi{X[Xi)-1
(6)
X'± ; see, e.g., Kadane 3 .
Asymptotic Properties of OLS Estimator
In order to examine the properties of estimators of structural coefficients in (2), we assume that the structural disturbances are normally distributed and the model does not contain any lagged endogenous variable. Further, it is assumed that all the exogenous variables in the model are asymptotically cooperative in the sense that T~1(X'X) tends to a finite nonsingular matrix as T tends to infinity. This rules out, for instance the presence of any trend variable in the model; see Kramer 4 . In order to present the large sample asymptotic results by following the approach of Nagar 5 , let us introduce the following quantities: d = H'crn (i)
(7)
315 -i
Q = (^n'x'xn)
(8)
where a^ denotes the first column vector of the variance-covariance matrix S of disturbances. R e s u l t I: The large sample asymptotic approximations, upto order 0(T~l), for the bias vector and variance covariance matrix of the OLS estimators <§o of 6 are given by B(60) = E{60 - 6) 1 {(tr Q^S) Sd+V(S0)
E 60-E{60)
- ljSQ^Sd
+
(9)
SQ^SQ^Sd
S0-E(So) (10)
T
( a n - d'Sd)S - (d'SQ-1Sd)SQ-1S
- SQ'1 Sdd'SQ'1
S
where S = (^jU'X'XU
+ H'T,H\ (11)
and o\\ is the (l,l)th element of £. From(9), it is observed that the OLS estimator is generally inconsistent and biased unlike the TSLS and LIML estimators which are consistent though generally biased. If we examine the asymptotic variance covariance matrix of the three estimators, it is well known that the TSLS and LIML estimators have the same asymptotic variance covariance which is given by T~1
=
V(6X)-V(60) ^(Q-S)
+
±\(d'Sd)S
+ {d'SQ-1Sd)SQ~1S
(12) +
SQ^Sdd'SQ^S
As the matrix
{Q-s) = (^u'x'xnY1 - (^u'x'xu + H'XH^J , -l
(13)
316 is at least positive semidefinite, it follows that the matrix expression (12) is also at least positive semidefinite. This implies that OLS estimator is generally asymptotically more efficient than TSLS and LIML estimators. It may be remarked that if we employ small disturbance asymptotic theory, all the three estimators are consistent and share the same asymptotic distribution; see Kadane 3 and Srivastava and Giles 6 . 4
Comparison of TSLS and LIML Estimators
Both the TSLS and LIML estimators are known to be consistent and to possess the same asymptotic distribution. In order to study the differences in their performance properties in finite samples, we need to examine higher order asymptotic approximations. Let us first consider the case of TSLS estimator. The large sample asymptotic approximations for its bias vector to order 0{T~l) and its mean squared error matrix to order 0(T~2) have been obtained by Nagar 5 . His results are as follows: B{k)
= E{8X - 5) (14)
M(Si) = E(6! - S)(Si - 6)' j , Q + j , 2 {(Tn(trQH'Y,H) + (L2 -3L
- 2(L -
+ 4)Qdd'Q - (L -
l)d'Qd}Q
(15)
2)a11QH'T,HQ
where
L=
(A-e-m)
(16)
denotes the degree of over-identification of the structural equation (2) From (14) and (15) , we can obtain the variance covariance matrix upto order 0{T~l) as follows: V(Si) = E <5i-£(<5i) T
w
^
T2
-{L-
<5i-£(<5i)
{an{trQH'Y,H) 3)Qdd'Q -{L-
- 2(L -
\)d'Qd}Q
2)anQHhLHQ
(17)
317 R e s u l t I I : For the LIML estimator 6\ of 6, the large sample asymptotic approximations for the bias vector to order 0(T~l) and the mean squared error matrix to order 0(T~2) are given by
B(6x) = -±Qd,
(18)
and M(6X) = ~ Q + p
|{ffn(tr QH'ZH)
+ 2d'Qd}Q - (L - A)Qdd'Q (19)
+ (L + 2)crnQH'T,HQ
.
From (18) and (19), the variance covariance matrix to order given by V(6X) = ^-Q
+^
[{an(tr
+ {L +
0(T~2)
is
QH'HH) + 2d'Qd}Q - (L - 3)Qdd'Q (20)
2)anQH"£HQ
It may be noticed that these results tally with the results obtained by Fuller 2 for a special case; see Lemma 1 and equation (12) of Fuller 2 . Comparing (14) and (18), we conclude that both the TSLS and LIML estimators are generally biased though consistent. As expected, both have identical biases to order 0(T~l) for exactly identified equation (2). If the equation is overidentified and the degree of overidentification exceeds one, the bias of TSLS estimator has a sign opposite to that of LIML estimator. Further, the magnitude of bias of TSLS estimator is larger in comparison to that of LIML estimator. Next, we observe from (17) and (20) that V(6X) - V(6i) =
2L r y2
(d'Qd)Q +
1 2anQH'Y,HQ
(21)
which is always a positive definite matrix for L > 0. This implies that the TSLS estimator dominates the LIML estimator according to the criterion of asymptotic variance covariance matrix to order 0(T~2) provided that the structural equation is overidentified. Similarly, if we compare the expressions (15) and (19) we get M(SX) - Af («i) = ^\2(d'Qd)Q
- (L - 2)Qdd'Q +
Now we notice that the matrix 2(d'Qd)Q -{Lif and only if L < 4; see, e.g. Dube et al.
2cmQH"EHQ
(22)
2)Qdd'Q\ is positive definite 1
(Lemma 1) . It thus follows
318
from (22) that the TSLS estimator dominates the LIML estimator according to the asymptotic mean squared error matrix criterion to the order of our approximation at least as long as the structural equation (2) is overidentified and the degree of overidentification is less than four. It may be mentioned that the expression (18) is identical with the small disturbance asymptotic approximation of bias vector of LIML estimator; see Kadane 3 (Theorem 1). Similarly, if we retain the terms upto order 0(T~2) only in the expression for the small disturbance asymptotic approximation of the mean squared error matrix obtained by Kadane 3 (Theorem 2) , we get the same expression as (19). Finally, for the dominance of TSLS estimator over the LIML estimator with respect to the criterion of small disturbance asymptotic approximation for the mean squared error matrix, Kadane 3 (Corollary 1, p.728) has found the condition that the degree of overidentification should be less than six provided (T — A) > 2. 5
Derivation of Results
First of all, we present the following expectation results. Lemma: We have E(UCU) = C'E
,
E{U'CU) = (trC)E
E(UCU') = (trCT,)I E{U'UCU'U)
, E(U'CU') = EC" =T\(T+
1)ECE + (trCE)EJ
(23)
where C is a nonstochastic matrix of suitable order in each case. For proof, see Kadane 3 (Appendix) or Srivastava and Tiwari 7 . Now for the results related to OLS estimator, we observe that (60 -6) =
(A'A^A'u / + ^S(H'U'XU SH'(T{1) +
+ U'X'UH) + SH'AH]
*
(24)
s(Jpl'X'u + H'c)
where S = ( i n ' X ' X n + H"EHy\ A =(![/'[/-E),
(25) (26)
319
and e
(27)
=(rC/'u-fr(1))-
Observing that the elements of the matrix expression ^S{H'U'XU + U'X'UH) + SH'AH are of order O p ( T - 1 / 2 ) and expanding the expression in the first square brackets on the right hand side of (24), we get (So -S)=g0
+ g_1/2 + fl_x + O p (T- 3 / 2 )
(28)
go = SH'(T(i),
(29)
where
g_1/2 = ^SU'X'u -
+ SH'e - ^S(H'U'XU
+
U'X'UH)Sd
(30)
SH'AHSd,
and -(H'U'XU
5-1
+ U'X'UH) + H'AH 9-1/2-
(31)
Here the suffixes of g indicate the order of magnitude in probability. Thus the bias vector to order O^T~x) is given by B(60) =g0 + E(g_1/2
(32)
+ g-i).
It is straightforward to see that (33)
£(2-1/2) = 0.
Similarly, using the results in Lemma along with normality of disturbances, it is easy to see that E{g-i)
= \tr(Q-lS)
- l\sQ~1Sd
+
SQ^SQ^Sd.
Substituting (33) and (34) in (32), we obtain the result (9) Next, we consider the variance covariance matrix upto order V(60) = E\60-
0{T~l):
E(60)\ [«0 - £(*>)]'
= -E(S-i/2£-i/ 2 ) -
(34)
(35)
E{g_l/2)E{g'_l/2)
From the results mentioned in the Lemma, it can be shown that ^(5-1/2^-1/2) = i f f a i -
d'Sd)S-(d'SQ-1Sd)SQ-1S
SQ^Sdd'SQ^s]
(36)
320
Substituting (33) and (36) in (35), we obtain the result in (10) For similar results in the context of LIML estimator, we first observe from (6) that A can be expressed as
(v-YrfyPxAv-YiP)
A = min
p
{y-YrfyPxiy-Yrf)
i .
=
.(y-ASY(PXl-Px)(y-A6) {y-AS)'Px(y-AS)
s i
|
(y - A6x)'(PXl
(37)
- Px)(y - A6X)
(y-ASxyp~x(y-ASx) Now we can write (y - A6X) = u - A{8X - 8) u - (XII + UH) (n'X'XU)-lU'X'u
+
Op(T~l)
i - xu(u'x'xu)-lu'x' u - uHiu'x'xuym'x'u + (38) so that i ( y - ASxy(PXl
- Px){y - A8X) = ^u'RU
+
where R = X{X'X)-lX'
- XU(W X'XU)'1^
^u'RUHQYl'X'u Op(T~2)
X'.
Similarly, we have (y - ASx)'Px(y
- A6xj\"'
= — [l - (-^u'u +
- l) +
-1-d'QTl'X'i
Op(T-1). (40)
Employing (39) and (40) in (37), we find A = 1 + t - i - £-3/2 +
Op(T~2)
(41)
where t-i =
—u'Ru,
(42)
321
and i_3/ 2 = -^u'RUHQTl'X'u
+ -l-u'Rui-^—u'u
-
l) (43)
o\xTi
u'Ru.d'QU'X'u
Here the suffixes of t indicate the orders of magnitude in probability. Next, using (4) and (41), we notice that ±A'(I
- XPx)u = ^U'X'u
+ (^H'U'Pxu + (-t_1H'e
-
t-xd)
+ t_3/2d) +
(44)
Op(T-2)
where e = {hU'u -
are of order
1 2
O p (T- / ),we have ^U'(I
- \PX)U
- i _ i S + O p (T- 3 / 2 )
= ^U'PXU
(45)
whence we can express rl -A'(I
I"1
- XPx)A\
1 =Q-
-Q{U'X'UH
+
H'U'XU)Q
-QH'(±U'PxU-t_^)HQ + ^Q(n'X'UH + H'UXU) Q(n'X'UH + H'UXU)Q + Op(T-^2). (46) Now using (44) and (46), we can express the estimation error of LIML estimator as follows: (6X -S)=
[A'(I - \P~X)A\
i - i
A'{I -
\Px)u
= e_i / 2 + (e_i + / _ i ) + (e_ 3 / 2 + / _ 3 / 2 ) +
(47)
0P(T~2)
where e_i / 2 = e_x = ^QH'U'Ru T
-
-QU'X'u,
Ti
^rQU'X'UHQU'X'u,
(48)
(49)
322 /_! = -t_iQd, e_ 3 / 2 = ^QU'X'UHQU'X'u
-
- ^QU'X'UHQH'U'Ru
(50)
^QH'U'RUHQU'X'u - ^QH'U'XllQH'U'Ru,
(51)
and
/_3/2 = ii_ 1 (Qif / sfi-Qn'x'u + Qn/x'c/JH'gd + QH'U'XUQd)
- t-iQH'e
+ t_3/2Qd
(52)
It may be observed that the sum (e_!/ 2 + e_i + e_3/2) provides the expression for the estimation error of TSLS estimator to order 0 ( T - 3 / 2 ) . Thus the bias vector of LIML estimator to order 0{T~l) is given by B(8x) = E(e_1/2
+ e_1)-E(t_l)Qd
(53)
The first expectation on the right hand side of the above equation is essentially equal to the bias vector of TSLS estimator to order 0{T~l) and is given by (14). As E(t-i) = (L/T), we have B(Sx) = -^Qd
(54)
which is the result (18). In a similar manner, the mean squared error matrix to order 0(T~2) given by
is
M(6X) = £ , ( e _ i / 2 e ' _ 1 / 2 + e_ 1 e'_ 1 / 2 + e_ 1 / 2 e'_i + e_ie'_i + e_ 3 / 2 e'_ 1 / 2 + e_ 1 / 2 e'_ 3 / 2 ) + E ( / _ i e ' _ i / 2 + e _ 1 / 2 / ' _ 1 ) + + e - i / 2 / ' - 3 / 2 ) + ^ ( e - i / ' - i + f-ie'-x
E(f_3/2e'_1/2 + f-d'-i)-
(55)
The first expectation on the right hand side of the above equation is the mean squared error matrix of TSLS estimator and is given by (15). For the remaining expectations, it can be verified that £ ( / - i e ' _ i / 2 ) = 0, £(/-3/2e'_i/2) = ^
[(d'Qd)Q + Qdd'Q + vuQH'ZHQ],
(56) (57)
323
£ ( e - i / ' _ i ) = ^T^-Qdd'Q,
(58)
£ ( / - i / ' - i ) = m^-Qdd'Q.
(59)
and
Using these alonwith (15) in (55), we obtain the result (19). References 1. M. Dube, V.K. Srivastava, H. Toutenburg and P. Wijekoon, Communications in Statistics 20, 2009 (1991). 2. W.A. Fuller, Econometrica 45, 939 (1977). 3. J.B. Kadane, Econometrica 39, 723 (1971). 4. W. Kramer, Communications in Statistics, Theory and Methods 14,1997 (1985). 5. A.L. Nagar, Econometrica 27, 575 (1959). 6. V.K. Srivastava and D.E.A. Giles, Journal of Quantitative Economics 7, 273 (1991). 7. V.K. Srivastava and R. Tiwari, Scandinavian Journal of Statistics 3, 135 (1976).
RELATIVE STABILITY OF W E I G H T E D M A X I M A OF B O U N D E D I.I.D. R A N D O M VARIABLES R.J. TOMKINS Department of Mathematics and Statistics, University of Regina Regina, Sask. S4S 0A2, Canada E-mail:jim. tomkins@uregina. ca Let {Xn,n > 1} be an i.i.d. sequence such that P{X\ > 0] > 0 and P[Xi < x] = 1 for all large x. For a positive real sequence {a„}, define Zn = m a x { a i X i , - - - ,anXn} for n > 1. Necessary and sufficient conditions are given for {Zn} to be relatively stable; i.e., for
> 1 in probability for some real cn
sequence {c„}.
Let Xi,X2, • • • be an i.i.d. sequence with distribution function (d.f.) F such that F(0) < 1 and F(a) = 1 for some a > 0. For a positive real sequence {an,n > 1}, define the random sequence Zn = max{aiXi,a 2 X2, • • • ,anXn},
n > 1.
(1)
This paper will present necessary and sufficient conditions for {Zn,n > 1} to be relatively stable; i.e., Zn/cn —> 1 for some real sequence {c„}, where "—>" denotes convergence in probability, and will show that the norming sequence may be assumed to be cn — max{ai, ••• ,an} ,n > 1. As will be seen, these results may be viewed as part of the process of generalizing a well-known theorem of Gnedenko x to situations involving the maxima of an independent sequence. Notice that Zn < Zn+i almost surely (a.s.), n > 1, so that lim Zn exists n—»oo
a.s. Tomkins
2
proved that lim Zn = oo a.s. if and only if (iff) supa„ = oo.
Unless otherwise noted, the focus of this article is on the case where sup an = n>l
oo and X\ is not degenerate. Define the constant XQ = vai{x : F{x) = 1}. By hypothesis, 0 < x$ < oo. The following five lemmas will lead to the main result: {Zn} is relatively stable iff max{ai, • • • , a n + i } ~ max{ai, • • • , a „ } , where "6„+i ~ 6„" means that bn+\/bn —• 1 as n —> oo. Lemma 1. Without loss of generality, it may be assumed that Xn > 0 a.s., n > 1, and hence that Zn > 0 a.s., n > 1, if Zn/cn -^-> 1 and c„ —> oo.
324
325
Proof: Define X+ = max(Xn,0),n > 1. Note that F(0) < 1 by hypothesis, so P[Xn > 0 infinitely often (i.o.) ] = 1 by the Borel Zero-one Law. But then P[Zn < 0 i.o. ] = 0. Therefore, P[Zn ^ Z'n i.o.] = 0, where Z'n = max{aiX*, • • • , a n X + } . Hence (Z'n — Zn)/cn —> 0 a.s. for any sequence {Cn} such that c„ —> oo. rj Lemma 2. Let {in} be any integer sequence such that 1 < in < n. Then liminf —^- > x0 if —— —> 1. n->oo
ain
Cn
Proof: Let e > 0. Then n
F ( ( l + £ ) C n / a i J > n F ( ( l + £ ) c „ / a i ) = P [ Z n < ( l + e j c ] — 1, since Zn/cn —• 1. Hence F ((1 + e)cn/ain) —> 1 as n —> oo for every e > 0. If liminf c„/oi n < zo, then a number rj > 0 exists such that cn/a,in < x 0 — ^ n—>oo
for an infinite number of values of n and, therefore, liminf F ( ( l + e)cn/ain)
< F ( ( l + e){x0 - » ? ) ) < 1
n—>oo
if e is chosen to be so small that (1 + E)(XQ — v) < ^o- This is a contradiction, so liminf Cn/a^ > XQ. n Lemma 3. If Zn/cn
—> 1 then cn —> oo and cn ~ c n _i.
Proof: As noted earlier, Zn —> oo a.s. since supa n = oo. But Zn/cn
—> 1,
n>l
from which it follows easily that c„ —> oo. Now, since X\ is non-degenerate, one may choose e\ > 0 so that F ( ( l — £i) 2 zo) > 0. For every e > 0, P[Zn < (1 - e)c n ] = P[Z„_! < (1 - e)cn]F((l
- e)c„/a„).
(2)
By Lemma 2 (with i„ = n), cn/an > (1 — e)a;o for all large n (say, n > N). But then, if e < ex and n > N, F((l - e)cn/an) > F ( ( l - e)2x0) > 0. Since Zn/cn —> 1, it follows from (2) that P\Zn-\ < (1 - e)cn) —> oo for every £ < £i, from which fact it is clear that P\Zn_\ < (1 — e)cn} —> 0 for every e > 0. But P[Z„_i < (1 + e)c„] > P[Z n < (1 + e ) c ] -» 1, e > 0, so Zn-i/cn —> 1. It is easy to see that c„ ~ c„_i, because Z„_i/c„_i —> 1 by hypothesis. r-i
326
Lemma 4. If Zn/cn -?-> 1 then c„ ~ a-nxo a* = m a x { a ! , - - - , a „ } .
an
d a* ~
a^_i, where
Proof: For each n > 1, choose in, 1 < i„ < n, such that a^n = a*. By Lemma 2, liminf c„/a* > XQ. n—>oo
Now, Zn < a* XQ a.s., n > 1, so, for every e > 0, P[<:ro < (1 - e)cn) < P[Zn < (1 - e)c„] -> 0. It follows that P[a*nxQ < (1 - e ) c „ ] = 0, i.e., cn/a*n < (1 - e ) - 1 ^ , for all large n. Hence lim sup cn/a*n < x0. Thus c„ ~ a*a;0, and it is an easy consequence of Lemma 3 that o* ~ a;
n-V
U
Lemma 5. Suppose that 0 < a\ < a^ < • • • ] oo. For 8 £ (0,1) and all n so large that an > 6 _ 1 a i , define kn = kn(6) = max{j : a,j < San} and Un = Un(6) = max{X,- : kn < j < n}. The following statements are equivalent: (i) {Zn,n
> 1} is relatively stable;
(ii) an ~ a„_i; (iii) n — kn(8) —> oo as n —> oo for every 6 £ (0,1); (iv) Un(6) —> a;o as n —> oo for every 6 S (0,1). Proof: That (i) implies (ii) is immediate from Lemma 4, since a* = a„, n > 1. m
Suppose a n ~ a n _ i . Then, for each integer m > 1, — — = TT i=l
"~'+
1. If n — kn(6) = m, then _^_
=
^IL > S~l > i.
Thus, if n — kn(6) = m for an infinite number of values of n and some fixed m > 1 and 6 € (0,1), then it would follow that lim sup an/an-m > <5-1 > 1, a contradiction. Hence, n - kn(6) —> oo for every 6 € (0,1); that is, (ii) implies (iii).
327
Note that Un(6) < x0 a.s. for every n > 1 and 0 < 8 < 1, so Un(8) - ^ x0 iff P[Un(8) < (1 - e)a;o] — 0 for every e > 0 iff F n _ f c "((l - e)x0) -> 0 for every e > 0 iff n —fc„(<5)—• oo for every 6 G (0,1). Therefore, (iii) and (iv) are equivalent. Finally, suppose that Un(6) -^-> XQ for all 0 < 8 < 1. In view of Lemma 1 and the definition of kn{8),
x0>^>
™*{°sXr-kn{S)<j6Un{g)
as
for every 5 G (0,1). For any e > 0, choose 8 such that 1 — e < 8 < 1. Then -P[Zn < (1 - e K ^ o ] < P[£4(<5) < (1 - e)x0/8] - • 0, since £/n(<5) -*U l. But Z n / a n < xo a.s., so Zn/an
—• XQ. Hence, (iv) implies
•
(0Here is the main result of the paper.
Theorem 1. Let Xi, X2, • • • be non-degenerate i.i.d. random variables with d.f. F such that F(0) < 1 and F(a) = 1 for some a > 0. Let {an} be a positive real sequence such that supa„ = 00, and define {Zn} by (1). Then — - ^ 1 n>l
Cn
for some real sequence {c n } iff a* ~ a£_i, where a* = max{ax, • • • , o„}, n > 1. Moreover, without loss of generality, c„ = o-nxoProof: In view of Lemma 4, it will suffice to show that {Zn} is relatively stable if a* ~ a*_j. To do so, define mj = 1 and, for j > 1, m j + i = minjn : a-n > ami}- Then ami < am2 < • • • | 00; a* = a^. = amj. if rrij < n < mj+i; and amj = a*^ ~ aj^ -1 = a m -_t = a m J _ 1 as j —» 00, using the assumption a* ~ a^-i- I* n o w follows immediately from Lemma 5 that maxamjXmj/(amNx0)
-^-» 1 as iV -> 00.
(3)
But, for each n > 1, there is a unique JV > 1 such that mjv < n < mjv+i. Therefore, maxj<jv aTO X m . ^, Zn — <
Z , = —n < xo a.s.
It is an easy consequence of (3) and (4) that Znj(a*nxo) -^-> 1.
,.s (4) rj
328
Remark 1. If X\ were degenerate and finite, then Xn = XQ a.s. Thus Zn = a*xo a.s. in this case. If supa„ < oo, then Zn —» A a.s. for a constant A iff sup an = supa„ for every m > 1, in which case A = a*xo (see Tomkins 2 ) ; clearly, a£ ~ a*_j and Zn/(a^xo) —> 1 a.s. in this case. Remark 2. If XQ = oo and an = 1, n > 1, then the relative stability of max{Xi, • • • , Xn} was completely solved by Gnedenko l. The author intends to address the relative stability of {Zn} when xo = oo in a future publication. Remark 3. As noted above, the relative stability problem for the maximum sequence Mn = max{Xi, • • • ,Xn} has been fully resolved in the case where X\,X2, • • • are i.i.d. with F(x) < 1 for all x. A natural extension is to ask are what happens to Mn as n —> oo where X\,X2,-" merely independent with P[Xn < x] < 1 for all n > 1 and all real x. Theorem 1 above may be interpreted as a first step towards the solution of this general problem. Acknowledgment This work was supported by a grant from the Natural Sciences and Engineering Research Council of Canada. The author is grateful for the comments of two referees. References 1. B. Gnedenko, Ann. Math. 44, 423 (1943). 2. R. J. Tomkins, Proceedings of the Conference on Extreme Value Theory and Applications (Gaithersburg, MD, May 1993) 3, 197 (1994).
T R A N S I E N T ANALYSIS OF SOME I N F I N I T E SERVER QUEUES GORDON E. WILLMOT AND STEVE DREKIC Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, Ontario N2L 3G1, Canada E-mail: sdrekic@hopper. math, uwaterloo. ca In this paper, it is demonstrated using generating function arguments t h a t the transient distribution of the number of customers in the system in many Mx /G/oo queues may be computed in a straightforward manner for a wide variety of bulk arrival size distributions. Numerical examples illustrate the ease of implementation of the computational procedure.
In this paper, we consider an infinite server queueing system in which customers arrive according to a Poisson process at rate A > 0. We assume that the bulk arrival size random variable X is distributed with q^ = Pr{X = k},k > 1, and probability generating function (pgf) Q(z) = Y^k^i^^. We assume that the individual service times are independent and identically distributed positive random variables with distribution function (df) F{y) = 1 - F(y) = Pr{Y < y}. This is the Mx/G/oo queue and the purpose of this paper is to present a computational procedure for determining the distribution of the number of customers in the system (equivalently, the number of busy servers) at time t for a particular class of service time distributions. The Mx/G/oo queue can be used to model any system involving delays with no congestion, and is thus quite general in scope. In particular, the Mx/G/oo queue has important applications in a variety of information transmission and storage systems. It has also been used to model the number of incurred but not reported (IBNR) claims in an insurance context (see Willmot and Lin 1). Theoretical results concerning the transient analysis of this system are contained in the excellent reference by Chaudhry and Templeton 2 , and much of this is attributed to the pioneering work of Shanbag 3 , Abol'nikov 4 , Reynolds 5 , and Brown and Ross 6 . These references do not place much emphasis on the computational aspects of the problem. However, Willmot and Drekic 7 have recently introduced a straightforward approach to computing the transient distribution of the number in system in the Mx /M/oo queue for a wide variety of bulk arrival size distributions. Let Nt represent the number of customers in the system at time t, with pn(t) = Pr{Nt — n}, n > 0. Assuming TVn = 0, it follows that Nt has a
329
330
compound Poisson distribution with pgf CO
Pt(z) = J2Pn(t)zn=extlQ<^-V
(1)
n=0
where oo
-i
Qt(z) = J2
pt
= 7
Q[F(x) + zF(x)]dx.
(2)
J
fc=0
°
The derivation of this result follows that of Brown and Ross 6 , who actually consider a more general model. We reproduce this derivation here for completeness. The probability that m > 1 batches of customers arrive by time t is (Xt)me~xt/m\. The conditional distribution of the (unordered) arrival times of these m batches are independent and uniformly distributed over (0,t) with constant probability density function 1/t (see Ross 8 , section 5.3). If a batch arrives at time x where 0 < x < t, a customer from this batch is still in the system with probability F(t — x) and has left the system with probability F(t — x). Therefore, the number of customers remaining in the system at time t from a batch arriving at time x has conditional pgf / Q[F{t-x) + zF{t-x)]— = Q[F(x) + zF(x)}dx, Jo * * Jo which is, with probabilities {qm(t); m = 1,2,3,...}, equation (2). The total number in the system at time t is obtained by summing over the m (independent) batches, i.e. the pgf is thus raised to the power m, so that m „—At
m! which is equation (1). In the case of the compound Poisson with pgf (1), the following recursive formula can be used to compute {pn(t); n = 0,1,2,...} (see Klugman et al. 9 , pp. 239-240 or Tijms 10 , p. 37): oo
Po{t)=exp{-\tY,qk{t)},
(3)
k=i
xt
n
Pn{t) = —Y,k
n = 1,2,3,....
(4)
fc=i
Also, moments of Nt are easily obtainable from (1) and (factorial) moments associated with (2) (see Klugman et al. 9 ) .
331
In this paper, our goal is to demonstrate that numerical evaluation of the probabilities {pn{t)', n = 0,1,2,...} using (3) and (4) is relatively straightforward. In what follows, we shall show that under certain conditions a recursive formula may be used to obtain {qk(t); k = 1,2,3,...}. This formula normally requires evaluation of qo(t) = Qt(0). Simplification results if a closed-form expression for Qt{x) is available. We briefly review some situations where this is the case, and then proceed to the derivation of the recursive formula alluded to above. We then reconsider evaluation of qo(t) in light of the recursion, and follow this with three examples. In general, it is difficult to obtain a simple form for the pgf (2). However, this may be done in some special cases. For example, suppose that (see Willmot n ) aT(k - a) qk
~
fcir(i-a)'
*-1^^.---.
where 0 < a < 1, so that Q(z) = 1 - (1 - z)a.
(5)
Then from (2), it follows that Qt{z) = \ [ {1 - [1 - F(x) - zF(x)]a) * Jo
= =
dx
\j\l-{l-zYF{xY]dx i-et{\-z)a
where
et = \ I F{x)adx. t Jo
Evidently, 0 < 6t < 1, and from (5) one has Qt(z) = 1 — 9t + 0tQ{z), from which it follows that qo{t) =
i-et
and qk(t) = 6tqk, fc = l , 2 , 3 , . . . . In the case of exponential service, Willmot and Drekic 7 (see p. 142 for details) have shown that if X has a zero-truncated geometric distribution (see Klugman et al. 9 , p. 590), both (1) and (2) simplify to reveal that Nt has a compound negative binomial - geometric distribution with easily calculable
332
parameters. As another example where service times are exponentially distributed at rate fi > 0, suppose that X has an extended truncated negative binomial (ETNB) distribution with pgf (see Willmot n , p. 18) [1 - 2 / ^ - 1 ) ] 4 - ( 1 + 2/3)4 V
'
1 - ( 1 + 2/8)4
Subsequent substitution into (2) then yields Qt(z) = I r [ 1 - 2 ^ - ^ ( ^ - 1 ) 1 * - d + 2/?)* dx t Jo 1-(1+2/3)2 2/3e-'iX(z - 1)]4 - ln{l + [1 - 2/3e~'lx(z - 1)]4}\ /**[(!+ 2/3)4-1] ) = 1+
• ( [1 - 2pe-»\z ,rf[(l + 2 / 3 ) 4 - 1 ]
x=0
- 1)] i - [1 - 2/3(2 - 1)]*
, 1+ [ 1 - W , - I ) H ,\ U + [l-2^e-^(z-l)]4/y Hence, from (1), we get {1 + [1 - 2/3(z - l)]i} e -[i-«»(*-i)]* ^(*)
.{1 + [1 - 2pe~^{z
\ ,.[(1+2/3)4-!,
- i)]4} e -[i-2/3e-/"(*-D]4
It is clear from this last example that it may be difficult to extract the coefficient of zn in Qt(z) as defined by (2). (Of course, one could expand the integrand in powers of zn and integrate term by term, but this is not very suitable, since the resulting expression is cumbersome). However, if the service time distribution has a failure rate whose reciprocal is a linear function, a simple computational formula for the probabilities {qk(t); k = 1,2,3,...} is obtainable. That is, let us assume that the failure rate of the service time distribution has the form
Note that differentiation of (2) yields Q't(z) = ^Qt(z)
= j j
Q'[F(x) +
I f —F tJo (1 -
zF(x)]F(x)dx
^ -Q'[F(x) + zF{x)]{\ z)F'{x)
z)F'{x)dx
333
By substituting (8) into (9), multiplying both sides by 1 — z, and applying integration by parts, we obtain that Qt{z) satisfies the differential equation Q't(z)(l -Z) = (l + e) Q[F(t) + zF(t)} - lQ(z)
- 6Qt(z).
(10)
Although we intend to use (10) to derive a computational formula for the probabilities {%(£); k = 1,2,3,...}, we remark that moments also follow easily from (10). In particular, by differentiating both sides of (10) r times, r > 0, we immediately get Q{tr+1)(z)(l
- z) - rQt\z)
F(t)TQ{r)[F(t)
=(j+0)
+ zF(t)] -
-0Q(;\z).
^Q{r){z) (11)
Evaluating (11) at z = 1 gives rise to the following expression: QV{i)
=
9
^-[j{i-F{ty)-eF{ty
+ r.
(12)
Of course, it is always the case that moments may be evaluated directly from (2), giving Q(r){1)=Q^J*F(xydx.
(13)
Equation (10) may be used to derive a recursive relationship between the coefficients qn{t) of zn in the expansion of Qt(z). First of all, define q„t by introducing the pgf oo
Q*t(z) = Q[F(t) + zF(t)} =
J (
£ ln,tZn,
*>0.
(14)
n=0
The probabilities {<j£t; n = 0,1,2,...} can be readily obtained for a wide variety of bulk arrival size distributions of practical interest. In particular, let us consider distributions [qk] k = 1,2,3,...} whose pgf is of the form
Q{z)=±q(k,T)z^m^m
(15)
334
where r is a parameter and B(-) is a function not depending on r. This includes many standard distributions such as the truncated negative binomial, truncated Poisson, truncated binomial, and logarithmic series (see Willmot n ) . Thus, qk = q(k,r), and from (14)
Qt{z)
B[TF(t){l-z)]-B{T)
=
T^W)
'
We then obtain that
0«-»w^-')^f';'i;('-'>i
<„>
and (-mrF(t)}nBM[rF(t)(l
„*(»),.,
Qt W -
- z)}
T=W)
'
from which it follows by setting 2 = 0 and dividing by n! that %,t =
<
B[rF{t)\ ~ B(T)
t = L
1-B(T)
T^fgK^(t)1'
n = 1,2,3,
If one now equates coefficients of zn in (10), one obtains
J_
qn+i(t) = ^-{ [(n - 6)qn(t) +(j+0) Qn,t - Jin] , n = 0 , 1 , 2 , . . . . (18) n +1 This is a first order linear difference equation which is easy to solve analytically, but the resulting solution offers little insight. Thus, for 9^0, our intention is to use (18) to generate qn(t) numerically. Note that (18) is the only place where the starting point qo(t) is needed (unless we have exponential service where 9 = 0, in which case considerable simplification results - see Example 1 below). Assuming qo(t) is available, the probabilities {qn(t); n = 1,2,3,...} may be obtained recursively via (18). For a discussion of numerical issues related to recursions of this nature, see Press et al. 12 , section 5.4. In general, two approaches suggest themselves as a means of computing ()(£)• First of all, if it is possible to obtain a simple analytical expression for Qt{z), as in the example which gave rise to (7), then one can substitute z = 0 easily in Qt(z) to get q0(t) = Qt(0). Otherwise, a Taylor series expansion of Qt{z) about z = 1 yields
Qt{z) = l + Yj^fL{z-l)k, fc=i
(19)
335 !(*)'{1) is given by (12) or (13). Then, we set z = 0 in (19) and use where Q\ (13) to obtain 1
9o(*)
°°
= i + 7£(-i)
f cQ
fc=i
(fc)
(l) ak(t) k\
(20)
where ak{t) = JQ F(x)kdx. We point out that: (i) Q ^ ( l ) may be computed quite readily via (16) for Q(z) satisfying (15), and (ii) a,k(t) may be evaluated explicitly for F(x) satisfying (8). Moreover, since the infinite series in (20) is an alternating one, an effective and efficient means of computing this series is to employ Euler summation. For details regarding Euler summation and the control of the approximation error, the reader is referred to Abate et al. 13 , p. 270 or Press et al. 12 , section 5.1. We now present three special cases: Example 1: Exponential Service Time Distribution. If F(y) — l - e _ w / for y > 0 where /J. > 0, one obtains r(y) = fi and so (18) is applicable with a = /it -1 and 9 = 0. Substituting these parameters into (18) immediately yields (n + l)qn+i(t)
- nqn(t) = — (q^t - qn) .
(21)
Summing both sides of (21) from n = 0 t o n = fc — 1 easily gives rise to the following explicit expression for qk(t), namely Qk(t) =
E n = o ( g n , t ~ gn)
k\it
k — 1,
2,6,...,
(22)
in agreement with the result found by Willmot and Drekic 7 , p. 139, eq. (8). Example 2: Pareto Service Time Distribution. If F(y) = 1 — ( - + - ) for y > 0 where a, ^ > 0, it is easy to verify that r(y) =
a v +y
and so (18) is applicable with a = fi/a and 8 = 1/a. To compute qo(t), we employ (20) with -ak+l
ak(t)
l-ak
-1
:
ak^l (23)
: ak = 1
336
We remark that the Pareto distribution may be a good alternative to the exponential if it is felt that a more heavily tailed service time distribution is better suited to the model (for example, see Harris 1 4 ) . Example 3: Complementary Power Function Service Time Distribution. The power function df (see Evans et al. 15 , p. 161) is H(v) = Pr{V < v} = (^) for 0 < v < u> where (3, u> > 0. Suppose that the service time random variable Y is given by Y = u> — V with df F(y) = 1 — H(CJ — y), that is F(y) = 1 — (l — ^j) for 0 < y < CJ. This subclass of the beta distribution includes the continuous uniform as the special case Q = 1. It is easy to verify that "(y) =
P
0
ui-y
and so (18) is applicable with a = w//3 and 6 = —1//3 if 0 < t < w. Note that if t > u, (2) becomes Qt{z)
-HI: l'-0-£)-('-5) Q
dy + J
Q(l)dy} (24)
Substitution of (24) into (1) then yields (for t > CJ) Pt(z) = exp{\t[jQ„(z)
+ 1 - j - l] } = eMMQUz)
~ 1]} = P«(*),
and so the results also extend to the case t > w. Finally, to compute qo(t), we employ (20) with ak(t) =
1 + Pk
(25)
'-|i-£)
To illustrate the effectiveness of the computational procedure, we consider a simple numerical example in which A = 2 and X has a zero-truncated negative binomial distribution given by fn(m,p) qn = l - / o ( m , p ) '
n = 1,2,3,.
where
fn{m,p) =
771 + n — 1
pm(l-p)n,
n = 0,1,2,
(26)
337
In this case, (15) is satisfied with T = (1 — p)/p and B(x) — (1 + x)~m. Consider m = 4 and p = 0.8, so that the mean bulk arrival size is 625/369 ~ 1.69 customers. Tables 1 and 2 display the results (to seven decimal places of accuracy) for pn(t),n = 0 , 1 , 2 , . . . , 10, for values of t = 0.125, 0.25, 0.5, 1, 5, and 10 under the assumptions that: (i) service times are Pareto distributed with parameters a = 2 and fi = 5, and (ii) uniformly distributed on the interval (0,10) (i.e. the complementary power function distribution with parameters (3 = 1 and ui — 10). The mean service time in both cases is 5. The sums of the first 11 probabilities corresponding to these values of t are also included in the tables for comparative purposes.
Table 1. Results for pn(t) n
0 1 2 3 4 5 6 7 8 9 10 Sum
t = 0.125 0.7814843 0.1083984 0.0606528 0.0285572 0.0123726 0.0051287 0.0020698 0.0008195 0.0003194 0.0001228 0.0000466 0.9999721
t = 0.25 0.6148611 0.1704123 0.1055363 0.0564078 0.0281093 0.0134367 0.0062313 0.0028187 0.0012478 0.0005420 0.0002315 0.9998348
under Pareto service with a = 2 and \i = 5. £ = 0.5 0.3881397 0.2143892 0.1584359 0.1025238 0.0616352 0.0351694 0.0192524 0.0101812 0.0052278 0.0026166 0.0012805 0.9988517
Table 2. Results for pn(t) n
0 1 2 3 4 5 6 7 8 9 10 Sum
t = 0.125 0.7794793 0.1081531 0.0613089 0.0292273 0.0128160 0.0053756 0.0021951 0.0008794 0.0003469 0.0001350 0.0000519 0.9999685
t = 0.25 0.6086560 0.1688919 0.1070310 0.0584697 0.0297673 0.0145380 0.0068900 0.0031859 0.0014420 0.0006405 0.0002798 0.9997921
t = 1 0.1664161 0.1815371 0.1772388 0.1485030 0.1127000 0.0794752 0.0528923 0.0335707 0.0204739 0.0120659 0.0069013 0.9917743
i= 5 0.0021185 0.0093877 0.0235195 0.0434800 0.0657831 0.0859879 0.1003555 0.1068854 0.1055072 0.0976337 0.0854428 0.7261013
t = 10 0.0001672 0.0011223 0.0040202 0.0101899 0.0204642 0.0345967 0.0511161 0.0676914 0.0818113 0.0914660 0.0955872 0.4582325
under Uniform service on (0,10).
t = 0.5 0.3731071 0.2070083 0.1588470 0.1066481 0.0665646 0.0394700 0.0224708 0.0123660 0.0066108 0.0034462 0.0017572 0.9982961
t =1 0.1433320 0.1588770 0.1642742 0.1462919 0.1183933 0.0892471 0.0636007 0.0432810 0.0283298 0.0179332 0.0110248 0.9845850
t= 5 0.0002386 0.0012633 0.0038446 0.0087131 0.0162544 0.0262992 0.0380937 0.0504483 0.0619999 0.0714954 0.0780155 0.3566660
t = 10 0.0000073 0.0000592 0.0002609 0.0008191 0.0020490 0.0043339 0.0080387 0.0133999 0.0204284 0.0288568 0.0381517 0.1164049
338
Acknowledgments This research was supported by the Natural Sciences and Engineering Research Council of Canada. References 1. G.E. Willmot and X.S. Lin, Lundberg Approximations for Compound Distributions with Insurance Applications (Springer-Verlag, New York, 2001). 2. M.L. Chaudhry and J.G.C. Templeton, A First Course in Bulk Queues (John Wiley & Sons, New York, 1983). 3. D.N. Shanbag, Journal of Applied Probability 3, 274 (1966). 4. L.M. Abol'nikov, Probl. Inform. Transm. 4, 82 (1968). 5. J.F. Reynolds, Operations Research 16, 186 (1968). 6. M. Brown and S.M. Ross, Journal of Applied Probability 6604 1969. 7. G.E. Willmot and S. Drekic, Operations Research Letters 28, 137 (2001). 8. S.M. Ross, Introduction to Probability Models, 7th Edition (Academic Press, San Diego, 2000). 9. S.A. Klugman, H.H. Panjer and G.E. Willmot, Loss Models: From Data to Decisions (John Wiley & Sons, New York, 1998). 10. H. Tijms, Stochastic Modelling and Analysis: A Computational Approach (John Wiley & Sons, Chichester, 1986). 11. G.E. Willmot, ASTIN Bulletin 18, 17 (1988). 12. W.H. Press, B.P. Flannery, S.A. Teukolsky and W.T. Vetterling, Numerical Recipes - The Art of Scientific Computing (Cambridge University Press, New York, 1986). 13. J. Abate, G.L. Choudhury and W. Whitt, In Computational Probability, ed. W.K. Grassmann (Kluwer Academic Publishers, Boston, 2000). 14. C M . Harris, Operations Research 16, 307 (1968). 15. M. Evans, N. Hastings and B. Peacock, Statistical Distributions, 3rd Edition (John Wiley & Sons, New York, 2000).
EMPIRICAL LIKELIHOOD M E T H O D FOR FINITE POPULATIONS
CHANGBAO WU Department
of Statistics and Actuarial Science University of Waterloo Waterloo ON N2L 3G1 Canada E-mail: [email protected]
This article provides an overview on recent developments of empirical likelihood methods in estimating the finite population means and totals, distribution function and quantiles, variance and other quadratic functions in the presence of auxiliary information. Major results are unified under the general framework of optimal estimation and model-calibration. Applications of the method to obtaining rangerestricted weights in regression estimators and estimation under measurement error models are also presented.
1
Introduction
The empirical likelihood method was proposed by Owen 11,12 as a device for constructing confidence regions with independent observations. Owen proved that the empirical likelihood ratio statistic has an asymptotic x2 distribution and therefore is useful for interval estimation and hypothesis testing. Qin and Lawless 15,16 discovered that the empirical likelihood method is also a powerful tool for point estimation when side information can be incorporated into constrained maximization of the empirical likelihood function. The method soon became popular and major developments have been summarized in the recent book by Owen 13 . Historically, the first application of the concept behind empirical likelihood was suggested by Hartley and Rao 10 for finite populations. They assume that the study variable y takes only a finite set of scale points y[i], Z/[2], • • • , y[k]- For a given sample {yi,i € s}, let rij be the number of j/i's in s that take scale point y[j]. Under simple random sampling (SRS), (ni,---,7ifc) follows a multivariate hyper-geometric distribution. When the population size TV is large, one can use the likelihood function from a multinomial distribution. Auxiliary information can be incorporated to find the constrained maximum likelihood estimators of the population proportions for each of the scale points. These estimated proportions can then be used to construct the so-called scale-load estimators for the finite population mean
Y =
N-^liyi.
The first formal application of the empirical likelihood method in sur-
339
340
vey sampling was introduced by Chen and Qin 3 under SRS. Let p, be the probability mass at yi for i £ s and X be the known population means for the (vector-valued) auxiliary variable x. The empirical maximum likelihood estimator of Y is defined as YEL — YliesPiVi w n e r e the p,'s maximize the empirical likelihood function L(p) = Ylies Pi subject to ^2pi
— 1 (pi > 0) and
~^2lpiXi=X.
(1)
Chen and Qin 3 showed that YEL ls asymptotically equivalent to the conventional regression estimator and therefore is very efficient. The rest of this paper is organized as follows. In Section 2, we introduce the pseudo empirical likelihood approach under a general sampling design for the estimation of Y (Chen and Sitter 4 ) . The concept of optimal calibration estimation is presented along with the model-calibrated pseudo empirical likelihood (MCPE) approach of Wu and Sitter 21 . In Sections 3 and 4, the MCPE estimators are applied naturally to the estimation of distribution functions and quadratic finite population functions including the population variance. A simple solution to obtain range-restricted weights in regression estimators using the empirical likelihood method is presented in Section 5. In Section 6, we discuss how the empirical likelihood method can be applied to various measurement error problems. We conclude with some remarks in Section 7. 2
Pseudo Empirical Likelihood Approach
Sample data from a finite population obtained through an unequal probability sampling scheme are usually highly correlated with each other. What will be the "empirical likelihood function" to use under a general sampling design? Chen and Sitter 4 proposed a pseudo empirical likelihood function based on a two-stage argument which can be viewed as a non-parametric version of the estimation strategy discussed in Binder 1 and Godambe and Thompson 9 : if the data from the entire finite population, {2/1,2/2, • • • ,VN}, is known, the correct empirical likelihood function would be L(p) — n»=iP»The corresponding empirical log-likelihood function l(p) = ^2i=i l°g(P») i s a population total! The design-based unbiased (Horvitz-Thompson) estimator for this total is given by
?(p)=X> lo 8(P0.
(2)
where di = l/iti are the basic design weights and 7T; = P(i £ s) are the inclusion probabilities. The l(p) was referred to as pseudo empirical (log) likelihood
341
function. The pseudo empirical maximum likelihood estimator (PEML) of Y (Chen and Sitter 4 ) was defined as YPE = YliesPiVi w n e r e t n e Pi's maximize l(p) subject to constraints (1), assuming X is known. The p^s are given by pi = <%/[l + \'(xi - X ) ] , where d? = d{/ £ i £ s di and the vector Lagrange multiplier, A, is the solution to g(X) = ^2ies[d*(xi — X)]/[l + \'(xi — X)] = 0. A modified Newton-Raphson algorithm has been developed by Chen et al. 6 for finding this solution. The algorithm is guaranteed to converge if a solution exists. A crucial component in the (pseudo) empirical likelihood estimation is the use of constraints (1) in the maximization process. There are two issues related to this and any other estimation procedures: efficiency and consistency. Efficiency is measured by the overall performance of the estimator in terms of bias and variance or mean square error; consistency refers to some internal conditions and requirements imposed by the surveyor. The second constraint in (1) is a commonly used consistency requirement called benchmark constraints. Benchmark constraints are often imposed in practice for two reasons: the surveyor believes that the weights which give perfect estimates for the auxiliary variables should also give a good estimate for the study variable; the auxiliary information is only available at the aggregate level, i.e. only X is known. On the other hand, if complete auxiliary information X\, • • •, XN is known, a compelling question to ask would be "what is the best constraint to use in the (pseudo) empirical likelihood estimation?" To put this more formally, let itj = u(xi), i — 1, • • • ,7V, where u(-) is a known function. We use u; as a calibration variable and replace the second constraint in (1) by 1
N
The question becomes "what kind of choice of u; will make Ypg most efficient"? It is very unfortunate that in survey sampling uniformly minimum variance (unbiased) estimators do not exist. Indeed the only choice of U; that results in a YPE with minimum variance is Ui = yi and this of course cannot be used. The model-assisted optimal estimators using the criterion of minimum expected design variance under a superpopulation model have been discussed by several authors. See, for example, the work by Godambe 7 , Godambe and Thompson 8 , and Cassel et al. 2 . Suppose that 2/1,2/2, • • • ,VN is a random
342
sample from a superpopulation such that E((Vi) = IM . V((vi) = of , i = 1, 2, • • •, N, and yi, £/2, • • •, VN are independent of each other. Here E$ and V£ denote the expectation and variance under the superpopulation model. The following result has been established by Wu 20 . Theorem. Under proper asymptotic settings and for any regular sampling designs, the use of u; = fii as a calibration variable in (1) will result in an estimator Y p% with minimum expected asymptotic design variance E(\AVP{YPE)] among all possible choices of («i, U2, • • •, % ) such that Here AVP refers to the asymptotic design-based variance and U = N~l Y2i=i ui- See Wu 2 0 for a detailed discussion on the asymptotic framework and a definition of regular sampling designs. Note that Y'PE is robust against model misspecification, since YPE is asymptotically design unbiased irrespective of the model but will be particularly efficient if the model adequately depicts the finite population. The gain of efficiency depends on the correlation between the response variable and the covariates. Assume complete auxiliary information is available, Wu and Sitter 21 proposed a model-calibration approach to implement this optimal estimator. They adapted a semi-parametric model Ez{yi\xi) = ^ =fj,(xi,0),
Vt(yi\xi) = crf , i = 1,2, • • •, N.
The fitted value fn — H{xi,0) is used as the calibration variable in constraints (1), where 0 is a design-based estimator for the model parameter 6. The resulting PEML estimator was termed a model-calibrated pseudo empirical maximum likelihood (MCPE) estimator, denoted by YMC- They also showed that replacing 9 by 0 in yu, = n(xi, 9) does not change the resulting estimator asymptotically. In addition, with probability close to 1, the MCPE estimator exists and can be computed using a simple bi-section algorithm (Chen et al. 6
).
This optimal model-calibration approach clarified several fundamental issues in using auxiliary information from surveys: (i) The effective use of auxiliary information depends on both the parameters to be estimated and the actual relationship between the response variable and the covariates. Blindly calibrating over auxiliary variables is usually not a good approach.
343
(ii) The benchmark constraints used in (1) are justifiable if the relationship between y and x is close to linear. In this case the resulting PEML estimator is asymptotically equivalent to the optimal MCPE estimator obtained using jj,i = x'fi as the calibration variable (Wu and Sitter 2 1 ) . So benchmarking implies efficient estimation. (iii) If the relationship between y and x is linear, knowing X is "sufficient" for efficient estimation of population mean Y or total Y. If the relationship is nonlinear, or the parameters of interest involve a nonlinear function, complete auxiliary information and/or more advanced modeling are essential for "optimal" estimation. (iv) The optimal model-calibration approach provides a unified framework for the estimation of distribution function and quantiles, variance and other quadratic functions in the presence of auxiliary information. In particular, the intrinsically positive weights, pi > 0, associated with the pseudo empirical likelihood method turn out to be a very valuable asset, as evidenced in the following sections. Under stratified random sampling, Zhong and Rao 22 used a different formulation of the empirical likelihood function by noting that observations from different strata are independent of each other. Each stratum is therefore assigned with an independent empirical distribution and the "overall" empirical (log) likelihood function is the sum of all these strata (log) likelihood functions. We will return to this formulation in Section 6. 3
Estimating the Distribution Function and Quantiles
The finite population distribution function Fy{t) = iV - 1 Y^i=i ^(j/» — *) i s a^so a finite population mean defined over an indicator variable z; = /(?/, < t). Without using any auxiliary information, estimation of Fy(t) is a special case of estimating the population mean and is usually straightforward. In the presence of auxiliary information, special attention needs be given to the following: (a) While benchmark constraints sometimes are justifiable for the estimation of Y, this consistency requirement is not needed for the estimation of Fy(t). Efficiency will be the primary concern. (b) It is the indicator variable Zi = I(yi < t) that we have to work with. There is also an issue of local efficiency (particular value of t) versus global efficiency (an arbitrary t) in estimating Fy(t). (c) It is desirable that an estimator of Fy(t), say F(t), is itself a distribution function, so quantile estimates can be obtained through direct inversion of
344
F(t). Many techniques for estimating Y, when applied directly to the estimation of FY(t), will produce unsatisfactory results. For instance, in the case of a scalar x variable, a regression-type estimator for FY(t) will have the form Freg(t) = FY(t) + {Fx(t) - Fx(t)}B, where FY(t) and Fx(t) are HorvitzThompson type estimators for FY{t) and Fx(t) = TV-1 Yli=\ I(xi — *)i a n ^ B is the estimated slope of regressing I(yi < t) on I(xi < t), Freg(t) suffers from several drawbacks. The obvious one is that Freg(t) is not a distribution function and it can take values outside of [0, 1]. The model-assisted difference estimator proposed by Rao et al. 18 and the regression-type estimators or the bias-adjusted estimators discussed in Rao 17 have similar problems. The pseudo empirical likelihood method combined with the optimal model-calibration approach can be readily applied here. The MCPE estimator of FY(t) defined by Chen and Wu 5 is given by FMC(t) = ^Zies Pil(yi < t) where the p^s maximize l(p) subject to 1
Y^Pi
= l
(pi>
°)
and
pi9i
Ys
=
N
9i
jjYj -
(4)
For a fixed t, the optimal choice of g± is given by gi = E^Ilyi
< t\xi\ = P(yt < t\xi).
It is now clear that no single set of weights p, is optimal for an arbitrary t. Chen and Wu 5 proposed to use a fixed t = to in computing the Qi while the resulting weights pi are used for any t. With this treatment the MCPE estimator FMc{t) will D e a genuine distribution function and is very efficient for t in the neighborhood of t0. Three different ways in computing the g^s were proposed by Chen and Wu 5 . (1) Compute (ft under a regression model Vi= n(xi,0)
+ Viei, i =
l,2,---,N,
where V{ = v(xi) is a known function , the Ei are independent and identically distributed (iid) random variates with mean 0 and variance a2. If the model is linear, /j,(xi,9) — x'fi, but other non-linear regression models will also work. Under this model one can use #; = P{yi < t§\xi) = G{[to — fi(xi,6)]/vi}, where G(-) is the cumulative distribution function of the e^s. Finally, one can replace 0 by 9, and estimate G(-) using the fitted residuals ii if G ( ) is unspecified.
345
(2) Compute g; using a logistic regression model log[gi/(l-gi)]
= x'ie,
t = 1,2, • • - , # ,
with variance function V(g) = g{l — g). The fitted values, gi, are then used. (3) Use pseudo fitted values en = 7[/i(xj, 6) < to], where fi(xi, 6) = E^(yi\xi). A simulation study reported by Chen and Wu 5 showed that the resulting estimators perform well in all three cases, with cases (1) and (2) slightly better than (3). Estimation for the population quantile £Q = Fy1{a) = inf{£ : Fy(t) > a} can be obtained through direct inversion of FMc(t)'- €a = V c ( « ) , where 0 < Q < 1. A Bahadur representation for the quantile process £Q has been established by Chen and Wu 5 under certain sampling designs. 4
Estimation of Variance and Quadratic Functions
Estimation of variance and other second-order finite population quantities using auxiliary information has been addressed by many survey researchers. Various techniques, such as regression, ratio and calibration estimation, have been attempted. See Sitter and Wu 19 for a literature review. A common weakness of these approaches is the ad hoc argument of applying certain techniques, which were originally developed for estimating Y, to estimate the variance without a common framework that unifies the two types of finite population parameters. The model-calibrated pseudo empirical likelihood method can be extended to handle variances and other second-order finite population parameters through a batch approach (Sitter and Wu 1 9 ). Let y be the (possibly vector-valued) study variable(s). For parameters in a quadratic form, T = Yli=i 2 j = i + i 0(lfi> Vj)> which includes the population variance, covariance, and variance of a linear estimator as special cases, a unified estimation strategy is as follows. (1) View T as a total over a synthetic finite population, i.e. T = Yla=i *« where a = (ij) = 1,2, • • •, W , t„ = 0(y i t y,) for a = (ij), and N* = N(N l ) / 2 is the total number of pairs. (2) The sample over the synthetic population consists of all the pairs from the original sample: s* = {(ij) : i < j , i,j € s}. (3) The "first-order" inclusion probabilities under this setting are 7Tjj = P(hJ € s)> a n d the "basic design weights" are d,j = l/7r,j.
346
(4) The pseudo empirical (log) likelihood function is modified to accommodate all the pairs (ij) using the d^'s.
where the pij is the probability mass assigned for The MCPE estimator of T is defined as
{y^Vj)-
where the p ^ ' s maximize the modified l(p) subject to 1
l
51 Z)p« = (pa > °). 5Z & ^ ( & ' »j)
=
JV
JF
N
51 5Z <^> »j) •
Here the y^s are the fitted values for the j/ f 's, as discussed in Section 2. The above approach brings a unified framework to the estimation of linear and quadratic parameters using auxiliary information. The approach is also model-assisted in that the resulting estimator TMC is approximately design unbiased irrespective of the working (superpopulation) model used and will be very efficient if the model is adequate. Also, since the weights p^ are always positive, the method ensures positive estimation for some known positive parameters such as variances. The optimality of the MCPE approach for quadratic finite population parameters has also been established by Wu 20 . The method is generally applicable and improvement over the naive Horvitz-Thompson estimator is guaranteed. Results of a simulation study reported in Sitter and Wu 19 showed that TMC performs very well for samples of small and moderate size. 5
Range-restricted Weights in Regression Estimation
The pseudo empirical likelihood method, combined with a novel idea of Chen et al. 6 , provides a simple solution to the range-restricted weights problem in regression estimation. The generalized regression estimator YQR for the population mean Y is probably the most popular one used by survey practitioners. It is computationally simple, very efficient if the relationship between y and x is nearly linear, and requires only X to be known to compute the estimate. If one rewrites YQR in the form of a weighted average, YQR = Y^i£swiVi> the so-called GREG weights u>; also satisfy the benchmark constraints, i.e.
T,ieswixi=X-
347
The GREG weights Wi, however, possess a very undesirable property: they can be very small or very large, and sometimes can even be negative. This has long been recognized by survey statisticians. Several iterative algorithms have been proposed to adjust the GREG weights so that the adjusted weights will satisfy the range-restrictions: 71 < Wi/d* < 72, where 0 < 71 < 1 < 72 are pre-specified, and d* is the standardized basic design weights. The basic design weights di can be interpreted as the number of units in the population represented by unit i in the sample. Range restrictions state that the adjusted weights obtained from incorporating auxiliary information are not allowed to deviate too far away from the basic design weights. In particular, the minimum restriction of positive weights should be imposed whenever is possible. The PEML and MCPE estimators discussed in Section 2 are asymptotically equivalent to the GREG and the weights, pi, are always positive. The weights may, however, not satisfy a more restricted range specified by 0 < 71 < 1 < 72. Chen et al. 6 proposed a simple solution to this. The idea is to relax the benchmark constraints a little bit while still make good use of auxiliary information. Taking the PEML estimator as an example, if we replace X by XHT = J2ies^iXi m t n e constraints (1), the resulting PEML weights would be pi = d* which will automatically satisfy any rangerestrictions. In general, if we replace X by X + S(XHT — X), the smallest 6 G (0,1) can be found through a simple bi-section algorithm such that the resulting PEML weights will satisfy the pre-specified range-restriction. The algorithm is simple and guaranteed to converge. The adjustment is "optimal" in the sense of minimum relaxation of the benchmark constraints.
6
Estimation Under Measurement Error Models
In many practical situations the cost to obtain exact measurements of a study variable can be high, but "inaccurate" measurements may be gathered quite easily. Let j/i be the exact measurement and Zi be the inaccurate measurement, i.e. measurement with error. The empirical likelihood method provides useful tools for inference under measurement error models. The general framework is to treat the distribution of the study variable non-parametrically while modeling the measurement error parametrically or semi-parametrically. Depending on the structure of the sample data, different approaches can be applied. Two general sampling schemes are often used with measurement error problems. One scheme is to take two (or more) independent samples, a rela-
348
tively small sample si with exact measurement and a large sample S2 measured with error. Another popular scheme is two-phase sampling where a large first phase sample s2 is taken and inaccurate measurement Z; are obtained and then a much smaller second phase sample si is drawn and the exact y^s are measured. An extreme case for two-phase sampling is to take s2 as the entire finite population. Let the parameter of interest be the finite population distribution function Fy (t) and assume all samples are obtained using simple random sampling. With two (or more) independent samples, Zhong et al. 2 3 proposed to formulate an empirical (log) likelihood function similar to the one used under stratified sampling (Zhong and Rao 2 2 ) . Let p» be the probability mass at 2/i, i £ s\, and qi be the probability mass at Z{, i £ s2. The empirical (log) likelihood function is defined as
i(P. 9) = X) logfo) + X) log(*) •
(5)
Extra model information regarding the measurement error can be used as constraints when one maximizes l(p, q). For instance, the inaccurate instrument and the accurate instrument may have a common average reading, one can then obtain j3j and qi by maximizing l(p, q) subject to
Xpi = l> X q i = *' and 12Piyi = XqiZi • The pi's and the exact measurements are used to construct estimators, i.e.
FY(t) =
ZiesiPiI(yi
Under a two-phase sample Si C S2, measurement errors can be modeled more explicitly. There are two commonly used models for measurement errors: the regression calibration model and the classical measurement error model. The regression calibration (RC) model treats Zi as a predictor of J/J: yt = a + (3zi + £i, i = 1,2, • • •, N, where the e;'s are i.i.d. random variates with E^(Si) — 0 and V^(£i) = cr2. Further model information may suggest that a — 0 or (3 — 1. Including both a and (3 in the model can accommodate systematic bias and/or departure accompanied with the error measurements. The role of Zi in the model is the same as an auxiliary variable discussed in Sections 2, 3 and 4. Methodologies developed in these sections can be used here with a minor modification. For instance, the second constraint in (4) should be replaced by ^2ies Pi9i = n _1 2 ^2iea gi, where n-i is the first phase sample size. The model parameters a, P and
349 The classical measurement error (CM) model is a conditional model for Zi given ?/;: Zi — a + fiyi + £i, i = 1,2, •• •, N. The £j's are assumed i.i.d. N(0,cr2). The RC model and the CM model look similar to each other but are fundamentally different. Under the CM model, the distribution of y, which is of primary interest, appears as a marginal distribution in the condition. A mixed likelihood function needs to be used to bring together the non-parametric likelihood function of y and the conditional parametric likelihood function of z given y. Let f(y,z) be the joint distribution function of (y, z), f(y) be the marginal distribution of y, and f(z\y) be the conditional distribution of z given y. The full likelihood function based on (y;,z;), i £ s\ can be written as Ilie.i fiVi'Zi) = Ui€si /(Wi)IL e .i f(zi\Vi)- Denoting f(Vi) b y P i and f{zi\yi) by 4>(zi, yi,9), the log-likelihood function is given by
KP, 0) = J2
1O
§^)
+
Yl l°g0(*> Vi, 9).
This log-likelihood function is based on the small second-phase sample s\ only. Information contained in the large first-phase sample S2 can be used for formulating constraints. Once again the pi's and the y;'s (i 6 s\) will be used to construct estimators. The p; and 0 are obtained through simultaneous maximization of l(p, 8) subject to 5 > i = 1 (ft > 0) and Y,PiE{z\\vi)
=
E(zr).
The second constraint comes from E(zT) = E[E(zr\y)] for r — 1,2,3, •••, and reduces to Y^ies Pi = 1 for r = 0. E(zr) should be replaced by the sample moments from the large sample S2 and E{zi\yi) = a + Pyi, E(zf\yi) = a1 + (a + Pyi)2, etc, are obtained from the conditional normal model. 7
Concluding Remarks
It has been shown by Owen 13 that the empirical likelihood method is a powerful non-parametric approach to inference with applications in many areas of statistics for infinite populations. This article presents an overview of various applications of the method to finite population problems. The optimal MCPE estimators under non-linear situations require complete auxiliary information for implementing the method. When such information is not available, a twophase sampling scheme may be used where the large first-phase sample of x
350
serves as "complete" auxiliary information. Variance estimation under this scenario has been investigated by Prasad and Thach 14 . The use of empirical likelihood methods for measurement error problems needs to be further investigated. Questions that need to be addressed include: (1) How to assess the method under different models; (2) What order of moment, r, to use in the CM model; and (3) How to use the inaccurate measurement Zi, i G s2 directly in the construction of estimators. The idea of using a mixed likelihood function with one component non-parametric and the other parametric or semi-parametric might be very useful for other finite population problems. Acknowledgments This research was supported by a grant from the Natural Sciences and Engineering Research Council of Canada. The author thanks Professor Randy R. Sitter and Professor Jiahua Chen for helpful comments. References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.
15. 16. 17. 18. 19.
D. Binder, Internal Statist. Rev. 5 1 , 279 (1983). C M . Cassel, C.E. Sarndal and J.H. Wretman, Biometrika 63, 615 (1976). J. Chen and J. Qin, Biometrika 80, 107 (1993). J. Chen and R.R. Sitter, Statistica Sinica 9, 385 (1999). J. Chen and C. Wu, Statistica Sinica (Accepted, 2002). J. Chen, R.R. Sitter and C. Wu, Biometrika (To appear , 2002). V.P. Godambe, J. Roy. Statist. Soc. Ser. B 17, 267 (1955). V.P. Godambe and M.E. Thompson, Ann. Statist. 1, 1212 (1973). V.P. Godambe and M.E. Thompson, Internal Statist. Rev. 54, 127 (1986). H.O. Hartley and J.N.K. Rao, Biometrika 55, 547 (1968). A.B. Owen, Biometrika 75, 237 (1988). A.B. Owen, Ann. Statist. 18, 90 (1990). A.B. Owen, Empirical likelihood (Chapman & Hall/CRC, 2001). N.G.N. Prasad and T. Thach, Variance estimation under two-phase sampling (Working paper, Department of Mathematical Sciences, University of Alberta, 2001). J. Qin and J.F. Lawless, Ann. Statist. 22, 300 (1994). J. Qin and J.F. Lawless, Canad. J Statist. 23, 145 (1995). J.N.K. Rao, J. Official Statistics 10, 153 (1994). J.N.K. Rao, J.G. Kovar and H.J. Mantel, Biometrika 77, 365 (1990). R.R. Sitter and C. Wu, J. Amer. Statist. Assoc. (To appear, 2002).
351
20. C. Wu, Optimal calibration estimators in survey sampling (Working paper 2002-01, Department of Statistics and Actuarial Science, University of Waterloo, 2002). 21. C. Wu and R.R. Sitter, J. Amer. Statist. Assoc. 96, 185 (2001). 22. C.X.B. Zhong and J.N.K. Rao, Biometrika 87, 929 (2000). 23. B. Zhong, J. Chen and J.N.K. Rao, Canad. J. Statist. 28, 841 (2000).
SECOND ORDER ESTIMATING EQUATIONS FOR CLUSTERED LONGITUDINAL BINARY DATA WITH MISSING OBSERVATIONS
G R A C E Y. Y I A N D R I C H A R D J. C O O K Department of Statistics and Actuarial Science, University of Waterloo 200 University Avenue West, Waterloo, Ontario, Canada N2L 3G1 E-mail: [email protected]; [email protected] For incomplete longitudinal data Robins et al. l s developed inverse probability weighted generalized estimating equations for the marginal mean parameters. In many cases, however, the repeated measurements themselves may arise in clusters, which leads to both a cross-sectional and a longitudinal correlation structure. In some applications the correlation structure may become of interest itself. In this paper we describe second order inverse probability weighted generalized estimating equations for association parameters characterizing the dependence among observations within clusters. The inverse probability weights are estimated from conditional logistic models for the missing data process. The methods are applied to d a t a from the Waterloo Smoking Prevention Project for illustrative purposes.
1
Introduction
Generalized estimating equations (GEE) have been widely used for the analysis of data from longitudinal studies and other settings featuring clustered data (Liang and Zeger 8 , Crowder 2 ) . This marginal approach is widely viewed as attractive because it does not require complete specification of the joint distribution of the longitudinal responses but rather is based only on specification of the means and variances of the responses. Perhaps most frequently it is of primary interest to make inferences about the parameters in regression models for the marginal means, but there has been increasing interest in association parameters in recent years. When the association parameters are of central importance second order generalized estimating equations can be constructed to facilitate their estimation. Prentice 14 developed moment-based GEE methods which model the association between a pair of binary responses in terms of the correlation, whereas Lipsitz et al. 10 , Liang et al. 9 and Fitzmaurice and Lipsitz 3 modeled the association in terms of the marginal odds ratio. Missing data occur frequently in longitudinal studies. Although studies are frequently designed to collect data on every individual in the sample at each assessment, often not all responses are observed at all occasions. For example, some individuals may withdraw from the study after a number of visits and never return. This results in a so-called monotone missing data
352
353
pattern which has been the focus of much of the literature on missing data methodology. In other settings more general missing data patterns may exist. Such patterns arise when individuals who have one or more missing visits may return to the study leading to so-called intermittently missing data. When the data are missing completely at random (MCAR), the GEE approach can produce consistent estimates for the parameters because missing data processes are not related to the processes of generating responses. In contrast, when the data are missing at random (MAR) or nonignorable, the estimating equations are not unbiased and they fail to provide consistent estimates. Robins et al. 15 proposed a modified GEE approach which leads to unbiased estimating equations for the estimation of parameters involved in the marginal means. Inverse probability weights are incorporated into the estimating equations to account for the effects of the missing data processes (e.g. Fitzmaurice et al. 4 ) . In this paper we focus on methods for the analysis of incomplete binary responses which arise in clusters. Clustered longitudinal binary data feature both a cross-sectional and a longitudinal correlation structure and interest often lies in the strengths of both types of association. Examples include longitudinal community intervention studies and family studies which involve repeated assessments of individual members over time. Our problem is motivated by a school-based cluster-randomized longitudinal smoking study. We describe second order generalized estimating equations with the inverse probability weights that are used to estimate the parameters associated with both longitudinal and cross-sectional effects. The mean response is characterized by a regression model, the variance is expressed as a function of the mean, and the form of the correlation structure is assumed. The inverse probability weights are estimated from conditional logistic regression models for the missing data process. In Section 2 we introduce notation and model assumptions. In Section 3 we describe estimation of interest parameters and robust variance estimation. Section 4 presents an application to a smoking prevention study in which the data are intermittently missing and involve both a longitudinal dependence in the repeated measurements and a crosssectional dependence arising from between school variation. General remarks are made in Section 5.
354
2 2.1
Notation and Model Assumptions The Response Process
Suppose there are / clusters and J, subjects within cluster i, i = 1, 2 , . . . , I. Further suppose there are K visits planned and let Yijk denote the binary response for subject j in cluster i at the visit k, k = 1,2, ...,K, Yij = (YijUYij2,...,YijKy, j = 1,2,..., Ju and Y{ = ( Y ^ , Y'i2,.... Y'UJ, i — 1,2,...,/. Let lower case letters yijk,Uij, a n d 2/; denote the realizations of Yijk, Y^, and Yt, respectively. Let xtjk = (l,Xijk,i, ...,Xijk,p-i)' be the p x 1 covariate vector for subject j in cluster i at time k, k = 1,2, ...,K; the covariates may be time-dependent or fixed across the entire observation times. Let x{j = {x'i61,x'ij2,...,x'ijK)', and x{ = (x'n,x'i2, ....sc^.)', j = 1,2,.., Ji,i = 1 , 2 , . . , / . Let Hijk = E(Yijk\xi)
= P(Yijk = l\xi),
and let y,^ = (piji,Pij2, - , Mtj/r)', J = 1,2, . . . , J ; , and ^ = (Mu> A*i2>..., ^'ij^', i = l , 2 , . . . , J . We consider the logistic regression model for the mean response logit Mijfc = a ; ^
(1)
for i = 1 , 2 , . . , / , j = 1,2,.., J;, fc = 1,2,...,K, where ^ = (/30,/3i, ..,/3 P -i)' is the vector of regression parameters. The variance for the response Y^k is specified as Vijk = va.v(Yijk\xi) = fiijk(l - Hijk), which depends on the regression parameter vector j3. The most general representation of the correlation structure is corr(Y;jfc, Yi>j>k>) = 4>jkj'k' if z = z' and 0 otherwise. A more specific structure could be specified as
{
Pkk>, if j = / , k^k', Ik, if j ^f, k = k',
with p = (pkfc/,1 < k < k' < K)',i = (7fc,l < fc < K)', S = {6kk>,l < k < k' < K)', and denoted <j> — ( p ' , 7 ' , ^O'I a q x 1 vector containing all association parameters. That is, p reflects the correlation among repeated measurements within subjects, 7 represents the cross-sectional association between concurrent responses from subjects within the same cluster at each visit, and S represents the association between responses from subjects within
355
the same cluster, but for responses taken at different time points. As may be inferred from the notation, in general the association parameters pertaining to responses at different time points may be functions of the span of time between visits. One may, for example, specify an autoregressive correlation structure for p by introducing the constraint pkk> — p],k~k '. A simpler exchangeable structure results by letting pkki = p for k ^ k'. Analogous specifications for 8kk' can be made. The exchangeable structure seems most appropriate for the cross-sectional association within clusters and indeed it is frequently realistic to constrain the 7fc terms to be equal to a common parameter 7. We remark that other parametric correlation structures may be adopted to take into account constraints on the second moments (e.g. Oman and Zucker 1 3 ) . The following constructions of the estimating equations for the mean and association parameters may be adapted to accommodate alternative correlation structures such as these. 2.2
The Missing Data Process
Let Rijk be the indicator variable taking the value 1 if the response Y ^ is observed and 0 otherwise, k = 1,2, ...,K, let Rij = (Riji, Rij2, •••, Rijx)', j = 1,2,..., Ji, and Ri = {R'n, R'i2,..., R'ij.)', i = 1,2,...,/. Let lower letters fijkirij, and ri denote the realizations of Rijk, Rij, and Ri, respectively. Monotone missing data patterns have been the focus of much of the work in the analysis of longitudinal incomplete data. In practice however, subjects may miss one or more visits before returning for a subsequent visit creating what is termed intermittently missing data. We shall consider arbitrary missing data patterns where Rijk = 0 does not necessarily imply Rijk' = 0 for k < k'. We let Hijk = {yiji,Vij2, • • • > Vij,k-i} denote the history of the responses from subject j in cluster i up to but not including visit k, H\°'k denote the history of the observed components in Hijk, and H\-k = {riji,rij2, • • • ,rij^-i} denote the history of the missing data indicators for subject j in cluster i up to but not including visit k, k — 1,2,... ,K, j = 1,2,... ,Ji, and 1 = 1 , 2 , . . . , / . Three types of missing mechanism have been distinguished (Laird 7 ) based on how missing data processes depend on the responses. When the missing data process is independent of all responses (both observed and unobserved) given relevant covariates and auxiliary variables, the data are said to be missing completely at random (MCAR). When the missing data process is conditionally independent of the unobserved responses given the covariates, auxiliary variables, and observed responses, the missing data are said to be missing at random (MAR). Finally the missing mechanism is called nonignorable or informative where the missing data process depends on the unobserved re-
356
sponses. This last type of mechanism can be addressed only by sensitivity analyses (Rotnitzky et al. 1 6 ) , and so we shall focus on methods for dealing with missing at random mechanism in marginal models. Moreover, we assume P(Rijk = l l - H ^ y y , ^ ) = P(Rijk = l\Hljk,H>jl,Xi). This assumption states that the probability of observing subject j in cluster i at visit k, given Hfjk, the full set of responses y^ for subject j , and covariates Xi, does not depend on the unobserved past responses or on the future responses. This additional assumption is not necessary when we simply restrict attention to monotone missing data patterns but facilitates the derivations that follow. As noted in Robins et al., 15 this assumption implies that the data are missing at random, although missing at random does not imply such an equation. We further assume that subjects are assessed at the first visit and so Rtj\ = 1, 3 ~ *-, z , . . . , Ji, % =
i,.4,...,i.
Suppose that the conditional probability Xijk — P(Rijk = l\Hljk,H>°l,Xi) for being observed at time k is known up to a vector of unknown parameters ak, k — 2,3, ...,K. That is, we assume that there exists a function \ijk{ctk) of H\jk, H\yk and xt such that Xijk = \i:jk(ak); typically a logistic link may relate a linear function of variables summarizing HLk, Hijli a n d xi to the probability of being observed at visit k. Let a. = (a'2,a'3,...,a'K)' and define irij(a) = P(Rij = rij\yi:j,Xi) = K
Y[[Xijk(ak)riik(I-Xijk(cxk))1~rijl'].
Since every subject is observed at k = 1,
fc=2
•ffij(at.) is the conditional probability of missingness over the entire observation period for subject j in cluster i given the vector y^ and the covariates Xi. We assume that within each cluster the indicator variables are conditionally independent given the whole set of observed responses within this cluster and cluster level covariates and that the conditional probability for missingness for one subject given the whole set of responses and covariates in cluster i does not depend on the responses of other subjects in the same cluster. These assumptions enable us to write Ji
Ji
P(Ri = ri\yit Xi) = Y[ P{Rij = rijfoi, x^ = J J P(Rij = r^y^,
Xi),
and the MAR assumption discussed earlier leads to the partial likelihood contribution from cluster i,
Hoc)=n n Nfc(«)rij" • (i - ^(cc))1-^], Ji
K
j=i
k=i
357
and the overall partial likelihood is L(a.) — Yli=1 Li(a). the partial score vector from cluster i is
The contribution to
k{a)
M « ) = t i>«* - ^)f°^
,
(2)
3 = 1 fc=2
and Uo(a) = X^=i Uoi(a), which may be solved by letting Uo(a) = 0 to give the partial maximum likelihood estimator d. 3 3.1
Inference P r o c e d u r e s Estimating Equations for (3 and
Our primary interest lies in estimating parameters associated with mean responses as well as correlation between responses. In this subsection we describe how to construct the estimating equations for both (3 and ')'. Let D{ = dfxyd/3 be the p x JiK derivative matrix of the mean vector /x{ with respect to f3, and Vi be the covariance matrix for the response Yi for cluster i, i = 1, 2,..., / . Note that Vi is comprised of sub-matrices Vijji of two types. The K x K matrix V^j is the covariance matrix for the repeated measurements for subject j in cluster i where its (k, k') entry is Vijk when k = k' and ^/WjkVijk' • Pkk' when k ^ k''. The K x K block matrix V^ji U 7^ J') contains the covariance terms between the responses from subject j and subject j ' in cluster i, and have as (k, k') entry y/VijkVij>k • 7^ when k = k', and y/VijkVij'k1 • ^kk' when k ^ k'. The generalized estimating equations for /3 are given by UX{B) = ] T Uu(0)=0,
(3)
where Uu(0) = DiV~l • (Yi - p.). To estimate association parameters 0, we consider pairwise products among responses within clusters. For subject j in cluster i define Z^ — {YijiYij2, YijiYij3, •••,Yij:K^1YijK)', a vector of K(K - l ) / 2 components consisting of the pairwise products over repeated measurements of subject j . For subjects j and f > j in cluster i define Z^^ = (YijiYij'i,..., YjjiYij'K , YijiYj'i, •••, YijxYij'K)', a vector of K2 components consisting of all pairwise products of responses between subjects j and j ' within cluster i. Let Z% be the vector consisting of all vectors Z^ and Z^jji) defined as Zi = (z'i(1),...,z i{Ji)> Z'i(i,2)>—>Zi{Ji-i,Ji))'> t m s 1S a vector of Qi components,
358
where Qi = KJi{KJ{ - l)/2. Let Ci be the expectation vector of Z» with components determined by the relation E(YijkYij'k')
= 4>jkj'k'y Hijk{1 - Pijk)Hij'k'0- - fJ-ij'k') + Hijk^ij'k',
and Ci = dC'i/dcj) be the q x Qi derivative matrix of the mean vector d with respect to (p. Lipsitz et al. 10 proposed a set of unbiased estimating equations to estimate association parameters. Since the variance covariance matrix of Zi involves third and fourth moments of the responses, they suggested a "working" covariance matrix Wi = diag((^(l - (u)), where C,u is the Ith. element of Ci, which gives / U2{9) = YJU2i{0) = 0,
(4)
i=l
where U2i(9) = dWf1 • (Zi - <4). The efficiency of estimation of the mean parameters (3 was discussed by Sutradhar and Das 17 and Sutradhar and Kumar 18 . Their findings suggested that independence working covariance matrix performs generally well when it is not possible to specify the true covariance matrix as is the case with the second order equations in (4). If data are incomplete, unbiased estimating equations (Godambe and Kale 6 ) of the form (3) and (4) may be constructed to accommodate a variable number of assessments per subject by making a suitable modification to the covariance matrix for (3). Because the estimating equations are unbiased the resulting estimators are consistent when the data are missing completely at random, but not otherwise. 3.2
Estimating Equations for (3 and (\> with Incomplete Data
To obtain unbiased estimating equations for incomplete data with observations missing at random, we modify the estimating equations by incorporating inverse probability weights. We begin with the case that the weights are specified by fixing a at a 0 . We do this to provide insight into the sources of variation in the estimators resulting from the estimating equations with estimated weights. Let Ai(a0) = diag(Ay(a 0 )) be the JiK x JiK block diagonal matrix, where the j t h diagonal matrix Aij(a0) = diag(I(Rijk = l)/iTij(oc0),k = 1,2, ...,K) is the K x K matrix with the diagonal elements being inverse probabilities Try ( a 0 ) _ 1 for observed data points, or zeroes if the corresponding observations are missing; /(•) is the indicator function. Recall
359
9 = (/3 ,(f>)' denotes the parameter of interest. The generalized estimating equations for (3 are then given by
U1(0,ao) = J2uu(9,ao)=0,
(5)
i=i
where Uu(6,a0) = DiV'1 • A ; ( a 0 ) • (Yt - M;)Let A*(j)(a0) = diag(I(Rijk = l,Rijk> = l)/7r i j (a 0 ), 1 < k < k' < K) be the (K(K - l)/2) x (K(K - l)/2) diagonal matrix corresponding to the elements in Z i ( j ) , and A ^ ^ a , , ) = diag(/(.Rjjfc = \,Riyk> = l)/( 7r ij( a o)7''ij'(Q;o)) Ifc,fc'= 1,2,..., if) be the K2 x X 2 diagonal matrix corresponding to the elements in Z^jjiy Consequently the Qi x Qi diagonal matrix corresponding to the elements in Z,- is written as
diag(A (ao)) A*(«)=f ^ A o ) [
0
° diag(A:UJI)(a0)))
"i '
and the inverse probability weighted estimating equations for 0 are given by U2(9,ao)
= ^2U2i(e,ao)=0,
(6)
»=i
where U2i(9, a0) = dW'1 • A*(a D ) • (Z{ - CJ. When a is unspecified it may be estimated by solving the sum of the score equations given in (2) and (5) and (6) can be modified by replacing a 0 with a . A Fisher-scoring algorithm may be used to obtain 9. Specifically, for (f) = cj>(s' at the sth iteration, recursively apply
0W = ^ - u + [MxOge- 1 ), «<•>, a ) ] - 1 . tfl(jg<«-D,), &), where Mi(9, a) = - £ f = 1 DiVr1 • A ; ( a ) • £><, until / 3 ( t ) converges to /3 ( s ) , say. Then given /3 = /3^ , recursively apply ^(t)
= 0 (t-i) + [ M 2 ( / 3 (.) ( 0 (t-i) ) d ) ] -i
. ^(^l^C-D.d) ,
where M 2 (0, a ) = - £ [ = i dWr1 • Af (a) • C7{, until it converges to 4>{s+1), say. These two steps may be cycled through until convergence is achieved for 9 at 9.
360
3.3
Robust Variance Estimation
Let Ui(0,a) = ([/^(0, a ) , [7.^(0, a ) ) ' and U(9,a) = I'1/2 E L i W , " ) When a is specified to be a0, under standard regularity conditions for estimating functions U(0,ao) is asymptotically normally distributed with mean zero and covariance E(Ui(0,ao)U-(8,ao)), and Il/2{6 — 0) is asymptotically normally distributed with mean zero and covariance matrix given by r j 1 £ ( E / i ( 0 , a o ) £ / ; ( 0 , a o ) ) [ r j 1 ] , > where T0 = E(dUi(0, ao)/d0'). When a is unspecified, the variation in the estimator a must be taken into account, and under the regularity conditions, U(0,a) and 7 1 / 2 (0 — 8) are asymptotically normal with mean zero and respective asymptotic variances E and r ^ E f r - 1 ] ' , where r = E [dUi(0,a)/d8'], E = E{Ui(0,a)U!(0,a)} E{dUi(0, a)/da') • {var(C/oi(a))}- 1 • {E(dUi(0, a)/da')}'. We can also write E = var{Resid(£/i, U0i)}, where Resid(A;, B;) = A i - E ^ ^ H - E ^ B - ) } " 1 ^ is the residual from the population least squares regression of A; on Bi. The proof is sketched in the Appendix, which appears in Robins et al. 1 5 . The matrix E is consistently estimated by E = / - ^ [ R e s i d ^ i , (70i)][Resid([/i, U0i)}', »=i i
i
where Resid(^, Uoi) = & - ( ^ t ^ G C ^ o * ) " 1 ^ , & = Ui(0,a), and t=i
t=i
Uoi — Uoi(a). And the matrix F is consistently estimated by r where M21{8,a) 4
=
T-i ( Mifra) 0 \ \M21(0,a)M2(0,a)),
= E - = i CiWr1 • A?(a) • (SC/0/3').
Application to a Smoking Prevention Study
The Waterloo Smoking Prevention Project (WSPP) consists of a series of longitudinal cluster-randomized school-based smoking prevention studies coordinated at the University of Waterloo (Cameron et al. 1). We report here on the results of some analyses of data from WSPP4, the fourth study in the series. In WSPP4 100 schools from 7 school boards in Ontario were randomized to dispense either the regular health education program provided by the school or a more intensive anti-smoking program delivered either by a specially trained teacher or a public health nurse. The program was first delivered to
361
children in grade 6 at the start of the study, and so-called "booster sessions"
were administered to these same students when they were in grades 7 and 8. The purpose of the booster sessions was to reinforce the material presented when they were in grade 6. Attitudes towards smoking as well as actual smoking behavior were assessed annually by questionaire. A response of interest is a binary indicator of whether the child was a smoker at each of the assessment periods. For the purpose of this analysis, we define a current smoker as a student who has indicated they are a regular or occasional smoker. If we wish to use data from grades 6, 7 and 8 from all students in participating schools we have at most K = 3 visits for each of Ji students in school i, i = 1,2,... ,1. The cluster randomized nature of the design suggests the need to deal with the cross-sectional within school correlation. Hence we have both the longitudinal and cross-sectional correlation structure discussed in previous sections. For illustrative purposes we consider data from 2006 students in 45 schools from WSPP4. 13.86% of the data are incomplete where 4.74% of students have no observations in both grade 7 and grade 8 and 9.12% of students have no observations either in grade 7 or in grade 8. Let Yijk = 1 if student j in school i is a smoker at assessment k, and let Yijk = 0 otherwise. Given the relatively small number of visits we consider exchangeable correlations by letting pkk' = P and dkk1 = 8 and a common cross-sectional correlation parameter 7fc — 7. To avoid the need for introducing any constraints on the parameters in the algorithm we reparameterize from 0 to r where TP = l o g ( ( l + p ) / ( l - p ) ) ,
r7 = l o g ( ( l + 7 ) / ( l - 7 ) ) ,
TS = log((l+«5)/(l-<5)),
and T = {Tp,Tj,Ts)'. The logistic regression model for the mean response is specified as logit fJ,ijk = Po + PlXijkl + PlXijk2 + @3Xijk3 + ^Xijki,
(7)
where Xijki — 1 if school i was randomized to the treatment arm and zero otherwise, Xijk2 = 1 if student j in school i is male and zero otherwise, Xijkz = 1 if k = 2 (i.e. the response is from a grade 7 assessment) and zero otherwise, and Xijki = 1 if A; = 3 (i.e. the response is from a grade 8 assessment) and zero otherwise. The logistic regression models for the missing data process are specified in equations (8) and (9) that follow, logit Xij2 = CH20 +
OL2iViji,
(8)
362
and logit Xij3 =
a3o+a3iyij1+a32rij2+a33rij2yij2+a34r*j2+a35r*j2xij2+a3Qr*j2Xij3, (9) where Xij2 = 1 if the school received a medium social models risk score in grade 7 and zero otherwise, x ^ = 1 if the school received a high social models risk score in grade 7 and zero otherwise, and r*j2 = 1 if the social models risk score was available in grade 7 and zero otherwise. Table 1 contains the estimates and standard errors for the regression coefficients for three logistic models for the responses. Model 1 is the full model in which p, 7, and 6 are estimated, and Models 2 and 3 involve the constraints 5 = 0 and 7 = 6 = 0 respectively. For each model a weighted set of generalized estimating equations of the form of (5) and (6) was solved with a0 replaced with a estimated from the sum of the score equations from (2), and an unweighted set of equations was solved according to (3) and (4). By comparing the estimates of the treatment effect arising from the weighted and unweighted estimating equations for Model 1 one can see that for several of the regression parameters the weighting has little effect on the point estimates. The estimate of the treatment coefficient however is considerably smaller when weights are used and, while not statistically significant, is not inconsistent with a modest treatment benefit. As one would expect due to the sampling variation in the estimated weights, the standard errors are considerably greater in the weighted analyses than in the unweighted analyses. The estimates of the correlation parameters are somewhat larger in the weighted analyses than the unweighted analyses. The weighted and unweighted analyses of Models 2 and 3 yield broadly consistent relationships.
5
Discussion
Second order estimating equations are used to provide estimates of association parameters simultaneously with estimates of regression coefficients for marginal means and have been advocated for use on the grounds of improved efficiency for estimation of association parameters (Liang et al. 9 ) . Interest may lie in precise estimation of the correlation coefficients to facilitate sample size calculations for example, or perhaps simply to obtain understanding about the nature of the various types of associations. We formulate estimating equations for regression coefficients in logistic models for longitudinal binary data and related association parameters for the problem in which the longitudinal series occur in clusters. We modify these first and second order
363 Table 1. Estimates from several weighted and unweighted models
Model 1 Covariate Intercept (/3o) Treatment {Pi) Gender (/32) Grade 7 (/33) Grade 8 (/34) Longitudinal (r p ) Cross-sectional (r 7 ) Mixed {TS)
Covariate Intercept {Po) Treatment {Pi) Gender (/32) Grade 7 (/33) Grade 8 (/34) Longitudinal (r p ) Cross-sectional (r 7 )
Covariate Intercept (/3o) Treatment (/3i) Gender (/32) Grade 7 {p3) Grade 8 (/34) Longitudinal (T P )
Weighted est. s.e. -2.553 0.308 -0.409 0.312 -0.121 0.205 0.489 0.103 1.769 0.146 0.970 0.147 0.244 0.070 0.232 0.067
Unweighted est. s.e. -2.826 0.237 0.007 0.214 -0.034 0.075 0.540 0.085 1.465 0.116 0.623 0.068 0.075 0.035 0.050 0.034
Model 2 (J = 0) Weighted Unweighted est. s.e. est. s.e. -2.701 0.258 -2.819 0.253 -0.158 0.265 0.008 0.236 -0.145 0.201 -0.025 0.076 0.597 0.122 0.542 0.091 1.728 0.185 1.458 0.121 0.885 0.115 0.612 0.059 0.220 0.058 0.070 0.028 Model 3 (7 = 8 = 0) Weighted Unweighted est. s.e. est. s.e. -2.512 0.241 -2.781 0.249 -0.182 0.232 0.029 0.232 -0.093 0.155 -0.013 0.075 0.495 0.071 0.442 0.079 1.421 0.109 1.686 0.134 0.048 0.725 0.073 0.581
equations by the introduction of weights to deal with the possibility that the missing data are missing for reasons related to the past observed responses and the history of the missing data process (i.e. missing at random). As noted earlier, the estimating equations are unbiased and hence generate consistent estimates of the mean and association parameters. We have shown empirically via simulation studies that accurate estimates of the mean and association parameters are obtained in many practical situations. For example, when the clustered binary data are simulated using the multivariate
364
Plackett distribution (Molenberghs and Lesaffre n ) under a broad range of configurations with the association characterized through odds ratios, the relative differences between the true value and the mean estimate was always less than 2.9% for regression parameters and less than 4.7% for the association parameters. These simulation studies were computed based on estimating equations described in Yi and Cook 19 which are computationally efficient and designed specifically for associations parameterized through odds ratios. We anticipate that similar findings would be observed based on equations with associations parameterized through correlations. When applied to data from the Waterloo Smoking Prevention Project (WSPP4) we find the weighted analyses provide mild suggestive evidence of a treatment effect which is not suggested in the unweighted analysis. We describe these methods in the context of a study in which the subjects remain in the same cluster for all longitudinal assessments. This was quite reasonable since we examined responses from students when they were in grades 6, 7 and 8. These grades were chosen in part because the intervention was applied during this time period. However, students in participating elementary schools were tracked during high school for grades 9 to 12 to enable study of the the long-term effects of the intervention. When students move from elementary schools to secondary schools it is natural to consider that they have moved from one cluster of individuals to another. One may, for example, wish to accommodate a residual long-term correlation during secondary school grades for students sharing the same elementary schools. In addition, it would seem natural to introduce another correlation parameter for crosssectional correlations for response from students within the same secondary schools. The methods described here can easily handle this complication by introducing a slightly more general correlation structure. In some problems it may be of interest to examine how explanatory covariates modulate the cross-sectional or longitudinal associations and for such settings Liang et al. 9 described how to formulate such models. One may alternatively use odds ratios to model the association between the binary responses as discussed by many authors (e.g. Fitzmaurice et al. 3 ) , and introduce the dependence on the covariates by regression models. The weighted estimating equations described in Section 3 are conditional on the specification of parameters that index the missing data processes. In Section 4 we propose logistic regression models to characterize the presence of responses via their past history of presence and observed outcomes, however unlike in the marginal models for the response, no cross-sectional correlation is addressed. One could include cluster effects for the drop-out process however, by modeling within cluster dependences with odds ratios and using the
365
Plackett distribution (Molenberghs and Lesaffre n ) . The estimation of the parameters involved in the missing data processes can be carried out using the GEE2 approach (Liang et al. 9 ). Acknowledgments This research was supported by the Natural Sciences and Engineering Research Council of Canada and the Canadian Institutes for Health Research. The authors would like to thank two anonymous referees whose comments greatly improved the form of this article. The authors thank Prof. K.S. Brown for providing the data from the Waterloo Smoking Prevention Project and Ms. Ker-Ai Lee for programming assistance. R.J. Cook is an Investigator for the Canadian Institutes for Health Research. Appendix Let Hi{0,a) = ( C / ^ O . a J . l / o ^ a ) ) ' , then E(Hi(0,a)) = 0. If Hi(0,a) further satisfies regularity conditions stated in Appendix A of Robins et al. 15, then by Theorem 3.4 of Newey and McFadden 12 with probability approaching 1, there is a unique solution, say (6 ,cx')', to $ 3 i = 1 Hi(6,a) = 0, and it satisfies /1/2
(& - a)
=
~E\.dHW>a) W ' a ')] _ 1 • I~1/2 £
H
W>a) + °P(1)-
Therefore, we obtain pl\% _ 0)
=
cx)/d0')-1 • J2 Ui(0> <*) ~ E(dUi{0,
-i-V2{E(dUi{0,
cx)/d9')-1
t=i
i
•E(dUi{9,a)/da')
• [E(dU oi{cx) / dcx'))'1 • ^ £ / 0 i ( < * ) } + o p (l) t=i
i
= -T^T1'2
• ^Resid(t/ i ,C/ o i ) + o p (l), i=X
where Ui = Ui(d, ex), U0i = Uoi(ex), U0 = 7 - 1 / 2 ]T\ U0i(cx). The central limit theorem then leads to the asymptotic distribution for I1/2 (6 — 8). In deriving the last equation, we used the identities E(dUoi/da') = -E{U0iU'0i) and E(dUi/da') = -E(UiU'0i). The latter one is obtained by differentiating E{Ui(0,a)} = 0 with respect to ex under the integral sign.
366
References 1. R. Cameron, K.S. Brown, J.A. Best, C.L. Pelkman, C.L. Madill, S.R. Manske, and M.E. Payne, Am. J. Public Health 89, 1827 (1999). 2. M. Crowder, Biometrika 82, 407 (1995). 3. G.M. Fitzmaurice and S.R. Lipsitz, Appl. Statist. 44, 51 (1995). 4. G.M. Fitzmaurice, G.M. Molenberghs and S.R. Lipsitz, J.R. Statist. Soc. B 57, 691 (1995). 5. G.M. Fitzmaurice, N.M. Laird and G.E.P. Zahner, J. Am. Statist. Ass. 91, 99 (1996). 6. V.P. Godambe and B.K. Kale, in Estimating Functions, ed. V. P. Godambe (Oxford University Press, London, 1991). 7. N.M. Laird, Statist. Med. 7, 305 (1988). 8. K.-Y. Liang and S.L. Zeger, Biometrika 73, 13 (1986). 9. K.-Y. Liang, S.L. Zeger and B. Qaqish, J.R. Statist. Soc. B 54, 3 (1992). 10. S.R. Lipsitz, N.M. Laird and D.P. Harrington, Biometrika 78, 153 (1991). 11. G. Molenberghs and E. Lesaffre, J. Am. Statist. Ass. 89, 633 (1994). 12. W. K. Newey and D. McFadden, in Handbook of Econometrics, Vol. 4, eds. D. McFadden and R. Engler (North-Holland, Amsterdam, 1993). 13. S.D. Oman and D.M. Zucker, Biometrika 88, 287 (2001). 14. R.L. Prentice, Biometrics 44, 1033 (1988). 15. J.M. Robins, A. Rotnitzky and L.P. Zhao, J. Am. Statist. Ass. 90, 106 (1995). 16. A. Rotnitzky, J.M. Robins and D.O. Scharfstein, J. Am. Statist. Ass. 93, 1321 (1998). 17. B.C. Sutradhar and K. Das, Biometrika 86, 459 (1999). 18. B.C. Sutradhar and P. Kumar, Stat, and Prob. Letters 55, 53 (2001). 19. G.Y. Yi and R.J. Cook, Marginal methods for incomplete longitudinal data arising in clusters (Submitted for publication, 2002).
R E C E N T ADVANCES IN
STATISTICAL METHODS This volume consists of research papers dealing with computational and methodological issues of statistical methods on the cutting edge of m o d e r n science. It touches on many applied fields such as Bayesian M e t h o d s , B i o s t a t i s t i c s ,
E c o n o m e t r i c s , Finite Population Sampling, Genomics, Linear and Nonlinear Models, Networks and Queues, Survival Analysis, Time Scries, and many more.
P270hc ISBN 1 86094-333-0
Imperial College Press www.icpress.co.uk
9 781860"943331'