Preface
The survival analysis has become an increasingly active and very important area of research. This is clearly evident from the large body of literature that has been developed in the form of books, volumes and research papers. In addition, several conferences, workshops and short courses have been conducted all over the world during the past two decades. This is the reason we felt that it is indeed a right time to dedicate a volume in the Handbook of Statistics series to highlight some recent advances in both theory and applications in the area of survival analysis. With this specific purpose in mind, we solicited articles from leading experts working in the area of survival analysis from both theoreticians and practitioners. This, in our opinion, has resulted in a volume with a nice blend of articles (40 in total) highlighting theoretical, methodological and applied issues in survival analysis. For convenience of readers, we have divided this volume into 18 parts as follows: I II III IV V VI VII VIII IX X XI XII XIII XIV XV XVI XVII XVIII
General methodology. Censored data and inference. Truncated data and inference. Hazard rate estimation. Comparison of survival curves. Competing risks and analysis. Proportional hazards model and analysis. Accelerated models and analysis. Frailty models and applications. Models and applications. Multivariate survival data analysis. Recurrent event data analysis. Current status data analysis. Disease progression analysis. Gene expressions and analysis. Quality of life analysis. Flowgraph models and applications. Repair models and analysis.
We hope that this broad coverage of the area of survival analysis will not only provide the readers with a general overview of the area but also describe to them what the current state in each of the topics listed above is. We express our sincere thanks to all the authors for their fine contributions and for helping us to bring out this volume in a timely manner. Our special thanks go to v
vi
Preface
Ms. Nicolette van Dijk, Ms. Edith Bomers and Ms. Andy Deelen of Elsevier, Amsterdam, for taking a keen interest in this project and also helping us with the final production of this volume.
N. Balakrishnan C.R. Rao
Table of contents
Preface
v
Contributors xxi
PART I. GENERAL METHODOLOGY
Ch. 1. Evaluation of the Performance of Survival Analysis Models: Discrimination and Calibration Measures 1 R.B. D’Agostino and B.-H. Nam 1. Introduction 1 2. Discrimination index 2 3. Calibration measures in survival analysis Appendix A 20 References 25
6
Ch. 2. Discretizing a Continuous Covariate in Survival Studies 27 J.P. Klein and J.-T. Wu 1. 2. 3. 4.
Introduction 27 Techniques based on the Cox model with a single covariate Extensions of Contal and O’Quigley’s approach 37 Discussion 41 Acknowledgements 42 References 42
vii
28
viii
Table of contents
Ch. 3. On Comparison of Two Classification Methods with Survival Endpoints 43 Y. Lu, H. Jin and J. Mi 1. 2. 3. 4. 5. 6.
Introduction 43 Degree of separation index 44 Estimation and inference procedures 48 Distribution property of test statistics under the null hypothesis Application examples 55 Discussion and conclusion 56 Acknowledgement 58 References 58
52
Ch. 4. Time-Varying Effects in Survival Analysis 61 T.H. Scheike 1. 2. 3. 4. 5.
Time-varying effects in survival analysis 61 Estimation for proportional or additive models 67 Testing in proportional and additive hazards models Survival with malignant melanoma 79 Discussion 83 Acknowledgement 83 References 83
74
Ch. 5. Kaplan–Meier Integrals 87 W. Stute 1. 2. 3. 4. 5. 6. 7.
Introduction 87 The SLLN 91 The CLT 93 Bias 95 The jackknife 97 Censored correlation and regression 100 Conclusions 102 References 103
PART II. CENSORED DATA AND INFERENCE
Ch. 6. Statistical Analysis of Doubly Interval-Censored Failure Time Data 105 J. Sun 1. 2. 3. 4. 5.
Introduction 105 Nonparametric estimation of a distribution function 107 Semiparametric regression analysis 114 Nonparametric comparison of survival functions 117 Discussion and future researches 119 References 121
Table of contents
Ch. 7. The Missing Censoring-Indicator Model of Random Censorship 123 S. Subramanian 1. 2. 3. 4.
Introduction 123 Overview of the estimators of a survival function 127 Semiparametric estimation in the MCI model 134 Conclusion 139 Acknowledgement 140 References 140
Ch. 8. Estimation of the Bivariate Survival Function with Generalized Bivariate Right Censored Data Structures 143 S. Kele¸s, M.J. van der Laan and J.M. Robins 1. Introduction 143 2. Modeling the censoring mechanism 144 3. Constructing an initial mapping from full data estimating functions to observed data estimating functions 147 4. Generalized Dabrowska’s estimator 151 5. Orthogonalized estimating function and corresponding estimator 153 6. Simulations 159 7. Discussion 162 Appendix A 162 References 172
Ch. 9. Estimation of Semi-Markov Models with Right-Censored Data 175 O. Pons 1. 2. 3. 4. 5.
Introduction 175 Definition of the estimators 176 Asymptotic distribution of the estimators 181 Generalization to models with covariates 189 Discussion 193 References 194
PART III. TRUNCATED DATA AND INFERENCE Ch. 10. Nonparametric Bivariate Estimation with Randomly Truncated Observations 195 Ü. Gürler 1. 2. 3. 4.
Introduction 195 Estimation of the bivariate distribution function 197 Estimation of bivariate hazard 202 Bivariate density estimation 205 References 206
ix
x
Table of contents
PART IV. HAZARD RATE ESTIMATION
Ch. 11. Lower Bounds for Estimating a Hazard 209 C. Huber and B. MacGibbon 1. 2. 3. 4. 5. 6.
Introduction 209 Framework 210 Kullback information and Hellinger distances based on hazards 213 A general device to derive lower bounds for estimating a function 215 Lower bound for the rate of estimation of a hazard function with right censoring 218 Rate of convergence for the kernel estimator of the hazard function 222 Acknowledgement 224 Appendix A 224 References 225
Ch. 12. Non-Parametric Hazard Rate Estimation under Progressive Type-II Censoring 227 N. Balakrishnan and L. Bordes 1. 2. 3. 4.
Introduction 227 Smoothing cumulative hazard rate estimator 228 Asymptotics 230 Simulation study 238 Appendix A: Technical results for the mean 244 Appendix B: Technical results for the variance 246 References 248
PART V.
COMPARISON OF SURVIVAL CURVES
Ch. 13. Statistical Tests of the Equality of Survival Curves: Reconsidering the Options 251 G.P. Suciu, S. Lemeshow and M. Moeschberger 1. 2. 3. 4. 5. 6. 7.
Introduction 251 Underlying alternative hypothesis and assumptions 252 An overview of available tests 256 Hypothesis testing and statistical computer packages 259 Applications to papers from major medical journal 260 Suggested guidelines 260 Discussion and conclusions 261 References 261
Table of contents
Ch. 14. Testing Equality of Survival Functions with Bivariate Censored Data: A Review 263 P.V. Rao 1. 2. 3. 4. 5. 6.
Introduction 263 Testing H0 with uncensored paired data 265 Within-pair difference tests with censored data 269 Pooled sample tests with censored data 272 Testing H0 when there are missing data 273 Overview 274 References 275
Ch. 15. Statistical Methods for the Comparison of Crossing Survival Curves 277 C.T. Le 1. 2. 3. 4.
Introduction 277 The modified Kolmogorov–Smirnov test 281 A Levene-type test 282 Linear rank tests 287 References 289
PART VI. COMPETING RISKS AND ANALYSIS Ch. 16. Inference for Competing Risks 291 J.P. Klein and R. Bajorunaite 1. 2. 3. 4. 5. 6. 7.
Introduction 291 Basic quantities 292 Univariate estimation 293 Inference based on the crude hazard rates 298 Tests based on the cumulative incidence function 300 Regression techniques based on the cumulative hazard function 305 Discussion 309 Acknowledgement 310 References 310
Ch. 17. Analysis of Cause-Specific Events in Competing Risks Survival Data 313 J. Dignam, J. Bryant and H.S. Wieand 1. 2. 3. 4. 5.
Introduction 313 Competing risks analysis based on cause-specific hazard functions 316 Competing risks analysis based on cumulative incidence functions 319 Examples: Competing risks analysis of events after breast cancer treatment 323 Summary 327 Acknowledgements 327 References 327
xi
xii
Table of contents
Ch. 18. Analysis of Progressively Censored Competing Risks Data 331 D. Kundu, N. Kannan and N. Balakrishnan 1. 2. 3. 4. 5. 6. 7. 8. 9.
Introduction 331 Model: Description and notation 333 Estimation 334 Confidence intervals 339 Bayesian analysis 342 Simulation study 343 Numerical example 345 Some generalizations and extensions 346 Conclusions 347 References 348
Ch. 19. Marginal Analysis of Point Processes with Competing Risks 349 R.J. Cook, B. Chen and P. Major 1. 2. 3. 4. 5.
Introduction 349 Rate functions for point processes 352 Point processes with terminal events 354 Application to a breast cancer trial 357 Discussion 360 Acknowledgements 360 References 360
PART VII. PROPORTIONAL HAZARDS MODEL AND ANALYSIS
Ch. 20. Categorical Auxiliary Data in the Discrete Time Proportional Hazards Model 363 P. Slasor and N. Laird 1. 2. 3. 4. 5. 6. 7.
Introduction 363 The standard and joint discrete-time proportional hazards models 364 Specification of the survival model and censoring 367 Discretizing continuous auxiliary data 367 Joint models: Recurrent events predicting survival 374 Other scenarios for censoring and survival 377 Discussion 378 Acknowledgements 379 Appendix A 379 References 382
Table of contents
Ch. 21. Hosmer and Lemeshow type Goodness-of-Fit Statistics for the Cox Proportional Hazards Model 383 S. May and D.W. Hosmer 1. 2. 3. 4. 5.
Introduction 383 The Hosmer and Lemeshow type test statistics 383 Necessity for time-dependent indicator variables 386 Examples 389 Summary 391 Appendix A 391 Appendix B 391 Appendix C 392 References 393
Ch. 22. The Effects of Misspecifying Cox’s Regression Model on Randomized Treatment Group Comparisons 395 A.G. DiRienzo and S.W. Lagakos 1. 2. 3. 4. 5.
Introduction 395 Notation and statistics 396 Conditions for valid tests 397 Bias correction 399 Discussion 402 Acknowledgement 404 Appendix A: MATLAB code for computing statistical tests 404 References 408
Ch. 23. Statistical Modeling in Survival Analysis and Its Influence on the Duration Analysis 411 V. Bagdonaviˇcius and M. Nikulin 1. 2. 3. 4. 5. 6. 7. 8. 9.
Introduction 411 The Cox or the proportional hazards model 412 Accelerated failure time model 413 Generalized proportional hazards model 415 Regression models with cross-effects of survival functions 419 Changing shape and scale models 420 Models with time-dependent regression coefficients 421 Additive hazards model and its generalizations 422 Remarks on parametric and semi-parametric estimation 423 References 428
xiii
xiv
Table of contents
PART VIII. ACCELERATED MODELS AND ANALYSIS Ch. 24. Accelerated Hazards Model: Method, Theory and Applications 431 Y.Q. Chen, N.P. Jewell and J. Yang 1. 2. 3. 4. 5. 6. 7. 8.
Introduction 431 Estimation 432 Asymptotic results 433 Efficiency consideration 434 Model adequacy 435 Extensions 437 Implementation and application 438 Some remarks 440 References 441
Ch. 25. Diagnostics for the Accelerated Life Time Model of Survival Data 443 D. Zelterman and H. Lin 1. 2. 3. 4. 5. 6.
Introduction 443 The likelihood and estimating equations 444 A Gibbs-like estimation procedure 448 Diagnostic measures 449 The bootstrap procedure 451 Numerical examples 453 Acknowledgement 459 References 459
Ch. 26. Cumulative Damage Approaches Leading to Inverse Gaussian Accelerated Test Models 461 A. Onar and W.J. Padgett 1. 2. 3. 4. 5.
Inverse Gaussian as a lifetime or strength model 462 Inverse Gaussian accelerated test models 463 Estimation for the inverse Gaussian accelerated test models 469 Application of the inverse Gaussian accelerated test models to chloroprene exposure data 474 Conclusion 476 Acknowledgement 477 References 477
Ch. 27. On Estimating the Gamma Accelerated Failure-Time Models 479 K.M. Koti 1. 2. 3. 4.
The failure time Gamma model 479 The maximum likelihood equations 480 The problem 481 The hybrid approximation 482
Table of contents 5. 6. 7. 8.
xv
The SAS/IML subroutine NLPTR 487 Pediatric cancer data 487 Leukemia data 488 Concluding remarks 489 Acknowledgements 490 Appendix A: Fisher information matrix 490 References 493
PART IX. FRAILTY MODELS AND APPLICATIONS
Ch. 28. Frailty Model and its Application to Seizure Data 495 N. Ebrahimi, X. Zhang, A. Berg and S. Shinnar 1. 2. 3. 4. 5.
Introduction 495 Inference for the shared frailty model 498 The shared frailty model for recurrent events 504 Seizure data and its analysis 507 Concluding remarks 516 Acknowledgements 516 References 516
PART X. MODELS AND APPLICATIONS
Ch. 29. State Space Models for Survival Analysis 519 W.Y. Tan and W. Ke 1. 2. 3. 4. 5. 6. 7. 8.
Introduction 519 The state space models and the generalized Bayesian approach 519 Stochastic modeling of the birth–death–immigration–illness–cure processes 521 A state space model for the birth–death–immigration–illness–cure processes 524 The multi-level Gibbs sampling procedures for the birth–death–immigration–illness–cure processes 526 The survival probabilities of normal and sick people 528 Some illustrative examples 529 Conclusions 534 References 534
Ch. 30. First Hitting Time Models for Lifetime Data 537 M.-L.T. Lee and G.A. Whitmore 1. 2. 3. 4. 5.
Introduction 537 The basic first hitting time model 537 Data for model estimation 538 A Wiener process with an inverse Gaussian first hitting time 538 A two-dimensional Wiener model for a marker and first hitting time 539
xvi
Table of contents
6. Longitudinal data 541 7. Additional first hitting time models 541 8. Other literature sources 542 Acknowledgements 543 References 543
Ch. 31. An Increasing Hazard Cure Model 545 Y. Peng and K.B.G. Dear 1. 2. 3. 4. 5.
Introduction 545 The model 546 Simulation study 549 Illustration 552 Conclusions and discussion 555 Acknowledgements 556 References 556
PART XI. MULTIVARIATE SURVIVAL DATA ANALYSIS
Ch. 32. Marginal Analyses of Multistage Data 559 G.A. Satten and S. Datta 1. 2. 3. 4. 5.
Introduction 559 “Explainable” dependent censoring in survival analysis 560 Multistage models: Stage occupation probabilities and marginal transition hazards 562 Estimation of marginal waiting time distributions 564 Regression models for waiting time distributions 566 Appendix A: Modeling the censoring hazard using Aalen’s linear hazards model 570 References 573
Ch. 33. The Matrix-Valued Counting Process Model with Proportional Hazards for Sequential Survival Data 575 K.L. Kesler and P.K. Sen 1. 2. 3. 4. 5. 6. 7.
Introduction 575 Introduction to multivariate survival methods 577 The matrix valued counting process framework 579 Matrix valued counting process framework with repeated measures data 585 Estimation 592 Example 593 Discussion 595 Appendix A 596 References 600 Further reading 601
Table of contents
PART XII. RECURRENT EVENT DATA ANALYSIS
Ch. 34. Analysis of Recurrent Event Data 603 J. Cai and D.E. Schaubel 1. 2. 3. 4. 5.
Introduction 603 Notation and basic functions of interest 603 Semiparametric models for recurrent event data 605 Nonparametric estimation of the recurrent event survival and distribution functions 618 Conclusion 621 Acknowledgement 621 References 621
PART XIII. CURRENT STATUS DATA ANALYSIS
Ch. 35. Current Status Data: Review, Recent Developments and Open Problems 625 N.P. Jewell and M. van der Laan 1. 2. 3. 4. 5. 6.
Introduction 625 Motivating examples 626 Simple current status data 627 Different sampling schemes 630 Complex outcome processes 634 Conclusion 640 References 641
PART XIV. DISEASE PROGRESSION ANALYSIS
Ch. 36. Appraisal of Models for the Study of Disease Progression in Psoriatic Arthritis 643 R. Aguirre-Hernández and V.T. Farewell 1. 2. 3. 4. 5.
Introduction 643 Data 643 Markov models 644 Poisson and negative binomial models 654 Discussion 670 Appendix A: Formulas for the estimated transition probabilities 671 References 672
xvii
xviii
PART XV.
Table of contents
GENE EXPRESSIONS AND ANALYSIS
Ch. 37. Survival Analysis with Gene Expression Arrays 675 D.K. Pauler, J. Hardin, J.R. Faulkner, M. LeBlanc and J.J. Crowley 1. 2. 3. 4.
Introduction 675 Methods 678 Results 681 Discussion 684 Appendix A 685 References 686
PART XVI. QUALITY OF LIFE ANALYSIS
Ch. 38. Joint Analysis of Longitudinal Quality of Life and Survival Processes 689 M. Mesbah, J.-F. Dupuy, N. Heutte and L. Awad 1. 2. 3. 4. 5. 6. 7.
Introduction 689 Presentation of the clinical trial: QoL instruments and data 691 Preliminary analysis 693 Time to QoL deterioration 696 Semi-Markovian multi-state model 698 Joint distribution of QoL and survival–dropout processes 709 Discussion 720 References 726
PART XVII. FLOWGRAPH MODELS AND APPLICATIONS
Ch. 39. Modelling Survival Data using Flowgraph Models 729 A.V. Huzurbazar 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
Series flowgraph model: HIV blood transfusion data 731 Data analysis of HIV/AIDS data 732 Converting flowgraph MGFs to densities 734 Likelihood construction in flowgraph models 737 Parametric assumptions 737 Parallel flowgraph models 740 Loop flowgraph models 740 A systematic procedure for solving flowgraphs 742 Data analysis for diabetic retinopathy data 743 Summary 745 References 746
Table of contents
PART XVIII. REPAIR MODELS AND ANALYSIS
Ch. 40. Nonparametric Methods for Repair Models 747 M. Hollander and J. Sethuraman 1. 2. 3. 4. 5. 6. 7.
Introduction 747 General repair models 748 Estimation in the DHS model 752 Estimation in the BBS model 755 A two-sample test in the BBS model 758 Goodness-of-fit tests in the BBS model 759 Testing the minimal repair assumption in the BBS model 761 Acknowledgement 763 References 763
Subject Index 765 Contents of Previous Volumes 773
xix
Contributors
R. Aguirre-Hernández, Departamento de Probabilidad y Estadistica, IIMAS–UNAM, Mexico, e-mail:
[email protected] (Ch. 36). L. Awad, Biostatistics Department, Aventis Pharma, Antony, France, e-mail: lucile.
[email protected] (Ch. 38). V. Bagdonaviˇcius, UFR Sciences et Modélisation, Université Victor Segalen Bordeaux 2, 33076 Bordeaux, France, e-mail:
[email protected] (Ch. 23). R. Bajorunaite, Department of Mathematics, Statistics and Computer Science, Marquette University, Milwaukee, WI 53201-1881, USA, e-mail:
[email protected] (Ch. 16). N. Balakrishnan, Department of Mathematics and Statistics, McMaster University, Hamilton, Ontario L8S 4K1, Canada, e-mail:
[email protected] (Chs. 12, 18). A. Berg, Department of Biology, Northern Illinois University, DeKalb, IL 60115, USA, e-mail:
[email protected] (Ch. 28). L. Bordes, Département Génie Informatique, Division Mathématiques Appliquées, Université de Technologie de Compiègne, 60205 Compiègne cedex, France, e-mail:
[email protected] (Ch. 12). J. Bryant, Department of Biostatistics, Graduate School of Public Health, University of Pittsburgh and National Surgical Adjuvant Breast and Bowel Project Biostatistical Center, Pittsburgh, PA 15213, USA, e-mail:
[email protected] (Ch. 17). J. Cai, Department of Biostatistics, School of Public Health, University of North Carolina, Chapel Hill, NC 27599, USA, e-mail:
[email protected] (Ch. 34). B. Chen, Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, Ontario N2L 3G1, Canada, e-mail:
[email protected] (Ch. 19). Y.Q. Chen, Division of Biostatistics, School of Public Health, University of California, Berkeley, CA 94720, USA, e-mail:
[email protected] (Ch. 24). R.J. Cook, Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, Ontario N2L 3G1, Canada, e-mail:
[email protected] (Ch. 19). J.J. Crowley, Cancer Research and Biostatistics, Seattle, WA 98101, USA, e-mail:
[email protected] (Ch. 37). R.B. D’Agostino, Department of Mathematics and Statistics, Statistics and Consulting Unit, Boston University, Boston, MA 02215, USA, e-mail:
[email protected] (Ch. 1). S. Datta, Department of Statistics, University of Georgia, Athens, GA 30602, USA, e-mail:
[email protected] (Ch. 32). xxi
xxii
Contributors
K.B.G. Dear, National Centre for Epidemiology and Population Health and Centre for Mental Health Research, The Australian National University, Canberra, ACT 0200, Australia, e-mail:
[email protected] (Ch. 31). J. Dignam, Department of Health Studies, The University of Chicago, Chicago, IL 60637, USA, e-mail:
[email protected] (Ch. 17). A.G. DiRienzo, Department of Biostatistics, Harvard School of Public Health, Boston, MA 02115, USA, e-mail:
[email protected] (Ch. 22). J.-F. Dupuy, Université de Bretagne-Sud, UFR SSI, 56017 Vannes cedex, France, e-mail:
[email protected] (Ch. 38). N. Ebrahimi, Division of Statistics, Northern Illinois University, DeKalb, IL 60115, USA, e-mail:
[email protected] (Ch. 28). V.T. Farewell, MRC Biostatistics Unit, University of Cambridge, Cambridge, UK, e-mail:
[email protected] (Ch. 36). J.R. Faulkner, Southwest Oncology Group, Fred Hutchinson Cancer Research Center, Seattle, WA 98109, USA, e-mail:
[email protected] (Ch. 37). Ü. Gürler, Department of Industrial Engineering, Bilkent University, Ankara, Turkey, e-mail:
[email protected] (Ch. 10). J. Hardin, Department of Mathematics, Pomona College, Claremont, CA 91711, USA, e-mail:
[email protected] (Ch. 37). N. Heutte, Université de Paris V and Université de Caen, France, e-mail: n.heutte@ lisieux.iutcaen.fr (Ch. 38). M. Hollander, Department of Statistics, Florida State University, Tallahassee, FL 32306, USA, e-mail:
[email protected] (Ch. 40). D.W. Hosmer, Department of Biostatistics and Epidemiology, University of Massachusetts, Amherst, MA 01003, USA, e-mail:
[email protected] (Ch. 21). C. Huber, UFR Biomédicale, Université René Descartes, 75270 Paris cedex 06, France, e-mail:
[email protected] (Ch. 11). A.V. Huzurbazar, Department of Mathematics and Statistics, University of New Mexico, Albuquerque, NM 87131, USA, e-mail:
[email protected] (Ch. 39). N.P. Jewell, Division of Biostatistics, School of Public Health, University of California, Berkeley, CA 94720, USA, e-mail:
[email protected] (Chs. 24, 35). H. Jin, Department of Radiology, University of California, San Francisco, CA 94143, USA, Department of Mathematics, South China Normal University, Guangdong, China, e-mail:
[email protected] (Ch. 3). N. Kannan, Department of Management Science and Statistics, The University of Texas at San Antonio, San Antonio, TX 78249, USA, e-mail:
[email protected] (Ch. 18). W. Ke, Department of Mathematical Sciences, The University of Memphis, Memphis, TN 38152, USA, e-mail:
[email protected] (Ch. 29). S. Kele¸s, Division of Biostatistics, School of Public Health, University of California, Berkeley, CA 94720, USA, e-mail:
[email protected] (Ch. 8). K.L. Kesler, Division of Biostatistics, Rho, Inc., 100 Eastowne Drive, Chapel Hill, NC 27514, USA, e-mail:
[email protected] (Ch. 33). J.P. Klein, Division of Biostatistics, Medical College of Wisconsin, Milwaukee, WI 53226, USA, e-mail:
[email protected] (Chs. 2, 16).
Contributors
xxiii
K.M. Koti, 1401 Rockville Pike, Suite 200S, HFM-219, Food and Drug Administration, Rockville, MD 20852, USA, e-mail:
[email protected] (Ch. 27). D. Kundu, Department of Mathematics, Indian Institute of Technology, Kanpur, UP, India 208 016, e-mail:
[email protected] (Ch. 18). S.W. Lagakos, Department of Biostatistics, Harvard School of Public Health, Boston, MA 02115, USA, e-mail:
[email protected] (Ch. 22). N. Laird, Department of Biostatistics, Harvard School of Public Health, Boston, MA 02115, USA, e-mail:
[email protected] (Ch. 20). C.T. Le, Department of Biostatistics, MMC 303 Mayo, School of Public Health, University of Minnesota, Minneapolis, MN 55455, USA, e-mail:
[email protected] (Ch. 15). M. LeBlanc, Southwest Oncology Group, Fred Hutchinson Cancer Research Center, Seattle, WA 98109, USA, e-mail:
[email protected] (Ch. 37). M.-L.T. Lee, Channing Laboratory, Department of Medicine, Brigham and Women’s Hospital and Harvard Medical School, Boston, MA 02115, USA, e-mail: meiling@ channing.harvard.edu (Ch. 30). S. Lemeshow, Division of Epidemiology and Biometrics, College of Medicine and Public Health, School of Public Health, The Ohio State University, Columbus, OH 43210, USA, e-mail:
[email protected] (Ch. 13). H. Lin, Division of Biostatistics, Yale University, New Haven, CT 06520, USA, e-mail:
[email protected] (Ch. 25). Y. Lu, Department of Radiology, University of California, San Francisco, CA 94143, USA, e-mail:
[email protected] (Ch. 3). B. MacGibbon, Départamento de Mathématiques, Université du Québec à Montréal, Succursale Centre-Ville, Montréal, Québec H3C 3P8, Canada, e-mail: brenda@ math.uqam.ca (Ch. 11). P. Major, Hamilton Regional Cancer Centre and Department of Medicine, McMaster University, Hamilton, Ontario, Canada, e-mail:
[email protected] (Ch. 19). S.J. May, Devision of Biostatistics, Department of Family & Preventive Medicine, University of California San Diego, La Jolla, CA 92093-0717, USA, e-mail: smay@ucsd. edu (Ch. 20). M. Mesbah, Université de Bretagne-Sud UFR SSI, F56017 Vannes cedex, France, e-mail:
[email protected] (Ch. 38). J. Mi, Department of Statistics, Florida International University, University Park, Miami, FL 33199, USA, e-mail:
[email protected] (Ch. 3). M. Moeschberger, Division of Epidemiology and Biometrics, College of Medicine and Public Health, School of Public Health, The Ohio State University, Columbus, OH 43210, USA, e-mail:
[email protected] (Ch. 13). B.-H. Nam, Department of Mathematics and Statistics, Statistics and Consulting Unit, Boston University, Boston, MA 02215, USA, e-mail: byungho.math.bu.edu (Ch. 1). M. Nikulin, Mathematical Statistics, University Victor Segalen Bordeaux 2, Bordeaux, France, Laboratory of Statistical Methods, Steklov Mathematical Institute, St. Petersburg, Russia, e-mail:
[email protected] (Ch. 23). A. Onar, Department of Management Science, University of Miami, Coral Gables, FL 33124, USA, e-mail:
[email protected] (Ch. 26).
xxiv
Contributors
W.J. Padgett, Department of Statistics, University of South Carolina, Columbia, SC 29208, USA, e-mail:
[email protected] (Ch. 26). D.K. Pauler, Southwest Oncology Group, Fred Hutchinson Cancer Research Center, Seattle, WA 98109, USA, e-mail:
[email protected] (Ch. 37). Y. Peng, Department of Mathematics and Statistics, Memorial University of Newfoundland, St. John’s, Newfoundland A1C 5S7, Canada, e-mail:
[email protected] (Ch. 31). O. Pons, INRA Biométrie, 78352 Jouy-en-Josas cedex, France, e-mail: odile.pons@ jouy.inra.fr (Ch. 9). P.V. Rao, Department of Statistics, University of Florida, Gainesville, FL 32611, USA, e-mail:
[email protected] (Ch. 14). J.M. Robins, Department of Epidemiology, Harvard School of Public Health, Boston, MA 02115, USA, e-mail:
[email protected] (Ch. 8). G.A. Satten, Centers for Disease Control and Prevention, Atlanta, GA 30333, USA, e-mail:
[email protected] (Ch. 32). D.E. Schaubel, Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA, e-mail:
[email protected] (Ch. 34). T.H. Scheike, Department of Biostatistics, University of Copenhagen, Blegdamsvej 3, DK-2200, KBH-N, Denmark, e-mail:
[email protected] (Ch. 4). P.K. Sen, Department of Biostatistics, School of Public Health, University of North Carolina, Chapel Hill, NC 27599, USA, e-mail:
[email protected] (Ch. 33). J. Sethuraman, Department of Statistics, Florida State University, Tallahassee, FL 32306, USA, e-mail:
[email protected] (Ch. 40). S. Shinnar, Departments of Neurology and Pediatrics, Montefiore Medical Center, Albert Einstein College of Medicine, Bronx, NY 10467, USA, e-mail:
[email protected] (Ch. 28). P. Slasor, Genzyme Corporation, Cambridge, MA 02142, USA, e-mail: peter.slasor@ genzyme.com (Ch. 20). W. Stute, Mathematical Institute, Justus-Liebig University, Giessen, D-35392, Germany, e-mail:
[email protected] (Ch. 5). S. Subramanian, Department of Mathematics and Statistics, The University of Maine, Orono, Maine 04469, USA, e-mail:
[email protected] (Ch. 7). G.P. Suciu, Division of Epidemiology and Biometrics, College of Medicine and Public Health, School of Public Health, The Ohio State University, Columbus, OH 43210, USA, e-mail:
[email protected] (Ch. 13). J. Sun, Department of Statistics, University of Missouri, Columbia, MO 65211, USA, e-mail:
[email protected] (Ch. 6). W.Y. Tan, Department of Mathematical Sciences, The University of Memphis, Memphis, TN 38152, USA, e-mail:
[email protected] (Ch. 29). M. van der Laan, Division of Biostatistics, School of Public Health, University of California, Berkeley, CA 94720, USA, e-mail:
[email protected] (Chs. 8, 35). G.A. Whitmore, Faculty of Management, McGill University, Montréal, Québec H3A 1G5, Canada, e-mail:
[email protected] (Ch. 30).
Contributors
xxv
H.S. Wieand, Department of Biostatistics, Graduate School of Public Health, University of Pittsburgh and National Surgical Adjuvant Breast and Bowel Project Biostatistical Center, Pittsburgh, PA 15213, USA, e-mail:
[email protected] (Ch. 17). J.-T. Wu, TAP Pharmaceutical Products Inc., Lake Forest, IL 60045, USA, e-mail:
[email protected] (Ch. 2). J. Yang, Division of Biostatistics, School of Public Health, University of California, Berkeley, CA 94720, USA, e-mail:
[email protected] (Ch. 24). D. Zelterman, Division of Biostatistics, Yale University, New Haven, CT 06520, USA, e-mail:
[email protected] (Ch. 25). X. Zhang, Division of Biostatistics, Medical College of Wisconsin, Milwaukee, WI 53226, USA, e-mail:
[email protected] (Ch. 28).
1
Handbook of Statistics, Vol. 23 ISSN: 0169-7161 © 2004 Elsevier B.V. All rights reserved. DOI 10.1016/S0169-7161(03)23001-7
Evaluation of the Performance of Survival Analysis Models: Discrimination and Calibration Measures
R.B. D’Agostino and Byung-Ho Nam
1. Introduction Background: Performance measures in mathematical predictive models Consider a vector of variables V (V1 , V2 , . . . , Vk ), independent variables in a regression, or risk factors in a model setting, and a variable W , the dependent variable, or outcome variable having 1 for positive outcome and 0 for negative outcome. Here, ‘positive outcome’ indicates occurrence or presence of an event such as coronary heart disease. Prediction functions or Health Risk Appraisal functions (HRAF) are mathematical models that are functions of the data ( V ), which relates to the probability of an event (W ). Symbolically, for the configuration of V of the data f ( V ) = f (V1 , V2 , . . . , Vk ) = P (W = 1) = P , where P (W = 1) is the probability of a positive outcome or an event. Our focus is to evaluate the performance of a HRAF with regard to its ability to predict the outcome variable. Here, we consider a time to event survival time model with censored observations such as the Cox regression model. Much of the following extends methods applied previously to logistic regression (Hanley and McNeil, 1982; Hosmer and Lemeshow, 1980). The Cox regression model is a survival analysis model that relates V to the development of an event over a period of time t, taking into consideration the time to event and censoring, for example, dropouts, lost to follow-up. The following is its mathematical expression: exp(β V −β V ) , P (T t) = S0 T = t/V ) is the survival probability for those with the mean vector values V . where S0 (T = t/V Measures of predictive accuracy The accuracy of a model is the degree to which the predicted values coincide with the observed outcomes. When the outcome variable is dichotomous and predictions are stated as probabilities that an event will occur, models can be checked for two general 1
2
R.B. D’Agostino and B.-H. Nam
concepts, discrimination and calibration. Our focus here is to evaluate the performance of a model with regard to its discrimination and calibration. Discrimination refers to a model’s ability to correctly distinguish the two classes of outcomes. A model with good discrimination ability produces higher predicted probabilities to subjects who had events than subjects who did not have events. Perfect discrimination would result in two non-overlapping sets of predicted probabilities from the model, one set for the positive outcomes, the other for the negative outcomes. The area under the Receiver Operating Characteristic (ROC) Curve is one the most used measures for a model discrimination (Hanley and McNeil, 1982). Calibration describes how closely the predicted probabilities agree numerically with the actual outcomes. A model is well-calibrated when predicted and observed values agree for any reasonable grouping of the observation, ordered by increasing predicted values. A common form of the calibration statistic is based on the Pearson χ 2 statistic that compares the observed and expected outcomes within each M group defined by the rank ordering of the predicted probabilities (Hosmer and Lemeshow, 1980). Although a model with a good calibration will tend to have good discrimination and vice versa, a given model may be good on one measure but weak on another. Harrel et al. (1996) recommend that good discrimination is always to be preferred to good calibration since a model with a good discrimination can always be recalibrated but the rank orderings of the probabilities cannot be changed to improve discrimination.
2. Discrimination index 2.1. Discrimination index in logistic regression The area under the Receiver Operating Characteristic (ROC) Curve is one the most used measures for a model discrimination. The following is how to construct ROC curve from the logistic regression: Suppose we have n subjects. From the logistic regression, we can compute for all n subjects their predicted probabilities (Y1 , Y2 , . . . , Yn ). Then, select some probability value Y ∗ and state as the decision rule that all subjects with predicted probabilities equal or above that value Y ∗ will be classified as positive, and all others will be classified as negative. Hence, for each Y ∗ , a two by two table such as following can be generated: Call subject + if Yi > Y ∗ True State
+ −
+ a c
− b d
From this table, sensitivity = a/(a + b), and specificity = d/(c + d) can be calculated. If one selects all possible values of Y ∗ for decision cutoff points and plots sensitivity on Y axis and 1-specificity on X axis in two-dimensional graph and connect the plots by
Evaluation of the performance of survival analysis models
3
Fig. 1. Receiver operating characteristic (ROC) curve.
a line curve, then the resulting curve is called Receiver Operating Characteristic (ROC) Curve. See Figure 1. The area under this curve is a measure for model discrimination. The interpretation of this area, also called the C statistic, is that it is the estimated probability for positive outcome being higher than that for a negative outcome. Thus, (Y1 > Y2 ), C statistic = C = area under the ROC curve = P where Y1 = predicted probabilities for those who had events, Y2 = predicted probabilities for those without events. The value of C varies from 0.5 with no discrimination ability to 1 with perfect discrimination and is related only to the ranks of the predicted probabilities. Bamber (1975) recognized that the area under the ROC curve is an unbiased estimator of the probability of correctly ranking a (event, no-event) pair and that this probability is closely connected with the Mann–Whitney statistic. Hanley and McNeil (1982) further developed the relationship between the area under the ROC curve and the Mann–Whitney statistic and showed that the two are closely related, i.e., C statistic =
1 WY Y , N1 · N2 1 2
(2.1)
where N1 = number of those who had events, N2 = number of those without events, WY1 Y2 = [number of pairs(Y1 , Y2 ) with Y1 > Y2 ] + 12 [number of pairs(Y1 , Y2 ) with Y1 = Y2 ]. Lehmann (1951, 1975) showed that the Mann–Whitney statistic is asymptotically normally distributed. 2.2. Extension of C statistics to survival analysis Suppose we have n individuals, among which n1 developed events in time t (event), n2 do not develop events in time t (non-event) and n3 were censored by time t (censored) (n = n1 + n2 + n3 ). Define Ti = survival time for ith individual, i = 1, 2, . . . , n, Yi = predicted probability for developing an event in time t for ith individual, i = 1, 2, . . . , n. Then, we have n
4
R.B. D’Agostino and B.-H. Nam
pairs of (T1 , Y1 ), (T2 , Y2 ), . . . , (Tn , Yn ). Define 1 aij bij , Q n−1 n
Overall C =
i=1 j =1
where Q = the total number of comparisons made, 1 if Ti < Tj , and at least one member of the pair (Ti , Tj ) is the aij = to an observed event, i, j = 1, 2, . . . , n, 0 otherwise, 1 if Yi > Yj , and at least one member of the pair (Yi , Yj ) is the bij = predicted probability of an event, i, j = 1, 2, . . . , n, 0 otherwise. For a time to event survival model, we expand the notation as follows: T1i = survival time for event, i = 1, 2, . . . , n1 , Y1i = predicted probability for event, i = 1, 2, . . . , n1 , T2j = survival time for non-event, j = 1, 2, . . . , n2 , Y2j = predicted probability for non-event, j = 1, 2, . . . , n2 , T3j = survival time for censored, j = 1, 2, . . . , n3 , Y3j = predicted probability for censored, j = 1, 2, . . . , n3 . From the above, we have three sets of comparisons: (1) event vs. non-event: comparing those who developed events against those did not; (2) event vs. event: comparing those who developed events against those who also developed events; (3) event vs. censored: comparing those who developed events against those who were censored. Note that these three comparisons are independent one another. Now, we examine the first comparison and develop a first component for an overall C. This concerns comparison of events vs. non-events. Define n1 n 1 C1 = aij bij , Q1
(2.12)
i=1 j =1
where Q1 = the total number of comparisons made, 1 if T1i < T2j , aij = 0 otherwise, bij = 1
if Y1i > Y2j .
Here, since all the survival times for those who did not develop events are longer than the maximum value of the event time for those who developed events, it is obvious that aij is always equal to 1. Hence, n1 n 1 C1 = bij . Q1
(2.13)
i=1 j =1
The numerator in C1 is exactly the same as the Mann–Whitney statistic for continuous data when we compare the predicted probabilities for dichotomous outcomes where
Evaluation of the performance of survival analysis models
5
Q1 = n1 · n2 . Thus, C1 can be expressed as C1 =
1 WY Y . n1 · n2 1 2
(2.14)
Therefore, C1 is asymptotically normally distributed. Next, we develop a second component of overall C(event vs. event). One important note to make here is that one who developed an event earlier in time has higher predicted probability for an event. Define C2 =
n1 −1 n1 1 aij bij , Q2
(2.15)
i=1 j =1
where
aij =
1 if T1i < T1j , i, j = 1, 2, . . . , n1 , i < j, 0 otherwise,
1 if Y1i > Y1j , i, j = 1, 2, . . . , n1 , i < j, 0 otherwise. C2 is very closely related to the rank correlation coefficient τˆ (Kendall, 1970). τˆ has the total score in its numerator. Here, C2 has the total positive score, PS in its numerator. Both τˆ and C2 have the same denominator Q2 , which is equal to 12 n1 (n1 − 1). It can be shown that C2 has a linear relationship with τˆ , bij =
1 τˆ + 1 . (2.16) 2 Kendall (1970) showed that τˆ is asymptotically normally distributed. See the appendix for its mean and variance. Hence C2 is also asymptotically normal. Now, we develop the third component of overall C (event vs. censored). Define C2 =
C3 =
n3 n1 1 aij bij , Q3
(2.17)
i=1 j =1
where
aij =
1 if T1i < T3j , i = 1, 2, . . . , n1 , j = 1, 2, . . . , n3 , 0 otherwise,
1 if Y1i > Y3j , i = 1, 2, . . . , n1 , j = 1, 2, . . . , n3 , 0 otherwise. In this component, we only take a pair of comparison [(T1i , Y1i ), (T3j , Y3j )] where the censored time (T3j ) is longer than the event time (T1i ). Thus, the numerator C3 can be n1of n3 n3 1 expressed as ni=1 j =1 (bij | aij = 1). Also, Q3 should be equal to i=1 j =1 aij . An important assumption that we make here is that the censoring occurs completely at random so that C3 is independent of C1 and C2 . The numerator of C3 is the same as the Mann–Whitney statistic given that we only include the pairs where the censored time is longer than the event time. Hence, C3 can be described as a conditional Mann–Whitney bij =
6
R.B. D’Agostino and B.-H. Nam
statistic. We express C3 as C3 =
1 c W . Q3 Y1 Y3
(2.18)
Since the Mann–Whitney statistic is asymptotically normally distributed, we can argue that C3 is also asymptotically normal. See appendix for its mean and variance. Now, we combine all three components into one overall discrimination index C. From (3.1), (3.2) and (3.3), we have C=
1 1 Q1 · WY Y WY1 Y2 + P + WYc1 Y3 = Q1 + Q2 + Q3 Q1 + Q2 + Q3 Q1 1 2 +
Q3 Q2 1 1 1 c · · (τˆ + 1) + · W Q1 + Q2 + Q3 Q2 2 Q1 + Q2 + Q3 Q3 Y1 Y3
= a · C1 + b · C2 + (1 − a − b) · C3
(2.19)
so, C is a linear combination of C1 , C2 and C3 where a=
Q1 , Q1 + Q2 + Q3
b=
Q2 . Q1 + Q2 + Q3
Now, we can argue that the overall C tends to normality since C1 , C2 and C3 are all asymptotically independent one another and each of them is asymptotically normal. See Appendix A for the mean and variance of the overall C.
3. Calibration measures in survival analysis Calibration measures are often statistics that partition a data set into groups and assess how the average predicted probability compares with the outcome prevalence in each group. Common forms of calibration statistics are based on the Pearson χ 2 statistic which summarize the model fit by comparing observed and expected outcomes within M groups defined by the rank ordering of the predicted probabilities. Hosmer and Lemeshow’s chi-square statistics are widely used for goodness of fit for logistic regression. In this work, we present a chi-square type statistic for the survival time data with censored observations. We use Poisson log-linear model as an approximation to the survival time model in investigating the asymptotic behavior of the statistic based on the theoretical framework of goodness of fit statistic for generalized linear models (Shillington, 1980). The asymptotic behavior of the statistic is examined by investigating the asymptotic null distribution of the statistic for the Poisson approximation. 3.1. A chi-square statistic for the survival time model with censored observations The chi-square statistic that Hosmer and Lemeshow (1980, 1982) have produced for the logistic regression model has the following formula: χ2 =
M [Yj /nj − p¯j ]2 . p¯j (1 − p¯ j ) j =1
(3.1)
Evaluation of the performance of survival analysis models
7
Here, nj , Yj , and p¯j are respectively, the number of the observations, the number of positive outcomes (event) and the average predicted probabilities from the logistic regression model in the j th cell where j = 1, 2, . . . , M. The most popular version of the above ranks the model’s predicted probabilities into deciles and then compares the proportion of positive outcomes in each decile with the averaged predicted probabilities in each decile. Here, each decile contains 10 percent of the observations. For a survival time model with censored observation, we propose a statistic, similar to the H–L chi-square statistic; it has the following formula: χ2 =
M [KMj − p¯j ]2 , p¯j (1 − p¯ j )
(3.2)
j =1
where KMj and p¯j indicate the Kaplan–Meier estimate for failure, and the average predicted probabilities for a positive outcome (event) from survival time model such as the Cox proportional hazard model in the j th cell where j = 1, 2, . . . , M. To examine the asymptotic behavior of this statistic, we first examine the likelihood function for survival data with the general proportional hazard model under the assumption of random censoring and then express the likelihood as a Poisson likelihood, with a loglinear model for the Poisson mean corresponding to the log-linear model for the hazard function. 3.2. Poisson approximation to survival time model First, consider the definition of the hazard function, P (t T < t + δt | T t) h(t) = lim δt →0 δt f (t) F (t + δt) − F (t) = , = lim δt →0 δtS(t) S(t)
t H0 (t) = h0 (u) du,
(3.3) (3.4)
0
S(t) = exp −H0 (t) exp(η) .
(3.5)
Generally, the hazard function depends on both time and on a set of covariates, some of which may depend on time. The proportional hazards model separates these components by specifying the hazard at time t for an individual with covariate vector V is h(t; V ) = h0 (t)ϕ( V ), where ϕ( V ) is a function of the values of the vector of explanatory variables for an individual with covariate vector V . The function ϕ( V ) can be interpreted as the hazard at time t for an individual whose vector of explanatory variables is V , relative to the hazard for an individual for when V = 0. Since ϕ( V ) is positive, it is convenient to express ϕ( V ) as exp(η) where η = β V which is the linear predictor. If we have k explanatory variates, then β V = β1 v1 + β2 v2 + · · · + βk vk .
8
R.B. D’Agostino and B.-H. Nam
This model implies that the ratio of hazard for two individuals depends on the difference between their linear predictors at any time. If there are no time dependent covariates, the hazard is constant, independent of time. Various assumptions may be made about h0 (t). Cox’s model (1972) treats h0 (t) as analogous to the block factor in a blocked experiment, defined only at points where events occur, hence making no assumption about the trend with time. h0 (t) is permitted to have arbitrary values and irrelevant in the sense that it does not enter into the estimation equations derived from the likelihood function. In order to construct the likelihood function for survival data, an assumption about the nature of censoring must be made. Here, we assume that the censoring is completely at random censoring where the censoring time, Ci for the ith individual is a random variable with survivor and density functions Gi (c) and gi (c), respectively, for i = 1, . . . , n. And further, we assume that C1 , . . . , Cn are stochastically independent of one another and of event time t1 , . . . , tn . Kalbfleisch and Prentice (1980, p. 40) constructed a likelihood function for survival data under the assumption of random censoring: P Ti ∈ (t, t + dt), wi = 1; V i , β = P Ti ∈ (t, t + dt), Ci > t; V i , β = Gi (t + 0)f (t; V i , β ) dt
(3.6)
and P Ti ∈ (t, t + dt), wi = 0; V i , β = gi (t)S(t + 0); V i , β dt,
(3.7)
S(t + 0; V i , β ) = lim S(u; V i , β ).
(3.8)
where u→t +
If S is continuous, then S(t + 0; V i , β ) = S(t; V i , β ).
(3.9)
Kalbfleisch and Prentice also assume that the censoring is noninformative (i.e., that Gi (t) does not involve β). Therefore, the likelihood on the data (ti ; wi , V i ) for i = 1, . . . , n, is L( β ) ∝
n
f (ti ; β, V i )wi S(ti ; β, V i )1−wi
(3.10)
i=1
and the pairs (ti , wi ), i = 1, . . . , n are independent (given V 1 , . . . , V n ). In a similar fashion, Aitkin and Clayton (1980) constructed a likelihood function for β with proportional hazard model. Consider the following example. Suppose that a study is performed with n subjects. At the end of the study, an individual with an event at time t contributes a factor f (t) to the likelihood, while one who was censored at time t contributes S(t) to the likelihood. Define W as a variate taking the value 1 for those who had events and 0 for those who were censored (suppose r subjects (of n) had events). Now, we develop a likelihood function for the survival data with general proportional-hazards model.
Evaluation of the performance of survival analysis models
From (3.3), (3.4) and (3.5), we have S(t) = exp −H0 (t) exp( β V ) , F (t) = h0 (t) exp β V − H0 (t) exp( β V ) .
9
(3.11) (3.12)
Then the likelihood takes the form L=
n
f (ti )wi S(ti )1−wi .
(3.13)
i=1
Hence, ln(L) =
n wi ln f (ti ) + (1 − wi ) ln S(ti ) i=1
=
n wi ln(h0 (ti ) + ηi − H0 (ti ) exp(ηi ) i=1
+ (1 − wi ) −H0 (ti ) exp(ηi )
n h0 (ti ) . (3.14) wi ln(H0 (ti ) + ηi − H0 (ti ) exp(ηi ) + wi ln = H0 (ti ) i=1
Now, if we write µi = H0 (ti ) exp(ηi ), then ln(L) becomes n n h0 (ti ) . wi ln(µi ) − µi + wi ln ln(L) = H0 (ti ) i=1
(3.15)
i=1
Here, the first term is identical to the kernel of the likelihood function for n independent Poisson variates Wi with mean µi , while the second term does not depend on the unknown β’s. Therefore, if H (t) is given, then we can obtain the estimates of the β’s by treating the censoring indicator variate Wi as a Poisson variate with mean µi = H0 (ti ) exp(ηi ). This Poisson model is a log-linear model which takes the form of a generalized linear model. The link function is the same as for log-linear models except that there is a fixed intercept ln(H0 (ti )) to be included in the linear predictor (ln(µi ) = ln(H0 (ti )) + ηi , where ηi = β V i ). Aitkin and Clayton (1980) described the use of GLIM (Baker and Nelder, 1978) to fit explicit distributions (Exponential, Weibull, extreme-value) by expressing the likelihood in each case as a Poisson likelihood, with the log-linear model for the Poisson mean corresponding to the log-linear model for the hazard function. They described the estimation procedure for the β’s as follows: A simple iterative maximization of the log-likelihood function is possible in GLIM. Given initial estimates of the unknown parameters in H0 (t), the ML estimate of the β’s is obtained for the Poisson model in GLIM with ln(H0 (ti )) as a known function (also known as offset) incorporated in the log-linear model. In the next section, we develop a chi-square type statistic for goodness of fit for the survival data.
10
R.B. D’Agostino and B.-H. Nam
3.3. Chi-square statistic based on the Poisson approximation Let Wi denote a Poisson random variable taking value 1 or 0 (1 for those who had events, 0 for those who were censored) and let V i = (Vi1 , Vi2 , . . . , Vik ) be the vector of independent variables, for i = 1, 2, . . . , n. The expectation and variance of Wi are as follows: E(Wi ) = H0 (ti ) exp β V i , (3.16) Var(Wi ) = H0 (ti ) exp β V i , (3.17) where β = {β1 , β2 , . . . , βk } is the 1 × k vector of regression parameters. We partition the space of V into M separate and distinct classes R1 , R2 , . . . , RM where M > k. We assign each observation to its appropriate class. We now consider the sum of Poisson random variables (Wi ) falling into each class, and define i∈Ij Wi − E( i∈Ij Wi ) , Zj ( β ) = (3.18) Var( i∈Ij Wi ) where Ij = {i: vi ∈ Rj }, for j = 1, 2, . . . , M. Define 2 M M [ 2 i∈Ij Wi − E( i∈Ij Wi )] 2 χ (β ) = . Zj ( β ) = Var( i∈Ij Wi ) j =1
(3.19)
j =1
Since {Wi }, i ∈ Ij are independent Poisson random variables, and the sum of independent Poisson random variables is also a Poisson random variable with mean equal to the sum of the individual means, E (3.20) Wi = E(Wi ) = H0 (ti )eβvi i∈Ij
and variance, Var
i∈Ij
i∈Ij
i∈Ij
Wi = Var(Wi ) = H0 (ti )eβ vi . i∈Ij
(3.21)
i∈Ij
Therefore, from (3.17), (3.18) and (3.19), β vi 2 M [ ] i∈Ij Wi − i∈Ij H0 (ti )e 2 χ (β ) = . β vi i∈Ij H0 (ti )e j =1
(3.22)
If the value of β is known, then we can use the central limit theorem to show that Zj ( β ) is asymptotically normally distributed with mean zero and unit variance for i = 1, . . . , n. Also, since the Zj ( β )’s are independent, the χ 2 ( β ) statistic defined in (3.19) has asymptotic χ 2 (M) distribution. However, most times, the value of β is not known, hence the unknown parameter β must be estimated. If we use an estimator
Evaluation of the performance of survival analysis models
11
of β obtained based on the grouped observations such as the minimum modified chisquare estimator, then the estimated χ 2 ( β ) has asymptotic χ 2 (M − k) distribution. In most cases in practice, however χ 2 ( β ) is estimated by substituting the maximum ˆ which is based on the ungrouped observations. The resulting likelihood estimator, β, χ 2 ( β ) statistic with the maximum likelihood estimator βˆ can be expressed as, from (3.22), M [ K = χ βˆ = 2
2
j =1
i∈Ij
Wi −
i∈Ij
ˆ
i∈Ij
H0 (ti )eβvi ]2 ˆ
H0 (ti )eβvi
.
(3.23)
ˆ even though the data are grouped into M cells, the expected In the K 2 statistic with β, cell frequencies are obtained based on the original observations. Stepanians (1994) stated that Chernoff and Lehmann (1954) derived the null distribution of the estimated Pearson’s chi-square statistic by the maximum likelihood estimator based on the ungrouped observations for testing goodness of fit to a family of distributions. Chernoff ˆ is and Lehmann proved the theorem saying that the asymptotic distribution of χ 2 ( β) that of: M d χ 2 βˆ = Z βˆ Z βˆ −→ λj yj2 ,
(3.24)
j =1
where the yi ’s are independent and normally distributed with mean zero and unit variance, and λi ’s are the eigenvalues of the asymptotic covariance matrix of Z( βˆ ). They showed that the eigenvalues of the covariance matrix consist of M − k − 1 ones, one zero, and k values between zero and one. Also, Moore and Spruill (1971, 1975) showed that Chernoff and Lehmann’s theorem can be extended to cells with random boundaries. Analogous to Chernoff and Lehmann’s results, Shillington (1980) proved the theorem for generalized linear model which states −1 d 0 1 0 ˆ Z β −→ N 0, I − B B , I N
(3.25)
where B 0 , described in (3.31), is B(β 0 ) evaluated at the true value of the parameter vector β, β 0 , and M d λj Uj , χ 2 βˆ = z βˆ z βˆ −→
(3.26)
j =1
where the Uj ’s are independent χ 2 (1)’s and the λj ’s are the eigenvalues of the asymptotic covariance matrix of Z( βˆ ). It was shown that the eigenvalues λj ’s are in the interval [0, 1], for j = 1, . . . , M. It was also shown that at least M − k of the eigenvalues are exactly equal to one. Therefore, the asymptotic distribution of χ 2 ( βˆ ) would fall between χ 2 (M − k) and χ 2 (M). The asymptotic distribution of χ 2 ( βˆ ) can be expressed
12
R.B. D’Agostino and B.-H. Nam
as M d λj χ 2 (1) = χ 2 (M − k) + χ 2 βˆ −→ j =1
M
λj χ 2 (1).
(3.27)
j =M−k+1
As Shillington stated that the more homogeneous the v’s within each cell, the closer the ˆ to χ 2 (M − k) because the λj ’s in (3.27) will tend approximate distribution of χ 2 (β) to zero, the approximate distribution of χ 2 ( βˆ ) would depend on the variability of the v’s within the M groups. For a more practical approach, Stepanians (1994) suggested that we calculate the theoretical mean and the variance of the null distribution of the statistic. Since the mean of χ 2 (1) is equal to one and the variance of χ 2 (1) is two, the following can be stated with respect to the mean and variance of the asymptotic null ˆ statistic: distribution of the χ 2 (β) M M 2 2 λj χ (1) = λj , E χ βˆ = E (3.28) j =1
j =1
M M 2 2 ˆ λj χ (1) = 2 λ2j . Var χ β = Var j =1
(3.29)
j =1
Hence, by calculating the eigenvalues of the covariance matrix we will be able to deˆ statistic. In order to rive the mean and the variance of the null distribution of the χ 2 (β) 0 calculate the eigenvalues, we must obtain I and B matrix in the asymptotic covariance matrix of Z( βˆ ). Iis the Fisher information matrix for the maximum likelihood estimaˆ evaluated at the true value of the parameter vector β. The j rth element of B( β ) tor β, is expressed as follows: ∂ ∂βr E( i∈Ij Wi ) . Bj r (β) = (3.30) N · Var( i∈Ij Wi ) Now, we evaluate I and B( β ) with respect to its true parameter value. First, consider the j rth element of the matrix from ∂ 0 ∂βr E( i∈Ij Wi ) Bj r β = . N · Var( i∈Ij Wi ) β=β 0
(3.31)
From (3.20),
∂ ∂ E Wi = H0 (ti )eβ vi = H0 (ti )xir eβ vi . ∂βr ∂βr i∈Ij
i∈Ij
Hence, from (3.21) and (3.32), we have β vi 0 i∈Ij H0 (ti )xir e . Bj r β = N · i∈Ij H0 (ti )eβ vi
(3.32)
i∈Ij
(3.33)
Evaluation of the performance of survival analysis models
13
Now, the Fisher information matrix I is the following: 2 ∂ l(L) Irs = − ∂βr βs 2 ∂ Wi ln H0 (ti )eβ vi − H0 (ti )eβ vi =− ∂βr βs i h0 (ti ) + Wi ln H0 (ti ) =−
=
N
i
∂2 ∂βr βs
Wi ln H0 (ti ) + Wi β vi − H0 (ti )eβvi i
H0 (ti )vir vis eβ vi .
(3.34)
i=1
Hence, the Fisher information matrix I can be expressed as follows: I= V Diag H0 (ti ) exp β vi V ,
(3.35)
where V is the N × k matrix of independent variables, Diag[H0 (ti ) exp(β vi )] is an N × N diagonal matrix with the stated elements on the main diagonal. Now, with I and B( β 0 ), we are able to obtain the covariance matrix of Z( β ), hence we are able to calculate the eigenvalues of the covariance matrix to derive the mean and the variance of the null distribution of the K 2 statistic. Next, we investigate the asymptotic null distribution of K 2 under various conditions. The investigation is conducted by generating extensive numerical examples from a variety of conditions and calculating the eigenvalues of the covariance matrices. For each generated example, the theoretical mean and the variance of the null distribution of the statistic K 2 are calculated from the eigenvalues. Then, those calculated means and the variances are used to derive the approximate null chi-square distribution for the statistic K 2 . While we consider only two possible outcomes, 1 for the event, 0 for the censored in generating numerical examples for investigating the asymptotic behavior of the K 2 statistic in Poisson log-linear model, it is important to notice that a Poisson random variable W can have integer value greater than one. Our assumption is that if we have very large probability for zero and reasonably small probability for one so that these two probabilities cover almost the entire probability space for W , then we will be able to obtain the asymptotic null distribution of the K 2 statistic. For validation of our assumption, we employed a truncated Poisson random variable that deliberately has only two possible outcomes, zero for censored observations and one for events. We also develop a similar chi-square statistic, Kt2 based on the truncated Poisson distribution. Now, we examine the distribution function of a truncated Poisson random variable. First, let Wi be an ordinary Poisson random variable with parameter ui = H0 (ti ) exp( β V i ). Then, the truncated Poisson distribution function is the following:
14
R.B. D’Agostino and B.-H. Nam
Let
P (Wi = k) =
0 Ci e−ui uki /k!
if k 2, for k = 0, 1,
(3.36)
where Ci = 1
1
Ci = 1
1
−ui uj /j ! j =0 e i
.
Here, −ui uj /j ! j =0 e i
=
1 1 . = −u e−ui ui + e−ui e i (ui + 1)
(3.37)
Let Wt be the truncated Poisson random variable having only two values, 0 and 1. Then, for k = 0, 1, P (Wt i = k) =
e−ui uki uki 1 · = . e−ui (ui + 1) k! (ui + 1) · k!
Hence, from (3.38), the probability mass function for Wt i can be expressed as wti 1−wti ti uw ui ui i = 1− f (wt i , ui ) = . (ui + 1) · wt i ! 1 + ui 1 + ui Let δi = ui /(1 + ui ), then w f (wt i , ui ) = δi i (1 − δi )1−wi
Here,
δi . = (1 − δi ) exp wt i log 1 − δi
δi = log H0 (ti ) exp β V i = log H0 (ti ) + β V i . log 1 − δi
(3.38)
(3.39)
(3.40)
(3.41)
(3.39) and (3.41) indicate that the truncated Poisson model is a generalized linear model with the random component equal to the binary response Wt i with parameter δi , and with the logit link. Thus, we can apply the theoretical framework of the goodness of fit statistic for the generalized linear model. Now, we have ui P (Wt i = 1) = (3.42) = δi , ui + 1 P (Wt i = 0) =
1 = 1 − δi . ui + 1
Then, the expectation and variance of Wt i are ui = δi , E(Wt i ) = P (Wt i = 1) = ui + 1 ui ui · 1− = δi (1 − δi ). Var(Wt i ) = ui + 1 ui + 1
(3.43)
(3.44) (3.45)
Evaluation of the performance of survival analysis models
15
Here, we develop a similar chi-square type statistic Kt2 based on the truncated Poisson model in the same way as we developed the K 2 statistic for the Poisson loglinear model. Let Vi = (Vi1 , Vi2 , . . . , Vik ) be the vector of independent variables, for i = 1, 2, . . . , n. We partition the space of V into M separate and distinct classes R1 , R2 , . . . , RM where M > k. We assign each observation to its appropriate class. Consider the sum of the truncated Poisson random variables falling into each class. Define i∈Ij Wt i − E( i∈Ij Wt i ) Zj (β) = (3.46) , Var( i∈Ij Wt i ) where Ij = {i: xi ∈ Rj }, for j = 1, 2, . . . , M. Define 2 M M [ 2 i∈Ij Wt i − E( i∈Ij Wt i )] 2 χ (β) = . Zj (β) = Var( i∈Ij Wt i ) j =1
(3.47)
j =1
Since {Wt i }s, for i ∈ Ij , are independent, the mean of the sum of the independent random variables are the sum of the individual means of the independent random variables, ui E Wt i = E(Wt i ) = ui + 1 i∈Ij
i∈Ij
=
i∈Ij
H0 (ti ) exp( β V i )
i∈Ij
1 + H0 (ti ) exp( β V i )
.
(3.48)
Also, Var
i∈Ij
=
ui 1 · Wt i = Var(Wt i ) = 1 + ui 1 + ui i∈Ij
i∈Ij
H0 (ti ) exp( β V i ) i∈Ij
1 + H0 (ti ) exp( β V i )
1 . 1 + H0 (ti ) exp( β V i )
(3.49)
Therefore, from (3.47), (3.48) and (3.49), χ 2( β ) =
M j =1
i∈Ij
i∈Ij
Wt i −
H0 (ti ) exp( β V i ) 2 i∈Ij 1+H0 (ti ) exp( β V i )
H0 (ti ) exp( β V i ) · 1+H (t ) 1exp( β V ) 1+H0 (ti ) exp( β V i ) i 0 i
.
(3.50)
Now, define Kt2 = χ 2 βˆ = Z βˆ Z βˆ , where βˆ is the maximum likelihood estimator of β.
(3.51)
16
R.B. D’Agostino and B.-H. Nam
From the previous discussion, and from (3.26), (3.26), we know that d 1 −1 0 Z βˆ −→ N 0, I − B 0 B , I N where B 0 , described in (3.31), is B(β 0 ) evaluated at the true value of the parameter vector β, β 0 , and M d χ 2 βˆ = z βˆ z βˆ −→ λj Uj , j =1
where the Uj ’s are independent χ 2 (1)’s and the λj ’s are the eigenvalues of the asymptotic covariance matrix of Z( βˆ ). As discussed above, from (3.28) and (3.29), M M 2 λj χ (1) = λj , E χ βˆ = E
2
j =1
j =1
M M 2 2 ˆ λj χ (1) = 2 λ2j . Var χ β = Var j =1
j =1
Hence, by calculating the eigenvalues of the covariance matrix we will be able to deˆ statistic. In order to rive the mean and the variance of the null distribution of the χ 2 (β) 0 calculate the eigenvalues, we must obtain the I and B(β ) matrix in the asymptotic covariance matrix of Z( βˆ ). I is the Fisher information matrix for the maximum likelihood ˆ estimator β, evaluated at the true value of the parameter vector β. The j rth element of B( β ) is expressed as follows: ∂ ∂βr E( i∈Ij Wi ) . Bj r (β) = N · Var( i∈Ij Wi ) Now, from (3.16) and (3.44), E(Wt i ) =
H0 (ti ) exp( β V i ) ui = , ui + 1 1 + H0 (ti ) exp( β V i )
∂ H0 (ti ) exp( kl=1 vil βl ) ∂ E Wt i = k ∂βr ∂βr i∈Ij i∈I 1 + H0 (ti ) exp( l=1 vil βl ) =
i∈Ij
=
i∈Ij
xil
k
H0 (ti ) exp(
l=1 vil βl ) k 1 + H0 (ti ) exp( l=1 vil βl )
vil · ui (1 − ui ).
1
k
1 + H0 (ti ) exp(
l=1 vil βl )
(3.52)
Evaluation of the performance of survival analysis models
17
Therefore, for the truncated Poisson model, the j rth element of the matrix B( β 0 ) is 0 i∈Ij vil ui ( β)(1 − ui ( β)) Bj r β = (3.53) . 0 N i∈Ij ui ( β)(1 − ui ( β)) β=β Now, the Fisher information matrix I for the maximum likelihood estimator βˆ evaluated at the true value of the parameter vector β is ∂2 l(L) ∂βr βs 2 δi ∂ log(1 − δi ) + wt i log =− ∂βr βs 1 − δi
Irs = −
i
=−
∂2 ∂βr βs
− log 1 + H0 (ti ) exp β V i i
+ wt i · log H0 (ti ) + β V i =−
vir vis δi (1 − δi ).
(3.54)
i
Hence, the Fisher information matrix I can be expressed as follows: I= V Diag δi (1 − δi ) V ,
(3.55)
where V is the n × k matrix of independent variables, Diag[δi (1 − δi )] is an n × n diagonal matrix with the stated elements on the main diagonal, for i = 1, 2, . . . , n. Now, with I and B( β 0 ), we are able to obtain the covariance matrix of Z( β ), hence we are able to calculate the eigenvalues of the covariance matrix to derive the mean and the variance of the null distribution of Kt2 statistic. 3.4. The asymptotic null distribution of K 2 and Kt2 We investigate the asymptotic null distribution of K 2 and Kt2 under a various conditions. The investigation is conducted by generating exhaustive numerical examples from a variety of conditions and calculating the eigenvalues of the covariance matrices. For each generated example, the theoretical mean and the variance of the null distribution of the statistic K 2 and Kt2 are calculated from the eigenvalues. Then, those calculated means and the variances are used to derive the approximate null chi-square distribution for the statistic K 2 and Kt2 . Numerical examples were generated to represent a wide range of situations by various selection of the numbers of groupings (M), the numbers of parameters βs (k), the sample sizes (N ), baseline survival (S0 (t)) and the distribution of V ’s. In selecting the number of parameters β’s, we have a restriction that the number of parameters should be smaller than the number of groups, i.e., k < M. When we group the
18
R.B. D’Agostino and B.-H. Nam
Table 1 Summary of specifications for numerical configurations Examples with all continuous variables M
Statistics
N
k
Choice of S0 (t)
βj
10
K2
200 400
2 to 9
0.6 0.7 0.8 0.9
0.1 0.2 0.3 ... 0.8 0.9 1.0
2×
8×
4×
10 =
1280
Total
Kt2
1×
2×
Total
Examples with all ordinal variables M
Statistic
N
k
Choice of S0 (t)
βj
10
K2
400
4 to 9
0.6 0.7 0.8 0.9 0.95
0.2 0.4 0.6 0.8 1.0
1×
6×
5×
5=
300
Total
Kt2 1×
2×
Examples with all binary variables M
Statistic
N
k
S0 (t)
β1 , β2 , β3 , . . .
10
K2
400
5 to 9
0.6 0.7 0.8
0.2, 0.21, 0.22, 0.23, . . . 0.4, 0.41, 0.42, 0.43, . . . 0.6, 0.61, 0.62, 0.63, . . .
0.9 0.95 5×
0.8, 0.81, 0.82, 0.83, . . . 1.0, 1.01, 1.02, 1.03, . . . 5=
Kt2 1×
2×
1×
5×
250
observations into M classes, the random variable of interest is not the individual observation any more. The grouped observations, namely the mean or sum of the responses in the cell are the variables of interest. Hence, we are dealing with M “observations” and the number of parameters should be smaller than the number of “observations”. Detail description of the specification is in Table 1. 3.4.1. Algorithms for generating numerical examples SAS programs were used to generate the grid of numerical examples. SAS procedure PROC IML was employed for matrix calculations. First, we determine the values of M, k, N, S0 (t) and β’s to represent a wide range of situations. Second, we generate V ’s. For continuous distribution of V ’s, k independent standard normal variables were generated using different random number generators to form k × 1 vector. Each procedure was repeated N times to form an N × k matrix for V ’s. For discrete distributions of
Evaluation of the performance of survival analysis models
19
V ’s, standard normal variables were generated and transformed into ordinal categorical variables. Then, given β, S0 (t) and V ’s, we calculated the expectation ui = E[Wi ] = H0 (t) exp β V i , where H0 (t) = − log(S0 (t)), and S0 (t) is the baseline survival function for time t. For the K 2 statistic, we calculate ξ( Xi ) = P (Wi = 1) = exp(−ui ) · ui . The observations are then sorted in ascending order by the value of ξ( Xi ) for K 2 for i = 1, . . . , N and the sorted observations are grouped into M classes such that each group has an equal number of observations. Here, we used M = 10, then the observations are divided into deciles such that each group contains N/10 observations. Hence, the first decile has the smallest mean value of ξ( V )’s and the 10th decile has the largest mean value of ξ( V )’s. Next, the matrix B( β 0 ) was constructed according to (3.33). Then, the Fisher information matrix I was constructed according to (3.35). With B( β 0 ) and I, the covariance matrix was obtained according to (3.25). Then, the eigenvalues of the covariance matrix were calculated and used for obtaining the theoretical mean and the variance of the null distribution of the statistic by the following equations: M M 2 2 ˆ E χ β =E λj χ (1) = λj , j =1
j =1
M M 2 2 ˆ λj χ (1) = 2 λ2j . Var χ β = Var j =1
j =1
From those eigenvalues, we were able to investigate the mean and the variance of the asymptotic null distribution of the K 2 statistic. The same procedures as above were applied for Kt2 . We calculated δ( V i ) = P (Wt i = 1) = ui /(1 + ui ) for the Kt2 statistic. The observations are then sorted in ascending order by the value of δ( V i ) for i = 1, . . . , N . The remaining steps for getting B( β 0 ) and I for Kt2 are the same as those obtained above for K 2 . 3.4.2. Results of generated examples A summary of the results from the generated examples are in Table 2. For all continuous and ordinal variables, χ 2 (M − 1) is the appropriate approximate null distribution of the statistic for both K 2 and Kt2 . For all binary variables, we can conclude that the null distribution also would be close to χ 2 (M − 1). In conclusion, for M = 10, the appropriate approximate null distribution of these two statistics are χ 2 (9). 3.5. Comparison of the proposed statistic (3.2) and Kt2 From (3.50), Kt2 can be expressed as follows: Kt2 =
M (Wj /nj − µ¯ j )2 , µ¯ j (1 − µ¯ j ) j =1
20
R.B. D’Agostino and B.-H. Nam
Table 2 The theoretical means for the approximate null χ 2 distribution derived by calculation of eigenvalues M = 10 K2
Kt2
Continuous
M − 1 (100%) n = 640
M − 1 (100%) n = 640
Ordinal
M − 1 (100%) n = 150
M − 1 (100%) n = 150
Binary
M − 1 (84%) M − 2 (16%) n = 125
M − 1 (100%) n = 125
where nj , Wj , and µ¯ j are respectively, the number of the observations, the number of positive outcomes (event) and the average value of ui /(ui + 1) in the j th cell where j = 1, 2, . . . , M and i ∈ Ij . Since Kt2 showed consistent asymptotic behavior from above generated examples (as shown in Table 2) and the two statistics are very similar, comparing these two statistics using real data (Framingham Heart Study) would give us a good idea how the proposed chi-square statistics would behave asymptotically. We examined the calibration measures of one of the Framingham prediction functions (predicting risk of coronary heart disease in 4, 5, 8 and 10 years) using these two statistics. From the results as shown in Table 3, we expect that the asymptotic behavior of the proposed statistic would be very close to that of Kt2 .
Appendix A (1) Mean and Variance of C statistic, the area under the ROC curve in logistic regression E(C) = P (Y1 > Y2 ) = P1 , 1 P1 (1 − P1 ) + (n2 − 1) P12 − P12 + (n1 − 1) P13 − P12 , n1 n2
Var(C) = where
P12 = P Y2 < Y1 and Y2 < Y1 for Y1 = Y1 , P13 = P Y2 < Y1 and Y2 < Y1 for Y2 = Y2 . (2) Mean and Variance of (τˆ ) E(τˆ ) = τ, Var(τˆ ) =
4(n1 − 2) 2 Var(τi ) + 1 − τ2 , n1 (n1 − 1) n1 (n1 − 2)
Evaluation of the performance of survival analysis models
21
Table 3 Calibration measures by decile (M = 10) Men, 10 years Decile 1 2 3 4 5 6 7 8 9 10
Wj nj
KMj
p¯ j
µ¯ j
proposed χ 2
Kt2
0.00000 0.00410 0.02049 0.03689 0.04918 0.07787 0.06967 0.13115 0.17213 0.23770
0.0000 0.0044 0.0211 0.0380 0.0508 0.0838 0.0735 0.1430 0.1847 0.2594
0.00955 0.01834 0.02768 0.03854 0.05085 0.06667 0.08646 0.11640 0.16391 0.28466
0.00950 0.01818 0.02730 0.03782 0.04960 0.06454 0.08291 0.11008 0.15173 0.24922
2.34375 2.63477 0.39255 0.00195 0.00002 1.14996 0.51936 1.67822 0.76949 0.76467
2.33142 2.70966 0.42590 0.00581 0.00092 0.71854 0.56273 1.10532 0.78927 0.17292
Total
10.2547
8.82249
Men, 8 years Decile 1 2 3 4 5 6 7 8 9 10
Wj nj
KMj
p¯ j
µ¯ j
proposed χ 2
Kt2
0.00000 0.00000 0.01639 0.02869 0.04098 0.05328 0.04508 0.10246 0.12705 0.18443
0.0000 0.0000 0.0167 0.0290 0.0418 0.0555 0.0463 0.1091 0.1328 0.1949
0.00679 0.01305 0.01972 0.02751 0.03636 0.04779 0.06216 0.08408 0.11933 0.21279
0.00676 0.01297 0.01953 0.02713 0.03571 0.04667 0.06030 0.08071 0.11268 0.19171
1.66102 3.22704 0.11530 0.02030 0.20580 0.31903 1.05309 1.98285 0.42097 0.46611
1.65482 3.20554 0.12525 0.02234 0.19667 0.23922 0.99685 1.55506 0.50377 0.08358
9.47151
8.58310
Total Men, 5 years Decile 1 2 3 4 5 6 7 8 9 10 Total
Wj nj
KMj
p¯ j
µ¯ j
proposed χ 2
Kt2
0.00000 0.00000 0.00410 0.01639 0.03279 0.02869 0.02459 0.05328 0.08197 0.13115
0.0000 0.0000 0.0042 0.0165 0.0330 0.0292 0.0249 0.0549 0.0835 0.1349
0.00402 0.00774 0.01171 0.01636 0.02167 0.02855 0.03725 0.05063 0.07244 0.13280
0.00401 0.00771 0.01165 0.01623 0.02144 0.02815 0.03657 0.04938 0.06991 0.12396
0.98129 1.90381 1.19004 0.00028 1.47642 0.00371 1.03835 0.09235 0.44438 0.00938
0.97912 1.89630 1.20746 0.00040 1.49726 0.00260 0.99433 0.07882 0.54578 0.11596
7.14001
7.31802
(continued on next page)
22
R.B. D’Agostino and B.-H. Nam
Table 3 (Continued) Men, 4 years Decile 1 2 3 4 5 6 7 8 9 10
Wj nj
KMj
p¯ j
µ¯ j
proposed χ 2
Kt2
0.00000 0.00000 0.00000 0.01230 0.01639 0.02049 0.02049 0.03689 0.07787 0.11475
0.0000 0.0000 0.0000 0.0123 0.0165 0.0207 0.0208 0.0376 0.0792 0.1174
0.00318 0.00612 0.00927 0.01296 0.01717 0.02264 0.02956 0.04024 0.05771 0.10672
0.00317 0.00611 0.00923 0.01287 0.01702 0.02238 0.02913 0.03945 0.05609 0.10087
0.77532 1.50358 2.28318 0.00824 0.00649 0.04132 0.65318 0.04405 2.07231 0.29198
0.77397 1.49889 2.27244 0.00642 0.00579 0.03982 0.64407 0.04222 2.18665 0.51855
7.67966
7.98882
Total Women, 10 years Decile 1 2 3 4 5 6 7 8 9 10
Wj nj
KMj
p¯ j
µ¯ j
proposed χ 2
Kt2
0.00000 0.00356 0.00356 0.00356 0.01068 0.00355 0.02491 0.03559 0.06406 0.12811
0.0000 0.0036 0.0039 0.0036 0.0108 0.0037 0.0259 0.0383 0.0686 0.1379
0.00103 0.00238 0.00414 0.00684 0.01101 0.01643 0.02387 0.03730 0.05986 0.13215
0.00103 0.00237 0.00413 0.00681 0.01094 0.01629 0.02359 0.03661 0.05811 0.12261
0.29078 0.17763 0.00388 0.43324 0.00109 2.82776 0.04970 0.00786 0.38101 0.08100
0.29061 0.16688 0.02228 0.43947 0.00187 2.85932 0.02142 0.00826 0.18149 0.07905
4.25395
4.07065
Total Women, 8 years Decile 1 2 3 4 5 6 7 8 9 10 Total
Wj nj
KMj
p¯ j
µ¯ j
proposed χ 2
Kt2
0.00000 0.00356 0.00000 0.00356 0.01068 0.00355 0.02491 0.02135 0.05338 0.10676
0.0000 0.0036 0.0000 0.0036 0.0108 0.0037 0.0259 0.0223 0.0561 0.1125
0.00083 0.00190 0.00331 0.00547 0.00880 0.01315 0.01912 0.02992 0.04813 0.10752
0.00083 0.00190 0.00330 0.00545 0.00876 0.01306 0.01894 0.02947 0.04699 0.10099
0.23232 0.42906 0.93267 0.17986 0.12837 1.94044 0.68894 0.56177 0.38927 0.07270
0.23221 0.40981 0.93109 0.18544 0.11823 1.98107 0.53991 0.64735 0.25623 0.10307
5.55540
5.40440
(continued on next page)
Evaluation of the performance of survival analysis models
23
Table 3 (Continued) Women, 5 years Wj nj
KMj
p¯ j
µ¯ j
proposed χ 2
Kt2
0.00000 0.00356 0.00000 0.00356 0.00711 0.00000 0.01068 0.01068 0.03915 0.06406
0.0000 0.0036 0.0000 0.0036 0.0071 0.0000 0.0108 0.0073 0.0404 0.0658
0.000485 0.001116 0.001945 0.003215 0.005182 0.007748 0.011279 0.017691 0.028575 0.065117
0.000485 0.001116 0.001943 0.003210 0.005169 0.007717 0.011215 0.017533 0.028167 0.062573
0.13649 1.55486 0.54766 0.01300 0.20044 2.20190 0.00579 1.74594 1.41538 0.00216
0.13645 1.50527 0.54711 0.01070 0.20752 2.19326 0.00737 0.76703 1.23732 0.01055
7.82362
6.62257
Decile 1 2 3 4 5 6 7 8 9 10 Total
Women, 4 years Wj nj
KMj
p¯ j
µ¯ j
proposed χ 2
Kt2
0.00000 0.00000 0.00000 0.00356 0.00712 0.00000 0.00712 0.00356 0.03915 0.05694
0.0000 0.0000 0.0000 0.0036 0.0071 0.0000 0.0072 0.0036 0.0404 0.0583
0.000408 0.000938 0.001635 0.002702 0.004356 0.006513 0.009485 0.014885 0.024065 0.055093
0.000408 0.000937 0.001633 0.002698 0.004346 0.006492 0.009440 0.014773 0.023774 0.053245
0.11467 0.26379 0.46005 0.08413 0.48791 1.84882 0.15620 2.44051 3.19271 0.05551
0.11464 0.26366 0.45967 0.07735 0.49867 1.84273 0.16210 2.42801 2.86085 0.07610
9.10430
8.78379
Decile 1 2 3 4 5 6 7 8 9 10 Total
where, τ is the rank correlation coefficient and Var(τi ) is unknown quantity and 1 1 − τ2 . 2 (3) C3 tends to normality with Var(τi )
E(C3 ) = P (Y1 > Y3 | T1 < T3 ) = P3 , 1 Var(C3 ) = 2 Q3 · P3 (1 − P3 ) + A · P32 − P32 + B · P33 − P32 , Q3 where
P32 = P Y1 > Y3 , Y1 > Y3 | T1 < T3 , T1 < T3 , P33 = P Y1 > Y3 , Y1 > Y3 | T1 < T3 , T1 < T3 , Q3 ,
A and B are unknown quantities.
and
24
R.B. D’Agostino and B.-H. Nam
(4) Mean and Variance of overall C E[C] = E a · C1 + b · C2 + (1 − a − b) · C3 1 = a · P1 + b · (τ + 1) + (1 − a − b) · P3 2 n1 · n2 = · P1 1 n3 1 n1 · n2 + 2 n1 (n1 − 1) + ni=1 j =1 aij − 1) · (τ + 1) 1 n3 n1 · n2 + − 1) + ni=1 j =1 aij n1 n3 i=1 j =1 aij + · P3 , 1 n3 n1 · n2 + 12 n1 (n1 − 1) + ni=1 j =1 aij
+
1 4 n1 (n1
1 2 n1 (n1
Var[C] = Var a · C1 + b · C2 + (1 − a − b) · C3 = a 2 Var[C1 ] + b2 Var[C2 ] + (1 − a − b)2 Var[C3 ] 1 = a2 · P1 (1 − P1 ) + (n1 − 1) P12 − P12 n1 · n2 + (n2 − 1) P13 − P12 2 4(n1 − 2) 2 1 2 +b · · Var(τi ) + 1−τ 4 n1 (n1 − 1) n1 (n1 − 1) + (1 − a − b)2 ·
=
1 Q3 · P3 (1 − P3 ) + A · P32 − P32 2 Q3 + B · P33 − P32 n1 · n2
1 n3 2 {n1 · n2 + − 1) + ni=1 j =1 aij } × P1 (1 − P1 ) + (n1 − 1) P12 − P12 + (n2 − 1) P13 − P12 1 2 n1 (n1
{ 12 n1 (n1 − 1)}2 1 · n1 n3 1 2 {n1 · n2 + 2 n1 (n1 − 1) + i=1 j =1 aij } 4 2 4(n − 2) 2 Var(τi ) + 1−τ × n(n − 1) n(n − 1) +
1
+
1 n3 2 {n1 · n2 + − 1) + ni=1 j =1 aij } n n 3 1 2 2 × aij · P3 (1 − P3 ) + A P32 − P3 + B P33 − P3 . 1 2 n1 (n1
i=1 j =1
Evaluation of the performance of survival analysis models
25
References Aitkin, M., Clayton, D. (1980). The fitting of exponential, Weibull and extreme value distributions to complex censored survival data using GLIM. Appl. Statist. 29, 156–163. Baker, R.J., Nelder, J.A. (1978). General Linear Interactive Modelling (GLIM). Release 3. Numerical Algorithms Group, Oxford. Bamber, D. (1975). The area above the ordinal dominance graph and the area below the receiver operating graph. J. Math. Psychol. 12, 387–415. Chernoff, H., Lehmann, E.L. (1954). The use of maximum likelihood estimates in tests for goodness of fit. Ann. Math. Statist. 25, 579–586. Hanley, J.A., McNeil, B.J. (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143, 29–36. Harrell, F.E., Lee, K.L., Mark, D.B. (1996). Tutorial in Biostatistics, Multivariable prognostic models: Issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Statist. Medicine 15, 361–387. Hosmer, D.W., Lemeshow, S. (1980). A goodness-of-fit test for the multiple logistic regression model. Commun. Statist. A 10, 1043–1069. Hosmer, D.W., Lemeshow, S. (1982). A review of goodness of fit statistics for use in the development of logistic regression models. Amer. J. Epidemiol. 115, 92–106. Kalbfleisch, J.D., Prentice, R.L. (1980). The Statistical Analysis of Failure Time Data. Wiley, New York. Kendall, M.G. (1970). Rank Correlation Methods. Griffin, London. Lehmann, E.L. (1951). Consistency and unbiasedness of certain nonparametric tests. Ann. Math. Statist. 22, 165–179. Lehmann, E.L., D’Abrera, H.J.M. (1975). Nonparametrics (Statistical Methods Based on Ranks). Holdenday, California. Moore, D.S. (1971). A chi-square statistic with random cell boundaries. Ann. Math. Statist. 42, 147–156. Moore, D.S., Spruill, M.C. (1975). Unified large-sample theory of general chi-square statistics for tests of fit. Ann. Statist. 3, 599–616. Shillington, E.R. (1980). A generalized chi-square goodness of fit procedure. Unpublished doctoral dissertation. University of Waterloo, Canada. Stepanians, M. (1994). Goodness of fit techniques and robustness considerations for multiple logistic regression model. Unpublished doctoral dissertation. Boston University, Boston.
2
Handbook of Statistics, Vol. 23 ISSN: 0169-7161 © 2004 Elsevier B.V. All rights reserved. DOI 10.1016/S0169-7161(03)23002-9
Discretizing a Continuous Covariate in Survival Studies
John P. Klein and Jing-Tao Wu
1. Introduction Many medical studies focus on the relationship between the time to some event, such as death or relapse, and covariates measured at the time at which therapy is initiated. When these covariates are discrete or categorical an interpretation of the effects of the covariates on outcome is relatively simple. Using a proportional hazard model the effect of a binary covariate on outcome is interpreted in terms of the relative risk of a patient with the characteristic as compared to a patient without the characteristic. When the covariate is continuous the interpretation of the effect of the covariate on outcome is more difficult. Here one typically reports the relative risk of a patient with a one-unit increase in the covariate. Most clinical investigators would rather have the continuous covariate converted into a binary covariate reflecting high and low risk values of the covariates. While this model may not be optimal for a continuous covariate, it is the model that is most often reported in the medical literature. There are a number of graphical techniques (see Klein and Moeschberger, 2003), such as martingale residual plots, which can be used to check if a threshold model is correct, but quite often the decision to use such a model is made by the clinical investigator on the grounds that it is more understandable than a model which treats the covariate as continuous. Once a decision is made to use a threshold model, the problem is to determine the cut-point between high risk and low risk patients. In some cases, the cut-point can be determined from the literature. Often cut-points need to be determined from the data. Selection of the cut-point can be made either by a data-oriented or outcome-oriented approach (Schulgen et al., 1994). In the data-oriented approach, cut-points are based on the distribution of the covariate in the study population. For example, the median could be used. The outcome-oriented approach picks a cut-point for which the discretized covariate has the largest effect on outcome. In this paper, we examine the outcome-oriented approach. In this approach it is important that adjustments be made to standard hypothesis tests to account for the fact that we picked a cut-point which makes the outcomes in the two groups as far apart as possible. In the next section, we shall survey methods for making adjustments to the test of no covariate effect when a cut-point procedure is used for a single continuous covariate in the proportional hazards model. We present the results of a Monte Carlo 27
28
J.P. Klein and J.-T. Wu
study that compares these methods. The results of this study suggest that a rescaled sequential approach suggested by Contal and O’Quigley (1999) performs the best. In Section 3, we extend this approach first to the accelerated failure time model with a single covariate and then to the case where there are additional auxiliary covariates to be modeled in both the proportional hazard model and the accelerated failure time model.
2. Techniques based on the Cox model with a single covariate In this section, we discus techniques for discretizing a single covariate, X, based on a right-censored random variable. We shall base the techniques on the Cox (1972) proportional hazards model. The data available to us consists of (Ti , δi , Xi ) where Ti is the on study time, δi is the indicator of an event (δi = 1 if Ti is an event, δi = 0 if censored) and Xi is the single continuous covariate. We also define the counting process, Ni (t) = I [Ti t, δi = 1] and Yi (t) = I [Ti t]. 2.1. Martingale residuals A graphical technique for assessing if a threshold model for X is warranted is the martingale residual plot. To construct this plot, when there is a single covariate, we first compute the Nelson–Aalen estimator of the cumulative hazard function of the failure time ignoring the covariate. That is we compute dN• (t) (t) = H (2.1) , Y• (t) Ti t where dN• (t) = dNi (t) is the total number of events at time t and Y• (t) = Yi (t) is the total number at risk at time t. The martingale residual for the ith subject is defined by Mi = Ni (Ti ) − Hi (Ti ).
(2.2)
To determine the functional form of X to be used in a proportional hazards regression, Themeau et al. (1990) show that one can examine a plot Mi versus Xi . A LOESS smooth of the scatter diagram is used to determine the functional form f (X) to be used in a Cox regression analysis as h(t | X) = h0 (t) exp{f (X)}. When the data is suggestive of a threshold model then we would expect the plot to show a distinct jump around the threshold and be approximately constant on either side of the jump. To illustrate the type of plots we use data taken from an International Blood and Marrow Registry study of alternative donor bone marrow transplantations (BMT) reported in Szydlo et al., 1997. This study consisted of patients with Acute Lymphocytic Leukemia (ALL) (n = 449), Acute Myeloid Leukemia (AML) (n = 632), and Chronic Myeloid Leukemia (CML) (n = 972). Patients were transplanted with their disease in either an early (n = 1154), intermediate (n = 529) or advanced (n = 372) stage based on remission status. In the study 1224 patients had an HLA-identical sibling donor, 238 a 1-antigen HLA mismatched related donor, 102 a 2-antigen HLA mismatched related
Discretizing a continuous covariate in survival studies
29
Fig. 1. Martingale residuals.
donor, 383 an HLA matched unrelated donor and 108 an HL mismatched unrelated donor. The endpoint of interest in the study was treatment failure defined as death or relapse which ever came first. Our interest is in modeling the effect of age at transplantation on the rate of treatment failure. In the study patients’ ages ranged from 7 months to 57.4 years with a median age of 30.3 years and a interquartile range of 19.5 years. The mean age was 29.06 with a standard error of 0.28 years. Figure 1 shows the martingale residual plot for this data. The solid line is the LOESS regression line. The line shows that there appears to be a change in the effect of age on outcome occurring at about 40 years of age. While this curve may be more suggestive of two piecewise linear functions of age, clinical interest is in the discretizing of the age covariate.
30
J.P. Klein and J.-T. Wu
2.2. Estimation of the threshold parameter Estimation of the threshold parameter, γ , is based on the Cox proportional hazards model. The assumed model is h(t | X, γ ) = h0 (t) exp βγ I [X γ ] . (2.3) Estimates of γ are based on finding the value of γ which maximizes a test statistics for testing H0 : βγ = 0. Possible tests are the score, Wald or likelihood ratio tests. For any of these tests one computes the value of the test statistics for all possible values of γ in the range where there is data. One can show that the values of the statistics change only at values of γ = Xi where Xi is an observed covariate value so that the statistics needs to be computed at only a finite number of potential values for γ . All three statistics are based on the basic proportional hazard partial log likelihood given by L(βγ , γ ) =
n
βγ I [Xi γ ]Ni (∞)
i=1
−
n
∞
log
i=1 0
n
Yj (u) exp βγ I [Xj γ ] dNi (u).
(2.4)
j =1
In this special case (2.4) can be simplified. Let Ri be the total number at risk, Di the γ γ number of deaths, Ri the number at risk with X γ , and Di the number of deaths with X γ at risk at time Ti . Then we can write (2.4) as L(βγ , γ ) =
n
γ γ γ
βγ Di − Di log Ri eβγ + Ri − Ri .
(2.5)
i=1
The score statistics for a fixed γ is given by Di Ri eβγ ∂L(βγ , γ ) γ U (βγ , γ ) = = Di − , γ γ ∂βγ (Ri − Ri ) + Ri eβγ i=1 γ
n
(2.6)
and the Fisher information is given by ∂U (βγ , γ ) Di (Ri − Ri )Ri eβγ I (βγ , γ ) = − = . γ γ ∂βγ [(Ri − Ri ) + Ri eβγ ]2 n
γ
γ
(2.7)
i=1
The value of βγ , bγ , that maximizes (2.5) is the profile maximum likelihood estimator for this value of γ . Estimates of γ are found by finding the Xi that maximize the likelihood ratio test statistics for testing βγ = 0 defined by LR(γ ) = 2 Log L(bγ , γ ) − Log L(0, γ ) , (2.8) by maximizing the Wald test of βγ = 0 given by Z(γ ) = bγ2 I (bγ , γ ),
(2.9)
Discretizing a continuous covariate in survival studies
31
Fig. 2. Score, Wald and likelihood ratio statistics.
or by maximizing the score test given by Sc(γ ) =
U (0, γ ) . I (0, γ )
(2.10)
Note that maximizing the likelihood ratio statistics is equivalent to maximizing the profile likelihood since log[L(0, γ )] is the same for all γ . To illustrate the calculations we again use the data on bone marrow transplant patients. In Figure 2 we plot the three test statistics against age. The maximum for the score and Wald statistics are at 0.8 years with chi-squares of 15.61 and 12.17, respectively for the test statistics. The maximum value for the likelihood ratio statistics is found at 41.2 years with a chi-square is 7.69 When age is treated continuously the values of the chi-square statistics are 2.63 (p = 0.1047) for the likelihood ratio test and
32
J.P. Klein and J.-T. Wu
2.62 (p = 0.1055) for both the Wald and score tests so that there is no evidence of an effect of age on outcome. If we use the test statistics based on the threshold models without any adjustment for the fact that we picked the threshold with the largest chi-square, we have p-values for the age effect of 0.0056 for the likelihood ratio test, 0.0005 for the Wald test and < 0.0001 for the score test. Clearly, some type of adjustment needs to be made to these p-values and this is discussed in the next section. 2.3. Inference for β In the previous section we saw how to estimate the threshold parameter based on test statistics for testing the hypothesis that βγ = 0. While these test statistics are appropriate for estimating γ , they are clearly not appropriate for making inference about βγ since they are based on the largest possible value of the test statistics. Some type of correction for this selection bias is needed to preserve the type I error rate. The first two corrections are based on the score statistics. The first, due to Jespersen (1986), is based on the statistics WJ defined by WJ =
Supγ |U (0, γ )| √ , D
(2.11)
where D is the number of events in the sample. Jespersen shows that under the null hypothesis this statistics converges in distribution to the sup |W 0 (p)|, 0 p 1, where W 0 ( ) is a Brownian Bridge. Billingsley (1968) shows that ∞
0 j +1 2 2 , (−1) exp −2j k P sup W (p) k = 2 (2.12) 0p1
j =1
so that p-values and critical values of the test can be found. Contal and O’Quigley (1999) also provide a corrected test based on the score statistics. The test is based on the following result due to Billingsley (1968). Suppose that φ1 , . . . , φn are exchangeable random variables that meet the following three conditions: n i=1
ϕi → 0, p
n i=1
ϕi2 → 1 p
and
max |ϕi | → 0
1in
p
as n → ∞.
(2.13)
Then Sn (p) =
[np]
ϕi ⇒ W 0 (p),
(2.14)
i=1
where [ ] is the greatest integer function. When there is no censoring so that n = D, one can show that the log rank test is precisely equal to Savage’s test which is a linear rank test with exponential scores for the ith smallest death time given by ai = 1 −
i j =1
1 . D−j +1
(2.15)
Discretizing a continuous covariate in survival studies
33
If the event times are ordered from smallest to largest then one can show that the score statistics (2.5) can be written using these scores as U (γ ) = −
n
I [X(i) γ ]ai ,
(2.16)
i=1
where X(i) is the value of X associated with the ith smallest death time. The scores, a’s, have a mean of zero and variance σ 2 given by D 2 a 2 σ = i=1 i . (2.17) D−1 If we order the data with X1 < · · · < XD then we have that ai /[(D − 1)1/2 σ ] meets the conditions of (2.13) and [Dp] U (X[Dp] ) 1 Sn (p) = √ ai = √ . σ D − 1 i=1 σ D−1
(2.18)
This result, in the no censoring case, allows the limiting Brownian motion result to be applied. When there is censoring the results hold using the log rank statistics with (2.17) computed using the observed number of deaths, D. The test of no effect of X on outcome is then based on the maximum value of |Sn (p)| over the possible cut-points and this statistics has the same limiting distribution as Jespersen’s statistics. We shall show extensions of this approach to more complicated models in later sections. A third method of adjustment is due to Lausen and Schumacher (1992, 1996). To apply this approach the possible range of threshold values needs to be restricted. Values of γ are restricted to be in the range [X(nε), X(n(1−ε))] where X(k) is the kth order statistics of the continuous variable and 0 < ε < 0.5. Let C(γ ) be a standardized test statistics for the two sample problems with groups defined by the threshold parameter γ , then Lausen and Schumacher show that Max C(γ ), γ ∈ [X(nε) , Xn(1−ε) ]
⇒
|W 0 (u)| , √ u∈[ε,1−ε] u(1 − u) sup
as n → ∞.
Miller and Siegmund (1982) show that |W 0 (u)| >b P sup √ u∈[ε,1−ε] u(1 − u) 1 (1 − ε)2 ϕ(b) ∼ ln , +4 = ϕ(b) b − b ε2 b
(2.19)
(2.20)
where φ( ) is the standard normal density function. These results suggest the following corrected p-value approach. For all γ in the range [X(nε) , X(n(1−ε))] we compute either the Wald, score or likelihood ratio test pvalue. Let Pmin be the smallest of these p-values over the range of threshold parameters.
34
J.P. Klein and J.-T. Wu Table 1 P -values of tests of the age effect Method Age treated continuously Score test Wald test Likelihood ratio test Threshold models Score test no adjustment Wald test no adjustment Likelihood ratio test no adjustment Jespersen’s correction Contal and O’Quigley correction Lausen and Schumacher e = 0.04 score test Lausen and Schumacher e = 0.04 Wald test Lausen and Schumacher e = 0.04 likelihood ratio test Lausen and Schumacher e = 0.1 score test Lausen and Schumacher e = 0.1 Wald test Lausen and Schumacher e = 0.1 likelihood ratio test Lausen and Schumacher e = 0.2 score test Lausen and Schumacher e = 0.2 Wald test Lausen and Schumacher e = 0.2 likelihood ratio test
The corrected p-value to test β = 0 is given by ϕ(z) 1 (1 − ε)2 ln , +4 Pcor = ϕ(z) z − 2 z z ε
P -value
0.1055 0.1055 0.1047
< 0.0001 0.0005 0.0055 0.1836 0.1806 0.0593 0.0601 0.0691 0.0441 0.0447 0.0516 0.0398 0.0402 0.0459
(2.21)
where z = Φ −1 (1 − Pmin /2) with Φ the standard normal distribution function. Table 1 shows the results of the various methods of testing for an age effect in the BMT data. Here we see that if no correction is made then there is a highly significant age effect. Using Jespersen’s or Contal and O’Quigley’s corrections, the age effect is not significant, in agreement with the continuous covariate model. For Lausen and Schumacher’s correction, the conclusion depends on the test statistics and on the value of ε. 2.4. Comparison of test procedures To compare the various strategies for testing the hypothesis that β is equal to zero we present part of a large scale Monte Carlo study found in Wu (2001). In the study, we simulate data from a Cox proportional hazards model with a baseline hazard rate equal to 1.0. For each observation a continuous covariate is generated from a uniform distribution on the interval [−1, 1]. Censoring is by an independent exponential random variable with a rate selected to yield the indicted censoring percentage. Table 2 shows the estimated power when data is simulated under the null hypothesis of no effect (β = 0). The level of the test is 0.05. Here we see that the unadjusted
Discretizing a continuous covariate in survival studies
35
Table 2 Observed significance based on difference methods of adjustment N
%
Continuous model
Unadjusted
Jespersen
Contal
Cen
Score
Wald
Like. ratio
Score
Wald
Like. ratio
and O’Quigley
0
0 20 40
5.2 5.2 6.0
4.7 4.8 5.2
4.8 5.2 5.8
66.3 63.7 62.5
60.7 55.4 56.4
55.5 51.9 53.9
1.2 2.0 2.3
2.1 2.9 4.6
100
0 20 40
5.6 4.6 5.8
5.6 4.5 5.8
5.6 4.7 5.8
70.7 69.9 66.1
67.7 66.1 60.4
60.7 61.6 57.9
2.2 2.8 3.1
2.7 3.8 4.4
200
0 20 40
4.7 5.3 4.7
4.6 5.3 4.6
4.5 5.3 4.5
75.0 71.4 70.9
71.9 67.6 67.6
65.4 62.9 61.1
3.6 3.6 2.7
4.1 3.0 3.1
Lausen and Schumacher approach %
Score test
Wald test
Likelihood ratio test
Cen ε = 0.04 ε = 0.10 ε = 0.20 ε = 0.04 ε = 0.10 ε = 0.20
ε = 0.04 ε = 0.10 ε = 0.20
50
0 20 40
17.2 16.9 17.3
6.5 6.9 7.2
4.4 5.1 5.3
7.2 6.5 7.2
4.3 4.7 4.3
3.5 3.6 3.9
2.7 3.8 3.8
3.2 3.8 4.2
2.7 3.6 4.6
100
0 20 40
9.8 10.6 8.7
5.4 5.7 5.0
4.7 4.5 5.5
7.2 7.5 5.7
4.2 4.4 4.0
3.9 4.6 4.5
3.6 4.5 4.2
3.5 4.7 3.9
3.3 4.2 4.4
200
0 20 40
7.2 6.2 5.9
5.7 5.7 4.2
4.5 47 4.1
5.5 5.5 4.9
5.3 5.4 3.6
4.3 4.7 3.6
4.5 4.6 3.1
4.5 4.7 3.1
4.3 4.6 3.6
methods, which base the test on the threshold model without any correction, are clearly not correct in that they reject the null hypothesis much too often. The adjusted p-value approach of Lausen and Schumacher, when based on the score test, also appears to reject the null hypothesis too often. Both Jespersen’s and Contal and O’Quigley’s methods provide appropriate protection to the null hypothesis and each may be a bit conservative. The adjusted p-value approach when applied using either the Wald or the likelihood ratio test seems to perform well when ε is either 0.1 or 0.2. Similar results hold when the baseline hazard rate is increasing or decreasing. To further compare the methods we examine their performance under the alternative hypothesis (β = 0). In Table 3 we look at how well they perform under the model h(t | X) = exp{βX} and in Tables 4 and 5 we look at the model h(t | X) = exp{βI [X γ ]} for γ = 0, 0.5. In these three tables we see that all three methods of adjustment have similar powers and that they perform quite well for large sample sizes. For all three methods there is a loss of power as compared to the continuous model, even when the threshold model is true. This suggests that there is a small price to pay for the simpler interpretation of the threshold model.
36
J.P. Klein and J.-T. Wu
Table 3 Power under continuous covariate model N
Ln(1.5)
Ln(2)
%
Continuous model
Jespersen
Cen
Score
Wald
Like. ratio
50
0 20 40
33.5 29.3 24.9
33.2 28.1 24.2
33.3 28.5 24.1
100
0 20 40
61.5 55.4 41.4
61.4 54.9 40.8
200
0 20 40
88.6 81.7 72.1
50
0 20 40
100
200
Contal
Lausen and Schumacher
and O’Quigley
Wald ε = 0.1
Like. ratio ε = 0.1
15.2 12.9 10.5
19.7 18.2 17.4
20.0 16.3 13.9
20.2 17.0 14.1
61.4 54.9 40.8
40.7 34.2 24.1
45.1 38.4 29.1
45.3 36.2 26.7
44.6 36.2 27.6
88.6 81.7 71.9
88.6 81.7 72.1
78.4 70.4 56.4
79.2 72.7 60.1
77.7 69.9 53.7
77.3 70.6 55.3
75.2 63.1 53.3
74.2 62.0 51.9
74.4 62.9 52.8
43.3 37.3 37.6
51.1 45.3 37.2
51.4 42.9 31.0
52.8 45.4 33.8
0 20 40
96.2 92.1 83.7
96.2 91.8 83.4
9.1 91.8 83.5
87.6 80.3 68.6
89.3 83.0 73.3
88.6 80.1 68.3
88.7 81.3 68.9
0 20 40
100.0 100.0 98.9
100.0 100.0 98.8
100.0 100.0 98.8
99.6 99.3 95.3
99.6 99.3 95.9
99.9 99.6 95.5
00.9 99.2 99.5
Table 4 Power based on threshold model with γ = 0 N
Ln(1.5)
Ln(2)
%
Continuous model
Jespersen
Contal and O’Quigley
Lausen and Schumacher
Cen
Score
Wald
Like. ratio
Wald ε = 0.1
Like. ratio ε = 0.1
50
0 20 40
21.7 17.9 14.9
21.3 17.0 14.2
21.2 17.4 14.6
10.0 8.3 8.9
13.0 12.7 12.9
12.4 10.1 8.2
12.8 11.4 9.6
100
0 20 40
39.5 35.5 28.1
39.3 35.2 27.7
39.3 35.2 27.7
29.4 27.2 21.6
32.7 31.0 24.2
28.9 23.3 17.8
28.5 24.7 19.2
200
0 20 40
69.3 57.7 47.2
69.1 57.6 47.0
69.1 57.6 46.0
67.1 54.8 44.7
68.5 57.5 47.6
61.8 49.2 37.4
63.7 49.1 38.0
50
0 20 40
51.0 45.1 40.5
49.7 44.4 39.4
49.7 44.4 39.4
31.0 27.1 25.6
38.6 35.9 34.1
35.0 30.2 23.3
34.6 30.6 26.4
100
0 20 40
82.4 76.1 67.6
82.1 75.5 67.2
82.1 75.5 67.3
77.9 71.7 60.0
80.5 75.6 65.7
73.7 65.9 54.4
74.1 66.2 54.8
200
0 20 40
98. 97.3 92.3
98.5 97.0 92.2
98.5 97.0 92.2
98.8 97.6 95.0
99.0 98.1 93.6
98.1 95.9 90.2
98.1 96.1 90.4
Discretizing a continuous covariate in survival studies
37
Table 5 Power based on threshold model with γ = 0.5 N
Ln(1.5)
Ln(2)
%
Continuous model
Cen
Score
Wald
Like. ratio
50
0 20 40
14.5 14.0 11.5
14.4 13.5 10.9
14.5 13.7 11.5
100
0 20 40
27.4 21.9 18.8
27.3 21.8 18.3
200
0 20 40
49.0 39.6 29.5
50
0 20 40
100
200
Jespersen
Contal
Lausen and Schumacher
and O’Quigley
Wald ε = 0.1
Like. ratio ε = 0.1
6.5 6.9 5.6
10.1 9.3 7.8
7.9 7.6 4.7
11.5 10.1 7.4
27.3 21.7 18.7
19.8 17.0 12.3
22.5 19.9 14.4
19.0 13.2 9.3
22.2 17.7 14.0
48.7 39.4 29.3
48.8 39.6 29.3
46.3 37.6 27.6
48.2 39.6 30.2
42.2 32.7 22.4
45.9 39.2 28.0
35.6 29.9 24.9
35.1 29.4 23.9
35.0 29.4 23.9
23.3 18.3 16.7
28.4 26.0 22.0
20.8 13.7 11.5
29.8 24.9 19.9
0 20 40
61.6 53.6 45.2
61.2 53.1 44.9
61.2 53.1 44.9
57.9 49.7 39.0
61.7 53.6 43.1
4.7 45.1 32.7
60.2 52.5 42.6
0 20 40
88.9 85.2 73.2
88.7 84.9 73.1
88.7 85.2 73.0
92.3 87.6 75.0
93.1 88.6 76.9
90.8 86.1 72.5
92.5 89.5 77.0
3. Extensions of Contal and O’Quigley’s approach In this section, we presented two extensions of Contal and O’Quigley’s correction. The first, in Section 3.1, is an extension of the approach to the accelerated failure time model with single continuous covariate. This extension is based on the score statistics for that model. The second extension allows adjustments for other covariates in either the Cox model (Section 3.2) or the accelerated failure time model (Section 3.2). These extensions are based on applying the correction to residuals that are based on the adjustment for other covariates. 3.1. Accelerated failure time models with a single covariate As in the previous section we assume that we have a single continuous covariate, X. We wish to estimate a threshold model for X under a parametric accelerated failure time model. Our desired threshold model is ln(T ) = µ + β · I [X γ ] + σ E,
(3.1)
where E is a random error term, µ is the intercept, β the regression coefficient, γ the threshold parameter and σ the scale parameter. Of primary interest is estimation of γ and tests of the hypothesis that β = 0 or equivalently, P [T < t | X γ ] =
38
J.P. Klein and J.-T. Wu
P [T < t | X > γ ]. Data consists of {(Ti , δi , Xi ), i = 1, . . . , n}. We also assume that the γ Xi ’s are ordered from smallest to largest. We also let Zi = I [Xi γ ], i = 1, . . . , n. γ To construct the procedure we let L (Ti , (µ, β, σ )) be the contribution to the log likelihood, based on (3.1), for the ith observation. Let γ
Uβ ((µ, β, σ )) =
n ∂Lγ (Ti , (µ, β, σ ))
∂β
i=1
(3.2)
and let (µ0 , σ0 ) be the maximum likelihood estimates of (µ, σ ) when β = 0. We can write the score statistics as γ
γ
Uβ ((µ0 , β = 0, σ0 )) = Uβ =
n
γ
Zi ψi ,
(3.3)
i=1
where ψi is a score associated with the ith ordered X. Since our data is ordered by increasing X, if we set γ equal to the [np]th smallest X we have that γ
Uβ (p) =
[np]
ψi .
(3.4)
i=1
Using a result of Billingsley (1968) on partial sums of ergodic process one can show that the process γ
Uβ (p) S(p) = √ v n
(3.5)
converges weakly to W , a Brownian Motion process on the unit interval, when v = (E[ψi2 ])1/2 . Wu (2001) shows that for the Weibull, exponential, log logistic and log nor mal models, the ψi ’s have mean zero and v can be estimated consistently by ψi2 /n. In these cases [np] ψi S(p) = i=1 (3.6) . n 2 i=1 ψi Since ψi = 0, we have that S(p) − pS(1) = S(p) so that S(p) converges to a Brownian Bridge by Billingsley’s results and the maximum value of |S(p)| has the limiting distribution of the supremum of a Brownian Bridge so that p-value can be found using (2.12). The value of p which maximizes |S(p)| gives the estimate of γ equal to the [np]th smallest of the X’s. The weights ψ depend on the distribution of the errors, E. Wu shows that the desired weights are of the form f (z) 1 fE (z) ψ= (3.7) −δ E , (1 − δ) σ0 SE (z) fE( (z) where SE (z) and fE (z) are the survival function and density function of the error distribution and z = [ln(T ) − µ0 ]/σ0 . The values of ψ can be computed for a number of
Discretizing a continuous covariate in survival studies
39
special cases. The values are ψ = T exp[−µ0 ] − δ
(3.8)
for the exponential distribution with an extreme value error distribution SE (x) = exp(x − ex ) and σ = 1; ln(T ) − µ0 1 exp −δ ψ= (3.9) σ0 σ0 for the Weibull distribution with also the extreme value distribution for E; 0 −δ exp ln(Tσ)−µ 1 0 ψ= 0 σ0 1 + exp ln(Tσ)−µ 0
(3.10)
for the log logistic which has a logistic error distribution with SE (x) = (1 + ex )−1 ; and 2 0) (1 − δ) exp − (ln(T )−µ ln(T ) − µ0 1 2σ02 ψ= (3.11) √ ln(T )−µ0 + δ σ0 σ0 2π 1 − Φ σ0
for the log normal regression with a normal distribution for E. Table 6 shows the results of fitting the exponential, Weibull, log logistic and log normal accelerated failure time models to the bone marrow transplant data. In the table we see that there is a marginal effect of age on outcome when age is treated continuously. All four models show that in the threshold model that age has no effect on outcome. All methods give the same threshold estimate of 41.2 years of age. Note that had we not made the adjustment for multiple testing, all four error distributions would have had highly significant age effects at 41.2 years. Table 6 Estimates and tests based on the accelerated failure time model Model
Threshold
Estimate of β (SE)
p-value
Exponential
Continuous
< 0.0001
Weibull
Continuous
Log normal
Continuous
Log logistic
Continuous
Exponential
41.20
Weibull
41.20
Log normal
41.20
Log logistic
41.20
−0.0113 (0.0022) −0.0090 (0.0046) 0.0093 (0.0046) −0.0089 (0.0047) 0.3124 (0.0735) 0.4148 (0.1427) 0.4267 (0.1475) 0.4634 (0.1499)
0.0489 0.0440 0.0658 0.2974 0.2646 0.1463 0.1519
40
J.P. Klein and J.-T. Wu
3.2. Adjustment for other covariates In many instances we wish to fit a threshold model after adjustment for other fixed time covariates, Z, which may affect outcome. We are also interested in testing the hypothesis P [T t | X γ , Z] = P [T t | X > γ , Z] for all t. We assume either a Cox regression model with h(t | X, Z) = h0 (t) exp β ∗ Z + βI [X γ ] (3.12) or an accelerated failure time model with ln(T ) = µ + β ∗ Z + βI [X γ ] + σ E.
(3.13)
One way of estimating γ and β is to first estimate the regression coefficient, β ∗ , by b∗ in a model without X. We then compute an appropriate residual, Ri , for the ith observation. Under models (3.12) or (3.13) these residuals will have either an exact or a limiting distribution of an accelerated failure time model. Using these residuals we can then use the results of Section 3.1 to make inference about X in the threshold model. When the Cox model (3.12) is in use, we compute Ri as the Cox and Snell (1968) residual. The residual is defined by 0 (Ti ) exp b∗ Zi , Ri = H (3.14) where 0 (t) = H
Tj t
dN• (Tj ) . ∗ Y k=1 k (Tj ) exp{b Zk }
n
(3.15)
It is well known (cf. Klein and Moeschberger (2003)) that these residuals behave like a censored sample from a unit exponential. Using this fact, we can apply the results of Section 3.1 with ψi = Ri − δi . For the accelerated failure time model (3.13) with error distribution E we define the residual by 1 Ri = (3.16) ln[Ti ] − µˆ − b∗ Zi , i = 1, . . . , n, σˆ where all parameters are estimated under a model without X. If the error distribution is selected properly then the Ri ’s are a sample from a standardized version of the error distribution with µ = 0 and σ = 1. The scores are then fE (Ri ) fE (Ri ) − δi , ψi = (1 − δi ) (3.17) SE (Ri ) fE (Ri ) and the results of the previous section apply. We illustrate this procedure in Table 7. Here all inference for the effect of age on treatment failure in the BMT example is adjusted for the patient’s disease state at transplant (early, intermediate or advanced) and the type of donor they had for the transplant. The results show, after adjustment for these factors, that there is a strong effect of age on outcome when age is treated continuously. In the threshold model the age remains
Discretizing a continuous covariate in survival studies
41
Table 7 Estimates and tests for age effect adjusted for disease stage and type of donor Method
Threshold
Estimate of β (SE)
p-value
Cox
Continuous
< 0.0001
Exponential
Continuous
Weibull
Continuous
Log logistic
Continuous
Log normal
Continuous
0.0111 (0.0025) −0.0145 (0.0025) −0.0199 (0.0043) −0.0195 (0.0043) −0.0205 (0.0043) −0.2457 (0.0625) 0.4054 (0.0729) 0.4433 (0.1083) 0.5108 (0.1128) 0.5536 (0.1124)
Cox
33.9
Exponential
20.7
Weibull
33.8
Log logistic
35.6
Log normal
35.6
< 0.0001 < 0.0001
< 0.0001 0.0016 0.0066 0.0033 0.0004 < 0.0001
significantly associated with outcome after the adjustments for multiple testing. The various models disagree on where the cut-point should be with possible values ranging from 20.7 years for the exponential distribution to 35.6 years for either the log normal or log logistic models. All of these adjustments can be programmed quite easily using standard statistical software that computes a residual from the fitted regression models.
4. Discussion We have presented the current state of the art in data driven methods for discretizing a continuous covariate. We have looked at methods for a single cut-point but these could easily be extended to more than two groups. Our Monte Carlo study shown in Section 2 and found in more detail in Wu (2001) shows that the method suggested by Contal and O’Quigley performs quite well. This method is also easy to program using standard statistical packages and, as seen in Section 3, can be extended to solve other problems. While these methods are useful when the cut-point is unknown, in many cases the best cut-point should be based on the subject matter under consideration. For example, in the bone marrow example the biology of the transplant may suggest a cut-point that reflects the difference between pediatric and adult patients. Whatever the cut-point is, it is important that the cut-point makes plausible biological sense.
42
J.P. Klein and J.-T. Wu
Acknowledgements This research was supported by grant R01-CA54706-09 from the National Cancer Institute.
References Billingsley, P. (1968). Convergence of Probability Measures. Wiley, New York. Contal, C., O’Quigley, J. (1999). An application of changepoint methods in studying the effect of age on survival in breast cancer. Comput. Statist. Data Anal. 30, 253–270. Cox, D.R. (1972). Regression models and life-tables (with discussion). J. Roy. Statist. Soc. B 34, 187–220. Cox, D.R., Snell, E.J. (1968). General definition of residuals (with discussion). J. Roy. Statist. Soc. B 30, 248–275. Jespersen, N.C.B. (1986). Dichotomizing a continuous covariate in the Cox regression model. Statist. Res. Unit Univ. Copenhagen Res. Report 86 (2). Klein, J.P., Moeschberger, M.L. (2003). Survival Analysis: Techniques for Censored and Truncated Data, 2nd edn. Springer, New York. Lausen, B., Schumacher, M. (1996). Evaluating the effect of optimized cutoff values in the assessment of prognostic factors. Comput. Statist. Data Anal. 21, 307–326. Lausen, B., Schumacher, M. (1992). Maximally selected rank statistics. Biometrics 48, 73–85. Miller, R., Siegmund, D. (1982). Maximally selected chi-square statistics. Biometrics 38, 1011–1016. Schulgen, G., Lausen, B., Olsen, J.H., Schumacher, M. (1994). Outcome oriented cutpoints in analysis of quantitative exposures. Amer. J. Epidemiology 140, 172–184. Szydlo, R., Goldman, J.M., Klein, J.P., Gale, R.P., Ash, R.C., Bach, F.H., Bradley, B.A., Casper, J.T., Flomenberg, N., Gajewski, J.L., Gluckman, E., Henslee-Downey, P.J., Hows, J.M., Jacobsen, N., Kolb, H.J., Lowenberg, B., Masaoka, T., Rowlings, P.A., Sondel, P.M., van Bekkum, D.W., van Rood, J.J., Vowels, M.R., Zhang, M., Horowitz, M.M. (1997). Results of allogeneic bone marrow transplants for leukemia using donors other than HLA-identical siblings. J. Clinical Oncology 15(5), 1767–1777. Themeau, T.M., Grambsh, P.M., Fleming, T.R. (1990). Martingale-based residuals for survival models. Biometrika 77, 147–160. Wu, J.-T., (2001). Statistical methods for discretizing a continuous covariate in a censored data regression model. Ph.D. Thesis. Division of Biostatistics, Medical College of Wisconsin.
3
Handbook of Statistics, Vol. 23 ISSN: 0169-7161 © 2004 Elsevier B.V. All rights reserved. DOI 10.1016/S0169-7161(03)23003-0
On Comparison of Two Classification Methods with Survival Endpoints
Ying Lu, Hua Jin and Jie Mi
1. Introduction Methods of prognostic prediction and classification, such as staging systems for diseases, are commonly used to assist decision making in clinical management (Stitt et al., 1991; Lu and Stitt, 1994; Sevin et al., 1996; Zhang and Singer, 1999; Esserman et al., 1999; Therasse et al., 2000; Kashani-Sabet et al., 2001; Reynolds, 2002). These staging or classification systems create a finite number of exclusive and exhaustive groups based on known prognostic factors. A successful system will maximize between-group survival differences. Without loss of generality, we assume the classification severity is an ordinal variable corresponding to disease severity, such that group 1 has longer mean survival time than group 2, and so on. These systems have been used to design treatment strategies in patient management. Often, there are several prognostic classification methods for the same disease and patient population that may depend on different prognostic factors or the same prognostic factors but different thresholds. Comparison of the method’s efficiency in separating patients according to survival profiles is necessary. When comparing two classification rules, the common approach utilizes the Cox regression model with both classification methods as ordinal variables (Kashani-Sabet et al., 2001). One may assume that the first method contains all information of the second if the type-3 p-value for the second method is above a pre-specified significance level but the same p-value for the first method is below that level. On the other hand, if both methods are significant in the presence of the other, they are considered to be complimentary. If none of their type-3 p-values reaches the significance level, no conclusion can be derived. Such an approach does not answer the question imposed whether one method is preferred over another and fails to provide an objective index about their separation efficiency. Furthermore, while it may be possible to test whether one classification method provides additional information over another one, it cannot determine whether one classification method is equivalent or non-inferior to another method (Lewis, 1999). The goal of this paper is to develop a statistical procedure for comparing the efficiency of two classification methods. In Section 2, we define an index to measure the classification efficiency based on mean survival times and discuss its mathematical 43
44
Y. Lu, H. Jin and J. Mi
properties. In Section 3, we give an estimation procedure based on restricted mean lifetime and inference procedures to test for two types of hypotheses. In Section 4, we use simulation to evaluate the distribution properties of the proposed test statistics under finite sample size. In Section 5, we demonstrate our testing procedures in two medical applications. In the last section, we provide comments and conclusion.
2. Degree of separation index Let T be the random variable of survival time of a population and its survivor function as S(t) = P (T > t). Suppose a classification method G separates patients into g exclusive and exhaustive groups. Let J be the random variable of the classification assignment of a randomly selected subject from the population; pj = P (J = j ), j = 1, 2, . . . , g, be the corresponding probability of being in the j th group; ∞ and Sj (t) = P (T > t | J = j ) be the corresponding survivor function and mj = 0 Sj (t) dt be the corresponding mean surg vival time. To be exclusive and exhaustive, j =1 pj = 1. We denote such a classifica, p), with m = (m1 , . . . , mg ) and p = (p1 , . . . , pg ) being two g-dimensional tion as G(m ∼ ∼ ∼ ∼ vectors. It is easy to see that S(t) =
P (T t, G = j )
j
=
P (T t | G = j )P (G = j ) =
j
Sj (t)pj
(1)
j
and
∞
m=
S(t) dt =
0
=
0
j
∞
∞ 0
Sj (t)pj dt
j
Sj (t) dt pj =
mj pj .
(2)
j
An efficient classification should maximize between-group differences in survival times, which can be measured by the between-group variance of mean survival times. The larger the between-group variance in mean survival times, the larger the difference in mean survival times between prognostic groups. Here, we introduce our definition of degree of separation (DOS).
On comparison of two classification methods with survival endpoints
45
D EFINITION 1. For a classification rule G(m, p), a measurement of degree of separa∼ ∼ tion is defined as the variance of mj , i.e., 2 ∞
2 DOS(m (3) Sj (t) − S(t) dt pj , , p) = (mj − m) pj = ∼ ∼
j
j
0
where m is the overall mean survival time defined in (2). Intuitively, mj − m is the area between the j th group and the overall survival curves. To facilitate our discussion on the properties of DOS, we introduce a discrete random variable ξ defined by P (ξ = mi ) = pi , 1 i g and denote this as ξ ∼ (m , p). It is ∼ ∼ easy to see that DOS(m , p ) = var(ξ ). ∼ ∼ In the rest of this section, we will give some sufficient conditions under which we can compare the variances of two discrete random variables and consequently compare DOSs associated with different classification methods. Let ξ ∼ (a , p), η ∼ (c , q) be two discrete random variables where a ∈ R k , c ∈ R l , ∼ ∼ ∼ ∼ ∼ ∼ p ∈ (0, 1)k , q ∈ (0, 1)l and ki=1 pi = lj =1 qj = 1. Define d = (d1 , . . . , dl ) by di = ∼ ∼ ∼ ci + E(ξ ) − E(η), 1 i l and assume η∗ ∼ (d, q). Then ∼ ∼
l l ci + E(ξ ) − E(η) qi = E(ξ ). di qi = E η∗ = i=1
i=1
This implies that 2
2
2 , q) = var(η) = var η∗ = E η∗ − Eη∗ = E η∗ − (Eξ )2 DOS(c ∼ ∼
and thus, 2
, q) = E ξ 2 − E η∗ . DOS(a∼, p) − DOS(c ∼ ∼
∼
We can assume that ξ is a positive random variable because otherwise we can consider |ξ |. We further define ∼ e = (e1 , . . . , el ) = (|d1 |, . . . , |dl |) and ζ = |η∗ | ∼ (e∼, q). ∼
st
st
P ROPOSITION 1. If ξ ζ , then DOS(a , p) DOS(c , q), where is the usual stochas∼ ∼ ∼ ∼ tic ordering. , q) = E(ξ 2 ) − E(ζ 2 ). P ROOF. Evidently E(ζ 2 ) = E((η∗ )2 ) and DOS(a∼, p) − DOS(c ∼ ∼ st
∼
, q) 0 since 0 ξ ζ implies E(ξ 2 ) E(ζ 2 ). Therefore DOS(a∼, p) − DOS(c ∼ ∼ ∼ Now consider the case when both ξ and η have the same probability distribution vector p. Namely, we assume ξ ∼ (a , p) and η ∼ (c , p). Once again, we define the ∼ ∼ ∼ ∼ ∼ random variable ζ as in the above. To study this case, we need partial ordering b as defined by Sobel (1954), Savage (1957) and Lehman (1966). For more details we refer to Marshall and Olkin (1979). To be self-contained, the definition of b is given as follows.
46
Y. Lu, H. Jin and J. Mi
D EFINITION 2. Let µ, ν be two vectors of dimension k, with one obtained by a permu∼ ∼ tation from the other. The vector µ is said to immediately precede ν , denoted as µ
νi = µj , and µn = νn , ∀n = i, j . D EFINITION 3. We say that µ b ν if µ = ν , or if there exists a finite chain λ1 , . . . , λn ∼ ∼ ∼ ∼ ∼ ∼ such that µ
∼
∼
∼
∼
That is, ν can be obtained from µ by successive interchanges, each of which corrects an ∼ ∼ inversion of the natural order. We need also the relative-arrangement partial order a . x, y , ∼ u, ∼ ν be four vectors where ∼ x and ∼ u have the same dimenD EFINITION 4. Let ∼ ∼ sion k, and y and ∼ ν have the same dimension l. We say that (x , y ) a (u , ν ) if ∼ u is a ∼ ∼ ∼ ∼ ∼ permutation of x , ν is a permutation of y , and if there exist permutations π1 and π2 ∼ ∼
∼
such that π1 x = x∗ , π2 u = u∗ = x∗ and π1 y b π2 ν where x∗ = (x1∗ , . . . , xk∗ ) satisfies ∼ ∼ ∼ ∼ ∼ ∼ ∼ ∼ x1∗ · · · xk∗ . Let ξ ∼ (a , p) and η ∼ (c , p). We define e and ζ ∼ (e , p) as before. ∼ ∼
∼ ∼
∼
∼ ∼
P ROPOSITION 2. If (p, a ) a (p, e ), then DOS(a , p) DOS(c , p). ∼ ∼
∼ ∼
∼ ∼
∼ ∼
P ROOF. Without loss of generality, we can assume that p = p∗ , i.e., p1 · · · pk . ∼
∼
e ), we have a∼ b ∼ e. Hence from (p, a∼) a (p, ∼ ∼ ∼ First we show the result when a∼
ei = aj and an = en , ∀n = i, j . Then
Eξ 2 − Eη2 = ai2 pi + aj2 pj − ei2 pi + ej2 pj
= ej2 pi + ei2 pj − ei2 pi − ej2 pj = ej2 − ei2 (pi − pj ) 0 since ej ei 0 and pi pj . Now, if a∼ b ∼ e , then there is a finite chain ∼d1 , . . . , ∼dn such that e. a∼
∼
and the desired result follows.
∼
∼
∼
∼ ∼
Below we study the comparison of DOS when one separation can be obtained from the other, either from further separations or by combining some groups into one.
On comparison of two classification methods with survival endpoints
47
D EFINITION 5. Let G and H be two classification methods and DOS(a , p), DOS(c , q) ∼ ∼
∼ ∼ f1
be the associated degree of separation. We say that H is finer than G, denoted as G < H , if there is an index i such that a = (a1 , . . . , ai−1 , ai , ai+1 , . . . , ak ) ∈ R k ,
∼
p = (p1 , . . . , pi−1 , pi , pi+1 , . . . , pk ) ∈ (0, 1)k ,
∼
c = (a1 , . . . , ai−1 , ai1 , . . . , ail , ai+1 , . . . , ak ) ∈ R k+l−1 ,
∼
q = (p1 , . . . , pi−1 , pi1 , . . . , pil , pi+1 , . . . , pk ) ∈ (0, 1)k+l−1
∼
satisfying l
pin = pi
l
and
n=1
ain pin = ai pi .
(4)
n=1
R EMARK . The modification of Definition 5 in the case of i = 1 or k is obvious. , q) D EFINITION 6. Let G and H be two classification methods and DOS(a∼, p), DOS(c ∼ ∼ ∼ be the associated degrees of separation. We say that H is finer than G of order j , denoted fj
f1
f1
f1
as G < H , if there exists a chain of separation K1 , . . . , Kj −1 such that G < K1 < · · · < f1
Kj −1 < H . fj
P ROPOSITION 3. Suppose that G < H for an integer j 1. Then DOS(G) DOS(H ). f1
P ROOF. Without loss of generality we can assume G < H . Let DOS(G) = DOS(a∼, p), ∼ DOS(H ) = DOS(c , q) and ξ ∼ (a∼, p), η ∼ (c , q). We have ∼ ∼ ∼
E(η) =
j =i
∼
aj pj +
l n=1
ain pin =
∼
aj pj + ai pi = E(ξ ),
j =i
by (4). Thus DOS(a , p) − DOS(c , q) = Eξ 2 − Eη2 = ai2 pi − ∼ ∼
∼ ∼
l
2 ain pin .
n=1
Note using the Lagrange multiplier l method, we can see that the minimum value of l that 2 xn pin = ai pi is achieved when x1 = · · · = xl = n=1 xn pin subject to constraint n=1 2 p 0 and consequently the desired ai , and equals to ai2 pi . Hence, ai2 pi − ln=1 ain in result follows. R EMARK . Proposition 3 suggests that the thinner the separation is, the better the degree of separation. This is true in general. However, in practical applications, because the classification rule was developed based on a limited number of observations, a very
48
Y. Lu, H. Jin and J. Mi
thin classification will result in errors when applied to new data. This is similar to percentage of variations explained by a linear regression model and the number of variables in the model. 3. Estimation and inference procedures To estimate DOS in Definition 1, we must estimate the vectors m and p based on data ∼ ∼ from prospective cohort studies. Estimation of vector p is relatively straightforward. ∼ Suppose that there are ni patients in group i, the maximum likelihood estimates of pi is pˆi = ni /n, (5) where n = i ni , the total number of patients in the study. Estimation of mean survival time m is relatively complicated and may even be infea∼ sible because of the presence of censoring (Susarla and Van Ryzin, 1980; Gill, 1983; Zheng, 1995). An alternative to the overall mean survival time is the restricted mean life to a suitably chosen finite time L (Irwin, 1949; Elveback, 1958; Kaplan and Meier, 1958; Karrison, 1987; Zucker, 1998; Chen and Tsiatis, 2001), which can be defined as L mi (L) = (6) Si (t) dt 0
for group i for i = 1, 2, . . . , g, and the restricted mean lifetime of the population can also be expressed as L L m(L) = S(t) dt = Si (t)pi dt 0
=
0
L
Si (t) dt pi =
0
i
i
mi (L)pi .
i
Under this restriction, the index to measure the degree of the separation becomes 2 mi (L) − m(L) pi . DOS m(L), p = ∼
∼
(7)
(8)
i
Note that Eq. (7) is similar to that of (3). Furthermore, all previous properties discussed in Propositions 1–3 still hold. In the following sections, we skip L in our notation for DOS. The choice of the length of time L is critical in evaluation of the restricted mean lifetime. One approach is the use of the “effective sample size” (Kaplan and Meier, 1958). Let Ci (t) denote the survivor function for the independent and identically distributed censoring random variables in the group i and Ci− (t) be the left continuous version of Ci (t). The largest possible observation time for the group i is τi = sup[t: min{Si (t), Ci (t)} > 0]. Then τ = mini (τi ) is the theoretically largest pos i (t) are the Kaplan– sible observation time for all classification groups. If Si (t) and C Meier estimators of the corresponding survivor functions, then the largest actual ob i (t)} > 0]. Thus, served follow-up time for the group i is Ti = sup[t: min{ Si (t), C LAO = min(Ti ) i
(9)
On comparison of two classification methods with survival endpoints
49
is the largest actual follow-up time for all classification groups, which is a random variable and the best estimator for τ . For any given L in (0, LAO], we can derive an estimate of the restricted mean lifetime L
m ˆi = (10) Si (t) dt. 0
In fact, it follows from Gill (1983) and Pepe and Fleming (1991) that
d
√ ˆ i − mi → N 0, σi2 when ni → ∞ and ni m 2 L L σi2 = − Si2 (t)Ci− (t) dSi (t) < ∞ Si (u) du
(11)
t
0
and its consistent estimator of σˆ i2 is found by substituting the Kaplan–Meier estimators for Si and Ci− in (11). An estimator of DOS, thus is
2 = DOS (12) m ˆi −m ˆ pˆ i . i
The variance of estimator (12) can be derived by a δ-method. Because ∂DOS = mi (mi − 2m), ∂pi and
∂DOS = 2pi (mi − m), ∂mi
∂DOS 2 ∂DOS 2
= Var DOS Var pˆ i + Var m ˆi ∂pi ∂mi i
+
∂DOS ∂DOS i =j
∂pi
∂pj
i
Cov pˆ i , pˆj ,
by we can estimate Var(DOS)
2 pˆ i (1 − pˆ i ) 2
2 2 + σˆ DOS ˆ i − 2m ˆi −m = m ˆ 2i m ˆ 4pˆ i m ˆ σˆ i2 n i
−
i
i =j
pˆi pˆ j ˆ i − 2m m ˆ im ˆj m ˆ m ˆ j − 2m ˆ n
(13)
− DOS)/σˆ DOS → N(0, 1) for n → ∞. ˆ i pˆi . We expect (DOS with m ˆ = im In most clinical applications, different classification methods are developed from different studies. Comparison of their efficiency can be based on their performance in different studies as long as the patient inclusion criteria of these studies are the same. However, the most appropriate comparison of two classification methods should use the same patient samples. Thus, we focus our inference procedures on comparing the DOSs of two classification rules based on paired observations. Let G1 and G2 be two classification methods resulting in g1 and g2 groups respectively, where g1 , g2 2. We apply both methods to a prospective cohort and follow
50
Y. Lu, H. Jin and J. Mi
Table 1 Layout for the joint split of G1 and G2 G2 = 1
G2 = 2
...
G2 = g2
Total
G1 = 1
n11 (p11, m11 )
n12 (p12 , m12 )
...
n1g2 (p1g2 , m1g2 )
n1+ (p1+ , m1+ )
G1 = 2
n21 (p21 , m21 ) ...
n22 (p22 , m22 ) ...
...
n2g2 (p2g2 , m2g2 ) ...
n2+ (p2+ , m2+ ) ...
G1 = g1
ng1 1 (pg1 1 , mg1 1 )
ng1 2 (pg1 2 , mg1 2 )
...
ng1 g2 (pg1 g2 , mg1 g2 )
ng1 + (pg1 + , mg1 + )
Total
n+1 (p+1 , m+1 )
n+2 (p+2 , m+2 )
...
n+g2 (p+g2 , m+g2 )
n (1, m)
...
...
An (i, j ) cell in Table 1 represents the result of G1 = i and G2 = j and nij subjects having this classification (for i = 1, 2, . . . , g1 and j = 1, 2, . . . , g2 ). Parameters pij and mij are the model parameters corresponding to the probability of being in this cell and the restricted mean lifetime for this group. A “+” sign is used to denote summation over the corresponding subscript.
these patients for a period of time. Observations from such a study are summarized in Table 1. An (i, j ) cell in Table 1 represents a result in which G1 = i and G2 = j and nij subjects have this classification (for i = 1, 2, . . . , g1 and j = 1, 2, . . . , g2 ). Parameters pij and mij are the model parameters corresponding to the probability of being in this cell and the restricted mean lifetime for this group. A “+” sign is used to denote summation over the corresponding subscript. In fact, we have mij pij j mij pij mi+ = , m+j = i and pi+ p+j mi+ pi+ = m+j p+j = mij pij . m= i
j
i
j
Here we may use (G1 , G2 ) = {(mij , pij ), i = 1, 2, . . . , g1 ; j = 1, 2, . . . , g2 } to denote the joint separation (G1 , G2 ). Under this framework, the degree of separation for G1 and G2 are DOS1 = DOS(m ,p ) = (mi+ − m)2 pi+ and ∼•+ •+ ∼
i
DOS2 = DOS(m+• , p+• ) = ∼
∼
(m+j − m)2 p+j ,
j
respectively. The most common inference problem is to determine whether two classifications have the same degree of separation, i.e., H0 : DOS1 = DOS2
versus H1 : DOS1 = DOS2 .
(14)
Alternatively we may be interested in a one-sided inference that replaces the equality in the null hypothesis with an inequality in the direction desired.
On comparison of two classification methods with survival endpoints
51
A different class inference problem is the equivalence or non-inferiority test (Lewis, 1999; Liu et al., 2002). For example, to determine if G2 is not worse than G1 within an acceptable level (non-inferiority), we can construct a one-sided null hypothesis for non-inferiority: H0 : DOS2 /DOS1 c0
versus H1 : DOS2 /DOS1 > c0
(15)
where 0 < c0 1 is the pre-specified constant corresponding to the acceptable limit of equivalence. This null hypothesis assumes that G1 has a larger DOS than G2 , but G2 has other advantages, such as low cost, that make it preferable as long as its loss in efficiency is still tolerable. Reorganizing hypothesis (15) as H0 : T = c0 DOS1 − DOS2 0
versus
H1 : T = c0 DOS1 − DOS2 < 0
(16)
we have a one-sided hypothesis testing problem. When c0 = 1, hypothesis (16) is a one-sided hypothesis testing for (14). Unrestricted maximum likelihood estimates of {pij } are pˆ ij = nij /n, pˆi+ = ni+ /n, and pˆ+j = n+j /n. Here we need only focus on estimation of the restricted mean lifetime mij for the subgroup with pij > 0. By replacing subscript i with ij for definitions of Ti in Eq. (9), the new LAO = min{Tij : pˆ ij > 0}, as an estimator of τ . Similarly, we can replace subscript i with ij in Eqs. (10) and (11) to estimate the restricted mean survival time and its variance for cell (i, j ) with pij > 0. A test statistic
2
2 2 = c0 1 − DOS T = c0 DOS m ˆ i+ − m m ˆ +j − m ˆ pˆi+ − ˆ pˆ+j (17) i
j
is constructed for hypotheses (16) under a one-sided test or (14) in a two-sided test with c0 = 1. Here, ˆ ij pˆ ij ˆ ij pˆ ij j : pˆ ij >0 m i: pˆ ij >0 m m ˆ i+ = , m ˆ +j = and pˆ i+ pˆ +j m ˆ ij pˆij . m ˆ= ij : pˆ ij >0
ˆ +j , and m ˆ can be estimated by the pooled marginal and overAlternatively, m ˆ i+ , m all survival curves. They will be slightly different from the average in proportion, but will have only a slight impact on numerical results. Again, based on the δ-method, the asymptotic variance of T is ∂T 2 ∂T 2
var T = var m var p ij + ij ∂mij ∂pij i,j
+
i,j
(i1 ,j1 ) =(i2 ,j2 )
∂T ∂T cov p i1 j1 , p i2 j2 ∂pi1 j1 ∂pi2 j2
(18)
52
Y. Lu, H. Jin and J. Mi
with
∂T = 2pij c0 mi+ − m+j + (1 − c0 )m , ∂mij ∂T = c0 mi+ (2mij − mi+ ) − m+j (2mij − m+j ) + 2(1 − c0 )mmij , ∂pij pij (1 − pij ) , var p ij = n
pi1 j1 pi2 j2 cov p , i1 j1 , p i2 j2 = − n
and
Var m ˆ ij = σij2 that is similar to Eq. (11) except for replacing subscript i with ij . We expect T Z= → N(0, 1) for n → ∞
var T when c0 DOS1 − DOS2 = 0. We will reject the null hypothesis in inference setting (14) if |Z| > z1−α/2 for c0 = 1. The corresponding asymptotic type I error is P (|Z| > z1−α/2 |DOS1 = DOS2 ) = α. Here z1−α is the 1 − α percentile for the standard normal distribution. For non-inferiority testing (16), we will reject the null hypothesis when Z < −z1−α . The corresponding asymptotic type I error is sup P (|Z| > z1−α/2 |c0 DOS1 DOS2 ) α. The type I error achieves the significant level α only when the equality in the null hypothesis is true.
4. Distribution property of test statistics under the null hypothesis While the δ-method assures that Z approximates the standard normal distribution under the null hypothesis, the distribution property of Z with a finite sample size is unknown. In this section, we use simulation methods to examine the distribution property under a finite sample size. Our simulation experiments were based on random variables generated from exponential distributions. In our experiments, we assumed that two classification methods G1 and G2 separated subjects into 3 prognostic groups and, consequently, 9 subgroups according to their joint distribution in Table 1. We further assumed that the probability of a patient being classified into any one of the 9 cells was 1/9. Time to death in each of the 9 cells followed exponential distributions with their expectations given in Table 2. Censoring time was exponentially distributed with mean censoring time of 25 and independent of classification results and survival times. The ratios of the two DOS indices with the restricted mean lifetime for L = 25 were respectively 1, 0.85 and 0.70 in the three examples. Alternative hypotheses were two-sided for the experiment 1 (c0 = 1) and one-sided for the other two experiments (c0 = 0.85 and 0.70, respectively).
On comparison of two classification methods with survival endpoints
53
Table 2 Mean survival times (mij ) in three simulation experiments Classification result of G2
Experimental conditions
Classification
for L = 25
result of G1
1
2
3
EXP 1 DOS1 − DOS2 = 0
1 2 3
100 50 25
50 25 10
25 10 5
EXP 2 0.85DOS1 − DOS2 = 0
1 2 3
95 50 25
50 25 10
30 10 5
EXP 2 0.70DOS1 − DOS2 = 0
1 2 3
90 50 24
50 25 10
35 10 6
The survival time in each classification combination followed an exponential distribution with mean survival time specified in the corresponding cell. Censoring time also followed an exponential distribution with mean censoring time of 25. Censoring time was independent of classifications and their corresponding survival times. Proportion of subjects to be classified into any of the nine cells (pij ) was 1/9. Table 3 Percentages rejecting the null hypothesis in 500 trials when the null hypothesis of equality was true and significance level was 5% EXP
1 2 3
True data structure
DOS1 = DOS2 0.85DOS1 − DOS2 = 0 0.70DOS1 − DOS2 = 0
Null hypothesis H0
DOS1 = DOS2 0.85DOS1 − DOS2 0 0.70DOS1 − DOS2 0
Sample sizes 800
1 800
2 700
4.2% 6.4% 5.4%
5.6% 6.4% 5.6%
5.0% 4.8% 4.4%
The first simulation experiment tested for equality of two DOSs and the rejection rate was based on a twosided test. Experiments 2 and 3 tested for non-inferiority and the rejection rates were based on a one-sided test.
For each selected sample size n and experimental conditions, we repeatedly generated simulated data 500 times. Table 3 summarizes the percentage rejecting the null hypothesis according to normal approximation at a significance level of 5% when the null hypothesis was true. The 95% binomial confidence interval for a 5% rejection rate in 500 repetitions ranges from 3.09% to 6.91%. All type I errors in Table 3 were within this confidence range. Figure 1 shows Q–Q plots of test statistics Z in three experimental conditions for sample sizes 800 and 2 700, respectively. The figure shows a reasonable approximation to the standard normal distributions under given sample sizes. The null hypothesis in (16) is a composite hypothesis and asymptotic normality only holds when c0 DOS1 = DOS2 . To verify that the overall type I error rate does not exceed pre-specified significant level, we performed additional simulation studies based on data generated under the experimental condition 3 in Table 2. The true relationship
54
Y. Lu, H. Jin and J. Mi
Fig. 1. Q–Q plots for Z-statistics derived in three simulation experiments when the null hypothesis of equality was true. The columns from left to right of this plot were for data generated under experimental conditions 1 to 3 in Table 2 and null hypotheses specified in rows of Table 3. The top row Q–Q plots were for sample size 800 in each trial and the bottom row were those of sample size 2 700.
Table 4 Percentages rejecting the null hypothesis in 500 trials when the null hypothesis of inequality was true and significance level was 5% EXP
3 3 3
True data structure
0.70DOS1 − DOS2 = 0 0.70DOS1 − DOS2 = 0 0.70DOS1 − DOS2 = 0
Null hypothesis H0
0.80DOS1 − DOS2 0 0.90DOS1 − DOS2 0 0.95DOS1 − DOS2 0
Sample sizes 800
1 800
2 700
1.6% 1.0% 0.8%
1.2% 0.4% 0.0%
0.6% 0.2% 0.0%
Data was generated under the experimental condition 3 in Table 2. Tests were for non-inferiority and the rejection rates were based on a one-sided test.
between two classification methods was that 0.7DOS1 = DOS2 . Thus, for c0 = 0.8, 0.9, and 0.95, the null hypothesis H0 : c0 DOS1 − DOS2 > 0 was true. Table 4 presented the percentages of rejecting H0 at a 5% significant level. The type I error rates were all below 2% and decreased as c0 increased. Figure 2 showed the corresponding Q–Q plots for the test statistics Z for sample sizes 800 and 2 700. While the test statistics Z no longer followed the standard normal distribution, they still approximately followed normal distributions.
On comparison of two classification methods with survival endpoints
55
Fig. 2. Q–Q plots for Z-statistics derived in three simulation experiments when the null hypothesis of inequality was true. The columns from left to right of this plot were for null hypotheses for c0 = 0.8, 0.9, and 0.95, respectively, in Table 4 when data were generated from experimental condition 3 in Table 2. The top row Q–Q plots were for sample size 800 in each trial and the bottom row were those of sample size 2 700.
5. Application examples In this section, we give two examples of applying DOS in medical research. E XAMPLE 1. Lu et al. (2003) used the tree structured survival analysis (TSSA) method (Segal, 1988) to classify post-menopausal women into 4 different subgroups with distinct distributions of time to hip fracture after their bone mineral density (BMD) measurement. The classification was derived from prospective cohort data of 7 665 women enrolled in the Study of Osteoporotic Fractures (SOF) (Cummings et al., 1993; Cummings et al., 1995). All of these women had forearm, calcaneal, hip and spine bone mineral density (BMD) measurements. Time to hip fracture after BMD measurement was also recorded for these women and was treated as the outcome variable. A random sample consisting of 75% (training data set) of women from the 7 665 available was used to generate 4 prognostic subgroups while the other 25% (validation data set) was used to validate the results. Two classification rules were developed for the TSSA analyses. The first tree analysis used risk factors of age and BMD of hip and calcaneus and was referred to as Model 1 in the paper. The second tree analysis replaced calcaneal BMD with age and was referred to as Model 2. Model 1 was the optimum classification according to the TSSA method. Model 2, however, replaced calcaneal BMD by age to avoid additional BMD
56
Y. Lu, H. Jin and J. Mi
Table 5 The restricted time L, the acceptable limit for non-inferiority c0 , and the corresponding p-value for Example 1 Data
L
c0
p-value
Generation data
1825 2000 2350∗
0.65 0.70 0.75
0.032 0.037 0.040
Validation Data
1825 2000 2287∗
0.49 0.50 0.55
0.050 0.042 0.045
examination. If DOS of Model 2 has an acceptable difference from that of Model 1, Model 2 would be a preferred non-inferior alternative to Model 1 for economic reasons. Using the testing procedures proposed in Section 3, we selected a suitable restricted time L to be 1825 days (5 years) and 2000 days. The observed LAO of the generation and the validation data were 2350 and 2287 days respectively. For a given L, we also selected c0 for which we had a one-sided p-value below 5%. These results were summarized in Table 5. This table suggests that the acceptable level c0 increased with L under the same significant level. The longer the restricted time L, the more stringent thresholds we can place on non-inferiority. Thus, the observed LAO can be a reasonable choice of the restricted time in this case. E XAMPLE 2. Sevin et al. (1996) proposed a new staging system for early cervical carcinomas also based on the tree structured survival analysis method (TSSA) based on 301 patients (Segal and Bloch, 1989). They further compared their staging system with another system proposed by Kamura et al. (1993). Both methods classified patients into three different prognostic groups based on disease free survival time for the same patient population. We used this example to illustrate our proposed test for superiority. We selected a restricted time L of 60 months (5 years) and rejected the null hypothesis of equal DOS at a significance level of 5%. The actual p-value was 0.6%, leading to the conclusion that the two prognostic staging systems were different in their classification efficiency for the data. The DOS for the classification resulted by TSSA was 40.53 in comparison to 2.33 of Kamura’s, suggesting the TSSA classification be significantly better than that of Kamura in this patient population. If we changed L to 71 months, we had a lower p-value of 0.3%, again, indicating stronger statistical efficiency by increasing the restricted time L.
6. Discussion and conclusion In this paper, we introduced an index DOS to measure the efficiency in prognostic separation by a classification method based on restricted mean lifetimes. This index was used to compare the efficiency of two classification methods with survival time as the endpoint. We provided mathematical relationships between partial ordering of DOS and classification methods. We proposed estimating and testing procedures of two DOSs
On comparison of two classification methods with survival endpoints
57
based on paired data. Our simulation results suggested that the proposed testing procedure have a desired type I error rate under reasonable study sample sizes. Clinical examples demonstrated its uses in medical research. Comparing the mean survival times between two survival curves has been studied by many investigators (Pepe and Fleming, 1989; 1991; Mukhopadhyay and Chattopadhyay, 1991; Shen and Fleming, 1997; Murray, 2001). However, none of the previous research has addressed the problem of comparing the variations in mean survival times obtained from two classification methods. While DOS is one measurement of such variations, other choices are possible and may be more appropriate under various conditions. Further studies of different indices for classification efficiency, such as the average LP Wasserstein distances between survival curves from classified groups, or those based on variation in distribution quantiles such as median survival times, etc., are warranted. The limiting time L is critical in comparing two DOS indices. Because of the nonparametric nature of our approach and limitations introduced by censoring, we could not compare mean survival time beyond LAO, a random variable to estimate time τ as the maximum positive support for the survival and censoring density functions. Traditionally, L is chosen after observation of data. Karrison (1987) suggested choosing the largest L consistent with an effective sample size greater than two-thirds of the total
( ˆ = S(t)(1 − S(t))/V S(t)). However, sample size, that is to say, n(L) ˆ > 23 n, where n(t) to use the non-inferiority testing procedure proposed in this paper, we need only choose an L LAO. Our example shows that LAO itself is a good choice of the restricted time. The test efficiency increased with the restricted time L. This is because the survival differences increased with the length of follow-up in our examples. The longer the L, the more difference in the survival curves. This may not be true, however, for a classification that results in survival curves that cross each other at a later time. Another difficulty is the estimation of the survival functions. In Section 3, we used the Kaplan–Meier estimator to estimate restricted mean lifetime for classification combinations of two classification methods. Although the marginal and overall restricted mean lifetimes are linear combinations of the restricted mean lifetime of these cells in mathematics, the corresponding estimated Kaplan–Meier marginal and overall restricted mean lifetime for their corresponding marginal or overall data were not identical to the linear averages from cell means obtained from mathematical formulas. However, our simulations showed little difference between these two methods. An alternative is the estimator defined in (1.10) of Susarla and Van Ryzin (1980), which possesses the same asymptotic normality property as the Kaplan–Meier estimator although their properties are different from each other for small sample sizes (Park, 1987). We propose the use of DOS to compare two pre-established classification methods each of which derives only a finite number of groups. Our Proposition 3 suggests that DOS increases with further separation of patients. Therefore, DOS is not suitable for use as a criterion to develop classification rules. It is possible to add a penalty function to DOS based on the number of resulting groups that is similar to the AIC in linear models. However, this is beyond the scope of our paper.
58
Y. Lu, H. Jin and J. Mi
Acknowledgement We thank Mr. David Breazeale for his editorial help. We also thank Dr. Ying-Qing Chen for helpful discussions. The study is supported by a grant from the National Institutes of Health R03 AR47104. The first two authors are supported by a grant from the National Institutes of Health R03 AR47104.
References Chen, P.-Y., Tsiatis, A.A. (2001). Causal inference on the difference of the restricted mean lifetime between two groups. Biometrics 57, 1030–1038. Cummings, S., Black, D., Nevitt, M., Browner, W., Cauley, J., Ensrud, K., Genant, H., Hulley, S., Palermo, L., Scott, J., Vogt, T. (1993). Bone density at various sites for prediction of hip fractures: The study of osteoporotic fractures. Lancet 341, 72–75. Cummings, S., Nevitt, M., Browner, W., Stone, K., Fox, K., Ensrud, K., Cauley, J., Black, D., Vogt, T. (1995). Risk factors for hip fracture in white women. Study of osteoporosis research group (see comments). New England J. Medicine 332, 767–773. Elveback, L. (1958). Estimation of survivorship in chronic disease: The ‘Actuarial’ method. J. Amer. Statist. Assoc. 53, 420–440. Esserman, L., Hylton, N., Yassa, L., Barclay, J., Frankel, S., Sickles, E. (1999). Utility of magnetic resonance imaging in the management of breast cancer: Evidence for improved preoperative staging. J. Clin. Oncol. 17, 110–119. Gill, R. (1983). Large sample behaviour of the product-limit estimator on the whole line. Ann. Statist. 11, 49–58. Irwin, J.O. (1949). The standard error of an estimate of expectational life. J. Hygiene 47, 188–189. Kamura, T., Tsukamoto, N., Tsuruchi, N., Kaku, T., Saito, T., To, N., Akazawa, K., Nakano, H. (1993). Histopathologic prognostic factors in stage IIb cervical carcinoma treated with radical hysterectomy and pelvic-node dissection – An analysis with mathematical statistics. Internat. J. Gynecol. Cancer 3, 219– 225. Kaplan, E.L., Meier, P. (1958). Nonparametric estimation from incomplete observations. J. Amer. Statist. Assoc. 53, 457–481. Karrison, T. (1987). Restricted mean life with adjustment for covariates. J. Amer. Statist. Assoc. 82, 1169– 1176. Kashani-Sabet, M., McMillan, A., Zackheim, H. (2001). A modified staging classification for cutaneous T-cell lymphoma. J. Amer. Acad. Dermatol. 45, 700–706. Lehman, E.L. (1966). Some concepts of dependence. Ann. Math. Statist. 37, 1137–1153. Lewis, J. (1999). Statistical principles for clinical trials (ICH E9): An introductory note on an international guideline. Statist. Medicine 18, 1903–1942. Liu, J.-P., Hsueh, H.-M., Hsieh, E., Chen, J.J. (2002). Tests for equivalence or non-inferiority for paired binary data. Statist. Medicine 21, 231–245. Lu, Y., Black, D., Mathur, A.K., Genant, H.K. (2003). Study of hip fracture risk using tree structured survival analysis. J. Miner. Stoffwechs 10 (1), 11–16. Lu, Y., Stitt, F. (1994). Using Markov processes to describe the prognosis of HIV-1 infection. Medical Decision Making 14, 266–272. Marshall, A.W., Olkin, I. (1979). Inequalities: Theory of Majorization and Its Applications. Academic Press, New York. Mukhopadhyay, N., Chattopadhyay, S. (1991). Sequential methodologies for comparing exponential mean survival times. Sequential Anal. 10, 139–148. Murray, S. (2001). Using weighted Kaplan–Meier statistics in nonparametric comparisons of paired censored survival outcomes. Biometrics 57, 361–368.
On comparison of two classification methods with survival endpoints
59
Park, D.H. (1987). Small sample study on estimators of life distributions and of mean survival times using randomly censored data. Commun. Statist. Part B Simulation Comput. 16, 221–232. Pepe, M.S., Fleming, T.R. (1989). Weighted Kaplan–Meier statistics: A class of distance tests for censored survival data. Biometrics 45, 497–507. Pepe, M.S., Fleming, T.R. (1991). Weighted Kaplan–Meier statistics: Large sample and optimality considerations. J. Roy. Statist. Soc. Ser. B (Methodological) 53, 341–352. Reynolds, T. (2002). Updates to staging system reflect advances in imaging, understanding. JNCI Cancer Spectrum 94, 1664–1666. Savage, I. (1957). Contributions to the theory of rank order statistics-the “trend” case. Ann. Math. Statist. 28, 968–977. Segal, M. (1988). Regression trees for censored data. Biometrics 44, 35–47. Segal, M., Bloch, D. (1989). A comparison of estimated proportional hazards models and regression trees. Statist. Medicine 8, 539–550. Sevin, B.-U., Lu, Y., Nadji, M.N., Bloch, D., Koechli, O.R., Averette, H.A. (1996). Surgically defined prognostic parameters in early cervical carcinoma: A tree structured survival analysis. Cancer 78, 1438–1446. Shen, Y., Fleming, T.R. (1997). Weighted mean survival test statistics: A class of distance tests for censored survival data. J. Roy. Statist. Soc. Ser. B (Methodological) 59, 269–280. Sobel, M. (1954). On a generalization of an inequality of Hardy, Littlewood and Polya. Proc. Amer. Math. Soc. 5, 596–602. Stitt, F.W., Lu, Y., Dickinson, G.M., Klimas, N.G. (1991). Automated severity classification of AIDS hospitalizations. Medical Decision Making 11, S41–S45. Susarla, V., Van Ryzin, J. (1980). Large sample theory for an estimator of the mean survival time from censored samples. Ann. Statist. 8, 1002–1016. Therasse, P.A., Arbuck, S.G., Eisenhauer, E.A., Wanders, J., Kaplan, R.S., Rubinstein, L., Verweij, J., Van Glabbeke, M., van Oosterom, A.T., Christian, M.C., Gwyther, S.G. (2000). New guidelines to evaluate the response to treatment in solid tumors (see comments). J. National Cancer Instit. 92, 205–216. Zhang, H., Singer, D. (1999). Recursive Partitioning Tree and Application in Health Sciences. Springer, New York. Zheng, Z. (1995). Two methods of estimating the mean survival time from censored samples. Sankhya Ser. A Indian J. Statist. 57, 126–136. Zucker, D.M. (1998). Restricted mean life with covariates: Modification and extension of a useful survival analysis method. J. Amer. Statist. Assoc. 93, 702–709.
4
Handbook of Statistics, Vol. 23 ISSN: 0169-7161 © 2004 Elsevier B.V. All rights reserved. DOI 10.1016/S0169-7161(03)23004-2
Time-Varying Effects in Survival Analysis
Thomas H. Scheike
1. Time-varying effects in survival analysis Since Aalen’s Ph.D. Thesis in 1975 (Aalen, 1975 and 1978) many authors have considered event time data in the counting process setup which is described briefly below. See Brémaud (1981) and Jacobsen (1982) for thorough general mathematical expositions on counting process theory, and the more recent expositions by Fleming and Harrington (1991) and Andersen et al. (1993). Assume that n independent subjects are observed over some period of time [0, τ ]. For each subject a counting process, Ni (t), that gives the number of events occurring before time t is observed together with possibly additional information in terms of p-dimensional covariates Xi (t). Models for survival data, or more generally counting process data, are very conveniently formulated through the intensity process λi (t) which is defined as P (Ni (t + h) − Ni (t) 1| Past) . h→0 h
λi (t) = lim
We make the usual regularity assumptions (see Andersen et al., 1993) and the additional assumption that subjects are independent and identically distributed. The aim of this exposition is to present some recent developments that deal with time-varying effects of covariates. The work presented here summarizes and extends some of the material in the Doctoral thesis of Scheike (2002b). In medical studies, it is often expected that effects are time-varying, and since traditional models do not allow for a natural description of such effects there is a need to extend these models. One reason for the lacking use of the few models that do exist is probably the lack of inferential tools to go along with the non-parametric time-varying effects that may be estimated in some models. I here suggest a procedure for making inference about nonparametric time-varying effects that is easy to do in practice. One typically wishes to test if an effect is in fact time-varying, or significant at all. Another limitation of the few existing models that have dealt with time-varying effects are that these models typically force all effects to be time-varying. I here also emphasize the use of semi-parametric models where some effects are time-varying and some are time-constant, thus giving the extended flexibility only for effects where a simple description is not possible. 61
62
T.H. Scheike
Time-varying effects may be modelled completely non-parametrically by a general intensity model λi (t) = λ t, Xi (t) . Smoothing techniques have been suggested for estimation of λ( ); see, e.g., Nielsen and Linton (1995) and the references therein. Such a model may be useful when the number of covariates is small compared to the amount of data, but the generality of the model makes it difficult to get a clear, if any, conclusion about covariate effects. McKeague and Utikal (1991) considered the model as a tool for goodness-of-fit procedures. One often finds that the data may be summarized sufficiently well by a regression model where covariate effects are expressed through regression coefficients. I here consider regression models that can be written in the form λi (t) = Yi (t)h α(t) + XiT (t)β , (1) where β is a set of p-dimensional regression coefficients, Yi (t) is a predictable at risk indicator taking values 0 or 1, α(t) is a non-parametric function and h is a known link function. One choice of h is h(x) = exp(x) and this leads to Cox’s regression model (slightly re-parametrized) (Cox, 1972), and another choice h(x) = x leads to a semiparametric version of the additive risk model (see McKeague and Sasieni, 1994 or Lin and Ying, 1994) that is a special case of the additive risk model by Aalen (1980) that is considered below. An immediate extension of the above model assumes that the intensity is of the dynamic additive regression form λi (t) = Yi (t)h α(t) + XiT (t)β(t) , (2) where β(t) is a p-dimensional (suitably well behaved) function of time-varying regression coefficients. We shall show how to deal with this model for the two choices of h just mentioned. The focus of the presentation given here is to provide efficient estimators that can be used for inferential purposes. Two important hypotheses that one often has interest in examining are that one of the regression coefficient functions is constant, Hc : βj (t) ≡ βj , or that a regression coefficient is non-significant, H0 : βj (t) ≡ 0. To examine for time-varying effects in model (2) it is often preferable with a number of successive tests that investigate the covariates one at a time, rather than just doing an overall test for all covariates at one time, such as Hgc : β(t) ≡ β. Several authors have considered testing of the overall hypothesis, Hgc : β(t) ≡ β, in the context of the proportional hazards model (see, e.g., Grambsch and Therneau (1994), Wei (1984), Lin et al. (1993), Murphy (1993) and Marzec and Marzec (1997)). The approach of Murphy is related to the approach presented here and also makes it possible to consider the effects individually. Successive tests make it possible to use backwards model selection procedures to investigate for time-varying effects. I here present a procedure for successive testing for the dynamic additive intensity model, model (2) for two choices of h, that relies on work by Scheike (2002a) for h(z) = z and Martinussen et al. (2002) and Scheike and Martinussen (2002) for h(z) = exp(z), as well as some new work presented here. Successive testing is more complicated than the overall testing because it necessitates
Time-varying effects in survival analysis
the study of the semi-parametric sub-model λi (t) = Yi (t)h α(t) + XiT (t)β(t) + ZiT (t)γ ,
63
(3)
where β(t) is a p-dimensional time-varying regression coefficient, γ is a q-dimensional regression coefficient, and (Xi (t), Zi (t)) is a set of (p + q)-dimensional predictable covariate vectors. In practice, the standard procedures to check for time-varying effects of covariates are based either on graphical procedures or tests for the extended model with timevarying covariates. To illustrate the use of the standard procedures, consider the simple model where the true intensity has the form λ˜ i (t) = Yi (t)h α(t) + Xi β(t) + Zi γ , (4) where both Xi and Zi are 1-dimensional. A typical graphical procedure for investigating whether Zi has a time-varying effect is based on stratifying Zi into S strata and then fitting the stratified model: λi (t) = Yi (t)h αs (t) + Xi β for Zi in stratum s, s = 1, . . . , S. (5) A closer examination of the estimates of αs (t) to see if these baseline intensities are consistent with the hypothesized baseline α(t) + Zi γ is then used to check if the effect of Zi depends on time. In reality, one only investigates the global hypothesis that all covariate effects are not time-varying. E XAMPLE . To illustrate that the graphical procedure does not say much about the timevarying nature of individual covariates, consider the following little example. Assume that the true intensity is λ˜ i (t) = Yi (t) 0.5 + Xi f (t) + Zi 0.3 , where f (t) = 3 for t < 0.5 and 0 otherwise. Covariates Xi and Zi were the absolute values of a 2-dimensional standard normal with mean 0, variance 1 and correlation 0.7. 400 independent realizations of survival times that were censored at time 3 were simulated. This led to approximately 15% of censorings. The data were then stratified into three stratum according to the value of Zi , and the stratified model (5) was fitted. If the model λi (t) = Yi (t) α(t) + Xi β + Zi γ is true then stratifying in S strata leads to the approximate baselines α(t) + γs , s = 1, . . . , S, thus implying that the difference between any two of the cumulative baselines should result in straight lines. Figure 1 shows the difference between the three cumulative baselines versus time. The plot suggests that the difference between the cumulative baselines is not linear in time. One difficulty with the graphical procedure is the lacking description of variability that, however, in principle can be constructed under the null. In conclusion, the plots reveal lacking fit of the model but say nothing about the time-varying effect of Z.
64 T.H. Scheike
Fig. 1. Difference between stratified cumulative baselines.
Time-varying effects in survival analysis
65
Lin et al. (1993) (see also Wei, 1984) suggested some important omnibus test for the goodness-of-fit of the proportional hazards model (Cox’s regression model). They suggest that one should consider the behaviour of the components of the score processes, as well as other cumulative processes. Later their important bootstrapping technique is appropriately modified to the extended setting of the semi-parametric models considered here. Denoting the score process under Cox’s regression model by U (β, t), they suggest a technique for approximating the distribution of the observed score process ˆ t), where βˆ is the estimator of the regression coefficients. U (β, A formal test for how well the observed score process behaves, under the null, can be constructed by considering the different components of the observed score and computing, say, ˆ t , j = 1, . . . , p Sj = supUj β, t
whose percentiles can be simulated under the null. E XAMPLE . Consider the true model λ˜ i (t) = Yi (t) exp 0.5 + Xi β1 (t) + Zi 0.3 , where β1 (t) = 4t for t < 0.2, β1 (t) = 0.8 − (t − 0.2) for t 0.2. Covariates Xi and Zi were 2-dimensional standard normal with mean 0, variance 1 and correlation 0.7. The survival times were censored at time 3. The simulation is based on 400 independent realizations. Simulating the score processes under the null hypothesis, H0 : β1 (t) ≡ β1 , β2 (t) ≡ β2 , gives random realizations of the score process under the null. Figure 2 shows 50 such random realizations as well as the observed score process. The example given here resulted in the p-values 0.00 and 0.00 for the observed score processes for X and Z, respectively. The score process test thus reveals the lacking fit, but cannot be used to state anything about the covariates individually. Alternatively, if the effect of Zi is believed to be approximately of log-type, a standard approach is to consider the model λi (t) = Yi (t)h α(t) + Xi β + Zi γ + Zi log(t) θ (6) and use the test for H : θ = 0 as a test for time-varying effects of Zi . This test is also generally invalid as a test for the time-varying effect of Z due to the incorrect modelling of X. E XAMPLE . Assume that the true intensity is on the form λ˜ i (t) = Yi (t) 0.5 + Xi f (t) + Zi 0.3 , where f (t) = 3 for t < 0.5 and 0 otherwise. Covariates Xi and Zi were the absolute values of a 2-dimensional standard normal with mean 0, variance 1 and correlation 0.7. Survival times were censored at time 3, and 200 independent realizations were used.
66
T.H. Scheike
Fig. 2. Score processes for covariates X and Z (thick lines) and 50 random realizations under null hypothesis.
The model
λi (t) = Yi (t) α(t) + Xi β + Zi γ + Zi f (t) θ ,
where f (t) is assumed known was used to test the hypothesis H : θ = 0. The test H : θ = 0 was rejected in all of 500 repetitions, thus leading to an observed level at 1 far away from the significance level that was chosen to 0.05. The test for timevarying effects based on (2) developed later resulted in an observed level of 0.352 for X and 0.048 for Z, thus showing some ability to correctly reject the hypothesis that X have a time-constant effect, and resulting in the correct level for Z. The power for X could be improved within the suggested framework. To summarize, the standard procedures do not work for testing for time-varying effect of an individual covariate (Z), since they are based on an incorrect modelling assumption about the effect of other covariates (X). Both (5) and (6) assume that the effect of Xi is time-constant and therefore cannot be used as a basis for testing the timevarying effect of Z. When the effect of X is strongly time-varying, both procedures may thus lead to incorrect conclusions if used to investigate for time-varying effects of Z. In Scheike and Martinussen (2002) it was demonstrated that the test procedure (6) can lead to seriously biased tests in the proportional setting (h(z) = exp(z)). It is therefore important to remember, although one often forgets, that most test procedures only in general say something about the overall goodness-of-fit or the hypothesis Hgc : β(t) ≡ β. To investigate the covariates individually for time-varying effects by a correct procedure one may start by considering the model λi (t) = Yi (t)h α(t) + Xi β1 (t) + Zi β2 (t)
Time-varying effects in survival analysis
67
and in this model test the hypothesis H : β2 (t) ≡ γ , and if accepted then proceed to test the use of the model (4) for H : β(t) ≡ β. To accommodate the successive testing approach the general procedure, pursued for both the proportional and additive risk model, is to provide an efficient estimation procedure for the semi-parametric model where it is assumed that λi (t) = Yi (t)h α(t) + XiT (t)β(t) + ZiT (t)γ . of γ and B(t), the cumulative regression coefficient function B(t) = Estimators t an estimator of β(s) may be derived, but when β(s) ds, are suggested. Based on B(t) 0 focus is on inferential procedures, the cumulative regression coefficients are preferable to the regression coefficients themself. Direct estimation of β(s) has been studied in a number of papers for model (2), see the references given below. Given the estimators for the semi-parametric model one may proceed to construct a test for time-varying effects for one of the components of Xi by comparing, e.g., the fit of the model T T λi (t) = Yi (t)h α(t) + Xi,1 (t)β1 (t) + Xi,2 β2 (t) + ZiT (t)γ with that of the model under the null (H0 : β2 (t) ≡ β2 ) T T λi (t) = Yi (t)h α(t) + Xi,1 (t)β1 (t) + Xi,2 β2 + ZiT (t)γ . 2 (t) with βˆ2 t. One simple way of doing this is to compare B The rest of paper is structured as follows. Section 2 deals with estimation for the proportional and additive models. Section 3 develops an approach for inferential procedures which makes it possible to make successive testing for covariate effects. Section 4 gives a worked example for the malignant melanoma data of Andersen et al. (1993). Finally, Section 5 contains some discussion. 2. Estimation for proportional or additive models In this section it is shown how to estimate the parameters for the dynamic additive intensity model for the proportional and the additive hazards models. The proportional and additive models give rise to estimation procedures that are very similar. One important difference between the two models, however, is that due to the non-linearity of the proportional models a smoothing parameter is needed to deal with non-parametric effects, whereas this can be avoided for the additive hazards model. Essentially, the proportional model is linearized in an iterative procedure which is in each step equivalent to the additive estimation procedure. Although the proportional model makes it more difficult to estimate non-parametric effects, the model has the advantage of always leading to positive hazards. I also present some new models that combine additive and proportional models. Some technical conditions are needed. These conditions, roughly speaking, consist of the assumption of independent and identically distributed subjects, with covariates that are suitably bounded and lead to a full design for all time points (asymptotically). Regression coefficients are suitably smooth: twice continuously differentiable for the
68
T.H. Scheike
proportional model and locally integrable for the additive model. More precise conditions are given in the cited work but I here aim at presenting the main arguments rather than a more technical description. t Some notation is also needed, define the cumulative intensity Λi (t) = 0 λi (s) ds such that Mi (t) = Ni (t) − Λi (t) is a martingale. Let further N(t) = (N1 (t), . . . , Nn (t)) be an n-dimensional counting process, Λ(t) = (Λ1 (t), . . . , Λn (t)) its compensator, λ(t) = (λ1 (t), . . . , λn (t)) an n-dimensional intensity process, M(t) = (M1 (t), . . . , Mn (t)) the n-dimensional martingale, and Ft the history generated by the observed processes. When each individual has predictable p-dimensional covariates Xi (t) we organize them into n × p matrices, i.e., X(t) = (X1 (t), . . . , Xn (t))T , finally D(wi ) denotes a diagonal matrix of dimension n × n with diagonal elements w1 , . . . , wn . 2.1. Cox’s proportional regression model Models for the intensity have been dominated by Cox’s regression model (Cox, 1972; Andersen and Gill, 1982) which states that the intensity is of the form λi (t) = Yi (t)λ0 (t) exp Xi (t)T β , where Yi (t) is a predictable at risk indicator and Xi (t) ∈ p is a predictable covariate vector. The baseline λ0 (t) gives the intensity for an individual with covariates equal to 0. If the covariates are time-independent, the model stipulates that the hazard rates for different values of the covariates are proportional. When studying treatment effects one often encounters effects that change over time, and one of the limitations of Cox’s regression model is that the baseline intensity is the only parameter that depends on time. It is therefore natural to consider an extended model λi (t) = Yi (t)λ0 (t) exp Xi (t)T β(t) , (7) where β(t) ∈ p is a suitably smooth p-dimensional function of regression coefficients. This model has been studied by a number of authors, e.g., Zucker and Karr (1990), Murphy and Sen (1991), Hastie and Tibshirani (1993) and Grambsch and Therneau (1994). Murphy and Sen (1991) study a histogram sieve t estimator of the cumulative time-varying regression coefficients effects (B(t) = 0 β(s) ds). More recently Pons (2000) and Cai and Sun (2002) have suggested local estimation techniques. The work presented here is based on Martinussen et al. (2002) (MSS) and Scheike and Martinussen (2002). MSS deal with a slightly re-parametrized version of the Cox model that leads to formulae that are easier to relate to the additive model presented later. For a positive baseline model (7) can be written as λi (t) = Yi (t) exp Xi (t)T β(t) , (8) incorporating the baseline in the β(t) vector and extending the design vector. The model is very flexible and may be thought of as a one term Taylor expansion of log(λi (t, xi (t))) around the 0 covariate. With the aim of presenting inferential procedures, focus is on estimation of the cumulative regression coefficients since these accommodate inferential procedures. Important
Time-varying effects in survival analysis
69
hypotheses are to decide whether a regression coefficient function is time-varying (significantly) or time-constant, and whether a regression coefficient is significant or not. This can be done by modifying the elegant and very useful bootstrapping technique suggested by Lin et al. (1993, 1994) that makes it possible to obtain approximate percentiles for complicated distributions. t To efficiently estimate the cumulative regression coefficient functions B(t) = 0 β(s) ds, MSS suggested a simple iterative procedure derived from the score equations of the likelihood based on the parametrization (8). The score equations can be written conveniently as X(t)T dN(t) − λ(t) dt = 0 which, combined with a Newton–Raphson iteration procedure and smoothing, define an estimation procedure. The method depends on the choice of smoothing parameter. (k+1) = g(B (k) ) for the cumulative Following MSS, this leads to the iteration step B regression coefficients where t t (t) = −1 X(s)T dN(s) ˜ ds + g B A(s) β(s) 0
0
t
−
−1 X(s)T λ˜ (s) ds, A(s)
(9)
0
˜ ˜ where λ(s) = (λ˜ 1 (t), . . . , λ˜ n (t)), and λ˜ i (t) = Yi (t) exp(Xi (t)T β(t)) and Yi (t) exp Xi (t)T β(t) Xi (t)Xi (t)T Aβ (t) = i
= A ˜ . For simplicity, β(t) ˜ is a p × p matrix and A is taken to be a simple kernel β estimator of β(t), that is,
˜ = b −1 K u − t dB(u), β(t) b with K a continuous kernel with support [−1, 1] satisfying K(u) du = 1, uK(u) du = 0, and b the bandwidth. that can be shown to be consistent and asymptotIterating leads to an estimator B(t) ically Gaussian under a set of regularity conditions. These regularity conditions include the assumption that the regression coefficients β(t) are twice continuously differentiable, and that the smoothing is asymptotically such that b ∼ n−α for 1/8 < α < 1/4. The choice of the smoothing parameter is a serious practical problem and is also present for the other suggestions for solving the model, and is due to non-linearity of the model. Model (8) is very flexible and sometimes too large for medium to small data sets, therefore, and to further summarize effects that are not time-dependent, MSS also deals with the semi-parametric model λi (t) = Yi (t) exp Xi (t)T β(t) + Zi (t)T γ , (10)
70
T.H. Scheike
where Xi (t) ∈ p and Zi (t) ∈ q are predictable covariate vectors. This model is extremely useful when one wishes to establish inferential procedures as already indicated in the introduction. The score equations for β(t) and γ are X(t)T dN(t) − λ(t) dt = 0, Z(t)T dN(t) − λ(t) dt = 0. The equations can be used to give an efficient estimation procedure when a Newton– Raphson procedure is combined with smoothing. The score equation for γ , after a linear Taylor expansion about a set of initial esti mates γ˜ and B(t), leads to the iteration step τ
−1 T gγ (γ˜ ) = γ˜ + Z(t) G(t)W (t)Z(t) dt 0 τ × Z(t)T G(t) dN(t) − λ˜ dt , (11) 0
where G(t) = (I −W (t)X(t)(X(t)T W (t)X(t))−1 X(t)T ), W (t) = D(λ˜ i (t)) and λ˜ i (t) = ˜ exp(Xi (t)T β(t)). Similarly, the updating step for B is t t −1 (t) = ˜ ds + gB B X(s)T W (s)X(s) X(s)T β(s) 0
0
× dN(s) − λ˜ ds − W (s)Z(s) gγ (γ˜ ) − γ˜ ds . (12) These iterations lead to consistent and efficient estimates of the cumulative regression t functions B(t) = 0 β(s) ds and γ . Under regularity conditions, see MSS, the asymptotic distribution of the cumulative estimators can be derived and variances can be estimated. Further, D
n1/2 (γˆ − γ ) → V
as n → ∞,
where V is a mean-zero normal with variance Σ11 (τ ), and D − B (t) → n1/2 B W (t) as n → ∞ in D[0, τ ]p , where W (t) is a mean-zero Gaussian process with variance Σ22 (·)(t). The covariance between V and W (t) is Σ12 (t). Estimators for these covariances are given in MSS but in Section 3 I provide some alternative simple and robust estimators that have not been presented before. The regularity conditions include assumptions about the smoothness of β(t), as well as the undersmoothing assumption about the smoothing parameter that must satisfy b ∼ n−α for 1/8 < α < 1/4. Non-parametric effects are studied on the cumulative scale since variability is most simply described on this scale. Based on the asymptotic description of the cumulative regression effects under the semi-parametric model, it is possible to construct successive tests for time-dependence of covariate effects. A detailed study of how to do this was
Time-varying effects in survival analysis
71
carried out in Scheike and Martinussen (2002) and in the next section these tests are presented. 2.2. Aalen’s additive model An alternative to the proportional hazard model of Cox and the semi-parametric extension considered in the previous sections is Aalen’s additive hazard model (Aalen, 1980, 1989; McKeague, 1988; Huffer and McKeague, 1991) where the intensity depends linearly on covariates λi (t) = Xi (t)T β(t),
(13)
(t) ∈ p
where Xi is a predictable covariate vector and β(t) are locally integrable. Aalen suggested a simple estimator of the cumulative time-varying effects where no smoothing parameter is needed. Efficient estimates based on smoothing are obtained to estimate λi (t), but the method is not sensitive to the choice of the smoothing parameters. The score equations give direct estimation of the increments X(t)T W (t) dN(t) − λ(t) dt = 0 with W (t) = D(1/λi (t)), which gives the following estimator of B(t): t = B(t) X(s)− dN(s), 0
where X(s)− = (X(s)T W (s)X(s))−1 X(s)T W (s). Aalen (1980) studied the estimator for the in-efficient choice W (t) = D(1), and McKeague (1988) and Huffer and McKeague (1991) considered a procedure where the weight matrix D(1/λi (t)) was replaced by an estimator. This leads to an efficient estimator (see Sasieni (1992) and Greenwood and Wefelmeyer (1991)). The Aalen model is very flexible and may be interpreted as a first order Taylor expansion of a completely general intensity function λ(t, X(t)) around covariate value 0. Often one can further simplify the description of covariate effects and then the semi-parametric version studied by McKeague and Sasieni (1994) is very useful, by assuming that λi (t) = Xi (t)T β(t) + Zi (t)T γ , ∈ p
(14)
∈ q
where Xi (t) and Zi (t) are predictable covariate vectors. Simple estimators of the cumulative effects B(t) and γ that can be studied asymptotically are available based on solving the simultaneous score equations X(t)T W (t) dN(t) − λ(t) dt = 0, Z(t)T W (t) dN(t) − λ(t) dt = 0 with W (t) = D(1/λi (t)). The equations may be solved for any weight matrix and then yield consistent estimators, and here the unweighted estimator W (t) = D(1) is used. t It is of interest to focus on estimation of the cumulative regression functions B(t) = 0 β(s) ds and γ , and also to suggest tests for the significance or time-dependence of β(·).
72
T.H. Scheike
Following McKeague and Sasieni (1994), under model (14) the following estimators are derived: τ
−1 τ γˆ = Z(s)T H (s)Z(s) ds Z(s)T H (s) dN(s), 0
= B(t)
0
t
X(s)− dN(s) − Z(s)γˆ ds ,
0
where H (s) = (I − X(s)X(s)− ), with the convention that H (s) is zero when the inverse matrix does not exist. McKeague and Sasieni showed that the estimators are consistent and asymptotically Gaussian with variances that can be estimated. It can be shown that D
n1/2 (γˆ − γ ) → V
as n → ∞,
where V is a mean-zero normal with variance Σ11 (τ ), and D − B (t) → W (t) n1/2 B
as n → ∞
D[0, τ ]p ,
where W (t) is a mean-zero Gaussian process with variance Σ22 (·)(t). The in covariance between V and W (t) is Σ12 (t). Estimators for these covariances are given in McKeague and Sasieni (1994), but in Section 3 I provide some alternative simple robust estimators formula based on Scheike (2002a). One general problem with the additive risk models is that the estimated intensity for a subject may become negative. 2.3. Additive–multiplicative intensity models The additive and proportional hazard models may be combined in various ways to achieve flexible and useful models. Generally, the additive and multiplicative models postulate different relationships between the hazard and covariates, and the subject matter seldom indicates clearly which of the models should be preferred. The models may often be used to complement each other and to provide different summary measures. One advantage of the additive models is that time-varying effects are easy to estimate, and that no smoothing parameter needs to be chosen. Lin and Ying (1995) considered the following additive–multiplicative intensity model: λi (t) = Yi (t) g Xi (t)T β + λ0 (t)h Zi (t)T γ , where (Xi (t)T , Zi (t)T ) is a (p + q)-covariate vector. One problem with this model is that only the baseline is time-varying and therefore data with time-varying effects will often not be well described by the model. Dabrowska (1997): considers the following general stratified Cox model: λi (t) = Yi (t)λ t, Xi (t) exp Zi (t)T β . For this model interest may be in the effect of the variables Z while avoiding modelling error of the additional covariates X. Dabrowska suggests a smooth profile estimation
Time-varying effects in survival analysis
73
technique for this model. This makes the choice of a smoothing parameter necessary, and further restricts the covariates in the non-parametric term to be continuously varying. The non-parametric part of the model is not expected to be able to handle too many covariates for small to medium sized data sets. In addition, the non-parametric effects can be difficult to summarize. A first order Taylor expansion of the Dabrowska baseline hazard around the zero covariate for the non-parametric yields the Aalen model instead of λ(t, Xi (t)), i.e., λi (t) = Yi (t) Xi (t)T α(t) exp Zi (t)T β , (15) where (Xi (t)T , Zi (t)T ) is a (p +q)-predictable covariate vector. This model was studied in Scheike and Zhang (2002a and 2002b) . The efficient score equations are X(t)T W (t) dN(t) − λ(t) dt = 0, Z(t)T dN(t) − λ(t) dt = 0, where W (t) = D(φi (t)/λi (t)), φ(t) = (φ1 (t), . . . , φn (t)), and φi (t) = exp (Zi (t)T β). Choosing the weight matrix W as D(exp (Zi (t)T β)) makes it possible to get initial estimates that for small samples do as well or better than the efficiently weighted estimates. The model allows a very flexible description of covariate effects of Xi (t) while introducing some structure to make estimation and interpretation easier than in the general model. Scheike and Zhang (2002a and 2002b) derive the asymptotic properties for the model, and based on this tests and confidence bands may be established. This model may be fitted without the choice of any smoothing parameters. Martinussen and Scheike (2002) consider the following additive–multiplicative intensity model: λi (t) = α T (t)Xi (t) + ρi (t)λ0 (t) exp β t Zi (t) , (16) where α(·) is a q-vector of time-varying regression coefficients and ρi (t)is a risk indit cator. Our interest is in estimation of the unknown parameters β, A(t) = 0 α(s) ds and t Λ(t) = 0 λ0 (s) ds. The score equations for the unknown parameters are given as X(t)T W (t) dN(t) − λ(t) dt = 0, φ(t)T W (t) dN(t) − λ(t) dt = 0, Z(t)T V (t) dN(t) − λ(t) dt = 0, where W (t) = D(1/λi (t)), V (t) = D(λ0 (t)φi (t)/λi (t)), φ(t) = (φ1 (t), . . . , φn (t)), and φi (t) = ρi (t) exp(β(t)T zi (t)). Efficient estimators are found by solving the score equations for the parameters and the estimators are provided with an asymptotic description that makes inference possible. Essentially the model reduces to an Aalen model for known β and this may be utilized to obtain a score equation for β that only depends on observed quantities. The model extends both the Cox and the Aalen model and may in principle be used to investigate goodness-of-fit for these models, but many practical issues need to be resolved to make this operational. Sasieni (1996) considered the special
74
T.H. Scheike
case of this model where α T (t)Xi (t) is replaced by a known function of Xi (t). Recently, Zahl (2002) in independent and yet unpublished work analysed the same model and gave detailed case studies. A different iterative estimation procedure is suggested by Zahl, but unfortunately no large sample properties are given.
3. Testing in proportional and additive hazards models In this section a unified treatment of how to do the successive testing that was discussed in the introduction is presented. The work is based on Scheike (2002a) for the additive risk model and on Scheike and Martinussen (2002) for the proportional hazards model. The approach is based on a modification of the simulation technique by Lin et al. (1993, 1994), that was recently used in Lin et al. (2000) and Scheike (2002a). The basic idea is that when one considers estimators of some functional η(t) and have an estimator η(t) ˆ such that n t η(t) ˆ − η(t) = hi (s) dMi (s) + op n−1/2 i=1 0
where hi (s) and Mi (s) are independent and identically distributed processes with Mi (s) t the zero mean martingale and hi (s) predictable, such that Qi (t) = 0 hi (s) dMi (s) are independent zero mean processes, then, under regularity conditions, one can show that the asymptotic distribution of η(t) ˆ − η(t) is equivalent to the asymptotic distribution of i (t)Gi Q i
i (t) = t hˆ i (s) dM i (s) are where G1 , . . . , Gn are standard normals and the estimators Q 0 fixed given the counting process data. This is essentially a version of the conditional multiplier central limit theorem. An alternative to simulating in this way is to do as Lin et al. (1993) who show that when Mi are martingales one can do something slightly simpler. Essentially one can use t hˆ i (s) dNi (s) Gi i
0
that also has the same asymptotic distribution as η(t) ˆ − η(t). The simulations based on the martingale residuals, however, have the advantage that they lead to standard errors that are more generally applicable (see Lin et al. (2000) and Scheike (2002a) for more on this robustness property). i (t) = Ni (t) − Λ i (t). How one estimates Λ i (t) The martingale is estimated by M differs slightly in proportional and additive models. For the additive hazard models the formula can be implemented slightly simpler, without smoothing, as t + zi (t)γˆ dt . i (t) = Λ Xi (t)T dB(t) 0
Time-varying effects in survival analysis
75
3.1. Asymptotics and approximations The simulation technique and results for the proportional and additive models are essentially the same and therefore a unified treatment of the two cases h(z) = z and h(z) = exp(z) is presented. The similarity of the formulae for the two cases is due to the fact that we have worked with the slightly re-parametrized version of Cox’s regression model, where the baseline is included in the regression design. For the traditional parametrization of the baseline and the partial likelihood approach, alternative formulae are obtained (see Scheike and Martinussen (2002)). The proportional and additive hazard models lead to equivalent formula when W (s) = D(λˆ i (t)) in the proportional setting and W (s) = D(1) in the additive setting. The ith diagonal of W (s) is denoted Wi (s). In the case of efficiently weighted estimator for the additive model, a different weight matrix should be used. Let X(t) = (Y1 (t)X1 (t), . . . , Yn (t)Xn (t))T be the n × p matrix of covariates for all n independent subjects at time t, where Yi (t) are the at-risk indicators, these were built into the covariates in the section on the additive models. t Assume that efficient estimators of β(t), B(t) = 0 β(s) ds and γ are at hand and and γˆ . ˆ denote these by β(t), B(t) 3.1.1. Non-parametric model and testing The non-parametric model states that λi (t) = Yi (t)h XiT (t)β(t) ,
(17)
where β(t) is a p-dimensional time-varying regression coefficient and Xi (t) is a pdimensional predictable covariate vector. To derive and approximate the asymptotic properties of the estimator the following observation is of use; − B(t) = B(t) Qi (t) + op n−1/2 , i
where
t
Qi (t) =
X(s)T W (s)X(s)
−1
Xi (s) dMi (s)
0
t with Mi (t) = Ni (t) − 0 Yi (s)h(Xi (s)T β(s)) ds. For n large this is essentially a sum of independent and identically distributed mean 0 random variables and its covariance may thus be estimated by (t) = i (t)⊗2 , V Q i
where i (t) = Q
t 0
i (t) = Ni (t) − with M
X(s)T W (s)X(s)
t 0
−1
i (s) Xi (s) dM
ˆ Yi (s)h(Xi (s)T β(s)) ds, and where for a vector a, a ⊗2 = aa T .
76
T.H. Scheike
Under regularity conditions it follows that n1/2 (B(t)−B(t)) converges to a Gaussian (t). process with a variance function that is estimated consistently by nV Now, if (G1 , . . . , Gn ) are independent and standard normally distributed then i (t)Gi Q ∆1 (t) = n1/2 i
n1/2 (B(t)
has the same limit as − B(t)). The random variables (G1 , . . . , Gn ) must be independent of the (Ni ( ), Yi ( ), Xi ( ), i = 1, . . . , n) that are considered fixed at the observed values. Let the j th component of the kth realization of ∆1 (t) be denoted as ∆k1,j (t). For the non-parametric model (17) one may wish to test the hypothesis, H0 : βj (t) ≡ γ , that one of the non-parametric components has time-constant effect. A simple test, implemented in the second example in the introduction and in the case study given later, is based on computing the test-statistic j (τ ) B sup Bj (t) − t . τ t ∈[0,τ ] This test can be improved in many different ways and a more detailed discussion is given for the semi-parametric model. To compute (approximately) percentiles for the observed test-statistic under the null, compute ∆k1,j (τ ) k sup ∆1,j (t) − t τ t ∈[0,τ ] for a large number of realizations. It is also of interest to construct simultaneous confidence bands and a test for H0 : βj (·) ≡ 0, or equivalently H0 : Bj (·) ≡ 0. To test if a time-varying component is non-significant, Bj (·) ≡ 0, the following test was applied: j (t) B sup j (t) t ∈[0,τ ] V j (t) is the j th diagonal element of V (t). where V Percentiles can be estimated from realizations of ∆1 (t): k ∆1,j (t) . sup j (t) t ∈[0,τ ] V 3.1.2. Semi-parametric model Consider the semi-parametric model λi (t) = Yi (t)h XiT (t)β(t) + ZiT (t)γ ,
(18)
where β(t) is a p-dimensional time-varying regression coefficient and γ is a q-dimensional time-constant regression coefficient, and (Xi (t), Zi (t)) is a set of (p + q)dimensional predictable covariate vectors.
Time-varying effects in survival analysis
77
Similarly to the derivation for the non-parametric model, see Scheike (2002a) for the additive model and Scheike and Martinussen (2002) for the partial likelihood approach, it can be derived that n1/2 (γˆ − γ ) is asymptotically normal with mean 0 and a variance that is estimated consistently by 1 C −1 , nC −1 Σ τ where C = 0 Z(s)T H (s)W (s)Z(s) ds, H (s) = (I −W (s)X(s)(X(s)T W (s)X(s))−1 × X(s)T ; ⊗2 , 1 = W Σ 2,i i
and
τ
W2,i =
Yi (s) Zi (s) − Z(s)T W (s)X(s)
0
−1 × X(s)T W (s)X(s) Xi (s) dMi (s), 2,i is defined from W2,i by replacing the unknown quantities with their and where W estimates. − B(t)) is equivalent to the asymptotic The asymptotic distribution of n1/2 (B(t) distribution of n1/2 W4,i (t) i
where
t
W3,i (t) =
X(s)T W (s)X(s)
−1
xi (s) dMi (s),
0
W4,i (t) = W3,i (t) − P (t)C −1 W2,i t − B(t)) is asymptotand P (t) = 0 X(s)− W (s)Z(s) ds. It can be shown that n1/2 (B(t) ically Gaussian with a pointwise variance that is estimated consistently by 4,i (t)⊗2 . n W i
To make simultaneous confidence bands and tests note that if (G1 , . . . , Gn ) are independent and standard normally distributed then 4,i (t)Gi ∆2 (t) = n1/2 W i
− B(t)). The construction of simultaneous confidence has the same limit as n1/2 (B(t) bands for B(t) and tests of the significance of the non-parametrics effects may be based on replications of ∆2 (t) (see previous section).
78
T.H. Scheike
3.2. Test for time-dependent effects in semi-parametric model To test if some covariate effects do not depend on time, such as the hypothesis H0 : βp (t) = γq+1 , p (·) and γˆq+1 , where B p (·) is obtained in the full model and consider functionals of B γˆq+1 is computed under the null-hypothesis (see Martinussen and Scheike (1999) for similar ideas in a regression setup). To distinguish between the design under the model with freely varying βp (t) and under the semi-parametric null, let the designs under the and Z(s) and use a similar notation for other quantities defined null be denoted X(s) under the null. Two functionals to test the null are p (·), γˆq+1 = sup F1 B Bp (t) − γˆq+1 t t ∈[0,τ ]
and p (·), γˆq+1 = F2 B
τ
p (t) − γˆq+1 t B
2
dt.
0
These tests may be improved in many different ways, e.g., by including information about the variation of the estimators. The functionals both have the drawback that the estimation is cumulated over time such that if a time-varying effect is only present for large t it can be difficult to detect due to cumulated random noise. If this is expected one may consider the functional for some pre-specified time-period. Alternatively, one may consider functionals that are better suited to explore more local aspects of the cumulative regression coefficient, such as, e.g., p (t) − B p (s) − γˆq+1 (t − s). sup sup B t ∈[0,τ ] s∈[0,τ ]
p (t) − γˆq+1 t is asymptotically equivalent to a Gaussian process that can be approxB imated by W5,i (t) i
with −1 W 2,i W5,i (t) = W4,i (t) − t P(t)C and where the last term is computed under the null. The asymptotic distribution of i W5,i (t) is the same as the asymptotic distribution of 5,i (t)Gi ∆3 (t) = W i
where (G1 , . . . , Gn ) are independent and standard normally distributed. Therefore the percentiles of F1 or F2 may be obtained from random samples of ∆3 (·). p (·) only, is to compute A simpler test based on B t F3 Bp (·) = sup Bp (t) − Bp (τ ) τ t ∈[0,τ ] and the asymptotic properties of this test may be simulated as before.
Time-varying effects in survival analysis
79
4. Survival with malignant melanoma The data is described in further detail in Andersen et al. (1993, p. 11). 205 patients with malignant melanoma had their tumor removed and were followed until death (from malignant melanoma) or the end of the study (15 years later). 57 deaths were recorded in the observation period. The aim of the study is to learn about the mortality rate from malignant melanoma, and estimate the predictive effect of the covariates sex (males = 1, females = 0), ulceration (present = 1, absent = 0), age (years) and thickness of tumor (mm). These data are well known to many and here it is to show how the successive testing models may be implemented to learn about the time-varying nature of the effects. A standard Cox regression model gave the log-relative risk effects (sd) 1.20 (0.31), 0.36 (0.27), 0.01 (0.01) and 0.09 (0.04) for ulceration, sex, age and thickness, respectively. The Lin et al. (1993) score test for goodness-of-fit of the Cox model gave the plots, shown in Figure 3. The figure reveals that the first and fourth components of the score process under Cox’s regression model indicate lacking fit. The p-values for the four components were 0.004, 0.61, 0.51 and 0.01930, respectively. The model thus being clearly rejected. First we consider the Cox–Aalen model as an alternative. Recall that for this model λi (t) = Yi (t) Xi (t)T α(t) exp Zi (t)T β . Trying to extend the model only for those covariates where it is really needed, I therefore consider the Cox–Aalen model with ulceration and thickness as additive components and sex and age as multiplicative components. For this model, a score process test for the two multiplicative effects (Scheike and Zhang, 2002b) reveals that the model fits well. The score processes for sex and age are quite similar to those in Figure 3. Figure 4 shows the additive non-parametric effects. The effects of ulceration and thickness appear to be rather constant but additive. In fact, a test for constant additive effect of ulceration and thickness gave the p-values 0.69 and 0.17, respectively (Scheike and Zhang, 2002b). The multiplicative effect of the Cox–Aalen model gave log-relative risk at 0.34 (0.27) and 0.01 (0.01), respectively. Estimates that are equivalent to those from the Cox model. Fitting the model completely within the additive risk framework gives the estimates shown in Figure 5. Note that the additive effects of the Cox–Aalen model are somewhat reduced compared to the Aalen model. Deciding which of the two models one prefers is difficult, and due to the limited amount of the data the power of more sophisticated goodness-of-fit test is not expected to be helpful. Successive testing of time-varying effects for the Aalen model of the covariates resulted in p-values 0.39, 0.48, 0.58 and 0.36 for age, sex, ulceration and thickness tested successively in that order. For the final semi-parametric model the estimates were 0.058 (0.017), 0.019 (0.016), 0.00063 (0.00048) and 0.0074 (0.0040) for ulceration, sex, age and thickness, respectively. These constant effects are depicted as the straight lines in Figure 5. The relative sizes and significance of the estimates are comparable to those of the Cox model. But the model gives a different alternative interpretation of the effects,
80 T.H. Scheike
Fig. 3. Score processes for covariates of melanoma data.
Time-varying effects in survival analysis
81
Fig. 4. Cox–Aalen model: cumulative coefficients for additive effects of melanoma data with 95% pointwise confidence intervals.
82
T.H. Scheike
Fig. 5. Aalen model: cumulative coefficients for effects of melanoma data with 95% pointwise confidence intervals. Straight lines represent estimates of time-constant effects.
Time-varying effects in survival analysis
83
note also that the goodness-of-fit procedure indicates that the Cox model does not give a satisfactory summary of the data.
5. Discussion I have reviewed some recent work on additive and proportional hazard models that focussed on time-varying regression coefficient models. The additive and proportional models supplement each other and will give different types of summary measures that show different aspects of the covariates. An important aspect that was emphasized is that semi-parametric versions of the models are very usefull to obtain a satisfactory description of covariate effects. Also, when inference was the interest it was emphasized that the semi-parametric models were needed to do successive tests. For the semi-parametric models I gave a procedure for testing important hypotheses for the non-parametric and parametric terms of the model. One seldom has sufficient data to decide clearly whether proportional or additive models should be preferred, and in practice the best fit may be obtained by models that allow for both multiplicative and additive effects. The models reviewed here are all sufficiently flexible to fit most data quite well. One is therefore probably better off trying to decide based on the subject matter whether or not effects are multiplicative or additive. Additive effects lead to excessive risk, and can therefore be considered as leading to mortality due to different reasons. In contrast, multiplicative effects have a general effect on the intensity and also modify the effects of other covariates. A practical consideration that might help guide the statistical analysis is that nonparametric effects are most easily dealt with in additive hazard models, since this can be done without choosing any smoothing parameters. The drawback of additive models, however, is that the model fitting may lead to negative hazards. To avoid negative hazards one may implement model fitting techniques with constraints but this appears to lead to complicated procedures. Acknowledgement The author received partial support from a grant from the National Institutes of Health. Some of the work was done while the author was employed at Department of Mathematical Sciences at Aalborg University. References Aalen, O.O., (1975) Statistical inference for a family of counting processes. Ph.D. Thesis. Department of statistics, University of California at Berkeley. Aalen, O.O. (1978). Non-parametric inference for a family of counting processes. Ann. Statist. 6, 534–545. Aalen, O.O. (1980). A model for non-parametric regression analysis of counting processes. In: Klonecki, W., Kozek, A., Rosinski, J. (Eds.), Lecture Notes in Statistics-2: Mathematical Statistics and Probability Theory. Springer, New York, pp. 1–25. Aalen, O.O. (1989). A linear regression model for the analysis of lifetimes. Statist. Medicine 8, 907–925. Andersen, P.K., Borgan, Ø., Gill, R.D., Keiding, N. (1993). Statistical Models Based on Counting Processes. Springer, New York.
84
T.H. Scheike
Andersen, P.K., Gill, R.D. (1982). Cox’s regression model for counting processes: A large sample study. Ann. Statist. 10, 1100–1120. Brémaud, P. (1981). Point Processes and Queues, Martingale Dynamics. Springer, New York. Cai, Z., Sun, Y. (2002). Local linear estimation for time-dependent coefficients in Cox’s regression models. Scand. J. Statist. Submitted for publication. Cox, D.R. (1972). Regression models and life tables. J. Roy. Statist. Soc. Ser. B 34, 406–424. Dabrowska, D.M. (1997). Smoothed Cox regression. Ann. Statist. 25, 1510–1540. Fleming, T.R., Harrington, D.P. (1991). Counting Processes and Survival Analysis. Wiley, New York. Grambsch, P.M., Therneau, T.M. (1994). Proportional hazards tests and diagnostics based on weighted residuals. Biometrika 81, 515–526 (corr: 95v82 p. 668). Greenwood, P.E., Wefelmeyer, W. (1991). Efficient estimation in a nonlinear counting-process regression model. Canad. J. Statist. 19, 165–178. Hastie, T., Tibshirani, R. (1993). Varying-coefficient models. J. Roy. Statist. Soc. Ser. B 55, 757–796. Huffer, F.W., McKeague, I.W. (1991). Weighted least squares estimation for Aalen’s additive risk model. J. Amer. Statist. Assoc. 86, 114–129. Jacobsen, M. (1982). Statistical Analysis of Counting Processes. Springer, New York. Lin, D.Y., Fleming, T., Wei, L. (1994). Confidence bands for survival curves under the proportional hazards model. Biometrika 81, 73–81. Lin, D.Y., Wei, L.J., Ying, Z. (1993). Checking the Cox model with cumulative sums of martingale-based residuals. Biometrika 80, 557–572. Lin, D.Y., Ying, Z. (1994). Semi-parametric analysis of the additive risk model. Biometrika 81, 61–71. Lin, D.Y., Ying, Z. (1995). Semi-parametric analysis of general additive–multiplicative hazard models for counting processes. Ann. Statist. 23, 1712–1734. Lin, D.Y., Wei, L.J., Yang, I., Ying, Z. (2000). Semiparametric regression for the mean and rate function of recurrent events. J. Roy. Statist. Soc. Ser. B 62, 711–730. Martinussen, T., Scheike, T.H. (1999). A semi-parametric additive regression model for longitudinal data. Biometrika 86, 691–702. Martinussen, T., Scheike, T.H. (2002). A flexible additive multiplicative hazard model. Biometrika 89, 283– 298. Martinussen, T., Scheike, T.H., Skovgaard, I. (2002). Efficient estimation of fixed and time-varying covariate effects in multiplicative intensity models. Scand. J. Statist. 28, 57–74. Marzec, L., Marzec, P. (1997). On fitting Cox’s regression model with time-dependent coefficients. Biometrika 84, 901–908. McKeague, I.W. (1988). Asymptotic theory for weighted least squares estimators in Aalen’s additive risk model. In: Statistical Inference from Stochastic Processes, pp. 139–152. McKeague, I.W., Sasieni, P.D. (1994). A partly parametric additive risk model. Biometrika 81, 501–514. McKeague, I.W., Utikal, K.J. (1991). Goodness-of-fit tests for additive hazards and proportional hazards models. Scand. J. Statist. 18, 177–195. Murphy, S.A. (1993). Testing for a time dependent coefficient in Cox’s regression model. Scand. J. Statist. 20, 35–50. Murphy, S.A., Sen, P.K. (1991). Time-dependent coefficients in a Cox-type regression model. Stoch. Proccess. Appl. 39, 153–180. Nielsen, J.P., Linton, O.B. (1995). Kernel estimation in a nonparametric marker dependent hazard model. Ann. Statist. 23, 1735–1749. Pons, O. (2000). Nonparametric estimation in a varying-coefficient Cox model. Math. Meth. Statist. 9, 376– 398. Sasieni, P.D. (1992). Information bounds for the additive and multiplicative intensity models. In: Klein, J., Goel, P. (Eds.), Survival Analysis: State of the Art. Kluwer Academic, Dordrecht, pp. 249–263 (disc: P263-265). Sasieni, P.D. (1996). Proportional excess hazards. Biometrika 83, 127–141. Scheike, T. (2002a). The additive non-parametric and semi-parametric Aalen model as the rate function for a counting process. Lifetime Data Anal. 8, 247–262. Scheike, T.H. (2002b). Dynamic event history models. Dr. Scient Thesis. Univeristy of Copenhagen, Faculty of Sciences.
Time-varying effects in survival analysis
85
Scheike, T., Martinussen, T. (2002). On efficient estimation and tests of time-varying effects in the proportional hazards model. Scand. J. Statist. Submitted for publication. Scheike, T.H., Zhang, M.-J. (2002a). An additive–multiplicative Cox–Aalen model. Scand. J. Statist. 28, 75– 88. Scheike, T.H., Zhang, M.-J. (2002b). Extensions and applications of the Cox–Aalen survival model. Biometrics. Submitted for publication. Wei, L.J. (1984). Testing goodness of fit for proportional hazards model with censored observations. J. Amer. Statist. Assoc. 79, 649–652. Zahl, P. (2002). Regression analysis with multiplicative and time-varying additive regression coefficients with examples from breast and colon cancer. Statist. Medicine. Submitted for publication. Zucker, D.M., Karr, A.F. (1990). Non-parametric survival analysis with time-dependent covariate effects: A penalized partial likelihood approach. Ann. Statist. 18, 329–353.
5
Handbook of Statistics, Vol. 23 ISSN: 0169-7161 © 2004 Elsevier B.V. All rights reserved. DOI 10.1016/S0169-7161(03)23005-4
Kaplan–Meier Integrals
Winfried Stute
1. Introduction Assume that we observe independent identically distributed (i.i.d.) variables X1 , . . . , Xn on the real line from some unknown distribution function (d.f.) F , i.e., F (x) = P(X1 x),
x ∈ R.
The nonparametric maximum likelihood estimator (NPMLE) of F then equals 1 1{Xi x} , n n
Fn (x) =
x ∈ R,
i=1
the empirical distribution function. It plays an important role in statistical inference and many of its basic properties are summarized in the monograph by Shorack and Wellner (1986). A class of simple statistical functionals to which more complicated ones may often be traced back by appropriate analytical approximations or probabilistic projection methods are linear functionals of Fn or empirical integrals n 1 ϕ(Xi ), T (Fn ) = ϕ dFn = n i=1
where ϕ is a given function. In many applications, ϕ may vary within a given class of functions yielding a stochastic process indexed by ϕ. Simple ϕ’s are • • • •
ϕ(x) = x k ; ϕ(x) = 1{xx0 } ; ϕ(x) = eit x ; ϕ(x) = ψ(x, θ );
to name only a few. In the first example we obtain for T (Fn ) the kth empirical moment, the second yields Fn (x0 ), while the third gives the empirical characteristic function at t. Finally, the fourth choice leads to a parametrized family appearing in M-estimation. Under a finite first moment assumption, |ϕ| dF < ∞, the Strong Law of Large Numbers (SLLN) yields lim ϕ dFn = ϕ dF with probability one, n→∞
87
88
W. Stute
ϕ 2 dF < ∞, the Central Limit Theorem (CLT) gives us 1/2 ϕ[dFn − dF ] → N 0, σ 2 in distribution, n
while under
where
σ = 2
ϕ dF − 2
2 ϕ dF
.
Even much more elementary, we have E ϕ dFn = ϕ dF, i.e., the empirical integral is an unbiased estimator of the target ϕ dF . Since along with 2 these unknown, to compute confidence intervals quantities, also σ will be generally for ϕ dF , we need an estimator for σ 2 . Needless to say, in the present framework, an unbiased estimator of σ 2 is given by 1
ϕ(Xi ) − n−1 n
σn2 =
2 ϕ dFn .
i=1
In the analysis of lifetime data one often faces the problem of some kind of censorship. Under right censorship, rather than Xi , one observes Zi = min(Xi , Yi )
and δi = 1{Xi Yi } .
Here Yi is a so-called censoring variable assumed to be independent of Xi for each 1 i n. As a sequence, Y1 , . . . , Yn are also i.i.d. from some unknown d.f. G. The 0–1 variable δi indicates whether Xi is censored (δi = 0) or not (δi = 1). The problem one faces is to recover F or targets related with F from the available observations (Zi , δi ), 1 i n. This is a so-called inverse problem. A crucial quantity in this context is the cumulative hazard function of F : F (dy) ΛF (x) = , 1 − F (y−) [0,x]
where in the following F (y−) = lim F (x) x↑y
and F {y} = F (y) − F (y−). Similarly for ΛF . If, for some reason, we transform the nonnegative X to a random variable which may also take on negative values, the above integral should be computed over (−∞, x]. All what follows equally applies to this setup as well.
Kaplan–Meier integrals
Now, see Shorack and Wellner (1986), p. 301, 1 − ΛF {ai } , 1 − F (x) = e−Λc (x)
89
(1.1)
ai ∈A ai x
with Λc denoting the continuous part of ΛF and A being the set of its atoms (respectively jumps). The point is that (1.1) of course not only applies to F but to any d.f., in the sense that it reveals a way how to compute a d.f. when its cumulative hazard function is given. Under censorship we may introduce 1 (x) = P(Z x, δ = 1) H and H (x) = P(Z x),
x ∈ R,
where (Z, δ) denotes a prototype of (Zi , δi ). Since these variables are observable, we may compute 1n (x) = 1 H 1{Zi x,δi =1} n n
i=1
and 1 1{Zi x} n n
Hn (x) =
i=1
and apply existing empirical process theory to these functions whenever necessary. Due to independence of X and Y we also have (1 − H ) = (1 − F )(1 − G) and 1 (x) = H
x
1 − G(y−) F (dy).
0
Conclude that
x
ΛF (x) = 0
1 (dy) H . [1 − H (y−)]
Plugging in the empirical counterparts we obtain x n 1n (dy) 1{Zi x,δi =1} H = , Λn (x) = [n − Rank Zi + 1] 0 [1 − Hn (y−)] i=1
the Nelson–Aalen estimator of ΛF . Now, using (1.1), we get the d.f. associated with Λn , namely n (x) = 1−F
n
i=1
1 − Λn {Zi }
1{Z x,δ =1} i
i
.
(1.2)
90
W. Stute
n which is more convenient A simple rearrangement of terms yields a representation of F for further purposes: 1{Z x} n i:n δ[i:n] n (x) = 1− 1−F (1.3) . n−i +1 i=1
Here, Z1:n Z2:n · · · Zn:n are the ordered Z-values, where ties within lifetimes or within censoring times are ordered arbitrarily and ties among lifetimes and censoring times are treated as if the former precede the latter. The label δ[i:n] denotes the concomitant associated with Zi:n , i.e., δ[i:n] = δj if Zi:n = Zj . If all δ’s equal 1, i.e., if there is n = Fn . The estimator F n is due to Kaplan and Meier (1958), who no censorship, then F derived it as a limit from then already existing life-table estimators. In view of its shape it is therefore called the product-limit estimator. The product structure is in contrast with the sum structure of Fn , which belongs to the empirical measure µn =
1 1 δX , δ Xi = n n i n
n
i=1
i=1
(1.4)
a sum of weighted Dirac-measures δXi , where 1 if x ∈ A, δx (A) = 0 if x ∈ / A. n , note that the To obtain the analog of (1.4) for the measures µˆ n associated with F weight attached to Zi:n equals δ[j:n] n−1 δ[i:n] n−j Win = (1.5) . n−i +1 n−j +1 j =1
Hence we may write µˆ n =
n
Win δZi:n
i=1
and, as a consequence, n Win ϕ(Zi:n ). ϕ dFn =
(1.6)
i=1
Such integrals are henceforth called Kaplan–Meier (K–M) integrals. If we put ϕ(x) = n (x0 ) back so that all results obtained for (1.6) particularly apply 1{xx0 } , we obtain F n itself. To study F n as a process in x0 , we have to allow for a class of indicators. to F n constitutes the NPMLE of F under random censorThe Kaplan–Meier estimator F ship. We may expect that K–M integrals play the same fundamental role for right censored data as ordinary empirical integrals play for standard i.i.d. data. It is the purpose of the present article to review some of the most important properties of K–M integrals. Section 2 will discuss their strong consistency, and in Section 3 their distributional convergence will be addressed. Bias considerations are surprisingly difficult now and are dealt with in Section 4. Issues related to variance estimation are reviewed in Section 5 while Section 6 summarizes many extensions and applications to the regression setting.
Kaplan–Meier integrals
91
2. The SLLN Historically, consistency under random censorship was first established for indicators, n (x) itself. Since F n (x) depends on Λn through (1.2) and Λn has a simple ini.e., for F tegral form, a promising strategy is to first analyze Λn . Standard properties of empirical n1 immediately yield d.f.’s applied to Hn and H lim sup Λn (x) − ΛF (x) = 0 with probability 1 (2.1) n→∞ xT
as long as T satisfies T < τH , where τH = inf x: H (x) = 1 ∞ is the least upper bound for the support of H . If F is continuous, then ΛF is unbounded on the real line so that the restriction to compact subsets of [0, τH ) is indispensable for uniform consistency. As a consequence, the approach via (2.1) does not give a quick n on the whole support answer as to uniform (or Glivenko–Cantelli) convergence of F of Z. Worse than that, the question whether an extension of the SLLN holds under right censorship, i.e., lim (2.2) ϕ dFn = ϕ dF with probability 1 n→∞
for a larger class of ϕ’s, is far from being trivial. Another possibility would be to analyze (1.6) directly, for a general ϕ. At this stage of the discussion it is worthwhile recalling the different techniques available to prove the ordinary SLLN: • Kolmogorov’s original proof and Etemadi’s (1981) beautiful modification. Their proofs heavily depend on the fact that the summands of the sample mean are independent and the weights all equal n−1 and are therefore nonrandom. In contrast, Eq. (1.6) reveals that K–M integrals are sums of weighted functions of order statistics, where the weights Win are random and depend on the labels in a complicated way. Note that Win = 0 whenever δ[i:n] = 0, i.e., Zi:n is censored. • Recalling that an i.i.d. sequence is strictly stationary and ergodic, the classical SLLN is also implied by the ergodic theorem. There is, however, little hope that K–M integrals can be handled within this framework. • Sample means are also martingales in reverse time so that the SLLN becomes a special case of a martingale convergence theorem. That the limit is a constant follows from the zero–one law. Martingales in n provide a general framework to which empirical integrals mayfit also n under random censorship. It may be seen, however, that the expectation of ϕ dF usually differs from n to n so that there is no hope that K–M integrals form a reverse martingale in n. Fortunately, it turns out that the probabilistic structure of K–M integrals is rich enough to make this technique applicable. For this, put Fn = σ (Zi:n , δ[i:n] , 1 i n, Zn+1 , δn+1 , . . .).
92
W. Stute
n is adapted to Fn with Fn ↓ F∞ . Furthermore, F∞ is trivial by the Hewitt– Then ϕ dF Savage zero–one law. The following relation is taken from Lemma 2.2 in Stute and Wang (1993a): n+1 − Rn+1 , E ϕ dFn | Fn+1 = ϕ dF (2.3) where Rn+1 =
n−1 n − j δ[j:n+1] 1 ϕ(Zn+1:n+1 )δ[n+1:n+1] (1 − δ[n:n+1] ) . n+1 n−j +1 j =1
If all data are uncensored, then Rn+1 = 0 so that (2.3) recovers the aforementioned fact that sample means of i.i.d. data form a reverse martingale in n. Under censorship the martingale structure may be violated. For nonnegative ϕ, we have Rn+1 0 so that at least n+1 . n | Fn+1 ϕ dF E ϕ dF It follows that for ϕ 0, K–M integrals are supermartingales in reverse time. Existing convergence theorems together with the Hewitt–Savage zero–one law yield convergence to a constant with probability one and in the mean. Since any ϕ may be decomposed into its positive and negative parts, the consistency is obtained for a general F -integrable ϕ. Since the data (Zi , δi ), 1 i n, provide no information τHabout F beyond τH , it is likely that K–M integrals can only provide estimators for −∞ ϕ dF . It turns out that in the presence of discontinuities also τH needs special caution. Identification of the limit leads to the following Theorem due to Stute and Wang (1993a). T HEOREM 2.1. Under |ϕ| dF < ∞, we have with probability 1 and in the mean: ϕ dFn = ϕ(x)F (dx) + 1{τH ∈A} ϕ(τH )F {τH }, lim (2.4) n→∞
{x<τH }
where A is the set of H -atoms (possibly empty). If F and G have unbounded support, then τH = ∞ so that the right-hand side of (2.4) equals ϕ dF . Cases when τH < ∞ are discussed in Stute and Wang (1993a). Theorem 2.1 in particular may be applied to all indicators ϕ(x) = 1{xt } , t τH . Note that τH is included so that we have pointwise consistency over the closed interval (−∞, τH ]. This is important to apply a uniformity argument for Glivenko–Cantelli n . The following result is due to Stute and Wang (1993a). convergence for F C OROLLARY 2.2. Put F (t) F (t) = F (τH −) + 1{τH ∈A} F {τH } Then
if t < τH , if t τH .
(t) → 0 with probability 1. Fn (t) − F sup t ∈R
Kaplan–Meier integrals
93
Wang (1999) applied Theorem 2.1 to obtain strong consistency of M-estimators under random censorship. Stute and Wang (1993b) extended the SLLN for K–M integrals to multi-sample U -statistics, while Bose and Sen (1999) were able to extend the methodology to one-sample U -statistics. From time to time practitioners doubt the validity of independence between X and Y . n also holds in the general A closer look at our proofs reveals that convergence of ϕ dF case. The limit then naturally needs to be expressed through quantities which determine the distribution of (Z, δ), namely H and m(x) = P(δ = 1 | Z = x). Putting
x−
γ0 (x) = 0
1 − m(y) H (dy), 1 − H (y)
one gets with probability 1 and in the mean ϕ dFn = ϕmγ0 dH, lim n→∞
(2.5)
which collapses to the right-hand side of (2.4) if X and Y are independent. If this as sumption is slightly violated it is likely that the limit (2.5) is close to ϕ dF so that K–M integrals are reliable estimators also under weak dependence between X and Y .
3. The CLT Kaplan and Meier (1958), in their landmark paper, not only derived the formula for n (x) but also added – on heuristic grounds – some useful comments on the (limit) coF n (x) and F n (y). Among other things they pointed out that the variance has variance of F many similarities with Greenwood’s (1926) and Irwin’s (1949) formula in connection with actuarial estimates. They also mention (p. 476) that “in the derivation of approxin (x) may have is neglected”. For the limit covariance they mate formulas any bias that F come up with the expression (in our terms) x 1−G 1 − F (x) 1 − F (y) (3.1) dF. 2 0 (1 − H ) In the context of functional distributional convergence (3.1) was justified by Breslow and Crowley (1974). In their extension of Donsker’s invariance principle for the empirical process, they showed that the Kaplan–Meier process
n (x) − F (x) , 0 x T < τH αˆ n (x) = n1/2 F weakly converges in the Skorokhod space D[0, T ] to a centered Gaussian process with covariance (3.1). This process may be written as a scaled time transformed Brownian motion α∞ (x) = 1 − F (x) BΓ (x) ,
94
W. Stute
where
x
Γ (x) = 0
1−G dF. (1 − H )2
For a general ϕ, the literature was less elaborate for some time. Kaplan and Meier (1958) also considered the mean lifetime pertaining to ϕ(x) = x. Since X is assumed to be nonnegative, integration by parts yields ∞ ∞ 1 − F (x) dx, xF (dx) = 0
0
respectively, ∞
n (dx) = xF
0
∞
n (x) dx. 1−F
(3.2)
0
n is a bona fide distribution function. This Note that the last equation only holds when F is the case if and only if the largest observation is uncensored. For a rigorous treatment of (3.2) Sander (1975) came to the conclusion that “it is extremely difficult to obtain the T distribution theory for the estimators of 0 (1 − F (x)) dx whenever T = ∞”. Susarla and Van Ryzin (1980) apparently were the first to provide a rigorous treatment for the mean lifetime estimator truncated at Mn but such that Mn → ∞ at appropriate rates, as n → ∞. Gill (1983) applied convergence of the Kaplan–Meier process plus integration by parts to obtain, under some additional tail assumptions on the censoring mechanism, distributional convergence of the mean lifetime estimator and K–M integrals for ϕ’s which are nonnegative, continuous and nonincreasing. Schick et al. (1988) obtained, for n in terms of a sum of i.i.d. random this class of ϕ’s, a weak representation of ϕ dF variables plus a remainder. In all of these papers integration by parts was essential. n , under regularity Yang (1994) was able to extend distributional convergence of ϕ dF conditions on F , to those ϕ’s satisfying ϕ2 (3.3) dF < ∞. 1−G See also Akritas (2000). The integral (3.3) will become part of the limit variance so that n as a sum of (3.3) is indispensable. Stute (1995) obtained a representation of ϕ dF i.i.d. random variables plus a remainder which is valid under no regularity assumptions on F and G. Moreover, the paper was written up for the case Z = min(X, Y ) only as a historical tribute but may be readily extended to the case of general censorship, in which as in Section 2 the distributional characteristics of the observed (Z, δ) are H and m and not F and G. The key observation of this approach is the fact that a K–M integral n may be written as ϕ dF n
Win ϕ(Zi:n )
i=1
=
ϕ(w) exp n 0
w−
ln 1 +
1 n0 (dz) H n1 (dw) H n(1 − Hn (z))
Kaplan–Meier integrals
95
n1 is defined as in the first section and correspondingly in which H n0 (x) = 1 1{Zi x,δi =0} . H n n
i=1
Expansion of the logarithmic term and neglecting error terms leads one to a U -statistic of degree three. Its Hájek projection is the desired (simple) sum of i.i.d. random variables to which the ordinary CLT applies. For a large class of ϕ’s the error terms are o(1) with probability one so that an application of the LIL also establishes the law of the iterated logarithm for K–M integrals. n was discussed in Stute (1995). The general formula for the limit variance of ϕ dF For the purpose of the present paper it is enough to consider independent censorship with a continuous F . We then have τH n − n1/2 (3.4) ϕ dF ϕ dF → N 0, σ12 in distribution 0
with
τH 2 ϕ2 dF − ϕ dF 1−G 0 0 2 τH 1 − F (x) − ϕ dF G(dx). [1 − H (x)]2 x
σ12 =
τH
(3.5)
Note that in this three-terms formula for σ12 , the last term vanishes if there is no censorship (G ≡ 0) so that in this case σ12 reduces to σ 2 from Section 1:
2 σ12 = σ 2 = ϕ 2 dF − ϕ dF . Akritas (2000) used continuous time martingale theory to come up with a variance which may be expressed through the function τH 1 ϕ(x) ˜ = 1 − F (x−) ϕ(x) − ϕ dF . 1 − F (x) x We prefer, however, (3.5) since it not only reduces to σ 2 if G ≡ 0 but also since the automatic estimator of σ12 proposed and studied in Section 5 mimics (3.5). A useful extension of the above result is due to Bose and Sen (2002) who also obtained the CLT for K–M U -statistics.
4. Bias n (x) was already briefly discussed in Kaplan As we learned in Section 2, the bias of F and Meier (1958). A first rigorous treatment may be found in Gill’s (1980) thesis. In his formula (3.2.17) he showed that n (x) − F (x) ≡ Bias F n (x) 0, −F (x)H n (x) EF
96
W. Stute
n (x) is always biased downwards. It also reveals that Kaplan and Meier (1958) i.e., F were right in neglecting the bias when x with H (x) < 1 is fixed. What the left inequality also suggests is that the bias increases as x gets large and that it may then become a nonnegligible quantity. Mauro (1985) extended the right inequality to a general K–M integral:
n 0 for ϕ 0. Bias ϕ dF This is also immediate from integrating (2.3). Zhou (1988) was able to establish a lower bound whenever ϕ 0 is continuous and Riemann-integrable:
n n . − ϕ(t)H (t)F (dt) Bias ϕ dF Phadia and Shao (1999) and Zhang (1999) derived an exact expression for the kth mon (x). ment of 1 − F In Stute (1994b) we derived a formula and an expansion for the bias of a general K–M integral in such a way that (see p. 476): (1) (2) (3) (4) (5)
unbiasedness was readily recovered if there is no censorship; Bias → 0 as n → ∞ for a general ϕ; sharp lower and upper bounds are easily available; all expressions only depend on the joint distribution of the observed (Z, δ)’s; the effect of light, medium or heavy censoring on the bias may be easily discovered.
The solution to this program is surprisingly complicated. But what comes out is that informally speaking the bias of a Kaplan–Meier integral may – be zero, if there is no censoring; – decrease to zero exponentially fast, if, e.g., ϕ is bounded and vanishes right of some T < τH ; – decrease to zero at any polynomial rate, if, e.g., 0 ϕ(x) ↑ ∞ as x → ∞ and censoring is heavy. In particular, the bias may decrease to zero at a rate slower than n−1/2 and therefore becomes an important quantity in assessing the quality of the approximation of ϕ dF n . Gather and Pawlitschko (1998) showed that if ϕ 0 is nondecreasing, by ϕ dF the bias becomes smaller if we set artificially δ[n:n] = 1 even if the largest datum is censored. Denote this estimator with Fn∗ . See also Chen et al. (1982). Wellner (1985) n (x) with Fn∗ (x), i.e., for the nonincreasing ϕ = 1[0,x] , and was led to prefer compared F Fn , because in the cases investigated by him the upward bias of Fn∗ was worse than the n . This puzzling effect is caused by the fact that F n need not be a downward bias of F bona fide d.f. so that what is good for a nondecreasing ϕ does not automatically apply to a nonincreasing one. n which is based on the followStute (1994a) proposed another modification of F ing observation. Suppose that all X’s were observable. Then the empirical d.f. based estimate of ΛF would be Λn (x) =
n 1{Xi:n x} . n−i +1 i=1
(4.1)
Kaplan–Meier integrals
97
In order to measure the impact of censoring we compute the conditional expectation of the Nelson–Aalen estimator with respect to the ordered X’s. It then turns out that n
1{Xi:n x} 1 − Gn−i+1 (Xi:n ) . E Λn (x) | Xi:n , 1 i n = n−i +1 i=1
Compared with (4.1) we see that censoring causes an additional bias term n 1{Xi:n x} n−i+1 G (Xi:n ). n−i+1
(4.2)
i=1
This is particularly large if G compared with F has short tails. Since (4.2) is unknown n and then to substione may be tempted to replace G by its Kaplan–Meier estimator G tute the (unknown) X-sample by some bootstrap replicates X1∗ , . . . , Xn∗ . Utilizing (1.1) n : again we finally come up with the following modified version of F n n ∗ ) n−i+1 ∗ x} G (Xi:n 1{Xi:n 1{Zi:n x,δ[i:n] =1} n n1 (x) = . 1− 1− 1−F n−i+1 n−i +1 i=1
i=1
n1 only jumps at the Z’s, we obtain for some weights W 1 : Since F in n1 (x) = F
n
1 Win 1{Zi:n x} .
i=1
Likewise,
n1 = ϕ dF
n
1 Win ϕ(Zi:n ).
i=1 1 W so that for ϕ 0 the modified procedure reduces the It is easily seen that Win in 1 and W becomes downward bias of the K–M integral. The difference between Win in negligible for small to moderate i while for i = n, n − 1, . . . there may be a difference resulting in an upweighing of the extreme order statistics. The asymptotic theories for both estimators are the same. For finite size, Stute (1994a) pointed out through sample n1 may have a significantly smaller bias and, an extensive simulation study that ϕ dF somewhat unexpectedly, also a smaller variance.
5. The jackknife The jackknife has been proposed to serve two purposes, see Quenouille (1956) and Tukey (1958): (i) If Tn happens to be a biased statistic, the jackknife is expected to provide a modification of Tn with a smaller bias. For later reference, let Tn = S(Fn ) be a statistical functional evaluated at the (ordinary) empirical distribution function Fn . Denote
98
W. Stute
with Fn(k) the empirical d.f. of the sample X1 , . . . , Xk−1 , Xk+1 , . . . , Xn and put Tn := n−1
n S Fn(k) . k=1
Then the bias-corrected substitute for Tn is defined as Tn = Tn − (n − 1) Tn − Tn . (ii) In the above notation, the jackknife estimate of variance of Tn is defined as Var(Jack) =
n − 1 (k) 2 Tn − Tn n n
k=1
with Tn(k) = S Fn(k) . A general account of the jackknife may be found in Gray and Schucany (1972) and Efron and Tibshirani (1993). For Tn = ϕ dFn , there is no need for a bias correction. This is also confirmed by the jackknife, in view of Tn = Tn . The jackknife estimate of variance equals n−1 times the sample variance, which is what one would expect. Note that the crucial thing about (i) and (ii) is that the statistic of interest is a function of Fn and therefore attaches mass 1/n to each of the data. As a consequence deletion of one point just results in a change of the mass 1/n to 1/(n − 1). For the K–M integral the situation is completely different since now the statistic is a sum of (functions n(k) the of) order statistics weighted by the complicated random Win ’s. Denoting with F n(k) ) not Kaplan–Meier estimator from the entire sample except (Zk:n , δ[k:n] ), then S(F only involves changes of the standard weights, but also incorporates replacement of the weights Win by new ones depending on the labels δ[i:n] , 1 i n. This may be one of the reasons why the jackknife under random censorship has been dealt with only in few papers. Gaver and Miller (1983) proved that the jackknife corrected Kaplan–Meier estin (x). Stute and Wang (1994) mator at a fixed x < τH has the same limit distribution as F derived, for an arbitrary ϕ, a finite sample formula for the jackknife modification of ϕ n : Sn ≡ ϕ dF n−2 n − 1 − j δ[j:n] n−1 Snϕ = ϕ(Zn:n )δ[n:n] (1 − δ[n−1:n] ) Snϕ + . n n−j
(5.1)
j =1
Hence the correction term depends on the largest Z-observation only but on all δconcomitants. Also the jackknife is much more cautious about attaching masses to the last observation when it is censored than what has been recommended in the ad-hoc proposal leading to Fn∗ in the previous section. It is also worthwhile to compare the correction term in (5.1) with Rn+1 in (2.3). First, both vanish unless the largest observation is uncensored and the second last is censored. Only in this case the extreme right data contain enough information on F to make a slight change of Wnn desirable. Shen (1998) extended (5.1) to the delete-2 jackknife but came to the conclusion that
Kaplan–Meier integrals
99
for nondecreasing ϕ’s both jackknife corrected estimators are outperformed by with Fn∗ defined in Section 4. As to the variance, (3.5) suggests that
ϕ dFn∗
ϕ σ12 Var Sn ∼ as n → ∞. n So what Var(Jack) is expected to do is nVar(Jack) → σ12
with probability 1.
(5.2)
Since (omitting ϕ) nVar(Jack) = (n − 1)
n
Sn(k)
2
− n(n − 1) S 2n
(5.3)
k=1
and Sn(k)
k−1 i−1 ϕ(Zi:n )δ[i:n] n − j − 1 δ[j:n] = n−i n−j j =1
i=1
+
δ[j:n] n k−1 i−1 ϕ(Zi:n )δ[i:n] n − j − 1 δ[j:n] n−j n−i +1 n−j n−j +1 j =1
i=k+1
j =k+1
already is a complicated expression to be squared in (5.3), it is a priori not obvious at all if Var(Jack) is able to work out the three terms in the expression (3.5) for σ12 . In Stute (1996a) a finite sample formula for nk=1 [Sn(k) ]2 was derived from which one obtains that up to an error term nVar(Jack) = (n − 1)
n−1
ϕ 2 (Zi:n )δ[i:n]
i−1 n 2 1 n − j − 1 2δ[j:n] − S (n − i)2 n−j n−1 n j =1
i=1
j −1 n−2 (n − k − 1)(n − k + 1) 2δ[k:n] − (n − 1) (1 − δ[j :n] )bj (n − k)(n − k) j =1
×
n
k=1
2 ϕ(Zi:n )Win
,
i=j +1
where bj = bj n =
1 1 1 1 − . − + (n − j − 1)2 (n − j )2 (n − j − 1) n − j
This somewhat mysterious representation of nVar(Jack) in fact constitutes the empirical analog of the three-terms expression (3.5) for σ12 and collapses to the classical estimator
100
W. Stute
of σ 2 if there is no censorship. By the SLLN for K–M integrals the second term converges to −( ϕ dF )2 . The first and third expressions need some special care. But after all it can in fact be shown that they converge to the desired limits. Provided the error term is negligible these pieces altogether would imply (5.2). Unfortunately, and somewhat unexpectedly, this holds true only when ϕ(x) → 0 as x → τH . So, in particular, the jackknife yields a consistent estimate of σ12 for ϕ = 1[0,t ] with t < τH . For a general ϕ, it may be inconsistent due to the observation that the remainder term becomes nonnegligible iff |ϕ(Zn:n )| is moderately large and δ[n−1:n] = 0
and δ[n:n] = 1.
(5.4)
If, under (5.4), we redefine δ[n:n] by artificially putting δ[n:n] = 0, we obtain a slightly changed version of Var(Jack), for which the remainder term vanishes and the three leading terms still converge. In other words, this modification satisfies (5.2) under the only assumption that σ12 is finite. T HEOREM 5.1 (Stute (1996a)). Under ity 1:
< ∞, we have with probabilϕ 2 /(1 − G) dF
lim nVar(Jack) = σ12 .
n→∞
6. Censored correlation and regression In many situations we not only observe (Z, δ) but also a covariable vector U associated with the true but possibly censored lifetime X. In such a situation one may be interested in the joint distribution F 0 (u, x) = P(U u, X x),
u ∈ Rp , x ∈ R.
Here p is the dimension of U and U u is to be understood componentwise. In this setup the vector U is always observable but may be correlated with the censoring variable. Sometimes the dependence between U and X is assumed to be of a particular type. As to regression we only mention the Cox-proportional hazards model and the accelerated failure time regression model. See, e.g., Kalbfleisch and Prentice (1981), Lawless (1982), Ritov (1990) and Tsiatis (1981). We first discuss some estimation issues where no such model assumptions are required and the approach is completely nonparametric. Also no conditions as to continuity etc. are needed. For identifiability reasons we just have to require that P(δ = 1 | U, X) = P(δ = 1 | X).
(6.1)
Condition (6.1) may be viewed as a Markov property of the “sequence” U, X, δ. It was introduced in Stute (1993) and further discussed in Stute (1996b, 1999). We observe independent data (Ui , Zi , δi ), 1 i n. A function ϕ may now also depend on Ui so that the K–M integral becomes Snϕ ≡ Sn =
n i=1
Win ϕ(U[i:n] , Zi:n ).
(6.2)
Kaplan–Meier integrals
101
The random vector U[i:n] denotes the concomitant vector associated with Zi:n . To demonstrate the richness of the above class of statistics we present some examples of ϕ together with the targets: • ϕ(u, x) = 1{uu0 ,xx0 } leads to the K–M estimator of F 0 (u0 , x0 ): n
0n (u0 , x0 ) = F
Win 1{U[i:n] u0 ,Zi:n x0 } .
i=1
• For univariate U ’s, consider ϕ1 (u, z) = uz, ϕ4 (u, z) = z2 ,
ϕ2 (u, z) = z, ϕ5 (u, z) = u2 .
ϕ3 (u, z) = u,
ϕ
Denote with Sn i the corresponding K–M integrals. Combination of these yields estimates of the covariance and correlation of (U, X). Rank correlation is dealt with in Stute (1993). • Set, for B ⊂ Rp (measurable) and y ∈ R: ϕ1 (u, z) = (z − y)1{z>y,u∈B} , ϕ Sn 1
ϕ2 (u, z) = 1{z>y,u∈B} .
ϕ Sn 2
and estimates the mean residual lifetime at y, subject to U ∈ B. The ratio of • Here we assume that the true (Ui , Xi ) satisfy a linear regression model Xi = Uit β0 + εi ,
1i n
(possibly after an appropriate transformation of Xi ). The nonlinear case was dealt with in Stute (1999). The errors satisfy E[εi | Xi ] = 0, but dependence between εi and Xi as in many heteroscedastic models is allowed. The weighted least squares estimator in the present framework is defined as the minimizer βn of β→
n
2 t Win Zi:n − U[i:n] β ≡ Sn (β).
i=1
It has the explicit representation −1 n , βn = M1n M2n Z
where n = (Z1:n , . . . , Zn:n )t Z and M1n (i, j ) =
n
j
i Wkn U[k:n] U[k:n] ,
1 i, j p,
k=1 i , M2n (i, s) = Wsn U[s:n]
1 i p, 1 s n.
i denotes the ith coordinate of U[s:n] . Here U[s:n]
102
W. Stute
Now, it may be seen that in all examples the consistency of the relevant estimators may be traced back to the consistency of proper K–M integrals (6.2). In the present situation the analog of Theorem 2.1 becomes (see Stute (1993)) lim Snϕ = ϕ(U, X) dP + 1{τH ∈A} ϕ(U, τH ) dP = S ϕ . n→∞
{X<τH }
{X=τH }
ϕ Sn
applied to proper ϕ’s yields the As a possible application the strong consistency of strong consistency of βn . The statistic (6.2) also allows for an i.i.d. representation which yields the asymptotic ϕ normality of n1/2 (Sn − S ϕ ). See Stute (1996b) for details. As one of several applications we obtained the asymptotic normality of n1/2 (βn − β0 ). The limit covariance matrix can be consistently estimated by utilizing the aforementioned jackknife applied to proper ϕ’s. As another application we briefly mention an extension of the F -test when the dependent variable is subject to censoring. Assume that the (linear) hypothesis of interest is H : Aβ0 = c, where A and c are a given q × p matrix and a q-vector, respectively. The weighted least squares estimator of β0 with K–M weights satisfying H then equals −1 t −1 t −1 A AM1n A (Aβn − c). βˆH = βn − M1n The corresponding F -type test statistic becomes t −1 −1 t −1 Ln = n βn − βˆH At AM1n ΣM1n A A βn − βˆH . Here Σ is an estimable covariance matrix obtained from the linear expansion of several K–M integrals. It may then be shown that Ln → χq2
in distribution,
(6.3)
where χq2 is the chi-square distribution with q degrees of freedom. Assertion (6.3) constitutes the asymptotic extension of the famous F -test to censored regression. As another application in censored regression, we only mention that results on K–M integrals may be applied to get prediction intervals for a future lifetime given the covariable. The F -test is an example of a test, where a certain hypothesis about a parameter is tested but there is no doubt about the overall validity of the model. Tests assessing the goodness-of-fit of the model are different and have found a lot of interest in the last ten years. As to censored regression, K–M methodology was applied in this context in Stute et al. (2000).
7. Conclusions K–M integrals are analogs of empirical integrals when the data are subject to right censoring. They may be viewed as basic statistics to which more complicated quantities
Kaplan–Meier integrals
103
may be traced back through linearization. In this article we reviewed the most important properties of K–M integrals: Strong and distributional convergence as well as bias and variance issues. Also K–M integrals (or sums) have interesting applications in multivariate statistics.
References Akritas, M.G. (2000). The central limit theorem under censoring. Bernoulli 6, 1109–1120. Bose, A., Sen, A. (1999). The SLLN for Kaplan–Meier U -statistics. J. Theoret. Probab. 12, 181–200. Bose, A., Sen, A. (2002). Asymptotic distribution of the Kaplan–Meier U -statistics. J. Multivariate Anal. Submitted for publication. Breslow, N.E., Crowley, J.J. (1974). A large sample study of the life table and product-limit estimates under random censorship. Ann. Statist. 2, 437–453. Chen, Y.Y., Hollander, M., Langberg, N.A. (1982). Small sample results for the Kaplan–Meier estimator. JASA 77, 141–144. Efron, B., Tibshirani, R.J. (1993). An Introduction to the Bootstrap. Chapman and Hall, New York. Etemadi, N. (1981). An elementary proof of the strong law of large numbers. Z. Wahrscheinlichkeitstheorie Verw. Gebiete 55, 119–122. Gather, U., Pawlitschko, J. (1998). On Efron’s and Gill’s version of the Kaplan–Meier integral. Comm. Statist. Theory Meth. 27, 181–192. Gaver, N.P., Miller, R.G. (1983). Jackknifing the Kaplan–Meier survival estimator for censored data: Simulation results and asymptotic analysis. Comm. Statist. Theory Meth. 12, 1701–1718. Gill, R.D. (1980). Censoring and Stochastic Integrals. In: Math. Centre Tracts, Vol. 124. Math. Centrum, Amsterdam. Gill, R.D. (1983). Large sample behavior of the product-limit estimator on the whole line. Ann. Statist. 11, 49–58. Gray, H.L., Schucany, W.R. (1972). The Generalized Jackknife Statistics. Dekker, New York. Greenwood, M. (1926). The natural duration of cancer. Reports on Public Health and Medical Subjects, No. 33. His Majesty’s Stationary Office. Irwin, J.O. (1949). The standard error of an estimate of expectational life. J. Hygiene 47, 188–189. Kalbfleisch, J.D., Prentice, R.L. (1981). The Statistical Analysis of Failure Time Data. Wiley, New York. Kaplan, E.L., Meier, P. (1958). Nonparametric estimation from incomplete observations. JASA 53, 457–481. Lawless, J.F. (1982). Statistical Models and Methods for Lifetime Data. Wiley, New York. Mauro, D. (1985). A combinatoric approach to the Kaplan–Meier estimator. Ann. Statist. 13, 142–149. Phadia, E.G., Shao, P.Y. (1999). Exact moments of the product limit estimator. Statist. Probab. Lett. 41, 277– 286. Quenouille, M.H. (1956). Notes on bias in estimation. Biometrika 43, 353–360. Ritov, Y. (1990). Estimation in a linear regression model with censored data. Ann. Statist. 18, 303–328. Sander, J.M. (1975). Asymptotic normality of linear combinations of functions of order statistics with censored data. Tech. Report, No. 8. Div. Biostatistics Stanford University. Schick, A., Susarla, V., Koul, H. (1988). Efficient estimation of functionals with censored data. Statist. Decisions 6, 349–360. Shen, P.-S. (1998). Problems arising from jackknifing the estimate of a Kaplan–Meier integral. Statist. Probab. Lett. 40, 353–361. Shorack, G.R., Wellner, J.A. (1986). Empirical Processes with Applications to Statistics. Wiley, New York. Stute, W. (1993). Consistent estimation under random censorship when covariables are present. J. Multivariate Anal. 45, 89–103. Stute, W. (1994a). Improved estimation under random censorship. Comm. Statist. Theory Meth. 23, 2671– 2682. Stute, W. (1994b). The bias of Kaplan–Meier integrals. Scand. J. Statist. 21, 475–484. Stute, W. (1995). The central limit theorem under random censorship. Ann. Statist. 23, 422–439.
104
W. Stute
Stute, W. (1996a). The jackknife estimate of variance of a Kaplan–Meier integral. Ann. Statist. 24, 2679–2704. Stute, W. (1996b). Distributional convergence under random censorship when covariables are present. Scand. J. Statist. 23, 461–471. Stute, W. (1999). Nonlinear censored regression. Statist. Sinica 9, 1089–1102. Stute, W., González Manteiga, W., Sánchez Sellero, C. (2000). Nonparametric model checks in censored regression. Comm. Statist. Theory Meth. 29, 1611–1629. Stute, W., Wang, J.L. (1993a). The strong law under random censorship. Ann. Statist. 21, 1591–1607. Stute, W., Wang, J.L. (1993b). Multi-sample U -statistics for censored data. Scand. J. Statist. 20, 369–374. Stute, W., Wang, J.L. (1994). The jackknife estimate of a Kaplan–Meier integral. Biometrika 81, 602–606. Susarla, V., Van Ryzin, J. (1980). Large sample theory for an estimator of the mean survival time from censored samples. Ann. Statist. 8, 1002–1016. Tsiatis, A.A. (1981). A large sample study of Cox’s regression model. Ann. Statist. 8, 93–108. Tukey, J.W. (1958). Bias and confidence in not quite large samples. Ann. Math. Statist. 29, 614 (Abstract). Wang, J.L. (1999). Asymptotic properties of M-estimators based on estimating equations and censored data. Scand. J. Statist. 26, 297–318. Wellner, J.A. (1985). A heavy censoring limit theorem for the product limit estimator. Ann. Statist. 13, 150– 162. Yang, S. (1994). A central limit theorem for functionals of the Kaplan–Meier estimator. Statist. Probab. Lett. 21, 337–345. Zhang, P.H. (1999). Exact bias and variance of the product limit estimator. Sankhy¯a (Ser. B) 61, 413–421. Zhou, M. (1988). Two-sided bias bound of the Kaplan–Meier estimator. Probab. Theory Rel. Fields 79, 165– 173.
6
Handbook of Statistics, Vol. 23 ISSN: 0169-7161 © 2004 Elsevier B.V. All rights reserved. DOI 10.1016/S0169-7161(03)23006-6
Statistical Analysis of Doubly Interval-Censored Failure Time Data
Jianguo Sun
1. Introduction This article discusses statistical analysis of doubly interval-censored failure time data (De Gruttola and Lagakos, 1989; Fang and Sun, 2001). By doubly interval-censored data we mean that the survival time of interest is defined as the elapsed time between two related events, called initial and subsequent events. Furthermore, observations on the occurrences of both events could be right- or interval-censored. A right-censored observation means that the occurrence time of an event is known either exactly or to be greater than a censoring time (Kalbfleisch and Prentice, 1980), while an intervalcensored observation means that the occurrence of an event is observed only to belong to an interval (Sun, 1998). One field that often sees doubly interval-censored failure time data is disease progression or epidemiological studies where initial and subsequent events may represent infection and subsequent onset of certain disease, respectively. In these situations, doubly interval-censored observations occur mainly due to the nature of the disease and/or the structure of the study design. A well-known example of doubly interval-censored data arises from follow-up studies of patients who have been or are at risk of being infected by the human immunodeficiency virus (HIV) and thus are also at risk of developing the acquired immune deficiency syndrome (AIDS). In this case, one variable, which is of great interest to clinicians and physicians and which plays an important role in the study of natural history of the disease and in projections of the course of the epidemic, is AIDS incubation time (survival time of interest), the time between the HIV infection (initial event) and the diagnosis of AIDS (subsequent event). Since the HIV infection can usually only be determined through periodic blood tests, observations on it are thus often interval-censored and given in the form of the dates of the last negative test and the first positive test. In the meantime, observations on the diagnosis of AIDS could be either right- or interval-censored due to, for example, the end of the study and the periodic follow-up nature of the study design (De Gruttola and Lagakos, 1989; Kim et al., 1993). Doubly interval-censored failure time data include usual right- or interval-censored failure time data as special cases (Kalbfleisch and Prentice, 1980; Sun, 1998). For exam105
106
J. Sun
ple, they reduce to interval-censored data if the occurrence of initial event can be exactly observed and observations on the occurrence of subsequent event are interval-censored. Furthermore, if observations on the occurrence of subsequent event are right-censored, we then have usual right-censored failure time data for the survival time of interest. Note that in the above special cases, for the analysis of the survival time, it is equivalent to treat the occurrence time of the initial event being time zero, which is what is usually done in this case. The analysis of doubly interval-censored data has recently attracted much attention, especially in the context of the analysis of AIDS incubation time (Bacchetti, 1990; Jewell, 1994; Jewell et al., 1994; Gómez and Lagakos, 1994; Sun, 1995, 1997; Tu, 1995). In this case, most of researches focus on two basic problems: nonparametric estimation of the distribution function of the AIDS incubation time and its regression analysis. In addition to interval-censoring, another factor that complicates the analysis is left-truncation. The truncation occurs since an AIDS cohort follow-up study usually includes only subjects who have been infected by HIV, but not developed AIDS at the time of recruitment. This means that the subject’s AIDS incubation time is left-truncated by the time from HIV infection to the recruitment to the study. The first paper that considered nonparametric estimation of the distribution function of the AIDS incubation time is given by De Gruttola and Lagakos (1989). They discussed situations without truncation and proposed to estimate jointly the distribution of HIV infection time and the AIDS incubation distribution using the full likelihood approach. In particular, they proposed a self-consistency algorithm for the joint estimation. Following De Gruttola and Lagakos (1989), a few other authors considered the same problem. For example, Gómez and Lagakos (1994) and Gómez and Calle (1999) presented some marginal estimation methods also for no truncation situation. Lim et al. (2002), Sun (1995, 1997) and Tu (1995) developed some algorithms for the situation where there exist both doubly interval-censoring and left-truncation. Frydman (1992, 1995) discussed the problem under the framework of three-state Markov model. For regression analysis of doubly interval-censored failure time data, several authors considered the problem under the proportional hazards model for the survival time of interest (Cox, 1972; Kalbfleisch and Prentice, 1980). Among these, Kim et al. (1993) studied the full likelihood approach for analyzing AIDS incubation time and derived the maximum likelihood estimates of the distribution of HIV infection time, the baseline survival function of the AIDS incubation time and regression parameters. Goggins et al. (1999) used the same approach and presented a Monte Carlo EM algorithm. In contrast, to investigate the effect of covariates on AIDS incubation time, Sun et al. (1999) developed an estimating equation-based approach that only involves estimation of regression parameters of interest. Following Sun et al. (1999), Pan (2001) presented a multiple imputation method for the problem. Other references that discuss the analysis of doubly interval-censored data include Fang and Sun (2001) and Sun (2001). The former studied the asymptotic consistency of the nonparametric maximum likelihood estimate of the distribution of survival time of interest, while the latter considered the treatment comparison problem and proposed a nonparametric test procedure. Many authors have considered the analysis of right- and interval-censored failure time data, the special cases of doubly interval-censored data and this is especially
Statistical analysis of doubly interval-censored failure time data
107
the case for right-censored data. Among others, an excellent book about the analysis of right-censored data is given by Kalbfleisch and Prentice (1980). For intervalcensored data, several review papers have been given (Huang and Wellner, 1997; Lindsey and Ryan, 1998; Sun, 1998), where one can find most references about the methods proposed for their analyses. The remainder of the article is organized as follows. Section 2 discusses nonparametric estimation of the distribution of the survival time of interest. In particular, we will first consider the situation where no truncation exists and then talk about the situation with truncation on observations on the occurrence of subsequent event. Several algorithms will be described and compared for the nonparametric estimation. Section 3 deals with regression analysis of doubly interval-censored failure time data. For this, we will discuss two methods for fitting the proportional hazards model to the data. Section 4 considers the problem of nonparametric comparison of survival functions based on doubly interval-censored failure time data and Section 5 concludes the article with some discussion and directions for future researches. Throughout the article, we will assume that the survival time of interest is independent of the time of the occurrence of initial event (De Gruttola and Lagakos, 1989; Kim et al., 1993; Sun, 2001). Some comments on this will be given below. Also we will assume that the mechanisms yielding right- or interval-censoring and truncation are independent of the underlying survival time of interest.
2. Nonparametric estimation of a distribution function Consider a survival study that involves n independent subjects. For subject i, let Xi and Si denote the times of occurrences of initial and subsequent events, respectively, i = 1, . . . , n. Define Ti = Si − Xi , the survival time of interest. In an AIDS cohort study, Xi and Si represent HIV infection time and AIDS diagnosis time, respectively, and Ti corresponds to AIDS incubation time. In this section, we will assume that all random variables Xi , Si and Ti take discrete values (De Gruttola and Lagakos, 1989; Sun, 1997). Suppose that for each subject, two intervals [Li , Ri ] and [Ui , Vi ] are observed to which Xi and Si belong. That is, Xi ∈ [Li , Ri ] and Si ∈ [Ui , Vi ], i = 1, . . . , n. Furthermore, suppose that there exists a truncation interval [Bi1 , Bi2 ] for Si . That is, information about Si is available only if the subsequent event occurs inside the interval [Bi1 , Bi2 ]. Consequently, [Ui , Vi ] ⊂ [Bi1 , Bi2 ] and we have truncated and doubly interval-censored data on the Ti ’s. If Bi1 = 0 and Bi2 = ∞, we then have usual doubly interval-censored failure time data. Furthermore, if Li = Ri , the data on the Ti ’s become interval-censored failure time data (Sun, 1998), while they reduce to right-censored data if we also have Ui = Vi or Vi = ∞ (Kalbfleisch and Prentice, 1980), i = 1, . . . , n. 2.1. Estimation without truncation In this subsection, we consider the situation where Bi1 = 0 and Bi2 = ∞, i = 1, . . . , n. That is, there does not exist truncation. Let u1 < · · · < ur denote the possible mass
108
J. Sun
points for the Xi ’s and v1 < · · · < vs the possible mass points for the Ti ’s. Define αji k = I (Li uj Ri , Ui uj + vk Vi ), wj = Pr(Xi = uj ) and fk = Pr(Ti = vk ), j = 1, . . . , r, k = 1, . . . , s. Then the full likelihood function L1 can be written as r s n L1 {wj }j =1,...,r , {fk }k=1,...,s = αji k wj fk
(1)
i=1 j =1 k=1
(De Gruttola and Lagakos, 1989). To estimate the distribution function F = {fk }k=1,...,s of the Ti ’s, we will describe two algorithms. The first one (De Gruttola and Lagakos, 1989) is a generalization of the self-consistency algorithm given in Turnbull (1976) for interval-censored and truncated failure time data. It is based on the above full likelihood function and jointly estimates F and H = {wj }j =1,...,r , the distribution function of the Xi ’s, together. The second algorithm (Gómez and Lagakos, 1994) uses the same self-consistency idea, but involves two steps and estimates F and H separately. Define αji k wj fk , s i l=1 m=1 αlm wl fm
Iji k = r wj∗
1 i = Ij k , n n
s
i=1 k=1
fk∗
(2)
1 i = Ij k , n n
r
(3)
i=1 j =1
j = 1, . . . , r, k = 1, . . . , s. Note that the quantity Iji k is the conditional expectation of the event Xi = uj and Ti = vk , and ni=1 Iji k can be regarded as an estimate of the number of subjects with Xi = uj and Ti = vk given the observed data. Of course, they are functions of F and H except when exact data on both Xi and Si are obtained. Using the Iji k ’s, for estimation of F , De Gruttola and Lagakos (1989) proposed the following self-consistency algorithm, which will be referred to as method I. Step 1. Choose starting values for F and H . Step 2. Compute the Iji k ’s, wj∗ ’s and fk∗ ’s (updated estimates of the wj ’s and fk ’s) from Eqs. (2) and (3), respectively. Step 3. Repeat step 2 until convergence. It can be shown that the estimate resulting from this algorithm is either a saddle point or local maximum of the likelihood function L1 . One way to distinguish the maximum likelihood estimate from a saddle point is to examine the matrix evaluated at the final estimate of negative second derivatives of the logarithm of L1 with respect to F and H . If the eigenvalues of the matrix are positive, then the estimate is a local or globe maximum and if they are both positive and negative, the estimate is a saddle point. Given the wˆ j ’s and fˆk ’s, the cumulative distribution functions of the Xi ’s and Ti ’s can be estimated, respectively, by (x) = (t) = H wˆ j , F fˆk . uj x
vk t
Statistical analysis of doubly interval-censored failure time data
109
Note that if Li = Ri , the above algorithm reduces to the self-consistency algorithm given in Turnbull (1976) for interval-censored data. In this case, we have i α.k fk i m=1 α.m fm
I.ki = s and fk∗ =
(4)
1 i I.k , n n
(5)
i=1
i i where α.k = I (Ui xi + vk Vi ), or α.k = I (Ui vk Vi ) if treating all xi = 0. The same algorithm can also be found in Sun (1998). To describe the second algorithm, which will be referred to as method II, suppose that Ui = Vi or Vi = ∞, i = 1, . . . , n. That is, right-censored data on the Si ’s are obtained. Note that in this case, if one is only interested in estimating H , it is natural to employ the marginal likelihood n Lm {wj }j =1,...,r = wj i=1 j :Li uj Ri
based on only interval-censored data on the Xi ’s. For estimation of F , one can then assume that H is known and use the full likelihood function L1 given in (1), which reduces to L1 {fk }k=1,...,s | {wj }j =1,...,r δ 1 δ 2 n i i = wj fk i=1
j,k:Li uj Ri ,uj +νk =Ui
×
δ 1 (1−δ 2 ) i
wj fk
i
,
j :Li uj Ri k:uj +vk Ui
where δi1 = I (Ri < ∞) and δi2 = I (Ui = Vi ), i = 1, . . . , n. This is the idea behind method II given in Gómez and Lagakos (1994), which can be described as follows. Step 1. Estimate H = ({wj }j =1,...,r ) using the nonparametric marginal maximum likelihood estimate from Lm , which can be obtained by method I with replacing (2) and (3) by (4) and (5). Step 2. Given wj = wˆ j from step 1 and the current estimate of the fk , say fˆk(l) , calculate the updated estimate of fk by fˆk(l+1) = {N1(l) (k) + N2(l) (k)}/(n − m), where φij wˆ j (uj =Ui −vk ) fˆk δi1 δi2 r , (l) ˆ j fˆj (vj =Ui −uj ) j =1 φij w i=1 r (l) n j =1 φij I (Ui < uj + vk )wˆ j fˆk (l) 1 2 N2 (k) = δi 1 − δi r ˆ j h:vh >Ui −uj fˆh(l) j =1 φij w i=1 and m = ni=1 (1 − δi1 ), where φij = I (uj ∈ [Li , Ri ]). N1(l) (k) =
n
(l)
110
J. Sun
Step 3. Repeat step 2 until convergence. It is easy to see that method I can be more efficient than method II since it is based on the full likelihood function. However, the former could be slower than the latter. Also, method I may have an identifiability problem with smaller data sets and some starting values will converge to saddle points of the likelihood instead of the maximum. In contrast, method II does not have these problems and is more stable and converges more rapidly. These two methods both concern estimation of the distribution or probability function of the survival time of interest. It is worth noting that the resulting estimates may be rough especially if the probability function is the main target. An alternative for the estimation is to use a penalized likelihood method, which adds an additional term to the likelihood function to control the smoothness of the estimates. For example, Bacchetti (1990) and Joly and Commenges (1999) developed some penalized likelihood methods for estimation of the hazard function of AIDS incubation time. 2.2. Estimation with truncation Now we consider doubly interval-censored and truncated failure time data. As in the previous subsection, we will describe two estimation procedures, one for general situations and one for the case where right-censored observations are available for the Si instead of interval-censored observations. For general situations, the method (method III) described is a two-step self-consistency estimation procedure involving joint estimation of H and F and the idea behind the method is similar to, but not the same as, that used in method II. It includes as a special case the method developed by Sun (1995), which discussed nonparametric estimation of F when H is known or can be estimated separately. So far we have assumed that the distribution function of the Ti ’s is completely unknown. Sometimes this may not be the case or other considerations have to be taken into account for the estimation. For example, AIDS incubation time has been found to be quite long. On the other hand, most AIDS studies have relatively shorter durations and do not provide enough information about the tail of the distribution of the AIDS incubation time. To obtain efficient estimate of the distribution, it is thus natural to estimate the tail parametrically, while still estimating the first part of the distribution nonparametrically (Lim et al., 2002). We will describe an EM estimation procedure below for such situations when right-censored observations are obtained for the Si . In the following, we will assume that [B 1 −Ri ,B 2 −Li ] dF (t) = 1. i
i
i
Let the uj ’s, vk ’s, φij ’s, αji k ’s, F = ({fk }k=1,...,s ) and H = ({wj }j =1,...,r ) be defined as before. Also let γjik , . . . , s = I (Li uj Ri , Bi1 uj + vk Bi2 ), j = 1, . . . , r, k = 1, . . . , s. Then the conditional likelihood function of the data given Xi ∈ [Li , Ri ] has the form s n r i j =1 k=1 αj k wj fk Lc {wj }j =1,...,r , {fk }k=1,...,s = r s i j =1 k=1 γj k wj fk i=1
Statistical analysis of doubly interval-censored failure time data
111
(Sun, 1997). To estimate F based on the above likelihood, define αik = I (vk ∈ [Ui − Ri , Vi − Li ]), βik = I (vk ∈ [Bi1 − Ri , Bi2 − Li ]), ∗ k:Ui −ul vk Vi −ul fk , if ul ∈ [Li , Ri ], φil = 1, otherwise and ηil∗ =
1,
k:Bi1 −ul vk Bi2 −ul
fk ,
if ul ∈ [Li , Ri ], otherwise,
l = 1, . . . , r, i = 1, . . . , n. Also define ∗ Ui ul +vk Vi wl , if vk ∈ [Ui − Ri , Vi − Li ], = αik 1, otherwise and ∗ βik =
1,
Bi1 ul +vk Bi2
wl , if vk ∈ [Bi1 − Ri , Bi2 − Li ], otherwise,
k = 1, . . . , s, i = 1, . . . , n. For joint estimation of H and F , Sun (1997) proposed the following two-step self-consistency estimation procedure. (0)
Step 0. Estimate H as in step 1 of method II and denote the estimate by H (0) = (wj ). (l)
Step 1. At the lth iteration, define the improved estimate, F (l) = ({fk }k=1,...,s ), of F as the maximum likelihood estimate of F from the conditional likelihood function (l−1) Lc with assuming that H is known and wj = wj from the previous iteration. This can be obtained by iterating the following self-consistency equation (b)
fk
=
1 M(F (b−1), H (l−1))
n
µik F (b−1) , H (l−1)
i=1
+ νik F (b−1) , H (l−1) ,
k = 1, . . . , s
with respect to b until convergence, with the fk(l−1) ’s as the initial estimates, where ∗f ∗ )f αik αik (1 − βik βik k k µik (F, H ) = s , νik (F, H ) = s ∗ ∗f α α f β β ij j ij j =1 j =1 ij ij j n s and M(F, H ) = i=1 k=1 [µik (F, H ) + νik (F, H )]. (l) Step 2. Define the improved estimate, H (l) = ({wj }j =1,...,r ), of H as the maximum (l)
likelihood estimate of H from Lc with assuming that F is known and fk = fk from step 1. This can be obtained by iterating the following self-consistency equation, (b)
wj =
1 M ∗ (H (b−1), F (l) )
n ∗ (b−1) (l) µij H ,F i=1
+ νij∗ H (b−1), F (l) ,
j = 1, . . . , r
112
J. Sun
with respect to b until convergence, with H (l−1) as the initial estimates, where φij φij∗ wj (1 − φij ηij∗ )wj ∗ µ∗ij (H, F ) = r , ν (H, F ) = r ij ∗ ∗ k=1 φik φik wk k=1 φik ηik wk and M ∗ (H, F ) = ni=1 rj =1 [µ∗ij (H, F ) + νij∗ (H, F )]. Step 3. Go back to step 1 and repeat this cycle until convergence. It can be shown that the conditional likelihood function Lc increases at each iteration between steps 1 and 2 and the algorithm converges to a local or maximum of Lc . Also, it can be shown that if there is no truncation, method III yields equivalent estimates to those given by method I. In other words, the estimates given by the above method are not equivalent to those given by method II, but the method has the same advantages over method I as method II in this case. This is because in method II, estimation of H is done in just one step and has nothing to do with estimation of F , while method III uses the similar idea, but estimates H and F together. For the special case where the Xi ’s are observed exactly, method III reduces to the Turnbull’s (1976) self-consistency algorithm. In the remainder of this subsection, we will consider a little simplified, but still realistic situation. Suppose that Bi2 = ∞ and Ui = Vi or Vi = ∞, i = 1, . . . , n. That is, right-censored and left-truncated data are observed for the Si ’s. Let S(t) and f (t) denote the survival and probability functions of the Ti ’s, respectively. We assume that f (t) has the form if vk τ, fk , f (vk ) = S(τ )/Gθ (τ ) gθ (vk ), otherwise, where τ denotes the change point of S(t) and gθ (t) = g(t, θ ) is a probability function characterized by the parameter vector θ with the cumulative distribution function Gθ (t) and Gθ (t) = 1 − Gθ (t). The above model assumes that the distribution of the Ti ’s is completely unknown up to τ and then follows a parametric model. Suppose that v1 < · · · < vk0 τ < vk0 +1 < · · · < vs . Then the goal becomes to estimate φ = {f1 , . . . , fk0 , θ }. Here we assume that τ is known, and for this case Lim et al. (2002) proposed an EM algorithm for the estimation. To describe the EM algorithm, assume that each subject i with Li < Ri or Ri < Bi1 corresponds to the outcome of a random sample drawn from a population with the conditional distribution H given X ∈ [Li , Ri ]; subject i was the first unit drawn whose initial event occurs within [Li , Ri ] and whose subsequent event occurs at or after Bi1 . Let Wi,x,s ∗ denote the number of subjects drawn from this conceptual population, prior to the selection of subject i, with X = x and Si = s ∗ , where Li x Ri and x s ∗ Bi1 . First note that for the current situation, the log likelihood function can be written as k0
Dj log(fj ) + Cj log S(vj ) L(φ) = E
j =1
Statistical analysis of doubly interval-censored failure time data
113
s
Dj log S(τ )gθ (vk )/Gθ (τ )
+E
j =k0 +1
+ Cj log S(τ )Gθ (vj )/Gθ (τ ) ,
(6)
where Dj =
i −vj n R
Wi,x,x+vj +
i=1 x=Li
n I Xi = Ui − vj , δi2 = 1 i=1
and Cj =
n I Xi = Ui − vj , δi2 = 0 , i=1
where = I (Ui = Vi ) as defined before, i = 1, . . . , n. In the above expression for Dj , the inner sum in the first term is defined to be 0 whenever the upper limit of summation is smaller than the corresponding lower limit. Note that in Eq. (6), the term inside the expectation is the log likelihood function of φ that would be obtained if there were no truncation on the Si ’s and no interval-censoring on the Xi ’s. Now the EM algorithm can be described as follows. Let φ (l) = (f1(l) , . . . , fk(l) , θ (l)) 0 denote the current value of the estimates after l cycles of the EM algorithm. δi2
E-step. Calculate (l) wi,x,s ∗
=E W
i,x,s ∗
|φ
(l)
= hci (x)f (l) (s ∗
Ri − x) hci (u)S (l) Bi1 − u u=Li
and
(l) bi,x = Pr Xi = x | φ (l) , Li Xi Ri , Ui , δi2 2
2
hci (x){f (l) (ti − x)}δi {S (l) (ti − x)}1−δi
= R i
u=Li
2
2
hci (u){f (l) (ti − u)}δi {S (l) (ti − u)}1−δi
,
where f (l) (·) and S (l) (·) represent the values of f (·) and S(·) after l cycles of the algorithm, respectively, and hci (x) = Pr(X = x | X ∈ [Li , Ri ]). Then calculate the expectations i −vj n R n (l) (l) E Dj | φ (l) = wi,x,x+v + δi2 bi,U j i −vj
i=1 x=Li
(7)
i=1
and n (l) E Cj | φ (l) = 1 − δi2 bi,U . i −vj i=1
(8)
114
J. Sun
M-step. Replace the expectations of Dj and Cj in (6) by (7) and (8), respectively, and maximize L(φ) with respect to φ to obtain the updated estimate φ (l+1) . It is worth noting that this EM algorithm is applicable even if there exists no change point. It represents a compromise between and provides an alternative to parametric and nonparametric estimation. It can be easily seen that it also applies to situations where the Xi ’s, Si ’s and Ti ’s are continuous variables. In the above, we have assumed that the tail of the distribution of the Ti ’s follows a parametric model. An EM algorithm can be similarly developed for situations where the first part of the distribution can be described by a parametric model, while the tail is completely unknown. Both algorithms given above for estimation of the fk ’s are developed based on the conditional likelihood Lc . An alternative is to use the full likelihood function given by r s i n j =1 k=1 αj k wj fk LF {wj }j =1,...,r , {fk }k=1,...,s = (9) , ji0 kij2 w f j k i=1 j =1 k=kij1 where ji0 = max{j ; uj Bi2 }, kij 1 = min{k; vk max{0, Bi1 − uj }}, and kij 2 = max{k; vk Bi2 − uj }. Note that although Lc and the above full likelihood look similar, there exists a major difference between them. If the Xi ’s are observed exactly, the conditional likelihood Lc is then independent of the wj ’s. Thus in this case, as expected, the estimation of the distribution F = {fk }k=1,...,s of the Ti ’s based on Lc can be carried out without estimating H = {wj }j =1,...,r . In contrast, if we employ LF given in (9), F has to be estimated simultaneously with H . Furthermore, in the case of interval-censored and truncated data, Lc becomes the likelihood function considered in Turnbull (1976), the commonly used one in this case, while LF does not. Tu (1995) studied LF given in (9) when all truncation intervals are identical. 3. Semiparametric regression analysis This section considers semiparametric regression analysis of doubly interval-censored failure time data under the proportional hazards model. We will describe two methods: one is based on the full likelihood function of the data and developed for discrete survival time and the other is based on the estimating equation approach and applies to continuous survival time. In this section, we assume that there is no truncation. First consider situations where all concerned variables are discrete. Let the Xi ’s, Si ’s, Ti ’s, uj ’s, vk ’s, wj ’s and αji k ’s be defined as in the previous section. Also as before, let [Li , Ri ] and [Ui , Vi ] denote the intervals to which Xi and Si are observed to belong. For subject i, suppose that there exists a vector of covariates Zi , i = 1, . . . , n. Also suppose that the effect of covariates on the Ti ’s follows the proportional hazards model (Cox, 1972) given by
Sk (Zi ) = Pr(Ti > vk | Zi ) = (p1 · · · pk )exp(Zi β) given Zi , where β denotes a vector of regression parameters and pk = Pr(Ti > vk | Ti > vk−1 , Zi = 0),
(10)
Statistical analysis of doubly interval-censored failure time data
115
k = 1, . . . , s. Then the full likelihood function L1 given in (1) has the form r s n αji k wj fk (Zi ) L1 {wj }j =1,...,r , {fk }k=1,...,s =
(11)
i=1 j =1 k=1
(Kim et al., 1993), where exp(Zi β) . fk (Zi ) = (p1 · · · pk−1 )exp(Zi β) 1 − pk
To estimate the parameters ({wj }j =1,...,r , {pk }k=1,...,s , β), it is convenient to reparameterize the pk ’s by γk = log(log(pk )). This will remove the range restrictions on the pk ’s and improve convergence in the estimation algorithm given below. Under the new parameters γk ’s, we have γk−1 γ1 fk (Zi ) = e(e +···+e )exp(Zi β) 1 − e−exp(γk +Zi β) . For estimation of ({wj }j =1,...,r , {γk }k=1,...,s , β), Kim et al. (1993) generalized method I described in the previous section and gave the following procedure. Step 1. Choose initial estimates for ({wj }j =1,...,r , {γk }k=1,...,s , β). Step 2. At the lth iteration, set γk = γˆk(l−1) and β = βˆ (l−1) and maximize L1 given in (11) with respect to the wj ’s using the self-consistency algorithm to obtain updated (l) (l−1) estimate wˆ j , where γˆk and βˆ (l−1) denote the estimates given in the (l − 1)th iteration. (l) Step 3. Set wj = wˆ j and maximize L1 given in (11) with respect to the γk ’s and β (l) using the Newton–Raphson algorithm to obtain γˆk and βˆ (l) . Step 4. Repeat steps 2 and 3 until convergence. (l)
Note that in step 2, the updated estimate wˆ j can be obtained by method I given in the previous section by applying it only to the wj ’s while holding other parameters fixed. As for method I, it can be proved that the estimate derived by the above algorithm are critical points of the full likelihood function L1 , and one can distinguish maximizers from saddle points by examining the signs of the eigenvalues of the information matrix. Once the estimates of ({wj }j =1,...,r , {γk }k=1,...,s , β) are obtained, their covariance matrix can be estimated using the inverse of the observed Fisher information matrix. Now we consider the situation where the Ti ’s are continuous survival times. In this case, the proportional hazards model is defined by the hazard function and has the form λi (t) = Yi (t)λ0 (t) exp Zi β (12) (Cox, 1972), where λ0 (t) is an unknown baseline hazard function, Yi is a predictable process taking value 0 or 1 indicating (by the value 1) when the ith subject is under observation, and β denotes the vector of regression coefficients as before. To estimate β, for the remaining of this section, we will assume that observations on the Si ’s are right-censored instead of interval-censored and are given by Si∗ = min{Si , Ci } and δi = I (Si = Si∗ ), i = 1, . . . , n, where Ci is the censoring time associated with subject i and assumed to be independent of Si .
116
J. Sun
For estimation of β, define Yi (t | Xi ) = I (Si∗ − Xi t) and Ni (t | Xi ) = I (Si∗ − Xi t, δi = 1). Let X = (X1 , . . . , Xn ) and 1 j Yi (t | Xi )Zi eZi β , n n
S (j ) (β, t | X) =
i=1
j = 0, 1, where = 1 and Zi1 = Zi . Then β can be estimated by the solution, say βˆp , to the following estimating equation n Rn n R1
−1 (xl ) = 0 U β, H = dH al ··· U β | {xi }i=1,...,n Zi0
L1
l=1
Ln
l=1
denotes the maximum likelihood estimate of the cumulative (Sun et al., 1999), where H distribution function of the Xi ’s based on interval-censored data on the Xi ’s only, al = Rl Ll dH (x), l = 1, . . . , n, and S (1) (β, t | X) dNi (t | Xi ), Zi − (0) S (β, t | X)
n τ1
U (β | X) = 0
i=1
where τ1 is the longest possible follow-up time. It is easy to see that if the Xi ’s are ex) = U (β | X), which is the partial likelihood score function actly observed, then U (β, H of β based on right-censored data, and the proposed estimate βˆp of β reduces to the partial likelihood estimate. Under mild regularity conditions, Sun et al. (1999) proved that βˆp is consistent and 1/2 n (βˆp − β0 ) has an asymptotic normal distribution with mean zero and covariance matrix that can be consistently estimated by A(βˆp ) Γ (βˆp ) A (βˆp ), where β0 denotes the )/∂β}−1 and Γ (β) = n−1 ni=1 bˆi (β)bˆ (β), true value of β, A(β) = {−n−1 ∂U (β, H i where τ1 R1 Rn S (1) (β, t | x) ˆbi (β) = Zi − (0) ··· S (β, t | x) 0 L1 Ln n | x) (xl ) Yi (t | xi ) exp(β Zi ) dN(t dH , × dNi (t | xi ) − al nS (0) (β, t | x) l=1 n N (t | x) = i=1 Ni (t | xi ), and x = (x1 , . . . , xn ). It can be easily seen that the above idea can also be applied to discrete survival time situation discussed under model (10). In contrast, the idea described for model (10) does not work for model (12). Another difference between the above two methods for estimation of covariate effect is that the full likelihood method for model (10) involves estimation of a large number of parameters, which makes both computation and study of its properties difficult. In contrast, the estimating equation method estimates only regression parameters of interest and does not require estimation of nuisance parameters. Also, the estimate of covariance matrix given by the full likelihood method may not give realistic and useful results, especially for large data sets. Of course, the estimating equation method could yield less efficient estimates than the full likelihood approach.
Statistical analysis of doubly interval-censored failure time data
117
As mentioned before, the algorithm given above for the full likelihood function L1 in (11) is a generalization of the self-consistency algorithm. As an alternative for determining MLE of L1 , Goggins et al. (1999) presented an EM algorithm. However, it possesses the same shortcomings as the generalized self-consistency algorithm and also estimation of covariance matrix is quite complicated. To estimate β under model (12), another approach is to apply the multiple imputation method on U (β | X) proposed in Pan (2001). This method is simple both conceptually and computationally, but may be less efficient than the two described methods.
4. Nonparametric comparison of survival functions This section discusses nonparametric treatment comparison based on doubly intervalcensored failure time data, for which little research exists compared to estimation of distribution functions and regression analysis. For the problem, Sun (2001) developed a nonparametric test procedure. Let the Xi ’s, Si ’s and Ti ’s be defined as before and suppose that interval-censored observations on the Xi ’s and Si are obtained and given by {[Li , Ri ], [Ui , Vi ]; i = 1, . . . , n} as in the previous sections. Furthermore, suppose that they are discrete variables and study subjects come from p different populations. We assume that the distributions of the Xi ’s are the same among different populations. The goal is to test the null hypothesis, H0 say, that the p populations from which the samples come have identical failure time distributions for the Ti ’s. To help understand the idea, note that for right-censored failure time data, the most popular method is the log-rank test, which has the form of a vector of summations of observed minus expected numbers of failures (e.g., Kalbfleisch and Prentice, 1980). To develop a similar approach, we start by defining pseudoobserved and expected numbers of failures. Define αik = I (vk ∈ [Ui − Ri , Vi − Li ]) and Pk = P (vk ) = Pr(Ti > vk ), and i = 1, . . . , n, k = 0, . . . , s, where the vk ’s are defined as before and v0 = 0. Let H P = (P1 , . . . , Ps ) denote the joint maximum likelihood estimators of the cumulative distribution function of the Xi ’s and the survival function of the Ti ’s under H0 , which can be obtained using method I described in Section 2. Also, as before, let
(x) − H (x − ) , if vk ∈ [Ui − Ri , Vi − Li ], H ∗ x+v ∈[U ,V ] k i i αik = 1, otherwise. Now we can define the estimated or pseudonumbers of failures and subjects at risk as follows. Define n s ∗ k αik αik dk = Pk−1 − P αij αij∗ Pj −1 − Pj , j =1
i=1
nk =
n s r=k i=1
∗ αir αir
r Pr−1 − P
s j =1
αij αij∗
j , Pj −1 − P
118
J. Sun
dkl =
∗ αik αik
s ∗ Pk−1 − Pk αij αij Pj −1 − Pj , j =1
i
and
s s ∗ ∗ r nkl = αir αir Pr−1 − P αij αij Pj −1 − Pj , r=k
j =1
i
where i denotes the summation over subjects i in the population l, l = 1, . . . , p, k = 1, . . . , s. The dj ’s and nj ’s can be regarded as pseudototal numbers of failures and subjects at risk, while dj l ’s and nj l ’s the pseudonumbers of failures and subjects at risk corresponding to each population. By following the log-rank test, a nonparametric test statistic, say U = (U1 , . . . , Up )t , can be defined as m Ul = (dkl − nkl dk /nk ) k=1
(Sun, 2001), the summation of the differences between the pseudoobserved failure numbers and expected failure numbers conditional on the observed data. To apply U for testing H0 , we need to estimate its covariance matrix. For this, Sun (2001) proposed the following bootstrap method. Let M be a prespecified integer. For each l (1 l M), apply the following algorithm: (l)
(l)
Step 1. Let {Xi ; i = 1, . . . , n} be an independent sample of size n such that Xi is drawn from the conditional probability function (x) − H (x − )
(l) H , hˆ i (x) = Pr Xi = x = (L− ) (Ri ) − H H i
x ∈ [Li , Ri ]
given Xi ∈ [Li , Ri ], i = 1, . . . , n. Step 2. For the given Xi(l) ’s, let {(Ti(l) , δi(l) ); i = 1, . . . , n} be an independent rightcensored survival sample of size n such that if Vi = ∞, let Ti(l) = Ui − Xi(l) and δi(l) = 0, and otherwise, let Ti(l) be a random sample drawn from the conditional survival probability function (t−) − P (t) P , P((Ui − Xi(l) )− ) − P(Vi − Xi(l) ) t ∈ Ui − Xi(l) , Vi − Xi(l)
(l) fi (t) = Pr Ti = t =
(l)
(l)
and δi = 1, i = 1, . . . , n, where δi is the censoring indicator. (l) (l) Step 3. Given right-censored failure time data {(Ti , δi ); i = 1, . . . , n}, calculate the (l) (l) (l) (l) numbers of failures and risks denoted by the dk ’s, nk ’s, dkl ’s, and nkl ’s corresponding to the dk ’s, nk ’s, dkl ’s, and nkl ’s and then the statistic U denoted by U (l) (l) (l) (l) (l) with replacing dk , nk , dkl , and nkl by dk , nk , dkl , and nkl , respectively.
Statistical analysis of doubly interval-censored failure time data
119
Step 4. Repeat the steps 1–3 for each r = 1, . . . , M and then estimate the covariance matrix of U by V = V1 + V2 , where 1 (l) (l) t U −U U −U M −1 M
V1 =
r=1
and V2 = V21 + · · · + V2s with V2k being given by (V2k )ll = nkl (nk − nkl )dk (nk − dk )/n2k (nk − 1),
l = 1, . . . , p
and (V2k )l1 l2 = −nkl1 nkl2 dk (nk − dk )/n2k (nk − 1), = M U (l) /M. where U r=1
l1 = l2 = 1, . . . , p,
To understand the above covariance estimation, one can look at it from the point of missing data. In this case, the second part V2 can be regarded to represent the variability due to right-censored data, which of course are not observed, while the first part V1 represents the variability arising from missing data. Once V is obtained, the test of the 2 hypothesis H0 can then be made by using the statistic U ∗ = U t V − U based on the χp−1 distribution, where V − denotes a generalized inverse of V . An equivalent test is to use any p − 1 elements of U and the corresponding submatrix of V . For testing H0 , an alternative to the above nonparametric method is to postulate some parametric or semiparametric models and to develop score tests for the comparison. An example is the Cox proportional hazards model for the Ti ’s described in Section 3. A difficulty for this approach is that model-checking may be difficult to be carried out.
5. Discussion and future researches In the previous sections, we have reviewed statistical methods proposed in the literature for the analysis of doubly interval-censored failure time data. As discussed above, most of existing researches focus on two basic topics, nonparametric estimation of a distribution function and regression analysis under the proportional hazards model, and there exist many topics about the data that have not been touched. We will first discuss several issues that are directly related to existing methods for the analysis of doubly intervalcensored data and need to be investigated, and then point out a few other problems for which no or little research exists for future researches. One issue concerning the estimation of a distribution function based on doubly interval-censored data that calls for immediate research is the comparison of nonparametric estimation procedures proposed in the literature. These include the four methods described in Section 2. Gómez and Calle (1999) also presented a nonparametric estimation procedure that is similar to method II. Although some numerical examples have suggested that they are close to each other, no extensive numerical study exists for their thorough comparison and they could yield quite different estimates in some situations. The same issue exists for the methods for regression analysis of doubly intervalcensored data.
120
J. Sun
Another issue related to the methods discussed in the previous sections that require investigation is the development of methods for estimating covariance matrix. Except the estimating equation approach described in Section 3, most of other methods for nonparametric estimation and regression analysis do not really provide estimates of covariance matrix and rely on observed Fisher information matrix for the estimation. It is well known that Fisher information matrix could give unrealistic results, especially when there exist a large number of parameters, which often tends to be the case with the full likelihood approach. Thus it is important to develop simple and realistic covariance estimates for the methods to be more useful. One way for this is to employ the idea used in Section 4 for estimating covariance matrix of the test statistic U . Asymptotic properties of the existing methods for doubly interval-censored failure time data also call for studies. So far all of the methods proposed in the literature are basically algorithms and there does not exist any asymptotic study on them except Fang and Sun (2001), which discussed the asymptotic consistency of the nonparametric estimate of a distribution function given by method II. A key difficulty for the asymptotic study in this case is that two possibly infinite dimension functions are involved, which is not the case for right- or interval-censored failure time data. Also the martingale theory on which the asymptotic study for right-censored data heavily rely is not available for doubly interval-censored data. Instead, empirical process theory may have to be used in this case as in Fang and Sun (2001). Throughout the article, we have assumed that the distribution of survival time of interest is independent of the occurrence of initial event. This means that in the context of AIDS studies, for example, AIDS incubation time is independent of HIV infection time. Although most of AIDS studies use the assumption, there does not seem to exist research evidence supporting it and also no research exists addressing this assumption in general. One way to deal with this is to develop some test procedures to check the assumption before employing existing methods. Another approach, which is used by Frydman (1992, 1995), is to use models that do not require the assumption. To estimate the joint distribution of X and T for the case of no truncation, Frydman (1995) employed a three-state Markov model that allows the distribution of survival time of interest to depend on initial event. It is obvious that some other models could be used here. For doubly interval-censored failure time data, all of existing researches have been focusing on nonparametric or semiparametric inference. It would be interesting to develop some parametric inference procedures. Obviously, parametric methods have the advantage that their implementation is straightforward although may be complicated. Also the standard asymptotic theory can directly apply to them. One possible problem for parametric methods is model-checking and this could be especially the case for AIDS studies since the history of AIDS research is still relatively shorter. For example, it has been proved that the median of AIDS incubation time is around 8 or 10 years or may be longer given the rapid development of many new AIDS drugs. However, most of AIDS cohort studies do not have that long durations and thus there does not exist much information about the tail of the distribution of the AIDS incubation time, which makes model-checking difficult or impossible. As mentioned before, truncation often occurs along with doubly interval-censoring and this could be especially the case for AIDS cohort studies, where left-truncation often
Statistical analysis of doubly interval-censored failure time data
121
occurs for AIDS diagnosis (subsequent event). Thus it would be useful to develop some regression methods that allow truncation. One way for this is to directly generalize the full likelihood approach discussed in Section 3, which should be straightforward. However, it is not obvious how to generalize the estimating equation method in Section 3 to truncation situations. In terms of regression analysis, another direction for future research is to consider some other regression models such as linear transformation models instead of the proportion hazards model. Although the proportion hazards model is the most popular one, it is well known that there exist situations where it does not fit the data well. Another direction for future research is to generalize statistical methods discussed in the previous sections to the situation where there exist a chain of events instead of just two related events as considered above. One such example is longitudinal observations on a succession of events. For this situation, Sternberg and Satten (1999) discussed estimation of the distribution of times between the successive events and generalized Turnbull’s (1976) self-consistency algorithm. It would be helpful to generalize other methods discussed above to this situation.
References Bacchetti, P. (1990). Estimating the incubation period of AIDS by comparing population infection and diagnosis patterns. J. Amer. Statist. Assoc. 85, 1002–1008. Cox, D.R. (1972). Regression models and life-tables (with discussion). J. Roy. Statist. Soc. B 34, 187–220. De Gruttola, V., Lagakos, S.W. (1989). Analysis of doubly-censored survival data with application to AIDS. Biometrics 45, 1–12. Fang, H., Sun, J. (2001). Consistency of nonparametric maximum likelihood estimation of a distribution function based on doubly interval-censored failure time data. Statist. Probab. Lett. 55, 311–318. Frydman, H. (1992). A nonparametric estimation procedure for a periodically observed three-state Markov process with application to AIDS. J. Roy. Statist. Soc. Ser. B 54, 853–866. Frydman, H. (1995). Semiparametric estimation in a three-state duration-dependent Markov model from interval-censored observations with application to AIDS data. Biometrics 51, 502–511. Goggins, W.B., Finkelstein, D.M., Zaslavsky, A.M. (1999). Applying the Cox proportional hazards model for analysis of latency data with interval censoring. Statist. Medicine 18, 2737–2747. Gómez, G., Calle, M.L. (1999). Nonparametric estimation with doubly censored data. J. Appl. Statist. 26, 45–58. Gómez, G., Lagakos, S.W. (1994). Estimation of the infection time and latency distribution of AIDS with doubly censored data. Biometrics 50, 204–212. Huang, J., Wellner, J.A. (1997). Interval censored survival data: A review of recent progress. In: Lin, D.Y., Fleming, T.R. (Eds.), Lecture Notes in Statistics: Proceedings of the First Seattle Symposium in Biostatistics: Survival Analysis, pp. 123–169. Jewell, N.P. (1994). Non-parametric estimation and doubly-censored data: General ideas and applications to AIDS. Statist. Medicine 13, 2081–2095. Jewell, N.P., Malani, H.M., Vittinghoff, E. (1994). Nonparametric estimation for a form of doubly censored data with application to two problems in AIDS. J. Amer. Statist. Assoc. 89, 7–18. Joly, P., Commenges, D. (1999). A penalized likelihood approach for a progressive three-state model with censored and truncated data: Application to AIDS. Biometrics 55, 887–890. Kalbfleisch, J.D., Prentice, R.L. (1980). The Statistical Analysis of Failure Time Data. Wiley, New York. Kim, M.Y., De Gruttola, V., Lagakos, S.W. (1993). Analyzing doubly censored data with covariates with application to AIDS. Biometrics 49, 13–22.
122
J. Sun
Lim, H.J., Sun, J., Matthews, D.E. (2002). Maximum likelihood estimation of a survival function with a change-point for truncated and interval-censored data. Statist. Medicine 21, 743–752. Lindsey, J.C., Ryan, L.M. (1998). Tutorial in biostatistics: methods for interval censored data. Statist. Medicine 17, 219–238. Pan, W. (2001). A multiple imputation approach to regression analysis for doubly censored data with application to AIDS studies. Biometrics 57, 1245–1250. Sternberg, M.R., Satten, G.A. (1999). Discrete-time nonparametric estimation for semi-Markov models of chain-of-events data subject to interval-censoring and truncation. Biometrics 55, 514–522. Sun, J. (1995). Empirical estimation of a distribution function with truncated and doubly interval-censored data and its application to AIDS studies. Biometrics 51, 1096–1104. Sun, J. (1997). Self-consistency estimation of distributions based on truncated and doubly censored data with applications to AIDS cohort studies. Lifetime Data Anal. 3, 305–313. Sun, J. (1998). Interval censoring. In: Encyclopedia of Biostatistics. Wiley, New York, pp. 2090–2095. Sun, J. (2001). Nonparametric test for doubly interval-censored failure time data. Lifetime Data Anal. 7, 363– 375. Sun, J., Liao, Q., Pagano, M. (1999). Regression analysis of doubly censored failure time data with applications to AIDS studies. Biometrics 55, 909–914. Tu, X.M. (1995). Nonparametric estimation of survival distributions with censored initiating time and censored and truncated terminating time: Application to transfusion data for acquired immune deficiency syndrome. Appl. Statist. 44, 3–16. Turnbull, B.W. (1976). The empirical distribution function with arbitrarily grouped censored and truncated data. J. Roy. Statist. Soc. Ser. B 38, 290–295.
7
Handbook of Statistics, Vol. 23 ISSN: 0169-7161 © 2004 Elsevier B.V. All rights reserved. DOI 10.1016/S0169-7161(03)23007-8
The Missing Censoring-Indicator Model of Random Censorship
Sundarraman Subramanian
1. Introduction In survival analysis, right-censored data are described by n independent and identically distributed (i.i.d.) copies of the observable pair (X, δ), where X is the minimum of a survival time T and a censoring time C which is independent of T , and δ is an indicator variable (called the censoring indicator henceforth) signifying whether the observed X equals T or otherwise. When the censoring indicator is always observed, the wellknown Kaplan–Meier estimator (Kaplan and Meier (1958)), KME henceforth, is the nonparametric maximum likelihood estimator (NPMLE) of the survival function S(t) of T , having several appealing asymptotic properties which also includes asymptotic efficiency (cf. Wellner (1982)). For more on the KME, see Shorack and Wellner (1986), Fleming and Harrington (1991), or Andersen et al. (1993), among others. Sometimes, however, the censoring indicator is not observed for a subset of the subjects investigated (e.g., in a bioassay experiment some subjects might not be autopsied to save expense, or the results of an autopsy may be inconclusive), leading to the missing censoring-indicator (MCI) model of random censorship. Specifically, let ξ be an indicator variable that may depend on X = min(T , C). The missingness indicator ξ assumes the value 1 when the censoring indicator δ is observed and takes the value 0 otherwise. The observed data in the MCI model of random censorship are n i.i.d. copies of Y = (X, ξ, σ ) where σ = ξ δ. The KME is inapplicable in the MCI model. In this article, we provide a general overview of several available estimators of S(t) in the MCI model under two well-known types of missing mechanisms. We also propose and analyze, under the less restrictive of the two missing mechanisms, a semiparametric estimator and show that it is more efficient than its nonparametric counterparts whenever the parametric component is correctly specified. The proposed semiparametric estimator is a direct extension of its classical random censorship model counterpart, originating in the work of Dikta (1998). The censoring indicators are said to be missing completely at random (MCAR) if the probability that ξ equals 1 does not depend on either X or δ, implying that the missing mechanism is independent of everything else observed in the MCI model. Assuming that the censoring indicators are MCAR allows consistent estimation of S(t) through 123
124
S. Subramanian
the KME applied to the complete cases. Moreover, the maximum likelihood estimator (MLE) of P (ξ = 1), which is readily estimated under MCAR using the fully observed ξ ’s, can be employed to provide improved estimators of S(t), see Lo (1991) or Gijbels et al. (1993). More generally, if the above probability depends on the fully observed X alone and not on the potentially unobservable δ, i.e., P (ξ = 1 | X, δ) = P (ξ = 1 | X),
(1.1)
then the censoring indicators are said to be missing at random (MAR); van der Laan and McKeague (1998) note that (1.1) is also the “minimal coarsening at random (CAR) assumption” needed for asymptotic efficiency in the MCI model. In other words, under MAR, the observability of the censoring indicator δ depends only on X and not on the value of δ. The effect of the MAR assumption is that P (ξ = 1 | X) is an infinite dimensional parameter, a function π(x), whose estimation is relatively more difficult. More importantly, however, MAR facilitates efficient estimation of S(t) unlike MCAR. Denote the conditional probability of an uncensored observation given X = x by p(x) and the conditional probability of an observable uncensored observation given X = x by q(x). Interestingly, Eq. (1.1) is equivalent to the statement that, conditionally on X, the missingness and censoring indicators are independent: q(x) = P (σ = 1 | X = x) = P (ξ = 1 | X)P (δ = 1 | X) = π(x)p(x). This facilitates estimation of the parametric component in the semiparametric estimator that we analyze using only the “complete cases”; see Tsiatis et al. (2002) for a nice application involving MAR when covariates are present. See Little and Rubin (1987) for more details about the missing mechanisms MCAR and MAR. CAR is a generalization of MAR to the case of coarse data, which arise when a random quantity of interest is not observed directly, but instead a subset of values (of the sample space) in which the unobserved random quantity lies is observed. For example, in the MCI model, when the missingness indicator is 0, the censoring indicator’s precise value is not observed, only the fact that it assumes a value in {0, 1} is known, and this gives rise to coarse data. CAR allows the coarsening mechanism to be ignored while making inferences, see Heitjan and Rubin (1991); see also Jacobsen and Keiding (1995), and Gill et al. (1997). An obvious drawback of the “complete-case estimator” is in its high inefficiency when the degree of missingness is considerable. Several authors have proposed improvements over the complete-case estimator, assuming MCAR. Dinse (1982) used the EM algorithm and obtained an NPMLE. Lo (1991), however, showed that the NPMLE is non-unique and some of them are inconsistent, and constructed two estimators one of which is consistent and asymptotically normal. Gijbels et al. (1993) proposed a convex combination of two estimators, one of which was a modified Lo-estimator. McKeague and Subramanian (1998) proposed an estimator that employed Nelson–Aalen estimators of certain cumulative transition intensities, and showed how their approach can be used to obtain the estimators of Lo (1991) and Gijbels et al. (1993). None of these approaches, however, investigated asymptotic efficiency. For a somewhat different censorship model, Mukherjee and Wang (1992) derived the NPMLE of S(t) when the hazard
The missing censoring-indicator model of random censorship
125
rate is increasing, assuming that the censoring indicator is never observed but that the censoring distribution is always known instead. The important issue of asymptotically efficient estimation in the MCI model was addressed by van der Laan and McKeague (1998), who introduced a sieved NPMLE, under a slightly stronger CAR assumption than (1.1), and obtained its influence curve. The influence curve of S(t), the estimator of S(t), is the random function denoted by IC(Y, t) satisfying the equation n n1/2 IC(Yi , t) + op (1). S(t) − S(t) = n−1/2 i=1
Indeed S(t) − S(t) is asymptotically an average of n i.i.d random quantities where the ith quantity is the influence curve evaluated at the observed data Yi . They also obtained the efficient influence curve for estimating S(t). The variance of the efficient influence curve is the information bound for estimating S(t), and is the smallest asymptotic variance that an estimator can hope to achieve; an estimator that attains the information bound is deemed to be asymptotically efficient. The survival function S(t) is a smooth (compactly differentiable) functional of the cumulative hazard function Λ(t), via the product integral mapping (cf. Gill and Johansen (1990)). The cumulative hazard function Λ(t) is in turn a compactly differentiable functional (cf. Gill (1989)) of the subdistribution function H1 (t) = P (X t , δ t= 1) and the distribution function H (t) = P (X t) through the relation Λ(t) = 0 [dH1 (s)/(1 − H (s))]. Note that both H1 (t) and H (t) are functionals of the bivariate distribution of (X, δ). Therefore, van der Laan and McKeague (1998) explained that an efficient estimator of the bivariate distribution of (X, δ) would readily yield an efficient estimator of S(t). Indeed Eq. (1.1) is essential for identifiability of the subdensity fX,δ (x, d), where d = 0 or 1. Van der Laan and McKeague noted that the subdensity fX,δ (x, 1) = P (X = x, δ = 1) can be written as fX,δ (x, 1) =
fX,σ,ξ (x, 1, 1) , fξ |X,δ (ξ | x, 1)
(1.2)
and pointed out that Eq. (1.1) was needed to make the denominator of (1.2) identifiable which in turn also implies that the subdensity fX,δ (x, 1) will be identifiable provided that the denominator, which by (1.1) is π(x), is bounded away from zero. Van der Laan and McKeague’s (1998) approach of estimating the joint distribution of (X, δ) was to present the problem in the framework of nonparametric estimation of a bivariate distribution from bivariate right-censored data in which the first component X is completely observed but the second component δ is right-censored by a discrete random variable. An important outcome of their work is the benchmark asymptotic variance (information bound) for comparing competing estimators of S(t) in the MCI model. Van der Laan and McKeague also employed the approach of Robins and Rotnitzky (1992) and proposed an explicit efficient estimator of H1 (t) using the general theory of semiparametric efficiency bounds; see Bickel et al. (1993) for more on information bound theory. Since the survival function is a compactly differentiable functional of H1 (t) and H (t), replacing H1 (t) and H (t) with their respective efficient estimators would immediately lead to an efficient estimator of S(t), see van der Vaart (1991).
126
S. Subramanian
All the aforementioned approaches of S(t) in the MCI model are nonparametric, in that they do not suppose that the observed data are random samples from specific populations, letting the data speak for themselves instead. Sometimes scientific rationale may suggest that the parametric approach be employed for estimating some quantities in a proposed model, while the nonparametric approach be pursued for estimating other quantities. This is called a semiparametric approach. An example of a semiparametric model is the well-known and highly popular Cox proportional hazards model (Cox, 1972), in which the parametric component is specified by the regression parameter and the (infinite dimensional) nonparametric part is the baseline hazard function. It turns out that the semiparametric approach can be gainfully employed for estimating S(t) in the MCI model. The link is provided by a representation noted by Dikta (1998). Denote the cumulative hazard function of X by ΛH (t). In the classical random censorship model, Dikta (1998) noted a representation in which Λ(t) is expressible in terms of p(t) and ΛH (t), and provided a semiparametric alternative to the KME, by assuming a parametric model p(t, θ ) for the conditional probability of an uncensored observation and estimating the parametric component via maximum likelihood. He estimated ΛH (t) using a standard estimator that utilizes the empirical distribution of the fully observed X. Dikta further proved that his semiparametric estimator of S(t) is more efficient than the KME, in the sense of having smaller asymptotic variance, provided p(t) is not misspecified. Dikta’s approach has been employed in a different setting as well. Sun and Zhu (2000) proposed a Dikta-type semiparametric estimator of a survival function in a lefttruncated and right-censored model and derived its large-sample properties. Zhu et al. (2002) studied resampling methods for testing the goodness-of-fit of Dikta’s semiparametric model. Interestingly, Dikta’s (1998) work finds yet another application, namely estimation in the MCI model, providing an attractive and very simple semiparametric alternative to the nonparametric approach pursued by researchers thus far in this model. The MAR assumption implies that the distribution of ξ is free of θ and this implies that the MLE of θ can be computed based on just the complete cases, see Section 3. Therefore, provided that the parametric model for p(t, θ ) is specified correctly, the only additional computation relative to the KME is the calculation of a parameter estimate using maximum likelihood. The simplicity of the approach is perhaps a compelling rationale for choosing a semiparametric estimator for S(t) over a nonparametric estimator in the MCI model. Indeed, the semiparametric estimator would not require cumbersome bandwidth calculations based on estimates of densities and their derivatives that are so pervasive with the nonparametric estimators, see, for example, van der Laan and McKeague (1998). When the model for p(t, θ ) is specified correctly, we show as in Dikta (1998) that the semiparametric estimator is more efficient (asymptotically) than the nonparametric estimators proposed by van der Laan and McKeague (1998). Needless to say, correct specification of p(t, θ ) is very crucial for estimating S(t) well. Similar to the argument employed by Tsiatis et al. (2002), one may argue that logit(p(t)) = log{p(t)/(1 − p(t))} is the difference between the log survival and log censoring hazards, which is a smooth function of t, hence a logistic model, after incor-
The missing censoring-indicator model of random censorship
127
poration of additional polynomial terms, should, in most cases, provide an appropriate representation for p(t, θ ). Cox and Snell (1989) provide methods that would be useful for analyzing binary data and for model-checking. The article is organized as follows. Section 2 contains an overview of the existing estimators of S(t) in the classical random censorship model and the MCI model. In Section 3, the new semiparametric estimator of S(t) is introduced, analyzed, and its asymptotic variance compared with the information bound for nonparametric estimation of S(t). The article ends with a Conclusion section.
2. Overview of the estimators of a survival function In this section, we first review the KME and Dikta’s (1998) semiparametric estimator in the classical random censorship model. We then present an overview of the estimators proposed in the MCI model, first under MCAR and then under MAR (or more generally CAR). To facilitate comparison of the asymptotic variances of the semiparametric and (efficient) nonparametric estimators of H1 (t) and S(t), we also calculate the information bound for estimating H1 (t) and S(t) from the expressions for their efficient influence curves obtained by van der Laan and McKeague (1998). 2.1. The classical random censorship model 2.1.1. The (nonparametric) KME (t) = 1−H (t), and let H (t), H (t), and H 1 (t) denote Let τ be such that H (τ ) > 0. Let H (t), and H1 (t). In counting process the empirical estimators, respectively, of H (t), H Niu (t) = I (Xi t, δi = 1), Nic (t) = I (Xi t, δi = 0), notation let Ni (t) = I (Xi t), Yi (t) = I (Xi t), and Y· (t) = ni=1 Yi (t). The standard approach of estimating S(t) is to provide an estimator of Λ(t), and then employ the product integral representation of Gill and Johansen (1990) given by S(t) = π0st {1 − dΛ(s)}. When all the censoring indicators are observed the Nelson–Aalen estimator of Λ(t) is given by NA (t) = Λ
t 0
1 (s) dH = (s−) i=1 H n
t 0
dNiu (s) . Y· (s)
(2.3)
NA (s) is the size of the jump of Λ NA (s), which is the reciprocal of the Note that dΛ number of subjects at risk (risk size) just before s. The KME is given by SKM (t) =
NA (s) . 1 − dΛ
(2.4)
0st
It is well known that the process n1/2 ( SKM (t) − S(t)) converges weakly in D[0, τ ] to a zero-mean Gaussian process whose asymptotic variance, assuming that S(t) is continut (s)2 ], see, for instance, Andersen et al. (1993). ous, is given by S(t)2 0 [dH1 (s)/H
128
S. Subramanian
2.1.2. The semiparametric estimator Dikta (1998), noted the following functional relationship between Λ(t), p(·) and ΛH (·): t t p(s) p(s) dΛH (s) = dH (s). Λ(t) = (2.5) 0 0 H (s−) He assumed a parametric model p(t, θ ), where p(·, ·) is a known continuous function and θ = (θ1 , . . . , θk ) are the unknown parameters. He estimated the parameter θ via maximum likelihood, and used p(t, θˆD ), where θˆD is the MLE of θ0 (the true value of (t) to construct his estimator of Λ(t). His estimator of S(t), denoted by θ ), and H SD (t), may be written as (s) p(s, θˆD ) dH 1− SD (t) = (2.6) . (s−) H 0st Let pr (u, θ0 ) = ∂p(u, θ )/∂θr evaluated at θ = θ0 , for each r = 1, . . . , k, and let D(u, θ0 ) = (p1 (u, θ0 ), . . . , pk (u, θ0 ))T . Let I (θ0 ) = E[D(X, θ0 )D T (X, θ0 )/p(X, θ0 ) × (1 − p(X, θ0 ))] denote the k × k “information” matrix. Dikta proved that if H is continuous and H (τ ) < 1, and appropriate regularity conditions were satisfied, then n1/2 ( SD (t) − S(t)) converges weakly to a zero-mean Gaussian process S.L where L is a zero-mean Gaussian process with covariance structure for 0 s t τ given by Cov L(s), L(t)
s s t p(u, θ0 ) α(u, v) = dH1(u) + dH (u) dH (v) , (u)2 (u)H (v) 0 H 0 0 H where α(u, v) = D(u, θ0 ), I −1 (θ0 )D(v, θ0 ), and ·, · denotes the inner product in Rk . Note that the difference in the asymptotic variances between the KME and the semiparametric estimator is S(t)2 r(t) where t t t 1 − p(s, θ0 ) α(u, v) dH1 (s) − dH (u) dH (v). r(t) = 2 (v) H (s) H (u)H 0 0 0 Dikta showed that r(t) is non-negative, which in turn implies that the variance of the KME exceeds that of the semiparametric estimator. 2.2. The MCI model In this section, we provide an overview of estimators of S(t) proposed under MCAR and MAR (CAR). We also compute the information bounds for estimating S(t) and H1 (t). 2.2.1. Estimators of S(t) under MCAR Let ρ = P (ξ = 1 | X = x) = P (ξ = 1), and ρˆ = n−1 ni=1 ξi . Lo (1991) proposed the estimator ξi δi /ρˆ 1 SL∗ (t) = (2.7) 1− . Y· (Xi ) Xi t
The missing censoring-indicator model of random censorship
129
Taking log on both sides of Eq. (2.7) we have n 1 ξi δi log 1 − I (Xi t) log SL∗ (t) = ρˆ Y· (Xi ) i=1
=−
i=1
=−
1 1 I (Xi t) + + op n−3 2 ρˆ Y· (Xi ) 2Y· (Xi )
n ξi δi i=1
=−
n ξi δi
ρˆ
I (Xi t)
n ξi δi i=1
t
=−
ρˆ n
t 0
1 + op n−1/2 Y· (Xi )
1 dNi (s) + op n−1/2 Y· (s)
dNiu (s) + op n−1/2 , ρY ˆ · (s)
i=1 ξi
0
so that defining L (t) = Λ
0
t
n
dNiu (s) , ρY ˆ · (s)
i=1 ξi
(2.8)
L (t) + op (n−1/2 ). It may be noted straightwe have immediately that − log SL∗ (t) = Λ away that the estimator ΛL (t) can be obtained from the Nelson–Aalen estimator NA (t), given by Eq. (2.3), by applying the principle of “inverse probability weightΛ ing of complete cases” due to Robins and Rotnitzky (1992) – the aggregated counting process ni=1 Niu (t), which is not observable due to the missing censoring indicators (ξi = 0), is replaced with ni=1 ξi Niu (t) and scaled by the reciprocal of ρ. ˆ Note however that the estimator differs from the complete-case estimator, since it uses the full size of the risk set that the KME would use when all the censoring indicators are observed. L (t), Gijbels et al. (1993) noted that SL (t), obtained by taking a product integral of Λ ∗ is asymptotically equivalent to SL (t), but “does not utilize the counting process information from the subjects with ξi = 0”. They then introduced t n t n c i=1 ξi dNi (s) i=1 (1 − ξi ) dNi (s) C (t) = − , Λ (1 − ρ)Y ˆ · (s) ρY ˆ · (s) 0 0 L (t) and Λ C (t): Λ GLY (t) = and estimated Λ(t) using a convex combination of Λ α(t)ΛL (t) + (1 − α(t))ΛC (t). Note again that the “inverse probability weighting of C (t). Gijbels et al. (1993) obtained the funccomplete cases” is employed to estimate Λ G (t) and pointed out that the value tion α(t) that minimizes the asymptotic variance of Λ of α(t) ≡ 1 leads to Lo’s estimator. McKeague and Subramanian (1998) viewed the MCI model set-up as a four state non-time-homogeneous Markov process having state space {0, 11, 10, 00}, where “0” represents the initial state in which no failure has occurred, and the other states represent, respectively, ξ δ = 1 (observed uncensored observation), ξ δ = 0 (observed cen-
130
S. Subramanian
sored observation), and ξ = 0 (missing censoring indicator). They expressed the cumulative hazard function Λ in terms of the cumulative transition intensities Λ11 , Λ10 , Λ00 for the transitions 0 → 11, 0 → 10, 0 → 00: ΛT (t) = Λ11 (t) + π(t)Λ00 (t), where π(t) = Λ11 (t)/(Λ10 (t)+Λ11 (t)), and estimated Λ(t) by plugging-in the Nelson–Aalen 10 , Λ 11 , and Λ 00 of Λ10 , Λ11 , and Λ00 . Specifically their estimator is estimators Λ given by MS (t) = Λ 11 (t) + π(t) 00 (t), Λ ˆ Λ 10 (t) + Λ 11 (t)) is defined to be 0 when the denominator van11 (t)/(Λ where π(t) ˆ =Λ ishes, t dHkl (s) kl (t) = Λ , (s−) 0 H 11 (t), H 10 (t) and H 00 (t) are the empirical estimators of the subdistribution funcand H tions Hkl (t) = P (X t, ξ = k, σ = l) for (k, l) = (1, 0),(1, 1), (0, 0). Finally, they MS (s)). estimated S(t) using the product-limit estimator SMS (t) = (0,t ] (1 − dΛ 2.2.2. Estimators of S(t) under MAR (CAR) Van der Laan and McKeague (1998) derived their reduced-data NPMLE as the solution of a self-consistency equation, in a way similar to Efron’s (1967) derivation of the KME. Specifically, they partitioned the interval [0, τ ] into k sub-intervals [aj −1 , aj ], j = 1, . . . , k, where a0 = 0, ak = τ , and ak+1 = ı, and constructed a NPMLE of the joint distribution of (X, δ) based on the reduced data (Xd , ξ, σ ) where Xd is X if ξ = 1 and equals aj if X ∈ (aj , aj +1 ] and ξ = 0. The NPMLE is discrete, placing point masses on all complete observations (X, δ) and on one or more artificially chosen points in each region (aj , aj +1 ] × {0, 1} containing incomplete observation(s) but no complete observations. In particular, for each observed uncensored observation Xi ∈ (aj , aj +1 ] (i.e., δi = 1), their estimator of the subdensity f (x, 1) at x = Xi is obtained by scaling 1/n with a factor given by the ratio of the number of observations in the interval (aj , aj +1 ] to the number of complete observations in the same interval. For example, if there are ten observations in an interval, four of which have missing censoring indicators, and six of the remaining complete observations have four uncensored, then each complete uncensored observation will be assigned a mass equal to (10/6) × 1/n = 5/3n. Employing a stronger CAR assumption than the minimal one given by (1.1), van der Laan and McKeague obtained the efficient influence curve for S(t) given by (note: Y = (X, ξ, σ )) ξ I (X t) p(X) σ I (X t) ICS (Y, t) = I (X t) − + (X) (X) π(X)H π(X) H t ∧X t ξ ξ I (X > u) dH1 (u) dH1 (u) − 1 − , − 2 (u)2 π(X) H 0 π(X)H (X) 0 where we have assumed that the survival and censoring distributions are continuous. Note that E(ICS (Y, t)) = 0. The variance of ICS (Y, t), which is the expectation of the square of ICS (Y, t) (see the next subsection), is the information bound for estimating
The missing censoring-indicator model of random censorship
131
S(t). Van der Laan and McKeague’s (1998) reduced data NPMLE described above attains the information bound and hence is asymptotically efficient. To propose their second estimator, van der Laan and McKeague (1998) appealed to the results of Robins and Rotnitzky (1992) and obtained the efficient influence curve for estimating H1 (t), denoted by ICH1 (Y, t). By setting up an estimating equation using a plug-in estimate of ICH1 (Y, t) they arrived at an estimator for H1 (t). The expression for ICH1 (Y, t) is given by ICH1 (Y, t) =
ξ I (X t, δ = 1) − H1 (t) π(X) ξ −1 , − I (X t)p(X) − H1 (x) π(X)
and the estimator of H1 (t) is given by 1 (t) = n−1 H
n
π(X ˆ i )−1 I (Xi x) σi − ξi − π(X ˆ i ) p(X ˆ i) ,
i=1
where π(·) ˆ and p(·) ˆ are suitable estimates of π(·) and p(·), respectively. 2.2.3. Information bound for estimating S(t) For purposes of variance calculations, we re-express ICS (Y, t) in the following simpler form: ICS (Y, t) =
I (X t)p(X) I (X t)ξp(X) − (X) (X) π(X)H H t I (X t)σ dH1(u) + − I (X > u) . (u)2 π(X)H (X) H 0
We have E ICS (Y, t)2
I (X t)σ I (X t)p(X)2 I (X t)ξp(X)2 =E + + 2 2 2 (X) (X) (X)2 π(X) H π(X)2 H H t dH1(u) 2 I (X t)ξp(X)2 + I (X > u) −2 (X)2 (u)2 π(X)H H 0 t I (X t)σp(X) I (X t)p(X) dH1(u) +2 −2 I (X > u) (X)2 (X) (u)2 π(X)H H H 0 t I (X t)ξp(X) dH1(u) I (X t)σp(X) + 2 I (X > u) −2 (X) (X)2 (u)2 π(X)2 H π(X)H H 0 t dH1 (u) I (X t)σ . I (X > u) −2 (X) 0 (u)2 π(X)H H
132
S. Subramanian
Conditioning on X first, then using the conditional independence of ξ and δ given X while calculating the conditional expectation given X, and finally taking expectation with respect to X, we have E ICS (Y, t)2 t t t dH1 (u) p(u) dH1 (u) 1 dH1 (u) = p(u) + + 2 2 (u) (u) (u)2 H 0 0 π(u) H 0 π(u) H t t t dH1(u) 2 dH1 (u) dH1 (u) +E I (X > u) − 2 p(u) + 2 p(u) 2 2 (u)2 H (u) H (u) H 0 0 0 t t t dH1 (v) dH1 (u) p(u) dH1(u) −2 − 2 2 (v) (u) (u)2 H 0 u H 0 π(u) H t t t t dH1 (v) dH1 (u) dH1 (v) dH1 (u) +2 −2 (v) (u)2 (v) (u)2 H H 0 u H 0 u H t t t dH1 (u) dH1 (u) 2 dH1 (u) = + r ˜ (u) + E I (X > u) (u)2 (u)2 (u)2 H H 0 H 0 0 t t dH1 (v) dH1 (u) −2 , (v) (u)2 H 0 u H where r˜ (u) = −1 + p(u) +
(1 − π(u)) p(u) 1 − = 1 − p(u) . π(u) π(u) π(u)
Note that dH1(u) 2 I (X > u) E (u)2 H 0 u t dH1(v) dH1 (u) =E I (X > u) (v)2 H (u)2 0 0 H t t dH1 (v) dH1(u) +E I (X > v) (v)2 H (u)2 H 0 u t u t t dH1 (v) dH1 (u) dH1 (v) dH1 (u) = + (v)2 (u) (v) (u)2 H H 0 0 H 0 u H t t dH1 (v) dH1 (u) =2 . (v) (u)2 H 0 u H
t
Therefore the information bound for estimating S(t) is given by
t t 1 − π(u) dH1 (u) dH1 (u) . IS−1 (t) = S(t)2 1 − p(u) + (u)2 (u)2 π(u) H 0 H 0
(2.9)
The missing censoring-indicator model of random censorship
133
Note that the first term of (2.9) is the information bound for estimating S(t) when the censoring indicators are all observed (classical random censorship model); the KME attains this bound. The second term is the increase in variability owing to missing censoring indicators. 2.2.4. Information bound for estimating H1 (t) Note that E(ICH1 (Y, t)) = 0. The variance of ICH1 (Y, t) is the information bound for estimating H1 (t). For this purpose note that ICH1 (Y, t) =
σ I (X t) ξ I (X t)p(X) − + p(X)I (X t) − H1 (t). π(X) π(X)
Therefore first taking conditional expectation given X and then taking expectation with respect to X we have E ICH1 (Y, t)2
σ I (X t) ξ I (X t)p(X)2 =E + + p(X)2 I (X t) + H1 (t)2 π(X)2 π(X)2 −2
σ I (X t)H1 (t) σ I (X t)p(X) σ I (X t)p(X) −2 +2 2 π(X) π(X) π(X)
ξ I (X t)H1 (t)p(X) ξ I (X t)p(X)2 +2 π(X) π(X) − 2p(X)I (X t)H1 (t) −2
t t dH1 (u) p(u) + dH1(u) + p(u) dH1 (u) + H1 (t)2 π(u) 0 0 π(u) 0 t t p(u) dH1 (u) + 2 −2 p(u) dH1(u) − 2H1 (t)2 0 π(u) 0 t −2 p(u) dH1 (u) + 2H1 (t)2 − 2H1(t)2 ,
=
t
0
which after simplification leads to the expression E ICH1 (Y, t)2 t 1 − π(u) dH1 (u). 1 − p(u) = H1 (t) 1 − H1 (t) + π(u) 0
(2.10)
Note that the first term of (2.10) is the information bound for estimating H1 (t) when the censoring indicators are all observed (classical random censorship model); the empirical subdistribution function of the uncensored times, which is the empirical estimator of H1 (t), attains this bound. The second term in (2.10) is the increase in variability owing to missing censoring indicators.
134
S. Subramanian
3. Semiparametric estimation in the MCI model The likelihood based on the observed data (X1 , ξ1 , σ1 ), . . . , (Xn , ξn , σn ) is given by n
ξ −σ 1−ξi π(Xi )ξi p(Xi , θ )σi 1 − p(Xi , θ ) i i 1 − π(Xi ) h(Xi ).
i=1
Under MAR the distribution of ξ is free of θ , which implies that Ln (θ ) =
n
ξ −σ p(Xi , θ )σi 1 − p(Xi , θ ) i i
i=1
may be used as a likelihood for θ . Let θˆ denote the MLE of θ0 . Recall that pr (u, θ ) = ∂p(u, θ )/∂θr , r = 1, . . . , k, D(u, θ ) = (p1 (u, θ ), . . . , pk (u, θ ))T . For each r, s = 1, . . . , k, let pr,s (u, θ ) = ∂ 2 p(u, θ )/∂θr ∂θs . Under appropriate regularity conditions and using a standard Taylor-expansion argument (cf. Dikta (1998)), it can be shown that n1/2 (θˆ − θ0 ) is asymptotically multivariate normal with mean 0 and covariance matrix I˜(θ0 )−1 where π(X)D(X, θ0 )D T (X, θ0 ) ˜ . I (θ0 ) = E (3.11) p(X, θ0 )(1 − p(X, θ0 )) To obtain (3.11) consider the log likelihood based on a single observation Y given by l(θ ) = σ log p(X, θ ) + (ξ − σ ) log 1 − p(X, θ ) . Note that −∂ 2 l(θ )/∂θr ∂θs equals σpr (X, θ )ps (X, θ ) σpr,s (X, θ ) − p(X, θ ) p(X, θ )2 +
(ξ − σ )pr (X, θ )ps (X, θ ) (ξ − σ )pr,s (X, θ ) , + (1 − p(X, θ ))2 (1 − p(X, θ ))
so that using the conditional independence of ξ and δ given X, we have that −E(∂ 2 l(θ )/∂θr ∂θs ) evaluated at θ = θ0 equals −1 −1 E π(X)pr (X, θ0 )ps (X, θ0 ) p(X, θ0 ) , + 1 − p(X, θ0 ) from which (3.11) is immediate. We shall use the estimated function p(s, θˆ ) to define the estimator of the integrated hazard by = Λ(t)
t 0
p(s, θˆ ) dH (s). (s−) H
The estimator of S(t) is defined via the product integral as before. We have
The missing censoring-indicator model of random censorship
135
T HEOREM 1. The process n1/2 ( S(t)−S(t)) converges weakly to a zero-mean Gaussian process W , with covariance structure, for 0 s t τ , given by Cov(W (s), W (t)) S(s)S(t)
s s t p(u, θ0 ) α(u, ˜ v) = dH1(u) + dH (u) dH (v) , (u)2 (u)H (v) 0 H 0 0 H
(3.12)
where α(u, ˜ v) = D(u, θ0 ), I˜−1 (θ0 )D(v, θ0 ), and ·, · denotes the inner product in Rk . P ROOF. Under MAR it is straightforward to show that (see Lemma 3.5 of Dikta (1998)) n1/2 p s, θˆ − p(s, θ0 ) = n−1/2
n i=1
ξi (δi − p(Xi , θ0 )) α(s, ˜ Xi ) + Rn (s), p(Xi , θ0 )(1 − p(Xi , θ0 ))
(3.13)
where the remainder term Rn (x) satisfies |Rn (s)| M(s). Op (n−1/2 ) + op (1) uniformly on [0, τ ], and M(s) is an integrable (with respect to the distribution H of X) upper bound for pm,n (s, θ ) = ∂ 2 p(s, θ )/∂θm ∂θn , for θ in a neighborhood of θ0 . Note that p(s, θ0 ) dH (s) = dH1 (s). Using Lemmas 3.7, 3.8 of Dikta (1998), and (3.13), it follows that − Λ(t) n1/2 Λ(t) = n−1/2
n i=1 0
−1/2
+n
t
p(s, θ0 ) d I (Xi s) − H (s) H (s)
n
(I (Xi s) − H (s) dH1(s) (s)2 H
t
i=1 0
+ n−3/2
n n ξi (δi − p(Xi , θ0 ))α(X ˜ j , Xi )I (Xj t) + op (1), (Xj ) p(Xi , θ0 )(1 − p(Xi , θ0 ))H i=1 j =1
uniformly on [0, τ ]. We employ U -statistic theory (cf. Serfling (1980)) to represent the third term above, denoted by n1/2 U (t), as the sum of n independent and identically distributed random variables. We have −2
U (t) = n
n n ξi (δi − p(Xi , θ0 ))α(X ˜ j , Xi )I (Xj t) (Xj ) p(Xi , θ0 )(1 − p(Xi , θ0 ))H i=1 j =1
ξi (δi − p(Xi , θ0 ))α(X ˜ j , Xi )I (Xj t) = n−2 + op n−1/2 p(Xi , θ0 )(1 − p(Xi , θ0 ))H (Xj ) i=j
=
(n − 1)n 1 n n2 2
1i<j n
1 ζt (Xi , σi , ξi , Xj , σj , ξj ), 2
136
S. Subramanian
where ζt (Xi , σi , ξi , Xj , σj , ξj ) = γt (Xi , σi , ξi , Xj ) + γt (Xj , σj , ξj , Xi ) and γt (Xi , σi , ξi , Xj ) =
ξi (δi − p(Xi , θ0 ))α(X ˜ j , Xi )I (Xj t) . (Xj ) p(Xi , θ0 )(1 − p(Xi , θ0 ))H
Next calculate ψt (Xi , σi , ξi ) = E(γt (Xi , σi , ξi , Xj ) | Xi , σi , ξi ) which is equal to t ξi (δi − p(Xi , θ0 ))α(s, ˜ Xi ) dH (s). ψt (Xi , σi , ξi ) = (s) 0 p(Xi , θ0 )(1 − p(Xi , θ0 ))H Since E(γt (Xj , σj , ξj , Xi ) | Xi ) = 0, it follows from U -statistic theory that U (t) = n−1
n
ψt (Xi , σi , ξi ) + op n−1/2
i=1
t ξi (δi − p(Xi , θ0 )) α(s, ˜ Xi ) dH (s) + op n−1/2 . (s) p(Xi , θ0 )(1 − p(Xi , θ0 )) 0 H i=1 − Λ(t)] = n−1/2 ni=1 [A(Xi , t) + B(Xi , t) + C(Xi , t)] + op (1) Therefore n1/2 [Λ(t) where p(X, θ0 )I (X t) A(t) = A(X, t) = − Λ(t), (X) H t I (X s) − H (s) B(t) = B(X, t) = dH1(s) (s)2 H 0 t I (X > s) =− dH1(s) + Λ(t), (s)2 H 0 t ξ(δ − p(X, θ0 )) α(s, ˜ X) C(t) = C(X, t) = dH (s). (s) p(X, θ0 )(1 − p(X, θ0 )) 0 H = n−1
n
We now calculate the covariance structure of the limiting zero-mean Gaussian process. For 0 s t τ , the following can be checked readily: E A(s)A(s) p(X, θ0 )I (X s) p(X, θ0 )I (X t) =E − Λ(s) − Λ(t) (X) (X) H H s p(u, θ0 ) dH1 (u) − Λ(s)Λ(t), = (u)2 0 H E A(s)B(t) t I (X > u) p(X, θ0 )I (X s) dH1 (u) + Λ(t) − Λ(s) − =E (X) (u)2 H H 0 s dH1 (u) Λ(s) − Λ(u) + Λ(s)Λ(t), =− (u)2 H 0
The missing censoring-indicator model of random censorship
E B(s)A(t) =E −
s 0
137
I (X > u) p(X, θ0 )I (X t) dH1(u) + Λ(s) − Λ(t) (u)2 (X) H H
dH1 (u) Λ(t) − Λ(u) + Λ(s)Λ(t), (u)2 H 0 E B(s)B(t) s I (X > u) dH1 (u) + Λ(s) =E − (u)2 H 0 t I (X > v) × − dH1(v) + Λ(t) (v)2 H 0 s s dH1 (u) dH1 (u) Λ(s) − Λ(u) Λ(t) − Λ(u) + − Λ(s)Λ(t). = 2 (u)2 H (u) H 0 0 =−
s
Moreover, E(A(s)C(t)) = E(C(s)A(t)) = E(B(s)C(t)) = E(C(s)B(t)) = 0. Finally we have E D(s)D(t) 2 s t ξ(δ − p(X, θ0 )) dH (u) dH (v) E α(u, ˜ X)α(v, ˜ X) . = (u)H (v) p(X, θ )(1 − p(X, θ )) H 0 0 0 0 Note that conditional on X the random variable ξ and δ are independent and also that E(δ − p(X, θ0 ) | X)2 = p(X, θ0 )(1 − p(X, θ0 )). It follows that E D(s)D(t) s t ˜ X)α(v, ˜ X) 1 π(X, θ0 )α(u, E dH (u) dH (v). = (u)H (v) p(X, θ0 )(1 − p(X, θ0 )) H 0 0 Note that
˜ X)α(v, ˜ X) π(X, θ0 )α(u, E p(X, θ0 )(1 − p(X, θ0 ))
π(X, θ0 )D(u, θ0 )T I˜−1 (θ0 )D(X, θ0 )D(X, θ0 )T I˜−1 (θ0 )D(v, θ0 ) =E p(X, θ0 )(1 − p(X, θ0 )) π(X, θ0 )D(X, θ0 )D(X, θ0 )T ˜−1 I (θ0 )D(v, θ0 ) = D(u, θ0 )T I˜−1 (θ0 )E p(X, θ0 )(1 − p(X, θ0 ))
= D(u, θ0 )T I˜−1 (θ0 )I˜(θ0 )I˜−1 (θ0 )D(v, θ0 ) = α(u, ˜ v). Therefore E D(s)D(t) =
s 0
t 0
α(u, ˜ v) dH (u) dH (v). (u)H (v) H
138
S. Subramanian
Adding all the non-zero terms leads to the expression given by (3.12). The only thing that remains to show is the tightness of the relevant processes. This can be proved as in Dikta (1998) and we omit the details. This completes the proof of the theorem. 3.1. Comparison of asymptotic variances A comparison of the asymptotic variances of the non- and semi-parametric estimators is in order. The asymptotic variance of an efficient nonparametric estimator in the MCI model is
t t dH1 (s) (1 − π(s))(1 − p(s)) dH1(s) 2 VNP (t) = S(t) (3.14) . + (s)2 (s)2 π(s) H 0 H 0 From Theorem 1 the asymptotic variance of the semiparametric estimator is
t t t p(s, θ0 ) α(u, ˜ v) 2 VSP (t) = S(t) dH1 (s) + dH (u) dH (v) . (s)2 (u)H (v) 0 H 0 0 H (3.15) Using (3.14) and (3.15) it follows that R(t) = VNP (t) − VSP (t) is given by
t (1 − p(s, θ0 )) 2 dH1 (s) R(t) = S(t) (s)2 π(s)H 0 t t α(u, ˜ v) − (3.16) dH (u) dH (v) . (u)H (v) 0 0 H P ROPOSITION 1. The expression R(t) defined in (3.16) satisfies R(t) 0, 0 t τ . P ROOF. We reproduce Dikta’s (1998) proof of his Corollary 2.7, with minor modifications. First we note that t t α(u, ˜ v) (3.17) dH (u) dH (v) = b, I˜−1 (θ0 )b , 0 0 H (u)H (v) where b ∈ Rk is defined by T
pk (X, θ0 ) p1 (X, θ0 ) . I (X t) , . . . , E I (X t) b= E (X) (X) H H (t) = t p(s, θ0 )(1 − p(s, θ0 )) dH (s) = t (1 − p(s, θ0 )) dH1(s). For any h = Let H 0 0 (h1 , . . . , hk )T ∈ Rk \{0}, we have using the Cauchy–Schwarz inequality that 2 k t 1 2 hr pr (s, θ0 ) dH (s)
h, b = (s) 0 H r=1
t
= 0
k hr π(s)1/2 pr (s, θ0 ) (s) dH (s) p(s, θ0 )(1 − p(s, θ0 )) π(s)1/2 H
1
r=1
2
The missing censoring-indicator model of random censorship
t
t
139
1 (s) d H (s)2 0 π(s)H 2 k t h π(s)1/2 p (s, θ ) r r 0 (s) dH × p(s, θ0 )(1 − p(s, θ0 )) 0 r=1
= 0
(1 − p(s, θ0 )) dH1(s) (s)2 π(s)H
t
× 0
π(s) 2 p (s, θ0 )(1 − p(s, θ0 ))2
k
2 hr pr (s, θ0 )
(s) dH
r=1
(1 − p(s, θ0 )) dH1(s) (s)2 π(s)H 0
t π(s)hT D(s, θ0 )D(s, θ0 )T h dH (s) × p(s, θ0 )(1 − p(s, θ0 )) 0
t (1 − p(s, θ0 )) π(X)D(X, θ0 )D(X, θ0 )T T h dH1(s) h E = (s)2 p(X, θ0 )(1 − p(X, θ0 )) π(s)H 0
t (1 − p(s, θ0 )) = dH1(s) h, I˜(θ0 )(h) . 2 π(s)H (s) 0
=
t
Therefore
h, b2
h, I˜(θ0 )(h)
t 0
(1 − p(s, θ0 )) dH1 (s). (s)2 π(s)H
(3.18)
As noted by Dikta (1998), the supremum of the left-hand side of (3.18) is b, I˜−1 (θ0 )b, provided that I˜−1 (θ0 ) is positive definite, see Rao (1973, 1f.1.1). The result now follows from (3.17) and (3.18).
4. Conclusion This article provides an overview of several methods of estimation of a survival function S(t) in the classical random censorship model when the censoring indicators are missing for a subset of the study subjects. The model is called the missing censoring-indicator (MCI) model of random censorship. Two well-known missingness mechanisms were presented. The currently existing asymptotically efficient estimators are the reduceddata NPMLE and an estimator obtained through standard estimating equation methodology. A semiparametric estimator was also presented and its asymptotic variance was compared with the information bound for estimating S(t). Semiparametric estimation of a survival function in the MCI model is a novel and attractive alternative approach to the more common nonparametric approach because of its simplicity and flexibility. The semiparametric estimation procedure, due to Dikta
140
S. Subramanian
(1998), specifies a parametric model for the conditional probability p(t, θ ) of an uncensored observation given the (always observable) minimum X but employs a nonparametric specification for the distribution of X. Since good-fitting models for p(t, θ ) can be readily provided, specification of a parametric model for p(t, θ ) is a natural and appealing approach that can be gainfully employed to compensate for the missing censoring information in the MCI model. Indeed, logistic regression tools offer a ready resource for model fitting concerning p(t, θ ). Provided the parametric component is specified correctly, the semiparametric estimator is more efficient than its nonparametric counterparts in the MCI model. The proposed semiparametric estimator is simple to compute unlike the efficient nonparametric estimators, which are not easy to calculate in practice because of artificial binning of the interval of estimation and because no optimal bandwidths are currently available to compute the estimators.
Acknowledgement This research was supported by the University of Maine Summer Faculty Research Fund Award, 2002.
References Andersen, P.K., Borgan, O., Gill, R.D., Keiding, N. (1993). Statistical Models Based on Counting Processes. Springer, New York. Bickel, P.J., Klaassen, C.A.J., Ritov, Y., Wellner, J.A. (1993). Efficient and Adaptive Estimation for Semiparametric Models. John Hopkins University Press. Cox, D.R. (1972). Regression models and life-tables (with Discussion). J. Roy. Statist. Soc. B 34, 187–202. Cox, D.R., Snell, E.J. (1989). Analysis of Binary Data, 2nd edn. Chapman and Hall, London. Dikta, G. (1998). On semiparametric random censorship models. J. Statist. Plann. Inference 66, 253–279. Dinse, G.E. (1982). Nonparametric estimation for partially-complete time and type of failure data. Biometrics 38, 417–431. Efron, B. (1967). Two sample problem with censored data. In: Proc. Fifth Berkeley Sympos. Math. Statist. Probab., Vol. 4, pp. 831–853. Fleming, T., Harrington, D. (1991). Counting Processes and Survival Analysis. Wiley, New York. Gijbels, I., Lin, D.Y., Ying, Z. (1993). Non- and semi-parametric analysis of failure time data with missing failure indicators. Technical Report 039-93, Mathematical Sciences Research Institute, Berkeley. Gill, R.D. (1989). Non- and semi-parametric maximum likelihood estimators and the von Mises method (part I). Scand. J. Statist. 16, 97–124. Gill, R.D., Johansen, S. (1990). A survey of product-integration with a view toward application in survival analysis. Ann. Statist. 18, 1501–1555. Gill, R.D., van der Laan, M.J., Robins, J.M. (1997). Coarsening at random. In: Lin, D.-Y. (Ed.), Proceedings of the First Seattle Symposium on Biostatistics. Springer, New York. Heitjan, D.F., Rubin, D.B. (1991). Ignorability and Coarse data. Ann. Statist. 19, 2244–2253. Jacobsen, M., Keiding, N. (1995). Coarsening at random in general sample spaces and random censoring in continuous time. Ann. Statist. 23, 774–786. Kaplan, E.L., Meier, P. (1958). Nonparametric estimation from incomplete observations. J. Amer. Statist. Assoc. 53, 457–481. van der Laan, M.J., McKeague, I.W. (1998). Efficient estimation from right-censored data when failure indicators are missing at random. Ann. Statist. 26, 164–182.
The missing censoring-indicator model of random censorship
141
Little, R.J.A., Rubin, D.B. (1987). Statistical Analysis with Missing Data. Wiley, New York. Lo, S.-H. (1991). Estimating a survival function with incomplete cause-of-death data. J. Multivar. Anal. 39, 217–235. McKeague, I.W., Subramanian, S. (1998). Product-limit estimators and Cox regression with missing censoring information. Scand. J. Statist. 25, 589–601. Mukherjee, H., Wang, J. (1992). Nonparametric maximum likelihood estimation of an increasing hazard rate for uncertain cause-of-death data. Scand. J. Statist. 20, 17–33. Rao, C.R. (1973). Linear Statistical Inference and Its Applications, 2nd edn. Wiley, New York. Robins, J.M., Rotnitzky, A. (1992). Recovery of information and adjustment for dependent censoring using surrogate markers. In: Jewell, N., Dietz, K., Farewell, V. (Eds.), AIDS Epidemiology – Methodological Issues. Birkhäuser, Boston, pp. 279–331. Serfling, T. (1980). Approximation Theorems of Mathematical Statistics. Wiley, New York. Shorack, G.R., Wellner, J.A. (1986). Empirical Processes with Applications to Statistics. Wiley, New York. Sun, L., Zhu, L. (2000). A semiparametric model for truncated and censored data. Statist. Probab. Lett. 48, 217–227. Tsiatis, A.A., Davidian, M., McNeney, B. (2002). Multiple imputation methods for testing treatment differences in survival distributions with missing cause of failure. Biometrika 89, 238–244. van der Vaart, A.W. (1991). Efficiency and hadamard differentiable functionals. Scand. J. Statist. 18, 63–75. Wellner, J.A. (1982). Asymptotic optimality of the product limit estimator. Ann. Statist. 10, 595–602. Zhu, L.X., Yuen, K.C., Tang, N.Y. (2002). Resampling methods for testing a semiparametric random censorship model. Scand. J. Statist. 29, 111–123.
8
Handbook of Statistics, Vol. 23 ISSN: 0169-7161 © 2004 Elsevier B.V. All rights reserved. DOI 10.1016/S0169-7161(03)23008-X
Estimation of the Bivariate Survival Function with Generalized Bivariate Right Censored Data Structures
Sündüz Kele¸s, Mark J. van der Laan and James M. Robins
1. Introduction Bivariate survival data arise when study units are paired such as child and parent, or twins or paired organs of the same individual. This paper addresses the survival function estimation in a general data structure which includes time independent and/or dependent covariate processes that are subject to right censoring. Consider a time dependent process X(t) = (X1 (t), X2 (t)) where Xk (t) includes a component Rk (t) = I (Tk t), 1 (T1 ), X 2 (T2 )), where X k = {Xk (s): s ∈ [0, t]}. k = 1, 2. Let the full data be X = (X We will denote the maximum of T1 and T2 with T so that we can represent the full data ). Let C1 and C2 be two censoring variables. Define Tk = min(Tk , Ck ) and with X(T ∆k = I (Ck > Tk ), k = 1, 2. Then the observed data is given by 1 T1 , T2 , ∆2 , X 2 T2 . Y = T1 , ∆1 , X Such data structures easily arise in longitudinal studies where study units are monitored over a period of time. In this paper, we are interested in estimating µ = S(t1 , t2 ) = P (T1 t1 , T2 t2 ) based on n i.i.d. Y1 , . . . , Yn copies of Y . Let FX denote the distribution of the full data X and G(· | X) denote the distribution of bivariate censoring variables (C1 , C2 ) conditional on X. Then, the distribution of the observed data Y is a function of FX and G(· | X), and we will denote it with PFX ,G . There is no previous work on the estimation of such marginal parameter µ with the generalized bivariate right censored data structure. However, estimation of the bivariate distribution of survival times when both study units are subject to random censoring in marginal data structures (no associated covariate process) has received a considerable attention in statistical literature. Some of the proposed nonparametric estimators are Dabrowska (1988), Prentice and Cai (1992), Pruitt (1991) and van der Laan (1996). These estimators employ the independent censoring assumption. Dabrowska, Prentice– Cai and Pruitt estimators are not, in general, efficient estimators. Laan’s (1996) SOR– NPMLE is globally efficient and typically needs a larger sample size for good performance. A review of most of the available estimators can be found in Pruitt (1993) and 143
144
S. Kele¸s, M.J. van der Laan and J.M. Robins
van der Laan (1997). Recently, Quale et al. (2001) proposed a new estimator of the bivariate survival function based on the locally efficient estimation theory. Their approach guesses semiparametric models for FX and G(· | X) and the estimator proposed is consistent if either one of the models is correctly specified and locally efficient if both are correctly specified. The generalized bivariate right censored data structure has two important aspects. Firstly, by utilizing the associated covariate process of the data structure one can allow informative censoring. Secondly, again through utilization of the covariate processes, one can gain efficiency in estimation of the parameters of the interest. This paper is concerned with achieving these properties in estimation of the bivariate survival function. Estimation of various parameters of interest with these type of general bivariate right censored data structures are addressed in great details in Chapter 5 of van der Laan and Robins (2002). Firstly, we will propose an initial estimator for µ that is a generalization of Dabrowska’s (1988) estimator. Dabrowska’s (1988) estimator, which is developed for marginal data structures, is widely used and depends on a smart representation of the bivariate survival function. It is only efficient under complete independence when survival times and censoring times are all independent of each other (Gill et al., 1995) and becomes inconsistent when there is informative censoring. Our generalization of it deals with informative censoring through utilization of the covariate processes. In our model, we leave the full data distribution completely unspecified and assume a model for G(. | X) that will allow dependent censoring. One crucial assumption that we make 1 , T2 | X) > δ > 0, FX – a.e. This assumpon the censoring mechanism is that G(T tion can be arranged by artificially censoring the data as in van der Laan (1996). For a 1 , τ2 | X) > 0, artificial censoring sets Ti = τi and ∆i = 1 if given τ1 , τ2 satisfying G(τ Ti > τi , i = 1, 2. Our initial estimator remains consistent under informative censoring if the censoring mechanism is either known or estimated consistently. Subsequently, we will provide an orthogonalized estimating function that will result in a robust and more efficient estimator. The general organization of the paper is as follows. In the next section, we will describe the methods to estimate the censoring mechanism in a way that allows dependent censoring. In Section 3, we will briefly review a general methodology of constructing mappings from full data estimating functions to observed data estimating functions, and introduce a new way of obtaining such mappings by using the influence curve of a given regular asymptotically linear (RAL) estimator. In Section 4, we use this method to obtain a generalized Dabrowska’s estimator. We will introduce an orthogonalized estimating function and discuss its corresponding estimator in Section 5. The practical performances of the proposed estimators are demonstrated with a simulation study in Section 6. Finally, we end the paper with a summary of conclusions.
2. Modeling the censoring mechanism We will represent the bivariate censoring variable as a bivariate time-dependent process. Let Ak (t) = I (Ck t), k = 1, 2, and we define Ck = ∞ if Ck > Tk , k = 1, 2. For
Estimation of the bivariate survival function
145
1 (C1 ), X 2 (C2 )). Moreover, let X A (t) = a given A = (A1 , A2 ) we define XA = (X 1 (C1 ∧ t), X 2 (C2 ∧ t)) be the part of XA which is observed by time t. Here X A (t) (X − ). Now, we can represent the observed data as only depends on A through A(t Y = (A, XA ),
(1) which corresponds with observing Y (t) = (A(t), XA (t)) over time t. The distribution of the observed data Y is thus indexed by the distribution FX of X and the conditional distribution of A given X. We now consider the modeling and estimation of the bivariate time dependent censoring process A in the discrete and continuous case. Let g(A | X) denote the conditional distribution of this bivariate process given the full data X. Firstly, we will assume that Ak (t), k = 1, 2, only change value at time points j = 1, . . . , p (indicating the true chronological time points at which Ak can jump). We will assume − 1), X = g A(j ) | A(j − 1), X A (j ) , g A(j ) | A(j (2) for all j ∈ {1, . . . , p}. This assumption is the analogue of the sequential randomization assumption (SRA) in the causal inference literature (e.g., Robins, 1989b, 1989a, 1992; Robins et al., 1994; Robins, 1998, 1999). Then, we have | X) = g(A
p − 1), X g A(j ) | A(j j =1
=
p
− 1), X A (j ) g1 A1 (j ) | A(j
j =1
×
p
− 1), X A (j ) . g2 A2 (j ) | A1 (j ), A(j
(3)
j =1
− 1), X A (j )) and let F2 (j ) = (A1 (j ), A(j − 1), X A (j )). Moreover, Let F1 (j ) = (A(j define λk (j | Fk (j )) = P (Ck = j | Ck j, Fk (j )) to be the conditional hazard of Ck with respect to the history Fk , k = 1, 2. Then, αk j | Fk (j ) ≡ P Ak (j ) = 1 | Fk (j ) = Yk (j )λk j | Fk (j ) , and
1−Ak (j ) gk Ak (j ) | Fk (j ) = αk (j )Ak (j ) 1 − αk (j ) ,
k = 1, 2,
where Yk (j ) = I (Tk j ). We propose to model the discrete intensities αk , k = 1, 2, with separate models. For example, we could assume a logistic regression model 1 , λk j | Fk (j ) = (4) 1 + exp(m(j, Wk (j ) | γk )) where Wk (j ) are functions of the observed past Fk (j ). One can model the effect of time j as nonparametric as possible so that this model contains, in particular, the independent censoring model which assumes that (C1 , C2 ) is independent of X. If the grid is
146
S. Kele¸s, M.J. van der Laan and J.M. Robins
fine, then the multiplicative intensity model λk (j | Fk (j )) = λ0 (t) exp(γk Wk (j )) is also appropriate for k = 1, 2. The sequential randomization is a stronger assumption than the well known coarsening at random assumption (CAR) (Heitjan and Rubin, 1991; Jacobsen and Keiding, 1995; Gill et al., 1997). Under CAR, the likelihood PFX ,G (dy) of Y factorizes in an FX and G-part. Consecutively, the maximum likelihood estimator of γ = (γ1 , γ2 ) is given by: γn = max−1 γ
Ci n
g1,γ1 A1i (j ) | F1i (j ) g2,γ2 A2i (j ) | F2i (j ) .
i=1 j =1
If the models for g1 and g2 have no common parameters, then γ1n = max−1 γ1
Ci n
dA (j ) 1−dA1 (j ) α1,γ1 j | F1i (j ) 1 1 − α1,γ1 j | F1i (j )
i=1 j =1
and γ2n = max−1 γ2
Ci n
dA (j ) 1−dA2 (j ) 1 − αγ2 j | F2i (j ) α2,γ2 j | F2i (j ) 2 .
i=1 j =1
If we assume the logistic regression model given in (4), then γkn can be obtained by applying the Splus-function glm() or gam() with logit link to the pooled sample (A ki (j ), j, Wi (j )), i = 1, . . . , n, j = 1, . . . , mki ≡ min(C2i , T2i ), treating it as N = i mki i.i.d. observations on a Bernoulli random variable Ak with covariates time t and W . | X) as the partial likelihood If A(t) is continuous, then one can formally define g(A of the bivariate counting process A = (A1 , A2 ) with respect to the observed history (t−)) (Andersen et al., 1993) F (t) = σ (Y 1−∆A· (t ) | X = g A (5) 1 − α· (t) dt αk (t)∆Ak (t ) , t,k
where
t
αk (t) = E dAk (t) | F (t)
is the intensity of Ak with respect to F (t) and α· (t) = 2k=1 αk (t) is the intensity of | X) one could assume a multiplicative intensity model A· = A1 + A2 . To estimate g(A (Andersen et al., 1993): αk (t) = Yk (t)λk t | Fk (t) ≡ Yk (t)λ0k (t) exp γ Wk (t) , where Yk (t) is the indicator that Ak is at risk of jumping at time t. To summarize, by treating the bivariate censoring variable (C1 , C2 ) as a bivariate time-dependent process (A1 , A2 ) indexed by the same time t as the full-data and assuming sequential randomization, we have succeeded in presenting a flexible modeling framework that allows dependent censoring. Moreover, parameters of these models can be estimated using the standard software.
Estimation of the bivariate survival function
147
3. Constructing an initial mapping from full data estimating functions to observed data estimating functions In this section, we will briefly review the main ideas of the locally efficient estimation methodology which includes full data estimating functions and mappings into observed data estimating functions (Robins and Rotnitzky, 1992; van der Laan and Robins, 2002). Consecutively, we will define a new way of constructing observed data estimating functions using the influence curve of a given RAL estimator. In order to construct an estimator for the parameter of interest µ based on the observed data Y1 , . . . , Yn , the estimation problem is firstly considered in the full data model since this class of estimating functions is the foundation of the estimating functions in the observed data model. We will firstly go over estimating functions of the full data model and then link these to the observed data estimating functions. Estimating functions in the full data model Let µ(FX ) be the parameter of interest. We will denote the model for FX by MF . Typically, we are interested in estimating functions whose asymptotic behavior are not affected by the choice of the estimators for nuisance parameters. Finding such a class of estimating functions requires finding the so-called orthogonal complement of the nuisance tangent space at FX for each FX ∈ MF . The full data nuisance tangent F (F ), is a subspace of Hilbert space L2 (F ) (space of functions space at FX , Tnuis X 0 X of X with finite variance and mean zero endowed with the covariance inner product f, gFX = EFX f (X)g(X)) defined as the linear space spanned by all nuisance scores. A nuisance score is a score function which is obtained by only varying the nuisance parameters within one-dimensional sub-models of FX (i.e., varying one-dimensional d sub-models Fε through FX at ε = 0 for which dε µ(Fε )|ε=0 = 0). We refer to Bickel F,⊥ et al. (1993) for the general theory of tangent spaces. Let Tnuis (FX ) be the orthogonal F,⊥ F complement of Tnuis(FX ). The representation of Tnuis (FX ), ∀FX plays an important role in constructing the full data estimating functions since this representation generally hints the form of a class of estimating functions. Mainly, one tries to find a class of F,⊥ estimating functions {Dh (X | µ, ρ): h ∈ H} such that Dh , h ∈ H falls into Tnuis (FX ), ∀FX when evaluated at the true parameter values (µ(FX ), ρ(FX )). Here, ρ(FX ) is a possible nuisance parameter of the full data distribution FX and H represents an index set for this class of estimating functions. A template for constructing such a class is given in van der Laan and Robins (2002). Ideally, one would like this class to be rich F,⊥ and cover the whole Tnuis (FX ). In our model, since we will leave the full data distribution completely nonparametric, there is only one full data estimating function, namely D(X | µ) = I (T1 t1 , T2 t2 ) − S(t1 , t2 ), where µ = S(t1 , t2 ). Estimating functions in the observed data model Defining a class of observed data estimating functions requires the notion of orthogonal complement of the observed data nuisance tangent space as in the full data model. Let G(CAR) be the set of all conditional bivariate distributions G(· | X) satisfying CAR. We will then represent the observed data model for the distribution of Y as M(CAR) =
148
S. Kele¸s, M.J. van der Laan and J.M. Robins
{PFX ,G : FX ∈ MF , G ∈ G(CAR)}. Next, define TCAR (PFX ,G ) as the tangent space for G in the model M(CAR) at PFX ,G . TCAR (PFX ,G ) consists of all functions of the observed data Y that have mean zero given the full data X. Let Dh → IC0 (Y | Q0 , G, Dh ) be an initial mapping from full data estimating functions into observed data estimating functions which satisfies EG (IC0 (· | Q0 , G, Dh (· | µ, ρ)) | X) = Dh (X | µ, ρ), ∀Q0 . Here, Q0 refers to nuisance parameters of the full data model other than ρ. As established in Robins and Rotnitzky (1992), the orthogonal complement of the nuisance tangent space ⊥ (P Tnuis FX ,G ) in the observed data model M(CAR) at PFX ,G is given by ⊥ Tnuis (PFX ,G ) = IC0 · | Q0 (FX ), G, Dh F,⊥ (FX ) , − Π IC0 · | Q0 (FX ), G, Dh | TCAR (PFX ,G ) : Dh ∈ Tnuis
where Π(IC0 (· | Q0 (FX ), G, Dh ) | TCAR (PFX ,G )) represents the projection of IC0 (· | Q0 (FX ), G, Dh ) onto TCAR (PFX ,G ). As in the full data model, this representation of ⊥ can be used to construct a mapping IC(Y | Q(F , G), G, D ) from full data esTnuis X h timating functions {Dh (· | µ, ρ), h ∈ H} into observed data estimating functions with ⊥ (P the property that it falls into Tnuis FX ,G ) if evaluated at the true parameter values of the data generating distribution. If the set of the full data estimating functions with the ⊥,F index set H covers all of the Tnuis then the set of the corresponding observed data mappings does not exclude any regular asymptotically linear estimator in the model M(CAR). These mappings result in estimators that are more efficient that the estimators of the initial mappings and are protected against misspecification of either the censoring mechanism or the full data distribution (Robins et al., 2000; van der Laan and Yu, 2001; van der Laan and Robins, 2002). This particular way of constructing observed data estimating functions relies on projections onto TCAR which can sometimes be burdensome. For the marginal bivariate right censored data structure, this projection operator does not exist in closed form but it is still possible to implement it algorithmically and this was done by Quale et al. (2001). However, for the general bivariate right censored data structure, the projection operator does not exist in closed form and is computationally much more complicated to implement. Therefore, we propose to project onto TSRA ⊂ TCAR that is defined as the tangent space for G in the model assuming only SRA. In essence, we will be orthogonalizing the initial mapping with respect to TSRA instead of TCAR . There are two key aspects of these orthogonalized estimating functions. Firstly, they will provide more efficient estimators than the corresponding initial mappings IC0 (Y | Q0 (FX ), G, D). However, since one is not projecting onto TCAR , this class of estimating functions will exclude some estimators, and which estimators are included in the class will depend on the choice of initial mapping D → IC0 (· | Q0 (FX ), G, D). Therefore, we propose a method to construct initial mappings that result in RAL estimators of a specific choice, and hence guarantee that our class of estimators will include the specified RAL estimators with good practical performances. Initial mappings that correspond with a specified RAL estimator In order to obtain a mapping Dh → IC0 (Y | Q0 (FX ), G, Dh ) from full data estimating functions into observed data estimating functions that would result in an estimator
Estimation of the bivariate survival function
149
asymptotically equivalent to a specified RAL estimator for a particular choice h, we use the influence curve, IC(Y | Q0,1 (FX ), G), of the specified RAL estimator. Influence curve, IC(Y | Q0,1 (FX ), G), of a RAL estimator µn of µ is defined by √
1 n (µn − µ) = √ IC Yi | Q0,1 (FX ), G + op (1). n n i
The parameter Q0,1 (FX ) indicates that this influence curve depends on FX only through a function Q0,1 (FX ) of FX . Since IC(Y | Q0,1 (FX ), G) is an influence curve it is an ⊥ (P element of Tnuis FX ,G ). Consecutively, it satisfies F,⊥ (FX ) ∀F ∈ MF . EG IC Y | Q0,1 (FX ), G | X ∈ Tnuis Let h∗ be such that EG (IC(Y | Q0,1 (FX ), G) | X) = Dh∗ (X | µ(FX ), ρ(FX )). Let Dh → U (Y | Q0,2 (FX ), G, Dh ) be a mapping from full data estimating functions into observed data estimating functions which satisfies EG (U (Y | Q0,2 (FX ), G, Dh ) | X) = Dh (X | µ, ρ), ∀FX . An example of such a mapping would be an inverse probability of censoring weighted mapping and we will use this as an example below. We define ICCAR Y | Q0 (FX ), G = IC Y | Q0,1 (FX ), G − U Y | Q0,2 (FX ), G, Dh∗ · | µ(FX ), ρ(FX ) , (6) where Q0 (FX ) ≡ (Q0,1 (FX ), Q0,2 (FX )). Note that EG (ICCAR (Y | Q0 (FX ), G) | X) = 0, ∀FX ∈ MF . We now propose the following as an initial mapping from full data estimating functions into observed data estimating functions IC0 Y | Q0 (FX ), G, Dh (· | µ, ρ) = U Y | Q0,2 (FX ), G, Dh (· | µ, ρ) + ICCAR Y | Q0 (FX ), G . Note that EG (IC0 (Y | Q0 (FX ), G, Dh (· | µ, ρ)) | X) = Dh (X | µ, ρ), ∀FX , G. Then, the corresponding estimating equation is 1 IC0 Yi | Q0,n , Gn , D(· | µ, ρn ) , n n
0=
i=1
where Q0,n , ρn and Gn are estimates of Q0 , ρ and G, respectively. We can then construct a one-step estimator 1 + IC0 Yi | Q0,n , Gn , cn , D · | µ0n , ρn n n
µ1n
= µ0n
i=1
where IC0 (Y | Q0,n Gn , cn , D(· | µ0n , ρn )) equals −1 n
d 1 − IC Yi | Q0,n , Gn , D(· | µ, ρn )
µ=µ0n dµ n i=1
× IC0 Y | Q0,n , Gn , D · | µ0n , ρn ,
(7)
150
S. Kele¸s, M.J. van der Laan and J.M. Robins
√ and µ0n is a n consistent initial estimator. This is the classical one-step estimator defined in Bickel et al. (1993), i.e., first step in the Newton–Raphson algorithm for solving the estimating equation of interest. The general asymptotic linearity Theorem A.1 in Appendix A can now be applied to this one-step estimator. Under the regularity conditions of this theorem, if Q0,n converges to a Q10 , ρn converges to ρ(FX ) and Gn is an efficient estimator of G in the model G ⊂ G(SRA), where G(SRA) is the set of all conditional bivariate distributions G(· | X) satisfying SRA, with tangent space T2 (PFX ,G ) then µ1n is asymptotically linear with influence curve IC Y | Q10 , G, D(· | µ, ρ) − Π IC Y | Q10 , G, D(· | µ, ρ) | T2 (PFX ,G ) . If h = h∗ and Q10 equals Q0 (FX ) then we have the following properties of the one-step estimator. Firstly, if the model G used for G is a submodel of the model G ∗ that the RAL estimator poses for G, µ1n is asymptotically equivalent to the RAL estimator, i.e., it has the same influence curve since Π(IC(Y | Q0 (FX ), G, D(· | µ, ρ)) | T2 (PFX ,G )) is zero. Moreover, if T2 (PFX ,G ) contains scores of submodels which are not in G ∗ , then µ1n is a more efficient estimator than the RAL estimator. Consider the following example with the parameter of interest µ = S(t1 , t2 ) in the general bivariate right censored data structure. Since the full data model is nonparametric the only full data estimating function is I (T1 t1 , T2 t2 ) − µ. Let the | X) − µ be a mapping from full data estimating U (Y | G, D(· | µ)) = I (T t)∆/G(T functions to observed data estimating functions. We use shorthand notation I (T t) to denote I (T1 t1 , T2 t2 ). Let µn be a RAL estimator with the influence curve IC(Y | Q0 (FX ), G) and satisfy EG (IC(Y | Q0 (FX ), G) | X) = I (T t) − µ, ∀Q0 (FX ). Then, the corresponding mapping with h∗ indexed full data function is, I (T t)∆ − µ. U Y | G, Dh∗ (· | µ) = | X) G(T We have, I (T t)∆ ICCAR Y | Q0 (FX ), G = IC Y | Q0 (FX ), G − + µ(FX ), | X) G(T where we assume that µ(FX ) depends on FX only through Q0 (FX ) (i.e., µ(FX ) = µ(Q0 (FX ))). Then, the proposed initial mapping equals IC0 Y | Q0 (FX ), G, D(· | µ) = U Y | G, D(· | µ) + ICCAR Y | Q0 (FX ), G I (T t)∆ I (T t)∆ − µ + IC Y | Q0 (FX ), G − + µ(FX ) | X) G(T | X) G(T = µ(FX ) − µ + IC Y | Q0 (FX ), G .
=
We solve this estimating equation for µ by setting its empirical mean to zero and replacing Q0 (FX ) and G by their estimates Q0,n and Gn . Here, Gn is a consistent estimate of G according to a model G ⊂ G(SRA).
Estimation of the bivariate survival function
151
In the next section, we will construct an observed data estimating function which has µ(Q0,n ) equal to the Dabrowska’s (1988) estimator and IC(Y | Q0 (FX ), G) equal to its influence curve. 4. Generalized Dabrowska’s estimator A well-known estimator of µ = S(t1 , t2 ) based on marginal bivariate right-censored data in the independent censoring model G ∗ for G is the Dabrowska’s estimator (Dabrowska, 1988, 1989). The influence curve ICDabr (Y | F, G) of Dabrowska’s estimator of S(t1 , t2 ) derived in Gill et al. (1995) and van der Laan (1997) is given by ICDabr (Y | F, G) = S(t1 , t2 ) −
t1
0
I (T1 ∈ du, ∆1 = 1) − I (T1 u)P (T1 ∈ du | T1 u) PF,G (T1 u)
I (T2 ∈ du, ∆2 = 1) − I (T2 u)P (T2 ∈ du | T2 u) PF,G (T2 u) 0
t1 t2 I (T1 ∈ du, T2 ∈ dv, ∆1 = 1, ∆2 = 1) + 2 v) PF,G (T1 u, T 0 0
t1 t2 I (T1 u, T2 v)P (T1 ∈ du, T2 ∈ dv | T1 u, T2 v) − PF,G (T1 u, T2 v) 0 0
t1 t2 I (T1 ∈ du, T2 v, ∆1 = 1)P (T2 ∈ dv | T1 u, T2 v) − PF,G (T1 u, T2 v) 0 0
t1 t2 2 v P (T1 ∈ du | T1 u, T2 v) I T1 u, T + t2
−
0
−
t1 0
+
0
t1 0
1 u, T2 v −1 × P (T2 ∈ dv | T1 u, T2 v) PF,G T t2
0
I (T1 u, T2 ∈ dv, ∆2 = 1)P (T1 ∈ du | T1 u, T2 v) PF,G (T1 u, T2 v)
t2
2 v P (T1 ∈ du | T1 u, T2 v) I T1 u, T
0
−1 . × P (T2 ∈ dv | T1 u, T2 v) PF,G T1 u, T2 v
Here F represents the bivariate distribution of (T1 , T2 ) and PF,G (T1 s, T2 t) = t). Firstly, we note that, as expected by the theory, EG (ICDabr (Y | F, G) | S(s, t)G(s, X) = I (T1 t1 , T2 t2 ) − µ for all independent censoring distributions G ∈ G ∗ satis 1 , t2 ) > δ > 0, FX – a.e. In addition, if we replace G in this influence curve by fying G(t 1 , t2 | X) > 0, FX − a.e., then we still have any G satisfying CAR and G(t EG ICDabr (Y | F, G) | X = I (T1 t1 , T2 t2 ) − µ (8)
152
S. Kele¸s, M.J. van der Laan and J.M. Robins
for all bivariate distributions F , as we show in Appendix A. This explicitly corresponds t) by S(s, t)G(s, t | X). We will refer to replacing PF,G (T1 s, T2 t) = S(s, t)G(s, to this resulting influence curve as the modified Dabrowska’s influence curve. Note that Q0,1 (FX ) ≡ F for this influence curve. Also, note that this modification will enable us to use covariate processes when estimating the censoring mechanism. 1 , T2 | X) − µ. We now define Let U (Y | G, D(· | µ)) = I (T1 t1 , T2 t2 )∆/G(T ICCAR (Y | F, G) ≡ ICDabr (Y | F, G) − U Y | D · | µ(F ) . By (8), EG (ICCAR (Y | F, G) | X) = 0 for all G ∈ G(CAR) and bivariate distributions F . We now propose the following observed data estimating function for µ indexed by the true censoring mechanism G and the bivariate distribution F of (T1 , T2 ) IC0 Y | F, G, D(· | µ) = U Y | G, D(· | µ) + ICCAR (Y | F, G). In this estimating equation bivariate distribution F of T1 , T2 plays the role of the nuisance parameter Q0 (FX ) of full data distribution. Note that this estimating function for µ satisfies (8) and at the true µ and F it reduces to the modified Dabrowska’s influence curve (and to Dabrowska’s influence curve at G(· | X) = G(·)). Given consistent estimators Fn of F and Gn of G, let µ0n be the solution of 1 IC0 Yi | Fn , Gn , D(· | µ) . n n
0=
i=1
Moreover, we have the following closed form solution of this estimating equation 1 ICDabr (Yi | Fn , Gn ), n n
µ0n = µ(Fn ) +
(9)
i=1
is the Dabrowska’s (1988) estimator. where µ(Fn ) which will be denoted with µDab n We will refer to µ0n as the generalized Dabrowska’s estimator. We estimate F nonparametrically by Dabrowska’s estimator and this corresponds to replacing hazards in the numerator of ICDabr by their empirical estimates and S(t1 , t2 ) by µDab n . Consequently integrals in this expression simply become sums. Conditional bivariate survival func | X) of (C1 , C2 ) can be estimated by low dimensional models such as frailty tion G(· models or by methods proposed in Section 2. Under regularity conditions of Theorem A.1, µ0n is asymptotically linear with influence curve IC(Y ) ≡ Π(IC(· | F, G, D(· | µ)) | T2⊥ (PFX ,G )), where T2⊥ (PFX ,G ) ⊂ TCAR is the orthogonal complement of the observed data tangent space of G under the posed model G for G. Two results emerge from the analysis of this estimator. Firstly, if the posed model for G is the independent censoring model or a submodel of it, then the resulting generalized estimator is asymptotically equivalent to Dabrowska’s (1988) estimator since IC0 (Y | F, G, D(· | µ)) is already orthogonal to the tangent space, Tindep , in Dabr (Yi | Fn , Gn ) algebraically equals to this model. In fact, under this scenario ni=1 IC 0, thus resulting a µ0n that is exactly equal to Dabrowska’s (1988) estimator. Secondly, if the tangent space T2 (PFX ,G ) contains scores which are not in the tangent space of the G in the model posed by the RAL estimator then the generalized estimator is more efficient than the Dabrowska’s (1988) estimator even when (C1 , C2 ) is independent of X.
Estimation of the bivariate survival function
153
5. Orthogonalized estimating function and corresponding estimator In this section we discuss the orthogonalization of our initial estimating function IC0 (Y | F, G, D(· | µ)) ≡ IC0 (Y | Q0 (FX ), G, µ) to improve efficiency and gain robustness. We define a new estimating function at G1 ∈ G(SRA) IC∗ Y | Q(FX , G1 ), G1 , µ = IC0 Y | Q0 (FX ), G1 , µ − ICSRA Y | Q1 (FX , G1 ), G1 , (10) where ICSRA (Y | Q1 (FX , G1 ), G1 ) = Π(IC0 (Y | Q0 (FX ), G1 , µ) | TSRA (PFX ,G1 )) represents the projection of IC0 (Y | Q0 (FX ), G1 , µ(FX )) onto TSRA at PFX ,G1 . This orthogonalized estimating function has the so-called double robustness property (Robins et al., 2000; van der Laan and Yu, 2001; van der Laan and Robins, 2002). The double robustness property allows misspecification of either the censoring mechanism G(· | X) or the full data distribution FX . Let FX1 and G1 ∈ G(SRA) be guesses of FX and G(· | X), respectively. Then, we have EPFX ,G IC∗ Y | Q FX1 , G1 , G1 , µ(FX ) = 0 if either G1 = G(· | X) and G(· | X) satisfies the identifiability condition G(· | X) > δ > 0, FX – a.e. or FX1 = FX and without any further assumptions on G(· | X). We refer to the General Double Robustness Theorem in Chapter 1 of van der Laan and Robins (2002) for details and the proof of this result. In order to obtain the double robustness property, special care must be paid to estimating the nuisance parameter Q1 (FX , G) for which we will have an explicit representation below. As we will see shortly, specifying Q1 (FX , G) correctly is usually a much harder task due to the nature of the projections of ICDabr (Y | F, G) onto TSRA . Therefore, in this section we focus on the scenario where G is estimated consistently and satisfies the identifiability condition. We describe how to obtain an estimator from the orthogonalized estimating function by estimating Q1 (FX , G) with a regression approach. Then, in the subsequent Section 5.1, we discuss an alternative way of estimating Q1 (FX , G) in the form of Monte Carlo simulations that would allow misspecification of G(· | X) when Q1 (FX , G) is correctly specified. Given a consistent estimate Gn of G, and an estimate Qn of Q(FX , G), we propose to estimate µ with the solution of 1 ∗ IC (Yi | Qn , Gn , µ). n n
0=
(11)
i=1
The closed form solution of this estimating equation equals 1 ICSRA (Yi | Q1,n , Gn ), n n
µ1n = µ0n −
(12)
i=1
where µ0n is the generalized Dabrowska’s estimator introduced in Section 4. This is also equivalent to the one step estimator one would obtain from (11) by using generalized Dabrowska’s estimator as the initial estimator. This estimator is asymptotically
154
S. Kele¸s, M.J. van der Laan and J.M. Robins
linear and consistent under the regularity conditions of Theorem A.1 if G is estimated consistently. We now present the explicit representation of the projections onto TSRA . 1 (T1 ), X 2 (T2 )), L EMMA 5.1. Define the following functions of Y = (T1 , T2 , ∆1 , ∆2 , X Aj (t) = I (Cj t), j = 1, 2 where C1 , C2 are two discrete time-variables with finite support contained in j = 1, . . . , p: k u , dMG,k (u) = I (Ck ∈ u, ∆k = 0) − λk u | Fk (u) I T where − 1), X A (j ) , F1 (j ) = A(j − 1), X A (j ) , F2 (j ) = A1 (j ), A(j λk j | Fk (j ) = P Ck = j | Ck j, Fk (j ) ,
(13) (14) k = 1, 2.
(15)
Then, the nuisance tangent space of G at PFX ,G under SRA is given by: TSRA (PFX ,G ) = TSRA,1 (PFX ,G ) ⊕ TSRA,2 (PFX ,G ), where TSRA,k (PFX ,G ) =
p
H j, Fk (j ) dMG,k (j ) : H .
j =1
Subsequently, the projection of any function V (Y ) ∈ L20 (PFX ,G ) onto TSRA(PFX ,G ) is given by
Π V (Y ) | TSRA (PFX ,G ) =
p 2
Hk j, F (j ) dMG,k (j ),
k=1 j =1
where Hk j, F (j ) = E V (Y ) | dAk (j ) = 1, Fk (j ) − E V (Y ) | dAk (j ) = 0, Fk (j ) . | X) into two products under the assumpP ROOF. Firstly, by factorization of g(A tion (2), we have that TSRA(G) = TSRA,1 (G) ⊕ TSRA,2 (G). By the same argument we have that TSRA,k (G) = TSRA,k,1 (G) ⊕ · · · ⊕ TSRA,k,p (G) where TSRA,k,j is the tangent space for the j th component of the kth product of (3). We will now derive the tangent − 1), X A (j )) and F2 (j ) = (A1 (j ), A(j − 1), X A (j )) space TSRA,k,j . Let F1 (j ) = (A(j be the observable histories. Let αk (j | F (j )) = E(dAk (j ) | Fk (j )), k = 1, 2. Then the kth product, k = 1, 2 in (3) can be represented as: p j =1
dA (j ) 1−dAk (j ) αk j | Fk (j ) k . 1 − αk j | Fk (j )
Estimation of the bivariate survival function
155
Note that αk (j | F (j )) = λk (j | Fk (j ))I (Tk j ) where λk (j | Fk (j )) is the conditional hazard of Ck as defined in (15). Since αk (j | Fk (j ))dAk (j ) {1 − αk (j | Fk (j ))}1−dAk (j ) is just a Bernoulli likelihood for the random variable dAk (j ) with probability αk (j | Fk (j )), it follows that the tangent space of αk (j | Fk (j )) is the space of all functions of (dAk (j ), Fk (j )) with conditional mean zero given Fk (j ). It can be shown that any such function V can be written as V dAk (j ), Fk (j ) − E V dAk (j ), Fk (j ) | Fk (j ) = V 1, Fk (j ) − V 0, Fk (j ) dMG,k (j ), (16) where
dMG,k (j ) = dAk (j ) − αk j | Fk (j ) k j . = I (Ck = j ) − λk j | Fk (j ) I T
Note that I (Ck = j ) = I (Ck = j, ∆k = 0). Thus the tangent space of αk (j | Fk (j )) for a fixed j equals TSRA,k,j (PFX ,G ) ≡ H Fk (j ) dAk (j ) − αk j | F (j ) : H , where H ranges over all functions of Fk (j ) for which the right-hand side are elements have finite variance. By factorization of the likelihood we have that TSRA,k (PFX ,G ) = TSRA,k,1 ⊕ TSRA,k,2 ⊕ · · · ⊕ TSRA,k,p . Equivalently, TSRA,k (PFX ,G ) =
p
(17)
Hk j, Fk (j ) dMG,k (j ) : H .
j =1
The projection of any function V (Y ) ∈ L20 (PFX ,G ) onto TSRA,k,j (PFX ,G ) is obtained by first projecting on all functions of dAk (j ) and the subtracting from this its conditional expectation given Fk (j ). Thus, we have Π V (Y ) | TSRA (PFX ,G ) =
p 2 E V (Y ) | dAk (j ) = 1, Fk (j ) k=1 j =1
− E V (Y ) | dAk (j ) = 0, Fk (j ) dMG,k (j ).
This completes the proof.
Application of this lemma with V (Y ) = IC0 (Y | Q0 (FX ), G, µ) gives Π IC0 Y | Q0 (FX ), G, µ(FX ) | TSRA =
p 2 E ICDabr (Y | F, G) | dAk (j ) = 1, Fk (j )
(18)
k=1 j =1
− E ICDabr (Y | F, G) | dAk (j ) = 0, Fk (j ) dMG,k (j ).
(19)
156
S. Kele¸s, M.J. van der Laan and J.M. Robins
Let Qk (dAk (j ), Fk (j )) ≡ E(ICDabr (Y | F, G) | dAk (j ), Fk (j )) for k = 1, 2. Note that Q1 (FX , G) in (10) represents Qk (dAk (j ), Fk (j )), k = 1, 2, j = 1, . . . , p. Then, following the representation of the projections onto TSRA , the explicit form of the onestep estimator given in (12) becomes µ1n = µ0n −
p 2 k=1 j =1
k dAk (j ) = 1, Fk (j ) Q k dAk (j ) = 0, Fk (j ) dMGn ,k (j ), −Q
k (dAk (j ), Fk (j )) is an estimate of Qk (dAk (·), Fk (·)), k = 1, 2 at j . One way where Q to obtain such estimates is to estimate the corresponding conditional expectations parametrically or semi-parametrically by regressing the estimate of ICDabr (Y | F, G) onto time variable j and covariates extracted from the past (Ak (t), Fk (t)). Making this conditional expectation dependent on time covariate j allows us to evaluate it at all required j . Note that in order to avoid technical conditions such as measurability in establishing the projections onto TSRA , we assumed that C1 and C2 are discrete on a grid {1, . . . , p}2 . Since the time points can be chosen to be the grid points of an arbitrarily fine partition this assumption can be made without loss of practical applicability. We have observed in our simulation studies that models for conditional expectations in projection terms are often misspecified, leading to inconsistent estimates for Q1 (FX , G). It is then desirable to prevent a possible efficiency loss due to projections. We apply the general modification proposed by Robins and Rotnitzky (1992) and use the following estimating function IC∗ Y | Q(FX , G), cnu , G, µ = IC0 Y | Q0 (FX ), G, µ − cnu ICSRA Y | Q1 (FX , G), G , (20) where cnu is defined as EPFX ,G {IC0 (Y | Q0 (FX ), G, µ)ICSRA(Y | Q1 (FX , G), G)} EPFX ,G {ICSRA (Y | Q1 (FX , G), G)2 } so that cnu ICSRA (Y | Q1 (FX , G), G) = Π(IC0 (Y | Q0 (FX ), G, µ) | TSRA). Note that when ICSRA (Y | Q1 (FX , G), G) is the projection of IC0 (Y | Q0 (FX ), G, µ) onto TSRA , cnu equals 1. In other words, this adjustment will only have an effect when Q1 (FX , G) is misspecified. Moreover, it guarantees that the resulting estimating function IC∗ (Y | Q(FX , G), cnu , G, µ) is more efficient than the initial estimating function IC0 (Y | Q0 (FX ), G, µ) even when Q1 (FX , G) is estimated inconsistently (Robins and Rotnitzky, 1992; van der Laan and Robins, 2002). cnu can be estimated by taking empirical expectations of the estimated IC0 (Y | Q0 (FX ), G, µ) and ICSRA (Y | Q1 , G). Specifically, estimate cnu,n of cnu is given by n IC0 (Yi | Q0 (Fn ), Gn , µ(Fn ))ICSRA (Yi | Q1,n , Gn ) n cnu,n = i=1 , T i=1 ICSRA (Yi | Q1,n , Gn )ICSRA (Yi | Q1,n , Gn ) where Q1,n is the estimate of Q1 (FX , G). One practical aspect of this adjustment parameter is that it provides a way of monitoring the goodness of fit of the projection term
Estimation of the bivariate survival function
157
ICSRA (Y | Q1,n , Gn ). Since cnu,n will be approximately 1 at the best fit of the projection term, one can use this property to choose the best fit. 5.1. Estimation of Q1 (FX , G) by Monte Carlo simulations In this subsection, we will discuss a Monte Carlo simulation method to estimate the nuisance parameter Q1 (FX , G). This approach requires guessing a low dimensional model for the full data distribution FX and the censoring mechanism G, respectively. As a result, the corresponding estimator of the orthogonalized estimating function (10) remains consistent if either of the guessed models is correctly specified. We will use the longitudinal representation of the observed data with the notation of Section 2 over a discrete time axes (j = 1, . . . , p) given as Y = X1 (1), X2 (1), A1(1), A2 (1), X1,A(1) (2), X2,A(1) (2), A1 (2), A2 (2), (p), X2,A(p−1) (p), A1 (p), A2 (p) . . . . , X1,A(p−1) Define L1 (j ) = X1,A(j −1) (j ) and L2 (j ) = X2,A(j −1) (j ), then Y = L1 (1), L2 (1), A1 (1), A2 (1), . . . , L1 (p), L2 (p), A1 (p), A2 (p) . Under SRA, the likelihood of the observed data is given by p − 1), A(j − 1) f1 L1 (j ) | L(j dPFX ,G (Y ) = j =1
− 1) f2 L2 (j ) | L1 (j ), L(j − 1), A(j − 1), L(j ) g1 A1 (j ) | A(j − 1), L(j ) , g2 A2 (j ) | A1 (j ), A(j
) = (L 1 (j ), where L(j L2 (j )). Since the likelihood factorizes under SRA, we have that the FX and G part of the likelihood are given by Q(FX ) =
p
− 1) f1 L1 (j ) | L(j − 1), A(j
j =1
×
p
− 1) , f2 L2 (j ) | L1 (j ), L(j − 1), A(j
(21)
j =1
| X = − 1), g A g1 A1 (j ) | A(j L(j )
p
j =1
×
p j =1
− 1), g2 A2 (j ) | A1 (j ), A(j L(j ) .
(22)
158
S. Kele¸s, M.J. van der Laan and J.M. Robins
The modeling and estimation strategies proposed for the censoring mechanism in Section 2 applies to both of these partial likelihoods. Let (f1,θ1 , f2,θ2 ) and (g1,η1 , g2,η2 ) be parametric or semi-parametric models for FX and G part of the likelihood. Let (θ1,n , θ2,n ) and (η1,n , η2,n ) be the maximum likelihood estimators and Qn = Q(θ1,n , θ2,n ) and Gn = Gη1,n ,η2,n be the corresponding estimators of Q(FX ) and G, respectively. Now, one can evaluate the conditional expectations in the projection terms (18) and (19) under the known law PQn ,Gn with a Monte Carlo simulation. Consider a particular observation Y and let j be fixed. The following is the algorithm for performing Monte Carlo simulation on this observation: • Simulate: This step simulates the complete observation from a fixed history. Set b = 1 and m = j + 1. (1) With history (dA1 (j ) = 1, F1 (j )): Set dA1 (j ) = 1 and − 2), L(m − 1)). – (A) Generate A2 (m − 1) from g2,η2,n (· | A1 (m − 1) = 1, A(m – (B) Generate L1 (m) from f1,θ1,n (· | L(m − 1), A(m − 1)). − 1)). – (C) Generate L2 (m) from f2,θ2,n (· | L1 (m), L(m − 1), A(m – Set m = m + 1 and repeat steps (A), (B), (C) until the complete data structure 1,∗ is observed. denoted by Y1,b (2) With history (dA1 (j ) = 0, F1 (j )): − 2), L(m − 1)). – (A) Generate A2 (m − 1) from g2,η2,n (· | A1 (m − 1) = 0, A(m − 1)). – (B) Generate L1 (m) from f1,θ1,n (· | L(m − 1), A(m − 1)). – (C) Generate L2 (m) from f2,θ2,n (· | L1 (m), L(m − 1), A(m − 1), – (D) Generate A1 (m) from g1,η1,n (· | A(m L(m)). – Set m = m + 1 and repeat steps (A), (B), (C), (D) until the complete data 1,∗ is observed. structure denoted by Y0,b (3) With history (dA2 (j ) = 1, F2 (j )): Set dA2 (j ) = 1 and − 1), A(m − 1)). – (A) Generate L1 (m) from f1,θ1,n (· | L(m − 1)). – (B) Generate L2 (m) from f2,θ2,n (· | L1 (m), L(m − 1), A(m – (C) Generate A1 (m) from g1,η1,n (· | A(m − 1), L(m)). – Set m = m + 1 and repeat steps (A), (B), (C) until the complete data structure 1,∗ is observed. denoted by Y2,b (4) With history (dA2 (j ) = 0, F2 (j )): − 1), A(m − 1)). – (A) Generate L1 (m) from f1,θ1,n (· | L(m − 1)). – (B) Generate L2 (m) from f2,θ2,n (· | L1 (m), L(m − 1), A(m − 1), – (C) Generate A1 (m) from g1,η1,n (· | A(m L(m)). – (D) Generate A2 (m) from g2,η2,n (· | A1 (m), A(m − 1), L(m)). – Set m = m + 1 and repeat steps (A), (B), (C), (D) until the complete data 2,∗ is observed. structure denoted by Y0,b k,∗ , k = 1, 2, i = 0, 1. • Evaluate: Evaluate ICDabr (Y | Fn , Gn ) at Y = Yi,b • Repeat: Repeat the steps Simulate and Evaluate B times and report
Estimation of the bivariate survival function
159
ICSRA (Y | Q1,n , Gn )(j ) =
B 2 k,∗ k,∗ 1 ICDabr Y1,b | Fn , Gn − ICDabr Y0,b | Fn , Gn dMGk,n (j ) B b=1 k=1
ICSRA (Y | Q1,n , Gn )(j ) is now an estimate of the projection of ICDabr (Y | F, G) onto TSRA,1,j ⊕ TSRA,2,j , j = 1, . . . , p. Note that for each observation j runs up to the corresponding max(T1 , T2 ). This way of estimating the projection terms guarantees that the resulting estimator of the orthogonalized estimating function (10) is consistent if either (f1,θ1 , f2,θ2 ) or (g1,η1 , g2,η2 ) is correctly specified. 5.2. Confidence intervals In this subsection, we will briefly discuss the ways of constructing Wald-type confidence intervals for the proposed one-step estimator given in (12). In particular, we will consider the case where we assume that the model G posed for censoring mechanism is correct. Application of Lemma A.1 in Appendix A shows that µ1n is asymptotically linear with influence curve IC∗ (Y | Q(FX , G), G, µ) − Π(IC∗ (Y | Q(FX , G), G, µ) | T (G)) where T (G) is the tangent space of G for the chosen model. Therefore, one can use 1 IC Yi | Qn , Gn , µ0n n n
σˆ 2 =
i=1
as a conservative estimate of the asymptotic variance of µ1n , and this can be used to construct a conservative 95% confidence interval for µ: σˆ µ1n ± 1.96 √ . n This confidence interval is asymptotically correct if Qn is a consistent estimate of Q, i.e., conditional expectations in the projection terms are estimated consistently. Moreover, it is freely obtained after having computed µ1n . 6. Simulations 1 We performed a simulation study to assess the relative performance of µ0n , µDab n and µn . In our simulations, we generated bivariate survival and censoring times from frailty models with and without covariates. Frailty models are a subclass of Copula models. The theory of Copulas dates back to Sklar (1959) but their application in statistical modeling is a more recent phenomenon (e.g., Genest and MacKay, 1986; Genest and Rivest, 1993; Oakes, 1989; Clayton, 1978; Clayton and Cuzick, 1985; Hougaard, 1987). We have two main simulation setups. Below we describe these in details, and the explicit formulas for data generation is provided in Appendix A.
• Simulation I (Informative censoring): We generated binary baseline covariates Z1 , Z2 ∼ Bernoulli(p) for each pair of subject. Consecutively, both censoring and survival times were made dependent on these baseline covariates to enforce informative censoring. Survival times T1 and T2 are generated from a gamma frailty model
160
S. Kele¸s, M.J. van der Laan and J.M. Robins
with truncated baseline hazard. This assumes a proportional hazards model of the type λi (t | W = w, Zi = zi ) = λ0 (t)weβt zi ,
i = 1, 2,
where w represents a realization from the hidden gamma random variable. Truncated 1 , t2 | X) > δ > 0, ∀t1 , t2 exponential baseline hazard was chosen to ensure that G(t in the support of FX . Similarly, C1 , C2 were generated from a gamma frailty model with covariates Z1 , Z2 using a constant baseline hazard. We adjusted the amount of dependence between survival and censoring times through the coefficient in front of the Zs (βt for (T1 , T2 ) and βc for (C1 , C2 )). • Simulation II (Independent censoring): We generated survival times as in simulation setup I, but enforced censoring times to be independent of T1 , T2 . This simply corresponds to setting βc = 0 in the conditional hazard functions of C1 , C2 . 6.1. Comparison of µ0n with µDab n We firstly report the mean squared error ratios for µDab and µ0n from a simulation study n of setup I for moderate informative censoring in Table 1. The two estimators are evaluated on a 4 × 4 grid. We observe that µ0n outperforms µDab n at all grid points. This result indicates that our generalization of the Dabrowska’s estimator is truly accounting for informative censoring as expected. Table 1 MSE µ0 /MSE µDab based on 200 simulated data sets of sample size 250. F and G are both generated from n n bivariate gamma frailty models with covariate Z ∼ Bernoulli(0.5). G(· | X) is estimated using a bivariate gamma frailty model with covariates. Correlations between T1 and C1 and T2 and C2 are approximately 0.4. P (T1 < C1 ) = 0.65 and P (T2 < C2 ) = 0.65
t2 = 0.1 t2 = 1 t2 = 4 t2 = 10
t1 = 0.1
t1 = 1
t1 = 4
0.961543 0.9222071 0.1769921 0.2335622
0.8015517 0.6194325 0.3169093 0.3569389
0.2000056 0.2619613 0.1994758 0.2638717
t1 = 10 0.2023185 0.2579991 0.2131806 0.2433467
Table 2 MSE µ0 /MSE µDab based on 200 simulated data sets of sample size 250. F is generated from a bivariate n n gamma frailty model with covariate Z ∼ Bernoulli(0.5). G(· | X) is from a bivariate gamma frailty model with no covariates. T and C are independent. G(· | X) is estimated using a bivariate gamma frailty model with covariates Z. Correlations between T1 and C1 and T2 and C2 are approximately 0. P (T1 < C1 ) = 0.70 and P (T2 < C2 ) = 0.70
t2 = 0.05 t2 = 0.2 t2 = 3 t2 = 8
t1 = 0.05
t1 = 0.2
t1 = 3
t1 = 8
0.9990788 0.9924038 0.9814260 0.9806082
0.9985483 0.9946938 0.9789798 0.9821724
0.9702160 0.9650496 0.9504253 0.9665981
0.9813311 0.9741110 0.9650510 0.9789226
Estimation of the bivariate survival function
161
In Table 2, we report relative performance of the two estimators when censoring times are independent of the failure times, i.e., G(· | X) = G(·) (generating from setup II). In this simulation, when constructing µ0n , G(· | X) is still estimated by a bivariate frailty model with covariates ignoring independence structure. We observe from Table 2 that both estimators perform about the same under this scenario. There is no efficiency loss for µ0n since our posed model for G(· | X) includes the independent censoring model as a submodel. 6.2. Comparison of µ0n , µDab and µ1n n We now compare the performances of the three estimators on a simulated data set of sample size 250 generated from the simulation setup I. We estimate the quantity E ICDabr (Y | Fn , Gn ) | dAi (t), Fi (t) by a linear regression model based on covariates extracted from the supplied history Fi (t). This corresponds to using the regression approach described in Section 5. Covariates such as I (Ti t), t, I (Ti t) × Ti , Zi , I (Ci t) × Ci , i = 1, 2 and some interactions with the time variable t are used and standard model selection techniques are employed. Moreover, the conditional hazards in the projections are estimated by fitting a Cox-proportional hazards model. Survival function estimates at different grid points for the three estimators are given in Table 3 together with the true survival probabilities. Firstly, since there is informative censoring, µ0n outperforms µDab n at all grid points. Secondly, we observe that the one-step estimator µ1n provides some improvement over the initial estimator µ0n (i.e., the change in the estimator is in the desired direction), however it is not a substantial improvement overall. This is not surprising if we look at the estimated adjustment parameter, cnu,n , reported in this table. cnu,n is away from 1 at all grid points indicating that we are doing a poor job when estimating the projections (i.e., nuisance parameter Q1 (FX , G) is misspecified). It would be worthwhile to put effort in making the projection constant close to 1 with real data applications. The conservative 95% intervals for the one-step estimator are reported in column 6 of Table 3. Table 3 0 1 1 µDab n , µn , µn estimates of P (T1 t1 , T2 t2 ) with 95% confidence interval calculated for µn on a data set simulated from simulation setup I (t1 , t2 ) (0.1, 0.1) (0.1, 1.0) (0.1, 4.0) (0.1, 10.0) (1.0, 1.0) (1.0, 4.0) (1.0, 10.0) (4.0, 4.0) (4.0, 10.0) (10.0, 10.0)
P (T1 t1 , T2 t2 )
µDab n
0.949409 0.794018 0.594459 0.460084 0.674216 0.509527 0.394196 0.390691 0.302516 0.235763
0.974827 0.826173 0.720699 0.658725 0.696585 0.620734 0.571199 0.539030 0.482808 0.453462
µ0n
µ1n
95% CI of µ1n
cnu,n
0.974531 0.9742604 (0.9381157, 0.9981405) 2.510069 0.797576 0.7961396 (0.7290466, 0.8632327) 2.300677 0.605632 0.6028978 (0.4939351, 0.7118605) 2.361191 0.541393 0.5387704 (0.4232607, 0.6542801) 2.404321 0.657125 0.6601110 (0.5843612, 0.7298903) 2.460159 0.528965 0.5232135 (0.4216446, 0.6247825) 3.649437 0.474861 0.46910810 (0.3595429, 0.5786733) 3.441779 0.363114 0.3766390 (0.2439028, 0.4823266) 2.525892 0.320330 0.3077152 (0.1829420, 0.4324884) 2.539978 0.306258 0.2954581 (0.1740513, 0.4168649) 3.399358
162
S. Kele¸s, M.J. van der Laan and J.M. Robins
7. Discussion We firstly presented a general method of constructing mappings from full data estimating functions to observed data estimating functions which results in estimators asymptotically equivalent to a specified RAL estimator. This is a powerful method and application of it in general bivariate right censored data structure resulted in a generalized estimator of Dabrowska’s (1988) estimator. This proposed generalized estimator overcomes the deficiencies of the commonly used Dabrowska’s estimator by allowing informative censoring and incorporating covariate processes. Secondly, we constructed an orthogonalized estimating function that has the double robustness property. We mainly considered the scenario where the censoring mechanism is specified correctly and constructed a one-step estimator that improves on our initial estimator. We have shown with a simulation study that the proposed generalized estimator is superior to Dabrowska’s estimator when censoring mechanism is estimated consistently and the results are dramatic in favor of the generalized estimator when there is dependent censoring. We used the one-step estimator together with Dabrowska’s estimator and generalized Dabrowska’s estimator on a simulated data set that included informative censoring. In this example dataset, one-step estimator did not improve much on the generalized Dabrowska’s estimator since estimators of the projections on TSRA performed poorly. We were able to monitor this by the estimated adjustment parameter. One future research direction would be implementing the Monte Carlo simulation method of Section 5.1 for estimating the projection terms. This would provide the desired flexibility to misspecify G(· | X).
Appendix A Both the influence curve lemma and the asymptotic linearity theorem of Appendices A.1 and A.2 require the following Hilbert space terminology: L20 (PFX ,G ) is the Hilbert space of functions of Y with finite variance and mean zero endowed with the covariance inner product v1 , v2 PFX ,G ≡ v1 v2 dPFX ,G . A.1 Influence curve of a asymptotically linear estimator when censoring mechanism is estimated efficiently The following lemma is from van der Laan et al. (2000). L EMMA A.1. Let Y be observed data from PFX ,G where G satisfies coarsening at random. Denote the tangent space for the parameter FX with T1 (PFX ,G ). Consider the parameter µ which is a real valued functional of FX . Let µn (G) be an asymptotically linear estimator of µ with influence curve IC0 (· | FX , G) which uses the true G. Assume that for an estimator Gn √ µn (Gn ) − µ = µn (G) − µ + Φ(Gn ) − Φ(G) + oP 1/ n (A.1) for some functional Φ of Gn . Assume that Φ(Gn ) is an asymptotically efficient estimator of Φ(G) for a given model {Gη : η ∈ Γ } with tangent space T2 (PFX ,G ). Then,
Estimation of the bivariate survival function
163
µn (Gn ) is asymptotically linear with influence curve IC1 (· | FX , G) = IC0 (· | FX , G) − Π IC0 (· | FX , G) | T2 (PFX ,G ) . P ROOF. We decompose L20 (PFX ,G ) orthogonally in T1 (PFX ,G ) ⊕ T2 (PFX ,G ) ⊕ T⊥ (PFX ,G ), where T⊥ (PFX ,G ) is the orthogonal complement of T1 (PFX ,G ) ⊕ T2 (PFX ,G ). By (A.1), µn (Gn ) is asymptotically linear with influence curve IC = IC0 + ICnu , where ICnu is an influence curve corresponding with an estimator of the nuisance parameter Φ(G) under the model with nuisance tangent space T1 (PFX ,G ). Let IC0 = a0 + b0 + c0 and ICnu = anu + bnu + cnu according to the orthogonal decomposition of L0 (PFX ,G ). We will now use two general facts about the influence curves. Firstly, an influence curve is orthogonal to the nuisance tangent space, and secondly efficient influence curve lies in the tangent space. Since ICnu is an influence curve of Φ(G) in the model where FX is not specified, it is orthogonal to T1 (PFX ,G ), i.e., anu = 0. Moreover, since Φ(Gn ) is efficient, ICnu lies in the tangent space T2 (PFX ,G ) and hence cnu = 0. We also have that IC0 + ICnu is influence curve of µn (Gn ) thus it is orthogonal to T2 (PFX ,G ), i.e., b0 + bnu = 0. Consequently, we have that IC1 + ICnu = a0 + c0 = Π IC0 | T2⊥ (PFX ,G ) ≡ IC0 − Π IC0 | T2 (PFX ,G ) . This completes the proof.
A.2 Asymptotics assuming consistent estimation of the censoring mechanism The following theorem (van der Laan and Robins, 2002) provides a template for proving asymptotic linearity with specified influence curve of the one-step estimator µ1n given by (7), (12) (i.e., set cnu,n = cnu = 1) or of the one-step solution of the estimating function (20) (if one uses the adjustment constant cnu,n ). The tangent space T2 = T2 (PFX ,G ) for the parameter G is the closure of the linear extension in L20 (PFX ,G ) of the scores at PFX ,G from all correctly specified parametric submodels (i.e., submodels of the assumed semiparametric model G) for the distribution G. T HEOREM A.1. Consider the observed data model M(G) = {PFX ,G : FX ∈ MF , G ∈ G ⊂ G(SRA)}. Let Y1 , . . . , Yn be n i.i.d. observations of Y ∼ PFX ,G ∈ M(G). Consider a one-step estimator of the parameter µ ∈ R1 of the form µ1n = µ0n + cn−1 Pn IC(· | Qn , Gn , cnu,n , Dhn (µ0n , ρn )). We will refer to cn−1 IC(· | Qn , Gn , cnu,n , Dhn (µ0n , ρn )) also by IC(· | Qn , Gn , cnu,n , cn , Dhn (µ0n , ρn )). Assume that the limit of IC(· | Qn , Gn , cnu,n , Dhn (µ0n , ρn )) specified in (ii) below satisfies: EG IC Y | Q1 , G, cnu , Dh (· | µ, ρ) | X = Dh (X | µ, ρ) FX – a.e., (A.2) F,⊥ Dh (· | µ, ρ) ∈ Tnuis (FX ).
√ Assume (we write f ≈ g for f = g + oP (1/ n )) cn−1 Pn IC · | Qn , Gn , cnu,n , Dhn µ0n , ρn − IC · | Qn , Gn , cnu,n , Dhn (µ, ρn ) ≈ µ − µ0n
(A.3)
(A.4)
164
S. Kele¸s, M.J. van der Laan and J.M. Robins
and √ EPFX ,G IC Y | Qn , G, cnu,n , Dhn (µ, ρn ) = oP 1/ n ,
(A.5)
where the G-component of ρn is set equal to G as well. In addition, assume (i) IC(· | Qn , Gn , cnu,n , cn , Dhn (· | µ0n , ρn )) falls in a PFX ,G -Donsker class with probability tending to 1. (ii) For some (h, Q1 ) we have: IC · | Qn , Gn , cnu,n , cn , Dh · | µ0 , ρn n n → 0, − IC · | Q1 , G, cnu , c, Dh (· | µ, ρ) P FX ,G
where the convergence is in probability. Here (suppressing the dependence of the −1 estimating functions on parameters) cnu = IC0 , IC is such that nu IC nu , ICnu cnu ICnu equals the projection of IC0 onto the k-dimensional space ICnu,j , j = 1, . . . , k in L20 (PFX ,G ). (iii) Define for a G1 Φ(G1 ) = PFX ,G IC · | Q1 , G1 , cnu , c, Dh (µ, ρ) . For notational convenience, let ICn (G) ≡ IC · | Qn , G, cnu,n , cn , Dhn (µ, ρn ) , IC(G) ≡ IC · | Q1 , G, cnu , c, Dh (µ, ρ) . Assume PFX ,G ICn (Gn ) − ICn (G) ≈ Φ(Gn ) − Φ(G). (iv) Φ(Gn ) is an asymptotically efficient estimator of Φ(G) for the SRA-model G containing the true G with tangent space T2 (PFX ,G ) ⊂ TSRA (PFX ,G ). Then µ1n is asymptotically linear with influence curve given by I C ≡ Π IC · | Q1 , G, cnu , c, Dh (· | µ, ρ) | T2⊥ (PFX ,G ) . If Q1 = Q(FX , G) and IC(Y | Q(FX , G), G, cnu , Dh (· | µ, ρ)) ⊥ T2 (PFX ,G ), then this influence curve equals IC(· | Q(FX , G), G, cnu = 1, c, Dh (µ, ρ)). A.3 Proof of Theorem A.1 For notational convenience, we will give the proof for cnu,n = 1 and use appropriate short hand notation. We have µ1n = µ0n + cn−1 Pn IC Qn , Gn , Dhn µ0n , ρn − IC Qn , Gn , Dhn (µ, ρn ) + cn−1 Pn IC Qn , Gn , Dhn (µ, ρn ) .
Estimation of the bivariate survival function
165
√ By condition (A.4) the difference on the right-hand side equals µ − µ0n + oP (1/ n ). Thus we have: µ1n − µ = (Pn − P )cn−1 IC Qn , Gn , Dhn (µ, ρn ) + cn−1 P IC Qn , Gn , Dhn (µ, ρn ) . For empirical process theory we refer to van der Vaart and Wellner (1996). Condition (i) and (ii) in the theorem imply that the empirical process term on the right-hand side is asymptotically equivalent with (Pn − PFX ,G )c− IC(· | Q1 , G1 , Dh (µ, ρ)). So it remains to analyze the term cn−1 P IC Qn , Gn , Dhn (µ, ρn ) . Now, we write this term as a sum of two terms A + B, where A = cn−1 P IC Qn , Gn , Dhn (µ, ρn ) − IC Q1 , G, Dh (µ, ρ) , B = cn−1 P IC Q1 , G, Dh (µ, ρ) . By (A.2) and (A.3) we have B = 0. As in the theorem, let ICn (G) ≡ IC · | Qn , G, Dhn µ, ρn (G) , IC(G) ≡ IC · | Q1 , G, Dh (µ, ρ) . We decompose A = A1 + A2 as follows: A = PFX ,G ICn (Gn ) − IC(G) = PFX ,G ICn (G) − IC(G) + PFX ,G ICn (Gn ) − ICn (G) . √ By assumption (A.5) we have that A1 = oP (1/ n ). By assumption (iii) √ A2 = Φ2 (Gn ) − Φ2 (G) + oP 1/ n . By assumption (iv), we can conclude that µ1n is asymptotically linear with influence curve IC(· | Q1 , G, c, cnu , Dh (µ, ρ)) + ICnuis , where ICnuis is the influence curve of Φ2 (Gn ). Now, the same argument as given in the proof of Lemma A.1 proves that this influence curve of µ1n is given by: Π IC · | Q1 , G, c, cnu , Dh (µ, ρ) | T2⊥ . This completes the proof. A.4 Data generation for the simulation study Let W be a gamma random variable with mean 1 and variance αt . Let Z1 and Z2 be Bernoulli random variables with probability p. We assume the following proportional hazards model for T1 and T2 : λi (t | W = w, Zi = zi ) = λ0,T (t)wi eβt zi ,
i = 1, 2,
166
S. Kele¸s, M.J. van der Laan and J.M. Robins
where w represents a realization from the hidden gamma random variable W . The baseline hazard λ0,T (t) is set to the hazard of a truncated exponential distribution and is given by λ0 (t) =
λt e−λt t , e−λt t − e−λt τ
where λt is the rate and τ is the truncation constant of the distribution. The bivariate distribution of T1 and T2 conditional on Z ≡ (Z1 , Z2 ) is given by −1/αt S(t1 , t2 | Z) = S1 (t1 | Z)−αt + S2 (t2 | Z)−αt − 1 ,
(A.6)
where −1/αt , Si (t | Z) = 1 + αt eβt Zi Λ0 (t)
i = 1, 2.
We use a similar frailty model with constant baseline hazard, λ0,C (t) = λc , for the censoring mechanism and denote the variance of the corresponding hidden gamma variable by αc . We now provide the explicit formulas for generating data from the above defined structures. Let U1 , U2 be random draws from uniform distribution on the interval [0, 1]. Let Z1 and Z2 denote random draws from Bernoulli(p). Then, T1 given (Z1 , Z2 ) and T2 given (T1 , Z1 , Z2 ) can be generated as φ1 =
(1 − U1 )−αt − 1 , αt eβt Z1
T1 = −
1 −λt τ log e(log(1−e )−φ1 ) + e−λt τ , λt
−α /(1+αt ) −1/αt S1 (T1 | Z1 )−αt − S1 (T1 | Z1 )−αt + 1 , φ 2 = U1 t φ3 =
φ2−αt − 1 , αt eβt Z2
T2 = −
1 −λt τ log e(log(1−e )−φ3 ) + e−λt τ . λt
Similarly, we generate the censoring times C1 given Z1 , Z2 and C2 given (C1 , Z1 , Z2 ) as follows 1 (1 − U2 )−αc − 1 C1 = , λc αc eβc Z1 −α /(1+αc ) −1/αc φ4 = U2 c S1 (C1 | Z1 )−αc − S1 (C1 | Z1 )−αc + 1 , C2 =
−α 1 φ4 c − 1 . λc αc eβc Z2
Estimation of the bivariate survival function
167
A.5 Computational remarks We will now go over a few computational details that are required for estimation of G(· | Z). R function coxph is used to estimate G(· | Z) by a frailty model. This is a straightforward procedure and is explained quite well in the help menu of R. Below we provide a piece of code for extracting cumulative baseline hazard function from the R objects generated by coxph. datafr is a data frame of the data. fra1_coxph(Surv(time,cstatus)~frailty(id)+strata(strat)+Z,data=datafr) fra1.sf_survfit(fra1) #survfit gives the survival estimates at #uncensored points with mean covariates. fra1.sum_ summary(fra1.sf) S1_fra1.sum\$surv[fra1.sum\$strata=="strat=0"] #P(T_1 >= tt1 | Z_1) S2_fra1.sum\$surv[fra1.sum\$strata=="strat=1"] #P(T_2 >= tt2 | Z_2) tt1_fra1.sum\$time[fra1.sum\$strata=="strat=0"]#tt1 tt2_fra1.sum\$time[fra1.sum\$strata=="strat=1"]#tt2 alph_fra1\$history\$"frailty(id)"\$theta #extracts the variance of the gamma frailty. #Extracting the cumulative baseline hazard at tt1 and tt2 including time 0: ch1_c(0,(S1^(-alph)-1)/(alph*exp(fra1\$coef*mean(Z1)))) ch2_c(0,(S2^(-alph)-1)/(alph*exp(fra1\$coef*mean(Z2))))
Once we have the estimates of the baseline cumulative hazard for various time points, | Z) using Eq. (A.6). we can estimate G(· A.6 Proving E(ICD (Y | F, G) | X) = I (T1 t1 , T2 t2 ) − S(t1 , t2 ) (t1 , t2 ) where Firstly, we are going to show that E(ICDabr | X) = I (T1 t1 , T2 t2 )− F ICDabr is the influence curve of Dabrowska’s (1988) estimator (without any modification) and X ≡ (T1 , T2 ). Then, it is easily seen that conditional expectation of modified (t1 , t2 ) Dabrowska’s influence curve given X also reduces to I (T1 t1 , T2 t2 ) − F | X) terms in denominator and numerator cancel out. In the following proof since, G(· (t1 , t2 ) = S(t1 , t2 ). Influence curve of Dabrowska’s bivariate survival we are using F function estimator in the random censoring model is given by IC(t1 , t2 )
= F (t1 , t2 ) −
I (T1 ∈ du, ∆1 = 1) − I (T1 u)P (T1 ∈ du | T1 u) P (T1 u) 0
t2 I (T2 ∈ du, ∆2 = 1) − I (T2 u)P (T2 ∈ du | T2 u) − P (T2 u) 0
t1 t2 2 ∈ dv, ∆1 = 1, ∆2 = 1) I (T1 ∈ du, T + 2 v) P (T1 u, T 0 0 t1
(A.7) (A.8) (A.9)
168
S. Kele¸s, M.J. van der Laan and J.M. Robins
−
t1 0
−
t1 0
+
t1 0
t1
t2 0 t2 0
2 v)P (T1 ∈ du, T2 ∈ dv | T1 u, T2 v) I (T1 u, T P (T1 u, T2 v) 2 v, ∆1 = 1)P (T2 ∈ dv | T1 u, T2 v) I (T1 ∈ du, T 2 v) P (T1 u, T
t2 0
1 u, T2 v P (T1 ∈ du | T1 u, T2 v) I T 2 v −1 × P (T2 ∈ dv | T1 u, T2 v) P T1 u, T
2 ∈ dv, ∆2 = 1)P (T1 ∈ du | T1 u, T2 v) I (T1 u, T 2 v) P (T1 u, T 0 0
t1 t2 1 u, T2 v P (T1 ∈ du | T1 u, T2 v) I T + 0 0 −1 . × P (T2 ∈ dv | T1 u, T2 v) P T1 u, T2 v) −
(A.10) (A.11)
(A.12)
t2
(A.13)
(A.14)
(t1 , t2 ) where Firstly, we will show that E(IC(t1 , t2 ) | X) = I (T1 t1 , T2 t2 ) − F X ≡ (T1 , T2 ). We will take the conditional expectations of the terms (A.7), (A.8), (A.9), (A.10), (A.11), (A.12), (A.13), (A.14) separately. Term (A.7): t1 I (T1 ∈ du, ∆1 = 1) − I (T1 u)P (T1 ∈ du | T1 u)
E −
X P (T1 u) 0
t1
t1 E[I (T1 ∈ du, ∆1 = 1) | X] E[I (T1 u) | X]P (T1 ∈ du | T1 u) + =− P (T1 u) P (T1 u) 0 0
t1 I (T1 ∈ du)P (C1 u | X) =− P (T 1 u)P (C1 u | T1 u) 0
t1 I (T1 u)P (C1 u | T1 u)P (T1 ∈ du | T1 u) + P (T1 u)P (C1 u | T1 u) 0 I (T1 t1 ) 1
u=T1 ∧t1 =− +
(T1 , 0) (u, 0) u=0 F F 1 I (T1 t1 ) + −1 =− F (T1 , 0) F (T1 ∧ t1 , 0) I (T1 t1 ) I (T1 t1 ) I (T1 t1 ) I (T1 t1 ) =− + + −1= − 1. (T1 , 0) (T1 , 0) (t1 , 0) (t1 , 0) F F F F Term (A.8): t2 I (T2 ∈ dv, ∆2 = 1) − I (T2 v)P (T2 ∈ dv | T2 v)
E −
X P (T2 v) 0
t2
t2 E[I (T2 ∈ dv, ∆2 = 1) | X] E[I (T2 v) | X]P (T2 ∈ dv | T2 v) + =− P (T2 v) P (T2 v) 0 0
Estimation of the bivariate survival function
=−
t2
0
I (T2 ∈ dv)P (C2 v | X) P (T2 v)P (C2 v | T2 v)
I (T2 v)P (C2 v | T2 v)P (T2 ∈ dv | T2 v) P (T2 v)P (C2 v | T2 v) 0 I (T2 t2 ) 1
v=T2 ∧t2 =− +
(0, T2 ) (0, v) v=0 F F 1 I (T2 t2 ) + −1 =− (0, T2 ) (0, T2 ∧ t2 ) F F I (T2 t2 ) I (T2 t2 ) I (T2 t2 ) I (T2 t2 ) =− + + −1= − 1. (0, t2 ) F (0, T2 ) F (0, T2 ) F (0, t2 ) F +
t2
Term (A.9): t1 t2 2 ∈ dv, ∆1 = 1, ∆2 = 1)
I (T1 ∈ du, T E
X P (T1 u, T2 v) 0 0
t1 t2 E[I (T1 ∈ du, T2 ∈ dv, ∆1 = 1, ∆2 = 1) | X] = P (T1 u, T2 v) 0 0
t1 t2 I (T1 ∈ du, T2 ∈ dv)P (C1 u, C2 v | X) = P (T u, T2 v)P (C1 u, C2 v | T1 u, T2 v) 1 0 0 =
I (T1 t1 , T2 t2 ) . (T1 , T2 ) F
Term (A.10): t1 t2 I (T1 u, T2 v)P (T1 ∈ du, T2 ∈ dv | T1 u, T2 v)
E −
X 2 v) P (T1 u, T 0 0
t1 t2 E[I (T1 u, T2 v) | X]P (T1 ∈ du, T2 ∈ dv | T1 u, T2 v) =− P (T1 u, T2 v) 0 0
t1 t2 =− I (T1 u, T2 v)P (C1 u, C2 v | X) 0 0 × P (T1 ∈ du, T2 ∈ dv | T1 u, T2 v) −1 × P (T1 u, t2 v)P (C1 u, C2 v | T1 u, T2 v)
t1 ∧T1 t2 ∧T2 F (du, dv) =− . (u, v)2 F 0 0 Term (A.11): t1 t2 I (T1 ∈ du, T2 v, ∆1 = 1)P (T2 ∈ dv | T1 u, T2 v)
E −
X P (T1 u, T2 v) 0 0
t1 t2 2 v, ∆1 = 1) | X]P (T2 ∈ dv | T1 u, T2 v) E[I (T1 ∈ du, T =− P (T1 u, T2 v) 0 0
169
170
S. Kele¸s, M.J. van der Laan and J.M. Robins
=−
t1 0
t2 0
I (T1 ∈ du, T2 v)P (C1 u, C2 v | X)P (T2 ∈ dv | T1 u, T2 v) P (T1 u, T2 v)P (C1 u, c2 v | T1 u, T2 v)
I (T1 t1 , T2 v)P (T2 ∈ dv | T1 T1 , T2 v) =− P (T1 T1 , t2 v) 0
t2 ∧T2 F (T1 , dv) = −I (T1 t1 ) (T1 , v)2 F 0
v=t 2∧T2 1
= −I (T1 t1 )
(T1 , v) v=0 F t2
I (T1 t1 ) I (T1 t1 ) + (T1 , 0) F (T1 , t2 ∧ T2 ) F I (T1 t1 , T2 t2 ) I (T1 t1 , T2 t2 ) I (T1 t1 ) =− − + . (T1 , T2 ) (T1 , t2 ) (T1 , 0) F F F
=−
Term (A.12): t1 t2 1 u, T 2 v P (T1 ∈ du | T1 u, t2 v)P (T2 ∈ dv | T1 u, T2 v) I T E 0 0 −1
X × P T1 u, T2 v =
t1 0
=
t2 0
t1 0
2 v | X P (T1 ∈ du | T1 u, T2 v) E I T1 u, T 1 u, T2 v −1 × P (T2 ∈ dv | T1 u, T2 v) P T
t2 0
I (T1 u, T2 v)P (C1 u, C2 v | X)P (T1 ∈ du | T1 u, T2 v) × P (T2 ∈ dv | T1 u, T2 v) −1 × P (T1 u, T2 v)P (C1 u, C2 v | T1 u, T2 v)
t1 ∧T1 t2 ∧T2
= 0
0
(du, v)F (u, dv) F . (u, v)3 F
Term (A.13): t1 t2 I (T1 u, T2 ∈ dv, ∆2 = 1)P (T1 ∈ du | T1 u, T2 v)
E −
X P (T1 u, T2 v) 0 0
t1 t2 E[I (T1 u, T2 ∈ dv, ∆2 = 1) | X]P (T1 ∈ du | T1 u, T2 v) =− P (T1 u, T2 v) 0 0
t1 t2 I (T1 u, T2 ∈ dv)P (C1 u, C2 v | X)P (T1 ∈ du | T1 u, T2 v) =− P (T1 u, T2 v)P (C1 u, C2 v | T1 u, T2 v) 0 0
t1 I (T1 u, T2 t2 )P (T1 ∈ du | T1 u, T2 T2 ) =− P (T1 u, T2 T2 ) 0
Estimation of the bivariate survival function
171
(du, T2 ) F (u, T2 )2 F 0
u=t 1∧T1 1
= −I (T2 t2 )
(u, T2 ) u=0 F = −I (T2 t2 )
t1 ∧T1
=−
I (T2 t2 ) I (T2 t2 ) + (T1 ∧ t1 , T2 ) (0, T2 ) F F
=−
I (T1 t1 , T2 t2 ) I (T1 t1 , T2 t2 ) I (T2 t2 ) − + . (T1 , T2 ) (t1 , T2 ) (0, T2 ) F F F
Term (A.14) (same as the term (A.12)): t1 t2 1 u, T2 v P (T1 ∈ du | T1 u, T2 v)P (T2 ∈ dv | T1 u, T2 v) E I T 0 0
2 v −1 X × P T1 u, T =
t1 0
=
t2 0
t1
2 v | X P (T1 ∈ du | T1 u, T2 v) E I T1 u, T 2 v −1 × P (T2 ∈ dv | T1 u, T2 v) P T1 u, T
t2
I (T1 u, T2 v)P (C1 u, C2 v | X)P (T1 ∈ du | T1 u, T2 v) × P (T2 ∈ dv | T1 u, T2 v) −1 × P (T1 u, T2 v)P (C1 u, C2 v | T1 u, T2 v)
t1 ∧T1 t2 ∧T2 (u, dv) F (du, v)F = . (u, v)3 F 0 0 0
Note that
0
(du, dv) (du, v)F (u, dv) −F 1 1 1 F +2 . = 2 (u, v) (u, v) (u, v)3 du dv F F F
Then, the sum of the terms (A.10), (A.12), (A.14) equals
t1 ∧T1 t2 ∧T2 (du, v)F (u, dv) F (du, dv) F − +2 (u, v)2 (u, v)3 F F 0 0
t1 ∧T1 t2 ∧T2 1 1 1 = (u, v) du dv F 0 0
t1 ∧T1 1 1 1 − = (u, t2 ∧ T2 ) F (u, 0) du F 0 =
1 1 1 1 − − + (t1 ∧ T1 , t2 ∧ T2 ) F (t1 ∧ T1 , 0) F (0, t2 ∧ T2 ) F (0, 0) F
172
S. Kele¸s, M.J. van der Laan and J.M. Robins
=
I (T1 t1 , T2 t2 ) I (T1 t1 , T2 t2 ) + (T1 , T2 ) (t1 , T2 ) F F +
I (T1 t1 , T2 t2 ) I (T1 t1 , T2 t2 ) + (T1 , t2 ) (t1 , t2 ) F F
−
I (T1 t1 ) I (T1 t1 ) I (T2 t2 ) I (T2 t2 ) − − − + 1. (T1 , 0) (t1 , 0) (0, T2 ) (0, t2 ) F F F F
Bringing all the terms together we obtain E IC(t1 , t2 ) | X I (T2 t2 ) I (T1 t1 , T2 t2 ) I (T1 t1 ) −1+ −1+ = F (t1 , t2 ) (T1 , T2 ) F (t1 , 0) F (0, t2 ) F −
I (T1 t1 , T2 t2 ) I (T1 t1 , T2 t2 ) I (T1 t1 ) − + (T1 , T2 ) (T1 , t2 ) (T1 , 0) F F F
−
I (T1 t1 , T2 t2 ) I (T1 t1 , T2 t2 ) I (T2 t2 ) − + (T1 , T2 ) (t1 , T2 ) (0, T2 ) F F F
+
I (T1 t1 , T2 t2 ) I (T1 t1 , T2 t2 ) + (T1 , T2 ) (t1 , T2 ) F F
I (T1 t1 , T2 t2 ) I (T1 t1 , T2 t2 ) I (T1 t1 ) + − (T1 , t2 ) (t1 , t2 ) (T1 , 0) F F F I (T1 t1 ) I (T2 t2 ) I (T2 t2 ) − − − +1 (t1 , 0) (0, T2 ) (0, t2 ) F F F +
(t1 , t2 ). = I (T1 t1 , T2 t2 ) − F This completes the proof.
References Andersen, P., Borgan, Ø., Gill, R., Keiding, N. (1993). Statistical Models Based on Counting Processes. Springer, New York. Bickel, P., Klaassen, C.J., Ritov, Y., Wellner, J. (1993). Efficient and Adaptive Estimation for Semiparametric Models. Johns Hopkins University Press, Baltimore, MD. Clayton, D.G. (1978). A model for association in bivariate life tables and its application in epidemiological studies of a familiar tendency in chronic disease incidence. Biometrika 65, 141–151. Clayton, D.G., Cuzick, J. (1985). Multivariate generalizations of the proportional hazards model (with discussion). J. Roy. Statist. Soc. Ser. B 148, 82–117. Dabrowska, D. (1988). Kaplan–Meier estimate on the plane. Ann. Statist. 16, 1475–1489. Dabrowska, D. (1989). Kaplan–Meier estimate on the plane: Weak convergence, LIL and the bootstrap. J. Multivariate Anal. 29, 308–325. Genest, C., MacKay, R.J. (1986). The joy of copulas: Bivariate distributions with given marginals. Amer. Statist. 40, 280–283.
Estimation of the bivariate survival function
173
Genest, C., Rivest, L. (1993). Statistical inference procedures for bivariate Arhimedean Copulas. J. Amer. Statist. Assoc. 88, 1034–1043. Gill, R., van der Laan, M., Robins, J. (1997). Coarsening at random, characterizations, conjectures and counter examples. In: Proceedings of the First Seattle Symposium in Biostatistics 1995. Springer, New York, pp. 255–294. Gill, R., van der Laan, M., Wellner, J. (1995). Inefficient estimators of the bivariate survival function for three models. Ann. Inst. H. Poincaré Probab. Statist. 31 (3), 545–597. Heitjan, D., Rubin, D. (1991). Ignorability and coarse data. Ann. Statist. 19 (4), 2244–2253. Hougaard, P. (1987). Modelling multivariate survival. Scand. J. Statist. 14, 291–304. Jacobsen, M., Keiding, N. (1995). Coarsening at random in general sample spaces and random censoring in continuous time. Ann. Statist. 23, 774–786. Oakes, D. (1989). Bivariate survival models induced by frailties. J. Amer. Statist. Assoc. 84, 487–493. Prentice, R.L., Cai, J. (1992). Covariance and survivor function estimation using censored multivariate failure time data. Biometrika 79 (3), 495–512. Pruitt, R. (1991). Strong consistency of self-consistent estimators: general theory and an application to bivariate survival analysis. Technical Report, Nr. 543, University of Minnesota. Pruitt, R. (1993). Small sample comparisons of six bivariate survival curve estimators. J. Statist. Comput. Simulation 45, 147–167. Quale, C., van der Laan, M., Robins, J. (2001). Locally efficient estimation with bivariate right censored data. Technical Report, Nr. 99, Division of Biostatistics. Robins, J. (1989a). The analysis of randomized and non-randomized AIDS treatment trials using a new approach to causal inference in longitudinal studies. In: Sechrest, L., Freeman, H., Mulley, A. (Eds.), Health Service Research Methdology: A Focus on AIDS, NCHSR. US Public Health Service, Dordrecht, pp. 113– 159. Robins, J. (1989b). Errata to: “Addendum to: ‘A new approach to causal inference in mortality studies with a sustained exposure period – application to control of the healthy worker survivor effect’ ” [Comput. Math. Appl. 14 (1987), no. 9–12, 923–945; MR 89a:92048]. Comput. Math. Appl. 18 (5), 477. Robins, J. (1992). Estimation of the time-dependent accelerated failure time model in the presence of confounding factors. Biometrika 79, 321–334. Robins, J. (1998). Marginal structural models versus structural nested models as tools for causal inference. In: Symposium on Prospects for a Common Sense Theory of Causation. AAAI Technical Report, Stanford, CA. Robins, J. (1999). Marginal structural models versus structural nested models as tools for causal inference. In: Halloran, M.E., Berry, D. (Eds.), Statistical Models in Epidemiology. In: The Environment and Clinical Trials, Vol. 116. Springer, New York, pp. 95–134. Robins, J., Rotnitzky, A. (1992). Recovery of information and adjustment for dependent censoring using surrogate markers. In: AIDS Epidemiology, Methodological Issues. Birkhäuser, Basel. Robins, J., Rotnitzky, A., van der Laan, M.J. (2000). Comment on “On Profile Likelihood” by S.A. Murphy, A.W. van der Vaart. J. Amer. Statist. Assoc. Theory Methods 450, 431–435. Robins, J., Rotnitzky, A., Zhao, L. (1994). Estimation of regression coefficients when some regressors are not always observed. J. Amer. Statist. Assoc. 89 (427), 846–866. Sklar, A. (1959). Fonctions de répartition à n dimensionset leurs marges. Publ. Inst. Statist. Univ 8, 229–231. van der Laan, M. (1996). Efficient estimator of the bivariate survival function and repairing NPMLE. Ann. Statist. 24, 596–627. van der Laan, M. (1997). Nonparametric estimators of the bivariate survival function under random censoring. Statist. Neerlandica 51 (2), 178–200. van der Laan, M.J., Robins, J.M. (2002). Unified Methods for Censored Longitudinal Data and Causality. Springer, New York. van der Laan, M., Hubbard, A., Robins, J. (2000). Locally efficient estimation of multivariate survival function in longitudinal studies. Technical Report, Nr. 82, Division of Biostatistics. van der Laan, M., Yu, Z. (2001). Comment on “Inference for semiparametric models: Some questions and an answer” by P.J. Bickel, J. Kwon. Statist. Sinica 11, 910–916. van der Vaart, A., Wellner, J. (1996). Weak Convergence and Empirical Processes. Springer, New York.
9
Handbook of Statistics, Vol. 23 ISSN: 0169-7161 © 2004 Elsevier B.V. All rights reserved. DOI 10.1016/S0169-7161(03)23009-1
Estimation of Semi-Markov Models with Right-Censored Data
Odile Pons
1. Introduction Multi-state processes describe the evolution of individuals between a finite number of states or systems exposed to several kinds of events. The data often consist in a large number of independent sample paths of such a process, for a variable observation period from a fixed entry time t = 0. The assumption of a non-homogeneous Markov jump process with a finite number of states is often used in biomedical applications when the transition intensities between states is conveniently specified in the time scale of the follow-up of individuals with a common entry time. The intensity of a direct transition from state j to state j at t only depends on t and on j and j but not on the previously visited states (Aalen (1978), Aalen and Johansen (1978), Andersen et al. (1991), Andersen et al. (1993), Andersen and Keiding (2000)). Another usual assumption is that of a Markov renewal process (or semi-Markov process) where the duration times between the occurrence of two events are supposed to be independent conditionally on the states, and the state variables follow a Markov chain model (Pyke (1961)). The observed process is characterized by the probabilities of a direct transition pjj between states j and j and the distributions F|jj of the durations between consecutive occurrences of j and j , for all pairs (j, j ) such that j is a state accessible from j . Lagakos et al. (1978) proposed a maximum likelihood estimator for the probabilities pjj and the distribution functions F|jj under right-censoring, when F|jj is a discrete function with a finite number of jumps. In non-parametric models for censored counting processes, Gill (1980), Voelkel and Crowley (1984) studied estimators of the sub-distribution functions Fj |j = pjj F|jj based on the estimated cumulative hazard function Λj |j for the transitions from j to j , associated with the sub-distributions Fj |j . Dabrowska et al. (1994), Dabrowska (1995) introduced the effect of time-dependent covariates in a Cox model for the hazard function Λj |j of a Markov renewal model and adapted the estimators to that setting. In this paper, we consider a general semi-Markov model with a finite and discrete state space and we present direct estimators for the transition probabilities and the cumulative hazard functions related to the distribution functions F|jj in a semi-parametric 175
176
O. Pons
maximum likelihood approach. First we consider a model in the absence of explanatory covariates, then we extend the approach to models with covariates. Section 2 concerns the model without covariate. We define an estimator of the probability pjj and of cumulative hazard functions Λ|jj that generalizes the Nelson–Aalen estimator for survival data, then we deduce a related Kaplan–Meier estimator for the distribution function F|jj . If a sojourn time in a state j is right-censored, the next state from j is unobserved, so we cannot define the set of individuals at risk for a transition from j to a given state j , as in the classical survival analysis setting where j and j are uniquely defined. In order to solve this problem, we introduce a weighted indicator of being at risk for this transition. The weights are defined as estimators of the probability of transition to states j conditionally on the covariate value and on a censoring in j at the observed censoring times. The estimators of pjj and Λ|jj are both based on this construction and we study their asymptotic properties in Section 3. In Huber et al. (2002), we compare the estimators for a general semi-Markov model to those for an independent competing risks model. In Section 4, we generalize this approach to semi-Markov models where components of a covariate vector Z are supposed to have an influence on the transition probabilities and/or on the distribution of the durations involved in the process (Pons (2002b)). We assume that, for every pair (j, j ) of consecutive states, the duration between the occurrence of the two states follows a proportional hazards model while the initial probabilities and the transition probabilities are unspecified functions of the covariate. For every pair (j, j ), the proportional hazards model is therefore the classical Cox model for the distribution of the sojourn times conditional on a direct transition from j to j . The unknown parameters of the model are the baseline hazard functions of transition, the regression coefficients on the covariate, the initial probabilities and the direct transition probabilities. We propose estimators for these parameters in a model with a discrete covariate. For every parameter value and adjoining states j and j , we defined a Breslow-type estimator for the cumulative baseline hazard function of the sojourn times before transitions from j to j (Breslow (1972)). As above, it involves weighted indicators of being at risk for these transitions, and the weights are now estimators of the probability of transition to j conditionally on the covariate value and on censoring in j at the observed censoring times. Replacing the cumulative baseline hazard functions by their estimator in the likelihood, we obtain a relevant version of a partial likelihood (Cox (1972)) and we define estimators of the parameters β’s and p’s by its maximization. We used similar methods in other models for lifetime data in Pons and de Turckheim (1988), Pons and Visser (2000) and in Pons (2000). We present the estimators and review their asymptotic properties.
2. Definition of the estimators We consider a large number n of independent sample paths of a semi-Markov jump process in a finite state space Sm = {1, . . . , m}. The ith sample path of the process is defined by the sequence of the different sojourn states Ji = (Ji,k )k0 , where Ji,0 is the initial state, and by the sequence of the transition times Ti = (Ti,k )k0 , with Ti,0 = 0 and
Estimation of semi-Markov models with right-censored data
177
Ti,k is the arrival time in state Ji,k . The sojourn time in state Ji,k−1 is Xi,k = Ti,k −Ti,k−1 for k 1. We suppose that each sojourn time Xi,k in a transient state Ji,k may be rightcensoring by a random variable Ci,k independent of (Ti,1 , . . . , Ti,k−1 ) and (Ji,0 , . . . , Ji,k−2 ) but depending on Ji,k−1 , k 1, i = 1, . . . , n. The minimum Xi,k ∧ Ci,k and the indicator δi,k = 1{Xi,k Ci,k } are observable for k 1. Let Ki be the random number of uncensored transitions of the process (Ji , Ti ) and let ti be the last observed time. For individual i, we observe the sequence of the visited states (Ji,0 , Ji,1 , . . . , Ji,Ki ), where Ji,Ki is the final state after Ki jumps, and the sequence of the transition times K i (Ti,1 , . . . , Ti,Ki ). The last observed duration time is Xi∗ = ti − k=1 Xi,k . Only the last sojourn time in a transient state Ji,Ki is actually censored, the last observed time ti is then equal to Ti,Ki + Ci,Ki +1 and the last duration time Xi∗ is the censoring variable smaller than the unobserved sojourn time Xi,Ki +1 . If Ji,Ki is a transient state then the individual i is censored and we denote δi = δi,Ki +1 = 0, otherwise Ji,Ki is an absorbing state, then we denote Xi,Ki +1 = 0 and δi = 1. We also denote δi,0 = 1. The successive sojourn times of an individual are independent conditionally on the states in the semi-Markov model and their distributions are assumed to be absolutely continuous. Let J (j ) be the set of accessible states from a transient state j . The model is defined by the following sets of components, for j in Sm and j = j in J (j ), – the distribution of the initial state ρj = P (Ji,0 = j ),
(1)
– the conditional distribution of the sojourn time elapsed between states j and j , F|jj (x) = P (Xi,k x | Ji,k−1 = j, Ji,k = j ), – the direct transition probabilities from state j to state
(2)
j ,
pjj = P (Ji,k = j | Ji,k−1 = j ).
(3)
By convention, pjj = 1 foran absorbing state and pjj = 0 otherwise. We denote Fj |j = pjj F|jj , and Fj = j ∈J (j ) Fj |j is the distribution of the sojourn time in j . The censoring variable of the sojourn times in a transient state j has a distribution function Gj , Gj (x) = P (Ci,k x | Ji,k−1 = j ). We consider the conditional hazard functions of the duration times before a transition between specific states j and j , λ|jj (x) = =
d P (x Xi,k x + dx | Xi,k x, Ji,k−1 = j, Ji,k = j ) dx f|jj (x) , 1 − F|jj (x)
(4)
where f|jj is the density of F|jj . The model is equivalently parametrizedby distribu· tion functions F|jj or by the cumulative baseline hazard functions Λ|jj = 0 λ|jj (s) ds
178
O. Pons
through the relationships (4) and 1 − F|jj (x) = exp{−Λ|jj (x)}, so the unknown parameters of the model are the baseline hazard functions Λ|jj and the probabilities ρj and pjj under the constraints m
ρj = 1,
pjj = 1
for every transient state j ∈ Sm ,
(5)
j ∈J (j )
j =1
where, by definition of the sets J (j ), pjj > 0 for every j ∈ J (j ) and j ∈ Sm .
= 1 − F , and A survival function related to any distribution function F is denoted F
j |j = pjj − Fj |j . The true values of the parameters and unknown functions are deF 0 0 0 0 0 0 noted ρj0 , pjj , Λ|jj , F|jj , Fj , Fj |j , Gj . We observe the following indicator functions of the elementary events and their sum on all stages k of the process or all individuals i, forj = j in J (j ), Ni0 (j ) = 1{Ji,0 = j }, Ni,k (j, j ) = δi,k 1{Ji,k−1 = j, Ji,k = j }, Nic (j ) = (1 − δi )1{Ji,Ki = j }, Ni (j, j ) =
Ni,k (j, j ) =
k1
Ni (x, j, j ) =
δi,k Ni,k (j, j ),
k1
Ni,k (j, j )1{Xi,k x},
k1
Nic (x, j ) = Nic (j )1 Xi∗ x , Yi (x, j, j ) =
Yic (x, j ) = Nic (j )1 Xi∗ x ,
number of durations smaller than x before a transition from j to j for individual i,
indicator of a censored duration smaller than x in j, for individual i,
Ni,k (j, j )1{Xi,k x},
k1
number of transitions from j to j for individual i,
number of durations larger than x before a transition from j to j for individual i,
indicator of a censored duration larger than x in j, for individual i.
We use similar notations without index i but with a variable n for sums on the n in dividuals, thus N(j, j , n) = ni=1 Ni (j, j ), etc., N nc (j, n) = j ∈J (j ) N(j, j , n) is the total number of uncensored sojourn times in j and Y nc (j, n) = j ∈J (j ) Y (j, j , n). c (j, j , n) the average number of censored durations in state j before We also denote N c (x, j, j , n) the corresponding number for the durations larger a transition to j and Y
Estimation of semi-Markov models with right-censored data
179
than x, i.e., c (j, j , n) = N
n
Nic (j )
i=1
=
n
j |j F Xi∗
j F
Nic (j )P (Ji,Ki +1 = j | Ji,Ki = j, Xi,Ki +1 x)|x=Xi∗ ,
i=1
c (x, j, j , n) = Y
n
Yic (x, j )
i=1
=
n
j |j F Xi∗
j F
Yic (x, j )P (Ji,Ki +1 = j | Ji,Ki = j, Xi,Ki +1 x)|x=Xi∗ .
i=1
(6) The maximum observed duration before transition from j to j or censoring in state j is denoted X(j, j , n) = max(Xi∗ 1{Ji,k−1 = j }, maxk1 δi,k Xi,k 1{Ji,k−1 = j, Ji,k = j }) and X(j, n) = maxj ∈J (j ) X(j, j , n) is the maximum duration in j . The likelihood of the observed data is proportional to
K
n m i Ni,k (j,j ) ∗ N c (j ) Ni0 (j ) i
j Xi ρ F fj |j (Xi,k ) j
k=1 j ∈J (j )
j =1 i=1
=
n m j =1 i=1
N 0 (j ) ρj i
Ki
|jj (Xi,k ) Ni,k (j,j ) pjj λ|jj (Xi,k )F
k=1 j ∈J (j )
∗ N c (j ) i
j Xi . × F
(7)
The initial probabilities ρj are the parameters of a multinomial distribution and their maximum likelihood estimators ρˆn,j = n−1 ni=1 Ni0 (j ) are asymptotically Gaussian. The logarithm of the semi-parametric likelihood for transitions starting from state j is ln (j ) = N(j, j , n) log(pjj ) j ∈J (j )
+
Ki n
Ni,k (j, j ) log λ|jj (Xi,k ) − Λ|jj (Xi,k )
i=1 k=1
+
n i=1
Nic (j )
−Λ|jj (Xi∗ ) . log pjj e j ∈J (j )
In a parametric setup, numerical approximations of maximum likelihood are obtained by iterative procedures but this approach requires a validation of the parametric assumption for the functions Fj and Fj |j . In a non-parametric setup, the estimator of
180
O. Pons
Λ|jj maximizing ln (j ) is a step function with jumps at the observed durations Xi,k and with value zero at 0. The parameters pjj and the value of the jump of Λ|jj at Xi,k are solutions of score equations under the constraint (5). For a non-parametric cumulative hazard functions Λ|jj , there is no explicit solution of the score equations but they allow us to characterize the maximum likelihood estimators of the transition probabilities pjj and the functions Λ|jj . From this characterization, solutions are approximated using non-parametric estimators of Fj and Fj |j . So we define estimators as x dN(y, j, j , n) , Λn,|jj (x) = c (y, j, j , n) 0 Y (y, j, j , n) + Y pˆn,jj =
c (j, j , n) N(j, j , n) + N N nc (j, n) + N c (j, n)
(8)
where the weighted counts c (y, j, j , n) = Y
n
Yic (y, j )
n,j
i=1
c (j, j , n) = N
n
Nic (j )
i=1
c (j, j , n) Y
F n,j |j Xi∗ and
F
F n,j |j Xi∗
F n,j
c (j, j , n) N
are estimators of and given in (6). They involve the Kaplan–
j and Gill’s estimator (1980) of the subMeier estimator of the distribution function F distribution functions Fj |j , (x) =
F n,j
Ki n 1− i=1 k=1
1 Y (Xi,k , j, n) + Y c (Xi,k , j, n)
x X(j, n), x (y − ) n,j |j (x) =
F F n,j 0
Ni,k (j )1{Xi,k x}
dN(y, j, j , n) , Y nc (y, j, n) + Y c (y, j, n)
,
x X(j, j , n), (9)
(x) = 0 for x > X(j, j , n). The idea for defin (x) = 0 for x > X(j, n) and F
F n,j n,j |j ing (8) is the following: In order to take into account the constraint (5), we denote by jr any reference index in J (j ). As pjjr = 1 − j ∈J (j ),j =jr pjj , the derivative of ln (j ) with respect to pjj under the constraints (5) is written
|jj − F
|jjr ∂ln (j ) N(j, j , n) N(j, jr , n) c F Xi∗ . = − + Ni (j )
∂pjj pjj pjjr Fj n
i=1
Multiplying this derivative by pjj and summing over j = jr in J (j ), the score equation becomes n
jr |j F N(j, jr , n) c 0 = N(j, n) − X∗ + Ni (j ) 1 −
j i pjjr pjjr F i=1
Estimation of semi-Markov models with right-censored data
181
and this equation has a unique solution p˜n,jjr =
c (j, jr , n) N(j, jr , n) + N . nc N (j, n) + N c (j, n)
We deduce an estimator of pjj as an estimator of p˜ n,jj since the result does not depend on the arbitrary choice of jr . Similarly, an estimator of Λ|jj solving the score equation may be viewed as an estimator of x dN(y, j, j , n) n,|jj (x) = Λ c (y, j, j , n) 0 Y (y, j, j , n) + Y c (j, j , n) and Y c (j, j , n) the unand the estimators (8) are defined by replacing in N
known (sub-)distribution functions Fj and Fj |j by their non-parametric estimators (9).
3. Asymptotic distribution of the estimators n,|jj and pˆ n,jj is based on the uniform convergence The consistency of the estimators Λ
n,j |j on [0, X(n, j, j )], they were established by Gill
n,j on [0, X(n, j )] and of F of F −1 (X∗ )F (X∗ ) − F
−1 (X∗ )F
(X∗ )| Y c (j )|F (1983). We deduce that max |j z
i=1,...,n i
j |j z
i
i
j |j z i i |j z −1 c n {N (j, j , n) −
converges in probability to zero. The convergence to zero of c (t, j, j , n) − Y c (t, j, j , n)| follows from the next c (j, j , n)} and n−1 supt X(n,j ) |Y N lemma.
0 = F
0 G
0
0
0
0 Let H j ∈J (j ) Hj |j , and let τjj = sup{x: Hj j (x) > 0} jj j j j and Hj = 0
(x) > 0} be the upper limit of their support. Let and τj = sup{x: H j
πj0 =
m j0 =1
ρj00 1{j = j0 } +
k1 j1 ,...,jk−1
τj0 j1 0
τjk−1 j
··· 0
0
j0 · · · G G jk −1 0
× dFj01 |j0
· · · dFj0|jk−1
L EMMA 1. For i = 1, . . . , n, j ∈ Sm and j ∈ J (j ), x
j0 dF 0 , ENi (x, j, j ) = πj0 EYinc (x, j, j ) = πj0 G 0
EYic (x, j ) = πj0
τj x
j0 dG0j . F
j |j
τjj x
.
j0 dF 0 , G j |j
182
O. Pons
P ROOF. From the definitions of the counting processes, ENi (x, j, j ) E δi,k 1{Xi,k x, Ji,k−1 = j, Ji,k = j } = k1
=E
δi,k−1 1{Ji,k−1 = j }P (Xi,k x, δi,k = 1, Ji,k = j | Ji,k−1 = j )
k1
=
P (δi,k−1 = 1, Ji,k−1 = j ) 0
k1
x
0 dF 0 G j j |j
and the proof is similar for E Yinc (x, j, j ). The probability that the Ki + 1 sojourn time of individual i is censored and occurs in state j is given by P (Ji,Ki = j ) = 0 τj 0 0 c k0 P (Ki = k, Ji,k = j ) = πj 0 Fj dGj , which entails the results for EYi (x, j ). T HEOREM 1. If 0 < πj0 < ∞ and
τjj 0
0 )−1 dF 0 < ∞ for every j ∈ J (j ), (G j j |j
0 then the vector (n1/2 (pˆ n,jj − pjj ))j ∈J (j ) converges weakly to a Gaussian variable 2 N (0, Σj ), where the elements of the covariance matrix Σj2 are defined by
Σj2;j ,j =
1 {j =j } 0
2
πj +
0
1 πj0
τjj
0 dF 0 + G j j |j
τjj ∧τjj 0
0 F
0 F j |j j |j
0 F j
dG0j
0 0 0 0 1 − pjj − 1 − pjj 1 − pjj + 1 − pjj .
P ROOF. By Lemma 1, c (j, j , n) EN(j, j , n) + E N τ jj 0 0 0
= nπj Gj dFj |j + 0
τjj 0
0 0 0
Fj |j dGj = nπj0 pjj ,
EN (j, n) + EN (j, n) = nπj0 , nc
c
and we deduce from the uniform consistency of the Kaplan–Meier estimator that c (j, j , n)} and n−1 {N nc (j, n) + N c (j, n)} converge respectively n−1 {N(j, j , n) + N 0 0 0 in probability to πj pjj and πj0 as n → ∞. Then pˆn,jj is a consistent estimator of pjj 0 and n1/2 (pˆ n,jj − pjj ) can be written as 0 n1/2 pˆ n,jj − pjj
=
c (j, j , n) − E N c (j, j , n)} n1/2 {N(j, j , n) − EN(j, j , n) + N N nc (j, n) + N c (j, n) c (j, j , n) + n1/2 EN(j, j , n) + E N
Estimation of semi-Markov models with right-censored data
× =
1 1 − nc c nc N (j, n) + N (j, n) E{N (j, n) + N c (j, n)}
183
1 B1 (j, j , n) + B2 (j, j , n) − B3 (j, n) + op (1) 0 πj
with B1 (j, j , n) c (j, j , n) − E N c (j, j , n) , = n−1/2 N(j, j , n) − EN(j, j , n) + N B2 (j, j , n) c (j, j , n) − N c (j, j , n) = n−1/2 N = n−1
−F n
0 ) n1/2 (F Nic (j ) n,j j
n,j |j − F
0
0 − F n1/2 F Xi∗ , j |j j |j 0
∗
F F (X ) n,j
i=1
j
i
B3 (j, n) = n−1/2 N nc (j, n) − EN nc (j, n) + N c (j, n) − EN c (j, n) . n,j |j (x) = x {Y nc (y, j, n) + Y c (y, j, n)}−1 dN(y, j, j , n) be the non-paraLet Λ 0 x 0 −1 0
) dF , and metric estimator of the cumulative hazard function Λ0j |j (x) = 0 (F |j j |j 0 Λ 1/2 Λ Λ let Wn,j |j = n {Λn,j |j − Λj |j } and Wn,|j = j ∈J (j ) Wn,j |j . Using Eqs. (38), (39) in Gill (1980), we have on [0, X(n, j )] 0
−F
0 (x)
n,j (x) = F n1/2 F j j
x
0
−
F n,j Λ dWn,j ,
0 F j
0 (x)
−F
n1/2 F j |j j |j
x
= 0
− dW Λ −
F n,j |j n,j
0
x
xF − −
F n,|j 0 n,j Λ 0 Λ
dW + F (x) dWn,j , F n,j j |j 0
0 j |j
F F 0 j j (10)
therefore B2 (j, j , n) −1
=n
X∗ X∗ n
0 F i i Nic (j ) j |j − − Λ Λ
F n,j 0 dWn,j − F n,j dWn,j |j . (X∗ ) 0
F 0 F n,j
i=1
j
i
As in Gill (1980, 1983), the processes x∧X(n,j ) − dW Λ F
Wn,j (x) = F n,j n,j
and
0
F Wn,j |j (x) =
x∧X(n,j ) 0
− dW Λ
F n,j |j n,j
184
O. Pons
converge jointly to Gaussian processes WjF and WjF |j on [0, τj ) × [0, τjj ) under the τ 0 −1 0 and F ,
) dF < ∞. By the uniform convergence of F
condition 0 jj (G n,j n,j |j j j |j we deduce that X∗ X∗
0 F Nic (j ) i i j |j − − Λ Λ
max F n,j 0 dWn,j − F n,j dWn,j |j (X∗ ) 0
1in F
F 0 j n,j i
N c (j ) − 0i ∗
(X ) F
0 F j |j
dWjF
0 F j
0
i
j
Xi∗
− WjF |j
∗ P Xi → 0
and B2 (j, j , n) converges in probability to
B2 (j, j
τjj x
) = πj0
0
0
0 F j |j
dWjF
0 F j
− WjF |j (x)
dG0j (x).
Λ Writing Wn,j |j in terms of the empirical processes −1/2
n
0 N(x, j, j , n) − nπj
x 0
0 dF 0 G j j |j
0 (x)}, it follows that the vector ((B1 (j, j , and n−1/2 {Y nc (x, j, n) + Y c (x, j, n) − nπj0H j n), B2 (j, j , n))j ∈J (j ), B3 (j, n)) converges weakly to a Gaussian variable ((B1 (j, j ), B2 (j, j ))j ∈J (j ) , B3 (j )). The expression of its covariance matrix is deduced from the following covariances, C1 (j ; j , j ) = E B1 (j, j )B1 (j, j ) F
0
0 F 2 0 0 j |j j |j 0
j0 dF 0 + = πj0 1{j =j } G − πj0 pjj dG pjj , j j |j 0
F j
0
C3 (j ) = EB32 (j ) = πj 1 − πj0 , C13 (j ; j ) = E B1 (j, j )B3 (j ) = pjj πj0 1 − πj0 , and for functions a and b on R+ , τ τ x jj jj Λ πj0 E a(s) dWn,j (s) |j 0
converges to τ
jj
τjj x∧y
0
0
0
0
0
=2
0
τjj
0 (x) G j
y 0
Λ b(t) dWn,j (t) dG0j (x) dG0j (y) |j
0 −1 0
j ab H dΛj |j dG0j (x) dG0j (y)
x 0
0 −1 0 0
ab H dΛ j j |j dGj (x)
Estimation of semi-Markov models with right-censored data
185
under integrability conditions, then C2 (j ; j , j ) = E B2 (j, j )B2 (j, j ) = 2πj0
j0 (x) G
0 F
0 F j |j j |j
x
0 H |j
0
−
x
0 F
0 F j j |j
0 H j
0
dΛ0j + 1{j =j } dΛ0j |j
x
−
x 0
0 )2 (F j dΛ0j |j
0 H j
0F
0 F j j |j
0 H j
0
dΛ0j |j
dG0j (x)
F
0 F
0
0
0 F d(F j |j j |j )
j0 dF 0 + G
j0 j |j j |j dFj0 + = πj0 1{j =j } G j |j
0 )2
0 (F F |j
|j
F
0
0 F j |j j |j 0
0 dF 0 + 1 + dG = πj0 1{j =j } G j j , j |j
0 F
j from an integration by parts, moreover E{B1 (j, j )B2 (j, j )} = 0 and E{B2 (j, j ) × Λ Λ are centered. As F
0 F
0 , the covariances C1 and B3 (j )} = 0 since Wn,j and Wn,j j j |j τjj |j0 −1 0
C2 are finite if 0 (Gj ) dFj < ∞. 0 ) is obtained by using An estimator of the variance Σj2;j ,j of n1/2 (pˆn,jj − pjj 0 0
the Kaplan–Meier estimator of Fj , the estimator of Fj |j given in (9) and the followτ 0
dFj |j and n−1 N c (x, j, n) for ing unbiased estimators: n−1 N(j, j , n) for πj0 0 jj G j x
0 dGj , hence πj0 0 F j
P ROPOSITION 1. The covariance Σj2;j ,j is estimated by 2 = Σ n,j ;j ,j
2 1{j =j } N(j, j , n) + n(πˆ n,j )2 +
1 πˆ n,j
X(n,j ) 0
F
F n,j |j n,j |j dN c (j, n) 2
F n,j
(1 − pˆn,jj ) + (1 − pˆ n,jj ) − (1 − pˆn,jj )(1 − pˆ n,jj ),
where πˆ n,j = n−1 {N nc (j, n) + N c (j, n)} and replacing pjj and pjj by their estimators. n,|jj − T HEOREM 2. Under the conditions of Theorem 1, the process {n1/2 (Λ 0 Λ|jj )}j ∈J (j ) converges weakly to a centered Gaussian process on every interval [0, τ ] τ
0 )−1 dΛ0 < ∞. such that τ < X(j, j , n) and (H 0
j |j
j |j
(x, j, j , n) = Y (x, j, j , n) + Y c (x, j, j , n), we have P ROOF. Let Y τ jj P
j0 dF 0 → 0, G sup n−1 Y (x, j, j , n) − πj0 j |j x∈[0,τjj )
x
186
O. Pons
sup x∈[0,τjj )
(x, j, j , n) − π 0 n−1 Y j
τjj x
P
0 dG0 → 0. F j j |j
From (10), P F
n,j (x) − F
0 (x) → 0
sup
j
x∈[0,X(j,n)]
and
P F
n,j |j (x) − F
0 (x) → 0, j |j
sup
x∈[0,X(j,j ,n)]
c (x, j, j , n) − Y (x, j, j , n)| → 0. therefore supx∈[0,X(j,j ,n)] n−1 |Y x
0 dΛ0 , the process Λ n,|jj satisfies, for x ∈ Since EN(x, j, j , n) = πj0 0 H j |j |jj [0, X(j, j , n)], P
n,|jj − Λ0 (x) n1/2 Λ |jj
x
= 0
n−1/2
x
− 0
x
= 0
d(N(y, j, j , n) − EN(y, j, j , n)) (y, j, j , n) n−1 Y 1
(y, j, j , n) n−1 Y dA1(y, j, j , n) −
0 (y) π 0H j
j |j
(y, j, j , n) − πj0 H
0 (y) dΛ0 (y) n−1/2 Y j |j |jj
x 0
A2 (y, j, j , n) + A3 (y, j, j , n) dΛ0|jj (y)
0 (y) π 0H
× 1 + op (1) ,
j
j |j
with A1 (y, j, j , n) = n−1/2 N(y, j, j , n) − EN(y, j, j , n) , c (y, j, j , n) A3 (y, j, j , n) = n−1/2 Y (y, j, j , n) − EY (y, j, j , n) + Y c (y, j, j , n) , − EY A2 (y, j, j , n) c (x, j, j , n) − Y c (x, j, j , n) = n−1/2 Y = n−1
X∗ X∗ n
0 F i i Yic (y, j ) − − j |j dW Λ Λ
dW − F F n,j n,j |j , n,j 0 n,j |j (X∗ ) 0
F 0 F i=1
n,j
i
j
using again (10). By the same arguments as in the proof of Theorem 1, the process ((A1 (j, j , n), A2 (j, j , n), A3 (j, j , n))(· ∧ X(j, j , n))j ∈J (j ) converges weakly to a centered Gaussian process ((A1 (j, j ), A2 (j, j ), A3 (j, j )))j ∈J (j ) on [0, τjj ).
Estimation of semi-Markov models with right-censored data
187
n,|jj −Λ0 )(x1 ) and n1/2 (Λ n,|jj −Λ0 )(x2 ) The asymptotic covariance of n1/2 (Λ |jj1 |jj2 1 2 is equal to K(x1 , x2 ; j, j1 , j2 ) x1 x2 (K2 + K3 )(y1 , y2 ; j, j1 , j2 ) 1 dΛ0|jj (y1 ) dΛ0|jj (y2 ) = 0
0 (y2 )
0 (y1 )H 1 2 (πj )2 0 0 H j |j j |j 1
+
x1 ∧x2
1
2
x1
1−
1{j =j } πj0 1 2 0
y
dΛ0|jj 1
dΛ0
|jj1 (y) 0
H j1 |j
+ Λ0|jj (x1 )Λ0|jj (x2 ), 1
2
using the following covariances K1 (y1 , y2 ; j, j1 , j2 ) = E A1 (y1 , j, j1 )A1 (y2 , j, j2 ) y1 ∧y2 0 2 y1 0 0 0 0 0
= 1{j =j } πj Gj dFj |j − πj Gj dFj |j 1
2
1
0
0
0
= E A3 (y1 , j, j1 )A3 (y2 , j, j2 )
K3 (y1 , y2 ; j, j1 , j2 ) = 1{j =j } πj0 1 2
1
τjj
1
y1 ∨y2
0 dF 0 + π 0 G j j j |j
1
τjj ∧τjj 1
2
0 F
0 F j |j j |j 1
y1 ∨y2
0 F j
y2
2
0 dF 0 , G j j |j 2
0 dG j
2 0
0 (y2 ),
(y1 )H − πj0 H j |j j |j 1
2
K13 (y1 , y2 ; j, j1 , j2 ) = E
= 1{j1 =j2 } 1{y1 >y2 } πj0
A1 (y1 , j, j1 )A3 (y2 , j, j2 ) y1
y2
0
0 dF 0 − π 0 2 H
(y2 ) G j j j |j j |j 1
2
K2 (y1 , y2 ; j, j1 , j2 ) = E A2 (y1 , j, j1 )A2 (y2 , j, j2 ) = k2 (y1 , y2 ; j, j1 , j2 ) + k2 (y2 , y1 ; j, j1 , j2 ),
0
y1
0 dF 0 , G j j |j 1
where
k2 (y1 , y2 ; j, j1 , j2 ) = πj0
τjj ∧τjj 1
y1
2
0
Gj (u ∨ y2 ) 1{j1 =j2 }
u
−
0 F j |j 1
0
E A1 (y1 , j, j1 )A2 (y2 , j, j2 ) = 0,
0
Gj
u 0
uF
0 F
0
0 F j1 |j j2 |j j 0 dΛj |j + dΛ0j
0
0 1 G H 0 j j
dΛ0j |j − 2
u
0 F j |j 2
0
0 G j
dΛ0j |j dG0j (u), 1
E A2 (y1 , j, j )A3 (y2 , j ) = 0.
188
O. Pons
An asymptotically Gaussian confidence interval for Λ0|jj (x) can be built from the n,|jj − Λ0 )(x) given by consistent estimator of the variance of n1/2 (Λ |jj
n (x; j, j ) = K
x
2n + K 3n )(y1 , y2 ; j, j ) n2 (K n,|jj (y1 ) dΛ n,|jj (y2 ) dΛ (y2 , j, j , n) (y1 , j, j , n)Y 0 0 Y x x n,|jj (y) n dΛ n,|jj (x) 2 , 1− + Λ dΛn,|jj + Y (y, j, j , n) 0 y x
where 3n (y1 , y2 ; j ; j ) K = n−1 Y (y1 ∨ y2 , j, j , n) +
X(n,j ) y1 ∨y2
2
F n,j |j dN c (j, n) 2
F n,j
(y2 , j, j , n), (y1 , j, j , n)Y − n−2 Y 2n (y1 , y2 ; j, j ) = kˆ2n (y1 , y2 ; j, j ) + kˆ2n (y2 , y1 ; j, j ), K kˆ2n (y1 , y2 ; j, j ) X(n,j,j ) nc Y (u ∨ y2 , j, n) + Y c (u ∨ y2 , j, n) = n−1 (u ∨ y )F
y1 F n,j 2 n,j (u) u 1 × nc c 0 Y (j, n) + Y (j, n)
2 2 dΛ F dΛ c
dΛ n,j + F − 2 F × F n,j n,j |j n,j |j n,j |j dN (u, j, n). n,j n,j |j
0 is finally defined from Λ n,|jj as a An estimator of the distribution functions F |jj product-limit estimator
n,|jj (y) . (11) 1 − dΛ F n,|jj (x) = yx
From (11) and as in Gill (1983) for the Kaplan–Meier estimator, the product-limit
estimator F n,|jj has the representation 1/2
n
n,|jj − F 0 F |jj
0 F |jj
x
(x) = 0
−
F n,|jj (y ) 1/2 dΛn,|jj (y) − dΛ0|jj (y) . (12) n 0
F (y) |jj
By the same arguments as in Lemma 2.9 of Gill (1983), we deduce from the uniform 1/2 (Λ
n,|jj − Λ0 ) (Theorem 2) convergence of F n,|jj and the weak convergence of n |jj that the process defined by (12) on [0, X(n, j, j )] converges weakly to a centered Gaussian process on ]0, τjj ).
Estimation of semi-Markov models with right-censored data
189
4. Generalization to models with covariates We consider the same setting as in Section 2, with the additional observation of a discrete p-dimensional covariate vector Zi , constant in time and having L possible values in the set Z = {z1 , . . . , zL } for individual i. We assume that (Zi )in is an i.i.d. sequence and for k > 1, P (Xi,1 x1 , . . . , Xi,k xk | Ji,0 , . . . , Ji,k , Zi ) = P (Xi,1 x1 | Ji,0 , Ji,1 , Zi ) . . . P (Xi,k xk | Ji,k−1 , Ji,k , Zi ). Let J (j, z) be the set of accessible states from state j in the sub-population having the covariate value z. The model previously defined by (1), (2) and (3) is now conditional to the covariate value and it relies on – the conditional distribution of the initial state ρj (z) = P (Ji,0 = j | Zi = z), – the conditional distribution of the time elapsed between states j and j , F|jj z (x) = P (Xi,k x | Ji,k−1 = j, Ji,k = j , Zi = z), – the direct transition probabilities from state j to state j , pjj (z) = P (Ji,k = j | Ji,k−1 = j, Zi = z). We assume that the conditional hazard functions of the duration times before a transition between specific states j and j follow a Cox model, λ|jj z (x, βjj ) =
d P (x Xi,k x + dx | Xi,k x, Ji,k−1 = j, Ji,k = j , Zi = z) dx
= λ|jj (x)e
T z βjj
where βjj is a p-dimensional vector. The unknown parameters of the model are the baseline hazard functions λ|jj , the regression coefficients βjj and the probabilities ρj (z) and pjj (z) under the constraints m
ρj (z) = 1,
j =1
pjj (z) = 1,
for every transient state j ∈ Sm and z ∈ Z.
(13)
j ∈J (j,z) 0 (z) β 0 , λ0 , and Λ0 (t) = We denote the true parameter values by ρj0 (z), pjj jj |jj |jj t 0 0 (z), z ∈ Z, is λ (s) ds. The maximum likelihood estimator of ρ j 0 |jj
ρˆn,j (z) = n−1
n
Ni0 (j )1{Zi = z}
i=1
and it is asymptotically Gaussian.
190
O. Pons
0 (z)) 0 0 The parameter values pj0 = (pjj j ∈J (j,z),z∈Z and βj = (βjj )j ∈J (j ) are estimated by maximum likelihood in the semi-parametric model. Considering βj = (βjj )j ∈J (j ) as fixed, the cumulative hazard function Λ0jj is defined by x dN(y, j, j , n) n,|jj (x, βjj ) = Λ (14) (0) 0 Sn (y, j, j , βjj )
where Sn(0) (x, j, j , βjj ) =
n
e
T Z βjj i Yinc (x, j, j ) + Yic (x, j )aˆ n,i (j, j , Zi ) ,
i=1
(j, j , z)
and aˆ n,i is an estimator of the probability of transition to j conditionally on the covariate value and on a censoring in j at the censoring times Xi∗ , ai (j, j , z) = P (Ji,k = j | Xi,k > x, Ji,k−1 = j, Zi = z)|x=Xi∗ =
j |j z F Xi∗ .
|j z F
0
0 This estimator is obtained by replacing the functions F j |j z and Fj z by their nonparametric estimators conditional on the covariate. Among individuals having the covariate value z, let N(x, j, j , z, n) be the number of transitions from j to j after a sojourn time smaller than x in j , let Y (x, j, z, n) be the number of durations larger than x in j , N(x, j, j , z, n) =
n
Ni (x, j, j )1{Zi = z},
i=1
Y (x, j, z, n) =
n
Yi (x, j, j
) + Yic (x, j )
1{Zi = z}.
i=1 j ∈J (j,z)
Similarly to (9), F n,|j z is the product-limit estimator based on the durations in state j of individuals having the covariate value z, Ni,k (x,j,z) n 1
1− , F n,|j z (x) = Y (Xi,k , j, z, n) i=1 k1
j |j z is with Ni,k (x, j, z) = δi,k 1{Ji,k−1 = j }1{Xi,k x, Zi = z}. The estimator of F deduced as x (x) = 1 − − dN(y, j, j , z, n)
F F n,|j z (y ) n,j |j z Y (y, j, z, n) 0 and we define aˆ n,i (j, j , z) =
(X∗ )
F n,j |j z i ∗
(X ) F n,|j z
with the convention
0 0
i
= 0 in all the above expressions.
Estimation of semi-Markov models with right-censored data
191
(·, β ) be the product-limit estimator of F
n|jj z (·, βjj ) associated with Let F n|jj z jj n,|jj (·, βjj ) given by (14), the estimator Λ βT z (x, β ) =
n,|jj (y, βjj ) 1 − e jj dΛ F n|jj z jj yx
β z e jj dN(y, j, j , n) . 1 − (0) = Sn (y, j, j , βjj ) yx T
|jj Zi (·, βjj ) by its n,|jj (·, βjj ) at Xi,k and F Replacing λ|jj (Xi,k ) by the mass of Λ estimator in the expression of the log-likelihood for the transitions starting from the state j , we obtain the partial likelihood l˜n (j, pj , βj ) n T = Ni (j, j ) log pjj (Zi ) + βjj Zi i=1
−
j ∈J (j )
Ki k=1
(0)
Ni,k (j, j ) log Sn (Xi,k , j, j , βjj ) − log F n,|jj Zi (Xi,k , βjj )
+ Nic (j ) log
p
jj
∗
(Zi )F n,|jj Zi Xi , βjj
.
(15)
j ∈J (j )
The parameters pj0 and βj0 are estimated by the values pˆ j and βˆnj that maximize l˜n (j ) n,|jj (·) = Λ n,|jj (·, βˆnj ). An estimator of the and the function Λ0|jj (·) is estimated by Λ
0 is then defined as the product-limit estimators related to Λ n,|jj , survival function F |jj dN(y, j, j , n)
. 1 − (0) F n,|jj (x) = Sn (y, j, j , βˆnjj ) yx In Pons (2002b), we give integrability and regularity conditions for studying the asymptotic behavior of our estimators as in Andersen and Gill (1982) and Pons and de Turckheim (1988). We prove that n1/2 (pˆnj − pj0 , βˆnj − βj0 ) converges weakly to a centered Gaussian variable under the constraints (13). On every fixed interval [0, τ ] n,|jj is uniformly consistent and n1/2 (Λ n,|jj −Λ0 ) conincluded in [0, X(n, j, j )[, Λ jj n,|jj is verges weakly to a centered Gaussian process; on the interval [0, X(n, j, j )], F n,|jj − Fn,|jj ) converges weakly a uniformly consistent estimator of Fn,|jj and n1/2 (F to a centered Gaussian process. This procedure is extended to the case of discrete and piece-wise constant covariate vectors Zi = (Zi,0 , . . . , Zi,Ki ), where Zi,k is constant on [Ti,k , Ti,k+1 [ with values in Z and such that P (Xi,1 x1 , . . . , Xi,k xk | Ji,0 , . . . , Ji,k , Zi,0 , . . . , Zi,k−1 ) = P (Xi,1 x1 | Ji,0 , Ji,1 , Zi,0 ) . . . P (Xi,k xk | Ji,k−1 , Ji,k , Zi,k−1 ).
192
O. Pons
The model is now defined by ρj (z) = P (Ji,0 = j | Zi,0 = z), F|jj z (x) = P (Xi,k x | Ji,k−1 = j, Ji,k = j , Zi,k−1 = z), pjj (z) = P (Ji,k = j | Ji,k−1 = j, Zi,k−1 = z), with an assumption of a proportional hazards model λ|jj z (x, βjj ) =
d P (x Xi,k x + dx | Xi,k x, Ji,k−1 = j, Ji,k = j , Zi,k−1 = z) dx
= λ|jj (x)e
T z βjj
.
We introduce the counting processes Ni,k (x, j, z) = δi,k 1{Ji,k−1 = j }1{Zi,k−1 = z}1{Xi,k x}, Ni,k (x, j, z)1{Ji,k = j }, Ni (x, j, j , z) =
k 1,
k1
Yi (x, j, j , z) =
Ni,k (j, j )1{Zi,k−1 = z}1{Xi,k x},
k1
Yic (x, j, z) = Nic (j )1{Zi,Ki = z}1 Xi∗ x , (16) c N(x, j, j , z, n), Y (x, j, j , z, n) and Y (x, j, z, n) are now defined by the sums of these
processes on the n individuals and Y (x, j, z, n) = Y (x, j, j , z, n) + Y c (x, j, z, n). The cumulative hazard function Λ0jj is estimated as in (14) but replacing the above expression of Sn (x, j, j , βjj ) by (0)
Sn(0) (x, j, j , βjj ) =
n
e
T Z βjj i,k−1
δi,k Ni,k (j, j )1{Xi,k x}
i=1 k1
+e
T Z βjj i,Ki
Yic (x, j )aˆ n,i (j, j , Zi,Ki )
where aˆ n,i (j, j , z) is defined as previously but using the processes (16) for the estima 0 and F
0 . Finally, the parameters p0 and β 0 are estimated by maximization tion of F jz j j j |j z of the partial likelihood l˜n (j, pj , βj ) K n i T = Ni,k (j, j ) log pjj (Zi,k−1 ) + βjj Zi,k−1 i=1 k=1 j ∈J (j )
− log Sn(0) (Xi,k , j, j , βjj ) + log F n,|jj Zi,k−1 (Xi,k , βjj ) ∗ c
. pjj (Zi,Ki )F n,|jj Zi,Ki Xi , βjj + Ni (j ) log j ∈J (j )
Estimation of semi-Markov models with right-censored data
193
5. Discussion Estimation of the distribution of semi-Markov processes from i.i.d. sample paths have been studied since a long time and the estimators have been applied to the analysis of biomedical data where the individuals visit only a few states during their life. In that field the observations are often censored and the methods we propose allow to take into account the censored durations of the process, as in survival analysis. Multi-state processes may also be viewed as systems of multivariate failure times and other approaches have been proposed for their analysis. Competing risks models are often used in biomedical applications, they suppose that an observed failure time is the smallest of the latent failures due to all possible causes, under an assumption of independent latent failures or under a specific dependence model. The cause specific hazard functions are identifiable and they are estimated by Gill’s estimator given in (9). However, the joint distribution of the latent failure times is not identifiable unless they are supposed to be independent (Tsiatis (1975)). For a semi-Markov process with competing risks, Fj |j (x) is interpreted as the probability that a sojourn time is smaller than x in state j and all the other possible transitions from j to states j = j are larger than the observed sojourn time in state j before a transition to j . This assumption is not always relevant, in particular it is not satisfactory if there exist states j and j with few but very early transitions from j to j so that the sojourn time in j is much shorter for transitions to j than to other states though Fj |j (x) is small. This situation occurs in some diseases and our approach is less restrictive than the competing risks model assumption (Huber et al. (2002)). Here, we only considered a Cox model for the distributions of sojourn times in the presence of discrete covariates having a finite number of values. It is obviously extended to the case of continuous covariates if we assume a parametric model {pjj (z, θ )}θ for
the transition probabilities and replace the conditional estimators F n,|j z by the corresponding non-parametric kernel estimators (Dabrowska (1987)). The assumption of independent consecutive durations of the semi-Markov model was removed in Pons and Visser (2000) for a non-stationary Cox model with three consecutive states where the baseline hazard function for the duration of interest depends on the duration variable and on the chronological entry time in the corresponding state. For example, in a longitudinal study for a disease without recovery, the patients states are health, illness and death, and censoring may occur during the illness period. The hazard function for the transition from illness to death may depend on the illness duration x and on the entry time S in the illness state (or on the age at this time in an individual chronological time scale), as well as on explanatory time-dependent covariates Z, in the form λX|S,Z (x | S, Z) = λX|S (x; S) eβ
T Z(S+x)
.
We defined a non-parametric estimator of the cumulative baseline hazard and an asymptotically efficient estimator of the regression coefficient vector using a kernel smoothing of the log-likelihood with respect to entry time S, and we studied their properties. The model is more general than the Markov renewal model and a goodness-of-fit test for the
194
O. Pons
hypothesis that the duration X is independent of the entry time S was proposed in Pons (2002a).
References Aalen, O. (1978). Non-parametric inference for a family of counting processes. Ann. Statist. 6, 701–726. Aalen, O., Johansen, S. (1978). An empirical transition matrix for non-homogeneous Markov chains based on censored observations. Scand. J. Statist. 5, 141–150. Andersen, P., Borgan, O., Gill, R., Keiding, N. (1993). Statistical Models Based on Counting Processes. Springer, New York. Andersen, P., Gill, R. (1982). Cox’s regression model for counting processes: A large sample study. Ann. Statist. 10, 1100–1120. Andersen, P., Hansen, L., Keiding, N. (1991). Non- and semi-parametric estimation of transition probabilities from censored observations of a non-homogeneous Markov process. Scand. J. Statist. 18, 153–167. Andersen, P., Keiding, N. (2000). Event history analysis in continuous time. Techn. Report. Department Biostatistics, University of Copenhagen. Breslow, N. (1972). Discussion of the paper by D.R. Cox. J. Roy. Statist. Soc. Ser. B 34, 216–217. Cox, D.R. (1972). Regression model and life tables (with discussion). J. Roy. Statist. Soc. Ser. B 34, 187–220. Dabrowska, D. (1987). Nonparametric regression with censored survival time data. Scand. J. Statist. 14, 181– 197. Dabrowska, D. (1995). Estimation of transition probabilities and bootstrap in a semiparametric Markov renewal model. J. Nonparam. Statist. 5, 237–259. Dabrowska, D., Sun, G., Horowitz, M. (1994). Cox regression in a Markov renewal model: an application to the analysis of bone marrow transplant data. J. Amer. Statist. Assoc. 89, 867–877. Gill, R. (1980). Nonparametric estimation based on censored observations of a Markov renewal process. Z. Wahrsch. Verw. Gebiete 53, 97–116. Gill, R. (1983). Large sample behaviour of the product-limit estimator on the whole line. Ann. Statist. 11, 49–58. Huber, C., Pons, O., Heutte, N. (2002). Inference for a general semi-Markov model as an alternative to independent competing risks. Report. Université Paris 5. Lagakos, S.W., Sommer, C., Zelen, M. (1978). Semi-Markov models for partially censored data. Biometrika 65, 311–317. Pons, O., de Turckheim, E. (1988). Cox’s periodic regression model. Ann. Statist. 16, 678–693. Pons, O., Visser, M. (2000). A non-stationary Cox model. Scand. J. Statist. 24, 619–639. Pons, O. (2000). Nonparametric estimation in a varying-coefficient Cox model. Math. Meth. Statist. 9, 376– 398. Pons, O. (2002a). Inference in extensions of the Cox model for heterogeneous populations. In: Huber-Carol, C., Balakrishnan, N., Nikulin, M.S., Mesbah, M. (Eds.), Goodness-of-Fit Tests and Validity of Models. Birkhäuser, Basel, pp. 211–225. Pons, O. (2002b). A regression model for multi-state Markov renewal processes with censoring. In preparation. Pyke, R. (1961). Markov renewal processes: Definitions and preliminary properties. Ann. Math. Statist. 32, 1231–1342. Tsiatis, A. (1975). A nonidentifiability aspect of the problem of competing risks. Proc. Nat. Acad. Sci. USA 37, 20–22. Voelkel, J., Crowley, J. (1984). Nonparametric inference for a class of semi-Markov processes with censored observations. Ann. Statist. 12, 142–160.
10
Handbook of Statistics, Vol. 23 ISSN: 0169-7161 © 2004 Elsevier B.V. All rights reserved. DOI 10.1016/S0169-7161(03)23010-8
Nonparametric Bivariate Estimation with Randomly Truncated Observations
Ülkü Gürler
1. Introduction In survival or reliability studies, it is common to have incomplete data due the limited time span of the study or dropouts of the subjects for various reasons. Random truncation and censoring are two common forms of such data. There has been extensive research for univariate or multivariate censored data which we do not intend to review here. The focus of this paper is on the nonparametric estimation of the bivariate hazard and distribution functions (d.f.) when observations are subject to random truncation. Estimation of bivariate distribution and hazard functions are discussed in more detail and as an application of the results on bivariate d.f., bivariate kernel density estimation is presented. In the univariate random left truncation model, the variable of interest is Y and one observes the i.i.d. replicates (Yi Ti ), i = 1, . . . , n of (Y, T ) where T is another random variable, independent of Y . Here T is called the truncating variable which prevents the complete observation of Y . Random right truncation is similarly defined by reversing the inequality. Random truncation models are conveniently used to model several aspects of AIDS data, such as the incubation time which is defined as the time from infection to the onset of the disease or the time from the onset of AIDS until death; the time from infection to seroconversion; or in insurance applications, the reporting lags which is the time between an accident happens and it is reported to the insurance company. In AIDS data for instance, suppose the observation period starts at some time point To . Let a be the time a person is diagnosed with AIDS and d be the time of death. Setting Y = d − a and T = To − a, left truncation occurs since only those individuals for whom T Y are observed. For insurance applications see Kalbfleisch and Lawless (1989). One of the earliest examples of the random left truncation model was given by Lynden-Bell (1971) with an application in astronomy, where Y refers to the brightness of celestial objects, which is only partially observable due to a preventing variable T . In this study, Lynden-Bell introduced the nonparametric maximum likelihood estimators (NPMLE) for the distribution functions of Y and T . We now present some results for the univariate truncation model. Let Y be the variable of interest and T be the truncating variable with d.f.’s F and G, respectively. The 195
196
Ü. Gürler
=1−F pairs (Y, T ) are observed only if (Yi Ti ), i = 1, . . . , n. The NPMLE of F and G given by Lynden-Bell (1971) are: s(Yi ) , 1− Fn (y) = (1) nCn (Yi ) i: Yi y
Gn (t) =
r(Ti ) , 1− nCn (Yi )
(2)
i: Ti >t
where nCn (u) = #{i: Ti u Yi },
s(u) = #{i: Yi = u},
r(u) = #{i: Ti = u}.
(3)
Note that Cn is the estimator of C given by Y (z−) = G∗T (z) − FY∗ (z) C(z) = α −1 G(z)F G∗ (z)
(4)
F ∗ (z)
where and are the distribution functions of the observed T and Y respectively under random truncation model, α = P (Y T ) is the probability that the pair (Y, T ) can be observed, and nCn (z) is the size of the ‘risk set’ at time z. Consistency of Fn and its right truncation counterpart are studied by Woodroofe (1985) and Wang et al. (1986). Chao and Lo (1988) derived a representation of (Fn − F ) as i.i.d. mean processes. The order of the remainder term for this representation is improved by Stute (1993) and Gijbels and Wang (1993). We summarize below the existing results concerning Fn (y). For i = 1, . . . , n, define z I (Ti u Yi ) I (Yi z) − dF (u) Li (z) = α (Yi ) 2 (u) G(Yi )F G(u)F 0 and n (z) = L
n
n−1 Li (z).
i=1
The following representation is given by Chao and Lo (1988), −1
Fn (z) − F (z) = n
F (z)
n
(z)L n (z) + εn (z). Li (z) + εn (z) = F
i=1
For any d.f. F , let aF and bF denote the lower and upper end points of the support of F . If F is continuous and b < bF , then the behavior of the remainder term is investigated by several authors as stated below: (Chao and Lo (1988)). Let aG = aF . If limx→0+ F (x)/G(x) = 0 and dF /G < ∞, then
sup εn (z) = o n−1/2 a.s. 0zb
Nonparametric bivariate estimation with randomly truncated observations
197
(Gijbels and Wang (1993)). Let aG < aF . Then sup εn (z) = O(log n/n) a.s. 0zb
(Stute (1993)). Let aG aF . If dF /G2 < ∞, then
sup εn (z) = O log3 n/n a.s. aFY zb
√ C OROLLARY 1. The process Wn (z) = n (Fn (z) − F (z)) converges weakly to a zero mean Gaussian process on D[0, b], with covariance structure, where z1 ∧ z2 = min(z1 , z2 ) z1 ∧z2
F (du) (z1 )F (z2 ) . Cov Wn (z1 ), Wn (z2 ) = α F 2 (u) G(u) F 0 Regarding the strong uniform consistency of the NPMLE, the following results are available in the literature. Let Fo (y) = P (Y y | Y aG ), where Fo = F if aG aF . Analogously, define Go (t) = P (T y | T bF ). (Chen et al. (1995)). Suppose F and G are continuous, aF = aG = 0. Then n (x) − F (x) → 0 a.s.; sup F x
He and Yang (1988) generalize this result by relaxing the continuity assumptions on F and G. (He and Yang (1988)). Suppose aF = aG = 0. Then sup Fn (x) − Fo (x) → 0 a.s. x
They also establish the strong law under random truncation for the univariate model as an extension of similar results of Stute and Wang (1993) for the random censoring case. In particular, they show that for any nonnegative measurable function φ it holds that lim φ(x) dFn (x) = φ(x) dFo (x) a.s. n→∞
He (1988) extends the strong consistency results to the left truncated and right censored data and He and Yang (1988) consider the estimation of the truncation probability α. Kernel estimators of the hazard function for truncated/censored data are studied by Uzuno˘gulları and Wang (1992). Gürler et al. (1993), Gu and Lai (1990), Lai and Ying (1991), and Gross and Huber-Carol (1992) extended the results for truncated/censored data in various directions. 2. Estimation of the bivariate distribution function For the bivariate truncation model, the existing literature is very limited and mostly refers to the case when only one component of the bivariate vector is subject to truncation. An exception to this assumption is the study by van der Laan (1996),
198
Ü. Gürler
where he provides the NPMLE for the bivariate d.f. where both components are randomly truncated. We start with singly truncated bivariate model as discussed in Gürler (1996, 1997). Consider the bivariate truncation model, where one observes the triplets (Yi , Xi , Ti ), i = 1, . . . , n only if (Yi Ti ). If the variable of interest is T , that is if right truncation occurs, then this model is applicable to the transfusion related AIDS data described above, where Y is the incubation time truncated due to the limited time window of the study and X can be considered as a covariate such as the age of the patient at the time of infection by HIV. Application to this particular data type is provided in Gürler (1996). Suppose the purpose is to estimate the bivariate d.f. F (y, x) of (Y, X), where T is assumed to be independent of (Y, X), with d.f. G. The marginal d.f.’s of Y and X are denoted by FY and FX respectively. For the identifiability of the model, it is assumed that aFY aG and bFY bG , as in the univariate model (see Woodroofe (1985)). The observations come from the following trivariate distribution H : HY,X,T (y, x, t) = P (Y y, X x, T t | Y T ) x y = α−1 G(t ∧ u) dF (u, v) 0
0
where α is as defined before. The observed (Y, X) pairs have the following distribution: x y ∗ −1 G(u) dF (u, v). F (y, x) ≡ HY,X,T (y, x, ∞) = α 0
0
The estimator suggested by Gürler (1997) for F (y, x) is Fn (y, x) =
Y,n (Yi −) 1F I (Yi y, Xi x), n Cn (Yi )
(5)
i
Y,n and Cn are as given in (1) and (3). where F Fn (y, x) reduces to the product limit estimator (1) of F (y) when x → ∞, and if no truncation is present, it is easy to verify that it is the bivariate empirical distribution function. Also, note that Fn (y, x) is a weighted sum of i.i.d. variables, where the weights are the jump sizes of the truncation product limit estimator Fn (y) at the data points. Therefore theoretical properties of this estimator are strongly related to those of FY,n (y). The following results are due to Gürler (1997) and Gürler and Prewitt (2000): T HEOREM 1. Assume F is continuous in both components, b < bFY and let Tb = {(y, x): 0 < y < b; 0 < x < ∞}. Then Fn (y, x) admits the following representation: Fn (y, x) − F (y, x) y
F (u) ∗ Fn (du, x) − F ∗ (du, x) = 0 C(u) y
F (u) n (u)C(u) F ∗ (du, x) + Rn (y, x) C(u) − Cn (u) + L + 2 0 C (u) ≡ ξ¯n (y, x) + Rn (y, x)
(6)
Nonparametric bivariate estimation with randomly truncated observations
199
where (i) if aG < aFY , then – (Gürler (1997)). sup(y,x)∈Tb |Rn (y, x)| = O(log2 n/n), – (Gürler and Prewitt (2000)). sup(y,x)∈Tb |Rn (y, x)| = O(log n/n); (ii) if aG = aFY , and G−2 (u)FY (du) < ∞, – (Gürler (1997)). sup(y,x)∈Tb |Rn (y, x)| = O(log3 n/n) = o(n−1/2 ). The magnitude of the remainder term for part (ii) derives from the result of Stute (1993) and therefore the integrability condition here is more restrictive than that of Chao and Lo (1988). Theorem 2 can also be utilized to establish the weak convergence of Fn (y, x). Define the processes:
√ √ n (y) = n Cn (y) − C(y) , n (y, x) = n Fn (y, x) − F (y, x) , C F
√ ∗ √ ∗ n (y). n (y, x) = n FY,X,n (y, x) − FY,X (y, x) , Ln (y) = n L W n (y, x), Theorem 1, SLLN and functional LIL Regarding the weak convergence of F yields the following result: C OROLLARY 2. Under the assumptions of part (ii) of Theorem 1, (a) For (y, x) ∈ Tb Fn (y, x) → F (y, x) a.s. (b) sup(y,x)∈Tb |Fn (y, x) − F (y, x)| = O((log n/n)1/2 ). n (y, x) (c) Suppose the conditions of part (ii) of Theorem 1 hold. Then for (y, x) ∈ Tb , F converges weakly to a mean zero, two dimensional time Gaussian process on (D2 , d). The covariance structure of the limiting process is highly complicated but the main ingredients to obtain the covariance function are derived in Gürler (1997). For practical n (y, x): interest, we provide the asymptotic variance σ 2 (y, x) of F y σ 2 (y, x) = A(u)F (du, x) −2 where A(u) =
(u) F C(u) .
y
F (y, x) − F (u, x)
1 − b(u) F (du, x) C(u)
An estimator for the above variance can be obtained as follows:
Estimate A(u) by An (u) = is
Y,n (u−) F Cn (u)
and note that the jump size of FY,X,n (u, v) at (Yi , x)
Y,n (Yi −) F I (X[i] x) = An (Yi )I (Xi x). Cn (Yi )
200
Ü. Gürler
Let V1,n (y, x) = n−1
A2n (Yi )
i: Yi y,Xi x
and V2,n (y, x) = n−1
i: Yi y, Xi x
An (Yi ) FY,X,n (y, x) − FY,X,n (Yi , x)
× 1/Cn (Yi ) − bn (Yi ) ,
where bn (u) =
n I (Yi u) . Cn2 (Yi ) i=1
Then, an estimator of the asymptotic variance is σn2 (y, x) = V1,n (y, x) − 2V2,n (y, x). An alternative estimator As mentioned above, van der Laan (1996) derives the NPMLE of the bivariate distribution function when both components are subject to random truncation. Let S(y, x) = 1 − F (y−, ∞) − F (∞, x) + F (y−, x−)
(7)
be the bivariate survival function of (Y, X), S ∗ (y, x) be the corresponding one for the observed truncated (Y, X) and Sn (y, x) and Sn∗ (y, x) be their empirical counterparts. Then, employing the efficient score operator, van der Laan comes up with the following estimating equation: y dG∗n (u) Sn (dy, dx) (8) = Sn∗ (dy, dx). Sn (u, 0) Integrating both sides over x, one obtains y dG∗n (u) = Sn∗ (dy, 0), Sn (dy, 0) Sn (u, 0) which is the efficient score equation for the univariate truncated data. Hence Sn (dy, 0) is the NPMLE for the univariate case. This observation and the estimating equation (8) now suggests the following estimator for the bivariate d.f. 1 I (Yi y, Xi x) . n −1 n n j =1 I (Tj Yi )/Fn (Tj ) n
Fn (y, x) =
(9)
i=1
The above estimator is proven to be asymptotically efficient. In personal correspondence it is indicated by van der Laan that his estimator and the one suggested by Gürler are asymptotically equivalent, which indicates that the later one is also asymptotically efficient. This equivalence is also observed in the simulation studies conducted in
Nonparametric bivariate estimation with randomly truncated observations
201
Gürler and Kele¸s (1998). Since singly truncated data is a special case of doubly truncated one, we will present the asymptotic results regarding the above estimator when we consider the doubly truncated data. Right truncation Although the random right truncation is defined by simply reversing the inequality in the observational constraint, especially the corresponding hazard function is not straightforward to obtain from the left truncation model. We therefore review the results for the right truncation case below. In the univariate right truncation, one observes the pairs (Yi , Ti ), i = 1, . . . , n only if Yi Ti . The NPML estimators for F and G are given as
1 − s(Yi )/nCn (Yi ) , FY,n (y) = i: Yi >y
1 − r(Ti )/nCn (Ti ) ,
Gn (t) = 1 −
i: Ti t
where nCn (u) = #{i: Yi u Ti }. Gürler (1996) provides the following estimator for the bivariate d.f., 1 FY,n (Yi ) I (Yi y, Xi x). FY,n (y, x) = n Cn (Yi ) i
Modifying the previous definitions of L as follows, I (Yi > z) I (Yi u Ti ) ∗ − F (du) Li,n (z) = C(Yi ) C 2 (u) z and n (z) = 1 Li,n (z), L n i
A(z) =
F (z) , C(z)
the i.i.d. representation and asymptotic behavior of the above estimator is obtained as in the left truncation case, for which we omit the details. Doubly truncated data Suppose now both components of the bivariate vector (Y, X) are randomly truncated. For ease of notation here we denote the bivariate vector of interest as Y = (Y1 , Y2 ) and assume that Y is truncated by T = (T1 , T2 ), so that one only observes (Y1 , Y2 , T1 , T2 ) for which (Yi Ti ), i = 1, 2 and left truncation occurs. We simply write (Y, T ) is observed only if (Y T ). Again, for right truncation, the inequality is reversed. Assume that the bivariate d.f. of Y is F and that of T is G. The estimating equation of van der Laan (1996), derived from the efficient score operator now becomes y1 y2 dG∗n (u1 , u2 ) Sn (dy1 , dy2 ) = Sn∗ (dy1 , dy2 ) Sn (u1 , u2 )
202
Ü. Gürler
where G∗n is the bivariate empirical d.f. of the observed T and Sn∗ is the empirical survival function of the observed Y . Integration with respect to y = (y1 , y2 ), yields an expression for estimating the bivariate d.f. as y1 y2 Fn∗ (du1 , du2 ) u1 u2 Fn (y1 , y2 ) = (10) . G∗n (du1 , du2 )/Sn (u1 , u2 ) The above relation unfortunately does not provide an explicit expression for Fn as in the singly truncated case. However, van der Laan (1996) provided an iterative algorithm to find the jumps of the survival function as Snk+1 (dy1, y2 ) = y1 y2
Sn∗ (dy1 , dy2 ) G∗n (du1 , du2 )/Snk (u1 , u2 )
where the initial estimator Sn0 (y1 , y2 ) is so chosen that it puts mass only to the observed Y vector and the iteration continues until convergence is established. Consistency of Fn is established by the following result of van der Laan (1996). T HEOREM 2. Let τ = (τ1 , τ2 ) such that S(τ ) > δ > 0 and assume thatF = Fd + Fc , τ where Fd is purely discrete and Fc is continuous. Moreover, assume that dF /G < ∞ and that dG/dF is uniformly bounded on [0, τ ]. Then Fn is uniformly consistent on [0, τ ]. Asymptotic normality and efficiency of Fn are also proved under a tail assumption at the origin which is an empirical analogue of dF /G < ∞ and √ construction of confidence bounds is discussed. In particular, it is established that n(Fn − F ) is asymptotically normal with mean zero and variance equal to the variance of IC∗ , which is a linear function of the influence curve. An iterative algorithm is provided to obtain an estimator of the asymptotic variance. Alternatively, it is noted that a bootstrap estimator of the variance could be employed.
3. Estimation of bivariate hazard In the univariate case, it is well known that the hazard function is expressed in terms of the d.f. in a unique way. For the bivariate case however, there is not a single definition of the hazard function (see, e.g., Basu (1971), Marshall (1975), Cox (1972), Johnson and Kotz (1975)). For the censored data, Dabrowska (1988) provides a nice representation of a bivariate survival function in terms of cumulative hazard function which is a vector of three components that correspond to double and single failures. Truncated data imposes different relations between the bivariate distribution function and bivariate hazard dynamics. In particular, as will be discussed in more detail below, for the left truncated data the ‘diverse hazard’ turns out to be a natural quantity to consider, whereas for right truncated data the ‘reverse hazard’ arises as a convenient function to describe the bivariate d.f. The ‘diverse hazard’ will be considered first for the left truncation model. This development is an extension of Dabrowska (1988) for the censored observations. First
Nonparametric bivariate estimation with randomly truncated observations
203
introduce the following notation. For a bivariate function φ(· , ·), possibly with discontinuities at finite number of points, for the argument u let du correspond to the partial derivative with respect to u if it is differentiable at that point, and otherwise to denote the difference operator. That is, if φ(u, v) is left-continuous at u and continuous and differentiable at v, then φ(du, v) = φ(u+, v) − φ(u, v) and φ(du, dv) will stand for ∂ ∂v [φ(u+, v) − φ(u, v)]. (u, v) = P (Y u, X v). The bivariate ‘diverse hazard’ vector Λ(u, v) is Let F defined as follows
v) = Λ12 (du, dv), Λ1 (du, v), Λ2 (u, dv) , Λ(u, where Λ12 (du, dv) ≡ − Λ1 (du, v) ≡ −
(du, dv) F , (u, v) F
(du, v) F , (u, v) F
and Λ2 (u, dv) ≡
(u, dv) F . (u, v) F
v) corresponds to the failures of both components at (u, v−), The first member of Λ(u, given that the first one is still alive at u−, while the second is known to have failed at v. In other words, it describes the conditional probability of double failures, the first in the immediate present and the second in the immediate past. The other two components have similar interpretations, which explains the term ‘diverse’. Define C2 (y, x) = HT ,X (y, x) − F ∗ (y−, x) (y, x). = α −1 G(y)F Then the ‘diverse hazard’ is written as ∗ F (du, dv) F ∗ (du, v) C2 (u, dv) , , . Λ(y, x) = C2 (u, v) C2 (u, v) C2 (u, v) Hence the following estimators are obtained: Λ12,n (du, dv) =
Fn∗ (du, dv) , C2,n (u, v)
Λ1,n (du, v) =
Fn∗ (du, v) , C2,n (u, v)
Λ2,n (u, dv) =
Cn (u, dv) , C2,n (u, v)
where nC2,n (u, v) = #{i: Ti u Yi , Xi v}
(11)
204
Ü. Gürler
is the size of the risk set at (y, x) with respect to ‘diverse hazard’ setup, and Fn∗ (y, x) is the empirical d.f. of observed (Y, X) pairs. y
Λ(y, x) = (12) λ1 (du, v)λ2 (u, dv) − λ12 (du, dv) . x
Gürler (1997) provides an i.i.d. representation for the estimator Λn (y, x) which is obtained by replacing the components of the ‘diverse hazard’ vector by their estimators in the expression (12) and shows that the remainder term is of order O(log3 n/n). Reverse hazard with right truncation As noted before, the truncation dynamics for right truncation implies the use of a different bivariate hazard vector which we term as ‘reverse hazard’, which turns out to be the natural quantity to estimate under right truncation, defined as F (dy, dx) F (dy, x) F (y, dx) Λ(y, x) = , , F (y, x) F (y, x) F (y, x) ≡ Λ12 (dy, dx), Λ1(dy, x), Λ2(y, dx) . × For the right truncation model, modify the definition of C2 as C2 (y, x) = α −1 G(y) F (y, x). Then it can be shown that Λ12 (du, dv) = Λ1 (du, v) =
∗ (du, dv) FY,X
C2 (u, v)
∗ (du, v) FY,X
C2 (u, v)
,
,
C2 (u, dv) . C2 (u, v) The components of the above vector describe the single or double immediate past failure probabilities, given that both components have already failed at the point (y, x). See also Lagakos et al. (1988) and Gross and Huber-Carol (1992) for reverse hazard. Then a natural estimate of the reverse hazard vector is obtained by replacing F ∗ and C2 by their empirical counterparts and employing nC2,n (u, v) = #{i: Yi u Ti , Xi v}. Letting R(y, x) = − log F (y, x), we have F (y, x) = FX (x)FY (y) exp −Λ(y, x) (13) Λ2 (u, dv) =
where the integrated hazard Λ(y, x) is defined as bF bF Y X Λ(y, x) = R(du, dv) y
x bFY
= y
bFX
Λ12 (du, dv) − Λ1 (du, v)Λ2 (u, dv) .
(14)
x
Nonparametric estimator for the integrated hazard can similarly be obtained from the expression given in (14) by substituting the estimated components of the reverse hazard
Nonparametric bivariate estimation with randomly truncated observations
205
vector. Note that the representation (13) of F in terms of the integrated hazard can be used to test the hypothesis of independence, since if Y and X are independent, then Λ(y, x) = 0 for all (y, x). Such ideas are further discussed in Gürler (1997). 4. Bivariate density estimation Using the strong i.i.d. representation of Fn (y, x) in Theorem 2, Gürler and Prewitt (2000) propose a nonparametric estimator of the p.d.f. f (y, x) of (Y, X) by convolving Fn (y, x) with a bivariate kernel. In particular, kernel functions K : S → R with compact support S ⊂ R 2 satisfying the following moment conditions are employed: 1 i + j = 0, i j i + j < k, K(u, v)u v du dv = 0 β(i, j ) < ∞ = 0 for some (i, j ): i + j = k. The following standard assumptions regarding the bandwidth sequences are made: bx , by → 0 and nbx by → ∞. The proposed bivariate density estimator is 1 y −u x −u Fn (du, dv). K , fn (y, x) = by bx by bx Gürler and Prewitt (2000) provide the following result: T HEOREM 3. Under the assumptions of Theorem 1(i), fn (y, x) − f (y, x) 1 ξ¯n (y − by u, x − bx v)K(du, dv) = bx by 1 F (y − by u, x − bx v)K(du, dv) − f (y, x) + rn (y, x) + bx by ≡ Sn (y, x) + Bn (y, x) + rn (y, x), where
(15)
sup rn (y, x) = O(log n/nbx by ). (y,x)∈Tb
Regarding the bias and the variance of the bivariate density estimator, the following expressions are provided: T HEOREM 4. Under the assumptions of Theorem 1(i),
k
k i j i f ij (y, x)β(i, j ) by bx BIAS fn (y, x) = (−1) k! i+j =k
+ o (bx by )k + O
1 , nbx by
206
Ü. Gürler
1 2 2 1 2 ∂ 2 W (y, x) A(y) K (u) du nbx by ∂y∂x −1 1 +o nbx by 1 2 1 FY (y) 1 2 f (y, x) = . K (u) du + o nbx by C(y) nbx by −1
VAR fn (y, x) =
Data driven and least squares cross-validation bandwidth choice for the bivariate density estimation with left truncated and right censored data is discussed in Prewitt and Gürler (1999).
References Basu, A.P. (1971). Bivariate failure rate. JASA 66, 103–104. Cox, D.R. (1972). Regression models and life-tables. J. Roy. Statist. Soc. B 34, 187–220. Chao, M.T., Lo, S.H. (1988). Some representations of the nonparametric maximum likelihood estimators with truncated data. Ann. Statist. 16, 661–668. Chen, K., Chao, M.T., Lo, S.H. (1995). Strong consistency of the Lynden-Bell estimator for truncated data. Ann. Statist. 23, 440–449. Dabrowska, D.M. (1988). Kaplan–Meier estimate on the plane. Ann. Statist. 16, 1475–1489. Gijbels, I., Wang, J.L. (1993). Strong representations of the survival function estimator for truncated and censored data with applications. J. Multivar. Anal. 47, 210–229. Gross, S.T., Huber-Carol, C. (1992). Regression models for truncated survival data. Scand. J. Statist. 19, 193–213. Gu, M.G., Lai, T.L. (1990). Functional laws of the iterated logarithm for the product-limit estimator of a distribution function under random censorship or truncation. Ann. Probab. 18, 160–189. Gürler, Ü. (1996). Bivariate estimation with right truncated data. JASA 91, 1152–1165. Gürler, Ü. (1997). Bivariate distribution and hazard functions when a component is randomly truncated. J. Multivar. Anal. 60, 20–47. Gürler, Ü., Kele¸s (1998). Comparison of independence tests for truncated data. Tech. Rep. IEOR 9816. Industrial Engineering Department, Bilkent University, Ankara. Gürler, Ü., Prewitt, K. (2000). Bivariate density estimation with randomly truncated data. J. Multivar. Anal. 74, 88–115. Gürler, Ü., Stute, W., Wang, J.-L. (1993). Weak and strong quantile representations for randomly truncated data with applications. Statist. Probab. Lett. 17, 139–148. He, S. (1988). The strong law for the P–L estimate in the left truncated and right censored model. Chinese Ann. Math. 3, 341–348. He, S., Yang, G.L. (1988). The strong law under random truncation. Ann. Statist. 26, 992–1010. Johnson, N.L., Kotz, S. (1975). A vector multivariate hazard rate. J. Multivar. Anal. 5, 53–66. Kalbfleisch, J.D., Lawless, J.F. (1989). Inference based on Retrospective ascertainment: An analysis of the data on transfusion-related AIDS. J. Amer. Statist. Assoc. 84, 360–372. Lagakos, S.W., Barraj, L.M., DeGruttola, V. (1988). Nonparametric analysis of truncated survival data with applications to AIDS. Biometrika 80, 573–581. Lai, T.L., Ying, Z. (1991). Estimating a distribution function with truncated and censored data. Ann. Statist. 19, 417–442. Lynden-Bell, D. (1971). A method of allowing for known observational selection in small samples applied to 3CR quasars. Monthly Notices Roy. Astronom. Soc. 155, 95–118. Marshall, A.W. (1975). Some comments on the hazard gradient. Stochastic Process. Appl. 3, 293–300.
Nonparametric bivariate estimation with randomly truncated observations
207
Prewitt, K., Gürler, Ü. (1999). Variance of the bivariate density estimator for left truncated and right censored data. Statist. Probab. Lett. 45, 351–358. Stute, W. (1993). Almost sure representations of the product-limit estimator for truncated data. Ann. Statist. 21, 146–156. Stute, W., Wang, J.-L. (1993). The strong law under random censorship. Ann. Statist. 21, 1591–1607. Uzuno˘gulları, Ü., Wang, J.-L. (1992). A comparison of the hazard rate estimators for left truncated and right censored data. Biometrika 79, 297–310. van der Laan, M.J. (1996). Nonparametric estimation of the bivariate survival function with truncated data. J. Multivar. Anal. 58, 107–131. Wang, M.C., Jewell, N.P., Tsai, W.Y. (1986). Asymptotic properties of the product limit estimate under random truncation. Ann. Statist. 14, 1597–1605. Woodroofe, M. (1985). Estimating a distribution function with truncated data. Ann. Statist. 13, 163–177.
11
Handbook of Statistics, Vol. 23 ISSN: 0169-7161 © 2004 Elsevier B.V. All rights reserved. DOI 10.1016/S0169-7161(03)23011-X
Lower Bounds for Estimating a Hazard
Catherine Huber and Brenda MacGibbon
1. Introduction Two very active areas of statistical research are non-parametric function estimation and the analysis of censored survival data. Techniques originally developed for density estimation, such as the kernel estimator approach introduced by Rosenblatt (1956, 1971) and Parzen (1962), have now been applied to the estimation of densities and hazard rates for randomly right censored data. Nadaraya (1974) studied integral mean square error of some non-parametric density estimates and Stone (1980) obtained rates of convergence for non-parametric estimators. Mielniczuk (1986) studied asymptotic properties of kernel estimators of a density function with censored data. Assuming that the kernel has compact support and that all observations fall in a compact interval, Yandell (1983) obtained maximal deviation and simultaneous confidence band results for such estimators, thus generalizing the original work of Bickel and Rosenblatt (1973) in the uncensored case. Ramlau-Hansen (1983) based his work on the theory of multiplicative counting processes, originally studied by Aalen (1978) and used a kernel function to smooth directly the non-parametric Nelson– Aalen estimator of the cumulative hazard. In the same framework, Pons (1986), obtained the rate of convergence of such estimators for the p-order risk by means of a minimax risk minorization. Csörgo et al. (1991) obtained central limit theorems for Lp-distances of kernel estimators of densities under random censorship. Tanner and Wong (1983)’s work does not have the above compactness restrictions of Ramlau-Hansen (1983) but uses instead kernels which are compatible with the survival time and the censoring distributions. Diehl and Stute (1988) determined the exact rate of uniform convergence for such kernel estimators. Lo et al. (1989) obtained pointwise strong consistency and an expression for the pointwise mean square error as well as asymptotic normality results for kernel estimators of the density and hazard functions under censoring and Zhang (1996a, 1996b) also studied such problems. Tanner (1983) proposed a variable bandwidth kernel estimator of the hazard under censoring and Schäfer (1985) showed the uniform consistency of such a data adaptive estimator. Various non-parametric methods other than kernel estimators have been used in order to estimate the hazard function with censoring. For example, Antoniadis and Grégoire 209
210
C. Huber and B. MacGibbon
(1990) used penalized likelihood, Ying (1992) proposed minimum Hellinger-type estimators, while Loader (1999) used local likelihood methods in order to estimate the hazard. Kooperberg et al. (1995) used maximum likelihood estimation and splines to estimate the log hazard with censored data. We concentrate here on the kernel estimator, however, and its asymptotic optimality. We are interested in asymptotic minimax results which have not as yet been as well studied for hazard function estimation with censoring. Under weak dependence, Roussas (1989, 1990) and Izenman and Tran (1990) studied kernel estimators of the hazard rate and Yu (1993) obtained minimax rates in the L∞ norm for estimating a density. Wang (1986, 1987) obtained asymptotic minimax rates of convergence for the estimation of a distribution function with increasing hazard rate. For the random right censorship model, Chen and Lo (1997) found the rate of uniform convergence of the product-limit estimator and Wellner (1982) showed that it is asymptotically minimax. Huang and Wellner (1995), using non-parametric maximum likelihood estimates for estimating the density and hazard functions under a monotonicity assumption showed that their approach to hazard rate estimation with randomly right censored data is asymptotically equivalent to that of Wang (1986) in the uncensored case. MacGibbon et al. (2002) showed the asymptotic optimality of such estimators under a mixed random right censorship-left truncation model. The problem of best obtainable global rates of convergence, however, has not really been completely addressed for function estimation in the presence of censoring. Here we study the problem of the best obtainable asymptotic rates of convergence under a weighted integrated quadratic risk, for the global estimation of a hazard function with randomly right censored data, knowing that the hazard function belongs to a given regularity class such as a Lipschitz or Sobolev class. A definition of Kullback information for a hazard function with right censoring which generalizes the original definition for densities is given here; it should be contrasted with that of Hjort (1992) and the use of Fisher information by Efron and Johnstone (1990). Perturbation methods developed by Bretagnolle and Huber (1979) are used here in the case of hazard rate estimation with censored data. Particular use is made of the hypercubes of Assouad (1983) who gave an elegant general method for deriving lower bounds for functional estimation, as discussed in Huber (1991, 1997). This method differs from that of Donoho and Low (1992) for establishing rates of convergence in the uncensored case. It is also shown that this lower bound is achieved, in particular, by the Ramlau-Hansen kernel estimator, using results of Pons (1986), thus establishing a minimax result. Although analogous results can be obtained for density estimation with censored data and for both density and hazard estimation under a weighted integrated p-order risk, we will concentrate on hazard estimation and weighted integrated quadratic risk in this paper.
2. Framework Although in the analysis of censored survival data, the density, the distribution or the survival function may be studied directly, a parsimonious and readily interpretable summary of the data is often given by the hazard rate. Here f , F , S, h and H will denote
Lower bounds for estimating a hazard
211
respectively the density, distribution function, survival function, hazard rate and cumulative hazard of a positive random variable X. This variable X is assumed to be right censored by a positive random variable C, independent of X, and having survival function G. The definitions are given as follows: P (X t) = S(t) = 1 − F (t),
t ∈ R+ ,
P (t X t + dt | X t) = h(t) dt, t h(u) du, H (t) =
t ∈ R+ , t ∈ R+ ,
0
P (C t) = G(t),
= 1 − G(t), and G(t)
t ∈ R+ .
(1)
We observe n i.i.d. random variables of the form Z = (T , D) where: T = min(X, C), D = 1{X C}, and D denotes the failure indicator. The “death” point process N associated with this sample as well as the “at risk” point process Y satisfy: dN(t) =
n
Di 1{Xi = t},
t ∈ R+ ;
(2)
i=1
Y (t) =
n
1{Ti t},
t ∈ R+ .
(3)
i=1
Usual estimators for h are derived either from a smooth version of the Kaplan–Meier estimate S for the survival S, as, for example, in Lo et al. (1989), or from a smooth of the cumulative hazard H as in Ramlauversion of the Nelson–Aalen estimate H Hansen (1983). Here dNn (s) S(t | Yn , Nn ) = (4) , 1− Yn (s) st
and (t) = H
t 0
J (s) dN(s), Y (s)
(5)
where
J (s) = 1 Y (s) > 0 . Although we are primarily interested in estimating a hazard function that belongs to a certain smoothness class, the technique described here is applicable as well to the estimation of a density or the intensity of the process.
212
C. Huber and B. MacGibbon
In general, let us suppose that a set of probability laws P is indexed by a set of smooth functions F ; that is f ∈ F ⇒ Pf ∈ P. Here, f will be assumed to determine the distribution of X and Pf will denote the associated law of the n-sample Zi = (Ti , Di ), i = 1, . . . , n observed. Now the n-sample {Zi }ni=1 belongs to the space of observations denoted by the triple (E, B, Pf ), where B is a σ -field, and we estimate f from the observations {Zi }ni=1 . The space Fn of all estimators of f based on the n-sample consists of all measurable observations from (E, B) into some Fn ⊃ F . Fn is a set containing F which is sufficiently large to include all the usual and “interesting” estimators fˆ of f . For example, in the case of density estimation, Fn would contain the optimal Parzen kernel estimators which are known to be not necessarily positive (Parzen (1962), Wahba (1975)). Various estimators will be compared by means of a weighted mean integrated risk. Knowing that f belongs to a certain smoothness class F , the problem is to determine the minimax asymptotic rate of convergence with respect to this risk. The risk R(fˆ, f ) of the estimator fˆ at the point f is defined as
R fˆ, f = EPf d fˆ, f . (6) The expectation EPf is to be taken with respect to the law Pf of the n-sample {Zi }ni=1 ; d is a “pseudo-distance,” d : F × F → R+ , satisfying the two properties (A1 ) and (A2 ) defined below (see, e.g., Huber (1991)): (A1 ) (triangular inequality)] ∃B > 0: d(f, g) B d(f, h) + d(h, g) ,
∀f, g, h ∈ F ;
(A2 ) (super-additivity)] ∀ partitions (Ij )j ∈J of R d(f 1Ij , g1Ij ). d(f, g) j ∈J
where 1Ij is the indicator function of interval Ij . The smoothness class F , the pseudo-distance d satisfying the above two properties and the set of probabilities P indexed by elements of F as described above will be said to determine an estimation problem. Such a problem will be denoted by the triple (F , d, P). There are many examples of such pseudo-distances. If we restrict ourselves to the estimation of probability densities, the distance function can be chosen among the usual ones given below. However, in our case, that is, for estimating hazard functions which are not necessarily in L1 , we shall consider weighted mean integrated losses. The weight function w will be chosen such that h or its derivatives are in L1 (w) or Lp (w). Let us now recall now the case of density estimation. Let p
F = f density: f (s) p < a for some positive integer s and some positive real number a. Then some possible pseudo-distances are given by:
Lower bounds for estimating a hazard
213
p
(1) d(f, g) = f − gp for some p 1. The corresponding risk is then called the porder risk. For example, for p = 2, B = 2 in property (A1 ).
p (2) d(f, g) = f (s ) − g (s ) p , where 1 s s. p (3) d(f, g) = f 1/q − g 1/q p , where 1/p + 1/q = 1 (when p = q = 2, we obtain the well-known Hellinger distance). When estimating a hazard function let p p f p,w = f (t) w(t) dt,
(7)
∞ where w(t) is a weight function satisfying w0 = 0 w(t) dt < ∞ and chosen so p that f p,w is finite; for example, w = 1K where K is a compact set or w = GS. (If F consists only of density functions, then w(t) could be taken equal to 1.) p Conditions such as F = {f density: f (s)p < a} have to be replaced by Fw = ∞ (s) p {h hazard: 0 |h | w(t) dt < a} and the corresponding pseudo-distances are simply weighted by w. Thus, all the preceding examples of pseudo-distances can be rewritten with the subscript w. The lower bound u for the asymptotic minimax rate of convergence is defined as the u for which there exist two constants c and b and a sequence of estimators (fˆn )n∈N , n (the set of all estimators based on the n-sample) such that fˆn ∈ F
inf sup lim sup nu R fˆn , f c n f ∈F fˆn ∈F
and
n
lim sup sup nu R fˆn , f b. n
f ∈F
It should be noted that those constants depend on both F and d; that is, c = c(F , d), b = b(F , d) and u = u(F , d). Such a lower bound is found here for the estimation of a hazard function under random right censoring when h is assumed to belong to certain Lipschitz or Sobolev classes. We again emphasize the fact that analogous techniques exist for establishing lower bounds for the estimation of a density or an intensity function with or without right censoring. 3. Kullback information and Hellinger distances based on hazards As in the previous section, let X be a positive random variable with distribution function F , and associated hazard, density and survival h, f , and S, respectively; let C be the positive random censoring variable, independent of X having survival function G and let Z = (T , D) denote an observation where T and D were defined in Section 2. Let Ph denote the probability law of Z where h is the underlying hazard for X. Although Ph also depends on G, this is not indicated as G will be assumed fixed for the moment. Then we can write L(Z | h = h1 ) = Ph1
214
C. Huber and B. MacGibbon
and L(Z | h = h2 ) = Ph2 , if h1 and h2 are the two hazard rates, corresponding respectively to the triples (F1 , f1 , S1 ) and (F2 , f2 , S2 ). If P and Q are two probability laws defined on (E, B) then we denote the statistical models by (E, B, (P , Q)). The Kullback information “distance” K between two laws P and Q is defined as dP K(P , Q) = log dP dQ and the Hellinger distance, h between P and Q as 2 1 √ h2 (P , Q) = dP − dQ . 2 We have the obvious relation between these two “distances”. L EMMA 1.
h1 h1 log + − 1 h1 S1 G, K(Ph1 , Ph2 ) = h2 h2 h1 + h2 2 + h1 h2 S1 S2 G, h (Ph1 , Ph2 ) = 2
(8) (9)
= 1 − G. where G For a proof of this result, see Appendix A. Note that we could have chosen the symmetrized version of the Kullback information distance in order to have, as for the Hellinger distance, a symmetric version of K(Ph1 , Ph2 ). We can always let h1 = eε1 ,
h2 = eε2
and by defining (ε1 + ε2 )/2 = ε and /ε = (ε1 − ε2 )/2, we obtain ¯ /ε , h1 = eε+/ε = he
¯ −/ε h2 = eε−/ε = he √ √ where h¯ is equal to the geometric mean h1 h2 . As S = S1 S2 = e−(H1 +H2 )/2 , with t Hi (t) = 0 hi (u) du, then
2 cosh(/ε) − 1 h¯ S G. h (Ph1 , Ph2 ) = R EMARK√. It should be noted that h¯ and√ S represent the geometric means of h1 and h2 (h¯ = h1 h2 ) and of S1 and S2 ( S = [b]S1 S2 ), respectively, but that h¯ is not the hazard corresponding to S. As /ε and Hellinger distances tends to 0, the Kullback and 1 (/ε)2 h¯ which are usually tend to be equal respectively to 12 (/ε)2 h1 S1 G S G, 2 equivalent, in the limit.
Lower bounds for estimating a hazard
215
4. A general device to derive lower bounds for estimating a function Here we outline a general methodology, often called the hypercube of Assouad (1983) for finding lower bounds on the rates of convergence in function estimation. For a more detailed exposition of this method, see Assouad (1983) or Huber (1991). Henceforth (F, d, P) will represent a statistical estimation problem as defined in Section 2. The pseudo-distance will henceforth be called a distance. P ROPOSITION 1. Let (F , d, P) be such an estimation problem, with the distance d on F × F (F ⊃ F ) verifying properties (A1 ) and (A2 ) and P being indexed by functions in F , then if in (F , P) one can construct a hypercube (F0 , P0 ), whose respective edges are the loss edge / and the Hellinger edge δ, then
1 N/ exp(−8δ), inf sup R fˆ, f 4B ˆ f ∈F f ∈F where B is the constant in the triangular inequality (A1 ). P ROOF. The basic idea is to first be able to separate two points and then to replace the initial problem by a union of N identical separate problems. We proceed as follows. L EMMA 2. Let (E, B, (P , Q)) be a statistical model with h2 (P , Q) 1/2. Let U and V denote two random variables defined on (E, B) such that U + V / > 0. Then EP U + EQ V P ROOF.
/ exp −4h2 (P , Q) . 2
(u dP + V dQ)
(U + V )(dP ∧ dQ) /
2 1 dP ∧ dQ dP dQ . 2
Applying the Cauchy–Schwartz inequality, dP dQ = dP ∧ dQ dP ∨ dQ
1/2 1/2 dP ∨ dQ dP ∧ dQ
which implies that 2 dP dQ dP ∧ dQ dP ∨ dQ
dP ∧ dQ,
216
C. Huber and B. MacGibbon
where the second factor is always smaller than 2. Besides 2 2 dP dQ = 1 − h2 (P , Q) exp −4h2 (P , Q) as log(1 − u) + 2u 0 for all u ∈ [0; 1/2]. This is satisfied for u = h2 since h has been assumed to be less than or equal to 1/2. R EMARK . The Hellinger distance h2 may be replaced by the Kullback distance K; that is
/ EP U + EQ V exp −4K(P , Q) . 2 P ROOF. Let φ(t) = log t − 1 + t. Then dQ 2 dP φ dP dQ 0 K(P , Q) − h (P , Q) = φ dP as φ is convex and positive on [0, 1].
R EMARK . Passing from (P , Q) to (P ⊗n , Q⊗n ) is easy with both Kullback and Hellinger distances, since: (1) K(P ⊗n , Q⊗n ) = nK(P , Q); and (2) h2 (P ⊗n , Q⊗n ) nh2 (P , Q) ∧ 1. Although the second inequality is only useful when h2 (P , Q) is small, we consider only this case here. L EMMA 3. Let f1 and f2 denote two points in F and fˆ an estimator of f . Then if / = d(f1 , f2 ) and δ = h2 (Pf1 , Pf2 ), then
/ exp(−4δ). max R fˆ, f f ∈{f1 ,f2 } 2B P ROOF. Now,
1 max EPf d fˆ, f EPf1 U + EPf2 V f ∈{f1 ,f2 } 2
/ / exp −4h2 (Pf1 , Pf2 ) exp −4K(Pf1 , Pf2 ) . 2B 2B
Thus,
d(f1 , f2 )
exp −4h2 (Pf1 , Pf2 ) . max R fˆ, f 2B
f ∈{f1 ,f2 }
Lower bounds for estimating a hazard
217
Henceforth, we shall be considering the following hypercube. Let F0 be a finite set of cardinality #F0 = 2N indexed by C = {−1; +1}N , whose generic element is ε = (ε1 , ε2 , . . . , εN ),
εi ∈ {−1; +1}.
Now fε denotes an element in F0 and Pfε the corresponding element in P0 . D EFINITION 1. (F0 , P0 ) is a (/, δ)-hypercube iff: d(fε , fε ) = / |εi − εi |, and h2 (Pfε , Pfε ) = δ |εi − εi |;
(10)
that is, (F0 , d) is a cube of edge 2/ while (P0 , h2 ) is a cube of edge 2δ. 0 be the subset of estimators with values in F0 . L EMMA 4. Let F
N/ exp(−8δ). inf sup R fˆ, f 2 0 f ∈F0 fˆ∈F P ROOF. Let ν be the uniform law on F0 , and let us simply denote Eε = EPfε . Then the Bayesian risk of any estimate fˆ0 at point fε for the uniform a priori law is equal to:
1 R(fˆ0 , ν) = Eν Eε d fˆ0 , fε = N Eε /εˆ j − εj 2 ε j
=
/ . ˆ E − ε j ,ε ) ε j j (ε j 2N j ε j ε j Aj (εj )
Now Aj (εj ) = EP U + EQ V where P = Pεj ,1 and Q = Pεj ,−1 . As εˆ = +1 or −1, U = |ˆεj − 1| and V = |ˆεj + 1| then, U + V / = 2 and by hypothesis (2) of the 2 definition (10)of the hypercube, = 2δ < 1/2. The lemma follows then from h (P , Q)N−1 the fact that | j 1| = N and | εj 1| = 2 . Proposition 1 then follows from Lemma 4. Note that changing from F0 to F introduces a factor of 1/2B because on F0 only estimates with value in F0 have been considered. E XAMPLE 1. Let F = {f density: f (s) 22 r}, s 1, r > 0 and let d(f, g) = f − g22 . Then the lower bound for the maximal risk on F , m(F ) is greater than or equal to
1 1 N/ exp(−8δ) = u2 O(1) exp −8nu2 N −1 O(1) , 4B 8 which is a maximum when u = N −s
and N = n1/(2s+1);
218
C. Huber and B. MacGibbon
that is, m(F ) n−2s/(2s+1)O(1).
5. Lower bound for the rate of estimation of a hazard function with right censoring 5.1. The framework Here we consider different smoothness classes which F could denote (see, for example, Ibragimov and Khasminskii (1981, 1982)). • Let F be a set of hazard functions h, with corresponding density f , distribution function F and survival function S. Thus, F may represent for example:
Σ(s + α, L) = h: h(s) (x) − h(s)(y) < L|x − y|α , s ∈ N, α ∈ ]0, 1] ,
Σ(1, L) = h: h(x2 ) − h(x1 ) < L|x2 − x1 | , Σ(s + α) = Σ(s + α, L). L>0
Then Σ(1) represents the class of Lipschitz functions and p
L(s, p, r) = h: h(s) p r, s ∈ N, p > 0, r > 0 • Now h is to be estimated through Z, an observed n-sample of Z = (T , D), defined in Section 2. • Let d be the following “pseudo-distance” on F × F defined with ∞ respect to a given weight function w, assumed to be positive and such that w0 = 0 w(t) dt < ∞: ∞
2 h1 (t) − h2 (t) w(t) dt. d(h1 , h2 ) = (11) 0
The risk of hˆ at h is defined as
ˆ h . ˆ h = EPh d h, R h, As noted before, although we only consider here integrated weighted quadratic risk, analogous results hold for integrated weighted p-order risk. 5.2. Lower bound for F = {h 0: h(s) 22 < M} for some s ∈ N+ and h – a hazard Note that the d as defined in (11), has both properties (A1 ) and (A2 ). Now construct a hypercube F0 embedded in F , with edge 2/ in correspondence with a hypercube P0 in P with edge 2δ. Note that an analogue of the usual transformation of any distribution into the uniform: U = F (X),
U ∼ U[0, 1]
Lower bounds for estimating a hazard
exists for hazard rates. Let h be a hazard rate and H (t) = cumulative hazard. Then the transformation Y = H (X),
219
t 0
h(u) du the corresponding
Y ∼ E(1)
turns X into the basic exponential random variable on the positive half-line with constant hazard rate equal to 1. Exactly as the uniform is chosen as a basis to construct the hypercube for the densities, we construct the hypercube for the hazards with h ≡ 1 as a basis. Let g denote an infinitely differentiable function with support on [0, 1]. Let g equal 0 for x = 0 and x = 1 and equal 1 on the interval [+ε , 1 − ε ] for some arbitrary small ε . Let x0 = 0 < x1 < x2 < · · · < xi−1 , xi < · · · < xN = ∞, define a sequence of points in R+ and let us denote by /xi = xi − xi−1 the length of the ith interval (xi−1 , xi ). For each i, let ui be a strictly positive number, smaller than 1 and let us denote x − xi gi (x) = ui g . /xi Let us take f (t) = e−t ; S(t) = e−t and h(t) ≡ 1 as a basis for our cube. • Then, in order that all hazards thus defined: hε = 1 +
N
ε ∈ {−1; +1}N
εi gi ,
i=1
are members of F we have to assume that N (s)2 (s) 2 g = h = u2 (/xi )1−2s g (s) = O(1) ε
i
2
i
2
i=1
as
(s) gi
= ui g
(s)
x − xi−1 1 /xi (/xi )s
and (s) 2 g = u2 i
2
i
1 (/xi )2s
xi
xi−1
g 2(s)
(s) 2 1 x − xi−1 g . dx = u2i /x i 2 /xi (/xi )2s
• Let w be assumed to be Lipschitz with coefficient L, that is, w(y) − w(x) < L|y − x|. Then, for any fixed /wi , the corresponding /xi will be at least equal to /wi /L.
220
C. Huber and B. MacGibbon
Let us fix some value /w, common to all /wi . Then all /xi will be at least equal to /w/L, so that (/xi )1−2s will be smaller than (/w/L)1−2s . The number of the xi ’s is then equal to ∞ w0 0 w(u) du N= = , /w /w where [r] means the largest integer less than or equal to r. We can choose w0 = 1 without loss of generality, so that N = [1//w]. Now let us assume all ui are equal to u1 . Then in order for the hazard function hε to be an element of F , we need 1−2s 2 /w Nu1 = O(1). L As N = [1//w] this gives: /w−2s u21 = O(1) or equivalently N 2s u21 = O(1). The edges of the cube F0 are given by 1 . N Now consider the edge of the hypercube P0 in P: 1
1 = 1 /i gi2 e−t G u2i g 2 (u) /i e−t G h2i 2 Ii 2 0 √ √ as on Ii , (h1 + h2 )/2 = 1, span h1 h2 = (1 + ρi )(1 − ρi ) ≈ 1 − gi2 /2, 1 (1 + gi ) + (1 − gi ) S1 S2 = exp ± = e−t . 2 /i = u2i (/w)i = u21 (/w) = u2i
The above is true since (h1 − h2 )2 w(t) dt = 0 when h1 = h2 . Ii
Thus,
= u2i (/vi ) h2i = u2i /i e−t G and the edge of P0 is 2δ where
= nu21 (/vi ). δ = nu21 /i e−t G This δ should be of order O(1) in order to kill the exponential inside the lower bound 1 N/ exp(−8δ). 4B
Lower bounds for estimating a hazard
221
Now if (/v)i = (/v1 ) or else if 0 < a < (/v)i //v1 < b < ∞, ∀i = 1, 2, . . . , N , then nu21 (/v) = O(1),
nu21
1 = O(1), N
δ = o(1).
(12)
Thus, it follows from the definition of F that N 2s u21 = O(1).
(13)
Maximizing N/ = u21 results in u21 =
N O(1) n
N 2s+1 = nO(1)
by (12), by (14) and (13),
(14)
and thus, u21 = n−2s/(2s+1),
(15)
that is, for any estimator h ∈ F of the hazard function h, the lower bound on the maximal risk is
ˆ h n−2s/(2s+1)O(1). inf sup R h,
0 h∈F ˆ F h∈
(16)
It remains to show in the next section that there are estimators of the hazard function that attain this lower bound. It should be noted that the “if” part of the above argument implies that we should have for all x and for some a > 0 x(1 − a) v ◦ w−1 (x) x(1 + a). A sufficient condition on the weight function is that there exist a and b positive such w(t ) that 0 < a < G(t )e−t < b < ∞. E XAMPLE 2. Let w(t) = e−2t G(t) = 1 − e−t
for the loss function, for the distribution function G of the censoring.
−t ≡ w(t) and so /v ≡ /w. Then there is no problem as v(t) = G(t)e Another alternative is to let w(t) = G(t)e−t = v(t). Then we also have /v = /w on any interval. for some survival function U . Then all G are permitted More generally, let w = e−t U G bU at least on some [0, T ] such that 1 − U > p > 0. such that a U
222
C. Huber and B. MacGibbon
5.3. Lower bound for F = {h > 0: h ∈ Σ(s + α, L)} Let us consider the simple case: w(t) = e−t , −t . v(t) = Ge Then g1 = u1 g( /x1 ), and
s x y g (s) − g (s) /1 /1 s x y α x < u1 L − = u1 /−(s+α)L|x − y|. /1 /1 /1
g1(s) (x) − g1(s) (y) = u1
1 /1
Condition for hε to be in F is thus: h ∈ F: u/−(s+α) = O(1), 1 2 Hellinger edge: δ nu /1 O(1), Loss edge: / u2 /1 O(1). (s+α)
This yields u = /1
= /m 1 where m = s + α is the “smoothness” of F , by definition. /1 = n1/(2m+1) , 2m+1 = O(1) ⇒ δ = n/1 u2 = n−2m/(2m+1).
As the rate in N/ = Nu2 /1 and N = 1//1 , the rate of convergence is n−2m/(2m+1) .
6. Rate of convergence for the kernel estimator of the hazard function • Here we show that the lower bound on the maximal risk for estimating a hazard function in F = {h 0: h(s) 22 < M} for some s ∈ N + is attained. Let us consider the kernel estimator of the hazard function defined by Ramlau-Hansen (1983) that is, define the kernel estimate ˆ = 1 K t − x dHn (x), h(t) b b where dHn =
dNn Rn
and Nn and Rn are defined in Eqs. (2) and (3).
This estimator h estimates h∗ , a smoothed version of h, rather than h itself: 1 t −x ∗ K dH ∗ (x), h (t) = b b where dH ∗ (x) = 1{Rn (x) > 0} dH (x).
Lower bounds for estimating a hazard
223
The bias and variance of hˆ have to be computed (See Ramlau-Hansen (1983)). We know that
∗
2 ˆh(t) − h(t) 2 w(t) dt 2 ˆh − h∗ 2 w dt + h − h w dt = 2[Variance + Bias squared term].
(17)
Let us assume that the kernel K is a Parzen kernel oforder s; that is, K is continuous, satisfying K(t) dt = 1; ∀j < s t j K(t) dt = 0 and |t|s K(t) dt is bounded. 6.1. Bias With each K is associated a transformed kernel s K (Bretagnolle and Huber (1979)) defined by ∞ (y − x)s−1 s K(y) dy for x > 0. s K(x) = (−1) (s − 1)! x By taking the convolution of h and K and using the Taylor expansion of h, we obtain h ∗ K − h = h(s) ∗ s K, since
tj h(j ) K(t) dt (x + t)K(t) dt = h(x)K(t) dt + j! ∞ x+t (x + t − u)s−1 (s) + f (u)K(t) du dt (s − 1)! 0 x ∞ (y − u)s−1 = h(s) K(y − x) dy du. (s − 1)! 0 sK
If we let Kb (t) = K(t/b)/b, then s (Kb ) = b
s
(s K)b .
Thus, the order of the bias is bs . 6.2. Variance As for the variance, its pointwise value, as stated in Proposition 3.2.2, p. 457 of RamlauHansen (1983), is of the order nb−1 . 6.3. Risk Now following the method of proof of Pons (1986) we can show that the risk given by (17) when hˆ is the kernel estimator described by Ramlau-Hansen is dominated by a constant times n−2s/(2s+1).
224
C. Huber and B. MacGibbon
This result combined with our lower bound on estimating a hazard in Eq. (15) implies that the rate of convergence of the minimax risk when the hazard function belongs to the space defined here in Section 5.2 is of order 2s/(2s + 1) and that this rate is attained by a kernel estimator (with a kernel of order s) defined by Ramlau-Hansen (1983). Acknowledgement Both authors acknowledge the support of the Mathematical Sciences Research Institute in Berkeley. The second author also acknowledges the partial support of NSERC of Canada and FCAR of Quebec. Appendix A In order to show that
h (Ph1 , Ph2 ) = 2
we know by definition that h2 (Ph1 , Ph2 ) =
1 2
−(H1 +H2 )/2 h1 + h2 − h1 h2 , Ge 2
R+ ×{D=0}
+ =
1 2
#
{D=1}×R+
$
gS1 −
gS2
− f1 G
#
2
f2 G
2
g e− H1 + e−H2 − 2e−(H1 +H2 )/2
+
% h1 e−H1 + h2 e−H2 − 2 h1 h2 e−(H1 +H2 )/2 . G
Let
Then
v = −G,
dv = g,
du = −h1 e−H1 − h2 e−H2 ,
u = e−H1 + e−H2 .
e−H1 + e−H2 −(H1 +H2 )/2 h1 h2 e−(H1 +H2 )/2 −e −G h (Ph1 , Ph2 ) = g 2 −H1 −H1 + e−H2 ) + e−H2 e G(e − g . + − 2 2
2
Now let dv = −g, u = e−(H1 +H2 )/2 ,
v = G,
h1 + h2 , du = e−(H1 +H2 )/2 − 2
Lower bounds for estimating a hazard
then
225
−(H1 +H2 )/2 −(H1 +H2 )/2 h1 + h2 + Ge h (Ph1 , Ph2 ) = −G h1 h2 e 2 −(H1 +H2 )/2 h1 + h2 − h1 h2 = Ge 2 √ h1 + h2 e−(H1 +H2 )/2 1 − 2 h1 h2 . = G h1 + h2 2
2
f¯
K(Ph1 , Ph2 ) =
h1 h1 log + − 1 h1 S1 G h2 h2
is proved by an analogous change of variable argument. References Aalen, O. (1978). Nonparametric inference for a family of counting processes. Ann. Statist. 6, 701–726. Antoniadis, A., Grégoire, G. (1990). Penalized likelihood estimation for rates with censored survival data. Scand. J. Statist. 17, 43–63. Assouad, P. (1983). Deux remarques sur l’estimation. C. R. Acad. Sci. Paris Sér. I Math. 1021. Bickel, P.J., Rosenblatt, M. (1973). On some global measures of the deviations of density function estimates. Ann. Statist. 3, 1071–1095; Bickel, P.J., Rosenblatt, M. Ann. Statist. 3 (1973), 1370. Correction. Bretagnolle, J., Huber, C. (1979). Estimation des densités: Risque minimax. Z. Wahrsch. Verw. Gebiete 47, 119–137. Chen, K., Lo, S.-H. (1997). On the rate of uniform convergence of the product-limit estimator: Strong and weak laws. Ann. Statist. 25, 1050–1087. Csörgo, M., Gombay, E., Horvath, L. (1991). Central limit theorems for Lp distances of kernel estimators of densities under random censorship. Ann. Statist. 19, 1813–1831. Diehl, S., Stute, W. (1988). Kernel density and hazard function estimation in the presence of censoring. J. Multivariate Anal. 25, 299–310. Donoho, D.L., Low, M.G. (1992). Renormalization exponents and optimal pointwise rates of convergence. Ann. Statist. 20, 944–970. Efron, B., Johnstone, I.M. (1990). Fisher’s information in terms of the hazard rate. Ann. Statist. 181, 38–62. Hjort, N.L. (1992). On inference in parametric survival data. Internat. Statist. Rev. 60, 335–387. Huang, J., Wellner, J.A. (1995). Estimation of a monotone density or monotone hazard under random censoring. Scand. J. Statist. 22, 3–33. Huber, C. (1991). Estimation de fonctions: Minoration du risque minimax par le cube d’Assouad et la pyramide de Fano. In: Séminaire de Statistique d’Orsay, Estimation fonctionnelle 91-55. Université de ParisSud, pp. 5–29. Huber, C. (1997). Lower bounds for function estimation. In: Pollard, D., Torgensen, E., Yang, G.L. (Eds.), Festschrift for Lucien Le Cam. Research Papers in Probability and Statistics. Springer, New York, pp. 245–258. Ibragimov, I.A., Khasminskii, R.Z. (1981). Statistical Estimation Asymptotic Theory, Vol. 16. Springer, Berlin. Ibragimov, I.A., Khasminskii, R.Z. (1982). Estimation of distribution density belonging to a class of entire functions. Theory Probab. Appl. 27, 551–562.
226
C. Huber and B. MacGibbon
Izenman, A.J., Tran, L.T. (1990). Kernel estimation of the survival function and hazard rate under weak dependence. J. Statist. Plann. Inference 24, 233–247. Kooperberg, C., Stone, C.J., Truong, Y.K. (1995). The L2 rate of convergence for hazard regression. Scand. J. Statist. 22, 143–157. Lo, S.H., Mack, Y.P., Wang, J.L. (1989). Density and hazard rate estimation for censored data via strong representation of the Kaplan–Meier estimator. Probab. Theor. Relat. Fields 80, 461–473. Loader, C. (1999). Local Regression and Likelihood. Springer, New York. MacGibbon, B., Lu, J., Younes, H. (2002). Limit theorems for asymptotically minimax estimation of a distribution with increasing failure rate under a mixed random right-censorship-left truncation model. Comm. Statist. Theory Methods 31, 1309–1333. Mielniczuk, J. (1986). Some asymptotic properties of kernel estimators of a density function in the case of censored data. Ann. Statist. 14, 766–773. Nadaraya, E.A. (1974). On the integral mean square error of some non-parametric estimates for the density function. Theory Probab. Appl. 19, 131–141. Parzen, E. (1962). On estimation of a probability density and mode. Ann. Math. Statist. 33, 1065–1076. Pons, O. (1986). Vitesse de convergence des estimateurs à noyau pour l’intensité d’un processus ponctuel. Statistics 17, 577–584. Ramlau-Hansen, H. (1983). Smoothing counting process intensities by means of kernel functions. Ann. Statist. 11, 453–466. Rosenblatt, M. (1956). Remarks on some non-parametric estimators of a density function. Ann. Math. Statist. 27, 642–689. Rosenblatt, M. (1971). Curve estimates. Ann. Math. Statist. 42, 1815–1842. Roussas, G.G. (1989). Hazard rate estimation under dependence conditions. J. Statist. Plann. Inference 22, 81–93. Roussas, G.G. (1990). Asymptotic normality of the kernel estimate under dependence conditions: Application to hazard rate. J. Statist. Plann. Inference 25, 81–104. Schäfer, H. (1985). A note on data-adaptive kernel estimation of the hazard and density function in the random censorship situation. Ann. Statist. 13, 818–820. Stone, C.J. (1980). Optimal rates of convergence for non-parametric estimators. Ann. Statist. 8, 1348–1360. Tanner, M.A. (1983). A note of the variable kernel estimator of the hazard function from randomly censored data. Ann. Statist. 11, 994–998. Tanner, M.A., Wong, W.H. (1983). The estimation on the hazard function from randomly censored data by the kernel method. Ann. Statist. 11, 989–993. Wahba, G. (1975). Optimal convergence properties of variable knot, kernel and orthogonal series methods for density estimation. Ann. Statist. 3, 15–29. Wang, J.-L. (1986). Asymptotically minimax estimators for distributions with increasing failure rate. Ann. Statist. 14, 1113–1131. Wang, J.-L. (1987). Estimators of a distribution function with increasing failure rate average. J. Statist. Plann. Inference 16, 415–427. Wellner, J.A. (1982). Asymptotic optimality of the product limit estimator. Ann. Statist. 10, 595–602. Yandell, B.Y. (1983). Non-parametric inference for rates with censored survival data. Ann. Statist. 11, 119– 1135. Ying, Z. (1992). Minimum Hellinger-type distance estimation for censored data. Ann. Statist. 20, 1361–1390. Yu, B. (1993). Density estimation in the L∞ norm for dependent data with applications to the Gibbs sampler. Ann. Statist. 21, 711–735. Zhang, B. (1996a). Some asymptotic results for kernel density estimation under random censorship. Bernouilli 2, 183–198. Zhang, B. (1996b). A note on strong uniform consistency of kernel estimators of hazard functions under random censorship. In: Jewell, N.P., Kimber, A.C., Lee, M.-L.T., Whitmore, G.A. (Eds.), Lifetime Data Models in Reliability and Survival Analysis, pp. 395–399.
12
Handbook of Statistics, Vol. 23 ISSN: 0169-7161 © 2004 Elsevier B.V. All rights reserved. DOI 10.1016/S0169-7161(03)23012-1
Non-Parametric Hazard Rate Estimation under Progressive Type-II Censoring
N. Balakrishnan and L. Bordes
1. Introduction As pointed out by Klein and Moeschberger (1997), in survival analysis studies, estimators of the slope of cumulative hazard rate function provide only a crude estimate for the hazard rate function. For this reason, it is important to estimate directly the hazard rate function which is the quantity of interest in many practical applications. Let X1 , . . . , Xn be independent and identically distributed random survival times with hazard rate function λ. Watson and Leadbetter (1964a, 1964b) studied an estimator of the hazard rate function given by: n 1 t − Xi:n ˆλ(t) = (n − i + 1), K b b i=1
where X1:n , X2:n , . . . are the ordered observations, K is a kernel function with integral equal to 1, and b is a positive parameter. Later, Ramlau-Hansen (1983) remarked that the above estimator may be generalized to estimate the intensity of counting processes having multiplicative intensities (in the sense of Aalen, 1978), provided that a Nelson– Aalen type estimator is available. Bordes (2004) proposed a Nelson–Aalen type estimator of the cumulative hazard rate function for lifetimes subjected to progressive Type-II censoring. In this paper, we show how the Ramlau-Hansen type estimator may be used to estimate the hazard rate function based on progressively Type-II censored data. This scheme of censoring has been suggested in life testing experiments. Units may be removed at various stages during the experiment, resulting from the experiment itself or in order to reduce its cost and/or its duration (see, e.g., Cohen, 1963; Sen, 1986; Balakrishnan and Aggarwala, 2000; Balasooriya et al., 2000). In such an experiment, n identical units are placed on a life test and after the ith failure, ri surviving units are withdrawn at random (1 i m). Hence, only m lifetimes are observed, denoted by X1:m:n , . . . , Xm:m:n , while the other units are progressively censored (by the observed preceding lifetimes). For more details and references on progressive censoring, we refer to the recent book of Balakrishnan and Aggarwala (2000) and Viveros and Balakrishnan (1994). It is also 227
228
N. Balakrishnan and L. Bordes
of interest to note that progressively censored samples may be viewed as special cases of the generalized order statistics introduced and studied in detail by Kamps (1995). The paper in organized as follows. In Section 2, we give the Nelson–Aalen type estimator of the cumulative hazard rate function, and then the corresponding kernel estimator. In Section 3, we discuss the properties of this estimator by using the results of Andersen et al. (1993, Section IV.2). The last section examines the practical choice of an optimal bandwidth, and the results of a simulation study are also presented. 2. Smoothing cumulative hazard rate estimator From now on, we suppose that we observe a progressively type-II censored sample of size m, denoted by X1:m:n , . . . , Xm:m:n , from an initial sample of size n, and with censoring scheme defined by integers r1 , . . . , rm such that n = m + r1 + · · · + rm . We assume that the underlying distribution admits a hazard rate function λ, and we denote respectively by R and Λ, the survival (or reliability) function and the cumulative hazard rate function. 2.1. Non-parametric maximum likelihood estimator Bordes (2004) proved that the non-parametric maximum likelihood estimator (NPMLE) is equal to: of the cumulative hazard function Λ, denoted by Λ, = Λ(t)
m 1 1(Xi:m:n t), αim i=1
where αim = m j =i (rj + 1), and 1(·) is the set characteristic function. In fact, defining the counting processes N and Y by N(t) =
m
1(Xi:m:n t),
i=1
and Y (t) =
m
(ri + 1)1(Xi:m:n t),
i=1
may be rewritten in the following way: it is easy to see that Λ t dN(s) = . Λ(t) 0 Y (s) The above expression for the NPMLE of Λ is not surprising since there exists an underlying multiplicative intensity model associated with progressive type-II censoring scheme, leading naturally to a Nelson–Aalen type estimator for Λ (see Andersen et al., 1993). Indeed, defining the natural filtration F = (Ft )t 0 with Ft = σ N(s); s t we have the following result.
Non-parametric hazard rate estimation under progressive type-II censoring
229
P ROPOSITION 1. The process M defined by t Y (s)λ(s) ds, M(t) = N(t) − 0
is an F-martingale. P ROOF. See Bordes (2004).
Let us remark that we have a multiplicative intensity model in the sense of Aalen (1978), since Y is a F-predictable process and λ is a deterministic function. The above result is essential in establishing various properties of the estimators of the cumulative hazard function and the survival function such as variance estimators, consistency and asymptotic normality. 2.2. Smoothing the NPMLE of Λ As mentioned in the Introduction, it is natural to estimate the hazard rate function Let K be a by smoothing the increments of the cumulative hazard rate estimator Λ. bounded function with support equal to [−1, 1] and integral equal to 1, called kernel function. Let b be a positive real number called bandwidth or window. We estimate λ(t) ˆ defined by by λ(t) 1 1 t −s t − s dN(s) ˆλ(t) = dΛ(s) = K K b [0,τ ] b b [0,τ ] b Y (s) m 1 1 t − Xi:m:n 1(Xi:m:n τ ), K = m b αi b i=1
where t is a positive real time in the fixed study interval [0, τ ] (with τ +∞). We can remark that for the special case when all the ri ’s are equal to 0, we have αim = m − i + 1 and Xi:m:n = Xi:n (the usual order statistic), and hence the above estimator reduces to the estimator of Watson and Leadbetter (1964a, 1964b) presented in the Introduction. The study of the statistical properties of λˆ requires us to introduce two others quantities, that is: 1 t −s λ∗ (t) = J (s)λ(s) ds, K b [0,τ ] b where J (s) = 1(Y (s) > 0), and t −s ˜λ(t) = 1 λ(s) ds. K b [0,τ ] b ˆ − λ∗ (t) since Hence, by Proposition 1, we have a martingale structure for λ(t) t − s dM(s) ˆ − λ∗ (t) = 1 . K λ(t) b [0,τ ] b Y (s)
230
N. Balakrishnan and L. Bordes
By the way, λˆ (t) is rather an estimator of λ˜ . It is therefore necessary that λ˜ be close to λ, which clearly depends on the choice of both the kernel function K and the bandwidth b. The bandwidth should therefore be adapted to the sample size m, and that is the reason why we shall introduce a sequence (bm )m1 of bandwidths depending on m. On the other hand, several kernels are available in the literature. The most popular ones are the following: 1 KU (x) = 1[−1,1] (x) (Uniform kernel), 2 3
KE (x) = 1 − x 2 1[−1,1] (x) (Epanechnikov kernel), 4 and KB (x) =
2 15
1 − x 2 1[−1,1] (x) (Biweight kernel). 16
In order to estimate derivatives of λ(t), we can also use higher order kernels (see Ramlau-Hansen, 1983). Indeed, if the kernel is absolutely continuous of order l, then K, K (1) , . . . , K (l) exist and it is possible to estimate the j th derivative of λ by (j ) t − s ˆλ(j ) (t) = 1 dΛ(s). K bj [0,τ ] b Such kernels are proposed in Andersen et al. (1993), for example. ˆ λ˜ and λ∗ will be respectively denoted by λˆ (m) , λ˜ (m) From now on, the quantities λ, ∗(m) and λ when the bandwidth b is adapted to the sample size m (and then denoted by bm ). 3. Asymptotics In the first subsection we give assumptions and we present a new method to derive asymptotic behavior of processes N and Y , which is an alternative to the approximation method given in Bordes (2004). The three last subsections are mainly based on Section IV.2 of Andersen et al. (1993) which in turn is based on the article of Ramlau-Hansen (1983) on non-parametric estimation of the intensity of counting processes. All the results in this section are based on the following assumptions. 3.1. Assumptions and preliminary results (A1) Let τ > 0 be a real number such that R(τ ) > 0; (A2) (rm,i )mi1 is a bounded triangular array of non-negative integers with rm,i K < +∞ for all 1 i m +∞; (A3) There exists a real number r such that r¯ = m−1 m i=1 rm,i = r + o(1). For simplicity, in the sequel we will denote the rm,i ’s by ri . We now give two preliminary results. The following result is given in Balakrishnan and Aggarwala (2000).
Non-parametric hazard rate estimation under progressive type-II censoring
231
P ROPOSITION 2. For i = 1, . . . , m, we have: L
Λ(Xi:m:n ) =
i Zj , αjm j =1
where Z1 , . . . , Zm is a sequence of independent and identically exponentially distribL
uted random variables with mean 1, and = is the equality in law of two random variables. The next result can be found in Bordes and Commenges (2001) or in Kim and Lee (2001). P ROPOSITION 3. Let (Fn )n1 be a sequence of non-decreasing (non-increasing) cadlad processes on [0, τ ] such that for all t ∈ [0, τ ] and for n → +∞:
E Fn (t) → F0 (t) and Var Fn (t) → 0, where F0 is a non-decreasing (non-increasing) continuous deterministic function on [0, τ ]. Then we have: P sup Fn (t) − F0 (t) → 0, n → +∞. t ∈[0,τ ]
The following result will be used repeatedly, and it involves the rescaled versions of processes N and Y , defined by N (m) = N/m and Y (m) = Y/m. P ROPOSITION 4. Under the assumptions (A1)–(A3) we have as m → +∞:
P sup N (m) (t) − 1 − R r+1 (t) → 0, 0t τ
P sup Y (m) (t) − (r + 1)R r+1 (t) → 0.
0t τ
(1) (2)
P ROOF. Let us prove (1). Results are given with respect to m → +∞. By Proposition 3, it is sufficient to prove that
E N (m) (t) → 1 − R r+1 (t) = G(t) and Var N (m) (t) → 0 for t ∈ [0, τ ], or equivalently, since the cumulative hazard function Λ is a nondecreasing continuous function on [0, τ ], that
E N (m) Λ−1 (t) → 1 − R r+1 Λ−1 (t) and Var N (m) Λ−1 (t) → 0 for t ∈ [0, Λ(τ )] (here, Λ−1 can be taken in the generalized inverse sense). Let us first prove that
E N (m) Λ−1 (t) → 1 − R r+1 Λ−1 (t) .
232
N. Balakrishnan and L. Bordes
We have by Proposition 2
E N (m) Λ−1 (t) i m m 1 1
m 1 Λ(Xi:m:n ) t = P Zj /αj t . =E m m i=1
i=1
j =1
Hence
E N
(m)
−1
Λ
(t)
1 = m m
i=1 0
t
(fZ1 /α1m ∗ · · · ∗ fZi /αim )(s) ds,
where fX denotes the density function of a random variable X and ∗ denotes the convolution operator. Let L(F ) be the Laplace transform of the non-negative finite measure F on [0, +∞) defined by +∞ L(F )(t) = exp(−ts)F (ds). 0
Now, from (5.14.18) in Hoffman-Jorgensen (1994, p. 378) we have the equivalence between E[N (m) (Λ−1 (t))] → G(Λ−1 (t)) and there exists an open interval I ⊂ R such that for all t ∈ I
Vm (t) = L E N (m) ◦ Λ−1 (t) → V (t) = L G ◦ Λ−1 (t), from which we deduce that G(t) = L−1 (V ◦ Λ)(t),
t 0,
where L−1 is the inverse Laplace transformation. Now, it is easy to see that 1 1 . m 1 + t/αjm m
Vm (t) =
i
(3)
i=1 j =1
From Lemma A.4 we have Vm → V on (0, 1), where V (t) = 1/(1 + t/(r + 1)). Then
L−1 (V )(t) = 1 − exp −(r + 1)t 1(t 0). It follows that the limit G(t) of EN (m) (t), for t 0, is equal to
G(t) = 1 − exp −(r + 1)Λ(t) 1 Λ(t) 0 = 1 − Rr+1 (t). Let us now prove that Var[N (m) (t)] → 0. Using the above convergence of E[N (m) (Λ−1 (t))], we get
Var N (m) Λ−1 (t) =
m
2 2
j P Λ(Xj :m:n ) t − 1 − exp −(r + 1)t + o(1), 2 m j =1
Non-parametric hazard rate estimation under progressive type-II censoring
233
and using again Proposition 2 and Laplace transformation, we get the expected result if there exists an open subset I of R for which: m
2 2
L j P Λ(Xj :m:n ) · (t) → L 1 − exp −(r + 1)· (t) m2 j =1
or equivalently (by Laplace transforming convolution of exponential distributions): m i 2(r + 1)2 2 1 i m → 2 m 1 + t/αj (t + r + 1)(t + 2 + 2r) i=1 j =1
for t ∈ (0, 1). This result follows by Lemma B.4 in the Appendix B which completes the proof of (1). Let us now prove (2). First, note that
Y (m) (t) − (r + 1) 1 − N (m) (t−) =
1 1 (ri − r) + (ri − r)1(Xi:m:n t). m m m
m
i=1
i=1
Then, using (1), we get the required result if both the terms on the right-hand side of the above equality tend to 0 in probability uniformly on [0, τ ]. By the assumption (A3), it is therefore sufficient to prove that for all ε, we have: N(t ) 1 P sup (ri − r) > ε → 0. t ∈[0,τ ] m i=1
Let t0 be real such that ε = 8K(1 − R r+1 (t0 )), and, by the assumption (A3), m0 be an integer such that for m m0 we have |m−1 m i=1 (ri − r)| < ε/2. Let Bm denotes the event {N(t0 ) m0 }. Then, we have by the assumption (A2) N(t ) 1 P sup (ri − r) > ε t ∈[0,τ ] m i=1
P 2KN (m) (t0 ) ε/2 N(t ) 1 m ) +P sup (ri − r) > ε/2 ∩ Bm + P (B t ∈[t0 ,τ ] m i=1
m , P N (m) (t0 ) 2 1 − R r+1 (t0 ) + P B where by (1), the two last probabilities tend to 0. Hence, the proposition is proved.
3.2. Consistency P ROPOSITION 5. Let t be a real number in ]0, τ [. Assume that the hazard rate function λ is continuous at t, and that the conditions (A1)–(A3) are satisfied. Suppose that the P bandwidth (bm )m1 satisfy (i) bm → 0 and (ii) mbm → +∞. Then, λˆ (m) (t) → λ(t) as m → +∞.
234
N. Balakrishnan and L. Bordes
P ROOF. We have (m) ˆλ (t) − λ(t) ˆλ(m) (t) − λ∗(m) (t) + λ∗(m) (t) − λ˜ (m) (t) + ˜λ(m) (t) − λ(t). To prove the required result, it is enough to show that each term on the right-hand side of the above inequality converges to 0. By Lenglart’s inequality (see Andersen et al., 1993, p. 86), we have for any η, δ > 0:
P ˆλ(m) (t) − λ∗(m) (t) > η −1 τ t − s J (s) dM(s) > η = P bm K bm Y (s) 0 −1 s t − u J (u) >η P sup b dM(u) K m bm Y (u) t −bm s∈[t −bm ,t +bm ] t +bm δ t − u J (u) −2 λ(u) du > δ K2 2 + P bm η bm Y (u) t −bm 1 δ J (t − bm u) −1 2 2 + P bm λ(t − bm u) du > δ K (u) η Y (t − bm u) −1 Y (τ ) δ C > mbm , 2 +P δ m η where C is a finite non-negative constant arising from continuity of λ at t (since bm → 0) and boundedness of the kernel function K. Finally, the probability on the right-hand side of the last inequality tends to 0 since mbm → +∞ and Y (τ )/m → (r + 1)R r+1 (τ ) > 0 in probability by Proposition 4. Now, since 1 ∗(m) (m) λ K(u)J (t − bm u) − 1λ(t − bm u) du, ˜ (t) − λ (t) −1
it is easy to see that
P λ∗(m) (t) − λ˜ (m) (t) > η P J (t + bm ) = 0 P (Xm:m:n > τ ), for m large enough. From Lemma 3 of Bordes (2004), the last probability tends to 0 as m → +∞. Finally, since K is bounded on [−1, 1] and λ is continuous at t, by using the Lebesgue dominated convergence theorem, we readily get 1 (m) ˜λ (t) − λ(t) K(u)λ(t − bm u) − λ(t) du → 0 −1
as m → +∞.
Non-parametric hazard rate estimation under progressive type-II censoring
235
P ROPOSITION 6. Assume that the conditions (A1)–(A3) are satisfied, K is of bounded variations on [−1, 1], and λ is continuous on [0, t] ⊂ [0, τ ]. Let the bandwidth (bm )m1 2 → +∞. Then, as m → +∞ and for 0 < t < t < t, satisfy (i) bm → 0 and (ii) mbm 1 2 we have P sup λˆ (m) (s) − λ(s) → 0. s∈[t1 ,t2 ]
P ROOF. Let us apply Theorem IV.2.2 of Andersen et al. (1993, p. 236) for which we need to check that τ τ
J (s) −2 bm λ(s) ds and 1 − J (s) λ(s) ds Y (s) 0 0 both tend to 0 in probability. In fact, both these conditions are fulfilled if we simply have P
2 inf bm Y (s) → +∞.
s∈[0,τ ]
2 Y (m) (τ ), we get the required result because Since the last quantity is equal to mbm 2 by assumption (ii) we have mbm → +∞, and by Proposition 4 we have Y (m) (τ ) → (r + 1)R r+1 (τ ) > 0 in probability, as m → +∞.
3.3. Optimal bandwidth In order to select an optimal bandwidth, we have to choose a measure of performance of the estimator λˆ (m) . Several measures have been proposed in the literature, but here we will consider mean integrated square error defined by t2
(m) 2 MISE λˆ (m) = E λˆ (t) − λ(t) dt , t1
where t1 and t2 are choosen such that ti ± c ∈ [0, τ ] for a c > 0, and [t1 − bm , t2 + bm ] ⊂ [t1 − c, t2 + c]. Note that
(m) E λˆ (m) (t) = λ˜ (m) (t) + r1 (t), where (m) r (t) 1
1 −1
K(u)λ(t − bm u)E 1 − J (t − bm u) du
c1 P (Xm:m:n τ ). The above constant c1 does not depend on m and t because K is bounded and Λ(τ ) < ∞ by the assumption (A1). By Proposition 2, we have: m Zi
P (Xm:m:n τ ) P Λ(Xm:m:n ) Λ(τ ) = P Λ(τ ) αim i=1
236
N. Balakrishnan and L. Bordes
P
m i=1
Zi Λ(τ ) (K + 1)(m − i + 1)
m = P Zm:m (K + 1)Λ(τ ) = 1 − exp −(K + 1)Λ(τ ) , where we use the equality in law of m i=1 Zi /(m − i + 1) and Zm:m , and the as(m) sumption (A2). Consequently, the remaining term r1 (t) converges to 0 uniformly on [t1 − c, t2 + c] at a geometric rate. Now we have t2
(m) (m) 2 ˆ MISE λ = λ˜ − λ(t) dt
t1
+
t2 t1
where
(m) r = 2 2
t2 t1
2 (m) E λˆ (m) (t) − λ˜ (m) (t) dt + r2 ,
r1(m) (t) λ˜ (m) (t) − λ(t) dt
m c2 1 − exp −(K + 1)Λ(τ ) , where c2 is again a constant which depends neither on m nor on t. Hence, we get the decomposition of the mean integrated square error into a squared bias type term plus a variance type term; the remaining term decreases to 0 at an exponential rate with respect to m. Now, in order to apply the results of Andersen et al. (1993), we restrict our attention to bounded kernels satisfying the conditions 1 K(u) du = 1,
−1 1 −1
uK(u) du = 0 and
1 −1
u2 K(u) du = κ > 0.
(4)
Assuming, along with (4), that λ is twice continuously differentiable on [t1 −c, t2 +c] ⊂ [0, τ ] for a c > 0, we can apply Theorem IV.2.3 of Andersen et al. (1993), since by Proposition 4 and the assumption (A1) we have mJ (s) 1 → , m → +∞, Y (s) (r + 1)R r+1 (s) uniformly in probability on [0, τ ] and therefore on [t1 − c, t2 + c], and
P Y (t) = 0 P (Xm:m:n τ ) = o m−l sup t ∈[t1 −c,t2 +c]
for all l > 0. Hence, the assumptions of this theorem are fulfilled and we get for a sequence (bm )m1 tending to 0 t2 4 t2
(m) 2 2
4 κ 2 bm λ˜ (t) − λ(t) dt = λ (t) dt + o bm , 4 t1 t1
Non-parametric hazard rate estimation under progressive type-II censoring
and
t2
237
2 E λˆ (m) (t) − λ˜ (m) (t) dt
t1
=
1 1 K 2 (u) du R −(r+1) (t2 ) − R −(r+1) (t1 ) 2 mbm (r + 1) −1
+ o (mbm )−1 .
Finally, we obtain
4 κ 2 bm MISE λˆ (m) = 4
t2
(λ (t))2 dt
t1
1 1 K 2 (u) du R −(r+1) (t2 ) − R −(r+1) (t1 ) 2 mbm (r + 1) −1
4
+ o bm (5) + o (mbm )−1 . +
Andersen et al. (1993) derive an optimal bandwidth by minimizing, with respect to bm , the first two leading terms on the right-hand side of equality (5). Here, we obtain 1 2 −(r+1) (t ) − R −(r+1) (t )) 1/5 2 1 −1 K (u) du(R −1/5 −2/5 bm,opt = m κ . t (r + 1)2 t12 {λ (u)}2 du We will see later on in Section 4 two strategies that can be used to derive practical optimal bandwidth for finite sample size. 3.4. Asymptotic normality Applying Theorem IV.2.4 of Andersen et al. (1993), we get the following result. P ROPOSITION 7. Let t be a real number in (0, τ ) such that λ is continuous at t. Assume that (bm )m1 satisfy bm → 0 and mbm → +∞. Then, we have
D
(mbm )1/2 λˆ (m) (t) − λ˜ (m) (t) → N 0, σ 2 (t) , where N (0, σ 2 (t)) is the normal distribution with mean 0 and variance σ 2 (t) defined by 1 λ(t) K 2 (u) du. σ 2 (t) = (r + 1)R r+1 (t) −1 Moreover, we have σˆ 2 (t) =
m bm
[0,τ ]
J (s)
K((t − s)/bm ) Y (s)
2
P
dN(s) → σ 2 (t),
and if λ is continuous at two distinct points t1 and t2 in (0, τ ), then λˆ (m) (t1 ) and λˆ (m) (t2 ) are asymptotically independent.
238
N. Balakrishnan and L. Bordes
P ROOF. By Theorem IV.2.4 of Andersen et al. (1993), we just have to check that there exists a deterministic function y positive and continuous at t such that for a ε > 0 Y (s) P sup − y(s) → 0, s∈[t −ε,t +ε] m
which is immediate by Proposition 4 with y = (r + 1)R r+1 .
As noted by Andersen et al. (1993), we can readily obtain asymptotic normality of (mbm )1/2 (λˆ (m) (t) − λ(t)); however, this requires that (bm )m1 decrease to 0 at a rate which is not compatible with the optimal rate m−1/5 derived previously. In fact, the above result may be more precise if we assume in addition that λ is twice continuously differentiable in a neighborhood of t, the kernel function K satisfies conditions given in (4), and that lim supm m1/5 bm < +∞. In this case, by Theorem IV.2.5 of Andersen et al. (1993),we have
κ 2 D (mbm )1/2 λˆ (m) (t) − λ(t) − bm λ (t) → N 0, σ 2 (t) , m → +∞, 2 which provides us information on the bias of the estimator λˆ (m) (t).
4. Simulation study 4.1. Two methods to choose an optimal bandwidth As we mentioned earlier in Section 3, two different strategies may be used to choose an optimal bandwidth. The first one was proposed byRamlau-Hansen in an unpublished t paper and involves an estimator of MISE(λˆ (m) ) − t12 λ2 (t) dt. Of course, this estimator depends on the bandwidth b, and therefore we just have to search the value of b that minimizes this estimator. In fact, the minimum integrated square error, seen as a function of b, may be written as t2 t2 t2
(m) 2 (m) ˆ ˆ MISE(b) = E λ2 (t) dt. λ (t) dt − 2E λ (t)λ(t) dt + t1
t1
t1
Some authors (see, e.g., Klein and Moeschberger, 1997) propose to estimate the first term on the right-hand side of the above equality by using the trapezoid rule, but of course this can also be done by using some software packages. Ramlau-Hansen proposed to estimate the second term on the right-hand side by a cross-validation technique giving us an estimator (adapted to our problem) Xi:m:n − Xj :m:n 2 1 − K 1(t1 Xi:m:n , Xj :m:n t2 ). b b αim αjm 1i=j m
Then we can take as an optimal bandwidth the value of b that minimizes the sum of these two estimators. An application of the above method to the mortality incidence of diabetic nephropathy in insulin-dependent diabetics is given in Andersen et al. (1993). Another application to the kidney transplant data can be found in Klein and Moeschberger
Non-parametric hazard rate estimation under progressive type-II censoring
239
(1997). We shall illustrate this method on simulated progressively censored data in the next subsection. The second method of choosing an optimal bandwidth involves the optimal bandwidth formula of Section 3.2 together with a twice differentiable kernel. The following one was proposed by Andersen et al. (1993) in order to study the mortality of female diabetic in the county of Fyn. It is defined by 3 35
1 − x 2 1[−1,1] (x), 32 and leads to the following formula for the estimated optimal bandwidth: K(x) =
bˆm,opt (b) 1/5 8640 × (N (m) (t2 ) − N (m) (t1 )) = . t
t −Xi:m:n /α m 2 dt 143m × Y (m) (t1 )Y (m) (t2 ) t12 b13 m K i=1 i b Let us note here that the above formula does not correspond exactly to the one given by Andersen et al. (1993), since we use N (m) (t2 ) − N (m) (t1 ) Y (m) (t1 )Y (m) (t2 ) which by Proposition 4 is a consistent estimator of R −(r+1) (t2 ) − R −(r+1) (t1 ) (r + 1)2 instead of the kernel type estimator they proposed. Again, the integral of the denominator can be calculated by using an appropriate software. However, an obvious limitation of this method is that we need to introduce a preliminary bandwidth b to estimate the integral of the second derivative of the hazard function. As a consequence, it may be difficult to use this formula directly as we will see in the next subsection. Hence, a different approach that uses this formula will be proposed as an alternative solution. 4.2. A simulation study We simulated a progressively censored sample of size m = 50 from an initial sample size of n, where E(n) = 100, using the software S CILAB. The simulation resulted in a sample of size 95. Type-II progressively censored data was generated using the exponential type algorithm given by Balakrishnan and Aggarwala (2000, p. 34). The underlying distribution is Weibull, with hazard rate function λ defined by λ(t) = 3/5 × (t/5)2 . Then, assumption (A1) of Section 3.1 is satisfied for any 0 < τ < +∞. In order to satisfy assumptions (A2) and (A3) of Section 3.1, the ri ’s are chosen randomly, and they are independent and identically distributed with P (ri = 0) = P (ri = 1) = P (ri = 2) = 1/3. It follows that the ri ’s are bounded by 2 and that r¯ , the mean of the ri ’s, converges almost surely to r = 1. In Figures 1–3, we give for each kernel (uniform, Epanechnikov, and biweight) three estimations of λ for t ∈ [1, 6], using b = 0.5, b = 1, and b = 3. These values of b are
240
N. Balakrishnan and L. Bordes
Fig. 1. Estimation of λ using the uniform kernel and 3 bandwidths: b = 0.5, b = 1, and b = 3.
Fig. 2. Estimation of λ using the Epanechnikov kernel and 3 bandwidths: b = 0.5, b = 1, and b = 3.
quite representative of the behavior of the kernel estimator. From a general point of view, we can see that for small values of b and whatever kernel we use, there is some instability (oscillations) in the estimators. Regularity appears as b increases. However, if b is too large, some inconsistencies appear for large values of t due to the small number of observations in this area. In fact, to treat this tail problem it is better to use some
Non-parametric hazard rate estimation under progressive type-II censoring
241
Fig. 3. Estimation of λ using the biweight kernel and 3 bandwidths: b = 0.5, b = 1, and b = 3.
t Fig. 4. Estimation of MISE(λˆ ) − t12 λ2 (t) dt for b ∈ [0.2, 5], t1 = 1, t2 = 6, and using the uniform kernel.
specific non-symmetric kernels suggested by Gasser and Müller (1979) (they are also available in Andersen et al., 1993; Klein and Moeschberger, 1997). In fact, from a practical point of view it is better to choose the values of the bandwidth at which regularity appears when plotting t the estimators curves. Figures 4–6 show the estimation of the MISE criteria (minus t12 λ2 (t) dt) for b ∈ [0.2, 5], and for the uniform, Epanechnikov, and biweight kernels, respectively. Except for the uniform kernel, the optimal bandwidth seems to be around 2. However, the minimum value of 0.6 ob-
242
N. Balakrishnan and L. Bordes
t Fig. 5. Estimation of MISE(λˆ ) − t12 λ2 (t) dt for b ∈ [0.2, 5], t1 = 1, t2 = 6, and using the Epanechnikov kernel.
t Fig. 6. Estimation of MISE(λˆ ) − t12 λ2 (t) dt for b ∈ [0.2, 5], t1 = 1, t2 = 6, and using the biweight kernel.
ˆ 2, tained for the uniform kernel may be attributed to some instability in integrating (λ) since its curve may be very irregular (see Figure 1). However, for this kernel, Figure 4 shows another minimum around 1.7 which appears in a rather regular part of the curve and which is closer to minima obtained for the Epanechnikov and biweight kernels (Figures 5 and 6). As we mentioned earlier in Section 4.1, the formula of bˆm,opt is quite difficult to use since it depends on an a priori value of b, and as we can see in Figure 7, bˆm,opt (b) is increasing quite linearly with respect to the initial guess b. Indeed, choosing as an
Non-parametric hazard rate estimation under progressive type-II censoring
243
Fig. 7. Calculation of bˆm,opt (b) of Section 4.1, for an a priori bandwidth b in [1, 6].
Fig. 8. Calculation of (b − bˆm,opt (b))2 for b ranking from 1 to 4.
initial guess b = 2 (or b = 3) we get approximatively bˆm,opt ≈ 2 (or bˆm,opt ≈ 3). This is the reason why we propose, as an alternative way to determine bˆm,opt , to plot the function b → (b − bˆm,opt (b))2 and to choose for bˆm,opt the value of b that minimizes this function. Figure 8 gives the result for b ∈ [1, 4] (we did not take smaller values of b due to integrability instabilities), and it can be seen that the value bˆ m,opt ≈ 3.3 appears as optimal for our criterion. However, we can see that the area where our criterion has smaller values corresponds to the interval [1.9, 3.5] containing the optimal bandwidth obtained by the MISE criterion.
244
N. Balakrishnan and L. Bordes
Appendix A: Technical results for the mean L EMMA A.1. Let (rm )m1 be a stationary sequence and equal to r > 0 and (Vm )m1 defined by (3). Then (Vm )m1 is stationary and equal to V (t) =
1 . 1 + t/(r + 1)
P ROOF. For t ∈ (0, 1), we have V1 (t) = V (t). For m 2, it is easy to show that Vm (t) =
1 1 + (m − 1)Vm−1 (t) . m + t/(1 + r)
We achieve the proof by induction. L EMMA A.2. Let (an )n1 and (bn )n1 be two sequences of real numbers, then j −1 n n n n aj − bj = (aj − bj ) aj bj . j =1
j =1
j =1
i=j +1
i=1
P ROOF. By induction.
L EMMA A.3. Let (rm )m1 satisfy the assumptions (A2) and (A3) and kε,m = (1 − ε)m, where 0 < ε < 1. Then m 1 sup ri − r → 0. m − j 0j kε,m i=j +1
P ROOF. Let ε > 0 be real, um = m−1
m
i=1 ri
− r, and there exist:
εε (i) m0 such that if m, j m0 then |um − uj | < 2(1−ε) (Cauchy); (ii) m1 such that if m m1 then |um | < ε /2 (convergence); (iii) m2 such that if m m2 then j/(m − j ) ε /4(K + 1) if 0 j m0 .
Let us define vj,m =
m 1 ri − r; m−j i=j +1
we have vj,m = um +
j (um − uj ). m−j
Then, for m max(m0 , m1 , m2 ), using (i)–(iii) we have – if 0 j m0 , we have |vj,m | ε /2 +
j 2(K + 1) ε ; m−j
Non-parametric hazard rate estimation under progressive type-II censoring
245
– if m0 < j kε,m , we have |vj,m | ε /2 +
kε,m εε ε . m − kε,m 2(1 − ε)
The lemma is thus proved.
L EMMA A.4. Let (rm )m1 satisfy the assumptions (A2) and (A3) and (Vm )m1 be defined by (3). Then, for t ∈ (0, 1), we have Vm (t) →
1 . 1 + t/(r + 1)
P ROOF. Let ε > 0 be a fixed real number. Let Vε,m be defined by 1 1 . m 1 + t/αjm kε,m
Vε,m (t) =
i
i=1 j =1
m , where ri = r for 1 i m. From ε,m and V Let us introduce similarly the sequences V Lemma A.1, we have
Vm (t) − 1/ 1 + t/(r + 1) = Vm (t) − V m (t) ε,m (t) + m (t). Vε,m (t) − V Vm (t) − Vε,m (t) + Vε,m (t) − V It is easy to see that the first and the third term on the right-hand side of the above inequality are both less than ε for t ∈ (0, 1). It remains to be shown that it is also true for the second term. For t ∈ (0, 1) from Lemma A.2 we have Vε,m (t) − V ε,m (t) kε,m k k 1 1 1 − m 1 + t/αjm 1 + t/(r + 1)(m − j + 1) k=1 j =1
j =1
kε,m k 1 1 1 . − m m 1 + t/αj 1 + t/(r + 1)(m − j + 1) k=1 j =1
For m large enough and 0 j kε,m , from Lemma A.3 we have 1 ε 1 1 + t/α m − 1 + t/(r + 1)(m − j + 1) m − j + 1 , j and so ε,m k 1 Vε,m (t) − V ε,m (t) ε m m−j +1 k=1 j =1 1 ε kε,m kε,m − 1 + + ···+ ε, m m m−1 m − kε,m + 1 which completes the proof of the lemma.
k
246
N. Balakrishnan and L. Bordes
Appendix B: Technical results for the variance L EMMA B.1. For t ∈ (0, 1), the sequence (Gm (t))m1 defined by Gm (t) =
m i 2 1 , i 1 + t/(m − j + 1) m2 i=1 j =1
is non-increasing. P ROOF. Define Hm by Hm (t) =
2 2 + (1 + t)(2 + t) (m + 1)(2 + t)
for t ∈ (0, 1). Simple but tedious calculations lead, for m 2, to: Gm (t) =
(m − 1)2 2 + Gm−1 (t). m(1 + t) m(m + t)
(B.1)
Let us prove that Gm (t) − Hm (t) 0
(B.2)
for t ∈ (0, 1) and m 1. First, we have 1 0 2+t for t ∈ (0, 1). Now, assume that for m 2 we have Gm−1 (t) − Hm−1 (t) 0. Then, by (B.1) we have G1 (t) − H1 (t) =
Gm (t) − Hm (t) =
(m − 1)2 2 + Gm−1 (t) − Hm (t) m(t + 1) m(m + t)
(m − 1)2 2 + Hm−1 (t) − Hm (t) m(t + 1) m(m + t)
2(mt 2 + mt + t) 0. m2 (m + 1)(m + t)(1 + t)(2 + t)
Now using (B.1) and (B.2), we have 2 (m − 1)2 + − 1 Gm−1 (t) m(1 + t) m(m + t) 2 (m − 1)2 + − 1 Hm−1 (t) m(1 + t) m(m + t)
Gm (t) − Gm−1 (t) =
−
2(m − 1) m2 (m + t)(2 + t)
<0
for all m 2. Therefore, (Gm (t))m1 is non-increasing, as needed to be proved.
Non-parametric hazard rate estimation under progressive type-II censoring
247
L EMMA B.2. For t ∈ (0, 1), we have
lim m Gm (t) − Gm−1 (t) = 0. m→+∞
P ROOF. From Lemma B.1, we have 0 m(Gm (t) − Gm−1 (t))
m−1 m i i 2m 2 1 1 − i i m 1 + t/(m − j + 1) (m − 1)2 1 + t/(m − j ) i=1 j =1
i=1
j =1
m−1 i m m 2 1 1 m−1 − +2 i m−1 1 + t/(m − j ) m m−1 1 + t/j i=1
j =1
j =1
1 − 2m 1 2 +2 , m(m − 1) 1 + t 1 + t/j m
(B.3)
j =1
where we obtain the last inequality by Lemma A.1, that is, m−1 i 2 2 1 . i m−1 1 + t/(m − j ) 1 + t i=1
j =1
Sincethe first term on the right-hand side of (B.3) tends to 0, it remains to be shown −1 → 0 for t ∈ (0, 1). For t ∈ (0, 1), t/j ∈ (0, 1) for j 1. Using that m j =1 (1 + t/j ) the fact that log(1 + x) x/2 on (0, 1), we get m m 1 1 log −t → −∞. 1 + t/j j j =1
m
j =1
It therefore follows that j =1 (1 + t/j )−1 → 0 and consequently by (B.3), we have
m Gm (t) − Gm−1 (t) → 0. The lemma is thus proved.
L EMMA B.3. For t ∈ (0, 1), we have m i 2 2 1 = . i lim m→+∞ m2 1 + t/(m − j + 1) (t + 1)(t + 2) i=1 j =1
P ROOF. From Lemma B.1, (Gm (t))m1 is a non-increasing sequence bounded away by 0. It follows that (Gm (t))m1 converges to a real G(t). Moreover, from (B.1) we have
2 m−1 m Gm (t) − Gm−1 (t) = (B.4) − Gm (t) − (1 + t) Gm−1 (t). 1+t m+t
248
N. Balakrishnan and L. Bordes
If m → +∞, by Lemma B.2 we have for t ∈ (0, 1) 2 − G(t) − (1 + t)G(t), 1+t which leads to the required result. 0=
L EMMA B.4. Under the assumptions (A2) and (A3), we have for all t ∈ (0, 1) m i 2(r + 1)2 2 1 . i = lim m m→+∞ m2 1 + t/αj (t + r + 1)(t + 2 + 2r) i=1 j =1
P ROOF. Following the lines of the proof of Lemma A.4, it is easy to show that for ε > 0 there exists an m0 such that for m m0 we have kε,m m i i 2 2 1 1 i − 2 i (i) 2 2ε, m m m 1 + t/αj m 1 + t/αj i=1 j =1
(ii)
i=1 j =1
kε,m kε,m i i 2 2 1 1 i − i 2 2ε, m 1 + t/αjm m2 1 + t/α˜ jm i=1 j =1
(iii)
i=1 j =1
kε,m kε,m i i 2 2 1 1 i − i 2 m m 2ε, 2 m 1 + t/α˜ j m 1 + t/α˜ j i=1 j =1
i=1 j =1
where kε,m = m(1 − ε) and of the lemma by showing that
α˜ jm
= (r + 1)(m − j + 1). Finally, we complete the proof
m i 2(r + 1)2 2 1 . i = m 2 m→+∞ m 1 + t/α˜ j (t + r + 1)(t + 2 + 2r)
lim
i=1 j =1
This is easily seen by setting t = t˜/(r + 1) in Lemma B.3.
References Aalen, O.O. (1978). Non parametric inference for a family of counting processes. Ann. Statist. 6, 701–726. Andersen, P.K., Borgan, O., Gill, R., Keiding, N. (1993). Statistical Models Based on Counting Processes. Springer, New York. Balakrishnan, N., Aggarwala, R. (2000). Progressive Censoring: Theory, Methods and Applications. Birkhäuser, Basel. Balasooriya, U., Saw, S.L.C., Gadag, V. (2000). Progressively censored reliability sampling plans for the Weibull distribution. Technometrics 42, 160–167. Bordes, L. (2004). Non-parametric estimation under progressive censoring. J. Statist. Plann. Inference 119 (1), 179–189. Bordes, L., Commenges, D. (2001). Asymptotics for homogeneity tests based on a multivariate random effects proportional hazards models. J. Multivariate Anal. 78, 83–102. Cohen, A.C. (1963). Progressively censored samples in life testing. Technometrics 5, 327–329.
Non-parametric hazard rate estimation under progressive type-II censoring
249
Gasser, T., Müller, H.G. (1979). Kernel estimation of regression functions. In: Smoothing Techniques for Curve Estimation. In: Lecture Notes in Math., Vol. 757. Springer, Berlin, pp. 23–68. Hoffman-Jorgensen, J. (1994). Probability with a View Toward Statistics. In: Probability Series. Chapman & Hall, London. Kamps, U. (1995). A concept of generalized order statistics. J. Statist. Plann. Inference 48, 1–23. Kim, Y., Lee, J. (2001). On posterior consistency of survival models. Ann. Statist. 29, 666–686. Klein, J.P., Moeschberger, M.L. (1997). Survival Analysis. Springer, New York. Ramlau-Hansen, H. (1983). Smoothing counting process intensities by means of kernel functions. Ann. Statist. 11, 453–466. Sen, P.K. (1986). Progressively censoring schemes. In: Kotz, S., Johnson, N.L. (Eds.), Encyclopedia of Statistical Sciences, Vol. 7. Wiley, New York, pp. 296–299. Viveros, R., Balakrishnan, N. (1994). Interval estimation of life characteristics from progressively censored data. Technometrics 36, 84–91. Watson, G.S., Leadbetter, M.R. (1964a). Hazard analysis I. Biometrika 51, 175–184. Watson, G.S., Leadbetter, M.R. (1964b). Hazard analysis II. Sankhya Ser. A 26, 101–116.
13
Handbook of Statistics, Vol. 23 ISSN: 0169-7161 © 2004 Elsevier B.V. All rights reserved. DOI 10.1016/S0169-7161(03)23013-3
Statistical Tests of the Equality of Survival Curves: Reconsidering the Options
Gabriel P. Suciu, Stanley Lemeshow and Melvin Moeschberger
1. Introduction Survival analysis has become one of the most widely used statistical tools for analyzing clinical research data. It is specifically concerned with time to event data and is of particular value because of its intrinsic ability to handle censored observations. Without question, use of the nonparametric survival curve estimators of Kaplan and Meier (1958) has become the scientific standard for comparing survival times of patients randomly assigned to the treatment arms of a clinical trial. For two treatments, the appearance of the Kaplan–Meier curves might be as in Figure 1. To determine whether there is a significant difference between the two survival curves a medical researcher can use confidence intervals or hypothesis testing. Although confidence intervals can often be used for inferences and in many cases are more desirable, we will focus only on hypothesis testing procedures because of their widespread use in the health science literature.
Fig. 1. Kaplan–Meier survival curves for two hypothetical treatments. 251
252
G.P. Suciu, S. Lemeshow and M. Moeschberger
Hypothesis tests of the equality of Kaplan–Meyer survival curves is typically accomplished using one of the available methods designed for this purpose but, without doubt, the logrank test (Le, 1997) is the one most commonly used. Perhaps the reason for the popularity of the logrank test rests in its ready availability in almost all statistical software packages. However, what many users do not appreciate is that the logrank test has very low power for some alternative hypotheses. Furthermore, the alternative hypothesis for which the logrank has good power may not be at all what the investigator has in mind. There are also assumptions that underlie the appropriate use of this test and these assumptions are often ignored (if not violated) and may lead to a test with very low power. In this paper we present the results of a systematic review of every article appearing in a major medical journal during a three year period (1999–2001). Of the 1749 main articles reviewed, 184 (10.5%) employed survival analysis and, of these, 127 articles used Kaplan–Meier estimates of survival distributions. Hypothesis testing for the equality of the survival curves was presented in 107 of these 127 articles (84.3%). None of these papers mentioned anything about the underlying alternative hypothesis of interest to the investigator or assumptions of the chosen test.
2. Underlying alternative hypothesis and assumptions Stating the null hypothesis that a set of survival curves is equal is relatively straightforward. However, specifying the alternative hypothesis of interest is considerably more difficult. An understanding of the nature of the difference the investigator hopes to detect determines the most powerful method to test the hypothesis of survival curve equality. If differences exist between the survival curves, these differences may occur in a variety of ways. For example, differences may exist in the early period of follow-up or in the late period of follow-up. Survival curves may even cross, suggesting possible differences in both early or late follow-up but not in the same direction. The statistical test chosen would not be the same for each of these possible alternative scenarios. Besides choosing the alternative hypothesis of interest, the investigator must be aware of the fact that the employment of a particular statistical test carries along certain underlying assumptions. The choice of the most adequate test depends on the sample sizes, the failure patterns, the censoring patterns, and the status of the hazards. One test does not fit all situations equally well. The statistical test must not be chosen simply because it is readily available in a statistical package. First, we shall address the assumption on the hazards. The basic quantity employed to describe time-to-event phenomena is the survival function, the probability of an individual surviving beyond time x (i.e., experiencing the event of interest after time x). It is defined as S(x) = Pr(X > x).
(1)
Alternative quantities are the hazard rate (or hazard function) which is the chance that an individual of age x experiences the event in the next instant in time, conditional on
Statistical tests of the equality of survival curves
having survived to that time, and the cumulative hazard function x H (x) = h(u) du = − ln S(x)
253
(2)
0
(see Hosmer and Lemeshow (1999), for a discussion of these quantities). If we know any one of these three quantities, then the remaining ones can be uniquely determined. Note that the Nelson–Aalen (Nelson (1972) and Aalen (1978)) estimator of H (t) is given by (t) = H
D di i=1
Yi
,
(3)
where di is the number of events at time ti and Yi is the number at risk at time ti , i = 1, . . . , D, the number of distinct event times. Although the survival function is displayed prominently in the health sciences literature, much of the statistical theory is developed through counting process theory (Andersen et al. (1992) and Fleming and Harrington (1991)) using the Nelson–Aalen estimator of the hazard function. Consequently, much of our discussion will focus on the hazard function (which automatically determines and is determined by the survival function). In the case of two treatments, hA (t) = θ (t)hB (t), where θ (t) is a function of a parameter that may depend upon time. For this case, a common assumption made is hA (t) = θ hB (t)
(4)
where hA (t) and hB (t) are the hazards for treatments A and B, respectively, and θ is a constant, i.e., θ does not depend upon time. In this case, the hazards are said to be proportional. Alternative hypotheses when θ does not depend upon time are said to follow a Lehmann alternative. If θ (t) depends upon time, then alternative hypotheses are said to follow a non-Lehmann alternative. Let us suppose that B is the standard treatment, and denote hB (t) by h0 (t). Then the hazard for the new treatment A is hA (t) = θ h0 (t). Cox (1972) considered θ = exp(β). This can be detailed, in a regression context, by defining an indicator variable, Z, which takes the value zero if an individual is on the standard treatment B and unity if an individual is on the new treatment A. If zi is the realization of Z for the ith individual in the study, i = 1, 2, . . . , N , the hazard function for this individual can be written hi (t) = exp(βz)h0 (t).
(5)
This is the Cox proportional hazards model for two treatments. From Eqs. (4) and (5), θ = exp(β) is called the hazard ratio. For this model, the hazard ratio, θ (sometimes called the relative risk), is constant, and does not depend on time. If there are no ties in the data, then the score test using the Cox model is equivalent to the log rank test. Thus the proportional hazards assumption is an important underlying ingredient in this instance. Often the Cox model is employed when there are several variables to be modeled. Figure 2 presents examples of the behavior of the cumulative hazards under two types of alternatives: “Lehmann” and “non-Lehmann”. In this figure the plots on the left-hand
254
G.P. Suciu, S. Lemeshow and M. Moeschberger
Fig. 2. Most typical situations for the cumulative hazard and estimated cumulative hazard plots for two groups: (a) and (b) proportional hazard assumption (Lehmann alternative); (c) and (d) divergent hazard assumptions (non-proportional hazard and non-Lehmann alternative); (e) and (f) crossing hazard situation (non-proportional hazard and non-Lehmann alternative). Population cumulative hazards are in panels (a), (c), and (e) and estimated cumulative hazards are in panels (b), (d), and (f).
side represent the population cumulative hazards, while those on the right-hand side represent possible realizations of the estimated cumulative hazards. For the Lehmann alternative, panel (a) suggests the proportional hazards pattern (i.e., θ = h1 (t)/ h2 (t) = H1 (t)/H2 (t) is a constant), while panel (b) illustrates the possible estimated cumulative hazards pattern (Nelson–Aalen estimates). These cumulative hazards estimates differ proportionally by the constant θ . The non-Lehmann alternatives are illustrated in panels (c)–(f), (i.e., divergent hazard plots (panel (c)), and the Nelson–Aalen estimated cumulative hazard plots (panel (d)), while panels (e)–(f) illustrate the crossing hazards situation. We define the most appropriate test as the one with the best efficiency or highest power for a specific clinical trial situation where the researcher selects the alternative hypothesis of interest. We note here that the most appropriate test might correctly reject the null hypothesis, while the tests which were not found to be the most appropriate
Statistical tests of the equality of survival curves
255
might fail to do so. The fact is that a researcher never knows in advance what type of situation can evolve from his/her data. However, what alternative hypothesis the researcher wishes to detect should be considered before the data are examined. The test of equal survival functions (or, equivalently, hazard functions) is unlike the test of two means using a t-test since, in the case of means, there is no time dependency (i.e., the difference is constant). Differences between survival functions (or hazard functions) depend upon time. The alternative hypothesis for a two-sided t-test is that the means are simply different whereas the alternative hypothesis when testing equality of survival curves has considerably more options since the curves may differ in the early part or the latter part of the follow up period or they may flare out or cross. The idea of using only one type of test for all nonparametric survival comparisons can clearly lead to wrong conclusions. We will illustrate our study findings under the general classes of hypotheses (Figure 2), the Lehmann (a)–(b), and non-Lehmann ((c)–(f)) alternatives. In this discussion, we will use the more commonly used descriptors “proportional hazards” and “nonproportional hazards”, respectively. The non-proportional hazards case includes the divergent and crossing hazards cases. Recall that in our survey of papers in a major medical journal we found that Kaplan– Meier survival curves were presented 127 times. Of these, 55 (43.3%) were assessed by one of us (GS) to involve crossing survival curves while in the remaining 72 papers the survival curves did not cross. The crossing situation we are concerned with here is the one resulting from the cumulative hazard curves crossing or, equivalently, the survival curves crossing. Table 1 presents the frequencies of the tests used, in crossing and non-crossing situations, in the 107 papers that employed hypothesis testing. We abbreviated the tests names as follows: LGR = logrank test, SLGR = stratified logrank test, GW = generalized Wilcoxon, P&P = Peto and Peto test, and COX = Cox partial likelihood test. The Tarone and Ware test and the Fleming and Harrington test were not used at all. These abbreviations will be used throughout this paper. Among these 107 papers, 95 (88.8%) used the logrank test. However, as introduced in Section 1, none of these papers mentioned anything about the underlying alternative hypothesis of interest to the investigator (early or late failure) or assumptions of the chosen test (proportional or non-proportional hazards). Table 1 Distribution of statistical tests used for comparing survival curves in a major medical journal (1999–2001) Test
Non-crossing survival
Crossing survival
Total
LGR SLGR GW P&P COX Other
53 1 1 1 − 1
42 2 3 − 1 2
95 3 4 1 1 3
Total
57
50
107
% 88.8 2.8 3.7 0.9 0.9 2.8
256
G.P. Suciu, S. Lemeshow and M. Moeschberger
3. An overview of available tests As a background to understanding a comparison of K (> 1) hazard rates we shall examine a variety of tests, as detailed in Klein and Moeschberger (2003), that look at weighted differences between the observed and expected hazard rates. The weights will allow us to put more emphasis on certain parts of the curves. Different weights will allow us to present tests that are most sensitive to early or late departures from the hypothesized relationship between samples as specified by the null hypothesis. In general, we wish to test the following set of hypotheses: H0 : h1 (t) = h2 (t) = · · · = hk (t),
for all t τ, versus
HA : at least one of the hj (t)’s is different for some t τ.
(6)
Our inference is to the hazard rates for all time points less than τ , which is typically the largest time on study. The alternative hypothesis is a global one in that we wish to reject the null hypothesis if at least one of the hazards differs from the others at some point in time. The data available to test the hypothesis (6) consist of independent right-censored and possibly left-truncated samples for each of the K populations. Let t1 < t2 < · · · < tD be the distinct death times in the pooled sample. At time ti we observe dij events out of Yij individuals at risk in the j th sample, j = 1, . . . , K, i = 1, . . . , D. Let di = K j =1 dij K and Yi = j =1 Yij be the number of deaths and number at risk in the combined sample at time ti , i = 1, . . . , D. If the null hypothesis is true, then an estimator of the expected hazard rate in the j th population under the null hypothesis is the pooled sample estimator of the hazard rate, di /Yi . Using data from the j th sample alone, the estimator of the hazard rate is dij /Yij . W (ti ) will be a common weight shared by each group and Yij is the number at risk at time ti in the j th group. With this choice of weight function we have
D di , j = 1, . . . , K. Zj (τ ) = (7) W (ti ) dij − Yij Yi i=1
This test statistic is the sum of the weighted differences between the observed number of deaths in the j th sample, dij , and the expected number of deaths under H0 in the j th sample, Yij (di /Yi ), similar to the pivotal quantity used in a traditional chi-square goodness-of-fit test. The expected number of deaths in sample j at ti is the proportion of individuals at risk at time ti that are in sample j , Yij /Yi , multiplied by the number of deaths at time ti . If all the Zj (τ )’s are close to zero then there is little evidence to believe that the null hypothesis is false, while if one of the Zj (τ )’s is far from zero then there is evidence that this population has a hazard rate which differs from that expected under the null hypothesis. The variance of Zj (τ ) in (7) is given by D Yij Yij Yi − di σˆ jj = (8) di , j = 1, . . . , K 1− W (tj )2 Yi Yi Yi − 1 i=1
Statistical tests of the equality of survival curves
257
and the covariance of Zj (τ ), Zg (τ ) is σˆ jg =
D
W (ti )
i=1
2 Yij
Yi
Yig Yi
Yi − di di , Yi − 1
g = j.
(9)
KThe components of the vector (Z1 (τ ), . . . , ZK (τ )) are linearly dependent since j =1 Zj (τ ) is zero. The test statistic is constructed by selecting any K − 1 of the Zj ’s. The estimated variance-covariance matrix of these statistics is given by the (K − 1) × (K1 ) matrix, Σ, formed by the appropriate σˆ jg ’s. The test statistic is given by the quadratic form t χ 2 = Z1 (τ ), . . . , ZK−1 (τ ) Σ −1 Z1 (τ ), . . . , ZK−1 (τ ) . (10) When the null hypothesis is true this statistic has, for large samples, a chi-squared distribution with K − 1 degrees of freedom. An α-level test of H0 rejects when χ 2 is larger than the αth upper percentage point of a chi-squared random variable with K − 1 degrees of freedom. Many weight functions have been proposed in the literature. The most common weight function, W (t) = 1 for all t, leads to the popular logrank test (LGR in Table 1), which has optimum power to detect alternatives where the hazard rates in the K populations are proportional to each other and is readily available in most statistical packages. (Although we have broken out the Cox test (COX in Table 1), the score test version employing the Cox proportional hazards model is essentially a logrank test when there are no ties present in the data. The stratified logrank test (SLGR in Table 1) allows different baseline hazard functions in the individual levels defined by the stratification variable.) Investigators need to realize that there are many other choices of weight functions that may better represent the investigator’s choice of an alternative hypothesis. For example, a second choice of weights is W (ti ) = Yi , the number at risk at time ti . This weight function yields Gehan’s (1965) generalization of the two sample Mann– Whitney–Wilcoxon test and Breslow’s (1970) generalization of the Kruskal–Wallis test (GW in Table 1) (e.g., see Collett (1994), Allison (1995); Kleinbaum (1996)). Tarone and Ware (1977) generalize to a class of tests where the weight function is 1/2 W (ti ) = f (Yi ), where f is a fixed function. They suggest a choice of f (Yi ) = Yi . The latter two classes of weights give more weight to differences between the observed and expected number of deaths in sample j at time points where there is the most data, i.e., where there are the most people in the study, usually early on in the study. An alternate censored data version of the Mann–Whitney–Wilcoxon test was presented by Peto and Peto (1972) and Kalbfleish and Prentice (1980). They define an estimate of the common survival function by
di
S(t) = (11) , 1− Yi + 1 ti t
which is close to the pooled product–limit estimator and they suggest using W (ti ) =
S(ti ) (P&P in Table 1).
258
G.P. Suciu, S. Lemeshow and M. Moeschberger
Fleming and Harrington (1981) propose a very general class of tests that includes many of the tests discussed here as special cases. The generalized Fleming–Harrington test defines S(t) to be the product–limit estimator of the survival function (1) based on the combined sample, and their weight function is defined as q Wp,q (ti ) = (12) S(ti−1 ) , p 0, q 0. S(ti−1 )p 1 − Here the survival function at the previous death time is used as a weight to ensure that these weights are known just prior to the time at which the comparison is to be made. For this class when p = q = 0 we have the logrank test. When p = 1, q = 0 we have a version of the Mann–Whitney–Wilcoxon test. When q = 0 and p > 0 these tests give the most weight to early differences between the hazard rates in the K populations, while when p = 0 and q > 0 these tests give most weight to differences which occur late in time. By an appropriate choice of p and q one can construct tests that have the most power against alternatives that have the K hazard rates differing over any desired region. Table 2 presents a summary of these weights. See Klein and Moeschberger (2003) for more details and a worked out example illustrating the different weights. It is fairly well understood, and survival analysis textbooks agree on these points, that the logrank test is most powerful for situations where equal weight, i.e., W (ti ) = 1, is given to all points on the entire curve and the proportional hazards assumption holds. Fleming et al. (1980) showed that the logrank and the generalized Wilcoxon test are insensitive to non-Lehmann alternatives and have poor statistical power under this assumption and heavily censored data. Despite the very rich arsenal of nonparametric survival comparison methods, no one has heretofore documented a general set of rules for deciding which is best under specific conditions. This formidable task is beyond the scope of this paper. Latta (1981) has performed a simulation study that indicated that the test with the greatest power changes with the sample sizes, censoring mechanism and distribution of the random variables of interest. Complicating this process even further is the fact that the non-proportional Table 2 Comparison of K sample tests Test Logrank Gehan Tarone–Ware Peto–Peto Modified Peto–Peto Fleming–Harrington p = 0, q = 1 Fleming–Harrington p = 1, q = 0 Fleming–Harrington p = 1, q = 1 Fleming–Harrington p = 0.5, q = 0.5 Fleming–Harrington p = 0.5, q = 2
W (ti ) 1 Yi 1/2
Yi
S(ti )
S(ti )Yi /(Yi + 1) [1 − S(ti−1 )] S(ti−1 ) S(ti−1 )[1 − S(ti−1 )] 0.5 S(ti−1 ) [1 − S(ti−1 )]0.5 S(ti−1 )0.5 [1 − S(ti−1 )]2
Statistical tests of the equality of survival curves
259
hazard alternatives constitute a heterogeneous family and, as such, discourage generalizations. We will point to several authors who have attempted to deal with these complex issues, e.g., Fleming et al. (1980), Stablein et al. (1981), Schoenfeld (1981), Fleming and Harrington (1981), Harrington and Fleming (1982), Schumacher (1984), Gastwirth (1985), Fleming et al. (1987), Liu (1993), and Kosorok and Lin (1999). The proportional hazard alternative was studied by Lee et al. (1975), Lininger et al. (1979), Beltangady and Frankowski (1989), and Jones and Crowley (1990). Mantel and Stablein (1988) have perhaps summarized this dilemma best when they state: “When non-constant hazard ratios exist multiple analyses may be required, and each has attendant advantages and disadvantages”. However, all authors agree that the proportional hazards assumption must be assessed and dealt with if it is not true. An important contribution to our understanding of the available testing options is due to Leurgans (1983, 1984). She concluded that when the hazards cross, use of the logrank test is not appropriate. Instead, she advocates using a class of tests that are approximately unbiased. The efficiencies depend upon alternatives and the censoring distribution. Under the proportional hazard assumption, the logrank test has the best efficiency, followed by the Peto and Peto test and the generalized Wilcoxon test. The results of Leurgans (1983) are consistent with those of Lee et al. (1975). A number of textbooks (e.g., Miller (1998), Collett (1994), Kleinbaum (1996), Klein and Moeschberger (2003), Hosmer and Lemeshow (1999)) also compare the available tests. It is clear from all of these sources that it is not suitable to decide a priori on which test should be used before we consider the alternative hypothesis. Even when the alternative hypothesis is decided upon, the type of test depends on the expected differences, the distribution of failures, the distribution of censoring, and the sample size. Again, it matters what test is used because each test is optimal for a specific set of conditions, and therefore those situations must be analyzed. 4. Hypothesis testing and statistical computer packages Most statistical packages automatically provide tests for any comparison of Kaplan– Meier curves. As of the writing of this manuscript, the following tests are provided by the most commonly used statistical packages: • SAS uses the logrank, the generalized Wilcoxon, and the Cox partial likelihood tests; • SPSS uses the logrank, the Breslow (the generalized Kruskal–Wallis test for Gehan– Wilcoxon test), and the Tarone–Ware tests; • S-plus uses the Fleming–Harrington class of specific tests defined for the power p when q = 0. For example, p = 0 results in the logrank test and p = 1 results in the Peto and Peto test; • MINITAB provides the logrank and the generalized Wilcoxon tests; • SYSTAT provides the Mantel (logrank), Breslow–Gehan (generalized Wilcoxon), and Tarone–Ware tests; • STATA provides perhaps the greatest variety of tests. The default test is the logrank test but researchers can request just about any of the available tests outlined here as well as the Cox partial likelihood test.
260
G.P. Suciu, S. Lemeshow and M. Moeschberger
Table 3 Non-crossing survival Tests
LGR
SLGR
GW
P&P
Other
Total
%
Appropriate Inappropriate or questionable
24 29
– 1
1 –
– 1
– 1
25 32
43.9 56.1
Total
53
1
1
1
1
57
LGR
SLGR
GW
P&P
Cox
Other
Total
%
Appropriate Inappropriate or questionable
– 42
2 –
– 3
– –
– 1
– 2
2 48
4.0 96.0
Total
42
2
3
–
1
2
50
Table 4 Crossing survival Tests
5. Applications to papers from major medical journal We now return to the papers studied (by GS) from the medical literature. Tables 3 and 4 summarize our findings. All tests performed in the articles were classified as appropriate, inappropriate or questionable. A test is appropriate if it is used according to its known properties, and whether or not the specific survival pattern appears to satisfy the assumptions of the chosen test. Alternatively, a test is inappropriate or questionable if there is not an obvious match between the test properties and the specific assumptions that would be optimal for justifying the use of the chosen test. Because we did not have access to the raw data in each paper, our determinations were based on the limited information available. However, often the information that was available was sufficient to make an informed determination. The fact that we can only identify 43.9% and 4% of the tests performed in the case of non-crossing survival curves and crossing survival curves, respectively, is a source of great concern. We feel strongly that this apparent inappropriate use of statistical tests is a manifestation of the reliance on the availability of software that performs survival analyses rather than on the carefully thought out issues of what is actually most appropriate in a given analytical situation.
6. Suggested guidelines The following suggested steps are desirable for a nonparametric survival comparison: (1) State the type of differences of interest in the research study (i.e., is the investigator interested in testing for early or later differences) prior to doing any analysis. This consideration leads to the alternative hypothesis. (2) Plot the estimated Kaplan–Meier survival plots.
Statistical tests of the equality of survival curves
(3) (4) (5) (6)
261
Plot the corresponding Nelson–Aalen cumulative hazard plots. Check visually if the hazards cross. Test the proportional hazards assumption. Choose the test that most appropriately satisfies the above conditions.
7. Discussion and conclusions In this paper we have attempted to present arguments against a researcher using a statistical test for the equality of survival curves solely on the basis that it is easily available in a statistical software package. The literature review on a non-homogeneous set of papers from the medical literature was extensive and time consuming. Our observation that the logrank test was used almost 90% of the time when testing was done is surprising and potentially embarrassing, as it displays ignorance of known theory for the use of the available tests of equality of nonparametric survival curves. There are no general tests that can fit all comparisons, therefore the testing should be performed according to the alternative hypothesis of interest and initial conditions/characteristics of the groups.
References Aalen, O.O. (1978). Nonparametric inference for a family of counting processes. Ann. Statist. 6, 701–726. Allison, P.D. (1995). Survival Analysis Using the SAS System. A Practical Guide. SAS Institute, Cary, NC. Andersen, P.K., Borgan, O., Gill, R.D., Keiding, N. (1992). Statistical Models Based on Counting Processes. Springer, New York. Beltangady, M.S., Frankowski, R.F. (1989). Effect of unequal censoring on the size and power of the logrank and Wilcoxon types of tests for survival data. Statist. Med. 8, 937–945. Breslow, N. (1970). A generalized Kruskal–Wallis test for comparing K samples subject to unequal patterns of censorship. Biometrika 57 (3), 579–594. Collett, D. (1994). Modeling Survival Data in Medical Research. Chapman & Hall, London. Reprinted 1996. Cox, D.R. (1972). Regression models and life-tables. J. Roy. Statist. Soc. B 34 (2), 187–220. Fleming, T.R., Harrington, D.P. (1981). A class of hypothesis tests for one and two sample censored survival data. Comm. Statist. Theory Meth. A 10 (8), 763–794. Fleming, T.R., Harrington, D.P. (1991). Counting Processes and Survival Analysis. Wiley, New York. Fleming, T.R., Harrington, D.P., O’Sullivan, M. (1987). Supremum versions of the log-rank and generalized Wilcoxon statistics. JASA 82 (397), 312–320. Fleming, T.R., O’Fallon, J.R., O’Brien, P.C., Harrington, D.P. (1980). Modified Kolmogorov–Smirnov test procedures with application to arbitrarily right-censored data. Biometrics 36, 607–625. Gastwirth, J.L. (1985). The use of maximum efficiency robust tests in combining contingency tables and survival analysis. JASA 80, 380–384. Gehan, E.A. (1965). A generalized Wilcoxon test for comparing arbitrarily singly-censored samples. Biometrika 52 (1–2), 203–223. Harrington, D.P., Fleming, T.R. (1982). A class of rank test procedures for censored survival data. Biometrika 69 (3), 553–566. Hosmer Jr, D.W., Lemeshow, S. (1999). Applied Survival Analysis. Regression Modeling of Time to Event Data. Wiley, New York. Jones, M.P., Crowley, J. (1990). Asymptotic properties of a general class of nonparametric tests for survival analysis. Ann. Statist. 18 (3), 1203–1220.
262
G.P. Suciu, S. Lemeshow and M. Moeschberger
Kalbfleish, J.D., Prentice, R.L. (1980). The Statistical Analysis of Failure Time Data. Wiley, New York. Kaplan, E.L., Meier, P. (1958). Nonparametric Estimation from incomplete observations. JASA 53 (282), 457–481. Klein, J.K., Moeschberger, M.L. (2003). Survival Analysis. Techniques for Censored and Truncated Data, 2nd edn. Springer, Berlin. Kleinbaum, D.G. (1996). Survival Analysis. A Self-Learning Text. Springer, Berlin. Kosorok, M.R., Lin, C.-Y. (1999). The versatility of function-indexed weighted log-rank statistics. JASA 94 (445), 320–325. Latta, R.B. (1981). A Monte Carlo study of some two-sample rank tests with censored data. JASA 76 (375), 713–719. Le, C.T. (1997). Applied Survival Analysis. Wiley, New York. Lee, E.T., Desu, M.M., Gehan, E.A. (1975). A Monte Carlo study of the power of some two-sample tests. Biometrika 62 (2), 425–432. Leurgans, S. (1983). Three classes of censored data rank tests: Strengths and weaknesses under censoring. Biometrika 70 (3), 651–658. Leurgans, S. (1984). Asymptotic behavior of two-sample rank tests in the presence of random censoring. Ann. Statist. 12 (2), 572–589. Lininger, L., Gail, M.H., Green, S.B., Byar, D.P. (1979). Comparison of four tests for equality of survival curves in the presence of stratification and censoring. Biometrika 66 (3), 419–428. Liu, P.Y., Green, S., Wolf, M., Crowley, J. (1993). Testing against alternatives for censored survival data. JASA 88 (421), 153–160. Mantel, N., Stablein, D.M. (1988). The crossing hazard function problem. The Statistician 37, 59–64. Miller Jr, R.G. (1998). Survival Analysis (1981). Wiley, New York. 81–118. Nelson, W. (1972). A short life test for comparing a sample with previous accelerated test results. Technometrics 14 (1), 175–185. Peto, R., Peto, J. (1972). Asymptotically efficient rank invariant test procedures. J. Roy. Statist. Soc. A 135 (2), 185–207. Schoenfeld, D. (1981). The asymptotic properties of nonparametric tests for comparing survival distributions. Biometrika 68, 316–319. Schumacher, M. (1984). Two-sample tests of Cramer–von Mises and Kolmogorov–Smirnov-type for randomly censored data. Internat. Statist. Rev. 52 (3), 263–281. Stablein, D.M., Carter Jr, W.H., Novak, J.W. (1981). Analysis of survival data with nonproportional hazard functions. Controlled Clin. Trials 2, 149–159. Tarone, R.E., Ware, J. (1977). On distribution-free tests for equality of survival distributions. Biometrika 64 (1), 156–160.
14
Handbook of Statistics, Vol. 23 ISSN: 0169-7161 © 2004 Elsevier B.V. All rights reserved. DOI 10.1016/S0169-7161(03)23014-5
Testing Equality of Survival Functions with Bivariate Censored Data: A Review
P.V. Rao
1. Introduction Let (X∗ , Y ∗ ) be a vector of bivariate survival times that are subject to right censoring by independent censoring variables Cx and Cy . Following O’Brien and Fleming (1987), we assume C = min{Cx , Cy } to be the common censoring time for both members in the pair. O’Brien and Fleming (1987) argue that it is desirable to use the common censoring time convention in order to eliminate bias that may arise when censoring and survival times are not independent. Let SX∗ and SY ∗ denote the marginal survival functions of X∗ and Y ∗ , respectively. In this chapter, we review some standard nonparametric methods for testing the null hypothesis H0 : SX∗ = SY ∗ . We focus on nonparametric tests because, for analyzing survival data encountered in medical and epidemiological applications, it is difficult to justify, either theoretically or on empirical evidence, the assumption of specific distributional forms for survival and censoring times. Bivariate survival times are observed frequently in practice. They may arise by design as in the leukemia remission data in Table 1 where the study subjects were paired on the basis of some shared characteristics (Lachenbruch et al., 1982), or they may occur naturally as in Table 2 which shows the survival times of closely and poorly matched human lymphocyte antigen (HLA) skin grafts on 16 burn patients (Batchelor and Hackett, 1970). Note that there are 5 pairs with missing values. Technical difficulties resulted in unpaired observations for subjects 1, 2, 3, 6, and 14. This chapter is organized as follows. Since many tests for paired censored data are patterned after the corresponding tests for paired uncensored data, Section 2 is devoted to a brief discussion of two common approaches for testing H0 in the uncensored data setting. In one of these approaches, tests of H0 are constructed by regarding the observed within-pair differences as a random sample from a population of differences. The test statistic is a function of the observed within-pair differences. Tests of this type will be called within-pair difference tests. In the other approach, the n paired observations are regarded as two samples – an X-sample and a Y -sample – of n observations each, and tests of H0 are based on statistics that are functions of the observed values of X∗ and Y ∗ . The resulting tests will be called pooled sample tests. 263
264
P.V. Rao Table 1 Weeks of survival (censored observations are indicated by + ) Subject
Control
Treatment
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
1 22 3 2 8 17 2 11 8 12 2 5 4 15 8 23 5 11 4 1 8
10 7 22+ 23 22 6 16 34+ 32 25+ 11+ 20+ 19+ 6 17+ 35+ 6 13 9+ 6+ 10+
Table 2 Days of survival (censored observations are indicated by + ) Subject
Close match
Poor match
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
··· 24 ··· 37 19 ··· 57+ 93 16 22 20 18 63 ··· 29 60+
19 ··· 18 29 13 19 15 26 11 17 26 21 43 28+ 15 40
Generalizations of the uncensored within-pair difference tests to censored data situations are presented in Section 3. Section 4 describes some pooled sample tests for censored paired data. Finally, Section 5 describes a method that can be used for testing
Testing equality of survival functions
265
H0 in the presence of missing observations and Section 6 provides an overall assessment of the tests described in this chapter.
2. Testing H0 with uncensored paired data Within pair difference tests. Let (Xj∗ , Yj∗ ), j = 1, 2, . . . , n; be independent observations of the bivariate random variable (X∗ , Y ∗ ). For the time being, we do not restrict X∗ and Y ∗ to be non-negative. Consequences of restricting the variables to be non-negative – a necessity when dealing with survival times – will be discussed later in this section. As already noted, a within-pair difference test is a test based on a statistic that is a function of the observed values of the within-pair differences D ∗ = X∗ − Y ∗ , whereas a pooled sample test is based on a statistic which utilizes a function of the observations of individual X∗ and Y ∗ . Nonparametric within-pair difference tests are usually designed for testing H0 under the paired sample shift model: Xj∗ = ∆ + Bj + E1j ,
Yj∗ = Bj + E2j .
(1)
Here ∆ is a scalar parameter, Bj is the effect of pair j , and Eij is the error. It is assumed that the Bj are independent identically distributed (i.i.d.) with distribution FB and the Eij are i.i.d. with distribution FE . Model (1) implies that the j th within-pair difference can be expressed as Dj∗ = ∆ + E1j − E2j = ∆ + Vj ,
(2)
showing that the within-pair differences are i.i.d. with a distribution symmetric about ∆. Thus, under model (1), H0 may be tested by testing the null hypothesis that D ∗ has a symmetric distribution with median ∆ = 0. The most common parametric test for the center of symmetry of the distribution of D∗ is the paired sample t-test. This test has many desirable properties when D ∗ has a normal distribution. Among the nonparametric tests of H0 , a class of tests known as linear signed rank tests are the most popular. Several properties of these tests and their relationship to the paired t-test are discussed in Chapter 10 of Randles and Wolfe (1991). Let ψj be defined as ψj = 1 or 0 for Dj∗ 0 or Dj∗ < 0, respectively, and let Rj+ be the rank of |Dj∗ | among {|D1∗ |, . . . , |Dn∗ |}. Then a linear signed rank statistic has the form T=
n
(2ψj − 1)a Rj+ ,
(3)
j =1
where a(·) is a scoring function and (2ψj − 1)a(Rj+ ) is the score for Dj∗ . Thus a linear signed rank statistic is the sum of the scores assigned to the within-pair differences. In practice, the score function is selected to optimize the test when D ∗ has a specific distribution. Scores leading to locally most powerful tests are often called efficient
266
P.V. Rao
scores and it is known that linear signed rank tests with efficient scores have high efficiency relative to paired t-test when the scores are based on the correct distribution of D ∗ (Randles and Wolfe, 1991). More importantly, choice of scores based on an incorrect distribution for D ∗ does not result in serious efficiency loss except in some extreme cases. In addition, compared to the paired t-test, linear signed rank tests are known to be robust in the sense that the tests tend to hold their level and power even when some observations are subject to gross errors. Two of the most common linear signed rank tests are the sign test (a(u) = 1) and the Wilcoxon signed rank test (a(u) = u). The corresponding test statistics are sign test: S=
n
(2ψj − 1),
(4)
j =1
and signed rank test: Ws =
n (2ψj − 1)Rj+ .
(5)
j =1
Pooled sample tests. Linear signed rank statistics, being functions of within-pair differences Dj∗ , utilize only the intra-pair information in the sample. Pooled sample tests are tests that use inter-pair information also. The most common nonparametric pooled sample tests are linear rank tests. These tests are originally designed for testing ∆ = 0 in the independent sample shift model: Xj∗ = ∆ + E1j ,
Yj∗ = E2j .
(6)
Here the Eij are i.i.d. with distribution FE . ∗ , . . . , R ∗ ) be the ranks of (X ∗ , . . . , X ∗ , Y ∗ , . . . , Y ∗ ) in the Let (R1∗ , . . . , Rn∗ , Rn+1 n 1 n 2n 1 combined ranking of the pooled sample. Then a linear rank statistic has the form T=
2n
cj a Rj∗ ,
(7)
j =1
where the cj are constants and a(·) is a scoring function that assigns a score to each observation based on its rank in the pooled sample. Typical scoring functions in (7) are selected to optimize efficiency of the test of the null hypothesis ∆ = 0 in the independent sample shift model (6). An important example of a linear rank statistic is the Wilcoxon rank sum statistic Wr =
j =n
Rj∗ ,
(8)
j =1
resulting from setting a(u) = u and cj = 1 if 1 j n and cj = 0 otherwise. An example of the use of a pooled sample test for paired data is found in Lam and Longnecker (1983). These authors investigate how the Wilcoxon rank sum statistic may be used for testing H0 with paired samples. Other examples of pooled sample tests for uncensored paired data may be found in Hollander et al. (1974) and Conover and Iman (1981).
Testing equality of survival functions
267
Implementing nonparametric tests with paired data. For a large sample test of H0 , critical values of the statistics in (3) and (7) can be based on their asymptotic normal distributions under the null hypothesis. In standard situations like Wilcoxon signed rank test, the asymptotic normal distribution can be completely specified. But in nonstandard cases like when we use the Wilcoxon rank sum statistic with paired data or when we adapt the uncensored data tests to censored data as will be described in the sequel, the variance of the asymptotic distribution may have to be consistently estimated from the data. For small samples, exact tests of H0 can be based on the conditional permutation null distributions of T in (3) or (7). These distributions are obtained utilizing the fact that under model (1), the null distributions of (Xj∗ , Yj∗ ) and (Yj∗ , Xj∗ ) are identical. Hence, conditional on the observed values of (Xj∗ , Yj∗ ), T has a uniform distribution on the set of all 2n values of T that can be generated by permuting the x, y values within the n pairs. Calculating the exact p-values may be computationally intensive, but statistical software such as StatXact are available for this purpose. A worked example of how within-pair and pooled sample tests are implemented in practice may help clarify the points discussed thus far. Suppose we want to test H0 under model (1) with n = 4 pairs of observations: (8, 4),
(6, 7),
(16, 11),
(12, 9).
The calculations needed for Wilcoxon signed rank test are: ∗ ∗ ∗ ∗ (ψ1 , ψ2 , ψ3 , ψ4 , ) = (1, 0, 1, 1), D1 , D2 , D3 , D4 = (4, −1, 5, 3),
R1+ , R2+ , R3+ , R4+ = (3, 1, 4, 2).
The test statistic equals Ws = (2 × 1 − 1)(3) + (2 × 0 − 1)(1) + (2 × 1 − 1)(4) + (2 × 1 − 1)(2) = 8. Since the null distribution of Ws is asymptotically normal with mean and variance (Randles and Wolfe, 1991): n(n + 1)(2n + 1) = 7.5, 24 a large sample test of H0 can be based on the z-score: E0 (Ws ) = 0,
z=
V0 (Ws ) =
Ws − E0 (Ws ) 8 =√ √ = 2.92. V0 (Ws ) 7.5
The resulting one-sided p-value is 0.0018. An exact test of H0 can be based on the permutation distribution of Ws . Table 3 lists the vector (ψ1 , ψ2 , ψ3 , ψ4 ) corresponding to the 24 = 16 within-pair permutations of the sample. The table also lists the values of the signed rank statistic generated by these permutations. Clearly, the permutational p-value is Pr{Ws 8} = 2/16 = 0.125.
268
P.V. Rao Table 3 The Ws -values generated by 24 permutations ψ1
ψ2
ψ3
ψ4
Ws
1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0
1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0
1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0
1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
10 6 2 −2 8 4 0 −4 4 0 −4 −8 2 −2 −6 −10
If H0 is tested with Wr in (8), the large sample test can be based on the result of Lam and Longnecker (1983) that under model (1), the null distribution of Wr is asymptotically normal with mean n(2n + 1) , 2 and a variance that can be estimated by n n2 (2n + 1) 12 3(n + 1) r = V0 W 1− , Sj Tj + 12 n(n2 − 1) n−1 E0 (Wr ) =
j =1
where Sj (Tj ) is the rank of z=
Xj∗
(Yj∗ )
in the X (Y ) sample. The z-score for our data is
Wr − E0 (Wr ) 21 − 18 = √ = 1.94, 2.4 V0 (Wr )
yielding a one-sided p-value of 0.026. The p-value for a permutation test of H0 based on Wr can be computed by considering the 16 possible values of Wr generated by permuting the X and Y values within pairs. Applications to survival data. Since survival times are nonnegative random variables, the paired sample shift model in (1) is not appropriate without additional restrictions on the parameters. One way around this difficulty is to use (1) to model the log of the survival times. The resulting log-linear model has the form log Xj∗ = ∆ + Bj + E1j ,
log Yj∗ = Bj + E2j .
(9)
Testing equality of survival functions
269
Writing Uij = exp{Eij }, i = 1, 2, and θ = exp{∆}, model (9) can be expressed as Xj∗ = θ eBj U1j ,
Yj∗ = eBj U2j ,
showing that the log-linear model in (9) is a scale model for the survival times. Also, testing ∆ = 0 is the same as testing θ = 1 and tests based on the within-pair differences of logs, log Xj∗ − log Yj∗ , are tests based on within-pair ratios, Xj∗ /Yj∗ , of the observed values. In what follows, it should be understood that models such as those in (1) and (6), when used with survival times, refer to the log of the survival times, rather than the survival times themselves.
3. Within-pair difference tests with censored data When censoring is a possibility, the survival variables (Xj∗ , Yj∗ ) may not be observable for all j . Let I (u) = 1 or 0 according as u or < 0, and Cj∗ be the value of the censoring variable C ∗ for the j th pair. The observed data may be represented as (Xj , Yj , δxj , δyj ), j = 1, . . . , n, where Yj = min Yj∗ , Cj∗ , Xj = min Xj∗ , Cj∗ , δyj = I Cj∗ − Yj∗ . δxj = I Cj∗ − Xj∗ , Here (Xj , Yj ) and (δxj , δyj ) are the observed value and the censoring status of (Xj∗ , Yj∗ ), respectively. Clearly, statistics of the form (3) cannot be used with censored data because neither the signs nor the ranks of their absolute values are available for all Dj∗ . Let Dj = Xj − Yj . Woolson and Lachenbruch (1980) noted that
= |Dj | if (δxj = δyj = 1), |Dj∗ | (10) > |Dj | if (δxj = 1, δyj = 0) or (δxj = 0, δyj = 1). Hence if we discard all doubly censored pairs (δxj = δyj = 0) from the sample, the remaining observed absolute paired differences, |Dj | = |Xj − Yj |, may be regarded as a right censored sample from |X∗ − Y ∗ |. Woolson and Lachenbruch (1980) used the property in (10), and the work of Prentice (1978) on linear rank tests for censored data, to develop a class of tests that can be used for testing H0 with paired right censored data. A brief description of the linear rank tests of Prentice (1978) will help better understand the rationale behind the Woolson– Lachenbrook and related tests that will be described in the sequel. Fundamental in the work of Prentice (1978) is the definition of generalized rank vectors, a concept first suggested by Kalbfleisch and Prentice (1973). In this concept, a set of rank vectors, called generalized rank vectors, is associated with each censored sample. Every vector in this set is a rank vector for the uncensored data that is consistent with the observed censoring pattern. For example in the censored sample, x1 = 4,
x2 = 6+,
x3 = 8,
x4 = 10+,
270
P.V. Rao
where a “+” denotes a censored observation, each of the three orderings of the uncensored observations: x1∗ < x2∗ < x3∗ < x4∗ ,
x1∗ < x3∗ < x2∗ < x4∗ ,
and x1∗ < x3∗ < x4∗ < x2∗ ,
is consistent with the observed censoring pattern. Hence the generalized rank vectors for this sample are (1, 2, 3, 4),
(1, 3, 2, 4) and (1, 4, 2, 3).
Assuming a log-linear model for the survival times, Prentice (1978) used the conditional probability of observing the generalized rank vectors, given the sample, to generate a class of censored data linear rank statistics that can be used for tests on regression constants. In the setting of two independent samples, that is, under model (6), Prentice’s censored data linear rank statistics may be formulated as follows. Suppose there are k uncensored observations in the pooled sample. Denote their values by Z(0) < Z(1) < · · · < Z(k) < Z(k+1) where Z(0) = 0 and Z(k+1) = ∞. Also let c(1j ) denote the number of censored observations belonging to the first group in the interval [Z(j ) , Z(j +1)), and c(j ) = 1 or 0 according as Z(j ) belongs to the first or the second sample. Prentice’s (1978) linear rank statistic for comparing two independent censored samples has the form: k k T= (11) c(j ) sj + c(1j ) Sj , j =1
j =0
where sj and Sj are scores assigned to Z(j ) and Z(j i) , respectively. In practice, the scores are selected to optimize the efficiency of the test when the errors Eij in (6) have specified distributions. Two important sets of scores arise in the independent sample setting of (6). If the Eij have logistic distributions, the efficient scores are the Prentice–Wilcoxon scores:
j nj , sj = 1 − 2 nj + 1 i=1
j nj Sj = 1 − . nj + 1
(12)
i=1
If the Eij have extreme value distributions, the efficient scores are the logrank scores: j 1 − 1, sj = ni i=1
j 1 Sj = . ni
(13)
i=1
A third set of scores, sometimes referred to as the Gehan–Gilbert–Wilcoxon scores: sj = j − nj ,
Sj = j
(14)
yields the Gehan–Gilbert (Gehan (1965) and Gilbert (1962)) generalization of the Wilcoxon rank sum test. The scores in (12), (13) and (14) are derived assuming that there are no ties among the failure times. If there are dj failures at Z(j ) , the score should be modified as
Testing equality of survival functions
271
Prentice–Wilcoxon:
j nj + dj − 1 , sj = 1 − 2 nj + 1
j nj + dj − 1 Sj = 1 − , nj + 1
i=1
i=1
logrank: sj =
j di i=1
ni
− 1,
Sj =
j di i=1
ni
,
Gehan–Gilbert–Wilcoxon: sj =
j
di − nj ,
Sj =
i=1
j
di .
i=1
Woolson and Lachenbruch (1980) utilized the probability of observing the signs of the differences and generalized ranks of the absolute differences in the reduced sample to develop a class of signed rank tests for testing H0 . The test statistics may be expressed as linear combinations of scores assigned to the observed differences in the reduced sample. Suppose there are k uncensored (δxj = δyj = 1) pairs in the reduced sample. Denote their ordered absolute differences by |D|(0) < |D|(1) < · · · < |D|(k) < |D|(k+1) where |D|(0) = 0 and |D|(k+1) = ∞. Let |D|(j i) , j = 1, . . . , nj denote the nj singly censored (δxj = 1, δyj = 0 or δxj = 0, δyj = 1) absolute differences in the interval [|D|(j ) , |D|(j +1) ), and let ψ(j ) and ψ(j i) denote the signs of the differences corresponding to |D|(j ) and |D|(j i) , respectively. The Woolson–Lachenbrook statistic for paired censored data can be expressed in the form: T=
k j =1
(2ψ(j ) − 1)sj +
nj k (2ψ(j i) − 1)Sj ,
(15)
j =0 i=1
where sj and Sj are scores assigned to |D|(j ) and |D|(j i) , respectively. The scores may be selected to optimize the efficiency of the test when the error difference Vj in (2) has a specified distribution. Two special cases of T in (15) are of interest. If we use the efficient scores corresponding to a double exponential distribution for Vj in (1), the statistic in (15) is the sign statistic calculated from the reduced sample. Thus, T in (15) generalizes the uncensored data sign statistic to the censored case. If the efficient scores corresponding to logistic distribution is used in (15), the resulting statistic is a generalization of the signed rank statistic Ws in (5). Woolson and Lachenbruch (1980) and Dabrowska (1990) show that in large samples, tests of H0 can be based on the asymptotic normal distribution of T . Of course, exact conditional tests can be implemented as described in Section 2. See Woolson and Lachenbruch (1980) for an example illustrating the large and small sample tests for the skin graft data in Table 2. Popovich and Rao (1985) proposed a class of within-pair tests of H0 based on linear combinations of two statistics – T1 calculated from the differences in the uncensored
272
P.V. Rao
pairs, and T2 calculated from the signs of the differences in the singly censored pairs. The test statistic has the form T = c 1 T 1 + c2 T 2 ,
(16)
where c1 and c2 are appropriately chosen constants. An important example of (16) is the linear combination of the Wilcoxon signed rank statistic (T1 ) and the sign statistic (T2 ). It can be shown that T is asymptotically normal and when T1 and T2 are chosen as above, the exact null distribution of T is easily determined. See Popovich and Rao (1985) for a worked example and other details.
4. Pooled sample tests with censored data A number of pooled sample tests have been suggested for testing H0 with paired censored data. In each case, the test is based on a statistic originally designed for testing H0 with independent censored samples. As noted in Section 2, large sample tests with these statistics may require consistent estimators of their null variances under model (1), but exact conditional tests can be performed utilizing the null distribution over the space of all within-pair permutations of the observations. Wei (1980) showed how the Gehan–Gilbert test (Gehan (1965) and Gilbert (1962)) for two independent censored samples can be adapted to the paired sample setting. In addition to proving the asymptotic normality of the null distribution of the Gehan–Gilbert statistic under model (1), Wei also provided a consistent estimator for its null variance. Cheng (1984) extended Wei’s (1980) work by defining a class of asymptotically nonparametric tests, of which Wei’s test is one member. O’Brien and Fleming (1987) adapted Prentice’s (1978) censored data linear rank statistics for testing H0 with paired censored samples. If ∆OFj is the difference between the efficient scores for X and Y in the j th pair, the O’Brien and Fleming (1987) statistic for testing H0 has the form T=
n
∆OFj .
(17)
j =1
The null distribution of T in (17) is asymptotically normal, with a variance that can be estimated by V 0 (T ) =
n
∆2OFj .
(18)
j =1
For details of implementing the O’Brien–Fleming test, see O’Brien and Fleming (1987) where the use of the statistic in (17) is demonstrated on the skin graft data in Table 2 with three sets of scores – paired Prentice Wilcoxon (PPW), paired logrank (PLR) and paired Gehan–Wilcoxon (PGW). Dabrowska (1989) investigated a family of pooled rank tests that are similar in structure to the O’Brein–Fleming test. The O’Brein–Fleming statistic with PPW, PLR and PGW scores appear as members of the family investigated by Dabrowska.
Testing equality of survival functions
273
Albers (1988) proposed an alternative to the Prentice generalized ranks and developed a test statistic for testing H0 . If Z(i) and Z(j i) are defined as in Section 3, the Albers’ ranks for the paired censored data are: R(Z(i) ) = rank of Z(i) among uncensored observations, R(Z(j i) ) = rank of Z(j i) among censored observations.
(19)
The Albers statistic is similar to the O’Brien–Fleming statistic except that the Prentice efficient scores in (17) are replaced with the scores generated by the Albers ranks. Albers suggests generating scores using two generating functions, one for censored pairs and the other for uncensored pairs. The test statistic has the form T=
n
∆ALj ,
(20)
j =1
where ∆ALj is the difference between the Albers scores for the X and Y observations in the j th pair. See Albers (1988) for more details and a worked example. Akritas (1992) suggested an alternative to the statistics in (17) and (20) on the basis of yet another definition of ranks for censored paired data. Let Sx and Sy denote the Kaplan and Meier (1958) survival function estimate calculated from the X and Y samples, respectively. Let 1 S= Sx + Sy . 2 Akritas (1992) proposed that the rank of an observation Z in pooled sample from censored paired data be defined as 1 R(Z) = 1 − (1 + δz ) (21) S(Z), 2 where δz = 1 for an uncensored Z and δz = 0 otherwise. Akritas (1992) suggested that H0 be tested by performing the independent sample t-test on the ranks defined in (21).
5. Testing H0 when there are missing data The missing data problem may occur in paired survival data either due to technical difficulties or due to study design. The skin-graft data in Table 2 is an example of missing data due to technical problems. An example of missing data generated by design is a study for comparing survival times of two therapies for eye irritation, a treatment and a control, in which patients with two affected eyes will receive both treatments, one on each eye, but those with one affected eye will be randomized to one of the two treatments. Clearly, patients with one affected eye will generate responses with missing data. Cheng (1984), in his work on extending the results of Wei (1980), briefly considered the missing data problem in the context of censored paired survival data. Dallas and Rao (2000) investigated a class of pooled rank tests for testing H0 when the data may
274
P.V. Rao
contain missing observations. These authors studied the missing data problem using a setup patterned after the one used by Hollander et al. (1974). Suppose the sample consists of n independent observations of (X, Y, δx , δy ), s independent observations of (X, δx ), and t independent observations of (Y, δy ). In the procedure advocated by Dallas and Rao (2000), the n1 = n + s observations of (X, δx ) are pooled with n2 = n + t observations of (Y, δy ), thereby forming a pooled sample of N = 2n + s + t observations of (X, δx ) and (Y, δy ). The pooled sample scores are then computed using any one of the available scoring procedures that has been proposed for complete paired censored data. Let ∆DRj be the score for the j th observation in the pooled sample. The Dallas–Rao statistic for testing H0 is the sum of the scores for the n1 observations of (X, δx ): T=
n1
∆DRj .
(22)
j =1
A permutation test of H0 can be performed on the basis of the null distribution of T over M = 2n × s +t t equally likely sets of pooled sample scores generated by permuting the X and Y observations within the n complete pairs combined with a grouping of the remaining s + t observations into two groups of s observations of X and t observations of Y . A Fortran routine for performing the test with Prentice–Wilcoxon scores as suggested by O’Brien and Fleming (1987) or the Akritas scores as in (21) is available in Dallas and Rao (2003). With minor modifications, the program can be used for testing H0 with other choices of scores also. Also, since the cases n > 0, s = t = 0 and n = 0, min(s, t) > 0 correspond, respectively, to a single complete paired sample and two independent samples, the computer routine in Dallas and Rao (2003) can be used in these cases also. Finally, the asymptotic normal distribution under the null hypothesis, established in Dallas (1997), can be used to perform large sample tests based on T in (22). 6. Overview Several nonparametric tests for comparing survival functions on the basis of paired censored data are described in this chapter. The question of which test is appropriate in what situation has generated many investigations of the power and efficiency properties of these tests. As might be expected, no single test emerges as the best test under all conditions. For details, the readers may consult the papers by Popovich and Rao (1985), O’Brien and Fleming (1987), Dabrowska (1990), Woolson and O’Gorman (1992), Raychaudhuri and Rao (1996) and Dallas and Rao (2000), among others. Simulation studies of O’Brien and Fleming (1987), Woolson and O’Gorman (1992) and Dallas and Rao (2000) suggest that the Akritas test or the O’Brien–Fleming test with PPW scores perform well in a variety of situations commonly encountered in practice. The method recommended by Dallas and Rao (2000) for handling missing data works well with paired Prentice–Wilcoxon scores (O’Brien and Fleming, 1987) or Akritas scores (1992).
Testing equality of survival functions
275
References Akritas, M.G. (1992). Rank transformed statistics with censored data. Statist. Probab. Lett. 13, 209–221. Albers, W. (1988). Combined rank tests for randomly censored paired data. J. Amer. Statist. Assoc. 83, 1159– 1162. Batchelor, J.R., Hackett, M. (1970). HL-a matching in treatment of burned patients with skin allografts. The Lancet 2, 581–583. Cheng, K.F. (1984). Asymptotically nonparametric tests with censored paired data. Comm. Statist. Theory Methods 13, 1453–1470. Conover, W.J., Iman, R.L. (1981). Rank transformation as a bridge between parametric and nonparametric statistics. Amer. Statist. 35, 124–129. Dabrowska, D.M. (1989). Rank tests for matched pair experiments with censored data. J. Amer. Statist. Assoc. 85, 478–485. Dabrowska, D.M. (1990). Signed rank tests for matched pair experiments with censored data. J. Multivariate Anal. 28, 88–114. Dallas, M.J. (1997). Permutation tests for randomly right censored data consisting of both paired and unpaired observations. Ph.D. Thesis. University of Florida, Gainesville, FL. Unpublished. Dallas, M.J., Rao, P.V. (2000). Testing equality of survival functions based on both paired and unpaired survival data. Biometrics 56, 154–159; Dallas, M.J., Rao, P.V. Biometrics 56 (2000), 1294. Correction. Dallas, M.J., Rao, P.V. (2003). Computing p-values for a class of permutation tests of equal survival functions. Comput. Methods Programs Biomedicine 71, 149–153. Gehan, E. (1965). A generalized Wilcoxon test for comparing arbitrary single censored samples. Biometrika 52, 203–223. Gilbert, J.P. (1962). Random censorship. Ph.D. Thesis. University of Chicago, Department of Statistics. Unpublished. Hollander, M., Pledger, G., Lin, P. (1974). Robustness of the Wilcoxon test to a certain dependency between samples. Ann. Statist. 2, 177–181. Kalbfleisch, J.D., Prentice, R.L. (1973). Marginal likelihoods based on Cox’s regression and life model. Biometrika 60, 267–278. Kaplan, E.L., Meier, P. (1958). Nonparametric estimation from incomplete observations. J. Amer. Statist. Assoc. 53, 457–481. Lachenbruch, P.A., Palta, M., Woolson, R.F. (1982). Analysis of matched pairs studies with censored data. Comm. Statist. 11, 549–569. Lam, F.C., Longnecker, M.T. (1983). A modified Wilcoxon rank sum test for paired data. Biometrika 79, 510–513. O’Brien, P.C., Fleming, T.R. (1987). A paired-Wilcoxon test for censored paired data. Biometrics 43, 169– 180. Popovich, E.A., Rao, P.V. (1985). Conditional tests for censored matched pairs. Comm. Statist. Theory Methods 14, 2041–2056. Prentice, R.L. (1978). Linear rank tests with right censored data. Biometrika 65, 167–179. Randles, R.H., Wolfe, D.A. (1991). Introduction to The Theory of Nonparametric Statistics. Krieger, Malabar, FL. Raychaudhuri, A., Rao, P.V. (1996). Efficacies of some rank tests for censored bivariate data. Nonparametric Statist. 6, 1–11. Wei, L.J. (1980). A generalized Gehan and Gilbert test for paired observations that are subject to arbitrary right censorship. J. Amer. Statist. Assoc. 75, 634–637. Woolson, R.F., Lachenbruch, P.A. (1980). Rank tests for censored matched pairs. Biometrika 67, 597–606; Woolson, R.F., Lachenbruch, P.A. Biometrika 71 (1984), 220. Corrections. Woolson, R.F., O’Gorman, T.W. (1992). A comparison of several tests for censored paired data. Statist. Medicine 11, 193–208.
15
Handbook of Statistics, Vol. 23 ISSN: 0169-7161 © 2004 Elsevier B.V. All rights reserved. DOI 10.1016/S0169-7161(03)23015-7
Statistical Methods for the Comparison of Crossing Survival Curves
Chap T. Le
1. Introduction This chapter is focussed on the detection of crossing-curve alternatives, alternatives hypotheses where the two underlying survival curves cross each other at a time point other than at t = 0. The following are two typical examples: (i) Surgical intervention for block arteries carries short term risks, which is typical of any surgeries, but may result in better long term survival. In a randomized trial of surgical intervention versus placebo, the two survival curves would cross each other with treated patients showing less favorable short-term results but much more favorable long-term results. (ii) In common cancer chemo treatments, treatments by medications, high dose may be bad in short-term (because of unavoidable toxicities which, in some cases, may be fatal) but would result in better long-term survival. In a randomized trial of high dose versus low dose, the two survival curves would cross each other at a time point other than at t = 0. Here is another, often-cited, example: E XAMPLE . In a study performed at the Mayo Clinic (data from Fleming et al., 1980), patients with bile duct cancer were followed to determine whether those treated with a combination of radiation treatment (RöRx) and 5-Fluorouracil (5-Fu) would survive significantly longer than a control population. Survival times for the RöRx + 5-Fu sample were 30, 67, 79+ , 82+ , 95, 148, 170, 171, 176, 193, 200, 221, 243, 261, 262, 263, 399, 414, 446, 446+ , 464, and 777 days; Survival times for the control sample were 57, 58, 74, 79, 89, 98, 101, 104, 110, 118, 125, 132, 154, 159, 188, 203, 257, 257, 431, 461, 497, 723, 747, 1313, and 2636 days. Estimates of the survival curves are presented in Figure 1. It is obvious that the two survival cross each other; To prove, or disprove the benefits of the combination of radiation treatment (RöRx) and 5-Fluorouracil (5-Fu) would require an efficient method for detecting this crossing-curve alternative. 277
278
C.T. Le
Fig. 1. Estimated survival distributions for bile duct cancer patients: Control population vs. patients receiving RöRx + 5-Fu.
The problem of comparing survival distributions arises very often in biomedical research. Over the last 30–40 years, many non-parametric tests have been proposed for the two sample problem with right censored survival data. The most notable is the socalled Tarone–Ware family of tests which encompasses several popular members, and which can be described as follows. Suppose that there are N1 and N2 individuals corresponding to two treatment groups 1 and 2, respectively. The study provides two samples of survival data: (t1i , δ1i ) , i = 1, 2, . . . , N1 and
(t2j , δ2j ) ,
j = 1, 2, . . . , N2 ,
where the t’s are duration times and the δ’s are survival indicators (δ = 1 for events, δ = 0 if censored). In the presence of censored observations, tests of significance can be constructed as follows (see, for example, Le, 1997, Chapter 3): (i) Pool data from two samples together and let t1 < t2 < · · · < tm ,
m d N1 + N2
be the distinct times with at least one event at each (d is the total number of deaths or events in both samples).
Statistical methods for the comparison of crossing survival curves
279
(ii) At ordered time ti , 1 i m, the data may be summarized into a 2 × 2 table: Sample 1 2
Dead d1i d2i
Status Alive a1i a2i
Total
di
ai
Total n1i n2i ni
where n1i = number of subjects from sample 1 who were at risk just before time ti , n2i = number of subjects from sample 2 who were at risk just before time ti , ni = n1i + n2i , di = number of deaths at ti , d1i of them from sample 1 and d2i of them from sample 2 = d1i + d2i , ai = ni − di = a1i + a2i = number of survivors, d= di . In this form, the null hypothesis of equal survival functions implies the independence of “sample” and “status” in the above cross-classification 2 × 2 table. Therefore, under the null hypothesis, the “expected value” of d1i is E0 (d1i ) =
n1i di ni
(d1i being the “observed value”). The variance is estimated by (hypergeometric model): Var0 (d1i ) =
n1i n2i ai di . n2i (ni − 1)
After constructing a 2 ×2 table for each uncensored observation, the evidence against the null hypothesis can be summarized in the following statistic: θ=
m i=1
wi d1i − E0 (d1i )
280
C.T. Le
where wi is the “weight” associated with the 2 × 2 table at ti . We have under the null hypothesis: E0 (θ ) = 0, Var0 (θ ) =
m i=1
wi2 Var0 (d1i ) =
m w2 n1i n2i ai di i
i=1
n2i (ni − 1)
.
The evidence against the null hypothesis is summarized in the standardized statistic 1/2 z = θ/ Var0 (θ ) which is referred to the standard normal percentile z1−α for a specified size α of the test. We may also refer z2 to a chi-square distribution at 1 degree of freedom in a two-sided test. There are two important special cases: (i) The choice wi = ni gives the generalized Wilcoxon test (also called the Gehan–Breslow test) (Gehan, 1965a, 1965b; Breslow, 1970); it is reduced to the Wilcoxon test in the absence of censoring. (ii) The choice wi = 1 gives the logrank test (also called the Cox–Mantel test; it is similar to the Mantel– Haenszel procedure for the combination of several 2 × 2 tables in the analysis of categorical data) (Mantel and Haenszel, 1959; Cox, 1972; Tarone and Ware, 1977). As applied to the above data set for bile duct cancer, we obtain p-values of 0.418 and 0.127 for the logrank and the generalized Wilcoxon tests, respectively. The main reason that both the logrank and the generalized Wilcoxon tests are insensitive to the crossing curve alternatives is because of the way the tests are formulated (terms in the summation are not squared), they are only powerful when one risk is greater than the other at all times. Otherwise, some terms in this sum are positive, some other terms are negative and many terms may cancel each other out leading to a smaller value of the test statistic. For the comparison of crossing survival curves, we will present in the following sections three different major approaches: (i) a modified Kolmogorov–Smirnov test by Fleming et al. (1980), (ii) a Levene-type test aimed directly at crossing-curve alternatives, and (iii) a subclass of linear rank tests where locally most powerful linear ranks can be form by proper choices of an optimal score function.
Statistical methods for the comparison of crossing survival curves
281
2. The modified Kolmogorov–Smirnov test Fleming et al. (1980) were the first to address the comparison of crossing survival curves. They proposed a generalized Smirnov procedure for testing H0 : S1 (t) = S2 (t) for t ∈ [0, T ] against the complement, HA : S1 (t) = S2 (t) for some t ∈ [0, T ]. Here, T = sup{t: N1 (t) > 0 and N2 > 0}, where Ni (t) is the number of individuals at risk in sample i at time t − . The basic idea was the observation that, when S1 (s) ≈ S2 (s) for all s t, then 1 S1 (t) − S2 (t) ≈ S1 (t) + S2 (t) H1 (t) − H2 (t) , 2 where Hi (t) is the cumulative hazard function for Si (t). The test based on the right-hand side term will behave at least as well as the classic Kolmogorov–Smirnov test which is based on the difference survival rates on the left-hand side. In their formulation of the test statistic, Fleming et al. (1980) used the Nelsons’s estimate (Nelson, 1972) of the survival function (1972) instead of Kaplan–Meier (by first estimating the cumulative i (t)) (i = 1, 2). This process led to their hazard then exponentiating it), Si (t) = exp(−H following test statistic: YN1 ,N2 (T ) = sup YN1 ,N2 (t): 0 t T , where
1 (s − )N2 C 2 (s − ) 1/2 t 1 N1 C YN1 ,N2 (t) = S1 (t) + S2 (t) 1 (s − ) + N2 C 2 (s − ) 2 N1 C 0 1 (s) − H 2 (s) . × IN1 (s)N2(s)>0 d H i (t) ≡ 1), and if we ignore the indicator Note that if there is no censoring (that is, C function in the integral, this expression simplifies to
1/2 1 1 1 (t) − H 2 (t) . YN1 ,N2 (t) = S1 (t) + S2 (t) H 2 1/N1 + 1/N2 Details about the derivation and implementation of the test statistic are given in Fleming et al. (1980) along with details about the distribution of the statistic under the null hypothesis. Fleming et al. (1980) showed that it is more powerful than the Tarone–Ware family (Generalized Wilcoxon and log-rank tests), in comparing crossing survival curves in two examples. For example, when applied to the survival data set of bile duct cancer patients of Section 1, the modified Kolmogorov–Smirnov test shows that the difference between 2 treatments is statistically significant, p-value = 0.046. However, the Kolmogorov– Smirnov test (and its original form, the classic Kolmogorov–Smirnov test) was not specifically designed to compare crossing survival curves. A Smirnov-type statistic is designed to measure the maximum distance between estimates of two (survival) functions. The modified Kolmogorov–Smirnov test more effective than the Tarone–Ware family (Generalized Wilcoxon and log-rank tests) when the two survival curves cross each other, or when the two survival distributions differ substantially for some short range of time values but not necessarily elsewhere. It does not work well when the minimum difference between the two estimated curves is small, even clear and consistent
282
C.T. Le
throughout the time scale (see, for example, Example 2 in Shen and Le, 2000). In particular, it is not the optimal choice for comparing crossing survival curves; we will include some comparative results when introducing another method in the section. 3. A Levene-type test Here are some simple but important observations about crossing survival curves, especially point (iii): (i) When the hazards are proportional, survival curves do not cross, except at t = 0. (ii) Exponential curves do not cross; each curve is characterized by one parameter influencing both mean and variance. (iii) Crossing curves are each characterized by at least two parameters. It can be observed that crossing survival curves (and cdfs) occur when variances – not the means – differ. The following Figure 2 illustrates how this can happen with Gaussian distribution functions. The top two figures illustrate a scenario in which the variances are equal but the means differ: the second cdf is a horizontal translation of the first, so the curves do not cross. The bottom two figures show what happens when the means are the same, but the variances differ: the curves cross – at the median, in this case.
Fig. 2. Gaussian distributions with crossing cumulative distribution functions.
Statistical methods for the comparison of crossing survival curves
283
The Tarone–Ware family focuses on the comparison of location parameters (i.e., means or medians); the generalized K–S is more general, aiming at different survival functions or cumulative distribution functions – not aiming at different variances. This observation indicates that we should look for a way to form a test aiming directly at the comparison of two variances. If the underlying distributions were known to be Gaussian, and in the absence of censoring, the F -test would give us an appropriate method for testing whether the variances are equal. However, the F -test is ultra-sensitive to departures from the normality assumption, so we should instead try a non-parametric approach to comparing variances taking the idea from Levene (1960). Such a test can be form as follows: (i) Pool the data to form a sample of size (N1 + N2 ), then arrange all the uncensored times from smallest to largest, let nj be the number of subjects at risk at time tj and si =
i j =1
nj ; nj + 1
(ii) A score ci is assigned to each observation; these are called Wilcoxon scores which we will explain with more details in Section 4: (a) if the subject is the ith event, ci = 1 − 2si ; (b) if the subject is censored between the ith and (i + 1)th events, ci = 1 − si ; (iii) Let c¯1 and c¯2 be the average score for groups 1 and 2, respectively. For each subject in each of the two groups, compute the absolute value of the deviation from average score d = c − c; ¯ (iv) Apply the Wilcoxon rank sum test or the two-sample t-test to these di ’s; the idea is similar to that of Levene (1960), that is to compare the mean deviations from the two samples. E XAMPLE . To apply the above procedure to the bile duct cancer data we have Table 1. Suppose we decide to use the Wilcoxon rank sum test (the two-sample t-test is another option), we have the sum of the ranks for the treated sample R = 430. Under the Null Hypothesis H0 , this sum has mean and standard deviation 1 µ = (22)(22 + 25 + 1) = 528 2
284
C.T. Le
Table 1 Calculations for the Levene-type test
Time 30 67 79+ 82+ 95 148 170 171 176 193 200 221 243 261 262 263 399 414 446 446+ 464 777
Treated sample Score |Deviation| −0.958 −0.834 0.125 0.125 −0.662 −0.312 −0.182 −0.138 −0.094 −0.006 0.038 0.126 0.170 0.260 0.306 0.352 0.398 0.444 0.538 0.769 0.640 0.846
1.047 0.923 0.036 0.036 0.751 0.401 0.271 0.227 0.183 0.095 0.051 0.037 0.081 0.171 0.217 0.263 0.309 0.355 0.449 0.680 0.551 0.757
Rank
Time
47 44 1.5 1.5 37 25 17 14 12 7 5 3 6 9 13 16 21 23 27 34 29 38
57 58 74 79 89 98 101 104 110 118 125 132 154 159 188 203 257 257 431 461 497 723 747 1313 2636
Control sample Score |Deviation| −0.916 −0.874 −0.792 −0.750 −0.706 −0.678 −0.574 −0.532 −0.488 −0.444 −0.400 −0.356 −0.268 −0.224 −0.050 0.082 0.212 0.212 0.490 0.588 0.692 0.742 0.794 0.898 0.948
0.820 0.778 0.696 0.654 0.610 0.582 0.478 0.436 0.392 0.348 0.304 0.260 0.172 0.128 0.046 0.178 0.308 0.308 0.586 0.684 0.788 0.838 0.890 0.994 1.044
Rank 41 39 36 33 32 30 28 26 24 22 18 15 10 8 4 11 19.5 19.5 31 35 40 42 43 45 46
and 1/2 1 (22)(25)(22 + 25 + 1) = 469, σ= 2
respectively. These lead to a z-score of 430 − 528 = −2.09 (p = 0.018) 46.9 which is similar to the result of the modified Kolmogorov–Smirnov test in the previous section which was 0.046. To check for the power of various, we put our two-sample problem in this framework: We assume that samples are drawn from two independent populations, as follows. z=
i.i.d.
i.i.d.
X1j ∼ F1 (j = 1, 2, . . . , N1 ) and X2j ∼ F2 (j = 1, 2, . . . , N2 ) are the event times i.i.d.
for the two samples. Censoring times are drawn as Y1j ∼ G1 (j = 1, 2, . . . , N1 ) and i.i.d.
Y2j ∼ G2 (j = 1, 2, . . . , N2 ). We are not always able to observe Xij , due to censoring; rather, we observe Zij = min(Xij , Yij ) and δij = IXij Yij . Let Si (t) = Pr(Xij > t) and
Statistical methods for the comparison of crossing survival curves
285
Table 2 Three alternatives used in simulation Model
λ1
α1
µ1
σ12
λ2
α2
µ2
σ22
Crossing time
H I J
1.5 2.6 5.1
0.9 1.2 1.1
0.701 0.362 0.189
0.610 0.092 0.030
1.3 2.5 2.0
4.6 3.8 2.8
0.703 0.362 0.445
0.030 0.011 0.030
0.796 0.407 0.916
Ci (t) = Pr(Yij > t) (i = 1, 2) are the survival and censorship functions, respectively, and we are interested in seeing whether F1 and F2 have the same variance. A simulation was conducted (Short, 2002) under these conditions: (i) Data are generated for two independent samples of the same size, either 20 or 50; the simulation is repeated 1000 times in order to calculate the power, the proportion of times the difference is statistically significant keeping type I error at either 0.01 or 0.05; (ii) Potential censoring times are generated from uniform distribution U(0, 1), or U(0, 2), or U( 12 , 1); (iii) In each case, samples are generated from a pair of two Weibull distributions to be compared. The cumulative distribution function for a Weibull distribution with parameters λ and α is given by F (t) = 1 − exp(−(λt)α ). It has mean µ = λ1 Γ (1 + 1 1 2 2 2 α ) and variance σ = λ2 Γ (1 + α ) − µ . We can find the point at which the cdf’s of a pair of Weibull distributions cross by equating their cdf’s and solving for t; this gives us exp(α1 log(λ1 ) − α2 log(λ2 )) . α2 − α1 Note that this formula holds if α1 = α2 . If α1 = α2 , either the curves coincide (λ1 = λ2 ) or the hazards are proportional and the survival curves cross each other only when t = 0 (when λ1 = λ2 ). t=
We consider three alternatives, called models H, I, and J; these are with parameters as summarized in Table 2. The following Figure 3 shows the underlying distributions for each of the three scenarios; models H and I represent the alternatives of “equal mean but different variances” whereas model J represents an alternative of “equal variance but different means”. The results are shown in Table 3. With model H, one of the models for which the variances differ greatly, the underlying distributions cross late in time, relative to the censoring distribution; a consequence of this is that the curves are fairly far apart at early and intermediate times. Not surprisingly, then, the Logrank test shows the lowest power in each scenario. The Levene-type test shows the greatest power to detect differences in all scenarios, with the generalized Smirnov and Wilcoxon tests somewhere in the middle. The differences between the Levene-type test and its nearest contenders are most pronounced when working with the α = 0.01 cutoff. Note that for the most “realistic” censoring, U( 12 , 1), the Levenetype test has 65% power, even in the smaller sample size (N1 = N2 = 20).
286
C.T. Le
Table 3 Simulation results Model
Ci
Ni
Proposed
Gen. Smir
Wilcoxon
Logrank
0.01
0.05
0.01
0.05
0.01
0.05
0.01
0.05
% censored
H
U(0, 1) U(0, 1) U(0, 2) U(0, 2) U(0.5, 1) U(0.5, 1)
20 50 20 50 20 50
0.842 0.998 0.776 1.000 0.652 0.998
0.935 0.998 0.920 1.000 0.857 1.000
0.313 0.907 0.399 0.963 0.425 0.961
0.581 0.973 0.706 0.996 0.713 0.996
0.579 0.981 0.369 0.831 0.478 0.929
0.835 0.997 0.635 0.954 0.731 0.977
0.244 0.589 0.060 0.059 0.157 0.331
0.463 0.776 0.121 0.145 0.319 0.558
(51.1, 70.7) (51.0, 70.2) (31.3, 35.0) (32.1, 34.9) (33.3, 42.6) (33.7, 42.3)
I
U(0, 1) U(0, 1) U(0, 2) U(0, 2) U(0.5, 1) U(0.5, 1)
20 50 20 50 20 50
0.483 0.936 0.495 0.971 0.598 0.988
0.709 0.980 0.752 0.992 0.810 0.996
0.144 0.626 0.175 0.756 0.193 0.815
0.348 0.883 0.431 0.934 0.475 0.967
0.152 0.478 0.099 0.286 0.069 0.179
0.338 0.710 0.266 0.518 0.171 0.340
0.033 0.017 0.015 0.018 0.018 0.033
0.089 0.057 0.060 0.065 0.082 0.113
(35.1, 35.8) (35.7, 36.2) (17.8, 17.7) (17.9, 17.8) (12.6, 0.9) (12.3, 0.9)
J
U(0, 1) U(0, 1) U(0, 2) U(0, 2) U(0.5, 1) U(0.5, 1)
20 50 20 50 20 50
0.098 0.556 0.009 0.068 0.002 0.002
0.357 0.846 0.063 0.294 0.011 0.032
0.902 0.999 0.944 1.000 0.968 1.000
0.974 1.000 0.989 1.000 0.992 1.000
0.966 1.000 0.981 1.000 0.979 1.000
0.998 1.000 0.995 1.000 0.996 1.000
0.867 0.998 0.869 0.999 0.883 1.000
0.960 0.999 0.950 1.000 0.961 1.000
(19.2, 45.2) (18.9, 44.5) (9.7, 22.3) (9.4, 22.1) (1.9, 9.5) (1.8, 9.0)
Fig. 3. Underlying distributions in three models used in power comparison.
With model I, the other model in which the variances differ greatly, the underlying survival functions intersect fairly early, at t = 0.407. The curves separate quite early (so the Wilcoxon test should have high power); both curves converge rapidly to 0 near t = 1 (so we expect the Logrank test will have low power). In fact the simulations show that the Levene-type test has higher power than any of the other tests, for every sample size and censoring distribution used for the model I simulations. The modified
Statistical methods for the comparison of crossing survival curves
287
Kolmogorov–Smirnov test had second highest power (with one exception), followed in turn by the Wilcoxon and Logrank tests. The ratio of the variances of the survival distributions for model H was 20.3; for model I, it was 8.4. Accordingly, we notice that for each censoring distribution, sample size and α level, the power of the Levene-type test is higher for model H than for model I. With model J (equal variance but different means), the Levene-type test performs extremely poorly compared to the other three tests. None of the other three tests shows power below 80% in any of the setups, whereas the Levene-type test has power below 50% for five of the six possibilities with α = 0.05. The only time the Levene-type test approaches the other three is with a sample size of 50. We notice from the sketch of the underlying distributions that the survival curves do meet, but so close to Si (t) = 0 that they barely cross; the curves (practically) look as though they are horizontal translates of each other. The simulation results suggest that the Levene-type test described in this section has greatest power in situations for which it seems tailored – situations with equal means but different variances, as in models H and I. The Levene-type test performed poorly (in absolute terms and relative to the other three tests) when faced with differing means but equal variances – as in model J. The generalized Wilcoxon test is known to be sensitive to departures from the null hypothesis at early times; the Logrank test is more sensitive to differences late in time. But, in both tests the emphasis is on comparing location parameters. With the Levenetype test, we have a powerful tool to differentiate crossing survival curves. However, it appears that it does not have the flexibility which would allows us to study short-term or long-term risks. This is not necessarily true, that we can form more than one Levenetype tests with different focus on different part of the time scale by a properly chosen linear rank test as briefly out lined in the following section.
4. Linear rank tests In our problem of comparing two survival functions S1 (t) and S2 (t), we may be interested in the location parameter where the Alternative Hypothesis is expressible as: HA : S1 (t) = S(t)
and S2 (t) = S(t − ∆)
with H0 : ∆ = 0, or we may be interested in the scale parameter where the Alternative Hypothesis is expressible as: HA : S1 (t) = S(t − θ ) and S2 (t) = S (t − θ )/η where H0 : η = 1. For both problems, a linear rank test can be constructed as follows. Given a density function f , called a score function, a function – called the φ-function – for the location parameter is defined as: φL (u, f ) =
f [F −1 (u)] f [F −1 (u)]
288
C.T. Le
and the φ-function for the scale parameter is defined as: φS (u, f ) = −1 − F −1 (u)
f [F −1 (u)] . f [F −1 (u)]
Suppose we have two independent samples of survival times of sizes N1 and N2 , let N = N1 + N2 . The scores associated with a score function f is defined as: 1
N −1 φL (u, f )ui (1 − u)N−i du aN (i, f ) = i −1 0 or 1
N −1 φS (u, f )ui (1 − u)N−i du aN (i, f ) = i −1 0 depending on if we are interested in the location or scale parameter. A linear rank statistic is one of the form: S= aN (Ri , f ) where Ri is the rank of the ith observation, and the sum is across subjects in one of the two samples (Wilcoxon, 1945; Ansari and Bradley, 1960; Mood, 1954; Klotz, 1962). It has been shown that linear rank tests are locally most powerful (see, for example, Hajek and Sadek (1967, Section 4.4 of Chapter 2)). Most existing test procedures for the two-sample problem with right-censored data are asymptotically equivalent to linear rank tests with proper choice of a score function. For example, the Generalized Wilcoxon test is asymptotically equivalent to the linear rank test with the linear score function and the Logrank test is asymptotically equivalent to the linear rank test with the logarithmic score function (see, for example, Kalbfleisch and Prentice (1980, Section 6.2)). The Levene-type test of previous Section 3 using Wilcoxon score: ci = 1 − 2
i j =1
=1−
i j =1
nj nj + 1
nj nj + 1
for an event
for a censored observation
is asymptotically equivalent to the linear rank test, aimed at the scale parameter, with the linear score function (Shen and Le, 2000). Another test to compare scale parameters could be formed using the Levene-type approach and the exponential scores (which lead to the Logrank test in the comparison of location parameters): ci =
i
n−1 j − 1 for an event
j =1
=
i j =1
n−1 j
for a censored observation.
Statistical methods for the comparison of crossing survival curves
289
Direct generalization of linear rank tests to the censored data situation is conceptually simple, but derivations of variance formulas are difficult, even in the cases of the linear score function and the logarithmic score function. The Levene-type test of Section 3, and a similar one using the exponential scores, provide equivalent but more simple methods for the comparison of crossing survival curves.
References Ansari, A.R., Bradley, R.A. (1960). Rank-sum tests for dispersions. Ann. Math. Statist. 41, 724–736. Breslow, N. (1970). A generalized Kruskal–Wallis test for comparing K samples subject to unequal patterns of censorship. Biometrika 57, 579–594. Cox, D.R. (1972). Regression models and life tables. J. Roy. Statist. Soc. B 34, 187–220. Fleming, T.R., O’Fallon, J.R., O’Brien, P.C. (1980). Modified Kolmogorov–Smirnov test procedures with application to arbitrarily right-censored data. Biometrics 36, 607–625. Gehan, E.A. (1965a). A generalized Wilcoxon test for comparing arbitrarily singly-censored samples. Biometrika 52, 203–223. Gehan, E.A. (1965b). A generalized two-sample Wilcoxon test for doubly censored data. Biometrika 52, 650–653. Hajek, J., Sadek, Z. (1967). Theory of Rank Tests. Academic Press, New York. Kalbfleisch, J.D., Prentice, R.L. (1980). The Statistical Analysis of Failure Time Data. Wiley, New York. Klotz, J. (1962). Nonparametric tests for scale. Ann. Math. Statist. 33, 498–512. Le, C.T. (1997). Applied Survival Analysis. Wiley, New York. Levene, H. (1960). Robust tests for equality of variances. In: Olkin, E.I. (Ed.), Contributions to Probability and Statistics. Stanford University Press, Palo Alto, CA, pp. 278–292. Mantel, N., Haenszel, W. (1959). Statistical aspects of the analysis of data from retrospective studies of disease. J. National Cancer Instit. 22, 719–748. Mood, A.M. (1954). On the asymptotic efficiency of certain nonparametric two-sample tests. Ann. Math. Statist. 25, 514–522. Nelson, W. (1972). Theory and applications of hazard plotting for censored failure data. Technometrics 14, 945–966. Shen, W., Le, C.T. (2000). Linear rank tests for censored survival data. Comm. Statist. Simulation Comput. 29, 21–36. Short, M. (2002). A nonparametric test for comparing crossing survival curve. Comm. Statist. Submitted for publication. Tarone, R.E., Ware, J. (1977). On distribution-free tests for equality of survival distributions. Biometrika 64, 156–160. Wilcoxon, F. (1945). Individual comparison by ranking methods. Biometrics 1, 80–83.
16
Handbook of Statistics, Vol. 23 ISSN: 0169-7161 © 2004 Elsevier B.V. All rights reserved. DOI 10.1016/S0169-7161(03)23016-9
Inference for Competing Risks
John P. Klein and Ruta Bajorunaite
1. Introduction Problems involving competing risks are common in medical and reliability applications. In such problems, there are K competing causes of failure that may occur. When one of the competing causes occurs the occurrence of the other causes are not possible. One observes for each unit simply a failure time and a cause of failure. In medical applications, competing risks are found in many situations. A common example is the analysis of cause of death data. In cancer studies common competing risks are relapse and death in remission (or treatment related mortality). Interest is often on estimating the rate of occurrence of the competing risks, comparing these rates between treatment groups and modeling the effect of covariates on the rate of occurrence of the competing risks. In engineering applications, competing risks arise in the analysis of series systems of components. Here failure of any of the components causes the system to fail. One observes the time at which the system fails and which component caused the system to fail. In this paper, we will discuss inference problems for competing risks. We shall illustrate the methods using a typical competing risk data set taken from an International Bone Marrow Transplant Registry study of alternative donor bone marrow transplants (BMT) reported in Szydlo et al. (1997). Here, as in many BMT studies, there are two competing risks of treatment failure; relapse or recurrence of the primary disease and death in complete remission, also known as treatment related mortality. This study consisted of patients with Acute Lymphocytic Leukemia (ALL) (n = 537), Acute Myeloid Leukemia (AML) (n = 340), or Chronic Myeloid Leukemia (CML) (n = 838). Patients were transplanted with their disease in an early (n = 1026), intermediate (n = 410) or advanced (n = 279) stage based on remission status. The initial Karnofsky performance score was greater than or equal to 90 in 1382 cases. Of primary interest is the comparison of relapse and treatment related mortality between patients with different donor types. Here there were 1224 with an HLA-identical sibling donor, 383 with an HLAmatched unrelated donor, and 108 cases with an HLA-mismatched unrelated donor. We shall use this data set in the sequel to illustrate the methods. 291
292
J.P. Klein and R. Bajorunaite
2. Basic quantities Competing risks are typically represented by a set of positive random variables X1 , . . . , XK , where Xj is the potential (unobservable) time to occurrence of the j th cause of failure. We observe T = min(X1 , . . . , XK ) and an indicator δ which tells which of the K risks caused the failure. That is δ = j if T = Xj . A basic quantity in competing risks theory is the crude hazard rate, hj (t), which is the rate of occurrence of the j th failure cause in the presence of all causes of failure. That is hj (t) = lim
∆→∞
P [t Xj < t + ∆ | T t] . ∆
(2.1)
The crude hazard rate can be computed from the joint survival function of the X’s, F (x1 , . . . , xK ) = P [X1 x1 , . . . , XK xK ], as ∂ ln(F (x1 , . . . , xK )) at x1 = · · · = xK = t. hj (t) = − (2.2) ∂xj As opposed to the usual hazard rate the exponential of the cumulative crude hazard rate is a meaningless quantity and is not related to any proper survival function. The crude hazard rates are related, however, to the hazard rate of the time to event, T , by h(t) =
K
hi (t).
(2.3)
i=1
When the potential failure times are independent then the crude hazard rates, hj (t), are the same as the marginal hazard rates of the Xj ’s, λj (t). When the X’s are dependent, in general, this is not true. It is also not possible to identify from competing risk data whether the X’s are independent or not because for every dependent system of X’s there is an independent set of random variables that will have the same crude hazard rates. This independent system of risks, however, will have different marginal distributions from the original dependent set of variables (see, for example, Klein and Moeschberger (1987),Basu and Klein (1982)). An alternative way to summarize the likelihood of a competing risk occurring is in terms of the cumulative incidence function Gj (t) = P [T t, δ = j ]. The cumulative incidence function can be computed as a function of all K crude hazard rates by K t x Gj (t) = (2.4) hj (x) exp − hi (u) du dx. 0
0
i=1
Note that the cumulative incidence depends on all K of the crude incidence rates. The function Gj (t) is a subdistribution function with the property that Gj (∞) = P [δ = j ]. The cumulative incidence function represents the chance that competing risk j occurs in a world where individuals can fail from any of the causes. The marginal or net distribution function for the cause j , P [Xj t], is the chance that cause j occurs in a counter factual world where only cause j can occur.
Inference for competing risks
293
Fig. 1. Possible states in the competing risk framework.
An alternative formulation of the competing risks problem is in terms of a multistate model. As originally proposed by Prentice et al. (1978) and recently discussed by Andersen et al. (2002), this approach does not require the construction of potential failure times for each cause of failure. It also removes some of the confusion between the net and crude probabilities of occurrence of a competing risk. In the multistate model formulation, there are K + 1 states a subject may be in at any point in time (see Figure 1). One state is transient and is the state that the subject is alive. The other K states are absorbing states and are the states that the subject is dead from a given cause. The transition intensities of the states are given by the crude hazard rates defined in (2.1). The basic parameters are the transition probabilities, Phj (s, t) = P [state j at time t | state h at time s], s < t. Here P00 (0, t) is the probability a subject is alive at time t and P0j (0, t) = Gj (t).
3. Univariate estimation In this section, we will discuss how one can estimate the basic competing risks quantities based on a censored sample of competing risk data. The data consists, for each of n subjects, of an on study time Ti , and an indicator δi , of the cause of removal from the study. To allow for independent censoring we assign a value of 0 to δ when the subject is censored. If censoring is correlated with outcome it can be treated as a competing risk. Define Y (t) = #[Ti t] as the number of individuals at risk at time t. Also let Nj (t) = #[Ti t, δi = j ], be the number of failures of type j up to time t and N(t) = K j =1 Nj (t) the total number of failures of any cause. Based on this data we can estimate, without further assumptions, a number of quantities. The first is the so-called Nelson–Aalen estimator of the crude cumulative hazard t rate. Here we estimate Hj (t) = 0 hj (x) dx by j (t) = H
I [δi = j ] t dNj (u) = . Y (Ti ) Y (u) 0
Ti t
(3.1)
294
J.P. Klein and R. Bajorunaite
Fig. 2. Kernel smoothed crude hazard rate estimate for relapse.
Note that this estimator is the usual Nelson–Aalen estimator one obtains if failure from any cause other than the cause of interest is treated as a censored observation. The standard error of this estimator is given by
I [δi = j ] 1/2 SE Hj (t) = (3.2) . Y (Ti )2 Ti t
Estimates of the crude hazard rate itself can be found by a smoothing technique such as the kernel smoothed estimator proposed by Ramlau-Hansen (1983a, 1983b) (see Klein and Moeschberger (2003) for details). To illustrate the methods we present in Figure 2 the smoothed relapse hazard for the three bone marrow donor groups and in Figure 3 the smoothed treatment related mortality hazard rates. Here we have used a uniform kernel and a bandwidth of 2 months. The figures show typical patterns for the crude hazard rates for these two risks, a decreasing rate of relapse over time and a hump shaped hazard rate for death in remission reflecting early mortality due to infections and graft versus host disease. While the estimates of the crude hazard rate are helpful in understanding the failure mechanism they do not lead directly to estimators of the competing risks probabilities. As noted in Section 2 the sum of the crude hazard rates is equal to the hazard rate of the time to event, T . Thus the survival function of T can be estimated directly from estimated crude cumulative hazard rates by
K K j =1 I [δi = j ] j (Ti ) = S(t) = (3.3) 1− . 1− dH Y (Ti ) Ti t
j =1
Ti t
Inference for competing risks
295
Fig. 3. Kernel smoothed crude hazard rate estimate for death in remission.
This estimator is simply the usual Kaplan and Meier (1958) estimator of the time to event. Several estimates are commonly used to summarize competing risks probabilities. The first is an estimate of the cumulative incidence function, Gj (t). The estimator is t I [δi = j ] j (t) = j (u) = G (3.4) S(u−) dH S(Ti −). Y (Ti ) 0 Ti t
j (t) provides This estimator was first proposed by Kalbfleisch and Prentice (1980). G an estimate of the probability of cause j occurring prior to time t in a subject at risk of experiencing any of the K competing risks. Andersen et al. (1993) provides an estimator of the variance of G(t), t
j (u) 2 dN(u) j (t) − G V Gj (t) = S(u−)2 G Y (u)2 0 t
dNj (u) j (t) − G j (u) + . (3.5) S(u−)2 1 − 2 G Y (u)2 0 The estimator can be shown to converge weakly to a Gaussian process so that pointwise confidence intervals can be found by usual techniques. A second summary statistic suggested by Pepe and Mori (1993) and Pepe (1991) is an estimator of the conditional probability of event j occurring given that none of the (j ) (t) be the estimated other events have occurred by time t. To define the estimator, let G cumulative incidence function for a pooled cause (j ) where all competing risks other
296
J.P. Klein and R. Bajorunaite
than j are taken as the event of interest. The estimated conditional probability of cause j occurring by time t is given by j (t) = C
j (t) G . (j ) (t) 1−G
(3.6)
The variance of this estimator is estimated by
j (t) = V C
S(t−)2 (j ) (t)]4 [1 − G
t
× 0
(j ) (u)]2 dNj (u) + G j (u)2 dN(j ) (u) [1 − G , 2 Y (u)
(3.7)
where N(j ) (t) is the total number of nontype j events up to time t. A third estimator commonly used to summarize competing risks data is the complement of the usual Kaplan–Meier estimator where events not of type j are treated as censored observations. That is dNj (Ti ) Sj (t) = (3.8) . 1− Y (Ti ) Ti t
1− Sj (t) is an estimate of the marginal or net distribution function. That is the probability of event j occurring in a counter-factual world where only risk j can occur. It is an (approximately) unbiased estimate only when the competing risks are independent and has no meaning in the multistate model for competing risks. When the competing risks are dependent, it may provide a quite biased estimator of the marginal distribution (see Klein and Moeschberger (1984)). To better understand these probabilities, consider a hypothetical experiment involving 100 patients with two risks, relapse and death in remission. Suppose that there is no censoring, at one year after transplant 10 patients have relapsed, and 30 patients have died in remission. When there is no censoring the cumulative incidence reduces to the cumulative number of events of the given type divided by n, so the relapse cumulative incidence is 10/100 and the death in remission cumulative incidence is 30/100. The death in remission incidence is clearly interpreted as the proportion of patients who died in complete remission prior to one year. The conditional probabilities estimates are 10/70 for relapse and 30/90 for death in remission. Here the death in remission probability is estimated by the number who die in remission divided by the number who could have died in remission which is the number at risk at one year who have yet to relapse. The complement of the Kaplan–Meier estimate depends on the pattern of occurrences of deaths and relapse. If all deaths occur before the first relapse then the relapse probability is 10/70 while if all the relapses occurred before the first death we get an estimate of 10/100. Any value between these two extremes is possible. Clearly, this estimate has no meaningful interpretation. We again illustrate these curves using data from the BMT cohort. Figure 4 shows the cumulative incidence, conditional probability and 1-Kaplan–Meier curves for relapse for the HLA-identical sibling donor group. Here we see the typical pattern of cumulative
Inference for competing risks
297
Fig. 4. Estimated cumulative incidence of relapse, conditional probability of relapse and marginal (1-Kaplan–Meier estimate) probability for the HLA-identical sibling donor group.
Fig. 5. Probability of relapse, death in remission and disease free survival for the HLA-identical sibling donor group.
298
J.P. Klein and R. Bajorunaite
incidence < 1-Kaplan–Meier < conditional probability, with equality only when the competing risks are independent. Note that at 5 years the estimate of the probability of relapse is 0.31 (SE = 0.019) using the Kaplan–Meier estimate and 0.25 (SE = 0.015) using the correct cumulative incidence estimate. The cumulative incidence curves for the various risks are additive in that the sum of the cumulative incidence curves over all risks is the probability of the event occurring, regardless of risk. Figure 5 is the way we would suggest that cumulative incidence curves be displayed. Here, for the sibling donors, the height of the lower curve is the relapse probability, the difference between the middle and bottom curves is the death in remission probability and the difference between the middle curve and top curve at 1 is the (Kaplan–Meier estimate of) survival function of the time to event, here treatment failure (death or relapse). These curves provide a means of seeing dynamically the relationship between the competing risk probabilities. Note that if one used the complement of the Kaplan–Meier curves for the competing risks this relationship between the sum of the risk probabilities and the treatment failure probability does not hold. A SAS macro to compute cumulative incidence is available at our website http://www.biostat.mcw.edu.
4. Inference based on the crude hazard rates Inference based on the crude hazard rate for cause j can be made by standard methods that treat failures from causes other than the cause of interest as censored observations. Weighted log rank tests (cf. Andersen et al. (1982), Andersen et al. (1993) and Chapter 7 of Klein and Moeschberger (2003)) can be used to compare two or more groups. These tests can be used to compare groups, stratified on risk factors, as the basis of tests for trend, and for matched pairs data. The competing risks formulation does not add any additional complication to the analysis. Regression methods can also be performed for competing risks data by treating the other competing risks as censored data. Regression models include the Cox (1972) multiplicative hazards model with hj (t | Z) = h0j (t) exp(βj Z) or Aalen’s (1989) additive hazards with hj (t | Z) = α0j (t) + γj (t)Z, for example. These models can be fit by standard techniques and using standard software. While these methods are easy to apply using standard software, they are, in many cases, not modeling quantities of direct interest. As noted earlier the product-integral of the crude hazard rate is a meaningless quantity. Differences in crude hazard rates for a particular risk do not translate directly into differences between cumulative incidence curves since these curves are functions of all the competing risk crude hazard rates. It is, however, possible to use estimates from a regression model of the K crude hazard rates to obtain an estimate of the cumulative incidence function, adjusted for covariates, by substitution into (2.4). Using results in Andersen et al. (1993) and the multistate model formulation the standard error of the estimates based on K Cox models can be obtained. The resulting regression model for the cumulative incidence curves is a complicated nonlinear function of the covariates and as such is quite difficult to interpret.
Inference for competing risks
299
Table 1 Log rank tests based on the crude hazard rates Hypothesis
Relapse X2
Relapse p-value
TRM X2
TRM p-value
DFS X2
DFS p-value
2.87 0.52 2.03
0.2375 0.4702 0.1540
185.09 119.93 116.18
< 0.0001 < 0.0001 < 0.0001
118.42 84.36 59.36
< 0.0001 < 0.0001 < 0.0001
All groups equal HLA − sibs = matched HLA − sibs = mismatched
Table 2 Results of the Cox regression analysis based on crude hazard rates Effect
Donor Sibling Matched Unrelated Mismatched Unrelated Disease ALL AML CML Stage of disease Early Intermediate Advanced Karnofsky score < 90 90
Relapse Estimate 95% CI
p
TRM Estimate 95% CI
0.032∗ 1.00 1.01 0.75–1.36 0.39 0.19–0.79
0.945 0.009
0.062 0.003
1.00 0.90 0.67–1.17
< 0.001 < 0.001
0.406
< 0.001 < 0.001
0.151 < 0.001
0.60 0.49–0.74
< 0.001 < 0.001
< 0.001
< 0.001 < 0.001 0.103
1.00 1.20 0.99–1.45 1.18 0.99–1.40
< 0.001 1.00 1.61 1.31–1.97 2.18 1.69–2.83
p < 0.001
1.00 1.74 1.46–2.03 1.90 1.49–2.40
< 0.001 1.00 1.22 0.93–1.58 1.63 1.30–2.03
< 0.001 1.00 1.90 1.40–2.56 6.34 4.73–8.52
DFS Estimate 95% CI
< 0.001 1.00 2.25 1.69–2.83 3.06 2.35–3.98
< 0.001 1.00 1.31 0.98–1.74 0.64 0.48–0.86
p
0.067 0.059 < 0.001
1.00 1.72 1.46–2.03 3.26 2.69–3.95
0.69 0.59–0.82
< 0.001 < 0.001
< 0.001
∗ Two degree of freedom test.
We again illustrate these techniques on the alternative donor study. Table 1 provides the results of the (unweighted) log rank test for both competing risks and for the overall disease free survival rate. In Table 2 we present the results of the Cox regression models for the crude hazard rates of relapse, treatment related mortality and disease free survival. Since the observed probability of one of the competing risks, as summarized by the cumulative incidence function, depends on both rates we feel it important that analysis of both competing risks be reported. We see in Table 2 that the covariates have differing effects depending on the
300
J.P. Klein and R. Bajorunaite
outcome. The effect of disease type, in particular, shows the importance of analyzing both risks. Here CML patients have significantly lower relapse rates but a higher death in remission rates so that the effect of disease on time to treatment failure washes out.
5. Tests based on the cumulative incidence function The log rank test discussed in the previous section compares the crude hazard rate between two or more groups. In this section, we present three tests that have been proposed to directly compare the cumulative incidence functions. The first test is due to Gray (1988). This test can be used to compare the cumulative incidence functions for two or more groups. We present the details here for M 2 groups using a slightly different variance calculation than found in Gray. Without loss of generality we assume that there are only two competing risks and we are interested in the first risk. The hypotheses of interest are: H0 : G11 (t) = · · · = G1M (t) = G01 (t),
for all t τ,
HA : at least one of the G1k ’s is different for some t.
(5.1)
Here G1k (t) is the cumulative incidence function of an event of type 1 for the kth sample. ∗ , i = 1, . . . , n , The test statistic is based on the (improper) random variable, Xik k k = 1, . . . , M. This random variable is defined by Tik , if δik = 1, ∗ Xik = (5.2) ∞, if δik = 1, treats occurrences of the other competing risk as occurring at ∞. One can show that ∗ t] = G (t) and the hazard rate of X ∗ is γ (t) given by P [Xik 1k 1k ik γ1k (t) =
dG1k (t)/dt . 1 − G1k (t)
(5.3)
The test is based on comparing the (weighted) differences between estimates of γ1k (t) and the pooled sample estimate of γ1 (t). Using the notation of Section 3, we let N1k (t) be the number of failures of type 1 1k (t) be prior to time t and Yk (t) the number at risk at time t in the kth group. Let G the estimated cumulative incidence curve for the kth sample and cause 1 (see (3.4)) and 0 (t) be this estimator based on the pooled sample. Let Sk (t) be the estimate of the G 1 time to event from any cause given by (3.3). t An estimate of Γ1k (t) = 0 γ1k (u) du is given by t Sk (u−) dN1k (u) . Γ1k (t) = (5.4) 1k (u−)]Yk (u) 0 [1 − G The test statistic is given by τ Wk (u) dΓ 1k (u) − dΓ 01 (u) . Zk = 0
(5.5)
Inference for competing risks
301
Here Wk (t) is a weight function and dΓ 01 (u) is the estimator (5.4) based on the pooled sample. In most cases the weight functions are taken to be of the form L(t)Rk (t), where Rk (t) =
1k (t−)] Yk (t)[1 − G . Sk (t−)
(5.6)
We shall take L(t) = 1 for simplicity. This choice corresponds to an analogue of the log rank test. With this weight function the test statistics are of the form τ Rk (u) Zk = (5.7) dN1. (u) , dN1k (u) − R. (u) 0 where R. (t) and N1. (t) are the values of Rk and N1k summed over all the M samples. Note that this is the weighted difference over time of the number of type 1 events in the kth sample and the expected number in the pooled sample. Standard counting process techniques can be used to show that, under H0 , we have τ 0 (u) dG R. (u) − Rk (u) 2 1 V (Zk ) = Yk (u) R. (u) Sk (u−) 0 τ 0 (u) dG Rk (u) 2 1 , Yi (t) + (5.8) R. (u) (u−) S 0 i i =k and Cov(Zh , Zk ) =
τ 0 (u) dG Rk (u)(R. (u) − Rh (u)) 1 − Y (u) h R. (u)2 Sh (u−) 0 τ 0 (u) dG Rh (u)(R. (u) − Rk (u)) 1 − Y (u) + k 2 R (u) (u−) S . 0 k 0 τ (u) dG Rk (u)Rh (u) 1 Yj (u) . + 2 R. (u) (u−) S 0 j j =h,k
(5.9)
The test statistic is X2 = (Z1 , . . . , ZM )Σ − (Z1 , . . . , ZM ) ,
(5.10)
where Σ − is a generalized inverse of the covariance matrix estimated using (5.8) and (5.9). The statistic has a large sample chi-square distribution with M − 1 degrees of freedom under the null hypothesis. We apply this test to the transplant data. Figures 6 and 7 show the cumulative incidence curves for relapse and treatment related mortality by donor type. Table 3 gives the results of the test of equality of the three group’s cumulative incidence curves and pairwise tests comparing the HLA-identical sibling donor group to the two alternative donor groups. The pairwise comparisons are based on the statistics computed using only data from the two groups. Table 3 shows that the results are similar to those found using the log rank test for the competing risk of treatment related mortality but for relapse the conclusions are different. Here we see a significant difference between the three
302
J.P. Klein and R. Bajorunaite
Fig. 6. Estimated cumulative incidence of relapse for the three donor groups.
Table 3 Tests based on Gray’s statistic comparing cumulative incidence functions Hypothesis All groups equal Sibs vs matched Sibs vs mismatched
Relapse X2
Relapse p-value
TRM X2
TRM p-value
11.90 2.43 10.28
0.0026 0.1188 0.0013
151.58 96.97 88.47
< 0.0001 < 0.0001 < 0.0001
groups with the HLA-identical sibling and HLA-mismatched unrelated donors being significantly different. The next two tests are available to compare two groups. Both directly compare the two cumulative incidence functions. The first, due to Lin (1997) is a variant of the Kolmogorov–Smirnov test. The test is given by 12 (t), Q = sup W (t) (5.11) G11 (t) − G t
where W ( ) is a weight function and the G1 ’s are the estimated cumulative incidence functions for cause 1 in the two samples. The distribution of Q under the null hypothesis is difficult to obtain so a modified bootstrap procedure is used. The procedure is based on the following argument. We again assume that we have only two risks with cumulative incidence functions Gj k (t), for samples k, k = 1, 2 and risk j , j = 1, 2. Let Nj k (t) be the total number of type j events and Yk (t) the number at risk at time t in sample k. Let hj k (t) be the crude hazard rate in the kth sample for cause j . One can show that the
Inference for competing risks
303
Fig. 7. Estimated cumulative incidence of death in remission for the three donor groups.
quantity Mj k (t) defined by
t
Mj k (t) = Nj k (t) −
Yk (u)hj k (u) du
(5.12)
0 1/2 is a martingale. Lin shows that θ1k (t) = nk {G 1k (t) − G1k (t)} can be approximated by
t t {1 − G (u)}2 dM (u) G1k (u) dM2k (u) 2k 1k 1/2 + θ1k (t) ≈ nk Yk (u) Yk (u) 0 0 t dM1k (u) + dM2k (u) (5.13) . − G1k (t) Yk (u) 0
Eq. (5.13) can be used to generate replicates of θ1k (t) by simulating the martingales using a large sample normal approximation. That is Mj k (t) =
nk
Aj ki I Tik t, δik = j ,
(5.14)
i=1
where the A’s are independent standard normal deviates. A replicate of θ1k is then found by generation of Mj k (t) by (5.14) and substitution of this quantity and the empirical estimates of the cumulative incidence functions into (5.13). Note that the pooled sample estimator is used to estimate G1k . Combining independent replicates of θ1k ( ), k = 1, 2 as supt |θ11(t) − θ12 (t)| gives a replicate of Q given by (5.11) (with W (t) = 1) when
304
J.P. Klein and R. Bajorunaite
Table 4 Tests based on Pepe’s and Lin’s statistics comparing cumulative incidence functions Comparison
Relapse p-value
Pepe’s X2 Sibs vs matched Sibs vs mismatched
3.74 28.36
0.0533 < 0.0001
Lin’s p-value
Pepe’s X2
TRM p-value
Lin’s p-value
0.0212 < 0.0001
98.23 119.95
< 0.0001 < 0.0001
< 0.0001 < 0.0001
the null hypothesis is true. Based on a large number of replicates of this statistic an empirical p-value can be found. The second test is due to Pepe (1991)Pepe and is based on the integrated difference between the cumulative incidence functions. That is the test is based on τ
n1 n2 11 (u) − G 12 (u) , Z= (5.15) W (u) G n1 + n2 0 where W ( ) is a weight function. The variance of Z is estimated by a method of moments estimator. To express the estimated variance let N1ki (t) be 1 if the ith subject in group k has experienced a type 1 event by time t and Yki (t) be the indicator of whether this subject is under observation at time t. Let N1k (t) and Yk (t) be the sums of these quantities over all individuals in sample k. Then the estimated variance of Z under the null hypothesis is 2 nk ι 2 n3−k V (Z) = (5.16) W (u)Ak1i (u) du , nk (n1 + n2 ) 0 k=1
where
Ak1i (t) =
i=1
nk Sk (u) dN1ki (u) − Yk (u) 0 t dN1k (u) + Bki (u) Yk (u) 0 t
and
t 0
nk Yki (u) Sk (u) dN1k (u) Yk (u) Yk (u) (5.17)
nk d(N1ki + N2ki )(u) Yk (u) 0 t nk Yki (u)d(N1k + N2k )(u) − (5.18) . Yk (u)2 0 √ The test statistics is given by Z/ V (Z) that has a large sample normal distribution under H0 . To illustrate these two tests, we present in Table 4 the comparisons of the HLA identical sibling group to each of the alternative donor groups. Both tests suggest that each of the alternative donor groups have different incidences of both relapse and death in remission. Again, for relapse these are different conclusions than those we obtained from the log rank test on the crude hazard rates. Sk (t) Bki (t) = −
t
Inference for competing risks
305
The three statistics discussed in this section have different power depending on the alternative of interest. The test of Lin is an omnibus test. Pepe’s test, based on a small Monte Carlo study, seems to reject the null hypothesis under H0 too often due to an underestimation of variance of Z and its use is not recommended. Lin’s test appears to be a bit conservative. 6. Regression techniques based on the cumulative hazard function In this section, we present three approaches to estimation of how covariates directly affect the cumulative incidence function. Again, we assume that the competing risk indexed by 1 is of interest and there are only two competing risks The first technique, suggested by Fine and Gray (1999) is based on the hazard rate γ (t) for the variable X∗ defines in (5.2) and (5.3). The analysis is closely related to a standard Cox (1972) regression model where occurrences of competing risks other than the risk of interest are occurring at infinity. The model assumes a proportional hazards form for γ . That is, given a set of covariates, Z, we have dG1 (t | Z)/dt γ (t | Z) = (6.1) = γ0 (t) exp β Z . 1 − G1 (t | Z) When there is no censoring, β can be estimated in exactly the same way as in the Cox model for right-censored data using a modified risk set. Here the risk set, R(t), at time t is all individuals yet to experience any event plus all those individuals who experienced a type 2 event at a time prior to t. Based on Ni (t) = I [Ti t, δi = 1] and Yi (t) = 1 − Ni (t−) the score function when there is no censoring is n ∞
j ∈R(s) Yj (s)Zi exp{β Zi } dNi (s) Zi − U(β) = (6.2)
0 j ∈R(s) Yj (s) exp{β Zi } i=1
which is of the form of the usual Cox score function. Values of β that solve the score Eq. (6.2) are the desired estimators. In this case, the usual information calculations can be used to find the estimated standard errors of the estimated β. When there is right censoring an inverse probability of censoring weighting technique is used. Here we let C(t) be the probability of not being censored at time t. This is estimated consistently by the usual Kaplan–Meier estimator that treats occurrences of either competing risk as censored observations and occurrences of censoring as an event. We define a time dependent weight function, wi (t) for each observation by C(t) , if Ni (t) is observable, wi (t) = C(min(t, (6.3) Ti )) 0, otherwise. Note that wi (t) is nonzero for censored observations up to the time of censoring. Using this weight, an estimating equation for β is given by n ∞
j ∈R(s) wj (s)Yj (s)Zi exp{β Zi } U(β) = wi (s) dNi (s). Zi −
j ∈R(s) wi (s)Yj (s) exp{β Zi } i=1 0 (6.4)
306
J.P. Klein and R. Bajorunaite
Table 5 Regression estimates for relapse and death in remission based on Fine and Gray’s method Effect
Relapse Relative risk
Donor Sibling Matched Unrelated Mismatched Unrelated Disease ALL AML CML Stage of disease Early Intermediate Advanced Karnofsky score < 90 90
p-value
Death in remission Relative risk p-value < 0.0001
0.0002 1.00 0.73 0.54–0.98 0.25 0.12–0.52
0.0332 0.0002
1.00 2.15 1.77–2.61 3.19 2.44–4.16
< 0.0001 1.00 1.19 0.90–1.58 0.56 (0.42–0.76)
0.2274 0.0002
1.00 1.19 0.89–1.59
0.0011 < 0.0001
0.2398
< 0.0001 < 0.0001
1.00 1.18 0.90–1.53 1.63 1.31–2.03
< 0.0001 1.00 1.66 1.22–2.25 4.51 3.33–6.12
< 0.0001
0.2326 < 0.0001 < 0.0001
1.00 1.51 1.23–1.86 1.58 1.22–2.04 1.00 0.65 0.53–0.79
< 0.0001 0.0005
< 0.0001
Estimates of variability the estimates of β based on (6.4) are given in Fine and Gray (1999). They suggest using a “sandwich” estimator. In our example, the factors studied were donor type, disease stage at transplant and Karnofsky performance score. Reported in Table 5 is the “relative risk”, eβ , and the pvalue of the test of β = 0. The table shows that for both relapse and death in remission there is a difference in the cumulative incidences due to donor types after adjustment for disease, disease stage and Karnofsky score. An alternative approach to the regression problem is based on work of Andersen et al. (2003) discussed in Andersen et al. (2003). This general approach is based on using pseudo-values to obtain a regression model for the cumulative incidence function. Let k (t) be the cumulative incidence function for cause k based on the complete sample G −i (t) be the estimated cumulative incidence function based on the sub of size n. Let G k sample obtained by deletion of the ith observation. We define the pseudo-value by i (t) = nG k (t) − (n − 1)G −i (t), G k k
i = 1, . . . , n.
A generalized linear model is assumed for the pseudo-values. That is i k (t) = αk (t) + β Zi , Ψ E G
(6.5)
(6.6)
where Ψ ( ) is a link function. Choices of link functions include the logit function, Ψ (x) = logit(x) = log(x/(1 − x)) and the complementary log–log function Ψ (x) =
Inference for competing risks
307
c log log(x) = log(− log(1 − x)). Alternatively a complementary log–log transform on 1 − G may be reasonable. The complementary log–log transform model is quite similar to Fine and Gray’s model and when there is only a single risk is analogous to the Cox (1972) model. To estimate model parameters we obtain the pseudo-values at a grid of time points, τ1 < · · · < τm . We find that of 5–10 times roughly evenly spaced on an event scale pro i (τj ). vides good estimates. Let Yi = (Yi1 , . . . , Yim ) be the m-vector defined by Yij = G k −1 Let θi be the m-vector with elements θij = Ψ (αj + βZi ) = µ(αj + βZi ), with αj = α(τj ). The estimating equations to solve for α and β are ∂µi t Vi−1 [Yi − θi ] = Ui (α, β), U (α, β) = (6.7) ∂(α, β) i
i
where Vi is a working covariance matrix for Yi . The solution to (6.7) provides the estimates of (α, β). The estimated parameters are asymptotically normal with mean zero and a covariance that can be estimated by a “sandwich” estimator. That is an estimated variance of = Iˆ−1 (α, ˆ Var U α, Σ (6.8) ˆ βˆ , ˆ β) ˆ βˆ Iˆ−1 α, where
∂µi t ∂µi −1 Vi I (α, β) = ∂(α, β) ∂(α, β)
(6.9)
i
and
Var U (α, β) = Ui (α, β)Ui (α, β)t .
(6.10)
i
Most of these calculations can be done using a standard statistical package that has generalized estimating equation routines once the pseudo-values are computed. For example, one can use the SAS routine GENMOD. This procedure is applied to the bone marrow transplant example with results summarized in Tables 6 and 7. In the example, 8 time points were used. We also used an independence (identity) working covariance matrix. The results are reported for three transformations. For each model we report the exponential of the estimate of β, 95% confidence intervals for eβ , and the p-values of the test of the hypothesis that β = 0. For the logit model eβ is the odds ratio for the probability of occurrence of the competing risk for an individual with the characteristic as compared to a subject without the characteristic. For the cloglog model on Gk ( ), we have exp(β) =
− ln[1 − Gk (t | z = 1)] , − ln[1 − Gk (t | z = 0)]
(6.11)
which, if Gk ( ) is a proper distribution function, is the usual relative risk. Similarly, for the cloglog model on 1 − Gk (t) a “relative risk” is defined by exp(β) =
− ln[Gk (t | z = 1)] . − ln[Gk (t | z = 0)]
(6.12)
308
J.P. Klein and R. Bajorunaite
Table 6 Regression estimates for relapse based on the pseudo-value approach Effect
Logit transform Odds ratio
Donor Sibling Matched Unrelated Mismatched Unrelated Disease ALL AML CML Stage of disease Early Intermediate Advanced Karnofsky score < 90 90
p-value
Complementary log–log transform of G(t) Relative risk p-value
0.0049 1.00 0.71 0.47–1.08 0.20 0.07–0.57
0.1076 0.0029
0.0064 1.00 0.75 0.52–1.07 0.24 0.09–0.63
0.0037 1.00 1.22 0.81–1.82 0.59 0.38–0.90
0.3429 0.0147
1.00 1.18 0.79–1.76
< 0.0001 < 0.0001
0.4133
0.1151 0.0040
0.0035 1.00 1.17 0.96–1.42 2.13 1.33–3.40
0.0081 1.00 1.17 0.83–1.66 0.64 0.43–0.93
< 0.0001 1.00 2.23 1.53–3.26 7.77 5.29–11.41
Complementary log–log transform of 1 − G(t) Relative risk p-value
0.3672 0.0210
1.00 1.19 0.85–1.67
< 0.0001 < 0.0001
0.3140
0.0016 0.0003
1.00 0.91 0.76–1.10 1.31 1.09–1.57
< 0.0001 1.00 2.16 1.52–3.06 6.28 4.50–8.76
0.1258
0.3380 0.0033 < 0.0001
1.00 0.72 0.61–0.84 0.38 0.31–0.46 1.00 0.97 0.80–1.17
0.0001 < 0.0001
0.7253
Note that in the first two cases a eβ > 1 implies that subjects with the condition are more likely to have the event occur then those without the condition, while for the third model the reverse is true. The results from the pseudo-value approach using the three different link functions are quite similar and all link functions give a similar inference. The results are quite similar in magnitude to the regression results from the Fine and Gray method. This method has the advantage of being more flexible in the choice of link function and in the ability to do much of the regression analysis using standard statistical software for generalized estimating equations. Fine (2001) also presents a regression model based on the cumulative incidence function that includes an arbitrary link function h( ). He assumes that h G(t) = αk (t) + β Zi . (6.13) The estimation scheme is based on an extension of the least squares procedure of Fine et ∗ ,t ) > al. (1998) and involves minimizing the sum of squares between uij = I (min(Xik 0 ∗ ∗ Xj k ), where Xik is the modified time to event variable which sets its value equal to +∞ when the other competing risk occurs (see (5.2)). Here uij is the indicator of the whether the j th subject fails from cause k prior to individual i at time t0 . When there is right censoring an inverse probability of censoring weighting technique is used with
Inference for competing risks
309
Table 7 Regression estimates for death in remission based on the pseudo-value approach Effect
Logit transform Odds ratio
Donor Sibling Matched Unrelated Mismatched Unrelated Disease ALL AML CML Stage of disease Early Intermediate Advanced Karnofsky score < 90 90
p-value
Complementary log–log transform of G(t) Relative risk p-value
< 0.0001 1.00 2.48 1.94–3.16 4.21 2.89–6.11
< 0.0001 < 0.0001
< 0.0001 1.00 2.11 1.74–2.57 3.17 2.40–4.18
0.0006 1.00 1.31 0.93–1.86 1.74 1.30–2.32
0.1212 0.0002
1.00 0.56 0.43–0.74
0.0009 0.0038
< 0.0001
< 0.0001 < 0.0001
< 0.0001 1.00 0.60 0.52–0.69 0.43 0.34–0.55
0.0006 1.00 1.26 0.94–1.68 1.59 1.25–2.02
0.0007 1.00 1.56 1.20–2.03 1.64 1.17–2.28
Complementary log–log transform of 1 − G(t) Relative risk p-value
0.1224 0.0002
1.00 0.64 0.51–0.80
0.0012 0.0043
0.0001
< 0.0001 0.0005
1.00 2.33 1.94–2.79 0.75 0.64–0.86
0.0012 1.00 1.43 1.15–1.77 1.49 1.13–1.96
< 0.0001
< 0.0001 0.0001 0.0003
1.00 0.77 0.67–0.89 0.76 0.63–0.91 1.00 1.40 1.20–1.64
0.0006 0.0031
< 0.0001
the least squares procedure. The method requires specification of an artificial censoring time t0 . Further details can be found in the paper.
7. Discussion The problem of summarizing and making inference about competing risks quantities is important in many medical applications of survival analysis. In this report, we have presented the main results available in this area but clearly there is a need for further research directed at this problem. There is a need for comparative studies of the relative efficiency of the methods. There is a need for additional work on the regression techniques including methods for goodness of fit of all models and rules for selection of time points and the form of the working covariance matrix in the pseudo-value approach. Finally, a problem, which needs attention, is the lack of user-friendly software for these techniques As noted by the simple example presented here the approaches based on the crude hazard rate and those based on the cumulative incidence function may give quite different results. The “correct” solution to the problem and the parameter to base the test
310
J.P. Klein and R. Bajorunaite
on must depend on the subject manner question being addressed. Many of the methods and parameters discussed here are not well known to clinical investigators and to applied statisticians responsible for analysis of data. While it is hoped that this report will help to inform applied statisticians of these techniques there is need to transfer this technology to the application literature. In this paper we have not dealt with inference for some topics. Among these topics is the estimation of marginal survival when there are dependent competing risks. Typically such risks arise from dependent censoring related to the outcome of interest. For example, patients with greater tumor burden may be removed from a study prior to death to be given therapy that is more intensive. Techniques used here include finding bounds on the marginal survival and specific statistical models for the dependence structure. A survey of these types of techniques can be found in, for example, Moeschberger and Klein (1995).
Acknowledgement This research was supported by Grant R01-CA54706-09 from the National Cancer Institute.
References Aalen, O.O. (1989). A linear regression model for the analysis of life times. Statist. Medicine 8, 907–925. Andersen, P.K., Abildstrom, S.Z., Rosthøj, S. (2002). Competing risks as a multi-state model. Statist. Meth. Med. Res. 11, 203–215. Andersen, P.K., Borgan, Ø., Gill, R.D., Keiding, N. (1993). Statistical Models Based on Counting Processes. Springer, New York. Andersen, P.K., Borgan, Ø., Gill, R.D., Keiding, N. (1982). Linear non-parametric tests for comparison of counting processes, with application to censored survival data (with discussion). Internat. Statist. Rev. 50, 219–258; Andersen, P.K., Borgan, Ø., Gill, R.D., Keiding, N. Internat. Statist. Rev. 52 (1984), 225. Amendment. Andersen, P.K., Klein, J.P., Rosthøj, S. (2003). Generalized linear models for correlated pseudo-observations with applications to multi-state models. Biometrika 90, 15–27. Basu, A.P., Klein, J.P. (1982). Some recent results in competing risk theory. In: Crowley, Johnson (Eds.), Survival Analysis. IMS, pp. 216–229. Cox, D.R. (1972). Regression models and life-tables (with discussion). J. Roy. Statist. Soc. B 34, 187–220. Fine, J.P. (2001). Regression modeling of competing crude failure probabilities. Biostatistics 2, 85–97. Fine, J.P., Gray, R.J. (1999). A proportional hazards model for the subdistribution of a competing risk. J. Amer. Statist. Assoc. 94, 496–509. Fine, J.P., Ying, Z., Wei, L.J. (1998). On the linear transformation model for censored data. Biometrika 85, 980–986. Gray, R.J. (1988). A class of K sample tests for comparing the cumulative incidence of a competing risk. Ann. Statist. 16, 1140–1154. Kalbfleisch, J.D., Prentice, R.L. (1980). The Statistical Analysis of Failure Time Data. Wiley, New York. Kaplan, E.L., Meier, P. (1958). Non-parametric estimation from incomplete observations. J. Amer. Statist. Assoc. 53, 457–481. Klein, J.P., Moeschberger, M.L. (1984). Asymptotic bias of the product limit estimator under dependent competing risks. Indian J. Product. Reliabil. Quality Control 9, 1–7.
Inference for competing risks
311
Klein, J.P., Moeschberger, M.L. (1987). Independent or dependent competing risks: Does it make a difference? Comm. Statist. Simulat. 16 (2), 507–533. Klein, J.P., Moeschberger, M.L. (2003). Survival Analysis: Techniques For Censored and Truncated Data, 2nd edn. Springer, New York. Lin, D.Y. (1997). Non-parametric inference for cumulative incidence functions in competing risks studies. Statist. Medicine 16, 901–910. Moeschberger, M.L., Klein, J.P. (1995). Statistical methods for dependent competing risks data. Lifetime Data Anal. 1, 195–204. Pepe, M.S. (1991). Inference for events with dependent risks in multiple endpoint studies. J. Amer. Statist. Assoc. 86, 770–778. Pepe, M.S., Mori, M. (1993). Kaplan–Meier, marginal or conditional probability curves in summarizing competing risks failure time data? Statist. Medicine 12, 737–751. Prentice, R.L., Kalbfleisch, J.D., Peterson, A.V., Flournoy, N., Farewell, V.T., Breslow, N.E. (1978). The analysis of failure times in the presence of competing risks. Biometics 34, 541–554. Ramlau-Hansen, H. (1983a). The choice of a kernel function in the graduation of counting process intensities. Scandinav. Actuar. J., 165–182. Ramlau-Hansen, H. (1983b). Smoothing counting process intensities by means of kernel functions. Ann. Statist. 11, 453–466. Szydlo, R., Goldman, J.M., Klein, J.P., Gale, R.P., Ash, R.C., Bach, F.H., Bradley, B.A., Casper, J.T., Flomenberg, N., Gajewski, J.L., Gluckman, E., Henslee-Downey, P.J., Hows, J.M., Jacobsen, N., Kolb, H.J., Lowenberg, B., Masaoka, T., Rowlings, P.A., Sondel, P.M., van Bekkum, D.W., van Rood, J.J., Vowels, M.R., Zhang, M., Horowitz, M.M. (1997). Results of allogeneic bone marrow transplants for leukemia using donors other than HLA-identical siblings. J. Clinic. Oncology 15 (5), 1767–1777.
17
Handbook of Statistics, Vol. 23 ISSN: 0169-7161 © 2004 Elsevier B.V. All rights reserved. DOI 10.1016/S0169-7161(03)23017-0
Analysis of Cause-Specific Events in Competing Risks Survival Data
James Dignam, John Bryant and H. Sam Wieand
1. Introduction 1.1. Background Situations involving the observation of time to failure when one of several mutually exclusive events may occur are called competing risks problems. There has been longstanding interest in competing risks, as this situation occurs frequently in a large variety of areas, including industrial engineering, demography, econometrics, and a long history of applications to biology and health. In this chapter we specifically focus on biomedical applications, and in particular, on the analysis of individuals under treatment for cancer, who may be subject to a number of subsequent events following diagnosis and initial treatment. The unique feature of competing risks problems in survival analysis is that there are both competing risks that may be correlated with the outcome of interest and a censoring mechanism that often can be assumed to be independent of the outcome of interest and, generally, of the competing risks as well. In the simplest case, with no censoring, some estimation issues become clear. For example, suppose 10 people were treated for a cancer and were followed for 2 years from initiation of treatment. If at that time, 3 of the patients had died from cancer, 3 others had died from other causes, and 4 patients were still alive, two important estimates would be the probability of dying from any cause and the probability of dying from cancer specifically. In this case, the natural nonparametric estimate of the former would be 0.6 and of the latter would be 0.3. In certain applications, we might be interested in estimating the probability of surviving two years if the competing risks (of death from a cause other than cancer) were eliminated. It seems likely that the best estimate is greater than 0.4, since that proportion of patients survived cancer and the competing risks for at least two years. It also seems likely to be less than 0.7, since 3 of the 10 patients died of cancer and there is some probability that the three patients who died of a competing risk within two years would have died of cancer were it not for the competing risk. If the competing risks were independent of death from cancer, one could use the usual nonparametric (Kaplan and Meier, 1958) estimate (defining deaths from other causes to be censored 313
314
J. Dignam, J. Bryant and H.S. Wieand
observations) of the survival probability. Without the assumption of independence or further knowledge of the covariance structure, this probability is not an estimable quantity. In this chapter, we will limit our discussion to the consideration of functions that may be estimated from competing risks data without making assumptions that cannot be verified by these data. Specifically, we will consider inference concerning causespecific hazard functions and cumulative incidence functions, to include a discussion of hypothesis tests for comparing the cause-specific events between two groups. In order to present these methods, we need to introduce some mathematical definitions and notation. A description of typical competing risks data sets is as follows: Assume that there are K modes of failure, with potential failure times (Xi1 , Xi2 , . . . , XiK ) for individual i. Failure of the ith individual occurs at time Ti = min(Xi1 , Xi2 , . . . , XiK ), and is due to cause δi = arg min(Xi1 , Xi2 , . . . , XiK ), the cause or the event which occurred first. We suppose that the follow-up of individual i is censored at time Ci , so that what is actually observed is the time to failure or last follow-up TiC = min(Ti , Ci ), and the failure indicator δiC , with δiC = δi if Ti Ci , and δiC = 0 otherwise. We assume independence across individuals, and will suppose in the following that the joint distribution of Xi1 , Xi2 , . . . , XiK is absolutely continuous, and that Ci is independent of (Xi1 , Xi2 , . . . , XiK ).1 1.2. Identifiability Of fundamental importance in competing risks theory is determination of the information that (T C , δ C ) can provide regarding the time to event distributions and related quantities for the K potential failure types. The survivor function of the minimum failure time T , defined as S(t) = Pr(T > t) can be estimated from competing risks data. The other principal estimable quantities are the cause-specific hazard functions, defined as λk (t) = lim
t →0
Pr{t < Ti t + t, δi = k | Ti > t} t
t and the cumulative cause-specific hazard functions defined as Λk (t) = 0 λk (u) du. The hazard function for T is λ(t) = λ1 (t) + λ2 (t) + · · · + λK (t). The cumulative t hazard Λ(t) = 0 λ(u) du relates the hazard to the survivor function via the identity S(t) = exp(−Λ(t)). Since the potential event times (Xi1 , Xi2 , . . . , XiK ) suggest a (latent) multivariate survivor function, it would be of interest to determine if possible the joint and marginal distributions of these failure times. The joint survivor function S(x1 , x2 , . . . , xK ) = Pr(X1 > x1 , X2 > x2 , . . . , XK > xK ), historically referred to as the multiple-decrement survivor function, and associated marginal survivor functions S k (xk ) = Pr(Xk > xk ), sometimes called net survivor functions for a given failure mode k, have been frequently 1 Slightly more generally, we may assume independent censoring in the sense that the cause-specific hazards λk (t) defined in Section 1.2 are also equal to λ#k (t) = limt→0 (Pr{t < Ti t + t, δ = k | Ti > t, Ci > t})/ (t).
Analysis of cause-specific events in competing risks survival data
315
studied in relation to competing risks. The latter quantities may be thought of as pertaining to hypothetical situations where other failure modes could be removed; in some situations, this may be realistic, such as the elimination of a disease via vaccination. Cox (1959) noted that the dependence between failure times X and Y cannot be uniquely determined from observation of the minimum of the times. Berman (1963) showed that the assumption of independence between X and Y would permit identification of the marginal distribution of X. The nonidentifiability theorem of Tsiatis (1975) formalized the limitations on estimation imposed by competing risks observations, showing that a given assumed dependent competing risks model would produce observations (T , δ) that are indistinguishable from those arising from another model where failure times are independent. Crowder (1991) furthermore showed that nonidentifiability of the joint distribution of failure times X and Y persists even when the marginal distribution of X is known. In summary, for any conjectured joint distribution of (Xi1 , Xi2 , . . . , XiK ), there always exists an independent risks ‘proxy’ model that would yield the same observed (T , δ). The independence assumption, while potentially plausible in some applications, is generally considered unrealistic in the biomedical context, and often interest lies specifically in how the risk of one event may influence another. Consequences of incorrectly assuming independence are well illustrated (Peterson, 1976; Lagakos, 1979; Klein and Moeschberger, 1987; Dignam et al., 1995). 1.3. Competing risks analyses based on estimable quantities Due to the limitations that nonidentifiability imposes, many authors have recommended that estimation and inference in competing risks problems be based solely on estimable quantities. For example, Prentice and colleagues contrast methods based on “latent” or “potential” failure times (Xi1 , Xi2 , . . . , XiK ) (and the additional unverifiable assumptions required) to methods based on estimable quantities, favoring methods centered around the set of identifiable cause-specific hazards (Prentice et al., 1978). When additional constraints are imposed on the (Xi1 , Xi2 , . . . , XiK ), or ancillary information from other variables are incorporated, identification of marginal and joint distributions can be achieved (see David and Moeschberger, 1974; Basu and Klein, 1982; Yashin et al., 1986; Heckman and Honore, 1989; Moeschberger and Klein, 1995 for review). In this chapter, however, we will limit the discussion to methods based on functions of the estimable quantities S(t) and λk (t) for the K failure types. One of these that has proven particularly useful is the cumulative incidence function, defined for cause k as t Fk (t) = Pr{T t, δ = k} = (1.1) S(u−)λk (u) du. 0
As a functional of the cause-specific hazard and survival distribution of T , the cumulative incidence function is identifiable from competing risks data and can be interpreted as the cumulative probability of a specific event occurring in the presence of other competing events, without further assumptions about the dependence among the event times. In this chapter, we briefly review estimation and hypothesis testing methods based on cause-specific hazards and cumulative incidence functions. We present these approaches in the analysis of competing endpoints in data from clinical trials for breast cancer
316
J. Dignam, J. Bryant and H.S. Wieand
conducted by the National Surgical Adjuvant Breast and Bowel Project (NSABP), a multi-center Cooperative Clinical Trials Group.
2. Competing risks analysis based on cause-specific hazard functions As mentioned above, the cause-specific hazards are identifiable from competing risks observations, and consequently many of the practical developments in competing risks data analysis center around these functions. 2.1. The Nelson–Aalen cumulative cause-specific hazard estimator Let 0 < t(1) < t(2) < · · · < t(J ) be the ordered event times. Let Nk (s) denote the number of failures of type k recorded at or before time s and Y (s) denote the number of individuals at risk at time s. The nonparametric hazard estimator for failure type k at time s is (Nk (s))/(Y (s)), where Nk (s) = Nk (s) − Nk (s−), and the cumulative hazard estimator is Nk (t(j ) ) k (t) = . Λ (2.1) Y (t(j ) ) j :t(j) t
Properties and uses of this estimator are well described (Nelson, 1972; Aalen, 1976; Andersen et al., 1993). A related quantity useful in data incidence density or analysis is the so-called incidence rate, defined as Nk (∞)/ TiC , where TiC is the sum of observations times across all individuals, or the person-time at risk. It can be thought of as an estimator for a weighted average of the event rate for failure cause k, defined by ∞ ∞ λ (s)π(s) ds/ π(s) ds, where π(s) = Pr{T C > s} (Gardiner, 1982). For a conk 0 0 stant cause-specific hazard λk (·) = θ , the incidence rate is the familiar maximum likelihood estimate for the exponential survival model. The most common nonparametric maximum likelihood estimator for the survivor function S(t) is N(t(j ) ) S(t) = , 1− Y (t(j ) ) j :t(j) t
where N(s) = K k=1 Nk (s). S(t) is known as the Kaplan–Meier estimator (1958). Al ternatively, the survivor function can be estimated by S(t) = exp(−Λ(t)). The two estimators are asymptotically equivalent. 2.2. Regression models for hazards Regression models for hazards are a commonly used tool in survival analysis. A large variety of parametric models exist, but the most frequently used approach in the biomedical setting is the semiparametric proportional hazards model specified by Cox (1972). This model is readily adapted to modeling effects of regression covariates on the cause
Analysis of cause-specific events in competing risks survival data
317
specific hazard functions. The model is λk (t; z) = λ0k (t) exp(Zβk ) for k = 1, 2, . . . , K cause-specific hazards. In this manner, covariates exerting influence on the cause-specific hazards for the different endpoints can be separately evaluated. More restrictive models, for instance, which include parameters relating cause-specific hazards to each other, can also be formulated. For example, the model λk (t; z) = λ0 (t) exp(γk ) exp(Zβk ) with γ1 = 0 specifies a model where the baseline cause-specific hazards are proportional (Holt, 1978; Larson and Dinse, 1985). One can also consider an accelerated failure time model, which has the general form λk (t; z) = λ0 {t exp(Zβk )} exp(Zβk ) for the failure modes. In any of these models, it should be noted that, as the effects of covariates are modeled on the estimable causespecific hazards only, no hypotheses can be tested regarding how these covariates may relate to the inestimable marginal hazards. Consequently, occasionally counterintuitive results can be seen, with certain covariates appearing ‘protective’ against or unexpectedly not at all associated with some event types, because of a strong association with events of other types (Slud and Byar, 1988; Di Serio, 1997). 2.3. Comparing cause-specific hazards between groups 2.3.1. The log-rank test applied to competing risk data The log-rank statistic is the most commonly used method for evaluating differences between two groups of individuals in the case where there is a single cause of failure (K = 1), with independent censoring (Mantel, 1966). It is designed to test the hypothesis that the hazard of failure λ(t) = limt →0 (Pr{t < T t + t | T > t})/(t) is identical in groups A and B, where T represents the time until failure, which implies it is also a test for comparing the survival functions in the two groups, since t S(t) = exp{− 0 λ(s) ds}. Now consider the case where there are K 2 different modes of failure and suppose that we wish to compare the two groups with respect to the hazard of failure due to a specific cause k. In this case, the log-rank test can be applied exactly as when K = 1, treating type-k failures as events, and all other failures as censoring events. But note that when competing risks are present, the independent censoring assumption may well be violated, and is in any case not testable. Nevertheless, the above formulation is still correct for testing the equality of cause-k-specific hazards between the two groups. More precisely, the log-rank test does not compare the net hazard λnet k (t) = limt →0 (Pr{t < Xk t + t | Xk > t})/(t) between groups; indeed it could not, since this hazard is not identifiable from competing risks data; However, it is a completely legitimate procedure for testing the equality of the cause-k-specific hazard, λk (t) = limt →0 (Pr{t < Xk t + t | min(X1 , . . . , XK ) > t})/(t), across groups. Unlike the single-cause-of-failure case, the test of equality of cause-k-specific hazards across groups cannot be interpreted as a test of equality of survival functions (recall that Sk (t) = exp{−Λk (t)} cannot be interpreted as a survival function without making
318
J. Dignam, J. Bryant and H.S. Wieand
untestable assumptions of independence of cause-specific failure times). Nor can it be interpreted as a test of the equality of cumulative incidence functions across groups. To see this, consider the following simple example: Suppose there are K = 2 causes of failure. Suppose that in group A the cause-specific hazards are λ1 (t) ≡ 1 and λ2 (t) ≡ 2, while in group B they are λ1 (t) ≡ 1 and λ2 (t) ≡ 3. It is easy to verify that the cumulative incidence function for cause-1 failures in group A is F1A (t) = (1/3){1 − exp(−3t)}, t > 0, while in group B it is F1B (t) = (1/4){1 − exp(−4t)}. Thus the cause-1-specific hazards are identical in the two groups, but the cumulative incidence functions are entirely different, and in fact do not even cross. This is because the two groups differ with respect to their cause-2-specific hazards, and this impacts not only the cumulative incidence of type-2 failures, but also the cumulative incidence of type-1 failures, since the modes of failure are in competition. In Section 3.4, tests for comparing cumulative incidence functions are discussed. Since we have seen that equality of cause-specific hazards is not equivalent to equality of cumulative incidence curves, the question arises as to which approach is the “correct” one. The answer is that both are correct, but they are appropriate to answer different questions. For example, (Smith et al., 2003) have analyzed the incidence of secondary acute myeloid leukemias (AMLs) in patients with operable breast cancer who were treated post-surgically with doxorubicin and cyclophosphamide. For simplicity, consider an example in which patients receive one of two regimens: Regimen A, consisting of doxorubicin and cyclophosphamide at “standard” doses, or Regimen B, consisting of standard dose doxorubicin together with intensified delivery of cyclophosphamide. Consider a competing risks analysis where K = 2 causes of failure are (k = 1) diagnosis of AML, and (k = 2) death from other causes prior to the diagnosis of AML. For purposes of addressing the putative causal impact of intensified cyclophosphamide on the incidence of secondary AMLs, a comparison of cause-specific hazards using the log-rank test seems most appropriate. (In this case, the fact that incidence rates are conditional not only on being undiagnosed for AML, but also on being alive, is surely not a concern since it makes no sense to consider incidence of AML following death.) On the other hand, if one were interested in incidence of secondary AMLs purely from a health economics perspective rather than a biological one, it may be the case that a comparison of cumulative incidence curves would be more relevant. The above arguments apply to several closely related tests such as those proposed by Gehan (1965) and Harrington and Fleming (1982). 2.3.2. A test for equality of all cause-specific hazards across multiple groups Lindkvist and Belyaev (1998) generalized the log-rank statistic to provide a class of tests for comparing cause-specific hazard rates from two competing causes of failure between two groups and Kulathinal and Gasbarra (2002) extended these results to L( 2) groups. The Kulathinal and Gasbarra statistic and is given by Z = (Z11 , Z12 , . . . , Z1K , Z21 , Z22 , . . . , Z2K , . . . , ZL1 , ZL2, . . . , ZLK ), where
∞
Zlk = 0
n Klk (t)
dNlk (t) dN.k (t) − Yl (t) Y.(t)
and
Analysis of cause-specific events in competing risks survival data
319
Nlk (t)
is the number of failures due to competing risk k in group l by time t,
Yl (t)
is the number of group l patients at risk at time t, and
n Klk (t)
is a locally bounded predictable process which serves as a weight function.
The mean and variance-covariance structure are derived in detail by Kulathinal and Gasbarra using martingale theory, but the key point is that under the null hypothesis λ1k (t) = λ2k (t) = · · · = λLk (t) for k = 1, . . . , K, the vector Z is asymptotically normal with mean 0. This statistic turns out to be a generalization of the log-rank statistic. If n K = 1, L = 2, and K11 (t) = Y1 (t), then Z11 can be simplified to ∞
∞ Y1 (t) Y2 (t) Z11 = dN11 (t) − dN21 (t) Y1 (t) + Y2 (t) Y1 (t) + Y2 (t) 0 0 which yields the log-rank statistic (Fleming and Harrington, 1991, p. 43).
3. Competing risks analysis based on cumulative incidence functions While hazard functions are a useful metric for statisticians, often in applied settings the relative likelihood of competing events expressed on a probability scale is more readily interpretable. In particular, the cumulative probability over time of occurrence for a given disease event in the face of other events that might preclude it is a quantity of direct clinical relevance. A common but inadvisable practice is to estimate the cumulative probability of event k occurring by computing 1 − Sk (t), where Sk (t) is calculated by considering events other than k as independently censored observations k (t), or some specified (either via the Kaplan–Meier estimator, exponentiation of −Λ semi parametric or parametric model). This estimator ignores the possibility of dependence among the event times, and since it can be shown that 1 − Sk (t) Fk (t) for all t, this approach can produce overestimates of the event-specific probability in the competing risks setting (Korn and Dorey, 1992; Pepe and Mori, 1993; Gaynor et al., 1993). In contrast, the set of K cumulative incidence functions are additive to the complement of the survivor function for T , that is, K k=1 Fk (t) = 1 − S(t) and for complete (uncensored) data, the usual nonparametric cumulative incidence estimators at t = ∞ equal the proportions of failures due to each cause. The cumulative incidence function has been frequently discussed in relation to the analysis of clinical time to event data with multiple endpoints (Korn and Dorey, 1992; Pepe and Mori, 1993; Gaynor et al., 1993), yet it remains underused in clinical applications (Gooley et al., 1999; Caplan et al., 1994). Earlier discussions can be found in Chiang (1968), Aalen (1978), Elandt-Johnson and Johnson (1980), and Kalbfleisch and Prentice (1980), among others. Because this quantity represents the cumulative probability of observing a given event under the study conditions at hand (e.g., with other failure mechanisms in effect), it is a useful complementary summary to the cause-specific cumulative hazard functions that may be more relevant in certain settings.
320
J. Dignam, J. Bryant and H.S. Wieand
3.1. The nonparametric cumulative incidence function estimator The most common nonparametric estimator of the cumulative incidence function for cause k is t Nk (t(j ) ) dNk (u) k (t) = = . S(u−) S(t(j −1) ) F (3.1) Y (u) Y (t(j ) ) 0 j :t(j) t
Aalen (1978) and others have described properties of this estimator. Several variance expressions have been derived (Aalen and Johansen, 1978; Dinse and Larson, 1986; Pepe, 1991; Korn and Dorey, 1992; Andersen et al., 1993; Lin, 1997). 3.2. Parametric and regression models for cumulative incidence Parametric estimators of the cumulative incidence function can be formulated simply by substituting the appropriate cause-specific hazard and survival function estimators (Benichou and Gail, 1990). For example, if the cause-specific hazards are assumed to be constant, then λk (t) = θk , k = 1, 2, . . . , K and the cumulative incidence estimator for cause k is t
θˆk k t; θˆ1 , θˆ2 , . . . , θˆK = ˆ , F (3.2) exp −θˆ u θˆk du = 1 − exp −θt θˆ 0 ˆ where θˆk = Nk (t(j ) )/ Ti and θˆ = K k=1 θk . Approximate variance expressions can be obtained using the delta method (Benichou and Gail, 1990). Covariate effects can also be incorporated into cumulative incidence function estimators via regression models for the individual cause-specific hazards (for the event of interest and for the hazard associated with the competing events), followed by construction of the cumulative incidence estimator (Benichou and Gail, 1990; Andersen et al., 1993; Cheng et al., 1998). However, interpretation of covariate effects modeled this way in relation to cumulative incidence of a given event may not be straightforward, since the cumulative incidence is a function of not only the hazard of interest, but also the ‘at-risk’ probability, represented by S(·) in Eq. (1.1) (Gray, 1988; Gaynor et al., 1993). Fine and Gray have proposed a novel semi-parametric proportional hazards model for the cumulative incidence function in order to directly model the effect of covariates (Fine and Gray, 1999). Employing a transformation of the cumulative incidence function that corresponds to the proportional hazards model, they specify a regression model for the subdistribution ‘hazard’, defined as λ∗k (t) = (dFk (t; Z))/(1 − Fk (t; Z)). The approach allows one to directly evaluating covariate effects on cumulative event probabilities, which may be quite different from the effects of the same covariates on individual cause-specific hazards. Fine (2001) has applied a similar transformation approach based on the proportional odds model. 3.3. A partially parametric cumulative incidence function estimator Bryant and Dignam (2004) have investigated the use of cumulative incidence estimators obtained by fitting a parametric function to the cause-specific hazard of primary interest,
Analysis of cause-specific events in competing risks survival data
321
accounting for all other modes of failure nonparametrically, and combining the results. Such estimators are considerably more efficient than fully nonparametric cumulative incidence estimators, and have proven to be useful in a number of practical applications. Consider a situation where events of a specific type, denoted type 1, are of primary interest. Let the cause-specific hazard λ1 (t) = λ1 (t; θ), where θ is an unknown parameter (possibly vector-valued) to be estimated from the data. Without loss of generality, we let K = 2 by combining all other competing risks into a single category. Due to the additivity of the cause-specific hazards, the overall survival distribution is S(u; θ ) = exp(−Λ1 (u; θ) − Λ2 (u)) = S1 (u; θ)S2 (u), and thus from Eq. (1.1), the cumulative incidence function for type 1 failures can be written as t F1 (t; θ ) = (3.3) S2 (u)g1 (u, θ ) du, 0
where g1 (u, θ) = λ1 (u, θ )S1 (u; θ). This representation suggests a “partially parametric” cumulative function estimator t
1 (t) = F (3.4) S2 (u)g1 u, θˆ du, 0
where S2 (u) is the Kaplan–Meier estimator for S2 (u) and θˆ is the maximum likelihood estimator for θ . Bryant and Dignam provide expressions for the variance of this estimator, and show how to construct confidence intervals for the cumulative incidence at fixed time points t, and confidence bands valid over finite time intervals. In particular, a 1 (t) is given by variance estimator for F 1 (s; θ) ˆ T −1 ∂ F ˆ
1 (t) = 1 ∂ F1 (s; θ) F Var I n ∂θ ∂θ 1 (t(j ) ) 2 pj , 1 (t) − F + F
(3.5)
j :t(j) t
where n is the sample size, t 1 (s; θ) ˆ ˆ ∂ gˆ 1 (s; θ) ∂F = du, S2 (u) ∂θ ∂θ 0 I is a consistent estimate of the Fisher information, and pj =
N2 (t(j ) ) . Y (t(j ) ){Y (t(j ) ) − N2 (t(j ) )}
Among estimators of this form, Bryant and Dignam (2004) specifically considered the special case where λ1 (t; θ ) = θ , that is, where the cause-specific hazard for the event of interest is assumed to be constant over time. In this case, the cumulative incidence estimator is, from Eq. (3.4), t ˆ ˆ F1 (t) = θ S2 (u)e−θu du, 0
322
J. Dignam, J. Bryant and H.S. Wieand
1 (t) can be expressed as where θˆ = Nk (∞)/ Ti . F 1 (t(h) ) = F
h−1
ˆ ˆ S2 (t(j ) ) e−θ(j) − e−θt(j+1)
(3.6)
j =0
and has variance estimator h−1 2 θˆ
1 ˆ ˆ − θt − θt (j+1) (j) 1 (t(h) ) = F − t(j ) e S2 (t(j ) ) t(j +1) e Var n T j =0
h 1 (t(h) ) − F 1 (t(j ) ) 2 pj . F + j =1
An alternative and slightly simpler estimator suggested by Eq. (1.1) was also dist cussed. Define Fk (t) = θˆ 0 S(u) du. Then for t = t(h) , (h) ), Fk (t(h) ) = θˆ Q(t (3.7) h−1 (h) ) = where Q(t j =0 S(t(j ) )(t(j +1) − t(j ) ). Thus, this estimator is just the integrated Kaplan–Meier estimator for all-cause failure multiplied by the estimated crude hazard of failure due to cause 1. This estimator was shown to be somewhat less efficient than 1 (t), but the two estimators are similar in most cases and in particular when failures F (t) may be preferred for due to cause 1 are infrequent, i.e., if θ is small; in this case F 1 its simplicity. Bryant and Dignam (2004) compared the asymptotic variances of the partially parametric estimators (Eqs. (3.6) and (3.7)) to those of the nonparametric estimator (Eq. (3.1)), and to the fully parametric estimator (Eq. (3.2)), applicable to the case of constant cause-specific hazards. Results indicated that both partially parametric estimators were considerably more efficient than the nonparametric estimator, especially in early and intermediate time points in the follow-up period. Surprisingly, they were nearly as efficient as the fully parametric estimator in the scenarios investigated. Bryant and Dignam also assessed the performance of the cumulative incidence estimators for finite samples of sizes relevant to applications in the analysis of typical cancer clinical trials. Simulations consisting of 10 000 replications of cohorts of 500 patients demonstrated that each of the estimators were virtually unbiased. In each case the corresponding variance estimators were also seen to be essentially unbiased in the scenarios investigated. 3.4. Tests for comparing cumulative incidence functions Several tests have been proposed to directly compare nonparametric cumulative incidence functions between groups, as such comparisons might elicit different information from that provided by hazard comparisons, depending on the relative intensity and failure time distributions for events other than the event of interest (the event for which cumulative incidence is being compared).
Analysis of cause-specific events in competing risks survival data
323
3.4.1. Gray’s test Gray (1988) proposed to test for the equality of cumulative incidence for failure due to a specific cause (k) between L 2 groups. The structure of the test statistic is best illustrated in the case of two groups (A and B). The test statistic is a weighted and integrated difference of estimates of the subdistribution hazards in the two groups, given by
∞ B u A dF dF k k du. − W (u) B (u−) A (u−) 1 − F 1−F 0 k k A (·) is the cause k cumulative incidence estimator defined in Eq. (3.1) for Here, F k B (·) is similarly defined for group B. W (·) is a predictable weighting group A and F k function. Gray presents a class of tests using weight functions generalizing the Gρ class of tests of Harrington and Fleming (1982). 3.4.2. Other tests Pepe (1991) proposed a test statistic for comparing cumulative incidence functions which is a weighted integral of the difference between two estimated cumulative incidence curves. The test is a direct application of a class of weighted distance tests proposed as alternatives to the log-rank test for survival curves (Pepe and Fleming, 1989), A nA nB ∞ k (u) − F kB (u) du, W (u) F n 0 where nA and nB are group sample sizes, n = nA + nB and W (·) is a predictable weight function. One such weight function suggested by Pepe and Mori (1993) is W (t) = B (t−)/{(nA /n)C A (t−) + (nB /n)C B (t−)}, where C(·) is the Kaplan–Meier A (t−)C C estimator for the censoring times. Lin (1997) proposed a weighted Kolmogorov–Smirnoff type test for comparing two A (t) − F B (t)|), where W (·) is cumulative incidence curves, given by Q = supt (W (t)|F k k a weight function.
4. Examples: Competing risks analysis of events after breast cancer treatment In this section, we illustrate the competing risks methods we have discussed. The example involves practical problems arising from the analysis of a randomized clinical trial evaluating treatments for early stage breast cancer. 4.1. A trial to study tamoxifen and radiotherapy for the treatment of women with small ( 1 cm) tumors In June of 1989, the NSABP opened a trial for women who were candidates for minimal invasive surgery (a lumpectomy) to remove a breast cancer which was 1 cm in its largest dimension and was confined to the breast. At that time, there was preliminary evidence that both irradiation and an estrogen-like compound called tamoxifen
324
J. Dignam, J. Bryant and H.S. Wieand
had efficacy in treating small tumors when either therapy was given alone. The goal of this trial was to determine if a combined therapy of breast irradiation and tamoxifen (XRT + TAM) was more effective than either of the treatments individually, and also to determine if tamoxifen alone (TAM) might be superior to breast irradiation alone (XRT). Patients who consented to be in this trial were randomly assigned to receive TAM, XRT (and a placebo), or XRT + TAM following surgery. One thousand and nine patients had entered the trial by December 1998 and the trial was closed to accrual. The results of the trial were presented in 2002 (Fisher et al., 2002). In the sections which follow, we will discuss analyses designed to compare the effect of the three treatments approaches in preventing an ipsilateral breast recurrence (IBTR), which is a recurrence in the breast from which the initial tumor was removed. There was considerable preliminary evidence that tamoxifen had the potential to prevent new primary cancer in the opposite breast, and both tamoxifen and irradiation had the potential to have some effect on the likelihood of developing other primary cancers or deaths from other causes. Hence, competing risk methodology was required for the analysis of the trial data. 4.2. Analysis of cause-specific hazards Cause-specific hazards by treatment group for a mutually exclusive and exhaustive classification of event types are shown in Table 1 for two arms of the B-21 trial (radiotherapy + placebo (XRT) and tamoxifen alone (TAM) arms). For each event type, the estimated incidence rate is shown in events per 1000 person-years. Also shown is the hazard ratio estimate from the Cox model and the log rank test p-value, which is identically equal to the score test for β = 0 from the Cox model. The rate of ipsilateral breast tumor recurrence (IBTR) was significantly lower in patients undergoing XRT than those receiving TAM. Conversely, patients receiving TAM experienced less occurrences of opposite breast tumors. Other recurrence types, which have been shown to be reduced by use of TAM in other studies, were infrequent among these patients and Table 1 Cause-specific hazards for two treatment arms of the NSABP B-21 randomized trial Failure type IBTR Regional-distant Opposite breast Other cancers Deaths, other All events
XRT + Placebo # Event Ratea 23 5 14 11 8 61
11.68 2.54 7.11 5.59 4.06 30.99
Tamoxifen # Events Rateb 45 7 3 15 4 74
22.86 3.56 1.52 7.62 2.03 37.59
Hazard ratiob
Logrank p-valuec
0.51 0.71 4.68 0.75 2.03 0.83
0.008 0.559 0.008 0.461 0.239 0.268
a The Rate is the number of events per 1000 patient-years. b The Hazard Ratio is for the XRT + Placebo relative to Tamoxifen and is obtained from the Cox proportional
hazards model.
c Due to the small number of events, p-values were obtained by permutation of treatment labels using the
approach described in Peto and Peto (1972, Section 12).
Analysis of cause-specific events in competing risks survival data
325
do not differ here. Other event types also did not differ significantly, and note that the overall (all causes) hazards of failure do not differ significantly between treatments. In the complete analysis of the trial, the combined therapy arm (XRT + TAM) was found to be superior to both of these treatment arms with respect to IBTR, and patients receiving TAM (either alone or with XRT) experienced significantly fewer occurrences of opposite breast tumors than those undergoing XRT only (Fisher et al., 2002). 4.3. Cumulative incidence estimates and test To illustrate the application of the cumulative incidence estimators, we focus on IBTR. As shown in Table 1, there were 45 IBTRs through 8 years in the TAM group, with an average annual rate of 22.9/1000 person-years. For women randomized to XRT and placebo, there were 23 IBTRs (average annual rate 11.7/1000 person-years). In the XRT + TAM arm, there were 9 IBTRs (average annual rate 4.4/1000). Examining the smoothed instantaneous rate of IBTR over the 8-year interval (not shown), a relatively constant rate was observed, suggesting that modeling this event with an exponential cause-specific failure hazard may be appropriate. Figure 1 shows the nonparametric estimator (Eq. (3.1)) and the semi-parametric estimator (Eq. (3.6)) for cumulative incidence of IBTR in patients receiving either TAM, XRT, or XRT + TAM after surgery. The semi-parametric estimator provides a good fit to the data throughout the observation period. In the TAM group, the five-year cumulative incidence of IBTR as a first event was 0.106 with standard deviation (sd) 0.018 according to the nonparametric estimate; the semi-parametric estimate was 0.105 (sd 0.015). In the XRT group, the five-year nonparametric estimate was 0.044 (sd 0.012); the semi-parametric estimate was 0.054 (sd 0.011). For the combined therapy arm, five-
(a) Fig. 1. Cumulative incidence of IBTR through 8 years among patients receiving either tamoxifen (a), radiotherapy + placebo (b), or radiotherapy + tamoxifen (c) after surgery. The nonparametric estimator (Eq. (3.1)) and the semi-parametric estimator discussed in Section 3.3 (Eq. (3.6)) are shown.
326
J. Dignam, J. Bryant and H.S. Wieand
(b)
(c) Fig. 1. (Continued)
year cumulative incidence was 0.019 (sd 0.0084); the semi-parametric estimate was 0.021 (sd 0.0068). Gray’s test comparing nonparametric cumulative incidence functions (Section 3.4.1) indicates a significant differences between groups (p < 0.0001), with pairwise tests indicating differences between XRT vs. TAM (p = 0.006), XRT vs. XRT + TAM (p = 0.009), and TAM vs. XRT + TAM (p < 0.0001). In this example, the p-values obtained using Gray’s method are nearly identical to those obtained using pairwise logrank statistics to compare cause specific hazards for IBTR (XRT vs. TAM, p = 0.008, XRT vs. XRT + TAM, p = 0.01), and TAM vs. XRT + TAM (p < 0.0001). The two methods often lead to similar results. However, we have seen in Section 2.3.1 that con-
Analysis of cause-specific events in competing risks survival data
327
ceptually the procedures test different hypotheses and therefore may result in substantially different results.
5. Summary Competing risks methods based on estimable quantities are generally straightforward and yield additional important information when multiple failure types are present and we advocate their use. However, it is important that statisticians and others who employ the methodology are careful in their interpretation and presentation of the results.
Acknowledgements This work was supported by Public Health Service Grants NCI-U10-CA-69651 and NCI P30-CA-14599 from the National Cancer Institute, National Institutes of Health, Department of Health and Human Services.
References Aalen, O. (1978). Nonparametric estimation of partial transition probabilities in multiple decrement models. Ann. Statist. 6, 534–545. Aalen, O., Johansen, S. (1978). An empirical transition matrix for nonhomogeneous Markov chains based on censored observations. Scand. J. Statist. 5, 141–150. Aalen, O. (1976). Nonparametric inference in connection with multiple decrement models. Scand. J. Statist. 3, 15–27. Andersen, P.K., Borgan, O., Gill, R., Keiding, N. (1993). Statistical Methods Based on Counting Processes. Springer, Berlin. Basu, A.P., Klein, J.P. (1982). Some recent results in competing risk theory. In: Crowley, J., Johnson, R.A. (Eds.), Survival Analysis. Institute for Mathematical Statistics, Hayward, CA, pp. 216–229. Benichou, J., Gail, M.H. (1990). Estimates of absolute cause-specific risk in cohort studies. Biometrics 46, 813–826. Berman, S.M. (1963). Notes on extreme values, competing risks, and semi-Markov models. Ann. Math. Statist. 34, 1104–1106. Bryant, J., Dignam, J. (2004). Semiparametric models for cumulative incidence functions. Biometrics. In press. Caplan, R.J., Pajak, T.F., Cox, J.D. (1994). Analysis of the probability and risk of cause-specific failure. Int. J. Radiat. Oncol. Biol. Phys. 29, 1183–1186. Cheng, S.C., Fine, J.P., Wei, L.J. (1998). Prediction of cumulative incidence function under the proportional hazards model. Biometrics 54, 219–228. Chiang, C.L. (1968). Introduction to Stochastic Processes in Biostatistics. Wiley, New York. Cox, D.R. (1959). The analysis of exponentially distributed life-times with two types of failure. J. Roy. Statist. Soc. Ser. B 21, 411–421. Cox, D.R. (1972). Regression models and life tables (with discussion). J. Roy. Statist. Soc. Ser. B 34, 187–202. Crowder, M. (1991). On the identifiability crisis in competing risks analysis. Scand. J. Statist. 18, 223–233. David, H.A., Moeschberger, M.L. (1974). The Theory of Competing Risks. Griffin, High Wycombe. Di Serio, C. (1997). The protective impact of a covariate on competing failures with an example from a bone marrow transplantation study. Lifetime Data Anal. 3, 99–122.
328
J. Dignam, J. Bryant and H.S. Wieand
Dignam, J.J., Weissfeld, L.A., Anderson, S.J. (1995). Methods for bounding the marginal survival distribution. Statist. Medicine 14, 1985–1998. Dinse, G.E., Larson, M.G. (1986). A note on the semi-Markov model for partially censored data. Biometrika 73, 379–386. Elandt-Johnson, R.C., Johnson, N.L. (1980). Survival Models and Data Analysis. Wiley, New York. Fisher, B., Bryant, J., Dignam, J., et al. (2002). Tamoxifen, radiation therapy, or both for prevention of ipsilateral breast tumor recurrence after lumpectomy in women with invasive breast tumors of one centimeter or less. J. Clinic. Oncology 20, 4141–4149. Fleming, T.R., Harrington, D.P. (1991). Counting Processes and Survival Analysis. Wiley, New York. Fine, J.P., Gray, R.J. (1999). A proportional hazards model for the subdistribution of a competing risk. J. Amer. Statist. Assoc. 94, 496–509. Fine, J.P. (2001). Regression modeling of competing crude failure probabilities. Biostatistics 2, 85–97. Gardiner, J.C. (1982). The asymptotic distribution of mortality rates in competing risks analyses. Scand. J. Statist. 9, 31–36. Gaynor, J.J., Feuer, E.J., Tan, C.C., et al. (1993). On the use of cause-specific failure and conditional failure probabilities: Examples from clinical oncology data. J. Amer. Statist. Assoc. 88, 400–409. Gehan, E.A. (1965). A generalized Wilcoxon test for comparing arbitrarily single-censored samples. Biometrika 52, 203–223. Gooley, T.A., Leisenring, W., Crowley, J., Storer, B.E. (1999). Estimation of failure probabilities in the presence of competing risks: new representations of old estimators. Statist. Medicine 18, 695–706. Gray, R.J. (1988). A class of K-sample tests for comparing the cumulative incidence of a competing risk. Ann. Statist. 16, 1141–1154. Harrington, D.P., Fleming, T.R. (1982). A class of rank test procedures for censored survival data. Biometrika 69, 553–566. Heckman, J.J., Honore, B.E. (1989). The identifiability of the competing risks model. Biometrika 76, 325– 330. Holt, J.D. (1978). Competing risks analysis with special reference to matched-pairs experiments. Biometrics 65, 159–166. Kalbfleisch, J.D., Prentice, R.L. (1980). The Statistical Analysis of Failure Time Data. Wiley, New York. Kaplan, E.L., Meier, P. (1958). Nonparametric estimation from incomplete observations. J. Amer. Statist. Assoc. 53, 457–481. Klein, J.P., Moeschberger, M.L. (1987). Independent or dependent competing risks: Does it make a difference? Commun. Statist. Comput. Simulation 16, 507–533. Korn, E.L., Dorey, F.J. (1992). Applications of crude incidence curves. Statist. Medicine 11, 813–829. Kulathinal, S.B., Gasbarra, D. (2002). Testing equality of cause-specific hazard rates corresponding to m competing risks among k groups. Lifetime Data Anal. 8, 147–161. Lindkvist, H., Belyaev, Y. (1998). A class of non-parametric tests in the competing risks model for comparing two samples. Scand. J. Statist. 25, 143–150. Lagakos, S.W. (1979). General right-censoring and its impact on analysis of survival data. Biometrics 35, 139–156. Larson, M., Dinse, G.E. (1985). A mixture model for the regression analysis of competing risks data. Appl. Statist. 34, 201–211. Lin, D.Y. (1997). Nonparametric inference for cumulative incidence functions in competing risks studies. Statist. Medicine 85, 901–910. Mantel, N. (1966). Evaluation of survival data and two rank order statistics arising in its consideration. Cancer Chemotherapy Rep. 50, 163–170. Moeschberger, M.L., Klein, J.P. (1995). Statistical methods for dependent competing risks. Lifetime Data Anal. 1, 195–204. Nelson, W. (1972). Theory and application of hazard plotting for censored failure data. Technometrics 19, 945–966. Pepe, M.S., Mori, M. (1993). Kaplan–Meier, marginal, or conditional probability curves in summarizing competing risks failure time data? Statist. Medicine 12, 737–751. Pepe, M.S. (1991). Inference for events with dependent risks in multiple endpoint studies. J. Amer. Statist. Assoc. 86, 770–778.
Analysis of cause-specific events in competing risks survival data
329
Pepe, M.S., Fleming, T.R. (1989). Weighted Kaplan–Meier statistics – A class of distance tests for censored survival data. Biometrics 45, 497–507. Peterson, A.V. (1976). Bounds for a joint distribution function with fixed subdistribution functions: Applications to competing risks. Proc. Nat. Acad. Sci. USA 73, 11–13. Peto, R., Peto, J. (1972). Asymptotically efficient rank invariant test procedures (with discussion). J. Roy. Statist. Soc. Ser. A 135, 185–206. Prentice, R.L., Kalbfleisch, J.D., Peterson, A.V., et al. (1978). The analysis of failure times in the presence of competing risks. Biometrics 34, 541–554. Smith, R., Bryant, J., DeCillis, A., Anderson, S. (2003). Acute myeloid leukemia and myelodysplastic syndrome following Doxorubicin–Cyclophosphamide adjuvant therapy for operable breast cancer: the NSABP experience. J. Clinic Oncology 21, 1195–1204. Slud, E.V., Byar, D. (1988). How dependent causes of death can make risk factors appear protective. Biometrics 44, 265–269. Tsiatis, A.A. (1975). A nonidentifiability aspect of the problem of competing risks. Proc. Nat. Acad. Sci. 72, 20–22. Yashin, A.I., Manton, K.G., Stallard, E. (1986). Dependent competing risks: A stochastic process model. J. Math. Biol. 24, 119–140.
18
Handbook of Statistics, Vol. 23 ISSN: 0169-7161 © 2004 Elsevier B.V. All rights reserved. DOI 10.1016/S0169-7161(03)23018-2
Analysis of Progressively Censored Competing Risks Data
Debasis Kundu, Nandini Kannan and N. Balakrishnan
1. Introduction In medical studies or in the analysis of reliability data, the failure of individuals or items may be attributable to more than one cause or factor. These “risk factors” in some sense compete for the failure of the experimental unit. Consider the example of Hoel (1972), based on a laboratory experiment in which mice were given a dose of radiation at 6 weeks of age. The causes of death were recorded as Thymic Lymphoma, Reticulum Cell Sarcoma, or other. Another example is from a study of breast cancer patients (Boag, 1949), where the cause of death was recorded as “cancer” or “other”. There are numerous examples in reliability experiments, where items may fail due to one of several causes. In traditional analyses of these datasets, the researcher is primarily interested in the distribution of lifetimes under one specific cause of failure, say cancer, and all other causes are combined and treated as censored data. In recent years, however, models have been developed to assess the lifetimes of a specific risk in the presence of other competing risk factors. The data for these “competing risk models” consist of the failure time and an indicator variable denoting the specific cause of failure of the individual or item. The causes of failure may be assumed to be independent or dependent. In most situations, the analysis of competing risk data assumes independent causes of failure. Even though the assumption of dependence may be more realistic, there is some concern about the identifiability of the underlying model. Kalbfleisch and Prentice (1980), Crowder (2001), and several other authors have argued that without information on covariates, it is not possible using data to test the assumption of independent failure times. See Crowder (2001) and the monograph by David and Moeschberger (1978) for an exhaustive treatment of different competing risks models. In this chapter, we will develop inference for the competing risk model under a very general censoring scheme. Censoring is inevitable in life-testing and reliability studies because the experimenter is unable to obtain complete information on lifetimes for all individuals. For example, patients in a clinical trial may withdraw from the study, or the study may have to be terminated at a pre-fixed timepoint. In industrial experiments, units may break accidentally. In many situations, however, the removal of units prior to 331
332
D. Kundu, N. Kannan and N. Balakrishnan
failure is pre-planned in order to provide savings in terms of time and cost associated with testing. The two most common censoring schemes are termed types I and II censoring. Consider n individuals under observation in a clinical study. In the conventional type I censoring scheme, the experiment continues upto a prespecified time T . Failures (deaths) that occur after T are not observed. The termination point T of the experiment is assumed to be independent of the failure times. By contrast, the conventional type II censoring scheme requires the experiment to continue until a prespecified number of failures m n occur. In this scenario, only the smallest lifetimes are observed. In type I censoring, the number of failures observed is random and the endpoint of the experiment is fixed, whereas in type II censoring the endpoint is random, while the number of failures observed is fixed. There have been numerous articles and books in the reliability and survival analysis literature, dealing with inference under types I and II censoring for different parametric families of distributions. One of the drawbacks to these conventional types I and II censoring schemes outlined above is that they do not allow for removal of units at points other than the terminal point of the experiment. Cohen (1963, 1966) was one of the earliest to study a more general censoring scheme: Fix m censoring times T1 , . . . , Tm . At time Ti , remove Ri of the remaining units randomly. The experiment terminates at time Tm with Rm units still surviving. This is referred to as progressive type I right censoring. Cohen (1963) discussed the estimation of parameters under this scheme for the parameters of the normal distribution. In this chapter, we consider competing risk data under progressive type II censoring. The censoring scheme is defined as follows: Consider n individuals in a study and assume that there are K causes of failure which are known. At the time of each failure, one or more surviving units may be removed from the study at random. The data from a progressively type II censored sample is as follows: (X1:m:n , δ1 , R1 ), . . . , (Xm:m:n , δm , Rm ), where X1:m:n < · · · < Xm:m:n denote the m observed failure times, δ1 , . . . , δm denote the causes of failure, and R1 , . . . , Rm denote the number of units removed from the study at the failure times X1:m:n , . . . , Xm:m:n . Note that the complete and type II right censored samples are special cases of the above scheme when R1 = R2 = · · · = Rm = 0, and R1 = R2 = · · · = Rm−1 = 0, Rm = n − m, respectively. For an exhaustive list of references and further details on progressive censoring, the reader may refer to the book by Balakrishnan and Aggarwala (2000). The main focus of this chapter is the analysis of the competing risk model when the data are progressively type II censored. In addition, we will assume the lifetimes under the competing risks have independent exponential distributions. We derive the maximum likelihood estimators (MLE) and uniformly minimum variance unbiased estimators (UMVUE) of the hazard rates. We also show that the MLEs and UMVUEs of the mean lifetime of the different causes may not always exist. We, therefore, propose the use of conditional MLEs of the mean lifetimes. The exact distributions of the MLEs and conditional MLEs are obtained, leading to the construction of confidence intervals
Analysis of progressively censored competing risks data
333
for the different parameters. Bayes estimators and the corresponding credible regions of the different parameters using inverted gamma priors are also obtained. The organization of the chapter is as follows. In Section 2, we describe the model and present the definitions and notation used throughout the chapter. The estimation of the different parameters are considered in Section 3, and their distributions are derived in Section 4. Bayesian analysis of the competing risk model is provided in Section 5. Results of a simulation study comparing the coverage probabilities and lengths of the different confidence and credible intervals are provided in Section 6. We illustrate the performance of these different techniques in Section 7 using a real dataset. Section 8 discusses some extensions and generalization of our results. Finally, some conclusions are drawn in Section 9.
2. Model: Description and notation Without loss of generality, we assume that there are only two independent causes of failure. All the methods presented in this chapter may be easily extended to the case of K > 2. We introduce the following notation: Xj i : lifetime of the ith individual under cause j , j = 1, 2; F (·): cumulative distribution function of Xi ; Fj (·): cumulative distribution function of Xj i ; j (·) = 1 − Fj (·); j (·): survival function of Xj i , F F δi : indicator variable denoting the cause of failure of the ith individual; m: the number of complete failures observed before termination; xi:m:n : ith observed failure time, i = 1, . . . , m; Ri : number of units removed at the time of the ith failure, Ri 0, R1 + · · · + Rm + m = n; Ij : = {xi:m:n ; δi = j }, j = 1, 2; |Ij |: cardinality of Ij : we assume |Ij | = nj ; gamma(α, λ): denotes the gamma random variable with density function λα α−1 e −λx ; Γ (α) x Igamma(α, λ): denotes the inverted gamma random variable with density function 1 λα −λ/x ; Γ (α) x α+1 e exp(λ): denotes the exponential random variable with density function λe−λx ; Bin(N, denotes the binomial random variable with probability mass function N p): i (1 − p)N−i . p i We assume that (X1i , X2i ), i = 1, . . . , n, are n independent, identically distributed (i.i.d.) exponential random variables. Further, X1i and X2i are independent for all i = 1, . . . , n and Xi = min{X1i , X2i }. We observe the sample (X1:m:n , δ1 , R1 ), . . . , (Xm:m:n , δm , Rm ),
(2.1)
assuming all the causes of failure are known. If some of the causes are unknown, we denote them with a ‘∗’. Under that setup, the observations are denoted as follows:
334
D. Kundu, N. Kannan and N. Balakrishnan
(X1:m:n , δ1 , R1 ), . . . , (Xi1 −1:m:n , δi1 −1 , Ri1 −1 )(Xi1 :m:n , ∗, Ri1 ), (Xi1 +1:m:n , δi1 +1 , Ri1 +1 ), . . . , (Xik −1:m:n , δik −1 , Rik −1 ), (Xik :m:n , ∗, Rik ), (Xik +1:m:n , δik +1 , Rik +1 ), . . . , (Xm:m:n , δm , Rm ).
(2.2)
The i1 th, . . . , ik th causes of failure are unknown. We also assume that k, the total number of unknown causes of failure, is fixed. 3. Estimation We assume that the Xj i ’s are exponential random variables with parameters λj for i = 1, . . . , n and for j = 1, 2. The distribution function Fj (·) of Xj i has the following form: Fj (t) = 1 − e−λj t
(3.1)
for j = 1 and 2. Since the minimum of 2 independent exponential random variables is also distributed as an exponential, the likelihood function of the observed data (2.1) is L(λ1 , λ2 ) = n(n − R1 − 1) · · · (n − R1 − · · · − Rm−1 − m + 1)(λ1 + λ2 )m n1 n2 m λ1 λ2 × e−(λ1 +λ2 ) i=1 (Ri +1)xi:m:n . (3.2) λ1 + λ2 λ1 + λ2 Here, n1 and n2 are the number of failures due to causes 1 and 2, respectively. Taking the logarithm of (3.2), and equating the partial derivatives to zero, we obtain the maximum likelihood estimators (MLEs) of λ1 and λ2 as n1 and λˆ 1 = m (R i=1 i + 1)Xi:m:n n2 m = − λˆ 1 , Z i=1 (Ri + 1)Xi:m:n
λˆ 2 = m
(3.3)
where Z = m i=1 (Ri + 1)Xi:m:n denotes the total time on test. We will suppress the use of the subscript m in Xi:m:n for the rest of the chapter without causing many confusion. From the likelihood function (3.2), it is immediate that (n1 , i=1 (Ri + 1)Xi:n ) is a jointly complete sufficient statistic for (λ1 , λ2 ). By Lehmann–Scheffe theorem, to construct the UMVUEs of λ1 and λ2 , it is sufficient to find unbiased estimators that are functions of n1 and m i=1 (Ri + 1)Xi:n . To construct the UMVUEs, we consider the following transformation (see Balakrishnan and Aggarwala (2000)): Z1 = nX1:n , Z2 = (n − R1 − 1)(X2:n − X1:n ), .. . Zm = (n − R1 − · · · − Rm−1 − m + 1)(Xm:n − Xm−1:n ).
(3.4)
Analysis of progressively censored competing risks data
335
The Zi ’s are called the spacings. It can be easily seen (Balakrishnan and Aggarwala m (2000)) that Z ’s are i.i.d. exp(λ + λ ) random variables. Therefore, i 1 2 i=1 (Ri + m 1)Xi:n = i=1 Zi is distributed as a gamma(m, λ1 +λ2 ) random variable. Since n1 is a Bin(m, λ1 /(λ1 + λ2 )), and n1 is independent of m i=1 (Ri + 1)Xi:n , we have for m>1 m mλ1 λ1 + λ2 E λˆ 1 = (3.5) = × λ1 × λ1 + λ2 m−1 m−1 and E λˆ 2 =
m λ1 + λ2 mλ2 = × λ2 . × λ1 + λ2 m−1 m−1
Simple calculation yields for m > 2, λ1 m (m − 1)λ2 + mλ1 ˆ , V λ1 = m − 1 (m − 1)(m − 2) λ2 m (m − 1)λ1 + mλ2 V λˆ 2 = , m − 1 (m − 1)(m − 2)
(3.6)
(3.7) (3.8)
and Cov λˆ 1 , λˆ 2 = −
mλ1 λ2 . (m − 1)2 (m − 2)
(3.9)
From (3.5) and (3.6), it follows immediately that the UMVUEs of λ1 and λ2 are given by m−1ˆ m−1ˆ λ˜ 1 = λ1 and λ˜ 2 = λ2 . m m The variance and covariance of the estimators are given by m − 1 (m − 1)λ2 + mλ1 , V λ˜ 1 = λ1 m (m − 1)(m − 2) m − 1 (m − 1)λ1 + mλ2 V λ˜ 2 = λ2 , m (m − 1)(m − 2)
(3.10) (3.11)
and Cov λ˜ 1 , λ˜ 2 = −
λ1 λ2 . m(m − 2)
(3.12)
Note that the estimators are always negatively correlated which is evident from (3.3) as well. To estimate the survival function or the cumulative distribution function, we can use the invariance property of the MLEs. For example, the estimated survival functions due to causes 1 and 2 are given by (t) = e−λˆ 1 t F 1 respectively.
(t) = e−λˆ 2 t , and F 2
336
D. Kundu, N. Kannan and N. Balakrishnan
Another parameter of interest in survival analysis is the relative risk rate due to a particular cause (say, cause 1). The relative risk is defined as ∞ λ1 π = P [X1i < X2i ] = λ1 e−λ1 x e−λ2 x dx = . λ 1 + λ2 0 Once again, using the invariance property, the MLE of the relative risk is given by πˆ =
λˆ 1 λˆ 1 + λˆ 2
=
n1 . m
For the exponential distribution in (3.1), λ1 , λ2 represent the hazard rates, and θ1 = 1/λ1 , θ2 = 1/λ2 represent the mean lifetime due to causes 1 and 2, respectively. Although the MLEs and UMVUEs of λ1 and λ2 always exist, the UMVUEs of θ1 and θ2 do not exist. The MLE of θ1 (θ2 ) do not exist if n1 = 0 (n2 = 0). We may then define the conditional MLE of θ1 (θ2 ) if n1 > 0 (n2 > 0) as follows: m (Ri + 1)Xi:n if n1 > 0 and θˆ1 = i=1 n1 m (Ri + 1)Xi:n θˆ2 = i=1 (3.13) if n2 > 0. n2 Using the forms of the estimators given above, we will now derive the exact distribution of λˆ 1 , λˆ 2 , θˆ1 , and θˆ2 . These distributions will naturally be useful in constructing confidence intervals. The cumulative distribution function of λˆ 1 is given by m
n1 Zi Fλˆ 1 (x) = P λˆ 1 x = P x =
m
P
j =1
=
m
i=1
m
m j =1
=
i=1
j
n1 Zi n1 = j P (n1 = j ) x
λ1 λ1 + λ2
m m−1
m
j
j =1 i=0
×
j
λ1 λ1 + λ2
λ2 λ1 + λ2
j
m−j
λ2 λ1 + λ2
×P
m
i=1
j Zi x
m−j
e−(λ1 +λ2 )(j/x) (λ1 + λ2 )i (j/x)i . i!
(3.14)
Note that the distribution function of λˆ 1 is a mixture of a discrete and a continuous distribution. We have m λ2 P λˆ 1 = 0 = P (n1 = 0) = . λ1 + λ2
Analysis of progressively censored competing risks data
337
For x > 0, the density function is given by fλˆ 1 (x) =
m
j =1
j m−j λ2 λ1 m gj (x) , λ1 + λ2 λ1 + λ2 j
(3.15)
where gi (x) =
(i(λ1 + λ2 ))m −((λ1+λ2 )i)/x 1 e Γ (m) x m+1
is the density function of the inverted gamma random variable I (m, i(λ1 + λ2 )). The distribution of λˆ 2 may be obtained in a similar manner as
Fλˆ 2 (x) = P λˆ 2 x =
m m−1
m
j
j =1 i=0
×
λ2 λ1 + λ2
j
λ1 λ1 + λ2
m−j
e−(λ1 +λ2 )(j/x) (λ1 + λ2 )i (j/x)i . i!
(3.16)
As before,
P λˆ 2 = 0 = P (n2 = 0) =
λ1 λ1 + λ2
m ,
and for x > 0 the density function of λˆ 2 is given by fλˆ 2 (x) =
m
gj (x)
j =1
j m−j λ1 m λ2 . λ1 + λ2 λ1 + λ2 j
(3.17)
We now obtain the distribution function of the conditional MLE of θ1 as defined in (3.13). For brevity, we denote it by Fθˆ1 (x). We have
Fθˆ1 (x) = P θˆ1 x | n1 > 0 =
m
P θˆ1 x | n1 = i, n1 > 0 × P [n1 = i | n1 > 0] i=1
=
m
P
i=1
m
Zj j =1
i
x × pi ,
where −1 pi = 1 − q m
i m−i m! θ2 θ1 i!(m − i)! θ1 + θ2 θ1 + θ2
and q =
θ1 . θ1 + θ2
338
D. Kundu, N. Kannan and N. Balakrishnan
Note that Zj /i is a gamma(m, i(1/θ1 + 1/θ2)) random variable. Therefore, m m−1
e−(i/θ1 +i/θ2 )x (i/θ1 + i/θ2 )j 1− × pi Fθˆ1 (x) = j! j =0
i=1
=
m
pi Hi (x) (say).
(3.18)
i=1
Here, Hi (x) is the distribution function of a gamma(m, i(1/θ1 + 1/θ2 )) random variable. The density function of the conditional MLE of θ1 becomes fθˆ1 (x) =
m
pi hi (x),
(3.19)
i=1
where hi (x) is the density function of a gamma(m, i(1/θ1 + 1/θ2 )) random variable. Similarly, we obtain the distribution function of the conditional MLE of θ2 as m
Fθˆ2 (x) = P θˆ2 x | n2 > 0 = p˜i Hi (x),
(3.20)
i=1
where
p˜i = 1 − q˜
m −1
i m−i m! θ1 θ2 i!(m − i)! θ1 + θ2 θ1 + θ2
and q˜ =
θ2 . θ1 + θ2
The corresponding density function of the conditional MLE of θ2 is fθˆ2 (x) =
m
p˜ i hi (x).
(3.21)
i=1
Using (3.19) and (3.21), we may obtain easily the moments of θˆ1 and θˆ2 . From (3.19), we have m mθ1 θ2 pi E θˆ1 = θ1 + θ2 i i=1
m m(m + 1)θ12 θ22 pi and E θˆ12 = . 2 (θ1 + θ2 ) i2
(3.22)
m m(m + 1)θ12 θ22 p˜i and E θˆ22 = . 2 (θ1 + θ2 ) i2
(3.23)
i=1
Similarly, we obtain from (3.21) m mθ1 θ2 p˜ i E θˆ2 = θ1 + θ2 i i=1
i=1
Note that, in the expressions above, the quantities within the summation sign denote the inverse moments of positive binomial random variables. Since exact expressions are not available, we may use tabulated values of positive binomial random variables available in Edwin and Savage (1954). Since the estimators are clearly biased, tabulated values of the biases given by Kundu and Basu (2000) may be used for bias correction. In several examples, the investigator may be interested in testing whether the hazard rates (mean lifetimes) of the two causes are identical. For testing H0 : λ1 = λ2 , we
Analysis of progressively censored competing risks data
339
may derive the likelihood ratio test. However, in the case of the competing risks model, the hypothesis H0 is equivalent to testing whether the proportion of deaths due to the 2 causes are identical. This reduces to a simple test of a binomial proportion H0∗ : p = 1/2. We may derive UMP or UMPU tests to determine whether there is indeed a difference in the lifetime distributions. For more than two causes of failure, we may derive tests based on the multinomial distribution.
4. Confidence intervals In this section, we propose four different methods of constructing confidence intervals for λ1 , λ2 , θ1 , and θ2 . The first method is based on the exact distributions of the MLEs derived in the previous section. The second method uses the asymptotic distributions of the estimators to obtain confidence intervals of the different parameters of interest. Finally, we construct parametric bootstrap confidence intervals using the percentile method and the bootstrap-t. 4.1. Approximate confidence intervals In this section, we will outline the procedure for constructing approximate confidence intervals for λ1 when λ2 is assumed to be known. A similar construction will generate intervals for the remaining parameters. In order to use this procedure, we need to assume that Pλ1 [λˆ 1 c] is monotonically increasing in λ1 . This assumption allows the invertibility of the pivotal quantity. This approach has been used by several authors, including Chen and Bhattacharya (1988), Gupta and Kundu (1998), Kundu and Basu (2000), and Childs et al. (2003). Let c(λ) be a function such that Pλ1 [λˆ 1 c(λ1 )] = α/2. Then, for λ1 < λ1 , we have
α Pλ λˆ 1 c(λ1 ) = Pλ1 λˆ 1 c(λ1 ) Pλ λˆ 1 c(λ1 ) = . 1 1 2
(4.1)
From equation (4.1), we have c(λ1 ) < c(λ1 ), which implies c(λ) is an increasing function of λ. Therefore, c−1 (λ) exists and is also an increasing function of λ. From (4.1), we see that
α Pλ1 c−1 λˆ 1 λ1 = 1 − , 2 −1 ˆ and therefore, λL = c (λ1 ) is the lower bound of the 100(1 − α)% confidence interval of λ1 . Similarly, we can obtain λU = d −1 (λˆ 1 ), where d −1 (·) is the inverse of the function d(·) obtained as the solution of the equation
α Pλ1 λˆ 1 d(λ1 ) = . 2 Then (λL , λU ) is a 100(1 − α)% confidence interval for λ1 . Since it is not possible to obtain closed form expressions of c(·) and d(·), we need to use numerical iterative techniques to compute these limits. Since the confidence interval for λ1 involves λ2 ,
340
D. Kundu, N. Kannan and N. Balakrishnan
which is unknown, we replace λ2 by its MLE. To obtain the upper and lower confidence limits, we solve the following two nonlinear equations: j ˆ m−j m m−1 α
m λL λ2 1− = 2 j λL + λˆ 2 λL + λˆ 2 j =1 i=0
ˆ ˆ e−(λL +λ2 )(j/λ1) (λL + λˆ 2 )i (j/λˆ 1 )i , i! m m−1
m λU j λˆ 2 m−j
×
α = 2
λU + λˆ 2
j
j =1 i=0
ˆ
×
(4.2)
λU + λˆ 2
ˆ
e−(λU+λ2 )(j/λ1 ) (λU + λˆ 2 )i (j/λˆ 1 )i . i!
(4.3)
The procedure outlined above requires the assumption that Pλ1 [λˆ 1 c] is increasing in λ1 . The complicated structure of the function makes it extremely difficult to provide a rigorous proof of this assumption. Numerical studies show that the result is true, and some heuristic justification along the lines of Chen and Bhattacharya (1988) or Kundu and Basu (2000) may be provided. 4.2. Fisher information matrix In this section, we present the Fisher information matrix of λ1 and λ2 . Let I(λ1 , λ2 ) = ((Iij (λ1 , λ2 )), i, j = 1, 2, denote the Fisher information matrix of the parameters λ1 and λ2 , where 2 ∂ ln(L(λ1 , λ2 )) . Iij (λ1 , λ2 ) = −E ∂λi ∂λj We have I11 (λ1 , λ2 ) =
m , λ1 (λ1 + λ2 )
I12 (λ1 , λ2 ) = I21 (λ1 , λ2 ) = 0, m . I22 (λ1 , λ2 ) = λ2 (λ1 + λ2 ) The Fisher information matrix of θ1 and θ2 , say I(θ1 , θ2 ), may be obtained from I(λ1 , λ2 ). Let I(θ1 , θ2 ) = ((Iij (θ1 , θ2 ))), i, j = 1, 2, denote the Fisher information matrix. Then, I11 (θ1 , θ2 ) =
mθ2 , + θ2 )
θ12 (θ1
I12 (θ1 , θ2 ) = I21 (θ1 , θ2 ) = 0, I22 (θ1 , θ2 ) =
mθ1 . + θ2 )
θ22 (θ1
Analysis of progressively censored competing risks data
341
Using the asymptotic normality of the MLE and the above information matrix, we obtain the 100(1 − α)% confidence intervals for λ1 , λ2 , θ1 , and θ2 as ˆ 1 (λˆ 1 + λˆ 2 ) λ λˆ 2 (λˆ 1 + λˆ 2 ) , λˆ 2 ± zα/2 , λˆ 1 ± zα/2 m m θˆ12 (θˆ1 + θˆ2 ) θˆ 2 (θˆ1 + θˆ2 ) θˆ1 ± zα/2 , θˆ2 ± zα/2 2 , mθˆ2 mθˆ1 respectively. Here, zα/2 is the upper (α/2)th percentile point of a standard normal distribution. 4.3. Bootstrap confidence intervals In this section, we construct confidence intervals based on the parametric bootstrap. The two methods that are widely used are (1) the percentile bootstrap method proposed by Efron (1982), and (2) the bootstrap-t method proposed by Hall (1988). We illustrate the procedure for the parameter λ1 . Intervals for the other parameters may be constructed in an analogous manner. To obtain the percentile bootstrap confidence interval for λ1 , we use the following algorithm: (1) Determine λˆ 1 and λˆ 2 from the sample {(x1:n , δ1 , R1 ), . . . , (xm:n , δm , Rm )} using (3.3). (2) Generate a random variable from the gamma(m, λˆ 1 + λˆ 2 ) distribution, and an independent Bin(m, λˆ 1 /(λˆ 1 + λˆ 2 )) random variable. The ratio of the two random variables provides the bootstrap estimate of λ1 , say λˆ ∗1 . (3) Repeat Step 2 NBOOT times. (4) Let CDF(x) = P∗ (λˆ ∗1 x) be the cumulative distribution function of λˆ ∗1 . Define −1 (x) for a given x. The approximate 100(1 − α)% confidence θˆboot(x) = CDF interval for λ1 is given by α α ˆθboot ˆ , θboot 1 − . 2 2 To obtain the bootstrap-t confidence interval for λ1 , we use the following algorithm: (1) Determine λˆ 1 and λˆ 2 from the sample {(x1:n , δ1 , R1 ), . . . , (xm:n , δm , Rm )} using (3.3). (2) Generate a random variable from the gamma(m, λˆ 1 + λˆ 2 ) distribution, and an independent Bin(m, λˆ 1 /(λˆ 1 + λˆ 2 )) random variable. The ratio of the two random variables provides the bootstrap estimate of λ1 , say λˆ ∗1 . Compute the variance of λˆ ∗1 , say V (λˆ ∗1 ), using (3.7). (3) Determine the T statistic √ m(λˆ ∗1 − λˆ 1 ) ∗ T = . V (λˆ ∗1 )
342
D. Kundu, N. Kannan and N. Balakrishnan
(4) Repeat Steps 2–3 NBOOT times. (5) From the NBOOT T ∗ values obtained, determine the upper bound and the lower bound of the 100(1 − α)% confidence bound of λ1 as follows: Let CDF(x) = P∗ (T ∗ x) be the cumulative distribution function of T ∗ . For a given x, define −1 (x). λˆ boott (x) = λˆ 1 + m−1/2 V λˆ 1 CDF The approximate 100(1 − α)% bootstrap-t confidence interval for λ1 is then given by α α ˆλboott ˆ , λboott 1 − . 2 2 We illustrate the performance of these different methods using a simulation study and a real dataset in Sections 6 and 7, respectively.
5. Bayesian analysis In this section, we approach the problem from a Bayesian perspective. We first consider procedures for estimating (λ1 , λ2 ), followed by procedures for (θ1 , θ2 ). In the context of exponential lifetimes, λ1 and λ2 may be reasonably modelled using gamma priors. We assume λ1 and λ2 are independently distributed with gamma(a1, b1 ) and gamma(a2, b2 ) priors, respectively. The parameters a1 , b1 , a2 , and b2 are all assumed to be positive. When a1 = b1 = 0 (a2 = b2 = 0), we obtain noninformative priors of λ1 (λ2 ). The posterior density of λ1 and λ2 based on the gamma priors is given by m
l(λ1 , λ2 | x) = kλa11 +n1 −1 e−λ1 (b1 +
i=1 (Ri +1)xi:n )
m
× λa22 +n2 −1 e−λ2 (b2 +
i=1 (Ri +1)xi:n )
.
(5.1)
Here, k is the normalizing constant that ensures l(λ1 , λ2 | x) is a proper density function. From (5.1), it is clear that the posterior density functions of λ1 and λ2 , say l(λ1 | x) and l(λ2 | x), respectively, are independent. Further, l(λ1 | x) is the density function of a gamma(a1 + n1 , b1 + m 1)xi:n ) random variable, and l(λ2 | x) is the density i=1 (Ri + function of a gamma(a2 + n2 , b2 + m i=1 (Ri + 1)xi:n ) random variable. The Bayes estimates of λ1 and λ2 under squared error loss are λˆ 1Bayes = λˆ 2Bayes =
a1 + n1 m b1 + i=1 (Ri + 1)xi:n b2 +
a2 + n2 m , i=1 (Ri + 1)xi:n
and (5.2)
respectively. For the noninformative priors (a1 = b1 = 0, a2 = b2 = 0), the Bayes estimators coincide with the MLEs. The credible intervals for λ1 and λ2 are obtained easily m from the posterior distributions. We observe that a posteriori, W = 2λ (b + 1 1 1 i=1 (Ri + 1)xi:n ) and W2 = 2 2 2λ2 (b2 + m i=1 (Ri + 1)xi:n ) follow χ2(a1 +n1 ) and χ2(a2 +n2 ) distributions, respectively.
Analysis of progressively censored competing risks data
Consequently, the 100(1 − α)% credible interval for λ1 is 2 2 χ2(a χ2(a +n ),α/2 +n ),1−α/2 m1 1 1m 1 , 2(b1 + i=1 (Ri + 1)xi:n ) 2(b1 + i=1 (Ri + 1)xi:n ) for a1 + n1 > 0. Similarly, when a2 + n2 > 0, the credible interval for λ2 is 2 2 χ2(a χ2(a +n ),α/2 +n ),1−α/2 m2 2 2m 2 , . 2(b2 + i=1 (Ri + 1)xi:n ) 2(b2 + i=1 (Ri + 1)xi:n )
343
(5.3)
(5.4)
2 is the lower αth percentile point of the central χ 2 distribution with k degrees Here χk,α of freedom. Next, we consider the Bayes estimators for θ1 and θ2 . Using the assumption of gamma priors, and squared error loss, the Bayes estimators of θ1 and θ2 are b1 + m i=1 (Ri + 1)xi:n ˆ and θ1Bayes = a1 + n1 − 1 b2 + m i=1 (Ri + 1)xi:n ˆ , θ2Bayes = (5.5) a2 + n2 − 1
respectively. An argument similar to the one outlined above may be used to obtain the 100(1 − α)% credible intervals for θ1 and θ2 as m 2(b1 + m i=1 (Ri + 1)xi:n ) 2(b1 + i=1 (Ri + 1)xi:n ) (5.6) , 2 2 χ2(a χ2(a 1 +n1 ),1−α/2 1 +n1 ),α/2 and
+ 1)xi:n ) 2(b2 + m i=1 (Ri + 1)xi:n ) , , 2 2 χ2(a χ2(a 2 +n2 ),1−α/2 2 +n2 ),α/2
2(b2 +
m
i=1 (Ri
(5.7)
respectively. In the next section, we illustrate the methods outlined here using an example.
6. Simulation study To compare the coverage probabilities and the lengths of the different confidence and credible intervals, we conducted a small simulation study. It is evident that the performance of the different methods will not depend on the censoring scheme. This has also been observed by Balakrishnan and Aggarwala (2000). The results for different n and m are presented in Tables 1–4. The results are based on an average over 1000 replications. From the tables, we observe that all the procedures provide satisfactory coverage (levels close to the nominal value of 95%). As m increases, the length of the intervals decrease. Surprisingly, the intervals based on the asymptotic distribution perform well even for small sample sizes. All the procedures give similar results: however, the credible intervals based on a noninformative prior have the shortest length. Based on this study, any one of the procedures outlined in Sections 4 and 5 may be used in practice.
344
D. Kundu, N. Kannan and N. Balakrishnan
Table 1 n = 20, m = 10, λ1 = 1.0, λ2 = 0.8 Methods
Parameters
Average length
Coverage percentage
Bayes
λ1 λ2
1.76078 1.56975
0.958 0.936
Asymptotic
λ1 λ2
1.78011 1.57125
0.937 0.922
Exact
λ1 λ2
1.52690 1.37600
0.952 0.947
Boot-P
λ1 λ2
2.12155 1.83684
0.933 0.930
Boot-t
λ1 λ2
2.04192 1.82465
0.939 0.915
Table 2 n = 50, m = 20, λ1 = 1.0, λ2 = 0.8 Methods
Parameters
Average length
Coverage percentage
Bayes
λ1 λ2
1.22373 1.08401
0.955 0.936
Asymptotic
λ1 λ2
1.23344 1.09470
0.940 0.926
Exact
λ1 λ2
1.30335 1.16623
0.966 0.958
Boot-P
λ1 λ2
1.33695 1.18564
0.949 0.934
Boot-t
λ1 λ2
1.31618 1.22481
0.922 0.926
Parameters
Average length
Coverage percentage
Bayes
λ1 λ2
1.08492 0.96296
0.946 0.932
Asymptotic
λ1 λ2
1.09183 0.97073
0.949 0.920
Exact
λ1 λ2
1.17490 1.05530
0.951 0.956
Boot-P
λ1 λ2
1.15351 1.02053
0.931 0.936
Boot-t
λ1 λ2
1.14355 1.04977
0.921 0.929
Table 3 n = 75, m = 25, λ1 = 1.0, λ2 = 0.8 Methods
Analysis of progressively censored competing risks data
345
Table 4 n = 75, m = 35, λ1 = 1.0, λ2 = 0.8 Methods
Parameters
Average length
Coverage percentage
Bayes
λ1 λ2
0.90954 0.80899
0.937 0.945
Asymptotic
λ1 λ2
0.91367 0.81363
0.940 0.946
Exact
λ1 λ2
1.00572 0.93749
0.955 0.956
Boot-P
λ1 λ2
0.93049 0.84728
0.945 0.950
Boot-t
λ1 λ2
0.93652 0.86743
0.928 0.939
7. Numerical example In this section, we consider a dataset originally analyzed by Hoel (1972). The data arose from a laboratory experiment in which male mice received a radiation dose of 300 roentgens at 5 to 6 weeks of age. The cause of death for each mouse was determined by autopsy to be thymic lymphoma, reticulum cell sarcoma, or other causes. For the purpose of analysis, we consider reticulum cell sarcoma as cause 1 and combine the other 2 causes of death as cause 2. There were n = 77 observations in the data. We generated a progressively type II censored sample from the original measurements with m = 25 and censoring scheme R1 = R2 = · · · = R24 = 2, R25 = 4. There were n1 = 7 deaths due to cause 1 and n2 = 18 deaths due to cause 2. Progressive censoring in these kinds of experiments may be invaluable in obtaining information on growths of tumors in the mice. At the time of death of a particular mouse, other mice may be randomly selected and removed from the study. Autopsies on these mice may lead to information on the progression of the cancer over time. The progressively type II censored sample thus obtained is (40, 2), (42, 2), (62, 2), (163, 2), (179, 2), (206, 2), (222, 2), (228, 2), (252, 2), (259, 2), (318, 1), (385, 2), (407, 2), (420, 2), (462, 2), (517, 2), (517, 2), (524, 2), (525, 1), (558, 1),(536, 1), (605, 1), (612, 1), (620, 2), (621, 1). From the above data, we obtain the following: 25
(Ri + 1)xi:n = 28611.00
i=1
which yields λˆ 1 =
7 = 0.000245, 28611
λ˜ 1 = 0.000235,
λˆ 2 =
18 = 0.000629, 28611
λ˜ 2 = 0.000604,
346
D. Kundu, N. Kannan and N. Balakrishnan
Table 5 Confidence and credible intervals λ1
λ2
(0.000064, 0.000426) (0.000098, 0.000456)
(0.000338, 0.000920) (0.000372, 0.000951)
(0.000087, 0.000448) (0.000069, 0.000460)
(0.000349, 0.000949) (0.000361, 0.000959)
Method Exact (MLE) Credible interval (noninformative prior) Bootstrap-t Percentile
Table 6 Confidence intervals θ1
θ2
(1059.38, 7115.20) (2192.98, 10204.08)
(855.19, 2328.81) (1051.53, 2688.17)
Method Exact (MLE) Credible interval (noninformative prior)
Var λˆ 1 = 9.8112 × 10−9 , Var λˆ 2 = 2.5644 × 10−8 , Cov λˆ 1 , λˆ 2 = −2.908 × 10−10 , Var λ˜ 2 = 2.3634 × 10−8 , Var λ˜ 1 = 9.0426 × 10−9 , Cov λ˜ 1 , λ˜ 2 = −2.469 × 10−10 . The relative risk due to cause 1 is 7 = 0.28. πˆ = 25 The MLEs of the mean lifetimes due to causes 1 and 2 are given by θˆ1 = 4087.29,
θˆ2 = 1589.5.
It is quite clear that cause 2 is more severe, with lifetimes almost one-third of that of cause 1. To assess the performance of these estimators, we construct 95% confidence and credible intervals using all the different methods outlined in Sections 4 and 5. The results are presented in Tables 5 and 6. The intervals are very similar for λ1 and λ2 , in terms of the length. However, the credible intervals for θ1 and θ2 based on the noninformative prior are much wider. 8. Some generalizations and extensions 8.1. Unknown causes of failure In all the procedures outlined above, the cause of failure for all individuals in the study were assumed to be known. The problem of unknown causes of failure was originally
Analysis of progressively censored competing risks data
347
considered by Dinse (1982) and Miyakawa (1982, 1984), and more recently by Kundu and Basu (2000). Assume we have data of type (2.2), and k causes of failure are unknown. Let I1 = {δi ; δi = 1}, I2 = {δi ; δi = 2}, I3 = {δi ; δi = 3}, |I1 | = n1 , |I2 | = n2 , and |I3 | = k = m − m∗ (say). Let us also assume that m∗ = n1 + n2 is fixed. The likelihood function of the observed data (2.2) is ∗
L(λ1 , λ2 ) = C × λ11 λ22 (λ1 + λ2 )m−m × e−(λ1 +λ2 ) n
n
m
i=1 (Ri +1)xi:n
,
(8.1)
where C = n(n − R1 − 1) · · · (n − R1 − · · ·− Rm−1 − m + 1) is the normalizing constant. Taking the logarithm of (8.1), and equating the partial derivatives to zeros, we obtain the MLEs of λ1 and λ2 as n1 n2 m m ˆλ1 = ˆ m m and λ2 = . (8.2) ∗ m∗ (R + 1)x m (R i:n i=1 i i=1 i + 1)xi:n ∗ Here n1 is a Bin(m∗ , λ1 /(λ 1 m+ λ2 )) random variable and n2 is Bin(m , λ2 /(λ1 + λ2 )) random variable. Further, i=1 (Ri + 1)xi:n is a gamma(m, λ1 + λ2 ) random variable which is independent of n1 and n2 . All the procedures discussed in Sections 3 and 4 can be easily modified to the present situation. The Bayesian analysis can also be carried out along the lines of Section 5.
8.2. Models for the lifetime distribution The first step in generalizing the assumption of exponential lifetimes is to consider the Weibull model. All the methods outlined in Sections 3–5 can be easily adapted to Weibull lifetimes. If we consider a common shape parameter, it is easy to derive explicit expressions for the two scale parameters. Unfortunately, when estimating the shape parameter, there is no closed form solution, and numerical iterative techniques will have to be employed. The expressions for the information matrix and the Bayesian estimators will also require iterative methods. Since there is not much insight to be gained by providing the forms of the estimators, we will not present here these expressions for brevity. We have also considered the case of dependent causes of failure. As we mentioned in Section 1, identifiability is a major concern. If there is anecdotal or physiological evidence that the causes of failure are dependent, we could use (among others) the bivariate exponential model suggested by Marshall and Olkin (see Kotz et al. (2000)).
9. Conclusions In this chapter, we have considered the competing risks model when the observed data is progressively type II censored. We have assumed that the lifetimes under the different causes have independent exponential distributions. We have obtained the MLEs and UMVUEs of the hazard rates and the mean lifetimes, and also derived their exact distributions. Several different procedures for constructing confidence intervals have been suggested. In addition, we have derived the Bayes estimators under suitable informative and noninformative priors. A numerical example has been provided to illustrate the
348
D. Kundu, N. Kannan and N. Balakrishnan
methods outlined in this chapter. We have also suggested some extensions and generalizations of these models.
References Balakrishnan, N., Aggarwala, R. (2000). Progressive Censoring: Theory, Methods and Applications. Birkhäuser, Boston. Boag, J.W. (1949). Maximum likelihood estimates of the proportion of patients cured by cancer therapy. J. Roy. Statist. Soc. Ser. B 11, 15–44. Chen, S.M., Bhattacharya, G.K. (1988). Exact confidence bound for an exponential parameter hybrid censoring. Comm. Statist. Theory Methods 16, 1858–1870. Childs, A., Chandrasekar, B., Balakrishnan, N., Kundu, D. (2003). Exact likelihood inference based on type I and type II hybrid censored samples from the exponential distribution. Ann. Inst. Statist. Math. In press. Cohen, A.C. (1963). Progressively censored samples in life testing. Technometrics 5, 327–329. Cohen, A.C. (1966). Life testing and early failure. Technometrics 8, 539–549. Crowder, M.J. (2001). Classical Competing Risks. Chapman & Hall, Boca Raton, FL. David, H.A., Moeschberger, M.L. (1978). The Theory of Competing Risks. Griffin, London. Dinse, G.E. (1982). Non-parametric estimation of partially incomplete time and type of failure data. Biometrics 38, 417–431. Edwin, G.L., Savage, R.I. (1954). Tables of expected value of 1/X for positive Bernoulli and Poisson variables. J. Amer. Statist. Assoc. 49, 169–177. Efron, B. (1982). The Jackknife, the Bootstrap and Other Resampling Plans. In: CBMS-NSF Regional Conference series in Applied Mathematics, Vol. 38. SIAM, Philadelphia, PA. Gupta, R.D., Kundu, D. (1998). Hybrid censoring schemes with exponential failure distribution. Comm. Statist. Theory Methods 27, 3065–3083. Hall, P. (1988). Theoretical comparison of bootstrap confidence intervals. Ann. Statist. 16, 927–953. Hoel, D.G. (1972). A representation of mortality data by competing risks. Biometrics 28, 475–488. Kalbfleisch, J.D., Prentice, R.L. (1980). The Statistical Analysis of Failure Time Data. Wiley, New York. Kotz, S., Balakrishnan, N., Johnson, N.L. (2000). Continuous Multivariate Distributions, Vol. 1: Models and Applications, 2nd edn. Wiley, New York. Kundu, D., Basu, S. (2000). Analysis of incomplete data in presence of competing risks. J. Statist. Plann. Inference 87, 221–239. Miyakawa, M. (1982). Statistical analysis of incomplete data in competing risks model. J. Japan. Soc. Quality Control 12, 49–52. Miyakawa, M. (1984). Analysis of incomplete data in competing risks model. IEEE Trans. Reliabil. 33, 293– 296.
19
Handbook of Statistics, Vol. 23 ISSN: 0169-7161 © 2004 Elsevier B.V. All rights reserved. DOI 10.1016/S0169-7161(03)23019-4
Marginal Analysis of Point Processes with Competing Risks
Richard J. Cook, Bingshu Chen and Pierre Major
1. Introduction 1.1. Overview Point process data arise in medical research when a clinically important event may recur over a period of observation. Examples are ubiquitous and arise in settings such as oncology (Gail et al., 1980; Byar et al., 1986; Hortobagyi et al., 1996), cerebrovascular disease (Hobson et al., 1993; OASIS, 1997), osteoporosis (Riggs et al., 1990), and epilepsy (Albert, 1991). Interest typically lies in understanding features of the event process such as intensity, rate, or mean functions, as well as related group differences and covariate effects. The method of analysis for point process data is naturally driven by the feature of interest. Andersen et al. (1993) focus on intensity-based methods for counting processes, while others emphasize models with a random effect formulation (Thall, 1988; Abu-Libdeh et al., 1990; Thall and Vail, 1990), marginal methods for multivariate survival data (Wei et al., 1989), or marginal models based on rate functions (Lawless and Nadeau, 1995). Interpretation and fit are key factors which help guide the analysis approach for a given problem, and the merits of the various strategies have been actively discussed in the literature (Lawless, 1995; Wei and Glidden, 1997; Cook and Lawless, 1997a; Oakes, 1997; Therneau and Hamilton, 1997; Cook and Lawless, 2002). Often marginal rate functions serve as a meaningful basis for inference and these will serve as the focus here. Frequently when subjects are at risk for recurrent events, they are also at risk for a so-called terminal event which precludes the occurrence of subsequent events. Death, for example, is a terminal event for any point process generated by a chronic health condition. The presence of a terminal event with point process data raises challenges which must be addressed if interest lies in the mean function (Cook and Lawless, 1997b), the cumulative distribution function for the number of events over a fixed interval or a lifetime (Strawderman, 2000), or other aspects of the process. The purpose of this article is to describe methods of analysis for point process data in the presence of terminal events while emphasizing connections with methodology for the competing risks problem in survival data. 349
350
R.J. Cook, B. Chen and P. Major
The remainder of the paper is organized as follows. In the next section methods for the analysis of time to event data subject to a competing risk are reviewed. Methods for the analysis of point processes based on rate functions are then reviewed, along with some simple methods for dealing with terminal events. An application to a study of breast cancer patients with bone metastases (Hortobagyi et al., 1998) illustrates the various procedures. The article concludes with some general remarks. 1.2. Time to event data and competing risks Let the random variable D denote the time from a well defined origin to death and let C denote a right censoring time. Assume that there is a maximum period of observation of duration C ∗ so that C C ∗ . The survival function for the time to death of a generic individual is denoted by S D (t) = Pr(D t), and Pr(C < t) = 1 − K(t) is the cumulative distribution function for the censoring time (i.e., K(t) = P (C t) for C C ∗ ). The hazard function for death is defined as Pr{D < t + t | D t} hD (t) = lim , t ↓0 t and the hazard for censoring, hC (t) is similarly defined with D replaced by C. Due to right censoring we only observe X = D ∧ C and ∆D = I (D C), where x ∧ y = min(x, y) and I (·) is an indicator function. Here X is the total duration of observation and ∆D = 1 if death is observed and ∆D = 0 otherwise. In the one sample problem with no covariates, the assumption of independent right censoring is satisfied if lim
t ↓0
P (D < t + t | D t, C t) = hD (t), t
and this is assumed to hold in what follows. For a sample of n independent and identically distributed individuals, let Di , Ci , Xi and ∆D i denote the corresponding quantities for individual i, i = 1, . . . , n. The observed data may then be represented by {(Xi , ∆D i ), i = 1, . . . , n}. To estimate the survival function, S D (t), we define the counting process NiD (t) = ∆D i I (Di t) so that dNiD (t) = limt ↓0 (NiD (t + t − ) − NiD (t − )) = 1 if subject i dies at time t and dNiD (t) = 0 otherwise, i = 1, . . . , n. The “at risk” function YiD (t) = I (t Xi ) indicates whether a subject is observed to be at risk for death at time t, i = 1, . . . , n. If 0 < t1 < · · · < tm are m distinct times of death, the Kaplan–Meier estimate for S D (t) is given by S D (t) = (1.1) 1 − hˆ D (tk ) , tk t
where hˆ D (t) = dN·D (t)/Y·D (t) is the estimated hazard, dN·D (t) = and Y·D (t) = ni=1 YiD (t) (Kalbfleisch and Prentice, 1980).
n
D D i=1 Yi (t) dNi (t),
Suppose now that interest lies in the occurrence of an event associated with morbidity, which may or may not occur prior to death. This may be, for example, the time to the progression of disease, or the time to some other clinically important event. Let the time of this event be denoted Vi for individual i and let Ti = min(Vi , Di ), i = 1, . . . , n.
Marginal analysis of point processes with competing risks
351
In this setting, since death precludes the occurrence of the morbidity event, a competing risk problem arises for which, in the absence of censoring, we observe only (Ti , Di ) and I (Vi Di ) for individual i, i = 1, . . . , n. More generally with independent right censoring, let ∆Vi = I (Vi min(Di , Ci )) be the indicator that the morbidity event was observed to occur, let NiV (t) = ∆Vi I (Vi t) be the counting process for the morbidity event, and let the at risk indicator for the morbidity event be denoted YiV (t) = I (t min(Ti , Ci )). The function Pr{T < t + t, ∆Vi = 1 | T t} t ↓0 t
hV (t) = lim
is called the cause-specific hazard for the morbidity event and may be interpreted as the instantaneous probability of the morbidity event occurring at time t given neither it nor death have occurred prior to time t. A nonparametric estimate of hV (t) is given by ˆhV (t) = dN·V (t)/Y·V (t), where dN·V (t) = n Y V (t) dN V (t), Y·V (t) = n Y V (t). i=1 i i=1 i i While this has a similar form to the estimated hazard for death it is important to note that it cannot be used to construct a Kaplan–Meier type estimate of a distribution function for the time to the morbidity event using a formula such as (1.1). Instead if interest lies in estimating the proportion of subjects who have experienced the morbidity event by time t, one should focus on the cumulative incidence function ψ(t) = Pr T t, ∆V = 1 =
t
hV (u)S D (u) du.
(1.2)
0
Note that (1.2) is slightly different than the usual expression provided for the cumulative incidence function in discussions of the competing risk problem. Typically competing risks are discussed in the context of problems where all events preclude the occurrence of other events as is the case in the analysis of cause of death data. Here, only death precludes the occurrence of the morbidity event (not vice versa) and so we use S D (u) in (1.2) instead of the survivor function for T = min(V , D). The cumulative incidence function may be estimated nonparametrically by V ˆ S D (tk ). hˆ (tk ) ψ(t) = Pr T t, ∆V = 1 =
(1.3)
tk t
Consider a two-sample problem in which the hazard for death and the cause-specific V hazard for the morbidity event in group j at time t are hD j (t) and hj (t), respectively, j = 1, 2. Suppose nj subjects are initially in group j , and the ith individual in group j has counting processes NjDi (t) and NjVi (t) for death and the morbidity event, respectively. The corresponding at risk indicators are denoted YjDi (t) and YjVi (t), respectively. D The standard class of statistics for testing H0 : hD 1 (t) = h2 (t) is
C∗ 0
ˆD W D (u) hˆ D 1 (u) − h2 (u) du,
(1.4)
352
R.J. Cook, B. Chen and P. Major
D D D where hˆ D j (t) = dNj. (t)/Yj. (t), dNj. (t) = j = 1, 2,
W D (t) =
nj
D D D i=1 Yj i (t) dNj i (t), Yj. (t) =
nj
D i=1 Yj i (t),
Y1.D (t) Y2.D (t) a(t) , Y..D (t)
and Y..D (t) = Y1.D (t) + Y2.D (t). The function a(t) is a fixed (predictable) weight function with a(t) = 1 giving the usual log-rank statistic. An analogous test of H0 : hV1 (t) = hV2 (t) could be carried out to assess differences between groups in the cause specific hazard function. For this test, however, the statistics forming the basis for this test are not directly linked to observable quantities. To address this, it may be desirable to test H0 : ψ1 (t) = ψ2 (t) where ψj (t) denotes the cumulative incidence function (1.2) for subjects in group j , j = 1, 2. Gray (1988) proposes a two-sample test of the equality of cumulative incidence functions based on the statistic C∗
−1
−1 (1.5) W V (u) 1 − ψˆ 1 (u) dψˆ 1 (u) − 1 − ψˆ 2 (u) dψˆ 2 (u) , 0
where ψˆ j (u) is the estimate of ψj (u) obtained by (1.3) and W V (u) is a weight function. For a suitably chosen W V (u), in the absence of the competing risk problem, the familiar log-rank test is obtained from this statistic. More generally, however, tests of this sort are appealing as they are based on observable quantities and have a simple interpretation.
2. Rate functions for point processes Let Ni (t) denote a right continuous counting process which records the number of events experienced by subject i over the interval (0, t] and let Ni (t + t − ) − Ni (t − ) denote the number of events occurring over the interval [t, t + t). We let dNi (t) = limt ↓0 (N(t + t − ) − N(t − )) = 1 if an event occurs at time t for subject i, and dNi (t) = 0 otherwise, i = 1, . . . , n. Consider the analysis of point process data in the setting where there is no terminal event, observation is planned over the interval (0, C ∗ ], but subjects may be censored at an earlier time denoted by Ci for subject i, i = 1, . . . , n. If we take Di = ∞ then here Yi (t) = I (t Xi ) = I (t Ci ). Let HiN (s) = {Ni (u); 0 < u < s} be the history of the event process at time s for subject i, which represents the times for all of their events occurring over (0, s). The intensity function for the event process for subject i is given by Pr{Ni (s + s − ) − Ni (s − ) = 1 | HiN (s)} λ s | HiN (s) = lim , s↓0 s which can also be shown to satisfy λ(s | HiN (s)) ds = E(dNi (s) | HiN (s)). Note that the history of the process may be expanded to include internal covariates and hence intensity based methods provide a rich framework which facilitates detailed examination of a wide variety of aspects of the process under study (Lawless, 1995). Use of intensity based methods, however, requires detailed modeling of sometimes complex
Marginal analysis of point processes with competing risks
353
processes and frequently questions of primary interest may be addressed based on marginal features through the use of rate functions. The rate function r(s) is simply given by the unconditional instantaneous probability of an event occurring at time s satisfying r(s) ds = E(dN(s)). Under a Poisson model, the intensity and rate functions are the same since the increments in the counts in disjoint intervals are independent. For the one sample problem, the Poisson score equation for estimation of the rate function at s is n
Yi (s) dNi (s) − r(s) ds = 0.
(2.1)
i=1
Provided E(dNi (s) | Yi (s)) = r(s) ds (i.e., the distribution of Ci is independent of u}), the left-hand side is an unbiased estimating function and the solution {Ni (u), 0 rˆ (s) ds = ni=1 Yi (s) dNi (s)/ ni=1 Yi (s) is an unbiased estimate of r(s) ds. Therefore, the solution to (2.1) is the robust Nelson–Aalen estimate of the rate function (Ander = t rˆ (s) ds is an unbiased estimate of the mean sen et al., 1993) and the quantity R(t) 0 R(t) = E(N(t)), the expected number of events over (0, t]. Robust variance estimates for a wide class of distributions to facilitate interval estimation may be obtained for R(t) (Lawless and Nadeau, 1995) and extensions to deal with regression problems are also possible. Consider now two groups of subjects with the counting process {Nj i (u); 0 < u t} and an independent at risk indicator Yj i (t) for the ith subject in group j , i = 1, . . . , nj , and rate and mean functions rj (t) and Rj (t) = E{Nj i (t)}, respectively, j = 1, 2. To develop tests of H0 : r1 (t) = r2 (t), we may proceed in a manner analogous to that used to develop tests for intensity functions in modulated Poisson processes (e.g., Andersen et al., 1993, Section 5.2). A family of test statistics mentioned by Lawless and Nadeau (1995) is based on
C∗
U=
W (u) rˆ1 (u) − rˆ2 (u) du,
(2.2)
0
where W (u) =
Y1. (u)Y2. (u)a(u) Y.. (u)
and a(u) is a fixed (predictable) weight function. Again, if a(u) is a constant a log rank type statistic results. A variance estimate for (2.2), v ar(U ), which is robust to departures from Poisson assumptions (e.g., it is valid for mixed or clustered Poisson processes, mixed renewal processes, self-exciting point processes, etc.) may readily be obtained var(U ) (Cook et al., 1996). Under mild regularity conditions, the standardized form U 2 / 2 distribution under H for a wide class of underlying point asymptotically follows a χ(1) 0 processes and hence large observed values of this statistic provide evidence against the null hypothesis.
354
R.J. Cook, B. Chen and P. Major
3. Point processes with terminal events 3.1. Joint models Frequently while subjects are at risk for a recurrent event, they are at risk for a socalled terminal event which precludes the occurrence of subsequent recurrent events. This is similar in spirit to the competing risk problem in survival analysis in which death precludes the subsequent occurrence of another type of event (e.g., an event associated with morbidity or, in the more classical setting, death from another cause). For example, consider a study of kidney transplant recipients. Graft rejection episodes are transient events in which there are physiological indications of difficulties with acceptance of the transplanted organ. By definition they respond to treatment and are less than 24 hours in duration, but they may occur repeatedly over time. Interest often lies in preventing these episodes since they are associated with morbidity as well as health resource utilization (Cole et al., 1994). While subjects are at risk for rejection episodes, they are also at risk for total graft rejection which in turn precludes the occurrence of subsequent episodes (Cook and Lawless, 1997b). As another example, patients with breast cancer and bone metastases have a strong risk of recurrent skeletal complications. Again it is of interest to prevent these complications from occurring due to their adverse effect on quality of life and the cost in treating them. Since these types of patients are at an advanced stage of disease they are also at high risk of death, and again death precludes the occurrence of subsequent skeletal complications. For concreteness in what follows, consider the termination time as a time of death. Let Di denote the time of death as before. One may consider {N(u), 0 < u D; D} as a bivariate process with the first component representing the point process and the second the time of death. Let H N (s) = {N(u), 0 < u < s} denote the history of the point process and H D (s) = {I (D u), 0 < u < s}. In general the joint process will not be fully observable due to right censoring at C. In general one can have different censoring times for the point process and the survival time but we do not consider this here for simplicity. Throughout, we assume a censoring mechanism in which C is independent of {N(u), 0 < u D; D}. Intensity-based joint models for the recurrent and terminal events are specified by intensity functions for the terminal event and recurrent event processes of the form λD t | H N (t), H D (t) Pr(D < t + t | H N (t), H D (t), D t) , t ↓0 t λ s | H N (s), H D (s) = lim
Pr(N(s + s − ) − N(s − ) = 1 | H N (s), H D (s), D s) . s↓0 s
= lim
(3.1)
(3.2)
Here (3.1) is the intensity for death, which may depend on the history of the point process, and (3.2) is the intensity of the recurrent event process which makes explicit
Marginal analysis of point processes with competing risks
355
the requirement that subjects must not have experienced their terminal event (i.e., died) for the recurrent event to occur. There are a variety of ways of forming such joint models. Perhaps the most familiar one is to consider a marginal model for the event process and a Cox regression model for the terminal event time featuring internal time-dependent covariates summarizing the history of the recurrent event process. This approach is in the spirit of a “selection model” as defined by Little (1995). Alternatively, one can induce association between the event and terminal event processes via shared or correlated random effects. These approaches are not particularly appealing when primary interest lies in characterizing the recurrent event process. In this setting the following “pattern-mixture” approach is more natural (Little, 1995). In order to discuss inferential issues surrounding the analysis of point processes with terminal events it is helpful to consider a particular model relating the event process and the terminal event. This model may then be used to compute expectations of marginal quantities. Let d ∗ denote the realized value of the time to death random variable D. A convenient pattern-mixture type model for the dependence between the event process and D is obtained by adopting a marginal model for death and a conditional rate function model given the time of death (i.e., E(dNi (s) | d ∗ ) = r(s | d ∗ ) ds). One may stratify the rate function on the basis of d ∗ , or adopt a proportional rate model of the form r(s | d ∗ ) = r0 (s) exp(γ g(d ∗ )), where g(d) is any monotonically increasing function of d (Cook and Lawless, 1997b). In this case r0 (s) is the event rate at time s for a subject with g(d ∗ ) = 0. The parameter γ reflects the dependence of the event rate on the survival time. If γ < 0, for example, then subjects with longer survival times have lower event rates. Expressing rates in this way is convenient if detailed information is required about the rate of events for specific values of d ∗ . Often, however, it is more convenient to examine marginal event rates of the sort ∞ E dN(s) | D > d ∗ /ds = (3.3) r(s | u)f u | u > d ∗ du. d∗ ∗ < d among subjects who survived at least to time d ∗ .
This is the rate of events at time s In studies of health resource utilization, estimates of the total number of events experienced over the entire course of the study or a patient’s lifetime may be of interest. For example, if each event is associated with a particular cost to the health care system (e.g., as may be the case with the need for radiation in studies of patients with bone metastases) it may be desirable to compare the total number of events across the two groups. If the total study duration is C ∗ years, we may be interested in the marginal expectation E(N(C ∗ )) where we are marginalizing over the survival time. In such analyses, the association between the event process and the survival time must be addressed. This is easily done by noting that
r(s, s) = E dN(s) = E E dN(s) | D > s is the marginal event rate of interest, and the mean at time t C ∗ is t µ(t) = E N(t) = r(u, u)S D (u) du. 0
(3.4)
356
R.J. Cook, B. Chen and P. Major
If we let C ∗ → ∞ and t → ∞, then we obtain E(N(D)) which is the expected number of events over a patient’s lifetime. If a substantial fraction of the sample is observed to die, then calculations of this sort may be reasonable. If, however, the majority of patients’ survival times are right censored, then estimates of E(N(D)) may involve extrapolation over a region of time where it is not possible to assess the model and such calculations should be interpreted with caution. Typically questions are restricted to the period of study and no such extrapolation is required. In such settings it is advisable to conduct supplementary survival analyses to help in the understanding of the treatment effect, since a reduction in the mean function could arise from a reduction in the conditional rate or an increase in the mortality rate. 3.2. Connections with competing risks methodology Note that (3.4) resembles the expression for the cumulative incidence function given by (1.2) but with the cause specific hazard hV (u) replaced by the conditional rate function r(u, u). Cook and Lawless (1997b) consider the estimate of (3.4) analogous to the estimate (1.3) given by t µ(t) ˆ = (3.5) rˆ (u | u) S D (u) = rˆ (tk , tk ) S D (tk ), 0
n
tk t
n
S D (t) is the Kaplan–Meier estimate where rˆ (u | u) = i=1 Yi (u) dNi (u)/ i=1 Yi (u), D for survival function S (t), and t1 < · · · < tm are the set distinct times of recurrent or terminal events. Note that (3.5) can be viewed as an estimate based on (3.4) obtained by replacing the unknown quantities with the corresponding estimates. Moreover, if the point process of interest is a failure time process in which the event of interest can occur at most once, then the estimate (3.5) coincides with that of (1.3). For the two sample problem, let µ1 (t) and µ2 (t) denote the marginal mean functions for treatment and control groups, respectively. Nonparametric results which accommodate dependent recurrent and terminal events are given by Cook and Lawless (1997b) and developed more fully by Ghosh and Lin (2000) who consider a generalized log-rank statistic C∗ ∗ U = (3.6) W (t) d µˆ 1 (t) − µˆ 2 (t) , 0
where µˆ j (t) is the estimate of the marginal mean function µj (t) for group j , j = 1, 2, from (3.5), and W (t) is a weight function. The weight function W (t) can be specified as W (t) =
Y1. (t)Y2. (t)a(t) , Y..(t)
where now Yj i (t) = I (t Xi ), i = 1, . . . , nj , j = 1, 2. Under the assumptions that n1 /n → ρ1 and n2 /n → ρ2 as n → ∞ for constants ρ1 and ρ2 and the null hypothesis H0 : µ1 (t) = µ2 (t), Ghosh and Lin (2000) show that the generalized log-rank statistic U ∗ has an asymptotic normal distribution with mean zero and variance which can be
Marginal analysis of point processes with competing risks
357
consistently estimated from the observed data. Let (U ∗ )2 / var(U ∗ ) denote the standard2 ized form of this statistic which is asymptotically χ(1) under the null hypothesis of no treatment effect. 4. Application to a breast cancer trial 4.1. Bone metastases and skeletal related events Hortobagyi et al. (1996) report on a multicenter randomized trial designed to investigate the effect of pamidronate on the development of skeletal complications in breast cancer patients with bone metastases. Patients were accrued between January 1991 and March 1994 from 97 study sites in the United States, Canada, Australia and New Zealand. Patients with stage IV breast cancer receiving cytotoxic chemotherapy with at least one predominantly lytic bone lesion greater than or equal to one centimeter in diameter were randomized within strata defined by ECOG status. A total of 382 women were enrolled in the study with 185 randomized to receive pamidronate and 197 to placebo control. Two patients randomized to placebo did not have bone metastases and were therefore excluded from subsequent analyses. Patients randomized to the pamidronate arm received 90 mg of pamidronate disodium via a two hour infusion every four weeks whereas patients randomized to the placebo received dextrose infusions. Patients on a three week chemotherapy regimen were permitted to receive the study drug every three weeks. After completion of the planned one year follow-up, the observation was extended for an additional year and the results published in Hortobagyi et al. (1998). Each patient was followed until death, the last date of contact or loss to follow-up, or February 1, 1996. At monthly visits patients were assessed and the occurrence of skeletal complications was recorded. The skeletal complications of interest include pathologic fractures, spinal cord compression with vertebral fracture, the need for surgery to treat or prevent fractures, and the need for radiation for the treatment of bone pain. Here we focus on the need for radiation for the treatment of bone pain. Figure 1 displays the duration of observation and need for radiation for all patients in the control arm of Hortobagyi et al. (1998). Each patient is represented by a horizontal line, the length of which represents the time on study. Those subjects known to have died before they completed the two years of follow-up have solid lines, whereas those known to have survived two years or more have lighter dashed lines. The probability of surviving two years after randomization for these patients is approximately 25% so the duration of follow-up to a large degree reflects the time from randomization to death. The dots on the lines represent the occurrence of radiation episodes, and for graphical presentation, multiple episodes recorded as occurring on the same day are represented with adjacent dots. The plot suggests that there is variation in need for radiation therapy, and the fact that many patients die without requiring radition therapy illustrates the competing risk phenomenon. Figure 2 contains a naive estimate of the proportion of control patients who have experienced at least one episode of radiation treatment based on the Kaplan–Meier function obtained from the estimated cause specific hazard. This estimate ignores the
358
R.J. Cook, B. Chen and P. Major
Fig. 1. Profile of events for control patients in Hortobagyi et al. (1998).
Fig. 2. Graphical plots of the time to the first episode of radiation therapy.
Marginal analysis of point processes with competing risks
359
Fig. 3. Graphical plots of naive and marginal mean functions for the number of episodes of radiation therapy.
fact that patients dying before a need for radiation will not subsequently experience the need for radiation therapy since subjects who die at time t are treated in the same way as subjects who are censored at the same time. Also plotted on Figure 2 are the estimated cumulative incidence functions (1.3) based on the time to the first bout of radiation therapy for patients receiving placebo and pamidronate therapy. The estimate of the proportion of control patients requiring at least one bout of radiation therapy at 24 months is substantially lower than the incorrect estimate based on the Kaplan–Meier function. It is also apparent that treatment with pamidronate incurs a reduction in the need for radiation therapy. Despite the fact that the Kaplan–Meier estimate is uninterpretable in the presence of the competing risk for death, the usual log rank test for the effect of pamidronate on the cause specific hazard for the first bout of radiation therapy is valid and demonstrates a strong benefit to treatment (p = 0.00001). The test for the difference in the cumulative incidence functions based on the log-rank (unweighted) version of Gray’s (1988) test statistic also provides strong evidence of benefit (p = 0.00031). Figure 3 contains analogous estimates for the cumulative mean functions. Specifically, the top estimate is the Nelson–Aalen estimate of the mean function for placebo treated patients. Again, it is based on the assumption that subjects who die remain at risk for bone pain and consequent bouts of radiation therapy. A valid estimate for the marginal expected number of bouts of radiation therapy is also provided and demonstrates how greatly one can over estimate the number of events experienced per patient over time by ignoring mortality. Naive use of the test based on (2.2) with a robust variance estimate gives p = 0.00016. The test based on the Ghosh and Lin (2000) statistic based on (3.6) gives p = 0.00125.
360
R.J. Cook, B. Chen and P. Major
5. Discussion The analysis of recurrent events poses a number of modeling challenges. We have considered issues pertaining to the analysis of recurrent events in the presence of high mortality. Treatment comparisons are particularly challenging in such settings and there is considerable debate about the most appropriate basis for making treatment comparisons. Marginal methods such as those based on (3.6) are attractive when interest lies in health resource utilization but they may not represent the most natural way of assessing the benefits of treatment to individual patients. At the very least complimentary analyses directed at examining treatment effects on survival are advisable to ensure that a complete impression of the effect of treatment is obtained. Related methodologic issues arise in health economics (Lin et al., 1997) and quality of life (Zhao and Tsiatis, 1997). Cox (1999) discusses a relatively tractable normal theory approach for modeling stochastic processes conditional on the time of a dependent terminal event and highlights connections with problems in other areas.
Acknowledgements This research was supported by the Natural Sciences and Engineering Research Council of Canada and the Canadian Institute for Health Research (CIHR). R.J. Cook is a CIHR Investigator. We thank Dr. John Seaman and Dr. Bee Chen for providing the data from the pamidronate trial and Ms. Ker-Ai Lee for programming assistance.
References Abu-Libdeh, H., Turnbull, B.W., Clark, L.C. (1990). Analysis of multi-type recurrent events in longitudinal studies: Application to a skin cancer prevention trial. Biometrics 46, 1017–1034. Albert, P.S. (1991). A two-state Markov mixture model for a time series of epileptic seizure counts. Biometrics 47, 1371–1381. Andersen, P.K., Borgan, O., Gill, R.D., Keiding, N. (1993). Statistical Models Based on Counting Processes. Springer, New York. Byar, D., Kaihara, R., Sylvester, R., Freedman, L., Hannigan, J., Koiso, K., Oohashi, Y., Tsugawa, R. (1986). Statistical analysis techniques and sample size determination for clinical trials of treatments for bladder cancer. In: Developments in Bladder Cancer. Alan R. Liss, New York, pp. 49–64. Cole, E.H., Cattron, D.C., Farewell, V.T., et al. (1994). A comparison of rabbit anti-thymocyte serum and OKT3 as prophylaxis against renal allograft rejection. Transplantation 57, 60–67. Cook, R.J., Lawless, J.F., Nadeau, J.C. (1996). Robust tests for treatment comparisons based on recurrent event responses. Biometrics 52, 557–571. Cook, R.J., Lawless, J.F. (1997a). Discussion of paper by Wei and Glidden. Statist. Medicine 16, 841–843. Cook, R.J., Lawless, J.F. (1997b). Marginal analysis of recurrent events and a terminal event. Statist. Medicine 16, 911–924. Cook, R.J., Lawless, J.F. (2002). Analysis of repeated events. Statist. Methods Medical Res. 11, 141–166. Cox, D.R. (1999). Some remarks on failure-times, surrogate markers, degradation, wear, and the quality of life. Lifetime Data Anal. 5, 307–314. Gail, M.H., Santner, T.J., Brown, C.C. (1980). An analysis of comparative carcinogenesis experiments based on multiple times to tumor. Biometrics 36, 255–266.
Marginal analysis of point processes with competing risks
361
Ghosh, D., Lin, D.-Y. (2000). Nonparametric analysis of recurrent events and death. Biometrics 56, 554–562. Gray, R.J. (1988). A class of K-sample tests for comparing the cumulative incidence of a competing risk. Ann. Statist. 16, 1141–1154. Hobson, R.W., Weiss, D.G., Fields, W.S., Goldstone, J., Moore, W.S., Towne, J.B., Wright, C.B., The Veterans Affairs Cooperative Study Group (1993). Effect of carotid endarterectomy for asymptomatic carotid stenosis. New England J. Medicine 328, 221–227. Hortobagyi, G.N., Thierault, R.L., Porter, L., Blayney, D., Lipton, A., Sinoff, C., Wheeler, H., Simeone, J.F., Seaman, J., Knight, R.D., Heffernan, M., Reitsma, D.J. (1996). Efficacy of pamidronate in reducing skeletal complications in patients with breast cancer and lytic bone metastases. New England J. Medicine 335, 1785–1791. Hortobagyi, G.N., Theriault, R.L., Lipton, A., Porter, L., Blayney, D., Sinoff C, C., Wheeler, H., Simeone, J.F., Seaman, J., Knight, R.D., Heffernan, M., Mellars, K., Reitsma, D.J. (1998). Long-term prevention of skeletal complications of metastic breast cancer with pamidronate. J Clin. Oncol. 16, 2038–2044. Kalbfleisch, J.D., Prentice, R. (1980). The Statistical Analysis of Failure Time Data. Wiley, London. Lawless, J.F. (1995). The analysis of recurrent events for multiple subjects. Appl. Statist. 44, 487–498. Lawless, J.F., Nadeau, J.C. (1995). Nonparametric estimation of cumulative mean functions for recurrent events. Technometrics 37, 158–168. Lin, D.Y., Feuer, E.J., Etzioni, R., Wax, Y. (1997). Estimating medical costs from incomplete follow-up data. Biometrics 53, 419–434. Little, R.J.A. (1995). Modeling the drop-out mechanism in repeated measures studies. J. Amer. Statist. Assoc. 90, 1112–1121. Oakes, D. (1997). Discussion of paper by Wei and Glidden. Statist. Medicine 16, 843. OASIS Investigators (1997). Comparison of the effects of two doses of recombinant hirudin compared with heparin in patients with acute myocardial ischemia without ST segment elevation as pilot study. Circulation 96, 769–777. Riggs, B.L., Seeman, E., Hodgson, S.F., Taves, D.R., O’Fallon, W.M., Muhs, J.M., et al. (1990). Effect of fluoride treatment on the fracture rate in post-menopausal women with osteoporosis. New England J. Medicine 322, 802–809. Strawderman, R. (2000). Estimating the mean of an increasing stochastic process at a censored stopping time. J. Amer. Statist. Assoc. 95, 1192–1208. Thall, P.F. (1988). Mixed Poisson likelihood regression models for longitudinal interval count data. Biometrics 44, 197–209. Thall, P.F., Vail, S.C. (1990). Some covariance models for longitudinal count data with overdispersion. Biometrics 46, 657–671. Therneau, T., Hamilton, S. (1997). rhDNase as an example of recurrent event analysis. Statist. Medicine 16, 2029–2047. Wei, L.J., Lin, D.Y., Weissfeld, L. (1989). Regression analysis of multivariate incomplete failure time data by modeling marginal distributions. J. Amer. Statist. Assoc. 84, 1065–1073. Wei, L.J., Glidden, D.V. (1997). An overview of statistical methods for multiple failure time data in clinical trials. Statist. Medicine 16, 833–839. Zhao, H., Tsiatis, A.A. (1997). A consistent estimator for the distribution of quality-adjusted survival time. Biometrika 84, 339–348.
20
Handbook of Statistics, Vol. 23 ISSN: 0169-7161 © 2004 Elsevier B.V. All rights reserved. DOI 10.1016/S0169-7161(03)23020-0
Categorical Auxiliary Data in the Discrete Time Proportional Hazards Model
Peter Slasor and Nan Laird
1. Introduction In this paper we compare the maximum likelihood estimator for proportional hazards covariate effect in a (standard) discrete time proportional hazards model (Prentice and Gloeckler, 1978) with the estimator obtained in a larger (joint) model which incorporates categorical auxiliary data, Y. Right censorship of survival T , results in a loss of information for estimation of covariate effect. If auxiliary data Y is available, joint models for T and Y can recover some of the information lost to censoring. We use a mixture model where the joint distribution for T and Y is factored: f(T , Y | Z) = f(T | Z)f(Y | T , Z). The distribution f(T | Z) corresponds to the discrete time proportional hazards model and f(Y | Z) is the distribution for a general multinomial model. The asymptotic relative efficiency (ARE) for estimating covariate effect is available through comparison of the Fisher information matrices for the standard and joint models. We examine ARE under various distributional forms for survival, censoring, and the auxiliary data. For the standard model the maximum likelihood estimator is consistent provided that (T | Z) is missing at random (censoring is noninformative). The condition for the missing data mechanism in the joint model is weaker: (T , Y | Z) is missing at random (Murray and Tsiatis, 1996). In this paper we focus on efficiency gained through joint modeling. Many authors have investigated the gains in efficiency through joint modeling. Cox (1983) compares a parametric (exponential) proportional hazards model for (T | Z) to a joint model where the auxiliary data Y is the ratio of (unobserved) remaining lifetime and an independent gamma random variable (parameters unknown). The proportion of the information lost to censoring that is gained through joint modeling ranges from 0 to 1, as the variance of the gamma (noise) ranges from infinite to zero. Slasor and Laird (2003) consider a piece-wise exponential proportional hazards model for (T | Z) where Z is binary. This model is expanded to a joint model (T , Y | Z) where Y is repeated categorical data, predictive of survival. In a simulation, an efficiency gain of 6.4% is demonstrated. They also use this joint model in the analysis of clinical trial data, achieving a 10.2% reduction in the bootstrap estimate of variance for the covariate 363
364
P. Slasor and N. Laird
effect. Fleming et al. (1994) demonstrate a 14.3% reduction in the bootstrap estimate of variance for the covariate effect, when using cancer recurrence to predict death in a joint (continuous time) proportional hazards model. We focus on the joint discrete time proportional hazards model and investigate the extent of efficiency gains for various scenarios of censoring, failure time and auxiliary data Y. The discrete time model is a natural choice when survival data is grouped. This occurs, for example, when survival status is noted at a sequence of scheduled followup visits. When survival time is grouped, but not coarsely grouped, the discrete time models may be reasonable approximations to continuous time models. The number of parameters in the joint model is large but does not depend on sample size, so that the ARE is a valid means of comparing the joint and standard models. Section 2 briefly outlines the standard and joint models. The ARE is computed as the amount of censoring and probability law for (Y | T , Z) is varied. In this investigation of ARE, survival parameters and the shape for censoring are fixed, as specified in Section 3. Section 4 presents efficiency gains for repeated categorical measurements. This includes the case of baseline binary and polychotomous measurement, and the case of two binary measurements. Recurrence time as auxiliary data is investigated in Section 5. Alternative specifications for the censoring shape and survival distribution are discussed in Section 6.
2. The standard and joint discrete-time proportional hazards models In a follow-up study for cancer recurrence, patients are examined for new tumor growth at scheduled monthly visits. If growth is detected then failure is recorded. Patients who drop out of the study are recorded as censored at the time that they were last seen. This is an example of a study where survival time is grouped into discrete categories. Survival law and covariate effects can be estimated using the method of maximum likelihood for a discrete time proportional hazards model (Prentice and Gloeckler, 1978). Right censoring of survival time results in a loss of information for estimation of parameters. In some cases additional (auxiliary) data, Y, collected at the clinic visits can be predictive of survival so that incorporation of Y into a joint model may increase the precision of the parameter estimation. The joint model is formed as a mixture model: f(T , Y | Z) = f(T | Z)f(Y | T , Z), where f(T | Z) is the standard proportional hazards model and f(Y | T , Z) is a nonparametric model. We consider auxiliary data Y, obtained through discretization of repeated continuous measurements (e.g., blood chemistry). We also consider recurrent type events as auxiliary data. For example, Y might be the time that a test for preclinical cancer first shows positive. If a positive test predicts tumor growth, then increased precision of parameter estimates should be possible by using a joint model which incorporates test results, Y.
Categorical auxiliary data
365
2.1. Likelihood function Failure time T is discrete, taking values in the set T = {t1 , . . . , tv }. Right censorship occurs through a censoring random variable C, and the observed follow-up time is X ≡ min{T , C}. Observed failures are enumerated through the indicator variable ξ = I (T C). The covariate Z is vector valued and discrete with values in the set Z = {z1 , . . . , zs }. The standard model is a product multinomial over the covariate strata, with proportional hazards constraint on the cell probabilities. With auxiliary data Y ∈ Y, the standard model is expanded to a joint multinomial model on Z × T × Y. The auxiliary data Y = (Y1 , . . . , Yr ) is collected at scheduled times (u1 , . . . , ur ). The times of measurement are usually a subset of the potential failure times, T = {t1 , . . . , tv }. We assume that measurements on Y are complete up to the time of followup X, and not available after X. If Y is a recurrent type event, it can be represented as repeated binary measures which jump from 0 to 1 at the time of recurrence. Alternatively Y could be represented as a scalar random variable that gives the time of recurrence. Note that in this representation, Y = T and Y is right censored by the follow-up time X. The multinomial joint model is incomplete since T is right censored and components of Y are missing beyond follow-up X. In maximizing the multinomial likelihood, the assumption that (T , Y | Z) is missing at random (MAR, Little and Rubin) allows us to ignore the mechanism for the missing data. The sufficient statistics for the joint model are the cell counts: Nj ky =
n
I (Zi = zj )I (Xi = tk )I (Yi = y)ξi ,
i=1
Mj ky =
n
I (Zi = zj )I (Xi = tk )I (Yi = y)(1 − ξi ).
i=1
The parameters are defined: py|j k ≡ P (Y = y | T > tk , Z = zj ), Sk ≡ P (T > tk | Z = 0), (e
Sj k ≡ P (T > tk | Z = zj ) = Sk
βzj )
,
Sj ky ≡ P (Y = y, T > tk | Z = zj ) = py|j k Sj k . Let S denote the baseline survival parameters, β the parameter for covariate effect, and p the parameters for f(Y | T , Z). The likelihood and log-likelihood are: L(S, β, p) =
v s M [Sj k−1y − Sj ky ]Njky Sj kyjky , j =1 k=1 y
l(S, β, p) =
v s j =1 k=1
βz βzj (e j ) Nj ky log py|j k−1 Sk−1 − py|j k Sk(e )
y
(eβzj ) + Mj ky log(py|j k ) + log Sk .
366
P. Slasor and N. Laird
The notation for the likelihood is general, covering both types of auxiliary data. The standard model is the special case where Y is degenerate: the distribution for (Y | T , Z) does not depend on T and has all mass on a single point. The log-likelihood for the standard model is: l(S, β) =
v s
βz (eβzj ) (eβzj ) (e j ) + Mj k log Sk . Nj k log Sk−1 − Sk
j =1 k=1
2.2. Maximum likelihood estimation The expected values of the sufficient statistics and the first and second derivatives of the log-likelihood are easy to compute, but tedious (Appendix A). The ML estimates can be obtained by Newton method or EM algorithm (Dempster et al., 1977). The dimension (number of parameters) of the optimization problem can be very large. For example, when survival is distributed on v time points and a recurrent event time is used as auxiliary data, the number of parameters is v2 + 2v − 2. When there are 12 time points, the dimension of the optimization is 166. The maximization by EM algorithm is tractable since the expected complete data likelihood separates into two components corresponding to the models for (T | Z) and (Y | T , Z). In the M-step of the algorithm, optimization of the component corresponding to the model for (Y | T , Z) has a closed form solution which is similar to the empirical estimator. Optimization of the component corresponding to the model for (T | Z) uses the Newton method, but the dimension of the optimization is relatively small (12 in the above example). We can assess the asymptotic relative efficiency (ARE) of the joint model ML estimate relative to the standard model ML estimate, by computing the Fisher Information for both models. Let A11 A12 B11 B12 A= , B= T AT12 A22 B12 B22 denote the partitioned Fisher Information for the standard and joint models, respectively. The second components of the matrices correspond to the nuisance parameters and the first component to a particular parameter of interest. For example, with two group studies, the parameter of interest might be β in the proportional hazards model. Let the partitioned inverses of the Fisher Information be denoted: ∗ ∗ ∗ A11 A∗12 B11 B12 −1 −1 A = , B = . ∗ ∗T B ∗ A∗T B12 12 A22 22 The asymptotic variances for the parameter of interest in the standard and joint models ∗−1/2 ∗−1/2 are A∗11 and B11 ∗. The ARE is B11 A∗11 B11 . An upper bound on the ARE can be obtained by assuming that the distribution for the auxiliary data, f(Y | T , Z) is known. Let C denote the matrix obtained by evaluating the joint model information, B, at the known parameters, f(Y | T , Z), and then deleting the rows and columns corresponding to these parameters. The upper bound for the ARE is ∗−1/2 ∗−1/2 computed, C11 A∗11 C11 . The calculation is useful since it gives an upper bound for all joint proportional hazards models, (T , Y | Z), involving the categorical variable Y .
Categorical auxiliary data
367
3. Specification of the survival model and censoring Our interest is in delineating scenarios for joint models which yield substantial gains in efficiency for estimation of a proportional hazards treatment effect. Efficiency gains will depend on the amount of censoring and the probability law for the auxiliary data f(Y | T , Z). Efficiency gains also depend on the shapes of the distribution for survival and censoring. In Section 4, we present joint models where Y is a discretization of continuous random variables. Section 5 presents the case where Y is a recurrence type event, predictive of failure. For both of these investigations we fix the parameters for the discrete time proportional hazards model, and the censoring mechanism as follows. The survival (standard) model, f(T | Z) is for a two group study with proportional hazards effect, β = log(2), and survival distributed on the twelve time points {1, 2, . . . , 12}. Censoring does not depend on group, Z. The shape for baseline survival and censoring are from the family:
t c S(t; c) = 1 − . 12 The most favorable results for the joint model occur when the distribution for survival is concentrated at late time points and the distribution for censoring is concentrated at early time points. We investigate late survival, S(t; c = 0.5), and early censoring shape, S(t; c = 3), in detail. We comment on later censoring, S(t; c = 2), S(t; c = 1), and earlier failure, S(t; c = 1), in Section 6. Assessment of the performance of the joint models will be achieved by plotting ARE as the amount of censoring and predictiveness of Y are varied. The predictiveness of Y is specified in the next sections. The amount of censoring is parameterized as a mixture of early censoring and noncensoring:
t 3 P (C > t | Z) = u 1 − + (1 − u). 12 We refer to u as the censoring proportion. When u = 0 there is no censoring and when u = 1 all subjects are exposed to the early censoring distribution, P (C > t | Z) = t 3 (1 − 12 ) . At a censoring proportion of 0.8, and early shape of censoring, the percentage of observations censored is 66.4% for group 0 and 56.5% for group 1.
4. Discretizing continuous auxiliary data We consider auxiliary data, Y, obtained through discretization of continuous (logistic) repeated measurements. One would expect that later measurements on Y tend to be more predictive of survival than are baseline measurements. However, the efficiency gain through including later measurements on Y within a joint model may be quite small if there are few subjects who are censored late enough for the measurement on Y to be observed. In addition, the loss of information on survival due to late censoring is much less than the information lost to early censoring. Efficiency gains also depend on the number of categories for Y. In this section we consider a single binary measurement at
368
P. Slasor and N. Laird
baseline, a polychotomous measurement at baseline, and the case of two binary repeated measurements. 4.1. A binary measurement at baseline Let V be an underlying random variable with logistic density whose location depends on event time T , but not group Z. The density function for V is: e(v−α(T −6.5)) . (1 + e(v−α(T −6.5)) )2 The parameter α quantifies the strength of association between V and T , with α = 0 corresponding to independence and large values of α corresponding to strong dependence. The auxiliary data Y is obtained as a dichotomization of V at the value 0: Y = I (V 0). The parameters for the component (Y | T , Z) of the joint model are p1|T Z ≡ P (Y = 1 | T , Z): f(v | T , Z) =
eα(T −6.5) . 1 + eα(T −6.5) These parameters are given in Table 1 for α = 0.35 and α = 0.50. The strength of predictiveness of Y is varied according to the parameter α and the amount of censoring varied with the parameter u of the distribution for censoring. Figure 1 gives a plot of ARE over a full range of censorship and predictiveness for Y . From p1|T Z =
Table 1 P (Y = 1 | T = t) for 2 predictive settings of Y α 0.35 0.50
t 1
2
3
4
5
6
7
8
9
10
11
12
0.13 0.06
0.17 0.10
0.23 0.15
0.29 0.22
0.37 0.32
0.46 0.44
0.54 0.56
0.63 0.68
0.71 0.78
0.77 0.85
0.83 0.90
0.87 0.94
Fig. 1. ARE joint and standard models.
Categorical auxiliary data
369
Fig. 2. Information gained using binary Y (α = 0.35).
the figure it is seen that ARE always exceeds one and equals one when there is no censoring (u = 0) or when Y is not predictive (α = 0). ARE increases as α and u increase from zero, and attains a maximal value between 1.11 and 1.12 in Figure 1. Figures 2(b) and 3(b) present slices of Figure 1. In Figure 2, the predictiveness of Y is set at α = 0.35 and the censoring proportion is allowed to vary. The upper curve of Figure 2(a) gives an upper bound for the information attained by assuming that the distribution for the auxiliary data is known. This is a bound for all joint models. For example, the joint model information under a logistic regression parameterization of f(Y | T , Z) will exceed the information under the general multinomial parameterization that we consider in this paper. However the upper curve of Figure 2(a) bounds the information for both models. The lower curve is the information for the standard model. From Figure 2(a) we see that when censoring is light, most of the potential information gain is captured by the joint model. When censoring is heavy, there are few observed failures so that estimation of f(Y | T , Z) is less stable. Information gains fall substantially short of the upper bound when censoring is heavy.
370
P. Slasor and N. Laird
Fig. 3. Information gained using binary Y (80% censoring proportion).
The ARE is given in Figure 2(b). The ARE peaks at a value 1.067 near a censoring proportion of 0.9. A censoring proportion of 1.0 means that all subjects are exposed to the early censoring mechanism. At this heavy amount of censoring, it is extremely rare for a subject to be observed on study beyond time 9. There is little data for estimating tail survival probabilities, and the parameters f(Y | T , Z) for T > 9 so that ARE declines for heavy censoring. For the remainder of the paper we assume that the censoring proportion is 0.8. We fix censoring at this level to insure that there is information for estimating parameters in the tail of survival. We also choose the proportion 0.8 since results on ARE are favorable. We comment on lesser amounts of censoring in Section 6. Figure 3 gives the information as censoring proportion is fixed at 0.8 and predictiveness of Y , (α), is allowed to vary. In Figure 3(a) it can be seen that the joint model captures approximately 40% of the potential gain in information, for α ∈ [0, 0.5]. Figure 3(b) plots the ARE and demonstrates that ARE increases with α. 4.2. Polychotomous measurement at baseline The auxiliary data Y was formed by categorizing a continuous random variable V at the median of the standard logistic distribution. Figure 4 plots ARE for 2, 4, 8, and 16 cate-
Categorical auxiliary data
371
Fig. 4. ARE for polychotomous joint models.
gory joint models, formed by splitting the continuous random variable V at the median, quartiles, octiles, and 16-tiles of the standard logistic distribution. The four models are nested, and the larger category models result in greater ARE. The ARE also increases as the parameter α increases. From the figure we see that for estimation of a proportional hazards parameter, β = log(2), the information in a binary joint model derived from continuous V with α = 0.50, is equal to the information of a 4 category joint model where underlying V has the lower predictivity, α = 0.41. The 4 category model yields an additional 2.5–4.0% as α ranges from 0.35 to 0.50. Only small additional gains in efficiency are achieved for discretization beyond 4 categories. We have focused on the range of predictivity, α ∈ [0, 0.5], since this might correspond well to many real data applications. It is interesting to consider what potential gains in efficiency occur in more extreme settings. For very large values of α (α → ∞), the polychotomous Y is equivalent to binary Y . Figure 5(a) is a plot of ARE as a function of α, for binary Y . It reproduces Figure 3(b), but on a larger range for the predictive parameter α. The upper bound for ARE is given in the plot. In the limit (α → ∞) the ARE is equal to its upper bound which approaches the value 1.40. At this extreme strength of predictiveness, P (Y = 1 | T , Z) = 0/1 according to whether failure is late, T ∈ {7, 8, 9, 10, 11, 12}, or early, T ∈ {1, 2, 3, 4, 5, 6}. At best (α → ∞), the categorical auxiliary data is a perfect predictor of whether failure occurs early or late. Greater gains can be realized using discrete auxiliary data which is more directly related to survival. Following Cox (1983), we modeled Y as a mixture of failure time T and the uniform
372
P. Slasor and N. Laird
Fig. 5. Information gained, extreme cases.
distribution on the set of possible failure times: 1 − 11 ε, y = t, P (Y = y | Z, T = t) = ε 12 y = t. 12 , Figure 5(b) plots the information for ε ∈ [0, 1]. As ε → 0 all information that was lost to censoring is recovered with the joint model. The ARE when ε = 0 is about 2.4. 4.3. Two binary repeated measurements Expanding the number of categories for a baseline measurement yielded increases in ARE. Additional gains can be achieved by including follow-up measurements in the joint model. We consider the case of 2 binary repeated measurements; one measured at baseline (T = 1) and the second measured at the same time or later (T 1). Follow-up measurements tend to be more predictive of survival than baseline measurements. If the second measurement is made late in follow-up then it is likely to be more predictive than an
Categorical auxiliary data
373
Fig. 6. ARE using two binary repeated measurements.
early measurement. However, late measurements contribute information only for late censorships. There are few late censorships, and the information lost to a single subject being censored late is much less than the information lost if the subject were censored earlier. There is a trade-off between frequently observed early follow-up measurement and the less frequent but more strongly predictive measurements made at later times. We assume that the baseline measurement is binary, with predictiveness, α = 0.35 and that the second measurement is independent of the baseline measurement. Figure 6 is a plot of ARE as the predictiveness of the second measurement ranges from α = 0 to α = 0.70. The lower curve (horizontal line) gives the ARE when only the first measurement is available. The curves lying above this line correspond to the ARE when a second measurement is collected at times 1, 2, . . . , 12. The upper curve gives ARE when the second measurement is taken at time 1 (simultaneous with the baseline measurement). The second curve from the top gives ARE when the second measurement is taken at time 2. The vertical difference between the two curves is the information lost due to censorship occurring prior to the second measurement. The succession of curves correspond to later times of the second measurement. If the second measurement is made at time 11 or 12, there is no increment in information beyond the joint model that includes only the baseline measurement. This figure gives insight to the tradeoff between early measurement, and the less available but more predictive later measurements. For example, a strongly predictive (α = 0.70) measurement made at time 3 yields the same increment in information as a measurement at time 2 with predictiveness α = 0.46.
374
P. Slasor and N. Laird
There is little additional gain in collecting a second measurement at time 5 or later. If a second measurement with predictiveness α = 0.50, can be collected as early as time T = 2, then additional gains of 7–8% can be achieved.
5. Joint models: Recurrent events predicting survival In the setting where the auxiliary data Y is obtained as a discretization of an underlying continuous random variable V , the largest efficiency gains occur when V is highly predictive (α is large) of survival T . In the limit, α → ∞, Y is perfectly predictive of whether or not T falls in the set {1, 2, 3, 4, 5, 6} (early survival). Large gains should be possible for auxiliary data, where the distribution f(T | Y, Z) is concentrated near a single point. For example, if cancer recurrence is typically followed by a rapid decline and death, then inclusion of time of cancer recurrence as auxiliary data in a joint model should enable large efficiency gains. In this section the auxiliary data Y is the time of a recurrent event, associated with survival. The distribution for (Y | T , Z) does not depend on Z and has support on the set {1, 2, . . . , 12}. Typically, subjects will fail shortly after experiencing recurrence, or they will survive to the end, {T = 12}, without experiencing recurrence. Recurrence time, Y , is right censored by follow-up time, X. Referring to the favorable survival outcome, {T = 12}, as “cured” we note that subjects are observed to fall into one of four states: {Cured, No Recurrence} = {T = 12, Y = 12}, {Cured, Recurrence} = {T = 12, Y < 12}, {Fail, No Recurrence} = {T < 12, Y > T }, {Fail, Recurrence} = {T < 12, Y T }. The distribution for f(Y | T , Z) is specified in two stages. First we give the distribution for recurrence status, and then the distribution for time of recurrence given recurrence status. The strength of association between failure status and recurrence status is parameterized: a++ ≡ P (Y T | T < 12), a−− ≡ P (Y = 12 | T = 12). If recurrence is considered as a diagnostic test for the eventual outcome {Fail/Cure}, then a++ and a−− are the sensitivity and specificity of the test. We will investigate efficiency gains as the parameters a++ and a−− are varied. To complete the specification of f(Y | T ) we require that the probability of recurrence given failure does not depend on the time of failure. In addition for subjects who do recur, the distribution for (Y | T ) has a mode at time T or earlier. The probability mass drops off from the mode at a geometric rate. When the geometric parameter is set to 1, the distribution for (Y | T ) is
Categorical auxiliary data
375
Fig. 7. Predictiveness of Y as the distribution for (T − Y | T ) becomes more sharply peaked.
uniform. At the opposite extreme, when the geometric parameter is 0, the distribution for (Y | T ) falls on a single point (Y is perfectly predictive). The predictiveness of the auxiliary data is demonstrated in Figure 7(a), for treatment Z = 0 and where a++ = a−− = 0.9 and the geometric parameter is set to 1. We compare the survival distribution for a subject who has survived to time T = 4 and experiencing recurrence, to the distribution for another subject who also has survived to T = 4 but without experiencing recurrence. The recurrence status is strongly predictive of status {Fail/Cure} but is not strongly predictive of time of failure. Recurrence does not signal rapid decline and failure. In fact, the distributions f(T | T > 4, Y = 4, Z = 0) and f(T | T > 4, Y = 3, Z = 0) are very similar. Figure 7(b)–(e) displays the predictive-
376
P. Slasor and N. Laird
ness of the auxiliary data, when the mode for (Y | T ) is 4 time units prior to failure, and the geometric parameter is 0.8, 0.7, 0.6 or 0.4. The smaller the geometric parameter, the more sharply peaked is the distribution f(T | y, T > y, Z) for subjects who have recurred. The distribution f(T | Y > t, T > t, Z) for subjects who have not recurred, does not depend much on the geometric parameter. The distributions f(T | Y > 4, T > 4, Z) and f(T | Y > 3, T > 4, Z) (not shown) are similar. The largest gains in efficiency should occur when the distribution f(T | y, T > y, Z) is sharply peaked and when a large number of subjects are observed to have recurrence. In Figure 8 we plot ARE in a favorable setting where a++ = a−− = 0.9 and 80% of subjects are subject to early censoring. We allow the mode of (T − Y | T ) to range in the set {0, 2, 4, 6}, and the sharpness of peaking at the mode (geometric parameter) ranges from 0.4 to 1.0. When the mode of (T − Y | T ) is 0, recurrence and failure tend to occur together. When the mode is 2, failure tends to occur at two time points after the time of recurrence. When recurrence occurs near the time of failure (e.g., mode = 0) efficiency gains decline as the geometric parameter decreases (Y becomes more predictive of survival). This is since few of the censored patients are observed to have recurrence. Knowledge of recurrence status is important and the uniform distribution for (Y | T ) is favorable when the mode is 0. When recurrence and failure have a greater lag time (e.g., mode > 0), then ARE increases as the geometric parameter decreases (Y becomes more predictive of survival). The largest gains occur when the distribution for (Y | T ) peaks sharply and when the lag time between T and Y is large. When the mode for (T − Y | T ) is 4 or 6
Fig. 8. ARE for recurrent Y .
Categorical auxiliary data
377
and the geometric parameter is 0.4, the ARE is near 1.3. This is a case where Y is very highly predictive of survival (Figure 7(e)). At more modest levels where the geometric parameter ranges from 0.6 to 0.8, the ARE ranges from 1.14 to 1.22. These gains exceed the gains that were achieved through discretizing continuous auxiliary data.
6. Other scenarios for censoring and survival Figure 9 plots ARE under four scenarios for baseline survival distribution and censoring shape. Figure 9(a) gives results for the joint model using recurrent Y , where a++ = a−− = 0.9, the mode of (T − Y | T ) is 4, and the geometric parameter for sharpness of peaking of the distribution for (Y | T ) is 0.7. Figure 9(b) gives results for the joint model using binary Y , where level of predictiveness is set at α = 0.5. The upper curve in each plot corresponds to the scenario of late failure and early censoring. The results reported in earlier sections assume this scenario. When failure
Fig. 9. ARE for alternate specifications.
378
P. Slasor and N. Laird
is earlier (uniform) or censoring is later (uniform), the gains in efficiency through joint modeling are reduced. The lower three curves in each plot demonstrate this finding. The reduction in efficiency gains is a consequence of fewer subjects being censored. When failure is uniform, the ARE results for binary Y are quite similar to the results for recurrent Y . When survival is late, the ARE is increased and the increase is greatest for models incorporating recurrent Y . The joint model with recurrent Y and late survival performs well since the distribution f(T | Y = t, T > t) is peaked (Figure 7(c)). In particular, the modes for f(T | T > 4, Y = 4) and f(T | T > 4, Y = 3) differ by one. Although we have parameterized (Y | T ) as a geometric decay from a modal point, this does not insure that the distribution for (T | Y ) is peaked. In the scenario of earlier failure (uniform) the distribution f(T | T > t, Y = t) is not peaked and the distributions f(T | T > 4, Y = 4) and f(T | T > 4, Y = 3) are similar. In this scenario, recurrence status is strongly predictive of survival time, but recurrence time is weakly predictive of survival time. The performance of recurrent Y is much like binary Y . In this paper we have also focused on a proportion of 80% for the population that is subject to censoring. For the lighter proportion subject to censoring of 50%, there are still gains of 5–10%, provided that censoring is heavily concentrated at early times.
7. Discussion A single binary measurement Y , obtained by dichotomizing a continuous random variable V that is predictive of survival, can yield asymptotic efficiency gains near 10%, when the degree of predictiveness is quite strong (α = 0.5, Table 1). An additional 4% gain can be achieved if V is split more finely into four categories (Figure 4) or if a second independent and equally predictive measurement is obtained at time 3 or earlier (Figure 6). In our models, expansion of the number of categories beyond four and collection of measurements beyond time three yields a very small increment in efficiency. Murray and Tsiatis demonstrated the use of a joint model (Weighted Kaplan–Meier estimator) for estimation of survival probabilities in the one sample problem. They also found that most of the efficiency gains were achieved with 3 or 4 categories. When the auxiliary data is time of a recurrent event predictive of survival, gains in efficiency depend on the sharpness of the peak for the distribution (T | Y ) and on the lag time between recurrence and failure (Figure 8). This occurs in the situation where a recurrent event causes rapid decline and then failure. Peaking of (T | Y ) was governed by the choice of the parameter for geometric decay in the distribution for (Y | T ). In Figure 8 gains as large as 30% are seen to be achieved when the geometric parameter is 0.4. For a more modest setting of 0.7 for the geometric parameter, gains can be as large as 20%. For both types of auxiliary data, the largest gains in efficiency occur for studies where failure is late and censoring is early and substantial. With a lighter amount of censoring (50% subject to censoring), modest gains of 5–10% are possible in our set-up. When baseline survival is early (uniform) and censoring later (uniform) then gains are only 2–3% at a 50% censoring proportion.
Categorical auxiliary data
379
In continuous time joint proportional hazards models, Slasor and Laird (2003) demonstrate a 10.2% gain in efficiency when estimating the treatment effect in an AIDS clinical trial. In this trial, the auxiliary data was a binary indicator of whether CD4 counts declined during the first interval of time on study. Fleming et al. (1994) demonstrated a 14.3% improvement when using cancer recurrence as auxiliary data in a study of cancer survival. In both the AIDS and cancer data, censoring was heavy. In this paper we did not consider bias issues. In the standard proportional hazards model, bias occurs when censoring is informative (T is not MAR). There may be auxiliary data Y such that (T , Y ) is MAR. In this situation the joint model is a good choice, since the maximum likelihood estimate for the proportional hazards treatment effect is consistent. Hypothesis tests can be based on Score or Wald statistics and greater power should be achieved when using the joint model. We did not investigate the joint model with several covariates. Hypothesis tests for the standard Cox model are robust to misspecification of the proportional hazards covariate effect (Kong and Slud, 1997; DiRienzo and Lagakos, 1999) and it would be worth investigating the robustness of hypothesis tests based on the joint discrete time proportional hazards model. When the object is to improve efficiency of estimation, the joint model outlined in this paper is a good choice when censoring is early and substantial, and when Y is strongly predictive of T . The largest gains that we demonstrated are achieved when survival is late and Y is a recurrence time, strongly predictive of survival.
Acknowledgements This research was supported by the National Institute of Health, Grant GM29745. The authors are grateful for helpful suggestions from Stephen Lagakos and Bob Gray.
Appendix A Computing the asymptotic relative efficiency requires calculation of the Fisher Information matrices for the standard and joint models. We give the details for the joint model with two treatment groups, Z ∈ {0, 1}, and polychotomous auxiliary data collected at baseline, Y ∈ {1, 2, . . . , r}. The cases of several measurements on Y and recurrent Y are similar. The parameters are defined: py|j k ≡ P (Y = y | T > tk , Z = zj ), Sk ≡ P (T > tk | Z = 0), (e
Sj k ≡ P (T > tk | Z = zj ) = Sk
βzj )
,
Sj ky ≡ P (Y = y, T > tk | Z = zj ) = py|j k Sj k .
380
P. Slasor and N. Laird
Let θ = (β, S, p) denote the vector of parameters for covariate effect, baseline survival, and auxiliary data. The log-likelihood and Fisher Information are: l(θ ) =
v r 1
βz βz (e j ) (e j ) Nj ky log py|j k−1 Sk−1 − py|j k Sk
j =0 k=1 y=1
βzj + Mj ky log(py|j k ) + log Sk(e ) , 2
∂ l . I (θ ) ≡ −E ∂θ θ T The log-likelihood is linear in the sufficient statistics: Nj ky =
n
I (Zi = zj )I (Xi = tk )I (Yi = y)ξi ,
i=1
Mj ky =
n
I (Zi = zj )I (Xi = tk )I (Yi = y)(1 − ξi ),
i=1
so that the calculation of the Fisher Information involves taking the expectation of the sufficient statistics. This is a simple tabulation involving the multinomial distribution for the joint model, and the censoring distribution. For calculation of the second derivative matrix, the following variables are useful: Pj ky = P (T = tk , Y = y | Z = zj ) = Sj k−1y − Sj ky , lSj ky = Sj ky log Sj k , +lSj ky = lSj k−1y − lSj ky , Cj ky =
Sj ky βzj e . Sk 2
∂ l Diagonal components: ∂β 2 is a scalar quantity, is a 2-way array of tri-diagonal matrices.
∂2l ∂SST
is a tri-diagonal matrix, and
v r 1 +lSj2ky Nj ky ∂ 2l − = z + lSj k−1y (log Sj k−1 + 1) j Pj ky Pj ky ∂β 2 j =0 k=1 y=1
+
v r 1 j =0 k=1 y=1
− lSj ky (log Sj k + 1)
zj Mj ky log Sj k ,
∂2l ∂ppT
Categorical auxiliary data
381
r 1 Nj k+1y Nj ky ∂ 2l 2 = −C + j ky ∂Sk2 j =0 y=1 Pj2k+1y Pj2ky
Nj ky Nj k+1y + − Pj k+1y Pj ky +
r 1 Mj ky eβzj
Sk2
j =0 y=1
Cj ky βzj e −1 Sk
,
r 1 Nj k+1y ∂ 2l , = Cj ky Sj k+1y ∂Sk ∂Sk+1 Pj2k+1y j =0 y=1
Nj kr Nj k+1y Nj k+1r Nj ky ∂ 2l 2 = −Sj k + 2 + 2 + 2 2 ∂py|j Pj2ky Pj kr Pj k+1y Pj k+1r k −
Mj ky 2 py|j k
−
Mj kr
,
2 Pr|j k
Nj k+1y Nj k+1r ∂ 2l , = Sj k Sj k+1 + 2 ∂py|j k ∂py|j k+1 Pj2k+1y Pj k+1r
Mj kr Nj k+1r Nj kr ∂ 2l 2 − 2 , = −Sj k + 2 2 ∂pa|j k ∂pb|j k Pj kr Pj k+1r Pr|j k
Nj k+1r ∂ 2l , a = b. = Sj k Sj k+1 ∂pa|j k ∂pb|j k+1 Pj2k+1r Off-diagonal components: tri-diagonal matrices.
∂2l ∂β∂S ,
and
∂2l ∂β∂p
are vectors and
a = b,
∂2l ∂p∂S
is a 2-way array of
r 1 log Sj k + 1 +lSj ky ∂ 2l = zj Cj ky Nj ky − ∂β∂Sk Pj ky Pj2ky j =0 y=1
− Nj k+1y
+
r 1
+lSj k+1y Pj2k+1y
−
log Sj k + 1 Pj k+1y
zj Mj ky eβzj /Sk ,
j =0 y=1
Nj ky +lSj ky Nj kr +lSj kr ∂ 2l = − 2 ∂β∂py|j k Pj ky Pj2kr −
Nj k+1y +lSj k+1y Pj2k+1y
+
Nj k+1r +lSj k+1r Pj2k+1r
,
382
P. Slasor and N. Laird
Nj ky Cj ky Nj kr Cj kr Nj k+1y Cj ky Nj k+1r Cj kr ∂ 2l , = Sj k − + − + ∂Sk ∂py|j k Pj2ky Pj2kr Pj2k+1y Pj2k+1r
Nj k+1r Cj k+1r Nj k+1y Sj k+1y ∂ 2l , = Sj k − ∂Sk+1 ∂py|j k Pj2k+1y Pj2k+1r
Nj ky Cj k−1y Nj kr Cj k−1r ∂ 2l . = Sj k − ∂Sk−1 ∂py|j k Pj2ky Pj2kr References Cox, D.R. (1983). A remark on censoring and surrogate response variables. J. Roy. Statist. Soc. Ser. B 45, 391–393. Dempster, A.P., Laird, N.M., Rubin, D.B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Statist. Soc. Ser. B 39, 1–22. DiRienzo, A.G., Lagakos, S.W. (1999). Effects of model misspecification on randomized treatment comparisons arising from Cox’s proportional hazards model. In preparation. Fleming, T.R., Prentice, R.L., Pepe, M.S., Glidden, D. (1994). Surrogate and auxiliary endpoints in clinical trials, with potential applications in cancer and AIDS research. Statist. Medicine 13, 955–968. Kong, F.H., Slud, E. (1997). Robust covariate-adjusted logrank tests. Biometrika 84, 847–862. Murray, S., Tsiatis, A. (1996). Nonparametric survival estimation using prognostic longitudinal covariates. Biometrics 52, 137–151. Prentice, R.L., Gloeckler, L.A. (1978). Regression analysis of grouped survival data with applications to breast cancer data. Biometrics 34, 57–67. Slasor, P.J., Laird, N.M. (2003). Joint models for efficient estimation in proportional hazards regression models. Statist. Medicine 22, 2137–2148.
21
Handbook of Statistics, Vol. 23 ISSN: 0169-7161 © 2004 Elsevier B.V. All rights reserved. DOI 10.1016/S0169-7161(03)23021-2
Hosmer and Lemeshow type Goodness-of-Fit Statistics for the Cox Proportional Hazards Model
Susanne May and David W. Hosmer
1. Introduction The Cox (1972) proportional hazards (PH) model has been an extremely popular regression model in the analysis of survival data during the last decades. Even though a number of goodness-of-fit tests have been developed for the PH model, authors who utilize this model rarely compute these tests (Andersen, 1991; Concato et al., 1993). One reason might be that only a few can be easily calculated in statistical software packages. We discuss goodness-of-fit tests for the Cox proportional hazards model, which are based on ideas similar to the Hosmer and Lemeshow (1980, 2000) goodness-of-fit test for logistic regression. All of these tests can be derived by adding group indicator variables to the model and testing the hypothesis that the coefficients of the group indicator variables are zero via the score test. We will call the tests that can be derived in this way the added variable tests. The tests that we discuss were proposed by Moreau et al. (1985, 1986) and Grønnesby and Borgan (1996). Care needs to be taken when implementing these tests since some of them require the use of time-dependent group indicator variables. In Section 2 we discuss the different tests. Section 3 provides information regarding the time-dependent nature of the tests. In Section 4 we provide examples. Details of proofs as well as SAS and STATA code for the examples can be found in Appendixes A–C.
2. The Hosmer and Lemeshow type test statistics We assume the typical right-censored survival data where we observe for each of n individuals the time (denoted by t) from study entry to either event or censoring, whether an event occurred or whether the time was censored (denoted by δ), and a vector of p fixed covariates, x = (x1 , . . . , xp ) . Under the PH model the hazard function takes the following form: λ(t, x) = λ0 (t) exp β x , (2.1) 383
384
S. May and D.W. Hosmer
where λ0 (t) represents an unspecified baseline hazard function, and β = (β1 , . . . , βp ) a vector of p coefficients. The component β x is often called the risk score. An important consequence of the model in (2.1) is that the effects of the covariates are constant over the entire observation period. We discuss three tests for assessing goodness-of-fit of (2.1) which are like the Hosmer–Lemeshow test for logistic regression. The main idea of the Hosmer– Lemeshow test is to compare the number of individuals with a specific outcome to a model based predicted number with the outcome within groups defined by the predicted probabilities. For the PH model Schoenfeld (1980) proposed a test, which is based on comparing the number of observed events to a model based estimated expected number of events across partitions of the covariate and time space. Moreau et al. (1986) proposed a slight modification of the Schoenfeld test in which a different estimate of the variance of the difference between observed and expected is used. We focus here on the Moreau et al. (1986) test. The details of the test are as follows. Consider r time intervals (b0 , b1 ), . . . , (br−1 , br ), with b0 = 0 and br = ∞. Each time interval is further subdivided into cj partitions such that there are a total of L partitions (c1 + · · · + cr = L). The individual partitions are denoted by Wj q , j = 1, . . . , r, q = 1, . . . , cj . Moreau et al. (1986) specify time dependent indicator variables yj q , j = 1, . . . , r, q = 1, . . . , cj , which have a value of one if (x, t) is in partition Wj q , and a value of zero otherwise. The hazard function can then be expressed as λ(t, x) = λ0 (t) exp β x + γ j yj , for bj −1 t < bj , (2.2) where β = (β1 , . . . , βp ), and γ j and yj are cj × 1, j = 1, . . . , r vectors of unknown parameters and the above defined indicator variables. Under the hypothesis of no varying effect across partitions γ j = 0 for j = 1, . . . , r. The log partial likelihood function for (2.2) is log lMOL =
kj r
β
xij + γ j yij
− log
j =1 i=1
exp β x + γ j yj ,
(2.3)
∈Rij
where Rij represents the risk set at time tij , xij represents the covariate vector of the subject with survival time tij , kj represents the number of unique event times in the j th time interval and the subscript MOL stands for the Moreau et al. (1986) test. The MOL test may be obtained by adding appropriate indicator variables to the model and testing the hypothesis that the coefficients of the indicator variables are zero via a score test. The proof is based on expressing (2.3) as cj d r log lMOL,av = β xi + γ j q I (xi , ti ) ∈ Wj q j =1 q=1
i=1
− log
∈Ri
exp β x +
cj r
, γ j q I (x , ti ) ∈ Wj q
j =1 q=1
(2.4)
Hosmer and Lemeshow type tests
385
where I((x , ti ) ∈ Wj q ) has a value of one if the covariate vector for the th individual and the event time of the ith individual are contained in partition Wj q , and a value of zero otherwise (j = 1, . . . , r, q = 1, . . . , cj ), d represents the total number of unique event times and the subscript av stands for the added variable version of the log partial likelihood function. The proof is parallel to the one used by May and Hosmer (2004) to show a similar result for the Moreau, O’Quigley and Mesbah test. The indicator variables I((x , ti ) ∈ Wj q ) need to be generated as time-dependent variables. We will give more details regarding the time-dependent indicator variables in the next section. Schoenfeld (1980) provides suggestions on how to divide the covariate and time space into partitions. One of his suggestions is to group the individuals based on their estimated risk score and form cross products with time intervals. Parzen and Lipsitz (1999) also suggest partitioning the covariate and time space by forming cross products of the risk score and time intervals and test for the addition of partition indicator variables for a similar goodness-of-fit test. They seem to be unaware that this test is a special case of the MOL test. Also, Parzen and Lipsitz (1999) fail to point out that the indicator variables for the time intervals are time-dependent indicator variables. The MOL test is an omnibus test and should detect any violations of the PH model. The second statistic considered here is the test proposed by Moreau et al. (1985), denoted MOM. The basic idea of this test is to allow the effect of the covariate vector x to vary over time, but to remain constant within different time intervals. The test assesses whether there is a significant variation in effects across time intervals. Under the MOM time-varying effect model, the hazard function has the following form λ(t, x) = λ0 (t) exp (β + γ j ) x , for bj −1 t < bj , (2.5) where β = (β1 , . . . , βp ) and γ j = (γ j 1 , . . . , γ jp ), j = 1, . . . , r. Under the hypothesis of no time varying effect γ j = 0 for j = 1, . . . , r. The log partial likelihood function for (2.5) is kj r log lMOM = (β + γ j ) xij − log exp (β + γ j ) x , j =1 i=1
(2.6)
∈Rij
where Rij represents the risk set at time tij , xij represents the covariate vector of the subject with survival time tij , and kj represents the number of unique event times in the j th time interval. May and Hosmer (2004) show that the log partial likelihood function in Eq. (2.6) can be expressed as d log lMOM,av = β xi + γ I(ti ) ⊗ xi i=1
− log
exp β x + γ I(ti ) ⊗ x ,
(2.7)
∈Ri
where ⊗ represents the direct product (Kronecker product), γ = (γ 1 , . . . , γ r ) = (γ11 , γ12 , . . . , γ1p , . . . , γr1 , γr2 , . . . , γrp ), I(ti ) = (I1 (ti ), . . . , Ir (ti )), Ij (ti ) = 1 if ti ∈ [bj −1 , bj ), Ij (ti ) = 0 otherwise, j = 1, . . . , r, d represents the number of unique event
386
S. May and D.W. Hosmer
times and the subscript av stands for the added variable version of the log partial likelihood function. The test for proportional hazards is performed by adding appropriate indicator variables to the model and by testing whether their coefficients are zero via a score test. When the covariate vector consists of only categorical variables, it can be shown that the test compares the observed number of events to a model based estimated expected number of events within time intervals. The MOM test is designed to specifically detect violations of the proportional hazards assumption. The third test we consider is one proposed by Grønnesby and Borgan (1996). Their test is based on martingale residuals. These residuals represent the difference between the number of observed events and a model based estimated expected number of events. The idea of the test is to divide the observations into groups based on their estimated risk score βˆ x and compare the observed and model based estimated expected number of events within risk score groups. May and Hosmer (1998) show that this test also can be calculated by adding appropriate indicator variables to the model and by testing whether the coefficients of the indicator variables are zero via a score test. Therefore the log partial likelihood function in this setting can be written as:
n lGB,av (β, γ ) = , δi β xi + γ Ki − log exp β x + γ K i=1
∈Ri
where K is a (G−1)×1 vector of risk score group indicator variables, γ is a (G−1)×1 vector of coefficients and GB stands for the Grønnesby and Borgan test and av stands for the added variable version of the test. Under the hypothesis of no varying effect across risk score groups γ j = 0 for j = 1, . . . , r. Parzen and Lipsitz (1999) also suggest testing the significance of added variables for risk score groups to assess goodness-of-fit in the PH model. They seemed to have been unaware that their test is equivalent to the Grønnesby and Borgan (1996) test. This test, like the MOL test, is an omnibus test. Each of the three tests described are based on the score test for a vector of coefficients to be zero for variables added to the PH model. Since the score, the likelihood ratio and the Wald test are known to be asymptotically equivalent either one of them could be used. This is important since not all software packages can easily perform score tests. In summary, we have shown that there are three previously proposed tests for the PH model, which are similar to the Hosmer and Lemeshow goodness-of-fit test for logistic regression. Each test can be calculated using existing statistical software packages. The difference between the three tests is the way in which the covariate and time space are partitioned. The Moreau et al. (1986) test is based on partitions of the covariate and time space, while the Moreau et al. (1985) test is based on partitions of the time space and the Grønnesby and Borgan (1996) test is based on partitions of the covariate space.
3. Necessity for time-dependent indicator variables An important aspect of the added variable version of the Moreau et al. (1986) and the Moreau et al. (1985) tests is that the indicator variables for the time intervals are timedependent. We will use a small example and the Moreau et al. (1985) test to illustrate
Hosmer and Lemeshow type tests
387
Fig. 1. Example data of four non-censored observations and partitions of the time and covariate space.
the time dependence. Assume we observe four non-censored observations denoted t1 < t2 < t3 < t4 and also observe whether each observation belongs to group one (denoted x = 0, 1) of two groups. Consider two time intervals, with the first two observations having event times in the first interval and the latter two observations having event times in the second interval (see Figure 1). The MOM test has the form S = U I−1 U, where U is the vector of first derivatives of the log partial likelihood function with respect to the r × p parameters and I is the observed information matrix. In this example Uj , the j th component of U is a scalar and represents the difference between the observed and expected number of events within the j th time interval for one of the groups. The estimated difference in the MOM formulation of the model is defined as kj kj ˆ ∈Rij x exp(β x ) Uj = xij − ˆ ∈Rij exp(β x ) i=1 i=1 = “observedj ” − “expectedj ”,
(3.1)
where βˆ is the maximum likelihood estimate of β under H0 . In Eq. (3.1) the “observed” part is the sum of the observed values of the covariate x for the kj subjects with event times in the j th time interval. The “expected” is a model weighted predicted sum of the covariate. If the covariate does not have a time varying effect, then a model with time-dependent indicator variables should accurately estimate the total in each interval. If the effect is time varying then there should be a discrepancy with the model based predicted sum over or under estimating the “observed” sum. The necessity for time-dependent indicator variables has the potential to be overlooked easily. The difference in Eq. (2.7) between using time-dependent and timeindependent indicator variables would be to use subscript instead of subscript i for time t in the second sum. Furthermore, familiar coding steps that are used in statistical software to generate indicator variables will result in generating time-independent indicator variables. Therefore, in the following we describe why the use of time-independent indicator variables is incorrect and why the use of time-dependent indicator variables is correct. Using the MOM formulation of the test, we illustrate using data from this example which observations contribute to the numerator of the estimated expected number of events. We then show how this differs from the Moreau et al. (1985) formulation if we incorrectly used time-independent indicator variables. Finally, we show that by using
388
S. May and D.W. Hosmer
appropriate time-dependent indicator variables we obtain the same estimated expected number of events as Moreau et al. (1985). For the example data shown in Figure 1 when calculating the estimated expected number of events in the first time interval we sum for i = 1 and i = 2. For i = 1 (and similarly for i = 2) all of the observations in group 1 (the observations with event times t2 and t3 ) contribute to the numerator sum since both of these observations are at risk for having the event occur just before time t1 . Specifically in (3.1) with j = 1, for Subject 1: Subject 2: Subject 3: Subject 4:
t1 ∈ R11 , t2 ∈ R11 , t3 ∈ R11 , t4 ∈ R11 ,
x1 = 0, x2 = 1, x3 = 1, x4 = 0.
When incorrectly using time-independent indicator variables for calculating the added variable version of the Moreau et al. (1985) test, the “expected” part in (3.1) is written as d ˆ ∈Ri x Ij (t ) exp(β x ) (3.2) , exp(βˆ x ) i=1
∈Ri
where Ij (t ) = 1 if the event time t falls into the j th time interval and Ij (t ) = 0 otherwise. When calculating (3.2) for j = 1 in this (incorrect) formulation we would sum over i = 1 to 4 (sum over observations with event times t1 , t2 , t3 , t4 ). Nevertheless, for i = 1 (and similarly for i = 2) only subject 2 would contribute to the numerator sum since only this subject would be in group 1 as well as having the event occur in the first time interval, (t2 < b1 ). Specifically in (3.2) with j = 1, for Subject 1: Subject 2: Subject 3: Subject 4:
t1 ∈ R1 , t2 ∈ R1 , t3 ∈ R1 , t4 ∈ R1 ,
x1 = 0, x2 = 1, x3 = 1, x4 = 0,
I1 (t1 ) = 1, I1 (t2 ) = 1, I1 (t3 ) = 0, I1 (t4 ) = 0.
For the numerator of the “expected” to be one (for j = 1) for subject , subject has to be part of the risk set (t ∈ R1 ), subject has to be in group one (x = 1) and the event for subject has to have occurred in the first time interval (I1 (t ) = 1). Even though variable x has a value of one for subject 3 the numerator is zero for this subject; the event for subject 3 did not occur in the first time interval and therefore the indicator variable I1 (t3 ) is zero. The numerator is zero as well for subject 4. The main difference to the Moreau et al. (1985) formulation (and mistake) would be that the observation with event time t3 would not contribute to the estimated expected number of events in group one in the first time interval. In more general terms, the expected number of events within a time interval represents a sum of weighted averages based on the individuals who are at risk of having the event occur. If simple time-independent group indicator variables were used, observations with event times occurring in the second time interval (or later time intervals) would not contribute to the numerator of the weighted average calculated for the first time interval.
Hosmer and Lemeshow type tests
389
When using the correct time-dependent indicator variables, the “expected” part is d ˆ ∈Ri x Ij (ti ) exp(β x ) , (3.3) ˆ ∈R exp(β x ) i=1
i
where Ij (ti ) = 1 if ti ∈ [bj −1 , bj ) and Ij (ti ) = 0 otherwise. In other words, the timedependent indicator variable for the j th time interval is one if the j th time interval includes the point in time that is the basis for the risk set. When calculating (3.3) for j = 1 for our example data we sum over i = 1 to 4. In this case all observations in group 1 (x = 1) (i.e., the observations with event times t2 and t3 ) contribute to the numerator portion of the estimated expected number of events for i = 1 (and similarly for i = 2) since the first time interval includes the event times that form the basis of the risk sets (t1 and t2 ). Specifically in (3.3) with j = 1, for Subject 1: Subject 2: Subject 3: Subject 4:
t1 ∈ R1 , t2 ∈ R1 , t3 ∈ R1 , t4 ∈ R1 ,
x1 = 0, x2 = 1, x3 = 1, x4 = 0,
I1 (t1 ) = 1, I1 (t1 ) = 1, I1 (t1 ) = 1, I1 (t1 ) = 1.
For the numerator of the “expected” to be one (for j = 1) for subject , subject has to be part of the risk set (t ∈ R1 ), subject has to be in group one (x = 1) and the event for subject 1 has to have occurred in the first time interval (I1 (t1 ) = 1). The definition of the time-dependent indicator variables therefore ensures that observations with event times in later time intervals still contribute to the risk sets that are formed in earlier time intervals. Murphy (1993) and O’Quigley and Pessione (1989) also point out that the MOM test can be calculated by testing for the addition of time-dependent indicator variables. Nevertheless, neither of them provides a proof or details regarding the necessity of the time-dependent nature of the indicator variables. Further comments regarding the necessity of time-dependent indicator variables apply to the test suggested by Parzen and Lipsitz (1999). As mentioned in Section 2 Parzen and Lipsitz (1999) also suggest partitioning the covariate and time space by forming cross-products of the risk score and time intervals and test for the addition of partition indicator variables for a goodness-of-fit test. Parzen and Lipsitz’s article does not make it clear that the indicator variables for the time intervals need to be time-dependent. 4. Examples The first example is based on the gastric cancer data presented by Stablein et al. (1981) (see also Moreau et al., 1985). Ninety cancer patients were either treated by chemotherapy or by both chemotherapy and radiotherapy. Like Moreau et al. (1985) we divide the time axis into four intervals such that each interval contains 18, 19, 18 and 19 deaths respectively. The Moreau et al. (1985) test in this case has 3 degrees of freedom with values of 10.21 (p = 0.02) for the score statistic, 9.55 (p = 0.02) for the Wald statistic and 10.67 (p = 0.01) for the likelihood ratio statistic. We reject the hypothesis of proportional hazards for each statistic. See Appendix A for the SAS code to calculate the
390
S. May and D.W. Hosmer
Wald test using the added variable version of the MOM test. In any setting with only one dichotomous covariate, the tests proposed by Moreau et al. (1985), by Moreau et al. (1986) and by Schoenfeld (1980) are equivalent and the Grønnesby and Borgan (1996) test does not provide anything more than the test for significance of the coefficient of the dichotomous variable. We use the subset of the UIS (University of Massachusetts Aids Research Unit IMPACT Study) data in Hosmer and Lemeshow (1999) as our second example. This study investigated two treatment programs of different approaches and planned duration aimed at reducing drug abuse and preventing high-risk HIV behavior. The major questions of the trial were whether there exists variability in the effectiveness of alternative residential treatment approaches and whether the planned program duration has an effect. The data set includes 628 observations of which 120 are censored after a follow-up period of 5 years. A variety of covariates are thought to potentially influence return to drug use. Among these are length of treatment, age, a Beck depression score at admission, heroin/cocaine use during the three months prior to admission and race. After extensive analyses, Hosmer and Lemeshow (1999) arrive at a tentative final model. Their tentative final model based on 575 complete cases includes the variables: age, Beck depression score, log-transformed number of prior drug treatments, an indicator variable for recent IV drug use, an indicator variable for non-white ethnic background, a treatment indicator, a site indicator and two interaction terms for interactions between age and site and race and site (see Hosmer and Lemeshow, 1999, Table 5.11). We calculate the test values for the Grønnesby and Borgan (GB) (1996) statistic using 5 risk score groups. The values for the GB test are 3.70 (d.f. = 4, p = 0.45) and 3.68 (d.f. = 4, p = 0.45) for the Wald and the likelihood ratio test respectively. We calculate the Moreau, O’Quigley, and Lellouch (MOL) (1986) test using 4 time and 5 risk score intervals. As a result the test statistic has 16 degrees of freedom, (4 − 1) × (5 − 1) degrees of freedom for 12 indicator variables representing interaction terms between time and risk score groups plus (5 − 1) degrees of freedom for 4 indicator variables representing (main effects for) risk score groups. The values for the MOL test are 9.77 (d.f. = 16, p = 0.88) and 9.75 (d.f. = 16, p = 0.88) for the Wald and the likelihood ratio test respectively. We fail to reject the hypothesis of adequate fit in all cases. See Appendix B for STATA code for calculating the GB test and see Appendix C for SAS code for calculating the MOL test. Note, we did not calculate the value of the Moreau et al. (1986) test for this example. The final model includes 10 variables. If we were to use, e.g., 4 time intervals, the MOM test would require 30 added variables and the test would have 30 degrees of freedom. The model therefore would include a total of 40 variables. For the 575 observations included in the final model 464 events were observed. If we were to follow the recommendation to use a maximum of 1 variable for 10 observed event times, a model for the MOM test would be borderline to overfitting. The fact that we might need to use a large number of added variables presents a limitation of the MOM test.
Hosmer and Lemeshow type tests
391
5. Summary While various goodness-of-fit tests have been developed to test the assumptions of the Cox proportional hazards model, only a few are readily available in existing statistical software packages. We discuss previously proposed goodness-of-fit tests for the Cox model, which are of the Hosmer–Lemeshow type. We present results that show that the tests can be calculated easily using existing statistical software packages. Care needs to be taken though when implementing some of these tests, since they require the use of time-dependent group indicator variables.
Appendix A The following SAS (1999) code can be used to obtain the added variable version of the Moreau et al. (1985) test for the gastric cancer data using time-dependent indicator variables based on the Wald test. *** using SAS version 8.2; *** Final model; proc phreg data=in.gcancer; model time*d(0)=groupi; *** Obtaining the added variable version *** (based on Wald statistic) *** of the Moreau, O’Quigley and Mesbah test *** using time-dependent indicator variables; proc phreg data=in.gcancer; model time*d(0)=groupi b2 b3 b4; MOMtest: test b2=0, b3=0, b4=0; if time>170 and time<=354 then b2=groupi; else b2=0; if time>354 and time<=535 then b3=groupi; else b3=0; if time>535 then b4=groupi; else b4=0; run; Note that the time-dependent indicator variables need to be generated within the PROC PHREG statement and cannot be generated within a data step.
Appendix B The following STATA (2001) code can be used to obtain the added variable version of the Grønnesby and Borgan (1996) test for the UIS data based on the likelihood ratio test.
392
S. May and D.W. Hosmer
*** Calculate the GB test using the likelihood *** ratio test in STATA for the UIS data *** (Hosmer and Lemeshow, 1999) use uis stset time,failure(censor) *** run the final model (Hosmer and Lemeshow, 1999, *** Table 5.11), use ’’nolog’’ to avoid lengthy output *** and ’’nohr’’ to obtain parameter estimates stcox age becktota ndrugfp1 ndrugfp2 ivhx_3 race treat site /* */ agesite racesite, nolog nohr *** obtain risk score predict xb,xb *** drop all observations not included in the final model drop if xb==. *** generate risk score groups sort xb gen j=group(5) *** run final model + risk score group indicator variables *** referent group: first risk score group xi:stcox age becktota ndrugfp1 ndrugfp2 ivhx_3 race /* */ treat site agesite racesite i.j, nolog nohr *** obtain likelihood value and save lrtest,saving(0) *** run final model to compare likelihood values stcox age becktota ndrugfp1 ndrugfp2 ivhx_3 race /* */ treat site agesite racesite, nolog nohr *** obtain likelihood ratio test lrtest
Appendix C The following SAS (1999) code can be used to obtain the Moreau et al. (1986) test for the UIS data using time-dependent indicator variables based on the Wald test. ********************************************************* *** run final model + interaction terms of risk score *** groups and time groups (time dependent)
Hosmer and Lemeshow type tests
393
*********************************************************; proc phreg data=daxb; model time*censor(0)=age becktota ndrugfp1 ndrugfp2 ivhx_3 race treat site agesite racesite j2 j3 j4 j5 k2j2 k2j3 k2j4 k2j5 k3j2 k3j3 k3j4 k3j5 k4j2 k4j3 k4j4 k4j5; MOLtest: test k2j2=0, k2j3=0, k2j4=0, k2j5=0, k3j2=0, k3j3=0, k3j4=0, k3j5=0, k4j2=0, k4j3=0, k4j4=0, k4j5=0, j2=0, j3=0, j4=0, j5=0; *** Generate the time dependent indicator variables ***; *------------------------------------------------------; *** Second time interval (first time interval is referent); if time> 84 and time<=170 then do; k2j2=j2; k2j3=j3; k2j4=j4; k2j5=j5; end; else do; k2j2=0; k2j3=0; k2j4=0; k2j5=0; end; *** Third time interval; if time>170 and time<=376 then do; k3j2=j2; k3j3=j3; k3j4=j4; k3j5=j5; end; else do; k3j2=0; k3j3=0; k3j4=0; k3j5=0; end; *** Fourth time interval; if time>376 then do; k4j2=j2; k4j3=j3; k4j4=j4; k4j5=j5; end; else do; k4j2=0; k4j3=0; k4j4=0; k4j5=0; end; run;
References Andersen, P.K. (1991). Survival analysis 1982–1991: The second decade of the proportional hazards regression model. Statist. Medicine 10, 1931–1941. Concato, J., Feinstein, A.R., Holford, T.R. (1993). The risk of determining risk with multivariable models. Ann. Internal Medicine 118, 201–210. Cox, D.R. (1972). Regression models and life-tables. J. Roy. Statist. Soc. Ser. B 34, 187–220. Grønnesby, J.K., Borgan, Ø. (1996). A method for checking regression models in survival analysis based on the risk score. Lifetime Data Anal. 2, 315–328. Hosmer, D.W., Lemeshow, S. (1980). Goodness-of-fit tests for the multiple logistic regression model. Comm. Statist. Theory Methods A 10, 1043–1069. Hosmer, D.W., Lemeshow, S. (2000). Applied Logistic Regression. Wiley, New York.
394
S. May and D.W. Hosmer
Hosmer, D.W., Lemeshow, S. (1999). Applied Survival Analysis: Regression Modeling of Time to Event Data. Wiley, New York. May, S., Hosmer, D.W. (1998). A simplified method of calculating an overall goodness-of-fit test for the Cox proportional hazards model. Lifetime Data Anal. 4, 109–120. May, S., Hosmer, D.W. (2004). An added variable goodness-of-fit test statistic for the Cox proportional hazards model. In preparation. Moreau, T., O’Quigley, J., Mesbah, M. (1985). A global goodness-of-fit statistic for the proportional hazards model. Appl. Statist. 34, 212–218. Moreau, T., O’Quigley, J., Lellouch, J. (1986). On Schoenfeld’s approach for testing the proportional hazards assumption. Biometrika 73 (2), 513–515. Murphy, S.A. (1993). Testing for a time dependent coefficient in Cox’s regression model. Scand. J. Statist. 20, 35–50. O’Quigley, J., Pessione, F. (1989). Score tests for homogeneity of regression effect in the proportional hazards model. Biometrics 45, 135–144. Parzen, M., Lipsitz, S.R. (1999). A global goodness-of-fit statistic for Cox regression models. Biometrics 55, 580–584. Schoenfeld, D. (1980). Chi-squared goodness-of-fit tests for the proportional hazards regression-model. Biometrika 6 (1), 145–153. SAS (1999). OnlineDoc, Version Eight. SAS Institute, Cary, NC. Stablein, D.M., Carter, W.H., Novak, J.W. (1981). Analysis of survival data with nonproportional hazard functions. Controlled Clinical Trials 2, 149–159. Stata, StataCorp (2001). Stata Statistical Software: Release 7.0. Stata Corporation, College Station, TX.
22
Handbook of Statistics, Vol. 23 ISSN: 0169-7161 © 2004 Elsevier B.V. All rights reserved. DOI 10.1016/S0169-7161(03)23022-4
The Effects of Misspecifying Cox’s Regression Model on Randomized Treatment Group Comparisons
A.G. DiRienzo and S.W. Lagakos
1. Introduction Hypothesis tests arising from Cox’s proportional hazards model (Cox, 1972) are often used to compare randomized treatment groups with respect to the distribution of a failure time outcome. Some of these tests adjust for covariates that may be predictive of outcome, while others, and most notably, the log-rank test, do not. In addition to adjusting for any imbalances that may arise between treatment groups, covariate-adjusted tests may enjoy greater efficiency than that of the log-rank test. Tsiatis et al. (1985) demonstrated the gain in efficiency of covariate-adjusted tests relative to the log-rank test when the working proportional hazards model is properly specified. Slud (1991) provided asymptotic relative efficiency formulae of the log-rank test to the optimal score test that arises from a properly specified model for covariates when the effect of treatment is multiplicative on the survival time hazard function. Lagakos and Schoenfeld (1984) studied the effects of various types of model misspecification on the power of tests based on Cox’s model. An important consideration in the application of these tests is their validity when the proportional hazards working model is misspecified. Recent work has shown that the impact of model misspecification on the validity of resulting tests hinges on whether the distribution of the potential censoring time either (i) is conditionally independent of treatment group given covariates or conditionally independent of covariates given treatment group, or (ii) depends on both treatment group and covariates. In the first case, resulting test statistics have an asymptotic normal distribution with mean zero under the null hypothesis and consistent variance estimates are readily obtainable (see Kong and Slud, 1997 and DiRienzo and Lagakos, 2001a). In the second case, the asymptotic mean of the test statistic is not necessarily equal to zero under the null hypothesis when the proportional hazards working model is misspecified. In such cases, the bias of tests can be large, as was demonstrated in DiRienzo and Lagakos (2001a, 2001b). In this chapter we summarize the properties of hypothesis tests derived from proportional hazards regression models. We introduce notation and define uncorrected statistics in Section 2. In Section 3 we describe conditions necessary for the asymptotic 395
396
A.G. DiRienzo and S.W. Lagakos
validity of these test statistics, and also discuss efficiency considerations and effects of model misspecification on the power of uncorrected test statistics. We describe a class of corrected test statistics for use when censoring depends on both treatment group and covariates in Section 4, and also examine estimation procedures and the efficiency of such bias-corrected tests. We provide some recommendations for the use of these tests in Section 5, and give MATLAB code for the computation of the various test statistics in Appendix A.
2. Notation and statistics Let the continuous random variable T denote time from randomization to failure and let C denote a potential censoring time. Assume that we observe T ∗ = min(T , C) and the indicator δ = 1(T C) of whether T is observed (δ = 1) or right-censored (δ = 0). Let the binary random variable X denote treatment group and let W denote a q × 1 vector of bounded baseline covariates. Throughout this paper we assume that censoring acts noninformatively, that is, T ⊥ C | (X, W ), and also that X ⊥ W , as is the case in most randomized clinical trials. The true conditional hazard functions of T and C given (X, W ) are denoted by κ(t | X, W ) and κC (t | X, W ), respectively, and are not necessarily of a proportional hazards form. The observed data is assumed to consist of n independent and identically distributed realizations of (T ∗ , δ, X, Z ∗ ), denoted (Ti∗ , δi , Xi , Zi∗ ) for i = 1, . . . , n, where Z ∗ is a p × 1 vector whose components are bounded functions of W. The null hypothesis of interest is H0 : X ⊥ T | W ; that is, that the failure time distribution does not depend on treatment group, under which we will denote κ(t | X, W ) by κ(t | W ). Consider tests of H0 that are based on statistics of the form n−1/2 Un =
n
∞
n−1/2 Gn (t) Xi − En (t) dNi (t),
(1)
i=1 0
where En (t) =
n j =1
Yj (t)ψn (Zj )Xj
n
Yj (t)ψn (Zj ),
j =1
Yi (t) = 1(Ti∗ t), Ni (t) = δi 1(Ti∗ t) and ψn (·) is a nonrandom bounded function whose form is known but whose parameters can be estimated from the data. The covariates Zi are some bounded function of Zi∗ , i = 1, . . . , n. The bounded predictable process Gn (·) is also assumed to be nonrandom, converging uniformly in probability to a bounded function G(·). It may be the case that one would want to consider timedependent covariates, for example an external ancillary covariate process (Kalbfleisch and Prentice, 1980, p. 123). Although results hold when the components of Wi and Zi are uniformly bounded and predictable functions of time, for ease of notation we only consider fixed covariates.
The effects of misspecifying Cox’s regression model
397
Statistics of the form in (1) arise as the numerator of partial likelihood score tests of α = 0 based on working proportional hazards models for κ(t | X, W ) that take the form exp(αXi )ψ(β; Zi )h(t).
(2)
The working model (2) is misspecified when it is not equivalent to κ(t | X, W ), in which case the parameters α, β, and h(·) have no simple interpretation. The statistic n−1/2 Un should then generally be viewed simply as statistic from which tests of H0 may be derived. Popular choices for G(·) and ψ(·) are Gn (t) = 1 and ψ(β; Zi ) = exp(β Zi ), resulting in ψn (Zi ) = exp(βˆ Zi ), where βˆ is the restricted maximum partial likelihood estimator of β obtained by fitting the model with α = 0 (Cox, 1972). Here the probability limit of ψn (Z) is exp(β˜ Z), where β˜ is the probability limit of βˆ (Lin and Wei, 1989). Note that β˜ = β when the model (2) is properly specified. Another special case of (1) is the class of weighted log-rank statistics (Cox and Oakes, 1984, p. 124), where ψ(β; Zi ) = ψn (Zi ) = 1, and where the most commonly used choice for Gn (·) is the identity function, yielding the ordinary log-rank test. In general, ψ(β; Z) can also depend on t, so long as it is uniformly bounded.
3. Conditions for valid tests Suppose that either C ⊥ X | W or C ⊥ W | X. Then the test statistic n−1/2 Un has an asymptotic normal distribution with mean 0 under H0 , regardless of whether or not the model (2) is misspecified (DiRienzo and Lagakos, 2001a). Furthermore, when either of these conditions hold, consistent estimates of the variance of n−1/2 Un are easily derived, yielding asymptotically valid inference whether or not the relationship between T and (X, W ) is properly specified. The condition C ⊥ X | W is usually satisfied in a randomized clinical trial when the only form of censoring is administrative or end-of-study censoring; that is, when C represents the time from enrollment of a subject into the study until the time the data are analyzed. However, when censoring can arise from premature study discontinuation or loss-to-follow-up, it is well known that this condition may not hold. The condition C ⊥ W | X holds when there is a dependency of censoring on treatment group which does depend on the covariates. To provide some insight into why either of these conditions are necessary for valid inference, note that at baseline (that is, when t = 0), the distribution of W is independent of X because of randomization; when either C ⊥ X | W or C ⊥ W | X holds and H0 is true, it is implied that X ⊥ W | Y (t) = 1, t > 0, which is necessary for n−1/2 Un to have mean 0 asymptotically. For a proof of these results, see Appendix A of DiRienzo and Lagakos (2001a) or Kong and Slud (1997). We now provide test statistics for use when either C ⊥ X | W or C ⊥ W | X holds. It follows from Kong and Slud (1997) that under H0 , n−1/2 Un can be expressed as n−1/2 Un = n−1/2
n i=1
Qi + op (1),
398
A.G. DiRienzo and S.W. Lagakos
where
∞
˜ Zi ) dt , Xi − µ(t) dNi (t) − ρ(t)Yi (t)ψ(β;
Qi = 0
˜ Z X /E Y (t)ψ β; ˜ Z , µ(t) = E Y (t)ψ β; ˜ Z . ρ(t) = E Y (t)κ(t | W ) /E Y (t)ψ β; It is easily verified (cf. Kong and Slud, 1997 or using Lemmas 1 and 2 in DiRienzo and Lagakos, 2001a) that under H0 , Qi has mean 0 when C ⊥ X | W or C ⊥ W | X. This implies that n−1/2 Un is asymptotically normal with mean zero and variance equal to the variance of Qi . As shown in Kong and Slud (1997), a consistent estimate of the variance of n−1/2 Un is 1 2, Qi − Q n n
i=1
where i = Q
ˆ Zi ) Yi (t)ψ(β; (t) , dNi (t) − n dN ˆ j =1 Yj (t)ψ(β; Zj )
n
n
Xi − X(t)
0
= X(t)
∞
Yi (t)Xi
i=1
Yi (t),
i=1
(t) = ni=1 Ni (t) and Q = (1/n) ni=1 Q N √ n i . 2 Thus, the test statistic Un / { i=1 (Qi − Q) } is asymptotically standard normal under H0 when C ⊥ X | W or C ⊥ W | X, regardless of whether the working model (2) above is that µ(t) = is properly specified. The motivation for replacing µ(t) with X(t) E{X | Y (t) = 1} under H0 when C ⊥ X | W or C ⊥ W | X. We note that for the special case of the log-rank test, use of the model-based variance estimator of n−1/2 Un results in a valid asymptotic test, and appears to usually provide nominal finite-sample type I errors (see DiRienzo and Lagakos, 2001a), so that use of a robust variance estimator is not needed. 3.1. Efficiency considerations Lagakos and Schoenfeld (1984) investigated the effects of various types of misspecification of the working model (2) on the power of n−1/2 Un . When covariates have a multiplicative effect on the true hazard κ(t | X, W ), but the ratio κ(t | X = 1, W )/ κ(t | X = 0, W ), is non-constant but either greater or less than one for all t > 0, i.e., the hazards do not cross, there is often only a small loss in power. One exception to this is when the ratio κ(t | X = 1, W )/κ(t | X = 0, W ) departs from one only after the majority of failures have occurred; in this case, the loss in power can be great. In contrast, when the ratio κ(t | X = 1, W )/κ(t | X = 0, W ) crosses one, the loss in power is often substantial.
The effects of misspecifying Cox’s regression model
399
Suppose that the effect of covariates in the true model κ(t | X, W ) is not multiplicative, that is the ratio κ(t | X = 1, W )/κ(t | X = 0, W ) is a function of W , but that the interaction is qualitative, in the sense that κ(t | X = 1, W )/κ(t | X = 0, W ) is either greater or less than one for all W . In this case, the loss in the power of n−1/2 Un is not in general large unless the discrepancy in the ratio κ(t | X = 1, W )/κ(t | X = 0, W ) between levels of W is substantial, especially if larger ratios tend to occur within levels of W that are less prevalent. More generally, the loss in power of n−1/2 Un can be large when a component of W that has a strong effect on the hazard of T is either omitted or mismodeled in such a way that the direction of its effect is not maintained. Further details on all of these situations can be found in Lagakos and Schoenfeld (1984). Morgan (1986) provides a correction to Lagakos and Schoenfeld’s (1984) asymptotic relative efficiency formula of the logrank test to the score test arising from a properly specified model for covariates. See also Lagakos (1988), who derived asymptotic relative efficiency formulae in the onesample problem when evaluating the effect of a misspecified form of a time-dependent covariate.
4. Bias correction When the distribution of the censoring variable depends on both treatment group and covariates, that is when the conditions C ⊥ X | W and C ⊥ W | X both fail to hold, the statistic n−1/2 Un in general has a non-zero asymptotic mean under H0 . One exception is when the model (2) is equal to κ(t | X, W ), i.e., the working proportional hazards model is properly specified. DiRienzo and Lagakos (2001a, 2001b) present simulation results which demonstrate that the bias of tests based on n−1/2 Un can be severe when in this setting and the working proportional hazards model is misspecified. In an attempt to correct for this bias, DiRienzo and Lagakos (2001b) present a class of tests that are asymptotically standard normal under H0 regardless of the joint distribution between C and (X, W ), provided that either the conditional distribution of T given (X, W ) or the conditional distribution of C given (X, W ) is properly modeled. Consequently, these tests are more robust than those arising from n−1/2 Un when the working model is misspecified, and do not appear to lose much efficiency when the working model is correctly specified and bias correction is unnecessary. Consider the generalization of (1) given by n ∞ n−1/2 Un∗ = (3) n−1/2 Gn (t)ϕ(t; Xi , Wi ) Xi − En∗ (t) dNi (t), i=1 0
where En∗ (t) =
n
Yj∗ (t)ψn (Zj )Xj
j =1
Yi∗ (t) = Yi (t)ϕ(t; Xi , Wi ),
n j =1
Yj∗ (t)ψn (Zj ),
400
A.G. DiRienzo and S.W. Lagakos
ϕ(t; Xi , Wi ) = min pr(C t | Xi = 0, Wi ), pr(C t | Xi = 1, Wi ) / pr(C t | Xi , Wi ),
(4)
for i = 1, . . . , n. Unlike the binary indicator variable Yi (t) normally used in Cox’s model, Yi∗ (t) can assume any value in the unit interval. Also, note that ϕ(t; Xi , Wi ) is only defined when Zi = Wi , i = 1, . . . , n. At each point in study time when a survival event occurs, this correction strives to remove any imbalances between treatment groups in the distribution of covariates that are caused solely by censoring. Mechanically, at study time t, the correction downweights, Yi∗ (t) < 1, those subjects in the risk set whose risk of censoring is higher in their opposite treatment group; those subjects whose risk of censoring is lower in their opposite treatment group are unweighted, Yi∗ (t) = Yi (t) = 1. To see this analytically, note that under H0 , the conditional expectation of Y ∗ (t) given (X, W ) is ϕ(t; X, W ) pr{Y (t) = 1 | X, W } = ϕ(t; X, W ) pr(C t | X, W ) pr(T t | W ) = min pr(C t | X = 0, W ), pr(C t | X = 1, W ) pr(T t | W ), which is independent of X. The probability limit of En∗ (t) under H0 is thus E{Y ∗ (t)ψ(Z)X} E[Xψ(Z)E{Y ∗ (t) | W }] = = π, E{Y ∗ (t)ψ(Z)} E[ψ(Z)E{Y ∗ (t) | W }] where π = E(X). As shown in DiRienzo and Lagakos (2001b), n−1/2 Un∗ can be expressed under H0 as n−1/2 Un∗ = n−1/2
n
Ai + op (1),
i=1
where
∞
Ai = 0
G(t)ϕ(t; Xi , Wi )(Xi − π)
E{Y ∗ (t)κ(t | W )} dt , × dNi (t) − Yi (t)ψ(Zi ) E{Y ∗ (t)ψ(Z)}
and the Ai are independent and identically distributed with mean zero. A consistent estimator of the variance of n−1/2 Un∗ is 1 (n) (n) 2 Ai − A , n n
Vn =
i=1
where
(5)
The effects of misspecifying Cox’s regression model
A(n) i =
∞ 0
401
Gn (t)ϕ(t; Xi , Wi ) Xi − X
n Yi (t)ψn (Zi ) × dNi (t) − n ϕ(t; Xj , Wj ) dNj (t) , ∗ j =1 Yj (t)ψn (Zj ) j =1
(n) is the mean of {A(n) , . . . , A(n) is the mean of {X1 , . . . , Xn } and A }. Hence, regardX 1 √ n less of the joint distribution between C and (X, W ), n−1/2 Un∗ / Vn asymptotically has the standard normal distribution under H0 whether or not the working model is properly √ specified. It follows that if the working model (2) is properly specified, n−1/2 Un∗ / Vn is asymptotically standard normal under H0 regardless of whether pr(C t | X, W ) is properly specified and of the dependency between C and (X, W ). In practice, ϕ(·) will often be unknown. Let ϕ(t; ˆ Xi , Wi ) denote an estimator of ϕ(t; Xi , Wi ). One would then calculate n ∞ n∗ = Gn (t)ϕ(t; ˆ Xi , Wi ) Xi − En∗ (t) dNi (t) U i=1
instead of (3) and (n) = A i
0
∞
Gn (t)ϕ(t; ˆ Xi , Wi ) Xi − X
0
n Yi (t)ψn (Zi ) ϕ(t; ˆ X , W ) dN (t) × dNi (t) − n j j j ∗ j =1 Yj (t)ψn (Zj ) j =1 ∗ instead of A(n) ˆ Xi , Wi ), and En∗ (·) is obtained by subi in (5), where Yi (t) = Yi (t)ϕ(t; n . ∗ (·) for Y ∗ (·) in En∗ (·), i = 1, . . . , n. Denote this variance estimate by V stituting Y i i Some methods for estimating ϕ(t; X, W ) are given in DiRienzo and Lagakos (2001b). These include the nonparametric regression methods of McKeague and Utikal (1990) as well as Cox’s (1972) proportional hazards regression models. If the covariates are discrete with relatively few levels, then a stratified, left-continuous Kaplan–Meier estimator (Kaplan and Meier, 1958) of censoring can be calculated for each treatment group within each level of the covariate space. For example, an estimate for ϕ(t; X, W ) can be obtained via the stratified proportional hazards model for κC (t | X, W ), λ(X) (t) exp γ (X) Z C , where ZiC is some bounded function of Zi∗ , i = 1, . . . , n. The maximum partial(X) likelihood estimator, γˆt , and the Breslow (1972, 1974) estimator of the baseline cu(X) (t), may then be calculated within each mulative hazard function of censoring, ( treatment group at each censoring time, using data accumulated before that time, and the continuous estimator (Xi ) (t) exp γˆt(Xi ) ZiC pr Ci t | Xi , ZiC = exp −(
402
A.G. DiRienzo and S.W. Lagakos
(Xi ) (t). Here estimation obtained by linear interpolation between censoring times of ( i was stratified on X, but stratification may additionally be based on any covariate that might possibly have a strong interaction with treatment. When ϕ(t; X, W ) is estimated using a semiparametric or nonparametric model, ϕ(t; ˆ Xi , Wi ) contains estimates of an infinite dimensional parameter, for which case n . n∗ would not necessarily be given by V a consistent estimate of the variance of n−1/2 U ∗ However, given the choice for an estimate of ϕ(t; X, W ), if it can be shown that Un is n∗ will asymptotically linear, then the nonparametric bootstrap estimate of variance of U be consistent (Gill, 1989). DiRienzo and Lagakos (2001b) have shown via simulation that when using a semiparametric proportional hazards model to calculate ϕ(t; ˆ Xi , Wi ), n appears to be adequate. the variance estimate V For any given data set, there is no guarantee that it will be possible to specify and estimate ϕ(·) well enough to make the correction for a misspecified model for T reliable. It is thus of utmost importance to check and validate the fit of both the model for censoring and survival. Some well known techniques for checking the appropriateness of proportional hazards regression models are given in Lin et al. (1993) and Klein and Moeschberger (1997). A related consideration in the use of bias-adjusted tests are the relative efficiencies. When the working proportional hazards model (2) is properly specified, i.e., equal to κ(t | X, W ), then the uncorrected, fully model-based test of H0 is asymptotically valid regardless of the dependency between C and (X, W ). In this situation, it is of interest to examine the relative efficiency of the corrected test to that of the uncorrected test and determine if there are situations for which unnecessary use of the corrected test could lead to loss in power. DiRienzo and Lagakos (2001b) provide formulae for the asymptotic mean and varin and n−1/2 U n∗ under the contiguous alternative Hn : α = c/√n, for ance of n−1/2 U some constant c, when the true hazard for T is given by κ(t | Xi , Wi ) = exp(αXi )ψ(β, Wi )h(t). That is, the working proportional hazards model is properly specified and calculation of a corrected test is unnecessary since the uncorrected, fully model-based test of H0 is asymptotically valid. In their accompanying simulations, the empirical relative efficiency of the corrected test to that of the uncorrected test appears to almost always be close to one. Other choices for the functional form of ϕ(·) may be of interest; one example is ϕ(t; X, W ) = 1/ pr(C t | X, W ). However, using simulations, DiRienzo and Lagakos (2001b) have found that this choice for ϕ(·) can be much less efficient than the choice (4). DiRienzo and Lagakos (2001b) also present an efficacy formula for the corrected test; this may be used to compare the efficiencies of tests using different choices for ϕ(·). 5. Discussion Given the wide use of statistical tests based on Cox’s regression model, especially in medical applications, and considering the importance of decisions that are reached from
The effects of misspecifying Cox’s regression model
403
these analyses, an understanding of their robustness to misspecification of the model is important. Misspecification can occur in many forms, including omitted or mismodelled covariates, the omission of treatment by covariate interactions, or a violation of the underlying proportionality assumption. While goodness-of-fit methods can be applied to check model fit (cf. Klein and Moeschberger, 1997), their failure to signal misspecification is no assurance that this is the case and, furthermore, their subjective and post-hoc nature can be problematic when a new treatment is being assessed, e.g., in clinical trials the standard practice is to precisely prespecify how treatment comparisons will be made. This chapter has argued that a fundamental question in assessing such robustness is whether treatment group and the censoring variable are conditionally independent given the underlying covariates, or whether the underlying covariates associated with survival are conditionally independent of the censoring variable, given treatment group. When either of these conditions apply, then statistical tests arising from fitting a proportional hazards model, including the popular log-rank test, maintain their validity under misspecification of the model-relating treatment and these covariates to the hazard function for survival. That is, when either condition holds, the resulting test statistic, when standardized by a robust variance estimator, has a distribution under the null hypothesis of no treatment effect that is asymptotically standard normal, regardless of whether or not the model is correctly specified. For the special case of the log-rank test, use of the model-based variance estimator to standardize the score statistic arising from the assumed model also leads to the desired asymptotic behavior under the null hypothesis. Thus, establishment of either of these conditions ensures that the size, or Type I error, associated with such tests is not distorted as a result of model misspecification. Moreover, one or both of the conditions can in practice often either be checked empirically or concluded to hold based on the analyst’s knowledge of the circumstances that lead to censored observations. When neither condition holds, that is, when either treatment or the underlying covariate is not conditionally independent of time to censoring, then tests based on fitting a proportional hazards model can be asymptotically biased under the null hypothesis. Since in practice the significance levels used to evaluate these tests invariably resort to their presumed asymptotic normality, the size of such tests can be seriously biased when the working proportional hazards model is misspecified. To avoid or minimize such biases, a class of bias-corrected tests can readily be adapted. These tests require knowledge or estimation of φ(t; X, W ), a function of the conditional distribution of censoring. Based on asymptotic considerations and simulations, the corrected test works well in a variety of settings, even when the estimated form of φ(t; X, W ) is only approximately correct. That is, misspecification of the function φ(t; X, W ) appears to be far less critical for the bias-corrected test than does the misspecification of the underlying hazard model for the uncorrected test. Furthermore, use of a bias-corrected test when one is unnecessary – that is, when the working proportional hazards model happens to be correctly specified – does not appear to result in much loss in efficiency. Thus, when there is any suspicion that the key conditions for robustness may be violated, use of the biasadjusted tests instead of or as a complement to standard methods is advised. To facilitate the computation for the adjusted tests, Appendix A gives MATLAB code for these and uncorrected tests.
404
A.G. DiRienzo and S.W. Lagakos
Acknowledgement This work was supported in part from grant AI24643 from the US National Institutes of Health.
Appendix A: MATLAB code for computing statistical tests We provide below MATLAB code for calculating the uncorrected and corrected score tests presented in this paper. The version of MATLAB used is 5.3.1 (R11.1) along with the Statistics (Version 2.2, R11) and Optimization (Version 2.0, R11) toolboxes. The uncorrected test is calculated using the model-based variance estimator, which is consistent when the working proportional hazards model for κ(t | X, W ) is properly specified or when the log-rank test is used as the uncorrected test. The corrected test is calculated with Gn (t) = 1 and using a stratified (by treatment group) proportional hazards model for the conditional distribution of C given (X, W ) with ψn (Z) = exp(βˆ Z), where βˆ the restricted maximum partial likelihood estimate of β under H0 . We note, however, that the code can be modified to accommodate other choices for these functions as well as for covariates other than those used below to illustrate the methods. The observed data consists of the five n × 1 column vectors T0, d, x, Z1, Z2, where T0 corresponds to {Ti∗ }, d to {δi }, x to {Xi }, Z1 is the first component of {Zi∗ }, say {Z1i } and Z2 the second, say {Z2i }, i = 1, . . . , n. Suppose that one wanted to adjust for the covariates I (Z1 < 0), Z22 in the model for T , and calculate a corrected test using a proportional hazards model for C that was conditional on X, |Z1|−1/2 , Z2 . Then the MATLAB call would be ˆ [un1, cor1] = SC(T0, d, x, [(Z1 < 0), (Z2.(2))], [((abs(Z1)).(ˆ − .5)), Z2]); where the output 1 × 2 row vector un1 consists of the uncorrected score statistic and score test, similarly, cor1 consists of the corrected score statistic and score test. The code for the function SC.m and the two functions it calls, rPLgh.m and BRES.m is given by: function [un, cor] = SC(TT, dd, xx, Z1, Z2) % computes uncorrected *un* and corrected *cor* score statistics % and tests % *TT* is the column vector of N possibly right-censored event % times % there are assumed to be no TIES in *TT* % *dd* is the column vector of N indicators I(T<=C) % *xx* is the column vector of N treatment group indicators % *Z1* is the N x p matrix of covariates for *T* % *Z2* is the N x p matrix of covariates for *C* %---------------------------------------------------------------
The effects of misspecifying Cox’s regression model
405
% first get the restricted MPLE for T %--------------------------------------------------------------global T d x Z; T=TT; d=dd; x=xx; Z=Z1; N = length(T); p = size(Z,2); th = zeros(1,p); options = optimset(’GradObj’,’on’,’Display’,’off’); rmple = fsolve(’rPLgh’,th,options); clear global %--------------------------------------------------------------% calculate the MPLE and Breslow estimate of the baseline % cumulative hazard of censoring within each treatment group %--------------------------------------------------------------global T d Z; T=TT(xx==0); d=1-dd(xx==0); Z=Z2(xx==0,:); N = length(T); p = size(Z,2); th = zeros(1,p); options = optimset(’GradObj’,’on’,’Display’,’off’); mple0 = fsolve(’rPLgh’,th,options); [L0w,c0] = BRES(T,d,Z,mple0); clear global %--------------------------------------------------------------global T d Z; T=TT(xx==1); d=1-dd(xx==1); Z=Z2(xx==1,:); N = length(T); p = size(Z,2); th = zeros(1,p); options = optimset(’GradObj’,’on’,’Display’,’off’); mple1 = fsolve(’rPLgh’,th,options); [L1w, c1] = BRES(T,d,Z,mple1); clear global %--------------------------------------------------------------T = TT; d = dd; x = xx; Z = Z1; p = size(Z,2); I = zeros((p+1),(p+1)); U = 0; Ur = 0; Ts = T.*d; N = length(T); K = sum(d); eBz = exp((Z*(rmple’))’); Xb = mean(x); wr = zeros(N,1); Wr = zeros(N,1); temp0 = 0; dN = 0; YseM = zeros(N,1); %--------------------------------------------------------------% calculate the baseline cumulative hazards at each observed % failure time for each treatment group by linear interpolation %---------------------------------------------------------------
406
A.G. DiRienzo and S.W. Lagakos
L0 = interp1(c0,L0w, min(max(c0),T)); L1 = interp1(c1,L1w, min(max(c1),T)); %--------------------------------------------------------------% test statistic %--------------------------------------------------------------for mm=1:N if (Ts(mm)>0) Y = (T>=Ts(mm)); Y0 = Y.*(1-x); Y1 = Y.*x; %--------------------------------------------------------------%treatment-specific Survival functions of censoring %--------------------------------------------------------------F0 = exp(-L0(mm)*exp(Z2*mple0’)); F1 = exp(-L1(mm)*exp(Z2*mple1’)); F0 = F0 + (F0==0).*eps; F1 = F1 + (F1==0).*eps; phi0r = ((min([F1’;F0’]))./F0’)’; phi1r = ((min([F1’;F0’]))./F1’)’; %--------------------------------------------------------------% uncorrected test %--------------------------------------------------------------meBz = (eBz’.*Y)*ones(1,p+1); s0 = (eBz*Y)/N; s1 = (sum(meBz.*[x,Z]))/N; s2 = ([x,Z]’*(meBz.*[x,Z]))/N; vz = ((s2/s0) - ((s1/s0)’*(s1/s0))); I = I + vz/N; sc = ([x(mm),Z(mm,:)] - (s1/s0)); U = U + sc(1); %--------------------------------------------------------------% corrected test %--------------------------------------------------------------Y0n=Y0.*phi0r; Y1n=Y1.*phi1r; Ys = Y0n + Y1n; YseM = [YseM, (eBz’.*Ys)]; meBz = (eBz’.*Ys)*ones(1,p+1); s0 = (eBz*Ys)/N; temp0 = [temp0, s0]; s1 = (sum(meBz.*[x,Z]))/N; s2 = ([x,Z]’*(meBz.*[x,Z]))/N; E = s1/s0; Ur = Ur + ( Ys(mm)*(x(mm)-E(1)) ); dN = [dN, Ys(mm)]; wr(mm) = Ys(mm)*( x(mm) - Xb ); end end temp0(1)=[];
The effects of misspecifying Cox’s regression model
407
dN(1)=[]; YseM(:,1)=[]; I = N*I; %--------------------------------------------------------------%calculate sample version of the iid terms (A) %--------------------------------------------------------------resr=sum((((x-Xb)*ones(1,K)).*(((YseM)./(ones(N,1)*temp0)).* (ones(N,1)*(dN/N))))’); Wr = wr - resr’; %--------------------------------------------------------------%model-based variance estimate of uncorrected test %--------------------------------------------------------------aa=I((2:p+1),(2:p+1)); iiI=inv(aa); % iiI=aa\eye(size(aa)); may be more efficient V = I(1,1)-(I(1,(2:p+1))*iiI*I((2:p+1),1)); %--------------------------------------------------------------%variance estimate of corrected test %--------------------------------------------------------------Rrm = sum( (Wr-mean(Wr)).ˆ(2) ); un = [U, U/sqrt(V)]; cor = [Ur, Ur/sqrt(Rrm)]; %--------------------------------------------------------------% NOTE: to calculate log-rank, set rmple=zeros(1,p) and % V = I(1,1); function [dL, ddL] = rPLgh(th) % computes the gradient and Hessian of Cox’s partial likelihood % at *th* % *th* is the (p+1) row vector of coefficients % *T* is the column vector of N possibly right-censored event % times % *d* is the column vector of N indicators I(T<=C) % *Z* is the N-by-p matrix of baseline covariates global T d Z; N = length(T); p = size(Z,2); I = zeros(p,p); U = zeros(1,p); %--------------------------------------------------------------% compute S 0 (th,t), S 1 (th,t) and S 2 (th,t) at each event time Bz = Z*(th’); eBz = exp(Bz’); Ts = T.*d; for n=1:N if (Ts(n)>0) Y = (T>=Ts(n)); meBz = (eBz’.*Y)*ones(1,p); s0 = (eBz*Y)/N;
408
A.G. DiRienzo and S.W. Lagakos
s1 = (sum(meBz.*Z))/N; s2 = (Z’*(meBz.*Z))/N; vz = (s2/s0) - ((s1/s0)’*(s1/s0)); sc = Z(n,:) - (s1/s0); U = U + sc; I = I + vz; end end dL = U’; ddL = -I; %--------------------------------------------------------------function [LL, tt] = BRES(T,d,z,b) % computes Breslow’s estimate of baseline cumulative baseline % hazard fn % *T* is the column vector of N possibly right-censored event % times % *d* is the column vector of N indicators I(T<=C) % assumes no ties in the data % *z* is the N x p matrix of covariates % *b* is the 1 x p vector of regression coefficients Ts=T.*d; Ts=Ts(Ts>0); Ts=sort(Ts); n=length(Ts); L=1:n; eb = exp(z*b’); for mm=1:n L(mm) = 1/sum( (T>=Ts(mm)).*eb ); end tt=[0,Ts’]; LL=[0,cumsum(L)];
References Breslow, N.E. (1972). Discussion of the paper by D.R. Cox. J. Roy. Statist. Soc. B 34, 216–217. Breslow, N.E. (1974). Covariance analysis of censored survival data. Biometrics 30, 89–99. Cox, D.R. (1972). Regression models and life-tables (with discussion). J. Roy. Statist. Soc. B 34, 187–220. Cox, D.R., Oakes, D.O. (1984). Analysis of Survival Data. Chapman and Hall, London. DiRienzo, A.G., Lagakos, S.W. (2001a). Effects of model misspecification on tests of no randomized treatment effect arising from Cox’s proportional hazards model. J. Roy. Statist. Soc. B 63, 745–757. DiRienzo, A.G., Lagakos, S.W. (2001b). Bias correction for score tests arising from misspecified proportional hazards regression models. Biometrika 88, 421–434. Gill, R.D. (1989). Non and semi-parametric maximum likelihood estimators and the von mises method (Part 1). Scand. J. Statist. 16, 97–128. Kalbfleisch, J.D., Prentice, R.L. (1980). The Statistical Analysis of Failure Time Data. Wiley, New York. Kaplan, E.L., Meier, P. (1958). Nonparametric estimation from incomplete observations. J. Amer. Statist. Assoc. 53, 457–481.
The effects of misspecifying Cox’s regression model
409
Klein, J.P., Moeschberger, M.L. (1997). Survival Analysis – Techniques for Censored and Truncated Data. Springer, New York. Kong, F.H., Slud, E. (1997). Robust covariate-adjusted log-rank tests. Biometrika 84, 847–862. Lagakos, S.W. (1988). The loss in efficiency from misspecifying covariates in proportional hazards regression models. Biometrika 75, 156–160. Lagakos, S.W., Schoenfeld, D.A. (1984). Properties of proportional-hazards score tests under misspecified regression models. Biometrics 40, 1037–1048. Lin, D.Y., Wei, L.J. (1989). The robust inference for the Cox proportional hazards model. J. Amer. Statist. Assoc. 84, 1074–1078. Lin, D.Y., Wei, L.J., Ying, Z. (1993). Checking the Cox model with cumulative sums of martingale-based residuals. Biometrika 80, 557–572. McKeague, I.W., Utikal, K.J. (1990). Inference for a nonlinear counting process regression model. Ann. Statist. 18, 1172–1187. Morgan, T. (1986). Omitting covariates from the proportional hazards model. Biometrics 42, 993–995. Slud, E. (1991). Relative efficiency of the log rank test within a multiplicative intensity model. Biometrika 78, 621–630. Tsiatis, A.A., Rosner, G.L., Tritchler, D.L. (1985). Group sequential tests with censored survival data adjusting for covariates. Biometrika 72, 365–373.
23
Handbook of Statistics, Vol. 23 ISSN: 0169-7161 © 2004 Elsevier B.V. All rights reserved. DOI 10.1016/S0169-7161(03)23023-6
Statistical Modeling in Survival Analysis and Its Influence on the Duration Analysis
Vilijandas Bagdonaviˇcius and Mikhail Nikulin
1. Introduction
Survival regression models relate lifetime distribution to the explanatory variables (covariates). These models are used for estimation of the effect of covariates on survival and for estimation of survival under given covariate values. For example, the effect of the treatment method, sex, alcohol consumption, blood test indices on survival may be analyzed. The most popular and most applied model is the proportional hazards model (called also the Cox model), introduced by Cox (1972). The popularity of this model is based on the fact that there exist simple semi-parametric estimation procedures which can be used when the form of survival distribution function is not specified, see Cox (1975), Andersen (1991), Andersen et al. (1993). On the other hand, the Cox model is rather restrictive and is not applicable when ratios of hazard rates under different fixed covariates are not constant in time. The hazard rates may approach, go away or even intersect. In such a case more sophisticated models are needed. We discuss them in the following sections. Survival regression models may be also used for analysis of failure time regression data in reliability, applied in econometrics, demography and any other field, where the effect of explanatory variables on the time up to occurrence of some event is observed. The explanatory variables may be modeled by stochastic processes, deterministic time functions or constants (possibly different for different individuals). Denote by x(·) = (x1 (·), . . . , xm (·))T : [0, ∞) → Rm , a deterministic time function (possibly multidimensional) which is a vector of covariates itself or a realization of a stochastic process X(·) = (X1 (·), . . . , Xm (·))T when covariates are modeled by this stochastic process. If x(·) is constant in time, x(t) ≡ x, then we shall write x instead of x(·). The distribution of survival under covariates can be defined by the survival, cumulative distribution, or probability density function. Nevertheless, the sense of models is best seen if they are formulated in terms of hazard rate function. 411
412
V. Bagdonaviˇcius and M. Nikulin
Denote by T the failure time. Then the survival and the hazard rate functions given x(·) are Sx(·) (t) = P T t | x(u), 0 u t , αx(·) (t) = lim h↓0
=− Denote by
1 P T ∈ [t, t + h) | T t, x(u), 0 u t h
(t) Sx(·)
Sx(·) (t)
t
Ax(·)(t) =
.
αx(·)(u) du = − ln Sx(·)(t)
0
the cumulative hazard under x(·). Each specified model relates the hazard rate (or survival function) to the explanatory variable in some particular way.
2. The Cox or the proportional hazards model Under the proportional hazards (PH) model the hazard rate under a covariate realization x(·) has the form αx(·) (t) = r x(t) α0 (t), (1) where α0 (t) is a baseline hazard rate function and r(·) is a positive function. The model implies that the ratio R(t, x1 , x2 ) of hazard rates under different fixed constant covariates x1 and x2 is constant over time: R(t, x1 , x2 ) =
αx2 (t) r{x2 } = = const. αx1 (t) r{x1 }
In most applications the function r is parametrized in the form r(x) = exp β T x , where β = (β1 , . . . , βm )T is the vector of regression parameters. Under this parametrization we obtain the classical semi-parametric Cox model with time-dependent covariables: αx(·) (t) = eβ
T x(t )
α0 (t).
(2)
Usually the Cox model is considered as semi-parametric: the finite-dimensional parameter β and the baseline hazard function α0 are supposed to be completely unknown. Nevertheless, non-parametric estimation procedures when the function r is also supposed to be unknown are sometimes used. Parametric estimation procedures when α0 is taken from some parametric class of functions is scarcely used because the parametric accelerated failure time model (see following sections) is also simple for analysis and more natural.
Statistical modeling in survival analysis
413
The PH model is not much used to analyze failure time regression data in reliability. The reason is that the model is not natural when subjects are aging. Indeed, the formula (1) implies that for any t the hazard rate under the time-varying covariate x(·) at the moment t does not depend on the values of the covariate x(·) before the moment t but only on the value of it at this moment. Nevertheless, in survival analysis the PH model usually works quite well, because the values of covariates under which estimation of survival is needed are in the range of covariate values used in experiments. So the use of a not very exact but simple model often is preferable to the use of a more adequate but complicated model. It is similar with application of linear regression models in classical regression analysis: the mean of dependent variable is rarely a linear function of independent variables but the linear approximation works reasonably well in some range of independent variable values. In reliability, accelerated life testing in particular, the choice of a good model is much more important than in survival analysis. For example, in accelerated life testing units are tested under accelerated stresses which shorten the life. Using such experiments the life under the usual stress is estimated using some regression model. The values of the usual stress is not in range of the values of accelerated stresses, so if the model is misspecified, the estimators of survival under the usual stress may be very bad. If on the bases of graphical analysis or goodness-of-fit tests the PH model is rejected and one has a reason to suppose that the ratios of hazard rates are not constant, other models should be used.
3. Accelerated failure time model The PH model has the absence of memory propriety: the hazard rate at any moment does not depend on the values of the covariate before this moment. It is more natural to suppose that the hazard rate at any moment t should depend not only on the value of covariate at this moment but on the probability to survive up to this moment. Under covariate x(·) this probability is Sx(·) (t). It characterizes the summing effect of covariate values in the interval [0, t] on survival. The equality Ax(·)(t) = − ln Sx(·) (t) implies that the cumulative hazard also characterizes this summing effect. So it can be supposed that the hazard rate at any moment t is a function of the covariate value x(t) and the value of the cumulative hazard Ax(·)(t). The generalized Sedyakin’s model supposes (see Sedyakin (1966), Bagdonaviˇcius (1978)): αx(·) (t) = g x(t), Ax(·)(t) . (3) This model with g completely unknown is too general to do statistical inference. But if we choose some regression model for constant covariates, the form of the function g can be made more concrete. Suppose that under different constant covariates x ∈ E0 the survival functions differ only in scale: Sx (t) = S0 r(x)t . (4)
414
V. Bagdonaviˇcius and M. Nikulin
If the GS model holds on a set E, E0 ⊂ E of covariates then (4) holds on E0 if and only if the function g has the form g(x, s) = r(x)q(s) (see Bagdonaviˇcius (1978)). We obtain the following model: αx(·) (t) = r x(t) q Ax(·)(t) . (5) Solving this differential equation with respect to Ax(·)(t), and using the relation between the survival and the cumulative hazard functions we obtain that the survival function has the form t Sx(·) (t) = S0 (6) r x(u) du , 0
where the function S0 does not depend on x(·). The function r changes locally the time scale. The model (6) (or, equivalently, (5)) is called additive accumulation of damages (Bagdonaviˇcius (1978)) or accelerated failure time (AFT) model (Cox and Oakes (1984)). The function r is often parametrized in the following form: r(x) = e−β x , T
where β = (β1 , . . . , βm )T is a vector of unknown parameters. Under the parametrized AFT model the survival function is t −β T x(u) Sx(·) (t) = S0 e du ,
(7)
0
the hazard rate is αx(·) (t) = e−β
T x(t )
t
α0
e−β
T x(u)
du ,
(8)
0
and for constant covariates T Sx (t) = S0 e−β x t . So in the case of constant covariates the AFT model can also be written as a loglinear model, since the logarithm of the failure time Tx under constant covariate x can be written as ln{Tx } = β T x + ε,
(9)
where the survival function of the random variable ε does not depend on x and is S(t) = S0 (ln t). In the case of lognormal failure-time distribution the distribution of ε is normal and we have the standard linear regression model. The equality (8) implies that if the survival function under any constant covariate belongs to parametric families such as Weibull, loglogistic, lognormal, then the survival function under any other constant covariate also belongs to that family. Differently from PH model, the AFT model is mostly applied in survival analysis as a parametric model: the function S0 (or the distribution of ε) is taken from some
Statistical modeling in survival analysis
415
parametric class of distributions and the parameters to estimate are the parameters of this class and the regression parameters β. In the case of semi-parametric estimation the function S0 is supposed to be completely unknown and the regression parameters as the function S0 are the parameters to estimate in the model (7). The semi-parametric AFT model is much less used in survival analysis then the Cox model because of complicated estimation procedures: modified variants of likelihood functions are not differentiable and even not continuous functions, the limit covariance matrices of the normed regression parameters depend on the derivatives of the probability density functions, so their estimation is complicated. The parametric AFT model is used in failure time regression analysis and accelerated life testing. Under special experiment plans even non-parametric estimation procedures are used. In such a case not only the function S0 but also the function r in the model (6) would be completely unknown. The AFT model is a good choice when the lifetime distribution class is supposed to be known. Nevertheless, it is as restrictive as the PH model. The assumption that the survival distributions under different covariate values differ only in scale is rather strong assumption. Therefore, more sophisticated models are also needed.
4. Generalized proportional hazards model 4.1. Definitions The AFT and PH models are rather restrictive. Under the PH model lifetime distributions under constant covariates are from the narrow class of distributions: the ratio of the hazard rates under any two different constant covariates is constant over time. Under the AFT model the covariate changes (locally, if the covariate is not constant) only the scale. Generalized proportional hazards (GPH) models allow the ratios of the hazard rates under constant covariables to be not only constant but also increasing or decreasing. They include AFT and PH models as particular cases. As was discussed in the previous section, the survival function Sx(·) (t) (or, equivalently, the cumulative hazard function Ax(·)(t)) characterizes the summing effect of covariate values in the interval [0, t] on survival. Suppose that the hazard rate at any moment t is proportional not only to a function of the covariate applied at this moment and to a baseline rate, but also to a function of the probability of survival until t (or, equivalently, to the cumulative hazard at t): αx(·) (t) = r x(t) q Ax(·)(t) α0 (t). (10) We call the model (10) the generalized proportional hazards (GPH) model, see Bagdonaviˇcius and Nikulin (1999). Particular cases of the GPH model are the PH model (q(u) ≡ 1) and the AFT model (α0 (t) ≡ α0 = const). Under the GPH model the survival functions Sx(·) have the form t r x(τ ) dA0 (t) , Sx(·) (t) = G (11) 0
416
V. Bagdonaviˇcius and M. Nikulin
where
A0 (t) =
t
α0 (u) du,
G=H
−1
0
We denote by
H −1
,
− ln u
H (u) = 0
dv . q(v)
the inverse function of G.
4.2. Relations with the linear transformations and frailty models Models of different levels of generality can be obtained by completely specifying q, parametrizing q, or considering q as unknown. Completely specifying q we obtain rather strict models which are alternatives to the PH model and the field of their application is relatively narrow (see Bagdonaviˇcius and Nikulin (1994)). Under constant covariates such models are the linear transformation T (LT) models. Indeed, if q is specified and r is parametrized by r(x) = eβ x then under T constant covariables the survival functions have the form Sx(·)(t) = G{eβ x A0 (t)} with G specified. This implies that the random variable Tx can be transformed by the function h(t) = ln{H (S0 (t))} to the random variable of the form h(Tx ) = −β T x + ε,
(12)
where ε is a random error with the parameter-free distribution function Q(u) = 1 − G(eu ). It is the linear transformation (LT) model of Dabrowska and Doksum (1988). Examples of the LT models: (1) PH model (G is a Weibull survival function, ε has the extreme value distribution); (2) logistic regression model (G is a loglogistic survival function, ε has the loglogistic distribution): 1 1 − 1 = r(x) −1 ; Sx (t) S0 (t) (3) generalized probit model (G is a lognormal survival function, has the normal distribution): Φ −1 Sx (t) = log r(x) + Φ −1 S0 (t) , where Φ is the standard normal cumulative distribution function. The last two models are alternatives to the PH model. They are widely used for analysis of dichotomous data when the probability of “success” in dependence of some factors is analyzed. If application of the PH model is dubious then better is to use a (not very) wider GPH model which is obtained from the general GPH model not by complete specification of the function q but taking a simple parametric model for it. Let us consider relations between the GPH models and the frailty models (Hougaard (1986)) with covariates. The hazard rate can be influenced not only by the observable covariate x(·) but also by a non-observable positive random covariate Z, called the frailty variable. Suppose that the hazard rate given the frailty variable value is αx(·) (t | Z = z) = zr x(t) α0 (t).
Statistical modeling in survival analysis
Then
417
t t Sx(·) (t) = E exp −Z r x(τ ) dA0 (τ ) = G r x(τ ) dA0 (τ ) , 0
0
where G(s) = Ee−sZ . So the GPH model can be defined by specification of the frailty variable distribution. 4.3. The GPH models with monotone hazard ratios The following parametrizations of r and q give submodels of the GPH model with monotone ratios of hazard rates under constant covariates. Using only one parameter and power or exponential functions for function q parametrization several important models are obtained. 4.3.1. The first GPH model Suppose that q(0) = 1 (if it is not so, we can include q(0) in α0 , which is considered T as unknown). Taking a power function q(u) = (1 + u)−γ +1 and r(x) = eβ x we obtain the first GPH model: −γ +1 T αx(·) (t) = eβ x(t ) 1 + Ax(·)(t) (13) α0 (t). It coincides with the PH model when γ = 1. The supports of the survival functions Sx(·) are [0, ∞) when γ 0 and [0, spx(·)) with finite right ends spx(·), spx(·) < ∞, when γ < 0. Finite supports are very possible in accelerated life testing: failures of units at different accelerated stresses are concentrated in intervals with different finite right limits. Suppose that at the point t = 0 the ratio R(t, x1 , x2 ) of the hazard rates under constant covariates x1 and x2 is greater then 1: r(x2 ) = c0 > 1. r(x1 ) The ratio R(t, x1 , x2 ) has the following properties: R(0, x1 , x2 ) =
(a) if γ > 1, then the ratio of the hazard rates decreases from the value c0 > 1 to the 1/γ value c∞ = c0 ∈ (1, c0 ); (b) if γ = 1 (PH model), the ratio of the hazard rates is constant; (c) if 0 γ < 1, then the ratio of the hazard rates increases from the value c0 > 1 to 1/γ the value c∞ = c0 ∈ (c0 , ∞); (d) if γ < 0, then the ratio of the hazard rates increases from the value c0 > 1 to ∞, end the infinity is attained at the point spx2 = A−1 0 {−1/((γ )r(x2 ))}. The first GPH model is a generalization of the positive stable frailty model with explanatory variables: the GPH model with γ = 1/α > 0 is obtained taking the frailty variable Z which follows the positive stable distribution with the density ∞
pZ (z) = −
(−1)k Γ (αk + 1) 1 exp{−αz + 1} sin(παk) , πz k! zαk k=1
where α is a stable index, 0 < α < 1.
z > 0,
418
V. Bagdonaviˇcius and M. Nikulin
4.3.2. The second GPH model Under the first GPH model the support of the survival functions is infinite when γ 0 and finite when γ < 0. The limit is γ = 1. So it is interesting to take a model with the parametrization q(u) = (1 + γ u)−1 . We obtain the second GPH model: αx(·) (t) = eβ
T x(t )
−1 1 + γ Ax(·)(t) α0 (t)
(γ 0).
(14)
It also coincides with the PH model when γ = 0. The supports of the survival functions Sx(·) are [0, ∞). The ratio R(t, x1 , x2 ) = αx2 (t)/αx1 (t) has the following properties: √ (a) if γ > 0, then the ratio of the hazard rates decreases from c0 > 1 to the value c0 ∈ (1, c0 ); (b) if γ = 0 (PH model), the ratio of the hazard rates is constant. The second GPH model equivalent to the inverse Gaussian frailty model with explanatory variables: the GPH model with γ = (4σ θ )1/2 > 0 is obtained taking the frailty variable Z which follows the inverse Gaussian distribution with the density pZ (z) =
1/2 √ σ e 4σ θ z−3/2 e−θz−σ/z , π
z > 0.
4.3.3. The third GPH model T Taking the exponential function q(u) = e−γ u and r(x) = eβ x we obtain the third GPH model: αx(·) (t) = eβ
T x(t )−γ A x(·)(t )
α0 (t).
(15)
It coincides with the PH model when γ = 0. The supports of the survival functions Sx(·) are [0, ∞) when γ 0 and [0, spx(·)) with finite right ends when γ < 0. Suppose that R(0, x1 , x2 ) = r(x2 )/r(x1 ) = c0 > 1. The ratio R(t, x1 , x2 ) has the following properties: (a) if γ > 0, then the ratio of the hazard rates decreases from the value c > 0 to 1, i.e., the hazard rates approach one another and meet at infinity; (b) if γ = 0 (PH model), the ratio of the hazard rates is constant; (c) if γ < 0, then the ratio of the hazard rates increases from the value c0 > 1 to ∞, end the infinity is attained at the point spx2 = A−1 0 {−1/(γ r(x2 ))}. The third GPH model is a generalization of the gamma frailty model with explanatory variables: the GPH model with γ = 1/k > 0 is obtained taking the frailty variable Z which follows the gamma distribution with the density pZ (z) =
zk−1 −z/θ e , θ k Γ (k)
z > 0.
All the three GPH models are considered as semi-parametric: finite-dimensional parameters β and γ and unknown baseline function A0 are the unknown parameters.
Statistical modeling in survival analysis
419
5. Regression models with cross-effects of survival functions When analyzing survival data from clinical trials, cross-effects of survival functions are sometimes observed. A classical example is the well-known data concerning effects of chemotherapy (CH) and chemotherapy plus radiotherapy (CH + R) on the survival times of gastric cancer patients, see Stablein and Koutrouvelis (1985), Kleinbaum (1996), Klein and Moeschberger (1997), Bagdonaviˇcius et al. (2002). At the beginning of treatment mortality of CH + R patients is greater but at a certain moment the survival functions of CH + R and CH patients intersect and later mortality of CH patients is greater, i.e., if patients survive CH + R therapy during a certain period then later this treatment is more beneficial then CH therapy. Doses of CH and R therapy can be different so regression data with not necessary dichotomous covariates may be gathered. Let us consider models for analysis of data with cross-effects of survival functions under constant covariates. 5.1. First model with cross-effects of survival functions The first model with cross-effects of survival functions (CE model) can be obtained from the first GPH model considered in the previous section replacing the scalar parameter T γ by eγ x(t ) in the formula (13), where γ is m-dimensional (see Bagdonaviˇcius and Nikulin (2002)): αx (t) = eβ
T x(t )
1−eγ T x(t) 1 + Ax (t) α0 (t),
γ = (γ1 , . . . , γm )T .
(16)
Suppose that at the point t = 0 the ratio of the hazard rates R(t, x1 , x2 ) = αx2 (t)/αx1 (t) under constant covariates x1 and x2 is greater then 1: R(0, x1 , x2 ) = eβ
T (x
2 −x1 )
= c0 > 1
and γ T (x1 − x2 ) < 0.
In this case the ratio R(t, x1 , x2 ) decreases from the value c0 > 1 to 0, i.e., the hazard rates intersect once. The survival functions Sx1 and Sx2 also intersect once in the interval (0, ∞) (more about it see in Bagdonaviˇcius and Nikulin (2002)). Other CE models can be obtained using the same procedure for the second and the third GPH models. 5.2. Second CE-model Hsieh (2001) considered the following model with cross-effects of the survival functions generalization of the PH model Λx (t) = eβ
T x(t )
eγ T x(t) Λ0 (t) .
It is a generalization of the PH model taking the power eγ power 1.
(17) T x(t )
of Λ0 (t) instead of the
420
V. Bagdonaviˇcius and M. Nikulin
Note that the difference between this second model and the first CE model is the following. In the case of the second CE model the ratios of the hazard rates and even the ratios of the cumulative hazards go to ∞ (or 0) as t → 0. In the case of the first CE model these ratios are defined and finite at t = 0. This property of the first CE model is more natural and helps to avoid complications when seeking efficient estimators. 6. Changing shape and scale models Natural generalization of the AFT model (4) is obtained by supposing that different constant stresses x influence not only the scale but also the shape of survival distribution, see Mann et al. (1974): ν(x) t Sx (t) = S0 , σ (x) where σ and ν some positive functions on E1 . Generalization of this model to the case of time-variable covariates is the changing shape and scale (CHSS) model, Bagdonaviˇcius and Nikulin (1999): t Sx(·) (t) = S0 (18) r x(u) uν(x(u))−1 du . 0
In this model the variation of stress changes locally not only the scale but also the shape of distribution. In terms of the hazard rate the model can be written in the form αx(·) (t) = r x(t) q Ax(·)(t) t ν(x(t ))−1, (19) where q(u) = α0 (A−1 0 (u)), A0 (t) = − ln S0 (t), α0 (t) = A0 (t). If ν(x) ≡ 1 then the model coincides with the AFT model with r(x) = 1/σ (x). The CHSS model is not in the class of the GPH models because the third factor at the right of the formula (19) depends not only on t but also on x(t). The GHSS model is parametric, if S0 is taken from some parametric class of surT vival functions and the functions r and ν are parametrized, usually taking r(x) = eβ x , γ x ν(x) = e . The model is semi-parametric, if the function S0 is considered as unknown and the functions r and ν are parametrized: γ T x(t) −1 T αx(·) (t) = eβ x(t )q Ax(·)(t) t e (20) .
For various classes of S0 the CHSS model includes cross-effects of survival functions under constant covariates. For example, it is so, if the survival distribution under constant covariates is Weibull, loglogistic (A0 (t) = t, ln(1 + t), respectively). Parametric analysis can be done using the method of maximum likelihood. Semiparametric analysis is more complicated because the same problems as in the case of AFT semi-parametric model arise: modified variants of likelihood functions are not differentiable and even not continuous functions, the limit covariance matrices of the normed regression parameters depend on the derivatives of the probability density functions.
Statistical modeling in survival analysis
421
7. Models with time-dependent regression coefficients 7.1. PH model with time-dependent regression coefficients Flexible models can be obtained by supposing that the regression coefficients β in the PH model (2) are time-dependent, i.e., taking αx(·) (t) = eβ(t )
T x(t )
α0 (t),
(21)
where β T (t) x(t) =
m
βi (t)xi (t).
i=1
If the function βi (·) is increasing or decreasing in time then the effect of the ith component of the explanatory variable is increasing or decreasing in time. The model (21) is the PH model with time-dependent regression coefficients. Usually the coefficients βi (t) are considered in the form βi (t) = βi + γi gi (t)
(i = 1, 2, . . . , m),
where gi (t) are some specified deterministic functions as, for example, t, ln t, ln(1 + t), (1 + t)−1 or realizations of predictable processes. In such a case the PH model with time-dependent coefficients and constant or time-dependent explanatory variables can be written in the usual form (2), where the role of the components of the “covariables” play not only the components xi (·) but also xi (·)gi (·). Indeed, set θ = (θ1 , . . . , θ2m )T = (β1 , . . . , βm , γ1 , . . . , γm )T , T z(·) = z1 (·), . . . , z2m (·) T = x1 (·), . . . , xm (·), x1 (·)g1 (·), . . . , xm (·)gm (·) .
(22)
Then β T (u)x(u) =
m
βi + γi gi (t) xi (t) = θ T z(u). i=1
So the PS model with time-dependent regression coefficients of the above form can be written as αx(·) (t) = eθ
T z(t )
α0 (t).
(23)
We have the PH model with time-dependent “covariables” and constant “regression parameters”. So methods of estimation for the usual PH model can be used. Note that the introduced “covariables” have time-dependent components even in the case when the covariable x is constant over time. Alternative method is to take βi (t) as piecewise constant functions with jumps as unknown parameters. In such a case the PH model is used locally and the ratios of the hazard rates under constant covariates are constant on each of several time intervals.
422
V. Bagdonaviˇcius and M. Nikulin
7.2. AFT model with time-dependent regression coefficients Similarly to the case of the PH model, flexible models can be obtained by supposing that the regression coefficients β in the AFT model (7) are time-dependent, i.e., taking t T Sx(·) (t) = S0 (24) e−β (u)x(u) du , 0
where β T (t) x(t) =
m
βi (t)xi (t).
i=1
As in the case of the PH model with time-dependent coefficients, the model (24) with βi (t) = βi + γi gi (t) can be written in the form of the usual AFT model t T e−θ z(u) du , Sx(·) = G (25) 0
where θ and z are defined by (22). Alternative method is to take βi (t) as piecewise constant functions with jumps as unknown parameters.
8. Additive hazards model and its generalizations An alternative of the PH hazards model is the additive hazards (AH) model αx(·) (t) = α0 (t) + β T x(t),
(26)
where β is the vector of regressor parameters. If the AH model holds then the difference of hazards rates under constant covariates does not depend on t. As the PH model this model has the absence of memory property: the hazard rate at the moment t does not depend on the values of the covariate before the moment t. Usually the AH model is used in the semi-parametric form: the parameters β and the baseline hazard rate α0 are supposed to be unknown. Both the PH and AH models are included in the additive–multiplicative hazards (AMH) model (Lin and Ying (1996)): αx(·) (t) = eβ
T x(t )
α0 (t) + γ T x(t).
(27)
Even this model has the absence of memory property, so it is rather restrictive. A modification of the AH model for constant covariates is the Aalen’s additive risk (AAR) model (Aalen (1980)): the hazard rate under the covariate x is modeled by a linear combination of several baseline rates with covariate components as coefficients: αx (t) = x T α(t), where α(t) = (α1 (t), . . . , αm (t))T is an unknown vector function.
(28)
Statistical modeling in survival analysis
423
Both AH and AAR models are included in the partly parametric additive risk (PPAR) model (McKeague and Sasieni (1994)): αx (t) = x1T α(t) + β T x2 ,
(29)
where x1 and x2 are q- and p-dimensional components of the explanatory variable x, and α(t) = (α1 (t), . . . , αq (t))T , β = (β1 , . . . , βp )T are unknown. Analogously to the case of the PH model, the AH model can be generalized by the generalized additive hazards (GAH) model: αx(·) (t) = q Ax(·)(t) α0 (t) + β T x(t) , (30) where the function q is parametrized as in the case of GPH models. Both the GPH and the GAH models can be included into the generalized additivemultiplicative hazards (GAMH) model: T αx(·) (t) = q Ax(·)(t) eβ x(t )α0 (t) + δ T x(t) . (31) In both GAH and GAMH models the function q is parametrized as in the GPH models: q(u) = (1 + u)−γ +1 , (1 + γ u)−1 , e−γ u , and the GAH1, GAH2, GAH3 or GAMH1, GAMH2, GAMH3 models are obtained.
9. Remarks on parametric and semi-parametric estimation The literature on parametric and non-parametric estimation for the above considered models is enormous. Methods of estimation depend on experiment plans, censoring, covariate types, etc. We do not give here all these methods but give two general methods of estimation (one for parametric and other for semi-parametric case) which work well for all models. If the models are considered as parametric then the maximum likelihood estimation procedure gives the best estimators. Let us consider for simplicity right censored survival regression data which is typical in survival analysis (more complicated censoring or truncating schemes are considered similarly): X1 , δ1 , x1 (·) , . . . , Xn , δn , xn (·) , where Xi = Ti ∧ Ci ,
δi = 1{Ti Ci }
(i = 1, . . . , n),
Ti and Ci and are the failure and censoring times, xi (·) the covariate corresponding to the ith object, Ti ∧ Ci = min(Ti , Ci ), 1A the indicator of the event A. Equivalently, right censored data can be presented in the form N1 (t), Y1 (t), x1 (t), t 0 , . . . , Nn (t), Yn (t), xn (·), t 0 , where Ni (t) = 1{Xi t,δi =1} ,
Yi (t) = 1{Xi t } .
424
V. Bagdonaviˇcius and M. Nikulin
In this case for any t, t > 0; N(t) =
n
Ni (t)
and Y (t) =
i=1
n
Yi (t)
i=1
are the number of observed failures of all objects in the interval [0, t] and the number of objects at risk just prior the moment t, respectively. Suppose that survival distributions of all n objects given xi (·) are absolutely continuous with the survival functions Si (t, θ ) and the hazard rates αi (t, θ ), specified by a common possibly multidimensional parameter θ ∈ Θ ⊂ Rs . Denote by Gi the survival function of the censoring time Ci . We suppose that the function Gi and the distributions of xi (·) (if they are random) do not depend on θ . Suppose that the multiplicative intensities model is verified: the compensators of the counting processes Ni with respect to the history of the observed processes are Yi αi du. The likelihood function for θ estimation is L(θ ) =
n
αiδi (Xi , θ ) Si (Xi , θ )
i=1
=
n i=1
δi
∞
αi (u, θ ) dNi (u) 0
exp −
∞
Yi (u)αi (u, θ ) du .
0
The maximum likelihood (ML) estimator θˆ of the parameter θ maximizes the likelihood function. It verifies the equation U θˆ = 0, where U is the score function ∂ U (θ ) = ln L(θ ) ∂θ n ∞
∂ = log αi (u, θ ) dNi (u) − Yi (u)αi (u, θ ) du . ∂θ 0
(32)
i=1
The form of the hazard rates αi for the PH, AFT, GPH1, GPH2, GPH3, CE, CHSS, AH,AMH, AAR, PPAR, GAH, GAMH are given by the formulas (2), (7), (13)–(16), (20), (26)–(31). The parameter θ contains the regression parameter β, the complementary parameter γ (for some models) and the parameters of the baseline hazard α0 , which is taken from some parametric family. Let us consider a general approach (Bagdonaviˇcius and Nikulin (2002)) for semiparametric estimation in all given models when the baseline hazard α0 is supposed to be unknown. The martingale property of the difference t Ni (t) − (33) Yi (u)αi (u, θ ) du 0
implies an “estimator” (which depends on θ ) of the baseline cumulative hazard A0 . Indeed, all the above considered models can be classified into three groups in dependence
Statistical modeling in survival analysis
425
on the form of αi (t, θ ) dt. It is of the form g xi (s), A0 (s), 0 s t, θ dA0(t) (for PH, GPH, CE models), and dA0 (fi (t, θ )) (for AFT, CHSS models) or g1 xi (s), A0 (s), 0 s t, θ dA0(t) + g2 xi (s), A0 (s), 0 s t, θ dt (for AH, AMH, AR, PPAR, GAH, GAMH models), A0 possibly multi-dimensional for the AR and PPAR models. We remind that the estimation for the PH and AFT models with time-dependent regression coefficients and time-dependent or independent covariates is analogous to the estimation for the PH and AFT models with constant regression coefficients and properly chosen time-dependent “covariates”. For the first group, the martingale property of the difference (33) implies the recurrently defined “estimator”: t dN(u) A˜ 0 (t, θ ) = .
n ˜ 0 j =1 Yj (u)g(xj (v), A0 (v, θ ), 0 v < u, θ ) For the second group, A˜ 0 (t, θ ) =
n
i=1 0
t
dNi (hi (u, θ ))
n , l=1 Yl (hl (u, θ ))
where hi (u, θ ) is the function inverse to fi (u, θ ) with respect to the first argument. For the third group (AH, AMH, GAH, GAMH models),
t dN(u) − ni=1 g2 (xi (v), A0 (v), 0 v < u, θ ) du
n . A˜ 0 (t, θ ) = 0 j =1 Yj (u)g1 (xj (v), A0 (v), 0 v < u, θ ) A little more complicated situation is with AR and PPAR models. The “estimator” A˜ 0 is obtained in the following way (McKeague and Sasieni (1994)): let us consider a submodel α0 (t) = α(t) + ηϕ(t), in which η is a one-dimensional parameter and ϕ, α are m-vectors of functions. The score function obtained from the parametric likelihood function for the parameter η (AR model) is n ∞ T
T ϕ (t)x (i) (t) U (η) = dNi (t) − Yi (t) x (i)(t) dA0(t) , αi (t) 0 i=1
and the score functions for the parameters η and β (PPAR model) are: n ∞ T (i)
T ϕ (t)x1 dNi (t) − Yi (t) x1(i) dA0 (t) U1 (η, β) = αi (t) 0 i=1
− β T x2 Yi (t) dt = 0,
426
V. Bagdonaviˇcius and M. Nikulin
U2 (η, β) =
n
∞
(i) T x2(i) dNi (t) − Yi (t) x1 dA0 (t) αi (t)
i=1 0
(i) − β T x2 Yi (t) dt = 0.
(34)
If A0 is unknown and we want to estimate it, the estimator should be the same for all ϕ. Setting U (η) = 0 (AR model) or U1 (η, β) = 0 (PPAR model) for all functions ϕ implies that for all t, T x (i) (t) dNi (t) − Yi (t) x (i) (t) dA0(t) = 0, αi (t) or (i) T x (i) (i) dNi (t) − Yi (t) x1 dA0 (t) − β T x2 Yi (t) dt = 0, αi (t) which implies the “estimators” (AR model): A˜ 0 (t) =
n
t
j =1 0
n
T
x (i) (u) x (i) (u) Yi (u) αi (u)
−1
−1
i=1
−1 × x (j ) (u) αj (u) dNj (u) or (PPAR model) ˜ = A(t)
n
t
n
j =1 0
−1 T x1(i) x1(i) Yi (u) αi (u)
−1
i=1 (j )
× x1
αj (u)
−1 (j ) dNj (u) − β T x2 Yj (u) du .
Note that for PH, GPH1, GPH2, GPH3 models, T g x(s), A0 (s), 0 s t, θ = eβ x(t ), eβ
T x(t )
1/γ −1 t T 1+γ eβ x(u) dA0 (u) , 0
−1/2 t T T eβ x(t ) 1 + 2γ eβ x(u) dA0 (u) , 0
−1 t β T x(t ) β T x(u) 1+γ e e dA0 (u) , 0
respectively. For the CE model, 1−eγ T x(t) T g x(s), A0 (s), 0 s t, θ = eβ x(t ) 1 + Ax(·)(t) ,
Statistical modeling in survival analysis
427
where the function Ax(·) is defined by the equation t 1−eγ T x(u) T eβ x(u) 1 + Ax(·)(u) dA0 (u) = Ax(·)(t). 0
If x is constant in time then for the CE model, e−γ T x −1 T T . g x, A0 (s), 0 s t, θ = eβ x 1 + e(β+γ ) x A0 (t) For the AFT and CHSS models, t T fi (t, θ ) = e−β x(u) du, 0
t
e−β
T x(u)
ue
γ T x(u) −1
du.
0
For the AH, AMH, AR, PPAR, GAH and GAMH models, T g1 xi (s), A0 (s), 0 s t, θ = 1, eβ x(t ),
x T,
x1T
and g2 xi (s), A0 (s), 0 s t, θ = β T x(t),
β T x(t),
0,
β2T x(t),
respectively. For the GAMH1 model (formulas are analogous for the GAMH2, GAMH3, GAH1, GAH2, GAH3 models), T g1 xi (s), A0 (s), 0 s t, θ = eβ x(t )g xi (s), A0 (s), 0 s t, θ , g2 xi (s), A0 (s), 0 s t, θ = δ T x(t)g xi (s), A0 (s), 0 s t, θ , where g xi (s), A0 (s), 0 s t, θ 1/γ −1 t t T eβ x(u) dA0 (u) + δ T x(u) du . = 1+γ 0
0
For the PH, GPH and CE models the weight (∂/∂θ) log αi (u, θ ) in (32) is a function of xi (·)(v), A0 (v), 0 v u and θ , so the modified score function is obtained replacing A0 by its consistent estimator A˜ 0 in the parametric score function (32). In the case of the AFT, CHSS, AH, AMH, AR and PPAR models, the weight depends not only on A0 but also on α0 and (or) α0 . But the more important thing is that αi (u) du do not depend on α0 and α0 . So construction of the modified likelihood function can be done by two ways. The first way is to replace A0 by A˜ 0 and α0 and α0 by nonparametric kernel estimators which are easily obtained from the estimator A˜ 0 . The second, much more easy way is to replace α by 1, α by 0 and A0 by A˜ 0 in the score function (32) (or (34) for the PPAR model, in the case of the AR model there are no parameters left to estimate). The efficiency loses very slightly in this case of such simplified weight. Computing the modified likelihood estimators is simple for the PH, GPH and CE models. It is due to the remarkable fact that these estimators can be obtained otherwise:
428
V. Bagdonaviˇcius and M. Nikulin
write the partial likelihood function δi n ∞ g{xi (v), A0 (v), 0 v u, θ }
n LP (θ ) = dNi (u) , 0 j =1 Yj (u)g{xj (v), A0 (v), 0 v u, θ } i=1 (35) and suppose at first that A0 is known. Replacing A0 in the score function by A˜ 0 exactly the same modified score function is obtained as going from the full likelihood! So computing the estimator θˆ , the score equation is not needed. Better to maximize the modified partial likelihood function which is obtained from the partial likelihood function (35) replacing A0 by A˜ 0 . The general quasi-Newton optimization algorithm (given in Splus) works very well seeking the value of θ which maximizes this modified function Bagdonaviˇcius et al. (2002)). The most complicated case is the case of AFT and CHSS models: the modified score functions are not differentiable and even continuous. So the modified maximum likelihood estimators are the values of θ which minimize the distance of the modified score function from zero. Computational methods for such estimators are given in Lin and Geyer (1992).
References Aalen, O. (1980). A model for nonparametric regression analysis of counting processes. In: Klonecki, W., Kozek, A., Rosinski, J. (Eds.), Mathematical Statistics and Probability Theory. In: Lecture Notes in Statist., Vol. 2. Springer, New York, pp. 1–25. Andersen, P.K. (1991). Survival analysis 1981–1991: The second decade of the proportional hazards regression model. Statist. Medicine 10 (12), 1931–1941. Andersen, P.K., Borgan, O., Gill, R.D., Keiding, N. (1993). Statistical Models Based on Counting Processes. Springer, New York. Bagdonaviˇcius, V. (1978). Testing the hypothesis of the additive accumulation of damages. Probab. Theory Appl. 23 (2), 403–408. Bagdonaviˇcius, V., Nikulin, M. (1994). Stochastic models of accelerated life. In: Gutienez, J., Valderrama, M. (Eds.), Advanced Topics in Stochastic Modelling. World Scientific, Singapore. Bagdonaviˇcius, V., Nikulin, M. (1999). Generalized proportional hazards model based on modified partial likelihood. Lifetime Data Anal. 5, 329–350. Bagdonaviˇcius, V., Hafdi, M., Nikulin, M. (2002). The generalized proportional hazards model and its application for statistical analysis of the Hsieh model. In: Dohi, T., Limnios, N., Osaki, S. (Eds.), Proceedings of the Second Euro–Japanese Workshop on Stochastic Risk Modelling for Finance, Insurance, Production and Reliability, September 18–20, Chamonix, France, pp. 42–53. Bagdonaviˇcius, V., Hafdi, M., El Himdi, K., Nikulin, M. (2002). Analysis of survival data with cross-effects of survival functions. Applications for chemo and radiotherapy data. Preprint 0202, I.F.R. “Santé Publique”. Bagdonaviˇcius, V., Nikulin, M. (2002). Accelerated Life Models: Modeling and Statistical Analysis. Chapman and Hall/CRC, Boca Raton, FL. Cox, D.R. (1972). Regression models and life tables. J. Roy. Statist. Soc. B 34, 187–220. Cox, D.R. (1975). Partial likelihood. Biometrika 62, 269–276. Cox, D.R., Oakes, D. (1984). Analysis of Survival Data. Methuen (Chapman and Hall), New York. Dabrowska, D.M., Doksum, K.A. (1988). Partial likelihood in transformations models with censored data. Scand. J. Statist. 15, 1–23. Hougaard, P. (1986). Survival models for heterogeneous populations derived from stable distributions. Biometrika 73 (3), 387–396.
Statistical modeling in survival analysis
429
Hsieh, F. (2001). On heteroscedastic hazards regression models: Theory and application. J. Roy. Statist. Soc. Ser. B 63, 63–79. Kleinbaum, D. (1996). Survival Analysis: A Self-Learning Text. Springer, New York. Klein, J.P., Moeschberger, M.L. (1997). Survival Analysis. Springer, New York. Lin, D.Y., Geyer, C.J. (1992). Computational methods for semi-parametric linear regression with censored data. J. Comput. Graph. Statist. 1, 77–90. Lin, D.Y., Ying, Z. (1996). Semi-parametric analysis of the general additive-multiplicative hazard models for counting processes. Ann. Statist. 23 (5), 1712–1734. Mann, N.R., Schafer, R.E., Singpurwalla, N. (1974). Methods for Statistical Analysis of Reliability and Life Data. Wiley, New York. McKeague, I.W., Sasieni, P.D. (1994). A partly parametric additive risk model. Biometrika 81 (3), 501–514. Sedyakin, N.M. (1966). On one physical principle in reliability theory. Techn. Cybernetics 3, 80–87. In Russian. Stablein, D.M., Koutrouvelis, I.A. (1985). A two sample test sensitive to crossing hazards in uncensored and singly censored data. Biometrics 41, 643–652.
24
Handbook of Statistics, Vol. 23 ISSN: 0169-7161 © 2004 Elsevier B.V. All rights reserved. DOI 10.1016/S0169-7161(03)23024-8
Accelerated Hazards Model: Method, Theory and Applications
Ying Qing Chen, Nicholas P. Jewell and Jingrong Yang
1. Introduction Time-to-event or survival time data have been thoroughly studied over the past decades. The Cox proportional hazards model (Cox, 1972) is the most extensively used regression model in the analysis of such data. This model often assumes that a covariate effect is captured through a proportionality constant between hazard functions, leaving the underlying hazard functions unspecified. Specifically, the Cox proportional hazards model for failure time T and associated covariates Z is λ(t | Z) = λ0 (t) exp β T Z , (1) where λ(·) is the hazard function and β is a parameter vector of the same dimension as the covariate vector Z. Here, the superscript T denotes vector transpose. When censoring is present, the partial likelihood provides a simple and efficient way to estimate the parameter β. Within the framework of counting processes, asymptotic properties of estimators can be elegantly justified using martingale theory (Andersen and Gill, 1982). For references, see Kalbfleisch and Prentice (1980), Fleming and Harrington (1991) and Andersen et al. (1993). As noted in Lin and Ying (1997), the Cox model is a special case of more general hazard-based regression model: λ(t | Z) = L λ0 (t), β T Z , (2) where L(·) is a known function. Other well known examples of (2) include the additive hazards model (Lin and Ying, 1994) λ(t | Z) = λ0 (t) + β T Z,
(3)
and the accelerated failure time model (Kalbfleisch and Prentice, 1980) λ(t | Z) = λ0 t exp β T Z exp β T Z . Further discussion of the merits of (2) can be found in Lin and Ying (1997). 431
(4)
432
Y.Q. Chen, N.P. Jewell and J. Yang
The accelerated hazards model (Chen and Wang, 2000a) is also a special case of (2): λ(t | Z) = λ0 t exp β T Z . (5) In this model, exp(β T Z) characterizes how the covariate Z alters the time scale of the underlying hazard progression, and is termed the hazard progression time ratio. Whether β > 0 or β < 0 reflects the direction of the alteration: acceleration or deceleration, respectively. For instance, assume that the covariate Z takes the value of 0 (control) or 1 (treatment). If β is − log 2, the model claims that the hazard of the treatment group progresses in half the time as those in the control group. If β is log 2, the hazard of the treatment group progresses in twice the time as those in the control group. If β = 0, it means there is no difference between the two groups in hazard progression. In the rest of this article, we review and summarize methodological and theoretical developments of the accelerated hazards model in estimation and efficiency evaluation. Some techniques are presented to check for model adequacy. Several extensions of the model are introduced. Practical implementation of the model is also discussed.
2. Estimation Let non-negative random variables T and C stand for the failure and censoring times, respectively, and Z denote the p-vector covariate. Conditional on Z, T and C are assumed to be independent. For the actual data collected in practice, we usually observe Xi = min(Ti , Ci ), ∆i = I (Ti Ci ) and Zi , for i = 0, 1, . . . , n. Here I (·) is the indicator function taking the value 1 if the condition in (·) is satisfied and 0 otherwise. Denote Ni (t) = I (Xi t, ∆i = 1) and Yi (t) = I (Xi t). Define the filtration Ft = σ Ni t exp −β0T Zi , Yi t exp −β0T Zi , Zi ; i = 1, 2, . . . , n . Then
E dNi t exp −β0T Zi | Ft − = Yi t exp −β0T Zi dΛi t exp −β0T Zi = Yi t exp −β0T Zi exp −β0T Zi dΛ0 (t), · where Λ(·) = 0 λ(u) du is the cumulative hazard function and β0 is the true value of parameter β. Let Mi (t; β, λ0 ) = Ni t exp −β T Zi t Yi t exp −β T Zi exp −β T Zi dΛ0 (t). − 0
Then E{Mi (t)} = 0 and Mi (t; β0 , Λ0 ) are martingales with respect to F . It is therefore reasonable to estimate β through n ∞ (6) G(t, Z; β) dMi (t; β, Λ0 ) = 0, i=1 0
Accelerated hazards model
433
where G(t, Z; β) is a known two-dimensional weight function, and G(t, Z; β0) is measurable with respect to the filtration Ft . For example, one possible choice of G is to let G(t, Z; β) = (1, Z)T . Then the above equations become n ∞ n ∞ (7) dMi (t; β, Λ0 ) = 0 and Zi dMi (t; β, Λ0 ) = 0. i=1 0
i=1 0
Furthermore, by replacing Λ0 (t) with its Breslow-type estimator: n t T i=1 dNi {t exp(−β Zi )}
0 (t, β) = n Λ T T 0 i=1 Yi {t exp(−β Zi )} exp(−β Zi ) and doing some algebraic manipulation, we can use the following unbiased estimating functions to estimate the parameters: n ∞
β) dNi t exp −β T Zi , S(β) = (8) Zi − Z(t, i=1
where
0
n
j =1
β) = Z(t, n
Yj {t exp(−β T Zj )} exp(−β T Zj )Zj
j =1 Yj {t exp(−β
TZ
j )} exp(−β
TZ
j)
.
A weighted version of (8) is n ∞
β) dNi t exp(−βZi ) , S W (β) = W (t, β) Zi − Z(t,
(9)
(10)
i=1 0
weight function W (t, β0 ) is left-continuous, non-negative and measurable with respect to Ft , and n−1 W (t, β0 ) is assumed to uniformly converge to a non-random function w(t, β0 ). When βˆ is available, the baseline cumulative hazard function can be estimated
0 (t, β): ˆ by Λ t d ni=1 Ni {u exp(−βˆ T Zi )} (11) . n ˆT ˆT 0 i=1 Yi {u exp(−β Zi )} exp(−β Zi ) 3. Asymptotic results Since S(β) is not a continuous function of β, a unique solution to S(β) = 0 is almost impossible. Following the definitions used in Tsiatis (1990) or Wei et al. (1990) for the rank estimation of the accelerated failure time model, the estimator of βˆ can be defined ˆ ˆ 0 when β is one-dimensional, or as a zero-crossing of S(β) such that S(β−)S( β+) simply the minimum of S(β) . Let S(t; β) be n t
β) dNi t exp −β T Zi . S(t; β) = Zi − Z(t, i=1 0
434
Y.Q. Chen, N.P. Jewell and J. Yang
Then it is true that n−1/2 S(t, β0 ) converges weakly to a zero-mean Gaussian process with variance function Σ(t, β0 ), say, which can be consistently estimated by n t
β) ⊗2 dNi t exp −β T Zi , n−1 Zi − Z(t, i=1 0
where, for vector v, v ⊗2 denotes vv T . Assume the following regularity conditions: (1) The covariates of Zi are uniformly bounded; (2) The censoring times of Ci have uniformly bounded densities; (3) λ0 has a bounded second derivative. Following the similar arguments in Ying (1993), it is true that βˆ is consistent and d T n1/2 βˆ − β0 → N 0, D −1 Σ D −1 , where
∞
D=
d λ0 (t)t . E Y1 t exp −β T Z1 exp −β T Z1 Z1 − Z
0
By the asymptotic linearity in Appendix 3 of Chen and Jewell (2001), it is also true that
0 (t, β) ˆ − Λ0 (t, β0 )} converges weakly to a zero-mean Gaussian process. n1/2 {Λ
4. Efficiency consideration The estimating equations proposed in the preceding section are rather ad hoc. The approaches in Lai and Ying (1992), Lin and Ying (1994) and Chen and Jewell (2001) can be followed to study a special parametric subfamily for the semiparametric information bound. Let such parametric subfamily be λ(t | Z, α, β) = αh t exp(βZ) + λ0 t exp(βZ) , (12) where α and β are unknown parameters, and λ0 (·) and h are fixed functions. Consider the log-likelihood function for (α, β) in (12): n ∞ l(α, β) = log λ(t | Zi , α, β) dN(t | Zi ) i=1
0
∞
−
Y (t | Zi )λ(t | Zi , α, β) dt
0
=
n i=1
∞
log αh t exp(βZi ) + λ0 t exp(β1 Zi ) dNi (t)
0
∞
− 0
Yi (t) αh t exp(βZi ) + λ0 t exp(βZi ) dt .
Accelerated hazards model
Denote the Fisher information matrix of Ih (α, β) at α = 0 and β = β0 by I (h) Iαβ (h) . Ih (0, β0 ) = αα Iβα (h) Iββ (h)
435
(13)
Then the semiparametric information bound of β can be calculated by exhausting all the choices of h in (12) at α = α0 and β = β0 , which is −1 Iββ (h) − Iβα (h)Iαα (h)Iαβ (h),
(14)
by the Cramér–Rao inequality. Straightforward calculation shows that the set of semiparametrically efficient estimating equations for parameter β0 is given by n ∞ (1)
β) dNi t exp(−β1 Zi ) . (15) Sopt (β) = λ0 (t)t/λ0 (t) Zi − Z(t; i=1 0
It is able to reach the semiparametric efficiency bound, although the use of Sopt may be limited in practice because of the optimal weight involving the unknown baseline hazard function.
5. Model adequacy The accelerated hazards models, jointly with other classes of models, should give users more options in modeling survival data with censored observations. However, due to its sensitivity to the baseline hazard function and the model assumptions, careful assessment of adequacy becomes a critical issue. As demonstrated in Chen (2001), two test statistics are available for checking model adequacy. Kolmogorov–Smirnov test Kolmogorov–Smirnov test can be used when Z is a binary scalar in the accelerated hazards regression model. As studied in Chen (2001), the Kolmogorov–Smirnov type of test statistic can be used: def TKS βˆ = D t, βˆ = sup D t, βˆ , (16) 0t <∞
where D t, βˆ = n−1/2
t
1 exp −βˆ − Λ
0 (u) . Q u, βˆ exp βˆ dΛ
(17)
0
Here, Q(t, β0 ) is measurable with respect to Ft , and n−1 Q(t, β0 ) has the limiting function of q(t, β0). Choices for Q include: QLR (u, β) = of the log-rank test;
Y0 (u)Y1 {u exp(−β)} exp(−β) Y0 (u) + Y1 {u exp(−β)} exp(−β)
QGE (u, β) = n−1 Y0 (u)Y1 u exp(−β) exp(−β)
436
Y.Q. Chen, N.P. Jewell and J. Yang
of Gehan’s; Y0 (u)Y1 {u exp(−β)} exp(−β) QPP (u, β) =
S(u−, β) Y0 (u) + Y1 {u exp(−β)} exp(−β) of the Peto–Prentice generalized Wilcoxon statistics, where
S(u−, β) is the leftcontinuous version of the Kaplan–Meier estimate based on the pooled sample of {(Xi exp(βZi ), ∆i ), i = 1, 2, . . . , n}; and ρ Y0 (u)Y1 {u exp(−β)} exp(−β) QHF (u, β) =
S(u−, β) Y0 (u) + Y1 {u exp(−β)} exp(−β) of the general class of tests by Harrington and Fleming (1982). Under the null hypothesis H0 when the accelerated hazards model is adequate, if ˆ is unexpectedly large, it leads to rejection of H0 . TKS is consistent against any TKS (β) general alternative hypothesis when the accelerated hazards model is inadequate and hence omnibus. How to select critical values of TKS can be found in the appendix of Chen (2001). Gill–Schumacher test For a general covariate vector Z, let βˆ1 and βˆ2 be the solutions of S1 (β) = 0 and S2 (β) = 0 with different choices of weight functions of W1 and W2 , respectively. When the accelerated hazards model is true, the difference between βˆ2 and βˆ1 should be small and therefore the Wald-type of statistics based on βˆ2 − βˆ1 can be used to test the model adequacy. Following similar arguments in Wei et al. (1990) for assessing adequacy of the accelerated failure time model, we know that n−1/2 (S1 (β0 ), S2 (β0 ))T is asymptotically joint normal with the covariance matrix that is the limit of −1 V11 (β0 ) V12 (β0 ) , n V21 (β0 ) V22 (β0 ) where n ∞
⊗2 dNi t exp −β0T Zi , V11 (β0 ) = W12 Zi − Z i=1 0
V12 (β0 ) = V21 (β0 ) =
n i=1 0
V21 (β0 ) =
n i=1 0
∞
∞
⊗2 dNi t exp −β T Zi , W1 W2 Zi − Z 0
⊗2 dNi t exp −β0T Zi . W22 Zi − Z
Therefore the following statistic, S1 (β) V11 (βˆ1 ) TGS = min S2 (β + βˆ2 − βˆ1 ) V21 (βˆ1 ) β∈U (βˆ1 ) S1 (β) , × S2 (β + βˆ2 − βˆ1 )
V12 (βˆ1 ) V22 (βˆ1 )
−1
Accelerated hazards model
437
which is asymptotically equivalent to the Wald-type of statistic testing βˆ2 − βˆ1 and it is known of χp2 asymptotically.
6. Extensions The accelerated hazards model has flexibility to be extended in various situations when needed. Several extensions based on this model are studied and summarized as follows. Scale change function When the treatment is not limited to changing the time scale linearly, an extension with scale change function can be used: λ(t | Z) = λ0 exp η(t, Z; β) , (18) where the known positive parametric function η(·) is monotonically increasing in t. For example, when Z is binary, if η(t, Z; β) = (1 − Z)t + Zβ0 t 2 , it means that the treatment can change the time scale quadratically. If 1 + β1Z t , η(t, z; β) = t β0Z + β1Z t then η(t, Z = 1; β)/(β0t) → 1 as t → 0 and η(t, z = 1; β)/t → 1 as t → ∞. Therefore β0 characterizes the acceleration rate in the early period of time while β1 describes how fast the acceleration effect will be dampened as time progresses. General model I In some situations, not all the covariates are considered to alter the time scale of hazard function. Such covariates, for example, may include age, gender and social economic status, which may still have proportional influence on the baseline hazard function. Let Z = (Z1 , Z2 , . . . , Zp )T = (Zp1 , Zp2 )T , where Zp1 are the first p1 covariates and Zp2 the rest of the p covariates. Then a variation of the accelerated hazards model for this situation is λ(t | Z) = λ0 t exp βpT1 Zp1 exp βpT2 Zp2 , (19) where β = (β1 , β2 , . . . , βp ) = (βp1 , βp2 ). This model assumes that part of the covariates influence the baseline hazard function proportionally, while the baseline hazard function itself remains an acceleration or deceleration relationship according to another part of covariates. Therefore, βp1 is interpreted as the covariate effect on acceleration/deceleration of the baseline hazard progression as if other covariates were held constant on relative hazards; or similarly, βp2 is interpreted as the covariate effect on relative hazards as if the other covariates were held constant on the acceleration/deceleration of the baseline hazard progression.
438
Y.Q. Chen, N.P. Jewell and J. Yang
General model II When all the covariates are considered to have impact on both time scale and magnitude of the baseline hazard function, it is natural to use a more general hazards regression model as λ(t | Z) = λ0 t exp β1T Z exp β2T Z . (20) It is not difficult to see that model (20) includes the Cox proportional hazards model, the accelerated failure time model and the accelerated hazards model as subclasses of models. Namely, when β10 = 0, model (20) becomes the Cox proportional hazards model with a proportionality constant of exp(β20 ); when β10 = β20 , model (20) becomes the accelerated failure time model with a time scale change of exp(β10 ) or exp(β20 ) in the survival functions; when β20 = 0, model (20) becomes the accelerated hazards model with the hazards progression time ratio of exp(β10 ). Therefore, model (20) provides an approach to judge which of the three models may well fit the actual data. As shown in Proposition 1 in Chen and Jewell (2001), when the underlying distribution is Weibull, all these three classes of models and the general Model in (20) coincide.
7. Implementation and application Although the semiparametrically efficient estimating equations of Sopt (β) are available, using them in practical numerical analysis may be difficult, simply because of the difficulty in estimating the ratio of λ(1) 0 (t) and λ0 (t) from the observed data. Lin and Ying (1994) once proposed the so-called “sample splitting” technique to construct efficient estimators for the additive-multiplicative hazards model, which may be extended in the current situation. In general, however, if users are willing to sacrifice some efficiency, there are many other simpler weight function of G, such as the Gehan’s weight function. Simulation studies on different weight functions can be found in Chen and Jewell (2001). There is challenge in solving the estimating equations, because the estimating functions are not smooth. The usual numerical approaches, such as Newton–Raphson algorithm, do not work well. When the covariates are of low dimensions, direct grid search or bisection method can be used; when of high dimensions, random search techniques such as “simulated annealing” could be more efficient. In most cases of moderate dimensions, the “recursive bisection” method by Huang (2002) is recommended to search for roots. The rationale of this recursive method is simple: for example, in the kth step, suppose we know how to solve for (β1 , β2 , . . . , β(k−1)), then equipped with the onedimension bisection algorithm, we should be able to solve for βk . From our experience, the computing time of this recursive method is modest and acceptable. It seems straightforward to estimate the variance of βˆ by simply replacing D with
and Σ with Σ(∞,
ˆ However, since D involves the any consistent estimators of D β). unknown baseline hazard function λ0 (t) and even its derivative, it makes extremely dif directly, although there are several approaches of which we can ficult to obtain such D
Accelerated hazards model
439
Fig. 1. Contour plots of simulated power curves of the accelerated hazards model. Four censoring percentages are used in graphs: (A) 0%; (B) 10%; (C) 25%; (D) 50%. The marked numbers in the dashed lines are corresponding powers.
take advantage. For example, when the sample size is large enough, the nonparametric kernel density estimation suggested by Tsiatis (1990) can be used; or, a computingintensive resampling algorithm by Parzen et al. (1994) can be implemented to approximate the variance–covariance matrix. An approach based on numerical difference by Huang (2002) is suggested to compute the variance–covariance matrix, which does not require sophisticated selection of smoothing bandwidth, or computing intensive resampling. Details can be found in Chen and Jewell (2001). Similar to the other hazards regression models, the accelerated hazards model can be used to analyze many data sets in practice. Some of such examples can be found in Wang and Chen (2000a, 2000b). When actually using the accelerated hazards model as alternative in planning clinical trials, for example, simulation studies can be conducted to compute sample size calculation. For demonstration purpose, simulation studies have been conducted on (5), when the baseline hazard function is assumed to be standard log-logistic: λ0 (t) = 1/(1 + t)). A two-sample score test statistic based on S(β) is used to generate power curves with different combinations of β, censoring percentages and sample sizes. Simulations are conducted on 1000 iterations. Contour and 3-dimensional graphs are plotted in Figures 1 and 2.
440
Y.Q. Chen, N.P. Jewell and J. Yang
Fig. 2. Three-dimensional plots of simulated power curves of the accelerated hazards model. Four censoring percentages are used in graphs: (A) 0%; (B) 10%; (C) 25%; (D) 50%.
8. Some remarks The accelerated hazards model carries some unique features of its own. First, it treats hazard function as process of hazard progression over time and therefore the parameter in the model has an interpretation of hazard progression acceleration/deceleration. Furthermore, when baseline hazard function λ0 (t) is monotone, the parameter identifies certain meaningful treatment effect (Chen and Wang, 2000b). Second, in contrast to the Cox model and the additive hazards model, the accelerated hazards model is not necessarily restricted to constant proportionality or constant additivity. This property is similar to the accelerated failure time model, which may lead to more flexibility in empirical sense to certain types of data. Therefore, the accelerated hazards model serves as a valuable supplement to the available hazards regression models. When the assumptions of the other hazard regression models are not reasonable in practice, the accelerated hazard model offers an alternative or supplement. This model may be more suitable in randomized clinical trials when the treatment effect is assumed as a time scale change between hazard functions. It is flexible with the identical hazard at the onset time of a randomized trial and more consistent with the rationale of randomization.
Accelerated hazards model
441
The accelerated hazards model is not limited to the crossovers in either hazard functions or survival functions. The removal of this restriction has two immediate consequences. First, it does allow us to use one single parameter to reflect a complex phenomenon, which otherwise need time-dependent structure in other models. Second, the trade-off is that the stochastic ordering may not be reflected in an arbitrary situation, which may lead to some difficulty in justifying the treatment benefits directly, though it may still indicate an overall biological mechanism the treatment can have on subjects – accelerating or decelerating their clocks of failure time. The accelerated hazards model is subject to the identifiability condition. That is, it is not identifiable when the baseline hazard function is constant, i.e., when the underlying distribution is exponential. This identifiability condition needs to be checked before implementing the accelerated hazards model, otherwise the variance estimates of the parameter of interest may look unreasonably large and subsequent inference becomes less meaningful.
References Andersen, P.K., Borgan, Ø., Gill, R.D., Keiding, N. (1993). Statistical Models Based on Counting Processes. Springer, New York. Andersen, P.K., Gill, R.D. (1982). Cox’s regression model for counting processes: A large sample study. Ann. Statist. 4, 1100–1120. Chen, Y.Q. (2001). Accelerated hazards regression model and its adequacy for censored survival data. Biometrics 57, 853–860. Chen, Y.Q., Jewell, N.P. (2001). On a general class of semiparametric hazards regression models. Biometrika 88, 687–702. Chen, Y.Q., Wang, M.-C. (2000a). Analysis of accelerated hazards model. J. Amer. Statist. Assoc. 95, 608– 618. Chen, Y.Q., Wang, M.-C. (2000b). Estimating a treatment effect with the accelerated hazards models. Controlled Clinical Trials 21, 369–380. Cox, D.R. (1972). Regression models and life-tables (with discussion). J. Roy. Statist. Soc. Ser. B 33, 187–220. Fleming, T., Harrington, D. (1991). Counting Processes and Survival Analysis. Wiley, New York. Harrington, D.P., Fleming, T.R. (1982). A class of rank test procedures for censored survival data. Biometrika 69, 133–143. Huang, Y.J. (2002). Calibration regression of censored lifetime medical cost. J. Amer. Statist. Assoc. 97, 318– 327. Kalbfleisch, J.D., Prentice, R.L. (1980). The Statistical Analysis of Failure Time Data. Wiley, New York. Lai, T.Z., Ying, Z. (1992). Linear rank statistics in regression analysis with censored or truncated data. J. Multivariate Anal. 40, 13–45. Lin, D.Y., Ying, Z. (1994). Semiparametric analysis of the additive risk model. Biometrika 81, 61–71. Lin, D.Y., Ying, Z. (1997). Additive hazards regression models for survival data. In: Lin, D.Y., Fleming, T.R. (Eds.), Proceedings of the First Seattle Symposium in Biostatistics: Survival Analysis. Springer, New York. Parzen, M.I., Wei, L.J., Ying, Z. (1994). A resampling method based on pivotal estimating functions. Biometrika 81, 341–350. Tsiatis, A.A. (1990). Estimating regression parameters using linear rank tests for censored data. Ann. Statist. 18, 354–372. Wei, L.J., Ying, Z., Lin, D.Y. (1990). Linear regression analysis of censored survival data based on rank tests. Biometrika 77, 845–851. Ying, Z. (1993). A large sample study of rank estimation for censored regression data. Ann. Statist. 21, 76–99.
25
Handbook of Statistics, Vol. 23 ISSN: 0169-7161 © 2004 Elsevier B.V. All rights reserved. DOI 10.1016/S0169-7161(03)23025-X
Diagnostics for the Accelerated Life Time Model of Survival Data
Daniel Zelterman and Haiqun Lin
1. Introduction Of the two best known semiparametric models for censored survival data, the Cox model of proportional hazards has received far more attention than the accelerated life time model. Perhaps this difference is due to the lack of readily available software to fit and diagnose the model. Despite the relative shortage of published results, the accelerated life time model has many appealing features, most prominently its ease of interpretation. The accelerated life time model assumes that the survival time S(t | x) of an individual with vector of covariate values x is expressible as S(t | x) = S0 t exp(x β) for all times t 0. In this relation, S0 is an unspecified survival function. By not specifying S0 we identify this model as semiparametric. The vector of regression coefficients β needs to be estimated from the observed survival times and their corresponding covariate values x. We describe the estimating equations for producing the estimate βˆ of β in Section 2. These equations suffer from the problem that they can rarely be solved exactly for a set ˆ Instead, in a small neighborhood of βˆ we will of estimated regression coefficients β. typically see both positive and negative values of the estimating equations. In Section 3 we develop a Gibbs-like algorithm for obtaining βˆ from the estimating equations by iteratively solving these one at a time. Two general classes of diagnostic measures are explored here. These measures test goodness-of-fit for the model and overdispersion in the individual estimated regression coefficients. In order to simulate the accelerated life model we also develop a semiparametric bootstrap procedure that generates pseudo-samples under the correct model. One set of diagnostic measures is based on extremes of the estimating equations due to Wei (1984). These diagnostics describe a general family of goodness-of-fit that are not specific to the accelerated life model. The measures are based on approximating partial sums of the estimating equation as a Brownian bridge. There is a natural ordering of the observed values and the maximum absolute value of this Brownian bridge 443
444
D. Zelterman and H. Lin
provides an intuitive measure of goodness-of-fit. The analogous test of goodness-of-fit, specifically for the accelerated life time model is provided in Section 4.1. Zelterman and Chen (1988) propose a test of overdispersion based on a comparison of the empirical variance and the expected variance using the second derivative of the likelihood function. Intuitively, we say that there is overdispersion in the data if the empirical variance is significantly larger than the variance anticipated by the observed Fisher Information for the likelihood function. This method is used to develop a test of goodness-of-fit for the model of proportional hazards by Lin and Wei (1989) and Le and Zelterman (1992). The overdispersion test for β in the accelerated life model is given in Section 4.2. Partial residuals are defined in Section 4.3 by analogy to those developed by Schoenfeld (1982) for the model of proportional hazards. There is a separate set of residuals for each covariate in the accelerated life time model. Zelterman et al. (1996) develop a semiparametric bootstrap simulation method for the proportional hazards model. They use this bootstrap to simulate goodness-of-fit statistics for proportional hazards models under the proper model. They simulate the goodness-of-fit tests that are analogous to those examined in Section 4 in this work. A semiparametric bootstrap is developed in Section 5 for the accelerated life model. In this simulation, an observed survival time t and its corresponding censoring indicator δ are chosen as an intact pair from the empirical data. A covariate vector x is assigned to this pair corresponding to the accelerated failure model using Bayes theorem. This bootstrap simulation can be performed both conditionally and unconditionally on the distribution of covariate vectors as these appear in the original data. These two separate bootstrap methods are explained in Sections 5.1 and 5.2, respectively. Three numerical examples are provided in Section 6 to illustrate these methods. The first of these contains a single, binary valued covariate. This simple example allows us to provide many of the details of the statistical analysis. Two additional examples have a number of both continuous and binary valued covariates. These latter two examples are more illustrative of situations likely to be encountered by the data analyst in practice.
2. The likelihood and estimating equations The semiparametric accelerated life time model specifies that the survivor function Si of subject i (i = 1, . . . , n) with covariate vector xi satisfies Si (t | xi ) = S0 t exp(xi β) , (1) where S0 is an unspecified baseline survivor function and β is a vector of regression coefficients to be estimated. In this section we derive a set of estimating equations given below at (6). The development of these equations follows Louis (1981) or Lin and Ying (1995) and is repeated here for completeness. The following development demonstrates the assumptions that are made in order to provide consistent parameter estimates for the semiparametric model (1).
Diagnostics for the accelerated life time model of survival data
445
The hazard function λi (t) = λ(t | xi ) of the ith subject in (1) satisfies λ(t | xi ) = exp(xi β)λ0 t exp(xi β) . Let Yi (t) be the indicator function that subject i is in the risk set at time t. That is, Yi (t) = 1 if subject i is in the risk set at time t and Yi (t) = 0 otherwise. Let Ni (t) count the number of events for subject i by time t. We will concentrate on the problem for which each subject can have at most one event so that Ni (∞) = 1. Once the subject experiences the event then he is no longer at risk. Then Ni (t) = 1 − Yi (t) for all i = 1, . . . , n and all t > 0. If subject i experiences the event of interest at time t then dNi (t) = Ni (t) − Ni (t−) = 1, and dNi (t) = 0 if no event occurs at time t. The likelihood for β in the accelerated time model (1) is ∞ n L(β) = λi (t)dNi (t ) exp − Yi (u)λi (u) du . 0
i=1 t >0
The log likelihood in (2) can then be written as l(β) = log L = li (β), i
where
∞
li (β) =
0
0
∞
Yi (t)λi (t) dt
xi β + log λ0 t exp(xi β) dNi (t)
∞
− 0
Let
0
∞
=
λ0
log λi (t) dNi (t) −
exp(xi β)Yi (t)λ0 t exp(xi β) dt.
denote the derivative λ0 (s) = dλ0 (s)/ds
of the baseline hazard function λ0 . The likelihood score function for β is derived from ∞ λ (t exp(xi β))xi t exp(xi β) dNi (t) ∂li (β)/∂β = xi + 0 λ0 (t exp(xi β)) 0
(2)
446
D. Zelterman and H. Lin
∞
−
xi exp(xi β)λ0 t exp(xi β)
0
+ txi exp(2xi β)λ0 t exp(xi β) Yi (t) dt ∞ λ (t exp(xi β))xi t exp(xi β) xi + 0 = λ0 (t exp(xi β)) 0 × dNi (t) − exp(xi β)λ0 t exp(xi β) Yi (t) dt .
(3)
Let us apply the change of variable t 0 = ti0 = t exp(xi β)
(4)
to the baseline time for the ith subject with dt 0 = exp(xi β) dt the Jacobian for the change of time scale given at (4). The baseline t 0 times at (4) are useful because all individuals can be compared on this time scale. In the t 0 time scale it is possible to describe one individual as being in the risk set of another. Notice that this risk set depends on the value of β and can change abruptly with a small perturbation in this parameter. As a result, it may be difficult to find exact solutions to the likelihood equations ∂li (β)/∂β = 0 given at (3). In (3) let us use the notation Ni t 0 , β = Ni t 0 exp(−xi β) and Yi t 0 , β = Yi t 0 exp(−xi β) . Define x0i as
x0i = x0i t 0 , β = xi + xi ti0 λ0 t 0 /λ0 t 0 .
Then (3) can be written as ∞ xi + xi ti0 λ0 t 0 /λ0 t 0 ∂li /∂β = 0
× dNi t 0 , β − Yi t 0 , β λ0 t 0 dt 0 .
Next replace λ0 (t 0 ) dt 0 in (5) with
n n 0 dNi t , β Yi t 0 , β . i=1
i=1
This gives ∂l(β)/∂β =
n i=1 0
∞
x0i
−
n l=1
x0i Yl
n 0 0 t ,β Yl t , β dNi t 0 , β . l=1
(5)
Diagnostics for the accelerated life time model of survival data
447
It is mathematically convenient and intuitive to define λ0 (t 0 ) dt 0 as piecewiseconstant where it otherwise would not be defined. Then λ is always equal to zero and x0i is equal to xi . The resulting set of estimating equations is
n n ∞ n 0 0 U(β) = xi − xi Yl t , β Yl t , β dNi t 0 , β . i=1 0
l=1
l=1
Finally we can write U(β) =
J
x(j ) − µj (β),
(6)
j =1
where x(j ) is the value of the covariate for the subject who fails at the j th ordered failure time t(j ) on the t 0 scale. The function
n n 0 0 µj (β) = (7) xl Yl t(j , β Yl t(j ) ), β l=1
l=1
can be interpreted as an average of the covariate values x among all subjects who are 0 at risk at baseline time t(j ). 0 . Our formulation The averages in (7) are taken over all individuals at risk at time t(j ) of the estimating equation at (6) is similar to the approach of Lin and Ying (1995). Forming the estimating equation through the partial likelihood score function of Louis (1981) also results in this same expression. In general, an exact solution of U(β) = 0 in (6) can rarely be obtained in practice because U(β) is a p-dimensional step function of β. Intuitively, small changes of β in (7) may leave this function unchanged or cause individual components to abruptly change sign. In Section 3 we describe an iterative fitting procedure that is similar to the familiar Gibbs sampling method for generating pseudo-samples from multivariate distributions. The estimating equation (6) has a functional form similar to that of the log-rank statistic. At the true parameter value, the statistic n−1/2 U(β) behaves asymptotically multivariate normally by the results of Andersen and Gill (1982) or Cuzick (1985) with variance n−1 V(β). The function V(β) can be consistently estimated by
n n ∞ n 0 ⊗2 0
V(β) = xl − µj (β) Yl t , β Yl t , β dNi t 0 , β , i=1 0
tj exp(xj β)
l=1
l=1
(8)
where = denotes the baseline time scale for j th observed failure time given at (4). The notation v⊗2 of a vector v denotes the outer product tj0
v⊗2 = vv .
448
D. Zelterman and H. Lin
We can also write the variance estimate in (8) as
V(β) =
J
σ 2j
j =1
where
σ 2j
=
n n 0 0 (xl − µj )(xl − µj ) Yl tj , β Yl tj , β . l=1
(9)
l=1
Written in this form, we see that V is a weighted average of the empirical variances of the covariate vectors x. We will use this estimate to approximate the variances of estimated parameters βˆ in the examples of Section 6. This asymptotic variance estimate will be compared to the empirical variance in Section 4.2 as a method of testing goodness-of-fit for the accelerated failure model and for testing overdispersion in the individual parameters in β.
3. A Gibbs-like estimation procedure In general there will be no value of β that solves U(β) = 0 in (6). Intuitively, the function U(β) is a locally constant step function with a finite number of steps. Our plan then, is to identify a parameter value denoted βˆ such that each component of the vecˆ changes sign when the vector βˆ is perturbed just a small amount. In a small tor U(β) ˆ will change sign. neighborhood of βˆ each component of U(β) We describe an algorithm that identifies a parameter value βˆ with the property that ˆ ˆ the signs of U(β+) and U(β−) are opposite in each component of the vector-valued function U. There is no guarantee, of course, that such a point βˆ exists either, but in practice we can usually find a point that comes very close to satisfying these conditions. The value of βˆ is obtained here in a component-wise fashion much as we do when performing a Gibbs sampling in multi-dimensional simulations. Specifically, each element of β = {β1 , . . . , βJ } is obtained by solving the univariate equation Ui (β) = 0. We then iterate across all components of β and repeat as often as necessary. Typically there might be no value of β that solves any of the univariate equations Ui (β) = 0 exactly. The best we may be able to do is to vary the value of βi in β in order to solve sign Ui (β + δei ) opposite that of sign Ui (β − δei ) for δ close to zero and unit Euclidean basis vector ei = {0, . . . , 0, 1, 0, . . . , 0} with a ‘1’ in the ith component.
Diagnostics for the accelerated life time model of survival data
449
Even this component-wise solution of the likelihood equations may fail to converge. A well-known problem with the Gibbs sampler occurs when the target distribution has multiple modes that do not lie along the same coordinate axes. The iterative solution described here can also fail under similar circumstances. If the roots of the individual likelihood equations Ui (β) = 0 do not occur along the coordinate axes of β then the algorithm may alternate between two or more different values. We did not find that this situation occurred in any of the examples we examined in Section 6. This failure of the Gibbs solution occurred in only a few simulated samples during the various bootstrap simulations. In these cases the alternating solutions tended to be rather close to each other. We do not expect that convergence failure of the algorithm should pose a problem in practice.
4. Diagnostic measures In this section we propose three diagnostic measures for the accelerated life time model. These measures are useful for detecting different types of deviations from the assumed model. The first of these is similar to the test of Wei (1984). The second test is derived from tests for overdispersion by Zelterman and Chen (1988). Partial residuals for the accelerated life time model can be defined directly from the individual terms in the estimating equations given at (6). 4.1. Brownian bridge statistic ˆ = 0 as described in the previous section. The Let βˆ denote one Gibbs solution of U(β) score functions U given at (6) are U(β) =
J
x(j ) − µj (β).
j =1
Write U = {U1 , . . . , Up }. Each component Ui of U corresponds to a different covariate. These will be examined separately because their scales and variances depend on the individual covariates x. For every i = 1, . . . , p each Ui (β) can be written as Ui (β) =
J
xij − µij (β)
j =1
using a suitable notation. For each k = 1, . . . , J , define the partial sums k Uik = Uik βˆ = xij − µij βˆ . j =1
These partial sums are only over the covariates of the first k ordered t 0 observations. We will define Ui0 = 0 and note that UiJ should be very close to (but rarely equal to)
450
D. Zelterman and H. Lin
ˆ The values Ui0 , . . . , UiJ behave approximately as a realizazero when evaluated at β. tion of a discrete Brownian bridge with Ui0 = 0 and UiJ ≈ 0. The test statistics we propose for each component of the covariate vector x are Ti = max Uik βˆ . 1kJ
Large values of Ti are indicative of a poor fitting model possibly due to the presence of outliers among the covariate values or highly influential observations in the data. The Ti statistics are analogous to the method proposed by Wei (1984) for use in testing goodness-of-fit in proportional hazards models. There is a separate Ti statistic for each component of β. We found that it was not useful to combine all of the Ti statistics in a single goodness-of-fit. These individual statistics are all different in magnitudes because the individual covariates x are measured on different scales. 4.2. Overdispersion statistics A second diagnostic examines the fit of the accelerated life time model for large values of the statistics J ⊗2 2 ˆ S = diag −σ x(j ) − µj β j
j =1
where σ is given at (9). Intuitively, large values of the components in S = {S1 , . . . , Sk } indicate that the observed variance of covariates x are greater than anticipated by the accelerated life time model. Under the null hypothesis of the accelerated life time model, the Si statistics should have zero means. As with the Ti statistics, the Si statistics measure extremes among the covariates rather than in the survival times. The easiest way to determine extreme critical values of Ti or Si is through the semiparametric bootstrap method, described in Section 5. 4.3. Partial residuals We can define residuals for the accelerated life time model by analogy to the partial residuals described by Schoenfeld (1982). The estimating equation (6) is composed from sums of raw residuals defined by raw residual(j ) = x(j ) − µj (β) 0 for the j th ordered, fitted survival time. corresponding to t(j ) Intuitively these raw residuals are useful because they correspond to the difference between the observed covariate values at time t(j ) and the weighted average µj of all covariates in the corresponding risk set. The definition of µj (β) is given at (7). There will be a separate set of partial residuals for every covariate in the model. The residual ˆ time will be equal to zero because the µ values are averaged at the longest fitted t0 (β) over this point alone. Similarly, there are no residuals defined corresponding to censored times.
Diagnostics for the accelerated life time model of survival data
451
A difficulty with using these raw residuals is that they are measured on different scales and need to be standardized by the corresponding variances σ 2j given at (9). Using the previous notation, the standardized partial residual is rij = (xij − µij )/σij for the ith covariate value at the j th ordered t0 time. The partial residuals from an example with a single binary valued covariate are plotted in Figure 2. The survival data from the Stanford heart transplant study is also examined in this manner. In this larger example we examined four continuous valued covariates measured on every subject. The partial residuals for this transplant example are plotted in Figure 3. Figure 4 plots the partial residuals for an example of bladder cancer recurrence times given by Wei et al. (1989).
5. The bootstrap procedure Let t(j ) denote the j th ordered observed life time and let δi = 1 if subject i experi∗ , δ ∗ ) and x∗ are separately enced an event and 0 otherwise. The bootstrap values of (t(j ) sampled with replacement from the observed data. A bootstrap pseudo-observation is ∗ , δ ∗ ) from the data and then sampling a obtained by drawing a time/censoring pair (t(j ) ∗ ∗ , δ ∗ ) selected. This covariate vector x from the conditional distribution given the (t(j ) conditional distribution is determined by the accelerated life time model. The covariates xj are sampled as intact vectors and not assigned as individual scalar covariate values. Similarly, the (t, δ) pairs are sampled intact as they appear in the original data. If an observed time is censored, for example, then it will continue to remain censored in all bootstrap samples. 5.1. The unconstrained bootstrap Define the marginal probability distribution of a randomly selected covariate vector by ρk = Pr x∗ = xk . (10) For the moment we restrict our attention to non-censored (δ = 1) observations. Define τj as the probability that the bootstrap selects the time and censoring pair (t(j ) , 1) from the observed data. That is τj = Pr t ∗ , δ ∗ = (t(j ) , 1) = dj /δ+ , where dj is the number of events occurring at time t(j ) and δ+ =
J
dj
j =1
is the number of non-censored observations in the observed data.
452
D. Zelterman and H. Lin
Given that the bootstrap selects the pair (t(j ) , 1), our bootstrap generates x∗ according to the conditional distribution πj k = πj k (β) = Pr x∗ = xk | t ∗ , δ ∗ = (t(j ) , 1) , where the πj k are determined according to the accelerated life time model (1). 0 For each selected t(j ) , we transform it to the corresponding baseline time scale t(j ) and select among those xk ’s according to the distribution of the covariates x at baseline 0 0 time scale t(j ) . The reader is reminded that the baseline time scale t(j ) defined at (4) depends on βˆ and covariates x. We can find the conditional distribution of covariates x∗ from the values of πj given that the bootstrap selects a censored time with δ = 0. Denote this probability by πjck = πjck (β) = Pr x∗ = xk | T tj . A simple approximation shows πjck = Pr x∗ = xk , T tj / Pr[T tj ] . = Pr x∗ = xk | t ∗ , δ ∗ = (t(l) , 1) Pr t ∗ , δ ∗ = (t(l) , 1) l: t(l) t(j)
Pr t ∗ , δ ∗ = (t(l) , 1) .
l: t(l) t(j)
Bayes’ theorem shows that
πjck = πlk τl l: t(l) t(j)
l: t(l) t(j)
τl =
l: t(l) t(j)
πlk dl
dl .
(11)
l: t(l) t(j)
That is, πjck is a weighted average of all πj k associated with non-censored event times that have occurred after the censored times t(j ) . If the longest t 0 baseline survival time is censored then we define the corresponding π c to be the same as the last uncensored value. A numerical example of these π values is given in Table 3. In this example there is a single, binary valued covariate. 5.2. The constrained bootstrap In the unconstrained sampling just described, there are no limits placed on the number of each value of covariate vector in each bootstrap sample. This type of sampling fails to condition on the design of the study and might inflate the variance estimate of βˆ over the conditional estimate. Constrained sampling is designed to keep the covariate frequency intact in each bootstrap sample. Once a covariate vector xk has been selected in a given bootstrap pseudo-sample, its marginal probability ρk defined in (10) of being selected in a subsequent pseudo-observation is set to zero. We then need to recalculate the entire set of conditional probabilities πj k and πjck before the next pseudo-observation can be generated. Under constrained sampling, every pseudo-sample has the same set of covariate vectors as the original data but the time/censoring pairs continue to be sampled
Diagnostics for the accelerated life time model of survival data
453
with replacement. This method was applied to several examples in Zelterman et al. (1996) for the model of proportional hazards.
6. Numerical examples Three numerical examples are presented here. The first example contains a single, binary valued covariate. This simple structure allows us to illustrate many details of the methods described in this chapter. Two additional examples include multiple, continuous valued covariates. These latter examples are more illustrative of settings that the data analyst is likely to encounter in practice. In every example, the bootstrap simulation consisted of 100 pseudo-samples. 6.1. AML This example was also reported in Embury et al. (1977), and Miller (1981, pp. 49– 50). The data appears in Table 1 and represents time in weeks to relapse for patients with acute myelogenous leukemia (AML). The single binary valued covariate indicates whether the patients received maintenance chemotherapy (treatment) or were untreated controls. The Kaplan–Meier plot in Figure 1 contains both the raw data (as solid lines) and the fitted accelerated life times (dotted). The t0 time to remission of the control patients (whose covariate is coded as x = 0) are not affected by the regression model so their fitted survival curve is the same as that of the original data. The patients receiving treatment (whose covariate x = 1) are estimated to have experienced a beneficial effect of that treatment and their fitted survival curve (depicted by the dotted line) is shifted left, closer to that of the controls. The negative fitted regression coefficient of βˆ = −0.671 indicates that their t0 time to relapse is shortened in order to coincide with those of the control patients. The summary statistics for our analysis of this data are given in Table 4. The Wald χ 2 = 1.845 is not statistically significant (p = 0.17) for the accelerated life model. The constrained and unconstrained bootstrap simulations indicate that the estimated βˆ has a negative bias and the true value of the regression coefficient is close to zero. The simulated and asymptotic standard errors of βˆ are reasonably close in value. The S and T statistics did not indicate overdispersion in either the constrained or the unconstrained bootstrap simulations. The conditional distribution πi of the covariate values is given in Table 3. Each πi is the probability that the ith patient time corresponds to a treated individual. At 23 weeks Table 1 Time in weeks to remission for patients with acute myelogenous leukemia (AML) Source: Embury et al. (1977), and Miller (1981, pp. 49–50) Control
5, 5, 8, 8, 12, 16+, 23, 27, 30, 33, 43, 45
Treated
9, 13, 13+, 18, 23, 28+, 31, 34, 45+, 48, 161+
454
D. Zelterman and H. Lin
Table 2 Time to first recurrence in bladder cancer patients. Columns are time, censoring indicator, treatment (0 = control, 1 = thiotepa), size of original lesions, and number of lesions (truncated at 8). Source: Wei et al. (1989) 0 10 5 3 26 29 2 34 9 6 49 59 1 10 18 22 6 38 41 2 49 38
0 0 1 1 0 0 1 0 1 1 0 0 0 0 0 0 1 0 0 1 0 1
0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
1 5 1 1 1 1 1 2 4 2 1 1 1 1 5 1 1 1 3 6 3 2
1 1 3 1 2 4 5 1 1 6 3 1 3 1 1 1 1 1 2 1 3 1
1 6 12 3 1 29 3 36 16 3 35 2 1 13 17 25 6 22 41 45 50 59
0 1 1 1 1 0 1 0 1 1 1 1 0 0 1 0 1 1 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
1 4 1 3 8 1 2 2 5 2 3 3 1 1 1 1 1 1 1 1 1 1
3 1 1 1 1 2 1 1 1 1 1 2 1 1 3 3 1 1 1 2 1 3
4 14 23 7 2 29 12 29 41 9 17 5 5 3 2 25 2 4 1 2 4
0 0 0 1 1 0 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1
0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
2 1 3 2 1 4 1 3 1 1 1 1 8 2 5 1 2 6 1 1 4
1 1 3 3 4 1 3 1 2 1 7 3 1 6 1 5 1 1 1 4 1
7 18 10 3 25 28 32 37 3 18 3 2 9 1 17 25 26 24 44 46 54
0 0 1 1 1 1 0 0 1 1 1 1 0 1 1 0 1 1 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 3 2 1 5 1 1 8 3 1 1 3
Fig. 1. Kaplan–Meier survival plot of AML data for original (solid) and fitted t0 (dashed) times.
1 1 3 1 2 6 2 2 1 1 1 3 2 3 1 1 3 1 1 4 4
Diagnostics for the accelerated life time model of survival data
455
Table 3 Accelerated life estimated bootstrap probabilities for the AML data t 5 8 9 12 13 13 16 18 23 23 27 28 30 31 33 34 43 45 45 48 161
δ
x
ˆ t0 (β)
πi
1 1 1 1 1 0 0 1 1 1 1 0 1 1 1 1 1 1 0 1 0
0 0 1 0 1 1 0 1 1 0 0 1 0 1 0 1 0 0 1 1 1
5.00 8.00 4.60 12.00 6.64 6.64 16.00 9.20 11.76 23.00 27.00 14.31 30.00 15.84 33.00 17.38 43.00 45.00 23.00 24.53 82.29
0.454 0.444 0.478 0.429 0.500
πic
0.363 0.350 0.500 0.467 0.333 0.167 0.341 0.200 0.417 0.250 0.400 0.333 0.500 0.286 0.286 0.286
Table 4 Summary of bootstrap simulation of AML data Statistic
Original sample
asymptotic SD Wald χ 2 Wald p-value
βˆ
S
−0.671 0.494 1.847 0.17
−0.029
1.393
T
Unconstrained bootstrap
mean SD empirical significance
0.048 0.450 0.43
−0.050 0.051 0.40
1.549 0.496 0.91
Constrained bootstrap
mean SD empirical significance
0.007 0.324 0.43
−0.052 0.043 0.33
1.604 0.478 0.92
there was a relapse experienced by both a control and a treated patient. These represent two different times on the baseline t0 scale and similarly have two different conditional probabilities πi of treatment.
456
D. Zelterman and H. Lin
Fig. 2. Residual plot of the AML data.
The residuals are plotted against the original survival time in Figure 2. The are no unusual values apparent in this plot. The pattern of residuals about the zero line correspond to the two covariate values of x = 0 below and x = 1 above. A similar pattern appears in plots of residuals from logistic regression. Such plots will exhibit a pair of residuals grouped above and below, but away from the line corresponding to a zero residual. 6.2. Stanford heart transplant data There have been many analyses of this data introduced by Miller and Halpern (1982). The form of the data we examined contained four covariates. Three covariates were continuous: Age at time of enrollment in the transplant program; Mismatch score on alleles; and Total mismatch score. A fourth, binary valued covariate included was the T5 antigen mismatch score. Our summary of the fitted regression coefficients, goodness-offit statistics, and bootstrap appears in Table 5. The standardized residuals for this model are plotted in Figure 3. In Table 5 we see that the T5 antigen mismatch score exhibits an extreme statistical significance in its effect on survival with a Wald p = 0.0003. Age is only moderately significant with a p = 0.086 for the Wald statistic. It has been suggested that the hazard as a function of the age covariate is ‘U’ shaped. That is, very old as well as very young patients requiring a heart transplant are at especially elevated risk of failure. The S and T statistics do not reveal any lack of fit for each of the four covariates in this data set. The partial residual plot in Figure 3 identifies two unusual individuals. Along the bottom of the figure, a patient with an unusually small age is identified. This organ recipient was 19 years old at the time of his transplant surgery and survived 285 days. ˆ = 268 is reasonable close to his actual survival. This His estimated survival time t 0 (β) individual was not the youngest heart recipient in the data set. The youngest person in the data set was 8 years old but his survival was censored and as such this observation
Diagnostics for the accelerated life time model of survival data
457
Table 5 Summary statistics for the Stanford heart transplant data Sample
Statistic Age
Original data
Unconditional bootstrap simulation
βˆ asymp. SD Wald χ 2 Wald p-value
0.0411 0.0313 1.72 0.086
Covariate Mismatch T5 antigen on alleles mismatch −0.228 0.259 0.77 0.44
S T
64.94 43.63
mean βˆ SD
−0.004 0.038 0.0
ave. S SD of S S p-value
59.8 19.8 0.44
1.046 0.309 0.19
ave. T SD of T T p-value
30.1 11.7 0.15
3.88 1.23 0.17
1.274 5.146 −0.001 0.282
1.245 0.653 3.64 0.0003 −0.239 1.733 0.0885 0.762
Mismatch score −0.155 0.613 0.06 0.95 −0.022 1.490 0.012 0.609
−0.321 0.204 0.38
−0.114 0.178 0.31
1.565 0.577 0.82
1.948 0.680 0.96
Fig. 3. Partial residual plot of the Stanford heart transplant data. Age = •; Mismatch on alleles = ∗; T5 antigen mismatch = +; Mismatch score = ♦.
does not generate a partial residual in this plot. Another unusual partial residual at the top of Figure 3 corresponds to the mismatch score of a 55 year old organ recipient who survived only 28 days but was estimated to survive t0 = 306 days.
458
D. Zelterman and H. Lin
Table 6 Summary statistics for the bladder cancer data Sample
Statistic Treatment group
Original data
Unconditional bootstrap simulation
Covariate Size of lesion
Number of lesions
βˆ asymp. SD Wald χ 2 S T
−0.896 0.315 8.05 0.112 3.468
0.336 0.083 16.4 3.324 7.151
0.112 0.104 1.17 2.546 6.090
mean βˆ SD p-value
−0.022 0.537 0.51
−0.011 0.143 0.54
0.015 0.185 0.46
ave. S SD of S S p-value
0.095 0.037 0.34
4.177 1.090 0.75
2.057 0.460 0.15
ave. T SD of T T p-value
2.316 0.849 0.13
11.28 4.370 0.87
8.421 2.819 0.75
Fig. 4. Partial residual plot of the bladder cancer recurrence times. Treatment group = •; Number of lesions = ∗; Size of lesions = +.
6.3. Bladder cancer The data given in Table 2 represents the time to first recurrence of bladder cancer. This data was published by Wei et al. (1989). The three covariates of interest are: Treatment (0 = control, 1 = thiotepa); Size of original lesions; and Number of original lesions,
Diagnostics for the accelerated life time model of survival data
459
truncated at 8. The summary statistics for our examination of this data appears in Table 6 and the partial residuals are plotted in Figure 4. The large values of the Wald statistics are not confirmed in the bootstrap simulation. The simulations indicate that the estimated βˆ should be closer to zero with larger standard errors than estimated by V. The partial residual plot in Figure 4 identifies individuals with a large number of lesions and/or large lesions. The partial residual plot for the binary valued treatment covariate is similar to the partial residuals of the AML data in Figure 2.
Acknowledgement This research was supported by grant P30-CA16359 awarded by the US National Institutes of Health to the Yale Comprehensive Cancer Center.
References Andersen, P.K., Gill, R.D. (1982). Cox regression model for counting processes: A large sample study. Ann. Statist. 10, 1100–1120. Cuzick, J. (1985). Asymptotic properties of censored linear rank tests. Ann. Statist. 13, 133–141. Embury, S.H., Elias, L., Heller, P.H., et al. (1977). Remission maintenance therapy in acute myelogenous leukemia. Western J. Medicine 126, 267–272. Le, C.T., Zelterman, D. (1992). Goodness of fit tests for proportional hazards regression models. Biometrical J. 34, 557–566. Lin, D.Y., Wei, L.J. (1989). The robust inference for the Cox proportional hazards model. J. Amer. Statist. Assoc. 84, 1074–1078. Lin, D.Y., Ying, Z.L. (1995). Semiparametric inference for the accelerated life model with time-dependent covariates. J. Statist. Planning Inference 44, 47–63. Louis, T.A. (1981). Nonparametric analysis of an accelerated failure time model. Biometrika 68, 381–390. Miller, R.G. (1981). Survival Analysis. Wiley, New York. Miller, R., Halpern, J. (1982). Regression with censored data. Biometrika 69, 521–531. Schoenfeld, D. (1982). Partial residuals for the proportional hazards regression-model. Biometrika 69, 239– 241. Wei, L.J. (1984). Testing goodness of fit for proportional hazards model with censored observations. J. Amer. Statist. Assoc. 79, 649–652. Wei, L.J., Lin, D.Y., Weissfeld, L. (1989). Regression analysis of multivariate incomplete failure time data by modeling marginal distributions. J. Amer. Statist. Assoc. 84, 1065–1073. Zelterman, D., Chen, C.F. (1988). Homogeneity tests against central-mixture alternatives. J. Amer. Statist. Assoc. 83, 179–182. Zelterman, D., Le, C.T., Louis, T.A. (1996). Bootstrap techniques for proportional hazards models with censored observations. Statist. Comput. 6, 191–196.
26
Handbook of Statistics, Vol. 23 ISSN: 0169-7161 © 2004 Elsevier B.V. All rights reserved. DOI 10.1016/S0169-7161(03)23026-1
Cumulative Damage Approaches Leading to Inverse Gaussian Accelerated Test Models
Arzu Onar and William J. Padgett
Accelerated test models have been widely used in the reliability context, but are also applicable to survival analysis in medical settings. For example, they can be used for modeling data arising from clinical trials that investigate the effects of a treatment or a harmful agent in a variety of doses. These models also apply to cases where it is hypothesized that a set of covariates possessed by a subject, such as age, body mass index, cholesterol level, etc., are expected to affect how well the subject fares under a given treatment. In accelerated test models, the response variable of interest, such as the lifetime of a subject, is assumed to depend on one or more covariates, called “acceleration variables” here, that are either naturally present or externally administered. Methods for accelerated testing were developed originally for reliability applications where observing the response variable, such as the strength of a given material or the lifetime of a piece of equipment, in the ordinary use environment was difficult due to the strength being too high or lifetime being too long. In these settings, accelerated testing experiments exposed the material or the equipment on test to more severe environmental conditions than usual, which reduced the material’s strength or shortened the equipment’s lifetime so that the response variable of interest could be observed more easily. The data obtained was used in analyzing the effect of the covariate, which in turn resulted in predictions of the material’s strength or the equipment’s lifetime in ordinary use environments. An analogous case in survival analysis is a trial where the effect of a harmful agent is studied. If it is expected that its effect at low doses will take too long to observe, the subjects on test may be exposed to several higher doses of the agent, resulting in shorter observation times. These results may then be used to make inferences about unobserved responses at low doses of the agent, possibly representing the ordinary exposure levels. Another example is when a study involves different doses of a treatment for a terminal illness. There, the lifetimes observed under the studied doses may be utilized to determine the expected addition to the subject’s lifetime due to the treatment at each dose level, which may then be used in determining the appropriate dose to be administered. Further, these results may be used to decide whether to administer the treatment at all, based on the projected lifetime with no treatment and a cost-benefit analysis of extended 461
462
A. Onar and W.J. Padgett
lifetime versus potential side effects, etc. Accelerated test models may be extended to a wide variety of other circumstances along these lines. The models to be discussed here are applicable to accelerated test situations described above and will be used to model survival times that result from such experiments via the inverse Gaussian distribution. This family of distributions is of particular interest as a lifetime model due to its many desirable properties. During the remainder of this paper the terms “lifetime” or “survival time” will be used in a broad sense, where they may imply “time until recovery” or “time until a medically defined endpoint”, or other appropriate interpretations. This article first briefly introduces the inverse Gaussian distribution and discusses some of its relevant properties. Then a family of inverse Gaussian accelerated test models, originally developed by Durham and Padgett (1997) and later extended by Onar and Padgett (2000b), is discussed. The parameter estimation procedures for these models based on maximum likelihood are outlined in Section 3, where the general Fisher information matrix is also given. In Section 4, the utility of the models is illustrated with data adapted from a carcinogenesis study conducted by the National Toxicology Program. The article concludes with a summary and some further discussion of the models. 1. Inverse Gaussian as a lifetime or strength model Since its early development, the inverse Gaussian distribution has been applied successfully to reliability and life testing problems and more recently to survival analysis situations. Due to the fact that its derivation is based on the first passage time of a Wiener process with positive drift, this distribution is particularly applicable to failure and reaction time data. The inverse Gaussian distribution possesses certain properties which are appealing not only from a computational point of view but also from a practical perspective. For example, a multitude of shapes, ranging from highly skewed to almost symmetric, can be represented using this distribution, which is unimodal and belongs to an exponential family of order two. Further, the inverse Gaussian family is complete and all of its positive and negative moments exist. Its probability density function (pdf) is λ −λ(x − µ)2 f (x; µ, λ) = (1.1) , x > 0, exp 2πx 3 2µ2 x where µ > 0 is the mean and λ > 0 is a scale parameter. The cumulative distribution function (cdf) for (1.1) can be written as a linear combination of two standard normal cdfs, denoted below by Φ, as follows: λ x λ x 2λ F (x) = Φ −1 + exp Φ − + 1 , x > 0. x µ µ x µ (1.2) For the rest of this article, a random variable X with this pdf will be denoted by X ∼ IG(µ, λ) (see Chhikara and Folks, 1989, for further details on the inverse Gaussian distribution).
Cumulative damage approaches leading to inverse Gaussian accelerated test models
463
Both in reliability studies and in survival analysis applications, the choice of the lifetime distribution should be based on the knowledge of the inherent failure mechanism. The physical or medical characteristics of the failure process should take priority over empirical model fitting during a search for an appropriate statistical distribution. The inverse Gaussian distribution may be a good choice in cases where a life distribution that describes early failures is needed. In such cases, the failure rate will be non-monotonic, increasing initially and then decreasing. The inverse Gaussian distribution has a nearly constant asymptotic failure rate which suggests that after a long time, the failures take place almost randomly, independent of past life, a behavior exhibited by the exponential distribution for all positive times. This property is not shared by other popular choices for lifetime distributions, such as the lognormal, which has a zero failure rate in the limit, unrealistically suggesting that no failures may be expected after a long time.
2. Inverse Gaussian accelerated test models Accelerated testing (AT) has traditionally been used in the reliability context as a method to test the reliability of a system that has a long expected lifetime under normal use conditions. This method speeds up the failure process by exposing the system on test to unusually high stress levels. If the mean lifetime of a particular system were of interest, for example, then an acceleration model would typically decrease this mean or shift the lifetime distribution such that the lifetimes would be considerably shorter. These methods have been applied very successfully to a variety of reliability problems such as modeling the reliability of electronic equipment. A typical objective is to utilize the results obtained from an accelerated test in order to make inferences about the lifetime in the normal use environment. The same principles may be applied to survival analysis where the appearance of the effect of an agent, say a carcinogen, may require impractically long periods of time under small doses that are typically encountered in everyday environments. Thus, high doses of the agent may be administered to laboratory animals in an effort to observe the carcinogen’s effect within a reasonable amount of time. In a different application, a range of doses of a medication may be tested in an effort to study the effects of the treatment on the subjects as well as its efficacy in a variety of doses in treating the disease at hand. In such settings, because an accelerated test model provides a functional relationship between the response variable of interest and the dose (acceleration variable), the results of the accelerated test analysis may ultimately assist in assigning the appropriate dosage for a treatment or in determining dangerous levels of a harmful substance. There are two main assumptions that govern AT experiments. The first assumption states that at different levels of the acceleration variable, the same family of distributions can be used to model the survival times. Secondly, it is assumed that the functional relationship between the acceleration level V and the parameters of the survival time distribution, typically represented by a parametric acceleration function, are known up to a few acceleration model parameters. Several parametric acceleration models, such as the inverse power law and the Eyring and Arhennius models, have been utilized frequently (see Mann et al., 1974, and Nelson, 1990, for example). Choosing a proper
464
A. Onar and W.J. Padgett
acceleration function, which is also called a “link function” in the hierarchical models context, is a crucial part of analyzing accelerated test data. This decision should be based on the inherent properties of the failure behavior under various acceleration levels as well as the acceleration variables (e.g., dose) under consideration. If T , which denotes the time to failure or survival time, has cdf F(t), then FV (t) denotes the lifetime distribution at dose level, V . Let V0 , V1 , . . . , Vk be the various dose levels with V0 being the normal use environment, with dose increasing as i = 1, . . . , k. At each dose level Vi , ni items with lifetimes Tij0 will be observed, where j = 1, . . . , ni . It is possible that at the end of the experiment some of the items on test will still be operational. If this happens, the actual lifetimes will not be observed and will be denoted by the censoring variable Uij , where the Uij ’s, j = 1, 2, . . . , ni , are independent of the Tij0 ’s. So, randomly right-censored observations (Tij , ∆ij ), j = 1, . . . , ni , i = 1, . . . , k, will be obtained, where 0 1 if Tij0 Uij , Tij = min Tij , Uij and ∆ij = 0 if Tij0 > Uij . Even though this article does not directly address the issue of censored observations, the methods presented here can be directly applied to censored cases, albeit not without some additional computational difficulties (see Mann et al., 1974, and Nelson, 1990, for more details on accelerated testing). The first step in analyzing the data obtained via an accelerated life test experiment is replacing one or more of the lifetime density parameters with a suitable parametric acceleration model such as one of those mentioned earlier. Then, the maximum likelihood estimators (MLEs) of the unknown model parameters are found. Finally, the survival function at V0 is obtained from the parameter MLEs using the acceleration model. A majority of the models that will be discussed in detail in the remainder of this article are based on continuous cumulative damage arguments where the experimental setting will involve a subject with acceleration variable V , e.g., dose of carcinogen, experiencing continuously increasing damage until an end point is reached. The resulting inverse Gaussian accelerated test model obtained from this approach, as described next, is a general family with enough flexibility to accommodate virtually any appropriate acceleration model and seems to provide good fits to a variety of data sets in both reliability and survival analysis applications. Once again, suppose a subject possessing, or exposed to, an acceleration variable V is experiencing a continuously worsening physical condition until a medically meaningful endpoint is reached. Note that if the acceleration variable is the dose of a treatment, the endpoint could be recovery from the disease. In the cases where the acceleration variable is the dose of a harmful substance, the endpoint may be the death of the subject or the appearance of a tumor, etc. Let D(s) denote the “damage” sustained by the subject at survival time s > 0 and suppose {D(s): s > 0} is a Gaussian process with constant mean (amount of damage) ν > 0 and variance σ 2 . Then, the “damage increment” at survival d time s is given as ds X(s) = D(s), s > 0. Further, consider the possibility that the drift parameter ν = ν(V ) may be dependent on the acceleration variable; for instance, the dose of the harmful substance administered to the subject may affect how quickly the end point is reached. Another example may be given for the case where the acceleration
Cumulative damage approaches leading to inverse Gaussian accelerated test models
465
variable is not an externally administered agent; rather it is a covariate that is inherent to the subject, such as age or body mass index. More specifically, for certain medical conditions, elderly or overweight patients may have shorter survival times than their younger or average-weight counterparts. Thus, age, or body mass index, would act as an acceleration variable shortening the survival time of the patient. This relationship between ν and V can be expressed using an appropriate parametric function, such as a power law, ν(V ) = αV β , or linear law, ν(V ) = α + βV , where the sign of the parameter β would determine whether the relationship is increasing or decreasing. In a reliability context, using carbon fiber strength as the response variable of interest, Durham and Padgett (1997) proposed two models for the strength of a system subjected to tensile load, based on a discrete version of cumulative damage. Adapted to the survival analysis context, their approach supposes that a subject possessing or exposed to an acceleration variable V has an unknown, yet fixed, theoretical “lifetime” denoted by Ψ , and the subject’s condition worsens steadily until death or another clinically meaningful endpoint is reached. Stated in the survival analysis context, the following three assumptions were made by Durham and Padgett (1997) in order to model this phenomenon: (i) The increasing severity of the disease is visualized in small, discrete increments until an endpoint is reached, which results in the subject’s observed survival time. (ii) Each small increment causes a nonnegative amount of “damage”, D, which is considered a random variable with cdf FD . (iii) The “initial damage” present in the subject before being subjected to the acceleration variable, is in the form of the severity of the disease or the “flaws” existing in the subject initially and is quantified by a random amount of reduction, X0 , in the patient’s theoretical survival time, so that the random variable W = Ψ − X0 represents the “initial/reduced survival time” of the subject. As the severity of the disease is incremented under the assumptions stated above, let the cumulative damage after n + 1 increments be Xn+1 = Xn + Dn+1 h(Xn ), where Dj 0, j = 1, 2, 3, . . . , are independent identically distributed damages to the subject at each increment and h(x) is the damage model function. When h(x) = 1 for all x, an additive damage model is obtained, whereas h(x) = x for all x gives a multiplicative damage model. Let N denote the number of increments of disease level experienced by the subject with a survival time Ψ , that is, N = supn {n: X1 Ψ, . . . , Xn−1 Ψ } = supn {n: X1 − X0 Ψ − X0 , . . . , Xn−1 − X0 Ψ − X0 }, where N = 1 if the set is empty. Conditionally, P (N > n | Ψ − X0 = w) = P (Xn − X0 w), since {Xn } is a non-decreasing sequence, i.e., all Dn 0 with probability one. Then the survival probability after n increments of exposure time is given by ∞ P (N > n) = (2.1) Fn (w) dGW (w), 0
where Fn (w) = P (Xn − X0 w) and GW is the cumulative distribution function of W . As n increases, Fn (w) converges to a normal cdf as follows: When h(x) = 1, write
466
A. Onar and W.J. Padgett
Dn+1 = (Xn+1 − Xn ), so n−1
Di+1 =
i=0
n−1
(Xi+1 − Xi ) = Xn − X0 .
(2.2)
i=0
Then, when n is large, by the central limit theorem, Xn − X0 is approximately normally distributed with mean nµ and variance nσ 2 , where µ = E(D) and σ 2 = Var(D) are the mean and variance of D, respectively. Therefore, for large n, √ Fn (w) = P (Xn − X0 w) ∼ (2.3) = Φ (w − nµ)/ n σ , where Φ denotes the standard normal cumulative distribution function. In the case of the multiplicative damage model, the subject’s pre-exposure lifetime is assumed to be W = Ψ/X0 , where X0 1 is the initial cumulative damage (see Padgett, 1998). Then, n
i=1
Di =
n
Xi − Xi−1 i=1
h(Xi−1 )
≈
Xn
−1 du h(u)
X0
represents the cumulative damage at the nth increment. Since h(u) = u, the integral given above yields ln(Xn ) − ln(X0 ), which is a sum of i.i.d. random variables. So, once again from the central limit theorem, the distribution of Xn /X0 is approximately lognormal with µ and σ defined as before; that is, for large n, Xn ln(w) − nµ Fn (w) = P . w ≈Φ √ X0 nσ Substituting (2.3) for Fn (w) into (2.1) and using an appropriate distribution for the subject’s initial/reduced survival time W in (2.1) yields an expression for the survival probability of the subject after n increments of survival time. In order to obtain an approximate expression for the survival probability given in (2.1), an appropriate distribution for the subject’s pre-exposure lifetime, GW (w), must be chosen. In a reliability context, Durham and Padgett (1997) proposed two such distributions. The first approach models GW (w) via a three-parameter Weibull distribution. The Weibull distribution has been routinely used for modeling failure stresses for industrial materials (see Nelson, 1990, p. 63, for example) and its justification in the reliability context is based on the well known “weakest link theory”. Further details of this model will not be discussed here, so the reader is referred to Durham and Padgett (1997). For their second approach to modeling the initial system strength GW (w), which may be more appropriate for survival analysis, Durham and Padgett assume that a “flaw process” exists over the system that is exposed to an acceleration variable V . For survival analysis applications, this flaw process is assumed to account for the reduction in pre-exposure survival time due to the initial disease severity or most severe “flaw” possessed by the subject. Furthermore, they assume that this process can be described as a stationary Gaussian process (see Karlin and Taylor, 1975), which yields the tails of a normal distribution as the initial survival time distribution described by the inverse
Cumulative damage approaches leading to inverse Gaussian accelerated test models
467
Gaussian density. Given the fixed theoretical survival time, Ψ , the corresponding probability density function (pdf) of W can be written as
2 −(w − Ψ )2 gW (w) = √ , 0 < w < Ψ. exp 2L 2πL Following the arguments of Owen and Padgett (1998) and applying them to the survival analysis setup, it can be shown using (2.1), one of the additive or the multiplicative damage models, and the initial lifetime distribution modeled by the Gaussian flaw process that the continuous cdf for survival time S can be represented by a generalized three-parameter Birnbaum–Saunders-type distribution. That is, the survival time distribution of the subject with an acceleration variable V can be written in the form Λ(θ ; V ) 1 √ GS (s; V ) = Φ ζ s− √ , s > 0, ζ > 0, σ s σ > 0, Λ(θ ; V ) > 0, for all V , where Λ(θ ; V ) is an acceleration function with an unknown parameter (possibly a vector) θ and the known acceleration variable V . Note that this cdf has the general form of the first term in the inverse Gaussian cdf (see Bhattacharyya and Fries, 1982, for details) written as Λ(θ ; V )2 sζ GS (s; V ) = Φ (2.4) − 1 , s > 0. Λ(θ ; V ) sσ 2 We denote the survival time then as Λ(θ ; V ) Λ(θ ; V )2 S∼ ˙ IG , . ζ σ2 Note that the second term of the inverse Gaussian distribution function is ignored in (2.4). This is reasonable if the second parameter (scale) of the inverse Gaussian is much larger than the first parameter (the mean) (Bhattacharyya and Fries, 1982; Chhikara and Folks, 1989), which was shown to be the case for the data considered by Durham and Padgett (1997). The chloroprene exposure data set used as an illustration in Section 4 of this article also satisfies this condition. Therefore, the threeparameter Birnbaum–Saunders-type distribution, which arises via the process described above, can be approximated using the three-parameter inverse Gaussian-type distribution (2.4). The fact that the inverse Gaussian distribution arises naturally through the physical model of the dose increments on the subject suggests that, except perhaps the Birnbaum–Saunders distribution, this model would fit this type of data better than other distributions which might have been considered. The models that arise based on the “Gaussian flaw process” for the initial damage lead to two accelerated inverse Gaussian-type models for the survival time, one for the additive and one for the multiplicative damage case. Consistent with Durham and Padgett (1997) these will be called “Gauss–Gauss Additive” and “Gauss–Gauss Multiplicative” models, respectively. Their functional forms are given below:
468
A. Onar and W.J. Padgett
• Gauss–Gauss Additive Model (GGA) (Durham and Padgett, 1997):
Λ(Ψ ; L) Λ(Ψ ; L)2 S ∼ IG , , 2 ζGGA γGGA
where Λ(Ψ ; L) = Ψ −
2L . π
• Gauss–Gauss Multiplicative Model (GGM) (Padgett, 1998):
Λ(Ψ ; L) Λ(Ψ ; L)2 S ∼ IG , , 2 ζGGM γGGM
1 where Λ(Ψ ; L) = ln(Ψ ) − Ψ
2L . π
Onar and Padgett (2000b) used similar arguments as in the discrete case presented above and gave the following expression for the continuous cumulative damage sustained by the subject up to survival time t, for t > 0, analogous to (2.2), t t X(t) − X(0) = dX(s) = D(s) ds, 0
0
which is also a Gaussian process (see Hoel et al., 1972, for example) with mean function µt = νt = ν(L)t and variance function σt2 = σ 2 t. Let X(0) = x0 , be the constant amount of initial “damage” present in the subject at the time of exposure to the acceleration variable, for example due to previous health conditions or otherwise. This model can be generalized further by allowing the initial damage, x0 , to also depend on V ; however, this case will not be discussed here. Suppose an endpoint is reached at an unknown (cumulative) critical damage level c = cV = c(V ), which may also be a decreasing function of V . Then the survival time of the subject, denoted by S, with an acceleration variable V is a random variable which is the survival time at first passage of the cumulative damage X(t) to the critical threshold level c(V ). Figure 1 illustrates such a cumulative damage process for V1 < V2 . Then,
Fig. 1. Survival times at different values of the acceleration variable V .
Cumulative damage approaches leading to inverse Gaussian accelerated test models
469
Table 1 Some parametric acceleration models for the inverse Gaussian parameters, µ and λ, assuming the continuous cumulative damage model Model type
Only (cV − x0 ) depends on V
Only ν(V ) depends on V
Power law
µV = ζ V θ λV = γ V 2θ (Gauss–Weibull Additive Model, Durham and Padgett, 1997)
µV = ζ V θ λV = λ (Power Law Accelerating µ only, Onar and Padgett, 2000a)
Inverse linear law
1 µV = αo +β oV
1 µV = α+βV
λV = Exponential law
1 (α1 +β1 V )2
µV = ζ eθ V λV = γ e2θ V
λV = λ µV = ζ eθ V λV = λ
S has an inverse Gaussian distribution whose pdf is a function of V , given by (1.1), λ1/2 f (s; µ, λ) = √ exp −λ(s − µ)2 / 2sµ2 , 2πs 3 for s > 0, where λ = λV = [(cL − x0 )/σ ]2 and µ = µV = (cV − x0 )/ν(V ), that is, S = subject’s survival time at V ∼ IG(µV , λV ). If we assume that (cV − x0 ) and/or ν(V ) depend on V , then we can quantify this dependency using parametric acceleration functions. As stated before, such functions should be selected so that they will at least approximately represent the relationship assumed to exist between the parameters involved and the acceleration variable V . Some possible choices for these acceleration functions are given in Table 1 for the inverse Gaussian mean and scale parameters, µ and λ. Onar and Padgett (2000b) gave several other possible models, but their list is in no way comprehensive. There is a wide array of acceleration functions that may be appropriate for a given situation. It should be noted that following the parsimony principle, accelerated inverse Gaussian models with only three parameters are preferred, as these add only one extra parameter to the overall model.
3. Estimation for the inverse Gaussian accelerated test models Suppose subjects are tested at the k different levels of the acceleration variable, V1 , . . . , Vk . At level Vi , tests are performed for ni subjects, resulting in observed survival times sij , j = 1, 2, . . . , ni , for i = 1, 2, . . . , k. Letting S denote the survival time until the subject reaches an endpoint, we have cV (θ1 ) − x0 (cV (θ1 ) − x0 )2 S ∼ IG (3.1) , = IG(µV , λV ). νV (θ2 ) σ2
470
A. Onar and W.J. Padgett
For estimation over all of the k levels, the maximum likelihood estimates (MLEs) of the unknown model parameters given in (3.1) can be obtained by maximizing the likelihood function L(θ ) =
ni k
f(sij ; µVi , λVi ),
(3.2)
i=1 j =1
or equivalently, maximizing the logarithm of the likelihood function, log L = log L(θ ), where θ = (θ1 , θ2 , σ 2 ) is the vector of parameters to be estimated. Letting sij , j = 1, . . . , ni , i = 1, . . . , k, denote the observed survival times and using the cV (θ1 ) and νV (θ2 ) notation which represent the critical cumulative damage level and the drift parameter, respectively, the likelihood function under the experimental conditions described above is L s; σ 2 , cV (θ1 ), νV (θ2 ), x0 =
1 √ σ 2π
× exp
m k
−1 2σ 2
cL (θ1 ) − x0
ni k ni i=1 j =1
i=1
ni k
νV (θ2 )sij − 2
i=1 j =1
+
−3/2
sij
k
ni νV (θ2 )(cL (θ1 ) − x0 )
i=1
ni k
(cL (θ1 ) − x0 )2
sij
i=1 j =1
,
k
where m = i=1 ni and θ1 and θ2 may be vectors. Using this likelihood the MLE of σ 2 can be obtained by substituting the MLEs of θ1 and θ2 into k n k i
1
2 σˆ = νV (θ2 )sij − 2 ni νV (θ2 )(cV (θ1 ) − x0 ) m i=1 j =1
+
i=1
ni k
(cV (θ1 ) − x0 )2 i=1 j =1
sij
.
The MLEs of θ1 and θ2 depend on the actual form of the acceleration functions, cV (θ1 ) and νV (θ2 ), and they can be obtained by solving the following equations simultaneously with the one above for σ 2 : k k 1
∂cV (θ1 ) ni ∂cV (θ1 ) ∂3
= + 2 ni νV (θ2 ) ∂θ1 cV (θ1 ) − x0 ∂θ1 σ ∂θ1 i=1
−
i=1
k ni 1
(cV (θ1 ) − x0 ) ∂cV (θ1 ) =0 2 sij ∂θ1 σ i=1 j =1
Cumulative damage approaches leading to inverse Gaussian accelerated test models
471
and k ni 1
∂3 ∂νV (θ2 ) =− 2 νV (θ2 )sij ∂θ2 σ ∂θ2 i=1 j =1
+
k ∂νV (θ2 ) 1 ni cV (θ1 ) − x0 = 0. 2 σ ∂θ2 i=1
Note that in most cases the actual solutions of these equations will have to be obtained numerically. To obtain starting values, which all numerical procedures require, a “least squares” method can be utilized. For each i = 1, . . . , k, denote the ordered observed sij ’s by si(j ) ’s. The empirical estimate of the cdf of S, is defined as n (si(j ) ) = j . Then setting G ni +1 1 cV (θ1 ) − x0 νV (θ2 )si(j ) j =Φ −1 ni + 1 si(j ) σ cV (θ1 ) − x0 yields the equation √
si(j ) Φ
−1
νV (θ2 ) cV (θ1 ) − x0 j = si(j ) − . ni + 1 σ σ
(3.3)
When the particular forms of the acceleration functions are known, this equation can be used to obtain “least squares” estimates of θ1 , θ2 and σ , given the sij and Vi values, for j = 1, . . . , ni , i = 1, . . . , k. Quantile–quantile (Q–Q) plots of the data at one acceleration level can be obtained by considering a random sample, s1 , s2 , . . . , sn , taken at the ith level of the acceleration variable Vi . This would result in a two-parameter model where the single level V would be absorbed into the other parameters. Denoting the ordered sj ’s by s(j ) , as above, the linear equation j √ = γ0 + γ1 s(j ) s(j ) Φ −1 n+1 is found. Then j √ , s(j ) , s(j ) Φ −1 n+1
j = 1, . . . , n,
can be plotted. Note that, if (1.1) is the actual survival time distribution, then this plot should result in an approximate straight line. Similarly, by considering just one of the levels of the acceleration variable, Vi , an estimate of the 100pth percentile, sp , of the survival time distribution at that value of V can be obtained by setting the inverse Gaussian cdf of (1.2) equal to p, GS (sp ) = p, and solving for sp which will lead to √ (3.4) sp Φ −1 (p) = γ0 + γ1 sp , where Φ −1 (p) is the 100pth percentile of the standard normal distribution. Note that using (3.1) and (3.3), we can express γ0 and γ1 in the more familiar terms of the inverse
472
A. Onar and W.J. Padgett
Gaussian parameters µ and λ as
√ √ −(cV (θ1 ) − x0 ) νV (θ2 ) λ = − λ and γ1 = = . γ0 = σ σ µ √ Since (3.4) is a quadratic equation in sp , the solution for sp can be obtained using the positive root of the resulting quadratic, giving µ2 4λ 2 zp + zp2 + sp = (3.5) . 4λ µ The MLE of the 100pth percentile can then be found simply by substituting the MLEs for µ and λ into (3.5). Furthermore, approximate lower confidence bounds on such percentiles can be obtained. Notice that the right-hand side of (3.5) is increasing in µ and decreasing in λ. Using some distribution theory for MLEs of inverse Gaussian parameters, a conservative lower confidence bound on sp can be calculated as follows: • A lower (1 − α)100% confidence bound on µ is given as −1 s¯ , µˆ V = s¯ 1 + tα ˆ − 1) λ(n where tα is the upper tail (1 − α)100% point of the Student’s t distribution with (n − 1) degrees of freedom and s¯ denotes the sample mean. • An upper (1 − α)100% confidence bound on λ is given as λˆ U =
λˆ χα2 , n
where χα2 is the upper tail (1 − α)100% point of the chi-square distribution with (n − 1) degrees of freedom. Thus, utilizing Bonferroni’s inequality, a conservative (1 − 2α)100% lower confidence bound on the 100pth percentile, sp , has the form µˆ 2L 4λˆ U 2 2 sˆp,LB = zp + zp + . µˆ L 4λˆ U The same procedure used for (3.5) can be utilized to find the 100pth percentile for the survival time distribution where the acceleration variable (such as dose) V is taken 2 0 0) into account. In this case, µ and λ will be replaced by cVν(θV 1(θ)−x and (cV (θσ1 )−x , respec2 2) tively, where cV (θ1 ) and νV (θ2 ) are the appropriate parametric acceleration functions. Then, it follows that the 100pth percentile for the survival time distribution at level V is σ zp + σ 2 zp2 + 4[cV (θ1 ) − x0 ]νV (θ2 ) sp (V ) = (3.6) . 2νV (θ2 ) Note that the MLE of sp (V ) can be obtained by substituting the MLEs of σ, θ1 and θ2 into (3.6). It is easy to see that sp (V ) is increasing in σ . Provided that sp (V ) is
Cumulative damage approaches leading to inverse Gaussian accelerated test models
473
monotone in θ1 and θ2 , which is the case for all acceleration functions considered here, a conservative Bonferroni-type lower confidence bound on sp (V ) can be obtained from the appropriate confidence bounds on each of the parameters, similar to the previous argument. However, in order to obtain the appropriate bounds on these parameters, the distributions of their MLEs need to be known. Even though the exact distributions of these MLEs are not available, it is possible to obtain their asymptotic distributions using the Fisher information matrix, I. Defining the vector ϕ = (σ, θ1 , θ2 ), it can be shown that, as m → ∞, (i.e., as ni → ∞, nmi → pi , for i = 1, . . . , k) the quantity m(ϕˆ − ϕ) converges in distribution to a multivariate normal with mean 0 and covariance matrix I −1 . A lower confidence bound on sp (V ) could be obtained using this asymptotic result with the Cramer δ-method. Though it is computable for a given set of data, this approach yields very complicated forms making the simple “closed form” formulae resulting from the Bonferroni method appealing. The second partial derivatives of the log-likelihood function for finding I are fairly easy to obtain. Assuming θ1 and θ2 are one-dimensional parameters, the average information matrix, I (θ1 , θ2 , σ ; V ) is composed of the negative expected values of the five unique second partial derivatives of the log-likelihood since the inverse Gaussiantype family is in the exponential form. Thus, I (θ1 , θ2 , σ ; V ) for the general accelerated inverse Gaussian model is b1 b2 0 I = b2 b3 b4 , 0 b4 b5 where
∂ 23 b1 = −E ∂σ 2 b2 = −E
=
∂ 23 ∂θ1σ
∂ 23 b3 = −E ∂θ12
2m , σ2
=
k −2
ni ∂cV (θ1 ) , σ cV (θ1 ) − x0 ∂θ1 i=1
=2
k
i=1
+
ni ∂cV (θ1 ) 2 ∂θ1 (cV (θ1 ) − x0 )2
k 1 ni νV (θ2 ) ∂cV (θ1 ) 2 , cV (θ1 ) − x0 ∂θ1 σ2 i=1
∂ 23 b4 = −E ∂θ1∂θ2 b5 = −E
∂ 23 ∂θ22
=
k −1
∂cV (θ1 ) ∂νV (θ2 ) = 2 , ni σ ∂θ1 ∂θ2 i=1
k 1
∂νV (θ2 ) 2 cV (θ1 ) − x0 n . i σ2 νV (θ2 ) ∂θ2 i=1
474
A. Onar and W.J. Padgett
The asymptotic variances of the MLEs of θ1 , θ2 and σ correspond to the diagonal elements of I −1 . These quantities can be estimated by substituting the estimates of the unknown parameters into the equations above.
4. Application of the inverse Gaussian accelerated test models to chloroprene exposure data In this section, in order to illustrate the methods discussed in Sections 2 and 3, data adapted from a publicly available data set listed on the National Toxicology Program’s (NTP) web site will be used. NTP is an interagency program which consists of relevant toxicology activities of the National Institutes of Health’s National Institute of Environmental Health Sciences(NIH/NIEHS), the Centers for Disease Control and Prevention’s National Institute for Occupational Safety and Health (CDC/NIOSH), and the Food and Drug Administration’s National Center for Toxicological Research (FDA/NCTR). The data, which will be used here, was adapted from a toxicology and carcinogenesis study of chloroprene (CAS no. 126-99-8) in F344/N Rats. The chemical chloroprene is used in large amounts yet its carcinogenic potential is not well known. Further, it is the 2chloro analogue of 1,3-butadiene, a potent carcinogen, which has been shown to affect multiple organs in multiple species. The data considered here were collected from a two-year study, which exposed male rats to chloroprene (more than 96% pure) by inhalation at concentrations (dose levels) 0, 12.8, 32 and 80 ppm for 6 hours per day, 5 days per week during those two years. As an illustration of the methods introduced in the earlier sections, the data collected at the upper dose levels will be modeled using one of the inverse Gaussian accelerated test models given in Table 1, and the model’s performance will be checked against the data collected at the 0 dose level, i.e., the “ordinary use environment.” The data from the NTP web site indicates that there were no censored observations and at dose level V1 = 12.8, there were n1 = 41 deaths; at dose level V2 = 32, there were n2 = 45 deaths; and at V3 = 80 ppm there were n3 = 46 deaths. The survival times are given in days. Since no expert knowledge was available about the relationship between the acceleration variable (chloroprene dose) and the survival time, a variety of the acceleration functions given in Table 1 were fitted to the data. The MLEs of the parameter values for each of the models were obtained numerically and are presented in Table 2. The goodness of fit of each of the models was measured based on the overall mean squared error (MSE) values for the fits over all observed dose levels. This mean squared error is defined as MSE =
ni k 1 1
n (si(j ) ) 2 , S (si(j ) ) − G G k ni i=1
j =1
n is the empirical cdf at each S is the MLE of the inverse Gaussian cdf and G where G i = 1, 2, 3. The MSEs of the models considered here were calculated and the results are summarized in Table 3.
Cumulative damage approaches leading to inverse Gaussian accelerated test models
475
Table 2 Maximum likelihood estimates for the chloroprene data Model Exponential C only Exponential ν only Power law ν only
Inverse linear law ν only
Gauss–Gauss additive Gauss–Gauss multiplicative
Gamma
Zeta
Theta
47212.443333
615.615553 615.660613 641.026721
−0.000191487 −0.000193225 −0.001389573
Alpha
Beta
0.0007521946
0.0000268267
Psi
Zeta
Gamma Sqr
224.78483915 55.873565674
0.3600866947 0.0064446589
1.0380331865 0.0003325041
Exponential C only Exponential ν only Power law ν only Inverse linear law ν only Gauss–Gauss additive Gauss–Gauss multiplicative
46451.630256 46727.599895 Lambda 2366.421
Table 3 Mean squared errors for the inverse Gaussian acceleration models Model
Lambda
MSE 0.006456742 0.006433372 0.01547785 0.1340673 0.1217454 0.01424108
Fig. 2. Observed vs. predicted survival times for the control dose (0 ppm).
476
A. Onar and W.J. Padgett
Table 4 Five-number summaries of the observed and predicted survival times (in days) at the control dose for the chloroprene data
Observed
Predicted
Minimum
25th Percentile
Median
75th Percentile
Maximum
469.0
571.0
616.0
657.0
722.0
1st Percentile
25th Percentile
Median
75th Percentile
99th Percentile
471.4
569.7
615.7
665.4
804.1
It is clear that the inverse Gaussian exponential acceleration models fit the chloroprene-exposed rat lifetime data best. The MSEs of these models are half as large as the MSE of the next best fitting model, the Gauss–Gauss Multiplicative. Adopting the “Exponential ν only” acceleration model as the most appropriate one, the predictive performance of this model was checked against the survival times at V0 = 0 dose level, i.e., the “normal use” environment. Figure 2 shows a scatter plot of the predicted versus observed survival times at 0 ppm, the control dose for chloroprene. The 45◦ line is added to the plot for easy comparison of predictions with the actual observed values. Table 4 shows the predicted and the actual five-number summaries of survival times for the control dose. It is clear from Figure 2 and from the numerical summaries given in Table 4 that the predictions are very close to the observed survival times, especially for values that are near the middle of the distribution.
5. Conclusion In this article, a general family of accelerated test models based on the inverse Gaussian distribution, which has been previously used in a reliability setting, is summarized and its use in survival analysis is illustrated. These models are based on “cumulative damage” arguments modeled by a Gaussian process or a “discrete damage” process and possess appealing properties. Most notably, since their derivation is based on the physical properties of the failure process, the models may be applicable in a variety of survival analysis settings. The structure of the models is flexible enough to allow the use of virtually any parametric function that adequately captures the relationship between survival time and the acceleration variable. As outlined here, finding numerical solutions for the maximum likelihood estimates of the unknown parameters is relatively easy, and a general form of the Fisher information matrix is available for asymptotic inference. Though perhaps computationally messy, a Bayesian treatment of the models that have been discussed in this article is relatively straightforward (see Onar and Padgett, 2001). To implement such a Bayesian approach, the necessary integrals for posterior means and posterior marginal distributions can be computed numerically using the Markov chain Monte Carlo (MCMC) procedures, or approximated via an appropriate method such as the Laplace approximation (Tierney and Kadane, 1986). The analysis of Onar and Padgett (2001) utilized the latter method for the Gauss–Gauss Additive model of Durham and Padgett (1997) in a reliability context, but the MCMC approach, which
Cumulative damage approaches leading to inverse Gaussian accelerated test models
477
has been very popular in the recent Bayesian literature (see, for example, Gilks et al., 1996, for an introductory reference), could also be used. Further, these methods may also be applied to the survival analysis models discussed here.
Acknowledgement The work of the second author was partially supported by the National Science Foundation under grant number DMS9877107, and DMS0243594.
References Bhattacharyya, G.K., Fries, A. (1982). Fatigue failure models – Birnbaum–Saunders vs. inverse Gaussian. IEEE Trans. Rel. 31, 439–441. Chhikara, R.S., Folks, J.L. (1989). The Inverse Gaussian Distribution. Marcel Dekker, New York. Durham, S.D., Padgett, W.J. (1997). A cumulative damage model for system failure with application to carbon fibers and composites. Technometrics 39, 34–44. Gilks, W.R., Richardson, S., Spiegelhalter, D.J. (1996). Markov Chain Monte Carlo in Practice. Chapman & Hall, London. Hoel, P., Port, S., Stone, C. (1972). Introduction to Stochastic Processes. Houghton Mifflin, Boston. Karlin, S., Taylor, H.M. (1975). A First Course in Stochastic Processes, 2nd edn. Academic Press, New York. Mann, N.R., Schafer, R.E., Singpurwalla, N.D. (1974). Methods for Statistical Analysis of Reliability and Life Data. Wiley, New York. Nelson, W. (1990). Accelerated Testing. Wiley, New York. Onar, A., Padgett, W.J. (2000a). Accelerated test models with the inverse Gaussian distribution. J. Statist. Plann. Infer. 89, 119–133. Onar, A., Padgett, W.J. (2000b). Inverse Gaussian accelerated test models based on cumulative damage. J. Statist. Comput. Simulation 64, 233–247. Onar, A., Padgett, W.J. (2001). Some Bayesian procedures with the accelerated inverse Gaussian distribution. J. Appl. Statist. Sci. 10, 191–210. Owen, W.J., Padgett, W.J. (1998). Birnbaum–Saunders-type models for system strength assuming multiplicative damage. In: Basu, A.P., Basu, S.K., Mukhopadhyay, S. (Eds.), Frontiers in Reliability. World Scientific, River Edge, NJ, pp. 283–294. Padgett, W.J. (1998). A multiplicative damage model for strength of fibrous composite materials. IEEE Trans. Rel. 47, 46–52. Tierney, L., Kadane, J.B. (1986). Accurate approximations for posterior moments and marginal densities. J. Amer. Statist. Assoc. 81, 82–86.
27
Handbook of Statistics, Vol. 23 ISSN: 0169-7161 © 2004 Elsevier B.V. All rights reserved. DOI 10.1016/S0169-7161(03)23027-3
On Estimating the Gamma Accelerated Failure-Time Models
Kallappa M. Koti
1. The failure time Gamma model
The extended family of generalized gamma models, described by Lawless (1982) and Kalbfleisch and Prentice (1980), includes exponential, Weibull, inverse Weibull, lognormal, and gamma as its special cases. The SAS LIFEREG (SAS Institute Inc., 1994) procedure is used to fit these models to failure-time data that may be right-, left-, or interval-censored. In order to describe the gamma accelerated failure-time models, we focus on analyzing right-censored data from a clinical trial where the objective is to evaluate the efficacy of a single treatment or product with a control. For simplicity, we assume that there are no covariates other than those corresponding to the two treatment groups. Let x = (x1 , x2 ) is the vector of covariates. We let x1 = 1 for all subjects, and x2 = 1 for the active treatment group and x2 = 0 for control. Let T and Y = log T be random variables for failure-time and the logarithm of failure-time. We set Z = (Y − x β)/σ , where β = (β1 , β2 ) is a vector of unknown parameters and σ (> 0) is the scale parameter. Under the extended family of generalized gamma model, Z has the log gamma density. That is, the distribution function, FZ (λ, z), of Z is −2 −2 −z|λ| 1 − Γ λ ,λ e FZ (z, λ) = Φ(z) −2 −2 λz Γ λ ,λ e
λ < 0, λ = 0, λ > 0,
where the incomplete gamma function Γ (k, b) is given by 1 Γ (k, b) = Γ (k)
b
e−u uk−1 du
0
and Γ (k) is the complete gamma function, and where Φ(z) is the standard normal distribution function (Kalbfleisch and Prentice, 1980). The p.d.f. of Z is of the form 479
480
K.M. Koti
|λ| −2 λ−2 exp λ−2 −|λ|z − e−|λ|z , λ < 0, Γ (λ−2 ) λ fZ (z, λ) = (2π)−1/2 exp − 12 z2 λ = 0, λ λ−2 λ−2 expλ−2 λz − eλz , λ > 0, Γ (λ−2 ) where −∞ < z < ∞, and λ is a free shape parameter. Let δi be an indicator of the censoring status. Specifically, δi = 1 means that subject i experienced the event at time ti where as δi = 0 means that subject i has not yet experienced the event by time ti . In what follows, we set γ = (β1 , β2 , σ, λ) and L = L(γ ) denotes the likelihood function. It follows that the log-likelihood is log L =
n
δi log σ −1 fZ (zi ; λ) + (1 − δi ) log 1 − FZ (zi , λ) ,
(1)
i=1
where n is the total number of observations in the survival data under consideration. Needless to say, when λ is negative, it is replaced by |λ| and z is replaced by −z (Lawless, 1982).
2. The maximum likelihood equations We want to find the estimate γˆ that maximizes the log likelihood given by (1). Therefore, we need to set up and solve the maximum likelihood equations: ∂ log L = 0, ∂γl
l = 1, 2, 3, 4.
Here we write down the maximum likelihood equations when λ > 0. The maximum likelihood equations corresponding to βj for j = 1, 2, are n τj i (λ, zi ) ∂ log L xj i λzi e − 1 − (1 − δi ) = 0, δi = ∂βj λσ 1 − FZ (zi , λ)
(2)
i=1
where λ > 0, and τj i (λ, zi ) =
λ−2 1 ∂ λ −2 λzi e−λ e λ−2 eλzi Γ λ−2 , λ−2 eλzi = −xj i · . −2 ∂βj σ Γ (λ )
The maximum likelihood equation corresponding to σ is n 1 κi (λ, zi ) ∂ log L
zi λzi = e −1 − − (1 − δi ) = 0, δi ∂σ λσ σ 1 − FZ (zi , λ) i=1
where λ > 0, and κi (λ, zi ) =
λ−2 1 ∂ −2 −2 λzi λ −2 λzi Γ λ ,λ e e−λ e λ−2 eλzi = −zi · . ∂σ σ Γ (λ−2 )
(3)
On estimating the gamma accelerated failure-time models
481
The maximum likelihood equation corresponding to λ(> 0) is n 2 ∂ log L
1 = − 2λ−3 log λ−2 − 2λ−3 − λ−2 zi − λ−2 eλzi zi − δi ∂λ λ λ i=1
∂FZ (zi , λ)/∂λ = 0, + 2λ−3 ψ λ−2 − (1 − δi ) 1 − FZ (zi , λ)
(4)
where ψ(k) denotes the digamma function, the derivative of log Γ (k) (Prentice, 1974). Next, we state a large sample property of maximum likelihood estimator γˆ for future reference. The Fisher information matrix is 2 ∂ log L I(γ ) = E − . ∂γi ∂γj Assume that I(γ ) is positive definite. Then asymptotically, as n → ∞, d
(γˆ − γ ) −→ N4 (0, Σ),
−1 Σ = I(γ ) ,
(5)
where N4 denotes a 4-variate normal distribution. The sample information matrix I(γˆ ) is a consistent estimator of I(γ ). The ij th element of I(γˆ ) is n
∂ 2 log[1 − FZ (zi , λ)] ∂ 2 log fZ (zi , λ) δi Iij (γˆ ) = − + (1 − δi ) . ∂γi ∂γj ∂γi ∂γj γ =γˆ l=1
Mathematical expressions for −Iij are given in Appendix A. See Kalbfleisch and Prentice (1980) or Lawless (1982) for further details. The following are the basic SAS commands to estimate the accelerated failure time gamma model using the LIFEREG procedure. The option COB prints the variancecovariance matrix. data mydata; input x2 duration status; proc lifereg; model duration ( status(0) = x2 /gamma COB;
The most recent version [SAS Release 8.2] of LIFEREG procedure computes confidence intervals for all model parameters (SAS Institute Inc., 2001). As we will show in Section 5 that, for some data, LIFEREG procedure gives unrealistic results. One possible source is the following.
3. The problem The maximum likelihood equation corresponding to the shape parameter λ needs the derivative
λ−2 eλz ∂ 1 −u λ−2 −1 (6) e u du ∂λ Γ (λ−2 ) 0
482
K.M. Koti
which is not available in closed form. Therefore, it is not possible to apply the Newton– Raphson algorithm to find the maximum likelihood estimates of the model parameters. Most of the nonlinear optimization subroutines such as SAS/IML subroutine NLPTR (SAS Institute Inc., 1995) require the first derivatives to be provided. In the next section, we propose an approximation to the λ-derivative of the incomplete gamma function Γ (λ−2 , λ−2 eλz ).
4. The hybrid approximation First, we find the best possible approximation to the incomplete gamma function. This approximation should be thrice differentiable with respect to λ. Needless to say, it should be independent of the failure time data under consideration. The idea is to use the λ-derivative of this approximation as an approximation to the derivative in (6). We set the step length h(λ, z) = λ−2 eλz /N and consider the composite trapezoidal approximation (Stoer and Bulirsch, 1980) N−1
1 −2 λz h(λ, z) j −2 λz TN (λ, z) := λ e · + g λ e g , N 2 Γ (λ−2 ) j =1
−2
where g(u) = e−u uλ −1 and g(0) = 0 for the incomplete gamma function. This approximation underestimates the incomplete gamma function when λ > 1, whatever may be N . In order to see this, we set N = 1000 and we look at the difference ε(λ, z) = Is λ−2 , λ−2 eλz − TN (λ, z), where Is (λ−2 , λ−2 eλz ) is the value of Γ (λ−2 , λ−2 eλz ) obtained using the SAS probability function PROBGAM(λ−2 , λ−2 eλz ). The SAS function GAMMA(k) is used to compute the gamma function Γ (k). See SAS Institute Inc. (1985). As we see from Table 1 below, this difference is not negligible. We generate over 22 000 observations on Table 1 SAS IG vs. Trapezoidal approximation λ
z
SAS IG
TN (λ, z)
Difference
1.5
−2.0 −1.5 −1.0 −0.5 0.0 0.5
0.2061 0.2855 0.3923 0.5299 0.6925 0.8526
0.1988 0.2753 0.3780 0.5100 0.6647 0.8139
0.0073 0.0102 0.0142 0.0198 0.0277 0.0387
3.0
−2.0 −1.5 −1.0 −0.5 0.0 0.5
0.4247 0.5017 0.5924 0.6985 0.8183 0.9343
0.2400 0.2835 0.3347 0.3941 0.4586 0.5094
0.1847 0.2181 0.2577 0.3045 0.3597 0.4249
On estimating the gamma accelerated failure-time models
483
the difference ε(λ, z), where 0.75 λ < 10 and several appropriate z’s are chosen for each λ. Next, we propose the following piecewise regression type approximation εˆ (λ, z) for the difference ε(λ, z). 0.00743λ + 0.00243z − 0.00458eλz (z < 0.0, 0.75 < λ 1.5), 0.00806λ − 0.02671z + 0.00877eλz (z 0.0, 0.75 < λ 1.5), (z < −5.0, 1.5 <λ 2.5), 0.02317λ +0.0082z + 0.0003z2 2 (−5 z < 0.0, 1.5 < λ 2.5), 0.06694λ + 0.05485z + 0.00637z λz (z 0.0, 1.5 < λ 2.5), 0.07006λ − 0.17472z + 0.03178e 2 (z < −5.0, 2.5 <λ 3.5), 0.04729λ +0.01862z +0.000545z 0.1188λ + 0.10537z + 0.0099z2 (−5.0 z < 0, 2.5 < λ 3.5), λz (z 0.0, 2.5 < λ 3.5), 0.12166λ + 0.00879e 2 (z < −10.1, 3.5 <λ 4.5), 0.02897λ +0.00825z +0.000083z εˆ (λ, z) = 0.06774λ + 0.02791z + 0.00047z2 (−10.1 z < −5.0, 3.5 < λ 4.5), 2 (−5.0 z < 0, 3.5 < λ 4.5), 0.13355λ + 0.11982z + 0.00895z λz 0.13573λ +0.11001z +0.00207e (z 0.0, 3.5 <λ 4.5), 0.02412λ + 0.00635z (z < −10.0, 4.5 < λ 5.5), 2 (−10 z < −5, 4.5 < λ 5.5), 0.05803λ +0.02446z +0.00021z 2 (−5 z < 0.0, 4.5 < λ 5.5), 0.06886λ + 0.05183z + 0.00353z 0.09207λ + 0.09306z (z 0.0, 4.5 < λ 5.5), 2 for all z and 5.5 < λ 7.5, 0.10260λ + 0.06253z + 0.00146z 0.08728λ + 0.05630z + 0.00109z2 for all z and 7.5 < λ < 10.0. Thus, we have the proposed hybrid approximation: Γ λ−2 , λ−2 eλz ≈ TN (λ, z) + εˆ (λ, z),
(7)
where N = 1000, λ > 0, and −∞ < z < ∞. We plotted and compared the graphs of SAS IG Is (λ−2 , λ−2 eλz ), TN (λ, z), N = 10 000, and the hybrid approximation given by (7) for λ = 1.25, 2.25, 3.25 and 4.25. See Figures 1–4. We found the hybrid approximation (7) to be extremely satisfactory. Next, we give an explicit expression for the λ-derivative of the incomplete gamma function. Let gj(1) (z, λ) denote the first derivative of g((j/N)λ−2 eλz ) with respect to λ, j = 1, 2, . . . , N . It is easily verified that j −2 λz j −2 λz (1) gj (z, λ) = g λ e λ e −2λ−3 log N N j 2 −2 λz 2 + λ−2 − 1 z − − z− λ e . λ N λ
484
K.M. Koti
Fig. 1. Shape parameter = 1.25.
Fig. 2. Shape parameter = 2.25.
Writing ω(λ, z) = ∂Γ (λ−2 , λ−2 eλz )/∂λ, for λ > 0, we have the pivotal result: ω(λ, z) =
λ−2 eλz · z − 2/λ + 2λ−3 ψ λ−2 · S1 −2 NΓ (λ ) +
λ−2 eλz ∂ εˆ (λ, z) · S2 + , NΓ (λ−2 ) ∂λ
(8)
On estimating the gamma accelerated failure-time models
Fig. 3. Shape parameter = 3.25.
Fig. 4. Shape parameter = 4.25.
where S1 =
N−1
j =1
S2 =
N−1
j =1
g
1 j −2 λz λ e + g λ−2 eλz ; N 2
1 (1) gj(1) (z, λ) + gN (z, λ), 2
485
486
K.M. Koti
and where 0.00743 − 0.00458zeλz 0.00806 + 0.00877zeλz 0.02317 0.06694 0.07006 + 0.03178zeλz 0.04729 0.1188 0.12166 + 0.00879zeλz 0.02897
∂ εˆ (λ, z) = ∂λ 0.06774 0.13355 0.13573 + 0.00207zeλz 0.02412 0.05803 0.06886 0.09207 0.10260 0.08728
(z < 0.0, 0.75 < λ 1.5), (z 0.0, 0.75 < λ 1.5), (z < −5.0, 1.5 < λ 2.5), (−5 z < 0.0, 1.5 < λ 2.5), (z 0.0, 1.5 < λ 2.5), (z < −5.0, 2.5 < λ 3.5), (−5.0 z < 0, 2.5 < λ 3.5), (z 0.0, 2.5 < λ 3.5), (z < −10.1, 3.5 < λ 4.5), (−10.1 z < −5.0, 3.5 < λ 4.5), (−5.0 z < 0, 3.5 < λ 4.5), (z 0.0, 3.5 < λ 4.5), (z < −10.0, 4.5 < λ 5.5), (−10 z < −5, 4.5 < λ 5.5), (−5 z < 0.0, 4.5 < λ 5.5), (z 0.0, 4.5 < λ 5.5), for all z and 5.5 < λ 7.5, for all z and 7.5 < λ < 10.0.
The digamma function ψ(k) is computed by calling the SAS function DIGAMMA(k). With this, the system of maximum likelihood equations of Section 2 is complete. For λ < 0, we make the necessary modifications. That is, we set ∂ −2 Γ |λ| , |λ|−2 e−|λ|z , ω |λ|, −z = ∂λ where Γ |λ|−2 , |λ|−2 e−|λ|−z ≈ TN |λ|, −z + εˆ |λ|, −z . As the complete gamma function Γ (k) and the digamma function ψ(k) are built-in in most of the statistical computing packages, the proposed hybrid approximation is capable of being used independently of SAS packages. In what follows, by “hybrid approach” we mean that the λ-derivative given by (8) is used to obtain the maximum likelihood estimates.
On estimating the gamma accelerated failure-time models
487
5. The SAS/IML subroutine NLPTR We call the SAS/IML subroutine NLPTR to compute the maximum likelihood estimates (SAS Institute Inc., 1995). A brief sketch of the SAS program: data temp; set mydata; lntime = log(duration); proc IML; use temp; start f_lhood global (x2 , lntime, status); lli = status ∗ log fz (z, λ) + (1 − status) ∗ log(1 − Fz (z, λ)); f = sum (lli); return f; finish f_lhood; start g_lhood global (x2 , lntime, status); g = j(1, 4, 0); dbeta1 = status ∗ ∂β∂ log fz (z, λ) + (1 − status) ∗ ∂β∂ log(1 − Fz (z, λ)); 1
1
dbeta2 = status ∗ ∂β∂ log fz (z, λ) + (1 − status) ∗ ∂β∂ log(1 − Fz (z, λ)); 2
2
∂ log f (z, λ) + (1 − status) ∗ ∂ log(1 − F (z, λ)); dsigma = status ∗ ∂σ z z ∂σ ∂ log f (z, λ) + (1 − status) ∗ ∂ log(1 − F (z, λ)); dshape = status ∗ ∂λ z z ∂λ
g[1] = sum (dbeta1); g[2] = sum (dbeta2); g[3] = sum (dsigma); g[4] = sum (dshape); return(g); finish g_lhood; n = 4; x0 = {β10 , β20 , σ 0 , λ0 }; ∗ initial solution; optn = {1, opt[2], opt[3]}; call NLPTR(rc, xres, “f_lhood”, x0 , optn) grd = “g_lhood”; xopt = xres‘; fopt = f_lhood(xopt);
We also compute the variance–covariance matrix of the maximum likelihood estimators of the model parameters. The SAS program to compute the estimated variance– is straightforward and not shown due to lack of space. Next, we covariance matrix Σ apply the hybrid approximation approach to two published data sets. Our objective of using these data sets is purely to compare the performance of hybrid approximation approach relative to LIFEREG procedure.
6. Pediatric cancer data Cantor and Shuster (1992) used remission duration (in days) data from a randomized trial in pediatric cancer to discuss parametric versus non-parametric methods for estimating cure rates. Two treatment groups are compared. Data contain a total of 78 observations. The data are 87 percent censored. Each treatment has only five failures.
488
K.M. Koti
Table 2 Analysis of parameter estimates (LIFEREG procedure) Variable
DF
Estimate
Standard error
Chi-square
Pr > ChiSq
Intercept x2 Scale Shape
1 1 1 1
10.0158 0.0318 0.3904 3.0997
7.2595 0.7110 13.2284 105.0228
1.90 0.00
0.1677 0.9643
Table 3 Analysis of parameter estimates (Hybrid approach) Variable
DF
Estimate
Standard error
Chi-square
Pr > ChiSq
Intercept x2 Scale Shape
1 1 1 1
9.9964 0.0479 0.4189 2.8845
0.1441 0.1994 0.0831 0.0328
4812.3731 0.0577
0.0001 0.8102
We set x2 = 1 for group 1 and x2 = 0 otherwise. The SAS LIFEREG procedure applied to these data shows a log-likelihood of −41.3463. However, it also gives a warning: Iteration limit exceeded. It produces the following results. These results are unrealistic. By (5), each parameter estimator is supposed to be asymptotically normally distributed. The standard errors in Table 2 seem to contradict this asymptotic theory. Next, we analyze these data using the proposed hybrid approach. That is, we execute the SAS program of Section 3 using an initial solution (3.1, 10.0, 0.05, 0.4). We set opt[3] = 0. After four iterations GCONV convergence criterion is satisfied. We have the solution shown in Table 3. The observed log likelihood is −41.347. In the LIFEREG procedure analysis, the SAS warning “Iteration limit exceeded” may be interpreted as convergence being slow. The standard errors are huge. These are indications of a non-identifiable model. But the results based on hybrid approach suggest otherwise.
7. Leukemia data Chemoradiotherapy and transplantation of bone marrow from matched sibling donors have been useful for the treatment of acute lymphoblastic leukemia in patients with a poor prognosis but are not available to some two thirds of patients who do not have a matched allogeneic donor. Kersey et al. (1987) undertook a study to compare autologous and allogeneic marrow transplantation in the treatment of such cases. They treated 90 patients with high dose chemoradiotherapy and followed them for 1.4 to 5 years. Fortysix patients with an HLA-matched donor received allogeneic marrow, and 44 patients
On estimating the gamma accelerated failure-time models
489
Table 4 Analysis of parameter estimates (LIFEREG procedure) Variable
DF
Estimate
Standard error
Chi-Square
PR > Chi
Intercept x2 Scale Shape
1 1 1 1
−1.5867 0.2726 1.2972 −1.6594
0.4016 0.3381 0.1659 0.4369
15.6143 0.6501
0.0001 0.4201
Table 5 Analysis of parameter estimates (Hybrid approach) Variable
DF
Estimate
Standard error
Chi-Square
PR > Chi
Intercept x2 Scale Shape
1 1 1 1
−1.7329 0.3538 1.2512 −1.8550
0.3150 0.3296 0.1410 0.2625
30.2640 1.1522
0.0001 0.2831
without a matched donor received their own marrow taken during remission and purges of leukemia cells with use of monoclonal antibodies. We look into only one aspect of their study. We revisit their data on T , the time to recurrence of leukemia, in years. The data are 24 percent censored. The data are reproduced on page 82 in Maller and Zhou (1996). We set x2 = 1 for autologous group and x2 = 0 for allogeneic group. We estimate the accelerated failure-time gamma model using the LIFEREG procedure. See Table 4. We have log likelihood = −145.528. Next, we analyze these data using the proposed hybrid approach with an initial solution (−2.2, −1.75, 0.35, 0.95). We set opt[3] = 3. After six iterations GCONV convergence criterion is satisfied. We have the solution shown in Table 5. The observed log likelihood is −145.622. Our results differ from those of LIFEREG procedure. However, the following is worth mentioning. We obtain the Fisher information matrix by inverting the variancecovariance matrix of the LIFEREG procedure. We compute the hybrid approach based Fisher information matrix evaluated at the maximum likelihood estimates of the LIFEREG procedure. The diagonal elements of these two matrices are identical except for the one that corresponds to the shape parameter. We note that the LIFEREG procedure ˆ overestimates the standard error of λ.
8. Concluding remarks It is suggested in Lawless (1982, pp. 326–327) and Yamaguchi (1992) that estimation of model parameters should involve two phases. First, the maximum likelihood estimates of parameters β, and σ may be obtained for each fixed value of λ, using the
490
K.M. Koti
Newton–Raphson algorithm. Then, the search for the value of λ that maximizes the log-likelihood function may be made. Yet estimating the standard error of λˆ remains to be problematic. Of course, as pointed out by Peng et al. (1998), the variance of the shape parameter is of little use in practice. However, non-availability of the second partial of the log likelihood with respect to λ needed in the observed information matrix may cast doubt on the legitimacy of the standard errors of the estimates of the other model parameters. We believe that we have overcome this problem. We intend to apply this approach to Yamaguchi (1992) type mixture models.
Acknowledgements The author is grateful to his colleagues Dr’s. Clara Chu, Ghanashyam Gupta, Peter Lachenbruch and Bo-guang Zhen for reviewing this paper for FDA clearance. The author is also grateful to Dr. N. Balakrishnan for his comments on the earlier version of this paper and for making this paper a reality.
Appendix A: Fisher information matrix
2 n xj i ∂ 2 log L
−δi = eλzi 2 σ ∂βj i=1
2 τj i (λ, zi ) ∂τj i (λ, zi )/∂βj + , − (1 − δi ) 1 − FZ (zi , λ) 1 − FZ (zi , λ)
where j = 1, 2, and τj i is defined in (2). 2 −2 λzi λ−2 xj i ∂τj i 1 1 − eλzi e−λ e λ−2 eλzi = , −2 ∂βj Γ (λ ) σ n xj i xli ∂ 2 log L
−δi 2 eλzi = ∂βl ∂βj σ i=1 ∂τj i (λ, zi )/∂βl τj i (λ, zi )τli (λ, zi ) + , − (1 − δi ) 1 − FZ (zi , λ) (1 − FZ (zi , λ))2 where j = 1, 2, j = l, and −2 λzi λ−2 ∂τj i xj i xli 1 1 − eλzi e−λ e λ−2 eλzi = , ∂βl Γ (λ−2 ) σ 2 n xj i 1 ∂ 2 log L
1 + zi eλzi − −δi 2 = ∂σ ∂βj λ λ σ i=1
− (1 − δi )
τj i (λ, zi )κi (λ, zi ) ∂τj i (λ, zi )/∂σ + 1 − FZ (zi , λ) (1 − FZ (zi , λ))2
,
On estimating the gamma accelerated failure-time models
491
where j = 1, 2, and ∂τj i xj −λ−2 eλzi −2 λzi λ−2 1 = λ e e · λ + zi 1 − eλzi , −2 2 ∂σ Γ (λ ) σ and κi is defined in (3).
n xj i 1 λzi ∂ 2 log L
λzi − e − 1 + zi e δi = ∂λ∂βj σλ λ i=1
∂τj i (λ, zi )/∂λ τj i (λ, zi )ω(λ, zi ) − − (1 − δi ) 1 − FZ (zi , λ) (1 − FZ (zi , λ))2
,
where λ−2 xj i ∂τj i 1 −2 λzi = e−λ e λ−2 eλzi −2 ∂λ σ Γ (λ )
−2 λz 2 λzi 1 2 1 2 i z− e − − log λ e × −1 + + z− λ λ λ λ λ − 2λ−2 ψ λ−2 , 2 n 2zi λzi 1 ∂ 2 log L
zi λzi δ − e = − 1 − e + i ∂σ 2 λσ 2 σ σ2 i=1
2 κi (λ, zi ) ∂κi (λ, zi )/∂σ + , − (1 − δi ) 1 − FZ (zi , λ) 1 − FZ (zi , λ)
where κ is defined in (3), and
2 2 −2 1 zi ∂κi zi zi −λ−2 eλzi −2 λzi λ λzi = e 2λ . λ e − e + ∂σ Γ (λ−2 ) σ2 σ σ n zi2 λz zi λzi ∂ 2 log L
i = e δi − 2 e − 1 + ∂λ∂σ λσ σλ i=1
∂κi (λ, zi )/∂λ κi (λ, zi )ω(λ, zi ) − − (1 − δi ) 1 − FZ (zi , λ) 1 − FZ (zi , λ))2
,
where w(λ, z) is defined in (8), and 1 ∂κi zi −λ−2 eλzi −2 λzi λ−2 =− e λ e ∂λ Γ (λ−2 ) σ
1 1 2 2 − zi eλzi − 2λ−2 log λ−2 eλzi + zi − × 1+ λ λ λ λ + 2λ−2 ψ λ−2 .
492
K.M. Koti
The diagonal element corresponding to the shape parameter λ, needs a lot of ground work. First, in order to find ∂ω/∂λ, where ω(λ, z) is given by (8), we note the following. By the chain rule we have d log Γ λ−2 = −2λ−3 ψ λ−2 , λ > 0. dλ By using the approximation shown in Prentice (1974) ∞
ψ(k) =
−1 d log Γ (k) = −γ − k −1 + k l(k + l) , dk l=1
(where γ is a constant), we have ∞
−1 −2 . l λ +l ψ λ−2 = −γ − λ2 + λ−2 l=1
Then it readily follows that, for λ > 0, d2 log Γ λ−2 = 6λ−4 · ψ λ−2 2 dλ ∞ ∞
−2 −1 −2 −3 −3 −5 −1 −2 −λ − λ . − 4λ λ +l +λ l λ +l l=1
l=1
The infinite sums are approximated by their sums of the first 10 000 terms. Setting d2 ψ (1) λ−2 ≈ 2 log Γ λ−2 dλ we have
ψ (1) λ−2 = 6λ−4 · ψ λ−2
10 000 10 000
−2 −1 −2 −3 −3 −5 −1 −2 −λ − λ . − 4λ l λ +l +λ l λ +l l=1
l=1
Next, we let S3 =
N−1
j =1
1 (2) gj(2) (z, λ) + gN (z, λ), 2
where (2)
gj (z, λ) 2 j −2 λz (1) j −2 λz λ e λ e · −2λ−3 log + λ−2 − 1 z − = gj N N λ 2 j j −2 λz j −2 λz −2 λz −4 z− ·λ e λ e λ e + gj · 6λ log − N λ N N
j 2 2 2 2 −2 λz + 2λ−2 λ−2 − 1 − λ + z − e − 4λ−3 z − λ N λ2 λ
On estimating the gamma accelerated failure-time models
493
is the second λ-derivative of g((j/N)λ−2 eλz ). Then it is easily seen that ∂ω/∂λ, the second λ-derivative of the incomplete gamma function, is ∂ω λ−2 eλz = ∂λ NΓ (λ−2 )
−2 2 2 2 −3 (1) −2 · S1 + 2 −ψ λ × z − + 2λ ψ λ λ λ −2 ∂ 2 εˆ 2 −3 · S2 + S3 + 2 , + 2 z − + 2λ ψ λ λ ∂λ where ∂ 2 εˆ /∂λ2 is straightforward and it is not shown here. We have n ∂ 2 log L
δi −λ−2 + λ−4 10 + 6 log λ−2 + 2λ−3 zi = 2 ∂λ i=1
2 2 2 −2 λz (1) −2 − zi − + 2 λ e −ψ λ λ λ
2 ω(λ, zi ) ∂ω(λ, zi )/∂λ + . − (1 − δi ) 1 − FZ (zi , λ) 1 − FZ (zi , λ)
References Cantor, A.B., Shuster, J.J. (1992). Parametric versus non-parametric methods for estimating cure rates based on censored survival data. Statist. Medicine 11, 931–937. Kalbfleisch, J.D., Prentice, R.L. (1980). The Statistical Analysis of Failure Time Data. Wiley, New York. Kersey, J.H., Weisdorf, D., Nesbit, M.E., LeBien, T.W., Woods, W.G., McGlave, P.B., Kim, T., Vallera, D.A., Goldman, A.I., Bostrom, B., Hurd, D., Ramsay, N.K.C. (1987). Comparison of autologous and allogeneic bone marrow transplantation for treatment of high-risk refractory acute lymphoblastic leukemia. New England J. Medicine 317, 461–467. Lawless, J.F. (1982). Statistical Models and Methods For Lifetime Data. Wiley, New York. Maller, R.A., Zhou, S. (1996). Survival Analysis with Long-Term Survivars. Wiley, New York. Peng, Y., Dear, K.B.G., Denham, J.W. (1998). A generalized F mixture model for cure rate estimation. Statist. Medicine 17, 813–830. Prentice, R.L. (1974). A log-gamma model and its maximum likelihood estimation. Biometrika 61, 539–544. SAS Institute (2001). SAS/STAT Software: Changes and Enhancements, Release 8.2. SAS Institute, Cary, NC. SAS Institute (1995). SAS/IML Software: Changes and Enhancements, Release 6.11. SAS Institute, Cary, NC. SAS Institute (1994). SAS/STAT User’s Guide, Vol. 2, Version 6, 4th edn. SAS Institute, Cary, NC. SAS Institute (1985). SAS Language Guide for Personal Computers, Version 6 Edition. SAS Institute, Cary, NC. Stoer, J., Bulirsch, R. (1980). Introduction to Numerical Analysis. Springer, New York. Yamaguchi, K. (1992). Accelerated failure-time regression models with a regression model of surviving fraction: An application to the analysis of permanent employment in Japan. J. Amer. Statist. Assoc. 87, 284– 292.
28
Handbook of Statistics, Vol. 23 ISSN: 0169-7161 © 2004 Elsevier B.V. All rights reserved. DOI 10.1016/S0169-7161(03)23028-5
Frailty Model and its Application to Seizure Data
Nader Ebrahimi, Xu Zhang, Anne Berg and Shlomo Shinnar
1. Introduction The modeling and analysis of the data in which the principal endpoint is the time until an event occurs is often of prime interest in medical studies. The event may be death, the appearance of a tumor, the development of some disease, recurrence of a disease, and so forth. The time to an event is referred to as the failure time. The primary goal in analyzing failure time data is to assess the dependence of failure time on covariates. The secondary goal is the estimation of the underlying failure time distribution. One way to explore the relationship of covariates to failure time is by means of a regression model in which failure time has a probability distribution that depends on the covariates. A regression approach to failure time data analysis could be either fully parametric or semi-parametric. A parametric approach involves extensions of existing parametric failure time models such as exponential, Weibull and log-normal models by means of re-parameterizations to include covariates. On the other hand, a semiparametric approach is distribution free and involves less stringent assumptions on the underlying failure time distribution. The Cox proportional hazards (PH) model, Cox (1972), offers a method for exploring the association of covariates with the failure time. An interesting feature of this model is that it is semi-parametric in the sense that it can be factored into a parametric part consisting of a vector of regression parameters associated with the covariates and a nonparametric part that can be left completely unspecified. More specifically, suppose there are n patients in a study. The observations consist of independent triplets (Xj , δj , Zj ), with Xj = min(Tj , Cj ), the minimum of a failure time and censoring time pair, δj = I (Tj Cj ), the indicator of the event that failure has been observed, and Zj , the vector of covariates of risk factors for the j th patient, j = 1, . . . , n. Such data is referred to as parallel data. Let h(t | Z) be the hazard rate function at time t for an individual with the risk factor Z. The basic Cox proportional hazards model is given by h(t | Z) = lim
t →0
1 P (t T t + t | T t, C t, Z) t
= h0 (t)c(βZ),
(1.1) 495
496
N. Ebrahimi et al.
where h0 (t) is the baseline hazard rate function, β is a parameter vector, and c(βZ) is a known function. A common function for c(βZ) is exp(βZ), which yields h(t | Z) = h0 (t) exp(βZ).
(1.2)
For two individuals with covariate vectors Z and Z ∗ , the ratio of their hazard rates is h0 (t) exp(βZ) = exp β Z − Z ∗ , ∗ h0 (t) exp(βZ ) which is a constant. That is, the conditional hazard functions have a fixed ratio over time and plot as parallel lines. The inference for β in Eq. (1.2) is based on a partial or conditional likelihood rather than a full likelihood approach. Here, the baseline hazard function h0 (t) is treated as a nuisance parameter function. Suppose there are no ties at any failure time and let t1 < t2 < · · · < tM denote the M distinct ordered failure times. At tk , only individuals who are still alive are likely to experience the failure. Thus, the probability that an individual with the covariate Zk fails at time tk can be expressed as P (individual with Zk fails at tk | survival to tk ) P (One fails at tk | survival to tk ) h(tk ) l∈R(tk ) h(tk )
=
h0 (tk ) exp(βZk ) l∈R(tk ) h0 (tk ) exp(βZl )
=
exp(βZk ) . l∈R(tk ) exp(βZl )
=
(1.3)
Here, R(tk ) is the set of all individuals who are at risk at tk . The partial likelihood function, which is a function of β only, is formed by multiplying the probability in Eq. (1.3) over all failure times: L(β) =
M k=1
exp(βZk ) . l∈R(tk ) exp(βZl )
(1.4)
One can obtain an estimate of β by maximizing Eq. (1.4) with respect to β. Often, due to the way times are recorded, ties between failure times are found in the failure data. Modified partial likelihood has been provided when such ties exist. See Klein and Moeshberger (1997) for details. The partial likelihood function (1.4) was obtained based on the assumption that the failure times of distinct individuals are independent of each other. Although this assumption may be valid in many experimental settings, it may be suspect in others. For example, we may be making inferences about failure time in a sample of siblings who share a common genetic make-up, or in a sample of married couples who share a common unmeasured environment. We may also be interested in studying the occurrence
Frailty model and its application to seizure data
497
of different non-lethal diseases within the same individual. In each of these situations, it is quite probable that there is some association within groups of failure times in the sample. Problems in which more than one failure time can be recorded on each individual are generally called multidimensional or multivariate. For the purpose of this review it is useful to distinguish problems in which repeated failures can occur to the same individual from problems in which the time axis is truly multivariate. Expanding proportional hazards model to include a random effect, called a frailty, allows for modeling association between individual failure times within a group. A frailty acts multiplicatively on the hazard function and the model that incorporates this random effect into the hazard function is called the frailty model. There are two different, but related, connotations of frailty. First, frailty is the missing covariates that are not known to us and consequently they are unobservable. More specifically, let Z denote the covariate vector that is known to us and w denote the covariate vector that is unknown. The hazard function for a given individual is h(t | Z) = h0 (t) exp(βZ + ψw),
(1.5)
where ψ is the regression coefficient of unknown covariates. To simplify Eq. (1.5), let u = exp(ψw), then the hazard function has the form h(t | Z) = h0 (t)u exp(βZ),
(1.6)
where u is a random variable assumed to have a one-dimensional distribution q. In Eq. (1.6), the frailty u represents the total effect on failure of the covariates not measured when collecting information on individuals. Eq. (1.6) is known as the frailty model, see Vaupel et al. (1979) for more details. Clayton (1978) suggested the other connotation of the frailty when individuals in a study are divided into distinct groups. Here, the frailty denotes unobservable common covariates shared by members in a group, and the frailty model handles the dependence generated by those common covariates. For example, for a study including husband and wife, each couple shares common environmental factors; for menozygotic twins study, twins share a common genotype as well as common environmental factors. Specifically, suppose there are G groups with ni individuals in the ith group; Zij is the observable covariate vector for the j th individual in the ith group. Let wi be the unobservable covariates for the ith group and ψ be its regression coefficient. The hazard function of the j th individual in the ith group is hij (t | Zij ) = h0 (t) exp(βZij + ψwi ),
i = 1, . . . , G, j = 1, . . . , ni .
(1.7)
Replacing exp(ψwi ) by ui , which is the frailty of the ith group, the hazard function incorporating frailty reduces to hij (t | Zij ) = h0 (t)ui exp(βZij ),
i = 1, . . . , G, j = 1, . . . , ni .
(1.8)
Here it is assumed that u1 , . . . , uG are random variables with the common probability density function q. The model (1.8) can be considered as a random effects model with two sources of variation. There is a group variation, described by the random variable u with the probability density function q. Secondly, there is the individual variation described by the hazard function h0 (t) exp(βZij ).
498
N. Ebrahimi et al.
In (1.8), members in a group share the same frailty, so the frailty model under this circumstance is known as the shared frailty model. Also, in this model, groups with a large value of the frailty will experience the failure at earlier times than groups with small values of the frailty. Details of this model are available in Hougaard (2000) and the references cited there. In this paper, we examine estimation and inference procedure for the shared frailty model (1.8). The paper is organized as follows. Section 2 presents estimations of coefficients for different frailty distributions. Section 3 describes the shared frailty model for recurrent events. Section 4 illustrates these techniques in the seizure data.
2. Inference for the shared frailty model As in most contexts, provided that one trusts the model (1.8), maximum likelihood method is the method of choice. To derive the general form of the likelihood function, it is assumed that the common factor causes dependence between individuals in a given group, and conditional on that, all individuals within the group are independent. Thus, for one group of n individuals, the conditional joint survival distribution of failure times T1 , T2 , . . . , Tn is given by P T1 > t1 , . . . , Tn > tn | u = P (T1 > t1 | u)p(T2 > t2 | u) · · · P (Tn > tn | u) n = exp −u H0 (tj ) exp(βZj ) .
(2.1)
j =1
Note that we omit the group index i in Eq. (2.1). The above joint conditional survival distribution holds for any group. Integrating the frailty out, we get the joint survival function for this group as S(t1 , . . . , tn ) = P (T1 > t1 , . . . , Tn > tn )
∞ P (T1 > t1 , . . . , Tn > tn | u)q(u) du = 0
∞
=
exp −u
0
= LP
n j =1
n j =1
H0 (tj ) exp(βZj ) q(u) du
H0 (tj ) exp(βZj ) ,
(2.2)
t where LP is the Laplace transform of the density function q and H0 (t) = 0 h0 (u) du. From Eq. (2.2) it is clear that the joint survival function for one group is the Laplace transform of the frailty density function q with parameter nj=1 H0 (tj ) exp(βZj ). In principle, any distribution on the positive numbers can be applied as a frailty distribution. In this paper, we concentrate on the gamma, the log-normal and the positive
Frailty model and its application to seizure data
499
stable distributions. For other distributions see Hougaard (2000) and Ohman and Eberly (2001). From Eq. (2.2) one can derive the likelihood function for one group as follows: If the failure time is observed for the j th individual at time tj , its probability is given by P (Tj = tj , T1 > t1 , T2 > t2 , . . .) = − = −h0 (tj ) exp(βZj )LP(1)
n
∂S(t1 , . . . , tn ) ∂tj
H0 (tj ) exp(βZj ) ,
(2.3)
j =1
where LP(1) (s) denotes the first derivative of LP(s) with respect to s. Let D. = δj , the total number of failures in the group, and θ be the parameter of the frailty distribution. Then, using Eq. (2.3), the likelihood for one group is given by n n D. δj (D.) h0 (tj ) exp(δj βZj ) LP H0 (tj ) exp(βZj ) . (−1) (2.4) j =1
j =1
The likelihood function for all individuals is constructed by multiplying the group likelihoods together. Specifically, if Di denotes the number of failures in the ith group, and D= G i=1 Di , then the likelihood function is given by (−1)
D
n G i i=1
j =1
h0 (tij )
δij
exp(δij βZij ) LP
(Di )
ni
H0 (tij ) exp(βZij ) .
j =1
(2.5) If we assume a parametric form for h0 , we can handle the estimation in the usual way by differentiating the log likelihood function. If a parametric from is not assumed for h0 , there are several estimation methods available to handle this semi-parametric model. These methods are described below. The full conditional approach is to use the likelihood function (2.5) and insert nonparametric expressions for H0 (t), H0 (t) = tk t h0k , assuming a discrete contribution h0k at each time of failure. Here tk is the kth smallest failure time, regardless of the subgroup, M is the number of distinct failures, and dk is the number of failures at tk , k = 1, . . . , M. Thus we get the function with parameters β, θ, h01, . . . , h0M . Now, we follow the traditional method for estimation: differentiate the log likelihood with respect to all parameters and use the Newton–Raphson method to obtain estimates. This approach takes much more iterations for convergence than other methods, so it is not preferable when we deal with a typical frailty distribution. However, for the data set with complicate dependence structures, we may end up with the likelihood function which cannot be handled by simpler methods. In this case, the full conditional approach may be the only resolution. The EM algorithm, which is a simpler method, can also be used for estimation since we can treat the frailty as covariates. Klein (1992) developed the estimation based on
500
N. Ebrahimi et al.
the EM algorithm for the gamma frailty. A similar method was developed by Wang et al. (1995) for the positive stable frailty. Shu and Klein (1999) also offer SAS macros for the gamma and the positive stable frailty models on the Internet. We discuss this algorithm in more details later in this section. S-plus (version 6) offers the frailty function based on the penalized likelihood approach (Good and Gaskin, 1971). An excellent example of the penalized likelihood estimates is given by Tapia and Thompson (1978). The penalized approach has some similarities to the EM algorithm. Here, the penalized likelihood is constructed by the product of the partial likelihood, including the frailty terms as parameters, and a penalty function which describes the roughness of the curve under consideration. An iterative estimation procedure is used by maximizing the partial likelihood first, then modifying the frailties by the penalized likelihood. In S-plus, there are two penalty functions available within the frailty function to simulate the gamma frailty and the log-normal frailty. For more details about this approach see Hougaard (2000). Most of the analysis in Section 4 was done by using corresponding functions in S-plus. 2.1. The EM algorithm The most commonly applied estimation method for parallel data with covariates is the EM algorithm. The EM algorithm is a combination of an expectation step (E-step) and a maximization step (M-step). This method does not use the likelihood function (2.5), rather the full likelihood function. In the E step of the algorithm the expected value of Lfull is computed given the current estimates of the parameters and the observable data. In the M step estimates of parameters which maximize the expected value of Lfull from the E step are obtained. We define the full likelihood as the product of the conditional and the density of frailties. Lfull =
ni G i=1 j =1
(ui )δij h0 (tij )δij exp(δij βZij ) G q(ui ). × exp −ui H0 (tij ) exp(βZij )
(2.6)
i=1
In Eq. (2.6), the first term is a standard survival likelihood given the frailties. The most basic difference between the likelihood (2.5) and the full likelihood is in the way frailty is treated. For the likelihood, we believe that the frailty is unobservable and thus inestimable; the only information about the frailty is the assumed distribution q(u) with the unknown parameter θ . So we keep only θ in the likelihood function. However, for the full likelihood, we pretend that the frailties are estimable. In this approach, first we estimate the frailty for each group based on the observable data and the initial values of h0 , β, θ . Then we plug in the estimates of u1 , . . . , uG into the modified partial likelihood to derive the estimates of β and h0 and then into the likelihood of θ to get θˆ . An advantage of estimating u1 , . . . , uG is that we can use the hazard functions, ui h0 (tij ) exp(βZij ), i = 1, . . . , G, to construct the corresponding partial likelihood, where we obtain the estimate of β and h0 . Furthermore, estimate of θ is easy to derive by inserting uˆ i , i = 1, . . . , G, into the corresponding likelihood. Compared with the full
Frailty model and its application to seizure data
501
conditional approach, where the likelihood is maximized directly, the EM algorithm is much simpler. More specifically, the steps of the EM algorithm are as follows: (1) Provide initial values for β, h0 and θ . (2) In the E-step, plug values of β, h0 and θ into the full likelihood (2.6) and calculate the conditional expectation of ui and ln(ui ) given the observable data. Parner (1997) suggested a general formula for the expectation of frailties, 0 (tij ) exp(βZ ˆ ij )] LP(Di +1) [ j H E(ui ) = − (2.7) , i = 1, . . . , G. 0 (tij ) exp(βZ ˆ ij )] LP(Di ) [ j H In this step, the unobserved terms in the log-likelihood are removed by substitution with the mean value given the observations (3) In the M-step, plug the expectation of frailties into the modified partial likelihood, update the estimates of β and h0 ; plug into Eq. (2.6), update the estimate of θ . In this case, the partial likelihood turns out to be L(β) =
M k=1
[
exp(uˆ k (βsk )) , ˆ l exp(βZl )]dk l∈R(tk ) u
(2.8)
number of where tk is the kth smallest failure time, regardless of subgroup, dk is the failures at tk , Dk is the set of all individuals who fail at time tk , and sk = j ∈Dk Zj . Furthermore, the maximum likelihood estimate of h0k is hˆ 0k =
dk ˆ
ˆ l exp(βZl ) l∈R(tk ) u
,
k = 1, . . . , M.
(2.9)
Here, the frailty values are considered fixed and known. (4) Repeat the E-step and M-step until the estimates converge. The standard errors of the estimates of h0 , β, and θ can be obtained from the inverse of the observed information matrix. 2.2. The gamma frailty model Clayton (1978) proposed the gamma distribution for the frailty. Since then, the gamma frailty model has been used extensively because the derivatives of its Laplace transformation are quite simple. The density function of the frailty is q(u) =
u1/θ−1 exp(−u/θ ) Γ (1/θ )θ 1/θ
(2.10)
with the Laplace transform LP(s) = (1 + θ s)−1/θ . Usually, we use the one-parameter gamma distribution denoted by Gamma(θ ). Thus the mean of the frailty is 1, which is the desired property of the frailty distribution; the variance is θ , which reflects the degree of dependence in the data. Large θ indicates strong dependence.
502
N. Ebrahimi et al.
From Eq. (2.2), it is easily seen that the marginal hazard is h0 (tj ) exp(βZ) . 1 + θ H0 (tj ) exp(βZ)
(2.11)
Consequently, for two randomly selected individuals with covariate values Z and Z ∗ , the marginal hazards are not proportional over time and the relative risk is given by 1 + θ H0 (t) exp(βZ ∗ ) exp β(Z − Z ∗ ) , 1 + θ H0 (t) exp(βZ)
(2.12)
which depends on time through H0 (t). From Eq. (2.12), it is clear that the relative risk starts at exp(β(Z − Z ∗ )) because no failure occurs, and converges to 1 as t → ∞ since H0 (t) → ∞ in this case. If we are interested in the relative risk at a given time t, we can calculate it by replacing θ, β and H0 (t) with their estimates in Eq. (2.12). One can derive the likelihood function as follows. The pth derivative of the Laplace transform is 1 1 (p) p p −1/θ−p LP (s) = (−1) θ (1 + θ s) (2.13) +p Γ . Γ θ θ Following Eq. (2.5), the likelihood for all individuals is given by n G Di i θ Γ (1/θ + Di ) D δij h0 (tij ) exp(δij βZij ) (−1) Γ (1/θ ) j =1
i=1
× 1+θ
ni
−1/θ−Di H0 (tij ) exp(βZij )
.
(2.14)
j =1
If we decide to use the EM algorithm, the full likelihood is ni G
h0 (tij )δij exp(δij βZij ) exp −ui H0 (tij ) exp(βZij )
i=1 j =1
×
1/θ+Di −1 G u exp(−ui /θ ) i
i=1
Γ (1/θ )θ 1/θ
.
(2.15)
Now the following procedure must be followed: Find estimates of β and H0 (t) in the model without frailty. This corresponds to setting θ = 0. Use them as initial estimates of β and H0 (t) with θ = 0 as an initial estimate for θ . For the E-step, obtain the expectation of frailty ui . We can get it using Eq. (2.7) directly. However, for the gamma frailty, a short-cut is available. Based on Lfull , the distribution of ui given the observable data ˜ is still a gamma with shape parameter α˜ i = 1/θ + Di and scale parameter θi = 1/θ + H (t ) exp(βZ ). In general, it can be proven that for the gamma (α, θ ), ij j 0 ij E(u) = α/θ, E ln(u) = ψ(α) − ln θ, where ψ(α) is the digamma function Γ (α)/Γ (α). Therefore, for the E-step, the expectation of ui and the expectation of ln(ui ) are α˜ i /θ˜i and ψ(α˜ i ) − ln(α˜ i ), respectively. The
Frailty model and its application to seizure data
503
general formula in Eq. (2.7) gives us exactly the same result for E(ui ). For the M-step, obtain the estimate of β based on Eq. (2.8). The estimate of h0k is given by Eq. (2.9) for k = 1, . . . , M, and the estimate of θ is derived by maximizing the likelihood of θ , L(θ ) = Γ (1/θ )−G θ −G/θ
G
1/θ+Di −1
uˆ i
exp(−uˆ i /θ ).
(2.16)
i=1
To get the standard errors of the estimates, we have to derive the information matrix and then plug in estimates of β, h01 , . . . , h0M , θ . The covariance matrix is the inverse of the observed information matrix. 2.3. The positive stable frailty model In this section we assume the positive stable distribution for the frailty, see Hougaard (2000). For most frailty models, the marginal hazard functions are not proportional. The positive stable frailty model is an exception. This is an advantage of the positive stable frailty model. Suppose the frailty has the positive stable distribution with parameter θ . We restrict 0 < θ 1 to get a distribution with positive numbers. Here a small value of θ indicates a strong dependence in the data. If θ is close to 1, individuals in the study are independent of each other, and the frailty model is not necessary. The density is ∞
q(u) = −
1 Γ (-θ + 1) −θ −u sin(θ -π), πu -!
(2.17)
-=1
with Laplace transform L(s) = exp(−s θ ). The marginal distribution of Tj is given by P (Tj > t) = exp −H0 (tj )θ exp(θβZ) .
(2.18)
Thus, the integrated hazard and the hazard function are H0 (tj )θ exp(θβZ), and θ h0 (tj )H0 (tj )θ−1 exp(θβZ), respectively. For the marginal distributions of two individuals with covariate values Z and Z ∗ , the relative risk is the constant exp(θβ(Z − Z ∗ )). The pth derivative of Laplace transform is (s) = (−1) exp −s θ cp,m θ m s mθ−p , p
(p)
L
p
m=1
where cp,m is a polynomial in θ of degree m and is defined recursively by cp,p = 1,
cp,1 = Γ (p − θ )/Γ (1 − θ ),
and cp,m = cp−1,m−1 + cp−1,m (p − 1) − mθ .
(2.19)
504
N. Ebrahimi et al.
Following Eq. (2.5), the likelihood function for all individuals is given by n n θ G i i δij h0 (tij ) exp(δij βZij ) exp − H0 (tij ) exp(βZij ) i=1
×
j =1 Di m=1
j =1
cDi ,m θ m
n i
mθ−Di H0 (tij ) exp(βZij )
.
(2.20)
j =1
One may use the EM algorithm for estimation. For the E-step, using the general formula in Eq. (2.7), the expectation of frailty ui is given by Di +1 m mθ−Di −1 m=1 cDi +1,m θ [ j H0 (tij ) exp(βZij )] E(ui ) = D . i m[ mθ−Di c θ H (t ) exp(βZ )] D ,m 0 ij ij i j m=1 For the M-step, one has trouble estimating θ because its likelihood function, involving the positive stable density, is really complicated. Wang et al. (1995) suggested a practical algorithm for providing values for θ and searching for the value which maximizes Eq. (2.20). 2.4. The log-normal frailty model McGilchrist and Aisbett (1991) proposed the log-normal distribution with mean ξ and variance σ 2 for the frailty. Therefore, the density of frailty is given by 1 (ln u)2 . q(u) = (2.21) exp − u(2πσ 2 )1/2 2σ 2 Unfortunately, the Laplace transform of this distribution is theoretically interact-able. Hence, the explicit form of marginal hazard is not available to us. They offered the penalized likelihood approach for estimation. This approach is implemented in the frailty function in S-plus (version 6).
3. The shared frailty model for recurrent events In this section, data of recurrent events refer to longitudinal data from individuals who experience the same type of events several times. Longitudinal data have two distinct sub-categories. In the first case, one studies individuals continuously and records the exact times of events. Such data is referred to as data of exact times. An alternative is to study individuals over the same period of time but only record the number of events for each individual during that period, count data. In this section, we will focus on the data of exact times. One approach to analyzing recurrent events is to do a separate analysis for each successive event using the Cox proportional hazards model described by Eq. (1.2). This approach is inefficient in several respects. Here we only mention two. First, it is tedious
Frailty model and its application to seizure data
505
to do multiple analyses and more numbers must be interpreted, leaving room for ambiguity and confusion. Secondly, the later events are biased. Because the only individuals who had, say, a fourth event, are those who already have had three events. It simply means that Inter-arrival times for those four events were shorter than average, so it is likely that the fifth Inter-arrival time will also be short. An alternative approach to repeated events that avoids these problems is to treat each interval as a distinct observation, pool the intervals together and fit the Cox model to the pooled observations. Unfortunately, this method introduces a new problem: dependence among the multiple observations. Given the dependence, of course, we can not accept the pooled estimates because the standard errors are likely to be too low, the chi-square statistics too high and the estimates to be biased. One approach is to ignore the possible biases and concentrate on getting better estimates of the standard errors and test statistics. The other approach is to use the shared frailty model and correct biases in the coefficients. Wei et al. (1989) proposed a method (WLW method) for getting robust variance estimates that allow for dependence among recurrent event times. With these variance estimates, one can then get efficient pooled estimates of the coefficients and their standard errors, and one can also test a number of relevant hypotheses. The technique is sometimes described as a marginal method. The advantage of this method is that no assumptions need to be made about the nature or structure of the dependence. On the other hand, there is no correction for biases. Although the shared frailty model is applied most frequently to parallel data, it is also applicable to recurrent data. This approach can correct biases in the coefficients. There are three aspects which make data of recurrent events differ from parallel data. First, recurrent events are obtained by observing each individual over time, whereas for parallel data, all individuals are followed simultaneously. Second, for recurrent events, one has no idea about the number of events which each individual will experience at the beginning of the study, so the total number of events is not known; however, for parallel data, the maximum number of events is fixed by the design. Finally, for recurrent events, the event times for an individual is of the form 0 < t1 < t2 < · · · < t, and only the last time could be a censoring time, whereas for parallel data, there is no such restriction as t1 < t2 . This aspect leads to a different likelihood for recurrent events. Below we describe the method. To incorporate the frailty, we let Y (t) to be the number of events up to time t and
t H0 (t) = 0 h0 (v) dv. It is known that Poisson process provides a good model for the number of events that occur during time, space, or volume, see Ross (1991). Thus conditional on the frailty u, the Y (t) is assumed to follow a Poisson process with the hazard function h0 (t)u exp(βZ) and P Y (t) = y | u, Z = uy H0 (t)y exp(yβZ) exp −uH0 (t) exp(βZ) /y!. (3.1) If M(t | Z, u) = E(Y (t) | u, Z), then M (t | Z, u) = h0 (t)u exp(βZ) is the instantaneous rate of change of the expected number of events with respect to time.
506
N. Ebrahimi et al.
Given that simultaneous events do not occur, one can show that lim
→0
P (Y (t + ) − Y (t) 1 | u, Z) = h0 (t)u exp(βZ).
(3.2)
That is, for small , h0 (t)u exp(βZ) is approximately the probability of an event in the interval (t, t + ]. Hence, it is important to distinguish between the interpretation of the hazard function of a process described by Eq. (3.2), and the hazard function of a distribution given by Eq. (1.2). From (3.1), we obtain the likelihood function for the individual based on y events before time t by P Y (t) = y, T1 = t1 , . . . , Ty = ty
∞ y = uh0 (tj ) exp(βZ) exp −uH0 (t) exp(βZ) q(u) du 0
=
j =1
y
h0 (tj ) exp(βZ)
∞
uy exp −uH0 (t) exp(βZ) q(u) du
0
j =1
= (−1)
y
y
h0 (tj ) exp(βZ) LP(y) H0 (t) exp(βZ)
(3.3)
j =1
Since E Xn exp(−sX) = (−1)n LP(n) (s) .
Here 0 < t1 < · · · < ty < t. The likelihood function of all individuals is now derived by multiplying Eq. (3.3) over individual index i, y n i yi L(h0 , β, θ ) = (−1) h0 (tij ) exp(βZi ) LP(yi ) i=1
j =1
× H0 (ti ) exp(βZi ) .
(3.4)
It should be noted that the likelihood given in Eq. (3.4) is applicable for the data of exact times observed with different observation intervals. If we observe all individuals in a common interval, the likelihood function could be simplified. Furthermore, if the recorded date are only the count of events in the common interval, then the likelihood must be modified. Here, we will not discuss those types of recurrent events. For more details see also Hougaard (2000). If we assume a parametric form for h0 , we can handle the estimation in the usual way by differentiating the log likelihood. If a parametric form is not assumed for h0 , we can use the full conditional method, differentiating the non-parametric log likelihood. However, using this method, we need to replace H0 (t) by its non-parametric form in Eq. (3.4). The newest version of S-plus can handle Poisson process based on frailty model. In fact, it offers us a convenient way to analyze data of recurrent events.
Frailty model and its application to seizure data
507
4. Seizure data and its analysis In this section we use the frailty model described in the previous section to assess the rate of seizure recurrence after a first unprovoked seizure in childhood. Our data come from a cohort of 407 children (one month through 19 years of age) prospectively identified for their first unprovoked seizure in Bronx, NY between 1983 and 1992 and followed through September 1998. (See Shinnar et al., 2000.) The dates of occurrence of the first ten seizure recurrences were recorded, along with several covariates. The first covariate is the etiology of the seizure which is classified as remote symptomatic (RS) (having a neurological abnormality associated with an increased risk of seizures and which is taken to be the presumed cause of the seizure) versus all other causes. The second covariate is the asleep-awake state, which is coded as 1 if the child was asleep when the initial seizure occurs, and given 0 if the child was awake. The third covariate is electroencephalogram (EEG). There are three categories based on children’s EEG: normal, abnormal or not performed. We use two covariates to denote these categories. EEG is coded as 1 for abnormal EEG, and 0 for others. A second variable (MissEEG) is coded as 1 if the child did not have an EEG. If we consider the record of the ith child, the recurrent event times satisfy the condition that 0 < ti1 < ti2 < · · · , where tij denotes the time for the j th seizure recurrence. Now, think of tij ’s as the arrival times of a counting process. Since there is an association between recurrence times for the same child, the frailty model (3.1) represents a way to capture this association. In addition to the three covariates mentioned above, the previous seizures may effect the risk for the next seizure. To adjust for the number of seizures, nine more covariates are created to describe all possible states that a child could be in. The covariates are 1, if a child experienced -th recurrence, Z- = (4.1) 0, otherwise, - = 1, . . . , 9. We note that for all observations related to one child, the covariates Z1 − Z9 change over time, while etiology, EEG and asleep-awake state hold constant. The goal is to determine which covariates explain changes in the intensity of recurrence. To accomplish this goal, ignoring the frailty u, Cox PH was used first. For this case, the intensity of recurrence for the individual i at time t after episode - is defined by hi- (t) = h0 (t) exp βetiology(etiology) + βEEG (EEG) + βasleep (asleep) (4.2) + βmissEEG (missEEG) + β- , - = 1, . . . , 9. Here, β- is the episode effect. Since there is an association among event times, by taking into account the frailty, the true hazard function becomes h∗i- (t) = ui hi- (t),
- = 1, . . . , 9,
where ui is the frailty for the ith child.
(4.3)
508
N. Ebrahimi et al.
Because two children were missing information about their asleep–awake state at the time of the initial seizure, our analysis is actually based on 405, not 407 individuals. Table 1 represents results from the model (4.2). It is often relevant to check the assumptions of the model (4.2). Generally this can be done using the Kaplan–Meier method for estimation of survival function with censored observations. Figure 1 plots − log(log(Kaplan–Meier estimator)) against time for each Table 1 Risk analysis, using the Poisson process based on the Cox model Relative hazard Remote symptomatic (RS) etiology Abnormal EEG Seizure while asleep Initial seizure 1 Recurrence 2 Recurrence 3 Recurrence 4 Recurrence 5 Recurrence 6 Recurrence 7 Recurrence 8 Recurrence 9 Recurrence
1.647 0.965 1.050 1.000 5.377 10.113 12.616 18.018 22.822 32.508 72.218 80.856 75.567
95% Confidence interval (1.43, 1.90) (0.84, 1.11) (0.91, 1.21) (4.25, 6.81) (7.83, 13.06) (9.66, 16.48) (13.57, 23.93) (16.83, 30.94) (23.70, 44.60) (52.02, 100.25) (57.80, 113.11) (53.48, 106.78)
Fig. 1. Checking proportional hazards.
p-value 0.0001 0.6200 0.4900 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001
Frailty model and its application to seizure data
509
Table 2 Risk analysis using the Poisson process based on gamma frailty
Remote symptomatic etiology Abnormal EEG Seizure while sleep Initial seizure 1 Recurrence 2 Recurrence 3 Recurrence 4 Recurrence 5 Recurrence 6 Recurrence 7 Recurrence 8 Recurrence 9 Recurrence
ˆ exp(β)
95% Confidence interval
2.359 1.157 1.060 1.000 4.003 5.956 7.906 12.002 12.622 18.133 35.359 31.731 33.160
(1.73, 1.90) (0.89, 1.50) (0.80, 1.38) (3.05, 5.25) (4.41, 8.04) (5.72, 10.93) (8.51, 16.93) (8.80, 18.11) (12.29, 26.76) (24.07, 52.06) (21.31, 47.24) (22.01, 49.97)
p-value 0.0001 0.2700 0.7000 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001
seizure. From this figure, it is clear that the model (4.2) is not valid, especially at the beginning. Having suggested several frailty distributions, it makes sense to compare them in order to find out whether any of them is preferable to the other. It appears that, based on Akaike Information Criteria (AIC), the gamma distribution is a better fit. Also, the estimated variance for the gamma frailty is 0.955 which confirms our finding that the model (4.2) is not an appropriate model. Table 2 gives the results based on the gamma frailty, the model (4.3). From this table it is clear that, after adjusting for the number of recurrences, etiology is statistically significant. Under this model, Figures 2(a)–(l) give the relative risk function (function of time) for each covariate. From Figure 2(a), the relative risk of recurrence is about 2.4 for RS at the beginning. The relative risk goes down to about 1.6, and still remains significant. For the other two covariates the results are different. The relative risk of recurrence for the abnormal EEG, Figure 2(b), is about 1.2 at the beginning. However, the effect disappears overtime and becomes insignificant. For asleep, see Figure 2(c), the relative risk of recurrence is about 1.06 initially and the effect disappears with time. In summary, from Table 2 and Figures 2(a)–(c), we observe that the etiology remains always significant. Also, the child with abnormal EEG and/or asleep initially is at greater risk to have the next seizure. Our finding confirms results obtained by Shinnar et al. (1996, 2000) who observed that EEG and the asleep-awake state are good predictors of the first recurrence but not significant after that. The effect of etiology persists regardless of the number of prior recurrences. Figure 2(d) demonstrates that relative to the initial risk of a first recurrence the risk of having a second recurrence after the first is four times greater at the beginning. The risk goes down over time. Similarly, from Figure 2(e), the risk of a third recurrence after a second recurrence is six times higher than the risk of a first recurrence after the initial seizure at the beginning and it goes down with time. Also, from both figures, we observe that the risk of a third recurrence after a second recurrence is 1.5 times higher than the
510
N. Ebrahimi et al.
(a)
(b) Fig. 2. (a) Relative risk of etiology. (b) Relative risk of EEG. (c) Relative risk of awake. (d) Relative risk of1 recurrence. (e) Relative risk of 2 recurrence. (f) Relative risk of 3 recurrence. (g) Relative risk of 4 recurrence. (h) Relative risk of 5 recurrence. (i) Relative risk of 6 recurrence. (j) Relative risk of 7 recurrence. (k) Relative risk of 8 recurrence. (l) Relative risk of 9 recurrence.
Frailty model and its application to seizure data
(c)
(d) Fig. 2. (Continued.)
511
512
N. Ebrahimi et al.
(e)
(f) Fig. 2. (Continued.)
Frailty model and its application to seizure data
(g)
(h) Fig. 2. (Continued.)
513
514
N. Ebrahimi et al.
(i)
(j) Fig. 2. (Continued.)
Frailty model and its application to seizure data
(k)
(l) Fig. 2. (Continued.)
515
516
N. Ebrahimi et al.
risk of a second recurrence after the first at the beginning and the risk goes down over time. Similar interpretation can be used for other graphs.
5. Concluding remarks For parallel data, we have presented a review of both a semi-parametric and a parametric approach to estimating the risk coefficients and the frailty parameters for the gamma frailty, the positive stable frailty and the log-normal frailty models. We also give references where the techniques have been extended to other frailty models. Since, in general, recurrent events are more complicated than parallel data, we have reviewed frailty models as well as statistical inference for such events in a separate section. Finally, we have discussed an application of the frailty model in medical science. The model is also applicable to reliability, actuarial science, biological science and other fields, where the multivariate times are observed.
Acknowledgements Berg and Ebrahimi were partially supported by the National Institute of Neurological Disorders and Stroke (NIH-NINDS) grant 1R01 NS 31146. Shinnar was partially supported by the NIH-NINDS grant 1R01 NS 26151.
References Clayton, D.C. (1978). A model for association in bivariate life tables and its application in epidemiological studies of familial tendency in chronic disease incidence. Biometrika 65, 141–151. Cox, D.R. (1972). Regression models and life tables (with discussion). J. Roy. Statist. Soc. B 34, 187–220. Good, I.G., Gaskin, R.A. (1971). Non-parametric roughness penalties for probability densities. Biometrika 58, 255–277. Hougaard, P. (2000). Analysis of Multivariate Survival Data. Springer, New York. Klein, J.P. (1992). Semi-parametric estimation of random effects using the Cox models based on the EM algorithm. Biometrics 48, 795–806. Klein, J., Moeschberger, M.L. (1997). Survival Analysis. Techniques for Censored and Truncated Data. Springer, New York. McGilchrist, C.A., Aisbett, C.W. (1991). Regression with frailty in survival analysis. Biometrics 47, 461–466. Ohman, P., Eberly, L. (2001). A shifted frailty distribution for correlated time to event data, Research report. Department of Statistics, University of Florida, Gainesville. Parner, E. (1997). Inference in semi-parametric frailty models. Ph.D. Thesis. University of Aarhus, Denmark. Ross, S.M. (1991). Stochastic Process. Wiley, New York. Shinnar, S., Berg, A.T., O’Dell, C., Newstein, D., Moshe, S.L., Hauser, W.A. (2000). Predictors of multiple seizures in a cohort of children prospectively followed from the time of their first unprovoked seizure. Ann. Neurol. 48, 140–147. Shinnar, S., Berg, A.T., Moshe, S.L., O’Dell, C., Newstein, D., Kang, H., Goldensohn, H., Hauser, W.A. (1996). The risk of seizure recurrence following a first unprovoked afebrile seizure in childhood: An extended follow-up. Pediatrics 98, 216–225.
Frailty model and its application to seizure data
517
Shu, Y., Klein, J.P. (1999). A SAS macro for the positive stable frailty model. Amer. Statist. Assoc. Proc. Statist. Comput. Section, 47–52. Tapia, R.A., Thompson, J.R. (1978). Non-parametric Probability Density Estimation. John Hopkins University Press, Baltimore, MD. Vaupel, J.W., Manton, K.G., Stallard, E. (1979). The impact of heterogeneity in individual frailty on the dynamics of mortality. Demography 16, 439–454. Wang, S.T., Klein, J.P., Moeschberger, M.L. (1995). Semi-parametric estimation of covariate effects using the positive stable frailty model. Appl. Stochastic Models Data Anal. 11, 121–133. Wei, L.J., Lin, D.Y., Weissfeld, L. (1989). Regression analysis of multivariate failure time data by modeling marginal distribution. J. Amer. Statist. Assoc. 84, 1065–1073.
29
Handbook of Statistics, Vol. 23 ISSN: 0169-7161 © 2004 Elsevier B.V. All rights reserved. DOI 10.1016/S0169-7161(03)23029-7
State Space Models for Survival Analysis
Wai Y. Tan and Weiming Ke
1. Introduction Many diseases such as AIDS, cancer and infectious diseases are often very complicated biologically. Most of these diseases are complex stochastic processes where it is often very difficult to estimate the unknown parameters, especially in cases where not many data are available. In these cases, it would be very difficult to estimate the survival probabilities. To ease the estimation problem and to estimate the survival probabilities, in this article we propose a state space modeling approach by combining stochastic models with statistical models. Then one can readily apply the Gibbs sampling method and the Markov Chain and Monte Carlo approach (MCMC) to estimate the unknown parameters and the state variables. By using these estimates, one can validate the model and estimate the survival probabilities. We will illustrate the model and the method by using a birth–death–immigration–illness–cure process which involves stochastic birth– death processes with immigration and the illness and cure processes for a disease such as tuberculosis.
2. The state space models and the generalized Bayesian approach To illustrate, consider a disease such as tuberculosis which is curable by drugs. Let X (t) be the vector of stochastic processes for key responses of the disease. Then, X (t) ∼
∼
is the stochastic model for this disease and in many cases, one can derive stochastic equations for the state variables of the system by using basic biological mechanism of the disease; for some illustrations in cancer and AIDS, see (Tan, 2000; Tan, 2002; Tan and Chen, 1998; Tan et al., 2001). If some observed data are available from this system, then, one may derive some statistical models to relate the data to the system. Combining the stochastic model of the system with the statistical model, one has a state space model for the system. That is, the state space model of a system is a stochastic model consisting of two sub-models: The stochastic system model which is the stochastic model of the system and the observation model which is a statistical model relating some available data to the system. It extracts biological information from the system 519
520
W.Y. Tan and W. Ke
via its stochastic system model and integrates this information with those from the data through its observation equation. 2.1. Some advantages of the state space models The state space model of the system is advantageous over the stochastic model of the system alone or the statistical model of the system alone in several aspects. The following are some specific advantages: (1) The statistical model alone or the stochastic model alone very often is not identifiable and can not provide information regarding some of the parameters and variables. For some specific examples, see Brookmeyer and Gail (1994); Tan (2000, Chapter 5) and Tan and Ye (2000, 2000). (2) State space model provides an optimal procedure to updating the model by new data which may become available in the future. This is the smoothing step of the state space models; see (Catlin, 1989; Gelb, 1974; Sage and Melsa, 1971). (3) The state space model provides an optimal procedure via Gibbs sampling to estimate simultaneously the unknown parameters and the state variables of interest; see Tan and Ye (2000, 2000). (4) The state space model provides an avenue to combine information from various sources. For some examples, see (Tan et al., 2000). The state space model was originally proposed by Kalman and his associates in the early 60s for engineering control and communication (Kalman, 1960). Since then it has been successfully used as a powerful tool in aerospace research, satellite research and military missile research. It has also been used by economists in econometrics research (Harvey, 1994) and by mathematicians and statisticians in time series research (Aoki, 1990) for solving many difficult problems which appear to be extremely difficult from other approaches. It was first proposed by Tan and his associates for AIDS and cancer research (Tan and Chen, 1998, 1999; Tan et al., 2001, 2000, 2002; Tan and Xiang, 1998; Tan and Xiang, 1998; Tan and Xiang, 1998; Tan and Xiang, 1999; Tan and Xiang, 1999; Tan and Ye, 2000; Tan and Ye, 2000; Wu and Tan, 1995; Wu and Tan, 2000). Apparently state space models can be extended to other diseases as well, including heart and infectious diseases. 2.2. A general Bayesian procedure for estimating unknown parameters and state variables via state space models Applying the state space models, Tan and Ye (2000, 2000) have developed a general Bayesian procedure to estimate simultaneously the unknown parameters and the state variables. These procedures would combine information from three sources: (1) previous information and experiences about the parameters in terms of the prior distribution of the parameters, (2) biological information via the stochastic system equations of the stochastic system, and (3) information from observed data via the statistical model from the system.
State space models for survival analysis
521
To illustrate, let X = {X (1), . . . , X (tM )} be the collection of all state variables, ∼
∼
Θ the collection of all unknown parameters and Y = {Y (t1 ), . . . , Y (tk )} (0 t1 < ∼
∼
· · · < tk tM ) the collection of all vectors of observed data sets. Let P (Θ) be the prior distribution of the parameters Θ, P (X | Θ) the conditional probability density of X given the parameters Θ, and P (Y | X, Θ) the conditional probability density of Y given X and Θ. Then the joint probability density function of (X, Y, Θ) is P (Θ, X, Y) = P (Θ)P (X | Θ)P (Y | X, Θ). From this, one derives the conditional probability density function P (X | Θ, Y) of X given (Θ, Y) and the conditional probability density function P (Θ | X, Y) of Θ given (X, Y), respectively, as: P (X | Θ, Y) ∝ P (X | Θ)P (Y | X, Θ);
(1)
P (Θ | X, Y) ∝ P (Θ)P (X | Θ)P (Y | X, Θ).
(2)
Given these conditional distributions, one may then use the multi-level Gibb’s sampler method (Liu and Chen, 1998; Shephard, 1994) to estimate simultaneously Θ and X. The multi-level Gibb’s sampler method is a Monte Carlo method to estimate P (X | Y) (the conditional density function of X given Y) and P (Θ | Y) (the posterior density function of Θ given Y) through a sequential procedure by drawing from P (X | Θ, Y) and P (Θ | X, Y) alternatively and sequentially. The algorithm of this method iterates through the following loop: (1) Given Θ (∗) and Y, generate X(∗) from P (X | Y, Θ (∗) ). (2) Generate Θ (∗) from P (Θ | Y, X(∗) ) where X(∗) is the value obtained in (1). (3) Using Θ (∗) obtained from (2) as initial values, go back to (1) and repeat the (1), (2) loop until convergence. At convergence, the above procedure then leads to random samples of X from the conditional density P (X | Y) of X given Y independently of Θ and to random samples of Θ from the posterior density P (Θ | Y) of Θ independently of X. Repeating these procedures we then generate a random sample of size n of X and a random sample of size m of Θ. One may then use the sample means to derive the estimates of X and Θ and use the sample variances as the variances of these estimates. The convergence of these procedures is proved by using the basic theory of homogeneous Markov chains; see (Tan, 2002, Chapter 3).
3. Stochastic modeling of the birth–death–immigration–illness–cure processes Consider a population of individuals who are at risk for a disease and suppose that some drugs are available to treat the disease. One example is the tuberculosis. In this population, then there are two types of people: Normal healthy people (denote by N1 ) who do not have the disease and sick people (denote by N2 ) who have contracted the disease. When the population is at risk for the disease, normal people may contract the disease to become sick via contacts with sick people or disease agents. The sick people may die from the disease or die from other competing causes but normal healthy people can only die from other causes than the disease; sick people may also be cured by
522
W.Y. Tan and W. Ke
drugs to become normal people. Besides death, suppose that there are births as well as immigration in the population. Then the stochastic system involves birth, death, immigration, illness and cure of the disease. Let Ni (t) (i = 1, 2) denote the number of Ni (i = 1, 2) people in the population at time t and put X (t) = {N1 (t), N2 (t)} . ∼
Under treatment by a drug, {X (t), t 0} is a birth–death–immigration–illness–cure ∼
process. This is a two-dimensional stochastic process with continuous time T = [0, ∞) and discrete state space S = {Ni (t) = ji , i = 1, 2, ji = 0, 1, . . .} with initial conditions {N1 (0) = n1 , N2 (0) = n2 }. The initial condition is equivalent to assume that there are n1 healthy people at time 0 who are at risk for the disease and that there are n2 sick people at time 0. If the numbers of {Ni (t), i = 1, 2} or of N(t) = N1 (t) + N2 (t) are observed at times tj , j = 1, . . . , n, then one can construct a state space model for this process. For this state space model, the stochastic system model is the stochastic model given by the stochastic process X (t) whereas the observation model is a statisti∼
cal model based on the observed numbers of Ni (t), i = 1, 2 or based on N(t). 3.1. The traditional Kolmogorov approach Denote by P {N1 (t) = i, N2 (t) = j | N1 (0) = n1 , N2 (0) = n2 } = Pij (t) and φ(x, y; t) the probability generating function (PGF) of Pij (t). Let {bi (t), di (t)} denote the birth rates and the death rates of the Ni people at time t, respectively, α1 (t) the disease rate, and α2 (t) the cure rate at time t, respectively. Then d2 (t) = d1 (t) + µ(t), and α1 (t) and α2 (t) are the transition rates of N1 → N2 and N2 → N1 at time t, respectively. Assume that the number Ri (t) of immigration of Ni people during [t, t + t) follows a Poisson distribution with mean Ni (t)δi (t)t + o(t). Then it can be shown that φ(x, y; t) satisfies the following Kolmogorov forward equation: ∂ φ(x, y; t) = x(x − 1)b1 (t) + (x − 1) δ1 (t) − d1 (t) ∂t ∂ φ(x, y; t) + (y − x)α1 (t) ∂x + y(y − 1)b2 (t) + (y − 1) δ2 (t) − d2 (t) + (x − y)α2 (t)
∂ φ(x, y; t), ∂y
(3)
with initial condition φ(x, y; 0) = x n1 y n2 . The solution of the above equation is extremely difficult if not impossible. Hence, this approach will not yield any useful results, especially in the non-homogeneous cases. We thus seek alternatively equivalent approach by stochastic differential equations. 3.2. The stochastic differential equations for the state variables The traditional approach by Markov theories is extremely difficult and manageable to yield useful results. Hence we use an alternatively equivalent approach through stochastic differential equations. Assume that during the time interval [t, t + t),
State space models for survival analysis
523
the birth–death–illness–cure processes follow the multinomial distributions with parameters {Ni (t), bi (t)t, di (t)t, αi (t)t} and the immigration processes the Poisson processes with means λi (t)t. Then, through the method of generating functions, it can readily be shown that the stochastic differential equation approach is equivalent to the classical Markov theory approach; see (Tan, 2002; Tan and Chen, 1998). To derive stochastic differential equations for the state variables Ni (t), observe that X (t + t) derive from X (t) stochastically through the birth–death–immigration ∼
∼
process and the illness–cure process. Hence this transition is characterized by the following transition variables: Ri (t) = Number of immigrants of Ni people during [t, t + t); Bi (t) = Number of birth of Ni people during [t, t + t); F1 (t) = Number of normal healthy people who become sick during [t, t + t); F2 (t) = Number of sick people who are cured by the drug during [t, t + t); Di (t) = Number of deaths of Ni people during [t, t + t), i = 1, 2. Then, conditional on Ni (t), the conditional probability distributions of Ri (t) and {Bi (t), Di (t), Fi (t)} are given respectively by: Ri (t) | Ni (t) ∼ Poisson with mean Ni (t)λi (t)t, i = 1, 2; Bi (t), Di (t), Fi (t) | Ni (t) ∼ Multinomial Ni (t); bi (t)t, di (t)t, αi (t)t , i = 1, 2. Given X (t), conditionally {Ri (t), i = 1, 2, and [Bj (t), Dj (t), Fj (t)], j = 1, 2} are ∼
distributed independently of one another. For i = 1, 2, define εi (t) by εi (t)t = Ri (t) − Ni (t)λi (t)t + Bi (t) − bi (t)Ni (t)t − Fi (t) − αi (t)Ni (t)t − Di (t) − di (t)Ni (t)t .
(4)
Then, one has the following stochastic differential equations for Ni (t), i = 1, 2: N1 (t) = N1 (t + t) − N1 (t) = R1 (t) + F2 (t) + B1 (t) − F1 (t) − D1 (t) = λ1 (t) + b1 (t) − d1 (t) − α1 (t) N1 (t) + α2 (t)N2 (t) t + ε1 (t)t,
(5)
N2 (t) = N2 (t + t) − N2 (t) = R2 (t) + F1 (t) + B2 (t) − F2 (t) − D2 (t) = λ2 (t) + b2 (t) − d2 (t) − α2 (t) N2 (t) + α1 (t)N1 (t) t + ε2 (t)t.
(6)
524
W.Y. Tan and W. Ke
In Eqs. (5) and (6), the random noises εj (t), j = 1, 2 have expectation zero and are uncorrelated with the state variables {Ni (t), i = 1, 2}. The variances and covariances of these random variables are easily obtained as COV{εi (t)t, εj (t)t} = Qij (t)t + o(t), where Q11 (t) = E λ1 (t) + b1 (t) + d1 (t) + α1 (t) N1 (t) + α2 (t)N2 (t) , Q22 (t) = E λ2 (t) + b2 (t) + d2 (t) + α2 (t) N2 (t) + α1 (t)N1 (t) , 2 Q12 (t) = −E αi (t)Ni (t) . i=1
4. A state space model for the birth–death–immigration–illness–cure processes The state space model of a system is a stochastic model consisting of two sub-models: The stochastic system model which is the stochastic model of the system and the observation model which is the statistical model relating available data from the system to the model. For the birth–death–immigration–illness–cure process, the state variables are {Ni (t), i = 1, 2} and the stochastic system model is represented by the stochastic differential equations given by (5), (6). Assuming that the number of sick people and/or the number of healthy people have been counted at times tj , j = 1, . . . , n. Then one can construct a statistical model and hence the observation model based on these observed numbers. Applying the MCMC procedures given above, one may then estimate the unknown parameters and the state variables as well as the survival probabilities. 4.1. The stochastic system model and the probability distribution of the state variables The stochastic system model is represented by the stochastic differential equations (5), (6) for the state variables {Ni (t), i = 1, 2}. Discretize the time scale by letting t ∼ 1 corresponding to a small time interval such as one month. Then, by using the multinomial distributions for the birth–death–illness–cure processes and using the Poisson processes for the immigration process, one may readily derive the probability distribution of the state variables. This probability distribution will be used through the multi-level Gibbs sampling method to estimate the state variables, the unknown parameters and the survival probabilities. Let tM be the current time or most recent time of interest. Then the collection of state variables is X = {X (t), t = 0, 1, . . . , tM }. Assume that the time interval [0, tM ) is ∼
partitioned into k non-overlapping sub-intervals Ls = [ts−1 , ts ), s = 1, . . . , k with t0 = 0 and tk = tM and that {λi (t) = λi (s), bi (t) = bi (s), di (t) = di (s), αi (t) = αi (s), i = 1, 2, if t ∈ Ls }. Then the collection of parameters is Θ = {Θs , s = 1, . . . , k}, where Θs = {λi (s), bi (s), di (s), αi (s), i = 1, 2}. Then, the conditional density of X given Θ is tM P {X | Θ} = P X(0) | Θ P X (j ) | X (j − 1), Θ . j =1
∼
∼
(7)
State space models for survival analysis
525
Let gi (j ; t) denote the density of Ri (t). Then P {X (t + 1) | X (t), Θ} is given by: ∼
P X (t + 1) | X (t), Θ ∼
=
∼
∼
N 1 (t ) N 2 (t )
N1 (t) i
i=0 j =0
h1 (i, j ; t) =
i j N2 (t) α1 (t) α2 (t) h1 (i, j ; t)h2 (i, j ; t), j
N1 (t +1)−N 1 (t )+i−j
g1 (k; t)
k=0
N1 (t )−i r=0
N1 (t) − i r
(8)
N1 (t) − i − r ξ1 (t)
r ξ (t ) ξ (t ) × b1 (t) d1 (t) 1 1 − α1 (t) − b1 (t) − d1 (t) 2 , where ξ1 (t) = N1 (t) − N1 (t + 1) − i + j + k + r, ξ2 (t) = N1 (t + 1) − j − k − 2r, and h2 (i, j ; t) =
N2 (t +1)−N 2 (t )−i+j
g2 (k; t)
k=0
× b2 (t)
r
N2 (t )−j r=0
d2 (t)
σ1 (t )
N2 (t) − j r
N2 (t) − j − r σ1 (t)
1 − α2 (t) − b2 (t) − d2 (t)
σ2 (t )
,
where σ1 (t) = N2 (t) − N2 (t + 1) + i − j + k + r, σ2 (t) = N2 (t + 1) − i − k − 2r. 4.2. The observation model For the observation model, assume that observed numbers of Ni people at times {tj , j = 1, . . . , n} are available. Let Y i (j ) be the observed number of Ni people at time tj . Assume that [Yi (j ) − Ni (tj )]/ Ni (tj ) is normal with mean 0 and variance σi2 , independently for i = 1, 2, j = 1, . . . , n. The observation model is represented by the statistical model given by: 1/2 εi (j ), Yi (j ) = Ni (tj ) + Ni (tj )
for i = 1, 2, j = 1, . . . , n,
(9)
where εi (j ) are independently distributed as normal with mean 0 and variance σi2 . Thus, the conditional likelihood function of Θ given X is L(Θ | X) =
n 2
fi Yi (j ); Ni (tj ) ,
i=1 j =1
where
fi x; Ni (tj ), σi2
−1 = σi 2πNi (tj ) exp −
2 1 x − Ni (tj ) . 2σi2 Ni (tj )
526
W.Y. Tan and W. Ke
5. The multi-level Gibbs sampling procedures for the birth–death–immigration–illness–cure processes To implement the multi-level Gibbs sampling method, denote by U (t) = {Fi (t), Ri (t), ∼
Bi (t), i = 1, 2} and put U = {U (t), t = 0, 1, . . . , tM − 1}. Then, given X (0) and Θ, ∼
∼
tM P X (i) | X (i − 1), U (i − 1) P X, U | X (0) = ∼
i=1
∼
∼
∼
× P U (i − 1) | X (i − 1) . ∼
(10)
∼
By the above distribution results with i = j ; i, j = 1, 2: 2 F (t ) B (t ) gi Ri (t); t αi (t) i bi (t) i P U (t) | X (t) = C1 (t) ∼
∼
i=1
N (t )−Fi (t )−Bi (t ) × 1 − αi (t) − bi (t) i , where C1 (t) =
(11)
2 Ni (t) Ni (t) − Fi (t)
, i=1 Fi (t) Bi (t)
P X (t + 1) | X (t), U (t) ∼
∼
∼
2 Ni (t) − Fi (t) − Bi (t)
= i=1 ηi (t) ×
di (t) 1 − αi (t) − bi (t)
ηi (t ) 1−
di (t) 1 − αi (t) − bi (t)
ζi (t ) ,
where ηi (t) = Ni (t) − Ni (t + 1) − Fi (t) + Fj (t) + Ri (t) + Bi (t), i = j, i, j = 1, 2, and ζi (t) = Ni (t + 1) − Fj (t) − Ri (t) − 2Bi (t), i = j, i, j = 1, 2. These distribution results indicate that given the parameters and given X, one can readily generate U; similarly, given the parameters and given U, one can readily generate X. Let Y = {Yi (j ), i = 1, 2, j = 1, . . . , n}. Then the joint density of {X, U, Y} given the parameters Ω = {Θ, σi2 , i = 1, 2} is: P {X, U, Y | Ω} = P X (0) | Θ(0) P U (0) | X (0) ∼
∼
×
n 2 j =1
i=1
∼
fi Yi (j ); Ni (tj ), σi2
State space models for survival analysis tj
×
527
P X (r) | U (r − 1), X (r − 1)
r=tj−1 +1
∼
∼
∼
× P U (r − 1) | X (r − 1) . ∼
∼
(12)
Let P {Ω} = P {Θ}P {σi2 , i = 1, 2} be the density of the prior distribution of Ω. Then the density of the conditional posterior distribution of Ω given {X, U, Y} is P {Ω | X, U, Y} ∝ P {Ω}
2
−n/2 σi2 exp
i=1
× λi (s)
m Si − 2 exp −m(s)λi (s) σi s=1
Ri (s) F (s) B (s) ηˆ (s) αi (s) i bi (s) i di (s) i
ζˆ (s) × 1 − αi (s) − bi (s) − di (s) i ,
(13)
where m(s) = ts − ts−1 and for i = 1, 2, 2 1 1 Yi (j ) − Ni (tj ) 2 Ni (tj ) n
Si =
j =1
i (s) = R
ts
Ri (t),
i (s) = F
t =ts−1 +1
i (s) = B
ts
ts
Fi (t),
t =ts−1 +1
Bi (t),
t =ts−1 +1
ζˆi (s) =
ts
ηˆ i (s) =
ts
ηi (t),
t =ts−1 +1
ζi (t).
t =ts−1 +1
Using the above distribution results, the multi-level Gibbs sampling procedures for estimating the unknown parameters Ω and the state variables X are given by the following loop: (1) Combining a large sample from P {U | X, Ω} for given X with P {Y | Ω, X} through the weighted Bootstrap method due to Smith and Gelfand (1992), generate U (denote the generated sample U(∗) ) from P {U | Ω, X, Y} although the latter density is unknown. (2) Combining a large sample from P {X | U(∗) , Ω} with P {Y | Ω, X} through the weighted Bootstrap method due to Smith and Gelfand (1992), generate X (denote the generated sample by X(∗) ) from P {X | Ω, U(∗) , Y} although the latter density is unknown. (3) On substituting {U(∗), X(∗) } which are generated numbers from the above two steps and assuming non-informative uniform priors, generate Θ from the conditional density P {Ω | X(∗) , U(∗) , Y} given by Eq. (13).
528
W.Y. Tan and W. Ke
(4) On substituting X(∗) generated from Step 2 above and with Ω being generated from Step 3 above, go back to Step 1 and repeat the above loop until convergence. At convergence, one then generates a random sample of X from the conditional distribution P {X | Y} of X given Y, independent of U and Θ and a random sample of Θ from the posterior distribution P {Θ | Y} of Θ given Y, independent of {X, U}. Repeat these procedures one then generates a random sample of size N of X and a random sample of size M of Θ. One may then use the sample means to derive the estimates of X and Θ and use the sample variances as the variances of these estimates. Alternatively, one may also use Efron’s bootstrap method (Efron, 1982) to derive estimates of the standard errors of the estimates.
6. The survival probabilities of normal and sick people Let Si (t) denote the survival probability that an Ni person at time 0 will survive at least t months when the population is at risk for the disease. Then, under the conditions given in Section 3, one has, to order of o(t), Si (t + t) = αi (t)tSj (t) + 1 − di (t)t − αi (t)t Si (t), for all i = j, i, j = 1, 2. It follows that the Si (t)’s satisfy the following system of equations: d S1 (t) = α1 (t)S2 (t) − α1 (t) + d1 (t) S1 (t), dt d S2 (t) = α2 (t)S1 (t) − α2 (t) + d2 (t) S2 (t). dt The initial condition is {Si (0) = 1, i = 1, 2}. Let S (t) = {S1 (t), S2 (t)} . Then, in matrix notation,
(14) (15)
∼
d S (t) = −F (t)S (t), ∼ dt ∼ where F (t) is given by α (t) + d1 (t) −α1 (t) . F (t) = 1 −α2 (t) α2 (t) + d2 (t) The initial condition is S (0) = 1 2 , a 2 × 1 column of 1’s. ∼
∼
Assume that {αi (t) = αi (s), di (t) = di (s), i = 1, 2} for t ∈ Ls = [ts−1 , ts ), s = 1, . . . , k with t0 = 0 and tk+1 = ∞. Then the solution of Eqs. (14), (15) is given by: S (t) = e−F (m)(t −tm−1)
∼
m−1
e−F (m−i)ξm−i 1 2 ,
i=1
if t ∈ [tm−1 , tm ), m = 1, . . . , k + 1,
∼
(16)
State space models for survival analysis
529
where ξi = ti − ti−1 and the e−F (i)x are matrix exponential functions defined by e−F (i)x =
∞ (−1)j j =0
j!
j F (i) x j .
(17)
Now the eigenvalues of F (t) are λi (t) = 12 {α1 (t) + α2 (t) + d1 (t) + d2 (t) ± w(t)}, where w(t) = {[α1 (t) + α2 (t) + d1 (t) − d2 (t)]2 + 4α2 (t)[d2 (t) − d1 (t)]}1/2 . Let 1 F (t) − λj (t)I2 , i = j, i, j = 1, 2. λi (t) − λj (t) Then e−F (t )x = 2i=1 e−λi (t )x Ei (t). Hence, for t ∈ [tm−1 , tm ), m = 1, . . . , k + 1: S (t) = e−λ1 (m)(t −tm−1) E1 (m) + e−λ2 (m)(t −tm−1) E2 (m) Ei (t) =
∼
×
m−1
e−λ1 (m−i)(ξm−i ) E1 (m − i)
i=1
+e
−λ2 (m−i)(ξm−i )
E2 (m − i)
1 2.
∼
(18)
If further αi (t) = αi and di (t) = di for i = 1, 2, then the solution of Eqs. (14), (15) are given by: 1 (d1 − λ2 )e−λ1 t + (λ1 − d1 )e−λ2 t ω 1 (d2 − λ2 )e−λ1 t + (λ1 − d2 )e−λ2 t , S2 (t) = ω where {ω(t) = ω, λi (t) = λi }. S1 (t) =
(19) (20)
7. Some illustrative examples To illustrate the above methods, consider the disease – tuberculosis (TB) which is curable by drugs. Given in Table 1 are the numbers of TB cases in US from 1980 to 1992 reported by CDC together with the total US population sizes over these years (CDC Report, 1993). In this data set, it is clear that the curve of TB cases in US is declining to the lowest level in 1985 and then increases due presumably to the effects of HIV (CDC Report, 1993). To fit this data, we thus assume α1 (t) = α1 (1) before January 1985 and assume α1 (t) = α1 (2) after January 1985. Assume that other parameters are not affected by HIV and other factors. Because the TBs are rare in children, we may ignore birth so that the unknown parameters are Θ = {λi , di , αi (1), αi (2), i = 1, 2}. Let t0 = 0 denote January 1980 so that N1 (0) = 226517805 and N2 (0) = 28000. Using the data given in Table 1, one may readily apply the method of Section 6 to estimate the unknown parameters and the state variables. Because we do not have
530
W.Y. Tan and W. Ke
Table 1 Observed numbers of total people, TB people and normal people Time
Total people
TB people
Normal people
1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992
226545805 228762212 230978619 233195025 235411432 237627839 239844246 242060653 244277059 246493466 248709873 250926280 253142687
28000 27500 25400 24000 22201 22201 23000 23000 23000 23800 25500 26283 26673
226517805 228734712 230953219 233171025 235389231 237605638 239821246 242037653 244254059 246469666 248684373 250899997 253116014
(a)
(b) Fig. 1. The estimated number and observed number of (a) normal people, and (b) TB people. (—: estimates; –•–•–: observed.)
State space models for survival analysis
531
Table 2 Estimates of parameters and standard errors Parameters
Mean
Standard error
Minimum
λ1 λ2 d1 d2 α1 (1) α1 (2) α2
0.0014990 0.0011764 0.000803630 0.0037722 0.000045527 0.000045570 0.4343895
2.0113312E−7 0.000017760 1.4315684E−7 0.000027462 5.0690768E−7 4.4691119E−8 0.000255445
0.0014986 0.0011363 0.000803242 0.0037019 0.000040545 0.000045448 0.4336948
Maximum 0.0014996 0.0012171 0.000803936 0.0038617 0.000045720 0.000045679 0.4349698
(a)
(b) Fig. 2. The survival probabilities of (a) normal people, and (b) TB people with Jan. 1980 = 0.
previous knowledge and data on US TB, to implement the procedures of Section 6 we assumed a non-informative uniform prior for the parameters. Using the method in Section 6 with bi (t) = 0 (i = 1, 2), the estimates of the parameters are given in Table 2 with the standard errors of the estimates obtained by using Efron’s bootstrap method (Efron, 1982). In Figures 1(a), (b), the estimates of the state variables are plotted together with
532
W.Y. Tan and W. Ke Table 3 The generated numbers of normal (YN) and sick people (YI) Time
YN
YI
0 10 20 30 40 50 60 70 80 90
1000 649 535 453 420 342 318 287 256 228
9 242 226 192 180 162 142 133 113 94
The parameter values for generating these data are: λ1 = 0.05, λ2 = 0.03, d1 = 0.04, d2 = 0.1, α1 = 0.2, α2 = 0.4, N (0) = 1000, I (0) = 10.
(a)
(b) Fig. 3. The estimated and observed numbers of (a) normal people and (b) sick people. (—: estimates; –•–•–: observed.)
State space models for survival analysis
533
Table 4 Estimates of parameters and standard errors Parameters
Mean
Standard error
Minimum
Maximum
λ1 λ2 d1 d2 α1 α2
0.0407253 0.0262105 0.0517746 0.1395334 0.2198501 0.3802137
0.0014776 0.0016861 0.0017989 0.0040820 0.0034795 0.0054955
0.0376655 0.0218033 0.0475090 0.1293103 0.2123125 0.3652281
0.0447270 0.0314858 0.0561725 0.1522202 0.2280998 0.3927149
(a)
(b) Fig. 4. The estimated survival probabilities of (a) normal people and (b) sick people.
the respective observed numbers. The estimates of the survival probabilities of normal people and sick people are plotted in Figures 2(a), (b). From Figure 1, apparently the estimated numbers of the Ni people are close to its observed numbers. From results in Table 2, it turned out that the estimate of α1 (2) is only slightly greater than α1 (1) but the standard error of the estimate of α1 (2) is 10 times smaller than that of the estimate
534
W.Y. Tan and W. Ke
of α1 (1), indicating that HIV and/or other causes have increased the infection rate of TB slightly. To further examine the approach, we have assumed some parameter values and generated some computer Monte Carlo data. The generated numbers are given in Table 3 and the parameter values for generating these data are {λ1 = 0.05, λ2 = 0.03, d1 = 0.04, d2 = 0.1, α1 = 0.2, α2 = 0.4, N1 (0) = 1000, N2 (0) = 10}. As in the TB example, we have ignored birth so that we are restricting ourselves to adults. Using the data in Table 3 and assuming a non-informative uniform prior for the parameters, we have applied the procedures of Section 6 to estimate the unknown parameters and the state variables. Given in Table 4 are the estimates of the unknown parameters. Plotted in Figures 3 and 4 are the estimates of the state variables together with the generated numbers, and the estimates of the survival probabilities, respectively. From results in Table 4, apparently, the estimates are very close to its true values. From Figure 3, it is also apparent that the estimates of the state variables are very close to the generated numbers. These results indicate that the methods proposed in this article are quite promising and useful.
8. Conclusions In this article, we have developed a state space model for the birth–death–immigration– illness–cure process. We have developed a generalized Bayesian method to estimate the unknown parameters and the state variables, and hence the survival probabilities. The numerical examples indicate that the methods are useful and promising. Of course, more studies are needed to further confirm the usefulness of the method and to check the efficiency of the method. In the past 5 years, we have developed some state space models for cancers and for AIDS (Tan and Chen, 1998; Tan et al., 1999; Tan et al., 2001; Tan et al., 2000; Tan et al., 2002; Tan and Xiang, 1998; Tan and Xiang, 1998; Tan and Xiang, 1998, Tan and Xiang, 1999; Tan and Xiang, 1999; Tan and Ye, 2000; Tan and Ye, 2000; Wu and Tan, 1995; Wu and Tan, 2000). The present article extends this modeling approach to other human diseases such as tuberculosis. This type of modeling approach is definitely useful for other diseases such as the heart disease and to risk assessment of environmental agents as well. In this respect, more research work are needed.
References Aoki, M. (1990). State Space Modeling of Time Series, 2nd edn. Springer, New York. Brookmeyer, R., Gail, M.H. (1994). AIDS Epidemiology: A Quantitative Approach. Oxford University Press, Oxford. Catlin, D.E. (1989). Estimation, Control and Discrete Kalman Filter. Springer, New York. CDC Report (1993). Tuberculosis morbidity – United States, 1992. MMWR Morb. Mortal. WKLY Rep. 1993; 42, pp. 696–697, 703–704. Efron, B. (1982). The Jackknife, the Bootstrap and Other Resampling Plans. SIAM, Philadelphia, PA. Gelb, A. (1974). Applied Optimal Estimation. MIT Press, Cambridge, MA.
State space models for survival analysis
535
Harvey, A.C. (1994). Forecasting, Structural Time Series Models and the Kalman Filter. Cambridge University Press, Cambridge. Kalman, R.E. (1960). A new approach to linear filter and prediction problems. J. Basic Engrg. 82, 35–45. Liu, J.S., Chen, R. (1998). Sequential Monte Carlo method for dynamic systems. J. Amer. Statist. Assoc. 93, 1032–1044. Sage, A.P., Melsa, J.L. (1971). Estimation Theory with Application to Communication and Control. McGrawHill, New York. Shephard, N. (1994). Partial non-Gaussian state space. Biometrika 81, 115–131. Smith, A.F.M., Gelfand, A.E. (1992). Bayesian statistics without tears: A sampling–resampling perspective. Amer. Statist. 46, 84–88. Tan, W.Y. (2000). Stochastic Modeling of AIDS Epidemiology and HIV Pathogenesis. World Scientific, New Jersey. Tan, W.Y. (2002). Stochastic Models With Applications to Genetics, Cancers, AIDS and other Biomedical Systems. World Scientific, New Jersey. Tan, W.Y., Chen, C.W. (1998). Stochastic modeling of carcinogenesis: Some new insight. Math. Comput. Modeling 28, 49–71. Tan, W.Y., Chen, C.W., Wang, W. (1999). Some state space models of carcinogenesis. In: Anderson, J.G., Katzper, M. (Eds.), Proceeding of 1999 Medical Science Simulation. The Society for Computer Simulation, San Diego, CA, pp. 183–189. Tan, W.Y., Chen, C.W., Wang, W. (2001). Stochastic modeling of carcinogenesis by state space models: A new approach. Math. Comput. Modeling 33, 1323–1345. Tan, W.Y., Chen, C.W., Wang, W. (2000). A generalized state space model of carcinogenesis. In: Paper presented at the 2000 International Biometric Conference at UC Berkeley, CA, July 2–7, 2000. Tan, W.Y., Chen, C.W., Zhu, J.H. (2002). Estimation of parameters in carcinogenesis models via state space models. In: Paper presented in person at the Eastern and Northern Biometric Society meeting, March 15–17, 2002, Arlington, VA. Tan, W.Y., Xiang, Z.H. (1998). State space models of the HIV epidemic in homosexual populations and some applications. Math. Biosci. 152, 29–61. Tan, W.Y., Xiang, Z.H. (1998). Estimating and predicting the numbers of T cells and free HIV by non-linear Kalman filter. In: DasGupta (Ed.), Artificial Immune Systems and Their Applications. Springer, Berlin, pp. 115–138. Tan, W.Y., Xiang, Z.H. (1998). State space models for the HIV pathogenesis. In: Horn, M.A., Simonett, G., Webb, G. (Eds.), Mathematical Models in Medicine and Health Sciences. Vanderbilt University Press, Nashville, TN, pp. 351–368. Tan, W.Y., Xiang, Z.H. (1999). Modeling the HIV epidemic with variable infection in homosexual populations by state space models. J. Statist. Inference Planning 78, 71–87. Tan, W.Y., Xiang, Z.H. (1999). A state space model of HIV pathogenesis under treatment by anti-viral drugs in HIV-infected individuals. Math. Biosci. 156, 69–94. Tan, W.Y., Ye, Z.Z. (2000). Estimation of HIV infection and HIV incubation via state space models. Math. Biosci. 167, 31–50. Tan, W.Y., Ye, Z.Z. (2000). Some state space models of HIV epidemic and applications for the estimation of HIV infection and HIV incubation. Comm. Statist. Theory Methods 29, 1059–1088. Wu, H., Tan, W.Y. (1995). Modeling the HIV epidemic: A state space approach. In: ASA 1995 Proceeding of the Epidemiology Section. ASA, Alexandria, VA, pp. 66–71. Wu, H., Tan, W.Y. (2000). Modeling the HIV epidemic: A state space approach. Math. Comput. Modelling 32, 197–215.
30
Handbook of Statistics, Vol. 23 ISSN: 0169-7161 © 2004 Elsevier B.V. All rights reserved. DOI 10.1016/S0169-7161(03)23030-3
First Hitting Time Models for Lifetime Data
Mei-Ling Ting Lee and George A. Whitmore
1. Introduction Previous research in several fields dealing with time-to-event data have considered models in which an event occurs when the sample path of a stochastic process first satisfies a specified condition. The time until this occurrence is a first hitting time. The idea of a first hitting time is quite broad. For example, in industrial quality control, the run time until a quality measure first reaches an out-of-control limit is a first hitting time and is a standard quantity considered in quality management. This review of first hitting time models focuses on a limited domain of the literature. It considers only models describing conventional survival or lifetime data. Nevertheless, these models find broad application, spanning medicine (e.g., patient survival), engineering (e.g., equipment lifetime), economics (e.g., business bankruptcy) and the social sciences (e.g., divorce).
2. The basic first hitting time model A general mathematical formulation of a first hitting time model considers a stochastic process in time, {X(t), t 0} and an absorbing set H in the state space of the process. The first hitting time is the random variable S defined as follows: S = inf t: X(t) ∈ H . (1) In other words, the first hitting time is the time until the stochastic process first enters or hits set H . The state space of the process {X(t)} may be one-dimensional or multidimensional. The process may have many defining properties, such as a continuous sample path, independent increments, stationarity, and so on. There may be no guarantee that the process will reach the absorbing set, i.e., P (S < ∞) = 1 − P (S = ∞) may be less than 1. Several authors have considered first hitting time models as general models for survival and time-to-event data. For examples in the health field, see Whitmore and Neufeldt (1970) and Eaton and Whitmore (1977). 537
538
M.-L.T. Lee and G.A. Whitmore
3. Data for model estimation Although mathematical modeling alone can be of interest, practical application usually entails estimating the model from data. The form of the data can be quite complicated but the following elements are usually present. (1) A sample of individual subjects or items, j = 1, . . . , n, whose processes {Xj (t)} are independent. The absorbing set may also vary by individual, denoted here by Hj for individual j . (2) Observation pairs (tij , xij ), i = 1, . . . , mj , j = 1, . . . , n, t1j · · · tmj ,j , and a failure indicator variable Ij for each individual j . Here the pair (tij , xij ) denotes the ith longitudinal reading xij on the level of process {Xj (t)} at time tij for individual j . The indicator Ij is 1 if tmj ,j = sj , the actual survival or life time of individual j , and is 0 if individual j is surviving at time tmj ,j . Reading xij ∈ Hj only if i = mj and Ij = 1, i.e., only if the reading corresponds to the first hitting time of the individual. (3) Where longitudinal data are not available, mj = 1 for all subjects and the data reduce to an observation pair (tj , xj ) and failure indicator Ij for each individual j . (4) Where the data consist only of time tj and failure indicator Ij for each individual j then the data reduce to censored survival data. (5) Baseline covariate data can be represented by a vector zj for each individual j . Regression link functions relate the covariate vector zj to the parameters of the process {Xj (t)} and the absorbing set Hj . The link function will have some form θj = g(zj βθ ), j = 1, . . . , n, for each parameter θ . 4. A Wiener process with an inverse Gaussian first hitting time Some of the earliest uses of a first hitting time model for survival and time-to-event data employed a one-dimensional Wiener process and a fixed threshold or absorbing barrier. Examples of these early works include Whitmore and Neufeldt (1970) who looked at hospital stays for mental illness, Whitmore (1975, 1979, 1995) who looked at applications to hospital stay, labor turnover and equipment degradation, Lancaster (1972) who modeled lengths of labor strikes, Eaton and Whitmore (1977) who considered this model and other first hitting time models for hospital stays for mental illness, Doksum and Hoyland (1992) who apply the model to accelerated equipment testing, Lu (1995) who considered longitudinal degradation and censored survival data in modeling equipment failure, Doksum and Normand (1995) who use the model for biomarker data and Whitmore and Schenkelberg (1997) who considered accelerated degradation data with a time-scale transformation. The common conceptual framework for this early research involves a model in which physical degradation, health status or the like, follows a Wiener process {X(t)} with mean parameter µx and variance parameter σxx . The process starts at X(0) = 0 and drifts at mean rate µx until it produces a terminal event (equipment failure, hospital discharge or the like) when the process first crosses a threshold a > 0 at time S. Henceforth, we use the terminology of a failure and failure threshold. The absorbing set is
First hitting time models for lifetime data
539
therefore defined as the set of points at a or above, i.e., H = {x: x a}. When µx 0, the probability of eventually hitting the failure threshold is 1 and the time until failure S follows an inverse Gaussian distribution. If µx < 0, the probability of hitting the threshold is less than 1, specifically, P (S < ∞) = exp(2aµx ). In this case, the conditional first hitting time S|S < ∞ remains inverse Gaussian distributed with mean and variance parameters |µx | and σxx , respectively. Note that the absolute value of the mean parameter applies in this conditional case. The most common form of data considered in the early work was that of censored survival data. For this kind of data set, it is easy to formulate the sample likelihood function from a combination of inverse Gaussian density values for observed survival times and inverse Gaussian cumulative probability values for (progressively) censored observations. Both the inverse Gaussian density and cumulative distribution functions have explicit mathematical forms. As the first hitting time model has three parameters, namely, a, µx and σxx , while the inverse Gaussian distribution has only two parameters, one of the parameters can be set arbitrarily (for example, a = 1). Regression link functions for accommodating covariates were also introduced – see Whitmore (1983) for example. Less attention has been given to the situation where the data consist not only of censored survival data but also readings on {X(t)} at the censoring times. This case is considered in Lu (1995) and, more recently, by Padgett and Tomlinson (2003).
5. A two-dimensional Wiener model for a marker and first hitting time Whitmore et al. (1998), henceforth WCL, extend the basic Wiener first-hitting-time model. They consider the underlying stochastic process {X(t)} to be latent (i.e., unobservable) but assume the presence of an observable partner process {Y (t)}, called a marker process, that covaries with {X(t)} in time. Their specific model postulates that the two processes jointly form a two-dimensional Wiener diffusion process {X(t), Y (t)}. In their case application, they consider a model for equipment failure in which the degradation process {X(t)} is latent but a physical process {Y (t)}, related to the degradation process, is observable. An item fails when the degradation level first reaches a fixed threshold a > 0. The first passage time to the threshold defines the item’s survival time. They also consider the problem of forming a composite marker process from a linear combination of two or more observable marker processes. The aim is to construct a composite marker that can mimic the latent degradation process. WCL only consider the case where a single marker value is recorded at the failure or censoring time for each item. Moreover, their model does not incorporate covariates. Lee et al. (2000), henceforth LDS, extend the WCL methodology and consider general regression link functions for baseline conditions and other covariates. They apply their model in a medical setting. Specifically, they adopt the first-hitting-time model to describe data from the AIDS Clinical Trial Group (ACTG) 116A Study to illustrate the investigation of surrogacy of laboratory measurements in studies of treatment effects from anti-retroviral drugs.
540
M.-L.T. Lee and G.A. Whitmore
Following the LDS notation, let {X(t)} denote a latent process that represents the unobservable health status of a subject at time t. Health status here can be interpreted as the converse of disease state. Thus, disease progression corresponds to health deterioration or degradation. {Yw (t)} represents an observed marker process that is correlated with the health status process {X(t)} and tracks its progress. LDS consider the two-dimensional Wiener diffusion process {X(t), Yw (t)}, for t 0, with initial values {X(0), Yw (0)}. The vector {X(t), Yw (t)} has a bivariate normal distribution with mean vector {X(0), Yw (0)} + tµ, where µ = (µx , µy ), and covariance matrix tΣ, where Σ=
σxx σyx
σxy σyy
.
Instead of fixing X(0) = 0 as in WCL, LDS assume that the subject’s initial health status is some positive number δ = X(0) > 0. This parameter δ is unknown and must be estimated. The failure threshold is set at zero on the health status scale so the closer X(t) is to zero, the sicker or more diseased is the subject. Failure (death or other clinical endpoint) occurs when the subject’s health status decreases to zero for the first time. LDS adjust the observed marker process by considering changes in the marker from the initial level Yw (0). Hence, the marker change process {Y (t)} is used, where Y (t) = Yw (t) − Yw (0). The initial marker level y0 = Yw (0) is used as a baseline covariate. The first passage time from the initial health status to the failure threshold is the survival time. For the data structure in this model, LDS assume that each subject is observed for a fixed period (0, t] and has two possible observation outcomes: (1) The subject survives to time t at which time a marker level of Y (t) = y(t) is recorded. This occurrence constitutes a censored observation of survival time because S > t; (2) The subject fails at some time S = s during the period (0, t] and a marker level of Y (S) = y(s) is recorded at the moment of failure. The mathematical form of the density functions for these two cases were worked out by WCL and the reader is referred to the original article (or that of LDS) for details. Both WCL and LDS derive forms for the log-likelihood function that allows maximum likelihood estimation of model parameters. In the case of LDS, these are µy , µx /δ, ρ, σyy and σxx /δ 2 . As the health status process {X(t)} is unobservable, one of the parameters for that process can be set arbitrarily. In essence, because this process is latent, it has an undefined measurement scale. LDS fix the value of σxx at 1. LDS show how the model can accommodate covariates z by assuming that the initial health status δ, the mean parameters µx and µy , the variance parameter σyy and the correlation parameter ρ are linked to linear combinations of covariates, as described earlier in Section 3. As parameters µx and µy may range over all real values, LDS use identity links for these parameters, i.e., µx = zβx and µy = zβy . As parameters δ and σyy are necessarily positive, LDS choose link functions δ = exp(zβδ ) and σyy = exp(zβyy ) as reasonable options. Finally, as correlation ρ is defined on the interval
First hitting time models for lifetime data
541
(−1, 1), they choose the following correlation transform as a link function ρ=
exp(zβρ ) − 1 . exp(zβρ ) + 1
LDS develop appropriate density functions needed for two kinds of predictive inference: (1) prediction of the health status of a subject at time t from the contemporaneous marker value and (2) prediction of the residual survival time of a surviving subject from his or her current marker value. The predictive inferences are made after obtaining maximum likelihood estimates of the model parameters.
6. Longitudinal data The preceding review has made only brief reference to model applications in which longitudinal data are gathered on the respective stochastic processes. The reason is that longitudinal data pose a challenge for first-hitting-time models based on Wiener processes. Lu (1995) considers the problem for the basic Wiener model where longitudinal observations are made on process {X(t)} up to the hitting or censoring time, as the case may be. She formulates the likelihood function and computes maximum likelihood estimates. The methodology is somewhat challenging but manageable. LDS consider the issue of modeling longitudinal data for the bivariate latent-marker Wiener model and propose the following approach as possibly technically satisfactory and practical to implement. Their approach exploits the fact that their model can include baseline covariates and also assumes that both the marker and health status processes have independent increments. If an individual has readings at n + 1 time points, 0 = t0 t1 · · · tn , then these readings form n independent observation increments. The first observation relates to the time increment (0, t1 ], the second relates to the time increment (t1 , t2 ], and so on. For the observation corresponding to increment (ti−1 , ti ], covariate values at time ti−1 , denoted by zi−1 , are baseline covariate values for that observation. The initial health status at time ti−1 is then taken as δi = exp(zi−1 βδ ). This simple method is accurate to the extent that the covariates z capture individual risk factors and the assumption of independent increments is valid. In this approach, the baseline covariates z initialize the conditions at the start of each observation increment. The vector zi−1 should include the value of Yw (ti−1 ), the marker value at time ti−1 . This proposal for handling longitudinal data remains untested.
7. Additional first hitting time models There are many potentially useful variations of the first hitting time model. (1) Horrocks and Thompson (2003) consider a Wiener diffusion model for health status in a hospital stay context. The model has two absorbing barriers (a lower and upper barrier). Hitting the upper barrier leads to healthy hospital discharge and the lower bound to death in hospital. Models in a similar vein were described earlier in
542
(2)
(3) (4)
(5)
M.-L.T. Lee and G.A. Whitmore
Whitmore and Neufeldt (1970), Whitmore (1975) and Eaton and Whitmore (1977) but without the same technical sophistication or benefit of additional decades of development. A number of criticisms have been leveled at a Wiener process as a model for {X(t)} in particular applications. In some applications, such as physical degradation, it is considered reasonable that the process sample path should be monotonic (the argument being that degradation can only proceed in one direction). A gamma process, for example, possesses this monotonic property. A monotonic process also happens to have a simple characterization for the distribution function of the first hitting time for an absorbing barrier at a as we have the relationship P (S t) = P [X(t) a]. LDS suggest other bivariate processes for their model, mentioning in particular, the bivariate gamma process and the bivariate integrated Ornstein–Uhlenbeck process. Many authors have considered curve-crossing models for reliability and lifetime data. These models involve modeling both the stochastic process for the failure process and selecting the appropriate absorption set – a (curvilinear) function of time. The first passage of the process to the curve defines the hitting time. Desmond (1987) is a typical example. He applies curve-crossing results to modeling metal fatigue and failure. Several authors have considered the concept of operational time in discussing first hitting time models. The basic idea is that calendar time is not the most suitable time scale on which to measure the progress of the stochastic process that determines the hitting time. For example, the degradation of a piece of equipment may be more a function of accumulated usage than of accumulated clock time. Likewise, a disease may progress more as a function of cumulative exposure to a toxin than the passage of calendar time. A subordinated process model may be appropriate in this situation. If calendar time is represented by t, then operational time is a monotonic stochastic process {R(t)} where R(t) = r(t) states that operational time is r(t) by calendar time t. The stochastic process {X} is now defined on the operational time scale and, hence, has form {X(r)}. The combined stochastic process {X[R(t)]} is called a subordinated process, with {X(r)} being the parent process and {R(t)} being the directing process. The first hitting time, denoted by SR , may therefore be defined as the first occasion when {X(r)} enters absorbing set H . Now, however, SR = sR is measured on the r-time scale. The corresponding time on the calendar time scale is sT where sR = r(sT ) and {r(t)} is the realized sample path of directing process {R(t)}. Useful references are Lee and Whitmore (1993) who look at subordinated processes and Duchesne and Lawless (2000) who consider alternate time scales for failure models.
8. Other literature sources Numerous other sources cover related subject matter, including a vaste literature dealing with purely theoretical and methodological aspects of first hitting time models. Lawless (2003) gives a comprehensive overview of theory, models and methods; especially pertinent to this review is Section 11.5, pp. 518–523. Aalen and Gjessing (2001) provide an extensive investigation and stimulating discussion on the subject.
First hitting time models for lifetime data
543
Acknowledgements This research is supported in part by NIH grant CA79725 and by the Natural Sciences and Engineering Research Council of Canada.
References Aalen, O.O., Gjessing, H.K. (2001). Understanding the shape of the hazard rate: A process point of view (with discussion). Statist. Sci. 16, 1–22. Desmond, A.F. (1987). An application of some curve-crossing results for stationary stochastic processes to stochastic modelling of metal fatigue. In: MacNeill, I.B., Umphrey, G.J. (Eds.), Applied Probability, Stochastic Processes and Sampling Theory. Reidel, Dordrecht, pp. 51–63. Doksum, K., Hoyland, A. (1992). Models for variable-stress accelerated testing experiments based on Wiener processes and the inverse Gaussian distribution. Technometrics 34, 74–82. Doksum, K., Normand, S.-L. (1995). Gaussian models for degradation processes – Part I: Methods for the analysis of biomarker data. Lifetime Data Anal. 1, 131–144. Duchesne, T., Lawless, J.F. (2000). Alternative time scales and failure time models. Lifetime Data Anal. 6, 157–179. Eaton, W.W., Whitmore, G.A. (1977). Length of stay as a stochastic process: A general approach and application to hospitalization for schizophrenia. J. Math. Soc. 5, 273–292. Horrocks, J.C., Thompson, M.E. (2003). Modelling event times with multiple outcomes using the Wiener process with drift. Lifetime Data Anal. In press. Lancaster, T. (1972). A stochastic model for the duration of a strike. J. Roy. Statist. Soc. Ser. A 135, 257–271. Lawless, J.F. (2003). Statistical Models and Methods for Lifetime Data, 2nd edn. Wiley, New York. Lee, M.-L.T., Whitmore, G.A. (1993). Stochastic processes directed by randomized time. J. Appl. Probab. 30, 302–314. Lee, M.-L.T., DeGruttola, V., Schoenfeld, D. (2000). A model for markers and latent health status. J. Roy. Statist. Soc. Ser. B 62, 747–762. Lu, J. (1995). A reliability model based on degradation and lifetime data. Ph.D. Thesis. McGill University, Montreal, Canada. Padgett, W.J., Tomlinson, M.A. (2003). Inference from accelerated degradation and failure data based on Gaussian process models. Lifetime Data Anal. In press. Whitmore, G.A., Neufeldt, A.H. (1970). An application of statistical models in mental health research. Bull. Math. Biophys. 32, 563–579. Whitmore, G.A. (1975). The inverse Gaussian distribution as a model of hospital stay. Health Services Res. 10, 297–302. Whitmore, G.A. (1979). An inverse Gaussian model for labour turnover. J. Roy. Statist. Soc. Ser. A 142, 468–478. Whitmore, G.A. (1983). A regression method for censored inverse-Gaussian data. Canad. J. Statist. 11, 305– 315. Whitmore, G.A. (1995). Estimating degradation by a Wiener diffusion process subject to measurement error. Lifetime Data Anal. 1, 307–319. Whitmore, G.A., Schenkelberg, F. (1997). Modelling accelerated degradation data using Wiener diffusion with a time scale transformation. Lifetime Data Anal. 3, 1–19. Whitmore, G.A., Crowder, M.J., Lawless, J.F. (1998). Failure inference from a marker process based on a bivariate Wiener model. Lifetime Data Anal. 4, 229–251.
31
Handbook of Statistics, Vol. 23 ISSN: 0169-7161 © 2004 Elsevier B.V. All rights reserved. DOI 10.1016/S0169-7161(03)23031-5
An Increasing Hazard Cure Model
Yingwei Peng and Keith B.G. Dear
1. Introduction Although survival data are most commonly summarized graphically using the survival curve, consideration of the underlying mortality process is more natural using the hazard function. The hazard function can be increasing, decreasing, constant, bathtubshaped, hump-shaped, or some other characteristic which describes the failure mechanism (Klein and Moeschberger, 1997). In many settings, the hazard increases over time and models that incorporate this fact are appropriate. In other cases, a heterogeneous patient group may include some whose hazard reduces to zero as a result of medical treatment (the cured group) and a remainder (the uncured group) in whom the disease persists and whose hazard may increase over time. It is therefore of interest to investigate a method for analyzing clinical trials data based on an assumption of increasing hazard for uncured patients. Marshall and Proschan (1965) discussed a general method of nonparametric maximum likelihood estimation with the monotone hazard assumptions for uncensored data. For homogeneous censored data, Padgett and Wei (1980) proposed a maximum likelihood estimation method for right censored data with the increasing hazard assumption. This work was extended by Mykytyn and Santner (1981) to include the decreasing and bathtub-shaped hazard assumptions. Asymptotic properties of the estimates under these assumptions were also given. These methods are nonparametric in nature and solely based on the hazard assumptions. In this paper, we propose a semiparametric proportional hazards cure model with the increasing hazard assumption for the failure time of uncured patients. The model is closely related to the work of Peng and Dear (2000) and Sy and Taylor (2000), however there are important differences. One difference is that the increasing hazard assumption is used in modeling the baseline distribution of uncured patients in this model. When the increasing hazard assumption can be justified for uncured patients, using such an assumption in a model will lead to useful gains in efficiency. Another important difference is in the estimation method. A generalization of the nonparametric method of Padgett and Wei (1980) is used in the EM algorithm to ensure an increasing hazard function in the estimation. 545
546
Y. Peng and K.B.G. Dear
The paper is organized as follows. Section 2 presents the model with estimation method. Section 3 provides a numerical study to compare the proposed method with existing models. Section 4 illustrates the use of the model with a breast cancer dataset. Section 5 contains a discussion and conclusions.
2. The model The proposed semiparametric proportional hazards cure model with the increasing hazard assumption is a further development of the work of Peng and Dear (2000). Therefore we adopt their notations in deriving the semiparametric estimation method for the proposed model. Let T be a nonnegative random variable denoting the failure time of patients with a survival function S(t | x, z), where x is a covariate vector affecting the failure time distribution of uncured patients and z a covariate vector affecting the proportion of cured patients (x and z may or may not be the same). Under the mixture models of Kuk and Chen (1992) and Peng and Dear (2000), S(t | x, z) = π(z)Su (t | x) + 1 − π(z), where π(z) is the proportion of uncured patients which may depend on z by a logistic form log[π(z)/{1 − π(z)}] = γ z, and Su (t | x) is the survival function of the failure time distribution of uncured patients given x, which satisfies the proportional hazards assumption with an arbitrary unspecified baseline hazard function hu0 (t): Su (t | t x) = Su0 (t)exp(β x) and Su0 (t) = exp[− 0 hu0 (u) du]. Because of the increasing hazard assumption for uncured patients, hu0 (t) is required to be a non-decreasing function of t (it is to be understood that in this paper “increasing” is written for “non-decreasing”). The unknown parameters to be estimated in this model are θ = (β , γ , hu0 (t)). Censoring is always present in cure rate estimation. Censored observations may occur on either cured or uncured patients. Patients who are cured, and thus will never experience the failure however long the trial lasts, will be censored when the trial ends if not before. Uncured patients may be lost to follow up, may relapse or die of another cause, or may survive longer than the duration of the trial, in each case providing a censored observation. Therefore it is not known whether each individual censored patient is cured or not. However, a patient with an uncensored observation is certainly not cured. As usual, we assume that the censoring is non-informative and independent of the failure time of uncured patients and of their probability of cure. Suppose the observed data are in the form (ti , δi , xi , zi ), i = 1, 2, . . . , n, where ti denotes the observed survival time for the ith patient; δi is the censoring indicator with δi = 0 if ti is censored and 1 otherwise; xi and zi are observed values of the two covariate vectors. To estimate θ in the model, we employ the EM algorithm which is outlined as follows. Define c = (c1 , c2 , . . . , cn ) where ci is an indicator of cure status of the ith patient, namely, ci is 1 if the patient is cured and 0 otherwise. The vector c contains partially missing information because the cure statusof a censored patient is not available. Given c, the complete likelihood function is ni=1 π(zi )1−ci {1 − π(zi )}ci [hu0 (ti ) exp(β xi )]δi Su (ti | xi )1−ci . The EM algorithm starts with an initial value θ (0) . Letting θ (r) be the estimate of θ at the rth iteration. The E-step in the (r + 1)th iteration calculates the expected complete log-likelihood function
An increasing hazard cure model
547
if θ = θ (r) , which is the sum of following two functions Q1 (γ ) =
n
(r) (r) gi log π(zi ) + 1 − gi log 1 − π(zi ) ,
(1)
i=1
Q2 β, hu0 (t) =
n (r) gi exp(β xi ) log Su0 (ti ) + δi β xi + log hu0 (ti ) ,
(2)
i=1 (r)
(r)
(r)
(r)
where g(r) = (g1 , g2 , . . . , gn ), gi
is given by (r)
(r) exp(γ (r) zi )Su0 (ti )exp(β xi ) = E 1 − ci | θ (r) = δi + (1 − δi ) (3) (r) (r) 1 + exp(γ (r) zi )Su0 (ti )exp(β xi ) t (r) and Su0 (t) = exp{− 0 h(r) u0 (u) du}. The M-step in the (r + 1)th iteration maximizes (1) and (2) separately to obtain θ (r+1) . The algorithm is iterated until it converges. Eq. (1) can be maximized by usual optimization methods such as the Newton–Raphson method to obtain γ (r+1) . To maximize Eq. (2), the partial likelihood approach is used. Let τ1 < · · · < τk denote the distinct uncensored failure times. Following arguments similar to those in the Cox’s proportional hazards model and the Breslow method (Breslow, 1974), Eq. (2) can be written as
gi(r)
log
k
j =1
{
exp(β sj )
i∈Rj
(4)
gi(r) exp(β xi )}dj
where sj = i∈Dj xi , Dj is the set of tied uncensored times at τj , dj is number of times in Dj , and Rj is the risk set at time τj . The baseline hazard function hu0 (t) is eliminated from (4) and β (r+1) can be obtained without it. Given β (r+1) , we construct a profile likelihood function to estimate hu0 (t) satisfying the increasing hazard assumption. Let Ej be the set of individuals with censoring times in [τj , τj +1 ), j = 0, . . . , k, where τ0 = 0 and τk+1 = ∞. Define h∗u0 (t) = αj when τj t < τj +1 , j = 0, . . . , k and α0 = 0. Let αj αj +1 , j = 1, . . . , k − 1, so that h∗u0 (t) is an increasing function. Given the current estimate g(r) , if we fix β at β (r+1) , it is easy to show that the profile log likelihood function satisfies Q2 hu0 (t) | g(r), β (r+1) n n (r+1) (r+1) (r) δi β xi + log hu0 (ti ) − gi exp β xi = i=1
i=1
k k aj αj , dj log αj + β (r+1) sj − j =1
j =1
ti
hu0 (v) dv 0
(5)
548
Y. Peng and K.B.G. Dear
where aj =
(r) (r) (ti − τj )gi exp β (r+1) xi + (τj +1 − τj ) gi exp β (r+1) xi ,
i∈Ej
i∈Rj+1
j = 1, . . . , k − 1, (ti − τk )gi(r) exp β (r+1) xi . ak = i∈Ek
Therefore, estimating hu0 (t) satisfying the increasing hazard assumption is equivalent to estimating α1 , α2 , . . . , αk satisfying αj αj +1 , j = 1, . . . , k − 1. Apparently the (r) right-hand side of (5) is unbounded if gk → 0, and in this case αˆ k → ∞. Following similar arguments of Padgett and Wei (1980) and Barlow et al. (1972), we estimate the increasing hazard function hu0 (t) with 0 t < τ1 , hˆ u0 (t) = αˆ j τj t < τj +1 , j = 0, . . . , k − 1, αˆ k t τk , where αˆ j = max
min
1uj j vk
v s=u
ds
v
as ,
j = 1, . . . , k − 1
(6)
s=u
and αˆ k = ∞. Note that hˆ u0 (t) is a step function. Therefore the estimated cumulative hazard function from this method is continuous. It is easy to show that if there are no cured patients and no covariates for the failure (r) time distribution, i.e., gi ≡ 1 for all patients and β (r+1) ≡ 0, (6) reduces to that of Padgett and Wei (1980). Mykytyn and Santner (1981) briefly discussed a proportional hazards model with the increasing hazard assumption and their estimator for hu0 (t) is a (r) special case of (6) with all gi ≡ 1, namely, all patients are uncured. If there are cured (r) patients, then 0 gi 1 and only a certain fraction of each censored observation is considered in (6). The proposed EM algorithm above is similar to that used in the work of Peng and Dear (2000). However, they used a different approach to maximize (2), which cannot ensure the estimated hazard function to be increasing. We generalized the method of Padgett and Wei (1980) and employed it in maximizing (2) so that the estimated hazard function is always increasing. ˆ γˆ and αˆ are not imThe standard errors of the maximum likelihood estimates β, mediately available from the EM algorithm. They cannot be obtained from the inverse of the full observed information matrix from the observed log-likelihood function of ˆ γˆ and αˆ are the maximum likelihood estimates under the re(β, γ , α) because β, striction of αj αj +1 , j = 1, . . . , k − 1. In this paper, we suggest using the bootstrap method to provide an approximation to the variances of βˆ and γˆ because of its conceptual simplicity. Let B be the number of bootstrap samples from the original data, and
An increasing hazard cure model
549
(β ∗i , γ ∗i , α ∗i ) be the estimated parameters from the ith bootstrap sample. The standard errors of βˆ and γˆ can be approximated by the standard deviation of β ∗i , i = 1, . . . , B and γ ∗i , i = 1, . . . , B, respectively. See Davison and Hinkley (1997) for details of this method for censored data.
3. Simulation study A simulation study has been conducted to validate the model and the estimation method proposed above (referred to as the IH cure model). We also compare it with the semiparametric model without the increasing hazard rate assumption (Peng and Dear, 2000; Sy and Taylor, 2000) (referred to as the PH cure model) and their parametric counterparts. The simulated data mimic clinical trials with possible cured patients in two groups (control and treatment) with 100 patients in each. The indicator of the treatment group is considered as a covariate. The failure time distributions of uncured patients and the cure rates differ in the two groups, and the differences between the two groups can be described by the IH cure model. We set γ0 = 2 and γ1 = −1, which corresponds to cure rates of 11.92% and 26.89% in the control and treatment groups respectively. We also generate data with γ0 = 0.8 and γ1 = −0.6, which corresponds to higher cure rates of 31% and 45% in the two groups. The parameter β is set to log(1/2) = −0.693, which implies that the risk of uncured patients in the treatment group is only half of the risk in the control group. We assume that the baseline distribution is the Weibull distribution with the density function pt p−1 exp(−t p ). The shape parameter p is set equal to 0.5, 0.9, 1, 2 and 3, where the first two values correspond to decreasing baseline hazard functions and the last three values correspond to increasing baseline hazard functions. We consider the cases of decreasing baseline hazard functions in this simulation study so that the robustness of the proposed estimation method can be examined when the assumption of the model is incorrect. We also consider other types of baseline distributions such as the gamma and Gompertz distributions in this study. The censoring times are assumed to follow the uniform and the exponential distributions. Since the censoring rate cannot be less than the cure rate, we consider 30, 40, and 50 percent as the censoring rates for the case with γ0 = 2 and γ1 = −1, and only 40 and 50 percent for the case of γ0 = 0.8 and γ1 = −0.6. For each of the settings above, 300 samples are generated and we compute the mean squared errors (MSE) and biases of the estimates of the regression parameter β, γ0 , γ1 and the baseline survival probabilities at the median and the 90th percentile of the hypothesized baseline distribution using the PH and IH cure models. Since the Weibull distribution satisfies the proportional hazards assumption, we also use the Weibull cure model (Farewell, 1986) and the Weibull cure model with the restriction that the shape parameter is not less than one in this computation. The latter corresponds to a parametric cure model with the increasing hazard assumption (referred to as the IW cure model). We first consider the Weibull baseline distribution and the uniform censoring distribution. Tables 1 and 2 summarize the results from the cases of γ0 = 2, γ1 = −1 and
550
Y. Peng and K.B.G. Dear
Table 1 MSEs of estimates in four cure models for simulated data generated with γ0 = 2 and γ1 = −1 p
Censoring rate 30% PH
IH
40% IW
W
PH
IH
IW
50% W
PH
IH
IW
W
1.46 0.99 0.87 0.76 0.40
1.77 1.49 1.47 0.95 0.64
2.21 1.51 1.46 0.89 0.61
2.52 2.36 1.82 1.10 0.69
2.63 1.89 1.58 1.18 0.64
3.87 2.46 2.82 1.95 1.24
8.19 4.26 3.38 1.91 2.41
9.76 4.38 3.42 1.66 2.06
10.26 4.47 4.78 2.25 2.20
8.22 6.23 4.33 2.07 2.32
18.14 9.15 7.79 3.39 2.21
5.67 5.30 4.92 5.21 5.11
2.17 2.84 3.14 3.89 3.16
2.48 6.82 11.34 4.81 3.59
16.77 18.33 15.37 5.20 4.09
0.63 0.43 0.31 0.32 0.42
0.92 0.58 0.55 0.39 0.32
0.78 0.54 0.53 0.44 0.37
1.00 0.32 0.32 0.17 0.14
0.41 0.35 0.37 0.18 0.14
β (MSEs × 10) 0.5 0.9 1 2 3
0.52 0.52 0.49 0.42 0.38
0.78 0.48 0.46 0.39 0.38
1.02 0.55 0.57 0.35 0.36
0.72 0.61 0.49 0.34 0.36
1.42 0.81 0.82 0.76 0.52
0.5 0.9 1 2 3
2.01 1.75 1.36 1.21 1.60
1.40 1.41 1.35 1.21 1.51
1.43 1.47 1.47 1.53 1.53
1.65 1.61 1.42 1.21 1.53
2.95 2.09 2.74 1.87 1.19
0.5 0.9 1 2 3
2.88 2.60 2.69 2.81 2.00
2.59 2.20 1.66 1.85 1.99
2.50 2.35 1.98 1.91 2.24
4.50 2.98 2.20 1.95 1.98
4.71 4.79 4.04 3.01 2.45
0.5 0.9 1 2 3
0.32 0.31 0.31 0.27 0.27
5.71 0.40 0.31 0.29 0.27
4.48 0.32 0.22 0.25 0.19
0.5 0.9 1 2 3
0.22 0.11 0.15 0.12 0.09
0.44 0.10 0.12 0.11 0.09
0.57 0.11 0.10 0.10 0.09
1.46 0.73 0.82 0.65 0.41
1.55 0.95 1.03 0.67 0.40
γ0 (MSEs × 10) 3.15 1.97 2.29 1.86 1.26
3.42 2.07 2.60 1.84 1.25
γ1 (MSEs × 10) 2.28 2.74 2.98 2.82 2.17
2.05 4.96 5.48 3.72 2.20
Survival probability at the median (MSEs × 100) 0.23 0.27 0.26 0.25 0.18
0.55 0.33 0.29 0.22 0.37
1.96 0.37 0.33 0.23 0.34
1.48 0.32 0.27 0.24 0.32
0.40 0.33 0.27 0.21 0.31
0.69 0.44 0.45 0.41 0.44
Survival probability at the 90th percentile (MSEs × 100) 0.18 0.10 0.11 0.10 0.08
0.33 0.22 0.22 0.17 0.14
0.88 0.25 0.23 0.16 0.12
0.91 0.16 0.15 0.16 0.13
0.28 0.23 0.19 0.16 0.11
1.00 0.40 0.42 0.21 0.17
1.00 0.78 0.67 0.19 0.15
γ0 = 0.8, γ1 = −0.6 respectively in simulation. When the true baseline hazard function is increasing, the estimates of β, γ0 , and γ1 from the IH cure model tend to have smaller MSEs than those from the PH cure model. There are no substantial differences between the two models in estimating the survival probabilities. It indicates that the additional increasing hazard assumption in the IH cure model does improve the accuracy of the estimation over the PH cure model when the assumption is justifiable. The results also reveal that the estimates from the IH and those from the PH cure models are still comparable when the true baseline hazard function is slightly decreasing (p = 0.9). It shows
An increasing hazard cure model
551
Table 2 MSEs of estimates in four cure models for simulated data generated with γ0 = 0.8 and γ1 = −0.6 p
Censoring rate 30% PH
IH
40% IW
W
PH
IH
IW
W
1.06 0.73 0.70 0.54 0.51
1.52 0.94 0.98 0.58 0.45
1.68 0.95 0.94 0.58 0.45
0.79 0.57 0.68 0.55 0.69
0.82 0.61 0.67 0.55 0.69
0.94 0.70 0.72 0.55 0.69
1.98 1.55 1.36 1.11 1.35
1.71 1.57 1.43 1.08 1.39
3.44 2.15 1.58 1.08 1.39
2.84 0.34 0.33 0.40 0.46
0.44 0.39 0.32 0.40 0.46
0.76 0.21 0.16 0.15 0.15
0.21 0.19 0.18 0.15 0.15
β (MSEs × 10) 0.5 0.9 1 2 3
1.10 1.06 1.00 1.01 0.65
0.31 0.42 0.42 0.47 0.36
3.18 0.40 0.48 0.38 0.32
0.5 0.9 1 2 3
0.78 0.69 0.73 0.81 0.64
0.66 0.51 0.61 0.72 0.60
0.67 0.50 0.60 0.73 0.60
0.5 0.9 1 2 3
5.49 5.23 4.19 4.64 3.30
0.96 1.13 1.00 1.30 1.00
0.95 1.11 1.03 1.31 1.00
0.5 0.9 1 2 3
0.35 0.39 0.34 0.41 0.34
9.68 0.73 0.38 0.32 0.30
0.5 0.9 1 2 3
0.20 0.14 0.18 0.17 0.15
0.43 0.38 0.46 0.38 0.32
1.00 0.94 0.92 0.64 0.76
γ0 (MSEs × 10) 0.67 0.51 0.60 0.73 0.60
1.51 1.01 0.97 0.86 0.80
γ1 (MSEs × 10) 1.06 1.12 1.01 1.31 1.00
3.31 2.58 1.82 1.94 2.16
Survival probability at the median (MSEs × 100) 6.72 0.33 0.16 0.24 0.21
0.28 0.23 0.20 0.24 0.21
0.68 0.49 0.44 0.38 0.62
3.63 0.47 0.48 0.44 0.57
Survival probability at the 90th percentile (MSEs × 100) 0.25 0.12 0.07 0.11 0.08
0.41 0.10 0.06 0.09 0.07
0.11 0.09 0.06 0.09 0.07
0.47 0.29 0.24 0.23 0.21
0.68 0.25 0.20 0.20 0.16
that the IH cure model has some robustness to the baseline hazard assumption, and that the estimates from the model are still acceptable when the increasing hazard assumption is moderately violated. When the hazard function decreases rapidly (p = 0.5), the MSEs of the estimates from the IH cure model may be greater than those from the PH cure model. It should be noted that the performance of the IH cure model over the PH cure model is not affected noticeably by censoring rates in data. It is interesting to see that the parametric Weibull cure model (denoted as W in the table), which is the true model in this simulation, performs poorly in the estimation of
552
Y. Peng and K.B.G. Dear
regression parameters. It tends to have large MSEs among the four models, particularly in the estimation of γ1 . This finding supports the fact that the Weibull cure model suffers a degree of non-identifiability (Farewell, 1986): It tends to mis-classify cured patients as uncured patients, or vice versa. Such mis-classification causes a large estimation variation when using the logistic regression to model the cure rate. Both the semiparametric models alleviate the non-identifiability concern in cure models, and the IH cure model tends to have the least degree of non-identifiability when the increasing baseline hazard assumption can be justified. However, the Weibull cure model does outperform the semiparametric models slightly in the estimation of the baseline survival probabilities, especially when the true baseline hazard function is not increasing. With the increasing hazard assumption, the IW cure model improves the Weibull cure model in the estimation of γ1 , but it also suffers a degree of loss in estimating the baseline survival probability when the hazard function plummets quickly. The MSEs of this model are quite comparable to those of the IH cure model. The latter has smaller MSEs in estimating β, γ0 , and γ1 while the former has slightly smaller MSEs in estimating the baseline survival function under moderate censoring. However, in comparison with the IH cure model, the disadvantage of the IW cure model is that it is a fully parametric model, which limits its application. We conclude from this simulation study that the IH cure model outperforms the PH cure model when the baseline survival function of uncured patients is increasing. The model also exhibits some robustness to a moderately violation of the increasing baseline hazard assumption. Only the estimate of the baseline survival probability from this model may have a large MSE when the hazard function is not increasing, but decreasing quickly. Its performance is comparable to its parametric counterpart. Therefore the IH cure model is recommended when the increasing hazard assumption can be reasonably assumed for uncured patients based on medical or other related evidence. Similar conclusions can be drawn from results under the exponential censoring distribution or under the situations where the gamma and Gompertz distributions are used as the baseline distribution in simulation. Detailed results are omitted.
4. Illustration Many authors recently have studied the curative effects of adjuvant therapies for breast cancer. They include Gamel et al. (1993, 1994), Demicheli et al. (1999), Wheeler et al. (1999), Farewell (1986), Kuk and Chen (1992), and Peng and Dear (2000), to name a few. However, the shape of the hazard function for uncured patients was not taken into account in the studies. There are a few studies on the overall hazard function of breast cancer patients following adjuvant therapies, that is, the hazard function of a patient, regardless of his/her cure status. It is inappropriate to use it to describe the shape of the hazard function of uncured patients. However, the studies do provide evidence for the increasing hazard assumption. For example, Wheeler et al. (1999) show that the overall hazard rate is not constant but increased up to and including years three, four and five. Because of existence of cured patients, it usually peaks at about year five and then
An increasing hazard cure model
553
decreases. However, the increasing hazard assumption for uncured patients is still valid by examining the estimated hazard rate from patients who are known to be uncured. As an illustration, we consider the breast cancer data analyzed previously with the Weibull cure model (Farewell, 1986) and later with Kuk and Chen’s cure model (Kuk and Chen, 1992) and the PH cure model (Peng and Dear, 2000). The data include the time to relapse or death of 139 breast cancer patients for three treatment arms of adjuvant therapies, together with other two factors for each patient: clinical stage and the number of lymph nodes having disease involvement. There are 44 patients who experienced recurrence of the breast cancer after the therapies and therefore they are uncured patients. The remaining patients did not experience the recurrence during the study and their cure statuses are unknown. To validate the increasing hazard assumption, we apply proportional hazards models to the 44 uncured patients with one of the Weibull, gamma, lognormal and log-logistic distributions as the baseline distribution. The first two distributions allow increasing and decreasing hazard functions while the hazard function of the last two distributions first increases to a peak and then decreases. The model with the gamma distribution as the baseline distribution provides the best fit to the data and the estimated 95 percent confidence interval for the shape parameter of the gamma distribution is (1.357, 3.445), which implies that the shape parameter is significantly greater than one. The fit of the model with the Weibull distribution leads to the same conclusion: The shape parameter is significantly greater than one and the hazard function for uncured patients is indeed increasing. Therefore the increasing hazard assumption for uncured patients is justifiable and we can fit the IH cure model to the whole dataset. The estimates from this model and those from the other three cure models are summarized in Table 3. The estimates of the regression parameters from the IH cure model are similar to those from the PH cure model, except that the significance of the estimates from the IH cure model is slightly weaker than those from the PH cure model. To some degree, Table 3 Estimated parameters and the ratios to their standard errors under four cure models for the breast cancer data Weibull
Treatment A Treatment B Clinical stage I Lymph nodes
Intercept Treatment A Treatment B Clinical stage I Lymph nodes
Kuk–Chen PH PH Parameters for uncured patients
IH
βˆ
ˆ β/s.e.
βˆ
ˆ β/s.e.
βˆ
ˆ β/s.e.
βˆ
ˆ β/s.e.
−1.57 −0.03 −1.20 0.77
−2.20 −0.05 −2.38 1.55
−1.11 −0.98 −1.28 0.69
−2.31 −1.65 −3.14 1.42
−0.92 0.09 −0.85 0.49
−1.89 0.18 −1.92 1.28
−0.91 0.08 −0.84 0.47
−1.73 0.15 −1.78 1.09
γˆ
γˆ /s.e.
γˆ
γˆ /s.e.
γˆ
γˆ /s.e.
0.13 −0.04 −1.06 −0.71 1.26
0.27 −0.04 −1.95 −1.33 2.18
0.36 0.04 −0.26 0.10 1.53
0.53 −1.18 −2.17 −2.07 2.57
0.22 −0.63 −1.12 −0.94 1.37
0.44 −1.03 −1.89 −1.82 2.36
Parameters for cure rate γˆ /s.e. γˆ 0.48 0.08 −0.37 0.20 1.84
0.24 −0.63 −1.14 −0.92 1.35
554
Y. Peng and K.B.G. Dear
the estimates are also similar to those from the Weibull cure model, but the significance of the estimates of β tend to be weaker and the significance of estimates of γ tend to be stronger. There are substantial differences between the IH cure model and Kuk and Chen’s cure model: Kuk and Chen’s cure model show that there is little evidence that clinical stage I reduces the probability of relapse, but its effect on the delay of the occurrence of relapse is strong. The IH cure model shows, however, that it significantly reduces the probability of relapse and delays the occurrence of relapse. Treatment B, on the other hand, significantly reduces the probability of relapse but does not affect the time to its occurrence in the IH cure model, whereas in Kuk and Chen’s cure model it has insignificant effect on the probability of relapse and marginally significant effect on the time to the occurrence of relapse. To examine the shape of the estimated baseline hazard functions from the models, we plotted the estimated baseline cumulative hazard functions from these models in Figure 1. The Weibull cure model indeed produces an increasing baseline hazard function because the cumulative hazard function is a convex function. The estimated value of the shape parameter of the Weibull distribution is 1.434 with the standard error of 0.2, which is significantly greater than 1. But this model itself is often criticized for its strong parametric nature. Kuk and Chen’s cure model, however, does not produce an increasing hazard function because the estimated cumulative hazard function from the model is apparently not a convex function of time. For the PH cure model, it is not easy to tell whether the estimated cumulative hazard function is convex or not. The estimated cumulative hazard function from the IH cure model is obviously a convex function because of the IH assumption in the model. Note that in Figure 1, the cumulative hazard functions of the Weibull, PH and IH cure models are plotted up to the largest uncensored
Fig. 1. Estimated baseline cumulative hazard functions of breast cancer patients from four models.
An increasing hazard cure model
555
time. Beyond the largest uncensored time, the cumulative hazard functions from the PH and IH cure models are infinite. Therefore the IH cure model provides a semiparametric cure model that satisfies the IH assumption for uncured breast cancer patients and further evidence for the conclusions that both treatment B and clinical stage I significantly reduce the probability of relapse, but only the latter significantly delays the occurrence of relapse.
5. Conclusions and discussion In this paper, we proposed a proportional hazards cure model for cancer clinical studies where the increasing hazard assumption is appropriate for patients who are not cured. A nonparametric estimation method based on the EM algorithm was proposed, which ensures that the estimated failure time distribution for uncured patients always has an increasing hazard function. The model extends Cox’s proportional hazards model to allow for cured patients and an increasing hazard function for the failure time distribution of uncured patients. We performed simulation studies to compare this model and the semiparametric PH cure model without the increasing hazard function. The results show that if the increasing hazard assumption holds, this model outperforms the model without the increasing hazard function assumption. The simulation studies and an application of the model to a breast cancer study also show that the EM algorithm for this model usually converges very quickly. A perplexing and difficult problem in cure rate estimation is identifiability between cured patients and uncured patients with prolonged censored survival times. It arises in parametric models, such as the Weibull cure model discussed by Farewell (1986). It also arises in semiparametric PH cure models, as investigated by Taylor (1995) and Peng and Dear (2000). A simple “working” solution in these semiparametric models is to assume that those patients who have censored times greater than the largest uncensored time are cured, which is a quite arbitrary decision, and lacks justification. However, since the estimated survival function under the IH cure model will approach zero at the largest uncensored time, these patients are treated as cured under the IH cure model. Therefore the IH cure model provides a justification of this solution, and imposing the increasing hazard assumption is helpful to alleviate the identifiability problem. This is also revealed in the simulation studies. Compared to a parametric assumption, the increasing hazard assumption is usually regarded as a mild assumption that can be satisfied in many cancers and other diseases. Therefore the IH cure model provides a valuable alternative in cure rate estimation when the increasing hazard assumption rather than a parametric assumption is appropriate. When the increasing hazard assumption is not justifiable and a decreasing hazard function is appropriate to describe the failure time distribution of uncured patients, differentiating cured patients and uncured patients is a real challenge. Usually the hazard function for entire patient population is always decreasing after certain time point if cured patients exist, because patients with high risks will have events in the early stage of the trial and only patients with low or zero risks be left at later stage. Therefore the non-identifiability of a cure model with a decreasing hazard assumption becomes even
556
Y. Peng and K.B.G. Dear
more severe, and a great deal of caution and a further study are required. However, our simulation study shows that cure models with the increasing hazard assumption have certain robustness when the increasing hazard assumption is moderately violated. A heuristic explanation is that because of the severe non-identifiability problem, the estimated parameters in cure models tend to have very large variances; with the increasing hazard assumption, the estimated parameters may have slightly larger biases, but variances are much smaller. Hence the estimates from a cure model with the increasing hazard assumption, no matter parametric or semiparametric, may still have smaller MSEs than the estimates from a model without the increasing hazard assumption. It can be particularly observed from the estimates of γ1 in the simulation study. The parameter γ1 determines the cure rate in the treatment group, which has a larger cure rate than the control group. The larger the cure rate in a group, the more sensitive the estimate to the non-identifiability in a model. Therefore a model without the increasing hazard assumption tends to have larger MSEs in the estimation of γ1 than a model with the increasing hazard assumption. The program that the authors used to fit the proposed IH cure model was written with C and S-PLUS. It is available on request from the first author.
Acknowledgements This work was supported in part by research grants from Memorial University of Newfoundland and the Natural Sciences and Engineering Research Council of Canada to the first author.
References Barlow, R.E., Bartholomew, D.J., Bremner, J.M., Brunk, H.D. (1972). Statistical Inference Under Order Restrictions. Wiley, New York. Breslow, N.E. (1974). Covariate analysis of censored survival data. Biometrics 30, 89–99. Davison, A.C., Hinkley, D.V. (1997). Bootstrap Methods and Their Application. Cambridge University Press, New York. Demicheli, R., Miceli, R., Brambilla, C., Ferrari, L., Moliterni, A., Zambetti, M., Valagussa, P., Bonadonna, G. (1999). Comparative analysis of breast cancer recurrence risk for patients receiving or not receiving adjuvant cyclophosphamide, methotrexate, fluorouracil (CMF) data supporting the occurrence of ‘cures’. Breast Cancer Res. Treatment 53, 209–215. Farewell, V.T. (1986). Mixture models in survival analysis: Are they worth the risk?. Canad. J. Statist. 14 (3), 257–262. Gamel, J.W., Vogel, R.L., McLean, I.W. (1993). Assessing the impact of adjuvant therapy on cure rate for stage 2 breast carcinoma. British J. Cancer 68, 115–118. Gamel, J.W., Vogel, R.L., Valagussa, P., Bonadonna, G. (1994). Parametric survival analysis of adjuvant therapy for stage II breast cancer. Cancer 74, 2483–2490. Klein, J.P., Moeschberger, M.L. (1997). Survival Analysis, Techniques for Censored and Truncated Data. Spinger, New York. Kuk, A.Y.C., Chen, C. (1992). A mixture model combining logistic regression with proportional hazards regression. Biometrika 79, 531–541. Marshall, A.W., Proschan, F. (1965). Maximum likelihood estimation for distributions with monotone failure rate. Ann. Math. Statist. 36, 69–77.
An increasing hazard cure model
557
Mykytyn, S.W., Santner, T.J. (1981). Maximum likelihood estimation of the survival function based on censored data under hazard rate assumptions. Comm. Statist. Theory Methods 10 (14), 1369–1387. Padgett, W.J., Wei, L.J. (1980). Maximum likelihood estimation of a distribution function with increasing failure rate based on censored observations. Biometrika 67 (2), 470–474. Peng, Y., Dear, K.B.G. (2000). A nonparametric mixture model for cure rate estimation. Biometrics 56, 237– 243. Sy, J.P., Taylor, J.M.G. (2000). Estimation in a Cox proportional hazards cure model. Biometrics 56, 227–236. Taylor, J.M.G. (1995). Semi-parametric estimation in failure time mixture models. Biometrics 51, 899–907. Wheeler, T., Stenning, S., Negus1, S., Picken1, S., Metcalfel, S. (1999). Evidence to support a change in follow-up policy for patients with breast cancer: Time to first relapse and hazard rate analysis. Clinical Oncology 11, 169–173.
32
Handbook of Statistics, Vol. 23 ISSN: 0169-7161 Published by Elsevier B.V. DOI 10.1016/S0169-7161(03)23032-7
Marginal Analyses of Multistage Data
Glen A. Satten and Somnath Datta
1. Introduction Multistage models are a type of multivariate survival data in which individuals (or experimental units) move through a succession of “stages” corresponding to distinct states. Traditional survival analysis is the simplest example of a multistage model, where individuals begin in an initial stage (say, alive) and may move irreversibly to a second stage (death). Another simple but well-known example is the three-stage illness death model, in which individuals can move from an initial stage (well) to either the illness or death stages. In the irreversible version of this model, persons in the illness stage may subsequently only move to the death stage; in the reversible version, ill persons may recover and move back to the well stage. As with standard survival data, we may not see complete information for each individual due to right censoring. In this chapter, we assume that up to the time an individual is censored her transition times are known. The problem of analyzing interval-censored data from multistage models is important but little progress has been made. This chapter addresses a number of simple but fundamental questions about multistage data. The questions we address are: what proportion of persons are in each stage at any given time; what is the hazard of moving from one stage to another at any given time; what is the distribution of waiting times in a given stage; and how do covariates affect the hazard of transitions between stages. In a general multistage model, the past history of the process may affect its future evolution even once the current stage is accounted for. It is theoretically possible to model this dependence to obtain transition hazards between stages that are conditional on an individual’s past history. For example, in the three-stage illness–death model, we may model the effect of the time spent in the well stage on the hazard of death among people in the illness stage. However, such conditional analyses do not easily lead to answers to the marginal questions we posed at the beginning of this paragraph, since “unconditioning” or “marginalizing” by averaging conditional hazards can be quite difficult. Here we take an alternative approach based on pioneering work by Robins and Rotnitzky (Robins and Rotnitzky, 1992; Robins, 1993) and construct marginal quantities directly. Our basic approach can be described fairly simply. All the marginal quantities we seek would be easily estimated in the absence of right censoring. For example, stage 559
560
G.A. Satten and S. Datta
occupation probabilities can be easily estimated by the empirical proportion of persons in each stage at any given time. We first write the uncensored-data estimator in productlimit form. We then model the process that governs the censoring process. Because each individual is censored at most once, standard (univariate) survival process models can be used such as Aalen’s linear hazard model. We use this model of the hazard of being censored to weight the observed (censored) data to reconstruct what the uncensored experiment would have produced. Finally, using the product limit forms (to ensure good small-sample behavior), we estimate the quantity of interest. As discussed above, even if there are no “external” covariates, there may be “internal” covariates, generated by an individual’s past history, that affect transition and censoring hazards (Cox, 1972). Another feature of multistage models is that even if censoring is independent, dependent censoring may be induced at the marginal level. For example, if the time spent in two stages is positively correlated, then the waiting time in previously-occupied stages predicts both the current waiting time and the likelihood of censoring before the next transition. Multistage models are often analyzed by fitting one of two models: a Markov model or a semi-Markov model. In a Markov model, the past history of the process does not affect its future evolution given its present state. In a semi-Markov model, waiting times in each stage are independently distributed. Nonparametric estimators of marginal quantities are available for both of these models (Aalen and Johansen, 1978; Fleming, 1978a, 1978b; Lagakos et al., 1978). Recently we have given nonparametric estimators that do make these structural assumptions (Datta and Satten, 2001, 2002; Satten and Datta, 2002). In this chapter we review and unify these results and extend them to include new results on parameter estimation in regression models. The rest of the chapter is organized as follows. In Section 2 we consider the inverse-probability-of-censoring approach to dependent censoring for standard survival data. In Section 3 we extend this approach to data from multistage models and give estimators of stage occupation probability and transition hazards. In Section 4 we consider estimating marginal distributions of waiting times in a stage, and in Section 5 we consider a proportional hazards regression model for the effect of covariates on the marginal waiting time in a stage. Finally, Appendix A contains some technical details.
2. “Explainable” dependent censoring in survival analysis Dependent censoring in survival analysis occurs when there is correlation between failure and censoring times. If there is no additional information on the nature of this correlation, then the problem is intractable (as for each individual, only the earlier of the failure and censoring times is observed) (Tsiastis, 1975). However, if this dependence is carried by covariates, so that for fixed levels of covariates failure and censoring times are uncorrelated, then it is possible to account for dependent censoring (Robins and Rotnitzky, 1992; Robins, 1993; Robins and Finkelstein, 2000; Satten et al., 2001). In this chapter, we use the following notational conventions. Let Ti∗ denote the true (but possibly unseen) failure time for subject i, 1 i n. We wish to estimate S(t) = Pr[Ti∗ > t],
Marginal analyses of multistage data
561
the survival function. Let Ci denote the (possibly unseen) censoring time and let the observed random variables be Ti = min(Ti∗ , Ci ) and δi = I [Ti∗ Ci ]. We consider estimation of S(t) when Ti∗ and Ci are not independent, but when this dependence can be explained by covariates Zi (t). Let Z i (t) denote {Zi (s), 0 s < t}. Denote the hazard of being censored by Pr[Ci ∈ [t, t + dt), δi = 0 | Ti t, ·] dt →0 dt if time is measured continuously, and λC (t | ·) = Pr(Ci = t, Ci Ti∗ | Ti t, ·) if time is measured in discrete units. Following Robins and Rotnitzky (1992) we assume that λC t | Ti∗ , Z i (t) = λC t | Z i (t) . (2.1) λC (t | ·) = lim
Eq. (2.1) stipulates that given the covariate history Z i (t), the (future) failure time does not affect the current hazard of being censored. This may seem self-evident since the laws of physics have similar form; however, if Z i (t) does not contain all relevant variables that affect failure and censoring hazards, then it may appear that knowledge of the future failure time affects the current censoring hazard. For this reason, (2.1) is sometimes called the assumption of no unmeasured confounders. In the absence of censoring, the survival function can be estimated by the empir ical survival probability S ∗ (t) = n−1 ni=1 I [Ti∗ > t]. Robins and Rotnitzky (1992) i (Ti −) as an estimator of the suroriginally proposed n−1 ni=1 I [Ti > t, δi = 1]/K vival function. However, the small-sample properties of this estimator are not desirable, and in particular S(0) is not necessarily 1 (although this estimator does reduce to the i is the Kaplan–Meier estimator of Kaplan–Meier estimator of failure times when K censoring times; see, e.g., Satten and Datta, 2001). Instead, following Robins (1993) we rewrite S ∗ (t) in product-limit form: dN ∗ (s) 1− ∗ S ∗ (t) = Y (s) st where N ∗ (t) = i I [Ti∗ t] and Y ∗ (t) = i I [Ti∗ t]. Robins (1993) first noted t that if λC satisfies (2.1) then weights based on ΛC (t | Z i (t)) = 0 λC (s | Z i (s)) ds (or st λC (s | Z i (s)) in discrete time) could be used to estimate complete-data quantities like N ∗ (t) and Y ∗ (t). In particular, if we define Ki (t) = st [1 − dΛC (s | Z i (s))] and let N (t) =
n
I [Ti t, δi = 1] Ki (Ti −) i=1
and Y (t) =
n
I [Ti t] Ki (t−) i=1
then E{N(t)} = E{N ∗ (t)} and E{Y (t)} = E{Y ∗ (t)}. Hence, by the law of large numbers, S(t) = st {1 − dN(s)/Y (s)} is a consistent estimator of S(t) and Λ(t) =
562
t
0 {Y (s)}
G.A. Satten and S. Datta −1 dN(s)
is a consistent estimator of the cumulative hazard. See Satten et al. (2001) for a more formal argument. Computation of S(t) and Λ(t) requires knowledge of λC (t | Z(t)) which limits their usefulness. However, if we estimate the cumulative i (t) = C (t | Z i (t)) and replace Ki (t) by a consistent estimate K censoring hazard by Λ (s)) and C (t | Z i (t)}, then the resulting estimators exp{−Λ S(t) = st (1 − dN(s)/ Y (s)}−1 dN(s)) = t {Y are functions of data alone, and can be shown to be consisΛ(t) 0 tent estimators of S(t) and Λ(t). Further, S(t) has the attractive small-sample properties that S(t) is a non-increasing step function with S(0) = 1 (if there are no failures at i (t) time 0) and S(∞) = 0 only if the largest observed event time is a failure. When K is independent of i then S(t) reduces to the Kaplan–Meier estimator. The product-limit form of S(t) gives its attractive small sample properties. A similar phenomenon occurs in multistage models. Satten et al. (2001) give a simulated data example and Robins and Finkelstein (2000) give a real data example showing how explainable dependent censoring can be accounted for by this approach. C (t | ·) be correctly specified. Of course, Consistency of S(t) and Λ(t) requires that Λ the form of ΛC (t | Zi (t)) is rarely if ever known. This argues for a flexible, semiparametric model for λC (t | Z i (t)); here and elsewhere we have proposed that Aalen’s linear hazard model (Aalen, 1980, 1989) be used to calculate ΛC (t | Z i (t)). Satten et al. (2001) discuss that the usual difficulties in using Aalen’s model for inference are not i (t); for additional details, see Apof concern when using Aalen’s model to calculate K i (t) will C (t | Z i (t)) is not monotone, K pendix A of this chapter. In particular, even if Λ be positive, which is sufficient to ensure the small sample properties of S(t) discussed above. Further, even if individual terms in λC (t | Z i (t)) are not identifiable after some i (t) required for time τ , the combinations of terms required to estimate the values of K S(t) will always be identifiable. Finally, it can be shown that the asymptotic variance of S(t) is never greater than the asymptotic variance of S(t) (i.e., that estimation of Ki (t) lowers the asymptotic variance), and that adding terms to the model for λC (t | ·) decreases the asymptotic variance of S(t). In the remainder of this chapter we will continue the notational conventions we have introduced here. Quantities with an asterisk (e.g., S ∗ (t)) require data from the full (uncensored) experiment. Quantities with an overbar (e.g., S(t)) are weighted using the true values of Ki (t), while quantities with a hat (e.g., S(t)) are statistics that are weighted i (t). The corresponding population quantities are unadorned (e.g., using an estimated K S(t)). Similarly, random variables such as failure times that correspond to the full (uncensored) experiment are generally denoted with asterisks (e.g., Ti∗ ), while their equivalents in the censored experiment are written without an asterisk (e.g., Ti ).
3. Multistage models: Stage occupation probabilities and marginal transition hazards For a multistage model, the stage occupation probabilities generalize the survival function, and the cumulative transition hazards generalize the cumulative hazard of failure in a standard survival analysis. Suppose that individuals move through a network of stages
Marginal analyses of multistage data
563
labeled 1, . . . , M that correspond to some observable states. Let Tij∗ denote the j th transition time of the ith individual (where Ti0∗ ≡ 0 and where Tij∗ = ∞ if the individual enters an absorbing state before the j th transition). Let Ti∗ = supj {Tij∗ | Tij∗ < ∞} be the time of the ith individual’s final transition, let Ti = min(Ti∗ , Ci ) and let δi = I (Ci Ti∗ ) indicate whether the ith individual was censored before their final transition. Finally, ∗ let sik denote the stage occupied by the ith individual between times Tik−1 t < Tik∗ , ∗ ∗ k 1. For any time t, let T i (t) = {Tik∗ : Tik∗ t} and s i (t) = {sik+1 : Tik∗ t} denote the transition times and stage occupation history of the ith individual up to time t. As in the ∗ survival data context, observed data consists of {T i (Ti ), s ∗i (Ti ), δi }, 1 i n. As with standard survival analysis, the marginal quantities we are interested in would be easily estimated in the presence of complete data. In particular, the stage occupation probabilities Pj∗ (t) are calculated by the empirical proportion of individuals in each stage at time t, and the (cumulative) marginal hazard of transitions from stage j to stage j , denoted Λ∗jj (t) is calculated using Λ∗jj (t) =
t 0
∗ (s) dNjj
Yj∗ (s)
,
j = j
where ∗ Njj (t) =
n
I Tik∗ t, sik = j, sik+1 = j ,
j = j
i=1 k1
and Yj∗ (t) =
n
∗ I Tik−1 < t Tik∗ , sik = j . i=1 k1
In many ways, the easiest way to define the marginal transition hazard Λjj (t) is that it is the quantity that Λ∗jj (t) converges to as sample size n increases. ∗
To account for right-censored data, let T i (∞), s¯i∗ (∞) denote the complete set of transition times and stages occupied for the ith individual, and let Z i (t) be an increasing collection of sigma-algebras containing covariate information previous to time t that explains the dependence between transition times {Tik∗ } and censoring time Ci . Note that Z i (t) may contain information about arbitrary combinations of transition times Tik∗ that have occurred before time t and stages sij visited before time t, as well as any additional “external” covariates we believe may induce dependent censoring. The censoring hazard condition (cf. (2.1)) is as defined exactly as in univariate survival analysis, except that ∗ Ti is as defined in this section and Ti∗ = (T i (∞), s¯i∗ (∞)). As before, it means that Z i (t) contains all possible confounders (including all relevant combinations of already i (t) are occurred transition times and stages already visited). Similarly, Ki (t) and K as defined for univariate survival analysis. Note that although the transition times are potentially multivariate, individuals can be censored at most once, so that modeling the i (t) for multistage model data is in principle no more censoring hazard and computing K difficult than for standard survival analysis.
564
G.A. Satten and S. Datta
As with estimation of the survival function, we seek an inverse-probability-ofcensoring weighted estimator of Λ∗ (t). Let N jj (t) =
n
I (Tik∗ t, Ci Tik∗ , sik = j, sik+1 = j ) , Ki (Tik∗ −)
j = j ,
i=1 k1
and Y j (t) =
n
∗
I (Tik−1 < t Tik∗ , Ci t, sik = j )
Ki (t−)
i=1 k1
.
∗ (t)} and Then it is possible to show (Datta and Satten, 2002) that E{N jj (t)} = E{Njj
∗ (t)}. Hence, if we define Λ(t) in the same manner as Λ∗ (t) but E{Y jj (t)} = E{Yjj
with N ∗ (t) and Y ∗ (t) replaced by N (t) and Y (t), then Λ(t) is a consistent estimator of i to give N jj (t) and Y j (t) Λ(t). Finally, if we replace Ki by a consistent estimator K and use these estimators to calculate Λjj , then Λjj (t) can be shown to be a consistent estimator of Λjj (t). j (t), but this estimator is not necessarily norIt is possible to estimate Pj (t) using Y malized and stage occupation probabilities may be estimated to be larger than 1. As in standard survival analysis, to obtain an estimate of Pj (t) that has attractive smallsample properties, we must obtain a product-limit form for Pj∗ (t). This can be achieved by defining Pj (t) as Pj (t) =
M
Yk (0+) k=1
n
Pkj (0, t),
(3.1)
where P(0, t) = (0,t ][I + dΛ(u)]. The form of (3.1) is the same as in the case of a Markov process. However it is valid without the Markov assumption; see Datta and i does not vary across individuals, Satten (2002) for details. When K P(0, t) reduces to the Aalen–Johansen estimator of stage occupation probability. Datta and Satten (2001) used this argument to establish the consistency of the Aalen–Johansen estimators (originally proposed for Markov processes) for general multistage models under independent censoring. In this sense, the Aalen–Johansen stage occupation probability P(0, t) is a natural generalization of the Kaplan–Meier estimator to multistage models.
4. Estimation of marginal waiting time distributions When calculating transition hazards and stage occupation probabilities, we use the zero of time as the time of first entry into the network. However, for some problems we wish to consider marginal analyses where the zero of time is first entry into a stage of interest. For example, we may wish to know the distribution of waiting times in the illness stage of the three-stage irreversible illness–death model. Satten and Datta (2002) considered problems of this type.
Marginal analyses of multistage data
565
It is difficult to define “the” waiting time in a stage when that stage can be entered more than once. It is possible to calculate the waiting time for the kth entry into a stage; here we take a slightly different (but ultimately equivalent) approach and assume that the network of stages is acyclic. If this is not the case, it can be “exploded” into an acyclic network by adding to the definition of the stages the number of times the event has occurred. For example, a two-stage model where individuals alternate between health and illness stages can be exploded into the chain of events: health, first illness, first recovery, second illness, etc. In principle an infinite network of stages is necessary, but a finite network will suffice for a finite amount of data. Unfortunately, the notation used for stage occupation probabilities is cumbersome for this problem. For this section we define Tij∗ (Uij∗ ) as the time the ith individual enters (leaves) stage j in the uncensored experiment (= ∞ if stage j is never entered (left)), and for Tij∗ < ∞, Wij∗ = Uij∗ − Tij∗ denotes the stage j waiting time for the ith person. Let γij∗ = I (Tij∗ < ∞) and δij∗ = I (Uij∗ < ∞) indicate whether the ith individual (ever) enters or leaves stage j , respectively. In the uncensored experiment, the survival function of marginal waiting time is Sj∗ (t) = i I [Uij∗ − Tij∗ > t, γij∗ = 1]/ i I [γij∗ = 1]. The survival function of the marginal waiting time distribution Sj (t) is most easily defined as that quantity which Sj∗ (t) converges to as the sample size increases. We may al so express Sj∗ (t) in product-limit form by defining Nj∗ (t) = i I [Uij∗ − Tij∗ t, γij∗ = 1] and Yj∗ (t) = i I [Uij∗ − Tij∗ t, γij∗ = 1] so that Sj∗ (t) = st {1 − dNj∗ (s)/Yj∗ (s)}. When data are right-censored, Sj∗ (t) is not available. In fact, some individuals may be censored before it is clear whether they will ever enter stage j or not. By appropriate sample reweighting it is still possible to estimate Sj (t). As before, let Ci denote the censoring time and let Ti∗ = maxj {Tij∗ | Tij∗ < ∞} denote the time that the ith individual enters their final stage. Define observable random variables Tij = min(Tij∗ , Ci ) if the observed data at time Tij do not imply γij∗ = 0 and ∞ otherwise, Uij = min(Uij∗ , Ci ) if the observed data at time Uij do not imply γij∗ = 0 and ∞ otherwise, Ti = min(Ti∗ , Ci ), γij = I (Tij∗ Ci ) and δij = I (Uij∗ Ci ). We assume that the ∗
data consist of {Tij , Uij , δij , γij , 1 j J }, 1 i n. Let T i = {Tij∗ } denote the ith individual’s complete data from the uncensored experiment. As in the previous section, we assume a censoring model in which future events do not affect the hazard of censoring conditional on the cumulative record Z i (t), t 0, of a set of covariates. The censoring hazard and no unmeasured confounders condition are as in the previous section as well. Recall that Zi (t) may contain information on past transitions, waiting times and stages visited as well as any “external” covariates that may affect both transitions and censoring hazards. The condition for the censoring ∗ hazard is identical to (2.1) with Ti and T i , as defined above. For the censored data case, we proceed as follows. Define
I (Uij − Tij t, δij = 1) N j (t) = Ki (Uij −) i
and let Y j (t) =
I (Uij − Tij t, γij = 1) Ki (Tij + t−) i
566
G.A. Satten and S. Datta
where Ki is as defined in Section 2. Note the arguments of Ki in the denominator are different from those in previous sections. This is because censoring occurs in calendar time, but we are estimating a waiting time with a zero of time that occurs at entry into stage j . Hence, an individual who enters stage j at time Tij and has waiting time at least t in that stage must survive uncensored for time Tij + t. Similarly, only persons who remain uncensored as they leave stage j (at time Uij ) can contribute to N j (t). j (t) in analogy with N j (t) and Y j (t) but with Ki (t) j (t) and Y As before, we define N replaced by estimator Ki (t). Our final estimator for the waiting time distribution is then j (s)/Y j (s)}. Note that even if K i (t) = K(t) for all individuals i, Sj (t) = st {1 − dN this estimator does not reduce to the Kaplan–Meier estimator of waiting times in stage j , because the arguments of K differ for each individual based on their time of entry into stage j . This is reasonable because dependent censoring for stage occupation times can arise even with independent censoring (Wang and Wells, 1998; Lin et al., 1999). For a model in which all participants move irreversibly through stages 1 and 2 to arrive at stage 3, our estimator of the stage 2 waiting time is asymptotically equivalent to the Wang and Wells estimator. Having estimated the waiting time distribution, we may also wish to estimate the proportion of persons who have left stage j for some stage j that can be reached in one transition from stage j , within time t of entry into stage j . This competing risks ∗ (t) = ∗ problem can be solved using the methodology outlined here. Let Njj i I [Uij − ∗ ∗ ∗ Tij t, δij = 1, δij = 1]. If all values of Uij and Tij were known, then Qjj (t) could ∗ (t)/N ∗ (∞). Note that Q∗ (t) can also be written as be estimated by Q∗jj (t) = Njj j jj Q∗jj (t) =
t 0
Sj∗ (u−)
∗ (u) dNjj
Yj∗ (u)
,
because Sj∗ (u−) = Yj∗ (u)/Nj∗ (∞) for uncensored data. Let N jj (t) = i I [Uij − jj (t) be equivalently defined with K i Tij t, δij = 1, γij = 1]/Ki (Uij −) and let N replacing Ki . Then, estimate t jj (u) dN jj (t) = . Sj (u−) Q j (u) Y 0 5. Regression models for waiting time distributions Heretofore we have only considered the problem of nonparametric estimation for marginal models. It is also possible to consider parameter estimation (and testing) for models that are a mix of conditional and marginal. For example, we may wish to estimate the effect of gender on a transition (i.e., treat gender conditionally) but marginalize over other covariates like past history of stages occupied. Note that for regression modeling of transition hazards, any variable that is treated conditionally (i.e., explicitly included in a regression model) does not induce dependent censoring, so marginalization is only necessary for variables not included in the regression model. For this reason, regression modeling of waiting times is intrinsically more interesting, since dependent censoring is
Marginal analyses of multistage data
567
induced even when all variables that can affect censoring are included in the regression model. In this section, we develop a proportional hazards model to allow regression analysis of waiting times for general multistage models. Let J and n denote the numbers of stages and individuals, respectively. Once again, consider all the quantities defined in the previous section. We assume that the data consist of independent replicates of {Tij , Uij , γij , δij , 1 j J }, over individuals i = 1, . . . , n, plus observations on covariate vectors {Xij , 1 j J } and Zi . Fix a stage j whose waiting times distribution we are interested in. We consider estimation of the effect of the covariate vector Xj on the waiting time distribution in a multistage model. We allow the possibility that Xj may be time-dependent, where time is measured since entry into stage j . Let, for each individual i, H(t) = Hi (t) denote an increasing sequence of σ -algebras containing Tij∗ and all the covariate information available up to stage j waiting time t, measured since Tij∗ ; 1 i n. We let Xij = Xij (Tij∗ , ·) be a Hi (t) predictable process, so that the value of Xij (Tij∗ , t), at stage j waiting time t, should be available just before calendar time Tij∗ + t. Suppose we posit a proportional hazards model with parameters β j , so that
1 ∗ Pr Wij ∈ [t, t + dt) | Wij∗ t, γij∗ = 1, Hi (t−) dt →0 dt lim
= λj (t)e
β j ·Xij (Tij∗ ,t )
.
(5.1)
For the uncensored experiment, define Yrj∗ (t; β j ) =
n
r ∗
∗ Xkj t, Tkj∗ eβ j ·Xkj (t,Tkj ) I Wkj t, γkj∗ = 1
k=1
Nij∗ (t)
= I [Wij∗ t, δij∗ = 1]. Then, β j can be estimated using a for r = 0, 1 and let weighted estimating function n ∗ (s; β )
Y1j j ∗ ∗ ∗ Uj (β j ) = (5.2) dNij∗ (s) φ s, Tij Xij s, Tij − ∗ Y0j (s; β j ) i=1
where the function φ(s, Tij∗ ) acts as a weight, if desired. That the above is an unbiased estimating equation follows from the martingale representation n ∗ (s; β )
Y1j j ∗ ∗ ∗ Uj (β j ) = (5.3) dMij∗ (s) φ s, Tij Xij s, Tij − ∗ Y0j (s; β j ) i=1
s β ·X (T ∗ ,u) where Mij∗ (s) = Nij∗ (s) − 0 I (Wij∗ u)λj (u)e j ij ij du is an Fi∗ (s)-martingale with Fi∗ (s) = σ I Wij∗ u , 0 u s, j 1 , Hi (s) , s 0. As before, we assume that we have data on a possibly time-dependent covariate process Zi (t) such that it explains the dependence between the censoring variable and the uncensored stage entry (and exit) times. Mathematically speaking, the censoring hazard
568
G.A. Satten and S. Datta
obeys the condition ∗ λCi t | T i , Hi (t −), Z i (t) = λCi t | Z i (t) ,
with t = 0 ∨ t − Tij∗ ,
where time is measured since study entry (calendar time), and, as before, Z i (t) = {Zi (s): 0 t < t}. Note that Z i (t) may contain any information on covariates Xij (s, Tij∗ ) that is available before calendar time t. Suppose we estimate Ki through a model for λC (t | Z i (t)) and let Ki (t) = C (t | Z i (t))}, where Λ C (t | Z i (t)) := t λC (s | Z i (s)) ds. Using the weighted exp{−Λ 0 estimation approach we can define the censored data version of the estimating equation as follows. Let ∗
rj (t; β j ) = Y
n {X (t, T ∗ )}r e β j ·X(j (t,T(j ) I [U − T ∗ t, γ = 1]
(j (j (j (j (j ∗ K( (T + t−) (=1
(j
i (Uij −). Then, the score funcij (t) = I [Uij − Tij t, δij = 1]/K for r = 0, 1 and let N tion Uj∗ (β j ) may be approximated by j (β j ) = U
1j (s; β j ) Y ij (s). φ(s, Tij∗ ) Xij s, Tij∗ − dN 0j (s; β j ) Y
n
i=1
(5.4)
j (β j ), the function φ is a fixed function, but in fact, under appropriate conditions, it In U may be possible to replace it by a random function which converges to a fixed function in probability. In particular, we would like to consider choices like SC (·), the Kaplan– Meier estimator of censoring times, evaluated at “calendar” time Tij∗ + s. The function φ can be chosen to improve the variance of βˆ j . Note that marginal estimate of the parameter vector β j can be obtained by solving j (β j ) = 0 and tests of hypothesis for H0 : β j = the censored data estimating equation U j (β j 0 ). β j 0 can be performed using the score like statistic U i . In that case, the As before, we suggest Aalen’s linear hazard model to calculate K following martingale representation for the estimating function can be used to derive the asymptotic normality of the resulting estimators. An outline of its derivation is given in Appendix A. n ∗ (s; β)
Y1j ∗ ∗ Uj (β) = dMij∗ (s) φ s, Tij Xij s, Tij − ∗ Y0j (s; β) i=1
+ ξ Tj (r) I − U(r)A−1 (r) UT (r) dMc (r) + op n1/2 , where Mij∗ (s) is an Fi∗ (s)-martingale corresponding to the counting process with the ∗
Nij∗ (s); Mic (r) is a σ ({T i , Hi (0 ∨ (u − Tij∗ )), Zi (u): 0 u r})-martingale corresponding to the counting process Nic (r) = I (Ci r), Mc (r) is a vector with ith comU(r) is a matrix with ith row I [Ti r]Ui (r). The quantities Ui (r), ponent Mic (r), A(r) and ξ j (r) are defined in Appendix A. Furthermore, the two martingale terms are orthogonal.
Marginal analyses of multistage data
569
To illustrate the issues involved in parameter estimation, we have conducted a small simulation study. We considered data from a three-stage chain-of-events model, where all individuals start in stage 1 and move first to stage 2 and then to stage 3. We wish to fit a proportional hazards model to the waiting time in stage 2 using the approach described above. We generated correlated waiting times in stages 1 and 2 that each (marginally) followed a Weibull distribution, conditional on a single binary covariate X. To accomplish this, for each observation we generated a random variable Z = (Z1 , Z2 ) from the bivariate normal distribution with mean 0, variance 1 and correlation ρ = 0.9. We then formed correlated uniform deviates M = (M1 , M2 ) by taking Mj = Φ(Zj ) where Φ is the cumulative distribution function (CDF) of the standard normal distribution. Finally, we generated (correlated) Weibull variates W ∗ = (W1∗ , W2∗ ) by taking b −1 Wj∗ = F(a (Mj ) where F(a,b) (t) = 1 − e−at is the CDF of the Weibull distribuj ,bj ) tion with scale parameter a and shape parameter b. For each data set we generated 2 000 observations, 1 000 having X = 0 and 1 000 having X = 1. For those observations for which X = 0, the Weibull parameters were a1 = 2.0, b1 = 0.5 and a2 = 2.0, b2 = 2.0 while for those observations for which X = 1, the scale parameters a1 and a2 were multiplied by 1.5 = eβ . As a result, the hazard for leaving stage 2 follows a proportional hazards model with respect to the variable X, with regression coefficient β = Ln(1.5) ≈ 0.4055. Finally, censoring times were generated according to a Weibull distribution with scale parameter 0.5 and shape parameter 1.5 and follow-up for each observation was terminated at the censoring time if this time was less than T3∗ = U2∗ = W1 + W2 , the time of entry into stage 3. With these parameter values, among persons with X = 0 (X = 1), approximately 29.1% (21.3%) of observations are censored in stage 1, 22.2% (22.4%) are censored in stage 2 and the remaining 48.7% (54.2%) reach stage 3 before being censored. We generated 15 000 data sets each of which was analyzed using five models. In models 1 and 2, we included a constant term, X and I [T2 > t] as variables in Aalen’s i . In models 3 and 4, we callinear model for the censoring hazard used to estimate K culated Ki using the Sc , the Kaplan–Meier estimator of censoring times. In models 1 and 3 we used weight function φ(s, Ti2∗ ) = 1, while models 2 and 4 used φ(s, Ti2∗ ) = Sc (Ti2∗ + s−). Note that in model 4, the weight function φ(s, Ti2∗ ) exactly cancels the i2 , but weighting still occurs in the ratio Y 1j (s; β j )/Y 0j (s; β j ). Finally, i in dN weight K model 5 is the naive proportional hazards model that ignores dependent censoring. Table 1 summarizes our simulation results. Models 1–4 have negligible bias, while the bias in model 5 shows that the naive proportional hazards model is invalid for these data. Estimates of βˆ from model 2 (4) have lower sampling variance than those from i , use of weight function φ(s, T ∗ ) = model 1 (3), indicating that for a fixed choice of K i2 ∗ ˆ This result is directly analogous Sc (Ti2 + s−) decreases the sampling variability of β. to a finding of Hernan et al. (2000, 2001) who stabilize inverse-probability-of-treatment weights in a marginal structural model. Finally, estimates of βˆ from models 1 and 2 have lower variability than those from models 3 and 4 even though the variables included in the censoring model do not affect censoring, illustrating the variance-lowering effect of adding variables to the censoring model. The sampling error of the naive model is the lowest, but the bias is about 20 times that of the weighted models, and the mean square error (MSE) is also larger.
570
G.A. Satten and S. Datta
Table 1 Simulation results Model
Average βˆ |bias| × 103 var(βˆ ) × 103 MSE × 103
1
2
3
4
0.401 4.9 5.58 5.60
0.408 2.1 5.01 5.01
0.403 2.3 7.07 7.08
0.410 4.8 6.13 6.15
5 0.332 73.9 4.03 9.49
Appendix A: Modeling the censoring hazard using Aalen’s linear hazards model We have advocated use of Aalen’s linear hazards model to account for the effect of external and internal covariates Z i (t) on the hazard of being censored. We recommend Aalen’s model because it is the most flexible hazard model currently available (and hence increases the chance of obtaining an estimate of Ki that is close to its true value), and because estimates are available in closed form. Aalen’s model writes the (censoring) hazard as
λc t | Zi (t) = βj (t)Uij (t) p
(A.1)
j =0
where each βj (t) is an unknown function, and where we assume that the first component is Ui0 (t) ≡ 1, Uij = φj (Zi (t)) are σ {Zi (t)}-predictable functions, 1 j p. If we t let Bj (t) = 0 β(s) ds and let Ui (t) be the vector (Ui0 (t), Ui1 (t), . . . , Uip (t))T , then Aalen’s estimator of the vector B(t) = (B0 (t), . . . , Bp (t))T is given by B(t) =
n
I (Ti t)(1 − δi )A−1 (t) · Ui (t)
(A.2)
i=1
where the matrix A(t) is given by A(t) =
n
I (Ti t)Ui (t)UTi (t).
(A.3)
i=1
Given the estimator B(t), we can write p t j (t) C t | Zi (t) = Uij (t) dB Λ j =0 0
=
n
I (Tj t)(1 − δj )UTi (Tj ) · A−1 (Tj ) · Uj (Tj ),
t Ti .
(A.4)
j =1
Aalen’s model is especially flexible because it fits a function βj (t) to describe the effect of each covariate, even time-independent covariates. However, this flexibility has
Marginal analyses of multistage data
571
limited its use as a model for understanding the effect of variables on survival times for C (t | Zi (t)) may not be monotone increasing. a variety of reasons. First, estimates of Λ Second, the matrix A(t) may fail to have full rank at some time τ . When this occurs, it is impossible to estimate B(t) at time τ or at any subsequent time. For example, if all smokers in a study have died or been censored by time τ , it is no longer possible to estimate the effect of gender after τ . However, these small-sample difficulties do not C (t | Zi (t)) may not be affect use of Aalen’s model to estimate Ki (t). First, while Λ i (t) = exp{Λ C (t | Zi (t))} is always positive, which is sufficient monotone increasing, K to ensure the small sample properties of our estimators (such as normalization and probabilities lying between 0 and 1). Second, while B(t) may not be identifiable, we show i (t) is always defined for any combinations of i and t used to calculate our here that K i (Tj ) for Tj Ti . If estimators. To calculate our estimators, we only need values of K A(t) fails to have full rank, then we could use any generalized inverse of A(t) in (A.4) C (t | Zi (t)) as long as each Ui (Tj ) were entirely contained in the when calculating Λ range of A(Tj ) for every Tj Ti . The form of A(t) and following lemma ensure that i (t). any generalized inverse of A(t) can be used when calculating K L EMMA . Given a set of M column vectors Vm each of dimension q, and matrix A defined as A=
M
Vm VmT ,
m=1
each Vm is contained in the range of A. P ROOF. Assume this is not the case, and that Vm = rm + nm where rm is the projection of Vm into the range of A and nm is the projection of Vm into the null space of A. We know that nTm · A · nm = 0 because nm is in the null space of A, but by definition of A we also have for each m nTm · A · nm =
M
nTm Vi ViT nm =
i=1
M
(nm · Vi )2 (nm · Vm )2 = nm 2 . i=1
Hence, we must have nm = 0 for all m. An outline of the martingale representation j (β) as Express U n ∗ (s; β)
Y1j ∗ ∗ dMij∗ (s) φ s, Tij Xij s, Tij − ∗ Uj (β) = Y0j (s; β) i=1
+
n
i=1
1j (s; β) Y ij − Nij∗ (s) φ s, Tij∗ Xij s, Tij∗ − d N 0j (s; β) Y
572
G.A. Satten and S. Datta
+
n
φ
s, Tij∗
1j (s; β) Y dNij∗ (s) ∗ (s; β) − Y0j Y0j (s; β)
∗ (s; β) Y1j
i=1
= I + II + III,
say.
(A.5)
In the sequel we denote by ≈ equality up to terms of order op Satten and Datta (2002) showed that ∗ I (Uij∗ > r) ∗ ∗ c (r), dM Nij (t) − Nij (t) = I Wij t, δij = 1 i Ki (r) (n1/2 ).
c (s) = N c (s) − I (Ti s) dΛ c (s | Z i (s)). Therefore, where dM i i ∗ (W ∗ ; β) y1j ij II ≈ −δij∗ φ Wij∗ , Tij∗ Xij Wij∗ , Tij∗ − ∗ y0j (Wij∗ ; β) ×
I (U ∗ > r) ij Ki (r)
ic (r), dM
(A.6)
∗ (s; β) = E{n−1 Y ∗ (s; β)}. where yrj rj Next, ∗ n 1j (s; β)
(s; β) − Y Y1j III ≈ φ s, Tij∗ ∗ (s; β) y0j i=1
∗ (s; β)(Y 0j (s; β) − Y ∗ (s; β)) y1j 0j + Φj dSj (s), ∗ 2 (y0j (s; β))
(A.7) where Φj = = 1}. The following martingale representation for Yrj can be obtained by the Duhamel equation (see Andersen et al., 1993) and the product integral i and I (Ci ·): representations of K P {γij∗
rj (s; β) = − Yrj∗ (s; β) − Y
n
r T ∗ Xij s, Tij∗ eβ.Xij (s,Tij ) I Wij∗ s, γij∗ = 1 i=1
s+Tij∗
× 0
1 c (u). dM i Ki (u)
Therefore, by substituting this in (A.7) and interchanging the orders of integration, we get T ∗ W ∗ n
ij eβ.Xij (s,Tij ) 1 III ≈ γij∗ φ s, Tij∗ ∗ (s; β) Ki (r) r−Tij∗ y0j i=1
∗ (s; β) y1j ∗ c (r). Φj dSj (s) dM × −Xij s, Tij + ∗ i y0j (s; β) (A.8)
Marginal analyses of multistage data
573
Combining (A.6) and (A.8), n
ic (r), II + III ≈ ξij (r) dM i=1
where T ∗ W∗ ij β.Xij (s,Tij ) 1 ∗ ∗ e ξij (r) = γ φ s, Tij ∗ (s; β) Ki (r) ij r−Tij∗ y0j
∗ (s; β) y1j ∗ Φj dSj (s) × −Xij s, Tij + ∗ y0j (s; β) + δij∗ φ
Wij∗ , Tij∗
∗ (W ∗ ; β) ∗ ∗ ∗ y1j ij I Uij > r . −Xij Wij , Tij + ∗ y0j (Wij∗ ; β)
c (r) as dM c (r) = dM c (r) + I (Ti u) dΛ iC (r | The rest follows from expressing dM i i i i as in Satten et al. (2001) to yield Zi (r)), and the martingale representation for Λ n
ic (r) = ξ Tj (r) I − ξij (r) dM U(r)A−1 (r) UT (r) dMc (r).
i=1
References Aalen, O.O. (1980). A model for nonparametric regression analysis of counting processes. In: Klonecki, W., Kozek, A., Rosiski, J. (Eds.), Lecture Notes on Mathematical Statistics and Probability, Vol. 2. Springer, New York, pp. 1–25. Aalen, O.O. (1989). A linear regression model for the analysis of lifetimes. Statist. Medicine 8, 907–925. Aalen, O.O., Johansen, S. (1978). An empirical transition matrix for nonhomogeneous Markov chains based on censored observations. Scand. J. Statist. 5, 141–150. Andersen, P.K., Borgan, Ø., Gill, R.D., Keiding, N. (1993). Statistical Models Based On Counting Processes. Springer, New York. Cox, D.R. (1972). The statistical analysis of dependencies in point processes. In: Lewis, P.A.W. (Ed.), Stochastic Point Processes: Statistical Analysis, Theory and Applications. Wiley, New York. Datta, S., Satten, G.A. (2001). Validity of the Aalen–Johansen estimators of stage occupation probabilities and Nelson–Aalen estimators of integrated transition hazards for non-Markov models. Statist. Probab. Lett. 55, 403–411. Datta, S., Satten, G.A. (2002). Estimation of integrated transition hazards and stage occupation probabilities for non-Markov systems under dependent censoring. Biometrics 58, 792–802. Fleming, T.R. (1978a). Nonparametric estimation for nonhomogeneous Markov processes in the problem of competing risks estimation. Ann. Statist. 6, 1057–1070. Fleming, T.R. (1978b). Asymptotic distribution results in competing risks estimation. Ann. Statist. 6, 1071– 1079. Hernan, M.A., Brumback, B., Robins, J.M. (2000). Marginal structural models to estimate the causal effect of zidovudine on the survival of HIV-positive men. Epidemiology 11, 561–570. Hernan, M.A., Brumback, B., Robins, J.M. (2001). Marginal structural models to estimate the joint causal effect of nonrandomized treatments. J. Amer. Statist. Assoc. 96, 440–448. Lagakos, S.W., Sommer, C.J., Zelen, M. (1978). Semi-Markov models for partially censored data. Biometrika 65, 311–318.
574
G.A. Satten and S. Datta
Lin, D.Y., Sun, W., Ying, Z. (1999). Nonparametric estimation of the gap time distributions for serial events with censored data. Biometrika 86, 59–70. Robins, J.M. (1993). Information recovery and bias adjustment in proportional hazards regression analysis of randomized trials using surrogate markers. In: Proceedings of the American Statistical Association – Biopharmaceutical Section, pp. 24–33. Robins, J.M., Finkelstein, D.M. (2000). Correcting for noncompliance and dependent censoring in an AIDS clinical trial with inverse probability of censoring weighted (IPCW) log-rank tests. Biometrics 56 (3), 779–788. Robins, J.M., Rotnitzky, A. (1992). Recovery of information and adjustment for dependent censoring using surrogate markers. In: Jewell, N., Dietz, K., Farewell, V. (Eds.), AIDS Epidemiology – Methodological Issues. Birkhäuser, Boston, pp. 297–331. Satten, G.A., Datta, S. (2001). The Kaplan–Meier estimator as an inverse-probability-of-censoring weighted average. Amer. Statist. 55, 207–210. Satten, G.A., Datta, S. (2002). Marginal estimation for multistage models: Waiting time distributions and competing risks analyses. Statist. Medicine 21, 3–19. Satten, G.A., Datta, S., Robins, J. (2001). Estimating the marginal survival function in the presence of time dependent covariates. Statist. Probab. Lett. 54, 397–403. Tsiastis, A.A. (1975). A nonidentifiability aspect of the problem of competing risks. Proc. Natl. Acad. Sci. USA 72, 20–22. Wang, W., Wells, M.T. (1998). Nonparametric estimation of successive duration times under dependent censoring. Biometrika 85, 561–572.
33
Handbook of Statistics, Vol. 23 ISSN: 0169-7161 © 2004 Elsevier B.V. All rights reserved. DOI 10.1016/S0169-7161(03)23033-9
The Matrix-Valued Counting Process Model with Proportional Hazards for Sequential Survival Data
Karen L. Kesler and Pranab K. Sen
1. Introduction Biological systems are frequently complex with many different sources of variability. Trying to separate these processes into distinct components often requires measurements from related units as well as independent units. When the outcome of interest is time to event, the common approach is to use survival methods. Survival analysis is rich with multivariate models, specifically designed to handle the complex data questions found in dependent data. These models range from strict parametric distributions to semi-parametric extensions of the proportional hazards model to purely nonparametric methods, all of which approach the problem of correlated data differently. Generally, the parametric models precisely parameterize the dependence; conditional models condition the outcome on the other dependent outcomes; marginal models simply ignore the dependence and adjust for its influence; and random effects models estimate the dependence in the form of the covariance matrix. Although different in their techniques, each of these models strives to overcome the same problem – bias introduced into the estimates of the parameters and their variances by correlated data. An overview of these different approaches to multivariate survival data can be found in Clegg et al. (1999). All of these models, however, operate under certain assumptions about the structure of the data. In repeated measures survival analysis, the number of events seen in an individual is random and can vary widely. In this type of data, the correlated clusters are typically represented by individuals, with the correlation deriving from all of the measurements coming from one person. A couple of typical examples of this type of data are seen in measuring the time between repeat hospitalizations for some chronic disease or the time between pediatric infections or successive study-defined events, such as a lab defined “toxicity”. Since the typical multivariate survival dataset consists of clusters with a fixed, predetermined size limit, parameterizing a model in the presence of random cluster sizes poses a challenge to most multivariate survival models. Especially when those random sizes can range from no events to fifteen events. One method used in typical models to adjust for the random number of event times is to eliminate valuable information by only looking at the time to the first few events. Additionally, some of these models control for the dependence without describing it, yielding no inference or 575
576
K.L. Kesler and P.K. Sen
description of the dependence parameters. Others parameterize the dependence, but in a highly specific manner with a non-intuitive clinical interpretation. Some novel methods, however, attempt to handle such variably sized clusters. Hoffman et al. (2001) present an approach based on within-cluster resampling. In this method, they randomly sample one observation from each cluster, analyze the resulting dataset using existing methods and then repeat this sample-analyze algorithm a large number of times. The regression parameters are estimated by the average of the resample-based estimates. Their method, although computationally intensive, results in valid estimators even when the risk for the outcome of interest is related to the cluster size. It has, however, only been presented for binary outcomes and could be used in conjunction with the methods presented in this article. A new alternative for modeling multivariate survival data called the Matrix Valued Counting Process (MVCP) framework was introduced by Pedroso de Lima (1995), Pedroso de Lima and Sen (1997, 1998). This framework allows for the use of any univariate model with the addition of conditional probabilities that account for the dependence. The framework relies on the underlying model–proportional hazards, parametric, or whatever is most appropriate–to define the main parameters of interest. It then defines a set of conditional probabilities that can be estimated to assess the nature of the dependency between observations. These dependence parameters can be interpreted as the multiplicative increase or decrease in the hazard of an event, conditional on the length of the other event times. Additionally, these parameters can be varied across event times, yielding a more flexible model. Most importantly for this research, the MVCP allows for fixed effect and covariance parameters to be estimated from variably sized clusters. Pedroso de Lima and Sen described the general multivariate framework and developed the asymptotic theory using the Cox Proportional Hazards model as an underlying parameterization. (Pedroso de Lima, 1995; Pedroso de Lima and Sen, 1997, 1998) They only considered a component-type data setup with equal cluster sizes, however. The MVCP framework is extended here to handle the issue of variable cluster size found in the multivariate sequential type of data. Additionally, this paper provides the first data analysis using this methodology. The first section after the introduction overviews general survival notation, the Cox Proportional Hazards model and discusses some of the generalizations of this popular model currently in the literature. Then, to introduce the MVCP framework, an overview of Sen and Pedroso de Lima’s work will be presented in the Section three. Section four describes the challenges presented by repeated measures survival data and extends their work by describing a series of possible parameterizations along with example clinical situations for which they would be appropriate. Section five overviews the changes to Pedroso de Lima and Sen’s likelihood, score function, and information matrix for the MVCP with a Cox proportional hazards model, with specifics in the Appendix. In Section six, we apply the model to a dose-escalation study of hydroxyurea in children suffering from sickle cell disease and interpret the results. The final portion of the paper is dedicated to a discussion of the interpretations, strengths, and weaknesses of the MVCP framework applied to this type of data. Appendix A details the modifications to the likelihood function, score function, information matrix, and asymptotic properties of the model needed to model variably-sized clusters.
The matrix-valued counting process model
577
2. Introduction to multivariate survival methods 2.1. Preliminary notation In survival analysis, the time to some event is used as the outcome, however, there are often cases where this failure time is not observed, when the observations are censored after some time. Specifically, let T , C, and X = X(·) have a joint distribution, where T and C are the survival and censoring times respectively, and X(t) is a random vector of possibly time-dependent design covariates (also known as explanatory or auxiliary covariates). Set Y = min(T , C) and δ = I {T C}. Y is the observed time for the observation and δ indicates whether the event was observed or not. We will assume that the covariate process X is external, or not directly involved with the failure mechanism (Kalbfleisch and Prentice, 1980). It is assumed that T and C are conditionally independent given X. This is the standard assumption of noninformative censoring that allows the distribution of C to be factored out of the likelihood, thus simplifying the model. Although the censoring distribution is assumed to be noninformative, censoring can take place before or after the observed period. These two options correspond to left and right censoring, respectively, and if both take place, it is called interval censoring. For the purposes of this research, we will focus solely on right censoring. 2.2. Cox proportional hazards model Methods for analyzing survival data are not so different from other types of data, in this case, inference proceeds according to a twist on standard likelihood methods. By assuming that T and C are conditionally independent given X, we can partition the likelihood into two sections, one for C (not of interest) and one for T (of interest). The hazard function is often used for modelling the distribution of these variables. Recall that the hazard function is a ratio of the density function and the survival function, often referred to as a measure of intensity, specifically, λ(t | X) =
f (t | X) . S(t | X)
One extremely popular and easily interpretable model simplifys the likelihood by assuming a constant proportional hazard between the covariates, with an unspecified but continuous underlying hazard function. λ(t | X) = λ0 (t)a(X) where λ0 (t) is the baseline hazard and the covariates are related to the hazard through a(X). The traditional choice for a(X) is eβX (Cox, 1972) which implies that the loghazard is a linear combination of the covariates. In many analyses, the parameterization of a(X) is the main interest, while λ0 (t) can be treated as a nuisance parameter. Partial likelihood based inference was proposed by Cox (1975) for any model containing high-dimensional nuisance parameters, with one of the examples being the proportional hazards regression model. In proportional hazards regression, the baseline hazard function λ0 (t) is essentially an infinite-dimensional nuisance parameter when estimating the
578
K.L. Kesler and P.K. Sen
regression coefficients β. Cox proposed a conditional hazard approach to create the likelihood, involving the ranks of the data. A second, more inferentially powerful approach utilizes counting processes and martingale theory to create the likelihood (Nielsen et al. (1992), Andersen et al. (1993)). In general, the goals of both methods are the same, and in some cases, the likelihoods coincide. The Cox Proportional Hazards model has become incredibly popular with researchers due to its ease of use and the interpretability of the parameters. Specifically, the β correspond to a log-hazard increase due to a unit increase in the corresponding covariate. We will use this pervasive model in conjunction with a new method of specifying the dependence between observations. First, however, we present other extensions of the Cox model to give the reader a perspective on how this well-liked model has been augmented to handle multivariate data. 2.3. Multivariate survival methods As mentioned in the introduction, there are three general methods for specifying the structure of dependence in multivariate survival data. As outlined in Clegg et al. (1999), one can either ignore the dependence (marginal models) or model it in one of three ways: (1) assume specific joint distributions (parametric approach), (2) condition on the history or past events (conditional approach), or (3) use frailty models to include random effects in the model (random effects approach). Although useful in many instances, the conditional models do not fit the type of data we will be examining and so we will not provide further information on them. The joint distribution models define somewhat rigidly the relationship between the event times by using a joint distribution, such as the Marshall–Olkin or Block–Basu multivariate exponentials. Although the assumptions for these models are extreme, if the data fits, the parameters are well-defined and inference is straightforward using maximum likelihood methods. Marginal models, on the other hand, treat the dependence in the data as a nuisance parameter and focus on the marginal time to event distributions. They then adjust the variance estimators using a robust covariance estimator that accounts for the dependence. These models are especially useful in situations where the structure of the dependence is not of interest because the marginal distributions can be analyzed much like univariate data, which leads to easily interpretable parameters and more flexible assumptions. If, however, the relationship between event times is of primary interest and the strict assumptions of the parametric approach do not fit the data, an alternative is the random effects or frailty model (Murphy (1995)). These generalizations of the proportional hazards model incorporate an unobserved random effect to model the association between two event times. This random effect is multiplicative in the hazard and can either lead to a semi-parametric form where the marginal survival distributions have a specified parametric form (copula models) or to a generalization of univariate partial likelihood of Cox. The decision as to which of these models to use is driven by the data and the goals of the analysis.
The matrix-valued counting process model
579
Specific extensions of the Cox proportional hazards model that are somewhat comparable to the research presented in this paper include two bivariate extensions from DeMasi (2000) and Tien and Sen (2002). The first approach stratifies the subjects into disjoint strata based on their outcome profile or specific sequence of events. DeMasi et al. (1997, 1998) and DeMasi (2000) exemplify this approach by developing a generalized competing risks framework for use with clustered data that has some competing risks components and some repeated measures components. By first formulating the event-specific hazard for the first event and then cross-classifying subjects and modeling their conditional partial likelihoods, pooled estimators can then be derived for the multivariate failures. The general stratification approach using conditional hazards can be extremely useful in situations where a competing risks outcome is considered, but is less necessary in the repeated measures data we consider here. The approach of Tien and Sen (2002) is a bivariate extension of the Cox proportional hazards model to handle interval censoring. In this model, they consider parameters defined as the effect of a covariate on both of the related outcomes, in addition to the more traditional parameters of covariate effect on a single outcome, conditional on the other outcome. The primary example presented with this method is of two strongly related outcomes (time to amenorrhea and time to cessation of breastfeeding) rather than a repeated event situation, which seems to reflect the strength of this method. As you will see in the following sections, the Matrix Valued Counting Process (MVCP) framework can be used with any of these Cox model extensions. The strength of the MVCP method is that it can be used with a variety of models (parametric, semiparametric, etc.) for parameterizing the non-dependent parameters. It then partitions the dependence in a structured set of parameters, thus removing the complexity of the dependence structure from the model.
3. The matrix valued counting process framework In this section, we outline the Matrix Valued Counting Process methodology, a framework in which various survival models can be parameterized for multivariate correlated data. Perhaps, it is more convenient to start with a simple bivariate case and then proceed to the general multivariate case; in that way the complexities arising in the general multivariate case can be visualized more clearly. 3.1. Bivariate matrix valued counting process Let (T1 , T2 ) be a nonnegative random vector defined in a probability space (Ω, F, P ). We define the counting processes Nk (t) = I {Tk t},
t 0, k = 1, 2
representing a right-continuous function that assumes value zero, jumping to one when the particular event associated to Tk occurs. We define the at risk process as Yk (t) = I {Tk t},
t 0, k = 1, 2.
580
K.L. Kesler and P.K. Sen
Such a process is assumed to have its value known at the instant just before t. Both of these indicators are functions of time, allowing for a stochastic approach consisting of a predictable portion of the process over time, and a martingale, which accounts for random fluctuations. As time progresses, the counting process builds on the previous accrual of information. Specifically, consider a stochastic basis with the right continuous filtration {Ft : t 0} defined by Ft = σ X, Nk (u), Z(u+): 0 u t , where the filtration is an increasing family of sub-σ -algebras. The increasing process N satisfies the conditions of the Doob–Meyer Decomposition (Corollary 1.4.1 Fleming and Harrington, 1991), and hence with respect to this stochastic basis there is a unique predictable process A such that N − A is a martingale. We therefore expect the difference in our counting process to be equal to the jump in the predicable process N(t), given the information up to the present time, that is: E dN(s) | Fs− = E dA(s) | Fs− = λ(t). The intensity process or hazard, λ(t) is the expected jump size in Nk (t) for a small window of time, given all of the information about the process up to t. The matrix-valued counting process method utilizes the instantaneous hazard function (also known as the intensity) of an event conditional on different combinations of the events occurring. Since the quantities in the counting process are defined on dependent random variables, it makes sense to consider also the random vector N(t) = N1 (t), N2 (t) . The intensity process associated with N(t) matrix and represented by the vector Λ(t), can be written as λ1 (t) Λ(t) = λ2 (t) where λ1 (t) = E dNk (t) | Nt − λ1 (t) is the average number of jumps for component k given the information available just before t and Nt − is the value of the counting process just before t. We may note that in this case Nt − contains information whether or not one (or both) component(s) have failed just before t. If the component k has failed before t then the expected value equals zero. In other words, we need to consider four situations: (i) no component has failed before time t, i.e., both components are still at risk, Y1 (t) = Y2 (t) = 1; (ii) only the first component has failed before t, i.e., Y1 (t) = 0 and Y2 (t) = 1; (iii) only the second has failed before t, i.e., Y1 (t) = 1 and Y2 (t) = 0; (iv) both components failed before t, i.e. Y1 (t) = Y2 (t) = 0.
The matrix-valued counting process model
581
If we want to consider the intensity process for the first component, then we only consider cases where Y1 (t) = 1. Each of these events has a conditional hazard, corresponding to relevant cases of event failure: (1)
Pr{T1 ∈ [t, t + t) | T1 t, T2 < t} t →0 t
p1 (t) = lim and p2(1) (t) = lim
t →0
Pr{T1 ∈ [t, t + t) | T1 t, T2 t} . t
Thus, p1(1) (t) is the conditional hazard function for the failure of component one, given (1) event two has occurred, while p2 (t) is the conditional function for the failure of component one, given event two has not occurred. Similarly, for component 2, (2)
Pr{T2 ∈ [t, t + t) | T2 t, T1 < t} t →0 t
p2 (t) = lim and
Pr{T2 ∈ [t, t + t) | T2 t, T1 t} . t →0 t
p1(2) (t) = lim
Note that case four, where both components have failed, is not included in these conditional probabilities because there is no risk or hazard when both components have failed. These components along with indicators for each case, give us the appropriate probabilities to allow us to write a form for the expected value of the jump size in the counting process, the hazard function: (1) (1) λ1 (t) = E dN1 (t) | Nt − = p1 (t)Y1 (t) 1 − Y2 (t) + p2 (t)Y1 (t)Y2 (t) and (2) (2) λ2 (t) = E dN2 (t) | Nt − = p2 (t)Y2 (t) 1 − Y1 (t) + p1 (t)Y1 (t)Y2 (t). Based on these equations, we can represent the intensity process by the product of matrices Y1 (t) α 11 (t) α 12 (t) Y1 (t) λ1 (t) 0 = Λ(t) = λ2 (t) 0 Y2 (t) α 21 (t) α 22 (t) Y2 (t) = Diag Y(t)α(t)Y(t) , (1)
(1)
(1)
where the elements of α(t) are given by α 11 (t) = p1 (t) and α 12 (t) = p2 (t) − p1 (t) for the first component and α 22 (t) = p2(2) (t) and α 21 (t) = p1(2) (t) − p2(2) (t) for the second. The matrix-valued counting process model is expressed in terms of the vectors N1 , N2 , . . . , Nn that is, n copies of the process N defined on the random vector of component counting processes and associated with a sample of n individuals. Then the
582
K.L. Kesler and P.K. Sen
matrix-valued counting process is given by N(t) = N1 (t), N2 (t), . . . , Nn (t) , with an associated intensity process given by Λ(t). Note that the columns of N are independent and, in each column, the two elements may be stochastically dependent. 3.2. Matrix valued counting process framework and the cox proportional hazards model The specific characterization of these hazards can be achieved by many different methods. We have chosen the popular and easily interpreted Cox proportional hazards model to illustrate the specifics of how the framework can be parameterized. In the Cox Proportional Hazards model, the intensity is what we model as an unknown baseline hazard multiplied by a function of the covariates. Specifically of the form,
λ(t) = λ0 (t)eβ X where λ0 is the baseline hazard, X is the vector of covariates, and β is the vector of the covariate’s parameters representing log hazard ratios per unit increase in X. We will assume that the covariates are external or non-time dependent for the purposes of this paper. For our initial illustration, we will additionally assume that the covariates are the same across each bivariate cluster, as one would see with certain types of clinical covariates in the case where the cluster is an individual. Specifically, if we were modelling time to progression of a disease or death, this would correspond to covariates collected at the start of the study like age or initial severity of disease. These covariates would not change across the events a subject experiences but would change across subjects, thus we denote the covariate vector as: Xi for the ith subject. The vector of β parameters then describes the log-hazard ratios of these covariates across all subjects and events. Additionally, we will consider the general case of the baseline hazard being the same for all subjects, but not necessarily both events. This implies that the baseline hazard is indexed by the order of the events, so γ11 (t) represents the baseline hazard of the first event, given event two has occurred and γ12 (t) represents the baseline hazard of the first event, given event two has not occurred. We can then describe the Cox model’s proportional hazard as a function of the hazards defined in the MVCP framework by expressing each element of α(t) as proportional to the exponential of the covariates. Specifically, (1)
α 11 (t) = p1 = γ11 (t)eβ1 Xi , (1) (1) α 12 (t) = p2i − p1i = γ12 (t) − γ11 (t) eβ1 Xi , (2) (2) α 21 (t) = p1i − p2i = γ21 (t) − γ22 (t) eβ2 Xi , (2)
α 22 (t) = p2 = γ22 (t)eβ2 Xi .
The matrix-valued counting process model
583
We can further simplify the model by requiring an additional layer of proportionality, specifically, γ12 (t) = (1 + ϑ1 )γ11 (t). This forces the hazard of the first event given the second event occurred to be proportional to the hazard of the first event given the second event has not occurred. Therefore, the intensity process vector can be written as: γ11 (t)(Y1i (t) + θ1 Y1i (t)Y2i (t))eβ1 Xi Λi (t) = . γ22 (t)(Y2i (t) + θ2 Y1i (t)Y2i (t))eβ1 Xi 3.3. Multivariate matrix valued counting process framework This section introduces the MVCP framework for multivariate sequential survival data and describes the conditional probabilities that characterize its dependence structure. Let the vector Tk = (Tk1 , Tk2 , . . . , Tkjk ) be our random vector of the times between the sequential events for the kth cluster with jk elements. The vector Tk can also be subject to an independent censoring mechanism which we will denote by random variable C. The distribution of C is left unspecified, but due to the nature of the sequential data, it only affects the jk th member of the vector. We can define the counting process for the ith member of the kth cluster as Nki (t) = I {Tki t},
t 0, i = 1, 2, . . . , jk .
Putting these counting processes into a vector gives us the counting process for the kth cluster. Nk (t) = Nk1 (t), Nk2 (t), . . . , Nkjk (t) . We also have a predictable vector of the at risk processes for ith member of the kth cluster, given by Yki (t) = I {Tki−1 t Tki },
t 0, i = 1, 2, . . . , jk
and the corresponding counting process for the kth cluster, given by Yk (t) = Yk1 (t), Yk2 (t), . . . , Ykjk (t) where the individual at risk processes are defined as in the bivariate case. See Section 4 for a detailed description of how this process is differs from the bivariate case. In order to compute the intensity processes needed for the MVCP, we consider three combinations of failure and non-failure for a cluster with multiple events and define conditional probabilities based on each case. Thus, for the kth cluster, we could observe: (1) Only event a has not occurred, but all other events have occurred. The conditional (a) hazard is denoted pka and represented by (a) pka = lim
t →0
Pr{Ta ∈ [t, t + t) | Ta t, Tl < t, ∀l = a} . t
584
K.L. Kesler and P.K. Sen
(2) Event a and event b have not occurred, but all other events have occurred. The (a) conditional hazard is denoted pkb . This corresponds to a first order interaction and is represented by Pr{Ta ∈ [t, t + t) | Ta t, Tb t, Tl < t, ∀l = a, b} . t →0 t
(a) pkb = lim
(3) Event a, b, and c have not occurred, but all other events have occurred. The condi(a) tional hazard is denoted pkbc . This corresponds to a second order interaction and is represented by (a)
pkbc = lim
t →0
Pr{Ta ∈ [t, t + t) | Ta t, Tb t, Tc t, Tl < t, ∀l = a, b, c} . t
These interactions can then be generalized up to the (m − 1)th order. We then define a series of α-parameters to be linear combinations of these conditional probabilities. Specifically, (1) αk11 (t) = pk1 ,
(1) αk12 (t) = pk2 − αk11 (t) (1) αk123 (t) = pk23 − αk11 (t) − αk12 (t) .
and
The vector of the kth cluster’s event hazards, Λk (t), is then built from these conditional hazards and can be placed in a matrix form. With this higher order data, the model can be viewed as a sum of main effects and interactions as we can see here with the trivariate case: αk11 (t) 0 0 Λk (t) = 0 Yk1 (t) + αk22 (t) Yk2 (t) + 0 Yk3 (t) αk33 (t) 0 0 αk13 (t) αk12 (t) + αk21 (t) Yk1 (t)Yk2 (t) + 0 Yk1 (t)Yk3 (t) αk31 (t) 0 αk123 (t) 0 + αk23 (t) Yk2 (t)Yk3 (t) + αk213 (t) Yk1 (t)Yk2 (t)Yk3 (t). αk32 (t) αk321 (t) The first three terms represent the main effects, the following three terms the first order interactions and the last term the second order interaction. Interpretationally, αk11 (t) is the hazard of event one conditional on all other events having previously occurred. αk12 (t) is the difference between αk11 (t) and the hazard of event one conditional on all other events having occurred except event two. And αk123 (t) is the difference between the hazard of event one conditional on events two and three not having occurred yet, and the sum of αk12 (t) and αk13 (t). If we then assume that the second order interactions (and any higher order interactions in a more general model) are zero, then we can write the general model in this
The matrix-valued counting process model
form
Yk1 (t) .. Λk (t) = . 0
··· .. . ···
0 .. . Ykjk (t)
αk11 (t) .. . αkjk 1 (t)
··· .. . ···
585
αk1jk (t) Yk1 (t) .. .. . , . αkjk jk (t)
Ykjk (t)
where the Yki (t) are the at risk processes for the ith event in the kth cluster. This means that the diagonal elements of α k (t) represent the hazard of each event given all of the other events have occurred. The off-diagonal elements represent the difference between this hazard and the hazard of the event occurring given all events except one have occurred. More specifically, αk11 (t) is the hazard of the first event occurring given all other events in cluster k have occurred. And αk13 (t) is the difference between αk11 (t) and the hazard of the first event given that only event three has not yet occurred. Thus, the offdiagonal elements give us a measure of the increase or decrease in hazard between each pair of events.
4. Matrix valued counting process framework with repeated measures data We first address the issue of adapting the Matrix Valued Counting Process Model to handle the challenges found in repeated measures data. We start with some notation and then address the issue of event times running in sequence rather than the traditional event times running concurrently. We then discuss the impact of variable cluster size on the general MVCP framework. For each cluster, k, we have a variable number of event times, we can denote this with the vector Tk = (Tk1 , Tk2 , . . . , Tkjk ) as our random vector of the times between the sequential events for the kth cluster with jk elements. The vector Tk can also be subject to an independent censoring mechanism which we will denote by random variable C. The distribution of C is left unspecified, but due to the nature of the sequential data, it only affects the j th member of the vector. Thus, we observe Zki = Tki ,
i = 1, . . . , jk − 1;
Zkjk = min(Tkjk , C)
and δki = 1,
i = 1, . . . , jk − 1;
δkjk = I {Zkjk = Tkjk }.
In most applications, the multivariate survival data has the same cluster size with the time to events running concurrently. However, when we have repeated measures data, the times run subsequent to one another and the number of events an individual can have is not prespecified. This open structure can lead to clusters of varying size. Additionally, since the person is not at risk for the second event until after the first event, the sequential data does not fit into our framework of conditional probabilities well, where we need to discuss the probability of one event in the presence of others. To illustrate, Figure 1 presents (on the left) the “at risk” process for a cluster with three events. Notice that the second event is not at risk until after the first event occurs at t1 and is then at risk only until the second event occurs at t2 . The same thing happens with the
586
K.L. Kesler and P.K. Sen
Fig. 1. Unscaled and rescaled at risk processes for a cluster of size three.
third and fourth event. In order to fit this into our matrix framework, we must “rescale” these processes and essentially start the clock over each time an event occurs. Specifically, t1 becomes time zero for the second event, t2 becomes time zero for the third event, and so on. This leads to the “at risk” processes on the right, where t2∗ = t2 − t1 and t3∗ = t3 − t2 . Now, the events appear to be at risk concurrently and we can discuss the probability of event one occurring given all of the other events have occurred, which leads us back to the matrix model. Another interpretation of this is that we are measuring the “gap time” between events. Although this contradicts the biological interpretation of the repeated measures setup, we can interpret the hazards obtained from the MVCP as the risk of a gap time being longer than all other gap times, or for the off-diagonal parameters, the risk of a gap time being longer than all other gap times except one. This interpretation can help extend the clinical interpretation in the presence of our statistical modeling necessity. After rescaling, we can define the counting process for the ith member of the kth cluster as Nki (t) = I {Tki t},
t 0, i = 1, 2, . . . , j,
where Tki is the gap time for the ith event for the kth cluster. This is defined the same as in the previous case of non-sequential data. However, the random vector of the indi-
The matrix-valued counting process model
587
vidual event processes seen here Nk (t) = Nk1 (t), Nk2 (t), . . . , Nkjk (t) is dependent on the size of the cluster, as denoted by the subscript k. Assuming that any interaction above the first order is zero, we have the general model for the intensity Λk (t) being the product of the diagonal Y matrix, the α matrix and the vector of Ys. Yk1 (t) · · · 0 .. .. Λk (t) = ... . . 0 · · · Ykjk (t) αk11 (t) · · · αk1jk (t) Yk1 (t) .. .. .. .. × . . . . . αkjk 1 (t) · · · αkjk jk (t) Ykjk (t) Now that we have discussed some general adaptations of the MVCP, we can discuss the differences when using Cox proportional hazards model from the bivariate case presented above. 4.1. Multivariate matrix valued counting process model and the Cox proportional hazards model As in the bivariate case, we will assume that the covariates are external or non-time dependent and that they are the same across each cluster. These covariates would not change across the events a subject experiences but would change across subjects, thus we denote the covariate vector as: Xk . The vector of β parameters then describes the log-hazard ratios of these covariates across all subjects and events. We will consider variations to this parameterization of the covariates in the next section. Additionally, we will consider the general case of the baseline hazard being the same for all subjects, but not necessarily all events. A clinical example of this would be the case where an initial event leaves the subject more susceptible to the subsequent events, changing the shape of the underlying hazard. This implies that the baseline hazard is indexed by the event number, or γi (t) for the baseline hazard of the ith event across all subjects. We further assume that the baseline hazard is the same for each event conditional on whether or not the other events have occurred. We can then describe the Cox model’s proportional hazard as a function of the hazards defined in the MVCP framework by expressing each element of α k (t) as proportional to the exponential of the covariates. Specifically, (i) (i) (i) αkii (t) = pki = γi (t)eβXk , αkij (t) = pkj − pki = γj (t) − γi (t) eβXk . We can further simplify the model by requiring an additional layer of proportionality, specifically, γj (t) = (1 + θij )γi (t). This forces the hazard of the ith event given all other events occurred to be proportional to the hazard of the ith event given only the j th event has not occurred. Simply put, the off-diagonal elements of the alpha matrix are restricted to be proportional to the on-diagonal elements. It allows us to write the
588
K.L. Kesler and P.K. Sen
elements of α k (t) as proportional to an event-specific baseline hazard, αk0 (t) αkii (t) = αi0 (t)eβXk , αkij (t) = αi0 (t)(1 + θij )eβXk − αi0 (t)eβXk = αi0 (t)θij eβXk . The model of the intensities is now Yk1 (t) · · · .. Λk (t) = ... . 0 ×
···
0 .. .
Ykjk (t)
α10 (t)eβXk .. .
αj0k (t)θjk 1 eβXk
··· .. . ···
α10 (t)θ1jk eβXk Yk1 (t) .. .. . . . 0 βX αjk (t)θjk jk e k Ykjk (t)
Where the αk0 (t) represent the baseline hazard, the β represent the vector of coefficients or hazard ratios for the covariates and the θij measure the influence of the j th event on the ith event. 4.2. Parameterizations and interpretations Even this specific representation of the MVCP framework has a great deal of flexibility in parameterizing the log-hazard ratios and dependency parameters to handle different research questions. We explore this further in this section, first for the log-hazard ratios and then for the dependency parameters. We first consider different parameterizations of the log-hazard ratios (β), their impact on the MVCP parameterization, their clinical interpretation, and other concerns. While focussing on the log-hazard ratios, we will assume a simple case for the dependency parameters. Specifically, we will force them to be the same across events and clusters (i.e., θij = θ ), indicating that the influence of one event on another is the same across all events. E XAMPLE 1 (A treatment of diminishing efficacy). In our first example, we consider the parameterization of β which allows for event-specific hazard ratios, modelling a varying relationship between covariates and the hazard of each event. Thus, a covariate could have a strong impact on an early event, and a weak impact on a later event. Although the parameters are different for each event, they are the same over all clusters (or subjects, in this example). Thus, βk1 i = βk2 i = βi . The number of parameters estimated (length of the βi vector) would be the maximum number of events observed across clusters, or max(j1 , . . . , jn ). This gives us a parameterization for the elements of α k (t) of: αkii (t) = α 0 (t)eβi Xki ,
αkij (t) = α 0 (t)(1 + θ )eβj Xkj − α 0 (t)eβi Xki = α 0 (t) (1 + θ )eβj Xkj − eβi Xki .
Clinically speaking, suppose each subject is randomized to an active treatment or a control. Then, βi represents the log-hazard increase or decrease of the ith event occurring for a subject being treated with the active therapy compared to one being treated with
The matrix-valued counting process model
589
the control. If, over time, a subject develops a tolerance of the treatment, we would expect the estimate of the log hazard ratio of treated versus untreated patients to decrease over the sequential observed events. In this case, a model with event specific estimates of the hazard ratio would be appropriate to measure trends like this that occur across all patients. E XAMPLE 2 (A genetic or environmental confounder). If, on the other hand, there exists a covariate whose impact varies across clusters, a within cluster parameterization of β can be used. This case, denoted in the model by βk , would generate a hazard ratio for each cluster, yielding up to n separate estimates. Thus, βki = βkj = βk . The interpretation of the βk would be that of a log-hazard increase or decrease for any event due to a subject-specific factor. Note that in this parameterization, there must be multiple events for each cluster in order to make these estimates. This gives us a parameterization for the elements of α k (t) of: αkii (t) = α 0 (t)eβk Xk , αkij (t) = α 0 (t)(1 + θ )eβk Xk − α 0 (t)eβk Xk = α 0 (t)θ eβk Xk . In clinical studies, some etiologies are marked by a genetic or environmental susceptibility for certain patients, making a heterogeneous relationship between survival time and the covariate. We would therefore expect each patient’s hazard ratio to be individualized. Alternatively, a genetic marker or series of genetic markers could confer susceptibility or a protective effect to groups of subjects. We would therefore expect to estimate several hazard ratios, one for each combination of the genetic markers, but not one for each subject. E XAMPLE 3 (A combination of event-specific and cluster-specific covariates). The cluster-specific and event-specific parameterizations could be combined by including βki into the model. However, note that an individual member of the βki vector cannot be both event and cluster specific, since that would require overparameterization of the model. In this situation, each cluster would have its own set of estimates, so this setup presumes that all of the clusters have sufficient numbers of events to model these parameters. If both a treatment with tolerance problems and a separate measure of genetic susceptibility were needed in a model, some hazard ratios could be made to be event specific and some to be made cluster specific. The θ parameters in this model can be interpreted as pairwise changes in the hazard of an event due to the influence of another event. Specifically, θij is the proportional increase or decrease in the hazard of the ith event (when all other events have occurred) due to the j th event not yet having occurred. In the setting of sequential events, it is impossible to think of a future event influencing a prior one. Thus, it is important to realize that the θ parameters represent a type of covariance measure between the time to events as described above. Thus, the “influence” of one hazard of event on another is not a temporal relation, but rather a parameter that models the correlation due to these events occurring in a single individual or across homogeneous populations.
590
K.L. Kesler and P.K. Sen
Just like the log-hazard ratio parameters, the θ parameters can also take different forms depending on the needs of the data. They are, however, tied to the parameterization of the baseline hazard in the Cox proportional hazards model. If we wish to assume a different baseline hazard for each event, we would need to specify θ parameters to be associated with each of those baseline hazards. Within this setting, however, we can specify different types of θ parameters to be associated with each baseline hazard event. For instance, we can model the complex case where there is a θij for each pairwise combination of events (as seen above) or the simplest case where a single proportion θ describes the influence of any event on any other event. For the purposes of our influence parameter examples, we will assume a simple case for the log-hazard parameters, specifically, that the log-hazards are the same across events and clusters or βki = β. E XAMPLE 4 (Events with different influences). As we shall see illustrated later in the example presented at the end of the article, events can exert different, but consistent, influences on the other events. For example, the length of the first event time may have a strong influence on the other event times, but later event times may not exert the same type of influence. We can parameterize our log-hazards by θij = θi . Thus, each event would have one parameter and the number of parameters estimated would be the maximum number of events. This gives us a parameterization for the elements of α k (t) of: αkii (t) = αi0 (t)eβXk , αkij (t) = αj0 (t)(1 + θj )eβXk − αi0 (t)eβXk . Under clinical descriptions, this parameterization might correspond to a situation where a subject will generally have all short event times or all longer event times, except that as the number of events increase, the variability of the event times widens and they exert less influence than the earlier events. As we will see with the example, certain subjects have a strong tolerance of the hydroxyurea therapy and therefore have longer event times between toxicities, other subjects have less tolerance of the therapy and therefore generally have shorter event times. E XAMPLE 5 (An event with lasting consequences). In a clinical setting, it is easy to imagine a situation where the influence of the ith event on the j th event is the same as the influence of the j th event on the ith event, so we will consider first, the case where θij = θj i . Considering this case of symmetric influences, we could still parameterize a different relationship between events of different distance (i.e., θ12 = θ14 ). The interpretation of parameterizing each pair of events separately is that some events are more influential on the hazard of the ith event than others. More specifically, this situation occurs when the events closest to the ith event in the sequence have more influence than those further away. If a short event time is more likely to influence a short event time for the next event than for an event further away in the sequence, having individual pairwise parameters would be a good modeling strategy. We would expect to model [max(j1 , . . . , jn )∗ max(j1 , . . . , jn ) − 1]/2 or the number of entries in the upper triangle of the matrix with dimensions corresponding to the maximum number of events. This
The matrix-valued counting process model
591
gives us a parameterization for the elements of α k (t) of: αkii (t) = αi0 (t)eβXk , αkij (t) = αi0 (t)(1 + θij )eβXk − αi0 (t)eβXk = αi0 (t)θij eβXk , αkj i (t) = αj0 (t)(1 + θij )eβXk − αj0 (t)eβXk = αj0 (t)θij eβXk . Note that although our dependency parameters are symmetric (i.e., θij = θj i ), the elements of α k (t) are not, specifically, αkij (t) = αkj i (t) because of the underlying hazards of the specific event, αi0 (t) and αj0 (t). Clinically, speaking, this example corresponds to the situation where an infection or other event of interest causes damage to the patient’s system making them more susceptible to another event, but that given time, this damage heals. We would therefore expect that events closer together in sequence would be more highly correlated than those further apart. We would therefore need to separately estimate the influence of event one on event two, for instance, compared to the influence of event one on event four. E XAMPLE 6 (Genetic or environmental susceptibility). Another possible parameterization for θ is to choose a cluster-specific dependency. This situation constrains the model to assume that events exert the same amount of influence on other specific events, but that individuals have varying degrees of dependence among their event times. Thus, some individuals will have strong dependencies and homogeneous event times and others will have weak dependencies and heterogeneous event times. In our model, this implies θkij = θk where k represents the cluster and i and j represent events. This parameter will give us a measure of the influence of events on each other for the kth cluster. The number of parameters estimated would be n the number of clusters. Note that in order to accurately estimate this set of parameters, each cluster (i.e., subject) would have to have multiple events. This gives us a parameterization for the elements of α k (t) of: αkii (t) = αk0 (t)eβk Xk , αkij (t) = αk0 (t)(1 + θk )eβk Xk − αk0 (t)eβk Xk = αk0 (t)θk eβk Xk . Once again, we consider the impact of each patient having a specific genetic or environmental exposure that affects their survival times. In this case, it makes their survival times more or less heterogeneous. If this susceptibility is large, it can strengthen the correlation between event times and create a more homogeneous cluster. If it is weak, more variability can enter and we would observe a more heterogeneous cluster. By using a different parameter for each patient, we can adjust for such differences in correlation. E XAMPLE 7 (Disease susceptibility in a homogeneous population). In an extremely homogeneous population, the θ parameters can be simplified further by assuming a constant effect for all clusters and a constant effect of other events on each event. In our notation, this would lead to a single θ parameter, pooled over all clusters and events. Its interpretation would be the change in the hazard of an event due to any other event having not happened yet and would be the same for all subjects. This gives us a para-
592
K.L. Kesler and P.K. Sen
meterization for the elements of α k (t) of: αkii (t) = α 0 (t)eβk Xk , αkij (t) = α 0 (t)(1 + θ )eβk Xk − α 0 (t)eβk Xk = α 0 (t)θ eβk Xk . This special case of Example 4 covers the situation in which the influence of the hazard of one event is the same on all other events, across all subjects. More clinically put, subjects who have a short time to an event are more likely to have shorter inter-event time for all events and patients who have a long event time are less likely to have shorter interevent times for all events. Also note that a dearth of data can force this assumption to be made, in order to get stable estimates. 5. Estimation Using a parameterization appropriate to the repeated measures data structure, we now expand Pedroso de Lima’s (1995) likelihood, score function, and description of the asymptotic properties of the model to the variably-sized clusters found in the sequential survival data. Details of these calculations can be found in Appendix A. For brevity’s sake, we present only the final results for each of these important functions. Since our focus is on sequential events within an individual, we will utilize a simple model for our hazard ratios which assumes that the covariates have a common effect across individuals and events. Additionally, we will parameterize cluster-specific pairwise dependencies between the events. This leads to a model of the form: αkii (t) = αi0 (t)eβi Xk , αkij (t) = αi0 (t)(1 + θki )eβi Xk − αi0 (t)eβi Xk = αi0 (t)θki eβi Xk , where k represents the cluster, and i and j count the events in the cluster (i = j ). The parameters of interest are β and θ – the αi0 (t) are treated as nuisance parameters and ignored, just like the standard multivariate version of the partial likelihood using in the proportional hazards model. This leads to a log-likelihood of the form: log Ln (β, θ ) τ jk n βXk + log Yki + θki Yki (t)Ykl (t) = 0 k=1 i=1
− log
l=i
n m=1
eβXm Ymi (t) + θmi Ymi (t)Yml (t) dNki (t) l=m
with the Y and N functions defined above. The corresponding score functions can then be computed by simple differentiation with respect to β and all of the θki (see Appendix A for specifics). Maximum partial likelihood estimators can then be obtained by setting the score functions to zero and solving for βˆ and θˆ . These need to computed iteratively due to the complexity of the equations.
The matrix-valued counting process model
593
The asymptotic properties of these estimates are based on the properties of Martingales and accompanying Taylor expansions. As with the standard multivariate case (Pedroso de Lima, 1995), the score equations can be represented by a stochastic integral with respect to a martingale process (under the null hypothesis). We can derive this representation from our score equations by using the Doob–Meyer decomposition. The derivation implies that Uβ is itself a sum of Martingales, allowing us to use the asymptotic distribution for the score function found in Pedroso de Lima (1995). 6. Example 6.1. Background In this section, we present the data analyses for the motivating clinical example. The HUG-KIDS study was designed to determine the short-term toxicity profile, laboratory changes, and clinical efficacy associated with hydroxyurea therapy (HU) in pediatric patients with severe sickle cell anemia (Kinney et al., 1999) Eighty-four children with sickle cell anemia, aged five to fifteen years were enrolled from ten centers. This study was not an efficacy study, but rather a monitoring and dose-escalation study. Specifically, the study was designed to determine the maximum tolerated dose (MTD) of HU and monitor a cohort of severely affected children treated for one year at MTD for adverse events. Laboratory blood counts were monitored every two weeks in order to identify toxicities as quickly as possible. Each one had specific laboratory measures with predefined boundaries to define the toxicity. The study found that HU was a safe and reasonably well tolerated therapy for children. Our goal is to examine the baseline demographic and laboratory covariates for predictors of longer inter-toxicity times and describe the relationship between inter-toxity times. This will allow clinicians to identify which patients would be the best candidates for HU therapy and which patients would need to be monitored more closely. 6.2. Methods The proportional hazards model chosen for this data parameterizes three covariates for which we will estimate log-hazard ratios and a simple form of the hazard dependent parameters. We assume that the covariates are constant within a cluster and will denote the vector of these log-hazard ratios by β. The vector of hazard dependence parameters will be denoted by θ , and we will assume that the hazard of one event is influenced the same amount by each of the other events. For instance, the influence of event two on event one is the same as the influence of event three on event one. This is accomplished by having one associated theta parameter for each event. Thus, θ1 is the multiplicative increase in the hazard due to any other event time being longer than event one. For four events, our alpha matrix would therefore look like: 0 α1 (t)eβ Xk · · · α10 (t)θ1 eβ Xk .. .. .. . . . .
α40 (t)θ4 eβ Xk
···
α40 (t)θ4 eβ Xk
594
K.L. Kesler and P.K. Sen
Additionally, we will estimate the variance and covariance between the beta parameters, the variance of the theta parameters and the covariance between the theta and beta parameters. 6.3. Results Toxicities were specifically predefined and their presence was monitored every two weeks. Table 1 provides a distribution of the number of toxicities seen as well as the number of children who experienced each number. The covariates to be used in this analysis were chosen for their clinical importance and availability at the start of drug use. The demographic variables age and gender were used along with the following lab variables: ALT, creatinine, hemoglobin, absolute neutrophil count, platelets, and absolute reticulocyte count. For modeling purposes, ALT, neutrophil counts, platelets, and reticulocyte counts were log transformed due to their strongly skewed distributions. Initial marginal models led us to consider the most promising of these variables, specifically, ALT, neutrophils, and reticulocytes. We then estimated the parameters from the MVCP, using the repeated measures parameterization described in the introduction of this chapter. Table 2 presents the estimated log-hazard ratios and dependence parameters, along with their standard errors. Table 1 Number of patients by frequency of toxicity Number of toxicities 0 1 2 3 4 5 6 7 8 9 11 12 18
Number of patients 9 13 9 11 7 10 6 6 5 3 3 1 1
Table 2 Log-hazard ratios and dependence parameters, with standard errors from the MVCP model Variable ALT Neutrophil count Reticulocyte count θ1 θ2 θ3 θ4
Estimate 0.30 −0.60 −0.20 0.90 0.60 0.40 0.40
Std Err 0.12 0.13 0.08 0.23 0.17 0.16 0.14
The matrix-valued counting process model
595
Table 3 Covariance estimates between log-hazard ratios and hazard dependence parameters θ1 ALT Neutrophil count Reticulocyte count
−0.0005 0.00007 > −0.00001
θ2
θ3
θ4
0.0003 0.0001 0.0003
0.0005 0.0006 0.0006
−0.0004 −0.0003 0.00002
Recall that the thetas are structurally independent from each other, but we do estimate covariance parameters between the beta and theta parameters. Table 3 presents the estimates of the covariance between these parameters.
7. Discussion The most predictive covariates in the unadjusted marginal models were ALT, neutrophil count, and reticulocyte count. Higher levels of ALT were associated with an increased risk of toxicity, while lower levels of neutrophils and reticulocytes were associated with an increased risk. The MVCP was fit with only four between-event hazard dependence parameters because that was the maximum number the data would support. However, event times for all other events were left in the data and contributed to the estimation of the log-hazard ratio. A separate analysis of the impact of these “extra” events and whether or not they should be used or excluded is planned for future research. The theta estimates reveal a protective relationship between the inter-toxicity event times. Recall, each θi is the multiplicative increase in the hazard due to any other event time being longer than the ith event. Since all of the θi are less than one, there is a decrease in the hazard of the ith event when the other event times are longer. Clinically speaking, subjects with a longer first time to toxicity will tend to continue to have longer inter-toxicity times. Additionally, patients with longer first and second event times have an even greater reduction in their hazard of the third and fourth event times. In contrast, if the one of the theta parameters had been greater than one, we would have expected to see short event times for that event associated with long times for the other events. This can happen if the duration of the study is short relative to the event times – a long first event time only allows for short times afterwards or else the subject is censored. We then see an artificially inflated estimate of θ1 , leading us to incorrectly suppose that patients who hold out longer for their first event succumb more quickly to the subsequent events than those who have a relatively short event time. The covariances between the hazard ratios and the hazard event dependencies showed no overall pattern, except that covariances between these two types of parameters was low. The MVCP has provided insight into this complex research question of which children tolerate HU better than others. In addition to being able to estimate log-hazard ratios for covariates in the presence of multiple events per subject, it also quantifies the relationship between these interevent times.
596
K.L. Kesler and P.K. Sen
There are, however, some outstanding questions about this methodology. The impact of including rare but large clusters while only estimating a few inter-event dependence parameters requires quantification. In this case, only four dependence parameters could be estimated, while some subjects had up to 19 events. Additionally, the magnitude of the theta parameters needs further explanation. Although the interpretation is a straightforward multiplicative effect on the hazard, hazards themselves are difficult to translate into objective clinical terms. Further data analyses could yield more insights and potential pitfalls for this methodology.
Appendix A This appendix specifies the details of the likelihood function, score function, and information matrix and description of the asymptotic properties of the model to the variablysized clusters found in the sequential survival data. It is an extension of Pedroso de Lima’s (1995) work. Sequential data likelihood and score functions Obviously, there is a great deal of flexibility in describing the model using the proportional hazards MVCP framework. Since our focus is on sequential events within an individual, we will utilize a simple model for our hazard ratios which assumes that the covariates have a common effect across individuals and events. This also circumvents the problem of lack of data and excessive variability in the extreme cases of many events. Additionally, we will parameterize cluster-specific pairwise dependencies between the events. This leads to a model of the form: αkii (t) = αi0 (t)eβi Xk , αkij (t) = αi0 (t)(1 + θki )eβi Xk − αi0 (t)eβi Xk = αi0 (t)θki eβi Xk , where k represents the cluster, and i and j count the events in the cluster (i = j ). The parameters of interest are β and θ – the αi0 (t) are treated as nuisance parameters and ignored, just like the standard multivariate version of the partial likelihood using in the proportional hazards model. Thus, the contribution to the overall likelihood of the kth cluster at time t is the hazard function of the ith event over the sum of the hazards for the ith event for all of the other clusters when the event occurs λki (t) lki = n . l=1 λli (t) For clusters which have fewer than i events, no contribution to either the numerator or denominator is made, which nicely handles the problem of variable cluster size. Recall that for each cluster, only the last event time can be censored since the events are sequential and must be observed before the next event time starts. The likelihood is then the product of these contributions over all clusters, k; all time points, t; and all events within a cluster, jk . Recall that Nki (t) is the counting process for the ith event in the kth
The matrix-valued counting process model
597
cluster and note that m indexes all individuals, with the at risk processes, Ymi (t), indicating which observations should be included in the risk set count in the denominator. With this sequential parameterization, we have a partial likelihood of: Ln (β, θ ) =
jk n k=1 t 0 i=1
λki (t) n m=1 λmi (t)
dNki (t )
dNki (t ) jk n αi0 (t)(eβXk (Yki (t) + θki l=i Yki (t)Ykl (t)) = n 0 βXm (Y (t) + θ mi mi l=m Ymi (t)Yml (t)) m=1 αi (t)(e k=1 t 0 i=1 =
jk n
n
eβXk (Yki (t) + θki
m=1 (e
k=1 t 0 i=1
βXm (Y (t) + θ mi mi
l=i Yki (t)Ykl (t)
dNki (t )
l=m Ymi (t)Yml (t))
.
It simplifies, just like the standard multivariate partial likelihood of Nielsen, Gill, Andersen, and Sorensen and the baseline hazard functions cancel, leaving us with a partial likelihood for the parameters of interest – β and θ . This in turn gives us a log-likelihood of the form: log Ln (β, θ )
τ
=
jk n
βXk + log Yki + θki Yki (t)Ykl (t)
0 k=1 i=1
− log
l=i
n
βXm Ymi (t) + θmi e Ymi (t)Yml (t)
m=1
l=m
× dNki (t). The corresponding score functions can then be computed by simple differentiation with respect to β and all of the θki . We first consider the score with respect to β: Uβ =
∂ log(Ln (β, θ )) ∂β
=
τ
jk n
0 k=1 i=1
n
βXm (Y (t) + θ mi mi l=m Ymi (t)Yli (t)) m=1 Xm e βXm (Y (t) + θ mi mi l=m Ymi (t)Yml (t)) m=1 e
Xk − n
× dNki (t). This score equation can also be viewed as a sum of our covariates, X m , centered by their empirical average calculated using weights eβXm (Ymi (t) + θmi l=m Ymi (t)Yli (t)), which provides an insight into the decomposition of the score function into a sum of martingale processes illustrated below.
598
K.L. Kesler and P.K. Sen
The score function with respect to θki is: Uθki = =
∂ log(Ln (β, θ )) ∂θki τ
l=m Yki (t)Ykl (t))
Yki (t) + θki
0
n
−
l=m Yki (t)Ykl (t)
βXm l=m Ymi (t)Yml (t) m=1 e n dNki (t) βXm (Y (t) + θ mi mi l=m Ymi (t)Yml (t)) m=1 e
for k = 1, . . . , n; i = 1, . . . , jk . Maximum partial likelihood estimators can be obtained by solving each of the following equations: Uβ = 0,
Uθki = 0,
k = 1, . . . , n; i = 1, . . . , jk
which need to be computed iteratively due to the complexity of the equations. The asymptotic properties of these estimates are based on the properties of Martingales and accompanying Taylor expansions. As with the standard multivariate case (Pedroso de Lima, 1995), the score equations can be represented by a stochastic integral with respect to a martingale process (under the null hypothesis). We can derive this representation from our score equations by using the Doob–Meyer decomposition. We define a process Mik (t) in the following manner: tki wk (θki )λki (u) du Mik (t) = Nki (t) − 0
where wk (θki ) = Yki (t) + θki l=m Yki (t)Ykl (t). Due to the predictability and boundedness of Yki (t) and Xk , we know that Mik (t) is a uniformly integrable martingale, which implies dMik (t) = dNki (t) − wk (θki )λki (u). We can now substitute for dNki (t) in our score equation and note that
τ
jk n
0 k=1 i=1
n βXm w (θ )) m mi m=1 (Xm e wk (θki )λki (t) dt = 0 Xk − n βXm w (θ ) e m mi m=1
because of its properties as a martingale. Thus, we are left with Uβ =
τ
jk n Xk −
0 k=1 i=1
n βXm w (θ )) m mi m=1 (Xm e dMki (t), n βXm w (θ ) e m mi m=1
which implies that Uβ is itself a sum of Martingales, allowing us to use the asymptotic distribution for the score function found in Pedroso de Lima (1995).
The matrix-valued counting process model
599
Information matrix We now describe the second derivatives of the log-likelihood, which make up the observed information matrix, an estimate of the covariance matrix of the score vector. This is an important component in developing inference about the variability of the maximum partial likelihood estimators. It is very similar to the derivation found in Pedroso de Lima (1995) with changes allowing for the new parameterization for variable cluster size. τ jk n n 2 (e βXm w (θ ))) ∂Uβ ( m=1 (eβXm wm (θmi )))( nm=1 Xm m mi n = ∂β ( m=1 (eβXm wm (θmi )))2 0 k=1 i=1 ( nm=1 Xm (eβXm wm (θmi )))2 dNki (t) − ( nm=1 (eβXm wm (θmi )))2 n 2 βXm w (θ ))) m mi m=1 Xm (e = ( nm=1 (eβXm wm (θmi ))) 0 k=1 i=1 n βXm w (θ )) 2 m mi m=1 Xm (e n − dNki (t) βXm w (θ )) (e m mi m=1 where wm (θmi ) = Ymi (t) + θmi l=m Ymi (t)Yml (t)
τ
jk n (
∂Uβ ∂Uθki = ∂β ∂θki n τ βXm l=m Ymi (t)Yli (t)) m=1 Xm e =− dNki (t) n βXm (Y (t) + θ mi mi 0 l=m Ymi (t)Yml (t)) m=1 e ( nm=1 Xm eβXm (Ymi (t) + θmi l=m Ymi (t)Yli (t))) − n ( m=1 eβXm (Ymi (t) + θmi l=m Ymi (t)Yml (t)))2 (eβXm l=m Ymi (t)Yml (t)) × n ( m=1 eβXm (Ymi (t) + θmi l=m Ymi (t)Yml (t)))2 and finally, ∂Uθki = ∂θki
τ
dNki (t) 0
−(
l=m Ymi (t)Yml (t))
2
(Yki (t) + θki l=m Yki (t)Ykl (t))2 ( nm=1 eβXm l=m Ymi (t)Yml (t))2 . − n ( m=1 eβXm (Ymi (t) + θmi l=m Ymi (t)Yli (t)))2
Note that the other second derivatives are equal to zero or specifically, ∂Uβi = 0, ∂βj
∂Uθki = 0, ∂θkj
∂Uθki = 0. ∂θli
600
K.L. Kesler and P.K. Sen
This leads to a block diagonal form for the observed information matrix. I1 0 . . . 0 0 I2 . . . 0 I = . .. , .. . . .. . . . 0
0
...
Ijk
where Ii is a n × n matrix of the form ∂U ∂U − ∂ββ − ∂θ1iβ . . . ∂Uθ1i ∂Uθ − − ∂θ1i1i . . . ∂β I = . .. .. .. . . −
∂Uθni ∂β
−
∂Uθni ∂θ1i
...
∂U
− ∂θniβ ∂Uθ − ∂θni1i
.. .
−
∂Uθni ∂θni
.
The information matrix can therefore be used to calculate the variance-covariance matrix of the maximum partial likelihood estimators described in Section 3.2. With the MPLEs and their variances, inferential tests can be performed. The computations for the MPLEs of β and θ as well as the variance-covariance matrix were performed in C. The program is based in part on Terry Therneau’s coxfit2.c, a component of his Survival 4 package of functions in Splus®. The program uses Efron’s approximation for tied event times and the Newton–Raphson method for calculating sequential iterations. It can currently accommodate up to twenty-five covariates and twenty-five events within a cluster.
References Andersen, P.K., Borgan, O., Gill, R.D., Keiding, N. (1993). Statistical Models Based on Counting Processes. Springer, New York. Clegg, L.X., Cai, J., Sen, P.K. (1999). A marginal mixed baseline hazards model for multivariate failure time data. Biometrics 55 (3), 805–812. Cox, D.R. (1972). Regression models and life-tables (with discussion). JRSS-B 34, 187–220. Cox, D.R. (1975). Partial likelihood. Biometrika 62 (2), 269–276. DeMasi, R.A., Qaqish, B.F., Sen, P.K. (1998). A family of bivariate failure time distributions with proportional crude and conditional hazards. Comm. Statist.: Theory Methods 27, 365–394. DeMasi, R.A., Qaqish, B.F., Sen, P.K. (1997). Statistical models and asymptotic results for multivariate failure time data with generalized competing risks. Sankhya (Ser. A) 59, 408–434. DeMasi, R.A. (2000). Statistical methods for multivariate failure time data and competing risks. In: Handbook of Statistics, Vol. 18. Elsevier, Amsterdam. Chapter 25. Fleming, T.R., Harrington, D.P. (1991). Counting Processes and Survival Analysis. Wiley, New York. Hoffman, E.B., Sen, P.K., Weinberg, C.R. (2001). Within-cluster resampling. Biometrika 88 (4), 1121–1134. Kalbfleisch, J.D., Prentice, R.L. (1980). The Statistical Analysis of Failure Time Data. Wiley, New York. Kinney, T.R., Helms, R.W., O’Branski, E.E., Ohene-Frempong, K., Wang, W., Daeschner, C., Vichinsky, E., Redding-Lallinger, R., Gee, B., Platt, O.S., Ware, R.E. (1999). Safety of hydroxyurea in children with sickle cell anemia: Results of the HUG-KIDS study, a phase I/II trial. Pediatric Hydroxyurea Group. Blood. 94 (5), 1550–1554. Murphy, S.A. (1995). Asymptotic theory for the frailty model. Ann. Statist. 23 (1), 182–198.
The matrix-valued counting process model
601
Nielsen, G.G., Gill, R.D., Andersen, P.K., Sorensen, T.I.A. (1992). A counting process approach to maximum likelihood estimation in frailty models. Scand. J. Statist. 19, 25–43. Pedroso de Lima, A.C. (1995). A semi-parametric matrix-valued counting process model for survival analysis. Dissertation. Institute of Statistics, Mimeo Series 2145T Chapel Hill, NC. Pedroso de Lima, A.C., Sen, P.K. (1997). A matrix-valued counting process with first-order interactive intensities. Ann. Appl. Probab. 7 (2). Pedroso de Lima, A.C., Sen, P.K. (1998). Bivariate exponentials, dual counting processes and their intensities. J. Combin. Inform. Sytem Sci. 23 (1–4), 33–46. Tien, H.-C., Sen, P.K. (2002). A proportional hazards model for bivariate survival data under interval censoring. Sankhya Ser. A 64 (2), 409–428.
Further reading Aalen, O.O. (1978). Nonparametric inference for a family of counting processes. Ann. Statist. 6, 701–726. Balakrishnan, N., Basu, A.P. (Eds.) (1995). The Exponential Distribution: Theory, Methods, and Applications. Gordon and Breach, Amsterdam. Bandeen-Roche, K.J., Liang, K.Y. (1996). Modelling failure-time associations in data with multiple levels of clustering. Biometrika 83 (1), 29–39. Basu, A.P. (1995). Bivariate exponential distributions. In: Balakrishnan, N., Basu, A.P. (Eds.), The Exponential Distribution: Theory, Methods, and Applications. Gordon and Breach, Amsterdam. Block, H.W., Basu, A.P. (1974). A continuous bivariate exponential extension. JASA 69, 1031–1037. Bremaud, P. (1981). Point Processes and Queues: Martingale Dynamics. Springer, New York. Cai, J., Prentice, R.L. (1995). Estimating equations for hazard ratio parameters based on correlated failure time data. Biometrika 82 (1), 151–164. Clayton, D., Cuzick, J. (1985). Multivariate generalizations of the proportional hazards model. JRSS-A 48 (2), 82–117. Clayton, D.G. (1978). A model for association in bivariate life tables and its application in epidemiological studies of familial tendency in chronic disease incidence. Biometrika 65 (1), 151. Clegg, L. (1997). Marginal models for multivariate failure time data with generalized dependence structure. Dissertation. Institute of Statistics, Mimeo Series 2184T Chapel Hill, NC. Clegg, L. (2000). Modeling multivariate failure time data. In: Handbook of Statistics, Vol. 18. Elsevier, Amsterdam. Chapter 27. Cox, D.R., Oakes, D. (1984). Analysis of Survival Data. Chapman and Hall, New York. Cochran, W.G. (1963). Sampling Techniques, 2nd edn. Wiley, New York. DeMasi, R.A. (1994). Proportional hazards models for multivariate failure time data with generalized competing risks. Dissertation. Institute of Statistics, Mimeo Series 2139T Chapel Hill, NC. Genest, C., MacKay, J. (1986). The joy of copulas: Bivariate distributions with uniform marginals. Amer. Statist. 40 (4), 280–283. Hilgartner, M., Donfield, S., Willoughby, A., Contant, C., Evatt, B., Gomperts, E., Hoots, K., Jason, J., Loveland, K., McKinlay, S., Stehbens, J. (1993). Hemophilia growth and development study. Amer. J. Pediatric Hematol. Oncol. 15 (2), 208–218. Hoots, W.K., Mahoney, E., Donfield, S., Bale, J., Stehbens, J., Maeder, M., Loveland, K., Contant, C. (1998). Are there clinical and laboratory predictors of 5-year mortality in HIV-invected children and adolescents with hemophilia? JAIDS and HR 18, 349–357. Hougaard, P. (1998). Frailty. In: Armitage, Colton (Eds.), Encyclopedia of Biostatistics. Wiley, New York. Hougaard, P. (1987). Modelling multivariate survival. Scand. J. Statist. 14, 291–304. Hougaard, P. (1986a). Survival models for heterogeneous populations derived from stable distributions. Biometrika 73 (2), 387–396. Hougaard, P. (1986b). A class of multivariate failure time distributions. Biometrika 73 (3), 671–678. Kish, L. (1965). Survey Sampling. Wiley, New York. Klein, J.P. (1995). Inference for multivariate exponential distributions. In: Balakrishnan, N., Basu, A.P. (Eds.), The Exponential Distribution: Theory, Methods, and Applications. Gordon and Breach, Amsterdam.
602
K.L. Kesler and P.K. Sen
Klein, J.P., Basu, A.P. (1985). Estimating reliability for bivariate exponential distributions. Sankhya-B 47, 346–353. Lee, E.W., Wei, L.J., Amato, D.A. (1992). Cox-type regression analysis for large numbers of small groups of correlated failure time observations. In: Klein, J.P., Goel, P.K. (Eds.), Survival Analysis: State of the Art. Kluwer Academic, Dordrecht, pp. 237–247. Liang, K.L., Self, S.G., Chang, Y.C. (1993). Modelling marginal hazards in multivariate failure time data. JRSS-B 55, 441–453. Munoz, S.R. (1994). Group sequential methods for bivariate survival data in clinical trials: A proposed analytic method. Dissertation. Institute of Statistics, Mimeo Series 2140T Chapel Hill, NC. Murphy, S.A. (1994). Consistency in a proportional hazards model incorporating a random effect. Ann. Statist. 22 (2), 712–731. Oakes, D. (2000). Survival analysis. JASA 95, 282–285. Oakes, D. (1994). Multivariate survival distributions. Nonparametric Statist. 3, 343–354. Oakes, D. (1989). Multivariate survival models induced by frailties. JASA 84, 487–493. Oakes, D., Manatunga, A.K. (1992). Fisher information for a bivariate extreme value distribution. Biometrika 79 (4), 827–832. Prentice, R.L., Cai, J. (1992). Covariance and survivor function estimation using censored multivariate failure time data. Biometrika 79 (3), 495–512. Sen, P.K. (1981). The Cox regression model, invariance principles for some induced quantile processes and some repeated significance tests. Ann. Statist. 9 (1), 109–121. SenGupta, A. (1995). Optimal tests in multivariate exponential distributions. In: Balakrishnan, N., Basu, A.P. (Eds.), The Exponential Distribution: Theory, Methods, and Applications. Gordon and Breach, Amsterdam. Skinner, C.J., Holt, D., Smith, T.M.F. (1989). Analysis of Complex Surveys. Wiley, New York. Wei, L.J., Lin, D.Y., Weissfeld, L. (1989). Regression analysis of multivariate incomplete failure time data by modeling marginal distributions. JASA 84, 1065–1073.
34
Handbook of Statistics, Vol. 23 ISSN: 0169-7161 © 2004 Elsevier B.V. All rights reserved. DOI 10.1016/S0169-7161(03)23034-0
Analysis of Recurrent Event Data
Jianwen Cai and Douglas E. Schaubel
1. Introduction In many research settings, the event of interest can be experienced more than once per subject. Such outcomes have been termed recurrent events. Recurrent event data are often encountered in biomedicine (e.g., opportunistic infections among AIDS patients), demography (e.g., birth patterns among woman of child-bearing age) and quality control (e.g., automobile repairs). Typically, the study will have a fixed termination date, such that the occurrence times are potentially censored. The data structure for recurrent events represents a special case of multivariate survival data, where the failure times for a subject are ordered. As such, recurrent event data have often been analyzed using methods of multivariate survival analysis (e.g., Wei et al., 1989; Prentice et al., 1981; Andersen and Gill, 1982). However, fundamental characteristics of recurrent event data mean that care must be exercised in the application of methods designed for a larger class of general data structures amenable to multivariate survival analysis. Correspondingly, the analysis of recurrent event data has recently been the subject of much methodological research. In this chapter, we describe non- and semi-parametric methods of analyzing recurrent event data. The chapter is organized as follows. In the second section, we set up essential notation, then focus on the functions which are modelled in the analysis of recurrent event data. In the third section, semiparametric regression methods are described, with special attention given to elucidating the sometimes subtle differences between the methods, with respect to interpretation of parameter estimates. The methods will be illustrated using an example analysis of a preschool asthma data set. In Section 4, we describe various nonparametric estimation methods which can be used for recurrent event data. The final section contains some concluding remarks, and lists some very recent work dealing with data structures which lie beyond the scope of this chapter. 2. Notation and basic functions of interest t Throughout this chapter, the notation will be as follows. Ni (t) = 0 dNi (s) will represent the number of events in [0, t] for subject i, for i = 1, . . . , n, where dNi (s) denotes 603
604
J. Cai and D.E. Schaubel
the number of events in the small time interval [s, s + ds). We assume that subject i is observed over the [0, Ci ] interval, where Ci denotes censoring time, and is observed to experience events at times Ti,1 , . . . , Ti,mi . We assume that Ci is determined independently of {dNi (t); t 0}. In the presence of covariates, the assumption is relaxed to that of conditional independence, given the covariate vector. Subject i’s observation time, with respect to the kth event, is denoted by Xi,k = Ti,k ∧ Ci where a ∧ b ≡ min(a, b). The corresponding observed event indicators are given by ∆ik = I (Ti,k < Ci ), where I (A) = 1 when the event A holds, and 0 otherwise. We denote inter-event (gap) times by Tij = Ti,j − Ti,j −1 with Ti,0 ≡ 0. Let m = ni=1 mi . Let Zi (s) be a possibly timedependent covariate vector. We refer to {Ni (t); t 0} as a subject-specific counting process (Ross, 1989) since (i) Ni (t) 0 (ii) Ni (t) is integer-valued (iii) for s < t, Ni (s) Ni (t) (iv) for s < t, the number of events in (s, t) is given by Ni (t) − Ni (s). Analysis of recurrent event data is usually achieved through the framework of counting process (Andersen et al., 1993) or point process (Cox and Isham, 1980) models. With these approaches, parameter estimation is often carried out through maximum likelihood or partial likelihood, for which a model that completely specifies the probability distribution of a subject’s event history is required. Specification of the entire distribution of {dNi (s); s 0} is usually not desirable. Since the distribution can be complicated, a suitably comprehensive model may be very complex, provided it even exists. Moreover, usually only certain components of the event process are of interest to the investigator. These facts have motivated methods which focus on the marginal means or rates functions. We now formally define the terms intensity, rate and mean. Although these functions have strong inter-relationships, their differences are essential in the models we discuss. 2.1. Intensity function Let Ni (t) = {Ni (s); s ∈ [0, t)} denote the ith subject’s event history at time t−. Suppressing dependence on {Zi (s); s ∈ [0, t)}, the conditional intensity (hereafter referred to as simply intensity) is defined by: 1 λi t | Ni (t) = lim P dNi (t) = 1 | Ni (t) . δ→0 δ If the process {Ni (t); t > 0} contains only unit increments, then we can also write E[dNi (t) | Ni (t)] = λi (t; Ni (t)) dt. The probability distribution of {dNi (t); t 0} can be specified completely in terms of λi (t | Ni (t)), as discussed by Andersen et al. (1993). In particular, the probability density that exactly mi events occur for subject i at times ti,1 , . . . , ti,mi is given by: mi
τ λi ti,j | Ni (ti,j ) e− 0 λi (s|Ni (s)) ds ,
j =1
where τ is a pre-specified point in time. This expression is valid quite generally (Lawless, 1995) and provides the basis for maximum likelihood estimation and the associated inference procedures. Note that although MLE is in theory available since the probability distribution is completely specified through λi (t | Ni (t)), such estimation may be
Analysis of recurrent event data
605
very difficult to implement in practice. For Poisson processes, λi (t | Ni (t)) = λi (t), while for renewal processes, λi (t | Ni (t)) = λi (t − Ti,Ni (t −) ) (Chiang, 1968). The recurrent event process might be modelled as a Poisson process when event counts are of interest, and as a renewal process when times between successive events are of concern. 2.2. Mean and rate functions The simplicity of Poisson and renewal process models makes them attractive. However, in practice neither model may describe the recurrent event sequence sufficiently. When the objective of the investigation is to assess the effect of covariates on the process, analysis of the marginal distributions is suggested, particularly due to the avoidance of assumptions on the intra-subject dependence structure. Marginal means and rates can be modeled. These functions are appealing because they are easily interpreted by non-statisticians, and because they are often of direct interest to investigators. The rate function is defined by dµi (s) ds = E[dNi (s)]; the cumulative rate function is, then, given by µi (t) = E[Ni (t)], which has the interpretation as a mean function when Zi (·) consists only of “external” covariates (i.e., covariates unaffected by the recurrent event process (Kalbfleisch and Prentice, 2002)). Although dµi (s) = λi (s) ds for the Poisson process, in general they are quite distinct, and dµi (s) hence need not characterize the entire distribution of the recurrent event process. There is usually no simple relationship between a given marginal model and its corresponding conditional counterpart. The direct interpretation of the effect of Zi (s); s ∈ [0, τ ]} on {dNi (s); s ∈ [0, τ ]} in a conditional model often occurs with no such relationship in the marginal analog, and vice versa. In addition, marginal specifications in terms of counts, times to events, and inter-event times are different, as implied by the relationships: k P Ni (t) < k = P (Ti,k > t) = P Tij > t . j =1
3. Semiparametric models for recurrent event data The effects of certain covariates on the recurrent event process are often of interest to investigators. In this section, we describe various semiparametric models which often form the basis for regression analysis of recurrent event data. Several of the models are illustrated through the analysis of a retrospective cohort study on preschool asthma (Schaubel et al., 1996). For this study, children born during a one-year period, April 1, 1984 to March 31, 1985, were followed retrospectively from birth until March 31, 1989. Thus, children were somewhere between the age of 4 and 5 when they were censored. Files with information based on birth characteristics were deterministically linked with physician claim files; details are available in Schaubel et al. (1996). The chief goal of the investigation was to estimate the effects of gender and various adverse birth characteristics on preschool asthma. The recurrent events of interest are physician office visits attributable to asthma.
606
J. Cai and D.E. Schaubel
3.1. Conditional regression models 3.1.1. Andersen–Gill (AG) proportional intensity model The most popular regression model for assessing the effect of covariates on survival probability is the Cox proportional hazards model (Cox, 1972), in which it is assumed that the hazard function for the ith subject at time t is given by: T
λi (t) = λ0 (t)eβ 0 Zi (t ) , where λ0 (t) is an unspecified non-negative function representing the baseline hazard and β 0 is a p × 1 vector of unknown parameters. Estimation is performed via partial likelihood (Cox, 1975), while the large-sample properties of parameter estimators can be conducted through the theory of Martingales (Andersen and Gill, 1982) or empirical processes (Tsiatis, 1981). The Andersen–Gill (Andersen and Gill, 1982) model is a generalization of the Cox proportional hazards model which relates the recurrent event intensity process to the covariates through the formulation: T
λik (t) = λ0 (t)eβ 0 Zi (t ),
(1)
for the kth recurrence, k = 1, . . . , K, where K < ∞. The at-risk process for the AG model is defined as Yik (t) = I (Xi,k−1 < t Xi,k ). Although subjects may experience multiple events, no subject may experience more than one event at any specific time; that is, dNi (s) ∈ {0, 1}. Parameter estimation is carried out via partial likelihood (Cox, 1975). The estimator of β 0 , denoted by βˆ A n , is obtained by iteratively solving the estiA mating equation Un (β) = 0p×1 , where: UA n (β) =
n τ Zi (s) − E(s; β) dNi (s), i=1 0
⊗r × with E(s; β) = S(1) (s; β)/S (0) (s; β), S(r) (s; β) = n−1 ni=1 K k=1 Yik (s)Zi (s) T eβ Zi (s) and, for a vector z, z⊗0 = 1, z⊗1 = z and z⊗2 = zzT . The Breslow–Aalen t (Breslow, 1974) estimate of the cumulative baseline hazard, Λ0 (t) = 0 λ0 (s) ds, is −1 t dN (s)/S (0) (s; β ˆ A ), where dN· (s) = n dNi (s).
0 (t; βˆ A given by Λ · n)=n n i=1 0 P D A ˆ Under certain regularity conditions, as n → ∞, β → β 0 , and n1/2 (βˆ A − β 0 ) → n T P A Np (0p×1 , A(β 0 where An (β) = −∂Un (β)/∂β → mator of the covariance matrix is given by An (βˆ A n ), with
)−1 ),
An (β) = n−1
n
A(β), and a consistent esti-
n τ (2) S (s; β) ⊗2 − E(s; β) dNi (s). S (0) (s; β) 0 i=1
Extending results of Huber (1967) and White (1982) for mis-specified models, the robust covariance matrix is given by Σ(β 0 ) = A(β 0 )−1 B(β 0 )A(β 0 )−1 (Lin and Wei,
ˆA 1989), which can be estimated consistently using An (βˆ A n ) given above and Bn (β n ),
Analysis of recurrent event data
where
Bn (β) = n−1
n τ i=1
607
⊗2
i (s; β) Zi (s) − E(s; β) dM
,
0
0 (s; β). Note that, when the model (1)
i (s; β) = dNi (s) − Yi (s)eβ T Zi (s) dΛ with dM holds, Mi (t; β 0 ) is a martingale with respect to the filtration (right continuous nondecreasing family of σ -algebras) Ft = σ {Yi (s), Ni (s−), Zi (s); s ∈ [0, t]}, which is the essential property used by Andersen and Gill to derive the large sample results for their model. In the Andersen–Gill (AG) model, each subject is treated as a counting process with essentially independent increments (i.e., non-homogeneous Poisson process (Chiang, 1968)), since λik (t) is assumed independent of Ni (t) except for dependencies captured by appropriately specified time-dependent covariates (e.g., number of previous events, or some function thereof). The AG model is among the simplest to conceptualize and program. It is similar in spirit to piece-wise exponential model, which is often fitted via Poisson regression numerical routines. In fact, the AG model can be accurately approximated by Poisson regression software analogous to Laird and Olivier’s (1981) approximation of the univariate Cox (1972) model. In an illustrative comparison of several recurrent event models, Lin (1994) recommends the Andersen–Gill model when the investigator is only interested in the overall recurrence rate, and when only a small proportion of subjects have Ni (τ ) 2. Assuming that the AG model is correct, previously listed consistency and asymptotic normality results hold, although the asymptotic covariance matrix is given by A(β 0 )−1 , instead of the sandwich form. However, the independent increment assumption is strong, and may be untenable. Since past events are likely to be positively correlated with future events, when the independence assumption fails, V (βˆ n ) will be underestimated by the “naive” variance, A(β 0 )−1 , and the robust variance, A(β 0 )−1 B(β 0 )A(β 0 )−1 , can be employed. However, as discussed by Lin and Wei (1989), without the independent increment structure, the observed data may follow a model which does not satisfy the proportional hazards assumption, rendering the limiting value of βˆ A n difficult to interpret. Nonetheless, in cases where the model is approxiˆ mately true, β A n is a useful statistic even when the underlying assumptions do not hold; in practice, for example, truly proportional hazards are rarely observed. Table 1 contains the results of three different AG models fitted to the asthma data set. Each model contained terms for the covariates of chief interest, namely low birth weight (LBW), respiratory distress syndrome (RDS), transient tachypnea of the newborn (TTN), birth asphyxia (ASPH) and male gender (SEXM). All models we examine in this chapter contain these covariates. Note that HR represents hazard ratio, the exponential of the estimated coefficient. The three models in Table 1 differ with respect to the form of the variates used to represent previous event history. Based on the first model (Model 1), SEXM and all birth conditions except TTN have highly significant effects on the intensity function for asthma. Previous event history was represented in Model 1 by the previous number of events; the asthma hazard is increased by a multiplicative factor of 1.20 with each additional previous office visit (HR = 1.20; p < 10−4 ).
608
J. Cai and D.E. Schaubel
Table 1 Analysis of preschool asthma data: estimated hazard ratios based on Andersen–Gill model Covariate
Model 1
Ni (t−) Ni (t−)2 I (Ni (t−) = 1) I (Ni (t−) = 2) I (Ni (t−) 3) LBW RDS TTN ASPH SEXM
Model 2
HR
p
1.20 – – – – 1.53 1.94 1.37 1.42 1.61
< 10−4 – – – – < 10−2 < 10−3 0.09 < 10−2 < 10−4
Model 3
HR
p
HR
p
2.01 0.98 – – – 1.21 1.00 1.36 1.29 1.33
< 10−4
– – 24.52 46.23 139.67 1.04 1.25 0.98 1.14 1.20
– – < 10−4 < 10−4 < 10−4 0.77 0.18 0.91 0.20 < 10−3
< 10−4 – – – 0.10 0.98 0.04 < 10−3 < 10−4
When a quadratic term for previous number of events was added to the event history (Model 2), estimated hazard ratios for gender and each birth condition were reduced. In certain cases, the drop was dramatic; e.g., for male gender, HR = 1.61 based on Model 1, but dropped to HR = 1.33 in Model 2. Note also that LBW and RDS fail to attain statistical significance (p > 0.05) based on Model 2. Model 3 features the most flexible modelling of prior event history, using indicator functions, as opposed to assuming a log-linear relationship to the prior event counts. As a result, SEXM, but no birth conditions, attain statistical significance. The estimated effect for SEXM has dropped further to HR = 1.20, which is still highly significant, but not on the same order as that of Model 1. This example illustrates a limitation of the AG model; i.e., it is difficult to model the previous event history. Moreover, one can never be sure that the event history has been specified correctly. It is difficult to interpret differences in the parameter estimates which occur upon using different forms to represent event history. Previous event history represents an internal time-dependent covariate. Essentially, we are modelling the event process (i.e., the intensity function thereof), while adjusting for covariates determined by previous values of the event process. Borrowing terminology from epidemiology, this is tantamount to adjusting for covariates along the “causal pathway”, which attenuates covariate effect estimates towards the null value (Rothman and Greenland, 1998). 3.1.2. Prentice–Williams–Peterson (PWP) models The first extension of the Cox model to multiple event data was developed by Prentice et al. (1981). In this subsection, we consider two classes of PWP models commonly employed in practice. In particular, the intensity function for subject i at time t for the kth recurrence, conditional on Ni (t), takes the following forms: T
λik (t) = Yik (t)λ0k (t)eβ k Zik (t ),
(2) T
λik (t) = Yik (t)λ0k (t − Ti,k−1 )eβ k Zik (t )
(3)
Analysis of recurrent event data
609
for total and gap times, respectively. Essentially, this approach produces a stratified proportional intensity model with time-dependent strata, where dependence between event times is accommodated by stratifying on the previous number of occurrences. For the total time model, risk set indicators are defined the same as in the AG model; i.e., Yik (t) = I (Xi,k−1 t < Xi,k ). For analyzing gap times, set Yik (t) = I (Xi,k Xi,k−1 + t), and replace Zik (t) with Zik (Xi,k−1 + t). Unlike the AG approach, for the PWP models, event-specific regression parameters and baseline intensity functions are permitted. Thus, when using the PWP model it is necessary to construct Zik (t) such that the effect of Ni (t) on λik (t) is accurately captured. Parameters are estimated through partial likelihood. For model (2), the pk × 1 regression parameter, β k , is estimated as the solution to the estimating equation UPn T (β k ) = 0pk ×1 , where UPn T (β k ) =
n τ Zik (s) − EPk T (s; β k ) dNik (s),
(4)
i=1 0
for k = 1, . . . , K, where EPk T (s; β k ) = Qk (s; β k )/Qk (s; β k ), Qk (s; β k ) = n−1 × n ⊗r β Tk Zik (s) and N (t) = I (T ik i,k t, ∆ik = 1). For PWP model (3), i=1 Yik (s)Zik (s) e the partial likelihood estimator is given by the solution to UPn G (β k ) = 0pk ×1 , where (1)
UPn G (β k ) =
(0)
(r)
n τ ik (s), Zik (s + Ti,k−1 ) − EPk G (s; β k ) dN
(5)
i=1 0
n (0) (r) −1 where EPk G (s; β k ) = R(1) i=1 Yik (s)Zik × k (s; β k )/Rk (s; β k ), Rk (s; β k ) = n T ik (t) = I (Tik t, ∆ik = 1). (Ti,k−1 + s)⊗r eβ k Zik (Ti,k−1 +s) and N In fitting PWP models (2) and (3), note that K is usually chosen such that a sufficient amount of experience is observed in the data set under analysis. Interpretation of parameters in these models is hindered by the conditional nature of the model. The main difficulty in analysis based on restricted risk sets is that the “missing completely at random” (MCAR) assumption is violated by the fact that subjects who have not experienced k events are excluded from analysis with respect to the (k + 1)th-event intensity function. However, analyses based on unrestricted risk sets can be criticized due to their “carry-over” effect. Subject-specific total times tend to be highly correlated even when gap times are not (Lipschutz and Snapinn, 1997), since the total times of the (k + 1)th risk interval contain those of risk intervals 1, . . . , k. When a covariate’s effect exists only for the initial event, but not subsequent recurrences, an analysis based on total times could deceptively estimate a large treatment effect for the second and third (etc) events due to such a carry-over effect. Choice of time scale and risk-set construction ultimately depends on the aims of the investigator. In the PWP model, like the methods of Andersen and Gill (1982), it is assumed that information in Ni (t) is captured by the covariate vector. A valid estimate of the variance can be obtained through a robust variance estimator when this assumption is thought to be possibly violated. Table 2 lists estimated hazard ratios based on the Prentice et al. (1981) total time model. All covariates demonstrate a significant association with time to first physician
610
J. Cai and D.E. Schaubel
Table 2 Analysis of preschool asthma data: estimated hazard ratios based on Prentice, Williams and Petersen total time model Covariate
Ti,1 Ti,2 LBW RDS TTN ASPH SEXM
1st event
2nd event
3rd event
HR
p
HR
p
HR
p
– – 1.50 1.92 1.39 1.23 1.67
– – < 10−3 < 10−4 0.03 0.02 < 10−4
0.98 – 1.11 1.47 0.94 1.23 1.23
< 10−4 – 0.57 0.09 0.77 0.09 < 10−2
1.02 0.94 1.01 1.45 1.37 0.96 1.06
0.02 < 10−4 0.98 0.20 0.26 0.80 0.57
Table 3 Analysis of preschool asthma data: estimated hazard ratios based on Prentice, Williams and Petersen gap time model Covariate
Ti,1 Ti2 LBW RDS TTN ASPH SEXM
1st event
2nd event
3rd event
HR
p
HR
p
HR
p
– – 1.50 1.92 1.39 1.23 1.67
– – < 10−3 < 10−4 0.03 0.02 < 10−4
1.01 – 1.11 1.34 0.90 1.16 1.18
0.05 – 0.55 0.14 0.59 0.18 0.03
1.00 0.98 1.02 1.35 1.24 1.12 1.09
0.44 < 10−2 0.89 0.20 0.32 0.44 0.33
visit for asthma. For time to second asthma visit, only SEXM attains statistical significance, although the effect has been greatly reduced, relative to the first visit. No covariates significantly affect total time to third physician visit. For the second and third events, note the statistically significant relationship with time(s) of preceding events. For the PWP gap time model (Table 3), we see similar patterns. Naturally, results are the same for event 1, as total and gap times are equal for the time to first office visit. The same results would be obtained through a Cox regression model. For the gap time of the second office visit, only SEXM remained significant; for the third gap time, no covariates are significant. Judging by the magnitude of the regression coefficients and their p values, the degree of association was stronger for the total times compared to gap times, which makes sense intuitively. It is difficult to interpret the decrease in the importance of gender and birth condition as the number of events increases. The fact that restricted risk sets are employed is the key. One hypothesis is that, as we move forward in time from event to successive event, the children under consideration (i.e., in the risk set) are progressively less healthy. As such, covariate effects are drowned out by a steadily increasing baseline hazard. Another possibility is that birth conditions affect only time until asthma onset, but not time of
Analysis of recurrent event data
611
subsequent events. There is little opportunity to empirically assess which of these two possibilities is more likely. 3.1.3. Models for recurrence time hazards Chang and Wang (1999) proposed the following semiparametric hazards model for recurrent event times: T
T
λij (t) = λ0j (t − Ti,j −1 )eβ 0 Zi1 (t )+γ j Zi2 (t ) .
(6)
Here, λ0j (·) are event-specific unspecified non-negative functions, β 0 is the p × 1 “structural” parameter of primary interest, while the q × 1 episode-specific parameters, γ j , may or may not be of interest depending on the particular application. As an example of its application, in a study of schizophrenia, gender and marital status may have the same effect for different episodes, but age of disease onset may have distinct effects over different episodes. When β 0 = 0p×1 , model (6) reduces to the gap time model proposed by Prentice et al. (1981). This model would be preferred if chief interest lies in the possibly changing pattern of covariate effects across episodes. For example, a medication may be effective in reducing the risk of the first two or three infections, but may not have an impact on subsequent episodes. Note that precise interpretation of covariate effects which steadily decrease in magnitude across the episode sequence is hindered by the fact that the baseline group is changing across the episode sequence. That is, analysis of the (j + 1)th infection times is necessarily restricted to subjects who experienced the j infections, since the immediately preceding infection time serves as the time origin. Thus, it maybe difficult to reconcile trends among {γˆ 1:n , γˆ 2:n , . . .}, since the baseline cohort shifts from event to event. In spite of such difficulties in interpretation, patterns in episode-specific covariate effects may still be of interest for descriptive purposes. Another special case of (6) results from setting γ j = 0q×1 for j = 1, 2, . . . ; i.e., T
λij (t) = λ0j (t − Ti,j −1 )eβ 0 Zi1 (t ), similar to Model (3) of Prentice et al. (1981), but employing a common regression parameter. This model was studied independently by Chang and Hsiung (1994) and Chang (1995), and would be of interest if all covariate effects are assumed constant across events; or, if covariates effects change across episodes, but only the average effect is of interest as a descriptive summary measure. Chang and Wang (1999) propose a profile likelihood method to estimate β 0 and γ j for j = 1, 2, . . . . Hence, a key distinction between the approach of Chang and Wang (1999) and those of previous authors (e.g., Prentice et al., 1981) is that all data are used in the analysis, as opposed to limiting the considered number of recurrent events per subject. Neyman and Scott (1948) described the inconsistency of maximum likelihood estimators in the setting where the number of nuisance parameters grows as n → ∞. Using counting process (Andersen et al., 1993) and martingale (Fleming and Harrington, 1991) theory, Chang and Wang (1999) prove that, under certain regularity conditions, the estimator of the structural regression parameter, β 0 , is consistent and asymptotically normal, even in situations where the episode-specific parameters cannot be estimated consistently.
612
J. Cai and D.E. Schaubel
3.2. Wei–Lin–Weissfeld (WLW) marginal hazards model Motivated by the lack of robustness of the AG and PWP models to mis-specification of the intra-subject dependence structure, Wei et al. (1989) proposed modelling the marginal hazards as an alternative to modelling the intensity function conditional on Ni (t). The WLW approach entails modelling the marginal distribution of the kth event time using the Cox specification for the hazard function: T
λik (t) = λ0k (t)eβ k Zik (t ) for k = 1, . . . , K. The kth-event partial likelihood function is given by: ∆ik T n eβ k Zik (Xik ) P Lk (β k ) = n β Tk Zjk (Xik ) i=1 j =1 Yj k (Xik )e with corresponding score function: ∂ log P Lk (β k ) = Uk:n (β k ) = ∂β k n
τ
Zik (s) − Ek (s; β k ) dNik (s)
i=1 0
where ∆ik = I (Ti,k Ci ), Nik (t) = I (Xi,k t, ∆ik = 1), Ek (s; β k ) = S(1) k (s; β k )/ n T (0) (r) −1 ⊗r β Z (s) ik k for r = 0, 1, 2, and Sk (s; β k ), with Sk (s; β k ) = n i=1 Yik (s)Zik (s) e Yik (t) = I (Xi,k t). The cumulative baseline hazard function for the k’th event is es
0k (t; βˆ k:n ) = n−1 t dN·k (s)/S (0) (s; βˆ k:n ), with dN·k (s) = ni=1 dNik (s). timated by Λ k 0 Set Xi = (Xi,1 , . . . , Xi,K )T and ∆i = (∆i1 , . . . , ∆iK )T . Among other regularity conditions listed by Wei et al. (1989), it is assumed that (Zi , Xi , ∆i ) are independent and identically distributed. The solution to the above score function, βˆ k:n , is consistent for β k , provided that the marginal models are correctly specified. As n → ∞, D n1/2 (βˆ k:n − β 0 ) → Np (0p×1 , Ak (β k )−1 Bk (β k )Ak (β k )−1 ), where a consistent estimator of the asymptotic variance is obtained through Ak:n (βˆ k:n ) and Bk:n (βˆ k:n ), where
Ak:n (β k ) = n−1 and
Bk:n (β k ) = n−1
n τ (2) Sk (s; β k )
Sk(0) (s; β k )
i=1 0
n τ i=1
0
−
(1)
Sk (s; β k )
⊗2
dNik (s)
Sk(0) (s; β k )
⊗2
ik (s; β k ) Zik (s) − Ek (s; β k ) dM
ik (s; β k ) = dNik (s) − Yik (s)eβ Tk Zik (s) dΛ
0k (s; β k ). Note that if the marginal with dM model is correctly specified, and if events for the same subject truly are uncorrelated, then Ak:n (βˆ k:n ) is asymptotically equivalent to Bk:n (βˆ k:n ). Inferences regarding β k are valid asymptotically, irrespective of the true intra-subject correlation structure. That is, analogous to the methods of Liang and Zeger (1986) for uncensored longitudinal data, marginal model parameters are fitted under the working independence assumption (with respect to events for the same subject), while intra-subject correlations are adjusted for in the analysis through a robust variance estimator.
Analysis of recurrent event data
613
The above formulae pertain to the case of event-specific regression parameters, which permit examination of trends in the effects with each subsequent occurrence. In cases where Zik (s) = Zi (s) for k = 1, . . . , K, the average covariate effect can be estimated by fitting a model which employs a common regression parameter; i.e., β k = β 0 for k = 1, . . . , K. In practice, certain strata may need to be aggregated, particularly, for example, if too few subjects experience the maximum number of events to induce stability in βˆ k:n and/or justify the asymptotic approximations. The AG, PWP and WLW methods have been compared by various authors (e.g., Lin, 1994; Gao and Zhou, 1997; Clayton, 1994; Therneau and Hamilton, 1997; Wei and Glidden, 1997) using real and simulated data, and it is well known that these models often yield different results for the same data set. This is not an unexpected finding, since the different models each address different research questions. The WLW method and its derivatives (Wei et al., 1990; Lee et al., 1992; Liang et al., 1993; Cai and Prentice, 1995, 1997) are robust and well-developed theoretically. They are viewed as excellent methods for making inferences on “population average” covariate effects. However, they provide no information on inter-relationships among failure times. There has been much debate in the literature regarding whether the WLW method is suitable, in principle, for recurrent event data. When the method was proposed, Wei et al. (1989) cast the following two research settings into the same framework: (i) clustered subjects, with 1 event per subject, and (ii) independent subjects, with possibly > 1 event each. In theory, the WLW method can be validly applied to both settings. In the past, it has been commonly employed, and has been explicitly recommended by many authors (e.g., Lin, 1994; Therneau and Hamilton, 1997; Wei and Glidden, 1997; Barai and Teoh, 1997; Kelly and Lim, 2000) for recurrent event data. On the other hand, the WLW method has been criticized with respect to interpretation of the regression parameter estimates. Since λik (t) is a marginal hazard, subjects can be at risk for the (k + 1)th event prior to having experienced the kth event. Event-specific t hazard functions can be quite inter-dependent; low observed values of Λik (t) = 0 λik (s) ds imply low Λik+1 (t), since e−Λik+1 (t ) e−Λik (t ) , appealing to the survivor function and the fact that the kth event must precede the (k + 1)th event. Authors such as Kelly and Lim (2000) claim that the WLW method overestimates regression coefficients due to the previously described carry-over effect. Cook and Lawless (1997a) comment on the logical inconsistency of the defining subjects as at risk for the (k + 1)th event prior to experiencing the kth event, a paradoxical situation which arises naturally from the WLW formulation. An analysis of the preschool asthma data based on the Wei, Lin and Weissfeld model is presented in Table 4. Naturally, the parameter estimates corresponding to the first office visit are the same as those for the first event in Tables 2 and 3, with all covariates having a significant effect. Comparing results across events 1–3, the HRs for each covariate generally increase with each successive event, which, given the results of the PWP models (Tables 2 and 3), illustrates the previously described carry-over phenomenon. For example, consider the covariate ASPH. Based on the WLW model, the estimated hazard ratio increases from HR = 1.23 (1st event time) to HR = 1.35 (2nd event) to HR = 1.44 (3rd event). However, based on the PWP models, it appears that ASPH
614
J. Cai and D.E. Schaubel
Table 4 Analysis of preschool asthma data: estimated hazard ratios based on Wei, Lin and Weissfeld model Covariate
LBW RDS TTN ASPH SEXM
1st event
2nd event
HR
p
1.50 1.92 1.39 1.23 1.67
< 10−3 < 10−4 0.03 0.02 < 10−4
3rd event
HR
p
HR
p
1.51 2.36 1.32 1.35 1.86
< 10−3
1.58 2.68 1.52 1.44 1.95
< 10−2 < 10−4 0.07 < 10−2 < 10−4
< 10−4 0.18 < 10−2 < 10−4
has no significant effect beyond the first event time. The pattern is generally the same for the remaining covariates. 3.3. Pepe and Cai rate models Pepe and Cai (1993) developed an approach that can be considered intermediate between the conditional intensity and marginal hazard approaches. They proposed modelling the rate functions, {ri1 (t), ri2 (t), . . .}, where rik (t) is the rate (i.e., average intensity) of occurrence of the kth event among subjects at risk at time t who have already experienced (k − 1) events. That is, 1 rik (t) = lim P (t < Ti,k t + δ | Ti,k t, Ti,k−1 < t). δ→0 δ Similar to the marginal approach, the intra-subject event time correlations are unspecified. The function rik (t) is well-defined and, irrespective of the nature of the event time correlations, is often of interest to investigators. In contrast to the WLW marginal hazard functions {λik (t): k = 1, 2, . . .}, the rate functions {rik (t): k = 1, 2, . . .} are conditional on having experienced (k − 1) events, which is a more intuitive approach for a recurrent event set-up. Moreover, whereas λi1 (t), λi2 (t), . . . bore a previously outlined numerical relationship to each other, ri1 (t), ri2 (t), . . . have no inherent relationship to each other. Thus, each conditional rate can be viewed as summarizing distinct components of the data set. One could model each conditional rate T with the Cox form, rik (t) = r0k (t)eβ k Zi (t ) where {r0k (t): k = 1, 2, . . .} are arbitrary non-negative functions. Moment estimators of the baseline rate functions are given by (0) rˆ0k (t; βˆ k:n ) = dN·k (t)/Sk (t; βˆ k:n ), which has the same form as that of λˆ 0k (t; βˆ k:n ), but with Yik (t) appropriately modified. The score function is given by: n τ T k Un (β k ) = Zi (s) dNik (s) − Yik (s)ˆr0k (s)eβ k Zi (s) ds i=1 0
which exhibits a strong correspondence to partial likelihood estimators listed in previous subsections. Table 5 lists estimated rate ratios (RR) for an analysis of the asthma data based on the Pepe and Cai rate model. All covariates are significantly associated with time until first
Analysis of recurrent event data
615
Table 5 Analysis of preschool asthma data: estimated rate ratios based on Pepe and Cai rate model Covariate
LBW RDS TTN ASPH SEXM
1st event
2nd event
3rd event
RR
p
RR
p
RR
p
1.50 1.92 1.39 1.23 1.67
< 10−3
1.01 1.57 0.97 1.21 1.26
0.57 0.04 0.89 0.10 < 10−2
1.25 1.37 1.28 1.10 1.16
0.23 0.19 0.29 0.48 0.09
< 10−4 0.03 0.02 < 10−4
asthma-attributable office visit. For SEXM, ASPH and RDS, the RR decreases steadily for the second and third events. Again, since risk sets are restricted under this model, this may reflect the heterogeneity among risk sets corresponding to successive events. 3.4. Marginal means/rates models The majority of the models previously discussed in this chapter have focused on the hazard or intensity function. In the context of recurrent event data, the mean number of events is a more interpretable quantity, particularly for non-statisticians, and is often of direct interest to investigators. For these reasons, a marginal means/rates model may be preferred, where the rate function is the derivative of the mean function. Lawless and Nadeau (1995) originally proposed the marginal means/rates model, although that specific name was not employed. They considered primarily the discrete time case, and provided no large sample results for the continuous time setting. Semiand fully-parametric models considered were of the form: E dNi (s) = m0 (s)g s; β 0 , Zi (s) , E dNi (s) = m0 (s; α)g s; β 0 , Zi (s) , respectively, where m0 (s) is an unspecified non-negative function, m0 (s; α) is a known function with unknown parameter α and g(·) 0 is a pre-specified link function. Lin et al. (2000) provided a rigorous formalization of the marginal means/rates model, and develop inference procedures for the continuous time setting. They proposed a semi-parametric continuous time model with a Cox-type link function. The authors present the approach as an alternative to the intensity model in the interests of robustness and interpretability. For example, the previously described AG model, T λi (s) = λ0 (s)eβ 0 Zi (s) , contains two essential components: (a) E[dNi (t) | Fi (t)] = E[dNi (t) | Zi (t)], T (b) E[dNi (t) | Zi (t)] = λ0 (t)eβ 0 Zi (t ) ds, where Fi (t) = σ {Yi (s), Zi (s), Ni (s−); s ∈ [0, t]}, with σ {·} denoting σ -algebra. Under assumption (a), the effect of Fi (t) is completely described by Zi (t). To avoid such a strong and unverifiable assumption, assumption (a) is deleted, and the model is defined through assumption (b). Specifically, Lin et al. (2000) propose the proportional rates
616
J. Cai and D.E. Schaubel
model: T E dNi (s) | Zi (s) = dµi (s) = eβ 0 Zi (s) dµ0 (s)
(7)
t Note that although dµi (s) is always a rate function, µi (t) = 0 dµi (s) represents a mean function only if Zi (·) consists only of external covariates. If Zi (·) consists of time-dependent internal covariates, then µi (t), can only be interpreted as a cumulative rate function. Note that, when all covariates are time-independent, integrating both sides of (7) yields the proportional means model: T E Ni (t) | Zi = eβ 0 Zi µ0 (t). (8) Parallels can be drawn between the marginal hazards model (Wei et al., 1989) and the marginal means/rates models (Lin et al., 2000). In both cases, the estimating equation for βˆ n ignores intra-subject correlations, analogous to the approach of Liang and Zeger (1986) for longitudinal data. A proportional hazards model implies a proportional rates model; but, the converse is not true. For example, consider a typical frailty model: T
λi (s | Qi ) = Qi λ0 (s)eβ 0 Zi (s), where Qi is an unobservable random effect inducing heterogeneity uncaptured by Z(·), with E[Qi ] = 1. When Qi follows any distribution other than the positive stable distribution (e.g., the Gamma or Inverse Gaussian distribution, which are commonly employed in frailty modelling), the proportional rates model holds, while the proportional hazards model does not. With respect to parameter estimation, β 0 is estimated by βˆ n , the solution to Un (β; τ ) = 0p×1 , with Un (β; t) =
n t
Zi (s) − E(s; β) dNi (s)
i=1 0
The baseline mean is estimated by the Breslow-type estimator µˆ 0 (t; βˆ n ), where t µˆ 0 (t; β) = n−1 0 dN· (s)/S (0) (s; β). The asymptotic distribution of βˆ n is derived through that of Un (β 0 ) = Un (β 0 ; τ ). We can write: n t Zi (s) − E(s; β 0 ) dMi (s; β 0 ) Un (β 0 ; t) = i=1
0
T
where, dMi (s; β) = dNi (s) − Yi (s)eβ Zi (s) dµˆ 0 (s; β). When the proportional hazards t model holds, Mi (t; β 0 ) = 0 dMi (s; β 0 ) is a martingale with respect to the filtration σ {Ni (s−), Zi (s); s ∈ [0, t]} and the asymptotic distribution of Un (β 0 ) can be derived using the martingale central limit theorem (Fleming and Harrington, 1991). When the proportional means (but not hazards) model holds, the multivariate central limit theorem can be applied to derive the limiting distribution of Un (β 0 ). More generally, Lin et al. (2000) demonstrate that {n−1/2 Un (β 0 ; t); t ∈ [0, τ ]} converges weakly to a continuous zero-mean Gaussian process with covariance function between time points s and t given
Analysis of recurrent event data
617
Table 6 Analysis of preschool asthma data: estimated mean ratios based on proportional means model Covariate
MR
p
LBW RDS TTN ASPH SEXM
1.59 2.70 1.30 1.41 1.98
< 10−2 < 10−4 0.30 0.02 < 10−4
by B(β 0 ; s, t), where
B(β; s, t) = E
Z1 (r) − E(r; β) dM1 (r; β)
s
0
t
×
Z1 (r) − E(r; β)
T
dM1 (r; β)
0
for 0 s, t τ . Lin et al. (2000) show that, under the proportional means model, D n1/2 (βˆ n − β 0 ) → Np (0p×1 , A(β 0 )−1 B(β 0 )A(β 0 )−1 ), where An (β) = −∂Un (β)/∂β T τ and A(β 0 ) is its limiting value evaluated at β 0 and B(β) = E[( 0 {Z1 (s) − E(s; β)} × ⊗2 dM1 (s; β)) ]. The proportional means model (8) was fitted to the asthma data (Table 6). Based on this model, all covariates except TTN demonstrated a significant association with the mean number of asthma-attributable office visits. For all covariates, the estimated effect is of much greater magnitude based on the marginal means model, compared to the proportional intensity model (Table 1). Recall that, based on the most flexible modelling of previous event history (Table 1, Model 3), statistical significance was not observed for any of the covariates. The results of our empirical comparison between the models of Andersen and Gill (1982) and Lin et al. (2000) make sense intuitively. In the AG results, it is apparent that the asthma intensity is positively associated with both (i) increasing previous number of events and (ii) each covariate. It then stands to reason that an analysis which, essentially, adjusts for prior event history will attenuate estimated covariate effects (i.e., the conditional effects will be of less magnitude than the marginal effects). Analogous statements would hold if both prior number of events and covariates decreased the intensity. Of course, if conditional covariate effects, specifically, are of interest, then the AG model would be preferred. Note that the intensity and hazard formulations contain the counting process restriction dNi (s) = 0, 1. Many event processes of interest in biomedical studies do not obey this restriction (e.g., health care cost; e.g., see Lin, 2000). Hazard and intensity models no longer make sense in such settings. However, the marginal means model can accommodate processes with increments of any arbitrary positive size.
618
J. Cai and D.E. Schaubel
4. Nonparametric estimation of the recurrent event survival and distribution functions Each of the methods described in this chapter thus far has involved semiparametric modelling. We now consider nonparametric estimation of the gap time distribution, survival and related functions. A complication in the estimation of gap time functions is induced dependent censoring. That is, even if total times are censored independently (e.g., loss to follow-up, administrative censoring), the gap times for the second and subsequent events will be subject to induced dependent censoring, except under independence of the gap times within a subject, which is an unrealistic assumption in many studies. For example, the longer the time until the first event, the shorter the observation times for the second and subsequent events. Thus, if gap times are correlated, then the second and subsequent gap times are, essentially, censored by a dependent variable. Hence, the independent censoring assumption which underlies the usual univariate and multivariate survival function estimators is violated. Note that for gap time regression methods of Prentice et al. (1981) and Chang and Wang (1999), it was assumed that event times within a subject were conditionally independent given the covariate vector. We now discuss some nonparametric estimators which account for induced dependent censoring. 4.1. Joint distribution function ik = Ci −Ti,k−1 for k = 1, . . . , K, with Ti,0 ≡ The censoring time for Tik is denoted by C 0. For simplicity of illustration, we restrict attention to K = 2. Due to right censoring, i1 , X i2 , ∆i1 , ∆i2 } for i = 1, . . . , n, where X ik = T ik ∧ C ik observed data consist of {X ik ) ≡ I (Ti,k Ci ). Below, random variables and functions not and ∆ik = I (Tik C incorporating the subscript i can be assumed to refer to an arbitrary subject. Wang and Wells (1998) proposed a product limit estimator of the joint survival function, S12 (t1 , t2 ) = P (Ti,1 > t1 , Ti2 > t2 ) which accounts for the induced dependent censoring. To begin, S12 (t1 , t2 ) = P T2 > t2 | T1 > t1 P (T1 > t1 ) = 1 − dΛ2 (s | T1 > t1 ) S1 (t1 ),
st2
where represents product integral, Λ2 (s | T1 > t1 ) is the cumulative hazard of (T2 | T1 > t1 ), which Campbell and Földes (1982) suggested estimating by: n i=1 ∆i2 I (Xi1 > t1 , Xi2 = s) 2 (s | T1 > t1 ) = , dΛ n i=1 I (Xi1 > t1 , Xi2 s) i2 = s)” is shorthand for “limδ→0 I (X i1 > t1 , X i2 i1 > t1 , X and the notation “I (X s +δ)−I (Xi1 > t1 , Xi2 s) = 1”. When the assumption that (Tik ⊥ Cik ) fails for some k, Wang and Wells (1998) show that the Campbell–Földes estimators of S12 (t1 , t2 ) and S2 (t2 ) = P (T2 > t2 ) are inconsistent; similar arguments can be made for other bivariate survival function estimators, such as those of Tsai et al. (1996), Dabrowska (1988),
Analysis of recurrent event data
619
Tsai and Crowley (1998), Prentice and Cai (1992), and Lin and Ying (1993), as well as the Kaplan and Meier (1958) estimator of S2 (t2 ). The modification proposed by Wang and Wells (1998) is to weigh risk set contributions by the inverse of their inclusion probabilities, similar in spirit to the dependent censoring methods of Robins and Rotnitzky (1992). The resulting estimator of the conditional cumulative hazard increment is given by: n
i=1 ∆i2 I (Xi1 > t1 , Xi2 = s)/G1 (Xi1 + s)
dΛ2 (s | T1 > t1 ) = , n
i=1 I (Xi1 > t1 , Xi2 s)/G1 (Xi1 + s)
1 (t) is the corresponding Kaplan and Meier (1958) estiwhere G(t) = P (C > t) and G mator based on {(Xi1 , 1 − ∆i1 )}ni=1 . Then,
2 (s | T1 > t1 ) 1 − dΛ S1 (t), S12 (t1 , t2 ) = (9) st2
S2 (t2 ) = S12 (0, t2 ). with the marginal survival function for T2 estimated by P Wang and Wells (1998) show that as n → ∞, S12 (t1 , t2 ) → S12 (t1 , t2 ) and 1/2 n ( S12 (t1 , t2 ) − S12 (t1 , t2 )) converges weakly to a zero-mean Gaussian process, but with a limiting covariance which is sufficiently complex that the authors suggest applying the survival data bootstrap (Efron, 1981) to construct reliable standard error estimates. Lin et al. (1999) proposed an estimator for the joint distribution function, F12 (t1 , t2 ) = P (T1 t1 , T2 t2 ), which accounts for induced dependent censoring, where t1 + t2 τC , with sup{t; P (C t) > 0}. The authors begin with the relationship F12 (t1 , t2 ) = H (t1 , 0) − H (t1 , t2 ), where H (t1 , t2 ) = P (T 1 t1 , T2 > t2 ). In the absence of censoring, H (t1 , t2 ) could be estimated by n−1 ni=1 I (Ti,1 t1 , Ti2 > t2 ). Since I (Ti,1 t1 , Ti2 > t2 ) cannot be observed due to right censoring, Lin et al. (1999) replace it i1 t1 , with an observable quantity with the same expectation. The fact that E[I (X i2 > t2 )G(Ti,1 + t2 ) suggests the estimator i2 > t2 ) | Ti,1 , Ti2 ] = I (Ti,1 t1 , T X
(t1 , t2 ) = n−1 H
n i1 t1 , X i2 > t2 ) I (X ,
X i1 + t2 ) G( i=1
is the Kaplan–Meier estimator based on {(Xi2 , 1 − ∆i2 )}n . As n → ∞, where G(t) i=1
(·, ·) −
(t1 , t2 ) converges almost surely to H (t1 , t2 ), uniformly in (t1 , t2 ), and n1/2 (H H H (·, ·)) converges weakly to zero mean Gaussian process with a covariance function which can be estimated using empirical quantities. The joint distribution function is then
12 (t1 , t2 ) = H
(t1 , 0) − H
(t1 , t2 ), while the conditional survival function, estimated by F F2|1 (t2 | t1 ) = P (T2 > t2 | T1 t1 ) is estimated through:
2|1 (t2 | t1 ) = 1 − H
(t2 | t1 )/H
(0 | t1 ). F
2|1 (t2 | t1 ) for F2|1 (t2 | t1 ) and weak convergence of Uniform strong consistency of F 1/2
n {F2|1 (t2 | t1 ) − F2|1 (t2 | t1 )} to a zero-mean Gaussian process can be shown using
(t2 | t1 ). properties inherited through H
620
J. Cai and D.E. Schaubel
4.2. Recurrence time survival function Chang and Wang (1999) proposed methods for a recurrence time survival function using a different approach. The authors consider the case when the first event itself defines the beginning of a subject’s follow-up period. An example would be infections for a notifiable disease (i.e., a disease which, by law, requires reporting to local health authorities); individuals enter the database at the time of their first infection, with recurrence times measured from that first infection onwards. This is a useful framework for many disease registry databases. Hence, maintaining notation already defined in this chapter, Ti,j represents, in this setting, the total time of the (j − 1)th recurrence. We define ei to be the number of observed events for subject i, including the initial event which marks subject i’s commencement of follow-up. It is convenient to define ei∗ = ei − I (ei 2). Of interest is the recurrence time survival function, P (Tij > t); it is assumed that the marginal survival function is the same for all recurrence times. Define the following function of censoring times, ai = a(Ci ) > 0, which acts as a weight and could be chosen, for example, such that greater weight is assigned to subjects with longer observation times. Define the quantities: Ha (t) = E ai I (Ti,1 t)I (Ci t) , Fa (t) = E ai I (Ti,1 t)I (Ci Ti,1 ) , and note that the cumulative hazard can be represented by
t
t E[ai I (Ci s)] d{1 − S(s)} Λ(t) = = Ha (s)−1 dFa (s), S(s) 0 E[I (Ci s)] 0 when S(·) is absolutely continuous. Correspondingly, Chang and Wang (1999) specify the following estimators, ∗
a (t) = n H
−1
ei n ai ij t , I X ∗ ei i=1
j =1
∗
a (t) = n F
−1
ei n ai I (ei 2) ij t . I X ∗ ei i=1
j =1
a (t) and F
a (t) are unbiased for Ha (t) and Fa (t), respectively, It can be shown that H motivating the estimator:
t
a (s).
a (s)−1 dF
a (t) = Λ H 0
The recurrence time survival function can then be estimated through
a (t) . Sa (t) = exp −Λ Chang and Wang (1999) show that, as n → ∞, when ai is bounded for i = 1, . . . , n, n1/2 { Sa (t) − S(t)} converges weakly to a mean zero Gaussian process for t ∈ [0, t ∗ ], with t ∗ < sup{t: S(t)G(t) > 0}.
Analysis of recurrent event data
621
5. Conclusion In this chapter, we have reviewed several methods useful in the analysis of recurrent event data. The example analysis of the preschool asthma data set illustrates the differences between the methods, and resulting parameter estimators. Ultimately, the choice of which model to fit depends on the objectives of the investigator, and possibly the specifics of the data set of interest. Methodological interest in recurrent event data persists. Methods have recently been developed for data structures outside the scope of this chapter. For example, Lin et al. (2001) developed point process transformation models to model the recurrent event intensity. Ghosh and Lin (2000) propose one- and twosample methods for the mean number of events in the presence of death, extending methods suggested by Cook and Lawless (1997b). Wang and Chiang (2002) and Wang et al. (2001) have developed nonparametric and semiparametric methods, respectively, for analyzing recurrent event data in the presence of informative censoring. Cai and Schaubel (2003) have proposed methods for fitting marginal means/rates models for analyzing data with multiple recurrent event sequences.
Acknowledgement This work was partially supported by National Institutes of Heath grant R01 HL-57444.
References Andersen, P.K., Borgan, O., Gill, R.D., Keiding, N. (1993). Statistical Models Based on Counting Processes. Springer, New York. Andersen, P.K., Gill, R.D. (1982). Cox’s regression model for counting processes: A large sample study. Ann. Statist. 10, 1100–1120. Barai, U., Teoh, N. (1997). Multiple statistics for multiple events, with application to repeated infections in the growth factor studies. Statist. Medicine 16, 941–949. Breslow, N. (1974). Contribution to the discussion of the paper by D.R. Cox. J. Roy. Statist. Soc. Ser. B 34, 187–220. Cai, J., Prentice, R.L. (1995). Estimating equations for hazard ratio parameters based on correlated failure time data. Biometrika 82, 151–164. Cai, J., Prentice, R.L. (1997). Regression estimation using multivariate failure time data and a common baseline hazard function model. Lifetime Data Anal. 3, 197–213. Cai, J., Schaubel, D.E. (2003). Marginal means and rates models for multiple type recurrent event data. Lifetime Data Anal. Submitted for publication. Campbell, G., Földes, A. (1982). Large sample properties of nonparametric statistical inference. In: Gnedenko, B.V., Puri, M.L., Vincze, I. (Eds.), Colloquia Methemetica-Societatis Jaános Bolyai. NorthHolland, Amsterdam, pp. 103–122. Chang, I.-S., Hsiung, C.A. (1994). Information and asymptotic efficiency in some generalized proportional hazards models for counting processes. Ann. Statist. 22, 1275–1298. Chang, S.-H., Wang, M.-C. (1999). Conditional regression analysis for recurrence time data. J. Amer. Statist. Assoc. 94, 1221–1230. Chang, S.-H. (1995). Regression analysis for recurrent event data. Doctoral Dissertation. Johns Hopkins University, Department of Biostatistics. Chiang, C.L. (1968). Introduction to Stochastic Processes in Biostatistics. Wiley, New York.
622
J. Cai and D.E. Schaubel
Clayton, D. (1994). Some approaches to the analysis of recurrent event data. Statist. Methods Medical Res. 3, 244–262. Cook, R.J., Lawless, J.F. (1997a). Discussion of paper by Wei and Glidden. Statist. Medicine 16, 841–851. Cook, R.J., Lawless, J.F. (1997b). Marginal analysis of recurrent events and a terminating event. Statist. Medicine 16, 911–924. Cox, D.R. (1972). Regression models and life-tables (with discussion). J. Roy. Statist. Soc. Ser. B 34, 187–220. Cox, D.R. (1975). Partial likelihood. Biometrika 62, 262–276. Cox, D.R., Isham, V. (1980). Point Processes. Chapman and Hall, London. Dabrowska, D.M. (1988). Kaplan–Meier estimate on the plane. Ann. Statist. 16, 1475–1489. Efron, B. (1981). Censored data and the bootstrap. J. Amer. Statist. Assoc. 76, 312–319. Fleming, T.R., Harrington, D.P. (1991). Counting Processes and Survival Analysis. Wiley, New York. Gao, S., Zhou, X.-H. (1997). An empirical comparison of two semi-parametric approaches for the estimation of covariate effects from multivariate failure time data. Statist. Medicine 16, 2049–2062. Ghosh, D., Lin, D.Y. (2000). Nonparametric analysis of recurrent events and death. Biometrics 56, 554–562. Huber, P.J. (1967). The behaviour of maximum likelihood estimates under nonstandard conditions. In: Proceedings of the Fifth Berkeley Symposium in Mathematical Statistics and Probability. University of California Press, Berkeley, CA, pp. 221–233. Kalbfleisch, J.D., Prentice, R.L. (2002). The Statistical Analysis of Failure Time Data. Wiley, Hoboken, NJ. Kaplan, E.L., Meier, P. (1958). Nonparametric estimation from incomplete samples. J. Amer. Statist. Assoc. 53, 457–481. Kelly, P., Lim, L.L.-Y. (2000). Survival analysis for recurrent event data: An application to childhood infectious diseases. Statist. Medicine 19, 13–33. Laird, N.M., Olivier, D. (1981). Covariance analysis of censored survival data using log-linear analysis techniques. J. Amer. Statist. Assoc. 76, 231–240. Lawless, J.F. (1995). The analysis of recurrent events for multiple subjects. Appl. Statist. 44, 487–498. Lawless, J.F., Nadeau, C. (1995). Some simple robust methods for the analysis of recurrent events. Technometrics 37, 158–168. Lee, E.W., Wei, L.J., Amato, D.A. (1992). Cox-type regression analysis for large numbers of small groups of correlated failure time observations. In: Klein, J.P., Goel, P.K. (Eds.), Survival Analysis: State of the Art. Kluwer Academic, Dordrecht, pp. 237–247. Liang, K.Y., Zeger, S.L. (1986). Longitudinal data analysis using generalized linear models. Biometrika 73, 13–22. Liang, K.-Y, Self, S.G., Chang, Y.-C. (1993). Modelling marginal hazards in multivariate failure time data. J. Roy. Statist. Soc. Ser. B 55, 441–453. Lin, D.Y. (1994). Cox regression analysis of multivariate failure time data. Statist. Medicine 15, 2233–2247. Lin, D.Y. (2000). Proportional means regression for censored medical costs. Biometrics 56, 775–778. Lin, D.Y., Sun, W., Ying, Z. (1999). Nonparametric estimation of the gap time distributions for serial events with censored data. Biometrika 86, 59–70. Lin, D.Y., Wei, L.J. (1989). The robust inference for the Cox proportional hazards model. J. Amer. Statist. Assoc. 84, 1074–1078. Lin, D.Y., Wei, L.J., Yang, I., Ying, Z. (2000). Semiparametric regression for the mean and rate functions of recurrent events. J. Roy. Statist. Soc. Ser. B 62, 711–730. Lin, D.Y., Wei, L.J., Ying, Z. (2001). Semiparametric transformation models for point processes. J. Amer. Statist. Assoc. 96, 620–628. Lin, D.Y., Ying, Z. (1993). A simple nonparametric estimator of the bivariate survival function under univariate censoring. Biometrika 80, 573–581. Lipschutz, K.H., Snapinn, S.M. (1997). Discussion of paper by Wei and Glidden. Statist. Medicine 16, 846– 848. Neyman, J., Scott, E.L. (1948). Consistent estimates based on partial consistent observations. Econometrica 16, 1–32. Pepe, M.S., Cai, J. (1993). Some graphical displays and marginal regression analyses for recurrent failure times and time-dependent covariates. J. Amer. Statist. Assoc. 88, 811–820. Prentice, R.L., Cai, J. (1992). Covariance and survival function estimation using censored multivariate failure time data. Biometrika 79, 495–512.
Analysis of recurrent event data
623
Prentice, R.L., Williams, B.J., Peterson, A.V. (1981). On the regression analysis of multivariate failure time data. Biometrika 68, 373–389. Robins, J.M., Rotnitzky, A. (1992). Recovery of information and adjustment for dependent censoring using surrogate markers. In: Jewell, N., Dietz, K., Farewell, V. (Eds.), AIDS Epidemiology-Methodological Issues, Proceedings of the Biopharmaceutical Section. American Statistical Association, pp. 24–33. Ross, S.M. (1989). Introduction to Probability Models. Academic Press, New York. Rothman, K.J., Greenland, S. (1998). Modern Epidemiology. Lippincott-Raven Publishers, Philadelphia, PA. Schaubel, D., Johansen, H., Dutta, M., Desmeules, M., Becker, A., Mao, Y. (1996). Neonatal characteristics as risk factors for preschool asthma. J. Asthma 33, 255–264. Therneau, T.M., Hamilton, S.A. (1997). rhDNase as an example of recurrent event analysis. Statist. Medicine 16, 2029–2047. Tsai, W.-Y., Crowley, J. (1998). A note on nonparametric estimators of the bivariate survival function under univariate censoring. Biometrika 85, 573–580. Tsai, W.-Y., Leugrans, S., Crowley, J. (1996). Nonparametric estimation of the survival function in the presence of censoring. Ann. Statist. 14, 1351–1365. Tsiatis, A.A. (1981). A large sample study of Cox’s regression model. Ann. Statist. 9, 93–108. Wang, M.C., Qin, J., Chiang, C.-T. (2001). Analyzing recurrent event data with informative censoring. J. Amer. Statist. Assoc. 96, 1057–1065. Wang, M.C., Chiang, C.-T. (2002). Non-parametric methods for recurrent event data with informative and non-informative censorings. Statist. Medicine 21, 445–456. Wang, W., Wells, M.T. (1998). Nonparametric estimation of successive duration times under dependent censoring. Biometrika 85, 561–572. Wei, L.J., Lin, D.Y., Weissfeld, L. (1989). Regression analysis of multivariate incomplete failure time data by modelling marginal distributions. J. Amer. Statist. Assoc. 84, 1065–1073. Wei, L.J., Glidden, D.V. (1997). An overview of statistical methods for multiple failure time data in clinical trials. Statist. Medicine 16, 833–839. Wei, L.J., Ying, Z., Lin, D.Y. (1990). Linear regression analysis of censored survival data based on rank tests. Biometrika 77, 845–851. White, H. (1982). Maximum likelihood estimation of misspecified models. Econometrica 50, 1–25.
35
Handbook of Statistics, Vol. 23 ISSN: 0169-7161 © 2004 Elsevier B.V. All rights reserved. DOI 10.1016/S0169-7161(03)23035-2
Current Status Data: Review, Recent Developments and Open Problems
Nicholas P. Jewell and Mark van der Laan
1. Introduction In some survival analysis applications, observation of the lifetime random variable T is restricted to knowledge of whether or not T exceeds a random monitoring time C. This structure is widely known as current status data, and sometimes referred to as interval censoring, case I (Groeneboom and Wellner, 1992). Section 2 briefly notes several generic examples where current status data is encountered frequently. Let T have a distribution function F , with associated survival distribution S = 1 − F . We assume that interest focuses on estimation and inference on F , but recognize throughout that, in most applications, the goal will be estimation of a variety of functionals of F . In many cases, the regression relationship between T and a set of covariates Z will be of primary concern. In some situations, parametric forms of F may be useful, although we pay most attention to the nonparametric problem where the form of F is unspecified. In the regression model, semiparametric models for the conditional distribution of T , given Z, are appealing and heavily used. The monitoring time C is often taken to be random, following a distribution function G, almost always assumed independent of T . However, most techniques are based on the conditional distribution of T , given C, and so work equally well for fixed nonrandom C. In the random case, we assume, for the most part, that the data arise from a simple random sample from the joint distribution of T and C; in the non-random case, we assume that simple random samples, often of size 1, are selected for each fixed choice of C. When C is random, the data can thus be represented by n observations from the joint distribution of (T , C); however, only {(Yi , Ci : i = 1, . . . , n} is observed where Y = I (T C). In Section 6, we make some brief remarks about the intriguing possibility of dependence between C and T , particularly when such dependence is introduced by design. In Section 4.1, we discuss an important variant to simple random sampling, namely the analysis of case-control samples. Here, two separate random samples are obtained, the first an i.i.d. random sample of size n0 from those for whom T > C (controls), the second an i.i.d. random sample of size n1 from individuals for whom T C (cases). 625
626
N.P. Jewell and M. van der Laan
Section 4.2 covers the situation where observation of the origin of T is also subject to censoring, thereby yielding doubly censored current status data. Section 5 extends the notion of current status observation to more complex forms of survival data. These include competing risks, multivariate survival variables T = (T1 , . . . , Tp ), and special cases of the latter, for example, when Tp Tp−1 · · · T1 . This leads naturally to consideration of the scenario, in Section 5.4, where observation at time C is on a general counting process, rather than the case of a single jump from count ‘0’ to ‘1’ as occurs with a simple survival random variable.
2. Motivating examples Before discussing estimation techniques designed for current status data, it will be helpful to have some motivating examples at the back of our minds as we proceed. Early examples arose in demographic applications, with a common version occurring in studies of the distribution of the age at weaning in various settings (Diamond et al., 1986; Diamond and McDonald, 1991; Grummer-Strawn, 1993). Here, T represents the age of a child at weaning and C the age at observation. Inaccuracy and bias surrounding exact measurement of T , even when T < C, led to use of solely current status data on T at C for the purpose of understanding F . Another kind of example arises naturally in the study of infectious diseases, particularly when infection is an unobserved event, that is, one with often no or few clinical indications. The prototypical example is infection with the Human Immunodeficiency Virus (HIV), in particular, partner studies of HIV infection (Jewell and Shiboski, 1990; Shiboski, 1998a). The most straightforward partner study occurs when HIV infection data is collected on both partners in a long-term sexual relationship. These partnerships are assumed to include a primary infected individual (index case) who has been infected via some external source, and a susceptible partner who has no other means of infection other than contact with the index case. Suppose T denotes the time (or number of infectious contacts) from infection of the index case to infection of the susceptible partner, and that the partnership is evaluated at a single time C after infection of the index case; then, the infection status of the susceptible partner provides current status data on T at time C. Since partnerships are often recruited retrospectively so that the event of the susceptible partner’s infection has occurred (or not) at the time of recruitment, some form of case-control design may be used; in this case the methods of Section 4.1 are appropriate. Our next area of application is in carcinogenecity testing when a tumor under investigation is occult (see Gart et al., 1986). In this example, for each experimental animal, T is the time from exposure to a potential carcinogen until occurrence of the tumor, and C is the time, on the same scale, of sacrifice. Upon sacrifice, the presence or absence of the occult tumor can be determined providing current status information on T . Finally, a common source of current status data is estimation of the distribution of age at incidence of a non-fatal human disease for which the exact incidence time is usually unknown although accurate diagnostic tests for prevalent disease are available. If a cross-sectional sample of a given population receives such a diagnostic test, then the
Current status data
627
presence or absence of disease in an individual of age C yields current status information on the age, T , at disease incidence. Keiding (1991) describes the nonparametric maximum likelihood estimator of the distribution of the age at incidence of Hepatitis A infection, based on cross-sectional data obtained by K. Dietz. A case study of the application of current status techniques to estimation of age-specific immunization rates is given in Keiding et al. (1996). For rare diseases, this approach to age incidence is only viable if a case-control sampling scheme is used. For example, with Alzheimer’s disease, it is feasible to obtain a random sample of prevalent Alzheimer’s patients, measuring their age at sampling, and then subsequently sample population controls. However the data are obtained, modification to current status methods are required if presence of the disease substantially modifies the risk of death, thereby reducing the probability of being sampled. This is an issue that deserves further study. Note that, in econometrics, there is a parallel terminology and literature that has developed on similar topics to those discussed below.
3. Simple current status data Recall that the binary random variable Y is defined to be 1 if T C and 0 if T > C. Now, suppose an i.i.d. random sample of the population is obtained with observed data thereby given by {(yi , ci ): i = 1, . . . , n}. Assuming that the monitoring time C is independent of the survival time, T , the likelihood of this data is thus given by L=
n
1−yi F (ci )yi 1 − F (ci ) dG(ci ).
(1)
i=1
Estimation of F can then be based on the conditional likelihood of Y , given C, namely, CL =
n
1−yi F (ci )yi 1 − F (ci ) .
(2)
i=1
This conditional likelihood is immediately applicable also in the case of fixed nonrandom selection of the monitoring times, assuming that such selection is again independent of T . Note also that E(Y | C = c) = P (T C | C = c) = F (c), and so estimation of F can be viewed in terms of estimation of the conditional expectation of Y for all c, with a monotonicity constraint imposed on the regression function. Before we continue our discussion of estimation of F , it is worth noting that there are conditions, somewhat weaker than independence of T and C, under which the conditional likelihood (2) remains appropriate (Betensky, 2000). When T and C are dependent, it is not possible to identify F from the data in general. In particular, there are non-independent situations where the conditional probability that Y = 1 given C is non-decreasing in C but not equal to F (C). Despite this possibility, many forms of dependence between T and C will result in E(Y = 1 | C) being locally decreasing in C. Rabinowitz (2000) provides a simple rank statistic, based on the observed data, that is designed to detect such decreasing trends. For Sections 3–5 we assume independence of C and T (or conditionally independent given covariates when we discuss regression
628
N.P. Jewell and M. van der Laan
models); in Section 6, we discuss accounting for, and even exploiting, relaxation of this assumption. If F belongs to a finite-dimensional parametric family, {F = Fθ : θ ∈ Θ}, then estimation and inference regarding θ and thus Fθ , can be obtained by standard maximum likelihood techniques based on (2). On the other hand, nonparametric maximum likelihood estimation of F requires maximization of (2) over the space of all distribution functions. This nonparametric maximization problem has been much studied – Ayer et al. (1955) provided a fast and effective approach, the ubiquitous pool–adjacent– . violators algorithm, to compute the nonparametric maximum likelihood estimator, F The connection to convex minorants is extensively discussed in Barlow et al. (1972) converges to F as n tends to inand Groeneboom and Wellner (1992). The estimator F finity, but at rate n−1/3 , unlike the empirical distribution function, or the Kaplan–Meier estimator, both of which converge at the more familiar n−1/2 rate. The limiting distrib is not Gaussian, but a more complex distribution associated with two-sided ution of F is a step function, Brownian motion (Groeneboom and Wellner, 1992). The estimator F jumping only at a subset of the observed monitoring times c1 , . . . , cn . In fact, the data only identifies the value of F at c1 , . . . , cn and at no other value of t. Identification of the entire distribution function F as n tends to infinity depends therefore on the support of F being contained within the support of G. Finally, a smoothing technique can be incorporated into the pool–adjacent–violators algorithm to produce smoother estimates of F across the ci ’s – see Mammen (1991) and Mukerjee (1988). to F , Huang and Wellner Despite the unusual and slow rate of convergence of F , converge at rate (1995) show that estimates of smooth functionals of F , based on F n−1/2 and are asymptotically efficient at many data generating distributions. These authors also supply the influence curve for such smooth functional estimators, thereby facilitating straightforward calculations for (asymptotic) confidence intervals. 3.1. Epidemiological applications–calculation of the relative risk In some simple epidemiological studies, interest focuses on the calculation and comparison of the cumulative incidence rate for a specific disease over a pre-determined period of risk and for differing levels of exposure to some risk factor. In many investigations, the risk interval is common to all individuals under study, and calculation of the cumulative risk thereby corresponds to current status estimation of F at a single monitoring time corresponding to the length of the interval, C. Of course, standard ‘survival’ follow-up of the study participants yields exact incidence times, albeit right censored at C. If risk intervals vary in length across individuals the nonparametric maximum likeli, discussed above provides an estimate of the cumulative risk, at any hood estimator, F observed value of C, that is based only on whether incident disease occurs in the observed risk interval or not. Again, estimates of cumulative risk can again be computed from follow-up data using the Kaplan–Meier estimator for right censored data. Typically, follow-up measurement of the exact time of disease incidence is considerably more expensive than mere (current status) assessment of incidence at some point during the risk interval. If F is parametrically specified, the efficiency of current status estimates of cumulative incidence, as compared to use of more complete incidence
Current status data
629
times arising from full follow-up, can be calculated directly. The simpler current status measurements are often surprisingly efficient, except in situations where the monitoring times are all either very small or very large in terms of the location of the support of F . Of more relevance, similar efficiency comparisons can be made when the parameter of interest is a comparative measure of the cumulative incidence rates across exposure groups, often leading to similar conclusions regarding the effectiveness of current status observations. In a study design, the relative costs of continuous follow-up versus a single current status assessment must be fully considered, and, of course, the latter allows investigation of more complex incidence properties. Consideration of the role of more complex measurements of exposures and other factors associated with incidence lead naturally to the development of regression models and their estimation from current status data. 3.2. Regression models In Section 3.1, we touched on the two-group situation where the difference in survival properties across exposure groups is of fundamental concern rather than the shape of the underlying survival distributions. Clearly, many applications include more general and higher dimensional covariates in situations where the relationships between the latter and survival time are key. A substantial literature has developed for regression models of this kind for survival outcomes, potentially subject to right censoring. Much recent work has extended the application of these models to current status data. There is an immediate and valuable correspondence between the regression models that link T , the survival random variable, and Y , the current status version of T , to a k-dimensional covariate vector Z. Doksum and Gasko (1990) had previously considered this association between survival and binary regression models in the context of censored survival data. This is extremely useful since estimates of parameters in the regression model for the observed Y can then be interpreted in terms of the parameters in the regression model for the unobserved T . For example, suppose that survival times follow a proportional hazards model (Cox, 1972) eβz S(t | Z = z) = S0 (t)
(3)
where S0 is an arbitrary survival function for the sub-population for whom Z = 0, and β is a k-dimensional vector of regression coefficients. Each component of β gives the relative hazard associated with a unit increase in the corresponding component of Z, holding all other components fixed. Then, if we write p(z | c) = E(Y | C = c, Z = z), the current status random variable Y is related to Z through log − log 1 − p(z | c) = log − log S0 (c) + βz.
(4)
This is a particular case of a generalized linear model for Y with complementary log– log link and offset given by an arbitrary increasing function of the observed ‘covariate’ C (that is, log − log[S0 (C)]). The regression coefficients, β, here are thus exactly the relative hazards from the regression model for T .
630
N.P. Jewell and M. van der Laan
As another example, suppose T follows the proportional odds regression model (Bennett, 1983) defined by 1 − S(t | Z = z) = where S0 (t) = log
1 . 1+eα(t)
1 1 + e−α(t )−βz
,
Here, Y is associated with Z via the logit link:
p(z | c) = α(c) + βz. (1 − p(z | c))
(5)
0 (C)) Again, the ‘intercept’ term, α(C) = log (1−S S0 (C) is an increasing function of C. If the baseline survival function S0 is assumed to follow a particular parametric form, the corresponding binary regression model will often simplify to a familiar generalized linear model, so that standard software can be used to estimate both S0 and the regression parameters β. As an example, suppose that S0 is assumed to be a Weibull distribution with hazard function ea bt b−1 , and that the proportional hazards model (3) holds for T . Then, the binary regression model for Y , given by (4) simplifies to a straightforward generalized linear model with complementary log–log link: log − log 1 − p(z, c) = a + b log(c) + βz.
On the other hand, if S0 is left arbitrary, semiparametric methods can be used to tackle inference on β, treating S0 as a nuisance parameter. Shiboski (1998b) provides an excellent review of these methods for current status data, discussing versions of a backfitting algorithm to compute estimates of β while fully acknowledging the monotonicity constraints in the intercept terms of the kind illustrated in (4) and (5). In the semiparametric regression model, dependence between C and the covariates Z can introduce some bias in estimation of β. Shiboski (1998b) also describes some simulations that compare the relative performance of coefficient estimates based on parametric or nonparametric assumptions on S0 . Asymptotic results regarding coefficient estimates within a semiparametric model (S0 left unspecified), necessary for inference, are discussed in Rabinowitz et al. (1995), Huang (1996) and Rossini and Tsiatis (1996) for the accelerated failure time, proportional hazards and proportional odds regression models, respectively, for T . Andrews et al. (2002) give locally efficient estimates for regression coefficient estimates in a broad class of models that (i) includes the accelerated failure time model, and (ii) allows for time-dependent covariates. Additive hazards regression models are studied in Lin et al. (1998) and Martinussen and Scheike (2002).
4. Different sampling schemes In Section 3, and in the construction of (1) and (2) in particular, we have assumed that an i.i.d. random sample of observations of (Y, C) are available, noting that, with the
Current status data
631
assumption of independence between T and C, the use of (2) allows the methods to apply directly to designs where the monitoring times are pre-determined. Often, the failures of interest are rare in the population so that such random samples provide very few observations where failure has occurred at the observed monitoring time, whether the latter is random or fixed. In these contexts, it is natural to consider a case-control strategy where separate samples of individuals to whom an event has already occurred (cases), and those for whom the event has not yet occurred (controls), are obtained. Section 4.1 briefly discusses the extension of the results of Section 3 to case-control designs. In some applications, the survival time, T , refers to the time between two events in chronological time, for example, the time between infection with HIV and the moment when an infected individual becomes infectious through a specified mechanism (see Jewell et al., 1994). Current status monitoring of an individual at a single point in chronological time then yields current status observation of T with the random variable C being defined by the difference in chronological time between the ‘origin’ of T and the monitoring time. Measurement of C assumes that the chronological time of this origin is known for all sampled individuals. Situations where this is not known leads to doubly censored current status data which is briefly described in Section 4.2. Some other modifications to standard current status data have also been studied; for example, Shiboski and Jewell (1992) allow for the possibility of a form of staggered entry in an observational study setting. 4.1. Case-control sampling As noted above, it is often useful to consider a case-control sampling scheme. Here, cases refer to a random sample of n1 observations on C from the sub-population where T C, and controls to a random sample of n0 observations from the sub-population where T > C. Even when the support of T is contained within the support of C, there is an additional identifiability problem that arises in nonparametric estimation of F from casecontrol samples. Jewell and van der Laan (2004) show that case-control data only idenF (t ) tify the odds function associated with F , namely log[ 1−F (t ) ], up to a constant. While this may be sufficient to identify F in an assumed parametric family, it is insufficient nonparametrically. However, additional data regarding the population distribution of cases and controls can be used to identify a specific F with a given odds function that is compatible with the population information. In particular, suppose that N individuals are sampled from the joint distribution of (Y, C), and that only the numbers of individuals for whom Y = i (i = 0, 1), say N0 and N1 , respectively, are observed. Subsequently, case-control data comprised of fixed samples of size n0 ( N0 ) and n1 ( N0 ) are selected, by simple random sampling, separately from the two groups, with Y = 0 and Y = 1, in the original sample of N . The random variable C is then measured for each of the n0 + n1 sampled individuals at this stage. In practice, the sampling rates, at this second stage, that is (n0 /N0 ) and (n1 /N1 ) will usually be quite different. The supplemented data is thus {(yij , cij ): i = 0, 1; j = 0, . . . , ni ; N0 , N1 }. Assuming that the sample sizes, n0 and n1 , are non-informative, a simple consistent nonpara-
632
N.P. Jewell and M. van der Laan
metric estimator of F is immediately available by weighting observations inversely proportional to their probability of selection, and using the estimator for standard current status data (Section 3) on this weighted data. Specifically, the weights are (N0 /n0 ) for controls and (N1 /n1 ) for cases. Jewell and van der Laan (2004) show that this simple estimator is, in fact, the nonparametric maximum likelihood estimator based on casecontrol data supplemented by knowledge of N0 and N1 . This nonparametric estimator assumes knowledge of the population totals N0 and N1 (in fact only the ratio N1 /N0 need be known). Without such information, we can hypothesize a value for N1 /N0 , compute the nonparametric maximum likelihood estimator, and then vary the assumed N1 /N0 as a sensitivity parameter over a range of plausible values. If N1 /N0 is allowed to take on all values the corresponding nonparametric maximum likelihood estimators trace out the population odds family associated with any particular choice of N1 /N0 . For parametric models for F , the situation is not as straightforward, even with knowledge of the supplementary population totals N0 and N1 , as the weighted and maximum likelihood estimators need not coincide. However, Scott and Wild (1997) provide an elegant iterative algorithm to compute the maximum likelihood estimator of F using data on N0 and N1 . Their approach is based on the regression model induced for Pr(Y = 1 | C = c), and the proposed algorithm is particularly simple when this regression model can be easily fit for randomly sampled (i.e., prospective) data. For example, if F is assumed to follow a Weibull distribution, with hazard ea bt b−1 , then log − log[Pr(Y = 1 | C = c)] = a + b log(c), as noted in Section 3.2, that is, a standard generalized linear model with complementary log–log link; the iterative steps in fitting a Weibull distribution to case-control current status data are therefore straightforward since there is standard software that accommodates this form of prospective generalized linear model. 4.2. Doubly censored current status data Suppose that the survival variable T measures the length of time between two successive events in chronological time. We refer to these as the initiating and subsequent events, and assume that their occurrence times are given by the random variables I and J , respectively, so that T = J − I . We assume that T is independent of I . Now, consider a single monitoring occasion whose chronological time is given by B, independent of I and J , at which point current status information is available on the subsequent event J ; that is, we observe whether J B or not. For a random sample of individuals for whom I B, such an observation scheme yields current status observations of T , assuming that the random variable I is known for all observations. In particular, we observe the random variable Y which takes the value 1 if T B − I , and 0, otherwise. In this case, the induced monitoring time for T is C = B − I , so that its distribution is determined by that of I . An additional complication is introduced when the random variable I is unknown or unobserved. Now, at chronological time B, we merely observe whether either or both of the initiating and subsequent events have occurred by time B, but not the times of either event. Without loss of information on F , we assume that only individuals for whom I B are included in the sample. The observed data is thus reduced to Y ∗ where Y ∗ = 1 if I J B and Y ∗ = 0 if I B < J .
Current status data
633
In order for F to be identifiable from such data, we assume that the conditional distribution of I , given that I B, is known (Jewell et al., 1994), although it is allowable that this distribution varies from individual to individual. For convenience, for the ith sampled individual, suppose that the known conditional distribution of I , given that I B, is labeled by Hi , and has finite support on some interval (Ai , Bi ). Then, we have Ci ∗ Pi = Pr(Yi = 1) = (6) Hi (Bi − T ) dF (T ), 0
where now Ci = Bi − Ai . Further, the conditional likelihood of n observations of this kind is then CL =
n
∗
∗
Pi Yi (1 − Pi )(1−Yi ) .
(7)
i=1
This data is referred to as doubly censored current status data by Rabinowitz and Jewell (1996) since it is a special case of doubly censored survival data as described by DeGruttola and Lagakos (1989). Two applications to data on HIV are given in Jewell et al. (1994). Parametric estimation of F , based on the likelihood (7) is again straightforward in principal. Nonparametric maximum likelihood estimation of F can be approached by viewing the model as a nonparametric mixture estimation problem (Jewell et al., 1994). An important special case occurs when Hi is assumed to be Uniform on [Ai , Bi ] in which case (6) reduces to 1 Ci Pi ≡ P (Ci ) = (8) F (T ) dT . Ci 0 Here, P is a distribution function that only depends on Ci and so doubly censored current status data in this case is a sub-model of current status data. Estimation of F , with this assumption on each Hi , is examined in Jewell et al. (1994), van der Laan et al. (1997) and van der Laan and Jewell (2001). The latter paper shows that the nonparametric maximum likelihood estimator of F is uniformly consistent, and further that the distribution function P (C), defined by (8), is nonparametrically estimated a rate n−2/5 , indicating the value of the additional structure given in (8) as compared to standard current status data. On the other hand, it is conjectured that F itself can only be estimated at rate n−1/5 (see van der Laan et al., 1997), although this result and the limiting distribution of the nonparametric maximum likelihood estimator of F remain to be established. Despite the very slow rate of convergence of the nonparametric maximum likelihood estimator, many smooth functionals can still be efficiently estimated, at rate n−1/2 , using the appropriate functionals of the nonparametric maximum likelihood estimator. An alternative iterative weighted pool–adjacent–violators algorithm is also given for computation of the nonparametric maximum likelihood estimator. Rabinowitz and Jewell (1996) extend the results of Rabinowitz et al. (1995), for estimation of regression parameters in the accelerated failure time model for T , to doubly censored current status data assuming each Hi to be Uniform. See also van der Laan et al. (1997).
634
N.P. Jewell and M. van der Laan
van der Laan and Andrews (2000) replace the assumption of a Uniform distribution for Hi by a mixture of a point mass and a Uniform, a generalization that arises naturally in partner studies. The presence of a point mass now permits the nonparametric maximum likelihood estimator to converge to F at rate n−1/3 , as for standard current status data; again smooth functionals can be efficiently estimated based on the nonparametric maximum likelihood estimator at rate n−1/2 . Some speculation is given there regarding the situation for other forms of Hi .
5. Complex outcome processes It is well-known that the survival random variable T can be alternatively viewed as the time to the ‘jump’ of a simple 0–1 counting process X(t). In this context, a current status monitoring scheme corresponds with a single cross-sectional observation of the stochastic process X(t). Considering cross-sectional observation of more complex monotone stochastic processes leads to various extensions of simple current status data structures. In particular, current status competing risk data, discussed in Section 5.1, arise when X still only jumps once in each sample path, but now jumps are marked by a discrete set of outcomes, usually the cause of the jump or failure. Section 5.2 investigates the situation where X is now defined by a bivariate pair of binary counting process, (X1 , X2 ). In Section 5.3, we return to a univariate X, but now allow for the possibility of two successive jumps – from 0 to 1, and then from 1 to 2. Finally, we briefly examine the case where is X is a general counting process in Section 5.4. For some brief remarks for the case where X(t) is a renewal process, see Jewell and van der Laan (1997). 5.1. Competing risk outcomes In Section 3, we introduced simple current status data in terms of a single survival random variable T with an assumed single definition of failure. In some scenarios, failure may be associated with more than one ‘cause’, leading to the extensive literature on competing risks. For simplicity here, we assume but two competing risks, although all the material readily extends to an arbitrary number of risks. If J is the random variable that indicates the cause of failure at time T , the two sub-distribution functions of interest are Fj (t) = pr(T t, J = j ), with the overall survival function given by S(t) = 1 − F1 (t) − F2 (t). Jewell et al. (2003) consider nonparametric estimation of F1 , F2 and F = F1 + F2 , when only current status information on survival is available at the monitoring time C, but cause of failure is known whenever failure is seen to have occurred before C. Here, observed data can thus be represented as Y = (∆, Φ) and C, where ∆ = 1 only if T C with J = 1, and Φ = 1 only if T C with J = 2, with ∆ = Φ = 0 otherwise. This is
Current status data
635
a special case of competing risk survival data subject to general interval censoring as studied in Hudgens et al. (2001). We again assume that C is independent of (T , J ), with the implication that we still focus on the conditional likelihood of the data, given C. This is easily seen to be given by CL =
n δ φ 1−δi −φi F1 (ci ) i F2 (ci ) i S(ci ) .
(9)
i=1
Ideas for estimation of parametric competing risk models, based on the likelihood (9), apply here much as they do for standard current status data (Jewell et al., 2003). Since, by definition, E(∆ | C) = F1 (C) and E(Φ | C) = F2 (C), simple nonparametric estimators of F1 and F2 can be constructed via separate current status estimators based on (δi , ci : i = 1, . . . , n) and (φi , ci : i = 1, . . . , n), respectively, using the methods of Section 3. A disadvantage of this naive approach is that there is no guarantee that 2 is a distribution function, so that the derived estimator of the overall survival 1 + F F 1 (t) − F 2 (t) may be negative for large t. function S(t) = 1 − F An alternative ad hoc approach is developed by Jewell et al. (2003) as follows. First, reparameterize F1 and F2 in terms of F and F1 . An immediate estimator of F is available from the data (γi , ci ), where γi = δi + φi ; since E(Γ = ∆ + Φ | C) = F (C), we as an estimator of F . can again use the current status methods of Section 3 to produce F Now, restrict attention to the data where ∆ + Φ = 1, and define a constructed variable Z by: Z = F (C)∆. Note that E(Z | C, ∆ + Φ = 1) = F (C) × Pr(∆ = 1 | C, ∆ + Φ = 1) = F1 (C). This δi , against ci , using suggests an isotonic regression estimator of the constructed data, F 1p . Similarly, the isotonic only observations where δi + φi = 1, yielding an estimator F φi against ci , will provide the analogous estimator F 2p for F2 . Again, regression of F 2p (t) may exceed one for large t, although this may be less likely than for the 1p (t) + F F (·)∆ and F (·)Φ, both naive approach since the isotonic regressions are here based on F smaller than the respective dependent variables, ∆ and Φ, for the previous estimators. Neither of these approaches yields the nonparametric maximum likelihood estimator in general. The difference between the second approach and the nonparametric maximum likelihood estimators, say F1n and F2n , hinges on variation in the support of F1n and F2n ; that is, the nonparametric maximum likelihood estimator uses the fact that F2n may be non-constant between support points of F1n . However, Jewell et al. (2003) show that smooth functionals of either F1 or F2 are efficiently estimated using the appropriate functionals of either of the two simpler estimators of F1 and F2 , respectively. Simulations show that the naive current status estimator (which ignores cause of failure data) and the full NPMLE of F have very similar performances in general; this is to be expected as there can be no value in knowing the cause of failure if one is solely interested in estimating the overall survival distribution. The general EM algorithm can be used to compute the nonparametric maximum likelihood estimators of F1 and F2 . However, Jewell and Kalbfleisch (2004) provide a much faster algorithm that generalizes pool–adjacent–violators. Their approach can most easily be described by restating the problem as follows: let (Ai , Bi , Di ) be a trinomial
636
N.P. Jewell and M. van der Laan
variate with index ni and probabilities pi , qi , 1 −pi −qi , independently for i = 1, . . . , k. We wish to maximize the log likelihood function 0(p, q) =
k
ai log pi + bi log qi + di log[1 − pi − qi ] ,
(10)
i=1
where p = (p1 , . . . , pk ) and q = (q1 , . . . , qk ). The parameter space, Θ = (p, q): 0 p1 · · · pk ; 0 q1 · · · qk ; 1 − p − q 0 , is a compact convex set in R2k . Equivalence to maximization of the conditional likelihood given in (9) is easily seen by ordering and grouping observations according to the size of the ci s; then, for each distinct ci , let Ai be the number of observations with monitoring time ci for which δi = 1, with a similar definition for Bi and Di ; in previous notation, pi = F1 (ci ) and qi = F2 (ci ). An iterative algorithm to find the estimator of (p, q) that maximizes (10) is given in Jewell and Kalbfleisch (2004) using the strategy of maximizing over the vector p, holding q fixed, and vice-versa. These maximizations are achieved using a variation on the pool–adjacent–violators algorithm where pooling now involves solution of a polynomial equation rather than simple averaging. Care is needed with regard to estimates of the vectors p, q for both the first and last set of entries. Further work is required to establish the limiting distribution of the nonparametric maximum likelihood estimator or other techniques that may be used to provide confidence limits for specific values of F1 or F2 ; one approach is to approximate such ‘parameters’ by smooth functionals of F1 and F2 . Jewell et al. (2003) and Jewell and Kalbfleisch (2004) illustrate the application of the nonparametric estimators discussed in this section on an example on women’s age at menopause, where the outcome of interest (menopause) is associated with two competing causes, natural and operative menopause. Jewell et al. (2003) also consider the situation where failure times for one risk are observed exactly whenever failure due to that cause occurs prior to the monitoring time. 5.2. Bivariate current status data Consider a study in which interest focuses on the bivariate distribution F of two random survival variables (T1 , T2 ), neither of which can be directly measured. Rather, for each individual, we observe, at a random monitoring time, C, whether Tj exceeds C or not for each j = 1, 2. That is, on each subject, we observe: Y1 ≡ I (T1 C), Y2 ≡ I (T2 C), C . Again, C is assumed independent of (T1 , T2 ). Wang and Ding (2000) refer to this data structure as bivariate current status data. Conditional on the observed values of C, the likelihood of a set of n independent observations of this kind is given by CL =
n
F3 (ci )y1i y2i (1 + F3 − F1 − F2 )(ci )(1−y1i )(1−y2i )
i=1
× (F1 − F3 )(ci )y1i (1−y2i ) (F2 − F3 )(1−y1i )y2i ,
(11)
Current status data
637
where F1 (t) = P (T1 t), F2 (t) = P (T2 t) and F3 (t) = P (T1 t, T2 t) are marginal distributions of F along the two axes and the diagonal, respectively. It follows that only these three univariate cdf’s F1 , F2 and F3 are identified from the data. In particular, the complete bivariate distribution, F , is not identifiable; however, the dependence measure F3 − F1 F2 is identifiable from the data, so that some assessment of independence of T1 and T2 is possible. Wang and Ding (2000) considered a semiparametric copula model for F , parametrized by the marginals, F1 and F2 , and a single real valued parameter α which represents a measure of dependence between T1 and T2 . Note that ‘marginal’ nonparametric current status estimators of Fj , j = 1, 2, 3, are available. With Y3 = Y1 Y2 , Fj (t) can be represented in terms of a monotonic regression of Yj on C since Fj (t) = E(Yj | C = t), for j = 1, 2, 3; we can thus use the current status estimator based on (Yj , C) to estimate Fj . This estimator is, of course, the nonparametric maximum likelihood estimator based on the reduced data (Yj , C). From the results of Section 3, it follows that these reduced data nonparametric maximum likelihood estimators are consistent and converge, under appropriate conditions, at rate n−1/3 , to known asymptotic distributions. In spite of the simplicity of these three reduced data nonparametric maximum likelihood estimators relative to the full nonparametric maximum likelihood estimator based on (11), Jewell et al. (2004) show that, at most data generating distributions, the reduced data nonparametric maximum likelihood estimators yield efficient estimators of smooth functionals of (F1 , F2 , F3 ). If interest focuses on the possible dependence of T1 and T2 , then estimates of appropriately chosen functionals of F3 − F1 F2 may be examined based on these reduced data nonparametric maximum likelihood estimators. We can restate the problem on nonparametric maximization of the likelihood (11) in terms of a multinomial random variable as follows: let (Ai , Bi , Di , Ei ) be a four-state multinomial variate with index ni and probabilities pi , qi , ri , 1 − pi − qi − ri , independently for i = 1, . . . , k. Having ordered the observations according to the ci s, maximizing the likelihood (11) is equivalent to maximization of the log likelihood function 0(p, q, r) =
k
ai log pi + bi log(qi − pi ) + di log(ri − pi ) i=1
+ ei log(1 + pi − qi − ri ) ,
(12)
where p = (p1 , . . . , pk ), q = (q1 , . . . , qk ) and r = (r1 , . . . , rk ) with the parameter space defined by Θ = {(p, q, r): 0 p1 · · · pk ; 0 q1 · · · qk ; 0 r1 · · · rk ; q − p 0; r − p 0; 1 − p − q − r 0}. Note that Θ is again a compact convex set in R3k . This formulation is obtained by setting, for each distinct ci , Ai to be the number of observations with monitoring time ci for which y3 = 1, Bi the number of observations with monitoring time ci for which y1 = 1 − y2 = 1 and Di the number of observations with monitoring time ci for which 1 − y1 = y2 = 1. With regard to the parameters, we have pi = F3 (ci ), qi = F1 (ci ) and ri = F2 (ci ). With this respecification of the problem, it would be of considerable value to derive an iterative algorithm akin to the Jewell and Kalbfleisch (2004) approach of Section 5.1; the main issue here is the appropriate handling of the ‘edge’ effects of the constraints linking p, q and r.
638
N.P. Jewell and M. van der Laan
We have assumed that the monitoring time C is the same for both T1 and T2 . In some applications, the monitoring times may differ so that current status information on Ti is obtained at time Ci , i = 1, 2, where the random or fixed C1 is not the same as C2 . This is a substantially more complex problem than the case considered here, and, to date, there is little work that has addressed this version of bivariate current status data. 5.3. Outcomes with intermediate stage A special form of bivariate survival data arises from observations on the time to failure where all individuals pass through an intermediate stage prior to failure. In this situation, let T1 represent the time from the origin until the intermediate event occurs, with T2 being the time to failure, Here, necessarily, T2 T1 . Current status observation of this process at a monitoring time C reveals whether an individual has failed by time C or not, and in the latter case, whether the intermediate event has occurred by time C or not. As a result, the observed data is then given by the random variable Y1 ≡ I (T1 C), Y2 ≡ I (T2 C), C . Unlike arbitrary bivariate current status data, there are only three possible outcomes for Y ≡ (Y1 , Y2 ), namely (0, 0), ((1, 0), and ((1, 1). Once more, C is assumed independent of (T1 , T2 ). A variant of this data structure where exact information is available on T2 whenever T2 C, is studied in van der Laan et al. (1997). Conditional on the observed values of C, the likelihood of a set of n independent observations of this kind is given by CL =
n
F2 (ci )y1i y2i (1 − F1 )(ci )(1−y1i )(1−y2i ) (F1 − F2 )(ci )y1i (1−y2i ) ,
(13)
i=1
where F1 (t) = P (T1 t), F2 (t) = P (T2 t) are the marginal distributions of T1 and T2 , respectively. It follows that just the two marginal cdf’s F1 and F2 are identified. As for bivariate current status data, the complete bivariate distribution of (T1 , T2 ) is not identifiable; an unfortunate consequence of this is that the data contains no information on the possibility of dependence between T1 and T2 − T1 , the recurrence times of the first and second event, respectively. Thus, the relationship between recurrence times can only be investigated via a prior model assumption whose dependence structure cannot be verified nonparametrically from the data. This data structure is a special case of current status observation on a counting process which we discuss in more detail in Section 5.4. Here, we point out that, as in Sections 5.1 and 5.2, we can restate the problem on nonparametric maximization of the likelihood (13), now in terms of a trinomial random variable. Let (Ai , Bi , Di ) be a trinomial variate with index ni and probabilities pi , qi , 1 − pi − qi , independently for i = 1, . . . , k. Nonparametric maximization of (13) is equivalent to maximization of the log likelihood function 0(p, q) =
k
i=1
ai log pi + bi log qi + di log[1 − pi − qi ] ,
(14)
Current status data
639
where p = (p1 , . . . , pk ) and q = (q1 , . . . , qk ), with the parameter space defined by Θ = {(p, q): 0 p1 · · · pk ; 0 q1 · · · qk ; 1 − p − q 0}, a compact convex set in R2k . This equivalence is achieved as before by setting Ai to be the number of observations with y1 = (1 − y2 ) = 1 and monitoring time a distinct ci from amongst the ordered monitoring times; similarly, Bi is the number of observations with y1 = y2 = 1 and monitoring time ci . The parameters pi = F2 (ci ) and qi = F1 (ci ). With this formulation at hand, it would be of interest to describe an appropriate version of the Jewell and Kalbfleisch (2004) iterative algorithm, with again the edge effects being important. 5.4. Counting processes We k now consider current status monitoring of a counting process X(t) = j =1 I (Tj t), where, for j = 1, . . . , k, Tj is the random variable which measures the time at which X jumps from j − 1 to j . Necessarily T1 T2 · · · Tk . Now assume that data arises from a sample of n current status observations of the process X, where the monitoring times are described by the random variable C, assumed independent of X. Note this corresponds to simple cross-sectional observation of X. Jewell and van der Laan (1995) describe several possible applications where this data structure arises naturally. Note that allowing the marginal distributions, Fj of Tj , j = 1, . . . , k, to each have a possible point mass at infinity accommodates data structures where individuals may “stop” after one jump, or two, etc. Further, individuals are not therefore required to pass through the exact same number of stages or jumps. Further, choosing the finite number of states to be large enough accommodates any practical application, so that the case of an infinite number of states is only of theoretical import. The data is thus a sample of independent and identically distributed observations on the random variable (X(C), C). As we have seen in previous sections, particularly Section 5.3, it is easy to see that, nonparametrically, the likelihood only depends on the marginal distributions Fj . An unfortunate consequence of this is again that, absent some additional model assumptions, the data tells us nothing about the interesting possibility of dependence among the recurrence times T1 , T2 − T1 , . . . , Tj − Tj −1 , . . . . Nonparametric maximum likelihood estimation of F1 , . . . , Fk requires some form of iterative algorithm – see Section 5.3. However, as we observed in Sections 5.1 and 5.2, direct estimation of any single Fj is possible using the standard current status observations, (Y = I (Tj C), C), and estimates of smooth functionals of Fj can be based on this simple estimator, enjoying all the asymptotic properties outlined in Sections 3. Note that this estimator ignores apparently useful information given in X(C) beyond the simple fact of whether X(C) j or not. Nevertheless, van der Laan and Jewell (2003) show that, at many data generating distributions, the simple standard current status estimators of Fj yield efficient estimators of smooth functionals. These simple current status estimators are not the full nonparametric maximum likelihood estimators, and van der Laan and Jewell (2003) discuss in detail the differences between the two approaches, thereby giving insight into why the nonparametric maximum likelihood estimator shows no asymptotic gain for such functional estimation. In the above, we have focused on estimation of Fj , the marginal distribution of Tj , for j = 1, . . . , k. In some applications, particularly when the number of states, k, is
640
N.P. Jewell and M. van der Laan
large there may be little interest in each individual marginal distribution. In such cases, a simple function of the marginal distributions, namely the so-called mean function, Λ(t) = E(X(t)), may however be of considerable importance. It is easy to see that
Λ(t) = Fj (t), (15) a description that is applicable even if the number of jumps can be arbitrarily large so that the above sum has an infinite number of terms. The mean function may be particularly useful as a method to summarize the effects of covariates on X(t). Sun and Kalbfleisch (1993) consider estimation of Λ, discuss regression models that allow this mean function to vary across covariate groups, and consider application of the ideas to multiple tumor data from a tumorgenicity experiment. Note that, for current status observation on X(t) at random monitoring times C with no covariates, the mean function is isotonic in the observed C’s, so that many of the ideas of Sections 3 can be immediately applied to estimation of Λ including the pool–adjacent–violators algorithm. 6. Conclusion This paper has reviewed recent advances in the understanding of nonparametric estimation based on various forms of current status data. Throughout a key assumption has been independence between the monitoring time variable C and the survival random variable, T , or counting process, X, of interest. An important future area of study with current status data concerns the relaxation of this assumption. For example, suppose, for a survival random variable T and random monitoring time C, we observe the data structure Y = (I (T C), C, L(C)) that includes observation of covariate processes L up to time C. The assumption of independence between T and C can now be assumed conditional on the observed L(C). This therefore allows dependence between the mon itoring time C and T that arises solely through L(C). To illustrate the importance of this extension, consider an animal tumorgenicity experiment designed to estimate the distribution of time to development of an occult tumor. Suppose that L(u) includes the weight of the experimental animal at time u, and that Y = (I (T C), C, L(C)) is observed. A reasonable alternative to choosing monitoring times completely at random is to increase the ‘hazard’ of monitoring shortly after an animal begins to lose weight as reflected in measurements of L; this is likely to improve efficiency in estimation if the monitoring time is thereby closer to the time of tumor onset (i.e., T ). This monitoring scheme introduces dependence between C and T , and estimators, discussed in Sections 3, that ignore this dependence will be biased. For the extended current status data structure Y = (I (T C), C, L(C)), van der Laan and Robins (1998) develop locally efficient estimators for smooth functionals of F , the distribution function of T . An important open problem of interest involves the use of these results in choosing optimal, or close to optimal, designs for the dynamic selection of monitoring times C that depend on concurrent observation of key covariates within L. Finally, since current status data corresponds with taking a single cross-sectional observation on individual survival processes. it is natural to consider similar questions where multiple cross-sectional observations are available at differing monitoring times for each individual. Data of this kind are often referred to as panel data. In the context
Current status data
641
of the single survival random variable T of Section 3, this monitoring scheme leads to interval-censored data, case II (Groeneboom and Wellner, 1992). There is a parallel extensive literature on estimation problems of the kind considered here, based on this more informative and general form of interval censored data, that deserves a similar review article of recent advances. For helpful introductions, see Sun (1998) and Huang and Wellner (1997). Panel data has also been considered in the context of counting processes as in Sections 5.4 by Sun and Kalbfleisch (1995), Wellner and Zhang (2000) and others. References Andrews, C., van der Laan, M., Robins, J.M. (2002). University of California Berkeley Division of Biostatistics, Working Paper Series. Working Paper 110. http://www.bepress.com/ucbbiostat/paper110. Ayer, M., Brunk, H.D., Ewing, G.M., Reid, W.T., Silverman, E. (1955). An empirical distribution function for sampling with incomplete information. Ann. Math. Statist. 26, 641–647. Barlow, R.E., Bartholomew, D.J., Bremner, J.M., Brunk, H.D. (1972). Statistical Inference under Order Restrictions. Wiley, New York. Bennett, S. (1983). Analysis of survival data by the proportional odds model. Statist. Medicine 2, 273–277. Betensky, R.A. (2000). On nonidentifiability and noninformative censoring for current status data. Biometrika 87, 218–221. Cox, D.R. (1972). Regression models with life tables (with discussion). J. Roy. Statist. Soc. B 34, 187–220. DeGruttola, V., Lagakos, S.W. (1989). Analysis of doubly-censored survival data with application to AIDS. Biometrics 45, 1–11. Diamond, I.D., McDonald, J.W., Shah, I.H. (1986). Proportional hazards models for current status data: application to the study of differentials in age at weaning in Pakistan. Demography 23, 607–620. Diamond, I.D., McDonald, J.W. (1991). The analysis of current status data. In: Trussel, J., Hankinson, R., Tilton, J. (Eds.), Demographic Applications of Event History Analysis. Oxford University Press, Oxford. Doksum, K.A., Gasko, M. (1990). On a correspondence between models in binary regression and in survival analysis. Internat. Statist. Rev. 58, 243–252. Gart, J.J., Krewski, D., Lee, P.N., Tarone, R.E., Wahrendorf, J. (1986). Statistical Methods in Cancer Research, Vol. III, The Design and Analysis of Long-term Animal Experiments. In: IARC Scientific Publications, Vol. 79. International Agency for Research on Cancer, Lyon. Groeneboom, P., Wellner, J.A. (1992). Nonparametric Maximum Likelihood Estimators for Interval Censoring and Denconvolution. Birkhäuser, Boston. Grummer-Strawn, L.M. (1993). Regression analysis of current status data: An application to breast feeding. J. Amer. Statist. Assoc. 88, 758–765. Huang, J. (1996). Efficient estimation for the proportional hazards model with interval censoring. Ann. Statist. 24, 540–568. Huang, J., Wellner, J.A. (1995). Asymptotic normality of the NPMLE of linear functionals for interval censored data, Case I. Statist. Neerlandica 49, 153–163. Huang, J., Wellner, J.A. (1997). Interval censored survival data: A review of recent progress. In: Lin, D.-Y. (Ed.), Proceedings of First Seattle Conference in Biostatistics. Springer, Berlin, pp. 123–169. Hudgens, M.G., Satten, G.A., Longini, I.M. (2001). Nonparametric maximum likelihood estimation for competing risks survival data subject to interval censoring and truncation. Biometrics 57, 74–80. Jewell, N.P., Kalbfleisch, J.D. (2004). Maximum likelihood estimation of ordered multinomial parameters. In preparation. Jewell, N.P., Malani, H., Vittinghoff, E. (1994). Nonparametric estimation for a form of doubly censored data with application to two problems in AIDS. J. Amer. Statist. Assoc. 89, 7–18. Jewell, N.P., Shiboski, S. (1990). Statistical analysis of HIV infectivity based on partner studies. Biometrics 46, 1133–1150. Jewell, N.P., van der Laan, M. (1995). Generalizations of current status data with applications. Lifetime Data Anal. 1, 101–109.
642
N.P. Jewell and M. van der Laan
Jewell, N.P., van der Laan, M. (1997). Singly and doubly censored current status data with extensions to multistate counting processes. In: Lin, D.-Y. (Ed.), Proceedings of First Seattle Conference in Biostatistics. Springer, Berlin, pp. 171–184. Jewell, N.P., van der Laan, M. (2004). Case-control current status data. In preparation. Jewell, N.P., van der Laan, M., Henneman, T. (2003). Nonparametric estimation from current status data with competing risks. Biometrika 90, 183–197. Jewell, N.P., van der Laan, M., Lei, X. (2004). Bivariate current status data. In preparation. Keiding, N. (1991). Age-specific incidence and prevalence: a statistical perspective (with discussion). J. Roy. Statist. Soc. A 154, 371–412. Keiding, N., Begtrup, K., Scheike, T.H., Hasibeder, G. (1996). Estimation from current-status data in continuous time. Lifetime Data Anal. 2, 119–129. Lin, D.Y., Oakes, D., Ying, Z. (1998). Additive hazards regression with current status data. Biometrika 85, 289–298. Mammen, E. (1991). Estimating a smooth monotone regression function. Ann. Statist. 19, 724–740. Martinussen, T., Scheike, T.H. (2002). Efficient estimation in additive hazards regression with current status data. Biometrika 89, 649–658. Mukerjee, R. (1988). Monotone nonparametric regression. Ann. Statist. 16, 741–750. Rabinowitz, D. (2000). Testing current status data for dependent monitoring. Statist. Probab. Lett. 48, 213– 216. Rabinowitz, D., Jewell, N.P. (1996). Regression with doubly censored current status data. J. Roy. Statist. Soc. B 58, 541–550. Rabinowitz, D., Tsiatis, A., Aragon, J. (1995). Regression with interval censored data. Biometrika 82, 501– 513. Rossini, A., Tsiatis, A.A. (1996). A semiparametric proportional odds regression model for the analysis of current status data. J. Amer. Statist. Assoc. 91, 713–721. Scott, A.J., Wild, C.J. (1997). Fitting regression models to case-control data by maximum likelihood. Biometrika 84, 57–71. Shiboski, S.C. (1998a). Partner studies. In: Armitage, P., Colton, T. (Eds.), The Encyclopedia of Biostatistics. Wiley, New York, pp. 3270–3275. Shiboski, S.C. (1998b). Generalized additive models for current status data. Lifetime Data Anal. 4, 29–50. Shiboski, S.C., Jewell, N.P. (1992). Statistical analysis of the time dependence of HIV infectivity based on partner study data. J. Amer. Statist. Assoc. 87, 360–372. Sun, J. (1998). Interval censoring. In: Armitage, P., Colton, T. (Eds.), The Encyclopedia of Biostatistics. Wiley, New York, pp. 2090–2095. Sun, J., Kalbfleisch, J.D. (1993). The analysis of current status data on point processes. J. Amer. Statist. Assoc. 88, 1449–1454. Sun, J., Kalbfleisch, J.D. (1995). Estimation of the mean function of point processes based on panel count data. Statist. Sinica 5, 279–290. van der Laan, M., Andrews, C. (2000). The nonparametric maximum likelihood estimator in a class of doubly censored current status data models with application to partner studies. Biometrika 87, 61–71. van der Laan, M.J., Bickel, P.J., Jewell, N.P. (1997). Singly and doubly censored current status data: Estimation, asymptotics and regression. Scand. J. Statist. 24, 289–308. van der Laan, M., Jewell, N.P. (2001). The NPMLE in the doubly censored current status data model. Scand. J. Statist. 28, 537–547. van der Laan, M., Jewell, N.P. (2003). Current status data and right-censored data structures when observing a marker at the censoring time. Ann. Statist. 31, 512–535. van der Laan, M., Jewell, N.P., Petersen, D. (1997). Efficient estimation of the lifetime and disease onset distribution. Biometrika 84, 539–554. van der Laan, M.J., Robins, J.M. (1998). Locally efficient estimation with current status data and timedependent covariates. J. Amer. Statist. Assoc. 93, 693–701. Wang, W., Ding, A.A. (2000). On assessing the association for bivariate current status data. Biometrika 87, 879–893. Wellner, J.A., Zhang, Y. (2000). Two estimators of the mean of a counting process with panel count data. Ann. Statist. 28, 779–814.
36
Handbook of Statistics, Vol. 23 ISSN: 0169-7161 © 2004 Elsevier B.V. All rights reserved. DOI 10.1016/S0169-7161(03)23036-4
Appraisal of Models for the Study of Disease Progression in Psoriatic Arthritis
R. Aguirre-Hernández and V.T. Farewell
1. Introduction The study of disease progression in a chronic disease often requires the analysis of longitudinal data which derive from the occurrence of multiple events over time per patient. Continuous monitoring of patients is seldom feasible however and, for long term chronic diseases in particular, the data are usually taken from a clinical database. A clinical database will primarily record information on patients at the times of clinic visits. Therefore the time at which a clinical event occurs is usually only known to be between two clinic visits. In addition, the time between clinic visits is commonly quite variable, both within and between patients. This generates what are commonly termed panel data. This chapter deals with the analysis of data on multiple events when the available data are limited to counts of the number of events which occur between clinic visits. A Markov model based on a categorization of the count data is examined first. Subsequently, the inter-visit counts are modelled directly with Poisson and negative binomial distributions. The need for special treatment of zero counts in such models is examined. The particular focus of the chapter is on model appraisal.
2. Data Psoriatic arthritis (PsA) is an inflammatory arthritis associated with psoriasis (Gladman, 1997). Although initially considered a benign form of arthritis, over the past decade PsA has been recognised as a progressive form of arthritis, leading to joint damage and disability. The data used in this chapter relate to patients with PsA followed at the University of Toronto Psoriatic Arthritis Clinic and tracked in the PsA Clinic database. Patients are registered in the clinic if they have an inflammatory arthritis associated with psoriasis, and if other forms of arthritis such as seropositive nodular rheumatoid arthritis, systemic lupus erythematosus, inflammatory bowel disease, gout and grade 4 osteoarthritis are excluded (Gladman et al., 1987). 643
644
R. Aguirre-Hernández and V.T. Farewell
Patients are assessed according to a standard protocol, including history, physical examination and laboratory evaluation at 6–12 month intervals. Active, swollen and damaged joint counts are recorded at each assessment. While active and swollen joints may subsequently improve and therefore represent reversible events, damage is irreversible and thus provides a measure of disease progression over time. Individual joint damage is based on joint deformity and is assessed clinically on the basis of a decreased range of movement of more than 20% of the normal range that could not be attributed to active inflammation, the presence of contractures, subluxation, loosening or ankylosis, or previous surgery. Joint counts made in this clinic has been shown to be reliable (Gladman et al., 1990). Patients are officially registered with the clinic after their second visit. An initial study was undertaken of clinical indicators, available at the time of a patient’s first clinic visit, which were related to subsequent disease progression. This was taken to be reflected in the development of damaged joints (Gladman et al., 1995). This study used a multi-state Markov model for disease progression, with states of damage defined by dividing the number of damaged joints into four categories 0, 1 to 4, 5 to 9 and 10. Five or more swollen joints and a high medication level at presentation were associated with progression of damage, whereas a low erythrocyte sedimentation rate (ESR) was “protective”. The choice of appropriate explanatory variables in a regression model is, of course, a major component of model appraisal. However, methods for this are widely known. Therefore, the analyses reported here will incorporate the risk factors found by Gladman et al. (1995) without further investigation of this aspect of the models.
3. Markov models 3.1. Background As mentioned in the previous section, in Gladman et al. (1995) damage progression in PsA was defined as transitions between 4 damaged states based on the number of damaged joints. Patients in states 1, 2, 3, and 4 had 0, 1 to 4, 5 to 9, and 10 or more damaged joints, respectively. The aim of the Toronto clinic is to measure damage at a clinic visit every six months but, in practice, considerable variation occurred. The transitions between damaged states were described by a progressive and timecontinuous Markov regression model. Details of the prognostic factors selected by Gladman et al. (1995) are: number of effused joints divided into two classes, < 5 or 5; erythrocyte sedimentation rate (ESR) categorized as, < 15 mm/h or 15 mm/h; and type of medication taken before participating in the study, none or nonsteroidal antiinflammatory medications, disease modifying drugs (DMD) or oral corticosteroids. Patients were also stratified by their initial state in order to adjust for any differences in the referral pattern, although these were not expected to be marked. No attempt was made by the authors to address the goodness of fit problem. Multi-state Markov models have also been used by Kalbfleisch and Lawless (1985) to study smoking prevention programmes in schoolchildren. Pérez-Ocón et al. (1998)
Appraisal of models for the study of disease progression in psoriatic arthritis
645
used a Markov process to model the influence of treatments in relapse and survival times to breast cancer. Keiding et al. (2001) used such models to analyse the course of events after bone marrow transplantation. Longini Jr et al. (1989), Gentleman et al. (1994), Lee and Kim (1998), and Sypsa et al. (2001) applied Markov models to study HIV/AIDS disease. Little attention was given to the issue of goodness of fit of the models. Most goodness of fit tests for Markov models assume that, for each subject, a response variable is recorded at a fixed number of times and that no explanatory variables are measured. This type of longitudinal data can be displayed in contingency tables as explained by Bishop et al. (1975). The authors show that the analysis of these tables is formally equivalent to certain contingency table analyses based on log-linear models. For equally spaced longitudinal data, Kalbfleisch and Lawless (1985) also proposed the construction of a contingency table of observed and expected transition counts for each time interval. The matrix of estimated transition probabilities is used to calculate the expected transition counts. The chi-squared statistic or the likelihood ratio statistic is calculated for each contingency table. These tables are independent if the Markov model is of order one so an overall goodness of fit statistic is obtained by adding the statistics calculated for each time interval. The authors state that the overall statistic has an asymptotic chi-squared distribution. This goodness of fit test was also proposed by de Stavola (1988). Gentleman et al. (1994) proposed the use of approximate observed and expected counts when no explanatory variables are included in the model and the observations are, in general, unequally spaced. A partition of the time scale was suggested by the authors. The approximate counts are calculated at each of the partition’s time points by assuming that an individual not observed at time t remained at his/her preceding state. Most repeated measurement studies are designed so that data are collected with certain periodicity. However, in public health and medicine, patients move away or drop out of the study, others miss or change an appointment, and some die for reasons unrelated to the treatment. In the PsA study, the variability between the inter-visit periods is such that it is natural to assume that damage occurs on a continuous-time scale rather than a discrete scale. Thus, the goodness of fit statistics described above are inappropriate to examine the adequacy of the fitted model. Furthermore, none of the above statistics can be used to examine the fit of Markov regression models in which the transition rates depend on several explanatory variables. Here, we present a Pearson-type statistic to examine the goodness of fit of stationary and time-continuous Markov regression models of order one. The statistic is designed for models in which the transition rates depend on several explanatory variables. The test statistic is also suitable for panel data in which the spacing between the responses and the total number of observations vary from one individual to another. 3.2. Inference for progressive Markov models Consider a random sample of size n in which subject i is observed at mi points in time denoted as: ti,1 < ti,2 < · · · < ti,mi . Let Yi,j represent a qualitative or discrete response
646
R. Aguirre-Hernández and V.T. Farewell
variable recorded for individual i at time ti,j . The values assumed by Yi,j are called states and will be denoted as 1, 2, . . . , k, . . . , K where K < ∞. In general, individuals will change state between observation times. Markov models of order one are defined by the transition probabilities: pi,j (a,b) = p(a,b)(ti,j , ti,j +1 ) = P (Yi,j +1 = b | Yi,j = a) or by the transition rates, qi(a,b) 0 for b = a and qi(a,a) = − b=a qi(a,b) . It will be assumed that the transition probabilities are stationary or time-homogeneous. In progressive Markov models transitions can only occur to the current state or to one state ahead, i.e., qi(a,a+1) > 0 and qi(a,b) = 0 for b ∈ / {a, a + 1}. This means that K is an absorbing state. Progressive Markov models have been used to describe chronic diseases like cancer, AIDS, and arthritis. In Markov regression models, the transition rates may depend on several explanatory variables or covariates. Here we only consider the situation in which for each subject p − 1 explanatory variables are measured at the beginning of the study. The vector of explanatory variables for subject i is denoted as: zi = (1, zi,1 , . . . , zi,p−1 ) for i = 1, 2, . . . , n. As qi(a,b) 0 for a = b, the logarithm of the transition rates can be expressed as a linear combination of zi , i.e.: ln(qi(a,b) ) = β0(a,b) +
p−1
βu(a,b)zi,u
for a = b.
u=1
Notice that the effect of the explanatory variables can change from one transition rate to another. Kalbfleisch and Lawless (1985) present a procedure to obtain maximum likelihood estimates and associated asymptotic covariance matrices for transition rates in stationary Markov regression models. The sojourn time Ti(a) represents the time that individual i spends in state a before moving to state a + 1. In stationary Markov models, the distribution of Ti(a) is independent of the amount of time that individual i has already spent in state a. This means −1 that the Ti(a) follow exponential distributions with parameters λi(a) = qi(a,a+1) , i.e., fTi(a) = (1/λi(a) ) exp(−ti(a) /λi(a) ) where ti(a) is the time measured from the previous observation time point. This exponential assumption has been found to be widely useful with appropriate state definitions. As Kalbfleisch and Lawless (1985) point out, it may not always be strictly appropriate but is a useful and often necessary first step. The exponential assumption is one aspect of the model which is examined through the use of goodness of fit statistics but it is sensible to also use more specific tests for this purpose. As shown subsequently, the probability of a transition from state a to state a + 1 can be expressed in terms of the sojourn times in states a and a + 1. The procedure is illustrated for a progressive model although it is applicable to any kind of stationary Markov model. pi,j (a,a+1) = P [Ti(a) < ti,j +1 − ti,j and Ti(a+1) > ti,j +1 − ti,j − ti(a) | Yi,j = a] = P [Ti(a) < ti,j +1 − ti,j and Ti(a+1) > ti,j +1 − ti,j − ti(a) ]
Appraisal of models for the study of disease progression in psoriatic arthritis
=
ti,j+1 −ti,j
0
=
ti,j+1 −ti,j
0
647
fTi(a) 1 − P [Ti(a+1) ti,j +1 − ti,j − ti(a)] dti(a) fTi(a) 1 −
ti,j+1 −ti,j −ti(a) 0
fTi(a+1) dti(a+1) dti(a) .
The second and third expressions follow from the Markov property. The explicit solution is: pi,j (a,a+1) =
λi(a+1) λi(a+1) − λi(a) ti,j − ti,j +1 ti,j − ti,j +1 − exp . × exp λi(a+1) λi(a)
(1)
In a similar way, pi,j (a,a+2), . . . , pi,j (a,K) can be calculated for all a = 1, 2, . . . , K − 1. K As K k=a pi,j (a,k) = 1 then pi,j (a,a) = 1 − k=a+1 pi,j (a,k) . An estimate of the transition probabilities is obtained by replacing the parameter of the exponential distribution by its maximum likelihood estimate. The formulas to calculate the nine transition probabilities of a progressive Markov regression model with K = 4 states are included in Appendix A. 3.3. A goodness-of-fit statistic In Markov regression models, the estimated transition probabilities depend on the covariate pattern, the sequence of observed states, and the time elapsed between consecutive observations. For non-stationary Markov models, the time at which the observations are made also affects the value of the estimated transition probabilities. When the model is stationary and it is wished to test if the stationarity assumption is valid, the behaviour of the estimated transition probabilities across time should also be examined. All these factors need to be considered in order to group the estimated transition probabilities to calculate a Pearson-type goodness of fit statistic. When the explanatory variables are qualitative or discrete, the observed transitions, yi,j → yi,j +1 , and the estimated transition probabilities, pˆi,j (a,b) , should be grouped according to the covariate pattern associated with individual i. If some explanatory variables are continuous we propose use of the quantiles of the estimated transition rates, {qˆi(a,b) }, to generate a partition of the space of explanatory variables. The transition rates are used here because they do not depend on time. The total number of categories thus defined will be denoted as C. If the set of all possible values for the response variable is {1, 2, . . . , K} then a maximum of K 2 different transitions between states can be observed. In a progressive Markov model, only transitions of the form a → b with b a can occur so the total number of different transitions is K(K + 1)/2 − 1 (transitions of the form K → K are ignored because K is an absorbing state). Let R denote the total number of classes in which the observed transitions and the estimated transition probabilities are classified based on the values of the response variable.
648
R. Aguirre-Hernández and V.T. Farewell
The length of the time intervals between measurements and the total number of observations for each sample unit may be fixed before collecting the data (as in experimental studies) or random (as in observational studies). The first case has been considered by several authors, see Section 3.1. For the second observation pattern (panel data) we generalize the grouping technique proposed by Hosmer and Lemeshow (1989) as follows. Classify the observed transitions and the estimated transition probabilities of the first time interval, i.e., yi,1 → yi,2 and pˆ i,1(a,b) , according to the quantiles (deciles, e.g.) of the distances {ti,2 − ti,1 }. Similarly, group the observed transitions and the estimated transition probabilities of the second time interval based on the quantiles of {ti,3 − ti,2 }. In general, classify yi,j → yi,j +1 and pˆ i,j (a,b) according to the quantiles of the length of the j th time intervals (ti,j , ti,j +1 ). Let H denote the total number of classes in which the time intervals are classified. When all the individuals have the same number of observations equally spaced on time then H = m − 1. If the time at which the observations are made is unimportant, the observed transitions and the estimated transition probabilities from different time intervals can be grouped into one category so H = 1. The symbol Lh will represent the number of levels defined by the quantiles of the length of the time intervals classified in category h with h = 1, 2, . . . , H . For simplicity it can be assumed that Lh = L for all h = 1, 2, . . . , H . The quantity eh,l,r,c will be called the expected number of transitions in cell (h, l, r, c). It is the sum of the estimated transition probabilities classified in categories h, l, r, and c. Analogously, nh,l,r,c will denote the total number of observed transitions in cell (h, l, r, c). The Pearson type goodness of fit statistic we propose to examine the adequacy of stationary Markov regression models is: T=
L R C H (nh,l,r,c − eh,l,r,c )2 h=1 l=1 r=1 c=1
eh,l,r,c
.
(2)
This statistic can be generalized to test the goodness-of-fit of a non-stationary Markov model if the transition probabilities are appropriately estimated. 3.4. Distribution of the goodness-of-fit statistic Given that individual i is in state a at time ti,j , the probabilities of observing transitions to the states a, a + 1, a + 2, . . . , K in the time interval (ti,j , ti,j +1 ) follow a multinomial distribution with parameters 1 and θij = (pˆ i,j (a,a), pˆi,j (a,a+1), . . . , pˆ i,j (a,K)). (A similar distribution is obtained for non-progressive Markov models.) Individual i will, in general, have a different multinomial distribution at time ti,j +1 because the vector of parameters θi,j +1 depends on the time elapsed between yi,j +1 and yi,j +2 . As θij also depends on the explanatory variables in the model, different subjects also have different multinomial distributions. Therefore, if the total number of observations is taken as fixed, P [N1111 = n1111, . . . , NH LRC = nH LRC ] is the sum of several independent and non-identical multinomial distributions. As the total number of observations is not fixed since entry into an absorbing state terminates observation of one subject, the exact distribution of the proposed test statistic is particularly intractable. The data structure also rules out any general likelihood ratio based goodness of fit procedure because the
Appraisal of models for the study of disease progression in psoriatic arthritis
649
number of parameters involved in any saturated model would increase with the sample size. A method of estimating the distribution of (2) is by generating B independent bootstrap samples from the model specified by the null hypothesis, and calculating the goodness of fit statistic for each sample. As B goes to infinity, the bootstrap distribution of (2) will approach the true null distribution of the test statistic (Efron and Tibshirani, 1993). If the null hypothesis states that a stationary Markov regression model of order one fits the data then the time elapsed between transitions follow exponential distributions −1 with parameters λi(a) = qi(a,a+1) for a = 1, . . . , K − 1. ∗ , Y ∗ , . . . where the first The bootstrap states for individual i will be denoted as Yi1 i2 ∗ ∗ = a. bootstrap state is taken to be the first observed state, i.e., Yi1 = Yi1 . Assume that Yi1 ∗ The rest of the Yij are obtained by simulating the times at which individual i enters ∗ be an observation simulated states a + 1, . . . , K (if the model is progressive). Let ti(k) ∗ reprefrom the exponential distribution with parameter λˆ i(k) for k = a, . . . , K − 1; ti(k) sents the time that individual i remains in state k before moving to state k + 1. Therea+1 K−1 ∗ , ∗ ,..., ∗ fore ti(a) t r=a i(r) r=a ti(r) are the simulated times at which individual i enters state a + 1, a + 2, . . . , K, respectively. The bootstrap state Yij∗ is then obtained as follows: ∗ If ti,j < ti(a)
∗ then Yi,j = a,
otherwise if
k
∗ ti(r)
ti,j <
r=a
k+1
∗ ti(r)
∗ then Yi,j = k + 1 for k = a, . . . , K − 2,
r=a
otherwise if ti,j
K−1
∗ ti(r)
∗ then Yi,j = K.
(3)
r=a ∗ is equal to a if t The above inequalities state that Yi,j i,j is less than the simulated time at which the transition to state a + 1 occurrs. Similarly, individual i remains in state a + 1 until ti,j is greater than or equal to the simulated time at which the transition to state a + 2 occurrs, etc. If the transition to the absorbing state takes place at a simulated time which is less than ti,mi then several bootstrap states are equal to K. ∗ = Y∗ ∗ Whenever Yi,s i,si+1 = · · · = Yi,mi = K the last mi − si states are ignored. Thus the i total number of bootstrap states for individual i will be denoted as si for i = 1, 2, . . . , n (si mi ). Once a sequence of states is generated for each individual, the Markov model is estimated based on the bootstrap data and the test statistic is calculated. This process is replicated to generate a bootstrap distribution for the test statistic. Finally, the value of the statistic from the original data is compared with the bootstrap distribution of values to compute the significance level.
650
R. Aguirre-Hernández and V.T. Farewell
3.5. The fit of the Markov model Table 1 shows the parameter estimates for the stationary Markov regression model fitted in Gladman et al. (1995). This analysis is based on 271 patients. The first column indicates the condition which was coded as one. The numbers in brackets are the standard deviations. Notice that a different intercept or constant term was used to model each transition rate. When no covariates or stratification variables are included in the model, these provide estimates for the average (mean) time spent in state a, where a = 1, 2, 3. Such a model applied to the PsA data suggests that patients with psoriatic arthritis do not develop damaged joints for an average of 11 years. Also, patients with PsA remain in states 2 and 3 for an average of 6.23 and 4.03 years, respectively. The coefficients for the stratification variables are not significantly different from zero indicating that the state at entry to the PsA Clinic has no effect on the progression of the disease. The estimated relative risks shown in Table 2 are the ratio of the transition rate of a patient with a given explanatory variable coded as 1 and the transition rate of a patient with that same covariate coded as 0, if both individuals have the same values for the stratifying variables and for the other explanatory variables. Table 1 Estimated parameters and standard deviations for the Markov regression model fitted by Gladman et al. Parameter 1→2 Constant term Effused joints 5 ESR < 15 mm/h Corticosteroids, Yes DMD Yes Initial state is 2 Initial state is 3
−2.52 0.49 −0.54 0.45 0.61
(0.14) (0.18) (0.19) (0.15) (0.20)
Transition rates 2→3 −1.67 0.49 −0.54 0.45 0.61 −0.51
3→4
(0.18) (0.18) (0.19) (0.15) (0.20) (0.24)
−1.70 0.49
(0.24) (0.18)
0.45 0.61 −0.00 −0.64
(0.15) (0.20) (0.30) (0.40)
Table 2 Estimated relative risks for each prognostic factor in the Markov regression model fitted by Gladman et al. Covariate
Condition 1→2
Transition rates 2→3
3→4
Number of effused joints
<5 5
1 1.63
1 1.63
Erythrocyte sedimentation rate
15 mm/h < 15 mm/h
1 0.58
1 0.58
Use of corticosteroids
No Yes
1 1.57
1 1.57
1 1.57
Disease modifying drugs
No Yes
1 1.84
1 1.84
1 1.84
1 1.63
Appraisal of models for the study of disease progression in psoriatic arthritis
651
Thus, a patient taking disease modifying drugs (DMD) before entering the study has a risk 1.84 times higher of moving to the next state as compared with a patient taking none or nonsteroidal antiinflammatory medications. Also, subjects using oral corticosteroids prior to the study have a risk that is 1.57 times higher of moving to the next state as compared with subjects not taking oral corticosteroids. The transition rate of a patient with 5 or more effused joints is 1.63 times bigger than the transition rate of a patient with the same characteristics but having less than 5 effused joints. Patients with an erythrocyte sedimentation rate less than 15 mm/h have a smaller risk of moving to the next stage of the disease as compared to patients with an ESR of 15 mm/h or more. In the PsA study, some explanatory variable patterns were common among the patients but others were rare. In view of this, the partition of the space of explanatory variables was done based on the number of prognostic factors coded as one. Three categories (C = 3) were defined. Level one refers to subjects with all prognostic factors coded as zero, level two corresponds to patients with only one prognostic factor coded as one, and level three contains individuals with two or more prognostic factors coded as one. Gladman et al. (1995) tested the appropriateness of the assumption that the transition rates are stationary. They allowed the transition probabilities to depend on a power of time but did not find evidence for this dependence. Therefore, we decided to group together observations obtained at different points in time so H = 1. Even though clinic visits were planned every 6 months, the mean time between assessments is 1.18 years with a standard deviation of 1.42 years and a median of 0.61 years. The time elapsed between observations was categorized into L = 5 levels defined by the quintiles of ti,j +1 − ti,j for i = 1, 2, . . . , n and j = 1, 2, . . . , mi . Nine types of transitions were observed; 47.60% are of the form 1 → 1 but only 0.55% are of the type 1 → 4. Cross-classifying the 9 types of transitions by the H × L × C = 1 × 5 × 3 = 15 categories previously defined produces a table with sparse cells. Therefore, transitions of the form a → b with b > a were grouped together in a category denoted as a → a ∗ . Consequently, R = 6 types of transitions were considered: a → a and a → a ∗ with a = 1, . . . , K − 1. The observed and expected transitions for the PsA data are shown in Table 3. The first two rows contain the observed and expected transitions of those patients with all the prognostic factors coded as zero and with an elapsed time between visits less than or equal to 0.4791 years. Analogously, the observed and expected transitions in the 3th and 4th rows correspond to subjects with only one prognostic factor coded as one and with ti,j +1 − ti,j 0.4791, etc. As ties occurred at some quintiles, the L = 5 groups defined for the time elapsed between observations do not have the same number of transitions. The nine cells with bold numbers contribute 67.7% to the value of the goodness of fit statistic: T = 69.95. One thousand bootstrap replications were carried out. In 44 of them the bootstrap algorithm did not generate any observations from state 3 so the associated contingency tables contain empty cells. These cells were ignored in the calculation of the bootstrap goodness of fit statistic. The p-value thus obtained is 9/1000 = 0.009 indicating that the fitted stationary Markov regression model does not describe adequately the PsA data. Note that all the cells with bold numbers occur in the columns labeled as a → a ∗
652
R. Aguirre-Hernández and V.T. Farewell
Table 3 Contingency table for the observed and expected transition counts for the Markov regression model fitted by Gladman et al. (1995) Time
P. Factor 1→1 Zero
0.0384 One to Two + 0.4791 Zero 0.4791 One to Two + 0.5394 Zero 0.5394 One to Two + 0.7474 Zero 0.7474 One to Two + 1.4012 Zero 1.4012 One to Two + 15.3484
Transition 1 → 1∗ 2→2
2 → 2∗
3→3
3 → 3∗
Obs. Exp. Obs. Exp. Obs. Exp.
57 55.32 51 51.14 20 22.01
0 1.68 2 1.86 3 0.99
19 18.11 38 42.59 18 17.69
0 0.89 7 2.41 1 1.31
17 16.75 16 16.45 6 6.21
1 1.25 2 1.55 1 0.79
Obs. Exp. Obs. Exp. Obs. Exp.
50 50.87 53 55.09 16 16.04
3 2.13 5 2.91 1 0.96
18 17.55 31 35.32 19 19.97
1 1.45 7 2.68 3 2.03
6 6.37 21 21.68 6 7
1 0.63 3 2.32 2 1
Obs. Exp. Obs. Exp. Obs. Exp.
47 50.42 51 55.95 15 14.72
6 2.58 8 3.05 1 1.28
24 22.94 42 39.86 14 14.09
1 2.06 2 4.14 2 1.91
11 10.69 23 21.76 5 5.50
1 1.31 2 3.24 2 1.50
Obs. Exp. Obs. Exp. Obs. Exp.
57 55.39 51 52.01 15 15.70
3 4.61 6 4.99 2 1.30
21 20.32 27 24.59 15 14.27
2 2.68 2 4.41 3 3.73
1 2.50 25 20.51 11 10.78
2 0.51 1 5.49 4 4.22
Obs. Exp. Obs. Exp. Obs. Exp.
38 39.16 67 57.88 13 12.21
13 11.84 12 21.12 5 5.79
21 16.17 25 22.49 10 9.40
3 7.83 9 11.51 6 6.60
4 3.80 14 11.43 2 2.33
2 2.20 5 7.57 4 3.67
Total Obs. Total Exp.
601 603.89
70 67.11
342 335.36
49 55.64
168 163.76
33 37.24
with a = 1, 2, 3. In six of these cells the observed count is bigger than the expected count. These six cells contain patients with an elapsed time between clinic visits less than or equal to 1.4 years. Therefore, it may be that patients who experienced a rapid progression in damage were prompted to visit the clinic in a shorter time interval than did the majority of patients. Thus, the non-random distribution of the bold numbers would reflect that some clinic visits did not occur at random times. Furthermore, two outliers were detected in the contingency table formed by using the 9 original transitions instead of the 6 collapsed categories. The outliers are patients
Appraisal of models for the study of disease progression in psoriatic arthritis
653
who had a 1 → 4 transition in less than 2 years and thus make a large contribution to the value of the corresponding goodness of fit statistic. Table 3 has 3 independent columns because the number of patients in state a at the beginning of every time interval is known so the expected number of transitions from a to a ∗ depends on the expected number of transitions within state a, for a = 1, 2, 3. Then, the number of independent cells is 45 (= H × L × (R − 3) × C). Furthermore, η = 10 parameters were estimated to fit the Markov regression model. Naively, the expected degrees of freedom in Table 3 would be 35 (= H × L × (R − 3) × C − η), which would correspond to the mean of the distribution of the statistic if it was chi-square. The mean value of the bootstrap goodness of fit statistic was, in fact, 41.12. Similarly, the 95% and 99% quantiles of the chi-squared distribution with 35 degrees of freedom are 49.80 and 57.34 whereas for the bootstrap distribution they are 57.08 and 68.39. Aguirre-Hernández and Farewell (2002) show that, in general, the use of a “naive” chi-square distribution for the test statistic will lead to rejection of the fitted model more often than the bootstrap distribution. The effect is most marked when the models include explanatory variables. 3.6. Some general remarks on table construction A contingency table needs to be constructed to calculate the proposed goodness of fit statistic. The classification criteria that need to be considered are: (1) (2) (3) (4)
the transition type, the covariate pattern, the time elapsed between observations, and the time at which the measurements are made.
The fourth classification criterion automatically induces criterion 3 when all the subjects have the same number of equally spaced observations. Classification criterion 4 is of paramount importance if the fitted Markov model is non-stationary or if interest lies in testing whether the model is stationary or not. When the time elapsed between measurements is not constant, we propose classification of the observed transitions and the estimated transition probabilities by the quantiles of {ti,j +1 − ti,j }. This is justified by the fact that two individuals with the same vector of explanatory variables having a transition from state a to state b also have similar transition probabilities if their inter-visit periods are approximately equal. Analogously, if the explanatory variables are continuous we propose to use the deciles of the estimated transition rates, {qˆi(a,b)}, to generate a partition of the space of explanatory variables. In large contingency tables, some discrepancies between the observed and expected transition counts are large due to chance. In order to avoid sparse tables, each classification criterion should have few categories even if the number of observations per individual is large. The theoretical distribution of the proposed goodness of fit statistic is intractable when explanatory variables are used to model the transition rates and the observations are made at random times, particularly since the number of transitions is not fixed because of the absorbing state. In the bootstrap algorithm used to estimate the distribution
654
R. Aguirre-Hernández and V.T. Farewell
of the test statistic, the number of states generated by the bootstrap algorithm, si , is smaller than or equal to the number of observed states, mi . This characteristic, and the fact that the contingency tables constructed for the models with explanatory variables had a large number of cells, produced zero counts. The bootstrap test statistic was computed by ignoring the empty cells instead of collapsing adjacent categories. For the PsA data, it is unlikely that this approach can completely account for the qualitative conclusions as the number of statistics affected was quite small, less than 5%, in most simulations. The proposed bootstrap algorithm does not guarantee that the proportion of individuals that reach the absorbing state is the same in the original data and in the bootstrap sample. The algorithm was modified so that additional states were generated for those individuals who, in the PsA data, reached the final state provided that their observation period does not exceed the mean observation period of the subjects who did not reach this absorbing state. The p-value obtained in this way is similar to the value mentioned in Section 3.5. This may be due to the fact that only 20.3% of the patients reached state 4. With a higher rate of progression to the absorbing state a different result might be obtained.
4. Poisson and negative binomial models 4.1. Background The categorization of joint damage into 4 states, that was used in the Markov regression model for the PsA data, provided useful information on prognostic factors (Gladman et al., 1995). As shown in Section 3.5 however, there is some evidence of lack of fit. This, and the summarization of the outcome variable of interest that is inherent in the method, suggests the examination of alternative approaches. A natural alternative is to directly model the number of damaged joints. Since the number of damaged joints can only increase over time, it is most sensible to focus on models for the increase in the number of damaged joints between clinic visits. A natural first choice for modelling this longitudinal count data would be based on Poisson distributional assumptions. Table 3 shows, however, that 601 (47.59%) transitions in the Gladman et al. (1995) PsA dataset are of the form 1 → 1. In fact, 116 (42.8%) patients with a total of 463 (36.66%) transitions did not develop damaged joints during the course of the study. From a clinical viewpoint, this raises the question of whether a subpopulation of individuals with PsA never develops damaged joints. If so, a mixture model might be used to describe the PsA data. In this context, a mixture model assumes that there is a subpopulation of PsA patients who are not at risk of damaged joints. The existence of such a population would motivate efforts to identify its members early in their disease course, on genetic or clinical grounds, and might eliminate unnecessary levels of treatment. However, other distributions, notable the negative binomial, can be used to describe count data with a large proportion of zeros. In the next sections, we develop mixture regression models for longitudinal (panel) data and investigate the potential need for
Appraisal of models for the study of disease progression in psoriatic arthritis
655
such models to represent PsA data. The PsA data set used for this is larger than the one analysed by Gladman et al. (1995). The new data are based on a longer observation period in which more patients registered in the PsA clinic and the existing patients continued to be assessed. Also, observations corresponding to 10 or more damaged joints were incorporated into the analysis. A total of 365 patients was in the new data set. Since the data set consists of multiple observations for each patient the correlation between these observations must be considered. There is a wealth of methods, including generalised estimating equations and multi-level models, which have been used for longitudinal count data. Here however we adopt the simpler strategy of the introduction of dynamic covariates into the regression models to account for this correlation. This simplifies the extension to mixture models and, by a suitable choice of the covariates, makes the analysis more comparable to the Markov approach discussed earlier. Four models for longitudinal count data are used here: a Poisson regression model, a Poisson regression model with added zeros (PWAZ), a negative binomial (NB) regression model and a negative binomial regression model with added zeros (NBWAZ). Details are provided for the PWAZ and NBWAZ models. The Poisson and NB models are special cases of these models. The goodness of fit of the best fitting model for the PsA data will be examined using the same approach adopted for the Markov regression model. This requires the use of expected counts and the calculation of these is also outlined for each model. The models are estimated using maximum likelihood. 4.2. Background to mixture models Some probability distributions can be regarded as a combination of other distributions. Usually, the resulting distribution is more complex than the original ones and is known as a contagious or mixture distribution. Formally, let h(x | θ ) be a conditional density function that depends on the vector of parameters θ . Suppose also that θ ∈ m is subject to random variation according to the probability law q(θ). Then the contagious density function f (x) is defined as: ∞ ∞ f (x) = (4) ··· h(x | θ )q(θ) dθ1 · · · dθm . −∞
−∞
When θ assumes a finite number of values: θ 1 , . . . , θ G , each with probability pg = q(θ g ), where G g=1 pg = 1, then q(θ ) has a discrete multivariate distribution and f (x) =
G
pg h(x | θ g )
(5)
g=1
is called a finite mixture or compound distribution. Johnson and Kotz (1969, Chapter 8), describe several density functions of the form (4) and (5) and McLachlan and Peel (2000) is a recent book on the these distributions. Sometimes, contagious distributions are used to describe heterogeneous populations. For example, in accident proneness, Greenwood and Yule (1920) derived the negative binomial distribution by assuming that the number of accidents per individual follows
656
R. Aguirre-Hernández and V.T. Farewell
a Poisson distribution with parameter θ . The authors assumed that θ varies from individual to individual according to a gamma distribution. A negative binomial regression model is obtained when θ depends on several explanatory variables. For this model, Lawless (1987) compared the efficiency and robustness properties of maximum likelihood estimators with those of weighted least-squares along with moment estimation of the dispersion parameter. Finite mixture distributions are applied when a population is formed by G distinct subpopulations. Sometimes it is known to which subpopulation each individual belongs so the primary aim is to estimate the mixing proportions p1 , . . . , pG in (5). In other situations it is impossible to observe the variable(s) that split the individuals into different groups. This means that there is no available information for each conditional distribution separately but only for the combined mixture distribution. Several examples are given by Everitt and Hand (1981). In this situation, the objective is to estimate both the mixing proportions and the parameters of the conditional distributions in (5). Usually, in this context, G is fixed by the theoretical background of the problem under investigation. In fact, some authors, like Everitt and Hand (1981) and Farewell (1986), state that this type of mixture distributions should only be used when there is strong scientific evidence for the existence of two or more subpopulations. The reason is that interpretation problems may arise since a mixture distribution can always be fitted to the data by choosing a sufficiently large number of groups, G. In spite of this, mixtures of distributions are frequently used in cluster analysis where there is no a priori knowledge about any grouping structure in the population. The aim is to model heterogeneous data and to obtain some insight into the problem by the formation of several clusters. An example of this kind of approach is given by McLachlan and Basford (1988). When the population is divided into G = 2 groups, expression (5) becomes: f (x) = p1 h(x | θ1 ) + (1 − p1 )h(x | θ2 ). Here, X can be viewed as depending on a binary variable V that is equal to one with probability p1 and equal to zero with probability (1 − p1 ). In other words, q(θ ) has a Bernoulli distribution with parameter p1 . If covariates are available, their effect on p1 can be assessed by fitting a logistic model. Farewell (1977) and Struthers and Farewell (1989) applied this model to time to event data. In these two articles, p1 represents the proportion of individuals that experience the event of interest (e.g., AIDS or relapse of a disease) and X is the time until the event occurs. It is assumed that the conditional distribution of X given V = 1 follows an exponential or Weibull distribution. Consider now the situation in which X represents the number of events that occur in a specified period of time. The Poisson distribution is a natural choice for X. However, count data may have an excess of zeros as compared with a Poisson distribution. This is one example of the phenomenon known as overdispersion. It may arise when a proportion, p1 , of individuals cannot experience the event of interest. Their zero count is a structural zero. Other individuals have a zero count by chance; these are sampling zeros. Several methods for modelling such overdispersed data have been proposed in the literature. The use of mixture distributions is one of them. In the situation just described,
Appraisal of models for the study of disease progression in psoriatic arthritis
657
X is a discrete variable and the population is again divided into G = 2 groups: individuals that can experience the event of interest, V = 0, and those who cannot, V = 1. Therefore, expression (5) can be rewritten more conveniently as: P (X = x) = p1 P (X = x | V = 1) + (1 − p1 )P (X = x | V = 0) or equivalently: P (X = 0) = p1 + (1 − p1 )P (X = 0 | V = 0), P (X = x) = (1 − p1 )P (X = x | V = 0) if x = 1, 2, . . . . This particular type of mixture distribution is known as a Poisson distribution with added zeros or as a zero-inflated Poisson (ZIP) distribution because the proportion of zeros has been increased by a constant p1 . If measured, explanatory variables can be used to model both the binomial parameter p1 and the mean of the Poisson distribution. Lambert (1992) compares several zero-inflated Poisson regression models for experimental data obtained at AT & T Bell Laboratories. The experiment was a study of the influence of five qualitative factors on the number of soldering defects on printed wiring boards. When a reliable manufacturing process is in control, the number of defects on an item should be Poisson distributed. Nevertheless, the Bell Laboratories data have many more items without defects than would be expected from a Poisson distribution. The author postulates that slight, unobserved changes in the environment cause the process to move randomly back and forth between a perfect state (V = 1) and an imperfect state (V = 0). Lambert (1992) considered three types of ZIP models. In the first one, the probability that the process is in the perfect state, p1 , does not depend on the factors. In the second model, the Poisson parameter and the Bernoulli parameter are not functionally related but both depend on the same factors. Finally, in the third model, p1 is a simple function of the Poisson parameter that depends on the factors. For each model, the author discusses the interpretation of the parameters and the algorithm to calculate the maximum likelihood estimates. Simulations showing the appropriateness of the asymptotic results are also presented. In unpublished work, Ridout et al. (1998) fitted several regression models to experimental data from horticulture. The aim of the experiment was to evaluate the effect of 4 hormone concentrations and 2 periods of light exposure on the number of roots produced by a plant cutting. The regression models examined by the authors are: Poisson, Poisson with added zeros, negative binomial and negative binomial with added zeros. Several variations of each model were examined as the parameter accounting for the extra zeros and the dispersion parameter were sometimes expressed as a function of the exposure to light. The authors analysed the significance of the factors and compared non-nested models using the Akaike information criterion and the BIC statistic. Bohning et al. (1999) use the zero-inflated Poisson model for a dental study and Cheung (2002) uses both zero-inflated Poisson and negative binomial models in a study of growth and development. The necessary extensions which allow for counts which are observed repeatedly for individual subjects are given in subsequent sections. For concreteness, we present the models in the context of the PsA data but they are immediately applicable to any comparable longitudinal count data. A rather different application involving longitudinal data is discussed by Hall (2000).
658
R. Aguirre-Hernández and V.T. Farewell
4.3. Poisson regression model with added zeros for longitudinal count data Here, the Poisson regression model with added zeros for longitudinal data includes the assumption that a subpopulation of individuals with PsA never develops damaged joints. It is further assumed that, in a fixed time interval, the average rate at which the joints are damaged varies from one person to another. The source of this variability is, however, considered to be related to measured explanatory variables. 4.3.1. Description of the model Let Ji,j represent the total number of damaged joints for patient i up to time ti,j . Then, Di,j = Ji,j +1 − Ji,j is the number of joints damaged between times ti,j and ti,j +1 with j = 1, 2, . . . , mi−1 and i = 1, 2, . . . , n. The Di,j are discrete variables so in a first approach it can be assumed that they have a Poisson distribution with mean µi,j . Suppose also that Di,j is independent of Di,j −1 , . . . , Di,1 given the value of Ji,j . This assumption is similar to the one made for the Markov model in which the state occupied by individual i at time ti,j depends only on the previous state. Therefore, the expected number of damaged joints in an interval of length ti,j +1 − ti,j is expressed as a function of the vector of explanatory variables zi = (1, zi,1 , . . . , zi,p−2 ) and Ji,j , i.e., µi,j = (ti,j +1 − ti,j ) exp(α0 + α1 zi,1 + · · · + αp−2 zi,p−2 + αp−1 Ji,j ) = (ti,j +1 − ti,j ) exp(α zi,j ), where zi,j = (zi , Ji,j ) and α = (α0 , α1 , . . . , αp−2 , αp−1 ). Notice that Ji,j is the only variable in zi,j that varies over time. A more general model would allow the other explanatory variables, zi,u , to change over time. In this case, µi,j = E(Di,j | zi,j ) would be a function of the explanatory variables measured at time ti,j . The probability density function of Di,j is given by: d
P (Di,j = di,j | zi,j ) =
exp(−µi,j )µi,ji,j di,j !
,
di,j = 0, 1, . . . .
As suggested by the results obtained in Section 3.5, an excess of zero increments in the number of damaged joints may occur relative to the Poisson distribution. If the excess of zeros can be explained by the existence of a subpopulation of individuals who never develop damaged joints during the course of the disease, a mixture model can be used to describe the data. Let Vi be a binary variable where Vi = 1 indicates that individual i will not develop damaged joints and Vi = 0 indicates that individual i is susceptible to damaged joints. The probability that patient i is not susceptible to damaged joints can be described by a logistic model:
θi = P Vi = 1; z∗i =
exp(β z∗i ) 1 + exp(β z∗i )
∗ , . . . , z∗ where β = (β0 , β1 , . . . , βr−1 ) and z∗i = (1, zi,1 i,r−1 ). ∗ In general, the explanatory variables, zi and zi,j , in the models for the Poisson and the Bernoulli parameters will be different. In the PsA dataset, the age of the patient
Appraisal of models for the study of disease progression in psoriatic arthritis
659
at the time of disease onset was the only available explanatory variable which could sensibly be related to the probability of never developing damaged joints. Preliminary analyses indicated that there was little evidence for an effect due to age. In the analyses reported here, therefore, no explanatory variables will be included in the model for θi so that the only parameter of interest from this part of the model is β0 and θi = θ for all i. The subscript i on θ will however be retained in the general specification of the model. The probability of not observing any damaged joints for patient i is: P (Di,1 = 0, Di,2 = 0, . . . , Di,mi−1 = 0 | zi,1 , zi,2 , . . . , zi,mi−1 )
= P Vi = 1 | z∗i P (Di,1 = 0, . . . , Di,mi−1 = 0 | Vi = 1, zi,1 , . . . , zi,mi−1 )
+ P Vi = 0 | z∗i P (Di,1 = 0, . . . , Di,mi−1 = 0 | Vi = 0, zi,1 , . . . , zi,mi−1 )
mi−1
= θi + (1 − θi )
P (Di,j = 0 | Vi = 0, zi,j )
j =1
mi−1
= θi + (1 − θi )
exp(−µi,j ) = θi + (1 − θi ) exp −
j =1
mi−1
µi,j .
(6)
j =1
For patient i, who developed damaged joints, the probability of the observed di,j values is: P (Di,1 = di,1 , . . . , Di,mi−1 = di,mi−1 | zi,1 , zi,2 , . . . , zi,mi−1 )
mi−1
= (1 − θi )
P (Di,j = di,j | Vi = 0, zi,j )
j =1
exp(−µi,j )µi,ji,j , di,j !
mi−1
= (1 − θi )
d
(7)
j =1
where di,j > 0 for some j = 1, 2, . . . , mi−1 . Note that expressions (6) and (7) retain the usual form for a distribution with added zeros. Let Ji = (Ji,1 , Ji,2 , . . . , Ji,mi ). The likelihood function for α and β is: L(α, β) =
θi + (1 − θi ) exp −
×
i|Ji =0
µi,j
j =1
i|Ji =0
mi−1
d exp(−µi,j )µi,ji,j (1 − θi ) . di,j ! mi−1 j =1
(8)
660
R. Aguirre-Hernández and V.T. Farewell
Therefore, the logarithm of the likelihood function is proportional to:
mi−1 l(α, β) = ln θi + (1 − θi ) exp − µi,j j =1
i|Ji =0
+
ln(1 − θi ) −
i|Ji =0
i−1 m
µi,j +
i|Ji =0 j =1
i−1 m
di,j ln µi,j .
i|Ji =0 j =1
4.3.2. The expected values A goodness of fit analysis will require the calculation of expected values. Based on the model defined by Eqs. (6) and (7), expectations can be calculated based on the quantities:
ei,j (k) = P(Di,j = k | zi,j ) for k = 0, 1, . . . and zi,j = zi , Ji,j , defined for the interval (ti,j , ti,j +1 ). For a patient with zero damaged joints up to time ti,j , i.e., Ji,j = 0, the above probabilities are calculated as:
ei,j (0) = P(Di,j = 0 | zi,j ) = θˆi + 1 − θˆi exp −µˆ i,j , ei,j (k) = P(Di,j = k | zi,j ) =
(1 − θˆi ) exp(−µˆ i,j )µˆ ki,j k!
,
k = 1, 2, . . .
where
θˆi =
exp(βˆ z∗i )
1 + exp(βˆ z∗i )
and
µˆ i,j = (ti,j +1 − ti,j ) exp αˆ zi,j ; if Ji,j > 0 then the probability of k damaged joints in an interval of length ti,j +1 − ti,j is: ei,j (k) = P(Di,j = k | zi,j ) =
exp(−µˆ i,j )µˆ ki,j k!
for k = 0, 1, 2, . . . .
4.3.3. Significance testing with respect to a subpopulation at no risk Since a mixture model is based on the assumption that the population is divided into G different groups, it is natural to examine the evidence that the proportion of individuals in each group is, in fact, greater than zero. The simplest situation examines if the population is divided into two groups defined by a Bernoulli parameter that does not depend on covariates: i.e., H0 : θ = 0
vs. H1 : θ > 0.
This is a non-standard hypothesis test because, under H0 , θ lies on the boundary of the parameter space. Note that this problem persists if the logit of θ is equal to β0 and
Appraisal of models for the study of disease progression in psoriatic arthritis
661
the hypothesis is expressed in terms of β0 : H0 : β0 = −∞ vs.
H1 : β0 > −∞.
(9)
Self and Liang (1987) and Ghitany et al. (1994) proved that the deviance statistic for testing this type of hypothesis has, asymptotically, not a chi-squared distribution with one degree of freedom, but the distribution of X, where
P (X x) = 0.5 + 0.5P χ12 x . (10) A more general test was also proposed by Ghitany et al. (1994) to determine if the proportion of individuals that do not experience the event of interest differs between levels in a one-way classification. This assumes that the proportions do not depend on additional explanatory variables. When explanatory variables influence the proportion of individuals in each subpopulation, the test to determine if the proportions are greater than zero has an additional non-standard feature. Under the null hypothesis, the intercept is equal to −∞ and any regression coefficients become irrelevant. Essentially, the regression coefficients disappear under the null hypothesis. Jansakul and Hinde (2002) generalize the score test development of van den Broek (1995) for this situation. Since this more complicated situation does not arise with the PsA data, we utilize the simpler likelihood ratio procedure following Self and Liang (1987) in the analyses reported subsequently. Interval estimation for β0 = logit(θ ) can be based on the profile likelihood for β0 . Profile likelihoods are often used to construct confidence regions when the maximum likelihood estimate of a parameter does not have an asymptotic normal distribution. Here, it is perhaps better to think of such intervals as significance intervals, i.e., the set of all values for β0 which are not rejected at a specified level of significance. Again, it is assumed that no explanatory variables affect the Bernoulli parameter. Thus logit(θ ) = β0 . A significance test of β0 = −∞ is non-standard but a test of β0 = β0∗ , where β0∗ is an interior point of the parameter space can be based on standard asymptotic results for maximum likelihood estimation. In this case, if β0 is replaced by a fixed value β0∗ , Eq. (8) becomes:
mi−1
L α | β0 = β0∗ = θ ∗ + 1 − θ ∗ exp − µ∗i,j j =1
i|Ji =0
×
i|Ji =0
di,j mi−1 exp(−µ∗i,j )µ∗i,j
∗ 1−θ . di,j !
j =1
The profile likelihood for β0 , denoted as P L(β0 ), is obtained by substituting α by its maximum likelihood estimate (αˆ ∗ ) in the above expression, i.e.,
P L β0∗ = L αˆ ∗ | β0 = β0∗ . The likelihood ratio statistic for the hypothesis: H0 : β0 = β0∗
vs.
H1 : β0 = β0∗
662
R. Aguirre-Hernández and V.T. Farewell
can be expressed in terms of the profile likelihood for β0 : P L(β0∗ ) P L(β0∗ ) . = −2 ln Λ = −2 ln P L(βˆ0 ) L(α, ˆ βˆ 0 ) The test statistic Λ has an asymptotic chi-square distribution with one degree of freedom. Therefore a (1 − α) · 100% significance interval for β0 includes the set of values of β0 for which:
P L(β0 ) 2 exp −0.5χ1,1−α ˆ L(α, ˆ β 0) 2 or equivalently: S = ln(P L(β0 )) − ln(L(α, ˆ βˆ 0 )) −0.5χ1,1−α if α = 0.05 then 2 −0.5χ1,0.95 = −1.921.
4.4. Negative binomial regression model with added zeros for longitudinal count data The negative binomial regression model is frequently used as an alternative to the Poisson regression model. The difference is that the negative binomial model assumes that the average rate at which the joints are damaged between clinic visits depends on known explanatory variables and on a random (unknown) component. The negative binomial regression model can be extended, in an exactly analogous way as is the Poisson, to allow for a possible subpopulation of patients with PsA who are not susceptible to damaged joints. 4.4.1. Description of the model As before, the increase in the number of damaged joints between times ti,j and ti,j +1 is denoted as Di,j = Ji,j +1 − Ji,j for i = 1, 2, . . . , n and j = 1, 2, . . . , mi−1 . In the group of individuals susceptible to damaged joints, the distribution of Di,j conditional on the vector of explanatory variables zi,j = (1, zi,1 , . . . , zi,p−2 , Ji,j ) is described by a negative binomial regression model. The mean of Di,j is expressed as a function of the vector of explanatory variables: zi = (1, zi,1 , . . . , zi,p−2 ) and Ji,j as follows:
µi,j = (ti,j +1 − ti,j ) exp α zi,j where zi,j = (zi , Ji,j ) and α = (α0 , α1 , . . . , αp−2 , αp−1 ). Thus, conditional on being in the population at risk, di,j γ −1 Γ (di,j + γ −1 ) γ µi,j 1 P (Di,j = di,j | zi,j ) = (11) di,j !Γ (γ −1 ) 1 + γ µi,j 1 + γ µi,j where γ 0 is the dispersion parameter, Γ (·) is the gamma function and di,j = 0, 1, 2, . . . . Therefore, the unconditional probability of not observing damaged joints for patient i during the course of the study is: γ −1 mi−1 1 P (Di = 0 | zi,1 , . . . , zi,mi−1 ) = θi + (1 − θi ) (12) 1 + γ µi,j j =1
Appraisal of models for the study of disease progression in psoriatic arthritis
663
while the probability that patient i develops damaged joints in the pattern specified by di is: P (Di = di | zi,1 , . . . , zi,mi−1 ) γ −1 Γ (di,j + γ −1 ) γ µi,j di,j 1 = (1 − θi ) 1 + γ µi,j di,j !Γ (γ −1 ) 1 + γ µi,j mi−1
(13)
j =1
where θi = P (Vi = 1 | z∗i ) = exp(β z∗i )/(1 + exp(β z∗i )) is the probability that subject i belongs to the subpopulation of individuals not susceptible to damaged joints. The likelihood function for α, β, and γ is: γ −1 mi−1 1 L(α, β, γ ) = θi + (1 − θi ) 1 + γ µi,j i|Ji =0
×
j =1
i|Ji =0
Γ (di,j + γ −1 ) γ µi,j di,j (1 − θi ) di,j !Γ (γ −1 ) 1 + γ µi,j mi−1 j =1
1 × 1 + γ µi,j
γ −1 .
Thus the logarithm of L(α, β, γ ) is proportional to: γ −1 mi−1 1 l(α, β, γ ) = ln θi + (1 − θi ) 1 + γ µi,j i|Ji =0
+
j =1
i|Ji =0
+
i−1 m Γ (di,j + γ −1 ) ln(1 − θi ) + ln di,j !Γ (γ −1 ) i|Ji =0 j =1
i−1 m
di,j ln(γ µi,j ) −
i|Ji =0 j =1
− γ −1
i−1 m
i−1 m
di,j ln(1 + γ µi,j )
i|Ji =0 j =1
ln(1 + γ µi,j )
i|Ji =0 j =1
where Γ (di,j + γ −1 ) di,j !Γ (γ −1 ) 1 if di,j = 0,
−1 −1 −1 1 γ + 1 γ + 2 ··· = di,j ! γ
−1
× γ + di,j − 2 γ −1 + di,j − 1 if di,j = 1, 2, . . . .
664
R. Aguirre-Hernández and V.T. Farewell
4.4.2. The expected values For the ith patient, the estimated probability that Di,j = k in the interval (ti,j , ti,j +1 ) is again used as the estimate of the expected number of increases equal to k damaged joints in the period ti,j +1 − ti,j , i.e., ei,j (k) = P(Di,j = k | zi,j )
where k = 0, 1, 2, . . . .
If Ji,j = 0, the model described by Eqs. (12) and (13) implies that ei,j (k) is calculated as:
−1
θˆi + 1 − θˆi 1+γˆ1µˆ γˆ if k = 0, i,j ei,j (k) =
1 γˆ −1 −1
1 − θˆi Γ (k+γˆ−1 ) γˆ µˆ i,j k if k = 1, 2, . . . 1+γˆ µˆ k!Γ (γˆ ) 1+γˆ µˆ i,j
i,j
but if Ji,j > 0 then k γˆ −1 γˆ µˆ i,j Γ (k + γˆ −1 ) 1 ei,j (k) = 1 + γˆ µˆ i,j k!Γ (γˆ −1 ) 1 + γˆ µˆ i,j
for k = 0, 1, 2, . . .
where θˆi , µˆ i,j , and γˆ are the maximum likelihood estimates of θi , µi,j and γ , respectively. 4.5. A goodness-of-fit statistic Many of the difficulties associated with the examination of goodness of fit for Markov models fitted with panel data, apply equally to models for longitudinal count data with irregular spacing. However, a Pearson-type goodness-of-fit statistic, like that defined in Eq. (2) for the Markov model, can be used in this situation. Its use to test the adequacy of the Poisson and the negative binomial models is discussed here. The first step is to group the observed increases in damaged joints, di,j , and the estimated probabilities of an increment equal to k, ei,j (k); k = 0, 1, . . . . For the PsA data, the values of di,j and ei,j (k) for the j th observation period could be classified according to one or more of the value of the response variable, the presence or absence of previous damaged joints, the explanatory variable pattern for individual i, and the time elapsed between clinic visits j and j + 1. The observation period can be ignored in a first instance if the fitted model assumes that the response variable (Di,j ) depends only on the distance between consecutive measurements and not on the time at which the observations are made. If all four factors are used for classification, eabcd can be used to denote the sum of the expected values ei,j (k) classified in level a of the response variable, category b for the presence (yes–no) of previous damaged joints, explanatory variable pattern c and group d defined by the quantiles of the time elapsed between clinic visits. Similarly, nabcd will represent the total number of observed increments classified in categories a, b, c, d where a = 1, 2, . . . , A; b = 1, 2 (B = 2); c = 1, 2, . . . , C; d = 1, 2, . . . , D. For each cell in the contingency table, the ratios (nabcd − eabcd )2 /eabcd are calculated and then summed to give a Pearson-type goodness-of-fit statistic. The total number of cells in the contingency table is A × B × C × D. Only (A − 1) × B × C × D expected counts are independent.
Appraisal of models for the study of disease progression in psoriatic arthritis
665
As for the Markov model, a parametric bootstrap algorithm can be used to estimate the distribution of the test statistic under the hypothesis that the fitted model is correct. In the bootstrap data, the initial number of damaged joints for patient i is given by the ∗ =J . number of damaged joints recorded for that patient at the first clinic visit, i.e., Ji,1 i,1 ∗ The increase in the number of damaged joints between times ti,j and ti,j +1 , Di,j , is then simulated from the fitted model. For the models that assume that all patients are susceptible to develop damaged ∗ is generated either from a Poisson distribution with mean rate of damage joints, Di,j µˆ i,j = (ti,j +1 − ti,j ) exp(αˆ zi,j ) or from a negative binomial model with parameters γˆ and µˆ i,j . The bootstrap algorithm to simulate observations from the models with added zeros is as follows. If zero damaged joints are observed for patient i at the initial clinic visit (Ji,1 = 0) then a uniform random variable, Wi , in the interval (0, 1) is simulated. If Wi ∗ =0 is less than θˆi then the bootstrap increments are all zero for patient i (i.e., Di,j for all j ). Otherwise, if Wi θˆi or if Ji,1 0, the D ∗ are simulated from a Poisson i,j
distribution with mean µˆ i,j or from a negative binomial model with parameters γˆ and µˆ i,j . ∗ are simulated from the appropriate model, the bootstrap number of Once the Di,j ∗ ∗ ∗ damaged joints at each clinic visit is calculated as Ji,j +1 = Ji,j + Di,j . The proposed model is then fitted to the bootstrap data and the goodness of fit statistic is computed. This process is replicated to generate a bootstrap distribution for the test statistic. The significance level associated with the proposed model is the proportion of bootstrap goodness of fit statistics greater than the statistic obtained for the observed data. 4.6. Application to the PsA data In this section, the results of fitting the various models for longitudinal count data to the PsA data are presented. Recall that an important clinical question is whether there exists a subpopulation of patients at no, or at least minimal, risk of joint damage. To give some impression of the available data, Figure 1 presents the Kaplan–Meier estimated probability of no damage being observed at clinic visits as a function of time since clinic entry. The initial drop in the curve is due to those patients arriving at the clinic with damage. The estimated curve appears to decline more slowly in the tail when the estimated probability is approximately 28%. This number would reflect both any subpopulation at no risk of damage and patients who had not yet experienced damage although at risk. Table 4, for the Poisson, PWAZ, NB and NBWAZ regression models, gives the estimated relative rates associated with the prognostic factors. Recall that this aspect of the model is not the primary focus of this chapter. The only substantial variation in relative risks across the models is with respect to prior medication. The maximum likelihood estimate of β0 from the PWAZ model is −0.785 with an estimated standard error of 0.129. Thus based on this model, the estimated fraction of PsA patients at no risk of joint damage is 31% with an associated 95% confidence interval of (26%, 37%). The significance level for a test that the fraction is zero leads to a p-value of less than 0.001. However for the NBWAZ model, the maximum likelihood
666
R. Aguirre-Hernández and V.T. Farewell
Fig. 1. Estimated probability of no damage being observed at clinic visits. Table 4 Estimated relative risks Factor ESR < 15 Effus. > 4 DMDs Steroids 1–4 DJs 5–9 DJs 10+ DJs
RR
Poisson p-value
RR
PWAZ p-value
RR
NB p-value
RR
NBWAZ p-value
0.84 1.46 1.07 1.08 2.13 4.13 4.18
0.002 < 0.001 0.426 0.147 < 0.001 < 0.001 < 0.001
0.89 1.52 0.98 0.97 0.80 1.59 1.64
0.048 < 0.001 0.802 0.583 0.005 < 0.001 < 0.001
0.77 1.35 1.73 1.30 2.21 4.24 5.42
0.077 0.067 0.014 0.077 < 0.001 < 0.001 < 0.001
0.77 1.35 1.73 1.30 2.21 4.24 5.42
0.077 0.067 0.014 0.077 < 0.001 < 0.001 < 0.001
PWAZ: Poisson with added zeros. NB: Negative binomial. NBWAZ: Negative binomial with added zeros.
estimate of β0 is −∞, which implies that there is no subset of PsA patients at no risk of damage. Figures 2 and 3 give the profile likelihoods for β0 from these two models. From Figure 3, a confidence interval for the proportion of PsA patients at no risk of damage, from the NBWAZ model, can be calculated as (0%, 8%). Since the maximum likelihood estimate of this proportion is 0%, the fits of the NB and NBWAZ models are identical. This is reflected in Table 4 and in Table 5 which is discussed subsequently.
Appraisal of models for the study of disease progression in psoriatic arthritis
667
Fig. 2. Profile log likelihood of β0 .
A preliminary examination of goodness of fit can be based on a comparison of expected and observed counts, collapsed over all variables except the number of new damaged joints that developed between clinic visits. Table 5 gives the observed increases in joint counts between clinic visits and the expected numbers as estimated by the various models. Additionally, the zero class is subdivided by previous damage. Note that both Poisson models considerably underestimate the number of inter-visit periods with no change in damage. Table 5 demonstrates the better fit to the observed counts provided by the two NB models that essentially provide the same fit. Since the estimated fraction of patients at no risk of damage is 0%, the better fit to the data provided by the negative binomial models derives primarily from the introduction of the heterogeneity in risk over different inter-visit periods associated with the negative binomial distribution. A formal investigation of the goodness of fit of the negative binomial model is given in the next section. 4.7. The fit of the negative binomial model As outlined in the previous section, Table 5 contains the observed and expected number of damaged joints between clinic visits equal to 0, 1, . . . , 10. The zero increments were split into two categories based on the presence or absence of previous damaged joints for each patient. Almost 80% (1736/2176) of the observations are zero incre-
668
R. Aguirre-Hernández and V.T. Farewell
Fig. 3. Profile log likelihood of β0 .
Table 5 Observed and expected changes in joint count Increase in number of damaged joints
Observed
Expected Poisson
Expected PWAZ
Expected NB
Expected NBWAZ
0 without previous damage 0 with previous damage 1 2 3 4 5 6 7 8 9 10+
827 909 174 103 35 27 23 20 9 12 7 50
684.33 610.43 525.70 193.21 78.66 38.10 21.87 14.02 9.50 6.57 4.57 9.05
614.78 607.79 530.50 217.54 92.24 46.91 28.21 18.61 12.73 8.78 6.03 11.88
782.63 954.98 185.14 80.88 46.03 29.75 20.77 15.27 11.66 9.17 7.37 52.36
782.63 954.98 185.14 80.88 46.03 29.75 20.77 15.27 11.66 9.17 7.37 52.36
Total
2196
2196.00
2196.00
2196.00
2196.00
PWAZ: Poisson with added zeros. NB: Negative binomial. NBWAZ: Negative binomial with added zeros.
Appraisal of models for the study of disease progression in psoriatic arthritis
669
Table 6 Observed and expected counts for the negative binomial model Time elapsed
Prognostic factors
Zero without previous damage Obs Exp
Increase in damaged joints Zero with previous damage Obs Exp
One or more Obs
Exp
0.01 to 0.47
Zero One Two +
63 75 25
58.31 70.88 27.27
43 98 59
47.84 107.82 64.18
14 37 26
13.85 31.30 18.55
0.47 to 0.52
Zero One Two +
68 76 17
63.03 71.81 16.87
40 107 57
35.99 106.21 69.48
7 34 37
15.98 39.07 24.66
0.52 to 0.65
Zero One Two +
59 83 28
58.73 83.33 24.56
39 96 51
39.49 97.93 50.18
18 44 17
17.77 41.74 21.27
0.65 to 1.16
Zero One Two +
71 73 20
65.35 69.12 20.30
37 103 51
40.23 97.67 51.10
20 38 26
22.42 47.21 25.60
1.16 to 13.49
Zero One Two +
60 84 25
56.28 73.90 22.90
30 62 36
33.20 69.93 43.82
34 65 43
34.52 67.18 37.27
Total
827
782.63
909
954.98
460
458.39
ments. The 120 (36.25%) patients for whom damage was never observed contribute zero counts for 27.8% (611/2196) of the observations. From Table 5, the value of the Pearson-type statistic, (n(k) − e(k))2/e(k), is 17.66. The biggest contribution comes from increments of two damaged joints which are underestimated by the negative binomial model. One thousand bootstrap data sets produced a significance level of 0.103. There is no evidence therefore that the negative binomial model is not appropriate for these data. Table 5 has 11 independent categories and the fitted model contains 9 parameters. The mean of the bootstrap goodness of fit statistic is 15.41. A more detailed examination of the fitted model is achieved through Table 6. In this table, the observed response variable and the expected values were further classified by the quintiles of the time elapsed between clinic visits and the number of prognostic factors coded as one. As for the Markov model, patients were classified into three groups based on the initial value of the prognostic factors: zero, one, and two or more prognostic factors coded as one. Increments of one or more damaged joints were also collapsed into one category. In general, there is good agreement between the observed and expected counts. Two out of 45 = 3 × 3 × 5 cells, indicated with bold numbers, make large contributions to the value of the goodness of fit statistic. These two cells correspond to patients with all prognostic factors coded as zero or with at least two prognostic factors coded as one and who developed one or more damaged joints between clinic visits spaced between
670
R. Aguirre-Hernández and V.T. Farewell
0.47 and 0.52 years apart. Nevertheless, the value of the goodness of fit statistic is only 32.09. The significance level associated with this value is p = 0.366, computed, for convenience, from the same 1000 bootstrap data sets used to assess Table 5. Thus, based on Table 6, there is no indication of lack of fit for the negative binomial model. The mean of the bootstrap distribution of the goodness of fit statistic is 31.12. As for Table 5, this value is bigger than the difference between the number of independent cells in Table 6 and the number of estimated parameters: 30 − 9 = 21. This general pattern was also observed for goodness of fit statistics for the Markov model in Section 3. Note that an assumption that the test statistic has an asymptotic chi-squared distribution with 21 degrees of freedom would give a conservative p-value of 0.057.
5. Discussion The study of the occurrence of multiple events over time, based on irregular monitoring of current status, is often based on models for which goodness of fit is not easily ascertained. In this chapter we have demonstrated, by their application to models estimated with data from a psoriatic arthritis clinical data base, how Pearson-type goodness of fit statistics can be used in this situation. Key features are the definition of cells in contingency tables of observed and expected counts and the empirical specification of the distribution of the test statistic under the null hypothesis. Some aspects of model appraisal can be configured as hypothesis tests which relate to model choice. For example, in this chapter, the choice between the Poisson and negative binomial models with added zeros and the ordinary Poisson and negative binomial models was based on hypothesis tests and confidence intervals. These were nested models, albeit with the parameter of interest on the boundary of the parameter space under the null hypothesis. Ridout et al. (2001) also develop a score test for comparing Poisson with added zeros and negative binomial with added zeros models. However, the comparison of the Markov model and the models for longitudinal count data can not be compared in this way, even via the procedures for non-nested models proposed by Cox (1970). The major reasons for this are the different measurement scales for the response variables and the presence of the absorbing state in the Markov model which essentially excludes data used in the fitting of the models for the changes in damaged joint counts. Of course, in the specific results presented there were also differences in the amount of data available as well as different representations of explanatory variable information. Although a formal comparison can not be made between the Markov and negative binomial models, the use of the Pearson type goodness of fit statistics does demonstrate lack of fit for the Markov model and no lack of fit for the negative binomial model. The source of lack of fit for the Markov model appears to be patients with rapid progression reflected in a large change in the damaged joint count between visits which were quite close together. Although there might be some evidence of this in the fit of the negative binomial model as well, it is much less marked. One fundamental difference between the models fitted in this chapter is that heterogeneity between patients is a component of the negative binomial model not paralled in the Markov model. Perhaps the influence of this could be reduced if additional explanatory variables could be identified. An alternative
Appraisal of models for the study of disease progression in psoriatic arthritis
671
approach is to introduce random effects in the Markov model as done by Cook (1999) for a two-state Markov process. Finally, it can be noted that the mixture models for count data assume the development of initial damage derives from and, therefore, can be modelled as part of the same process which leads to subsequent damage. An alternative is to model the occurrence of initial damage as a separate event. In this case, for estimation of the fraction of patients who will not develop damage, it is the time to initial damage only which would be relevant and standard mixture models for time to event data could be used. Since the negative binomial model fits the data acceptably, there was no compelling need to investigate this for the psoriatic arthritis data. Appendix A: Formulas for the estimated transition probabilities When K = 4, the transition probabilities pi,j (1,2) and pi,j (2,3) are obtained by substituting a = 1 and a = 2 in expression (1), respectively. This expression is only valid when a + 1 < K (if K is an absorbing state). The transition probability pi,j (1,3) is calculated as:
ti,j+1 −ti,j −ti(1) −ti(2) ti,j+1 −ti,j −ti(1) ti,j+1 −ti,j fTi(1) fTi(2) 1 − fTi(3) dti(3) 0
0
0
× dti(2) dti(1) λi(2) λi(3) = λi(2) − λi(3) λi(1) − λi(2) ti,j − ti,j +1 ti,j − ti,j +1 × exp − exp λi(1) λi(2) λi(3) λi(3) − λi(2) − λi(3) λi(1) − λi(3) ti,j − ti,j +1 ti,j − ti,j +1 × exp − exp and λi(1) λi(3) ti,j+1 −ti,j −ti(1) ti,j+1 −ti,j −ti(1) −ti(2) ti,j+1 −ti,j pi,j (1,4) = fTi(1) fTi(2) fTi(3) 0
0
0
× dti(3) dti(2) dti(1) ti,j − ti,j +1 λi(1) λi(1) exp =1− λi(1) − λi(2) λi(1) − λi(3) λi(1) ti,j − ti,j +1 λi(2) λi(2) + exp λi(1) − λi(2) λi(2) − λi(3) λi(2) ti,j − ti,j +1 λi(3) λi(3) − exp . λi(2) − λi(3) λi(1) − λi(3) λi(3)
672
R. Aguirre-Hernández and V.T. Farewell
For a = 1, 2, . . . , K − 1, pi,j (a,a) = 1 −
K
b>a pi,j (a,b)
or equivalently, ti,j − ti,j +1 . pi,j (a,a) = 1 − P (Ti(a) ti,j +1 − ti,j ) = exp λi(a)
Finally
ti,j+1 −ti,j
pi,j (2,4) = 0
fTi(2)
ti,j+1 −ti,j −ti(2) 0
fTi(3) dti(3) dti(2)
ti,j − ti,j +1 λi(2) exp λi(3) − λi(2) λi(2) ti,j − ti,j +1 λi(3) − exp λi(3) − λi(2) λi(3)
=1+
and
ti,j − ti,j +1 . pi,j (3,4) = P (Ti(3) < ti,j +1 − ti,j ) = 1 − exp λi(3)
References Aguirre-Hernández, R., Farewell, V.T. (2002). A Pearson-type goodness-of-fit test for stationary and timecontinuous Markov regression models. Statist. Medicine 21, 1899–1911. Bishop, Y.M.M., Fienberg, S.E., Holland, P.W. (1975). Discrete Multivariate Analysis: Theory and Practice. MIT Press, Cambridge. Bohning, D., Dietz, E., Schlattmann, P., Mendonca, L., Kirchner, U. (1999). The zero-inflated Poisson model and the decayed, missing and filled teeth index in dental epidemiology. J. Roy. Statist. Soc. A 162, 195– 209. Cheung, Y.B. (2002). Zero-inflated models for regression analysis of count data: A study of growth and development. Statist. Medicine 21, 1461–1469. Cox, D.R. (1970). Further results on tests of separate families of hypotheses. J. Roy. Statist. Soc. Ser. B 32, 406–424. Cook, R.J. (1999). A mixed model for two-state Markov processes under panel observation. Biometrics 55, 915–920. Efron, B., Tibshirani, R.J. (1993). An Introduction to the Bootstrap. Chapman and Hall, New York. Everitt, B.S., Hand, D.J. (1981). Finite Mixture Distributions. Chapman and Hall, New York. Farewell, V.T. (1977). A model for a binary variable with time-censored observations. Biometrika 64, 43–46. Farewell, V.T. (1986). Mixture models in survival analysis: Are they worth the risk?. The Canad. J. Statist. 14, 257–262. Gentleman, R.C., Lawless, J.F., Lindsey, J.C., Yan, P. (1994). Multi-state Markov models for analysing incomplete disease history data with illustrations for HIV disease. Statist. Medicine 13, 805–821. Ghitany, M.E., Maller, R.A., Zhou, S. (1994). Exponential mixture models with long-term survivors and covariates. J. Multivariate Anal. 49, 218–241. Gladman, D.D., Shuckett, R., Russell, M.L., et al. (1987). Psoriatic arthritis (PSA) – an analysis of 220 patients. Quart. J. Med. 62, 127–141. Gladman, D.D., Farewell, V., Buskila, D., et al. (1990). Reliability of measurements of active and damaged joints in psoriatic arthritis. J. Rheumatol. 17, 62–64. Gladman, D.D., Farewell, V.T., Nadeau, C. (1995). Clinical indicators of progression in psoriatic arthritis: Multivariate relative risk model. J. Rheumatol. 22, 675–679.
Appraisal of models for the study of disease progression in psoriatic arthritis
673
Gladman, D.D. (1997). Psoriatic arthritis. In: Maddison, P.J., Isenberg, D.A., Woo, P., Glass, D.N. (Eds.), Oxford Textbook of Rheumatology. Oxford University Press, Oxford, pp. 1051–1056. Greenwood, M., Yule, G.U. (1920). An enquiry into the nature of frequency distributions of multiple happenings, with particular reference to the occurence of multiple attacks of disease or repeated accidents. J. Roy. Statist. Soc. Ser. A 83, 255–279. Hall, D.B. (2000). Zero-inflated Poisson and binomial regression with random effects: A case study. Biometrics 56, 1030–1039. Hosmer, D.W., Lemeshow, S. (1989). Applied Logistic Regression, 1st edn. Wiley, New York. Jansakul, N., Hinde, J.P. (2002). Score tests for zero-inflated Poisson models. Comput. Statist. Data Anal. 40, 75–96. Johnson, N.L., Kotz, S. (1969). Discrete Distributions. Houghton Mifflin Company, Boston. Kalbfleisch, J.D., Lawless, J.F. (1985). The analysis of panel data under a Markov assumption. J. Amer. Statist. Assoc. 80, 863–871. Keiding, N., Klein, J.P., Horowitz, M.M. (2001). Multi-state models and outcome prediction in bone Marrow transplantation. Statist. Medicine 20, 1871–1885. Lambert, D. (1992). Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics 34, 1–14. Lawless, J.F. (1987). Negative binomial and mixed Poisson regression. Canad. J. Statist. 15, 209–225. Lee, E.W., Kim, M.Y. (1998). The analysis of correlated panel data using a continuous-time Markov model. Biometrics 54, 1638–1644. Longini, I.M. Jr., Clark, W.S., Byers, R.H., Ward, J.W., Darrow, W.W., Lemp, G.F., Hethcote, H.W. (1989). Statistical analysis of the stages of HIV infection using a Markov model. Statist. Medicine 8, 831–843. McLachlan, G.J., Basford, K.E. (1988). Mixture Models: Inference and Applications to Clustering. Marcel Dekker, New York. McLachlan, G., Peel, D.A. (2000). Finite Mixture Models. Wiley, New York. Pérez-Ocón, R., Ruiz-Castro, J.E., Gámiz-Pérez, M.L. (1998). A multivariate model to measure the effect of treatments in survival to breast cancer. Biometrical J. 40, 703–715. Ridout, M., Demétrio, C.G.B., Hinde, J. (1998). Models for count data with many zeros. In: Proceedings of the XIXth International Biometric Conference, Cape Town, Invited Papers, pp. 179–192. Ridout, M., Hinde, J., Demétrio, C.G.B. (2001). A score test for testing a zero-inflated Poisson regression model against zero-inflated negative binomial alternatives. Biometrics 57, 219–223. Self, S.G., Liang, K. (1987). Asymptotic properties of maximum likelihood estimators and likelihood ratio tests under nonstandard conditions. J. Amer. Statist. Assoc. 82, 605–610. de Stavola, B.L. (1988). Testing departures from time homogeneity in multistate Markov processes. Appl. Statist. 37, 242–250. Struthers, C.A., Farewell, V.T. (1989). A mixture model for time to AIDS data with left truncation and an uncertain origin. Biometrika 76, 814–817. Sypsa, V., Touloumi, G., Kenward, M., Karafoulidou, A., Hatzakis, A. (2001). Comparison of smoothing techniques for CD4 data in a Markov model with states defined by CD4: An example on the estimation of the HIV incubation time distribution. Statist. Medicine 20, 3667–3676. van den Broek, J. (1995). A score test for zero inflation in a Poisson distribution. Biometrics 51, 738–743.
37
Handbook of Statistics, Vol. 23 ISSN: 0169-7161 © 2004 Elsevier B.V. All rights reserved. DOI 10.1016/S0169-7161(03)23037-6
Survival Analysis with Gene Expression Arrays
Donna K. Pauler, Johanna Hardin, James R. Faulkner, Michael LeBlanc and John J. Crowley
1. Introduction In this paper we discuss the identification and measurement of gene expressions as prognostic indicators for survival times. As the technology and field of bioinformatics has rapidly exploded in recent years, so too has the need for tools to analyze outcome data with covariates of extreme high-dimension, such as arises from the measurement of expression levels from large numbers of genes. As a particular example, a study in progress at the Myeloma Institute for Research and Therapy in Little Rock, Arkansas has measured 12 625 gene expression levels on 156 patients before standard treatment for myeloma. Patients are currently being followed for response to treatment and survival. Part of our collaboration with this institute will focus on the identification of genes and genetic profiles that serve as pre-treatment prognostic indicators for response and survival. Here we outline a variety of data-reduction and analyses techniques for correlating survival times with high-dimensional covariates. We perform simulation studies to compare techniques for use in the upcoming analysis of the Arkansas data. 1.1. The measurement of gene expression levels A key to the genomic revolution has been the development of the microarray chip, a technology which permits the researcher to study literally thousands of genes in a single assay. There are several approaches to developing chips, but the basics are that known genes or key parts of genes, such as sequences of nucleotides are fixed on a small slide, and a sample from a patient is prepared in such a way that gene transcripts that exist in the sample will attach to the gene targets on the slide. The output for each gene is compared to either an internal or external control and is expressed as either a categorical variable, indicating gene expressed or not, or a measured variable, which is typically a log ratio of expression by the experimental to control, which quantifies the degree of over or under expression. The chips can be created to target genes hypothesized to be involved in a specific cancer, such as the lymphochip for studying lymphoma (Alizadeh et al., 2000) or can be quite general, such as the Affymetrix chip (Affymetrix, Inc., http://www.affymetrix.com/). 675
676
D.K. Pauler et al.
There are two types of chips that are commonly used, which produce oligonucleotide and cDNA arrays. In an oligonucleotide array, single strands of oligos, or partial gene sequences, are printed on a chip. Multiple oligos are printed for each gene and a mismatch strand, which is an oligo with one base change, is simultaneously hybridized as a measure of control. The sample is labeled with fluorescent dye and measured for both the pairing to the oligo and the mismatch. In a spotted cDNA array, one basic sequence representing all or part of a gene is printed on the chip. The experimental sample is labeled with fluorescent red dye while the reference sample is labeled with fluorescent green dye, and both are simultaneously hybridized to the chip. The reference sample acts as a control for the experimental sample. For both types of arrays the amount of fluorescent dye measured is quantified. There are many questions that can be addressed with microarray technology, such as which genes, combinations of genes, or entire genetic profiles differentiate normal subjects from cancer patients, predict survival, or refine histological subsets of specific cancers. As a long term goal, identification of responsible genes can potentially lead to the development of targeted therapy against the gene products or proteins. 1.2. The statistical analysis of genes Routine statistical inference correlating gene expression with clinical outcome is complicated by the extreme high numbers of genes. Even in the optimistic case of a large complete clinical databank, the number of genes will certainly outnumber the patients, making model selection or multivariable modeling difficult. To combat this problem many approaches have focused on performing a large number of simultaneous hypothesis tests, such as t-tests or Wilcoxon tests, one for each gene; see, for example, the “class prediction” approach of Golub et al. (1999). In principle the same simultaneous testing procedure can be performed for analyzing survival outcomes, using Cox’s proportional hazards models as an example. There are more advanced screening techniques, including the significance analysis of microarrays (SAM) (Tusher et al., 2001) and the directed indices for exploring gene expressions proposed by LeBlanc et al. (2003), which might better serve to polish the crude multiple hypothesis testing approach. To account for the thousands of simultaneous hypothesis tests, a Bonferroni-correction can be applied to retain the family-wide error rate of 5%. However, due to the high dimensionality involved, this approach can be overly conservative, leading to the rejection of all genes. Therefore, a number of more liberal testing procedures which retain a pre-specified family-wide error rate or false discovery rate have been proposed. For examples, see the permutation method of Westfall and Young (1993) and the adjustment method of Benjamini and Hochberg (1995). The simultaneous hypothesis-testing approach can be used as a first stage in a two-stage procedure to reduce the covariate space to a manageable size for a second-stage multivariable model. Stepwise regression, tree-based (Breiman et al., 1984; Segal, 1988; LeBlanc and Crowley, 1993) or neural network (Khan et al., 2001) methods can then be applied to hopefully further optimize the predictive capability of the model. The marginal simultaneous hypothesis testing approach does not take into consideration the correlation structure of the genes. Alternative data reduction techniques attempt
Survival analysis with gene expression arrays
677
to first understand the structure of the covariate space and utilize dependencies before performing independent or multivariable modeling. An old familiar solution is principal component decomposition, which reduces the predictor space to a smaller sample of orthogonal linear combinations of the original variables which explain a pre-specified portion of the total variation. Multivariate modeling can then proceed using the reduced set of principal component scores. For gene expression data, the method is not completely straightforward to implement because of the high dimensionality. Many software packages utilize standard approaches for matrix inversion, which do not work efficiently for high-dimensions. The covariate matrix may be singular or nearly singular, in which case principal component decomposition will not work. Another drawback of principal component decomposition for gene expressions is interpretation. It is often the case that investigators are interested in the roles of individual genes in order to understand mechanism, and the linear combinations comprising the principal components may not easily translate back to individual influential genes. It has been shown in simple cases that principal components may obscure groupings of interest (Chang, 1983). However, a number of authors are investigating methodologies to address this issue. In a recent report, Hastie et al. (2000) outline a method for using principal components in a clustering framework to identify subsets of genes. For additional applications see Alter et al. (2000), Quackenbush (2001), and Yeung and Ruzzo (2001). An alternative approach to principal component decomposition for capitalizing on the dependency structure is to first perform a clustering algorithm, such as hierarchical (e.g., Ward, 1963; Weinstein et al., 1997; Eisen et al., 1998), K-means (e.g., MacQueen, 1967; Herwig et al., 1999; Tavazoie et al., 1999) or model-based (e.g., Bock, 1996; Fraley and Raftery, 2002) clustering on the gene space and then use the cluster centers or medoids as the predictors in the regression model. A small pre-specified number of clusters may be specified so that a multivariable model with stepwise selection can be feasibly used. Alternatively, the number of clusters may be determined using a criteria such as the average silhouette width (Kaufman and Rousseeuw, 1990). The clustering approach retains some interpretability since individual genes in the influential clusters may be easily traced. With so many potential predictors in addition to the multitude of clinical variables, there is a high probability of overfitting the data. Therefore, it is important to examine other criteria such as predictive capability in addition to goodness of fit as measures for selecting a model on which to base inference. Criteria such as Akaike’s information criterion (AIC) (Akaike, 1973) and the Bayesian information criterion (BIC) (Schwarz, 1978), which penalize for high dimensionality and overfitting, are part of most standard software packages. A number of authors provide recommendations for measuring predictive accuracy in Cox’s proportional hazards models; see, for example, Harrell et al. (1984, 1996), Korn and Simon (1990), Henderson (1995), Raftery et al. (1996), Volinsky et al. (1997), and O’Quigley and Xu (2001). Verweij and van Houwelingen (1993) propose a cross-validated log likelihood based on the partial likelihood, which reduces to Allen’s PRESS (Allen, 1974) and Mallow’s Cp (Mallows, 1973) statistic in normal linear models. Methods described in Efron and Tibshirani (1997) can be used to reduce variability in cross-validation.
678
D.K. Pauler et al.
1.3. Motivating application Our motivating application comes from an ongoing study at the Myeloma Institute for Research and Therapy in Little Rock, Arkansas. In this study 156 patients are enrolled on a protocol designed to determine the efficacy of thalidomide added to high dose chemotherapy and tandem transplants for multiple myeloma. For individuals in the study, blood sera were collected immediately pre-treatment for gene expression level measurement. Expression levels from 12 625 genes were obtained from each of the samples using the Affymetrix oligonucleotide chip. The patients are currently being followed for response and survival. When these data become available, it will be of interest to examine the prognostic significance of expression levels in conjunction with other clinical parameters such as stage of disease, age and treatment. In this paper we perform simulation studies to compare some of the approaches outlined in Section 1.2 for correlating survival times with gene expression levels in preparation for our analysis of the Myeloma dataset. For feasibility of simulation, we reduce attention to 500 genes, and choose the 500 with highest variance in the Myeloma sample.
2. Methods For our simulations we use Cox’s proportional hazards model, which specifies that the hazard of failing at time t is equal to: λ0 (t) exp β x , (1) where λ0 (t) denotes the baseline hazard function, β a vector of regression coefficients, and x a vector of predictors. We use the Splus function coxph to fit (1) (Mathsoft Inc., 1999). The function provides an estimate of the AIC as −2 times the difference between the partial log likelihood evaluated at the maximum likelihood estimate and the number of regression coefficients in the model. 2.1. Data reduction and model fitting We consider the following four methods for reducing and analyzing the data. Stepwise selection. First do 500 simultaneous marginal Cox proportional hazards regressions for each gene. Choose the 50 genes which have largest absolute t-statistic for further analysis. On the set of selected 50 genes, perform a forward stepwise AIC procedure to select the main effects model with lowest AIC, allowing the model size to vary from 1 to 50. Top 5. First do 500 simultaneous Cox proportional hazards regressions. Choose the 5 genes with largest absolute test statistic values and include as main effects in a second Cox proportional hazards model. Clustering. Perform Splus CLARA medoid clustering on the set 500 genes of dimension 156 (Mathsoft Inc., 1999). Choose the cluster size with maximum average silhouette width. Fit a Cox proportional hazards model with one predictor for each cluster with value equal to the median of all genes assigned to the cluster.
Survival analysis with gene expression arrays
679
Principal components. First do a principal component analysis of the 156 by 500 matrix of gene expression levels to obtain a ranked list of principal component scores in terms of percent variability explained. Fit a Cox proportional hazards model using the top 5 principal components as predictors. For the 500 simultaneous Cox proportional hazard regressions we use the method of fast simultaneous regressions as proposed by Hastie and Tibshirani (1990) and outlined in Appendix A. We also invoke this procedure for the forward stepwise AIC selection since a large number of models have to be fit at each forward step. We cluster the 500 genes of dimension 156 using the Splus program CLARA (Clustering LARge Applications) (Mathsoft Inc., 1999). CLARA is an extension of the partitioning around medoids (PAM) method to accommodate large numbers of objects, as often encountered in the analyses of microarrays; see Kaufman and Rousseeuw (1990) or Han and Kamber (2001, pp. 352–354). PAM was one of the first K-medoid clustering algorithms for partitioning n objects into K groups. After an initial random selection of K medoids, PAM repeatedly tries to make a better choice of medoids by minimizing the sum of dissimilarities of all objects to their nearest medoid. PAM does not work efficiently for large numbers of objects, n, greater than 200. CLARA extends PAM by drawing multiple samples of the dataset, applying PAM, and returning the best clustering as the output. The effectiveness of CLARA depends on the number of random samples drawn. For our simulations we use five samples. For additional relief to the computational burden, we fix the cluster size within simulations. In principal the cluster size can be selected by repeatedly running CLARA and choosing the K corresponding to the cluster analysis with largest average silhouette width. The average silhouette width is a function of individual object dissimilarities and measures how well the clustering performs, with well-separated clusters having values near 1; for a precise definition see the plot.partition command in Splus or Chapter 2 of Kaufman and Rousseeuw (1990). For the principal component analysis we apply the Splus function princomp to the 156 by 500 matrix of gene expression levels, which produces a list of 500 principal scores for each of the observations and the orthogonal loading matrix for the scores (Mathsoft Inc., 1999). 2.2. Assessing predictive accuracy To assess the predictive accuracy of the methods we perform 6-fold cross-validation. That is, we randomly divide the sample of 156 cases into six groups of 26 cases. We apply the four model-fitting procedures of Section 2.1 to 5/6 of the sample and then assess them on the remaining 1/6 using the average cross-validated log likelihood (ACVL). Repeating for K = 6 random partitions of the data, the ACVL is given by ACVL =
K 1 k βˆ(−k) , K
(2)
k=1
where k (β) = (β) − (−k) (β)
(3)
680
D.K. Pauler et al.
is the difference between the partial log-likelihood for the entire sample and that with the kth group of observations excluded, respectively, and βˆ(−k) is the value of β that maximizes (−k) (β), for k = 1, . . . , K. See Verweij and van Houwelingen (1993) for an explicit formula for the leave-one-out cross-validated log likelihood in the context of Cox’s partial likelihood. For the clustering and principal component procedures it is necessary to retain both the clustering assignment indices and loading matrices from the fit to the training data (the 5/6th sample) in addition to the estimated regression coefficients, βˆ(−k) ’s. These indices and loading matrices are then applied to both the genes in the full data set and reduced training set as required in the two terms of (3). 2.3. Simulations We simulated both gene and survival data for the 12 scenarios listed in Table 1. The scenarios are clumped into three different types corresponding to the underlying joint distribution for the set of 500 genes for 156 patients. For the first subset of simulations, the Independent case, the genes are assumed to follow a Normal distribution with zero mean and variance–covariance matrix equal to the identity matrix. For the second subset, Myeloma, the genes are again assumed to follow a Normal distribution, but with mean vector and variance–covariance matrix set equal to that observed for the 500 genes with most variability in the Arkansas myeloma dataset. In this dataset, the mean log expression level across all genes was 5.27 with a range of 2.36 to 9.92. The standard deviations for the first four genes were 5.31, 3.65, 2.92, and 2.30, respectively. The median standard deviation for the 500 genes was 1.28 with range 1.11 to 5.31. The median correlation was 0.02 with a range of −0.69 to 0.97. The variance covariance matrix was borderline singular so we added a small constant, ε = 1.0e−13, to the diagonal elements to obtain a nonsingular matrix. For the third subset, 3-Cluster, genes are simulated from a mixture of Normals distribution with three clusters. Genes from different clusters are assumed independent. Means, variances and correlation coefficients within the clusters are listed in the caption of Table 1. We generated survival data from an exponential model with baseline hazard specified so that the median survival time was approximately 4 years. We varied the designated number of genes with non-zero β coefficients from 0, 1, 10 to 50. Values of the designated β parameters are given in the fourth column of Table 1; values for all other β parameters are set to zero. For the case of a single gene we set β equal to − log 2, corresponding to a doubling in the odds of failing with a unit decrease in gene expression. For the case of 10 and 50 designated genes we dampened the values of β to control the prognostic signal and produce survival times in the range observed for the default case of median survival at 4 years. We used a standardization to control the magnitude of the predictor, β x, which yielded the β coefficient values listed for the 10 and 50 gene cases in Table 1. We assumed a uniform censoring mechanism on the interval 4 to 10 years, which yielded approximately 35% censoring under all of the scenarios considered. We simulated 100 datasets for each scenario; average median survival times and censoring proportions for each scenario under all simulations are reported in the last two columns of Table 1. For each of the 100 datasets we calculated (2) for a single random partition and then averaged the ACVL over the 100 datasets.
Survival analysis with gene expression arrays
681
Table 1 Parameters and survival data used in the simulation study; Gene dist. indicates the distribution of the individual gene vectors, No. des. genes indicates the number of genes that have non-zero β coefficients in the survival model, Des. parameter values indicates the value of the non-zero β coefficients, Med. surv. and Med. % cens. indicate the mean of the median survival time and censoring proportion, respectively, over 100 datasets from the simulation. Under Des. param. values, 1x indicates a column vector of ones of length x. Under Gene dist., Independent indicates each gene is identically distributed with a N(01156 , I156 ) distribution, where I156 denotes the 156 × 156 identity matrix, Myeloma indicates each gene follows a 156-variate normal distribution with mean and variance–covariance matrix equal to that for the top 500 genes in the Arkansas Myeloma project, and 3-Cluster indicates 3 clusters of size 25, 25, and 450, with N(81156 , Σ1 ), N(41156 , Σ2 ), and N(01156 , I156 ) marginal distributions, respectively, where Σ1 is a matrix with unity across the diagonal and all pairwise correlations equal to 0.8 and Σ2 is similarly defined with pairwise correlations equal to 0.5 Sim. no.
Gene dist.
No. des. genes
Des. param. values
Med. surv.
Med. % cens.
1 2 3 4
Independent
0 1 10 50
−0.69 −0.32 × 110 −0.14 × 150
4.05 3.81 3.71 3.66
0.31 0.33 0.34 0.34
5 6 7 8
Myeloma
0 1 10 50
−0.69 −0.32 × 110 −0.14 × 150
4.10 3.65 3.66 3.63
0.32 0.43 0.42 0.41
9 10 11 12
3-Cluster
0 1 10 50
−0.69 −0.32 × 110 −0.14 × 150
3.98 3.87 3.53 3.57
0.31 0.33 0.38 0.43
For the clustering method, we performed initial smaller sized trial runs, which based on the average silhouette width, indicated that the cluster size should be no greater than two for the Independent and Myeloma scenarios and should be 3 for the 3-Cluster scenarios. We therefore fixed the cluster size at 2 for the first 8 simulations and 3 for the remaining 4 simulations. For the principal components procedure we also ran a small pre-test to see which of the optimal rules provided the maximal ACVL: the top 5, 10, or 50 principal components, or the number that explained at least 90% of the variability. We found that restricting to the top 5 principal components produced ACVLs just as high as the other rules, even though the percent variation explained may be very low. Therefore we fixed the number of principal components at 5 for all simulations.
3. Results The results for the different simulations of Table 1 are shown in Figure 1. For each of the four methods we report the difference between the ACVL for the method and the ACVL for a naive approach of just taking the median of all genes as the sole predictor in the model.
682
D.K. Pauler et al.
Fig. 1. Average cross-validated log likelihoods (ACVL) for the Stepwise (S), Top 5 (T), Cluster (C) and Principal Component (P) method from the simulation study of 100 randomly generated datasets corresponding to the scenarios in Table 1. The ACVLs have each been subtracted by ACVL(median), which is the ACVL corresponding to the case where the only predictor in the model is the median across all genes. Across all scenarios, the average standard errors for points on the graph were approximately 1.9, 0.64, 0.67 and 0.7 for the Stepwise, Top 5, Cluster and Principal Component method, respectively.
Stepwise As shown by the figure, the stepwise method performed much more poorly than any of the other procedures, including the naive median approach, under all simulations except for the 1 Beta case under Myeloma. The standard errors for the stepwise estimates in the figure were approximately 1.9 and varied from 0.9 to 2.9 across simulations. Standard errors for all other procedures were smaller in magnitude and variation and were near 0.7. An explanation for the poor performance of the stepwise procedure is given in Table 2. The last column of Table 2 shows that the stepwise procedure had a tendency to select oversized models, even in the case where there were no designated genes. As seen by the fourth column of the table, except for the case for a single gene with large absolute value of the β coefficient, the procedure was not able to detect designated genes. The true positive rate, or number of designated genes that made it into the final stepwise model, declined as the magnitude of the β coefficients decreased for all scenarios except for the Cluster-3 scenario (simulations 9–12). Finally, as seen by the third column, except for the 3-Cluster scenario, the stepwise procedure was doomed from the start since many of the designated genes did not make it past the top 50 marginal regression screen.
Survival analysis with gene expression arrays
683
Table 2 Performance of the stepwise model; Top 50 is the number of designated genes that were selected in the initial marginal screen to select the top 50 genes, True positive rate (TPR) is the proportion of designated genes that appeared in the stepwise model, Model Size is the number of predictors in the model selected by stepwise. All summaries are given as the average and range over the 600 fits of the model Sim. no.
No. des. genes
1 2 3 4
0 1 10 50
5 6 7 8 9 10 11 12
Top 50
TPR (range)
Model Size (range)
1.00 (1, 1) 6.56 (2, 10) 11.00 (4, 18)
1.0 (1.0, 1.0) 0.53 (0.10, 1.0) 0.14 (0.02, 0.28)
27.37 (12, 50) 26.22 (8, 50) 28.39 (12, 50) 28.24 (14, 50)
0 1 10 50
1.00 (1, 1) 6.52 (3, 9) 12.85 (4, 20)
1.0 (1.0, 1.0) 0.53 (0.10, 1.0) 0.15 (0.02, 0.34)
15.35 (6, 38) 10.50 (1, 50) 25.83 (7, 50) 25.12 (8, 50)
0 1 10 50
1.00 (1, 1) 9.93 (8, 10) 48.93 (41, 50)
0.89 (0, 1.0) 0.51 (0.10, 1.0) 0.64 (0.12, 1.0)
26.77 (13, 50) 21.98 (8, 50) 16.67 (3, 50) 32.56 (6, 50)
Top 5 Surprisingly, just choosing a model with the top 5 genes in terms of largest absolute test statistics from an initial screen of the 500 genes produced a model that always outperformed the stepwise procedure in cross-validation. For the case where all β’s were set to zero, under all three gene distribution types the top 5 approach performed worse than the principal components, clustering and naive median approach. Note again that the standard errors for all three methods are approximately 0.7. For the 1 Beta case the top 5 approach performed as well as the principal and clustering method for the Independent and 3-Cluster scenarios. For the Myeloma 1 Beta case it performed markedly better than all of the other methods. This simulation is somewhat different from the others in that we placed a large β-coefficient on a gene with highest variability (standard deviation 5.31) so that the signal was strong. The principal component method could not recover the signal as effectively as the top 5 approach. For the 10 Beta case, the top 5 approach did not perform as well as clustering and principal components in the Independent and 3-Cluster scenarios. It did however match the principal component approach in the Myeloma case and outperform the clustering method there. Finally, the top 5 approach did not perform as well as the principal component and clustering approaches in the 100 Beta case, but for the 3-Cluster case it performed better than expected. The top 5 approach is handicapped because its fixed size does not allow it to capture the majority of designated genes. A top 10 or 20 approach might have higher predictive power for the 100 Beta case. Clustering The clustering approach does not beat the naive median approach in either the Independent or Myeloma case because there was little evidence for clusters in these scenarios.
684
D.K. Pauler et al.
For the Independent and Myeloma scenarios, 2 clusters were used but the average silhouette widths were only 0.009 and 0.28, respectively. For the 3-Cluster scenario, 3 clusters were used and the average silhouette width was 0.69, reflecting the three wellseparated clusters. In this case the clustering method performed neck to neck with principal components and greatly outperformed the top 5, stepwise and median approaches. Principal components The top 5 principal components explained on average only 8%, 28% and 13% of the total variability for the Independent, Myeloma and 3-Cluster scenarios, respectively. Yet this method either outperformed or matched the best method for all simulations except for the Myeloma 1 Beta case.
4. Discussion We have compared four methods for correlating high-dimensional gene expression levels with survival outcome in the context of Cox’s proportional hazards model. Each method consists of an initial data reduction step and a secondary model-fitting step. Under all three types of dependency structures considered, stepwise selection consistently fit overly complex models leading to a reduction in predictive power. In the case of a single influential gene, the stepwise model did consistently capture the gene in the optimal model, but unfortunately it included a number of false positive genes as well. Increasing the number of designated genes but dampening their correlation with survival led to a decrease in the true positive rate and little to no improvement in predictive performance as measured by the ACVL. Perhaps relying on a more stringent criterion such as the BIC for stepwise selection would choose more parsimonious models and hence improve predictive ability. Some authors have found that principal component analysis or partial least squares does not necessarily improve prediction (Yeung and Ruzzo, 2001; Nguyen and Rocke, 2002). We did not find this to be the case in any of the scenarios we considered, except for the 1 Beta case under Myeloma, where a very strong signal was placed on a single gene with high variance. We did not explore the next step of isolating influential genes through the loading matrices. This is important because frequently an aim of a microarray experiment is to identify smaller subsets of genes for further exploration of co-regulation or pathway regulation. We were impressed that the clustering algorithm performed as well as the principal component algorithm, even in the Independent and Myeloma cases. As in the principal component approach, we did not further explore the next step of how one proceeds from the cluster to the single gene or genes driving the relationship to survival. The K-medoids algorithm gives medians of the genes in each cluster so in principle one can identify genes that are close to the median value. Hierarchical clustering could also be used to further refine clusters. There are certainly a variety of more complex techniques that can be used for data reduction. As an example, van der Laan and Bryan (2000) provide asymptotic properties of general subsetting techniques using the parametric bootstrap. Hastie et al. (2000)
Survival analysis with gene expression arrays
685
outline a particular form of clustering based on principal components and show how it may be used to identify influential subsets of genes. If there are specific genes with known function it may be of interest to perform a supervised clustering to locate similar genes that may or may not correlate with survival. Finally, other baseline or prognostic variables can also be included in the final gene model to see if the explanatory capability of the genes supercedes better known indicators. For example, Rosenwald et al. (2002) examined the additive explanatory power of individual genes over the independently obtained international prognostic index for the prediction of survival after chemotherapy in diffuse large B-cell lymphoma.
Appendix A We use a weighted least squares approximation to full maximum likelihood. Hastie and Tibshirani (1990) use a similar weighted least squares method for estimating smooth covariate effects in the proportional hazards model. Cox’s model can be defined with hazard function λ(t | z) = λ0 (t) exp η(z) , where λ0 (t) is an unspecified baseline hazard function. Assume there are n patients indexed by i = 1, . . . , n. Let ηi = η(zi ), where zi = (zi1 , . . . , zip ) is the vector of covariates (genes) for the ith patient. Let ∆i = 1 if patient i is a failure and ∆i = 0 if patient i is censored. Define the score vector as U = dl/dη and the observed information matrix as I = −d2 l/dηηT . An adjusted dependent variable is y = η + I −1 U. For ease of presentation assume the data have been ordered so that Ti Ti for i < i . The elements of the score vector and information matrix can be expressed as follows: dl/dηi = ∆i − exp(ηi )A(η, Ti ), d2 l/dηi2 = − exp(ηi )A(η, Ti ) + exp(2ηi )B(η, Ti ), d2 l/dηi dηj = − exp(ηi ) exp(ηj )B η, min(Ti , Tj ) , i = 1, . . . , n, where l represents the logarithm of the partial likelihood and where A(η, Ti ) =
k∈Ri
B(η, Ti ) =
k∈Ri
∆k , j ∈Rk exp(ηj )
(
∆k , 2 j ∈Rk exp(ηj ))
686
D.K. Pauler et al.
and where Ri is the risk set at time Ti . We drop all second order terms B(η, Ti ), which leaves a diagonal approximation to the information matrix. The diagonal elements of matrix of I are used as the weight vector. Therefore, we use yi = ηi −
∆i − exp(ηi )A(η, Ti ) A(η, Ti )
the adjusted dependent variables, and weights wi = A(η, Ti ). Typically, we set the ηi = 0, although multiple iterations are possible. Since we need to apply the algorithm for each of several thousand genes, it is computationally efficient to start by sweeping out the mean by centering the adjusted dependent variables by their weighted mean, and the covariates by their weighted mean. Therefore, the univariate regressions do not require many 2 × 2 matrix inversions. We note a close connection to score test statistics based on the partial likelihood to choose basis functions. For instance, for ηi = 0 the component of the score vector for covariate g is Ug = zgi wi yi . i
References Akaike, H. (1973). Information theory and an extension of the entropy maximization principle. In: Petrov, B.N., Csak, F. (Eds.), Proceedings of the Second International Symposium on Information Theory. Akademia, Kiado. Alizadeh, A.A., Eisen, M.B., Davis, R.E., Ma, C., Lossos, I.S., Rosenwald, A., Boldrick, J.C., Sabet, H., Tran, T., Yu, X., Powell, J.I., Yang, L., Marti, G.E., Moore, T., Hudson Jr, J., Lu, L., Lewis, D.B., Tibshirani, R., Sherlock, G., Chan, W.C., Greiner, T.C., Weisenburger, D.D., Armitage, J.O., Warnke, R., Levy, R., Wilson, W., Grever, M.R., Byrd, J.C., Botstein, D., Brwon, P.O., Staudt, L.M. (2000). Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403, 503–511. Allen, D.M. (1974). The relation between variable selection and data augmentation and a method for prediction. Technometrics 16, 125–127. Alter, O., Brown, P.O., Botstein, D. (2000). Singular value decomposition for genome-wide expression data and modeling. Proc. Nat. Acad. Sci. 97, 10101–10106. Benjamini, Y., Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Roy. Statist. Soc. Ser. B 57, 289–300. Bock, H.H. (1996). Probabilistic models in cluster analysis. Comput. Statist. Data Anal. 23, 5–28. Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J. (1984). Classification and Regression Trees. Wadsworth, Belmont, CA. Chang, W.C. (1983). On using principal components before separating a mixture of two multivariate normal distributions. Appl. Statist. 32, 267–275. Efron, B., Tibshirani, R. (1997). Improvements on cross-validation: The 632 bootstrap. J. Amer. Statist. Assoc. 92, 548–560. Eisen, M.B., Spellman, P.T., Brown, P.O., Botstein, D. (1998). Cluster analysis and display of genome-wide expression patterns. Proc. Nat. Acad. Sci. 95, 14863–14868. Fraley, C., Raftery, A.E. (2002). Model-based clustering, discriminant analysis, and density estimation. J. Amer. Statist. Assoc. 97, 611–631.
Survival analysis with gene expression arrays
687
Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeck, M., Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A., Bloomfield, C.D., Lander, E.S. (1999). Molecular classification of cancer: Class discovery and class prediction be gene expression monitoring. Science 286, 531–537. Han, J., Kamber, M. (2001). Data Mining Concepts and Techniques. Morgan Kaufmann, San Francisco. Harrell Jr, F.E., Lee, K.L., Califf, R.M., Pryor, D.B., Rosati, R.A. (1984). Regression modelling strategies for improved prognostic prediction. Statist. Medicine 3, 143–152. Harrell Jr, F.E., Lee, K.L., Mark, D.B. (1996). Tutorial in biostatistics. Multivariable prognostic models: Issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Statist. Medicine 15, 361–387. Hastie, T., Tibshirani, R. (1990). Exploring the nature of covariate effects in the proportional hazards model. Biometrics 46, 1005–1016. Hastie, T., Tibshirani, R., Eisen, M.B., Alizadeh, A., Levy, R., Staudt, L., Chan, W.C., Botstein, D., Brown, P. (2000). ‘Gene shaving’ as a method for identifying distinct sets of genes with similar expression patterns. Genome Biology 1. research0003.1–0003.21. Henderson, R. (1995). Problems and prediction in survival-data analysis. Statist. Medicine 14, 161–184. Herwig, R., Poustka, A.J., Muller, C., Bull, C., Lehrach, H., O’Brien, J. (1999). Large-scale clustering of cDNA fingerprinting data. Genome Research 9, 1093–1105. Kaufman, L., Rousseeuw, P. (1990). Finding Groups in Data. Wiley, New York. Khan, J., Wei, J.S., Ringner, M., Saal, L.H., Ladanyi, M., Westerman, F., Berthold, F., Schwab, M., Antonescu, C.R., Peterson, C., Meltzer, P.S. (2001). Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature Medicine 7, 673–679. Korn, E.L., Simon, R. (1990). Measures of explained variation for survival data. Statist. Medicine 9, 487–503. LeBlanc, M., Crowley, J. (1993). Survival trees by goodness of split. J. Amer. Statist. Assoc. 88, 457–467. LeBlanc, M., Kooperberg, C., Grogan, T.M., Miller, T.P. (2003). Directed indices for exploring gene expression data. Bioinformatics 19, 686–693. MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In: Cam, L.M.L., Neyman, J. (Eds.), Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1. University of California Press, Berkeley, CA. Mallows, C.L. (1973). Some comments on Cp . Technometrics 15, 661–675. Mathsoft Inc. (1999). S-PLUS 2000 Professional Release 1. Mathsoft Inc. Nguyen, D., Rocke, D. (2002). Tumor classification by partial least squares using microarray gene expression data. Bioinformatics 18, 39–50. O’Quigley, J., Xu, R. (2001). Explained variation in proportional hazards regression. In: Crowley, J. (Ed.), Handbook of Statistics in Clinical Oncology. Marcel Dekker, New York. Quackenbush, J. (2001). Computational analysis of microarray data. Nature Rev. 2, 418–427. Raftery, A.E., Madigan, D., Volinsky, C.T. (1996). Accounting for model uncertainty in survival analysis improves predictive performance. In: Bernardo, J.M., Berger, J.O., Dawid, A.P., Smith, A.F.M. (Eds.), Bayesian Statistics 5. Oxford University Press, Oxford. Rosenwald, A., Wright, G., Chan, W.C., Connors, J.M., Campo, E., Fisher, R., Gascoyne, R.D., MullerHermelink, H.K., Smeland, E.B., Staudt, L.M., et al. (2002). The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. New England J. Medicine 346, 1937– 1947. Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist. 6, 461–464. Segal, M.R. (1988). Regression trees for censored data. Biometrics 44, 35–48. Tavazoie, S., Hughes, J.D., Campbell, M.J., Cho, R.J., Church, G.M. (1999). Systematic determination of genetic network architecture. Nature Genetics 22, 281–285. Tusher, V.G., Tibshirani, R., Chu, G. (2001). Significance analysis of microarrays applied to the ionizing radiation response. Proc. Nat. Acad. Sci. 98, 5116–5121. van der Laan, M.J., Bryan, J. (2000). Gene expression analysis with the parametric bootstrap. Biostatistics 1, 1–19. Verweij, P.J.M., van Houwelingen, H.C. (1993). Cross-validation in survival analysis. Statist. Medicine 12, 2305–2314. Volinsky, C.T., Madigan, D., Raftery, A.E., Kronmal, R.A. (1997). Bayesian model averaging in proportional hazard models: Assessing the risk of a stroke. Appl. Statist. 46, 433–448.
688
D.K. Pauler et al.
Ward, J.H. (1963). Hierarchical groupings to optimize an objective function. J. Amer. Statist. Assoc. 58, 234– 244. Weinstein, J.N., Myers, T.G., O’Connor, P.M., Friend, S.H., Fornace Jr, A.J., Kohn, K.W., Fojo, T., Bates, S.E., Rubenstein, L.V., Anderson, N.L., Buolamwini, J.K., van Osdol, W.W., Monks, A.P., Scudiero, D.A., Sausville, E.A., Zaharevitz, D.A., Bunow, B., Viswanadhan, V.N., Johnson, G.S., Wittes, R.E., Paull, K.D. (1997). An information intensive approach to the molecular pharmacology of cancer. Science 275, 343– 349. Westfall, P.H., Young, S.S. (1993). Resampling-Based Multiple Testing: Examples and Methods for P-value Adjustment. Wiley, New York. Yeung, K.Y., Ruzzo, W.L. (2001). Principal component analysis for clustering gene expression data. Bioinformatics 17, 763–774.
38
Handbook of Statistics, Vol. 23 ISSN: 0169-7161 © 2004 Elsevier B.V. All rights reserved. DOI 10.1016/S0169-7161(03)23038-8
Joint Analysis of Longitudinal Quality of Life and Survival Processes
Mounir Mesbah, Jean-François Dupuy, Natacha Heutte and Lucile Awad
1. Introduction The aim of clinical trials conducted in oncology is to estimate the efficacy and the safety of new chemotherapies. An important endpoint is the patient’s perception of his/her global quality of life. This endpoint integrates the impact of both efficacy and safety of the drug from the patient’s perspective (Cox et al., 1992). Despite the growing importance of quality of life outcomes in medical research in general and in cancer clinical trials in particular, methods of analysis of these data remain an issue. In quality-of-life analysis, missing data cannot be considered as missing at random, as they are most likely related to the worsening of the patient’s condition or to death. No consensus exists on the way of dealing with missing data. Therefore it is important to use a palette of statistical methodologies. The first and easiest strategy consists of first converting repeated measures of QoL into time-to-event data and then applying the Kaplan–Meier method and log-rank tests. This strategy can be generalized by building a semi-parametric semi-Markovian multistate model where states are defined as a score level of quality of life or death. The last strategy consists of specifying a joint distribution for the longitudinal continuous Quality of Life data and the survival duration and then getting the statistical inference (estimation and tests). Clinical endpoints such as time to progression or time to treatment failure are definitive events generally occurring during the study. These data can be analyzed using survival analysis methods. Following Awad et al. (2002), we show in this paper how, similarly, QoL can be analyzed by defining an event corresponding to its definitive deterioration. Additionally to that methodology, two other strategies of analysis of such kind of data will be presented carefully: • an analysis based on a semi-Markovian multi-states model (Heutte and Huber-Carol, 2002); • an analysis based on a joint model of the continuous longitudinal (QoL) variable and the time of dropout or death (Dupuy and Mesbah, 2002). 689
690
M. Mesbah et al.
All these different methodologies and their application to data from a clinical trial of treatments for metastatic colorectal cancer (Douillard et al., 2000) are explained theoretically and practically. Internal covariate Survival studies usually collect on each participant, both duration until some terminal event and repeated measures of a time-dependent covariate. Such a covariate is referred to as an internal time-dependent covariate. Usually, some subjects drop out of the study before occurrence of the terminal event of interest. One may then wish to evaluate the relationship between time of dropout and the internal covariate. The Cox model is a standard framework for that purpose. Here, we address this problem in situations where the value of the covariate at dropout time is unobserved. Dropout process Following Little and Rubin (1987), three main dropout processes can be distinguished in longitudinal studies. A dropout process is said to be completely random when the dropout is independent of both observed and unobserved measurements of the longitudinal variable. A dropout is random when it is independent of the unobserved measurements but depends on the observed ones. A dropout is non-ignorable when it depends on unobserved measurements. Under completely random and random dropout processes, and provided that there are no parameters in common between the measurement and the dropout models, nor any functional relationship between the parameters describing the measurement process and the parameters describing the dropout process, the longitudinal measurement process can be ignored for the purpose of making likelihoodbased inferences about the time-of-dropout model. This property does not hold when the dropout is non-ignorable. Recently, a number of methods have been proposed to accommodate non-ignorable dropout in longitudinal data. Diggle and Kenward (1994) combined a multivariate linear model for the longitudinal process with a logistic regression model for the dropout. This logistic model allows dependence of dropout on the missing observation at dropout time. Molenberghs et al. (1997) adopted a similar approach for longitudinal ordinal data. These models fall into the class of outcome-based selection models (Little, 1995; Hogan and Laird, 1997b), for which the joint density of the repeated measures vector and dropout time is obtained as the conditional density of the dropout time given the longitudinal outcomes, multiplied by the marginal density of these outcomes. In some situations, dropout is related to a trend over time, rather than to the unobserved value of the covariate. One may then relate dropout time to the longitudinal outcomes through individual random effects used to model the longitudinal process. This yields the randomcoefficient-based selection models (Wu and Carroll, 1988; Schluchter, 1992; DeGruttola and Tu, 1994; Ribaudo et al., 2000). An alternative class of models for the joint distribution of repeated measures and dropout time is called pattern-mixture (Little, 1995; Hogan and Laird, 1997a). This approach stratifies the sample by time of dropout and then models the distribution of the repeated measures within each stratum. Detailed
Joint analysis of longitudinal quality of life and survival processes
691
reviews of these various approaches can be found in Little (1995), Hogan and Laird (1997b), Verbeke and Molenberghs (2000) and Billot and Mesbah (1999). In Section 2, the design of the QoL clinical trial used will be presented. Quality of Life instrument and the data will be described there. In Section 3, we present preliminary analysis of the QoL scores: these preliminary analysis enclose mainly analysis of compliance, study of missingness and a mixed model analysis of variance of QoL scores. In Section 4, we characterize deterioration of QoL by an absorbing state (similar to death when analyzing survival time). Only two kinds of events are defined, so repeated measures of QoL are easily converted into time-to-event data, then the Kaplan–Meier method and log-rank tests are used for analysis. This methodology is applied to the clinical trial data. In Section 5 the semi-parametric semi-Markovian multi-state model of Heutte and Huber (Heutte and Huber-Carol, 2002) is presented. It is based on the same principle as Kaplan–Meier method and the Cox model. The states are defined as a score level of quality of life or death. Events are a change of score level of quality of life or death occurrence. The transitions occur along a competing risk process. Estimators of the quantities of interest are given and the method is illustrated by the same QoL cancer clinical trial data. In Section 6, we present the joint model of Dupuy and Mesbah (Dupuy and Mesbah, 2002), which combines a first-order Markov model for the continuous longitudinally measured QoL with a time-dependent Cox model for the dropout or death process. Estimation is performed using maximum likelihood estimation in this model and show how estimation can be carried out via the EM-algorithm. The interest of this model is the fact that it takes into account non-ignorable dropout. Indeed, it can be viewed as generalizing Diggle and Kenward’s model (Diggle and Kenward, 1994) to situations where dropout may occur at any point in time and may be censored. Hence we apply both models and compare their results on a data set concerning longitudinal measurements among patients in a cancer clinical trial.
2. Presentation of the clinical trial: QoL instruments and data 2.1. Quality of life instruments and data The European Organization for Research and Treatment of Cancer (EORTC) QLQ-C30 instrument (Aaronson et al., 1993) has been widely used to evaluate the quality of life of cancer patients. It is a 30-item questionnaire completed by patients. Following EORTC guidelines, linear transformations are used to obtain 15 scales for analysis. These scales include five functional scales, nine symptom scales and one global scale (QL) including two items, the global health status (QL1) and quality-of-life (QL2) each one scaled with 7 levels. For the five functional scales and the QL scale, a high score indicates a high level of functioning. On the contrary, a high score on a symptom scale is related to a high level of symptoms. The EORTC (Fayers et al., 1995) and other authors (Beitz et al., 1998; McLachlan et al., 1998) recommend using the QL scale as an overall summary of
692
M. Mesbah et al.
the patient’s quality of life. The QL scale is preferable to a composite summary of the other scales (Osoba, 1994). It indeed corresponds to a self-assessed summary measure of the patient’s quality of life, based on two separate ratings of his/her overall health and quality of life during the past week. The QL score takes 13 possible values ranging from 0 to 100, by 8.3. Among the analysis performed on the 15 scales in the colorectal cancer clinical trial, a specific analysis was performed on the QL scale to assess both the efficacy and safety of the drug. Repeated measures of QL were recorded at various time points during the study: before randomization, during treatment (usually before each cycle of treatment), and during follow-up. 2.2. Clinical studies The QLQ-C30 instrument has been used to measure quality of life in three randomized phase III studies conducted in metastatic colorectal cancer. The objective of the first study (V301) was to show that a new drug (irinotecan) administered after failure of treatment with fluorouracil prolonged the survival of the patients in second line metastatic colorectal cancer (Cunningham et al., 1998). This study compared irinotecan plus best supportive care (BSC) to BSC alone. In the second study (V302), irinotecan was compared to fluorouracil as second-line treatment for metastatic colorectal cancer (Rougier et al., 1998). In the third study (V303), irinotecan plus fluorouracil (CPT-11 + 5-FU was compared to fluorouracil alone (5-FU) as first-line treatment for metastatic colorectal cancer (Douillard et al., 2000). The primary endpoints were overall survival for V301 and V302, and response and time to progression for V303. The schedule of assessment was different in the three studies as the patients were followed for their quality of life until death or for a maximum of one year for V301 and V302, whereas in V303, the questionnaire was administered only during treatment. It is important to emphasize that survival was prolonged by the administration of irinotecan in each of the three studies. Therefore, it was important to show that this survival gain was not obtained at the price of a deterioration of the patient’s quality of life. Additionally, due to the survival difference between the treatment groups, it was clear that the analysis of variance comparison will be biased in favor of the group with the worst survival. In this paper, we present in detail only the methodology used in the analysis of V303. Results for V301 and V302 are briefly discussed in Section 3.2. Two possible fluorouracil regimens were chosen by the clinical center according local clinical practice or preference. One regimen consisted of administration of drug every two weeks. The other regimen consisted of 6 infusions every 7 weeks. In V303, the quality-of-life questionnaire was to be completed within 8 days after randomization, before the first treatment, and before treatment for each treatment cycle (every 6 to 7 weeks) while patients were on treatment. The Kaplan–Meier analysis (Section 4) applied to the data from this study uses the definition where death is an event only if it occurs sufficiently close in time to the last QoL assessment. Indeed, death was only considered in patients without a prior deterioration, and if it occurred within 10 weeks of the last QL score available for the patient. That is, death was considered only when it could replace a QL evaluation given the period between two assessments defined in the protocol (6 to 7 weeks, to which a security margin of 3 weeks was added).
Joint analysis of longitudinal quality of life and survival processes
693
3. Preliminary analysis 3.1. Methods Preliminary to the Kaplan–Meier, Cox and semi-Markovian analyzes, more classical quality-of-life analyses were performed, including analysis of compliance and analysis of variance on the QL scores. Repeated measures analysis of variance were used to study the evolution of the QL score over time. Mixed effects model analysis, assuming a normal distribution for the QL scores were done. Treatment and the treatment-bytime interaction were the fixed effects, time was a repeated factor, and patient-withintreatment-group was a random factor. The most suitable covariance structure was chosen according to Akaike’s information criterion (AIC) and Schwartz’ Bayesian criterion (SBC), between unstructured, compound symmetric and autoregressive of order 1. The hypothesis of no treatment effect was tested at a 5% two-sided significance level. A certain paucity of data was expected beyond a certain time point, making interpretation of the observed quality-of-life patterns difficult. Therefore, we limited the repeated measures analysis to time points at which at least 10% of patients in each treatment group provided evaluable questionnaires. In some cases, multiple evaluable questionnaires for a patient occurred within a single time interval. In these cases, we used the mean of the evaluable QoL scores within the time interval. To study whether data missingness was informative of the patient’s quality of life, the relationship between missing data and bad clinical condition or reason for study dropout was analyzed. The cumulative number of missing QoL scores by treatment group was tabulated against the concomitant occurrence, in the same time interval, of one of the following: (1) (2) (3) (4)
death, occurrence of progressive disease, occurrence of a grade 3 or worse adverse event, progressive disease and a grade 3 or worse adverse event occurring in the same time interval, or (5) the patient withdraws consent to participate. 3.2. Results One hundred ninety-eight patients were randomized to CPT-11 + 5-FU, and 187 were randomized to 5-FU alone. The results of the clinical endpoints (response rate, time to progression and survival) were in favor of CPT-11 + 5-FU. In particular, survival was prolonged. These results are reported elsewhere in detail (Douillard et al., 2000; Awad et al., 2002). A high percentage of the 643 questionnaires received in the CPT-11 + 5-FU group and of the 518 received in the 5-FU alone group was evaluable (94% and 92%, respectively). The averaging of multiple questionnaires in the same time interval caused a slight reduction of the degrees of freedom for the analysis. The equivalent of 570 distinct questionnaires were eventually considered in the CPT-11 + 5-FU group, and 448 in the 5-FU alone group. Of the total number received, this represents 89% and 87% in the two treatment groups, respectively. These percentages would have been 92% and 90%, respectively, if we had used protocol cycles for the questionnaire averaging
694
M. Mesbah et al.
instead of time intervals. Therefore, only a limited number of degrees of freedom were lost because of the definition of the time intervals (22 intervals for the CPT-11 + 5-FU group, and 20 intervals in the 5-FU alone group). 3.2.1. Analysis of compliance The study protocol specified that a QLQ-C30 questionnaire had to be completed by each patient before each treatment cycle. We calculated compliance with this protocol as the ratio of the number of patients with at least one evaluable questionnaire per cycle to the number of patients in the treatment cycle, as defined in the protocol. The overall compliance was 62% in the CPT-11 + 5-FU group and 59% in the 5-FU alone group. As expected, attrition was due to the decrease in the number of patients on treatment, and the decrease in compliance increased with the cycle number. Compliance was above 50% in both treatment groups for the first three cycles of treatment. At baseline, compliance was of 83% in the CPT-11 + 5-FU group versus 77% in the 5-FU alone group. Compliance was stable for the first three treatment cycles and then lower but still nearly stable for the next four cycles. In the 5-FU alone group, compliance was 56% and 53% at cycles 2 and 3 (versus 65% and 61% for the CPT-11 + 5-FU group). We conclude that compliance was comparable in both treatment groups for the four next cycles. A small and identical number of patients in each treatment group had QoL data while on treatment without a score at baseline (18 patients, representing less than 10% of the initial population in each group). The trends with and without a baseline score were very similar in both groups, although the curve for patients without a baseline score in the CPT-11 + 5-FU group was below the corresponding curve for the 5-FU alone group. The small number of available scores from the patients without a baseline score, which did not exceed 7% of the initial population in both groups in each time interval, prevented any further interpretation. 3.2.2. Time interval definition Because the durations of a protocol cycle, and therefore the frequency of quality-oflife evaluation, were different for the two 5-FU regimens used (every 7 weeks versus every 6 weeks), the analyzes (other than the analyzes of compliance and of the time to deterioration of the QL score) were performed according to time interval. The baseline interval was the time before the first infusion of the first cycle. While on treatment, the time intervals corresponded to 7-week periods (49 days), except for the first ontreatment interval. For the first on-treatment time interval, a 3.5-week lag (25 days) was added to the 7-week period, to make it a 10.5-week period (49 + 25 = 74 days). This allowed the questionnaires in the same time interval to be more synchronous for the two 5-FU schedules. Time interval 1 was from the first infusion (day 1 of week 1, excluded) to day 5 of week 11 (included). Time interval 2 was from day 5 of week 11 (excluded) to day 5 of week 18 (included). In general, the nth time interval was from day 5 of week [4 + 7(n − 1)] (excluded) to day 5 of week (4 + 7n) (included). 3.2.3. Study of data missingness The number of evaluable QoL scores was comparable in both treatment groups, although always slightly higher in the CPT-11 + 5-FU group, both in absolute number
Joint analysis of longitudinal quality of life and survival processes
695
and in percentage of the initial population. As expected from the clinical analysis, the primary reason for data missingness in both groups was the progression of the disease, which was more frequent in the 5-FU group. Conversely, a little more missing data could be related to grade 3 or worse adverse events with CPT-11 + 5-FU. The number of missing observations due to the withdrawal of the patient’s consent also seemed higher in the CPT-11 + 5-FU group. These findings were consistent with the reasons for treatment discontinuation. It may be assumed that the patients with missing data for progressive disease or adverse events had a worse QL score. Conversely, the same assumption could not be made so easily for missing data due to consent withdrawal. Some patients indeed withdrew their consent because they were feeling well and did not see further benefit from chemotherapy. The relationship between data missingness and the possibly low values of the missing QL scores for progressive disease or adverse events suggested that the missingness mechanism was not at random. In addition, the number of missing observations for these reasons seemed to be unbalanced between the treatment groups. 3.2.4. Analysis of variance Although all the data are shown on Figure 1, the model for the primary analysis of variance was applied to the first four on-treatment time intervals only. The QL patterns observed on fewer questionnaires would indeed not be representative of the initial population. We used a compound symmetric covariance structure for the repeated measures mixed model. It led to the lowest absolute values for both AIC and SBC criteria as compared to unstructured or autoregressive (order 1) matrices. In addition, convergence
Fig. 1. Evolution over time of the QL-full analysis population.
696
M. Mesbah et al.
was reached in 3 iterations instead of 6 with the unstructured matrix, which indicated less over-parametrization with the simpler compound symmetric model. Despite the observed better trend in the CPT-11 + 5-FU group, the test of a treatment effect from the repeated measures analysis of variance was not significant (p = 0.5).
4. Time to QoL deterioration 4.1. Preliminaries In metastatic cancer studies, the deterioration of the patient’s health status is often observed during the trial either because of the toxicity of treatment or the progression of disease. This deterioration is very likely to be reflected in the patient’s quality of life. Therefore, when analyzing the global health status/quality-of-life (QL) score of the EORTC QLQ-C30 instrument, an event, QoL deterioration can be defined in terms of health states. Deterioration could be characterized by an absorbing state (similar to death when analyzing survival time). We define only two kinds of events, so repeated measures of QoL are easily converted into time-to-event data, and the Kaplan–Meier method and log-rank tests can be used for analysis. 4.2. Time to event definition To allow for a time-to-event analysis, repeated measures of QoL can be transformed into survival data by defining an absorbing state that represents deterioration of quality of life. We define QoL deterioration as a decrease in QL score from baseline without any return to a better state during the study. Because there are no established criteria for choosing a specific level of deterioration, we used four levels for the analysis: 5%, 10%, 20%, and 30%. For the 5% level of deterioration, any maintained decrease will be considered as an event. For the other levels of deterioration, the better the QoL state at baseline, the greater the decrease that is required for the change to be considered an event. Levels of deterioration greater than 30% were not considered because of the small number of events occurring. Two endpoints were defined: “QoL-deteriorationfree-survival” and the “time to deterioration of QoL”. For the first endpoint, the event of interest is the occurrence of either a QoL deterioration or the death of the patient, whichever occurs first. For the second endpoint, the event is the QoL deterioration, but death is considered as an event only if it occurs within a predefined period following the last QoL assessment. This is analogous to using death as a surrogate for QoL deterioration. By convention, death is considered as an event when it occurs within a period of time defined by 1.5 times the period between two assessments as planned in the study protocol. Otherwise, the patient is considered lost to follow up for QoL and is censored at the date of last assessment. The first endpoint is a composite endpoint of QoL deterioration and death. The second endpoint focuses on the QoL deterioration but takes into account the information of death assuming that if death occurs within a relatively short period of time after a QoL assessment, the QoL of the patient might have deteriorated within this period of time.
Joint analysis of longitudinal quality of life and survival processes
697
Using these two definitions, the longitudinal QoL data are transformed into time-toevent data and can be analyzed using methodology appropriate for survival analysis. Kaplan–Meier probabilities can be estimated and the log-rank test can be used to compare treatment groups. To take into account that missing values were very likely to be related to the bad condition of the patient, when a deterioration was observed after a missing value, it was assumed that the deterioration had occurred at the time of the missing value. The Kaplan–Meier analysis includes only patients with at least a baseline score. This may induce a selection bias. Therefore, we compared QoL over time in patients with and without a baseline QL score. All the analyses were performed on evaluable questionnaires only. To be considered evaluable on treatment, a questionnaire had to be filled more than 5 days after the latest infusion, and before any further infusion. In the protocol, questionnaires were intended to be completed before the infusions to avoid the effects of acute toxicities following treatment administration (these toxicities were analyzed separately). In addition, questionnaires without a date of assessment or that were completed after the cut-off date were not evaluable. All questionnaires completed before the first infusion were considered as evaluable baseline questionnaires. 4.3. Results The Kaplan–Meier curve estimate of the survival distribution of the time to QOL deterioration in the CPT-11 + 5-FU group was consistently above that of the 5-FU alone group, regardless of the level of deterioration considered (Figure 2). The log-rank test was statistically significant for the 5% and 20% deterioration levels (p = 0.03 and p = 0.04, respectively), and the trend was similar for the 10% and 30% deterioration levels (p = 0.06 in both cases), which indicates increased time to deterioration of QoL for CPT-11 + 5-FU. Censoring was mainly due to the absence of event at the cut-off date. The proportion of patients lost to follow up for QoOL (as defined in Section 2) was always below 33% in the CPT-11 + 5-FU group and always below 22% in the 5-FU alone group. The aim of this analysis was to focus on the QoL score and not to mix the information with survival. Therefore we analyzed the proportions of deaths in the QoL events of the Table 1 Events in the analyses of the time to definitive QL scale deterioration from baseline (Arm A: CPT-11 + 5-FU; Arm B: 5-FU) Deterioration level 5% 10% 20% 30%
Arm (# of events) A(N B(N A(N B(N A(N B(N A(N B(N
= 33) = 40) = 30) = 35) = 19) = 24) = 13) = 18)
Deaths (%) 3(9.1%) 3(7.5%) 3(10.0%) 3(8.6%) 3(15.8%) 4(16.7%) 3(23.1%) 4(22.2%)
QL deteriorations (%) 30(90.9%) 37(92.5%) 27(90.0%) 32(91.4%) 16(84.2%) 20(83.3%) 10(76.9%) 14(77.8%)
698
M. Mesbah et al.
Fig. 2. Kaplan–Meier curves of the QoL time to degradation.
Kaplan–Meier analysis. Table 1 shows the proportion of deaths in the number of events at all levels of deterioration (below 10% in both treatment groups at deterioration levels of 5% and 10%, below 17% at the 20% deterioration level, and below 24% at the 30% deterioration level), indicating that quality of life and not survival was the primary cause of the treatment effect observed.
5. Semi-Markovian multi-state model 5.1. Preliminaries We, now present the Heutte and Huber method based on a multi-state model with competing risks (Crowder, 2001). Each state is obtained through a score level of quality of life or death and each transition is a change of quality of life or death occurrence. We assume that the corresponding process is semi-Markovian with a finite number m of states. This assumption is less restrictive than the usual Markov assumption. We take the first m states 1, . . . , m1 to be transient and the other to be absorbing. In order to take into account the possible effect of covariates on the evolution of the quality of life of the
Joint analysis of longitudinal quality of life and survival processes
699
patient, we present a semi-Markovian model, in continuous time, which assumes a multiplicative model for each waiting time which is the analog of the semi-parametric Cox model. The covariates may be qualitative or quantitative. For example, we could use the sex or the age as covariate. They also may depend on time as long as the time considered is the one spent in the current state. One may consider the time spent in a specific state as a competing risk problem, each state available from this specific state being in competition with all other states available from this state. It is thus the “competing risk scheme”. We define state m + 1 as the “censored state” which occurs when a transition is right censored. Under relevant assumptions about the corresponding hazards, we may consider separately the different waiting times and apply Cox’s partial likelihood to each of them. This allows us to estimate easily the impact of the covariates and the law of the waiting time in state i before a transition to state j . So, we can evaluate the effect of the treatment on the Change of QoL, when we move from a specific QoL state. We test the impact of the covariates using the likelihood ratio test and give an estimate of the transition probabilities from state i to state j according to the time spent in state i. We apply this method to the previous QoL clinical trial in cancerology after converting the 7 possible values of the QL2 score in 3 states, in order to get enough subjects with inter-state transition. This is essential to estimate the parameters of the Cox model and the survival distribution. 5.2. Definitions and notations The aim of the trial was to compare two drugs A and B. Drug A is the reference. We introduce a covariate: ξ is equal to 0 for drug A and 1 for drug B. The score levels quantifying the quality of life of the 215 patients in the trial are observed at different times. The initial score scale ranges from 1 to 7, reduced for our study to only 3 values, due to the small number of patients involved. The quality of life increases with the score. State 1 is bad quality of life, state 2 is neutral quality of life, and state 3 is good quality of life. State 4 is death and we define state 5 as censored. Thus we have m1 = 3 transient states, one absorbing state, death, and the special censored state. Every patient has a history of the following type: H = {J0 , X1 , J1 , X2 , J2 , . . . , Xs−1 , Js−1 , Xs , Js } where 1 Ji m1 for i < s and m1 < Js m + 1 where m = 4. m1 = 3 is the number of transient states (the remainders are absorbing states). Xi is the sojourn time in state Ji−1 . The number of visited states s may not the same for all patients. The method we Table 2 Number of observed transitions Treatment arm: A
QoL states
1 2 3
B
1
2
3
4
5
0 18 13
19 0 22
2 21 0
20 21 32
6 7 23
1 2 3
1
2
3
4
5
0 15 11
15 0 30
0 23 0
22 30 26
1 5 22
700
M. Mesbah et al.
Fig. 3. Multi-state model: quality of life in a clinical trial in cancerology (1 ≡ bad, 2 ≡ neutral, 3 ≡ good, 4 ≡ death).
propose allows taking into account patients whose history is censored. Let us note that the right censored patients are in state Js = m + 1. The observed transition numbers in this study are shown in Table 2. Figure 3 shows the possible transitions between the different states, except the transition between a transient state to censored state, which is always possible. The patients enter into the study with specific quality of life, bad, neutral or good. The only covariate used here is the type of drug but the proposed method could take into account several covariates under convenient assumptions. The covariates included in the study may be qualitative or quantitative but they must not depend on chronological time. They can be a known function of time spent in the current state. Those conditions are necessary in order to fulfill the semi-Markov assumption. 5.3. The semi-Markovian model The sojourn time in a state before moving to another state is an important issue. In a Markov process, time spent in a given state does not depend on the next state. We assume that the distribution of the sojourn time X in state i before a transition to state j may depend not only on state i but on the next state j ( = i). This implies that we assume a semi-Markov model for our multi-state data. For each patient, we observe a process (J, X) = {(Js , Xs ): 0 s}. We assumed this process to be semi-Markov (Pyke, 1961). A semi-Markov process is a stochastic process whose state space is denumerable and whose sequence of visited states is a Markov chain. Conditionally on the Markov chain J , the sojourn times in each state form a sequence X = {Xs , 0 s} of independent random variables with probability laws depending on this state as well as on the next visited state. Precisely, (J, X) is a renewal process if the sojourn times distribution satisfy: P (Jn = j, Xn x | J0 , X1 , . . . , Jn−1 , Xn−1 ) = P (Jn = j, Xn x | Jn−1 ).
(1)
We define the sojourn times laws through the following hazard functions, conditional on the two adjacent states: P (t Xn+1 < t + t | Xn+1 t, Jn = i, Jn+1 = j, ξ ) . t →0 t
λ|ij,ξ (t) = lim
Let J (i) be the set of states that can be reached from state i. We can sketch the competing assumption as follows. If we consider independent random variables Wi1 , . . . , Wim where i = 1, . . . , m1 , having respective survival functions
Joint analysis of longitudinal quality of life and survival processes
701
G|i1,ξ (t), . . . , G|im,ξ (t), the sojourn time Xn+1 in state Jn = i before a transition to state Jn+1 is: Xn+1 = min Wil
(2)
l∈J (i)
and Jn+1 = j if and only if Wij < Wil , ∀l ∈ J (i), l = j . Wil , l ∈ J (i) are potential sojourn times in state i before a transition to state l, the smallest of which determines the “current” sojourn time and the next state, and the likelihood may be derived using the independence assumption. 5.4. Likelihood function Let us now write the likelihood of our model, assuming that the total number of patients is N . Let jshh be the last observed state for patient h ranging from 1 to m + 1 = 5. The likelihood may be written as h −1 m N s 1 h L= (3) P (J0 = i)1{j0 =i} h=1 n=0 i=1 m
1{jnh =i, j h =j } h n+1 Lnc j, xn+1 | i, ξ
j =1
1{j h =m+1} sh × L xshh | i, ξ
(4)
c
(5)
where the first product takes care of the initial state, the second product takes care of the fully observed transitions having respective contribution Lnc (j, x | i, ξ ) and the third one takes care of the censored transition: Lc (x | i, ξ ). The likelihood contribution of a transition between Jn = i and Jn+1 = j , with no censoring, after a sojourn time in state i equal to x is: Lnc (x, j | i, ξ ) = P (Jn+1 = j | Jn = i, ξ )
∂ ∂u
× −P (Xn+1 > u | Jn = i, Jn+1 = j, ξ ) x
=P {Wil > x} ∩ {x Wij < x + dx} | ξ l =j l∈J (i)
=
m
P (Wil > x | ξ ) dP (Wij = x | ξ )
l=1 l =j
=−
m l=1 l =j
G|il,ξ (x)
∂G|ij,ξ (x) . ∂x
702
M. Mesbah et al.
The likelihood contribution, of a censored sojourn time in state i equal to x is: Lc (x | i, ξ ) =
m
P (Jn+1 = l | Jn = i, ξ )P (Xn+1 > x | Jn = i, Jn+1 = l, ξ )
l=1
=P
{Wil > x} | ξ
l∈J (i)
=
m
P (Wil > x | ξ )
l=1
=
m
G|il,ξ (x).
l=1
Let δ h is equal to 0 when censoring occurs, in others words jshh = 5, and 1 otherwise. Then the likelihood L is: m h −2 N s h h L= λ|j h j h ,ξ xn+1 G|jnh l,ξ xn+1 n n+1
h=1 n=0
l=1 l =jnh
× λ|j h
jh s h −1 s h
m h xs h G|j h l=1 l =j hh
s h −1
h l,ξ xs h
δ h
s −1
×
m
G|j h
s h −1
l=1
h l,ξ xs h
(1−δ h ) .
Thus, the likelihood L is a product for each transition from state i to state j of terms such as: N s −1 h h 1{jnh =i, j h =j, j =m+1} n+1 Lij = λ|ij,ξ xn+1 G|ij,ξ xn+1 h
h=1 n=0
h 1{jnh =i, j h =j } n+1 × G|ij,ξ xn+1
L=
1{j hh =i, j hh =m+1} s −1 s , × G|ij,ξ xshh Lij .
(6) (7)
i=1,...,m1 j ∈J (i)
For each transition from state i to state j , we assume a Cox regression, and we consider that each transition from state i to any state j = j as a censored transition. Then, Eq. (6)
Joint analysis of longitudinal quality of life and survival processes
703
is the total likelihood corresponding to Cox model. The parameters βij and G0|ij are estimated by the standard Cox method. 5.5. Estimation and tests The estimates of the regression parameters βij are given in Table 3 and the corresponding graph for the survival curves are shown in Figures 5 and 4. Only the coefficient βˆ14 is significatively different to zero. This transition is the transition from state of bad quality of life to death. Table 3 Estimates of regression parameters βij Transitions
β
se(β)
z
(1, 2) (1, 4) (2, 1) (2, 3) (2, 4) (3, 1) (3, 2) (3, 4)
0.185 1.01 −0.207 0.0135 0.463 −0.0868 0.388 −0.107
0.347 0.331 0.35 0.302 0.286 0.41 0.281 0.265
0.532 3.05 −0.591 0.0448 1.62 −0.212 1.38 −0.402
p 0.59 0.0023 0.55 0.96 0.11 0.83 0.17 0.69
Fig. 4. Survival curves with 95% confidence limits around the estimated probability (continuation).
704
M. Mesbah et al.
The likelihood ratio test statistic for testing: H0 : βij = β0 is T = 2{log(L(β0 )) − ˆ log(L(β))}. Under H0 : T → χ12 . Let β0 = βˆ14 . We can test whether β24 is equal to β0 and β34 is equal to β0 . β34 is significatively different from β0 (p-value < 0.001), while β24 is not significantly different from β0 (p-value = 0.062). Having fitted the transition intensities, we want to estimate Pij ;ξ (t), i ∈ {1, . . . , m1 }, j ∈ {1, . . . , m} the probability that a patient is in state j at time t given he was initially in state i with a covariate value equal to ξ . We compute this probability when we have a hierarchical model. In a hierarchical model, if you can reach state j from state i, you cannot reach state i from state j . It is always possible to transform a model into a hierarchical model. Let t Si|ξ (t) = exp − λ|ij ;ξ (u) du 0
j
be the survival function of the sojourn time in state i given ξ . Then, for a direct transition from state i to state j , we have: Pij ;ξ (t) = P (Xn+1 = s, Jn+1 = j, Xn+2 t − s | Jn = i, ξ ) t Si|ξ (s − )λ|ij,ξ (s)Sj |ξ (t − s) ds, 0 < s < t. = 0
Fig. 5. Survival curves with 95% confidence limits around the estimated probability.
Joint analysis of longitudinal quality of life and survival processes
705
R EMARK . If state j is an absorbing state, the expression of Pij ;ξ (t) becomes simpler because Sj |ξ (t − s) = 1. Let Nij (t) be the number of direct transitions from state i to state j in [0, t] and Yi;ξ (s) the number of sojourn times in state i exceeding s with covariable equal to ξ . We estimate Pij ;ξ (t) by: t I {Yi;ξ (s) > 0} Si|ξ (s − ) Pij ;ξ (t) = Sj |ξ (t − s) dNij (s). Yi;ξ (s) 0 |ij,ξ be the estimated cumulative transition rate from state i to state j given ξ and Let A (i, j ): |ij,ξ (t) = A 0 (t) exp βˆij ξ A for i = j and |ij,ξ |ij,ξ (t) for i ∈ {1, . . . , m1 }. |ii,ξ (t) = − A A j :i =j
We estimate this cumulative transition rate through the Nelson–Aalen estimator. |ij,ξ is a step function, we can rewrite Pij ;ξ (t) as: As A |ij ;ξ (s) Sj ;ξ (t − s) Pij ;ξ (t) = Si;ξ (s − ) dA 0<s
|ij ;ξ (u)} (Kaplan–Meier estimator). with Si;ξ (t) = 0
s2
s3
sp−1
i −→ l −→ k −→ · · · −→ j. The probability Pij ;ξ (t) associated to this path π is: π Pij ;ξ (t) = p Si|ξ (s1− )λ|il,ξ (s1 )Sl|ξ (s2− )λ|lk,ξ (s2 ) · · · l=1 sl =t
× Sj |ξ t −
Let tk =
k
p−1
su ds1 ds2 · · · dsp−1 .
u=1
l=1 sl .
P πij ;ξ
The estimation of the transition probabilities are given by: |lk;ξ |il;ξ (t1 ) (t) = Sl;ξ (t2 − t1 )− dA Si;ξ (t − ) dA 1
0
× (t2 − t1 ) · · · Sj ;ξ (t − tn ). Let C(i, j ) be the set of all possible paths from state i to state j . Then Pij ;ξ (t) = π π∈C(i,j ) Pij ;ξ (t). Voelkel and Crowley (1984) show how to obtain large sample properties of this estimator.
706
M. Mesbah et al.
Fig. 6. Probability of being in state of bad quality of life given the fact that it was initially in the neutral state 21,ξ (t) with 95% confidence limits around the estimated probability. of quality of life: P
Fig. 7. Probability of being in state of good quality of life given the fact that it was initially in the bad state 13,ξ (t) with 95% confidence limits around the estimated probability. of quality of life: P
Joint analysis of longitudinal quality of life and survival processes
707
In order to obtain the variance of this estimator we use a bootstrap procedure. The variance was obtained using 500 iterations. The probability of being in state of bad quality of life given that it was initially in the neutral state of quality of life is represented in Figure 6 and the probability of being in state of good quality of life given that it was initially in the bad state of quality of life is represented in Figure 7. 5.6. Asymptotic results We derive in this section the asymptotic properties when n covariates are present. Let Yik (t) = 1 be the indicators of presence of patient k in state i at time t. The functions Yik are left continuous. Let τ be the maximal duration of observation. Let Nij k (x) be the number of transitions from state i to state j that have occurred before time x on patient k. We observe n independent point processes on [0, τ ] with respective counts Nij k . Then, we can write the partial log-likelihood as:
M τ M τ Lij = βij ξ k dNij k (x) − log Yik (x) exp βij ξ k dNij (x). k=1 0
0
k=1
We consider that vectors ξ are column vectors. For given βij and x, let: M
(0)
Sij (βij , x) =
eβij ξ Yik (x),
k=1 (1)
Sij (βij , x) is a column vector of dimension k: M
Sij(1) (βij , x) =
eβij ξ Yik (x)ξ,
k=1
Sij(2) (βij , x), Vij (βij , x) and Jij (βij ) the n × n matrix given by: M
(2)
Sij (βij , x) =
eβij ξ Yik (x)ξ ξ t ,
k=1 (2)
Vij (βij , x) = Jij (βij ) =
Sij (βij , x) Sij(0) (βij , x)
(1)
−
(1)
Sij (βij , x)Sij (βij , x)t (Sij(0) (βij , x))2
,
τ
Vij (βij , x) dNij (x). 0 (0)
A SSUMPTIONS . We assume that there exists a neighborhood ν of βij , sij ∈ R, a col(1)
(2)
umn vector sij and a matrix sij defined on ν × [0, τ ] such that for all u = 0, 1, 2:
708
M. Mesbah et al. P
(1) supβij ∈ν,t τ n1 Sij(u) (βij , t) − sij(u) (βij , t) −→ 0. n→∞
(2) sij(u) (·) is a continuous function of βij ∈ ν, uniformly in t ∈ τ and bounded on ν ⊗ [0, τ ]. (3) There exists a > 0 such that function t → sij(0) (βij , t) is bounded on [a, τ ]. (4) For βij ∈ ν, t τ : (0)
(1)
sij (βij , t) = ∂ (2)
(5) Let vij =
sij
(0) sij
(1)
−
sij (βij , t) βij
(0)
,
(2)
sij (βij , t) = ∂ 2
(1)
sij (sij )t
, the matrix Σij =
(0) (sij )2
τ 0
sij (βij , t) βij2
.
(0)
vij (βij , t)sij (βij , t)λj |i,ξ (t) dt is
definite positive. √1 n
(6) There exists δ > 0 such as
P
sup 0vt |ξ Yik (t)I{βij ξ >−δ|ξ |} | −→ 0. n→∞
1kn
T HEOREM . Under the previous assumptions: P βˆij −→βij ,
√ L M βˆij − βij −→N 0, Σij−1 , P 1 Jij βˆij −→Σij . M For all x1 , . . . , xu in [0, τ ] put:
xk
Cij (k, ·) = 0
Γij (k, l) =
λ0j |i (u)
(1) sij (βij , u) du, (0) sij (βij , u)
xk ∧xl
λ0j |i (u) (0)
du,
sij (βij , u)
0
Kij = Γij + Cij Σij−1 Cijt , then: √ 0 L (x1 ) − A0 (x1 ), . . . , A 0 (xu ) − A0 (xu ) −→N M A (0, Kij ), ij ij ij ij let
Kij (k, l) = n
xk ∧xl
1 (0) (Sij (βˆij , u))2
0
x
+ 0
(Sij(1) (βˆij , u))t (0) (Sij (βˆij , u))2
dNij (u) −1 dNij (u)J βˆij
Joint analysis of longitudinal quality of life and survival processes
vl
×
(Sij(1) (βˆij , u)) (Sij (βˆij , u))2 (0)
0
709
dNij (u)
then, it can be shown that: P ij (k, l)−→K K ij (k, l).
More generally: √ L 0 (x1 ) − A0 (x1 ), . . . , A 0 (xu ) − A0 (xu ) −→N M βˆij − βij , A 0, Kij1 , ij ij ij ij with
Kij1 =
Σij−1
−Cij Σij−1
−Σij−1 Cijt
Kij
and for all l: −1 −MJij βˆij
vl 0
(Sij (βˆij , u)) (1)
(0) (Sij (βˆij , u))2
P
dNij (u)−→ − Σij Cijt (·, l).
6. Joint distribution of QoL and survival–dropout processes Many survival studies collect longitudinal measures of covariates on each study subject, until occurrence of a terminal event such as death, infection or disease progression. Usually, some subjects drop out of the study before occurrence of this event. Our interest is on modelling the relationship between time of dropout (the precise definition of dropout will be given later in this introduction) and a longitudinally measured covariate. (Kalbfleisch and Prentice, 1980) call such a time-dependent covariate an internal covariate. More precisely, they define an internal covariate to be the output of a stochastic process that is generated by the individual under study. This stochastic process is observed only so long as the individual remains in the study. Let Z be an internal co denote its history {Z(u), 0 u t} up to time t. Let T denote time to variate and Z(t) some event. (Kalbfleisch and Prentice, 1980) define hazard of occurrence of this event at time t by λ t | Z(t) = lim Pr t T t + dt | Z(t), (8) T t /dt, dt →0
which is conditional on the covariate process up to t. Specific problems arise when fitting a Cox model (Cox, 1972) with internal covariates. For example, several authors (Cox and Oakes, 1984; Altman and De Stavola, 1994) point out that in a clinical trial comparing two treatments, inclusion of an internal covariate whose path is directly affected by treatment may mask the treatment effect. Other authors discuss estimation of probabilities of the form Pr[T t + t | Z(t)] (note that Pr[T t | Z(t)] = 1). This should also be conducted with care since such probabilities depend on the development of Z between t and t + t (Andersen et al., 1993; Altman and De Stavola, 1994). Other references on these issues include (Kalbfleisch and Prentice, 1980) and (Collett,
710
M. Mesbah et al.
1994). Some new developments involving internal covariates in a Cox model have recently been proposed. They address the problem of measurement error in an internal covariate, and include the work of (Tsiatis et al., 1995), (Wulfsohn and Tsiatis, 1997), (Dafni and Tsiatis, 1998), and (Tsiatis and Davidian, 2001). Here is considered a survival study where values of an internal covariate Z are measured at discrete times until a terminal event occurs. Some sequences of measurements may terminate prematurely (i.e., before occurrence of this event). This phenomenon, which truncates the sequence of observations of Z, is called a dropout (Diggle and Kenward, 1994; Little, 1995; Scharfstein et al., 1999). A standard framework to evaluate the relationship between time of dropout and the time-dependent covariate Z is given by the Cox model (Cox, 1972). For a given individual, we assume that changes in the path of Z occur at the times tj of measurement of Z (Z is piecewise constant) and that the value of Z on [tj , tj +1 [ is observed at the end of the interval. Hence, Z is not observed at the time of dropout. Following suggestions of (Altman and De Stavola, 1994) and (Collett, 1994), one may nevertheless fit a Cox model to these data by replacing the unobserved value of Z at the event time by the last observed value. However, this approach is not appropriate if Z may vary in the instants preceding dropout. In this paper, we propose a joint modelling approach for dropout time and longitudinal covariate data. This model accommodates possible changes in Z just prior to dropout and could be applied in the context of nonignorable dropout. In our setting, the value of the internal covariate at dropout is unobserved. Hence, we may interpret the dropout as being non-ignorable. Moreover, the joint model suggested naturally falls into the class of outcome-based selection models and can be seen as a generalization of (Diggle and Kenward’s, 1994) selection model to situations where dropout may occur at any point in time and may be censored. The selection and pattern-mixture models cited above assume that dropout occurs at one of the prespecified measurement times of the longitudinal variable (it must be noted that (Hogan and Laird, 1997a) proposed a mixture model also allowing for censoring of the dropout times). The methodology presented in this section was first motivated by situations similar to those of the clinical trial presented in Section 2, were QoL values are measured up to disease progression and constitute the internal covariate of interest. Dropout may occur on some subjects before disease progression. Since our interest is on characterizing the relationship between dropout and the longitudinal QoL covariate, we consider dropout time as being censored when disease progression occurs first. So, it is important to note that patients can leave the QoL (for various reasons) study and, in the same time, continue to participate to the clinical study. This is a slightly different definition of the classical dropout as a person leaving definitely the clinical study. In Section 6.1, we define notation and model and in Section 6.2, we derive the joint likelihood for the dropout time and the internal covariate Z. A Markov AR(1) model is assumed for the marginal distribution of this covariate. In Section 6.3, we show how the EM-algorithm (Dempster et al., 1977) can be applied to estimate the parameters in the proposed model. In Section 6.4, we apply our model to actual data and compare the results to those obtained using the model of (Diggle and Kenward, 1994).
Joint analysis of longitudinal quality of life and survival processes
711
6.1. Model and notations Let Z denote an internal covariate and Zi (t) denote the value of Z at time t for the ith individual under study (i = 1, . . . , n). Repeated measurements of Z are taken on each subject at common fixed times tj , j = 1, 2, . . . (t0 = 0). In the following, it will be convenient to write Zij to denote the response value for the ith subject on [tj , tj +1 [ . If l ∈ [tj , tj +1 [ , let [l] = j − 1. With this notation, Zi[l] denotes the response value on [tj −1 , tj [ , recorded at time tj on the ith subject (i.e., the last recorded value of Z on the ith subject before time l). Let Zi(l) = (Zi[l] , Zi (l))T , i = 1, . . . , n. Let Ti denote the time of dropout for the ith individual (i = 1, . . . , n). If we denote by Ci the potential censoring time for the ith individual, then we actually observe Si = min(Ti , Ci ) and the corresponding censoring indicator i = 1{Ti Ci } , where 1{A} is the indicator of an event A. The value of Z at si , denoted zi (si ), is not observed. The time-of-dropout model assumes that the hazard of dropout is related to the internal covariate Z through a time-dependent Cox model (Cox, 1972). More precisely, it assumes that the hazard of dropout at time t depends on the covariate history up to t through the last observed value before t, Zi[t ] , and through the current unobserved value Zi (t). It is defined by: i (t) = λ0 (t) exp β T Zi(t ) = λ0 (t) exp β0 Zi[t ] + β1 Zi (t) , λ t |Z
(9)
where λ0 (·) is an unspecified baseline hazard function and β = (β0 , β1 )T is a vector of unknown regression parameters. Some observed (possibly time-dependent) covariates X may also be included in the model, by defining the hazard as λ0 (t) exp(β T Zi(t ) + γ T Xi (t)). However, for simplicity, we do not include X in what follows. For the easiness of exposition, we shall develop the model by assuming that the hazard of dropout at time t is a function of the observed longitudinal process before t only through the last observed value. This can be generalized to a more complex functional of the observed history of Z. Let fθ (zi0 , . . . , zij ) denote the joint distribution of the responses Zi0 , . . . , Zij . 6.2. Likelihood function The observed data for each subject i is yi = (si , δi , zi0 , . . . , zi[si ] ). We assume that conditional on (z0 , . . . , z(s)), T and C, the censoring time, are independent. This corresponds to the usual assumption of independent censoring in survival studies. We also assume that the distribution of C does not depend on θ, β, λ0 nor on z(s). This last hypothesis can be viewed as an analogue of the noninformative censoring hypothesis stated by (Nielsen et al., 1992) in the context of frailty models. Here it is motivated by the fact that censoring arises when disease progression occurs first. In this case, a subject is removed from the longitudinal study by doctors, who base their choice on clinical criteria which are independent of the current value of Z. In other situations however, C might depend on z(s). We obtain the likelihood for the observed data by first writing the likelihood for (yi , zi (si )) as the likelihood for (si , δi ) given zi0 , . . . , zi[si ] , zi (si ), times the marginal
712
M. Mesbah et al.
density of (Z0 , . . . , Z[s] , Z(s)). We then integrate over the variable Z(s). This conditioning naturally arises from the definition (8) of the hazard function with internal covariates (Kalbfleisch and Prentice, 1980). The contribution from a single observation yi to a partial likelihood, obtained by discarding terms adhering to censoring, can then be written: si T Li β, θ, λ0 (·) = λδ0i (si ) exp δi β T zi(si ) − λ0 (u) eβ zi(u) du R
0
× fθ zi0 , . . . , zi (si ) dzi (si ).
(10)
However, the maximum of this likelihood does not exist for λ0 ranging over the space of positive functions on [0, +∞). We modify the likelihood by constraining the cumulat tive baseline hazard 0 λ0 (u) du to be a step function taking positive jumps λ1 , . . . , λp at the distinct observed dropout times u1 < u2 < · · · < up . In the special case of no missing covariate values, this approach results t in the usual maximum partial likelihood estimator of β and the Breslow estimator of 0 λ0 (u) du. Similar approaches have been proposed in the context of frailty models (Nielsen et al., 1992), for Cox regression with measurement error in the covariate (Wulfsohn and Tsiatis, 1997) and for Cox regression with missing covariates (Martinussen, 1999). The resulting likelihood is usually called nonparametric likelihood and maximum likelihood estimation yields the so-called “nonparametric maximum likelihood estimators” (although the only part that is really nonparametric is just the representation of the baseline hazard). In our case, a proof of consistency of the nonparametric maximum likelihood estimators by (Dupuy et al., 2001) seems to point to the asymptotic validity of the proposed method. Asymptotic normality of the estimator of the parameters and consistency of the estimate of their asymptotic variance are shown in (Dupuy et al., 2001). It relies on techniques based on empirical process theory, used by Murphy (1995) to establish the asymptotic theory for the frailty model. So comparing treatment group distribution can be easily done by straightforward appropriate asymptotic test. Letting ψ = (β, λ1 , . . . , λp , θ ) be the vector of all parameters of the joint model and using Eq. (10), a nonparametric likelihood function is obtained by multiplying the following contributions over subjects: p p δi 1{ul =si } β T zi(ul ) T exp δi β zi(si ) − Li (ψ) = λl λl e 1{ul si } R
l=1
× fθ zi0 , . . . , zi (si ) dzi (si ).
l=1
(11)
Without any missing information, a full likelihood for β, λ1 , . . . , λp , θ can be factorized into one component for the dropout process, involving β and λ1 , . . . , λp , and another component for the longitudinal measurement process, involving θ . Estimation of (β, λ1 , . . . , λp ) and θ would then be achieved by separate maximization of these two components. We show later that in our case, a similar partition of the likelihood function can be achieved through the use of the EM-algorithm (Dempster et al., 1977).
Joint analysis of longitudinal quality of life and survival processes
713
We finally introduce a particular form for the longitudinal variable, assuming a firstorder Markov model. In a first-order Markov model, fθ (zij | zi,j −1 , . . . , zi0 ) = fθ (zij | zi,j −1 ). It follows that the joint density of the responses Zi0 , . . . , Zik from subject i can be written as: fθ (zi0 , . . . , zik ) = fθ (zi0 )
k
fθ (zij | zi,j −1 ).
j =1
The distribution of Zij is conditioned by the previous response zi,j −1 , considered as an explanatory variable once it has been observed. Since such an explanatory variable is not available for the measurement obtained at time t1 , we drop that response from the model, except as a fixed value upon which Zi1 is conditioned. We further assume that fθ (zij | zi,j −1 ) is a Gaussian probability density function with mean αzi,j −1 and variance σ 2 (here θ = (α, σ 2 )). 6.3. Parameter estimation using the EM-algorithm 6.3.1. The EM-algorithm Inference for this joint model is conducted using the EM-algorithm (Dempster et al., 1977). EM iterates between an E-step where the expected log-likelihood of the complete data conditional on the observed data and the current estimate of the parameters is computed, and an M-step where parameter estimates are updated by maximizing this expected log-likelihood. We consider (yi , zi (si )) as being the set of complete data for the ith subject. At the (m + 1)th iteration of the EM-algorithm, the E-step consists of calculating the conditional expectation E(log Lc (ψ) | yi , i = 1, . . . , n; ψ (m) ) of the complete data loglikelihood given the observed data yi , for the fixed set of parameter estimates ψ (m) . To simplify notation, we denote E(· | yi , i = 1, . . . , n; ψ (m) ) by Eψ (m) (·). Letting Yi (u) = 1{t[si ]+1 usi } and Wi (u) = 1{u
l=1
l=1
+ log fθ (zi0 , . . . , zi[si ] ) + δi Eψ (m) β T Zi(si ) −
p
T λl Eψ (m) eβ Zi(ul ) Yi (ul )
l=1
+ Eψ (m) log fθ Zi (si ) | zi[si ]
.
It can be seen that Eψ (m) (log Lc (ψ)) can be separated in two components, one involving β and λ1 , . . . , λp and another involving θ .
714
M. Mesbah et al.
In the M-step, we solve ∂Eψ (m) (log Lc (ψ))/∂ ψ¯ = 0 (where ψ¯ is the vector (m+1)
(β0 , β1 , λ1 , . . . , λp , α, σ 2 )T ), which results in updated estimates (β (m+1) , λ1 (m+1) (m+1) λp ,θ ). The M-step proceeds as follows:
,...,
(a) We first maximize Eψ (m) (log Lc (ψ)) with respect to α and σ 2 to obtain the updated estimates α (m+1) and σ 2 (m+1) . (b) No closed-form solution exists for updating β. However, the updated estimate may be approximated using the Newton–Raphson algorithm. At the (m + 1)th iteration of the EM-algorithm, the (d + 1)th iteration of the Newton–Raphson algorithm is −1 β (d+1) = β (d) + I (m) β (d) · U (m) β (d) where U (m) (β) and I (m) (β) are respectively the score and information for β. (m+1) (c) We finally update λl (l = 1, . . . , p). The EM-algorithm requires computation of conditional expectations of the form Eψ (m) (ϕ(Zi (si ))). Because Z is continuous, numerical integration is required. Since Z is normally distributed, we evaluate these expectations using Gauss–Hermite quadrature (Crouch and Spiegelman, 1990). (m+1) Closed-form formulas for α (m+1) , σ 2(m+1) , λl , and expressions for U (m) (β (d)), (m) (d) I (β ) and the evaluation of Eψ (m) (ϕ(Zi (si ))) follow in the next section. 6.3.2. EM-formulas . At the M-step of the (m + 1)th Updating formulas for α (m+1) , σ 2(m+1) and λ(m+1) l iteration of the EM-algorithm, the estimates for α, σ 2 and λl (l = 1, . . . , p) are updated using respectively: n
[si ]
n
[si ]
i=1 [
i=1 [
j =1 zij
· zi,j −1 + Eψ (m) (zi[si ] · Zi (si ))] , n [si ] 2 i=1 j =0 zij
j =1 (zij
− α (m+1) zi,j −1 )2 + Eψ (m) (Zi (si ) − α (m+1) zi[si ] )2 ] , n + ni=1 [si ]
and n
i=1 δi 1{ul =si } . β T (m+1) Zi(ul ) β T (m+1) zi(ul ) [E (e )Y (u ) + e W (u )] (m) i l i l i=1 ψ
n
Score and information for β. It is convenient to introduce the following notation: (r) Sm (β, s) =
n
T β T Zj (s) Eψ (m) Zj⊗r Yj (s) + zj⊗r(s)eβ zj (s) Wj (s) (s) e
j =1
for r = 0, 1, 2, where for any column vector a, a ⊗0 = 1, a ⊗1 = a and a ⊗2 = aa T .
Joint analysis of longitudinal quality of life and survival processes
715
(1) (0) Define Em (β, s) = Sm (β, s)/Sm (β, s). The score and information for β are respectively:
U (m) (β) =
n δi Eψ (m) (Zi(si ) ) − Em (β, si ) i=1
and I (m) (β) =
n (2) (0) δi Sm (β, si )/Sm (β, si ) − E⊗2 m (β, si ) . i=1
Computation of conditional expectations. The conditional expectations to be evaluated in the E-step of the EM-algorithm have the form Eψ (ϕ(Zi (si ))). In particular, the choices of Zi (si ), Zi2 (si ), eZi (si ) , Zi (si )eZi (si ) and Zi2 (si )eZi (si ) for ϕ(Zi (si )) are of interest. These expectations are taken under the conditional distribution of (Si , i , Zi0 , . . . , Zi (si ) | yi ; ψ) which is: fλ,β (si , δi | zi0 , . . . , zi (si ))fθ (zi (si ) | zi[si ] ) . R fλ,β (si , δi | zi0 , . . . , zi (si ))fθ (zi (si ) | zi[si ] ) dzi (si )
Taken with respect to this density, the Eψ (ϕ(Zi (si )) are then equal to: i (si ))fλ,β (si , δi | zi0 , . . . , zi (si ))fθ (zi (si ) | zi[si ] ) dzi (si ) R ϕ(z . R fλ,β (si , δi | zi0 , . . . , zi (si ))fθ (zi (si ) | zi[si ] ) dzi (si )
(12)
i (si ) = (2σ 2 )1/2 Wi + αzi[si ] and Z i(si ) = Letting Wi = (2σ 2 )−1/2 (Zi (si ) − αzi[si ] ), Z T (Zi[si ] , Zi (si )) , Eq. (12) can be re-written after simplification as: +∞ p T zi (si )) exp[δi β1 z˜ i (si ) − l=1 λl eβ z˜ i(si ) Yi (ul ) − wi2 ] dwi −∞ ϕ(˜ . +∞ p β T z˜ i(si ) 2 ] dw exp[δ β z ˜ (s ) − λ e Y (u ) − w i 1 i i l i l i l=1 i −∞ We then use an N -points Gauss–Hermite quadrature to compute Eψ (m) (ϕ(Zi (si ))) using the following formula: N p β T (m) z˜ i(si ) zi (si )) exp[δi β1(m) z˜ i (si ) − l=1 λ(m) Yi (ul )])ωj j =1 (ϕ(˜ l e , p (m) β T (m) z˜ i(s ) N (m) i Yi (ul )])ωj j =1 (exp[δi β1 z˜ i (si ) − l=1 λl e where wi takes on N abscissa values aj (j = 1, . . . , N) and ωj are the corresponding weights. 6.3.3. Estimation of standard errors Under the assumption that the cumulative baseline hazard is a step function with jumps at the distinct dropout times, we derived maximum likelihood estimators of the parameters. A convenient consequence of this approach is that we can use likelihoodbased methods to estimate asymptotic variance of the estimators. Asymptotic variance of the maximum likelihood estimators is usually estimated by the inverse of the observed Fisher information matrix. In presence of missing data, we may obtain it using
716
M. Mesbah et al.
formulas given by (Louis, 1982). However, in the case of nonparametric likelihood estimation, the information matrix for the observed data is obtained using the following expression: − ni=1 ∂ 2 log Li (ψ)/∂ ψ¯ ∂ ψ¯ T , evaluated at the maximum likelihood estimate ψˆ n of the parameters. Estimation of the asymptotic variances requires inversion of a high dimensional matrix. Martinussen (1999) nevertheless uses this approach in his work on Cox regression with incomplete covariate measurements. An alternative method consists of estimating the asymptotic variance by inverting the negative of the second order derivatives of a profile log-likelihood. We explain this approach using the asymptotic variance of αˆ n as an example. The expression: log LP (α) = profile log-likelihood for α is given by the following ˆ λˆ 1 , . . . , λˆ p , σˆ 2 ) maximizes the log-likelihood log(ni=1 Li (ψˆ α , α)), where ψˆ α = (β, log( ni=1 Li (ψ)) for fixed α. Noting that
2 n 2 ˆ −1 αα ∂ log LP (αˆ n ) −1 i=1 ∂ log(Li (ψn )) = , ¯ ψ¯ T ∂α 2 ∂ ψ∂ where [ · ]αα denotes the sub-matrix of [ · ] associated with the parameter α (Gourieroux and Monfort, 1996), we can estimate the asymptotic variance of αˆ n by [−∂ 2 log LP (αˆ n )/ ∂α 2 ]−1 . Since analytical differentiation of log LP may be cumbersome, we use numerical differentiation instead. In particular, we use the following central-difference approximation (Nocedal and Wright, 1999) of ∂ 2 log LP (αˆ n )/∂α 2 : log LP (αˆ n − ε) − 2 log LP (αˆ n ) + log LP (αˆ n + ε) ε2 (ε is an arbitrarily small perturbation). For a given value of α, ψˆ α is calculated using the same EM-algorithm as described in Section 3.1, applied while keeping the value of α fixed. Similar procedures give estimates of the asymptotic variances for the maximum likelihood estimators of the other parameters. 6.4. Numerical comparison with the DK method As an illustration of the proposed joint model, we now analyze data from the cancer clinical trial describe above. We also explain how this model can be used in longitudinal studies with dropout to distinguish between random and non-ignorable dropouts. We then compare our results to those obtained by fitting the model of Diggle and Kenward (1994) (DK in the following) to the same data (Study V302, Arm B). We applied the analysis to a set of 256 patients, who were included in the second study V302 (see Section 2.2). We retain in the analysis only patients who fulfill at least one evaluable questionnaire at cycle 1 and Baseline. The number of scores per patient ranges between 1 and 13. A description of the data in treatment group Arm B, is provided by Table 4, which may be read as follows. Among 29 patients that reported 2 QoL measurements, 27 left the longitudinal study following dropout. Dropout times were censored for 2 patients who left the longitudinal study following disease progression. Disease progression occurred in 31 of 120 patients. The remaining subjects dropped out
Joint analysis of longitudinal quality of life and survival processes
717
Table 4 Numbers of scores and dropouts among the 120 patients included in the analysis Number of completed QoL assessments
Number of patients
Number of dropouts
Number of censoring events
1 2 3 4 5 6 7 8 9 13
13 29 25 21 15 10 2 3 1 1
10 27 19 17 10 3 1 1 0 1
3 2 6 4 5 7 1 2 1 0
the study. Due to the study design, the QoL score at dropout or disease progression was unobserved. A simple approach to the problem of fitting a Cox model to these data would be to set the value of Z at the dropout time equal to the last observed value of Z (i.e., to assume i (t)) = λ0 (t) exp(β1 Zi (t)) and Zi (t) = Zi[t ] ). This approach may be viewed that λ(t | Z as a random dropout analysis, when considered in the context of longitudinal studies with dropout. Gender and a measure of gravity of illness at baseline were also recorded for each patient. None of these covariables had a significant effect when included in the model, hence we do not include them in the following analysis. We then fit the model described in Section 2. Computations were carried out with (Sciprog program, 2001) written with the programming language Scilab (Scilab Group, 1998). The EM-algorithm was stopped and considered to have converged when the increment in the log-likelihood was less than 10−4 . Starting values for the EM-algorithm were given by the parameter estimates resulting from the random dropout analysis. The number of iterations needed to reach convergence was about 20. Table 5 displays the EM-algorithm estimates together with standard errors, as well as the log-likelihood evaluated at the maximum likelihood estimate ψˆ n of ψ. The estimated hazard function resulting from the first analysis is λˆ 0 (t) exp(−0.167Z(t)). The negative value for the regression parameter β0 implies that individuals with low levels of QoL are more likely to dropout. This is natural as high values of Z indicate a better overall QoL. Hence we would expect that a patient feeling pretty well is more likely to complete the QoL questionnaire. The estimated hazard resulting from the joint modelling approach is λˆ 0 (t) exp(0.089Z[t ] − 0.316Z(t)). As suggested by various authors (Diggle and Kenward, 1994; Verbeke and Molenberghs, 2000), some insight into this model can be obtained by rewriting the hazard rate as a function of the increment Z(t) − Z[t ] and the level Z[t ] of the outcome variable, as λˆ 0 (t) exp −0.316 Z(t) − Z[t ] − 0.227Z[t ] . (13)
718
M. Mesbah et al.
Table 5 Parameter estimations and group comparison test under both dropout process schemes Arm Random βˆ0 SE(βˆ0 )
−0.164 0.078
βˆ1 SE(βˆ1 )
A NI 0.128 0.081
Arm Random −0.167 0.078
−0.362 0.086
B NI
Test Random
Statistics NI
0.089 0.080
0.03277 –
0.34679 –
−0.316 0.087
– –
−0.37464 –
αˆ SE(α) ˆ
0.959 0.008
0.952 0.009
0.955 0.009
0.948 0.010
0.35086 –
0.32268 –
σˆ e2 SE(σˆ e2 )
0.704 0.046
0.710 0.047
0.571 0.039
0.576 0.0401
2.19565 –
2.18126 –
–
–
log-likelihood
−963.928
−896.4
−927.2
−857.2
The value −0.227 confirms that low levels of QoL are associated with higher rates of dropout. Moreover, the joint model we propose suggests that the hazard of dropout is associated with a change in the level of the variable Z. More precisely, dropout increases with a decreasing evolution in the QoL outcome. Finally, the conditional density of Zij (given zi,j −1 ) is Gaussian with an estimated mean of 0.947zi,j −1 (suggesting an overall decrease in QoL over time) and variance of 0.576. The suggested joint model may be useful for distinguishing between random and non-ignorable dropout. For example, for our data, one may suspect dropout to be nonignorable since QoL values were measured using a questionnaire that was filled out by each patient. Hence it is likely that a patient feeling poorly will not complete the questionnaire, implying that dropout is non-ignorable. Although a more general test for testing non-ignorable vs. random dropout is not yet available, examination of Table 5 (where, for instance, in treatment group “Arm B”, βˆ1n = −0.316 with a standard error 0.087) may confirm this opinion. The joint model suggested accommodates censoring and continuous dropout times. In the following, we will refer these two features of our data to as JMAC (joint model’s application conditions). The DK model considers less general situations in that it does not accommodate the JMAC. However, it may be fitted to our data by relaxing these conditions in the following way. In line with Diggle and Kenward (1994), we assumed that the probability for a dropout at time tj (j 2), given the subject was still under study at time tj −1 , follows a logistic regression model: logit Pr[Ti = tj | Ti tj , zi0 , . . .] = β0 zi,j −1 + β1 zij . When this model was applied to our data, Ti denotes the time of the first missing measurement following dropout. zij denotes the unobserved measurement of Z at tj . Subjects who reached disease progression were also treated as non-ignorable dropout since the value of Z at the occurrence of this event was unobserved. We fit the DK model to our data using the PCMID function in Splus OSWALD suite (Smith, 1997). This led to the following fitted model:
Joint analysis of longitudinal quality of life and survival processes
719
logit Pr[Ti = tj | Ti tj , zi0 , . . .] = −0.036zi,j −1 − 0.170zij = −0.170(zij − zi,j −1 ) − 0.206zi,j −1.
(14)
From this model, we also conclude that subjects with lower levels of QoL and subjects whose QoL decreased were more likely to dropout. We wish to compare our results to the ones obtained by the DK model. In particular, we wish to evaluate the effect of ignoring the JMAC when treating our data with the DK method. We suggest that the following approach may give some clues to answer this question. It may appear uncomfortable to directly compare estimates of the parameters in the fitted model (13) and (14), since these models rely on different modelling choices. Hence we suggest comparing instead, estimated relative risks of dropout between two individuals, using the fitted models. Let us consider two individuals sharing the same value z of Z at time tj . One individual has constant value z over time. Z varies by a quantity denoted by incr for the second individual. Using (13), the relative risk of dropout between these individuals is exp(−0.316 incr). From Eq. (14), it is equal to exp(−0.170 incr)(1 + exp(−0.206 z))/(1 + exp(−0.170 incr− 0.206 z)). Estimated relative risks from these two models are represented as functions of incr, by the solid and dotted lines respectively on Figure 8. The DK model was fit under the conditions of discrete dropout times and no censoring. We also fit the suggested joint model under these conditions, which gave exp(−0.201 incr) as the relative risk. Its representation on Figure 8 lies close to the one obtained from the DK model. It appears that both the DK model and our model underestimate the relative risk of dropout when incr becomes negative. Both models perform equally well when incr is positive.
Fig. 8. Relative risk of dropout vs incr for the joint model (JM) fitted under JMAC and by relaxing each of the JMAC in turn, and for the DK model.
720
M. Mesbah et al.
It may then be interesting to investigate the effect of ignoring each of the JMAC in turn. Hence, we fit the suggested joint model to our data, first ignoring censoring, secondly ignoring the continuous nature of the dropout process. The estimated relative risks were respectively exp(−0.284 incr) and exp(−0.224 incr). We compare in Figure 8 these results to those obtained by fitting the joint model using the full information given by the data. Results obtained by ignoring each of the JMAC in turn are similar to each other. Again, they underestimate the relative risk of dropout for negative values of incr. When the continuous nature of the dropout is ignored, underestimation may occur because a dropout is set to occur at tj when it actually lies between tj −1 and tj . The high prognostic value of a decrease in the QoL for occurrence of a dropout, is then moderated by the increase of the duration of follow-up induced by this convention. Similarly, considering censoring events, which are not preceded by a decline in QoL level as dropouts, attenuates the impact of a decrease in QoL on dropout. 6.5. Comparison of QoL between treatment groups In arm A, 115 subjects were analyzed from 127, with 1 to 10 measurements. Dimension of the Sciprog matrix X (Sciprog program, 2001) is 115 × 11, the last column is a column of 0.74 subjects had an informative dropout. 41 were censored. In Arm B, 120 subjects were analyzed from 129, with 1 to 13 measurements. Dimension of the Sciprog matrix X (Sciprog program, 2001) is 120 × 14, the last column is again, a column of 0. 89 subjects had an informative dropout. 31 were censored. In Table 5, a simple asymptotic test is derived, using the fact that the estimate of the parameter obtained are asymptotically normal and that the estimate of the variance obtained are consistent (Dupuy et al., 2001). The statistic of this asymptotic test, can be written, for any parameter Θ, as below: A − Θ B Θ . √ 2 2 σˆ Θ + σˆ Θ A
B
It follows, using results from Dupuy et al. (2001), that this statistics is asymptotically distributed as a normal (0, 1) variable. Testing global significance of a parameter, without focusing on group comparison, can be obtained easily, in the same way. Numerical results indicate that the only significant difference between the two treatment groups is for the variance parameter of the QoL process, under both dropout scheme. These results are coherent with those obtained by other methods.
7. Discussion 7.1. Specificity and interest of the time to QoL deterioration approach Obviously when there is a survival difference between the treatment groups, the classical linear repeated analysis of variance cannot be considered adequate for comparing
Joint analysis of longitudinal quality of life and survival processes
721
QoL. It is indeed biased in favor of the group with the poorer survival. This can be generalized to data missing not at random, as was the case in this study. Another criticism is that the analysis of variance does not take deaths into account. On the contrary, time-toevent analyses explicitly account for deaths. Also, the time-to-event analyses avoid the need for time intervals, which can be difficult to define in some cases. The absence of a treatment effect in the analysis of variance presented in this paper, contrasts with the results of the time to QoL deterioration analysis. This should be related to the fact that in the linear mixed model analysis, missing data are ignored, even when they are related to a bad QoL status. The results of the deterioration-free survival analyzes performed on the V301 and V302 studies can be summarized as follows. In V301 all log-rank tests comparing the irinotecan group to the no irinotecan group were significant with p < 0.001. In V302, the irinotecan group showed curves always above that of the no irinotecan group, but all the comparison were not significant. For these three studies, the results of the QoL analyzes seem quite consistent with their respective clinical results. Mixed models for longitudinal repeated measures are currently very popular. Nevertheless, the occurrence of death, and its relationship with missing data need to be taken into account, and this makes the analysis more complicated than a classical longitudinal analysis. Longitudinal data with non-random missingness, even without occurrence of death remain an issue. The Dupuy–Mesbah and the Heutte–Huber methods, presented in Sections 5 and 6 (and elsewhere) deal with such problems (Billot and Mesbah, 1999; Heutte and Huber-Carol, 2002; Dupuy and Mesbah, 2002). The endpoint “time to QoL deterioration” takes the missing data into account in a specific manner. It appears more sensitive to treatment effects in presence of missing data. As it has been defined, this endpoint has the advantage that missing data occurring before or after deterioration does not impact it. Indeed, if deterioration was observed after a missing value, it was assumed that the first deterioration had already occurred at the time the missing value was expected. This could be related to a last observation carried forward (LOCF) imputation of the missing data coupled with a procedure carrying backward the first observation after deterioration. Thus, it corresponds to a sophisticated kind of imputation, which considers that the endpoint is at least deteriorated by the same amount as the observed deterioration. Of course, this holds as long as it can be assumed that a missing observation is very likely to be related to deterioration. In addition, missing data due to death are specifically taken into account in this endpoint. For this analysis, it is important to define reasonably large intervals, within which death is taken as an event to avoid censoring a death following closely a QoL assessment but without prolonging too much the time to deterioration if no assessment has been performed during a long period. It was more logical to use “time to QoL deterioration” for the analysis of V303 because the patients were followed on treatment only. Because patients were followed for their quality of life until death, the endpoint “QoL deterioration-free survival” was used for V301 and V302. It seems more appropriate to use this endpoint when the objective is to show that a treatment prolongs survival without deterioration of the quality of life. “Time to QoL deterioration”, on the other hand, focuses on quality of life per se. Another possibility would be to use the competing risk methodology, with death being considered as a
722
M. Mesbah et al.
competing risk for QoL deterioration. However, using this method would imply making the assumption of a same survival in the two treatment groups. When the objective of the study is to show a difference in survival, this methodology is not applicable. However, if the objective of the trial is to show that there is no difference in survival, but that the quality of life is better in one of the groups, this could be an appropriate method. The definition of QoL deterioration is a weak definition in the sense that if only one measure of QoL shows deterioration at the end of the study, it is considered a deterioration even though it is not confirmed by a further assessment. This should be put in the context of metastatic cancer. Usually if the patient has a worsening status at the end of the study without any other measure of QoL, it can be assumed that the deterioration is continuous. The two endpoints based on quality of life are very similar to the two time-to-event analysis usually performed on the time until progressive disease. The time to deterioration of QoL is analogous to the time to progression, and “QoL deterioration-free survival” is analogous to progression-free survival. When the parameter of interest is progression-free survival (time to progression or death, whichever occurs first), it is important to have regular tumor assessments unless the end-point would be artificially prolonged until death. This remark also holds for “QoL deterioration-free survival”. Among the possible analyses of quality of life, the time to event analysis have good properties. The choice of the relevant analysis depend form the objective of the trial. If the objective is to show a survival advantage without quality of life deterioration, the endpoint “QoL deterioration-free survival” can be used. If the objective is to show an improvement of both survival and quality of life, then the “time to QoL deterioration” parameter might be preferable. When applied to the QL score of the QLQ-C30 questionnaire, this type of analysis has been shown to be sensitive. The analysis were highly significant in the V301 study when comparing the irinotecan treatment to best supportive care. They were also significant or not far from significant when applied to the V303 study. A third possibility would be the time to deterioration with death being a competing event. This endpoint would be appropriate if the objective were to show a similar survival with an improved quality of life. In that case, another possibility could be to analyze the effect of treatment on quality of life, knowing the effect of treatment on survival. Some authors suggest performing a classical longitudinal analysis adjusted for the survival duration. This simple kind of analysis answers a surrogate question: what is the effect of treatment on quality of life, if the treatment groups were equivalent in terms of survival. This type of analysis would not be appropriate if there were differences in survival between the groups. For instance if one treatment is better than another in terms of survival, then when we adjust for the survival duration, we will disadvantage that better treatment group, because quality of life and the survival duration are highly positively related. A better solution might be to jointly analyze quality of life and survival, taking into account the pre-requested results about treatment effect on survival duration. This is a more sophisticated solution which request to model and to estimate the joint distribution of the longitudinal QoL outcome and the survival process. A growing recent literature, described by Dupuy (2002), deal with that issue. In Section 6, following Dupuy and Mesbah (2002), we present more deeply such kind of methods. In Section 5, we will show that Multi-state semi-Markov models (Heutte and
Joint analysis of longitudinal quality of life and survival processes
723
Huber-Carol, 2002), a natural generalization of the QoL-deterioration model may be also used, at the condition that quality-of-life scores can easily be translated into states. All these appealing methods need specific software and careful computer programming. Furthermore, interpretation of results is not straightforward. The Kaplan–Meier applied to Time to QoL deterioration methodology solution has the main advantage to be easy to implement and to interpret. Clinicians are familiar with Kaplan–Meier curves. All general statistical software nowadays includes survival methods with Kaplan–Meier estimates and the log-rank test. As there does not seem to be a logical choice for a unique level for the deterioration, we recommend using a range of different levels. The consistency of the results is important for the strength of the conclusion. In fact, it would be justified to use a higher significance level than 0.05 for each of the tests. Further research is aimed at building a global significance test, integrating the results of a set of deterioration levels. 7.2. Specificity and interest of the semi-Markovian modelling It can be seen from Table 3 of estimation of regression parameters and also looking at the various survival curves (Figures 5 and 4), that there is a significant difference between the two drugs for transitions from bad state of quality of life to death, and only for those transitions. Actually, if we look at the probability of being in state of bad quality of life given that it was initially in the neutral state of quality of life and the probability of being in state of good quality of life given that it was initially in the bad state of quality of life, we note that there is no difference between the two drugs. We have assumed a Cox model for each hazard function. We could also model each hazard function with a parametric model or another semi-parametric model. We have introduced some right censoring. One interesting extension of this model could be the introduction of censors by intervals: this would allow us to take into account QoL change not observed between two assessments. This method can describe the differential behavior of patients between two groups. Here we have shown a significant difference between transition from bad state of QoL to Death. 7.3. Specificity and interest of the joint modelling approach In this paper, we have proposed a new approach to the problem of Cox regression with an internal time-dependent covariate whose value at the event time is not observed (here event is a dropout). This approach jointly models the longitudinal covariate and hazard processes. We have shown that the likelihood resulting from this model can be maximized using an EM-algorithm. We fit the suggested joint model to data from a cancer clinical trial. We suggested that it may be useful to apply model (9) within the context of longitudinal data with non-ignorable dropout. Comparison between our approach and the DK model is provided. The DK model assumes discrete dropout times and does not accommodate censoring. Under these conditions, the two models led to similar results, both appearing to underestimate risk of dropout for individuals with decrease in the longitudinal outcome. To compare the two models, we used the hazard ratio and a relative risk
724
M. Mesbah et al.
calculated from the DK model. It should be noted that this relative risk is conditional on the fact that individuals have not yet dropped out. One may nevertheless keep in mind that in general, hazard ratios and relative risks are different quantities that should be compared with care. More work on how to compare results of both models is still needed. The increasing recognition of the need for models that accommodate missing values in longitudinal studies makes the issue of model checking extremely important. However, goodness-of-fit analysis is rarely performed by the users of such models, despite awareness of the adverse effects of model misspecification on the statistical inference. Some suggestions follow, which may be viewed as preliminary attempts to validate the model we suggest. Whenever possible, one may rely on a double sampling scheme (Mesbah et al., 1992) in which one obtains the QoL value at dropout for a sub sample of the studied population and conducts a validation study on this sub sample. However, this may also raise difficulties (e.g., the size and choice of the sub sample). An alternative approach may rely on goodness-of-fit tests for the model using the entire data set (including unobserved values). However, missing values preclude from using such an approach. We might then consider instead, assessing the validity of the marginal model (10) obtained by integration over the missing quantities, recognizing however that this does not allow to validate the assumption about the distribution of the missing components given the observed data. Validation of this marginal model should be important, in order to draw reliable conclusions from the estimations of parameters obtained by maximizing the integrated likelihood (11). For this purpose, a joint work including of some of the authors is actually in good progress Gulati et al. (2002). We may ask whether estimates are robust to misspecification of the model assumptions. Various authors have discussed robustness in selection models, including (Little, 1995), (Hogan and Laird, 1997b), (Molenberghs et al., 1997), and some of the discussants of the papers by (Diggle and Kenward, 1994) and (Scharfstein et al., 1999). To date, very little work has been done to propose methods that investigate the sensitivity of selection-modelling results with respect to the model assumptions (for a review, see (Verbeke and Molenberghs, 2000), who also adapt the DK model to a tractable form for sensitivity analysis (Chapter 19)). Hence, further work is needed to propose a methodology to be applied to the model we suggest. Some work has already been done for pattern-mixture models (a recent overview and treatment of this are given by (Verbeke and Molenberghs, 2000), Chapter 20). However, these models require large numbers of dropouts per dropout pattern to reliably estimate the usually large number of parameters. One feature of our data was the continuous nature of dropout, and Figure 8 suggests that one has to take this characteristic into account when it is present in the data. A problem raised by pattern-mixture models is that they cannot accommodate this feature. Identification of parameters is also an important issue of models for non-ignorable dropout. It has recently been discussed (in a context different from that of longitudinal studies however) by (Scharfstein et al., 1999). There, the authors consider a study designed to end at a fixed time T , at which an outcome of interest Y is measured on each individual. Letting V be some time-dependent covariate, they propose to model (t), Y ) = λ0 (t | V (t)) exp(α0 Y ), thus introducing nonthe hazard of dropout by λ(t | V ignorable dropout through the term exp(α0 Y ). In the absence of further knowledge of
Joint analysis of longitudinal quality of life and survival processes
725
(t), Scharfstein et al. (1999) do not formulate any hypothesis on the joint disY and V (t) and the potentially unobserved Y , and they show that the tribution of the observed V parameter α0 cannot be identified. The approach we take when applying model (9) in the context of longitudinal data with non-ignorable dropout is different and is motivated by the following reason: when interest lies in longitudinal trajectories, rather than in a single measure of an outcome, it may not be unreasonable to make certain model assumptions on the longitudinal process. Then, assuming a first-order Markov model for the longitudinal process (Dupuy et al., 2001) have shown that model (2) is identifiable under mild conditions. It is easy to accommodate longitudinal categorical data in the model we suggest, and to compare results with those obtained from the model of (Molenberghs et al., 1997). One may also investigate how the suggested method may be extended to non-monotone missing data and compare results with those given by (Troxel et al., 1998), who extended the DK model to this situation. One may view non-monotone missing data as recurrent dropouts, and proceed by extending methods for analysis of recurrent events. 7.4. Extension of previous analyses to the latent nature of quality of life In the previous part, Quality of Life was just considered as an observed (excepted for its value at dropout time, in the DM analysis) continuous score: Z. But with real data, Quality of Life is in fact an unobserved latent variable. In practice, QoL data consist always in a multidimensional binary or categorical observed variable named QoL Scale used to measure the true unobserved latent variable (Qol). From this QoL scale, we can derive QoL scores, i.e., individual statistics. These scores are surrogate of the true unobserved latent variable (Qol). When the QoL variable z was observed (excepted of course for the last unobserved dropout value zd ), the likelihood for one observation yi (1 i n) was: xi (i) δi T β T wi (u) λ(u)e du L (τ ) = λ(xi ) exp δi β wi (xi ) − 0
× f (zi0 , . . . , ziad , zd ; α) dzd = l(yi , zid , τ ) dzid , where yi = (xi , δi , zi0 , . . . , ziad ) = (xi , δi , ziobs ) and, then the previous inference can be done. But, in the latent variable context, ziobs is in fact not directly observed. The k item responses Qij of a subject i (response or raw vector Qi ) are the core observation and must be used to recover the latent QoL values zi through the measurement model. The most famous measurement model in psychometric context is the Rasch model, Fisher and Molenaar (1995) which is for binary responses: P (Qij = qij | zi , ζj ) = f (qij , zi , ζj ) =
e(zi −ζj )qij 1 + ezi −ζj
726
M. Mesbah et al.
and for categorical ordinal responses (with number of levels mj different per item), c
e(czi −
pc = P (Qij = c | zi , ζj ) = mj
c=0 e
l=1 ζjl )
(czi − cl=1 ζjl )
known also, in the psychometric literature as the partial credit model. Under the following assumptions: (1) The Dupuy and Mesbah analysis model hold for the true unobserved QoL Z and dropout D or survival T ; (2) The Rasch measurement model relate the observed response items Q to QoL Z; it could be interesting to derive the likelihood and to re analyse the data. Let us recall the statistical problem: (1) The observations are: Yi = (Xi , i , Qi0 , . . . , QiaD )1in ; (2) The parameters are: τ = (α, β, Λ) and the nuisance difficulty parameters of the QoL questionnaire: ζ ; (3) and Hidden variables appear in the model (latent Z, missing Q): (Zi0 , . . . , ZiaD , Zid , Qd )1in . This allows us to extend our methodology to the latent case. We can, in the same way, extend the Awad et al or the Heutte and Huber methodologies to the latent case, assuming that the QoL state are obtained after a more rigorous classification of the data (from items responses) into states using the latent specificity of the QoL and latent class analysis. This latent regression modelling approach need of course more mathematical work to be achieved.
References Aaronson, N.K., Ahmedzai, S., Bergman, B., Bullinger, M., Cull, A., Duez, N.J., Filiberti, A., Flechtner, H., Fleishman, S.B., De Hues, J.C.J.M., Kaasa, S., Klee, M., Osoba, D., Razavi, D., Rofe, P.B., Schraub, S., Sneeuw, K., Sullivan, M., Takeda, F. (1993). For the European Organization for Research and Treatment of Cancer Study Group on Quality of Life, “The European Organization for Research and Treatment of Cancer QLQ-C30: A quality-of-life instrument for use in international clinical trials in oncology”. J. Nat. Cancer Instit. 85, 365–376. Altman, D.G., De Stavola, B.L. (1994). Practical problems in fitting a proportional hazards model to data with updated measurements of the covariates. Statist. Medicine 13, 301–341. Andersen, P.K., Borgan, Ø., Gill, R.D., Keiding, N. (1993). Statistical Models Based on Counting Processes. Springer, New-York. Awad, L., Zuber, E., Mounir, M. (2002). Applying survival data methodology to analyse longitudinal quality of life. In: Mesbah, M., Cole, B.F., Lee, M.L.T. (Eds.), Statistical Methods for Quality of Life Studies: Design, Measurements and Analysis. Kluwer Academic, Boston. Beitz, J., Gnecco, C., Justice, R. (1998). Quality-of-life end points in cancer clinical trials: The US Food and Drug Administration Perspective. J. Nat. Cancer Instit. Monographs 20, 7–9. Billot, L., Mesbah, M. (1999). Analyse d’un essai de Qualit de Vie longitudinal en cancrologie avec prise en compte des abondans. In: Actes des Journées Vannes-Bordeaux (Proceedings). Collett, D. (1994). Modelling Survival Data in Medical Research. Chapman & Hall, London. Cox, D.R. (1972). Regression models and life-tables (with discussion). J. Roy. Statist. Soc. Ser. B 34, 187–220. Cox, D.R., Oakes, D. (1984). Analysis of Survival Data. Chapman & Hall, London.
Joint analysis of longitudinal quality of life and survival processes
727
Cox, D.R., Fitzpatrick, R., Fletcher, A.E., Gore, S.M., Spiegelhalter, D.J., Jones, D.R. (1992). Quality-of-life assessment: can we keep it simple?. J. Roy. Statist. Soc. Ser. A 155, 353–393. Cunningham, D., Pyrhnen, S., James, R.D., Punt, C.J.A., Hickizsh, T.P., Heikkila, R., Johannesen, T.B., Starkhammar, H., Topham, C.A., Awad, L., Jacques, C., Herait, P. (1998). Randomised trial of iritonecan plus supportive care versus supportive care alone after fluorouracil failure in patients with metastatic colorectal cancer. Lancet 352, 1413–1418. +∞ f (t) exp(−t 2 ) dt: ApCrouch, E.A.C., Spiegelman, D. (1990). The evaluation of integrals of the form −∞ plication to logistic-normal models. J. Amer. Statist. Assoc. 85, 464–469. Crowder, M. (2001). Classical Competting Risks. Chapman & Hall/CRC, Boca Raton. Dafni, U.G., Tsiatis, A.A. (1998). Evaluating surrogate markers of clinical outcome when measured with error. Biometrics 54, 1445–1462. DeGruttola, V., Tu, X.M. (1994). Modelling progression of CD4-lymphocyte count and its relationship to survival time. Biometrics 50, 1003–1014. Dempster, A.P., Laird, N.M., Rubin, D.R. (1977). Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. Roy. Statist. Soc. Ser. B 39, 1–38. Diggle, P.J., Kenward, M.G. (1994). Informative dropout in longitudinal data analysis (with discussion). Appl. Statist. 43, 49–93. Douillard, J.Y., Cunningham, A.D., Navarro, M., James, R.D., Karasek, P., Jandik, P., Iveson, T., Carmichael, J., Alakl, M., Gruia, G., Awad, L., Rougier, P. (2000). Irinotecan combined with fluorouracil compared with fluorouracil alone as first-line treatment for metastatic colorectal cancer: A multicenter randomized trial. Lancet 355, 1041–1047. Dupuy, J.-F. (2002). Joint modeling of survival and non-ignorable missing longitudinal QoL data. In: Mesbah, M., Cole, B.F., Lee, M.L.T. (Eds.), Statistical Methods for Quality of Life Studies: Design, Measurements and Analysis. Kluwer Academic, Boston. Dupuy, J.-F., Grama, I., Mesbah, M. (2001). Identifiability and consistency in a Cox model with missing time-dependent covariate. Technical Report SABRES 2001/11, University of South-Brittany, France. Dupuy, J.-F., Mesbah, M. (2002). Joint modeling of event time and non-ignorable missing longitudinal data. Lifetime Data Anal. 8, 99–115. Fayers, P., Aaronson, N., Bjordal, K., Curran, D., Groenvold, M. (1995). For the EORTC Quality of Life Study Group, EORTC QLQ-C30 Scoring Manual, 2nd edn. European Organization for Research and Treatment of Cancer, Brussels. Fisher, G.H., Molenaar, I.W. (1995). Rasch Models Foundations, Recents Developments and Applications. Springer, Berlin. Gourieroux, C., Monfort, A. (1996). Statistique et Modèles Econométriques. Economica. Gulati, S., Dupuy, J.-F., Mesbah, M. (2002). Goodness of fit of a joint model for event time and non-ignorable missing longitudinal quality of life data. Technical Report SABRES 2002/4, University of South-Brittany, France. Heutte, N., Huber-Carol, C. (2002). Semi-Markov models for quality of life data with censoring. In: Mesbah, M., Cole, B.F., Lee, M.L.T. (Eds.), Statistical Methods for Quality of Life Studies: Design, Measurements and Analysis. Kluwer Academic, Boston. Hogan, J.W., Laird, N.M. (1997a). Mixture models for the joint distribution of repeated measurements and event times. Statist. Medicine 16, 239–257. Hogan, J.W., Laird, N.M. (1997b). Model-based approaches to analysing incomplete longitudinal and failure time data. Statist. Medicine 16, 259–272. Kalbfleisch, J.D., Prentice, R.L. (1980). The Statistical Analysis of Failure Time Data. Wiley, New York. Little, R.J.A. (1995). Modeling the dropout mechanism in repeated-measures studies. J. Amer. Statist. Assoc. 90, 1112–1121. Little, R.J.A., Rubin, D.B. (1987). Statistical Analysis with Missing Data. Wiley, New York. Louis, T.A. (1982). Finding the observed information matrix when using the EM algorithm. J. Roy. Statist. Soc. Ser. B 44, 226–233. Martinussen, T. (1999). Cox regression with incomplete covariate measurements using the EM-algorithm. Scand. J. Statist. 26, 479–491.
728
M. Mesbah et al.
McLachlan, S.A., Devins, G.M., Goodwin, P.J. (1998). Validation of the European Organization for Research and Treatment of Cancer quality-of-life questionnaire (QLQ-C30) as a measure of psychological function in breast cancer patients. European J. Cancer 34, 510–517. Mesbah, M., Lellouch, J., Huber, C. (1992). The choice of loglinear models in contingency tables when the variables of interest are not jointly observed. Biometrics 48, 259–265. Molenberghs, G., Kenward, M.G., Lesaffre, E. (1997). The analysis of longitudinal ordinal data with nonrandom dropout. Biometrika 84, 33–44. Murphy, S.A. (1995). Asymptotic theory for the frailty model. Ann. Statist. 23, 182–198. Nielsen, G.G., Gill, R.D., Andersen, P.K., Sørensen, T.I.A. (1992). A counting process approach to maximum likelihood estimation in frailty models. Scand. J. Statist. 19, 25–43. Nocedal, J., Wright, S.J. (1999). Numerical Optimization. Springer, New York. Osoba, D. (1994). Lessons learned from measuring health-related quality of life in oncology. J. Clinical Oncology 12, 608–616. Pyke, R. (1961). Markov renewal processes: Definitions and preliminary properties. Ann. Math. Statist. 32, 1231–1342. Ribaudo, H.J., Thompson, S.G., Allen-Mersh, T.G. (2000). A joint analysis of quality of life and survival using a random effect selection model. Statist. Medicine 19, 3237–3250. Rougier, P., van Cutsem, E., Bajetta, E., Niederle, N., Possinger, K., Labianca, R., Navarro, M., Morant, R., Bleiberg, H., Wils, J., Awad, L., Herait, P., Jacques, C. (1998). Randomized trial of iritonecan versus fluoracil by continuous infusion after fluoracil failure in patients with metastatic colorectal cancer. Lancet 352, 1407–1412. Scharfstein, D.O., Rotnitzky, A., Robins, J.M. (1999). Adjusting for non-ignorable dropout using semiparametric nonresponse models (with discussion). J. Amer. Statist. Assoc. 94, 1096–1146. Schluchter, M.D. (1992). Methods for the analysis of informatively censored longitudinal data. Statist. Medicine 11, 1861–1870. Scilab Group (1998). Introduction to Scilab. INRIA Meta2 Project/ENPC Cergrene, User’s Guide. Sciprog (2001). User Giude of Sciprog, Source and User’s Guide. http://www.univ-ubs.fr/sabres/dupuyj/ JointAnalysis.htm. Smith, D.M. (1997). Oswald: Object-Oriented Software for the Analysis of Longitudinal Data in S. http:// www.maths.lancs.ac.uk/Software/Oswald/. Troxel, A.B., Lipsitz, S.R., Harrington, D.P. (1998). Marginal models for the analysis of longitudinal measurements with non-ignorable non-monotone missing data. Biometrika 85, 661–672. Tsiatis, A.A., Davidian, M. (2001). A semiparametric estimator for the proportional hazards model with longitudinal covariates measured with error. Biometrika 88, 447–458. Tsiatis, A.A., DeGruttola, V., Wulfsohn, M.S. (1995). Modeling the relationship of survival to longitudinal data measured with error. Applications to survival and CD4 counts in patients with AIDS. J. Amer. Statist. Assoc. 90, 27–37. Verbeke, G., Molenberghs, G. (2000). Linear Mixed Models for Longitudinal Data. Springer, New York. Voelkel, J., Crowley, J. (1984). Nonparametric inference for a class of semi-Markov processes with censored observations. Ann. Statist. 12, 142–160. Wu, M.C., Carroll, R.J. (1988). Estimation and comparison of changes in the presence of informative right censoring by modelling the censoring process. Biometrics 44, 175–188. Wulfsohn, M.S., Tsiatis, A.A. (1997). A joint model for survival and longitudinal data measured with error. Biometrics 53, 330–339.
39
Handbook of Statistics, Vol. 23 ISSN: 0169-7161 © 2004 Elsevier B.V. All rights reserved. DOI 10.1016/S0169-7161(03)23039-X
Modelling Survival Data using Flowgraph Models
Aparna V. Huzurbazar
Multistate stochastic networks are used to model diseases that progress through various stages. Examples include cancer progression, the stages of HIV leading to AIDS, and the failures of organs such as kidneys or eyes. Of interest are survival times and times to events such as organ failure. There is a great deal of interest in data analysis for multistate models, in fact, recently, an entire issue of a journal, Statistical Methods in Medical Research, was devoted to the topic1 . Traditionally, such models have been analyzed using Markov processes with associated simplifying assumptions including the use of exponential waiting times. Semi-Markov processes relax the exponential assumption and provide alternatives for viewing Markov processes in continuous time. However, in practice, data analysis for semi-Markov processes can be very difficult, especially when returns to a state are allowed. Generally work with semi-Markov processes restricts all the waiting times in the multistate model to one family of distributions such as the Weibull (cf. Wilson and Solomon, 1994) for tractability. The proportional hazards model has also been used in the context of multistate models but this is also restrictive as the proportional hazards assumption may not hold. Flowgraph models provide an innovative approach for the analysis of time to event data obtained from complex stochastic networks. Flowgraphs model semi-Markov processes and allow for a variety of distributions to be used within the stages of the multistate model. Flowgraphs model potential outcomes, probabilities of outcomes, and waiting times for the outcomes to occur. Figure 1 is a flowgraph model for the degenerative stages of diabetic retinopathy due to Yau and Huzurbazar (2002). This model was first proposed as a Markov model by Marshall and Jones (1995). The stages of retinopathy in the model are classified according to a modified Airlie House (Diabetic Retinopathy Study Research Group, 1981) classification scheme. State 1 indicates no retinopathy; state 2 indicates microaneurisms only; state 3 represents intermediate stages of background retinopathy; and state 4 indicates that preproliferative and proliferative retinopathy have occurred leading to total blindness. State 4 is an absorbing state. A patient with diabetes begins in state 1 having no retinopathy. The patient can progress to a degenerative state, 2. Once in state 2, the patient can improve and return to state 1, or progressively worsen to state 3. Once in state 3, the patient can improve and return to state 2, or worsen to state 4, total blindness. 1 Multi-State Models. Statistical Methods in Medical Research 11 (2) (2002).
729
730
A.V. Huzurbazar
Fig. 1. Flowgraph model for the progression of diabetic retinopathy.
Each branch is labeled with the probability of taking that branch, p· , times the moment generating function (MGF) of the waiting time distribution associated with that transition, M· (s). Flowgraphs model this entire complex stochastic network. Interest focuses on the probabilities of getting to any stage and in how long a person remains in a given stage. In this case, interest centers on the time until total blindness or the time until the various degenerative stages are reached. Particular quantities of interest in the flowgraph model are the waiting times 1 → 4 or 1 → 3. Complete data on the flowgraph model consists of following every individual through every intermediate transition. Incomplete data occur when not all of the transitions are observed. This happens in several ways, for example, suppose that we observe a patient only in states 1 → 3 → 4. Then, we know that the patient must have transited through 2 but we do not know when or how often, i.e., the patient’s path may have been 1 → 2 → 3 → 2 → 1 → 2 → 3 → 4. In addition, the patient’s path may be subject to censoring. The end result from a flowgraph model is either a maximum likelihood estimated density, survivor, or hazard function for the quantity of interest or a Bayes predictive density, predictive survivor, or predictive hazard function of a future observable waiting time such as the time to total blindness. The Bayes predictive density of a future observable Z given data D is fZ (z | θ )L(θ | D)π(θ ) dθ fZ (z | D) = (1) ≡ Eθ|D fZ (z | θ ) L(θ | D)π(θ ) dθ where Z has density fZ (z | θ ), L(θ | D) is the likelihood function, and π(θ ) is the prior. In our problem, f (z | θ ) is unknown but can be constructed using a flowgraph model and in the case of incomplete data the likelihood function, L(θ | D), is also unknown and must be constructed with the flowgraph model. This chapter is organized as follows. Section 1 presents background on flowgraph models and reviews the basic series flowgraph structure. Section 2 details data analysis with series flowgraphs using data for the progression of HIV using Bayesian methods. Section 3 discusses saddlepoint approximations for flowgraph models. Section 4 discusses likelihood construction for flowgraphs in the presence of incomplete data using an example on kidney failure. Sections 5 and 6 discuss parallel and loop flowgraph structures, respectively. Section 7 presents a general procedure for handling flowgraphs that combine series, parallel, and loop structures such as the retinopathy flowgraph of Figure 1. Section 8 presents a full analysis of the diabetic retinopathy flowgraph. As flowgraphs can also be used in the frequentist setting, this last example is presented in that framework.
Modelling survival data using flowgraph models
731
1. Series flowgraph model: HIV blood transfusion data Block diagrams and signal flowgraphs are widely used to represent engineering systems, especially in circuit analysis. Basic flowgraph ideas were developed in engineering, but they never incorporated probabilities, waiting times, or data analysis. The literature on flowgraph methods as used in engineering is vast. Introductions to flowgraph methods are contained in most circuit analysis or control systems textbooks such as D’Azzo and Houpis (1981) and Dorf and Bishop (1995), Gajic and Lelic (1996), Lorens (1964) and Whitehouse (1973, 1983). Statistical flowgraph models are based on flowgraph ideas but unlike their predecessors, flowgraph models can also be used to model and analyze data from complex stochastic systems. Butler and Huzurbazar (1997) present statistical flowgraphs in a Bayesian context. Their focus is on modelling disease progression. Huzurbazar (1999) extends flowgraph models to phase type distributions (cf. Aalen (1995)). Huzurbazar (2000) applies flowgraph models in a Bayesian setting to the design and analysis of cells used in cellular telephone networks. Huzurbazar (2005) is a comprehensive text on statistical flowgraph models. In a flowgraph model, the states or nodes represent outcomes. This is distinct from a graphical model where the nodes represent variables. Flowgraphs are outcomes graphs whereas graphical models are variables graphs. In the retinopathy example of Figure 1 the nodes represent various stages of retinopathy or blindness. The nodes are connected by directed line segments called branches. These branches are labeled with transmittances. A transmittance consists of the transition probability × (MGF) of waiting time distribution in the previous state. This quantity is called the branch transmittance when it refers to a specific branch or an overall transmittance when it refers to the transmittance of the entire flowgraph. In the figure, probabilities and MGFs of the waiting time distributions are shown as branch transmittances. We use the branch transmittances of a flowgraph model to solve for the MGF of the distribution of the waiting time of interest. Note that the waiting time distribution on each branch can be any distribution with a MGF that is available either analytically or numerically. Each branch can have a different waiting time distribution. For example, we could model 1 → 2 with a Weibull and 2 → 3 with a gamma density. The three basic components of a flowgraph model are series, parallel, and loop structures. Larger flowgraph models are combinations of these basic structures. This section discusses the series model and associated data analysis for this model. Figure 2 is a flowgraph model for patients infected with HIV through blood transfusions. The model is motivated by data from the San Francisco Men’s Health Study (SFMHS) and was presented by Huzurbazar and Huzurbazar (1999). State 1 is HIV infection, antibody negative, state 2 is antibody positive but without AIDS symptoms,
Fig. 2. Flowgraph model for transfusion data.
732
A.V. Huzurbazar
Fig. 3. Solved flowgraph for transfusion series model.
and state 3 is pre-AIDS symptoms. Let T1 be the random waiting time in state 1 until transition to state 2 and let T2 be the random waiting time in state 2 until transition to state 3. One quantity of interest is the time to incubation of the virus, the time until preAIDS symptoms appear. This is, T = T1 + T2 , the total waiting time for passage from 1 → 3. Suppose that T1 ∼ Exp(λ) with mean 1/λ and T2 ∼ Gamma(α, β) with mean α/β. There are two branches in this flowgraph, 1 → 2 and 2 → 3, and they are labeled with transmittances. In this flowgraph, the probability that an AIDS patient eventually reaches state 2 is p12 = 1, and the MGF of the waiting distribution for this outcome is λ . λ−s In a flowgraph, the transmittance, p12 M12 (s), is written on the corresponding branch. The patient remains in state 2 until pre-AIDS symptoms appear. The probability that this occurs is p23 = 1 and the MGF of the corresponding waiting time distribution is α β M23 (s) = . β −s M12 (s) =
Solving a flowgraph model refers to reducing all of the branch transmittances in a flowgraph into one overall transmittance for the flowgraph. This is called the equivalent transmittance since this transmittance is equivalent to the entire flowgraph. To solve the flowgraph of Figure 2, we compute the transmittance of the path 1 → 2 → 3. The transmittance of a path is the product of all of the branch transmittances for that path. In terms of random variables, we are dealing with the distribution of the sum of two independent random waiting times from 1 → 2 and 2 → 3. Here, the transmittance of the path, or the MGF of the distribution of the sum, is M12 (s)M23 (s), the equivalent transmittance. We can replace Figure 2 with an equivalent flowgraph, Figure 3, in which node 2 is removed and passage is directly from node 1 to node 3. This equivalent flowgraph is labeled with α β λ M(s) = M12 (s)M23 (s) = (2) , λ−s β −s the MGF of the waiting time distribution of T . This procedure of replacing a more complicated flowgraph with an equivalent one consisting of only nodes 1 and 3 say, is called solving the flowgraph from 1 to 3.
2. Data analysis of HIV/AIDS data The transfusion data from the San Francisco Men’s Health Study (SFMHS) consist of individuals who were given HIV infected blood products. Detailed descriptions of the
Modelling survival data using flowgraph models
733
data are in Longini et al. (1989) and Huzurbazar and Huzurbazar (1999). The data were collected from 1978–1980 and have been used to model the early stages of progression of individuals with HIV infection. The transfusion data are informative about the early stages of HIV infection because the exact time of transfusion is known. Huzurbazar and Huzurbazar (1999) use the transfusion data to illustrate saddlepoint methods in the most frequent setting and Huzurbazar and Huzurbazar (2000) present a Bayesian analysis of these data without flowgraphs. This example focuses on illustrating data analysis with flowgraph models by computing the Bayes predictive density, predictive survivor function, and predictive hazard function for the waiting time for HIV progression. The data consist of 90 individuals who received HIV infected blood or blood products through transfusion. 70 individuals made the transition from state 1 and state 2. For 35 of these 70 individuals, there is no information beyond state 2. For the remaining 35 individuals, 22 individuals made the transition from state 2 to state 3 and 13 individuals were right-censored in state 2. This means that these 13 individuals were last seen in state 2 and not observed making the transition to the end state, state 3. 20 individuals, initially observed in state 1 and then again in state 3 and beyond, are ignored in the current analysis. Recall that T1 is the waiting time in state 1 until transition to state 2 and T2 is the waiting time in state 2 until transition to state 3. The total waiting time for passage from 1 → 3 is T = T1 + T2 . Previous analyses have taken T1 and T2 to be exponentially distributed. Examining a censored data histogram (cf. Huzurbazar and Huzurbazar (1999) or Barnett and Cohen (2000)) of observations on T1 indicates that an exponential model is adequate for T1 . Figure 4 presents censored data histograms for observations on T1 and T2 . These histograms suggest that a distribution with a positive mode, such as a gamma would be a more appropriate distribution for T2 . We model T1 ∼ Exp(λ) with mean 1/λ and density fT1 (·), and T2 ∼ Gamma(α, β) with mean α/β, density fT2 (·) and distribution function FT2 (·). For illustration, we assume diffuse uniform priors on all of the parameters. The independent priors on λ, α, and β are λ ∼ Unif(0, 10), α ∼ Unif(0, 10), and β ∼ Unif(0, 10). The data contribute terms to the likelihood function as follows. Each individual that made the transition from 1 → 2 contributes a term, fT1 (t1i ), to the likelihood function.The 70 individuals that made the transition from 1 → 2 contribute the product 70 i=1 fT1 (t1i ). Each individual that made the 2 → 3 transition contributes a term, fT2 (t2j ), to the likelihood. The 22 individuals that made the 2 → 3 transition contribute the product, 22 j =1 fT2 (t2j ). Each individual that is right-censored in state 2 contributes a term, 1 − FT2 (t2k ), to the likelihood since the individual is last observed in state 2. The 13 individuals that are right censored in state 2 contribute the term 35 k=23 [1 − FT2 (t2k )]. The full likelihood function is L(λ, α, β) =
70 i=1
fT1 (t1i )
22 j =1
fT2 (t2j )
35
1 − FT2 (t2k ) .
(3)
k=23
With uniform priors on the parameters, the posterior is proportional to the likelihood. We use rejection sampling to sample from this posterior to compute the Bayes predictive
734
A.V. Huzurbazar
Fig. 4. Censored data histograms for observations on T1 and T2 .
density, predictive survivor, and predictive hazard functions using the flowgraph MGF of the distribution of the waiting time from HIV infection to pre-AIDS symptoms, α β λ M(s) = M12 (s)M23 (s) = . λ−s β −s The next section describes the details of converting this MGF into a density. 3. Converting flowgraph MGFs to densities Solving flowgraph models gives us the MGF of the waiting time distribution of interest, however, we still do not have the distribution. Flowgraph MGFs are converted into density, survivor, and hazard functions using saddlepoint approximations. In simple models that are restricted to exponential waiting times or gamma waiting times with integer valued shape parameter, these MGFs may be inverted directly using Laplace transform inversion routines available in packages such as MAPLE or Mathematica. However, for more complicated models we must use numerical methods. One method for numerically converting a MGF to a density is to use a saddlepoint approximation. Let T be the random waiting time, 1 → 3, in the flowgraph of Figure 2 and let M(s) be the corresponding MGF derived in (2). Let K(s) = log M(s) be the cumulant generating function (CGF) of T . Then the saddlepoint approximation for the density of T is −1/2 f˜T (t) = 2πK
(ˆs ) (4) exp K(ˆs ) − sˆt , where K
(s) = d 2 K(s)/ds 2 and sˆ is the solution to the saddlepoint equation, K (ˆs ) = t.
(5)
For most problems, sˆ is a complicated implicit function of both t and the parameters of the distribution. The saddlepoint approximation requires that M(s) exist for s ∈ (c1 , c2 ), an open neighborhood of zero, where c1 and c2 can be found numerically.
Modelling survival data using flowgraph models
735
Fig. 5. Bayes predictive density for the time to incubation.
The necessary inputs to the saddlepoint approximation are the CGF, K(s), and its first and second derivatives. The CGF for the HIV example is K(s)=log(λ) − log(λ − s) + α log(β) − α log(β − s)
for s < min(λ, β).
(6)
The saddlepoint equation, K (ˆs ) = t, is given by (λ − sˆ )−1 + α(β − sˆ )−1 = t,
for sˆ < min(λ, β).
(7)
The second derivative at the saddlepoint is K
(s) = (λ − s)−2 + α(β − s)−2 . Figures 5, 6, and 7 give the Bayes predictive density, survivor, and hazard functions respectively. A sample of 5000 θ = (λ, α, β) values was generated from this posterior. For each θ value the predictive density, survival, and hazard functions were computed using the flowgraph MGF. Table 1 gives the predictive CDF values. The “exact”, presented for comparison, is computed via numerical integration of the convolution of the exponential and gamma. We can do this only because we have a simple convolution. The “exact” is in quotes because we still use rejection sampling to perform the integration is the numerically integrated saddlepoint density. in (1). The CDF F In Figure 7, the shape of the hazard function starts out like a gamma hazard and then approaches the value of the smaller pole, λ, in the limit.
736
A.V. Huzurbazar
Table 1 Predictive CDF values for AIDS/HIV transfusion data (time in months) Percentile
10th
50th
90th
95th
99th
Exact ) SA (F
16.10 15.90
37.60 37.30
84.30 84.40
103.30 105.00
140.60 153.00
Fig. 6. Bayes predictive survivor function for the time to incubation.
Fig. 7. Bayes predictive hazard for the time to incubation.
Modelling survival data using flowgraph models
737
Fig. 8. Series flowgraph for kidney failure.
4. Likelihood construction in flowgraph models The previous section illustrated data analysis with flowgraph models when the data were complete and censored. An important problem is that of likelihood construction in the presence of incomplete data. Figure 8 presents a 3 state model of kidney failure. The data were originally analyzed using a Markov model by Gross et al. (1971). Butler and Huzurbazar (1997) presented a flowgraph model. Our focus will be to illustrate likelihood construction using these data. State 0 represents initial diagnosis of kidney disease where the patient has two functioning kidneys. State 1 represents one failed kidney, and state 2 represents two failed kidneys, an absorbing state. Let Y1 be the random waiting time in state 0 until state 1 is reached and Y2 , independent of Y1 , be the random waiting time in state 1 until state 2 is reached. One quantity of interest is the survival time of the patient’s kidneys, that is, the time spent in state 0 and 1 before reaching state 2. Let T = Y1 + Y2 be the total waiting time from 0 → 2. There are two branches in this flowgraph, 0 → 1 and 1 → 2, and they are labeled with transmittances. In this flowgraph, the probability that a kidney fails and the patient eventually reaches state 1 is p01 = 1, and the MGF of the waiting distribution for this outcome is M01 (s). In state 1, we wait for the remaining kidney to fail. The probability that this occurs is p12 = 1 and the MGF of the corresponding waiting time distribution is M12 (s).
5. Parametric assumptions We use the same distributional assumptions as in Gross et al. (1971). Suppose that kidneys fail independently and according to an exponential distribution, Exp(λ1 ), with mean λ−1 1 . In state 0, we observe the minimum of two independent exponentials which is Exp(2λ1 ). This is the waiting time in state 0 until one kidney fails and the patient is in state 1. Since the kidney patient will eventually make the transition from state 0 to state 1, the probability that this occurs is 1. Once in state 1, we assume that the remaining kidney has a failure time distributed according to an Exp(λ2 ) distribution such that (1/λ1 ) > (1/λ2 ), and this transition occurs with probability 1. This is a way to account for the additional stress on the remaining kidney, now that one kidney has failed. The waiting time distribution for passage from state 0 to 2 is the sum of the two independent waiting times, 0 → 1 and 1 → 2, i.e., the convolution of two independent exponential distributions. If we let Y1 ∼ Exp(2λ1 ) be the waiting time from 0 → 1 and Y2 ∼ Exp(λ2 ) be the waiting time from 1 → 2, then the total waiting time from 0 → 2 is T = Y1 + Y2 . Since Y1 and Y2 are independent, the MGF of the total waiting time T , MT (s) is the product of the MGFs of the waiting times from 0 → 1 and 1 → 2 which
738
A.V. Huzurbazar
are M01 (s) and M12 (s), respectively. Recall that the MGF of an Exp(λ) is M(s) = λ/(λ − s). Using (2) for the series model, the MGF of T is λ2 2λ1 . MT (s) = MY1 +Y2 (s) = M01 (s)M12 (s) = (8) 2λ1 − s λ2 − s Because each waiting time is exponential, we can analytically solve this for the density of T . We substitute the appropriate exponential MGFs in (8) and use partial fraction expansion to simplify and convert MT (s) to a density as follows: MT (s) = M2λ1 (s)Mλ2 (s) 2λ1 λ2 = (9) for s < min(2λ1 , λ2 ), 2λ1 − s λ2 − s 2λ1 λ2 1 1 2λ1 λ2 + . = (10) λ2 − 2λ1 2λ1 − s 2λ1 − λ2 λ2 − s We rearrange the terms to get known MGFs and the convert each MGF to a density giving λ2 λ2 2λ1 2λ1 MT (s) = (11) − , 2λ1 − λ2 λ2 − s 2λ1 − λ2 2λ1 − s 2λ1 λ2 −λ2 t fT (t) = λ2 e (2λ2 )e−2λ1 t − 2λ1 − λ2 2λ1 − λ2 2λ1 λ2 −λ2 t e = − e−2λ1 t 2λ1 − λ2 for t > 0, 2λ1 = λ2 , λ1 > 0, λ2 > 0.
(12)
The key step is in recognizing that the MGFs in (11) are MGFs of exponential distributions. In more complicated problems, especially ones without exponentials, we cannot simply look up corresponding distributions on a table, and we must convert MT (s) numerically using a saddlepoint approximation to get the density fT (t) in (12). 5.1. Constructed likelihood: Incomplete data Recall that complete data on the flowgraph model consists of following every individual through every intermediate transition. Incomplete data occur when not all of the transitions are observed. In this case, incomplete data occurs when we observe a patient in state 0 and again in state 2; we do not know when the patient’s first kidney failed, i.e., we do not know when the transition to state 1 occurred and we do not observe the waiting times Y1 and Y2 . We only observe T , the total waiting time from 0 → 2. The flowgraph gives the MGF for 0 → 2, MT (s), for fixed θ . Suppose that our data consists only of t1 , . . . , tn , the total waiting times for n kidney patients. We can write the approximate likelihood using L(θ | D) =
n j =1
f˜T (tj | θ ),
(13)
Modelling survival data using flowgraph models
739
Fig. 9. Kidney example, exact likelihood.
where D represents the data and f˜(·) is the saddlepoint approximation to the density of T . We can find f˜T (t | θ ) for a specific valued of t and θ . In L(θ | D) we could evaluate the likelihood on a multidimensional grid of θ vectors to find and approximate MLE (impractical for high dimensional θ ) or we could perform a Bayesian analysis. Figure 9 gives a plot of the exact likelihood as a function of θ = (λ1 , λ2 ) and Figure 10 gives a plot of the constructed likelihood. We can see that the constructed likelihood is quite good. The exact likelihood is computed using (12) instead of f˜(·) in (13). These plots are based on n = 10 observations of the total time to failure of the kidneys, T . In two dimensions, we can see that the shape of the likelihood reflects the deconvolution problem. The direction of stretch of the likelihood shows the lack of identifiability of λ1 and λ2 . We have information on the sum and we are trying to deconvolve information in the likelihood for the separate parameters of the convolved distributions. The constructed likelihood uses the flowgraph MGF along with the saddlepoint approximation on the same data of 10 observations. The Bayesian approach to the problem requires sampling from the prior distribution to help construct the likelihood. We can use this constructed likelihood to compute the Bayes predictive density by modifying (1) to give the computational formula n
m ˜ ˜ i=1 fZ (z | θi )( j =1 fT (tj | θi )) ˜ fZ (z | D) = (14) .
m n ˜ i=1 ( j =1 fZ (tj | θi )) This involves sampling from the prior and using saddlepoints to construct fZ (z | θ ) for every value of z and θ , and constructing the likelihood of θ , (13).
740
A.V. Huzurbazar
Fig. 10. Kidney example, saddlepoint likelihood.
6. Parallel flowgraph models Parallel structures are the second basic flowgraph element. In a parallel flowgraph, transition from a state is allowed to one of a set of outcomes. We illustrate this by redrawing a portion of the retinopathy flowgraph of Figure 1 into Figure 11. In Figure 11, the states 1 and 3 are in parallel from the point of reference of state 2. This represents a competing risks situation in that a patient with microaneurisms can either improve to state 1 or deteriorate to state 3. To solve a parallel flowgraph, we consider the two paths 2 → 1 and 2 → 3. Parallel flowgraphs lead to finite mixture distributions. The transmittance of the path for the overall waiting time is a finite mixture distribution: with probability p21 it is M21 (s), the MGF of the waiting time distribution for 2 → 1, and with probability (1 − p21 ) it is M23 (s), the MGF of the waiting time distribution for 2 → 3. Therefore the MGF of the waiting time to state 1 or 3 beginning in state 2 is p21 M21 (s) + (1 − p21 )M23 (s).
7. Loop flowgraph models The third basic element of a flowgraph is the feedback loop. Figure 12 shows a feedback loop taken from Figure 1. Huzurbazar (2005) presents detailed derivation for the transmittance of the feedback loop. We present the following heuristic argument. A patient with microaneurisms in state 2 of Figure 12 can improve to state 1 or deteriorate to state 3. Once in state 1, the patient’s eyesight can worsen returning the patient to state 2.
Modelling survival data using flowgraph models
741
Fig. 11. Flowgraph model for a parallel structure.
Fig. 12. Loop flowgraph structure.
Each time that the patient is in state 2, he can improve to state 1 or worsen to state 3. As drawn in Figure 12, the patient will eventually worsen to state 3. To solve this flowgraph, note that the states 2 → 1 → 2 are in series and we can reduce that portion of the feedback loop to p21 M21 (s)M12 (s). For simplicity, let M22 (s) = M21 (s)M12 (s) so that the reduced equivalent transmittance of the feedback loop is p21 M22 (s) = p21 M21 (s)M12 (s). Beginning in state 2, the patient can also proceed directly to state 3. This gives an overall path transmittance of (1 − p21)M23 (s). If the patient experiences improvement once, his contribution to the overall path transmittance would be magnified by the transmittance of the loop so that the overall path transmittance would be (1 − p21 )M23 (s)p21 M22 (s). If he takes the feedback loop twice, this becomes (1 − p21 )M23 (s)[p21 M22 (s)]2 , and three times gives (1 − p21 )M23 (s)[p21 M22 (s)]3 . Iterating in this manner gives the overall MGF of the feedback loop, M(s), as M(s) = (1 − p21 )M23 (s) + (1 − p21 )M23 (s)p21 M22 (s) 2 + (1 − p21 )M23 (s) p21 M22 (s) 3 + (1 − p21 )M23 (s) p21 M22 (s) + · · · 2 = (1 − p21 )M23 (s) 1 + p21 M22 (s) + p21 M22 (s) 3 + p21 M22 (s) + · · ·
742
A.V. Huzurbazar
= (1 − p21 )M23 (s)
∞
p21 M22 (s)
j
j =0
= (1 − p21 )M23 (s)
1 . 1 − p21 M22 (s)
(15)
8. A systematic procedure for solving flowgraphs To solve a complicated flowgraph such as in Figure 1 requires several steps for reducing the series, parallel, and loop components. For complex flowgraphs we use a procedure based on Mason’s rule (cf. Mason (1953)) to solve flowgraphs. Mason’s rule was developed in the context of graph theory for solving systems of linear equations. Mason’s rule does not involve probabilities or MGFs. Solving flowgraph models involves applying Mason’s rule to the branch transmittances. This gives a systematic procedure for computing the path transmittance between any two states, A and B, of the flowgraph. When eventual passage is certain from state A to state B, the path transmittance is the MGF of the waiting time distribution. Practical use of Mason’s rule entails identifying all of the paths from A to B and the loops involved in those paths. It requires computing the transmittance for every distinct path from the initial state to the end state and adjusting for the transmittances of various loops. We illustrate using the retinopathy example of Figure 1. Recall that a path is any possible sequence of nodes from the initial state to the end state that does not pass through any intermediate node more than once. Suppose our interest is in the time to blindness (state 4) beginning with no retinopathy, then there is only one path, 1 → 2 → 3 → 4. The transmittance of a path is simply the product of individual branch transmittances. A first-order loop is any closed path that returns to the initiating node of the loop without passing through any node more than once. The transmittance of a first-order loop is the product of the individual branch transmittances involved in its passage. A j th-order loop consists of j non-touching firstorder loops. In this case, there are two first-order loops: 1 → 2 → 1, with transmittance p21 M12 (s)M21 (s); and 2 → 3 → 2, with transmittance (1 − p21 )p32 M23 (s)M32 (s). There are no higher-order loops. The general form of Mason’s rule that gives the MGF from node A to node B is
j i i Pi (s)[1 + j (−1) Lj (s)]
M(s) = (16) , 1 + j (−1)j Lj (s) where Pi (s) is the transmittance for the ith path, Lj (s) in the denominator is the sum of the transmittances over the j th-order loops, and Lij (s) is the sum of the transmittances over j th-order loops sharing no common nodes with the ith path, i.e., loops not touching the path. This is the MGF of the first passage time distribution from node A to node B. Let T be the waiting time from no retinopathy to total blindness, i.e., from state 1 to the absorbing state 4. We can reduce the flowgraph of Figure 1 using the steps for
Modelling survival data using flowgraph models
743
series, parallel, and loop flowgraphs as discussed, or we can apply (16) with P1 (s) = (1 − p21 )(1 − p32 )M12 (s)M23 (s)M34 (s), L1 (s) = p21 M12 (s)M21 (s), L2 (s) = (1 − p21 )p32 M23 (s)M32 (s),
and
Lij (s) = 0,
(17)
the last term being zero since all loops touch the path. The MGF of the waiting time T , time to blindness is MT (s) =
P1 (s) , 1 − [L1 (s) + L2 (s)]
MT (s) =
(1 − p21 )(1 − p32 )M12 (s)M23 (s)M34 (s) . 1 − [p21 M12 (s)M21 (s) + (1 − p21 )p32 M23 (s)M32 (s)]
(18)
While expression (18) may appear complicated, in general, we do not calculate this by hand. Having a systematic procedure such as (16) allows us to program equations such as (18) using symbolic algebra.
9. Data analysis for diabetic retinopathy data We will use the diabetic retinopathy application of Figure 1 to illustrate data analysis and computation for a complicated flowgraph model. Yau and Huzurbazar (2002) analyze these data using methods for censored and incomplete data and we refer the reader to (Yau and Huzurbazar, 2002) for details of the analysis. The data consist of 277 patients having Type I diabetes for five years or more. The data are heavily censored and many of the transitions are incomplete. For example, we may have an observation for a patient that is 3 → 1. In this case, we do not know how the patient made the transition from state 3 to state 1. For example, was the transition really 3 → 2 → 1 or could it have been 3 → 2 → 3 → 2 → 1? Let T be the waiting time from state 1, no retinopathy, to state 4, total blindness. Using (16) we compute the MGF of the distribution of T as M(s) =
(1 − p21 )(1 − p32 )M12 (s)M23 (s)M34 (s) . 1 − [p21 M12 (s)M21 (s) + (1 − p21 )p32 M23 (s)M32 (s)]
(19)
Yau and Huzurbazar (2002) use censored data histograms to suggest appropriate parametric models. Distributions of waiting times used in the analysis are inverse Gaussian IG(λ1 , λ2 ) for 1 → 2, inverse Gaussian IG(µ1 , µ2 ) for 2 → 3, Exp(γ ) with mean γ for 3 → 4, Gamma(α, β) with mean αβ for 2 → 1, and Exp(η) for 3 → 2. The inverse Gaussian random variable has mean λ1 /λ2 and its density is parameterized as
(λ21 /t) + λ22 t λ1 + λ1 λ2 . f (t | λ1 , λ2 ) = √ (20) exp − 2 2πt 3/2 Parameter estimates are found by maximizing the likelihood function of the model.
744
A.V. Huzurbazar
Data on the waiting time in state 1 (no retinopathy) until transition to 2 (microaneurisms only) are available from 131 observations of which 58 are censored. There are 186 observations providing information on the waiting time in state 2 until transition to either state 1 or state 3. In state 2, 39 observations were observed to return to state 1, no retinopathy; 48 deteriorated to background retinopathy; and 99 had no change until the end of the study. The observed proportions estimate the transition probabilities p21 and p23 . There are 40 observations providing information on the waiting time in state 3. Of these 14 recovered to state 2 (microaneurisms only), 3 reached total blindness, the absorbing state 4, and 23 were censored. The overall likelihood function is L(λ1 , λ2 , µ1 , µ2 , γ , α, β, η | data) =
73
f12 (xi12 | λ1 , λ2 )
39
f21 (xi21 | α, β)
i=1
×
1 − F12 (xj∗1 | λ1 , λ2 )
j =1
i=1
×
58
48
f23 (xi23 | µ1 , µ2 )
i=1
99 1 − p21 F21 (xj∗2 | α, β) + p23 F23 (xj∗2 | µ1 , µ2 ) j =1
×
14 i=1
×
f32 (xi32 | η)
3
f34 (xi34 | γ )
i=1
23 1 − p32 F32 (xj∗3 | η) + p34 F34 (xj∗3 | γ ) .
(21)
j =1
The likelihood in (21) was maximized using the SPLUS function ms. The MLEs are: λˆ 1 = 7.9, λˆ 2 = 0.23, µˆ 1 = 7.2, µˆ 2 = 0.086, γˆ = 146, αˆ = 6.13, βˆ = 3.11, ηˆ = 19.7. With these estimates, the estimated MGF of the flowgraph from (19) is M(s) =
637 1240(1 − 146.02s) × exp 2.44 − 7.91(0.05 − 2s)1/2 − 7.24(0.01 − 2s)1/2 13 exp 1.82 − 7.91(0.05 − 2s)1/2 × 1− 6.13 62(1 − 3.11s) −1 343 1/2 exp 0.62 − 7.24(0.01 − 2s) . (22) − 1240(1 − 19.71s)
This MGF in converted to a density, CDF, survival, or hazard function using a saddlepoint approximation of Section 3. For further details on parametric modelling, likelihood construction for incomplete data, and the estimation required to convert this MGF
Modelling survival data using flowgraph models
745
Fig. 13. Results from the flowgraph model for diabetic retinopathy data.
to a density, CDF, survival, and hazard function using a saddlepoint approximation see (Yau and Huzurbazar, 2002). Figure 13 gives the estimated density, CDF, survival function, and hazard function for the time to total blindness. The density is heavy tailed and the hazard rises sharply for large t as the CDF → 1. Flowgraphs model the actual waiting time distributions and then convert those to the corresponding hazards without making any direct assumptions about the hazard. Hazard shapes that are increasing and then decreasing or vice versa are common with flowgraphs.
10. Summary Flowgraph models provide greater generality for the analysis of time to event data arising from multistate models for disease progression. Flowgraphs allow nonexponential waiting times for each stage of the disease and provide a data analytic method for modelling semi-Markov processes. They do not make the proportional hazards assumption but rather model the waiting time densities directly and then use those to compute the corresponding hazard functions. This results in a variety of hazard shapes. In addition to censored data, flowgraphs also deal with incomplete data by allowing the construction of likelihoods in this situation. The methodology accommodates feedback loops
746
A.V. Huzurbazar
thus handling recurrent events. It also accommodates feed-forward loops and multiple beginning and end points.
References Aalen, O. (1995). Phase type distributions in survival analysis. Scand. J. Statist. 22, 447–463. Barnett, O., Cohen, A. (2000). The histogram and boxplot for the display of lifetime data. J. Comput. Graphical Statist. 94, 759–778. Butler, R.W., Huzurbazar, A.V. (1997). Stochastic network models for survival analysis. J. Amer. Statist. Assoc. 92, 246–257. D’Azzo, J., Houpis, C. (1981). Linear Control System Analysis and Design: Conventional and Modern. McGraw-Hill, New York. Diabetic Retinopathy Study Research Group (1981). A modification of the Airlie House classification of diabetic retinopathy: Report 7. Investigative Ophthalmol. Visual Sci. 21, 210–226. Dorf, R., Bishop, R. (1995). Modern Control Systems. Addison-Wesley, Reading, MA. Gajic, Z., Lelic, M. (1996). Modern Control Systems Engineering. Prentice-Hall, New York. Gross, A.J., Clark, V.A., Liu, V. (1971). Estimation of survival parameters where one of two organs must function for survival. Biometrics 27, 369–377. Huzurbazar, A.V. (1999). Flowgraph models for generalized phase type distributions with non-exponential waiting times. Scand. J. Statist. 26, 145–157. Huzurbazar, A.V. (2000). Modeling and analysis of engineering systems data using flowgraph models. Technometrics 42, 300–306. Huzurbazar, A.V. (2005). Flowgraph Models for Multistate Time to Event Data. Wiley, New York. Huzurbazar, S., Huzurbazar, A.V. (1999). Survival and hazard functions for progressive diseases using saddlepoint approximations. Biometrics 55, 198–203. Huzurbazar, S., Huzurbazar, A.V. (2000). Bayesian models for progressive diseases. Chilean J. Statist. 17, 29–43. Longini, I.M., Clark, W.S., Byers, R.H., Ward, J.W., Darrow, W.W., Lemp, G.F., Hethcote, H.W. (1989). Statistical analysis of the stages of HIV infection using a Markov model. Statist. Medicine 8, 831–843. Lorens, C.S. (1964). Flowgraphs for the Modeling and Analysis of Linear Systems. McGraw-Hill, New York. Marshall, G., Jones, R. (1995). Multi-state models and diabetic retinopathy. Statist. Medicine 14, 1975–1983. Mason, S.J. (1953). Feedback theory – some properties of signal flow graphs. Proc. IRE 41, 1144–1156. Whitehouse, G. (1973). Systems Analysis and Design using Network Techniques. Prentice-Hall, Englewood Cliffs, NJ. Whitehouse, G. (1983). Flowgraph analysis. In: Kotz, S., Johnson, N. (Eds.), Encyclopedia of Statistical Science, Vol. 3. Wiley, New York. Wilson, S.R., Solomon, P.J. (1994). Estimates for different stages of HIV/AIDS disease. Comput. Appl. Biosci. 106, 681–683. Yau, C., Huzurbazar, A. (2002). Analysis of censored and incomplete data using flowgraph models. Statist. Medicine 21 (23), 3727–3743.
40
Handbook of Statistics, Vol. 23 ISSN: 0169-7161 © 2004 Elsevier B.V. All rights reserved. DOI 10.1016/S0169-7161(03)23040-6
Nonparametric Methods for Repair Models
Myles Hollander and Jayaram Sethuraman
1. Introduction Many systems are maintained by administering some type of repair after each failure. Thus an important area of concerns the development and study of various repair models. In this chapter we focus on statistical aspects of various repair models. Our approach is nonparametric. Let F denote the distribution of the time to first failure of the system. The repair models we consider postulate that the distribution of interfailure times depends on F in some way. Important problems include the estimation of F , development of simultaneous confidence bands for F and goodness-of-fit tests for F . Contributions in these areas include nonparametric inference for F in the general repair model of Dorado et al. (1997). The DHS model contains many other repair models in the literature such as the minimal repair model (cf. (Ascher, 1968)), the Brown and Proschan (1983) model, and models I and II of Kijima (1989). It also provides many new models using the notion of life supplements. Estimators of F and simultaneous confidence bands for F as derived in Dorado et al. (1997) thus also provide estimates and bands for F in those repair models that are special cases of the DHS model. The age-dependent minimal repair model of Block et al. (1985) is not a special case of the DHS model and thus requires a separate treatment. Whitaker and Samaniego (1989) derived the nonparametric maximum likelihood estimator of F for the BBS model. Hollander et al. (1992) extended the Whitaker–Samaniego results to the whole line, provided asymptotic simultaneous confidence bands for F , and developed a twosample Wilcoxon-type test. Let Λ be the cumulative hazard function corresponding to F . Augustin and Peña (1999, 2001) derived goodness-of-fit tests of H0 : Λ = Λ0 where Λ0 is completely specified. Using a Cox-type proportional hazards framework in conjunction with the BBS model, Augustin and Peña (2002) test H0C : λ ∈ C, where λ is an unspecified baseline hazard function, C = {λ0 (·, ξ )} and λ0 (·, ξ ) is specified, except for the p × 1 vector ξ . In Section 2, we define some general repair models. Section 3 describes estimators and simultaneous confidence bands for F in the DHS model. Section 4 considers the BBS model and describes estimators and simultaneous confidence bands for F . Section 5 considers a Wilcoxon-type test for two BBS processes. Section 6 presents 747
748
M. Hollander and J. Sethuraman
goodness-of-fit in the BBS model. Section 7 concerns hypothesis tests for the minimal repair assumption.
2. General repair models To define a repair model, one specifies the joint distribution of the failure times or equivalently, the joint distribution of the interfailure times. Let {Sj } denote the failure times of the system, and let {Tj } denote the interfailure times where Tj = Sj − Sj −1 where def
S0 = 0. Let F be the distribution of the time to first failure of the system. Data in a repair model will typically have dependencies that can be induced, for example, by the nature of the repair process and the specification of the period of observation. Although these dependencies complicate the mathematics, the data can be broken down into independent identically distributed random length sequences of failure and interfailure times until the next perfect repair. This often provides insight as well as a mathematical route to investigating the properties of the statistics, distribution functions, confidence bands, etc. that are computed with the data. (i) Perfect repair model In the perfect repair model, upon failure, a failed system is replaced by a new one having the same properties as the original. Thus the interfailure times are independent and identically distributed (iid) according to F . (ii) Minimal repair model In the minimal repair model, upon failure, the failed system is returned to a functioning state so that the distribution of the time until the next failure is the same as that of a system of the same age that has not yet failed. That is, the effective age of the system is unchanged from the effective age at failure. In this model, {Sj } is a Markov process with P (Sj > x | Sj −1 = y) =
(x) F , (y) F
x > y.
(2.1)
Under this model N(t), the number of failures by time t, is a nonhomogeneous Poisson process with mean value function equal to the cumulative hazard function Λ of F , that is t dF (s) E N(t) = Λ(t) = (2.2) . (s) 0 F Many inferential procedures about Λ and its derivative λ(t) make use of this fact. (iii) Brown–Proschan (1983) model The Brown and Proschan (1983) model is an imperfect repair model where, at the time of each repair, two types of repair are possible. With probability p, a perfect repair is performed and with probability 1 − p a minimal repair is performed.
Nonparametric methods for repair models
749
(iv) Block–Borges–Savits (1985) model Block et al. (1985) generalized the Brown–Proschan model by allowing the probability of a perfect repair to depend on the age of the failed system. In the BBS model the probability of a perfect repair is p(t) where p(·) is a measurable function p : [0, ∞) → [0, 1]. Under the condition p(t) (2.3) dF (t) = ∞, (t) (0,∞) F Block, Borges and Savits showed that for continuous F , the waiting time between perfect repairs is almost-surely finite with distribution H given by p(s) H (t) = 1 − exp − (2.4) dF (s) , t 0. (s) (0,t ] F (v) Kijima’s (1989) models Kijima (1989) introduced models to allow for repairs that are better than minimal but not as good as perfect repairs. In Kijima’s models, after repair, the system is restored to an effective age that depends on its age just before failure as well as random variables expressing the degree of repair. Let Aj +1 denote the effective age of the system after the def
j th repair with A1 = 0 and let Dj , j = 1, 2, . . . be degree of repair random variables that are independently distributed on [0, 1] and independent of other processes. In Kijima’s model I, P (Tj > x | T1 , . . . , Tj −1 , D1 , . . . , Dj −1 ) =
(x + Aj ) F (Aj ) F
(2.5)
where Aj =
j −1
Di Ti ,
j > 1,
(2.6)
i=1
so that Aj +1 = Aj + Dj Tj .
(2.7)
In Kijima’s model II, P (Tj > x | T1 , . . . , Tj −1 , D1 , . . . , Dj −1 ) = where Aj =
j −1 j −1 k=1
(x + Aj ) F (Aj ) F
(2.8)
Di Tk ,
j > 1,
(2.9)
i=k
so that Aj +1 = Dj (Aj + Tj ).
(2.10)
750
M. Hollander and J. Sethuraman
When Dj = 1 with probability p and = 0 with probability 1 − p, Kijima’s model II reduces to the Brown–Proschan model. Uematsu and Nishida (1987) indicated that in general the state of a system after repair would depend not only on its age before failure but also on the degree of repair. Under their model the system has age q(S1 , . . . , Sj , D1 , . . . , Dj ) after the j th repair is performed, and q(S1 , . . . , Sj , D1 , . . . , Dj −1 ) is the age of the system just before the j th repair. Kijima’s models I and II are special cases of the Uematsu–Nishida model. Under Kijima’s models, q(S1 , . . . , Sj , D1 , . . . , Dj −1 ) q(S1 , . . . , Sj , D1 , . . . , Dj ) so that repair produces age reduction. (vi) Dorado–Hollander–Sethuraman (1997) model Dorado et al. (1997) defined a general repair model that contains many popular repair models and introduces many others. For any distribution F, θ ∈ (0, 1] and a ∈ [0, ∞), DHS consider the family of survival functions a,θ (x) = F (θ x + a)/F (a). F
(2.11)
The family of distributions {Fa,θ } satisfies a stochastic ordering property, namely, θ < θ st
implies Fa,θ Fa,θ for each a. That is Fa,θ (x) Fa,θ (x) for every x. Thus lower values of θ mean longer remaining life. The distribution Fa,θ is the life distribution of an item with an effective age of a and a life supplement of θ . The DHS model is based on two sequences {Aj }, {θj }, called the effective ages and life supplements, respectively, satisfying A1 = 0,
θ1 = 1,
Aj Aj −1 + θj −1 Tj −1 ,
Aj 0,
θj ∈ (0, 1] and
j 2.
(2.12)
In the DHS model, the joint distribution of the {Tj } is: P (Tj t | A1 , . . . , Aj , θ1 , . . . , θj , T1 , . . . , Tj −1 ) = FAj ,θj (t).
(2.13)
We see that for j 1, the effective age Aj +1 of the system after the j th repair is less def
than the effective age Xj = Aj + θj Tj just before the j th failure, and since θj 1, Xj in turn is less than the actual age Sj . Some special cases and examples of the DHS model are as follows. If we set θj = 1, Aj = 0 for j 1, we obtain the perfect repair model (t). P (Tj > t | T1 , . . . , Tj −1 ) = F If we set θj = 1, Aj = Sj −1 , j 1, we obtain the minimal repair model Sj−1 ,1 (t). P (Tj > t | Sj −1 ) = F If we set θj = 1 for each j and let Aj be defined by (2.6) so that Aj +1 = Aj + Dj Tj , we obtain the Kijima I model, Aj ,1 (t). P (Tj > t | Ti , Di , 1 i j − 1) = F
Nonparametric methods for repair models
751
If we set θj = 1 for each j and let Aj be defined by (2.9) so that Aj +1 = Dj (Aj + Tj ), we obtain the Kijima II model Aj ,1 (t). P (Tj > t | Ti , Di , 1 i j − 1) = F
(2.14)
Set θj = 1 for all j and define Aj +1 =
j
αj k Tk
for j 1,
(2.15)
k=1
where the sequence of random variables {αj k }j k1 satisfies the following conditions: (i) For each k, the random variable αj k is measurable with respect to the σ -algebra generated by (D1 , . . . , Dj ). (ii) For each j , {αj k } is an increasing sequence in k. (iii) For each k, {αj k } is an increasing sequence in j . j When αj k = i=k Di , (2.15) is in the form Aj +1 = Dj (Aj + Tj ). This is Kijima’s model II. Eq. (2.15) states that the effective age of the system just after repair is a weighted sum of the interfailure times where the weights satisfy the conditions (i)–(iii). Condition (i) says that the assigned weights are independent of the age of the system and depend solely on the degrees of repair done. Condition (ii) says that the more recent interfailure times receive a greater weight than the more distant interfailure times in computing the effective age, i.e., Tj receives a weight that is at least as great as the weight given to Ti for i < j k − 1 in the computation of Ak . This is a way of formalizing that on the presumption that repair does not eliminate all damages incurred by the system, any damages to the system that occurred in the more distant past has had more opportunities to be fixed than more recent damages. Condition (iii) says that the weight assigned to a particular Ti in the computation of Aj is at least as great as the weight assigned to the same Ti in computing Ak whenever j < k. This reflects the assumption that the current repair process is able to fix damages that occurred in the past but were left unfixed for some reason during previous repairs.
j −1 If, in the DHS model, we set θ1 = 1, Aj = i=1 θi Ti and 0 < θj < 1 for j > 1 we obtain what we call the supplemented life repair model. The term “supplemented life” is motivated as follows. If a minimal repair were performed at the time of the first failure, T2 would have the distribution FT ,1 . We can, however, provide a longer expected life for T2 if we use the distribution FT1 ,θ2 for some θ2 satisfying 0 < θ2 < 1. Starting with the distribution FT1 ,θ2 for T2 and using minimal repair after the second failure, T3 would have the distribution FA3 ,1 where A3 = T1 + θ2 T2 . If we want a longer expected life for T3 we can use the distribution FA3 ,θ3 for some 0 < θ3 < 1. Continuing in this fashion yields the supplemented life model. (vii) Last–Szekli (1998) model The restriction θj ∈ (0, 1] imposed by the DHS model does not incorporate deterioration due to repair. Last and Szekli (1998) do allow deterioration due to repair by extending model (2.14) so that the larger range [0, ∞) is allowed for the factors {Di }. Values of Di
752
M. Hollander and J. Sethuraman
larger than 1 correspond to a deterioration due to repair. Last and Szekli (1998) showed that their repair model contains many proposed repair models in the literature including those of Stadje and Zuckerman (1991) and Baxter et al. (1996).
3. Estimation in the DHS model Relating the DHS model to a survival model provides insight as to how to obtain estimators and confidence bands for F . Dorado et al. (1997) fix a T > 0 and define two processes N(·) and Y (·) by N(t) = I {Xj t, Sj T } j
and Y (t) =
I Aj < t Xj ∧ Aj + θj (T − Sj −1 ) j
where Xj = Aj + θj Tj is the effective age just before the j th failure. Letting δj = I (Xj Aj + θj (T − Sj −1 )) = I (Sj T ) and
j = Xj ∧ Aj + θj (T − Sj −1 ) , X the random variables 1 , δ1 , X 2 , δ2 , . . . X can be viewed as censored observations from a lifetime data model where the lifetime Xj of subject j is observed only if it is smaller than Aj + θj (T − Sj −1 ). Thus one can think of the repair model as a survival model where the subject j enters the study at Aj and dies during the study at age Xj or leaves the study by age Aj + θj (T − Sj −1 ). Hence the process N(t) can be viewed as the number of observed (uncensored) deaths by time t and Y (t) represents the number at risk at time t. Let Λ be the hazard function of F and define the process t M(t) = N(t) − Y (s) dΛ(s). 0
Dorado et al. (1997) t recognized that it is not necessary to establish that {M(t)} is a martingale or that { 0 Y (s) dΛ(s)} is its compensator. They proved that E M(t) = 0 and cov M(t), M(t ) =
t ∧t
E(Y )(1 − &Λ) dΛ.
(3.1)
Dorado et al. (1997) assumed that n independent copies of the processes N and Y are observed on a finite interval. Let Nn and Yn denote the sum of the n independent
Nonparametric methods for repair models
753
copies of N and Y , respectively. Let Λ denote the cumulative hazard function of F . The Nelson–Aalen estimator of Λ is t Jn dNn n (t) = Λ Yn 0 where Jn (t) = I (Yn (t) > 0) for t ∈ (0, T ]. Since t 1 − F (s−) dΛ(s) F (t) = 0
n to satisfy it is natural to require an estimator F t n (s−) dΛ n (s). n (t) = 1−F F 0
The solution of this Volterra integral equation is (t) = (s) 1 − dΛ F n
n
(3.2)
st
n (s)) denotes the product integral (see Gill and Johansen (1990)). where st (1 − dΛ Let Mn = Nn − Yn dΛ. This is the sum of n iid processes in D[0, T ] with mean 0 and covariance function given by (3.1). It follows that Wn (t) = n−1/2 Mn (t), 0 t T , will converge to a Gaussian process if tightness can be established. This is done in Dorado et al. (1997, Theorem 5.1). It can be shown that n (t) − F (t) F n (s−)Jn (s) F (3.3) dMn (s). = (t) (s)(Yn (s)/n) F F Let t dF . C(t) = EY (1 − F) 0 Assume that F (T ) < 1 and F is an increasing failure rate distribution. From the continuity mapping principle, and a result on the uniform convergence of the integrand in (3.3), Dorado et al. (1997, Corollary 5.1) show that √ F n−F n ⇒ B(C) on D[0, T ], F where B denotes the Brownian motion on [0, ∞). They also established √ K n − F ⇒ B 0 (K) on D[0, T ], n F F where B 0 denotes a Brownian bridge on [0, 1] and K = C/(1 + C). Dorado et al. (1997) also derived a simultaneous confidence band for F . For t ∈ n (t) < 1) and set [0, T ], let Ln = I (F t n Jn Ln dF n (t) = , C n ) 0 (Yn /n)(1 − F n (t) = K
n (t) C . n (t) 1+C
754
M. Hollander and J. Sethuraman
n (t) = 1, set K n (t) = 1. A nonparametric asymptotic simultaneous For t such that F confidence band for F with confidence coefficient at least 100(1 − α)% is n ± n−1/2 λα F n F n K where λα is such that P sup B 0 (t) λα = 1 − α. t ∈[0,1]
(3.4)
(3.5)
Let X(1) , X(2) , . . . , X(r) be the distinct ordered values of the X’s whose corresponding failure times are within [0, T ]. Also, let δj be the number of observations with value X(j ) . Then for computational purposes we note that δj n (t) = 1− F Yn (X(j ) ) X(j) t
and n (t) = n C
F n (X(j −1) ) n (X(j ) ) − F . n (X(j ) ) Yn (X(j ) )F X t (j)
n (t0 ) = 1 for some 0 < t0 < T . In practice, it may be that the data obtained lead to F When this happens, the data obtained give us a confidence band only on the interval n (t) = 1}. [0, σ ) where σ = inf{t ∈ [0, T ]: F Dorado et al. (1997) presented Monte Carlo simulation studies of the coverage probabilities of the simultaneous confidence bands given by (3.4). They considered two models, viz. Kijima II where the Dj ’s are U [0, 1] (called Model A) and the j −1 supplemented life model with θj = i=1 Di where the Dj ’s are U [0.8, 1] (called Model B). They considered the situations where F is gamma with α = 3, 5, 7 and Weibull with α = 1, 1.5, 2. For the gamma distribution T was taken to be equal to 10, and for the Weibull T was chosen to be 2. The sample sizes considered were n = 10, 20, 30, 50, 100, 200 and the nominal coverage probabilities were taken to be 0.90, 0.95 and 0.99. In Model A, the estimated actual coverage probability was within 2% of the nominal value for sample sizes as low as 20. In Model B, a sample size of 50 was sufficient in most cases to obtain the “2%” accuracy. This is because there are more failures per sample in Model A than in Model B. In most cases, the bands did better as F (T ) moved farther away from 1 (i.e., large values of α for gamma and small values of α for Weibull). This is consistent with the fact that the DHS large sample results require F (T ) < 1. For details of the simulation results, see Tables 1–4 of Dorado et al. (1997). Gäertner (2000), in the context of the Last and Szekli (1998) model, obtained results similar to those of Dorado et al. (1997). He estimates F and Λ from one observation of the repairable system. Peña and Hollander (2003), in a recurrent event setting, allow the incorporation of covariates into the Dorado et al. (1997) and Last and Szekli (1998) models.
Nonparametric methods for repair models
755
If we set Aj = 0, θj = 1 for all j in the Dorado et al. (1997) model, then the estimator given by (3.2) reduces to the product limit estimator derived by Gill (1981) in a “testing with replacement” setting. Furthermore, the band given by (3.4) provides a band for F in Gill’s setting.
4. Estimation in the BBS model Whitaker and Samaniego (1989) proposed an estimator for F for the BBS model. Using counting process techniques, Hollander et al. (1992) extended the large sample theorems of Whitaker and Samaniego to the whole real line. Hollander et al. (1992) also derived a nonparametric asymptotic simultaneous confidence band for F . In the BBS model, upon failure at age t, with probability p(t) a perfect repair is performed whereas with probability 1 − p(t) a minimal repair is performed. For j = 1, . . . , n, let {Xj,0 ≡ 0, Xj,1 , Xj,2 , . . .} be independent record value processes from F . (t)/F (Xj,k−1 ), These are Markov processes with P (Xj,k > t | Xj,0 , . . . , Xj,k−1 ) = F for t > Xj,k−1 , k 1. Let τF be the (possibly infinite) upper endpoint of the support of F . If &F (τF ) > 0, define Xj,0 = ∞ for all 0 larger than the first k for which Xj,k = τF . Take p(τF ) = 1 in all cases. These processes represent the failure ages of n systems under a “forever minimal remair” model. Perfect repair is introduced into the model by letting {Uj,k : 1 j n, k 1} be iid uniform random variables. Let δj,k = I Uj,k p(Xj,k ) (4.1) and νj = inf{k: δj,k = 1}. Then observing (Xj,1 , . . . , Xj,νj ), j = 1, . . . , n
(4.2)
(4.3)
is equivalent to observing n independent copies of the BBS process, each until the time of its first repair. Hollander et al. (1992) used this structure and martingale methods to establish their asymptotic results. Let N(t) = # (j, k): Xj,k t, k νj , 1 j n and
Y (t) = # j : Xj,νj t, 1 j n .
Assume F is continuous and the pair (F, p) satisfies τF dF (t) p(t) = ∞. F (τF ) = 1 and (t) F 0 Let X(k) be the kth ordered value of {Xj,s : s νj , 1 j n} and let T = min X(k) : Y (X(k) ) = 1 .
(4.4)
756
M. Hollander and J. Sethuraman
That is, the sampling scheme here is to observe the n BBS processes, each until the time of its first perfect repair but stop at the first failure age such that only one process remains at risk (that is, has not yet experienced a perfect repair). Then the Whitaker– Samaniego estimator can be written as (t) = F (4.5) 1 − dΛ(s) st
where = Λ(t)
t 0
J (s) dN(s) Y (s)
and J (s) = I (s T ). By employing an appropriate filtration of σ -fields to account for the BBS model of minimal and perfect repairs through the function p(t), Hollander et al. (1992) find the compensator of {N(t)} and show that M(t) = N(t) − Y (s) dΛ(s) (0,t ]
is a locally square-integrable martingale with a predictable variation process M given by M(t) = Y (s) 1 − &Λ(s) dΛ(s). (0,t ]
(t) given by (4.5) is related to the process {M(t)} by the relation The estimator F (s) (t) − F (t) F F = dM(s). (t) (s)Y (s) F (0,t ] F
(4.6)
Using Rebolledo’s (1980) martingale central limit theorem and the methods of Gill (1983), Hollander et al. (1992) showed that, assuming F is continuous and (4.4) is satisfied, then (i), (ii) and (iii) below hold. (i) As n → ∞, √ − F ⇒ F · B(C) nF
in D[0, ∞],
where B is Brownian motion on [0, ∞), t dF (s) , C(t) = (s)F (s) 0 H where H is given by (2.4). (ii) As n → ∞, F − F) √ K( n ⇒ B 0 (K) F
in D[0, ∞],
where B 0 is a Brownian bridge on [0, 1] and K = C/(1 + C).
Nonparametric methods for repair models
757
(iii) √ K − F ⇒ B 0 (K) n F F = C/(1 + C), for any τ < τF , where K t (s) dF = C(t) (s)F (s) 0 H
in D[0, τ ]
(4.7)
(4.8)
is the empirical cdf of the Xj,νj . From (4.7) a nonparametric asympand where H totic simultaneous confidence band for F , on [0, τ ], is shown to be of the form ± F
√
nλα
F , K
(4.9)
where λα is defined by (3.5). (s−) = Y (s)/n, and C(t) simpli the empirical cdf of the Xj,νj , we have H With H fies to n = . C(t) Y (X(k) )(Y (X(k) ) − 1) X(k) t
When there are ties in the data, take T to be the first age at which the number of units failing is equal to the number at risk. That is, the first t for which &N(t) = Y (t). Then to be take F &N(s) (t) = F 1− Y (s) st
= 1/(1 + C) where and K n&N(s) = C(t) . Y (s)(Y (s) − &N(s)) st
Hollander et al. (1992) presented Monte Carlo simulation studies of the coverage probabilities of the simultaneous bands given by (4.9). The BP process was simulated with p taking the values 0.10, 0.25, and 0.50. The choices for the underlying distribution F was exponential, gamma with shape parameter α = 2 or 5, and Weibull with shape parameter α = 12 or 32 . In each case the scale parameter was taken to be 1. The interval over which the bands were computed (the interval of estimation) was taken to he the 90th, 95th and 99th percentile of F . The sample sizes considered were n = 10, 20, 30, 50, 100, 200 and the nominal coverage probabilities were taken to be 0.90, 0.95, and 0.99. In all cases considered a sample size of 100 was sufficient to insure the true coverage probability was at worst one or two percent less than the nominal coverage probability. In most cases, a sample size of 50 was adequate. The simulations also indicated that as the interval of estimation increased, large sample sizes were needed to well approximate the nominal confidence level. Also, coverage probabilities for p = 0.5
758
M. Hollander and J. Sethuraman
were generally higher than those for p = 0.25 and p = 0.10. Within the cases considered, the choice of F did not have a strong effect on the coverage probability. Kvam et al. (2002) considered a framework where one observes n identical and independent systems, each of which undergoes minimal repair at each failure. They use isotonic regression techniques to find the maximum likelihood estimator of F , the distribution of the time to first failure for each system, when F is known to have an increasing failure rate. They also extend their results to the Brown–Proschan model.
5. A two-sample test in the BBS model Hollander et al. (1992) considered the situation where one observes two independent BBS processes and it is desired to test whether the distribution of the time to first failure for process 1 is equal to the distribution of the time to first failure for process 2. They assume that for i = 1, 2, ni BBS processes from (Fi , pi ) are observed, each until its first perfect repair. They test H0 : F1 = F2
(5.1)
using a Mann–Whitney type statistic W described below. The statistic is based on 2 1 dF W= F (5.2) i is the WS estimator (see (4.5)) for i = 1, 2. Hollander et al. (1992) assume where F that F1 and F2 are continuous and τF i pi (t) dF = ∞, i = 1, 2. i (t) i F 0 Under these assumptions they show, if n1 , n2 → ∞ in such a way that n1 /(n1 + n2 ) → λ, 0 < λ < 1 then ∞ σ2 σ22 1 1/2 + , (n1 + n2 ) (5.3) W − F1 dF2 → N 0, λ (1 − λ) where
σ12 = 2
0
σ22
=2 0
∞ ∞
1 (t)C1 (t) dF2 (s) dF2 (t), 1 (s)F F
t
∞ ∞
2 (t)C2 (t) dF1 (s) dF1 (t), 2 (s)F F
t
where C1 , C2 are obtained from (4.8). Under H0 : F1 = F2 = F (say), σ12 reduces to 3 (s) 1 ∞ F σ12 = dF (s) 1 (s−) 4 0 H 2 replacing H 1 . and, under H0 , σ22 reduces to (5.4) with H
(5.4)
Nonparametric methods for repair models
To test H0 , an asymptotically distribution-free test refers 1 σˆ 12 σˆ 22 + Z= W− 2 n1 n2
759
(5.5)
to a N(0, 1) distribution. In (5.5), σˆ i2 is a consistent estimator of σi2 and is given by σˆ i2
1 = 4
∞ 0
3 i (s) F i (s), dF i (s−) H
where Hi is the empirical distribution of the perfect repair ages in the ith sample. i = F pi . The In the Brown–Proschan model, the pi are constants and under H0 , H pi ’s are consistently estimated by the ratio of ni to the total number of failures in the ith sample, and H0 can be tested by treating Z as having an asymptotic N(0, 1) distribution where 1/2 1 1 1 Z = W − + . 2 4n1 (4 − pˆ1 ) 4n2 (4 − pˆ 2 ) If p1 = p2 = 1, the setting further reduces to the standard iid model, the Whitaker– Samaniego estimators reduce to the empirical cumulative distribution functions, and W reduces to a multiple of the Mann–Whitney form of the two-sample Wilcoxon rank sum statistic. Then, in agreement with the results for that iid model, 1/2 1 1 1 1 W− + → N(0, 1). 2 12 n1 n2 6. Goodness-of-fit tests in the BBS model Using the Hollander et al. (1992) structure given by (4.1), (4.2) and (4.3), Augustin Peña (1999) derive goodness-of-fit tests for the cumulative hazard function Λ(t) = and t dF (s) (s) when the data follows the BBS model. They test 0 F H0 : Λ(·) = Λ0 (·)
(6.1)
H1 : Λ(·) = Λ0 (·)
(6.2)
versus
where Λ0 is a completely specified cumulative hazard function. They consider two cases viz. (i) the probability p(·) of a perfect repair is known or is a constant function, and (ii) p(·) is known to belong to some specified parametric family but the true value of the parameter is unknown. (t) and λ∗ (t) = Let F be continuous with density function f . Let λ(t) = f (t)/F t ∗ ∗ [1 − p(t)]λ(t). Let Λ (t) = 0 λ (s) ds. Let POI(µ) denote a Poisson distribution with mean µ. T HEOREM (Augustin and Peña (1999)). Let (X1 , . . . , Xν ) denote a BBS process as defined by (4.1)–(4.3). Then, for w ∈ (0, ∞) and k ∈ {1, 2, . . .}
760
M. Hollander and J. Sethuraman
(a) ν − 1 | Xν = w ∼ POI(Λ∗ (w)). d (b) (X1 , . . . , Xν−1 ) | [(Xν , ν) = (w, k)] = (V(1) , . . . , V(k−1) ), where (V(1) , . . . , V(k−1)) are order statistics of size k − 1 from the density function
g(v | w) = λ∗ (v)/Λ∗ (w) I[0,w] (v). d
(c) ({[Λ∗ (X1 )]/Λ∗ (Xν )]}, . . . , {[Λ∗ (Xν−1 )]/[Λ∗ (Xν )]}) | (ν = k) = (U(1) , . . . , U(k−1) ), where U(1) , . . . , U(k−1) are order statistics of size k − 1 from a U [0, 1] distribution. The Augustin and Peña (1999) tests of H0 (6.1) versus H1 (6.2) are conditional on the vector (ν1 , . . . , νn ). For the case where p(·) is known, at the approximate α-level of significance, and conditional on (ν1 , . . . , νn ) = (k1 , . . . , kn ), 2 ; reject H0 if Sn∗ χ1,α
accept H0 otherwise,
(6.3)
where Sn∗
12 = n j =1 (kj − 1)
j −1 n k j =1 i=1
Λ∗0 (Xj,i ) 1 − ∗ Λ0 (Xj,kj ) 2
2 ,
(6.4)
t 2 is the 100(1 − α)% percentile point of the where Λ∗0 (t) = 0 (1 − p(s)) dΛ0 (s) and χ1,α chi-squared distribution with 1 df. The test is justified as follows. From part (c) of the previous theorem and conditional on (ν1 , . . . , νn ) = (k1 , . . . , kn ), j −1 n k n kj −1 Λ∗0 (Xj,i ) d = Uj,i Λ∗ (Xj,kj ) j =1 i=1
(6.5)
j =1 i=1
where the U ’s are iid U [0, 1]. For large n, using the central limit theorem, the distribution of the random variable on
the right-hand side of (6.5) can be approximated by a normal distribution with mean nj=1 (kj − 1)/2 and variance nj=1 (kj − 1)/12. Hence the distribution of Sn∗ is approximately that of a chi-squared distribution with 1 df. Augustin and Peña (1999) point out that, even if n = 1, if k1 is sufficiently large, the distribution of Sn∗ is still approximately that of a chi-squared distribution with one degree of freedom. For the situation where p(·) is known to belong to a parametric family P = {p(·, θ ), θ ∈ Θ ⊆ Rk }, but where the true value θ0 of θ is not known, Augustin and Peña (1999) utilize a multivariate counting process and martingale theory to produce an analog of Sn∗ . Let
η0 (t, θ ) = 1 − p(t, θ ) λ0 (t) and Λ∗0 (t, θ ) =
t
η0 (s, θ ) ds. 0
Nonparametric methods for repair models
761
Their test is reject H0 if
Q2 (T , θˆ ) 2 ; χ1,α σˆ 2 (T , θˆ )
accept H0 otherwise.
(6.6)
In (6.6), T is the upper endpoint of the study period, n νj −1 1 Λ∗0 (Xj,i , θ ) 1 Q(T , θ ) = √ − , Λ∗0 (Xj,νj , θ ) 2 n j =1 i=1
θˆ is a consistent estimator of θ (such as the maximum likelihood estimator), and σˆ 2 (T , θˆ ) is a consistent estimator of the asymptotic variance of Q(T , θˆ ). Several choices are suggested in Augustin and Peña (1999). Augustin and Peña (2001) developed, for data from a BBS model, tests of H0 : λ(·) = λ0 (·), where λ0 is a completely specified failure rate. Augustin and Peña (2002), using a Cox-type proportional hazards structure, test H0C : λ ∈ C, where λ is an unspecified baseline hazard function, C = {λ0 (·, ξ )} and λ0 (·, ξ ) is specified, except for the p × 1 vector ξ . Their methods apply to the BBS model and the methods admit a general covariate structure.
7. Testing the minimal repair assumption in the BBS model Presnell et al. (1994) proposed two asymptotically distribution-free tests of the minimal repair assumption in the BBS model. Many of the inferential procedures and their supporting asymptotic distribution theory (such as those, for example, considered in Sections 4, 5, and 6 of this paper) depend heavily on the assumption that imperfectly repaired systems are minimally repaired. Hence it is useful to have tests of the minimal repair assumption. As in the previous BBS sections, we assume that n systems are observed under the BBS model, each until the time of its first repair or, equivalently, a single system is assumed to be observed until the time of its nth repair. The motivation for the PHS tests (say), based on the initial failure times is as follows. The empirical survival function F e of the n systems under observation, provides a consistent estimator of F , the distribu tion of time to first failure of the system. Under the BBS model, the WS estimator F given by (4.5) also provides a consistent estimator of F . If, however, the minimal repair may diverge from F . Presnell et al. (1994) proposed a assumption does not hold, F Kolmogorov–Smirnov-type test based on the maximum absolute difference between F e and a Wilcoxon-type test based on F e dF . and F be the empirical cdf of the Xj,νj of (4.3), that is, the empirical cdf of the Let H perfect repair ages. Let = L(t)
1 − 1 − C(t) (t) F
762
M. Hollander and J. Sethuraman
is given by (4.8), and let where C(t) = G L/ 1 + L . Corollary 2.3 of Presnell et al. (1994) shows that a test of the minimal repair assumption can be performed by referring √ G(t) F (t) − F e (t) n (t) 0t τ F
Sτ = sup
(7.1)
to a table of the distribution of the supremum of the absolute value of the Brownian )] (cf. Hall and Wellner, 1980; Koziol and Byar, 1975). bridge over the interval [0, G(τ Corollary 2.4 of Presnell et al. (1994) justifies a test of the minimal repair assumption based on referring V∗ =
√
1 /σˆ n V− 2
(7.2)
to a N(0, 1) distribution, where
∞
V=
e dF F
0
and 1 1 − σˆ = 12 4
∞
2
0
3 (s) F (s). dF (s−) H
(7.3)
E XAMPLE . Presnell et al. (1994) used the classic Proschan (1963) data on intervals between failures of Boeing air conditioner systems for 13 Boeing 720 jet airplanes (see Table 1 of Presnell et al. (1994)) to illustrate the minimal repair tests based on Sτ and V ∗ . They treated the intervals between failures as interfailure times between minimal repairs. They omitted intervals following a major overhaul (indicated by ∗∗ in Table 1 of Presnell et al. (1994)) because it is impossible to determine the age of the unit after the overhaul. For the four planes 7908, 7909, 7910 and 7911 they treated the age at which a major overhaul occurs as the time of the first perfect repair for that plane. For their example, the last observed failure ages of the remaining planes were treated as the times of their first perfect repair. Presnell et al. (1994) arbitrarily decided to compute Sτ over the interval from 0 to = 0.9902. The value of S500 is 500 hours. They obtained S500 = 0.7705 and G(500) less than the 50th percentile of its asymptotic null distribution. Thus application of the Kolmogorov–Smirnov-type test yields no evidence against the minimal repair assumption. The value of the Wilcoxon-type statistic is V = 0.4984, σˆ = 0.1753, and V ∗ = −0.03323. Thus this test also finds no evidence against the minimal repair assumption.
Nonparametric methods for repair models
763
Acknowledgement Myles Hollander’s research was supported by National Institutes of Health Grant 5 R01 DK52329 and National Heart Lung Blood Institute Grant 7 R01 HL 67460. References Ascher, H. (1968). Evaluation of repairable systems using the “bad as old” concept. IEEE Trans. Reliability 17, 105–110. Augustin, Z., Peña, E. (1999). Order statistic properties, random generation, and goodness-of-fit testing for a minimal repair model. J. Amer. Statist. Assoc. 94, 266–272. Augustin, Z., Peña, E. (2001). Goodness-of-fit of the distribution of time-to-first-occurrence in recurrent event models. Lifetime Data Anal. 7, 289–306. Augustin, Z., Peña, E. (2002). A basis approach to goodness-of-fit testing in recurrent event models. In preparation. Baxter, L., Kijima, M., Tortella, M. (1996). A point process model for the reliability of a maintained system subject to general repair. Stochastic Models 12, 37–65. Block, H., Borges, W., Savits, T. (1985). Age-dependent minimal repair. J. Appl. Probab. 22, 370–385. Brown, M., Proschan, F. (1983). Imperfect repair. J. Appl. Probab. 20, 851–859. Dorado, C., Hollander, M., Sethuraman, J. (1997). Nonparametric estimation for a general repair model. Ann. Statist. 25, 1140–1160. Gäertner, M. (2000). Nichtparametrische Statistik für Reparierbare Systeme, Ph.D. Dissertation. Technischen Universität Braunschweig. Gill, R. (1981). Testing with replacement and the product-limit estimator. Ann. Statist. 9, 853–860. Gill, R. (1983). Large sample behavior of the product-limit estimator on the whole line. Ann. Statist. 11, 49–58. Gill, R.D., Johansen, S. (1990). A survey of product-integration with a view toward application in survival analysis. Ann. Statist. 18, 1501–1555. Hall, W.J., Wellner, J. (1980). Confidence bands for a survival curve from censored data. Biometrika 67, 133–143. Hollander, M., Presnell, B., Sethuraman, J. (1992). Nonparametric methods for imperfect repair models. Ann. Statist. 20, 879–896. Kijima, M. (1989). Some results for repairable systems with general repair. J. Appl. Probab. 26, 89–102. Koziol, J.A., Byar, D.P. (1975). Percentage points of the asymptotic distributions of one- and two-sample K–S statistics for truncated or censored data. Technometrics 17, 507–510. Kvam, P.H., Singh, H., Whitaker, L.R. (2002). Estimating distributions with increasing failure rate in an imperfect repair model. Lifetime Data Anal. 8, 53–67. Last, G., Szekli, R. (1998). Asymptotic and monotonicity properties of some repairable systems. Adv. Appl. Probab. 30, 1089–1110. Peña, E., Hollander, M. (2003). Models for recurrent phenomena in survival analysis and reliability(. In: Mazzuchi, T., Singpurwalla, N., Soyer, R. (Eds.), Mathematical Reliability: An Expository Perspective. Kluwer Academic, Dordrecht. In press. Presnell, B., Hollander, M., Sethuraman, J. (1994). Testing the minimal repair assumption in an imperfect repair model. J. Amer. Statist. Assoc. 89, 289–297. Proschan, F. (1963). Theoretical explanation of observed decreasing failure rate. Technometrics 5, 375–383. Rebolledo, R. (1980). Central limit theorems for local martingales. Z. Wahrsch. Verw. Gebiete 51, 269–286. Stadje, W., Zuckerman, D. (1991). Optimal maintenance strategies for repairable systems with general degree of repair. J. Appl. Probab. 28, 384–396. Uematsu, K., Nishida, T. (1987). One unit system with a failure rate depending on the degree of repair. Math. Japon. 32, 139–147. Whitaker, L.R., Samaniego, F.J. (1989). Estimating the reliability of systems subject to imperfect repair. J. Amer. Statist. Assoc. 84, 301–309.
Subject Index
Aalen’s additive hazards, 298 Aalen’s linear hazard model, 560, 562, 568, 570 Aalen–Johansen estimator, 564 absorbing barrier, 538 absorbing set, 537 absorbing state, 563, 646 accelerated degradation data, 538 accelerated equipment testing, 538 accelerated failure time model, 37, 40, 413, 479, 630 accelerated hazards model, 432 accelerated life time model, 443 accumulated clock time, 542 accumulated usage, 542 acute myelogenous leukemia, 453 added variable test, 383 additive model, 67 additive risk model, 62 administrative censoring, 397 age at incidence, 626 AIDS, 519, 520, 534, 539 AIDS incubation time, 106 Akaike information criterion, 509, 677 analysis of a randomized clinical trial, 323 area under the receiver operating characteristic curve, 2 asymptotic minimax rates, 210 asymptotic null distribution, 12 asymptotic optimality, 210 asymptotic variance, 448 asymptotically efficient, 200 Augustin–Peña goodness-of-fit test, 759 average cross-validated log likelihood, 679 average silhouette width, 679 baseline covariate, 540 Bayes predictive density, survivor, and hazard functions, 730, 735 Bayes theorem, 444, 452 Bayesian information criterion, 677 bias of Kaplan–Meier integral, 90 biomarker data, 538 Birnbaum–Saunders distribution, 467
birth–death–immigration process and illness–cure process, 523, 524 birth–death–immigration–illness–cure process, 519, 521, 522, 524, 534 bivariate current status data, 636 bivariate distribution, 195 bivariate gamma process, 542 bivariate hazard, 202 bivariate integrated Ornstein–Uhlenbeck process, 542 bivariate kernel density estimation, 195 bivariate normal distribution, 540 bivariate right censored data structure, 143 bivariate survival data, 143 bivariate survival time, 263 bivariate time-dependent process, 144 bladder cancer, 451, 454, 458 Block–Borges–Savits model, 749 bone mineral density (BMD), 55 bootstrap, 548 bootstrap distribution, 649 bootstrap procedure, 302 bootstrap simulation, 444, 451 breast cancer, 323, 546 Breslow estimator, 401 Breslow generalization, 257 Breslow–Gehan test, 259 Brown–Proschan model, 748 Brownian bridge statistic, 443, 449 C statistic, 3 calendar time, 542 calibration, 2 cancer, 519, 520, 534 carcinogenecity testing, 626 case-control samples, 625 causal inference literature, 145 cause-specific hazard function, 314 censored correlation and regression, 100 censored data, 545, 568 censored data histogram, 733 censored survival data, 539
765
766
Subject Index
censoring, 8 censoring hazard, 560, 569 censoring time, 495 cervical carcinomas, 56 chain-of-events model, 569 CLARA, 679 clustering, 678, 681, 683 clustering algorithm, 677 coarsening at random, 124, 146 common censoring time, 263 comparison of two variances, 283 competing risk, 291, 313, 331–333, 347, 349, 634 competing risks model, 193 complete data, 730 complete gamma function, 479 composite marker process, 539 composite trapezoidal approximation, 482 conditional distribution of the sojourn time, 177 conditional likelihood, 496 conditional Mann–Whitney statistic, 5 conditional maximum likelihood estimator (MLE), 336–338 conditional regression model, 606 confounder, 563 constructed likelihood: incomplete data, 738 contagious distribution, 655 convex minorants, 628 copula, 637 copula model, 159 correlation transform, 541 counting process, 61, 228, 568, 578, 626 covariate, 465, 495, 559, 567, 568 covariate data, 538 coverage probability, 343 Cox model, 62, 189, 305, 690, 691, 699, 709, 710, 717, 723 Cox multiplicative hazards model, 298 Cox partial likelihood, 259 Cox partial likelihood test, 255 Cox proportional hazards model, 28, 30, 34, 161, 253, 257, 383, 495, 677 Cox regression model, 40, 299, 443 Cox–Mantel test, 280 Cramér–Rao inequality, 435 credible interval, 342, 343, 346 critical damage level, 468 cross-validated log likelihood, 677 crossing survival curves, 282 crossing-curve alternatives, 277 crude hazard rate, 292 crude incidence rate, 292 cumulative baseline hazard function, 177 cumulative exposure, 542
cumulative hazard function, 88, 125, 126, 130, 253, 562 cumulative incidence function, 292, 295, 298, 300, 304, 315 (cumulative) marginal hazard, 563 cure model, 545 cure rate estimation, 546 cured patient, 546 current status competing risk data, 634 current status data, 625 curve-crossing model, 542 Dabrowska’s estimator, 144 degree of separation (DOS), 44, 45 dependent censoring, 560, 563, 566, 569 diabetic retinopathy, 729 diagnostic measures, 449 digamma function, 481, 486 direct transition probability, 177 directing process, 542 discrete time proportional hazards model, 363, 364 discrimination, 2 discrimination index, 2 disease progression, 643 distributional convergence of Kaplan–Meier integral, 90 diverse hazard, 202 Dorado–Hollander–Sethuraman confidence band, 753 Dorado–Hollander–Sethuraman estimator, 753 double robustness, 153 doubly censored current status data, 631 doubly interval-censored data, 105 doubly truncated data, 201 dropout, 689–691, 693, 709–712, 715–720, 723–726 Duhamel equation, 572 dynamic additive regression, 62 E-step, 501 efficiency–joint model, 363 efficient classification, 44 efficient influence curve, 125, 127, 130, 131 efficient score, 266, 271–273 Efron’s bootstrap method, 531 eigenvalue, 11 electroencephalogram, 507 EM-algorithm, 124, 499, 545, 635, 691, 710, 712–717, 723 empirical distribution function, 87 empirical integral, 87 empirical variance, 448 end-of-study censoring, 397 equipment degradation, 538 estimating equation, 149, 443, 568
Subject Index estimating function, 147, 433 expected value, 660 explanatory variable, 411 exponential distribution, 39, 332, 333, 347, 646 exponential model, 495 exponential score, 288 extreme value error, 39 F -test, 102 failure indicator variable, 538 failure threshold, 538 fast simultaneous regression, 679 first hitting time, 537 first passage, 542 first passage time, 462 Fisher information matrix, 340, 444, 481 Fleming–Harrington class of specific test, 259 Fleming–Harrington test, 255 flowgraph model, 729 frailty model, 152, 497 full data, 143 gamma distribution, 498 gamma frailty model, 159 gamma process, 542 gamma random variable, 333, 337, 338 gap time, 609 gastric cancer data, 389 Gaussian process, 127, 128, 135, 136, 197 Gehan’s generalization, 257 Gehan–Gilbert–Wilcoxon score, 270 Gehan–Wilcoxon test, 259 general Bayesian procedure, 520 General Double Robustness Theorem, 153 generalized Bayesian method, 534 generalized Fleming–Harrington test, 258 generalized Kruskal–Wallis test, 259 generalized linear model, 9, 306, 629 generalized order statistics, 228 generalized proportional hazard, 415 generalized rank vector, 269, 270 generalized Wilcoxon test, 255, 258, 259, 280 genes, 675 Gibbs sampling, 447, 448, 519, 520 Gill’s estimator, 180 goodness-of-fit test, 383, 443, 450, 645, 759 Gray’s test, 300 hazard, 209, 559–561, 564, 566, 567, 570 hazard estimator, 316 hazard function, 7, 177, 252, 384, 431 hazard rate, 252, 495 hazard ratio, 253 health risk appraisal functions, 1
767
health status, 540 Hellinger distance, 214 high-dimensional covariates, 675 Hilbert space, 162 HIV, 529, 534, 730 Hollander–Presnell–Sethuraman confidence band, 757 Hollander–Presnell–Sethuraman two-sample test, 758 Hosmer and Lemeshow’s chi-square statistics, 6 Hosmer–Lemeshow test, 384 hospital stays, 538 hybrid approach, 486, 488, 489 hybrid approximation, 483, 486, 487 hypercube of Assouad, 215 identifiability, 314, 552 incidence rate, 316 incomplete data, 730 incomplete gamma function, 479, 482, 483, 493 increasing hazard, 545 independent censoring assumption, 143 independent sample shift model, 266 infectious diseases, 519, 520 influence curve, 147 influential observation, 450 information bound, 125, 127, 128, 130–133, 139 informative censoring, 144, 621 initial mapping from full data estimating functions into observed data estimating functions, 149 intensity function, 352, 604 interaction, 399, 402, 403 internal covariates, 560, 570 interval censoring, 625 inverse Gaussian distribution, 539 inverse probability of censoring approach, 560 inverse probability of censoring weighted mapping, 149 inverse-probability-of-censoring weighted estimator, 564 inverse-probability-of-censoring weighting, 305 inverse-probability-of-treatment weights, 569 inverted gamma, 333, 337 isotonic regression, 635 jackknife of Kaplan–Meier integral, 97 joint distribution function, 618 Kaplan–Meier curve, 251, 259 Kaplan–Meier estimator, 7, 180, 295, 401, 561, 562, 564, 566, 568, 569 Kaplan–Meier integral, 90, 92 Kaplan–Meier method, 508 Kaplan–Meier process, 93
768
Subject Index
Kaplan–Meier survival curve, 255, 260 kernel estimator, 209, 228 kernel smoothed estimator, 294 Kijima model, 749 Kolmogorov approach, 522 Kolmogorov forward equation, 522 Kolmogorov–Smirnov test, 302 Kruskal–Wallis test, 257 Kullback information, 210, 214 labor turnover, 538 largest actual observed follow-up time, 48 Last–Szekli model, 751 Lehmann alternative, 254 lengths of labor strikes, 538 leukemia data, 488 Levene-type test, 280, 289 LIFEREG procedure, 479, 481, 488, 489 lifetime data, 537 lifetime distribution, 411 likelihood, 179 likelihood construction, 730 Lin’s test, 302 linear combination, 271, 272 linear rank statistic, 270, 272, 288 linear rank test, 266, 269, 280, 287 linear signed rank test, 265, 266 link function, 464, 540 Lipschitz classes, 213 locally efficient estimation methodology, 147 log gamma density, 479 log logistic, 39 log normal, 39 log partial likelihood function, 384 log rank, 259, 397 log-linear model, 268–270 log-normal model, 495 log-rank score, 270 log-rank test, 32, 252, 280, 299, 317, 395, 397–399, 403, 404 logistic error distribution, 39 logistic model, 126 logistic regression model, 2, 145, 456 long term survival, 277 longitudinal data, 504, 538, 541, 645, 690, 723 longitudinal studies, 143 loop flowgraph model, 740 loss-to-follow-up, 397 lower bounds, 215 M-step, 501 Mann–Whitney statistic, 3 Mann–Whitney–Wilcoxon test, 257, 258 Mann–Whitney–Wilcoxon two-sample test, 759
Mantel, 259 marginal bivariate right-censored data, 151 marginal event rate, 355 marginal hazards model, 612 marginal transition hazard, 562, 563 marker process, 539 Markov assumption, 564 Markov chain and Monte Carlo (MCMC) procedure, 519, 524 Markov model, 560, 644 Markov process, 564, 729 martingale, 229, 567, 568, 571–573, 578 martingale method, 755 martingale residual, 28, 386 Mason’s rule, 742 MATLAB, 396, 403, 404 matrix valued counting process (MVCP), 576 maximum likelihood, 462 maximum likelihood equation, 480, 486 maximum likelihood estimator (MLE), 332, 334–336, 339–342, 346, 347, 487, 489 mean function, 605, 640 mean lifetime, 332, 338, 347 measures of predictive accuracy, 1 mental illness, 538 metal fatigue, 542 microarray, 676 minimal repair model, 748 minimax risk, 209 missing at random, 124, 363, 365 missing data, 273, 274 missing observation, 265 mixture model, 363, 364, 546, 654 mixture or compound distribution, 655 model adequacy, 435 modified Dabrowska’s influence curve, 152 modified Kolmogorov–Smirnov test, 280, 281, 284 monotone hazard assumption, 545 monotonic process, 542 Monte Carlo simulation, 153 multi-level Gibbs sampling method, 521, 524, 527 multi-state model, 293, 689, 691, 698, 729 multinomial distribution, 523 multiple myeloma, 678 multiplicative intensity model, 146, 228 multistage model, 559, 560, 562, 564, 567 multivariate matrix valued counting process framework, 583 multivariate survival data, 559 multivariate survival method, 578 negative binomial with added zeros, 655 Nelson–Aalen cumulative cause-specific hazard estimator, 316
Subject Index Nelson–Aalen cumulative hazard plot, 261 Nelson–Aalen estimator, 89, 124, 127, 129, 130, 209, 253, 254, 293 Newton–Raphson algorithm, 482, 490, 499 non-homogeneous Poisson process, 607 non-inferiority test, 51 non-informative uniform priors, 527, 531, 534 non-Lehmann alternative, 254 non-parametric bootstrap, 402 non-parametric comparison, 107 non-parametric cumulative incidence function estimator, 320 non-parametric density estimates, 209 non-parametric estimation, 106, 618 non-parametric maximum likelihood estimator (NPMLE), 123, 195, 228, 627 non-parametric method, 209 non-parametric mixture estimation, 633 non-parametric survival comparison, 255 non-parametric test, 263, 265, 267, 274 non-standard hypothesis test, 660 non-stationary Cox model, 193 nuisance parameter, 147 nuisance score, 147 observation model, 519, 524, 525 observed data, 143 observed data nuisance tangent space, 147 omnibus test, 386 one-step estimator, 149 operational time, 542 orthogonal complement of the nuisance tangent space, 147, 148 overdispersion, 444, 450, 656 paired censored data, 263, 272, 274 paired censored sample, 272 paired right censored data, 269 paired sample shift model, 265 PAM, 679 panel data, 640, 643 parallel flowgraph model, 740 parameterizing the log-hazard ratio, 588 parent process, 542 partial likelihood, 191, 192, 496, 547, 577, 677 partial ordering, 45 partial residual, 444, 450, 456, 457 partial sum, 449 partially parametric cumulative incidence function estimator, 320 partitions of the covariate and time space, 386 partner studies, 626 Parzen kernel, 223 Pearson chi-square statistic, 6
769
Pearson statistic, 645 pediatric cancer data, 487 penalized likelihood approach, 500 Pepe’s test, 304, 305 perfect repair model, 748 permutation distribution, 267 permutation null distribution, 267 permutation test, 268, 274 perturbation method, 210 Peto and Peto test, 255, 257, 259 point process, 211, 349 point process transformation model, 621 Poisson and negative binomial model, 654 Poisson distribution, 522 Poisson likelihood, 9 Poisson log-linear model, 6 Poisson process, 505, 523, 524 Poisson with added zeros, 655 pool–adjacent–violators algorithm, 628 pooled sample test, 263–266, 272 positive stable, 499 posterior distribution, 527, 528 predictable process, 567 Prediction functions, 1 predictive inference, 541 Prentice–Wilcoxon score, 270 Presnell–Hollander–Sethuraman test of the minimal repair assumption, 761 principal component, 679, 684 principal component decomposition, 677 prior distribution, 520, 521, 527 product integral, 125, 127, 129, 134, 572 product-limit estimator, 90, 188 product-limit form, 560, 562, 564, 565 profile likelihood, 661 progressive censoring, 227, 332 projection, 148 proportional hazard, 261, 411, 545, 560 proportional hazard model, 7, 107, 431, 547, 567, 569, 575, 629 see also Cox proportional hazards model proportional hazard regression, 443 see also Cox regression model proportional intensity model, 606 proportional means model, 616 proportional model, 67 proportional odds regression, 630 proportional rates model, 616 pseudo-distance, 212 pseudo-value, 306 psoriatic arthritis, 643 quality of life, 689, 691–693, 696, 698–700, 706, 707, 721–723, 725
770
Subject Index
random censoring, 143 random truncation, 195 randomized clinical trial, 323 rank correlation coefficient, 5 rate functions, 353, 605 rate model, 614 rates of convergence, 209 receiver operating characteristic curve, 2 recurrent event, 349, 363, 374, 504, 603 recursive bisection, 438 regression analysis, 106 regression link function, 538 regression method, 298 regression model, 495, 560, 566, 567 regression model for hazard, 316 regression techniques, 305 regular asymptotically linear (RAL) estimator, 144 relative risk, 509, 650 relative-arrangement partial order, 46 remote symptomatic, 507 repeated measures data, 585 repeated measures survival analysis, 575 restricted mean life, 48 reverse hazard, 202 right censoring, 88, 143, 263, 559 right-censored data, 209, 563, 565 risk score, 385 robust variance estimator, 398, 403 S-plus, 504 saddlepoint approximation, 730 sample information matrix, 481 SAS, 383 scale model, 269 score function, 568 score statistics, 30, 32 score test, 257, 383 self-consistency, 130 semi-Markov jump process, 176 semi-Markov process, 729 semi-Markovian model, 560, 689, 691, 698–700 semi-parametric efficiency, 435 semi-parametric estimation, 411 semi-parametric hazards model, 611 semi-parametric information bound, 434 semi-parametric maximum likelihood, 176 semi-parametric method, 630 semi-parametric model, 443, 605 semi-parametric proportional hazard cure model, 545 semi-parametric sub-model, 63 sensitivity, 2 sequential randomization assumption (SRA), 145 series flowgraph model, 731 shape parameter, 480, 481, 492
shared frailty model, 498 short term risks, 277 sign statistic, 272 sign test, 266 signed rank test, 266, 271 smooth functional, 635 smoothness class, 209, 211 Sobolev class, 213 sojourn time, 177, 646 space model, 519 specificity, 2 stage occupation probability, 560, 562–565 stage occupation time, 566 staging or classification system, 43 standard error, 488 Stanford heart transplant, 456 STATA, 383 state space model, 519, 520, 522, 524, 534 stationary, 646 stationary Gaussian process, 466 statistical model, 519, 520, 524, 525 stepwise method, 682 stepwise selection, 678 stochastic differential equation, 522–524 stochastic model, 519–521, 524 stochastic process, 519, 522, 537 stochastic system equation, 520 stochastic system model, 519, 520, 522, 524 stratified logrank test, 255 strong consistency of Kaplan–Meier integral, 90 subdensity, 125, 130 subdistribution ‘hazard’, 320 subordinated process, 542 subroutine NLPTR, 482, 487 supermartingales in reverse time, 92 supervised clustering, 685 survival analysis, 61, 227, 251, 332, 336 survival data, 537, 696 survival function, 143, 252, 561, 565 survival method, 575 survival probability, 519, 524, 528, 533, 534 survival regression model, 411 survival time, 462 Tarone–Ware family of test, 278 Tarone–Ware test, 255, 259 tests for comparing cumulative incidence functions, 322 tests of hypothesis, 568 three-stage illness–death model, 559 three-stage irreversible illness–death model, 564 threshold, 538 threshold model, 37
Subject Index threshold parameter, 30, 32 time independent and/or dependent covariate process, 143 time scale, 542 time to event, 575 time-dependent covariate, 567, 630 time-dependent possibility, 567 time-dependent variable, 385 time-scale transformation, 538 time-to-event data, 537 time-varying effect, 61, 385 total time, 609 transfusion related AIDS data, 198 transition probability, 646 transition rate, 646 truncated Poisson, 13 tuberculosis, 519, 521, 529, 534 two sample Mann–Whitney–Wilcoxon, 257 two-dimensional Wiener diffusion process, 539 U -statistics, 93 Uematsu–Nishida model, 750 unbiased estimating equation, 567 uniformly minimum variace unbiased estimator (UMVUE), 332, 334–336 University of Massachusetts Aids Research Unit IMPACT Study, 390 unmeasured confounder, 561, 565
variance of Kaplan–Meier integral, 90 waiting time, 559, 560, 564–567 Wald test, 386 Wald-type confidence interval, 159 Weibull distribution, 39, 466, 549, 569, 630 Weibull model, 495 weight function, 213 weighted Bootstrap method, 527 weighted distance test, 323 weighted estimating function, 567 weighted integrated quadratic risk, 210 weighted least squares, 685 weighted mean integrated risk, 212 Whitaker–Samaniego estimator, 756 Wiener process, 538 Wilcoxon rank sum statistic, 266, 267, 270 Wilcoxon score, 288 Wilcoxon signed rank statistic, 272 Wilcoxon signed rank test, 266, 267 Wilcoxon test, 259 within-cluster resampling, 576 within-pair difference, 269 within-pair difference test, 263–265, 269 working model, 395, 397–399, 401 zero-inflated Poisson distribution, 657
771
Handbook of Statistics Contents of Previous Volumes
Volume 1. Analysis of Variance Edited by P.R. Krishnaiah 1980 xviii + 1002 pp. 1. Estimation of Variance Components by C.R. Rao and J. Kleffe 2. Multivariate Analysis of Variance of Repeated Measurements by N.H. Timm 3. Growth Curve Analysis by S. Geisser 4. Bayesian Inference in MANOVA by S.J. Press 5. Graphical Methods for Internal Comparisons in ANOVA and MANOVA by R. Gnanadesikan 6. Monotonicity and Unbiasedness Properties of ANOVA and MANOVA Tests by S. Das Gupta 7. Robustness of ANOVA and MANOVA Test Procedures by P.K. Ito 8. Analysis of Variance and Problems under Time Series Models by D.R. Brillinger 9. Tests of Univariate and Multivariate Normality by K.V. Mardia 10. Transformations to Normality by G. Kaskey, B. Kolman, P.R. Krishnaiah and L. Steinberg 11. ANOVA and MANOVA: Models for Categorical Data by V.P. Bhapkar 12. Inference and the Structural Model for ANOVA and MANOVA by D.A.S. Fraser 13. Inference Based on Conditionally Specified ANOVA Models Incorporating Preliminary Testing by T.A. Bancroft and C.-P. Han 14. Quadratic Forms in Normal Variables by C.G. Khatri 15. Generalized Inverse of Matrices and Applications to Linear Models by S.K. Mitra 16. Likelihood Ratio Tests for Mean Vectors and Covariance Matrices by P.R. Krishnaiah and J.C. Lee 17. Assessing Dimensionality in Multivariate Regression by A.J. Izenman 18. Parameter Estimation in Nonlinear Regression Models by H. Bunke 19. Early History of Multiple Comparison Tests by H.L. Harter 20. Representations of Simultaneous Pairwise Comparisons by A.R. Sampson 21. Simultaneous Test Procedures for Mean Vectors and Covariance Matrices by P.R. Krishnaiah, G.S. Mudholkar and P. Subbaiah 22. Nonparametric Simultaneous Inference for Some MANOVA Models by P.K. Sen 773
774
Handbook of Statistics Contents of Previous Volumes
23. Comparison of Some Computer Programs for Univariate and Multivariate Analysis of Variance by R.D. Bock and D. Brandt 24. Computations of Some Multivariate Distributions by P.R. Krishnaiah 25. Inference on the Structure of Interaction Two-Way Classification Model by P.R. Krishnaiah and M. Yochmowitz
Volume 2. Classification, Pattern Recognition and Reduction of Dimensionality Edited by P.R. Krishnaiah and L.N. Kanal 1982 xxii + 903 pp. 1. Discriminant Analysis for Time Series by R.H. Shumway 2. Optimum Rules for Classification into Two Multivariate Normal Populations with the Same Covariance Matrix by S. Das Gupta 3. Large Sample Approximations and Asymptotic Expansions of Classification Statistics by M. Siotani 4. Bayesian Discrimination by S. Geisser 5. Classification of Growth Curves by J.C. Lee 6. Nonparametric Classification by J.D. Broffitt 7. Logistic Discrimination by J.A. Anderson 8. Nearest Neighbor Methods in Discrimination by L. Devroye and T.J. Wagner 9. The Classification and Mixture Maximum Likelihood Approaches to Cluster Analysis by G.J. McLachlan 10. Graphical Techniques for Multivariate Data and for Clustering by J.M. Chambers and B. Kleiner 11. Cluster Analysis Software by R.K. Blashfield, M.S. Aldenderfer and L.C. Morey 12. Single-link Clustering Algorithms by F.J. Rohlf 13. Theory of Multidimensional Scaling by J. de Leeuw and W. Heiser 14. Multidimensional Scaling and its Application by M. Wish and J.D. Carroll 15. Intrinsic Dimensionality Extraction by K. Fukunaga 16. Structural Methods in Image Analysis and Recognition by L.N. Kanal, B.A. Lambird and D. Lavine 17. Image Models by N. Ahuja and A. Rosenfield 18. Image Texture Survey by R.M. Haralick 19. Applications of Stochastic Languages by K.S. Fu 20. A Unifying Viewpoint on Pattern Recognition by J.C. Simon, E. Backer and J. Sallentin 21. Logical Functions in the Problems of Empirical Prediction by G.S. Lbov 22. Inference and Data Tables and Missing Values by N.G. Zagoruiko and V.N. Yolkina 23. Recognition of Electrocardiographic Patterns by J.H. van Bemmel 24. Waveform Parsing Systems by G.C. Stockman
Handbook of Statistics Contents of Previous Volumes
775
25. Continuous Speech Recognition: Statistical Methods by F. Jelinek, R.L. Mercer and L.R. Bahl 26. Applications of Pattern Recognition in Radar by A.A. Grometstein and W.H. Schoendorf 27. White Blood Cell Recognition by F.S. Gelsema and G.H. Landweerd 28. Pattern Recognition Techniques for Remote Sensing Applications by P.H. Swain 29. Optical Character Recognition – Theory and Practice by G. Nagy 30. Computer and Statistical Considerations for Oil Spill Identification by Y.T. Chien and T.J. Killeen 31. Pattern Recognition in Chemistry by B.R. Kowalski and S. Wold 32. Covariance Matrix Representation and Object-Predicate Symmetry by T. Kaminuma, S. Tomita and S. Watanabe 33. Multivariate Morphometrics by R.A. Reyment 34. Multivariate Analysis with Latent Variables by P.M. Bentler and D.G. Weeks 35. Use of Distance Measures, Information Measures and Error Bounds in Feature Evaluation by M. Ben-Bassat 36. Topics in Measurement Selection by J.M. Van Campenhout 37. Selection of Variables Under Univariate Regression Models by P.R. Krishnaiah 38. On the Selection of Variables Under Regression Models Using Krishnaiah’s Finite Intersection Tests by J.L. Schmidhammer 39. Dimensionality and Sample Size Considerations in Pattern Recognition Practice by A.K. Jain and B. Chandrasekaran 40. Selecting Variables in Discriminant Analysis for Improving upon Classical Procedures by W. Schaafsma 41. Selection of Variables in Discriminant Analysis by P.R. Krishnaiah
Volume 3. Time Series in the Frequency Domain Edited by D.R. Brillinger and P.R. Krishnaiah 1983 xiv + 485 pp. 1. Wiener Filtering (with emphasis on frequency-domain approaches) by R.J. Bhansali and D. Karavellas 2. The Finite Fourier Transform of a Stationary Process by D.R. Brillinger 3. Seasonal and Calendar Adjustment by W.S. Cleveland 4. Optimal Inference in the Frequency Domain by R.B. Davies 5. Applications of Spectral Analysis in Econometrics by C.W.J. Granger and R. Engle 6. Signal Estimation by E.J. Hannan 7. Complex Demodulation: Some Theory and Applications by T. Hasan 8. Estimating the Gain of a Linear Filter from Noisy Data by M.J. Hinich 9. A Spectral Analysis Primer by L.H. Koopmans 10. Robust-Resistant Spectral Analysis by R.D. Martin 11. Autoregressive Spectral Estimation by E. Parzen
776
Handbook of Statistics Contents of Previous Volumes
12. Threshold Autoregression and Some Frequency-Domain Characteristics by J. Pemberton and H. Tong 13. The Frequency-Domain Approach to the Analysis of Closed-Loop Systems by M.B. Priestley 14. The Bispectral Analysis of Nonlinear Stationary Time Series with Reference to Bilinear Time-Series Models by T. Subba Rao 15. Frequency-Domain Analysis of Multidimensional Time-Series Data by E.A. Robinson 16. Review of Various Approaches to Power Spectrum Estimation by P.M. Robinson 17. Cumulants and Cumulant Spectra by M. Rosenblatt 18. Replicated Time-Series Regression: An Approach to Signal Estimation and Detection by R.H. Shumway 19. Computer Programming of Spectrum Estimation by T. Thrall 20. Likelihood Ratio Tests on Covariance Matrices and Mean Vectors of Complex Multivariate Normal Populations and their Applications in Time Series by P.R. Krishnaiah, J.C. Lee and T.C. Chang
Volume 4. Nonparametric Methods Edited by P.R. Krishnaiah and P.K. Sen 1984 xx + 968 pp. 1. Randomization Procedures by C.B. Bell and P.K. Sen 2. Univariate and Multivariate Multisample Location and Scale Tests by V.P. Bhapkar 3. Hypothesis of Symmetry by M. Hušková 4. Measures of Dependence by K. Joag-Dev 5. Tests of Randomness against Trend or Serial Correlations by G.K. Bhattacharyya 6. Combination of Independent Tests by J.L. Folks 7. Combinatorics by L. Takács 8. Rank Statistics and Limit Theorems by M. Ghosh 9. Asymptotic Comparison of Tests – A Review by K. Singh 10. Nonparametric Methods in Two-Way Layouts by D. Quade 11. Rank Tests in Linear Models by J.N. Adichie 12. On the Use of Rank Tests and Estimates in the Linear Model by J.C. Aubuchon and T.P. Hettmansperger 13. Nonparametric Preliminary Test Inference by A.K.Md.E. Saleh and P.K. Sen 14. Paired Comparisons: Some Basic Procedures and Examples by R.A. Bradley 15. Restricted Alternatives by S.K. Chatterjee 16. Adaptive Methods by M. Hušková 17. Order Statistics by J. Galambos 18. Induced Order Statistics: Theory and Applications by P.K. Bhattacharya 19. Empirical Distribution Function by F. Csáki 20. Invariance Principles for Empirical Processes by M. Csörg˝o
Handbook of Statistics Contents of Previous Volumes
777
21. M-, L- and R-estimators by J. Jureˇcková 22. Nonparametric Sequential Estimation by P.K. Sen 23. Stochastic Approximation by V. Dupaˇc 24. Density Estimation by P. Révész 25. Censored Data by A.P. Basu 26. Tests for Exponentiality by K.A. Doksum and B.S. Yandell 27. Nonparametric Concepts and Methods in Reliability by M. Hollander and F. Proschan 28. Sequential Nonparametric Tests by U. Müller-Funk 29. Nonparametric Procedures for some Miscellaneous Problems by P.K. Sen 30. Minimum Distance Procedures by R. Beran 31. Nonparametric Methods in Directional Data Analysis by S.R. Jammalamadaka 32. Application of Nonparametric Statistics to Cancer Data by H.S. Wieand 33. Nonparametric Frequentist Proposals for Monitoring Comparative Survival Studies by M. Gail 34. Meteorological Applications of Permutation Techniques Based on Distance Functions by P.W. Mielke Jr 35. Categorical Data Problems Using Information Theoretic Approach by S. Kullback and J.C. Keegel 36. Tables for Order Statistics by P.R. Krishnaiah and P.K. Sen 37. Selected Tables for Nonparametric Statistics by P.K. Sen and P.R. Krishnaiah
Volume 5. Time Series in the Time Domain Edited by E.J. Hannan, P.R. Krishnaiah and M.M. Rao 1985 xiv + 490 pp. 1. Nonstationary Autoregressive Time Series by W.A. Fuller 2. Non-Linear Time Series Models and Dynamical Systems by T. Ozaki 3. Autoregressive Moving Average Models, Intervention Problems and Outlier Detection in Time Series by G.C. Tiao 4. Robustness in Time Series and Estimating ARMA Models by R.D. Martin and V.J. Yohai 5. Time Series Analysis with Unequally Spaced Data by R.H. Jones 6. Various Model Selection Techniques in Time Series Analysis by R. Shibata 7. Estimation of Parameters in Dynamical Systems by L. Ljung 8. Recursive Identification, Estimation and Control by P. Young 9. General Structure and Parametrization of ARMA and State-Space Systems and its Relation to Statistical Problems by M. Deistler 10. Harmonizable, Cramér, and Karhunen Classes of Processes by M.M. Rao 11. On Non-Stationary Time Series by C.S.K. Bhagavan 12. Harmonizable Filtering and Sampling of Time Series by D.K. Chang 13. Sampling Designs for Time Series by S. Cambanis
778
Handbook of Statistics Contents of Previous Volumes
14. Measuring Attenuation by M.A. Cameron and P.J. Thomson 15. Speech Recognition Using LPC Distance Measures by P.J. Thomson and P. de Souza 16. Varying Coefficient Regression by D.F. Nicholls and A.R. Pagan 17. Small Samples and Large Equations Systems by H. Theil and D.G. Fiebig
Volume 6. Sampling Edited by P.R. Krishnaiah and C.R. Rao 1988 xvi + 594 pp. 1. A Brief History of Random Sampling Methods by D.R. Bellhouse 2. A First Course in Survey Sampling by T. Dalenius 3. Optimality of Sampling Strategies by A. Chaudhuri 4. Simple Random Sampling by P.K. Pathak 5. On Single Stage Unequal Probability Sampling by V.P. Godambe and M.E. Thompson 6. Systematic Sampling by D.R. Bellhouse 7. Systematic Sampling with Illustrative Examples by M.N. Murthy and T.J. Rao 8. Sampling in Time by D.A. Binder and M.A. Hidiroglou 9. Bayesian Inference in Finite Populations by W.A. Ericson 10. Inference Based on Data from Complex Sample Designs by G. Nathan 11. Inference for Finite Population Quantiles by J. Sedransk and P.J. Smith 12. Asymptotics in Finite Population Sampling by P.K. Sen 13. The Technique of Replicated or Interpenetrating Samples by J.C. Koop 14. On the Use of Models in Sampling from Finite Populations by I. Thomsen and D. Tesfu 15. The Prediction Approach to Sampling Theory by R.M. Royall 16. Sample Survey Analysis: Analysis of Variance and Contingency Tables by D.H. Freeman Jr 17. Variance Estimation in Sample Surveys by J.N.K. Rao 18. Ratio and Regression Estimators by P.S.R.S. Rao 19. Role and Use of Composite Sampling and Capture-Recapture Sampling in Ecological Studies by M.T. Boswell, K.P. Burnham and G.P. Patil 20. Data-based Sampling and Model-based Estimation for Environmental Resources by G.P. Patil, G.J. Babu, R.C. Hennemuth, W.L. Meyers, M.B. Rajarshi and C. Taillie 21. On Transect Sampling to Assess Wildlife Populations and Marine Resources by F.L. Ramsey, C.E. Gates, G.P. Patil and C. Taillie 22. A Review of Current Survey Sampling Methods in Marketing Research (Telephone, Mall Intercept and Panel Surveys) by R. Velu and G.M. Naidu 23. Observational Errors in Behavioural Traits of Man and their Implications for Genetics by P.V. Sukhatme 24. Designs in Survey Sampling Avoiding Contiguous Units by A.S. Hedayat, C.R. Rao and J. Stufken
Handbook of Statistics Contents of Previous Volumes
779
Volume 7. Quality Control and Reliability Edited by P.R. Krishnaiah and C.R. Rao 1988 xiv + 503 pp. 1. Transformation of Western Style of Management by W. Edwards Deming 2. Software Reliability by F.B. Bastani and C.V. Ramamoorthy 3. Stress–Strength Models for Reliability by R.A. Johnson 4. Approximate Computation of Power Generating System Reliability Indexes by M. Mazumdar 5. Software Reliability Models by T.A. Mazzuchi and N.D. Singpurwalla 6. Dependence Notions in Reliability Theory by N.R. Chaganty and K. Joagdev 7. Application of Goodness-of-Fit Tests in Reliability by B.W. Woodruff and A.H. Moore 8. Multivariate Nonparametric Classes in Reliability by H.W. Block and T.H. Savits 9. Selection and Ranking Procedures in Reliability Models by S.S. Gupta and S. Panchapakesan 10. The Impact of Reliability Theory on Some Branches of Mathematics and Statistics by P.J. Boland and F. Proschan 11. Reliability Ideas and Applications in Economics and Social Sciences by M.C. Bhattacharjee 12. Mean Residual Life: Theory and Applications by F. Guess and F. Proschan 13. Life Distribution Models and Incomplete Data by R.E. Barlow and F. Proschan 14. Piecewise Geometric Estimation of a Survival Function by G.M. Mimmack and F. Proschan 15. Applications of Pattern Recognition in Failure Diagnosis and Quality Control by L.F. Pau 16. Nonparametric Estimation of Density and Hazard Rate Functions when Samples are Censored by W.J. Padgett 17. Multivariate Process Control by F.B. Alt and N.D. Smith 18. QMP/USP – A Modern Approach to Statistical Quality Auditing by B. Hoadley 19. Review About Estimation of Change Points by P.R. Krishnaiah and B.Q. Miao 20. Nonparametric Methods for Changepoint Problems by M. Csörg˝o and L. Horváth 21. Optimal Allocation of Multistate Components by E. El-Neweihi, F. Proschan and J. Sethuraman 22. Weibull, Log-Weibull and Gamma Order Statistics by H.L. Herter 23. Multivariate Exponential Distributions and their Applications in Reliability by A.P. Basu 24. Recent Developments in the Inverse Gaussian Distribution by S. Iyengar and G. Patwardhan
780
Handbook of Statistics Contents of Previous Volumes
Volume 8. Statistical Methods in Biological and Medical Sciences Edited by C.R. Rao and R. Chakraborty 1991 xvi + 554 pp. 1. Methods for the Inheritance of Qualitative Traits by J. Rice, R. Neuman and S.O. Moldin 2. Ascertainment Biases and their Resolution in Biological Surveys by W.J. Ewens 3. Statistical Considerations in Applications of Path Analytical in Genetic Epidemiology by D.C. Rao 4. Statistical Methods for Linkage Analysis by G.M. Lathrop and J.M. Lalouel 5. Statistical Design and Analysis of Epidemiologic Studies: Some Directions of Current Research by N. Breslow 6. Robust Classification Procedures and their Applications to Anthropometry by N. Balakrishnan and R.S. Ambagaspitiya 7. Analysis of Population Structure: A Comparative Analysis of Different Estimators of Wright’s Fixation Indices by R. Chakraborty and H. Danker-Hopfe 8. Estimation of Relationships from Genetic Data by E.A. Thompson 9. Measurement of Genetic Variation for Evolutionary Studies by R. Chakraborty and C.R. Rao 10. Statistical Methods for Phylogenetic Tree Reconstruction by N. Saitou 11. Statistical Models for Sex-Ratio Evolution by S. Lessard 12. Stochastic Models of Carcinogenesis by S.H. Moolgavkar 13. An Application of Score Methodology: Confidence Intervals and Tests of Fit for One-Hit-Curves by J.J. Gart 14. Kidney-Survival Analysis of IgA Nephropathy Patients: A Case Study by O.J.W.F. Kardaun 15. Confidence Bands and the Relation with Decision Analysis: Theory by O.J.W.F. Kardaun 16. Sample Size Determination in Clinical Research by J. Bock and H. Toutenburg
Volume 9. Computational Statistics Edited by C.R. Rao 1993 xix + 1045 pp. 1. Algorithms by B. Kalyanasundaram 2. Steady State Analysis of Stochastic Systems by K. Kant 3. Parallel Computer Architectures by R. Krishnamurti and B. Narahari 4. Database Systems by S. Lanka and S. Pal 5. Programming Languages and Systems by S. Purushothaman and J. Seaman 6. Algorithms and Complexity for Markov Processes by R. Varadarajan 7. Mathematical Programming: A Computational Perspective by W.W. Hager, R. Horst and P.M. Pardalos
Handbook of Statistics Contents of Previous Volumes
781
8. Integer Programming by P.M. Pardalos and Y. Li 9. Numerical Aspects of Solving Linear Least Squares Problems by J.L. Barlow 10. The Total Least Squares Problem by S. van Huffel and H. Zha 11. Construction of Reliable Maximum-Likelihood-Algorithms with Applications to Logistic and Cox Regression by D. Böhning 12. Nonparametric Function Estimation by T. Gasser, J. Engel and B. Seifert 13. Computation Using the OR Decomposition by C.R. Goodall 14. The EM Algorithm by N. Laird 15. Analysis of Ordered Categorial Data through Appropriate Scaling by C.R. Rao and P.M. Caligiuri 16. Statistical Applications of Artificial Intelligence by W.A. Gale, D.J. Hand and A.E. Kelly 17. Some Aspects of Natural Language Processes by A.K. Joshi 18. Gibbs Sampling by S.F. Arnold 19. Bootstrap Methodology by G.J. Babu and C.R. Rao 20. The Art of Computer Generation of Random Variables by M.T. Boswell, S.D. Gore, G.P. Patil and C. Taillie 21. Jackknife Variance Estimation and Bias Reduction by S. Das Peddada 22. Designing Effective Statistical Graphs by D.A. Burn 23. Graphical Methods for Linear Models by A.S. Hadi 24. Graphics for Time Series Analysis by H.J. Newton 25. Graphics as Visual Language by T. Selkar and A. Appel 26. Statistical Graphics and Visualization by E.J. Wegman and D.B. Carr 27. Multivariate Statistical Visualization by F.W. Young, R.A. Faldowski and M.M. McFarlane 28. Graphical Methods for Process Control by T.L. Ziemer
Volume 10. Signal Processing and its Applications Edited by N.K. Bose and C.R. Rao 1993 xvii + 992 pp. 1. Signal Processing for Linear Instrumental Systems with Noise: A General Theory with Illustrations from Optical Imaging and Light Scattering Problems by M. Bertero and E.R. Pike 2. Boundary Implication Results in Parameter Space by N.K. Bose 3. Sampling of Bandlimited Signals: Fundamental Results and Some Extensions by J.L. Brown Jr 4. Localization of Sources in a Sector: Algorithms and Statistical Analysis by K. Buckley and X.-L. Xu 5. The Signal Subspace Direction-of-Arrival Algorithm by J.A. Cadzow 6. Digital Differentiators by S.C. Dutta Roy and B. Kumar 7. Orthogonal Decompositions of 2D Random Fields and their Applications for 2D Spectral Estimation by J.M. Francos
782
Handbook of Statistics Contents of Previous Volumes
8. VLSI in Signal Processing by A. Ghouse 9. Constrained Beamforming and Adaptive Algorithms by L.C. Godara 10. Bispectral Speckle Interferometry to Reconstruct Extended Objects from Turbulence-Degraded Telescope Images by D.M. Goodman, T.W. Lawrence, E.M. Johansson and J.P. Fitch 11. Multi-Dimensional Signal Processing by K. Hirano and T. Nomura 12. On the Assessment of Visual Communication by F.O. Huck, C.L. Fales, R. AlterGartenberg and Z. Rahman 13. VLSI Implementations of Number Theoretic Concepts with Applications in Signal Processing by G.A. Jullien, N.M. Wigley and J. Reilly 14. Decision-level Neural Net Sensor Fusion by R.Y. Levine and T.S. Khuon 15. Statistical Algorithms for Noncausal Gauss Markov Fields by J.M.F. Moura and N. Balram 16. Subspace Methods for Directions-of-Arrival Estimation by A. Paulraj, B. Ottersten, R. Roy, A. Swindlehurst, G. Xu and T. Kailath 17. Closed Form Solution to the Estimates of Directions of Arrival Using Data from an Array of Sensors by C.R. Rao and B. Zhou 18. High-Resolution Direction Finding by S.V. Schell and W.A. Gardner 19. Multiscale Signal Processing Techniques: A Review by A.H. Tewfik, M. Kim and M. Deriche 20. Sampling Theorems and Wavelets by G.G. Walter 21. Image and Video Coding Research by J.W. Woods 22. Fast Algorithms for Structured Matrices in Signal Processing by A.E. Yagle
Volume 11. Econometrics Edited by G.S. Maddala, C.R. Rao and H.D. Vinod 1993 xx + 783 pp. 1. Estimation from Endogenously Stratified Samples by S.R. Cosslett 2. Semiparametric and Nonparametric Estimation of Quantal Response Models by J.L. Horowitz 3. The Selection Problem in Econometrics and Statistics by C.F. Manski 4. General Nonparametric Regression Estimation and Testing in Econometrics by A. Ullah and H.D. Vinod 5. Simultaneous Microeconometric Models with Censored or Qualitative Dependent Variables by R. Blundell and R.J. Smith 6. Multivariate Tobit Models in Econometrics by L.-F. Lee 7. Estimation of Limited Dependent Variable Models under Rational Expectations by G.S. Maddala 8. Nonlinear Time Series and Macroeconometrics by W.A. Brock and S.M. Potter 9. Estimation, Inference and Forecasting of Time Series Subject to Changes in Time by J.D. Hamilton
Handbook of Statistics Contents of Previous Volumes
783
10. Structural Time Series Models by A.C. Harvey and N. Shephard 11. Bayesian Testing and Testing Bayesians by J.-P. Florens and M. Mouchart 12. Pseudo-Likelihood Methods by C. Gourieroux and A. Monfort 13. Rao’s Score Test: Recent Asymptotic Results by R. Mukerjee 14. On the Strong Consistency of M-Estimates in Linear Models under a General Discrepancy Function by Z.D. Bai, Z.J. Liu and C.R. Rao 15. Some Aspects of Generalized Method of Moments Estimation by A. Hall 16. Efficient Estimation of Models with Conditional Moment Restrictions by W.K. Newey 17. Generalized Method of Moments: Econometric Applications by M. Ogaki 18. Testing for Heteroscedasticity by A.R. Pagan and Y. Pak 19. Simulation Estimation Methods for Limited Dependent Variable Models by V.A. Hajivassiliou 20. Simulation Estimation for Panel Data Models with Limited Dependent Variable by M.P. Keane 21. A Perspective Application of Bootstrap Methods in Econometrics by J. Jeong and G.S. Maddala 22. Stochastic Simulations for Inference in Nonlinear Errors-in-Variables Models by R.S. Mariano and B.W. Brown 23. Bootstrap Methods: Applications in Econometrics by H.D. Vinod 24. Identifying Outliers and Influential Observations in Econometric Models by S.G. Donald and G.S. Maddala 25. Statistical Aspects of Calibration in Macroeconomics by A.W. Gregory and G.W. Smith 26. Panel Data Models with Rational Expectations by K. Lahiri 27. Continuous Time Financial Models: Statistical Applications of Stochastic Processes by K.R. Sawyer
Volume 12. Environmental Statistics Edited by G.P. Patil and C.R. Rao 1994 xix + 927 pp. 1. Environmetrics: An Emerging Science by J.S. Hunter 2. A National Center for Statistical Ecology and Environmental Statistics: A Center Without Walls by G.P. Patil 3. Replicate Measurements for Data Quality and Environmental Modeling by W. Liggett 4. Design and Analysis of Composite Sampling Procedures: A Review by G. Lovison, S.D. Gore and G.P. Patil 5. Ranked Set Sampling by G.P. Patil, A.K. Sinha and C. Taillie 6. Environmental Adaptive Sampling by G.A.F. Seber and S.K. Thompson 7. Statistical Analysis of Censored Environmental Data by M. Akritas, T. Ruscitti and G.P. Patil
784
Handbook of Statistics Contents of Previous Volumes
8. Biological Monitoring: Statistical Issues and Models by E.P. Smith 9. Environmental Sampling and Monitoring by S.V. Stehman and W. Scott Overton 10. Ecological Statistics by B.F.J. Manly 11. Forest Biometrics by H.E. Burkhart and T.G. Gregoire 12. Ecological Diversity and Forest Management by J.H. Gove, G.P. Patil, B.F. Swindel and C. Taillie 13. Ornithological Statistics by P.M. North 14. Statistical Methods in Developmental Toxicology by P.J. Catalano and L.M. Ryan 15. Environmental Biometry: Assessing Impacts of Environmental Stimuli Via Animal and Microbial Laboratory Studies by W.W. Piegorsch 16. Stochasticity in Deterministic Models by J.J.M. Bedaux and S.A.L.M. Kooijman 17. Compartmental Models of Ecological and Environmental Systems by J.H. Matis and T.E. Wehrly 18. Environmental Remote Sensing and Geographic Information Systems-Based Modeling by W.L. Myers 19. Regression Analysis of Spatially Correlated Data: The Kanawha County Health Study by C.A. Donnelly, J.H. Ware and N.M. Laird 20. Methods for Estimating Heterogeneous Spatial Covariance Functions with Environmental Applications by P. Guttorp and P.D. Sampson 21. Meta-analysis in Environmental Statistics by V. Hasselblad 22. Statistical Methods in Atmospheric Science by A.R. Solow 23. Statistics with Agricultural Pests and Environmental Impacts by L.J. Young and J.H. Young 24. A Crystal Cube for Coastal and Estuarine Degradation: Selection of Endpoints and Development of Indices for Use in Decision Making by M.T. Boswell, J.S. O’Connor and G.P. Patil 25. How Does Scientific Information in General and Statistical Information in Particular Input to the Environmental Regulatory Process? by C.R. Cothern 26. Environmental Regulatory Statistics by C.B. Davis 27. An Overview of Statistical Issues Related to Environmental Cleanup by R. Gilbert 28. Environmental Risk Estimation and Policy Decisions by H. Lacayo Jr
Volume 13. Design and Analysis of Experiments Edited by S. Ghosh and C.R. Rao 1996 xviii + 1230 pp. 1. The Design and Analysis of Clinical Trials by P. Armitage 2. Clinical Trials in Drug Development: Some Statistical Issues by H.I. Patel 3. Optimal Crossover Designs by J. Stufken 4. Design and Analysis of Experiments: Nonparametric Methods with Applications to Clinical Trials by P.K. Sen
Handbook of Statistics Contents of Previous Volumes
785
5. Adaptive Designs for Parametric Models by S. Zacks 6. Observational Studies and Nonrandomized Experiments by P.R. Rosenbaum 7. Robust Design: Experiments for Improving Quality by D.M. Steinberg 8. Analysis of Location and Dispersion Effects from Factorial Experiments with a Circular Response by C.M. Anderson 9. Computer Experiments by J.R. Koehler and A.B. Owen 10. A Critique of Some Aspects of Experimental Design by J.N. Srivastava 11. Response Surface Designs by N.R. Draper and D.K.J. Lin 12. Multiresponse Surface Methodology by A.I. Khuri 13. Sequential Assembly of Fractions in Factorial Experiments by S. Ghosh 14. Designs for Nonlinear and Generalized Linear Models by A.C. Atkinson and L.M. Haines 15. Spatial Experimental Design by R.J. Martin 16. Design of Spatial Experiments: Model Fitting and Prediction by V.V. Fedorov 17. Design of Experiments with Selection and Ranking Goals by S.S. Gupta and S. Panchapakesan 18. Multiple Comparisons by A.C. Tamhane 19. Nonparametric Methods in Design and Analysis of Experiments by E. Brunner and M.L. Puri 20. Nonparametric Analysis of Experiments by A.M. Dean and D.A. Wolfe 21. Block and Other Designs in Agriculture by D.J. Street 22. Block Designs: Their Combinatorial and Statistical Properties by T. Calinski and S. Kageyama 23. Developments in Incomplete Block Designs for Parallel Line Bioassays by S. Gupta and R. Mukerjee 24. Row-Column Designs by K.R. Shah and B.K. Sinha 25. Nested Designs by J.P. Morgan 26. Optimal Design: Exact Theory by C.S. Cheng 27. Optimal and Efficient Treatment – Control Designs by D. Majumdar 28. Model Robust Designs by Y.-J. Chang and W.I. Notz 29. Review of Optimal Bayes Designs by A. DasGupta 30. Approximate Designs for Polynomial Regression: Invariance, Admissibility, and Optimality by N. Gaffke and B. Heiligers
Volume 14. Statistical Methods in Finance Edited by G.S. Maddala and C.R. Rao 1996 xvi + 733 pp. 1. Econometric Evaluation of Asset Pricing Models by W.E. Person and R. Jegannathan 2. Instrumental Variables Estimation of Conditional Beta Pricing Models by C.R. Harvey and C.M. Kirby
786
Handbook of Statistics Contents of Previous Volumes
3. Semiparametric Methods for Asset Pricing Models by B.N. Lehmann 4. Modeling the Term Structure by A.R. Pagan, A.D. Hall and V. Martin 5. Stochastic Volatility by E. Ghysels, A.C. Harvey and E. Renault 6. Stock Price Volatility by S.F. LeRoy 7. GARCH Models of Volatility by F.C. Palm 8. Forecast Evaluation and Combination by F.X. Diebold and J.A. Lopez 9. Predictable Components in Stock Returns by G. Kaul 10. Interset Rate Spreads as Predictors of Business Cycles by K. Lahiri and J.G. Wang 11. Nonlinear Time Series, Complexity Theory, and Finance by W.A. Brock and P.J.F. deLima 12. Count Data Models for Financial Data by A.C. Cameron and P.K. Trivedi 13. Financial Applications of Stable Distributions by J.H. McCulloch 14. Probability Distributions for Financial Models by J.B. McDonald 15. Bootstrap Based Tests in Financial Models by G.S. Maddala and H. Li 16. Principal Component and Factor Analyses by C.R. Rao 17. Errors in Variables Problems in Finance by G.S. Maddala and M. Nimalendran 18. Financial Applications of Artificial Neural Networks by M. Qi 19. Applications of Limited Dependent Variable Models in Finance by G.S. Maddala 20. Testing Option Pricing Models by D.S. Bates 21. Peso Problems: Their Theoretical and Empirical Implications by M.D.D. Evans 22. Modeling Market Microstructure Time Series by J. Hasbrouck 23. Statistical Methods in Tests of Portfolio Efficiency: A Synthesis by J. Shanken
Volume 15. Robust Inference Edited by G.S. Maddala and C.R. Rao 1997 xviii + 698 pp. 1. Robust Inference in Multivariate Linear Regression Using Difference of Two Convex Functions as the Discrepancy Measure by Z.D. Bai, C.R. Rao and Y.H. Wu 2. Minimum Distance Estimation: The Approach Using Density-Based Distances by A. Basu, I.R. Harris and S. Basu 3. Robust Inference: The Approach Based on Influence Functions by M. Markatou and E. Ronchetti 4. Practical Applications of Bounded-Influence Tests by S. Heritier and M.-P. VictoriaFeser 5. Introduction to Positive-Breakdown Methods by P.J. Rousseeuw 6. Outlier Identification and Robust Methods by U. Gather and C. Becker 7. Rank-Based Analysis of Linear Models by T.P. Hettmansperger, J.W. McKean and S.J. Sheather 8. Rank Tests for Linear Models by R. Koenker 9. Some Extensions in the Robust Estimation of Parameters of Exponential and Double Exponential Distributions in the Presence of Multiple Outliers by A. Childs and N. Balakrishnan
Handbook of Statistics Contents of Previous Volumes
787
10. Outliers, Unit Roots and Robust Estimation of Nonstationary Time Series by G.S. Maddala and Y. Yin 11. Autocorrelation-Robust Inference by P.M. Robinson and C. Velasco 12. A Practitioner’s Guide to Robust Covariance Matrix Estimation by W.J. den Haan and A. Levin 13. Approaches to the Robust Estimation of Mixed Models by A.H. Welsh and A.M. Richardson 14. Nonparametric Maximum Likelihood Methods by S.R. Cosslett 15. A Guide to Censored Quantile Regressions by B. Fitzenberger 16. What Can Be Learned About Population Parameters When the Data Are Contaminated by J.L. Horowitz and C.F. Manski 17. Asymptotic Representations and Interrelations of Robust Estimators and Their Applications by J. Jureˇcková and P.K. Sen 18. Small Sample Asymptotics: Applications in Robustness by C.A. Field and M.A. Tingley 19. On the Fundamentals of Data Robustness by G. Maguluri and K. Singh 20. Statistical Analysis With Incomplete Data: A Selective Review by M.G. Akritas and M.P. La Valley 21. On Contamination Level and Sensitivity of Robust Tests by J.Á. Visšek 22. Finite Sample Robustness of Tests: An Overview by T. Kariya and P. Kim 23. Future Directions by G.S. Maddala and C.R. Rao
Volume 16. Order Statistics – Theory and Methods Edited by N. Balakrishnan and C.R. Rao 1997 xix + 688 pp. 1. Order Statistics: An Introduction by N. Balakrishnan and C.R. Rao 2. Order Statistics: A Historical Perspective by H. Leon Harter and N. Balakrishnan 3. Computer Simulation of Order Statistics by Pandu R. Tadikamalla and N. Balakrishnan 4. Lorenz Ordering of Order Statistics and Record Values by Barry C. Arnold and Jose A. Villasenor 5. Stochastic Ordering of Order Statistics by Philip J. Boland, Moshe Shaked and J. George Shanthikumar 6. Bounds for Expectations of L-Estimates by T. Rychlik 7. Recurrence Relations and Identities for Moments of Order Statistics by N. Balakrishnan and K.S. Sultan 8. Recent Approaches to Characterizations Based on Order Statistics and Record Values by C.R. Rao and D.N. Shanbhag 9. Characterizations of Distributions via Identically Distributed Functions of Order Statistics by Ursula Gather, Udo Kamps and Nicole Schweitzer 10. Characterizations of Distributions by Recurrence Relations and Identities for Moments of Order Statistics by Udo Kamps
788
Handbook of Statistics Contents of Previous Volumes
11. Univariate Extreme Value Theory and Applications by Janos Galambos 12. Order Statistics: Asymptotics in Applications by Pranab Kumar Sen 13. Zero-One Laws for Large Order Statistics by R.J. Tomkins and Hong Wang 14. Some Exact Properties of Cook’s DI by D.R. Jensen and D.E. Ramirez 15. Generalized Recurrence Relations for Moments of Order Statistics from NonIdentical Pareto and Truncated Pareto Random Variables with Applications to Robustness by Aaron Childs and N. Balakrishnan 16. A Semiparametric Bootstrap for Simulating Extreme Order Statistics by Robert L. Strawderman and Daniel Zelterman 17. Approximations to Distributions of Sample Quantiles by Chunsheng Ma and John Robinson 18. Concomitants of Order Statistics by H.A. David and H.N. Nagaraja 19. A Record of Records by Valery B. Nevzorov and N. Balakrishnan 20. Weighted Sequential Empirical Type Processes with Applications to Change-Point Problems by Barbara Szyszkowicz 21. Sequential Quantile and Bahadur–Kiefer Processes by Miklós Csörg˝o and Barbara Szyszkowicz
Volume 17. Order Statistics: Applications Edited by N. Balakrishnan and C.R. Rao 1998 xviii + 712 pp. 1. Order Statistics in Exponential Distribution by Asit P. Basu and Bahadur Singh 2. Higher Order Moments of Order Statistics from Exponential and Right-truncated Exponential Distributions and Applications to Life-testing Problems by N. Balakrishnan and Shanti S. Gupta 3. Log-gamma Order Statistics and Linear Estimation of Parameters by N. Balakrishnan and P.S. Chan 4. Recurrence Relations for Single and Product Moments of Order Statistics from a Generalized Logistic Distribution with Applications to Inference and Generalizations to Double Truncation by N. Balakrishnan and Rita Aggarwala 5. Order Statistics from the Type III Generalized Logistic Distribution and Applications by N. Balakrishnan and S.K. Lee 6. Estimation of Scale Parameter Based on a Fixed Set of Order Statistics by Sanat K. Sarkar and Wenjin Wang 7. Optimal Linear Inference Using Selected Order Statistics in Location-Scale Models by M. Masoom Ali and Dale Umbach 8. L-Estimation by J.R.M. Hosking 9. On Some L-estimation in Linear Regression Models by Soroush Alimoradi and A.K.Md. Ehsanes Saleh 10. The Role of Order Statistics in Estimating Threshold Parameters by A. Clifford Cohen
Handbook of Statistics Contents of Previous Volumes
789
11. Parameter Estimation under Multiply Type-II Censoring by Fanhui Kong 12. On Some Aspects of Ranked Set Sampling in Parametric Estimation by Nora Ni Chuiv and Bimal K. Sinha 13. Some Uses of Order Statistics in Bayesian Analysis by Seymour Geisser 14. Inverse Sampling Procedures to Test for Homogeneity in a Multinomial Distribution by S. Panchapakesan, Aaron Childs, B.H. Humphrey and N. Balakrishnan 15. Prediction of Order Statistics by Kenneth S. Kaminsky and Paul I. Nelson 16. The Probability Plot: Tests of Fit Based on the Correlation Coefficient by R.A. Lockhart and M.A. Stephens 17. Distribution Assessment by Samuel Shapiro 18. Application of Order Statistics to Sampling Plans for Inspection by Variables by Helmut Schneider and Frances Barbera 19. Linear Combinations of Ordered Symmetric Observations with Applications to Visual Acuity by Marios Viana 20. Order-Statistic Filtering and Smoothing of Time-Series: Part I by Gonzalo R. Arce, Yeong-Taeg Kim and Kenneth E. Barner 21. Order-Statistic Filtering and Smoothing of Time-Series: Part II by Kenneth E. Barner and Gonzalo R. Arce 22. Order Statistics in Image Processing by Scott T. Acton and Alan C. Bovik 23. Order Statistics Application to CFAR Radar Target Detection by R. Viswanathan
Volume 18. Bioenvironmental and Public Health Statistics Edited by P.K. Sen and C.R. Rao 2000 xxiv + 1105 pp. 1. Bioenvironment and Public Health: Statistical Perspectives by Pranab K. Sen 2. Some Examples of Random Process Environmental Data Analysis by David R. Brillinger 3. Modeling Infectious Diseases – Aids by L. Billard 4. On Some Multiplicity Problems and Multiple Comparison Procedures in Biostatistics by Yosef Hochberg and Peter H. Westfall 5. Analysis of Longitudinal Data by Julio M. Singer and Dalton F. Andrade 6. Regression Models for Survival Data by Richard A. Johnson and John P. Klein 7. Generalised Linear Models for Independent and Dependent Responses by Bahjat F. Qaqish and John S. Preisser 8. Hierarchial and Empirical Bayes Methods for Environmental Risk Assessment by Gauri Datta, Malay Ghosh and Lance A. Waller 9. Non-parametrics in Bioenvironmental and Public Health Statistics by Pranab Kumar Sen 10. Estimation and Comparison of Growth and Dose-Response Curves in the Presence of Purposeful Censoring by Paul W. Stewart
790
Handbook of Statistics Contents of Previous Volumes
11. Spatial Statistical Methods for Environmental Epidemiology by Andrew B. Lawson and Noel Cressie 12. Evaluating Diagnostic Tests in Public Health by Margaret Pepe, Wendy Leisenring and Carolyn Rutter 13. Statistical Issues in Inhalation Toxicology by E. Weller, L. Ryan and D. Dockery 14. Quantitative Potency Estimation to Measure Risk with Bioenvironmental Hazards by A. John Bailer and Walter W. Piegorsch 15. The Analysis of Case-Control Data: Epidemiologic Studies of Familial Aggregation by Nan M. Laird, Garrett M. Fitzmaurice and Ann G. Schwartz 16. Cochran–Mantel–Haenszel Techniques: Applications Involving Epidemiologic Survey Data by Daniel B. Hall, Robert F. Woolson, William R. Clarke and Martha F. Jones 17. Measurement Error Models for Environmental and Occupational Health Applications by Robert H. Lyles and Lawrence L. Kupper 18. Statistical Perspectives in Clinical Epidemiology by Shrikant I. Bangdiwala and Sergio R. Muñoz 19. ANOVA and ANOCOVA for Two-Period Crossover Trial Data: New vs. Standard by Subir Ghosh and Lisa D. Fairchild 20. Statistical Methods for Crossover Designs in Bioenvironmental and Public Health Studies by Gail E. Tudor, Gary G. Koch and Diane Catellier 21. Statistical Models for Human Reproduction by C.M. Suchindran and Helen P. Koo 22. Statistical Methods for Reproductive Risk Assessment by Sati Mazumdar, Yikang Xu, Donald R. Mattison, Nancy B. Sussman and Vincent C. Arena 23. Selection Biases of Samples and their Resolutions by Ranajit Chakraborty and C. Radhakrishna Rao 24. Genomic Sequences and Quasi-Multivariate CATANOVA by Hildete Prisco Pinheiro, Françoise Seillier-Moiseiwitsch, Pranab Kumar Sen and Joseph Eron Jr 25. Statistical Methods for Multivariate Failure Time Data and Competing Risks by Ralph A. DeMasi 26. Bounds on Joint Survival Probabilities with Positively Dependent Competing Risks by Sanat K. Sarkar and Kalyan Ghosh 27. Modeling Multivariate Failure Time Data by Limin X. Clegg, Jianwen Cai and Pranab K. Sen 28. The Cost–Effectiveness Ratio in the Analysis of Health Care Programs by Joseph C. Gardiner, Cathy J. Bradley and Marianne Huebner 29. Quality-of-Life: Statistical Validation and Analysis An Example from a Clinical Trial by Balakrishna Hosmane, Clement Maurath and Richard Manski 30. Carcinogenic Potency: Statistical Perspectives by Anup Dewanji 31. Statistical Applications in Cardiovascular Disease by Elizabeth R. DeLong and David M. DeLong 32. Medical Informatics and Health Care Systems: Biostatistical and Epidemiologic Perspectives by J. Zvárová 33. Methods of Establishing In Vitro–In Vivo Relationships for Modified Release Drug Products by David T. Mauger and Vernon M. Chinchilli
Handbook of Statistics Contents of Previous Volumes
791
34. Statistics in Psychiatric Research by Sati Mazumdar, Patricia R. Houck and Charles F. Reynolds III 35. Bridging the Biostatistics–Epidemiology Gap by Lloyd J. Edwards 36. Biodiversity – Measurement and Analysis by S.P. Mukherjee
Volume 19. Stochastic Processes: Theory and Methods Edited by D.N. Shanbhag and C.R. Rao 2001 xiv + 967 pp. 1. Pareto Processes by Barry C. Arnold 2. Branching Processes by K.B. Athreya and A.N. Vidyashankar 3. Inference in Stochastic Processes by I.V. Basawa 4. Topics in Poisson Approximation by A.D. Barbour 5. Some Elements on Lévy Processes by Jean Bertoin 6. Iterated Random Maps and Some Classes of Markov Processes by Rabi Bhattacharya and Edward C. Waymire 7. Random Walk and Fluctuation Theory by N.H. Bingham 8. A Semigroup Representation and Asymptotic Behavior of Certain Statistics of the Fisher–Wright–Moran Coalescent by Adam Bobrowski, Marek Kimmel, Ovide Arino and Ranajit Chakraborty 9. Continuous-Time ARMA Processes by P.J. Brockwell 10. Record Sequences and their Applications by John Bunge and Charles M. Goldie 11. Stochastic Networks with Product Form Equilibrium by Hans Daduna 12. Stochastic Processes in Insurance and Finance by Paul Embrechts, Rüdiger Frey and Hansjörg Furrer 13. Renewal Theory by D.R. Grey 14. The Kolmogorov Isomorphism Theorem and Extensions to some Nonstationary Processes by Yûichirô Kakihara 15. Stochastic Processes in Reliability by Masaaki Kijima, Haijun Li and Moshe Shaked 16. On the supports of Stochastic Processes of Multiplicity One by A. Kłopotowski and M.G. Nadkarni 17. Gaussian Processes: Inequalities, Small Ball Probabilities and Applications by W.V. Li and Q.-M. Shao 18. Point Processes and Some Related Processes by Robin K. Milne 19. Characterization and Identifiability for Stochastic Processes by B.L.S. Prakasa Rao 20. Associated Sequences and Related Inference Problems by B.L.S. Prakasa Rao and Isha Dewan 21. Exchangeability, Functional Equations, and Characterizations by C.R. Rao and D.N. Shanbhag 22. Martingales and Some Applications by M.M. Rao 23. Markov Chains: Structure and Applications by R.L. Tweedie
792
Handbook of Statistics Contents of Previous Volumes
24. Diffusion Processes by S.R.S. Varadhan 25. Itô’s Stochastic Calculus and Its Applications by S. Watanabe
Volume 20. Advances in Reliability Edited by N. Balakrishnan and C.R. Rao 2001 xxii + 860 pp. 1. Basic Probabilistic Models in Reliability by N. Balakrishnan, N. Limnios and C. Papadopoulos 2. The Weibull Nonhomogeneous Poisson Process by A.P. Basu and S.E. Rigdon 3. Bathtub-Shaped Failure Rate Life Distributions by C.D. Lai, M. Xie and D.N.P. Murthy 4. Equilibrium Distribution – its Role in Reliability Theory by A. Chatterjee and S.P. Mukherjee 5. Reliability and Hazard Based on Finite Mixture Models by E.K. Al-Hussaini and K.S. Sultan 6. Mixtures and Monotonicity of Failure Rate Functions by M. Shaked and F. Spizzichino 7. Hazard Measure and Mean Residual Life Orderings: A Unified Approach by M. Asadi and D.N. Shanbhag 8. Some Comparison Results of the Reliability Functions of Some Coherent Systems by J. Mi 9. On the Reliability of Hierarchical Structures by L.B. Klebanov and G.J. Szekely 10. Consecutive k-out-of-n Systems by N.A. Mokhlis 11. Exact Reliability and Lifetime of Consecutive Systems by S. Aki 12. Sequential k-out-of-n Systems by E. Cramer and U. Kamps 13. Progressive Censoring: A Review by R. Aggarwala 14. Point and Interval Estimation for Parameters of the Logistic Distribution Based on Progressively Type-II Censored Samples by N. Balakrishnan and N. Kannan 15. Progressively Censored Variables-Sampling Plans for Life Testing by U. Balasooriya 16. Graphical Techniques for Analysis of Data From Repairable Systems by P.A. Akersten, B. Klefsjö and B. Bergman 17. A Bayes Approach to the Problem of Making Repairs by G.C. McDonald 18. Statistical Analysis for Masked Data by B.J. Flehinger† , B. Reiser and E. Yashchin 19. Analysis of Masked Failure Data under Competing Risks by A. Sen, S. Basu and M. Banerjee 20. Warranty and Reliability by D.N.P. Murthy and W.R. Blischke 21. Statistical Analysis of Reliability Warranty Data by K. Suzuki, Md. Rezaul Karim and L. Wang 22. Prediction of Field Reliability of Units, Each under Differing Dynamic Stresses, from Accelerated Test Data by W. Nelson
Handbook of Statistics Contents of Previous Volumes
793
23. Step-Stress Accelerated Life Test by E. Gouno and N. Balakrishnan 24. Estimation of Correlation under Destructive Testing by R. Johnson and W. Lu 25. System-Based Component Test Plans for Reliability Demonstration: A Review and Survey of the State-of-the-Art by J. Rajgopal and M. Mazumdar 26. Life-Test Planning for Preliminary Screening of Materials: A Case Study by J. Stein and N. Doganaksoy 27. Analysis of Reliability Data from In-House Audit Laboratory Testing by R. Agrawal and N. Doganaksoy 28. Software Reliability Modeling, Estimation and Analysis by M. Xie and G.Y. Hong 29. Bayesian Analysis for Software Reliability Data by J.A. Achcar 30. Direct Graphical Estimation for the Parameters in a Three-Parameter Weibull Distribution by P.R. Nelson and K.B. Kulasekera 31. Bayesian and Frequentist Methods in Change-Point Problems by N. Ebrahimi and S.K. Ghosh 32. The Operating Characteristics of Sequential Procedures in Reliability by S. Zacks 33. Simultaneous Selection of Extreme Populations from a Set of Two-Parameter Exponential Populations by K. Hussein and S. Panchapakesan
Volume 21. Stochastic Processes: Modelling and Simulation Edited by D.N. Shanbhag and C.R. Rao 2003 xxviii + 1002 pp. 1. Modelling and Numerical Methods in Manufacturing System Using Control Theory by E.K. Boukas and Z.K. Liu 2. Models of Random Graphs and their Applications by C. Cannings and D.B. Penman 3. Locally Self-Similar Processes and their Wavelet Analysis by J.E. Cavanaugh, Y. Wang and J.W. Davis 4. Stochastic Models for DNA Replication by R. Cowan 5. An Empirical Process with Applications to Testing the Exponential and Geometric Models by J.A. Ferreira 6. Patterns in Sequences of Random Events by J. Gani 7. Stochastic Models in Telecommunications for Optimal Design, Control and Performance Evaluation by N. Gautam 8. Stochastic Processes in Epidemic Modelling and Simulation by D. Greenhalgh 9. Empirical Estimators Based on MCMC Data by P.E. Greenwood and W. Wefelmeyer 10. Fractals and the Modelling of Self-Similarity by B.M. Hambly 11. Numerical Methods in Queueing Theory by D. Heyman 12. Applications of Markov Chains to the Distribution Theory of Runs and Patterns by M.V. Koutras 13. Modelling Image Analysis Problems Using Markov Random Fields by S.Z. Li 14. An Introduction to Semi-Markov Processes with Application to Reliability by N. Limnios and G. Opri¸san
794
Handbook of Statistics Contents of Previous Volumes
15. Departures and Related Characteristics in Queueing Models by M. Manoharan, M.H. Alamatsaz and D.N. Shanbhag 16. Discrete Variate Time Series by E. McKenzie 17. Extreme Value Theory, Models and Simulation by S. Nadarajah 18. Biological Applications of Branching Processes by A.G. Pakes 19. Markov Chain Approaches to Damage Models by C.R. Rao, M. Albassam, M.B. Rao and D.N. Shanbhag 20. Point Processes in Astronomy: Exciting Events in the Universe by J.D. Scargle and G.J. Babu 21. On the Theory of Discrete and Continuous Bilinear Time Series Models by T. Subba Rao and Gy. Terdik 22. Nonlinear and Non-Gaussian State-Space Modeling with Monte Carlo Techniques: A Survey and Comparative Study by H. Tanizaki 23. Markov Modelling of Burst Behaviour in Ion Channels by G.F. Yeo, R. K. Milne, B.W. Madsen, Y. Li and R.O. Edeson
Volume 22. Statistics in Industry Edited by R. Khattree and C.R. Rao 2003 xxi + 1150 pp. 1. Guidelines for Selecting Factors and Factor Levels for an Industrial Designed Experiment by V. Czitrom 2. Industrial Experimentation for Screening by D.K.J. Lin 3. The Planning and Analysis of Industrial Selection and Screening Experiments by G. Pan, T.J. Santner and D.M. Goldsman 4. Uniform Experimental Designs and their Applications in Industry by K.-T. Fang and D.K.J. Lin 5. Mixed Models and Repeated Measures: Some Illustrative Industrial Examples by G.A. Milliken 6. Current Modeling and Design Issues in Response Surface Methodology: GLMs and Models with Block Effects by A.I. Khuri 7. A Review of Design and Modeling in Computer Experiments by V.C.P. Chen, K.-L. Tsui, R.R. Barton and J.K. Allen 8. Quality Improvement and Robustness via Design of Experiments by B.E. Ankenman and A.M. Dean 9. Software to Support Manufacturing Experiments by J.E. Reece 10. Statistics in the Semiconductor Industry by V. Czitrom 11. PREDICT: A New Approach to Product Development and Lifetime Assessment Using Information Integration Technology by J.M. Booker, T.R. Bement, M.A. Meyer and W.J. Kerscher III 12. The Promise and Challenge of Mining Web Transaction Data by S.R. Dalal, D. Egan, Y. Ho and M. Rosenstein
Handbook of Statistics Contents of Previous Volumes
795
13. Control Chart Schemes for Monitoring the Mean and Variance of Processes Subject to Sustained Shifts and Drifts by Z.G. Stoumbos, M.R. Reynolds Jr and W.H. Woodall 14. Multivariate Control Charts: Hotelling T 2 , Data Depth and Beyond by R.Y. Liu 15. Effective Sample Sizes for T 2 Control Charts by R.L. Mason, Y.-M. Chou and J.C. Young 16. Multidimensional Scaling in Process Control by T.F. Cox 17. Quantifying the Capability of Industrial Processes by A.M. Polansky and S.N.U.A. Kirmani 18. Taguchi’s Approach to On-line Control Procedure by M.S. Srivastava and Y. Wu 19. Dead-Band Adjustment Schemes for On-line Feedback Quality Control by A. Luceño 20. Statistical Calibration and Measurements by H. Iyer 21. Subsampling Designs in Industry: Statistical Inference for Variance Components by R. Khattree 22. Repeatability, Reproducibility and Interlaboratory Studies by R. Khattree 23. Tolerancing – Approaches and Related Issues in Industry by T.S. Arthanari 24. Goodness-of-fit Tests for Univariate and Multivariate Normal Models by D.K. Srivastava and G.S. Mudholkar 25. Normal Theory Methods and their Simple Robust Analogs for Univariate and Multivariate Linear Models by D.K. Srivastava and G.S. Mudholkar 26. Diagnostic Methods for Univariate and Multivariate Normal Data by D.N. Naik 27. Dimension Reduction Methods Used in Industry by G. Merola and B. Abraham 28. Growth and Wear Curves by A.M. Kshirsagar 29. Time Series in Industry and Business by B. Abraham and N. Balakrishna 30. Stochastic Process Models for Reliability in Dynamic Environments by N.D. Singpurwalla, T.A. Mazzuchi, S. Özekici and R. Soyer 31. Bayesian Inference for the Number of Undetected Errors by S. Basu