Modern 5tatistical and Mathematical Methods in Reliabilitu
SERIES IN QUALITY, RELIABILITY & ENGINEERING STATISTICS Series Editors: M. Xie (National University of Singapore) T. Bendell (Nottingham Polytechnic) A. P. Basu (University of Missouri)
Published Vol. 1:
Software Reliability Modelling M. Xie
Vol. 2: Recent Advances in Reliability and Quality Engineering H. Pham
Vol. 3:
Contributions to Hardware and Software Reliability P. K. Kapur, R. B. Garg & S. Kumar
Vol. 4:
Frontiers in Reliability A. P. Basu, S. K. Basu & S. Mukhopadhyay
Vol. 5:
System and Bayesian Reliability Y. Hayakawa, T. Irony& M. Xie
Vol. 6:
Multi-State System Reliability Assessment, Optimization and Applications A. Lisnianski & G. Levitin
Vol. 7: Mathematical and Statistical Methods in Reliability B. H. Lindqvist & K, A. Doksum Vol. 8 :
Response Modeling Methodology: Empirical Modeling for Engineering and Science H. Shore
Vol. 9: Reliability Modeling, Analysis and Optimization Hoang Pham
Series on Quality, Reliability and Engineering Statistics
vo 1. 10
Modern Statistical and Mathematical Methods in Reliabilitu
editors
Alyson Wilson, Sallie Keller-McNulty
Yvonne Armijo Los Alamos National Laboratory, USA
Nikolaos Limnios Universit6 de Technologie de CompiBgne, France
KsWorld Scientific N E W JERSEY
*
LONDON
SINGAPORE
-
EElJlNG * SHANGHAI
HONG KONG * TAIPEI
CHENNAI
Published by
World Scientific Publishing Co. Pte. Ltd. 5 Toh Tuck Link, Singapore 596224
USA ofice: 27 Warren Street, Suite 401-402, Hackensack, NJ 07601 U K oflce: 57 Shelton Street, Covent Garden, London WC2H 9HE
British Library Cataloguing-in-PublicationData A catalogue record for this book is available from the British Library.
MODERN STATISTICAL AND MATHEMATICAL METHODS IN RELIABILITY Copyright 0 2005 by World Scientific Publishing Co. Pte. Ltd. All rights reserved. This book, orparts thereoJ may not be reproduced in any form or by any means, electronic or mechanical, including photocopying, recording or any information storage and retrieval system now known or to be invented, without written permission from the Publisher.
For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from the Publisher.
ISBN 981-256-356-3
Printed in Singapore by World Scientific Printers (S)Pte Ltd
PREFACE
This volume is published on the occasion of the fourth International Conference on Mathematical Methods in Reliability (MMR 2004). This bi-annual conference was hosted by Los Alamos National Laboratory (LANL) and the National Institute of Statistical Sciences (NISS), June 21-25, 2004, in Santa Fe, New Mexico. The MMR conferences serve as a forum for discussing fundamental issues on mathematical methods in reliability theory and its applications. They are a forum that bring together mathematicians, probabilists, statisticians, and computer scientists from within a central focus on reliability. This volume contains a careful selection of papers, that have been peer-reviewed, from MMR 2004. A broad overview of current research activities in reliability theory and its applications is provided with coverage on reliability modeling, network and system reliability, Bayesian methods, survival analysis, degradation and maintenance modeling, and software reliability. The contributors are all leading experts in the field and include the plenary session speakers, Tim Bedford, Thierry Duchesne, Henry Wynn, Vicki Bier, Edsel Peiia, Michael Hamada, and Todd Graves. This volume follows Statistical and Probabilistic Models in Reliability: Proceedings of the International Conference on Mathematical Methods in Reliability, Bucharest, Romania (D. C. Ionescu and N. Limnios, eds.), Birkhauser, Series on Quality, Reliability and Engineering Statistics (1999); Recent Advances in Reliability Theory, Methodology, Practice, and Inference, Proceedings of the Second International Conference on Mathematical Methods in Reliability, Bordeaux, France (N. Limnios and M. Nikulin. eds.), Birkhauser, Series on Quality, Reliability and Engineering Statistics (2000); Mathematical and Statistical Methods in Reliability, Proceedings of the Third International Conference on Mathematical Methods in Reliability, Trondheim, Norway (B. Lindqvist and K. A. Doksum, eds.), World
V
vi
Preface
Scientific Publishing, Series on Quality, Reliability and Engineering Statistics 7 (2003). The editors extend their thanks to Hazel Kutac for the formatting and editing of this volume.
A. Wilson Los Alamos National Laboratory Los Alamos, N M , USA N. Limnios Universite' de Technologie de Compiggne Compi&gne, France
S. Keller-McNulty Los Alamos National Laboratory Los Alamos, N M , USA Y. Armijo Los Alamos National Laboratory Los Alamos, N M , USA
CONTENTS
Preface
V
1 Competing Risk Modeling in Reliability Tim Bedford Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . Independent and Dependent Competing Risks . . . . . . . Characterization of Possible Marginals . . . . . . . . . . . Kolmogorov-Smirnov Test . . . . . . . . . . . . . . . . . . Conservatism of Independence . . . . . . . . . . . . . . . . 1.6 The Bias of Independence . . . . . . . . . . . . . . . . . . 1.7 Maintenance as a Censoring Mechanism . . . . . . . . . . 1.7.1 Dependent copula model . . . . . . . . . . . . . . . 1.7.2 Random clipping . . . . . . . . . . . . . . . . . . . 1.7.3 Random signs . . . . . . . . . . . . . . . . . . . . . 1.7.4 LBL model . . . . . . . . . . . . . . . . . . . . . . . 1.7.5 Mixed exponential model . . . . . . . . . . . . . . . 1.7.6 Delay time model . . . . . . . . . . . . . . . . . . . 1.8 Loosening the Renewal Assumption . . . . . . . . . . . . 1.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . .
1 3 5 7 8 9 10 11 11 12 12 13 13 13 15 15
Game-Theoretic and Reliability Methods in Counter-Terrorism and Security Vicki Bier
17
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Applications of Reliability Analysis to Security . . . . . . 2.3 Applications of Game Theory to Security . . . . . . . . . .
17 18 19
1.1 1.2 1.3 1.4 1.5
2
1
vii
viii
3
Modern Statistical and Mathematical Methods in Reliability
2.3.1 Security as a game between defenders . . . . . . . . 2.4 Combining Reliability Analysis and Game Theory . . . . . 2.5 Directions for F’uture Work . . . . . . . . . . . . . . . . . . 2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21 23 24 25 25 26
Regression Models for Reliability Given the Usage Accumulation History Thierry Duchesne
29
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Definitions and notation . . . . . . . . . . . . . . . 3.1.2 Common lifetime regression models . . . . . . . . . 3.2 Other Approaches to Regression Model Building . . . . . . 3.2.1 Models based on transfer functionals . . . . . . . . 3.2.2 Models based on internal wear . . . . . . . . . . . . 3.3 Collapsible Models . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Two-dimensional prediction problems . . . . . . . . 3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Bayesian Methods for Assessing System Reliability: Models and Computation Todd Graves and Michael Hamada 4.1 Challenges in Modern Reliability Analyses . . . . . . . . . 4.2 Three Important Examples . . . . . . . . . . . . . . . . . . 4.2.1 Example 1: Reliability of a component based on biased sampling . . . . . . . . . . . . . . . . . . . . 4.2.2 Example 2: System reliability based on partially informative tests . . . . . . . . . . . . . . . . . . . 4.2.3 Example 3: Integrated system reliability based on diverse data . . . . . . . . . . . . . . . . . . . . . . 4.3 YADAS: a Statistical Modeling Environment . . . . . . . . 4.3.1 Expressing arbitrary models . . . . . . . . . . . . . 4.3.2 Special algorithms . . . . . . . . . . . . . . . . . . . 4.3.3 Interfaces, present and future . . . . . . . . . . . . 4.4 Examples Revisited . . . . . . . . . . . . . . . . . . . . . .
29 30 31
32 33 34 36 36 38 38 38
41 42 42 43 44 45 46 46 48 48 49
Modern Statistical and Mathematical Methods i n Reliability
ix
4.4.1 Example 1 . . . . . . . . . . . . . . . . . . . . . . . 4.4.2 Example 2 . . . . . . . . . . . . . . . . . . . . . . .
49 49 50 51 54
4.4.3 Example 3 . . . . . . . . . . . . . . . . . . . . . . . 4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
6
Dynamic Modeling in Reliability and Survival Analysis Edsel A . Pe6a and Elizabeth H Slate
55
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Dynamic Models . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 A dynamic reliability model . . . . . . . . . . . . 5.2.2 A dynamic recurrent event model . . . . . . . . . 5.3 Some Probabilistic Properties . . . . . . . . . . . . . . . 5.4 Inference Methods . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Dynamic load-sharing model . . . . . . . . . . . . 5.4.2 Dynamic recurrent event model . . . . . . . . . . 5.5 An Application . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55 59 60 61 63 64 64 65 68 70
.
. . .
. .
End of Life Analysis
73
6.1 The Urgency of WEEE . . . . . . . . . . . . . . . . . . . . 6.1.1 The effect on reliability . . . . . . . . . . . . . . . . 6.1.2 The implications for design . . . . . . . . . . . . . . 6.2 Signature Analysis and Hierarchical Modeling . . . . . . . 6.2.1 The importance of function . . . . . . . . . . . . . 6.2.2 Wavelets and feature extraction . . . . . . . . . . . 6.3 A Case Study . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.1 Function: preliminary FMEA and life tests . . . . . 6.3.2 Stapler motor: time domain . . . . . . . . . . . . . 6.3.3 Lifting motor: frequency domain . . . . . . . . . . . 6.4 The Development of Protocols and Inversion . . . . . . . . Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
74 74 75 75 76 76 76 77 77 78 79 86 86
H . Wynn. T. Figarella. A . Di Bucchianico. M . Jansen. W. Bergsma
Modern Statistical and Mathematical Methods i n Reliability
X
7 Reliability Analysis of a Dynamic Phased Mission System: Comparison of Two Approaches Man: Bouissou. Yves Dutuit. Sidoine Maillard
87
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 87 7.2 Test Case Definition . . . . . . . . . . . . . . . . . . . . . 91 7.3 Test Case Resolution . . . . . . . . . . . . . . . . . . . . . 93 7.3.1 Resolution with a Petri net . . . . . . . . . . . . . 93 7.3.2 Resolution with a BDMP . . . . . . . . . . . . . . 96 7.3.3 Compared results . . . . . . . . . . . . . . . . . . . 97 7.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 100 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 8
Sensitivity Analysis of Accelerated Life Tests with Competing Failure Modes Cornel Bunea and Thomas A Mazzuchi
105
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 ALT and Competing Risks . . . . . . . . . . . . . . . . . . 8.2.1 ALT and independent competing risks . . . . . . . 8.2.2 ALT and dependent competing risks . . . . . . . . 8.3 Graphical Analysis of Motorettes Data . . . . . . . . . . . 8.4 A Copula Dependent ALT - Competing Risk Model . . . . 8.4.1 Competing risks and copula . . . . . . . . . . . . . 8.4.2 Measures of association . . . . . . . . . . . . . . . . 8.4.3 Archimedean copula . . . . . . . . . . . . . . . . . 8.4.4 Application on motor insulation data . . . . . . . . 8.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
106 107 107 109 110 112 112 113 114 114 116 117
Estimating Mean Cumulative Functions From Truncated Automotive Warranty Data S . Chukova and J . Robinson
121
9.1 Introduction . . . . . . . . . . . . 9.2 The Hu and Lawless Model . . . . 9.3 Extensions of the Model . . . . . . 9.3.1 “Time” is age case . . . . . 9.3.2 “Time” is miles case . . . . 9.4 Example . . . . . . . . . . . . . . . . 9.4.1 The “P-claims” dataset . . .
121 123 124 124 127 130 130
.
9
............. ............. ............. . . . . . . . . . . . . . ............. ............ .............
xi
Modern Statistical and Mathematical Methods an Reliability
10
9.4.2 Examples for the “time” is age case . . . . . . . . . 9.4.3 Examples for the “time” is miles case . . . . . . . . 9.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
131 133 133 134
Tests for Some Statistical Hypotheses for Dependent Competing Risks-A Review Isha Dewan and J V Deshpande
137
. .
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . Locally Most Powerful Rank Tests . . . . . . . . . . . . Tests for Bivariate Symmetry . . . . . . . . . . . . . . . Censored Data . . . . . . . . . . . . . . . . . . . . . . . . Simulation Results . . . . . . . . . . . . . . . . . . . . . . Test for Independence of T and b . . . . . . . . . . . . . 10.6.1 Testing HO against H i . . . . . . . . . . . . . . 10.6.2 Testing HO against H i . . . . . . . . . . . . . . 10.6.3 Testing HO against HA . . . . . . . . . . . . . . Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.1 10.2 10.3 10.4 10.5 10.6
11
137
. 139 . 140
. . . .
Repair Efficiency Estimation in the A R I l Imperfect Repair Model Lavrent Doyen 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Arithmetic Reduction of Intensity Model with Memory 1 11.2.1 Counting process theory . . . . . . . . . . . . . . 11.2.2 Imperfect repair models . . . . . . . . . . . . . . 11.3 Failure Process Behavior . . . . . . . . . . . . . . . . . . 11.3.1 Minimal and maximal wear intensities . . . . . . 11.3.2 Asymptotic intensity . . . . . . . . . . . . . . . . 11.3.3 Second order term of the asymptotic expanding . 11.4 Repair Efficiency Estimation . . . . . . . . . . . . . . . . 11.4.1 Maximum likelihood estimators . . . . . . . . . . 11.4.2 Explicit estimators . . . . . . . . . . . . . . . . . 11.5 Empirical Results . . . . . . . . . . . . . . . . . . . . . . 11.5.1 Finite number of observed failures . . . . . . . . . 11.5.2 Application to real maintenance data set and perspective . . . . . . . . . . . . . . . . . . . . . .
144 145 147 148 149 150 151 151
153 153 154 154 155 156 156 157 159 160 161 163 164 164 165
xii
Modern Statistical and Mathematical Methods in Reliability
11.6 Classical Convergence Theorems . . . . . . . . . . . . . . 166 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
12
On Repairable Components with Continuous Output
169
12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 12.2 Asymptotic Performance of Repairable Components . . 12.3 Simple Systems . . . . . . . . . . . . . . . . . . . . . . . . 12.4 Imperfect Repair . . . . . . . . . . . . . . . . . . . . . . . 12.5 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
169 171 173 174 175 176
M . 5'. Finkelstein
.
13 Effects of Uncertainties in Components on the Survival of Complex Systems with Given Dependencies 177 Axel Gandy 13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 13.2 System Reliability with Dependent Components . . . . . 13.3 Bounds on the Margins . . . . . . . . . . . . . . . . . . . 13.3.1 Uniform metric . . . . . . . . . . . . . . . . . . . 13.3.2 Quantiles . . . . . . . . . . . . . . . . . . . . . . 13.3.3 Expectation . . . . . . . . . . . . . . . . . . . . . 13.4 Bayesian Approach . . . . . . . . . . . . . . . . . . . . . . 13.5 Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
Dynamic Management of Systems Undergoing Evolutionary Acquisition Donald Gaver. Patricia Jacobs. Ernest Seglie 14.1 Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.1.1 Preamble: broad issues . . . . . . . . . . . . . . . 14.1.2 Testing . . . . . . . . . . . . . . . . . . . . . . . . 14.2 Modeling an Evolutionary Step . . . . . . . . . . . . . . . 14.2.1 Model for development of Block b 1 . . . . . . . 14.2.2 Introduction of design defects during development and testing . . . . . . . . . . . . . . . . . . . . . .
+
177
179 181 182 183 184 186 188 188
191 192 192 194 195 195
196 14.2.3 Examples of mission success probabilities with random K O . . . . . . . . . . . . . . . . . . . . . 197 14.2.4 Acquisition of Block b 1 . . . . . . . . . . . . . 198
+
xiii
Modern Statistical and Mathematical Methods an Reliability
+
14.2.5 Obsolescence of Block b and Block b 1 . . . . 14.3 The Decision Problem . . . . . . . . . . . . . . . . . . . . 14.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.5 Conclusion and Future Program . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
.
Reliability Analysis of Renewable Redundant Systems with Unreliable Monitoring and Switching Yakov Genis and Igor Ushakov Introduction . . . . . . . . . . . . . . . . . . . . . . . . . Problem Statement . . . . . . . . . . . . . . . . . . . . . Asymptotic Approach . General System Model . . . . . Refined System Model and the FS Criterion . . . . . . Estimates of Reliability and Maintainability Indexes . . Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . Heuristic Approach . Approximate Method of Analysis of Renewal Duplicate System . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.1 15.2 15.3 15.4 15.5 15.6 15.7
16
. 199
205 205 206 . 207 . 207 . 209 210
. 214
Planning Models for Component-Based Software Offerings Under Uncertain Operational Profiles Mary Helander and Bonnie Ray 16.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 16.2 Mathematical Formulation as an Optimal Planning Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.3 Stochastic Optimal Reliability Allocation . . . . . . . . . 16.3.1 Derivation of the distribution for Go . . . . . . . 16.3.2 Solution implementation . . . . . . . . . . . . . . 16.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.5 Summary and Discussion . . . . . . . . . . . . . . . . . . Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17 Destructive Stockpile Reliability Assessments: A Semiparametric Estimation of Errors in Variables with Validation Sample Approach Nicolas Hengartner 17.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . .
199 200 203 204
218
221 221 223 224 226 228 228 230 232 233
235 235
xiv
18
19
Modern Statistical and Mathematical Methods an Reliability
17.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . 17.3 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 17.4 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.4.1 Regularity conditions . . . . . . . . . . . . . . . . 17.4.2 Proof of Theorem 17.2 . . . . . . . . . . . . . . . Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
237 238 240 240 241 244 244
Flowgraph Models for Complex Multistate System Reliability Aparna Huzurbazar and Brian Williams
247
18.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 18.2 Background on Flowgraph Models . . . . . . . . . . . . . 18.3 Flowgraph Data Analysis . . . . . . . . . . . . . . . . . . 18.4 Numerical Example . . . . . . . . . . . . . . . . . . . . . 18.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
247 250 255 257 260 261
Interpretation of Condition Monitoring Data Andrew Jardine and Dmgan Banjevic
263
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . The Proportional Hazards Model . . . . . . . . . . . . . . Managing Risk: A CBM Optimization Tool . . . . . . . . Case Study Papers . . . . . . . . . . . . . . . . . . . . . . 19.4.1 Food processing: use of vibration monitoring . . . 19.4.2 Coal mining: use of oil analysis . . . . . . . . . . 19.4.3 Nuclear generating station . . . . . . . . . . . . . 19.4.4 Gearbox subject to tooth failure . . . . . . . . . . 19.5 Future Research Plans . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
263 265 270 272
19.1 19.2 19.3 19.4
20
Nonproportional Semiparametric Regression Models for Censored Data Zhezhen Jin
272
273 274 276 277 277
279
20.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 279 20.2 Models and Estimation . . . . . . . . . . . . . . . . . . . 280 20.2.1 Accelerated failure time model . . . . . . . . . . . 280 20.2.1.1 Rank-based approach . . . . . . . . . . 281
Modern Statistical and Mathematical Methods in Reliability
20.2.1.2 Least-squares approach . . . . . . . 20.2.2 Linear transformation models . . . . . . . . . 20.3 Remark . . . . . . . . . . . . . . . . . . . . . . . . . . . . Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xv
. . 284 . . 286
21 Binary Representations of Multi-State Systems Edward Korczak
289 290 290 293
21.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 293 21.2 Basic Definitions . . . . . . . . . . . . . . . . . . . . . . . 294 21.3 Binary Representation of an MSS and Its Properties . . . 296 21.4 Examples of Application . . . . . . . . . . . . . . . . . . 301 21.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 305 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . 306 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306 22
Distribution-Free Continuous Bayesian Belief Nets D . Kurowicka and R . Cooke
309
22.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 309 22.2 Vines and Copulae . . . . . . . . . . . . . . . . . . . . . . 311 22.3 Continuous bbns . . . . . . . . . . . . . . . . . . . . . . . 315 22.4 Example: Flight Crew Alertness Model . . . . . . . . . . 318 22.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 321 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321 23
Statistical Modeling and Inference for Component Failure Times Under Preventive Maintenance and Independent Censoring Bo Henry Lindqvist and Helge Langseth 23.1 23.2 23.3 23.4
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . Notation. Definitions. and Basic Facts . . . . . . . . . . . The Repair Alert Model . . . . . . . . . . . . . . . . . . . Statistical Inference in the Repair Alert Model . . . . . . 23.4.1 Independent censoring . . . . . . . . . . . . . . . 23.4.2 Datasets and preliminary graphical model checking . . . . . . . . . . . . . . . . . . . . . . . 23.4.3 Nonparametric estimation . . . . . . . . . . . . . 23.4.4 Parametric estimation . . . . . . . . . . . . . . .
323 323 325 326 328 328 329 331 332
xvi
Modern Statistical and Mathematical Methods i n Reliability
23.5 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24 Importance Sampling for Dynamic Systems Anna Ivanova Olsen and Arvid Naess
25
339
24.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 24.2 Problem Formulation . Reliability and Failure Probability 24.3 Numerical Examples . . . . . . . . . . . . . . . . . . . . . 24.3.1 Linear oscillator excited by white noise . . . . . . 24.3.2 Linear oscillator excited by colored noise . . . . . 24.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
340 341 343 343 348 350 351 351
Leveraging Remote Diagnostics Data for Predictive Maintenance Brock Osborn
353
25.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 25.2 Accounting for the Accumulation of Wear . . . . . . . . . 25.3 Application to Inventory Management of Turbine Blades 25.4 Developing an Optimal Solution . . . . . . . . . . . . . . 25.5 Application . . . . . . . . . . . . . . . . . . . . . . . . . 25.6 Formulating a Generalized Life Regression Model . . . . 25.7 Concluding Remarks . . . . . . . . . . . . . . . . . . . . Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
335 337
From Artificial Intelligence to Dependability: Modeling and Analysis with Bayesian Networks Luigi Portinale. Andrea Bobbio. Stefania Montani 26.1 26.2 26.3 26.4 26.5
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . Mapping Fault Trees to Bayesian Networks . . . . . . . . Case Studies: The Digicon Gas Turbine Controller . . . . Modeling Issues . . . . . . . . . . . . . . . . . . . . . . . 26.5.1 Probabilistic gates: common cause failures . . . . 26.5.2 Probabilistic gates: coverage . . . . . . . . . . . . 26.5.3 Multi-state variables . . . . . . . . . . . . . . . .
353 354 356
357 360 361 362 362 362
365 366 366 367 368 371 372 372 373
Modern Statistical and Mathematical Methods in Reliability
27
28
xvii
26.5.4 Sequentially dependent failures . . . . . . . . . . 26.6 Analysis Issues . . . . . . . . . . . . . . . . . . . . . . . . 26.6.1 Analysis example . . . . . . . . . . . . . . . . . . 26.6.2 Modeling parameter uncertainty in BN model . . 26.7 Conclusions and Current Research . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
374 375 376 378 380 380
Reliability Computation for Usage-Based Testing S . J Prowell and J . H . Poore
383
27.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . 27.2 Characterizing Use . . . . . . . . . . . . . . . . . . . . . . 27.3 Computing Reliability . . . . . . . . . . . . . . . . . . . . 27.3.1 Models . . . . . . . . . . . . . . . . . . . . . . . . 27.3.2 Arc reliabilities . . . . . . . . . . . . . . . . . . . 27.3.3 Trajectory failure rate . . . . . . . . . . . . . . . 27.4 Similarity to Expected Use . . . . . . . . . . . . . . . . . 27.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
383 384 387 387 388 389 390 392 392
K-Mart Stochastic Modeling Using Iterated Total T i m e on Test Transforms h n c i s c o Vera and James Lynch
395
.
28.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 395 28.2 Generalized Convexity, Iterated TTT. K-Mart . . . . . . 397 28.3 Mixture Models . . . . . . . . . . . . . . . . . . . . . . . 400 28.4 A Binomial Example . . . . . . . . . . . . . . . . . . . . . 403 28.5 Construction of “Most Identical” Distribution . . . . . . 407 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409
This page intentionally left blank
CHAPTER 1 COMPETING RISK MODELING IN RELIABILITY
TIM BEDFORD Department of Management Science Strathclyde University Glasgow, UK E-mail: tim.bedford@strath. ac.uk
This paper gives a review of some work in the area of competing risk applied to reliability problems, focusing particularly on that of the author and co-workers. The results discussed cover a range of topics, starting with the identifiability problem, bounds and a characterization of marginal distributions with given competing risk information. We discuss the way in which the assumption of independence usually gives an optimistic view of failure behavior, possible models for maintenance, and generalizations of the competing risk problem to nonrenewal systems.
1.l. Introduction The competing risk problem arises quite naturally in the reliability context. Maintenance logs often track the history of events occurring at a particular socket. The events can be failure mode specific, incipient failures, maintenance actions, etc. Where the cost of critical failure is large, the maintenance policy will ensure that the whole system is as good as new. Hence we can regard the data as arising from a renewal process in which we only see the “first” possible event occurring after renewal and we know what that event is. The different events can be regarded as competing risks. The “competing risk problem” is that we cannot identify the marginal distributions of the time t o each event without making untestable distributional assumptions. Competing risk information is, at least implicitly, used in reliability databases such as the Center for Chemical Process Safety (CPSS) and European Industry Reliability Data (EIREDA) generic 1
T.Bedford
2
databases amongst several others. These databases give data-failure rates or on-demand failure probabilities as appropriatefor each failure mode of the component. In order to derive this information it is necessary to have used a competing risk statistical model to interpret the underlying failure data. A discussion of the way in which such databases are built up and the role that the competing risk problem arises there is given in Cooke and Bedford. The paper gives an overview of competing risk work, specifically in the reliability context, with which I and co-workers have been associated. It does not attempt to give a general overview of competing risk. For a more general view we refer the reader to Crowder’s recent book’ and Deshpande’s overview paper.3 The issues we cover are:
0 0 0 0
a 0
Independent and dependent competing risks Characterization of possible marginals Kolmogorov Smirnov Test The bias of independence Maintenance as a censoring mechanism Loosening the renewal assumption
In considering these issues we shall largely take a probabilistic modeling viewpoint, although one could equally well take a more statistical view or an operations research view. The competing risk problem is a part of the more general issue of model identifiability. We build models in order to gain an understanding of system behavior. In general terms, the more tightly we specify the model class, the more likely we are to be able to identify the specific model within the class. If we define the model class too tightly though, the model class may not capture all the features contained in the data. However, our intuition about defining model classes more tightly does not always correspond to the functional constraints that imply identifiability or lack thereof. We shall give an example of this in Sec. 1.8. In competing risk applications to reliability the model-identifiability issue means that we do need to specify “tight” families of models to apply in specific application situations. To do this we need a better understanding of the engineering context, particularly for applications to maintenance censoring.
Competing Risk Modeling in Reliability
3
1.2. Independent and Dependent Competing Risks In general there may be several different processes going on that could remove a component from service. Hence the time to next removal, Y , is the minimum of a number of different potential event times Y = min(X1 , . . . ,Xn). For simplicity assume that the different events cannot occur together and, furthermore, just consider a single nonfailure event, for example unscheduled preventive maintenance. Hence there is a failure time XI, which is the time (since the previous service removal) that the equipment would fail, and a single PM time X2, which is the time at which the equipment would be preventively maintained. We only observe the smallest of the two variables but also observe which one it is, that is, we know whether we have observed a failure or a PM. Hence the observable data is of the form Y = (min(X1, X2), l,yt<,yz). It would clearly be interesting to know about the distribution of XI, that is, the behavior of the system with the maintenance effect removed. However, we cannot observe X1 directly. The observations we have allow us only to estimate the subdistribution function Gl(t) = P(X1 I t , X1 < X2). The subdistribution function converges to the value P(X1 < X2) as t + 00. We often talk about the subsurvivor function S r ( t ) = P(X1 > t , X1 < X2), which is equal to P(X1 < X,) - Gl(t). The normalized subsurvivor function is the quantity S%l(t)/S%l(0) normalized to be equal to 1 at t = 0. The final important quantity that can be estimated directly from observable data is the probability of a censor after time t , @ ( t )= P(X2 < XllY > t ) . The shapes of these functions can play a role in model selection. The classical competing risks problem is to identify the marginal distributions of the competing risk variables from the competing risk data. It is well known4 that the marginal and joint distributions of (XI,X2) are in general “nonidentifiable,” that is, there are many different joint distributions which share the same subdistribution functions. Is it also well k n o ~ n ~ ?that ~ ’ ’ if X1 and X2 are independent, nonatomic and share essential suprema, their marginal distributions are identifiable. Given a pair of subsurvivor functions we can assume an underlying independent model, but may have to accept that one of the random variables has a degenerate distribution, that is, an atom at infinity.8 In any case, if we are prepared to assume independence then-subject to some technical conditions-we can identify marginals. For the moment we shall concentrate on general bounds that make as few assumptions as possible, and explain the source of nonidentifiability.
4
T.Bedford
Figure 1.1 shows the (X1,X2) plane and the events whose probabilities can be estimated from observable data. For example, given times tl < t2 we can estimate the probability of the event tl < XI 5 t2, XI < X2 which corresponds to the vertical hatched region on the figure, while given times t3 < t4 we can estimate the probability of the event t3 < X2 I t4, XZ< XI, which corresponds to the horizontal hatched region on the figure.
Fig. 1.1. (L) Events whose probabilities can be estimated by competing risk data; (R)Geometry of events determining the upper and lower bounds.
We are able to estimate the probability of any such region, but we cannot estimate how the probability is distributed along such a region. Now we can see why the distribution of X1 is not identifiable; varying the mass within the horizontal region changes the distribution of XI without changing the distribution of the observable quantities. Considering the “extreme” ways in which that probability mass could be distributed leads to upper and lower bounds on the marginal distribution function of XI. The Peterson boundsg are pointwise upper and lower bounds on the value of the marginal distribution function. They say that for any t 2 0 we have Gl(t) 5 Fl(t) 5 F y ( t ) . A functional bound was found by Crowder’O who showed that the distance between the distribution function F1 and the Peterson lower bound, F1 ( t ) - GI ( t )is nondecreasing. A simple proof of these bounds is possible through considering the geometry of the events in the (XI,Xz)-plane. Figure l.l(R) shows, for a given t, three events marked A , B and C. The probability of event A is the lower bound probability in the Peterson bound, Gl(t). This is clearly less than or equal to the probability of event A U B , which is just Fl (t).This in turn is less than or equal to the probability of event A u B u C, which is just FY (t).
5
Competing Risk Modeling in Reliability
This shows that the lower and upper Peterson bounds hold. To get the functional lower bound of Crowder just note that the difference Fl(t)-Gl(t) is the probability of the event {XI 5 t,X2 < XI}, that is, the region marked B on the figure. Clearly as t increases, we get an increasing sequence of events whose probabilities must therefore also be nondecreasing. This demonstrates the functional bound. Both Peterson and Crowder make constructions to show when there is a joint distribution satisfying the bounds. A result in Bedford and Meilijsonll however improved these results slightly while giving a very simple geometric construction, which we now consider. 1.3. Characterization of Possible Marginals
The characterization tells us exactly which marginal distributions are possible for given subdistribution functions. To simplify things in this presentation we assume that we are only going to deal with continuous (sub)distributions and the reader is referred to Bedford and Meilijsonll for the details of the general case of more than 2 variables, which may have atoms and could have ties. The key idea here is that of a co-monotone representation. We have already seen that the Crowder functional bound writes the distribution function as a sum of two monotone functions: the subdistribution function and the nondecreasing “gap.” More generally we define a co-monotone representation of a continuous real-valued function f as a pair of monotone nondecreasing continuous functions f1 and f 2 such that f = f1 f2. The characterization result will show when we can find a pair of random variables compatible with the observable subdistribution functions. It is therefore necessary to define abstractly what a pair of subdistribution functions is without reference to a pair of random variables. We define a pair of functions GI, G2 to be a lifetime subdistribution pair if
+
(1) Gi : [0, m) -+ IR, i = 1,2. (2) They are nondecreasing continuous real-valued functions with Gl(0) = Gz(0) = 0. (3) limt-.+oo Gl(t) = pl and limt+oo G2(t) = p2 with pl p2 = 1.
+
Suppose we are given such a pair of functions. (As stated above, the subdistribution functions can be estimated from competing risk data.) Consider the unit strip [0, co)x [0,1] and subdivide it into two strips of heights pl and p2, respectively as shown in Fig. 1.2. In the lower and upper strips we plot
T.Bedford
6
the functions G l ( t ) and Gz(t),respectively. In the same strips we choose, arbitrarily, two nondecreasing right continuous functions whose graphs increase from the bottom to the top of the strip, while lying under the graphs of the functions we have already drawn. For reasons that will become clear we call these new functions F2 - G2 and Fl - GI, respectively. See Fig. 1.3. The notion of “lying under” will be made clear in the statement of the theorem below.
Fig. 1.2.
Co-monotone construction - 1
Fig. 1.3. Co-monotone construction
- 2.
Competing Risk Modeling in Reliability
7
Theorem 1.1:
l 1 (i) Let X1 and X2 be lifetime random variables. Then, using the notation established above,
+
(1) Fi = Gi (Fi - G , ) is a nonnegative co-monotone representation of a nondecreasing continuous function, for i = 1,2. (2) Fi(t) 5 G l ( t ) G z ( t ) for all t , and the Lebesgue measure of the range set {(Fi- Gi)(t)lFi(t)= G l ( t ) G 2 ( t ) }is zero, for i = 1 , 2 . (3)(a) Fi(0) = 0 and Fi(cm)= 1, for i = 1,2. (b) G i ( w ) G2(00) = 1 .
+
+
+
(ii) If nondecreasing right continuous functions F1, Fp, GI, and G2 satisfy the conditions (1)-(3) of (i) then there is a pair of random variables ( X I ,Xp) for which Fi and Gi are the distribution and subdistribution functions respectively (i = 1,2). The construction of the random variables ( X I, X2) can be done quite simply and is shown in Fig. 1.3. Draw a uniform random variable U . If it is in the lower strip (that is, U < p1) then we can invert U through the functions drawn in the lower strip. Define X I = G , ’ ( U ) and Xp = (F2 - G 2 ) - l ( V ) . The ordering of these two functions implies that X1 5 Xp, and furthermore that X1 = X2 with probability 0. If U is in the upper strip then a similar construction applies with the roles of X1 and X , reversed. See Fig. 1.3. This geometrical construction shows that the conditions of the theorem are sufficient for the existence of competing risk variables. Condition 1 is necessary as it is Crowders functional bound. Condition 3 is an obvious necessary condition. The first part of Condition 2 is the Peterson upper bound, while the second part is a rather subtle “light t o u c h condition whose proof is fairly technical and for which we refer the reader to Bedford and Meilijson.
1.4. Kolmogorov-Smirnov Test The complete characterization described above was used to produce a statistical test based on the Kolmogorov-Smirnov statistic in which a hypothesized marginal distribution can be tested against available data. The functional bound tells us that, given a dataset, the difference between the empirical subdistribution and the unobserved empirical marginal distribution function should be nondecreasing. A little thought shows that the distances can be computed at the “jumps” of these functions, which of course occur at the times recorded in the dataset. Recall that the Kolmogorov-Smirnov test takes a hypothesized distribution and uses the asymptotic relation that the
8
T.Bedford
functional difference between true population distribution and empirical distribution converges, when suitably normalized, to a Brownian bridge. Extremes of the Brownian bridge can then be used to establish classical confidence intervals. Although in our competing risk situation we cannot estimate the maximal difference between empirical distribution function and hypothesized distribution function (as we are not able to observe the empirical distribution function), the functional bound can be used to give lower bound estimates on that maximal difference, thus enabling a conservative Kolmogorov-Smirnov test to be developed. A dynamic programming algorithm can be used to determine the maximum difference. Theoretical details of the test are in Bedford and Meilijson" with more implementation details about the dynamic programming and application examples given in Bedford and Meilijson.12
1.5. Conservatism of Independence
As we noted at the beginning, commercial reliability databases make use of competing risk models in interpreting and presenting data. Common assumptions are that underlying times to failure from different failure modes are exponential and that the censoring is independent. Clearly, assumptions need to be made, but one can ask whether or not these assumptions are going to bias the numerical results in any consistent way. It turns out that the functional bounds can be used to show that the assumption of independence tends to give an optimistic assessment of the marginal of XI. It was shown in Bedford and Meilijson13that any other dependence structure would have given a higher estimate of the failure rate; the "independent" failure rate is the lower endpoint of the interval of constant failure rates compatible with the competing risk data. This result, which is generalized to a wider class of parametric families (those ordered by monotone likelihood ratio), is essentially based on a constraint implied by differentiating the functional bound at the origin. Consider the following simple example from Bedford and Mei1ij~on.l~ Suppose that the lifetime Y of a machine is exponentially distributed, there are two failure modes with failure times Xi, X2, and its cause of failure I = 1 , 2 is independent of Y . Suppose also that the failure rate of Y is 0 and let P ( I = 1) = pl. The unique independent model (X1,Xz) for this observed data joint distribution makes Xi and X2 exponentially distributed with respective failure rates plB and (1 - pl)0. Suppose now that we believe that XI has a marginal exponential
Competing Risk Modeling in Reliability
9
distribution, but we are not sure about possible dependence. What is then the range of possible values of its failure rate A? Since the subdistribution function of X1 is G l ( t ) = (1 - e-et)pl and X must satisfy F { ( t ) = (1 - edXt)’2 G i ( t ) for all t > 0, we have that p l 8 5 X 5 8. The upper Peterson bound tells us that A 5 8. However, the “light touch” condition discussed above shows that equality is not possible thus giving a range of feasible X values as
5x
e.
(1)
This shows that the lowest, most optimistic, failure rate compatible with the general competing risk bounds is that obtained from the independence assumption.
1.6. The Bias of Independence
As a further illustration of the potential bias created by assumption of independence when it is not clearly appropriate we consider an example from reliability prediction discussed in Bedford and C00ke.l~This gives the theoretical background to work carried out for a European Space Agency project. Four satellites were due to be launched to carry out a scientific project, which required the functioning of all satellites throughout the mission period of 2 years post launch. A preliminary assessment of the satellite system reliability using a standard model (independent exponentially distributed lifetimes of subsystems) indicated a rather low probability of mission success. The study included discussions with mission engineers about the mechanisms of possible mission failure which suggested that, in contrast to the assumptions made in the standard reliability modeling, mission risk was highly associated to events such as vibration damage during launch and satellite rocket firing, and to thermal shocks caused by moving in and out of eclipses. A new model was built in which the total individual satellite failure probability over the mission profile was kept at that predicted in the original reliability model. Now, however, a mission phasedependent failure rate was used to capture the engineering judgement associating higher failure rates to the abovementioned mission phases, and a minimally informative copula was used to couple failure rates to take account of residual couplings not captured by the main effects. The effect of positive correlation between satellite lifetimes is to improve system lifetime. This may seem counter intuitiveespecially to those with an engineering risk background used to thinking of common cause effects as being a “bad thing.” However, it is intuitively easy to understand. For
10
T.Bedford
if we sample 4 positively correlated lifetime variables all with marginal distribution F then the realizations will tend to be more tightly bunched as compared t o the 4 independent realizations from the same marginal distribution F . Hence the series system lifetime, which is the minimum of the satellite lifetimes, will tend to be larger when the lifetimes are more highly correlated. This is illustrated in Fig. 1.4, taken from Bedford and Cooke,14 which is based on simulation results.
-
Fig. 1.4. Cluster system survival probability depending on correlation.
1.7. Maintenance as a Censoring Mechanism One very interesting and practical area of application of competing risk ideas is in understanding the impact of maintenance on failure data. There is a lot of anecdotal evidence to suggest that there can be major differences in performance between different plant (for example nuclear plant) that are not caused by differences in design or different usage patterns, and that are therefore likely to be related to different maintenance practices and/or policies. Theoretical models for maintenance optimization require us to know the lifetime distribution of the components in question. Therefore it is major significance to understand the impact that competing risk has in masking
Competing Risk Modeling in Reliability
11
our knowledge of that distribution due t o current maintenance (or other) practices that censor the lifetime variable of interest.
1.7.1. Dependent copula model The sensitivity of predicted lifetime t o the assumptions made about dependency between P M and failure time was investigated in Bunea and Bedford15 using a family of dependent copulae. This is based on the results in Zheng and Klein" where a generalization of the Kaplan-Meier estimator is defined that gives a consistent estimator based on an assumption about the underlying copula of ( X I ,X z ) . To illustrate this in an optimization context, the interpretation given was that of choosing an age-replacement maintenance policy. Existing data, corresponding t o failure and/or unscheduled P M events, is taken as input. In order to apply the age replacement maintenance model we need the lifetime distribution of the equipment. Hence it is necessary t o "remove" the effect of the unscheduled P M from the lifetime data. The objective was t o show what the costs of assuming the wrong model would be, if one was trying to optimize an age replacement policy. The conclusion of this paper is that the costs of applying the wrong model can indeed be very substantial. The costs arise because the age replacement interval is incorrectly set for the actual lifetime distribution when the lifetime distribution has been incorrectly estimated using false assumptions about the form of censoring. See Fig. 1.5, taken from Bunea and Bedford.15
1.7.2. R a n d o m clipping This is not really a competing risk model, but is sufficiently close t o be included here. The idea, due t o Cooke, is that the component life time is exponential, and that the equipment emits a warning at some time before the end of life. That warning period is independent of the lifetime of the equipment, although we only see those warnings that occur while the equipment is in use. The observable data here is the time at which the warning occurs. An application of the memoryless property of the exponential distribution shows that the observable data (the warning times) has the same distribution as the underlying failure time. Hence we can estimate the MTBF just by the mean of the the warning times data. This model, and the following one, is discussed further in Bedford and C00ke.l~
12
T. Bedford
Fig. 1.5. Optimizing to the wrong model: effect of assuming different correlations.
1.7.3. Random signs
This model" uses the idea that the time at which PM might occur is related to the time of failure. The PM (censoring time) X2 is equal to the failure time X1 plus an a random quantity [, X2 = X I - 6. Now, while [ might not be statistically independent of X I , its sign is. In other words PM is trying to be effective and t o occur round about the time of the failure, but might miss the failure and occur too late. The chance of failure or P M is independent of the time a t which the failure would occur. This is quite a plausible model, but is not always compatible with the data. Indeed, Cooke has shown that the model is consistent with the distribution of observable data if and only if the normalized subsurvivor functions are ordered. In other words, this model can be applied if and only if the normalized subsurvivor function for X I always lies above that for X2. 1.7.4. LBL model
This model, proposed in Langseth and L i n d q ~ i s t , 'develops ~ a variant of the random signs model in which the likelihood of an early (that is, before failure) intervention by the maintainer is proportional to the unconditional
Competing Risk Modeling i n Reliability
13
failure intensity for the component. Maintenance is possibly imperfect in this model. The model is identifiable.
1.7.5. Mized exponential model
A new model capturing a class of competing risk data not previously COVered by the above was presented in Bunea et a1.” The underlying model is that X I is drawn from a mixture of two exponential distributions, while X 2 is also exponential and independent of X I . This is therefore a special case of the independent competing risks model, but in a very specific parametric setting. Important features of this model that differ from the previous models are: (1) The normalized subdistribution functions are mixtures of exponential distribution functions, (2) The function @(t)increases continuously as a function o f t . This model was developed for an application to OREDA data in which these phenomena were observed.
1.7.6. Delay time model The Delay time modelz1 is well known within the maintenance community. Here the two times X1 and X2 are expressed in terms of a warning variable and supplementary times, X1 = W+X1’, X z = W + X z ’ , where W,Xl’,XZ’ are mutually independent life variables. As above we observe the minimum of X1 and X2. In the case that these variables are all exponential it can be shown (see Hokstadt and Jensen”) that (a) The normalized subdistribution functions are equal and are exponential distribution functions, (b) The function @(t ) is constant as a function oft. 1.8. Loosening the Renewal Assumption
The main focus of the paper is on competing risks, and this implies that when we consider a reliability setting we are assuming that the data can be interpreted as though generated by a renewal process. In practice this is not always a good assumption. In general, the higher the risk is that is potentially caused by failure of the equipment and the lower the cost of repair, the more likely it is that maintenance crew will try to get the system back to a “good as new state.” As discussed in the introduction, the competing risk identifiability question is part of the general issue of model identifiability. It is natural therefore to consider loosening the renewal assumption. In the context of a series
14
T.Bedford
system it is quite natural to think about three types of maintenance programs. The first is “good as new,” that is, when one component fails, all components are restored to as good as new, thus enabling us to make an assumption of a renewal process. The second is partial renewal: only the component that fails is restored to as good as new. The fact that the other component(s) are not new may lead to extra stress being placed on all components. The third possibility is “as bad as old"" where a minimal repair is applied to the failed component that restores it to the functioning state but leaves the failure intensity for the whole system in the same state as just before the failure. This situation is discussed by Bedford and L i n d q v i ~ where it is assumed that each component has a failure intensity depending on the components own lifetime plus another term that depends on the age of each component. In this context, identifiability means that we can estimate the failure intensities of the components using data from a single socket. This means that we start off with a single unit and replace the components according to the maintenance policy that was determined, recording failure times as we go. As stated above, the “good as new” policy essentially means that we are in the classical competing risk situation. Our model class is sufficiently general that it is not identifiable. The least “intensive” maintenance policy, the “bad as old” policy is also not identifiable (indeed, from a single socket data we never revisit any times, so cannot possibly make estimates of failure probabilities). The partial repair policy, however, is rather different. Here, at least under some quite reasonable technical conditions, we are able to identify the model. The reason for this is enlightening. We can consider the vector valued stochastic process, which tells us the current ages of the components at each time point. This process has no renewal properties whatsoever in the minimal repair case. In the full and partial repair cases however it can be considered as a continuous time continuous state Markov process. However, in the full repair case there is no mixing, whereas the partial repair case (under suitable technical, but weak, conditions) the process is ergodic. This ergodicity implies that a single sample path will (with probability 1) visit the whole sample space, thus enabling us to estimate the complete intensity functions.
Competing Risk Modeling an Reliability
15
1.9. Conclusion
(Dependent) competing risk models are increasingly being developed to sup port the analysis of reliability data. Because of the competing risk problem we cannot identify the joint distribution or marginal distributions without making nontestable assumptions. The validation of such models on a statistical basis is therefore impossible, and validation must therefore be of a “softer” nature, relying on assessment of the engineering and organizational context. This is particularly so in the area of maintenance policy. Clearly, there is a whole area of modeling that can be developed.
Acknowledgments I would like to thank my various collaborators over the last few years including Isaco Meilijson (Tel Aviv University) , Bo Lindqvist (NTNU Trondheim), Cornel Bunea (George Washington University), Hans van der Weide (TU Delft), Helge Langseth (NTNU Trondheim), Sangita Karia Kulathinal (National Public Health Institute, Helsinki), Isha Dewan (IS1 New Delhi), Jayant Deshpande (Pune University) , Catalina Mesina (Free University, Amsterdam) and especially Roger Cooke (TU Delft) who introduced me to the area. References 1. R. Cooke and T. Bedford, Reliability Databases in Perspective, IEEE Transactions on Reliability 51, 294-310 (2002). 2. M. Crowder, Classical competing risks, Chapman and Hall/CRC (2001). 3. J. V. Deshpande, Some recent advances in the theory of competing risks, Presidential Address, Section of Statistics, Indian Science Congress 84th Session (1997). 4. A. Tsiatis, A nonidentifiablility aspect in the problem of competing risks, Proceedings of National Academy of Science, USA 72,20-22 (1975). 5. E. L. Kaplan and P. Meier, On the identifiability crisis in competing risks analysis, Journal of American Statistical Association 53,457-481 (1958). 6. A. NBdas, On estimating the distribution of a random vector when only the smallest coordinate is observable, Technometrics 12,923-924 (1970). 7. D. R. Miller, A note on independence of multivariate lifetimes in competing risk models, Ann. Statist. 5,576-579 (1976). 8. J. A. M. Van der Weide and T. Bedford, Competing risks and eternal life, Safety and Reliability (Proceedings of ESREL’98), S. Lydersen, G.K. Hansen, H.A. Sandtorv (eds), Vol. 2, 1359-1364, Balkema, Rotterdam (1998). 9. A. Peterson, Bounds for a joint distribution function with fixed subdistribution functions: Application to competing risks, Proc. Nat. Acad. Sci. USA 73, 11-13 (1976).
16
T. Bedford
10. M. Crowder, On the identifiability crisis in competing risks analysis, Scand. J . Statist. 18,223-233 (1991). 11. T. Bedford and I. Meilijson, A characterization of marginal distributions of (possibly dependent) lifetime variables which right censor each other, Annals of Statistics 25, 1622-1645 (1997). 12. T. Bedford and I. Meilijson, A new approach to censored lifetime variables, Reliability Engineering and System Safety 51,181-187 (1996). 13. T. Bedford and I. Meilijson, The marginal distributions of lifetime variables which right censor each other, in H. Koul and J. Deshpande (Eds.), I M S Lecture Notes Monograph Series 27 (1995). 14. T. Bedford and R. M. Cooke, Reliability Methods as management tools: dependence modelling and partial mission success, Reliability Engineering and System Safety 58, 173-180 (1997). 15. C. Bunea and T. Bedford, The effect of model uncertainty on maintenance optimization, IEEE Transactions in Reliability 51,486-493 (2002). 16. M. Zheng and J. P. Klein, Estimates of marginal survival for dependent competing risks based on an assumed copula, Biometrika 82,127-138 (1995). 17. T. Bedford and R. Cooke, Probabilistic Risk Analysis: Foundations and Methods, Cambridge University Press (2001). 18. R. Cooke, The total time on test statistic and age-dependent censoring, Stat. and Prob. Let. 18 (1993). 19. H. Langseth and B. Lindqvist, A maintenance model for components exposed to several failure mechanisms and imperfect repair, in B. L. K. Doksum (Ed.), Mathematical and Statistical Methods in Reliability, pp. 415-430. World Scientific Publishing (2003). 20. C. Bunea, R. Cooke, and B. Lindqvist, Competing risk perspective over reliability databases, in H. Langseth and B. Lindqvist (Eds.), Proceedings of Mathematical Methods in Reliability, NTNU Press (2002). 21. A. Christer, Stochastic Models in Reliability and Maintenance, Chapter A review of delay time analysis for modelling plant maintenance, Springer (2002). 22. P. Hokstadt and U. Jensen, “Predicting the failure rate for components that go through a degradation state,” Safety and Reliability, Lydersen, Hansen and Sandtorv(eds) Balkema, Rotterdam, pp. 389-396, (1998). 23. T. Bedford and B. H. Lindqvist, The Identifiability Problem for Repairable Systems Subject to Competing Risks, Advances in Applied Probability 36, 774-790 (2004).
CHAPTER 2 GAME-THEORETIC AND RELIABILITY METHODS IN COUNTER-TERRORISM AND SECURITY
VICKI BIER Center for Human Performance and Risk Analysis University of Wisconsin-Madison Madison, Wisconsin, USA E-mail:
[email protected] The routine application of reliability and risk analysis by itself is not adequate in the security domain. Protecting against intentional attacks is fundamentally different from protecting against accidents or acts of nature. In particular, an intelligent and adaptable adversary may adopt a different offensive strategy to circumvent or disable protective security measures. Game theory provides a way of taking this into account. Thus, security and counter-terrorism can benefit from a combination of reliability analysis and game theory. This paper discusses the use of risk and reliability analysis and game theory for defending complex systems against attacks by knowledgeable and adaptable adversaries. The results of such work yield insights into the nature of optimal defensive investments in networked systems to obtain the best tradeoff between the cost of the investments and the security of the resulting systems.
2.1. Introduction After the September 11, 2001, terrorist attacks on the World Trade Center and the Pentagon, and the subsequent anthrax attacks in the United States, there has been an increased interest in methods for use in security and counter-terrorism. However, the development of such methods poses two challenges t o the application of conventional statistical methods-thehe relative scarcity of empirical data on severe terrorist attacks and the intentional nature of such attacks. In dealing with extreme events, for which empirical data is likely t o be sparse, classical statistical methods have been of relatively little use.' 17
18
V. Bier
Instead, methods such as reliability analysis are used to break complex systems down into their individual components (such as pumps and valves) for which larger amounts of empirical failure-rate data may be available. Risk analysis2i3 builds on the techniques of reliability analysis, adding consequence-analysis models to allow for estimation of health and safety impacts. Quantification of risk-analysis models generally relies on some combination of expert judgment4 and Bayesian statistics to estimate the parameters of interest in the face of sparse data.5 Zimmerman and Bier‘ argue that “Risk assessment in its current form (as a systems-oriented method that is flexible enough to handle a variety of alternative conditions) is a vital tool for dealing with extreme events.” However, the routine application of reliability and risk analysis by itself is not adequate in the security domain. Protecting against intentional attacks is fundamentally different from protecting against accidents or acts of nature (which have been the more usual focus of engineering risk analysis). In particular, an intelligent and adaptable adversary may adopt a different offensive strategy to circumvent or disable protective security measures. Game theory7 provides a way of taking this into account analytically. Thus, security analysis can benefit from a combination of techniques that have not usually been used in tandem. This paper discusses approaches for using risk and reliability analysis and game theory to defend complex systems against intentional attacks.
2.2. Applications of Reliability Analysis to Security
Early applications of engineering risk analysis to counter-terrorism and security include Martz and Johnson’ and COX.^ More recently (following September ll), numerous risk analysts have proposed its use for homeland security.10-13 Because security threats can span such a wide range, the emphasis has been mainly on risk-based decision making (i.e., using the results of risk analyses to target security investments at the most likely and most severe threats). Much of this work has been directed specifically towards threats against critical infrastructure14- l9 Levitin and colleagues have by now amassed a large body of work applying reliability analysis to security problems.20-22 Risk and reliability analysis have a great deal to contribute to ensuring the security of complex engineered systems. However, unlike in applications of risk analysis to problems such as the risk of nuclear power accidents, the relationship of recommended risk-reduction actions to the dominant risks
Game- Theoretic and Reliability Methods in Counter- Terrorism and Security
19
emerging from the analysis is not straightforward. In most applications of risk analysis, the decision maker can review a list of possible actions, ranked based on the magnitude of risk reduction per unit cost, and simply choose the most cost-effective actions. This does not work so well in the security context (especially if the potential attacker can readily observe system defenses), since the effectiveness of investments in defending one component can depend critically on whether other components have also been hardened. Risk and reliability analysis are clearly important in identifying the most significant security threats, particularly in complex engineered systems (whose vulnerabilities may depend on networks of interdependencies that cannot be readily identified without detailed analysis). However, in the security context, the results of such analyses do not lead in a straightforward manner to recommended improvements. In particular, risk and reliability analysis generally assumes that the threat or hazard is static, whereas in the case of security, the threat is adaptive and can change in response to the defenses that have been implemented. Therefore, simply rerunning an analysis with the same postulated threat, but assuming that some candidate security improvements have been implemented, will in general significantly overestimate the effectiveness of the candidate improvement^.^^ For example, installing anthrax sterilization equipment in every post office in the country (if publicly known) might just cause future attackers to deliver anthrax by Federal Express, United Parcel Service, or even bicycle courier. Game theory provides a natural way of taking this into account.
2.3. Applications of Game Theory to Security
Due to its value in understanding conflict, game theory has a long history of being applied to security-related problems, beginning with military application^.^ It has also been extensively used in political science; e.g., in the context of arms control.24 Recently, there have also been exploratory applications of game theory and related ideas to computer security.25- 27 With respect to applications of game theory to homeland security, there is a large body of work already, much of it by economist^.^^-^^ Much of the work in this area until now has been designed to provide policy insights; e.g., into the relative merits of deterrence versus other protective measure^.^' However, there has also been interest in using game theory in support of operational-level decisions; e.g., determining which assets to protect. For example, the Brookings Institution31 has recommended that
20
V. Bier
“policy-makers should focus primarily on those targets at which an attack would involve a large number of casualties, would entail significant economic costs, or would critically damage sites of high national significance.” The Brookings recommendation constitutes a reasonable ‘LzerO-order”suggestion about how to prioritize targets for investment. Under this type of “weakest-link model, defensive investment is allocated only to the target(s) that would cause the most damage if attacked. However, such weakestlink models tend to be unrealistic. For example, Arce and Sandler28 note that the extreme solutions associated with weakest-link models “are not commonly observed among the global and transnational collective action problems confronting humankind.” Instead, real-world decision makers will generally want to “hedge” by investing in defense of additional targets, to cover contingencies such as whether they have failed to correctly estimate which targets will be most attractive to the attackers. Moreover, it is important to go beyond the zero-order heuristic of protecting only the most valuable assets, to also take into account the success probabilities of attacks against various possible targets. This can be important, since terrorists appear to take the probability of success into account in their choice of targets; for example, Woo32 has observed that “al-Qaeda is.. .sensitive to target hardening,” and that “Osama bin Laden has expected very high levels of reliability for martyrdom operations.” Models that take the success probabilities of potential attacks into account include Bier et al.,33 Major,34 and Woo.32 The results by Bier et aL8 still represent weakest-link models, in which at equilibrium the defender invests in only those components that are most vulnerable. By contrast, the models proposed by Major34 and Woo32 achieve the more realistic result of hedging at optimality. In particular, Major34 assumes that the defender allocates defensive resources optimally, but that the attacker does not know this, and randomizes the choice of targets to protect against the possibility that the allocation of defensive resources was suboptimal. The result is that the attacker’s probability of choosing any particular target is “inversely proportional to the marginal effectiveness of defense...at that target.” Moreover, since the attacker randomizes in choosing which asset to target, the optimal defensive investment involves hedging (i.e., positive defensive investment even in assets that are not LLweakest links”). The basic game analyzed by Major34 and Woo32 involves simultaneous play by attackers and defenders, so the assumption that the attacker cannot readily observe the defender’s investment makes sense in that context. However, many types of defenses (such as security guards) are either pub-
Game- Theoretic and Reliability Methods in Counter-Terrorism and Security
21
lic knowledge or readily observable by attackers; moreover, some defenses involve large capital outlays, and hence would be difficult to change in response to evolving attacker strategies. In such situations, it seems counterintuitive to assume that the attacker can observe the marginal effectiveness of defensive investments in each possible target, but cannot observe which defenses the defender has actually implemented. Presumably, in sequential play, those defenses that have actually been implemented should be easier to observe than the hypothetical effectiveness of defenses that have not been implemented. Since the basic premise underlying the models of Major34 and Woo32is of questionable applicability for sequential play, recent work by the author and colleagues35achieves the same goal (i.e., an optimal defensive strategy that allows for hedging in equilibrium) in a different manner. In particular, in this model, attackers and defenders are assumed to have different valuations for the various potential targets. This is reasonable given the observation by Woo32 that “If a strike against America is to be inspirational [to al-Qaeda], the target should be recognizable in the Middle East.” In addition to allowing attacker and defender valuations to differ, the proposed model assumes that attackers can observe defensive investments perfectly (which is conservative, but perhaps not overly so), but that defenders are uncertain about the attractiveness of each possible target to the attackers. This model has several interesting features, in addition to the possibility of defensive hedging at equilibrium. First, it is interesting to note that hedging does not always occur. In particular, it will often be optimal for the defender to put no investment at all into some targets, even if they have a nonzero probability of being attacked-especially when the defender is highly budget constrained, and when the various potential targets differ greatly in their values to the defender. Moreover, in this model, if the allocation of defensive resources is suboptimal, defending one set of targets could deflect attacks to alternative targets that are simultaneously less attractive a priori to the attackers (e.g., more costly or difficult to attack), but also more damaging to the defenders. This could be important, in light of past substitution effects documented by Enders and Sandler.2g
2.3.1. Security as a game between defenders The work discussed above views security primarily as a game between an attacker and a defender, focusing on anticipating the effects of defensive actions on possible attackers. However, it also makes sense to consider the
22
V . Bier
effects of defensive strategies adopted by one agent on the incentives faced by other defenders. Some defensive actions (such as installation of visible burglar alarms or car alarms) can actually increase risk to other potential victims. This can lead to overinvestment in security when viewed from the perspective of society as a whole, because the payoff to any one individual or organization from investing in security is greater than the net payoff to the entire society. Conversely, other types of defensive actions-such as vaccination, fire protection, or use of anti-virus protection s o f t w a r e decrease the risk to other potential victims. This can be expected to result in underinvestment in security, since defenders may attempt to “free ride” on the investments of others. Recently, Kunreuther and Heal36 proposed a model of interdependent security where agents are vulnerable to “infection” from other agents. For example, consider supply-chain partners who share access t o each other’s computer systems, and hence are vulnerable to the computer security vulnerabilities of their partners. In this context, it may be extremely costly or difficult for agents to defend their own systems against infections from their partners, and they may therefore need to rely on their partners to protect them against such threats. Kunreuther and HEal36 consider in particular the case where even a single successful attack can be catastrophic. In this case, they show that failure of one agent to invest in security can make it unprofitable for other agents to invest in security, even when they would normally find it profitable to do so. Moreover, they show that this game can have multiple equilibrium solutions (e.g., an equilibrium in which all players invest, and another in which none invest). Kunreuther and H e a P discuss numerous possible coordinating mechanisms that can help to ensure that all players arrive at the socially optimal level of defensive investment. Recent work by the author and colleague^^^^^^ has extended these results to the case of attacks occurring over time (according to a Poisson process), rather than the static model assumed in the original analysis. In this model, differences in discount rates among agents can lead some agents with low discount rates not to invest in security when it would otherwise be in their interests to do so, if other agents choose not to invest in security. Thus, heterogeneous time preferences can complicate the task of achieving security. Differences in discount rates can arise for a variety of reasons, ranging from participation in different industries with different typical rates of return, to risk of impending bankruptcy, to myopia. As in the simpler static model, coordinating mechanisms can be important in ensuring that the socially optimal level of investment is achieved
Game- Theoretic and Reliability Methods in Counter- Terrorism and Security
23
2.4. Combining Reliability Analysis and Game Theory
Many recent applications of risk and reliability analysis to security do not explicitly model the response of potential attackers to defensive investments, and hence may overstate the effectiveness and cost-effectiveness of those investments. Similarly, much of the existing game-theoretic security work focuses on nonprobabilistic games. Moreover, even models that explicitly consider the success probabilities of potential attacks (e.g., Major,34 Woo,32 Kunreuther and Heal36) usually consider individual assets or components in isolation, and fail to consider the effect that disabling one or several components can have on the functionality of a larger system. Combining risk and reliability analysis with game theory could therefore be a fruitful way of protecting against intentional threats to complex systems. Hausken3’ has integrated probabilistic risk analysis and game theory (although not in the security context), by interpreting system reliability as a public good. In particular, he elucidates the relationships between the series or parallel structure of the system and several classic games. Rowe40 presents a simple game-theory framework for evaluating possible protective actions in light of terrorists’ ability to “learn from experience and alter their tactics.’’ Finally, Banks and Anderson41 apply similar ideas to the threat of intentionally introduced smallpox, embedding risk analysis (quantified using expert opinion) into a game-theoretic formulation of the defender’s decision problem. They conclude that this approach “captures facets of the problem that are not amenable to either game theory or risk analysis on their own.” Recent results by Bier et a1.33>42 use game theory to explore the nature of optimal investments in the security of simple reliability systems, as a building block to the analysis of more complex systems. The results suggest that defending series systems against informed and determined attackers is an extremely difficult challenge. In a series system, if the attacker knows about (or can observe) the system’s defenses, the defender’s options for protecting a series system are extremely limited. In particular, the defender is deprived of the ability t o allocate defensive investments according to their cost-effectiveness; rather, if potential attackers know about the effectiveness of defensive measures, defensive investments in series systems must essentially equalize the strength of all defended components in order to be economically efficient. This is consistent with the observation by Dresher7 that, for optimal allocation of defensive resources, “each of the defended targets [must] yield the same payoff to the attacker.”
24
V . Bier
This emphasizes the value of redundancy as a defensive strategy. Essentially, redundancy reduces the flexibility available t o the attacker (since the attacker must now disable multiple redundant components in order t o disable a system), and increases the flexibility available to the defender (since the defender can now choose which of several redundant components to defend). Traditional design considerations such as spatial separation and functional diversity can also be important in defensive strategy, to help ensure that redundant components cannot all be disabled by the same type of attack.
2.5. Directions for Future Work
It is clearly important in practice to extend the types of security models described above to more complicated system structure. Recent work by the author and colleagues42has begun t o address this challenge, at least under particular assumptions. However, achieving fully general results (say, for arbitrary system structures, and for more general assumptions about the effects of security investments on the costs and/or success probabilities of potential attacks) is likely to be difficult, and may require heuristic approaches. In addition, for reasons of mathematical convenience, the models developed until now have generally assumed that the success probability of an attack on a particular component is a convex function of the resources invested to defend that component. While in many contexts this is a reasonable assumption (for example, due to declining marginal returns to defensive investments), it is clearly not fully general. In fact, certain types of security improvements (such as relocating a critical facility to a more secure location) are “inherently discrete,”30 in the sense that they require some minimal level of investment in order to be feasible. This will tend to result in step changes in the success probability of an attack as a function of the level of defensive investment. Similarly, if security investment beyond some threshold deters potential attackers from even attempting an attack, then the likelihood of a successful attack could decrease rapidly beyond that threshold. Such effects can result in the success probability of an attack being a nonconvex function of the defensive investment, at least in certain regions (for example, when the level of investment is not too large). This makes the problem of identifying the optimal level of defensive investment more complicated, and can change the nature of the optimal solutions-increasing the likelihood that there will be multiple local optima,
Game- Theoretic and Reliability Methods in Counter- Terrorism and Security
25
for example, the likelihood that not investing in security may be the global optimum strategy. Moreover, historical experience suggests that secrecy and even deception can be important strategies for improving security and/or reducing defensive investment costs. Understanding when such strategies are likely to be desirable will require departing from the assumptions of common knowledge used in most applications of game theory. Finally, it would of course be worthwhile t o extend our models to include the dimension of time, rather than the current static or %napshot” view of system security. This would allow us to model imperfect attacker information (including, for example, Bayesian updating of the probability that an attack will succeed based on the past history of successful and failed attacks), as well as the possibility of multiple attacks over time.
2.6. Conclusion
As noted above, protecting engineered systems against intentional attacks is likely to require a combination of game theory and reliability analysis. Risk and reliability analysis by itself will not be sufficient to address many security challenges, since it does not take into account the attacker’s response to the implementation of improvements. However, most applications of game theory to security deal with individual components in isolation, and hence could benefit from the use of reliability analysis to model the risks to complex networked systems. Approaches that embed reliability models in a game-theoretic framework may make it possible to take advantage of the strengths of both approaches.
Acknowledgments This material is based upon work supported in part by the US. Army Research Laboratory and the U.S. Army Research Office under grant number DAAD19-01-1-0502, by the U.S. National Science Foundation under grant number DMI-0228204, by the Midwest Regional University Transportation Center under project number 04-05, and by the United States Department of Homeland Security through the Center for Risk and Economic Analysis of Terrorism Events (CREATE) under grant number EMW2004-GR-0112. Any opinions, findings, and conclusions or recommendations expressed herein are those of the author and do not necessarily reflect the views of the sponsors.
26
V. Bier
References 1. V. M. Bier, Y. Y. Haimes, J. H. Lambert, N. C. Matalas, and R. Zimmerman, Assessing and managing the risk of extremes, Risk Analysis 19, 83-94 (1999). 2. V. M. Bier, An overview of probabilistic risk analysis for complex engineered systems, in Fundamentals of Risk Analysis and Risk Management (V. Molak, editor), Boca Raton, FL: Lewis Publishers (1997). 3. T. Bedford and R. Cooke, Probabilistic Risk Analysis: Foundations and Methods, UK: Cambridge University Press (2001). 4. R. M. Cooke, Experts in Uncertainty: Opinion and Subjective Probability in Science, Oxford, UK: Oxford University Press (1991). 5. V. M. Bier, S. Ferson, Y. Y. Haimes, J. H. Lambert, and M. J. Small, Risk of extreme and rare events: Lessons from a selection of approaches, in Risk Analysis and Society: Interdisciplinary Perspectives (T. McDaniels and M. J. Small, editors), Cambridge, UK: Cambridge University Press (2004). 6. R. Zimmerman and V. M. Bier, Risk assessment of extreme events, Columbia-Wharton/Penn Roundtable on “Risk Management Strategies in an Uncertain World,” http://www.ldeo.columbia.edu/CHRR/Roundtable/ZimmermanWP.pdf (2003). 7. M. Dresher, Games of Strategy: Theory and Applications, Englewood Cliffs, NJ: Prentice-Hall (1961). 8. H. F. Martz and M. E. Johnson, Risk analysis of terrorist attacks, Risk Analysis 7,35-47 (1987). 9. L. A. Cox, A probabilistic risk assessment program for analyzing security risks, in New Risks: Issues and Management (L. A. Cox and P. F. Ricci, editors). New York: Plenum Press (1990). 10. E. Pate-Cornell and S. Guikema, Probabilistic modeling of terrorist threats: A systems analysis approach to setting priorities among countermeasures, Military Operations Research 7, 5-20 (December 2002). 11. B. J. Garrick, Perspectives on the use of risk assessment to address terrorism, Risk Analysis 22, 421-424 (2003). 12. E. Zebroski, Risk management lessons from man-made catastrophes: Implications for anti-terrorism risk assessments using a simple relative risk method, American Nuclear Society Topical Meeting on Risk Management, San Diego, CA (June 2003). 13. R. A. Zilinskas, B. Hope, and D. W. North, A discussion of findings and their possible implications from a workshop on bioterrorism threat assessment and risk management, Risk Analysis 24, 901-908 (2004). 14. Y. Y. Haimes, Strategic responses to risks of terrorism to water resources, Journal of Water Resources Planning and Management 128,383-389 (2002). 15. Y. Y. Haimes and T. Longstaff, The role of risk analysis in the protection of critical infrastructure against terrorism, Risk Analysis 22, 439-444 (2002). 16. Y. Y. Haimes, N. C. Matalas, J. H. Lambert, B. A. Jackson, and J. F. R. Fellows, Reducing vulnerability of water supply systems to attack, Journal
Game- Theoretic and Reliability Methods in Counter-Terrorism and Security
27
of Infrastructure Systems 4,164-177 (1998). 17. B. C. Ezell, J. V. Farr, and I. Wiese, Infrastructure risk analysis model, Journal of Infrastmcture Systems 6,114-117 (2OOOa). 18. B. C. Ezell, J. V. Farr, and I. Wiese, Infrastructure risk analysis of municipal water distribution system, Journal of Infrastructure Systems 6,118-122 (2000b).
19.
D.M. Lemon and G. E. Apostolakis, A Methodology for the Identification of
Critical Locations in Infrastructures, Massachusetts Institute of Technology, working paper ESD-WP-200401 (2004). 20. G. Levitin, Optimal allocation of multi-state elements in linear consecutively connected systems with vulnerable nodes, European Journal of Operational Research 150,406-419 (2003). 21. G.Levitin and A. Lisnianski, Optimizing survivability of vulnerable seriesparallel multi-state systems, Reliability Engineering and System Safety 79, 319-331 (2003). 22. G. Levitin, Y. Dai, M. Xie, and K. L. Poh, Optimizing survivability of multi-state systems with multi-level protection by multi-processor genetic algorithm, Reliability Engineering and System Safety 82,93-104 (2003). 23. I. Ravid, Theater Ballistic Missiles and Asymmetric War, The Military Conflict Institute, http://www.militaryconflict.org/Publications.htm (2002). 24. S. J. Brams and D. M. Kilgour, Game Theory and National Security, Oxford: Basil Blackwell (1988). 25. R. Anderson, Why information security is hard: An economic perspective, Presented a t the 18th Symposium on Operating Systems Principles, October 21-24, Alberta, Canada; also the 17th Annual Computer Security Applications Conference, December 10-14, New Orleans, Louisiana; http://www.ftp.cl.cam.ac.uk/ftp/users/rjal4/econ.pdf. f (2001). 26. F. Cohen, Managing network security: Attack and defense strategies, Technical Report 9907, Fred Cohen and Associates (1999). 27. B. Schneier, Managed security monitoring: Network security for the 21st century, Counterpane Internet Security, Inc., http://www.counterpane.com/msm.pdfdf (2001). 28. D. G.Arce M. and T. Sandler. Transnational public goods: Strategies and institutions, European Journal of Political Economy 17 (3),493-516(2001). 29. W.Enders and T. Sandler, What do we know about the substitution effect in transnational terrorism? in A. Silke and G. Ilardi (editors), Researching Terrorism: Trends, Achievements, Failures. London: Frank Cass (2004). 30. N. 0.Keohane and R. J. Zeckhauser, The ecology of terror defense, Journal of Risk and Uncertainty 26,201-229 (2003). 31. M. O'Hanlon, P. Orszag, I. Daalder, M. Destler, D. Gunter, R. Litan, and J. Steinberg, Protecting the American Homeland, Washington, DC: Brookings Institution (2002). 32. G.Woo, Insuring against Al-Qaeda, Insurance Project Workshop, National Bureau of Economic Research, Inc. http://www.nber.org/~confer/2003/insurance03/woo.pdf(2003). 33. V. M. Bier, A. Nagaraj, and V. Abhichandani, Protection of simple series
28
34. 35.
36. 37.
38.
39. 40.
41.
42.
V. Bier and parallel systems with components of different values, Reliability Engineering and System Safety 87, 313-323 (2005). J. Major, Advanced techniques for modeling terrorism risk, Journal of Risk Finance 4, 15-24 (2002). V. M. Bier, S. Oliveros, and L. Samuelson, Choosing what t o protect: Strategic defensive allocation against an unknown attacker, submitted t o Journal of Risk and Uncertainty (2005). H. Kunreuther and G. Heal, Interdependent security, Journal of Risk and Uncertainty 26, 231-249 (2003). A. Gupta and V. M. Bier, Myopia and interdependent security risks: A comment on “Interdependent Security” by Kunreuther and Heal, draft manuscript (2005). J. Zhuang and V. M. Bier, Subsidized security and stability of equilibrium solutions in an N-player game with errors, submitted t o ZIE Transactions (2005). K. Hausken, Probabilistic risk analysis and game theory, Risk Analysis 22, 17-27 (2002). W. D. Rowe, Vulnerability t o terrorism: Addressing the human variables, Risk-Based Decision Making in Water Resources X , pp. 155-159. Reston, VA: American Society of Civil Engineers (2002). D. Banks and S. Anderson, Game-theoretic risk management for counterterrorism, Office of Biostatistics and Epidemiology, US.Food and Drug Administration (2003). N. Azaiez and V. M. Bier, Optimal resource allocation for security in reliability systems, submitted t o European Journal of Operational Research (2005).
CHAPTER 3 REGRESSION MODELS FOR RELIABILITY GIVEN THE USAGE ACCUMULATION HISTORY
THIERRY DUCHESNE De‘partement de mathLmatiques et de statistique Pavillon Alexandre- Vachon Universite‘ Lava1 QuLbec, QC, G1K YP4, Canada E-mail:
[email protected] In this paper we survey a few common and less common approaches to lifetime regression model building when the covariate of interest is the usage accumulation history. Classical models for lifetime regression with timevarying covariates based on the hazard function and on transformations of the time variable are first considered. Less common, but potentially interesting, regression models are then derived using other modeling approaches based on the notions of transfer functionals and internal wear. The class of simple regression models known as “collapsible models” is discussed, along with its potential applicability to tackle two-dimensional prediction problems involving time and cumulative usage at failure. Ideas for future research are also discussed.
3.1. Introduction In many reliability applications, the lifetime distribution of the devices under consideration will strongly depend on the usage accumulation pattern of these devices. One obvious example is the reliability of automobiles: failures will usually happen much earlier on cars that accumulate mileage at a high rate than on similar cars that are seldom used. For this reason, it is clear that usage accumulation measures should be incorporated as much as possible in the forecast of time to failure, for they are likely to be quite predictive. In order to include usage information in reliability forecasts, a regression model for lifetime given the usage accumulation history is needed. 29
T.Duchesne
30
In this paper, we will survey several approaches to building such models. Throughout, we will assume that a single measure of usage is available. We also suppose that the notion of LLusage” is clear and that a measure of usage accumulation is readily defined. Unfortunately, this might not be the case in every application. For instance, when modeling the lifetime of certain automobile parts, cumulative mileage is the obvious manner in which to measure usage.’ However, for the reliability of factory equipment, there might not be a single trivial way in which to measure usage accumulation. Coming up with a formal definition of usage accumulation measure is beyond the scope of this paper. As a matter of fact, ways to define suitable usage measures could be the basis of several research projects! For the remainder of this paper, we focus on applications for which usage accumulation can be measured with a single time-varying covariate that can be clearly identified.
3.1.1. Definitions and notation Let T be the random variable of time to failure for the device under study. We suppose that T follows a continuous distribution on the positive real line. Let e ( t ) denote the usage rate of the device at time t and Bt = {B(u), 0 I u L t } be the usage accumulation history up to time t . For convenience, we drop the subscript t from 8t to denote the entire potential (if the device never failed) usage accumulation history, i.e., 8 = limt,, Ot. Finally, y(t) = B(u) du denotes the usage accumulated from time 0 up to time t. For example, if the device under consideration is an automobile, T represents its time of failure, B ( t ) is its speed at time t and y(t) is the cumulative mileage on this car at time t. Following Duchesne and Rosenthal,2 we make mild assumptions on the usage rate process {B(t), t 2 0 ) . Let 0 be the space of all possible usage histories. Then for all 8 E 0,we suppose that B ( t ) is a nonnegative and piecewise smooth function, i.e., for each 8 E 0,there exists a countable set of time points 0 5 t l < tz < ... with ti + 03 such that e(t) = a i ( t ) , ti < t < ti+l, where cci(.) is continuous and smooth over [ti,t i + l ] , i = 1 , 2 , . . . Moreover, we assume that the process { 8 ( t ) , t 2 0 ) is external in the sense that its trajectories are determined independently of T . Note that this family of usage accumulation histories is broad enough to include stepwise continuous usage histories (on-off, low intensity-high intensity, smooth acceleration-smooth deceleration-off, etc.) and thus covers a wide range of potential applications. Our goal in this paper is to survey different approaches to building
Jl
Regression Models for Reliability Given the Usage Accumulation History
31
regression models for lifetime given the usage history. Mathematically, we want models for
P[T > tie], vt 2 0,
e E e.
(1)
In this context, usage is nothing more than a time-varying covariate. We will therefore devote the remainder of this section to a short survey of the more classical survival regression models with time-varying covariates. In Sec. 3.2, we will survey other lifetime regression modeling approaches that are not as common but that might be especially appropriate for building regression models for lifetime given usage. We devote Sec. 3.3 to the class of collapsible models and illustrate how they can be convenient to tackle two-dimensional prediction problems involving both time to failure and cumulative usage at failure, such as preventive maintenance schedules or warranty cost calculations. A short discussion as well as ideas for future research are given in Sec. 3.4.
3.1.2. Common lifetime regression models The effect of the usage accumulation history 8 on the conditional distribution of lifetime (1) can be modeled in several ways. One classical approach to regression for lifetime in the presence of time-varying covariates is to model the effect of the covariates on the hazard function (2)
One such model is the proportional hazards model of COX,^ where X(tl0) = Xo(t)exp{PO(t)},
W 2 0, 8 E 0,
(3)
where Xo(t) is a “baseline” hazard function that only depends on t and not on the usage history. Banjevic et d 4use this model to derive optimal condition-based maintenance strategies. Another hazard-based regression model is the additive hazards model:5 x(t1e) = x,(t)
+ p q t ) , w 2 0, e E o,
(4)
where X,(t) is, again, a “baseline” hazard function that only depends on t and not on the usage history. This model is particularly convenient mathematically when the joint distribution of T and Y = y(T) is needed, as is the case with two-dimensional warranty and maintenance calculations. Examples of such calculations in the case of warranty pricing are given by Singpurwalla and Wilson.‘
T. Duchesne
32
Another common approach t o regression model building consists in measuring the age of the device in an operational time scale or general time transformation7 that takes into account its usage history. There is a vast literature on time-scale based models and different authors seem t o be using different terminology for the same concepts. In this paper, we use the terminology of Duchesne and Lawless' who define an ideal time-scale as a functional 4 : R x 0 -+ R such that
P[T > tiel = G[+(t,e)l,
vt 2 0, e E 0,
(5)
where G(.) is some continuous survival function. This notion of ideal time -scale is equivalent t o the concepts of intrinsic scale of Cinlar and O ~ e k i c cumulative exposure of Nelson," load invariant scale of Kordonsky and G e r t ~ b a k h , ' ' ~ virtual '~~~~ age ~ ~of~Finkelsteinl' and is closely related t o the notion of transfer functional of BagdonaviEius and Nikulin.16 From (5), we see that the age in the ideal time -scale at chronological time t contains all the information necessary to obtain the conditional probability of survival past that time given the usage history. From Eqs. (3)-(5) of Ref. 8, we have that the conditional probability P[T > tie] can be equivalently specified via the hazard function (2) or through an ideal time-scale as in (5). However, models that tend to have a simple form for (2) tend to have complex expressions for (5) and vice versa. One model defined through a time-scale change is the basic cumulative exposure model given in Sec. 10.3 of Ref. 10. Suppose that if e(t) = 0 V t , then P[T > tie] = G [ t / v ( e ) ]Then . for a general usage history 8 E 0, we have P[T > t ( O ] = G [ 4 ( t , 8 ) ]with , +(t,0) = d x / v ( B ( x ) ) .Nelson" uses this model to analyze oil breakdown data. The accelerated failure time model with time-varying covariate is a generalization of this varying-stress model. It does not suppose that the usage rate only affects the conditional survival probability through a scale parameter, but it poses P[T > tie] = G[$(t,e)] with
s,"
1 t
d(t,e)=
G ( x , q x ) )dz,
vt 2 0,
8 E 0,
(6)
where ?1, is a functional that is usually specified up to a finite-dimensional vector of parameters. Robins and Tsiatis17 derive inference procedures for this model with +(x,O(s))= exp{pe(x)}. 3.2. Other Approaches to Regression Model Building
Though the methods of Sec. 3.1 can be used in many situations where timevarying covariates are available, perhaps other classes of regression models
Regression Models for Reliability Given the Usage Accumulation History
33
could also be useful when the covariate in question is usage. The classes that we consider in this section are especially appropriate in situations where information on how usage impacts lifetime is available in the form of differential equations that specify the rate of "consumption" of the time in the ideal time-scale or that specify the rate of increase in "internal wear" of the device.
3.2.1. Models based on transfer functionals BagdonaviEius and Nikulin" define a transfer functional as a functional fe : R x 0 -+ R such that
f e ( t ) = H-~(P[T > t i e ] ) , v t L 0, 8 E 0,
(7)
where H ( . ) is some specified survivor function. Note that if H ( . ) in (7) is equal to G ( . ) in (5), then the transfer functional and the ideal time-scale are the same. Hence both quantities are in one-to-one correspondence to the conditional quantiles of the lifetime distribution given usage. BagdonaviEius and Nikulin" also define the H-resource at time t as the value,of f e ( t ) and the H-resource used up to failure random variable as R = f e ( T ) . The transfer functional can thus be viewed as a tool to transfer the lifetime measured in chronological time for different devices with different usage h i s tories into a lifetime that is measured in a scale whose lifetime distribution has survivor function H ( . ) for all devices, regardless of their usage history. An interesting interpretation of this model is as follows: each device is created with a random initial amount of resource R, whose distribution has survivor function H(.). The usage accumulation history dictates how this amount of resource is consumed via the transfer functional. When all the resource has been consumed, the item fails. BagdonaviEius and Nikulin" use this approach to define new classes of lifetime regression models by specifying the effect of usage on the rate of resource consumption, d f e ( t ) / & , through a differential equation. For instance their Model 4 is given by (8)
with initial condition fo(0) = fe(0) = H-'(l) and where f o ( t ) = H-l(Fo(t)} for some baseline survivor function Fo. From (8), the conditional survivor function of interest is given by (9)
T.Duchesne
34
Obviously, a wide class of regression models can be generated by changing the form of the differential Eq. (8) according to how usage influences the rate of resource use. Note that the common models of Sec. 3.1 are special cases of (8). For example, if H ( t ) = e - t , u { e ( t ) } = 0, r{B(t)}= exp{@(t)} and X o ( t ) = -dlnFo(t)/dt, then Model 4 reduces to the proportional hazards model (3). The additive hazards model (4), the cumulative damage model of Nelsonlo and the accelerated failure time model (6) are also special cases of (8). 3.2.2. Models based on internal wear
It is common in the reliability l i t e r a t ~ r e to ’ ~model ~ ~ ~the ~ ~lifetime ~ of a device as the first time at which the internal wear (“cumulative damage,” LLdegradation7’) of the device crosses a certain threshold value. Mathematically, let { X ( t ) , t 2 0) be a real-valued stochastic process and X* be a positive real-valued random variable independent of { X ( t ) ,t 2 0). Assume that X ( 0 ) = 0 and that { X ( t ) , t 2 0) has right-continuous paths with finite left-hand limits with probability 1. Then the time to failure random variable, T , of the device can be defined in terms of { X ( t ) , t 2 0 ) and X* as T = inf{t : X ( t ) 2 X*).With T so defined, a model of the effect of usage 8 on the internal wear process { X ( t ) , t 2 0) will lead to a model for the conditional probability of interest P[T > tie].There are several models for the distribution of { X ( t ) } given covariates available in the literature: many references are given in the survey on survival in dynamic environments presented in Ref. 18; Chapter 21 of Ref. 7 is devoted to regression models for degradation; BagdonaviEius and Nikulin” propose a method to regress degradation on time-varying covariates, such as usage, to model tire reliability. Some regression models for internal wear given a covariate will yield a simple form for the conditional probability P[T > tie]. Duchesne and Rosentha12 have studied the class of internal wear models given by rt
rt
(10)
where {y(t), t 2 0) is a stochastic process with nonnegative increments. They have derived the conditional probability P[T > tlO] under different forms for the functions p and u and have been able to recover the accelerated failure time model (6) as well as the collapsible models of Sec. 3.3 under some conditions. A simple interpretation of (10) is that the internal wear accumulated on a device increases at every time instant. This increment
Regression Models for Reliability Given the Usage Accumulation History
35
in wear can be split into a deterministic increase due to time and usage equal to p[s,8,X ( s ) ] d sand a stochastic increase, whose distribution may be influenced by time and usage as well, that is equal to u[s,8,X(s)]dy(s). For example, if the deterministic increase in wear at time s is proportional to the time interval ds and to the usage rate at that time and if the stochastic increase is not influenced by the usage rate or the level of internal wear, then (10) can be written as X ( t ) = Ji[p1 +,BzO(s)]ds s,” a [ s ] d y ( s ) ,which yields P[T > ti81 = G[,&t+P2y(t)], where G is some survivor function that does not depend on 8 and y(t) = O(s)ds is the total usage accumulated up to time t. BagdonaviEius and Nikulin” model the effect of usage on internal wear in a completely different manner. Rather than letting usage modify the increment in wear at a given time, they suppose that the trajectory of the internal wear process is a stochastic process that does not depend on USage, but rather that usage will influence the value of the time index of the process. With this approach, each device randomly draws an internal wear accumulation trajectory, but travels more slowly or rapidly along this trajectory, depending on how intensely it is used. This approach via subordinated processes is relatively new in reliability, although it has been used extensively in other Disciplines.21,22 Mathematically, suppose that each device draws an internal wear trajectory { X ( t ) , t 2 0 ) with X ( t ) = a 2 y ( t ) ,where {y(t), t 2 0 ) is the same nondecreasing process as in (10). Models similar to those presented in Ref. 16 can then be applied to modify the time index of the processes { X ( t ) }and {y(t)}. For instance under the modeling approach of model (8) and (9) with a { e ( t ) }= 0, the amount of internal wear at time t for a device with usage history 8 is given by X ( t ) = exp{pO(s) ds}). This model leads to P[T > tie] = G[Jl exp{pO(s) d s } ] . These models can all be extended to cases where failure is not caused solely by internal wear reaching a threshold, but also by external (traumatic) events such as accidents, overheating, etc. In this regard, internal wear can also be viewed as “failure rate determining”18~19J~20 in the sense that the hazard that the device will succumb in the advent of an adverse event increases with the amount of internal wear of the device. In this setup, time to failure is defined as U = min(T, K ) where T = inf{t : X ( t ) 2 X * } , as before, and K is the time at which a traumatic event that could “kill” the device occurs. Modeling of the distribution of time to failure is then typically done by assuming that the process {y(t)} and K are independent
+
a2y(Jl
T.Duchesne
36
and that K has hazard function (11)
Several a ~ t h ~haver found ~ the ~ additive ~ * hazards ~ ~ formulation ~ ~ n XK(tl6, X ( t ) ) = XO ,BX(t) to be particularly mathematically convenient in this framework. Once again, under some conditions on the hazard function (11) and on the internal wear process { X ( t ) } ,it is possible to obtain some of the regression models described in Sec. 3.1 or Sec. 3.320J as special cases.
+
3.3. Collapsible Models Collapsible models are the family of simple lifetime regression models that can be written as
P[T > tiel
=
~ [ 4 {y(t))l, t, vt 2 0, e E o,
(12)
i.e., models for which the ideal time-scale only depends on time, t , and the amount of accumulated usage at that time, y(t), not on the entire usage accumulation history between times 0 and t. The term "~ollapsibilit seems to have been introduced by O a k e ~ , 'but ~ models of the collapsible form (12) have been used in reliability for a much longer time.25~11~'2 Kordonsky and G e r t ~ b a k h " * ' ~ >have ~ ~ >used ' ~ the linear time-scale version of these models, with 4 { t , y ( t ) } = @ I t ,&y(t), to analyze aircraft reliability data and fatigue life of steel specimens. Duchesne and Lawless' have shown that this linear time-scale model provided a better fit to the steel specimen data than the traditional accelerated failure time model. Model selection and parametric inference for these models are discussed in Refs. 24, 14, 8, while semiparametric ( 4 specified up to a vector of parameters, G arbitrary) is considered by Kordonsky and Gertsbakh14 as well as Duchesne and Lawless.26 D u c h e ~ n eillustrates ~~ how it is possible to estimate level curves of the form C, = {(z,y) : +(z,y} = a } nonparameterically. As shown in Sec. 3.2, collapsible models can arise in setups where failure is caused by excessive internal wear when this internal wear process satisfies certain conditions.
+
3.3.1. Two-dimensional prediction problems In several applications, predictions are based on the joint distribution of time to failure and cumulative usage at failure. Conditional distributions
Regression Models for Reliability Given the Usage Accumulation History
37
of time to failure given the usage history of the collapsible form can potentially simplify some of these calculations. Here are a few examples of two-dimensional prediction problems: 0
0
0
For a typical North American automobile warranty, coverage has both time and cumulative usage (mileage) limitations. As shown by several authors, e.g., Singpurwalla and Wilson,6 evaluation of the cost of such a warranty will therefore involve an expectation with respect t o the joint distribution of T and Y = y(T). Moreover, warranty data are censored in both time and cumulative usage;' a warranty period which would be parallel t o an age curve C, would bring this censoring problem back t o a single dimension. Though such a warranty would be difficult t o sell, it would represent a fair warranty since every user would have the same probability t o see his device fail within the warranty period, regardless of how the device is used. Kordonsky and GertsbakhZ8 and, more recently, Frickenstein and Whitake~ consider ~~ the problem of optimizing preventive maintenance operations when usage has a strong influence on the distribution of time t o failure and there is a lot of variability in the usage histories of different items. Again in this case, the joint distribution of T and Y is a central element of the calculations. Kordonsky and GertsbakhZ8 use a collapsible model with a linear time-scale to minimize the average cost per life cycle of a fleet of items with different usage histories. Many modern devices are equipped with sensors that measure all sorts of environmental and usage rate information in real time. Engineers want to use this information t o monitor the state of the device and propose condition-based maintenance in real time. Conditioned-based maintenance decisions a t a given time t are usually based on the hazard of failure at that time.4 Data storage capabilities and the necessity for instantaneous calculations suggest that collapsible models could be useful for this type of application, since the hazard of failure does not depend on the entire usage history but only on time, cumulative usage and usage rate a t that time: d d A(t(0) = -- l n P [ T > tls]= -- lnG[c${t,y(t))] dt dt = XG[d{t,Y(t))l [d(i){t,Y ( t ) ) -k d(z){t,Y(t))e(t)] where AG(z)
4 ( 2 ) kY l
=
-dlnG(x)/ds,
= W { Z , Y)/dY.
,
4(1){x,y} = a$(x,y}/az and
T. Duchesne
38
3.4. Discussion
We have looked at various approaches to construct regression models for the conditional distribution of lifetime given a usage accumulation history. While some of these models have been thoroughly investigated and have proven their worth for lifetime regression based on time-varying covariates, several new models can be derived using the construction approaches outlined in this paper. Of course this review is not exhaustive, as there exists a vast literature on lifetime regression models. Numerous semi- and nonparametric hazard based models are considered in the biostatistics and econometrics literature; some of these models might also be useful in reliability applications. Moreover, specific regression models based on chemical and physical theory have been described in the reliability literature and have not been explicitly mentioned here. Since usage can be viewed as a stress factor or a catalyst, perhaps these chemical/physical regression models could be used to model the distribution of lifetime given usage. There are still many open problems when we consider the approaches of Sec. 3.2. The class of models given by Eqs. (8) and (9) is very broad. Nonparametric inference methods to eliminate some models from this class would be very helpful for model selection. Perhaps such a test could lead to an omnibus test of the collapsibility assumption (12). Regression models for reliability data tend to be specified via a time transformation. This suggests that the subordinated process approach to internal wear/degradation modeling could be a more natural way to proceed when regressing internal wear or degradation on usage. This avenue requires further study.
Acknowledgments
I would like to thank the numerous researchers who provided me with interesting feedback after I presented this article at the MMR 2004 meeting in Santa Fe in June 2004; many of their comments are reflected in this paper. I also wish to thank an anonymous referee for useful comments and the Financial support of this work by the Natural Sciences and Engineering Research Council of Canada and the Fonds qu6b6cois de la recherche sur la nature et les technologies is gratefully acknowledged.
References 1. J . F. Lawless, J. Hu, and J. Cao, Methods for the estimation of failure distributions and rates from automobile warranty data, Lifetime Data Anal.
Regression Models for Reliability Given the Usage Accumulation History
39
1, 227-240 (1995). 2. T. Duchesne and J. S. Rosenthal, On the collapsibility of lifetime regression models, Adv. Appl. Prob. 35,755772 (2003). 3. D. R. Cox, Bayes estimates for the linear model, J. R. Statist. SOC.B 34, 187 (1972). 4. D. Banjevic, A. K. S. Jardine, V. Makis, and M. Ennis, A control-limit policy and software for condition-based maintenance optimization, INFOR 39,32-50 (2001). 5. D. Y. Lin and Z. Ying, Additive hazards regression models for survival data, in Proceedings of the First Seattle Symposium in Biostatistics: Survival Analysis, Eds. D. Y. Lin and T. R. Fleming, Springer-Verlag, New York (1997). 6. N. D. Singpurwalla and S. P. Wilson, Warranty problem: its statistical and game theoretic aspects, SIAM Review 35,17-42 (1993). 7. W. Q. Meeker and L. A. Escobar, Statistical Methods for Reliability Data, Wiley, New York (1998). 8. T. Duchesne and J. Lawless, Alternative time scales and failure time models, Lifetime Data Anal. 6, 157-179 (2000). 9. E. Cinlar and S. Ozekici, Probability in the Engineering and Informational Sciences, Prob. Eng. Inf. Sci. 1, 97-115 (1987). 10. W. Nelson, Accelerated Testing: Statistical Models, Test Plans, and Data Analyses, Wiley, New York (1990). 11. K. B. Kordonsky and I. Gertsbakh, Choice of the best time scale for system reliability analysis, Europ. J. Oper. Res. 65, 235-246 (1993). 12. K. B. Kordonsky and I. Gertsbakh, System State Monitoring and Lifetime Scales, Reliability Engineering System Safety 47, 1-14 (1995). 13. K. B. Kordonsky and I. Gertsbakh, System State Monitoring and Lifetime Scales 2, Reliability Eng. Sys. Safety 47, 145-154 (1995). 14. K. B. Kordonsky and I. Gertsbakh, Multiple time scales and the lifetime coefficient of variation: engineering applications, Lifetime Data Anal. 2, 139156 (1997). 15. M. S. Finkelstein, Wearing-out of components in a variable environment, Reliability Engineering System Safety 66, 235-242 (1999). 16. V. B. BagdonaviEius and M. S. Nikulin, Transfer functionals and semiparametric regression models, Biometrika 84, 365-378 (1997). 17. J. Robins and A. A. Tsiatis, Semiparametric Estimation of an Accelerated Failure Time Model with Time-Dependent Covariates, Biometrika 79,321334 (1992). 18. N. D. Singpurwalla, Survival in Dynamic Environments, Stat. Sci. 10, 86103 (1995). 19. D. R. Cox, Some remarks on failure-times, surrogate markers, degradation, wear, and the quality of life, Lifetime Data Anal. 5, 307-314 (1999). 20. V. B. BagdonaviEius and M. S. Nikulin, Estimation in degradation models with explanatory variables, Lifetime Data Anal. 7, 365 (2001). 21. M. L. T. Lee and G. A. Whitmore, J. Appl. Prob. 30, 302-314 (1993). 22. P. Hougaard, M. L. T. Lee, and G. A. Whitmore, Analysis of overdispersed
40
23. 24. 25. 26. 27.
28. 29.
T. Duchesne
count data by mixtures of Poisson variables and Poisson processes, Biometrics 53, 1225-1238 (1997). N. P. Jewel1 and J. D. Kalbfleisch, Marker processes in survival analysis, Lifetime Data Anal. 2 , 15-29 (1996). D. Oakes, Multiple time scales in survival analysis, Lifetime Data Anal. 1, 7-18 (1995). N. A. Miner, Cumulative damage in fatigue, J. Appl. Mechanics 12, A159 (1945). T. Duchesne and J. Lawless, Semiparametric inference methods for general time scale models, Lifetime Data Anal. 8, 263-276 (2002). T. Duchesne, Semiparametric Methods of Time Scale Selection in Recent Advances in Reliability Theory: Methodology, Practice and Inference, Eds. N. Limnios and M. S. Nikulin, Birkhauser, Boston, (2000). K. Kordonsky and I. Gertsbakh, Best time scale for age replacement, Int. J. Reliab. Safety Eng. 1, 219 (1994) S. G. Frickenstein and L. R. Whitaker, Age replacement policies in two time scales, Naval. Res. Logist. 50, 592-613 (2003).
CHAPTER 4 BAYESIAN METHODS FOR ASSESSING SYSTEM RELIABILITY: MODELS AND COMPUTATION
TODD L. GRAVES Statistical Sciences Group Los Alamos National Laboratory Los Alamos, N M 87545 USA E-mail:
[email protected]
MICHAEL S. HAMADA Statistical Sciences Group Los Alamos National Laboratory Los Alamos, NM 87545 USA E-mail:
[email protected] There are many challenges with assessing the reliability of a system today. These challenges arise because a system may be aging and full system tests may be too expensive or can no longer be performed. Without full system testing, one must integrate (1) all science and engineering knowledge, models and simulations, (2) information and data at various levels of the system, e.g., subsystems and components, and (3) information and data from similar systems, subsystems and components. The analyst must work with various data types and how the data are collected, account for measurement bias and uncertainty, deal with model and simulation uncertainty, and incorporate expert knowledge. Bayesian hierarchical modeling provides a rigorous way to combine information from multiple sources and different types of information. However, an obstacle to applying Bayesian methods is the need to develop new software to analyze novel statistical models. We discuss a new statistical modeling environment, YADAS, that facilitates the development of Bayesian statistical analyses. It includes classes that help analysts specify new models, as well as classes that support the creation of new analysis algorithms. We illustrate these concepts using several examples.
41
42
T . Graves and M. Hamada
4.1. Challenges in Modern Reliability Analyses
There are many challenges with assessing the reliability of a system today. First, full system testing may be prohibitively expensive or even prohibited. For this reason and others, it is important to be able to make use of expert opinion and information in the form of physics/engineering/material science based models (deterministic and stochastic) or simulation and account for model bias and uncertainty. Results from multiscale science/engineering experiments also need to be incorporated. One must be able to handle complex system reliability models, including reliability block diagrams, fault trees and networks. One must incorporate data at various levels (system, subsystem, component), and properly account for how higher level event data inform about lower level data. In this context, models for subsystems or even components can be nontrivial. The effects of aging and other covariates including those that define subpopulations, are often of interest. Efficient analysis entails combining information and data from similar systems, subsystems and components. Reliability data can come in various flavors, including binomial counts, Poisson counts, failure times, degradation data, and accelerated reliability data. Such data may be nontrivial to analyze in their own right. How such data are collected must also be considered. For example, measurement error (bias and precision) from destructive or nondestructive evaluation techniques may be too large to ignore. These challenges go beyond that addressed in the system reliability literature (Cole,’ Mastran,’ Mastran and Singp~rwalla,~ Natvig and Eide,4 Martz, Waller, and Fickas,’ Martz and Waller‘) which mostly consider binomial data. This literature also predates the advances made in Bayesian computation in the 1990s and resorts to various approximations. Such approximations will be hard to generalize for these challenges. An outline of this paper is as follows. First we consider three examples which illustrate these challenges. Next we discuss a statistical modeling environment which can support the needs of modern reliability analyses. Then we return to the three examples and present results of their reliability analyses. Finally, we conclude with a discussion. 4.2. Three Important Examples
We discuss three examples of challenging statistical problems that arise in reliability estimation. First, even the analysis of a single component can require development of new techniques. Consider the case in which there are indications that a component’s manufacturing lot impacts its reliability, and
Bayesian Methods for Assessing S y s t e m Reliability
43
some of the test data are obtained in ways that might favor the sampling of (un)reliable items. Second, we discuss the estimation of the reliability of a system based on (1)system tests, where failures provide partial information about which components may have failed, and (2) specification tests, which measure whether components meet specifications that relate imperfectly t o system success. Finally, we present an ambitious approach to integrating many sorts of component data into a system reliability analysis.
4.2.1. Example 1: Reliability of a component based o n biased sampling Our first example, which deals with reliability estimation for a single component, is discussed in Graves et al.7 Of interest is the prevalence of a certain feature in an existing population of items. Some items have already been destructively tested and removed from the population. There is reason t o believe that the probability that an item has the feature is related t o the lot in which it was manufactured, but it is not obviously appropriate to assume that the feature is confined to a small number of lots. We handle this situation with a Bayesian hierarchical model, pi N Beta(a, b ) , where pa is the probability that an item in lot a was manufactured with the feature, and where a and b are given prior distributions, so that a test on an item in one lot is informative about the prevalence of the feature in the other lots, but more informative about its own lot. A further complication is we are not willing to assume that the process by which items were selected for sampling was done so that items with and without the feature were equally likely to be sampled. (We do also have some truly random samples along with these “convenience samples.”) Naive estimation is therefore in danger of systematically over- or under-estimating the prevalence. We use the extended hypergeometric distribution (see Graves et al.7 and its references) to allow for the possibility of biased sampling; using this distribution for this purpose was new, so software did not exist for using it. Finally, the unknown quantities of most interest are the actual numbers of items with the feature in each of the lots (which were of known finite size), so the software must be able to sample posterior distributions of quantities which take on finitely many values. Before introducing the statistical models, we begin with some notation. The finite population consists of M lots. Let the ith lot size be denoted by Ni and the unknown number of systems in the ith lot with the attribute be denoted by Ki. The sample sizes and numbers of sampled features in
44
T. Graves and M. Hamada
the convenience and random samples for the i lot are denoted respectively by n,i, yci, n,i and yri. Because the convenience sample is assumed t o be taken first, the ith lot size for the random sample is N,., = Ni - n,i and K,.i = Ki - yci is the number of components with the attribute remaining in the ith lot after the convenience sample has been taken. We assume
Ki
N
BinorniaZ(N,,pi).
(1)
and (2)
Now we consider statistical models for the data. For the convenience sample data, we want t o account for the potential bias of sampling too many or too few components with the attribute. To do this, we use the extended-hypergeometric distribution for yci which has probability function
+
for y = max(0, n,i - Ni K i ) , . . ., min(n,i, Ki). When the biasing parameter 0 is equal to one, the extended-hypergeometric reduces to the hypergeometric which arises from a completely random sample in which there is no biasing. When 0 is greater than one, the sampling favors components with the attribute. The randomly sampled data yri are assumed to follow hypergeometric distributions. 4.2.2. Example 2: System reliability based o n partially
informative tests Another reliability problem involves synthesizing two different types of data, neither of which is standard for reliability analysis. First, the system test data provide complicated information; for notational clarity, consider a single system test. If the set of components in the system is denoted by C , there is a subset of components C1 that we know to have worked, another subset of components Co that we know to have failed, and a third subset of components Cz, at least one of which must have failed. (The test provides no information at all about the success or failure of the remaining components.) The system is a series system. The likelihood function for this single
45
Bayesaan Methods f o r Assessing System Reliability
test, assuming that the system is of age t , is (3)
where the first two products are defined to be one if empty, while the last is zero. Here p i ( t ) is the probability of success of component i at age t; we used p i ( t ) = @ {(of ~ ? ) - l / ~ ( a iPit - &)},where @ is the Gaussian distribution function. One reason for this choice of p i ( t ) (in particular, the seemingly redundant parameterization) is the other type of data we use; certain of the components were tested to assure that they met specifications, and these tests generate continuous data. If one assumes that the specification measurement Si on component i satisfies Si N(cri+&t, $), specification data can be incorporated naturally. Then if one assumes that, conditionally on its specification measurement Si , the component would succeed in a system test with probability @ {a2-'(Si - &)},it follows that unconditionally, the component's success probability is @ { (of ~ ? ) - ' / ~ ( a i Pit - ei)}. More generally, suppose ni specification measurements Sill. . . ,Sin; a p ply to component i, that Sij N ( q + &t,T$), and that the success probability given the specs is @{oijl(Sij - &)}. Then the unconditional success probability, or the likelihood function for the system tests, is ~ { ( a i+j ~ij)-'/'(aij + Pijt - eij)}.We also generalize to multiple covariates; they need not be the same for different specs. A final complication, but one that is trivial to handle, is that the system is built in several different configurations; not all components are present in all configurations. This formulation enables us t o use specification data to help make inferences on parameters relevant to system tests. It also requires specialpurpose software.
+
+
-
+
+
-
n;:,
nyLl
4.2.3. Ezample 3: Integrated system reliability based on diverse data A fundamental problem of system reliability is estimating the reliability of a system whose components are combined in series and parallel subsystems, and where data relevant to the component qualities take on general (not necessarily binomial) forms. (The case of binomial data is discussed in Johnson et a1.8) As a simple example, consider a three component series system. Binomial data are available on Component 1 at various ages, and the success probability at age t satisfies log(pl(t)/{l - p l ( t ) } ) = a0 - alt.
46
T . Graves and M . Hamada
for Component 2 is defined in terms of its lifetime which is distributed Weibull; a lifetime 77 equates t o a component success a t time t if t < 77, and data on Component 2 are a collection of possibly right-censored lifetimes. Component 3 is required to generate a desired amount of power T on demand; the distribution of power is lognormal, with a logged mean that decreases linearly in age. Data are (power, age) pairs. Finally, we have binomial system test data, where the success probability is the age-dependent success probability of Component 1, multiplied by the reliability of Component 2, multiplied by the age-dependent probability of sufficient power generation by Component 3. The full data likelihood contains terms for each of the four types of tests, and other information can be captured in prior distributions. It is necessary that the software analyze likelihood functions with each of these terms, and ideally it would support the integration of components into (sub)systems in arbitrary parallel/series combinations. As described above, the Component 1 data yl, at time tl, follow Binomial(nl,,pl,) where log(pl,/(l -PI,)) = a0 a l t l , . The Component 2 data y2, follow WeibulZ(X,P) for scale X and shape P. The Component 3 data y3, at time t3, follow Lognorrnal(p3,, g2)7where p3, = 70 7 t 3 , . Finally, the system data yszat time t,, follow Binomial(n,,,p,,), where p,, = Rl(tS2)R2(t,2)R3(tS2) and R l ( t S Z ) = exp(ao + aot.9Z)/(1 + exp(ao + a o t S 2 ) ) , R ~ ( L= ) eXp(-@,) and R3(tsz) = 1 - a{(log(T) - (70 rtsz>)/g).
+
+
+
4.3. YADAS: a Statistical Modeling Environment YADAS is a software environment for performing arbitrary Markov chain Monte Carlo (MCMC) computations, and as such it is very useful for defining and analyzing new, nonstandard statistical models. Its source code and documentation, several examples, and supporting technical reports are available for download a t yadas . l a d . gov (Gravesg?lo).Its software architecture makes it easy to define new terms in models and make small adjustments t o existing models. MCMC algorithms often suffer from poor autocorrelation, and YADAS provides an environment for exploring and fixing these problems. YADAS is written in Java and generally requires additional Java code to work a new problem, but work continues on alternative interfaces. We discuss all these issues in this section.
4.3.1. Expressing arbitrary models Defining a model in YADAS is as simple as specifying how to calculate the unnormalized posterior distribution evaluated at an arbitrary parameter
Bayesian Methods for Assessing System Reliability
47
value. This is an advantage of a Bayesian approach, as well as being one of the benefits of the design decision to emphasize the Metropolis-Hastings algorithm instead of Gibbs sampling as in WinBUGS (Spiegelhalter et a1.l'). In the Gibbs sampler, each time the model is changed, the sampling algorithm must be changed accordingly. In YADAS, however, the model definition is decoupled from the algorithm definition. Provided the acceptance probabilities are generated correctly, the distribution of the samples will converge to the desired posterior distribution. Since the acceptance probabilities are determined by the unnormalized posterior density function, this happens automatically when the new model is defined. The definition of a model is a collection of objects called bonds. Each bond is a term in the posterior distribution. Bonds are defined in the software in such a way as to make it easy to make the sort of small changes to an analysis that are common in the model-building phase. Examples include changing a parameter from fixed to random, or changing a distributional form. In particular, analyzing the sensitivity of conclusions to the choice of prior is natural. The ease of defining new models was particularly evident in the analysis of the reliability of the component manufactured in lots and sampled nonrandomly. The analysts needed to write code to calculate the extended hypergeometric density function, but after this trivial exercise was complete, it could be plugged in as any other density would be. Without YADAS, time constraints would have forced the analysts to make use of a more convenient but less appropriate analysis. In our second example, the critical step was to compute (3) after first computing the p i ( t ) ' s . This was also straightforward, and the handling of specification data just required adding another bond (with the familiar normal linear model form) to the existing list. The third example, in which various forms of component test data are combined with system test data, is an excellent example of the usefulness of the YADAS model definition strategy. The component test data are as easy to include as in an application where they are the only data source. We make use of YADAS's general code that reads in a system structure and integrates component success probabilities in any series or parallel combination, in order to incorporate the system test data.
48
T . Graves and M . Hamada
4.3.2. Special algorithms While it is true that defining an MCMC algorithm for a new problem is as easy as specifying how to compute the unnormalized posterior distribution, it is also true that these first attempts at algorithms may fail to perform adequately. However, YADAS turns this to a strength by helping users to improve algorithms by adding steps to the existing algorithm; Metropolis-Hastings-based software is much better suited to this goal than Gibbs-based software. The most common MCMC performance problem is high posterior correlation among parameters; this generates high autocorrelation in consecutive MCMC samples, because parameters are reluctant to move individually. YADAS’s typical approach is the “multiple parameter update”; one proposes simultaneous moves to parameters in a direction of high variability. For example, in our second example (as happens in many generalized linear model examples), the intercept and slope parameters for some components were highly correlated, and the algorithm was improved with new steps that proposed the addition of a random amount to the intercept while simultaneously subtracting a multiple of the same amount from the slope. The naturalness of defining algorithms in YADAS was also exhibited in the biased sampling problem, where the numbers of items in each lot with the feature of interest needed to be sampled; this was handled with YADAS’s general approach for sampling parameters that take on finitely many values.
4.3.3. Interfaces,present and future YADAS is written in Java, and that provides portability advantages beyond its encouragement of generality that helped YADAS become as ambitious as it is. However, few Bayesian statisticians use Java as their language of choice, so this limits its popularity. An area of active YADAS development is providing additional interfaces to its capabilities. One such interface is the interface with the R package (www.R-pro j e c t .org), a very popular, free statistics computing environment that is very similar to S-Plus. This interface will facilitate the handling of both input to and output from MCMC algorithms, including examining output for adequate convergence to the limiting distribution and rapid mixing. One application is the use of genetic algorithms for experiment design; each candidate design selected by the genetic algorithm will generate data, which will then be analyzed using YADAS, and the analysis results will be
49
Bayesian Methods for Assessing System Reliability
examined for “fitness” and fed back to determine the next genetic algorithm generation. System reliability is an application of particular interest. For a given budget, which system, subsystem or component data should be collected to reduce the uncertainty of system reliability the most? (See Hamada et a1.12 for an example involving a fault tree.) The R-YADAS interface is possible thanks to the SJava package of the omegahat project (www .omegahat.org). Another interface that will particularly help with reliability problems is the interface with a new graphical tool for eliciting defining system structure and its relationship to data (Klamann and Koehler13). 4.4. Examples Revisited
4.4.1. Example 1
We illustrate the problem with convenience and random samples from the stratified population with a simulated data set; the data from the real application are proprietary. The simulated data set features a total population of 5000 items in 230 lots (100 lots of size 10, 100 of size 25, and 30 of size 50). Total sample sizes were 100 for the convenience sample and 50 for the random sample, with 18 lots being sampled in both ways, 21 only randomly sampled, and 57 only sampled by convenience. A total of 513 components had the feature; individual lot feature proportions p , were drawn from the Beta(l,9) distribution, and the true value of B was 2, representing mild biasing. In the convenience sample, 16 (16%) had the feature, while 6 features appeared in the random sample (12%). A naive estimate for the feature prevalence would then be 22/150 = 0.147. Ten percent of the unsampled items had the feature, so the feature was overrepresented in both samples. We used the following prior distributions: Beta(0.3,1.7), a + b Gamma(2,5), and log 8 N ( 0 , l ) . Posterior medians for these quantities were then 0.13, 5.5, and (for 0) 1.32. The posterior median for p * , the fraction of unsampled items with the feature, was 0.13, different from the naive estimate and closer to the random sample fraction than the convenience sample fraction, as it should be. A 90% posterior interval for this quantity is (0.073,0.209).Posterior density estimates for ./(a a + b ) , a b, 0, and p* are shown in Fig. 4.1.
-
-
N
+
4.4.2. Example 2
The system we studied in Example 2 had a total of 23 components. Thirteen of these had no related specification measurements, five had a single
T. Graves and M . Hamada
50
Fig. 4.1. Plot of Example 1 parameter posteriors. Dotted curves represent priors. The four plots are for ./(a a b ) , a + b, 0, and p'.
+
specification, and five had between two and four. Roughly 1000 system tests were available, with a proprietary number of failures. The 20 specification measurements generated a total of roughly 2000 data points. We studied the effects of six covariates, and the posterior distribution involved 294 unknown parameters. Using YADAS t o analyze these data required us t o write code t o handle the system test data in the form of successes, failures, and partial successes. We needed t o calculate the likelihood for each system test, and all the covariates affected the success probabilities for each component in a novel way. Also, many combinations of the 294 unknown parameters were highly correlated due to shortcomings in the completeness of the data, and this led to poorly performing algorithms. However, YADAS made it possible to improve these algorithms using multiple parameter updates. Results are shown in Fig. 4.2; these are reliability posterior distributions for four different versions of the system. The axes are not shown because of the sensitivity of the data. 4.4.3. Ezample 3
The system pass/fail data consisted of 15 tests at each of 0, 5, 10, 15 and 20 years. The Component 1 pass/fail data consisted of 25 tests a t each of 0, 2, 4, 6, 8, 10, 15 and 20 years. The Component 2 consisted of 25 lifetimes.
Bayesian Methods for Assessing System Reliability
51
Reliability posterior distributions for four system configurations
Rellabllity
Fig. 4.2. Density estimates for four different versions of the system described in Example 2 and for one value of the covariates. The axes are omitted due t o proprietary concerns.
Finally, the Component 3 data consisted of 10 destructive observations a t each of 0, 2.5, 5, 7.5, 10, 15 and 20 years. Note that some of the system and component data are collected at different rates. Analyzing these data provide the system reliability median and 95% credible interval over time as displayed in Fig. 4.3. The component model parameter posteriors are given in Figs. 4.4 and 4.5. Note that the posteriors plotted as dotted lines did not use the system data. Those posteriors plotted as solid lines that use the system data are tighter and therefore more informative. The system data provided little additional information for the Component 3 model parameters as displayed in Fig. 4.5. Figure 4.6 displays the system reliability 95% credible interval over time when both the system data are used and when they are not used. The solid lines display the results when the system data are used. Note that the solid lines are higher and closer together than the dotted lines. 4.5. Discussion
In this paper we have attempted t o communicate some of the excitement of working on modern system reliability assessments. New methodology is important for dealing with such issues as nonrandom sampling and analyzing test results that utilize different levels of the system and generate different data distributions. Though we have not illustrated them in this
52
T. Graves and M. Hamada
Fig. 4.3. Plot of Example 3 system reliability posterior median and 95% credible interval.
Fig. 4.4. Plot of Example 3 component models 1 and 2 parameter posteriors (with system data (solid line) and without system data (dotted line)).
paper, other forms of information such as expert judgment, more detailed engineering models, and simulation models also need to be integrated with traditional data, and there are opportunities for research into good methods for doing this. Appropriate new models require new computational meth-
Bayesian Methods for Assessing System Reliability
53
Fig. 4.5. Plot of Example 3 component model 3 parameter posteriors (with system data (solid line) and without system data (dotted line)).
Fig. 4.6. Plot of Example 3 system reliability posterior posterior 95% Credible intervals (with system data (solid line) and without system data (dotted line)).
ods, and an extensible modeling environment makes it practical to work with new models, even in a deadline-driven scientific environment. Finally, complex system models with automatic analysis software are amenable to exciting research on using genetic algorithms to guide resource allocation
54
T. Graves and M. Hamada
a n d experimental design.
Acknowledgments We t h a n k Dee Won for her encouragement of this work. References 1. P. V. Z. Cole, A Bayesian reliability assessment of complex systems for binomial sampling, ZEEE Transactions o n Reliability R-24,114-117 (1975). 2. D. V. Mastran, Incorporating component and system test data into the same assessment: a Bayesian approach, Operations Research 24, 491-499 (1976). 3. D. V. Mastran and N. D. Singpurwalla, A Bayesian estimation of the reliability of coherent structures, Operations Research 26, 663-672 (1978). 4. B. Natvig and H. Eide, Bayesian estimation of system reliability, Scandznavian Journal of Statistics 14,319-327 (1987). 5. H. F. Martz, R. A. Waller, and E. T. Fickas, Bayesian reliability analysis of series systems of binomial subsystems and components, Technometrics 30, 143-154 (1988). 6. H. F. Martz and R. A. Waller, Bayesian reliability analysis of complex series/parallei systems of binomial subsystems and components, Technometrics 32,407-416 (1990). 7. T. Graves, M. Hamada, J. Booker, M. Decroix, C. Bowyer, K. Chilcoat, and S. K. Thompson, Estimating a proportion using stratified data arising from both convenience and random samples, Los Alamos National Laboratory Technical Report LA-UR-03-8396 (2004). 8. V. Johnson, T. Graves, M. Hamada, and C. S. Reese, A hierarchical model for estimating the reliability of complex systems (with discussion), Bayesian Statistics 7,Oxford University Press, 199-213, J. M. Bernardo, M. J. Bayarri,
J. Berger, A. P. Dawid, D. Heckerman, A.F.M. Smith, and M. West (Eds.) (2003). 9. T. L. Graves, A framework for expressing and estimating arbitrary statistical models using Markov chain Monte Carlo, Los Alamos National Laboratory
Technical Report LA-UR-03-5934 (2003a). 10. T. L. Graves, An introduction to YADAS. yadas .lanl.gov (2003b). 11. D. Spiegelhalter, A. Thomas, and N. Best, WinBUGS Version 1.3 User Manual (2000). 12. M. Hamada, H. F. Martz, C. S. Reese, T. Graves, V. Johnson, and A. G.
Wilson, A fully Bayesian approach for combining multilevel failure information in fault tree quantification and optimal follow-on resource allocation, Reliability Engineering and System Safety 86, 297-305 (2004). 13. R. Klamann and A. Koehler, GROMIT: Graphical Modeling Tool for System Statistical Structure, Los Alamos National Laboratory, Los Alamos, NM (2004).
CHAPTER 5 DYNAMIC MODELING IN RELIABILITY AND SURVIVAL ANALYSIS
EDSEL A. PENA Department of Statistics University of South Carolina Columbia, SC 29208 USA E-mail:
[email protected]. edu
ELIZABETH H. SLATE Department of Biostatistics, Bioinformatics and Epidemiology Medical University of South Carolina Charleston, S C 29425 USA E-mail:
[email protected]
Dynamic models are important, realistic, and more appropriate than static models in many settings, notably reliability, engineering, survival analysis, and also in biomedical settings. General classes of dynamic models are described. Some probabilistic properties of these models are presented, parameter estimation methods are indicated, and their applicability is demonstrated through an illustrative data set.
5.1. Introduction Consider a situation where a recurrent event of interest is being monitored over some observation period. This could occur, for example, in a reliability or engineering setting when a pcomponent coherent system (cf., Barlow and Proschanl) with structure function 4 is being followed and the events of interest are component failures and eventually system failure. Or, it could be in a biomedical study and of interest are successive occurrences of some recurrent event, such as, for example, hospitalization due to a chronic disease, onset of depression, occurrence of migraine headaches, tumor occurrence, etc. Additional examples of the recurrent event situation we study include 55
56
E. Pe6a and E. Slate
monitoring a married couple in a sociological study of divorce and of interest are recurrences of major disagreements in the marriage; a drop of 200 points during a trading day in the Dow Jones Industrial Average in an economic context; and the commission of a terrorist act in the United States. In developing stochastic models for event occurrences in such systems, dynamic models become highly appropriate and more realistic. In such models the impact of actions or interventions, which are undertaken as the monitoring progresses, with such actions possibly dictated by the accrued history of the system, can be incorporated in the model. Dynamic models could also incorporate the possible effects of an increasing number of event occurrences and take into account the impact of possibly time-varying covariate processes. When dealing with coherent structures in reliability and engineering settings, component failures may have the effect of increasing the load on the remaining functioning components arising from the change in the effective structure function. This is illustrated by considering the bridge structure in the first picture in Fig. 5.1. In this picture, the initial load that each component is receiving is indicated. After the failures of component 2, and then component 4, the resulting loads of the remaining functioning components may change as demonstrated by the second and third pictures in Fig. 5.1 because of the change in the effective structure function. The usual static modeling approach, which is characterized by a specification of the failure density model at time s in response to the question posed at the time origin: “What is the probability that an event will occur in the infinitesimal time interval [s,s ds)?,” encounters difficulties in modeling such dynamic changes in the component loads, or the impact of performed interventions after each event occurrence. Dynamic models are best specified through failure intensities, and so are often stated in terms of hazard rate functions. The underlying dynamic modeling approach is conditional in that one asks the question: “Given the history of the system just before time s, what is the conditional probability that an event will occur in the infinitesimal time interval [s,s ds)?” This leads to a specification of the hazard or failure intensity at time s, say a ( s ) ,with the interpretation that a(s)ds is approximately the probability of an event occurring in [s,s ds), given the history up to time S. If a ( s ) is deterministic, then through the product-integral representation of the survivor function (cf., Andersen, Borgan, Gill and Keiding,’ p. 57), the
+
+
+
Dynamic Modeling i n Reliability and S u m i d Analysis
57
After Compoxient 2 Has Failed: Series-Paralkl
After Components 2 arid 3 Have Failed: Series
Fig. 5.1. A bridge structure with structure function $ J ( T ~ , x z , z ~ = , ~(1114) ~ , z ~V (111315) V ( 1 2 1 5 ) V (121324), with the initial component loads indicated in the first picture, and the consequent component loads indicated in the second and third pictures after component 2, then component 4 fail.
58
E. Peiia and E. Slate
system life S survivor function becomes
n S
P{S > S} =
[l - a(v)dv].
(1)
v=o
A simple, but not rigorous, way of seeing the validity of this formula is that if the unit is to survive the period [0, s] so S > s, then for a partition 0 = S O < s1 < sz < . . . < SM = s of [0,s ] , we have by the multiplication rule M that P{S > s } = n,=, P{S > sjlS 2 s j - 1 ) . But as rnaxlljSM(sj - sj-1) becomes small, then P{S > s j ( S 2 sj-1) % 1 - a ( s j - l ) ( s j - s j - ~ ) , so letting rnaxlSjg4(sj - sj-1) --+ 0 as M + oa,formula (1) obtains. Because the approach is conditional, the resulting model is able to incorporate the impact of performed actions and interventions, as well as the situational and environmental changes, that dynamically occur during the monitoring of the system. A pictorial depiction of these static and dynamic modeling approaches is given in Fig. 5.2. Much work in the reliability and engineering settings adddressing dynamic models has dealt with the modeling aspect and the determination of the stochastic and probabilistic properties of such models. There has been a dearth of work dealing with statistical inference issues for such dynamic models. In the survival analysis setting, where dynamic models are typically associated with biomedical studies and clinical trials, there has been work dealing with inference issues for such models, though for general dynamic models, inference procedures are still incomplete. In this paper, aside from discussing dynamic models and their properties, focus will also be on methods for making inference about the model parameters, with the inference to be based on data arising from the monitoring of a sample of study systems or units. We outline the major portions of this paper. In Sec. 5.2 we present the mathematical setting that will facilitate the formal description of dynamic models and describe three specific dynamic models. In Sec. 5.3 we provide some properties of the dynamic models described in Sec. 5.2, and in Sec. 5.4 we indicate inference issues regarding dynamic model parameters. As the primary intent of this paper is to call attention and to encourage the study and use of dynamic models, mathematical technicalities will be confined to a minimum. Instead we refer the reader for technical details to the relevant papers. The applicability of the models will be illustrated in Sec. 5.5.
Dynamic Modeling i n Reliability and Survival Analysis
59
Fig. 5.2. Pictorial depiction of the static (top) and dynamic (bottom) modeling approaches.
5.2. Dynamic Models
To formally describe the dynamic models of interest, given a system or unit under study, let I = [O,T] be the monitoring period, with T possibly random. Define the processes { ( N t ( s ) , Y t ( s ) ): s E I} according t o N t ( s ) = number of events that occurred in [0, s], and Y t ( s ) = I{the system is under observation at s}, where I { A } is the indicator function of event A. (We remark that the use of the superscript t simplifies notation since no new letters are needed when we convert to doubly-indexed processes later.) We also define on an appropriate probability space (a,3,P ) a filtration {.Fs: 0 5 s 5 T } such that .FSrepresents the cT-field containing all
60
E. Pefia and E. Slate
information about the system that has accrued over the time period [O, s] (cf., Sec. 11.2 in Andersen, et d 2 ) In . particular, N t and Yt are adapted to this filtration. A dynamic model is specified by providing the intensity process of N t defined for every s E I via
a ( s )= l i m -1P { N t ( ( s + h ) - ) - N t ( s - ) 2 l l F s - } . hi0 h
(2)
We describe two general classes of dynamic models. The first one, which is relevant for coherent systems in reliability and engineering, was introduced in Hollander and Pefia;3 while the second one, which perhaps is more relevant in biomedical and public health settings, was introduced in Peiia and
Hollander. 5.2.1. A dynamic reliability model Consider a coherent system with p components and structure function 4. Let 2, = {1,2,. . . , p } , and denote by P the power set of 2,. Let &, P be the collection of minimal cut sets of 4. A set J E P is defined to be &absorbing if there exists a K E K$ with K C J. Let Q$ be the collection of &absorbing sets of q51 and by Q$ = P \ Q b the collection of &nonabsorbing sets of 4. We now describe the first dynamic model. Let XO(.) be a hazard rate function, and for each J E Q:, let {ai[J],i E Jc} be a set of nonnegative real numbers. For each s 2 0, denote by F ( s ) the set of component indices that are nonfunctioning at time s-. With Y t ( s ) = I{‘ 2 s}, the intensity process of the dynamic model is specified according to
r
1
J
C I { F ( s )= J} C aj[J]XO(S).
JE 2 !;
j € J=
(3)
An important property that this intensity specification induces is that the process { M t ( s )= N t ( s ) - 1; a(v)dv : s 2 0} becomes a zero-mean martingale, which means that E{Mt(s t)lFs}= M ( s ) for every s,t >_ 0 (cf., Sect. 11.3 of Andersen, et d 2 )This . allows for the representation
+
Nt(s) =
I”
a(v)dv
+M(s),
s 2 0,
where the first-term on the right-hand side could be interpreted as the ‘signal’ component, while the martingale term is the ‘noise’ component. Through this martingale property, important results about martingales
Dynamic Modeling in Reliability and Survival Analysis
61
such as Kurtz’s Lemma and Rebolledo’s martingale central limit theorem could be employed to examine properties of resulting inference procedures. The model in (3) is a special case of the general model introduced in Hollander and Peiia3. If the system is a pcomponent parallel system so Q 4 = 2, and for any state vector ( ~ 1 ~ ~. .2,yp) , . E {O,l}P, $(yl,y2, . . . , y p ) = Vj”=lyj, and if o l j [ J ] = ~ 1 . ~ 1where IJI is the cardinality of set J and { ~ =j ~ [ j: j] = 0,1,. . . , p - 1) are nonnegative reals with 70 = 1, then (3) simplifies to
(4) which is the equal load-sharing model for a parallel system considered in Kvam and Pe17a.~ 5.2.2. A dynamic recurrent event model
Next we describe a general dynamic model for recurrent events which takes into account the impact of performed interventions after each event occurrence, the effects of accumulating event occurrences and relevant covariate processes, and the effect of an unobserved latent variable, called a frailty, which induces association among the inter-event times. This model was proposed in Peiia and H ~ l l a n d e r and , ~ further studied in Peiia, Slate and Gonzalez.‘ To specify the intensity process for this model, we suppose that a vector of predictable covariate processes {X(s) : s 2 0 ) is observed, and we also require an observable and predictable effective age process { E ( s ) : s 2 0 } , which is possibly specified dynamically. For the notion of predictability, see for instance Sec. 11.3 of Andersen, et aL2 Conditional on the frailty variable 2, which is assumed to follow a distribution H(+) where E is an unknown parameter vector, the intensity process is given by Z y t (s)Ao[E(s)1 dNt(s-); a1 %‘@’X(S)],
(5) where A,(.) is a hazard rate function, p(.; a ) is a nonnegative function with p ( 0 ; a ) = 1, and $(.) is a nonnegative link function. In this model, the effective age process encodes the impact of performed interventions after each event occurrence. To illustrate the importance of this notion, Fig. 5.3 depicts the evolution of this effective age process as events are occurring and based on certain interventions that are performed after each event occurrence. If minimal repair or intervention is performed after each event occurrence, this effective age process takes the form E ( s ) = s, which is exemplified in Fig. 5.3 at the first event, whereas if perfect repair or intervention is performed after each event occurrence, then this is the backward a(SIz)
=
E. Per‘ia and E. Slate
62
Effective Age, E(s)
Illustration: Effective Age Process “Possible Intervention Effects”
Fig. 5.3. Illustration of a possible effective age process.
recurrence time given by E ( s ) = s - S N t ( s - ) , where 0 5 SO< S1 < Sz < . . . are the successive calendar times of event occurrences, with this situation demonstrated in Fig. 5.3 at the second event. Many other forms of €(.) are possible such as that induced by the minimal repair model of Brown and Proschan7 and Block, Borges, and Savits.8 The effect of accumulating event occurrences is contained in the p(.;cr) function, and a simple form for this is p ( k ; c r ) = ak;whereas the covariate effect is encoded in the link function $J(.), which is usually taken to be $J(v) = exp(w). The frailty distribution IT(.;<) could have many forms, but in many cases it is assumed to be a gamma distribution with mean 1 and variance l/<.The baseline hazard rate function A0 (.) could either be parametrically specified, or could be nonparametrically specified. The latter may be more appropriate in biomedical and public health applications, whereas the former is more appropriate in reliability and engineering applications. As discussed in Peiia and Hollander4 and in Peiia, et aZ.,6 the class of models specified by (5) includes as special cases many existing models currently in use in reliability and survival analysis.
Dynamic Modeling in Reliability and Survival Analysis
63
5.3. Some Probabilistic Properties In contrast to statically-specified models, probabilistic properties of dynamically-specified models are harder to obtain due to the changing intensities as time evolves, and this difficulty could be a potential impediment to their use and application. Nevertheless, in some special cases, concrete results are possible. To provide a flavor for such results, consider a pcomponent parallel system with intensity process specified in (3). We present the distribution of the time to the kth event as obtained in Hollander and Peiia.3 To state this result, we need to introduce notation. For a collection {ai : i E C} of distinct real numbers, define
In the notation pi(uj;C), C denotes the set of possible values of the index c k = {O,1,2,.. . ,k} for k = O,1,2,. . .. Following earlier notation, let s k be the time of the kth event for this system, which corresponds to the kth component failure. We now state a theorem that provides the survivor functions of the event times.
j . We also utilize the notation
Theorem 5.1: For a pcomponent parallel system following a dynamic model with intensity process in (3), if the collection {a.[J] : J c 2p}7 where a.[J] = z j E J c ~ j [ Jis] ,such that that IJI = k + a.[J] = a k , ( k = 0,1,. . . , p ) , with a k # 01 whenever k # I, then for k = 1 , 2 , . . . , p and with h o ( s ) = J; Xo(t)dt, k-1
p{sk > s}
pi(aj;Ck-1) exP{-azAO(s)}.
=
(6)
i=O
From this distributional result, certain characteristics of the event times s k s , such as their means, variances, and medians, may be obtained. A more general result, for which the result in Theorem 5.1 is a special case, was proved in Hollander and Peiia3 using an inductive proof, with this proof relying extensively on the interesting identity in Lemma 5.1. Because this identity can be established in many ways (e.g., induction, Lagrange polynomials, etc.), we leave it for the reader’s enjoyment to prove it! Of course, with Theorem 5.1, by setting s = 0 in (6), we obtain the identity; however, this proof is not allowed as we used this identity to prove (6)!
Lemma 5.1: Let {CO, [I,. . . , C M - ~ } be any set of M distinct, nonzero real = c (speed of light); C2 =Avogadro’s number; numbers [e.g., CO = 1;
E. Peiia and E. Slate
64
M- 1 i=O
M-1
M-1
i=O
j=O; j#i
5.4. Inference Methods
Suppose that a sample of n systems, units, or subjects governed by the dynamic models described earlier are monitored, with the ith system followed over the period [ O , T ~ ] .It then becomes of interest to make inference about the unknown model parameters. Having estimates of these model parameters will enable us to perform predictions and aid in making practical decisions such as, for example, performing preventive maintenance or doing some interventions. Inference methods for these dynamic models become more elaborate and complicated, especially if the baseline hazard rate function A,(.) is nonparametrically specified. The major tools that enable inference are the construction of the likelihood function via Jacod'sg formula, which utilizes product integration (see also Andersen, et al.'); the fact that with respect to calendar time there is a martingale structure t o { M t ( s )= N t ( s )- J ; a(v)dv : s L 0}, which allows powerful results such as the martingale central limit theorem to be employed; and, to deal with the computational complexity for some models, the use of the expectationmaximization (EM) algorithm of Dempster, Laird and Rubin14. 5.4.1. Dynamic load-sharing model
In Kvam and Peiia15the estimation of the load-share parameters {yj : j = 0,1,. . . , p - l}, the baseline hazard Ro(t) = Xo(v)dv, and the associated baseline survivor function &(t) = Fo(t) = n:=,[l-Ao(dv)] were developed for the load-sharing model in (4), which pertains to a parallel system whose structure function is $ J ( z22,. ~ ,. . ,xP) = max(z1, 2 2 , . . . , xp).We summarize some of the results in that paper. We refer the reader to this paper for details of the derivations and the proofs. With the alternate notation y[j] = y j l j = 1 , 2 , . . . , p , the estimators % s of the load-share parameters yjs are obtained by maximizing with respect to y the profile likelihood (7))
Dynamic Modeling in Reliability and Survival Analysis
65
where AN/(w) = N/(w)- N:(w-). This profile likelihood is obtained by first assuming that the load-share parameter y is known, and a nonparametric estimator of ho(.)is obtained. This estimator, which depends on y, is plugged into the full likelihood, which yields (7). As this profile likelihood is a rather complicated function of y, numerical methods are required to obtain the maximizing value of y. For example, in the implementation in Kvam and Peiia,5 a Newton-Raphson algorithm was utilized to obtain the maximizers. As derived in Kvam and Peiia,5 a Nelson-Aalen type estimator of &(s ) is given by (8)
To obtain an estimator of the associated reliability or survivor function
Fo(.),one invokes the product-integral
representation in (1) to obtain the
estimator
n t
soot) =
[1 - A ~ ( d v ) ].
(9)
v=o
Asymptotic properties of the estimators were also obtained in Kvam and Peiia.' In particular, the weak consistency and their asymptotic normality (weak convergence to Gaussian processes for the A0 and Fo) of the estimators, under certain conditions and as n + 00, were established. 5.4.2. Dynamic recurrent event model
For the general recurrent event model specified in (5), estimation procedures for the model parameters, which are 6 in the frailty distribution, a in the p ( k ; a ) component, p in the link function, and A0 and SO,were developed in Peiia, et a1.6 We briefly describe the main ingredients of the estimation procedure for this general class of models as given in the aforementioned report. The starting point is to first consider the model without frailties. By utilizing an idea by Gill,lo and exploited by Sellke" and later by Peiia, Strawderman, and Hollander,12 define the doubly-indexed process given by Zi(s,t ) = I { € i ( s ) 5 t } , where & ( s ) is the effective age process for the ith unit. One then defines two doubly-indexed processes given by
E. Peiia and E. Slate
66
where Af (s) = J; qt(v)p[N!(v-) ;a]${PtXi (v)}Xo [Ei (v)]dw. Interpretatively, for the ith unit, the process Ni(s,t ) denotes the number of events that had occurred during the period [0,s] whose effective age at event occurrence is at most t , while Ai(s,t ) represents the cumulative intensity over [0,s] for regions in which the effective age is at most t . From the point of view of inference, two cases are of interest: first, when XO(-) is parametrically specified, and second, when it is nonparametrically specified. In this paper, we focus on the latter case, which is the more relevant model when dealing with biomedical settings. The former case is dealt with in Stocker and Pelia.l3 When dealing with this latter setting, we note in passing that the difficulty encountered in estimation is that the data accrual places an argument of &i(v) in the baseline hazard rate A,(.),), but of interest to us is X o ( t ) itself. For some notational reduction, define &ij-l(U) Cpij-l(Wb,
= Ei(v)l(s,,_,,s;,](v)l{Y,+(v) > 0);
P) =
P ( j - 1;a)7mtXi[E&(w)l1 €&I
[E&
(w)1
With this notation at hand, it turns out that we can define a generalized at-risk process given by
c
N!(S-)
yZ(SlWl~,P)=
~(&ij-l(sij-l),
&ij-,(Sij)](w)Cpij-l(W
j=1 +
'(&iN:
(s -) " i N :
(s - )
)'
& i N / ( s -)
(rnin(s,Ti))l(w) CpiN,?(s-l(wlai0).
The importance of this expression is that it leads to a more useful alternat tive expression for Ai(s,t ) given by Ai(s,t ) = Jo K(s, wla, p)Ao(dw). By defining SO(S, tla,p) = C;=, yZ(s, tla,P), through a method-of-moments idea, we obtain an 'estimator' of A0 given by (10) However, since ( a ,p) are unknown, the above expression is not yet an estimator. We need to obtain estimators for ( a ,p) which will then be substituted in the above expression to finally get an estimator of Ao. The expression in (10) when plugged in the full likelihood yields a profile likelihood for (a,p). This profile likelihood is given by (11)
Dynamic Modeling in Reliability and Survival Analysis
67
where we note that 0 = Sio < Sil < Si2 < . . . are the successive times of event occurrences for the ith unit. The maximizers & and of this profile likelihood provide estimators of a and p, respectively. Again, an iterative procedure, such as the Newton-Raphson procedure as was implemented in Peiia,et a1.,6 is needed to obtain these maximizers. When these & and fi are plugged in Ao(s*,tla,/3),we obtain our estimator of Ao(t), and by the product-integral representation, we are able to obtain an estimator of Fo(t). This product-limit type estimator of Fo(t) is given by
a
(12) For the general model with frailties, recall the intensity process is X,) = 2, Xo[&,(s)] p[N:(s-); a]@(,BtX,(s)),where 2,s are IID gamma random variables with mean 1 and variance The model parameters that need to be estimated are a , p, Ao). In Peiia, et ~ l . we , ~ used the EM algorithm of Dempster, et a1.,14 first implemented by Nielsen, Gill, Andersen and Sorensen” in the gamma frailty model. To implement this algorithm in this general recurrent event model, we also consider the frailty variables 2,s to be missing. We refer to Peiia, et aL6 for the details of the implementation of this EM algorithm. The major steps in the implementation of this algorithm are as follows: X,(sIZ,,
(e,
c-’.
(,&,a.
Step 0: (Initialization) Provide seed values for For the nonparametric parameter, start with the no-frailty estimator of A,. Step 1: (E-step) Compute 2, = E(Z,ldata, (, &, fi, Ao), expressions of which are given in Peiia, et ~ 1 . ~ Step 2: (M-step 1) Obtain a new estimate of Ao(.). The form of this is analogous to the no-frailty case except that every occurrence of Y,(s,tl&,fi) is multiplied by its corresponding 2,. Step 3: (M-step 2) Obtain new estimates of a and p. The estimating equations are analogous to the no-frailty situation except for the change noted in the preceding step. Step 4: (M-step 3) Obtain a new estimate of 6 by maximizing the marginal likelihood with respect to Step 5: Check for convergence.
c.
In Peiia, et aL6 we have successfully implemented this algorithm and developed an R package of our implementation. We have also examined small- to moderate-sample size properties of the estimators through computer simulation. Research is still ongoing to ascertain asymptotic properties of the
E. Peiia and E. Slate
68
estimators, such as consistency and weak convergence properties.
5.5. An Application The applicability of these dynamic models is still in its infancy. However, by virtue of the fact that they are more appropriate models for real situations, their practical promise is appealing. We provide a simple illustration of the use of the general recurrent event model as applied to a bladder cancer data that have been used in Wei, Lin, and Weissfeld.16 The data set is depicted pictorially in Fig. 5.4.
Fig. 5.4. Pictorial representation of the bladder cancer data
For each of the 85 subjects, the picture shows the times of bladder cancer recurrence marked by 0 . The subjects are sorted by treatment assignment (denoted as rx in the table below) so that the first 47 subjects received placebo and the remaining 38 received thiotepa. Within treatment group, the subjects are sorted according to the number of cancer recurrences. There were two other covariates which are not shown in this picture. These were the largest initial tumor (called size) and the number of initial
Dynamic Modeling in Reliability and Survival Analysis
69
tumors (called number). The green x marks represent end of observation periods. We fitted the general model in (5) with p(lc;a) = a' t o this data set. As it is still a limitation of this class of models that the effective age is not yet monitored, and may indeed be an obstacle to the utility of this class of models in biomedical settings, we considered two situations: that a perfect repair or intervention is always performed so that & ( s ) = s - SiN!(s-) for each i = 1 , 2 , . . . ,n; and that a minimal repair or intervention is always performed so that &i(s) = s for each i = 1 , 2 , . . . ,n. We present in the following table the estimates of the parameters together with estimates of Table 5.1. Parameter estimates for bladder cancer data under different models.
AG
Cow
Para
logN(t-) Frailty rx Size Number
a
.98 (.07)
(
W
01 02 03
-.47 (.20) -.04 (.07) .18 (.05)
WLW Marginal
-.58 (.20) -.05 (.07) .21 (.05)
PWP Cond*nal
-.33 (.21) -.01 (.07) . I 2 (.05)
General Model Perfect Minimal
-.32 (.21) -.02 (.07) .14 (.05)
.79 (.13) .97 -.57 (.36) -.03 (.lo) .22 (.lo)
their standard errors (in parentheses), with the s.e.'s for minimal repair obtained via a jackknifing procedure. These estimates were obtained according to the estimation procedure described in Sec. 5.4.2. In this table the parameter estimates arising from the marginal method of Wei, et a1.,16 the conditional method of Prentice, Williams and Petersen,17 and the Andersen and Gi1118 method, considered to be the current methods for analyzing recurrent event data, are also indicated. These results were presented in Therneau and Grambsch.lg Examining this table, we note the apparent importance of the effective age process in terms of reconciling the differences among the estimates obtained from the currently used methods. When the effective age process corresponds to perfect repair, then the resulting estimates from the general model are quite close to those obtained from the Prentice, et ~ 1 . con'~ ditional method; whereas when a minimal repair model is assumed, then the general model estimates are close to those arising from the marginal method of Wei, et al.16 This seems to indicate that perhaps the differences among these methods of modeling are borne out mostly by the type of effective age process that is being assumed. Thus, the notion of an effective age appears to be an important one in modeling recurrent event data, and
70
E. Peiia and E. Slate
it behooves to t r y to monitor and assess this process in real applications, even in biomedical settings. This seems to call for a paradigm-shift in the gathering of recurrent event data.
Acknowledgments
E. Peiia acknowledges NSF Grant DMS 0102870, NIH Grant GM056182, and NIH COBRE Grant RR17698; while E. Slate acknowledges NIH Grant CA077789, NIH COBRE Grant RR17696, and DAMD Grant 17-02-1-0138. Both also thank the USC/MUSC Collaborative Research Program and the MMR 2004 Organizers, Dr. S. Keller-McNulty and Dr. A. Wilson, for inviting them to give talks in the conference and to contribute in the Conference Proceedings. The authors thank also a reviewer for his/her comments which led to improvements in this paper. References 1. R. Barlow and F. Proschan, Statistical Theory of Reliability and Life Testing: Probability Models. To Begin With, Silver Spring, MD (1981). 2. P. Andersen, 0. Borgan, R. Gill, and N. Keiding, Statistical Models Based on Counting Processes. Springer-Verlag, New York (1993). 3. M. Hollander and E. Peiia, Dynamic reliability models with conditional proportional hazards, Lifetime Data Analysis 1, 377-401 (1995). 4. E. Peiia and M. Hollander, Mathematical Reliability: An Expository Perspective (eds., R. Soyer, T. Mazzuchi and N. Singpurwalla), Chap. 6. Models for Recurrent Events in Reliability and Survival Analysis, pp. 105-123. Kluwer Academic Publishers (2004). 5. P. Kvam and E. Peiia, Estimating load-sharing properties in a dynamic reliability system, Journal of the American Statistical Association 100(469), 262-272 (March 2005). 6. E. Peiia, E. Slate, and J. Gonzalez, Semiparametric inference for a general class of models for recurrent events. Technical Report 214, Department of Statistics, University of South Carolina (2003). (Currently under consideration for publication.) 7. M. Brown and F. Proschan, Imperfect repair, J. A p p l . Prob. 20, 851-859 (1983). 8. H. Block, W . Borges, and T . Savits, Age-dependent minimal repair, J . A p p l . Prob. 22 51-57 (1985). 9. J. Jacod, Multivariate point processes: Predictable projection, Radon nikodym derivatives, representation of martingales, 2. Wuhrsch. werw. Geb 31, 235-253 (1975). 10. R. D. Gill, Testing with replacement and the product-limit estimator, The Annals of Statistics 9, 853-860 (1981). 11. T. Sellke, Weak convergence of the Aalen estimator for a censored renewal
Dynamic Modeling in Reliability and Survival Analysis
12.
13. 14.
15.
16.
17. 18. 19.
71
process, in Statistical Decision Theory and Related Topics IV (eds.,S. Gupta and J. Berger) 2, 183-194 (1988). E. Peiia, R. Strawderman, and M. Hollander, Nonparametric estimation with recurrent event data, J. Amer. Statist. Assoc. 96 (456), 1299-1315 (December 2001). R. Stocker and E. Peiia, A general class of parametric models for recurrent event data. Submitted for publication (2005). A. Dempster, N. Laird, and D. Rubin, Maximum likelihood estimation from incomplete data via the EM algorithm (with discussion), J . Roy. Statist. Soc. B 39, 1-38 (1977). G. Nielsen, R. Gill, P. Andersen, and T. Sorensen, A counting process approach to maximum likelihood estimation in frailty models, Scand. J . Statist. 19, 2543 (1992). L. Wei, D. Lin, and L. Weissfeld, Regression analysis of multivariate incomplete failure time data by modeling marginal distributions, J. Amer. Statist. Assoc. 84, 1065-1073 (1989). R. Prentice, B. Williams, and A. Peterson, On the regression analysis of multivariate failure time data, Biometrika 68, 373-379 (1981). P. Andersen and R. Gill, Cox’s regression model for counting processes: a large sample study, Annals of Statistics 10, 1100-1120 (1982). T. Therneau and P. Grambsch, Modeling Survival Data: Extending the Cox Model. Springer, New York (2000).
This page intentionally left blank
CHAPTER 6 END OF LIFE ANALYSIS
H. P. WYNN Department of Statistics London School of Economics Houghton Street, London W C 2 A 2 A E , UK E-mail:
[email protected]
T. FIGARELLA E U R ANDOM P.O. Box 513, 5600 M B , Eindhoven, The Netherlands E-mail:
[email protected]
A. DI BUCCHIANICO Department of Mathematics Eindhoven University of Technology P.O. Box 513, 5600 M B , Eindhoven, The Netherlands E-mail: a. d.
[email protected]
M. H. JANSEN Department of Mathematics Eindhoven University of Technology P. 0. Box 513, 5600 M B , Eindhoven, The Netherlands E-mail: mjansenawin. tue.nl
W. P. BERGSMA E U R ANDOM P.O. Box 513, 5600 M B , Eindhoven, The Netherlands E-mail:
[email protected]
73
74
H . Wynn, T . Fagarella, A . Di Bucchianico, M . Jansen, W . Bergsma
Under the new European WEEE directive there will be strict limits for disposal of electrical goods to landfill. The area of analysis which takes into account the full life cycle of products and components is called ‘Len of life” analysis. The idea is that the economic life of components is longer than the life of the initial use. To use this inherent value and to meet such directives, a radically new approach to recycling is needed. Modules, components or materials may be reused. This leads to a complex feedback to earlier stages of the supply chain. End of life analysis should predict the life of components with an objective testing for reuse. It can be considered as an extension of predictive maintenance and signature analysis in reliability. The ideal is to have cheap and fast tests on the basis of which decisions about reuse, etc. can be made. The implications for design are considerable.
6.1. The Urgency of WEEE The EC Directive on Waste Electrical and Electronic Equipment (WEEE) was agreed on 13 February 2003 and with amendments is likely t o come into effect. It is driven by LLthe need for promoting waste recovery with a view t o reducing the quantity of waste for disposal and saving natural resources, in particular by reuse, recycling, composting and recovering energy from waste ...” Governments have programmes of induction in the directives, and companies in the relevant sectors are urgently addressing the implications for product development. These implications are a radical extension of the current approaches to the product life cycle. The life cycle really becomes a cycle. Components can be recycled in various ways: component reuse, material reuse, use as a source of energy, etc. The more complex life cycle drives a more complex supply chain and affectsthe links in the chain and the contractual relationships. New business models are needed to take account of the change and based on a better understanding of the risks and risk sharing.
6.1.1. The eflect on reliability The best place to start is with reuse. Parts and components come back into a factory and are combined with new parts and components t o build a “new” product. Thus, the concept of life-time is already affected. A part may have more than one physical life in the normal sense. A simple way of saying this is that the economic life of a product is longer than its first use. At the time of possible reuse a decision has t o be made about whether a part is good enough t o be reused or must be rejected. It could also, of course, be reconditioned or repaired before being reused. It will certainly
End of Life Analysis
75
be cleaned. The area of predictive maintenance needs to be extended to predictive reuse. Enough must be known about the probable effective life and performance of a component t o forecast whether it can be used and reduce testing to a minimum. Reuse decisions have to be made quickly so as not to delay production. The area of reliability-based decisions making for reuse waste disposal is called “end of life” analysis. It is too early to give a more precise definition, given the complexity and the legal and economic uncertainties.
6.1.2. The implications for design
End of life analysis heralds a new era for design, as extensions of design for reliability and robust design, which must take into account the afterlife of all components. It must be made easy to disassemble components, retrieve materials or reuse them to generate energy. Components which are probably going to be reused need to be robust and reliable for two, three ... lives. Components with a single life can be cheaper and more easily reused as material or energy. Warranty and maintenance will change. Perhaps the concept of a product itself needs to be revised to be more fluid and adaptable. Components from one product may be reused in a different product. Sourcing components will include reusable components. The market place for components will increasingly be a mixture of the old and the new.
6.2. Signature Analysis and Hierarchical Modeling
The main technology used in predictive maintenance is “signature analysis.” Measurements on products such as engines exhibit particular signatures and departures from these signatures or the appearance of special features in the signals can indicate the possibility of failure ahead of actual failure. When dynamic characteristics are being measured, signature analysis can be considered as branch of signal processing. Electric current, vibration, and sound are typical measurements. In line with quality improvement concepts, the signature should be linked to cause of (possible) failure or wear, that is, it should give traceability. To make this effective there must be a considerable amount of off-line modelling. The methodology in the project is to combine a signal processing style of measurement with designed experiments in order to build models which relate high-level signal features to low-level failure and degradation.
76
H . Wynn, T. Figarella, A . Di Bucchianico, M. Jansen, W . Bergsma
6.2.1. The importance of function The importance of linking the modeling to the engineering cannot be overestimated. Within engineering, “function” has become a key concept in recent years, particularly in design. In quality improvement, “Quality Function Deployment” and “Design Function Deployment” are useful methods. Consideration of the function of a component or higher level module is very useful in determining failure modes, suggesting the right type of measurement and helping to build hierarchical models. There is a hierarchy of engineering function underlying any model of failure. Failure can be defined as a failure of functional performance: “I just want the car to get me there.” 6.2.2. Wavelets and feature extraction Wavelet analysis is used to decompose (deconvolute) dynamic signals into components, each with its own wavelet coefficient. Using the discrete wavelet transform (DWT) the raw signal is decomposed into components which represent the fine structure of the signal and noise, which is removed by thresholding methods (denoising), see Bruce and Gao.’ The frequencies are represented by the local scaling of the wavelet and time by the timeshifted wavelet. The inverse transformation is used to build up an estimate of the “true” signal. See Refs. 2, 3, and 4 for a few references in the field. The methodology consists in extracting particulary time or frequency features which can be related to particulary failure modes or degradation. Multivariate analysis can be used either directly on the wavelet coefficients or features, such as peaks, to further reduce dimensionality. The aim is to use the high-level feature detection in real time for fast testing at the end of life decision point. The features would hold the information about the lower-level functional failure or material quality. 6.3. A Case S t u d y Flextronics International in Venray, The Netherlands, has the responsibility for the development of end of life analysis for a range of photocopiers and other products. The aim, with partners, is to build up a general purpose protocol using a combination of reliability methods, experimental design, signal processing, field data analysis. Some experiments were conducted on functions of a photocopier finisher module, which has three main components: stapler motor, nip motor and solenoid. Figure 6.1 shows a schema of the module.
77
End of Life Analysis
Top tray
& MaintriaY
Nip motor Y
Fig. 6.1. Three parts of finisher module.
6.3.1. Function: preliminary FMEA and life tests We shall describe time and frequency analysis of electrical signal output. Some important background tests were performed to help understand the function. This was partly carried out via auxiliary Masters students projects. Some early experiments also confirmed that the current spectrum is heavily affected by the load variation and wear. An important decision was made also to analyse electrical signals, rather than noise. One reason for this was that in industrial plants where parts are assessed for reuse, there is likely for there to be considerable background noise. This is in contrast to the considerable literature on sound and vibration.
6.3.2. Stapler motor: time domain The current amplitude was used as a main output for the staple motor. Figure 6.2 shows the raw signal. Figure 6.3 shows the gradual denoising of the signal using wavelet analysis. The peak amplitudes were related to different functional activities during the cycle. For example one peak is clearly associated with the spring load, the point at which the staple first cuts into the paper before stitching. At this point the signal dominates the first 6 scale levels of the wavelet analysis, so that one can consider the signature having at least 6 dimensions. That is to say the peak itself is a complex feature. One or more of the relevant coefficients which compose the peak can be analyzed against the factors in the experiments or combined using multivariate methods into a single component, see B~cchianico
78
H. Wynn, T . Figawlla, A . Di Bucchianico, M . Jansen, W. Bergsma
Fig. 6.2.
Current signal of stapler motor and spring load peak.
Fig. 6.3.
Approximation coefficients at levels 4 and 5.
6.3.3. Lijling motor: frequency domain This concerns a submodule, which is the high capacity feeder (HCF). It stores A4 paper sheets in two trays and supplies the paper to the next module. An initial screening experiment concluded that the load and volt-
End of Life Analysis
79
age produces variation in the motor current, which can be identified in both time and frequency domains. Load was found to affect the low frequencies in the spectrum. The motor is connected to a set of four gears for gearing down the speed. A test was run for 202 hours until breakdown because of gear wear. The current consumed by the motor during the important “lifting process” measured for 20 seconds in 17 separate time points beginning at the hour points: 0 , 4 , 7,11,15,21,27,33,39,45,51,63,79,95, 116,137,and 202, see Fig. 6.4. Figure 6.5 gives the raw data. The DC motor used has three poles so that there are 6 commutation events per revolution because as the rotor turns, the commutator switches current to pairs of coils in turn, resulting in the oscillation in the current, see Fig. 6.6. Figures 6.7 and 6.8 show the change in the shape of the signal from 0 to 202 hours. For the time domain analysis Fig. 6.9 shows the increase in root mean squared deviation (RMS) over the whole 202 hours with a piece-wise linear fit to the group RMS at each 17 points time point. That this is increasing is a feature of the wear. The value increases from 174.4 to 249.96; the increase is 43% of the total current. Figures 6.10 and 6.11 show the estimated spectral density of the signals for the measurement at 0 hours and 202 hours, and we can see that the motor frequency is around 700 Hz in both cases. In the low frequency content of the spectrum we observe important changes. In the beginning of the test the meshing frequency of the load gear with the third gear is significant, see Fig. 6.12. At 202 hours, more meshing frequencies and their harmonics are present. Besides the frequency of mesh 1, the meshing frequency of the third pinion with the second gear (mesh 2) and its harmonics are present, see Fig. 6.13. Figures 6.14 and 6.15 show that up to 39 hours the power contained in the frequency mesh 1 increases but after that time it decreases while the power at mesh 2 increases. 6.4. The Development of Protocols and Inversion
The overall purpose of the project outlined here is to develop a protocol that will lead to an implementable system for testing for reuse. This has a number of stages, some of which are common to much of industrial experimentation. (1) Elucidations of function using engineering knowledge, e.g., from the design, from failure mode analysis and filed data. (2) Signal measurement: electrical, vibration, etc.
80
H. Wynn, T . Fagarella, A . Di Bucchaanico, M. Jansen, W . Bergsma
Fig. 6.4.
Current measurement periods
Fig. 6.5. Raw signal, pooled over 20 seconds.
(3) Design of experiments with varying factors, e.g., the load on the module, the nature of the module itself, e.g., new/used. (4) Using statistical models to relate factors to signal features of the output, e.g., peak amplitudes, frequencies, wavelet coefficients, drift. (5) Inversion of the models t o be able t o trace particular faults from observed signal features; also called L‘~alibration.’’
We have seen in this study several points at which features can be related to underlying wear. The power (RMS) typically increases with time. This
End of Life Analysis
81
Fig. 6.6. Current signal between 4.055 and 4.090 seconds.
Fig. 6.7. Current signal at 0 hours.
can be made more detailed by modeling the drift in wavelet coefficients over time demonstrated in Figs. 6.14 and 6.15. Inspection of Figs. 6.7 and 6.8 shows that there is considerable scope for subtle modeling of the change in the shape of the signal over time as wear increases. Further work is studying
82
H. Wynn, T . Figarella, A. Di Bucchianico, M . Jansen, W . Bergsma
Fig. 6.8. Current signal at 202 hours.
Fig. 6.9. Overall RMS level.
this in more detail. The next stage of the project will develop these themes by carrying out experiments in which the load itself varies over time. Combining these with the inverse problem will lead to age-dependent calibration and from there, finally, to fast implementable testing regimes.
End of Life Analysis
83
Fig. 6.10. Spectrum at 0 hours.
Fig. 6.11.
Spectrum at 202 hours.
In summary, end-of-life analysis, in its full sense, requires a protocol that is adapted to the particular problem but contains ingredients from experimental design, signal processing and reliability. This protocol leads to decisions support systems adapted to the likely actions: reuse, reject, use
84
H. Wynn, T.Figarella, A. Di Bucchianico, M. Jansen, W.Bergsma
Fig. 6.12.
Spectrum at 0 hours: meshing frequency of load gear with third gear.
Fig. 6.13.
Spectrum at 202 hours: meshing frequencies and harmonics.
of whole modules, use of materials, maintenance/repair and so on. Ideally once the analysis and modeling has been performed, the analysis can be
End of Life Analysis
Fig. 6.14.
Third
85
- Load gear power over time
Fig. 6.15. Third pinion - second gear power over time.
streamlined to be fast so that the decision-making is as far as possible automatic. To date the project from which the particular case studies have been drawn has essentially passed the proof-of-concept stage; it possible to have a small number of signals which identify failure and wear. Although there are only a few signals they hold much information about degradation; the issue is to discover which features of the signal hold this information
86
H. Wynn, T. Fagarella, A. Di Bucchianico, M. Jansen, W. Bergsma
and to convert these into simple-to-use but accurate discriminators.
Acknowledgments This work was conducted as part of the project funded by the Dutch government, project EDT 20015.
References 1. A. Bruce and H. Gao, Applied Wavelet Analysis with S-plus, Springer, New York, (1996). 2. M. E. Benbouzid, A review of induction motors signature analysis as a medium for faults detection, IEEE Trans. Ind. Electron 47, 984 (2000). 3. K. Kim and A. G. Parlos, Induction motor fault diagnosis based on neuropredictor and wavelet signal processing, IEEE/ASME Trans. Mech 7 , 201 (2000).
4. E. Strangas, E. Khalil, H. Zanardelli, and J. Miller, Wavelet-based methods for the prognosis of mechanical and electrical failures in electric motors, Mech. Syst. Szg. Proc. 19, 411 (2000). 5. A. D. Bucchianico, T. Figarella, G. Hulsken, M. H. Jansen, and H. P. Wynn, A multi-scale approach to functional signature analysis for product end-oflife analysis, Qual. Rel. Eng. Int. 20, 457 (2004).
CHAPTER 7 RELIABILITY ANALYSIS OF A DYNAMIC PHASED MISSION SYSTEM: COMPARISON OF TWO APPROACHES
MARC BOUISSOU Electricite' de France R&D 1 avenue du Ge'ne'ral de Gaulle - 92141 Clamart, France and CNRS UMR 8050, Universite' de Marne la Valle'e, France E-mail: marc.
[email protected]
W E S DUTUIT Universite' Bordeaux I / L A P S 351 cours de la Libe'ration - 33405 Talence cedex, France E-mail: yues.
[email protected]. fr
SIDOINE MAILLARD Institut National Polytechnique de Lorraine (CRAN) Faculte' des sciences - B P 239, 54506 Vandoeuvre Cedex, France E-mail:
[email protected] Phased mission systems are frequently encountered in industrial fields, and many approaches have been proposed in the literature to compute their reliability. After a short review of the existing literature, this paper aims to illustrate the use of two reliability analysis methods applied to a simple, but not trivial, problem. The system proposed as a test-case enables us to compare the respective benefits and drawbacks of a Petri net-based approach and of the so-called Boolean logic Driven Markov Process (BDMP) approach, recently published.
7.1. Introduction With the increasing complexity and automation associated with systems encountered in many domains such as the nuclear, aerospace, chemical, and electronic, the safety studies are more and more complex and multifaceted. 87
88
M. Bouissou, Y.Dutuit,
and
S. Maillad
Nowadays, a phased mission analysis methodology is being recognized as the appropriate reliability analysis method for a large number of problems. Indeed the industrial systems are mostly used over several ways of functioning, including degraded states. A phased mission system (PMS) is a system subject to multiple, consecutive and nonoverlapping time periods of operation (or phases). The phases can be characterized along a wide variety of differentiating features:
0
the tasks performed within a phase may differ from phase to phase. performance and dependability requirements can be different from one phase to another. the system may be subject to a particularly stressing environment in a specific phase, thus experiencing increases in the failure rate of its components. the structure may change over time, depending or not on the performance and/or dependability requirements of the current phase. the successful completion of a phase may bring a different benefit to the PMS with respect to that obtained with other phases.
Thus, considering the characteristics of the whole system over phases can be very difficult. Moreover, the effects of the past history of the system (for instance the degraded configuration) need to be taken into account to explore its future behavior within the successive phases, and this turns out in a great increase of the modeling and analysis complexity. Phased-mission systems have been widely investigated, and many different approaches appear in the literature. The encountered studies can be roughly classified in two main groups, based on the approach used to deal with the changes in the structure of the system. These studies consider the definition of either a global model including all phases as proposed in Refs. 1, 2, 3, 4, 5 , 6, and 7 or the definition of a distinct model for each phase of the system and a separate evaluation for each of these models. The definition of a single model, that takes into account all the possible behaviors of the system in the different phases, allows to easily consider the dependencies among the phases. This approach gives the possibility of exploiting the similarities among phases to obtain a compact model in which all the phases are properly embedded. But building such a single model may not be simple or suitable in some cases where the following aspects prevail over the similarities along different phases: The operational configuration of the system is not inflexible but rather
Reliability Analysis of a Dynamic Phased Mission System
0
89
may vary from phase to phase, in accordance with the criticality of the specific phase, The failure and repair history of some components within one phase affects system behavior in subsequent phases. Therefore, the state of a component at the beginning of a phase is dependent from the state it had at the previous phase completion time, The criteria defining the level of fulfilment of dependability and performance requirements inside a phase may differ from those valid for a subsequent phase.
One of the main disadvantages of this single model approach is represented by the lack of reusability of the model. A new model needs to be built if the behavior of the system in any phase is changed or if the phase order is changed. Moreover, a substantial effort may be needed to define and solve, using automatic tools, the overall model of the system, which is often of large size. On the other hand, a separate modeling and evaluation of each phase, as in Refs. 8, 9, 10, and 11, allows a better management of the complexity of the analysis, and to use again previously built models of the phases. Furthermore, this approach permits focus, inside each phase, on the most interesting behaviors to be analyzed from the system dependability viewpoint, and also to reduce the size of the models for each phase. But especially the characterization of the differences among phases, in terms of different failure rates and different configuration requirements, is much easier. Very often, a small difference does not allow to consider a single model, even if two consecutive phases are quite similar to each other. Each phase can be separately solved, and then its solution outcomes aggregated with those of the other phases to obtain the overall result for the PMS, thus demonstrating a better performance at solution time. The major weakness of the separate modeling approach (not shown by the single model one) is the treatment of the dependencies among phases, which are to be taken into consideration because of the sharing of components among phases. This approach explicitly requires to perform the mapping of a component state at the end of a phase to the state of the component at the beginning of the next phase. This mapping is conceptually simple but can be cumbersome and certainly becomes a potential source of errors for large models. However, it must be done because, as it was shown in Burdick,2 estimating the mission reliability by the product of the reliabilities of the phases usually results in optimistic results, with an appreciable over-prediction of the
90
M . Bouissou, Y . Dutuit, and S. Maillam!
reliability. In front of this duality, phased mission techniques are thus required for proper analysis of problems in particular when switching procedures are carried out or equipment is reassembled into new systems a t predetermined times. An important quantitative phased mission analysis problem is to calculate exactly or obtain bounds for mission unreliability, where mission unreliability is defined as the probability that the system fails t o function successfully in at least one phase. Dependability modeling and evaluation of such systems has been addressed mainly resorting to Fault Tree models and to Markov processes based models. And the dynamic structure of the phased systems makes the analysis more complex compared to the single phased systems. Models such as Fault Trees and Reliability Block Diagrams were widely used to analyse phased mission systems dependability.12>4 More recently, a new family of approaches based on Fault Trees has been proposed." It exploits the gain in computational complexity that is possible thanks to the use of Binary Decision Diagrams based techniques. Reference 5 applies Fault Tree methodology t o the dependability analysis of PMS systems with nonrepairable and repairable components. Statebased modeling approaches based on Markov chains and Petri nets (PN) were also applied because of their ability in representing complex dependencies among system c ~ m p o n e n t s . ' > lCombinatorial ~*~ models provide simpler formalisms that allow a very intuitive mapping between modeling elements and system failures. Moreover, it is quite immediate t o exploit results of classical qualitative analysis such as those made available by the Failure Modes and Effects Analysis (FMEA) to build quantitative dependability Fault Tree or Reliability Block Diagram models. On the other hand, such models show severe limitations with respect to the representation of dependencies among different system components, imperfect coverage of fault containment mechanisms, repair actions for failed units and subsystems. State-space models exhibit a higher flexibility with respect to the representation capabilities. However, such generality does not come alone; it is paid by a higher complexity of both the modeling formalism itself and of the modeling process. The considerations above on differences between flexibility and expressiveness generally apply to any modeling formalism. In the specific case of PMS, additional increased complexities are to be handled by the dependability modelers because of the phased behavior of the systems to be analyzed. In the literature, there are examples of separate and single modeling studies for both combinatorial and state-space based approaches. Recently, some hierarchical approaches13 tried to grab the best aspects while alleviat-
Reliability Analysis of a Dynamic Phased Mission System
91
ing the limitations of each of the two choices. They allow for the definition of a high-level, single model of the PMS, which has the only purpose of defining the sequence of phases, and a second, lower level modeling layer, which focuses on PMS intra-phase behavior. Nowadays, combinatorial and statespace modeling formalisms still represent the two dominant approaches to PMS dependability analysis. Each approach has its own advantages and weaknesses, and the choice of the best one is largely dependent on the specific characteristics of the system at hand and on the goals of the analysis. The sequel of this article aims to illustrate the use of two reliability analysis methods applied to a simple, but not trivial, 2-phases problem. The system proposed as a test enables us to compare the respective benefits and drawbacks of a PN-based a p p r ~ a c h , ' ~ > 'and ~ ? 'of ~ the so-called BDMP (Boolean logic Driven Markov Process) approach, recently published.1 6 7 1 7
7.2. Test Case Definition The system to be studied is a hypothetical example of phased mission system, as shown in Fig. 7.1. It consists of two main nonrepairable components A and B, and five switches that are used for protection or reconfiguration functions in different configurations over two consecutive phases as described hereafter:
Fig. 7.1. System structure of the studied test case.
Phase 1 0
0
T1 (the duration of phase 1) is exponentially distributed with a mean value equal to E(T1) = 1 / X 1 = 100 hours. Switches K1, K2, K3, and K4 are normally closed.
92
0
0
0
M . Bouissotl, Y. Dutuit, and S. Maillard
Switch K5 is normally open. Components A and B work in parallel. Their (constant) failure rate is XA = X B = X = 10-4h-1. A failure of A or B is considered as a short-circuit between the input and output of the component. Possible reconfigurations: - In case of failure of one component, some switches (K2 and K 4 on a
failure of A, K 1 and K 3 on a failure of B) must be opened, in order to avoid short circuit of the system, with a probability of failure on demand equal to y = 5.10-3. - Inadvertent opening of switches can also occur, with a failure rate xs = x = 1 0 - ~ h - l . Phase 2 T2 (the duration of phase 2) is exponentially distributed with a mean value equal to E(T2) = 1 / X 2 = 50 hours. At the beginning of phase 2, the positions of some switches are changed to enable the two active components to work in series. More precisely: in the nominal procedure, K 1 and K4 are open, then K5 is closed ( o p erations must be done in this order to avoid creating a short circuit). But some alterations due to unwanted opening of K 1 or K4 during phase 1 may occur. If component A or B has failed during phase 1, the system cannot be used on the second phase.
Note that the reconfiguration of the system (change of structure from parallel to series) is scheduled by an independent process. Comments on the test case : 0
This test case may seem simple at first sight, because it involves only 7 components. However the number of elementary failure modes to be taken into account is 12, including 5 failures on demand, which can happen at various times, depending on the system evolution. What makes this example really tricky is the omnipresent dependencies between components, and between the system behaviors in the two phases. For example, in Phase 1, if A fails, K2 and K4 are supposed to open. If at least one of the switches opens, the system can still work until the end of Phase 1, but if both refuse to open, the system is lost at once. If K4 opens inadvertently during Phase 1 it cannot refuse to open during the reconfiguration at phase change. A real system would involve many more components, but its reconfiguration would probably be simpler than the one we are studying in the present test case.
Reliability Analysis of a Dynamic Phased Mission System
0
93
Another criticism that this test case could suffer from is the fact that all distributions are exponential. This assumption is very common about the times to failures of components, but what may seem strange is the fact that the phases durations are exponentially distributed. We have deliberately chosen this assumption in order to be able to use the FIGSEQ tool (which works only on Markov models) and to show the advantages brought by the original method implemented in this tool. The other solving method we used, i.e. Monte-Carlo simulation would have worked as well with any other distributions.
7.3. Test Case Resolution
7.3.1. Resolution with a Petri net To model in a concise way the behavior of the system during its phased mission, by using the P N formalism, a system-based approach has been chosen instead of the usual component-based approach. The symmetrical configuration of the system has been exploited (symmetrical structure and same characteristics of the switches and components) and an aggregation procedure has been carried out to obtain the P N shown in Fig. 7.2. Two main subnets make it up. The first one (on the right side) models both all the possible states (aggregated states) in which the system could be during its first mission phase (places 1to 10 representing these lumped states) and the transitions between them. I t was obtained by transposing a Markov chain model of the system in this first phase. The second subnet (on the left side) consists in two parts: one to model the behavior of the system during its second mission phase (places P14 for success and place P15 for system failure), and the other is used to manage the phase sequence (places P11 and P12 for the phase 1 and phase 2, and place P13 signifying the end of the whole mission time). A link like 0‘ inhibits the transition it is tied to whenever the origin place contains tokens. The rate of each transition is explicitly written on the figure. Some transitions are marked with special signs: ?M notifies that the transition needs the Boolean variable M to be in state TRUE to be fired. This variable is initially set to TRUE and when the transition between P12 and P13 is fired (i.e. at the end of Phase 2), it changes to FALSE. Therefore, the meaning of the variable M is ‘(missionis in progress.” To make the P N of Fig. 7.2 more understandable, each place of the right subnet corresponds to a state of a Markov graph, which is explicitly defined hereafter:
M. Bouissou, Y. Dutuit, and S. Maillard
94
0
0
0
0
0
0
0
0
Place 10 corresponds to the initial state of the system, i.e. A and B are working and all Ki except K5 are closed. Place 1: one of the components A and B has just failed. This state is instantaneous, because some switches are required to open. Place 4: A (resp. B) is failed and at least one of its associated switches K2 and K4 (resp. K1 and K3) has opened on demand (probability (1- Y2N. Place 2: only one of the switches K1 and K4 has inadvertently opened. This place corresponds to a degraded functioning state of the system. Its whole mission can still be performed. Place 3: only one of the switches K2 and K3 has inadvertently opened. This failure does not prevent success of phase 1, but it will induce the failure of both phase 2 and the whole mission. Place 5: A (resp. B) is still working and (K1 and K3) (resp. K2 and K4) opened inadvertently. From the initial state, the state corresponding to place 2 can be reached via place 3 or via place 2. The consequences with regards to the mission are the same as for place 3. Place 6 is reached from place 2 because of a failure of either A or B. The reaction of the corresponding switch K2 (resp. K3) does not play any role for the remainder of the mission. Place 8 is reached from place 3 because of a failure of either A or B. The reaction of the corresponding switch K 4 (resp. K1) has no impact on the remainder of the mission. Place 7 is reached from place 5 because of a failure of either A or B. The states of all other components remain unchanged in this transition. Place 9 corresponds to the system failure during Phase 1. It is an absorbing state. It can be reached from each of the places 2, 3, 4, 5, 6, 7 and 8 because of the failure of one of the 3 components of the remaining operating path (A-K2-K4 or Kl-K3-B). This is why the corresponding failure rate equals to 3X. Place 9 can also be reached from the initial state (place lo), after only one timed transition, via one of the two following sequences: (A fails and both K2 and K4 refuse to open) or (B fails and both K1 and K3 refuse to open).
A qualitative analysis of the P N enables us to identify the sequences of component failures which result in the failure of each phase. This step-bystep procedure can be done manually with the interactive simulator of KB3 or by using the FIGSEQ tool.is7i9720~21
Reliability Analysis of a Dynamic Phased Massaon System
95
96
M. Bouissou, Y. Dutuit, and S. Maillard
7.3.2. Resolution with a BDMP
The second resolution was done with the BDMP formalism. The bulk of the BDMP depicted in Fig. 7.4 (directly copied and pasted from the KB3 too119>20921) is self-explanatory. The advantage of this formalism is that it looks like fault-trees. It has the same ability to progressively breakdown a global event into more elementary events, in a top-down approach. Because of the lack of space, we cannot give all the formal definition of BDMP; it is available in Refs. 16 and 17. Instead, we are going to show on a very simple example how a BDMP with 3 leaves can specify a Markov model with 16 states representing a system with a standby redundancy. Let us suppose we wish to model a system with the structure given in Fig. 7.3 (left). The second line of the system is a standby redundancy. Therefore, when C1 works, C2 and C3 can only be in a standby or failed state, whereas when C1 is failed, C2 and C3 can only be in a working or failed state (this explains the 16 states of the Markov model). This behavior is precisely specified by the BDMP of Fig. 7.3 (right).
Fig. 7.3. A simple system with a standby redundancy and a BDMP modeling it.
For a better readability of Fig. 7.4, let us now introduce the meaning of the unusual symbols of this model in a few simple words. First of all, symbols such as and simply represent a split link: the names of the origin and target of the link are below the symbols. Split links are here just to avoid some disgraceful crossings of links in the drawing. Secondly, red dotted arrows represent the ”triggers” of the BDMP; their role is to transform what seems to be a standard fault-tree into a fully dynamic model. As long as the event at the origin of a trigger is FALSE, the trigger maintains all the elements in the subtree under its target in a “nonrequired”
0 ‘
*
97
Reliability Analysis of a Dynamic Phased Mission System
KI
mode. In this mode, the leaves representing failures in function: cannot change from state FALSE to state TRUE. Besides the failures of A and B, the inadvertent openings are represented by such leaves (with names beginning with 10- ).
El
The leaves representing on-demand failures: react to a mode change. When their mode changes from “not required” to “required,” they instantaneously can become TRUE with a given probability. All on-demand failures of the system are represented in this way, with names beginning with ROfor ”refuse to open” and RC- for LLrefuse to close.” When a mode change occurs at the same time for several components of that kind, it is possible to specify constraints on the order in which their reaction must be taken into account. This is done with grey dotted links. Two of these links specify that the outcome of the opening demands on K1 and K4 must be determined before the attempt to close K5. The last symbol we must explain is the phase indicator leaf, represented with a clock. The behavior of this leaf is as follows: if no trigger points at it (like for phase- l), it is initialized in the TRUE state and becomes FALSE after an exponentially distributed time. If a trigger points at it (like for p h a ~ e - 2 ) it ~ is initialized in the FALSE state and when the origin of the trigger changes from the TRUE to the FALSE state, the leaf instantaneously becomes TRUE. It goes back to the FALSE state after an exponentially distributed time. This kind of behavior makes it easy to link an arbitrary number of phases. It is even possible to define a cyclic chain of phases; this is consistent with the general theory of BDMP. 7.3.3. Compared results
By animating the P N model by means of a Monte-Carlo simulation technique, one can obtain interesting quantitative information such as the success probability of each phase, the mean of the time of the first system failure, the mean sojourn time of the system in its different states, etc. The obtained result for the probability of mission success was p = 0.92396 after 40s of calculation (time needed to perform lo7 trials) using the software MOCA-RP,I4 and p = 0.92402 after about 4 min with YAMS (for lo7 trials). YAMS22is a Monte-Carlo simulator able to process any model written in the FIGARO modeling language, and therefore, any model built with KB3. The mission success corresponds to the fact that the place P14, see Fig. 7.2, contains one token at the end of the trial (simulated duration
98
M. Bouissou, Y. Dutuit, and S. Maillard
Reliability Analysis of a Dynamic Phased Mission System
99
of each trial: 3000 hours; this time is large enough to ensure that the end of the whole mission is reached in each trial). We have also solved the PN with the tool FIGSEQ, which is based on sequence exploration and quantification of the Markov graph specified by the PN. FIGSEQ uses an analytical quantification of sequences leading to a specified set of state^,^^^^^ and is able to process any markovian model written in the FIGARO language. FIGSEQ instantly solved the model and gave the following result for the probability of mission success: p = 0.92394. We do not report the sequences output by FIGSEQ in this article, because the aggregation of the states prevents them from being legible. We solved the second model (the BDMP) with the two evaluation tools FIGSEQ and YAMS (MOCA-RP is dedicated to PNs and cannot be used to solve a BDMP). The Monte Carlo simulation gave in 6 min for lo7 trials of 3000 hours a probability of success p = 0.92380. The BDMP was solved instantly with FIGSEQ. This solution yielded p = 0.92392 as the probability of mission success, and two sets of sequences sorted by decreasing probability: one for the sequences leading to loss of mission and the other leading to success. The results tables in the appendix are directly those created by FIGSEQ. They display the list of transitions for each sequence, with their own rate and class (EXP for exponential distribution, INS for instantaneous), the probability of the sequence at mission time 3000 hours (Proba MT), the average duration after initiator (Aver. Dur. After init) and the Contribution to the probability of the whole event. The contributions of the 12 first sequences decrease from 11.9% to 5.96% of the mission failure probability; subsequent sequences have much lower contributions (45 sequences). The 12 first sequences are listed in Table 7.3. We could also obtain the three only success sequences, corresponding to the nonoccurrence of the top event (UE-1) and the end of phase 2 (see Table 7.2). The cross results are summed up in Table 7.1. Note that the MonteCarlo simulations results are given with a confidence interval of 1.6410-4 and are therefore consistent with the analytical results given by FIGSEQ. The FIGSEQ results are exact because the sequences have been exhaustively explored and quantified in both models. The differenceof 210W5 between the results obtained with FIGSEQ from the P N and the BDMP is due to subtle differences between the behaviors d e picted by the two models. Here is the main one: the BDMP allows switches K1 and K4 to open inadvertently in phase 1 and then to refuse to open at the phase change. In fact these failure modes are mutually exclusive. The BDMP formalism has now been extended with the notion of “inverted trig-
M. Bouissou, Y.Dutuit, and S. Maillard
100
Table 7.1. Cross results of the different models.
MODEL PN BDMP
I Processing Tools I
I
CPU Time I Pr(success) MOCA-RP (MC) I 40s. lo7 trials I 0.92396 YAMS (MC) 3min549, lo7 trials 0.92402 FIGSEQ c 1s 0.92394 FIGSEQ c 1s 0.92392 YAMS Gmin, lo7 trials 0.92380
ger” in order to cope with this kind of modeling problem, but it would be too long to explain this extension here, and it will be done in a forthcoming publication. 7.4. Conclusions
The results obtained with quite different methods are practically the same, which constitutes a good cross validation. Since both models are Markovian, any solving method valid for Markov processes could have been used to solve the two models. Therefore, the only significant difference between the two approaches resides in the model construction. Whereas the BDMP construction has been straightforward and produced a self-explaining, easy-to-validate model, the P N required in this case some further work to result in a concise graphical representation. The size of the P N could be limited thanks to a careful exploitation of all the symmetries of the system (structure of the installation and features of components). However, if we had to model a system with the same behavior, but made of components having all different characteristics, the P N size would obviously increase, while the BDMP would remain exactly the same. The same remark would apply if we wanted to introduce repairs. But the most spectacular advantage of the BDMP formalism is probably the following: thanks to its hierarchical structure, all we would have to do in order to replace the simple components A and B by subsystems would be to replace the leaves FailureOfA and FailureOfl3 of the BDMP of Fig. 7.4 by sub-BDMP. For example, if A and B were subsystems with the characteristics of the system of Fig. 7.3 (left) the replacement of the leaves FailureOfA and FailureOfEi by two BDMP like the one of Fig. 7.3 (right) would solve the problem. Whereas we would of course have to rebuild the PN from scratch, and it would be a hard work. Another interesting result of this study is the illustration of the interest of the sequence exploration and quantification method used by FIGSEQ, which allows a quick and precise quantification of a large Markov model and
Reliability Analysis of a Dynamic Phased Mission System
101
gives interesting qualitative results: the most probable sequences leading to the mission failure. Table 7.2. Sequences leading to mission success. Transition Name [end OF phase- 11 [start OF phase-21 [OK OF RO-K1, OK O F RO-K4] [OK OF RC-K5] [end OF phase-21 [Fail. O F IO-K4] [end O F phase- 11 [start OF phase-21 [OK O F RO-K1, OK OF RO-K4] [OK OF RC-KS] [end O F phase-21 [Fail. OF 10-K11 jend OF phase-i] [start OF phase-21 [OK O F RO-K1, OK OF RO-K4] [OK OF RC-K5] [end O F phase-21
Proba MT
Aver. Dur. After init.
Contribution
9.0665e-01
4.8780e+01
9.8131e-01
8.6348e-03
1.4402e+02
9.3458e-03
8.6348e-03
1.4402e+02
9.3458e-03
Table 7.3. Sequences leading to loss of mission.
end O F phase- 1
end O F phase-1
102
M. Bouissou, Y. Dutuit, and S. Maillad Transition Name [start OF phase-21 [Fail O F RO-K1, OK O F RO-K4] [end O F phase- 11 [start O F phase- 21 [OK OF RO-K1, Fail OF RO-K4) [end O F phase- 11 [start O F phase-21 [OK OF RO-K1, OK OF RO-K4] [Fail O F RC-K5] [end O F phase- 11 [start O F phase- 21 [OK OF RO-K1, OK O F RO-K4] [OK OF RC-K5] [Fail O F FailureOfB] [end O F phase- 11 [start O F phase-21 [OK O F RO-K1, OK O F RO-K4] [OK OF RC-K5] [Fail O F IO-K2] [end OF phase- 11 [start O F phase- 21 [OK O F RO-K1, OK O F RO-K4] [OK O F RC-K5] [Fail O F IO-K5] [end O F phase- 11 [start O F phase-21 [OK OF RO-K1, OK OF RO-K4] [OK OF RC-K5] [Fail OF FailureOfA] [end OF phase- I] [start OF phase-21 [OK O F RO-K1, OK O F RO-K4] [OK OF RC-K5] fFail O F IO-K31
Proba M T
Aver. Dur. After init.
Contribution
4.6934e-03
0.000e+000
6.1712e-02
4.6934e-03
0.000e+000
6.1712e-02
4.6699e-03
0.0000e+00
6.1403e-02
4.5332e-03
4.8780e+01
5.9606e-02
4.5332e-03
4.8780e+01
5.9606e-02
4.5332e-03
4.8780e+01
5.9606e-02
4.5332e-03
4.8780e+01
5.9606e-02
4.5332e-03
4.8780et01
5.9606e-02
Reliability Analysis of a Dynamic Phased Mission System
103
References
1. M. Alam and U. M. Al-Saggaf, Quantitative Reliability Evaluation of Repairable Phased-Mission Systems Using Markov Approach, IEEE Tmnsactions on Reliability, Vol. R-35 (5), 498-503 (Dec. 1986). 2. G. R. Burdick, J. B. Fussell, D. M. Rasmusson, and J. R. Wilson, Phased Mission Analysis, a Review of New Developments and an Application, IEEE Tbnsactions on Reliability, Vol. 26 ( l ) , 43-49 (1977). 3. J. B. Dugan, Automated Analysis of Phased Mission Reliability, IEEE Transactions on Reliability, Vol. 40, 45-52 (1991). 4. A. Pedar and V. V. S. Sarma, Phased-Mission Analysis for Evaluating the
5.
6.
7. 8.
9.
10.
11.
12. 13.
14.
15.
16.
Effectiveness of Aerospace Computing-Systems, IEEE Transactions on Reliability, Vol. R-30 (5), 429-437 (Dec. 1981). J. K. Vaurio, Fault tree analysis of phased mission systems with repairable and non-repairable components, Reliability Engineering and System Safety, Vol. 74, 169-180 (2001). A. Bondavalli, S. Chiaradonna, F. Di Giandomenico and I. Mura, Dependability modeling and evaluation of multiple-phased systems using DEEM, IEEE Dansactions on Reliability 53(4), 509-522 (2004). V. Volovoi, “Modeling of system reliability Petri nets with aging tokens,” Reliability Engineering and System Safety, Vol. 84 (2), 149-161 (May 2004). A. Bondavalli, I. Mura, and M. Nelli, Analytical Modeling and Evaluation of Phased-Mission Systems for Space Applications, Proceedings of IEEE High Assurance System Engineering Workshop (HASE) (1997). A. K. Somani, Simplified Phased-Mission System Analysis for Systems with Independent Component Repairs, Proceedings of ACM SIGMETRICS ( 1996). L. Xing, Reliability and sensitivity analysis of static phased mission systems with imperfect coverage, M.S. Thesis, Electrical Engineering, University of Virginia (January 2000). L. Xing and J. B. Dugan, Analysis of generalized phased mission system reliability, performance and sensitivity, IEEE Transactions on Reliability, Vol. 51 (2), 199-211 (June 2002). J. D. Esary and H. Ziehms, Reliability analysis of phased missions, in Reliability and Fault Free Analysis, p. 213-236, SIAM Philadelphia (1975). A. Bondavalli and I. Mura, Markov Regenerative Stochastic Petri Nets t o Model and Evaluate Phased Mission Systems Dependability, IEEE narisactions on Computers, Vol. 50 (12) (December 2001). Y. Dutuit, E. Chatelet, P. Thomas, and J. P. Signoret, Dependability Modelling and Evaluation by Using Stochastic Petri Nets: Application to Two Test-Cases, Reliability Engineering and System Safety 55 (2), 117-124 (1997). D. C. Ionescu, E. Zio, and A. Contantinescu, Availability Analysis of a Safety System of a Nuclear Reactor, Proceedings of KONBIN’03 Conference, Vol. 2, p. 225-233 (2003). J. L. Bon and M. Bouissou, A new formalism that combines advantages
104
17.
18.
19.
20. 21.
22.
23.
24.
M. Bouissou, Y.Dutuit, and S. Mailland
of fault trees and Markov models: Boolean logic Driven Markov Processes, Reliability Engineering and S y s t e m Safety, Vol. 82 (2), 149-163 (November 2003). M. Bouissou, Boolean logic Driven Markov Processes: a powerful new formalism for specifying and solving very large Markov models, PSAM6, Puerto Rico (June 2002). M. Bouissou, H. Bouhadana, M. Bannelier, and N. Villatte, Knowledge modeling and reliability processing: presentation of the FIGARO language and associated tools, Proceedings of SAFECOMP’91, Trondheim (Norway) (November 1991). M. Gallois and M. Pilliitre, Benefits expected from automatic studies with KB3 in PSAs at EDF, Proceedings of the PSA99 conference, Washington (August 1999). The KB3 and FIGSEQ tools: detailed information, software download at http://rdsoft.edf.fr. M. Bouissou, S. Humbert, S. Muffat, and N. Villatte, KB3 tool: Feedback on knowledge bases, Proceedings of Lambda Mu 13 / ESREL 2002, European Conference, Lyon (France), p. 754759 (March 2002). M. Bouissou, H. Chraibi, and S. Muffat, Utilisation de la Simulation de Monte-Carlo pour la rksolution d’un benchmark (MINIPLANT), 148me congrb de fiabilith et maintenabilitk, Bourges, France (October 2004). J. L. Bon and M. Bouissou, Fiabilitk des grands systbmes skquentiels: rksultats thkoriques et applications dans le cadre d u logiciel GSI, Revue de Statistique applique‘e XXXX (2), p. 45-54 (1992). M. Bouissou and Y. Lefebvre, A path-based algorithm t o evaluate asymptotic unavailability for large Markov models, Annual Reliability and Maintainability Symposium Proceedings, Seattle (2002).
CHAPTER 8 SENSITIVITY ANALYSIS OF ACCELERATED LIFE TESTS WITH COMPETING FAILURE MODES
CORNEL BUNEA School of Engineering and Applied Science George Washington University 1776 G ST N W Washington, DC 20052 USA E-mail:
[email protected]
THOMAS A. MAZZUCHI School of Engineering and Applied Science George Washington University 1776 G ST N W Washington, DC 20052 USA E-mail:
[email protected] Most of the Accelerated Life Tests (ALT) ignore the possibility of competing modes of failure. The literature that attempts to address this problem often does so by assuming independence among the competing failure modes. Rather, the failure modes often display a highly dependent structure, which is usually influenced by the applied stress. The dependent ALT-competing risks models available are based on multivariate distributions with specified marginals. However, multivariate exponential and Weibull distributions-dominant in literature-have properties not suited for all applications. This paper investigates the applicability of the existent models on a couple of data sets, and presents a new ALT model with dependent competing failure modes. The dependence structure is modeled by a copula given the rank correlation. The effects of the dependence degree of the competing failure modes are also studied.
105
106
C. Bunea and
T.A . Mazzuchi
8.1. Introduction
Accelerated life tests are applied on long-life components in order to obtain failure data in a reasonable time interval and are performed at higher stress levels that exceed the normal use conditions. The goal is to make inference about the life distribution of the component under use conditions using failure data obtained under more severe environment. To be valid, an accelerated life test must not alter the basic modes and/or mechanism of failure or their relative prevalence. In real experiments all these problems may occur if the stresses become too high. Many authors took into account the change in failure mechanism and proposed distributional or Bayesian models that allow all data t o be taken into account. However, when more failure modes are present, their relative prevalence as the stress increases has not yet been considered, even if this is indicated by accelerated life data. Multiple failure modes or competing failure modes are often present in industrial accelerated life experiments. Classical and Bayesian models available in the literature usually assume independent failure modes. Little work has been done for dependent competing failure modes, and it resumes to multivariate models with positively correlated lives of failure modes, or to upper and lower bounds for the component life distribution. All these models do not take into account the change in the relative prevalence of the failure modes and the change in the dependence structure as the stress increases. The purpose of this paper is to study the influence of the dependence structure of competing failure modes, as their relative dominance changes with the increase of the applied stress. This is obtained by integrating the results of competing risk theory into the ALT procedure, and by using graphical technics in model selection. Since the mottorets data (Nelson') presents a strong change of the relative prevalence of competing failure modes, it will be used as a support application throughout the paper. Section 8.2 presents several step-stress ALT models, with independent and dependent competing failure modes. Empirical distribution functions based on competing risk data are used in Sec. 8.3 to illustrate the change of the relative dominance of the competing failure modes and to select the appropriate model to interpret ALT data. An ALT model with dependent failure modes is presented in Sec. 8.4. The dependence structure is modeled by a copula given the rank correlation coefficient for each stress level. The impact of the independence assumption on the estimated life time of the
Analysis of Accelerated Life Tests with Competing Failure Modes
107
component under use conditions is studied as well.
8.2. ALT and Competing Risks We consider a m-constant stress level ALT. At each stress level 1, 1 = 1,.. . m, a number of items are tested until a failure or a censored time occur. The failure is assumed to occur due to k competing failure modes, X I ,X 2 , . . . xk. In a competing risks context, we observe the shortest of X i , i = 1 , . . . k, and observe which failure mode it is. In many cases we can reduce the problem to the analysis of two competing risks classes, described by two random variables X1 and X2, and we call X2 the censoring variable. Competing risk data will only allow us to estimate the subsurvival functions,
S $ , ( t ) = PT(X1 > t , X 1 < X 2 } , and
s;,(t)= PT(X2 > t , x2 < X l } , but not the true survival functions of X1 and X2. Hence, we are not able to estimate the underlying failure distributions without making additional, nontestable, model assumptions. The conditional subsurvival function is the subsurvival function, conditioned on the event that the failure mode in question is manifested. Assuming continuity of S21 and S;2 at zero:
< X2} = s;l (t)/S;l (O), CS>,(t) = PT(X2 > t , x 2 < XlIX2 < X , } = S;,(t)/S;,(O).
cs;, ( t )= PT(X1 > t ,x1 < X2lXl
Closely related to the notion of the subsurvival functions is the probability of censoring beyond time t ,
This function seems to have some diagnostic value, enabling us to choose the competing risk model which fits the data. 8.2.1. ALT and independent competing risks
The presence of independent competing risks in ALT has been widely studied in the literature. McCoo12 presented a technique for calculating estimate intervals for Weibull parameters of a primary failure mode when a secondary failure mode having the same (but unknown) Weibull shape parameter is
108
C. Bunen and T.A . Mazzuchi
acting. Klein and Basu3,4 presented the analysis of ALT when more than one failure mode is acting. Assuming independence among competing failure modes for each stress level, the authors obtained maximum likelihood estimators when the lifetimes are exponentially or Weibull-with common or different shape parameter-distributed. Nelson' presented graphical and analytical (maximum likelihood) methods to analyze data on a failure mode, to estimate a product life distribution when failure modes act, and to estimate a product life distribution with certain failure modes eliminated. Nelson indicated that complete data sets are usually analyzed with standard least-squares regression analysis. Such analysis may be misleading for data with competing failure modes. The analysis should consist in a separate Arrhenius model for each failure mode and a series-system model for the relationship between the failure times of each failure mode and the failure time of the component. Examples of products which have multiple causes of failure are given, including insulation systems, ball bearings and industrial heaters. A large sample data is considered by Nelson' (pg. 393) and is collected from a temperature-accelerated life test of motor insulation (the Turn, Phase, or Ground insulation can fail). The experiment was conducted to observe a greater number of failures for each failure mode. When the specimen fails by a specific failure mode, that insulation is isolated and the motorette is kept on test and runs to a second or third failure. In actual use, the first failure from any cause ends the life of the motor. This fact conducts to a pseudo-competing risk data. Nelson analysis of motorettes data consists in a separate Arrheniuslognormal model for each failure mode, and a competing risk (series system) model for the relationship between the failure times of different failure modes and the failure time of a specimen. The assumptions made in the Arrhenius-lognormal model are: - for temperature T, life has a lognormal distribution - the log standerd deviation c k is constant - the mean log life as a function of z = 1000/T is pk(z) = a k
+ pkz,
where Q k , p k and c k are the parameters characteristic for the failure mode k, product and test method. For units run at stress 1 the probability that failure mode k survives time t is: Rk(t) = @{-[log(t) - p k ( x ) I / c k ) .
The competing risk model assumes that each unit has k competing fail-
Analysis of Accelerated Lzfe Tests with Competing Failure Modes
109
ure modes, which concur to terminate the service life of the component. The k competing risks (times to failure) are considered statistically independent. Graphical analysis of data with multiple failure modes involves the usual two plots for each failure mode: a hazard plot of the multiple censored data and a relationship plot (relationship between estimated life of a failure and temperature/stress). Figure 8.la shows the hazard plot of the Turn data, and Fig. 8.lb shows the Arrhenius plot of median times to (Turn) failure against test temperature. A straight line fitted to the medians estimates the Arrhenius relationship of Turn failures.
I
101
10
r*.
Fig. 8.1. Graphical analysis of Turn failures: (a) lognormal hazard plot and ( b ) Arrhenius plot of median times to failure.
8.2.2. ALT and dependent competing risks
Simple empirical upper and lower bounds of the system life distribution can be found when the lives of competing failure modes are positively correlated. The lower limit corresponds to the most pessimistic case, when independent failure modes are considered. The upper limit is the lowest life distribution for a single failure mode. These bounds may give some insights with respect to the true distributional form, especially when these bounds are translated to time-average failure rate bounds. Other technics in dealing with dependent failure modes in ALT are based
110
C. Bunea and T.A . Mazzuchi
on multivariate distributions. The multivariate log-normal distribution used by Nadas5 has satisfactory properties. However, multivariate exponential and Weibull distributions-dominant in literature-have properties not suited for all applications. Klein and Basu4 proposed a bivariate Weibull distribution as the joint survival function of two competing risks both Weibull distributed with same shape parameter a. They considered two competing risks XI and X2, both Weibull distributed with shape parameter a and scale parameter Xi(T,Pi) X12(T, P12), i = 1 , 2 , at each temperature (stress) level T (01, 02, ,812 are time transformation parameters). The joint survival function of (XI, X2)is given by
+
S(z1, z2) = exp(-Xl(T, P1)z: - X2(T, 02)~:- X12(T, P12) max(z1, z d a ) . To estimate the parameters of X1 and Xp at use conditions, classical methodology of ALT is applied. They illustrated this procedure on the motorette data. However, as it will be shown in Sec. 8.3, simple plots of data will reject the applicability of exponential and Weibull multivariate models to this specific data set. More elaborate dependent competing risks models based on the use of copula have been presented by Hougaard' and Bunea and B e d f ~ r dThese .~ models will be discussed in Sec. 8.4. 8.3. Graphical Analysis of Motorettes Data Cooke' presented several independent and dependent competing risks models. These models have been further investigated by Bedford and Cooke: Langseth and Lindqvist,lo Bunea et al.ll Nevertheless, the choice of the most appropriate model to fit the competing risk data remains an open debate. All the above authors gave some guidelines for model selection. An important indicator for model selection is the probability of censoring after time t , @(t),together with the conditional subsurvival functions CS;, ( t ) and CS;2 ( t )(see Bunea et a1.l') Note, that empirical version of these functions can be directly obtained from a competing risk data set. The model selection guidelines for the most well-known competing risk models-independent exponential model, random signs model, conditional independence model, mixture of exponentials model-are:
If the risks are exponential and independent, then the conditional subsurvival functions are equal and correspond to exponential distributions. Moreover, @(t)is constant.
Analysis of Accelerated Life Tests with Competing Failure Modes
0
0
Under random signs censoring, @ ( O ) for all t > 0.
111
> @ ( t )and CS;,(t) > CS;,(t)
If the conditional independence model holds with exponential marginals, then the conditional subsurvival functions are equal and @(t)is constant.
0
If the mixture of exponentials model holds, then @(t)is strictly increasing and CS;* ( t )5 CS;, ( t ) for all t > 0.
In addition to the above statements, one can easily show that Klein and Basu’s model requires equal conditional subsurvival function. Simple calculations give the subsurvival and conditional subsurvival functions: s;l(t)= x lx++Al x z ~ exp(-(h ~lz + A2 + X12)t), SG2(t)= 0
x 1A+ 2+Ax z ~ ~ 1 exp(-(h 2 + A2 + X12)t),
CS;(l ( t )= CS;, ( t ) = exp(-(X1
+ A2 + X12)t).
Note that the conditional subsurvival function should be equal at each stress level in order to have a valid ALT-competing risks model. Figure 8.2 shows the empirical conditional subsurvival functions for motorettes data, with failure modes grouped in two competing risk classes: risk 1 - turn failure modes, risk 2 - phase and ground failure modes. One can see a strong change in the failure mechanism as the stress level increases. At the first stress level the dominant failure mode is risk 2, with its conditional subsurvival function laying entirely above the conditional subsurvival function of risk 1. A random signs model may be appropriate for this case (risk 1 being censored). The conditional subsurvival functions are more or less equal for stress level 2 and 3, indicating that an independent exponential model may apply. Data obtained from stress level 4 indicates a dominant conditional subsurvival functions for risk 1, thus a random signs model may be selected (risk 2 being censored). The change in relative prevalence of the failure modes do not allow a classical ALT-competing risk model to analyze data. Since the conditional subsurvival function are not equal for each stress level, the assumption of independent exponential or Weibull with same shape parameter failure modes is clearly rejected by motor insulation data. Based on model selection criteria presented at the beginning of this section, the random signs model and the multivariate exponential (or Weibull with same shape parameter model) are also rejected. Next section will present a more complex dependent model and will investigate the influence of the degree of dependence on the estimated distribution functions at use conditions.
112
C. Bunea and T.A . Mazzucha
Fig. 8.2. Empirical conditional subsurvival functions for motorettes data, at each temperature (stress) level.
8.4. A Copula Dependent ALT
- Competing Risk Model
8.4.1. Competing risks and copula
We assume that the dependence structure between X I and Xp is given by a copula. The copula of two random variables X1 and Xp with distribution functions Fx,(X1)and Fx,(Xp), is the distribution C on the (XI), Fx,(Xp)). The functional form of unit square [0,112 of the pair (Fxl C : [0,112 4 R is C(u,v) = H(Fi:(u), Fi:(v)),where H is the joint distribution function of (X1,Xp) and Fj: and FG: are the right-continuous inverses of Fxland Fxz. Under the assumption of independence of X1 and X2,the marginal distribution functions of X I and X2 are uniquely determined by data. Zheng and Klein12 showed the more general result that, if the copula of (XI, X2)is known, then the marginal distributions functions of X1 and Xp are uniquely determined by the competing risk data. More precisely, the marginal distributions functions Fx, and FxZ are solutions of the following system of ordinary differential equations:
Analysis of Accelerated Life Tests with Competing Failure Modes
113
with initial conditions Fx1(O)= Fx2(0)= 0, where CU(Fxl(t), Fx2(t)) and C,(Fx,( t ) Fx2 , ( t ) )denote the first order partial derivatives &C(u, w ) and $C(u, w) calculated in (Fxl( t ) ,Fx2(t)). ( t ) and F;, ( t ) are the subdistribution functions of X1 and Xp (see Bunea and Bedford7). 8.4.2. Measures of association
We now discuss the problem of choosing a copula. There are many measures of association for the pair ( X , Y ) , which are symmetric in X and Y . The best known measures of association are Kendall’s tau and Spearman’s rho. Kendall’s tau for a vector ( X ,Y ) of continuous random variables with joint distribution function H is defined as follows: let ( X I ,Yl)and ( X z ,Y2) be i.i.d. random vectors, each with joint distribution H , then Kendall’s tau is defined as the probability of concordance minus the probability of discordance:
T ( X , Y )= P T ( ( x 1 - x2)(Y1- &) > 0) - PT((x1- xp)(fi - &) < 0). The other measure of association (Spearman’s rho) is defined as follows: let X and Y be continuous random variables then the Spearman’s rho is defined as the product moment correlation of F x ( X ) and F y ( Y ) :
Simple formulae relating the measures of association to copula density are:
- 3.
Since the measure of association is to be treated as a primary parameter, it is necessary to choose a family of copulae which model all possible measures of association in a simple way. Meeuwissen and Bedford13 proposed to use the unique copula with the given Spearman’s rho that has minimum information with respect to the independent distribution, and also they gave a method to calculate numerical this copula. Prior knowledge or subjective information to obtain values for the measure of association is needed. Expert judgment is used to model the uncertainty over the measure of association (see Bunea and Bedford7).
114
C. Bunea and T . A . Mazzuchi
Work of Zheng and Klein" suggests that the important factor for an estimate of the marginal survival function is a reasonable guess at the strength of the association between competing risks and not the functional form of the copula. For this reason we will choose a class of copula which it is easy t o work with from the mathematical point of view. Such a class is the Archimedean copula. 8.4.3. Archimedean copula
Let X1 and X2 be continuous random variables with joint distribution H and marginal distribution F x l and FxZ.The Archimedean copula is defined as:
'p(H(3AI4)= 'p(FXl(21))+ 'p(FX,(~2)), or in terms of copula
'p(C('IL'v)) = 4 .)+ 'p(v>. The function 'p is called an additive generator of the copula. If 'p(0) = 00, 'p is a strict generator and C ( u ,v) = ~-'(p(u)+'p(v))is a strict Archimedean copula. For our goal we choose a one-parameter family of copulae which has a strict generator. The Gumbel family is defined as follows: Ca(u,v) E exp(-[(-logu)"
+ (-logv)
]
a l/cY
)
for a E [I,00). The generator is the function 'pa(t) = (- logt)a. Using the relation between Kendall's tau and copula, we can write a as a function of Kendall's tau, a, = 1/(1- 7 ) . 8.4.4. Application on motor insulation data
Let U1 and U2 be two random variables associated with the conditional subsurvival functions of X1 and X2.The ALT-competing risk model proposed consists of a separate Arrhenius-lognormal model for each random variable U1 and U2,and of a dependence structure of X1 and X2 modeled by a copula, with the rank correlation coefficient given as a primary parameter. The Arrhenius-lognormal model applied to conditional subsurvival functions enables us to find the conditional subsurvival functions at use conditions and to estimate the distributional parameters. Elicitation of the rank correlation coefficient at each stress level and at use conditions using experts opinion allows us t o fully specify the copula. Further, applying Zheng
Analysis of Accelerated Life Tests with Competing Failure Modes
115
and Klein’s result at each stress level and at use conditions, one can obtain the distribution functions of X1 and X2. Table 8.1 presents the estimated parameters of the lognormal distribution (for U1 and U2) at each stress level and at use conditions. These distributions, together with the values in point zero of the subsurvival functions of X1 and X2, and with the elicitated rank correlation coefficients, constitute input factors in the system of equations presented in Sec. 8.4.1.
Table 8.1. Estimated parameters of the lognormal distribution (for U1 and U2) at each stress level and at use conditions (motorcttes data). 180 190 220 240 260
C C C C C
uU1 1.4061 1.36863 1.2453 1.1540 1.0535
TU1 0.0201 0.0208 0.0236 0.0259 0.0287
I
uU2 1.46601 1.4165 1.2506 1.2243 1.1508
TU2 nu2 0.028 0.0206 0.0321 0.1026 0.1334
Figure 8.3 shows the distribution functions of X1 and X2 at each stress level. One can see a shift to the left of the plots as the stress increases, indicating a shorter time to failure as the stress is higher. Figure 8.4 presents the distribution functions of X1 and X2 at use condition, for different degrees of dependence between failure modes: from independence in the upper-left corner to strong dependence in the lower-right corner. Even if a logarithmic scale is used, one can see the difference that the degree of dependence makes in estimating the true distribution functions at use conditions. A more clear visualization of the above results may be obtained by analyzing the industrial heaters data presented by Nelson’ (pg. 420). This data is collected from a temperature-accelerated life test of industrial heaters, which have two failure modes-open and short. A Bayesian Arrheniusexponential model for random variables U1 and U2 is used (see Bunea and Ma~zuchi’~) and the dependence structure is modeled by an Archimedean copula, given the rank correlation coefficient. Table 8.2 presents the estimated failure rates of U1 and U2 at each stress level and at use condition. Figure 8.5 shows the distribution functions at use conditions of X1 and X , . The sensitivity of the estimated distributions with respect to the degree of dependence is more obvious in this case.
116
C. Bunea and T. A . Mazzuchi
Fig. 8.3. Distribution functions for each failure mode, at each temperature (stress) level (motorettes data). Table 8.2. Estimated failure rates of Ui and Uz at each stress level and at use conditions (industrial heaters data). 1150 F 1600 F 1675 F 1750 F 1820 F
YU1 YU2 4.1075E-04 6.0985E-05 6.528e7-04 6.85953-04 8.0000E-04 7.40483-04 1.0865E-03 1.33333-03 3.9393E-03 4.000E-03
8.5. Conclusions
Several ALT-competing risks models have been studied in this paper. Classical independent and dependent models failed to suit all types of ALT data. Simple plots of experimental data rejected the applicability of such models, especially when the relative dominance of competing failure modes changes as the applied stress increases. A new dependent model based on copula
Analysis of Accelerated Life Tests with Competing Failure Modes
117
Fig. 8.4. Distribution functions for each failure mode, at use conditions (motorettes data); different degree of dependence are considered.
has been proposed. The results showed a high sensitivity with respect to the degree of dependence between competing failure modes. Thus, the LLindependent failure modes" assumption can conduct to wrong results, when the risks are actually highly correlated. The disadvantage of this method consists in the large number of measures of association that need t o be estimated by experts (one for each stress level and one for use conditions).
References 1. W. Nelson, Accelerated Testing, John Wiley and Sons, New York, USA (1990).
2. J. I. McCool, Competing risk and multiple comparison analysis for bearing fatigue tests, ASLE Transactions 21, 271-284, (1978). 3. J. P. Klein and A. P. Basu, Weibull accelerated life tests when there are competing causes of failure, Communications i n Statistical Methods and Theory, A10,2073-2100 (1981). 4. J. P. Klein and A. P. Basu, Accelerated life testing under competing expo-
118
C. Bun-
and T. A. Mazzucha
Fig. 8.5. Distribution functions for each failure mode, at use conditions (industrial heaters data); different degrees of dependence are considered. nential failure distribution, ZA P Q R Transactions 7, 1-20 (1982). 5. A. Nadas, A graphical procedure for estimating all parameters of a life distribution in the presence of two dependent death mechanisms, each having a lognormally distributed killing time, personal communication (1969). 6. P. Hougaard, Analysis of Multivariate Survival Data, Springer, New York (2000). 7. C. Bunea and T. J. Bedford, The Effect of Model Uncertainty on Maintenance Optimization, ZEEE Ransaction on Reliability 51 (4), 486-493 (December 2002). 8. R. M. Cooke, The design of reliability databases Part I and 11, Reliability Engineering and System Safety 51, 137-146 and 209-223 (1996). 9. T. J. Bedford and R. M. Cooke, Probabilistic Risk Analysis: Foundations and Methods, Cambridge University Press (2001). 10. H. Langseth and B. Lindqvist, A maintenance model for components exposed to several failure mechanisms and imperfect repair, Mathematical and Statistical Methods in Reliability, Series on Quality, Reliability and Engineering Statistics, pp, 415-430, World Scientific (2003). 11. C. Bunea, R. M. Cooke and B. Lindqvist, Competing risk perspective on
Analysis of Accelerated Lzfe Tests with Competing Failure Modes
119
reliability databases, Mathematical and Statistical Methods in Reliability, Series on Quality, Reliability and Engineering Statistics, pp. 355-370, World Scientific (2003). 12. M. Zheng and J. P. Klein, Estimates of marginal survival for dependent competing risks based on an assumed copula, Biometrika 82, 127-138 (1995). 13. A. M. H. Meeuwissen and T. Bedford, Minimally informative distribution with given rank correlation for use in uncertainty analysis, Journal of Statistical Computation and Simulation 57, 143-174 (1997). 14. C. Bunea and T. A. Mazzuchi, Bayesian Accelerated Life Testing under Competing Failure Modes, RAMS Conference Proceedings, Alexandria, USA (January 2005).
This page intentionally left blank
CHAPTER 9 ESTIMATING MEAN CUMULATIVE FUNCTIONS FROM TRUNCATED AUTOMOTIVE WARRANTY DATA
S . CHUKOVA School of Mathematics, Statistics and Computer Science, Victoria University of Wellington, New Zealand E-mail: Stefanka.
[email protected]
J. ROBINSON General Motors R B D Center Warren, Michigan USA E-mail: jeffrey. a.robinson@gm. com This article deals with a type of truncation that occurs with typical automotive warranties. Warranty coverage and the resulting claims data are limited by mileage as well as age. Age is known for all sold vehicles all the time, but mileage is only observed for a vehicle with a claim and only at the time of the claim. Here we deal with the univariate case, taking either age or mileage as the usage measure. We evaluate the mean cumulative number of claims or cost of claims and its standard error as functions of the usage measure. Within a nonparametric framework, we extend the usual methods in order to account for the fact that the odometer readings are available only for a vehicle with a claim and only at the time of the claim. We illustrate the ideas with real data on four cases based on whether the usage measure is age or miles and whether the results are adjusted for withdrawals from warranty coverage. We also note that these adjustments can be further refined by taking into account the effects of reporting delay.
9.1. Introduction
Data from warranty systems are of interest t o manufacturers for several reasons. Paid claims are a cost of doing business and a liability incurred by the manufacturer at the time of sale. For these reasons forecasting warranty 121
122
S. Chukova and J . Robinson
expenses is of interest. Also, warranty data provide information about the durability of products in the field and, therefore, is of interest t o engineers. Furthermore, warranty coverage can be regarded as a product attribute that affects buying decisions. Manufacturers may wish t o change warranty coverage t o attract more buyers, and they would want to estimate how much the changes would cost. See Robinson and McDonald’ for more discussion on these points. Here we emphasize automotive warranties as they are typically done in the U.S.A. The warranty guarantees free repairs subject to both age and mileage limits. The most common limit is now thirty-six months or thirty-six thousand miles, whichever comes first. Age is known all the time for all sold vehicles because sales records are retained. But odometer readings are only collected in the dealership at the time of a claim. So automotive warranties are a case where two usage measures are of interest and information on one of them is incomplete. Automobiles are repairable systems, and warranty claims can be thought of as recurrent events associated with the system. We take this approach although for a small enough range of warranty labor operations, the likelihood of duplicate claims on the same vehicle is small enough that methods based on lifetimes (time to first warranty claim) may be appropriate. We also prefer a nonparametric approach because sample sizes are large. We wish t o consider two usage measures, one of which is incompletely observed. Finally, we want t o deal explicitly with the fact that the events are coming from a warranty plan with specific restrictions, i.e. repair events that occur outside the limitations of the warranty plan will not be included in the database. Even considering these restrictions there is some relevant past literature. The model and estimation procedure we discuss later is an extension of the “robust estimator” presented by Hu and Lawless.2 Their approach extends the “standard” estimator as reported in Nelson3 to account for, in our case, the probability of a vehicle not being eligible t o generate a warranty claim because it has exceeded its warranty coverage limits. Lawless, Hu and Cao4 also explicitly deal with our censoring problem and specify a simple linear automotive mileage accumulation model, which we use here to estimate the current mileage of a vehicle for which we have a previous odometer reading. They also present a family of semi-parametric models t o relate event times t o mileage accumulation. Lawless5 presents a survey and some extensions including dealing with the bias caused by the reporting delay of warranty claims. This issue was also previously discussed in Lawless and Nadeau‘ and in Kalbfleisch, Lawless and R o b i n ~ o n . ~
Estimating Mean Cumulative Functions
123
A finite sample population correction to the variance estimate is discussed in Robinson.8 In addition to their other work mentioned, Hu and Lawlessg also developed a general framework for handling the type censoring considered here making use of supplementary information. Other relevant methods applicable to warranty are also found in Blischke and Murthy.1° 9.2. The Hu and Lawless Model For the most part we utilize notation from Hu and Lawless’ and adopt the generally reasonable assumption that at the beginning of their lifetime, cars accumulate mileage approximately linearly with their age. Let nz(t)be the number of claims (or cost associated with those claims) at time t for vehicle i. It will be convenient and not restrictive to think of time as discrete, i.e. t = 1 , 2 , .... Let N,(t) be the accumulated number of claims (or cost) up through and including time t for vehicle i. “Time” here will be either age or mileage of the vehicle, not the calendar time. Suppose M such vehicles have been under observation, and their records are part of the warranty database. Sometimes it will be convenient to think of these M vehicles as a sample from a larger population of vehicles, some of which have not yet been sold. Let 7%be the“time” that unit i has been under observation. The exact definition of the 7’s will depend on whether time is age or mileage. We obtain an estimator A(t)of the population mean cumulative function A(t) = EN,(t). For the discrete time case, the incremental rate function is X ( t ) = A ( t ) - A(t - 1) with the initial condition A(0) = 0. Let & ( t )= I(.% 2 t ) be the indicator of whether car i is under observation at time t . Then M
is the total number of claims (or cost) observed at time t for all M vehicles. We note that &(t)may be unknown for some of the cases we discuss later, but the product &(t)ni(t) is always known. If the observation times are independent of the event process, the rate function is estimated by X(t) =
n.(4 MP(t)
~
(!)
where P ( t ) is the probability that a vehicle of age t is eligible to generate a claim. This is the “robust estimator” stated in Hu and Lawless.’ Their
124
S. Chukova and J . Robinson
P ( t ) is a right-hand tail probability of an age distribution and it is assumed known. We will need to estimate P ( t ) . It will be more convenient to think in terms of M ( t ) = M P ( t ) , the expected number of vehicles eligible to generate a claim at time t. Denote its generic estimator by k ( t )= M p(t).Then we replace (1)by (2)
The associated mean cumulative function estimator is t
(3)
Under mild conditions, Hu and Lawless2show asymptotic normality of A(t) (assuming known M ( t ) )with a standard error given by the square root of (4)
We will compute &(t) from the warranty data itself and substitute this estimate into (4) to obtain our standard errors. For the general cases considered next the ri’s are also unknown and estimated from the data, although it will not always be necessary to write down the estimators explicitly. 9.3. Extensions of the Model 9.3.1. rrTime”is age case
First, we consider “time” to be the age of the vehicle. If we ignore withdrawals from coverage due to mileage, then the number of vehicles eligible to generate a claim at age t is simply the number of vehicles age t or older, i.e., M i= 1
where ai = ri is the age of vehicle i. This estimator correctly characterizes how warranty claim rates and costs actually accrue as a function of age, and it may be useful for prediction. However, it is influenced by two factors. The first one is the inherent claim rate as a function of age. The second one is the rate of accumulation of mileage because this affects how long vehicles remain in warranty coverage. To illustrate the point, suppose hypothetically that warranty claims occur
Estimating Mean Cumulative Functions
125
due to age and not miles, and suppose that drivers begin to drive more. The inherent reliability of the population of vehicles would not change, but the warranty claim rate would go down because more vehicles would drop out of coverage sooner due to the mileage limit. To get at this “true” warranty claim rate, we will adjust for the fact that some vehicles leave coverage by exceeding the mileage limit. Here and later the adjustment will always be to k(t). To capture this rate the “time” under observation is now defined as ri = min(ai,yi), where yi is the age in days at which the vehicle exceeds (or would exceed) the mileage limit of ,I = 36,000 miles and ai is its current age. Subsequently, it will be necessary to differentiate between vehicles that have had warranty claims from those that have had no claims. Denote by M I the number of vehicles that have had at least one claim (and therefore some reported mileage history) and by M2 the number of vehicles that have had no claims, where
Mi
+ M2 = M .
Since odometers are not monitored continuously, yi is not known even for vehicles that have had a claim. Based on our assumption that the accumulation of miles is approximately linear with age, for a vehicle that has had at least one claim, we will simply extrapolate linearly using the age and mileage at the time of the most recent claim. For vehicles with claims, let ~i = oi/ci, where ci and oi are the age and mileage of vehicle i at the time of the latest claim. Then ~i is the mileage accumulation rate in miles per day for vehicle i. For a target age t , vehicle i counts in k(t) if it is old enough and if its mileage at age t is estimated to have been within the mileage limit .,I Its contribution to k(t) is I(ai 2 t ) I(ri 5 %). Figure 9.la illustrates this graphically. It depicts a miles (in thousands) by age (in months) grid. The large square represents the warranty coverage limits. The little stars represent the mileages and ages for the most recent claim for four hypothetical vehicles. The small squares at the end of each line are the extrapolated current mileages of the vehicles at their current age. Two of the vehicles are older than the target age t l , but one of them is estimated to have left warranty coverage due to mileage just prior to reaching age tl , and would not be counted when computing M ( t ) adjusted for withdrawals due to the mileage limit. This accounts for the vehicles that have had a claim, but we must also account for the ones that have not had a claim. In practice there may be thousands of vehicles that have experienced at least one claim, so we make use of the empirical distribution function of the mileage accumulation rates.
126
S. Chukova and J. Robinson
Fig. 9.1. (a) “Time” is age. (b) “Time”is mileage.
Denote this distribution by
and take it to be an estimate for the mileage accumulation rates for vehicles that have not yet experienced a claim. The probability of a typical vehicle remaining in coverage at age t is F (+) and its contribution to k(t) is as shown in the second row of Table 9.1. We note here that using the mileage accumulation distribution in this way amounts to assuming that vehicles with one or more claims have similar mileage accumulation rates as vehicles that have not experienced a claim. Early in the model year, particularly, selection bias could cause this assumption to be violated, say if high mileage vehicles are more likely than the general population to generate a claim. Fbrthermore, for vehicles that never experience a warranty claim, the assumption cannot be checked without resorting to external (nonwarranty) data sources. Such sources exist, e.g. quality surveys and maintenance records, but they must be cross referenced with the warranty database in order to be useful. We have not attempted to use supplementary mileage information in this article, but later we do examine the impact of using mileage data from a previous model year as well as trends over time with evolving mileage records. From our limited studies, these effects appear to be small. Reporting delays occur with warranty data when there is a delay between when a warranty event occurs in the dealership and when the claim is
Estimating Mean Cumulative Functions
127
posted to the database. Rejected claims that are later resolved is one reason for these delays. Such delays result in undercounting claims and their corresponding claim rates and costs. An adjustment for removing the effects of reporting delays was reported by Lawless and Nadeau‘ and Kalbfleisch, Lawless, and R ~ b i n s o n Like . ~ the case for withdrawals from coverage due to mileage, the adjustment is to k f ( t ) .Furthermore, this adjustment can be used in combination with the adjustment for mileage. When a claim is posted to the database, the date that it was posted as well as the date of the repair order in the dealership are known. The difference between these two dates is the claim delay (perhaps zero), and every posted claim has associated with it a delay. Thus we can compute from the claims database an empirical distribution function, 1 Nc G(d) = I(dj 5 d), Nc j = 1
where d j is the delay in days for claim j and Nc is the total number of claims in the database. As with the mileage accumulation distribution, we may in some situations want to compute this empirical cumulative distribution function from a previous model year. If we consider the eligibility of vehicle i, now of age ai, to have generated a claim at age t 5 ai, then such a claim, if it occurred, has had d = ai - t days to have been posted. The estimated probability that a claim is posted in that length of time is G(ai - t ) . Here we are assuming independence between the process generating the claim delays and that generating the mileage accumulation, which is reasonable. To adjust f i ( t ) for claim delay we multiply the contributions of vehicle i by these probabilities. Note that this reduces f i ( t ) and increases the rate function, which is the expected effect. Table 9.1 summarizes the details of the contributions to f i ( t )for all cases.
9.3.2. “Time” is miles case If warranty claim rates are more closely related to mileage than age, then we may wish to analyze by miles. This could occur in some engineering applications where we would expect warranty incidents to occur more as a result of usage than of age, for example suspension parts, bearings and other components that cycle with mileage. On the other hand corrosion and paint deterioration are generally treated as age-related. In general it is difficult to identify the dominant damage parameter for a particular subsystem due to widely varying customer usage patterns.
128
S. Chukova and J . Robinson
Table9.1 Contribution to for vehicle i at traget age t. A vehicle with at least one claim
CaSeS
A vehicle with no claims
Unadjusted Adjusted for Mileage Adjusted for Claim Delay Adjusted for Mileage and Claim Delay
I(aa
2 t ) I(Ti 5
I(ai
I(ai
+)
t ) G(ai - t )
2 t ) I(ri 5
%) G ( a i - t )
I(a1 2 t )
I(ai
F(+)
2 t ) G(a, - t )
I(ai2 t ) $(%) &(ai - t )
Throughout the remainder of the paper we will use the argument “m” for mileage. The fact that the exact mileage is unknown except at the time of a claim complicates the calculation of h’(m), but the same linear extrapolation model and mileage accumulation distribution used in the previous subsection can produce reasonable results. The numerators in (2) are available because they are the number of claims or cost for vehicle i at mileage m and are available in the database. For the unadjusted case the “time” under observation for vehicle i is ~i = mi, the current mileage. As before it will be convenient to think of mileage as a discrete variable. The current mileage is not known exactly, even for vehicles with claims, but it can be estimated by airi, where ai is the current age and ri is the mileage accumulation rate based on the most recent claim. So for a typical vehicle having at least one claim, the contribution to k ( m )at a target mileage m is estimated to be I ri 2 .
0 ($
Similarly for a vehicle without any claims, the contribution is 1 - F , representing the likelihood that the vehicle has attained m miles. To adjust k ( m ) for vehicles leaving coverage due to age, we want to assure that the target mileage m is reached before the warranty age limit, say I,. Thus, we replace ai in the unadjusted case by min(a,, l a ) . These ideas are illustrated graphically in Figure 9.lb, where two vehicles are estimated to have exceeded target mileage ml, but one of them did SO after leaving coverage due to age. The contributions to k ( m ) are summarized in the second row of Table 9.2. The calculations for incorporating reporting delay are more cumbersome
129
Estimating Mean Cumulative finctions
Table 9.2. Contribution to *(m) for vehicle i at target mileage m.
Cases
A Vehicle with
A Vehicle with no claims
at least one claim Unadjusted
I
Adjusted for
Ace
adjusted for Claim Delay Adjusted for age and Claim Delay
for the miles case, but still straightforward. If vehicle i, now of age ai, has experienced a claim, its age at mileage m is estimated by . If a claim
):(
0 0
had occurred on that day, it has had ai -
days to be posted. Therefore,
we multiply the previous contributions by G a, delay.
to account for claim
If the vehicle has not had a previous claim, no mileage rate is available, and we must consider all previous ages. If mileage m was achieved at age j , where j I a,, then the mileage accumulation rate is (S). The corresponding empirical probability mass distribution function of the mileage accumulation is for
f ( m , j ) = P ( z ) -&(?),
for j = 2 , 3 ,....
If a claim had occurred at vehicle age j , it has had (ai - j ) days to be posted. Thus, the contribution to k ( m ) for vehicle i is as shown in the third row of Table 9.2. To adjust for withdrawals due to age as well as reporting delay, the sum is restricted to j 5 min(ai, la), the minimum of the vehicle current age and the warranty age limit, as shown in the last row of Table 9.2.
130
S. Chukova and J . Robinson
9.4. Example 9.4.1. The L‘P-claims77 dataset
We illustrate the methods on a set of actual warranty data with 44,890 records taken from model year 2001 vehicles sold mainly in calendar years 2000 and 2001. We examine the warranty claims on one major system of the vehicle, which is not identified. It will be referred to as “System P”. Table 9.3 summarizes the dataset. To illustrate our points we created versions of the dataset as it would have existed at four different “cuts” in time: Jan. 1, 2001; Jan. 1, 2002; Jan. 1, 2003; and the actual “cut” date for our original dataset, Oct. 24, 2003. These are displayed in Table 9.3 along with Table 9.3. Summary of 2001 warranty file.
the descriptive statistics of the datasets up through the respective dates. For proprietary reasons a few vehicles were randomly selected, and their records were deleted from the original dataset. Also, the costs have been re-scaled. These precautions do not affect the authentic nature of the data. Note that the median reporting delay is around 11 days. Fewer than half of the vehicles generate one or more claims (of any type), even for the final cut when the median age is 982 days. At the first cut only 6% of the vehicles have had a claim. The median mileage accumulation rate is around 40 miles per day, and declines slightly over calendar time. This is more than the rate of 33 miles per day, which corresponds roughly to exhausting the 36,000-mile limit in exactly three years. So most cars leave coverage due to mileage. Also, we should emphasize that we use mileage information from any claim, not just those from System P.
Estimating Mean Cumulative Functions
131
9.4.2. Examples for the %me” is age case
In this section and the next, we illustrate the calculations for cost per car. The results are similar for the number of claims per car.
Fig. 9.2. (a) Unadjusted A(t) and adjusted for mileage A(t). ( b ) Unadjusted A(t) with 95% CL and adjusted for mileage A ( t ) .
Figure 9.2a illustrates the effect of the adjustment for withdrawals from coverage due to mileage, shown for the most recent time cut. Without further analysis, the bend in the unadjusted curve could be attributed t o a decline in warranty claim rates with age. But the adjusted curve indicates that the “true” rate is not declining; the bend is caused by vehicles leaving coverage due to mileage. Figure 9.2b superimposes 95% confidence limits on the unadjusted curve, roughly indicating that the differences due to the adjustment are statistically significant. Figure 9.3a shows the adjusted curves for all four time cuts to illustrate how the results would unfold over calendar time. We see few differences despite the fact that the mileage accumulation distribution trends slightly toward lower rates over calendar time. To investigate the impact of the mileage accumulation rate distribution, F ( r ) , its estimate $(r) was computed using data from 2000 and again using data from 2001. Figure 9.3b contrasts the adjustments at the third time cut, Jan. 1, 2003, using these two estimates of @(r).The differences are small.
132
S. Chukowa and J. Robinson
B
Fig. 9.3. (a) Adjusted for mileage A(t) at four time cuts. (b) T h e impact of adjusted for mileage A(t).
P(T)on
The adjustment for claim delay is not substantial for our dataset except for the first time cut, Jan. 1, 2001, when vehicles are relatively young, and a typical claim has had little time to be posted. Figure 9.4a illustrates the impact.
I
Fig. 9.4. (a) Unadjusted A ( t ) with 95% CL and adjusted for claim delay A(t). (b) Unadjusted A@) with 95% CL and adjusted for age A(m).
Also, the adjustment for mileage has little impact early in the model year because few vehicles have left coverage, so we are not able to demonstrate the effects of adjusting simultaneously for mileage and claim delay.
Estimating Mean Cumulative Functions
133
9.4.3. Ezamples for the “time” is miles case
Figure 9.4b shows the cumulative cost per car by mileage for the unadjusted case (with 95% confidencelimits) along with the adjustment for withdrawal from coverage due to age. Unlike the “time” is age case, the adjustment is essentially indistinguishable. Moreover, the simultaneous adjustment for age withdrawals and claim delay, also makes almost no difference (not shown). The reason is that in our dataset relatively few vehicles, only about 38%, leave coverage due to age. Also, the age limit adjustment does not begin until the oldest car reaches the age limit of warranty coveragethree years from the first sale for our dataset. This is in contrast to the mileage adjustment for the “time” is age case, where the adjustment begins to have an effect when the oldest car might possibly reach the mileage limit. Our mileage accumulation rate distribution has 2% of the vehicles reaching 36,000 miles in one year, and 10% reach the limit in a year and a half. Other calculations for the mileage case are similar to the age case and are not shown here. 9.5. Discussion
We have discussed extensions to the Hu and Lawless2 “robust” estimator for the mean cumulative function and its associated standard error. In particular, without requiring any supplemental source of data for mileage accumulation, we deal with the problem of incomplete information for mileages as it typically occurs in automotive warranty data. We also discussed how to combine these methods with previously reported methods for dealing with reporting delay. We have been motivated primarily by practical concerns and concentrated on stating and computing on real data various “adjustments” to the rate function and its standard error. In all cases the adjustments are with respect to the number of cars at risk. We have viewed the problem from a sampling perspective even though it could more precisely be described as a prediction problem. That is the first vehicles sold are treated as a sample from the “population” of vehicles that eventually will be sold. Also, the population is taken here as infinite, but it would be straightforward to apply a finite population correction, as in Robinson,8 to the standard error calculation. There are many issues related to the adjustments of k(t)(or &(m) respectively). Hu and Lawless2 demonstrate asymptotic normality and provide a consistent estimator of the asymptotic variance under appropriate
134
S. Chukova and J. Robinson
conditions. The key assumption is that the process that determines the time under observation is independent of the event process. Here the time under observation is determined by the process of mileage accumulation. It is difficult to justify or test this independence assumption in practice, so we recommend caution due to the possibility of bias. Using mileage data from the previous model year is one precaution that we have employed, and for our data so far it makes little difference. Also, we conjecture that proper asymptotic results can be developed for the estimators described here under L'rea~~nable7' conditions, but we have made no attempt to do that. The simple linear mileage accumulation model used here is convenient, but it is a first approximation. It only uses the last observed claim to calculate mileage accumulation rates, and it does not account for changes in rates as vehicles age. Also, we do not distinguish a mileage rate calculated from an older vehicle from one calculated from a younger vehicle, even though the one from the older vehicle should be less variable. In our example calculations we excluded rates from claims with vehicle ages less than 30 days to deal with this issue. Certainly more elaborate models for mileage accumulation are possible if the data to support them are available. We end on a technological note. It is certainly feasible now for vehicles to be equipped with sensors that would transmit milage and other relevant information to the manufacturer at all times. Interestingly, the additional mileage information would eliminate the need for some of the approximations that were discussed in this paper. The primary obstacles are cost and privacy issues, not technology.
References 1. J. Robinson and G. McDonald, Issues related to field reliability and warranty data, in Data Quality Control: Theory and Pragmatics, Ed. G. E. Liepins and V. R. R. Uppuluri, p. 69, Marcel Dekker, New York (1991). 2. X. Hu and J. Lawless, Estimation of rate and mean functions from truncated recurrent event data, J . Amer. Statist. Assoc. 91, 300 (1996). 3. W. Nelson, Recurrent Events Data Analysis f o r Products Repairs, Disease Recurrences, and Other Applications, ASA-SIAM, Philadelphia (2003). 4. J. F. Lawless, J. Hu, and J. Cao, Methods for the estimation of failure distributions and rates from automobile warranty data, Lifetime Data Anal. 1, 227-240 (1995). 5. J. Lawless, Statistical analysis of product warranty data, Int. Stat. Rev. 66, 41 (1998). 6. J. Lawless and J. Nadeau, Some simple robust methods for the analysis of recurrent events, Techometrics 37, 158 (1995).
Estimating Mean Cumulative hnctions
135
7. J. Kalbfleisch, J. Lawless, and J. Robinson, Methods for the analysis and prediction of warranty claims, Technometrics 33,273 (1991). 8. J. Robinson, Standard errors for the mean number of repairs on systems from a finite population, in Recent Advances in Life-Testing and Reliability, Ed. N. Balakrishnan, p. 195, CRC Press, Boca Raton, London, Tokyo (1995). 9. X. Hu and J. Lawless, Estimation from truncated lifetime data with supplementary information on covariates and censoring times, Biometrika 83, 747 (1996). 10. W. R. Blischke and D. N. P. Murthy, Product Warranty Handbook, Marcel Dekker, New York (1996).
This page intentionally left blank
CHAPTER 10 TESTS FOR SOME STATISTICAL HYPOTHESES FOR DEPENDENT COMPETING RISKS-A REVIEW
ISHA DEWAN Indian Statistical Institute New Delhi - 110016, India E-mail:
[email protected]
J. V. DESHPANDE Statistics Department Poona University, Pune,. India E-mail:
[email protected]
Competing risks data consists of time to failure and cause of failure. Suppose that the underlying risks are dependent. We review distribution-free tests for bivariate symmetry against alternatives involving dominance of cause specific hazard rate, subdistribution functions and subsurvival functions. We also review tests for independence of time to failure and cause of failure. Many of the statistics that were originally proposed in the context of independent risks continue to be useful for testing similar hypotheses regarding dependent risks.
10.1. Introduction The competing risks situation arises in life studies when a unit is subject t o many, say k, modes of failure and the actual failure, when it occurs, can be ascribed to a unique mode. These k modes are also called the k risks t o which the unit is exposed, and as they all seemingly compete for the life of the unit, the term ‘competing risks’ is used to describe it. Suppose that the continuous positive valued random variable T represents the lifetime of the unit and S taking values 1 , 2 , . . . , k represents the risk which caused the failure of the unit. The joint probability distribution of (T,S) is specified by the set of k 137
138
Isha Dewan and J. V. Deshpande
subdistribution functions F ( i ,t ) = P[T 5 t ,6 = 21, or equivalently by the subsurvival functions S(i,t ) = P[T > t , 6 = i], i = 1,2,. . .,k. Let H ( t ) and S ( t ) , respectively, denote the distribution function and the survival function of T . Let f ( i , t ) denote the subdensity function corresponding to ith risk. Then the density function of T is h(t) = C t l f(i,t ) , H ( t ) = E f = , F ( i ,t ) , S ( t ) = Ef=lS ( i ,t ) and pi = F ( i ,00) is the probability of failure due t o the ith risk. A commonly used description of the competing risks situation is the latent failure time model. Let XI, Xz, . . . ,xk be the latent failure times of any unit exposed to Ic risks, where Xi represents the time to failure if cause i were the only cause of failure present in the situation. The observable random variables are still the time to failure T = min(X1, Xa,. . . ,X,) and the cause of failure 6 = j if Xj = min(X1, Xa, . . . ,X k ) . If X I ,XZ,. . . , X k are independent, then their marginal distributions carry all the probabilistic information regarding the model with k risks. It is easily seen that the marginal and hence the joint distribution is identifiable from the probability distribution of the observable random variables (T,6). However, in general when the risks are not independent, neither the joint distribution of X's nor their marginals are identifiable from the probability distribution of (T,6) (Tsiatis' , Chowder2). Hence, the independence or otherwise of the latent lifetimes (XI,X z , . . . , X,) cannot be statistically tested from any data collected on ( T ,6). The independence of (XI,Xz, . . . , x k ) has to be assumed on the basis of a priori information, if any. Also, the marginal distribution functions may not represent the probability distribution of lifetimes in any practical situation. Also, elimination of one or more of the risks may change the environment in such a way that the marginal distributions under the remaining risks are not really an appropriate probability model any more. In view of the above considerations, unless one can assume independence, it is necessary to suggest appropriate models, develop methodology and carry out the data analysis in terms of the observable random variables (T,6)alone. Kalbfleisch and Prentice3y4 proposed methods for analyzing competing risks data in terms of cause specific hazard rates X ( i , t ) = limAt+o &P(t <
T < t + A t 7 6 = i l T > t ) s(t) =M. Dewan and Kulathina15 have considered parametric models for subsurviva1 functions by assuming a suitable parametric form for cause-specific hazard rates. Aly et a1.,6 K ~ c h a r and , ~ Sun and Tiwari' have consid-
Tests for Some Statistical Hypotheses for Dependent Competing Risks-A Review 139
ered the problem of testing for equality of cause-specific hazard functions. De~hpande,~ Aras and Deshpande,lo Deshpande and Dewan'' have considered the problem of analyzing competing risks data by using the subdistribution functions and subsurvival functions. Recently Dewan et a1.12 have proposed tests for testing independence of T and 6.
10.2. Locally Most Powerful Rank Tests Suppose 5 = 2, that is, a unit is exposed to two risks of failure denoted by 1 and 0. When n units are put to trial, the data consists of (Ti,S;), a = 1,.. . ,n where 6* = 2 - 6. Suppose we wish to test the hypothesis HO : F ( 1 , t ) = F(2,t),for all t. First we look at tests based on likelihood theory. Under the null hypothesis the two risks are equally effective at all ages in the prevailing environment. However, the alternative hypothesis is that the two risks are not equally effective at least at some ages. The likelihood function is given by (see, Aras and Deshpande") n
(!) a=1
where T = ( T I , . . . , Tn),K= (ST,.. . , 6;). If F ( i , t ) depends upon the parameter 8, then inference about it can be based on the above likelihood function. When T and 6' are independent then Deshpandeg proposed the model F ( 1 , t ) = BH(t), F ( 2 , t ) = (1 - B)H(t). Here 8 = P[6* = 11. Then the likelihood reduces to
fi
q 8 T 6*) = eE7=16z*(1 - ~)C7="=,1-6z*) h(ta). 7-7-
(2)
a=1
Testing of the hypothesis F ( 1 , t ) = F(2,t) reduces to testing that 8 = 1 / 2 . Then the obvious statistic is the sign statistic (3) i= 1
nU1 has B(n, 0) distribution and there exist optimal estimation and testing procedures based on it. However, if F(1,t ) and F(2,t ) depend on a parameter 8 in a more complicated manner, then one needs to look at locally most powerful rank tests. Let f ( l l t ) = f ( t , B ) , f ( 2 , t ) = h(t)- f ( t ,8) where h(t) and f ( t , O ) are known density functions and incidence density such that f ( t , 80) = i h ( t ) . Let T(1)i T(2)i . . . I T(n)denote the ordered failure times. Let 1 if T(i) corresponds to first risk, 0 otherwise.
(4)
140
Isha Dewan and J . V. Deshpande
Let Rj be the rank of Tj among TI,. . . ,Tn. Let &' = (RI, Rz,. . . ,&), = (Wl,Wz,.. . ,Wn) denote the vector of ranks and indicator functions corresponding t o ordered minima. The likelihood of ( & W )is given by
w'
P(e,R'W)= J ... J
n
n[f(tz' B)]"'[h(ti)- f ( t i ,8)]1-wti
O < t l < . . .
i=l
Theorem 9.2.1: If f'(t18) is the derivative of f ( t , 8 ) with respect to 8, then the locally most powerful rank test for HO : 8 = 8, against H1 : 8 > 80 is given by: reject HO for large values of L, = Cy=,wiai, where (5)
Special cases (i) If Deshpande's model holds with 80 = 1/2, then sign test is the LMPR test. (ii) If f(1,t ) = i g ( t ,0) and f(2, t ) = i g ( t , O ) , 8 > 0 and g ( t , 8) is the logistic e(=-~) then the LMPR test is based on the density function g ( t , 8) = [,l+e(z-8 statistic W+ = Cr=lWiRi, which is the analogue of the Wilcoxon signed rank statistic for competing risks data. (iii) In case of Lehmann type alternative defined by F ( 1 , t ) = [HT ( t )I 6, F(2,t) = H(t) the LMPR test is based on scores ai = E(E(,.))where E ( j )is the j t h order statistic from a random sample of size n from the standard exponential distribution. But for more complicated families of distributions, e.g., Gumbel,13 the scores are complicated and need to be solved using numerical integration (see Aras and Deshpande").
[TIe,
10.3. Tests for Bivariate Symmetry
Suppose that the latent failure times X and Y are dependent with their joint distribution given by F ( z ,y). On the basis of n independent pairs (Ti,65) we want to test whether the forces of two risks are equivalent against the alternative that the force of one risk is greater than that of the other. That is, we test the null hypothesis of bivariate symmetry
Ho : F ( z ,y) = F ( y , z) for every (z,y).
(6)
Before we formulate the alternatives of interest let us consider the following theorem which is easy to prove.
Tests for Some Statistical Hypotheses for Dependent Competing Risks-A Review 141
Theorem 9.3.1: Under the null hypothesis of bivariate symmetry we have (i)F(l,t) = F ( 2 , t ) for all t , (ii)S(l,t) = S(2,t) for all t , (iii) X(1, t ) = X(2, t ) for all t , (iv) P[d* = 11 = P[d* = 01 (v) T and 6* are independent. In view of the above theorem, the following alternatives to the null hypothesis are worth considering.
H i : X(1,t)
< X(2,t);
H2 : F(1,t)
< F ( 2 , t ) ; H3 : S(1,t) > S(2,t). (7)
All these alternatives say that risk I1 is more potent than risk I at all ages t in some stochastic sense. Sen14 considered fixed sample and sequential tests for the null hypothesis of bivariate symmetry of the joint distribution of ( X ,Y ) .The alternatives are expressed in terms of nl(t) = P[S*= 1IT = t ] ,the conditional probability that the failure is due to first risk, given that failure occurs at time t. He derived optimal score statistics for such parametric situations. But the statistics cannot be used without the knowledge of the joint distribution F(X, Y). We look at various distribution-free test procedures for testing HO against above alternatives. For testing HO against HI consider rt
S(t)[A(l, u ) - X(2, u)]du.
$J(t)= F(1,t ) - F ( 2 ,t ) =
(8)
I0
H1 holds iff the above function is nonincreasing in t. Consider the following measure of deviation between HO and H I ,
A=
J
o<x
[$J(X)-
N Y ) l d F b )W Y ) ,
(9)
and its empirical estimator An as the test statistic where A, is given by
An
=
J
[$n
(x.> - $n (Y)I d ~ (5) n d ~ (Y), n
(10)
o<x
where &(t) = t x j ” = l I ( d j = 1,Tj 5 t } , Fn(t) = ~ ~ j ” = l I { I T ’t } , and $Jn(t)= 2Fln(t) - F,(t) are the empirical estimators of F ( l , t ) , F ( t ) and $(t),respectively. Then n
1 n(n2- 1) - 2 X ( i - l)(n - 2 6 i=l
A, = -[ n3
+ 1)Wzl.
(11)
142
Isha Dewan and . I . V . Deshpande
Under Ho, A = 0 but under the alternative A > 0. Large values of the statistic are significant. Rejecting HO for large values of An is equivalent to rejecting it for small values of the statistic
c
n-1
u, =
i ( n - i)Wi+l.
(12)
i= 1
The exact null distribution of U, can be obtained following the approach of Hettman~perger'~ (pp 35). For n = 5 , 6 , .. .,20, the 1% and 5% critical values of Uz are given in second and fourth columns of Table 10.1 and corresponding exact significance levels in third and fifth columns respectively. Table 10.1. Exact critical points for ,572. a x 0.01 a x 0.01 a x 0.05 a x 0.05 n 0.062500 5 0 0.031250 0 6 0.015625 0.046875 0 6 7 0.046875 14 0.015625 8 7 0.011719 0.054688 22 9 8 34 0.011719 0.054688 10 18 0.050781 50 11 28 0.013672 67 0.010254 0.05175a 12 38 0.051514 90 0.010498 54 13 0.010010 0.052734 74 118 14 0.050293 148 0.010132 15 98 0.051300 186 0.010010 16 126 0.010010 0.050308 228 17 158 0.050613 278 0.010269 18 198 0.010311 0.049911 332 242 19 396 0.050467 0.010050 290 20 -
Under Ho, E[U2]= *,
q.
Var[Uz]= By the Central Limit 1 Theorem it can be shown that, under H o , n i { % N(0, After change of order of integration A can also be expressed as
A
=
A}
lm
--f
S2(z)F(x)[X~(z) - Xl(z)]dz.
A).
(13)
It is seen from this representation that the test based on U2 is equivalent to the test V of Yip and Lam16 proposed for the case of independent risks without censoring. They have not discussed its small sample exact null distribution. Deshpandeg proposed two tests for testing Ho versus H2 on heuristic
Tests f o r Some Statistical Hypotheses f o r Dependent Competing Risks-A Review 143
grounds. The first test is the Wilcoxon signed rank type statistic n
w+= C(1- bi*)Ri.
(14)
i=l
It is argued that W+will be large when the alternative H2 is true, there being a greater incidence of the second risk up to any fixed time t. This consideration leads to another test based on the U-statistic (15)
where 43 is given by = $3(Ti,b,T,Tj,bj*)
{
lifbj. = O , T i > T j ,
or 6: = 0,Ti 0 otherwise.
< Tj,
(16)
E(U3) = 1 / 2 under HO and strictly larger than 1/2 under H2. U, is the same as the statistic proposed earlier to test for HO against the alternative HA^ of stochastic dominance of distribution functions of independent latent failure times (see Deshpande and Dewan17). It is also consistent for testing bivariate symmetry against dominance of incidence functions. For testing HO against H2, one can consider the measure of deviation F(2,t) - F ( l , t ) , which is nonnegative under H2. Then (17)
and its U-statistic estimator is the statistic U3 discussed above. Similarly for testing HO against H3 consider the measure of deviation S(1,t ) - S(2,t ) ,which is nonnegative under H3. Then 1 2
p ( l , t )- S ( 2 , t ) ] d H ( t= ) P[bT = 1,Tl > T2] - -.
(18)
Consider the kernel
44(Ti,6ilTj,6j)=
{
1 if bf = l,Ti > Tj, or 6; = l,Ti < T j , 0, otherwise,
(19)
and the corresponding U-statistic is given by (20)
This statistic was earlier proposed by Bagai et a1.18 to test for equality of failure rates of independent latent competing risks.
Isha Dewan and J. V . Deshpande
144
Aly et a1.6 proposed Kolmogrov-Smirnov type tests for testing the equality of two competing risks against the alternatives H1 and H2. Here we discuss their approach. Consider
$*(t)= F ( 2 ,t ) - F(1,t ) =
I'
S(u)(X(2, u ) - X(1, u))du.
(21)
Under Ho, $*(t)= 0. H1 holds iff $*(t)is nondecreasing in t. Let $:(t) be its empirical estimator. Consider (22)
Large values of the statistic are significant. If one is interested in a general two-sided alternative F(1,t ) # F(2,t) for some t or equivalently X , ( t ) # X 2 ( t ) for some t , then one can use the Kolmogrov-Smirnov type statistic
D n = SUP IK(t)l. t2o
(23)
Under Ho, fiD, converges in distribution to supo
SUP $ n ( t ) .
O
(24)
Large values of Dan are significant for testing HO against H2. For exact null and asymptotic distributions of these statistics see Aly et a1.6
10.4. Censored Data Most of the above tests can be generalized to the case when the data is right censored. Let C be the censoring random variable independent of the latent failure times X and Y . Denote the survival function of C by Sc and assume that S c ( t ) > 0 for all t. Now the available information consists of (5?%,&), i = 1 , 2 , . . . , n where 5? = min(T,C) and 8 = 6*I(T 5 C). Aly et a1.6 generalized the function $* so as to capture departures of HO from HZ in case of censored data. Let
$(t) = s0' s ( u - ) ( s c ( ~ - ) ) " ~ ( 2, uX(l,u))du, )
(25)
which is the $* function when there is no censoring. The integrand S ~ ( u - ) l / is~ the function required to dompensate for censoring in order that the D statistics remain asymptotically distributionfree. Under Ho, $(t)= 0. H1 holds iff $(t) is increasing in t.
Tests for Some Statistical Hypotheses for Dependent Competing Risks-A Review 145
An obvious choice of
4,
1
is
t
&(t) =
S(u-)(Sc(u-))1/2 d(A1 - A2)(u),
(26)
0
where S and SC are the product limit estimators of and SC, and A j is the Aalen estimator of the cumulative CSHR function A j ( t ) = s,” A,(.) du. A suitable modification of U2 to censored data is given by the statistic
Kn =
/
(4n(z>- +n(Y/>> d s ( z )dS(y),
(27)
o
where
4, and S are ils defined above. K, can be expressed as (20)
Large values of K, are significant for testing HO against H I . Aly et a1.6 showed that, under Ho, n1/2+, W ( S ( . ) )where W(.) is the standard Brownian Motion. Using the continuous mapping theorem, we t have n1I2K, N ( 0 , &). Sun and Tiwaris modified the statistic U, so that it can be used for censored data. Consider
4
--f
(20)
A natural estimator of V is given by V, where vn =
Lm[S(t-)12d(A2(t)- A l ( t ) ) ,
where S ( t ) and Aj are as defined above. In the absence of censoring V, reduces to the statistic U,.Sun and Tiwari’ discussed the asymptotic distribution of V,. Aly et a1.6 extended D1, and Dan to include censored data. 10.5. Simulation Results
Given below are the results of a simulation study done for power comparisons of various tests for the uncensored case listed above. Random samples were generated from absolutely continuous bivariate exponential (ACBVE) due to Block and Basulg with density
146
Isha Dewan and J . V. Deshpande
+ +
where (Xo,X1,Xz) are the parameters and X = XO XI Xa. The CSHRs x.x j = l , 2 . Under H I , XI < XZ. are X j ( t ) = *, X and Y are independent if and only if XO = 0. We set XI = 1 and consider Xz = 1.0,1.4,1.8,2.2 indicating larger and larger departures from 270.The case A2 = 1.0 corresponds to the null hypothesis. n = 100 and there are 10000 replications (Table 10.2). Table 10.2. Uncensored case. A2
Test D1 D2 U2 U3 U4 Sign
1.0 3.76 4.85 5.09 4.79 4.99 4.39
1.4 41.98 47.71 44.60 43.06 42.67 49.50
1.8 82.53 86.98 83.96 80.54 80.81 88.29
2.2 96.83 98.14 96.92 95.42 95.77 98.66
Next we look at the censored case (Tables 10.3 and 10.4). The censoring distribution was exponential with parameters 1 and 3, respectively. We use asymptotic critical levels of 5 percent. Results are based on 5,000 replications. The underlying distribution of ( X ,Y ) is ACBVE with A1 = 1. Table 10.3. Observed levels and powers of K , at an asymptotic level of 5 percent. Censored EXP(1).
I A:! 1.0 1.5 2.0 2.5
I
I
I
n=50
I
xo=o
I
,0218 ,1864 ,4482 ,6928
xo=1 ,0312 ,2192
,5080 ,7414
I
I
n=100 xo=o xo=1 ,0360 I ,0376 ,3732 .4302 .8342 ,7862 ,9704 ,9546
I
1
Table 10.4. Observed levels and powers of K,, at an asymptotic level of 5 percent. Censored EXP(3). n=50
1.0 1.5
0 = .0048 .0342
2.0 2.5
,1202 ,2496
0=1 ,0124 ,0762 .1986 ,3774
n=100 xo=1 ,0172 .0084 .012 ,1834 ,4860 .3216 ,7344 ,5862
xo=o
Tests for Some Statistical Hypotheses for Dependent Competing Risks-A Review 147
From the tables it is clear that the asymptotic critical levels give conservative tests for the censored case, with the effect increasing as the censoring becomes more severe. There is slight effect on the levels or the power due t o lack of independence of X and Y in the presence of censoring. The results are comparable with the results for test proposed by Aly et a1.6 for the lightly censored case. Remarks
(i) The various tests are consistent against their intended alternatives. (ii We can also use these tests for the hypothesis X , ( t ) = X 2 ( t ) against the alternative that cause-specific hazards are ordered. (iii) The tests are distribution-free under Ho. The null distribution of the tests U, and U4 is same as in the case of independent latent failures (see Deshpande and Dewan17). (iv) It is important t o note that T and 6 continue to be independent under the null hypothesis of bivariate symmetry. Hence the conclusions of Lemma 1 in the review paper hold under Ho. (v) The statistic Uz puts more weight on the middle observations and is less sensitive to the observations in the beginning and the end of the experiment. On the other hand, U, puts more weight to later observations and U, puts higher weight to observations in the beginning. (vi) Deshpande and Dewan" proposed tests for testing bivariate symmetry against dispersive asymmetry. Here the alternatives can be expressed in terms of ordering of subsurvival functions and ordering of subdistributions of the maximum of observations and 6. The statistic is a linear combination of two statistics, the first one is a U-statistic based on minimum and 6 and the other one is a U-statistic based on maximum and 6. The one based on minimum and 6 is the statistic U4. (vii) The statistics U2,U3,U4 are all linear combinations of the sign statistic and the Wilcoxon-signed rank type statistic. (viii) Tests proposed by Aly et a1.6 can be extended t o the case of multiple risks in which any two of the cause-specific risks are t o be compared. The statistic can be modified to test dominance of one risk over the other in a specified interval. 10.6. Test for Independence of T and 6
The nature of dependence between T and b is crucial and useful in modeling competing risks data via subdistribution/subsurvival functions. If T and 6
148
Isha Dewan and J. V. Deshpande
are independent then S ( i , t ) = P(6 = i ) S ( t ) ,allowing the study of the failure times and the causes (risks) of failure separately. The hypothesis of equality of incidence functions or that of causespecific hazard rates reduces to testing whether P(6 = 1) = P(6 = 0) = 1/2. This simplifies the study of competing risks to a great extent. Dewan et a1.12 studied the properties of the conditional probability functions
@ i ( t ) = P[6 = i 1 T 2 t ) = S ( i , t ) / S ( t ) i, = 1 , 2 @.f(t) = P[6 = i 1 T < t ) = F ( i , t ) / H ( t ) ,i = 1,2.
(30)
They observed (i) T and 6 are independent iff @ 1 ( t ) = P[6 = 11 or @;(t)= 1-P[6 = 11, (ii) T and 6 are PQD iff @ l ( t )2 P[6 = 11 or @;(t)2 1 - P[6 = 11 , (iii) 6 is Right Tail Increasing in T iff @l(t)is increasing in t , (iv) 6 is Left Tail Decreasing in T iff @g(t)is decreasing in t. They considered the problem of testing HO : T and 6 are independent which is equivalent to HO : @,(t)is a constant against various alternative hypotheses which characterize the properties of @ l (t )and @;(t):
H i : @ 1 ( t ) is not a constant H; : @ l ( t )2 P[6 = 11 for all t with strict inequality for some t HA : @l(t)is a monotone nondecreasing function o f t H i : @:(t)is a monotone nonincreasing function oft.
A test based on the concept of concordance and discordance was proposed for testing Ho against H i . Actually a one-sided version of the test was seen to be consistent against H; . Two tests were proposed to test Ho against H;. A test using U-statistic was proposed for testing HO against HA and on the same lines a test was proposed for testing Ho against Hi. Note that there is no relationship between H i and HA but both imply H i . Some of the test statistics considered are already in the literature but in other contexts. 10.6.1. Testing Ho against H ;
Kendall’s r is used as a test statistic for a very general alternative of nonindependence. A pair (Ti,&) and (Tj,6 j ) is a concordant pair if Ti > Tj, bi = 1,6j = 0 or Ti < Ti, 6i = 0 , bj = 1 and is a discordant pair
Tests for Some Statistical Hypotheses for Dependent Competing Risks-A Review 149
if Ti
> Tjl 6i = O16j = 1 or Ti < Tj, bi
= 1,6j = 0. Define the kernel
I
ifTi>Tj, b i = l , b j = O orTi
Tj, bi = 0,bj = 1 orTi
$'k(Ti16ilTj,bj) =
The corresponding U-statistic is given by
It is seen that E(Uk) 2 0 under Hh. Hence, a one-sided test based on u k can be used to test @ l ( t )2 $l(O) for all t also. This statistic was introduced for the first time in Sengupta and Deshpande2' for proportionality of cause specific hazard rates with independent competing risks. For details see Deshpande and Dewan5 and Dewan et a1.12
10.6.2. Testing H , against H i Under H i : (al(t) 2 (al(0) which is equivalent to @:(t)2 CPg(0). Consider
lo rm
A2(S1,S) =
[S(l,t) - 4 S ( t ) ] d F ( t= ) P[T2 > T1,S2 = 1) - 4/2.
Under Ho, S ( l , t ) / S ( t ) = P(6 = 1) which implies A2(&,S) = 0. Under Hil S ( 1 , t ) 2 @l(O)S(t) and hence Az(S1,S) 2 0. Define the symmetric kernel $~2(Ti,&,Tj,bj)=
{
1 if Tj > Ti, 6 j = 1, or if Ti > Tj, 6i = 1, 0 otherwise.
The corresponding U-statistic estimator is given by (31
The above statistic is proposed by Bagai et a1.l8 for testing the equality of failure rates of two independent competing risks and is same as the statistic U4. We can derive another test for Ho versus Hi using the fact that @:(t)2 (a:(O) under H i .
150
Isha Dewan and J. V. Deshpande
10.6.3. Testing HO against HA
Note that @l(t) T t is equivalent to Gl(tl) I @1(t2), whenever t l 5 t 2 . This gives 7(t11t2) = S(11t2)S(tl)- S(l,tl)S(t2) 2 0,tl 5 t2 with strict inequality for some (tl, t2). Define A3(Sl,s) =
JJ
(32)
tllt2
= l r n [ S 2 ( l , t ) - d2/2]S(t)dF(1,t).
Under Ho, S(1, t ) / S ( t )= 4. Hence A3(s11S) = 0. Under HA, A3(S1,S) 2 0. Define the kernel
$3*(Ti,bilTjlsj,Tk,6kl~,6= 1)
I
1 if Tk > Tj > Tl > Ti, si = sj = Sk = 1,Sl = 0 -1 i f z > Tj > Tk >Ti, 6, = dj = bk = 1,61 = 0 0 otherwise.
Then the U-statistic corresponding to A3(S1, S) is given by (33
where $3 is the symmetric version corresponding to $$. Note that E ( 1 C I ~ ( T i , 6 i , T j , d j , T k , b k l ~ , 6= 1 ) )A3(Sl1S) and the expectation of the symmetric kernel is 24&(s1,S) due to the possible combinations required to obtain the symmetric kernel. Hence, E(UR)= 24A3(S11S). Under Ho, E(U3) = 0 and under H i , E(UR) 2 0. Theorem 9.6.1 : As n tends to 00, under Ho, n1l2UR converges in distribution to N ( 0 , u:), where a: = (96/35)45(1 - 4). The null hypothesis is rejected for large values of ~ I ' / ~ U R / &where = (96/35)&(1 Tests proposed above will help in discriminating between the constant or proportional warning-constant inspection and random signs censoring models due to Cooke21 and also to determine whether the corresponding mode of failure becomes more likely with increasing age. Based on similar considerations Dewan et a1.12 proposed a test for testing Ho against H i . For modeling the competing risks data in terms of (T,d), it is of prime importance to check whether T and 6 are independent. The above tests are simple and perform satisfactorily in distinguishing between the hypotheses.
i:
4).
Tests for Some Statistical Hypotheses for Dependent Competing Risks-A Review 151
All tests are typically consistent against larger alternatives than the one for which they are proposed. The tests are “almost” distribution free in the sense that their null distribution depends only on the parameter P ( b = 1) which can be estimated consistently. If the hypothesis of independence is accepted, then one can simplify the model and study the failure time and cause of failure separately. If the hypothesis is rejected, then a suitable model under specific dependence between T and 6 in terms of the incidence functions is needed. The results are being extended to include three or more risks. It is observed that many of the statistics which have been originally proposed in the context of independent risks continue to be useful for testing similar hypotheses regarding dependent risks. This raises the interesting possibility that in general such rank-based procedures work whether the risks are independent or not.
Acknowledgments We thank the referee for suggesting improvements in the presentation of the results in the paper.
References 1. A. Tsiatis, A nonidentifiability aspect of the problem of competing risks,
Proc. Natl. Acad. Sci. U.S.A. 7 2 , 20-22 (1975). 2. M. J. Crowder, Classical competing risks, Chapman and HallICRC, London (2001). 3. J. D. Kalbfleisch and R. L. Prentice, The statistical analysis of failure time data, John Wiley, New York (1980). 4. J. D. Kalbfleisch and R. L. Prentice, The statistical analysis of failure time data, Second Edition, John Wiley, New Jersey (2002). 5. I. Dewan and S. B. Kulathinal, Parametric models for sub-survival functions. Preprint (2003). 6. E. A. A. Aly, S. C. Kochar, and I. W. McKeague, Some tests for comparing cumulative incidence functions and cause-specific hazard rates, J . Amer. Statist. Assoc. 89, 994-999 (1994). 7. S. C. Kochar, A review of some distribution-free tests for the equality of cause specific hazard rates, Analysis of Censored Data, IMS Lecture Notes, 27, 147-162 (1995). 8. Y. Sun and R. C. Tiwari, Comparing cause-specific hazard rates of a competing risks model with censored data, Analysis of Censored Data, IMS Lecture Notes, 27, 255-270 (1995). 9. J. V. Deshpande, A test for bivariate symmetry of dependent competing risks, Biometrical Journal 32, 736-746 (1990).
152
Isha Dewan and J. V. Deshpande
10. G. Aras and J. V. Deshpande, Statistical analysis of dependent competing risks, Stats. and Decisions 10, 323-336 (1992). 11. J. V. Deshpande and I. Dewan, Testing bivariate symmetry against dispersive asymmetry, Journal Indian Statist. Assoc. 38, 227-249 (2000). 12. I. Dewan, J. V. Deshpande, and S. B. Kulathinal, On testing dependence between time to failure and cause of failure via conditional probabilities, Scandanavian Journal of Statistics 31, 79-92 (2004). 13. E. J. Gumbel, Bivariate exponential distribution, J . Amer. Statist. Assoc. 55, 698-707 (1960). 14. P. K. Sen, Nonparametric tests for interchangeability under competing risks, in Contributions to Statistics, Jaroslav Hajek Memorial Volume, 211-228. Reidel, Dordrecht (1979). 15. T. P. Hettmansperger, Statistical Inference Based On Ranks, John Wiley, New York (1984). 16. P. Yip and K. F. Lam, A class of nonparametric tests for the equality of failure rates in a competing risks model, Comm. Statist. Theory Methods 21, 2541-2556 (1992). 17. J. V. Deshpande and I. Dewan, A review of some statistical tests under independent competing risks model. Preprint (2003). 18. I. Bagai, J. V. Despande, and S. C. Kochar, Distribution-free tests for stochastic ordering in the competing risks model, Biometrika 76, 775781 (1989). 19. H. W. Block and A. P. Basu, A continuous bivariate exponential distribution, Journal Amer. Statist. Assoc. 69, 1031-1037 (1974). 20. D. Sengupta and J. V. Deshpande, Some results on the relative ageing of two life distributions, J. Appl. Prob. 31,991-1003 (1994). 21. R. M. Cooke, The design of reliability databases, part 11, Reliability Engineering and System Safety 51, 209-223 (1996).
CHAPTER 11 REPAIR EFFICIENCY ESTIMATION IN THE ARIl IMPERFECT REPAIR MODEL
LAURENTDOYEN Institut Natinal Polytechniques de Grenoble, Labomtoire L M C B P 53 - 38041 Grenoble Cedex 9, fiance E-mail: [email protected] r
The aim of this paper is to study the estimation of repair efficiency in an imperfect repair model, called Arithmetic Reduction of Intensity model with memory one (ARI1). This model, first introduced by Doyen and Gaudoin, has a very simple failure intensity, and repair efficiency is characterized by a single parameter. Thanks to that simplicity, the asymptotic almost sure behavior of the failure and cumulative failure intensities of ARIl model can be derived. Then, the almost sure convergence and asymptotic normality of several estimators (including maximum likelihood) of repair efficiency can be proved in the case where the wear out process without repair is known. The influence of the number of observed failures on the quality of the repair efficiency estimation is empirically studied. Finally, repair efficiency is estimated on a real maintenance data set.
11.1. Introduction Many imperfect repair models have already been proposed (see for example a review in Pham and Wang'). One of the most famous is the Brown and Proschan2 model, in which system after repair is renewed with probability p and restarted in the same state as before failure with probability 1 - p . Another very important class of models is the Virtual Age (VA) models proposed by Kijima,3 in which repair is assumed to rejuvenate the system. The first imperfect repair model has been proposed by Malik4 in 1979. It is a particular VA model called Proportional Age Reduction (PAR) model. Even if many imperfect repair models exist, only a few of them have been statistically studied, especially regarding the estimation of repair efficiency. 153
L. Doyen
154
For the Brown-Proschan model, Lim5 has studied estimation with the EMalgorithm, Langseth and Lindqvid have estimated parameters on real data set with maximum likelihood, and Lim, Lu and Park7 have used Bayesian estimation. For virtual age models some studies on maximum likelihood estimators have been published: Shin, Lim and Lie,' Yun and C h ~ u n g Kaminskiy and Krivtsov,lo Yanez, Joglar and Modarres,ll Gasmi, Love and Kahle,12 Doyen and Gaudoin.13 But all these studies only proposed numerical results. The aim of this paper is t o propose theoretical results on repair efficiency estimation in the Arithmetic Reduction of Intensity model with memory one (ARIl), proposed by Doyen and Gaudoin.13 Arithmetic Reduction of Intensity (ARI) models are analogous to VA models, except that repair efficiency does not rejuvenate the system but reduces the failure intensity. The ARIl model is the analogue of the PAR model and is defined in Sec. 11.2. Section 11.3 analyzes the asymptotic behavior of the failure process. From these results, Sec. 10.4 derives the asymptotic properties of some estimators of repair efficiency (including the maximum likelihood). The influence of the number of observed failures on these asymptotic properties is studied in Sec. 11.5. Finally, repair efficiency is estimated on a real maintenance data set. Classical convergence theorem used in the article are given in the appendix. 11.2. Arithmetic Reduction of Intensity Model with Memory 1 11.2.1. Counting process theory a.s.
Let 0 = TO< T I < ... be the successive, nonexplosive (T, -+ oo),failure times. The counting process associated to the observation of these failure times up to t will be denoted by Nt = C;'-"p ll{T,5tj and the inter-failures times by Xi = Ti+l-Ti for i 2 1. A repair is supposed to be performed after each failure and the corresponding repair times are not taken into account. The filtration F in consideration will be the natural filtration associated to the failure times: .Ft = G ( { N ~ } It ~ represents ~ ) . all the failure times that can be observed before t. N = {Nt}t?o has a failure intensity At if there exists a predictable prot cess At such that: Mt = Nt - A, ds, where Mt is a martingale, representing the noise, that is to say, a right hand continuous with left-hand limits Mt- , adapted stochastic process which is integrable and satisfies: 'ds,t 2 0, EIMt+,IFt] = Mt. If At is assumed to be left-continuous with right-hand
lo
Repair Eficiency Estimation in ARI1 Model
155
limits At+, then one can show14 that At is unique (up to undistinguishability) and completely characterizes the failure process. Under further sensible conditions, the failure intensity can be viewed as the conditional rate of failure14: At+ = P(Nt+at - Nt = 112-t) o(At). Finally, the integral of the failure intensity At = A, ds is called the cumulative failure intensity or compensator of N . The predictable variation process ( M ) is the compensator associated to M 2 and (M)+mdenotes its limit: (M)+oo= lirnt++,(M)t.
+
s,”
11.2.2. Imperfect repair models Before the first failure, the failure intensity is supposed to be a nonnull, nondecreasing, deterministic function from R+ to R+,called initial intensity and denoted by X ( t ) . The initial intensity represents the intrinsic wear out, that is to say the wear out in the absence of repair actions. When the initial intensity is known, an imperfect repair model is only characterized by the effect of repair actions on the failure intensity. Basic assumptions on repair efficiency are known as As Bad As Old (ABAO) and As Good As New (AGAN). In the ABAO case, each repair is supposed to be minimal, that is to say restores the system in the same state as it was before failure. The corresponding random process is the Non-Homogeneous Poisson Process (NHPP) and its failure intensity is a continuous function of time: Xt = X ( t ) . In the AGAN case, each repair is supposed to be perfect, that is to say renews the system. The corresponding random process is the renewal process and its failure intensity is of the form: , At = X ( t - T N , - ) . The aim of the VA model is to consider that repair rejuvenates the system, in the sense that the failure intensity at time t is equal to the initial intensity at time At, called virtual age of the system, and generally At 5 t: Xt = X(At). Between failures, the VA is generally supposed to increase as the real age t: At = t - TN, AT,,,^, for t # TN,.The effect of repair is to reduce the virtual age. The idea of the PAR model is to consider that repair action reduces the VA from an amount proportional to the VA accumulated since the last repair: A,; = A , - pX,, for all n 2 1. p is called the repair efficiency parameter and one13 can prove that the failure intensity of the PAR model is: At = X ( t - ~ T N ~This - ) . model appears to be the same as the Kijima et al.15 model, Shin et a1.8 model and Kijima 3 type I model in the case of deterministic repair effect. In ARI model,13 repair is supposed to reduce not the age, but the failure
+
L. Doyen
156
intensity. That reduction is supposed to be arithmetic: AT,+ = AT; - Z,,, for all n 2 0. In addition, systems wear out speed is supposed to be not affected by repair actions: A t = X ( t ) - ~ ( T N , )AT,,,, , for t # TN,. AH models can be defined by analogy with VA models. The analogous of the PAR model is the ARIl model. Repair effect is supposed to reduce the failure intensity from an amount proportional to the increase of the failure intensity since the last repair: AT$ = AyTn , - p(X, - YTn) By analogy
+
with the PAR model, one'3 can prove that the failure intensity of the ARIl model is: At = X ( t ) - ~ X ( T N ~It- )is. certainly one of the most simple imperfect repair models. In the following, the probabilistic and statistical properties of this model will be developed. The parameter p of ARIl model characterizes repair^ efficiency.13When p is between 0 and 1, repair is efficient. When it is smaller than 0, repair is harmful. When p equals 0, repair is inefficient, that is to say ABAO. And finally, when p equals 1, repair is optimal but not AGAN. Then, evaluating repair efficiency in the ARIl model is equivalent to estimating the parameter P.
11.3. Failure Process Behavior 11.3.1. Minimal and maximal wear intensities In the case of efficient repair, the ARIl failure intensity satisfies for all t (1 - p ) X ( t ) I At 5
W).
2 0: (1)
For harmful repair, there exists an equivalent equation:
q q IAt 5 (1- p)X(t).
(2)
The previous lower and upper bounds are called respectively minimal and maximal wear intensities and are denoted by: (3)
(4
If X ( t ) > 0 for t the sense that:
> 0, it
Vt 2 0, VE > 0, P(Xt
can be proved13 that these bounds are optimal in
5 X m i n ( t ) + E ) > 0 and P(Xt 2 A,,,(t)
-E)
> 0. (5)
Using the fact that under assumption:
Repair Eficiency Estimation in ARI1 Model
0
157
Al:p < 1,
the failure intensity is lower bounded by a nondecreasing, nonnull function, the following lemma 11.1proves that times between failures are negligible with regards to failure times.
Lemma 11.1: Under assumption A1, the failure process satisfies:
t - TN,- = o(t) ( a x ) . Proof: p
< 1 and A(t)
(6)
is nondecreasing and nonnull, then,
3c > 0, 3 t o > 0,
vt 3 t o ,
Amin@)
2 c.
(7)
Then, At is almost surely divergent. Brkmaud" proves that it is equivalent to the fact that Nt is almost surely divergent. Since the failure process is not explosive, 2 ' ~is~also almost surely divergent, then almost surely:
3tl > 0, V t 3 tl,
TN,- 2 t o . (8) By using the differential martingale writing of N , there is almost surely, for all t 2 tl: (9)
(10
Corollary 11.1(see appendix) can be applied and then,
loX, t l
dMs "2. o(
1;
ds)
+ 0(1)=
t to T) -
O(
+ O(1)= o(t).
(11)
That result remains true by replacing t by TN,-. Then, since ~ ( T N ), -= o(t), lemma 11.1 is proved. 0 11.3.2. Asymptotic intensity
The failure intensity of an ARIl model can be rewritten as: At = (1 - p ) A ( t )
+ p [ A ( t ) - A(TN,_)].
-
(12)
But, thanks to lemma 11.1, it can be proved under additional sensible assumption that: A(TN,-) = A(t - (t - TN,-)) A(t). Then, the failure intensity is asymptotically equivalent to (1 - p)X(t) (see proposition 11.1 below), called asymptotic intensity and denoted by A,(t). (y) That result is true for an initial intensity that is a regular variation function:17
L . Doyen
158
0
A2:30 E IR+, Vx > 0 ,
limt,,
X(x t ) / A ( t ) = xo.
Regular variation functions with /3 = 0 are slow variation functions. Powers of logarithm functions ( A ( t ) = a,B(Zn(l+ t))o-', a > 0,p > 1) and constant functions ( X ( t ) = a,Q > 0) are slow variation functions. Increasing power functions ( X ( t ) = crpto-l, (Y > 0 , p > 1) are not slow variation functions. However, there are regular variation functions. Nonregular variation functions for the initial intensity correspond to quick divergent initial intensity, for example: X ( t ) = e z p ( t P ) with p > 0. That kind of initial intensity seems not to be realistic for reliable industrial systems.
Proposition 11.1: Under assumptions A 1 and AZ, the failure intensity is asymptotically equivalent to the asymptotic intensity: At "2 X,(t) (t) o(Xm ( t ) ) .
+
Proof: Lemma 11.1 implies that, almost surely: (13)
Then, Eq. (12) implies, for all t 2 M I , that:
I&
IIpI [X,(t)
- A,(t)I
-
L((1
+ e l ) t ) ] (a.s.).
(14)
In addition, the initial intensity is a regular variation function: Vt2
> 0 , 3M2 > 0 , V t 2 M2,
This is equivalent to:
I X,((l
+ € l ) t )- (1
+El)%&)
)I€ZX,(t).
(15)
By using this result in Eq. (14) it can be deduced that, almost surely:
vt 2 max(M1,M2), I At And the result is proved.
-A @ ,)
IL IPI [ 11 - (1 + E 1 ) O + €21 L ( t ) . (16) 0
Figure 11.1 shows the failure intensity in the case of efficient repair ( p = 0.3), in solid thin line, and harmful repair ( p = -0.3), in solid bold line. The initial intensity, X ( t ) = 3t2, is in dashed line. The asymptotic intensity is represented by the doted line in the case of efficient repair and by the dashed-dotted line in the case of harmful repair. It appears that the difference between the failure intensity and the asymptotic intensity is decreasing. Proposition 11.1 only proves that the difference is negligible with regards to the asymptotic intensity.
159
Repair Eficiency Estimation in ARIi Model
351
Fig. 11.1. Failure intensity for p
<0
(thin line), 0 < p
< 1 (bold
line).
11.3.3. Second order term of the asymptotic expanding
The difference between the failure process and the asymptotic process can not be asymptotically approximated, but the integral of that difference can. Let h , ( t ) = X(s,) ds. Then, the cumulative failure intensity of the AH1 model is: (17
The following proposition 11.2 proves that if repair is not perfect and the initial intensity is a regular variation function that diverges:
X(t) = +m, then, Jt[A(s) - X ( T N ~)]- ds is asymptotically equivalent to: l n ( X ( t ) ) / (- pl ) . 0
A3: limt,,
Proposition 11.2: Under assumptions A1 to A3, the cumulative failure intensity satisfies:
At "2A,(t)
+ -Zn(X(t)) P 1-P
+ o(ln(X(t))).
Proof: There almost surely exists to E TR; such that Amin, ,A nonnull for t 2 to and N,, 2 1.
(18)
and X are
L. Doyen
160
Using the martingale writing of the counting process N , the failure intensity is such that for all s 2 to:ds = dNs/Xs - dMs/Xs. Then, since X(s) - X(TN~-) 5 A(s), the failure process satisfies:
(19)
Corollary 11.1 can be applied and then,
"2.o(At - A,(t))
+ O(1).
(20)
Proposition 11.1 implies that:
(21)
And, since T, is almost surely finite for all n 2 0:
(22
Finally,
At - A,(t)
+
"2.L l n ( A ( t ) ) o(At - A,(t))
+
O(1). 1-P Since the initial intensity diverges, proposition 11.2 is proved.
(23) 0
11.4. Repair Efficiency Estimation
The aim of this section is to study some estimators of repair efficiency in the case where the initial intensity is known, that is to say the behavior of the
Repair Ejjiciency Estimation an ARIi Model
161
system in the absence of repair is known. Then failure intensity is supposed to depend on a single parameter p E 3:At = X t ( p ) . The true value of that parameter will be denoted po. First, the properties of the maximum likelihood estimator are studied. Then, explicit estimators are introduced. 11.4.1. Maximum likelihood estimators
To prove the convergence of the maximum likelihood estimator (MLE), the true value of the repair efficiency parameter is assumed to be in a known bounded and closed interval of ] - 00, I [ :
-
0
A l : The MLE is searched in [ p ~ , p z ]with ,
-00
< p1 < po < p2 < 1.
The log-likelihood is:14
1
1 t
t
U P )=
W X s ( p ) ) dNs
-
M P ) ds.
(24)
Since the failure intensity is lower and upper bounded:
VJP E
[Pl,P21,
11 A (1- P 2 ) I W )
I&(P) I [1 v ( 1 - P l ) l X ( t ) ,
(25)
it can be proved, using the property of the Lebesgue integral, that the derivative of the log-likelihood exists and satisfies: (26
The maximum likelihood estimator ,6f is the value of [ p l ,,021 that maximizes the log-likelihood. Classically it is also the value of [ p 1 , p 2 ] such that the derivative of the log-likelihood is null.
Proposition 11.3: Under assumptions A1, A 2 and A3, the maximum likelihood estimator of p satisfies, f o r a single observation of the failure process over [ O , t ] : VC > 0, I po - ,6f I A(t)0.5--e% O (27) [Po - , 6 f I J A ( t ) / ( l Po)
5 N o , 1).
(28)
Proof: The derivative of the log-likelihood satisfies (26), then:
(29)
L. Doyen
162
But, proposition 11.2 showed that: Ji[X(s) - X(TN=-)] ds "2 O(ZnX(t))). And, using Eq. (25), it can be proved that:
(30)
And, as it has been done in Eq. (21) and (22), it can be proved that: (3)
Finally, using the martingale writing of N ,
(32)
By using previous results in Eq. (29), it is proved that:
IL:(P)- ( P - Po)A(t)- Mtl "2 o(ln(X(t))).
SUP
(33)
P€[Pl,PZl
Equation (33) implies in particular that:
L ~ ( P I"2' ) (PI - P O ) A(t) + Mi
+ o(ln(x(t))).
(34)
But corollary 11.2 implies that, for all E > 0, Mt = O ( A ~ ( ~ ) ~ . = ~ O ( A ( ~ ) ~ . ~ +And ' ) . Karamata's theorem" proves the following asymptotic equivalence for regular variation functions with p > -1:
A(t)
(p+ l)-'tX(t).
(35) This implies A ( t ) = o(A(t)) and O(ln(A(t)))= ~ ( h ( t ) ~Then, . ~ +Li(p1) ~). is negative for all t large enough. Similarly, it can be proved that Li(p2) is positive for all t large enough. Using the property of the Lebesgue integral, it can be proved that Lt(p) and L:(p) are continuous. Then, the maximum likelihood estimator is such that: L:(pf) = 0 for t large enough. Equation (33) also implies that: (p,"
N
- P O ) A(t) + Mt
+ O(ln(X(t)))"2' 0.
(36)
Then, with the same argument used for L:(pl),it is proved that:
> 0,
(p,"
P O ) A(t) "2 O ( A ( ~ > ~ . ~ + ' ) .
(37) The first result of the proposition is proved. Finally, the result of corollary 11.2 is used in Eq. (36) and b'e
-
3N ( 0 , l ) . {l - P o The second result of the proposition is then proved. (po - p,")
(38) 0
Repair Eficiency Estimation in ARI1 Model
163
11.4.2. Explicit estimators There also exists explicit estimators (EE) with the same asymptotic properties as the MLE. These EE are based on the asymptotic expanding of the cumulative failure intensity of proposition 11.2. Proposition 11.4: Under assumptions A1 to A3, f o r a single observation of the failure process over [O,t], the explicit estimator: Nt (39) = --
iT1
l
satisfies the same convergence properties as the MLE of proposition 11.3. Proof: By identifying the result of proposition 11.2 in the martingale writing of Ntl it is proved that:
+
Nt - ( 1 - po)A(t) O(Zn(A(t)))
Mt (40)
(42_
+
In addition, since At "Z (1 - po)A(t) o ( A ( t ) ) corollary , 11.2 implies that M t / ( ( l - ~ ~ ) h ( t ) ) ~converges . ~ + € almost surely to zero for all E > 0 and converges in distribution to a standard Gaussian variable for E = 0. By using those two results in Eq. (40), proposition 11.4 is proved. To build previous EE, the cumulative failure intensity has been replaced, in the martingale writing of N , by its first order asymptotic expansion: (1 - po)A(t) + o ( l n ( A ( t ) ) ) .Then, only the value of the first order term ((1- po)A(t))influences the EE. Because of the particularly simple shape of the failure intensity, there exists another EE that is built in accordance with the real value of the failure intensity in the martingale writing of N . Because that special EE takes in consideration the real value of the failure intensity and not only its asymptotic value, it is expected to converge more quickly than the previous EE. Proposition 11.5: Under assumptions A1 to A3, f o r a single observation of the failure process over [ O , t ] , the explicit estimator: (42
satisfies the same convergence properties as the MLE of proposition 11.3.
L . Doyen
164
Proof: The martingale writing of N implies that: Nt - A(t)
+ PO
I'
X(TIV-) ds = Mt (43
Using lemma 11.1 and assumption A2, it can be proved that:
I"
X(TN~-) ds - A ( t ) =
I'
[X(TN~-) - X(s)] ds
o(X(s)) ds = o ( A ( t ) ) .
(44)
So, po -ijF2 "2 M t / A ( t ) + o ( M t / A ( t ) )And . similarly to the proof of proposition 11.4, proposition 11.5 is proved by applying corollary 11.2. 0 11.5. Empirical Results
11.5.1. Finite number of observed failures
To test the quality of previous estimators, the empirical bias and standard deviation (SD) of these estimators have been calculated over 10000 simulations, for an initial intensity X ( t ) = aptP-', with cy E (1, lo}, /3 E {1.5,2,3}, p E {0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9}and a number of observed failures n E {5,10,20,40,60,80, loo}. Figure 11.2 represents, for the MLE and both explicit estimators, in the case of increasing wear out speed ( p = 3), the logarithm of the opposite of the empirical bias (Zog(p0 - 3)) and the logarithm of the empirical SD of estimators
d x ) )
(log( function of the number of observed failures and repair efficiency. Figure 11.2 is exactly the same for cy = 1or 10. Then the scale parameter cy seems to have no influence on repair efficiency estimators bias and SD. As the number of observed failures increases, the bias and SD of the estimators decrease. That confirms the fact that the estimators are asymptotically convergent. The bias of all estimators seems to be ever negative, then repair efficiency is under estimated. That is to say repair effect tends to be better than what is estimated. The bias and SD of the MLE and E2 are quite the same. But, E2 is an explicit estimator much more simple to use than the MLE. Both estimators improve as p converges to one, that is to say as repair efficiency becomes optimal. The SD of El has the same trend. In addition, for a few number of observed failures (5,lO) the SD of E l is slower than the SD of MLE and E2. For a reasonable number of observed failures, all SD are quite equivalent.
165
Repair Eficiency Estimation an ARIi Model
0 n
n
P
P
-2
E
-4 -6
0
0 n
n
P
P
0
M
-2 -4
......
.. . '
'
0 n
P
n
P
Fig. 11.2. Empirical bias and standard deviation of repair efficiency estimators for p = 3.
The bias of E l does not depend much on p . It decreases slightly as repair efficiency becomes inefficient. Except in that case, the bias of E l is bigger than the bias of E2 and MLE. In conclusion, both explicit estimators are efficient. E l is more adapted for a few number of observed failure times or for inefficient repair efficiency. And E2 is recommended in all other cases.
11.5.2. Application to real maintenance data set and perspective In practical cases, the initial intensity is generally unknown. But, it can be estimated using maximum likelihood method. The photocopier failure data of Murthy, Xie and Jianglg are considered. These data consist in the 42 failure times (in days or number of copies) of a particular photocopier over
L. Doyen
166
its first 43 functioning years. The maximum likelihood estimators are: Li
= 8.8
10m7(days),,8 = 2.79, /i = 0.99.
(45)
The values of previous repair efficiency estimators are calculated in Table 11.1, for a known power function for the initial intensity with parameters a E (6.8 10-7,8.8 10.8 and p E {2.59,2.79,2.99}. The repair efficiency estimation seems to be sensible to the initial intensity shape. Then, further work will be to extend the properties of Sec. 11.5 to the case where the initial intensity is unknown and has to be estimated. The problem is to prove that the MLE are also convergent in the case of simultaneous estimation of repair efficiency and initial intensity parameters.
MLE, E l , E2 a = 6.8
a = 8.8 a = 10.8
lo-'
fl = 2.59
p = 2.79
fl = 2.99
0.79, 0.72, 0.75 0.85, 0.78, 0.82 0.89, 0.82, 0.86
0.98, 0.94, 0.98 0.99, 0.95, 0.99 1, 0.96, 1
1, 0.99, 1.03 1, 0.99, 1.03 1, 0.99, 1.03
11.6. Classical Convergence Theorems
Theorem 11.1 is a weak version of a classical theorem14 on stochastic integration. Theorem 11.1: Let us consider a predictable stochastic process 2 = {Zt}t>O such that 2," d h , is finite for all t 2 0 (it is automatically true if 2, and A, are bounded over [0,t]).Then, s," 2, dM, is a local square t integrable martingale with predictable variation process: 2,"dAs.
so
Theorems 11.2 and 11.3 are two convergence results,20the first one is a consequence of the law of large numbers, and the second one of the central limit theorem.
Theorem 11.2: For {(M)+m< +w}, Mt converges almost surely to a finite random variable as t grows to infinity. For { ( M ) + m= +w}, and for all E > 0: M,/(M);.~+' Z 0. Theorem 11.3: Let 2 be a locally bounded predictable process. Then under the following assumptions:
167
Repair Eflciency Estimation in ARIi Model
0
there exists ct
> 0 and u such that:
s,” Z:Xs
ds/c;
5 u2,
> 0 s,” Z ~ U ~ l ~ a l , ~ cdslc? , ~ X , + 0 (it is automatically true if Zt is almost surely finite and ct grows to the infinity), the random variable: s,” 2, dM,/ct converges in distribution to a Gaussian centered variable with standard deviation u. 0
P
for all b
In particular, theorems 11.1 and 11.2 imply the first following corollary and theorems 11.2 and 11.3 imply the second following corollary. Both deal with martingale asymptotic behavior. Corollary 11.1: Let Z be a bounded predictable process. If the failure intensity i s bounded for all t 2 0 , then: 2, dM, “2 o(s,” 2,”dA,) O(1).
s,”
+
Corollary 11.2: If there exists a positive, nondecreasing, divergent function A,(t) such that the cumulative failure intensity satisfies almost surely: At = A,(t) o(A,(t)), then for all 6 > 0: Mt “2‘ o(A,(~)’.~+‘) and
+
M t / d m -% N(o,1). References 1. H. Pham and H. Wang, Imperfect maintenance, European Journal of Operational Research 94, 452-438 (1996).
2. M. Brown and F. Proschan, Imperfect Repair, Journal of Applied Probability 20, 851-859 (1983). 3. M. Kijima, Some Results for Repairable Systems with General Repair, Journal of Applied Probability 26, 89-102 (1989). 4. M. A. K. Malik, Reliable preventive maintenance scheduling, AZZE Z’ransactions 11, 221-228 (1979). 5. T. J. Lim, Estimating System Reliability with Fully Masked Data under Brown-Proschan Imperfect Repair Model, Reliability Engineering and System Safety 59, 277-289 (1998). 6. H. Langseth and B. H. Lindqvist, A Maintenance Model for Components Exposed to Several Failure Mechanisms and Imperfect Repair, Mathematical and Statistical Methods in Reliability, World Scientific Publishing Co. p. 415430 (2004). 7. J. H. Lim, K. L. Lu, and D. H. Park, Bayesian Imperfect Repair Model, Communications in Statistics - Theory and Methods 27, 965-984 (1998). 8. I. Shin, T. J. Lim, and C. H. Lie, Estimating Parameters of Intensity Function and Maintenance Effect for Repairable Unit, Reliability Engineering and System Safety 54, 1-10 (1996). 9. W. Y. Yun and S. J. Choung, Estimating Maintenance Effect and Parameters of Intensity Function for Improvement Maintenance Model, 5th. ISSAT International Conference Reliability and Quality in Design, p. 164-166, Las Vegas (1999).
168
L . Doyen
10. M. P. Kaminskiy and V. V. Krivtsov, G-Renewal Process as a Model for Statistical Warranty Claim Prediction, Annual Reliability and Maintenability Symposium, p. 276-280, Los Angeles (2000). 11. M. Yafiez, F. Joglar and M. Modarres, Generalized renewal process for analysis of repairable systems with limited failure experience, Reliability Engineering and System Safety 77, 167-180 (2002). 12. S. Gasmi, C. E. Love, and W. Kahle, A general repair, proportional-hazards, framework to model complex repairable systems, ZEEE Transactions o n Reliability 54, 26-32 (2003). 13. L. Doyen and 0. Gaudoin, Classes of Imperfect Repair Models Based on Reduction of Failure Intensity or Virtual age, Reliability Engineering and System Safety 84, 45-56 (2004). 14. P. K. Andersen, 0. Borgan, R. D. Gill, and N. Keiding, Statistical Models Based on Counting Processes, Springer-Verlag (1993). 15. M. Kijima, H. Morimura, and Y . Suzuki, Periodical Replacement Problem Without Assuming Minimal Repair, European Journal of Operational Research 37, 94-203 (1988). 16. P. BrBmaud, Point processes and queues: Martingale Dynamics, SpringerVerlag (1981). 17. J. Karamata, Sur un mode de croissance rbguliitre, ThCoritmes fondamentaux, Bulletin de la Socie'te' de Mathe'matique de France 61, 55-62 (1933). 18. P. Embrechts, C. Kliippelberg, and T. Mikosch, Modeling extremal events, Springer, p. 567 (1997). 19. D. N. P. Murthy, M. Xie, and R. Jiang, Weibull Models, Wiley Interscience, p. 314 (2004). 20. C. Cocozza-Thivent, Processus Stochastiques et FiabilitC des Systkmes, Springer (1997).
CHAPTER 12 ON REPAIRABLE COMPONENTS WITH CONTINUOUS OUTPUT
M. S. FINKELSTEIN Department of Mathematical Statisticsj University of the Free State PO Box 339, 9300 Bloemfontein, Republic of South Africa E-mail: [email protected]. ac.za
An output of a component’s performance is characterized by a monotone deterministic function or by a stochastic process with monotone sample paths. A stationary expected output and the probability of exceeding the given level of an output are derived for perfectly repaired components. Simple systems of components with the continuous output are considered. The case of imperfect repair is discussed. Several examples are considered.
12.1. Introduction The state space of a binary component is {O,l}. When a component is “on,” its state is 1 and when a component is “off,” its state is 0. Reliability theory of engineering systems is mostly devoted t o this case, although the multistate components and systems were also considered rather extensively (see Lisniansky and Levitin’ for references). Less attention was paid to the continuous case, when the state space is e.g., [O,w) or some other interval. In this note we shall consider some simple approaches that can help to analyze main stationary characteristics of repairable components and elementary systems with continuous state space. Define an output of a continuous state component a t some time t as the state of a component a t this instant of time. Let the output of this component, which characterizes its performance level, be a positive, decreasing ) the corresponding stochastic (increasing) function of time Q(t) ( @ ( t )or process Q t , t 2 0 ( @ t , t 1 0 ), showing monotonic deterioration of the 169
170
M. 5’. Finkelstein
components’ performance level. Monotonicity assumption is quite natural in many applications. The main objective of this paper is to obtain for repairable components simple formulas for the expected output at time t and for the probability of exceeding (nonexceeding) of a given level of output at this instant of time. As the renewal reward processes methodology will be used, the focus will be on asymptotic results for t 00. We shall consider three important settings describing components with continuous output: --f
a. Let a system consist of two independent components in series. The first component is binary with the Cdf F(t) and the second is a continuous state component which state is defined by the decreasing Q(t)(Qt,t 2 0 ) . Usual reliability reasoning for the series system results in the following: when the first component is operating (its state is l), the output of a system is defined as Q(t)(Qtrt 2 0 ) and when the first component is “off’ (failed), the output of the whole system is 0. In other words: an output of a system is a product of a binary random variable and a random output of a continuous part. Example: an onboard navigation system consisting of a binary part, which failure leads to the failure of the whole system, and the continuous output part with a degrading between corrections performance level (accuracy of navigation parameters). It is clear that, as two components are independent, the expectation of the output of this system is given by Q E @ ) = F(t)Q(t) ( Q E ( ~= ) F(t)E[Qtl),
(1)
where we assume that E[Qt]exists and that the criterion of failure with respect to the level of Q(t)(Qt,t2 0) is not defined (as usually: F ( t ) = 1- F ( t ) ) . b. There is only one component with increasing output @ t , t 2 0 and the failure is defined as exceeding some deterministic level PO.Obviously the corresponding first passage time for this process is its stopping time. Consider a stochastic process &, t 2 0, which is equal to zero when the initial process exceeds PO: Et
= @tI(cpo
-
at),
(2)
where I ( t ) is the corresponding indicator. Therefore, the expectation of an output at time t is @ E ( t ) = E[Jt].Without losing generality, assume that = @ ( O ) = 0 as in the case of accumulated damage (degradation).
On Repairable Components with Continuous Output
171
c. The same as in b. but the critical level p o is a random variable \Eo (with the Cdf Fo(cp) ) independent of @(t). All three settings describe continuous state components and, as it was mentioned, the renewal reward methodology will be used for all of them when considering the repairable case. The difference is in defining a failure. In the first setting the failure is defined via the failure of the binary component, and in the second and the third settings a failure time is defined as the first passage time for the corresponding stochastic process. 12.2. Asymptotic Performance of Repairable Components
a. Let each failure of the component be instantaneously perfectly repaired a t failures. This means that all inter-failure times are i.i.d. with distribution functions F ( t ) , whereas the output after each failure is ideally ‘refreshed.’ Denote by Q ( t ) a random value of the output at time t . Therefore, it can be easily seen:
1 t
E[Q(t)l
Q E ( ~= )
F ( t ) E [ Q t+ ]
0
h ( x ) F ( t- x)E[Qt-,]dx,
(3)
where h ( z ) is the renewal density function for the ordinary renewal process governed by the distribution function F ( t ) . Applying the key renewal theorem2 to Eq. 3 , the stationary value (as t 4 m) of the expected output is obtained in a usual way: (4)
where T is the mean time to failure. The probability that the output exceeds some acceptable level qo > 0 is also of interest. Similar to ( 3 ) and (4),the following stationary probability can be obtained: (5
where F ( x ,4 0 ) is the distribution function of the first passage time for the decreasing stochastic process Qt, t 2 0. Example 12.1: Let, firstly, Q ( t ) be deterministic and consider the exponential case: F ( t ) = 1 Q ( t ) = ePat,LY > 0. Then (6)
M. S. Finkelstein
172
For the simplest stochastic process: Qt(Z) = 1- e - z t , where Z is uniformly distributed in [0,a ] a , > 0, the straightforward calculations give: (7)
where d = -(In qo)/a. b. Denote by &(cp) the pdf of at for the fixed t. It is clear that it can be easily defined for processes for which the distribution of the first passage time can be obtained. Then: (8)
Assume that this mean exists for W 2 0. The perfect repair is instantaneously performed each time @ t reaches PO.Therefore, the inter-arrival times of the renewal process are just the corresponding i.i.d. first passage times with the Cdf F ( t ,9 0 ) and the mean Tv0.It can be shown that E[&]is decreasing in t and that its integral from 0 to infinity is finite. This means that the key renewal theorem can be applied, which leads to the following stationary value:
1
l o o @ES(PO) = TVO
E[Ct]dt.
(9)
A similar result can be obtained for the stationary probability for the output values to be in the interval [cp', 901 for some 0 < cpl < cp. Example 12.2: Let
Qt
= Pt, where ,f3 is a positive random variable. Then @ES(PO) =
PO 2
1
and this simple result does not depend on the Cdf of p. c. Let now PO be random with the Cdf Fo(cp) and consider firstly deterministic @(t).As this function is increasing, the inter-arrival times are defined by the following distribution
@(U= Fo(W)>
P(P0 I
(10)
and, eventually: (11)
Example 12.3: Let Fo(t) = 1- exp{-At}, @(t)= Pt,P > 0. Then @ E S = and, as in the previous example, the stationary value does not depend on P.
173
On Repairable Components with Continuous Output
For the random process, Eq. 11 is modified to: @ES =
L To
1
00
EIFo(@z)@z]dx,
(12)
0
where E[Fo(@,)]is the inter-arrival time distribution and TOis its mean.
12.3. Simple Systems The above reasoning can be generalized to systems of components of the described type. We shall consider the setting ''a." and stationary characteristics, but the following is valid for other settings and for V t 0 as well. Let h(z1,5 2 , ..., z), be an ordinary structure function of a coherent system of n binary components. Assume now that each component of this system is non-binary anymore, but a repairable continuous state, one with an output defined by relations (4) and (5). Let the following rule be applied to the output of our ~ y s t e m : ~
>
(13)
where Qi(t), is an output of the ith component i = 1,2...,n and 01,022,..., 01 are the corresponding minimal path sets. It is clear that the parallel continuous output system is defined by: Q(t) = maxls,Qi(t), whereas relation Q(t) = mini<, Qi(t) defines the series system. Denote: Pi(qo,t) = P(&i(t)2 40). It isclear that, if relation (13) holds, then
P(Q(t)2 40) P ( q o l t ) = h(Pi(qo,t),P2(~o,t), ...,P,(qo,t)),(14) for V t 2 0 and, specifically, for the stationary value when t + 00. The expected system output in this case is: (15)
Example 12.4: Let, as in Example 1: F ( t ) = 1Q ( t ) = e-Qt,Ly > 0 and consider stationary characteristics for the parallel system of identical components. It is clear that (14) turns to 1 - (1 - PS(q0))" and the upper limit of integration in (15) is 1. Then (compare with a nonredundant structure in Example 1):
Ps,n(qo)= 1 - (qo)??
7
Q E S , ~=
nX
nX
+ LY'
Remark 12.1: Different components can have different 'output supports' (output range). This means that the structure function can be different for different values of qo, which can lead to changing with the level of qo system structure.
M. S. Finkelstein
174
Remark 12.2: Definition 13 is the simplest one for the output of a coherent system. Some other approaches can be considered as well. For instance, the cumbersome relations, when the output of the entire system is the sum of outputs of components, can be derived (‘parametrical convolution’). 12.4. Imperfect Repair
The results of previous sections were derived under the usual assumption that repair is ideal, thus returning the system to ‘as good as new’ state. We shall consider here only one type of imperfect repair and apply it to Model “c.” This approach applied to Model “b” results trivially in the corresponding delayed renewal process. Another type of imperfect repair, which is defined for binary components as the combination of the perfect repair and the minimal repair4i5i6can be easily generalized to the case of the continuous output in Model Assume that after each repair the process at,t 2 0 is restarted from the same initial level, but the critical level of the output, which is a random variable and defines a failure (end of a cycle), is stochastically decreasing with each repair. This kind of imperfect repair is relevant in various applications. Specifically, let QQ, i = 1 , 2 , ... be the critical output at the end of the ith cycle. For the perfect repair all !Po,%,i = 1,2, ...,Q o , E ~ QO are i.i.d. Let the sequence Q o , ~i ,= 1 , 2 , ... be stochastically decreasing: Qo,2+1 I s t
Qo,2
@ FO,r+l(cp)
I Fo,z((P), = 172, ..., VcP E [O, 001,
(16)
where, F ~ , ~ + l ( cis p )the Cdf of Q o , and ~ Fo,l(ip) = Fo(p). The simplest example of a stochastically decreasing sequence Q o , ~i , = 1 , 2 , ... is the following geometric process:
FO,*((p)= Fo(wz-lcp), i = 1 , 2 , ..., w > 1,
(17)
where w is a constant scale transformation parameter. It follows from (17) that the mean length of a cycle tends to 0 as i -+ m. A more interesting situation would be, if the cycle duration Cdf tends, as i -+ 00 , to some limiting Cdf. In this case asymptotic relation (12) is true, where Fo(cp) is substituted by this limiting distribution l$ (cp). Consider the following specific model, which meets these requirements. It is clear, that the duration of the first cycle is defined by the Cdf Fo,l((p=) Fo((p),whereas Q0,I QO (and, of course, by the stochastic process at, t 2 0 itself). Let the next cycles be governed by the distribution Fo,%((p) = E[Fo(cpl@,,)] where @$ is a random starting value, and
=
On Repairable Components with Continuous Output
175
Assume, specifically,that the starting age of this conditional distribution is the decreased output at the end of the previous cycle. FO,i+l(Cp)= ~[Fo(cplq~o,i)l, = 1,2,
.-.
(18)
where 0 5 q < 1 shows 'the extent' of the repair action. The value q = 0 corresponds to the perfect repair and it is clear that there is no analogue of the minimal repair6 in this case. Therefore, this setting is similar to the Model 1 of Kijima.8 It can be provedg that if Fo(9) is IFR, then (16) holds and the limiting distribution f i ( c p ) exists (for the latter result the IFR assumption is not required). Therefore, relation (12) for the stationary output value (with mentioned above substitution Fo((P)by 4 (9)) holds.
Remark 12.3: There can be different ways of modelling the initial increasing process at,t 2 0. Assume, for instance, that it has independent increments. A good candidate for this is the gamma process Wt,t 2 0, WO= 0, which, according to the definition, has stationary independent increments and an increment Wt - W,, (t > s), has a gamma density with scale 1 and shape (t - s). The Levy process, as a more general one, is often suitable for this purpose as well. The increment in [t,t + h] can be also defined in a following natural way:"
+
@t+At - @t = a ( @ t ) & ( A t ) b(@t)At,
(19)
where &(At) is a random variable with a positive support and finite first two moments, and a ( . ) , b ( . ) are continuous positive functions of their arguments. Letting At -+ 0, we arrive at the continuous version of (19) in the form of the Ito stochastic differential equation:
+
d@t = a(@'t)dv(t) b(@t)dt,
(20)
where v ( t ) , t L 0 is, for instance, a gamma process, if &(At)has a gamma density with scale 1and shape (At). Integrating Eq. (20) €or O0 = 0 results in
12.5. Concluding Remarks
The continuous outputs (deterministic and stochastic) for repairable components have been studied in this note and simple relations for stationary characteristics were obtained. It is easy to use the formulas of Sec. 12.2 and Sec. 12.3 in practical calculations.
M. S. Finkelstein
176
Only one method of describing continuous output systems of continuous output components, based on t h e structure function of Barlow and W U was considered. The described approach can be easily generalized to the model, which takes into account the possibility of relevant maintenance actions. The new type of imperfect repair, discussed in Sec. 12.4, characterizes imperfect repair actions via stochastically decreasing sequence of random variables defining the corresponding critical values (and, therefore, the failures at each cycle). It seems t h a t this topic needs further attention in the future.
References 1. A. Lisniansky and G. Levitin, Multi-State System Reliability. Assessment,
2. 3. 4. 5. 6. 7. 8. 9. 10.
Optimization, and Application, World Scientific. New Jersey, London, Singapore, Hong Kong (2003). A. Hoyland and M. Rausand, System Reliability Theory, John Wiley and Sons, New York (1994). A. Barlow and A. Wu, Coherent systems with multistate components, Math. Oper. Research 4, 275-278 (1978). F. Beichelt and K . Fischer, General failure model applied to preventive maintenance policies, I E E E Transactions o n Reliability 29 (1980). H. W. Block, T. H. Savits, and W. Borges, Age dependent minimal repair, J. Appl. Prob. 22,370-386 (1985). M. S. Finkelstein, Some notes on two types of minimal repair, Adw. Appl. Prob. 24,226-229 (1992). M. S. Finkelstein, The performance quality of repairable systems, Quality and Reliability Engineering International 19,67-72 (2003). M. Kijima, Some results for repairable systems with general repair, J . Appl. Prob. 26,89-102 (1989). M. S. Finkelstein, A restoration process with dependent cycles, Automat. Remote Control 53, 1115-1120 (1992). N. D. Singpurwalla, Survival in dynamic environment, Statistical Science 10,86-108 (1995).
CHAPTER 13 EFFECTS OF UNCERTAINTIES IN COMPONENTS ON THE SURVIVAL OF COMPLEX SYSTEMS WITH GIVEN DEPENDENCIES
AXEL GANDY Institut fur Angewandte Mathematik und Statistik, Universitat Hohenheim 70593 Stuttgart, Germany E-mail: agandyauni-hohenheim.de When considering complex systems (i.e. systems composed of multiple components) in practice, the failure behavior of the components is usually not known precisely. This is particularly true in early development phases of a product. We study the influence of uncertainties in the marginal distributions of the components' lifetimes on one-dimensional properties of the system's lifetime (e.g. expectation, quantiles). We do not assume that the components are independent; instead we require that their dependence is given by a known copula. We consider two approaches. In the first approach, we assume that the margins are within a fixed distance from known distributions. This approach leads to bounds on the one-dimensional properties and requires the solution of a nontrivial optimization problem. We provide solutions for some special cases. The second approach is Bayesian and assumes some prior distribution on the marginal distributions. For example, one may assume that the margins belong to parametric classes and that distributions on the parameters are given. Monte-Carlo simulation can be used to obtain the distribution of one-dimensional properties of the system's failure time.
13.1. Introduction We consider a classical complex system allowing just two states: working (coded as 1) and failed (coded as 0). We assume that t h e system consists of n binary components and that t h e states of the components determine the state of the system, i.e. we have a structure function @ : (0, l}" -+ (0, l}. In this paper, we assume @ to be monotone, meaning that @ is monotone in each component, @ ( O , . . . ,0) = 0 and @ ( l ,..,.1) = 1. The nonnegative random lifetimes of the components are denoted by T I , .. . ,T,, with joint 177
178
A . Gandy
cumulative distribution function (cdf) H . Let X i ( t ) = l{Ti>t) be the state of the ith component at time t , where 1 denotes the indicator function. The probability that the system has failed up to a given time t is
(1) For further discussion of complex systems see Aven and Jensen.’ If F = (F1,. . . , F,) denotes the marginal distributions of the lifetimes TI, . . . ,T, then, by Sklar’s theorem, there exists an n-copula C such that H ( t 1 , .. . ,t,) = C ( F l ( t l ) ., . . , F,(t,)) for all t l , . . . , t,. This can be written as H = C o F . Recall that an n-copula is an n-variate cdf with margins that are uniform distributions on [0, 11. If F I ,. . . ,F, are continuous then C is uniquely determined. If T I ,. . . ,T, are independent then C can be chosen as the product copula II(z1,.. . ,z,) = z1z2 . . . z,. Further details can be found in Nelsen2 and In practice, H is usually not known precisely. This is particularly true in early development phases of a product. Often, one is interested in onedimensional properties of the system’s lifetime distribution S*,H like the expectation or quantiles. These properties are defined by mappings from the space D of cdfs of nonnegative random variables, into R = R U ( - 0 0 , 0 0 ) . Say q : D -+ is one of these mappings. We want to study how imprecise knowledge of H influences q ( S @ , H ) . To simplify this task, we confine ourselves to uncertainties about the marginal distributions F1 , . . . , F, and assume that the dependence structure of H is given by a known copula C. We consider two approaches. In the first approach, we assume that there are known GI , . . . , G, E D such that di(Fi,G,) 5 for some E , 2 0 where d l , . . . , d , are functions measuring distances in the space D. To derive the resulting bounds on q ( S + , H ) we have to solve a nontrivial optimization problem. We consider this approach in Sec. 13.3. The second approach is Bayesian in nature. We assume that the marginal distributions Fl , . . . , Fn depend on some parameter 8 E RP. We assume that some (prior) distribution on 0 is given. In Sec. 13.4 we consider this approach and give an example. In Sec. 13.5 we will compare aspects of the two approaches. Before starting with the two approaches, we show in Sec. 13.2 how the marginal distributions F1, . . . , F, can be separated from the structure function @ and the copula C in the lifetime distribution S+,Hof the system.
Effects of Uncertainties in Components on the Survival of Complex Systems
179
13.2. System Reliability with Dependent Components We will introduce a function G*,c and show how, together with the marginal distributions, it determines the cdf S ~ , H of the system's lifetime. Let @ : { O , l } " + { 0 , 1 } be a monotone structure function and let C be an n-dimensional copula. Let be the probability measure on (R", B(R")) induced by C , where B(Rn)is the Borel-cT-algebra on R". Note that 11") = 1. For t E R,let @, := (-m, t ] and BI := ( t ,cm). Let G+,c : [0, 11" + [0,1]be given by
c
c([O,
C
G a , c ( t l , . . . ,tn):= 1 -
@(x)C
PE{O,l)"
Lemma 13.1: Let @ : ( 0 , 1)" -+ { 0 , 1 } be a structure function, let C be an n-copula and let F = (F1,. . . ,F,) E D". Then for all t 2 0, S + , C ~ F == ( ~G+,c(Fl(t),... ) ,Fn(t)), where S ~ , C is ~ given F by ( 1 ) .
Proof: Let (TI,. . . ,Tn) be a random vector with cdf C o F . For t X i ( t ) := l { ~ ~ ,F~ o r}t .2 0 ,
2 0 let
l{+(e)=O}P(Xi(t)= ZiVi)
S+,CoF(t) = PE{0,1}"
=
1-
c
( P ( X ) P ( X i ( t )= ZiVi).
ecE{O,l)"
Let U = ( U I , . . . , Un) C and for i = 1 , . . . ,n, let FL1 be a generalized d inverse of Fi.Then (Fcl(U1),. . . ,FL1(Un))= ( T I , . . ,Tn). Hence, for t 2 0 and x E (0, l}", N
P ( X i ( t )= ZiVi) = P(T2 E Bk;Vi) = P(FC1(U,) E BkiVi)
E) P(UiE @)yq
=6
("npxt)) xi
,
i= 1
To see the validity of (I),one can argue as follows. Let the at most countable set of discontinuities of FL1 be denoted by Ai. If u $ Ai then u 5 Fi(t) iff FL1(u) I t (see Witting4 p. 20) and hence u E B E ( t ) iff FF1(u) E B4,. Since each Ui is uniformly distributed on [0,1], P(UZ1{ UiE Ai})5 P(Ui = x) = 0. Hence, (I) holds. 0
cz,czEAi
The following lemma is a consequence of properties of copulas and the assumption that is monotone.
A . Gandy
180
Lemma 13.2: G+,c is nondecreasing and continuous. Proof: To prove that G+,c is nondecreasing we show that G+,c is nondecreasing in each component. Without loss of generality, we consider the last component. Let s = (sl,.. . ,)s, E [0, lIn such that s, 5 t 5 1. We show that for any z E (0, l},-',
(2) where @(z,y) = @(XI,.. . ,zn-l,y). If @(z,O) = @(z, 1) = 0 then (2) is trivial. If @(z,0) = @(z,1) = 1 then (2) holds because B:" Bin = BA BE = R and 6 is additive. Since @ is nondecreasing the only remaining case is @(z, 0) = 0 and @(z, 1) = 1. Since (s, co)3 ( t ,co),
+
+
The continuity of G+,cis a consequence of the continuity of the copula C. In fact, G ~ , c ( is t )just a linear combination of the values of C a t certain points that depend continuously on t E [0,11,. These points are the elements of n;=&ti, 1). 0
Example 13.1: For a parallel system with n components, i.e. the system fails when all components have failed, @(XI,.. . ,z), = 1 - fl:==,(l- 5 , ) and hence G+,c = C. In the case of a serial system, i.e. the system fails when a single component fails, @(XI,. . . , z), = xi. If there are just n = 2 components then G+,c(tl,tz)= tl t2 - C(t1,ta). Fig. 13.1 illustrates G+,c for n = 2 components.
+
nYZl
Example 13.2: If the product copula components are independent, then
n(u) = ny=lui
is used, i.e. the
Effects of Uncertainties in Components on the Survival of Complex Systems
181
Fig. 13.1. Illustration of G + , c in the case n = 2. The copula C induces a measure on 10,112. The value of G g , c ( t l ,t2) is the measure of the shaded area in the case of a parallel/serial system (from left to right).
13.3. Bounds on the Margins In this section, we want to study how one-dimensional properties of the system's lifetime cdf behave if only bounds on the margins are known. More formally, let @ be a monotone structure function, let C be an n-copula and suppose we know that the marginal distributions F = (F1,. . . ,F,) E D" of the failure times satisfy di(Fi,Gi)5 f i ti = 1,.. . ,n, where GI , . . . ,G, E D are known cdfs, ~i 2 0 and di : D x D + [0,m] measure distances between cdfs. Examples for di are the uniform metric
d,(F,H)
:= sup
IF(t) - H ( t ) l ,
tE[O,m)
L,-distances on the cdfs
d,(F,H) :=
(1"
1lP
IF(t) - H(t)IPdt)
(for some p > 0),
and L,-distances on the inverse cumulative distribution function
(For some P < 0 ). The distance d z l is called Mallow's metric. We are interested in a one-dimensional property of the system's lifetime cdf which we assume t o be given by a function q : D -+ E.Examples for q include the expectation E ( F ) := - F ( t ) ) d t and quantiles Q p ( F ):= inf{t E [0,m) : F ( t ) 2 p } (for some O < p 5 1). In practice, it is often interesting to be able to give guarantees for these properties of a system. In mechanical engineering, for example, the requirement specifications often include a 10% quantile for the system's lifetime.
som(l
A. Candy
182
For the properties we mentioned above, a smaller value is “worse.” Therefore, to guarantee a certain behaviour, we are interested in the minimal possible value of the property. In the following, we shall consider the minimal possible value of q(S+,CoF)given the restrictions on the margins. For this, the following optimization problem over F = ( F I ,. . . ,F,) has to be solved.
i
Q ( S + , C o F ) + min
(*> F E Dn di(Fi,Gi)5 ~ i ,=i 1,.. . ,n We are not aware of a general solution of (x). However, for some important special cases solutions can be given. A related problem would be to consider the maximal value given the restrictions. We will not discuss this problem here. 13.3.1. Uniform metric For the following, we use the usual stochastic ordering5>‘on D , i.e. F 5 , G iff F ( z ) 2 G(z) for all z E [ O , c o ) . If q is monotone with respect to “5,” then a solution of (*) in the case of the uniform metric is given in the following proposition. For applications, note that the functions E and Q p are nondecreasing.
Proposition 13.1: If q is nondecreasing with respect to the usual stochastic order ‘Y,” on D and if di = d,,i = 1,.. . , n, then Fo given by
F,O(t):= (Gi(t)+ ~ is a solution of
(*)1
i A )
1, i
=
1,.. . , n
(3)
where a A b = min{a, b } .
Proof: Clearly, Fo E D” and d,(Gi, F!) 5 ~i for all i. Let H E D” with d,(Gi,Hi) 5 ~i for all i. For i E (1,.. . , n } and t 2 0, we have Hi(t) 5 (H,(t) - Gi(t)( Gi(t) 5 &i Gi(t) and Hi@)5 1. Hence, Hi(t) 5 F!(t) for all i and t . Since G+,c is nondecreasing by Lemma 13.2, S+,CoH(t) = G+,c(Hi(t),. . . , & ( t ) ) I GQ,c(F:(t),. . . , F:(t)) = SQ,CoFO(t).Hence, S ~ , C ~2,HS Q , C ~ FSince O. 0 q is nondecreasing, Q ( S + , C o H ) >_ q(S+,CoFO).
+
+
Remark 13.1: Note that (3) does not depend on q.
Remark 13.2: Instead of using Lemma 13.2 in the proof of Proposition 13.1 one could also employ a remark on p. 240 of Muller and Stoyan.‘ In
Effects of Uncertainties in Components on the Survival of Complex Systems
183
our notation the remark states the following. If @ is a monotone structure function, C is an n-variate copula, F 1 = ( F t ,. . . ,F i ) , F 2 = (F;, . . . ,F:) fi D” and F: IsF;,i = 1,.. . ,n,then S + , C o FIs ~ S+,CoF1.
13.3.2. Quantiles Next, we consider (*) if q = Qp,that is if we are interested in quantiles. In the following we use the notation a V b := max{a, b}. Proposition 13.2: Suppose that for i = 1,.. . ,n, the distance di has the property that for Ho, H I ,H2 E D ,
IHo(t) - Hl(t)JI JHo(t)- H2(t)lVt 2 0 implies di(H0,H I ) I di(H0,H z ) . Let G;”(t) := Gi(t)V ~ l [ ~ , ~ zi(s) ) ( t ):= , sup{z E [0,1] : di(Gi,Gi’”)5 t i } and G” := (GS,’z’(s), . . . ,G$””(”). If q = Qp for some 0 < p 5 1 then a lower bound for the optimal target value of (*) is given by
to
.
:= inf{t E [ O , c o ) : G+,c(Gi(t),. . ,Gk(t))2 p } .
Proof: Let H E D” such that d(H,,G,) 5 e,,i = 1,.. . , n . Let v := Qp(G+,co H ) . We will show to 5 v. Define K E D n by K , := G7H‘(”). Note that this implies K,(v)2 H,(v), i = 1,. . . , n. The assumption on d implies that d(G,, K,) I d(G,, H,) I 6,. Hence By the monotonicby the definition of G”, we have G,”(v) 2 K,(v). ity of G+,c (see Lemma 13.2) it follows that G+,c(G’;(v), . . . ,GK(v)) G+,c(Hl(v), . . . , Hn(v)). Furthermore, the right-continuity of H and the continuity of G+,c imply that Ga,c(Hl(v),. . . , H n ( v ) ) L p . Hence, G+,c(G’;(v),. . . ,GK (v)) 2 p which implies to 5 v.
>
To actually compute to, often one has to resort to numerical methods.
Example 13.3: Gto need not be a solution of (*). Let n = 1, d l ( F , H )= l { F ( 0 ) Z H ( O ) ) , € 1 = 0.5 and G ( t ) = t for t E [0,1]. As a consequence of CP being monotone we have @(O) = 0 and @(1)= 1. If p = 1 then z ~ ( s )= 1 for s > 0 and zl(0) = 0. Clearly, G+,c(Gt(t))= G+,c(l)= 1 for t > 0 and G@,c(Go(0)) = 0. Hence, to = 0. But Q1(G@,co Go) = 1 > 0.5 = Qi(G+,c0
A . Gandy
184
13.3.3. Expectation
Solutions for the expected system lifetime (i.e. q = E ) in the special cases of parallel and serial systems with independent components when di(F, H). = IF(x) - H(z)I dx are also possible.
so"
Proposition 13.3: Suppose that di(F,H ) = IF(z) - H(x)I dx,i = 1 , . . . ,n, that q ( F ) = F ( s ) ) d z and that C(z) = n(z) = xi. If q(Gi) < 00, i = 1 , .. . ,n then the following holds.
s,"(l-
n:=,
(1) Suppose that @ i s the structure function of a parallel system, i.e. @(x) = 1- xi). For i = 1,. . . ,n, let & := inf{q E [O,oo) : Jv"(l -
n;.,(l
Gi(z))dz 5 ~ i }a n d P i ( z ) := G i ( z ) l ~ , , b } + l ~ , ~ t }T. h e n ( P I , . . .,fin) i s a solution of (*). (2) Suppose that @ is the structure function of a serial system, i.e. @(z) = n;=,xi. For i = 1,..., n, let 6i := sup{^ E [Gi(O),l] : Jo"(~ - Gi(z))+dz 5 ~ i } and fii(x) := Gi(z) V 6 i . T h e n (Fl,... ,pn)i s a solution of (*). Proof: First, we consider the case of a parallel system. We will show that for any H E D n satisfying di(Hi,Gi) 5 ~i for all i the following holds. For -i each i E { l , . . , n } , H := ( H I , .. . , H i _ l , F i , H i + l , .. . , H n ) satisfies (4)
Using this result sequentially we get q(S+,noF) 5 q(Sa,noH)which is what we want. Due to the symmetry of C = Il and @ it suffices to show (4) for i = 1. Note that G+,n(tl,.. . , tn) = ti. With h ( z ) := Hi(z) and using Lemma 13.1,
nZ1
nZ2
Effects of Uncertainties in Components on the Survival of Complex Systems
Splitting the integral at
61
185
we can use that h is nondecreasing t o get
) J,"(l - Gl(z))dz. I f f ( & ) = €1 then where f ( ~ :=
(L
00
A
I h(J1)
IHl(z) - Gl(z)ldz - €1) I h(E1)(~1- €1) = 0.
Iff(&) < €1 then since f is decreasing and continuous, & = 0. Hence,
A 5 h(tl) (Lrn(H1(z)- G1(z))+dx -
1
Im(l
- G1(r))dz)
00
I h ( h ) (Lm(l- Gl(z))+dz -
(1 - G ~ ( z ) ) c h )= 0.
This finishes the proof of (4) and as discussed above the proof for the parallel system. Next, we consider the serial system. By the same arguments as for the parallel system, it suffices t o show Y(S?p,nofi4 i d S * , I I o H ) ,
(5)
where H a := (Hll . . . l H ~ - l l F a , H a + l , . . . l Hand n ) H E D n such that d3(H3,G3)5 c3 for all j . Again we may assume i = 1. Note that now Go,n(tl,.. . , t n ) = 1 - t a ) . With h(z) := - Ha(z)),
nz1(l
n;="=,l
A . Gandy
186
If 61 = 1 then F1 (z) = 1 which implies H l ( z ) V Gl(z) - Fl(z) 5 0 and thus A 5 0. If 61 < 1 one can proceed as follows. Since Gl(z) V Hl(z) is nondecreasing and Gl(z) V H l ( z ) -+ 1 as z 00, there exists p E R+ such that Gl(z) V Hl(z) < 61 for all z < p and Gl(z) V H1(z) 2 61 for all 3: 2 p . Hence, G1 (z) V H1(z) < (z) for all z < p and G1(z) V H I (z) 2 fil(z)for all z 2 p . Since h is nonincreasing, -+
where f(.) := som(~-G1(z))+dz.The function f ( ~ is ) increasing and continuous. Indeed, continuity follows from f ( 6 ) = 1 { , > ~ ~ ( ~ ) ) d=~ d a : l{,,G,(,)}dzdu. Since 61 < 1 this implies f(&) = G. Thus,
:s :s
soo0
13.4. Bayesian Approach
The second approach is Bayesian in nature. We assume that the marginal distributions Fe = (Ff, . . . F f ) depend on a parameter 6 = (01,. . . , O P ) E RP. We assume that the parameter 6 is random and follows a known distribution T . Since 6 is random, the quantity of interest q(S@,CoFe)is random as well. We denote the distribution of q ( s * , C o F e ) by G. Explicit formulas for G in the general case cannot be expected. However, in many cases Monte-Carlo simulations be used as follows to obtain an approxmation of G. Let 6(l),. . . , 6 ( k ) be independent random vectors with distribution T . Based on this sample one can compute ai := q(S+,CoFe(;))using numerical methods. Let G ( t ) = l { , i s t )be the empirical distribution based on a l , . . . ,ak. The Glivenko-Cantelli theorem ensures that G(t) converges uniformly to G as k tends to infinity. This approach can be generalized to incorporate uncertainties in the copula. Indeed, one only has to choose a suitable class of parametric copulas
xF=l
Effects of Uncertainties in Components on the Survival of Complex Systems
187
and assign a distribution to the parameters of the copula. After that, one can use the obvious modification of the above simulation approach.
Example 13.4: Consider a 2-component serial system. Assume that the joint distribution of the failure times is a Marshall-Olkin distribution, i.e. TI = min(21,212), T2 = min(Z2,212), where 2 1 , 2 2 , 2 1 2 are independent and exponentially distributed with rates XI, X2, X l z . The copula of the joint distribution of TI and T2 is a generalized Cuadras-Auge'copulaCa,p(u, v) = min(u'-av,uwl-P) where CY = X12/(X1 X12) and j 3 = X12/(X2 Xl2). Details of this can be found in Chap. 3.1.1 of Ref. 2. Assume that we know that c o . 5 , 0 . 5 is the copula of our system and that TI and T2 are exponentially distributed with rates 61 and 62, where 61 and 8 2 are i.i.d. following a uniform distribution on [0.005,0.015].Then the cdf of the 0.1-quantile and the 0.2-quantile of the system are shown in Fig. 13.2.
+
+
Fig. 13.2. Cdf of Qo.i(FS) and Q o . 2 ( F S ) of a 2-component serial system with copula c0.5,O.S.
If we replace c0,5,0,5by the FrBchet-Hoeffding lower bound copula W(u,v) = max(u w - l , O ) , the product copula ll(u,v) = uw or the Frdchet-Hoeffding upper bound copula M ( u ,w) = min(u, v) the cdf of the 0.2-quantile of the system is as shown in Fig. 13.3. The plots in Fig. 13.2 and Fig. 13.3 are based on the simulation approach sketched at the beginning of this section.
+
188
A . Candy
Fig. 13.3. Cdfs of Qo.z(FS) of a 2-component serial system with copulas (from left to right) W , n, c 0 . 5 , 0 . 5 and M .
13.5. Comparisons
A problem of the approach of Sec. 13.3 is that it only yields a bound. This bound may be too pessimistic. Furthermore, in order to compute this bound one has to solve a nontrivial optimization problem. The advantage is that one does not only consider marginal distributions within a certain parametric class as in the approach of Sec. 13.4. The advantage of the Bayes approach is that it yields a distribution of properties of the system’s failure behavior which may be more realistic than just a fixed bound. Since this distribution can be evaluated using simulation, the Bayes approach is relatively straightforward to apply.
Acknowledgments Support of the “Deutsche Forschungsgemeinschaft” is gratefully acknowledged. I would like to thank Uwe Jensen for his valuable comments.
References 1. T. Aven and U. Jensen, Stochastic Models in Reliability, Springer (1999). 2. R. B. Nelsen, An Introduction to Copulas, Lecture Notes in Statistics, Vol. 139, Springer (1999). 3. H. Joe, Multivariate Models and Dependence Concepts, Chapman and Hall (1997). 4. H. Witting, Mathematische Statistik. I, B. G. Teubner, Stuttgart (1985).
Effects of Uncertainties in Components on the Survival of Complex Systems
189
5. M. Shaked and J. G . Shanthikumar,Stochastic Orders and Their Applications,
Academic Press (1994). 6. A. Miiller and D. Stoyan, Comparison Methods for Stochastic Models and
Risk, Wiley (2002).
This page intentionally left blank
CHAPTER 14 DYNAMIC MANAGEMENT OF SYSTEMS UNDERGOING EVOLUTIONARY ACQUISITION
DONALD P. GAVER Operations Research Department Naval postgraduate School Monterey, C A 93943 USA Email: [email protected]
PATRICIA A. JACOBS Operations Research Department Naval Postgraduate School Monterey, C A 93945 USA Email: [email protected]
ERNEST A. SEGLIE Ofice, Director Operational Test and Evaluation The Pentagon, Washington D C 20301 USA Email: [email protected] Procurement of modern military systems is made timely and effective by invoking evolutionary acquisition and spiral development. This process is dynamic: an initial system version (Block 1) is designed, manufactured, tested, and fielded while next generation’s (Block 2) altered improved subsystems are simultaneously developed (along an upward technological spiral of effectiveness and suitability). Developmental and operational tests determine when Block 2 provides a measurable and operationally significant improvement over, or complement to Block 1. Inferred cost and capability of Block 2 dictates its introduction time, and so on to Block 3, etc. Enemy adaption to the capabilities of current blocks accelerates the transitions to new versions and concepts of operations (CONOPS); otherwise the current Block eventually becomes technically obsolete. Concepts of effectiveness and suitability growth, and
191
192
D. P . Gaver, P. A. Jacobs, and E. A. Seglie
mutation, must be encouraged and quantified, initially using exploratory models, but later tested against challenging but realistic opposition and environments. Such exploratory models are furnished.
14.1. Setting The new US. Department of Defense (DoD) acquisition regulations state that “the primary mission of Defense acquisition is to acquire quality products that satisfy user needs with measurable improvements to mission capability and operational support, in a timely manner and at a fair and reasonable price” (DoD Directive 5000.1). “Evolutionary acquisition is the preferred DoD strategy for rapid acquisition of mature technology for the user. An evolutionary approach delivers capability in increments, recognizing up front the need for future capability improvements” (DoD Instruction 5000.2). Thus, a military system is fielded in a series of Blocks (partial or total system upgrades), each Block desirably being more capable and suitable than the previous ones. The improvement should be testable and adequately tested in a field environment. Modeling and Simulation (M&S) techniques are to be used to plan, preview, and rehearse such tests, but not to substitute for them.
14.1.1. Preamble: broad issues The types of systems involved are one-shot/destructable (missiles), vehicles (land, sea, amphibious in groups), C4ISR systems; all in end-to-end operation. The time to develop and test a new block is typically several nominal one-year budget cycles, but can also be event-driven-sequential. There can be many reasons to initiate design of a new Block, b + 1, say, given Block b is in some stage of testing or field employment. Anytime such a block initiation starts, it typically encounters engineering/technical problems that will take time to resolve, and also incurs unpredictable and somewhat uncontrollable costs; these two features are inter-related and are generally categorized as project (Block b 1) risk. At present, such risk is put into broad categories by “expert judgment,” so choices are made subjectively, and it is assumed that mistakes can be corrected subsequently, either by within-Block b modification, or after a new block is fielded. The latter is apt to be especially costly, time-consuming, and even unsafe, so alternative options are important; the evolving planning tool Real Options (see Glaros’) is being examined as a systematic way to study this entire life-cycle process.
+
Dynamic Management of Systems Undergoing Evolutionary Acquisition
+
193
Timing the start of a new block (the ( b l)st)is an important decision problem: a relatively long time between blocks may, and is desired to mean, that upon introduction, the new ( ( b + l)St) Block is considerably more capable than is Block b, where capability is a composite of performance and Effectiveness and Suitability, which includes Reliability, Availability, Maintainability, and Safety (RAMS); RAMS is a crucial aspect of Suitability. However, there is a tendency for a new ( ( b + l)St) Block to have undesirable features such as too much weight and to cost more than Block b unless vigilance is strongly exercised; unfortunately, such undesirable features often have shown up undesirably late in the acquisition cycle (during Operational Test (OT) or even field experience). This surprise may negate the promise of the new Block, at least temporarily for certain missions and concepts of operations (CONOPS). Also, if Block b is in place too long, opponent adaptation to its initial capability may greatly degrade that capability; this may be called operational obsolescence. These issues argue for built-in dynamic flexibility and operational effectiveness tracking; unexpected field modification can sometimes be essential, but is better avoided by careful early design and testing. Generally, design modifications are most cost-effective if made early in an acquisition evolution. A relatively short time between issuance of successive blocks may provide for quick response to surprises, either from opponent action or from unanticipated misbehavior of the system itself. Actual large losses (attrition) of the systems in Block b may be tolerable if Block b 1 is nearly ready, or its development can be accelerated. On the other hand, a short between-Block time interval can mean that more modest increases in capability are likely, and that there must be more frequent developmental testing (DT) and operational testing (OT) to validate the quicker succession of new Blocks. A compromise must be sought and justified. Tradeoffs can best be understood by systematic not-excessively-detailed end-to-end preview modeling. Real Options should be defined and compared, initially by M&S. Real Options refer to maintenance of future flexibility in program deployment, e.g., the option to (a) defer investment in a technology using one of several alternatives, (b) to contract differently, at a different pace, (c) to change scale/level/pace of acquisition, (d) to plan to operate in the field differently, or any combination thereof.
+
194
D . P. Gaver, P. A . Jacobs, and E. A . Seglie
14.1.2. Testing Consider a system that is undergoing evolutionary development; it will be fielded in successive blocks. Each Block will undergo DT, OT, and eventually fielding. DT is testing at the component-subsystem level, and is the optimum period in which to rectify Design Defects or Failure Modes. OT is intended to closely imitate field operational actions, e.g., using military (“soldier”) operators and maintainors, thus revealing different Failure Modes for removal. They can well be coordinated much more nearly “in parallel,” or combined, than is current practice. Efforts in that direction are encouraged. This should include total system testing in scenarios that include “Red Team” opposition. In this initial simple discussion we combine both developmental and operational testing into one test phase. Similar results can be obtained for the more general situation. Suppose Block b 1 is starting development. We assume there are D(b 1) design defects (DDs) that can be discovered and removed from Block b 1 design, and copies thereof during testing (the number is unknown but characterized by a probability distribution). Those DDs not removed by testing can activate in the field, leading to costly repairs and decreased system availability for missions. Tests should be designed to reveal those DDs most likely to occur under realistic field environments. Inevitably, some DDs will remain in each block; they are candidates for later correction. In this chapter, we consider a model for the development of a oneshot destructable system, e.g., a missile. Although such systems realistically function in a nearly sequential fashion, so that later stages (e.g., the warhead) can only be tested if early stages (e.g., initial propulsion, booster, guidance, etc.) function properly (any existing early stage DDs do not activate and cause early failure, inhibiting the test of later stages, see Gave?), we postpone treatment of this real issue for the present. We assume a version of the system (Block b ) has been fielded at time 0. A new version of the system (Block b + 1) is presumed, (arbitrarily) starting development at time 0. Time and money spent on the development and testing of Block b + 1 should ultimately result in a system which is more capable than Block b but which may have a number of new different DDs. Alternatively, Real Options will possess their own virtues and faults to be discovered by test. Once Block b 1 is deemed sufficiently capable and sufficiently free of DDs, it will be appropriate to field it to conduct missions in place of Block 6. Naturally, if too much of the budget has been committed to development and testing, then little will remain to purchase units of the improved Block
+
+
+
+
Dynamic Management of Systems Undergoing Evolutionary Acquisition
195
b + 1 in order to conduct missions. An appropriate balance must be struck. A modeling, analysis, and evaluation scheme such as we propose should guide this evolution. In Sec. 14.2 we introduce the model and in Sec. 14.3 we present numerical examples. 14.2. Modeling an Evolutionary Step 14.2.1. Model for development of Block b 4-1
+
The ultimate probability of mission success for Block b 1 after development and testing depends in part on the funding allocated to development and its rate of expenditure. The development process should not only increase the effectiveness of Block b + 1, but can also introduce DDs which can be removed by testing. Let Td(b 1) be the total time to develop and test the design and manufacture of Block b 1. Let M(Td(b + 1)) be the total funds spent to develop and test Block b 1; m ( T d ( b 1)) = M(Td(b l ) ) / T d ( b 1) is the average rate of that expenditure. A parametric model for the conditional probability of mission success after expenditure of M(Td(b+ 1))at a rate m(Td(b 1))given no remaining DDs activate during the mission, is of the form
+
+
+
+
+
+
+
Pd
= p*(b
( b + 1; T d ( b + 1); M (Td(b + 1)))
+ 1)Gd [b+ 1; M (Td(b + 1)); m (T’(b + l))],
(1) where p * ( b 1) is the maximum achievable probability of mission success (summarizing operational effectiveness) and Gd[b + 1; M ( T d ( b + 1)); m ( T d ( b l))] is the fraction of the maximum achieved as a result of the development process. The specific parametric example used for illustration is
+
+
Gd
[b+ 1; M (Td(b + 1)) ; m (Td(b
+
+
+ I))]
= [ I - exp {-alM (Td(b 1))=’}] exp {-a2 [m(Td(b I))=*]}, (2) for a, > 0, ai > 0 €or i = 1, 2. Of course (1) and (2) are hypothetical and speculative, but are plausible, and certainly subject to revision and/or refinement. Note that the fraction of the maximum effectiveness achieved, Gd, increases as the funding expended on development, M ( .), increases; while the overall fraction of possible success G d achieved decreases as the time-rate of expenditure increases (the “haste makes waste” effect). Other parametric forms are possible, but the present is illustrative, and can be sequentially estimated parametrically (ai > 0, ai > 0 for i = 1, 2 can at least
196
D. P . Gaver, P. A . Jacobs, and E. A . Seglie
be block-dependent) . It is presumed that performance level should increase at an eventual decreasing (saturating) rate with development budget ( M ) , but can be penalized by a too-rapid attempted rate of improvement (rate of expenditure, m).
14.2.2. Introduction of design defects during development
and testing We assume the number of DDs introduced during initial development has a Poisson distribution with (conditional) expected value (3)
where a2 is not necessarily = a3 > 0 above; both (2) and (3) are illustrative and subject to change as test data accumulates. K Omay be partially random, but also allowed a deterministic regression component, introduced to represent between-copy and environmental variability in Block b 1. In this model, the expected number of DDs introduced during the development process increases as the rate of expenditure for development increases. Less extreme variation can, of course, also be represented, and the growth function G d should be sequentially re-estimated as the b lStdevelopment progresses. It is assumed that at least some of the DDs introduced during development can be found during testing and removed; this is a costly and timeconsuming process. In this chapter, we will mingle/pool DT and OT; such intermingling is a proposed current trend. Each remaining DD can be activated during a test with probability 1- 8, more generally, each &value can be chosen at random and held fixed; such more complex analysis is avoided here. More than one DD may be activated during a test and, desirably, removed. Here DDs activated during a test are presumed to be successfully removed (an optimistic simplifying assumption that can and should be relaxed). The testing policy is to allocate ~ T d ( b 1) DT/OT tests: The decision parameter T is an average number of DT/OT tests conducted per time unit of development. Each test uses/expends one copy of Block b 1 at a unit cost of c,(b+ 1).Additionally, each test costs c t ( b + l ) .Each DD discovered during a test is removed at a cost of cT(b 1). By results for the present simple Poisson/binomial model, the conditional distribution of the number of DDs remaining after development and testing, given K O ,is represented
+
+
+
+
+
197
Dynamic Management of Systems Undergoing Evolutionary Acquisition
by a Poisson distribution with mean (hasty expenditure induces more DDs):
+
+
+
E [ D( b 1; M (Td(b 1)) ; Td(b 1); 7)I K o ] = K , , ( ~~ ~+(1))"3 b eTTd(b+l).
(4)
We assume that DDs remaining in Block b + 1 after fielding have the potential of causing mission failure if activated during a mission. Let OF be the probability a DD remaining in Block b + 1 after fielding does not activate during a mission. The conditional probability of mission success for Block b 1 after completion of development and testing given the number, D ( . ) , of remaining DDs is
+
+ 1lD ( b + 1; kf(Td(b + 1 ) ) ; T d ( b + 1 ) ; D(bt-1; M ( T d ( b + l ) ) ; T d ( b + l ) ; ~+ 1); ( M b ( ~ ~+ ( 1))) b eF (b
pF
= pd ( b
+ 1; ~
T)) 7).
(5)
Note that all fielded copies of Block b+ 1will have the same remaining DDs since they all are constructed according to the currently evolved design. This is a simplification that does not hold for certain system types, e.g., copies of a particular Navy ship design. Nor is manufacturing and environmental variation represented here.
14.2.3, Examples of mission success probabilities with random K O a) Suppose first that KO = k, a constant; then the probability that, after fielding, no remaining DDs activate during a mission is D ( b + l ; M(Td(b+l));Td(b+l); [eF
= exp
"I
{ -km (Td(b + 1 ) ) Q 3O r T d ( b + l ) ( l
-OF)}.
(64
b) Alternatively, assume K Ois random with a gamma distribution, scale v > 0, shape parameter 0 < P, and Laplace transform E [e-sKo] = [l (s/v)]-'. This is the classical modification of the Poisson that leads to the negative binomial.
+
[e,
D ( b + l ; M ( T d ( b + l ) ) ; Td(b4-1);
=
[I+[
m (T&
"I
(6b)
+ 1))03 v
eTTd(b+l)
(l-eF)ll-p.
Further randomization is possible and mathematically tractable (e.g., randomization of 0 in the above exponent).
D. P. Gaver, P. A . Jacobs, and E. A. Seglae
198
c) Assume next that K Ois a random variable with a standard positive stable law distribution with scale u > 0 and order 0 < P < 1 and Laplace transform E [e-saO] = exp { -(s/v)O}; see Feller.3 Assume KO is the minimum of the stable random variable and a truncating exponential random variable having mean l / ~in;this case the Laplace transform of K Ois E [ e - s K o ] = K / ( K s ) S / ( K s) exp - ( ( s K ) / V ) ’ ) and
+ +
[1
{
+
+
1
E[Ko]= (l/n) - e-(n/v)P . The probability that no remaining DDs activate during a mission is
(6c)
where
+ 1) = m ( ~ d ( +b 1))*”e T T d ( b + l )
(1 - O F ) . Such auxiliary randomizations have been called “double stochasticity” (c.f. D. R. COX,^ M. S. B a ~ - t l e t tM. , ~ S. Bartlett,‘ and D. R. Cox;’ they can usefully represent environmental variation between (within) missions and/or manufacturing or configuration variability between manufactured copies. Its introduction is a form of sensitivity analysis. If K = 0, then the probability of field success is simply s(b
[e,
D(b+l; h f ( T d ( b + l ) ) ; T d ( b + l ) ;
= exp
{
-
[YIJ + s(b
I
T)
1)
(6d)
14.2.4. Acquisition of Block b 4-1
+ +
+
Let B(b 1) be the total budget to develop, test, and procure Block b 1. Let c,(b 1) be the cost per copy of Block b 1; ct(b 1) be the cost per test; c,(b+l) be the average cost to remove a discovered DD. The expected budget remaining after development and testing is
+
+
+ 1) = B(b + 1)- M ( T d ( b+ 1)) - [cm(b+ 1) + ct(b + l)]T T d ( b + 1) -c,(b + l)E [DR( b + 1; T d ( b + 1); , (7) where D R( b + 1; T d ( b + 1); is the number of DDs removed as a result B#(b
T)]
T)
of testing. After development and testing, the mean number of copies of
Dynamic Management of Systems Undergoing Evolutionary Acquisition
+
199
+
Block b+ 1that can be fielded is N F ( ~ 1) + = (B#(b l)/c,(b 1)). Each of these copies has the same remaining DDs in this initial model, and hence has the same probability of field reliability/suitability. This simplification is a strong candidate for relaxation.
14.2.5. Obsolescence of Block b and Block b
+1
Operational (and physical) obsolescence (i.e., the loss of effectiveness due t o enemy countermeasure adaptation, and/or physical aging, dearth of spare parts, etc.) is to be anticipated, and is a powerful reason for Block upgrading. Assume in advance that Block b will become obsolete after an independent exponential time having mean l/w(b); a statistical test for obsolescence, (i.e., a run of f 2 3, say, failures) can signal a need for change, e.g., in Block design or CONOPS. After obsolescence, the probability of mission success for Block b is p o ( b ) < p ( b ) ; p o ( b ) may be close to zero. Assume a fielded Block b + 1 can also become obsolete after an independent exponential time having mean l / w ( b 1). After obsolescence, the probability of mission success for Block b 1 is p o ( b 1) < p ( b 1).If Block b becomes obsolete before the completion of development and testing for Block b 1, then a decision can be made to continue to field Block b, or to field the current Block b 1 prematurely. We defer discussion of this possibility; see Gaver.8 In this chapter we will assume that Block b is completely ineffective after obsolescence; that is, p o ( b ) = 0; we also assume that Block b 1 is completely ineffective after obsolescence. Sequential real-time detection of obsolescence is an open problem, well worthy of attack.
+ +
+
+
+
+
+
14.3. The Decision Problem Suppose for the present (one special case} that Block b does not become obsolete before the end of development and testing of Block b 1. The decision is whether to use the budget remaining after development and testing of Block b 1 to purchase copies of Block b 1, or to use it to purchase more copies of Block b. Suppose that each mission uses one copy of a Block. DDs remaining in Block b 1 after fielding can cause mission failure. The decision criterion will be to maximize the expected number of successful missions. The decision is to allocate that amount of the budget, B(b l), to upgrade Block b 1 so the expected number of successful missions by the copies of Block b + 1 procured with the remaining budget is maximized. Assume that missions arrive according to a stationary Poisson process
+
+
+
+
+
+
D. P. Gaver, P. A . Jacobs, and E. A . Seglae
200
having rate A. Let N F ( ~be ) the number of copies of Block b that can be purchased using B ( b 1). The number of mission successes if Block b is purchased exclusively is denoted by Sg ( N F ( ~ )Then ).
+
(8)
There is a similar expression for the conditional expected number of successful missions given the remaining number of DDs if N F ( + ~ 1) copies of Block b 1 were fielded; that number of successful missions is S b + l ( N F ( ~ 1)).It is seen that
+
+
E
[Sbfl
+
( N F ( ~ 1)) ID ( b
= [h+w(b+l)
+ 1; M (%(b 4- 1)) ; T d ( b
1); T ) ]
lNF(b+l)
+ 1lD ( b + 1; M (%(b
1)); T d ( b
+
+
x N F ( b $- 1)PF ( b
+
+ 1);
T))
(9)
The distribution of D(b 1; M ( T d ( b 1)); T d ( b 1); T ) can be Poisson ( K O= k), or Gamma-Poisson (negative binomial), or (Truncated) Stable Poisson.
14.4. Examples The parameters for the numerical examples appear in Table 14.1. The total budget for development, testing, and procurement of Block b + 1 is $2000K. Development budgets $100K, $200K, and $400K are considered. Each of these budgets can be spent over an integer time interval length T d of 1 time period, or alternatively, 5 time periods. The rate of DT/OT testing ranges in integer values from 1 to 100 tests per time period. The proportionality constant, K O is constant with value 0.0002. Table 14.2 summarizes the gains made by an appropriate choice of time for development and testing. If Block b 1 is not developed and units of Block b are bought with the budget, then the expected number of mission successes is 120 (so developing
+
Dynamic Management of Systems Undergoing Evolutionary Acquisition
201
+
and buying Block b 1 is advantageous in this case; our model provides useful quantitative support for such decisions). Table 14.1. Parameters. Total budget Development budget Development (spiral) time Rate of development expenditure Mean number of tests per unit time Cost per copy Block b Cost per copy Block b + 1 Cost of a test Cost removing each DD found during test Mission arrival rate Probability mission success Block b Utopian requirement: Probability mission success Block b + 1 Parameters of effectiveness growth
Parameter for introduction of DD Rate of obsolescence Block b Rate of obsolescence Block b 1 Probability DD survives test Probability DD survives mission Proportionality constant for effectiveness
+
+
B ( b 1) M Td(b + 1) m = M / T d ( b 1)
+
7
Cm (b) cm(b 1 ) ct(b 1) cr(b 1)
+ + +
x P(b) P*(b 1)
+
2000 variable variable variable variable 1 1.2 1 3
100 0.4 0.85
a1
0.05 2
a2
0.002
a2
0.8 2 0.33 0 .2 0.85 0.8 0.0002
a1
03
4) w(b
+ 1)
e
OF
KO
Table 14.2 displays the number of tests that maximizes the expected number of successes for Block b + 1 once it is fielded; also displayed are the expected number of mission successes for Block b 1. Discussion: It is noticeable that for the situations presented above, the optimum expected number of mission successes occurs for a 5% budget (B) expenditure on Development and Testing, and for (tabulated) minimum rate of development expenditure. This leads to significantly less testing than do other options, and to a noticeable gain in successful missions. Of course, different parameterizations can lead to quite different outcomes. It is important to expand the above analysis to dynamic conditions. Study of sequential parameter estimates is strongly endorsed and is in progress. Tables 14.3 and 14.4 display the expected probabilities of mission success after development and testing according to the policies of Table 14.2 for
+
D. P. Gaver, P. A . Jacobs, and E. A . Seglae
202
+
Table 14.2. Maximum expected number of successes for Block b 1 after development and testing; [maximizing number of tests]; total budget = $2000K.
M / T : rate of expenditure of development budget M / B : fraction of total budget used for development 0.2 = 400/2000
20 = 40 = 80 = 100 = 200 = 400 = 10015 20015 40015 10011 20011 40011
366 [301
0.1 = 200/2000
301 [481
347 PSI
386
~ 5 1 0.05 = 100/2000
397
373
[35
[15]
different distributions of K O .In Table 14.3, the distributions considered are K O= 0.0002; K Ohaving a gamma distribution with shape parameter 0.5 and mean 0.0002; KO having a positive stable law distribution with order equal to the shape of the gamma and scale the same as the gamma; K O having a truncated stable law distribution with the same scale and order as the stable and the exponential truncation rate K. chosen so that the mean is equal to 0.0002. In Table 14.4 the distributions of K Oare the same except the shape of the gamma and the order of the stable law are 0.1. The policies in Tables 14.3 and 14.4 are for K O= 0.0002. Table 14.3. Expected probability of mission success for Block b + 1 after develop ment and testing; constant K O= 0.0002; [gamma K Oshape parameter 0.5 and mean 0.00021; {stable KOwith order (scale) equal to shape (scale) of gamma); (truncated stable K O ) .
M / T : rate of expenditure of development budget M I B : fraction of total budget used for development 0.2 = 400/2000
20 = 100/5
40 = 200/5
80 = 400/5
100 = 100/1
200 =
200/1
0.80 [0.79] 70.75)
0.67 [0.66] f0.62) i0.67j
io.soj 0.82 [0.81] (0.78)
0.1 = 200/2000
0.74 [0.73] (0.74) (0.74)
(0.82)
0.05 = 100/2000
0.83 [0.83] (0.79) (0.83)
400 = 400/1
0.78
[o.m] (0.75) (0.78)
Dynamic Management of Systems Undergoing Evolutionary Acquisition
203
Discussion: The expected probabilities of mission success are nearly equal for the moment and shape matched versions of constant, gamma, and truncated stable K O except , for stable with ,O small (0.1). This may be due to the very exaggerated shape of the pure stable law with this shape parameter; there can be many small values versus a few very large. The mean number of DDs introduced when K Ohas a pure (untruncated) positive stable law is infinite and the expected probability of mission success after development and testing is much less than the others; the smallest expected probabilities of mission success occur for the stable law of order 0.1. This suggests the need for a sequential stopping rule, a problem under current investigation. Table 14.4. Expected probability of mission success for Block b + 1 after development and testing; constant KO = 0.0002; [gamma K Oshape parameter 0.1 and mean 0.0002]; (stable K Owith order (scale) equal to shape (scale) of gamma}; (truncated stable KO
M / T : rate of expenditure of development budget M / B : fraction of total budget used for development 0.2 = 400/2000
20 = 100/5
80 = 400/5
100 = 100/1
200 = 200/1
0.79 [0.79] (0.41) (0.80)
0.1 = 200/2000
0.05 = 100/2000
40 = 200/5
0.67 [0.66] (0.33) (0.67)
0.82 [0.81] (0.43) (0.82) 0.83 [0.83] (0.43) (0.83)
400 = 400/1
0.74 [0.73] (0.38) (0.74) 0.78 [0.78] (0.41) (0.78)
14.5. Conclusion and Future Program The problem discussed is simple and generic, and widely encountered in defense acquisition. In subsequent work we propose to more extensively explore the above models, conditions and issues, and to provide more refined operational tools to guide the timing of evolutionary cycles; Bayesian methods and sequential success run criteria suggest themselves. We will also consider elsewhere the development and testing of systems consisting of subsystems in series. As stated above, we also plan to consider TestAnalyze-Fix-Test (TAFT) testing policies with a sequential stopping rule
204
D. P. Gaver, P. A . Jacobs, and E. A. Seglie
based on first occurrence of a run of successful (no DDs activating) operational tests; see Gaver.2 Simulation and more analysis will be designed to assess the procedure’s robustness. Application to Interim Armored Vehicle, IAV (STRYKER) acquisition and testing is underway. References 1. G.
Glaros, http://www.oft.osd.mil/library/library-files/trends-205 -transformation-trends-9-june% 202003-issue.pdf (June 4,2003). 2. D. P. Gaver, P. A. Jacobs, K. D. Glazebrook, and E. A. Seglie, Probability models for sequential-stage system reliability growth via failure mode removal, Znt. J. Rel., Qual., and Safety Eng. 10 ( I ) , 15-40(2003). 3. W. Feller, An Introduction t o Probability Theory and Applications, Vol. 11, Wiley, New York (1966). 4. D. R.Cox, Some Statistical Methods Connected with Series of Events, JRSS B 17,129-164 (1955). 5. M.S.Bartlett, Discussion of paper by D. R. Cox, JRSS B 17,159-160(1955). 6. M. S. Bartlett, Inference and Stochastic Processes, JRSS A 130, 457-477 (1967). 7. D.R.Cox and V. Isham, Point Processes, Chapman and Hall, London (1980). 8. D. P.Gaver, P. A. Jacobs, K. D. Glazebrook, and E. A. Seglie, System reliability growth through block upgrade, to appear (2005).
CHAPTER 15 RELIABILITY ANALYSIS OF RENEWABLE REDUNDANT SYSTEMS WITH UNRELIABLE MONITORING AND SWITCHING
YAKOV GENIS Borough of Manhattan Community College City University of New York New York, New York 10007 USA ygenis@bmcc. cuny. edu
IGOR USHAKOV Canadian Training Group San Diego, CA USA iushakov2000@yahoo. com Two methods of approximate evaluation of probability of no-failure operation and maintainability of renewable redundant system is suggested. The model takes into account incomplete monitoring of units state and not absolutely reliable switching device. Fast restoration, fast unit failure detection, and fast standby switching are assumed. A comparison of approximate estimates of reliability indexes with exact calculated results and simulation results shows practical applicability of suggested methods.
15.1. Introduction
Delays in unit failure detection due to inadequate testing, belated switching in of standby units in case of standby redundancy (SR) as a result of unreliable operation of switching and testing devices, and the introduction of additional faults and “false” failure during periodic testing (PT) substantially reduce the reliability of redundant systems. To find the actual system reliability, one must take into account all factors associated with incomplete testing, unreliable operation of test and switching devices, and also the effect of periodic testing on system performance. 205
206
Yakov Genis and Igor Ushakov
The reliability of certain particular systems was calculated in Handbook of Reliability Engineering and Gnedenko and Ushakov2 assuming that the no-failure operation and unit restoration times as well as the standby unit switching time and the time between two consecutive tests have exponential distribution functions (DF). These works indicate that even if the system behavior can be described by a Markov process, exact calculation of reliability is possible only for relatively simple systems. For more complex systems, some heuristic methods are suggested in Handbook of Reliability Engineering,' and Gnedenko and Ushakov.2 In the present paper the ideas proposed in Genis3 and G e n i ~are , ~ further developed and more precisely stated, and the domain where they can be used in practice for approximate evaluation of the reliability is defined.
15.2. Problem Statement A renewable redundant system contains n units and r repair facilities (RF). The DF of time to failure for each unit is assumed exponential. A unit might be in one of the two states: operational or nonoperational (failed). The unit serviceability testing is incomplete; that may delay detection of a unit failure. Accordingly, a nonoperational unit can be in a state of detected or undetected failure. In systems with SR, the units are divided into active and standby ones. If an active unit fails, its functions are performed by a standby unit. Switching device itself is not absolutely reliable; switching device can delay switching and thus reduce efficient utilization of standby resources of the system. As soon as a unit is detected as failed (real or false), the renewal of this unit should begin. Every unit has at least one RF which is capable to restore it. One RF can simultaneously restore not more than one unit, and one unit can be simultaneously restored by not more than one RF. Unit restoration is complete and restoration begins if there is an available RF. No constraints are imposed on unit restoration time DF. There are no restrictions to the structure of the system. The system is said to be failed if the domain of the states of its units belongs to a specified set of states. The system is assumed to be provided with fast servicing (FS). This concept is defined more accurately in Sec. 15.4. Its practical meaning is that the unit restoration time, the unit failure detection time, and the standby unit switching time are negligible in comparison with the time between any two unit failures including those taking place during PTs. This also means that the probability of events directly responsible for an
Reliability Analysis of Renewable Redundant Systems
207
SR system failure as a result of noninstantaneous standby unit switching or of noninstantaneous detection of an active unit failure is low. Thus, FS entails fast restoration (see Genis3), fast and mostly instantaneous detection of unit failure, fast and mostly instantaneous standby unit switching, and a low rate of additional failures occurring during periodic tests. The problem is to estimate the reliability and serviceability indexes of a system under FS conditions. 15.3. Asymptotic Approach. General System Model The state of system units is assumed to belong t o a malfunction interval (MI) if even one of its units is in a state of detected, undetected, or false failure and to a serviceable interval (SI), otherwise. An MI is called a failure MI if system failure occurs within this interval. It means that in this MI at least once the system will exhaust its reliability reserve. The system can fail in some MI more than once. System behavior is described by an alternating random process in which MIS and SIs follow each other. The behavior of such systems has been analyzed in G e n i ~ . ~ Let us assume that the system operating conditions and the units nofailure operation time and restoration time DF do not change with time. The system to be discussed is assumed to be highly reliable. Since the probability of failure of such a system in its nonstationary operation region can be made small enough, its behavior in the stationary operation region becomes of primary importance. The results of Genis3 indicate that in such a case the DF of the time to first failure converges to an exponential function if the product of the rate of occurrence of MIS by the maximum mean duration of the malfunction interval T as well as the probability of system failure q in the MI interval tends to zero. If also the probability q* of more than one system failure taking place in the MI tends to zero, the DF of the time between two system failures converges to an exponential function. The FS criterion defined below ensures conditions in which AT -+ 0, q 3 0 and q* -+ 0. 15.4. Refined System Model and the FS Criterion
In the following we discuss one possible model of a system with incomplete testing and unreliable unit switching. Other models of such systems can be analyzed by a similar technique. The important condition for such systems is that the FS criterion should ensure small T , q, and q*. The systems to be discussed are systems with SR. Let Fi(z)and mi be
208
Yakov Genis and Zgor Ushakov
respectively the DF and the mean operating time of the i-th unit and let Fi(x) be fixed. The probability of instantaneous detection of the i-th active unit failure is p l i and that of the i-th standby unit is p2i. When a failure of an i-th active unit is detected, the probability of instantaneous switching to a standby unit is psi; let Hi(z) denote the DF of switching time to the standby when a failure of the i-th active unit is detected and the switching is not instantaneous. An undetected failure of an i-th unit can be found in the course of periodic tests with a probability p4i; the distribution function, mean time between two PTs, and its second moment are denoted by a(.), mpt,and m$) respectively. The DF of the time from the beginning of an MI to the first PT after the MI begun is given by
and
r
Here and below = 1 - r for all r. The DF of the time to detect the failure of i-th unit under the condition of no instantaneous detection of this failure is Bi(z),and its mean is given by
The probabilities of a false signal being generated to indicate a no existing failure of a serviceable active and standby unit are psi and psi respectively. Let G,(z) be the DF of the restoration time of the i-th failed unit and Gli(z)an analogous function when a false failure of the i-th unit is indicated. In addition, let s be the minimum number of units whose failure causes system failure; G(z) = mazG,(z , ~ ( z=) maxZ,(z),B(z) = mazBi(z), where i E 1,n; m?), mg), m$, and m f ) are the j-th moments of the (1) DF G(z),Gl(z), H ( z ) ,and B ( z ) ;and let m, = rn?),ml, = m,, ,m, = 1) (1) m, ,mk = mk . Under stationary system operating conditions the rate of occurrence of MIS is estimated as n
n
i= 1
k l
209
Reliability Analysis of Renewable Redundant Systems
In practically important cases m(j)5 C(m)j (subscripts omitted), and the FS condition Genis3 can be written as
a1 = imaz(m,, ml,, m,,
m k ) + 0.
For systems with SR an additional condition is a2 =
max j j l i + 0,
lsisn
a3 =
max jj3i lsisn
+ 0.
The obtained reliability estimates are valid when a = m a z ( a l , a 2 ,a3) I 0.1 since their error is of the same’order as a. 15.5. Estimates of Reliability and Maintainability Indexes
A failure in which a system persists for a t least a time z is called an z-
failure, and let fiz denote the rate or z-failures taking place in the system under stationary conditions. It can be shown that the desired reliability indexes can be found in terms of the fiz. Under FS conditions the distribution functions of system operating time to first failure, of the time between two failures, and of the no-failure operating time are nearly exponential. The mean time t o failure and the mean time between failures can be assumed to be approximately equal t o the mean no-failure operating time of the system. Then, assuming in ,& z = 0, we will receive the system failure rate = /?o, the estimation of no-failure operating time T M l/p, the estimation of DF of no-failure operatin$ time 1-exp{ -fix}, the estimation of DF of system restoration time 1-&/p, and the estimation of mean system restoration time in the form of ,hz,drc/p). Under FS conditions fiz is found as the sum of x-failure rates of the system along monotonous paths G e n i ~ i.e., , ~ paths along with no single restoration can be completed during the time from the beginning of the MI until system failure on this interval plus the time rc; SR system requires in addition that no single standby switching can be completed and no single active unit failure is not instantaneously detected during that time. The rate of rc-failures along some monotonous path is defined as the product of the rate of occurrence of MIS at which a given path can start and the probability of z-failure along this path. The obtained estimates are then simplified in accordance with the results obtained in G e n i ~ . ~ To simplify the calculation of estimates only minimal monotonous paths are used on which the probability of system failure is substantially greater than on nonminimal paths. Other conditions being equal, the number of failed units that cause system failure is the least along minimal paths.
a
(s,”
210
Yakov Genis and Zgor Ushakov
Thus, in case of instantaneous switching of standby units and instantaneous detection of active unit failures, one only takes into account those paths along which system failure takes place with the failure of minimal number of units. In case of noninstantaneous standby switching or noninstantaneous active unit failure detection, only those paths are taken into account along which the noninstantaneous switching or noninstantaneous detection takes place with failure of the first unit in this MI. If only the system reliability indexes are needed, then the ,6 can be found immediately. It should be also noted that the rate of x-failures of a system can be used to estimate the reliability indexes of systems with time redundancy. The obtained results can be probably extended to the case when the DF Fi(z) are absolutely continuous. In this case the rate of the occurrence of MIS in stationary system operation can be estimated by value n
n
where ci is the value at the zero of density of Fi(x), ci = F,(O). The above method is illustrated with the following examples in which the approximate estimates have been found to be close to either exact values or to values obtained by simulation.
15.6. Examples Unloaded Duplication with Nonreliable Switching. The system consists of two identical units, one active and one standby, and a switch. The standby unit is unloaded. As soon as the active unit fails, the switch substitutes for it the standby unit. A unit is said to be active from the instant it is put into operation until it is replaced with another unit. During this time the second unit is said to be a standby unit. Let F ( x ) , c, and m be the distribution function, density at zero, and the mean no-failure operating time of the active unit. The distribution function of the time of switching the standby unit in place of the active unit is L ( t ) = p3+jj3H(t), 0 I p3 i 1, j j 3 = 1-p3, where H ( t ) is the distribution function of some random quantity with a mean m,. Switching takes place when the active unit fails, provided the standby unit is serviceable. Restoration of the failed unit begins as soon as switching is completed. The system has a single RF. The unit restoration time distribution function is G(x) with a mean m,. After restoration, the unit acts as a new one.
Reliability Analysis of Renewable Redundant Systems
211
The system fails when the active unit fails, the standby unit is not yet restored or when unit switching is not instantaneous. At the beginning both the active and standby components were serviceable and new. Here we obtain an approximate solution and compare it with the exact one. Let us estimate fiz = fiL1) where ,&') is due to x-failures of the system caused by no instantaneous standby switching and ,6i2)is x-failures rate because of that, that in the moment of failure of the active unit the restoration of the standby unit was not yet finished. Then,
+ fii2',
A = 2b3R(x) + p3 m
s,
W
q t + x) d F ( t ) ] ,
and the FS conditions have the form
max(mT, m,) * max(1 / m , c ) The approximate estimate of
-+
0,
p3 4 0.
6 under FS conditions is given by
If T o is the mean time between system failures and T 1is the mean time to the first system failure, the exact solution has the form
i.e., agrees well with ,8. To estimate Tr we have
M
b3m, + 0.5p3cm?)]/[Tj3 + p 3 ~ m r I r
which also agrees well with the exact solution. It should be noted that the coincidence of approximate estimations of and T, with the exact solutions is excellent over a wide range of p3. But the condition FS p3 --f 0 must be observed to ensure that the DF of the no-failure operating time of the system is exponential. Loaded Reserve, No Instantaneous Standby Switching, and No Instantaneous Detection of Standby Component Failure. The standby redundancy
212
Yakov Genis and Igor Ushakov
system described in Sec. 15.5 is discussed. The system has one RF, 1 active and one standby units. The reserve is loaded. Since all components are of the same type the subscript a denoting the unit number in Sec. 15.4 is omitted. The failure of an active unit is detected instantaneously. Since, only the state of the standby component is checked during periodic testing, p5 = 0. System failure occurs when at least one active unit fails and cannot be replaced because of one of the events: or if the standby switching is not instantaneous, or if no serviceable standby unit is available. FS conditions are satisfied. To estimate ,& we consider that under stationary conditions the DF of the residual no-failure operating time has the form of an integral of F ( u ) / m , where u E [0,z]. Let
m,(z) =
I"- ,&:+ G(u
z) du, m, = m,(O).
Let us estimate the terms of 1) one of 1 active units failed, standby switching is not instantaneous:
/32) = p3 1E(z)/ m; 2) one of 1 active units failed, standby switching is instantaneous; before restoration is completed one more active unit failed and standby switching is instantaneous (noninstantaneous switching is only taken into account in the estimate of :
Bi'))
3) the standby unit fails, failure is detected, and before restoration is completed one active unit fails, the switching is instantaneous:
,8i3)M
1"- + 1
G(t z) 1 p3 F(t)dt / m2 M
p2
p2 p 3
1 m,(z) / m2;
4) the standby unit fails, failure is not detected, and before failure is detected one active unit fails, the switching is instantaneous: 00
,8i4)M P2
B(t)1 p 3 F ( t )c(z)dt / m2 M pz p3 1 c(x) mk / m2;
5 ) the standby unit fails, failure is not detected, one of active units fails after failure of the standby unit is detected but before its restoration is completed, standby switching is instantaneous:
dB(t1)c(t2- tl
+ z) l p 3 F(t2)dt2 / m 2 =
Reliability Analysis of Renewable Redundant Systems
‘jj2 p3 1 m
213
~ ( z/)m2;
6 ) a false failure of the standby unit is indicated and during its restoration an active unit fails, standby switching is instantaneous:
Hence 6
p?)
=
M
1E3Z(Z)
4- (1p3 -!- 1 ) p 3 m ~ ( X ) / m +
i=l
M 1$3
+ ~2 p3 c(z)mk / m + ~ 3 miT(z) ~ 6 / m p t l / m, + (1p3 + I)P3mT/m + P2p3mk/m + P3p6mlT/mpt]/mi fizdx/p lp3m, + ( I p s + I ) p ~ m Y ) / ( 2 m ) S
1
00
TT
+P2p3
m~mk / m
f p3pS
mg) / (2mpt)]/ (mb).
The obtained estimates are practically the same as the exact estimates for the case when undetected component failures are detected during the next PT, no false failures are indicated, and all continuous random quantities have an exponential distribution. The general case has been computer simulated using the GPSS language. In particular, we have simulated the case with 1 = 1 and p6 = 0. In this case
b
$3
mk TT
E3m3
+ 1)?’3mr/m + P,p3mk/m]/m, = [m$)/ ( 2 ( m ~ t + ) ~P4) / ~ 4 mpt ]
f (p3
+ (p3 + 1 ) p 3 m ? ) / ( 2 m ) +
P2p3mTmk/m]/(mb).
Certain results of simulation for the case p4 = 1 are listed in Tables 15.1, T,?,,, and usim are the simulation results of the mean time between failure, the mean system restoration time, and the standard deviation of system full operating time; TT,caiand Tcaiare calculated by the described method(s) values of mean system restoration time and the mean system full operating time, Tcai= 1 / $; n is the number of observed system failures; T , T ~ T,,, and ~~t are component no-failure operating time, the component restoration 15.2, and 15.3, where the following notation is used: T,i,,
214
Yakov Genis and 1gor Ushakov Table 15.1. The reliability and maintainability indexes’ dependence from p3 and p 2 .
p3 p2 tcal Tsim sim Tr,cal Trsim n
0.990 0.950
59856 60466 643; 64 20.00 20.47
1474
1 213; 1 1 1:; I
I
0.995 0.990
I
0.900 0.950
16
26859 20.00 17.42
186
16154 11’ 19.50 19.52
308
I 0.001
I 0.001 202 2018
1976
20.00 20.66
2334
time, the standby switching time, and the time between two periodic tests; ( A * E X P ) , ( A ;B ) , or ( A c * N R M ) denote that the respective random quantity is exponentially distributed with a mean A, or uniformly distributed in the interval (A - B, A + B), or has a normal distribution with a mean A and standard deviation 0, truncated at 3u. In Table 15.1, pp and p3 were varied and T , r,., T,, and TPt are exponentially distributed with means 2000, 20, 20, and 150. Table 15.2 shows the effect of the T~ and 7., distribution functions on Tsim.In this case, r and rPt are exponentially distributed with means 2000 and 150, p3 = 0.990, p2 = 0.950, and Teal = 59856. Table 15.3 shows the effect of r and rPt distribution functions on Tsimand Tr,,im.The values rr and T~ are exponentially distributed with the means 20 and 40; and p3 = 0.970, p z = 0.950.
+
Table 15.2. The dependence of Tsim and usim from rs and r,.. 7s Tr
Tsim gsim
n
20 20
* EXP * EXP 60466 64864 147
20; 0 20; 0 60874 58128 146
20; 20 20; 20 57362 59984 86
20 20
+ 6 * NRM + 6 * NRM 56198 60576 88
The tables indicate that Tsimand Tr,,imagree well with the calculated values over a wide range of p z and p 3 for different forms of distribution functions of all random quantities; the similarity of Tsim and Usim shows that if FS conditions are observed the distribution functions of the system no-failure operating time is nearly exponential. 15.7. Heuristic Approach. Approximate Method of
Analysis of Renewal Duplicate System The idea of approximate method is based on the Renyi Theorem that states the following. Consider a recurrent point process with distribution function
Reliability Analysis of Renewable Redundant Systems
215
Table15.3 The dependence of Tsim and Tr,sim from
tpt. r TPt
Tca~ Tsim maim
Tr, cat Tr,sim
n
200 * EXP l00*EXP 38809 34958 34608 31.64 35.69 266
2000; 1500 100*EXP 38809 32973 33536 31.64 35.20 302
2000
+ 600 * N R M 100; 0 39744 38847 37616 31.92 32.70 255
of time between events D(t). Determine a “sifting procedure” to this process, i.e. with probability p one independently excludes current event. The Renyi Theorem states that with infinite continuation of the sifting procedure, the resulting limit point process converges to a Poisson process. In our case for each system unit, we consider a recurrent alternative process consisting of “on-intervals” with DF F ( t ) and the mean T , and “down-intervals” with DF G ( t ) and the mean T. In our case, considering highly reliable systems, we take T >> T. Let us call a unit failure an “alarm,” if it can be developed into a system failure if another failure (or other failures) will occur. Notice that such a situation exists for a very short time, since we assume that T >> I-. So, in limit, we can consider pure recurrent point process of “alarms” instead of alternating one. Sometimes these alarms might develop into system failures, sometimes not. Lets now return to the analysis of investigated renewable system. Consider a general renewable duplicate system with loaded redundancy ( “hot” redundancy). Both units are identical with distribution function of time to failure F ( t ) and distribution function of renewal time G ( t ) .Denote corresponding mathematical expectations of these distributions by T and T , respectively, and assume again that T >> T . The state of a standby unit is monitored periodically with the period 0, assume that 0 << T . Periodical test might produce a false signal: an operational unit can be erroneously indicated as failed and switching device (SD) begins the reconfiguration of the system replacing acting unit by standby one. The probability of false detection equals 6, 6 << 1. The unit that assumed to be failed is sent to RF for repair, as it would be failed. A monitoring device itself is assumed reliable in the sense that it always is monitoring the state of the standby unit. (Otherwise, we wiH meet a problem of the type: “Who guards the guardians?”) If the standby unit has failed, its failure can be detected only at the
216
Yakov Genis and Igor Ushakov
moment of monitoring test. In other words, a standby unit is in the state of undetected (“hidden”) failure from the moment of failure until the moment of a forthcoming monitoring test. Switching device is not reliable: its failure rate is &wit& and distribution of its repair time has DF S ( t ) with the mean s*. Besides, the SD can fail to perform switching with probability q. If it happens, the SD is to be repaired. A switching procedure takes a random time with distribution function H ( t ) with the mean h. If switching duration is longer than some given time z , then it is assumed that the system has failed. For stationary process, the mean residual time of switching is equal h*:
After renewal procedure (repair), a unit is assumed as good as new. For repair of failed units there are k RFs, where k = 2 or k = 1. The problem is to determine main reliability indices of the system: probability of failure-free operation during a given time; availability coefficient, i.e. the average portion of time when the system is in an operational state; mean time between failures. Solution. Let us enumerate all primary events that could be a cause of the system failure in the case if something will fail additionally. Let us make at the very beginning an important notice: in our analysis of highly reliable systems, we will consider only such failure situations that are developing by “monotonous trajectory,” i.e. within a “down period” after the first (initializing) failure of a unit, there is no renewal of any other failed units. The situations of system failure are the following ones. (a) During restoration of the active unit, the standby unit will have failed. Since we consider a stationary period of the system operating, failure rate of the standby unit equals 1IT.This is a component of the system “alarms flow” due to active unit failures. The probability of “sifting,” that is, the probability that alarm will not lead to the system failure, in this case equals p = exp(-.r/T). Otherwise, with probability Q = T / T ,the system failure will have occurred. It means that the “failure flow” generated by this alarm has rate A, z TIT2.
System down time Table 15.4.
7,
for failures of this nature can be found from
Reliability Analysis of Renewable Redundant Systems
217
Table 15.4. The "ditime"for the canse when the standy unit fails during restoratio of the active unit. ra when k = 1 T i 2 7 1 2 5 Ta 5 7
G(t) Degenerated (constant) “Aging“ Exv ~
ra when k = 2
7
Ti2 Ti2 TI2
(b) During “down time” of the standby unit, the active unit will have failed. For a stationary period of the system operating, failure rate of the acting unit equals 1 /T. This is a component of the system “alarms flow.” The “down time” of standby unit is a sum of time of a “hidden” failure (average duration equals 0.5 x 0) and namely restoration time ( r ) .Therefore, with probability qb = ( r 0.5 x 0) /T, the system failure will have occurred. So, the “failure flow” generated by this type of alarm has rate
+
Ab M
(r
+ 0.5 x 0)) / T2.
(c) Switching is not successful due to SD failure, which occurs with probability q. The flow of alarms of that type is equivalent just to flow of failures of the acting unit, i.e. rate of these events equals 1/T. Using the Renyi Theorem, one can immediately write A, x q/T. System LLdown time” for this situation, i.e. impossibility to switch from a failed acting unit t o standby unit, is r, = s*.
(d) SD is operable but switching procedure exceeds a given time z . System failure rate due to this type of situations is Ad M
[l-H(~)](l-q)/T.
Duration of a system “down time” was obtained above as the mean residual time of switching duration: Td
= h*.
(e) The active unit fails when SD is under a restoration procedure. An alarm caused by the SD failure might be developed into the system failure with probability s* / T . The failure rate due to that reason is Ae
M S* Aswitch
/ T.
Yakov Genis and Igor Ushakov
218
Table 15.5. The “down time” for the case when the active unit fails when SD is under restoration. G(t)
re
Degenerated (Costant) “Aging“
T/2 r / 2
5
re
5
r
7
EXP
The mean system “down time” T, can be found from Table 15.5. (f) A test of the standby unit, performing in time period 6, falsely shows that it has failed. The standby unit begins its restoration, and during this time interval the active unit has failed. Intervals between such possible false signals is constant and equals 6, so distribution between such events is geometrical. Under our assumptions of highly reliable systems, we can take an exponential distribution as an appropriate approximation. Then the failure rate due t o this situation can be written as Xf
NN
ST/OT.
System “down time” Tf can be found from Table 15.4 if the value 7, is substituted by value ~ f . Other situations, being not numerous, have probabilities of the highest order of magnitude. For the system as a whole, the failure rate is equal t o XSyst 25 x u
+ x b + + Ad xc
f
xe
xf
and the mean “down time” is found as a weighed average: T S y s t % (Tu x u
f Tb x b
+
Tc x c
+ Td Ad
f Te x e
+ xf ) / h y s t . Tf
These two values give us a possibility t o derive all other reliability indices: probability of failure-free operation during a given time, mean time between failures and availability coefficient.
References
1.
Handbook of Reliability Engineering, Ed. by I. A . Ushakov, New York, John Wiley and Sons (1994). 2. B. Gnedenko and I. Ushakov, Probabilistic Reliability Engineering, New York, John Wiley and Sons (1995). 3 Y. Genis, 2-Sided estimates for the reliability of a renewable system under a nonstationary operational regime, Soviet Journal of Computer and Systems Sciences 2 7 ( 6 ) , 168-170 (1989).
Reliability Analysis of Renewable Redundant Systems
219
4. Y. Genis, Indexes of suitability for repair and the coefficient of readiness of standby systems for various renewal disciplines, Soviet Journal of Computer and Systems Sciences 26(3), 164-168 (1988).
This page intentionally left blank
CHAPTER 16 PLANNING MODELS FOR COMPONENT-BASED SOFTWARE OFFERINGS UNDER UNCERTAIN OPERATIONAL PROFILES
MARY HELANDER Department of Mathematical Sciences ZBM Watson Research Center Yorktown Heights, N Y 10598 USA E-mail: [email protected]
BONNIE RAY Department of Mathematical Sciences ZBM Watson Research Center Yorktown Heights, N Y 10598 USA E-mail: [email protected]. com We present a modeling framework for allocating development or support effort among software components to meet a specified cost or reliability objective. The approach is based on knowledge of the linkage between the operational profile and the architecture of the system, but allows for uncertainty through the use of probability distributions to characterize usage. An approach based on stochastic optimization is presented to obtain efficient solutions to the allocation problem. Results are demonstrated when uncertain usage is characterized by a Dirichlet distribution.
16.1. Introduction
Guidelines for achieving a specified reliability target under resource constraints play a key role in the software development planning process. In
particular, methods that determine how to allocate resources among components of a software system to facilitate cost-efficient progress toward a quantified system reliability goal are essential. Here, a component is defined as a set of operations, a subsystem, a module, an object, or any other distinguishable software entity that can be assigned a failure intensity represent22 1
222
M . Helander and B. Ray
ing its reliability. If the intended usage of the system is given, then system reliability can be computed as a function of component failure intensities, the expected component utilizations, and the specified usage. Specification of usage through the assignment of occurrence probabilities to operations form the quantification known as the operational profile.’ However for many systems, in particular commercial software systems, the usage of the system in a production setting may vary considerably from an expected usage, or the assumed usage as characterized for test phases of system development. This paper describes a modeling framework for cost and reliability planning in software system development that allows for random variation in the operational profile. Previous authors have addressed the problem of software reliability allocation in the context of a specified operational profile. For instance, Poore, Mills, and Mutchler’ used a spreadsheet approach to consider various strategies for allocating reliability to software modules. The paper of Helander, Zhao, Ohlsson3 provides an analytical solution to the optimal allocation problem based on standard nonlinear optimization methods. A component utilization matrix is used to link system structure to operations, which are partitionings of the software system from a user’s perspective. The paper of Leung solved the reliability allocation problem in the case of an operational profile that was specified up to an E allowable difference, where the E uncertainty was the same across all operations. We build directly on the framework of Helander13by allowing for very general uncertainty in the operational profile through the use of probability distributions on the operation occurrences. This is especially important as the high-level componentization of software systems becomes more common. By high-level, we mean that what were once considered as individual products are now bundled together in different ways and sold as software systems. An example is the componentization of software offerings from IBM, which bundles key elements of products such as DB2, Tivoli, and Websphere Application Server together so that they are quickly deployable in a customized setting. The usage of the components may vary widely from setting to setting, but reliability targets for the individual components need to be allocated so that overall reliability targets are achieved across a range of settings. In Sec. 16.2, we provide the general model formulation for specification of component failure intensities to plan for a software system reliability target while minimizing costs. Section 16.3 gives the derivation of the stochastic optimization problem under a particular distributional assump-
Planning Models for Component-Based Software O f f e ~ n g s
223
tion about usage. Section 16.4 presents an example. Section 16.5 concludes. 16.2. Mathematical Formulation as an Optimal Planning
Problem Let n denote the number of software components comprising a software system. In assembling components to form a software offering, we want the components to be reliable enough so that the probability of failure-free execution for a configured system has probability of at least p (0 < p < 1) of achieving failure-free execution with respect to an execution time interval of length T . The problem of interest is to determine what the failure intensities of the individual components should be to achieve this target in the most cost-effective manner. This assumes a cost associated with achieving a specified performance, a common concept in software engineering eccnomics. Denote f(X1, Xz, . . . , A), as the total cost of achieving the failure intensities A1, Az, . . . , A, , which we assume to be a pseudoconvex, nonincreasing function of the The function R(X1,Xz, . . . , A,; T ) measures system reliability in terms of XI, &, . . . ,A,, which will be used as the main decision variables in the reliability allocation cost-optimization problem. The following model simultaneously finds XI, Xz, . . . ,A:, Minimize
f(A1,
Subject to:
Az, . . . , An).
(1)
(2)
X j 2 0 for j = l ,... ,n, (3) where 0 < p; < 1, C E l p i = 1 are the operational profile parameters and p i j (0 5 pij 5 1) are the component usage parameters for i = 1,.. . ,m operations and j = 1,.. . , n components with Cj”=, uij = 1 for each i. Note that the reliability function in (2) is derived under the assumption that failure events are statistically independent, following an exponential distribution, and that the system architecture is such that a failure in one component causes failure of the entire system, i e . the system is not faulttolerant with respect to the components as identified in the planning model. could be derived under different assumptions concernOther forms for R(-) ing the failure distribution and system architecture, although the resulting optimization problems become more complex. Given positive association of
224
M. Helander and B. Ray
failure events, it appears that the form based on (2) is helpful for planning conservatively since it should assure the system reliability target is a a lower bound. This model specified by (l),(2), and (3) is the multivariate, nonlinear, constrained optimization model introduced in Helander.3 It can be equivalently restated as:
Minimize
(4)
Subject to
(5 gj(X1,Xz ,..., A),
= -Xj
5 0
for j = 1 ,..., n,
(6)
which follows a standard nonlinear programming model form involving minimization of a nonlinear objective function constrained by “50” inequalities. Conditions and solutions for this form are given in Helander3 for some common cost functions in software economics (see, e.g., Boehm5). These functions typically reflect costs related t o development and testing and costs related t o reliability problems after the software is released, but not trade-offs between the cost of continued testing and the cost of delay in taking software to market. In this framework, uncertainty in the operational profile can be introduced by treating the operational profile parameters as random variables, giving a stochastic programming formulation. This formulation and general solutions are discussed in the next section. Note that the model proposed here is useful for reliability planning purposes, e.g. to set reliability targets minimizing the total cost of development. Reliability estimation techniques are required to allow for uncertainty in the achieved reliability of each component as testing progresses.
16.3. Stochastic Optimal Reliability Allocation
Let f = {ti,.. . ,tm)denote the random vector representing random variable replacements for the deterministic operational profile parameters in constraint ( 5 ) , i.e. for the p i , i = 1 , . . . , rn. The decision model stated by (4), (5) and ( 6 ) in the context of a random operation profile leads to a
Planning Models for Component-Based Software Offerings
225
problem statement “Minimize”
f(X1,
XZ, . .. ,An;$).
(7)
Subject to: (8) i=l
j=1
gj(X1,Xz,. ..,A,)
=
-Xj
5 0
for j = 1 , . . . , n . (9)
In this formulation, we have replaced go by Go to indicate that go is now a random variable. As noted in Kall and Wallace,‘ this problem as a whole, and the constraint (8), are not well-defined when trying to make a decision for setting the XI, Xz, . . . ,An values, prior to knowing a realization of [. To address this, we may consider a deterministic equivalent of the model specified by (7), (8) and (9) by replacing (8) with a probabilistic constraint such as: P[Go(Xi,Xz, . . . ,L; I 01 2 $, (10) where 0 < $ < 1. This constraint says that we want the chance that the overall reliability target is met to be at least $. Note that when a realization of is specified, e.g., as P I , . . . ,pm, then the values of pi should sum to one, and each value must be between zero and one. A convenient probability distribution that achieves these properties is a Dirichlet distribution. A random variable is said to follow a Dirichlet distribution if its probability distribution function has the form
r)
i
i
(11 when 6 1 , . . . , E m 2 O , C L , constant is
= 1 and q , .. . ,urn
> 0. The normalization (12)
A univariate Dirichlet distribution reduces to a standard Beta distribution. The values v = (211,. . . ,urn)determine the shape of the distribution. See Kotz’ for additional details on Dirichlet distributions. Standard results in stochastic optimization, for example Kall and Wallace,‘ show that a solution involving (10) is obtained by applying the same nonlinear programming techniques used to solve the deterministic formulations. For example, when (10) is quasiconvex and differentiable with
M. Helander and B. Ray
226
respect to A1, Ap, . . . ,AYn ,, then the Kuhn-Tucker conditions may be used to characterize and identify an optimal solution. Application of such techniques, including validation of properties such as quasiconvexity for (lo), requires derivation of the distribution of Go. The next subsection explores this derivation.
16.3.1. Derivation of the distribution for Go Upon examination of expression (lo), we see that Go is basically a shifted and scaled linear combination of Dirichlet random variables. Recent results by Provost and Cheongs provide an expression for the cumulative distribution function of a linear combination of Dirichlet random variables in integral form. Let a = plog(p), b = pr and ci = pijAj in (8). Then the distribution function is given by
c;=,
(13)
+
+
for a bmin(ci) < z < a bmax(ci). Note that -5 = -log(p)/.r = A, the system failure intensity target. Then it follows from (7)-(lo), that the optimization problem can now be restated as Minimize
f(A1,
Az,
. . . , A;,
Subject to: h(A1, A 2 , . . . ,A,)
-A1, - A 2 , . . . , -A,
i).
(14 (14)
= -FG,,(O; A1,
5
0
for j
A2,.
=
. . ,A,)
+ 11, 5 0 , (15)
1 , . . . ,n,
(16
where
(17) from (13). As stated above, the Kuhn-Tucker sufficient conditions can be used to find a global optimal solution to the stochastic optimization problem provided (14) is pseudoconvex and h(X1, A 2 , . . . ,A), is quasiconvex and differentiable with respect to the X i in the feasible region.
227
Planning Models for Component-Based Software Offerings
The Kuhn-Tucker sufficient conditions are as follows: BTC
7
BTC axz
7
T
... 1 7 1
T
fi]
, 72
T 7 . e -
t m l
d h(A1,A2,. . .,A,) yjxj = 0
-
(18)
= 0,
(19)
= 0,
for j = 1, ..., n,
(20)
where d and y1,y2, . . . ,y, are Lagrangian multipliers associated with inequality constraints (5) and ( 6 ) respectively. From (15), we have h(A1, A21..
After
.,A,)
= -FGo(O;
A11
A2r.. ., A n ) -k 11,.
some algebraic manipulations, the partial
(21)
derivative of
FG~(O; X I , . . . ,A,) with respect to A j needed for (19) has the following form:
Lrn
1
~ F -G ~
--
ax,
7r
C O S I C z=1 m 21.2 tan-l{(ci
EE1viPij[l+(c,-1~s)2wi1dw
- A,)w}]
nzl{l+
(Ci - AS)2W2}Vi/2
s i n [ C z l vitan-l((ci - A,)w}] i=l
(22) Looking again at (17), it is not obvious to show that h(X1,A2,. . . ,A,) is quasiconvex, given its complicated form. However, inspection of feasible region plots in two-dimensions suggests that for 11, > 0.5, the boundary of the feasible region is convex, and appears to be approximately piecewise linear. When 11, = 0.5, the boundary of the feasible region coincides with that for the deterministic solution. Interestingly, the feasible region appears to be concave for 11, < 0.5. However, recall that 11, denotes the chance that the overall reliability target will be met. Thus, from a practical standpoint, restriction to 11, > 0.5 to achieve a convex feasible region does not unduly limit the formulation. From (17), we also see that F G ~ ( O ;..., X ~ A,,) takes value 1/2 when X i = A, for each i = 1 , . . . n. This implies that, for any 11,, the boundary of the feasible region always goes through the point X i = A,, which forms an inflection point (under varying 11,) for the boundary of the feasible region corresponding to the system reliability constraint.
228
M. Helander and B. Ray
Given that we have not definitively determined the quasi-convexity of (17), it is possible that the solution to (14) obtained using numerical optimization techniques may not provide the global optimal solution. While this is not entirely satisfying, a “close-to-optimal” solution may still provide significant gains over the deterministic solution when uncertainty in the usage profile is large. We provide examples to illustrate the effects of allowance for uncertainty in Sec. 16.4.
16.3.2. Solution implementation To obtain a solution to (14)-(17), the fmincon function of Matlab’s Optimization Toolbox (v. 6.2, Release 13) was used to directly minimize (14) subject to the constraints (15)- -(17). The function (17) was evaluated numerically using adaptive Simpson quadrature methods, with the Matlab function quad. We note that several parameters of fmincon and quad affect the quality of the obtained solution. First, the precision tolerance (tol) of the quad function can result in differences in the optimal solution obtained. Decreasing the tolerance to a very small value usually provided good results, at the cost of increasing the solution computation time or exceeding the maximum number of allowed function evaluations. However, making the tolerance value too small sometimes resulted in singularity and nonconvergence. Second, several parameters of the fmincon function affect the quality of the solution. These are a) tolx, the required change in A, for continued iteration, b) tolfun, the required change in the objective function for continued iteration, and c) t o l c o n s t r , the required distance of the solution from the active constraint. Matlab documentation recommends staying with the default values, although we found it necessary to experiment with smaller values in order to achieve algorithm convergence. While these algorithm parameters impacted the achieved solution, the choice of starting point for finding the solution seemed to have the greatest effect. We found that starting at a value a small step into the feasible region from the deterministic solution generally proved an effective strategy.
16.4. Examples In this section, we graphically illustrate the effect of uncertainty on the allocation problem specified by (l), (2), and (3), by extending the n = 2 component example from Helander3 to allow variation in assumed usage. Specifically, the input parameters of the model in the deterministic case
Planning Models for Component-Based Software O@rings
are: ? =I2
p = .99
m=2
~
=
1
p"
I;[ [;:I
i;=
[
229
=
Pl1 P 1 2 1121 P 2 2
]
0.6 0.4 =
[o.o L O ] .
The cost of each component is assumed to be inversely proportional to the achieved reliability, following an inverse power law of the form C(X) = P (X-6)" ' A > S. The InvPow cost parameters are taken as
The total cost is simply the sum of the individual component costs, ie., f ( x l , A2)
=CP~,a~,61
+CP~,a2,62(~2).
Figure 16.1 shows the solution for the deterministic version of the model, as in Helander,3 which corresponds to the stochastic solution for any choice of Dirichlet parameters wi when the chance of meeting the reliability target, $, is 0.5. Using a Dirichlet distribution with w1 = 7 , v 2 = 3 to characterize uncertainty in p i , Fig. 16.2 shows the effect on the feasible region of increasing $, while Fig. 16.3 shows the cost contours and resulting solution, respectively. Table 16.1 gives the precise values of XI, A2 and the resulting Total Cost to achieve the specified reliability under the Inverse Power cost function given above. As expected, the Total Cost increases with $. The four plots of Fig. 16.4 show the effect of varying wi over the values v1 = 0 2 = 0.5 (top left), 1 (top right), 2 (bottom left), and 16 (bottom right) for values of 1c, varying from 0.2 to 0.8, where the Dirichlet distributions corresponding to the w1 values are shown in Fig. 16.5. We see that decreased uncertainty in the usage, as specified through decreased variability in p i , results in less variation from the 1c, = 0.5 case (also corresponding to the deterministic boundary). Depending on the objective function, the stochastic optimization algorithm may converge toward the common failure intensity solution as $ approaches one, ie. XI = A2 = . . . = An = A, = - log(p)/r. When this happens, FG~(O; XI, ..., An) is exactly 0.5, which forces the Lagrangian multiplier, d, to be zero so that (19) can hold. In this case, there is no feasible solution to the stochastic optimization problem under the Dirichlet distribution assumption. In practice, convergence toward the common failure intensity solution may make it difficult to terminate with an acceptable
230
M . Helander and B. Ray
exit condition from a procedure like fmincon,as was the case for values of 0.85 < $J < 1 in this two component example.
Fig. 16.1. Solution t o the deterministic model for the 2 component example.
Table 16.1. Component reliability solutions for two-component system under varying probabilistic constraints and resulting total cost. Deterministic Stochastic Stochastic Stochastic Stochastic
I $ J I I
-
0.50 0.65 0.75 0.85
I
XI
0.03654 0.03654 0.03419 0.03335 0.03144
I I
XZ
0.02606 0.02606 0.02742 0.02788 0.02946
I 1
cost 1884.10 1884.10 1888.58 1890.98 1893.06
16.5. Summary and Discussion In this paper, we have formulated and solved the optimal reliability allocation problem for a system of software components when uncertainty in the operational profile is quantified using probability distributions. In other work, we plan to make more explicit the relationship between the stochastic formulation of the problem, as outlined in this paper, and other
Planning Models for Component-Based Software Offerings
231
.
Fig. 16.2. Feasible region changes with varying
+.
Fig. 16.3. Solutions to the stochastic model for the two component example with varying
+.
232
M. Helander and B. Ray
0.03
0.03
0.02
0.02
0.01
0.01
0
0
0.02
0.06
0.04
0 0
0.02
0.04
0.06
0.04
0 03
0.03
0 02
0.02
0 01
0.01
0
0
002
004
006
0
0
0.02 004 0.06 0.08
Fig. 16.4. Behavior of the feasible region boundary formed by the system reliability for the two component example with varying from 0.2,. . . ,0.8and v1 = 0.5 (top left), 1 (top right), 2 (bottom left), 16 (bottom right). $J
“robust” deterministic approaches to the same problem, such as that given in Leung4 Here, we have assumed that the parameters in the cost function being optimized are known. If these parameters are subject to uncertainty which can be characterized through a probability distribution on f ( A ) for a specified A, several approaches may be taken for incorporating this additional source of uncertainty into the proposed modeling framework. For example, the optimization problem could be solved to minimize the expected cost, or to minimize the the pth percentile of the cost distribution, with p = 0.95 for example. We leave exploration of the sensitivity of the solution to uncertainty in both cost and operational profile to future research. Additionally, we will consider how the present formulation can be made relevant to application areas having attributes other than reliability as the system attribute of interest.
Acknowledgments The authors wish to thank Samer Takriti and Giuseppe Paleologo for extensive discussions concerning stochastic optimization methods and tools.
Planning Models for Component-Based Software Offerings V
233
v= 1
= 0.5
1
0
0.2 0.4
0.6
0.8
Fig. 16.5. Illustration of the Dirichlet distribution for vi = 0.5 (top left), 1 (top right), 2 (bottom left), 16 (bottom right).
References 1. J. D. Musa, Operational Profiles in Software Reliability Engineering, I E E E Software, Vol. 10 (2), 14-32 (1993). 2. J. H. Poore, H. D. Mills, and D. Mutchler, Planning and certifying software system reliability, I E E E Software, Vol. 10 ( l ) ,88-99 (1993). 3. M. E. Helander, M. Zhao, and N. Ohlsson, Planning models for software reliability and cost, I E E E Transactions on Software Engineering, Vol. 24 (6), 420-434 (1998). 4. Y. W. Leung, Software reliability planing models under an uncertain operational profile, Journal of the Operational Research Society, Vol. 48, 401-411 (1997).
5. B. W. Boehm, Software Engineering Economics, Prentice-Hall: New Jersey (1981). 6. P. Kall and S. W. Wallace, Stochastic Programming, John Wiley and Sons: New York (1994). 7. S. Kotz, N. Balakrishnan, and N. Johnson, Continuous Multivarite Distributions, Volume 1: Models and Applications, John Wiley and Sons: New York (2000). 8. S. B. Provost and Y . H. Cheong, On the distribution of linear combinations of the components of a Dirichlet random vector, The Canadian Journal of Statistics, Vol. 28, 417-425 (2000).
This page intentionally left blank
CHAPTER 17 DESTRUCTIVE STOCKPILE RELIABILITY ASSESSMENTS: A SEMIPARAMETRIC ESTIMATION OF ERRORS IN VARIABLES WITH VALIDATION SAMPLE APPROACH NICOLAS W. HENGARTNER Statistical Sciences Group, M S F600 Los Alamos National Laboratory Los Alamos, NM 87545 USA E-mail: [email protected] This paper extends the methodology of parameter estimation for errors in variables with validation sample to the problem of assessing the reliability of a stockpile from different types of destructive testing. Our contribution is a generalization of the methods introduced by Pepe and Fleming' for discrete covariates and surrogate variables to their continuous counterparts. We show that our proposed semiparametric estimator is asymptotically Gaussian and efficient.
17.1. Introduction
Integrating different destructive tests to assess the reliability of items in a stockpile is a challenging statistical problem. Consider the following example. To assess the reliability of rocket engines in a stockpile of missiles, we perform two different types of testing: The first type of testing consists of firing the missile and recording 2,the status of the engine. Pass if it worked, and fail if not. The second type of testing consists of firing the engine in a laboratory and monitoring and recording the performance X of the engine, such as thrust, heat, and burn time. This second type of testing data is informative for the overall engine reliability when the conditional distribution p ( z l z ) of the status Z given the performance X is known. Since
p = P[Z = 11 = E[P[Z= llX]],
235
236
N. W . Hengartner
the stockpile reliability can be estimated from a sample XI, . . . ,X formance measurements by
n of per-
l n
?j = - C P [ Z = llXi]. n i= 1
More typically, the engineers know the conditional distribution p ( z J z ) up to some unknown parameter 9, which needs to be estimated from data. Unfortunately, the two independent samples (one of performances and the other of status) do not in themselves allow the estimation of that parameter. To make the estimation of 9 possible, we link the two samples by introducing a covariate W , observed on all the test units, which is related to both the status and the performance. Further, we will assume that this additional covariate W does not contain information about the status that is not already contained by the performance. Thus W acts as a surrogate for the performance measurements. The concept of surrogate variables was introduced by Prentice2 and r e fined by Pepe3 in the context of medical studies, where it is also called non-differential measurement error by Caroll et al.4 Pepe and Fleming.' Wang and Pepe5 call this data structure errors in variables with validation sample and have considered estimation of 0 when both the surrogate W and performance X are discrete random variables taking on finitely many values. Extensions to continuous performance and surrogate variables is straightforward if a parametric model for the conditional distribution of X given W is available. However, given the complexity of the relationship between the performance and the surrogate variables, such distributions may be hard to specify. Jiang and Turnbulls and Jiang and Turnbul17 consider estimating 9 without specifying the joint distribution of performance and surrogate variables by a generalization of the method of moments. While their estimators are consistent and asymptotically Gaussian, it is unlikely that they are efficient. For a discussion on the challenges of estimating 9 in this semiparametric model, we refer the interested reader to Caroll and Wand' and Stephanski and Lee.g In this paper, we consider efficient estimation of the 0 without specifying explicitly the relationship between performance and surrogate. Our approach to produce efficient estimates for 9 is similar in spirit to Pepe3 and rests on the idea of estimating the score function. This paper is organized as follows. Section 17.2 lays the groundwork by introducing the notation and giving sufficient conditions for the parameter 8 to be identifiable. Section 17.3 defines the estimation procedure and states our main result, which is
Destmctive Stockpile Reliability Assessments
237
proven in Sec. 17.4. We wish to mention that Wang et a1.I0 have studied a similar problem arising in econometrics and have produced efficient semiparametric estimates using an empirical likelihood approach. However, we do prefer our estimator over their estimator/estimators because of its simplicity and its ease of computation. Finally, to help keep the notation as simple as possible, this paper will focus on univariate parameter estimation as the extension to multivariate parameter estimation is straightforward.
17.2. Preliminaries
+
+
Let {(Zi,Xi, Wi),i = 1,.. . ,n m}, be n m independent copies of (2,X, W ) ,where 2 is a binary status variable, X the performance variable and W the surrogate that satisfies the surrogacy assumption
P [ Z = llX,W] = P [ Z = 11x1.
(1) Consider the parametric model for the conditional distribution of the status variable given the performance variable
~
( e) =~ p S [1 z= zlx ~ = 4, ~
and denote by d F(x1w) = P [ X 5 xlw] and f(xlw) = --F(xlw),
ax
the conditional probability distribution function and related density function of the performance variable given the surrogate. Finally, one gets to observe the two independent samples 271 = ((22,W2);i = 1 , . . . , n}
(2)
and
l , ..., n + m } . (3) In light of the surrogacy assumption, the conditional probability function of the status variable given the surrogate variable W = w is v2 = {(Xz,W,);i =n f
p(zlw;0 ) = Pe[Z = zIW = W] = IE [ P S [ Z= tlX, w = w] I
e)lw= 4 =
J
w = W]
R(.+; e)f(xlw)dw. (4) Both p ( z ) w , B )and f(olw) can be estimated from the data (2) and (3). Hence a necessary condition for identifiability of the parameter 6 is that kernel of the functional =IE[R(~IX;
N. W.Hengartner
238
be trivial, that is, if for any function H ( z ) z )
T [ H ] ( z , w= ) 0
* H ( z l z ) = 0.
It is easy t o see that this condition fails to hold if, for example, the performance X and surrogate W are independent, in which case, the parameter 0 is not identifiable. Theorem 17.1 below gives a sufficient condition for identifiability of nonparametric specification of the conditional probability of status given status.
Theorem 17.1: If {f(.Is)}is a complete family of densities, then the reliability function R(zlz) is identifiable. Theorem 17.1 proves identifiability of arbitrary conditional probability functions R(zlz).Thus under the conditions of the theorem, if the parameter 0 is identifiable in the (unobserved) model R(zls,e),then it is also identifiable in the errors in variable with validation sample model.
Proof: Suppose t o the contrary that there exists two conditional probability functions Rl(zlz) # R ~ ( Z ~ such Z ) that
However, the assumed completeness of the family of densities {f(.ls)}implies that Rl(zlz) - Rz(zls) = 0, contradicting the ansatz. 17.3. Estimation
To gain insight into the proposed estimation method, let us first consider the case of known conditional distribution of X given W . This may be seen as a limiting case of unlimited validation samples (X, W). Then p(+,
e) = JR(+,
e)f(zIw)ds = I E [ R ( ~ (e)lw X, =
4,
and the maximum likelihood estimators for 8 are solutions of the estimating equation
c" aea
- logp(ziIwi,6 ) = 0.
i= 1
Under mild regularity assumptions to justify interchanging derivatives and integrals, the score function can be expressed as
Destructive Stockpile Reliability Assessments
-
IE[&R(~)X e)lw ; =W] IE[R(zJX;6)(W = w] '
239
(5)
which is easily verified to be also the minimizer with respect of a of
(6) h
We propose to estimate the score function S ( z ,w; 0) by minimizing, with respect to a, n+m
i=n+l
an empirical counterpart of (6), where K ( u ) is a symmetric smoothing kernel of sufficiently high order, and h denotes the bandwidth that goes to zero as the validation sample size m increases. Armed with an estimate for the score function, we estimate the parameter 6 as solutions of n
(7) i=l
The following theorem states that the resulting estimator has good statistical properties.
Theorem 17.2: Let 8 be a solution of (7). Under suitable regularity conditions detailed in Sec. 17.4.1, we have that
fi(4
- 00)
A N(0,I-l(eo)),
where
is the Fisher information matrix for estimating 6 from independent replications of (2,W ) when the conditional distribution of X given W is known. Several remarks are in order. (1) The estimator 8 is efficient as its asymptotic variance is the same as if we would have known the joint distribution of X and W .
N. W.Hengartner
240
(2) The asymptotic distribution of the estimator 5 is not sensitive to the choice of the smoothing parameter h which can be chosen to be in
(A)
1-&
5h5
(+:)''
for small E > 0. Hence there is little need to discuss bandwidth selection for first order asymptotics. We note however that the choice of the bandwidth does impact the second order properties of the estimator, and thus, for finite samples, there may still be some benefit in identifying good choices for h. (3) The size m of the validation sample can be much smaller than n. In particular, the conclusions of the theorem hold provided that 1 m 2 n%-& for small E > 0. This shows that the smoother the joint densities f(z,w) are, the smaller validation samples need to be. 17.4. Proofs 17.4.1. Regularity conditions The proof of Theorem 17.2 requires that the following set of assumptions hold. A1 The parameter in 6 is identifiable. A2 The joint density f(z,w) is T times continuously differentiable in w. A3 The smoothing kernel K ( . ) is bounded with compact support and satisfies a J K ( u ) d u= 1
-
b K ( . ) is of order A4 Both n,m
T
00.
A5 The bandwidth h satisfies 6 h Z r A6 Let Mh(2,W) =
The functions
have finite expectations.
sup
-
0 and 6
h
-
m.
Destructive Stockpile Reliability Assessments
24 1
17.4.2. Proof of Theorem 17.2
The idea of the proof is to write the semiparametric estimator 4 as a Ustatistic. Unfortunately, the standard convergence results for U-statistics (as in Grams and Serfling") do not directly apply because of the smoothing we do in estimating the score function. Let the estimator t? be a solution of
A
-
A Taylor expansion of Jh(t9) about the true parameter value & implies that
B - eo = -[v&(e*)~-~T~(e~), - eoll < 116 - doll. A central limit theorem for e^ follows by
for some IlO* showing that &(do) is asymptotically Gaussian and that V&(e*) converges in probability to some nonnegative definite matrix C. Denote by
and write
w ;e,) +
= Sh(z,
N . W . Hengartner
242
- P
with the remainder rm 0 as m 00. Set u i = (zi, Xi),5 = (Xn+kWn+j).In light of Equation (8), we approximate the objective function Jh(e0) by the U-statistics
with kernel
The Hoffding decompostion of the later kermel is
where
Standard calculations reveal that
(9 with the remainder bounded by
Destructive Stockpile Reliability Assessments
243
Identity (9) further implies that
(10 Using the fact that for all 8
we get that
+
/ B(z,
' O ) p ( z , W + hu; &)f(z,W + hu)K(u)dudz + hu; hu;8 0 ) = Ee,[S(Z,W ;eo)l + $-J u ~ ~ ( u ) dE u~ .( xw), , (11)
'("
W
with integrable remainder E2. By construction, the random variables {A(U2,V,), i = 1,.. . ,n j = 1,.. . ,m ) are uncorrelated. Furthermore, since VCZT(A(U~, V,)) = O(h-l), it follows that
1 n y p ( nm 2=1
-
Combining (9),
Qe0)
=o,(m). 1
m ~
2
j=1
1
V
,
)
(12)
(lo), (11) and (12), we conclude that n v,; eo>i+ -nl C{Bo,,[H(u2, v,; e0)iv,i
= Ee,[H(U2,
2=1
-Eeo[H(UZlV,; O O ) ] } +-ml r n v,; Q,)IU,I - Eeo[H(Ut,V , ; ~ ~ ) I I j=1
C{E[H(U,,
Hence if both m , n
-
00,
nh2T-+ 0 and mh
-
00,
then
N. W. Hengartner
244
with
Finally, since
aae
--Jh(O*)
aae
= -J h ( e o )
+op(1)
and
we get t h e desired conclusion.
Acknowledgments
I wish t o thank Alyson Wilson for helpful discussions and an anonymous referee for comments that have improved the presentation of this paper.
References 1. M. S. Pepe and T. R. Fleming, A general nonparametric method for dealing with errors in missing or surrogate covariate data, JASA 86, 108-113 (1991). 2. R. L. Prentice, Surrogate endpoints in clinical trials: definition and operational criteria, Statistics in Medicine 8, 431-440 (1989). 3. M. S. Pepe, Inference using surrogate outcome data and a validation sample, Biometrika 75, 237-249 (1992). 4. R. J. Caroll, D. Ruppert, and L. A. Stephanski, Measurement error in nonlinear models, Chapman and Hall, London (1995). 5. C. Y. Wang and M. S. Pepe, Expected estimating equations to accommodate covariate measurement errors, Journal of the Royal Statistical Society, Series B, 62, 509-524 (2000). 6. W. Jiang and B. W. Turnbull, The indirect method: Inference based on intermediate statistics-A synthesis and examples, to appear in Statistical Review (2004). 7. W. Jiang and B. W. Turnbull, The indirect method: Robust Inference based on intermediate statistics, Technical Report No 1377, School of Operations Research, Cornell University, Ithaca, New York, http://www.orie.cornell.edu/trlist/trlist.html (2003). 8. R. J. Caroll and M. P. Wand, Semiparametric estimation in logistic measurement error models, Journal of the Royal Statistical Society, Series B, 53, 652-663 (1991).
Destructive Stockpile Reliability Assessments
245
9. J. H. Stephanski and L. F. Lee, Semiparametric estimation of nonlinear errors-in-variables models with validation study, Journal of Nonparametric Statistics 4, 365-394 (1995). 10. Q. Wang, K. Yu, and W. Hiirdle, Likelihood-based kernel estimation in semiparametric errors-in-variables models with validation data, Technical report, Humboldt Universitat zu Berlin (2002). 11. W. F. Grams and R. J. Serfling, Convergence rates for U-statistics and related statistics, Annals of Statistics 1, 153-160 (1973).
This page intentionally left blank
CHAPTER 18 FLOWGRAPH MODELS FOR COMPLEX
MULTISTATE SYSTEM RELIABILITY
APARNA V. HUZURBAZAR Department of Mathematics and Statistics University of New Mexico Albuquerque, NM 87131 USA E-mail: [email protected]. edu BRIAN J. WILLIAMS
Statistical Sciences Group Los Alamos National Laboratory Los Alamos, NM 87545 USA E-mail: [email protected] This chapter reviews flowgraph models for complex multistate systems. The focus is on modeling data from semi-Markov processes and constructing likelihoods when different portions of the system data are censored and incomplete. Semi-Markov models play an important role in the analysis of time to event data. However, in practice, data analysis for semi-Markov processes can be quite difficult and many simplifying assumptions are made. Flowgraph models are multistate models that provide a data analytic method for semi-Markov processes. Flowgraphs are useful for estimating Bayes predictive densities, predictive reliability functions, and predictive hazard functions for waiting times of interest in the presence of censored and incomplete data. This chapter reviews data analysis for flowgraph models and then presents methods for constructing likelihoods when portions of the system data are missing.
18.1. Introduction Multistate models are used to describe longitudinal, time-to-event data. They model stochastic processes that progress through various stages. Today’s complex systems make the analysis of multistate models very im247
248
Aparna V. Huzurbazar and Brian J . Williams
portant in reliability. Flowgraph models are one type of multistate model. Flowgraphs model potential outcomes, probabilities of outcomes, and waiting times for the outcomes to occur. They can be used to model complex system behavior, time to total or partial system failure, time to repair of components or the entire system, and to predict system reliability. Examples of processes include the internal mechanisms of cellular telephone networks, stages in the maintenance and repair of aircraft, or improving stages of assembly and rework in a manufacturing process. For example, modeling the stages of the cellular telephone network begins with a fully functioning network that proceeds through a series of degradations to a partially functioning network and eventually to a fully failed network. The network itself may be a local network composed of only a few cells or it may be composed of subsystems that comprise a network over an entire state or country. In aircraft maintenance, the macro level process can be represented by five main systems. However, each system consists of subsystems with thousands of elements that represent tasks involved in repair and maintenance. Data may be available in detail for the entire process or, more realistically, data may be available only partially at the component, subsystem, or system level. Combining such information across various levels of a system has become increasingly important with today’s complex systems. Current methods in reliability for complex systems require that all the component waiting times be from the same distributional family such as the Weibull so that information may be easily combined. In addition, some methods for complex systems such as fault trees restrict themselves to binomial data so that large amounts of information may be easily aggregated. Flowgraphs allow each component or set of components to have its own distribution and flowgraph algebra provides a way to combine these varied distributions. The examples of the previous paragraph involve modeling the time until the occurrence of some event or events. Our interest is in modeling large, complex systems or subsystems of such systems. This chapter reviews data analysis for flowgraph models when the data are complete and possibly censored and then presents methods for handling systems where portions of the system data are incomplete. We extend the work of Williams and Huzurbazar3 who perform posterior simulation based on one constructed likelihood for an entire system flowgraph. The problem considered here requires the construction of several different likelihoods for portions of missing data in the system flowgraph. Flowgraph models can be viewed as a type of semi-Markov multistate
Flowgmph Models for Complex Multistate System Reliability
249
model. The theory underlying multistate models for event history analysis was formalized by the work of Aalen4 who showed that such models may be analyzed within the framework of counting processes. In terms of data analysis, multistate models have been restricted to the realm of Markov models. Let X n denote the state of the process at stage n, let T n denote the time of transition to X n , and let a h j ( . ) denote the transition intensity of the process for the h t j transition. In a Markov multistate model, given the current state of the process, the transition time to a future state does not depend on the past history of the process. In a Markov multistate model, the transition intensity a h j ( . ) reduces to the hazard function. At the initial state of a Markov process, the transition time is a minimum of the waiting time distributions corresponding to all possible transitions from the initial state. Hence, in practice, for tractability and analytical convenience, exponential distributions are assumed. Occasionally, with appropriate parametric restrictions, Weibull distributions are used, exploiting the fact that the minimum of independent and identically distributed Weibulls is again a Weibull distribution. A semi-Markov multistate model allows the transition time to a future state to depend on the duration of time spent in the current state so that P[Xn+1= j , Tn+1 E [t,t
+ fit) I {(xk,~ k ) } E c g l , (Xn = h, Tn),Tn+1 L t] = a h j ( t - Tn)bt, for h , j E S ,
where S = { 1 , 2 , 3 , ...,m} is the finite state space of the multistate process. In practice, it is quite difficult to analyze data for semi-Markov multistate models. One method of analysis for multistate models consists of combining independent submodels for each transition intensity, a method that restricts the analysis to models with unidirectional or progressive flow.5 In fact, Hougaard‘ states that for nonunidirectional or nonprogressive multistate models, it is impossible to obtain general formulas for transition probabilities for models where the hazard is allowed to depend on the history in any way. Flowgraph models circumvent this difficulty by working in the moment generating function domain. Another method is the use of the proportional hazards model and its many extensions. The proportional hazards model as used in medical statistics was developed by Cox;’ however, the idea of assuming proportional hazards to fit more parsimonious models dates back to the operations research literature.’ Cox’s model is semi-parametric and assumes that the intensity of the counting process is a product of a parametric function of the covariates and an arbitrary function of time. In practice, this method
250
Aparna V. Huzurbazar and Brian J . Williams
is also restricted to a unidirectional or progressive multistate model. The obvious restriction of the proportional hazards model is that hazards are not always proportional. In both methods, the key approach for analyzing such multistate models is based on modeling the hazard function, a quantity that is not directly observable. The end result is a hazard function model based on a set of covariates, which can be converted to a reliability function if required. However, if our interest is in predicting the overall waiting time especially when state-to-state transitions times are not from the same distributional family, this method is also limited due to the complexity of the convolutions and finite mixtures involved. For example, the convolution or a finite mixture of a Weibull waiting time with an inverse Gaussian is not analytically tract able.
18.2. Background on Flowgraph Models
A flowgraph is a graphical representation of a stochastic system in which possible outcomes are connected by directed line segments. Modeling concerns probabilities of the outcomes, the waiting time distributions of the outcomes, and manipulating the flowgraph to access waiting time distributions for total or partial passage through the system. Flowgraphs model semi-Markov processes and allow for a variety of distributions to be used within the stages of the multistate model. They also easily handle reversibility. This means that a failed component can be repaired. Flowgraphs model the observable waiting times rather than the hazards and as such, they do not directly make any assumptions about the shape of the hazard. The end results from a flowgraph analysis are Bayes predictive densities, CDFs, reliability functions, and hazard functions of the waiting times of interest. Figure 18.1 shows a complex system consisting of outcomes in series and cascaded in parallel with feedback loops. The system is an assembly line for a manufacturing process for car stereos. The system flowgraph was originally presented in Huzurbazarg but without data or analysis. State 0 represents an initial detection of a problem with a stereo. The problem is categorized into one of two types of severity. If the severity is of type I, the system is in state 1 for repair of the item. Eventually, the problem is fixed and the item moves to state 3 where it is specifically inspected to make sure that the type I problem is fixed. If the problem is not fixed, the item is returned to state 1, otherwise, it passes inspection and moves to state 5. Similarly, if the severity is of type 11, the system is in state 2 for repair. Eventually, the problem is fixed and the item moves to state 4 where it is
Flowgmph Models for Complex Multistate System Reliability
251
specifically inspected to make sure that the type I1 problem is fixed. If the problem is not fixed, the item is returned to state 2, otherwise, it passes inspection and moves to state 5.
Severity type 11, repair
Inspection of Type I1
Fig. 18.1. Flowgraph model for manufacturing system.
While block diagrams and signal flowgraphs are widely used to represent engineering systems, they do not incorporate probabilities, waiting times, or data analysis. The literature on nonstatistical flowgraph methods in engineering is vast beginning with Mason." Introductions to these methods are contained in most circuit analysis or control systems textbooks.l1?l2 Statistical flowgraph models are based on flowgraph ideas but unlike their antecedents, flowgraph models can also be used to model and analyze data from complex stochastic systems. Flowgraphs are also distinct from graphical models in that the states represent outcomes rather than variables. For example, while feedback loops are an integral part of flowgraph models, they are redundant in a graphical model. Huzurbazar13 is a recent book on statistical flowgraph models. In a flowgraph model, the states, representing outcomes, are connected by directed line segments called branches. These branches are labeled with transmittances, called branch transmittances. A branch transmittance consists of the transition probability multiplied by the moment generating function (MGF) of the waiting time distribution in the previous state. The
252
Apama V. Humrbazar and Brian J. Williams
waiting times on the branches can be any parametric distributions that admit moment generating functions. Hence, the model is quite general in that exponential assumptions or assumptions that the waiting time for the various branches be from the same family of distributions are not made. For example, in Fig. 18.1, the branch waiting time for state 0 to state 1 could be inverse Gaussian and for state 0 to state 2 could be gamma. The overall waiting time from state 0 to state 5 is a complicated combination of finite mixtures and convolutions of all of the branch transition distributions weighted by the branch probabilities. Flowgraphs are simplified by solving them for the MGF of the waiting time distribution of interest. The branch transmittances of a flowgraph model are used along with flowgraph algebra to solve for the MGF of the distribution of the waiting time of interest. This can be done by reducing the series, parallel, and feedback loop components of the flowgraph or by using an automated procedure implemented with symbolic algebra. Both of these procedures are described in the statistical flowgraph literature. The system of Fig. 18.1 is an example of a combination of a series and parallel system with feedback. States 0, 1, and 3 are connected in series. States 1 and 2 are in parallel, as are states 1 and 5 if the system is in state 3. The transitions 1 3 -+ 1 and 2 + 4 + 2 constitute feedback loops. These basic components can be used to reduce the flowgraph to just two states 0 and 5 labeled with one transmittance, M ( s ) ,the MGF of the distribution of the total time required for successful inspection of the part. In general, for complex systems, the procedure of reducing flowgraphs can become tedious and a technique from graph theory is used to automate the process. Mason" developed a rule in the context of graph theory for solving systems of linear equations. In its original implementation, Mason's rule did not involve probabilities or MGFs. However, flowgraph models can be solved by applying Mason's rule to the branch transmittances. This provides a systematic procedure for computing the transmittance of a flowgraph from any state i to any state j. It requires computing the transmittance for every distinct path from the initial state to the end state and adjusting for the transmittances of various loops. When a system is certain to pass from state i to state j , this transmittance is the MGF of the transition time distribution. Mason's rule requires the following definitions. A path from state i to state j is any possible sequence of states from i j that does not pass through any intermediate state more than once. In Fig. 18.1, there are two paths from 0 to 5 : 0 + 1 -+ 3 + 5 and 0 -+ 2 + 4 --+ 5. A path transmittance is the product of all the branch transmittances for that --+
--+
Flowgmph Models for Complex Multistate System Reliability
253
path. The transmittances of these paths are p01p35Mo1(s)Ml~(s)M35(s) and po2p45Mo2(s)M24(s)M45(s). A first-order loop is any closed path that returns to the initiating state without passing through any state more than once. In Fig. 18.1, 1 --f 3 + 1 and 2 + 4 + 2 are first-order loops. A second-order loop consists of two nontouching first-order loops. In Fig. 18.1 there is one second order loop since the two first-order loops do not touch. A j t h - order loop consists of j nontouching first-order loops. The transmittance of a first-order loop is the product of the individual branch transmittances involved in the path. The transmittance of a higher-order loop is the product of the transmittances of the first-order loops it contains. There are no loops of order three or more in this flowgraph. The general form of Mason's rule gives the overall transmittance, i.e. the MGF from input to output, as (1)
where Pi(s)is the transmittance for the i t h path, L j ( s ) in the denominator is the sum of the transmittances over the jth-order loops, and Li(s)is the sum of the transmittances over jth-order loops sharing no common nodes with the ith path. For this system, PI(s) = p01p35Mo1(s)M13(s)M35(~), P~(s) = po2p45M02(s)M24(s)M45(s) so that (1) for the overall waiting time from 0 to 5 is M ( s ) = {pOlp35MOl(S)M13(S)M35(S)[l - Ll(S)
M45(s)[1 - L:(s)l}/{l
- L:(s)]
+ L2(s)}
+ p02p45M02(s)M24(s)
x
1
where
L:(s) = p42M24(s)M42(s) L?( s ) = p3 1M 1 3 (s)M31 ( s ) 1(s)
= p 3 1 M 1 3 ( s )M 3 1 ( s )
+ p42M24 ( s )M 4 2 ( s )
L2(s) = p31p42M13(s)M31(s)M24(s)M42(s)
.
In general, for systems with many states, paths, and loops, (1) is programmed using symbolic algebra. When the interest is in partial passage from state i to state j , (1) must be modified for the event of interest conditional on the occurrence of the event of interest. The conditional MGF in this case is
Mj(4/Mij(O), for partial passage from i + j .
254
Aparna V. Huzurbazar and Brian J. Williams
While such flowgraph algebra provides the MGF of interest, it must be converted to a waiting time density, reliability or hazard function in order t o be useful. One class of methods for numerically converting MGFs to densities is saddlepoint approximation^.'^ This method applies to a wide range of complex flowgraphs. The approximation is summarized as follows: Let K ( s ) = log { M ( s ) }be the cumulant generating function (CGF) of the waiting time T . Let c1 and c2 be constants such that c1 < 0 < c2. Suppose that M ( s ) exists for s E (c1,cz). Then the saddlepoint approximation to the density of T is (2
where K“(s) = d 2 K ( s ) / d s 2 and
K’(i) = t ,
(3)
where K’(s) = d K ( s ) / d s . In practice, (2) is normalized by its numerical integral over the support of T . The quantity i is called the saddlepoint and it is the solution to the saddlepoint equation (3). With flowgraphs, i is a complicated implicit function of both t and the parameters of the branch distributions. In practice, c1 and c2 are found numerically. These quantities are used to bound the solution to the saddlepoint equation (3). Practically, the upper bound c2 is the smallest positive root of (3). The saddlepoint is unique and it is a monotonically increasing function of t , a property that can be exploited in its computation. The saddlepoint is a highly accurate second order approximation that works well with flowgraphs. For further details on the saddlepoint approximation see Jensen,15 Kolassa,16 or Reid. l7 Although other transform inversion techniques such as those based on characteristic functions18~’9~20 are available, they do not work well with complex systems such as those modeled by flowgraphs. This is due to the complex nature of the convolutions, finite mixtures and feedback of a variety of nonstandardized distributions present in flowgraph models. In addition to allowing a variety of distributions to be used within the stages of the multistate model, the flowgraph methodology also easily handles reversibility. This means that a failed component can be repaired. Flowgraphs model the observable waiting times rather than the hazards and as such, they do not directly make any assumptions about the shape of the hazard. Quantities of interest include predicting the distribution of the total time to testing and repair, 0 -+ 5 ; predicting the waiting time to repair, say, 3 + 1; or predicting the number of times the component failed
Flowgraph Models for Complex Multistate System Reliability
inspection of type 11, 4 data.
-+
255
2, all in the presence of censored and incomplete
18.3. Flowgraph Data Analysis Data analysis using flowgraph models can handle both censored and incomplete data. The basic data types are called complete data, incomplete data and unrecognizably incomplete data. Complete data on a flowgraph model consists of having every intermediate transition for each observation. The associated waiting time for the transitions may be censored; however, the transition information is complete. In Fig. 18.1, complete data consists of observations such as 0 -+ 1 -+ 3 + 1 -+ 3 -+ 5. Incomplete data consists of data that have complete information on observed waiting times but incomplete information on the associated transitions. For example, in Fig. 18.1, if we observe a waiting time such as 0 4 4 we know that the transition is 0 -+ 2 -+ 4 but that the transition 2 is incomplete. Unrecognizably incomplete data are data that appear to have complete information on the transitions and waiting times but in reality are incomplete with respect to transition information. For example, suppose that we observe 0 -+ 2 -+ 4, we might assume that the data are complete. But perhaps the true observation transitions were 0 2 +4 2 4 4. Currently, there are no good methods for dealing with such data in the semi-Markov case. The end result(s) from a flowgraph analysis are Bayes predictive densities, CDFs, reliability functions, and hazard functions of the waiting times of interest. The Bayes predictive density of a future observable 2 given data D is -+
-+
Giq.o{fz(4e)),
(4)
where 2 has density f z ( z l 0 ) and r(0 ID) is the posterior distribution of 8. The predictive cumulative distribution function (CDF) is defined analogously as F z ( z ( D )= EelD{Fz(z(8)}. The flowgraph model provides the MGF of the density fz(zl8).For computational purposes, (4) can be written as
fz(zlD) 0:
/ fz(zP)
L(elD0).(el d e ,
(5)
where L(0lD) is the likelihood function of the data and r(8) is the prior distribution of 8. The integration can be performed using any of the many available methods for posterior simulation including Markov chain Monte
256
Aparna V. Huzurbazar and Brian J . Walliams
Carlo (MCMC). We use the MCMC based slice sampling method due to Neal.21 Slice sampling is convenient because it requires minimal tuning, which is also true of equally viable alternatives such as the random walk Metropolis algorithm with automatic step size selection.22 Parametric assumptions about the flowgraph allow us to spec@ the likelihood function for complete data. The flowgraph models a semi-Markov processl so that conditional on the observed transitions, the branch waiting times are i n d e ~ e n d e n t .Parametric ~ ~ ? ~ ~ models for the branch waiting times are chosen using exploratory data analysis such as histograms or censored data h i ~ t o g r a m s .Sometimes, ~ ~ ? ~ ~ theoretical considerations lead to a parametric model; for example, if the flowgraph is modeling a queuing system, then the queuing assumptions are applied to derive the parametric model. When data are incomplete, we use an approximate likelihood for the data. This approximate likelihood is constructed from the saddlepoint approximation applied to the MGF of the incomplete data path in the underlying flowgraph model. Suppose TI, . . . l T, are n independent and identically distributed random variables representing the total waiting time in a flowgraph model. For example, in the flowgraph of Fig. 18.1, our data would consist only of observations on the waiting time 0 -+ 5. We do not observe any intermediate transitions that may have been made. Since we do not observe the intermediate transitions, we cannot easily specify the likelihood function involving the associated waiting time distributions. However, we can construct an approximate likelihood (6)
where f ~ ( t j(0)is the normalized saddlepoint density approximation (2). The flowgraph model gives the MGF of the total waiting time. The saddlepoint approximation converts this MGF to a density function, giving an approximate likelihood. Thus, we are using the saddlepoint approximation to construct an approximate sampling distribution from the individual branch models in the flowgraph. Note that this likelihood construction depends on the entire flowgraph parameterization. Incomplete data can also occur on portions of the flowgraph and the treatment is the same as (6) but using only the relevant portion of the flowgraph. In general, observations contribute different amounts of data. Most observations will be complete, allowing specification of their full likelihood; while others will be incomplete, requiring likelihood construction. The over-
FIowgmph Models for Complex Multzstate System Reliability
257
all likelihood function is given by
(7)
The likelihood can involve several flowgraph MGFs associated with the incomplete data cases.
18.4. Numerical Example
Consider the system of Fig. 18.1. In the following discussion,the exponential density Ezp(X) has mean 1/X, the gamma density Gamma(cr, ,O) has mean cr/,O, and the inverse Gaussian density I G ( p 1 ,p~g)is parameterized as
$.
1-d
and has MGF given by M I G ( S )= exp [p1p2 , for s < Complete and incomplete data were simulated as follows. Transition probabilities pol, p31, and p42 were taken to be 0.6, 0.3, and 0.5 respectively. The branch waiting time distributions used were Exp( 118) for 0 -+ 1, Ezp(lO/3) for 0 4 2, IG(5,4/5) for 1 43, IG(2,4) for 2 -+ 4, Gamma(5,4) for 3 4 1, Gamma(7,3) for 4 42, E z p ( 2 ) for 3 -+ 5, and Ezp(1/4) for 4 -+ 5. The simulated data had complete uncensored information for 45 transitions on 0 -+ l , 42 transitions on 0 -+ 2, 47 transitions on 1 + 3, 70 transitions on 2 4, 14 transitions on 3 -+ 1, 42 transitions on 3 -+ 5, 38 transitions on 4 2, and 32 transitions on 4 4 5. In addition, there were 3 observations censored in state 0, 12 observations censored in state 1, and 1 observation censored in state 3. We also created incomplete data such that there were 10 incomplete transitions observed on 0 -+ 3 and 10 incomplete transitions observed on 2 -+5. The overall likelihood function is given by (7) where the incomplete data likelihood consists of two incomplete portions for 0 3 and 2 -+ 5. The --f
-+
-+
Aparna V. Huzurbazar and Brian J. Williams
258
complete data likelihood function is
j=l
i=l
j=l
i=l
i=l
nr(1- P ~ ~ ) ~ ~ ~ ( z ~[p31~31(~;3~e31) ~ ~ I ~ ~ ~ ) I n;
n35
JJci
i=l
+(I
-
j=l
- ~ ~ ~ ) ~ ~ ~ ( x tx. j * ~ 1 e ~ ~ ) 1 )
n4z
1245
GI ~~ ~ ~~
nb42f42( i=l
> I p42)f45 (5i45 -
,
(8)
i=l
where no1 = 45, no2 = 42, nt, = 3, 7235 = 42, n: = 1, n42 = 38, and summarized in Table 18.1.
n13 = 724.5 =
47, nI; = 12, n24 = 70, r t 3 1 = 14, 32. The notation used in (8) is
Table 18.1. Notation used in (8)
ngh
Number of uncensored transitions from g -+ h
xigh
i t h uncensored observation from g + h
n;
Number of observations censored in state g
X f
j t h observation censored in state g
P9h
Transition probability from g
39
-+
h
fgh(.)
Density function of the waiting time g
Fgh(')
CDF of the waiting time g + h
4
h
Parameter vector for the g -+ h branch model
*Sh
1 - Fgh(')
Contribution of observation censored in state g when transition is only possible to state h
+
1 - [ P g h F g h ( . ) (1 - p g h ) F g k ( . ) ] Contribution of observation censored in state g when transition is possible to state h or state k
The incomplete data likelihood for 0 + 3 is constructed using the MGF Mo3(S)/MO3(0). Note that this MGF is normalized by its value at 0 because it is conditional on taking the path 0 -+ 3. The relevant MGF is -M03(s) POlMOl(S)M13(S) 1 - P31 (9) M03(0)
1 -p31M13(S)M31(S)
PO1
'
Flowgraph Models for Complex Multistate System Reliability
The MGF for the 2 M25(0)1
+
5 incomplete data likelihood is just
259
M25(~)since
(10
The MGFs (9) and (10) are used with the saddlepoint approximation to construct the incomplete data likelihood of ( 6 ) and used to give the second term in (7). The resulting Bayes predictive density (5) and hazard functions computed using slice sampling to obtain posterior samples with the constructed likelihood of (7) are given in Fig. 18.2. The first 1,000 samples were discarded and 15,000 samples were used for the analysis. Table 18.2 gives
Fig. 18.2. Density and hazard function for manufacturing system. Pointwise 95% confidence bands (dashed lines) for the density and hazard functions are essentially indistinguishable from the estimate.
the quantile values for the predictive density estimate. The column labeled “Complete Sample Estimate” refers to the estimate based on the full sample of 15,000 iterates. The column labeled “Batch Mean” refers to the estimate calculated from the mean of 30 batch estimates, each based on 500 samples. The confidence intervals on the hazard function and the quantiles apply variance estimates calculated from the batch method, while everything else uses variance estimates based on “effective sample size” due to Carlin and Louis.27 For a description of these methods for constructed likelihoods, see Williams and H u ~ u r b a z a r . ~ The slight bump in the density is reflective of the manner in which the
260
Aparna V . Huzurbazar and Brian J . Williams Table 18.2. Quantile values for predictive density estimate. Quantile 0.25 0.50 0.75 0.90 0.95 0.99
Lower Bound 5.753893 10.33510 16.87805 24.79846 30.57338 43.69914
Batch Mean 5.762913 10.34666 16.89889 24.83618 30.62560 43.79044
Complete Sample Estimate 5.762881 10.34661 16.89877 24.83612 30.62578 43.79202
Upper Bound 5.771934 10.35821 16.91974 24.87390 30.67782 43.88174
finite mixture and convolution distributions were chosen. Specifically, there were some “fast” times through 0 + 2 -+ 4 + 5. This is also reflected in the hazard. Flowgraph modeling focuses on modeling the observed state-tostate waiting times and then converting the resulting overall distribution to a hazard. This approach has an advantage over assuming an a priori model for the hazard function in that only observable quantities are modeled directly.
18.5. Conclusion Flowgraph models are useful for modeling data arising from complex multistate systems that can be represented by semi-Markov processes. Such data can be censored, incomplete, or unrecognizably incomplete. The case of complete data which includes censoring is the most straightforward. In the case of incomplete data inference becomes more challenging and likelihoods must be constructed. Incomplete data often provide information relevant to the likelihood in addition to the complete data, and thus methods for incorporating such data into the likelihood are necessary in practice. Future work concerns developing analogous methods for unrecognizably incomplete data. For example, in Fig. 18.1, we may have an unrecognizably incomplete transition such as 0 -+ 2 -+ 4. In this case, we could create a type of count variable for the loop and allow it to be taken a finite number of times. This would contribute to the appropriate constructed likelihood. Flowgraph models with slice sampling provide a novel method for inference that has performed effectively in situations where data are incomplete. The method can be used with flowgraphs having arbitrarily large numbers of parameters with additional computational complexity. Future work will consider analysis of flowgraph data using random walk Metropolis MCMC with automatic step size selection as in Graves.” This approach has the benefit of automatically tuning multiple parameter updates, which can improve mixing in some applications.
Flowgmph Models for Complex Multistate System Reliability
261
References 1. M. Hamada, H. Martz, C. S. Reese, T. Graves, V. Johnson, and A. Wilson, A fully Bayesian approach for combining multilevel failure information in fault tree quantification and optimal follow-up resource quantification, Reliability Engineering and System Safety 86, 297-305 (2004). 2. V. Johnson, T. Graves, M. Hamada, and C. S. Reese, A hierarchical model for estimating the reliability of complex systems (with discussion), Bayesian Statistics 7.Edited by Bernardo, J., Bayarri, M., Berger, J., Dawid, A., Smith A., and West, M. Oxford: Oxford University Press, 199-213 (2003). 3. B. J. Williams and A. V. Huzurbazar, Posterior sampling with constructed likelihood functions, Los Alamos National Laboratory Technical Report LAUR-046833 (2004). 4. 0. 0. Aalen, Nonparametric inference for a family of counting processes, Annals of Statistics 6, 701-726 (1978). 5. P. K. Andersen and N. Keiding, Multi-state models for event history analysis, Statistical Methods for Medical Research 11, 91-115 (2002). 6. P. Hougaard, Multi-state models: a review, Lifetime Data Analysis 5 , 239-264 (1999). 7. D. R. Cox, Regression models and life tables (with discussion), Journal of the Royal Statistical Society Series B 34, 187-220 (1972). 8. W. R. Allen, A note on conditional probability of failure when hazards are proportional, Operations Research 11,658-659 (1963). 9. A. V. Huzurbazar, Flowgraph models: A Bayesian case study in construction engineering, Journal of Statistical Planning and Inference 129,181-193 (2005). 10. S. J. Mason, Feedback theory-Some properties of signal flow graphs, Proceedings of the Institute of Radio Engineers 41, 1144-1156 (1953). 11. R. Dorf and R. Bishop, Modern control systems, Reading, Mass.: AddisonWesley (1995). 12. Z. Gajic and M. Lelic, Modern control systems engineering, New York: Prentice Hall (1996). 13. A. V. Huzurbazar, Flowgmph Models for Multistate Time-to-Event Data, New York: Wiley (2005). 14. H. Daniels, Saddlepoint approximations in statistics, Annals of Mathematical Statistics 25, 631-650 (1954). 15. J. L. Jensen, Saddlepoint Approximations, New York: Clarendon Press (1995). 16. J. E. Kolassa, Series Approximation Methods in Statistics, New York: Springer-Verlag (1997). 17. N. Reid, Asymptotics and the theory of inference, Annals of Statistics 31, 1695-1731 (2003). 18. H. Bohman, Numerical inversions of characteristic functions, Scandinavian Actuarial Journal, 121-124 (1975). 19. L. A. Waller, B. W. Turnbull, and J. M. Hardin, Obtaining distribution functions by numerical inversion of characteristic functions with applications, American Statistician 49, 346-350 (1995).
262
Aparna V . Huzurbazar
and Brian J. Williams
20. C. Zhang, Fourier methods for estimating mixing densities and distributions, Annals of Statistics 18, 806-831 (1990). 21. R. Neal, Slice Sampling, Annals of Statistics 31, 705-767 (2003). 22. T. Graves, Automatic step size selection in random walk Metropolis algorithms, Los Alamos National Laboratory Technical Report LA-UR-05-2359 (2005). 23. S. Karlin and H. Taylor, A First Course in Stochastic Processes, New York: Academic Press (1975). 24. N. Limnios and G. Oprisan, Semi-Markov Processes and Reliability, Boston: Birkhauser (2001). 25. 0. Barnett and A. Cohen, The histogram and boxplot for the display of lifetime data, Journal of Computational and Graphical Statistics 9, 759-778 (2000). 26. A. V. Huzurbazar, A censored data histogram, Communications in Statistics: Simulation and Computation 34(1), 113-120 (2005). 27. B. Carlin and T. Louis, Bayes and Empirical Bayes Methods for Data Analysis, 2nd edition, Boca Raton: Chapman & Hall (2000).
CHAPTER 19 INTERPRETATION OF INSPECTION DATA EMANATING FROM EQUIPMENT CONDITION MONITORING TOOLS: METHOD AND SOFTWARE
ANDREW K. S. JARDINE C B M Laboratory Department of Mechanical and Industrial Engineering University of Toronto, Toronto, Canada E-mail: [email protected]. ca
DRAGAN BANJEVIC CBM Laboratory Department of Mechanical and Industrial Engineering University of Toronto, Toronto, Canada E-mail: [email protected]. ca This chapter focuses on current industry-driven research that blends together risk estimation, using a PHM, along with economic considerations to establish optimal condition-based maintenance decisions. Recent results of the research program are described including development of the EXAKT software and its successful application to the interpretation of inspection data emanating from vibration monitoring in the food processing industry, oil analysis in an open pit mine, seal failure data in a nuclear generating station, and vibration signal data from gearboxes subject to tooth failure.
19.1. Introduction Much research and product development in the area of condition-based maintenance (CBM) focuses on data acquisition and signal processing. However, the focus of this chapter of t h e book is to examine what might be thought of as t h e final step in the CBM process-optimizing t h e decisionmaking step. Jardine' provides an overview of the following procedures being used to 263
264
Andrew
K. S. Jardine and Dmgan Banjevic
assist organizations make smart CBM decisions: physics of failure; trending; expert systems; neural networks and optimization models. Possibly the most common approach to understand the health of equipment is through trending various measurements and comparing them to specified standards, such as illustrated in Fig. 19.1 where measurements of iron deposits in an oil sample are plotted on the Y-axis and compared to warning and alarm limits. The maintenance professional then takes remedial action when deemed appropriate. Many software vendors addressing the needs of maintenance have packages available to assist in interpreting condition monitoring (CM) measurements, with the goal of predicting failure through trending. This method usually works well if the variables used for trending (in engineering sometimes called features, or parameters) are directly linked to the physics of failure, or used to define the failure, such as the total amount of wear, thickness, sometimes the level of vibration, and amount of cracks (e.g. the total length, or total area with cracks on turbine blades). It is also necessary that features show clear trend (increasing or decreasing) with relatively small variations in successive values around the trend line so that it is possible to give reliable predictions (using extrapolation) of oncoming failures (that is, of crossing alarm and/or failure levels). But, in many other cases when the variables are only indirectly linked to failures, and/or “contaminated” by maintenance (temporarily reduced by certain minor maintenance actions, such as an oil change), and also prone to large sampling errors, the trending method is not very useful. A typical example (from our experience) for this situation is oil analysis, in which simple trending may not be very successful. A consequence observed when the trending approach is undertaken is that the maintenance professional is often too conservative in interpreting the measurements. In work undertaken by Anderson et a1.’ on using PHM (proportional-hazards model, which will be discussed in some details in Sec. 19.2), it was observed that 50% of aircraft engines that were removed before their design life, due to information obtained at the oil sampling of engine oil, were identified by the engine manufacturer to still be in a state that would have enabled them to remain on the 4-engined aircraft. Christer3 has made the same point where he reported that since condition monitoring of gearboxes had been introduced, gearbox failures within an organization had fallen by 90%. As Christer said: “Thisis a notable accolade for CM.” Christer also reported that it transpired that when reconditioning “defective” gearboxes, in 50% of occasions there was no evident gearbox fault. He then concluded: “Seemingly, CM can be at the same time very
Interpretation of Condition Monitoring Data
265
Fig. 19.1. A classical approach to condition monitoring.
effective, and rather inefficient.” Clearly there is a need to focus attention on the optimization of condition monitoring procedures. In this chapter we will present an approach that estimates the hazard (risk of failure), where the hazard depends on the age of equipment and condition monitoring data, using a PHM. We will then examine the optimization of the CM decisions by blending in with the hazard calculation the economic consequences of both preventive maintenance, including complete replacement, and equipment failure. 19.2. The Proportional Hazards Model
Cox4 in the Sec. 19.4. “Dependence of failures on wear” of his well-known book introduced the idea of deciding on the replacement time of an item (component) based not only on its age, but also on its physical properties, which may be called by Cox “wear.” If Z(z) (in the Cox’s notation Zs), multidimensional, is the value of the wear of the component of age z, Cox defines “the age wear-specific failure rate” as lim P(z < X 5 z h(z,z ) = As-O+
+ A z ( X > z, Z(z) = z ) / A z ,
(1)
(in the Cox’s notation h ( z ,z ) = $ ( z , z)). As Cox states: “When {Zs}is a stochastic process of specific structure and the function $ ( z , z) is given, we have a probabilistic model of wear and failure.” He then considers some special cases of the “wear” process (including the case when wear is a perfect predictor of failure, as already discussed above). In Sec. 19.4. “Strategies involving wear” of the book, Cox4 discusses some replacement strategies that
266
Andrew K . S. Jardine and Dmgan Banjevic
depend on the level of wear. This idea seemed to be a real novelty at that time since Cox provided only two references to this method. Gertsbakh5 proposed a similar idea and applied it in maintenance, but without reference to Cox. The idea was also included in the book Gertsbakh16that was a translation of the 1969 Russian edition, so the idea should not be unknown to the reliability community. Still the idea didn’t get much attention until Cox7 introduced the proportional-hazard model approach to estimate the risk of failure of an item, taking into account concomitant information, by specifymg the form of the function 4(z, x) in more detail. Even then, it was mainly applied in lifetime data analysis in the biomedical field and only in the early 1980s in reliability (Anderson et a1.2). Cox’ also introduced the proportional-intensity model (an equivalent of PHM for recurrent events, such as for repairable systems) directly suggesting applications in reliability, but this paper remained almost ignored for some years. This situation might be explained by the fact that condition monitoring was not extensively used in industry at that time (as a new and often expensive technique) and/or the inclination of practitioners to use simple age-based mathematical models. It might have happened at that time also, as Ushakovg stated regarding the problem of preventive maintenance, “due to a lack of consistent field data.” Nowadays “consistent field data” is much more available (still not always in a satisfactory state). The work reported in this chapter is a direct outcome of Cox’s insight that PHM can be applied in the reliability field. In particular we demonstrate its power in assisting reliability and maintenance professionals to efficiently interpret signals they obtain when using condition monitoring, where the objective is to obtain the maximum useful life, along with economic considerations, from each physical asset before taking it out of service for preventive maintenance. Cox7 introduces PHM so that the conditional hazard rate of a unit is the product of a baseline hazard rate and a functional term, (2)
with ho(t) being the baseline hazard rate depending on time only and . Z ( t ) ) being an adjusting functional term that models the effect of the particular unit characteristics, with y a vector of regression coefficients and Z ( t ) a vector of concomitant factors (covariates) that may include condition monitoring data and other useful available information about the unit collected up to time t. Covariates can be also diagnostic variables that describe the state of the unit. If the baseline hazard is left unspecified, the model in (2) is usually called the semiparametric Cox PHM. The
Interpretation of Condition Monitoring Data
267
functional term may have various forms, the most common being the ex. Z ( t ) )= exp(y’ . Z ( t ) ) .Cox7 suggested a method which he ponential, called conditional likelihood and later partial likelihood for estimating the parameters of the functional term, without specifying a form of the baseline hazard term. This makes the semiparametric model very flexible, but then requires nonparametric estimation of the baseline hazard term. For additional details about the semiparametric model and earlier references and applications, see Kumar and K1efsjo.l’ The PHM is now commonly used in reliability. For some more recent applications, see e.g. Gasmi et al.,” Krivtsov et a1.,12 Vlok et al.,I3 Le Gat and Eisenbeis,14 Percy et al.,15 Kobbacy et a1.,l6 Luxhoj and Shyur,l’ and Kumar and Westberg.18 There is a tendency in reliability applications to use the parametric model, in which the baseline hazard ho(t)has a specified functional form, up to unknown parameters that should be estimated from the data, along with the regression parameter y. A common model is a “Weibull PHM” in which
(9) 0-1
the baseline hazard is of the power-law form, that is ho(t) = (f) , a model used in the case studies described in this chapter. Information that may be included in the PHM can come from various sources. In practice covariates may be measurements obtained from condition monitoring (and then usually internal, time-dependent variables, such as from oil or vibration analysis, or mounted sensors that provide various types of information, e.g. from diesel engines, such as oil pressure and temperature, crankcase pressure, exhaust gas temperature and pressure, etc), or from working conditions (external variables, such as environment, humidity, work-load, temperature, rotating speed), or events specific (number of startups, number of previous preventive maintenances (PMs) and/or failures, accumulated downtime, servicing frequency, maintenance strategy, total money spent on repair and maintenance actions). Time variables such as global and local working age, and often the number of previous failures/PMs, are typically not considered as covariates, and they are preferably included in the baseline intensity, which provides more flexibility. It should be noted that “time variables” should be interpreted as appropriate “working age” variables of the unit, such as the working hours, mileage, total accumulated work load, etc., usually not the same as calendar time. The “working age” may be adjusted to “effective age” that may, for example, incorporate maintenance interventions, or changes in LLload77 that affect aging, when the load is not considered as a covariate. A classical method to incorporate effective age is to use the virtual age concept, as defined by Kijima.lg An overview of differ-
268
Andrew K . S. Jardine and Dmgan Banjevic
ent methods to incorporate usage factors into survival models can be found in Duchesne and Lawless,20i21and references therein related to reliability. See also Lugtigheid et al.,” and Pena and Slate,23 in this volume. A main goal of applying PHM in reliability is to combine condition monitoring data and working history of a unit with its age for optimizing maintenance/replacement decisions. The typical intention of maintenance professionals, as they usually state, is to “prevent failures,” but also to run the unit as long as possible, without false alarms. These two requirements are obviously contradictory, unless the failures can be predicted accurately, because the first one requires that more money is spent on maintenance and preventive repairs/replacements, and the second one increases the risk of failures. If cost of operation is the criterion for decisions, an optimal balance should be found between the cost of maintenance and the cost consequence of failures. An obvious extension of the classical age replacement policy (replace/repair the unit at a predetermined fixed age, or at failure, whichever comes first) is the policy to replace the unit when covariates reach some predetermined “alarm” values, where the alarm values may depend on the current age of the unit (so, the “age specific” levels), and at failure. If the hazard rate function has an increasing trend, the optimal (or near to optimal) policy has a very simple form: replace/repair the unit when the hazard rate (not particular covariates!) reaches the predetermined (fixed) risk level (Aven and BergmanZ4).It simply means that the hazard rate function captures all information contained in the data needed for optimal decisions. Obviously, it also means that in real applications of this result the hazard rate function should be properly established. This will be a main task in creating optimal decision policies, as it will be seen from the reported case studies. Makis and JardineZ5developed a methodology based on stochastic dynamic programming to optimize the risk and economic trade-off, i.e. to calculate the optimal threshold “risk level.” This approach is illustrated in Fig. 19.2 where it is assumed that as the equipment ages there is an increasing risk associated with a unit failing in the next moment in time. The reliability specialist can select a risk cut-off level at which to intervene and replace the deteriorating equipment. Each possible risk level has an associated expected cost made up of the cost of preventive replacement and the cost consequence of an equipment failure. The key is to identify the optimal risk level d* at which the equipment should be replaced. Thus the plan is to monitor risk, using a PHM, and once the risk hits the specified value d*, then it is optimal to perform a preventive replacement. It is
Interpretation of Condition Monitoring Data
269
Fig. 19.2. Risk level and optimal policy.
important to understand that the suggested moment of preventive replacement is not at a fixed age (as it is in age-based models), but depends on the particular “hazard history” influenced by the measurements and other information included in the hazard rate function. In typical situations, if covariates show “nice” behavior, the preventive replacement will be postponed, but eventually the age will become the most dominant factor in the risk of failure, and the unit will be replaced. Of course, if the equipment fails it has to be replaced under failure conditions. There exist cases (not uncommon) in which the hazard rate function does not directly depend on the age of the unit, but only on the condition monitoring variables, in which case the policy may be reduced to the situation similar to the “classical approach,” illustrated in Fig. 19.1, with fixed alarm limits on covariates. It should be noted that the ‘Lreplacement77 may not be a physical replacement of the unit (which may be typical for certain components), but also an appropriate major maintenance/repair (more typical for systems and larger components) that returns the unit to, practically, “as-good-as new” conditions. If this is not the case, the methodology still could be applied by including certain measures of repair in the hazard function, or flexible “initial” conditions that affect the baseline hazard, as discussed above. We will not consider this more complicated situation in greater detail in the following, but we will mention it in the last section of this chapter, about future research plans.
270
Andrew K . S. Jardine and Dragan Banjevic
Technically, the optimal risk level d* is calculated to minimize the expected average cost per unit time of the decision policy, @(d), as a function of the risk level d , which is given by, (3)
where C is the preventive replacement cost and C + K the failure replacement cost. Q ( d ) represents the probability that failure will occur before the hazard reaches the level d, so that the replacement will be at failure. Then, 1 - Q ( d ) is the probability that the hazard will reach the level d before the failure occurs, and then the replacement will be preventive. W ( d ) is the expected time until replacement, either preventive or at failure. For details of the calculation, see Banjevic et a1.26 The costs C and C K in practice are almost never constant, but the average values can be used, unless the variations in the costs are very high. In the latter case it may appear that significantly different failure modes and/or failure consequences exist, which would require another approach to the modeling of the hazard rate function and the decision procedure. This is also another research topic, quite significant for applications.
+
19.3. Managing Risk: A CBM Optimization Tool
Subsequent to the Makis and Jardine25 paper a research group was formed to take the theory and make it work in practice. To achieve this goal a number of companies supported in 1995 the creation of the ConditionBased Maintenance Laboratory at the University of Toronto (CBM Lab) (www.mie.utoronto.ca/cbm). As a consequence, the EXAKT software was developed, along with addressing challengingtheoretical issues that resulted from close collaboration with the consortium members (numbering 11 in 2005). There are a few main components to EXAKT (see Banjevic et a1.26):
(1) Creation of a convenient and flexible database by extracting the event (maintenance) and condition (inspection - condition monitoring) information from external databases. (2) Data checking, analysis and preprocessing (including transformations), using graphical and statistical tools. (3) Estimating and testing parameters of the PHM and Markov process probabilistic model for covariates.
(4) Computation, analysis, validation and saving of the optimal replacement policy (decision model).
271
Interpretation of Condition Monitoring Data
(5) Using an already saved decision model t o make decisions for current inspection records stored in the database. The decision recommendation, and some other useful information, such as the expected remaining life (in the case of “don’t replace” recommendation) is stored in the database for the user’s convenience.
Fig. 19.3. Inspections records and events data.
Figure 19.3 is a screen shot of condition monitoring data that is obtained from equipment that is monitored through oil analysis, combined with event data that provides information about the timing of equipment installations (B), removals (whether preventive and so treated as a suspension time, or due to failure) and any maintenance action that is recorded, such as oil changes (OC) (an illustration for points 1 and 2 above). Both types of data are then used to build the PHM for the hazard function. Combining the inspection records and the event data enables the following Weibull PHM to be obtained:
(
h ( t , z )= 5.007 t 38,988 38,988 ~
e(0.2626~ Iron+1.0522x Lead) 7
(4)
where Iron and Lead are the values obtained from the measurement at time t , and z = (Iron,Lead) (an illustration for point 3 above). Incorporating cost data then enables the optimal decision chart to be obtained from the PHM, as shown in Fig. 19.4 (an illustration for points 4 and 5 above). The decision chart actually presents a more convenient equivalent form of the stopping decision rule h(t,z ) 2 d*, in a form 2 A* - ( p - 1) ln(t), which
272
Andrew K . S. Jardane and h g a n Banjevic
Fig. 19.4. Optimal decision chart.
can be easily obtained by taking logarithm of the both sides of the first form, once the optimal threshold level d* is calculated. y’z is reffered to as the “composite covariate” in the following. The Weibull PHM for the hazard function is buit by maximizing the full likelihood. The important (significant) covariates are selected combining the backward and forward selection procedure The same method is used in all case studies presented in Sec. 19.4. Further details on the theory and its application to the CBM optimization work can be found in Banjevic et a1.,26and Vlok et a l l 3 19.4. Case Study Papers
Readers interested in further applications of the CBM optimization work can find details about pilot studies undertaken in the food processing industry in Jardine et al.,27 open pit mining in Jardine et a1.,28 in nuclear power generation, Jardine et al.,” and in testing gearbox tooth failures, Lin et aL3’ A summary of these studies follows. 19.4.1. Food processing: use of vibration monitoring
A company undertook regular vibration monitoring of a critical shear pump bearings. At each inspection, 21 measurements were provided by an accelerometer. Using the theory described in the previous section, and its
Interpretation of Condition Monitoring Data
273
embedding in the software EXAKT, it was established that of the 21 measurements there were 3 key (significant) vibration measurements: Velocity in the axial direction in both the first band width and the second band width, and velocity in the vertical direction in the first band width. In the plant the consequence of a bearing failure was 9.5 times greater than replacing the bearing on a preventive basis. After calculating the optimal hazard level, as explained in Sec. 19.3, and comparing it with the data, it was clear that through following the decision optimization approach total cost due to maintenance and failures could be reduced by 35%. The Optimal Decision Chart used for interpreting the appropriate action to take at inspection of the bearing is shown in Fig. 19.5.
Fig. 19.5. Shear pump bearing decision chart.
19.4.2. Coal mining: use of oil analysis
Electric wheel motors on a fleet of haul trucks in an open-pit mining operation were subject to oil sampling on a regular basis. Twelve measurements resulted from each inspection and compared to warning and action limits to decide whether or not the wheel motor should be preventively removed. These measurements were: Al, Cr, Ca, Fe, Ni, Ti, Pb, Si, Sn, Visc 40, Visc 100 and Sediment. Applying a PHM to the data set it was identified that there were only
274
Andrew K . S. Jardine and Dmgan Banjevic
two key risk measurements, correlated to the cause of the wheel motor failure being monitored by oil analysis; these measurement were iron (Fe) and Sediment. The cost consequence of a wheel motor failure was estimates three times greater than when replacing it preventively. The economic advantage of following the optimal replacement strategy would be a cost reduction of 22%. The Optimal Decision Chart used for interpreting the appropriate action to take at inspection of the wheel motor is shown in Fig. 19.6. "
10 1
_
P 2 r
x I1
N
II
9 'C
>
8
.9 Lo a
5
0
0
Don't replace
0 0
5000
I0000
15000
20000
Working Age = 11587 [hr] Z = 0.002742'Fe + 5.3955e-005'CorrSed
Fig. 19.6. Electric wheel motor decision chart.
19.4.3. Nuclear generating station Hydro-dyne seals perform an important function in CANDU type reactors. During refueling, they seal the connection between the fueling machine and the end fitting of the reactor fuel channels, preventing the leakage of heavy water from the reactor. At a nuclear plant, 4 reactors employ 8 hydro-dyne sealed refueling machines. Keeping track of the condition of each seal poses a particular challenge to site engineers, and safety and environmental considerations have prompted a 100% inspection regimen. That is, prior to each use, after having positioned the refueling machine adjacent to the target reactor tube, the operators check for leakage by measuring the pressure drop in the seal
Interpretation of Condition Monitoring Data
275
over a fixed time interval. If the seal fails the test, the operators abort the intended operation and perform the required maintenance. As a consequence of failure, the plant will have incurred the costs of setting up for the canceled refueling operation, some loss in energy efficiency due to the delay in refueling, and some organizational disruption due to lack of planning of the resulting emergency maintenance work in addition to the direct cost of seal replacement. Engineers desire to perform timely preventive seal replacements in an attempt to minimize the overall costs due to failure and maintenance. After applying a PHM to the inspection and event data (the leak rate and records of working cycles) from the refueling operations at the nuclear plant over an 11-year period, the optimal decision chart was created. The leak rate shows clear increasing trend, but with large variations in subsequent measurements, an example of a “feature” directly linked to the definition of the failure, as mentioned in the introduction. Also, the “alarm” level curve is almost flat (parameter ,8 close to one) because the leak rate “dominates” the working age. The Optimal Decision Chart used for interpreting the appropriate action to take at inspection of the hydro-dyne seals is shown in Fig. 19.7.
0
Fig. 19.7. Hydro-dyne seal decision chart.
Andrew K . S. Jardine and Drngan Banjevic
276
19.4.4. Gearbox subject to tooth failure
Gearboxes are run to failure on the mechanical diagnostic test-bed. The condition vibration signal was collected and processed to calculate the “fault growth parameter” (FGP) and its revised version FGPl from the residual error signal. Some other variables were also calculated and all were used to calculate the risk of gear-box tooth failure using a PHM. The initial torque was increased or alternated differently for different pumps. The examined failure mode was “gear tooth fracture.” Since the gearboxes operated under varying loads, it was not appropriate to use simple “calendar time” as a working age variable (see a comment in Sec. 19.2). Instead, the integral of the product of actual running time and instantaneous torque was used as a measure of the working age. Several models were tested and the variables FGPl and RFM (mean of the power spectrum of the residual error signal) appeared to be the best to predict oncoming gear tooth failure. Using an estimate of the cost ratio of the failure replacement to the preventive replacement of 5:1, the model with FGPl showed decision results similar to model with the both variables. The Optimal Decision Chart used for preventing the gear-tooth failure is shown in Fig. 19.8.
C-Y
0
m u? 7
II
N
a
c
.-Q
b>
s a, c .v)
g E
s
Working Age = 3470.22 [in-lb-day]
Z = 0.38843*FGPI Fig. 19.8. Gearbox tooth decision chart.
Interpretation of Condition Monitoring Data
277
19.5. Future Research Plans
Some of the research topics that are in future plans, or are already in development in the CBM Lab are: relaxation of a common assumption in the literature that at the maintenance intervention the item is returned to the statistically as good-as-new condition; inclusion of different failure modes or multicomponent systems in the model; variable costs at failure and preventive maintenance; inclusion of the cost of the condition monitoring in the decision optimization model and planning of the next inspection; calculation of the remaining useful life (Banjevic and Jardine31), and modeling of complex repairable systems (Lugtigheid et a1.22)Of great importance also are the problems related to the statistical estimation of model parameters from incomplete or sparse data, particularly with a small number of failures, such as for a new type of equipment. In that area, combining expert knowledge with Bayesian approach is the main approach.
References 1. A. K. S. Jardine, Optimizing condition based maintenance decisions, Annual Reliability and Maintainability Symposium, Seattle, USA, 90 (2002). 2. M. Anderson, A. K. S. Jardine, and R. T. Higgins, The use of concomitant variables in reliability estimation, Modeling and Simulation 13, 73 (1982). 3. A. H. Christer, Developments in delay time analysis for modelling plant maintenance, Journal of the Operational Research Society 50, 1120 (1999). 4. D. R. Cox, Renewal Theory, Methuen, London (1962). 5. I. B. Gertsbakh, Engineering Cybernetics 1,54 (1967). 6 . I. B. Gertsbakh, Models of Preventive Maintenance, North-Holland, Amsterdam (1977). 7. D. R. Cox, Regression Models and Life-Tables, Journal of the Royal Statistical Society B34, 187 (1972). 8. D. R. Cox, The statistical analysis of dependencies in point processes, in Stochastic Point Processes, Ed. P. A. Levis (Wiley, New York, 1972), p. 55. 9. I. Ushakov, Reliability: past, present, future, in Recent Advances i n Reliability Theory, Ed. N. Limnios and M. Nikulin (Birkhauser, Boston, 2000), p. 3. 10. D. Kumar and B. Klefsjo, Reliability Engineering and System Safety 44, 177 (1993). 11. S. Gasmi, C. Love, and W. Kahle, A general repair, proportional-hazards, framework to model complex repairable systems, I E E E Transactions on Reliability 52, 26 (2003). 12. V. V. Krivtsov, D. E. Tananko, and T. P. Davis, Regression approach to tire reliability analysis, Reliability Engineering and System Safety 78, 267 (2002). 13. P. J. Vlok, J. L. Coetzee, D. Banjevic, A. K. S. Jardine, and V. Makis,
278
14.
15.
16.
17.
18.
19. 20. 21. 22.
23. 24. 25. 26.
27.
28. 29. 30.
31.
Andrew K. S. Jardine and Dragan Banjevic Optimal component replacement decisions using vibration monitoring and the proportional-hazards model, Journal of the Operational Research Society 53, 193 (2002). Y.Le Gat and P. Eisenbeis, Using maintenance records to forecast failures in water networks, Urban Waters 2, 173 (2000). D. F. Percy, K. A. H. Kobbacy, and H. E. Ascher, Using proportionalintensities models to schedule preventivemaintenance intervals, ZMA Journal of Mathematics Applied in Business and Industry 9, 289 (1998). K. A. H. Kobbacy, B. B. Fawzi, D. F. Percy, and H. E. Ascher, A full history proportional hazards model for preventive maintenance scheduling, Quality and Reliability Engineering International 13, 187 (1997). J. T. Luxhoj and H. Shyor, Comparison of proportional hazards models and neural networks for reliability estimation, Journal of Intelligent Manufacturing 8,227 (1997). D. Kumar and U. Westberg, Proportional hazards modeling of timedependent covariates using linear regression: a case study, ZEEE IPransactions in Reliability 45,386 (1996). M. Kijima, Further results for dynamic scheduling of multiclass queues, J. Appl. Probab. 26,89 (1989). T. Duchesne and J. Lawless, Superparametric inference methods for general time scale models, Lifetime Data Analysis 8, 263 (2002). T. Duchesne and J. Lawless, Lifetime Data Analysis 6,157 (2000). D. Lugtigheid, D. Banjevic, and A. K. S. Jardine, Modelling repairable system reliability with explanatory variables and repair and maintenance actions, I M A Journal of Management Mathematics 15,89 (2004). E. A. Pena and E. H. Slate, Modern Statistical and Mathematical Methods in Reliability, Chap. 25, pp. 347-362 (2005). T. Aven and B. Bergman, Optimal replacement times-a general set-up, Journal of Applied Probability 23, 432 (1986). V. Makis and A. K. S. Jardine, Optimal replacement in the proportional hazards model, INFOR 30, 172 (1992). D. Banjevic, A. K. S. Jardine, V. Makis, and M. Ennis, A control-limit policy and software for condition-based maintenance optimization, INFOR 39, 32 (2001). A. K . S. Jardine, T. Joseph, and D. Banjevic, Optimizing condition-based maintenance decisions for equipment subject to vibration monitoring, Journal of Quality in Maintenance Engineering 5, 192 (1999). A. K. S. Jardine, D. Banjevic, M. Wiseman, S. Buck, and T. Joseph, Journal of Quality in Maintenance Engineering 7, 286 (2001). A. K. S. Jardine, K. Kahn, D. Banjevic, M. Wiseman, and D. Lin, Proceedings of COMADEM 2003, Sweden, 175 (2003). D. Lin, M. Wiseman, D. Banjevic, and A. K . S. Jardine, An approach to signal processing and condition-based maintenance for gearboxes subject to tooth failure, Mech. Syst. Signal Proc. 18,993 (2004). D. Banjevic and A. K. S. Jardine, to appear in ZMA Journal of Management Mathematics, (2005).
CHAPTER 20 NONPROPORTIONAL SEMIPARAMETRIC REGRESSION MODELS FOR CENSORED DATA
ZHEZHEN JIN Department of Biostatistics Mailman School of Public Health Columbia University 722 West 168th Street New York, NY 10032 USA E-mail: [email protected]. edu In lifetime or censored data analysis, the accelerated failure time (AFT) model and the linear transformation models are attractive nonproportional semiparametric regression models. The AFT model is of the same form as the usual linear regression models for noncensored data, and the linear transformation models include the popular Cox proportional hazards model and proportional odds model as special cases. In this paper, we review recent development on the estimation and inference methods for these two types of models. The development provides effective and reliable means of estimating the regression parameters in the AFT and linear transformation models, which makes these models available for practitioners.
20.1. Introduction Censored d a t a are common in many fields, such as economics, business, industrial engineering and biomedical studies. T h e right-censored data usually consist of ( X i ,y Z , hi), i = 1,. . . , n, where X is a pdimensional covariate, Y = min{T,C}, with T being t h e response variable and C being the censoring variable, and b = 1{T 5 C} being the indicator of censoring. With t h e censored data ( X i ,y Z , & ) , the Cox proportional hazards model is often used t o examine the covariate effects.' The Cox model is semiparametric and is easy t o fit with many existing statistical software. Its asymptotic properties can be justified by elegant martingale theory.2 However, 279
280
Zhezhen Jzn
the validity of this approach depends on the proportional hazards assump tion which can be violated in many important practical occasions. The Cox model is also difficult to interpret when the response variable is not related to the concept of “time to event.” A number of alternative semiparametric regression models to the Cox model have also been proposed and studied, particularly, the accelerated failure time model,3-18 the proportional odds model,19-24 the linear transformation m ~ d e l , ~ ~and - ~the ’ accelerated hazards mode1.31i32These models do not require proportional hazards assumption. The accelerated failure time model is of the same form as the usual linear regression model and easy to interpret. The proportional odds model is particularly useful in modeling converging hazards with time in two or more groups. The linear transformation model includes the Cox proportional hazards model and the proportional odds model as special cases. The accelerated hazards model includes the accelerated failure time model and the Cox proportional hazards model as special cases. Unlike the Cox model, however, these models are seldom used in practice due to the lack of practically feasible estimation methods. Finding numerically feasible estimation methods is critical to the application of these models. We refer readers to Refs. 3, 33-35 for further discussion on these models. In this paper, we focus on and review recent new development in the estimation and inference for the accelerated failure time m0de1’~t’~and the linear transformation models.30 As mentioned above, these two types of models do not require proportional hazards assumption. They are also applicable when the response variable is not related to the concept of “time to event The new development provides numerically effective and reliable means of estimating regression parameters in the two models. In Sec. 20.2, the accelerated failure time model and the linear transformation models are described and recent development on estimation and inference on these models is presented. Some concluding remarks are given in Sec. 20.3.
20.2. Models and Estimation 20.2.1. Accelerated failure time model
A way to study the covariate effects on the response is to use the semiparametric linear regression model that is of form
Ti = X,Tpo + ei,
(1)
Nonproportional Semipammetric Regression Models for Censored Data
281
where PO is the unknown true p x 1 parameter of interest and G (i = 1,.. ,n) are unobservable independent random errors with a common but completely unspecified distribution function. (Thus, the mean of E is not necessarily 0). In survival analysis in which the response is the logarithm transformation of the survival time, the model (1) is called the accelerated failure time (AFT) model. As usual, it is assumed that Ti and Ci are independent conditional on X i . In the absence of censoring, there are many well-developed semiparametric estimation and inference methods for the model (l),particularly rank-based approach and least-squares approach. On the other hand, there are few practically useful estimation and inference methods for the model when the response Ti may not be observed accurately due to censoring. The presence of censoring creates a serious challenge for the semiparametric analysis of the model. Extensions of rank-based or least-squares approaches were proposed and studied.4- l6 Despite theoretical advances, these methods have rarely been used in practice due to their computational difficulty. 20.2.1.1. Rank-based approach In this section, we present inference procedure developed recently by Jin, Lin, Wei, and Ying.17 Let e,(P) = Y , - X,'P, i = 1,..., n. The original rank-based approach uses estimating function for PO that takes the form
c n
U+(P) =
M P ; ez(P)){X*- S1(P;ez(P))ISo(P;et(P>),
(2)
a=1
where S,(p;t) n-l
=
n-'C:==lXd(ek(P)
2
t ) , and so(P;t)
=
C;=,I ( e k ( P ) 2 t ) ,and (b is a weight function, see Refs. 4 and 5. For the
two-sample problem, in which X is a single dichotomous variable, U+(O)is the weighted logrank statistic. The weights q5 = 1 and 4(0;t ) = So(0;t ) give the usual logrank test statistic and the Gehan test statistic, respectively. It is easy to observe that the rank-based estimating function U+(p) is neither differentiable nor continuous in terms of regression parameters. The exact roots to the equation Ub(p)= 0 seldom exist and U+ can also be nonmonotone and lead to multiple solutions. Consequently,the resulting estimators may not be well-defined, especially when the number of covariates is large. Furthermore, it is shown that the covariance matrices of the estimators involve nonparametric estimation of the underlying unknown density function for 6 , which are rather difficult to obtain. In summary, although theoretically a consistent and asymptotically normal estimator of POcan be
282
Zhezhen Jzn
obtained by solving the estimating equation: U+(P)= 0, approaches along this line are numerically complicated and difficult to implement for both in point estimation and in covariance matrices estimation. The general weighted logrank estimating function (2) for PO can be reexpressed in counting process notation: (3)
x(P,t)
where @ ; t ) is a weight function, Ni(P,t)= biI(ei(P) 5 t ) , = Sl(P,t)/So(P,t). The key of the newly developed method of Jin et a1.17 is the use of following modified estimating function
(4)
,&
where $(P; t ) = q5(@; t)/So(P;t ) and is an initial consistent estimator of PO. Note that UM(.,,&) is closely related to but not equivalent to the original estimating function U&(.).Let ,& be the solution to U M ( P , = 0, then the new estimating function with /$ as the initial value is U M ( ~ , , & At the lcth step, let the solution to UM(P,,&-~)= 0 be denoted by b k . It is important to note that the estimating function U M ( ~pk-1) , is monotone in and is the gradient of the following objective function
61)
n
n
- ej(P)}-,
G & ( p l B k - l )=
2=1 j = 1 (5) where { a } - =I a I I ( a < 0) for a real number a. Therefore, the p k can be obtained by minimizing G & ( P , p k - l ) .It is easy to see that G + ( P , b k - l ) is a convex function of 0, and the minimization of G+(P,&l) can be implemented through the standard linear programming technique. If the ,& converges, the limit is the solution of the original estimating equation U&3) = 0 in (3). Under regularity conditions, it can be shown that each b k , k = 2 , . . . , is consistent and asymptotically normal if the initial estimator ,& is consistent and asymptotically n0rma1.l~Thus, the approach provides a class of legitimate estimators. If G+(p,P k ) , k = 1,. . . , is minimized repeatedly until two consecutive estimates are close enough, then the limit is the estimator for PO based on the original rank estimating function (2). A consistent and asymptotically normal initial estimator can be obtained by choosing the Gehan-type weight $(P; t ) = SO@;t ) in ( 2 ) . In this
Nonproportional Semipammetric Regression Models f o r Censored Data
case, as shown by Fygenson and Ritovlg the component of P. It can be reexpressed as n
283
U+(P)is monotone in each
n
(6)
Let (7) i=l j Z 1
Then (6) is the gradient of a convex function G(P) of ,8 whenever it exists. Let ,& be the minimizer of G(P). Since G(P) is a convex function, ,& is well-defined for any sample size. Again, the minimization of G(P) can be carried out by linear programming techniques, i.e. , minimizing the objective function C;=, biwij under linear constraints w i j = e j ( 0 ) - e i ( P ) and w i j 2 0, for i , j = 1,.. . , n. It is also not difficult to see that minimizing G(P) is equivalent to minimizing n
n
n
n
(8 k = l 1=1
a=1 j=l
where Yn+l is an extremely large number. This minimization can be implemented with the algorithm of Barrodale and Roberts36 and Koenker and D ’ o r e ~ which , ~ ~ is available in statistical software packages, such as s-PLUS. The asymptotic variance of resulting P k , however, involves the unknown density and its derivative functions of c, which is difficult to estimate in practice. To overcome this difficulty, Jin et ~ 1 . ’ also ~ proposed a resampling method to estimate the variance of by modifying the resampling approach of Jin, Ying, and Wei.38 Specifically, the objective function G,#,(P, b k - 1 ) is perturbed by independent and identically distributed known positive random variable V , (i = 1 , 2 . . . , n ) satisfying E(V,) = Var(V,) = 1. That is,
&
n
n
4(ea (0; - 1 )) s~
G; (P, pi -1) =
- 1 i ez (pi- 1
)8, { ea (P)-ej
(PI }- V ,
,
a=1 j = 1
(9) Note that G;(P, &-,) is a convex function of P. Again, the minimization can be done through the standard linear programming technique. The initial value can be obtained by minimizing perturbed Gehan-type objective function n
n
(10
Zhezhen J i n
284
Conditional on observed data ( Y , , d i , X i ) (i = l , . . ., n ) , the {K} are the only random components in G; and G*. Let Pi be the minimizer of G;(P,P;-,). Under regularity conditions, it can be shown that the distribution (conditional on the data) of n'I2(& - b k ) is asymptotically equivalent to that of n1I2(fik- pO).l7Therefore, one may generate a large number of realizations for through { K} and use them to estimate the variance-covariance matrix of The perturbation random variable V , can be generated from known distributions, such as exponential distribution, beta distribution, and lognormal distribution, satisfying E ( K ) = V a r ( K )= 1. In summary, the use of linear programming technique and resampling method provides a practically useful method for the estimation and inference on the AFT model. A S-PLUS program is available from the author for implementing the new procedure.
a.
20.2.1.2. Least-squares approach
In the absence of censoring, the classical least-squares estimate defined as a solution vector /3 to the minimization problem
b of PO is (11
with respect to (a,,BT),where a0 = E E and Fn,o(.)is the empirical distribution constructed from { e i ( p ) } . Equivalently, the least-squares estimate is the solution of the estimating equations
i=l
i= I
In particular, the estimate of PO can be obtained by solving the estimating equation n
C ( X i - X)(Ti - XTp)= 0 ,
(12)
i= I
where X = C,=lXi. Miller" extended the least-squares approach to the censored data by replacing the Fn,o(.)in (11) with the Kaplan-Meier estimate $,,p(.) based on { e i ( p ) , & } . As noted by Mauro3' and Lai and Ying,40 this approach often leads to inconsistent estimate of PO. Alternative extensions of least-squares estimate to the censored data were proposed by modifying the estimating equation (12). n
Nonproportional Semiparametric Regression Models for Censored Data
285
In particular, Buckley and James12 modified (12) and introduced an estimate of PO as a solution of the estimating equation S ( p ) = 0, where
(13 This is equivalent to replace T, in (12) with E[TilY,,&I which is estimated by y,*(p), where (14)
The theoretical properties of this approach were investigated by many people, see Refs. 16 and 41. Despite theoretical advances, the method is seldom used in practice due to numerical complexity. Note that the estimating function (13) is neither monotone nor continuous and its roots may not exist.15 The numerical difficulty increases as the dimension of p increases. Also, the variance of resulting estimates may not be easy to obtain because it involves the unknown hazard function of the error term E . Generally, with censored data this function may not be well estimated nonparametrically. On the other hand, Koul, Susarla, and Van Ryzin13 proposed a different modification of the least-squares method by estimating the distribution of censoring variable C. Unfortunately, the approach has often found to be unsatisfactory in practice since its validity depends on the independence assumption between censoring variable C and covariate X.14 Next, we present a new implementation procedure that was developed by Jin, Lin, and Ying.18 Let p1 be a known consistent estimate of 0 0 , for example, the Gehantype estimate & in Sec. 20.2.1.1. Define
]'
L(b) = - y ( X Z- X)@2 [a:,
X, [$(
1
- X ) ( x * ( b )- Y*(b)) ,
(15)
where a@'' = aaT, Y ( b ) is defined in (14) as the estimate of E[T,IY,,ha],and Y * ( b ) = $ CZ1Y,*(b). Then the estimate of PO is updated by ,& = L(p1). At the kth Step, k t b k = L(jk-1). If ,&Satisfies certain convergence criteria, then the b k is the solution of the original estimating equation Q(p) = 0. Now, it is important to note that Buckley and James12 originally considered the iterative procedure, but they were not able to provide initial values that are consistent. The use of the Gehan-type estimate exactly
b~
286
Zhezhen Jzn
overcomes this difficulty. Jin, Lin, and Ying18 showed that for each fixed k = 1,.. . , /3k is consistent and asymptotically normal under mild conditions. As mentioned before, the asymptotic covariance matrix of ,&,however, is difficult to estimate directly because it depends on the unknown hazard function of the error terms. To overcome this difficulty, we also developed a resampling method. Although the resampling shares the same spirit as that in Refs. 17 and 38 in the sense that the perturbation is done by independent and identically distributed known positive random variable V, (i = 1,. . . , T I ) , E(V,) = Var(V,)= 1, there is a major difference. Specifically, at the kth step with the perturbed estimate obtained at k - l t h step, the ,Bi is obtained by L*(p,*-,), i.e., 0;= L*(/?;-,), where
1
V , ( X , - X)(e(b) - Y * * ( b ) ) , (16)
(!7
@;,b(t) is the Kaplan-Meier estimate based on the perturbed data {Viei(b),Vi&}and Y**(b)) = Cy=l It is easy to see that the perturbation scheme in (16) is quite different from that in (9) and (10). If the Gehan-type estimate ,& is used as the initial estimate in (15), then the corresponding initial perturbed estimate 0;can be obtained by minimizing the perturbed Gehan-type objective function (10). Under regularity conditions, it can be shown that, for each fixed k , the distribution (conditional on the data) of T Z ' / ~ ( / ? ;- ,&) is asymptotically equivalent to that of n1/2(,&-&J).'~ Thus, one can generate a large number of realizations for through {V,} and use them to estimate the variancecovariance matrix of ,&. Moreover, extensions to marginal linear models for multivariate failure time data can also be readily developed, we refer readers to Ref. 18 for more details.
e(b).
20.2.2. Linear transformation models
The linear transformation model assumes
g(Ti)= x
T ~ O
+ Ei.
(18)
It is assumed that the g ( . ) is an unknown monotone increasing function and Q is an unobservable random error with completely known distribution
Nonproportional Semipammetric Regression Models for Censored Data
287
function. This type of model has been studied extensively in the literature, see Refs. 1, 19-29. It becomes the Cox proportional hazards model if 6 follows the extreme-value distribution with density f(t) = exp{-et t}. It also becomes the proportional odds model if 6 follows the standard logistic exp{t})2, see Refs. 19-24. distribution whose density is f(t) = exp{t}/(l Cheng, Wei, and Ying27 proposed a general estimation method for linear transformation models with censored data and justified their inference procedure. The method was further investigated and extended, see Refs. 28, 29, and 42. A key step of the approach is the estimation of the survival function for the censoring variable by the Kaplan-Meier estimator. Its validity relies on the assumption that the censoring variable is independent of the covariates. Thus, unlike the Cox’s partial likelihood approach, the method of estimation might fail when the assumption is violated. Moreover, the method is numerically unstable and difficult to implement.23 In the following, we present a new estimation and inference method that was developed by Chen, Jin, and Ying.30 The method overcomes the shortcomings of the above methods. Let X ( t ) be the hazard function and h(t)= f , X(t)dt be the cumulative hazard function of the error term E . Note that X ( t ) and h(t)are known since the error distribution is completely specified. Let Ni(t) = &I{yZ 5 t } and Ri(t)= I{yZ 2 t ) . Motivated by the fact that Ni(t)- s,” Ri(t)dA(-XTPo g(t)) is a martingale process, the following unified estimating equations are proposed:30
+
+
+
(19 n
C [dNi(t)
-
Ri(t)dA(-X’P
+ ~ ( t ) )=] 0,
t 2 0.
(20)
a= 1
In the special case of the Cox model, in which X ( t ) = h(t)= exp(t), the estimating equation (19) becomes the Cox partial likelihood score equation. This can be seen from (20) that n
n
i= 1
i=1
and pIug it in equation (19), one can obtain
the Cox partial likelihood score equation.
Zhezhen Jin
288
Thus, the method can easily be implemented as in the Cox proportional hazards model. Let tl < . . . < t K < 00 denote the observed distinct failure times. Then the computation can be done with the following simple iterative algorithms:30
Step 0: Choose an initial value of 0,denoted by ,do). S t e p 1: With the initial estimate of ,do), obtain g(O) at t l , . . . ,t K as follows: First obtain g ( ' ) ( t l ) by solving n
(21 i=l
Then, obtain g(O)(tk),k
c
= 2, ..., K , one-by-one
c
by solving the equation
n
n
R i ( t k ) A { X T P+ 9 ( t k ) ) = 1 +
i= 1
Ri(tk)A{X?P
+ g(tle->).
(22)
i=l
(If there is tied failures at t k , the value 1 in (21) and (22) should be replaced by the number of ties.)
Step 2: Obtain new estimate of p = P(l) by solving (19) with g = g(O). Step 3: With ,dl),repeat Steps 1 and 2 untiI some convergence criteria are met. The resulting estimate is b. An alternative implementation of Step 1 can also be found in Ref. 30. It is shown that the asymptotic variance of has a closed form which can be easily estimated by plug-in rules.30 Specifically, under mild conditions,
p
n1/'(p- PO)
-+
N { O ,~
,l~*(~sl)~)
(23)
in distribution, as n + 00, where C, and C* can be consistently estimated bv 00
c, = -nl cn /
O0
{Xi- X ( t ) ) @ 2 X { X T p+ g ( t ) ) R i ( t ) d g ( t ) ,
{ X i - X ( t ) } X T X ( X T f i + g(t)}Ri(t)dg(t),
i=l 0 respectively. Here i ( t ) is the first derivative of X(t), and
Nonproportional Sernipammetric Regression Models f O T Censored Data
289
The proposed method is numerically stable and valid even when the censoring variable depends on the covariates. It is noted that the implementation of the method depends on the full specification of the error distribution. A possible choice is the class of distribution functions whose hazard functions are specified by
The class of distribution functions are closely related to the Box-Cox transformation. As T varies, A ( ~ ; T varies ) and gives various distribution functions. In particular, when T = 0, it gives the extreme value distribution and the model is the Cox proportional hazards model, and when T = 1, it gives standard logistic distribution and the model is the proportional odds model.
20.3. Remark In this paper, we have reviewed recent development on the AFT model and linear transformation models. For the AFT model, newly developed rank-based and least-squares a p proaches are presented. If the unobserved error follows extreme value distribution, then the logrank-based method should yield the efficient estimator among the weighted rank estimators. If the unobserved error follows Gaussian distribution, then least-squares based method should yield the efficient estimator. It should be noted that the essence of the new rank-based method for the AFT model is the construction of a class of monotone estimating functions which approximates the weighted logrank estimating functions around the true values of the regression parameters. The corresponding estimators are consistent and asymptotically normal with covariance matrices being readily estimated by a simple resampling technique. More importantly, linear programming technique can be used for both the parameter estimation and variance-covariance estimation. In a technical report, we extended the approach to marginal accelerated failure time models for multivariate failure time data including multiple events, recurrent events and clustered data. However, it is noted that the procedure is not a straightforward generalization. The resampling approach in Ref. 17 is not valid and a different resampling method is developed.43The modification of resampling method entails considerable new technical challenges. For details, we refer readers to Ref. 43.
290
Zhezhen Jin
For linear transformation models, the essence of the new development is the construction of t h e mean zero estimating functions based on t h e martingale process. It appears t h a t the approach can be readily generalized to deal with multivariate failure time data along the line of Ref. 44. We also would like t o point out that there are still many interesting and important research topics, such as model checking and selection. Moreover, t h e effects of censored covariate or measurement error need to be addressed.
Acknowledgments This work is partially supported by New York City Council Speaker’s Fund and National Science Foundation Career Award DMS-0134431.
References 1. D. R. Cox, Regression models and life-tables (with discussion), J. R. Statist. SOC.Ser. B 34, 187-220 (1972). 2. P. K. Anderson and R. D. Gill, Cox’s regression model for counting processes: a large sample study (Com: p. 1121-1124), Ann. Statist. 10, 11001120 (1982). 3. D. R. Cox and D. Oakes, Analysis of Survival Data, London: Chapman and Hall (1984). 4. A. A. Tsiatis, Estimating regression parameters using linear rank tests for censored data, Ann. Statist. 18, 354-372 (1990). 5. L. J. Wei, Z. Ying, and D. Y. Lin, Linear regression analysis of censored survival data based on rank tests, Biometrika 77, 845-851 (1990). 6. T. L. Lai and Z. Ying, Linear rank statistics in regression analysis with censored or truncated data, Journal of Multivariate Statistics 40, 13-45 (1992). 7. J. M. Robins and A. A. Tsiatis, Semiparametric estimation of an accelerated failure time model with time-dependent covariates, Biometrika 79, 311-319 (1992). 8. Z. Ying, A large sample study of rank estimation for censored regression data, Ann. Statist. 21, 76-99 (1993). 9. M. Fygenson and Y. Ritov, Monotone estimating equations for censored data, Annals of Statist. 22, 732-746 (1994). 10. M. P. Jones, A class of semiparametric regressions for the accelerated failure time model, Biometrika 84, 73-84 (1997). 11. R. G. Miller, Least squares regression with censored data, Biometrika 63, 449-464 (1976). 12. I. V. Buckley and I. James, Linear regression with censored data, Biometrika 66, 429-436 (1979). 13. H. Koul, V. Susarla, and J . Van Ryzin, Regression analysis with randomly right-censored data, Ann. Statist. 9, 1276-1288 (1981).
Nonproportional Semiparametric Regression Models for Censored Data
291
14. R. G. Miller and J. Halpern, Regression with censored data, Biometrika 69, 521-531 (1982). 15. I. R. James and P. J. Smith, Consistency results for linear regression with censored data, Ann. Statist. 12, 590-600 (1984). 16. T. L. Lai and Z. Ying, Large sample theory of a modified Buckley-James estimator for regression analysis with censored data, Ann. Statist. 19, 13701402 (1991). 17. Z. Jin, D. Y. Lin, L. J. Wei, and Z. Ying, Rank-based inference for the accelerated failure time model, Biometrika 90, 341-353 (2003). 18. Z. Jin, D. Y. Lin, and Z. Ying, On least-squares regression with censored data, submitted (2004). 19. A. N. Pettitt, Inference for the linear model using a likelihood based on ranks, J . R. Statist. SOC.Ser. B 44, 234-243 (1982). 20. S. Bennett, Analysis of survival data by the proportional odds model, Statist. in Med. 2, 273-277 (1983). 21. A. N. Pettitt, Proportional odds model for survival data and estimates using ranks, Appl. Statist. 33,169-175 (1984). 22. D. M. Dabrowska and K. A. Doksum, Estimation and testing in a twosample generalized odds-rate model, J. A m . Statist. Assoc. 83, 744-749 (1988a). 23. S. A. Murphy, A. J. Rossini, and A. W. Van der Vaart, Maximum likelihood estimation in the proportional odds model, J. A m . Statist. Assoc. 92, 968976 (1997). 24. S. Yang and R. L. Prentice, Semiparametric inference in the proportional odds regression model, J. A m . Statist. Assoc. 94, 125-136 (1999). 25. D. M. Dabrowska and K. A. Doksum, Partial likelihood in transformation models with censored data, Scand. J. Statist. 15, 1-23 (1988b). 26. J. Cuzick, Rank regression, Ann. Statist. 16, 1369-1389 (1988). 27. S. C. Cheng, L. J. Wei, and Z. Ying, Analysis of transformation models with censored data, Biometrika 82,835-845 (1995). 28. S. C. Cheng, L. J. Wei, and Z. Ying, Predicting survival probability with semi-parametric transformation models, J . A m . Statist. Assoc. 92, 227-235 (1997). 29. J. Fine, Z. Ying, and L. J. Wei, On the linear transformation model for censored data, Biometrika 85, 980-986 (1998). 30. K. Chen, Z. Jin, and Z. Ying, Semiparametric analysis of transformation models with censored data, B i o m e t ~ k a89, 659-668 (2002). 31. Y. Q. Chen and M. C. Wang, Analysis of accelerated hazards model, J. A m . Statist. Assoc. 95, 608-618 (2000). 32. Y. Q. Chen and N. P. Jewell, On a general class of semiparametric hazards regression model, Biometrika 88, 687-702 (2001). 33. J. D. Kalbfleisch and R. L. Prentice, The Statistical Analysis of Failure Time Data, 2nd edition. New York: Wiley (2002). 34. J. F. Lawless, Statistical Models and Methods f o r Lifetime Data, 2nd edition. New York: Wiley (2003). 35. P. Andersen, 0. Borgan, R. Gill, and N. Keiding, Statistical Models Based
292
Zhezhen Jzn
on Counting Processes. New York: Springer-Verlag (1993). 36. I. Barrodale and R. Roberts, An improved algorithm for discrete L1 linear approximations, SIAM J. Numerical Analysis 10, 839-848 (1973). 37. R. Koenker and V. D’Orey, Computing regression quantiles, Applied Statist. 36, 383-393 (1987). 38. Z. Jin, Z. Ying, and L. J. Wei, A simple resampling method by perturbing the minimand, Biometrika 88, 381-390 (2001). 39. D. W. Mauro, A note on the consistency of Kaplan-Meier least squares estimators, Biometrika 70, 534-535 (1983). 40. T. L. Lai and Z. Ying, A missing information principle and M-estimators in regression analysis with censored and truncated data, Ann. Statist. 22, 1222-1255 (1994). 41. Y . Ritov, Estimation in linear regression model with censored data, Ann. Statist. 18, 303-28 (1990). 42. T. Cai, L. J. Wei, and M. Wilcox, Semiparametric regression analysis for clustered failure time data, Biometrika 87, 867-878 (2000). 43. Z. Jin, D. Y. Lin, and Z. Ying, Rank regression analysis of multivariate failure time data based on marginal linear models, submitted (2003). 44. L. Chen and L. J. Wei, Analysis of multivariate survival times with nonproportional hazards models, in Proceedings of the First Seattle Symposium in Biostatistics: Survival Analysis, Editors Lin, D.Y. and Fleming, T.R., 23-36 (1997).
CHAPTER 21 BINARY REPRESENTATIONS OF MULTI-STATE SYSTEMS
EDWARD KORCZAK Telecommunications Research Institute ul. Poligonowa 90, 04-051 Warsaw, Poland E-mail: [email protected]
A new definition of binary representation of a multi-state structure func-
tion is formulated, and its main properties are presented. The definition may be applied to both monotone and nonmonotone structures. Several forms of binary representation and its factoring are discussed. Applicability of the results is illustrated by two examples.
21.1. Introduction Although the theory of binary systems has many practical applications, it is being replaced by the theory of multi-state systems (MSS). In fact, many modern systems (and their elements as well) are capable of assuming a whole range of performance levels, varying from perfect functioning to complete failure. Since the performance level of a system depends on element performance levels, the binary model is often over-simplification and insufficient for describing multi-state systems. The present state-of-art and historical overview of the theory and practice of MSS may be found in a recent book by Lisnianski and Levitin.l Reliability analysis of MSS is usually more difficult than the analysis of binary systems. Indeed, the number of states of an MSS may be quite big, and the relation between the system state and the states of its elements (i.e. the structure function) may be very complicated. Therefore some effort has been made in applying existing binary methods to MSS. In particular, Block and Savits2 introduced two equivalent binary representations of a multi-state monotone system (MMS) , each decomposing the multi-state monotone structure into a sequence of binary monotone structures of binary 293
294
E. Korczak
variables. Similar approach was considered independently in Refs. 3-6. The binary variables used in this approach are threshold indicators, i.e. they indicate whether or not an element or a system is in a state with performance rate higher or equal to a given performance level. See also Refs. 7-13 for applications of this decomposition to MMS. The Block-Savits decomposition is based on the assumption that the system is monotone and has totally ordered state space. However, there are multi-state systems that are not monotone, for which the state spaces of the system and/or its elements are partially ordered14J5J6J7 or even u n ~ r d e r e d For . ~ the ~ ~latter ~ ~ case, ~ ~ C~a ~l d ~a r ~~ l a l *proposed >~~ an alternative binary representation method, which does not require the state ordering and is based on a special Boolean algebra. In his method, the binary variables indicate whether or not the element or system is in a particular state. In this paper a new definition of a binary representation of an MSS is formulated and its main properties are obtained. The definition is given in terms of conditions to be satisfied, and is applicable to any multi-state structure, monotone or nonmonotone. Several forms of the binary representation are discussed. Problem of variables' relevance is also addressed. Furthermore, two different types of factoring of the binary representation are introduced, which can be used to simplify the corresponding binary structures. Finally, the application of the binary representation to the calculation of the reliability indices is illustrated by two examples.
21.2. Basic Definitions Let (C,K , K1, . . . ,K,, 'p) be a multi-state system (MSS) consisting of n multi-state elements with the index set C = { 1 , 2 , . . . ,n } , where K = { 0 , 1 , . . . , M } is the set of the system states, Ki = {0,1,. . . ,Mi} is the set of the states of element i E C, and p: V ---t K is the system structure function, where V = K1 x Kz . . . x K , is the space of element state vectors. The state of the system is determined by the element state vector and the structure function: p(x) is the system state when the element state vector is x = ( q , x 2 , . . . ,xn),where xi E Ki represents the state of element i. The state spaces of MSS and its elements are considered as totally ordered sets, each with usual order induced by state indexing (numbering). This order need not be consistent with performance rates or other features corresponding to the states. Without any loss of generality we assume that the system states 0 and M are essential, i.e. the sets p-l(O) and pV1(M)are
h
nonempty. If a , b E V and a L. b, then the set [a,b]= {x E V : a I x I b} is called an interval of V ,or a box. We use standard notation: a 5 b iff ai 5 bi for each i E C, and a < b iff a 5 b and ai < bi for some i E C. For any j E K \ {0}, let V<j = {x E V : cp(x) < j} and V2j = {x E V : cp(x) 2 j}. According to our assumption, these sets are nonempty. A box is called an upper (lower) box to level j if it is contained in V2j (V<j).An upper (lower) box to level j is maximal if it is not properly contained in another upper (lower) box to level j. A collection of maximal upper (lower) boxes to level j is named an upper (lower) base to level j if their union is the whole set V>j (V<j).Of course, the collection of all maximal upper (lower) boxes to level j is an upper (lower) base to level j . An upper (lower) base to level j is irredundant if it ceases to be an upper (lower) base to level j after deletion of any of its members. General MSS may have many irredundant upper and/or lower bases. Let Uj ( C j ) be any upper (lower) base to level j. Then: M
M
max l(x E [a,b])
l(x E V&) =
~ ( x=) j=1 M
=
j=1
[a,b]E U j
max min l(zi E [ai,b i ] ) ,
j=1
[a,b]EUj a E C
M
M
min maxl(xi 2$ [ a i , b i ] ) ,
= j=l
(1)
[a,b]Etj aEC
(2
where 1(.. . ) is the indicator function. An MSS is said to be a multi-state monotone system (MMS) if its structure function 'p is nondecreasing in each argument, ~ ( 0 = ) 0 and 'p(M) = M , where 0 = (O,O,. . . ,O), M = ( M I ,M2,. . . ,Mn). It is easy to see that for any MMS, each upper box to level j has the form [y,M],and the lower bound y of this box is called a path vector to level j . A path vector to level j is minimal if [y,M] is maximal upper box to level j. Similarly, each lower box to level j of an MMS has the form [O,z], and the upper bound z of this box is called a cut vector to level j . A cut vector to level j is minimal if [0, z]is maximal lower box to level j. Alike for general MSS, for an MMS the set of all maximal upper (lower) boxes to level j is the only irredundant upper (lower) base to level j.
E. Korczak
296
For monotone systems, it is usually assumed that the states of the system (element i) represent successive levels of performance ranging from the perfect functioning level M (Mi) down to the complete failure level 0. In other words, the states of the system and its elements are totally ordered with respect to the performance level, and the state numbering is consistent with this order. In this paper, in order to keep full generality, we do not assume that. The state ordering is considered formally, as the natural order between integers representing the states of a system and its elements. The dual structure ' p D is given by 'pD(x)= M - 'p(M - x). If 'p is monotone, then so is cpD. 21.3. Binary Representation of an MSS and Its Properties Let JiiT(xi)= l(xi 2 r ) , xi E Ki, i E C , r E Ki \ (0). Define vector-valued functions Ji: Ki 4 (O,l}M' = Fi (i E C ) and J : V -+ {O,l}M1 x ... x ( 0 , l}Mn = F = F1 x . . . x F,:
Ji(xi) = (Ji;l(xi),J i ; Z ( Z i ) , . . .
7
(3)
Ji;M;(xi)),
J(x) = (Ji(xi),J 2 ( ~ 2 ) ,. . . ,Jn(xn)) = ( J i i r ( x i ): i E C,r E Ki \ (0)). (4) Elements of the set Fi are denoted as vectors with single underlying: -2 x . = ( q 1 , x ~ ,. .., xi;^;), where xi;,.E (0,1}. Elements of the set F are denoted as vectors with double underlying: x - = (x,,x,,. . . ,s,) = { x i i T } . For notational convenience, we set xi;o = 1 and x i ; ~ ~ = + 0. l F . We have: Let Ai = Ji(Ki) Fi and A = J ( V ) = A, x . . . x A,
Ai = {xi = ( X i i l , Z i ; z , .. . , S ~ ; M ;:)1 2 Ziil 2 Zi;z 2 . . . 2 2 = (Ci(T): r E
i ; ~2 ;
o}
Ki},
-- - -
where gi(r)= (l(a 5 r ) : a E Ki
\ (0))
= J i ( r ) , i.e.
.
ei(r) = (1,.. . , 1 , 0 , . . . ,O), ~ ~ ( =0 (0,. ) . , O ) , gi(M,) = (1,.. . ,1). r
M ; -r
(5)
Mi
(6)
Mi
Hence J(x) = (gi(r): a E Ki) is the binary image of element state vector x, and J establishes a one-to-one correspondence between V and A. Definition 21.1: Let f : V -+ (0,1} be a binary function on V and f B : F -+ Z an integer-valued function of binary variables 1~= {xiir}.We say that f~ is a binary representation (BR) of f if Vx E V : f ( x ) = fl3
(J(x)).
Binary Representations of Multi-State Systems
297
For example, let f(x) = l(x E [a,b]), a,b E V , a 5 b. Let us consider two functions fB1,fB2: + defined by:
v z
n
n
fBl(E) = nxi;a,(l -xi;b;+l), fE2(&
= n ( % ; a ; -Zi;bi+l).
i= 1
(7)
i= 1
These functions are not equivalent on the whole space F . Indeed, substi- = 0 and tuting xiiai = 0 and Zi;b,+l= 1, i = 1,.. . ,n, we have f~l(5) ~ E ~ ( J= C )(-l)n. However, they are equivalent on A, and both are binary representations of f.
Definition 21.2: Let cpj : F -+ Z, j E K \ {0}, be integer-valued functions of binary variables x = {xi;,.}.We say that the sequence { c p j } is a BR of an MSS with the structure cp if V j E K \ {0}, cpj is a BR of l(cp(x) 2 j ) , i.e. if l(cp(x) 2 j) = cpj(J(x)) for x E V. Such a representation exists. Indeed, since
l(cp(x) 2 j ) =
C
1(x = a) =
a€Vzj
c
n
n l ( . i = )i.
a€V2j i=l n
(8)
then clearly n
(9) aEVZj i=l
is a binary representation of l(cp(x) 2 j ) . The following proposition summarizes the basic properties of a BR of an MSS.
Proposition 21.1: Let {cp} is a BR of an MSS with the structure p. (i)
{cpj}
are nontrivial binary functions on A.
(ii) if the system is MMS, then { c p j } are nontrivial binary monotone functions on A. (aii) 1 2
cp1
(iv) cp(x) =
2
cp2
2 . . . 2 ( P M 2 0 on A.
C,"=, cpj(J(x)) f o r each x E V
(v) The binary representation is determined uniquely on A, i.e. if { c p j } and { $ j } are two binary representations of an MSS, then cpj = $ j on A f o r each j E K \ ( 0 ) .
E. Korczak
298
: j = 1 , . . . ,M } be integer-valued functions defined o n F . If satisfy (iii), then +(x) = C,"=, +j(J(x))is the structure function of an MSS and { + j } is its binary representation. If {q$} also satisfy the conclusion of (ii), then +(x) is monotone. (vii) A binary representation {$} of the dual structure cpD can be obtained from a binary representation { ' p j } of 'p as follows: (1) obtain the pseudo-dual functions 'p;l({xii,.}) = 1 - 'pj({l - xi;,}); (2) obl algetain functions {Cpj"'}, by replacing each xi;, with x i ; ~ ~ - , +in braic expressions defining functions {'pjdl}; (3) set 'py = 'p$-j+l, j = l , . .., M .
(vi) Let
{+j
{+j}
The function ' p j may be written in several forms equivalent on A, but not necessarily on F \ A. Some of these forms are direct generalizations of the forms known from the binary system theory. The algebraic expression defining the function ' p j may be treated as in the binary system theory, using techniques and tools known from this theory. Moreover, due to the restrictions on xi;, imposed by (5), we may use several additional simplification rules which do not alter the function ' p j on the set A. The basic rules are: xi;rxi;s = xi;r A
- -
xi;s
xi;, V
-
= xi;max(,,s), xi;,xi;s = xi;, A -
xi;s = xi;min(r,s)i
xt;, V
zi;s
3i;S
= zi;min(,,s)i
= Fi;max(r,s)l
(10) (11) (12)
where, as usual, a A b = min(a, b) = ab, a V b = max(a, b) = a + b - ab and a = 1 - a for any binary a and b. Block and Savits' defined binary representations of an MMS using socalled it minimal path vector and minimal cut vector forms: (13)2 (14) 0
where M P , and M C j are the sets of minimal path vectors and minimal cut vectors to level j respectively. Both functions 'p(iMP) and 'p(iMC) satisfy Definition 21.2. Indeed, by the definition of minimal path vectors to level j , V2j = U y E M P[y, j MI, hence l(cp(x) 2 j ) = l(x E V>j) = max min l(xi 2 yi) = y E M P j iEC:y;>O
max
min
y E M P j iEC:yi>O 'piMc).
~ i ; ~ ~ ( x=i )( p i M P ' ( ~ ( X ) ) . Similar arguments apply to
BY Proposition 21.1(v), P I M P ) = PIMC) on A.
299
Binary Representations of Multi-State Systems
For general MSS, the corresponding forms use upper and lower bases to level j, Uj and C j :
(15)
(16)0
where min 0
= 1 and rnax = 0. 0
By applying the inclusion-exclusion principle to (15) and then reduction rules (10) to each term, we obtain a generalized Sylvester-PoincarL’s form of an MSS:
(17)
where for any nonempty set of boxes D , wi(D) = max
ai, w i ( D ) =
[a,bjE D
min b,, i E C , v(D) = ( v l ( D ) ,. . . ,wn(D)),w(D) = (wi(D), . . . ,wn(D>),
la,bl . .E D
ID/ = card(D). By convention, a sum (product) over the empty set equals zero (one). In case of MMS, the above form simplifies to the well known formula:
vj ( S P )(22)
=
c
(-l)lD1+l 0#D&MP,
[ g:..:.l]
7
(18) (18)
u,(D)>O
where for any 0 # D 2 V, u,(D) = max a,, i E C . a€D
Among many other forms, there is a pseudo-polynomial form:
(19) Hk(&
=
n
k= 1
Zi;ak,iZi;bk,i
=A
iEC
where cu are integer coefficients 0 5 a k , i means equality on A, and the products
n(Zi;ak,i
- %;bk,i),
(20)
iEC
< b k , i I Mi t’1 for all i and k, = A are nontrivial, i.e. 0 < a k , i or
Hk
E. Korczak
300
bk,i
I Mi for some i E C.Observe that J-'(Hi'(l))
= [ak,b k - 11,.where
ak = (a,', . . . , ~ k , ~bk) = , & , I , . . . ,b k , n ) and 1 = ( 1 , . . . , 1). Two products Hk(x)and Ht(x)are orthogonal if H k ( x ) H l ( s ) = 0 on A, i.e. when the corresponding boxes are disjoint. If in (E),a 6 two products are disjoint, a0 = 0 and a k = 1 (Ic = 1,. . . ,m), then the pseudo-polynomial form is called the orthogonal form, or SDP form. It corresponds to a partition of V>j into disjoint nonempty boxes. A particular case of the orthogonal form is the canonical disjunctive normal form, in which the products correspond to single-point boxes, see (9). Let us now consider the problem of variable relevance. We say that variable xi;,.,r E Ki\{O}, is relevant for the function cpj if there exists x -E A
such that (O(i;,.),X) - E A, P(i;,.),Z) E A and cpj(O(i;,.,,g)# cpj(l(i;,.), where (d(i,,),s) is equal to ZI - with xi;,.replaced by d. Other equivalent conditions are7 (a) there exists x ), - E A such that c p j ( g i ( r-) , ~#) cpj(ei(r - l ) -, ~where ( ~ i ( rg) ) ,= ( X I 7 . . .,xi-1 7 g i ( r ) ,xi+', . . . > ~ n ) . (b) there exists x E V such that l(p(r(i),x)2 j ) # l(cp((r- l)(q,x)2 j),where (r(i),x) = ( 2 1 , .. . , x i - l , r , x i + l , . . . ,xn). The functions cpj may have several irrelevant variables. However, in many practical cases the algebraic expressions defining these functions contain relevant variables only. For example, all variables in the expressions in (13)-(18) are relevant. Proposition 21.2: For a n y upper (lower) base Uj ( C j ) of a n MSS, at
holds: {relevant variables of
cpj}
> 0, [a,b] E U j } U { q b i + l : i E C,bi < Mi, [a,b] E Uj} {xiia,: i E C,ai > 0, [a,b] E Lj}
= {xi;ai: i E C,ai -
u {xi;bi+l : i E c,bi < Mi, [a,b] E cj} . (21) I n particular, for MMS, {relevant variables of
cpj}
> 0,y E MPj} {xi;*i+l: i E c,zi < M i , 2 E M C j } . (22)
= {xiiyi: i E C , y i -
Factoring is a popular technique for analyzing complex binary systems. There are two kinds of factoring (pivotal decomposition) of the function cpj: with respect to the entire vector siE A (or the state of element i),
301
Binary Representations of Multi-State Systems
and with respect to a single variable xiiT.Considering all possible values of vector xiE A, we obtain the factoring formula with respect to the state of element i:
l(xi = e i ( T ) ) ’ p j ( c i ( T ) ,-X ) =
’p.(x) 3 = = TEKi
=
C
xi;TZi;T+l
‘pj(ci(T)lg)
TEK~
(23)
( 5 i ; r - x i ; r + l )v j ( c i ( T ) > g )
rEKi
for x E A. The factoring with respect to a single variable Xi;r is given by: ’ ~($j
- + ~ i ;‘ r~(O(i;r) X) j g) xi;r ~ j ( ( l ( i ;: ~1 )I s I T ) ,E)
=zi;T =A
~ p(l(i;r) j
+ T i ; r p j ( ( O ( i ; s ): T
7
Is I Mi),g).
(24)
The second equality in (24) follows from the fact that on A, the condition xiir = 1 means x i i s = 1 for all 1 5 s 5 T , and the condition xi;,, = 0 means xi;s = 0 for all T 5 s I Mi. It can be seen that recursive application of (24) with respect to xi;^^, x i ; ~ > - l.,. .,x i ; l successively leads to (23). 21.4. Examples of Application
Suppose that the states of elements 1,. . . n of an MSS are represented by s-independent random variables X I , . . . ,X,. Then the state of the MSS is cp(X), where X = (XI,. . . ,X n ) . These variables may be time-dependent, but we omit, for brevity, the time-parameter t. It is clear that for any fmed i E C the random variables Xi;,. = J i i T ( X i ) T, E Ki \ { 0 } , are s-dependent, as XiiT _> Xi;,for T < s. However, the random vectors Xi= Ji(Xi) = (Xt;T: T E Ki \ { 0 } ) ,i E C, are s-independent. Let R ( j ) = P{cp(X) 2 j } = E(cpj(J(X))]and R ~ ( T=) P{Xi 2 T } = E[Xi,,.].Under the assumptions that { R i ( r ) }are known, and that the function ‘pj is given in the form (19), the calculation of R ( j ) is very easy: m
(25)0
where clearly Ri(0) = 1, Ri(Mi
{Xi;r
T
E
Ki \ {O},i E
C}.
+ 1 ) = 0, and X- = J(X) = (Xi: i E C) =
302
E. Korczak
For example, if A C K, then
P{(P(X>E A ) = C(R(j) - R(j
+ 1))
(26)
*
jEA
This shows that the choice of a suitable binary representation of an MSS overcomes difficulties with the dependence of random variables belonging to the same element. Now consider the general case when the random variables XI,. . . , X, representing states of system’s elements may be s-dependent. We show that (25) can be extended to this case. Let pxl ,..,,X, (c1, . . . ,G)= p ~ ( cbe ) the joint probability mass function (pmf) of X I , . . . , X, defined by: PXl,...(X,(Cl,. . . , C n ) = P{Xl = c1,. . . ,x, = c n } ,
(27)
or in compact form: p x ( c ) = P{X = c } , c E
v.
(28)
Since Hk(J(X)) = 1 iff X E [arc,b k - 11, we have: E[Hrc(J(X))] = P { a k
I X < h}=
PX(C).
(29)
ak
Under the assumptions that the join pmf { p x ( c ) : c E V} is known, and that the function pj is given in the form (19), it follows that:
k=l m
k=l
(30)0 ab
Example 21.1: Consider an MMS with K = {0,1,2,3}, K1 = K2 = {0,1,2}, K3 = {0,1,2,3}. The structure p is defined by the sets of minimal path vectors: MP1 = {(1,2, l ) , (2,1, l ) , (0,1,2), (1,0,2), (O,O, 3)}, MP2 = {(1,1,2),(0,1,3),(1,0,3)}andMP3= {(1,2,3),(2,1,3)}. Wedemonstrate the use of factoring of cp1 with respect to element 3. According to (13): pl(g) = 2 1 ; 1 2 2 ; 2 2 3 ; 1 v
21;222;123;1 v 2 2 ; 1 2 3 ; 2
We have:e3(0) = ( O , O , O ) , % ( l ) (1,1,l),hence:
v
21;123;2
v
Z3;3.
(31)
= (1,0,0),g3(2)= ( l , l , O ) and%(3) =
(32)0
Binary Representations of Multi-State Systems
303
Let us consider the second type of factoring. Using (24) with xi;r=x3;3, yields:
By substituting (38) into (37), we obtain:
Which is equal to (36) Finally, Assuming the independence of elements, we obtain:
304
E. Korczak Table 21.1. Values of structure 9.
22
Example 21.2: Consider a two-element MSS with K = {0,1,2,3} and K 1 = Kz = {0,1,2,3,4}. The structure function cp is defined by Table 21.1, and is nonmonotone. We see the maximal upper boxes to level j = 1 are [(0, l),(4,4)] and [(2,0),(4,4)], hence x H l(cp(x) L 1) is a monotone binary function of x, and the vectors ( 0 , l ) and (2,O) are minimal path vectors to level 1. Therefore results from MMS theory apply. PI($ = ~
2 ;V 1 21;2
= 22;1
+ 2 1 ; 2 - 2 2 ; 1 2 1 ; 2 = 2 2 ; 1 + C 2 ; i ~ i ; .z
(41) Now consider level j = 2. The maximal upper boxes to level 2, together with corresponding products, are: [(1,2),(2,4)1 [(I,3), (4,3)]
-
[(3, I), (4,I)]
21;1:1;322;2,
[(I,a), (3,3)]
21;122;3T2;4,
[(3, I), (3, 311
21;322;1z2;2,
[(4,3)7 (47 4)]
-
21;1:1;422;2:2;4, 21;3T1;422;152;4, 21;422;3
.
There are two irredundant upper bases to level 2:
uzw = { K L21, (2,411, ~
a,
1 , (3,3)1, [(3,1),(4,1)1, [(4,3),(4,4)11,
w,21, (2,4)1, ~ 3I),, (3,3)1, ~ 3 ~ 1(4,1)1, 1, [(4,3),(4,4)11.
~ ~ ( =2 )
According to (15), with
(pZ(Z) = 21;1:1;322;2 v
U2
= U2(1), we have:
21;1:1;422;2z2;4
v
v
21;322;122;2
21;422;3.
(42)
Applying the inclusion-exclusion rule yields:
(pZ(5) =
+ ( 2 l ; l - 21;4)(22;2 + 21;3(22;1 - 2Z;Z) + 2 1 ; 4 2 2 ; 3
(21;l
- 21;3)22;2
- (21;3 - 21;4)(22;1 - 22;2)
An SDP form of 'p2(g)
=
(pz
-
22;4)
.
(43)
is
(21;l
- 2 1 ; 3 ) 2 2 ; 4 $- ( 2 1 ; l - 2 1 ; 4 ) ( 2 2 ; 2
f z1;3(22;1 - 22;Z)
+ 21;422;3 .
-
22;4)
(44)
Binary Representations of Multi-State Systems
305
For level j = 3 we have two disjoint maximal upper boxes: [(3,l),(3,2)] and [(4,3), (4,4)], hence the upper boxes form and Sylvester-PoincarCform coincide, and are also SDP forms: (P3(5) = 21;3T1;422;1Z2;3v 21;422;3
-
(21;3 - 21;4)(22;1 - 22;3)
+ 21;422;3 .
(45)
Assuming the independence of elements, the substitution rule (25) yields: (46)
R(2) =
(47) (48)
+
P{(p(X) E {1,2}} = (R(1) - R ( 2 ) ) (R(2) - R(3)) = R(1) - R(3). (49) Now suppose that the elements are s-dependent, with known joint pmf PX~,X~(C= ~ , P{Xi C ~ ) = Q , X2 = c2}, 0 5 c 1 , I ~ 4. Let P(A1, A2) = P{Xi E Ai, X2 E A2}, A1 C K1,A2 2 K2:
P(A1, -42) =
cc
PX1,XZ(Cl,C2).
(50)
~ ( 1=)~ ( [ 0 , 4 1~, ~ 4+1w) . 4 1 , [o,oI),
(51)
c i E A i czEAz
Then, according to (30), we obtain:
(52) (53) 21.5. Conclusions New definition of binary representation of an MSS presented in the paper extends the applicability of existing definitions of binary decomposition of MMS. This shows that some results from the theory of MMS can be adapted to general MSS, forming a unified framework for both cases. Two examples show that the definition of binary representation proposed in the paper can be applied to both monotone and nonmonotone systems.
306
E. Korczak
Acknowledgments
I am very thankful to t h e referees for a careful reading of the manuscript and constructive comments. References 1. A. Lisnianski and G. Levitin, Multi-State System Reliability. Assessment, Optimization and Applications, World Scientific, New Jersey (2003). 2. H. W. Block and T. H. Savits, A decomposition for multistate monotone systems, J. Appl. Probab. 19, 391 (1982). 3. D. A. Butler, Bounding the reliability of multistate systems, Oper. Res. 30, 530 (1982). 4. B. Natvig, Two suggestions of how to define a multistate coherent system, Adv. in Appl. Probab. 14, 434 (1982). 5. K. Reinschke, Determination of Reliability Quantities for Monotonous Multivalued Systems, 2. elektr. Inform.- u. Energietechnik 11,549 (1981). 6. K. Reinschke and M. Klingner, Messen, Steuern, Regeln 24, 422 (1981). 7. E. Funnemark and B. Natvig, Bounds for the Availabilities in a Fixed Time Interval for Multistate Monotone Systems, Adv. in Appl. Probab. 17,638 (1985). 8. J. Huang and M. J. Zuo, Dominant multi-state systems, I E E E Trans. Reliability 53,362 (2004). 9. E. Korczak, Reliability analysis of multistate monotone systems, in Safety and Reliability Assessment - A n Integral Approach, ESREL'93, p. 671, Eds. P. K a h and J. Wolf, Elsevier, Amsterdam (1993). 10. E. Korczak, in Advances in Safety and Reliability, ESREL'97, Vol. 3, p. 2213, Ed. C. Guedes Soares, Pergamon, London (1997). 11. W. Kuo and M. J. Zuo, Optimal Reliability Modeling: Principles and Applications, p. 494, Wiley, New York (2003). 12. A. P. Wood, Multistate Block Diagrams and Fault-Trees, IEEE Trans. Reliability 34, 236-240 (1985). 13. J. Xue and K. Yang, Dynamic reliability analysis of coherent multistate systems, IEEE Trans. Reliability 44, 683-688 (1995). 14. B. Lindqvist, Bounds for the reliability of multistate systems with partially ordered state spaces and stochastically monotone Marko transitions, Int. J. Reliab. Quality and Safety Eng. 10, 235 (2003). 15. K. Nakashima and K. Yamato, Some properties of multi-state monotone systems and their Boolean structure functions, Trans. ZECE of Japan E66, 535 (1983). 16. S. Shinmori and F. Ohi, Stochastic bounds for generalized systems in reliability theory, J. Oper. Res. SOC.Japan 33, 103-118 (1990). 17. K. Yu, I. Koren, and Y. Guo, Generalized multistate monotone coherent systems, I E E E Trans. Reliability 43, 242 (1994). 18. L. Caldarola, Coherent Systems with Multistate Components, Nucl. Eng. Des. 58,127-139 (1980).
Binary Representations of Multi-State Systems
307
19. L. Caldarola, in Synthesis and Analysis Methods for Safety and Reliability Studies, Eds. G. Apostolakis, S. Garribba and G. Volta, p. 199, Plenum Press, New York (1980). 20. M. Veeraraghavan and K . S. Trivedi, Combinatorial algorithm for performance and reliability analysis using multistate models, IEEE Rans. Comput. 43, 229-234 (1994). 21. X. Zang, D. Wang, H. Sun, and K . S. Trivedi, A BDD-Based Algorithm for Analysis of Multistate Systems with Multistate Components, IEEE h n s . Comput. 52, 1608-1618 (2003).
This page intentionally left blank
CHAPTER 22 DISTRIBUTION-FREE CONTINUOUS BAYESIAN BELIEF NETS
D. KUROWICKA Delft Institute for Applied Mathematics Delft University of Technology Delft, The Netherlands E-mail: D.Kurowicka@ewi. tudelft. nl
R. M. COOKE Delft Institute f o r Applied Mathematics Delft Univeristy of Technology Delft, The Netherlands E-mail: R.M. [email protected] This paper introduces distribution-free continuous belief nets using the vine-copulae modeling approach. Nodes are associated with arbitrary continuous invertible distributions, influences are associated with (conditional) rank correlations and are realized by (conditional) copulae. Any copula which represents (conditional) independence as zero (conditional) correlation can be used. We illustrate this approach with a flight crew alertness model.
22.1. Introduction
Bayesian belief nets (bbns) are directed acyclic graphs representing high dimensional uncertainty distributions, and are becoming increasingly popular in modeling complex systems. Appiication is limited by the excessive assessment burden, which leads to informal or unstructured quantification. Continuous bbns exist only for the joint normal case, and require assessing means, conditional variances and partial regression coefficients.' 309
310
D. Kurowicka and R . M. Cooke
This paper develops a “copula-free” approach to continuous bbns.” Any copula may be used so long as the chosen copula represents (conditional) independence as zero (conditional) correlation. This approach cannot rely on the equality of partial and conditional correlation, and hence cannot rely on vine transformations to deal with observation and updating, as done in Ref. 2. Nonetheless, it is shown that the elicitation protocol2 based on conditional rank correlation can work in a copula-free environment. A unique joint distribution can be determined and sampled based on the protocol, which factorizes in the manner prescribed by the bbn and can be updated with observations of values or intervals. The theory presented here can be extended to include “ordinal” variables; that is, variables which can be written as monotone transforms of uniform variables, perhaps taking finitely many values. The dependence structure must be defined with re spect to the uniform variates. Further, we consider here only the case where the conditional correlations associated with the nodes of vines are constant; however, the sampling algorithms discussed below will work mutatis mutandis for conditional correlations depending on the values of the conditioning variables. We note that quantifying bbns in this way requires assessing all (continuous, invertible) one-dimensional marginal distributions. On the other hand, the dependence structure is meaningful for any such quantification. In fact, when comparing different decisions or assessing the value of different observations, it is frequently sufficient to observe the effects on the quantile functions of each node. For such comparisons we do not need to assess the one-dimensional margins at all. This will be illustrated in the sequel. We assume the reader is familiar with bbns. We introduce vines and copula’s in Sec. 22.1, and distribution-free bbns in Sec. 22.3. Section 212.4 illustrates their use with an example from Roelen, et al.3 involving airline flight crew alertness.
”In Kurowicka and Cookez the authors introduced an approach to continuous bbns using vines4 and the elliptical ~ o p u l a Influences .~ were associated with conditional rank correlations, and these were realized by (conditional) elliptical copulae. While this approach has some attractive features, notably in preserving some relations between conditional and partial correlation, it also has disadvantages. Foremost among these is the fact that zero (conditional) correlation does not correspond t o (conditional) independence under the elliptical copula.
311
Distribution-F!ree Continuous Bayesian Belief Nets
22.2. Vines and Copulae We build complex high-dimensional distributions from two-dimensional and conditional two-dimensional distributions with uniform margins. The twodimensional distributions on unit square with uniform margins are called “copulae.”
Definition 22.1: A copula C is a distribution on the unit square with uniform margins. Random variables X and Y are joined by copula C if their joint distribution can be written FXY
(-7
Y) = C(Fx(-), FY (Y)).
The diagonal band copula was introduced in Cooke and Waij.6 For positive correlation the mass is concentrated on the diagonal band with vertical bandwidth P = 1 - a. Mass is distributed uniformly on the rectangle with corners (P, 0), (0,P), (1 - P , l ) and ( 1 , l - P ) and is uniform but “twice as thick” in the triangular corners. For negative correlations the mass of the diagonal band density is concentrated in a band along the diagonal y = 1--. Then we have bandwidth P = 1+a, -1 5 Q: 5 0. The correlation coefficient is given by p = sign(a) ((1- la1)3- 2(1 - 1 ~ ~ 1 )1).6 ~ Figure 22.1 shows the density of the diagonal band distribution with correlation 0.8.
+
...
Fig. 22.1.
.. ....
A density of the diagonal band copula with correlation 0.8.
In this paper we use the diagonal band copulae (or mixtures of these) because of their compliant analytical form. However we could use any copulae for which zero correlation entails independence, including the maximum entropy copulae, Frank’s copulae, etc. For more information about copulae we refer to e.g. Doruet and Kotz,’ N e l ~ e nand , ~ Lewandowski.”
312
D. Kurowicka and R. M. Cooke
i
nqu
. i
\..,.........
Fig. 22.2. D-vine (left) and canonical vine (right) on 4 variables with (conditional) rank correlations assigned to the edges.
x3
x3
Fig. 22.3. Graphical representation of sampling value of
24
in D-vine.
Graphical models called vines were introduced in Cooke" and Bedford and Cooke.12 A vine on N variables is a nested set of trees { T I ,...Tn-l} where the edges of tree j are the nodes of tree j 1, and each tree has the maximum number of edges. A regular vine on N variables is a vine in which two edges in tree j are joined by an edge in tree j + 1 only if these edges share a common node. A regular vine is called a canonical vine if each tree Ti has a unique node of degree N - a, hence has maximum degree. A regular vine is called a D-vine if all nodes in 2 ' 1 have degree not higher than 2 (see Fig. 22.2). There are N(N - 1)/2 edges in a regular vine on N variables. Each edge in a regular vine may be associated with a conditional
+
Distribution-Free Continuous Bayesian Belief Nets
313
copula, that is, a conditional bivariate distribution with uniform margins (for j = l the conditions are vacuous). The conditional bivariate distributions associated with each edge are determined as follows: the variables reachable from a given edge are called the constraint set of that edge. When two edges are joined by an edge of the next tree, the intersection of the respective constraint sets are the conditioning variables, and the symmetric differences of the constraint sets are the conditioned variables. The regularity condition insures that the symmetric difference of the constraint sets is always a doubleton. Each pair of variables occurs once as conditioned variables. It is convenient to specify the conditional bivariate copulae by a rank correlation specification: first assign a constant conditional rank correlation to each edge of the vine, then choose a class of copulae indexed by correlation coefficients in the interval [-1,1] and select the copulae with correlation corresponding to the conditional rank correlation. The density can be factorized into the product of bivariate densities depending on the conditioning variables of each edgelZ (see formula 2 below). For the precise definitions and properties of regular vines we refer to Bedford and Cooke4 and Kurowicka and C00ke.l~A joint distribution satisfying the vine-copula specification can be constructed and sampled on the fly, and will preserve maximum entropy properties of the conditional bivariate distributions.11312 The rank correlation specificationon a regular vine determines the whole joint distribution. There are two strategies for sampling such a distribution, which we term the cumulative and density approaches. We first illustrate the cumulative approach with the distribution specified by the D-vine in Fig. 22.2, D(1,2,3,4): Sample four independent variables distributed uniformly on interval [0,1],U1, UZ, U3, U4 and calculate values of correlated variables XI, XZ,X3, X4 as follows:
(1) 21 = 211,
F4a;r3
(F-1 ~14123;~~,31z;F,.23;z2(z~)(~r]2;z2 (2 1 ))(U4))) where FTijll; ;xi(Xj)denotes the cumulative distribution function for Xj, applied to Xj, given Xi under the conditional copula with correlation Tijlk. Figure 22.3 shows the procedure of sampling value of 2 4 graphically. Notice that the sampling procedure for D-vine uses conditional distributions as well as inverse conditional distributions. We shorten the notation by (4)
24 =
~ ~ ~ 1 3 ; F r 2 3(I21 ; 1 3
D. Kurowicka and R . M. Cooke
314
dropping the ‘‘r”s and write the general sampling algorithm as:b
..
When the bivariate distributions are indexed by conditional rank correlations, the correlations to be specified are r 1 2 1 T1312, T141237 T151234r r23,
r2413,
. . . Tl,n-112...n-2r
r l , n l 2...n-1,
T 2 5 ( 3 4 ~. . . r 2 , n - 1 ( 3 ...n - 2 1 T2,n(3...n-11
T 3 4 ~ T3514r
. . . T3,n-114...n-2,
T3,nl4 ...n-17
... Tn-2p-1,
(1)
Tn-2,nln-l,
rn - 1,n7
Notice that the conditional rank correlations can be chosen arbitrarily in the interval [-1, 11; they need not be positive definite or satisfy any further algebraic constraint. Two distributions with the same conditional correlations (1) and the same conditional copulae, are the same distribution. When the vine-copula distribution is given as a density, the density approach to sampling may be used. Let V = (7’1 . . .Tn-l) be a regular vine on n uniform variables (XI,.. . Xn), let Em be the edge set for tree T,, and for e E Em with conditioning variables D,, let C ~ ~ I be D ,the copula density associated with e . 1 2 show that the density for a distribution specified by
bInstead ofx3 = F-123 etc.
we writex3= F-12
Distribution-Free Continuous Bayesian Belief Nets
315
the assignment of copulae to the edges of V is given by n- 1
n-1
(2) m=l eEE,
where, by uniformity, the densities fi(xi) = 1. This expression may be used to sample the vine distribution; namely, draw a large number of samples (XI,. . . 2), uniformly, and then resample these with probability proportional to (2). This is less efficient than the general sampling algorithm given previously; however it may be more convenient for conditionalization. 22.3. Continuous bbns
We associate nodes of a bbn with univariate random variables (1, ..n} having uniform distributions on ( 0 , l ) . We will associate the arcs, or “influences,” with (conditional) rank correlations according to the following protocol:
(1) Construct a sampling order for the nodes, that is, an ordering such that all ancestors of node i appear before i in the ordering. A sampling order begins with a source node and ends with a sink node. Of course the sampling order is not in general unique. Index the nodes according to the sampling order 1,.. . ,n. (2) Factorize the joint in the standard way following the sampling order.
With sampling order is 1 , 2 , .. . , n, write: P(1,. . . , n ) = P(l)P(2(1)P(3(21).. . P ( n ( n- 1 , n - 2 , . . . , l ) . (3) Underscore those nodes in each condition, which are not parents of the conditioned variable and thus are not necessary in sampling the conditioned variable. This uses (some of) the conditional independence relations in the belief net. Hence if in sampling 2 , . . . , n variable 1 is not necessary (i.e. there is no influence from 1 to any other variable) then
P(1,.. . ,n) = P(l)P(2(1)P(3(21).. . P(n(n- 1 , n - 2 , . . . ,L). (3) The underscored nodes could be omitted thereby yielding the familiar factorization of the bbn as a product of conditional probabilities, with
D. Kurowicka and R . M . Cooke
316
each node conditionalized on its parents (for source nodes the set of parents is empty). (4) For each term i with parents (nonunderscored variables) il..ip(i) in (3), associate the arc i p ( i ) - h --+ i with the conditional rank correlation
r ( i ,i p ( i ) ) ;k = 0 ~ ( i ip (,i ) - k / i p ( i ) , . - . i p ( i ) - k + l ) ; 1 I k I P (-~ 1,)
(4)
where the assignment is vacuous if { i ~ . . i ~ ( = ~ ) 0. } Assigning conditional rank correlations for i = 1,..n, every arc in the bbn is assigned a conditional rank correlation between parent and child. Let V i denote a D-vine on i variables. The following theorem shows that these assignments uniquely determine the joint distribution and are algebraically independent:
Theorem 22.1: Given (1) a directed acyclic graph ( D A G ) with n nodes specihing Conditional in-
dependence relationships in a bbn, (2) the specification of conditional rank correlations (4), i
=
1, ...n and
(3) a copula realizing all correlations [-1,1] for which correlation 0 entails independence,
the joint distribution is uniquely determined. This joint distribution satisfies the characteristic factorization (3) and the conditional rank correlations in (4) are algebraically independent. Proof. The first term in (3) is determined vacuously. We assume the joint distribution for (1, ...i - l } has been determined. Term i of the factorization (3) involves i - 1 conditional variables, of which { i p ( % ) +.l. & , - I } are conditionally independent of i given { i l ,. . i p ( i ) } .We assign
r(i,ijlil,...ip(i)) = 0; ip(i)< i j 2 i - 1.
(5)
Then the conditional rank correlations (4) and ( 5 ) are exactly those on V ainvolving variable i, that is, in the last column of matrix ( 1 ) . The other conditional bivariate distributions on V i are already determined. It follows that the distribution on (1,..i} is uniquely determined. Since zero conditional rank correlation implies conditional independence,
P(1, ...i ) = P(il1, ...i - 1 ) P ( 1 ,...i - 1) = P(ili1, ...ip(i))P(il, ...ipci)),
Distribution-he Continuous Bayesian Belief Nets
317
from which it follows that the factorization (3) holds. 0 Nodes and arcs can be added or deleted from a bbn quantified with this protocol, without reassessing previously assessed correlations. This is significant difference with respect to parametric continuous bbns, in which partial regression coefficients of the child given all of its parents must be assessed. When a parent is added or deleted, the remaining coefficients must be assessed again. We sample Xi using the sampling procedure for VZ.In general it is not possible to keep the same order of variables in successive D-vine, and some conditional distributions will have to be calculated as in Example 22.1.
Example 22.1: Let us consider the following bbn on 5 variables.
El Sampling order: 1, 2, 3, 4, 5. Factorization: P( 1)P(2)1)P(3)21)P(4)321)P( 5 )4321). Rank correlations that have t o be assessed: T217 T3l I T437 T42(37T54r T52(4*
In this case D4 = 0 ( 4 , 3 , 2 , 1 ) but the order of variables in V5 must be 0 ( 5 , 4 , 2 , 3 , 1 ) .Hence this bbn cannot be represented as one vine.
Using the conditional independence properties of the bbn, the sampling procedure can be simplified as: = u1,
D . Kurovicka and R. M. Cooke
318
The conditional distributions F 2 1 3 ( 2 2 ) ,F 2 1 4 ( 2 2 ) are not known and must be calculated. In calculating these distributions the formula (2) is used.c f23(227z3)
=
F213(22)
=
F214(x2)
=
I’1
C21(Z2,21)C13(21,23)du
f213(U)dU =
11 22
1
12
22
1 C43(24,23)
f23(u7 23)dTi
* f23(u,
23)
* c4213 (F413(24),
F213(21)) d23dV7
where c4213 is a density of the copula with correlation ?-4213. Since the diagonal band copula is not supported on the entire unit square, it is preferable to use mixtures of diagonal band copulae which realize the desired correlations while assigning positive probability to each point in the unit square. Good results are obtained by mixing a diagonal band copula with the independent copula; these mixtures have lower information relative to the uniform density, under a correlation constraint, and do not significantly impede computations. This was done in the following example. 22.4. Example: Flight Crew Alertness Model
In Fig. 22.4 a flight crew alertness model adapted from the discrete model described in Roelen, et aL3 is presented. In the original model all chance nodes were discretized t o take one of two values “OK” or “Not OK.” Alertness is measured by performance by a simple tracking test during off-duty moments. The results are scored on an increasing scale and can be modeled as a continuous variable. Continuous distributions for each node must be gathered from existing data or expert judgment.14 The distribution functions are used to transform each variable to uniform on the interval (0,l). COfcourse f23 is also a copula but in general will not belong t o the copula family used on the right hand side.
Distribution-& ee Continuous Bayesian Belief Nets
319
Required (conditional) rank correlations are found using the protocol described in Sec. 22.3. These were assessed by experts in the way described in Kraan,” and Kraan and Cooke.lGThis indirect assessment of rank correlations was found successful. In Kraan and CookelGwe find: “Experts easily understood the question and had no trouble giving meaningful answers.” In Fig. 22.4 (conditional) rank correlation is assigned to each arc of the bbn. These numbers are chosen to illustrate this approach and are based on the assessments of in house experts.
Operational load
Fig. 22.4. Flight crew alertness model.
The sampling algorithm for distribution described by the bbn in Fig. 22.4 is the following. 21
= u1,xg = u g , 5 4 = U4’ 5 7 = u7,
IC3 = F&:;z2(F4:12;z1
(U3))r
25
= F
x6
=
F~.1;,4(F~51,4i~~14(55)
x8
=
F6i;z6 (FG:1614;53
(u6)),
(FG:p6;z7(u8))).
The main use of bbns is in decision support, and in particular updating on the basis of possible observations. Let us suppose that we know before the flight that the crew didn’t have enough sleep and they will have a long flight. Let us assume that the crew’s hours of sleep correspond to 25th percentile of hours of sleep distribution and the fly duty period is equal to 80th percentile of the flight duty period distribution.
320
D. Kurowicka and R . M. Cooke
We seek policies that could compensate loss of the crew alertness in this situation. Firstly we require that the number of night hours on the flight should be small (equal to 10th percentile). This improves the situation a bit (dotted line in Fig. 22.5). Alternatively we could require having long resting time on a flight (equal to 90th percentile). This results in a significant improvement of the crew alertness distribution (see dashed line in Fig. 22.5). Combining both of these policies improves the result even more.
Fig. 22.5.
Four conditional distributions of crew alertness.
Notice that in comparing different polices it is not necessary to know actual distributions of given variables. Our decisions can be based on quantile information. We might think of the transformation from quantiles to physical units of the variables as being absorbed into a monotonic utility function. Thus, conclusions based on quantiles will hold for all monotonic utility functions of the random variables. Notice also that this quantification requires eight numbers. If the individual nodes are described with discrete distributions involving K outcomes, then 22K algebraically independent numbers are required. This demonstrates the dramatic reduction of assessment burden obtained by
Distribution-Free Continuous Bayesian Belief Nets
321
quantifying influence as conditional rank correlation. 22.5. Conclusions
The discrete bbns have recently become a very popular tool in modeling of risk and reliability. Their popularity is based on the fact that influence diagrams capture engineer's intuitive understanding of complex systems, and at the same time serve as user interfaces for sophisticated software systems. Continuous bbns can significantly reduce the assessment burden. Parametric continuous bbns have the advantage of enabling analytic updating. On the other hand, assessing partial regression coefficients may be unintuitive, especially if the variables must first undergo transformation to joint normal. Further, adding or deleting variables requires re-assessing previously assessed partial regression coefficients. Distribution free continuous bbns have an advantage in this regard; their primary disadvantage is that updating must be done by Monte Carlo simulation. References 1. F. V. Jensen, Bayesian Networks and Decision Graphs, New York: SpringerVerlag (2001). 2. D. Kurowicka and R. M. Cooke, The vine copula method for representing high dimensional dependent distributions; application to continuous belief nets, Proc. Winter Simulation Conference (2002). 3. A. L. C. Roelen, R. Wever, A. R. Hale, L. H. J. Goossenes, R. M. Cooke, R. Lopuhaa, M. Simoons, and P. J. L. Valk, Casual modeling for integrated safety at airport, Proc ESREL 2003, Safety and Reliability 2, 1321-1327 (2003). 4. T. J. Bedford and R. M. Cooke, Vines-a new graphical model for dependent random variables, Ann. of Stat. 30 (4), 1031-1068 (2002). 5. D. Kurowicka, J. Misiewicz, and R. M. Cooke, Elliptical copulae, Proc. of the International Conference on Monte Carlo Simulation - Monte Carlo, 209-214 (2000). 6. R. M. Cooke and R. Waij, Monte Carlo sampling for generalized knowledge dependence with application to human reliability, Risk Analysis 6, 335-343 (1986). 7. H. Joe, Multivariate Models and Dependence Concepts, London: Chapman and Hall (1997). 8. Mari D. Doruet and S. Kotz, Correlation and Dependence, London: Imperial College Press (2001). 9. R. B. Nelsen, An Introduction to Copulas, New York: Springer (1999). 10. D. Lewandowski, Generalized diagonal band copulae - applications to the vinecopula method, presented at the DeMoSTAFI 2004 Conference in Quebec, Canada (2004).
322
D. Kurowicka and R .
M. Cooke
11. R. M. Cooke, Markov and entropy properties of tree and vines-dependent variables, in Proceedings of the ASA Section of Bayesian Statistical Science (1997). 12. T. J. Bedford and R. M. Cooke, Probability density decomposition for conditionally dependent random variables modeled by vines, Annals of Mathematics and Artificial Intelligence 32, 245-268 (2001). 13. D. Kurowicka and R. M. Cooke, A parametrization of positive definite matrices in terms of partial correlation vines, Linear Algebra and its Applications 372, 225-251 (2003). 14. R. M. Cooke, Experts in Uncertainty. New York: Oxford University Press (1991). 15. B. C. P. Kraan, Probabilistic Inversion in Uncertainty Analysis and related topics. ISBN 90-9015710-7, PhD dissertation, TU Delft (2002). 16. B. C. P. Kraan and R. M. Cooke, Processing expert judgements in accident consequence modeling, Radiation Protection Dosimetry 90 (3), 31 1-315 (2000).
CHAPTER 23 STATISTICAL MODELING AND INFERENCE FOR COMPONENT FAILURE TIMES UNDER PREVENTIVE MAINTENANCE AND INDEPENDENT CENSORING
BO HENRY LINDQVIST Department of Mathematical Sciences Norwegian University of Science and Technology N-7491 Trondheim, Norway E-mail: [email protected]
HELGE LANGSETH Department of Mathematical Sciences Norwegian University of Science and Technology N-7491 Trondheim, Norway E-mail: [email protected] Consider the competing risks situation for a component which may be subject to either a failure or a preventive maintenance (PM) action, where the latter will prevent the failure. It is then reasonable to expect a dependence between the failure mechanism and the PM regime. The chapter reconsiders the so-called repair alert model, which is constructed for handling such cases. A main ingredient here is the repair alert function, which characterizes the “alertness” of the maintenance crew. The main emphasis of the chapter is on statistical inference for the model, based on possibly right-censored data. Both nonparametric and parametric inference is studied. The methods are applied to two different datasets.
23.1. Introduction We consider the competing risks situation occurring when a potential component failure at some time X may be avoided by a preventive maintenance (PM) at time 2. The experienced event will in this case be at time Y = m i n ( X , Z ) , and it will either be a failure or a PM. It is convenient to use the notation 6 = I ( Z < X ) to denote the type of event, where 323
324
B. H. Lindqvist and H. Langseth
I ( A ) is the indicator function of the event A. Thus 6 = 0 means that the component fails and S = 1 means that it is preventively maintained. The observable result is now the pair (Y,S), rather than the underlying times X and Z, which will often be the times of interest. For example, knowing the distribution of X would be important as a basis for maintenance optimization. It is well known,172 however, that in a competing risks case as described here, the marginal distributions of X and Z are not identifiable from observation of (Y,6 ) alone unless specific assumptions are made on the dependence between X and 2. The most used assumption of this kind is to let X and Z be independent, in which case identifiability follows. This assumption is not reasonable in our application, however, since the maintenance crew is likely to have some information regarding the component's state during operation. This insight is used to perform maintenance in order to avoid component failures. We are thus in practice usually faced with a situation of dependent competing risks between X and Z. Lindqvist, Stcbve and Langseth3 suggested a model called the repair alert model for describing the joint behavior of failure times X and PM-times Z. This model is a special case of random signs censoring, C ~ o k eunder ,~ which the marginal distribution of X is identifiable. Recall that 2 is said to be a random signs censoring of X if the event { Z < X } is stochastically independent of X , i.e. if the event of having a PM before failure is not influenced by the time X at which the component fails or would have failed without PM. The idea is that the component emits some kind of signal before failure, and that this signal is discovered with a probability, which does not depend on the age of the component. The repair alert model extends this idea by defining in addition a repair alert function that describes the "alertness))of the maintenance crew as a function of time. The main emphasis of the present chapter is on statistical inference for the repair alert model. It will be assumed that data are available for a sample of N independent observations of (Y,S), which may be right censored. In the case of censoring we only know that Y is greater than the censoring time, but do not know the type of event (failure or PM) that would have been eventually experienced. Independent censoring will be assumed in this case. This assumption is reasonable in many cases and is needed to identify the distribution of (Y,6 ) and hence the distribution of X under random signs censoring. The ability to handle censored data is important for practical applications, and this is the main motivation for the present chapter.
325
Failure Data Censored by Preventive Maintenance
Two examples will be given: In the first example we reconsider the data given by Mendenhall and Hader.6 These data are type I censored at a fixed time T , but were for illustrative purposes analyzed in Lindqvist et al.3 without taking these censorings into account. The second example is based on data from the OREDA database7 and are also considered by Langseth and Lindqvist.8 The component failures can in this example be due to several different failure modes. We study one of the failure modes with respect to failure time X and PM-time 2, while treating failure and PM events for the other failure modes as censorings.
23.2. Notation, Definitions, and Basic Facts We assume that ( X ,2 ) is a pair of continuously distributed life variables, with the properties that P ( X = 2 ) = 0 and 0 < P ( Z < X ) < 1. The cumulative distribution functions of X and 2 are, respectively, F x ( t ) = P ( X t ) and F z ( t ) = P ( Z 5 t). Now let (Y, 6) define a competing risk case between X and 2. Here Y = min(X, 2 ) and 6 = I ( Z < X ) . The distribution of (Y,S) is characterized by the subdistribution functions of X and 2, defined respectively by F;((t) = P ( X 5 t , X < 2 ) = P(Y 5 t,6 = 0 ) and F z ( t ) = P(Z 5 t , Z < X ) _= P(Y 5 t,6 = 1). Note that the functions F i and F; are nondecreasing with FZ(0) = FS(0) = 0 and F;((m) F;T(m)= 1. Any pair of functions K1, Kz satisfying these conditions, will be referred to as a subdistribution pair. We next define the conditional distribution functions of X and 2 respectively by #x(t) = P(X 5 tlX < 2 ) and FZ(t) = P ( Z tlZ < X ) . Note that &(t) = F ; ( t ) / F / ; ( m ) , #z(t) = F;)(t)/F;(m). For convenience we assume the existence of densities corresponding to each of the functions defined above, i.e. f x ( t )= F i ( t ) , f / ; ( t ) = F$(t), f x ( t )= p i ( t ) ,and similarly for 2. It follows by definition that the subdistribution functions F; and F;) are identifiable from observation of (Y, 6). In practice this means that if an infinite sample of (Y, 6) is available, then we can estimate the subdistribution functions without error. On the other hand, the marginal distribution functions F x and Fz are not identifiable in this manner from observation of (Y,6).172 Thus, even with an infinite sample of (Y,6)we are unable to estimate F x and Fz exactly.
.s
+
<
B. H. Lindqvist and H. Langseth
326
23.3. The Repair Alert Model
Definition 23.1: The pair (X, 2)of life variables satisfies the requirements of the repair alert model provided the following two conditions both hold:
(i) The event (2 < X} is stochastically independent of X (i.e. 2 is a random signs censoring of X ) .
(ii) There exists an increasing function G with G(0) = 0 such that for all 2
> 0,
The function G is called the cumulative repair alert function. Its derivative g (which we shall assume exists) is called the repair alert function. The repair alert model is, as already noted, a specialization of random signs censoring, obtained by introducing the repair alert function g. Part (ii) of the definition means that, given a potential failure at time X = x , and given that a PM will be performed before that time, the conditional density of the actual time Z of P M is proportional to g. The repair alert function is meant to reflect the reaction of the maintenance crew. Thus g ( t ) ought to be large at times t for which failures are expected and the alert therefore should be high. Langseth and Lindqvistg simply used g ( t ) = Xx(t) where AX is the hazard rate of the failure time X . It is seen that the repair alert model is completely determined by the marginal distribution function FX of X , the cumulative repair alert function G , the probability q 3 P ( Z < X), and the assumption that the event (2< X } is independent of X . In fact, given those ingredients it is straightforward to derive a valid joint distribution for ( X ,Z ) . 3 From the definition we obtain the following expressions for the subdistribution and conditional-distribution f ~ n c t i o n s : ~
(!) (2) (3) (4) F;(t) = qF;(t).
(5)
It follows from (1)-(2) that the marginal distribution function FX as well as q are identifiable under the repair alert model, being functions of the
Failure Data Censored by Preventive Maintenance
327
subdistribution function F i (t).Moreover, (1) and (3) imply the following relation between the conditional distribution functions Fz for Z and Fx for X , F z ( t ) > F,y(t) for all t > 0. (6) This property can be used in a graphical check of plausibility of a repair alert model for a dataset by plotting empirical estimators of Fx and Two examples are given in Fig. 23.1.
Fz.
(a)
(b)
Fig. 23.1. Empirical subdistribution functions p z ( t ) (thick line) and p x ( t ) (thin line) for the VHF data (a) and OREDA data (b).
The ordering (6) between FX and FZ holds whenever Z is a random signs censoring of X (C00ke4). In fact, Cooke4 proved that this ordering is also sufficient for the existence of a joint distribution of X and Z satisfying the requirements of random signs and having a given set of subsurvival functions consistent with px and Fz. As a strengthening of Cooke’s result it was shown in Lindqvist et al.3 that whenever (6) holds, there is an essentially unique repair alert model having a given set of subsurvival functions for X and 2 consistent with Fx and Pz.A precise formulation of this is given by the following result:
Theorem 23.1: Let K1,K2 be a subdistribution pair such that K2 is differentiable. Suppose furthermore that
Then there exists a pair ( X ,Z ) of life variables which satisfy the requirements of the repair alert model and which are such that
F;(t) = K l ( t ) , F;(t) = K2(t) for all t 2 0.
B. H. Lindqvist and H. Langseth
328
Moreover, for any such pair (X, 2 ) we have Fx(t) = Kl(t)/Kl(co),q = KZ(oa), while the cumulative repair alert function G is uniquely (modulo a multiplicative constant) given by
(7) (8) for all t > 0, where t o
> 0 is a fixed, arbitrary constant.
The theorem is proved in Lindqvist et al.3 Note that the expression (7) for G is obtained from equations (3)-(4) which imply that
A simple example of a cumulative repair alert function is G(t) = to where P > 0 is a parameter. Then g(t) = ptov1 so = 1 means a constant repair alert function, while ,f3 < 1 and P > 1 correspond to, respectively, a decreasing and increasing repair alert function. It follows, furthermore, that for this repair alert function we have
(9) Thus cost efficient PM performance corresponds to large values of /3, since this implies that PM can be expected to be close to the potential failure time.
23.4. Statistical Inference i n the Repair Alert Model 23.4.1. Independent censoring Let (Y,6) be the result of a competing risk case between failure time X and PM-time 2 of a component. Suppose now that observations of (Y,b)may be right censored by a random variable C which is independent of (X, 2 ) and hence of (Y,6). Then by considering the competing risk case between Y and C it follows by independence that the marginal distributions of Y and C are identifiable. However, in order to identify the underlying repair alert model we need to have identifiability of the distribution of the pair (Y,6 ) . Fortunately, this is the case. To see this, note first that the probabilities P ( y I Y 5 y+dy, 6 = 0, Y < C) and P ( y 5 Y 5 y dy,6 = 1,Y < C) are identifiable from observation
+
Failure Data Censored by Preventive Maintenance
329
of the competing risk case between X , Z and C.But these probabilities can be written as, respectively, f:(y)P(C c > y)dy and fi(y)P(C > y)dy by independence of Y and C. Thus, assuming that P(C > y) > 0 for all y, the subdistribution functions of X and 2 are identifiable since the distribution of C is. Hence the underlying repair alert model can be identified as well.
23.4.2. Datasets and preliminary graphical model checking
Let there be N independent right-censored observations as described in the previous subsection. By extending the notation of Bedford and Cookello Sec. 23.5, we may let these observations be represented on the form XI,... ,x m , zl,.. . , z,, c1,. . . ,cr, which are, respectively, the observed times to failure, the observed times to PM, and the observed times to censoring. For practical illustration we use two datasets. The first one, from Mendenhall and Hader,‘ gives failure times for ARC-1 VHF communication transmitter-receivers of a single commercial airline. They will later be referred to as the VHF-data. Failed units were removed from the aircraft for maintenance. However, in some cases the apparent failures were unconfirmed upon arrival at the maintenance center, as the unit exhibited satisfactory operation when tested there. Thus, the failure times can be divided into two groups, unconfirmed, 2, and confirmed failures, X . There are m = 218 observations of X , and n = 107 observations of 2. The data were censored at time r = 630, and there are r = 44 such censored observations. This gives a total of N = 369 observations in the dataset. The second dataset was prepared by Langseth and Lindqvist.8 These data are failure times of a single mechanical component taken from the OREDA database17and will be referred to as the OREDA data. The component under study could fail due to several different failure modes. In the present data study we will focus on failures of type “2,” and treat the other failures as independent censorings of the failure times. The component failures are either “criti~a1,’~ X , or ‘ L n ~ n ~ r i t i(degraded ~al” or incipient), 2. We will only use data starting from the tenth event, that is, after the first critical failure was repaired. This gives us m = 12 observations of X , n = 29 observations of 2,and r = 37 censored observations, a total of N = 78 cases. The resulting data are given in Table 23.1. Suppose we want to fit a repair alert model to the data. By (6) we need to have Fz(t) > py.(t) for all t > 0. For a graphical verification of this, we use nonparametric estimators of the conditional distribution functions fix and F.z as derived by Lawless,ll Sec. 23.2. We then start by computing the
330
B. H. Lindquist and H. Langseth Table 23.1. OREDA data. The first line contains the observed failure times xi; the second line contains the observed P M times z j ; the two last lines contain the censoring times C k .
xi: ~ j : Ck:
1,1,5,8,10,11,11,13,25,80,85,117 1,1,1,1,1,1,1,1,3,3,3,3,4,5,7,8,10,12,12,14,17,18,24,24,28,28,28,32,36 1,1,2,2,2,2,2,2,3,3,4,4,4,5,6,6,6,7,7,7,10,12,12,12,12,13,19,30,31,32, 32.47.49.61.65.76.97
Kaplan-Meier estimator S ( t ) based on the right-censored sample of Y s , i.e. the union { y i } of the xi and the z j with the Ck being censorings. This leads tn
where R ( t ) is the total number of units (components) which are a t risk just before t , while d ( t ) is the number of observed events (failure or PM) at time t. The subdistribution functions can next be estimated using equation (9.2.5) in Lawless," which in our notation can be written as
(10) (11) Recall that the conditional distribution functions are given by Fx ( t )=
Fi(t)/Fi(co) and likewise for F z . Moreover, recall that F;(co)+F;(co) = 1. The natural estimators of Fx and p' are obtained by dividing (10)-(11) by I';;(co) and fi;(co), respectively. However, the estimates k;(co)and k;(co)do not necessarily add t o 1, so Lawless" suggests t o normalize them to have sum 1. Thus, defining
(12) we obtain the estimators (13) (!4) Figure 23.1 shows the plots of &(t) and @ Z ( t ) obtained in this way for the two datasets. The required inequality (6) is apparently satisfied for the estimated functions, and we conclude that it is indeed meaningful to fit repair alert models t o both datasets. Formal tests for investigations of this kind are considered by Dewan et a1.12
331
Failure Data Censored by Preventive Maintenance
23.4.3. Nonparametric estimation In this subsection we suggest simple nonparametric estimators of q, Fx and G for the repair alert model. First, a natural estimator for q will be the $ defined by (12). For the VHF-data this equals q = 0.33, and the OREDA data gives q = 0.71. Next, by (1) we may estimate FX by Fx given in (13) and depicted in Fig. 23.1. It remains therefore to estimate G. Following Lindqvist et al.3 we start from the definition of G ( t ) in (8), repeated here for ease of reference,
G ( t ) = exp
{
dy L t r :Y - Fx(&'(y))
}.
We then proceed by substituting the estimator &(t) for F x ( t ) . It follows from (10) and (13) that
fix
is constant on intervals [ze,xe+l), with value
P x ( z ~ )Thus . F x ( ~ ~ l ( y=) jx(xe) ) for ~z(xe) Iy
< ~z(se+l),e = 1,.. . , m - 1.
By selecting to = x1 and t = xi in (8), we obtain
Since G(t) is only determined modulo a constant, we can define
G(x1) =
1. Finally, substituting f i z ( t ) from (14) for Fz(t), we obtain the nonparametric estimator for G ( t ) defined at the points t = xi:
(15) We have tacitly assumed that F z ( t ) > Fx (&'(t)) for all C in this development. This assumption is theoretically justified by ( 6 ) , but in practice it may still happen that fiz(se)
I Fx (kgl(xe)) for some C.
In this case
we suggest to put the corresponding factor of (15) equal to 1. Figure 23.2 shows the described estimator of G for the two datasets with logG(si) plotted against logxi. The motivation for these plots is to check whether the parametrization G ( t ) = to is plausible. In that case we
332
B . H. Lindqvist and H. Langseth
will have log G(t) = log t , so we would expect plots of log G(xi) against log xi to be approximately a straight line with slope ,8. This is roughly true in Fig. 23.2. Based on the plots, we may estimate the slopes of the curves to be around 5 (VHF-data), and 0.7 (OREDA-data). These estimates are therefore our first guesses of p.
5
Fig. 23.2. Nonparametric estimate logG(ri) plotted against logzi for the VHF data (a) and the OREDA data (b).
23.4.4. Parametric estimation
In this section we assume the special parametric model where X is exponentially distributed with fx(z) = Xe-xz, while G ( t ) = t P . For notational simplicity, we recall the definition of the incomplete Gamma function, I?($, t ) = Jt" wQ'-le-"'dw. Note that the integral converges for all real $ when t > 0, and for all $ > 0 when t = 0. Following Crowder,2 the contributions to the likelihood from an uncensored observation is given by the subdensity function at the observed time. Thus, (1)-(5) imply that the likelihood contribution from an zi is f/F(xi) = (1 - q ) f x ( z i ) = (1 - q)Xe-Azi; the contribution from a z j is f ; ( z j ) = q . g ( z j ) Jz'[fx(t)/G(t)]dt = qXp(Xzi)P-lI'(l - p,Xzi), and finally the contribution from a censoring Ck is P(min(X,Z) > ck) = 1- ( F ; ( c ~ + ) F ; ( c ~ )= ) e-'Q - q ( X c k ) P . r(i - p, X C ~ ) . The total likelihood for the data is obtained as the product for each
333
Failure Data Censored by Preventive Maintenance
data point. Taking the logarithm we obtain the log-likelihood function
l(A, p, q ) = mlog(1 - q ) m
+ nlogq + (n+ m) log x + n1ogp
1
2=
n
n
2=1
2=1
r
+ Clog[exp(--Xck) - q ( A c k ) P . r(i - P,
(16)
AC~)].
k=l Maximum likelihood estimates of the parameters A, P and q can be found by maximizing (16), which needs t o be done numerically. It turns out that the E M - a l g ~ r i t h m 'is~ useful here. The general idea is t o augment the data artificially in order t o obtain a more tractable likelihood function, for which there may exist simple expressions for the maximum likelihood estimators. This is the so-called M-step (maximization step) of the EMalgorithm. The M-step alternates in an iterative manner with the E-step (expectation step), in which we compute the conditional expectation of the augmented likelihood function conditional on the observed data. During the M-step we shall assume that we have always observed X and 6, while 2 was observed only when 6 = 1. Furthermore, we assume that none of these observations are censored by C. It is practical t o change slightly the meaning of the x, and z3. We now assume that there are N triples (z,, z,, 6,). Here 6, = 0 if x, < z,, in which case we observe only x,,6,, and 6, = 1 if z, < x,, in which case we observe the whole triple (x,,z,, 6,). The augmented likelihood now becomes
which by taking the logarithm gives the augmented log-likelihood, N
I A ( x , P, 4) = N
log
-
xi
+ Nlog(1
N -
4) - bg(1 - q)
i= 1
6i a=
N
N
N
N
i=l
i= 1
i=l
i=l
1
(17)
We consider the M-step first, where we find the maximum likelihood estimators from (17). Then we consider the E-step where we replace the unobserved terms of the log-likelihood by their expected values, conditional on the observed data and the current parameter estimates.
334
B. H. Lindqvist and H . Langseth
M-step: Maximization of (17) gives us explicit expressions for the maximum likelihood estimators:
(18) E-step: In this step we compute the expected value of (17) given our data 51,. . . ,x,, 21,. . . ,z,, c1,. . . ,c,.. For the observations where the failure time xi is observed, we have Si = 0 and the value of zi is not needed. For the observations where the PM-time zi is observed we have bi = 1 while xi is not observed. Hence, we need to replace the corresponding xi and log xi in (17) by their conditional expectations. These are computed by first noting that the conditional density of X given (2 < X, 2 = z } is
f(zl2 < x,z = 2) =
5- P e-
x1:
x P - q i - p, xz)
for x > z ,
and hence that E[XIZ < X, 2 = z] = r(2 - p,Xz)/(X . r(l - p,Xz)) and E[log(X)JZ< X, 2 = z ] = log(w)w-P exp(-w)dw/I'(l - p, Xz) 1% (XI. Finally, we consider the observations where the censoring time C = c is observed. In this case we do not observe bi, and hence from (17) we need to compute E [ X l m i n ( X , Z ) > c ] , E[bImin(X,Z) > c ] , E [bX1 min (X, 2 ) > c], and E [blog X 1 min ( X ,2 ) > c]. After some algebraic manipulations, we find that
JE
P ( Z < XI min(X, 2 ) > c) = -
f(zl min(X, 2 ) > c, 2 < X)= -
f(xl min(X, 2 ) > c, X < 2) =
f(zl min(X, 2 ) > c, 2 < X) = -
Failure Data Censored by Preventive Maintenance
335
From these conditional densities we find the desired expectations. The EMalgorithm now proceeds by using the augmented dataset to re-estimate the parameters using (18), using these new estimators to generate a new augmented database, and so on until convergence. The resulting estimates for the VHF and OREDA-data are given in Tables 23.2 and 23.3, respectively. We also include bounds for approximate 95% confidence intervals based on standard log-likelihood theory. We see that p appears to be larger for the VHF data than for the OREDA data. This is in correspondence with Fig. 23.1, where the sub-distribution functions for the VHF data are closer together than those of the OREDA data. It is also interesting to note that = 1.00 for the OREDA data. This corresponds to choosing g ( t ) proportional to the hazard rate of the failure times, as in the class of models investigated by Langseth and Lindqvist.’ Finally, we note that the confidence interval for 0in the VHF data extends all the way to infinity. The meaning of using 0 = 00 should be seen in relation to (9), and indicates that when an unconfirmed failure ( 2 )was observed, it occurred immediately before a failure (X)would have been realized. Table 23.2. Maximum likelihood estimates and approximate 95% confidence intervals for parametric repair alert model for the VHF-data. Parameter A
P 4
Estimate Lower bound 3.10. 10-3 2.73. lo-’ 4.44 2.08 0.318 0.270
Upper bound 3.51. lo-’ 00
0.369
Table 23.3. Maximum likelihood estimates and approximate 95% confidence intervals for parametric repair alert model for the OREDA-data. Parameter
x P 4
Estimate 1.80. lo-‘ 1.00 0.621
Lower bound 1.04. lo-’ ,553 0.461
Upper bound 2.86. lo-’ 2.74 0.771
23.5. Concluding Remarks In this chapter we have considered the repair alert model, which describes
a specific dependence structure between failures (X) and preventive maintenance (2).We extend our previous work3 by including the possibility of
336
B. H. Lindqvist and H . Langseth
independent censoring due to an external source. The use of our model is exemplified by analyzing two different datasets: The VHF data are type I censored at a given time T but have previously been analyzed3 without taking this into account. The OREDA dataset describes several different failure modes, and we handle this situation by focusing on one particular failure mode and consider all other failure modes as external censorings. In general this may lead to dependent censoring, thus violating the assumption of independent external censoring. Independent censoring may, however, be justified in the example from physical properties of the involved failure mechanisms. Otherwise, an assumption of independent censoring may be forced by mathematical convenience or the lack of knowledge of the underlying failure mechanisms. Next, it may be noted that the OREDA dataset is rather heavily censored, which may lead to less confidence in the nonparametric curves (right panes of Figs. 23.1 and 23.2). In fact, these curves are based on the Kaplan-Meier estimator which may behave poorly in the presence of heavy ~ens0ring.l~ We will not pursue this further here, but only note that this motivates the use of parametric models in such cases. We finally comment on the choice of the exponential distribution for X in the parametric modeling of Sec. 23.4.4. By the memoryless property of the exponential distribution it seems likely that preventive maintenance would not be appropriate in this case. However, random signs censoring and the repair alert model still have meaning for the exponential distribution since they impose a strong dependence between the P M time 2 and the failure time X . Recall the intuitive content of random signs censoring being that the component emits a signal before failure. It is inherent in the model that the time from this signal to the failure would occur, is not distributed as X as one might believe from the memoryless property. In an extreme case the signal may be emitted immediately before the failure, in which case 2 , when observed, is always approximately equal to X. On the other hand, a main reason for choosing the exponential distribution in our parametric model is for ease of exposition. In principle, any parametric model for X may of course be used, the Weibull distribution being a natural choice. Still, for maintained components it is well known that failure times often appear to be nearly exponentially distributed since there are usually few observations in the tail of the distribution.
Failure Data Censored by Preventive Maintenance
337
Acknowledgment Comments from an anonymous referee are greatly acknowledged.
References 1. A. Tsiatis, A nonidentifiability aspect of the problem of competing risks, Proceedings of the National Academy of Sciences, USA 72, 20-22 (1975). 2. M. J. Crowder, Classical Competing Risks, Chapman and Hall/CRC, Boca Raton (2001). 3. B. H. Lindqvist, B. Stove, and H. Langseth, Modelling of dependence between critical failure and preventive maintenance: The repair alert model, to appear in Journal of Statistical Planning and Inference, Special Issue on Competing Risks (2005). 4. R. M. Cooke, The total time on test statistics and age-dependent censoring, Statistics and Probability Letters 18,307-312 (1993). 5. R. M. Cooke, The design of reliability databases, Part I and 11, Reliability Engineering and System Safety 51,137-146 and 209-223 (1996). 6. W. Mendenhall and R. J. Hader, Estimation of parameters of mixed exponentially distributed failure time distributions from censored life test data, Biometrika 45, 504-520 (1958). 7. OREDA, Offshore Reliability Data Handbook, 4th ed. Distributed by Det Norske Veritas, P.O. Box 300, N-1322 Hovik, Norway, http://www.oreda.com/ (2001). 8. H. Langseth and B. H. Lindqvist, Competing risks for repairable systems: A data study, to appear in Journal of Statistical Planning and Inference, Special Issue on Competing Risks (2005). 9. H. Langseth and B. H. Lindqvist, A maintenance model for components exposed to several failure mechanisms and imperfect repair, in Mathematical and Statistical Methods in Reliability, Series on Quality, Reliability and Engineering Statistics, Vol. 7,Eds. B. H. Lindqvist and K. A. Doksum (World Scientific Publishing, Singapore, 2003), pp. 415-430. 10. T. Bedford and R. M. Cooke, Probabilistic risk analysis: Foundations and methods. Cambridge University Press, Cambridge (2001). 11. J. F. Lawless, Statistical models and methods f o r lifetime data, 2nd ed., WileyInterscience, Hoboken, NJ. (2003). 12. I. Dewan, J. V. Deshpande, and S. B. Kulathinal, On testing dependence between time to failure and cause of failure via conditional probabilities, Scandinavian Journal of Statistics 31,79-91 (2003). 13. A. P. Dempster, N. M. Laird, and D. B. Rubin, Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society, Ser. B 39,1-38 (1977). 14. R. G. Miller, Jr., What price Kaplan-Meier? Biometrics 39,1077-1081 (1983).
This page intentionally left blank
CHAPTER 24 IMPORTANCE SAMPLING FOR DYNAMIC SYSTEMS BY APPROXIMATE CALCULATION OF THE OPTIMAL CONTROL FUNCTION
ANNA IVANOVA OLSEN CeSOS, NTNU, Otto Nielsens v. 10 NO-7491, Trondheim, Norway E-mail: [email protected] Web: http://www. cesos. ntnu.no/N ivanova
ARVID NAESS Dept of Mathematical Sciences €3 CeSOS, NTNU, A . Get2 vei 1 NO-7491, Trondheim, Norway E-mail: [email protected] An iterative method for estimating the failure probability for timedependent reliability problems has been developed. The system response has been modeled by a diffusion process, solution of an It6 stochastic differential equation. On the first iteration a simple control function has been built using a design point weighting principle for a reliability problem. After time discretization, two points were chosen to construct the compound deterministic control function. It is based on the time point when the first maximum of the homogenous solution has occurred and on the end point of the considered time interval. An importance sampling technique is used in order to estimate the failure probability functional on a set of initial values of state space variables and time. On the second iteration, the concept of optimal control function developed by Milstein has been implemented to construct a Markov control which provides better accuracy of the failure probability estimator than the simple control function. On both iterations, a concept of changing the probability measure by the Girsanov transformation is utilized. As a result the lower variance of estimates is achieved by fewer samples and the CPU time is reduced by order of 10 compared with the crude Monte Carlo procedure.
339
340
A . Ivanova Olsen and A . Naess
24.1. Introduction The evaluation of system performance is one of the objectives of mathematical modeling and numerical analysis in structural mechanics. Therefore, the first passage characteristics play an important role in structural reliability. The first passage failure occurs when the response of the structure first reaches a prescribed tolerance level. Unfortunately, the general solution of this problem is not available in closed analytical form. Therefore, by now, there is only the Monte Carlo method which is a universal procedure for estimating the system failure probability. However, since such failure probabilities are often rather small, the standard implementation of the Monte Carlo method requires an immense number of samples to secure acceptably accurate estimates. Therefore, as in the time-independent reliability, the methods of variance reduction, also referred to as importance sampling techniques, are desirable to be used for dynamic problems. In this paper we investigate the use of methods from the theory of stochastic control for developing a practical importance sampling method for dynamical systems. It is considered that the dynamic response is modeled by a diffusion process, solution of an It6 stochastic differential equation. The importance sampling density is obtained by successive implementation of the Girsanov transformation of probabilistic measures.' Though this theorem is applicable to a wide class of dynamic systems, the scope of this paper is oscillatory systems with single degree of freedom excited by white and colored noise. Recently, several author^'>^>^ applied the First Order Reliability Method (FORM)5 to the randomly excited dynamic systems. The so-called design point excitations lead the response of the structure to the failure domain while the unlikeliness of such paths is taken into account by a correction process. This procedure creates a deterministic control for the dynamic system. However, the analytical expression is obtainable only in the linear case for one degree of freedom systems. For the nonlinear and multidimensional problems special optimization algorithms have to be applied to assess the control function for each time step and, in some cases, the evaluation of failure probability is restricted only to the stationary regime. Herein, it is shown how the Markov control function can be designed for a linear oscillator in the transient regime. Furthermore, it is assumed that this procedure may be used for nonlinear systems which can be suitably linearized. This method might be applied to seismic structural design, estimation of a nuclear plant reliability, as well as, forecasting some economic
Importance Sampling for Dynamic Systems
341
strategies such as, for instance, values of American options. 24.2. Problem Formulation. Reliability and Failure
Probability Performance and design requirements of a dynamical system restrain the acceptable values of the response to the safe domain Ds E R", where D s is some given Bore1 set. The probability (1)
ps(T) = P{ES), where
Es
= {w : X ( t ) = X ( t ,w ) E
Ds for every0 < t 5 T }
(2) is the probability that the response X ( t ) will stay inside the safe domain throughout the time interval (0,TI. For simplicity it has been assumed that X ( 0 ) = 0 E Ds.p s ( T ) is referred to as the reliability of the system. Thus the probability of failure ~ F ( Tcan ) be defined as ~ F ( T=) 1 - ps(T). The response is represented by a continuous stochastic process X ( t , w ) : (0,T ] x il + W" defined on a (complete) probability space (il,F,P ) and for the time interval (0, TI, where il is a space of elementary events, F is a aalgebra of measurable sets of il and P : F + [0,1]is a probability measure.6 Thus the probability of failure ~ F ( T=)1 - P ( E s } , where Es f F by the assumptions made, is given as P F ( T ) = 1-
/
Es
dP(w) =
/
I [ g ( X ( .">,)I
d P ( w ) = E ( I [ g ( X ) ] ) , (3)
n
where g ( X ) is a limit state function and I [ g ( X ) ]is an indicator function defined as follows: 0 : WEE^ (4) I[g(X)1= 1 : otherwise.
{
The dynamic response of the system is assumed to be given by an It6 stochastic differential equation (SDE)
d X ( t ) = rn(t,X)dt4-( T ( t , X ) d W ( t ) , 0 5 s 5 t 5 T , X ( s ) = 5, (5) where X ( t ) = ( X I @ )X,(t), , . . . , X n ( t ) ) T E W" and W ( t )E W is a standard Wiener process with respect to the measure P. m(t,X ) , a(t,X ) E R" are drift and diffusion coefficients, respectively, which satisfy suitable Lipschitz and growth conditions.6 The mean value estimator of the failure probability by the Monte Carlo method is . N
(6)
342
A . Ivanova Olsen and A . Naess
where zi (i = 1,. . . ,N ) are N realizations of the process X ( t ) . The standard error of this estimator7 is approximately By the Girsanov theorem' the Wiener process W ( t )in Eq. ( 5 ) may be substituted with another stochastic process @(t)defined as
l/dm.
(7) where U ( T , W ) is a real nonanticipative bounded process.6 Thus, @(t)becomes a standard Wiener process on ( s , T ] with respect to a new measure p , which is defined below. Then the SDE (Eq. 5) takes the form (0 5 s 5 t 5 T ) d X ( t ) = m(t,X ) d t
+ g ( t ,X)v(t, X ) d t + ~ ( Xt ), d W ( t ) ,
X ( s ) = x, (8)
where the control function u ( ~ , w = ) v ( t , z ( t , w ) )is given below in Eq.(13). This transformation of measures may be done according to the RadonNikodym theorem.8 By the Novikov conditioq6 the Radon-Nikodym derivative (dP/dP)-l is integrable. Hence we can write
(9) where E denotes expectation with respect to P , and
(10 The reliability problem is now considered on the time interval (s,T ]with initial condition X ( s ) = z, and ~ F ( T s,x) ; denotes the associated failure probability. The Monte Carlo estimator of the failure probability based on the measure is then by Eq. (9)4
(11) In the terms of stochastic control t h e ~ r ythe , ~ Radon-Nikodym derivative depends on the control process for the system (5). Usually four types of control may be distinguished.6 However, for the method proposed in the present paper, only two of them are used, namely, a deterministic control v(t) (also called open-loop control) and a Markov control v ( t ,2). Analyzing Eqs. 3 and 9, it is obvious that the failure probability is independent on the choice of controller, whereas the variance does depend
Importance Sampling for Dynamic Systems
343
on v. Invoking again the theory of stochastic control, it can be shown that there exists an optimal control function7 for minimizing the functional
J = E 12"9()1)] --=
,
(12)
viz.. (13) .
.
From Eq. (13) it follows that the optimal control function depends on the failure probability ~ F ( T s,x), ; which has to be known for all values of the arguments (s,x)E (O,T]x Rn. But, of course, if the answer is known, there is no need in controlling the system. On the other hand, if the failure probability can be calculated approximately on a suitable finite grid in (O,T]x Rn, then it is possible to construct a control function that may provide a more accurate estimation of the failure probability. 24.3. Numerical Examples
24.3.1. Linear oscillator excited by white noise
A very wide class of engineering systems can be modeled, to a first approximation, in terms of linear differential equation, if the amplitude of motion is relatively small. Thus, the considered numerical example is linear oscillations of a light damped spring-mass system. The motion of this linear oscillator excited by white noise N ( t ) follows the second order differential equation: X ( t ) + 2 [ w o X ( t ) W i X ( t ) = yN(t) (14) X ( s ) = 2, X ( s ) = k. Under standard assumptions, Eq. (14) can be written as a SDE (Eq. 5). Let X I = X and X , = X , then in matrix form
+
d X ( t ) = AX(t)dt +ybdW(t), where W ( t )is a standard scalar Wiener process" and
(15)
(16)
The considered safe domain is given by D s = ( 2 : x < x c , x E R}, where x, is some critical threshold for the displacement response. By replacing the Wiener process W ( t ) by W(t) as explained in the previous section, Eq. (15) takes the form d ( t )= A r i ( t ) d t+ y b v ( t , j ( t ) ) d t + y b d @ ( t ) .
(17)
344
A . Zvanova Olsen and A . Naess
-
On the first iteration it is assumed, that the control function is a deterministic open loop control, ie., w ( t , X ( t ) ) = w(t), then Eq. (14) has an explicit analytical solution (18)
where 1
h(t) = -e--Ewot
sinwdt,
wd
Wd
=w g J m l
(19)
is an impulse response function and (20)
is an homogenous solution for the displacement X ( t ) . Using the forward Euler approximation, Eq. (18) can be approximated by the difference scheme: ()21)
+
where s < (is 1)At < . . . < j . At, is = [s/At] ( [ a ] denotes the largest integer smaller than or equal to a), j 5 m = [ T / A t ] ,t j = j . At, wi = v ( i .A t ) , hji = h ( ( j- i) . A t ) , V i a = W ( i. At) - W ( ( i- 1).At) are the increments of the Wiener process. Ui are independent standard Gaussian variables with zero mean and unit standard deviation. In accordance with the definition of Ds,the limit state function is given as:
-
g ( X ) = z, -
x,
(22)
where zc is a preassigned critical threshold. Then the mapping to the finite standard Gaussian space becomes possible because the number of random variables is finite due to the time discretization. Thus, the FORM-like procedure can be applied to design a deterministic controller. The idea was developed by Ref. 2. In Ref. 3 the method is called the design point oscillations. Hence the mapping is a standard procedure given (23)
i=i,+l
i=i,+l
The excitations leading to out-crossing of the safety margin gu(V) = 0
Importance Sampling for Dynamic Systems
345
at time t j are obtained then as
(24)
and the design point index is given
pj
= p(tj) =
4c
uf.
(25)
i = i . +1
Unless the control function is known, assume that the design point of the system (Eq. 17) is at the origin of the standard Gaussian space, then the simple control function" is given as zc - F( t j - S,5,k ) (26) vi = hji.
c j
7
Let At
--+
h5kAt
k=i,+l
0, then (26) in integral form is given
(27)
Note that the design point index at a time expressed as
tj
of the system (14) can be
(28)
The corresponding m-dimensional sampling importance sampling density function3 is given as the sum of Gaussian distributions weighted at each time step by the coefficient (29)
where a(.) is the normal Gaussian distribution, p ( t j ) is the design point index at ti. Let t ( j )be the time where the first passage is assumed to happen for the given control function. For some cases, where a single passage time is dominant, one time point t* : {p(t*) = min p(t)} is enough to estimate the failtE(O,TI
ure probability accurately as for instance in the case of fatigue fracture. For an oscillatory system such an approach would lead to an underestimation of the failure probability." In this case considering zero initial conditions, the
346
A. Iuanoua Olsen and A . Naess
domain near the end of the time interval T contributes most to the failure probability, whereas for nonzero conditions the area near tmax,which is the time of the first maximum of the homogenous solution F ( t - s, z, k ) , is the most probable failure interval. Taking into account these features and keeping the number of samples low, two time points are suggested to use, t ( l ) = T and t ( 2 )= tmax,weighted according to their contributions. Simulations with the given deterministic control functions (Eq. 27) are performed for the set of the initial values of z, 3 and s in order to compose the approximate data for the Markov control (Eq. 13). The plot of 3D data is shown in Fig. 24.1. This surface corresponds t o an initial time s = 0. The consequent surfaces for following initial times will be deeper at the origin, and tend to 1 at the distant regions.
Fig. 24.1. Results of the first iteration-approximate failure probability in R2 smoothed by B-spline.
On the second iteration, the value of the Markov control function at each time step is estimated from 3D data calculated on the first iteration, ie. wi = v ( t i - l , Z l ( t i - l ) , Z ~ ( t i - l ) ) .After the sample path crosses the critical threshold zc, the control function is set to zero. If this is not done, the control will keep the oscillations on the threshold level. The examples of
347
Importance Sampling for Dynamic Systems
the sample paths and control functions are shown in Fig. 24.2. In the upper figure (a) the original paths (Eq. 14) are plotted, in the middle (b) the system is controlled by the deterministic function w ( t ) and in the bottom figure (c) the final Markov control v ( t , q , z 2 ) is active. The time axis is normalized by the natural period T, = of the free oscillations, while the ordinate is divided by the stationary standard deviation of the response X ( t ) , 00 = where Go = y 2 / n is one-sided spectral density of the input noise.
m,
I
I
I
I
I
I
I
I
0
1
2
3
4
5
6
7
8
0
1
2
3
4
5
6
7
8
0
1
2
3
4
5
6
7
8
-4'
t l Tn Fig. 24.2. Examples of the samples a) (*) z ( t ) and (0)k ( t ) for the original Eq. (14), b) (*) z ( t ) and ( 0 ) k ( t ) for the transformed Eq. (17) with deterministic control (0 v ( t ) and c) (*) z ( t ) and (0)k ( t ) for the transformed Eq. (17) with Markov control (0 v(t,zl, Q); critical threshold zJuo = 5.3.
Calculated by (ll),the failure probability estimates are compared with
A . Ivanova Olsen and A. Naess
348
the crude Monte Carlo results in Fig. 24.3 for various threshold values. The results from the crude Monte Carlo method are obtained with different number of samples (> lo3) by the convergency criterium that standard error equals E = 0.05, otherwise the number of samples is N = lo6, if e has not converged to 0.05. Whereas N used by the importance sampling procedure in the two steps is about lo4.The standard error of the estimates is varying between 0.05 and 0.2. However, the method converges to the chosen value E = 0.05 if the number of samples used at the second step is increased, although not more than by a factor of 10. This will not affect the total calculation time very much, because most of the computational burden is associated with the first step.
lo-z
'
......
1o4
lo" PF 10-8
'
i o * I _. . . . . . . . . .]. . . . . . . . . _].. . . . . . . . ,;. . . . . . . . . . I . . . . . .o..j.*. ;
1o-lO
10-12
-. . . . . . . . .
. . . . . . .;. . . . . . . . . .] *
.:. . . . . . . . . .:. . . . . . . . . . ; . . . . . . . . . .:. . . . . . . . . . . .:O .........
; I . . . . . . . . ..:
1 0 5 -.
........
.I . . . . . . . . . I . . . . . . . . .
j
.........
.I..
.......
.I.
:
.......
*
;
.o: . . . . . . . . . .: : o
: I
-
Fig. 24.3. Failure probability for the linear oscillator excited by white noise.
24.3.2. Linear oscillator excited
by colored noise
To exploit the quality of the proposed procedure, another application is suggested. In this example we shall assume that the excitation process of
Importance Sampling for Dynamic Systems
349
the linear oscillator is a white noise that has passed through a linear, second order filter. Hence, the following model is obtained X(t)
Z(t)
+ 2&JoX(t)+ WZX(t)= Z(t),
+ 2J,w,Z(t) +
(30)
= yN(t),
(31) where X ( s ) = 0, X ( s ) = 0, Z ( s ) = 0,Z(s) = 0. N ( t ) denotes standard Gaussian white noise, and the parameter y determines the intensity of the noise term. Equations (30-31) are interpreted in such a manner that they can be recast into the form given by (5) in terms of the state space vector ( X ( t ) ,X ( t ) ,Z ( t ) ,Z ( t ) ) .The parameters of the system are chosen such that the filter is broad banded, i.e. & = 0.5, whereas the system response is assumed to be a narrow banded process in the sense of small damping, specifically 6 = 0.05. The safe domain to be considered in this section is of the similar type Ds = {(z,x,z , i.)T : z < z, (i, z , i )E R3}. Consider the auxiliary system W,2Z(t)
X(t)
+ 2 ~ ~ o X (+t ~) i X ( t=) y e N ( t ) ,
-2
-2 I
(32) where a corresponding parameter ye is evaluated from the criterium that the spectral density $/(2x) of the equivalent white noise is the same as the contribution from the colored noise Z ( t ) ,namely
_ ‘e -
(33) 2x 2x( (W; - u p 4E;W;W;). Then on the first iteration the same procedure is used (Sec. 24.3.1) to estimate the approximate failure probabilities. According to the Girsanov transformation, the control for the original system should be added as a drift to the white noise N ( t )
+
X ( t ) + 2&doX(t) + W i X ( t ) = Z ( t )
(34)
Z ( t ) + 2[,w,Z(t) + W , 2 Z ( t ) = y(vz(t,F) + N ( t ) ) , - - - -
(35)
where F = (X,X, 2,2)T.However, from the previous step the approximation of the failure probability is obtained only as a function of two space variables, X ( t ) , X ( t ) and time. Using the linear property of the differential operator, it is appropriate to use this controller in the first system Eq. (34)
+ Z ( t ) + 2<,w,Z(t) + W , 2 Z ( t ) = y N ( t ) ,
i ( t ) 2<~0k(t) + u $ f ( t ) ( l+ & X 2 ( t ) )= Z ( t )+ yevx(t,j?,?) (36) where basically
(37)
- -
z(t> = ~ ( t+)y e t J X ( t ,X ,XI.
(38)
A. Ivanova Olsen and A. Naess
350
Hence, the original control V Z ,needed for the evaluation of the RadonNikodym derivative (Eq. lo), can be found as a solution of the direct problem
wx
+ 2J,W,VX +
W,”VX
Y
= -Vz.
(39) Ye On the second iteration for each time step during the integration of the controlled system (36-37), a corresponding value of the control process is chosen depending on the state space variables at this time, i.e. wx(ti,Z ( t i ) ,i ( t i ) ) .To recover the control VZ, the differentiation, required by (39), is performed numerically and subsequently smoothed by a B-spline approximation in the time domain. In Fig. 24.4 various samples of the control process and the response variable X ( t ) are presented. In Fig. 24.5 the failure probability vs threshold is plotted. The results from the crude Monte Carlo method (a solid line, standard error c = 0.05 or N = lo6 if E has not converged to 0.05) are compared with the results obtained at the second step of the importance sampling procedure with N = 100. The total number of samples, used on two iterations, is about lo4. The standard error of the estimators is varying also between 0.05 and 0.2. 24.4. Conclusions
The study has demonstrated good convergence of the failure probability estimates provided by the proposed importance sampling procedure to the results of the crude Monte Carlo method when those are available. The linear single degree of freedom oscillator subjected to white and colored noise is considered. The procedure allows us to calculate robustly values of low probabilities of failure of the order low4and less. The advantage of the used methodology is that all expressions are given analytically, thus no special techniques are required. The used numerical algorithms, as spline or difference methods, are well-established procedures and available in most mat hematical packages. A lower variance of the failure probability estimates is obtained and the required CPU time are reduced at least by a factor 10 compared with the crude Monte Carlo method at a failure probability of 10-4 Furthermore, with increasing of the critical threshold, calculation time for the proposed procedure will remain more or less unchanged, whereas the crude Monte Carlo method will require more samples and, correspondingly, more time for evaluation. Further work is in progress to use the approximation of optimal control function for nonlinear problems which can be appropriately linearized.
Importance Samplingfor Dynamic Systems
351
Fig. 24.4. Examples on the sample paths (critical threshold x c / u g = 4): a) samples of the linear oscillator: ( 0 ) uncontrolled x (Eq. 31) and (*) controlled 1 0%. 361, b) ( 0 ) control function vx, (*) control function vz.
Acknowledgments This project is financially supported by t h e Research Council of Norway. The authors gratefully acknowledge this support.
References 1. I. V. Girsanov, On Transforming a Certain Class of Stochastic Processes by Absolutely Continuous Substitution of Measures, Theory of Probability and Its Applications, Vol. 3, 285-301 (1960). 2. H. Tanaka, Application of an Importance Sampling Method to Timedependent System Reliability Analyses Using the Girsanov Transformation, Proceedings of ICOSSAR’97, eds. N. Shiraishi and M. Shinozuka and Y.K. Wen, 411-418 (1998). 3. M. Macke and C. Bucher, Importance Sampling for Randomly Excited Dynamical Systems, Journal of Sound and Vibration 268, 269-290 (2003).
352
A. Ivanova Olsen and A. Naess
Fig. 24.5.
Failure probability for the linear oscillator excited by colored noise.
4. A. Naess, Comments on Importance Sampling for Time Variant Reliability Problems, Stochastic Structural Dynamics 4, 197-202 (1999). 5. R. E. Melchers, Structural Reliability Analysis and Prediction, John Wiley & Sons Ltd (1999). 6. B. Bksendal, Stochastic Differential Equations: a n Introduction with Application, Springer (1998). 7. G. N. Milstein, Numerical Integration of Stochastic Differential Equations, Kluwer Academic Publishers (1995). 8. W. Rudin, Real and Complex Analysis, McGraw-Hill, Inc. (1987). 9. W. H. Fleming and R. 0. Rishel, Deterministic and Stochastic Optimal Control, Springer-Verlag (1975). 10. T. T. Soong and M. Grigoriu, Random Vibration of Mechanical and Structural Systems, PrenticeHall, Inc. (1997). 11. A. Naess and C. Skaug, Importance Sampling for Dynamical Systems, I C A S P Applications of Statistics and Probability 8 , 749-755 (2000).
CHAPTER 25 LEVERAGING REMOTE DIAGNOSTICS DATA FOR PREDICTIVE MAINTENANCE
BROCK OSBORN Applied Statistics Laboratory, G E Global Research Research Circle, Naskayua, NY USA E-mail: [email protected]
Embedded sensors are used in many applications to collect remote diagnostics data, providing valuable information about the health of a system during its operation. In addition to detecting problems that need immediate attention, this information can be used in conjunction with historical reliability data to determine how the system is aging or degrading over time. This paper will describe an approach for analyzing remote diagnostics data for the purpose of predictive reliability and maintenance. 25.1. Introduction
One of the most important advances in the area of predictive maintenance involves the use of embedded sensors in complex systems t o collect remote diagnostics data. This data provides a snapshot of the health of the system during its operation. Such sensors are found in a variety of complex systems, including power plants, automobiles, home appliances, medical equipment, and aircraft engines. The data collected may include internal temperatures and pressures, external environmental conditions, stresses, loads, etc. This data has proved t o be very valuable in detecting problems that require immediate attention such as detecting low oil pressure in an automobile and recognizing excessive exhaust gas temperatures in an aircraft engine. However, this data also has another important use: when combined with historical reliability data it can provide insights into how the system, and the various parts that make up that system, age or degrade over time. The first challenge in utilizing such data for reliability analysis lies in the development of the proper mathematical framework t o “piece together” 353
354
B. Osborn
information from each use of each system in the study (e.g., each flight of each of the aircraft engines in a fleet) in order to properly account for the aging that is taking place over time. Clearly, if each system in a study operates under one set of conditions throughout its life and parts do not move between systems, then the analysis is greatly simplified. Lifetime regression methods like Weibull regression (see Lawless’ and Meeker2)can be used to account for the various operational environments and determine the relative importance of various remote diagnostic parameters on predicting reliability. Unfortunately, however, the problem is usually complicated by the fact that every use of an engine may be different. For example, during one flight an aircraft engine may be operated at full thrust at take-off with an ambient temperature of 75 degrees, and during the next flight may be operated at 50% of full thrust with an ambient temperature of 34 degrees. Physics-based models and experience suggest that the wear experienced by turbine blades in the high-pressure section of the engine under these flights is different, and therefore neither hours of operation nor number of cycles (i.e., number of take-offs) can fully account for the total aging that takes place on turbine blades over time. One interesting application of this type of data is in the area of part inventory management. Repaired (vs. new) parts form an increasing proportion of inventories in industrial settings. Advances in material science have resulted in the ability to repair many parts, thereby extending their useful life and drastically reducing the amount of money spent on new part manufacturing, sometimes by as much as a factor of ten. This change in inventory composition creates challenges and opportunities. Specifically, given an inventory of parts with various histories, can we find optimal ways in which to allocate these parts to maximize utilization and minimize overall cost? In this paper we will describe a methodology for utilizing remote diagnostics data to quantify the “true age” of a part in a complex system, and we will leverage this information to solve problems like the one described above. Our focus will be on applications in aircraft engines, since these complex systems provide a wealth of insight and results are directly applicable to many other systems.
25.2. Accounting for the Accumulation of Wear Although many techniques have been developed to account for the accumulation of wear (e.g., BagdonaviEius and N i k ~ l i n Singp~rwalla~), ,~ here we will consider a simple yet powerful method for accounting for wear based on
Leveraging Remote Diagnostics Data for Predictive Maintenance
355
the cumulative hazard function. We are motivated by the fact that the hazard function represents the instantaneous failure rate of the part and that the cumulative hazard function is the accumulation of these failure rates over time. If we can relate changes in stress to changes in the instantaneous failure rate (the hazard function)] then the cumulative hazard function can be utilized as the true metric for age of the part. By translating between the time domain (e.g., cycles or hours) and the cumulative hazard domain, one is able to develop a methodology for accumulating the wear that occurs during each use of the system in which the part is installed. This method is actually a simple extension of a technique of accumulating damaged in accelerated testing studies presented by N e l ~ o n . ~ As an illustrative example, consider a part installed in a system which operates under one set of conditions (stress level A ) from time 0 until tl and then operates under a different set of conditions (stress level B ) from time tl until t 2 . Our goal is to determine the cumulative hazard of the part at time t 2 . Let H ~ ( t 1 be ) the value of the cumulative hazard for the part at time tl (we will use upper case, H ( . ) , for the cumulative hazard and lower case, h(.),for the hazard). For example, if we assume that time to failure follows a Weibull distribution with shape parameter PA and scale parameter V A , then:
In order to find the time, s, at stress level B that corresponds to this level of wear, we solve H B ( s ) = H ~ ( t 1 for ) s. In the case of Weibull we have:
and obtain:
(1) Therefore] since the stress level changes from A to B at time t l , we can calculate the total cumulative hazard for the part at time t 2 by:
H(t2) = HA(t1)
+
rLZt1) hB(x)dz = H B ( s f
t2 -tl)
which in the Weibull case is equal to:
(2)
356
B. Osborn
25.3. Application to Inventory Management of Turbine Blades We will now apply these results to the problem of repairable turbine blades from the high-pressure turbine section of an aircraft engine. Each jet engine of our study has 72 of these blades. At each shop visit, all seventy-two blades are removed and inspected. Blades that have exceeded their useful life are scrapped. All other blades are sent to a repair shop where they undergo various types of repair such as recoating and resurfacing. Seventytwo blades are also chosen from the inventory to repopulate the engine to complete the overhaul process. As an option, new blades (or a mixture of new and used blades) can be selected to repopulate the engine. Each turbine blade can be repaired two, three, or possibly more times before it needs to be scrapped, and the cost of repair is approximately one tenth of the cost of a new blade. The time to scrap distribution (which will define blade reliability for this paper) is greatly affected by how the blade is used, i.e., blades installed in engines subject to high stress (high thrust take-offs, high ambient temperatures, etc.) will need to be scrapped sooner than those installed in engines that are subject to low stress. It is also important to note that blades are scrapped long before they actually fail to function, and therefore are never in danger of causing the engine to actually fail t o operate. To simplify the analysis, we will assume that the time to scrap distribution is unaffected by the repair process, i.e., blades that are repaired are as bad as old from a time-until-scrap reliability perspective. In a more detailed analysis (one that we are currently exploring), one could consider a more general state based process with states such as new, needing repair, needing to be scrapped, etc. However, analysis of the data available suggests that this simplification provides a good first-order approximation. Our problem can be stated as follows: given an initial inventory of blades with various usage histories, and a fleet of engines which enter the shop at various times, what is our optimal blade allocation process? Specifically at any point in time when a particular engine from the fleet enters the shop, which blades should be selected from the inventory to repopulate this engine in order to maximize the total usage of all of the blades and minimize overall cost? Here usage is defined as the total number of cycles accumulated on a blade before it must be scrapped, and cost is associated with the initial cost of the blade plus the cost of each repair. Note that blades are only examined at a shop visit and the timing
357
Levemging Remote Diagnostics Data for Predictive Maintenance
of shop visits for a particular engine, i.e., the number of flights (cycles) between shop visits, is controlled by several factors including scheduled maintenance, performance degradation, etc. If a blade is installed on an engine that runs 1000 cycles between shop visits, it will receive credit for all 1000 cycles, even if it entered the “needing to be scrapped” state after the engine’s lootfiflight. To motivate the optimization process for blade allocation, consider a simple example. Suppose that a blade in the inventory has already accumulated 1200 cycles, has been repaired three times, and has the potential of being repaired at most once more before it must be scrapped. NOWassume that we have two engines A and B in which to potentially install the blade. Further assume that if the blade is installed in engine A, it will accumulate an average of 1000 additional cycles and will have a good chance of needing to be scrapped at the next shop visit. Alternatively, if the blade is installed in engine B it will only accumulate 800 additional cycles but will most likely not need to be scrapped at the next shop visit. So here’s our choice: install the blade on engine A resulting in a total of 1200 1000 = 2200 cycles, or install the blade in engine B, followed by a repair, followed by an installation in engine A resulting in a total of 1200 + 800 1000 = 3000 cycles. If we define value as total cycles accumulated on the blade divided by the total number of repairs, then in the first case, the value would be = 733.3 cycles/repair while in the second case the value would be = 750 cycles/repair. As this simple example illustrates, the fact that the blade is only examined at the shop visit dictates a strategy that leads us to install the blade which is near the end of its life (i.e., a few cycles before it reaches the needing to be scrapped state) in the engine which will accumulate the most cycles before the next shop visit.
+ +
25.4. Developing an Optimal Solution
To determine such optimal strategies for the entire set of blades (i.e., the blades in the initial inventory and the blades currently installed on all of the engines in the fleet), we will develop a policy for optimal blade allocation. This will be done in two phases. For the first phase, we consider each blade separately and determine the optimal strategy for using that blade in various engines throughout the rest of its life given its current age. Here age is measured by its cumulative hazard function. This is a dynamic programming problem.6 For the second phase we consider the inventory of
B. Osborn
358
blades at any particular point in time and consider the particular engine in the shop. Utilizing the dynamic programming solutions in the first phase, each blade is evaluated according to how the action of installing that blade on this particular engine at this point in time will affect its future total value. The blade for which this is maximized is the optimal blade chosen for installation. We begin with the first phase, blade level optimization, and define the various characteristics of our dynamic programming model: 0
0
0
0
0
A stage of our process is defined to begin when the blade under consideration is in the inventory and an engine is in the shop requiring a blade. The state of our process at each stage is defined by: (1)either the current age of the blade defined by its cumulate hazard function or “needing to be scrapped”; and (2) which engine is in the shop. Profit for our process is defined as total cycles accumulated on the blade divided by the total cost of repairs plus the initial cost of the blade. The set of actions that can be taken at each stage consists of either installing the blade in the given engine or leaving the blade in the inventory. Finally, the last stage of our process is reached when the blade is in the scrap state, and hence can no longer be used. Therefore if we define:
0
0 0
0
VN(Z)= maximum expected profit when N stages remain and we are in state i.
R(i,a ) =reward for taking action
LY
when we are in state i.
P2j(a)= probability of moving from state i to state j when action a is taken.
A = set of all possible actions at this stage and state.
Then the dynamic programming solution is given by:
(3)
Vl(i)= maxR(i,a), WA
(4)
where Vl(i)is the maximum expected profit when we have one stage left and are in state i, i.e., the blade can only be used one more time before it
Leveraging Remote Diagnostics Data for Predictive Maintenance
359
must be scrapped. As we pointed out in the above discussion, the optimal strategy in this case is simply to leave the blade in the inventory until the engine which will provide the maximum number of expected cycles on the blade comes into the shop. This strategy will ensure that we will maximize our profit, i.e., maximize the total number of cycles divided by the total cost. For any stage before this point, we need to calculate Pij(cu),the probability that the blade will move from state i to state j if we take action a. Since at each stage there are only two possible actions, to leave the blade in the inventory or to install it in the engine currently in the shop, and since leaving the blade in the inventory will not alter the cumulated age or number of repairs on the blade, we need only concern ourselves with how the blade changes in state if it is installed on the current engine. Two quantities are needed to calculate Pij(o):its expected age at the next shop visit and the probability that it will be in the scrap state at the next shop visit. In order to simplify notation, we will assume a Weibull distribution for the time to scrap distribution and will further assume an average operating characteristic for each engine. In Section ( 6 ) we will return to the general case in which the engine’s operating characteristics can be different for each cycle. If the current engine in the shop ( e , ) is k, the current cumulative hazard is Hi, and the probability density function for the time to the next shop visit for this engine is given by f k ( t ) r then we can use Eq. (2) to calculate the expected age. Specifically, this is given by:
(5) Similarly, the probability that the blade will not be in the scrap state at the next shop visit is given by:
(6) where R k ( . ) is the reliability function for the time to scrap distribution for the blade installed on engine k evaluated at time (t s ) where s is given by E q . (1).Therefore, Pij(a = install) equals P when j is the state indicating that the blade has the age given in Eq. ( 5 ) and is equal to (1 - P) when j is the state indicating that the blade needs to be scrapped. Once this dynamic programming solution is calculated for each blade in the inventory, we are able to determine for any engine that enters the shop the expected total value (total expected number of cycles that will be
+
360
B. O s b r n
accumulated on this blade divided by the total expected cost) associated with the decision of installing each of the blades on this engine. Our optimal decision is simply to choose the 72 blades in the inventory with the highest expected total value. If there is ever less than 72 blades in the inventory, then new blades are used as needed.
25.5. Application
This solution was applied to an actual problem of turbine blade inventory management €or a fleet of 1,015 engines installed on aircraft operated by four airlines under various thrust conditions. Analysis of the historical data allowed us to group the engines into six unique operating environments. We labeled these groups A, B, C , D, E, and F. Average time between shop visits (MTBSV) was calculated for each group. In addition, historical data on over 4,000 turbine blades installed on these engines enabled us to estimate Weibull parameters for the time to scrap distributions for each of the groups. This study was important because of a proposed change in policy. Blades on engines in group A were scheduled to be upgraded. As a result, at their next shop visit, the blades on these engines would be added to the inventory (in the manner described above), but the engine would be repopulated with blades from another source. The resulting increase in inventory from these used blades created the need for exploring optimal blade allocation. At the beginning of the study, historical information was used to estimate the age (measured in cumulative hazard) of each of the blades currently installed on the 1,015 engines in the fleet in addition to the blades currently in the inventory. Two computer simulations were then run to compare random blade allocation (i.e., the current policy) with the optimal policy. In each simulation, engines entered the shop according to the distributions calculated above and blades were either sent to be repaired or scrapped according to the time to scrap distribution. If the engine was from group B, C , D, E, or F, new blades were selected from the inventory. If the engine was from group A it was simply removed from the study. Engines from groups B, C, D, E, and F were kept in the study for the duration of their particular service contract (ranging between five and twenty years). The simulation analysis was quite revealing. There was on average a 37% reduction in the number of new blades required when the optimal allocation policy was used vs. the random allocation policy. Based on the actual costs of purchasing new blades and repairing used blades, the projected savings
361
Levemging Remote Diagnostics Data for Predictive Maintenance
was over three million dollars.
25.6. Formulating a Generalized Life Regression Model In the problem described above, engines were characterized by belonging to one of six groups based on their particular operating conditions. This greatly simplified the process of calculating the effective age (cumulative hazard) for each blade in the study as it moved from engine to engine. In the more general case, we wish to determine how a part installed on an engine ages over time as the engine constantly changes its operational characteristics. This is of interest because every flight is different and, as a result, different stresses are exerted on the part. In the introduction we provided the example of ambient temperature and engine thrust at take-off as two such operational characteristics that exert stress on the high-pressure section of the engine and vary from take-off to takeoff. To formulate the proper lifetime model for a part installed on an engine with changing operational characteristics, we will leverage results from lifetime regression mode1s.l For example, consider an engine with K sensors, each of which is monitoring some operating characteristic of the engine (thrust, temperatures, pressures, etc.). Assuming a proportional hazard Weibull model with constant shape parameter 0, the scale parameter will take the form:
v , , ~= exp [QO
+~
+ . .~. + ~ ~ ~
(7) Here, X , J k represents the value of the output of the kth sensor during the j t h measurement on the i t h engine, and the Q k are the corresponding regression parameters. Then if N is the number of parts in our sample (one part per engine), Nl the number of parts that have failed, m, the number of the time of the j t h measurement on the measurements on engine i, and ta3J i t h engine (ta,o = 0), then the log likelihood function is given by: 1
31 4
~
~
1
.
(8)
By maximizing this equation with respect to Q k , we can determine which sensors are monitoring factors that significantly influence part reliability. It is also worth noting that in the special case where the qi,j are the same for
B. Osborn
362
all flights of a particular engine, this log likelihood equation reduces to the log likelihood for the Weibull regression model. In many of the applications that we have been focused on, takeoff data is used. In this case, one measurement is taken per flight and mi is the number of flights on engine i, and therefore, the time of failure or censoring (measured in cycles) for engine i. In this case, the expression ( t i , j - ti,^-^) is equal to 1. We applied this methodology to analyze time to scrap data for a part located in the compressor section on 796 CF6 engines. A total of 35 sensor measurements were considered one at a time. The likelihood ratio test was used to determine the significance of the parameters. Core speed and flight leg (length of flight in hours) were determined t o be the most significant parameters and, when combined, were found to explain over 24% of the variation in the time to scrap distribution. 25.7. Concluding Remarks
Remote diagnostics data provides a valuable resource for determining both the current and future health of a complex system. In this paper we have outlined a technique for combining this data with historical reliability information to determine how the parts of the system are aging over time and to determine which factors are most important for predicting reliability. We have also shown how this information can be leveraged to develop an inventory management strategy for optimally allocating repairable parts.
Acknowledgments The author wishes to thank the editor and an anonymous referee for their helpful comments which improved the quality of this article.
References 1. J. Lawless, Statistical Models and Methods for Lifetime Data, New York,
Wiley (1982). 2. W. Q. Meeker and L. A. Escobar, Statistical Methods f o r Reliability Data, New York, Wiley (1998). 3. V. B. BagdonaviEius and M. S. Nikulin (1997), Transfer functionals and semiparametric regression models, Biometrika 84, 365-378 (1997).
4. N.D. Singpurwalla, Survival in dynamic environments, Statistical Science 10,85-103 (1995). 5. W. Nelson, Prediction of field reliability of units each under differing dynamic stresses, from accelerated test data, in N. Balakrishnan and C. R.
Leveraging Remote Diagnostics Data for Predictive Maintenance
363
Rao (Eds.), Handbook of Statistics 20, pp. 611-621. Elsevier Science B.V. (2001) 6. S. Ross, Introduction to Stochastic Dynamic Programming. New York, Academic Press (1983).
This page intentionally left blank
CHAPTER 26 FROM ARTIFICIAL INTELLIGENCE TO DEPENDABILITY: MODELING AND ANALYSIS WITH BAYESIAN NETWORKS
LUIGI PORTINALE Dipartimento d i Infomnatica Universitci del Piemonte Orientale “A. Avogadro” Via Bellini 25/g - 15100 Alessandria, Italy E-mail: [email protected]
ANDREA BOBBIO Dipartimento di Infomatica Universitci del Piemonte Orientale “A. Avogadro” Via Bellini 25/g - 15100 Alessandria, Italy E-mail: [email protected]
STEFANIA MONTANI Dipartimento d i Infomnatica Universitri del Piemonte Orientale “A. Avogadro” Via Bellini 25/g - 15100 Alessandria, Italy E-mail: [email protected] The present work is aimed at exploring the capabilities of the Bayesian Networks (BN) formalism in the modeling and analysis of dependable systems. We compare B N with one of the most popular techniques for the dependability analysis of large, safety-critical systems, namely Fault Tree Analysis ( F T A ) . The work shows that any Fault Tree ( F T ) can be directly mapped into a BN and that basic inference techniques on the latter may be used to obtain classical parameters computed from the former. By using B N , some additional power can be obtained, both at the modeling and at the analysis level. The comparison of the two methodologies is carried on by means of a case study concerning a gas turbine controller system.
365
366
Luigi Portinale, Andrea Bobbio, and Stefuniu Montani
26.1. Introduction
Fault Tree Analysis (FTA) is a very popular and diffused technique for the dependability modeling and evaluation of large, safety-critical systems.l In FTA, the analysis is carried on in two steps: a qualitative step in which the logical expression of the Top Event (TE) (e.g. the system failure) is derived in terms of prime implicants (the minimal cut-sets); a quantitative step in which, on the basis of the failure probabilities assigned to the basic components, the probability of occurrence of the TE (and of any internal event corresponding to a logical sub-system) is calculated. On the other hand, Bayesian Networks ( B W 2 provide a robust probabilistic method of reasoning under uncertainty. They have been successfully proposed in the field of Artificial Intelligence (AI) as the most flexible formalism for reasoning under uncertain knowledge and have been applied to a variety of real-world problems. However, they have received little attention in the area of dependability with few e x c e p t i o n ~ . ~ ? ~ > ~ The present chapter is aimed at exploring the capabilities of the BN formalism in the modeling and analysis of dependable systems. Starting from the work described in Bobbio, Portinale, Minichino, and Cian~amerla,~ we compare the use of BNs with the use of FTs and the modeling and the decision power of the FTA and BN methodologies in the area of dependability. We show that any FT can be algorithmically mapped into a BN and how the results obtained from FTA can be cast in the BN setting. The major shown advantage is that, by using BN, some additional power can be obtained, both at the modeling (where several restrictive assumptions implicit in FTA can be removed) and at the analysis level (where classical probabilistic computation of FTA can be generalized). The comparison of the two methodologies is carried on through the analysis of a case study, represented by the controller of the gas turbine of the ICARO co-generative plant.6
26.2. Bayesian Networks
A Bayesian Network ( B V 2 is a pair N = ( ( V , E ) , P )where ( V , E ) are the nodes and the edges of a Directed Acyclic Graph (DAG), respectively, and P is a probability distribution over V . Discrete random variables V = {Xi,X 2 , . . . Xn}are assigned to the nodes, while the edges E represent the causal probabilistic relationships among the nodes.a aExtensions are possible where nodes represent continuous random variables
From Artaficial Intelligence to Dependability
367
In a BN, we can then identify a qualitative part (the topology of the network represented by the DA G') and a quantitative part (the conditional probabilities). The qualitative part represents a set of conditional independence assumptions that can be captured through a graph-theoretic notion called d-separation.2 This notion has been shown to model the usual set of independence assumptions that a modeler assumes when considering each edge from variable X to variable Y as a direct dependence (or as a causeeffect relationship) between the events represented by the variables. The quantitative analysis is based on the conditional independence assumptions modeled by the net. Because of these assumptions, the quantitative part is completely specified by considering the probability of each value of a variable conditioned by every possible instantiation of its parents. These local conditional probabilities are specified by defining, for each node, a Conditional Probability Table (CPT). Variables having no parents are called root variables and marginal prior probabilities are associated with them. According to that, the joint probability distribution P of a BNhaving variables X I . . . Xn can be factorized as in Equation 1:
n n
P [ x ~ .,. x. ,xn] ~ ,=
a=
~[~i\~arent(~,)].
(1)
1
The basic inference task of a BN, consists of computing the posterior probability distribution on a set of query variables Q, given the observation of another set of variables E called the evidence, (i.e. P(Q1E)).In Sec. 26.6 we will return to this issue. 26.3. M a p p i n g Fault Trees to Bayesian Networks In standard FTA methodology we have the following basic assumptions: (i) events are binary events (working/not-working); (ii) events are statistically independent; (iii) relationships between events and causes are represented by logical AND and OR gates. We usually adopt the following convention: given a generic binary component C we denote with C = 1 the component failure and with C = 0 the component working. The quantification of the FT requires the assignment of a probability value to each leaf node. Since the computation is performed at a given mission time t , the failure probabilities of the basic components at time t should be provided. In the usual hypothesis, the component failures are exponentially distributed, the probability of occurrence of the primary event (C = 1 = faulty) is P(C = 1,t) = 1- e-xct, where XC is the failure rate of component C.
368
Luigi Portinale, Andrea Bobbio, and Stefania Montani
We first show how a FTcan be converted into an equivalent BNand then we show, in Sec. 26.5,how assumptions (i), (ii) and (iii) can be relaxed in the new formalism. First of all it should be clear that deterministic AND/OR gates can be translated through CPTs with extreme probabilities (e.g. equal to 0 or 1).For instance if the output event of an OR gate is C and the inputs events are A and B , Pr[C = 1IA = 0,B = 0) = 0 while Pr[C = 1JA= z , B = y ] = 1 in case z = 1 V y = 1. In a dual way id the gate is an AND gate PT[C= 1 J A= l , B = 1) = 1 and P T [ C = 1JA= z , B = y] = 0 in case z = 0 V y = 0. Similarly, any kind of boolean function (like for instance the usual implicit ( k : n ) gates of a FT13) can be made explicit in the BN, by only modifying the corresponding CPT. According to the translation rules for the basic gates, it is straightforward to map a FT into a binary BN, i.e. a BN with every variable V having two admissible values: false corresponding t o a n o m a l or working value and true (V) corresponding to a faulty or not-working value. The conversion algorithm proceeds along the following steps:
(v)
for each leaf node (i.e. primary event or system component) of the FT, create a root node in the BN; however, if more leaves of the F T represent the same primary event (i.e. the same component), create just one root node in the BN, assign to root nodes in the BN the prior probability of the corresponding leaf node in the F T (computed at a given mission time t ) ; for each gate of the FT, create a corresponding node in the BN, connect nodes in the BN as corresponding gates are connected in the F F for each gate ( O R , AND or k:n) in the F T assign the equivalent CPT to the corresponding node in the BN. Due to the very special nature of the gates appearing in a FT, non-root nodes of the BN are actually deterministic nodes and not random variables and the corresponding CPT can be assigned automatically. The prior probabilities on the root nodes are coincident with the corresponding probabilities assigned to the leaf nodes in the FT. 26.4. Case Studies: The Digicon Gas Turbine Controller
This case study concerns the controller of the PGT2 Gas Turbine of the ICARO co-generative plant.6 For what concerns the safety and dependability assessment, the structure of the system can be summarized as follows:
From Artificial Intelligence to Dependability
369
( i ) Digicon PGT2 controller, composed by two subsystems: the "main controller" (MC) that provides control and shutdown functions; the "back-up" (BU) unit that provides only protection function (related to two critical parameters, only: temperature and rotational speed); ( i i ) "watchdog" relays associated to each hardware circuit board. The Digicon controller carries out two main functions: Control: to ensure dependability (reliability and availability); Protection: to ensure safety. The hardware structure of the main controller is depicted in Fig. 26.l(a)
Fig. 26.1. Hardware structure of the main controller and backup unit.
where: DI - Digital input; A1 - Analog input; CPU - 32-bit microprocessor; MEM - Memory; 1/0 - 1/0 bus; DO - Digital output; A 0 - Analog input; WD - Watchdog relay; PS - Power Supply inlet; S11.r~- Supply circuit of the main controller. The back-up unit provides a redundant protection with respect to two critical events only: overspeed and over temperature. The back-up unit has a CPU independent from the main controller and uses a separate power supply circuit (operating from the same supply inlet). The back-up unit shares the following transducer signals: 2 thermocouples and 1 speed probe. The hardware structure of the back-up unit is depicted in Fig. 26.l(b). The labels appearing on the blocks of Fig. 26.l(b) have the same meaning as for the MC, and: RO- Relay output; Therm - Thermoucouple signal; Speed - Speed probe. The elementary blocks of the Digicon Controller are assumed to have constant failure rates whose values are reported in Table 26.1. We adopt as TE the occurrence of a safety critical behavior. The Fault Tree representation of the T E is reported in Fig. 26.2. According to the
Luigi Portinale, Andrea Bobbio, and Stefania Montana
370 Table 26.1. troller.
Failure rates for the elementary blocks of the DIGICON con-
Component IObus S P d DO
RO A1 SMC
CPU
Failure Rate (f/h) X i 0 = 2.0 10-9 A s p = 2.0 10-9 10-7 x D 0 = 2.5 10-7 xRO = 2.5 10-7 = 3.0 10-7 xsrn= 3.0 10-7 xcpu = 5.0 10-7
Component Therm. Memory
AO DI
Ps sBU
WD
Failure Rate (f/h) XTh = 2.0 lo-’ AM = 5.0 10-8 x A O = 2.5 10-7 xDr = 3.0 1 0 - ~ xps = 3.0 10-7 x s b = 3.0 10-7 x W D = 2.5 10-7
Fig. 26.2. The FT for the safety critical failures.
translation algorithm presented in Sec. 26.3 the BN derived from the F T of Fig. 26.2 is reported in Fig. 26.3.
Fmm Artificial Intelligence to Dependability
371
Fig. 26.3. The Bayesian network translating the FT of Fig. 26.2
In the BN of Fig. 26.3, gray ovals represent root nodes (corresponding to the basic events in the FT) while white ovals represent non-root nodes. Every node in the BN is a binary node, since the variable associated to it is a binary variable. The binary values of' the variables associated t o the nodes represents the presence of a failure condition (true value) or an operational condition (false value). The only chance (pTobobzhstac) nodes of the BN are the roots (gray nodes). All the other nodes in the BN (white ovals) are deterministic nodes. To make Fig. 26.3 more self-consistent, we have labeled the non-root nodes with the corresponding boolean function (from which the CPT can be obtained straightforwardly). 26.5. Modeling Issues
The mapping procedure described in Sec. 26.3 shows that each F T can be naturally described as a BN. However, BNs are s more general formalism than FTs; for this reason, there are several modeling aspects underlying BNs that may make them very appealing for dependability analysis. In
372
Luigi Portinale, Andrea Bobbio, and Stejania Montana
the following sections, we examine a number of modeling extensions to the standard FT methodology, that can be exploited into a BN framework. 26.5.1. Probabilistic gates: common cause failures
Differently from FT, the dependence relations among variables in a BN are not restricted to be deterministic. This corresponds to being able to model uncertainty in the behavior of the gates, by suitably specifying the conditional probabilities in the CPT entries. Probabilistic gates may reflect an imperfect knowledge of the system behavior, or may avoid the construction of a more detailed and refined model. A typical example is the incorporation of Common Cause Failures (CCF). CCFs are usually modeled in FT by adding an OR gate, directly connected to the TE, in which one input is the system failure, the other input the CCF leaf to which the probability of failure due to common causes is assigned. In the BN formalism, such additional constructs are not necessary, since the probabilistic dependence is included in the CPT. Fig. 26.4 shows an AND gate with CCF and the corresponding BN. The value LCCFis the probability of failure of the system when one or both components are up due to common causes. R[A=l]
Pr(C=IIA=O,B=O)=Lccp Pr(C=lIA=O,B=I ]=km Pr( C=l IA=l ,B=O]=Lcff Pr(C=lIA=l,B=I ]=1
B
-
FAULT TREE AND Gate With Common Cause Failures
Pr(B=l]
BAYESIAN NETWORK AND Node With Common Cause Failures
Fig. 26.4. The CCF representation in BN.
26.5.2. Probabilistic gates: coverage
An important modeling improvement in redundant systems is the consideration of a coverage factor.’ The coverage factor is defined as the probability that a single failure in a redundant system entails a complete system failure. Coverage accounts for the fact that the recovery mechanism can be inac-
Fkom Artijicial Intelligence to Dependability
373
curate and the redundancy may become not operative even in the presence of a single failure. A coverage factor may be also included in combinatorial and FT model^;^)^ however, it finds a very natural application in BN by resorting to the possibility of defining probabilistic gates. The coverage c is defined as the probability that the reconfiguration process is able to restore the system in a working state when a redundant element fails. Figure 26.5 reports an excerpt of Fig. 26.3 related to the AND gate labeled Function. The figure shows a probabilistic AND gate and the corresponding CPT modeling the coverage factor, In the standard AND gate, the event Function is failed with probability 1 only when both inputs MC and WD-M are down. Conditional Probabiliry Table
OR
Pr(Function=l IMC=O, WD-M=l)=(I-c)
Fig. 26.5. An AND gate with coverage and the corresponding CPT
In the coverage case, the node Function may be down with a small probability equal to (1 - c ) , being c the coverage factor, even when only one of the inputs is down. As it can be seen, the coverage factor modifies only the CPT and not the structure of the model (as in Doyle, Dugan, and Patterson-Hine7). To show the effect of the coverage factor on the dependability measures of the system, we have run the following numerical case. We have included a coverage factor c in the AND gates labeled Function, Protec and TE and we have calculated the TE unreliability for a coverage factor c = 0.9, 0.95, 0.99. The results are in Fig. 26.6. For the sake of comparison also the curve with no coverage (i.e. coverage factor c = 1) is reported in Fig. 26.6. 26.5.3. Multi-state variables
All the above considerations concerned binary variables. The use of multistate or n-ary variables can be very useful in many application^^^^ where it is not sufficient to restrict the component behavior to the dichotomy
374
Luigi Portinale, Andrea Bobbio, and Stefania Montani 0.07 0.06 0.05
.0 -
2.--
0.04
3 w
0.03
e
I-
0.02 0.01
0 1.Oe+05
Fig. 26.6.
2.0e+05 3.0e45 Time (h)
4.0e+05
5.0e+05
Unreliability of TE as computed by BN with different coverage factors.
working/not-working. Typical scenarios are possible that require the incorporation of multi-state components: the possible presence of various failure modes7 (short vs open, stuck at 0 vs stuck at 1, covered vs uncovered), the different effect of the failure modes on the system operation (e.g. failsafe/fail-danger) , or various performance levels between normal operation and failureg (Sec. 26.5.4). By dealing with variables having more than two values, BNs can allow the modeler to represent a multi-valued component by means of different values of the variable representing the component itself. 26.5.4. Sequentially dependent failures
Another modeling issue that may be quite problematic to deal with using F T is the problem of components failing in some dependent way. For instance, the abnormal operation of a component may induce dependent failures on other ones. Suppose that, in our case study, the component Power Supply (PS) has actually different behavioral modes: when PS is in state working or failed the behavior of the overall system is the same as the BN of Fig. 26.3 translated from the FT. When PS is in state degraded it induces an anomalous behavior also in the supply equipment (SMC) of the main controller (MC) and (SBU) the back-up unit (BU). The BN, that models the described situation, is reported in Fig. 26.7,
From Artificial Intelligence t o Dependability
375
Fig. 26.7. Portion of the BN showing the influence of a PS degradation.
where only the relevant part of the BN of Fig. 26.3 is reconsidered. The PS node has three states denoted by W for worhng, deg for degraded and F for failed. The prior probabilities of the PS node in the three different states is also reported on the figure. The arcs connecting node PS with both nodes SMC and SBU indicate a possible influence of the parent node PS on the children nodes SMC and SBU. This influence is quantified in the CPTs reported in Fig. 26.7, where it is shown that a degradation in PS induces a failure also in SMC and SBU with probability 0.9. The degradation of the power supply PS does not have a direct effect on the system dependability, but its effect originates from a negative influence of the degradation on other components of the system. 26.6. Analysis Issues Typical analyses performed on a FT involve both qualitative and quantitative aspects. In particular, any kind of quantitative analysis exploits the basics of the qualitative analysis, thus the minimal cut-sets (prime implicants of the TE) computation. Usual quantitative analysis involves: ( i ) the computation of the overall unreliability of the system corresponding to the unreliability of the TE (i.e. P(Fault));(ii) the computation of the unreliability of each identified subsystem, corresponding to the unreliability of each single gate; (iii) the importance of each minimal cut-set, corresponding to the prior probability of the cut-set itself by assuming the statistical independence among components. Any analysis performed on a FT can be performed on the corresponding BN, moreover, other interesting measures can be obtained from the BN that cannot be evaluated in a FT. Let us first consider the basic analyses of a FT and how they are performed in the corresponding BN: (i) unreliability of the T E this corresponds to computing the prior probability of the variable TE=Fault, that is P(Q1E) with Q = Fault and E = 0; (ii) unreliability of a given subsystem; this corresponds to computing the prior probability
376
Luigi Portinale, Andrea Bobbio, and Stefania Montana
of the corresponding variable Si, that is P(Q1E) with Q = S, and E = 8. Differently from the computations performed on a FT, the above computations in a BN do not require the determination of the cut-sets. Concerning the computation of the cut-set importance, it is worth noting that BNs may directly produce a more accurate measure. Indeed, posing a query having the node Fault as evidence and the root variables R as queried variables allows one to compute the distribution P(RI Fault); this means that the posterior probability of each mode of each component (just working and faulty in the binary case) can be obtained. Related to the above issue is another aspect that is peculiar to the use of BN wrt FT: the possibility of performing diagnostic problem-solving on the modeled system. Classical diagnostic inference on a BN involves: (i) computation of the posterior marginal probability distribution on each component; (ii) computation of the posterior joint probability distribution on sets of components; (iii) computation of the posterior joint probability distribution on the set of all nodes but the evidence ones.
26.6.1. Analysis example Consider again the Digicon controller case study. Given the failure rates of Table 26.1, we can evaluate the unreliability of the TE at different mission times from t = 1 . 105h to t = 5 . 105h, by computing the probability of node TE in the BN of Fig. 26.3, given a null evidence. The TE unreliability at the considered time points is plotted in Fig. 26.6 in solid line with label Coverage 1.0 (see Sec. 26.5.2). Concerning posterior analysis, Table 26.2 reports the posteriors of each single component computed at time t = 5 . 105h. These values have been obtained by using the SPI analysis tool.'' Table 26.2. Posterior Probabilities for single components. Component
Posterior
Component
Posterior
Component
Posterior
WDb
1 0.34525986 0.2333944 0.19425736 0.16387042 0.00167474
WDm
1 0.30848555 0.19688544 0.19425736 0.03443292 0.00139391
CPUb Alb
0.37063624 0.2333944 0.19425736 0.16387042 0.00247744 0.00100097
PS
SBU
SMC DOm I/Ob
CPUm
Rob Dlm Mem I/Om
Alm AOm Speed
Thl. Th2
We can notice that the two watchdogs WDm and WDb have a criticality 1, since their failures are necessary in order to have a system failure (as
From Artificial Intelligence to Dependability
377
it could have been easily deduced from the structure of the FT as well). Moreover, the probability of a CPU failure in case of TE occurrence is about 30% for the CPU-M of the main controller and about 37% for the CPU-B of the backup unit. Notice that these posterior values are different, even if the failure rate of both CPUs is the same, because of the different role they play in the overall system dependability. In fact, the failure both of the main controller M C and of the backup unit BU are provided by the failure of the corresponding CPU in boolean OR with the failure of the PER sub-system, but the failure of PER-M follows a different sequence of events than the failure of PER-B, resulting in different posterior probabilities also for the two CPUs. Table 26.3 reports the top 9 configurations of the components having the highest posterior probabilities. The configurations reported in Table 26.3 have to be read by assuming as faulty the mentioned components and operational all the other ones; so for instance the most probable configuration of the components, explaining the occurrence of the TE, is that assuming the power supply inlet and the two watchdogs faulty and all the other component working properly (and the probability of occurrence of such a situation is 0.06597642).
Table 26.3. rations.
Most probable posterior configu-
Faulty Components
Posterior Prob.
PS WDb W D m CPUb CPUm WDb WDm Dlm CPUb WDb WDm SM CPUb WDb W D m Aim CPUb WDb WDm SB CPUm WDb WDm PS CPUm WDb WDm PS CPUb WDb W D m Alb CPUm WDb W D m
0.06597642 0.03288764 0.01873898 0.01873898 0.01873898 0.01873898 0.01873898 0.01873898 0.01873898
As it can be expected, the failure of the watchdog is present in every posterior configuration, since their failure is necessary for the occurrence of the TE. The diagnostic information provided by Table 26.3 is in general more precise (both qualitatively and quantitatively) than the set of MCS with their unreliability.
378
Luigi Portinale, Andrea Bobbio, and Stefania Montani
26.6.2. Modeling parameter uncertainty i n B N model
As a final example about the flexibility of using BN models in dependability analysis, we give a closer look to the problem of parameter uncertainty and sensitivity analysis. In the BN framework, parameters may be considered as random variables, and their uncertainty modeled more appropriately by probability distributions. To illustrate this point, we have carried on the following experiment by considering the Digicon controller case study and assuming again the PS as the exemplificative node. The input parameter is, in this case, the failure rate q5ps of the PS node. In order to compare the present results with those obtained from the FT analysis, we have assumed that q5ps is a random variable uniformly distributed between a minimum and a maximum of Table 26.1: around the value Xps = 3.0
Xminps = (1 - 0.1) Xps
Xmasps
=
+
(1 0.1) Xps
For the sake of simplicity, we have discretized the uniform distribution of q5ps with three values (Xminps, Xps, Xmasps) with equal prior probability. In this new framework, the node PS becomes a non-root node and a child of a new root node which is the multistate variable q5ps. Figure 26.8 reports an excerpt of the overall BN, representing the new situation. Pr{Ops=hps min ]=1/3 Pr@ps=hps )=1/3
hps min=
0.9 hps
hps max=l. 1 hps Pr{PS=F I $ps=hps min)=l- exd- Ops t ) Pr{PS=F I $ps=hps )=Iexp(- Ops t)
Fig. 26.8. The failure rate of component PS is modeled as a discrete random variable.
The CPT associated to the node PS provides the failure probability of PS conditioned to the particular value assumed by its failure rate q5ps. A forward analysis of the modified BN provides the TE unreliability, when
379
h m Artificial Intelligence to Dependability
the failure rate of the component PS is a random variable rpps distributed according to prior probability shown in Fig. 26.8. The obtained results are reported in Fig. 26.9, but they do not change significantly with respect to the original case.
0.02 0.018 0.016 0.014 =.. g
.-m
z
5 w I-
0.012
0.01 0.008 0.006
0.004 0.002 0
1.Oe+O5
2.Oe+05
3.0e+05
4.0e45
5.0~05
l-nne (h)
Fig. 26.9. Unreliability of TE when the failure rate of component PS is uniformly distributed.
A more interesting computation is the posterior analysis, whose aim is to compute the backward probabilities of the various instantiations of the random variable + p s (the PS failure rate) given the TE. The results are reported in Table 26.4. It is interesting to note that while the prior Table 26.4. time
(h)
100000 200000 300000 400000 500000
Posterior probabilities of the PS failure rate values versus time.
4ps
= Aminps
0.314908014710765 0.320016853038519 0.322778441634621 0.324592747846168 0.325920367975634
Posterior Probabilities 4 P S = XPS 0.333351749430096 0.333359939607828 0.333364950369460 0.333368225472299 0.333370305037643
~ P = S Amazps 0.351740235859138 0.346623207353651 0.343856607995917 0.342039026681531 0.340709326986721
probabilities of the three failure rate values have been chosen to be equal, the posterior probabilities result in a higher criticality for the higher failure rate values.
380
Luagi Portinale, Andrea Bobbio, and Stefania Montani
26.7. Conclusions and Current Research Bayesian Networks provide a robust probabilistic method of reasoning with uncertainty and are rather interesting for dependability analysis of safety critical systems. Here, we have dealt with BN versus FT, a very popular technique for hardware dependability analysis. BN versus F T can address interesting questions allowing both forward and backward analysis; moreover, BN are more suitable to represent complex dependencies among components, to include uncertainty in modeling, and to generalize the probabilistic analysis. Several extensions of the BN formalism appear to be of great interest in dependability analysis; among them Dynamic Bayesian Networks for dealing with dynamic system behavior as assumed in Dynamic Fault Trees" and first-order representations for dealing with parametric representations5?l2)as dealt with in Parametric Fault Trees.13 We are currently working towards the use of such formalisms for the definition of a more powerful and flexible dependability framework based on BNs.
References 1. N. G. Leveson, Safeware: System Safety and Computers. Addison Wesley (1995). 2. F. V. Jensen, Bayesian Networks and Decision Graphs. Springer (2001). 3. A. Bobbio, L. Portinale, M. Minichino, and E. Ciancamerla, Improving the analysis of dependable systems by mapping fault trees into Bayesian networks, Reliability Engineering and System Safety 71,249-260 (2001). 4. J. Solano-Soto and L. E. Sucar, A methodology for reliable system design, in Lecture Notes in Computer Science, Vol. 2070, 734-745. Springer (2001). 5. H. Langseth, Bayesian networks with application in reliability analysis, Technical Report PhD Thesis, Dept. of Mathematical Sciences, Norvegian University of Science and Technology (2002). 6. A. Bobbio, G. Franceschinis, R. Gaeta, L. Portinale, M. Minichino, and E. Ciancamerla, Sequential application of heterogeneous models for the safety analysis of a control system: a case study, Reliability Engineering and System Safety, 82(3), 269-280 (2003). 7. S. A. Doyle, J. Bechta Dugan, and A. Patterson-Hine, A combinatorial approach to modeling imperfect coverage, IEEE Transactions on Reliability 44, 87-94 (1995). 8. S. Amari, J. Dugan, and R. Misra, A separable method for incorporating imperfect fault-coverage into combinatorial models, ZEEE Transactions on Reliability 48, 267-274 (1999). 9. A. P. Wood, Multistate block diagrams and fault trees, I E E E Transactions o n Reliability 34,236-240 (1985).
h m Artajicial Intelligence to Dependability
381
10. B. D’Ambrosio, Local expression languages for probabilistic dependence, Znternational Journal of Approximate Reasoning 11, 1-158 (1994). 11. J. Bechta Dugan, K. J. Sullivan, and D. Coppit, Developing a low-cost highquality software tool for dynamic fault-tree analysis, IEEE Zhnsactions on Reliability 49( 1), 49-59 (2000). 12. A. Bobbio, S. Montani, and L. Portinale, Parametric dependability analysis through Probabilistic Horn Abduction, in Proc. 19th Conference on Uncertainty in Artificial Intelligence, Acapulco (2003). 13. A. Bobbio, G. fianceschinis, R. Gaeta, and L. Portinale, Parametric Fault Tree for the dependability analysis of redundant systems and its high-level Petri net semantics, IEEE Transactions on Software Engineering 29(3), 270287 (2003).
This page intentionally left blank
CHAPTER 27 RELIABILITY COMPUTATION FOR USAGE-BASED TESTING
S. J. PROWELL Department of Computer Science The University of Tennessee 203 Claxton Complex, 1122 Volunteer Blvd. Knoxville, T N 37996-3450 USA E-mail: [email protected]
J. H. POORE Department of Computer Science The University of Tennessee 203 Claxton Complex, 1122 Volunteer Blvd. Knoxville, T N 37996-3450 USA E-mail:[email protected]. edu Markov chains have been used successfully to model system use, generate tests, and compute statistics about anticipated system use in the field. A few reliability models are in use for Markov chain-based testing, but each has certain limitations. A reliability model that is gaining support in field use is presented here, along with a modification of a common stopping criterion. While the presentation of the reliability model has been given previously,' a discussion of the underlying assumption of independence has not.
27.1. Motivation Not all system failures are equal; some failures are a nuisance, and some result in failure of mission, loss of revenue, or even loss of life. For this reason the central focus of testing should not be simply t o find bugs, but instead t o gain sufficient confidence that system release will not be harmful. The failures t o find are: 383
384
S. J. Prowell and J. H. Poore
(1) those failures that occur most frequently under anticipated operational use, and
(2) those failures that are most serious, by some measure. In order to find the failures which occur most frequently, the system should be tested as it is expected to be used. To find the failures that are most serious, testing should be biased toward critical functionality, but in the context of expected use. This latter part is significant; simply testing critical functions under artificial conditions does not necessarily provide evidence of the reliability of these functions when the system is in operational use. These observations provide the motivation for usage-based testing. To perform usage-based testing, one needs to characterize use in a form that is adequate for testing, and then to select a sample of tests that is representative (in some way) of expected use. The ideas of selection, confidence, and prediction point the way to performing testing as a statistical experiment. Treating software testing as a statistical experiment has the additional advantage that it allows making quantitative assertions about one’s confidence in the system. Other approaches to software testing, such as code, branch, and path coverage can lead to increased confidence in the software, but this experience is not necessarily relevant to field use, and is very hard to quantify. As Healy2 notes: In a metrics mentality, the goal is to find quantities that can be calculated easily. This metrics mentality leads to the calculation of meaningless quantities. The user thinks he has useful information, but is mistaken. A statistical mentality defines parameters, collects data, and applies statistical procedures using the data to estimate parameters. The results are meaningful, useful estimates. This paper discusses statistical testing of systems (not necessarily software-only) based on a Markov chain usage model, an approach which has gained acceptance in industry, and been applied to a variety of different a p p l i c a t i o n ~ . ~Tools > ~ > exist ~ > ~ to support these technique^.^ Section 27.2 of this paper discusses how Markov chain usage models can be used to characterize use of systems. Section 27.3 discusses an approach to computing reliability based on experience during testing, and Sec. 27.4 discusses a method of comparing test experience to expected field use. 27.2. Characterizing Use
The population under study is the set of all relevant tests, where a test is a sequence of usage events to apply to the system under test. Given
Reliability Computation for Usage-Based Testing
385
some initial conditions and a sequence of events, the choice of next event is governed by a probability distribution; the next event selection is stochastic. While there are many classes of stochastic models, finite (first-order, discrete parameter) Markov chains are a well-studied class of stochastic models. Additionally, they are appealing because of their simple structure: such a chain is a state transition diagram, where the outgoing arcs from a given state have probabilities summing to one. The current state thus determines a probability distribution on the outgoing arcs, each of which is labeled with a usage event. Additionally, nearly any statistic of interest for such a model is c ~ m p u t a b l e . ~ ? ~ The structure of a usage model represents what is possible; the probabilities are intended to capture what is believed (or in some cases, known) to be likely in field use, for some particular stratum of use. There may be many usage models developed for a particular product; one for each different stratum of expected users, uses, and environments. A usage model can be characterized by a transition matrix. Let P = [p2,3]be the n x n matrix of conditional next-state probabilities. That is, the probability that the next state of a long realization XI, Xz, . . . , X,. is j, given that the current state is i, is Pr[X,.+1 = j 1 X,. = i] = P ~ , If~ there . is no arc from state i to state j, then p2,J= 0. If there is just one arc leaving state i, and the arc terminates at state j , then p z , j = 1. For any state i one n = 1. These probabilities need not be explicitly set; one can has C3=1p2,.1 use mathematical programming techniques to specify just what is believed about use, and then generate feasible transition probabilities which honor these constraints." Usage models must be developed consistent with a clear definition of use. Such a definition defines a single test (or demand, or use) of the system by stating precisely and verifiably the initial conditions for the test and the conditions under which the test terminates. Given this, one has a clear notion of what constitutes a single test, and one is further able to argue that tests are independent and identically distributed, because all tests begin with the same verifiable initial conditions. If more than one set of initial conditions is relevant, one has more than one definition of use, and consequently more than one model. This more general case is an example of partition testing and is not discussed here. For a discussion of partition testing and Markov chain usage models, see Ref. 11. A trajectory is any path in the usage model. The initial conditions are represented in the Markov chain by a unique state called the source. The final conditions are represented by a unique state called the sink. A test (or
386
S. J. Prowell and J.
H. Poore
use, or demand) is any trajectory which begins in the source and terminates upon reaching the sink. For convenience, the source and sink will always be represented herein as the first and last state, respectively, in the matrix. If one is interested in long-run behavior, one can make the chain ergodic by setting pn,l = 1. If one is interested in the single-use characteristics, one can make the sink absorbing by setting pn,n = 1. For the rest of this paper, we assume the chain is ergodic. Figure 27.1 shows an example model of a simple telephone. The source is the state On Hook, the sink is the state Done. The definition of use begins with the phone on-hook and not ringing, and terminates when the phone goes on hook after being connected exactly once in a call. Note that relative weights are shown instead of probabilities. These are easily converted to probabilities by normalizing them.
Fig. 27.1. The phone model.
Reliability Computation for Usage-Based Testing
387
The generation of test cases and their execution and evaluation are beyond the scope of this ~ a p e r . ~ ~ ? ~ 27.3. Computing Reliability 27.3.1. Models
The singleuse reliability is the probability that the system will not fail on a use, given some definition of use. In terms of the usage model, it is the probability that, given the system is started in the source, one reaches the sink without encountering a failure. Several reliability models have been proposed for use with Markov chain usage models. 0
0
0
Bernoulli sampling model. One can treat each test as a Bernoulli trial, and use the binomial distribution to compute reliability and confidence.l 3 This approach is statistically conservative, but it does not take into account variation among tests and assigns equal weight to both long and short test cases. Failure state model. One can make a copy of the Markov chain usage model called the testing model and introduce failure states into the model where failures are detected.l4y9 This model makes strong assumptions about transition reliabilities and does not give a useful estimate of reliability when no failures are observed, which limits the utility of the model in the short run. Arc-based model. If one views traversing an arc (a usage event) as a single trial, one can compute arc reliabilities using any suitable model. One can then apply simulation techniques to aggregate these into an overall single-use reliability estimate. l 1
This paper focuses on the arc-based model. Success or failure of each arc is, observed and recorded and used to produce a reliability estimate associated with the arc. When arc reliabilities are computed using a model such as that provided by Miller, et useful reliability estimates are obtained even when no failures are observed. Computing the single-use reliability via simulation is complicated by the need to generate many random test cases. Depending on the structure of the usage model a simulation could take considerable time to converge to a useful estimate, even when models are small (fewer than 10,000 states). Usage models specify the use of the system, answering the question “what do we expect the user to do,” rather than specifying the behavior of the
S. J. Prowell and J. H. Poom
388
system, answering the question "what should the system do in response to some event." As a result, Markov chain usage models are seldom large, and an analytical solution for the reliability and its variance is often faster and more precise than simulation. Further, by investigating an analytical solution other results are revealed, such as the probability of failure given any starting state.
27.3.2. An: reliabilities Let 0 I f i , j I 1 be a random variable called the transition failure rate, which counts the fraction of failed transitions from state i to state j. For the purpose of this paper, we assume the f i , j are governed by the beta distribution with parameters a i , j and b i , j . Let a i , j - 1 and b i , j - 1 be the counts of successful and unsuccessful transitions, respectively. Given this information, the moments E [f g ]can easily be computed. For convenience, let r i j = 1 - f i , j . The f i , j will be assumed to be independent. Some transition failure rates are truly independent. In other cases, it is likely that transition failure rates will be positively correlated (a failure on one transition increases the likelihood of a later failure). Since we are only concerned with the first failure encountered, this will lead to a conservative estimate because the end-to-end probability of failure is less for the positively-correlated failures than for independent failures. It is also possible that transition failures can be negatively correlated, where a failure on one transition increases the likelihood of success on a subsequent transition. This is regarded as less likely in general, and its effect is not considered. Each individual transition from state i to state j in a test will be considered independent and identically distributed . Consider running many tests and recording success or failure for each. Let denote the random variable counting the fraction of tests which encounter a failure prior to reaching the sink, given that one starts in state i. The single-use reliability is thus E[1 - f;]. The method of first-passage gives the equation:
fz
k
k
It is tempting to replace every f i , k with its known expectation, and then solve for all the expectations E[f;]. Unfortunately, this will not work in general, because the random variables f : are not independent. The error can grow quite large, as one is effectively substituting E"[4 for E [ z m ] , because of loops in the usage model.
Reliability Computation for Usage-Based Testing
389
27.3.3. !@rajectory failure rate
To avoid the problem encountered above, the single-use failure rate will be derived by appealing to fixed-length trajectories and basic definitions of expectation. The derivation is only sketched here; for complete details, see Ref. 1. Let 0 5 f i m ) 5 1 be a random variable called the m-step failure rate, counting the fraction of tests on which the first failure is observed on the mth step, given that the realization started in state i. Let F(t) denote the probability that a trajectory t executes without failure up to the last step, which then fails. That is, F(t) is the probability that trajectory t fails for the first time at the last step. Let T, be the set of all trajectories on length m. Then, by definition:
c 1
1
E[f,!")]=
Pr[t]
Pr[F(t)= f]f df.
(1)
0
tET,
The integral of Eq. 1 gives the expected failure rate for the trajectory t , and the summation is thus the expected value of the m-step failure rate. Rewriting t as the sequence of state visits s1,s2,. . . ,s,+1 allows factoring and reorganizing Eq. 1, to obtain:
s2
s2
This gives an equation for E[f,("'] in terms of E[f,("-')]. The base of the recursion is m = 1. s2
Solving for the expected value of the single-use failure rate is now a matter of summing all failure rates from m = 1 up. This gives: (2) m==l
Let ?'& denote the matrix whose i , j t h element is pi,j(1 - E [ f & ] )for , 1 I a < n and 1 5 j < n (note that the row and column for the sink are , discarded). Let F1 denote the matrix whose i, j t h element is p i , j E [ f i , j ]for 1I i < n and 1 I j I n. Let U denote a vector of ones of the proper size. Then the vector F*= [ft]can be obtained with the equation:
F* = ( I - &l)-'FiU.
390
S. J. Prowell and J. H. Poore
By a similar process, the following equation can be derived for the variance of the single-use failure rate.
c 00
E
[(f:),]=
E
j=1
[(f?))'] + 2 2
2
E
[f!"'f,'")] .
(3)
m=l n=m+l
The significant term in Eq. 3 is the last expectation. Again, by looking at the term over all trajectories and factoring, the following is obtained, where F(min)=
if m > 1.
(4)
Using induction and Eq. 4,one can show the following for n > m 2 1.
F(m,n)= 7im-1
(7il - ?L2)7iy-m-131u.
(5)
Combining Eq. 5 with Eq. 3 one can obtain the vector of second moments F2+ from the equation:
F,*
=
( I - 7?2)-'F2V
+2(I-
%!2)-'(7?.1
- 7&)F*.
This allows the variances to be obtained. 27.4. Similarity to Expected Use
Let X be a random variable governed by the true distribution p. A statistical model is constructed to approximate p, and this model uses the approximating distribution q. It is reasonable to ask how closely the model resembles the true distribution. One way to measure this is to compute the relative entropy of distribution p with respect to distribution q. This number can be thought of as the number of bits which are wasted by encoding observations of X using the not-quite-right distribution q. Let lga = log, a. Whenever outcome x E X is observed, it is encoded using - lg q ( x ) bits. This outcome is observed with the true probability p(x), so on average - EXEX p(z) lg q(z) bits are used to encode an outcome of X. If the statistical model precisely matched the true distribution, the minimum average of - CIEXp(x)lgp(x) bits would be used to encode the outcome. The number of bits wasted is the difference between the encoding used and the minimum, and can be written as follows:
Reliability Computation for Usage-Based Testing
391
This quantity is known as the Kullback-Leibler number or sometimes just the discrimination, and is denoted K[p,q].16 Note that the discrimination is not a true metric: it is not symmetric. As testing experience grows, it will come to more and more closely approximate the distribution given by the usage model. The difference between the “true” distribution given by the usage model and the approximating distribution given by the ensemble of test cases can be measured by the discrimination. The matrix T = [ t i , j ] ,where t i , j simply counts the number of transitions from state i to state j , can be normalized to obtain a Markov chain, and the discrimination between this testing chain and the original usage chain can be computed. For this reason the discrimination has been recommended as a stopping criterion.14717 As testing experience grows to more closely resemble expected use, KIp, q] approaches zero. There are many different entropies for a model; the one commonly used is the source entropy. For the usage model given by transition matrix P = [pij], the source entropy is given by the equation: n.
n.
i=l
j=1
where 7ri is the stationary distribution of the ergodic chain. For the testing chain given by the matrix Q = [ q i , j ] obtained from T , the equation is: n
n
i=l
j=1
A problem with the discrimination is that it is measured (when one uses the logarithm base two) in bits. It is not intuitively appealing. The discrimination is defined as the difference between the approximating entropy and the model’s source entropy, with the latter always being smaller. If the source entropy is ten bits, then a discrimination of one bit is small. If the source entropy is 0.1 bits, then a discrimination of one bit is large. A better measure may be the relative discrimination, given by the percentage difference between the testing experience and the expected use.
Ht - Hu Hu . Thus if the source entropy is ten bits and the discrimination is one bit, the relative discrimination is 10%. If the source entropy is 0.1 bits and the discrimination is one bit, the relative discrimination is l,OOO%. These two cases are clearly differentiated, and it is possible to establish a target value for the relative discrimination during testing. K,[p,q] = 100 x
392
S. J. Prowell
and
J. H. POOR
27.5. Conclusion Statistical testing based on a usage model directly addresses the goal of gaining confidence that system release will not be harmful. Usage models can be constructed to capture expected use, a n d testing experience can be quantified in terms of reliability and the discrimination with respect to expected use.
References 1. Stacy J. Prowell and Jesse H. Poore, Computing system reliability using markov chain usage models, Journal of Systems and Software 7 3 ( 2 ) , 219225 (September 2004). 2. John D. Healy, Commentary software: Metrics mentality versus statistical mentality, IEEE Transactions on Reliability 4 9 ( 3 ) , 319-321 (September 2000). 3. David P. Kelly and Jesse H. Poore, From good to great: Lifecycle improvements can make the difference, Cutter I T Journal 1 3 ( 2 ) , (February 2000). 4. David P. Kelly and Robert Oshana, Improving software quality using statistical testing techniques, Information and Software Technology 42( 1 2 ) , 801-807 (2000). 5. Kirk D. Sayre, Testing dynamic web applications with usage models, in International Conference on Software Testing, Analysis and Review (STAR West), Software Quality Engineering (October 2003). 6. Kirk D. Sayre, Automated API testing: A model-based approach, in International Conference on Software Testing Analysis and Review (STAR East), Software Quality Engineering (May 2004). 7. Stacy J. Prowell, JUMBL: A tool for model-based statistical testing, in Proceedings of the 36th Annual Hawaii International Conference on System Sciences (HICSS’03), IEEE Computer Society Press (January 2003). 8. J. G. Kemmeny and J. L. Snell, Finite Markov Chains. Springer-Verlag, New York, NY (1976). 9. Stacy J. Prowell, Computations for Markov chain usage models, Computer Science Technical Report UT-CS-03-505, The University of Tennessee, Knoxville, TN (2003). 10. Gwendolyn H. Walton and Jesse H. Poore, Generating transition probabilities to support model-based software testing, Software Practice and Experience 30(10), 1095-1106 (August 2000). 11. Kirk D. Sayre, Improved Techniques for Software Testing Based on Markov Chain Usage Models, PhD thesis, The University of Tennessee, Knoxville, T N (December 1999). 12. Stacy J. Prowell, Carmen J. Trammell, Richard C. Linger, and Jesse H. Poore, Cleanroom Software Engineering: Technology and Process, AddisonWesley, Reading, MA (1999). 13. Jesse H. Poore, Harlan D. Mills, and David Mutchler, Planning and certifying software system reliability, IEEE Software 10(1),88-99 (1993).
Reliability Computation for Usage-Based Testing
393
14. James A. Whittaker and Michael G. Thomason, A Markov chain model for statistical software testing, IEEE %insactions on Software Engineering 20(10), 812-824 (October 1994). 15. Keith W. Miller, Larry J. Morell, Robert E. Noonan, Stephen K. Park, David M. Nicol, Branson W. Murriel, and Jeffrey M. Voas, Estimating the probability of failure when testing reveals no failures, IEEE Transactions on Software Engineering 18(1), 33-44 (January 1992). 16. R. B. Ash, Information Theory, Dover Publications, Inc., New York, NY, (1990). 17. Kirk D. Sayre and Jesse H. Poore, Stopping criteria for statistical testing, Infornation and Software Technology 42( 12), 851-857 (September 2000).
This page intentionally left blank
CHAPTER 28 K-MART STOCHASTIC MODELING USING ITERATED TOTAL TIME ON TEST TRANSFORMS
FRANCISCO VERA Department of Statistics University of South Carolina Columbia, S C 29208 USA E-mail: [email protected]
JAMES LYNCH Department of Statistics University of South Carolina Columbia, SC 29208 USA E-mail: [email protected] Over-dispersion of a population relative to a fitted baseline model can be accounted for in various ways. For example, one way is by using a mixture over the family of baseline models. Another is via a martingale structure if the Total Time on Test (TTT) Transform of the population “dominates” that of the baseline model. Here these latter ideas are extended to iterated T T T comparisons and related to a martingaletype of structure, called a k-mart, between the population and the baseline model. These ideas are illustrated for a binomial baseline model using the Saxony 1876-85 sibship census for families with 12 siblings. In addition the construction of a “most identical” distribution in the case of 1-mart is presented.
28.1. Introduction
A principle objective of stochastic model building is to choose a simple baseline model that describes the salient features of the population. One possibility is to construct a stochastic model so that Y = X + q 395
(1)
396
h n c i s c o Vera and James Lynch
where Y is the population, X is the baseline model and E is an error term. Two features that one might require of X in (1) are to be “fair” and uncorrelated with the error term E . Formally,
E ( Y ) = E ( X ) and Cov(X,c) = 0, (2) A stronger and more satisfying requirement is that the construction of the joint distribution of ( X ,Y ) result in a martingale i.e. E(Y1X)= X . (3) The martingale structure quantifies overdispersion of the population relative to the baseline model. As we shall see below this is related to Y being a dilation of X , i.e., both random variables have finite means and E(c(Y))2 E ( c ( X ) ) ,
(4)
for every convex function c. An operational way to verify (4) is through the use of Total T i m e o n Test (TTT) comparisons. Let F denote a cumulative distribution function (cdf) with nonnegative support and finite mean /I (i.e. p = J d F ( z ) ) . Then the TTT transform is
li
T F ( YE ) 1- -
PY
(1 - F ( z ) ) & .
(5)
It is well known that (4) holds if and only if
E ( Y ) = E ( X ) and T G I T F (6) where G and F denote the distribution functions of Y and X , respectively (e.g. Ross1 or Shaked and Shantikumar2). Moreover, ( 6 ) holds if and only if there exists a joint distribution for Y and X such that (3) holds, i.e., ( X ,Y ) has a martingale structure (e.g., B l a ~ k w e l l ,S~ t>r~a ~ s e n Meyer6 ,~ and the references there in). Dilations can arise quite naturally when fitting mixture models where the means of the population and the fitted model are the same (Shaked7). Consequently, the joint distribution ( X , Y ) can be constructed to have a martingale structure. However, as we shall see, higher order mixture models can oftentimes be fitted in a hierarchical way where the k-point fitted model and the population mixture have a k-mart structure rather than a martingale structure. For k-marts, the analog of dilations is that of balayages. Balayage is a generalization of dilation defined here in terms of generalized convex function. Vera and Lynch8 introduce the concept of Iterated TTT transforms
K-mart Stochastic Modeling Using Iterated TTT ‘Pmnsjonns
397
and prove its relationship with balayages. They also introduce the notion of k-marts, which are martingales when k = 1, and determine its relationship with balayages. The details of this relationship are given in Sec. 28.2. In Sec. 28.3 the application of these ideas to mixture problems is presented, where method of moments is used to match the fitted model to the population. These ideas are illustrated in Sec. 28.4 for the Saxony 1876-85 sibship census for families with 12 siblings from Sokal and RohlP (see also Shaked7). In Sec. 28.5 the construction of the “most identical” distribution in the case of a 1-mart is given.
28.2. Generalized Convexity, Iterated TTT, K-Mart To generalize the notion of dilation we need the following definitions. Definition 28.1: A family ofreal functions U = {UO,~ 1 , ... ,U r n } is called a Tchebycheff system (or T-system) if det [ui(xj);i, j = 0,. . . , m] > 0 whenever xo < x1 < ... < 2,. Definition 28.2: Let U = { u o , ul,.. . , u r n } be a system. We say that a function c, defined on X , is U-convex or convex with respect to U if and only if
(7)
for every xo < x1 < . . . < xm+l. If the inequality is strict (>), then we say that the function is strictly convex.
As a special case of U-convexity we have the following definition. Definition 28.3: Under the conditions of Definition 28.2, if U = (1, z,.. . ,xZk-l},then any U-convex function is called k-convex. In the next theorem
c ( j ) denotes
the j-th derivative of c.
Theorem 28.1: Under the conditions of Definition 28.3, if c is a differentiable function then c is k-convex if and only if c ( ~ 2~ 0) Next, we define the concept of a balayage in terms of the class of kconvex functions.
398
Francisco Vera and James Lynch
Definition 28.4: Let G and F be two cdf. We say that G is a balayage of
5 F ) if
F (denoted by G
(8)
for every k-convex function c. If G and F have densities g and f respectively, g
k
k
> f denotes G > F .
Notice that in the above definition a dilation is a balayage with k = 1. Also notice that if the inequality is strict in the case of strictly k-convex functions then F and G are different on a set with positive probability. Another important notion is that of Iterated TTT transforms.
Definition 28.5: Let F be the cdf of a nonnegative random variable X . The j-order TTT transform, T g ) , is defined recursively as follows: TS 5 F 2
(94
where
s
m
(j)
Pj=PF
= -
T- F ( A( t ) d t ; T - (FA = 1
- T F ( j ); j. = 0 , 1 , 2 , ...
(9b)
0
The relationship between iterated TTT transforms and balayages is given in the next theorem. k
Theorem 28.2: Let G and F be two distribution functions. G > F if and -(2k-1) ) j = 1 , . .. , 2 k - 1 . only if TG 2 TP-')and J z j d G ( z ) = ~ j d F ( zfor Proof: See Vera and Lynch8 for details.
0
The relationship given in Theorem 28.2 suggests a diagnostic tool to check when the population is a balayage of the fitted model, or in terms of mixtures, if a higher order mixture should be fitted. Let F2 denote the fitted distribution and let F1 denote the distribution of the population. If k
F1 > F2 then the first 2k- 1 moments of Fl and Fz are the same. Moreover, the moments pj defined in (9b) for j = 0, . . . ,2k- 2 are also the same, since from Vera and Lynch' formula (3.5), (j)
P j = PF
=
EF
(
x~+I)
(10) ( ~ + I ) E F ( X ~ ) '
399
K-martStochastic Modeling Using Iterated TTT T h n s f o m s
7
Hence
[Fp)(t)- F$)(t)]dt= 0
(11)
0
for j = 0,.. . ,2k - 2. Now, the function x Z k is strictly k-convex. Therefore, if F1 and F2 are different, then E 1 ( X z k )> E 2 ( X Z k ), and by (lo),
or equivalently, by (9b), 00
(12) 0
Summarizing the above we have the following lemma. k
Lemma 28.1: Let Fl > F2 with F1
# F2. Then, (12) holds if and only if
E F (X2‘) ~ > E F (~X z k ) .
(13)
Another important notion is that of k-marts.
Definition 28.6: Let (Y,X I , .. . ,xk) be jointly distributed random variables with X I ,. . . ,xk independent and identically distributed (i.i.d.). We say that (Y,X I ,. . . , xk) have a k-mart structure if (!4)
for j = 1,. . . ,2k - 1. The following theorem gives the relationship between k-marts and balayages analogous to the one between martingales and dilations. k
Theorem 28.3: Let G and F be two distribution functions. G > F if and only if there exists jointly distributed random variables (Y,X1 , . . . ,x k ) having a k-mart structure with Y N G and X1 F . N
Proof: See Vera and Lynch8 for details The next theorem gives a sufficient sign change condition on G - F for k
verifying when G > F .
Theorem 28.4: Let F and G be two distribution functions with the same first j moments and with respective densities f and g (with respect to some
400
h n c i s w Vera and James Lynch
+
measure v). If g - f has at most j 1 sign changes then T$) - T:) has at most j - i sign changes for i = 1, , .. , j . In particular, if the last sign of g - f is then T$.O”- $.O” 2 0.
+
Proof: See Vera and Lynch8 for details
0
Remark 28.1: Meyer‘ gives a thorough presentation on balayages defined in terms of a class of functions and the consequent representation results akin to the k-mart representation given here. A useful reference on generalized convexity can be found in Roberts and Varberg.”
28.3. Mixture Models In this section we consider mixtures over the family {Fe : 6 E @}.For v a distribution on 9, let F,(z) = Fe(z)dv(O).Assume that Fe has a density with respect to a measure A. So the mixed density is f,,(z) = J fe(z)dv(8). Also denote the j t h moment of Fe by mj(0) = J d d F e ( s ) . The following theorem characterizes balayages between mixtures in terms of balayages between the mixing distributions. Theorem 28.5: Let
v1
and
v2
be distributions on 9 with
points of mass. Assume that fe is totally positive. If a polynomial of degree at most j for j
=
v1
having k
k
v2
> v1 and m j ( 0 ) is
1 , . . . , 2 k - 1, then F2
k
> F1.
Proof: First notice that Theorem 28.2 implies that
and, therefore, J m j ( 0 ) d v 2 ( 8 )= J’mj(0)dvl(O) since mj is a polynomial of degree at most j. Thus
J
z j d F ~ ( z=)
J
~jdF2(~= ) , 1,. j . . , 2 k - 1.
(15)
K-mart Stochastic Modeling Using Iterated TTT ’Ifamforms
401
1
(16)
since
=
x3dF2(x),
for j = 1,.. . ,2k - 1. Now, duz(f9) - du1(O) has at most 2k sign changes since u1 has k points in its support. This implies that f2
- fi =
J fe(dY2(0)
- dvi(8))
has at most 2k sign changes because fe has a totally positive kernel (see the Variation Diminishing Theorem in Karlin”). Thus, by Theorem 28.4, j i ( 2 k - 1 ) - ,ji(2k-1) 2 0. Fz Fi The desired result follows now from Theorem 28.2
0
As an application of Theorem 28.5 we consider mixed binomial distributions. For this let fe(z) = P(1- O)”-, be the binomial density with respect to the measure X that places mass on the points x = 0,1,. . . , n. Since the binomial distribution is an exponential family, it has a totally positive kernel (Karlin”). To obtain an explicit expression for the jthmoment of Fe, mj(O),denote the “falling” powers (Graham, Knuth and Patashnik12) by zi = z(x - 1). . . (x - j 1).It is straight forward to prove, via probability generating functions, that if X bin(n, O ) , then the “falling” moments of the binomial are E(X2) = &j,j = 1 , 2 , . . . (17) It is immediate from (17) that mi(€’), j = 1 , 2 , .. . is a polynomial of degree j . Hence, by Theorem 28.5, whenever fitting a k-point mixed binomial, F1, to a population coming from a higher order mixture, F 2 , where the first
(z)
+
2k
-
k
1 moments are matched, F 2 > F1. The following lemma specifies the polynomial m j (0).
-
Lemma 28.2: Let X
-
bin(n, 0). Then (18)
402
h n c a s c o Vera and James Lynch
Proof: To determine the polynomial m j ( 0 ) for the case of the binomial, we can proceed as follows. Regular and “falling” powers are related by the expression j
(19) i= 1
where
{f} denotes the number of ways to partition a set of j
elements into
i nonempty subsets (Stirling numbers of second kind, Graham, Knuth and Patashnik12). These numbers are defined recursively as follows:
{:}
= O ; n = 1 , 2 , ...
(3 {L}
= 1;n = 0 , 1 , 2 , .. . = k{
I,I} + { L Ii};n = 1 , 2 , . . . ;k = 1 , . . . ,n
The proof now follows from (17) and (19).
0
Let (XI, . . . ,X,,) be a sequence of independent identically distributed (iid) random variables with distribution F2. We close this section with some asymptotics regarding the behavior of
$
n
X,“k- El (X2k)which in some 1
sense measures how well the k-point model, F1, fits the population model, F2. This is given as a corollary of the following theorem.
Theorem 28.6: Let (XI, . . . ,Xn) be a sequence of iid random variables, n
with distribution
F2.
Let Sn =
C X:k.
Then
i= 1
(20
where 2k
E Z ( X ? ~=)
{ 23! } d / 0 j d v 2 ( 0 ) j=l
Proof: Direct consequence of Lemma 28.2 and central limit theorem.
0
K-mart Stochastic Modeling Using Iterated TTT !l??unsfonns
403
Corollary 28.1: If El (X") = Ez (X2k),then
Fl = F 2 , and
1
A,, = J;;(-Snn
- E1(X;k))
If Ez (XZk)> El (X") then A,
(214
-% N(O,Varl(X?')).
(21b)
P
-.+ 00
-(2k-1) - ji(2k-1) - F2
Proof: To see 21a, note that TFl
rom Theorem 28.2 and from (12) . Thus, from (ga), the same argument can be applied recursively to get F1 = F2. Then, (21b) follows from Theorem 28.6. Note that A, 5 00 when E2(XZk) > E I ( X ~ by ~ ) the law of large numbers. 0
Remark 28.2: In a statistical inference problem, the parameters in El(Xfk) and Varl(Xfk) would have to be estimated. The required asymptotics would be more delicate and are not considered here. 28.4. A Binomial Example
The ideas discussed in Sec. 28.3 are illustrated in this section with a data set from Sokal and RohlP, where the frequencies are given for the number of males in 6115 families with 12 siblings each in Saxony from 1876 through 1885. A binomial with n = 12 is a natural baseline model for this data with p equal to the probability of a male. For a binomial fit, the method of moments (MOM) gives p = .5192. Shaked7 observed that the difference between the observed and expected frequencies for this data had a sign sequence -, (Fig. 28.l(a)). He also proved that for a one-parameter exponential family (90,8 E €I with }, a mixing distribution p, the difference in densities f2 - f1 exhibit such a sign change pattern, where = J 8dp(8), f2 = Jgedp(B), and f1 = gg. He also proved that likelihood ratio is convex and that f2 is a dilation of f1. These two features show up in the data, and are seen in the likelihood ratio plot (Fig. 28.l(b)) and in the TTT comparison plot (Fig. 28.l(c)). All these graphs thus confirm that the distribution on (0,1 , 2 , .. . ,12} defined by this data, F , is more dispersed than the fitted binomial. Lindsey13 also considered this data and fitted a mixed binomial model. He proved that the likelihood ratio is log-convex and developed gradient plots to find the maximum likelihood estimator (MLE) of the mixing
"+, +"
e
2
2
404
h n c i s c o Vera and James Lynch
Fig. 28.1. One-point mixed binomial.
distribution. In particular, he showed that the MLE for this data set, has 4 points in its support. Here we fit, via MLE and MOM, mixed binomial distributions, Fk, with k points in the support for k = 1,2,. . . ,5. The results of this are shown in Table 28.1. As it can be seen, except for k = 1, the MOM and MLE procedures do not necessarily result in the same fitted models. It can also be seen that the MLE in the case of the 4-point mixture has 1 in its support. Also, the 5-point mixture has two points in its support that are very close together, suggesting that it is actually a 4-point mixture. For the 4-point mixture, there was no solution by the method of moments, which means that F is not a mixed binomial with four or more points in the support of the mixing distribution, since, if it were, then F moments would match up.
4
> F4 and the first seven
K-mart Stochastic Modeling Using Iterated TTT lhnsforms
405
Table 28.2 shows the moments of F and of the different MOM fitted mixed binomials. In the case of the 1-point mixture it is seen that, while the first moments are the same, the second moment of F is bigger. For the 2-point mixture, the first three moments match up, but the fourth moment of F is bigger. In the 3-point mixture, the first five moments are the same, but the sixth moment of F is smaller. This indicates that F is more “spread-out” than the 1-point and 2-point mixtures, but is less “spread-out” than the 3-point mixture. From this one might conclude that 3
1
2
that F3 > F > F2 > FI. As we now shall see the last two comparisons are correct but not the first. Table 28.1. Fitted mixing distributions.
k=l k=2 k=3
k=4
k=5
Maximum Likelihood Mass Support points ( p ) 1 0.519215 0,7200452 0.4814297 0.2799548 0.6163991 0.2250591 0.0072224 0.8174521 0.4952738 0.1753255 0.6429582 0.2220664 0.0068421 0.4943468 0.8088673 0.6391686 0.1841744 1 0.0001162 0.2220664 0.0068421 0.3885243 0.4943467 0.4203429 0.4943469 0.1841744 0.6391686 0.0001162 1
Method of Momemts Mass Support points ( p )
1 0.6509593 0.3490407 0.0016216 0.7771662 0.2212122 0.0008227 0.3100417 0.45189 0.2372455 0.2924128 0.0949093 0.064388 0.4245071 0.1237829
0.519215 0.4744328 0.6027337 0.0853183 0.4886285 0.6298527 5e-07 0.4849806 0.4876126 0.6259495 0.4478722 0.5209143 0.5263 0.5281071 0.6523147
Table 28.2. Moments of MOM mixing distributions. moment 1
9
Fdata
6.2306 42.3094 307.1117 2354.5671 18904.141 157899.12 1364938.1 12159637 111248741
k=l 6.2306 41.8157 297.7493 2227.2933 17379.229 140686.56 1176451.1 10126711 89467145
k=2 6.2306 42.3094 307.1117 2353.7153 18880.079 157462.58 1358401.2 12071120 110115825
k=3 6.2306 42.3094 307.1117 2354.5671 18904.141 157904.55 1365132 12163734 111315075
k=4 6.2306 42.3093 307.112 2354.598 18904.184 157895.71 1364893 12159372 111247672
k=5 6.2307 42.3074 307.1214 2354.7494 18904.752 157888.63 1364731.9 12157551 111235712
406
h n c i s c o Vera and James Lynch
The iterated TTT plots are given in Fig. 28.2 for the different MOM fitted distributions. As seen from Figs 28.2(a) and 28.2(b) the first and third order iterated TTT for the 1 and 2 point fitted mixed distribution, 2
1
respectively, are dominated by F . Consequently, F > F2 > F1, where the second comparison follows from Theorem 28.5. Fig. 28.2(c) shows partial domination of the 5th order iterated TTT for F3 over F , however there is a small interval where F dominates F3 slightly. This demonstrates that the F3 is not a balayage of F .
(a) Comparison of first order TTT (b) Comparison of third order TTT
(c) Comparison of fifth order TTT Fig. 28.2.
Iterated TTT comparison
Figure 28.3 graphs the difference of the various MOM fitted mixed binomials and indicates that the higher order fits more than half the maximum error. The higher order fits also exhibit (with some nu-, +) for merical error) the higher order sign change pattern (+, -, the 2-point mixture, (+, -, -, -, +) for the 3-point mixture, and
+, +,
+,
K-mart Stochastic Modeling Using Iterated TTT ’Pmnsforms
(+, -,
407
+, -, +, -, +, -, +), which is a generalization akin to Shaked’s sign
change result (see Lynch,14 Theorem 3.2).
.-x 0
x 3. 359
. . .~~
,......:.%:-r .....i..... 5 7-_-. ir_
-.7
..
0 N
P 4
2
O
6
6
1
0
1
0
2
2
4
6
(b) Estimated survival functions
(a) Fitted probabilities Fig. 28.3.
8 1 0 1 2
X
X
Comparison of fitted probability and survival comparisons.
28.5. Construction of “Most Identical” Distribution
If
k
>
then the existence of a joint distribution with marginals F1 and F2 having a k-mart structure is given by Theorem 28.3. Such a joint distribution is not necessarily unique. In this section a construction is given F2
F1,
1
for a 1-mart ( X , Y ) , (F2 > Fl),such that X is “most identical” to Y . The ideas presented are related to Sethuraman.l5 For two densities fl and f 2 , with respect to some measure A, define P
1
= (fl - fd+& = -2
If1
- fzldp,
P = 1- p .
Also define the following densities: 9 cc f1 A f 2 ,
hl c( ( f l
- f 2 ) + 7 h2
c( (f1
- f2)-.
The next Lemma gives a representation of f i in terms of g and hi. Lemma 28.3:
-
f i = p g + p h i ; i = 1,2
(22)
Proof: Let X fi and let W = I(fl(X) > f z ( X ) ) ,where I denote the indicator function. Then, by the law of total probabilities, fl(2)
= P(I = O)fl(ZlI = 0)
+ P(I = l)fl(SlI = 1) = P g ( 2 ) +Phl(S).
h n c i s c o Vem and James Lynch
408
A similar proof holds for f 2 .
0
Notice that (22) implies that k
f2
k
> f 1 if and only if h2 > h l ,
(23)
since
] c ( z ) ( f z ( z )- f 1 b ) ) d z = / C (Z )(h 2 (4 - hl(z))dZ, for any function c. The following result from Lemma 2.1 and (2.5) in Sethuraman15 is needed to prove the main result.
Lemma 28.4: inf P(X
# Y )= p
where the infimum is over all joint distributions ( X ,Y ) having marginals f 1 and f z . 1
Theorem 28.7: Given two densities, f1 and f 2 , with f 2 > fi, there exists jointly distributed random variables ( X , Y ) with marginals f i and f 2 respectively having a 1-mart structure such that P ( X # Y ) = p . Proof: Let 21 hl and 2 2 ha, where, by (23), we choose (21,Zz) to have a 1 - mart structure. Let 2 g be independent of 2 1 and 2 2 and let J be an auxiliary random variable independent of 2, 21 and 2 2 , with P ( J = 1) = p = 1 - P ( J = 0 ) . Set N
N
N
Y
=JZ+
JZ2, X = J Z + JZ1.
Note that
E ( Y I X ) = E(E(YlZ,21,J)lX) (24) since o ( X ) c a(Z,Z1,J ) . Now E (Y )Z , Z 1 , 1 )= E(JZ + JZzlZ,Z1,J) = J Z JE(Z212,21,J ) = J Z JE(22lZ1) = J Z + JZ1= X where the third equality follows since 2 1 and 2 2 are independent of 2 and J and the fourh since (21,Zz) is a 1-mart. Thus, from (24) and (25),
+ +
E(Y1X) = E ( X I X ) = X .
0
K-mart Stochastic Modeling Using Iterated TTT l h n s f o m s
409
Acknowledgments Research partially supported by NSF Grant DMS 0243594. References 1. S. M. Ross, Stochastic Processes, John Wiley, New York (1983). 2. M. Shaked and J. G. Shanthikumar, Stochastic Orders and Their Applications. Academic Press, San Diego, CA (1984). 3. D. Blackwell, Comparison of experiments, in Proc. Second Berkeley Symp. Math. Statist. pp. 93-102 (1951). 4. D. Blackwell, Equivalent comparisons of experiments, Ann. Math. Statist. 24, 265-272 (1953). 5. V. Strassen, The existence of probability measures with given marginals, Ann. Math. Statist. 36, 423-439 (1965). 6. P. A. Meyer, Probability and Potentials, Blaisdell, London (1966). 7. M. Shaked, On mixtures from exponential families, J. R. Statist. SOC.42, 192-198 (1980). 8. F. Vera and J. Lynch, Iterated total time on test transforms comparisons, Submitted (2004). 9. R. R. Sokal and F. J. Rohlf, Introduction to Biostatistics, Freeman and Company, San Fkancisco, CA (1973). 10. A. W. Roberts and D. E. Varberg, Convex Functions, Academic Press, New York (1973). 11. S. Karlin, Total Positivity, Stanford University Press, Stanford (1968). 12. R. L. Graham, D. E. Knuth, and 0. Patashnik, Concrete Mathematics. Addison-Wesley Publishing Company (1990). 13. B. G. Lindsay, Mixture models: Theory, geometry and applications, Institute of Mathematical Statistics, Hayward California, American Statistical Association, Alexandria Virginia (1995). 14. J . Lynch, Mixtures, generalized convexity and balayages, Scand. J. Statist. 15, 203-210 (1988). 15. J. Sethuraman, Some extensions of the Skorohod representation theorem, Sankhya, Series A 64, 884-893 (2002).