Series on Quality, Reliability & Engineering S t a t i s t i c s
RECENT ADVANCES IN RELIABILITY AND QUALITY ENGINEERING
F i ( t )
til \ )
S 2
1 1
t 2
13
t
S 3
Editor
Hoang Pham
World Scientific
RECENT ADVANCES IN RELIABILITY AND QUALITY ENGINEERING
SERIES IN QUALITY, RELIABILITY & ENGINEERING STATISTICS Series Editors:
M. Xie (National University of Singapore) T. Bendell (Nottingham Polytechnic) A. P. Basu (University of Missouri)
Published Vol. 4:
Frontiers in Reliability A. P. Basu, S. K. Basu & S. Mukhopadhyay
Forthcoming Vol. 1: Vol. 3:
Reliability Optimization & Design of Fault Tolerant Systems H. Pham Contributions to Reliability Analysis of Software & Computing Systems P. K. Kapur& R. B. Garg
VOL 2
Series on Quality, Reliability & Engineering Statistics
RECENT ADVANCES IN RELIABILITY AND QUALITY ENGINEERING
Editor
Hoang Pham Rutgers University
I
V f e World Scientific «•
Singapore • New Jersey • London • Hong Kong Sinqapore
Published by World Scientific Publishing Co. Pte. Ltd. P O Box 128, Fairer Road, Singapore 912805 USA office: Suite IB, 1060 Main Street, River Edge, NJ 07661 UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE
British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library.
RECENT ADVANCES IN RELIABILITY AND QUALITY ENGINEERING Copyright © 2001 by World Scientific Publishing Co. Pte. Ltd. All rights reserved. This book or parts thereof, may not be reproduced in any form or by any means, electronic or mechanical, including photocopying, recording or any information storage and retrieval system now known or to be invented, without written permission from the Publisher.
For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from the publisher.
ISBN 981-02-4221-2
Printed in Singapore by U t o P r i n t
To Michelle, Hoang J r . a n d David
PREFACE
Today's engineering systems have become increasingly complex to design and build while the demand for reliability, quality, and cost effective development continues. Reliability is one of the most important attributes in such critical systems as defense systems,0 aerospace applications, real-time control, medical applications, as well as commercial systems. Growing international competition has increased the need for all engineers and designers to ensure a level of quality and reliability of their products before release, and for all manufacturers and producers to produce products at their optimum reliability level at the lowest cost. Hence, the interest in reliability and quality has been growing in recent years. This volume presents recent research in reliability and quality theory and its applications by many leading experts in the field. The subjects covered include reliability optimization, software reliability, maintenance, quality engineering, system reliability, Monte Carlo simulation, tolerance design optimization, manufacturing system estimation, neural network, software quality assessment, optimization design of life tests, software quality, reliability centered maintenance, multivariate control chart, methodology for the measurement of test effectiveness, imperfect preventive maintenance, Markovian reliability modeling, accelerated life testing, and system availability assessment. This volume will serve as a reference for postgraduate students and also prove useful for practicing engineers and researchers of reliability and quality engineering. The method proposed in Chapter 1 is devoted to the designing of control charts (mean, median, standard deviation, and range) for data having a symmetrical leptokurtic distribution. This method is based on the symmetrical Johnson Su distributions. Since computers are being used increasingly to monitor and control both safetycritical and civilian systems, there is a great demand for high-quality software products. Reliability is also a primary concern for both software developers and software users [Pham et al. (1999)]. Chapter 2 describes an NHPP software reliability model that integrates testing coverage and imperfect debugging by incorporating fault introduction phenomenon into software fault detection rate function. The model is also illustrated by using two sets of data collected from real applications. In Chapter 3, a software system is viewed as a hierarchy of programming and integration modules. Each programming module can be developed in-house or purchased (when available). The reliability of software modules developed in-house and of software
Vll
viii
Preface
hierarchies as a function of their cost is presented. This chapter also includes branch and bound schemes to compute the optimal reliability where only easier continuous nonlinear problems are solved. Successful testing of a one-shot device usually results in its destruction. This occurs quite often in the inspection of military weapons as well as industrial products. Chapter 4 discusses general test plans that satisfy both sample size and power requirements. An algorithm for generating these plans is also obtained. There are many situations that the process may be in control when the characteristics are considered separately but out of control when considered jointly with the multivariate control chart. In Chapter 5, multivariate control charts are discussed which are based on Hotelling's T 2 statistic. Chapter 6 presents the preparedness maintenance model for multi-unit systems with imperfect maintenance and economic dependence. The optimum opportunistic preparedness maintenance policies to optimize the system operating performance are then obtained. Chapter 7 describes the use of Monte Carlo simulation to provide estimates of reliability functions for use in classical variational methods. The variational principle is derived from the forward and backward Kolmogorov equations. Chapter 7 also demonstrates a number of practical ways in which the efficiency of the simulation and variational processing procedure can be further improved. Chapter 8 presents a Bayesian approach to the periodic and the sequential imperfect preventive maintenance with minimal repair at failure. The optimal policies that minimize the expected cost rate under a Weibull hazard function are discussed. A case study of the K Steelworks is also presented to illustrate the results. Chapter 9 deals with life tests based on multi-stage decision making. The decision making of which action, namely accepting, rejecting or deferring to the next stage, is done at each stage. This chapter also deals with the multi-stage life test with step-stress life acceleration at each stage and presents a design procedure for the multi-stage accelerated life test using the cumulative exposure model. Reliability centered maintenance (RCM) is a systematic process for preserving a system's function by selecting and applying effective maintenance tasks. Chapter 10 reports an application of RCM techniques to enhance the maintenance efficiency and effectiveness of light rail vehicle equipment at the Light Rail Division of KowloonCanton Railway Corporation. An RCM team was set up during the implementation of RCM to rethink and redesign maintenance procedures. For robust tolerance design, tightening systems' variations can lead to better customer satisfaction but it may result in a higher cost to the manufacturer. The focus of environmental protection concern is now shifting to pollution prevention, from documentation of pollutant patterns and identification of remediation technology. Chapter 11 presents a tolerance design optimization model incorporating environmental concepts. The model can be used to minimize the total societal loss, including a loss to the environment, by selecting the best level settings of the tolerance limits. Chapter 12 discusses two stochastic software safety assessment models based on the existing software reliability models: the reliability assessment model with safety
Preface
ix
and the availability-intensive safety assessment model. These models can be used to predict the software safety and reliability measurements. Chapter 13 discusses variable stream and sample size group control charts in which both the number of streams selected for sampling and sample size from each of the selected streams are allowed to vary based on the values of the preceding sample statistics. The Markov chain process is used to derive the formulas for evaluating the performances of the proposed charts. Chapter 14 describes a model to estimate the total number of faults inserted into a software system, its residual fault content at any given time, and the efficacy of the testing activity in executing the code containing the newly inserted faults. This study also shows that changes in the relative complexity can be used to estimate the rates at which faults are inserted into a system between successive revisions. Chapter 15 presents the classification and regression trees algorithm to practitioners in software engineering and draws practical lessons learned on building classification trees for software quality modeling. A case study of a very large telecommunications system is used to illustrate the algorithm to build software quality models. Chapter 16 investigates the impact of software design on reliability assessment by describing a novel approach to software reliability assessment which combines the strengths of formal verification and program testing. The approach in this chapter provides a better understanding of the consequences of design decisions and a better insight into the problem domain. Chapter 17 discusses the radial basis function neural network method for empirical modeling and illustrates the method through a well-known temperature controller circuit application. In Chapter 18, a mathematical programming framework for assisting decision-makers in determining the optimal subset of maintenance activities to perform prior to the next mission is discussed. This process allows the decision-maker to consider the time, reliability and cost issues.
Hoang Pham
Piscataway, New Jersey October 1999
ABOUT THE AUTHOR
Dr. Hoang Pham is currently Associate Professor in the Department of Industrial Engineering at Rutgers University, New Jersey, USA. Before joining Rutgers in 1993, he was Senior Engineering Specialist at the Boeing Company, Seattle, USA, and the Idaho National Engineering Laboratory, Idaho Falls, USA. He received the BS in Mathematics and BS in Computer Science, both with high honors, from Northeastern Illinois University, MS in Statistics from the University of Illinois at Urbana-Champaign, and MS and PhD in Industrial Engineering from the State University of New York at Buffalo. He is Editor-in-Chief of the International Journal of Reliability, Quality and Safety Engineering. He was Guest Editor for the IEEE Transactions on Reliability, IEEE Communications Journal, Journal of Systems and Software, and International Journal of Modelling and Simulation. He is an editorial board member of several international journals and has been Conference Chair of many international reliability conferences. He is the General Chair of the Seven International Conference on Reliability and Quality in Design, Washington, D.C., USA. He is the author of Software Reliability (Springer, 2000) and editor of 10 volumes. He has published over 60 journal articles and 10 book chapters. Among his edited books are Fault-Tolerant Software Systems: Techniques and Applications (IEEE Computer Society Press, 1992) and Software Reliability and Testing (IEEE Computer Society Press, 1995). His book chapters include Software Reliability (Wiley Encyclopedia of Electrical and Electronics Engineering,1999). He is a Senior Member of the IEEE and HE. He was given the HE Quality Control and Reliability Engineering Achievement Award for "Outstanding Service in the Field of Quality Control and Reliability Engineering". He is listed in Who's Who in Science and Engineering.
LIST OF C O N T R I B U T O R S
Edward B. Allen
Florida Atlantic University
D. S. Bai
Korea Advanced Institute of Science and Technology
Farokh B. Bastani
University of Texas at Dallas
Oded Berman
University of Toronto
C. Richard Cassady
Mississippi State University
M. Chang
University of Cambridge
Guangming Chen
Morgan State University
Patricia L. Cooper
University of Waterloo
Bojan Cukic
West Verginia University
Michal Cutler
Binghamton University
S. H. Hong
Chonbuk National University
Anthony E. Izundu
Lucent Technologies
Chi-Hyuck Jun
Pohang University of Science and Technology
Akihiro Kanagawa
Okayama Prefectural University
Taghi M. Khoshgoftaar
Florida Atlantic University
K. T. Lee
Institute for Advanced Engineering
Ken H. K. Leung
Kowloon-Canton Railway Corporation
J. D. Lewins
University of Cambridge
Ming-Wei Lu
DaimlerChrysler Corporation
John C. Munson
University of Idaho
W. Paul Murdock
Air Force Institute of Technology
Allen P. Nikora
California Institute of Technology
Hiroshi Ohta
Okayama Prefectural University
Kwang-Su Park
Pohang University of Science and Technology
Geoffrey T. Parks
University of Cambridge
Hoang Pham
Rutgers University
xiv
List of
Contributors
Castagliola Philippe
Ecole Des Mines De Nantes
Edward A. Pohl
Office of the Secretary of Defense
Richard J. Rudy
DaimlerChrysler Corporation
Gordon J. Savage
University of Waterloo
Wei-Kei Shiue
Southern Illinois University at Edwardsville
Koichi Tokuno
Tottori University
Shigeru Yamada
Tottori University
Weili Zhang
Novartis
Xuemei Zhang
Lucent Technologies
Ming J. Zuo
University of Alberta
Hongzhou Wang
Lucent Technologies
Richard Whitfield
Megatrend Information Services Ltd
CONTENTS
Preface About the Author List of Contributors 1. Control Charts for Data Having a Symmetrical Distribution with a Positive Kurtosis C. Philippe 2. A Software Reliability Model with Testing Coverage and Imperfect Debugging X. Zhang and H. Pham 3. Cost Allocation for Software Reliability O. Berman and M. Cutler 4. General Reliability Test Plans for One-Shot Devices W. Zhang and W.-K. Shiue 5. Multivariate Control Chart M.-W. Lu and R. J. Rudy 6. Optimal Preparedness Maintenance of Multi-Unit Systems with Imperfect Maintenance and Economic Dependence H. Wang, H. Pham, and A. E. Izundu 7. Estimation of System Reliability by Variationally Processed Monte Carlo Simulation M. Chang, G. T. Parks, and J. D. Lewins 8. A Bayesian Approach to the Optimal Policy under Imperfect Preventive Maintenance Models K.-S. Park and C.-H. Jun 9. Design of Life Tests Based on Multi-Stage Decision Process . . . . A. Kanagawa and H. Ohta 10. Reliability Centered Maintenance for Light Rail Equipment . . . . K. H. K. Leung, M. J. Zuo, and R. Whitfield 11. Incorporating Environmental Concepts with Tolerance Design Optimization Model G. Chen
XV
vii xi xiii 1
17 33 51 61
75
93
123 137 153
169
xvi
Contents
12.
Markovian Reliability Modeling for Software Safety /Availability Measurement K. Tokuno and S. Yamada 13. Group Control Charts with Variable Stream and Sample Sizes . . . K. T. Lee, D. S. Bai, and S. H. Hong 14. A Methodology for the Measurement of Test Effectiveness J. C. Munson and A. P. Nikora 15. Modeling Software Quality with Classification Trees T. M. Khoshgoftaar and E. B. Allen 16. Highly Reliable Systems: Designing Software for Improved Assessment B. Cukic and F. B. Bastani 17. Manufacturing Systems Estimation using Neural Network Models . P. L. Cooper and G. J. Savage 18. A Deterministic Selective Maintenance Model for Complex Systems . C. R. Cassady, W. P. Murdoch, and E. A. Pohl Author index Subject index
181 203 215 247
271 291 311 325 327
Chapter 1 CONTROL CHARTS FOR DATA HAVING A SYMMETRICAL DISTRIBUTION W I T H A POSITIVE KURTOSIS CASTAGLIOLA PHILIPPE Dept. d'Automatique et de Productique Ecole des Mines de Nantes, La Chantrerie 4, Nantes,
France
1. Introduction The designing of a "classical" control chart (Shewhart, EWMA, CUSUM) supposes that the probability density function of the quality characteristic X has to be normal or approximately normal. However, in some situations, it has been shown that this condition does not hold [Jacobs (1990)]. In order to design control charts when the underlying population is non-normal (without increasing the sample sizes), different approaches may be used: • Use classical Shewhart control charts anyway. Many authors studied the effect of non-normality on Shewhart control charts: [Burr (1967), Schilling and Nelson (1976), Balakrishnan and Kocherlakota (1986), Chan, Hapuarachchi and Macpherson (1988)]. One important conclusion of these studies is that classical Shewhart control charts give good results unless the population is highly skewed. • Assume that the distribution of the underlying population is known and then derive specific control limits which verify the type I error a. Such an approach was chosen by Ferrell (1958), Nelson (1979). Ferrell assumed a log-normal distribution for the underlying population and proposed control limits for the geometric midrange and the geometric range, whereas Nelson assumed a Weibull distribution and derived control limits for the median, range, scale, and location. • Use distribution free control charts that provide a type I error close enough to the theoretical one. This approach was first considered by Cowden (1957) who proposed to split the skewed distribution into two parts at its mode, and to consider the two new distributions as two half-normal distributions having the same mean, but different standard deviations. Another very similar approach, the Weighted Variance control chart (WV control chart), was proposed by Choobineh and
1
2
C. Philippe
Ballard (1987) who suggested to split the skewed distribution into two parts at its mean, instead of its mode, and then compute the standard deviations of the two new distributions using the semivariance approximation of Choobineh and Branting (1986). Finally, we can cite the recent works of Seppala (1995) who suggests to use the "Bootstrap" in the computation of control limits, Willemain and Runger (1996) who proposes to use the notion of "Statistically Equivalent Blocks" to design nonparametric control charts, and Castagliola (1997) who proposes an extension of the Weighted Variance method called the "Scaled Weighted Variance" method. • Transform the data in order to make them quasi-normal. This approach was chosen by Pyzdek (1992), Farnum (1997) who used the Johnson system of distributions as a general tool for transforming the data to normality. The method proposed in this paper, which follows the last approach, is devoted only to the designing of "classical" control charts (mean, median, standard deviation, range, EWMA, CUSUM, etc.) for data having a symmetrical distribution with a positive kurtosis (leptokurtic distribution). This method is based on the properties of the symmetrical Johnson Su distributions which will be examined in the following section. 2. The Symmetrical Johnson Su Distributions Let us focus on transformations of form Z = a + bg(Y) of the random variable Y, where a and b > 0 are two parameters, where g is a monotone increasing function, and where Z is a (0,1) normal random variable. It is very easy to show that the random variable Y has the following characteristics: • cumulative distribution: FY(y) = $[a + bg(y)} • inverse cumulative distribution:
Fy1(a)=g-1
y
• density function: fY(y)=bg'{y)4>[a • noncentral moments of order s:
+ bg(y)}
-OW_:H^)
(1)
If c and d > 0 are two additional parameters such that Y = (X — c)/d, then we can straightforwardly deduce the characteristics of the random variable X, i.e., Fx{x) = FY[{x - c)/d] and F^(a) = c + dFy1(a). There are a large number of possibilities for choosing an adequate function g. Johnson (1949) has proposed a very popular system of distributions based on a set of three different functions:
Control Charts for Symmetrical
Distribution
Data with Positive Kurtosis
3
• 9L(Y) = ln(V) and d = 1. The distributions defined by this function, called Johnson SL distributions, are defined on [c,+00]. • gB(Y) — ln[y/(l — Y)]. The distributions defined by this function, called Johnson SB distributions, are defined on [c, c + d]. • gu(Y) = ln(y 2 + y/Y2 + 1) = s i n h _ 1 ( y ) . The distributions defined by this function, called Johnson Su distributions, are defined on ] —00, +oo[. Johnson has proved in his paper that (a) for every skewness coefficient 7i = M3/M2 a n d every kurtosis coefficient 72 = M4/M2 ~~ 3 such that 72 > 7i - 2 there is one and only one Johnson distribution, (b) the SB and Su distributions occupy nonoverlapping regions covering the whole of the skewness-kurtosis plane, and the SL distributions are the transitional distributions separating them (see Fig. 1).
6SU Distributions
Distributions
SB Distributions No Possible Distribution
-1-
0
0.2
0.4
0.6
0.8 1 1.2 Skewness gammal
1.4
1.6
1.8
Fig. 1. The (71,72) plane for the Johnson distributions.
If we look at this figure, we can see that among the symmetrical Johnson distributions (71 = 0) the Su distributions are leptokurtic while the SB ones are platykurtic. For this reason, we will now focus more precisely on Johnson Su distributions which are symmetrical (about the mean m\(X) = mi). It is clear that a necessary and sufficient condition for a Johnson Su to be symmetrical is that a = 0 and c = m i . Consequently, the characteristics of a symmetrical Johnson Su random variable X are:
4
C. Philippe
• cumulative distribution: Fx (x) = $ b sinh~_! / x — mi • inverse cumulative distribution: F^1(a)
= m\ + dsinh
i-^a)
(2)
• density function:
/x(aO
Vz2 + d2
isinh
1
(x/d)]
Let fi2(X) = fj,2 and 72(A") = 72 be, respectively, the variance and kurtosis coefficients of the random variable X. If X is a symmetrical Johnson Su distribution, then we proved in Castagliola (1998) that parameters b and d are related to y,2 and 72 using the following equations (see the appendix for the proof):
l n ( ^ 2 ( 7 2 + 2) - 1) /
2/i 2
~
^ 2 ( 7 2 + 2) - 2
(3)
(4)
3. Application to Control Charts Let X\,..., Xn be a sample of n independent random variables corresponding to training data taken when the process is considered to be "in control", from which we have to compute control limits. Let rh\, fa, 7i, and 72 be, respectively, the (moment) estimators of the mean, variance, skewness, and kurtosis. We will assume now (see the appendix) that a first statistical test leads to the conclusion 71 = 0 (the data distribution seems symmetrical) and a second one leads to the conclusion 72 > 0 (the data distribution seems leptokurtic). If these conditions are verified, we suggest to compute control limits as presented below: • Compute b and d using Eqs. (3) and (4) in which fi2 and 72 have been replaced by their estimators. • Transform each new observation X to a quasi-normal iV(0,1) random variable Z using the following equation: Z = b sinh
1 ( X — m\ ~~d
(5)
• Use "classical" control limits (mean, median, standard deviation, range, EWMA, CUSUM, etc.) corresponding to a normal iV(0,1) distribution.
Control Charts for Symmetrical
Distribution
Data with Positive Kurtosis
5
4. A n Example The training data set plotted in Fig. 2(a) is composed of m = 40 subgroups of n = 5 observations, corresponding to an "in control" process. We have estimated rhi = 19.99, p,2 = 1.11, 7i = -0.27, and 72 = 2.96. If we use the methodology suggested in the appendix, we have Si = —1.61 and S2 — 8.64. Because —1.96 < S\ < 1.96 and S2 > 1-645 with a confidence level 1 - a = 0.95, we can conclude that the data distribution is significantly symmetrical and leptokurtic. In Fig. 2(b) we plotted the standard deviation for the m = 40 subgroups. We notice that the UCL = 1.89 of the standard deviation chart, assuming normality for the data, leads three points "out of control". Thus, if the value UCL = 1.89 is used for the control of the production, we can expect numerous false "out of control" signals. Because the distribution of the data seems to be symmetrical and leptokurtic we suggest to use a symmetrical Johnson Su transformation. We plotted in Fig. 2(c) the standard deviation of the data transformed using Eq. (5) with b = 1.617 and d — 1.39 (estimated using Eqs. (3) and (4)). Assuming that the transformed data are quasi-normal N(0,1), we immediately have (for n = 5) UCL = 2.089. We can notice now that there is no more point "out of control". 5. How Kurtosis Affects Classical Charts The goal of this section is to investigate when the method proposed in this paper can favorably be used and when it cannot, and to give a very general rule concerning 24-i
1
20
40
60
80
100 120 Observations
(a) Fig. 2. An example.
140
160
180
200
6
C. Philippe
20 Subgroup
25
(b)
UCL--
20 Subgroup (c)
Fig. 2.
(Continued).
25
Control Charts for Symmetrical
Distribution
Data with Positive Kurtosis
7
its use. T h e c o m p u t a t i o n of t h e limits of t h e "classical" control charts assumes t h a t t h e distribution of the d a t a must be normally distributed. If this is not t h e case, t h e t y p e I error a really obtained will be different t h a n the expected t y p e I error a = 0.0027 (3a limits). But how large is t h e difference between the observed and t h e expected type I error when t h e kurtosis increases? In order t o evaluate the impact of t h e kurtosis 72 of t h e d a t a on the observed type I error we used t h e following approach: • • • •
for a sample size n G {5, 7, 9}. for a kurtosis 72 € [0,10], mean m i = 0, and variance \i2 = \. compute b and d using Eqs. (3) and (4). generate (by inverse simulation of Eq. (2)) a set of m samples of n symmetrical Johnson Su r a n d o m variables having parameters (b, d). T h e number m of sample has been chosen such t h a t the total number of generated d a t a isTOx ro = 3 465 000. • compute t h e mean, median, s t a n d a r d deviation, and range for each sample. • compute t h e proportion of d a t a outside the control limits for each of t h e four statistics. For a kurtosis 72 — 0, the estimated observed proportion of d a t a outside the control limits (the t y p e I error) must be close to the expected one a = 0.0027, for all the control charts.
In Fig. 3 we have plotted the observed t y p e I error versus t h e kurtosis for sample size n = 5, 7,9. T h e conclusions of these simulations are: • T h e mean and median charts seem t o be very insensitive to t h e kurtosis of t h e data. As expected, this is particulary t r u e for the median. T h e larger t h e sample size, the more insensitive the charts. • In contrary, t h e s t a n d a r d deviation and range charts seem t o be very sensitive t o the kurtosis. T h e range chart is t h e most sensitive. T h e larger t h e sample size, the more sensitive the charts. From this, we can conclude t h a t t h e m e t h o d proposed in this paper is mostly devoted to t h e computation of t h e limits of s t a n d a r d deviation or range charts (dispersion charts in general) of d a t a having a positive kurtosis, even if it can also be applied t o t h e other statistics. 6. O C a n d A R L C u r v e s Let X be a symmetrical Johnson Su with parameters (b,d). W i t h o u t loss of generality, we will assume t h a t TOI = 0. By definition, the random variable Z defined by:
Z = 6sinh-1^j
(6)
is a normal (0,1) r a n d o m variable. In order to compute the OC and ARL curves of a control chart using the method proposed in this paper, we have to first find t h e distributions of t h e random variables U and V defined as:
8
C. Philippe n=5 0.05
(a)
n=7 Average Median StDev Range 0.04-
0.03-
0.02-
0.01-
0.0027 -c
oJ
1
1
1
1
10 Kurtosis
(b) Fig. 3. Influence of the kurtosis on the observed type I error.
Control Charts for Symmetrical
Distribution
Data with Positive Kurtosis
9
n=9 0.05 Average Median StDev Range 0.04
0.03
0.02
0.01
0.0027'
'QO0QQ0i?0O00000000000OOOOOOOOOOO00000000OOOOOOOOOOO0000000000OOOOOOOQ0iX)0O00OOO<
Kurtosis
(c) Fig. 3.
U = 6 sinh
{Continued).
i fe + X
V^ = 6 s i n h - 1 ( ^ d
(7) (8)
These random variables result from the transformation of the random variables X + e and TX by the symmetrical Johnson Su transformation with parameters {b,d), where e and r > 0 are shifts corresponding respectively to the mean and standard deviation of the process. From Eqs. (6) and (7) we easily deduce:
7 = sinh (f X +e
sinh
U
Substracting the first equation from the second yields sinh(C//b) — sinh(Z/b) — e/d. Then: Z — b sinh
sinh ( —
10
C. Philippe
Because Z is a normal (0,1) random variable, we immediately have the cummulative distribution function of the random variable U !
b(«) = * {bsinh - l
sinh
©-ai
(9)
and from the equation above we immediately deduce its inverse cummulative distribution function: 1
jsirsinh = bsinh - 1 <
FvL(a)
S-^a)
(10)
+
Deriving Eq. (9) gives the density function fu(u) of U cosh ( | )
fu(u)
{l + [sinh(f)-f] 2 }
^{bsinh-^sinhQ)-!]}
(11)
Of course, if e = 0 (no shift in the process mean) then Eqs. (9)-(ll) simply become Fu(u) = <&(«), F^1(a) = <J>_1(a), and fu(u) — 4>(u). From Eqs. (6) and (8) we have:
f = Slnh (f '-f = Slnh (j Dividing the second equation by the first gives: sinh(£) sinh(f) and then: Z = bsinh" 1 - sinh I — r V o The cummulative distribution function of the random variable V is then equal to: ,-i Fy (v) = $ < b sinh
;-©]}
(12)
and its inverse cummulative distribution function can be obtained from the previous equation: Fy\a)
= bsinh"'
Tsinh
Deriving Eq. (12) gives the density function fv(v) fv(v)
=
cosh(f) 2
[T +sinh2(H)j
^
\
(13)
of V
—24> {bsinh" 1 [ 1 sinh ( I ) ] }
(14)
Control Charts for Symmetrical
Distribution
Data with Positive Kurtosis
11
As for the case e = 0, if r = 1 (no shift in the process standard deviation) then Eqs. (12)-(14) simply become Fv(v) = $(u), Fy1(a) = $ _ 1 ( a ) and fv(v) = <j>(v). Unfortunately, because the random variables U and V are clearly non-normally distributed, it seems impossible to straightforwardly find the distribution of the sample mean or median (idem for the sample standard deviation or range). The only way to achieve the computation of the OC or ARL curves seems to be simulation. As an example, we have computed the ARL curve for the standard deviation using the following approach: • • • • •
for a sample size n e {5, 7,9}. for a kurtosis 72 € [0,10], mean m\ = 0, and variance /x2 = 1. for a shift r e [0,2]. compute b and d using Eqs. (3) and (4). generate (by inverse simulation of Eq. (13)) a set of m samples of n random variables V. The number m of sample has been chosen such that the total number of generated data is m x n = 3 465 000. • compute the standard deviation for each sample. • compute the proportion of data inside the control limits for the standard deviation. This gives an estimate for OC(r). Then, deduce an estimate for ARL(r) = 1/(1 - OC(r)).
The results of this simulation are plotted in Fig. 4. The main conclusion is that the larger the kurtosis, the larger the ARL of the standard deviation chart. Of course, for a specific kurtosis, when the sample size n increases, the ARL decreases. n =5 Kurtosis=0 Kurtosis=3 Kurtosis=6 Kurtosis=9
450-
kh 400350-
\\
// // //
300250200-
I
150-
4
% \ \^
I
\
i
§1
* ' • '•
\ M';>
100A
500-
v
^ -^ 1
0.2
^^:;^^vvvvvv
i^
" ^ — 1
0.4
1-
0.6
— ,
0.8
T
1
1 tau
1.2
— 1
1.4
1
1.6
1
1.8
(a) Fig. 4. Examples of ARL curves for the standard deviation chart.
2
12
C. Philippe
(b)
n=9
(c)
Fig. 4.
{Continued).
Control Charts for Symmetrical
Distribution
Data with Positive Kurtosis
13
7. Conclusions In this paper we have developed a methodology for the designing of control charts for data having a symmetrical and leptokurtic distribution. We show that this kind of approach is particularly useful when one wants to implement dispersion control charts such as standard deviation or range control charts. The method proposed in this paper allows the quality practitioner to compute "classical" control charts without the use of complex numerical algorithms. The suggested method is so simple that it can be easily programmed into a spreadsheet such as Excel. We also prove that the distributions of the shifted and transformed data can be computed exactly, but this is not the case for the OC and ARL curves which have been obtained by simulation only. Appendix How to Test the Skewness
and the
Kurtosis
The method which is developed in this paper can only be applied if, in theory, the data have a symmetrical leptokurtic distribution. Therefore, in pratice, we need one test to ensure that the distribution of the data is significantly symmetrical, and another test to ensure that the kurtosis of the distribution is significantly positive. In other words, if 7! = /Lt3//z2' and 72 = AWMI ~~ ^ a r e *^ e skewness and kurtosis coefficients of the data, we have to solve the two following hypothesis tests Hso : 7i = 0
HKo • 72 = 0 and
HS\ • 7i ¥= 0
HK\
•• 72 > 0
Our method can be applied if hypothesis Hso and HKI are accepted. In order to solve the two hypothesis tests above, we suggest the use of an approach which is usually dedicated to test the normality assumption of a data set: Compute an estimate for ;a
n
1
C n z-~' J
J'=I
Compute estimates for ^k ,k = 2, 3,4 f^k
1
n
"A)fc .7=1
Compute estimates for 71 and 72 7i =
M3
Compute the standardized statistics
and
72 =
M4 M22
14
C. Philippe
• with yn)
( n - 2 ) ( n + l ) ( n + 3)
y(7"2) = 24n(n-ir yn ' ( n - 3 ) ( n - 2 ) ( n + 3)(n + 5) • The hypothesis i?s 0 will be accepted with a confidence level 1 — a if —$ _ 1 (1 — a/2) <5\ < $ " 1 ( l - a / 2 ) . • The hypothesis HKI will be accepted with a confidence level 1 — a if S2 > S-^l-a). The symmetry of the distribution can also be tested using the following nonparametric scheme: • Enumerate the number 8 of data which are less than or equal to fi. • The hypothesis Hso will be accepted with a confidence level 1 — a if B~l (a/2, n, 0.5) < 8 < B~l{\ — a / 2 , n, 0.5), where B~1(a,n,0.5) is the inverse cummulative function of the binomial distribution.
Proof of Eqs. (3) and (4) The moments of order s of the random variable Y can be easily computed using Eq. (1) assuming g(Y) — s i n h _ 1 ( y ) , i.e., ms(Y)
1 r+ao •= \ exp(-z 2 /2)[exp(z/6) -
= n„(Y) =
exp(-zj'b)\ s dz
In particular, for s = 2,4, we have: M 2 (y)
= i[exp(2/62)-l]
/i 4 Q0 = ^[exp(8/b 2 ) - 4exp(2/6 2 ) + 3] We deduce the moments of order s = 2,4 of the random variable X = dY: M2(X)d=^2 = ^ [ e x p ( 2 / 6 2 ) - l ]
(15)
Ai4(X) d= M4 = ^[exp(8/6 2 ) - 4exp(2/6 2 ) + 3] o
and then: ,y.dtf 7 2 W = 7 2 =
tn(X) M l f -
3 =
exp(8/fe 2 )-4exp(2/6 2 ) + 3 2[exp(2/ b2 ) - I] 2
3
(16)
Control Charts for Symmetrical Distribution Data with Positive Kurtosis
15
If we define u = e x p ( 2 / 6 2 ) , then Eqs. (15) and (16) become:
d2(u-l) V2 = 2 uA-Au T2 = -^7
, ^ ' + 2, TT2- ~
3
(l8>
2(u - ly We notice t h a t Eq. (18) can be markedly simplified after some basic calculus: 72 = ^ ( « - l ) ( « + 3)
(19)
T h e solution for u in Eq. (19) is u = \/2(^/2 + 2) - 1, and we t h e n deduce t h e value of b (Eq. (3)). Finally, using Eq. (17) we have d (Eq. (4)). References Bai, D. S. and Choi, I. S. (1995), lCX and R Charts Control Charts for Skewed Populations", Journal of Quality Technology 27, 120-131. Balakrishnan, N. and Kocherlakota, S. (1986), "Effects of Non-Normality on X Charts: Single Assignable Cause Model", Sankhya B 48, 439-444. Burr, I. W. (1967), "The Effect of Non-Normality on Constants for X and R Charts", Industrial Quality Control 24, 563-569. Castagliola, P. (1997), "An Improvement of the Weighted Variance X Control Chart", in 8th International Symposium on Applied Stochastic Models and Data Analysis, 49-54. Castagliola, P. (1998), "Approximation of the Normal Sample Median Distribution Using Symmetrical Johnson Su Distributions: Application to Quality Control", Communications in Statistics: Simulation and Computation 27(2), 289-301. Chan, L. K., Hapuarachchi, K. P. and Macpherson, B. D. (1988), "Robustness of X and R Charts", IEEE Transactions on Reliability 37, 117-123. Choobineh, F. and Branting, D. (1986), "A Simple Approximation for Semivariance", European Journal of Operational Research 27, 364-370. Choobineh, F. and Ballard, J. L. (1987), "Control-Limits of a QC Charts for Skewed Distribution Using Weighted Variance", IEEE Transactions on Reliability 36, 473477. Cowden, D. J. (1957), Statistical Methods in Quality Control, Prentice-Hall. Farnum, N. R. (1997), "Using Johnson Curves to Describe Non-Normal Process Data", Quality Engineering 9(2), 329-336. Ferrell, E. B. (1958), "Control Charts for Log-Normal Universe", Industrial Quality Control 15, 4-6. Jacobs, D. C. (1990), "Statistical Process Control: Watch out for Non-Normal Distributions", Chemical Engineering Progress 86, 19-27. Johnson, N. L. (1949), "Systems of Frequency Curves Generated by Methods of Translation", Biometrika 36, 149-176. Nelson, P. R. (1979), "Control Charts for WeibuU Processes with Standards Given", IEEE Transactions on Reliability 28, 283-287. Pyzdek, T. (1992), "Process Capability Analysis Using Personal Computers", Quality Engineering 4(3), 419-440.
16
C. Philippe
Schilling, E. G. and Nelson, P. R. (1976), "The Effect of Non-Normality on the Control Limits of X Charts", Journal of Quality Technology 8, 183-188. Seppala, T. (1995), "Statistical Process Control Via the Subgroup Bootstrap", Journal of Quality Technology 27, 139-153. Willemain, T. R. and Runger, G. C. (1996), "Designing Control Charts Using an Empirical Reference Distribution", Journal of Quality Technology 28, 31-38.
Chapter 2 A S O F T W A R E RELIABILITY MODEL W I T H TESTING C O V E R A G E A N D I M P E R F E C T D E B U G G I N G X U E M E I Z H A N G and H O A N G P H A M Department of Industrial Engineering P.O. Box 909, Rutgers University Piscataway, NJ 08854-0909
1. Introduction Today, computer software is the heart of various safety-critical and civilian systems and software reliability measurement has become a significant factor in quantitatively characterizing quality and determining when to stop testing and release software on the basis of predetermined reliability objectives. Research activities in software reliability engineering have been conducted and several software reliability growth models [Goel and Okumoto (1979), Ohba (1984a, 1984b), Ohba and Yamada (1984), Pham (1993), Pham and Zhang (1997), Pham et al. (1999), Tohma et al. (1991), Yamada et al. (1983), Yamada et al. (1986), Yamada et al. (1992), Wood (1996)] have been proposed to estimate the number of remaining faults and assess the reliability of software. NHPP Software reliability models have been successfully used in modeling and analyzing the reliability of the software products. These models consider the debugging process as a counting process, which follows Poisson distribution with a time dependent failure intensity function. Existing software reliability models can be unified into a general form from which different software reliability growth models can be developed [Pham and Zhang (1997)]. Some previous models [Goel and Okumoto (1979), Ohba (1984b), Ohba and Yamada (1984)] were derived based upon perfect debugging which assumes that no new faults are introduced during the debugging process. Others consider imperfect debugging [Yamada et al. (1983), Ohba (1984b), Pham (1993), Yamada et al. (1986), Yamada et al. (1992)] in the sense that new faults can be introduced while the detected faults are removed. The imperfect debugging models assume that the fault content function is nondecreasing as one progresses through the testing phase.
17
18
X. Zhang and H. Pharn
This indicates that the total number of faults including the initial faults and the introduced ones is increasing in testing time. Recently, a software reliability model based on NHPP that integrates imperfect debugging with the learning phenomenon is developed [Pham et al. (1999)]. Another important issue in the software development process is testing coverage. Testing coverage measurement helps software developers to evaluate the reliability of the software during the testing phase and determine how much additional effort is needed to improve the reliability of the software. Testing coverage also can provide customers with a quantitative confidence when they plan to use the software. To our knowledge, testing coverage has not been addressed in the existing imperfect software reliability models. In this paper, a software reliability model is proposed to integrate testing coverage and imperfect debugging by incorporating fault introduction phenomenon into software detection rate function. In Sec. 2, a new NHPP model addressing fault introduction and fault detection rate is formulated. In Sec. 3, the proposed model is evaluated to illustrate the results by using two sets of data collected from real applications. 1.1.
Notation
m(t) a d(t) c(t)
expected number of software failures by time t, m(t) = E[N(t)] number of initial software faults imperfect debugging intensity rate testing coverage function, i.e., the percentage of the code covered by testing cases by time t intensity function or fault detection rate per unit time, X(t) = d[m(t)]/dt software reliability function for a mission time x given that the last failure occurred at time t actual number of failures observed by time t^
X(t) R(x\t) yk
2. Software Reliability Modeling The general NHPP software reliability growth model is formulated based on the following assumptions: (1) The occurrence of software failures follows an NHPP. (2) The software failure intensity rate at any time is proportional to the number of remaining faults in the software at that time. (3) When a software failure occurs, a debugging effort takes place immediately. This debugging is s-independent at each location of the software failures. (4) During the debugging process, the effort to remove each fault may not be perfect and therefore new faults may be introduced into the software system with the imperfect debugging intensity rate d(t).
A Software Reliability Model with Testing Coverage and Imperfect Debugging
19
(5) The imperfect debugging rate is assumed to decrease as testing progresses and becomes negligible towards the end of the testing phase because the experience and knowledge of the testing team increases with the progress of the learning process. (6) The imperfect fault debugging does not increase the number of initial faults. 2.1. The Nonhomogeneous
Poisson
Process
Software testing process has been widely modeled as an fault counting process. A counting process {N(t),t > 0} is said to be a nonhomogeneous Poisson process with intensity function X(t), if N(t) follows a Poisson distribution with mean value function m(t), i.e., PrfiVm = k} = t^)Le-m(t)
fc
= 0,1,2A
(1)
The mean value function m{t), which is the expected number of failures experienced up to a certain time t, can be expressed in terms of failure rate of the program, i.e., t(i) = / Jo
\(s)ds
where A(s) is the failure intensity function. Software reliability R(x/t) is defined as the probability that a software failure does not occur in (t,t + x), given that the last failure occurred at testing time t(t > 0 , x > 0 ) . That is, R(x/t) = e -M*+*)-m(t)] (2) 2.2. Model
Formulation
In a recent literature on software reliability engineering, imperfect debugging is usually considered by modeling a time-dependent fault content rate function [Ohba (1984a, 1984b), Ohba and Yamada (1984), Pham (1993), Pham and Zhang (1997), Pham et al. (1999), Yamada et al. (1983), Yamada et al. (1986), Yamada et al. (1992)]. That is, the total number of faults including the initial faults in the code and the introduced ones increases in testing time. These models require users to select an appropriate fault content rate function. In this paper, we derive a new model which incorporates fault introduction phenomenon into fault detection rate function. By doing so, we take imperfect debugging into the construction of fault detection phenomenon and the number of initial faults can be estimated consistently. Since fault introduction rate is taken into consideration, a general NHPP model incorporating testing coverage and imperfect debugging can be formulated as follows: dm{t) c'(t) [a - m(t)} - d{t) [a - m(t)} (3) dt 1 - c{t)
20
X. Zhang and H. Pham
where a is the number of initial faults in the software code. The function c'(t)/ [1 — c(t)} represents the fault detection rate function, where c(i) represents the testing coverage function. The function c(t) measure the percentage of the software code covered by testing cases up to any time t. Then, l — c(t) is the percentage of the software code which has not yet been covered by test cases by time t. The derivative of the testing coverage function, c'(t), represents the coverage rate. Therefore, the fault detection rate function can be expressed as c'(t)/[l — c(t)]. And d(t) denotes the fault introduction rate which is a decreasing function of time. Define g(t) = c'(t)/[l — c(t)]— d(t), which represents the imperfect fault detection rate. Equation (1) can be re-written as follows: d
^=g(t)[a-m{t)\
(4)
This model indicates that the failure intensity depends on both the imperfect fault detection rate and the number of remaining faults. Once the imperfect fault detection function, g(t) is determined from Eq. (3) in Pham et al. (1999), the mean value function can be expressed as follows: m(t)
-Bit) m0+
J Jt0
aeB{T)g(r)dT
(5)
where B{t) = f g{r)dr and m{to) = mo is the marginal condition of Eq. (2) with to representing the starting time of the debugging process. The reliability function can be derived substitute Eq. (5) into Eq. (3). 2.2.1. A Software Reliability Model with Testing Coverage and Imperfect Debugging In this study, fault introduction phenomenon is considered in the construction of fault detection rate function. The testing coverage function is a nondecreasing function of testing time t (see Fig. 1). The following function can be used to model the testing coverage function c(t) = 1 - (1 + bt)e~bt
(6)
then 1 - c(t)
1+bt
{
'
This is an S-shaped curve that captures the learning process of software tester [Yamada et al. (1983), Ohba and Yamada (1984)]. In other words, the fault detection rate increases at beginning but with a slower slope; after a certain period, since testers earn knowledge and get familiar with the underlying software, the fault detection rate increases faster; at the end most of the faults have been removed and the fault detection turns flat at the tail.
A Software Reliability Model with Testing Coverage and Imperfect Debugging
•
21
t
Fig. 1. Testing coverage function c(t).
d(t)
Fig. 2. Fault introduction rate function dli) versus testing time.
Define the fault introduction rate, d(t), which is a decreasing function of testing time t (see Fig. 2), as follows:
* > " TTS
<8>
This function is used to reflect the fact that as one progresses through the testing phase, the fault introduction rate slow down since testers become more experienced and introduce less faults. The imperfect fault detection rate function is given by: 9(t) =
bH l + bt
l + dt
(9)
22
X. Zhang and H. Pham Table 1. Summary of the N H P P software reliability models. Model Name
MVF
Model Type
Comments
(m(t)) bt
Also called exponential model
Goel-Okumoto (G-O) (1979)
Concave
m{t) --= a(t) =• a &(*) = b
a(l-e- )
Delayed S-shaped [Yamada et al. (1983)]
S-shaped
m(t) --= a ( l - (1 +
Inflection S-shaped SRGM [Ohba (1984b)]
Concave
m(t) a(t) b(t)
a(l-e~ht) --_ l+0e-bt =• a b =
Yamada Exponential [Yamada et al. (1986)]
Concave
m(t) a(t) b(t)
--= 0 ( l - e - r a . ( l - e < - ' " ' ) ) == a =: rafie-P1
Attempt to account for testing effort
Yamada Rayleigh [Yamada et al. (1986)]
S-shaped
m(t) == a ( l - e - « » ( i - « ( - , " 2 / a > ) ) a(t) == a b(t) =: r a / J e " ' 3 ' 2 / 2
Attempt to account for testing effort
Yamada Imperfect Debugging (1) [Yamada et al. (1992)]
S-shaped
m(t) := ^ x s ( e Q t - e - 6 t ) a(t) =--aeat 6(t) =•b
Assume exponential fault content function and constant fault detection rate
Yamada Imperfect Debugging (2) [Yamada et al. (1992)]
S-shaped
m(t) = a{l-e-bt] a(t) == a ( l + at) b(t) =-b
PNZ [Pham et al. (1999)]
S-shaped concave
m(t)
bt)e~bt)
Modification of G-O model to make it S-shaped Solves a technical condition with the G-O model. Becomes the same as G-O if 0 = 0
l+/3e-"
[ 1 - f]
+aat
Assume constant introduction rate a and the fault detection rate
a [ l - e - " ] [ l - f ] + aat
~ \ + 0e-bt <*(*) == a ( l + at) b(t) = b l+/3e-"
is a linear function of testing time, and the function is nondecreasing with an inflexion S-shaped model
P-Z [Pham and Zhang (1997)]
S-shaped and concave
m(t)
New Model
S-shaped and concave
m(t) = a — ae
(l+/3e-«) xKc + a X l - e - " ) - b^(e-at - e-bt)} a(t) = c + a{\ - e~at)
(1 + (b + d)t + bdt2)
a(t) a(t) = _&1*
yw
i+dt
i+u
Assume introduction rate is exponential function of the testing time, and the fault detection rate is nondecreasing with an inflexion S-shaped model Assume a constant initial fault content function, and the imperfect fault detection rate combining the fault introduction phenomenon
A Software Reliability Model with Testing Coverage and Imperfect Debugging
23
Substitute Eq. (7) into Eq. (3) with m(0) = 0, we obtain: m(t) = a - ae-bt[l + (b + d)t + bdt2}
(10)
Note that both imperfect debugging and learning curve of the fault detection and introduction are integrated into the proposed model. Table 1 summarizes the features of the proposed and several existing NHPP models, where in Table 1 the function a(t) is defined as the fault content function, and b(t) is the fault detection function, and g(t) is the imperfect fault detection rate. 2.2.2. Model Comparison Criterion The mean squared errors (MSE) is usually used as a criterion for comparison of goodness-of-fit and predictive power. MSE can be calculated as follows: MSE = S f - i f r f a ) - * ' * ] '
(11)
N —n where ykobserved number of failures by time t\. rh(tk): estimated number of failures by time tk k: testing interval index N: number of the observations n: number of parameters in the model MSE value indicates the overall relative "goodness" of a model after compensation for its complexity in terms of the number of parameters it has. The small the MSE value, the better the model predicts. 3. Model Evaluation and Applications 3.1. Goodness-of-fit
Test: Data from IBM Entry
Software
Package
In this section, we examine both the goodness-of-fit of the proposed model and the existing NHPP models using the software failure data collected from testing an online data entry software package at IBM [Ohba (1984a)] (see Table 2). The failures are recorded in days. We use subset of the above data to fit the models and estimate the parameters. Then we use the remaining ones to compare the predictive power of these models. For illustration purpose, we assume that the software has been tested for 17 days and the software failures during these 17 days are recorded. We estimate the parameters and determine the software reliability models using the first 17 data points and listed the parameter estimate in Table 3. We then utilized all the 21 data to estimate the model parameters, and the last column of Table 3 lists the estimation using all the 21 data.
24
X. Zhang and H. Pham Table 2. Failure data from IBM entry software package. Testing Time (Day) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Failures 2 1 1 1 2 2 2 1 7 3 1 2 2 4 1 6 1 3 1 3 1
Cumulative Failures 2 3 4 5 7 9 11 12 19 21 22 24 26 30 31 37 38 41 42 45 46
The results in Table 3 show that the new model provides consistent estimates for the number of initial faults (a). That is, a estimated using 17 data points are very close to the one estimated using all the 21 data points. This consistency will provide software developers with the precise information of how many initial faults in the software and number of faults remaining in the code at any time in the testing process. The estimates given by other models change when more failure observations are included into the model fitting. Table 4 summarizes the MSE value for model comparison. MSE is calculated based on 21-day failure data. From the results, we can draw a conclusion that the new model provides the best fit. 3.2. Evaluation for the Predictive Real Time Control System
Power:
Data from
a
In this section, the predictive power of the proposed model is evaluated by using a data set collected from testing a program for monitor and real-time control systems [Tohma et al. (1991)]. The software consists of about 200 modules and each module has, on average, 1000 lines of a high-level language like FORTRAN. Table 5 records the software failures detected during 111-day testing period. This actual data set is concave overall with two clusters of significant increasing detected faults.
A Software Reliability Model with Testing Coverage and Imperfect Debugging
25
Table 3. MLEs of model parameters — IBM data. Model Name
MVF (m(t)) -e~bt)
MLEs (17 Data Points)
MLEs (21 Data Points)
a = 37435.36 6 = 5.97 x 1 0 " 5
a = 56051 6 = 3.9 x 1 0 " 5
Goel-Okumoto (G-O)
m{t) = a ( l a(t) = a b(t) = b
Delayed S-shaped
m(t) = a ( l - (l + 6t)e- 6 £ ) a(t) = a
a = 75.36 6 = 0.1995
a = 71.73 6 = 0.10397
Inflexion S-shaped
m(t)
" ( l - =_ " )
a = 63.25
a = 57.37
a(t) = a °W
l +/ 3e-f"
6 = 0.1558
6 = 0.175
13 = 7.743
P = 8.5136
Yamada Exponential
m(t) =a{l-e-ra(1-e{~m'>) a(t) = a b(t) = raPe^131
a = 17081 d = 0.725 P = 0.000181
a = 17264.83 d = 0.734 P = 0.000173
Yamada Rayleigh
m(t)=a(l-e-''«(1-<=<"'3t2/2>)) a{t) = a b(t) = raPe-f3*2/2
a = 645.57 d = 0.0837 P = 0.0089
a = 664.19 d = 0.086 P = 0.0083
Yamada Imperfect
m(t) = ^ ( e » « -
Debugging (1)
Yamada Imperfect Debugging (2)
PNZ
e~bt)
a = 91920.94
a = 39252.59.
a(t) = ae
d = 0.0454
b(t) = b
6 = 1.61 X 1 0 " 5
d = 0.0185 6 = 4.57 X 10~ 5
m{t) = a[l - e~bt] [l - f ] + aat a(t) = a ( l + at) 6(t) = 6
a = 1787056 d = 0.08104 6 = 7.41 x 1 0 " 6
a = 1671644 d = 0.02932 6 = 1.0 X 10~ 6
a = 63.19
a = 57.30
8 = 0.1559 a = 4.98 x 1 0 " 5
6 = 0.175 d = 4.96 x 10~ 5
P = 7.735
P = 8.506
a = 303.71
a = 167.9
at
a[l-e-"][l-Sl+a«t a(t) = a ( l + at) b h(t\ °W l+0e-i<-
P-Z
-5^(e-at-e-6t)] a(i) = c + a ( l - e _ Q t ) b 6C/'l °W l + /3e-l" New Model
m(t) = a - a e " 6 t ( l + (6 + d)t + 6iii 2 ) a(t) = a „(t) = - ^
6 = 0.01125
6 = 0.0112
d = 0.0625
d = 0.1592
P = 0.03105
P = 0.01785
c = 104.0103
c = 100.615
a = 60.32 6 = 0.138 d = 0.011
a = 60 6 = 0.137 d = 0.013
26
X. Zhang and H. Pham Table 4. Model comparison — IBM data. Model Name
MSE (Fit)
MVF (m(4)) bt
Goel-Okumoto (G-O)
m(t)
Delayed S-shaped
m(i) -- : a ( l - (1 + a(t) =
•
a(l-e- )
117.24
o,(t) b(t) = b bt)e-bt)
38.0
b(t) = l+bt Inflexion S-shaped
m(t) •-
52.01
~l+f3e-bt
a(t) = b(t) = l+0e Yamada Exponential
m{t) •-
•• a ( l
—
e~
^l-et-"'))
124.81
a(t) = 6(t) = ra(3e ~pt Yamada Rayleigh
m(i)
t2 2 a ( l — -er Q ( l - e ( - S / > ) )
•• =
36.89
a(t) = a b(t) = Yamada Imperfect Debugging (1)
m(t)
= 4 ( = o t - e~bt) oc + b ^
a(t) =
53.09
'
aeat b m(t) = a[l - e-bt] [1 - f J + aat a(t) = a ( l + at) b(t) = b 6(4) =
Yamada Imperfect Debugging (2)
PNZ
m(t)
_
43.87
n[l-e~ilt][l-f]+aat l + Se-6*
34.05
a(t) = a ( l + a t ) b(t) = 6
P-Z
a{t)~b(t) = New Model
37.13
m(t)
= (I + /£="> [ ( C + a K l :c + o(l - e " 6
Qt
_
^^
~ ^{£~at
m(t) = T+I3e~ a - abte - ' ^ l + (& + d)t + 6dt 2 ) -bt/ a(i) = a
&(«)
l+bt
~
^^
) 32.20
1 + dt
Since the software turns stable after 61 days of testing, we will estimate the model parameters using the first 61 data points and compare them with parameter estimates using all the 111 data points. The results are summarized in Table 6. There are totally 481 faults detected by the end of 111-day of testing period. As seen from Table 6, the number of initial faults, a, estimated by the new model using the first 61 data points is 482.5, whereas when using all the 111 data is 482.
B
.-. en en £
oq CD en O
tl
M
3
»
-T"
O
C< D I P
.-. _
/*! 3p
CD en
en o P 3
H ct
CD
Oq
5
O
P
CD
5'
aq
P en
B o
o P 3 rt-
Oq' 3
P B
cr
CD
cr cr
rt
111!
pj
a- re cr
P "• B CD CL
CD X T3
C L
<-l CD
•a
CL
p 3
32
CL
CD^
CD
o
3
CL
•O a
ft,
en
CL P
en
»-! CD
en
g
3 M rtCD
H . ' DO
en 3
ff
BW q O 2
2
i
2 3
a 3 p
2, B.
i= CD en
3- 3
Mo
3 =
a.
CD
T3
cr o
B p
o
p
4
3 CD
p
*•*
a
o
—
O
?.B
O
CD >1
cr '"
^
P
2.
CL CD
S a-
P
M
rt-
O P 3
CD
^
p
«-
S° en CD
'— CL P
CD
a-
3
o O B o
CO
CD
rt
CD
r+
r+
cr
CD
°#
-• <: o 2
S" <2.
B
g
? ™
CD p
B
Oq
O si 3 .
CD „ , O
c l cr
CD
a- CL rt1-! cr p
^ P 3
<
3 3 s"«O ;-
en 3
CL
3
S P
cr 3 cT -a P
p
cr
T)
aq
£ B
aq
B
cr
CD
O
3
CL 3 3
P
3
en
P B CL
p
3
O en' CD
-'
3
en
CL CD CD »
•^ CD
3 Oq a. S1 2 "O & c 3 i-t
CD
9
B Oq ,_,
3 J cr •C 5 en CD
W
B
rt <S re 3 "
TO 2
a- S ^ « ^ 2C
CD
°^ S £
& £ 2 3 ' ^
3 o
•-•
<
en _
CD »
c? 3
M
£.
era aCD 3 CR " O
H
a 3 O &•
s
3
p
B- tS-
TO
II -, hi
h->
h->
h->
o o t D O o ~ q c n o
i
O O t - '
^ 4^ j i . ^ >fck 4^ rfi. CC 00 00 --J S S h - " O O t o y ? ? o o o o o
^
^
J^
h - ' O h - ' O O h - ' O O i - ' O O i - ' O O O
C O W C O O O W t O t O O C D X J ^ O O O ^ ^ ^
O O O O H i O t O l — i O O O l - * i -
O l O l ^ ^ C ^ W l ^ O C ^ ^ ^ h ^ h i h i C n ^
^ ^ ^ ^ ^ j ^ ^ ^ ^ ^ ^ ^ ^ o o w c o ^ ^ 4 ^ c o w w o o w t o w i ^ i ^ o o o ^ a i
uij^^cococotoi—'i—•
t O t O t O t O t O f c O t O t O t O t O t O h - 1
28
X. Zhang and H. Pham
Table 6. MLEs of model parameters — the control system data. Model Name
MVF (m(t)) e~bt)
Goel-Okumoto (G-O)
m(t) = a ( l a(i) = a bit) = b
Delayed S-shaped
m(t) = a ( l - (1 + a{i) = a
6(0 Inflexion S-shaped
bt)e-bt)
b't 1+bt _ aU-e-'") l + f3e~bt
MLEs (61 Data Points)
MLEs (111 Data Points)
a = 852.97 6 = 0.01283
a = 497.282 6 = 0.0308
a = 522.49 6 = 0.06108
a = 483.039 6 = 0.06866
a = 852.45
a = 482.017
ait) =
b = 0.01285
6 = 0.07025
b(t) =
0 = 0.001
(3 = 4.15218
3 6(0 = rape'! *
a = 9219.7 a = 0.09995 4 = 0.01187
a = 67958.8 a = 0.00732 /3 = 0.03072
Yamada Rayleigh
-m(l-e' m(t) : a(l a(t) = 2 2 6(t) = r a / J t e " ^ /
a = 611.70 a = 1.637 4 = 0.00107
a = 500.146 a = 3.31944 4 = 0.00066
Yamada Imperfect
m(0
a = 654.963
Debugging
a(0 =
Model (1)
6(0 =
a = 1795.7 6 = 0.00614 a = 0.002
Yamada Imperfect Debugging Model (2)
m(t)
a ( 0 = •• a ( l + at) 6(0 = b
a = 16307 6 = 0.0068 a = 0.009817
a = 591.804 6 = 0.02423 a = 0.0019
PNZ Model
m(0
a = 470.759
a = 470.759
6 = 0.07497
6 = 0.07497
Yamada Exponential
m(t)
l+0e-bt
m(t) =
a
(l_c-r«»(l-e<-'">))
a(t) =
^TE(e Q t -
e~bt)
= a[l - e~bt) [1 - f ] + aat
l+f)e-bt
a ( 0 = a ( l + at) 6(0 =
P-Z Model
W =
b l + /3e-bt
TTT=4^T[(c
+ a)(l-e-")
*)]
b—a v
a(0 = c + a(l 6(0
New Model
~eat)
l + /3e~
m(t) = a - a e - " ( l + (6 + d)t + 6d*2) a(0 = a
6(0 1 + bt
1+tit
b = 0.02059 a = 0.0027
a = 0.00024
a = 0.00024
J3 = 4.69321
4 = 4.69321
a = 0.920318
a = 0.46685 6 = 0.07025
6 = 0.0579 a = 2.76 x 1 0 "
5
a = 1.4 x 10~ 5
J3 = 3.152
4 = 4.15213
c = 520.784
c = 482.016
a = 482.5 6 = 0.0751 d = 0.006
a = 482 6 = 0.081 d = 0.007
A Software Reliability Model with Testing Coverage and Imperfect Debugging
29
Table 7. Model comparison — the control system data. Model Name Goel-Okumoto (G-O)
M V F (m(t))
MSE (Production)
6
m(t) = a ( l - e " ' ) a(t) = a
11611.42
6(4) = 6
Delayed S-shaped
m(t) = a ( l - (1 + bt)e~bt) a(4) = a 6
Inflexion S-shaped
W = T&
m(4) = " ^ " ^ ^ a(4) = a 6(4)
Yamada Exponential
935.88
590.38
&
m(t) = a ( l - e -™ ! (l-e<-' 3 t >))
12228.25
a(4) = a 6(4) = rafle-0t Yamada Rayleigh
m{t) = a ( l - e - r Q ( 1 - e ( " ' 3 ' a(4) = a 6(4) = r a / 3 e " ' 3 t 2 / 2
Yamada Imperfect
m(t) = ^ ( e
Debugging Model (1)
Q t
/2>
))
- e"6')
187.57
8950.54
at
a(t) = ae 6(4) = 6
Yamada Imperfect Debugging Model (2)
m(t) = o[l - e~bt} [l - f ] + aat a(t) = a ( l + at)
2752.83
6(4) = 6
PNZ Model
m(4) =
B[1
~'~";£:tf1+aa'
2480.7
a(4) = a ( l + a t )
W = TT^r = 7—-i-n^-[(c + a)(l-e"M) = (1+/e-'")t(c + a ) ( 1 ~ e " ' " ) -^(e-Qt-e-bt)] a(t) = c + a ( l - e a t )
P-Z Model
m(t) m t
New Model
m(t) = a - a e " 6 t ( l + (6 + d)t + bdt2) a(4) = a
()
102.66
9.37
30 X. Zhang and H. Pham number of testers, the proposed model did a better j o b in capturing these changes and therefore improved the predictive power. 4. C o n c l u s i o n s In this study, a N H P P software reliability model t h a t incorporates fault introduction phenomenon and testing coverage information is developed. T h e proposed model does not require any specific assumptions for t h e fault content function. Furthermore, this model incorporates fault introduction phenomenon and testing coverage information into error detection. T h e results show t h a t the number of initial faults in t h e software estimated by this model is consistent, in other words, it does not change significantly with time. This model has technical merit in estimating t h e number of initial faults consistently as one progresses through a testing phase. Therefore it can be used to determine t h e number of remaining faults in t h e software at any time point during testing process. T h e proposed model also provides testing coverage measure, which will help developers t o evaluate how much of the software has been executed and how much more effort is needed for further improvement. Acknowledgment This research was supported in part by t h e U.S. Federal Aviation Administration. References Goel, A. L. and Okumoto, K. (1979), "Time-Dependent Error-Detection Rate Model for Software and Other Performance Measures", IEEE Transactions on Reliability 28, 206-211. Ohba, M. (1984a), "Software Reliability Analysis Models", IBM Journal of Research Development 28, 428-443. Ohba, M. (1984b), "Inflexion S-Shaped Software Reliability Growth Models", in Stochastic Models in Reliability Theory, ed. S. Osaki and Y. Hatoyama, Springer-Verlag, Berlin, 144-162. Ohba, M. and Yamada, S. (1984), "S-Shaped Software Reliability Growth Models", in Proc. Jfth Int. Conf. Reliability and Maintainability, 430-436. Pham, H. (1993), "Software Reliability Assessment: Imperfect Debugging and Multiple Failure Types in Software Development", EG&G-RAAM-10737; Idaho National Engineering Laboratory. Pham, H., Nordmann, L. and Zhang X. (1999), "A General Imperfect Software Reliability Model with S-Shaped Fault Detection Rate", IEEE Transactions on Reliability 48, 2. Pham, H. and Zhang, X. (1997), "An NHPP Software Reliability Models and Its Comparison", International Journal of Reliability, Quality and Safety Engineering 4(3), 269-282. Tohma, Y., Yamano, H., Ohba, M. and Jacoby, R. (1991), "The Estimation of Parameters of the Hyper-Geometric Distribution and its Application to the Software Reliability Growth Model", IEEE Transactions on Software Engineering 17(5), 483-489. Yamada, S., Ohba, M. and Osaki S. (1983), "S-Shaped Reliability Growth Modeling for Software Error Detection", IEEE Transactions on Reliability 12, 475-484.
A Software Reliability Model with Testing Coverage and Imperfect Debugging 31 Yamada, S., Ohtera, H. and Narihisa, H. (1986), "Software Reliability Growth Models with Testing Effort, IEEE Transactions on Reliability 4, 19-23. Yamada, S., Tokuno, K. and Osaki, S. (1992), "Imperfect Debugging Models with Fault Introduction Rate for Software Reliability Assessment", International Journal of Systems Science 23(12), 2253-2264. Wood, A. (1996), "Predicting Software Reliability", IEEE Computer 11, 69-77.
Chapter 3 COST ALLOCATION FOR S O F T W A R E RELIABILITY ODED BERMAN Joseph L. Rotman School of Management University of Toronto, Toronto, Ontario M5S 3E6 Canada E-mail:
[email protected] MICHAL CUTLER Department of Computer Science Watson School of Engineering and Applied Science Binghamton University, Binghamton NY 13902-6000 E-mail:
[email protected]
1. Introduction Managing a large software project to produce highly reliable code with limited resources, which is delivered on schedule, is a very difficult task. In this research the limited resource is the overall budget provided for the software system. Our goal is to maximize the reliability of developed and/or purchased software systems. We deal with the following aspects of resource allocation: (1) Determining which parts of the software will be developed-in-house and which parts will be purchased (or reused). (2) Selecting one of possibly many available implementations for the parts that will be purchased (whenever such a choice exists). (3) Predicting the resources (cost) needed to develop in-house software parts, and the expected reliability that will be produced with those resources. (4) Computing the overall maximum reliability of the system without exceeding the budget constraint. Software tends to have a very complex structure. Commonly, reliability models consider a software system as a single unit, or as a linear sequence of modules executed serially. In this research we assume that the structure of a software system is hierarchical (tree structure) and we compute the hierarchical reliability of the system. We call the software functions implemented in the leaves of the tree the programming modules, and the software implemented in the internal nodes the
33
34
O. Berman
and M. Cutler
integration modules. We assume that the main goal of integration modules is to invoke the code of its children, and to deal with interface issues that must be addressed when the output of one child becomes the input of another child. More specifically the integration module must deal with extracting and converting data to the form needed by a child node. Changing the assumptions about the role of an integration task will not invalidate the model. For economical reasons, and/or to save time, it is possible that only some of the software will be developed in-house and the rest will be purchased or reused. In this research we assume that all integration tasks are developed in-house but any programming module can be either purchased or developed in-house. A module is a candidate for purchasing when there are available versions on the market, or when the code is already available in-house and can be reused. The cost of the reused code is assumed to be 0. The development of a hierarchical software system begins with coding and unit testing all modules that are developed in-house. Purchased modules have a fixed known reliability and will not be unit-tested. Next, every integration module whose children have already been tested will be tested. Each test checks the code of a single hierarchy for interface errors. Interface errors occur when the structure of the data and/or some assumed units of measurement and accuracy are inconsistent between the module that generates the data and the module that uses the data. Two models are described. The first model, called the predetermined model, explains in more detail the model discussed in Cutler and Berman (1998). The second model, called the general model, describes a similar but simpler model than the model presented in Berman and Cutler (1998). In the predetermined model the decision to purchase or develop each programming module has been made. This model is based on the assumption that to enable on time delivery and/or to save cost, an organization may elect to purchase a subset of the modules. The organization may however decide that the rest of the modules must be developed in-house since they contain the additional functionality that will make the software competitive, or since they are not available on the market. Thus, the main issue is to determine which version should be selected for purchased modules and how much to invest for developed ones. The second model is more general as it also determines for each module whether it should be purchased or developed in-house. Both models are formulated as mixed integer non-linear programs that are very difficult to solve using a general-purpose algorithm. The paper includes branch and bound schemes to compute the optimal reliability where only easier continuous non-linear problems are solved. The paper has the following structure. In the next section we review related work. Section 3 discusses the reliability of modules developed in-house. Section 4 presents the predetermined model and its branch and bound scheme. An example is included in Sec. 5. Section 6 presents the general model and its branch and bound scheme. Finally, Sec. 7 discusses the contributions of the research as well as some limitations that need to be addressed in the future.
Cost Allocation for Software Reliability
35
2. Related Work In this section we discuss previous work that focuses on optimization models for reliability and cost allocation. Reliability allocation deals with the setting of reliability goals for individual components to achieve a system's reliability goal. Kubat (1989) formulated a mathematical programming model which, for a given level of software reliability, determines module reliabilities by minimizing development and testing costs. Zahedi and Ashrafi (1991) presented a software reliability allocation model. In their model the reliability of a program is the multiplication of the reliability of its modules, and the reliability of the system is a weighted sum of the reliability of its programs. The cost of each module is assumed to be a linear function of its reliability. The goal of the model is to find the reliability of each module so that the reliability of the system will be maximized without exceeding a given budget. They also proposed extending the model by making cost an exponential or a logarithmic function of reliability. A cost-effective alternative to developing your own software is to build it from commercially available components. Software selection deals with optimal selection of commercially available software components, in order to maximize the overall reliability of the system subject to a budget constraint. Ashrafi and Berman (1992) presented optimization models for selecting a set of available commercial programs. The first model does not allow redundancy, whereas the second model permits redundant selections of available versions for each program. This work was extended in Ashrafi and Berman (1993) by considering a software system in which each program consists of a sequence of modules which upon execution performs the program. Ashrafi Berman and Cutler (1994) deal with software selection models when N-version programming provides redundancy. Berman and Cutler (1996) assume that the needed software is included in commercially available "libraries" (stored for example of CD ROMs), and present optimization models for selecting a subset of these libraries. Berman and Cutler (1998) assumed that a software package consisting of K different functions needs to be implemented and that some modules might be shared by these functions, and conducted research on cost allocation with the assumption that each module could either be purchased or developed in house. 3. Reliability of an In-house Developed Module We have chosen to represent the reliability of a module developed in-house with the negative exponential function. Let x° be the basic cost needed for developing module i in-house. The basic cost of developing the module is x\ and the initial reliability is r°. With additional expenditure (more debugging, testing, etc.), the reliability (the probability that the module will perform correctly when called upon) can be further increased. The maximum reliability achievable for module i is r™1. Usually we assume that r™ = 1. But since 100% correctness may be very difficult to achieve, r™ can be set to a value smaller than 1.
36
O. Berman and M. Cutler
The reliability of module i, Ti is: r
T ~ ( ( C - r°)e-Qi(x^x?))
Xi
> x\
n
(1)
where a, is a parameter that reflects the sensitivity of the reliability of the module to changes in the cost. The larger oti, the larger is the effect of a change in the cost Xi. Note that when xt ~ x®, r* = r ° , and when Xi —>• oo r, = r™\ Figure 1 taken from Berman and Cutler (1998) shows the reliability function of Eq. (1) for r° = 0.3, r™ = 0.9, an = 0.05, and x° = 100. In this case, the reliability is 0 while the cost is less than 100, and 0.3 when the cost is equal to 100. The reliability then increases until the maximum reliability of 0.9 is achieved.
1000 Cost Fig. 1. The reliability function.
The analysis developed in later sections is independent of the function used to represent the relationship between reliability and cost. However, to demonstrate the applicability of our model, we had to choose a specific reliability function. Avritzer and Weyuker (1995) plotted the reliability of five projects as a function of the number of test cases performed. Their figures show that the reliability grows rapidly after a relatively small number of test cases and then much slower as the number of tests is increased. Moreover, their plots are very similar to plots of the negative exponential function used in this paper (see Fig. 1). Since the cost of testing grows as the number of tests performed increases, the use of the negative exponential function to describe the relationship between reliability and cost of a module is reasonable. A possible alternative would be an S-shaped function. The Sshaped model reflects the underlying assumption that early testing is not as efficient as later testing, so there is a ramp-up period during which the error detection rate increases Zhang, Pham and Vu (1998).
Cost Allocation for Software Reliability
37
4. The Predetermined Model and Its Branch and Bound Scheme 4.1.
The Predetermined
Model
In the predetermined model the assignment of a module as either purchased or developed in-house has already been determined. Consider a hierarchical software system composed of programming modules 1,2,... ,m and integration modules m + 1 , . . . , n . The purchased subset consists of modules \,... ,m\. The rest of the modules mi + 1 , . . . , n are developed inhouse. Each module i in the purchased subset is assumed to have rij > 1 versions available on the market. The j t h version of module i assumes to cost c^ and has reliability r^. The objective of the model is to maximize the reliability, R, of the software system subject to a constraint on the total available budget $B. Let yij be a binary variable that assumes the value 1 if version j G { 1 , . . . ,n) of module i € { l , . . . , m } is purchased, and 0 otherwise. To ensure that only one version is purchased we require Y^LiUij = 1- For any purchased module i € { 1 , . . . , m i } the reliability r^ is given by: rti
and its cost is Y^jLi VijCijLet Ti 6 (m + 1 , . . . ,n) denote an integration module. The reliability of the hierarchy with root Tj depends on the reliability of its children. Let (ji,J2, • • • ,js) denote TVs children. Each child is either a programming module or an integration module. We assume that the execution of the children is serial and that there is no dependency among them. We further assume that when hierarchy with root Tj is tested, only interface errors are found. These errors are corrected by changing the code of the integration module. The reliability of the child modules remains unchanged. Let rjk denote the reliability of the code associated with jk- When jk is an integration module, rjk is the hierarchical reliability of the code with root jk- By testing and correcting the code of the integration module, the hierarchical reliability may increase to, at most, r™ = nfc=irifc- However, because of the existence of interface errors, the initial reliability r^ may be less than r™. Let 7^(0 < qrt < 1), be a coefficient that reflects the effect of interface errors on the initial reliability. So r
Tt =9T i rifc=l r J fc =(lTir^.. The negative exponential function is used for representing the reliability of a hierarchy. The reliability of the hierarchy with root Tj is thus:
RT,
f rrn ~ ((r% ~ r%..)e-aT'(xT<-x°Ti)) = < y0
xTi > x°T. ' XTi <
where a^, ^Ti, and Xj,. are as defined in the previous section.
x
Ti
(3)
38
O. Berman
and M. Cutler
The formula for R depends on the hierarchical structure of the system. If the system contains only one programming module, R is equal to the reliability of the single module. If the module is an in-house developed module, its reliability formula is that of Eq. (1). If the module is a purchased module, its reliability formula is that of Eq. (2). When the system contains more than one module, the root is an integration module and the hierarchical reliability of the root depends on the reliability of its children. When all the children of the root are programming modules, the formula for R is that of Eq. (3). When some children of the root's integration module are themselves integration modules, the formula for their hierarchical reliability must be computed before R can be derived. A simple recursive procedure can be used to derive the formula for R. Figure 2 shows a software system that contains four programming modules. Index-generator (3) and Analyzer (4) are programming modules developed in-house and their reliability formula is that of Eq. (1) (see r% and r^ in Sec. 5). Parser (1) and Stemmer (2) are purchased modules and their reliability formula is that of Eq. (2) (see r\ and r-i in Sec. 5). The reliability formula of an integration module Eq. (3) is used for Keyword (5) (see r$ in Sec. 5). Once the formulas for all children of the root have been determined, Eq. (3) is used to derive R (See r^ in Sec. 5).
II
II
Integration task Purchased module
|
|
Developed module Fig. 2. A database-indexing system.
The problem called (P) can be written as: maxi? S.T.
(PI)
Cost Allocation for Software Reliability
mi
rii
n
c
E E ^ + E x^B
39
(p2>
ni
]Pyij=l
for i = l , 2 , . . . , m i
Xi > 0
for i = mi + 1 , . . . , n
and
Uij > 0,1
for z = 1 , . . . , mi
and j = 1 , . . . , n»
(P3)
(P4) The objective (PI) is to maximize the system reliability R. Constraint (P2) ensures that total expenditures do not exceed the budget. Constraint (P3) makes sure that exactly one version is purchased for module i, where i = 1 , 2 . . . , m i . Finally, (P4) guarantees that all nonnegative and all y^ are binary. The formula for R depends on the reliability of all the modules, and will be zero if there is a module with zero reliability. The reliability of a purchased module is the reliability of the selected version, and is always positive. The reliability of the in-house developed module i is positive if xt > x®. Therefore, any solution with R > 0 must satisfy Xi > x\ for all in-house developed modules i = m,\ + 1 , . . . , n. 4.2. The Branch and Bound Scheme for the Predetermined
Model
The problem is solved using a branch and bound scheme (B & B). As we show later, the B & B utilizes the following simplified version of problem (P), called (RP): maxi? S.T.
E
*S = B>
j=m1 + l
where Xj > 0 for all j € {mi + 1 , . . . , n}. B' is the remaining budget after all discrete variables have been assigned a value (i.e., after all decisions on (purchased) versions have been made). Notice that (RP) is a much easier problem than (P) since it is just a maximization of a non-linear function subject to one simple linear constraint. Without any loss of generality, we rule out from the set of versions of each module all versions that cannot belong to an optimal solution. Version k of module i can be deleted if there is another version j for module i satisfying r^ > (>)rjfc and Cij < (<)cik- We call the deleted versions dominated. The largest cost of the versions of module i is c° 0 = max J= i i 2,...,nj{cij}- Thus version j° is the most expensive version. The lowest cost of the versions of module i is Cjjoo = minj =li2) ... >ni {cj : ,-}. So version j 0 0 is the least expensive version. Note that after discarding all dominated versions, the most expensive version j° is also the most reliable version, and the least expensive version j 0 0 is the least reliable
40
O. Berman
and M. Cutler
version. We now use the notation rii for the number of nondominated versions of module i for i = 1 , . . . , m\. The B & B tree has a root node at level 0 of the tree and each node at level k > 0 has 7ifc+i branches. The jth branch corresponds to selecting version j of module k + 1. Therefore yk+ij — 1 (and thus yk+i,i = 0 for I / j and j = 1,2,... ,nk+i). If, for example, n\ = 2, there will be 2 branches leaving the root, one for yn = 1 Om = 0) and one for j / 1 2 = 1 (y22 = 0). For each node at level k > 1 we define a partial solution, pr, that includes all the decisions made along the path from the root to the node. Each of these decisions specifies which version is selected for purchased module i for i < k. For each pr at level k > 1 we calculate the remaining budget B' = B — J2Vi zprViji a n d upper and lower bounds. Nodes of the B & B tree are fathomed (need not be branched), since they either cannot lead to an optimal solution ((i) and (ii)), or since the best solution for the node has already been found (hi): (i) The best lower bound available (which is always a feasible solution in this case) is greater or equal to the upper bound for the node, or (ii) The cost of pr exceeds the budget, or (hi) The lower bound for the node is equal to its upper bound (in which case the best feasible solution for this node is already derived). 4.2.1. The Upper Bound (UB) The upper bound of a given node at level k > 1, with partial solution pr is computed by solving problem (RP). This means that values must be assigned to all y^ that are not in the partial solution. Also a new budget, B', must be provided for (RP). The initial value of B' is B — ]T) 6 ytj. The updated budget, B', and the values yij are calculated as follows: k < Vi < mi Set yij0 = 1;
yij=0
for j ^ j° ;
B' = B' - cij00
Thus, to get an upper bound, the most reliable version, j°, is used for the rest of the purchased modules i — k + 1 , . . . , m i . The remaining budget, B', is decreased by the cost of the least expensive version j 0 0 for i = k + 1 , . . . , mi. The rational for this calculation is that to make the upper bound tight we would like to use the smallest possible remaining budget B' while still ensuring that the solution of (RP) is an upper bound. 4.2.2. The Lower Bound (LB) To derive the lower bound, we again solve problem (RP) for a new budget B' and the values y^ calculated as follows:
Cost Allocation for Software Reliability
41
k < V i < TOi
Set yij00
= 1;
2/^ = 0
for j ± j 0 0 ;
B' = B' - cij00
Thus, to get the lower bound, the least reliable version is used for i = k + 1 , . . . , m i . Note that in this case B' is the actual remaining budget. This means that the lower bound provides a feasible solution. 5. E x a m p l e This example, discussed earlier, will clarify some of our earlier discussion. The implementation of a database-indexing engine in our example consists of the following modules (see Fig. 2): 1. 2. 3. 4. 5. 6.
A lexical parser A word stemming procedure An indexing module A lexical analyzer Integration module Integration module
Parser (1) Stemmer (2) Index-generator (3) Analyzer (4) Keyword (5) and Database-indexing (6)
Parser (1) and Stemmer (2) are the purchased modules with two versions for each (11 and 12 for (1), and 21 and 22 for (2)). For competitive edge, it may be important to develop the Index-generator (3) and the Analyzer (4), in-house. Keyword (5) integrates the code of the Analyzer (4) and Stemmer (2). Databaseindexing task (6) integrates Keyword (5) with the Index generator (3) and the Parser (1). The arbitrary numbers chosen for the example are: r n = 0.7, en = 5 r i 2 = 0 . 9 , C12 = 6
r 2 1 =0.87,c 2 i = 7 r 22 =0.95,c 2 2 = 8 r% = 0.83, r° = 0.53 , a3 = 0.3, x% = 2 rf = 0.9, r°4 = 0.5 , a4 = 0.4, x°4 = 3.5 q5 = 0 . 8 , a 5 = 0.25,x° = 4 q6 = 0 . 8 , a 6 = 0.3,xg = 3 To calculate the reliability of the system, we first calculate the reliability of purchased modules 1 and 2 and programming modules 3, and 4.
42
O. Berman
and M. Cutler
n = r u 2 / i i +r 1 2 yi2 T2 = r2lV2\
T3
TA
+ r22V22
0.83 - (0.83 - 0.53)e-°- 3 ( X3 - 2 )
X3
0
Otherwise
0 . 9 - (0.9 - 0.5)e-°- 4 ( X4 - 3 - 5 )
x4 > 3.5
x0
>2
Otherwise
The hierarchical reliability of Keyword (5) is: lrm
_
r0\e-0.25(x5-4)
Th
x5>4
Otherwise x where r™ = r2r4 and r° = 0.8r™
.R is the hierarchical reliability of Database-indexing (6):
f C " (C - r^e"0-3^-3) ^6
0
x6 > 3 Otherwise
x where r™ = r\r$rz and r° = 0.8r-g The problem is, max/2 S.T. 53/11 + 62/12 + 7j/2i + 82/22 + a;3 + a;4 + a;5 + x 6 < B 2/11 +2/12 = 1 2/21 + 2/22 = 1 £3 > 0, J/11,
£4 > 0, 2/12,
2/21,
X5 > 0,
X6 > 0
2/22 = 0, 1
In Table 1 we give the optimal solution of the example when the budget is varied from 25 to 200. The non-linear programs were solved using the Solver function of the Microsoft Excel (Office 97) software. The solver function can solve non-linear programs and is based on the Quasi Newton routine (Bazarra and Shetty, 1979). We have found it useful to start the optimization with the initial solution Xi = x® i = 3, 4, 5, 6. In all cases investigated, the obtained optimal solution of (RP) was the global optimum but since we do not have a proof in general it is suggested to solve (RP) with several different initial solutions.
Cost Allocation for Software Reliability
43
Table 1. Solutions for different budgets B. B
2/11
yi2
2/21
2/22
xs
X4
0C3
16
Optimal Reliability
25 26
1 0
1 1
0 0
2 2
30 35
0 0 0
0 0 0
1
3.3816
1 1
5.1168 6.3842
0 0 0
1
7.6511
50 55
0 0 0
1 1
8.9178 10.1842
60 70 80 100 150 200
0 0 0 0 0 0
0 0 0 0 0 0
1 1 1 1 1 1
11.4505 13.9826 16.5145 21.5779 34.2205 46.7863
40 45
0
4 4 5.6183 6.9833 7.9627 8.9325 9.8958 10.8547 11.8105 13.7167 15.6189 19.4188 28.9238 37.8185
4 4
3 3
0.11826 0.15205
4 4.7556 6.2414
3 4.1441
0.2518 0.3491
5.4115
0.4269
7.7371
6.6785 9.9452 9.2116 10.4778 13.0099 15.5418 20.6053 33.2734 45.0614
0.4870
9.2410 10.7494 12.2610 15.2906 18.3246 24.3978 39.5821 56.3336
0.5316 0.5639 0.5868 0.6140 0.6270 0.6361 0.63863 0.63868
For an example of the B & B scheme, suppose the budget is equal to $45. The B & B tree is given in Fig. 3.
yu= i UB = .3940 LB = .36085
yi2= i
UB = .4971 LB = .4552
UB = LB = .4552
UB = LB = .4870
Fig. 3. Branch and bound tree for the example.
To explain the calculations Node 2 at level k = 1. Since ylx the most reliable version 22 for B' is decreased by the cost 7 of
of the upper bounds consider pr = (y\\ = 1) for e pr, the remaining budget B' = 45 - 5 = 40. Now module 2 is selected (so y22 = !)• Next the budget the less expensive version 21. So B' = 40 - 7 = 33.
44
O. Berman
and M. Cutler
The problem (RP) is: maxi? S.T. x3 + X4 + x5 + x6 < 33 £3 > 0,
X4 > 0,
x5 > 0,
x6 > 0
where : r™ = r 2 r 4 = r 2 2r 4 = 0.95r4 r° = tor? = 0-76r4 C = 0.7r3r5 r% = 0.56r 3 r 5 R = r6 The optimal solution of this (RP) is 0.394, which is the value of UB of Node 2. To calculate the lower bound, the least reliable version 21 of module 2 is selected (so 2/21 = 1), and the initial budget used is B' = 33. Now the following problem (RP) problem is solved: maxi? S.T. xz + x4 + x5 + x6 < 33 x3 > 0,
X4 > 0,
x5
>o
x6 > 0
where : rT = r2r4 = r2r4 = 0.87r4 r° = q2r^ = 0.696r4 C = 0.7r 3 r 5 r° = 0.56r 3 r 5 fl = r 6 The optimal solution for (RP) is 0.36085, which is the value of LB for Node 2 of Fig. 3. As can be seen from Fig. 3, Node 2 is fathomed since the best lower bound at stage fc = 1, LB = 0.4552 is larger than the upper bound. The optimal solution is: 2/12 = 2/22 = 1, and from the (RP) calculated for Node 5: x3 = 7.6511, x4 = 8.9325
Cost Allocation for Software Reliability
45
x$ = 7.7371, x6 = 6.6785 and the optimal objective function value is 0.487. Figure 4, shows the optimal reliability as a function of the budget.
0.6 -
Reliability p p
0.5
0.2 i0.1 *"" i
j 25
4
26
30
35
40
45
50
_».
55
60
70
80
100
150
200
Budget
Fig. 4. Optimal reliability as a function of the budget.
Note that the maximum possible reliability of the system is: i 6 — r 1 2 ^ 3 ^5
Since r 5 = r™ = r22r™ = 0.95 * 0.9 = 0.855, the maximum reliability of the software is r 6 = 0.9 * 0.83 * 0.855 = 0.63868. Table 1 shows that this reliability is achieved when the investment for the system is 200. 6. The General Model and Its Branch and Bound Scheme 6.1. The General
Model
In this model the programming modules are 1 , . . . , m and each module, i, can be either purchased in the market or developed in-house. The number of versions is 7ij for i = 1 , . . . ,m. Let Zi be a binary variable that assumes the value 1 if the programming module i is to be developed in-house, and 0 otherwise. Since a module is either developed in-house or one of its version is purchased, Zi + 5Z"Li ytj = 1Let 7-j be the reliability achieved with cost Xi when module i is developed inhouse. For any programming module i its reliability Ri is given by: R%
TrZi + 7V *, Vijrij 3=1
The cost of the module is X{ when it is developed in-house (zt = 1), or Vijrij when it is purchased.
Y^Li
46
O. Berman
and M. Cutler
The problem called (GP) can be written as, max^ S.T.
(GP1)
££c«l>tf+5>i<£
(GP2)
i~l
j=l
j=l
rti
Zi + Y,yiJ
= 1
fori = l , 2 , . . . , m
(GP3)
i = l,...,m
(GP4)
Xi - Uzi < x° - e Xi > 0
for i = 1 , . . . , n
and
Vij = 0,1
for i = 1 , . . . , m j = 1 , . . . , nt
(GPS) where U is a very large number (e.g. U = 1, 000,000) and e is a very small positive number (e.g. e = 0.000001). (GP1) maximizes the total reliability, (GP2) ensures that total expenditures are within the budget, and (GP3) ensures that a module is either developed in house, or exactly one of its versions is selected to be purchased. Constraint (GP4) guarantees that Xi and, consequently, r, will be 0 if one of the versions is purchased. In that case 3 j such that -tjij = 1 and Zi — 0. So Xi is less than the minimum expenditure x° for programming module i, causing Xi to be equal to 0 in the solution. Note that if Zi = 1, the left side of the constraint is negative and since the right side is positive the constraint is satisfied. 6.2. The Branch and Bound Scheme for the General
Model
The problem is solved using a branch and bound scheme (B & B) similar to the one used for the predetermined model. A simplified version of problem (P), called (GRP) is solved. The (GRP) problem is: max/? S.T.
Y^x3z3+ J2 xJ -B' j—1
Xj > 0
j—m+1
for j = 1 , . . . , n
B' is the new budget computed when the discrete variables are assigned a value. The B & B scheme is based on the fact that (GRP) is an easier problem. The cost
Cost Allocation for Software Reliability
47
c^o of the most expensive version, and the cost Cjj-oo of the least expensive version are as defined in the predetermined model, fy is the cost that must be allocated to programming module i in order to achieve the reliability r ^ . Let rj(x) denote the reliability that can be achieved with investment x when module i is developed in-house. So when r° < r^ < r™, Xij, satisfies rtj = rj(xy). Without any loss of generality we rule out from the set of all possible decisions the dominated ones (decisions that cannot be optimal): (1) If 3j 3 Tij > r™ and Cy < a:?, then developing module i in-house cannot lead to an optimal solution, and the module will be purchased. (2) If 3j 3 r° < rij < r™ and Cij > Xij, the reliability of version j can be achieved at a lower cost by developing the module in-house, so version j of module i can be ignored. (3) If 3j B Tij < r° and c^ > a;°, version j for module i can be ignored. (4) If 3j, k 9 Tij > (>)/"ife and Cij(<) < Cik, version k for module i can be ignored. The structure of the B & B tree is similar to the one used in the predetermined model. The partial solution pr, and the initial remaining budget B' are as defined for the predetermined model. For each pr at level k > 1 we calculate the remaining budget B', and upper and lower bounds. Also, the same three conditions are used to fathom nodes of the B & B tree. 6.2.1. The Upper Bound The upper bound is computed below. Line 1 selects the most reliable version if its reliable version is at least r™\ The budget is decreased by the lower of the following two costs: The cost to buy the cheapest version, and the basic cost to develop the module in-house. Line 2 makes the decision to develop the module in-house if the reliability of the most reliable version is less than the maximum that can be achieved by developing the module in-house. In this case, the budget is increased by x^o — ci:/oo > 0, where Xif is the cost of developing module i in-house so its reliability will be r„o. Line 3 determines that the module will be developed in-house if the reliability of the most reliable version is less than r°. The budget is increased by x° — c^-oo > 0, which is the minimum additional cost needed to develop the module instead of buying the cheapest version k < Vi < m (1) If r y o < rj", set yij0 = 1 (Vij = 0j ^ j°,Zi = 0), B' = B' - min^-oo.a:?) (2) If r° < r-ijo < rf1 set Vij = 0, j = 1,2,..., nu zt = 1, B' = B' + max{0, xij0
-
CijOO }
(3) Ifr?
< r ™ s e t y i j = 0, j = 1,2,... ,ni,Zi
= 1, B' = B'+max{0,a:9-Cyoo}
48
O. Berman
and M. Cutler
6.2.2. The Lower Bound We solve three different problems (GRP) and obtain three lower bounds: LB', LB" and LB'". The best of these lower bounds serves as the lower bound called LB. The procedure for calculating LB' is: k < Vi <m If ci:joo < x°, set yij00 = 1, (yi:j = Oj ^ j 0 0 ) , zt = 0, B = B - cij00 If ci:joo > xl, set yij =0j = l,2,...,rii, Zi = l Note that LB' selects the least expensive and least reliable version of module i if the cost of the least expensive version is less than the initial cost of developing the module in-house. The budget B' is decreased by the cost of the selected least expensive version. Otherwise, the module is developed in-house and the budget B' remains unchanged. In this case the remaining budget B' is the actual budget and the solution of (GRP) is a feasible solution. The procedure for calculating LB" is: k < Vi < m Set i/ijoo = 1, B = B - Cijoo, (yij = Oj ^ j 0 0 ) ,
Zj
=0
Here, the least expensive version is chosen for all remaining modules, and the budget is decreased accordingly. It is possible, however, that even the least expensive versions are too costly and there will be no feasible solution for LB" since the budget B is exceeded. In that case it is not used in the calculation of LB. The procedure for LB"' is: k < Vi < m Set ytj =0,j
= 1,2, . . . , n j , z, = 1
In this case, all remaining modules are developed in-house. It is possible, however, that developing the modules in-house is more expensive and LB'" may not be feasible. In that case it will not be used in the computation of LB. The B & B scheme for the predetermined module is relatively more efficient than the B & B scheme for the general model for the following reasons: (1) In the predetermined model, the maximum number of branches in a path from the root to a leaf is mi. In the general model it may have m > mi branches. (2) In the predetermined model, (RP) has n — m\ continuous variables while (GRP) has n continuous variables. (3) The computation of the bounds requires solving two (RP) problems for the predetermined model, and four (GRP) problems for the general model.
Cost Allocation for Software Reliability
49
7. C o n c l u d i n g R e m a r k s This paper deals with t h e problem of cost allocation t o develop or purchase the components of a software system with the goal of maximizing system reliability. T h e main contributions of t h e research are: (1) Modeling a software system as a hierarchy of programming and integration modules. (2) T h e computation of t h e reliability of an in-house developed module and of the hierarchical reliabilities. (3) T h e inclusion of b o t h in-house developed and purchased modules. (4) T h e branch and b o u n d schemes for b o t h models. T h e research has some limitations t h a t need t o be addressed in the future. Our models do not take into account t h a t major redesign may occur as a result of testing. It limits t h e role of integrating modules t o dealing with interface issues and invoking child modules. We assume t h a t the reliability of child nodes remains unchanged when a hierarchy is tested. Investment decisions are made only for the implementation stage of t h e software. Finally, the estimation of t h e coefficients q^ and a.i is not addressed. Acknowledgment This research was supported by a grant from the N a t u r a l Sciences Engineering Council C a n a d a ( N S E R C ) . References Ashrafi, N. and Berman, O. (1992), "Optimization Models for Selection of Programs: Considering Cost and Reliability", IEEE Trans, on Reliability 41(June), 281. Ashrafi, N. and Berman, O. (1993), "Optimization Models for Reliability of Modular Software Systems", IEEE Trans, on Software Engin. 19(November), 1119. Ashrafi, N., Berman, O. and Cutler, M. (1994), "Optimal Design of Large SoftwareSystems Using N-Version Programming", IEEE Trans, on Reliability 43(June), 334. Avritzer, A. and Weyuker, E. (1995), "The Automatic Generation of Load Test Suites and the Assessment of the Resulting Software", IEEE Trans, on Software Engin. 21(9), 705. Bazarra, M. S. and Shetty, C. M. (1979), Nonlinear Programming Theory and Algorithms, John Wiley and Sons, New York. Berman, O. and Cutler, M. (1996), "Choosing Optimal Set of Libraries", IEEE Trans, on Reliability 45(2), 303. Berman, O. and Cutler, M. (1998), "Optimal Software Implementation Considering Reliability and Cost", Compu. and Opers. Res. 25(10), 857. Cutler, M. and Berman, O. (1998), "Optimal Reliability Considering Cost of a Software System with Purchased and In-House Developed Components", in Proceedings of the 4th ISSAT International Conference on Reliability and Quality in Design, August 12-14, Seattle, Washington, 51. Kubat, P. (1989), "Assessing Reliability of Modular Software", Opers. Res. Lett. 8, 35.
50
O. Berman and M. Cutler
Zahedi, F. and Ashrafi, N. (1991), "Software Reliability Allocation Based on Structure, Utility, Price, and Cost", IEEE Trans, on Software Engin. 17(4), 345. Zhang, X., Pham, H. and Vu, M. (1998), "Comparison of NHPP Software Reliability Models", in Proceedings of the 4th ISSAT International Conference on Reliability and Quality in Design, August 12-14, Seattle, Washington, 66.
Chapter 4 G E N E R A L RELIABILITY T E S T P L A N S FOR ONE-SHOT DEVICES WEILI Z H A N G Novartis, Strategic Marketing Operations 59 Route 10, E. Hanover, NJ 07936 WEI-KEI SHIUE Department of Mathematics and Statistics Southern Illinois University at Edwardsville, Edwardsville,
IL 62026
1. Introduction The reliability of military weapons as well as industrial products normally degrade over time. To monitor possible decay in reliability, inspection of a sample of items is needed periodically in order to ensure the stockpile still meet certain specifications. However, inspection of such one-shot devices is destructive and thus expensive. The standard binomial methods require the same sample size for each stage of testing. Therefore it is desirable to adopt an approach which can utilize past sample data and subsequently reduce the total sample size. Bayesian methods had been applied to this problem by Wood (1984) and Launer and Singpurwalla (1986). Bain and Engelhart (1991) proposed a Weilbull degradation model for the reliability of these devices. They derived test plans allowing for zero observed failures, which are only pertinent to highly reliable items. The sample sizes for such tests may be reasonable, but the resulting powers are usually rather low. In this paper, Bain and Engelhard's approach is generalized to include nonzero observed failures. With this extension, test plans which meet specified power and require moderate sample sizes are available. Sections 2 and 3 summarize Bain and Engelhard's approach and its extension. The presentation of notation and formulation stays very similiar to that given in Bain and Engelhardt (1991). The problem of deciding sample size to achieve a specified power is discussed in Sec. 4. The algorithm for generating desired test plans is given as an appendix.
51
52
W. Zhang and W.-K.
Shine
2. R e l i a b i l i t y T e s t P l a n s Suppose a r a n d o m sample of n one-shot devices is tested at t i m e tj, j = 1, 2 , . . . , k, in order to d e m o n s t r a t e t h a t t h e level of deterioration in reliability is acceptable. T h e n the number of unsuccessful tests, Xj, is a binomial random variable with parameters n and q(t), t h e probability of an unsuccessful test. Instead of 1 — q(t), q(t) is used as a reliability criterion for such devices. Assuming t h a t the stockpile will be rejected if Xj + 1 or more failures are observed, it is required t o have an upper confidence limit for q(tj) which is, at most, some prescribed value, q*, with confidence level at least 1 — a. By applying a general method for constructing a confidence interval (Bain and Engelhardt (1992), p. 374), an upper confidence limit for q(tj) is obtained as a solution, say q , 1 (i 1 ), in the equation B(x1;ni,q1(t1)) = a (1) for given x\ and ni, where B(x;n,q) denotes a binomial cumulative distribution function with parameters n and q. Since B(x;n,q) is a decreasing function in q, n\ can be solved from Eq. (1) by replacing q\{t\) with q*. W h e n x\ = 0, n\ = l n a / l n ( l — q*). If q* = 0.05, 1 — a = 0.90, then n\ = 44.89 w 45. However, if x\ — 1, no closed form solution is available. For q* = 0.05, 1 — a = 0.90, using a simple computer program with qi(ti) = q* = 0.05, rii = 77. Allowing for nonzero failures appears t o increase the sample size, but it improves t h e power of the test as will be discussed in Sec. 4. Suppose Xj or less failures are observed in t h e first sample of size rii and a second sample is to be taken from the stockpile at the next inspection time. Another sample of size n\ is needed unless model assumptions are introduced concerning t h e function q(t). Next, consider a test plan t h a t requires two independent samples. One sample at time ti of size n\ obtained from Eq. (1) and a second sample is chosen at time t2- Assuming t h a t t h e failure function is q2(t), we wish ri2 to be large enough t o produce an upper confidence limit not exceeding q* with a confidence level of at least 1 — a. T h e minimum value of n,2 is obtained by solving B{xi;ni,q2(ti))B(x2;n2,q*)
= a
(2)
In general, for k independent samples, a n d for given failure function qk{t) and allowed number of failures, Xk, t h e minimum sample size n^ is t h e solution of B(xk;nk,q*)
fc-i Y[B(xj;nj,qk(tj))
= a
j= l
For the special case, Xj = 0, j = 1 , 2 , . . . , k, fc-i
nk =
In a - ^rij
ln(l -
qk{tj))
ln(l-0
(3)
General Reliability
Test Plans for One-Shot Devices
53
Otherwise a computer program is needed to solve Eq. (3) for sample size. 3. Weilbull Degradation Model Let T be the random variable for the lifetime of a one-shot device. Since the device will be functioning at time t if and only if T exceeds t, q(t) = P[T < t] = F(t), where F(t) is the cumulative distribution function of T. Bain and Engelhardt (1991) proposed two-parameter Weilbull distribution as the distribution for T, then q(t) = 1 — exp[—(t/0)P]. This distribution has been used to fit reliability data from various applications. It has a scale parameter 9 and a shape parameter (3. The failure rate function for this model is h(t) = {(3/9)(t/9)P~l. h{t) is increasing if (3 > 1 and decreasing if (3 < 1. Exponential distribution corresponds to (3 = 1, which has constant failure rate function. When (3 = 2, it is called Rayleigh distribution. For the present application, j3 is assumed known, but 6 is not known. If at stage k, q{t) does not exceed a prescribed limit q* with a specified level of confidence, then qk{tk) = 9*! from which we obtain 9k=tk/{-\n(l-q*
(4)
In terms of this value, qk(tj) = 1 - exp
ln(l - q*)
tk
Given a, q* and xk, sample size can then be solved by substituting qk(tj) into Eq. (3). For the special case of equal time period, tj = j , and Xj = 0, P
ln(l - q*)
Qk(tj) = 1 - e x p and
nk
ln(l-<7*)
^
n j
\ k
Table 1 provides some sample size comparisons for the exponential {(3 = 1) and Rayleigh {(3 = 2) model. For various combinations of k, a, xk, and q*, the case xk = 0 again corresponds to the smallest sample size which satisfies reliability criteria q* and a. However, the resulting power in the context of test of hypotheses is low. This issue will be dealt with in the next section. For detailed treatment of the special case xk = 0, the reader is referred to Bain and Engelhardt (1991).
54
W. Zhang and W.-K.
Shiue
Table 1. Sample size n^ such that P[q{t^) < q*} > 1 — a for Weibull degradation model. k
1
2
3
4
5
6
7
8
9
10
A. Exponential (/3 = 1)
Zfc Xk Xk
Xk Xk Xk
Xk Xk Xk
Xk Xk Xk
0 1 2
45 77 105
23 55 85
15 43 72
a = 0.1 q" = 0.05 11 9 8 37 33 30 64 54 58
6 27 50
6 25 48
5 24 45
5 22 43
0 1 2
59 93 124
29 65 98
20 52 82
a = 0.05 9* = 0.05 15 9 12 44 35 38 73 61 66
9 32 57
7 29 53
7 28 51
6 26 49
0 1 2
230 388 531
115 276 428
76 220 362
a = 0.1 9*=<).01 58 46 38 187 165 149 321 291 270
33 136 252
28 127 239
26 118 227
23 112 217
0 1 2
299 473 628
149 329 495
100 260 416
a = 0.05 9* = 0.01 74 60 50 221 194 174 366 306 331
42 160 286
38 148 269
33 138 256
30 138 145
14 43 71
12 39 66
11 36 63
9 34 60
9 33 57
71 104
a = 0.05 9* = 0.05 18 26 21 62 50 55 81 95 86
16 46 76
14 43 71
12 40 69
11 38 65
128 298 454
a = 0.1 9* = 0.01 70 100 83 262 213 234 354 412 380
61 197 333
54 183 316
48 172 301
44 162 289
63 203 344
57 191
B. Rayleigh (/? = 2)
Xk Xk Xk
Xk Xk Xk
Xk Xk Xk
0 1 2
45 77 105
34 69 101
25 59 90
0 1 2
59 93 124
44 84 118
33
0 1 2
230 388 531
172
0 1 2
299 473 628
351 507
a = 0.1 20 52 82
q * = i0.05
16 46 75
a = 0.05 Q* Xk Xk Xk
224 424 596
166 358 527
131 311 477
=
0.01
107 278 437
91 253 407
79 232 381
70 216 361
328
4. Power and Sample Size Requirements The procedure discussed in the previous sections can be reformulated in terms of test of hypotheses. If qk{tk) is a 1 - a level upper confidence limit on q(tk), then an a level test of the null hypothesis H0 : qitx) > q* against the alternative hypothesis
General Reliability
Test Plans for One-Shot Devices
55
Ha • q(tk) < q* is to reject H0 if qk{tk) < Q* or the number of observed failures is less than or equal to Xk • The null and alternative hypotheses correspond to nonacceptable and acceptable stockpile, respectively. Hence a type I error is committed when nonacceptable stockpile is declared acceptable, which places emphasis on protection against accepting a bad stockpile. By contrast, a type II error is referred to as rejecting an acceptable stockpile and the power function provides a measure of protection against rejecting a good stockpile. Suppose the Weilbull degradation model is adequate for the problem at hand. Then qk(tj) = 1 - exp[(iJ/ifc)/3 ln(l - q*)}. We can study the power of the following test H0 : q(tk) = q* versus Ha : q(tk) = aq*, where 0 < a < 1. Under H0, the failure function would be the member qk{t) with q(tk) = q* and parameter value 6k given by Eq. (4). While under Ha, it would be another member, say qa,k(t), determined by q(tk) = aq* with parameter value 6aik = tk/[— ln(l — aq*)]~P . Hence, qa,k{tj) = 1 - exp[(tj/*fe)^ m ( l ~~ a 9*)]- The power of the test is given by B(xk,rik,aq*)Yl-Z1 B(xj-,nj,qa,k{tj))For a combination of/?, k, a, Xk, q* and a, Table 2 provides some power comparisons. For example, ifa = 0.1, fc = 10, /3 = 1, a = 0.10, q* = 0.05 and Xk = 0, using sample sizes from Table 1, the resulting power is 0.795; if Xk = 1, the power is 0.968, which is a substantial improvement, but at the expense of larger sample sizes. To avoid this difficulty, power and sample size requirements need to be considered together. Since H0 is rejected at each stage of testing when the number of failures is less than or equal to Xj, j = 1,2,... ,k, type I error a and power 1 — 7 must satisfy the following inequalities: fc-i
B{xk;nk,q*)W_B{xj;nj,qk{tj))
(5)
(6)
and fc-i
B(xk;nk,aq*)Y[B(xj;nj,qaik(tj))
For a given set of a, k, (3, a and q*, Eqs. (5) and (6) can be solved simultaneously to obtain general test plan consisting of sample size and failure number at each stage of testing. However, the solution to Eqs. (5) and (6) is not unique. The algorithm given in the appendix generates minimum n / s and Xj's, j — 1,2,..., k, which could be attained at each stage. Table 3 contains some general test plans produced by the algorithm for two Weilbull degradation models. A copy of the computer program for implementing the algorithm is available from the first author upon request. Under the present framework, sampling plans for most applications can be derived. It is possible to consider test plans where the sampling begins lower and increases as the population age. For problems where the items are less costly or of lower reliability, some failures can be allowed to occur to reduce sample sizes. Combinations of sample sizes and observed failures can be considered in order to reduce testing cost.
56
W. Zhang and W.-K.
Shiue
Table 2. Probability of finding the stockpile acceptable on the fcth stage of testing with the WeibuU degradation model and alternative q(tk) = aq*.
Xk
0.1
0.3
0.5
0.7
0.9
0.05
0.10
0 1 2
0.798 0.963 0.922
0.507 0.745 0.862
0.320 0.483 0.597
0.201 0.275 0.333
0.126 0.142 0.154
0.05
0.05
0 1 2
0.742 0.949 0.989
0.407 0.671 0.812
0.222 0.379 0.496
0.120 0.183 0.232
0.065 0.078 0.086
10
0.05
0.10
0 1 2
0.795 0.968 0.994
0.500 0.765 0.884
0.314 0.504 0.632
0.196 0.287 0.357
0.121 0.145 0.160
10
0.05
0.05
0 1 2
0.743 0.956 0.992
0.408 0.695 0.842
0.223 0.400 0.536
0.121 0.192 0.253
0.065 0.079 0.091
0.05
0.10
0 1 2
0.798 0.959 0.990
0.506 0.731 0.843
0.320 0.470 0.573
0.201 0.269 0.319
0.126 0.141 0.152
0.05
0.05
0 1 2
0.744 0.943 0.986
0.410 0.651 0.787
0.225 0.362
0.123 0.175 0.220
0.066 0.076 0.085
0.316 0.487 0.608
0.198 0.277 0.341
0.123 0.141 0.157
0.226 0.387 0.507
0.123 0.186 0.237
0.067 0.079 0.088
10
0.05
0.10
0 1 2
0.796 0.964 0.993
0.503 0.750 0.868
10
0.05
0.05
0 1 2
0.745 0.952 0.990
0.412 0.680 0.820
0.469
Table 3(a). The pairs (x, n) which satisfy both prescribed reliability criteria and given power requirement for WeibuU Model with (3=1 (Exponential). 1
k
2
3
4
5
6
7
8
9
10
a = 0.1 7 = 0.1 9* = 0.05 a = 0.1 a = 0.3 a = 0.5
1
1 43 3 103 306
X
1 77 4
n
158
55 4 142
X
13 377
13 371
X
n
n
11
0 16 3 91
1 31 3 84
1
1
28 2 58
26 3 72
0 8 2 50
1 22 3 66
1 21 2 46
10 268
9 233
8 202
8 193
8 187
7 162
7 156
General Reliability Table 3(a). k
1
2
57
(Continued). 7
6
5
4
3
Test Plans for One-Shot Devices
9
8
10
a == 0.1 7 =0.1 g* == 0.1 a = 0.1 a = 0.3 a = 0.5 a = 0.7
X
1
n X
38 4
n
78
1 27 4 70
0
1
1
1
0
8 3
15
14
13
4
1 11
3 42
2 29
3 36
2 25
23
9 115
8
8
7
7
7
100
96
83
80
78
39
36
34
420
392
32 364
31 349
30 334
1
1 5 2 12 6 34 27 151
1 6 2
3 51
45
X
13
12
10
9
n
187
173
140
121
X
47 566
47 566
46 552
43 512
1 18 4
1 13 3 29 11 79 43 259
n
1
1 22
460
2
10 3 32
a == 0.1 7 := 0.1 q* = 0.2 a = 0.1
X
n a = 0.3
X
n a = 0.5
n
38 12 86 43 259
X
2
n
124 7
X
n a = 0.7
X
1 11 3 24 10 69 41 246
0 4 3 22 9 60 38 226
1 7 2
1
15 8 51 35
7 3 19 7 44 32
206
186
0 2 2 13 7 42 30 172
6 2 12 7 41 29 165
11 7 39 27 150
a = 0.1 7 = 0.1 q* = 0.05 a = 0.1 a = 0.3
X
260 22
n
624
2 98 6 219 21 598
X
n
2 61 7 129 21 298
2 49 6 109 21 297
X
2
n
30 7 63
1 17 6 53 19 134 71 425
X
n a = 0.5
1 57 6 196 20 552
1
1
1
47 5 159 18 487
40 5 147 16
35 4
1 33 5
1
1
1
353
30 4 105 13 323
27 4 101 13 314
27 4 97 12 286
117 15
131 14
424
388
1 16 4 54 14 175
1 15 4 52 13 160
1 14 4 50 12 146
1 13 4 48 12 142
1 7
1
1
1
6 4 25 11
7 3 18 11 65 46 267
a = 0.05 7 =: 0.05 g* = 0.1 a = 0.1
n a = 0.3
X
n a = 0.5
X
1 28 5 85 18 251
1 23 5 77 17
1 20 5 73 15
1 17 4 58 14
229
199
181
a = 0.05 7 == 0.05 q" = 0.2 a = 0.1 a = 0.3
X
n a = 0.5
X
n a = 0.7
X
n
19 135 71 425
1 13 5 42 17 118 71 424
1
1
5 38 15
9 4 30 14
9 4 28
4 27
8 4 25
101 67 398
92 62 366
13 83 58 340
13 81 54 314
13 73 52 300
1 11
67 50 286
58
W. Zhang and W.-K.
Shiue
Table 3(b). The pairs (x,n) which satisfy both prescribed reliability criteria and given power requirement for Weibull Model with /3 = 2 (Rayleigh). k
2
1
4
3
7
6
5
8
9
10
0 16 3 89 10 264
1 31 3 86 9 236
1 31
a = 0.1 7 = 0.1 q* = 0.05 a = 0.1 a = 0.3
1 52 4
147 13 374
138 12 343
1 46 3 107 11
1 43 3 99 11
310
301
n
1 69 4 157 13 377
X
1
n
1 29 4 73 12 174 47
1 26 3 56 12 170 47
n
38 4 78 13 187 47 566
1 34 4 78 13 187 47 566
566
566
X
1
1
1
1
1
1
n
18 4 38 12 86 43 259
17 4 38 12 86
14 3 29 12 86 43 259
13 3 27 11 78 43 259
11 3 25 10 70 42
10 3 24 9 62 41
252
245
X
n a = 0.5
59 4
1 77 4 158 13 377
X
n
X
1
a = 0.1 7 = 0.1 q' == a = 0.1 a = 0.3
X
n a = 0.5
X
n a = 0.7
X
a == 0.1 7 =0.1 a = 0.1
a = 0.3
X
n a = 0.5
X
n a = 0.7
X
n
43 259
1 39 3 93 10 272
3 82 9 230
0.1
1
19 3 46 10 134 44
0 8 3 44 9 120 42
1
11 154 46
1 21 3 48 10 138 45
16 3 43 9 117 40
15 2 30 8 103 38
553
539
525
499
473
447
0 4 3 23 9 60 40 238
1
1
1
8 2 16 8 53 38 225
8 3 21 8 52
8 3 20 8 51
36 212
35 205
1 23 3 52
1
q* == 0.2
a = (3.05 7 = 0.05 q* := 0.05 a = 0.1
X
260 22
2 118 7 256 22
n
624
624
X
n a = 0.3
X
n a = 0.5
2 124 7
2 104 6 215 22
68 6 200 21
2 82 5 166 20
1 54 5 155 19
1 48 5 149 18
1 44 5 142 17
1 41 4 116 16
1 38 5 133 16
623
593
559
525
491
459
427
420
19 5
1
a = 0.05 7 == 0.05 q* = 0.1 a = 0.1 a = 0.3
X
2
2
n
61 7
58 7
X
129 21
128 21
7 124 21
n
298
298
298
X
n a = 0.5
2 52
1
2
1
1
33 6
41 6
26 5
24 5
1 22 5
1 20 5
106 20 282
100 19
84 18
80 17
76 16
265
249
232
216
74 16 212
1
72 15 196
General Reliability Test Plans for One-Shot Devices 59 Table 3(b). (Continued). k
1
2
3
4
5
7
6
9
8
10
a = 0.05 7 = 0.05 q* = 0.2 a = 0.1
X
a = 0.3
2
1
1
1
1
1
1
1
11
11
10
5 39 16 109 67
5 37 15 100 65 386
9 4 31 14
n
30
28
25
16
14
X
7 63 19 135 71 425
6 56 19 135 71
6 54 19 135 71
5 43 18 125 71
425
425
6 51 18 127 71 425
5 41 17 116 69 412
X
n a = 0.7
2
12
n a = 0.5
2
X
n
454
399
93
4 29 14
63
91 61
373
360
Appendix In order to obtain the smallest values of Xk, nk subject t o the two inequalities in Eqs. (5) and (6), an algorithm is used. Let s u m l = B(xk;nk,q*)
fe-i JJ
B{xj\nj,qk{t:j))
i=i sum2 = B{xk;nk,aq*)
fc-i JJ
B{xj;nj,qa,k{tj))
3= 1
T h e following is an enumeration algorithm, which can be used to obtain a n d rifc's. 1. 2. 3. 4.
5. 6.
7. 8.
Initialize nkxk = 0 Compute suml. If s u m l < a then if Xk < rik then Xk = Xk + 1 go t o step 3 if Xk = rik then go to step 7. xk = xk - 1. if Xk < 0 then nk =nk + 1 go to step 1. compute sum2. if sum2 < 1 — 7 t h e n nk=Tik + l go to step 2.
Xk's
60
W. Zhang and W.-K.
Shiue
9. If xk > 1 then Xk = Xk ~ 1
compute sum2 if sum2 > 1 — 7 t h e n go t o step 9 if sum2 < 1 — 7 t h e n xk = xk + 1. 10. O u t p u t xk and null. Stop. References Bain, L. J. and Engelhardt, M. (1991), "Reliability Test Plans for One-Shot Devices Based on Repeated Samples", Journal of Quality Technology 23, 304-311. Bain, L. J. and Engelhardt, M. (1992), Introduction to Probability and Mathematical Statistics, 2nd Edition, PWS-Kent, Boston, MA. Launer, R. L. and Singpurwalla, N. D. (1986), "Monitoring the Reliability of an Arsenal Using a Combined Bayesian and Sample Theoretic Approach", in Reliability and Quality Control, ed. A. P. Basu, Elsevier Science Publishers, North Holland, New York. Wood, B. B. (1983), "Bayesian Reliability Test Plans for One-Shot Devices", Technical Report USAFA TR-84-01, Department of Mathematical Sciences, USAF Academy, Colorado Springs, CO.
Chapter 5 MULTIVARIATE CONTROL CHART M I N G - W E I L U and R I C H A R D J. R U D Y DaimlerChrysler Corporation 800 Chrysler Drive, Auburn Hills, MI 48326,
USA
1. Introduction There are many situations in which the overall quality of a product often depends on successful monitoring of several interdependent correlated characteristics. A common practice to control several characteristics simultaneously is to consider multiple Shewhart control charts, even though they are designed to monitor a single characteristic. The assumption in this practice is that the characteristics are independent, which is often incorrect. If we assume that the characteristics are independent, multiple Shewhart charts with 3-sigma control limits fail to take the simultaneous nature into account. Suppose that xi and X2 are two measurable characteristics that determine the usefulness of the part. The process is considered to be in control only if the sample means x~\ and a?2 fall within their respective control limits. Monitoring these two characteristics independently can be very misleading. The probability that either x\ or X2 exceeds 3-sigma control limits is 0.0027. Hence, the probability that both ~x~[ and X2 will simultaneously plot inside the control limits is 0.9973 * 0.9973 = 0.9946 when the process is really in control. The chance of rejecting the process when the process is really in control is 1 - 0.9946 = 0.0054 (two times 0.0027), where 0.0027 is the level for the individual control charts. That is, we are left with a control region that is smaller than expected, which will result in investigating the process when it is actually in control. In the multivariate environment, the characteristics are often related to each other. It is possible that the process may be in control when the characteristics are considered separately but out of control when considered jointly with the multivariate control chart. As shown in Fig. 1, the individual charts for characteristics X and Y do not indicate unusual process behavior. But a joint, two-dimensional picture illustrates one unusual data point.
61
62
M.-W. Lu and R. J. Rudy
V
IT
0
2
4
v
A
8
JL.
10
12
10
12
Fig. 1. Disadvantages of multiple individual control charts.
It is also possible that the process may be out of control when the characteristics are considered separately but in control when considered jointly with the multivariate control chart. Increased frequency and cost errors in judging product
Multivariate
Control Chart
63
acceptance necessitate paying more attention to using proper statistical tools to monitor process quality. In this paper, multivariate control charts are discussed which are based on Hotelling's T2 statistic. An exhaust manifold example which was used on Chrysler's 5.2 L engine is demonstrated. 2. Multivariate Control Charts Multivariate control charts are the proper tools to use when the overall quality depends on several interdependent characteristics considered simultaneously. It is assumed that the joint probability distribution of the p quality characteristics is the p-variate normal distribution. The control charts were developed and used by Hotelling (1947), who applied his procedures to bombsight during World War II. There are charts available for both location (mean) and spread, this paper will only consider charts for location. For more information on multivariate control charts, the reader can refer Montgomery (1991) and Alloway (1991). There are two general categories of control charts: (1) no and £ are given and (2) no and £ are not given, where no is the (p x 1) vector of means and X is the (p x p) population variance-covariance matrix. Let n be the subgroup size and k be the number of subgroups. Case (1) When no and £ are given The test statistic plotted on the chi-square control chart for each sample is T2 = n ( x - M o ) ' S - 1 ( x - / i 0 )
(1)
where x is the (p x 1) vector of means for subgroup, the prime denotes the matrix transpose operation, Xi
p,o is the vector of in-control means for each quality characteristic. That is "Mi' Mo =
-MP. x
and £~ is the inverse of the (p x p) population variance-covariance matrix.
64
M.-W. Lu and R. J. Rudy
The upper control limit for this T2 statistic is UCL = xl,P
(2)
where x2ap *s the (1 — a) * 100th percentile point of the x2 distribution with p degree of freedoms. Case (2) When /J,Q and £ are not given The test statistic plotted on the chi-square control chart for each sample is T2 = n(x = x)'S'1(x-x)
(3)
where x is the (pxl) vector of the grand means, the prime denotes the matrix transpose operation, and 5 _ 1 is the inverse of the (p x p) pooled sample variancecovariance matrix. The upper control limit for this T2 statistic is p(k + l)(n-l) , n ra;p,kn-k~p+\ kn — k — p + 1
UtL = —
\^)
where Fa-)P,kn-k-p+i is the (1 — a) * 100th percentile point of the F-distribution with p and (kn — k — p + 1) degrees of freedom. Case when n — 1 Frequently, when multiple quality characteristics must be monitored, multivariate control charts with n = 1 would be of interest. From Eq. (3), the T 2 statistic is T2 = (x-x)'S-1(x-x)
(5)
The upper control limit for this statistic is [Tracy (1992)] TIPT
P(k + i)(k-i) =
k2_kp
( ) F
«;P*-P
W
When the number of subgroups k is large, then from the Eq. (6), the approximate control limit is given by UCL = — Fa;Ptk-P K p
(7)
The T 2 display format as shown in Fig. 5 is more conventional in appearance in that the control limit is parallel to the time axis and points are plotted in chronological order.
Multivariate
2 . 1 . Bivariate
Control
Control Chart
65
Chart
T h e most direct way to illustrate multivariate control charts is to use the bivariate case. Once this bivariate case situation is understood, it can be extended t o any number of characteristics. M o d e l — T h e two variables X and Y are assumed t o follow a bivariate normal distribution with means nx,
Yy, variances ax,
<7y, a n d correlation coefficient p.
T h e density function is given by 1
-1
f(x,y)
2
exp •
2TraX(7Y\/l~P
1
_2p{x-»x)(y-»y) &x
2
(
1
x -
nx
ox
-P)
(y-»y aY
+
ay
(8)
For the i t h group of size n, the subgroup means are x^ and j/j, the subgroup variances are s ^ and syi, and t h e subgroup covariance sxyi.
x = ^2/xi/k
y = y^yi/k
i=i
i=i
k S
x
=
(9)
k
2__j Sxi/k i=l
S
y
=
/__, i=l
(10)
S
yi/k
n
sXyi = 5^(acij - Xi)(yij - y~i)/(n - 1) i=i fc
Sxy
=
/ ^ Sxyi/ &
(11)
t=l
From Eq. (3), i
rp2 QZ Q ^
r-^r === 4 ( 5 - x)1 + 8%(y- vY - 2s^(x e^
- x){y - y)
(12)
T h e upper control limit, from Eq. (4), is _2(fc + l ) ( n - l ) UL/J_i —
J
;
"
kn — k — 1
^a;2,fcn-fc-l
(13)
66
M.-W. Lu and R. J. Rudy
When n = 1, from Eq. (6), u r T
~
2(*+ !)(*-!) ba k 2 k2-2k ^-
(14)
When n = 1 and fc is large, from Eq. (7), UCL =
2(k — I) \_2'Fa,2^2
(15)
3. Identifying Out of Control Signals As long as the subgroups statistics plot within the control region, no further action is required. However, when an out of control signal is given, steps must be taken to identify and eliminate the assignable cause. The multivariate control chart does not identify which characteristic(s) is responsible, and supplemental measures are required to guide corrective action. The standard practice is to plot univariate X bar charts on the individual variables x\,... ,xp. Control limits are calculated as x±Ap<ns
(16)
where APtTl values are given in Table 1 [Alloway (1991)].
n (Subgroup S ize)
p = l p = 2 p = 3
2
3
4
5
6
7
8
9
10
2.66 2.84 2.94
1.95 2.09 2.16
1.63 1.74
1.43 1.53 1.58
1.29
1.18 1.26 1.31
1.10 1.17 1.22
1.03
0.98 1.04 1.08
1.80
1.38 1.42
1.10 1.14
When p = 1, APt1l reduces to the traditional A3 constant in the average and standard deviation charts (X bar and s) [AIAG (1991)]. In practice, one would plot values of T2 on the multivariate chart and the values of x~i on individual control charts with the Bonferroni control limits. When an out-of-control situation is indicated on the multivariate chart, the individual characteristics responsible can be identified from the modified univariate limits. Another very useful approach to diagnosis of an out-of-control signal is to decompose the T2 statistic into components that reflect the contribution of each individual variable. Let di = T2- T2} (17) where T 2 is the current value of the statistic, and T?^ is the value of the statistic for all variables except the ith one.
Multivariate
Control Chart
67
Hence di is an indicator of the relative contribution of the ith individual variable to the overall statistic. When an out-of-control signal is generated, we recommend computing the values of di (i = 1,2,... ,p) and focusing attention on the variables for which di are relatively large. For more information, see Montgomery (1991). 4. Selecting Important Characteristics (Variables) There is a practical limit to the number of characteristics that should be considered simultaneously. The following techniques from principal component analysis (PCA) can be used to reduce the number of variables to be considered. 4.1. Principal
Component
Analysis
(PCA)
PCA is a statistical technique that linearly transforms an original set of multiple dependent variables into a substantially smaller set of uncorrelated variables that represents most of the information in the original set of variables [Dunteman (1989)]. The principal components, Z» (i = 1,2,... ,p), are represented as a linear combinations of the p original correlated variables, Xi [i = 1,2,... ,p). That is p
Zt = '^2aijXj
(18)
subject to the condition that an2 + ai22 A
h aip2 = 1
Those Zt (i = 1,2,... ,p) variables are uncorrelated. The lack of correlation is a useful property. It means that the independent variables are measuring different "dimensions" in the data. The variance of Zi should be as large as possible given the constraint on a^. a,ij can be considered as the contribution of each variable Xj on the principal component Zj. From Eq. (18), the original p variables, Xi (i = 1,2,... ,p), can be represented as a linear combinations of the p principal components, Zi (i = 1,2,... ,p). That is p
Xi = ^biJZj
(19)
Hence, the variance of Xi can be expressed as p
Var(X l ) = ^ 6 ? , J V a r ( Z i )
(20)
where Var (Zj) are the variances of the principal components. The sum of the variances of the principal components is equal to the sum of the variances of the original variables. That is p
p
^Var(Zi) = ^Var(Xl) i=l
i=\
(21)
68
M.-W. Lu and R. J. Rudy
If the Xi variables are standardized (i.e., subtracting the mean from each X, variables and dividing the results by the standard deviation), then p
^Var(Zi)=p
(22)
i=l
For the data in Table 2, there are two (p = 2) original correlated variables, Xi — X and X2 = Y. The first principal component, Zi, has the following equation: Zx = 0.707107X* + 0.707107y* The second principal component, Z2, has the equation Z2 = 0.707107X* - 0.707107Y* where X* and Y* are standardized variables of X and Y, respectively. Var(Zi) = 1.44077 Var(Z 2 ) = 0.55923 Var(Zi) + Var(Z 2 ) = 2 The first principal component, Zu accounts for 72.04% (1.44077/2 * 100) of the variance of the two variables. The second principal component, Z 2 , accounts for considerably less variance (27.96% (0.55923/2 * 100)) of the two variables. That means the first principal component explains a substantial amount of variation in the two variables and the second component considerably less. 4.2.
Approaches
Approach (a) — First, we need to determine which principal components to retain. Relatively large elements indicate the variables that are important in defining a particular principal component. The variable that has the highest loading or weight on a principal component would be selected to present that component, provided it has not been chosen to represent a larger variance principal component. This approach should start with the largest principal component and proceed to the smallest retained component. Approach (b) — First, we need to determine which principal components to retain. This approach is to use the discarded principal components to discard variables. We should start with the smallest discarded principal component and delete the variables with largest loading or weight on that principal component. Then the variable with largest loading or weight on the second smallest component would be discarded. This approach should continue up through to the largest discarded component.
Multivariate
Control Chart
69
Approach (c) — The Variance of Xi as defined in Eq. (20) can also be used as the importance ranking on p variables. 5. Example This example deals with an exhaust manifold used on a Chrysler 5.2 L engine in a given model year. The exhaust manifold is shown in the Fig. 2. Two dimensional characteristics, X and Y as indicated in Fig. 1, are selected for our multivariate
Fig. 2. Exhaust manifold.
Table 2. Dimension X and Y measurements. No.
X
y
T-squared
No.
X
Y
T-squared
1 2 3 4 5 6 7 8 9 10 11 12
8.9464
0.8190
2.25
0.8167
11.26
0.8185
5.89
8.9437
0.8164
0.83
8.9452
0.8165
0.05
8.9438
0.8159
0.73
8.9453
0.8181
0.88
8.9457
0.8166
0.24
8.9441
0.8142
2.85
8.9447
0.8177
0.74
8.9433
0.8151
1.74
8.9464
0.8189
2.07
8.9434
0.8151
1.64
8.9451
0.8153
1.14
8.9440
0.8164
0.50
8.9469
0.8151
4.49
8.9426
0.8144
3.43
8.9450
0.8155
0.80
8.9444
0.8180
1.49
8.9450
0.8166
0.01
8.9463
0.8171
0.60
8.9436
0.8165
0.99
8.9466
0.8193
2.90
8.9450
0.8177
0.54
13 14 15
8.9455
0.8175
0.25
8.9449
0.8200
6.04
8.9439
0.8163
0.58
8.9447
0.8168
0.09
8.9445
0.8144
2.53
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
8.9498
8.9489
8.9442
0.8169
0.48
70
M.-W. Lu and R. J. Rudy
control charts example. For the dimension X, the nominal is 8.44 + 0.5 = 8.9400" with the upper spec limit = 8.9519" and lower spec limit = 8.9281". For the dimension Y, the nominal is 0.8200" with the upper spec limit = 0.8319" and lower spec limit = 0.8081". The data (30 pairs) in Table 2 are collected from a machine capability study performed on the machine builder's floor [Chrysler (1993)]. The correlation coefficient between X and Y is 0.44. All the data analysis and control charts calculation in this example are handled by the PC software STATGRAPHICS [Manugistics (1994)]. 5.1. For X
Chart
The process mean = 8.9451" and the process sigma = 0.00129616", hence UCL = 8.9451 + 3 * 0.00129616 = 8.94899" LCL = 8.9451 - 3 * 0.00129616 = 8.94121" The Shewhart control chart for the X variable is given in Fig. 3. One point is beyond the control limits; the process is out of statistical control. (X 0.001) UCL = 8.94899 Centeriine = 8.9451 LCL = 8.94121
X
0
5
10
15
20
25
30
Observation Fig. 3. Shewhart control chart for the X variable.
5.2. For Y
Chart
The process mean = 0.81675" and the process sigma = 0.00128699", hence UCL = 0.81675 + 3 * 0.00128699 = 0.82061" LCL = 0.81675 - 3 * 0.00128699 = 0.81289" The Shewhart control chart for the Y variable is given in Fig. 4. No points are beyond the control limits; the process is in a state of statistical control.
Multivariate (X 0.001) 822 -
Control Chart
71
UCL = 0.82061
>
J.
v\\ v
\
A 7\,
A A \ , V\ ^ J \ rx/ \ J ~* / v>
/v \
Centcriine = 0.81675
V
LCL = 0.81289
^
812 10
15
20
25
30
Observation Fig. 4. Shewhart control chart for the Y variable.
5.3. For Bivariate
Control Chart
From Eq. (14), we have UCL = 2(30 + 1)(30 - l)/[30 2 - 2(30)] * F0.0o27;2,28 = 2.14 * 7.36 = 15.75 Using Eq. (15), we have an approximate control limit UCL = 15.24. T 2 statistic is denned in Eq. (5) and the calculated T2 values are given in Table 2. In the following, we will compute the T2 value for the first data point (Xi = 8.9464, Y\ = 0.8190). The mean (average) and sample variance of X variable are 8.9451" and 2.46171 * 10~ 6 , respectively, and the mean (average) and sample variance of Y variable are 0.816751" and 2.29086 * 10" 6 , respectively. Also, the covariance of X and Y is 1.04672 * 10~ 6 . Hence the variance-covariance matrix S is given by S =
6
10
2.46171
1.04672
1.04672
2.29086
Hence, S~l = 10b
0.504172
-0.230362
-0.230362
0.541772
From Eq. (5), the T 2 value at the first data point (Xx = 8.9464, Y1 = 0.8190) is 10b
8.9464
-8.9451
0.504172
-0.230362
0.8190
-0.81675
-0.230362
0.541772
8.9464
-8.9451
0.8190
-0.81675
2.25
72
M.-W. Lu and R. J. Rudy
16
:
. . . . . . •
UCL =15.24
•
T3
g 3 CO
12
"
g . 4 . 0
"
1
wAJLAH j|
•
A A
10
20
. 1
30
40
Observation Fig. 5. Bivariate control chart.
The bivariate control chart is shown in Fig. 5 (with UCL = 15.24). No points are beyond the control limits; the process is in a state of statistical control. This example provides a case where the process may be interpreted as out of control when -the characteristics are considered separately, but is shown to be in control when considered jointly with the multivariate control chart. In Table 2, if the number 16th data point (X = 8.9498, Y = 0.8167) is replaced by (X = 8.9515, Y = 0.8167), then the T 2 value will be 15.4041. The bivariate control chart is shown in Fig. 6 (with UCL = 15.24). One point is beyond the control limits; the process is out of statistical control. 16
UCL =15.24
12
g 3 CO
10
20
30
40
Observation Fig. 6. Bivariate control chart.
When an out of control situation is indicated on the multivariate chart as shown above, the individual characteristics responsible can be identified from the individual univariate limits. For X chart, the process mean = 8.94515" and the process sigma = 0.0014001", hence
Multivariate
Control Chart
73
UCL = 8.94515 + 3 * 0.0014001 = 8.94935" LCL = 8.94515 - 3 * 0.0014001 = 8.94095" The Shewhart control chart for the X variable is given in Fig. 7. One point is beyond the control limits; the process is out of statistical control. The Shewhart control chart for the Y variable is same as in Fig. 4. No points are beyond the control limits; the process is in a state of statistical control. 8.952 -
,
A
AV-
\
UCL = X.94935
i /
Cemeriine = 8.94515 LCL = 8.94095
L
M
i
r\ i 1
^V
\
*
AA
V 17 A/v.^A^v, V i-j
8.94 -
5
10
15
20
25
30
Observation Fig. 7. Shewhart control chart for the X variable.
6. Conclusion There are many situations in which the overall quality of a product often depends on successful monitoring of several interdependent correlated characteristics. A common practice to control several characteristics simultaneously is to consider multiple Shewhart control charts, even though they are designed to monitor a single characteristic. The assumption in this practice is that the characteristics are independent, which is often incorrect. In the multivariate environment, the characteristics are often related to one another. It is possible that the process may be in control when the characteristics are considered separately but out of control when considered jointly with the multivariate control chart. In this paper, multivariate control charts are discussed which are based on Hotelling's T 2 statistic. An engine exhaust manifold example is presented. Widely commercially available PC software for multivariate control charts and programmable scientific calculators can be used to present the methods and simplify calculations. Multivariate control charting is a cost effective alternative for placing high powered statistical tools into the hands of those closest to the process. References Alloway, J. A. and Raghavachari, M. (1991), "A Introduction to Multivariate Control Charts", in Proceedings of the ASQC 45th Annual Quality Congress, Milwaukee, Wisconsin, May.
74 M.-W. Lu and R. J. Rudy Dunteman, G. H. (1989), Principal Components Analysis, Sara Miller McCune, Sage Publications, Inc. "Fundamental Statistical Process Control", Automotive Industry Action Group (AIAG), 1991. Hotelling, H. (1947), "Multivariate Quality Control, Illustrated by the Air Testing of Sample Bombsights", Selected Techniques of Statistical Analysis and Industrial Research and Production and Management Engineering, eds. Eisenhart, C , Hastay, M. W. and Wallis, A. W., McGraw-Hill Book Company, New York. Montgomery, D. C. (1991), Introduction to Statistical Quality Control, 2nd Edition, John Wiley & Sons, New York. STATGRAPHICS Plus software (Version 1 for Window), Manugistics, Inc., Rockville, Maryland, 1994. "Tooling and Equipment Supplier Quality Assurance (TESQA)", Procurement & Supply Office, Chrysler Corporation, 1993. Tracy, N. D., Young, J. C. and Mason, R. L. (1992), "Multivariate Charts for Individual Observations", Journal of Quality Technology 24(2), 88-95.
Chapter 6 OPTIMAL PREPAREDNESS MAINTENANCE OF MULTI-UNIT SYSTEMS W I T H I M P E R F E C T M A I N T E N A N C E A N D ECONOMIC DEPENDENCE H O N G Z H O U W A N G and A N T H O N Y E. I Z U N D U Lucent Technologies, Holmdel, NJ 07733, USA HOANG PHAM Department of Industrial Engineering Rutgers University, Piscataway, NJ 08854-8018,
USA
1. Introduction Block, age and sequential preventive maintenance (PM) policies are designed for a system composed of a single stochastically deteriorating subsystem [McCall (1963)]. A natural generalization of the underlying maintenance model is to consider a system with multi-subsystems. Optimal maintenance policies for such systems reduce to those for systems with a single subsystem only if all subsystems are stochastically, economically and structurally independent. In this case, maintenance decisions are also independent and the optimal policy is to employ an optimal block, age or sequential PM policy for each separate subsystem. However, if it costs less to perform maintenance on several subsystems jointly than on each subsystem separately, i.e., there exists economic dependence, then the optimal maintenance policy is not one of considering each subsystem separately and maintenance decisions will not be independent. Obviously, the optimal maintenance action for a given subsystem at any point of time depends on the states of all other subsystems in the system [Cho and Parlar (1991), Dekker and Smeitink (1991), McCall (1963), Pham and Wang (1996), Radner and Jorgenson (1963), Zheng (1995)]. Generally, the assumption of perfect maintenance, which brings a system to the "good as new" state, may be plausible for the system with only one component which is structurally simple. On the other hand, the assumption of minimal repair, which bring a system to the "bad as old" state, seems reasonable for failure behavior of the system when one of its many, non-dominating components is replaced
75
76
H. Wang, H. Pham and A. E.
Izundu
by a new one [Pham and Wang (1996)]. However, many maintenance activities may not result in these two extreme situations but in a complicated intermediate one. For example, an engine may not be "good as new" or "bad as old" after tune-up, a type of common PM. It usually becomes "younger" than at the time just prior to the tune-up and enters some state between "good as new" and "bad as old". In the recent literature, this kind of maintenance is often called imperfect maintenance [Bhattacharjee (1987), Brown and Proschan (1983), Fontenot and Proschan (1984), Makis and Jardine (1992), Nakagawa (1987), Pham and Wang (1996), Sheu et al. (1995), Zheng (1995)]. Generally, imperfect maintenance means any maintenance which makes a system "younger", and results in an improved system operating condition. Usually, it is assumed that imperfect maintenance restores the unit's operating state to somewhere between "good as new" and "bad as old". Clearly, imperfect maintenance is a general maintenance since it can include two extreme cases: minimal and perfect maintenance [Brown and Proschan (1983), Fontenot and Proschan (1984), Pham and Wang (1996)]. So far most maintenance models in the reliability and maintenance literature have been proposed for one-unit systems, and maintenance models for multisubsystem systems are only a small proportion of them. We note also that economic dependence is ignored for many previous maintenance models of multi-subsystem systems. Some early models consider economic dependence but they suppose that after maintenance the system is "good as new", i.e., all maintenance is perfect. This paper contemplates an optimal preparedness maintenance model for a system with n + l subsystems among which there is a single uninspected subsystem in presence of n continuously-inspected subsystems. Assume that the system is placed in storage and is called upon to perform a given task only if a specific but unpredictable emergency occurs. Its actual state is unknown unless it is inspected. Some maintenance actions may be taken while the system is in storage or long-term cold standby, and the objective is to choose the maintenance action sequence providing the best level of preparedness for field use [McCall (1963, 1965)]. This maintenance model is called preparedness maintenance model, which can applies to maintenance problems of missiles, rockets, etc. McCall (1963, 1965) applied this preparedness maintenance model to the ballistic missile maintenance and obtained optimal the preparedness maintenance policy. The ballistic missile studied was composed of one uninspected subsystem: the rocket engines, as well as three subsystems which are continuously inspected: the nozzle control units, the guidance and control system, and the re-entry vehicle [McCall (1963)]. Obviously, to keep ballistic missiles in the highest level of operational readiness and thus to prevent them from failure in emergence use, frequent maintenance are necessary when they are in storage. This study assumes that there exists economic dependence in this multi-unit system, and that PM is imperfect, given the times to failure of the subsystems in this system are stochastically independent. Maintenance time is not negligible in this work. For a multi-component system, if there exists strong economic dependency, joint maintenance should be considered. The optimal maintenance policy for
Optimal Preparedness Maintenance
of Multi-unit
Systems
77
this kind of systems possesses an opportunistic characteristic: the optimal maintenance action for each subsystem entails taking advantage of opportunities that arise through failures of the other subsystems, i.e., the failure of one subsystem results in the possible opportunity to perform maintenance on other subsystems (opportunistic maintenance). This paper also contemplates that the uninspected subsystem in this system has an increasing failure rate while the remaining n subsystems have constant failure rates. The system is functioning if and only if every subsystem is functioning (series system). The distinctive feature of preparedness maintenance models is that the state of the system is ascertained only at the time of inspection or maintenance since usually the system is not in operation, unlike the other kind of systems such as automobiles or aircraft. Accordingly, the preparedness model is characterized by three different uncertainties. First, it is impossible to predict the exact time of system failure. Secondly, the time of emergency is also not susceptible to exact prediction. Finally, the state of the system is known only at the time of certain maintenance or inspection actions [McCall (1963, 1965), Radner and Jorgenson (1963)]. The main difference between the preparedness maintenance model for missiles, rockets, etc. and the other maintenance models for automobiles, aircraft etc. lies in the way in which failures are detected. With the automobiles, aircraft etc. failure occurring while the system is not in operation will be detected whenever an attempt at operation is made. The state of the system is always known with certainty. In fact, continuous operation provides assurance that the state of the system is always known with certainty. However, in a missile system, such a failure will go undetected indefinitely; the state of the system (at least some of its subsystems) is not known with certainty unless some definite maintenance or inspection action is taken. The difference directly affects the design of optimal maintenance policies for each kind of system. Those for automobiles must be designed to overcome the effects of uncertainty about when failures will occur. The policies for missile systems must overcome the same uncertainty, plus another as well: uncertainty about the actual state of the system at any given time — that is, whether it is good or has failed [McCall (1965)]. If the actual state of the system is known with certainty, either through continuous inspection or continuous operation, the theory of maintenance for the preparedness model becomes the same as for the other maintenance models such as the age-replacement policy and block-replacement policy. In this sense, the theory of maintenance for the preparedness model is more general than the other maintenance models. The objective of this study is to present realistic optimal opportunistic maintenance polices which reduce maintenance expenses and keep high system availability. In this paper, maintenance will be a general term and may represent preventive or corrective maintenance (CM), and repair only means CM. Replacement is a perfect maintenance.
78
H. Wang, H. Pham and A. E.
Izundu
Notation T U Co wo Ci, u>i Cot, u>oi
Time interval at the end of which a PM is performed on subsystem 0 Critical age of subsystem i for i = 1,2,... ,n PM cost of subsystem 0 at jT, j = 1,2,3,... PM time of subsystem 0 The cost and time to replace subsystem % for i = 1,2,... ,n The cost and time to maintenance subsystem 0 and i together, i = 1,2,... ,n p Probability that PM (at multiples of T) is perfect q Probability that PM (at multiples of T) is minimal, p + q = 1 qoi Probability that the renewal cycle ends with a replacement of subsystem i and PM of subsystem 0 together di Probability that the renewal cycle ends on the interval [
i = l,...,n
and
X'0(t) > 0
Optimal Preparedness Maintenance
of Multi-unit
Systems
79
Since subsystems 1, 2 , . . . , n have constant failure rates, they are never replaced before failure, that is, no PM will be performed on them at any time. Besides, as stated at beginning of this paper, it is assumed that it costs less expense and time to perform maintenance on subsystem 0 and any other subsystem together than to do maintenance on each subsystem separately, i.e., Co, Ct < C0i < Co + Q
and
w0, Wi < w0i < w0 +Wi
(1)
Note that at any point in time, the maintenance performer must choose among four alternatives: do maintenance on the uninspected subsystem, on an inspected subsystem, on the uninspected subsystem and an inspected subsystem together, or do nothing. Using a dynamic programming formulation, Radner and Jorgenson (1963) showed that the optimum maintenance policy is what they call a (ti,T) type of policy and proposed an opportunistic maintenance policy. Moreover, in study of maintenance problems Barlow and Proschan (1975), and Khalil (1985) proposed and investigated the following shut-off rule (B-P) for the series system: While a failed subsystem is in repair or maintenance, all other subsystems remain in "suspended animation". After the repair or maintenance is completed, the system is returned to operation. At that instant, the subsystems in "suspended animation" are as good as they were when the system stopped operating. Obviously, the above (B-P) shut-off rule seems realistic and plausible in many applications since most maintenance times are relatively short. Next a new realistic preparedness maintenance model subject to the opportunistic maintenance, imperfect maintenance, and (B-P) shut-off rule is designed. Let x be the age of subsystem 0 since last replacement of subsystem 0. This paper investigates the following opportunistic preparedness maintenance policy, modified from the preparedness maintenance model developed by Radner and Jorgenson (1963), based on the above (B-P) shut-off rule by Barlow and Proschan (1975) and Khalil (1985), and the concept of imperfect maintenance [Bhattacharjee (1987), Brown and Proschan (1983), Fontenot and Proschan (1984), Makis and Jardine (1992), Nakagawa (1987), Pham and Wang (1996), Sheu et al. (1995), Zheng (1995)] and opportunistic maintenance: (i) If subsystem i fails when the age of subsystem 0 is in the time interval [0,£j), replace subsystem i alone at a cost of Cj and at a time of u>i (i = 1,2,..., n); (ii) If subsystem i fails when the age of subsystem 0 is in the time interval [U, T), replace subsystem i and do perfect PM on subsystem 0 (i = 1, 2 , . . . , n). The total maintenance cost is Coi and total maintenance time is w0i', (iii) If subsystem 0 survives until its age x = T, perform PM on subsystem 0 alone at a cost of Co and at a maintenance time of Wo at x = T. PM is imperfect. (iv) If subsystem 0 has not received a perfect PM at T, perform PM on it alone at time jT (j = 2,3,...) until it gets a perfect PM; If subsystem 0 has not experienced a perfect maintenance and subsystem i fails after some PM, replace subsystem i and do perfect PM on subsystem 0 (i = 1,2,... ,n). The total
80 H. Wang, H. Pham and A. E. Izundu Critical Age Repair Subsystem i only (C,., w,)
Maintenance of Subsystems i and 0 ( c o , > woi >
T
t
PM of Subsystem 0 only (Q>>
w
o)
Fig. 1. Optimal opportunistic maintenance policy. maintenance cost is still C0i and total maintenance time is woi- This process continues until subsystem 0 gets a perfect maintenance. (v) T h e (B-P) shut-off rule applies to all maintenances. T h e above opportunistic maintenance model is also shown in Fig. 1. It is characterized by n +1 decision variables (ti,t2, • • • ,tn,T), a n d the optimal maintenance policy is obtained by determining t h e optimal (ti,ti,... ,tn,T) to maximize the system availability or minimize t h e system maintenance cost rate or optimize one when t h e predetermined requirements for the others are satisfied. It is worth t o note t h a t to achieve good operating characteristics of systems, we might take into account system availability because while the system cost rate is minimized t h e system availability may not be maximized sometimes and even very low [Pham and Wang (1996)]. For multicomponent systems this opportunistic maintenance policy may result in higher system availability as compared with t h e case t h a t each subsystem is separately maintained. This is because while any subsystem fails and is under maintenance t h e whole system is down, and it would save time to perform P M on unfailed subsystems during this down period and t h u s reduce t h e system down time. Therefore, t h e optimal maintenance policy discussed in this paper can be expected to approximate any t y p e of multicomponent systems. 2. S y s t e m R e l i a b i l i t y a n d C o s t M e a s u r e s In this paper, imperfect P M is treated in a way t h a t after P M a system is "good as new" with probability p (perfect P M ) and is "bad as old" with probability q = 1 — p (minimal P M , 0 < p < 1) [Bhattacharjee (1987), Brown and Proschan (1983), Fontenot a n d Proschan (1984), Makis and J a r d i n e (1992), Nakagawa (1987), P h a m and Wang (1996), Sheu et al. (1995), Zheng (1995)]. Note t h a t P M becomes a perfect one when p = 1 and a minimal one when p = 0. Therefore, this modeling method for imperfect P M makes imperfect P M t o include perfect P M and minimal P M as special cases. It makes each P M a weighted average of perfect P M a n d
Optimal Preparedness Maintenance
of Multi-unit
Systems
81
minimal PM with weight p [Pham and Wang (1996)]. Details on this treatment method for imperfect maintenance can be found in Bhattacharjee (1987), Brown and Proschan (1983), Fontenot and Proschan (1984), Makis and Jardine (1992). Note that PM of subsystem 0 together with another subsystem between jT's is assumed to be perfect where j = 1,2,3,... . Next we will first derive the long-run system maintenance cost per unit of time, or expected system maintenance cost rate, the asymptotic average system (storage) "availability", and then other system operating performance characteristics. The optimal maintenance polices are finally studied. 2.1. System
Maintenance
Cost Rate and
"Availability"
Given the above opportunistic preparedness maintenance policy, the time between consecutive perfect maintenances of subsystem 0 constitutes a renewal cycle. From the renewal reward theory, system maintenance cost rate is given by: T(rp
.
.
L{T tut2
,
x
C(T,ti,t2,.
• • ,tn;p)
. .
tn P)=
(2)
' '-' ' D(Tttl,t2t...,tn;p)
Asymptotic average system storage "availability" is defined as: A ( T [
. '
, U 2
. '""-'
v ~U(T,t1,t2,...,tn;P)
n,P)
U{T,txM,---,tn;p) + R(T,t1,t2,...,tn;p)
W
where C(T, ti,t2,. • • ,tn;p) is the expected system maintenance cost per renewal cycle, D(T, t\,t2,... ,tn;p) is the expected duration of a renewal cycle, and U(T, ti, t2,... ,tn;p) a,ndR(T,ti,t2,... ,tn;p) are respectively the accumulating system storage time and the maintenance time of this system in one renewal cycle. Obviously, D(T,ti,t2,...
,t„;p) = U{T,tx,t2,
...,tn;p)
+ R(T,h,t2,
...,tn;p)
Let Si = time spent on replacing subsystem i alone in one renewal cycle (i = 1, 2,...,n); Y = age of subsystem 0 when first perfectly preventively maintained at time jT alone or together with another subsystem between jT's, whichever occurs first; Z = time spent on performing PM on subsystem 0, possibly with other subsystems (at end of a cycle). We first evaluate the expected duration D(T, t\,t2,..., tn;p). The renewal cycle duration B is the sum of three random variables. The duration and expected duration are respectively B =S+Y +Z D(T, t1,t2,...,tn;p)
= E(B) = E(S) + E(Y) + E(Z)
(4)
82
H. Wang, H. Pham and A. E.
Izundu
We investigate the cumulative distribution function (Cdf), probability density function (Pdf) and mean of Y first. Next we make the convention that the monitored subsystems are numbered so that t\
Pr(tf = jT) =
q^p
where
1,2,
and Cdf of Y for y < T is as follows: P r ( F < y) = 1 - P r ( y > y) = 1 - Pr(ii + Ui > y,t2 + U2 > y,... ,tn + Un > y) n
=
l-Y[Pr(Ui>y-tl) i=l
51 K(y-U
1 — exp .
y-U>0
and for y = T we have P r ( y = T) = p • exp l - ^ A ^ r - t i ) i=l
The Cdf of Y for T < y < 2T is given by: P r ( y < y) = 1 - P r ( F > y) = 1 - Pr(£j + Ui>y,i
= l,2,...,n;
1 - qY[Pr(Ui > y - U) n
1 - g e x p -^Xj(y 2=1
-U)
1st PM = imperfect)
Optimal Preparedness Maintenance
of Multi-unit
Systems
83
and for y = 2T P r ( F = 2T) = gp • exp
• z ^A i (2T-t i ) i=l
Generally, for (j — 1)T < V < j ' T where j = 1 , 2 , 3 , . . . we have: P r ( F < y) = 1 - Pr(U + Ui > y, i = 1, 2 , . . . , n;
1st perfect P M is j t h P M )
= i-gJ'-inPr(^>y-*i)
1 — <jJ * exp
A
E
i(y-*i)
l/-ti>0
and for y = jT: Pr(Y = j T ) = ^ _ 1 p - e x p
•^AiO'T-ti) i=l
Next we determine Pdf of Y. i = 1, 2 , . . . , n, let us introduce the following notation for future convenience:
Mi= [J2Xi Dj = exp I ^ J Xjtj
(5)
o=i
/<(») = (J2 x j) ex p
A M , exp(-Miy)
Then V has probability density given by, for y < T
{
0,
0
fi(y),
U
/n(w), tn >(T-i,) with probability
i = 1,2, . . . , n - 1
:p-I>nexp(-MnT)
(7)
84
H. Wang, H. Pham and A. E.
Izundu
For (j - 1)T
(7a)
with probability mass at Y = jT: G0(jT) = Pi(Y = jT) 1
= q*
•^XiUT-U)
p • exp
i=l 1
=
q'- pDnexp[-jTMn]
(7b)
It follows that the expected value of Y is given by: ft. + l
^(y) = E i=l
Jt
*
n
= E
°°
r(j+l)T
l//i(l/) d y + E
/
j=l
Qj~1fn(y)dy
/ 0T
rti + 1
y^jTGoUT) j =
°°
l
/.(j + l)T
yfi(y)dV + DnM„ J21^1
/
+
J
/
exp(-Mny)dy
oo
+ ^ W - ^ D , , exp[-jTAf n ] "
= E
/-ti + 1
/
°°
J ^ G / ) ^ + ^ " 1 3 9 J _ 1 {exp[-jM„T] -
exp[-j(+l)MnT}}
+ DnPT J2 JV_ 1 exp[-jTMn]
(8)
3= 1
where tn+i = T. Each of the integrals in the first sum can be simplified as follows [McCall (1963), Radner and Jorgenson (1963)]: f
+
yfi(y)dy
M-1)exp(-Miti)-(ti+1+Mr1)exp(-Miti+1)}
= Di\(ti +
Jti
Secondly, we derive the expected value of S. Recall that V, be the duration of the interval over which subsystem i alone would be replaced if it were to fail (i = 1,2,...,n). Then Vi = min(y,ti) E(Si) =
i = 1,2,.. .n
\iE(Vi)Wi
l,2,...n
Note that V* has Pdf equal to that of F for F < U, and probability mass Pr(Y > U) concentrated at £,. Therefore,
E(V) = E / J + 1 yfMdy + u i - J2 S 3=1
tj
L
3=1
j
+1
f*(y)dv
(9)
Optimal Preparedness Maintenance
of Multi-unit
Systems
85
Recall that
S= J> Hence
n
i~l
izi
r«j+i
y^A^ Y,
/-'i+i
vfi(v)dy+u[i-Y, Jt
j=0 i
\
fi Mdy
(10)
Jt
j=0 i
Finally, let us derive an expression for E(Z). Denote by d, the probability that the renewal cycle ends on the interval [ti, U+i): di = Pr(tj < Y < ti+i),
where i = 1,2,... , n ,
and
tn+\=T
Then (
^ = l-exp[-M1(t2-*i)] d2 = (1 - a i ) { l - exp[-M 2 (i 3 - *2)]}
i-1
di = J J ( 1 - aj)cti 3=1
n-1 rf
n = JJ(1 -aj)a« 3=1
dn+\ = J J ( 1 - oij) = p • Dn exp[-M„T] 3=1
where a3•. = 1 — exp[—Mj(tj+i cycle ends at T.
— t,)J and d„+i is the probability that the renewal
It is easy to verify that the probability that the renewal cycle ends on the interval ((j - l)T,jT) and at jT for j = 2 , 3 , . . . respectively, from Eqs. (7a) and (7b):
86
H. Wang, H. Pham and A. E.
Izundu
d y = Pr{(j - 1)T < Y < jT} rJT
qj-1fn(y)dy
[ JU-1)T
qj~l /
Uy)dy
'(j-l)T
d2j = P r ( F = jT) =
qi-1PDnexp[-jTMn}
Denote by q^ the probability that the renewal cycle ends with a replacement of subsystem i and subsystem 0 together. Noting that for two independent exponential random variables Z\ and Z2 with rate r\\ and r\2 respectively, there exists the following equation Pr(Zi < Z2) =
- ^ ~
Then, we have
jk + 3=1^ i t 3
J
j=i
j=i
The probability that the renewal cycle ends with a replacement of subsystem 0 alone is given by: 00
00
d0p = dn+1 + Y^ d2j = pDn Y2 QJ~X exp[-jTMn] j=2
(11a)
3=1
Therefore, it follows that the third term in Eq. (4) is given by: n
E(Z) = ^2 qoiWoi + d0pWo
Recall that D(T,t1:t2,...,tn;p)
= E(B) = E(S) + E(Y) + E(Z)
(12)
Optimal Preparedness
Maintenance
of Multi-unit
Systems
87
Thus, D(T,t1,t2,...,tn;p) «-l
5> n
rti
E /
vfi (y)dy + tAi-J2
fi (y)dy
Lj=0-"J .ti+l
oo
x {exp(-jM„T) - e x p [ - j ( + l ) M „ T ] } + DnT Y^ 3QJ
1
exp[-jTM„] + ^ to^Oi + c?oPw0
j=i
(13)
t=i
This completes the derivation of the expected duration of a renewal cycle D{T,t\, t2,...,tn;p). Next we investigate the expected system maintenance cost over a single renewal cycle. Since the expected number of replacements of subsystem i alone is XiE(Vi), the corresponding expected replacement cost for subsystem i in one renewal cycle is XiE(Vi)Ci, (i = 1,2,... ,n). The probability of a replacement of subsystem 0 and i together multiplied by the corresponding cost is qoiCoi, i = \,...,n. The probability that subsystem 0 is subject to PM at jT, multiplied by the cost of PM at T, results in dopCo- It follows that the expected system maintenance cost during one renewal cycle is given by: C(T,t1,t2,-
• .,tn;p) i-1
/
\Ci
J
E / j=0
+ /
rU
vfi(y)dy + uii-J2
uL
J
fi(y)dy
j = 0 " li
(14)
loiCoi + dopCo
»=i
Substituting the above results into Eqs. (2) and (3), and noting that E(Y) = U(T,t\,t2,... ,tn;p) and E(S + Z) = R(T,ti,t2,... ,tn;p), we obtain the following proposition: Proposition 1 The long-run system maintenance cost per unit of time, or expected system maintenance cost rate, and the asymptotic average system (storage) "availability" are respectively: A(T,ti,t2,...,tn;p)
E{Y) D(T,t1,t2,...,tn;p)
(15)
88
H. Wang, H. Pham and A. E. Izundu
L(T,t1,t2,-..,tn;p) 'i~L
7 _, \Ci
ftj+i
(
ill
fti+i
E / J+1 yfMdy + U (i - Yl j J + 1 Mv)dy A—n
** ti
+ 2jgoiCoi + dopCo/D(T,ti,t2,
\
• — "n *' J ti 3=0
• • • ,tn;p)
(16)
i=l
From the above result, the optimal opportunistic preparedness maintenance policy (T*, t\, <2,.. •, £*) to minimize the system maintenance cost rate or to maximize the asymptotic average system storage "availability" can be obtained by using nonlinear programming software. Next we discuss the other operating performance characteristics of this opportunistic preparedness maintenance model. 2.2. Other Operating
Characteristics
In Sec. 2.1, we evaluate system maintenance cost rate and availability. To learn more about this imperfect opportunistic preparedness maintenance model, and to predict supply and maintenance requirements, it is useful to investigate its other operating characteristics. First, we note that for this imperfect preparedness maintenance model, three different maintenance actions are observed: (i) Replacement (perfect repair) of a failed subsystem with a constant failure rate by itself; (ii) Joint opportunistic maintenance of a failed subsystem with a constant failure rate and subsystem 0 unfailed; (iii) PM of unfailed subsystem 0 at some time jT where j is a natural number. In addition to system storage availability and maintenance cost rate derived in last section, other important operating characteristics of this preparedness maintenance model include the expected number of each of these maintenance actions per unit time, and expected maintenance cost of each of these maintenance actions per unit time. Another characteristic of interest is the probability of at least m failures of one of the subsystems in the interval (0,t) [McCall (1963, 1965), Radner and Jorgenson (1963)]. In details, the following operating characteristics will be investigated in this section: Ti roi r0o rop
expected rate of replacement of subsystem i, i = 1,2,..., n expected rate of joint opportunistic replacement of failed subsystem i and unfailed subsystem 0, i = 1,2,..., n expected rate of planned maintenance of subsystem 0 at times jT expected rate of total perfect maintenance (alone and joint) of subsystem 0
Optimal Preparedness Maintenance
of Multi-unit
Systems
89
Tfj
the expected rate of joint opportunistic maintenance of subsystem 0 with another subsystem coo expected rate of expenditure on PM of subsystem 0 at times jT Ci expected r a t e of expenditure o n replacement of subsystem i Coi expected rate of expenditure on joint replacement of subsystem 0 a n d subsystem J Pi(m, t) probability of at least m failures of subsystem i in the interval (0, t) where t is constant and % = 0 , 1 , . . . , n Let us consider the subsystems with constant failure rates first. The time to failure for each of the inspected subsystems is an exponential random variable with rate Xi. Obviously, ri = Xi i = l,2,...,n (17) or including Wi, t h e time t o replace subsystem i,
1/Aj + Wi
1,2,....n
XiWi + 1
(18)
Therefore, cir=riCi
AC = -———
i = l,2,...,n
AjU>j + 1
(19)
Using t h e elementary renewal theorem, we know t h a t t h e rate of perfect maintenance of subsystem 0, is asymptotically equal t o t h e reciprocal of t h e expected value of Y, t h e time t o t h e first perfect maintenance of subsystem 0, t h a t is, [EiY)}-1
lim rop(t) = t—>oo
where E(Y) is given in Eq. (8). T h u s , for large value of t, r0p « [EiY)}-1
(20)
O n t h e other hand, from t h e foregoing definitions for rates of maintenance we have n
r0p = r'00 + '^2roi
(21)
l
T h a t is, this expected rate, r o p , can be partitioned into two parts: t h e expected r a t e of perfect P M a t some time jT a n d t h e expected r a t e of joint opportunistic replacement with another subsystem. We can also see this relationship from t h e derivation of Eq. (8). B y t h e law of large numbers, t h e fraction of the total number of perfect P M of subsystem 0 is equal t o dop given in Eq. (11a) for large t. Hence, in t h e long run, t h e expected r a t e of perfect P M of subsystem 0 is given by:
r 00 «do^mr 1
90
H. Wang, H. Pham and A. E.
Izundu
or, including maintenance time, r
/, _ doplEjY)}-1 °° ~ d0p[E(Y)]-i[E(Z) + E(S)] + 1
dop d0p[E(Z) + E(S)] + E(Y)
=
It follows that the expected rate of PM of subsystem 0 is r
°° ^ r°o/P
=
d0p[E(Z) + E°(S)}p + E(Y)p
(22)
The expected expenditure on planned maintenance is given by C
°° " r ° ° C 0 - d0p[E(Z) + E(S)}P
+
E(Y)p
(23)
It is easy to see that in the long run, ^™ roi, the expected rate of joint opportunistic maintenance of subsystem 0 is: n
rfj = 52r0i = (l-dOp)[E(Y)]-1
(24)
l
From Sec. 2.1, n
i — d0p = 2_^ qoi 1
where qoi's are given by Eq. (11). Therefore, n
rfj = [E(Y)}-1'£qoi
(25)
l
Since the probability that subsystem 0 will be replaced jointly with subsystem i is qa, the asymptotic expected rate of opportunistic replacement of subsystem 0 and i is given by: r0i = q0l[E(Y)}-1 i = l,2,...,n (26) or including maintenance time, _/ 0i
qoilEjY)]-1 q0i[E(Y)]- [E(Z) + E(S)] + l 1
qoi q0i[E(Z) + E(S)]+E(Y)
,
,
V
;
Accordingly, the expected rate of expenditure on opportunistic maintenance is, coi=roiCoi
= —f . . ' °' „,v, i = l,2,...,n (28) qoi[E(Z) + E(S)\ + E{Y)p Because the lifetime of subsystem i follows the exponential distribution, the probability of at least m replacements of subsystem i in the interval (0, t) is given by: Fi(m,t)=5]exp(Ait).Myij—m
t = l,...,n
(29)
Optimal Preparedness Maintenance
of Multi-unit
Systems
91
3. Optimization Models So far we have derived the system reliability measures — system storage availability, probability of at least m failures of subsystem i in the interval (0,i), expected rate of failure of subsystem i, etc., and system maintenance cost measures — system maintenance cost rate, expected rate of expenditure on planned maintenance of subsystem 0, expected rate of expenditure on replacement of subsystem i, etc. To obtain the optimal system maintenance policies both system reliability and maintenance cost measures must be both acceptable. For example, it may be required that the system maintenance cost rate is minimized while the system availability is not less than some predetermined requirement Ac,. For such a problem, we can formulate the following optimization model from Eqs. (15) and (16): Minimize L(T,ti,t2,-..,tn;p) 7 i=i
J
AjC7j
E J lj=0-"-i
+ 2jgoiC 0 i + d0pCo/D(T,ti,t2,
Subject
!Z*
/-tj+i
3=0
Lj
Vfj (v)dy + tAl-^2
J
fj (y)dy
...,tn;p)
to A(T,ti,t2,...,tn;p)
> A0
,T>0 where constant Ao is the predetermined requirement for system availability. This optimization model can be solved by nonlinear programming software to obtain an optimal system preparedness maintenance policy (t*, ££, • • •, C T*). Similarly, based on other operating characteristics derived and different requirements in practice we can construct other optimization models. 4. Concluding Remark One of the key problems for multicomponent systems in modern maintenance practice is economic dependence. Besides, maintenance is often imperfect. This paper has considered these two factors which greatly affect optimal maintenance policies in multi-unit systems and presents some preliminary work on this aspect. Moreover, maintenance time is not ignored in this work. Both system reliability and maintenance cost measures are incorporated in the optimal opportunistic maintenance models in this paper so that the optimal maintenance policies obtained may be optimal not only in terms of maintenance costs but also in terms of reliability
92
H. Wang, H. Pham and A. E. Izundu
measures. Therefore, t h e opportunistic maintenance model of the multi-component system with n+1 decision variables ( i i , t^, • • •, tn, T) proposed in this study is more realistic and the results obtained in this paper expect to be effective in practice. In t h e future we will extend this work t o multi-unit systems with two or more I F R subsystems which are subject t o imperfect maintenance and economic dependence as well as other shut-off rules. It is noted t h a t Khalil (1985) discussed various shutoff rules a n d system availability, a n d those shut-off rules can be considered in later research. T h e second p a r t of this paper — application of this preparedness model t o t h e ballistic missile maintenance will be furnished in a separate paper. References Barlow, R. E. and Proschan, F. (1965), Mathematical Theory of Reliability, John Wiley and Sons, New York. Barlow, R. E. and Proschan, F. (1975), Statistical Theory of Reliability and Life Testing, Holt, Renehart & Winston, New York. Bhattacharjee, M. C. (1987), "New Results for Brown-Proschan Model of Imperfect Repair", Journal of Statistical Planning and Inference 16, 305-316. Brown, M. and Proschan, F. (1983), "Imperfect Repair", Journal of Applied Probability 20, 851-859. Cho, D. I. and Parlar, M. (1991), "A Survey of Maintenance Models for Multi-Unit Systems", European Journal of Operational Research 5 1 , 1-23. Dekker, R. and Smeitink, E. (1991), "Opportunity-Based Block Replacement", European Journal of Operational Research 53, 46-63. Fontenot, R. A. and Proschan, F. (1984), "Some Imperfect Maintenance Models", in Reliability Theory and Models, Academic Press. Khalil, Z. A. (1985), "Availability of Series Systems with Various Shut-Off Rules", IEEE Transactions on Reliability 34, 187-189. Makis, V. and Jardine, A. K. S. (1992), "Optimal Replacement Policy for a General Model with Imperfect Repair", Journal of the Operational Research Society 4 3 , 111-120. McCall, J. J. (1963), "Operating Characteristics of Opportunistic Replacement and Inspection Policies", Management Science 10, 85-97. McCall, J. J. (1965), "Maintenance Policies for Stochastically Failing Equipment: A Survey", Management Science 11, 493-524. Nakagawa, T. (1987), "Optimum Policies for a System with Imperfect Maintenance", IEEE Transactions on Reliability 36, 631-633. Pham, H. and Wang, H. (1996), "Imperfect Maintenance", European Journal of Operational Research 94, 425-438. Radner, R. and Jorgenson, D. W. (1963), "Opportunistic Replacement of a Single Part in the Presence of Several Monitored Parts", Management Science 10, 70-84. Sheu, S.-H., Griffith, W. S. and Nakagawa, T. (1995), "Extended Optimal Replacement Model with Random Minimal Repair Costs", European Journal of Operational Research 85, 636-649. Zheng, X. (1995), "All Opportunity-Triggered Replacement Policy for Multi-Unit Systems", IEEE Transactions on Reliability 44, 648-652.
Chapter 7 ESTIMATION OF S Y S T E M RELIABILITY B Y VARIATIONALLY P R O C E S S E D M O N T E CARLO SIMULATION M. C H A N G , G. T. PARKS and J. D. LEWINS Department
of Engineering, University of Cambridge Cambridge CB2 1PZ, UK
1. Introduction Complex systems displaying stochastic behavior may be represented as Markov processes, even when these processes include aging of such parameters as the mean specific failure and repair rates. To estimate useful properties of such systems, e.g. availability, mean time between failures, etc., with such a high dimensionality structure is challenging, especially if the systems are safety-critical and therefore designed for high reliability and rare failure events. The Monte Carlo (MC) or stochastic simulation technique is one of the few known to be capable of estimating the behavior of such complex systems, although even here the challenge of the rare event is considerable. Variational principles (VPs) of the Lagrange type are well known as providing to a single question posed of a distributed system an answer that is relatively more accurate than the distributed field approximation used. Our purpose then is to study reliability problems, where the single question might typically be a mean time to (first) failure, by combining stochastic simulation (or MC methods) with a VP. MC analog methods can be biased in several ways to promote accuracy (i.e., reduce the variance) and adjoint biasing is known to be efficient. Rieff (1996) gives a good review of MC methods including biasing. Our use of the adjoint equation is, we think, more illuminating in showing exactly what is achieved by the process. The VP for a Markov process was given by Lewins (1978) in the context of a neutron stochastic problem but it can be applied in principle to any Markov problem or a problem (semi-Markov) that can be put in the Markov form, i.e., where the future of the system depends on the present but not the past state. Such a system can be represented by Kolmogorov forward and backward equations, both yielding
93
94
M. Chang, G. T. Parks and J. D. Lewins
the full conditional distribution functions leading from the past to the future state, which we think of as the Green's function for the system: the probability of the system being in any final state (/) at a final time t, given the probability of it being in an initial state (i) at the earlier time s, or conditional probability P,!u£|s). The backward equation forms the variational stationary condition with the forward equation, and vice versa. MC methods developed from neutron transport studies are now widely used to study a range of problems including those arising in reliability studies. De Vooght (1997) has recently reviewed such dynamic reliability problems and their MC solutions. Each simulation or "track" represents a possible outcome and hence an approximation for the conditional probability we need to answer some specific question. Enough simulations should give a sufficiently precise answer. But simulations of realistic systems are expensive in computation time. Thus one looks for advantages of processing the MC trials through a VP, if the gain obtained calls for less computation than would be needed to obtain a similar accuracy by further direct (analog) simulation. In the field of logistics, standard measures of reliability such as total uptime or downtime are very important factors which impact on the life cycle cost of systems [Chang and Lewins (1998a)]. When using MC methods to estimate this factor, one of the drawbacks of the MC method is encountered — the rare event. For a system with highly reliable components, the likelihood of observing a system failure during each MC simulation is very low. In consequence there is a large variance in the estimation of the system reliability or life cycle cost. In MC reliability applications there are a number of well developed methods for reducing the variance associated with rare events [Billington and Li (1994)]. However, these methods do not appear to be effective when attempting to estimate system reliability in logistics studies under the constraints of spare parts provisioning [Chang and Lewins (1998b)]. A major motivation behind our development of the variational approach described here is the hope that it will provide an effective variance reduction method for use in such logistics studies. The scheme of the remainder of the paper is first to give the VP for a Markov process. We then report some initial findings for a particularly simple system consisting of a single component subject to a constant mean failure rate. This simplicity admits an exact analytical solution for comparison. We evaluate the use of both analytical and simulated trial functions in calculating parameters of interest in assessing the reliability of systems, particularly the mean time to failure (MTTF). Comparison with the known exact solution demonstrates the potential power of the variational processing and the associated bias in the results, and prompts suggestions for further work. In the second half of the paper, we seek to demonstrate that the combination of stochastic simulation and variational processing is feasible in more complicated systems and gives increases in accuracy for a given computational cost. We therefore apply ourselves to the assessment of the reliability of a system having one component
Estimation
of System Reliability by Variationally
Processed MC Simulation
95
which suffers failure and repair stochastically in a Markov manner. To retain the comparison with known exact solutions we continue to assume constant age-specific rates, although neither, in principle, would aging components take us out of the Markov domain (and our associated VP) nor, in practice, would the MC method be unsuited to such a development, to be studied later. We undertake various numbers of trials and treatment of a single component with failure and repair to illustrate the gain in figure of merit of the VP technique over analog MC. Our conclusions show the results achieved, with indications of further work in hand. 2. Variational Principle Suppose we have a single element whose states may be working (1) or failed (0). The matrix of conditional probabilities is then of the form P(t\s) = ( Jo D°i )• If A is the matrix of transition rates, the Kolmogorov equations and their boundary conditions, which are derived in the Appendix, are forward: ^
= AP;P(8\s)
= I={5ij)
(1)
and backward: BP -— = PA;P(t\t)
= I = (5ij),
or
3PT - — = A T P ; PT{t\t) = /
(2)
Let pM(s) = ( P i ) and P(/)(t) = (p°) be the known initial state probabilities and the final state probabilities, desired to be known, respectively. Then p ( / ) ( t ) = P(t|s)p ( i ) W
(3)
If there is a single question c to be answered at time t, in the form of a final probability weighted as c(t) = hT(t)p^(t) where hT(t) = (ho, hi), then c(t) = hT(t)P(t\s)p^(s)
(4)
In general P is difficult to find exactly, so we seek a VP through which c(t) may be computed approximately but accurately. In the spirit of Lagrange we write a Lagrangian L(t) comprised of what we seek plus a weighted expression for the condition which P should satisfy at all times r from s to t: L(t) = h1 P{t\s)pw
+ /
h'Qfar) A P M . ) - ^
p{i)dr
(5)
The weighting of interest, h, has been introduced into the integral weighting but leaves it fully general in terms of the as yet unspecified matrix Q(t,r). The constraint is the forward equation, to be satisfied at all times r between s and t. Were this constraint satisfied, implying P was known exactly, the integral involving
96
M. Chang, G. T. Parks and J. D. Lewins
[ ] = 0 would vanish and hence, the Lagrangian would indeed yield the exact value of c. We therefore seek to make the Lagrangian stationary to variations of P and write SL(SP) = hT 5P{t\s)p{i) + f
hTQ(t,T) ASP
dSP
p{i)dT -> 0
Js
which after integration by parts reduces to 5L(SP) = hTSP(t\s)p{i)
-
hTQ(t,t)6P(t\s)pW
+ hTQ{t,s)6P(s\s)p{i)
+ f
p{i)TSP
ATQT+9C^
hdr -» 0
(6)
OT
This result can be achieved for arbitrary error SP by imposing equation and boundary conditions on Q(t,r). We can assume there is no error 5P(s|s) since P(s|s) is known to be / , and the two leading terms at t will cancel if Q is then the identity function / . The integral will be zero for any error if Q satisfies the equation obtained by putting the terms in [] to zero. However, this is seen to be identical to the backward.equation and its boundary condition. Hence, the stationary condition for L(t) is to identify Q(t,r) with P(t\r) and we have a kind of a self-adjoint VP requiring a single trial function:
L(t) = hTP(t\s)p{i) + f
hTP{t\r)
AP(T\S)
JS
dP{r\s) dr
V« d r
(7)
The indirect correction term to the error in the direct evaluation of c is seen to be a weighting of the out of balance forward equation at r with the consequences at t given r, over all T. Although both fields can be found from an estimate of the conditional probability P, it should be noted that the forward field is P(T\S) and the backward field is P(t\r); hence the integrals are in the form of convolution integrals. While this interpretation has been given in terms of an elementary Markov system, an interpretation of the forward and backward equations in terms of operators suggests that any Markov process (and by extension, any pseudo-Markov process) will have the same form of VP and pseudo-self adjoint stationary condition. This should allow application to a wide range of stochastic processes. 3. Analytical Example Consider a single component system subject only to failure (no repair) at a (constant) mean specific failure rate A, so that A = ( . ). We pose the problem of calculating P/(t|0) and therefore take p^T = (0,1) and hT = (0,1). It is readily shown that the exact solution for this model takes the form 1 P(t\s)
l-e-^-^ (8)
Estimation
of System Reliability by Variationally
Processed MC Simulation
97
We propose the linear approximation At < 0
0 < At < 1
P(t|0) = (
) ,
(9)
At > 1
The VP now yields
U = P±{t\Q) - J* \Pl{t\r)Pl{T\Q) + Pl{t\r)dP^0)
dr
(10)
and hence the variational approximation (VA): Zx = 1 - A t + - ( A t ) 2 - - ( A t ) 3 , Zi = ^ - ^ ( A t - l ) + i ( A t - l ) 3 , Li = 0,
0 < At < 1 l
(11)
At > 2
and we may make the following comments: • In the first domain, from zero initial time, the result has additional second and third order terms that are indeed exact, compared to the original approximation that is exact to first order only. This is a marked improvement. • In the second domain, the VA is an improvement, although not as spectacularly as the third order, but still an improvement generated by the good approximation in the neighborhood. • In the final domain the approximation is unchanged (at zero) by the variational procedure, but is still good as the correct result starts the range at exp(—2) and falls to zero at infinity. • The VAs are continuous across both the internal domain boundaries (with no slope discontinuities) and remain within the domain of probabilities (0, 1). Figure 1 shows the variationally improved results, as well as exact and linear approximation values. The improvement brought about by the VP is seen to be substantial. One test of the accuracy is to compute the MTTF which, for the exact solution, is unity in nondimensional units. However, we find some uncertainty in whether to
98
M. Chang, G. T. Parks and J. D. Lewins
""• -o-~
Linear approximation Lagrangian Exact solution
Fig. 1. Variational development of the linear approximation.
first form :
(12a) /•OO
or second form : (Xt) — / Jo
XtPd(Xt)
(12b)
Both are identical for the exact solution but in view of the approximation, it is not clear that one is a better form to use than the other, so we examine both. For the first (differential) form, the original approximation gives a nondimensional MTTF of 1/2, a 50% underestimate. The VA gives
(Xt) = fx ( 1 — x H—x2
) dx + /
Jo
Ji
X [ X
\
7 11 3 dx = — + — = 24 24 4 2 J X2 I
v(13) ;
a 25% underestimate that can be said to be twice as good as the original approximation. The original approximation in the second form gives a poorer value still, 1/6, since it again neglects those rare events that contribute to a long time to failure. The VA in the second form yields (Xt) = W
x ( 1 -x+
io
\
-x2 - -x3 ) dx+
2
6 ;
x2 + -x3 ) dx = —
/ x[
A
\3
2
Q J
a 58% error compared to the 83% error of the original approximation. form sees a gain on using the variational expression, but the first form is we take the variance, we find the direct approximation has variances of 1/18 in the first and second form while the variational expression has
12
(14) V J
So either better. If 1/12 and variances
Estimation
of System Reliability
by Variationally
Processed MC Simulation
99
of 13/48 and 7/48, respectively. We believe that the reduction in variance in the second form does not make up for the greater bias in the predicted mean. 4. Monte Carlo Analytical Example In MC stochastic simulation, a mean time to any event is selected using a (pseudo-) random number generator, followed by the selection of which event occurred, using a further random number, and the state of the system updated. In the present case, starting with the component working, there is only one possible event — failure of the unrepairable system. Of course, with a constant mean specific failure rate, the time to failure is known to be exponentially distributed. Repeated simulations would allow for a MTTF to be estimated although even one trial provides some information. Suppose such a simulation fails at time t\. This trial provides an estimate for Pl(t\Q) in the form of a step function P / = 1 — H{t — t\). We seek to show consistency of the MC method, that repeated simulations will realize the correct MTTF. It is known that t\ is exponentially distributed (indeed this is how the MC program finds it) as Aexp(—A*i), so that the expectation of Pl(t\0) over t\ is rOO
Pl{t\Q) = / Xe~XtlH(*! - t)dtx = e~xt (15) Jo the exact result. Note the change of arguments in H on changing integration from t to t\. We conclude that direct use of MC gives an unbiased estimator for the MTTF in this problem. Similarly, the mean and variance of Xt\ are both unity. Consider, however, the VP using the MC result as a trial function. No run is needed to estimate the final state if the initial state starts "failed" and the single run yielding estimates the final state given the initial state "working". The Lagrangian for the problem posed is Eq. (10), and hence Li = 1 - / Xdr, t < ti Jo Li = -
Xdr+j Jt-ti
L1 = 0,t>
S(t-ti)dT,ti
(16)
JO
2fi
We can evaluate the Lagrangian in these three domains, making use of the Dirac delta distribution as the derivative of a step function. The result is Li =
l-\t,t
Li = l + \t-2\ti,ti Li = 0 , i > 2ti
(17)
100
M. Chang, G. T. Parks and J. D. Lewins
In the first interval, the VA falls (as it should) but note that the less usual case of \t\ > 1 can go negative and hence is unphysical. The approximation is continuous into the next interval where it now rises to unity, before dropping discontinuously to zero in the final range. Figure 2 gives the variationally processed approximation for a case with a sufficiently large ti that the probability goes negative (a rare event).
r, 1/A Afj < 1 more usual
/U[ > 1 less usual
Fig. 2. Variational development of a one-run Monte carlo approximation.
The expectation over is t\ /•oo
(P£(t\0)) = / (l-\t)e~xdx J\t 2e~M -
p\t
+
(l +
Xt-2x)e-xdx
Jxt/2
(18)
e~xt'2
which starts correctly at unity but then does not follow the simple, single exponential form of the exact solution. Therefore, we see that the variational processing introduces a bias; indeed at large enough ii the result is negative. The result for the single trial function probability itself is also unexpected, giving increasing probabilities, but is explicable perhaps in terms of an over-high initial (MC) approximation in the first interval, giving an over-great variational drop in correction. The rise in the second interval is due to this jump and large differential term. Note also the moving of the discontinuity from the first to the second transition. The discrepancy in the solution also presents some problems in evaluating the MTTF. For the direct (analog) MC, both forms are exact but the first form is more convenient, evaluating as JQ tdP = ti, which averages correctly. The variationally processed first form, with due allowance for the discontinuity, is (At),
/ .A)
xdx -
/ J Mi
xdx + /
x5(x - 2Xtx)dx - 2Aii = -(Aii) 2
(19)
JO
This again starts correctly (at unity) but can give a negative, unphysical, and unacceptable result for large enough (though less usual) t\ because of the positive change of probability in the second domain. The second form gives
Estimation
of System Reliability by Variationally
/•Aii
Processed MC Simulation
101
r2Xti
(Ai)2nd = / x(l-x)dxx(l + x - 2Xti)dx = 2(Xt1)2 = - ( A ^ ) 3 (20) JO J\t! This will also be negative for sufficiently long ti and so again is unsatisfactory. The expectation and variances of the two forms over t\ are not the same: E({Xti))lst=
e~x(2x - x2)dx = 2 x 1! - 2! = 0
(21a)
e~x(2x - x2)2dx = 4 x 2 ! - 4 x 3 ! - 4 ! = 8
(21b)
Jo /•OO
Var((Ati)) lst = / Jo
POO
E({\ti))2nd
= / Jo
e-x{2x2
- x3)dx = 2 x 2! - 3! = - 2
(22a)
/•OO
Var((Ai 1 )) 2nd = / Jo
e~x(2x2 - x3)2dx - ( - 2 ) 2
= 4 x 4! - 4 x 5! - 6! - 4 = 332
(22b)
The first form is clearly preferable in terms of least bias and least variance, but neither is satisfactory. To show they can be improved upon we consider two developments: smoothing and two trials. 4.1. Smoo thing Part of the difficulty with the MC trial function is its discontinuities. This will be expected to diminish with a large number of trial functions so that each step in the approximation of the probability distribution becomes small. But it is certainly possible that a smoothing of the one-trial approximation before entry into the VP will be beneficial. To demonstrate this, consider smoothing the single run trial function motivated by the concept that we take a linear interpolation between the epoch (zero) when it is known to be unity (component working) and the epoch when it is known to have failed (ii) according to the MC simulation. Thus Pi(t\0) =
l,t<0
Pt(t\0) = l-t/t1,0
=0,t>h
The domains are for the linear approximation already treated so that:
Pl(t\0) = 1 - At + (\h - 0 ^ - i ^ - , 0 < t < U
(23)
102
M. Chang, G. T. Parks and J. D. Lewins
+
*<*»>=HH K^ + A*i
1 + 1 ft
1 \
2 \ ( t1
•6 2{r1- )-2{r1- )
\2
1 ft
n r
x3 1 '
(24)
U < t < 2ti P/^IO) = 0,t > 2 i i , These results are continuous across their domain and reduce to the previous analytical example when Xti = 1, the expected value of the MC trial. We evaluate, in the first form, the MTTF as
(At)lst = hh
- ^(Aii) 2 + |At! - ^(A*i) 2 = At! - J(Ati)2
(25)
This has expectation £ , ((Ati))i st = 1/2 and variance Var((Ati))i st = 1/4. Thus, the smoothed form shows a marked improvement over the non-smoothed result. The bias has been halved, from 1 to 1/2, and the variance reduced from 8 to 1/4. Similarly, for the second form, we have (At) 2nd = ^ ( A t O 2 - i(At!) 3
(26)
giving E((\ti))2nd = 1/6 and Var((Ati)) 2n d = 173/36. Although this is an improvement over the unsmoothed result, the second form is again less desirable than the first. 4.2.
Two
Trials
It would be unrealistic to assess any simulation method on the basis of a single trial function. The analytical treatment can be extended to two trial functions, to be performed before variational processing. In the direct or analog method, the probability will be estimated as 1 - H(t - t i ) / 2 — H{t — t 2 )/2 and this gives the true value for the probability over repeated runs as before.a That is, the first form of the normalized MTTF gives A(ti + t 2 )/2 whose expectation is correct, and the second gives \2{t\+t\)/A, whose expectation is also correct. The approximation for the probability distribution now has two discontinuities, each magnitude 1/2 rather than 1. This also means that the variational result is less likely to go negative, as it is less likely that both results will occur after the expected failure. There are two forms, dependent on the relationship between the two random variables of the twin MC simulation. Figure 3 gives the results and shows the improvements in the probability distribution having been passed through the VP. a T h e number is not the order of running. Whichever of the runs has the shorter time to failure is numbered one.
Estimation
of System Reliability
by Variationally
Processed MC Simulation
103
+P
l + ^Xt-l-Xu -Xu 2 4
4
2
'
t2 < 2fj
1 + Xt - Xu - Xu 1
+\Xt-^Xt2
\-Xu
1 -It
+P 4 3
4
-
!
4
2
'
2
t2 > 2fj
!
A/ Ar 2Af'"4A'
\-Xt
\ + -AXt-\xu2 4 4 2
• Xu
Fig. 3. Variational development of a two-run Monte Carlo approximation.
In the first form, we obtain (with due allowance for the discontinuities) the variational result (Ai) lst = Aix + Xt2 - -(Atj + \t2f (27) This is seen to go negative for the (exceptional) case of both MC failures greater than expected, less likely than the expression for a single trial function. The expectation and variance are E((Xt))ist = 1/2 and Var((At))i st = 5/4. Thus, the first form reduces the bias in the expectation from unity to 1/2, compared to the expectation over the single trial function, and correspondingly reduces the variance from 8 to 5/4. This is a substantial improvement. The second form can be assessed similarly. For the MTTF we have (A*)2nd = ~ [ ( A * i ) 2 + (Ai2)2] - i [ ( A ^ ) 3 + (Xt2)3} + \(Xti+Xt2)2--^(\t1
+
\t2)\
(28)
giving £((Ai)) 2n d = - 1 / 2 and Var((At)) 2nd = 177/4. Thus the second form has less bias and smaller variance than the single trial function result, but is poor compared to the first form. We conclude at this point that the first form is to be preferred in evaluating a MTTF from variationally processed MC results. At this stage, we can have some confidence that with the variational processing of a single composite trial function based on many MC runs, the reduction in variance
104
M. Chang, G. T. Parks and J. D. Lewins
achieved will outweigh the bias introduced — although the goal is shifted, the shots strike closer to the original goal. This hypothesis, which is supported by analogous work in neutron transport theory [Allagi et al. (1998)], is put to the test in subsequent sections, where not only are the results based on a more realistic number of trials, but for a model incorporating repair as well as failure. 5. I n t e r m e d i a t e Discussion These analytical results lead us to expect that: • Increasing the number of MC samples will help, and favors variational processing. • The variational processing introduces a bias to the analog (unbiased) MC result. • Variational processing will nevertheless give an improvement since the reduction in variance achieved outweighs the bias introduced. • Optimal results will be gained by solely post-processing, i.e., undertaking all the desired analog MC trials and then variationally processing the single trial function approximation for the probability distribution. We consider that smoothing may be advantageous, but we have not checked this in this paper in a more realistic model, including, for example, repair as well as failure. Table 1. Mean and variance of the normalized mean time to failure by various methods. Form
Mean
Bias
Variance
Exact* and MC-1 Direct
1
0
1
1/2
-1/2
1/12
Analytical Direct Analytical VP MC-1 VP Smoothed MC-1 V P MC-2 Direct MC-2 VP
(") (0 (") (i)
(») (0 (")
1/6 3/4 5/12 0 -2 1/2 1/12 1 1/2 -1/2
-5/6 -1/4 -7/12 1 -3 -1/2 -11/12 0 -1/2 -3/2
1/18 13/48 7/48 8 332 1/4 173/36 1/2 5/4 177/4
'variance of the distribution of failure times. In other cases, variance of estimate of the M T T F .
Table 1 summarizes a number of results we have obtained for the MTTF by various methods, together with the variance to be expected, and illustrates these deductions quantitatively. It is seen that although the variational processing introduces bias, the reduction in variance is more significant. The first form (i) of
Estimation
of System Reliability by Variationally
Processed MC Simulation
105
evaluation of the MTTF is to be preferred in all cases studied. We can conclude that the second form (ii) should not be employed. The analytical results are generally favorable showing a bias but a reduced variance. After VP on one MC trial, there is a bias of —1 and a large variance of 8. VP of two trials halves the bias and reduces the variance by a factor of 6.4, whereas the analog result for two trials is unbiased but with a reduction of variance of only 2. We therefore expect that with, say, n MC trials variationally post-processed, the bias will be reduced by a further large factor until it becomes less than the analog variance reduction of 1/n. Such extended trials are undertaken in the sections which follow. Our results for two trial functions suggest the preliminary maxim, that all the trials should be conducted first and then variationally processed in one pass: variational post-processing is best. This is also explored in subsequent sections in more realistic models. 6. Simulation Model To demonstrate that the combination of stochastic simulation and variational processing is viable in more complicated systems, we conduct extended trials on a single component system with exponentially distributed probabilities of failure and renewal with constant failure and repair rates, A and /i, respectively. We also assume that there is an ideal spare support system, meaning that a spare part is immediately available whenever required. These two assumptions make the system Markovian. The form of the VP is as given in Sec. 2 with the matrix of transition rates for a system model with renewal as well as failure being:
-x)
(29)
We seek to estimate the system reliability function Pl(t\s = 0). As P^ + PQ = 1, from the relevant forward Kolmogorov equation [Eq. (49) in the Appendix]: dP1 - ^ = - A P 1 1 + / i ( l - i , i 1 ) = / * - ( M + A)P11
(30)
for which the analytical solution is: u
_i_
\P-(v+>>)t (l -r A
In our simulations, the system starts in the working (or failed) state and operates continuously for Tm hours. Initially, we take Tm = 500 h, A = 0.001 h _ 1 and H = 0.005 h" 1 . In performing these simulations on a digital computer the mission time, T m , must be discretized into time steps of length At, so that there are m = Tmj At
106
M. Chang, G. T. Parks and J. D. Lewins
time steps in all. Theoretically, the smaller At, the more accurate the results, but the greater the computational cost entailed in the simulations and in processing the data. There must be a trade-off. In our model the mean time to failure and the mean time to repair are 1/A and 1//^, respectively. This implies that we have two different cycles, the "failure cycle" and the "repair cycle". The "frequencies" of these cycles are A and /x, respectively. The digitization time step At chosen should not be so large that information from either of these cycles is lost. Shannon sampling theory [Roberts and Mullis (1987)], widely used in digital signal processing, states that in order to ensure that no information is lost through digital sampling, the sampling interval should be at least as short as the reciprocal of the Nyquist frequency, which is twice the highest signal frequency to be observed. The Nyquist frequency for the model initially under consideration is 2fi (2 x 0.005 h _ 1 ) . Therefore using this criterion, At should be 100 h or shorter. In our initial studies we use a time step At = 10 h. 7. Simulation Results 7.1. Analog
MC
First, we use standard analog MC methods to simulate the system for, say, 1000 life cycles, each starting with the system working. From each simulated cycle we can construct a histogram which represents a (very crude) estimate Pi of P\, as shown in Fig. 4. system repaired at t = 321.6 h
system failed at t= 175.2 h
10
20
30
40
Time step (10 h each) Fig. 4. Pj 1 estimate from a single life cycle histogram.
50
Estimation
of System Reliability
by Variationally
1.0
Processed MC Simulation
i
i
i
r
I
I
I
L
20 30 Time step (10 h each)
40
107
0.9 XX XXX X X
IS a
X)
S PL,
0.8
J
0.7 0
I
10
I
I
I
50
Exact P. * * Estimated P. from 100 histograms ° ° Lagrangian L. from 100 histogram pairs Fig. 5. Imporved estimate of Pf.
We can improve our estimate P / by taking the mean from several such histograms. Figure 5 shows an example of P / constructed by taking the mean of histograms obtained from 100 independent simulation histories. It is clear from Fig. 5 that there is a large difference between the estimated P / and the known, exact result for P / . Here we must note that the reliability functions in Fig. 5 are all digitised in time steps (here At = 10 h). 7.2. Variational
Processing
We can improve this crudely estimated P / by applying our VP [Eq. (7)], but in order to do so, we need not only an estimate of Pi but also an estimate of P°, which therefore requires us to perform further MC simulations, each starting with the system failed. The remaining components of the matrix of conditional probabilities P are readily obtained using the fact that Pj1 + P<$ = 1 and that P® + PQ = 1. If we seek to improve upon our estimate P / , we take p^T — (0,1) and hT = (0,1) and perform the convolution integrals required by Eq. (7) to obtain the Lagrangian L\. Figure 5 shows the result of variationally processing the P / histogram. It is readily apparent that the Lagrangian L1 is much closer to the known exact P*. Thus, our VP has improved the accuracy of estimate, but this has been obtained
108
M. Chang, G. T. Parks and J. D. Lewins
at the cost of performing a further 100 MC simulations plus significant additional computational effort — the convolution integrals are particularly expensive. 7.3. Figures
of
Merit
In order to quantify the benefit, if any, to be gained by using our proposed VP, we must define some figures of merit (FOMs). We can measure the accuracy of an individual estimate P ^ or L^ in terms of the mean square error between P ^ or L\t and the exact P^, if the latter is known: 1 m r MSE„ = - J2\ Ph(kAt) fc=i
i2 - Pi (A;At)
(32)
If we do not know the exact solution as is likely to be the case in practice, then a single estimate is of limited value. However, if we have several, say n, such estimates Pfi or Ljj then we can reasonably expect that our best estimate of P / will be the mean of these estimates: I n P/(fcAt) = - ^2(kAt),
k = l,m
(33)
and the accuracy of this best estimate will be indicated by the standard deviation a in the mean square errors between each P ^ and this average estimate P / : MSE 2i
, m = - J2 [^ii(fcAt) - Pf'(fcAi)] m fc=i
2
(34)
Thus: a2 = — — V [MSE 2i - MSE 2 | 2 n — 1 ^-^
(35)
where 1
n
MSE 2 = - V MSE 2i
(36)
In cases where we do know the exact P\, we can also measure the accuracy of the best estimate P\ directly by calculating the mean square error between P\ and P\: ,
m
„
MSE3 = - X ] f ^li (kAt) - Pl (k At)} (37) m fc=i Since we run our program on a time-sharing workstation network, the exact execution time for the program cannot be measured with any accuracy. However, the MATLAB package [The Mathworks Inc. (1997)] in which the program is implemented provides another factor, the FLOP count, which measures the accumulated number of floating point operations in executing the program. We use this factor to
Estimation
of System Reliability by Variationally
Processed MC Simulation
109
represent the CPU time. Then, following Dubi and Greenfeld (1991), we can define two FOMs: FOMi = MSE 3 x FLOP (38) and FOM 2 = a2 x FLOP
(39)
For both these measures, the smaller the FOM the better the performance of the method under consideration. 7.4. Benefit
Study
If we have a certain amount of MC simulation information available, say N pairs of life cycle histories (half starting with the system working, half with the system failed), then this data can be processed in a variety of ways. We can construct n pairs of histograms (each approximating P / and P°) using a = N/n histories and variationally process these to give n Lagrangians L\{, which can then be averaged to give a best estimate of Pi. Table 2. Errors and FOMs for different data. processing options — P\ estimation. Number of Histories in Bach Histogram
Number of (Pairs of) Histograms
a
n
10
100
20
50
40
25
50
20
100
10
200
5
250
4
500
2
1000
1
MC
MSE 3 X10- 1 0
FOM! Xl0~3
4.84 4.84 4.84 4.84 4.84 4.84 4.84 4.84 4.84
0.014 0.014 0.014 0.013 0.013 0.013 0.012 0.012 0.012
VP (7
FOM 2
xl0~4 27.42 12.95 5.93 5.14 2.33 1.15 0.92 0.47 CO
0.2
0.05 0.01 0.007 0.001 0.0003 0.0002 0.0001 CO
MSE 3 xlO-10
FOMi XlO-3
7.84 0.81 0.09 0.04 0.01 0.01 0.01 0.01 0.01
2.11 0.10 0.005 0.003 0.0004 0.0001 0.0001 0.0001 0.0001
a
FOM 2
X10" 4 7.68 2.06 0.43 0.32 0.09 0.03 0.02 0.01
0.06 0.001 0.0006 3 x 10~ 5 1 x 10~ 6 5 x 10"7 2 x 10~ 7
CO
CO
1.6
Table 2 shows the results of processing in different groupings the data provided by simulating, 1000 times from an initial working state and 1000 times from an initial failed state, a single component system with Tm = 500 h, A = 0.001 h _ 1 , \i = 0.033 h _ 1 and At = 10 h in order to estimate P}. The analog MC results which use only the 1000 simulations starting from the working state show that it makes no difference, in terms of the accuracy of the best estimate Pi with respect to the known solution, how the data is sub-divided and averaged. The slight reduction in FOMi with increasing group size is due to the fact that fewer histograms are being constructed and averaged, thus slightly reducing
110
M. Chang, G. T. Parks and J. D. Lewins
the FLOP count. There is, however, a clear benefit in constructing relatively few histograms (with large numbers of samples contributing to each) in reducing the variance in the mean square errors between the histograms and therefore FOM2. The VP results show a clear benefit compared to analog MC in both FOMs except when the sample size in each group is fewer than 40. Even in the case of group sizes of 20 there is an improvement in the accuracy of the resulting P^ estimate obtained but the computational cost of variationally processing so many histograms means that the FOMs are worse. Otherwise, improvements over analog MC of a factor of up to 120 are seen for accuracy (FOMi), despite the need for twice the number of MC simulations and the additional computational costs. The improvements are even higher for FOM2, which is based on the sample variance, although a small sample variance is in itself no guarantee of actual accuracy. We must remember that the VP introduces a bias. 7.5. Integrated
Reliability
Measures
Our VP has been derived on the basis of seeking to improve estimates of the conditional system probabilities. In practice, one is often interested in calculating an overall measure of system reliability such as availability, total system downtime, To, or total system uptime, Ty. These quantities can be calculated by integrating the system probabilities appropriately and we would naturally expect that a more accurate estimate of the conditional system probabilities will result in a more accurate measure of the system reliability. However, such integral reliability measures can be estimated directly by analog MC simulation without the need to construct histograms approximating the conditional system probabilities. Therefore, we must establish whether our VP allows the accuracy of these reliability estimates to be improved cost effectively. As an example, we consider the estimation of the total system uptime: Tv = / Pldt (40) Jo To measure the accuracy of our best estimate, Tu, the average of the n individual estimates we calculate Tut, we can define the percentage relative bias range (PRBR) between T\j and the known exact result Xj/: Tu
Z_Tu Tu The sample standard deviation is given by: PRBR -
1
n
(41)
2
2
a = -L-S2\fm-Tu\ ro—1 ^—' L
(42) J
i=i
so the standard deviation of Tu should be: <7j- = a/n
(43)
Estimation
of System Reliability by Variationally
Processed MC Simulation
111
From this we can construct a 95% confidence interval about the best estimate, Tu, as TJJ ± ao-jr-, where a depends on n according to the Student-t distribution and tends asymptotically on the value for the normal distribution (a = 1.96) with increasing n. If the exact TJJ lies within this 95% confidence interval, we can deduce that the expectation of Tu is close enough to the exact result that Tu can Tu therefore be called an unbiased estimate. If Tu lies outside this confidence interval, then the estimate is referred to as biased. Following Dubi and Greenfeld (1991), we also define the proportional relative standard deviation (PRSD) as: PRSD
a-,Tu
(44)
with an associated figure of merit: FOM = PRSD 2 x FLOP
(45)
Table 3. Errors and FOMs for different data processign options — Tu estimation. Number of Histories in Each Histogram a 10 20 40 50 100 200 250
Number of (Pairs of) Histograms n 100 50 25 20 10 5 4
2 500 1 1000 Direct Analog MC with 1000 Histories
VP
Bootstrap Analog MC PRBR
PRSD
(%)
(%)
0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09
0.17 0.16 0.13 0.15 0.15 0.14 0.24 0.29 oo
0.12
0.17
Bias Test
PRBR
PRSD
(%)
(%)
0.09 0.07 0.04 0.06 0.06 0.04 0.14 0.20 oo
unbiased
0.44
unbiased unbiased unbiased unbiased unbiased unbiased unbiased
0.19 0.01 0.02 0.02 0.004 0.003 0.01 0.02
0.10 0.06 0.04 0.04 0.03 0.04 0.05 0.05 oo
0.06
unbiased
FOM
—
FOM
Bias Test
2.73 0.60 0.10 0.09 0.03 0.03 0.04 0.03 oo
biased biased unbiased unbiased unbiased unbiased unbiased unbiased
—
Table 3 shows the results of processing in different groupings the data provided by simulating, 1000 times from an initial working state and 1000 times from an initial failed state, a single component system with Tm = 500 h, A = 0.001 h _ 1 , H = 0.033 h _ 1 and At = 10 h in order to provide estimates of P / and then integrating these to provide estimates of Tu. This is the same system (indeed the same simulations) as that considered in the preceding section. In the direct analog MC process Tu is estimated by direct tallying without constructing any histograms. In the bootstrap analog MC process the analog MC
112
M. Chang, G. T. Parks and J. D. Lewins
results are used to construct P / histograms which are then integrated to provide Tu estimates. In our experience, the bootstrap analog MC process sometimes produces superior results to the direct analog MC process. We have not yet found any way to predict which will be better for individual cases. The direct analog MC method is what is normally used in practice. We therefore focus on comparisons between our VP procedure and the direct analog MC method. In Sec. 5 we anticipated that best VP results would be obtained by undertaking all the desired analog MC trials and then variationally processing a single trial function approximation for the probability distribution. This is not totally born out by the results in Tables 2 and 3. The results in Table 3 confirm our previous findings that VP gives a benefit compared to analog MC when the sample size in each group is large, i.e., a relatively small number of histograms are constructed. The advantage of VP for estimating this integrated reliability measure is not so marked as for estimating the conditional system probabilities. Nevertheless, it is apparent that compared to the direct MC method, the optimal VP processing (here constructing five pairs of histograms using 200 samples each) reduces the PRBR and PRSD by factors of 30 and 4, respectively, and the FOM by a factor of 2. We use this optimal VP processing strategy in the further studies which follow.
8. Refinements 8.1. Time
Discretization
The foregoing studies were conducted using a time step At = 10 h. As discussed earlier, a smaller time step can be expected to give more accurate results but at greater computational cost. For the single component system just considered /J, = 0.033 h _ 1 . Therefore, according to Shannon sampling theory, At should be shorter than 15 h so that no information is lost in the digital sampling process. Table 4 shows the effect of varying the discretization time step At in processing the data provided by simulating a single component system (1000 times from an initial working state and 1000 times from an initial failed state) with Tm = 500 h, A = 0.001 h _ 1 and n = 0.033 h _ 1 . This is again the same system (and the same simulations) as that considered in preceding sections. All the data processing (except using the direct analog MC method) is done by constructing five (pairs of) histograms (variationally processing these if appropriate) to provide estimates of P / which are then integrated to provide estimates of T\j. The exact result for this simple system is known to be 486.2852 h. The computation cost analysis in Table 4 shows that the cost of the direct analog MC method is entirely in simulation. The bootstrap analog MC method is dominated by simulation, though the cost of histogram construction is significant, particularly for short time steps.
Estimation
of System
Reliability
by Variationally
Processed MC Simulation
113
Table 4. FOMs and cost analysis for different discretization time steps — Tu estimation. Bootstrap Analog MC
VP
Direct Analog
At
At
At
At
At
At
At
At
MC
1 h
5 h
10 h
25 h
1 h
5 h
10 h
25 h
486.70
486.38
486.24
486.30
486.26
Number of Histories Tu (h) PRSD (%)
2000
1000
1000 486.87
486.84
486.85
486.30
0.170
0.130
0.127
0.135
0.160
0.016
0.022
0.036
0.074
FOM
0.060
0.047
0.039
0.043
0.059
0.038
0.018
0.027
0.072
PRBR (%)
0.120
0.114
0.116
0.085
0.020
0.009
0.002
0.004
0.005
unbiased
unbiased
unbiased
unbiased
unbiased
unbiased
unbiased
unbiased
unbiased
Bias Test
Computat ion Cost Analysis (Mflops) 0.0208
0.0208
0.0208
0.0208
0.0208
0.0556
0.0556
0.0556
0.0556
Building
0
0.0069
0.0037
0.0030
0.0023
0.0254
0.0191
0.0166
0.0137
Convolutions
0
0
0
0
0
1.4025
0.2992
0.1379
0.0628
0.0208
0.0277
0.0245
0.0237
0.0231
1.4835
0.3740
0.2102
0.1321
Simulation Histogram
Total
For the VP method the simulation cost is increased because of the need to do additional simulations with the system starting failed. However, the most significant cost is that incurred in performing the convolutions required. The latter depends very strongly on the discretization time step. As expected, the smaller the time step the lower the PRSD. However, because of the increase in convolution cost, in particular, and histogram building cost, to a lesser extent, the FOM goes through a minimum for At = 5 h. Note that when At = 25 h, a value longer than the maximum suggested by Shannon sampling theory, the VP FOM is worse than that for direct analog MC, confirming the applicability of this theory to reliability studies. In the example presented here the optimal discretization time step is about half as long as the maximum predicted by Shannon sampling theory. This is insufficient evidence on which to base a rule of thumb. Nevertheless, it is true to say that, although Shannon sampling theory sets an upper limit on the discretization time step, a lower value may well lead to improved performance. 8.2. Additional
Histories
The preceding section showed that the computation cost of the VP method is dominated by the cost of performing the convolutions required. This depends on the number of histograms to be processed and strongly on the discretization time step, but not, in theory at least, on the number of samples used in constructing the histograms. Performing additional MC simulations will increase the accuracy of the estimates obtained. Because the (dominating) convolution cost will not be changed by doing
114
M. Chang, G. T. Parks and J. D. Lewins
so, it can be anticipated that the performance of the VP method will be improved if further histories are used. Table 5. FOMs and cost analysis for different numbers of MC simulation histories — Ty estimation. Direct Ailalog MC
Number of Histories
5*7 00 PRSD (%) FOM P R B R (%) Bias Test
VP
500
1000
2000
4000
493.06 0.18 0.03 1.38 unbiased
486.87 0.17 0.06 0.12 unbiased
486.07 0.13 0.07 0.04 unbiased
487.00 0.08 0.06 0.15 unbiased
1000 485.91 0.044 0.065 0.08 unbiased
2000
4000
8000
486.30 0.022 0.018 0.002 unbiased
486.24 0.017
486.31 0.005
0.013 0.010 unbiased
0.001 0.005 unbiased
Computation Cost Analyjsis (Mflops) Simulation Histogram Building Convolutions Total
0.0105
0.0208
0.0410
0.0817
0.0282
0.0556
0.1129
0.2242
0 0 0.0105
0 0 0.0208
0 0 0.0410
0 0 0.0817
0.0108 0.2986 0.3375
0.0191 0.2992 0.3740
0.0353 0.2995 0.4477
0.0658 0.2997 0.5897
Table 5 shows the effect of varying the number of histories used in simulating our single component system with Tm = 500 h, A = 0.001 h _ 1 , and n = 0.033 h _ 1 . In these studies the discretization time step At is set at the optimal value (5 h) identified in the last section. The VP data processing is again done by constructing five pairs of histograms, which are variationally processed to provide five Lagrangians (estimates of P / ) , and then integrating these to provide estimates of TJy. As before, the exact result is known to be 486.2852 h. Table 5 shows that as more histories are used in the direct analog MC method (even though the PRSD is reduced) because the computation (simulation) cost increases in proportion, there is no significant change in the FOM. In contrast, as more histories are used for the VP method, the PRSD reduces and, because the convolution cost is little changed, although the costs of simulation and histogram building do increase, the overall computation cost increases relatively slowly, and the FOM is observed to decrease. Thus, the advantage of VP over direct analog MC increases with the number of simulations required. 8.3. Digitization
Time
Reseating
As was shown earlier, the cost of performing the convolutions required by the VP method depends strongly on the digitization time step size used in constructing the histograms. It was also demonstrated that Shannon sampling theory gives a practical upper limit on the length of this time step, so that information is not lost in the digital sampling process. Once the histograms have been
Estimation
of System
Reliability by Variationally
• ®sf.
Processed MC Simulation
115
Original histogram
03 Rescaled histogram At„
*ft-
!
&
Time step Fig. 6. Digitization time rescaling.
constructed, however, it may be possible beneficially to rescale them in order to reduce the number of time steps before performing the convolutions to construct the Lagrangians. We call this process digitization time rescaling (DTR). Figure 6 shows schematically what it entails. The probability values over a number of consecutive histogram time steps (of length At) are arithmetically averaged to give an appropriate probability value for a rescaled histogram with longer time steps (of length AtR.). Table 6 shows the effect of varying the rescaled digitization time step in simulating a single component system with Tm = 500 h, A = 0.001 h _ 1 and ^i = 1 h _ 1 . Table 6. FOMs and cost analysis for digitization time rescaling — Trj estimation.
Simulation Time Step At Rescaled Time Step AtR
0.5 h
5 h
(h) PRSD (%) FOM P R B R (%) Bias Test
0.5 h
—
10 h
50 h
100 h
2000
Number of Histories TD
VP
Direct Analog MC
0.5044
0.5052
1.37 571.9 1.17 unbiased
1.39 37.4 1.32 unbiased
0.5045 1.50 31.0 1.19 unbiased
— 1000
0.5040 1.48 22.2 1.09 unbiased
0.5034 1.55 24.0 0.97 unbiased
0.4967 6.11 81.3 0.36 unbiased
Computation Cost Analysis (Mflops) Simulation Histogram Building Convolutions Total
2.9636 3.0458
0.1124 0.1945
0.0587 0.0234 0.0548 0.1370
0.0218 0.0198 0.1020
0.0171 0.0992
0 0 0.0218
116
M. Chang, G. T. Parks and J. D. Lewins
In this case we seek to estimate the total system downtime, another important factor in reliability and logistics studies: TD = f ™ Podt (46) Jo which for this simple system is known to have an exact value of 0.4985 h. The VP data processing is again done by constructing five pairs of histograms, which are variationally processed to provide five Lagrangians (estimates of P j ) , and then integrating these to provide estimates of TDIn these studies the simulation discretization time step At is set at the threshold value given by Shannon sampling theory (0.5 h). As can be seen in Table 6, this short time step results in such a high convolution computation cost that if rescaling is not performed, VP is inferior to direct analog MC. Table 6 shows that as the rescaling time step is increased: • The best estimate of the total system downtime, TD, does not change very much. • The PRSDs do not change very much. • The cost of performing the VP convolutions decreases dramatically. • VP becomes superior to direct analog MC. This clearly shows the potential benefit of DTR in improving the efficiency of VP in estimating integrated reliability measures. Obviously its efficacy depends on the fact that in highly reliable systems with low failure rates and rapid repair rates, the conditional system probabilities vary slowly with time, thus allowing coarse averaging without significant accuracy loss. It is not anticipated that DTR will prove as effective in estimating the performance measures of less reliable systems. 8.4. Virtual
Sampling
Table 6 shows that if DTR can be applied successfully, the computation cost of the VP method then becomes dominated by the cost of simulation. To apply the VP method to a single component system requires two conditional system probabilities, e.g. Pi and Pf, to be estimated, thus doubling the number of simulations required compared to direct analog MC. For an m component system, 2 m conditional system probabilities, corresponding to the 2 m possible initial conditions (each component working or failed in various combinations), will be required in order to apply the VP method. It would therefore be desirable to reduce the cost of performing the MC simulations, if possible. We therefore explore the application of Allagi and Lewins' concept of virtual sampling (VS) in neutron transport simulation [Allagi and Lewins (1998)] to our problem. This concept is illustrated in Fig. 7. Suppose we are simulating a single component starting in the failed state (in order to estimate PQ ). When the system is repaired it enters the working state. We
Estimation
of System Reliability by Variationally
system repaired at f= 136.7 h
system failed at / = 288.0h
Processed MC Simulation
117
system repaired at ? = 521.3h
o o 0, a o
l
20 system starts from failed state at t = 0 h
1
'
r
system failed at t = 288.0 h
1
system still failed at t = 500 h
system repaired at t= 136.7 h
system repaired at t = 384.6 h
system starts from working state at t = 0 h
L virtual P . history
'
30 40 50 Time step (10 h each)
system failed at / = 151.3 h
1
system still working at f = 500h
Fig. 7. Virtual sampling — Pfi and virtual P j sampling from a single MC history.
can regard this transition as t h e beginning of another (virtual) simulation starting from t h e working state (thus providing a n estimate of PQ). In order t o complete this virtual sample, we must extend t h e simulation time t o Tm plus t h e first sampled free flight time. It is clear, however, t h a t t h e cost of performing these two simulations (the original one and the virtual one) is less t h a n the cost of doing two simulations from scratch. We are aware of t h e complications of correlation this introduces and this is t h e subject of continuing study. In Allagi's work, such correlation showed no disadvantage [Allagi and Lewins (1998)]. Table 7 shows t h e effects of applying virtual sampling t o the total system downtime estimation problem considered in t h e previous section, again using D T R . Comparing Tables 6 and 7 reveals t h a t : • T h e use of virtual sampling does not significantly alter t h e P R B R s or P R S D s . • T h e cost of simulation is significantly reduced (by 35%).
118
M. Chang, G. T. Parks and J. D. Lewins Table 7. FOMs and cost analysis for virtual sampling with DTR — Tr> estimation.
Virtual Sampling V P
Direct Analog MC
0.5 h
—
Simulation Time Step At Rescaled Time Step A t ^
0.5 h
5 h
50 h
100 h
1000
Number of Histories T b (h) PRSD (%) FOM P R B R (%) Bias Test
10 h
0.5031 1.51 691.1 0.92 unbiased
0.5048 1.66 51.9 1.25 unbiased
0.5037 1.50 29.3 1.02 unbiased
— 1000
0.5037 1.48 21.0 1.03 unbiased
0.5037 1.48 20.4 1.02 unbiased
0.4967 6.11 81.3 0.36 unbiased
0.0171 0.0930
0.0218 0 0 0.0218
Computation Cost Analysis (Mflops) Simulation Histogram Building Convolutions Total
2.9636 3.0395
0.1132 0.1891
0.0379 0.0380 0.0550 0.1309
0.0198 0.0958
• Unfortunately, this gain is almost completely negated by the fact that VS histogram construction costs 62% more. • In consequence, virtual sampling shows no real advantage. We suspect that there is a more efficient way in which VS histogram construction can be done. If this could be found, then the computation cost saving associated with the use of virtual sampling on this single component system would be significant. The real advantage of virtual sampling will become more evident when it is applied in the simulation of multi-component systems. In these, the potential computational savings associated with its use are substantially higher. 8.5. Benefit
of
Synergy
In addition to the potential benefit of cost-effective variance reduction, the other significant advantage offered by the VP method is the ability to answer more than one question. In order to answer one question the VP method requires that all the system conditional probabilities be estimated. Once these are known, any other question can be answered by performing appropriate convolutions and integrations. However, there is no need for further system simulation or histogram construction. In contrast, in order to answer a different question by traditional direct analog MC methods, the system must be simulated again from scratch. The advantage offered by the VP method is obvious.
Estimation
of System Reliability
by Variationally
Processed MC Simulation
119
9. Concluding Discussion The preceding theoretical and experimental analysis demonstrates the potential benefits of our proposed VP method in estimating the reliability of simple systems. It might be queried how merely a manipulation of the analog MC data can give improved results. There are two reasons an advantage can be gained. The first is to say that the exact Kolmogorov equation is used in one pass through the VP and can be expected to give an improvement like a further term in a von Neumann series approximation. The second is to say that there is indeed unused information in the MC simulation — where the system started as well as where it ended. That is, the conventional averaging that takes place hides the detail of the origin of each case. This detail is employed in the convolution of the adjoint weighting and provides additional information. Simple theory led us to expect that all trials should be processed in a single batch for least bias and least cost. We found that this was not universally true in practice; a few batches subsequently averaged performed better than a single batching. We can take advantage of this by noting that, in general, an exact result for comparison is not available. A single batch would then have no associated variance to estimate the accuracy (unless the variance of the analog MC result were used). But by variationally processing, say ten batches, we have a sample variance to predict the accuracy of the final result (either batch averaged or repeated as a final single batch). Experience so far shows that the sample variance over, say ten batches, is a conservative estimate of the exact variance, and a closer estimate than the original analog MC. We also note that multiple batch processing may make the use of higher-order theory, where it is desirable to simulate the initial and final (adjoint) sources, more efficient by allowing these sources (themselves dependent upon the probability distributions) to be estimated in a recursive fashion from the previous batch. A notable feature of our approach has been to develop the Green's function for the system, the full conditional probability matrix P. If this is made available from the MC simulation, then any question of interest can be assessed through the VP. A further feature is that a classical VP allows development of classical perturbation theory. The use of MC trial functions in a simply based perturbation theory has always been suspect: the difference between two simulations has 'noise' from the statistical fluctuations which hides the effect of a small perturbation in properties. The present formulation does not suffer this defect, being based on a single simulation of the unperturbed system. Consequently it can address again any small perturbation of properties without further trials. Furthermore, such perturbation theory lends itself to higher-order variational and perturbation theory of classical form [Lewins et al. (1986)]. We acknowledge that a corpus of analog MC benchmark results will need to be built up before the analyst can justifiably estimate the benefit of our proposed VP procedure. The extension can readily be made to multi-component systems and to
120
M. Chang, G. T. Parks and J. D. Lewins
systems with aging parameters for failure and repair, since these can all be put in Markov form and are hence subject to the same variational principle and can be modelled by MC simulation. In a problem with multiple components, say m, we note that the number of trial functions required by the VP is of order 2 m , thus the effort needed rises rapidly. Whether the VP still brings benefit in such a model is under study. This paper has focused on developing the Green's function, the complete matrix of probabilities, from which any question about the system can be answered subsequent to simulation. We are not unaware, however, that if it is known in advance that there is but a single question to be answered, then it may well be more efficient to simulate trial functions in the form of forward and backward vectors, rather than the full matrix. Finally, we note that although the present study applies to problems in reliability (and elementary ones to provide exact solutions for comparison), the VP given for a Markov process evidently has application to a wide range of systems and should be useful in any variety of applications involving stochastic theory. Acknowledgment Part of this work was undertaken during a period when one of us (JDL) was Visiting Professor in the Mechanical Engineering and Material Science Department, Duke University. Thanks are extended for their support. Appendix: Kolmogorov Equations We derive the forward and backward equations physically. For the former, consider the sequence relating the probabilities at t to those at t + St. The probabilities in these disjoint periods are assumed independent and can therefore be multiplied. We study the way in which the end result at t + St can be achieved allowing for the probabilities of failure at a mean probability of XSt and of repair at a mean probability of /j,St in the limit of the vanishing interval St during which there can be no or at most one event. For instance: Pi(t + St\s) = Pi(t\s) x Prob(no system failure in St) + Po(t\s) x Prob(system repaired in St)
(47)
or Pl(t + St\s) = P}(t\s){l - XSt] + P£(t\s)[LiSt}
(48)
?I1 = -XP}+»PZ
(49)
giving
with boundary condition Pl(s\s) = 1.
Estimation of System Reliability by Variationally Processed MC Simulation
121
Similarly:
8P° dt
-AP 1 °+/iP o °;P 1 o (s|s)=0
(50)
8P1 _A
=
_Api_/xP0i;Poi(s|s)=0
(51)
<9P° _A
=
_ApO_/ipO.pO(s|s)=0
(52)
Note the viewpoint taken in the derivation of the forward equations is essentially: "Here I am; where do I go next?" For the backward case, consider the effect of changing the initial time s by 5s and the changes in this period at mean probabilities of XSs and ^i5s in the limit of the vanishing interval, when again no multiple events can occur. Again the probabilities in the two disjoint periods may be multiplied, so that, for instance: Pl(t\s)
= Pl(t\s + 5s) x Prob(no system failure in 5s) + Po(t\s + 5s) x Prob(system failed in 5s)
(53)
Pl(t\s) = Pl(t\s + Ss)[l - \5s] + P?(t\s + 5s){X5s]
(54)
or
giving dP
i - _ \ p i , \p0 ds ~ XPl + XPl
(55)
with boundary condition Pl(t\t) = 1. Similarly: 8P1 . _ J L = -APo1 + AP0° ; Po1 (t\t) = 0
(56)
8P° = -/iP10-MP11;P10(i|i)=0 ' ds
(57)
8P° —£ = ~^Po ~ ^Po : PS(t\t) = 1
(58)
Note that the viewpoint taken in the derivation of the backward equations is essentially: "Here I am; where did I come from ?" References Allagi, M. O., Lewins, J. D. and Parks, G. T. (1998), "Variationally Processed Monte Carlo Transport Theory", Ann. Nucl. Energy 25, 1055.
122
M. Chang, G. T. Parks and J. D. Lewins
Allagi, M. O. and Lewins, J. D. (1998), "Real and Virtual Sampling in Variational Processing of Stochastic Simulation in Neutron Transport: The One-Dimensional Rod", Ann. Nucl. Energy 25, 1521. Billington, R. and Li, W. (1994), Reliability Assessment of Electrical Power Systems Using Monte Carlo Methods, Plenum Press, NY. Chang, M. and Lewins, J. D. (1998a), "Using Life Cycle Revenue Loss and Monte Carlo Simulation as a Prior and Direct Assessment of Consequences of Un-wished Events", Ann. Nucl. Energy 25, 117. Chang, M. and Lewins, J. D. (1998b), "Some Experiences in Using Biasing and Antethetic Variance Techniques in Monte Carlo Logistics Simulation", in Proc. European Safety and Reliability Conference, Trondheim, Norway, 697-702. De Vooght, J. (1997), "Dynamic Reliability", Adv. Nucl. Sci. Tech. 25, 211. Dubi, A. and Goldfeld, A. (1991), "AMIR 2.2 User's Manual: A Multipurpose Performance and Reliability Analysis Code", Technical Report AMIR 2.2, Malchi Science Ltd. Lewins, J. D. (1978), "Variational Method in Neutron Stochastics", Ann. Nucl. Energy 5, 141. Lewins, J. D., Parks, G. T. and Babb, A. L. (1986), "Higher Order Perturbation Theory — An Example for Discussion", Nucl. Sci. Eng. 94, 380. MATLAB Version 5.1 (1997), The Mathworks Inc., Natick MA. Rieff, H. (1996), "Stochastic Perturbation Analysis Applied to Neutral Particle Transport", Adv. Nucl. Sci. Tech. 23, 96. Roberts, R. A. and Mullis, C. T. (1987), Digital Signal Processing, Addison-Wesley, Reading MA.
Chapter 8 A B A Y E S I A N A P P R O A C H TO T H E OPTIMAL POLICY UNDER IMPERFECT PREVENTIVE M A I N T E N A N C E MODELS K W A N G - S U P A R K and C H I - H Y U C K J U N Department of Industrial Engineering Pohang University of Science and Technology San 31, Hyoja-dong, Pohang 790-784, Korea E-mail: [email protected]
1. Introduction Following the work of Barlow and Hunter (1960), many researchers have proposed various maintenance policies which include age replacement, block replacement, and periodic replacement with minimal repair at failure, etc. [Barlow et al. (1965); Valdez-Flores and Feldman (1989)]. Though age replacement and condition-based maintenance policies are also being studied by many researchers, time-based maintenance policies where replacements are done at specified time intervals are most popularly used in heavy industries or steel companies. These industries generally have periodic preventive maintenance (PM) schedules three or four times a month and have a major overhaul one or two times a year. A simple maintenance policy assumes that the system becomes anew after each PM. In reality, however, the improvement after each PM depends on the age of the system as well as the cost and the cumulative number of PMs done since the last major overhaul or replacement. Hence, the system may not be renewed at each PM, but its failure rate generally increases with the number of PMs, while it can be assumed to be renewed at the major overhaul. This kind of PM is often referred to as being imperfect in the literature [Nakagawa (1981, 1986)]. The imperfect PM policy may be divided into a periodic case and a sequential case, where the periodic case is to make PM intervals constant while the sequential case is to make PM intervals shorter as the number of PMs increases. We reconsider Nakagawa's model [Nakagawa (1986)] for the imperfect (sequential) PM and replacement policy by adopting the Bayesian approach. Assumptions
123
124
K.-S. Park and C.-H.
Jun
are as follows: (i) PMs are done with variable intervals xk (fc = 1, 2 , . . . , K), where K is the number of PMs until a complete replacement. That is, PMs are done at successive times 0 < x\ < #1+2:2 < • • • < xi + - • - + XK-I', (ii) the system is replaced by a new system at the Kth PM so that x\ + X2 + • • • + XK is the replacement period; (iii) the system undergoes minimal repair at failure, hence the failure rate remains unchanged by any repair between PMs; (iv) the system failure rate Afc(i), say, in the fcth period of PM (time between the (k — l)th PM and the fcth PM) is increasing with fc such that Xk(t) < Xk+i(t) for any t > 0, which means that the system is undergoing imperfect PMs; and (v) times required for PMs, minimal repairs, and replacements are negligible or reflected at costs. Note that it will reduce to the periodic case having the replacement period Kx when x\ = X2 = • • • = XK = x. We propose a Bayesian approach to determining the optimal PM intervals and the replacement period under a Weibull hazard function. In this paper, we use a gamma distribution and discretized Beta distribution as priors of two Weibull parameters. One of rationales of adopting the Bayesian approach is based on the fact that failure parameters vary over time as the system is stabilizing or deteriorating. A similar Bayesian approach has been considered to the block replacement and the age replacement protocol [Mazzuchi and Soyer (1996)]. 2. Prior Distribution of Failure Parameters and the Cost Function Let us assume that system failures during the fcth PM interval which is the time between (fc — l)th PM and fcth PM occur according to a nonhomogeneous Poisson process with intensity or failure rate function, Xk(t) = ark-1ptp-1,
k = l,2,...,K
(1)
where a and (3 are unknown parameters and r (> 1) is assumed to be known. The form of failure rate function in Eq. (1) is constructed to make it increasing with the number of PMs and to ensure Afc(i) < Xk+i(t). Note that the above assumption is equivalent to the use of the following Weibull distribution for the time to the first failure during the fcth PM interval under the minimal repair: fk(t\a,
6) = a r * - 1 / ^ " 1 exp(-ar f c - 1 t / 3 )
(2)
where a and j3 are said to be the scale parameter and the shape parameter of a Weibull distribution, respectively. We assume that the prior distribution of the scale parameter a is given by the following gamma distribution which could express diverse shapes of distribution through two known parameters a and b, and will be denoted by Gamma(a,6):
9(a) = ^fa~1e~ba.
«>0
(3)
Bayesian Approach to Optimal Policy under Imperfect PM Models
125
For the prior distribution of the shape parameter ft, we use a discretized beta distribution [Mazzuchi and Soyer (1996)], which is a discretization of the generalized beta density on (PL,PU)'-
T(c + m
=
r(c)T(d)
d)(p~pLy-\pu-p)^ for0
(ft,-/jL)^-i—'
^^^
(4)
where PL, PU, C, d > 0 are specified constants. That is, the prior of P (discretized beta distribution) is defined by fPi+6/2
Pi=-Pv{P = (3l}=
h(J3)d0, 1 = 1,2,...,m
(5)
J 0,-5/2
where Pi = pL + 5(21 - l ) / 2 and 5 = (Pu - /?L)/TO for a given m. This allows great flexibility in representing prior uncertainty between specified lower and upper values and makes calculation more effective. We assume that the two parameters a and (3 are independent a priori and thus the joint prior distribution is the product of the distributions of a and (3. Because P controls the rate at which the system ages, (3 must be greater than 1. Thus, PL should be assumed to be greater than or equal to 1 and other constants /%, c, d may be guessed by an expert or evaluated by effective methods. Once the prior distribution of P has been established, the prior constants of a could be obtained by eliciting information about the time to the first failure during fcth PM and equating these with appropriate expression from the predictive distribution. If we let Nk(xk) be the number of failures during the fcth PM interval, then the conditional expectation of Nk(xk) given that a and P are known is E{Nk(xk)\a,p]=
nk-l„P [ " \k{t)dt = ar^xl Jo
(6)
As in a previous work [Nakagawa (1986)], we consider three kinds of cost: the cost of minimal repair, the cost of PM, and the cost of replacement. The expected cost during a replacement period is the sum of expected cost of minimal repair, expected cost of PM, and expected cost of replacement. Therefore, the expected cost per unit time given that parameters a and P are known is E[Cix,K)\a,0\
= * S f - i W * O K f l + ( * - ! ) < » + <»
cia Ek=i
rk lx
+(K-
~ k >K
!) c 2 + c3
(7)
where c\ is the cost of minimal repair, C2 is the cost of PM, and c 3 (> C2) is the cost of replacement. In this paper, using this cost rate in Eq. (7), we seek the optimal xk's and replacement time K which minimize the cost rate when unknown parameters a and
126
K.-S. Park and C.-H. Jun
P have some distribution. First, the optimal PM intervals and replacement period are obtained in Sec. 3 under prior knowledge for the sequential case as well as for the periodic case. Adaptive procedures based on posterior knowledge are discussed in Sec. 4. 3. The Optimal P M Intervals and Replacement Period under Prior Knowledge 3.1. Sequential
Case
The optimal PM intervals and the replacement period are obtained via the value of the PM intervals Xk (k = 1 , . . . K) and the number of PMs (K) which minimize the following unconditional expected cost: E[C(x,K)]=Ea,pE[C(x,K)\a,0\
=
^ 1=1
y^-^ Z^fc=l
{)
x
k
When K is given, to find the optimal x* = (xi*, X2*, • •., xk*) which minimize E[C(x,K)}, we differentiate the right-hand side of Eq. (8) with respect to Xk and set it equal to 0, to obtain the following: 771
ci^Pirk-14t-1Pi=E[C{x,K)],
k = l,...,K
(9)
i=i
If we let
Rk = J2&rk~lxk~lpi>
k = l,...,K
(10)
i=i
then Rk = (b/aci)E[C(x, K)) is a positive constant value for a certain go*, which means that there exists a unique Xk satisfying Eq. (10) for all k = 1 , . . . , K, if Next, to find the optimal value of K for the sequential case, Ks*, say, which minimizes E[C(x,K)], we form the inequalities: E[C(w,K-l)]>E[C(x,K)]
(11a)
and E[C(x,K)}<E[C(y,K
+ l)}
(lib)
where w, x and y are (K - 1), K, and (K + 1) element vectors, respectively. If those vectors minimize each cost function when the numbers of PMs are given by (K — 1), K and (K + 1), respectively, we see that the optimal PM intervals depend on the number of PMs. It is thus difficult to show the existence and uniqueness of Ks*. Though the convexity of E[C(x,K)] with respect to K has not been shown, we can find the optimal Ks* satisfying Eq. (11) as follows: First, set K to a value
Bayesian Approach to Optimal Policy under Imperfect PM Models
127
(for example, the optimal value of the periodic PM policy that is easily evaluated). When K is given, x* is determined by Eq. (8) or Eq. (9) using a suitable numerical method. Next, perform the neighborhood search which is based on the conjecture that the optimal Ks* should be around the neighborhood of the optimal number of PMs for the periodic policy. That is, after comparing E[C(x*,K)] with the cases of K, K — 1, K + 1, K — 2, K + 2, etc., we can obtain the optimal Ks* and x*, minimizing the expected cost E[C(x, K)]. 3.2. Periodic
Case
Especially, when we assume that all the values of Xk have the same value x, the optimal PM interval and the replacement period are obtained via the value of the constant PM interval x and the number of PMs (K) which minimize the following unconditional expected cost: ™ ci a ^ x
A
E[C(x, K)\ = £ -±^±
+(K-
l)c 2 + c 3
^
^ — ^
i=i
_ "
K
Cla(r
bK{r-l)
- 1) ^ -_. {^X
Fl+
(^-l)c2+c3 Kx
,
,
(12)
To find an x* which minimizes E[C(x, K)], we differentiate the right-hand side of Eq. (12) with respect to x and set it equal to 0;
dx
bK(r-l)
f^^
'
Kx2
K
'
From Eq. (13), we have arK
-I
V ( A - l ) ^ ' f l = {(A--l)c2+c3}/ci i o r — 1 i=i ^-^
(14)
Then we can solve Eq. (14) for x* with an appropriate numerical technique when K is given. It can be easily shown that there exists a finite and unique x* which satisfies Eq. (14) for any integer K if Pi > 1, I = 1,2,... ,m. Next, to find the optimal value of K for the periodic case, Kp*, say, which minimizes E[C(x, K)] for a given value of x, we form the inequalities E[C(x,K
+ l)]>E[C(x,K)}
(15a)
and E{C(x,K)}<E[C(x,K-l)]
(15b)
If we define L{x,K)
K = % Kr --
yy'Pl, J
i=i
(A" = 1,2,...,)
(16)
128 K.-S. Park and C.-H. Jun t h e n Eqs. (15a) a n d (15b) imply L{x,K)
> (c 3 - c 2 ) / c i
(17a)
and L(x,K-l)
< (c3-c2)/ci
(17b)
Under t h e condition of r > 1, we have at
L(x,K)-L{x,K-l)
= ^ ( r - l J r ^ ^ V ' P ;
>0
(18)
i=i
It means t h a t L(x, K) is increasing in K and grows to infinity as K goes t o infinity. Hence, there exists a finite and unique Kp* which satisfies t h e above two inequalities in Eqs. (17a) and (17b) for any x > 0. From Eqs. (14) and (18), we can see t h a t t h e r e exists a unique value of a: a n d K minimizing the cost rate function in Eqs. (12). To solve t h e optimal x* a n d Kp*, therefore, we solve Eqs. (14) a n d (17) iteratively. T h a t is, starting with any possible K, we solve Eq. (14) for x a n d apply Eq. (17) t o solve for a new K. We need t o repeat t h e procedure until t h e convergence. In our experience with a case study, t h e optimal values are rapidly obtained in a few iterations. 4. A d a p t i v e P M P o l i c y Suppose t h a t t h e system under consideration has undergone failures, a n d has been minimally repaired at t h e following times: tki, i = 1,2,...,nk during t h e fcth P M , where rik is t h e number of failures and Xk is t h e P M interval, k = 1, 2,..., K. 4.1.
Sequential
Case
Consider first the sequential case with Ks and t h e P M intervals of (x\,X2,. •., XKS)Using t h e sequence of failures a n d P M times 0 < tu < ti2 < • • • < i i n i < X\ < Xi + t2l < Xi + t22 < • • • < Xi + t2n2 < Xi+X2
< ••• < S f c ^ r 1 xk < X ^ l ^ xk +
**.i < £fc=i ^ + tK.2 <••• < £fc=r xk + tK SUKS < ]Cfc=i xki w e could write t h e likelihood function under t h e failure d a t a t = (in,£12, • • •, £im> £21^22,-••> <2n2, • • • , tK,l, tKs2,- • -, tKsnKs) &S Ks Uxpi-ar*-1^} (19) £(a,/J|i) = n flar*-1/^-1 fc=l T h e joint posterior distribution of a and /3 is obtained via Bayes theorem as KS
/(a,/3 = /3,|t)oc H K=\ x aa
n^-^/tfr k
1
\-exp{-ark-'xf}
i=l
e x p { - 6 a } x Pi
(20)
Bayesian Approach to Optimal Policy under Imperfect PM Models 129
and it is rewritten by
f(a,P
= /3,|t) cc ^ - ' " '
Pi x a a + ^ " - . ' " - 1
I j | n*fc,- I vfc=ij=i
/
x exp
x,
-affc + ^ r * - 1 ^ V
fe=i
(21)
/
It is considered as the product of the posterior conditional probability density of a given (3 = Pi and the posterior density of 0. Note that the former follows Gamma anc t n e a t t e r (a',b\) with a' = a + J2k=ink, b[ = b + Ylk=irk~lxk> ^ l has the following form:
Pi
nk (p= P{(3 =ft|t} ex
1 (ll&IVLit*)*' n V L , ; {b )a
(22)
We see that the posterior distributions of a and /? are no longer independent. In spite of the dependence, however, since the expected cost function for updating the optimal strategy is E[C(x,K)t]
=
Ea,0[E{C(x,K)\a,p,t}} m
/-oo
= 2 / i=iJ° m
E[C(x,K)\a,P]f(a,p
= pl\t)da
E[C{x1K)\a,P)f{a\P
= pl\t)da-P{P
/"OO
= 53 /
= pl\t}
i=iJ°
^c^Kklirk-l4l
- Z, i=i
+(^-l)c 2 + C 3 n / v
«.
^
t23)
Z-.k=i xfc
the posterior optimal values of a:* and Ks* are achieved in a straightforward manner by replacing the prior constants a, b and Pi with the posterior constants a', b[ and P{, respectively, in Eqs. (8), (9) and (11). 4.2. Periodic
Case
For the periodic PM with a; and Kp, using the sequence of failures and PM times 0 < t n < t i 2 < • • • < t l r n < X < X + t 2 l < X + t-n < • • • < X + t2n2
< 2x < •• • <
(Kp - l)x < (Kp - l)x + tKpl < (Kp - l)x + tKp2 <••• <(Kp-l)x + tKp7lKp < KpX, we write the likelihood function under the failure data t = (in, ii2, • • •, iim > hl,t22,-
•• , t2n2,-
• • , * / f p i , i/f p 2 >-- • ' f V ( p ) i
a S
130
K.-S. Park and C.-H. Jun Kv
fe 1 1 £(a,(3\t) = H J]ar - /3if- i - e x p l - a ^ - ^ ^ } k=i
(24)
kt=i
The joint posterior distribution of a and /3 is obtained via Bayes theorem as
/(a,/3 = A|i)aJ] k=\
1
n ^ -
/ ^ -
1
•expf-ar'-1^}
L U=l
x aa
2
exp{-6a} x P;
(25)
and it is rewritten by 01-1
,fe=li=l
K„
x exp
H^rfc-V'
-a
(26)
fc=i
which is considered to be the product of the posterior conditional probability density of a given (3 = (3i and the posterior density of (3. Note that the former follows Gamma (a', b{) with a' = a + J2k=i nk, b't = b + Ylk=i Tk~X^1, and the latter has the following form:
nk
P{ = P{0 = 0i\t}
AF-
(n&n^t*)*'1* (biY
(27)
As in the sequential case, the revision of the expected cost function of E[C(x,K)\t] and the calculation of its optimal x* and Kp* for the updating optimal strategy under the posterior distributions are achieved by replacing the prior constants a,b and Pi with the posterior constants a', b\ and P[, respectively, in Eqs. (12), (14) and (17). 5. A Case Study of the K Steelworks We have applied this methodology to a hot-coil plant in the K steelworks, one of the leading steel companies in the world, to obtain the optimal PM intervals and the replacement period. There were 154 failures observed during a certain year and 50 PMs were done including two replacements at 174.87 days and 193.57 days after each previous replacement. This data is listed in Table 1.
§ §
a. t-
•s; t•a
e s
C3
>> C3
ho
c 3 "0 cfl
s> cj
t-i
£
S
CD
cu
a
i>
u, e P
*
n, a.
ea
^ CO CO
a a 05 cci
H
T3
CN CO
00
m
t^
en CM
CO
•cP
00 o
en o
~
r < in
CO CO
CM
CO CN
CO
Tjl
m o ^p
o f--
o ^p
i^
en
CO
CO
CN
o
CO
CO CO
m
CO CO
CO
CN CO
in
CO
CN
o
p^
CD CO
in in
CO
o
CO
on
en
CN
co
CO CN
in P/CN
CN r-H
CO CN rH
CN
^
CO CO TP en en CO CD CN m o o rH CO CO CO CO "CP m
en CO
CO lO
co
CO
o m
CO
CO
TP CO
CO
^
CD
en
CN CD
en
in o
en
O
CN
on
CO
CN
in
CO
CO CO
o t^
CN
CO
CN
CO
in
CO
in in rp
CN
t^
00 CN o rU
o
CO
t^p
^P
en
on
in
O)
en
rH
'-' o CO
en
m
CO CO
o
TP
CO
CO
in
CN
CO
CD
en
^P
rH
CO •cP
in
O CN
CO
CN
en in
o
o o
^ ^
t^
on
CO
CO
CO LO
o
CN
o
CO
CO
^p
o
^p
00
CO CO
o o
p-
co
o
00
CD CO
CO
CO
CO
'""'
en
o
CD
in
CO
CO
CO
in
^P
CD CN
o
CN
CD
en
^p en
en o
CO
CO
CO CO
en
CO
in
CO
in
in
^p
o CO
in in
CO
in
o
CO
CN
in 00
CO CO
00
m
en TP
en ^p
CN
P^
CN
m
00
en
CO
CO CN
CN CN
CN CO
on
CD
CN
CN
00
CO
CO CO
t00
o
CD
q
o o
in 00
^P
rH
^p CO
CO CO
m
CO
o
in
CO
en
m
CO 'CP
HH
^p
t^
CO
^p
co en
CO CO
TP
CO
o
o CO
CN
-
CN CO
CO CO
lO
CN
CO
o T—1
cn t-
CO
CO CO
CO
CO
§
o Z
s CU o rH
o
co
•*p
"3"
CO
CO TP
en
^p CN •cf
CO
en CO
o
00 in o o
Ol
CD
00
o
t-
CD
CO
LO
m
CN
CO
^P
CN
CO
^
CN
CN
5 CO
o lO
en ^p
X
CO
tCN
in
rH
o en
en
on o
2
o Z
rH
CU
o Z
CN
PH
2 CU
Z
0
Fai lure Fai lure Fai lure Fai lure
Fai ble
132
K.-S. Park and C.-H.
Jun Table 1.
No
Failure
34 35
6.57 7.23
36 37
1.57 3.04
38
2.29
PM
No
Failure
73 74
0.82 4.03
9.35
3.58 10.49 39 40
PM
9.50 [174.87] 75 76 77 78
0.12 1.28 1.46 1.63
4.90 5.27
(Continued). No
Failure
115 116 117
2.35 4.18 5.00
118 119
2.72 5.05
PM
No
Failure
151 152 153 154
0.22 3.17 3.87 3.05
8.23
7.39
6.83 [193.57]
6.64
5.50 6.29
120
0.34
121
4.20
PM
7.01
Avg.
3.17
7.37
Note: Figures in brackets are the actual replacement times. Table 2. Optimal PM number, replacement period, and expected cost rate based on the prior. Cl
C3
Cl
Kp*
X*
Replacement Period
10
2 3 4 5 2 3 4 5 2 3 4 5
26 20 16 14 32 26 22
5.471
142.246
6.670 7.677 8.517 5.499 6.676 7.667
19 36 30 26 23
8.539 5.546 6.706 7.685 8.548
15
20
Sequential
Periodic
= 1
E[C(x,K)\
Ks'
Replacement Period
E[C(x,K)]
27 21 17
147.410
0.847744
133.400 122.832
0.850134 1.007451 1.139353
139.640 130.010
1.005749 1.138062
119.238 175.968 173.576 168.674 162.241 199.656 201.180 199.810 196.604
1.254563 0.881748 1.040073 1.173441 1.290615 0.908198 1.066784 1.200751 1.318741
14 34 26 23 19 39 32 27 24
119.326 185.969 174.032
1.253582 0.877891 1.037233 1.171133 1.288753 0.902842
175.848 162.483 214.275 213.292 207.058 204.636
1.062661 1.197379 1.315916
It is assumed for the gamma prior distribution of a that a = 2.31 and 6 = 30.4, which were calculated by the method of moment from the values E[a] = 0.076, Var[a] = (0.05) 2 . An engineer in K steelworks suggested these using the method of maximum likelihood estimation (MLE) based on the Weibull life distribution. Furthermore, the engineer proposed the value of constants for prior distribution (3 in Eq. (4) as 0L = 1-4, Pu = 2.4, c = 2 and d = 2. In order to use the discretized distribution of (3, we divide the range (PL,PU) into 10 separate segments, that is, take m = 10. We also assume that r = 1.01. It means that the failure rate increases by 1 percent after each PM. Because we cannot evaluate the cost terms exactly, we assume that c\ = 1, and we consider four cases of c^ (c2 = 2,3,4,5) and three cases of C3 (c 3 = 10,15, 20), which yields a total of 12 cases. Table 2 shows the optimal
Bayesian Approach to Optimal Policy under Imperfect PM Models 133 Kp*, x* (periodic P M interval) and their product (replacement period) and Ks*, sum of Xk*s (sequential replacement period) a n d expected cost rate for each case on t h e basis of prior knowledge. T h e sequence of optimal intervals Xk*s are plotted in Fig. 1 for each of 12 cases on the basis of prior knowledge.
10
f
•c2 -c2 -c2 •c2
6 0
10 20 30 (a) number of PM when c3 = 10
=2 =3 =4 =5
40
10 -c2 = 2 -c2 = 3 -c2 = 4 -c2 = 5
s a.
0
10 20 30 (b) number of PM when c3 = 15
40
-c2 -c2 -c2 -c2
10 20 30 (c) number of PM when c3 = 20
= 2 =3 =4 =5
40
Fig. 1. Optimal sequential PM intervals (x*) based on the prior. From Table 2, when C\ = 1, C2 = 4, C3 = 15, for example, the optimal P M interval is 7.667 days and the optimal replacement should be done on t h e 22nd P M (or 169 days from t h e last replacement) under t h e periodic policy. In t h e case of t h e sequential policy, t h e optimal replacement should be done on the 23rd P M period, or after 176 days, with slightly smaller expected cost rates t h a n t h a t of t h e periodic policy. In Table 1, the first actual replacement was m a d e at 175 days or at t h e 23rd P M . During t h a t time, there were 74 failures. T h o u g h these values are not exactly t h e same as the result of prior calculation under the case of ci = 1, ci = 4, C3 = 15, we here use these real failures a n d P M s d a t a listed in Table 1 as the input t o the calculation of posterior policies for each of t h e periodic and the sequential cases. T h e n , using the posterior distribution of Weibull parameters with these d a t a , we obtain t h e posterior optimal Kp* = 15 (or Ks* = 15) and the replacement period of 238 days (or 268 days) for t h e periodic (or the sequential) policy. In t h e
134
K.-S. Park and C.-H.
Jun
case of ci = 1, C2 = 5, C3 = 20 under the periodic case (or the sequential case), the prior optimal replacement period is about 197 days (or 205 days) which is similar to the second replacement period (194 days) shown in Table 1. During the second replacement period, 80 failures were observed with 27 PMs. Also, using these data, we have a similar calculation as the case of C\ = 1, C? = 4, C3 = 15. The results of the posterior case calculations are listed in Table 3 and plotted in Fig. 2.
interval
25 -* "
"
^
•
- — c 2 = 4, c3 = 15 -m~~ C2 = 5, c3 = 20
^
s 10 5 ()
10
20 number of PM
30
40
Fig. 2. Optimal sequential PM intervals (x*) based on the posterior. Table 3. Optimal PM number, replacement period, and expected cost rate based on the posterior. =1
Periodic
Cl =
C3
CI
Kv*
15
4
20
5
15 15
X*
15.845 20.503
Replacement Period 237.697 307.535
Sequential
E[C(x,K)}
Ks*
0.877975
15
0.913752
16
Replacement Period 267.794 341.615
E[C(x, K)] 0.840758 0.892906
The results in Table 3 and Fig. 2 show that the PM and the replacement should be done at longer intervals. This may be justified by the recent observation that the number of failures actually tended to decrease during last several years in this steelworks. Generally, from Table 2 and Fig. 1, we can easily see that the optimal PM interval Xk* is more sensitive to the cost of PM (02) than to the cost of replacement (C3) and that the optimal replacement period is more sensitive to C3 than to c-iAlso, Table 2 and Table 3 show that the sequential policy has a slightly smaller expectation cost rate than that of the periodic policy. 6. Conclusions This paper uses an imperfect PM policy, one of the most practical maintenance models, to introduce the situation that the optimal PM interval and replacement period under uncertainty in a parametric life distribution can be easily analyzed by adopting a Bayesian theoretic view. The selection of a Weibull failure model and the priors was mainly for the illustrative purpose. Many alternative selections of the
Bayesian Approach to Optimal Policy under Imperfect PM Models 135 type of failure rates or t h e use of prior distributions may be possible. In the context, though we assume t h a t t h e increasing scaling factor r of the failure intensity function is known, an expert opinion or methods of the classical parameter estimation may be available to evaluate it. Acknowledgments T h e authors would like t o t h a n k anonymous referees for their valuable comments. This work was partially supported by Korea Science and Engineering Foundation through t h e A u t o m a t i o n Research Center at P O S T E C H . References Barlow, R. E. and Hunter, L. C. (1960), "Optimum Preventive Maintenance Policies", Operations Research 18, 90-100. Barlow, R. E., Prochan, F. and Hunter, L. C. (1965), Mathematical Theory of Reliability, Wiley, New York. Mazzuchi, T. A. and Soyer, R. (1996), "A Bayesian Perspective on Some Replacement Strategies", Reliability Engineering and System Safety 5 1 , 295-303. Nakagawa, T. (1981), "A Summary of Periodic Replacement with Minimal Repair at Failure", Journal of the Operations Research Society of Japan 24(3), 213-227. Nakagawa, T. (1986), "Periodic and Sequential Preventive Maintenance Policies", J. Appl. Prob. 23, 536-542. Valdez-Flores, C. and Feldman, R. E. (1989), "A Survey of Preventive Maintenance Models for Stochastically Deteriorating Single-Unit Systems", Naval Research Logistics 36, 419-446.
Chapter 9 D E S I G N OF LIFE TESTS B A S E D O N MULTI-STAGE DECISION PROCESS AKIHIRO KANAGAWA Faculty of Computer Science and System Engineering Okayama Prefectural University, Japan HIROSHI OHTA Department of Industrial Engineering Osaka Prefecture University, Japan
1. Multi-Stage Life Test 1.1.
Introduction
Multi-stage life test is a reliability test plan corresponding to a multiple sampling inspection in the field of SQC. This paper deals with a life test based on the multistage decision process. Multi-stage discussed here has k stages for decision making. The objective is to test Ho : 9 = 6$ versus Hi : 9 = 9\, under the specified type I and type II errors. In this multi-stage (fc-stage) decision process, an economic factor, e, is invested to each stage in order to obtain the data. From the first stage until the k — 1th stage, the decision maker (DM) must decide whether Ho or Hi is accepted or preceed to the next stage based on the presented data. When the DM reaches the final stage, he or she must decide whether Ho or Hi should be accepted. The economic factor is an object to be minimized. Concretely, let us consider an acceptance inspection. The parameter 9 implies a proportion nonconforming, and the object is to test Ho : p = po versus Hi : p = p\, under the specified producer and consumer risks. In this case, at the first stage, N samples are invested as an economic factor to obtain the data x that is the number of nonconforming items in the sample. The decision process at the first stage is, if x < a\, the hypothesis H 0 is accepted and if a; > r\, the hypothesis Hi is accepted. Otherwise, the test continues to the second stage, and another N samples are invested anew. In this manner, the test continues through each successive stage.
137
138
A. Kanagawa and H. Ohta
1.2. Assumptions
and
Notation
Assuming that (1) (2) (3) (4)
Investing economic factor is the same quantity at every stage. At the final stage, Hi is accepted if H 0 was rejected. The probabilities of accepting Ho are uniform at each stage. The probabilities of continuing to next stage are uniform at each stage except for the final stage.
Notation N e k
sample size investing economic factor at each stage number of stages X data for decision at each stage acceptance number at j t h stage Qj r rejection number at j t h stage 3 observation time at jth stage tj parameter for the test 9 acceptable parameter 0o rejectable parameter &i probability of accepting Ho at each stage £ probability of continuing to next stage q probability of accepting Ho throughout the test L{6) expected total investing economic factor E{0) specified producer and consumer risks a, 0 actual producer and consumer risks a*,P* cumulative distribution function (Cdf) of an item F(t) b(x,p, N) binomial probability := (^.)px(l ~p)N~x B(a; p, N) cumulative of binomial probability := 2™=o b(x'-> P> N) The decision process is depicted in Fig. 1. 1.3. Determination
of Decision
Plan
From the assumptions, the probability of accepting Ho under the true parameter 6 is L(9)=e
+ qe + q2£+---
+ qk-1l
=£ ^ -
(1)
The expected total investing economic factor E{6) is E{0) = e + qe + q2e + --- + qk-1e = e - ^ 1-9
(2)
Design of Life Tests Based on Multistage
H,
H,
\-q-tJ 1st stage
q
1-q-t
2nd stage
jt—1th
stage n
e \
\-0 q 0
«0
e
A: th stage
\
Ho
Wo
139
H,
« i
l-q-t
ii
Decision Process
Hn
e Fig. 1. Multi-stage decision process.
Then we have E{9)
eL{9) (3)
From Eq. (1), we can consider the following policy to determine a decision plan. Policy: Choose parameters (k,q,£) which minimize max{E(9o),E(9i)},
subject to
f 1 - L(90) < a
Wi)3 1.4. Life Test Based on Multi-Stage
Decision
Process
In reliability life tests, shortening test time and saving test items are important matters. Hewett and Spurrier (1983) provided a survey of two-stage or double sampling test plans, as well as examples of experiments for which a two-stage procedure would be more appropriate than single stage procedure from the above standpoints. Bulgren and Hewett (1973) considered a two-stage test of the exponential parameter with type II (failure) censoring only at each stage. Fairbanks (1988) proposed a two-stage test of the exponential parameter with hybrid censoring at each stage. It was a combined extension of the single-stage hybrid test of Epstein (1956) and the two-stage test of Bulgren and Hewett, combining desirable aspects of two-stage procedures while allowing the user to specify the maximum time duration of the test. Fairbanks also stated that it was a fairly direct extension of these test plans for the exponential parameter with three or more stages. Apparently, a criterion for selecting failure size and an efficient method for solving the resulting nonlinear system of equations are the main problems in such an extension. Let us consider a multi-stage life test whose objective is to test Ho : 9 > #o versus Hi : 6 < 6\, where 9 is a parameter of the probability distribution F. In this muliti-stage (/c-stage) life test, TV items are placed on test. The first test decision occurs at a time t\. At this time, if x\ < a\, the null hypothesis is accepted. If X\ > r\, the null hypothesis is rejected. Xi denotes the number of failures that occurred at a time t\. Otherwise, all sample items are replaced with new items and the test continues to the second stage. In stage 2, the test ends with acceptance if x\ + x^ < a.2, and rejection if
140
A. Kanagawa and H. Ohta
X\ + X2 > r2, where x2 denotes the number of failures that occurred from a time t\ to a time t2. Otherwise, the test continues again to the next stage after replacing all sample items with the new items again. In this manner, at each successive stage, the decision of accepting, rejecting, or continuing is decided by the relation between the cumulative failure number and the acceptance/rejection numbers. The test terminates at the maximum time duration of the test (T at the latest). To meet with the four assumptions mentioned in Assumptions and Notation, the test duration time at each stage should be equalized, namely *2 - *1 = *3 - *2 = • • • = **
(4)
Then we have t* = T/k
(5)
Furthermore, we set the acceptance/rejection numbers as follows: a,j = jm — 1,
r3; = jm + 1;
at = km — 1,
rfc = km
j = 1,..., k — 1 (6)
Then we have q=
b(m;p,N)
I=
B(a;p,N)
(7)
where p=F(t*)
(8)
The economic factor can be given by: e = a N + c2t* =c1N
T + c2-r
(9)
where C\ and c2 denote the weighting coefficients. A close investigation is needed in case F(t) is given by the exponential distribution as an application to the life test. When F(t) is given by the exponential distribution, it is not necessary to replace all samples at each stage. It is only necessary to replace the failed items due to the memoryless property. The following is a discussion of the multistage life test with exponential parameter as a special case of multi-stage decision makings. Hereafter Cdf F(t) is equal to 1 — exp(—1/8). When both the stage size k and the continuance number m are given, an algorithm for obtaining the smallest sample size of the simplified multi-stage test is shown in Fig. 2. Since L(G) is monotonously decreasing in N, the checking of the condition L(0\) < (3 precedes with the checking of the condition L(#o) > 1 — a. Next we discuss how to determine the stage size k and the continuance number m. Assuming that the upper limit of sample size N* is priorly prescribed owing to testing device, and the maximum number of stages k* can be determined by taking
Design of Life Tests Based on Multistage
Decision Process
141
CSTARTJ
/
';"
/
N «- 1
^ N — N+l
Kfl,) : /3
> 1r
1
output
There is no feasible plan
(^
STOP
N
^
Fig. 2. Design algorithm for minimum sample size.
procedural complication into consideration. When 6 = 0i, the expected failures at one stage is Np±, so the upper limit of the continuance number m* is:
m* = [N*Pl] + 1
(10)
In this case, saving of samples is assured in some degree so we mainly aim to shorten the expected total test time. Setting ci = 0 and C2 = 1, we find from Eq. (9) that the economic factor T
..T The criterion for determining m and k can be considered: Criterion: Select m and k such that max{.E(0o)> E(6i)} criterion can be rewritten as follows:
(ID is minimized.
This
142
A. Kanagawa and H. Ohta
Select (k, m) which maximize I = k min
(Jo
£
\ L ( 0 O ) ' L{Oi
An algorithm based on the criterion is shown in Fig. 3.
("
/ |
START
from
")
LOOP \ k = 2 to k = k* |
m* «- | /V»p,] + 1
I
^
from
LOOP m = 2 to m = m ^
from
n = 1 to
LOOP
k k k
LOOP END (AO LOOP END (m) LOOP END (k)
t (^
STOP
/ * ;
;
")
Fig. 3. Design algorithm based on the testing time index.
(12)
Design of Life Tests Based on Multi-Stage
Decision Process
143
An example of the actual design of the multi-stage test and comparisons between the proposed plan and corresponding single stage plan are presented. Example 9o = 3500 , 0i = 800 , a = 0.05 , /? = 0.10 , T = 100 , k* = 7 and N* = 200 . We obtain the following results (in Table 1) by using the algorithm shown in Fig. 3. Although the maximum stage size of seven is determined, feasible plans with more ,aan five stage tests could not be found. From the results of Table 1, we have the test plan: fc = 4, m = 3 , N = 177. In this test plan, the actual risks are: a* = 1- L(60) = 0.0427,
f3* = L{9X) = 0.0999 .
Table 1. Feasible multi-stage test plans. k
m
N
/
2 2 2 2 2 3 3 3 4
4 5 6 7 8 3 4 5 3
111 133 154 17'4 195 133 166 198 177
1.817 1.838 1.852 1.861 1.872 2.657 2.705 2.738 3.540
Single stage plan of the example with N = 177, type I censoring with no replacement is: censoring time = 37,
acceptance number = 4.
In this test plan, the actual risks are: a* = 1 - L(0o) = 0.0401,
(3* = L(6»i) = 0.0944.
Single stage plan of the example with n = 177, type I censoring with replacement is: censoring time = 37 ,
acceptance number = 4 .
In this test plan, the actual risks are: a* = 1 - L(00) = 0.0418,
/T = L(6»i) = 0.0895 .
144
A. Kanagawa and H. Ohta
expected testing time •to 38 36
••
\
>
*
single stage plan
34 32 30 28 26 4 24 22 20 0
5<X)
!()00 !5H) 2IXXJ
250O
31X1) 35(X) 40(X>
45tt)
5IJ00
MTBF Fig. 4. Comparison of the expected testing time.
expected number of r.
failures single stage plan ' vi rtri hu: rep I aceineii t
4
single stage plan with replacement
3 \ i
proposed plan
a
ax)
i(xx)
ir>m
ZCXTO
2500
.mm
3503
4(ix)
4500 BIXXJ
MTBF Fig. 5. Comparison of the expected number of failures.
From the perspective of expected testing time and expected number of failures, comparisons between the proposed plan and the single stage plan are shown in Figs. 4 and 5. 2. M u l t i - S t a g e Accelerate Life Test 2.1.
Introduction
In reliability acceptance life test, the small size of samples and the early decision of the test are required. There are two effective ways to shorten testing time. One is to use the multi-stage decision process mentioned in the previous chapter,
Design of Life Tests Based on Multistage
Decision Process
145
and the other is to use accelerated test. This chapter deals with the multi-stage life test with step-stress life acceleration at each stage, and presents a design procedure for the multi-stage accelerated life test using the cummulative exposure (Nelson's) model. The probability of accepting the null hypothesis with respect to the given MTBF can be derived. Furthermore, the expected testing time and the expected number of failures by this test plan are also presented. It is well known that the accelerated life test, namely the life test running under more severe conditions than usual, is effective for shortening the total testing time. The object of this life testing is to test Ho : 9 > 90 versus Hi : 9 < 9\, where 9 is the MTBF of the item. In this multi-stage (fc-stage) life test, N items are placed on test. The first test decision occurs at a time t\. At the time, if X\ < a\, the null hypothesis is accepted (test passing), and if x\ > r\, the null hypothesis is rejected (test failing), where X\ is the number of failures observed at the first stage. Otherwise, the test continues to the second stage. In stage 2, the test is carried out by the application of increased stress, and at the time £2 the test ends with acceptance if a; < a-i, and rejection if x > Ti. Note that x denotes the cumulative number of failures at all past stages. Otherwise, the test continues again to each succesive stage. At the new stage, another stress is added. The test terminates maximally at the fcth stage. Figure 6 shows the typical decision boundaries for this test. Additional Ti Si Gi Fi A 9 0(j) 0O 6»i a, (3
Notation
temperature stress at step i stress except for temperature at step i Cdf under the constant stress Si (or Tj) Cdf under the step-stress from S\ (or Ti) to Si (or Ti) accelerative multiplier for general stress ordinary MTBF accelerated MTBF with stress step i desirable MTBF undesirable MTBF (< 0O) producer and consumer risks
2.2. The Model for Step-Stress
Testing
In this paper, Nelson's model [W. Nelson (1980, 1990), W. Nellson and T. J. Kielpinski (1976), R. Miller and W. Nelson (1979)], including Arrhenius and Eyring models, is adopted for physical mechanism of wearing. Nelson's model assumes that the remaining life of an item depends only on the current cumulative fraction failed and current stress, regardless how the fraction accumulated. In this model the cumulative stress (damage) W of time t duration can be expressed as follows:
146
A. Kanagawa and H. Ohta cumulative failures
rejection line acceptance line 'k
«,9 a2 Q
9 t, stress
Normal
Normal xA
Normal x4-
Fig. 6. Decision line of the multi-stage accelerated life test.
W = Kt
(13)
where K is the constant responsive speed. When t is the time to failure, W denotes the cumulative stress to failure. By taking a probabilistic behavior into consideration, it can be expressed: w* =Kt* +e (14) where e denotes the random variable whose mean value is 0. Eq. (14) is: W = E{w*} = K6
Expectation of (15)
In case that the stress implies the temperature T, the Arrhenius model gives the relation: K oc e-B'T (16) where B is a constant. For the stress except for the temperature, the Eyring model gives the relation: K oc SA (17) From Eq. (15), we have 6 = W/K
(18)
6{i) = W/Ki
(19)
f%} =
(20)
Similarly,
then, we have 6K/Ki
Design of Life Tests Based on Multi-Stage
Decision Process 147
Accordingly, when t e m p e r a t u r e Ti is used as t h e stress a t step i, we have
e{i)=6exp
_BQ;-1)
(21)
So the stress should be increased such t h a t Ti=Ti-.1+b(l/T)
(22)
where b is a positive constant and T is t h e absolute t e m p e r a t u r e . W h e n the other stress Si is used of t h e i t h step, we have
'(«>='(^J
(23)
where S denotes the actual stress. Eventually the stress should be increased such t h a t Si = ASi-i
(24)
where t h e A is a positive constant which should be greater t h a n 1. These constants A, B are usually known based on t h e property of the material. If we use 1/T scale t o measure the raising stress from T; t o Tj+i, we can t r e a t t e m p e r a t u r e stress synthetically by Eq. (24). Let Si be general stress hereafter. For a step-stress p a t t e r n , there is a Cdf for time t o failure under the test. T o analyze d a t a , which is observed from this Cdf, one needs a model t h a t relates t h e distribution under step-stressing to t h e distribution under constant stress. Suppose t h a t , for a particular p a t t e r n , step i runs a t stress Si, s t a r t s a t time U-\, and runs in time U. T h e Cdf of time t o failure for units run at a constant stress Si is denoted by Gi(t). T h e population cumulative fraction of specimens failing in step 1 is Fi(t) = Gi(t), t0
(26)
T h e population cumulative fraction of specimens failing in step 2 by t o t a l time t is F2(t) = G2[(t-t1)
+ s1],
h
(27)
Similarly, step 3 has the equivalent s t a r t time s 2 t h a t is the solution of G3(s2) = G2(t2 - h +
Sl)
(28)
and F3(s2)
= G2[{t - t2) + s2],
h
(29)
148
A. Kanagawa and H. Ohta
Si
S 2
S3
Fig. 7. Cdf under the multi-stage life acceleration.
In general, step i has the equivalent start time Sj_i that is the solution of Giisi-!)
= Gi-iKU-!
- U-2) + Si-2\,
<»-i<*<*i
(30)
and Fi(t) = Gt[(t - U-i) + Si-!]
(31)
Thus, F(t) for the step-stress pattern is determined by connecting with segments of Cdf's Fi,F2,...,Fk as shown in Fig. 7. 2.3. Derivation
of the Property
of the
Test
(1) The Probability of Accepting H 0 We derive the probability of accepting Ho with respect to 8. First, the probability of accepting HQ at the first stage is: L1=B(a1;p1,N)
(32)
where Pi = i ? i ( t i ) - i ; , i ( t o ) = i ; i ( t i )
(33)
Second, the probability of accepting HQ at the second stage is: r\ — 1
(34) x\=ai
+l
Design of Life Tests Based on Multistage
Decision Process
149
where P2 = F2(t2) - F2(t1)
(35)
because the event of accepting Ho at the second stage is the joint event of the test is decided to be continued at the first stage and the test accepting H 0 at the second stage. Similarly, we get the probability of accepting Ho at the j t h (2 < j < k) stage as:
Lj=
n-l
r2-l-X(2)
Y.
E
ri
_i-l-X(i-l)
•••
E
x i = a i + l x2=a2+l-X(2)
b(x1;Pl,N)B(x2;p2,N)---
xj-1=aj-1+l-X(j-l)
x bixj-^pj-!,N)B(aj
- X(j);Pj,N)
(36)
where
( 3? )
X(j) = E x" and Pi = Fjtti) - Fj(tj-i)
(38)
Note that each Cdf Fj has the accelerated MTBF 9^ as a parameter. The probability of accepting Ho at the fcth (final) stage is: ri-l
ifc=
r2-l-X(2)
E
rt„!-l-X(fc-l)
E
•••
x i = a i + l x2=a2 + l-X(2)
E xk^i=aj-i
Kxi;Pi,N)B(x2;P2,N)---
+
l-X(k-l)
x 6(x fc _ 1 ;p fc _ 1 ,JV)S(a fc -X(k);pk,N)
(39)
Accordingly, the probability of accepting Ho for the MTBF 9 can be obtained as: k
L{9) = YJLj
(40)
J=I
(2) Expected Testing Time We derive ET, the expected testing time per a life test. In case that the test terminates at the first stage, the expected testing time is ET! = t i { B ( o i ; P l , N ) + [1 - B(rx - l;PlN)]}
(41)
At the second stage, we have ri-l
ET2=t2
Y,
Kx1;p1,N)B(a2-X(2);p2,N)
xi=a\ + l r i
+h
-l
Y, x1=a1+l
N-X(2)
E x2=r2-X(2)
Kxi;Pi,N)b(x2;P2,N)
(42)
150
A. Kanagawa
and H. Ohta
Similarly, we can obtain ENj for further stages. Accordingly we have fc ET
ET=Y,
J
(43)
(3) Expected Number of Failures We derive EN, the expected number of failed items per test. In case that the test terminates at the first stage, the expected numer of failures is O]
ENi=
JV
Y^x1b(x1;puN)+
^
X\ = \
xlb(xl-pl,N)
(44)
li=n
At the second stage, we have r-i-l EN
*=
r2-X(2)
Z
Z)
xi=a\ + l r i
+
(x2+X(2))b(x1;pl,N)b(x2;p2,N-X(2))
22=0
-l
E
r2-X(2)
E
(x2+X(2))b(xl;Pl,N)b(x2;p2,N-X(2))
(45)
xi=oi + l x2=r2-X(2)
Similarly, we can obtain ENj for further stages. Accordingly we have fc
EN = Y,ENi
( 46 )
J'=I
2.4. Design
Procedure
and
Example
We must determine the test plan (fc, N, ai,..., L{60)
a^, r\,...,
r^, £ i , . . . , t^) such that
>l-al
Ufii)l < P
(47)
J
are satisfied. These numbers can be chosen arbitrarily to some extent. For example, the user decides a standard observation time t* and an acceptance number at the first stage a\. Then other parameters are recursively determined using the following formula: t i -tj-i +**0(j)/0 dj = jai
>
(48)
Tj = dj + (fc - j + 1) y Note that this is merely an example for selecting parameters. An algorithm for the proposed life test is as follows:
Design of Life Tests Based on Multi-Stage
Decision Process
151
[STEP 1] Give the parameters 90, 61, a, p, a\ and A. [STEP 2] Set the initial values as k 1 — a is satisfied with the current parameters, then forward to [STEP 7 ] or else increment k by 1 and reset iV to 1. [STEP 6] Go to [STEP 3]. [STEP 7] Output current parameters. Example Assume that 60 = 1500,
0i = 300,
a = 0.05,
P = 0.10 ,
A = 2, A = 1.25, t * = 5 0 , a! = 3 . Cdf at the first stage is F^t) = Gi(t) = 1 - exp(-i/0 ( 1 ) ); 0 < t < tx. Then we have the following test plan: k =2
N = 43 ,
a\ = 3 ,
j"i = 6 ,
i 2 = 52.5,
a 2 = 6,
r2=7.
Actual risks of this plan are: a* = 1 - L(0O) = 1-78%,
P* = L{9X) = 9.71% .
expected t e s t i n g time er> , Hi Si
\
multi-stage plan
o
wx) iai) IHI) zna> zsa) 3000 35a) imi « ! ) MTBF
5<)
15 •10
:s :«)
Fig. 8. Comparison of the expected testing time.
ran
152
A. Kanagawa and H. Ohta
Figures 8 a n d 9 show t h e expected testing t i m e a n d t h e expected number of failures, respectively. For reference, in Figs. 8 a n d 9, t h e expected testing time and the expected number of failures for the normal multi-stage life test without acceleration are also shown, respectively. From Fig. 8, it is found t h a t t h e proposed test has a remarkable effect of shortening the testing time. In Fig. 9, t h e greater part of two characteristic curves overlapped each other. As for the number of failures, very little difference between t h e two tests can be found. expected number of ,„ f a i l u r e s 9 H 7 B
5 4 3
multi-stage plan
I 1
proposed plan
0 0
«»)
I00O
1500 20(11 2HD SIM
3500
4(H)
4f>(J) 5000
MTBF Fig. 9. Comparison of the expected number of failures. References Bulgren, W. G. and Hewett, J. E. (1973), "Double Sampling Tests for Hypotheses about the Mean of an Exponential Distribution", Technometrics 15, 187. Epstein, B. E. (1954), "Truncated Life Tests in the Exponential Case", Ann. Math. Stat. 25, 555. Fairbanks, K. (1988), "A Two-Stage Life Test for the Exponential Parameter", Technometrics 30, 175. Hewett, J. E. and Spurrier, J. D. (1983), "A Survey of Two Stage Tests of Hypotheses: Theory and Application", Communications in Statistics — Theory and Methods 12, 2307. Miller, R. and Nelson, W. (1979), "Optimum Simplestep Stress Plans for Accelerated Life Testing", General Electric Research & development TIS Report 79CRD262. Nelson, W. and Kielpinski, T. J. (1976), "Theory for Optimum Accelerated Life Tests for Normal and Lognomal Life Distributions", Technometrics 20, 105. Nelson, W. (1980), "Accelerated Life Testing — Step-Stress Model and Data Analysis", IEEE Trans. Rel. 29, 103. Nelson, W. (1990), Accelerated Testing: Statistical Models, Test Plans and Data Analysis, John Wiley and Sons.
Chapter 10 RELIABILITY C E N T E R E D M A I N T E N A N C E FOR LIGHT RAIL E Q U I P M E N T K E N H. K. L E U N G Light Rail Division, Kowloon-Canton Railway Corporation 55-65 Lung Mun Road, Tuen Mun, NT, Hong Kong M I N G J. Z U O Department
of Mechanical Engineering, University of Alberta Edmonton, Alberta, T6G 2G8, Canada RICHARD WHITFIELD
Megatrend Information Services Ltd, Unit 703, Charm 700 Castle Peak Road, Kowloon, Hong Kong
Center
1. Background The Light Rail Transit System is a public transportation system operated by the Kowloon-Canton Railway Corporation's Light Rail Division (KCRC-LRD). It provides passenger services to the Northwest part of the New Territories in Hong Kong and moves an average of 336 000 passengers each day. It has 99 Light Rail Vehicles (LRVs). The rapid development of the local community and population growth in the neighborhood is increasing demands for passenger services in the region. Historical data shows that the average ridership growth rate has been around 5% per year and the trend is predicted to continue into the next century [KCRC (1995)]. Rising customer demands and expectations mean that KCRC-LRD must continuously improve the safety, reliability, efficiency, and value-for-money of its transport service [KCRC (1996a)]. The reliability of the LRV equipment was improved to 65,000 km per failure in 1996 from 45,000 km per failure in 1994. Top management has set a target of 75,000 km per failure by 1998 [KCRC-LRD (1996a)]. This target needs to be achieved while serving the growing rider population and maintaining the same fleet size. The operating profit of the KCRC-LRD has been decreasing over the past few years [KCRC-LRD (1996b)], which has resulted in budget cuts by top management.
153
154
K. H. K. Leung, M. J. Zuo and R.
Whitfield
Maintenance cost is a large portion of the total operating costs, and it is rising with the fleet age. Proposed fare increases to cover increasing maintenance costs have received strong resistance from customers. The management is under pressure to make maintenance more cost-effective. Maintenance planning will become more difficult if the maintenance budget continues to decrease while satisfactory operating performance needs to be maintained. The current levels of reliability and availability of LRVs could not meet the operational requirements. Equipment failures occurred with an average of 690 breakdowns per month [Leung (1997)]. There was also an average of 159 minutes of headway delays and four LRV outages per month due to LRV equipment failures. One possible explanation was that the equipment was not maintained in the right place at the right time. Currently, the equipment was maintained according to the manufacturer's recommendations, which may not be appropriate for the unique operating conditions at KCRC-LRD. Delierlien (1991) identified some problem areas in the KCRC-LRD's maintenance activities and provided recommendations for improvements. The identified problems included (1) maintenance managers used the original equipment manufacturer's recommendations without considering the cost-effectiveness and the rationale behind these recommendations, (2) the KCRC-LRD had never conducted a systematic study of the applicability of the preventive maintenance actions performed and their relationships to the reliability of LRVs, and (3) the preventive maintenance program lacked traceability and visibility. On this basis, we began this project with the following objectives: (1) the KCRC-LRD needed to examine the current preventive maintenance practices and (2) the RCM approach should be used to determine the best balance among condition monitoring, preventive maintenance, and breakdown maintenance. 2. Reliability Centered Maintenance The premise of RCM is that a more efficient and effective life-cycle maintenance program for equipment can be developed by addressing individual component failure modes, the consequences of failures, and the actual preventive maintenance tasks to be done. The primary objective of RCM is to preserve equipment function by preserving component operation. There are several different age reliability relationships for components [Moubray (1995)]. A large percentage of components do not benefit from a limit on operating age. The traditional approach of replacing old parts during maintenance is often wasteful. Because components may have different age-reliability relationships, different maintenance tasks should be applied to different components. The following maintenance tasks were considered in our study: (a) Hard Time Replacement (HTR) — Failures due to age or usage, prior to functional failure, may be prevented by replacing or overhauling at a predetermined fixed time interval.
Reliability
Centered Maintenance
for Light Rail Equipment
155
(b) On-Condition Maintenance (OCM) — Deterioration prior to functional failure can be detected by periodic inspections and evaluations. This is only effective when a potential failure can be identified reliably and inexpensively. (c) Condition Monitoring (CM) — Degradation prior to functional failure can be detected in sufficient time by instrumentation (e.g. temperature, pressure, vibration indicators). The recorded information is used to determine the need for action. (d) Run To Failure (RTF) — A deliberate decision is made to allow the equipment or component to operate until it fails and no preventive maintenance is ever performed. Proper maintenance tasks should be selected for different failure patterns. Parameters including component lifetime, age-reliability information, the nature of failures (hidden or evident), and repair costs should be considered in the course of maintenance task selection. The complete implementation of RCM typically involves four major steps [Kuehn (1992)]: Step 1. Information on delays, outages, frequency of failures, and maintenance activities is retrieved from relevant reports and computer systems. The information is analyzed to identify the items that contribute significantly to delays and outages and have high frequencies of failures. Step 2. Perform failure mode and effect analysis of the selected items to identify failure modes and their consequences. Failure modes are categorized into safety, outage, and economic areas for resource allocation via Logic Tree Analysis. Then, prioritisation of failures in hierarchical order is required to fix failures through the calculation of Risk Priority Number in terms of severity, occurrence frequency, and ease of detection. Step 3. Apply Hard Time Replacement, On-Condition Maintenance, Condition Monitoring, or Run to Failure (RTF) on the selected equipment according to its age-reliability patterns and parameters including lifetime, nature of failures, repair cost, and downtime. Step 4. Implement RCM by denning specific maintenance task requirements, such as what must be done and how often or at what intervals to do it and integrate RCM into the current maintenance activities. 3. LRV Systems and Equipment An LRV is a kind of railroad vehicle driven by the electrical supply delivered from overhead lines. It consists of advanced mechanical, pneumatic, electrical, electronic, and computer systems with complicated interfaces. An LRV basically contains seven systems and four major equipment items. (1) Propulsion System: An LRV is equipped with two DC propulsion choppers. Each chopper controls the power requirements of its associated traction motor. It operates the traction motor in either motoring or braking modes.
156
K. H. K. Leung, M. J. Zuo and R.
Whitfield
(2) Pneumatic & Braking System: Compressed air is generated by a compressor unit and supplied to the pneumatic system. The system contains an air compressor, pressure governors, pressure valves, and isolating cocks. The pneumatic system is used to operate the train's doors, pantograph, horn, exterior mirror, air suspension, flange lubrication, sanding and braking, and wheel slide. The braking system is equipped with a brake control unit (BCU) and four springforced cylinders connected to the brakes mounted on two bogies. It is of a fail-safe design so that braking is applied when brake cylinder pressure is zero. (3) Air Conditioning System: The air conditioning system is a split type unit using R22 refrigerant. The compressor and the control cubicle with its electrical switch gears and programmable logic controllers are mounted on the underframe of the LRV. The condenser, evaporator, and their motors are mounted on the roof. The system requires 380 volts AC and 24 volts DC supplied from a static inverter. (4) Passenger Door System: An LRV has three pairs of doors, which can be operated by the driver or by passengers in case of emergency. The door system is a combination of electrical and pneumatic systems. It has a large impact on passenger safety. A speed sensor is in the system to prevent the doors from opening when the LRV is moving faster than 2 km/hour. (5) Auxiliary Power System: A Static Inverter (SIV) is mounted on the underframe of an LRV to transform the 750 volts DC supply into a three phase 380 volts AC, single phase 220 volts AC, and 24 volts DC. (6) Control and Signal System: The propulsion control and signal system is comprised of the driver's hand controller, operator annunciation, auxiliary driver's panel, Car Control Unit (CCU), Brake Control Unit (BCU), and relay rack. The hand controller generates signals for the CCU to control vehicle motion and the BCU to control the braking effort. The annunciation functions as a fault indicator whenever a fault is detected. The CCU and the BCU are the main electronic system controllers which issue commands, respond to the driver's directions, and monitor the signals from sensors throughout the vehicle. (7) Signal and Communication System: Signal and Communication systems consist of the Vehicle Information System (VIS), Radio Communication, and Public Announcement subsystems. VIS is a wireless radio system which controls the directions, routes, and positions of a vehicle through a central computer in the Operation Control Center (OCC) and transponders located in track sites. (8) Speed Sensing Equipment: Speed sensing equipment includes the digitizer, tachograph, and tachogenerator. It provides the speed signal to the systems which require speed control and indication, such as the braking system, the signal and control system, and the door system. (9) Pantograph: A pantograph is a metal frame mounted on the roof of an LRV. It is used to deliver electrical power from the overhead lines to the vehicle. (10) Coupler: The coupler provides automatic mechanical coupling of LRVS. With two cars buffing together, a rigid and slackfree connection is effected without
Reliability
Centered Maintenance
for Light Rail Equipment
157
manual assistance even at very low speeds and with mating parts initially out of alignment. When two LRVs are coupled, all air pipes are automatically connected while the electrical connectors have to be manually coupled. (11) Bogie: Bogies are swivels connected to the underframe of the car body by a ball bearing slewing ring. It provides primary suspension by a chevron spring with inherent damping and secondary suspension by two air springs, which automatically adjust the carbody height for different loads under the control of air spring levelling valves. 4. Selection of Systems and Equipment for the R C M Project Maintenance records were retrieved from the maintenance management system (MMS) for identification of critical systems and equipment that contributed significantly to the maintenance costs, delays, outages, and failures. The identified systems and equipment would be then recommended for the implementation of RCM. The current MMS was installed in June 1995. Thus, only the data in the new MMS was used for selection of systems and equipment. 4.1. Preliminary
Selection
The preliminary selection criterion was based on the factors of the total accumulated number of failures and maintenance costs for a pre-selected interval. For this project, the interval from June 1995 to June 1997 was selected. The actual failure frequencies in some systems are tabulated in Table 1. To compare the failure trend Table 1. Summary of observed and estimated failures. System and Equipment
Number of Failures Observed 6-12/95
1-12/96
1-6/97
Number of Failures Estimated 1-12/95
1-12/96
1-12/97
Estimated Trend
3
6
3
6
6
6
Same
20 5 5
57 31 22
40 25 49
40 10 10
57 31 22
80 50 98
Up Up Up
61
30
65
122
30
130
Up
35 9 44
128 31 120
102 19 78
70 18 88
128 31 120
204 38
Lighting
518
1276
857
1036
1276
176 1714
Up Up Up
Tachograph Passenger Doors Technograph
16 145
35 293
8 155
32 290
35 293
16 310
Down
4
9
2
8
9
4
Down
Coupler Air Compressor Static Inverter Digitizer Car Control Unit Brake Control Unit Chopper Bogie
Up Up
158
K. H. K. Leung, M. J. Zuo and R.
Whitfield
on an annual basis, we made the assumption that failures were evenly distributed throughout the year. We used this assumption to estimate the total numbers of failures for the systems during the whole years of 1995 and 1997. The data were scaled to hide the sensitive private information. An up-trend suggests that the system or equipment is deteriorating with age. The failure trend of some individual systems and equipment went up while others went down or remained at the same level throughout the two-year period. All the up-trend items deserved further attention. Maintenance costs are one of the major concerns to take into account in equipment selction. The maintenance costs include material costs and labor costs. The material costs take a large portion of the total maintenance costs. In addition, the labor costs not only include the costs to fix the failure in monetary terms but also reflect the downtime of the LRV. The material costs per failure and the labor costs per failure in hours for the up-trend systems and equipment are shown in Table 2. Table 2. Maintenance costs for up-trend items. Systems and Equipment Air Compressor Static Inverter Brake Control Unit Car Control Unit Chopper Bogie Carbody Miscellaneous System "Legend:
Material Cost (HKD)
Labor Cost
Total
(Hour)
Failures
Material Cost/Failure
Hour/ Failure
Remarks*
2 361 774 725 830
7484 3 380
234 302
10 093 2 403
31.9 11.2
HMC, LDT MMC, LDT
59 898
561
265
107
2.1
LMC, SDT
256 809 2 067126 9173 380 154 668
1150 10 950 35 943 1153
156 580 482 183
1646 3 564 19 032 845
7.3 18.9 74.6 6.3
884 832
5 373
5 358
165
1.0
LMC: Low Maintenance Cost MMC : Medium Maintenance Cost HMC: High Maintenance Cost
MMC, MMC, HMC, LMC,
MDT LDT LDT MDT
LMC, SDT
SDT: Short Downtime MDT : Medium Downtime LDT: Long Downtime
The air compressor and the bogie both had a high maintenance cost (HMC) and long downtime (LDT). They should definitely be selected for the RCM project. However, the remaining items would go through the final selection stage to see if they had major impacts on operations, delays, and outages of the LRV. 4.2. Final
Selection
The items selected in the previous step were then examined on their impact on operations, delays, and outages of the LRV. In addition, the decision guidelines
Reliability
Centered Maintenance
for Light Rail Equipment
159
were based on the function of the system and equipment, and the failure and delay report prepared by the Operations Department. Every system and equipment item must perform its function for the LRV to run satisfactorily, but not every system and equipment item is of equal importance. Some of them perform major functions while others perform minor functions. The equipment was divided into the major and the minor groups. The major equipment was defined such that failures of the equipment would affect operations of an LRV in terms of delays and outages. However, failures of the minor equipment would not affect the operations as much as the major one. The consequences of different system and equipment failures were different in terms of impacts on operations, delays, and outages of an LRV. For instance, lighting had the highest failure rates, but did not have the impacts on LRV operations to cause delays or outages. By contrast, the consequence of a bogie's failure was far more serious. The operational delays and outages of the LRVs for the pre-selected interval were summarized and the equipment was categorized into the major and minor groups. Table 3 shows that the failure of the major group equipment would cause the interruption of service in terms of operational delays and outages while the failure of the minor group equipment would not cause service interruptions. For the carbody, bells, destination units, saloon equipment, and lighting, their failures would not interrupt the LRV operations. Therefore, these items fell out of the scope of further studies. Table 3. Summary of LRV operational delays and outages. System and Equipment Air Compressor Static Inverter (SIV) Brake and Brake Control unit Car Control Unit (CCU) Chopper Bogie Carbody Bell Destination Unit Passenger Doors Saloon Equipment Lighting Mirror
LRV Outages
Operational Delays 1-12/95
1-12/96
1-12/97
51
5
11
4
432
141
4
20
5
53
167
31
2
3
1
1 286 17 0 0 0 179
7 66 32 0 0 0 35
0 3 1 0 0 0 2
1 14 1 0 0 0 3
2 4 2
major
6 62 13 0 0 0 36
minor minor minor
0 0 0
0 0 0
0
0 0 0
0
0
0 0
0 0
Group
6-12/95
1-12/96
major
152
269
major
31
major major major major minor minor minor
1--6/97
0 0
0 0 0 2
160
K. H. K. Leung, M. J. Zuo and R.
Whitfield
The final selection was carried out considering downtime, operational delays, and LRV outages. The systems and equipment that would directly cause delays and outages were also within the scope of further study regardless of their maintenance costs or downtime. With the combined selection criteria in both the preliminary and the final selection stages, seven equipment items were identified for the RCM project. These items are given in Table 4. They are the Air Compressor, Static Inverter (SIV), Brake and Brake Control Unit, Car Control Unit, Chopper, Bogie and Passenger Doors. Table 4. Summary of selected items. System and Equipment
Material Cost*
Maintenance Downtime*
Operational Delays
Operational Outages
Air Compressor Static Inverter Brake & Brake Control Unit Car Control Unit Chopper
HMC MMC LMC MMC MMC HMC LMC LMC
LDT LDT SDT MDT LDT LDT MDT LDT
V V V V V V
V V V V V V
X
X
V
V
Bogie Carbody Passenger Doors
* Legend: LMC: Low Maintenance Cost MMC: Medium Maintenance Cost HMC: High Maintenance Cost
SDT: Short Downtime MDT: Medium Downtime LDT: Long Downtime
5. R C M Implementation In this section we focus on how the RCM tools were applied to the identified systems and equipment. The factors considered were age-reliability relationship, component lifetime, failure modes, and repair costs for the corresponding systems and equipment. The identified maintenance actions were then integrated into the current maintenance work. 5.1. Team
Work
An RCM team was established in December 1996 and fully supported by the management and the shop floor staff. The six team members had different expertise including electrical, mechanical, pneumatic, and braking. This diversity in expertise was essential to cover the equipment and systems selected. RCM philosophy and techniques were explained to all team members. Team members were briefed on how maintenance must focus on safety, reliability of service, and availability of the LRVs. Team members understood the relationship between maintenance and quality and extension of equipment life and curtailing maintenance costs. Regular meetings were held. Team members discussed maintenance problems and designed solutions enthusiastically.
Reliability
Centered Maintenance
5.2. Failure Mode and Effect Analysis
for Light Rail Equipment
161
(FMEA)
FMEA was used to find out the result of equipment failure, starting with the function of the equipment and analyzing the ways it fails and the causes of failures. FMEA consists of a process that intends to identify equipment failure modes, failures causes, and, finally, the effects that may result during equipment operation. By definition, a failure mode is what went wrong; a failure cause is why the failure occurred in the first place, and the failure effect is how the failure impacts the equipment. A sample worksheet for each selected piece of equipment is shown in Fig. 1.
Primary Function
Functional Failure
,/ /
\ \
/»
Failure Mode 1
Failure Effect
>, *
Failure Mode 2
Failure Effect
^
Failure Mode n
Failure Effect
Fig. 1. Structure of FMEA worksheet.
Take the compressor as an example. The primary function of the compressor is to provide compressed air to the braking, pneumatic, and passenger door systems. A block diagram is shown in Fig. 2 to illustrate the components and the control mechanism of the air compressor. A failure of the compressor means that it is incapable of supplying the compressed air to the above mentioned systems. The individual failure modes can be further drilled down to locate the defective part and its influence on the operation. For example, say the pressure governor setting drifts or malfunctions. This causes the air compressor to frequently start and stop the motor, which makes it burn out.
24VDC
pressure governor/ switch
oil filter
oil cooler
air filter
screw block
oil / air separator
4
air input
temperature sensor
3 80 Vac
3* motor
Fig. 2. Air compressor block diagram.
air output
162
K. H. K. Leung, M. J. Zuo and R.
5.3. Logic Tree Analysis
Whitfield
(LTA) and Risk Priority
Number
(RPN)
Each failure mode was mapped into one of three distinct categories: safety related, outage related, or economic related. We also distinguished between evident and hidden failures. The decision depended on the experience of the team members. For hidden failure modes (D), each failure mode should be additionally classified to identify the potential severity, either D/A, D/B, or D/C (hidden failure related to safety, outage, or economic categories). A ranking scheme for the RPN was established by the team to assist maintenance task selection. A consensus was reached to set the ranking for parameters in terms of failure severity, occurrence frequency, and ease of detection (hidden or evident failure). These parameters were based on the experiences of the team members, maintenance information, and recommendations from equipment manufacturers. Each rating was in the range from 1 to 10. The severity referred to the repair time needed. The higher the total RPN score, the higher the priority for fixing the failure. Table 5 shows the guidelines used in assigning RPN scores. Table 5. Ranking scheme for PRN.
Scores
Severity (Rep.air Time in Hr.)
Occurrence (Freq uency per Mianth)
Detection (Evident or Hidden)
1 2 3 4 5 6 7 8 9 10
1-2 3-4 5-6 7-8 9-10 11-12 13-14 15-16 17-18 Above 19
0-0.4 0.5-1.0 1.1-2.0 2.1-3.0 3.1-4.0 4.1-5.0 5.1-6.0 6.1-7.0
Evident is given 5 points Hidden is given 10 points
7.1-8.0 8.1-9.0
RPNs were obtained for all the major failure modes of the selected equipment and systems. Team members agreed to start with safety and fleetwise problems and items with high RPN scores (i.e. greater than 11). Based on this procedure all important failure modes were identified. Using the air compressor as an example, the following were included: pressure switch high resistance, screw block worn out, shaft seal worn out, oil separator damage, screw block and driving motor bearings damage/noise. From this analysis, twenty-three failure modes were identified for the selected equipment and systems. Proper maintenance tasks needed to be selected for them.
Reliability
5.4. Identification
Centered Maintenance
of Age-Reliability
Patterns
for Light Rail Equipment
and Part
163
Damage
Age-reliability patterns can roughly be divided into two simple classes. The first one may benefit from a limit on operating age while the second one does not. The age-reliability classification of frequent failure items, which scored over three points in occurrence, can be identified from maintenance records in which the time between failures can be found. Wear and tear of mechanical parts puts a limit on operating age while failures of electronic cards occur irregularly in terms of time. With information provided by equipment manufacturers, maintenance records, and team members' experiences, we could determine the lifetime and reasonably estimate whether the two age-reliability patterns apply. The age-reliability pattern with a benefit from a limit on operating age was marked with "y". The age-reliability pattern without a benefit from a limited operating age was marked with "x". 5.5. Maintenance
Task
Selection
The task selection road map, shown in Fig. 3, was used to structure and record the task selection process. It was useful in helping the analyst to logically develop the Is the age reliability relationship for this failure known? Yes
No
Are there any applicable HTR tasks? Yes
No |Specify HTR task |
Are there applicable CM tasks?
Specify CM task Is this a Hidden failure mode?
|Are there any applicable OCM tasks? Yes |Specify OCM task | Are any of these tasks effective? Optional Is this a economic-related failure mode? Can a design modification eliminate the failure mode or its effect?
Specify HTR, CM and OCM tasks |
| Accept RTF
|
Design modification
Fig. 3. Task selection road map.
164
K. H. K. Leung, M. J. Zuo and R.
Whitfield
candidate maintenance tasks for each failure mode. Maintenance selection in the RCM process required that each would prevent or mitigate failure and be cost effective. The following examples demonstrate the ways to determine the appropriate maintenance tasks. Table 6. Results of maintenance task selection, age-reliability, and life time.
Defective Parts Air Compressor Pressure switch Screw block Shaft seal Oil separator Bearing S t a t i c Invertor Link filter capacitor Major electrical component Electronic cards Current and voltage transformers
Life Time (Month)
Maintenance Task
Age-Reliability Patterns
9-15
HTR RTF
V V V V V
> 60 12-18 12-18 9-15
> 70 > 70 Irregular 60-70
B r a k e and B r a k e C o n t r o l U n i t Irregular B18 regulator Chopper Recharging unit Pulse formation card G T O control card Bogie Traction motor Gear box oil seal Traction rod Wheel shunt Air bag Chevron spring Slewing ring Passenger doors Door track Door speed relay Door seal
HTR HTR HTR D & M RTF D & M OCM
y/
V X
V
RTF
X
Irregular Irregular Irregular
D & M D &M D & M
X
> 80 > 80 40-48 36-40 > 80 40-48
HTR HTR HTR D & M OCM OCM HTR
V V V V V V V
60-70 Irregular Irregular
HTR D & M RTF
V
> 80
Legend: HTR = Hard Time Replacement OCM = On Condition Maintenance D &: M = Design and Modification yj = Benefit a limit on operating age
X X
X X
RTF = Run To Failure CM = Condition Maintenance x = Cannot benefit a limit on operating age
Reliability
Centered Maintenance
for Light Rail Equipment
165
Team members found that the pressure switch had the age-reliability pattern which could "benefit from a limit on operating age". Its lifetime was around 9 to 15 months based on maintenance records. Costs for Condition Monitoring (CM) and the On-Condition Maintenance (OCM) were far more expensive than the pressure switch itself. Therefore, Hard Time Replacement (HTR) was the simplest and most effective maintenance task for the pressure switches. For link filter capacitors, the lifetime was irregular and hence there was no point to have a limit on operating age. Failure of electronic cards was also a hidden failure. CM and OCM are not suitable here because of limited space in the underframe of LRVs for instrument installation and time consuming testing for individual electronic cards. Design and modification (D & M) was recommended in accordance with the Task Selection Road Map. For all items selected, the combined outcomes of maintenance task selection, age-reliability patterns, and lifetimes are shown in Table 6. 5.6. Design
Validation
Procedures
If the maintenance tasks, including HTR, OCM, CM, and RTF, could not be applied to resolve problems effectively, redesign and modifications were required. Design and Modification procedures were composed of design input, design review, design output, verification, design changes, and design validation. Design inputs consist of International Codes of Practice, statutory regulations, customer specifications, drawings, samples, and previous similar designs. All these have to be discussed and clarified by the KCRC Design and Modification Panel prior to commencing the redesign processes. The Design and Modification Panel consists of representatives from the Operational Department and the Rolling Stock Section. The panel is responsible for investigating design modifications and issuing approvals for long term implementation. The design process is to translate design inputs into design outputs. For this project, all the maintenance activities recommended by the RCM team were reviewed and verified by the Design and Modification Panel before implementation. The panel also evaluated the progress and performance of modifications and trials. If they went wrong, necessary corrective actions were initiated to correct the mistakes. 5.7. Integration
with Current
Maintenance
Activities
Limits on operating age were imposed on 16 selected items but not on the other items. The lifetimes of the items were different with wide ranges. Electronic items had irregular lifetimes. They may fail at any time, even right after maintenance. Maintenance task requirements were defined in terms of what needed to be fixed and when to do the work. The defective parts were directly derived from the related failure modes. It implied that particular equipment items must be replaced or maintained before they fail. Maintenance frequencies or intervals for the items
166
K. H. K. Leung, M. J. Zuo and R.
Whitfield
could be found in accordance with the lifetime. The longer the life limits, the longer the maintenance intervals needed, and vice versa. Equipment items or parts should be replaced before they reach the life limits and the utilization of materials should be maximized. Therefore, the maintenance intervals should match the life limits of the defective items. For irregular equipment items, defects could not be predicted accurately. Hence, decisions for integration with the current maintenance activities depended upon the scope of work of the examinations and the rectification time. The time for the maintenance actions for the selected items should match the current maintenance activities and schedules. LRVs maintenance schedules were based on recommendations from vehicle manufacturers and inspections were conducted at different intervals. Contents of work in each inspection were different and the required maintenance downtime was different correspondingly. The programs were implemented with tolerances to meet the operational requirements. Maintenance downtime was a critical factor for defect rectification. If a maintenance task is integrated to an inspection with a short downtime, it is impossible to get the work done provided that the rectification time is longer than the downtime. Based on the different maintenance criteria, opportunities were sought to incorporate the new maintenance activities into the current ones in terms of agereliability patterns, life time, rectification time, scope of work of an inspection, maintenance intervals, and downtime. 5.8. Outcomes
of
Integration
The maintenance task requirements have been recognized. They are the agereliability patterns, lifetime, defect repair time, scope of work of an inspection, maintenance intervals, and downtime. In addition, defective items on equipment, with the same or similar life limit, were grouped together for simultaneous maintenance. Repeated work and double handling were eliminated so as to increase the maintenance efficiency and cost effectiveness. Examples were available to demonstrate the ways to integrate with current maintenance activities. The results of the maintenance task integration are shown in Table 7. 6. Summary and Conclusions In this project, preliminary and final selection criteria were used to identify equipment items for application of defined maintenance tasks. FMEA was applied on the selected equipment. Twenty-three key failure modes were identified using logic tree analysis and risk priority number. Four maintenance tasks were selected and applied to different failure modes of different systems. Design modifications were also included when none of the RCM tools could be used. New maintenance tasks were incorporated into existing maintenance practices. Some of the future works after this project include (1) to pursue information from equipment manufacturers and develop tailored testing procedures for electronic cards and equipment items, (2) to develop procedures for maintenance staff to capture the cost and failure data of
Reliability
Centered Maintenance
for Light Rail Equipment
167
Table 7. Summary of maintenance tasks integration.
Defective Parts Air C o m p r e s s o r Pressure switch Screw block Shaft seal Oil separator Bearing S t a t i c Invertor Link filter capacitor Major electrical component Electronic cards Current and voltage transformers
Maintenance
Maintenance
Maintenance
Action
Interval (Month)
Program
Replacement Replacement Replacement Replacement Replacement
About 15 Not fixed About 15
Examination E Repair when failed Examination E
About 15
Examination E
About 15
Examination E
Modification Replacement Modification
About 72 Not fixed About 72
Overhaul Repair when failed Overhaul
Check & Inspection
About 72
Overhaul
Not fixed
Repair when fail
About 72
Overhaul
About 72 About 72
Overhaul Overhaul
Brake and B r a k e C o n t r o l U n i t B18 regulator Adjustment Chopper Recharging unit Pulse formation card G T O control card
Modification Modification Modification
Bogie Traction motor Gear box oil seal Traction rod Wheel shunt Air bag Chevron spring Slewing ring
Replacement Replacement Replacement Replacement Check &c Inspection Check &c Inspection Replacement
About About About About About About About
72 72 48 48 72 72 48
Overhaul Overhaul Intermediate overhaul Intermediate overhaul Overhaul Overhaul Intermediate overhaul
Passenger Doors Door track Door speed relay Door seal
Replacement Modification Replacement
About 72 Not fixed Not fixed
Overhaul Repair when fail Repair when fail
miscellaneous equipment, and (3) to provide training and devise field practices that enable the maintenance staff to determine the failure causes and proper remedies more objectively and effectively. References Delierlein, B . (1991), Repair or Replace Fleet Equipment, MSc Thesis, D e p a r t m e n t of M E E M , C i t y University of H o n g K o n g , H o n g K o n g . K C R C (1995), K C R C Track Record, No. 48, November 1995. K C R C (1996a), K C R C Track Record, No. 50, J a n u a r y 1996.
168 K. H. K. Leung, M. J. Zuo and R. Whitfield KCRC (1996b), KCRC Annual Report, 1996. KCRC-LRD (1996a), Rolling Stock Meeting, No. 152, 25 June 1996. KCRC-LRD (1996b), Rolling Stock Section — Monthly Report, August 1996. Kuehn, S. F. (1992), "RCM Trim Nuclear Plant Costs", Power Engineering 96(8), 23-28. Leung, H. K. (1997), Reliability Centered Maintenance for Light Rail Vehicle Equipment, MSc. Thesis, Department of MEEM, City University of Hong Kong, Hong Kong. Moubray, J. (1995), "Making a Positive Contribution to Asset Management Strategy", Work Management 48(3), 14-15.
Chapter 11 INCORPORATING ENVIRONMENTAL CONCEPTS WITH T O L E R A N C E D E S I G N OPTIMIZATION MODEL GUANGMING CHEN Department of Industrial Engineering, School of Engineering Morgan State University, Baltimore, MD 21251, U.S.A.
1. Introduction With increasing concerns in environmental protection in recent years, nations have to face world-wide challenges in the aspect of incorporating an environmental concept with the traditional product design and manufacturing technologies. Indeed, all countries now, with no exception, must acknowledge a responsibility for the global environment. For instance, as part of an international agreement, the industrial usage of Chlorofiuorocarbon (CFC) must be eliminated, and the production must be phased out or banned soon [Beardsley (1988), Parson (1995), Manzer (1993)]. CFC is an aerosol propellant as well as refrigerant material, believed to affect the atmospheric ozone layer. Thus, it is not surprising that some leading manufacturers have realized the importance of reducing potential environmental impacts of their products and enthusiastically participated in or supported environmental programs [Jones (1989), Nordwall (1992)]. For the sake of environment, industries should strive for extracting maximal value from given resources and reducing industrial wastes generated in the manufacturing processes. In many companies especially small ones, design and manufacturing have lagged behind with regard to integrating environmental factors into product design and development stages. This has remained so because it is currently the customer, rather than the manufacturer, who has to deal with the disposal of retired products. As a result, these manufacturers are not concerned about the environmental impacts of the products. Their goal in design and manufacturing is achieving high quality and low cost such that the products can be competitive in the market. No doubt this concept had a great contribution to wealth creation in the past. Nevertheless, economic prosperity is also accompanied by environmental problems. Society, each year, has to deal with increasing amounts of waste from the retired products
169
170
G. Chen
(billions of tons of industrial wastes). These wastes have adversely affected the environment. The direct cost of handling hazardous waste in the United States alone is as high as six billion US dollars per year [Weinrach and Shahinpoor (1996)]. Thus, the approach to environment-friendly design and manufacturing is now increasingly important. The technology should be used in the new millenium in order for economic growth to be consistent with the goal of environmental preservation and enhancement. The task of sustainable development can be accomplished by implementing these contemporary approaches to the design and manufacturing practice. During the past few decades, great progress has been made in quality and productivity improvement. This progress includes important reforms in quality design and manufacturing. One example is the application of robust design [Kackar (1985), Taguchi (1986, 1995), Bullington and Lovin (1993)]. The ultimate goal of robust design is to reduce the deviations of the system's performance from the target value, as well as to make the performance insensitive or robust to variations. Philosophically, robust design has three basic steps: (1) System Design, (2) Parameter Design and (3) Tolerance Design. Regardless of the tremendous work in the robust design area, however, the newest concept of integrating environmental factors has been proposed [Chen (1997a, 1997b)]. The ultimate goal is to minimize a societal loss that includes a loss to the customer, a loss to the producer, and a loss to the environment. In the following sections, we will discuss the integration of environmental factors into the tolerance design approaches and models. 2. Basic Concept of Robust Design Robust design is a cost-effective method for improving quality. The objective, as proposed by Taguchi, is that of simultaneously making a product or a system insensitive to manufacturing variations, usage conditions and deterioration over time, with little or no impact on cost. In other words, using robust design can produce high-quality products with inexpensive materials or parts. The quality is built in through design, rather than through enhancing on-line quality control activities. Depending on the nature of the quality characteristics, products or systems can be classified as static or dynamic [Chen and Kapur (1997)]. The static system has a fixed target value for the quality characteristic, and the dynamic system has a variable target. A quality loss occurs if the performance of the characteristic cannot match the fixed or variable target. As mentioned earlier, robust design has three basic steps: System Design, Parameter Design, and Tolerance Design. The objective of System Design is to obtain a workable prototype model of the product. Except for the general dimensions, the parameters of parts as well as components are not determined in this step. Much of the previous and current effort in industries is concentrated in this step. In addition to the development of a workable model, we also select materials, subsystems, or parts based on general strength requirements, functional requirements, and economical effects. The environmental factors or considerations for material selection and life cycle analysis can be integrated into this step.
Environment
Concepts with Tolerance Design Optimization
Model
171
Parameter Design, which is the most important and effective step in Robust Design, focuses on parameter setting selection for design factors. In this step, engineers intend to design a robust product by selecting the optimal parameter settings of the design factors. The goal of robustness is achieved by selecting the best parameter levels rather than using expensive parts or components. The design can be performed by using orthogonal arrays, analysis of variance (ANOVA) [Montgomery (1997)], statistical models, and optimization. The controllable design factors are arranged into a special orthogonal array named Inner Array with appropriate level assignments. The uncontrollable noise factors are assigned to another special orthogonal array called Outer Array to simulate the random variations associated with the design factors, such as those caused by manufacturing variations, deterioration over time, etc. Some user-adjustable factors, such as signal factors, are also assigned to the outer array. For each combination (each row of the inner array) of the levels of the design factors, we can conduct the experiments or simulations under the conditions specified in the outer array and collect data. By performing ANOVA on the design factors, we can find which factor has a significant effect on the variations of system performance or quality characteristics of the product and which factor has an insignificant effect. The significant factors will stay in the statistical model and the insignificant factors will be removed from the model. The best or robust parameters setting of design factors will be selected based on the statistical model and optimization techniques. The parameters setting selected by this procedure can make the performance of the product insensitive to the undesired variations of the parameters of the design factors, such as undesired but uncontrollable manufacturing variations due to mass production. For details, refer to Coleman and Montgomery (1993), Khattree (1996). Tolerance Design, at the expense of increasing cost, is usually used to tighten the tolerances to reduce the variations in product's performances. In this step, designers balance quality and cost to meet or exceed the requirement of the design specification. By integrating environmental factors, we can design tolerances or allowances that can lead to less societal loss. The tolerance design is performed based on quality loss function [Chen and Kapur (1989)], economic effects of upgrading components, and parts or materials. Traditionally, the design is aimed at minimizing the quality loss by selecting the best tolerances setting. Chen and Kapur (1994), Kapur and Cho (1996) give systematic studies on the details of tolerance design. The basic design criterion used in robust design is the minimization of quality loss or the variations in product performance. As advocated by Taguchi, the quality of a product is viewed as the loss imparted by the product to the society from the time the product is shipped. This is a more holistic view point of quality because it relates quality to the loss incurred, not just to the manufacturer at the time of production, but also to the consumer after the product is delivered. The quality activity must focus on reducing this loss. To integrate environmental factors into the paradigmatic robust design, we use a societal loss, which will be defined in
172
G. Chen
the next section, as the design criterion. The objective is aimed at minimizing the societal loss when we select the materials in the System Design step, select the best parameter's settings and tolerance's settings in Parameter Design and Tolerance Design. The basic application can be described by the flowchart in Fig. 1. Outcomes & Attributes
Tasks & Detailed
Prototype Model General Dimensions Materials and Processes
Optimal Parameter's Setting Reduction of Societal Loss Based on Reduction in Variations and Wastes
Best Tolerance Setting Variation Reduction Minimal Societal Loss Longer Life Time
• • • •
Activities
Weight Idea Generation , Cost Brainstorm , Stiffness Screening of ( Other Design Necessary Economic Analysis Strengths Material Selection , Energy with Environmental Consumed • Societal Orthogonal Arrays Loss Model ANOVA Quality Loss Environmental Simulation Material and Wast Factors Societal Loss Reduction Model Loss Function • Life Cycle Transfer Models Analysis Economic Data • Robustness to Variation Variation Reduction • Optimization Minimization of Societal Loss
Fig. 1. Schematic flowchart of integrating environmental factors into robust design.
3. Societal Loss Usually, reducing variations in product quality characteristics can lead to better customer satisfaction. However, this reduction may also need higher-grade parts, materials, and production facilities, which will lead to a higher production cost. On the other hand, larger allowable variations will adversely affect quality and decrease the competitiveness of the manufacturer [Chen and Kapur (1994), Chen (1994)]. To balance this, an optimization model can be established. To integrate the environmental factors, the objective function of this model represents a societal loss that includes a loss to customer (LC, or variation in product performance), a loss to the environment (LE), and a loss to the producer (LP, cost for upgrading or tightening tolerance limits). Usually, LC is an increasing function of the tolerance limit, and LP is the tolerance control cost that should be a decreasing function of the tolerance limits. LE, depending on the practical cases, could be constant, increasing or decreasing functions of tolerance limits. The design variables in this optimization model are the tolerance levels associated with the design factors. As mentioned earlier, the total societal loss (denoted by TL) can be modeled as the sum of three parts: loss to the customer (LC), loss to the producer (LP) and loss
Environment
Concepts with Tolerance Design Optimization
Model
173
to the environment (LE). LC is reflected by the deviation from the best or target value of the quality characteristic. LP is reflected by the relative cost needed for upgrading tolerance limits. LE is reflected by the potential environmental impact of this design. Since the environmental impact is integrated into this design, an environmental weight can be introduced. Thus, TL can be given by TL = LC + LP + aLE
(1)
In Model (1), a is the environmental weight and it should assume a positive value. The more important the environmental impact is for a design, the larger value of a should be used. Case a = 1 gives the same weight to the environmental loss as to the quality loss. Case a = 0 will give no consideration of environmental impacts for the design. The problem will boil down to a traditional tolerance design without considering environmental effects. Since tolerance limits may have various environmental effects, the environmental impact can be an increasing or decreasing function of the tolerance limit. The tolerance limit may also have no effect on environmental impact. In the following paragraphs, we will demonstrate these three different cases by graphic illustrations based on one tolerance limit variable. The principle can be applied to multi-tolerance limits design problems where multi-dimension surfaces should be used rather than the two-dimension curves. For simplicity, the tolerance limit is denoted by a or by a variance a2. (1) LE is an increasing function of a In this case, LE increases as a increases. Suppose that the best tolerance level based on traditional tolerance design was (JQ. By integrating the environmental factor, the best tolerance limit should be reduced to a* (Fig. 2). Smaller tolerance limits can bring a smaller loss to the customer and a smaller loss to the environment at the expense of a higher cost to the manufacturer. (2) LE is a decreasing function of a In this case, LE decreases as a increases. Assume that the best tolerance level given by the traditional tolerance design was <7o. By integrating the environmental factor, the best tolerance limit will be increased by a small amount to a* (see also Fig. 3). At the expense of a larger variation, looser tolerance limits can bring in a smaller cost to the manufacturer and a smaller loss to the environment. (3) LE is a constant function of u In practice, a may have no impact on LE. In other words, LE is a constant value regardless of what value the tolerance limit assumes. Thus, LE will not affect the selection of the best tolerance limit. The effect of LE is nothing but adding a constant value to the societal loss (see also Fig. 4). In this case, integrating environmental factors will not affect the best tolerance level setting. It is not necessary to
174
G. Chen Loss
O*
O0
Fig. 2. Societal loss when LE is an increasing function of a.
Loss
LC+LP or Quality Loss
a0
a*
Fig. 3. Societal loss when LE is a decreasing function of a.
Fig. 4. Societal loss when LE is a constant function of a.
Environment
Concepts with Tolerance Design Optimization
Model
175
consider the environmental impact in this tolerance design, and the environmental weight a can usually be set to zero. 4. Variation Transfer Process A manufacturing process transfers raw materials, parts, components, and subassemblies into products. It is a value-added process (Fig. 5) to transfer costs to the value of a finished product. Irrespective of this value-added process, variations associated with materials, parts, components, and subassemblies, as well as the manufacturing variations associated with the process, will also be transferred into the performances of products (Fig. 6). Tolerance design is used to control these variations at a reasonable manufacturer's cost. Integrating environmental factors makes tolerance design more complicated. To facilitate the tasks, we can establish a generic optimization model, which will be given in next section. Costs for labor, energy, machines, facilities & overhead, as well as shipping/handling
Wholesale Price or Value of the Product
Costs for components, parts, subassemblies & materials
Fig. 5. Value-added design and manufacturing process.
Manufacturing Variations
Variations with incoming components, parts, subassemblies & materials, or lowercharacteristic
Variations with the performance of finished products or higher quality characteristic Fig. 6. Variation transfer process.
176
G. Chen
Advanced topics on design and manufacturing technologies have been developed to address the challenges that will confront industries in the 21st century, such as environmentally conscious design and manufacturing, integration of environmental factors into robust design, etc. [Chen (1997c, 1999), Graedel and Allenby (1996), Kalpakjian (1995)]. To meet these challenges, we will be discussing the tolerance design modeling and methodology, integrated with environmental factors. We use the term design parameters to represent those parameters associated with components, subassemblies, parts and/or materials. We use process parameters to represent those on-line adjustable parameters such as the temperature setting, pressure setting, and timing for an injection molding machine. Thus, for a transfer process composed of many design parameters, variations associated with these parameters will be transferred into the performance of the products. This performance variation can result in a quality loss. To reduce this variation, we can control the tolerance limits of the design parameters and the process parameters but it may simultaneously increase the cost to the manufacturer. As discussed in last section, the tolerance levels may have environmental effects. To eliminate the conflict in our objectives, we can establish an optimization model to balance quality, cost, and environmental effects. If an analytic system transfer model is known, the variations transfer model can be developed analytically. If the model is not available, the empirical methods can be employed to obtain such a variations transfer model. Assume that the process transfer model is given by y = f(xi,x2,
• •. ,xn;p!,p2,...
,pm)
(2)
where x\, x2,. • •, xn are the design parameters and Pi,P2, • • • ,Pm are process setting parameters. The variations in design parameters, denoted by variances cr^, a x2 > • • • i °x„ > respectively, and the variations in process parameters, denoted by variances a2 , a2,..., cr2, respectively, can be transferred to the variation in the output denoted by variance a2, through a variation transfer function or model (VTF). Thus, a* is given as follows: o* = VTF(o*1,o*x,...,aln;o*Pl,
(3)
The variation transfer can take place according to this VTF model. 5. Tolerance Design Optimization Model Theoretically, the cost for controlling tolerance limits is a decreasing function of tolerance limits. It can be assumed to be a basic function given as: the cost for controlling tolerance limits of design factor Xim. Cd = a ^
(4)
as well as the cost for controlling variations of the process factor Pj CP = Cja-f*
(5)
Environment
Concepts with Tolerance Design Optimization
Model
177
where a,i,bi,Cj and dj, are positive constants. Assume the tolerance limits can be denoted by a vector quantity x = [<JXl, a x2' • • •» axn]T a n d the process parameters can be denoted by another vector quanai 1 T where al,,al„,... ,ai are the variances with design tity p = [a2pi r2 , a\ are the variances with process parameters. The parameters and a^ conceptual optimization model can be given by (see also Fig. 7) Cost for Controlling Variations
Fig. 7. Robust design optimization with integration of environmental impacts.
Minimize{LC(x; p) + LP(x; p) + aLE(x; p)}
Subject to •
(6)
(x;p)c(X,P) where X is the domain for x and P is the domain for p . VTF is the variations transfer model of the manufacturing process. As a result, the tolerance design optimization problem can be also modeled by
178
G. Chen
Minimize •
{
»=i
j=i aLE{al1---aln;a2Pi.--alm)\
+ al =
(7)
VTF{ali...aln-a%---alJ
Subject to • All a > 0 In the model, K is the coefficient for a quality loss function that can transfer the units of quality characteristics to monetary scales. Apparently, Model (7) is a nonlinear programming (NLP). It can be solved by using an NLP algorithm. 6. Tolerance for Outgoing Quality Characteristic Suppose a product has a characteristic y. The tolerance of y before shipping to the customer can be represented by ay. Thus, LV can be given by Ka^; LT can be given by ao~h; LE can be given by K.E<Jt in which Kg and e are constants. -6
TL = Kazy + aay° +
aKEoy
(8)
where a is the environmental weight. If LE is an increasing function of the tolerance, for simplicity we assume that a quadratic function is appropriate for LE or e = 2. To find the best tolerance limit, we can take the derivative of Eq. (8) with respect to <JV and set it to zero. Thus, we can obtain ab 2(K + aKE).
(9)
If LE is a decreasing function of the tolerance, we assume LE has the same power for oy as that for ay in LT (or e = — b). We can take the derivative of Eq. (8) with respect to ay and set it to zero. Thus, we can obtain ab + abKE 2K
(10)
If the tolerance limit has no environmental impact, then the power e is zero. When we take the derivative of Eq. (8) with respect to oy, the last term in Eq. (8) vanishes. The best tolerance limit is given by ab 2K
(11)
which is the same as the tolerance design given in Chen and Kapur (1994). If the environmental impact is not considered in a design, then a is zero in Eqs. (9) and (10). The best tolerance is also given by Eq. (11).
Environment
Concepts with Tolerance Design Optimization
Model
179
7. Tolerance for Design/Process Parameters In the case where tolerance design involves design parameters and process parameters, we can establish an NLP optimization model as given by Eq. (7). The solutions for the model are the best tolerance limits for all these design and process parameters. Solving an NLP problem may be complicated to industrial practice. However, in industrial design, the number of tolerance levels may be limited. The optimization problem usually can boil down to the enumeration of the feasible solutions or the enumeration of all feasible combinations of the tolerance levels. 8. Conclusion Tolerance design is important for high-quality and cost-effective products and process design. The ultimate goal of incorporating environmental factors into tolerance design is to minimize the societal loss that includes a loss to the customer, a loss to the producer, and the weighted loss to the environment. The tolerance limit may have no environmental impact, increasing impact, or decreasing impact. In addition, the variations in the parameters of parts, components, and manufacturing processes can be transferred to the variations in the product's performance. The transfer will occur in the manufacturing conversion process. A variation transfer function can be used to estimate the transferred variations. Thus, we can set up an optimization model to minimize the societal loss. Solving the optimization model can find the best tolerance settings for the design and process parameters. In industrial design where the combination of the tolerance levels is limited, the enumeration of all feasible combinations can be used to find the optimal solutions. Acknowledgment This research was supported in part by a faculty research grant in the School of Graduate Studies at Morgan State University and an appointment to the US Nuclear Regulatory Commission HBCU Research Participation Program administered by the Oak Ridge Institute for Science and Education through a cooperative agreement between the US Department of Energy and the US Nuclear Regulatory Commission. References Beardsley, T. M. (1988), "Ultraviolet Verdict: A Global Decline in Atmospheric Ozone Appears to be Real", Scientific American 258(5). BuUington, R. G. and Lovin, S. (1993), "Improvement of an Industrial Thermostat Using Designed Experiments", Journal of Quality Technology 25(4). Chen, G. and Kapur, K. C. (1989), "Quality Evaluation System Using Loss Function", in Proceedings of 1989 International Industrial Engineering Conference, Toronto, Canada, May. Chen, G. and Kapur, K. C. (1994), "Tolerance Design by Break-Even Analysis for Reducing Variation and Cost", International Journal of Reliability, Quality and Safety Engineering 1(4).
180
G. Chen
Chen, G. (1994), "Tolerance Design Based on Variation Transfer Function", in Proceedings of the 3rd IASTED International Conference on Reliability, Quality Control and Risk Assessment, Washington, DC, Oct 3-5. Chen, G. and Kapur, K. C. (1997), "A Two-Step Robust Design Procedure of Linear Dynamic Systems for Reducing Performance Variations", International Journal of Reliability, Quality and Safety Engineering 4(2). Chen, G. (1997c), "Product and Material Recycling Policy Models for Environmentally Conscious Design and Manufacturing", in The 1997 International Conference on Industry, Engineering, and Management Systems, Cocoa Beach, FL, March 10-12. Chen, G. (1997a), "Environmentally-Conscious Product Design by Material Reduction for Minimizing Societal Loss", Journal of Engineering Valuation and Cost Analysis 1(4). Chen, G. (1997b), "Integrating Environmental Factors into Robust Design to Minimize Societal Loss", in Proceedings of the 4th International Conference on Manufacturing Technology in Hong Kong, Hong Kong, Nov. 30-Dec. 3. Chen, G. (1999), "Environmentally Conscious Robust Design and Manufacturing for Sustainable Development", a Chapter in Massie Book Special Elements of Environmental Engineering, to be published in 1999. Coleman, D. and Montgomery, D. (1993), "A Systematic Approach to Planning for a Designed Industrial Experiment", (with Discussion), Technometrics 35(1). Graedel, T. E. and Allenby, B. R. (1996), Design for Environment, Prentice Hall. Jones, P. M. (1989), "Can the World Cooperate? (Pollution and Environment)", Scholastic Update 121(16). Kackar, R. N. (1985), "Off-line Quality Control, Parameter Design, and the Taguchi Method", Journal of Quality Technology 17(4). Kalpakjian, S. (1995), Manufacturing Engineering and Technology, 3rd Edition, AddisonWesley Publishing Company. Kapur, K. C. and Cho, B. (1996), "Economic Design of the Specification Region for Multiple Quality Characteristics", HE Transactions 28(3). Khattree, R. (1996), "Robust Parameter Design: A Response Surface Approach", Journal of Quality Technology 28(2). Manzer, L. E. (1993), "DuPont Tackles the CFC Dilemma a Year Ahead of Schedule", Focus, National Center for Manufacturing Sciences. Montgomery, D. C. (1997), Design and Analysis of Experiments, 4th Edition, John Wiley & Sons, Inc., New York. Nordwall, B. D. (1992), "Hughes Soldering Changes to Aid Environment, Cut Manufacturing Cost (Hughes Aircraft Co.)", Aviation Week & Space Technology 136(7). Parson, E. A. (1995), "The Complex Chemistry of the International Ozone Agreements", Environment 37(2). Taguchi, G. (1986), Introduction to Quality Engineering, Unipub/Kraus International Publications, White Plains, NY and American Supplier Institute Inc., Dearborn, MI. Taguchi, G. (1995), "Quality Engineering (Taguchi Methods) for the Development of Electronic Circuit Technology", IEEE Transactions on Reliability 44(2). Weinrach, J. B. and Shahinpoor, M. (1996), "Environmentally Conscious Design and Manufacturing: New Research Sponsored by National Science Foundation", International Journal of Environmentally Conscious Design & Manufacturing 5(1).
Chapter 12 M A R K O V I A N RELIABILITY M O D E L I N G FOR S O F T W A R E SAFETY/AVAILABILITY M E A S U R E M E N T K O I C H I T O K U N O and S H I G E R U Y A M A D A Tottori University,
Tottori-Shi,
Japan
1. Introduction Software reliability is one of the most important quality characteristics and its evaluation methods have been much discussed [Pham (1995)]. A mathematical model for software reliability measurement is called a software reliability growth model which describes a software fault-detection or a software failure-occurrence phenomenon in a dynamic environment such as the testing phase of software development and the actual operation phase. A software failure is defined as an unacceptable departure from program operation caused by a fault remaining in the software system [Yamada (1991)]. Both software safety and reliability so far tended to be treated in the same concept. However, software safety begins being distinct from software reliability and being regarded as an important quality characteristic since software systems have controlled safety-critical systems such as nuclear power generation systems, defense systems, vehicle control systems, and so on. Software systems in unsafe states lead to fatal accidents, mishaps, and hazards; for instance, financial losses or injuries to human life. Software reliability is the attribute that software systems do not incur software failures, while software safety is the attribute that software systems do not fall into hazardous conditions whether or not the systems are performing intended functions [Leveson (1986)]. Accordingly, software failure-occurrences do not always cause the problems related to safety, and software systems functioning in accordance with the specifications do not always ensure safety. There are Fault Tree Analysis (FTA) and Failure Mode and Effect Analysis (FMEA) in qualitative techniques for measuring and assessing software safety [Keene, Jr. (1992)]. FTA and FMEA are static analyses and effective tools for investigation and verification of software safety in the specification and the design phase. But safety-critical systems have many restrictions concerning "time" and there is a limitation to perform dynamic analyses
181
182
K. Tokuno and S.
Yamada
with FTA and FMEA. Any quantitative methods of software safety assessment in dynamic environments are scarcely discussed. In this paper, we develop two stochastic software safety assessment models based on the existing software reliability models: the reliability assessment model with safety and the availability-intensive safety assessment model. Our attention is directed to the event that the system causes hazardous conditions randomly in operation, not that the system may fall into an unsafe state when the system is down due to software failure-occurrence. These two models are formulated by Markov processes [Ross (1996)] to describe the time-dependent behaviors of the software system, taking into account the software reliability growth process. These models can provide the metrics of software safety defined as the probability that the system does not fall into hazardous states at a specified time point. Moreover, other reliability measures are derived from the former model and availability measures are derived from the latter model. Numerical illustrations are presented to show that these models are useful for software safety/reliability measurement and assessment. 2. Reliability Assessment Model with Safety 2.1. Model
Description
Referring to the model of Yamada et al. (1993), we give the following assumptions to construct the reliability assessment model with safety taking notice of only operational states: Al. When the software system is operating, the holding times of the safe and the unsafe state follow exponential distributions with means 1/0 and l/rj, respectively. A2. A debugging activity is performed when a software failure occurs. Debugging activities are perfect with probability a (0 < a < 1), while imperfect with probability 6(= 1 — a). We call a the perfect debugging rate. A3. Software reliability growth occurs in case of perfect debugging. The timeinterval between software failure-occurrences follows an exponential distribution with mean 1/An, where n = 0, 1,2,... denotes the cumulative number of corrected faults. A4. The probability that two or more software failures occur simultaneously is negligible. A5. Only one fault is corrected and removed from the system per one activity of perfect debugging and the debugging time is not considered. Consider a stochastic process {X(t),t > 0} representing the state of the software system at time point t. The state space of {X(t),t > 0} is defined as follows [Ross (1996)]: Wn: the system is operating safely, [/„: the system falls into the unsafe state.
Markovian Reliability Modeling for Software Safety/Availability
Measurement
183
From assumption A2, when the next software failure occurs in {X(t) = Wn} or {X(t) = £/„}, Wn
(with probability b)
! Wn+i (with probability a) Further, we use Moranda's model [Moranda (1979)] to describe the software reliability growth process. That is, when n faults have been corrected, the hazard rate for the next software failure-occurrence, A„, is given by \ n = Dkn
(n = 0 , l , 2 , . . . ; £ > > 0 , 0 < k < 1)
(2)
where D and k are the initial hazard rate and the decreasing ratio of the hazard rate, respectively. Equation (2) is expressed from the point of view that software reliability depends on the debugging efforts, not the residual fault content. We do not note how many faults remain in the system. Equation (2) reflects that the faults cause software failures with high frequency in execution during the early stage of the testing or the operation phase and the hazard rate decreases rapidly as they are removed [Lyu (1996), Yamada et al. (1993)]. In this paper, we consider that the occurrences of the software failure and the unsafe conditions are independent mutually [Yamada et al. (1998)]. In fact, some faults which cause software failures may be those in connection with safety, and debugging such faults may contribute to the improvement of software safety. However, debugging activities are to correct the program so that the system functions in accordance with the specifications and do not aim at the improvement of software safety. Therefore, we assume that parameters 6 and TJ, which are related to software safety, are constant regardless of the software reliability growth process. Let QA,B(T) (A,B € {Wn,Un; n = 0,1,2,...}) denote the one-step transition probability that after making a transition into state A, the process {X(t),t > 0} makes a transition into state B by time r. The expressions for QA,B(T)'S are given as follows:
Qwn,un(r) = j^re^-^+6)T\ Qwn,wn+1{r)
Qwn,wAr)
Qun,wn+1(r)
(3)
= ^-e[l~e-^+^}
= ^ ^ [ l ~ e - ^ = ^~-[l~e-^+^}
(4)
+
^ }
(5)
(6)
184
K. Tokuno and S. Yamada QU„,W„(T)
=
-(An+^Tl
(7)
(8) where QU„,W„(T) denotes t h e case where a software failure occurs and debugging is imperfect, while Q\jn wn(T) denotes t h e case where the system returns t o t h e safe state before a software failure occurs. T h e sample state transition diagram of X(t) is illustrated in Fig. 1.
6Ax
Fig. 1. A diagrammatic representation of state transitions between X(t)'s for the reliability assessment model with safety.
Markovian
2 . 2 . Derivation
Reliability Modeling for Software Safety/Availability
of Safety/Reliability
Measurement
185
Measures
2.2.1. Distribution of the First Passage Corrected Faults
Time to a Specified Number of
Let Sn (n = 1, 2, . . . ; SQ = 0) be t h e r a n d o m variable representing t h e first passage time t o s t a t e Wn, in other words t h e t i m e spent in correcting n faults, a n d Gn{t) be the distribution function of Sn. Further, let Gi, n (£) be the distribution function associated with t h e probability t h a t n faults are corrected in the time-interval (0, t] on the condition t h a t i faults have already been corrected at time zero. Then, we obtain the following renewal equation: Gi,n(t)
= QWi,Wt * Gi>n(t) + Q\Vi,Ui * Qui,Wi * Gi,n{t) + QWi,Ui * QUi,Wi * Gitn(t) + Q\Vi,Ui * Qui,wi+1
+ Q\Vi,Wi+1 *
* Gi+i,n(t)
Gi+i:n(t)
(i = 0 , 1 , 2 , . . . , n - 1)
(9)
Applying t h e Laplace-Stieltjes (L-S) transforms [Osaki (1992)] to Eq. (9) recursively, we obtain the L-S transform of Gn (t) as
=
y
h
A
n ^ i
(1Q)
s+aX
*
where AJEEI
A
"=flT^TY. j=o
(n = 2,3,...;i J
= 0,1,2,...,n-1)
f
( U )
%
By inverting Eq. (10), we have t h e distribution function of the first passage time when n faults are corrected: Gn(t)
= P r { 5 „ < t} ra-l
= ^2A"[l-e-aXit]
(t>0;n=
l , 2 , . . . ; G 0 ( t ) = 1(f) (unit function))
i=0
(12) Equation (12) is identical t o the result obtained in Y a m a d a et al. (1993) and has no bearing on parameters 6 and 77, which are related to safety.
186
K. Tokuno and S.
Yamada
Further, the mean and the variance of Sn are given by n —1
1
E 5
l
(13)
^
(14)
t «] = E TV. =0
V
»i 5 -] = E
i=0
respectively. 2.2.2. 5^aie Occupancy Probability and Software Safety Metrics Let PA,B(t) be the conditional state occupancy probability that the system is in state B at time point t on the condition that the system was in state A at time point zero, i.e., PAtB(t)
=Pv{X(t)
= B\X(0) = A}
(A,Be{Wn,Un;n
= 0,1,2,...})
and it is denoted that Pw„{t) = P\v0,wn(t) obtain the following renewal equations:
and Pun(t)
(15) = P\v0,un{t)-
Pwn(t)=Gn*PWn,wAt)
Then, we
(16)
= e - ( A " + ^ + QWn,wn * Pwn,wn(t)
Pwn,wn{t)
+ Qwn,un * Qun,wn * Pwn,wn (t) + Qwn,un * Q*un,wn * Pwn,Wn(t)
( 17 )
By applying the L-S transforms to Eqs. (16) and (17), we obtain the L-S transform °f P\vn (t) as P
f-\
fwn(s) -
s(s + Xn+rj) (s +
^
+ e+
^
+ aK)
aA
Yj
ll
s +
»
^
do)
u«;
By inverting Eq. (18), we have Pw„(t) as Pwn{t) = Pr{X(t)
= Wn} n
= Bne~{\„
+e+n)t + S^B?e~aXit j=0
(n = 0,1,2,...)
(19)
Markovian Reliability Modeling for Software Safety/Availability
Measurement
187
where constant coefficients B n and B™ are given by Bn =
u n
, ,llj=.°
\
r
(20)
(A„ + 6> + 7j - a\i) [ ] j=o (Aj - Aj) respectively, and we postulate n7=o ' = 0 j=o • = 1- It is noted that
B° + B° = 1 (22)
B" + J > ? = 0
(n=l,2,...)
i=0
The following equation holds for arbitrary time point t: oo
'Z,[PwAt) + PUn(t)} = l
(23)
n=0
In this section, the software safety is defined as oo
S1(t) = £ i V „ ( t )
(24)
n=0
which represents the probability that the system does not fall into any unsafe states at time point t. Software unsafety is defined as oo
Ui(t) = Y , P ^
( 25 )
n=0
which represents the probability that the system falls into unsafe states at time point t. Using Eq. (23), we get
oc
= l-J2Pwn(t)
(26)
71 = 0
2.2.3. Distribution of the Time between Software Failures Let Xi (I = 1, 2,...) be the random variable representing the time interval between the (Z — l)-st and the Zth software failure-occurrences, and $j(x) be the distribution function of Xi. It is noted that Xi depends on the number of the faults corrected up
188
K. Tokuno and S.
Yamada
to the (/ — l)-st software failure-occurrence, which is not explicitly known. However, the random variable C;, which represents the number of the faults corrected up to the (/ — l)-st software failure-occurrence, follows a binomial distribution having the following probability mass function: Pr{Cj
= t } = (l~il\i*>l~1~i
(* = 0 , 1 , 2 , . . . , Z - 1 )
(27)
where ( T1) = (I — l)!/[(/ — 1 — i)li\] denotes a binomial coefficient. Further, from assumption A3, Pr{Xt < x\Ci = i} = 1 - e~XiX
(28)
Accordingly, we can get the distribution function for Xi as $J(I)
= Pi{Xt < x}
= Y, Pr{*< < x\Ci = i} • Pr{Ci = t} i=0 l-l
= J2 C • 1 ) a ^ ' _ 1 ~ i [ l - e-XiX] i=o ^
l
(29)
'
We then have the reliability function for Xi as Rt(x)
= PT{XI
>
x}
= Sr~ 1 ) a<& '" 1_<e_A ' X i=0
^
%
(30
^
'
The expectation of Xi, i.e., mean time between software failures (MTBSF), is defined by /•OO
E[X/] = / Jo
Ri(x)dx
(31)
Epf,, . M ± f l ^
(32)
From Eq. (30), we can derive MTBSF as
Rl(x) and E[Xi] also have no bearing on parameters 0 and TJ. 3. Availability-Intensive Safety Assessment Model In this section, we describe the behavior of a software system which alternates between operable and inoperable states in consideration of software safety.
Markovian
3.1. Model
Reliability Modeling for Software Safety/Availability
Measurement
189
Description
The following assumptions are made for availability-intensive safety assessment modeling: Bl. When the software system is operating, the holding times of the safe and the unsafe state follow exponential distributions with means 1/6 and 1/r], respectively. B2. The software system breaks down and starts to be restored as soon as a software failure occurs, and the system cannot operate until the restoration action is complete. B3. The restoration action implies the debugging activity and software reliability growth occurs if a debugging activity is perfect. B4. The debugging activity is perfect with probability a (0 < a < 1), while imperfect with probability b(= 1 — a). A perfect debugging activity corrects and removes one fault from the system. B5. When n faults have been corrected, the next software failure-occurrence timeinterval and the restoration time follow exponential distributions with means 1/An and l//n n , respectively. B6. The probability that two or more software failures occur simultaneously is negligible. B7. The restoration actions are performed in safe states. The state space of the process {X(t), t > 0} representing the state of the software system at time point t is denned over again as follows [Yamada et al. (1998)]: Wn: the system is operating in a safe state, Un: the system is operating in an unsafe state, Rn: the system is inoperable and restored, where n = 0,1, 2 , . . . denotes the cumulative number of faults corrected during the operation phase. From assumption B4, when a restoration action is complete in {X(t) = Rn},
{
Wn
(with probability b)
Wn+i (with probability a) Similar to the preceding section, we use Moranda's model to describe the software failure-occurrence phenomenon, i.e., the hazard rate An is given by \n=Dkn
(n = 0 , l , 2 , . . . ; D > 0 , 0 < k < 1)
(34)
Furthermore, we describe the restoration characteristic. Generally, the faults detected later tend to have higher complexity; i.e., it takes more time to isolate the faults and to check the fault-correction [Nakagawa and Takenaka (1987),
190
K. Tokuno and S.
Yamada
Tokuno and Yamada (1997)]. Accordingly, it is appropriate that the restoration time becomes longer with increasing n. Then, we assume fxn as follows: /in = Ern
(35)
(n = 0 , 1 , 2 , . . . ; E > 0,0 < r < 1)
where E and r are the initial restoration rate and the decreasing ratio of the restoration rate, respectively. The expressions of one-step transition probabilities QA,B(T)'S (A,B £ {Wn, Un, Rn] n = 0,1, 2,...}) are given as follows: Qwn,un{T -
Qwn,Rn(T
K +v
[1
(36)
_P-(A„+9)T
(37)
-(A„+7j)T
(38)
[l-e
An X . g - C A n + ^ r i A„ +V
Qun,Rn(r +
A„ ^" V
Qon,wn(r
QRn,Wn
,-(A„+0)-H
An + 0 l
i{T
QRn,Wn(l~
(39)
= a(l-e-^r)
(40)
= 6(1 - e-"" r )
(41)
The sample state transition diagram of X(t) is illustrated in Fig. 2.
GAr
a/it AT
AO4T
fln-nAT
AnAT
AI4T
Fig. 2. A diagrammatic representation of state transitions between X(t)'s intensive safety assessment model.
3.2. Software
Availability/Safety
AA+I4T
for the availability-
Analysis
3.2.1. Distribution of the First Passage Time to the Specified Number of Corrected Faults Let Tiii+i and Giti+i(t) be the random variable representing the time spent in making a transition from state Wi to state Wi+\ and the distribution function of Ti,i+i, respectively. Then, we obtain the following renewal equations:
Markovian
Reliability Modeling for Software Safety/Availability
Gj,t+i(£) =
Measurement
191
Hw.tRi*Qfiitwi+1(t) +
Hwi,Ri*QRi,Wi*Gi,i+l(t)
(42) HWi,Ri(t)
= QWi,Ri{t)
+ QWi,Ui *
+ Q\Yi,Ui *Qui,Wi
QuitRi(t)
*HWi,Ri(t)
(i = 0 , 1 , 2 , . . . ) ,
where Hw^Ri (t) represents the probability that the process X(t) makes a transition from state Wi to state Ri in an amount of time less than or equal to t. Substituting the L-S transforms of Eqs. (36)-(41) into that of Eq. (42) yields %iVi
iW
{s + Xi)(s + yi)
(43)
where
\ = ^
X i
±
+ ^)
\ / ( A i + Mi) 2 - 4 a A i / X i ]
(double signs in same order)
(44)
Recall that Sn and Gn{t) (n = 1,2,...; So = 0) denote the random variable representing the time spent in correcting n faults and the distribution function of Sn, respectively. Then, the following relation are obtained:
(45)
&n — / _, J-i,i+l i=0
Noting that Ti^+i's are mutually independent, we can get the L-S transform of Gn(t) as
Gn(s) =
l[Gi,i+1(s) i=0
n
i=0
XiVi
{s + Xi)(s + yi)
ST^ I ^n,iXi i=0
where constant coefficients A\
i
and A\
s + Xi i
,
An,iVi
s + yt
are given by
(46)
192
K. Tokuno and S.
Yamada
=
Uj=o xiVi
^ i =
Uj=o^^
Ki
(z = 0 , l , 2 , . . . , n - l )
(47)
rrn-1 71,1
r r n - l
/
(z = 0 , l , 2 , . . . , n - l )
\ i-TTl — 1 /
yi 1IJ=O (%• - 2/i) IIj=o w
\
V
'
'
'
'
(48) /
V
/
- y<)
respectively. By inverting Eq. (46), we have the distribution function of Sn as G„(<) = Pv{Sn < t} n-l
= l-^«ie-^+
(n = l , 2 , . . . ; G 0 ( t ) = l(t))
(49)
where we postulate nj'=o • = 1 for n = 1. It is noted that Eq. (49) has no bearing on parameters 6 and r\ and that
£«*+<*) =!
(n=l,2,...)
(50)
t=0
Further, the mean and the variance of Sn are given by
**J-£(£ + ;;)
(51)
respectively. 3.2.2. State Occupancy Probability Here, we derive the probability that X(t) is in the respective states. Recall that PA,BV)(A,
B £ {Wn, Un, Rn; n = 0,1,2,...})
denotes the conditional state occupancy probability that X(t) is in state B at time point t on the condition that X(t) was in state A at time point zero and Pwn(t) = Pw0,w„(t), Pun(i) = Pw0,un(t), and PRn(t) = Pwo,Rn(t) denote the state occupancy probabilities that X(t) is in states Wn, Un, and Rn at time point t, respectively.
Markovian Reliability Modeling for Software Safety/Availability Measurement 193
At first, we obtain the following renewal equations with respect to F*wn(t): (53)
Pwn{t) = Gn*PWn,wn{t) Pwn,wn{t)
= e~{Xn+6)t
+ Qwn,un * Qun,wn * Pwn,wn(t)
+ Q\Yn,Rn
* QRn,Wn *
+ Q\Vn,Un * Qun,Rn
Pwn,Wn(t)
* QRn,Wn * P\Vn,Wn (t)
(54)
Then, the L-S transform of Pwn (t) is obtained as
PwM
n
S(S + \„ + T))(S + fj.n)
\ j
XiUi
(s + Xn + 9 + r])(s + xn)(s + yn) A l ( s + Xi)(s + y.)
(55)
By inverting Eq. (55), we have Pwn{t) as PWn(t)
= Fr{X(t) = Wn} = B°ne-^+0+^
+ ^2(Bitie-^
+ Bite-**)
(n = 0,1, 2,...)
(56)
i=0
where constant coefficients £?„, B\ i, and B\ 6i,in Xn e Vm X B^„,:x :: ~ ^ ^ n x
nLo( J ~ n-o-v)(yj
Bl
(A„ + r] - Xi)(fin
4
are given by
,
-K-o-v)
(n = 0 , l , 2 , . . . )
- Xi) n " = o ^JJ/J
(57)
(i = 0 , l , 2 , . . . , n )
n
(A„ + e + j] - x^ Y\ j=o (XJ - x^ YYj=o(yj - xi)
(58)
2 B,n.i
(An + r) - yj){nn - y%) n"=o xiVi (An + 0 + 77 - yi) n " = o (yj ~ Vi) U"=o(xj - Vi
(i = 0 , l , 2 , . . . , n )
3^i
(59)
respectively. It is noted that
(60)
£„ + £ « i i=0
+ <*)=0
(n=l,2,...)
194
K. Tokuno and S. Yamada
Next, we obtain the following renewal equations with respect to Pfin(i): PRn{t) = Gn * HWn,Rn
* PRn,Rn(t)
PRn,Rn(t) = e~" n t + QRn,wn * HWnMn
(61) * PRn,Rn{t)
(62)
From Eqs. (61) and (62), the L-S transform of PRn(t) is obtained as aX n PRn(s) = — • , ^ • Gn(s) a/j,„ {s + xn)(s + yn)
= -^Gn+1(s)
(63)
a/j,n
Then, we have PR„ (t) as PRn(t) = Pr{X(t)
= Rn}
= —gn+i(t) afin
(n = 0,1,2,...)
(64)
where gn(t) denotes the probability density function of Sn, i.e., gn(t) = dGn(t)/dt. Considering the stochastic process {Y(t),t > 0} representing the cumulative number of faults corrected up to time t, we have the following equivalent relation: {Y(t) = n} *> {X(t) = Wn) U {X(t) = Un} U {X(t) = Rn} Further, since {Y(t),
(65)
t > 0} is a counting process, {Sn
{Y(t) > n}
(66)
then, Pn(t) = Pv{Y(t) = n} = Gn{t) - Gn+l{t)
(n = 0,1,2,...)
(67)
Therefore, we have P\jn (t) as Pun(t) = Pi{X(t)
= Un}
= Gn(t) - Gn+l(t)
- PWn(t) - PRn(t)
(n = 0,1,2,...)
3.2.3. Software Safety and Availability The following identical equation holds for arbitrary time t:
(68)
Markovian
Reliability Modeling for Software Safety/Availability
£
[Pwn (t) + Pun (t) + PRn (t)] = 1
Measurement
195
(69)
n=0
In this section, the software safety is defined as oo
S2(t) = Y/[Pwn(t) + PRn(t)}
(70)
71 = 0
which represents the probability that the software system does not fall into any unsafe states at time point t. The instantaneous software availability [Tokuno and Yamada (1997), Yamada et al. (1998)] is defined as oo
A(t) = £ i V „ ( < )
(71)
Tl = 0
which represents the probability that the software system is operating safely at time point t. Further, the average software availability is defined as
i r* Aav(t) = - / A(x)dx (72) t Jo which represents the ratio of the amount of time when the system is operating safely to the time-interval (0, t]. 4. Numerical Examples Using two software safety assessment models discussed above, we show numerical illustrations for software safety and reliability measurement. To begin with, we show numerical examples of the reliability assessment model with safety. The distribution functions of the first passage time to the specified number of corrected faults, Gn(t) in Eq. (12) are shown in Fig. 3 for various perfect debugging rates, a, where n = 5, D = 0.1, and k = 0.8. We can see that the smaller the perfect debugging rate becomes, the more difficult it is to correct faults. The software safety metrics, Si(t) in Eq. (24) for various values of 0 are shown in Fig. 4, where D = 0.1, k = 0.8, a = 0.9, and r/ = 0.1. Figure 4 indicates that the software safety becomes larger as 9 decreases. Si(t)'s are shown in Fig. 5 for various values of k, where D = 0.1, a = 0.9, 0 = 0.01, and r) = 0.1. 5i(t) converges to r)/(8 + rj), which denotes the steady probability that the system is operating safely in the case where software failureoccurrences are not considered. Figure 5 indicates that the software safety converges earlier with decreasing k. Smaller k means that software reliability growth occurs more rapidly. Since this model assumes that the system is not unsafe in causing a software failure, the software safety becomes larger with increasing k, which means the high frequency of software failure-occurrences.
196
K. Tokuno and S.
Yamada
G5(t)
0.6
0.4 0.2
50
100
150
200
250
300
Time Fig. 3. Dependence of perfect debugging rate a on G$(t) (D — 0.1, fc = 0.8).
?i(0 e=o.oi 0.9 '
0.8
0.7
0.6
200
400
600
Time Fig. 4. Dependence of 0 on Si(t) (D = 0.1, k = 0.8, a = 0.9, T) = 0.1).
400
600
300
Time Fig. 5. Dependence of k on Si(t)
(D = 0.1, a = 0.9, 9 = 0.01, rj = 0.1).
Markovian Reliability Modeling for Software Safety/Availability Measurement
197
T h e reliability functions, Ri(x) in Eq. (30) for various values of / are shown in Fig. 6 and the values of M T B S F are shown in Table 1. T h e figure and t h e table show t h a t a software reliability growth occurs whenever a software failure occurs.
Fig. 6. Dependence of number of failures / on Ri(x) (D = 0.1, k = 0.8, a = 0.9).
Table 1. MTBSF E[X(] (D = 0.1, k = 0.8, a = 0.9). 1
E[*i]
1 2 3 4 5 6 7 8 9 10
10.00 12.25 15.01 18.38 22.52 27.59 33.79 41.40 50.71 62.12
Next, numerical illustrations of the availability-intensive safety assessment model are shown. Figure 7 represents an example of the time-dependent behaviors of s t a t e occupancy probabilities, Pw„(t), PR„(t), and Pun{t) where a = 0.9, D — 0.1, k = 0.8, E = 0.2, r = 0.9, 6 = 0.01, and T] = 0.1. Figure 8 shows t h e dependence of a on the software safety, 52(f) in Eq. (70) where D = 0.1, k = 0.8, E = 0.2, r = 0.9, 9 = 0.01, and rj = 0.1. This figure indicates t h a t t h e software safety decreases with increasing a. This reasoning is t h e
198
K. Tokuno and S.
Yamada
IfuAt) n \^ >^
EP« /O n ^^_
>> •4—"
r^5 0 . 6
' o °- 4
n
S-H
OH 0.2
0
100
200
300
400
500
600
700
800
Time Fig. 7. Behaviors of state occupancy probabilities (a = 0.9, D = 0.1, k = 0.8, E = 0.2, r = 0.9, 6» = 0.01, r/ = 0.1).
52(0
0.98
0.96
0.94
0.92
0
100
200
300
400
500
600
700
800
Time Fig. 8. Dependence of a on S2(t)
(D = 0.1, k - 0.8, E = 0.2, r = 0.9, 6 = 0.01, ?j = 0.1).
same as in the case of Fig. 5 since larger a means that software reliability growth occurs more rapidly. Figures 9 and 10 represent the instantaneous software availability, A(t) in Eq. (71) and the average software availability, Aav(t) in Eq. (72) for various values of a, respectively, where D = 0.1, k = 0.8, E = 0.2, r = 0.9, 6 = 0.01, and rj = 0.1. A(t) and Aav(t) drop rapidly immediately after operation and improve gradually with the lapse of time. These figures also tell us that a system has higher availability with increasing a.
Markovian
Reliability Modeling for Software Safety/Availability
Measurement
199
0.85
0.75
0.65 •
0.6 0
100
200
300
400
500
600
700
800
Time Fig. 9. Dependence of a on A(t) (D = 0.1, k = 0.8, E = 0.2, r = 0.9, 9 = 0.01, n = 0.1).
Aav(t)
0.75
0.7
0.65 0
100
200
300
400
500
600
700
800
Time Fig. 10. Dependence of a on Aav(t)
(D = 0.1, k = 0.8, E = 0.2, r = 0.9, 6 = 0.01, n = 0.1).
5. Concluding Remarks In this paper, we have proposed two software safety assessment models: the reliability assessment model with safety, and the availability-intensive safety assessment model. These have considered the random occurrences of hazardous conditions in system operation. The stochastic behaviors of the software system in dynamic
200
K. Tokuno and S. Yamada
environments have been described by Markov processes, involving t h e software reliability growth process. Several software safety/reliability assessment measures have been derived from these two models and the numerical examples of software safety/reliability measurement have been illustrated. These models can provide quantitative measures of software safety, which have scarcely been proposed so far. In particular, it is very meaningful t h a t this work suggests to enable quantitative assessment of simultaneous software safety and software availability, which are t h e customer-oriented quality characteristics. Generally, it is difficult t o estimate t h e p a r a m e t e r s related t o safety, 6 a n d r\. Reasonable estimation of model p a r a m e t e r s remains a future problem. Further, these models handle software safety and reliability factors independently. However, t h e r e may be cases where these two factors have a strong relation. It is an interesting problem t o construct models correlating safety factors with software reliability growth.
Acknowledgment This work was supported in p a r t by a Grant-in-Aid for the Scientific Research of t h e Ministry of Education, Science, Culture and Sports of J a p a n under G r a n t No. 10680431, in part by t h e Telecommunications Advancement Foundation, and in part by t h e Yazaki Memorial Foundation for Science and Technology. References Keene, Jr. S. J. (1992), "Assuring Software Safety", in Proc. Annu. Reliability and Maintainability Symp., Las Vegas, U.S.A., 274-279. Leveson, N. G. (1986), "Software Safety: Why, What, and How", ACM Computing Surveys 18(2), 125-163. Lyu, M. R. (ed.) (1996), Handbook of Software Reliability Engineering, IEEE Computer Society Press, Los Alamitos, CA. Moranda, P. B. (1979), "Event-Altered Rate Models for General Reliability Analysis", IEEE Trans. Reliability R-28(5), 376-381. Nakagawa, Y. and Takenaka, I. (1987), "Error Complexity Model for Software Reliability Estimation", (in Japanese), Trans. IEICE D-I J74-D-I(6), 397-386. Osaki, S. (1992), Applied Stochastic System Modeling, Springer-Verlag, Heidelberg. Pham, H. (ed.) (1995), Software Reliability and Testing, IEEE Computer Society Press, Los Alamitos, CA. Ross, S. M. (1996), Stochastic Processes, 2nd Edition, John Wiley & Sons, New York. Tokuno, K. and Yamada, S. (1997), "Markovian Availability Measurement and Assessment for Hardware-Software System", Int. J. of Reliability, Quality and Safety Engineering 4(3), 257-268. Yamada, S. (1991), "Software Quality/Reliability Measurement and Assessment: Software Reliability Growth Models and Data Analysis", J. Information Processing 14(3), 254-266. Yamada, S., Tokuno, K. and Osaki, S. (1993), "Software Reliability Measurement in Imperfect Debugging Environment and Its Application", Reliability Engineering & System Safety 40(2), 139-147.
Markovian Reliability Modeling for Software Safety/Availability Measurement 201 Yamada, S., Tokuno, K. and Kasano, Y. (1997), "Quantitative Assessment Models for Software Safety/Reliability", (in Japanese), Trans. IEICE A J80-A(12), 21272137; Electronics and Communications in Japan, Part 2, 81(5), 1998, 33-43.
Chapter 13 G R O U P CONTROL CHARTS W I T H VARIABLE S T R E A M A N D S A M P L E SIZES K. T . LEE Manufacturing Technology Laboratory Institute for Advanced Engineering 633-2, Koan-ri, Baegam-myon, Yongin, Kyonggi-do 44^-800,
Korea
D . S. B A I Department of Industrial Engineering Korea Advanced Institute of Science and Technology Gusung-dong 373-1, Yusung-gu Taejon 305-701, Korea S. H. H O N G Department of Industrial Engineering Chonbuk National University, Chonju, Chonbuk 561-756,
Korea
1. Introduction Control charts are widely used to determine whether the process is in-control, to bring out-of-control process into in-control, and to monitor the process to make sure that it stays in-control. The conventional practice in maintaining control charts is to take samples of fixed size with a fixed time interval between samples. The resulting control charts are called fixed sampling rate (FSR) control charts. In recent years, however, variable sampling rate (VSR) control charts, in which sampling intervals and/or sample sizes are allowed to vary based on the values of the preceding sample statistics, have received much attention. The rationale of the VSR control charts is that the sampling rate should be high if a sample statistic shows some indication of a process change, and low if there is no such indication. See, for example, Reynolds et al. (1988), Runger and Pignatiello (1991), Amin and Letsinger (1991), Amin and Miller (1993), Prabhu et al. (1993, 1994), Costa (1994, 1997), Baxley (1995) and Prybutok et al. (1997). All these studies show that the VSR control charts are substantially quicker than the FSR control charts in detecting shifs in the process. Manufacturing processes often have multiple sources or streams of output. For example, a machine may have several heads, with each head producing supposedly
203
204
K. T. Lee, D. S. Bai and S. H. Hong
identical units of product. In such situations, one can use separate control charts on each stream. This approach, however, usually results in a prohibitively large number of control charts. One method of overcoming this problem is to use group control charts (GCCs) in which samples are drawn from all streams but only a few of them are plotted on the charts. For instance, when 20 streams are monitored simultaneously, we plot only the largest and smallest of the 20 sample means observed at any sampling point. If these means fall between lower and upper control limits, all the other means will also lie inside the limits. In this manner, the use of GCCs significantly reduces administrative costs. See, for example, Boyd (1950), Nelson (1986), Mortell and Runger (1995), Nelson and Stephenson (1996) and Runger et al. (1996). GCCs are usually run with samples of size 1 due to the cost of sampling and testing. However, control charts with small sample size are generally insensitive to the changes in process parameters. Therefore, in these situations, if we select a large portion of streams and take samples of size n > 1 when there is some indication of a process change and select a small portion of streams and take samples of size n = 1 when there is no such indication, then the detection ability of the charts can be significantly improved with no increase in sampling and testing costs. For clarity, the number of streams selected for sampling will be called stream size. In this paper, we propose variable stream and sample size (VSSS) GCCs in which both stream and sample sizes are allowed to vary based on the values of the preceding sample statistics. In the next section, a description of the VSSS GCCs is presented, and formulas for evaluating the properties of the charts are given in Sec. 3. In Sec. 4, the proposed VSSS GCCs are compared with the traditional fixed stream and sample size (FSSS) GCCs. Notations which will be used throughout the paper are provided in App. A.l for reference purposes. 2. Variable Stream and Sample Size GCCs Consider a process which consists of M sources or streams of output where the distribution of the observations on each stream is normal with mean fij, j = 1 , . . . , M and variance a2. A group control chart is maintained to detect shifts in fj,j's from a common target value ^o- Suppose that M, streams are selected for sampling and that samples of size rij are taken from each of the selected streams. Let tpi be the set of selected streams at the ith sampling point and let Xtj be the sample mean for the jth stream. When Zi{Mi,rn) = min je ^ i {y / n7(Xj : , — fi0)/^} and Z2(Mi, i%i) = maxj 6 ^{y / n7(Xj J — fxo)/cr} are plotted on the chart, a signal is given and a search for assignable causes is triggered if Zi(Mi,rii) or Zi{Mi,rii) falls outside control limits ±.k\. In the traditional FSSS GCCs, samples are taken from all streams and the sample size rii is fixed, say n. The FSSS GCCs are usually run with n = 1, since the cost of sampling and testing can be high even if one uses moderate sample sizes, i.e., n = 4 ~ 5. However, the use of small sample sizes make the charts insensitive to the changes in process parameters. Table 1 shows the expected number of sampling
Group Control Charts with Variable Stream and Sample Sizes
205
Table 1. Expected number of sampling points required to signal by FSSS GCC with n = 1 (false alarm rate = 1/350).
5
M =5
10
20
0.5 1.0 1.5 2.0 2.5 3.0
253.816 104.970 35.531 13.087
312.485 190.796 75.020 26.093
5.736
288.263 145.725 52.043 18.435 7.577
3.031
3.745
4.691
10.125
points required to signal by the FSSS GCC with a false alarm rate of 1/350 when one of /Zj's has shifted to ^Q + 5a. One method of overcoming this problem is to adopt a variable sampling rate scheme. In VSSS GCCs, stream and sample sizes are varied based on the values of the preceding sample statistics. In this paper we assume that the VSSS GCCs use two stream sizes, M^ and M( 2 ), and two sample sizes, n^ and rt(2), where 2 < M(2) < Af(!) < M, and n^2) < n(i)> s m c e it keeps the complexity of VSSS schemes to a reasonable level. Let the region between the two control limits be partitioned into two subregions as follows; h = h =
(-ki,-k2]u[k2,k1) {-k2,k2)
where 0 < k2 < k\, and let Is be the region outside of the control limits. Then the decision of the proposed charts at the ith sampling point is (D Signal,
if Zi(Mi,m)
e Is
(2) U s e ( M i + 1 , n i + i ) = (M{2),n{2)),
\i,Z1{Muni)
e I2 and Z2{Mi,rn)
(3) Use (Mi+i,ni+i)
otherwise.
= (M( 1 ),ri( 1 )),
or Z2(Mi,ni)
e Is , e I2
Since no sample is taken at the start of the process, the first stream and sample sizes cannot be determined by the above decision rule. However, we assume that Mi = M(i) and n\ = n^) since it gives additional protection against problems that arise during start-up. See Lucas and Crosier (1982) for a further discussion on this point. 3. Properties of the VSSS GCCs The properties of VSSS GCCs are determined by the number of sampling points and the number of observations taken from the process until a signal is given. Define the average number of sampling points to signal (ANSPS) to be the expected number of sampling points from the start of production until the chart signals. Similarly, define the average number of observations to signal (ANOS) to be the expected
206
K. T. Lee, D. S. Bai and S. H. Hong
number of observations from the start of the process to a signal. For the FSSS GCCs the ANSPS is the well known average run length (ARL) and the ANOS is just nxANSPS, but for the VSSS charts both the stream and sample sizes are allowed to vary and thus the ANOS cannot be obtained from the ANSPS alone. The ANSPS should be large when the process is in-control so that the false alarm rate is low and small when the process is out-of-control so that the detection of process changes is quick. Before obtaining the formulas for ANSPS and ANOS, we first derive some probabilities. Let gr(zi, Z2; Mi,rii) be the joint probability density function (pdf) of Zi(Mi, rii) and Z2(Mi, rii) when the number of streams with shifted mean is r. Then
gr{zi,z2;Mi,ni)=
>
/ ,„ \
' f{zuzi\l)
(2)
where f(zi,z2\l) { Mi{Mi - 1)A0B™'-2C0,
if Z = 0
(Mi - l)£? 0 M '- 3 [A)£oCi + AiBoCo + (Mt - 2)A0B1C0], (Mi - ^AoB^-^B^llBoCi = <
+ (Mi-l-
+lA1B^-l-1Bif2[(Mi (Mi -
lJBxC]
- l)BiC0 + (l- l)B 0 Ci],
1)B^-3[,4OBICI
if./ = 1
+ A1B1C0 + (Mt - 2)A1B0C1],
{MiiMi-^AxB^Ci,
if 2 < I < Mi - 2 if I = Mt - 1 if I = Mi
A0 = 4>(Zl),Ai = <j>(zi - SifnTi), B0 = $(z 2 ) - $(z{),Bx
= $(z 2 - 5y/n~)
- $(21 - (z2), Ci = (/>(z2 - 5y/nl), and <£(•) and $(•) are the standard normal pdf and cumulative distribution function (cdf), respectively. A detailed derivation is given in App. A.2. Let qg (Mi,rii) be the probability that the chart signals, and let qrJ(Mi,rii) and q?2(Mi,rii) be, respectively, the probabilities that the next stream and sample sizes are (M( 1 ),n( 1 )) and (M( 2 ),n( 2 )) when the number of streams with shifted mean is r. Then, qP(Mi,rii)
= l-
/
gr(zi,z2;Mi,ni)dz2dzi
Group Control Charts with Variable Stream and Sample Sizes
207
fc2 rk g<
r)
(M 4) nO
qfHM^m)
/
gr(zi, z2; -fc?
Mi,ni)dz2dzi
(3)
Jz\
= 1 - q§(Mi,ni)
-
^(M^m)
To derive the ANSPS and ANOS, a Markov chain approach is used. Let Ri = (Mi, rii) be the state of the chain at the ith sampling point. Then, the sequence of {Ri,i > 1} forms a Markov chain with 2 transient states {(M^,n^), (M( 2 ),n( 2 ))} and one absorbing state corresponding to the signal. Let p m i , m 2 be the transition probability from state (M( m i ), n( m i )) to state (M( m 2 ),ri( m 2 )), and let P be the transition probability matrix of transient states. Then,
P = P21
P\2
>) 9(i)(M(i)>n(i))
P22
. 9(ij (M(2), n{2))
,(r)
(r),
9(2j(M(i)'n(i)) (r),
(4)
9(2j (M (2 ), n ( 2 ) )
Let Nm, m = 1,2, be the number of sampling points to a signal when the chain starts with state m. Then, by the elementary properties of Markov chains, E(Nm)
= el • (I - P)"
(5)
where e m denotes a unit vector whose mth element is one, I is a 2 x 2 identity matrix, and 1 T = (1,1). Since we have assumed that the first stream and sample sizes are M(j) and n^, the ANSPS is E(N±). Let wm, TTI = 1,2, be the number of observations when Ri = m, and let Om be the number of observations from the start of the process to a signal when Ro = m. Then, • M(1),
if m
. n(2) • M ( 2 ) ,
if m
"(I)
(6)
and £ ( O m ) = e ^ - ( I - P )- -l I . w where w
T
— (wi,W2)- The ANOS is then simply
(7)
E(0{).
4. Numerical Comparisons In this section, the VSSS GCCs are compared with the FSSS GCCs. The properties of the FSSS GCCs can easily be obtained by letting M^) = M(2) = M and n^ = n(2) =n in the formulas of Sec. 3. These two control charts are to be compared under equal conditions. For this purpose, we adjust the values of k\ and k2 so that the two charts have the same in-control ANSPS and in-control ANOS values. Then they are 'matched' in the sense that they produce the same average quantity of false alarms and demand the same average quantity of items to be inspected during the in-control period.
208
K. T. Lee, D. S. Bai and S. H. Hong
Out-of-control ANSPS and out-of-control ANOS values of the two charts can be compared and the chart with smaller values is considered more efficient. ANSPS and ANOS values of the VSSS GCCs with M = 10 having in-control ANSPS value of 350 and in-control ANOS value of 3500 are given in Tables 2-5 for selected combinations of (n( 2 ),M( 2 )) = (2,1), (4,1), (n( 2 ),M( 2 )) = (5,4), (10,2), (5,8), (10, 4), r = 1,3 and 6 = 0.5(0.5)3.0. Those values of the FSSS GCCs with n = 1 are also given in the tables. One can see from these tables that: Table 2. ANSPS and ANOS values of the FSSS and VSSS GCCs. (M = 10, r = 1, ANSPSo = 350,+ ANOS 0 = 3500tt, ( n ( 2 ) , M ( 2 ) ) = (2,1)) VSSS
s
FSSS
( n (D ,MW) (5,4)
ANSPS
ANOS
(5,8)
(10, 2)
(10, 4)
0.5 1.0 1.5 2.0 2.5 3.0
288.26 145.73 52.04 18.44 7.58 3.75
189.97 33.64 8.06 3.84 2.93 2.70
236.70 65.43 13.96 4.16 1.94 1.29
141.00 15.99 5.50 4.17 3.74 3.42
187.03 25.01 3.65 1.40 1.05 1.00
0.5 1.0 1.5
2882.63 1457.2 520.43 184.34 75.77 37.45
2028.66 413.91 109.64 54.33 42.04 39.14
2518.30 826.32 218.72 76.34 37.96 25.63
1618.88 246.32 98.52 80.69 76.33 72.00
2132.95 438.49 111.07 54.07 42.12 40.17
2.0 2.5 3.0
+ ANSPSo: in-control ANSPS ttANOSo: in control ANOS
Table 3. ANSPS and ANOS values of the FSSS and VSSS GCCs. ( M = 10, r = 3, ANSPS 0 =350, ANOS 0 =3500, ( n ( 2 ) , M ( 2 ) ) = (2,1)) VSSS
s
(n(1),M(1))
FSSS (5,4)
1.0 1.5
213.24 67.46 19.56
90.86 9.21 2.31
135.19 19.55 4.17
2.0
6.70
1.34
2.92 1.66
1.17 1.13
0.5 ANSPS
(10, 2)
2.5 3.0
(5, 8)
(10, 4)
1.68
51.09 3.20 1.38 1.22
80.97 5.05 1.36 1.02
1.12 1.01
1.18 1.16
1.00 1.00
Group Control Charts with Variable Stream and Sample Sizes
209
Table 3. (Cor,dinued). VSSS
s 0.5 ANOS
(n(1),M(1))
FSSS
2131.41
(5,4)
(10, 2)
(5, 8)
(10, 4)
1095.30
1609.32
766.82 94.46 47.93
1183.09 165.79 54.19
1.0 1.5 2.0 2.5
674.57 195.58
151.00 42.84
326.61 81.74
67.02 29.22
25.54 22.28
33.57 22.39
43.69 43.21
40.72 40.01
3.0
16.56
21.71
20.22
42.83
40.00
Table 4. ANSPS and ANOS values of the FSSS and VSSS GCCs. (M = 10, r = 1, ANSPS 0 =350, ANOS 0 =3500, ( n ( 2 ) , M ( 2 ) ) = (4,1)) VSSS 5
FSSS
("(i) ,M(1))
(5,4)
ANSPS
ANOS
(10, 2)
(5,8)
(10, 4)
0.5 1.0 1.5 2.0 2.5 3.0
288.26 145.73 52.04 18.44 7.58 3.75
214.77 41.94 9.35 4.13 2.97 2.64
249.20 75.28 16.05 4.53 2.01 1.30
181.35 21.64 6.16 4.19 3.51 3.04
218.77 35.59 4.66 1.50 1.07 1.01
0.5 1.0 1.5 2.0 2.5 3.0
2882.63 1457.25 520.43 184.34 75.77 37.45
2254.68 499.06 124.87 58.24 43.07 38.69
2610.31 907.47 237.70 80.30 38.99 25.95
1989.87 312.03 109.23 83.43 75.97 69.24
2387.11 542.44 124.84 56.38 42.54 40.21
1. The proposed charts are dramatically quicker in detecting process mean shifts than the FSSS GCCs. For the case where r = 1 and 5 = 1.5, for instance, the out-of-control ANSPS value of the former with {(M( 2 ),n( 2 )), (M( 1 ),n( 1 ))} = {(2,1), (10, 4)} is only 3.65 whereas that of the latter is 52.04. 2. Out-of-control ANOS values of the VSSS GCCs are smaller than those of the FSSS GCCs when S is not too large. The use of the proposed VSSS GCCs requires the user to predetermine the values of M(i), M( 2 ), n(!j and n( 2 ). A general guidance from our numerical study is to use small n(2) • M(2) and large n^ • M^y For the case where there is an upper
210
K. T. Lee, D. S. Bai and S. H. Hong Table 5. ANSPS and ANOS values of the FSSS and VSSS GCCs. ( M = 10, r = 3, ANSPS 0 =350, ANOS 0 =3500, ( n ( 2 ) , M ( 2 ) ) = (4,1)) VSSS 6
FSSS
(«(i)>-M(i))
(5,4)
ANSPS
ANOS
(10, 2)
0.5 1.0 1.5 2.0 2.5 3.0
213.24 67.46 19.56 6.70 2.92 1.66
112.99 11.40 2.53 1.38 1.17 1.13
150.74 60.97 9.32 1.73 1.13 1.01
0.5 1.0 1.5 2.0 2.5 3.0
2131.41 674.57 195.58 67.02 29.22 16.56
1301.85 177.20 46.19 26.18 22.37 21.62
1722.83 809.68 166.38 34.56 22.61 20.25
(5,8)
(10, 4)
77.98 4.10 1.43 1.21 1.17 1.14
110.48 6.93 1.45 1.02 1.00 1.00
1032.20 110.09 49.40 43.83 43.09 42.52
1432.40 195.18 57.33 40.99 40.01 40.00
limit on n^ • M^, it is preferable to use large M^ and small n^) when it is more important to find large mean shifts, and vice versa. 5. Concluding Remarks We have proposed VSSS GCCs which select a large portion of streams and take samples of size n > 1 from the selected streams if there is some indication of a process change, and select a small portion of streams and take samples of n = 1 if there is no such indication. The expressions of ANSPS and ANOS are derived with a Markov chain approach. Numerical comparisons show that VSSS GCCs are substantially quicker than FSSS GCCs in detecting shifts in the process. The proposed GCCs, however, do not pinpoint the stream(s) responsible for an out-of-control signal. Therefore, a possible area of further investigation would be to devise an effective method of finding out the responsible stream(s) when the charts signal. A . Appendix A.l.
Notations
Subscript i denotes sampling point. M ixj Xij S
total number of streams in a process process mean of stream j , j = 1 , . . . , M sample mean of stream j,j — 1 , . . . , M magnitude of mean shift, expressed in units of a
Group Control Charts with Variable Stream and Sample Sizes
211
Mi rti ipi Zi(Mi,rii)
stream size sample size set of streams selected for sampling the smallest of Mi sample means expressed in standardized unit; min j6 ^, i {- v /ni(Xij — jj,0)/a} Zi(Mi,ni) the largest of Mi sample means expressed in standardized unit; maxjg1/,i{v/ni'(Xj:;' — fi0)/a} M( 2 ), M(i) small and large stream sizes, 2 < M(2) < M(i) < M n (2)) n (i) small and large sample sizes, n^2) < n(i) qg(Mi, rii) probability of a signal when the number of streams shifted in mean is r q(VJMi,rii) probability that the next stream and sample sizes are M^Q and TL^Q, respectively, when the number of streams shifted in mean is r, £ =• 1, 2 Ri state variable of Markov chain Nm number of sampling points from the start of the process to a signal when the initial state is m Om number of observations from the start of production to a signal when the initial state is m wm number of observations corresponding to state m 4>{-) standard normal pdf $(•) standard normal cdf A.2. Derivation
of Equation
(2)
Let L, L — 0 , . . . , min(Mj, r), be the number of streams with shifted mean among the selected Mi streams, and let f(z\,Z2\L) be the conditional distribution of Zi(Mi,rii) and Z2(Mi,ni) given that the number of streams with shifted mean is L. Then min(Mi,r)
gr(zi,Z2\Mi,ni)=
]T
f{zuz2\l)-Pi(L
= l)
(8)
1=0
Random variable L follows the hypergeometric distribution with parameters M, Mi and r, and thus Eq. (8) reduces to
min(Mi,r)
9r{zi,z2;Mi,ni)
=
V
1 * 1=0
f(z1,z2\l)
•
,—v /
—
(9)
!\/f \
Let 9 , be the set of streams with shifted mean among all the streams. Also, let £o(-) and Ho(-) be, respectively, the pdf and cdf of Zij = ^Jnl{Xij — Ho)/cr for j £ &i, and ^(-) and Hi(-) be the corresponding pdf and cdf for j 6 Oj.
212
K. T. Lee, D. S. Bai and S. H. Hong
Then, from t h e result of Vaughan a n d Venables (1997), f(z\, by
f(zi,z2\l)
z2\l) can be represented
Mi-I
I
A0---AQ
A f - A i
Bo • • • Bo
B\ • • • B\
+ \
(Mi - 2) (Mi - 2)! Bo • • • BQ
BI
CQ • • • CQ
Ci • • • C\
(10)
• • • B\
where A0 = £o(zi), Ax = £i(z{), B0 = =^o(z2) - E0(zi), Bx = =.i(z2) - 'E.i(zi), Co = £i(-zi), C\ = ^1(22), a n d + | A | + is t h e permanent of a square matrix A defined like t h e determinent, except t h a t all signs are positive. Note t h a t Z\(Mi,rii) and Z2(Mi, rii) are, respectively, t h e smallest a n d largest order statistics for independent non-identically distributed variates. See Vaughan a n d Venables (1972) for a further discussion on this distribution. We can obtain f(zi,z2\l) in Eq. (2) from Eq. (10) with some manipulation. Finally, since Zij follows t h e s t a n d a r d normal distribution when j ^ 0 , and t h e distribution of Z^ is normal with mean S^/ni a n d variance one when j G 0 , , AQ = <j>(z!), Aj. = <j>(zi - Sy/rn), B0 = $ ( z 2 ) - ®(zi), Bx = $ ( z 2 - Sy/rn) - $ ( z j - Sy/rn), Co = (z2), Cx = cf>(z2 - S^M) D References Amin, R. W. and Letsinger, W. II. (1991), "Improved Switching Rules in Control Procedures Using Variable Sampling Intervals", Communications in Statistics — Simulation and Computation 20, 205-230. Amin, R. W. and Miller, R. W. (1993), "A Robustness Study of X Charts with Variable Sampling Intervals", Journal of Quality Technology 25, 36-44. Baxley, R. W., Jr. (1995), "An Application of Variable Sampling Interval Control Charts", Journal of Quality Technology 27, 275-282. Boyd, D. F. (1950), "Applying the Group Chart for X and R", Industrial Quality Control 7, 22-25. Costa, A. F. B. (1994), UX Charts with Variable Sample Size", Journal of Quality Technology 26, 155-163. Costa, A. F. B. (1997), "X Chart with Variable Sample Size and Sampling Intervals", Journal of Quality Technology 29, 197-204. Lucas, J. M. and Crosier, R. B. (1982), "Fast Initial Response for CUSUM Quality Control Schemes: Give Your CUSUM a Head Start", Technometrics 24, 199-205. Mortell, R. R. and Runger, G. C. (1995), "Statistical Process Control of Multiple Stream Processes", Journal of Quality Technology 27, 1-12. Nelson, L. S. (1986), "Control Chart for Multiple Stream Processes", Journal of Quality Technology 18, 255-256.
Group Control Charts with Variable Stream and Sample Sizes
213
Nelson, P. R. and Stephenson, P. L. (1996), "Runs Tests for Group Control Charts", Communications in Statistics-Theory and Methods 25, 2739-2765. Prabhu, S. S., Montgomery, D. C. and Runger, G. C. (1994), "A Combined Adaptive Sample Size and Sampling Interval X Control Scheme", Journal of Quality Technology 26, 164-176. Prabhu, S. S., Runger, G. C. and Keats, J. B. (1993), "An Adaptive Sample Size X Chart", International Journal of Production Research 3 1 , 2895-2909. Prybutok, V. R., Clayton, H. R. and Harvey, M. M. (1997), "Comparison of Fixed versus Variable Sampling Interval Shewhart X Control Charts in the Presence of Positively Autocorrelated Data", Communications in Statistics — Simulation and Computation 26, 83-106. Reynolds, M. R., Jr., Amin, R. W., Arnold, J. C. and Nachlas, J. A. (1988), " X Charts with Variable Sampling Intervals", Technometrics 30, 181-192. Runger, G. C , Alt, F. B. and Montgomery, D. C. (1996), "Controlling Multiple Stream Processes with Principal Components", International Journal of Production Research 34, 2991-2999. Runger, G. C. and Pignatiello, J. J., Jr. (1991), "Adaptive Sampling for Process Control", Journal of Quality Technology 23, 135-155. Vaughan, R. J. and Venables, W. N. (1972), "Permanent Expressions for Order Statistics Densities", Journal of Royal Statistical Society-Ser. B 34, 308-310.
Chapter 14 A M E T H O D O L O G Y FOR T H E M E A S U R E M E N T OF T E S T E F F E C T I V E N E S S J O H N C. M U N S O N Computer Science Department University of Idaho, Moscow, ID 83844-1010 E-mail: [email protected] A L L E N P. N I K O R A Jet Propulsion Laboratory California Institute of Technology, Pasadena, CA E-mail: [email protected]
91109-8099
1. Introduction Over a number of years of study, we can now establish a distinct relationship between software faults and certain aspects of software complexity. When a software system consisting of many distinct software modules is built for the first time, we have little or no direct information as to the location of faults in the code. Some of the modules will have far more faults in them than others. We now, however, know that the number of faults in a module is highly correlated with certain software attributes that can be measured. This means that we can measure the software on these specific attributes and have some reasonable notion as to the degree to which the modules are fault prone [Munson and Khoshgoftaar (1990a), Munson (1996)]. In the absence of information as to the specific location of software faults, we have successfully used a derived metric, the relative complexity measure, as a fault surrogate. That is, if the relative complexity value of a module is large, then it will likely have a large number of latent faults. If, on the other hand, the relative complexity of a module is small, then it will tend to have fewer faults. As the software system evolves through a number of sequential builds, faults will be identified and the code will be changed in an attempt to eliminate the identified faults. The introduction of new code however, is a fault prone process just as the initial code generation was. Faults may well be injected during this evolutionary process.
215
216
J. C. Munson and A. P. Nikora
Code does not always change just to fix faults that have been isolated in it. Some changes to code during its evolution represent enhancements, design modifications, or changes in the code in response to continually evolving requirements. These incremental code enhancements may also result in the insertion of still more faults. Thus, as a system progresses through a series of builds, the relative complexity fault surrogate of each program module that has been altered must also change. We will see that the rate of change in relative complexity will serve as a good index of the rate of fault insertion. Once the rate of fault insertion has been established, it becomes possible to estimate the number of faults remaining in the system at any point during the development. Since we use changes in relative complexity as an index of the fault insertion rate, it becomes possible to estimate the number of residual faults at the module level, in which a module is taken to be a procedure, function, or method. This information is useful to software development managers wishing to estimate the resources required to remove the remaining faults — not only can the number of remaining faults be estimated, but it is possible to direct fault detection and removal resources at those portions of the software estimated to have the highest concentrations of residual faults. However, this is only half of the picture. Once the software is operating in the field, we wish to estimate its reliability. The estimated number of residual faults, a static measure, must be transformed into an estimate of the system's dynamic behavior. The general notion of software test is that rate of fault removal will generally exceed the rate of fault insertion. In most cases, this is probably true [Munson and Hall (1997)]. Some changes are rather more heroic than others. During these more substantive change cycles, it is quite possible that the actual number of faults in the system will rise. We would be very mistaken, then, to assume that software test will monotonically reduce the number of faults in a system. This will only be the case when the rate of fault removal exceeds the rate of fault insertion. The rate of fault removal is relatively easy to measure. The rate of fault insertion is much more tenuous. This fault insertion process is directly related to two measures that we can take on code as it evolves, code change and code churn. In this investigation we will establish a methodology whereby code can be measured from one build to the next, a measurement baseline. We will use this measurement baseline to develop an assessment of the rate of change to a system as measured by our relative complexity fault surrogate. From this change process we will then be able to derive a direct measure of the rate of fault insertion based on changes in the software from one build to the next. We examine data from an actual system on which faults may be traced to specific build increments to assess the predicted rate of fault insertion with the actual. Finally, we will develop a method of measuring the efficiency of a test activity. To estimate rates of fault insertion, it is necessary to identify a complete software system on which every version of every module has been archived together with the faults that have been recorded against the system as it evolved. Of the two systems
A Methodology for the Measurement
of Test Effectiveness
217
we analyzed for this study, the Cassini Orbiter Command and Data Subsystem at JPL met all of our objectives. On the first build of this system there were approximately 96K source lines of code in approximately 750 program modules. On the last build there were approximately 110K lines of source code in approximately 800 program modules. As the system progressed from the first to the last build there were a total of 45 200 different versions of these modules. On the average, then, each module progressed through an average of 56 evolutionary steps or versions. For the purposes of this study, the Ada program module is a procedure or function. It is the smallest unit of the Ada language structure that may be measured. A number of modules present in the first build of the system were removed on subsequent builds. Similarly, a number of modules were added. The Cassini CDS does not represent an extraordinary software system. It is quite typical of the amount of change activity that will occur in the development of a system on the order of 100 KLOC. It is a non-trivial measurement problem to track the system as it evolves. Again, there are two different sets of measurement activities that must occur at once. We are interested in the changes in the source code and we are interested in the fault reports that are being filed against each module. To determine the efficiency of a test activity, it is necessary to have a system in which structural changes between one increment and its predecessor can be measured together with the execution profile observed during test. Since we were unable to accomplish this for the CASSINI CDS flight software, we studied the real-time software for a commercial embedded system. 2. A Measurement Baseline The measurement of an evolving software system through the shifting sands of time is not an easy task. Perhaps one of the most difficult issues relates to the establishment of a baseline against which the evolving systems may be compared. This problem is very similar to that encountered by the surveying profession. If we were to buy a piece of property, there are certain physical attributes that we would like to know about. Among these properties is the topology of the site. To establish the topological characteristics of the land, we will have to seek out a benchmark. This benchmark represents an arbitrary point somewhere on the subject property. The distance and the elevation of every other point on the property may then be established in relation to the measurement baseline. Interestingly enough, we can pick any point on the property, establish a new baseline, and get exactly the same topology for the property. The property does not change. Only our perspective changes. When measuring software evolution, we need to establish a measurement baseline for the same purpose described above [Munson and Werries (1996), Nikora et al. (1997)]. We need a fixed point against which all others can be compared. Our measurement baseline also needs to maintain the property so that, when another point is chosen, the exact same picture of software evolution emerges; only the perspective changes. The individual points involved in measuring software evolution are individual builds of the system.
218
J. C. Munson and A. P. Nikora
Standardizing metrics for one particular build is simple. For each metric obtained for each module, subtract from that metric its mean and divide by its standard deviation. This puts all of the metrics on the same relative scale, with a mean of zero and a standard deviation of one. This works fine for comparing modules within one particular build. But when we standardize subsequent builds using the means and standard deviations for those builds a problem arises. The standardization masks the change that has occurred between builds. In order to place all the metrics on the same relative scale and to keep from losing the effect of changes between builds, all build data is standardized using the means and standard deviations for the metrics obtained from the baseline system. This preserves trends in the data and lets measurements from different builds be compared. For each raw metric in the baseline build, we may compute a mean and a standard deviation. Let us denote the vector of mean values for the baseline build as 5tB and the vector of standard deviations as s s . The standardized baseline metric values for any module j in an arbitrary build i, then, may be derived from raw metric values as Z
3
~
qB
The process of standardizing the raw metrics certainly makes them more tractable. Among other things, it now permits the comparison of metric values from one build to the next. This standardization does not solve the main problem. There are too many metrics collected on each module over many builds. We need to reduce the dimensionality of the problem. We have successfully used principal components analysis for reducing the dimensionality of the problem [Khoshgoftaar and Munson (1992), Munson and Khoshgoftaar (1990b)]. The principal components technique will reduce a set of highly correlated metrics to a much smaller set of uncorrelated or orthogonal measures. One of the products of the principal components technique is an orthogonal transformation matrix T that will send the standardized scores (the matrix z) onto a reduced set of domain scores thusly, d = zT. In the same manner as the baseline means and standard deviations were used to transform the raw metric of any build relative to a baseline build, the transformation matrix TB derived from the baseline build will be used in subsequent builds to transform standardized metric values obtained from that build to the reduced set of domain metrics as follows: dB'1 = z B , l T B , where z S '* are the standardized metric values from build i baselined on build B. Another artifact of the principal components analysis is the set of eigenvalues that are generated for each of the new principal components. Associated with each of the new measurement domains is an eigenvalue, A. These eigenvalues are large or small varying directly with the proportion of variance explained by each principal component. We have successfully exploited these eigenvalues to create a new metric called relative complexity, p, that is the weighted sum of the domain metrics to wit:
A Methodology for the Measurement
of Test Effectiveness
219
m
Pi = 50 + 10^2 Xjdj where m is the dimensionality of the reduced metric set [Munson and Khoshgoftaar (1990b)]. As was the case for the standardized metrics and the domain metrics, relative complexity may be baselined as well using the eigenvalues and the baselined domain values as follows: m
If the raw metrics that are used to construct the relative complexity metric are carefully chosen for their relationship to software faults, then the relative complexity metric will vary in exactly the same manner as the faults [Munson (1992)]. The relative complexity metric in this context is a fault surrogate. Whereas we cannot measure the faults in a program directly we can measure the relative complexity of the program modules that contain the faults. Those modules having a large relative complexity value will ultimately be found to be those with the largest number of faults [Munson and Khoshgoftaar (1992)]. 3. Software Evolution As a software system grows and modifications are made, the modules that comprise the system are recompiled and a new version, or build, is created. Each build is constructed from a distinct set of these software modules, though not always exactly the same ones. The new version may contain some of the same modules as the previous version. Some entirely new modules may even omit some modules that were present in an earlier version. Of the modules that are common to both the old and new version, some may have undergone modification since the last build. The set of modules that constitute the system on any one build is subject to material change over the life of the system. 3.1. Module Sets and
Versions
When evaluating the change that occurs to the system between any two builds i, and j , we are interested in three sets of modules. The first set, Mlc'3, is the set of modules present in both builds of the system. These modules may have changed since the earlier version but were not removed. The second set, M^\ is the set of modules that were in the early build and were removed prior to the later build. The final set, M^'3, is the set of modules that have been added to the system since the earlier build. As an example, let build i consist of the following set of modules. Ml
=
{7711,7712,7713,7714,7715}
220
J. C. Munson and A. P. Nikora
Between build i and j module 7713 is removed giving Mj = Mi U MlbJ - Mla'j = {m1,m2,m3,m4,m,5}U{
} - {m 3 }
= {m 1 ,m2,m 4 ,m 5 } Then between builds j and k two new modules, 017 and mg are added and module 7712 is deleted giving Mk = Mj U MJb'k - M£k = {mi,ra 2 ,ra 4 ,ra5} U {m7,mg}
- {m 2 }
= {mi,m 4 ,m 5 ,m7,mg} With a suitable baseline in place, and the module sets denned above, it is now possible to measure software evolution across a full spectrum of software metrics. We can do this first by comparing average metric values for the different builds. Secondly, we can measure the increase or decrease in system complexity as measured by the code delta, or we can measure the total amount of change the system has undergone between builds, code churn. We can now see that establishing the complexity of a system across builds in the face of changing modules and changing sets of modules is in itself a very complex problem. In terms of the example above, the relative complexity of the system RB'1 at build i, the early build, is given by
where pfA is the relative complexity of module mc on this build baselined by build B. Similarly, the relative complexity of the system R3^ at build j , the latter build is given by
mc6MJ
The later system build is said to be more complex if RB'1 > RB'3. Regardless of which metric is chosen, the goal is the same. We wish to assess how the system has changed over time with respect to that particular measurement. The concept of a code delta provides this information. A code delta is, as the name implies, the difference between two builds as to the relative complexity metric.
A Methodology for the Measurement
3.2. Code Churn and Code
of Test Effectiveness
221
Deltas
The change in the relative complexity in a single module between two builds may be measured in one of two distinct ways. First, we may simply compute the difference in the module relative complexity between build i and build j . We will call this value the code delta for the module ma, or S^j = pf>j - pf'\ The absolute value of the code delta is a measure of code churn. In the case of code churn, what is important is the absolute measure of the nature that code has been modified. From the standpoint of fault insertion, removing a lot of code is probably as catastrophic as adding a bunch. The new measure of code churn, x, f° r module ma is simply
xiJ' = l<5i-J'l = lpfJ'-pf-*lIt is now possible to compute the total change activity for the aggregate system across all of the program modules. The total net change of the system is the sum of the code delta's for a system between two builds i and j is given by
±iJ= E # mc£Mi
<£•*+ E pfJ
E ma€Mi,j
mbeMl'j
With a suitable baseline in place, and the module sets defined above, it is now possible to measure software evolution across a full spectrum of software metrics. We can do this first by comparing average metric values for the different builds. Secondly, we can measure the increase or decrease in system complexity as measured by a selected metric, code delta, or we can measure the total amount of change the system has undergone between builds, code churn. A limitation of measuring code deltas is that it doesn't give an indicator as to how much change the system has undergone. If, between builds, several software modules are removed and are replaced by modules of roughly equivalent complexity, the code delta for the system will be close to zero. The overall complexity of the system, based on the metric used to compute deltas will not have changed much. However, the reliability of the system could have been severely affected by the process of replacing old modules with new ones. What we need is a measure to accompany code delta that indicates how much change has occurred. Code churn is a measurement, calculated in a similar manner to code delta, that provides this information. The net code churn of the same system over the same builds is
v - = E **J+ E mc£Mc
<£•* + E rf'J
ma£M'a'
j
TrueM*'-*
When several modules are replaced between builds by modules of roughly the same complexity, code delta will be approximately zero but code churn will be equal to the sum of the value of p for all of the modules, both inserted and deleted. Both the code delta and code churn for a particular metric are needed to assess the evolution of a system.
222
J. C. Munson and A. P. Nikora
4. Obtaining Average Build Values One synthetic software measure, relative complexity, has clearly been established as a successful surrogate measure of software faults [Munson and Khoshgoftaar (1990b)]. It seems only reasonable that we should use it as the measure against which we compare different builds. Since relative complexity is a composite measure based on the raw measurements, it incorporates the information represented by LOC, V(g), r/i, rj2, and all the other raw metrics of interest. Relative complexity is a single value that is representative of the complexity of the system which incorporates all of the complexity attributes we have measured (e.g., size, control flow, style, data structures, etc.). By definition, the average relative complexity, p, of the baseline system will be NB
where NB is the cardinality of the set of modules on build B, the baseline build. Relative complexity for the baseline build is calculated from standardized values using the mean and standard deviation from the baseline metrics. The relative complexities are then scaled to have a mean of 50 and a standard deviation of 10. For that reason, the average relative complexity for the baseline system will always be a fixed point. Subsequent builds are standardized using the means and standard deviations of the metrics gathered from the baseline system to allow comparisons. The average relative complexity for subsequent builds is given by Nk V^
-k
l
P
^-Mkl^Pi
B,k
where NK is the cardinality of the set of program modules in the fcth build and pi ' is the baselined relative complexity for the ith module of that set. The total relative complexity, R°, of a system on its initial build is simply the sum of all relative complexities of each module of the initial system,
i=\
The principle behind relative complexity is that it serves as a fault surrogate. That is, it will vary in precisely the same manner as do software faults. The fault potential r° of a particular module i is directly proportional to its value of the relative complexity fault surrogate. Thus,
r°=pVR° To derive a preliminary estimate for the actual number of faults per module we may make judicious use of historical data. From previous software development projects it is possible to develop a proportionality constant, say k, that will allow
A Methodology for the Measurement of Test Effectiveness
223
the total system relative complexity to map to a specific system fault count as follows: F ° = kR° or R° = k/F°. Substituting for R in the previous equation, we find that r° =
kp°/F°
Thus, our best estimate for the number of faults in module i in the initial configuration of the system is
0? = r?F° After an interval of testing a number of faults will be found and fixes made to the code to remedy the faults. Let F-? be the total number of faults found in the total system up to and including the j t h build of the software. In a particular module i there will be f} faults found in the first build that are attributable to this module. The estimated number of faults remaining in module i will then be
assuming that we have only fixed faults in the code and not added any new ones. Our ability to locate the remaining faults in a system will relate directly to our exposure to these faults. If, for example, at the j t h build of a system there are gj remaining faults in module i, we cannot expect to identify any of these faults unless some test activity is allocated to exercising module i. As the code is modified over time, faults will be found and fixed. However, new faults will be introduced into the code as a result of the change. In fact, this fault insertion process is directly proportional to change in the program modules from one version to the next. As a module is changed from one build to the next in response to evolving requirements changes and fault reports, its complexity will also change. Generally, the net effect of a change is that complexity will increase. Only rarely will its complexity decrease. It is now necessary to describe the measurement process for the rate of change in an evolving system. 5. Software Evolution and the Fault Insertion Process Initially, our best estimate for the number of faults in module i in the initial configuration of the system is
ff? = r°F° As the ith module was tested during the test activity of the first build, the number of faults found and fixed in this process was denoted by / / . However, in the process of fixing this fault, the source code will change. In all likelihood, so, too, will the relative complexity of this module. Over a sequence of builds, the complexity of this module may change substantially. Let,
fc=i
224
J. C. Munson and A. P. Nikora
represent the net change in relative complexity to the ith module over the first j builds. Then the cumulative churn in the total system over these j builds will be,
where Nj is the cardinality of the set of all modules that were in existence over these j builds. The complexity of the ith module will have changed over this sequence of builds. Its new value will be pi + At 'J. Some changes may increase the relative complexity of this module and others may decrease it. A much better (as will be demonstrated) measure of the cumulative change to the system will be pi + V i '3. The system complexity, R, will also have changed. Its new value will be R° + A0'-7. On the initial build of the system the initial burden of faults in a module was proportional to the relative complexity of the module. As the build cycle continues, the rate of fault insertion is most closely associated with the code churn. Thus, the proportion of faults in the ith module will have changed over the sequence of j builds, related to its initial relative complexity and its subsequent code churn. Its new value will be
We now observe that our estimate of the number of faults in the system has now changed. On the j t h build there will no longer be F° faults in the system. New faults will have been introduced as the code has evolved. In all likelihood, the initial software development process and subsequent evolution processes will be materially different. This means that there will be a different proportionality constant, say k', representing the rate of fault insertion for the evolving system. For the total system, then, there will have been F-7 = kR° + fc'A0'-7 faults introduced into the system from the initial build through the j t h build. Each module will have had h\ — r\Fi faults introduced in it either from the initial build or on subsequent builds. Thus, our revised estimate of the number of faults remaining in module i on build j will be
9l = hi-fi The rate of fault insertion is directly related to the change activity that a module will receive from one build to the next. At the system level, we can see that the expected number of injected faults from build j to build j + 1 will be F3
+ l _ pi
=
kR0
+ k'^0,j+l
_
= jfc'(V°-''+1 - V 0 J ') = Jfc'V w+1
kR0
+
fc/v0,j
A Methodology for the Measurement of Test Effectiveness
225
At the module level, the rate of fault insertion will again be proportional to the level of change activity. Hence, the expected number of injected faults between build j to build j' + 1 on module i will be simply h\ — h\. The two proportionality constants k and k' are the ultimate criterion measures of the software development process and software maintenance processes. Each process has an associated fault insertion proportionality constant. If we institute a new software development process and observe a significant change downward in the constant k, then the change would have been a good one. Very frequently, however, software processes are changed because development fads change and not because a criterion measure has indicated that a new process is superior to a previous one. We will consider that an advance in software development process has occurred if either k or k' has diminished for that new process. 6. Definition of a Fault Unfortunately there is no particular definition of just precisely what a software fault is. In the face of this difficulty it is rather hard to develop meaningful associative models between faults and metrics. In calibrating our model, we would like to know how to count faults in an accurate and repeatable manner. In measuring the evolution of the system to talk about rates of fault introduction and removal, we measure in units to the way that the system changes over time. Changes to the system are visible at the module level, and we attempt to measure at that level of granularity. Since the measurements of system structure are collected at the module level (by module we mean procedures and functions), we would like information about faults at the same granularity. We would also like to know if there are quantities that are related to fault counts that can be used to make our calibration task easier. Following the second definition of fault in IEEE (1983, 1989), we consider a fault to be a structural imperfection in a software system that may lead to the system's eventually failing. In other words, it is a physical characteristic of the system of which the type and extent may be measured using the same ideas used to measure the properties of more traditional physical systems. Faults are introduced into a system by people making errors in their tasks — these errors may be errors of commission or errors of omission. In order to count faults, we needed to develop a method of identification that is repeatable, consistent, and identifies faults at the same level of granularity as our structural measurements. In analyzing the flight software for the CASSINI project the fault data and the source code change data were available from two different systems. The problem reporting information was obtained from the JPL institutional problem reporting system. For the software used in this study, failures were recorded in this system starting at subsystem-level integration, and continuing through spacecraft integration and test. Failure reports typically contain descriptions of the failure at varying levels of detail, as well as descriptions of what was
226
J. C. Munson and A. P. Nikora
done to correct the fault (s) that caused the failure. Detailed information regarding the underlying faults (e.g., where were the code changes made in each affected module) is generally unavailable from the problem reporting system. The entire source code evolution could be obtained directly from the Software Configuration Control System (SCCS) files for all versions of the flight software. The way in which SCCS was used in this development effort makes it possible to track changes to the system at a module level in that each SCCS file stores the baseline version of that file (which may contain one or more modules) as well as the changes required to produce each subsequent increment (SCCS delta) of that file. When a module was created, or changed in response to a failure report or engineering change request, the file in which the module is contained was checked into SCCS as a new delta. This allowed us to track changes to the system at the module level as it evolved over time. For approximately 10% of the failure reports, we were able to identify the source file increment in which the fault (s) associated with a particular failure report were repaired. This information was available either in the comments inserted by the developer into the SCCS file as part of the check-in process, or as part of the set of comments at the beginning of a module that track its development history. Using the information described above, we performed the following steps to identify faults: • For each problem report, search all of the SCCS files to identify all modules and the increment (s) of each module for which the software was changed in response to the problem report. • For each increment of each module identified in Step 1, start with the assumption that all differences between the increment in which repairs are implemented and the previous increment are due solely to fault repair. Note that this is not necessarily a valid assumption — developers may be making functional enhancements to the system in the same increment that fault repairs are being made. Careful analysis of failure reports for which there was sufficiently detailed descriptive information, served to separate areas of fault repair from other changes. However, the level of detail required to perform this analysis was not consistently available. • Use a differential comparator (e.g.: Unix diff) to obtain the differences between the increment(s) in which the fault(s) were repaired, and the immediately preceding increment(s). The results indicated the areas to be searched for faults. After completing the last step, we still had to identify and count the faults — the results of the differential comparison cannot simply be counted up to give a total number of faults. In order to do this, we developed a taxonomy for identifying and counting faults [Nikora (1998)]. This taxonomy differs from others in that it does not seek to identify the root cause of the fault. Rather, it is based on the types of changes made to the software to repair the faults associated with failure reports — in other words, it constitutes an operational definition of a fault. Although identifying the root causes of faults is important in improving the development
A Methodology for the Measurement
of Test Effectiveness
227
process [Chillarege et al. (1992), IEEE (1994)], it is first necessary to identify the faults. We do not claim that this is the only way to identify and count faults, nor do we claim that this taxonomy is complete. However, we found that this taxonomy allowed us to successfully identify faults in the software used in the study in a consistent manner at the appropriate level of granularity. 7. The Relationship between Faults and Code Changes Having established a theoretical relationship between software faults and code changes, it is now of interest to validate this model empirically. This measurement occurred on two simultaneous fronts. First, all of the versions of all of the source code modules were measured. From these measurements, code churn and code deltas were obtained for every version of every module. The failure reports were sampled to lead to specific faults in the code. These faults were classified according to the above taxonomy manually on a case by case basis. Then we were able to build a regression model relating the code measures to the code faults. The Ada source code modules for all versions of each of these modules were systematically reconstructed from the SCCS code deltas. Each of these module versions was then measured by the UX-Metric analysis tool for Ada [SET (1993)]. Not all metrics provided by this tool were used in this study. Only a subset of these actually provide distinct sources of variation [Khoshgoftaar and Munson (1990)]. The specific metrics used in this study are shown in Table 1. Table 1. Software metric definitions. Metrics
Definition
m
Count of unique operators [Halstead (1977)] Count of unique operands Count of total operators Count of total operands Purity ratio: ratio of Halstead's N to total program vocabulary McCabe's cyclomatic complexity Maximum nesting level of program blocks Average nesting level of program blocks Number of lines of code Number of blank lines Count of comments Total words used in all comments Count of executable statements Number of logical source statements Number of physical source statements Number of non-executable statements Average number of lines of code between references to each variable Average variable name length
V2
JVi N2 P/R V(g) Depth AveDepth LOC Blk Cmt CmtWds Stmts LSS PSS NonEx AveSpan VI
228
J. C. Munson and A. P. Nikora
To establish a baseline system, all of the metric data for the module versions that were members of the first build of CDS were then analyzed by our PCA-RCM tool. This tool is designed to compute relative complexity values either from a baseline system or from a system being compared to the baseline system. In that the first build of the Cassini CDS system was selected to be the baseline system, the PCA-RCM tool performed a principal components analysis on these data with an orthogonal varimax rotation. The objective of this phase of the analysis is to use the principal components technique to reduce the dimensionality of the metric set. Table 2. Principal components of software metrics. Metric
Size
Structure
Style
Nesting
Stmts LSS
0.968 0.961 0.926 0.934 0.884 0.852 0.843 0.635 0.617 -0.027 -0.046 -0.043 0.033 -0.053 0.263 -0.148 0.372 -0.000 37.956
0.022 0.025 0.016 0.016 0.012 0.032 0.032 -0.055 -0.022 0.979 0.970 0.961 0.931 0.928 0.898 -0.198 -0.232 -0.009 30.315
-0.079 -0.080 0.086 0.074 -0.244 0.031 -0.094 -0.522 -0.337 0.136 0.108 0.149 0.058 0.076 0.048
0.021 0.004 0.086 0.077 0.043 -0.082 -0.114 -0.136 -0.379 0.015 0.004 0.019 -0.010 -0.009 0.005 0.052 0.010 -0.938 6.009
N2 iVi
m AveSpan V(g)
m Depth LOC Cmt PSS CmtWds NonEx Blk P/R VI AveDepth % Variance
-0.878 -0.752 0.041 10.454
As may been seen in Table 2, there are four principal components for the 18 metrics shown in Table 1. For convenience, we have chosen to name these principal components as Size, Structure, Style and Nesting. From the last row in Table 2 we can see that the new reduced set of orthogonal components of the original 18 metrics account for approximately 85% of the variation in the original metric set. As is typical in the principal components analysis of metric data, the Size domain dominates the analysis. It alone accounts for approximately 38% of the total variation in the original metric set. Not surprisingly, this domain contains the metrics of total statement count (Stmts), logical source statements (LSS), the Halstead lexical metric primitives of operator and operand count, but it also contains cyclomatic complexity (V(g)). In that we regularly find cyclomatic complexity in this domain we are forced to conclude that it is only a simple measure of size in the
A Methodology for the Measurement
of Test Effectiveness
229
same manner as statement count. The Structure domain contain those metrics relating to the physical structure of the program such as nonexecutable statements (NonEx) and the program block count (Blk). The Style domain contains measures of attribute that are directly under a programmer's control such as variable length (VI) and purity ratio (P/R). The Nesting domain consist of the single metric that is a measure of the average depth of nesting of program modules (AveDepth). In order to transform the raw metrics for each module version into their corresponding relative complexity values, the means and the standard deviations must be computed. These are shown in Table 3. These values will be used to transform all raw metric values for all versions of all modules to their baselined z score values. The last four columns of Table 3 contain the actual transformation matrix that will map the metric z score values onto their orthogonal equivalents to obtain the orthogonal domain metric values used in the computation of relative complexity. Finally, the eigenvalues for the four domains are presented in the last row of this table. Table 3. Baseline transformation data. Metric
Stmts LSS N2 Ni
m AveSpan V{g) VI
Depth LOC Cmt PSS CmtWds NonEx Blk P/R VI AveDept h Eigenvalues
x
s
11.37 25.18 79.59 68.24 1.32 4.77 1.48 0.00 162.05 19.05 34.19 139.27 16.61 17.52 108.80 7.36 5.75 9.00
5s
Domain 1
7.79
0.10
27.08 129.08 115.72 0.54
0.13 0.13 0.13 0.00 0.12 0.10 0.01 -0.01 0.03 -0.01 0.00 0.14 0.14 -0.01 -0.01 0.12
6.19 1.58 0.05 515.83 30.14 124.24 452.48 20.44 23.50 372.11 22.84 8.26 4.40
0.07 6.832
Domain
Domain 3
-0.02
0.26 0.04 -0.17 -0.17 0.54 -0.03 0.17 0.06 0.07 0.07 0.09 0.10 -0.07 -0.07 0.06 0.10 -0.11 0.40 1.882
2
0.00 0.02 0.02 -0.07 0.01 -0.01 0.00 0.17 0.16 0.17 0.16 0.01 0.01 0.17 0.16 0.02 -0.06 5.457
Domain 4 0.05 -0.09 -0.08 -0.09 -0.16 0.07 0.30 0.88 -0.02 -0.02 -0.01 0.00 -0.05 -0.04 -0.02 0.00 0.06 -0.11 1.082
Table 3, then contains all of the essential information needed to obtain baselined relative complexity values for any version of any module relative to the baseline build. As an aside, it is not necessary that the baseline build be the initial build. As a typical system progresses through hundreds of builds in the course of its life,
230
J. C. Munson and A. P. Nikora
it is well worth reestablishing a baseline closer to the current system. In any event, these baseline data are saved by the PCA-RCM tool for use in later computation of metric values. Whenever the tool is invoked referencing the baseline data it will automatically use these data to transform the raw metric values given to it. In relating the number of faults inserted in an increment to measures of a module's structural change, we had only a small number of observations with which to work. Problem reports could not be consistently traced back to source code, and there were numerous modules for which UX-Metric did not report measurements. The net result was that of the over 100 faults that were initially identified, there were only 35 observations in which a fault could be associated with a particular increment of a module, and with that increment's measures of code delta and code churn. For each of the 35 modules for which there was viable fault data, there were three data points. First, we had the number of injected faults for that module that were the direct result of changes that had occurred on that module between the current version that contained the faults and the previous version that did not. Second, we had code delta values for each of these modules from the current to the previous version. Finally, we had code churn values derived from the code deltas. Linear regression models were computed for code churn and code deltas with code faults as the dependent variable in both cases. Both models were built without constant terms in that we surmise that if no changes were made to a module, then no new faults could be introduced. The results of the regression between faults and code deltas were not at all surprising. The squared multiple R for this model was 0.001, about as close to zero as you can get. This result is directly attributable to the non-linearity of the data. Change comes in two flavors. Change may increase the complexity of a module. Change may decrease the complexity of a model. Faults, on the other hand, are not related to the direction of the change but to its intensity. Removing masses of code from a module is just as likely to introduce faults and adding code to it. The regression model between code churn and faults is dramatically different. The regression ANOVA for this model are shown in Table 4. Whereas code deltas do not show a linear relationship with faults, code churn certainly does. The actual regression model is given in Table 5. In Table 6 the regressions statistics have been reported. Of particular interest is the Squared Multiple R term. This has a value of 0.649. This means, roughly, that the regression model will account for more that 65% of the variation in the faults of the observed modules based on the values of code churn. Table 4. Regression analysis of variance. Source
Sum-of-Squares
Regression
331.879
Residual
179.121
DF
Mean-Square
F-Ratio
P
1
331.879
62.996
0.00
34
10.673
5.268
A Methodology for the Measurement
of Test Effectiveness
231
Table 5. Regression model. Effect
Coefficient
Std Error
t
F(2-Tail)
Churn
0.576
0.073
7.937
0.000
Table 6. Regression statistics. N
Multiple R
Squared Multiple R
Standard Error of Estimate
35
0.806
0.649
2.296
Of course, it may be the case that both the amount of change and the direction in which the change occurred affect the number of faults inserted into the system. The linear regression through the origin shown in Tables 7-9 illustrates this particular regression model. Tables 5 and 8 contain our estimates for the constant k relating the rate of fault insertion to the measured structural change, measured by code churn and code delta. We see that the model incorporating code delta, as well as code churn, performs significantly better than the model incorporating code churn alone, as measured by Squared Multiple R and Mean Sum of Squares. Table 7. Regression analysis of variance. Source
Sum-of-Squares
DF
Mean-Square
F-Ratio
P
Regression Residual
367.247
2 33
183.623
42.153
0.00
143.753
4.356
Table 8. Regression model. Effect
Coefficient
Std Error
t
P(2-Tail)
Churn Delta
0.647 0.201
0.071 0.071
9.172 2.849
0.00 0.00
Table 9. Regression statistics. N
Multiple R
Squared Multiple R
Standard Error of Estimate
35
0.848
0.719
2.08
We evaluated the predictive ability of the regression models by performing a crossvalidation. We performed a specific type of crossvalidation, excluding one observation at a time and examining the prediction made with the remaining observations. For our set of 35 observations, 35 different predictions were made for each regression model. Tables 10 and 11 summarize the crossvalidation results for the
232
J. C. Munson and A. P. Nikora Table 10. Predicted squared residuals for linear regressions. Model (P'.j+i +1
d>'i
=blyj,j+i
=b1Vj'j 4- 62 A->'-»
Mean
Variance
Minimum
Maximum
25th %tile
50th %tile
75th %tile
5.43 4.68
124.91
0.003 0.03
51.02 36.27
0.752 0.799
1.000 1.479
3.840 3.876
69.66
Table 11. Ratio of predicted faults to observed faults for linear regressions. Model
dJ'.J+i =bi~Vi _ ftjVJ.j dj,j+i
Mean
Variance
Minimum
Maximum
25th %tile
50th %tile
75th %tile
0.899 0.911
1.334 1.164
0.00 0.00
5.03 4.03
8.64E - 2 6.69E - 2
0.508 0.463
1.136 1.447
+ 62 A "
two linear regression models through the origin, which are specified in Tables 4-9. For each of these models, Tables 10 and 11 show statistics for: • Predicted squared residuals. For each observation, a regression model is formed that excludes that observation. The resulting model then uses the value of the excluded observation to predict the number of faults inserted. This prediction is then subtracted from the number of faults actually observed for the excluded observation. This residual is then squared, thereby forming the predicted squared residual. • Ratio of predicted number of faults to observed number of faults, where predictions are made for excluded observations. For each excluded observation, a prediction is made as described above. The ratio of the prediction made using each excluded observation to the actual number of faults is then formed. Figures 1 and 2 are histograms that present additional information to that given in Tables 10 and 11. Looking at Table 10, we see that the regression model that includes both code delta and code churn has the lowest values for mean predicted squared residual and variance of the predicted squared residual. This is also shown in Fig. 1. In addition, Table 10 shows that the two parameter model that includes both code churn and code delta has a slightly smaller difference between the points at the 25th and 75th percentiles. Table 11 shows that the mean value of the predictions made by the model which includes both code delta and code churn comes closer to predicting the number of faults observed. Table 11 also shows that the model which includes only code churn and code delta has the lowest variance for this ratio of predicted to actual values. This can be seen in Fig. 2, which shows that the regressions depending only code churn has a higher variability for this ratio than the regression which includes both
A Methodology for the Measurement
of Test Effectiveness
233
20 18 16 14
I Code Churn
12 10
H Code Delta and Code Churn
8 6 4 2 0
ItUti 1 0.0 0
5.0 0
10. 00
15. 00
20. 00
25. 00
30. 00
35. 00
40. 00
45. 00
50. 00
Predicted Squared Residuals Fig. 1. Histograms of predicted squared residuals for excluded observations.
I Code Churn
H Code Delta and Code Churn
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
Ratio of Predicted to Observed Number of Faults Fig. 2. Histograms of ratio of predicted to observed number of faults for excluded observations.
code delta and code churn. However, the range between the points at the 25th and 75th percentiles is the highest for the two parameter model. Table 12 shows the results of the Wilcoxon Signed Ranks test, as applied to the predictions for the excluded observations and the number of faults observed for each of the regression models. We see that about 2/3 of the estimates tend to be less than the number of faults observed. We can also plot the predicted residuals against the actual number of observed faults for each of the four linear regression models. These plots are shown in Figs. 3 and 4.
234
J. C. Munson and A. P. Nikora Table 12. Wilcoxon signed ranks test for linear regressions through the origin. Sample Pair
N
Observed Faults; Churn only
Observed Faults; Churn and Delta a. b. c. d.
Observed Observed Observed Based on
Rank
Sum of Ranks
Test Statistic Z
Asymptotic Significance (2-tailed)
-2.015d
0.044
-1.491 d
0.136
Mean
— Ranks
25 a
17.52
438.00
+ Ranks
b
19.20
192.00
16.92 20.36
406.00 224.00
10
0C
Ties Total
35
— Ranks + Ranks Ties Total
24 a llb 0C 35
Faults > Regression model predictions Faults < Regression model predictions Faults = Regression model predictions positive ranks
The results of the Wilcoxon signed ranks tests, as well as Figs. 3 and 4, indicate that the predictive accuracy of the regression models might be improved if syntactic analyzers capable of measuring additional aspects of a software system's structure were available. Finally, we investigated whether the linear regression model which uses code churn alone is an adequate predictor at a particular significance level when compared
Defects = bl*Churn 1
4 i
iI
a T3 0)
1
S
0*
c 9 ,
i 1
0 i T3
!
o
? ,
06
!
t1
6 i
a
1
3
I1 3
a
6 0
2
4
6
8
10
12
Number of observed defects - versions 2.0, 2.1a, and 2.1b Fig. 3. Predicted residuals versus number of observed faults for linear regression with churn.
J4 Methodology for the Measurement of Test Effectiveness 235
Defects = bl*Churn + b2*Delta 6 i
13
A •
3
a
•o
u
B
a
!
3
i
H
'3
C
0 •
2 ,
3
a
-a-e—
p ,
it
<1 . 6 I
0
2
4
6
8
12
10
Number of observed defects - versions 2.0, 2. la, and 2.1b Fig. 4. Predicted residuals versus number of observed faults for linear regression with churn and delta. t o t h e model using b o t h code churn a n d code delta. We used t h e .R 2 -adequate test [MacDonell et al. (1997), Neter et al. (1983)] t o examine t h e linear regression models t h r o u g h t h e origin a n d determine whether t h e model t h a t depends only on code churn is a n adequate predictor. A subset of predictor variables is said t o b e i? 2 -adequate at significance level a if: ^sub > 1 - (1 - J & u X 1 +
R2 0.649 0.719
DF 34 33
1 2
an
d d„,fc f° r i?2-adequate test.
Fk,n-k-i for Significance a
d(n, k)
Threshold for Significance a
4.139 3.295
0.125 0.206
— 0.661
236
J. C. Munson and A. P. Nikora
Table 13 shows that the value of Multiple Squared R for the regression using only code churn is 0.649. The 5% significance threshold for the code churn and code delta model is 0.661. This means that the regression model using only code churn is not R2 adequate when compared to the model using both code churn and code delta. Although the amount of change occurring between subsequent revisions appears to be the primary factor determining the number of faults inserted; the direction of that change also appears to be a significant factor. 8. Testing Objectives Deterministically testing a large software system is virtually impossible. Trivial systems, on the order of 20 or 30 modules, often have far too many possible execution paths for complete deterministic testing. This being the case, we must revisit what we hope to accomplish by testing the system. Is our goal to remove all of the faults within the code? If this is our goal, how do we know when we have them all? What is it worth, in terms of expense, to try to find one more fault? Given unlimited time and resources, identification and removal of all faults might be a noble goal, but real world constraints make this largely unattainable. The problem is that we must provide an adequate level of reliability in light of the fact that we cannot find and remove all of the faults. Through the use of software measurement, we hope to identify which modules contain the most faults and, based on execution profiles of the system, how these potential faults can impact software reliability. The idea is that a fault that never executes, never causes a failure. However, a fault that lies along the path of normal execution will cause frequent failures. The majority of the testing effort should be spent finding those faults that are most likely to cause failure. The first step towards this testing paradigm is the identification of those modules that are likely to contain the most faults. The objectives of the software test process are not clearly specified and sometimes not clearly understood. An implicit objective of a deterministic approach to testing is to design a systematic and deterministic test procedure that will guarantee sufficient test exposure for the random faults distributed throughout a program. By insuring, for example, that all possible paths have been executed, then any potential faults on these paths will have had the opportunity to have been expressed. We must, however, come to accept the fact that some faults will always be present in the code. We will not be able to eliminate them all. The objective of the testing process should be to find those faults that will have the greatest impact on the safety/survivability of the code. Under this view of the software testing process, the act of testing may be thought of as conducting an experiment on the behavior of the code under typical execution conditions. We will determine, a priori, exactly what we wish to learn about the code in the test process and conduct the experiment until this stopping condition has been reached.
A Methodology for the Measurement
of Test Effectiveness
237
To know the loci of probable faults in a complex software system is not a sufficient condition for reliability modeling. A software system may be viewed as a set of program modules that are executing a set of mutually exclusive functions. If the system executes a functionality expressed by a subset of modules that are fault free, it will never fail. If, on the other hand, the system is executing a functionality expressed in a subset of fault laden modules, there is a very high probability that it will fail. Thus, failure probability is dependent upon the input data sets which drive the system into regions of code (i.e., functionalities) of differing complexities (i.e., fault proneness). Each software test suite implements a subset of functionalities. As each test is run to completion it generates a test execution profile which represents the results of the execution of one or more functions. When a program begins the execution of a particular functionality we can describe this beginning as the start of a stochastic process. For the system, S, there is a call tree that shows the transition of program control from one program module to another. This transition can be modeled as a stochastic process, where we define an indexed collection of random variables {Xt}, where the index t runs through a set of non-negative integers, t = 0,1, 2 , . . . representing the epochs of the process. At any particular epoch the software is found to be executing exactly one of its M modules. The fact of the execution occurring in a particular module is a state of the system. For a given software system, it may be found in exactly one of a finite number of mutually exclusive and exhaustive states, 1,2,... M. In this representation of the system, there is a stochastic process {Xt}, where the random variables are observed at epochs t = 0,1, 2 , . . . and where each random variable may take on any one of the M integers, from the state space A= 1,2, . . . M . The probability that a particular module may execute is a conditional probability. Let Y be a random variable defined on the indices of the set of elements of F. Then p\ = Pr[Xx = i\Y = k] where k = 1,2,..., #{F} represents the execution profile for a set of modules expressing function k exclusively. The distribution of the execution profile is multinomial for a software system consisting of more than two modules. In other words, for each functionality, fi, there is an execution profile represented by the probabilities p^ , p^ , p^ , . . . , Pn • 9. Test Efficiency The test process for evolving software systems takes on a different measurement aspect than that of new systems. Existing systems are continually being modified as a normal part of the software maintenance activity. Changes will be introduced into this system based on the need for corrections, adaptations to changing requirements, and enhancements to make the system perform faster/better. The precise effects of changes to software modules in terms of number of latent faults is now reasonably well understood. From a statistical testing perspective, test effort should be focused on those modules that are most likely to contain faults. Each program module that
238
J. C. Munson and A. P. Nikora
has been modified, then, should be tested in proportion to the number of anticipated faults that might have been introduced into it. Each program module is usually closely linked to a specific functionality. That is, as we exercise a particular functionality a distinct execution profile emerges for that functionality. For each functionality, some modules have a high probability of being executed, while others have a low probability. Each test suite will express one or more of these functionalities. These execution profiles that generate from each test may be characterized by the probability distribution P = {pi\l < i < n} for the kth test. In the face of the evolving nature of the software system, the impact of a single test may change from one build to the next. Each program module has a relative complexity value. This relative complexity is a fault surrogate. That is, the larger value of the relative complexity the greater fault potential that a module has. If a given module has a large fault potential, but limited exposure (small profile value) then the functional complexity of that module is also small. Our objective during the test phase is to maximize our exposure to the faults in the system. Another way to say this is that we wish to maximize functional complexity, 4>, given by
3= 1
where p* is the relative complexity of the jth module on the ith system build and (k)
Pj ' is the test profile of the kth test suite. The initial phase of the efficient testing of changed code is to identify the functionalities that will exercise the modules that have changed. Each of these functionalities so designated will have an associated test suite designed to exercise that functionality. With this information it is now possible to describe the efficiency of a test from a mathematical/statistical perspective. A regression test is one specifically tailored to exercise the functionalities that will cause the changed modules to be executed. A regression test will be efficient if it does a good job of exercising changed code. It is worth noting, however, that a regression test that is efficient on one build may be inefficient on a subsequent build. The efficiency of a regression test, then, is given by the following formula. m a=\
where m represents the cardinality of {Mc U Mj} as defined earlier. In this case, r, is simply the expected value for code churn under the profile P^. This concept of test efficiency permits the numerical evaluation of a test on the actual changes that have been made to the software system. It is simply the expected value of the fault exposure from one release to another under a particular test. If the value of r is large for a given test then the test will have exercised
A Methodology for the Measurement
of Test Effectiveness
239
the changed modules. If the set of T'S for a given release is low then it is reasonable to suppose that the changed modules have not been tested in proportion to the number of probable faults that were introduced during the maintenance changes. For practical purposes, we need to know something about the upper bound on test efficiency. That is, if we were to execute the best possible test, what then would be the value of test efficiency. A best regression test is one that will spend the majority of its time in the modules that have changed the most from one build to the next. Let, n a=l
This is the total code churn between the i and j builds. To exercise each module in proportion to the change that has occurred in the module during its current revision, we will compute this proportion as follows: qa = Xa/X This computation will yield a new hypothetical profile called the best profile. That is, it all modules were executed in proportion to the amount of change that they had received we would then theoretically have maximized our exposure to software faults that may have been introduced. Finally, we seek to develop a measure that will relate well to the difference between the actual profile that is generated by a test and the best profile. To this end, consider the following term, \pi — Qi\- This is the absolute value between the best profile and the actual profile. This value has a maximum value of 1 and a minimum of 0. The minimum value will be achieved when the module best and actual coverage are identical. A measure of the total coverage for a set of modules (task or program) is then, Coverage = 10 — 5* \ J \pi — qi\ i
This coverage value has a maximum value of 10 when the best and the actual profiles are identical and 0 when there is a complete mismatch of profiles. 10. Regression Test Results The following discussion documents the results of the execution of 36 instrumented tasks on two sequential builds of a large embedded software system. The perspective of this discussion is strictly from the standpoint of regression testing. That is, certain program modules have changed across the two sequential builds. The degree of this change is measured by code churn. As has been clearly demonstrated on the Cassini spacecraft project, the greater the change in a program module, the greater the likelihood that faults will have been introduced into the code by the change.
*-»
4^
O
O
o
o
o
o
o
o
o
o
o
H
0 O 2 : S ^ ^
H
H
P
t
O
t
O
^
©
I
©
I
©
I
O
I
I
© ©
I
i
B
I
O^ Cn e n
H
i
Ol
B
1
to
B
I
B
1
B
I o
00
1
en
CO
to -J
I—1 h-»
o
4^
o>
eo
00
M
to
B
CO
B
-J 05
00
!
eo
CO
o o o o
(»
o
o o o to
I
to oo
o o o o
1 o en
B
4^
o
00
en
< B» B | | |
o
,_,
eo
en en
to
to
-J (X)
I
I
en
I
I
I
- j CO
o o es o
to
o o o o ro h^
I
eo
o o o o en
4^ to to
o o o o
I
h^
o o o o M
O O h-*
o o o h-»
CJ
eo
o o o
U)
o o
o
O
O
1 o
eo B
00
4X
I*
en to
o o o
en
to i*
:j^
en
o
t» CJ5
o o o 00
en
00
o
4^ 4^ 00
o o o
CO
o o o
-J
o o o o en
i&.
eo
4^ Cn
o o o
o>
o o o o eo 00
oo o>
-J
o o o
h-» en CO en en
o o
O
o o
-J
o o o o to
en en 00
o o
o eo
4^ h-» -4 00
eo
-4 O) OU
eo
o o
eo
cn
o o o o 4*.
eo co to
to 4^
o o
to en to
o o o 4^
-J to -J
o o o> o
I I I II I II I I II II I I o © o o © © o © © © O i ^ O l M C O . f c . O i W a i H C n ^
tO 4 ^ t O CO CO CO O i t-» O i h-> W CO h-» 0 0 ^ O ) S t O M W H S J O j | i ^ i bo In H O i - v l o i H H C J i t O ^ a i ^ O O H S C O C O O S t O t O - J S a i W M t O S f f i
+ + + + + + + l l l l l l l l l l l l l l l l l l l l l o o o o o o o o o o o o o o o o o o o o o o o o o o o o © © © 0 © © © C n e n C n C ^ C n ^ ^ > ^ ^ ^ ^ ^ ^ C O C O C O C O C O C O C O C O
© © o © © © © © ^ o o c n t o © t o < o b O N 3 ^ © t o c n c n b o a i ^ w o o c o
H
ot-->i-'i-'towwcocncnajaicD^
N K x ^ < d H ^ ? 0 < O
O O O O O O O i — ' t O b O O - ^ t - ' t O t J C O C n c n C i t O
O
>> > > > > > > > >
A Methodology for the Measurement
of Test Effectiveness
241
All of the tasks in system were instrumented with our Clic 1.0 tool. This tool would permit us to count the frequency of execution of each module in each of the instrumented tasks and thus obtain the execution profiles for these tasks for each of the tests. The execution profiles show the distribution of activity in each module of the instrumented tasks. For each of the modules, the code churn measure was computed. The code churn values for each modules reflected the degree of change of the modules during the most recent sequence of builds. The cumulative churn values for all tasks are shown in the second column of Table 14. A churn value of zero indicates that the module in question received no changes during the last build sequence. A large churn value (> 30) indicates that the module in question received substantial changes. For the subsequent analysis, two profile values for each test will be compared. The actual profile is the actual execution profile for each test. The best profile is the best hypothetical execution profile given that each module would be tested directly in proportion to its churn value. That is, a module whose churn value was zero would receive little or no activity during the regression test process. From Table 14 we can seen that the A and B tasks have received the greatest change activity. Associated with each task entry in this table is the Best Profile and the Actual Profile for the task across all tests. The last row in the table gives the total values for code churn for all tasks. The last two columns of this table contain the expected value for the code churn of the task under the best profile and also under the actual profile. These columns are labeled Best Coverage and Actual Coverage. The total expected value for code churn under the best profile is 1311. The total expected value for code churn under the actual profile is 89. The tests spent a disproportionate amount of time in modules that had not changed during this build interval. The ratio of Total Actual Coverage to Total Best Coverage will yield a percent coverage index for the task, for the system, or for the test depending on the granularity of the summary. The change coverage index was computed by module for each task and then for the total system. In Fig. 5, these coverage data are presented for the total system and Tasks A, B, D, and E. For this figure, the values have been scaled onto the interval from 0 to 10. Had there been perfect best coverage, the total value would have been 10. The coverage values for the A and B tasks were the best out of all tasks. The E and D tasks, while having relatively high code churn values, did not fair so well. The test coverage of the D task was typical of the total system, shown as the rightmost entry in this figure. We would now like to look within a task to see why the A and B tasks showed better coverage than other tasks. The difference between the best profile and the actual profile is shown in Fig. 6. Here, if the line is negative, this means that the module in question was exercised well out of proportion to the possible faults that it contained. On the other hand if the line is positive, then the module in question was not exercised in proportion to the faults that it might contain. A perfect line on this chart would be perfectly straight at zero on the profile axis.
242
J. C. Munson and A. P. Nikora
A
B
E
D
Total
Fig. 5. Change coverage index.
- ^
||
... i
.1 uMMni
1
. l i t i l l If i
]
Fig. 6. Difference between best and actual profile for Task A.
A similar graph is shown in Fig. 7 for Task B. Here we can see that almost all test activity was on three distinct program modules (the negative values). The code that was changed was not exercised by this test to any large extent. Table 15 summarizes the performance of the best 24 of suite of 115 instrumented tests. Only those tests whose performance index exceeded 10% of a theoretical total are shown here. Again the performance index shown in this figure was computed by forming the ratio of the actual profile to the best profile for that test. It must be remembered that not all tests will exercise all modules. The performance index
A Methodology for the Measurement
•w
of Test Effectiveness
243
,_&_
V
Fig. 7. Difference between best and actual profile for Task B.
Table 15. Individual test summaries. Test #
Percent
Test #
Coverage
28 18 14 12 47 49 169 156 20 39 9 158
20.6 19.0 18.2 16.9 14.8 14.8 14.7 13.2 13.1 12.9 12.2 12.2
Percent Coverage
177 31 3 167 59a 2 159 1 38 180 33 137
11.7 11.6 11.5 11.5 11.4 11.3 11.3 10.9 10.8 10.7 10.6 10.2
is computed only for those modules whose functionality was included in the test. From a regression test perspective, we now know that we have a testing problem. None of these tests do a really good job in executing the code most likely to contain the newly introduced faults. 11. Summary There is a distinct and a strong relationship between software faults and measurable software attributes. This is in itself not a new result or observation. The most interesting result of this current endeavor is that we also found a strong association
244
J. C. Munson and A. P. Nikora
between the fault insertion process over the evolutionary history of a software system and the degree of change that is taking place in each of the program modules. We also found that the direction of the change had an effect on the number of faults inserted. Some changes will have the potential of introducing very few faults while others may have a serious impact on the number of latent faults. Different numbers of faults may be inserted, depending upon whether code is being added to or removed from the system. In order for the measurement process to be meaningful, the fault data must be very carefully collected. In this study, the data were extracted ex post facto as a very labor intensive effort. Since fault data cannot be collected with the same degree of automation as much of the data on software metrics being gathered by development organizations, material changes in the software development and software maintenance processes must be made to capture these fault data. Among other things, a well defined fault standard and fault taxonomy must be developed and maintained as part of the software development process. Further, all designers and coders should be thoroughly trained in its use. A viable standard is one that may be used to classify any fault unambiguously. A viable fault recording process is one in which any one person will classify a fault exactly the same as any other person. Finally, the whole notion of measuring the fault insertion process is its ultimate value as a measure of software process. The software engineering literature is replete with examples of how software process improvement can be achieved through the use of some new software development technique. What is almost absent from the same literature is a controlled study to validate the fact that the new process is meaningful. The techniques developed in this study can be implemented in a development organization to provide a consistent method of measuring fault content and structural evolution across multiple projects over time. The initial estimates of fault insertion rates can serve as a baseline against which future projects can be compared to determine whether progress is being made in reducing the fault insertion rate, and to identify those development techniques that seem to provide the greatest reduction. Software test is not an intuitive process. Different modules are changed between builds. A regression test that was satisfactory for one build might well be totally inadequate on a subsequent build. When a program is subjected to numerous test suites to exercise differing aspects of its functionality, the test risk of a system will vary greatly as a result of the execution of these different test suites. Intuitively — and empirically — a program that spends a high proportion of its time executing a module set of high relative complexity will be more failure prone than one driven to executing program modules with complexity values. Thus, we need to identify the characteristics of test scenarios that cause our criterion measures of x a n d T to be large. The importance of this research is that we can now have a clearer understanding of how to quantify and evaluate the effectiveness of the regression testing process.
A Methodology for the Measurement of Test Effectiveness
245
For this study, we were not able t o perform an analysis of test effectiveness on the same system for which we estimated the rate of fault insertion. We are currently working with NASA and commercial software development efforts t o apply b o t h types of analysis t o t h e same project, with the goal of improving our ability t o estimate the number of faults remaining in the system after t h e completion of a test sequence and allocate t h e m among those portions of t h e system t h a t have changed since t h e last increment. Acknowledgments T h e research described in this paper was carried out at t h e University of Idaho a n d the Jet Propulsion Laboratory, California I n s t i t u t e of Technology. T h e work at the University of Idaho was partially supported by a grant from t h e National Science Foundation. Portions of the work performed at JPL were sponsored by t h e U.S. Air Force Operational Test and Evaluation Center ( A F O T E C ) and t h e National Aeronautics and Space Administration's IV&V Facility. References Chillarege, R., Bhandari, I., Chaar, J., Halliday, M., Moebus, D., Ray, B. and Wong, M.-Y. (1992), "Orthogonal Defect Classification — A Concept for In-Process Measurement" , IEEE Transactions on Software Engineering 943-946. Halstead, M. H. (1977), Elements of Software Science, Elsevier, New York. "IEEE Standard Glossary of Software Engineering Terminology", IEEE Std 729-1983, Institute of Electrical and Electronics Engineers, 1983. "IEEE Standard Dictionary of Measures to Produce Reliable Software", IEEE Std 982.11988, Institute of Electrical and Electronics Engineers, 1989. "IEEE Standard Classification for Software Anomalies", IEEE Std 1044-1993, Institute of Electrical and Electronics Engineers, 1994. Khoshgoftaar, T. M. and Munson, J. C. (1990), "Predicting Software Development Errors Using Complexity Metrics", IEEE Journal on Selected Areas in Communications 8, 253-261. Khoshgoftaar, T. M. and Munson, J. C. (1992), "A Measure of Software System Complexity and Its Relationship to Faults", in Proceedings of the 1992 International Simulation Technology Conference, The Society for Computer Simulation, San Diego, CA, 267-272. MacDonell, S. G., Shepperd, M. J. and Sallis, P. J. (1997), "Metrics for Database Systems: An Empirical Study", in Proceedings of the Fourth International Software Metrics Symposium, Albuquerque, NM, November 5-7, 99-107. Munson, J. C. and Khoshgoftaar, T. M. (1990a), "Regression Modeling of Software Quality: An Empirical Investigation", Journal of Information and Software Technology 32, 105-114. Munson, J. C. and Khoshgoftaar, T. M. (1990b), "The Relative Software Complexity Metric: A Validation Study", in Proceedings of the Software Engineering 1990 Conference, Cambridge University Press, Cambridge, UK, 89-102. Munson, J. C. and Khoshgoftaar, T. M. (1992), "The Detection of Fault-Prone Programs", IEEE Transactions on Software Engineering SE-18(5), 423-433. Munson, J. C. (1995), "Software Measurement: Problems and Practice", Annals of Software Engineering, J. C. Baltzer AG, Amsterdam.
246
J. C. Munson and A. P. Nikora
Munson, J. C. (1996), "Software Faults, Software Failures, and Software Reliability Modeling", Information and Software Technology, December. Munson, J. C. and Werries, D. S. (1996), "Measuring Software Evolution", in Proceedings of the 1996 IEEE International Software Metrics Symposium, IEEE Computer Society Press, 41-51. Munson, J. C. and Hall, G. A. (1997), "Estimating Test Effectiveness with Dynamic Complexity Measurement", Empirical Software Engineering Journal. Feb. Neter, J., Wasserman, W. and Kutner, M. H. (1983), Applied Linear Regression Models, Homewood, Irwin, IL. Nikora, A. P., Schneidewind, N. F. and Munson, J. C. (1997), "IV & V Issues in Achieving High Reliability and Safety in Critical Control System Software", in Proceedings of the International Society of Science and Applied Technology Conference, Anaheim, CA, March 10-12, 25-30. Nikora, A. P. (1998), Software System Defect Content Prediction From Development Process and Product Characteristics, Doctoral Dissertation, Department of Computer Science, University of Southern California. 'User's Guide for UX-Metric 4.0 for Ada," SET Laboratories, Molino, OR, © S E T Laboratories, 1987-1993.
Chapter 15 MODELING SOFTWARE QUALITY W I T H CLASSIFICATION TREES T A G H I M . K H O S H G O F T A A R and E D W A R D B . A L L E N Florida Atlantic
University, Boca Raton, Florida
USA
1. Introduction High software quality is essential for mission-critical systems. However, assuring high quality often entails time-consuming and costly development processes, such as more rigorous design and code reviews, automatic test-case generation, more extensive testing, strategic assignment of key personnel, and reengineering of highrisk portions of a system. One cost-effective strategy is to target enhancement activities to those software modules that are most likely to have problems [Hudepohl et al. (1996a)]. A software quality model based on empirical data can be used to guide enhancement efforts. Following the terminology of the classification tree statistics literature, a typical model consists of an algorithm in which a single response variable is a function of a set of predictor variables. Predictor variables can be measured earlier in the development life cycle than the response variable, whose value can be predicted. A software fault is a defect in an executable product that causes a software failure [Lyu (1996)]. For example, whether a moduie has excessive faults detected during operations could be the response variable. Given the predictors' values prior to release, the class of each module can be predicted. Enhancement efforts can then be targeted to those modules that are most at risk. Measures of product attributes are called software product metrics, and measures of attributes of software development processes are called software process metrics [Fenton and Pfleeger (1997)]. Prior research results by Khoshgoftaar et al. [Khoshgoftaar and Allen (1995b), Khoshgoftaar and Allen (1998), Khoshgoftaar et al. (1996c), Lanning and Khoshgoftaar (1995), Munson and Khoshgoftaar (1996), Szabo and Khoshgoftaar (1995)], and others [Arthur and Henry (1995), Fenton and Pfleeger (1997), Oman and Pfleeger (1997)], have shown that software product and process metrics can be the basis for quality predictions.
247
248
T. M. Khoshgoftaar
and E. B. Allen
A project's developmental history can be captured by information systems. Many software development organizations have very large data bases for configuration management and for problem reporting which capture data on events during development. Such data bases are potential sources of new information relating software quality factors to the attributes of software products and the attributes of their development processes. For large legacy systems or product lines, the amount of available data can be overwhelming. The combination of numerous attributes of software products and processes, very large data bases designed for other purposes, and weak theoretical support [Kitchenham and Pfleeger (1996)] mandates an empirical approach to software quality prediction, rather than a strictly deductive approach [Khoshgoftaar et al. (2000)]. Fayyad (1996) defines knowledge discovery in data bases as "the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data". Given a set of large data bases or a data warehouse, major steps of the knowledge discovery process are [Fayyad et al. (1996)]: (1) selection and sampling of data; (2) preprocessing and cleaning of data; (3) data reduction and transformation; (4) data mining; and (5) evaluation of knowledge. Fayyad restricts the term data mining to denote the step of extracting patterns or models from clean, transformed data, for example, fitting a model or finding a pattern. Classificationtree modeling is an acknowledged tool for data mining [Glymour et al. (1996), Hand (1998)]. Knowledge discovery in general, and the data mining step in particular, is focused on finding patterns and models that can be interpreted as useful knowledge [Fayyad et al. (1996)]. Industrial software systems often have thousands of modules, and a large number of variables can be extracted from source code measurements, configuration management data, and problem reporting data. The result is a large amount of multidimensional data to be analyzed by the data mining step. Classification trees can be used as a data mining technique to identify significant and important relationships between faults and software product and process attributes [Khoshgoftaar et al. (1996a), Porter and Selby (1990), Troster and Tian (1995)]. This paper introduces the Classification And Regression Trees (CART) algorithm [Breiman et al. (1984)] to software engineering practitioners. A "classification tree" is an algorithm, depicted as a tree graph, that classifies an input object. Alternative classification techniques used in software quality modeling include discriminant analysis [Khoshgoftaar et al. (1996b)], the discriminative power technique [Schneidewind (1995)], logistic regression [Basili et al. (1996)], pattern recognition [Briand et al. (1992)], artificial neural networks [Khoshgoftaar and Lanning (1995)], and fuzzy classification [Ebert (1996)]. A classification tree differs from these in the way it models complex relationships between class membership and combinations of variables. CART automatically builds a parsimonious tree by first building a maximal tree and then pruning it to an appropriate level of detail. CART is attractive because it emphasizes pruning to achieve robust models. Although Kitchenham briefly
Modeling Software Quality with Classification
Trees
249
reports using CART to model software project productivity [Kitchenham (1998)] to our knowledge, CART has seldom been used to model software quality. This paper presents practical lessons learned on building classification trees for software quality modeling. Preliminary results indicated that CART can be useful for software quality modeling [Khoshgoftaar et al. (1998c), Khoshgoftaar et al. (1998d)]. A case study of a very large telecommunications system used CART to build software quality models [Naik (1998)], focusing on problems discovered in the field by customers. The models predicted whether or not modules were fault-prone, based on various sets of software product and process metrics as predictor variables. The remainder of this paper presents background on software quality modeling, background on classification trees, including a summary of the CART algorithm, our modeling methodology, details on our case study, and conclusions. 2. Software Quality Modeling Due to the high cost of correcting problems discovered by customers, the goal of our modeling is identification of fault-prone modules early in development. Software quality models are tools for focusing efforts to find faults. Such models yield timely predictions on a module-by-module basis, enabling one to target high-risk modules. The field of software metrics assumes that characteristics of software products and development processes strongly influence the quality of the released product, and its residual faults, in particular. The more complex the product is, the more likely it is that developers will make mistakes that cause faults and consequent failures. Since product characteristics can be measured earlier than quality, software metrics can guide improvements to quality before the product is released. Software metrics research has emphasized models based on software product metrics. Commonly measured software abstractions include call graphs, control flow graphs, and statements. For example, fan-in and fan-out [Myers (1978)] are attributes of a node in a call graph, where each node is an abstraction of a module and each edge represents a call from one to another. Many software product metrics are attributes of a control flow graph in which the nodes represent decision statements or branch destinations and the edges represent potential flow of control. McCabe's cyclomatic complexity is one of the best known in this category [McCabe (1976)]. Lines of code is the best known statement metric. Other examples are Halstead's counts of operators and operands [Halstead (1977)]. Commercially available "code analyzers" measure more than fifty static software product metrics at a time. Software process metrics can be derived from project management, problem reporting, and configuration management systems [Henry et al. (1994)]. Process metrics are especially important for legacy systems and developments with significant reuse [Evanco and Agresti (1994)]. For example, preliminary empirical research on a military system found that software quality models based on process metrics alone had similar accuracy to models based on the combination of product metrics and process metrics [Khoshgoftaar et al. (1998a)]. Case studies by Khoshgoftaar
250
T. M. Khoshgoftaar
and E. B. Allen
et al. found that reuse [Khoshgoftaar et al. (1996b), Khoshgoftaar et al. (1996c)], the history of corrected faults [Khoshgoftaar et al. (1998a)], and the experience of programmers [Khoshgoftaar et al. (1999)] were significantly associated with faults. A recent special issue of IEEE Software on software measurement provides a snapshot of the state of the art in the measurement and software quality modeling fields [Pfleeger (1997)]. The corpus of software-measurement and software quality-related literature indicates that software metrics and related models are becoming an integral part of recommended development and maintenance practices. Measurements and models are the means for understanding, controlling, and improving development processes [Pfleeger (1997)}. Successful measurement programs in development organizations are inextricably linked to empirical models that are clearly related to business goals [Offen and Jeffery (1997)]. Software developers are gradually embracing software quality modeling into mainstream practice. The multitude of software product and process metrics can be overwhelming. Some try to identify one metric that is the key to reliability. However, due to the complexity of software products and processes, no single metric can consistently give accurate predictions. Predicting software reliability is a multivariate problem [Khoshgoftaar and Allen (1995a)]. Instead of relying on programmer intuition, software quality models use computational intelligence and statistical modeling techniques to predict a reliability indicator that developers and managers are familiar with. A software quality model is developed using measurements and fault data from a past release. The calibrated model is then applied to modules currently under development. Enhanced Measurement for Early Risk Assessment of Latent Defects (EMERALD) is an example of a system that includes software quality models [Hudepohl et al. (1996b)]. It is a sophisticated system of decision support tools used by software designers and managers to assess risk and improve software quality [Hudepohl et al. (1996a)]. It was developed by Nortel (Northern Telecom) in partnership with Bell Canada and others. EMERALD provides access to software product metrics, deployment usage, fault histories, other software process metrics, and software quality models based on those metrics. At various points in the development process, EMERALD'S software quality models predict which modules are likely to be fault-prone, based on available measurements. Lyu et al. (1995) report on a prototype with similar objectives. Systems such as EMERALD are the key to improved software quality. For example, inspections are a standard part of a development process [Russell (1991)]. Rather than inspecting all modules on an equal basis, focusing on high-risk modules can improve the efficiency of inspection efforts. Suppose EMERALD'S models indicate that certain modules are at risk when changed. After a review, the design team may choose to reengineer these modules first before continuing their implementation, so that their risk of faults is reduced. By integrating risk predictions with review processes, designers gain immediate feedback on the quality of their products. This reduces the cost of quality problems by early detection.
Modeling Software Quality with Classification
Trees
251
Moreover, during maintenance, any changes proposed for high-risk modules may require more stringent justification and inspection. Risk assessments can also be used for assigning technical staff, prioritizing test automation, and matching skill level with complexity during maintenance. The prerequisite infrastructure for a system such as EMERALD includes a strong configuration management system, a flexible source code library system, a thorough problem reporting system, and widespread networking of the desktop systems of developers. Metrics and models must be delivered to the developers in an attractive manner so they will actually use them in their day-to-day work. EMERALD offers periodic measurement that is highly automated and does not interfere with development activities. An extensive history data base is accessible through a graphical user interface or a company-private World Wide Web page, and a client-server architecture. Interfaces to the support systems make the process of collecting data unintrusive. An inviting user interface facilitates acceptance of the system by the developers, so that software metrics and models are integrated into day-to-day development activities. 3. Classification Trees In our application, an object is a software module, predictors are software metrics, and the response variable is class membership. We model each object with a set of ordinal predictors and a nominal response variable with two categories: not faultprone or fault-prone, abbreviated by nfp and fp, respectively. 3.1. Using a Classification
Tree: An
Example
A classification tree for software quality modeling represents an algorithm as an abstract tree of decision rules to classify a module as a member of ths not faultprone group or the fault-prone group. Figure 1 depicts a classification tree created in our case study, described below. Each diamond node represents a decision, and each edge represents a possible result of that decision. Each circular node is a leaf that classifies a module into the group noted at the bottom. The root of the tree is the node at the top. Suppose we have measurements on one software module. For example, one metric could be the number of unique file-include directives in the code, FILINCUQ. Because include files ("header files") are often used to implement interfaces and data abstractions, this is an attribute of the high-level design. Another metric could be the total span of variables, VARSPNSM, where the span of one variable is the number of lines of code between the first and last use of the variable in a procedure. This is a measure of locality of reference. It is an attribute of the coding. Given measurement on a module and beginning at the root node, the algorithm traverses a downward path in the tree, one node after another, until it reaches a leaf node. The current decision node is applied to one measurement. For example, in Fig. 1, Node 1 examines the module's measurement of FILINCUQ.
252
T. M. Khoshgoftaar •• L e g e r
NFP FP
o
o
and E. B.
Allen
d Not Fault-Prone Fault-Prone Decision Node Leaf Node
^FILINCUQ
FILINCUQ
<27
NFP
NFP
FP
Fig. 1. Example tree based on product metrics.
If FILINCUQ < 27 then the algorithm chooses the left edge to a leaf where the module is classified as not fault-prone. Few interfaces implies mistakes discovered by customers are not likely. Otherwise the algorithm proceeds to Node 2 which is another decision. At Node 2, if FILINCUQ < 49 then the algorithm proceeds to Node 3. Otherwise, the algorithm proceeds to the right. Many interfaces implies customers are more likely to discover mistakes. Node 3 examines VARSPNSM. If VARSPNSM < 15657, then the module is classified as not fault-prone, and otherwise, fault-prone. When there is a medium number of interfaces, locality of reference (VARSPNSM) distinguishes a module's class. The process is repeated for each node along the path. When a decision node is reached, it is applied to the module. When a leaf node is reached, the module is classified as not fault-prone or fault-prone, and the path is complete. Each module in a data set can be classified using such an algorithm. A Type I misclassification is when the model identifies a module as fault-prone which is actually not fault-prone. A Type II misclassification is when the model identifies a module as not fault-prone which is actually fault-prone. 3.2. Building
a Classification
Tree Using
CART
The Classification And Regression Trees (CART) algorithm [Breiman et al. (1984)] builds a classification tree. It is implemented as a supplementary module for the SYSTAT package [Steinberg and Colla (1995)].
Modeling Software Quality with Classification
Trees
253
Beginning with all modules in the root node, the algorithm recursively partitions ("splits") the set into two leaves until a stopping criterion applies to every leaf node. A goodness-of-split criterion is used to minimize the heterogeneity of each leaf at each stage of the algorithm. CART'S default goodness-of-split criterion is the "Gini index of diversity", also called impurity, which is based on probabilities of class membership, p(.) [Breiman et al. (1984)]. The impurity of a node t for the two-class problem is i(t) = l-(p2(nfp\t)+p2(fp\t)) (1) The best split minimizes the impurity. Further splitting is impossible if only one module is in a leaf or if all modules have exactly the same measurements. CART also stops splitting if a node has too few modules (e.g., less than 10 modules). The result of this process is typically a large tree. Usually, such a maximal tree overfits the data set and, consequently, is not robust. CART then generates a series of trees by progressively pruning branches from the maximal tree. The accuracy of each size of tree in the series is estimated and the most accurate tree in the series is selected as the final classification tree. We use the following two methods for estimating the accuracy of a tree. Resubstitution is a method for estimating model accuracy by using the model to classify the same modules that were the basis for building the model, and then calculating misclassification rates. The estimated accuracy can be overly optimistic. u-fold cross-validation is an alternative method which also uses the same modules as were the basis for building the model, but the estimated accuracy is not biased [Efron (1983), Gokhale and Lyu (1997), Lachenbruch and Mickey (1968)]. The algorithm has these steps: Randomly divide the sample into v approximately equal subsets (e.g. v = 10). Set aside one subset as a test sample, and build a tree with the remaining modules. Classify the modules in the test subset and note the accuracy of each prediction. Repeat this process, setting aside each subset in turn. Calculate the overall accuracy. This is an estimate of the accuracy of the tree built using all the modules. The number of subsets, v, should not be small; Breiman et al. (1984) found that ten or more worked well. CART allows one to specify prior probabilities and costs of misclassifications. These parameters are used to evaluate goodness-of-split of a node as a tree is recursively generated [Breiman et al. (1984)]. Let TTJP and 7r n / p be prior probabilities of membership in the fault-prone and not fault-prone classes, respectively, and let Cj and C[[ be the costs of Type I and Type II misclassifications, respectively. Due to different costs associated with each type of misclassification, we need a way to provide appropriate emphasis on Type I and Type II misclassification rates according to the needs of the project. a We experimentally choose a constant c, which can be interpreted as a priors ratio times a cost ratio. a I n [Khoshgoftaar and Allen (1997)], we propose a generalized classification rule applied to discriminant analysis, which has a similar constant c.
254
T. M. Khoshgoftaar
and E. B. Allen
X^nfp) \C\ J We have observed a tradeoff between the Type I and the Type II misclassification rates as functions of c. Generally, as one goes down, the other goes up. We estimate these functions of c by repeated calculations with the fit data set. Given a candidate value of c, we build a tree and estimate the Type I and Type II rates using resubstitution and ^-fold cross-validation. We repeat for various values of c, until we arrive at the preferred c for the project. 3.3. Other Classification
Tree
Algorithms
Several other algorithms for building classification trees have appeared in the software engineering literature. 3.3.1. Information Theory-Based Algorithm Selby and Porter (1988) first preprocessed measurement by merging metric values into categories representing quartiles or octiles of each metric's distribution [Porter and Selby (1990)]. Thus, each predictor had four or eight possible values. Their algorithm recursively partitions the set of modules using an information theorybased criterion. Each decision node consists of a selected predictor and each child node represents a possible value (category) of that predictor. The algorithm selects the predictor that minimizes the weighted average of the entropy of each possible child subtree. The entropy calculation is based on the probability of an object being fault-prone or not. The weighting factor is the fraction of objects in each child subtree. 3.3.2. Akaike Information Criterion Algorithm Takahashi, Muraoka, and Nakamura (1997) extended Selby and Porter's algorithm by applying Akaike Information Criterion (AIC) procedures [Akaike (1987)] to prune the tree. AIC procedures are based on maximum likelihood estimation and the least number of predictors. The pruning criterion minimizes AIC. AIC = —2 log (max likelihood) + 2 (number of parameters)
(3)
The pruned classification tree can be more stable than Selby and Porter's algorithm alone, without any significant decrease in the proportion of correct classifications. 3.3.3. S-plus Algorithm Troster and Tian (1995) used regression trees with real response variables and real predictors, rather than categories, and then used the result to classify modules. This algorithm is implemented in the S-Plus® system. The algorithm recursively partitions the set of modules. When creating a decision node, this algorithm makes a binary partition of the modules based
Modeling Software Quality with Classification
Trees
255
on a selected predictor and a cutoff threshold. Let Vi be the response value for module i, and let v(S) be the mean response over the set S of modules. Deviance is defined as D(S) = ] > > - t ; ( S ) ) 2 ies
(4)
The algorithm chooses the predictor-threshold combination that minimizes D(Si) + D(S2) for the partitioned subsets Si and 5*2. After building a tree, Troster and Tian suggest that the analyst perform "intelligent pruning" when a decision node yields little insight, in the view of the analyst. 3.3.4. CHAID Algorithm A preliminary study [Khoshgoftaar et al. (1996a)] used the TREEDISC algorithm [SAS Institute staff (1995)] to build a classification tree. It is a refinement of the CHAID algorithm [Kass (1980)], which automatically determines the most significant split criteria based on chi-squared tests. TREEDISC is implemented as a macro package for the SAS® System. Improvements to the CHAID algorithm include adjusting the chi-squared statistic for better accuracy [Hawkins and Kass (1982)], specifying a method for finding the most significant branching criterion, and avoiding the possibility of infinite loops. The TREEDISC algorithm recursively partitions the set of modules. It finds the most significant partitioning of a predictor's original categories with respect to the dependent variable. Like the information theory-based algorithm, TREEDISC allows multiway splits, but it minimizes the adjusted p-value of chi-squared tests to determine the ranges. 4. Modeling Methodology Given data on a past project, our case study's methodology is summarized by the following steps. (1) Measure static software product metrics from source code for each module and derive software process metrics from configuration management data and problem reporting data. Forecast usage from deployment records and installation plans. (2) Prepare fit and test data sets for the set of modules, using data splitting. In our case study, we impartially divided the available data into approximately equal fit and test data sets. The fit data set was used to build the model, and the test data set was used to evaluate its accuracy. This assured that the test data set's observations were independent of those in the fit data set, facilitating an unbiased estimate of model accuracy. These proportions were chosen to provide adequate statistical sample in each data set. Other proportions might be appropriate in another study, according to the needs of the modeling technique.
256
T. M. Khoshgoftaar
and E. B. Allen
(3) Choose the parameter c to achieve a preferred balance between the Type I and Type II misclassification rates using the fit data set (This involves building a series of trees using the CART algorithm). (4) Build the final classification tree based on the preferred value of c and the fit data set using the CART algorithm. Calculate the final resubstitution and ^-fold cross-validation misclassification rates. In our case study, we chose to use 10-fold cross-validation, which is the CART default. (5) Classify each module in the test data set using the final tree, and calculate misclassification rates. (6) Evaluate the model's accuracy estimated by misclassification rates for resubstitution (based on the fit data set), cross-validation (based on the fit data set), and data splitting (based on the test data set). (7) The model is ready to use on a similar project or subsequent release where measurements are available. 5. Empirical Case Study 5.1. System
Description
We conducted a case study of a very large legacy telecommunications system written in a high level language, using the procedural development paradigm, and maintained by professional programmers in a large organization. The entire system had significantly more than ten million lines of code. This embedded computer application included numerous finite state machines and interfaces to other kinds of equipment. A module consisted of a set of related source code files. This case study focused on faults discovered by customers after release. A module was considered fault-prone if any faults were discovered by customers, and not fault-prone otherwise. Faults discovered in deployed systems are typically extremely expensive because, in addition to down-time due to failures, visits to customer sites are usually required to repair them. Fault data was collected at the module-level by the problem reporting system. Analysis of configuration management data yielded an update indicator, identifying modules that were unchanged from the prior release. Approximately 99% of the unchanged modules had no faults. This case study modeled "updated" modules only, namely, those that were new or had at least one update to source code since the prior release. These modules had several million lines of code in a few thousand modules. The proportion of modules with no faults among the updated modules was TTnfp = 0.926, and the proportion with at least one fault was -KJP = 0.074. Such a small set of modules is often difficult to identify early in development. Tables 1 and 2 list the software metrics used in this study. We do not advocate a particular set of metrics for software quality models to the exclusion of others recommended in the literature. Pragmatic considerations usually determine the set of available metrics.
Modeling Software Quality with Classification
Trees
257
Table 1. Software product metrics. Symbol USAGE
Description Deployment percentage of the module.
Call G r a p h M e t r i c s CALUNQ Number of distinct calls to others. CAL2 Number of second and following calls to others. CAL2 = CAL - CALUNQ where CAL is the total number of calls. Control F l o w G r a p h M e t r i c s CNDNOT Number of arcs that are not conditional arcs. Number of non-loop conditional arcs, i.e., if-then constructs. IFTH LOP Number of loop constructs. Total span of branches of conditional arcs. The unit of measure is arcs. CNDSPNSM CNDSPNMX Maximum span of branches of conditional arcs. CTRNSTSM Total control structure nesting. CTRNSTMX Maximum control structure nesting. KNT Number of knots. A "knot" in a control flow graph is where arcs cross due to a violation of structured programming principles. Number of internal nodes (i.e., not an entry, exit, or pending node). NDSINT NDSENT Number of entry nodes. NDSEXT Number of exit nodes. NDSPND Number of pending nodes, i.e., dead code segments. LGPATH Base 2 logarithm of the number of independent paths. Statement Metrics FILINCUQ Number of distinct include files. LOC Number of lines of code. STMCTL Number of control statements. STMDEC Number of declarative statements. STMEXE Number of executable statements. VARGLBUS Number of global variables used. VARSPNSM Total span of variables. VARSPNMX Maximum span of variables. VARUSDUQ Number of distinct variables used.
USAGE was calculated using data derived from installation records of an earlier release. Deployment plans could also be considered. Considering that this is a legacy system, USA GE was a forecast of deployment usage in the current release. The other software product metrics were measured from source code at the procedure level, and were then aggregated to the module level. Attributes of a call graph, control flow graphs, and source code statements were measured [Mayrand and Coallier (1996)]. Metrics of problems found and fixed were derived from problem reporting system data. Personnel attributes were derived from configuration management data. Metrics related to updates by designers, such as UPD.CAR, capture attributes of
258
T. M. Khoshgoftaar
and E. B. Allen Table 2. Software process metrics.
Symbol
Description
Process Metrics DES- PR Number of problems found by designers. BETA-PR Number of problems found during beta testing. TOT-FIX Total number of problems fixed. DES- FIX Number of problems fixed that were found by designers. BETA-FIX Number of problems fixed that were found by beta testing in the prior release. CUST- FIX Number of problems fixed that were found by customers in the prior release. REQ- UPD Number of changes to the code due to new requirements. TOT- UPD Total number of changes to the code for any reason. REQ Number of distinct requirements that caused changes to the module. SRCSRCUNQVLO-
GRO MOD DES UPD
Net increase in lines of code. Net new and changed lines of code. Number of different designers making changes. Number of updates to this module by designers who had 10 or less
LO- UPD
total updates in entire company career. Number of updates to this module by designers who had between 11
UPD— CAR
and 20 total updates in entire company career. Number of updates that designers had in their company careers.
the personnel that maintained a module, without requiring access to a personnel data base. 5.2. Model Based on Product
and Process
Metrics
We built and evaluated a model based on product and process metrics. Candidate predictors were all forty product and process metrics in Tables 1 and 2. We generated a series of trees using various values of c. The value c = 5.0 was too extreme for CART to generate a useful model. The CART parameters for prior probabilities were fixed at artificial values, 7r^/p = 0.8, and -ir'fp — 0.2. CART'S cost of a Type I error, C{, was set to one, and CART'S cost of a Type II error, C'n, was varied to achieve the desired values of c, according to Eq. (2). We built a series of models using CART, each with its own value of c. Table 3 and Fig. 2 show the cross-validation accuracy of this series of trees as a function of c. We preferred c = 0.95, which is bold in Table 3. In this case study, because the proportion of fault-prone modules was very small, we preferred the tree with approximately equal Type I and Type II misclassification rates using cross-validation. Other criteria might be appropriate for other situations. Figure 3 also shows the proportion of modules predicted by the model to be fault-prone, as a function of c, based on the cross-validation accuracy. Some might
Modeling Software Quality with Classification
Trees
Table 3. Results for model based on product and process metrics. Cross-Validation Misclassifications c
Type I
Type II
Pred f-p
0.50 0.60 0.70 0.80 0.90 0.95 1.00 1.10 1.20 1.30 1.40 1.50 2.00 3.00 4.00 5.00
10.8% 15.0% 18.3% 21.2% 25.7% 26.4% 28.2% 30.4%
59.3% 51.1% 44.4%
13.0% 17.5% 21.1% 24.1% 29.1% 29.9% 31.7%
39.3% 28.9% 25.9% 24.1% 21.5%
30.9% 30.9% 42.5% 37.4% 48.3% 52.4% 54.6% No tree generated
34.0% 34.3% 34.3% 45.7% 40.8% 51.1% 55.0% 57.2%
23.7% 23.7% 14.8% 17.0% 13.3% 11.9% 10.4%
Candidate predictors: 40 product and process metrics.
100%
80% -
60%
40% 20% -
0.00
1.00
2.00
3.00
Fig. 2. 10-fold cross-validation accuracy.
4.00
5.00
259
260
T. M. Khoshgoftaar
and E. B. Allen
5" 80%J2 O
•g "•5 £
60%
J
40% -
20% / 0% -
^ t
0.00
1.00
^ i
l
2.00
l
3.00
4.00
5.00
Fig. 3. Proportions predicted to be fault-prone.
'JUNCUQ Legend NFP FP
Not Fault-Prone Fault-Prone
oo
Decision Node Leaf Node
Fig. 4. Tree based on product and process metrics.
choose a preferred c by limiting the number of modules to be given enhancement treatment. For example, c = 0.95 would limit the number predicted to be faultprone to 29.9% of the modules. Figure 4 depicts the preferred classification tree model built using c = 0.95. Note that only six metrics out of forty were used in the final tree: the number of distinct include files {FILINCUQ), the net number of new and changed lines of code (SRC-MOD), the maximum nesting level (CTRNSTMX), the number of updates in designers' company careers (UPD.CAR), the number of control
Modeling Software Quality with Classification
Trees
261
statements (STMCTL), and deployment usage (USAGE). CART determined that other variables would not significantly improve the model. We applied the model shown in Fig. 4 to the test data set, and achieved the results shown in Table 4. 5.3. Model Based on Product
Metrics
To evaluate the value of the process metrics in the context of software quality modeling, we similarly applied the CART algorithm to the fit data set to build a model based on only the twenty-five product metrics listed in Table 1 as candidate predictors. Using the same criteria as above, we preferred c = 0.90. Figure 5 depicts Table 4. Accuracy of model based on product and process metrics. Model Class
not f-p
f-P
Total
f-P
73.8% 28.9%
26.2% 71.1%
100.0% 100.0%
Total
70.5%
29.6%
100.0%
Actual not f-p
Overall misclassification rate: 26.4% test data set Preferred c = 0.95
•• Legend NFP FP
o
o
-
^FILINCUQ
Not Fault-Prone Fault-Prone Decision Node Leaf Node
<27
NFP Fig. 5. Tree based on product metrics.
FILINCUQ
262
T. M. Khoshgoftaar
and E. B. Allen
the resulting classification tree. Note that only two metrics out of twenty-five were used in the final tree: the number of distinct include files (FILINCUQ) and the total span of variables (VARSPNSM). We applied the model in Fig. 5 to the test data set, and achieved the results shown in Table 5. Table 5. Accuracy of model based on product metrics. Model Class
not f-p
f-p
Total
f-p
73.0% 27.4%
27.0% 72.6%
100.0% 100.0%
Total
69.6%
30.4%
100.0%
Actual not f-p
Overall misclassification rate: 27.0% test data set Preferred c = 0.90
5.4. Model Based
on Best Single
Metric
The number of distinct include files {FILINCUQ) was the most significant individual variable. It was the root decision node in all the tree models we generated. We also generated a classification tree where FILINCUQ was the only candidate predictor. Using similar criteria as the other models, we preferred c = 0.90. Figure 6 depicts the preferred model. This tree's structure reveals a nonmonotonic relationship between FILINCUQ and membership in the fault-prone class. When applied to the test data set, the model in Fig. 6 had a Type I misclassification rate of 27.3% and a Type II rate of 39.2%, as shown in Table 6. This model's accuracy was somewhat worse than the other two models. This indicates Table 6. Accuracy of model based on single metric. Model not f-p
f-P
Total
f-P
72.7% 39.2%
27.3% 60.8%
100.0% 100.0%
Total
70.2%
29.8%
100.0%
Class Actual not f-p
Overall misclassification rate: 28.2% test data set Preferred c = 0.90
Modeling Software Quality with Classification
Trees
263
J1L1NCUQ
FILINCUQ
< 27
Legend NFP FP
o
o
Not Fault-Prone Fault-Prone Decision Node Leaf Node
NFP
NFP
FP
Fig. 6. Tree based on single variable
NFP (FILINCUQ).
the value of the other product metrics in the context of these software quality models. However, even if it were comparable, we do not recommend single variable models because they are often not robust. 5.5. Discussion
of Empirical
Results
As stated above, we applied each preferred model to the fit and test data sets to predict the class membership of each module. Table 7 compares the resulting estimated accuracies. The table indicates the data set on which the evaluation is based. We prefer the data-splitting estimate because it simulates practical use of the model. The data splitting results in Table 7 indicated the level of accuracy to expect on a similar project or subsequent release. The tree models had sufficient accuracy to be useful to a software development project for targeting enhancement techniques. As shown in Table 7, the model based on only two significant product metrics (out of twenty-five candidate metrics) had similar data-splitting accuracy as the model based on six significant product and process metrics (out of forty candidate metrics). This was probably due to correlation among the variables. The smaller model was easier to interpret and data collection was streamlined. This illustrates that experimentation with the list of candidate metrics is advisable to achieve the preferred balance between parsimony (few variables) and accuracy. Let us consider a cost-benefit analysis of the model based on product and process metrics (Fig. 4) and results using the test data set which are bold in Table 7.
264
T. M. Khoshgoftaar
and E. B. Allen Table 7. Accuracy of models. Misclassifications
Model c Product & Process Metrics Resubstitution (fit) Cross-validation (fit) Data splitting (test)
0.95
Product Metrics Resubstitution (fit) Cross-validation (fit) Data splitting (test)
0.90
FILINCUQ Resubstitution Cross-validation Data splitting
0.90
Type I
Type II
25.8% 26.4% 26.2%
15.6% 25.9% 28.9%
26.7% 34.6% 27.0%
28.1% 25.2% 27.4%
27.5% 34.6% 27.3%
29.6% 25.2% 39.2%
If used to guide enhancements, the modules recommended for extra treatment would consist of those that are actually not fault-prone but are misclassified (Type I), plus those correctly classified as fault-prone. The model based on product and process metrics would recommend that about 29.5% = (0.262) (0.926) + (1 -0.289) (0.074) of the modules be given special treatment to discover faults early. Extra efforts would be wasted on the not fault-prone modules that were mistakenly recommended. However, effective efforts would discover 71.1% = 1 - 0.289 of the fault-prone modules early, so that faults could be corrected before release. If we selected the same number of modules at random for extra efforts as recommended by the model, then only 29.5% of the fault-prone modules would be selected for enhancement. Using the model to select modules for enhancement would have a substantial benefit [Khoshgoftaar et al. (1998b)], because avoiding customer-discovered faults (71.1% vs. 29.5% of fault-prone modules) is extremely valuable. Let us consider the cost implications of misclassifications for c = 0.95 as an example. Suppose we have 1,000 updated modules from a similar project or subsequent release to analyze with the model. We expect 74 to be actually faultprone, because the prior probability was 7T/P = 0.074. The model identifies 295 = (0.262)(1000 - 74) + (1 - 0.289)(74) modules as fault-prone. When we apply an enhancement technique, such as additional reviews or testing, to the 295 modules, we discover that 53 = (1 - 0.289)(74) are actually fault-prone and the remaining 242 = 295 — 53 are not. Over the operational life of the software, customers find faults in 21 = 74 - 53 additional modules which the model thought were not fault-prone. The operational consequences of the faults in these 21 modules are the "costs" of the Type II misclassifications. The cost of a Type I misclassification is the time
Modeling Software Quality with Classification
Trees
265
and expense wasted trying to enhance a module that is already not fault-prone. Let us consider the cost of a Type I misclassification to be a "cost unit", and let us suppose that there is a benefit of 100 units for identifying a fault-prone module at the beginning of testing compared to waiting until a fault is discovered in that module by a customer. Even though a cost ratio of 100:1 is only an illustration here, we believe that costs associated with customer-discovered faults for telecommunications systems are so high that even more than 100:1 is plausible. Each project must determine its own value of the cost ratio. We define the Cost of using a model as the extra effort entailed beyond the budgeted costs of normal development. Our focus is on using a model, and thus, we assume the costs of building a model, and data collection are already budgeted; we do not include them in the cost-benefit analysis. An expanded definition of Cost is a topic for future research. We define Benefit as the benefit due to using the model. From these quantities we can calculate Profit and return on investment, ROI. Profit = Benefit - Cost _
T
R01
(5)
Profit
= c^T
(6
>
The cost of enhancements when acting on this model's predictions is 295 units. The benefit of using the quality model is 5,300 = 53 x 100 units for identifying some fault-prone modules correctly. Figure 7 shows profit as a function of c for several values of Cu/Ci-h For our example, a hypothetical cost ratio of C\\/C\ = 100, profit is maximized when all modules are enhanced, i.e., when c is so large that the Type II misclassification rate is zero. In other words, the value of discovering faults early, which would otherwise be discovered by customers, is so high that it pays to enhance as many modules as possible. However, it is usually not practical to enhance all modules. In this example, we supposed that 29.5% of the modules is the practical limit, and thus, the profit of using the model is 5,005 = 5, 300 — 295 units. Figure 8 shows return on investment (ROI) as a function of c for several values of Cn/Ci. For small values of c, the Type II misclassification rate is so high that very few modules were classified as fault-prone (see Fig. 2), and therefore ROI is also high. However, a model based on a low value of c finds so few fault-prone modules that it is not very useful to the project. At a value of c = 0.95, the return on investment is ROI = 17.0 = 5,005/295. Most investors would consider this an excellent return on investment. For comparison, suppose we randomly select 295 modules for reliability enhancement. We expect 22 = (0.074) (295) modules to be actually fault-prone. Direct cost is 295 units. The benefit is 2, 200 = 22 x 100 units. The profit is 1,905 = 2, 200-295 units. The return on investment is ROI = 6.5 = 1,905/295. Thus, using the model almost triples the ROI of randomly selecting the same number of modules. The figure uses a logarithmic scale for profit.
266
T. M. Khoshgoftaar
and E. B. Allen
10,000
1,000
I
100 -Cost Ratio=100 -Cost Ratio=20 -Cost Ratio=10
10
0.00
1.00
2.00
3.00
4.00
5.00
c Fig. 7. Profit.
25 -Cost Ratio=100 20
-Cost Ratio=20 -CostRatio=10
15
g 10 -
*****-
0.00
1.00
2.00
3.00
4.00
5.00
c Fig. 8. Return on investment.
This example illustrates the computation of ROI. It is easily recomputed, if one were to choose a different value of c, and thereby different Type I and Type II misclassification rates, and/or if one had a different cost ratio. 6. Conclusions High software quality is required of mission-critical systems. We focus on software quality models that make timely predictions of software quality on a module-bymodule basis. Our objective is to target enhancement techniques early in development to those modules that will benefit the most.
Modeling Software Quality with Classification Trees 267 This paper introduces t h e Classification And Regression Trees, CART, t o practitioners in software engineering. We draw the following practical lessons learned on building classification trees for software quality modeling: • Extreme prior probabilities can prevent CART from building a tree. • A project should control the balance between misclassification rates. In this paper, we implemented this control by choosing a constant c, defined as the ratio of prior probabilities times t h e ratio of misclassification costs. These parameters control the goodness-of-split criterion when CART generates a tree. • Minimize t h e number of independent variables without significantly sacrificing accuracy. However, we do not advocate models with a single independent variable. • Building a tree is vulnerable t o overfitting, a n d consequently, t h e accuracy of resubstitution of the fit d a t a set into t h e model can be misleading. A case study of a very large telecommunications system used CART to build software quality models based on various sets of software product a n d process metrics. T h e models predicted whether or not modules were fault-prone. T h e case study illustrates the way t h a t tree algorithms can build successful parsimonious models. Acknowledgments We t h a n k Wendell D. Jones and t h e EMERALD t e a m at Nortel for collecting t h e d a t a . We t h a n k J o h n P. Hudepohl for his encouragement and support. We t h a n k Archana Naik for performing t h e experiments. This work was supported in part by a grant from Nortel, t h r o u g h the Software Reliability Engineering Department, Research Triangle Park, NC, USA. T h e findings a n d opinions in this paper belong solely t o t h e authors, and are not necessarily those of t h e sponsor. Moreover, our results do not in any way reflect t h e quality of t h e sponsor's software products. References Akaike, H. (1987), "Factor Analysis and AIC", Psychometrika 52(3), 317-332. Arthur, J. D. and Henry, S. M. (eds.) (1995), "Software Process and Product Measurement", Annals of Software Engineering 1, J. C. Baltzer. Basili, V. R., Briand, L. C. and Melo, W. (1996), "A Validation of Object-Oriented Design Metrics as Quality Indicators", IEEE Transactions on Software Engineering 22(10), 751-761. Breiman, L., Friedman, J. H., Olshen, R. A. and Stone, C. J. (1984), Classification and Regression Trees, Chapman & Hall, London. Briand, L. C , Basili, V. R. and Thomas, W. M. (1992), "A Pattern Recognition Approach for Software Engineering Data Analysis", IEEE Transactions on Software Engineering 18(11), 931-942. Ebert, C. (1996), "Classification Techniques for Metric-Based Software Development", Software Quality Journal 5(4), 255-272. Efron, B. (1983), "Estimating the Error Rate of a Prediction Rule: Improvement on CrossValidation", Journal of the American Statistical Association 78(382), 316-331. Evanco, W. M. and Agresti, W. W. (1994), "A Composite Complexity Approach for Software Defect Modeling", Software Quality Journal 3(1), 27-44.
268
T. M. Khoshgoftaar and E. B. Allen
Fayyad, U. M. (1996), "Data Mining and Knowledge Discovery: Making Sense Out of Data", IEEE Expert 11(4), 20-25. Fayyad, U. M., Piatetsky-Shapiro, G. and Smyth, P. (1996), "The KDD Process for Extracting Useful Knowledge from Volumes of Data", Communications of the ACM 39(11), 27-34. Fenton, N. E. and Pfleeger, S. L. (1997), Software Metrics: A Rigorous and Practical Approach, 2nd Edition, PWS Publishing, London. Glymour, C., Madigan, D., Pregibon, D. and Smyth, P. (1996), "Statistical Inference and Data Mining", Communications of the A CM 39(11), 35-41. Gokhale, S. S. and Lyu, M. R. (1997), "Regression Tree Modeling for the Prediction of Software Quality", in Proceedings of the Third ISSAT International Conference on Reliability and Quality in Design, ed. Pham, H., Anaheim, CA, International Society of Science and Applied Technologies, 31-36. Halstead, M. H. (1977), Elements of Software Science, Elsevier, New York. Hand, D. J. (1998), "Data Mining: Statistics and More?", The American Statistician 52(2), 112-118. Hawkins, D. M. and Kass, G. V. (1982), "Automatic Interaction Detection", in Topics in Applied Multivariate Analysis, ed. Hawkins, D. M., Chapter 5, Cambridge University Press, Cambridge, 269-302. Henry, J., Henry, S., Kafura, D. and Matheson, L. (1994), "Improving Software Maintenance at Martin Marietta", IEEE Software 11(4), 67-75. Hudepohl, J. P., Aud, S. J., Khoshgoftaar, T. M., Allen, E. B. and Mayrand, J. (1996a), "EMERALD: Software Metrics and Models on the Desktop", IEEE Software 13(5), 56-60. Hudepohl, J. P., Aud, S. J., Khoshgoftaar, T. M., Allen, E. B. and Mayrand, J. (1996b), "Integrating Metrics and Models for Software Risk Assessment", in Proceedings of the Seventh International Symposium on Software Reliability Engineering, White Plains, NY, IEEE Computer Society, 93-98. Kass, G. V. (1980), "An Exploratory Technique for Investigating Large Quantities of Categorical Data", Applied Statistics 29, 119-127. Khoshgoftaar, T. M. and Allen, E. B. (1995a), "Multivariate Assessment of Complex Software Systems: A Comparative Study", in Proceedings of the First International Conference on Engineering of Complex Computer Systems, Fort Lauderdale, FL, IEEE Computer Society, 389-396. Khoshgoftaar, T. M. and Allen, E. B. (1995b), "Predicting Faults from an Information Theory Based Software Metric", in Proceedings of the Second ISSAT International Conference on Reliability and Quality in Design, ed. Pham, H., Orlando, FL, International Society of Science and Applied Technologies, 210-214. Khoshgoftaar, T. M. and Allen, E. B. (1997), A Practical Classification Rule for Software Quality Models, Technical Report TR-CSE-97-56, Florida Atlantic University, Boca Raton, FL. Khoshgoftaar, T. M. and Allen, E. B. (1998), "An Information Theoretic Approach to Predicting Software Faults", International Journal of Reliability, Quality and Safety Engineering 5. Khoshgoftaar, T. M., Allen, E. B., Bullard, L. A., Halstead, R. and Trio, G. P. (1996a), "A Tree-Based Classification Model for Analysis of a Military Software System", in Proceedings of the IEEE High-Assurance Systems Engineering Workshop, Niagara on the Lake, Ontario, Canada, IEEE Computer Society, 244-251. Khoshgoftaar, T. M., Allen, E. B., Halstead, R., Trio, G. P. and Flass, R. (1998a), "Process Measures for Predicting Software Quality", Computer 31(4), 66-72.
Modeling Software Quality with Classification Trees 269 Khoshgoftaar, T. M., Allen, E. B., Jones, W. D. and Hudepohl, J. P. (1998b), "Return on Investment of Software Quality Models", in Proceedings 1998 IEEE Workshop on Application-Specific Software Engineering and Technology, Richardson, TX, USA, IEEE Computer Society, 145-150. Khoshgoftaar, T. M., Allen, E. B., Jones, W. D. and Hudepohl, J. P. (1999), "Which Software Modules have Faults that will be Discovered by Customers?", Journal of Software Maintenance: Research and Practice 11(1), 1-18. Khoshgoftaar, T. M., Allen, E. B., Jones, W. D. and Hudepohl, J. P. (2000), "Data Mining for Predictors of Software Quality", International Journal of Software Engineering and Knowledge Engineering, forthcoming. Khoshgoftaar, T. M., Allen, E. B., Kalaichelvan, K. S. and Goel, N. (1996b), "Early Quality Prediction: A Case Study in Telecommunications", IEEE Software 13(1), 65-71. Khoshgoftaar, T. M., Allen, E. B., Kalaichelvan, K. S. and Goel, N. (1996c), "The Impact of Software Evolution and Reuse on Software Quality", Empirical Software Engineering: An International Journal 1(1), 31-44. Khoshgoftaar, T. M., Allen, E. B., Naik, A., Jones, W. D. and Hudepohl, J. P. (1998c), "Modeling Software Quality with Classification Trees", in Proceedings of the Fourth ISSAT International Conference on Reliability and Quality in Design, eds. Pham, H. and Lu, M.-W., Seattle, WA, International Society of Science and Applied Technologies, 171-182. Khoshgoftaar, T. M., Allen, E. B., Naik, A., Jones, W. D. and Hudepohl, J. P. (1998d), "Using Classification Trees for Software Quality Models: Lessons Learned", in Proceedings of the Third IEEE International High-Assurance Systems Engineering Symposium, Bethesda, MD, USA, IEEE Computer Society, 82-89. Khoshgoftaar, T. M. and Lanning, D. L. (1995), "A Neural Network Approach for Early Detection of Program Modules having High Risk in the Maintenance Phase", Journal of Systems and Software 29(1), 85-91. Kitchenham, B. A. (1998), "A Procedure for Analyzing Unbalanced Datasets", IEEE Transactions on Software Engineering 24(4), 278-301. Kitchenham, B. A. and Pfleeger, S. L. (1996), "Software Quality: The Elusive Target", IEEE Software 13(1), 12-21. Lachenbruch, P. A. and Mickey, M. R. (1968), "Estimation of Error Rates in Discriminant Analysis", Technometrics 10(1), 1-11. Lanning, D. L. and Khoshgoftaar, T. M. (1995), "The Impact of Software Enhancement on Software Reliability", IEEE Transactions on Reliability 44(4), 677-682. Lyu, M. R. (1996), "Introduction", in Handbook of Software Reliability Engineering, ed. Lyu, M. R., Chapter 1, McGraw-Hill, New York, 3-25. Lyu, M. R., Yu, J. S., Keramidas, E. and Dalai, S. R. (1995), "ARMOR: Analyzer for Reducing Module Operational Risk", in Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing, Pasadena, CA, IEEE Computer Society, 137-142. Mayrand, J. and Coallier, F. (1996), "System Acquisition Based on Software Product Assessment", in Proceedings of the Eighteenth International Conference on Software Engineering, Berlin, IEEE Computer Society, 210-219. McCabe, T. J. (1976), "A Complexity Measure", IEEE Transactions on Software Engineering SE-2(4), 308-320. Munson, J. C. and Khoshgoftaar, T. M. (1996), "Software Metrics for Reliability Assessment", in Handbook of Software Reliability Engineering, ed. Lyu, M., Chapter 12, McGraw-Hill, New York, 493-529.
270
T. M. Khoshgoftaar and E. B. Allen
Myers, G. J. (1978), Composite/Structured Design, Van Nostrand Reinhold, New York. Naik, A. (1998), Prediction of Software Quality Using Classification Tree Modeling, Master's thesis, Florida Atlantic University, Boca Raton, FL, USA. Advised by Taghi M. Khoshgoftaar. Offen, R. J. and Jeffery, R. (1997), "Establishing Software Measurement Programs", IEEE Software 14(2), 45-53. Oman, P. and Pfleeger, S. L. (eds.) (1997), Applying Software Metrics, IEEE Computer Society Press, Los Alamitos, CA. Pfleeger, S. L. (1997), "Assessing Measurement", IEEE Software 14(2), 25-26. Editor's introduction to special issue. Porter, A. A. and Selby, R. W. (1990), "Empirically Guided Software Development Using Metric-Based Classification Trees", IEEE Software 7(2), 46-54. Russell, G. W. (1991), "Experience with Inspection in Ultralarge-Scale Developments", IEEE Software 8(1), 25-31. SAS Institute staff (1995), TREEDISC macro (beta version), Technical report, SAS Institute, Inc., Cary, NC. Documentation with macros. Schneidewind, N. F. (1995), "Software Metrics Validation: Space Shuttle Flight Software Example", Annals of Software Engineering 1, 287-309. Selby, R. W. and Porter, A. A. (1988), "Learning from Examples: Generation and Evaluation of Decision Trees for Software Resource Analysis", IEEE Transactions on Software Engineering 14(12), 1743-1756. Steinberg, D. and Colla, P. (1995), CART: A Supplementary Modules for SYSTAT, Salford Systems, San Diego, CA. Szabo, R. M. and Khoshgoftaar, T. M. (1995), "An Assessment of Software Quality in a C + + Environment", in Proceedings of the Sixth International Symposium on Software Reliability Engineering, Toulouse, France, IEEE Computer Society, 240249. Takahashi, R., Muraoka, Y. and Nakamura, Y. (1997), "Building Software Quality Classification Trees: Approach, Experimentation, Evaluation", in Proceedings of the Eighth International Symposium on Software Reliability Engineering, Albuquerque, NM, USA, IEEE Computer Society, 222-233. Troster, J. and Tian, J. (1995), "Measurement and Defect Modeling for a Legacy Software System", Annals of Software Engineering 1, 95-118.
Chapter 16 HIGHLY RELIABLE SYSTEMS: D E S I G N I N G S O F T W A R E FOR I M P R O V E D A S S E S S M E N T BOJAN CUKIC Department
of Computer Science and Electrical Engineering West Virginia University P.O. Box 6109 Morgantovm, WV 26506 FAROKH B. BASTANI
Computer Science Department University of Texas at Dallas 2601 N. Floyd Rd. Richardson, TX 75080
1. Introduction Very few computer users have escaped problems related to software failure, such as the mysterious loss of a file or a suddenly frozen operating system shell. Due to software faults, hours of work may be lost or, unknowingly, inappropriate business decisions may be made. Computer users, not having a better option, consider these events as tolerable and expected inconveniences. In terms of their consequences, these failures relate primarily to economics. The users' tolerance to inconvenience is inversely proportional to the price they are willing to pay for the given software product. Software price and software quality are the attributes which need to be in balance for the given application. However, failures in certain systems can lead to endangering the financial prosperity of a company, or even loss of human lives. These systems are called mission-critical (telecommunication, transactionprocessing, etc.) and safety-critical (flight control, nuclear plant monitoring, railway signaling, etc.) systems, respectively. Generally, safety-critical systems require high levels of predictability. Security, availability, safety, reliability, performability, and other attributes characterizing their operation must be guaranteed. The term high-assurance systems was recently introduced to stress the significance of a multidisciplinary approach to engineering critical systems and applications [Bhattacharya et al. (1997)]. Software reliability is defined as the probability of failure-free execution given a specific environment and a specified time interval. International standards limit
271
272
B. Cukic and F. B.
Bastani
the rate for catastrophic failures (Safety Integrity Level 4) to less than 10~ 8 failures per hour in the case of continuous control systems and less than 10~ 4 failures per hour for protection systems such as emergency shutdown systems [Cullyer and Storey (1994)]. The reliability requirements stated above, also known as ultra-high reliability, are widely disseminated in the scientific and engineering community. It is generally agreed that the reliability levels required by the protection systems can be successfully assessed and validated [Butler and Finelli (1993)]. However, present methods for assessing and validating the reliability requirements of continuous control systems are either theoretically sound or practical, but not both at the same time. Due to the limited adequacy of current assessment methods, reliability requirements of safety-critical continuous control systems cannot be rigorously guaranteed to a sufficiently high level of confidence. In dealing with the above mentioned uncertainty, regulatory authorities generally recognize the difficulty in quantification of software reliability. The Federal Aviation Administration (FAA) of the USA and other federal agencies, such as the Nuclear Regulatory Commission (NRC) and Food and Drugs Administration (FDA), have chosen to perform software certification using a technique similar to that used for certifying hardware. The basic message of the Radio Technical Commission for Aeronautics document RTCA/DO-178A, for example, is that "designers must take a disciplined approach to software: requirements definition, design, development, testing, configuration management, and documentation". Since standards and recommendations feature considerable diversity in their details [Bhansali (1993)], it is hard to compare their efficacy. Furthermore, software engineering techniques for building and validating software so that it adheres to stringent safety and reliability requirements are the subject of permanent improvements and represent open research problems [Littlewood (1994), Parnas et al. (1990)]. In the mean time, safety-critical systems are being built using the best current practices. An overview of some of the current practices in software development and assessment for nuclear power plants is presented in Lawrence et al. (1992), Parnas et al. (1991). When software is concerned, unlike in other engineering disciplines, very little attention has been paid to design patterns and principles which make software assessment easier and/or more efficient. Since software assessment is inherently difficult, the following question can be asked: Can the reliability of a program be made easier to assess as the result of specific design decisions? In this chapter, we discuss the principles of software design for process-control programs that may enhance the assessment and quantification of software reliability. Section 2 addresses reasons for the increased use of programmable electronic systems in safety-critical process control applications. Section 3 presents an overview of assessment techniques for reliability assurance of safety-critical systems, and concludes with the discussion of several "common sense" rules of software design that have been proposed for enhancing software reliability. Section 4 discusses the transformational approach to software reliability assessment, the framework which combines the strengths of formal verification and program testing. Section 5 describes
Highly Reliable Systems:
Designing Software for Improved Assessment
273
software design techniques which improve reliability assessment in the context of the transformational approach. We conclude with a summary in Sec. 6. 2. Programmable Electronic Systems Programmable electronic subsystems are increasingly used in process-control systems, including safety-critical applications. Having previously mentioned that computer systems are considered untrustworthy, their use in critical systems appears to be a paradox. In spite of their deficiencies, computerized systems have been found to have substantial advantages over the technology that they replace. Cost is probably the main driving force behind the increased use of software in process-control. The cost of realizing a given set of functions in software is likely to be much lower than with any alternative mechanism. The cost of changing and enhancing software systems is reasonably low, when compared with new designs, components, and installation changes incurred by alternative hardware technologies. Functionality of programmable systems is as important as cost. Programmable systems can monitor a range of inputs, perform complicated calculations, and provide different outputs at high speeds. Furthermore, software offers means to increase system safety. According to IEE/BCS (1992), these are: (1) Frequent proof checks on critical components, (2) Potential to provide sophisticated safety interlocks, (3) Potential to provide warnings of impending problems resulting from diagnostic programs, (4) The ability to compare actual system conditions with an "ideal" model during the operation, (5) The ability to provide better information to operators and hence improve decision making, (6) Potential to replace people working in hazardous and hostile environments. The Ease of Change and Reusability of algorithms, designs, and, sometimes, even software implementations, is also important. Provided that hardware resources allow adding functionality, improvements may take advantage of the maturity of the current system and incorporate new features. While cost and functionality factors are significant advantages of programmable systems, the ease of change and software reuse also hide the risk of reducing the overall system dependability. Note that changeability is one of the essential difficulties in creating error free software. Therefore, programmable systems can potentially improve dependability attributes such as reliability and safety only if software is engineered "properly". 3. Reliability Assurance of Safety Critical Systems If we knew what the future held, then we could know precisely how reliable today's software is, when it would fail, how often it would fail, and under what conditions
274
B. Cukic and F. B.
Bastani
it would fail. With that information, it would be trivial to confidently assess the program's reliability. But since the future is unknown, we do not know when and what faults will be triggered, what data will be sampled on the input, and how severe the consequences of the failures will be. From the development standpoint, the goal is to avoid, minimize, and fix known faults. The goal of the validation phase in software life-cycle (often performed through some form of certification testing) is to assess and assure that current software meets user's expectations (projected into the future), including the reliability requirement. Two basic approaches exist to quantitatively assess software reliability: Static analysis and dynamic testing. This classification is shown in Fig. 1. Software Reliability Assessment Static Analysis Formal Verification
Dynamic Testing Input Domain
Time Domain Fault Based
Fig. 1. Approaches to software reliability assessment.
3.1. Static
Analysis
Static analysis does not require the execution of the program being analyzed. It can be used to determine properties of the program which are universally true for all possible execution conditions. The most rigorous form of static analysis, suitable for the reliability assessment of safety-critical programs, is program verification. A program is functionally correct if it behaves according to the functional requirements specification. Under the assumption that functional requirements specification captures all the desirable properties of the application and that no undesirable properties are erroneously specified in it, correct software implementations represent a subset of reliable software implementations. Unfortunately, the assumption that requirements specification documents capture all the desired properties without any errors is not easy to achieve in practice. 3.2. Time Domain
Models
Dynamic testing is a conventional method of program checking. A program under test is executed with different combinations of input data, and the results are compared with the expected values. The outcome of testing is used to predict the reliability of the program. Reliability prediction is accomplished using one of three different groups of methods, namely, time domain methods, fault based methods, and input domain methods. Note, however, that software reliability depends on the software quality as well as on its operational usage. Since testing cannot guarantee
Highly Reliable Systems:
Designing Software for Improved Assessment
275
the absence of faults, exposing the program to the operations that are anticipated to be used most frequently should catch failures that are most likely to appear during field use. It is assumed that these failures are the ones that matter the most to the user. The quantitative approximation of the system's field use is called the operational profile [Musa (1993)]. Time domain models are used to predict the future reliability of a program based on the analysis of time-indexed failure data. In time domain models, the failure arrival process resulting from repeated execution of test cases is viewed as a stochastic process. The observed failure data over time are fitted to various statistical models, the so called reliability growth models. Unfortunately, time domain models cannot be used for the assessment of ultra-high software reliability. The first reason is that many of the assumptions underlying these models are not realistic (perfect fault removal, independence of time intervals between failures). Furthermore, reliability growth models are based on counting and interpolating failure data. For ultrareliable programs which need to achieve failure rates of less than 1 0 - 8 per hour, it is highly unlikely that any failure will occur in the certification phase testing [Musa (1998)]. Without observing (several) failures, time domain models cannot be constructed and a program's reliability cannot be estimated [Butler and Finelli (1993)]. 3.3. Fault-Based
Testing
Software fault injection is the process of physically injecting a fault into the program. Fault seeding is a statistical fault injection method used to estimate the number of faults remaining in the program after the testing has been performed. Let M denote the known number of injected faults, k the total number of faults detected through testing, and m the number of injected faults detected by testing (m < k and m < M). Under the assumption that both injected and inherent faults are equally likely to be detected, an estimate of the number of inherent faults N is M N=—(k-m) m Presently, it is generally agreed that fault-based testing does not provide an insight into how good the code is per se. It rather answers "what if" scenarios by simulating human factor errors and environment failures [Voas and Miller (1995)]. Due to its limitations, fault-based testing is not suitable for software reliability assessment of safety-critical systems. 3.4. Input Domain
Models
In the input domain reliability assessment, the reliability of a program is the probability of failure free operation for a specific input profile. The time component of reliability assessment in time domain models is replaced by the concept of test runs, which are activated by test cases from the input domain I. Program P is then a
276
B. Cukic and F. B.
Bastani
function that maps all the elements of the multidimensional space / into the output space O. The inputs that activate faults in the program are mapped into failures, i.e., incorrect outputs. An intuitive measure of software reliability is the proportion of test cases that result in correct outputs. If n represents the total number of test cases, and n / the number of detected failures, the estimated reliability, R, according to the Nelson model [Thayer et al. (1978)], is k=n-nL
=
x_nj_
n
n
When an indefinite number of test runs is taken, R=\-
lira 2£ n—>oo n
the fraction rif/n in its limit represents the estimated probability of failure in a single run of the program (frequently denoted by 8). All input domain models (as well as the other statistical models) require that a test oracle is available [Amman et al. (1994)]. The oracle should make the decision (with absolute certainty) whether a test run is correct or incorrect. A sound foundation for reliability assessment in the input domain is provided by the statistical sampling theory [Bastani and Pasquini (1994), Tsoukalas et al. (1993)]. A program under test separates the input domain into two disjoint classes: inputs that are correctly and those that are incorrectly mapped into the corresponding outputs. From the statistical viewpoint, each test case is a Bernoulli trial, and the model is frequently called the sampling model. Since input domains are large, the likelihood of selecting the same test case i is small. Thus, reliability estimation assumes sampling with replacement in which nothing precludes repetition of a test case, but it is simpler (and cheaper) to implement than sampling without replacement. The input domain models for software reliability assessment have the strongest theoretical foundation [Ramamoorthy and Bastani (1982)], and will be discussed later in this Chapter. Some practical drawbacks are the following: • A large number of test cases is required, • The existence of a test oracle is assumed, • Reliability estimation depends upon the ability to closely approximate/predict the operational profile of the field use.
3.5. Designing
Software
for
Assessment
In recent years, the term software testability has attracted the attention of the software engineering community. According to the IEEE Glossary of Software Engineering Terminology [IEEE (1990)], testability is defined as "the degree to which a system of components facilitates the establishment of test criteria and the performance of tests to determine whether those criteria have been met". Voas and
Highly Reliable Systems:
Designing Software for Improved Assessment
277
Miller (1995), and later Bertolino and Stringini (1996), define testability of the program as the probability that the program will fail under test if it contains at least one fault. Testability becomes very useful when reliability of a safety-critical program has to be inferred from a series of observations of failure-free test executions. If testability can be estimated (this is not an easy task), it can serve as the basis to draw inferences on reliability estimation. According to Hamlet and Voas (1993), repeated failure-free executions of a program with a high probability of revealing failures (if they exist), should provide higher confidence in assessed reliability. However, as noted in Bertolino and Stringini (1996), if faults remain in the program, they have a higher probability of causing a failure during the field use. While Voas and Miller attribute testability only to the input-output behavior of the system, Bertolino and Stringini extend it to include "observability of the inside the system". However, in our opinion, these approaches only partially address the most important issue, which is how to design a safety-critical system such that it will allow its failures and faults to be observed and detected, respectively, during the reliability assessment. For ultra-high reliable systems, it is necessary to develop a methodology to systematically select among alternatives in architectural and/or detailed design. Presently available general guidelines for selection, borrowed from Clutterbuck (1992), are listed below: • Assessment is easier for less complex systems. • The analysis of systems based on asynchronous interrupts is extremely difficult and should be avoided in favor of designs in which the inputs are polled. • Multi-tasking systems pose difficult analysis problems, and, whenever possible, design decision should be made in favor of sequential design. • Dynamic memory allocation introduces uncertainty, and, in safety-critical systems, memory requirements should be statically determinable. The guidelines above are useful for guiding high level design decisions. However, they do not address detailed design phase concerns. Our observations regarding the importance of detailed design decisions in the context of software reliability assessment are reported in the following two sections. 4. Software Reliability Assessment in Reduced Input Domains As mentioned in Sec. 3.4, input domain models are founded on the strongest theoretical background of all available methods for software reliability assessment of ultra reliable programs. Repeated failure-free execution of the program provides a certain level of confidence that the required reliability level has been successfully achieved [Miller et al. (1992)]. Furthermore, it has been shown that a combination of statistical testing and formal correctness verification can be used in software reliability assessment. In the so called input-domain approach [Ramamoorthy et al. (1981)], "high usage" partitions for which the confidence in the correctness
278
B. Cukic and F. B.
Bastani
of their formal specification is high are verified formally, if possible. The remaining partitions are tested using statistical sampling according to the operational profile. In the recently introduced transformational approach to software reliability assessment [Cukic (1997)], program transformations and partial program proofs are used to amplify the effect of test cases, i.e., it is possible to infer the behavior of the program on many inputs based on its behavior on one input. This section describes how careful software design can replace program transformations. The effect of this replacement will remain the same: reducing the effective size of the input space, typically by eliminating one or more dimensions in the input space. Combining formal verification with statistical testing in order to enhance software reliability is not straightforward. One of the measures of success of a methodology that combines the two should be its ability to significantly reduce (often prohibitively) expensive amounts or testing required to attain the required confidence level. Reliability depends on the quality of the implemented program as well as on its operational usage. When satisfaction of one (or more) of the general properties is formally proven, it provides a guarantee that these types of failures will not appear in the output. Following this approach, only the proof of absence for all types of failures would ensure correctness. Software products are usually too complex to allow us to understand and predict all possible failure types. Consequently, correctness proofs are the only type of formal proof techniques that have a direct impact on reliability estimation. The transformational approach to software reliability assessment uses correctness proofs that reduce the size of the input domain.* As mentioned earlier, input domains are usually very large, because they are formed as (a subset of) the Cartesian product of all possible values of all input parameters. The size of the input domain can be reduced by decreasing the number of input parameters or limiting the number of values that input parameters can assume. Reducing the number of input parameters lowers the dimensionality of the input domain and significantly reduces its size. In the following two subsections we present methods of reducing the dimensionality of the input domain while preserving the program's semantics. They are based on combinations of correctness proofs and program slicing techniques [Weiser (1984)]. Program slices are sets of program statements which directly or indirectly affect the selected occurrence of a variable or a set of variables. Vertical slicing is a method for program decomposition in the direction of control and data flow, while horizontal program slices capture the logical processing units of a program, providing program decompositions orthogonal to vertical slicing. 4.1. Reducing
the Dimensionality
of Input
Domains
The idea behind vertical slicing is to form groups of statements which may directly or indirectly affect the value of an output parameter. Suppose that the program a
Here, "input domain" is defined as the set of all inputs relevant to a program.
Highly Reliable Systems:
x
Designing Software for Improved Assessment
y
279
z
Fig. 2. Vertical program slicing.
can be divided into two different computations, Po and Pi, and that the input to PQ is xy and the input to Pi is yz, where x, y, and z are vectors of input parameters. Then two vertical slices of the program can be taken, one for PQ and one for Pi, as shown in Fig. 2. Vertical slicing provides advantages for testing because slices Po and Pi can be tested separately. For example, if input vectors x, y, and z contain 2, 1, and 2 input parameters, respectively, testing a program with five input dimensions is reduced to (possibly concurrent) testing of two slices with three dimensional input domains. The impact of this method on the input domain size becomes apparent if one (or more) of the slices can be verified by a partial program proof. Assuming high confidence in the validity of the requirements specifications for this slice, input parameters used as inputs only to this slice (z for slice Pi, for example) can be eliminated from the input space. A large number of inputs xiyjZk correspond to a single input i , x j/j x z for Po and x x yj x Zk for Pi. Furthermore, assuming that one of the slices, say Pi, can be formally verified, only Po needs random testing in the reduced, three dimensional input space. Application of the vertical slicing rule is guaranteed to preserve semantic equivalence between the original program P and transformed programs P o , P i , . . . , P „ used during the reliability assessment, because slices are just the projections of P . 4.1.1. Case Study I: Power Plant Control System Figure 3 describes a control system for a power plant using the SA/RT (Structured Analysis/Real Time) method. Functionality of the power plant control system has been originally described in Heitmeyer and Mandrioli (1996). This control system checks the amount of power that needs to be produced by the plant, and, using this information, manages the amount of fuel or coolant to be injected into the plant. In an emergency, the system may even shut down the plant.
280
B. Cukic and F. B.
Bastani
Pressure
Power Request
Temperature
Cool
Compute Power .Production
Shut Down
L t o " /continue
T
w OK
iMoMtor*^
\.
-4
Alarm
Power Production Fig. 3. Power plant control system.
SA/RT is a semiformal operational notation closely related to data flow diagrams used in the description of the slicing rules shown earlier in this section. Continuous line arrows and circles represent data items (input or output numerical parameters), and data transformers, respectively. Dotted line arrows and circles represent signals and control transformations, respectively. Arrows labeled Temperature and Pressure denote input parameters coming from the system's environment. The arrow labeled Power Request also denotes information from the environment, specifically, how much power is requested by the plant. The arrows labeled Coolant and Power Production, which represent the amount of coolant to be injected into the plant under the given circumstances, and the amount of power the plant will produce, represent output parameters from the system to the environment. Coolant Tank and Fuel Tank are data repositories, i.e., the place holders for state variables representing the amount of coolant and fuel available in the tanks. The power plant control system performs two kinds of functions: service function and safety function. Service functions manipulate the amount of power produced based on the available power, the safety state of the plant, and whether an external request for the power has been received. Further, the amount of available fuel is regularly updated. Safety function includes regular checking of temperature and pressure in order to determine whether the system is in a hazardous state. In response, the system either injects further coolant into the coolant tank, reduces the amount of available power, or, in case of an emergency, shuts down the plant. The design of the Power Plant Control System fits very well the vertical slicing rule. The system may be decomposed into two different vertical slices, and each of them may be tested and/or formally verified separately. The first slice is composed of Compute Safety State and the Monitor. The second slice contains Compute Power Production function. The input space for each of the two slices has reduced size and dimensionality. The complete program has 5 dimensional input space. The first slice has 3 dimensional input space. For the second slice, the dimensionality of
Highly Reliable Systems:
Designing Software for Improved Assessment
281
the input space appears to be 5, but bearing in mind that Shut Down, Restart, and Continue are signals that can be represented by boolean values, its dimensionality drops to only 2 dimensions. This system allows even further decomposition of the first slice. Function Monitor is a control function, and it can be represented by a finite state machine in the form of Mealy automaton [Heitmeyer and Mandrioli (1996)]. Since this particular finite state machine is very simple, composed of only three states, its formal verification is feasible. While this step does not reduce the size of the input space for the first slice (it actually increases it, because signal Cool represents the new input), it is still desirable since it allows for the correctness of the state monitoring logic to be proven. As we discuss in Sec. 4.4, the reliability of this Power Plant Control System is made assessable by the design decisions that allow the system's decomposition. In fact, this specific system can be certified to a high level of reliability through the application of formal program proofs and exhaustive testing techniques. 4.2. Reducing
the Cardinality
of Input
Domains
Horizontal slicing provides program decomposition orthogonal to vertical slicing. In case of vertical slicing, the selected occurrence of a variable which guided the constitution of a slice is necessarily a program output. When forming a horizontal slice, the occurrence of a selected variable (or variables) is somewhere in the program's data flow. The essence of horizontal slicing is shown in Fig. 4.
Fig. 4. Horizontal program slicing.
282
B. Cukic and F. B.
Bastani
Application of partial program proofs in combination with horizontal slicing can significantly reduce the size of the input domain. Referring to Fig. 4, assume that program P can be decomposed into three stages, P2(Pi(Po(x),y),z), where x, y, and z are the input vectors. The execution of stages PQ, Pi, and P2 results in output vectors x°, y°, and z°, respectively. If P0 and P2 are formally verified, that leaves only Pi(x°y) for random testing and statistical reliability assessment. Formal verification of Pn and P2 is necessary in order to guarantee the semantic equivalence of the transformed and the original program implementation. The dimensionality of the input domain is reduced if x° has fewer dimensions than x. The input domain of the transformed program consists of parameters in x° and y. Due to the partial correctness proofs, each sample x°yj validates the program for all points in xy^z, such that PO(XJ) = x°. The precondition for applicability of the horizontal slicing rule is the data flow proof. It requires that the variables which are eliminated should not be used in subsequent stages of the program. This condition can be verified automatically by generating definition-use chains for each and every definition13 of the variables under consideration. A definition-use chain statically links a definition of a variable to all its possible uses. If there are no uses in the successive (unproven) stages of the program, the data flow precondition is fulfilled [Zima and Chapman (1990)]. In addition to potential reduction of input space dimensionality in cases when Po represents an aggregate or reduction function (min, max), horizontal slicing may alternatively reduce the number of values an input parameter may assume. Referring to Fig. 4, the stage PQ may represent a sorting or a cyclic function such as sin(:r) or cos(a;). These functions are attractive for horizontal slicing because their codomains are smaller than their input domains, thus proving their correctness reduces the cardinality of the newly formed input domains. It is desirable that data reduction step PQ either uses well known algorithms or can be specified and verified easily. If possible, a verified preprocessing module can (and should) be reused in many applications, without the need to be reverified over and over again, given that the new environment complies with the restrictions in the non-functional requirements specifications. From a design point of view, stage Pi in Fig. 4 should be the portion whose specification has a nonzero probability of containing errors. 4.3. Case Study II: Satellite
Navigation
Control
System
This section describes the application of slicing rules to the pitch-control system of a spacecraft orbiting the Earth. The spacecraft is required to maintain a fixed trajectory with a nominal speed. On the other hand, the spacecraft is subject to several environmental disturbances. For pitch attitude control, torque disturbances are principally due to the gravitational gradient, orbit eccentricity effects, and solar b
A variable is defined in statements where it is assigned a new value, either in a read( ) statement or in an assignment statement.
Highly Reliable Systems:
Designing Software for Improved Assessment
283
pressure [Lahdhiri and Alouani (1993)]. The pitch controller is required to maintain the attitude within a very narrow margin of error, typically 3.5 milliradians. In our project, the pitch-control system receives sensor readings (inputs) from an environment simulator, and sends actuator commands to the environment simulator. Attitude command mode was chosen for controller implementation. The input domain is five dimensional, consisting of the current pitch position 9, current angular velocity 9V, desired pitch angle da, current mission time t, and current fuel level / . In this version of the controller, the last two dimensions of the input domain, current mission time t, and current fuel level / , are state variables. The controller implements the firing logic for jet thrusters. It is assumed that the thrusters are located at exact opposite ends of the space vehicle and "paired" so that they fire in "couples". This assumption simplifies the firing logic because thrusters produce pure torque with no linear acceleration. There are two output parameters representing firing commands for two pairs of thrusters. The controller optimizes the time necessary to correct the orbiter's pitch angle to the desired value 9^, under the constraint of fuel consumption. The pitch-control program was designed with the applicability of slicing techniques for input domain reduction in mind. Indeed, the implementation enables the application of both vertical and horizontal slicing. The state variables, mission time t and fuel level / , can be eliminated by formally verifying the vertical slice of numerical computations in which they are both involved. The slice computes the angular velocity that the spacecraft is requested to achieve during the maneuver, given the phase of the mission and the fuel level. In order to correct the pitch angle, thrusters need to be fired. The desired angle can be achieved at different angular velocities, depending on the moment of force induced by firing jet thrusters. The requirement specification document completely defines the control law for this computation and we relied on it in the partial program proof. The reliability of the second vertical slice (and the whole program) is to be estimated by statistical sampling. This slice includes the algorithm for determining the optimal trajectory for achieving the desired pitch angle, given the spacecraft's current angular velocity. The implementation offers an opportunity for the application of the slicing rule. Its very first computational step is the so called "coordinate system normalization". What really matters in the controller is the difference between the current pitch angle and the desired pitch angle. Therefore, by setting 9' to 9 — 9d and 9d to 0, 9d becomes a constant and is eliminated from the input domain. Proving this preprocessing step is trivial. The reduced input domain is two dimensional. The remaining input parameters are relative pitch "drift" 9' and the current angular velocity 9V. In terms of input space reduction, the above mentioned steps reduce the size of the input domain from approximately 5.18 x 10 29 to 7.2 x 10 7 . These numbers were obtained by taking into account the accuracy of sensors that could be actually used in the field. Assuming that execution of a test case takes 1 second, exhaustive testing in the reduced domain could be completed in eleven days.
284
B. Cukic and F. B.
Bastani
4.4. Effects of Slicing on Reliability
Assessment
A program under test separates the input domain into two disjoint classes: inputs that are correctly and incorrectly mapped into the corresponding outputs. Consequently, from a statistical viewpoint, each test case is a Bernoulli trial and testing is modeled as sampling with replacement, that is, a test case can be repeated, but in large input domains that is unlikely to occur. The goal of reliability assessment is to bound the (unknown) failure probability 8 of the program to a particular value with a predefined statistical confidence C. The consequence of applying slicing rules is that testing (sampling) is performed by choosing inputs in the reduced set of inputs. If the reduced domains are "reasonably small", sampling without replacement (the design that prohibits repetition of test cases) provides higher confidence in the assessed failure probability. It was noted by Musa et al. (1990) that testing strategies based on partitioning of the input domain are especially suitable for testing without replacement because of the smaller population size in each partition. Partitioning coupled with partial program proofs provides a good framework for this type of testing. Frequently, when reliability assessment is performed on safety-critical systems, testing reveals no failures. Correctness probability is defined as the conditional probability that the program is free of errors, given that t test cases have revealed no failures [Hamlet (1987), Ramamoorthy and Bastani (1982)]. An additional parameter that can be computed from the application of partial program proofs is the reduction factor, i.e., the ratio between the size of the original input domain and the reduced input domain. If it is difficult to perform testing in the reduced domain due to unknown operational profiles in the reduced domain or unavailability of automatic test data generators, testing in the original domain may have no alternative. However, due to input space reduction, the provably minimal fault size in the original domain is equal to the reduction factor, because in the reduced domain one test case represents an entire equivalence class of original inputs. When the exact knowledge about minimal fault size is incorporated into statistical assessment of correctness probability, the estimate increases. In other words, the probability that the program is correct when testing reveals no failures increases as a result of performing partial program proofs that reduce the size of the input domain. A detailed explanation of estimating the minimal fault size and correctness probability is provided in Cukic (1997). Lastly, it has been observed that one of the ways to reduce the task of determining an operational profile is to reduce the size of the input space [Musa (1993)]. Slicing techniques neither reduce the functionality nor change the input-output behavior of the program. Input parameters eliminated by vertical slicing and partial program proofs have no further impact on program testing. Therefore, there is no need to estimate them prior to reliability assessment. The operational profile has to be determined only for the input parameters in the slices that are to be exposed to testing. In the case of horizontal slicing, the operational profile of variables in
Highly Reliable Systems:
Designing Software for Improved Assessment
285
the output of the (formally verified) slice is a function of the operational profiles of input parameters, but can frequently be assessed independently. It has been shown that reducing the number of parameters in the input space also results in decreased sensitivity of the reliability estimate to variations in the operational profile of the input parameters [Cukic and Bastani (1996)]. 5. Design Rules for Process-Control Programs Good architectural design is a major factor in determining the success of a software system. While there are many architectural paradigms, design decisions pertaining to specific software system are usually made in an ad hoc fashion. It is our intention to show that design decisions have a direct impact on the applicability of slicing techniques and, consequently, on the ability to address software reliability requirements during the early phases of the software life-cycle. 5.1. The Architecture
of Process-Control
Programs
The types of programs we consider are process-control programs. The purpose of a control system is to maintain specified properties of the outputs of the process, provided that reference values are given. Properties such as temperature, pressure, flow rates, or altitudes are monitored and their values are used to control the process by changing the settings of the active components, such as valves, heaters, thrusters, etc. Some of the basic requirements for creating a process-control program are as follows [Shaw and Garlan (1996]: • The program must accommodate deliberative behavior, such as a command coming from the authority (an authorized user, for example), • It must also accommodate reactive behavior, that is, the system must respond to changes in its state caused by the timing aspects of the problem (for example, changes in the system state may not be instantaneous, observing and then adjusting to them might take a certain time period), • The software must account for dangers inherent in the process environment, i.e., by considering fault tolerance, safety, reliability, performance, etc., the software must continuously enforce and maintain the integrity of the system it controls despite the occurrence of unexpected problems. The architectural solution frequently used for reactive systems is the control loop paradigm shown in Fig. 5. The controller executes a series of cycles or frames. At the beginning of each frame, it reads inputs from the sensors, performs some computations, and at the end of the frame sends commands to the mechanisms. The inputs are numerical quantities rather than symbolic strings. Similarly, the computations are numerical computations. The computation in each frame depends on the inputs read at the beginning of the frame as well as the values of some internal variables called history or state variables. These state variables are updated at the end of every frame.
286
B. Cukic and F. B.
Bastani
Controller
Active Components
Actuators
Sensors
Feedback
Action Environment
Fig. 5. The feedback control loop of a reactive system.
Simplicity is one of the important advantages of the control loop paradigm. It captures the interaction between the controller and the environment. The underlying assumption is that changes in the environment are continuous and require continuous reactions. The simplicity of the paradigm supports fault tolerance and safety, in the sense that duplication of computation lanes (modular redundancy) is easy to incorporate. The control loop paradigm allows (and actually encourages) the layered architecture of process-control programs. An input layer accepts real world inputs, performs sensor interpretation (data analysis from one sensor), and sensor integration (the combined analysis of different sensor inputs). A decision making layer is the core of the controller that implements control functions, reacts to changes in the environment and/or to commands from the authority. An output generation layer transforms computational outputs to the format meaningful to actuators. Usually, the input layer and the output generation layer implement well known and fairly simple algorithms. 5.2. Design
Support for the Transformational
Reliability
Assessment
In principle, the application of vertical and horizontal slicing rules can take one of four forms. These are listed in order of increasing complexity and decreasing preference as follows: (1) Design the program so that it has a specified property. (2) Prove that the program has a specified property. (3) Transform the program into a version that has a specified property.
Highly Reliable Systems:
Designing Software for Improved Assessment
287
(4) Transform the program into a version for which one can prove that it has a specified property. Our main emphasis is on the application of slicing rules in the design phase of software life-cycle. It must be noted that domain experts, with their extensive knowledge about the requirements specification and the application, should cooperate with software engineers in finding potential opportunities for the application of slicing. If the opportunity for "slice discovery" is missed during the early design phase, determining slices in the later phases of software development is much more difficult. General guidelines for software design bearing in mind the benefits of input domain reduction are: • Detect and group data reduction steps in the input layer of the process-control program. • Keep orthogonal computations independent. However, these computations can share code segments, since this improves the ability to detect failures by testing. • Reduce the number of state variables to a minimum. These are the variables with potentially unknown operational profiles, and their elimination through the combination of slicing and formal verification is desirable (when possible). • Use deterministic control algorithms; otherwise, testing cannot guarantee that the repeated execution of a test case results in the same (correct or incorrect) output. It has already been stated that the input layer and the output generation layer usually implement well known and fairly simple algorithms. If the confidence in the validity/correctness of specifications for these two layers is high, they are good candidates for formal verification. According to the horizontal slicing rule, formal proofs of these two layers may significantly reduce the size of the program's input space. Further reduction may be achieved if the opportunity for vertical slicing exists in the decision making layer. Some common cases for the application of vertical slicing are computations of history variables, orthogonal computations, error checking portions of the code, etc. Since history variables introduce dependence between test runs, they have to be treated as input parameters [Parnas et al. (1991)]. Random generation of their values for testing purposes assures independence between test runs and assures the applicability of the sampling model. The decision of whether to formally verify a given vertical slice should be based on: • One's confidence in the validity of the specification, and • The ability (or inability) to obtain a good estimate of operational profiles for variables in the slice. The preference for verification is given to slices with higher confidence in the validity of specification and to slices that eliminate from testing input parameters with unknown operational profiles.
288
B. Cukic and F. B.
Bastani
T h e above set of guidelines is not complete. These are j u s t t h e ideas which, when followed, may allow reasoning about the reliability levels t h a t can be assessed later in the validation phase of software life-cycle. This approach also enables an early insight into t h e cost trade-offs in reliability assessment framework t h a t combines formal verification and statistical testing. It promotes a better understanding of t h e consequences of design decisions, a better insight into the problem domain and the basis for improved definition of testing scenarios. Lastly, this approach to software design enforces an early consideration of software testability, t h e design issue deemed crucial in other engineering disciplines, but unfortunately, frequently forgotten in software engineering.
6. S u m m a r y Good architectural design has always been a major factor in determining the success of a software system. While there are many architectural paradigms, design decisions pertaining t o specific software system are usually m a d e in a n ad hoc fashion. In this paper, we presented software design guidelines which, when followed, may improve t h e assessment of software reliability. T h e transformational approach t o software reliability assessment is unique in its combined use of formal verification methods and statistical testing. T h e program transformations t h a t identify modules which are to be formally verified can be applied automatically. In practice, however, most applications of slicing rules are the direct consequence of careful program structuring and understanding. If t h e o p p o r t u n i t y for "slice discovery" is missed during t h e early design phase, determining slices in the later phases of software development is much more difficult. This approach promotes a better understanding of t h e consequences of design decisions and a better insight into the problem domain. Furthermore, t h e transformational approach encourages software designers to consider the reliability related consequences of design decisions early in t h e life-cycle. It makes a clear distinction between testable design choices and non-testable ones. We strongly believe t h a t in t h e future, safety-critical applications will have t o be designed and developed in mind their testability, i.e., the applicability of practical assessment techniques in the reliability certification process. To the best of our knowledge, t h e transformational approach to software reliability assessment is t h e first paradigm which practically allows a n d promotes consideration of the assessable level of confidence in software reliability in relation with t h e software design choices. This is a significant practical step forward from the generalized notion t h a t "complex s t r u c t u r e leads t o reduced reliability" [Thayer (1978)].
References IEE/BCS Joint Study Report (1992), "Software in Safety-Related Systems", in Software in Safety-Related Systems: A Special Report, ed. Wichmann, B. A., John Wiley & Sons, Chichester, 1-144.
Highly Reliable Systems: Designing Software for Improved Assessment
289
"IEEE Glosary of Software Engineering Terminology", ANSI/IEEE Standard 610.12, IEEE, New York, 1990. Amman, P. E., Brilliant, S. S. and Knight, J. C. (1994), "The Effect of Imperfect Error Detection on Reliability Assessment via Life Testing", IEEE Trans, on Software Engineering 20(2), 142-148. Bertolino, A. and Stringini, L. (1996), "On the Use of Testability Measures for Dependability Assessment", IEEE Trans. Software Engineering 22(2), 97-108. Bastani, F. B. and Pasquini, A. (1994), "Assessment of a Sampling Method for Measuring Safety-Critical Software Reliability", in Proceedings of the 5th International Symposium on Software Reliability Engineering (ISSRE'94), Monterey, CA., Nov. Bastani, F. B., Cukic, B., Hilford, V. and Jamoussi, A. (1996), "Towards Dependable Safety-Critical Software", in Proceedings of the 2nd IEEE Workshop on Object -Oriented Real-Time Dependable Systems, Laguna Beach, CA, Feb. Bhansali, P. V. (1993), "Survey of Software Safety Standards Shows Diversity", IEEE Computer 26(1), 88-89. Bhattacharya, S., Onoma, A. and Bastani, F. B. (1997), "High-Assurance Systems", Communications of the ACM 40(1), 67. Butler, R. W. and Finelli, G. B. (1993), "The Infeasibility of Quantifying the Reliability of Life-Critical Real-Time Software", IEEE Trans. Software Eng. 19(1), 3-12. Clutterbuck, D. L. (1992), "Drive Report: Review of Current Tools and Techniques for the Development of Safety-Critical Software", in Software in Safety-Related Systems: A Special Report, ed. Wichmann, B. A., John Wiley & Sons, Chichester, 145-176. Cukic, B. and Bastani, F. B. (1996), "On Reducing the Sensitivity of Software Reliability to Variations in the Operational Profile", in Proceedings of the 7th Int'l Symposium on Software Reliability Engineering (ISSRE'96), White Plains, NY, Oct. 45-54. Cukic, B. (1997), Transformational Approach to Software Reliability Assessment, Doctoral Dissertation, Department of Computer Science, University of Houston, August. Cullyer, V. J. and Storey, N. (1994), "Tools and Techniques for the Testing of SafetyCritical Software", Computing and Control Engineering Journal October, 239-244. Hamlet, R. (1987), "Probable Correctness Theory", Information Processing Letters June, 17-25. Hamlet, D. and Voas, J. (1993), "Faults on Its Sleeve: Amplifying Software Reliability Testing," in Proceedings of the 1993 Int'l. Symposium on Software Testing and Analysis, Cambridge, MA, June, 89-98. Heitmeyer, C. and Mandrioli, D. (1996), Formal Methods for Real-Time Computing, John Wiley & Sons, New York. Lahdhiri, T. and Alouani, A. T. (1993), "LQG/LTR Pitch Attitude Control of an EarthOrbiting Spacecraft", in Proc. 32nd Conf. on Decision and Control, San Antonio, TX, 445-446. Lawrence, J. D., Persons, W. L., Preckshot, G. G. and Gallagher, J. (1992), "Evaluating Software for Safety Systems in Nuclear Power Plants", in Proceedings of the 9th Annual Conference on Computer Assurance (COMPASS '94) 35(2), 13-21. Littlewood, B. (1994), "Learning to Live with Uncertainty in Our Software", in Proceedings of the 2nd International Software Metrics Symposium, London, UK, Oct. Miller, K., Morell, L. J., Noonan, R. E., Park, S. K., Nicol, D. M., Murrill, B. W. and Voas, J. W. (1992), "Estimating the Probability of Failure When Testing Reveals No Failures", IEEE Trans, on Software Engineering 18(1), 33-44. Musa, J. D., Iannino, A. and Okumoto, K. (1990), Software Reliability: Measurement, Prediction, Applications, (Professional Edition), McGraw-Hill, 178-180.
290
B. Cukic and F. B. Bastani
Musa, J. D. (1993), "Operational Profiles in Software Reliability Engineering", IEEE Software, Mar. 14-32. Musa, J. D. (1998), Software Reliability Engineering, McGraw Hill. Parnas, D. L., van Schouwen, A. J. and Kwan, S. P. (1990), "Evaluation of Safety-Critical Software", Communications of the ACM 33(6), 636-648. Parnas, D. L., Asmis, G. J. K. and Madey, J. (1991), "Assessment of Safety-Critical Software in Nuclear Power Plants", Nuclear Safety 32(2), 189-198. Ramamoorthy, C. V., Mok, Y. R., Bastani, F. B., Chin, G. H. and Suzuki, K. (1981), "Application of a Methodology for the Development and Validation of Reliable Process-Control Software", IEEE Trans, on Software Engineering SE-7(6), 354371. Ramamoorthy, C. V. and Bastani, F. B. (1982), "Software Reliability-Status and Perspectives", IEEE Trans. Software Eng. SE-8(July), 354-370. Shaw, M. and Garlan, D. (1996), Software Architecture: Perspectives on an Emerging Discipline, Prentice Hall. Thayer, T. A., Lipow, M. and Nelson, E. C. (1978), Software Reliability, North-Holland Publishing, TRW Series of Software Technology, Amsterdam. Tsoukalas, M. Z., Duran, J. W. and Ntafos, S. C. (1993), "On Some Reliability Estimation Problems in Random and Partition Testing", IEEE Trans, on Software Engineering SE-19(7), 687-697. Voas, J. M. and Miller, K. W. (1995), "Software Testability: The New Verification", IEEE Software 12(3), 17-28. Weiser, M. (1984), "Program Slicing", IEEE Trans. Software Engineering SE-10(4), 352357. Zima, H. and Chapman, B. (1990), Supercompilers for Parallel and Vector Computers, Frontier Series, ACM Press, New York.
Chapter 17 M A N U F A C T U R I N G SYSTEMS ESTIMATION USING N E U R A L N E T W O R K MODELS PATRICIA L. COOPER* and G O R D O N J. SAVAGEt Systems Design Engineering University of Waterloo, Waterloo, Ontario, Canada * E-mail: [email protected] ^E-mail: [email protected]
1. Introduction The challenge to design high-quality products and processes at low cost is the basis of a very large field of study. Methods of Robust Design provide a systematic and efficient way to meet this challenge through design optimization for performance, quality, and cost. Four components of Robust Design taken from Phadke (1989) include: (1) (2) (3) (4)
making product performance insensitive to raw material variation, making designs robust against manufacturing variation, making designs least sensitive to variation in operating environment, and, using a structured development process so that engineering time is used most productively.
There are six activities that help to achieve the components of Robust Design. These include: (1) determining which process characteristics have a significant influence on system responses, (2) determining how key process characteristics interact to influence the system response, (3) selecting levels of key process characteristics for maximizing the assurance that the system responses meet customer requirements, (4) predicting system responses for future expected levels of process characteristics, (5) monitoring trends over time in the system in order to control system responses, and
291
292
P. L. Cooper and G. J. Savage
(6) comparing the performance of the process with customer requirements to judge the process' suitability. Traditional methods of Robust Design provide a means to carry out these activities, but there is not one technique that can address all six simultaneously. Statistical Process Control, Capability Analysis, and computing Signal-to-Noise Loss Ratios are reactive methods, and therefore the results, at most, can show an adverse change has occurred and that changes are needed to restore the process. Design of Experiments is active, rather than reactive, but has its limitations since all work is done off-line. Thus there is a need for a method that can be automated to provide on-line, real-time information related to the six activities above. It is proposed that the integration of a manufacturing systems model with "Design for Quality" activities is a suitable alternative. System models can normally be broken down into two categories: mechanistic and empirical. Mechanistic models use the physical mechanisms of the process to infer the form of the functional relationship describing the inputs and outputs and the coefficients in the model represent physical parameters. When physical knowledge of a system is incomplete, no mechanistic model is available. In this case, it can often be realistically assumed that the relationship between the process characteristics and responses are smooth and, consequently, the function can be locally approximated by an empirical model that is an interpolation function based on observed behavior. The generalized modeling problem for a manufacturing system application is to identify the system that maps a set of system variables to values of key response variables observed at a given point in time and relevant to the suitable level of organization. The set of system variables may include any pertinent variables of a system and may fall into any of Taguchi's six factor types: Control, Signal, Adjustment, Variance-Affecting, Noise, and Neutral. Also, characteristics of the system inputs and other system relevant conditions such as machine and operator capability, specification of tolerances, and structure of the system may have a significant effect on the system responses and should be included in the set of system variables. System responses may include quality attributes, levels of end-product reliability, and manufacturing costs, which may encompass human cost, energy cost, and physical cost. Traditional systems modeling methods applied to manufacturing systems have their limitations. It is highly unlikely that enough physical systems knowledge of a manufacturing system is available to solely use the principles of mechanistic modeling. It is more often the case that modeling will be carried out using some type of empirical model such as the popular Response Surface Models (RSM) [Khuri and Cornell (1996), Box and Draper (1987)]. Here a polynomial model of predetermined degree is fit using a least squares algorithm and data typically generated through Design of Experiments (DOE) procedures. With RSM, prior assumptions must be made on the degree of the resulting model as well as the nature of the model
Manufacturing
Systems
Estimation
Using Neural Network Models
293
residuals. As a result, RSM is problematic for manufacturing systems. For example, RSM (like DOE) is difficult to execute on-line. Furthermore, all past data must be retained when future observations are collected in order to update an existing model, which can become restrictive when data are collected over a long period of time. Also, the validation of residual assumptions is imperative each time a model is created and violation of these assumptions discredits the fitted model. These restrictions render the present models in manufacturing systems dubious. This paper discusses an alternative modeling technique based on Artificial Neural Networks. The main advantages of this type of model over the traditional ones are: (1) no a priori assumptions need to be made surrounding the form of the model or physical relationships, (2) the model can be updated in real-time without the need to retain past data, (3) a large number of input variables can be incorporated into one model, and (4) there is no restriction on the amount of linearity that can be captured. Herein, neural network techniques for deriving system models are presented and the theory of Radial Basis Function neural networks is discussed in detail. An example is included that clearly shows the ability of the neural network models to generalize an observed input/output relationship. Discussion is included on the use of the proposed model in performing the six activities for achieving the components of Robust Design. A manufacturing process can be considered as an aggregate of subsystems, or elemental processes, whose collective purpose is to transform the physical state of materials to achieve overall system goals. Continuity and compatibility constraints relate flows, temperatures, product dimensions, and other variables between adjacent subsystems. Robust Design of manufacturing systems involves the selection of network characteristics so that the collection of subsystems performs best according to end-product specifications. The principles of neural network modeling can be used to create subsystem models or continuity and compatibility constraint equations from empirical data for whatever physical system information is lacking. When individual subsystem models can be combined by way of the continuity and compatibility constraints to form an overall system model, techniques discussed in this paper can be used to carry out Robust Design activities for the collection of subsystems. An example included in this paper shows a method to combine empirical and physical information for a collection of two subsystems in order to create an overall system model. 2. Radial Basis Function Neural Network Theory Concepts of neural network modeling have developed rapidly over the last decade, and their use is widespread in aerospace, medical, and manufacturing industries, among others. Neural networks are one class of system estimation functions that
294
P. L. Cooper and G. J. Savage
learn a continuous multivariate mapping between input and output space given a set of input/output pairs. The inspiration for neural network structures was taken from characteristics of brain function and while a neural network does not approach the complexity of the brain, there are strong similarities. One similarity is that building blocks of both biological and artificial neural networks are simple computational devices (neurons) that are highly interconnected. The connections between neurons determine the function of the network. Artificial neurons can be interconnected in a variety of network architecture [Hagan et al. (1996)]. The architecture selected for modeling manufacturing systems is the Radial Basis Function neural network. Radial Basis Function (RBF) neural networks are particularly suited to handle the multivariable interpolation problem / : 9t n —> 9\L, where n, the number of independent variables, is sufficiently small. The RBF network architecture employs a single internal layer of locally-tuned processing units, where local representations have the advantage that the strength of the response depends on the proximity to each field, and computational overhead can be reduced by disregarding weak signals. The architecture depicted through the network diagram in Fig. 1 shows that the internal or hidden layer units are fully connected to an output layer of linear units, where x is a real valued vector in the input space of n variables. Wn
XeSR
• VL
Fig. 1. Radial basis function neural network architecture.
The RBF has a feedforward structure, consisting of K locally-tuned units, or basis functions, which simultaneously receive the n-dimensional vector x. Each hidden output, Zfc (k = 1 , . . . , K) is obtained by calculating the "closeness" of the input x to an n-dimensional vector fik associated with the kth basis function. The response function of the fcth basis function, Zfc, is a radially-symmetric function R with a single maximum at the origin and which drops off rapidly to zero away from the center as follows: Zk
R
(1)
Manufacturing
Systems
Estimation
Using Neural Network Models
295
where /x^ and <7fc are the center and width, respectively, of the receptive field in the input space for unit k. The value of Zk will be appreciable only when the "distance" \\x — /ife|| is smaller than the width, a^. Then for any given input, only the small fraction of basis functions with centers very close will respond with activations that differ significantly from zero. This leads to the notion of locality of RBF networks. A commonly used RBF network assumes a Gaussian basis function for the locally-tuned units: >-Mfc|l (2) zk = exp
H
where the norm is Euclidean. No bias terms are needed when Gaussian basis functions are used. The output layer of the RBF network is linear and produces a weighted sum of the outputs of the hidden layer, where the sum is calculated by the matrix multiplication given in Eq. (3).
d2
Wn
W21
WK1
Zl
Wl2
W22
WK2
z% (3)
dL
WlL
W2L
WKL
ZK
The strength of the connections between the fcth hidden unit and the Zth output unit is denoted by weight wu- Term di, where I = 1 , . . . , L, is the Ith component of the network output vector for one input/output pair. The linear output layer function may also include a bias term Aoi- An allowance for nonlinearity in the output layer is possible, provided the transfer function is invertible. Moody and Darken (1989), Broomhead and Lowe (1988), and Hassoun (1995) are popular citations for theory on Radial Basis Function networks. Training the
Network
Training of RBF networks is most computationally efficient when a hybrid learning method, combining linear supervised learning and linear self-organized learning is used. Supervised learning rules adjust the network parameters to move network outputs closer to target outputs and self-organized learning rules modify parameters in response to network inputs only. The combination of local representation and linear learning offers tremendous speed advantages relative to other architectures such as backpropagation. The hybrid learning method is an example of a training strategy that decouples learning at the hidden and the output layers, made possible for RBF networks because of the local receptive field nature of the hidden units. Under the hybrid learning method, receptive field centers and widths are first determined using a self-organizing or feedforward technique. Then, a supervised feedback procedure that optimizes total error is used to adjust the network weights and biases that connect the hidden and output layers.
296
P. L. Cooper and G. J. Savage
Locating Receptive Field Centers The standard ii'-means clustering algorithm is one of several schemes that have been suggested to find suitable locations for the receptive field centers without propagating output error back through the network. The goal of locating the receptive fields is to populate dense regions of the input space. The K-means algorithm [MacQueen (1967)] finds a set of K receptive field centers to minimize the total squared Euclidean distances or Sum Squared Error (SSE) function, between the M training vectors (exemplars) and the nearest of the K centers. The SSE is calculated as follows: M
K
SSE=-^^Mmfc||xm-ufc||2
(4)
where Mmk is a K x M matrix of O's and l's with exactly one 1 per column which identifies the receptive field to which a given exemplar belongs. The /('-means algorithm is invoked until convergence, at which point change in the centroids is insignificant from iteration to iteration. Each cycle of the algorithm decreases the SSE, and since there are only finitely many partitions of data points into K clusters, then convergence is ensured. The optimal cluster configuration is influenced by the initial assignment of the centroids and oftentimes, K randomly selected training vectors are selected. Determining the centers based on a group of exemplars is known as batch training. In a real-time process where the aim is to update the current value of the network parameters on-line due to a small number of additional exemplars, a variation of the batch if-means algorithm, called the adaptive Kmeans algorithm, is available. The adaptive algorithm is attractive since it requires no storage of past training vectors or cluster membership information. At each time step, the receptive field center \ij that is closest (in Euclidean distance sense) to the new exemplar xm+i is updated according to AUJ — p{xm+\ — fa), where p is a small positive constant. Generally speaking, there is no formal method for specifying the number K of receptive fields in the RBF network. It is apparent that architecture with a greater number of basis functions will be capable of achieving a smaller SSE. Radial basis function networks can produce a strict interpolating surface that exactly passes through all the pairs of the training set for a value of K equal to the number of training points [Michelli (1986)]. In application, the exact fit is misleading, especially in the presence of noisy data. Partial training may lead to better approximation of unknown functions in the sense of improved interpolation and possibly improved extrapolation. Hassoun (1995) recommends the procedure of cross-validation to decide on K. Determining Receptive Field Widths The P-nearest neighbor heuristic [Moody and Darkin (1989)] is suitable for determining the widths of the receptive fields in the RBF network in order to form a smooth and contiguous interpolation over the represented region of the input space.
Manufacturing
Systems
Estimation
Using Neural Network Models
297
Under this heuristic, the receptive field widths are varied in order to achieve a certain amount of response overlap between each field and its neighbors. The width of a given receptive field is set equal to the root mean squared value of the Euclidean distances to its P nearest neighboring receptive fields. Parameter P is referred to as the overlap parameter, and typically assumes a value between 1 and 4, depending on K. Since the P-nearest neighbor heuristic relies on the receptive field center locations, but not information on particular exemplars, the same heuristic is suitable for on-line and off-line training. Determining Output Layer Weights In practical cases, selecting the optimal weights w* is subject to pitfalls such as overfitting. To avoid overfitting, the strict interpolation problem is weakened to the case where fewer basis functions are used than the number of data points. The price of avoiding overfitting this way is that z is no longer square, and the optimal weights w* cannot be obtained from y = wTz by w* = (zT)~1yT. When z and zzT are full rank, then it can be proven that the solution w* = (zzT)~1zyT is guaranteed to be the minimum of SSE(w) [Hassoun (1995)]. The matrix (zzT)-1z is called the pseudoinverse matrix of zT, and equivalently, zT(zzT)~1 is the pseudoinverse of z, + denoted by z . This is confirmed by verifying the required property of pseudoinverses, namely zz+ = I, where / is the identity matrix of size K x K. Equivalently, zT(zT)+ = I where I € MM-M- Whenever the output units are linear, the method of pseudoinverses can be used to solve the set of simultaneous linear equations in Eq. (3) to obtain the minimum SSE solution. If the matrix zzT is badly conditioned or nearly singular, then zz+ = zzT T l {zz )~ will not be a good approximation to I. More specifically, for any matrix A of size n x m, linear algebra theory states that the largest possible rank is the smaller of m and n. Now, suppose without loss of generality, that n is the smaller of the two. Then if rank A = n, the matrix product AAT is invertible, of size nxn, and AAT(AAT)"1 = / . In this case AT(AAT)~1 is a suitable pseudoinverse for A. For rank A or rank AAT less than full rank, while (AA7)-1 can be calculated, AA+ will deviate significantly from J. It is suggested in this case to use the singular value decomposition procedure to find a suitable pseudoinverse. Complete details of singular value decomposition can by found in Broomhead and Lowe (1988). Through the Golub-Reinsch algorithm, matrix A € M n j m is decomposed as A = USVT, where U is an n x m matrix with orthonormal columns, S is an m x m diagonal matrix of positive singular values, and V is an m x m orthogonal matrix. Then A+ — VS+UT, where S+ is obtained from S by inverting the nonzero elements. When A and AAT are full rank matrices, then the pseudoinverse from the singular value decomposition method is the same as AT(AAT)~1. Algorithms to obtain the pseudoinverses for determining the optimal weights are readily available.
298
P. L. Cooper and G. J. Savage
Network
Generalization
Applying a trained network to input patterns that were not used to train the network is known as network generalization. For a network to adequately generalize, one rule of thumb is that it should have fewer parameters than the number of data vectors in the training set. In the RBF network architecture, there are three sets of parameters: N x K centers, 1 x K widths, and K x L output layer weights. Then, M > K(N + L + 1) exemplars should be presented to the network before generalizing on the resultant network. Once the network is trained using the self-organizing and supervised learning techniques described above, then all parameters in the network are fully specified and an output can be estimated for any input pattern. Through matrix multiplication, outputs for many input patterns can be found immediately. First, the output of the hidden layer, z € MfcXMi is constructed as follows: Zkm = exp
"2—^~J
^
where xm G Mjvxi represents the mth input pattern, /ifc £ M_/vxi represents the location of the fcth receptive field center, and <7fc G H. represents the radial width of the fcth receptive field. Once the matrix z has been constructed, the network output d G MLXM can be calculated as follows: d = wTz
(6)
where w G MfcXL is the trained weight matrix. Then the L-dimensional output for the mth input pattern is the mth column of d. The ability of the RBF network to interpolate a function, and subsequently generalize, relies on the general assumption that the system is smooth, in the sense that small changes in input parameters result in a correspondingly small change in the output. When approximating a high dimensional function, the only option may be to assume a high degree of smoothness, otherwise the number of exemplars required to capture the function behavior would be totally unpractical. Conditions of the system that further improve the generalization ability of the RBF network are a linear mapping, positive range, limited domain, and invariance to some transformations. Typically, manufacturing systems satisfy all the above conditions. Temperature
Controller
Circuit
Example
The following example shows the operation of the RBF network and its ability to generalize a known multi-dimensional mapping. Phadke (1989) show off-line design for a temperature controller using Design of Experiments, and Swan (1997) compares these results with those obtained through the probability approach. The goal is to minimize variation in the resistive function RT-ON, the resistive value at which the relay turns on, given a variable resistor R$ used to set the desired temperature. The controller circuit is as follows (Fig. 2):
Manufacturing
Systems Estimation
Using Neural Network Models
299
Fig. 2. Temperature controller circuit.
where R3 is the signal factor and the remaining variables are design variables. Upon converting R2 and R4 to scalars of Ri, (a = R2/R\,X = R4/R1) and using properties of electric circuits, RT-ON can be expressed in terms of a signal factor and four design variables as follows: RT- ON
R3a(EzX + Eg) = Eza + EzX - E0a
(7)
where Eg and Ez are constant voltage sources. To examine the ability of the RBF network to generalize this function, probability distributions describing the signal and design factors values have been assumed. Input/output patterns are generated by sampling a five-dimensional input vector using the respective distribution for each dimension and calculating the one-dimensional output vector from Eq. (7). The signal and design factors and their respective distributions are found in Table 1. Table 1. Temperature controller factor distributions. Factor
Probability Distributions
R3
Uniform [0.898, 1.102] Normal (4.8,0.08 2 ) Normal (8.0, 0.1333 2 ) Normal (1.33, 0.02217 2 ) Normal (5.33, 0.08883 2 )
Ez Eo a A
Five hundred exemplar vectors were randomly generated and the corresponding output value for each was calculated by Eq. (7). According to the rule of thumb relating problem size, the number of the basis functions, and the number of exemplars, a problem with 71 or fewer basis functions is suitable. Batch training was carried out using all the exemplars under architectures with 8, 16, 32, and 64 basis functions under routines written in Matlab V Release 4. Initial basis function
300
P. L. Cooper and G. J. Savage
Fig. 3. Network output versus Actual output (200 validation vectors).
centers were chosen as the first K exemplars and P = 3. The input vectors from an additional two hundred known input/output pairs were presented to each of the four resultant networks to compare their generalization abilities and results are shown in Fig. 3. The superior network produces output that most closely matches the known output and hence the points would be most tightly scattered about the line y = x, added to these graphs. It is apparent from these graphs that the network architectures with K = 8 and K = 16 are insufficient for determining the outputs in the validation set. When 32 and 64 basis functions are used, the network is much more capable of generalizing to input/output patterns not presented to it under training. In fact, when calculating the correlation between the actual outputs and each of the four network outputs, it can be seen in Table 2, as expected, that as more basis functions are used, the actual output and network output are more tightly correlated. Of course, all correlation values in Table 2 should be positive, so K = 8 is clearly an unacceptable architecture. Outputs from the K = 32 and K = 64 networks are tightly correlated which indicates that there is not a significant advantage to using 64 rather than 32 basis functions. In accordance with conditions described
Manufacturing
Systems
Estimation
Using Neural Network Models
301
above that affect the ability of the RBF network to interpolate a function, the known RT-ON function is smooth and it has a positive range and a limited domain. It appears that the Radial Basis Function network architecture is well suited to interpolating the function for RT-ON in terms of the signal factor and four design factors through presentation of input/output patterns. Table 2. Actual output and network output correlation.
Actual K = 8 K = 16 K = 32 K = 64
Actual
K = 8
K = 16
K = 32
if = 64
1 -0.1334 0.1047 0.6436 0.7206
1 0.7842 0.2933 0.1754
1 0.5142 0.3650
1 0.9592
1
3. Use of Model in Activities for Robust Design The diversity of tasks that can be carried out once a network model has been accurately trained makes this methodology the undeniable alternative for Robust Design of manufacturing systems. Neural network models can be used to carry out all six activities for Robust Design listed in Sec. 1. Activity 1: Determining which process characteristics have a significant influence on system responses. Activity 2: Determining how key process characteristics interact to influence the system responses. The first two activities are components of system analysis. By including input/output data for all potentially significant variables, the network fits a model to all variables and sensitivity analysis can be carried out on the model to determine which characteristics have a significant influence on the responses. Either plots or gradient information from the model can be used to determine which characteristics have a significant influence on the responses. The same methods can be used to determine how the process variables interact to influence the system responses. Activity 3: Selecting levels of key process characteristics for maximizing the assurance that the system responses meet customer requirements. The third task, system design, uses the model and distribution information of its components to determine what levels of the process parameters will maintain the response(s) within specifications with highest probability. The tasks of systems
302
P. L. Cooper and G. J. Savage
analysis and systems design are efficiently carried out by the methodology known as "Design for Quality" [Swan (1997), Swan and Savage (1998)] when the system model and process distribution information are known. Underlying the "Design for Quality" methodology is the use of a probability transformation method with the mathematical system model so that designers can perform parameter (nominal) and tolerance (variance) design simultaneously. Through this transformation, an algorithm is used to search the parameter space efficiently to determine which combination of variables is "most likely" to occur such that the result would be a system response not meeting its specifications. The result is a methodology that can perform analysis and design for linear and nonlinear systems, with variables of any distribution type, with or without correlation of the variables. Activity 4: Predicting system responses for levels of process characteristics. Neural network models are useful for predicting the level of the response for combinations of the input levels that are not seen during training. The model can be used for worst-case analysis where the input levels are extremes of the input space. Simulation software packages such as Prosim and ProModel are particularly geared towards assessing manufacturing systems through simulation when a model is available. Activity 5: Monitoring trends over time in the system in order to control system responses. Neural network modeling can be used for activities traditionally carried out by control charts within SPC. The models are trained with a binary response that indicates whether or not the corresponding system response extends beyond the upper or lower control limits for observed inputs. A transfer function may be applied to the output from the network so that the new output lies in the range [0.1], depending on its proximity to the mean. Then the new output is "rounded" to an endpoint, where the appropriate clusters are defined by some discriminant that depends on the spread of the data. This discriminant is analogous to control limits in SPC and is defined by the 3a spread of the data. This automatic control system can base its decisions on the effect of process conditions on the product attributes and not just on process conditions themselves. This is an advantage over traditional SPC activities. Activity 6: Comparing the performance of the process with customer requirements to judge the process' suitability. Neural network models can be used to compare the performance of the process with customer requirements to judge the suitability of the process. When expected
Manufacturing
Systems
Estimation
Using Neural Network Models
303
distributions of process characteristics can be reasonably estimated, the model can be used to predict yield, where yield is the proportion of responses that lie within customer specifications. The measure of yield may be obtained either through the response distribution produced by Monte Carlo simulation, a process of simulated experimentation by repeated trials for analyzing an expected system's performance given a model [Spence and Soin (1988)], or through "Design for Quality" routines. The calculation of yield can be used in a manner similar to traditional Process Capability studies. Monte Carlo experimentation and "Design for Quality" are also useful for sensitivity analysis of the predictions. 4. Subsystem Modeling The profitability of creating neural network models is the ability to build models of processes where sufficient physical information is not available to build a complete mechanistic model. Where the system can be broken up into subsystems, the physical continuity and compatibility information of the connection of subsystems may or may not be known. Also, for certain subsystems, the mapping between subsystem inputs and subsystem outputs may be known, where for others this information may be lacking. Neural networks provide means to connect subsystem models when continuity and compatibility constraints are not available, and also to learn a mapping between subsystem inputs and subsystem outputs when this relationship is not well understood. Consider a cooling fin system from a wall that is made up of two repeated cascading subsystems shown in Fig. 4.
Wall T1
T
>
Oi
fi
—> 0?.
7^
T
fi
-5*
Fig. 4 Cooling fin.
The through and across variables for this system are heat flow and temperature, respectively. As such, there are continuity and compatibility constraints joining the subsystems that result from the vertex and circuit postulates, respectively. A physical model for each of the identical subsystems can be derived through Graph Theoretic Modeling. For 7\ and Q\, the temperature and heat flow, respectively, at the wall, and T2 and Q2, the temperature and heat flow, respectively, at the end of the first half of the fin, the physical model is as follows:
304
P. L. Cooper and G. J. Savage
T2
1 a
b
~2sf
9i
g2
1 a
I
9t
'Ti"
(8)
.^1.
2~9t _
where gk = 8Ak/L, gn = PLh/8, P, A, L, k, and h, are fin parameters, a = 9t + lQ9k9h + 20glg2h + 8gkg3h+g4h and b = 4g3k + 10glgh + 6gkg2h+g3h. For the second fin half, the mapping from T3 and Q3 to T4 and Q4 will look identical. Connecting the subsystems, the continuity and compatibility constraints are Q2 = —Qz and T2 = T3 by the vertex and circuit postulates, respectively. The heat flow at the end of the fin, Q4, and the temperature at the wall, Ti, are the through and across drivers, respectively, of the system. Since physical knowledge is available for the connection constraints for the system, as well as for the mappings between inputs and outputs of the subsystems, a mechanistic model can be constructed. To demonstrate the ability to combine mechanistic and empirical information, first consider the common case where physical subsystem connection constraints are fully known, however the mappings between subsystem inputs and outputs are unknown. In order to discern a relationship between subsystem inputs and outputs, corresponding input/output samples need to be collected. To simulate this data, the physical relationship in Eq. (8) is used to generate corresponding input/output samples, where Ti takes on integer values between 50 and 100, and Q\ takes on integer values between 10 and 20. The set of samples { ( T I , , Q I J ) , (T2i,Q2i) where i = 1,2,..., 561} is used to train a Radial Basis Function neural network. The size of the neural network created to describe these data depends on performance measures including sum of squared error, mean absolute error, and the percentage of samples whose network output lies within a specified tolerance of the target output. These performance measures study the variability in predictions made using a particular model; it is useful also to look at the bias of a particular model through procedures of cross-validation [Twomey and Smith (1993)]. The network size that optimizes the above performance measures for the samples under investigation is 16 and Fig. 5 contains plots that demonstrate the performance of the ensuing network model. The first two plots show the accuracy of the network model predictions of samples in the training set. The target output and network output for each of the two model outputs are plotted; in the best case, all points lie on the line target output = network output. These plots show that the model is best suited for describing samples where the output is close to the centers of the range of all outputs, presumably due to the fact that Radial Basis Functions can best describe samples with abundant neighbors. The plots in Fig. 6 show the network output for integer values of T\ in the range of the inputs and at three levels of Q\. The points marked "x" in these plots are network output values at the corresponding value of T\, and Q\ equal to 10; for points marked "*", Q\ equals 15 and for points "o", Q\ equals 20.
Manufacturing
Systems
Estimation
Using Neural Network Models
305
Plot of Network Q2 Output versus Target Q2 Output 101
.
1
-20' -20
' -15
' -10
1
.
' ' -5 0 Network Output
1
' 5
' 10
Fig. 5. Network versus target output plots.
It is interesting to note that while the general behavior of the network curves is as expected, there is a great deal of variability in some cases. The network performance would be more accurate if either the training sample inputs covered a smaller range or if more sample points were available within the current range.
306
P. L. Cooper and G. J. Savage
Plot of Network and Target Function for T2 Output versus T1 Input 80
60 xx,.
40
5 Q.
o
xxxx 20:^o<xx-x x *
A^**
3 I ^ ^
P^c,
QCDOCO
\_CPCOOOCP
^^°°Ooo0
U
•
*P
8 -20
jxxxxx-x * * * * *
cP
-40 50
60
70
80
90
100
T1 Input Plot of Network and Target Function for Q2 Output versus T1 Input
X>
X^XxX
"5 S-
x V
xxxxxxxxxxxx*
X
X X"
x
*
.oocP
*^* w CP
O
*****
*%.-*
*
-1a*
X
x ' '*
******
3
o o
x?°S***!! *
xx
x*x
30
cpoocP'lC
OOCOCCP°
^
CP
*°V
-12 o ** -14 -16 50
60
70
80
90
100
T1 Input Fig. 6. Network function plots.
As demonstrated, neural network modeling can be used to determine a mapping between subsystem inputs and outputs when the only information available is empirical. Once all subsystem models are created, physical constraint relationships between subsystems can be integrated to generate an overall system model.
Manufacturing
Systems
Estimation
Using Neural Network Models
307
To demonstrate the task of connecting physical information with empirical models using the fin modeling problem, assume that continuity and compatibility constraints connecting the identical subsystems are known a priori. The problem becomes that of connecting the subsystems to form a model, where temperature at the wall (Ti) and heat flow at the end of the fin (QA) are system inputs, and heat flow at the wall (Qi) and temperature at the end of the fin (T4) are system outputs. Presumably the motivation for creating such a model is so that levels of design variables T\ and Q 4 can be selected to maintain response characteristics T4 and Qi within specifications. There are a total of eight subsystem inputs and outputs for the fin problem. Levels of two inputs are known a priori, as well as two constraint equations, and since each subsystem output can be written as a separate function of all subsystem inputs, there are a total of four subcomponent models. Consequently, the problem is entirely specified; that is, for given system inputs the system outputs are unique. In order to satisfy eight constraints simultaneously, a feedback procedure must be used in this example. The flowchart in Fig. 7 shows one method of joining the subsystem models, constraint equations, and system inputs to determine the system outputs.
known
euess
known
Ti
Q.
Q4
revise guess
fi \l
\1
T2
Q2
>/C/
\ £?
T3
Q3
i
'
compare i i
f2 \1
\1
Ti
O
Fig. 7. Feedback procedure to connect subsystems.
Here Ci : T2 = T3 and C2 : Qi = - Q 3 are the constraints and /1 : (Ti,Qi) => (^2,^2) and f2 • {T3,Q3) =>• (T 4 ,Q 4 ) are identical mappings determined as demonstrated through neural networks. Values of inputs Q 4 and 7\ are known and the procedure of calculating the output values, T4 and Qi, starts with a guess of the appropriate value of Qi- Once Q 4 is calculated as indicated in the flowchart,
308
P. L. Cooper and G. J. Savage
a comparison t o t h e known value is made. If the known value of Q\ equals the calculated value, t h e n t h e feedback procedure terminates. Otherwise, an adjustment needs to be m a d e t o t h e initial guess of Q\, and t h e entire process is repeated. Note t h a t it may be desirable t o develop separate neural network models for / j and f% in the case t h a t t h e inputs to the two subsystems span different ranges. It is not an arduous task t o a u t o m a t e the model building and feedback procedures. T h e ability to use this procedure to connect empirical models with physical constraints is extremely practical. In the second case, consider t h a t mappings between subsystem inputs and outputs are known, but t h e constraints connecting subsystems are not known. Again, neural networks provide an efficient, systematic technique for incorporating empirical information to complete a system model. In this instance, d a t a must be collected on adjacent subsystem o u t p u t s and inputs in order t o discern a relationship between the two. T h e same feedback procedure can be used for developing a system model; in this case, neural network models represent certain constraint equations. Finally, if no physical system information is available, a neural network can be constructed t h a t models all system responses in t e r m s of all system inputs. W h e r e it is possible to break the system u p into subsystems and combine subsystem models to form an overall system model, there will be a better understanding of subsystem interactions and sensitivities. 5. C o n c l u s i o n s In this paper it has been shown t h a t Radial Basis Function methodology is well suited for developing models of manufacturing processes. Using these models for Robust Design provides advantages over traditional methods in t h a t on-line, realtime implementation is possible and the six Robust Design activities can be carried out simultaneously. It has also been shown t h a t forming and connecting subsyst e m models by continuity and connectivity constraints is more advantageous t h a n modeling t h e complete system. Indeed, neural network models can be constructed wherever physical information on subsystem models and continuity and connectivity constraints is lacking. References Broomhead, D. S. and Lowe, D. (1988), "Multivariate Functional Interpolation and Adaptive Networks", Complex Systems 2, 321-355. Box, G. E. P. and Draper, N. R. (1987), Empirical Model-Building and Response Surfaces, John Wiley & Sons, Toronto. Hagan, M. T., Demuth, H. B. and Beale, M. (1996), Neural Network Design, PWS Publishing Company, Boston. Hassoun, M. H. (1995), Fundamentals of Artificial Neural Networks, The MIT Press, Cambridge. Khuri, A. I. and Cornell, J. A. (1996), Response Surfaces Designs and Analyses, 2nd Edition, Revised and Expanded. D. B. Owen and W. R. Schucany, Dallas.
Manufacturing Systems Estimation Using Neural Network Models 309 MacQueen, J. (1967), "Some Methods for Classification and Analysis of Multivariate Observations", in Proceedings of the Fifth Berkeley Symposium on Mathematics, Statistics, and Probability, eds. LeCam, L. M. and Berkeley, J. N., U. California Press, 281. Michelli, C. A. (1986), "Interpolation of Scattered Data: Distance Matrices and Conditionally Positive Definite Functions", Constructive Approximation 2, 11-22. Moody, J. and Darken, C. J. (1989), "Fast Learning in Networks of Locally-Tuned Processing Units", Neural Computation 1, 281-294. Phadke, M. S. (1989), Quality Engineering Using Robust Design, Englewood Cliffs, NJ, Prentice-Hall. Swan, D. A. (1997), Design for Quality: A Model-Based Probabilistic Approach, Ph.D. Thesis, Department of Systems Design Engineering, University of Waterloo. Swan, D. A. and Savage, G. J. (1998), "Continuous Taguchi: A Model-Based Approach to Taguchi's "Quality by Design" with Arbitrary Distributions", Quality and Reliability Engineering International 14, 1-13. Spence, R. and Soin, R. S. (1988), Tolerance Design of Electronic Circuits. Addison Wesley, Wokingman, England. Twomey, J. M. and Smith, A. E. (1993), "Nonparametric Error Estimation Methods for Evaluating and Validating Artificial Neural Network Prediction Models", Intelligent Engineering Systems through Artificial Neural Networks 3, 233-239.
Chapter 18 A D E T E R M I N I S T I C SELECTIVE M A I N T E N A N C E MODEL FOR C O M P L E X SYSTEMS C. R I C H A R D C A S S A D Y Department
of Industrial Engineering, Mississippi Mississippi State, MS 39762
State
University
M A J . W . P A U L M U R D O C K , Jr. Department
of Operational Sciences, Air Force Institute Wright-Patterson AFB, OH 45433
of Technology
MAJ. E D W A R D A. POHL Office of the Secretary of Defense,
Washington,
DC
20301-1800
1. Introduction As budgets decline, force structures decrease, and operational requirements increase, today's military is faced with the task of doing "more with less". The military's primary mission requires that today's advanced technology weapon systems be ready to perform their mission when called upon. Many of today's weapon systems are required to support operational missions around the world at rates exceeding those anticipated for peace-time scenarios. Due to the increased use of these systems, the military must find cost-effective ways to keep these systems working and at the same time support the operational requirements. The maintenance philosophy must change from a purely repair function to one that focuses on the mission [Hedderich (1996)]. To do this, the military must develop strategies and procedures that balance maintenance (preventive and corrective) with the operational requirements. Since many of these systems are needed to perform several missions per day, maintenance crews must decide which systems to repair in the time allotted between missions. This problem, however, is not limited to military environments. System managers in industry are also being asked to do "more with less". In particular, industrial engineers are constantly seeking methods for improving the efficiency of system operation. One area that presents a significant opportunity for industrial
311
312
C. R. Cassady, M. W. P. Murdoch and M. E. A. Pohl
productivity improvement is maintenance planning [Cassady and Nachlas (1998)]. In industry, maintenance is often viewed as a necessary evil or a "people problem". As a result, the benefits of using mathematical modeling to optimize maintenance planning have not been fully realized in industry. This paper develops a generalized maintenance model that can be used to identify maintenance activities that optimize mission requirements. Although particularly applicable in military scenarios, this selective maintenance model can be applied to industrial systems which utilize equipment that performs a series of missions. For example, production equipment may be used during the work week and maintained on the weekend; commercial aircraft are maintained between flights; computer systems are heavily used during the day and maintained at night. Consider a system composed of m independent subsystems (subsystem 1, subsystem 2, . . . , subsystem m) connected in series. Let each subsystem i contain a set of rii independent components connected in some fashion. Each component in the system is denoted by (i,j) where i denotes the subsystem number and j denotes the component number. In addition, each component, subsystem, and the system can be in only one of two states: functioning properly or failed. This type of system configuration represents a wide variety of equipment utilized in both military and industrial environments (including machinery, computers, and vehicles). Assume the system is required to perform a sequence of identical missions with breaks of known length between missions. For example, a piece of equipment may be used for eight hours each day. At the beginning of a mission, say mission k, the status of a component is defined by '1 Xij(k) = < 0
if component j of subsystem i is functioning at the start of mission k otherwise
The status of a subsystem at the beginning of mission k is defined by
{
1
if subsystem i is functioning
at the start of mission k 0 otherwise Since each subsystem consists of a structural arrangement of its components, subsystem status is a function of the status of its components. The status of the system if thebysystem is functioning at the beginning of mission k1is defined X(k) at the start of mission k 0
otherwise
Since the system consists of a series arrangement of its subsystems, each subsystem must function in order for the system to function. Therefore, system status can be computed using the widely recognized series system structure function
A Deterministic
Selective Maintenance
Model for Complex Systems
313
X(k) = l[Xl(k) i=l
Similar status functions can be defined for the end of missions. At the end of mission k, the status of a component is denned by 1
if component j of subsystem i is functioning at the start of mission k
Yij(k) = { 0
otherwise
The status of a subsystem at the end of mission k is defined by 1
if subsystem i is funtioning
Yi{k) :
at the end of mission k 0
otherwise
and the status of the system at the end of mission k is defined by
Yk
() = iI
if the system is functioning at the end of mission k otherwise
where
Y(k) = f[Yi(k) The performance of a component, subsystem or system can be measured in many ways. For the purposes of maintenance planning, performance is typically measured by reliability. The reliability of a component is given by the probability that the component survives a particular mission. Let ry denote the probability that component j of subsystem i survives a particular mission, say mission k, given that the component is functioning at start of the mission. Thus, rij
= PiX^k)
= i\xi3(k) = l)
Note that r y is not indexed on k. This implies than r y is the same for any given mission. This is a result of the following two assumptions: (1) missions are identical, (2) all components have a constant failure rate (i.e. components do not age over time). Although it is important to note these assumptions, neither of them have a direct impact on the numerical results presented in this paper. Either of these assumptions could be relaxed by simply indexing r y on k and/or the age of the component. Having defined these component survival probabilities, the reliability of component j of subsystem i for mission k is given by P(Ylj(k)
=
l)=rijXij(k)
314
C. R. Cassady, M. W. P. Murdoch and M. E. A. Pohl
This expression is derived by conditioning on the status of the component at the beginning of the mission. Since each subsystem is a structured arrangement of its components, the reliability of subsystem i for mission k can be written as a function of its components' reliabilities. In general, it can simply be stated that Ri(k) = P{Yt{k) = 1) Since the system is a series arrangement of the subsystems, the system reliability for mission k can be defined using the series system reliability function m
R(k) = P(Y(k) = 1) = Y[Ri(k) At the completion of a particular mission, say mission k, each component in the system is either functioning or failed. Suppose that the technology exists to repair all failed components (if there are any) to a functioning condition for the next mission. Ideally, all failed components (those components having Yij(k) = 0) would be repaired prior to the beginning of the next mission. However, it may not be possible to repair all the failed components. Let tij denote the amount of time required to repair component j of subsystem i. By representing the repair time as a known constant, assumptions of a single mode of failure and a single method of repair for each component are implied. The total time required to repair all failed components in the system prior to the next mission, mission k + 1, is given by
*=1 J'=l
where, in this case, the Xij's represent the component status values at the beginning of the next mission. Suppose the total amount of time allotted to perform maintenance upon failed components between missions is To time units. If m
rii
^2J^tij(Xij(k + l)-Yij(k))>T0 i=ij=i
then all failed components cannot be repaired prior to beginning the next mission. In such ct CctSG j EL method is needed to decide which failed components should be repaired prior to the next mission and which components should be left in a failed condition. This process is one example of selective maintenance. Selective maintenance is the process of selecting a set of maintenance activities to perform from a set of desirable maintenance activities. Solving selective maintenance problems falls under the research domain of maintenance modeling and optimization. Maintenance modeling and optimization have been studied extensively in the open literature. Fortunately, several authors have reviewed the maintenance literature. In 1965, McCall (1965) conducted a survey on maintenance policies for equipment subject to stochastic failure. Over 80 references are cited. In 1976,
A Deterministic
Selective Maintenance
Model for Complex Systems
315
Pierskalla and Voelker (1976) surveyed the literature on maintenance models for deteriorating systems. Over 250 references are cited in this paper. Valdez-Florez (1989) surveyed the research done on preventive maintenance models between 1976 and 1988. Over 120 articles are cited and the research on optimization models for repair, replacement, and inspection of systems subject to stochastic deterioration is summarized. Dekker (1996), in a more recent survey, provides another review and analysis of the maintenance optimization models available in the literature. Over 130 references are cited that address maintenance modeling and optimization. More recently, Kapur and Bhalla (1990) develop optimum maintenance policies that maximize service reliability for systems that undergo both repair and preventive maintenance. Zheng and Fard (1991) develop opportunistic maintenance policies for repairable systems based on a unit hazard rate. Hsu (1991) investigates optimal preventive maintenance policies in a serial production system. A mathematical model is developed for a multi-device production system. The results show that it is important to examine the joint effects of a maintenance policy on the various production stations simultaneously rather than separately. Jayabalan and Chaudhuri (1992) develop a cost optimization model for maintenance scheduling for a system with assured reliability. They develop strategies for performing maintenance at predetermined points such that the reliability does not drop below some acceptable level. Van Der Duyn Schouton and Vanneste (1993) investigate optimal group control policies for a set of M identical machines. They model the components using four states: good, doubtful, preventive maintenance due, and failed. Dekker and Smeitink (1994) investigate developing preventive maintenance policies under randomly occurring opportunities of restricted duration. A model for determining the optimal execution time for individual maintenance packages is developed. The authors focus on the use of the block replacement strategy and use a Monte Carlo simulation model to analyze deviations from the optimal time. Duboka et al. (1996) develop preventive maintenance policies for units subject to two kinds of wear failure. The authors optimize operational readiness for a system that has two modes of failure: one requiring adjustment and another requiring replacement. Dedopoulos and Shah (1996) investigate optimal preventive maintenance strategies for equipment in multi-purpose production facilities. The authors formulate a mixed integer nonlinear model that maximizes expected resource availability to determine the optimal maintenance strategies. Finally, El-Damcese (1997) investigates preventive maintenance policies by optimizing the average cost rate for system operation. The system is made up from a collection of components, each of which can be working or failed. The states of components are assumed to be independent and criteria is developed for determining the importance of components to aid in the determination of those that will receive maintenance action. In a previous paper, Rice et al. (1998) define a mathematical programming model for solving the selective maintenance problem studied in this paper under the assumptions: (1) each subsystem is comprised of a parallel arrangement of rii independent and identical copies of a component, and (2) the objective in
316
C. R. Cassady, M. W. P. Murdoch and M. E. A. Pohl
performing the selective maintenance is to maximize the system reliability for the next mission. In this paper, the mathematical programming approach is extended to consider subsystems of any structure. A general formulation is presented, and three specific examples are presented which demonstrate the types of results that can be obtained from solving this selective maintenance mathematical program. One of these examples corresponds to the specific case addressed by Rice et al. (1998). Then, the selective maintenance mathematical programming model is extended to include cost constraints, as well as systems of any structure. 2. Mathematical Programming Models For the case in which the time allotted for maintenance is insufficient to repair all failed components in the system, a mathematical programming model is defined for assisting in the selective maintenance decision. The first step in the formulation of this model is the identification of the decision variables for the model. Given the status of each component at the end of a certain mission (the Yij(k) values), the selective maintenance decision consists of identifying the failed components (those components having Yij{k) — 0) to be repaired prior to the next mission. This decision can be represented mathematically by specifying the Xij values for the next misskm. Thus, the Xij(k + 1) values serve as the decision variables for the mathematical programming model. The next step in formulating the mathematical programming model for this selective maintenance problem is the construction of the objective function. Initially, assume that the objective in performing selective maintenance is to maximize the system reliability for the next mission. Therefore, the objective function is given by m
R{k + l) = Y[Ri{k + l) i=l
where R(k + 1) denotes the system reliability for the next mission and Ri(k + 1) denotes the reliability function of subsystem i for the next mission. Recall that Ri(k + 1) is a function of the decision variables, where the mathematical form of the function depends on the configuration of the subsystem. The final step in formulating this selective maintenance mathematical programming model is the construction of the constraints on the decision variables. The first constraint requires that all maintenance activities be completed within the allotted time. Specifically,
J2EUjiXijik
+ 1) - Y^k)) < T0
Second, the decision variables are restricted to binary values. Xij(k + 1) binary
Vi, j
A Deterministic
Selective Maintenance
Model for Complex Systems
317
Finally, a component's status at the beginning of the next mission must be at least as good as its status at the end of the previous mission. In other words, components that are operating at the end of a mission will still be operating at the beginning of the next mission. Thus,
xy(fc + i)>y;j(fc)
v»,j.
Adding the constraints to the objective function yields this final formulation of this basic selective maintenance problem. m
P :
Maximize
R(k + 1) = J | Ri(k + 1) i=i m
subject to
Tii
Y^ H
li
i (Xii (fc + *) ~ YiJ M ) -
T
°
i=ij=i
Xij(k + 1) binary
\/i,j
It is important to recognize that this formulation applies to all selective maintenance problems that satisfy the defined system and mission conditions. This model is deterministic in that all the model parameters (end-of-mission status values, component reliabilities, component maintenance times, total allotted time for maintenance) are assumed to be known constants. 3. Selective Maintenance Examples In this section, three example selective maintenance problems are solved. Each of these problems uses the general selective maintenance mathematical programming formulation (P) developed in Sec. 2. For each example, the appropriate parameters are defined and the optimal selective maintenance decisions are identified. These examples help to demonstrate the type of information that can be obtained from solving the selective maintenance optimization problem. The first example involves the system described by the reliability block diagram found in Fig. 1. It is important to note that the system represented in Fig. 1 contains subsystems that are comprised of identical components. Therefore, the selective maintenance problem for this system falls under the formulation defined by Rice et al. (1998). However, the selective maintenance mathematical programming model defined in this paper can also be applied to this problem. The reliability formulae required for solving problem P for this system are stated below. These formulae are a direct result of the fact that the subsystems for this system consist of parallel arrangements of their components.
318
C. R. Cassady, M. W. P. Murdoch and M. E. A. Pohl
3,1 1,1 2,1
3,2
2,2
3,3
1,2
1,3 3,4
Fig. 1. Reliability block diagram for Example 1.
fli(fc + 1) = 1 - (1 - rnXn(k
+ 1))(1 - r12X12(k
+ 1))(1 - r13X13(k
R2(k + 1) = 1 - (1 - r21X21(k
+ 1))(1 - r22X22(k
+ 1))
R3(k + 1) = 1 - (1 - r 3 iX 3 i(fc + 1))(1 - r32X32(k
+ 1))
x (1 - r33X33(k
+ 1))(1 - r34X34(k
+ 1))
+ 1))
The specific parameter values for Example 1 are defined in Table 1. Table 1 reflects one possible realization of the status of the system upon completion of one mission. The failed components are highlighted in Table 1. Note that this particular realization results in a system failure due to the failure of subsystem 2. Table 1. Example 1 parameter values. Subsystem 1 1 1 2 2 3 3 3 3
(0
Component (j)
Tij
*ij
Yij (k)
1 2 3 1 2 1 2 3 4
0.8 0.8 0.8 0.9 0.9 0.7 0.7 0.7
3 3 3 2 2 4 4 4
1 0 0 0 0 1 0 1
0.7
4
0
In addition to the values provided in Table 1, T0 = 10 time units are available for maintenance. In order to repair all the failed components in the system prior to the next mission, 18 time units are required. Therefore, there is not a sufficient amount of time for performing all desirable repairs prior to beginning the next mission, and problem P must be solved in order to determine the failed components to repair. Solving problem P for Example 1 indicates that the following components
A Deterministic
Selective Maintenance
Model for Complex Systems
319
should be repaired: components 2 and 3 in subsystem 1, and components 1 and 2 of subsystem 2. These repairs consume all 10 units of available maintenance time and result in a system reliability of 0.8937 for the next mission. The maximum achievable reliability if all components were repaired is 0.9741.
2,1 3,1 2,2
1,1
3,2 1,2
2,3
1,3
2,4
3,3
3,4 2,5
Fig. 2. Reliability block diagram for Example 2.
The second example involves the system described by the reliability block diagram found in Fig. 2. Although the system in Fig. 2 is similar in structure to the system presented in Fig. 1, it is worthwhile to note that the components in each of the three subsystems are no longer identical. The reliability formulae required for solving problem P for this system are provided below and based again on the fact that the subsystems consist of parallel arrangements of their components. Ri(k + 1) = 1 - (1 - ruXuik
+ 1))(1 - rl2X12(k
+ 1))(1 - r13X13(k
+ 1))
R2(k + 1) = 1 - (1 - r21X21(k
+ 1))(1 - r22X22(k
+ 1))(1 - r23X23(k
+ 1))
x (1 - r24X24(k R3(k + 1) = 1 - (1 - r31X31(k x (1 - r33X33(k
+ 1))(1 - r25X25(k + 1))(1 - r32X32(k + 1))(1 - r34X34(k
+ 1)) + 1)) + 1))
The specific parameter values for Example 2 are defined in Table 2 with the failed components highlighted. In this example, the system has returned from its mission in working condition, i.e. all subsystems have at least one component functioning. In addition to the values provided in Table 2, To = 11 time units are available for maintenance. In order to repair all the failed components in the system prior to the next mission, 18 time units are required. Thus, problem P must be solved in order to determine the failed components to repair. Solving problem P for Example 2 indicates that the following components should be repaired: component 2 of subsystem 1, components 1 and 5 of subsystem 2, and component 4 of subsystem 3. These repairs consume all 11 units of available maintenance time and result
320
C. R. Cassady, M. W. P. Murdoch and M. E. A. Pohl Table 2. Example 2 parameter values. Subsystem (i)
Component (j) 1 2 3 1 2 3 4 5 1 2 3 4
yy (*) 0.8 0.7 0.85 0.65 0.5 0.7 0.75 0.6 0.55 0.75 0.6 0.7
6 3 5 2 1 5 6 3 4 2 6 3
1 0 1 0 0 1 1 0 1 1 0 0
2,2 2,1 2,3
1,4
1,1
1.3
1.2
1,5 2,5
2,4
Fig. 3. Reliability block diagram for Example 2.
in a system reliability of 0.9475. The maximum achievable reliability if all components were repaired is 0.9725. The third example involves a more complex structure and the reliability block diagram is shown Fig. 3. The third example demonstrates the generality of the systems for which problem P is applicable. Subsystem 1 is the commonly recognized bridge structure, and subsystem 2 is a somewhat complex series-parallel structure. The reliability formulae required for solving problem P for this system are presented below. R^k + 1) = r13X13(k
+ 1)[1 - (1 -
x [1 - (1 - r14X14(k
rilXu(k
+ 1))(1 - r12X12(k
+ 1))(1 - r15X15(k
+ 1))]
+ 1))]
A Deterministic
Selective Maintenance
+ (1 - r 13 X 13 (A; + 1))[1 - (1 - mXn(k x (1 - r12X12(k R2{k + 1) = 1 - {1 - r21X2l(k x {1 - r24X24(k
+ l)r15X15(k
+ l)ruXu(k
321
+ 1))
+ 1))]
+ 1)[1 - (1 - r22X22(k + l)r25X25(k
Model for Complex Systems
+ 1))(1 - r23X23(k
+ 1))]}
+ 1)}
The specific parameter values for Example 3 are defined in Table 3, and once again, the failed components are highlighted. Like the first example, this system has come back from its mission in a failed state. In this case, subsystem 1 is failed.
Table 3. Example 3 parameter values. Subsystem (i)
Component (j)
Tij
tij
Yij (fc)
1 1
1 2
0 1
1 1
3 4
6 3 5 1 4 2 1 5 6 3
0 1 1 0 0
1
5
0.9 0.7 0.65 0.8 0.85
2 2 2 2 2
1 2 3 4 5
0.7 0.9 0.75 0.8 0.6
0 0 0
In addition to the values provided in Table 3, To = 15 time units are available for maintenance. In order to repair all the failed components in the system prior to the next mission, 27 time units are required. Thus, problem P must be solved in order to determine the failed components to repair. Solving problem P for Example 3 indicates that the following components should be repaired: components 1, 4 and 5 of subsystem 1, and component 1 of subsystem 2. These repairs consume 13 units of available maintenance time and result in a system reliability of 0.6051. The maximum achievable reliability for this system is 0.7697. 4. Generalizations of the Selective Maintenance Model In the previous formulation, a finite amount of time, To, was allotted for making repairs to the failed components. However, in many cases, both time and cost constrain the maintenance activities performed between missions. Let c^ denote the cost to repair component j of subsystem i. Suppose the total cost of repairs between two missions may not exceed Co- The total cost of repairing selected failed components in the system prior to the next mission, say mission k + 1, is given by
322
C. R. Cassady, M. W. P. Murdoch and M. E. A. Pohl
c k
()
=E 7=1
c
E
a(Xij(k
+1) -
Y^k))
j=l
A cost constraint can be added to the selective maintenance model such that all maintenance activities are required to be completed within the allotted cost. Specifically, m
7ii
J2 E c ^PM fc + !) - Ya(k)) < co Adding this constraint to the model yields a new selective maintenance optimization problem. m
Pi:
Maximize
R(k + 1) = J^i?,(fc + 1)
subject to
E m
£*«(X«(fe
+ l
)~
Y
v(k))
< To
ni
J2 E %• (^( f c +1) - *«(*)) < °o i=ij=i
Xij(k + 1) binary
Vi, j
Note that P is the special case of P i for which C 0 = oo. In problem P i , the objective is to maximize system reliability subject to both time and cost constraints. An interesting variation on this selective maintenance problem is to consider minimizing the total system repair cost subject to the time constraint and a minimum required reliability level, Ro. The formulation for this selective maintenance optimization problem is given below. 771
P2 :
Minimize
C(jfc) = E E v ( X i j ( k i=l
subject to
Tli
c
+ X) "
Y
a(k))
j=l
^2 Y^UjiXijik
+ 1) - Yi3{k)) < T0
7=1j=l
Y[Ri(k + l)>Ro 7=1
Xi3ik + l)>Yi3ik)
Vi,j
Xij(k + 1) binary
Vi,j
A Deterministic
Selective Maintenance
Model for Complex Systems
323
A similar variation on this problem could be obtained by minimizing total repair time subject to cost and reliability constraints. m
P3 :
ni
Minimize
T(k) = ] T j N ^ X ^ f c + 1) -
subject to
5 Z $ Z cij(Xij{k
m
Yij(k))
ni
+ 1) - Yij(k)) < C0
771
i=i
Xijik + V^Yjik)
Vt,j
Xij(k + 1) binary
Vi, j
5. Summary and Conclusions This paper introduces the concept of selective maintenance and develops a generalized modeling framework for optimizing selective maintenance decisions. This framework is applicable to systems that have fixed mission lengths and limited time between missions for maintenance. First, two simple series-parallel systems are analyzed. These examples illustrate the process by which components should be selected for repair. Next, a more complex reliability structure is analyzed. Analysis of this structure shows the general nature of the formulated problem's structure. The next section of the paper generalizes the structure of the selective maintenance problem to include cost as well as maintenance time as constraints. The general problem structure is then redefined for the case in which minimizing cost or time is the objective and mission reliability is treated as a constraint. The ideas and concepts in this paper represent the beginning of an ongoing research effort into how maintenance (corrective, preventive, and predictive) can be used to increase system performance and reduce life-cycle cost. In fact, the authors' research efforts are in concert with the conclusion of Cassady and Nachlas (1998) that effective maintenance planning is the new key to industrial productivity growth. Selective maintenance modeling is one of the key components of those efforts. The general selective maintenance models discussed here lend themselves to a wide variety of extensions, many of which increase the complexity and validity of the selective maintenance problem. The authors are currently investigating several extensions including the use of random variables to model repair time and cost. Other potential extensions include the consideration of minimal repair, preventive maintenance, imperfect maintenance, imperfect detection, obsolescence, technological evolution, and the role that these concepts play in the selective maintenance decision-making process.
324
C. R. Cassady, M. W. P. Murdoch and M. E. A. Pohl
References Cassady, C. R. and Nachlas, J. A. (1998), "Preventive Maintenance: The Next Frontier in Industrial Productivity Growth", in Ind. Eng'g Solutions '98 Conf. Proc, 254. Dedopoulos, I. and Shah, N. (1996), "Long-Term Maintenance Policy Optimization in Multi-Purpose Process Plants", Chem. Eng'g. Rsch. & Des. 74,307. Dekker, R. (1996), "Applications of Maintenance Optimization Models: A Review and Analysis", Rel. Eng'g. and Sys. Safety 5 1 , 229. Dekker, R. and Smeitink, E. (1994), "Preventive Maintenance Opportunities of Restricted Duration", Nav. Rsch. Log. 4 1 , 335. Duboka, C , Arsenic, Z. and Todorovic, J. (1996), "Preventive Maintenance of Units Subject to Two Kinds of Wear Failures", Yugoslav J. of Oper. Rsch. 6, 85. Hedderich, C. P. (1996), "Navy Predictive Maintenance", Naval Engineers J. 10, 41. Hsu, L. (1991), "Optimal Preventive Maintenance Policies in a Serial Production System", Int'l. J. of Prod. Rsch. 29, 2543. El-Damcese, M. (1997), "Suggested Procedure for Preventive Maintenance Policy", MicroElec. and Rel. 37, 1173. Jayabalan, V. and Chaudhuri, D. (1992), "Cost Optimization of Maintenance Scheduling for a System with Assured Reliability", IEEE Trans, on Rel. 4 1 , 21. Kapur, P. and Bhalla, V. (1990), "Optimum Maintenance Policies Maximizing Service Reliability", Int'l. J. of Sys. Sci. 2 1 , 217. McCall, J. J. (1965), "Maintenance Policies for Stochastically Failing Equipment: A Survey", Management Science 1 1 , 493. Pierskalla, W. P. and Voelker, J. A. (1976), "A Survey of Maintenance Models: The Control and Surveillance of Deteriorating Systems", Nav. Rsch. Log. Qtrly. 23, 353. Rice, W. F., Cassady, C. R. and Nachlas, J. A. (1998), "Optimal Maintenance Plans under Limited Maintenance Time", in Proc. of the Seventh Ind. Eng'g Rsch. Conf. Valdez-Flores, C. (1989), "Survey of Preventive Maintenance Models for Stochastically Deteriorating Single-Unit Systems", Nav. Rsch. Log. 36, 419. Van Der Duyn Schouten, F. and Vanneste, S. (1993), "Two Simple Control Policies for a Multi-Component Maintenance System", Oper. Rsch. 4 1 , 1125. Zheng, X. and Fard, N. (1991), "A Maintenance Policy for Repairable System Based on Opportunistic Failure Rate Tolerance", IEEE Trans, on Rel. 40, 237.
AUTHOR INDEX
Allen, E. B. 247 Bai, D. S. 203 Bastani, F. B. 271 Berman, O. 33 Cassady, C. R. 311 Chang, M. 93 Chen, G. 169 Cooper, P. L. 291 Cukic, B. 271 Cutler, M. 33 Hong, S. H. 203 Izundu, A. E. 75 Jun, C.-H. 123 Kanagawa, A. 137 Khoshgoftaar, T. M. 247 Lee, K. T. 203 Leung, K. H. K. 153 Lewins, J. D. 93 Lu, M.-W. 61
Munson, J. C. 215 Murdock, W. P. 311 Nikora, A. P. 215 Ohta, H. 137 Park, K.-S. 123 Parks, G. T. 93 Pham, H. 17, 75 Philippe, C. 1 Pohl, E. A. 311 Rudy, R. J. 61 Savage, G. J. 291 Shiue, W.-K. 51 Tokuno, K. 181 Yamada, S. 181 Zhang, W. 51 Zhang, X. 17 Zuo, M. J. 153 Wang, H. 75 Whitfield, R. 153
325
SUBJECT INDEX
AIC 254 Availability 81 ARL 7 Accelerate life testing 144 Adaptive PM 128 Availability-intensive safety model 188 Average software availability 195 Bayesian approach 123 Binomial Distribution 52 Bivariate control chart 71 Branch and bound scheme 37 Condition monitoring 155 Control chart 1,203 Control limits 61 Control signals 66 Corrective maintenance 77 Cost allocation 33 Cost model 80 Economic dependence 75 Environmental concepts 169 EWMA 4 Expected maintenance cost 87 Failure intensity function 19 Fault insertion process 223 Fault introduction rate 21 Fault tree analysis 181 FMEA 161, 181
Functional complexity Hard time replacement
238 154
Imperfect debugging 17 Imperfect maintenance 75, 123 Input domain model 275 Inspection sampling 52 Integrated reliability measures 110 Integration modules 34 Life testing 139 Long-run maintenance cost
87
Maintenance 75, 311 Maintenance cost rate 81 Maintenance model 123 Maintenance policy 80 Mathematical programming 316 Manufacturing process 169, 203 Manufacturing system 291 Markovian reliability 181 Maximum likelihood estimation 132 Mean time between software failure 188 Mean time to failure 94 Mean squared error 23 Mean value function 19 Monte Carlo simulation 93 Multi-component maintenance model 91
328
Subject
Index
Multi-stage decision process 137 Multivariate control chart 61 Navigation Control system 282 Neural network model 291 Nonhomogeneous Poisson process model 17 Opportunistic maintenance 77 Optimal maintenance 77, 126 Optimal policy 123 Optimal preparedness Optimal reliability 45 Optimization model 91, 169 Outgoing quality characteristic 178 Percentage relative bias range 110 Poisson process 19 Predetermined model 37 Predictive power 24 Preventive maintenance 123 Principal component analysis 67 Process-control program 285 Process metric 258 Programmable electronic system 273 Programming module 34 Pseudo-Markov process 96 Radial basis function 294 Regression test 239 Reliability assurance 273 Reliability centered maintenance 153 Reliability model 17, 51 Reliability test plan 51, 137 Resource allocation 33
Reusability 273 Risk priority number 162 Robust design 170, 301 Safety critical system 273 Safety model 181 Sample size 54 Selective maintenance 317 Shewhart chart 70 Software complexity 215 Software evolution 219 Software quality 247 Software product metrics 247 Software reliability 17, 33, 181, 271 Software reliability growth 181 Software safety 186 Software testing process 19 Step-stress testing 145 Symmetrical distribution data 4 System reliability 80, 93 Temperature controller circuit Testing coverage 17 Testing objective 236 Time domain model 274 Tolerance design 169 Transportation system 153 Type I error 55 Variable sampling rate 203 Variational principles 93 Virtual sampling 116 Weibull degradation model Weibull model 134 X chart
70
53
298