PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY
HANDBOOK SECOND EDITION
© 2009 by Taylor & Francis Group, LL...
307 downloads
2418 Views
6MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY
HANDBOOK SECOND EDITION
© 2009 by Taylor & Francis Group, LLC
9879_C000.indd 1
3/10/09 4:08:01 PM
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY
HANDBOOK SECOND EDITION
Edited by
MICHAEL PECHT
Boca Raton London New York
CRC Press is an imprint of the Taylor & Francis Group, an informa business
© 2009 by Taylor & Francis Group, LLC
9879_C000.indd 3
3/10/09 4:08:01 PM
CRC Press Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 © 2009 by Taylor & Francis Group, LLC CRC Press is an imprint of Taylor & Francis Group, an Informa business No claim to original U.S. Government works Printed in the United States of America on acid-free paper 10 9 8 7 6 5 4 3 2 1 International Standard Book Number-13: 978-0-8493-9879-7 (Hardcover) This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, please access www.copyright.com (http:// www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged. Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. Library of Congress Cataloging-in-Publication Data Product reliability, maintainability, and supportability handbook / editors, Michael Pecht. -- 2nd ed. p. cm. Includes bibliographical references and index. ISBN 978-0-8493-9879-7 (alk. paper) 1. Electronic apparatus and appliances--Reliability--Handbooks, manuals, etc. 2. Electronic apparatus and appliances--Maintainability--Handbooks, manuals, etc. I. Pecht, Michael. II. Title. TK7870.P748 2009 658.5’75--dc22
2008044162
Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com and the CRC Press Web site at http://www.crcpress.com
© 2009 by Taylor & Francis Group, LLC
9879_C000.indd 4
3/10/09 4:08:02 PM
Contents Preface......................................................................................................................vii Editor.........................................................................................................................xi Contributors............................................................................................................ xiii Chapter 1 Product Effectiveness and Worth................................................................................1 Harold S. Balaban, Ned Criscimagna, Michael Pecht Chapter 2 Reliability Concepts.................................................................................................. 19 Diganta Das, Michael Pecht Chapter 3 Statistical Inference Concepts................................................................................... 31 Jun Ming Hu, Mark Kaminskiy, Igor A. Ushakov Chapter 4 Practical Probability Distributions for Product Reliability Analysis....................... 57 Diganta Das, Michael Pecht Chapter 5 Confidence Intervals................................................................................................. 83 Diganta Das, Michael Pecht Chapter 6 Hardware Reliability................................................................................................. 95 Abhijit Dasgupta, Jun Ming Hu Chapter 7 Software Reliability................................................................................................ 141 Richard Kowalski, Carol Smidts Chapter 8 Failure Modes, Mechanisms, and Effects Analysis................................................ 185 Sony Mathew, Michael Pecht Chapter 9 Design for Reliability.............................................................................................. 201 Diganta Das, Michael Pecht Chapter 10 System Reliability Modeling.................................................................................. 219 Michael Pecht v © 2009 by Taylor & Francis Group, LLC
9879_C000toc.indd 5
3/4/09 11:14:52 AM
vi
Contents
Chapter 11 Reliability Analysis of Redundant and Fault-Tolerant Products............................. 239 Joanne Bechta Dugan Chapter 12 Reliability Models and Data Analysis for Repairable Products............................. 299 Harold S. Balaban Chapter 13 Continuous Reliability Improvement...................................................................... 325 Walter Tomczykowski Chapter 14 Logistics Support.................................................................................................... 357 Robert M. Hecht Chapter 15 Product Effectiveness and Cost Analysis............................................................... 391 Harold S. Balaban, David Weiss Chapter 16 Process Capability and Process Control................................................................. 421 Diganta Das, Michael Pecht
© 2009 by Taylor & Francis Group, LLC
9879_C000toc.indd 6
3/4/09 11:14:52 AM
Preface To ensure product reliability, an organization must follow certain practices during the product development process. These practices impact reliability through the selection of parts (materials), product design, manufacturing, assembly, shipping and handling, operation, maintenance, and repair. The following practices are described in this book: • Define realistic product reliability requirements determined by factors including the targeted life-cycle application conditions and performance expectations. The product requirements should consider the customer’s needs and the manufacturer’s capability to meet those needs. • Define the product life-cycle conditions by assessing relevant manufacturing, assembly, storage, handling, shipping, operating, and maintenance conditions. • Ensure that supply-chain participants have the capability to produce the parts (materials) and services necessary to meet the final reliability objectives. • Select the parts (materials) that have sufficient quality and are capable of delivering the expected performance and reliability in the application. • Identify the potential failure modes, failure sites, and failure mechanisms by which the product can be expected to fail. • Design the process to capability (i.e., the quality level that can be controlled in manufacturing and assembly), considering the potential failure modes, failure sites, and failure mechanisms obtained from the physics of failure, analysis, and the lifecycle profile. • Qualify the product to verify the reliability of the product in expected life-cycle conditions. Qualification encompasses all activities that ensure that the nominal design and manufacturing specifications will meet or exceed the reliability goals. • Ascertain whether manufacturing and assembly processes are capable of producing the product within the statistical process window required by the design. Variability in material properties and manufacturing processes will impact the product’s reliability. Therefore, characteristics of the process must be identified, measured, and monitored. • Manage the life-cycle usage of the product using closed-loop, root-cause monitoring procedures.
Chapter 1: Product Effectiveness and Worth. This chapter presents a definition of product effectiveness and discusses the relationships between product effectiveness and its related functions (availability, dependability, and capability). The chapter concludes with a discussion of assignment responsibility and product worth. Chapter 2: Reliability Concepts. This chapter presents the fundamental mathematical theory for reliability. The focus is on reliability and unreliability functions, probability density function, hazard rate, conditional reliability function, and key time-to-failure metrics. Chapter 3: Statistical Inference Concepts. This chapter introduces statistical inference concepts as ways to analyze probabilistic models from observational data. The chapter discusses basic types of statistical estimation, hypothesis testing, and reliability regression model fitting. vii © 2009 by Taylor & Francis Group, LLC
9879_C000e.indd 7
3/4/09 11:18:18 AM
viii
Preface
Chapter 4: Practical Probability Distributions for Product Reliability Analysis. In this chapter, basic types of discrete and continuous probability distributions are introduced. Two discrete distributions (binomial and Poisson) and four continuous distributions (Weibull, exponential, normal, and lognormal) commonly used in reliability modeling and hazard rate assessments are presented. Chapter 5: Confidence Intervals. This chapter presents the concept of confidence interval and its relationship with tolerance, sample size, and confidence levels. Examples of confidence interval calculations and estimations are provided. Chapter 6: Hardware Reliability. Using failure models and examples, this chapter focuses on reliability assessment and its associated validation techniques for engineering hardware. The chapter provides a case study on wirebond assembly in microelectronic packages to illustrate the implementation of a probabilistic physicsof-failure approach in reliability prediction and modeling. Chapter 7: Software Reliability. This chapter provides a definition of software, software reliability, software quality, and software safety. It also discusses software development models and techniques for both improving and assessing software reliability. Chapter 8: Failure Modes, Mechanisms, and Effects Analysis. Knowledge of failure mechanisms that cause product failure is essential in implementation of appropriate design practices for design and development of reliable products. This chapter presents a new methodology called failure modes, mechanisms, and effects analysis (FMMEA) to identify potential failure mechanisms and models for potential failure modes and to prioritize failure mechanisms. FMMEA enhances the value of failure modes and effects analysis (FMEA) and failure modes, effects, and criticality analysis (FMECA) by identifying “high-priority failure mechanisms” to help create an action plan to mitigate their effects. The knowledge about the cause and consequences of mechanisms found through FMMEA helps in efficient and costeffective product development. Chapter 9: Design for Reliability. There are steps that must be taken to develop a product that meets reliability objectives. This chapter provides an overview of product requirements and constraints, product life-cycle conditions, parts selection and management, failure modes, mechanisms, effects analysis, design techniques, qualification, manufacture and assembly, and closed-loop monitoring. Chapter 10: System Reliability Modeling. This chapter describes how to combine reliability information from parts and subsystems to compute system level reliability. Reliability block diagrams are used as a means to represent the logical system architecture and develop system reliability models for a system. This chapter also presents fault-tree analysis for system reliability modeling. Chapter 11: Reliability Analysis of Redundant and Fault-Tolerant Products. A fault-tolerant product is designed to continue operating correctly despite the failure of some constituent components. This chapter presents methods for evaluating reliability in several types of fault-tolerant conditions. Chapter 12: Reliability Models and Data Analysis for Repairable Products. This chapter describes methods for modeling and analyzing failures of repairable products (particularly nonelectronic equipment) that normally exhibit wearout characteristics.
© 2009 by Taylor & Francis Group, LLC
9879_C000e.indd 8
3/4/09 11:18:18 AM
Preface
ix
Analytical background and data analysis techniques that describe the reliability behavior of repairable products are provided. Chapter 13: Continuous Reliability Improvement. Reliability improvement techniques can be applied to a new product that has passed its major hardware and/ or software design reviews, to a developed product that the manufacturer wishes to make more competitive, or to an existing product that is not meeting the customer’s expectations of reliability performance. This chapter discusses the principles of reliability growth, accelerated testing, and management of a continuous improvement program. Chapter 14: Logistics Support. Integrated logistics support (ILS) applied to products constitutes a life-cycle approach to maintenance and support. This chapter discusses the influence of reliability on logistics support requirements, emphasizing how the reliability of product, equipment, or assembly influences the need for spare or repair parts, support equipment, and maintenance personnel. Chapter 15: Product Effectiveness and Cost Analysis. This chapter shows how reliability and maintainability data can be combined with performance data to assess overall product effectiveness and how cost aspects can be introduced to provide a more complete basis for design decisions. Chapter 16: Process Capability and Process Control. Quality is a measure of a product’s ability to meet the workmanship criteria of the manufacturer. This chapter introduces the concepts of process capability and the basics of statistical process control techniques. Chapter sections present the concepts of average outgoing quality, process capability, defects calculation, and statistical process control, with examples. The Audience for This Book This book is for professionals interested in gaining knowledge of the practical aspects of reliability. It is equally helpful for students interested in pursuing this challenging career in liability, as well as maintainability and supportability teams.
© 2009 by Taylor & Francis Group, LLC
9879_C000e.indd 9
3/4/09 11:18:18 AM
Editor Michael Pecht is visiting professor in Electrical Engineering at City University of Hong Kong. He has an MS in Electrical Engineering and an MS and PhD in Engineering Mechanics from the University of Wisconsin at Madison. He is a professional engineer, an IEEE Fellow, an ASME Fellow, and an IMAPS Fellow. He was awarded the highest reliability honor, the IEEE Reliability Society’s Lifetime Achievement Award in 2008. He served as chief editor for IEEE Transactions on Reliability for eight years and was on the advisory board of IEEE Spectrum. He is chief editor for Microelectronics Reliability and is an associate editor for IEEE Transactions on Components and Packaging Technology. He is the founder of CALCE (Center for Advanced Life Cycle Engineering) at the University of Maryland, College Park, where he is also the George Dieter Chair Professor in Mechanical Engineering and a professor in Applied Mathematics. He has written more than 20 books on electronic products development, use, and supply chain management, and over 400 technical articles. He has been leading a research team in the area of prognostics for the past ten years. He has consulted for over 100 major international electronics companies, providing expertise in strategic planning, design, test, prognostics, IP, and risk assessment of electronic products and systems. He has previously received the European Micro and Nano-Reliability Award for outstanding contributions to reliability research, the 3M Research Award for electronics packaging, and the IMAPS William D. Ashman Memorial Achievement Award for his contributions in electronics reliability analysis.
xi © 2009 by Taylor & Francis Group, LLC
9879_C000f.indd 11
3/4/09 11:28:15 AM
Contributors Harold S. Balaban has over 40 years experience in developing weapon system models for cost and effectiveness analyses for Department of Defense and other government agencies. He is currently employed by the Institute for Defense Analyses (IDA), where he specializes in applying reliability and maintainability concepts to weapon system life-cycle costs and effectiveness modeling. He has developed a number of models and cost-estimating relationships that enable such work to be accomplished efficiently and accurately—notably the IDA IMEASURE program for maintenance manpower estimation, the airlifter mission capable rate simulation model, and the IDA CER model for estimating depot level reparable and consumables costs. Prior to his work at IDA, Dr. Balaban was employed by ARINC Research Corporation; his last position was director, advanced analysis. He was responsible for developing and applying analytical and simulation models to perform studies of cost, effectiveness, and reliability, maintainability, and availability of military systems. He led the team that developed the highly successful System Testability and Maintenance Program, which was the forerunner of products used today to improve organizational diagnostics of military systems. He was also a major contributor to efforts to introduce long-term warranties and logistic controls in military systems acquisition. Dr. Balaban has presented and published numerous papers on reliability and maintainability, contributed chapters for three textbooks, and taught graduate courses in reliability theory and operations research at George Washington University and at University College, University of Maryland. He holds a PhD degree in mathematical statistics from the George Washington University. He contributed to Chapters 1, 12, and 15 of this book. Ned Criscimagna is president and owner of Criscimagna Consulting, LLC. He provides training, program assessment, and related reliability consulting services for industry and government. Prior to starting his own business, Mr. Criscimagna worked at Alion Science & Technology (previously IIT Research Institute, IITRI), where he was a senior science advisor and served in various capacities, including 5 years as the deputy director of the Reliability Analysis Center. Before joining IITRI in 1994, he served in various positions with the ARINC Research Corporation. Prior to his career with private industry, he served 20 years as an Air Force officer in various engineering, maintenance, and staff positions. While on the Air Force staff and the staff of the Air Force Systems Command, he helped develop and implement policies on reliability and maintainability, quality, and system acquisition. He was a member of the Air Force’s repair and maintenance 2000 study team and was involved with the initial efforts to implement the Department of Defense’s total quality management approach to acquisition. He is a member of the American Society of Quality Assurance and the Society of Automotive Engineers, and a senior member of the Society of Logistics Engineers. He holds a BS in mechanical engineering from the University of Nebraska, Lincoln, and an MS in systems engineering–reliability from the Air Force Institute of Technology. He contributed to Chapter 1 of this book. xiii © 2009 by Taylor & Francis Group, LLC
9879_C000g.indd 13
3/4/09 12:04:44 PM
xiv
Contributors
Diganta Das has a PhD in mechanical engineering from the University of Maryland, College Park, and a BTech in manufacturing science and engineering from the Indian Institute of Technology. He is a member of the research staff at the Center for Advanced Life Cycle Engineering. His expertise is in reliability, environmental and operational ratings of electronic parts, uprating, electronic part reprocessing, technology trends in electronic parts, and parts selection and management methodologies. He performs benchmarking processes and organizations of electronics companies for parts selection and management and reliability practices. Dr. Das also assists organizations in design improvements. He has published more than 50 articles on these subjects and has presented his research at international conferences and workshops. He served as technical editor for two IEEE standards and is currently coordinator for two additional IEEE standards. He is an editorial board member for Microelectronics Reliability and the International Journal for Performability Engineering. He is a Six Sigma black belt and is a member of IEEE and IMAPS. He contributed to Chapters 2, 4, 5, 9, and 16 of this book. Abhijit Dasgupta is a faculty member and researcher in the CALCE Electronic Packaging Research Center at the University of Maryland. He received his PhD in theoretical and applied mechanics from the University of Illinois. He conducts research in the area of micromechanical modeling of constitutive and damage behavior of heterogeneous materials and structures, with particular emphasis on fatigue and creep–fatigue interactions. His research also includes associated stress analysis techniques under combined thermomechanical loading, formulating physics-offailure models to evolve guidelines for design, validation testing, and screening and derating for reliable electronic packages. He contributed to Chapter 6 of this book. Joanne Bechta Dugan was awarded a BA in mathematics and computer science from La Salle University, Philadelphia, in 1980 and an MS and a PhD in electrical engineering from Duke University, Durham, North Carolina, in 1982 and 1984, respectively. Dr. Dugan is currently associate professor of electrical engineering at Duke University and visiting scientist at the Research Triangle Institute. She has performed and directed research on the development and application of techniques for the analysis of computer systems designed to tolerate hardware and software faults. Her research interests include hardware and software reliability engineering, faulttolerant computing, and mathematical modeling using dynamic fault trees, Markov models, Petri nets, and simulation. Dr. Dugan is a senior member of the IEEE and a member of the Association for Computing Machinery, Eta Kappa Nu, and Phi Beta Kappa. She contributed Chapter 11 of this book. Robert M. Hecht is a senior principal engineer with the ARINC Research Corporation. He specializes in the evaluation of reliability, maintainability, and testability problems of fielded equipment and the planning and management of product improvement programs. He has supported numerous weapon systems programs, including the P-3C, E-2C, bA-6E, EA-6B, ES-3, GUARDRAIL, QUICK FIX, EF-111A, and the M1 Abrams main battle tanks. He has extensive experience in the design for reliability of electronic, electromechanical, and mechanical systems.
© 2009 by Taylor & Francis Group, LLC
9879_C000g.indd 14
3/4/09 12:04:45 PM
Contributors
xv
Prior to joining ARINC Research, Mr. Hecht was a reliability engineer with the Bell Aerospace Company’s New Orleans operation. At Bell, he conducted reliability analysis in support of U.S. Navy surface effect ship and air cushion vehicle programs. While with the U.S. Army, Mr. Hecht managed the reliability and maintainability demonstration testing of general military equipment. He received a BS in aerospace engineering from Pennsylvania State University and an MS in engineering from the University of New Orleans. Mr. Hecht is an ASQC certified reliability engineer. He contributed Chapter 14 of this book. Jun Ming Hu is the managing director of Microsoft Asia Center for Hardware (MACH) in Shenzhen, China. The MACH team is responsible for the design, engineering, testing, and manufacturing of Microsoft hardware products, including mice, keyboards, Webcams, Xbox controllers, gaming text input devices, Zune music accessories, and other hardware products for world markets. The team also provides supporting work for Xbox console manufacturing, testing, and component sourcing and qualification. The MACH team manages many design and manufacturing partners in China. Dr. Hu joined Microsoft Corporation at Redmond in 2000 as the engineering manager of hardware reliability and component engineering. He moved to Shen Zhen in February 2004 to set up the MACH organization. Before joining Microsoft, Dr. Hu worked for Ford Motor Company in Michigan for 8 years as a senior technical specialist and engineering manager of computer-aided design for automotive electronics development. Dr. Hu received a BS in 1982 and an MS in 1985 from Shanghai Jiao-Tong University; he received a PhD from the University of Maryland in 1989. Dr. Hu holds more than 14 U.S. and international patents for electronics products and qualification methods. He was the associate editor of IEEE Transactions on Reliability and an editorial board member of Journal of the Institute of Environmental Sciences from 1993–1998. He is a recipient of the Asian American Corporate Achievements Award and two Henry Ford Technology Awards. He contributed to Chapters 3 and 6 of this book. Mark Kaminskiy is the chief statistician at the Center of Technology and Systems Management of the University of Maryland, College Park. He is a researcher and consultant in statistical and probabilistic reliability, life data analysis, and risk analysis of engineering systems. He has performed research and consulting projects funded by government and industrial companies such as the Department of Transportation, Coast Guard, Army Corps of Engineers, the Navy, Nuclear Regulatory Commission, American Society of Mechanical Engineers, Ford Motor, Qualcomm Inc., and several other engineering companies. Dr. Kaminskiy is the author and co-author of over 100 publications in journals, conference proceedings, reports, and books, including “Modeling Population Dynamics for Homeland Security Applications,” co-authored with B. Ayyub, in Wiley Handbook of Science and Technology for Homeland Security, edited by J. G. Voeller (John Wiley & Sons, 2008); Reliability Engineering and Risk Analysis: A Practical Guide, co-authored with M. Modarres and V. Krivtsov (Marcel Dekker, 1999, 2009); “Accelerated Testing” (Chapter 5) in Statistical Reliability Engineering (John Wiley & Sons, 1999); and “Statistical Analysis of Reliability Data” in Encyclopedia of IEEE (John Wiley & Sons, vol. 20, 1999). He received an
© 2009 by Taylor & Francis Group, LLC
9879_C000g.indd 15
3/4/09 12:04:45 PM
xvi
Contributors
MS in nuclear physics at the Polytechnic University of St. Petersburg (Russia) and a PhD in electrical engineering at the Electrotechnical University of St. Petersburg (Russia). He contributed to Chapter 3. Richard Kowalski retired as director, product assurance, from ARINC Incorporated in 2002 after 27 years with the firm. He was responsible for the development and execution of hardware and software quality program policy. Dr. Kowalski was trained in software capability evaluation (SCE), using the capability maturity model and the integrated model developed by the Software Engineering Institute at Carnegie–Mellon, and he conducted SCEs at several U.S. and European companies and for several programs at ARINC. Dr. Kowalski is a member of Sigma Xi and is a life senior member of the Institute of Electrical and Electronic Engineers (IEEE). For more than 20 years, Dr. Kowalski was a member of the IEEE Reliability Society’s Administrative Committee and is a past editor of the IEEE Transactions on Reliability. Dr. Kowalski received a BS in mathematics from Northeastern University and an MS and PhD in mathematics from Case Institute of Technology. He contributed to Chapter 7. Sony Mathew is a faculty research assistant at the Center for Advanced Life Cycle Engineering in the Mechanical Engineering Department of the University of Maryland, College Park. He is also pursuing his PhD in mechanical engineering from the A. James Clark School of Engineering at the University of Maryland. His areas of research are reliability, tin whiskers, and prognostics and health management of electronic products. He earned his MS in mechanical engineering from the University of Maryland in May 2005. He has a BA in mechanical engineering (1997) and an MBA (1999) from Pune University, India. He contributed to Chapter 8. Carol Smidts is an assistant professor in the Department of Materials and Nuclear Engineering, Reliability Program, University of Maryland, College Park. She obtained an MS in physics engineering in 1986 from the Universite Libre de Bruxelles, Belgium, and her PhD in physics engineering in 1991 from the same university. Her research has focused mainly on dynamic system reliability, Markovian analysis, and human reliability. Her recent work has been devoted to software reliability. She contributed to Chapter 7. Walter Tomczykowski the director of the Life Cycle Management and Operations Support Department at ARINC Engineering Services, reporting to the vice president of the Advanced Systems Division. He received an MS in reliability engineering from the University of Maryland and a BS in electrical engineering technology from Northeastern University in Boston. For over 25 years he has been leading specialized teams in the areas of reliability, maintainability, life-cycle cost, human factors, counterfeit prevention, and obsolescence management for the Office of the Secretary of Defense, Defense Logistics Agency, and Department of Defense (DoD) programs throughout the services and various federal agencies, such as the Department of Homeland Security and the Department of the Treasury. As the director for the Life Cycle Management and Operations Support Department, Mr. Tomczykowski is responsible for personnel in Boston, Annapolis (including
© 2009 by Taylor & Francis Group, LLC
9879_C000g.indd 16
3/4/09 12:04:45 PM
Contributors
xvii
Patuxent River), Maryland, Dayton, Ohio, San Antonio, Texas, Oklahoma City, and Panama City, Florida. Specifically, for obsolescence management (DMSMS— diminishing manufacturing and material shortages), his teams provide support to the Defense MicroElectronics Activity, Defense Supply Center, Columbus, NAVAIR Aging Aircraft IPT, Coast Guard, AWACS, B-2, USMC H-1, and a variety of other DoD programs. He is a primary author of the DMSMS Cost Factors, the DMSMS Program Manager’s Handbook, and the DMSMS Acquisition Guidelines. He is often requested as a keynote speaker at aging aircraft, DMSMS, and other obsolescence management conferences to share his knowledge of reliability, life-cycle cost, and obsolescence management. His work in reliability has also been published in the Wiley Encyclopedia of Electrical and Electronics Engineering. He contributed Chapter 13. Igor A. Ushakov taught for approximately 15 years at the Moscow Institute of Physics and Technology. In 1989 he was invited to be a distinguished visiting professor at George Washington University in Washington, D.C. Later, he taught at George Mason University and the University of California, San Diego. He has also worked at well-known American companies such as MCI, Qualcomm, Hughes Network Systems, and Mantech. His experience is focused on reliability and effectiveness analysis of large-scale telecommunication systems and mathematical and computer modeling of communication systems. He has been chair of sessions at numerous international conferences (in the United States, Russia, Ukraine, Canada, Japan, Great Britain, France, Italy, Germany, Norway, Poland, Hungary, and Bulgaria). He has authored more than 300 papers in various prestigious international math and engineering journals in operations research, reliability engineering and theory, and telecommunication network modeling. Professor Ushakov has written approximately 30 books in Russian, English, German, and Bulgarian, including three published in the United States. Publications include Histories of Scientific Insights (Lulu, Morrisville, North Carolina, 2007), Course on Reliability Theory (Drofa, Moscow, 2007), Statistical Reliability Engineering (John Wiley & Sons, New York, 1999), Probabilistic Reliability Engineering (John Wiley & Sons, New York, 1995), and Handbook of Reliability Engineering (John Wiley & Sons, New York, 1994). A member of Sigma Xi, Omega Rho, and Tau Beta Pi, Professor Ushakov is a founder of the Gnedenko Forum, an informal international association of specialists in probability and statistics. He contributed to Chapter 3. David Weiss is a consultant in the fields of reliability and systems analysis. For 10 years he served as the manager for reliability programs in the Engineering Research Center at the University of Maryland, working with faculty in the creation of a graduate program in reliability engineering. Prior to joining the University of Maryland, he was a reliability manager with General Electric Company and a partner in the consulting firm Booz Allen Hamilton. He contributed to Chapter 15.
© 2009 by Taylor & Francis Group, LLC
9879_C000g.indd 17
3/5/09 5:49:34 PM
CHAPTER 1
Product Effectiveness and Worth Harold S. Balaban, Ned Criscimagna, Michael Pecht
CONTENTS 1.1 Introduction .......................................................................................................1 1.2 Attributes Affecting Product Effectiveness ......................................................2 1.3 Programmatic Factors Affecting Product Effectiveness...................................3 1.3.1 Product Effectiveness ...........................................................................5 1.3.2 Operational Readiness and Availability ...............................................6 1.3.3 Dependability .......................................................................................7 1.3.4 Capability..............................................................................................8 1.3.5 Reliability .............................................................................................8 1.3.6 Maintainability ................................................................................... 10 1.3.7 Relationships Among Time Elements ................................................ 13 1.4 Assignment of Responsibility ......................................................................... 13 1.4.1 Administrative Time........................................................................... 14 1.4.2 Logistics Time .................................................................................... 15 1.4.3 Active Repair Time and Operating Time ........................................... 15
1.1
INTRODUCTION
The ultimate goal for any product or system is that it perform some intended function as affordably and as well as possible. The function may be described as some output characteristic, such as satisfactory message transmission in a communication system, cargo tonnage for a transportation system, or the accuracy of weather identification for airborne weather radar. The term for the overall capability of a product to meet customer objectives is product effectiveness. If the product is effective, it carries out the intended function well; if it is not effective, deficient attributes must be improved. The term for the overall cost, including purchase price, costs associated with operation maintenance, and repair and disposal costs, is product worth. 1 © 2009 by Taylor & Francis Group, LLC
2
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
1.2
ATTRIBUTES AFFECTING PRODUCT EFFECTIVENESS
Product effectiveness is a function of many product attributes and external factors. For an automobile, dependability, safety, ease of repair, and comfort are among the attributes a buyer might cite as important. In terms of product worth, purchase price, economic operation, and good resale value may be additional attributes. A successful blend of these attributes results in a car that is perceived to be of high value. For any specific product, a distinct blend of attributes is needed to achieve high product effectiveness and worth. Good product design and development require that members of the design team (and the customer, if appropriate) evaluate and discuss all pertinent attributes affecting product effectiveness during the appropriate phases of the product’s life cycle: concept formulation, research and development, production, operation, and disposal. For many products—particularly those with a long life—the highest cost is to operate, support, and maintain the product. Many of the tasks and decisions arising early in the life cycle of a product affect the product at later stages and affect costs throughout product life. Table 1.1 shows how the cost of decisions made early in development affects downstream costs. For example, although only 3 to 5% of the total development and production costs may be expended in the concept definition phase, from 40 to 60% of the total cost may be committed as a result of decisions and actions taken during that period. Table 1.2 lists some attributes that affect product effectiveness in terms of performance, availability, and affordability. The term “performance” represents operational, physical, or functional characteristics. “Availability” represents the likelihood of having the product in a usable state; “affordability” relates to the economic consequences associated with product development, purchase, and operation. Overall product effectiveness and worth can theoretically be improved by trading off attributes, which is an extremely complex process. For example, an automobile manufacturer wants to maximize profits and may feel this is best done by increasing market share through offering a new car that provides maximum affordability and reliability. Affordability is a function of how cheaply the car can be manufactured; features that would make the car easy to maintain might have to be compromised or eliminated to achieve ease of manufacture. Under the hood of today’s automobiles, manufacturing cost and maintenance trade-offs are apparent compared with the cars of, say, 20 years ago. New design approaches, such as electronic ignition, are more reliable than those Table 1.1 Product Development and Production Costs Percent of Total Costs Development Process Phase
Incurred
Concept definition Design Testing Process planning Production
3–5% 5–8% 8–10% 10–15% 15–100%
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
Committed 40–60% 60–80% 80–90% 90–95% 95–100%
PRODUCT EFFECTIVENESS AND WORTH
Table 1.2
3
Example Attributes Affecting Product Effectiveness and Product Worth
Performance Operational: Range Speed Accuracy Vulnerability Payload Output power
Availability Reliability: Failure-free operation Redundancy or graceful degradation Mean time to failure
Physical: Volume and density Weight Input power Environment Functional: Safety Mission success rate
Affordability Cost to: Develop or buy
Maintainability: Ease of repair (access, time to repair) Required resources (manpower, tools). Fault Detection and isolation (testability)
Own or operate Maintain Dispose
Logistics supportability: Sparing Training Facilities Time to develop
of the past and the use of computer diagnostics balances the repair challenge presented by today’s complex engines and transmissions. A good design team knows that attributes sometimes support each other and are sometimes contradictory; and that, consequently, trade-offs become a necessary part of the development process.
1.3
PROGRAMMATIC FACTORS AFFECTING PRODUCT EFFECTIVENESS
A typical history of the development of a new product reveals a number of steps in the progression from original concept to an acceptable production model. These steps are particularly marked if the equipment represents a technical innovation—that is, if it pushes the state of the art by introducing entirely new functions or by performing established functions in an entirely new way. The marketplace (or an existing customer base) defines the need for new or improved technical performance. The design and development team executes a multitude of operations leading to accomplishment of program objectives, primarily the production of a system or product that will perform as intended, with minimum breakdowns and rapid repair. This must be done within acceptable development, production, and support budgets and within an established schedule. The three program criteria—performance, cost, and schedule—impose severe pressures on a company. Just as compromises among product attributes are required to achieve desired product effectiveness, compromises are often necessary among program objectives. These compromises begin early in the development process, usually in the basic research and concept validation phases. For example, the time allocated to develop needed technologies or to prove concept feasibility may be curtailed to meet a schedule driven by a competitive challenge.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
4
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
After preliminary work on a typical product, a prototype model is built; ideally, it represents the final product as closely as possible. Its purpose is to establish the initial feasibility of satisfying critical effectiveness attributes. This model could be a hardware or software prototype, or a computer simulation of the system, key subsystems, or components. The prototype may be crude in appearance, unsuitable for production line manufacturing, subject to frequent failure, or repairable only by skilled technicians using expensive equipment and considerable time. Early attention to manufacture, quality, and reliability can save time and money later in the product development program. As the program moves forward, changes to improve reliability become more difficult and expensive, the schedule becomes more inflexible, and budgets become tighter. Despite increased emphasis on reliability, many new products experience serious growing pains during their first years of operation as designers undertake extraordinary and sometimes frantic efforts to determine causes of failure and to eliminate them through modifications, upgrades, or changes in operating and maintenance procedures. Factors important in the development of a new product (revolutionary change) also apply to modification or development programs integrating proven equipment (evolutionary change). For both revolutionary and evolutionary development, reliability is a key attribute affecting product effectiveness and should be considered from the outset. Figure 1.1 shows the major components of product effectiveness: availability, dependability, and capability. In turn, availability and dependability have reliability, maintainability, and logistics supportability as their major constituent elements.
A measure of how well the product does its job
A measure of the product’s condition when first required to perform
A measure of the product’s condition during the performance of its function
A measure of how well the product’s performance meets objectives
Maintainability–restoration capability Logistic support–external factors
Figure 1.1 Major components of product effectiveness.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
PRODUCT EFFECTIVENESS AND WORTH
1.3.1
5
Product Effectiveness
Product effectiveness can be formally defined as the ability of a product to meet an operational demand when operated under specified conditions. Effectiveness is influenced by how a product is used and maintained, as well as by the design and production processes. It can also be influenced by the logistics support system, company policies, regulations and laws governing product use, fiscal constraints, and other administrative policy decisions. For single-use systems, such as missiles, torpedoes, and fuses, operating time and calendar time during the operational phase are relatively unimportant, as is repair of failed units. However, these time elements are critical in determining the effectiveness of a multi-use product, which also may have to accommodate the repair of failures. A product fails if it does not operate when called upon to perform or if it fails to operate successfully (that is, does not complete its function or mission). Both multi-use and one-shot products must be operated and supported under specified conditions defined by the customer or supplier. If a product is pushed to operate at higher stresses for uses unforeseen by the design team, product effectiveness may be decreased. The U.S. Air Force’s experience with the B-52 aircraft exemplifies how a change in usage environment can affect system effectiveness. The B-52 was originally designed as a high-altitude bomber, but changing needs required the Air Force to include low-altitude penetration as one of the aircraft’s missions. Because low-altitude flight imposed higher stresses on the airframe, additional modifications were necessary to strengthen the structure and maintain the desired service life. “Specified conditions” also include whether the product is used in continuous or cyclic operation. In continuous operation, maintenance is performed after a failure occurs, and any failure reduces product effectiveness. For products operated cyclically, such as a car or an airplane, in windows of time when product operation is not critical, maintenance can be performed. Potential failures can be averted through a planned preventive maintenance program. Removing the product from the readiness state for a portion of each day to perform maintenance may increase effectiveness. However, if the percentage of equipment that becomes inoperable prior to demand for use is insensitive to preventive maintenance, it is best to maintain a continual state of readiness and perform only corrective maintenance. Another influence on product effectiveness is a change in operational requirements (technical performance attributes). For example, the vulnerability of a target may be reduced by a change in target design, such as the addition of armor or electronic countermeasures in a military system. The effectiveness of the system intended to counter the target would decrease then even though no degradation of the system itself had occurred. Consider a race car designed to attain a top speed of 200 miles per hour. If competitors’ cars are able to attain top speeds of 210 miles per hour, all other factors being equal, the effectiveness of the slower car has decreased. The terms design effectiveness and use effectiveness are sometimes used to describe the performance of a product. Design effectiveness measures how well the product meets specific performance requirements under test conditions that minimize
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
6
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
operator, maintenance, and logistic influences. Use effectiveness is at the other end of the effectiveness spectrum. It attempts to assess how well the product meets the demands placed on it, even if such demands exceed specifications. Although a sports car and a station wagon both provide transportation, the use effectiveness of the sports car for cargo carrying is not very high, just as the station wagon may not meet handling or acceleration requirements very well. Product effectiveness measures how well the product does the job for which it was purchased. It is a function of availability (the likelihood that the product is ready to start the job), dependability (likelihood that the product will operate in states that will produce output designed to do the job), and capability (how well the designed outputs actually accomplish the necessary tasks, given the states in which the product operated). Each of these topics—availability, dependability, and capability—will now be addressed, followed by a discussion of the three major components of availability and dependability: reliability, maintainability, and logistics supportability. 1.3.2 Operational Readiness and Availability The capability of a product to perform its intended function when called upon is its operational readiness or its operational availability.* The difference between readiness and availability is that the latter includes only operational and downtimes, while the former also includes free and storage times—that is, periods when the product is not needed. Operational readiness or availability differs from product effectiveness in several ways. Its emphasis is on the “when called upon” aspect, rather than on the completion of the task or mission. This emphasis focuses on a probability at a point in time rather than over an interval, as is the case with the mission success rate (the percentage of successfully completed missions). This interval of time can be extremely long, as in the case of a satellite on a long-term mission to another planet; the satellite may be operationally available at launch time, but that does not ensure that it will operate successfully for the duration of its mission. For products that are continually used and are providing useful output, availability is often estimated by calculating the fraction of total “need time” in which the product is operational or capable of providing useful output. Another difference between operational availability and product effectiveness is that the performance attributes of the latter include designed-in capabilities, such as accuracy, power, and weight. Operational availability typically excludes detailed examination of these characteristics by addressing only the product’s readiness to perform its intended function at a particular point in time. Depending on the intended use, one or more performance attributes may apply to availability. The difference between a product’s being operational or not is often a function of the customer’s definition of failure, which depends on the use of the product. If the performance * Although terms such as availability were at one time closely associated with military systems, they are now more widely used in commercial industry. The availability of an off-shore oil rig, for example, is of extreme importance.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
PRODUCT EFFECTIVENESS AND WORTH
7
related to a critical attribute is not satisfactory, the customer may consider the product to be “down,” and readiness or availability from that point until the need ends or until the deficiency is corrected is zero. For example, if a radar set has a specified range of 50 miles, should the radar be considered down if it is effective only to 45 miles? If the 50-mile range is the absolute minimum needed to avoid midair collisions, the aircraft on which the radar is installed would be considered unflyable, and the radar would be considered unavailable for the mission. If the 50-mile range is a goal value and 20 miles is the absolute minimum, a 45-mile range might be acceptable. An availability calculation could be based on a definition that includes as uptime all periods for which the range is at least 20 miles. Operational availability and readiness, therefore, relate uptime and downtime to the conditions under which the product will be used. The following definitions are used: r The operational availability of a system or product is the probability that it is operating satisfactorily at any point in time when used under stated conditions, where the total time considered includes operating time, active repair time, administrative time, and logistic time. r The operational readiness of a system or product is the probability that, at any point in time, it is either operating satisfactorily or is ready to be placed in operation on demand when used under stated conditions, including allowable warning time. Total calendar time is the basis of computation.
A subset of operational availability is intrinsic or inherent availability. Like the design effectiveness concept, this measure attempts to minimize the effects of external influences by considering only active repair time and required use time. Thus, free time when the product is not needed and downtimes due to logistic and administrative delays are excluded. Intrinsic availability is a built-in capability; thus, the design and production engineers first must address discovered problems, assuming that operating conditions are compatible with design specifications. If these engineers cannot resolve the problem, then the product operations manager may be assigned to reduce administrative or logistics delays or to utilize and maintain the product more efficiently. 1.3.3 Dependability Most products can be in any one of a number of different states during their operation. Dependability measures the likelihood of each possible product state. If a product contains n identifiable components and each component can be in only one of two states (say, success or failure), then the product can be in any one of 2n states. For example, a product with 10 components has 1,024 possible states, if each component is either up or down. We do not usually quantify dependability by a single number as we may do for availability, but rather use the dependability concept to quantify effectiveness. However, dependability quantification is possible for simple cases. For example, for our sample product, we may define a subset of the 1,024 states as success states; the
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
8
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
product is considered dependable if it operates within this subset. However, not all of the success states will necessarily result in the same level of acceptable output; in such cases, the capability measure has to be considered, as discussed in the next section. From a more analytical point of view, the dependability concept describes how the product transitions from one state to another. For example, the failure of a component will generally transition the product from its present state to a less capable state. If repair during operation is possible, there may be a transition back to the more productive state. If an item failure brings the product down, then no useful output may be produced until repairs are made. 1.3.4 Capability Capability measures how well the product accomplishes the task it is assigned. It is normally a state-dependent measure. If the product is not operating, then its capability would normally be zero, but not always. Consider a tank protecting an enclave from rebel troops. The tank may not be able to fire, but if the enemy sees the tank and is unaware of its state, its protective mission may still be accomplished while repairs are undertaken. On the other hand, a product that is operating as it is supposed to may not have the highest capability. An optical aerial camera may not get a desired picture because of cloud cover, even though all components are operating perfectly. Products that have backup or redundant modes of operation will have a number of states that can produce useful output. For each state, a capability measure exists. For example, the speed, range, and fuel consumption of a multi-engine aircraft depend on the number of engines operating. The units of measure of capability depend on the product and its tasks. The capability measure may be directly related to such product output as picture resolution, number of messages delivered, kilowatts of power produced, or the amount of damage to the enemy. When it is difficult to define or to quantify such a measure, an ordinal scale may be used—for example, from 0 to 100, with 100 representing the best possible output. A probability measure may also be used in some cases. Each of the possible product states is determined to be either a success or a failure. Then the product capability is the probability that the product operates within the class of success states. 1.3.5 Reliability A critical attribute determining product effectiveness is reliability, which is a measure of the product’s ability to avoid failure. A reliability deficiency will eventually result in an impaired or lost performance, compromised safety, and the need for such restorative actions as diagnosis, repair, spare replenishment, and maintenance. High-reliability products will operate longer, allowing resources to be focused on improving performance. Within a product effectiveness context, satisfactory operation is normally associated with a defined envelope of satisfactory outputs. If all the product outputs are within this envelope, then the product is operating reliably. Note that reliable operation by this definition does not imply satisfactory results. An optical aerial camera operating in a cloudy environment is an example.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
PRODUCT EFFECTIVENESS AND WORTH
9
1.0
R(t)
0
Time Figure 1.2 A typical reliability function.
Observed reliability is the ratio of items operating within specifications for the stated period to the total number of items in the sample. A reliability function is this same probability expressed as a function of the time period. Figure 1.2 is an example of a reliability function. Products that have built-in test equipment (BITE) to monitor output and warn the operator when one or more outputs are out of tolerance can be continually accessed for reliability. BITE is now a common design practice for industrial products, and it is becoming more prevalent for consumer products, especially for electronic items. Nevertheless, the assessment of product performance often has to be made by the operator, and distinguishing between a capability and a reliability problem is not always easy. A washing machine user may easily determine that the washing machine has failed because water is flooding the laundry room. A more difficult assessment is determining the reason that the washed clothes do not appear to be as clean as they should. Whether the problem is one of reliability (e.g., failure of the motor to agitate the water properly) or capability (e.g. insufficient motor capacity for the wash load), the usual decision is to assign the problem to reliability unless a specific analysis of effectiveness is conducted. Mission reliability is usually defined as the probability that a product will operate successfully for the duration of the mission, given that it is ready to start the mission when called upon to do so. Mission reliability, therefore, is the probability that no failure will occur during the mission that prevents the mission from being satisfactorily completed. For a one-time operation, this probability is a point on the reliability function curve corresponding to a time equal to the mission time. If repeated missions are undertaken and wearout may be occurring, adjustments must be made for cumulative operating or stress time following the most recent maintenance or restoration. All the alternative modes of operation required for mission completion must be considered in mission reliability. Alternative modes include operations using redundant or backup units that take over for failed units.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
10
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Logistic reliability, on the other hand, is concerned not only with mission accomplishment but also with all failures that place a demand on the logistic system, regardless of when the failures occur or whether they affect mission accomplishment, require maintenance, require spare parts, or all three. Thus, component redundancy, which normally increases mission reliability, almost always decreases logistic reliability. A natural measure of logistic reliability is the demand rate, which tracks the demand events occurring when a failure triggers the logistics support system. 1.3.6 Maintainability Maintainability addresses the ease and economy with which the maintenance actions necessary to restore a failed product to a satisfactory state can be taken. Following a failure, restoration involves isolating the source of the failure, correcting the problem, checking out the product, removing test equipment and tools, securing all access doors and panels, and making the product acceptably available to perform its required function. The statistical average for downtime during restoration actions is called mean downtime (MDT). MDT comprises diagnostic time, active repair time, logistic delay, and administrative delay. The relative ease with which a product can be kept in operational condition or restored to it after failure is typically embodied in the maintainability characteristic. Maintainability, which comprises all active repair time, is a fundamental design attribute; most of the effort to affect this attribute favorably is expended in the design phase. For a product to be highly maintainable, the design should not be complex; equipment should be easy to access, remove, and replace; the types of fasteners should be as uniform as possible; few special tools should be needed; and so forth. Such factors are the responsibility of the design engineer. The most general definition of maintainability is the probability that, when maintenance is initiated under stated conditions, a failed product will be restored to operational effectiveness within a given period of time, excluding downtime due to logistic or administrative delay. A subset of maintainability is testability. Testability is defined in terms of failure detection and source isolation. The definition may be expanded by including the rapidity and accuracy of detection and isolation. Ideally, all failures (and only failures) are detected as soon as they occur, allowing the operator to take appropriate action (for example, turning the product off to prevent further damage). Failures can be detected by human observation (for example, the operator may see smoke or an invalid product response) or by the product itself using a built-in test. Similarly, the maintainer can isolate the source of failure and identify the cause using manual or semiautomatic methods to check various components until the failure is found or using automatic built-in tests. In practice, some failures are intermittent and difficult to detect and isolate. Definitions for the time divisions are given in Table 1.3. Time is of fundamental importance for quantifying product or system properties because it permits measurement rather than qualitative description. The usual measures of time—year, month, day, and hour—normally form the basis for the computation of reliability, maintainability, and availability parameters. However, because there are so many ways of
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
PRODUCT EFFECTIVENESS AND WORTH
Table 1.3
11
Definitions of Time Elements
Time Element Operating time
Downtime r Realization time r Active repair time
r Logistic delay
r Administrative delay Free time
Standby time
Access time
Diagnosis time
Replacement time
Supply delay
Checkout time
Definition Time during which the product is operating in a manner acceptable to the operator; this element includes the time when the customer is dissatisfied with the manner of operation, but not so dissatisfied that the product must be shut down for repair or, if repair will not satisfy customer needs, discarded Total time during which the product is not in an acceptable operating condition or is not operationally ready Time that elapses before the fault condition becomes apparent Portion of downtime during which actual maintenance takes place; included is the time to prepare the product for repair, locate the fault, correct the fault, and check out the product That portion of downtime during which repair is delayed (waiting time) solely because a part or unit needed to make a repair is not available That portion of downtime not covered by active repair or logistic time Time during which operational use of the product is not required; it may or may not be depending on downtime or on whether the product is in operable condition; during free time periods, downtime is not included in operational availability calculations Time during which the product is operable but is being held as a spare; standby time is the time during which the product is operable but is not being used to perform a useful function; the product can be called upon to operate at any random point of time during the period Time from realizing that a fault exists to making contact with displays and test points and commencing fault finding; this does not include travel or preparation; access time reflects the removal of covers and shields and the connection of test equipment and is determined largely by mechanical design Fault-finding time, including the adjustment test equipment (e.g., setting up an oscilloscope or generator), carrying out checks (e.g., examining wave forms for comparisons with a handbook), interpreting information (this may be aided by algorithms), verifying conclusions, and deciding upon corrective action Time for removing the faulty line replaceable assembly (LRA), followed by connecting and wiring a replacement as appropriate; the LRA is the replaceable item beyond which fault diagnosis does not continue; replacement time is largely dependent on the choice of LRA and on mechanical design features, such as the choice of connectors Time required from the point of identifying the need for a maintenance part or assembly (LRA) until that part or assembly is in the hands of the maintenance technician; supply delay can be factored into elements such as time to remove the part from the maintenance technician’s tool kit, time to obtain the part from a supply bin, time to receive the part from a warehouse at another site, or time to procure the part from a manufacturer Time of verifying that the fault condition no longer exists and that the product is operational; it may be possible to restore the product to operation before completing the checkout—in which case, although it is a repair function, all of checkout time does not constitute downtime; adjustments may be required when a new module is inserted into the product; as in the case of checkout, some or all of the alignment time may fall outside the downtime window
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
12
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Figure 1.3 Principal divisions of calendar time.
delineating these time intervals, the method adopted in each investigation must be carefully developed in order to provide the desired results. In general, the time interval of interest is the total calendar time during which the product is in use. As shown in Figure 1.3, this interval may be divided into available time and unavailable time. During available time, the product is available for use by the intended user; during unavailable time, the product is being supplied, repaired, or restored and is not available for use. Thus, there are really two time-division criteria: the equipment’s state of operability and the demand for its use. These criteria are outlined as follows: criterion 1: product state of operability r operable/inoperable r administrative delay r logistic delay r realization time r repair time criterion 2: demand for product use r use required r use not required − storage time − free time − standby time
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
PRODUCT EFFECTIVENESS AND WORTH
13
Figure 1.4 Relationships among the time elements as they influence product effectiveness.
1.3.7 Relationships Among Time Elements An examination of the relationships among the various time elements can provide additional insight into the properties of the product effectiveness components. As an aid in doing this, Figure 1.4 shows how various time intervals combine and influence the product effectiveness components. Note that capability is not normally a timedependent parameter and therefore no time factors are shown to influence it.
1.4 ASSIGNMENT OF RESPONSIBILITY Even before discussing quantitative measurement for the concepts displayed in Figure 1.4, it is possible to demonstrate how such measures can be helpful in locating trouble areas and assigning responsibility for remedial action to improve effectiveness. These concepts can provide information for comparative evaluation of competing equipment or systems and for determining the particular characteristics responsible for the differences. The property of the time breakdown that leads to these results is the relationship between the lengths of various time intervals and the responsibilities of various personnel groups. It is apparent that administrative personnel are responsible for controlling free time, storage time, administrative time, and logistic time, while production and design engineers are responsible to a large degree for operating time
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
14
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
(failure frequency) and active repair time. Of course, maintenance and design engineers share the responsibility for active repair time. To achieve the outmost effectiveness, it is necessary to maximize operating time and minimize downtime. The role of non-use time (free time and storage time) is to serve as a safety valve. Maximum free time means minimum pressure for product use; storage time results from the existence of spares to carry the operational load in case of emergency. Because the deterioration rate during storage may be different from that during use and because, by error, some inoperable equipment may be placed in storage, this time element must be considered in determining operational readiness. Large amounts of free time result when a product has a short operating time— for example, when equipment is needed relatively infrequently—and there is firm scheduling of the need. It can also happen if working hours are restricted instead of continuous. For example, banks need time locks on safes at night but not during the day. Some communication equipment regularly has free time. Automatic answering services are needed only when the operator is absent. Some television stations have regular hours during which there is no telecast. It is clear that operational readiness can be enhanced by using free time for maintenance, and free time can thus compensate to some extent for poor maintainability and poor reliability. The important point with respect to free time and storage time is that they provide administrative flexibility to help alleviate the effects of equipment inadequacies and thus to gain operational readiness. However, it is important to note that free time and storage time have no connection with improving poor equipment. They provide an inferior but sometimes necessary alternative to the preferred solution of obtaining better equipment. They are a substitute for quality, but not a way of achieving it. It follows from the foregoing discussion that the more significant indicators of equipment characteristics are to be found in times other than free time and storage time. Figure 1.4 shows that these other types of time are all involved in the concept of availability, which combines operating time with total downtime, including the three subcategories of downtime: administrative time, logistic time, and active repair time. These subcategories involve both administrative and engineering responsibilities. 1.4.1
Administrative Time
The administrative time category is almost entirely determined by administrative decisions about the processing of records and the personnel policies governing maintenance engineers, technicians, and those engaged in associated clerical activities. Establishing efficient methods of monitoring, processing, and analyzing repair activities is the responsibility of administration. In addition, administrative time has been defined to include wasted time because such time is the responsibility of administration. It is independent of engineering as such and is not the responsibility of the equipment manufacturer.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
PRODUCT EFFECTIVENESS AND WORTH
15
1.4.2 Logistics Time Logistic time is the time consumed by delays in repair due to the unavailability of replacement parts. This is a matter largely under the control of administration, although the requirements for replacements are determined by operating conditions and the built-in ability of the equipment to withstand operating stress levels. Policies determined by procurement personnel can, if properly developed, minimize logistic time. Therefore, the responsible administrative officials in this area are likely to be different from those who most directly influence the other time categories. This justifies separate consideration of logistic time. 1.4.3 Active Repair Time and Operating Time Active repair time and operating time are both determined principally by the built-in characteristics of the equipment, and hence are primarily the responsibility of the equipment manufacturer. Improvement in this area requires action to reduce the frequency of failure, to increase the ease of repair, or both. Operating time and active repair time are associated with the concepts of reliability and repairability, respectively, which are related through the concept of intrinsic availability. Administration can do little to reduce active repair time or increase operating time (i.e., failure-free time). Administrators can influence these time elements to a limited extent by assuring that operating stress levels are within design specifications and that the maintenance shop is supplied with proper tools and adequately trained personnel. Because products are generally purchased, most customers want to buy products that perform their intended function at the lowest total cost. Cost studies show that the total cost of ownership (including initial and operating costs for the service life of the equipment) can be materially reduced if proper attention is given to reliability and maintainability early in the design of the product. These considerations lead to the concept of product worth, which is illustrated in Figure 1.5, and relate product effectiveness to total cost, scheduling, and personnel requirements.
Figure 1.5 Concepts associated with product worth.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
16
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Technical development plan
Technical requirements
Product design plan
Cost (initial and operating) and schedule requirements
Operability Dependability and plan (reliability, maintainability) supportability plan
Development, production, delivery, and installation plan
Personnel requirements
Personnel requirements and training plan for system development and utilization
Test and evaluation plan
Figure 1.6
Outline of requirements for a technical development plan.
To optimize product worth, program managers face the difficult task of striking balances to maximize product effectiveness while minimizing total cost, development time, and personnel requirements (see Chapter 13). Cost, schedule, and personnel are constraints faced by both military and commercial program managers. In the commercial world, time to market and staying up with the competition are additional constraints. The political constraints surrounding most military programs are unique to the military manager. In practice, managers from both communities select from several alternatives of the most promising product or component for which development effort is required. This selection can be facilitated by forming technical development plans, as outlined in Figure 1.6. At this point it should be noted that the product effectiveness applies to the operation of a product in its use environment and is capable of being measured. However, because the actual use environment is often unknown or beyond the control of the product manufacturer, only certain elements of the product effectiveness concept can be specified for contractual purposes. From a practical point of view, a mission or use analysis must be conducted to determine the required level of intrinsic availability, as well as the needed performance characteristic (design capability). The problem of specifying product requirements becomes increasingly complex if redundant or multimodal operation is employed. For example, to achieve the required level of availability at the optimum cost level, the product design team may have to consider several alternative approaches. Should several redundant systems be used so that one or more spare systems will always be available and the failed system can be repaired under less pressing circumstances, or should one highly reliable product be developed that can be repaired quickly? In many cases, improved reliability and repairability by the use of higher quality parts, redundant circuitry, plug-in assemblies, and simplified or semiautomatic fault
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
PRODUCT EFFECTIVENESS AND WORTH
17
isolation devices can do much to improve availability and reduce total downtime. Trade-off analyses, therefore, are often essential in determining the minimum requirements for achieving the required availability with the optimum expenditure of money, personnel, and time among which trade-offs will also be required. Historically significant reliability improvements in electronic systems were made as industries transitioned from vacuum tubes to solid-state components to microchips. New materials and approaches to design and stress analysis contributed to reducing failures. As reliability problems were attacked and alleviated, the need to address maintainability, logistics, and cost issues became more evident. Product effectiveness provides an approach to product operation, support, and performance. For example, reliability engineers working on redundancy in the early 1960s found that, by adding additional components and a switching mechanism to a functionally duplicate operating component, reliability could theoretically be improved. But, as these designs were implemented, it became clear that penalties were incurred in such areas as power, weight, maintainability, and cost. Of what value is adding a duplicate transmitter circuit if the power requirements of a redundant design cause a significant increase in the failure rate of the power supply? Product effectiveness and product worth, which account for cost and resource usage, provide a conceptual framework for determining how best to direct design and development to achieve the desired performance more efficiently and effectively.
First observation
212.1
214.2
213.7
212.7
212.5
212.7
212.8
213.0
212.9
212.3
212.5
212.1
211.8
213.5
212.0
213.0
214.5
212.3
212.2
211.9
213.2
212.7
211.9
212.3
212.0
212.8
213.9
212.6
214.0
212.4
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
Sixth observation
Thirtieth observation
CHAPTER 2
Reliability Concepts Diganta Das, Michael Pecht
CONTENTS 2.1 Introduction..................................................................................................... 19 2.2 Reliability........................................................................................................ 19 2.3 Probability Density Function .......................................................................... 23 2.4 Hazard Rate ....................................................................................................25 2.5 Conditional Reliability....................................................................................26 2.6 Time to Failure................................................................................................ 27 Homework Problems................................................................................................ 27
2.1 INTRODUCTION This chapter presents some of the fundamental definitions and mathematical theory for reliability. The focus is on the reliability and unreliability functions, the probability density function, the hazard rate, the conditional reliability function, and some time-to-failure metrics.
2.2 RELIABILITY For a constant sample size, no, of identical products that are tested or being monitored, if nf number of products have failed and the remaining ns number of products are still operating satisfactorily at any time t, then
n (t ) n f (t ) n s
o
(2.1)
The factor “time” in Equation 2.1 can pertain to age, total time elapsed, operating time, number of cycles, or distance traveled, or it can be replaced by a measured 19 © 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
20
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
quantity, which ranges from –∞ to ∞. This quantity is called the variate in statistics. Variates may be discrete (e.g., number of cycles) or continuous, when they can take on any value within a certain range. The number of failures of a product (or process or event) that occur up to a given time is a fundamental reliability index. The ratio of failed products per sample size is an estimate of the unreliability, Qˆ (t ), of the product at any time t. That is,
Qˆ (t )
n f (t ) no
(2.2)
where the “hat” on the top of the variable indicates that it is an estimate. Similarly, the estimate of reliability, Rˆ (t ), of a product at time t is given by the ratio of operating (not failed) products per sample size:
n (t ) Rˆ (t ) s 1 - Qˆ (t ) no
(2.3)
As fractional numbers, Rˆ (t ) and Qˆ (t ) range in value from zero to unity; multiplied by 100, they give the probability in the form of percentages. EXAMPLE 2.1 A semiconductor fabrication plant has an average output of 10 million devices per week. Over the last year, it has been found that 100,000 devices have been rejected in final test. (a) What is the unreliability of the semiconductor devices according to the conducted test? (b) If the tests reject 99% of all defective devices, what is the chance that any device a customer receives will be defective? Solution: The total number of devices produced in a year is (a) no 52 r 10 r 106 520 r 106 The number of rejects (failures) nf over the same period is
n 1 r 105 f
Therefore, from Equation 2.3, an estimate for device unreliability is
n f (t ) 1 r 105 Qˆ (t ) y 1.92 r 10 4 no 520 r 106 or 1 chance in 5,200. (b) If the failed devices represent 99% of all the defective devices produced, then the number of defectives that passed testing is § 1 r 105
xd ¨
¨© 0.99
¶
(1 r 105 ) · y 1010
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
·¸
RELIABILITY CONCEPTS
21
Therefore, the probability of a customer getting a defective device or the expected unreliability of the supplied devices on first use is
Qˆ (t )
1010 y 1.94 r 10 6 (520 r 106 ) (1 r 105 )
or 1 chance in 515,000.
Reliability estimates obtained by testing or monitoring samples generally exhibit variability. For example, light bulbs designed to last for 10,000 hours of operation that were all installed at the same time in the same room are unlikely to fail at exactly the same time or at exactly 10,000 hours. Variability in the measured product response as well as the time of operation is expected. In fact, product reliability assessment is often associated with the estimation and measurement of this variability. The accuracy of a reliability estimate at a given time is improved by increasing the sample size no. The requirement of a large sample is analogous to the conditions required in experimental measurements of probability associated with coin tossing and dice rolling. This implies that the estimates given by Equations 2.2 and 2.3 approach actual values for R(t) and Q(t) as the sample size becomes infinitely large. Thus, the practical meaning of reliability and unreliability is that, in a large number of repetitions, the proportional frequency of occurrence of success or failure will ˆ estimates, respectively. ˆ and Q(t) approximately equal the R(t) The response values for a series of measurements on a certain product parameter can be plotted as a histogram in order to assess the variability. For example, Table 2.1 lists a series of time-to-failure results for 251 samples tested in 11 different groups. These data are summarized as a frequency table in the first two columns of Table 2.2 and a histogram is created from those two columns in Figure 2.1. In the histogram, 120
Number of Failures
100 80 60 40 20
91 –1 00 10 1– 11 0 O ve r1 10
81 –9 0
61 –7 0 71 –8 0
60 51 –
41 –5 0
31 –4 0
30 21 –
11 –2 0
0– 10
0
Operating Time (hours)
Figure 2.1
Frequency histogram or life characteristic curve for data from Table 2.2.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
22
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Table 2.1
Measured Time-to-Failure (Hours) Data for 251 Samples Series
1
2
3
4
5
6
7
8
9
10
11
1 1 2 3 4 5 6 6 8 9 11 13 16 18 20 25 28 32 36 46 58 79 117
1 1 3 3 4 5 6 6 8 9 12 14 16 18 20 25 28 32 37 47 59 83 120
1 2 2 3 4 5 6 7 8 9 12 14 16 18 20 26 29 33 38 48 62 85 125
1 2 2 3 4 5 6 7 8 9 12 14 16 18 21 26 29 33 39 49 64 89 126
1 2 3 3 4 5 6 7 8 10 12 14 17 18 21 27 29 34 41 49 65 93 131
1 2 3 3 4 5 6 7 8 10 12 15 17 18 22 27 29 34 41 51 66 97 131
1 2 3 3 4 5 6 7 8 11 12 15 17 19 22 27 29 35 42 52 67 99 137
1 2 3 4 4 5 6 7 8 11 13 15 17 19 23 28 29 35 42 53 69 105 140
1 2 3 4 4 5 6 7 9 11 13 15 17 19 23 28 30 36 43 54 72 107 142
1 2 3 4 4 5 6 7 9 11 13 15 18 19 24 28 31 36 44 55 76 111 —
1 2 3 4 4 5 6 7 9 11 13 15 18 20 24 28 31 36 45 56 78 115 —
Table 2.2
Grouped and Analyzed Data from Table 2.1
Operating Time (hours)
Number of Failures ( Δ nf )
Surviving Products (ns)
Probability Density Function f(t)
Reliability (R) (n0 251)
Average Hazard Rate Estimate (Δ t 10)
0–10 11–20 21–30 31–40 41–50 51–60 61–70 71–80 81–90 91–100 101–110 Over 110
105 52 28 17 12 8 6 4 3 3 2 10
146 94 66 49 37 29 23 19 16 14 12 0
0.418 0.207 0.112 0.068 0.048 0.032 0.024 0.016 0.012 0.012 0.008 0.043
0.58 0.372 0.26 0.192 0.144 0.112 0.088 0.072 0.06 0.052 0.044 0
0.04 0.03 0.03 0.02 0.02 0.02 0.02 0.01 0.01 0.02 0.01 —
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
RELIABILITY CONCEPTS
23
1
Reliability
0.8 0.6 0.4 0.2
0 10 0 1– 1 O 10 ve r1 10
0
91 –1
0
81 –9
0
71 –8
0
0
61 –7
51 –6
0
0
41 –5
31 –4
0
21 –3
11 –2
0–
10
0
Operating Time (hours) Figure 2.2
Reliability histogram of data from Table 2.1.
each rectangular bar represents the number of failures in the interval. This histogram represents the life characteristic curve for the product. The ratios of the number of surviving products to the total number of products (i.e., the reliability at the end of each interval) are calculated in the fourth column of Table 2.2 and are plotted as a histogram in Figure 2.2. As the sample size increases, the intervals of the histogram can be reduced and the plot will approach a smooth curve. For some continuous time-to-failure data, the rectangles are replaced by ordinates to obtain a smooth hazard rate curve.
2.3 PROBABILITY DENSITY FUNCTION The ratio of the number of product failures in an interval to the total number of products gives an estimate of the probability density function corresponding to the interval. For the data in Table 2.1, the estimate of the probability density function for each interval is evaluated in the fourth column of Table 2.2. Figure 2.3 shows the estimate of the probability density function for the data of Table 2.1. The sum of all possible values equals unity (values in column four of Table 2.2). The probability density function is given by:
f (t ) =
1 d[n f (t )] d[Q(t )] = dt no dt
(2.4)
Integrating both sides of this equation gives the relation for the unreliability in terms of f(t): Q (t )
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
n f (t ) = no
t
¯ f (T ) dT
0
(2.5)
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
11 –2 0 21 –3 0 31 –4 0 41 –5 0 51 –6 0 61 –7 0 71 –8 0 81 –9 0 91 –1 0 10 0 1– 11 0 O ve r1 10
0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0
0– 10
f(t)
24
Operating Time (hours)
Figure 2.3
Probability density function for data in Table 2.1.
where the integral is the probability that a product will fail in the time interval 0 a n a t. The integral in Equation 2.5 is the area under the probability density function curve to the left of the time line at some time t. The unreliability is also called the cumulative failure probability distribution function for a continuous random variable. Similarly, the percentage of products that have not failed up to time t is represented by the area under the curve to the right of t by c
R (t ) =
¯ f (T ) dT
(2.6)
t
Because the total probability of failures must equal one at the end of life for a population, the function f(t) is appropriately normalized. That is, c
¯ f (t)dt 1 t
EXAMPLE 2.2 From the histogram of Figure 2.3, (a) Calculate the unreliability of the product at a time of 30 hours. (b) Calculate the reliability. Solution: For the discrete data represented in this histogram, the unreliability is the sum of the failure probability density function values from t 0 to t 30. This sum, as a percentage, is 74%. The reliability is the sum of the mass function values from t 30 to t c and is equal to 26%. The sum of reliability and unreliability must always be equal to 100%.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(2.7)
RELIABILITY CONCEPTS
25
2.4
HAZARD RATE
For a nonrepairable product, for a given period (from ti to ti Δt), the hazard rate is given by the conditional probability that the product will fail in this period, given that the product has survived up to time t. That is,
h(t ) = P(ti a T a ti $ t | T q ti )
(2.8)
Assuming that the product has survived up to time ti, an estimate of the average hazard rate over the time interval Δt, hˆ is mathematically expressed as hˆ
$ nf 1 nbp (ti ) $ t
(2.9)
where nbp(ti) is the number of products monitored or tested that have not failed at the beginning of the time period, Δnf is the number of failures in the sampling period, and Δt is the sampling interval. As nbp becomes large and as the sampling interval goes to zero, the average hazard rate estimate approaches the instantaneous hazard rate (or just hazard rate) at time t. That is, in the limit as Δt goes to zero, Equation 2.9 becomes h(t ) =
1 d n f (t ) ns dt
(2.10)
The hazard rate, h(t), is the number of failures per unit time per number of nonfailed products left. It is thus a relative rate of failure, in that it does not depend on sample size. From Equations 2.2, 2.3, and 2.10, a relation for the hazard rate in terms of the reliability is
h(t ) =
1 d R(t ) R (t ) dt
(2.11)
Integrating Equation 2.11 over an operating time from 0 to t, noting that R(t 0) 1, and taking the exponential of each side gives ¤ t ³ R(t ) exp ¥ h (T )dT ´ ¥¦ ´µ 0
¯
(2.12)
This is the fundamental equation of reliability expressed in terms of the hazard rate. The hazard rate can also be expressed as the ratio of the failure probability density function to the reliability by combining Equations 2.4 and 2.11 with Equation 2.3: h(t ) =
f (t ) R (t )
(2.13)
Using the data from Table 2.1 and Equation 2.9, an estimate (over Δt) of the hazard rate is calculated in the last column of Table 2.2. Figure 2.4 is the histogram of hazard rate versus time.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
0.045 0.04 0.035 0.03 0.025 0.02 0.015 0.01 0.005 0
0–
10 11 –2 0 21 –3 0 31 –4 0 41 –5 0 51 –6 0 61 –7 0 71 –8 0 81 –9 0 91 –1 0 10 0 1– 1 O 10 ve r1 10
Average Hazard Rate
26
Operating Time (hours)
Figure 2.4
Hazard rate histogram of data from Table 2.1.
2.5 CONDITIONAL RELIABILITY The conditional reliability function R(t, T) is defined as the probability of operating for a time interval t, given that the nonrepairable system has operated for a time T prior to the beginning of the interval. The conditional reliability can be expressed as the ratio of the reliability at time (t T) to the reliability at an operating duration T, where T is the “age” of the system at the beginning of a new test or mission. That is, R (t,T ) =
R (t + T ) R (T )
(2.14)
For a product with a decreasing hazard rate, the conditional reliability will increase as the age T increases. The conditional reliability will decrease for a product with an increasing hazard rate. The conditional reliability of a product with a constant rate of failure is independent of T; that is, the reliability for a mission time t is independent of previous mission times. This suggests that a product with a constant failure rate can be treated as “as good as new” at any time. EXAMPLE 2.3 The reliability function for a system is assumed to be an exponential distribution and is given by R (t) e L0t where h0 is a constant (i.e., a constant hazard rate). Calculate the reliability of the system for mission time t, given that the system has already been used for 10 years. Solution: Using Equation 2.14, R (t,10) =
R (t +10) e L0 (t 10)
L t = L 10 = e 0 R(t ) R (10) 0 e
That is, the system reliability is “as good as new,” regardless of the age of the system.
© 2009 by Taylor & Francis Group, LLC
RELIABILITY CONCEPTS
27
2.6 TIME TO FAILURE The median M of the probability distribution is the time at which the area under the distribution is divided in half (i.e., it is the time to reach 50% reliability). That is, M
¯ f (t)dt 0.5
(2.15)
0
Because M occurs as a limit, determining an explicit relation for M can be difficult. As a result, a mean is a more preferred metric. The mean time to failure (MTTF), defined as the expected value of the failure probability density function, is one such parameter: c
MTTF =
¯ tf (t)dt
(2.16)
0
It can also be shown that Equation 2.16 is equivalent to c
MTTF =
¯ R(t) dt
(2.17)
0
The MTTF should be used only when the failure distribution function is specified because the value of the reliability function at a given MTTF depends on the probability distribution function used to model the failure data. In fact, different failure distributions can have the same MTTF while having different reliabilities. The first failures that occur in a product or system often have the biggest impact on safety, warranty, and supportability, and consequently on the profitability of the product. Thus, the beginning of the failure distribution is an important concern in reliability. The time to the first 1 or 5% of the failures is often estimated for that reason.
HOMEWORK PROBLEMS Problem 2.1 Following the format of Table 2.1, record and calculate the different reliability metrics after bending 30 paper clips 90° back and forth to failure. Plot the life characteristics curve, the estimate of the probability density function, the reliability, and unreliability, and the hazard rate. Do you think your results depend on the amount of bend of the paper clip? Explain. Problem 2.2 Prove that MTTF
¯
c
0
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
tf (t ) dt
¯
c
0
R(t ) dt .
28
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Problem 2.3 Calculate the MTTF for a failure probability density function given by k M f M x!e ; x 0,1,2,3,....... Problem 2.4 Calculate the MTTF for a failure probability density function given by ¹ ª0; for (t t ) 1 1 ; for (t1 a t a t2 ) º f (t ) « t2 t1 ¬0; for (t t2 ) » Problem 2.5 Assume that the system in Example 2.3 is a car. Do the results in the example make sense? Why or why not? Provide some examples of systems where the results may be more appropriate. Problem 2.6 Given the following: 1 $ nf fˆ no $ t f (t )
1 d [n f (t )] d [Q(t )] no dt dt c
t
Q (t )
¯
f (T )dT d , R (t )
0
¯ f (T )dT 0
c
prove
¯ f (t)dt 1 0
Problem 2.7 Hazard rate: Given the following:
hˆ
1 $nf nbp (t ) $ t
h (t )
1 d [n f (t )]
1 d [ R (t )] ns (t ) dt R (t ) dt t
R (t ) x e ¯0
h (t ) dt
, h(t ) x
f (t ) R (t )
prove the hazard rate equation. Problem 2.8 Discuss PC failure data in terms of conditional reliability (e.g., if a PC has survived 12 cycles at 90°, what is the probability that it will survive another 5 cycles?) R (t , T )
R (t T ) R (T )
where R(t, T) is the conditional probability that a product will survive for an additional time t, given that it has survived up to time T.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
RELIABILITY CONCEPTS
29
Problem 2.9 What does the conditional reliability reduce to, if the hazard rate is a constant? Hint: t
h(t )dt R(t ) x e ¯0
Problem 2.10 Mean time to failure: prove that c
c
¯0
¯0
c
c
MTTF t f (t )dt R (t )dt
¯0 t f (t)dt ¯0 R (t)dt
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
CHAPTER 3
Statistical Inference Concepts Jun Ming Hu, Mark Kaminskiy, Igor A. Ushakov
CONTENTS 3.1 Introduction ..................................................................................................... 32 3.2 Statistical Estimation ...................................................................................... 32 3.2.1 Point Estimation.................................................................................. 32 3.2.1.2 Method of Moments............................................................. 33 3.2.1.2 Method of Maximum Likelihood ........................................34 3.2.2 Interval Estimation ............................................................................. 36 3.3 Hypothesis Testing.......................................................................................... 37 3.3.1 Frequency Histogram ......................................................................... 37 3.3.2 Goodness-of-Fit Tests ......................................................................... 38 3.3.2.1 The Chi-Square Test ............................................................ 38 3.3.2.2 The Kolmogorov–Smirnov Test........................................... 41 3.3.2.3 Sample Comparison............................................................. 43 3.4 Reliability Regression Model Fitting .............................................................. 45 3.4.1 Gauss–Markov Theorem and Linear Regression ............................... 45 3.4.1.1 Regression Analysis............................................................. 45 3.4.1.2 The Gauss–Markov Theorem .............................................. 49 3.4.1.3 Multiple Linear Regression.................................................. 49 3.4.2 Proportional Hazard (PH) and Accelerated Life (AL) Models .......... 51 3.4.2.1 Accelerated Life (AL) Model............................................... 51 3.4.2.2 Proportional Hazard (PH) Model ........................................ 52 3.4.3 Accelerated Life Regression for Constant Stress................................ 52 3.4.4 Accelerated Life Regression for Time-Dependent Stress................... 54 3.5 Summary......................................................................................................... 56 References................................................................................................................ 56
31 © 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
32
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
3.1
INTRODUCTION
Evaluation in probabilistic terms of the real reliability of products is possible only on the basis of real statistical data obtained from special experiments or practical use. Adequately determining the form or class of the distribution function requires statistical data. Such data also provide the information to confirm prior hypotheses about different probabilistic parameters of the models used in reliability analysis. Techniques for analyzing probabilistic models from observational data are embodied in statistical inference.
3.2
STATISTICAL ESTIMATION
There are two kinds of estimation: point estimation and interval estimation. Point estimation provides a single number from a set of observational data to represent a parameter or other characteristic of the underlying distribution. Point estimation does not give any information about its accuracy. Interval estimation constructs a confidence interval that includes the true value of the parameter with a specified degree of confidence. 3.2.1
Point Estimation
Estimation of a parameter is necessarily based on a set of sample values, X1,…, Xn. If these values are independent and their underlying distribution remains the same from one sample value to another, they yield a random sample of size n from the distribution of the investigated random variable X. Let the distribution have a parameter k. Consider a random variable t(X1,…, Xn) that is a single-valued function of X1,…, Xn. The random variable t(X1,…, Xn) is referred to as a statistic. A point estimate is obtained by selecting an appropriate statistic and calculating its value from the sample data. The selected statistic is referred to as an estimator. An estimator, t(X1,…, Xn), is said to be an unbiased estimator for k if E (t(X1,…, Xn)) k for any value of k. The bias is the difference between the expected value of an estimate and the parameter value itself—the smaller the bias is, the better the estimator is. Another desirable property of an estimator, t(X1,…, Xn), is the property of consistency. An estimator, t, is said to be consistent if, for every a > 0, lim P(| t Q | E ) 1
nlc
(3.1)
This property implies that, as sample size n increases, the estimator, t(X1,…, Xn), gets closer to the true value of k. In some situations, several unbiased estimators can be found. Selecting the best among the unbiased estimators involves choosing the one with the least variance. An unbiased estimator, t, of k, with minimum variance among all unbiased estimators of k, is called efficient.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
STATISTICAL INFERENCE CONCEPTS
33
Another useful estimation property is sufficiency. An estimator, t(X1,…, Xn), is said to be a sufficient statistic of the parameter, k, if it contains all the information about k that is in the sample X1,…, Xn. Several common methods for point estimation are briefly introduced next. 3.2.1.2 Method of Moments The method of moments is used to estimate unknown parameters of the distribution function (d.f.) on the basis of empirically estimated moments of the random variable. The estimators are equated to the corresponding distribution moments. The solutions of the equations obtained provide the estimators of the distribution parameters. For example, because the mean and variance are the expected value of X and of (X *)2, respectively, the sample mean and sample variance can be defined as the expected values of a sample of size n—namely, X1,…, Xn —as follows: n
1 n
X
£X
(3.2)
i
i 1
and S2
1 n
n
£(X X )
2
(3.3)
i
1
X and S2, respectively, are the point estimates of the distribution mean, *, and variance, m2. The estimator of variance 3.3 is biased; however, this bias can be removed by multiplying it by n/(n 1): S2
1 n 1
n
£(X X )
2
i
(3.4)
i1
It can be shown that this is the unbiased estimator of variance. Comparison of Equations 3.3 and 3.4 shows that there is little difference between the two estimates for large sample sizes. EXAMPLE 3.1 The life of a device, T, is modeled as a random variable with the exponential distribution f (t ) L e Lt
(3.5)
The times to failure for accelerated life tests are 22, 24, 31, 41, 52, 63, and 70 hours. To determine the parameter, h, of the distribution, the test data are considered as the sample of t, with a sample size of seven. Because the exponential distribution is only a one-parameter distribution, the first moment is used: t
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
1 n
n
£ i 1
ti
1 7
7
£ t 43.3 (hours) i
i 1
(3.6)
34
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
This yields an estimator of the mean value. The relationship between the mean value and parameter h is c
1 L
¯
Q e Lt dt 0
(3.7)
Therefore, an estimator of h is L 1/t 0.0231 (1/hours).
3.2.1.2 Method of Maximum Likelihood The method of maximum likelihood is one of the most popular methods of estimation. Consider a random variable, X, with density function f(x, k 0), where k 0 is the parameter. Using the method of maximum likelihood, one can try to find the value of k 0 that has the highest (or most likely) probability (or probability density) of producing the particular set of measurements, X1,…, Xn. The likelihood of obtaining this particular set of sample values is proportional to the density function f(x, k 0) evaluated at the sample points X1,…, Xn. The likelihood function is introduced as L f ( X1 , z , X n ; Q ) f ( x1 , Q ) f ( x 2 , Q ), z , f ( x n , Q )
(3.8)
The definition of the likelihood function is based on the probability (for a discrete random variable) or the density (for continuous random variable) of the joint occurrence of n events, X X1,…, X Xn. The maximum likelihood estimate, Q 0, is the value of k0 that maximizes the likelihood function, Lf (X1,…, Xn; k0), with respect to k0. The usual procedure for maximization with respect to a parameter is to calculate the derivative with respect to this parameter and equate it to zero. This yields tL f ( X1 , z , X n ; Q 0 ) tQ 0
(3.9)
0
The solution of the preceding equation for k0 will give Q 0 , if it can be shown that Q 0 does indeed maximize Lf (X1,…, Xn; k0). Because of the multiplicative nature of the likelihood function, it is frequently more convenient to maximize the logarithm of the likelihood function instead; that is, t log L f ( X1 , z , X n ; Q 0 ) tQ 0
0
(3.10)
The solution for Q from this equation is the same as the one from Equation 3.9. For a density function with two or more parameters, the likelihood function becomes n
L f ( X1 , z , X n ; Q1 , z , Q m )
£ f ( X ,Q ,z ,Q i
i1
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
1
m
)
(3.11)
STATISTICAL INFERENCE CONCEPTS
35
where k1,…,km are the m parameters to be estimated. In this case, the maximum likelihood estimators can be obtained by solving the following m equations: tL1 ( X1 , z , X n ; Q1 , z , Q m ) 0, j 1, z , m tQ j
(3.12)
Under some general conditions, the maximum likelihood estimates are consistent, asymptotically normal, and asymptotically efficient. Let us estimate the parameter p of the binomial distribution. In this case, ¤n³ n m L f (m | n) ¥ ´ pm 1 p , ¦mµ
m 0, 1, z , n
(3.13)
Then, tLogL f tp
³ ¤m n ¥ p´ µ p(1 p) ¦ n
(3.14)
It follows, therefore, that the maximum likelihood estimator of p is p m/n. It can also be easily checked that the sample mean is the maximum likelihood estimator of the normal distribution mean. EXAMPLE 3.2 For the life-test data of the device given in Example 3.1, estimate the parameter of the distribution, using the method of maximum likelihood. The maximum likelihood function for this problem is 7
L f (t1 ,z, t7 , L )
£ f (t , L) L e
7 L
i
7
£ i 1 ti
(3.15)
i 1
Taking the derivative yields dL f (t1 ,z, t7 , L ) dL
§ ¨ 7L 6 L 7 ¨©
7
¶
£ t ··¸ 0 i
(3.16)
i 1
Solving this equation gives
L
7
£
7 i 1
0.0231 ti
k is the estimator of the parameter h. In this example, the estimates by where L the method of moments and the method of maximum likelihood coincide.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(3.17)
36
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Another popular estimation method is the least squares method; it will be considered in Section 3.4.1.
3.2.2 Interval Estimation Let L(X1,…, Xn) and U(X1,…, Xn) be two statistics, such that the probability that parameter k0 lies in an interval is P{L ( X1 ,z, X n ) Q 0 U ( X1 ,z, X n )} 1 A
(3.18)
The random interval [L,U] is called a 100(1 ])% confidence interval for the parameter k0. The endpoints L and U are referred to as the 100(1 ])% confidence limits of k0; (1 ]) is called the confidence coefficient. The most commonly used values for ] are 0.10, 0.05, and 0.01. If k0 > L (k0 < U) with probability equal to 1, then U (L) is the one-sided upper (lower) confidence limit or confidence bound for k0. A 100(1 ])% confidence interval for an unknown parameter, k0, is interpreted as follows: If a series of repetitive experiments yields random samples from the same distribution and the confidence interval for each sample is calculated, then 100(1 ])% of the constructed intervals will, in the long run, contain the true value of k0. The following example illustrates the common principle of confidence limits construction. (The other procedures of interval estimation widely used in reliability data analysis are considered in Section 3.4.4.) Consider the procedure for constructing confidence intervals for the mean of a normal distribution with known variance. Let X1, X2,…, Xn be a random sample from the normal distribution, N(*,m2), in which * is an unknown parameter and m2 is assumed to be known. It is easy to show that the sample mean has the normal distribution N(*,m2/n). Thus, ( X M ) n /S has the standard normal distribution. This means that ¤ ³ X M P ¥ z1 (A / 2) a a z1 (A / 2)´ 1 A , S/ n ¦ µ
(3.19)
where z1–(]/2) is the 100(1 1/ 2A )th percentile of the standard normal distribution N(0, 1). Solving the inequalities inside the parentheses, Equation 3.19 can be rewritten as
S S ³ ¤ P ¥ x z1 A / 2 a M a X z1 A / 2 1 A ¦ n n ´µ
(3.20)
Thus, the symmetric (1 ]) confidence interval for the mean, *, of a normal distribution with known m2 is § S S ¶ , x z1 (A / 2) )· ¨ x z1 (A / 2) © n n ¸
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(3.21)
STATISTICAL INFERENCE CONCEPTS
37
The confidence interval is wider for a higher confidence level (1 ]). As m decreases or as n increases, the confidence interval becomes smaller for the same confidence level (1 ]).
3.3
HYPOTHESIS TESTING
Researchers often need to determine a probability distribution based on available observational data. A histogram for a set of data can yield an idea about the distribution model when visually compared with several hypothesized density functions. Certain statistical tests, known as goodness-of-fit tests, can reject or accept an assumed probability distribution determined empirically or developed theoretically on the basis of prior assumptions. When two or more distributions appear to be plausible probability distribution models, such tests can determine the relative degree of validity of the different distributions. This section illustrates how a distribution can be studied by using a frequency histogram and verifying the result with goodness-of-fit tests. Two of the most commonly used goodness-of-fit tests— the chi-square (_2) and Kolmogorov–Smirnov (K–S)—are further discussed. 3.3.1
Frequency Histogram
The frequency histogram is a graphic, empirical description of the variability of a random variable. For a specific set of experimental data, a corresponding histogram is constructed as follows: r From the observed experimental data, select a range sufficient to include the largest and smallest data values. r Divide this range into consistent intervals of equal length, Δ x (sometimes they can be different to emphasize some special areas of data domain). r Count the number of measurements within each interval and draw vertical bars with heights representing the number of observations in that interval (in the second case, the number of measurements must be related to the length of Δx).
Alternatively, the heights of the bars can be determined in terms of the ratio of the fraction of the number in the interval (relative frequency) to the length of the interval in the horizontal abscissa. That is,
fn
N x , x $x n
(3.22)
$x
where Nx,x+Δx is the number of measurements in the interval (x, x Δx) and n is the total number of measurements (the sample size). The frequency histogram can be used as an empirical frequency distribution for comparison with the theoretical density function. If the theoretical density function
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
38
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
0.00024
Probability Density
0.00020 MMIC type A, 175°C
0.00016 0.00012
Fitting curve
0.00008
Test data
0.00004 0 0
10000
20000 Failure Time, (hours)
30000
40000
Figure 3.1 Frequency diagram and probability density of the life distribution of MMIC devices.
of an assumed distribution has the same shape (in general) as the frequency histogram and the theoretical curve is close to the peaks of the bars in the frequency histogram, this distribution might model the phenomenon. Figure 3.1 shows an example of a frequency histogram and density function of life results for a type of monolithic microwave integrated circuit (MMIC) device for which the life distribution of the MMIC devices is assumed to be normal. 3.3.2 Goodness-of-Fit Tests When an assumed theoretical distribution is used to model a random variable, based perhaps on the general shape of the frequency histogram, there is no quantitative measure of how well the data fit the model. A goodness-of-fit test provides a quantitative technique to disprove (or not) the assumed distribution. Two of the most commonly used tests—the chi-square and K–S tests—are discussed next. 3.3.2.1 The Chi-Square Test Consider a sample of N observed values (measurements) of a random variable. The chi-square goodness-of-fit test compares the observed frequencies, n1, n2,…, nk , of k intervals of the random variable with the corresponding frequencies, e1, e2,…, ek , from an assumed theoretical distribution, F0(x). The basis for appraising the goodness of fit is the distribution of the statistic k
£ (n e e ) i
i 1
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
i
i
2
(3.23)
STATISTICAL INFERENCE CONCEPTS
39
This statistic has a distribution that approaches the chi-square (_2) distribution with (f k 1) degrees of freedom as N l ∞. The _2 distribution has the following probability density function: f (x)
f
1 f 2
¤ 2 '¥ ¦
f³ ´ 2µ
1
x
x2 d2
1
x0
(3.24)
where f is the number of degrees of freedom, and '(r) is the gamma function. The cumulative probability function of the _2 distribution is given in any statistical book (see references for this chapter). If the parameters of the theoretical distribution are unknown and are estimated from the data, the preceding distribution remains valid if the number of degrees of freedom is reduced by one for every unknown parameter that must be estimated. On this basis, if an assumed distribution yields a result such that k
£ (n e e ) i
i 1
i
2
C1 A , f
(3.25)
i
then the assumed theoretical distribution is not rejected (i.e., the so-called null hypothesis H0: F(x) F0(x) is not rejected) at significance level ]. If the inequality 3.25 is not satisfied, the alternative hypothesis, H1: F(x) ≠ F0(x), is accepted. In the inequality 3.25, C1–],f is the value of the _2 corresponding to the cumulative probability (1 ]). Employing the _2 goodness-of-fit test, it is recommended that at least five intervals be used (k > 5), with at least five observations per interval (ei > 5) to obtain satisfactory results. The steps for conducting the _2 test are as follows: r Divide the range of data into intervals (number of intervals > 5), with the first and the last infinite intervals, and count ni the number of measurements in each interval. r Estimate the parameters of the assumed theoretical distribution, F0(x), and calculate the theoretical quantity of data in each interval, ei, as follows: ei [ F0 ( x $x ) F0 ( x )] [sample size]
(3.26)
r Calculate Equation 3.23. r Choose a specified significance level, ] (generally, 1 ] 90 or 95%) and determine the number of degrees of freedom of the _2 distribution: f k 1 [number of parameters of F0 ( x )]
(3.27)
r Determine C1–], f from the table and compare it with the obtained value of Equation 3.23. If the inequality 3.25 is satisfied, then the assumed theoretical distribution function, F0(x), is not rejected.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
40
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
EXAMPLE 3.3 Life tests of a microelectronic device give 138 failure times, grouped in seven time intervals (k 7). The number of failures in each interval is listed in Table 3.1. Determine the relative goodness of fit between the normal and exponential distributions, using the _2 test at a significance level of ] 5%. The given data lead to an estimate of the mean value and variance as T 4545.3 hours and S2 829.2 hours2, using the following formulas:
£ T s
£
k i 1
1 (ti ti 1 ) ni 2 ni
(3.28)
£
2
¤1 ³ ¥ (ti ti 1 ) T ´µ ni i 1 ¦ 2 k
£
(3.29)
ni 1
Then, the cumulative distribution function for the normal distribution is ¤ t T ³ ¤ t 4545.3 ³ FT , N (t ) & ¥ &¥ ¦ 29.2 ´µ ¦ S ´µ
(3.30)
After estimating the parameters of the exponential distribution as L 1/T 0.00022, the following cumulative distribution function results for the exponential distribution:
(3.31)
FT ,E (t ) 1 e Lt 1 e 0.00022 t Then, the theoretical number of data in each interval, ei, is calculated according to the following formulas: ei ,N 138[ FT ,N (ti ) FT ,N (ti 1 )]
(3.32)
ei ,E 138[ FT ,E (ti ) FT ,E (ti 1 )]
(3.33)
Table 3.1 Chi-Square Testing for Life Distribution (Example 3.3)
Interval No.
Interval Range ti 1 ti (Hours)
Recorded Frequency
1 2 3 4 5 6 7
2000–3000 3000–3500 3500–4000 4000–4500 4500–5000 5000–5500 5500–6500
1 11 24 33 31 22 16
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
Theoretical Frequency, e1
£ i
(ni e i )2
ei 2
Normal
Exponential
Normal
Exponential
0.184 7.314 15.180 26.496 32.430 28.014 22.080
54.298 11.659 6.856 6.203 5.613 5.079 6.727
0.042 1.900 7.027 8.621 8.684 9.090 9.341
52.317 52.354 95.227 210.980 325.800 382.180 394.965
STATISTICAL INFERENCE CONCEPTS
41
The number of degrees of freedom is f 7 1 2 4 for the normal distribution, and f 7 1 1 5 for the exponential distribution. At a significance level of 5%, C95%,1 9.49 for the normal distribution, and C95%,5 11.1 for the exponential distribution. Comparing these values with the values of the (ni ei )2 /ei2 listed at the bottoms of the sixth and seventh columns, it is apparent that the normal distribution is not rejected and the exponential distribution is rejected, according to the goodness-of-fit test at a 5% significance level.
£
The exponential distribution does not seem attractive for these data because, for the exponential distribution, the mean is equal to the standard deviation. In the case considered, the sample mean is about 150 times greater than the sample standard deviation!
3.3.2.2 The Kolmogorov–Smirnov Test Another widely used goodness-of-fit test is the K–S test. The basic procedure involves comparing the so-called empirical (or sample) distribution function (e.d.f.) with an assumed theoretical d.f. If the maximum discrepancy is large compared with what is anticipated from a given sample size, the assumed theoretical model is rejected. Consider an uncensored sample of n observed values of a random variable. The set of the data is rearranged in increasing order: X(1) < X(2) < … < X(n). Using the ordered sample data, e.d.f., Sn(x) is defined as follows: ª0 i Sn ( X ) « n ¬1
c X X (1) X (i ) a X x(i 1
(3.34)
X(n) a X c
i 1,z, n 1 where X(1), X(2),…, X(n) are the values of the ordered sample data (order statistics). Figure 3.2 shows a plot of Sn(x) and the proposed theoretical distribution function, F0(x). The law of large numbers shows that the e.d.f. is a consistent estimator for the corresponding d.f. In the K–S test—the maximum difference (the test statistic) between Sn(x) and F0(x) over the entire range of x—is used as the measure of the discrepancy between the theoretical model and the e.d.f. The maximum difference of Sn(x) and F0(x) is denoted by Dn max | F0 ( x ) Sn ( x )|
(3.35)
x
If the null hypothesis is true, the probability distribution of D n will be the same for every possible continuous F0(x). Thus, D n is a random variable whose distribution depends on the sample size, n, only. For a specified significance level, ], the
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
42
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Sn(x) and Fn(x)
1
Fn(x)
Sa(x) Da(x) 0
Figure 3.2
x x1
x2 x3
x4
xn–1
xn
Empirical and theoretical distribution functions.
K–S test compares the observed maximum difference with the critical value DnA , defined by
P Dn a DnA 1 A.
(3.36)
Critical values, DnA , at various significance levels, ], are tabulated and given in any statistical book (see references for this chapter). If the observed Dn is less than the critical value DnA , the proposed distribution would not be rejected. The steps for conducting the K–S test are as follows: r For each sample item datum, calculate the Sn(x(i)) (i 1,…,n) according to Equation 3.34. r Estimate the parameters of the assumed theoretical distribution, F0(x), using another sample, and calculate F0(x(i)) from the assumed d.f. r Calculate the differences Sn(xi) and F0(x(i)) for each sample item and determine the maximum value of the differences according to Equation 3.35. r Choose a specified significance level, ] (generally, 1 ] 90 to 95% for all tests) and determine DnA from the appropriate statistical table (Beyer 1968). r Compare Dn with DnA ; if Dn is less than DnA , the assumed theoretical d.f., F0(x), is not rejected.
EXAMPLE 3.4 The modulus of rupture of GaAs wafers was tested. The results from 11 wafers are listed in Table 3.2. Use the K–S test to determine if the data are normally distributed at a significance level of ] 5%.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
STATISTICAL INFERENCE CONCEPTS
Table 3.2 Number i
43
K–S Testing for Modulus of Rupture (Example 3.4) Modulus of Rupture, Xi (Mpa)
Sk(xi)
Fn(xi)
Dn !Fn(xi) Sn(xi)!
67.38 69.96 71.00 73.22 74.75 75.67 80.37 81.64 84.23 85.50 125.71
0.091 0.182 0.273 0.364 0.455 0.545 0.636 0.727 0.818 0.909 1.000
0.200 0.249 0.269 0.318 0.352 0.374 0.488 0.518 0.583 0.614 0.997
0.109 0.067 0.004 0.046 0.103 0.171 0.148 0.209 0.235 0.295 0.003
1 2 3 4 5 6 7 8 9 10 11
The data given in Table 3.2 indicate the sample mean is X 80.86 MPa. The sample standard deviation is mx 16.02 MPa. Calculate Sn(x) i/n from column 3 of Table 3.2. Then calculate ¤ X 80.86 ³ F0 ( X (i ) ) & ¥ (i ) ¦ 16.02 ´µ
(3.37)
and \F0(X(i)) Sn(X(i))\ for each sample element. The results are tabulated in Table 3.2. From these results, the maximum absolute discrepancy between the two functions is Dn 0.295 and occurs at x 85.5 MPa. In this case, there are 11 experimental data points. Hence, the critical value of DnA at the 5% significance level is found to be Dn0.05 0.40. Because the maximum discrepancy of 0.295 is less than Dn0.05, the assumption of a normal distribution for the GaAs modulus of rupture is not rejected at the 5% significance level.
3.3.2.3 Sample Comparison In some experimental situations two or more samples must be compared. For example, the failure times of two samples of the same device, tested under two different complex stress conditions could be compared. The problem is to determine whether the reliability of the device under one stress condition differs from the reliability under another stress condition. Perhaps it is unnecessary to know the time to failure (TTF) distribution and the specific values of reliability function. This type of problem is related to the class of statistical tests known as nonparametric or distribution free. Consider the Mann–Whitney (Wilcoxon) test for two samples as an example of nonparametric tests. Two independent samples of measurements (Xi,…,Xm) and (Yi,…,Yn) from the continuous distributions with d.f. FX(x) and d.f. FY (y), respectively, are given. Consider the following hypotheses: r Each measurement in the first sample has the same distribution as each measurement in the second sample; that is, F X(x) FY (y) (null hypothesis, H0). r There exists a constant, 1, such that each random variable (Yi 1) has the same distribution as each Xi; that is, F X(x) FY (x 1) (alternative hypothesis, H1).
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
44
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
These hypotheses are written as follows: H 0 : FX ( x ) FY ( x ), c x c, H1 : FX ( x ) FY ( x 1), 1 w 0, c x c. The alternative hypothesis can also be formulated as 1 > 0 or 1 < 0. If the two samples are combined into a single sample, the order statistics of this sample are Z(1), Z(2),…, Z(m+n), where Z(1) < Z(2) < … < Z(m+n). The rank of a sample element corresponds to its position in the previous ordering of Z(i). Thus, the smallest sample element has the rank 1, the second smallest sample element has the rank 2, and so on. Consider the ranks of all Z(i) (i 1, 2,…, m n) that represent the elements of the first sample, X(i), in the pooled sample, Z. Let the sum of these ranks be S. Because the average of the ranks in the pooled sample is (1 m n)/2, it is clear that if H0 is true, then E (S )
m(m n 1) 2
(3.38)
mn(m n 1) 12
(3.39)
It can be shown that Var ( S )
Moreover, in this case, for any m and n greater than 8, S is approximately normally distributed with the preceding mean and variance. For any m and n less than 8, special tables must be used (Dixon and Massey 1969). For reliability applications, the most critical interest is in testing the hypothesis H0: FX(x) FY (x), ∞ < x < ∞, against the alternative H1: FX(x) FY (x k), k > 0. If x and y are the failure times, then this hypothesis is equivalent to the hypothesis (in terms of reliability functions): R X(x) > RY (y), which is also equivalent to the hypothesis that the items of the first sample from FX(x) are more reliable than the items of the second sample from the FY (y). The hypothesis H0 is rejected if S > C(m, n, ]), where ] is the probability of rejecting H0 when it is true (significance level). The values of C(m, n, ]) are tabulated for small samples (m or n is less than eight) (Dixon and Massey 1969). For large samples, the hypothesis is rejected if C > z], where C
S E (S ) (VarS )1/ 2
and z] is the 100]th percentile of the standard normal distribution N(0,1). EXAMPLE 3.5 Assume two TTF samples of a device, obtained under different stress conditions: Sample 1 (hours): 90, 367, 470, 572, 1307, 1345, 1392, 1603, 2152, 2858; m 10. Sample 2 (hours): 37, 150, 154, 319, 373, 433, 538, 571, 751, 1180; n 10.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(3.40)
STATISTICAL INFERENCE CONCEPTS
45
The null hypothesis is that the device has the same reliability under both stresses. The alternative hypothesis is that the second stress condition’s results are more severe. The hypothesis is tested at the 5% significance level (i.e., ] 0,05). The pooled sample can be ordered as 37(1), 90(2*), 150(3), 154(4), 319(5), 367(6*), 373(7), 433(8), 470(9*), 538(10), 577(11), 572(12*), 751(13), 1180(14), 1307(15*), 1345(16*), 1392(17*), 1603(18*), 2152(19*), 2858(20*). The ranks marked with an asterisk represent the elements of the first sample X(i) in the pooled sample z. The sum of these ranks is S 134, E(S) 105, Var S 2100/12 175, (Var S)1/2 13.23, so that C (134 105)/13,23 2,19. From a table of normal distribution (Beyer 1968), c0.05 1.64. Hence, H0 is rejected; the second stress condition is supposed to be worse for reliability.
3.4
RELIABILITY REGRESSION MODEL FITTING
The previous sections dealt mainly with a single random variable. However, reliability problems often require an understanding of the probabilistic relationships among several random variables. For example, the time to failure of a device may depend on applied voltage, environmental temperature, and humidity. The time to failure can be considered as a random variable, Y, which is a function of the variables x1 (voltage), x2 (temperature), and x3 (humidity). Such functions inevitably contain different kinds of uncertainties. Therefore, the term model is widely used. 3.4.1
Gauss–Markov Theorem and Linear Regression
3.4.1.1 Regression Analysis In regression analysis, U is referred to as the dependent variable or response and x1, x2, and x3 are the independent variables or factors. For the general case, independent variables x1,…, xk might be random or nonrandom variables whose values are known or chosen by the experimenter. The conditional expectation of U for any given values of x1,…, xk [E(Y|x1,…, xk)] is known as the regression of U on x1,…, xk . If the regression of Y is a linear function of the independent variables x1,…, k, then E (Y | x1 ,z, x k ) B0 B1x1 ! B k x k
(3.41)
The coefficients ^0, ^1, … ,^k are called regression coefficients at parameters. Insofar as the expectation of U is a nonrandom variable, the relationship 3.41 is deterministic. The corresponding regression model for the random variable of U can be written in the following form: Y B0 B1x1 ! B k x k E
(3.42)
where a is called the random error, assumed to be distributed with mean E(a) 0 and finite variance m2. If a is normally distributed, one deals with the normal regression.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
46
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Figure 3.3 Linear regression fitting.
Simple linear regression. Now consider the regression model for the simple deterministic relationship (3.43)
Y B0 B1 x
which is known as simple linear regression. Assume n pairs of measurements (x1, y1),…, (xn, yn), as shown in Figure 3.3, where Y has a general tendency to increase with increasing values of x. Also assume that, for any given variable x, the dependent variable U is related to the independent variable x by Y B0 B1x E
(3.44)
where a is normally distributed with mean 0 and variance m2. The random variable U has, for a given x, normal distribution with mean ^0 ^1x and variance m2. Thus, the regression model is the location transformation of the random variable Y. That is, the random variable U is formed by adding nonrandom variable ^0 ^1x to the random variable a. Also suppose that, for any given values x1,…, xn, random variables Y1…, Yn are independent. For the preceding n pairs of measurements, the joint probability density function of y1,…, yn is given by § 1 ¨ 1 fn ( y | x , B0 , B1 , S ) exp 2 n/2 ¨© 2S 2 (2PS ) 2
n
£ i 1
¶ ( yi B0 B1xi )2 · ·¸
(3.45)
The function 3.45 (discussed in Section 3.3.2.1) is the likelihood function of the parameters ^0 and ^1. Maximizing this function with respect to ^0 and ^1 is reduced to the problem of minimizing the sum of squares: n
S (B0 , B1 )
£( y B B x ) i
i 1
with respect to ^0 and ^1.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
0
1 i
2
(3.46)
STATISTICAL INFERENCE CONCEPTS
47
Thus, the maximum likelihood estimation of the parameters ^0 and ^1 is the estimation by the method of least squares. The properties of least squares estimates are given by the Gauss–Markov theorem and will be discussed later. The values of ^B and ^1, minimizing S(^0, ^1), are those for which tS (B0 , B1 ) 0 tB0
tS (B0 , B1 ) 0 tB1
(3.47)
The solution of these equations yields the least squares estimates of the parameters ^0 and ^1 (denoted Bk 0 and Bk 1) as n
Bk 0 y Bk 1 x and Bk 1
£ x x y £ x x i 1 n
i
1 y n
n
£y , i
i 1
1 x n
(3.48)
i
in
where
i
2
n
£x
(3.49)
i
i1
Note that the estimates are linear functions of the measurements yi. The estimate of the dependent variable variance m2 is given by n
2
S
£ i 1
Y Yk i
2
(3.50)
i
(n 2)
where Yki Bk 0 Bk 1 xi
(3.51)
is predicted by the regression model values for the dependent variable; (n 2) is the number of degrees of freedom, and “2” is the number of the estimated parameters of the model. It can be shown that the estimates Bk 0 and Bk 1 are normally distributed random variables with the corresponding means ^0 and ^1. The joint distribution of Bk 0 and Bk 1 is a bivariate normal distribution with the covariance, Cov(Bk 0 , Bk 1), given by
Cov Bk 0 , Bk 1
xs 2 n
£ x i 1
i
x
.
(3.52)
Hence, unfortunately, in the general case, the estimates Bk 0 and Bk 1 are correlated. To avoid this, x1, x2,…, xn must be chosen so that the sample mean, x, will equal 0 in Equation 3.52. This choice is the simple example of design of experiments.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
48
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Using the obtained estimates Bk 0, Bk 1, and s2, the following confidence intervals can be constructed. The (1 ]) two-sided confidence interval for ^0 is given by § 2 1/ 2 ¨ 1 k B 0 o tn 2;A / 2 (s ) ¨ n ¨ ©
1/ 2
¶ (x) · n 2 · xi x · i 1 ¸ 2
£
(3.53)
The (1 ]) two-sided confidence interval for ^1 is § 2 1/ 2 ¨ k B 1 o tn 2;A / 2 (s ) ¨ ¨ ©
1/ 2
n
¶ 1 · 2 · xi x · ¸
£ i 1
(3.54)
The (1 ]) two-sided confidence interval for the mean value of U for any given point, x0, is given by § 2 1/ 2 ¨ 1 y( x ) o t 0 n 2;A / 2 (s ) ¨n ¨ ©
1/ 2
¶ ( x0 x ) · n 2 · xi x · i 1 ¸ 2
£
(3.55)
Based on the distributions of the preceding parameter estimates, several hypotheses can be tested: *
r Let B0 be a given number. Test the hypothesis that the regression parameter, ^0, is * equal to B0 (null hypothesis) against the alternative hypothesis that it is not equal * to B0—that is, *
H0: ^0 B*0 H 0: ^ 0 ≠
B0
r Analogously, test the following hypothesis: H0: ^1 B1* H0: ^1 ≠ B1* r Test the hypothesis about both ^0 and ^1: H0: ^0 B0* and ^1 B1* H1: the hypothesis H0 is not true.
The correlation coefficient, l, has an additional meaning in regression analysis. Let the random variables U and x have a bivariate normal distribution. In this case, the conditional distribution of U for a given value of x is the univariate normal distribution with the variance of U given by
S 2 S y2 (1 R 2 )
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(3.56)
STATISTICAL INFERENCE CONCEPTS
49
where S Y2 is the unconditional variance of Y (i.e., the variance of Y) when the value of X is unknown. From Equation 3.56,
R2
S Y2 S 2 S Y2
(3.57)
This last relationship has a useful interpretation. It means that the squared correlation coefficient is equal to the portion of the variance of U obtained with knowledge of x. 3.4.1.2 The Gauss–Markov Theorem Consider n measurements, Y1,…, Yn, of the dependent variable. Also suppose that the expectation, E(Yi), is given by Equation 3.41: E (Yi ) B0 x 0 i ! B k x ki
(3.58)
i 1,z, n, n k 1 where x0i,…, xki are known values of the independent variables obtained in the experiments along with the values of Yi (as rule, x0i x 1). Thus, each measurement, Yi, can be written as Yi B0 x 0 i ! B k x ki E i
(3.59)
i 1,z, n, n k 1 where ai are random uncorrelated (Cov (ai, aj) 0) errors with E(ai) 0, Var(ai) m 2, ij 1,…, n. These assumptions constitute the general linear model. It should be noted that no assumption has been made about a normal distribution of the random errors. As for the simple linear regression, the least squares estimates of ^0,…, ^k are the values of Bk 0 ,z , Bk k, which minimize the sum of the square: n
SS (B0 ,z, B k )
£ (Y B B i
0
0i
! B k B ki ) 2
(3.60)
i 1
Under the general linear model, the least squares estimates are unbiased and have the minimum variance among all unbiased estimates that are linear in the dependent variable measurements. This statement is known as the Gauss–Markov theorem. 3.4.1.3 Multiple Linear Regression The general linear model can be written in a simple form in matrix notations. Let Y (Yi,…,Yn)w, ^ (^0,…,^k)w, a (ai,…,an)w, and ¤ x 01 ¥x 02 X ¥ ¥| ¥x ¦ 0n
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
! ! ! !
x k1 ³ xk 2 ´ ´ |´ x kn ´µ
(3.61)
50
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
where Awdenotes the transpose of any vector or matrix A. Then Equation 3.59 takes the form: Y XB E
(3.62)
It can be shown that the vector of estimates Bk (Bk 0 ,z , Bk k )` is given by
Bk ( X `X ) 1 X `Y .
(3.63)
The estimates, Bk , of the coefficients ^ are, for the general case, correlated. The covariance matrix of Bk is Cov(Bk ) S 2 ( X `X ) 1
(3.64)
The matrix X is sometimes called the design matrix of the experiments. This equation is the basis for the optimal design of experiments because practically all of the optimal experiment design criteria are expressed in terms of a covariance matrix. For example, if the design matrix is orthogonal, XwX will be the identity matrix, so all of the estimates, Bk , will be independent (uncorrelated) random variables with equal variances, m 2. Note also that any optimal design of experiments is based on an a priori known form of the model 3.58. All of these considerations were made within the limits of the general linear model—that is, without an assumption about the normal distribution of the error. If, in addition, the random errors are normally distributed, the least squares estimates will have the smallest variance among all unbiased estimates (including those that are nonlinear functions of Yi). This is the case of multiple linear regression. In this case, it can be shown that the estimates, Bk , are normally distributed, so different confidence limits can be constructed and some hypotheses can be tested. Most of them are similar to those used in the simple linear regression, but some of them have multivariate peculiarities. For example, the experimenter can test the hypothesis that one of the independent variables x i in his model does not have an influence on the dependent variable Y,—that is, test the hypothesis that H 0: ^ i 0 H1: ^i ≠ 0
In many practical situations, the experimenter might also be interested in the order of importance of the independent variables in predicting the dependent variable. For example, the experimenter might want to order the factors (load, temperature, humidity) influencing the reliability (independent variable) of a given device. Testing the hypothesis mentioned earlier (^ i 0) for each independent variable, Xi , i 1,…, k, does not reveal this ordering. In these situations, applying the so-called stepwise regression method is useful (Draper and Smith 1981).
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
STATISTICAL INFERENCE CONCEPTS
3.4.2
51
Proportional Hazard (PH) and Accelerated Life (AL) Models
A reliability model is usually defined as the relationship between the time-to-failure distribution of a device and stress factors, such as load, cycling rate, temperature, humidity, and voltage. The reliability model can also be considered as a deterministic transformation of the random variable of the time to failure. Two main time transformations exist in life data analysis: the accelerated life (AL) model and the proportional hazard (PH) model. 3.4.2.1 Accelerated Life (AL) Model Let F1(t;z1) and F2(t;z2) be time-to-failure cumulative distribution functions (Cdfs) of the device under the constant stress conditions z1 and z2, respectively. Stress condition z2 is more severe than z1 if, for all positive values of t, F2 (t; z2 ) F1 (t; z1 )
(3.65)
This inequality means that a more severe stress condition accelerates the time to failure. Without loss of generality, it may be assumed that z 0 for the normal (use) stress condition. If a failure time Cdf under normal stress conditions is denoted by F0(r), the AL time transformation is given in terms of F(t;z) and F0(r) by the following relationship (Cox and Oaks 1984): F (t; z ) F0 [tY ( z , A)]
(3.66)
where s(z,B) is a function connecting time to failure with a vector of stress factors, z, and A is a vector of unknown parameters. The s(z,B) always corresponds to a decreasing time to failure. For z 0, s(z,B) is assumed to be equal to one. The relationship in Equation 3.66 is the scale transformation. It means that a change in stress does not result in a change of the shape of the distribution function, but rather changes its scale only. For the d.f. F1(t;z1) and the d.f. F2(t;z2), if z1 is less severe than z2 and t1 and t2 are the times at which F1(t1;z1) F2(t2;z2), there exists a function g (for all positive t1 and t2) such that t1 g(t2), so F2 (t2 ; z2 ) F1 ( g(t2 ); z1 )
(3.67)
Because F1(t;z) < F2(t;z), g(t) must be an increasing function with g(0) 0 and lim g(t ) c
x lc
(3.68)
The function g(t) is called the acceleration or the time transformation function. The assumption in Equation 3.66 that a change of stress condition does not change the shape of the Cdf, but changes its scale only, can be written in terms of the acceleration function as follows: g(t ) Y ( z , A)t
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(3.69)
52
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
In other words, Equation 3.66 is equivalent to the linear with time acceleration function. The relationships for the 100pth percentile of time to failure, tp(z), and the hazard rate, h(z), can be obtained from Equation 3.66 as t p ( z ) t 0p / Y ( z , A)
(3.70)
L (t; z ) Y ( z , A)L 0 [tY ( z , A)]
(3.71)
where t0p and h0 are the 100pth percentile and the hazard rate for the normal stress condition z 0. 3.4.2.2 Proportional Hazard (PH) Model For the PH model, the basic relationship analogous to Equation 3.66 is given by F (t; z ) 1 [1 F0 (t )]Y ( z , A)
(3.72)
The proper proportional hazard (Cox) model is known as the relationship for hazard rate, which can be obtained from Equation 3.72 as
L (t; z ) Y ( z , A)L 0 (t )
(3.73)
where s(z,B) is usually a log-linear function. The PH model time transformation does not normally retain the shape of the Cdf, and the function s(z) no longer has a simple relationship to the acceleration function. Consequently, the PH model is not as popular in reliability applications as the AL model. Nevertheless, it can be shown (Cox and Oaks 1984) that, only for the Weibull distribution, the PH model coincides with the AL model. The AL model time transformation is more popular for reliability applications, and the PH model is widely used in biomedical life data analysis. 3.4.3
Accelerated Life Regression for Constant Stress
Consider the problem of prediction on the basis of AL tests under constant stress conditions. It is assumed that the reliability model, d(z,C), for the 100pth percentile, tp, is a given function of the stress factors, z, with unknown vector of parameters, B: t p ( z , B) H ( z , B)
(3.74)
and the reliability model is related to Equation 3.70 as
H( z , B) t 0p /Y ( z , A)
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(3.75)
STATISTICAL INFERENCE CONCEPTS
53
The most commonly used models for the percentiles (including median) are loglinear models. Two such models are the power rule model and the Arrhenius reaction model. For the power rule model, t p ( x ) a /x c , c 0, x 0
(3.76)
where x is a mechanical or electrical stress. For the Arrhenius reaction rate model, t p (T ) a exp ( Ea /T )
(3.77)
where T is absolute temperature and Ea is activation energy. The model combining these two models is given by t p ( x , T ) a x -c exp ( Ea /T )
(3.78)
where a, Ea , and c are the parameters to be estimated. The goal is to obtain an estimate of the vector, B, of the model 3.74 and to predict the percentile at the normal (or given) stress condition on the basis of AL tests at different stress conditions, z1,…, zk , where k is greater than the dimension of vector C—that is, k > dim B. It is also assumed that r the TTF distributions at all the stress conditions are increasing failure rate average (IFRA) distributions with continuous density functions f(t;z); and r the test results are type II censored samples, where the number of uncensored failure times, ri (i 1,…, k), and the sample sizes, ni, are large enough to estimate the tp as the sample percentile t p:
, if n, p is not an integer and t t p ª« ([ n , p]) ¬anny value from the interval [t( n , p) , t( n , p1)) ] , if n, p is an integer
(3.79)
where t(r) is the failure time (order statistic); the sample sizes are large enough that the asymptotic normal distribution of this estimate can be used. Based on the preceding assumptions, the model for Equation 3.74 can be written as t p ( z , B) H( z , B) E
(3.80)
ª ¹ p(1 p) E N «1, 2 º 2 ¬ niH ( z , B) f [H(z , B)] »
(3.81)
where
and ~N(a,b) means “is normally distributed with mean a and variance b.”
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
54
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
The multiplicative model 3.80 can be transformed by means of a logarithmic transformation to a model with a normally distributed additive error—that is, to the standard normal regression ln t p ( z , B) ln H( z , B) E1
E1 N ( 0 , S 2 )
(3.82)
where m 2 is an unknown constant. This transformation is based on the properties of IFRA distributions and a probabilistic transformation: Let x ~ N(0, m 2) and m << 1; then, the random variable y ln (1 x) is approximately distributed as x (i.e., y ~ N(0, m 2)). This results in an additional restriction on the possibility of the transformation of the model 3.80 to the form in Equation 3.82. This is the restriction superimposed on the sample size, ni, as 1/ 2
§ p /n (1 p) ln 2 (1 p) ¶ © i ¸
1, i 1,!, k
(3.83)
Thus, the problem of prediction is reduced to the estimation of the parameters of a normal regression, which is then used for point and interval prediction. The standard regression experiment design techniques can be applied to AL test planning using Equation 3.81. 3.4.4
Accelerated Life Regression for Time-Dependent Stress
Accelerated life tests with time-dependent stress, such as step-stress and ramp tests, are of great importance. For example, one of the most common reliability tests of thin silicon dioxide films in metal-oxide semiconductor integrated circuits is the ramp-voltage test. In this test, the oxide film is stressed to breakdown by a voltage that increases linearly with time. Let z(t) be a time-dependent stress vector such that z(t) is integrable. In this case, Equation 3.66 can be written in the form (Cox and Oaks 1984): F (t; z ) F0 [Y (t )]
(3.84)
where t( x ) (z)
Y (t )
¯ Y [z(s), A]ds
(3.85)
0
and t(z) is the time connected with a device under the stress condition z(t). The corresponding relationship for the 100pth percentile of time to failure tp[z(t)] can be obtained from Equation 3.84 as t p[ z ( t )]
t 0p
¯ 0
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
Y [ z (s), A]ds
(3.86)
STATISTICAL INFERENCE CONCEPTS
55
Using Equations 3.74 and 3.75, the last relationship can be rewritten as t p[ z ( t )]
1
¯ 0
1 ds 0 t p {Y [ z (s), A]} 1
t p[ z ( t )]
¯ 0
1 ds H[ z (s), B]
(3.87)
Equation 3.82 is an exact probabilistic form of Miner’s rule (Miner 1945), which is widely used in fracture mechanics to account for cumulative damage under different stresses. Thus, the problem of using AL tests with time-dependent stress is identical to the problem of the applicability of Miner’s rule. Moreover, there may be a useful analogy between mechanical damage accumulation and electrical breakdown. The time-dependent analog of the model 3.80 is t p [ z ( t )] 0 p
t
¯
Y [ z (s), A]ds
(3.88)
0
where t p [ z (t )] is the sample percentile for a device under the stress condition z(t). The problem of estimating vector A and t 0p (i.e., estimating the reliability model 3.74) in this case cannot be reduced to parameter estimation for a log-linear regression model, as in the previous case of constant stress. Consider k different time-dependent stress conditions, zi(t), i 1,2,…, k, (k > (dim A) 1), where the test results are (as in the previous case) type II censored samples and the number of uncensored failure times and sample sizes are large enough to estimate tp as the sample percentile t 0p. In this situation, the parameter estimates (of the vector A and t 0p) for the reliability model can be obtained using a least squares method solution of the following system of integral equations: t p [ z ( t )] i 0 p
t
¯
Y [ z1 (s), A]ds
(3.89)
i 1,2 ,!,k .
EXAMPLE 3.6 Assume a model 3.78 for the 10th percentile of time to failure, t0.1, of a ceramic capacitor in the form
t0.1 (U , T ) aU c exp ( Ea /T ) where U is applied voltage and T is absolute temperature. Consider a time-step-stress AL test plan using step-stress voltage in conjunction with constant temperature as accelerating stress factors. A test sample starts at a
© 2009 by Taylor & Francis Group, LLC
(3.90)
56
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Table 3.3 Ceramic Capacitors Test Results Temperature, K
Voltage U0, V
TTF Percentile Estimate, h
398 358 373 373
100 150 100 63
347.9 1688.5 989.6 1078.6
specified low voltage, U0, and is tested for a specified time, Δt. Then the voltage is increased by ΔU, and the sample is tested at U0 ΔU during Δt, etc.:
U (t ) U 0 $U En(t /$t )
(3.91)
where En(x) means “nearest integer not greater than x.” The test will be terminated after p ≥ 0.1 items fail. Thus, the test results are sample percentiles at each voltage–temperature combination. The test plan and results with ΔU 10 V, Δt 24 h are given in Table 3.3. For the example considered, the system of integral Equation 3.89 takes the form t0.1
a
¯ exp( E /T )[U (s )] ds c
a
i
(3.92)
0
i 1, 2, 3, 4 Solving this system for the preceding data yields the following estimates for Equation 3.88: a 2.227 10 –8 h/V1.885; Ea 1,321 104 K; c 1.885.
3.5
SUMMARY
Like the previous chapter, this one gives the reader the necessary basic statistical techniques (point and interval estimation, hypothesis testing, basic regression). Simultaneously, it is an introduction to specific reliability techniques (proportional hazard and accelerated life models).
REFERENCES Beyer, W. 1968. Handbook of tables for probability and statistics. Boca Raton, FL: CRC Press. Cox, D. R., and D. Oaks. 1984. The analysis of survival data. London: Chapman & Hall. Dixon, W. J., and F. J. Massey, Jr. 1969. Introduction to statistical analysis, 3rd ed. New York: McGraw-Hill. Draper, N., and H. Smith. 1981. Applied regression analysis. New York: John Wiley & Sons. Miner, M. A. 1945. Cumulative damage in fatigue. Journal of Applied Mechanics 12:A159–A164.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
CHAPTER 4
Practical Probability Distributions for Product Reliability Analysis Diganta Das, Michael Pecht
CONTENTS 4.1 Introduction ..................................................................................................... 57 4.2 Discrete Distributions ..................................................................................... 58 4.2.1 Binomial Distribution ......................................................................... 58 4.2.2 Poisson Distribution............................................................................ 62 4.2.3 Other Discrete Distributions............................................................... 63 4.3 Continuous Distributions ................................................................................ 63 4.3.1 Weibull Distribution ........................................................................... 65 4.3.2 Exponential Distribution..................................................................... 68 4.3.3 The Normal Distribution .................................................................... 71 4.3.4 The Lognormal Distribution............................................................... 73 4.4 Probability Plots.............................................................................................. 75
4.1 INTRODUCTION In reliability engineering, data are often collected from analysis of incoming parts and materials, tests during and after manufacturing, fielded products, warranty returns, and so on. If the collected data can be modeled by a probability distribution, then properties of the distribution can be used to make decisions for product design, manufacture, and reliability assessment. In this chapter, discrete and continuous probability distributions are introduced, along with their key properties. Then, two discrete distributions (binomial and Poisson) and four continuous distributions (Weibull, exponential, normal, and lognormal) commonly used in reliability modeling and hazard rate assessments are presented.
57 © 2009 by Taylor & Francis Group, LLC
58
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
4.2
DISCRETE DISTRIBUTIONS
A discrete random variable (x) is a quantity that can be equal to any one of a number of discrete values (x0 x1x2 ,…, xn). There is probability fxi, that x xi: f(xi)P{x xi}
(4.1)
In Equation 4.1, f(xi) is called the probability mass function (PMF)* and the cumulative distribution function is written as F(xi) P{x a xi}
(4.2)
The mean * and variance m2 of a discrete random variable are defined in terms of the probability mass function as
M
£ x f (x ) i
(4.3)
i
i
S 2
£ (x M) i
i
2
f ( xi )
£x
2 i
f ( xi ) M 2
(4.4)
i
The binomial and Poisson distributions are distributions of interest to reliability engineers. These distributions are useful in developing product sampling and acceptance plans. They are also useful in assessing product reliability based on the reliability of the parts (materials) that comprise the product. 4.2.1 Binomial Distribution The binomial distribution is a discrete probability distribution applicable in situations that have only two mutually exclusive outcomes for each trial or test. For example, for a roll of a die, the probability is one-sixth that a specified number will occur (success) and five-sixth that it will not occur (failure). This example, known as a “Bernoulli trial,” is a random experiment with only two possible outcomes, denoted by “success” or “failure.” Of course, the definition of success or failure is defined by the experiment. In some experiments, the probability of the result not being a certain number may be defined as a success. The probability mass function f(x) for the binomial distribution gives the probability of exactly k successes in m attempts: ¤ m³ f ( k ) ¥ ´ p k q m k , 0 a p a 1, q 1 p, k 0,1, 2,! , m ¦k µ
(4.5)
* Discrete probability functions are referred to as probability mass functions and continuous probability functions are referred to as probability density functions. When referring to probability functions in generic terms, the term “probability density function” is used to mean both discrete and continuous probability functions.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
PRACTICAL PROBABILITY DISTRIBUTIONS FOR PRODUCT RELIABILITY ANALYSIS
59
where p is the probability of the defined success; q (or 1 p) is the probability of failure; m is the number of independent trials; k is the number of successes in m trials; and the combinational formula is defined by ¤ m³ m! m ¥ ´ x Ck k!(m
k)! ¦ kµ
(4.6)
where ! is the symbol for factorial. Because ( p q) equals 1, raising both sides to a power j gives ( p q) j 1
(4.7)
The binomial expansion of the left-hand side term in Equation 4.7 gives the probabilities of j number of successes, as represented by the binomial distribution. For example, for three components or trials, each with equal probabilities of success (p) or failure (q), Equation 4.7 becomes ( p q)3 p3 3 p2q 3 pq 2 q 3 1
(4.8)
which is based on the general equation: m
£ f (k ) F (m) p{k a m} ( p q)
m
(4.9)
k 0
The four terms in the expansion of ( p q)3 give the values of the probabilities for getting three, two, one, and no successes, respectively. That is, for m 3 and probability of success p, f(3) p3, f(2) 3p2 q, f(1) 3pq2 , and f(0) q3. The binomial expansion is also useful when there are products with different success and failure probabilities. The formula for the binomial expansion in this case is m
(p q ) 1 i
i
(4.10)
i 1
where i pertains to the i component in a system consisting of m components. For a system of three different components, the expansion takes the following form: ( p1 q1)( p 2 q 2)( p 3 q 3) p1 p 2 p 3 ( p1 p 2q 3 p1q 2 p 3 q1 p 2q 3) ( p1q 2q 3 q1 p 2q 3 q1q 2 p 3) q1q 2q 3 1
(4.11)
where the first term on the right-hand side of the equation gives the probability of success of all three components; the second term (in parentheses) gives the probability of success of any two components;
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
60
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
the third term (in parentheses) gives the probability of success of any one component; and the last term gives the probability of failure for all components. The cumulative distribution function for a binomial distribution F(k) gives the probability of k or fewer successes in m trials. It is defined in terms of the discrete PMF by k
F (k )
£ (i)
(4.12)
i0
or by using the PMF for the binomial distribution: k
F (k )
¤ m³
£ ¥¦ i ´µ p i q (m i)
(4.13)
i0
For a binomial distribution, the mean, * is given by
M mp
(4.14)
S 2 mp (1 p)
(4.15)
and the variance is given by
EXAMPLE 4.1 An engineer wants to select four capacitors from a large lot of capacitors in which 10% are defective. What is the probability of selecting four capacitors with (a) (b) (c) (d)
zero defective capacitors; exactly one defective capacitor; exactly two defective ones; and two or fewer defective ones?
Solution: Here, success will be defined as “getting a certain number of good capacitors.” Therefore, p 0.9, q 0.1, and m 4. Using Equations 4.5 and 4.6, f(4) is the probability of all four being good (i.e., no defectives). That is, four components (trials) and equal p and q. ¤ 4³ f (4) ¥ ´ (0.9)4 (0.1)0 0.6561 ¥¦ 4 ´µ Another way to solve this problem is by defining success as “getting a certain number of defective capacitors” with p 0.1 and thus q 0.9. In this case, f(0) gives the probability that there will be no defectives in the four selected samples. That is, ¤ 4³ (a) f (0) ¥ ´ (0.1)0 (0.9)4 0.6561 ¥¦ 0 ´µ Continuing with the latter approach, the solution to problems b, c, and d are ¤ 4³ (b) f (1) ¥ ´ (0.1)1 (0.9)3 0.2916 ¥¦ 1 ´µ
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
PRACTICAL PROBABILITY DISTRIBUTIONS FOR PRODUCT RELIABILITY ANALYSIS
¤ 4³ (c) f (2) ¥ ´ (0.1)2 (0.9)2 0.0486 ¥¦ 2´µ (d) F (2) f (0) f (1) f (2) 0.9963
EXAMPLE 4.2 Consider a product with a probability of failure in a given test of 0.1. If 10 of these products are tested, (a) (b) (c) (d)
What is the expected number of failures that will occur in the test? What is the variance in number of failures? What is the probability that no product will fail? What is the probability that two or more products will fail?
Solution: Here, m 10 and p 0.1 (a) The expected number of failures is the mean * mp (10 r 0.1) 1 (b) The variance is m2 mp(1 p) [10 r 0.1 r (1 0.1)] 0.9 (c) The probability of having no failures is the PMF with k 0. That is, ¤ 10 ³ f (0) ¥ ´ r 0.10 r (1 0.1)10 0.349 ¦ 0µ (d) The probability of having two or more failures is the same as one minus the probability of having zero or one failure. It is given by Pr (two or more failures) [1 {f(0) f(1)}] [1 0.349 {10 r 0.1 r (1 0.1)9}] 0.264
EXAMPLE 4.3 An electronic automotive control module consists of three identical microprocessors in parallel. The microprocessors are independent of each other and fail independently. For successful operation of the module, it is required that at least two microprocessors operate normally. The probability of success of each microprocessor for the duration of the warranty is 0.95. Determine the failure probability of the control module during warranty. Solution: The module fails when two or more microprocessors fail. In other words, the module fails when only one or none of the microprocessors is working. Thus, the probability of failure of the module during warranty will be given by Pr (module fails during warranty) [f (0) f (1)] where m 3 components k 0 or 1 is the total number of working components p 0.95 and q 0.05. Therefore: Pr (module fails during warranty) [(0.05)3 {3 r 0.95 r (0.05)2}] 0.00725 f (1) 1 works, 2 fail failure f (0) 0 work , 3 fail f (1) 2 work ,1 fails f (0) 3 work , 0 fail
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
success
61
62
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Binomial Distribution Function in Excel BINOMDIST (number_s, trials, probability_s, cumulative) returns the binomial probability distribution f (k) or F(k), where Number_s is the number of successes in trials (k). Trials is the number of independent trials (m). Probability_s is the probability of success for each trial (p). Cumulative is a logical value that determines the form of the function [(PMF (TRUE) or CDF (FALSE)]. CRITBINOM (trials, probability_s, alpha) returns the smallest value of k for which the cumulative binomial distribution is greater than or equal to a criterion value. Trials is the number of Bernoulli trials (m). Probability_s is the probability of success on each trial (p). Alpha is the criterion value. The user selects this based on the problem in hand.
4.2.2 Poisson Distribution In situations where the probability of success (p) is very low and the number (m) of samples tested (i.e., the number of Bernoulli trials conducted) is large, it is cumbersome to evaluate the binomial coefficients. A Poisson distribution is useful in such cases. The PMF of the Poisson distribution is independent of the number of trials, and is written as f (k )
Mk M e ; k 0, 1, 2,z k!
(4.16)
where * is the mean and also the variance. For a Poisson distribution for m Bernoulli trials with probability of success in each trial equal to p, the mean and the variance are given by
M mp
(4.17)
S 2 mp
(4.18)
The Poisson distribution is widely used in industrial and quality engineering applications. It is also the foundation of control charts. It is used in various applications, such as determination of particles of contamination in a manufacturing environment, number of power outages, and flaws in rolls of polymers. EXAMPLE 4.4 Solve Example 4.2 using the Poisson distribution approximation. Solution: The expected number of failures is the same as the mean * (10)(0.1) 1. The variance is also equal to one.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
PRACTICAL PROBABILITY DISTRIBUTIONS FOR PRODUCT RELIABILITY ANALYSIS
63
The probability of obtaining no failures is the same as the PMF with k 0,
f (0) e M e 1 0.3678 The probability of getting two or more failures is the same as one minus the probability of obtaining one or two failures. It is given by Pr (two or more failures) [1 {f(0) f(1)}] [1 {0.3678 e 1}] 0.2642 Note the differences compared to Example 4.2. Poisson Distribution Functions in Excel POISSON (x, mean, cumulative) x is the number of events. Mean is the expected numeric value. Cumulative is a logical value that determines the form of the probability distribution returned [(PMF (TRUE) or CDF (FALSE)].
4.2.3 Other Discrete Distributions Other discrete distributions used in reliability analysis include the geometric distribution, the negative binomial distribution, and the hypergeometric distribution. These distributions can usually be modeled as special or limiting cases of the binomial distribution. With the geometric distribution, the Bernoulli trials are conducted until the first success is obtained. The geometric distribution has the “lack of memory” property, implying that the count of the number of trials can be started at any trial without affecting the underlying distribution. In this regard, this distribution has some similarity to the continuous exponential distribution, which will be described later. With the negative binomial distribution (a generalization of the geometric distribution), the Bernoulli trials are conducted until a certain number of successes are obtained. It is conceptually different from the binomial distribution because the number of successes is predetermined and the number of trials is random. With the hypergeometric distribution, testing is conducted without replacement in samples containing more than one kind of product or defect. The hypergeometric distribution differs from the binomial distribution in that the population is finite and the sampling from the population is made without replacement.
4.3
CONTINUOUS DISTRIBUTIONS
If the range of a random variable x extends over an interval (either finite or infinite) of real numbers, then x is a continuous random variable. The cumulative distribution function is given by F(xi) P{x a xi}
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(4.19)
64
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
The probability density function (PDF—analogous to PMF for discrete variables) is f(x), where f(x) given by d F (x) dx
f (x)
(4.20)
which yields x
F (x)
¯ f (x) dx
(4.21)
c
The mean * and variance m of a continuous random variable are defined over the interval from ∞ to ∞ in terms of the probability density function as 2
c
M
(4.22)
¯ xf (x) dx
c c 2
S
c
¯ (x M) f (x) dx ¯ x f (x) dx M 2
c
2
2
(4.23)
c
If f(x) is the failure probability density function (see Equation 2.4), then F(x) can be considered to be the unreliability Q(x) function, when the random variable x denotes time t ≥ 0. Thus, Equation 4.21 becomes equivalent to Equation 4.5 and Equation 4.22 becomes equivalent to Equation 4.17. EXAMPLE 4.5 The PDF for the failure of an appliance as a function of time to failure is given by 1 f t t e t / 4, where t is in years, and t 0. 16 (a) What is the probability of failure in the first year? (b) What is the probability of the appliance lasting at least 5 years? (c) If no more than 5% of the appliances are to require warranty service, what is the maximum number of months for which the appliance should be warranted? Solution: For the given PDF, the CDF is t
F (t )
1 ¤t ³ t e t / 4 dt 1 ¥ 1´ e t / 4 ¦4 µ 16
¯ 0
(a) The probability of failure during the first year is F(1) 0.0265. (b) The probability of lasting more than 5 years is [1 F(5)] [1 0.3554] 0.6446. (c) For this case, F(t0) has to be less than 0.05, where t0 is the warranty period. From the preceding results, we find that the time has to be more than 1 year. Also, F(2) is equal to 0.09; hence, the warranty period should be between 1 and 2 years. Using trial and error, we find that for no more than 5% warranty service, t0 1.42 years. Therefore, the warranty should be set at no greater than 17 months.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
PRACTICAL PROBABILITY DISTRIBUTIONS FOR PRODUCT RELIABILITY ANALYSIS
65
4.3.1 Weibull Distribution The Weibull distribution is a continuous distribution developed in 1939 by Walloddi Weibull, who was also credited with inventing ball bearings and the electric hammer. The Weibull distribution is widely used for reliability analyses because it is a distribution with which a wide diversity of hazard rate curves can be modeled. The distribution can also be approximated to other distributions under special or limiting conditions. The Weibull distribution has been applied to life distributions for many engineered products and has also been used for material strength and warranty analysis. The probability density function for a three-parameter Weibull probability distribution function is B ¤ t G ³ H ´µ
f (t ) BH B (t G )B 1e ¥¦
(4.24)
where ^ 0 is the shape parameter, d 0 is the scale parameter, and c is the location, or time delay, parameter. The reliability function is given by c
R(t )
¯
¤ t G ³ f (t ) dt e ¥¦ H ´µ
B
(4.25)
t
It can be shown that, for a duration t c d,starting at time t 0, the reliability R(t) 36.8% regardless of the value of ^. Thus, for any Weibull failure probability density function, 36.8% of the products survive for t c dThe time to “failure” of a product with a specified reliability, R, is given by t G H (lnR)1/B
(4.26)
The hazard rate function for the Weibull distribution is given by h(t )
f (t ) B § t G ¶ R(t ) H ¨© H ·¸
B 1
(4.27)
The conditional reliability function is R (t , T )
ª § (t T G ) B ¶ § (T G ) B ¶ ¹ R(t T ) exp « ¨ ·º ·¨ H H R(T ) ·¸ » ·¸ ¨© ¬ ¨©
(4.28)
Equation 4.28 gives the reliability for a new mission of duration t for which T hours of operation were previously accumulated up to the beginning of this new mission. It is seen that the Weibull distribution is generally dependent on both the age at the beginning of the mission and the mission duration (unless ^ 1). In fact, this is true for most distributions, except for the exponential distribution. Table 4.1 lists the key parameters for a Weibull distribution. The function is the gamma function, for which the values are available from statistical tables.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
66
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Table 4.1 Weibull Distribution Parameters Location parameter Shape parameter Scale parameter Mean (arithmetic average) Median (t50, or time at 50% failure) Mode (highest value of f(t)) Standard deviation
c ^ d c d'(1/^ 1) c d (ln 2)1/^ c d (1 1/^)1/^ ¤2 ³ ¤1 ³ H ' ¥ 1´ ' ¥ 1´ ¦B µ ¦B µ
Probability Density Function f (t)
The shape parameter of a Weibull distribution determines the shape of the hazard rate function. With 0 ^ 1, the hazard rate decreases as a function of time and can represent early life (i.e., infant mortality) failures. A ^ ≈ 1 indicates that the hazard rate is constant and is representative of the “useful life” period in the “idealized” bathtub curve. A ^ 1 indicates that the hazard rate is increasing and can represent wearout failures. Figure 4.1 shows the effects of ^ on the probability density function curve with d 1 and c 0. Figure 4.2 shows the effects of ^ on the hazard rate curve with d 1 and c 0. The scale parameter dhas the effect of scaling the time axis. Thus, for cand ^fixed, an increase in dwill stretch the distribution to the right while maintaining its starting location and shape (although there will be a decrease in the amplitude because the total area under the probability density function curve must be equal to unity). Figure 4.3 shows the effects of d on the probability density function for ^ 2 and c 0. The location parameter estimates the earliest time to failure and locates the distribution along the time axis. For c 0, the distribution starts at t 0. With c , this implies that the product has a failure-free operating period equal to c. Figure 4.4 shows the effects of con the probability density function curve for ^ 2 and d 1. Note that if cis positive, the distribution starts to the right of the t 0 line, or the origin. If cis negative, the distribution starts to the left of the origin and could imply failures had
β=3
β=2
β=1
Operating Time
Figure 4.1 Effects of shape parameter ^ on probability density function, where d = 1 and c = 0.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
PRACTICAL PROBABILITY DISTRIBUTIONS FOR PRODUCT RELIABILITY ANALYSIS
67
β=3
Hazard Rate
β = 0.5
β=1
Operating Time
Probability Density Function f(t)
Figure 4.2 Dependence of hazard rate on shape parameter ^, where d = 1 and c = 0.
0.8 η=1
0.6
η=2
0.4
η=3 0.2 0
0
1 Operating Time
2
Probability Density Function f(t)
Figure 4.3 Effects of scale parameter d on the probability density function of a Weibull distribution, where ^ = 2 and c = 0.
1
γ=0
γ>0
γ<0
0 0
1
2 3 Operating Time
4
Figure 4.4 Effects of location parameter c, where ^ = 2 and d = 1.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
5
68
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
occurred prior to the time t 0, such as during transportation or storage. The Weibull distribution can also be formulated as a two-parameter distribution with c 0. Weibull Distribution Functions in Excel WEIBULL (x, beta, eta, cumulative) returns the two-parameter Weibull distribution. x is the value at which to evaluate the function, a non-negative number. Beta is the shape parameter of the distribution, a positive number. Eta is the scale parameter of the distribution, a positive number. Cumulative is a logical value; for the cumulative distribution function, use TRUE.
4.3.2 Exponential Distribution The exponential distribution is a single-parameter distribution that is simple and easy to use. This distribution can be viewed as a special case of a Weibull distribution, where ^ 1. The exponential distribution models the time between independent events that occur at a constant rate. The probability density function has the form of f(t ) = L 0 e L 0 t , t q 0
(4.29)
where ho is a positive real number, often called the constant failure rate. Table 4.2 summarizes the key parameters for the exponential distribution. The parameter h0 is typically an unknown that must be calculated or estimated. Once h0 is known, reliability can be determined from the probability density function as c
R(t )
c
¯t f (T )dT ¯t L e 0
L0t
dT e
L0t
(4.30)
The cumulative distribution function, or unreliability, is given by Q(t ) 1 exp[ L0t ]
(4.31)
As mentioned, the hazard rate is constant: h(t ) Table 4.2
f (t ) 1 (L 0e L 0 t ) L 0 R(t ) e L 0 t
Exponential Distribution Parameters
Scale parameter Median (t50, or time at 50% failure) Mode (highest value of f(t)) Standard deviation Mean
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(4.32)
1/h0 0.693/h0 0 1/h0 1/h0
PRACTICAL PROBABILITY DISTRIBUTIONS FOR PRODUCT RELIABILITY ANALYSIS
69
The conditional reliability is R(t , T )
L t
L (t T ) L 0 T R(t T ) e 0 /e e 0 R(T )
(4.33)
Equation 4.31 shows that previous trials (e.g., tests or missions) do not affect future reliability. This “as good as new” result stems from the fact that the hazard rate is a constant and the probability of a product failing is independent of the past history or use of the product. The mean time to failure (MTTE) for an exponential distribution is determined from the general equation for the mean of a continuous distribution: MTTF
c
c
¯
R(t ) dt e L 0 t dt
¯ 0
0
1 L0
(4.34)
Because the hazard rate is a constant for an exponential distribution, the mean time to failure is also a mean time between failures (MTBF). The MTBF is inversely proportional to the constant failure rate, and thus the reliability can be expressed as R(t ) e t / MTBF
(4.35)
MTBF is sometimes misunderstood to be the life of the product or the time that 50% of products will fail. For a mission time of t MTBF, the reliability calculated from Equation 4.34 gives R (MTBF) 0.368. Thus, only 36.8% of the products survive a mission time equal to the MTBF. For reliability tests in which the hazard rate is assumed to be constant, we divide the total number of accumulated device-hours by the total number of relevant failures to obtain a point estimate of the MTBF: MTBF ta /r
(4.36)
where ta is the total number of device-hours and r is the number of relevant failures. EXAMPLE 4.6 Show that the exponential distribution is a special case of the Weibull distribution. Solution: From Equation 4.22, set ^ 1 and c 0. f (t )
1 t /H e H
Thus, in this case, the Weibull distribution reduces to the single-parameter exponential distribution with ho 1/d. The reliability and the hazard rate functions simplify to t
R(t) = e H h(t) =
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
1
H
70
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
1 L0 If ^ 1 and c 0, then the Weibull distribution is similar to the exponential distribution but with a delay term that can be considered to be a period within which no failures occur. where H
EXAMPLE 4.7 Assume that the time to failure of a product can be described by the Weibull distribution with estimated parameter values of d 1,000 hours, c 0, and ^ 2. Estimate the reliability of the product after 100 hours of operation. Also, determine the MTTF. Solution: From Equation 4.21 and Table 4.1, 2 R(100 ) = e (100/1000 ) = 0.990
and MTTF 1000 G(1/2 1) 1000 G(1.5) 886 hours
EXAMPLE 4.8 Estimate the MTBF for the following reliability test situations: (a) Failure terminated, with no replacement: 12 items were tested until the fourth failure occurred, with failures at 200, 500, 625, and 800 hours. (b) Time terminated, with no replacement: 12 items were tested up to 1,000 hours with four failures at 200, 500, 625, and 800 hours. (c) Failure terminated, with replacement: eight items were tested until the third failure occurred, with failures at 150, 400, and 650 hours. (d) Time terminated, with replacement: eight items were tested up to 1,000 hours with failures at 150, 400, and 650 hours. (e) Mixed replacement/nonreplacement: six items were tested through 1,000 hours. The first failure occurred at 300 hours, and its replacement failed after an additional 400 hours. The second failure occurred at 350 hours, and its replacement failed after an additional 500 hours. The third failure occurred at 600 hours, and its replacement did not fail up to the completion of the test. Solution: (a) (b) (c) (d) (e)
MTBF(e) (200 500 625 800 8(800))/4 2,131 hours MTBF(e) (200 500 625 800 8(1000))/4 2,554 hours MTBF(e) (8)(650)/3 1,733 hours MTBF(e) (8)(1000)/3 2,667 hours MTBF(e) (700 850 1000 (3)(1000))/5 1,110 hours
EXAMPLE 4.9 Consider an electronic product where only “chance failures” occur; that is, the products exhibit a constant hazard rate. If the MTBF is 5 years, at what time will 10% of the products fail?
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
PRACTICAL PROBABILITY DISTRIBUTIONS FOR PRODUCT RELIABILITY ANALYSIS
71
Solution: Using Equation 4.26 with R 0.90 and MTBF ≈ 43,800 hours (5 years), we solve for t. Thus, t [(MTBF) r ln(R)] ≈ 4,600 hours, or nearly half a year. Exponential Distribution Functions in Excel EXPONDIST(x, lambda, cumulative) x is the value of the function, a non-negative number. Lambda is the parameter value, a positive number equal to the constant failure rate. Cumulative is a logical value for the function to return for the cumulative distribution function (CDF or PDF/TRUE or FALSE).
4.3.3 The Normal Distribution The normal distribution occurs whenever a random variable is affected by a sum of random effects such that no single factor dominates. It has been used to represent dimensional variability in manufactured goods, material properties, and measurement errors. It has also been used to assess product reliability. The probability density function for the normal distribution is f(t)
§ 1 ¤ t M³2¶ 1 ¤ ³ · , c a t a c exp ¨¥ ´ ¥ ¦ µ ¦ S ´µ · 2 S 2𠨩 ¸
(4.37)
where the parameter * is the mean or the MTTF, and m is called the standard deviation of the distribution. The parameters for a normal distribution are listed in Table 4.3. Figure 4.5 shows the shape of the probability density function. The cumulative density function or unreliability, for the normal distribution is 1 S 2P
t
¯ 0
§ 1 x M 2¶ ³ · dx exp ¨¤¥ ³´ ¤¥ ´ ¨©¦ 2 µ ¦ S µ ·¸
Failure Probability Distribution f(t)
Q(t )
Mean Figure 4.5
Probability density function for normal distribution.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(4.38)
72
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Table 4.3 Normal Distribution Parameters Mean (arithmetic average) Median (t50, or time at 50% failure) Mode (highest value of f(t)) Location parameter Shape parameter s (estimate of m)
* * * * m t50 t16
A normal random variable with mean equal to zero and variance of one is called a standard normal variable (z) where z x (x *)/m
(4.39)
Properties of the standard normal variable—in particular, cumulative probability density function—are tabulated in statistical tables. Table 4.4 provides the percentage values of the areas under the standard normal curve at different distances from the mean multiples of m. There is no closed-form solution to the integral of Equation 4.34. Therefore, the values for the area under the normal distribution curve are obtained from the standard normal tables by converting the random variable to a random variable using the following transformation: ¤ t M³ F (t ) & ( z ) & ¥ ¦ S ´µ
(4.40)
The normal distribution has been used to describe the failure distribution when an expected wearout time, *exists for a population (often defined as the time when degradation level reaches a critical value). The lives of the treads of tires and the cutting edge of machine tools fit this description. In these situations, life is given by a mean value of *and an uncertainty defined through standard deviation. When the normal distribution is used, the probability of a failure occurring before or after this mean time is equal. EXAMPLE 4.10 A machinist estimates that there is a 90% probability that the washer of an air compressor will fail between 25,000 and 35,000 cycles of use. Assuming a normal distribution for washer degradation, find the mean life and standard deviation of the life of the washers. Table 4.4
Areas Under the Normal Curve
* 1m 15.87% * 2m 2.28% * 3m 0.135%
* 1m 84.13% * 2m 97.72% * 3m 99.865%
* 4m 0.003%
* 4m 99.997%
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
PRACTICAL PROBABILITY DISTRIBUTIONS FOR PRODUCT RELIABILITY ANALYSIS
73
Solution: Assuming that 5% of the failures are at fewer than 25,000 cycles and 5% are at higher than 35,000 cycles, the mean of the distribution will be centered at 30,000 cycles of use; that is, * 30,000. In this condition: & ( z1 ) 0.05, z1
25000 M 35000 M , & ( z 2 ) 0.95, z 2 S S
From the normal distribution table, z1 1.65, and z2 1.65. Hence, 1.65m 25,000 * and 1.65m 35,000 *. Solving these two equations with the value of mean as 30,000 miles results in m of 3,030 cycles.
EXAMPLE 4.11 The time for a corrosive failure is normally distributed with mean * 2.8 hours and standard deviation m 0.6 hours. (a) What is the probability that the corrosion will occur in 1.5 hours? (b) If we want to analyze the corrosion at 10% growth, after what time from the start should the fungi be analyzed? Solution: (a) The probability that the corrosion will grow in less than 1.5 hours is given by P {t 1.5} Q (1.5) &(z) z (x *)/m (1.5 2.8)/0.6 2.1667 From the standard normal table, & ( 2.1667) 0.0151. (b) For this condition, F(&) 0.1; then, from the standard normal table, z is approximately 1.28. Therefore, t * 1.28m; hence, t 2.03 hours. Normal Distribution Functions in Excel NORMDIST (x, mean, standard_dev, cumulative) returns the normal cumulative distribution for the specified mean and standard deviation. NORMINV (probability, mean, standard_dev) returns the inverse of the normal cumulative distribution for the specified mean and standard deviation. NORMDIST (z) NORMSINV (probability)
4.3.4 The Lognormal Distribution For a continuous random variable, there may be a situation in which the random variable is a product of a series of random variables. For example, the wear on a system may be proportional to the product of the magnitudes of the loads acting on it. This condition can be described as y y1y2y3…yN
(4.41)
where yi are the different loads and y on the left-hand side is representative of the amount of wear.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
74
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Taking the natural logarithm of the equation: ln y ln y1 ln y2 ln y3 … ln yN
(4.42)
If no single individual term on the right-hand side of Equation 4.39 has a dominant effect on the outcome (i.e., the value of ln y), then y is distributed normally, and y is considered lognormally distributed. The lognormal distribution has been shown to apply to many engineering situations, such as strengths of metals and dimensions of structural elements, and to biological parameters such as loads on bone joints. Lognormal distributions have been applied in reliability engineering to describe failures caused by fatigue. The probability density function for the lognormal distribution is f (t )
§ 1 ln t M ³ 2 ¶ 1 exp ¨¤¥ ³´ ¤¥ ´ ·; 0 a t a c S t 2P ©¦ 2 µ ¦ S µ ¸
(4.43)
where m is the standard deviation of the logarithms of all times to failure and *is the mean of the logarithms of all times to failure. The cumulative distribution function (unreliability) for the lognormal distribution is t
§¤ 1 ³ ¤ ln x M ³ 2 ¶ 1 Q(t ) exp ¨¥ ´ ¥ ´ · dx x ¨©¦ 2 µ ¦ S µ · S 2P ¸ 0 1
¯
(4.44)
The probability density functions for two values of mare as shown in Figure 4.6. The key parameter estimates for the lognormal distribution are provided in Table 4.5. The MTTF for a population whose hazard rate follows a lognormal distribution is given by
§ S2 ¶ MTTF exp ¨ M · 2 ¸ ©
(4.45)
Figure 4.6
Lognormal probability density function where m 0.1 and m 0.5.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
PRACTICAL PROBABILITY DISTRIBUTIONS FOR PRODUCT RELIABILITY ANALYSIS
75
Table 4.5 Lognormal Distribution Parameters Mean (arithmetic average) Median (t50, or time at 50% failure) Mode (highest value of f(t)) Location parameter Shape parameter s (estimate of m)
T exp [* m 2/2] T e* T exp [* m 2] e* m ln (t50 /t16)
From the basic properties of the logarithm operator, it can be shown that if variables x and y are distributed lognormally, then the product random variable, z xy, is also lognormally distributed. EXAMPLE 4.12 A population of industrial circuit breakers was found to have a lognormal failure distribution with parameters * 3 and m 1.8. What is the MTTF of the population? What is the estimate of reliability of these circuit breakers for continuous operation of 30 years? Solution: From Equation 4.45 for the MTTF: MTTF exp (3 0.5 r (1.8)2) 101.5 years For a 30-year operation (from Equation 4.36): z
ln(30) 3 3.41 3 0.228 1.8 1.8
Hence, from the table of standard normal distribution, the estimate of reliability for a 30-year operation is given by R(30) [1 &( z )] [1 &(0.228)] [1 0.589] 0.411 Lognormal Distribution Functions in Excel LOGNORMDIST (x, mean, standard_dev, cumulative) returns the cumulative distribution for x for the specified mean and standard deviation, where ln(x) is normally distributed and mean and the standard deviation are for ln(x).
4.4
PROBABILITY PLOTS
Probability plotting is a method for determining whether data (observations) conform to a hypothesized distribution. Typically, computer software is used to assess the hypothesized distribution and determine the probability parameters. The method used by the software tools is analogous to using constructed probability plotting paper to plot data. The time-to-failure data are ordered from the smallest to the largest in value in an appropriate metric (e.g., time to failure, cycles to failure). An estimate of percent unreliability
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
76
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Table 4.6
Examples of Cdf Estimates for N 20 Estimate of Cumulative Distribution Function or Unreliability
Rank Order (i)
Midpoint Plotting Position
Expected Plotting Position
Median Plotting Position
Median Rank
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
2.5 7.5 12.5 17.5 22.5 27.5 32.5 37.5 42.5 47.5 52.5 57.5 62.5 67.5 72.5 77.5 82.5 87.5 92.5 97.5
4.8 9.5 14.3 19.0 23.8 28.6 33.3 38.1 42.8 47.6 52.4 57.1 61.9 66.7 71.4 76.4 80.1 85.7 90.5 95.2
3.4 8.3 13.2 18.1 23.0 27.9 32.8 37.7 42.6 47.5 52.5 57.4 62.3 67.2 72.1 77.0 81.9 86.8 91.7 96.6
3.4 8.3 13.1 18.0 23.0 27.9 32.8 37.7 42.6 47.5 52.5 57.4 62.3 67.2 72.1 77.0 81.9 86.8 91.7 96.6
is selected. The data are plotted on probability plotting papers (these are distribution specific) with ordered times to failure in the x-axis and the estimate of percent unreliability as the y-axis. A best-fit straight line is drawn through the plotted data points. The time-to-failure data used for the x-axis are obtained from field or test. The estimate of unreliability against which to plot these time-to-failure data is not that obvious. Several different techniques, such as “midpoint plotting position,” “expected plotting position,” “median plotting position,” “median rank,” and “Kaplan–Meier ranks” (in software) are used for this estimate. Table 4.6 provides the estimates for unreliability based on different estimation schemes for a sample size of 20. The median rank is given by the solution of the following equation: N! (1 Q)i Q N i 0.5 i !( N i)!
(4.46)
where N is the sample size, i is the failure number, and Q is the median rank (or estimate of unreliability at the failure time of the i failure). Equation 4.42, which estimates the median plotting positions, can be used in place of the median rank: Qi
100 r (i 0.3) N 0.4
(4.47)
The axes used for the plots are not linear. They are different for each probability distribution and are created by linearizing the reliability function, typically by
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
PRACTICAL PROBABILITY DISTRIBUTIONS FOR PRODUCT RELIABILITY ANALYSIS
77
taking logarithm of both sides repeatedly. For example, mathematical manipulation of Equation 4.25 for Weibull distribution will result in ordinate (y-axis) in log reciprocal scale and abscissa (x-axis) is a log scale of time to failure. Once the probability plots are prepared for different distributions, the goodness of fit of the plots is one factor in determining which distribution is the right fit for the data. Probability distributions for data analysis should be selected based on their ability to fit the data and for physics-based reasons. There should be a physics-based argument for selection of a distribution that draws from the failure model for the mechanism(s) that caused the failures. These decisions are not always clear cut. For example, the lognormal and the Weibull distribution both model fatigue failure data well; thus, it is often possible for both to fit failure data and experience-based engineering judgments need to be made. There is no reason to assume that all the time-to-failure data taken together need to fit only one failure distribution. Because the failures in a product can be caused by more than one mechanism, it is possible that some of the failures are caused by one mechanism and others by a different mechanism. In that case, no one probability distribution will fit the data well. Even if it appears that one distribution is fitting all the data, that distribution may not have any predictive ability. That is why it may be necessary to separate the failures by mechanisms into sets and then fit separate distributions for each set. Table 4.7 shows time-to-failure data separated into two groups by failure mechanism. Figure 4.7 shows the Weibull probability plots for the competing failure
99.000
Data 1 Weibull-CFM MLE SRM MED FM
90.000
CFM 1 points CFM 2 points CFM 1 pine CFM 2 pine Probability pine
Unreliability, Q(t)
50.000
10.000 5.000
1.000 1.000
10.000
100.000
1000.000
Time, (t) b[1] = 0.67, h[1] = 450; b[2] = 4.33, h[2] = 340
Figure 4.7 Weibull probability plots for competing failure mechanism data shown in Table 4.7.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
78
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Table 4.7
Time-to-Failure Data Separated by Failure Mechanisms
D-IS
State F or S
Time to F or S
Subset ID
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
F F F F F F F F F F F F F F F F F F F F F F S S S S S S S S S
2 10 13 23 23 28 30 65 80 88 106 143 147 173 181 212 245 247 261 266 275 293 300 300 300 300 300 300 300 300 300
V V V V V V V V V V V V W V W W W V V W W W
mechanism data. Note that the shape and scale factors for the two sets are distinct; one set has a decreasing hazard rate (^ 0.67) and the other set has an increasing hazard rate (^ 4.33). If the data are plotted together, then the result shows an almost constant hazard rate. Sparing and support decisions made based on results from a combined data analysis can be misleading and counterproductive. EXAMPLE 4.13 Figure 4.8 shows reliability test data for 10 identical products in which 6 products failed within the test duration of 600 hours. The time-to-failure data are plotted on two-parameter Weibull probability plotting paper. Using the plot, estimate the following:
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
PRACTICAL PROBABILITY DISTRIBUTIONS FOR PRODUCT RELIABILITY ANALYSIS
99.000
79
Data 1 Weibull-2P RRX SRM MED FM
90.000
F = 6/S = 4 Data points Probability line
Unreliability, Q(t)
50.000
10.000 5.000
1.000 10.000 b = 0.65, h = 825, r = 0.9998
100.000 Time, (t)
1000.000
Figure 4.8 Two-parameter Weibull probability plots for time-to-failure data shown in Table 4.8.
(a) the unreliability and reliability at the end of 50 hours; (b) the reliability for a new period of 50 hours, starting after the end of the previous 50-hour period; and (c) the longest duration that will provide a reliability of 95% assuming the operation starts at 50 hours. Solution: (a) For this example, we find that ^and d is estimated to be 825 hours. It is now possible to write the equation for the reliability and use it for analysis. The plotted straight line can also be used to determine the reliability values directly. From Figure 4.7, the unreliability estimate for a mission time of 50 hours can be read directly from the straight line. The value is Q(50) 15%. Thus, the reliability for this duration is R(50) 1 Q (50) 85%. (b) The reliability for a new 50-hour period starting with an age of 50 hours is given by the conditional reliability equation as
R (50, 50)
R (50 50) R (100) 0.78 91.7% R (50) 0.85 R (50)
where R(100) 1 Q(100) can be taken directly from the curve.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
80
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Table 4.8
Test Data for Example 4.13
Sample Number
Time to Failure (Hours)
Sample Number
Time to Failure (Hours)
1 2 3 4 5
14 58 130 245 382
6 7 8 9 10
563 — — — —
(c) For a mission time t that starts after a 50-hour period and must have a reliability of 95%, R (t , 50)
R (t 50) R (t 50) 0.95 R (50) 0.85
or R (t 50) 0.95 r 0.85 0.808 To obtain this reliability, the unreliability is 0.192 or 19.2%. From the curve, the time to obtain this unreliability is about 75 hours. Thus, 50 t 75 gives a maximum new mission time of 25 hours in order to have a reliability of 95%.
When the life data may contain two or more life segments, such as infant mortality, useful life, and wearout, mixed Weibull distribution can be used to fit parts of the data with different distribution parameters. Curved or S-shaped Weibull probability plots (in either two parameters or three parameters) are an indication that mixed Weibull distribution may be present. Statistical analysis provides no magical way of projecting into the future. The results from an analysis are only as good as the assumed model and assumptions: for example, how failure is defined, the validity of the data, and how the model is used, taking into consideration the tail of distribution and the limits of extrapolations and interpolations. The following example demonstrates the absurdity of extrapolating time-to-failure data beyond their reasonable limits. EXAMPLE 4.14 A Weibull probability plot was made for a population collected over the first 10 years of the life of population (see following figure) containing failures. (a) Estimate the percentage of this population expected to fail by 300 years. (b) Does the answer make sense if the time-to-failure data are for human mortality? Discuss with explanation. Solution: (a) The results show that the probability of failure at 300 years is 2%. (b) The mortality data for over a billion people for a 10-year period from the time of birth fit Weibull distribution very well. This is impressive work;
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
PRACTICAL PROBABILITY DISTRIBUTIONS FOR PRODUCT RELIABILITY ANALYSIS
nevertheless, it is wrong. Looking at the result, it is clear that these data should not be used for making any judgment on human longevity even though all the calculations are correct. The mortality pattern of humans in the first 10 years of life cannot be extrapolated because the mortality pattern changes with age. (This is often true for engineered goods too. Failures that occur after manufacturing tests are often caused by defects introduced in manufacturing.) The first 10 years of time-to-failure data will result in a shape factor (^) of less than one. However, during early childhood through a large part of adulthood, the shape factor will be close to one where most deaths can be considered random (e.g., caused by accidents). Then the population will enter a wearout stage in which people die from old-age causes. Complete human mortality data should be modeled using mixed Weibull distribution.
10.00
Estimated Failure Probability (%)
5.00
1.00 0.50
0.10 0.05
0.01 1.00
10.00
100.00
1000.00
Years
Figure 4.9 Weibull probability plot of time-to-failure data for Example 4.13.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
81
CHAPTER 5
Confidence Intervals Diganta Das, Michael Pecht
CONTENTS 5.1 Introduction ..................................................................................................... 83 5.2 Concepts..........................................................................................................84 5.2.1 Definitions........................................................................................... 85 5.2.2 Interpretation of Confidence Level..................................................... 85 5.2.3 Relationship Between Confidence Interval and Sample Size............. 86 5.3 Confidence Interval Estimate Methods........................................................... 86 5.4 Confidence Interval for Normal Distribution.................................................. 87 5.4.1 Unknown Mean with Known Variance .............................................. 88 5.4.2 Unknown Mean with Unknown Variance .......................................... 89 5.4.3 Differences in Two Population Means with Variances Known..........90 5.5 Confidence Interval on MTBF—Exponential Distribution Assumption........ 91 5.6 Confidence Intervals for Proportions..............................................................92 5.7 Summary ......................................................................................................... 93 Reference .................................................................................................................94
5.1
INTRODUCTION
Data are often collected from a sample of a population in order to estimate characteristics about the entire population. For example, times to failure for a sample of light bulbs produced in a lot may be assessed to estimate the longevity of all the light bulbs produced. Another example is in production of a product where manufactured goods are periodically sampled to estimate the defect rate of the total population. Sample acceptance testing can also be conducted at the receipt of goods in order to assess and estimate the ability of the entire lot to meet specifications. A confidence interval is a measure of uncertainty that comes from making a generalization about a population based on a sample. In this chapter, the concept 83 © 2009 by Taylor & Francis Group, LLC
84
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
of confidence intervals is introduced, along with their implementation in assessing uncertainty in reliability analysis.
5.2
CONCEPTS
A population is a set of data collected from all the members of some group. A sample is a set of data collected from only a portion of the members of the population. The data obtained from the sample are in turn used to make estimates about the population. Figure 5.1 is a schematic of estimating population parameters from examining only a sample. The primary precondition is that the population is created from the same process. It is not possible or even advisable to measure the whole population (for example, the act of measurement can damage a sample). A parameter calculated from a sample is called the point estimate for the parameter. A confidence interval puts a boundary around these point estimates and provides a probability of including the population parameters within these boundaries. Inferential statistics are used to draw inferences about a population from a sample. Statistics from a sample can include measures of location, such as mean, median, and mode, and measures of variability, such as variance, standard deviation, range, and interquartile range. A confidence interval is a range computed from a given sample that includes the actual value of the parameter with a degree of certainty. The width of the confidence interval is an indication of the uncertainty about the actual parameter. Providing standard deviation for a set of measurements is not the same as providing a confidence interval. Standard deviation is a measure of dispersion of a measurement. In general, the higher the standard deviation is, the wider the confidence interval is on the mean value of that measurement. However, there is more to the statistics of a set of measurements than standard deviation. In fact, although it may not be meaningful to compute a standard deviation on the distribution parameters, an estimate of the confidence interval around them can be made. For example, one can Population Process
X "
!#
Sample
Figure 5.1 Schematic of estimation of population parameters from sample parameters.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
CONFIDENCE INTERVALS
85
calculate only ONE value of standard deviation from a collection of measurements; it is not possible to have a standard deviation on that estimate of standard deviation. 5.2.1
Definitions
When the probability of k being in the interval between l and u is given by P(l a k a u) 1 ], 0 a ] a 1, the interval l a k a u is called a 100 r (1 ])% confidence interval. In this definition, l is the lower confidence limit, u is the upper confidence limit, and (1 ]) is called the confidence level, which is usually given as a percentage. A confidence interval can be either one sided or two sided. A two-sided (or twotailed) confidence interval specifies both a lower and upper bound on the interval estimate of the parameter. A one-sided (or one-tailed) confidence interval specifies only a lower or upper bound on the interval estimate of the parameter. A lower onesided 100(1 ])% confidence interval is given by l a k, where l is chosen so that P{l a k} 1 ]. Conversely, an upper one-sided 100(1 ])% confidence interval is given by k a u, where u is chosen so that P{k a u} 1 ]. 5.2.2
Interpretation of Confidence Level
The common perception is that confidence level is the probability of a parameter being within the confidence interval. Although this assumption is intuitive and gives a measure of understanding, the conceptual definition of confidence interval is more nuanced than that. One engineering statistics textbook (Montgomery and Runger 1994) states the nuance in the following way: In practice, we obtain only one random sample and calculate one confidence interval. Since this interval either will or will not contain the true value of k, it is not reasonable to attach a probability level to this specific event. The appropriate statement would be that the observed interval [l, u] brackets the true value of k with confidence level 100(1 ]). This statement has a frequency implication; that is, we don’t know if the statement is true for a specific sample, but the method used to obtain the interval [l, u] yields correct statements 100(1 ]) percent of times.
Figure 5.2 shows 50 confidence intervals on the mean computed from samples taken from a population at a confidence level of 95%. The solid line represents the
0
10
20 30 Experiment Number
Figure 5.2 Conceptualization of confidence interval.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
40
50
86
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
true mean calculated from the whole population. We expect that 95% of all possible samples taken from the population would produce a confidence interval that includes the true value of the parameter being estimated, and only 5% of all samples would yield a confidence interval that would not include the true value of the parameter. The simulated case shows that three (approximately 5%) of the confidence intervals do not contain the true mean. With a fixed sample size, the higher the confidence level is, the larger is the width of the interval. A confidence interval estimated at 100% confidence level will always contain the actual value of the unknown parameter; however, the interval will stretch from ∞ to ∞. Such a large confidence interval may provide little insight, however. For example, one can say with a very high confidence level that the age of all students in a reliability class is between 1 and 100 years, but that does not provide any information that can be used in decision making. Selection of confidence level is part of the engineering risk analysis process. For example, with a confidence interval analysis, one can estimate the expected worst cases on warranty returns over a period. One can then make an estimate on spare parts to stock based on the point estimate, 95% confidence level, or 99% confidence level (or any other value of one’s choice) of the expected warranty return. The decision will depend on the balance between the cost of storing the spares versus the cost of delay in repair time due to unavailability of spares. In many engineering situations, the industry practices or customer contracts may require use of a specific confidence level; typically, values of 90 or 95% are quoted. 5.2.3
Relationship Between Confidence Interval and Sample Size
The value of confidence intervals depends on the measurements of each sample. As long as the measurements made on the samples are from the same population, an increase in sample size will reduce the width of the confidence interval, provided the confidence level is kept constant (not changed). However, when an experiment is conducted or data are gathered from the field, data may come from multiple populations; in these cases, a large sample size may actually increase the confidence interval. For example, in the manufacture of baseball bats, one may record hardness values of samples taken from the production line. If the production parameters are all within control, then all samples come from the same population and increasing the number of samples will narrow the confidence interval. However, if, for a certain period, the production parameters were out of control, then the hardness values for samples taken during those periods will differ. Therefore, increasing the sample size by including samples from the “out of control” population will broaden the confidence interval.
5.3
CONFIDENCE INTERVAL ESTIMATE METHODS
Two methods of computation are commonly used to estimate the confidence interval. Fisher matrix (FM) bounds are used in most commercial statistical applications, but are not recommended for use in cases with a small sample size because the results
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
CONFIDENCE INTERVALS
87
are often too optimistic. The likelihood ratio (LRB) confidence bounds method can be used for any sample size. However, at very large sample sizes, the computational intensity can make LRB analysis time consuming. The details of the mathematical formulations and derivations are beyond the scope of this text. Confidence bounds in the Fisher matrix method for Weibull shape and scale parameter are given in the following equation:
BU Bˆ r e
KA Var ( Bˆ ) B
Bˆ
BL e
KA Var ( Bˆ ) B
HU Hˆ r e
(5.1)
KA Var (Hˆ ) H
Hˆ
HL e
KA Var (Hˆ ) H
where K] is
A
1 2P
c
¯
e
t2 dt 2
KA
The likelihood ratio method is shown in the following equation:
N
L (B , t )
i 1
5.4
¤ ³ ¥ ´ xi B ´ r¥ ¤ t ³ ¥¤ t ³´ ¥¥ 1 ´ 1 ´ ´ ¥¦ ( ln( R )) B µ ¦ ¦ ( ln( R)) B µ µ
B 1
§ ¤ ¨ r exp ¨ ¥ ¥ ¨ ¥ ¦ ¨©
xi t
1 ( ln( R )) B
³ ´ ´ ´µ
B
¶ · · · ·¸
(5.2)
CONFIDENCE INTERVAL FOR NORMAL DISTRIBUTION
Concepts on confidence interval are often illustrated using the normal distribution, partly because it is a symmetric distribution described by two parameters. In a population with normal distribution, there is a direct relation of confidence interval to sample size. This section describes the calculation of confidence intervals for three cases: confidence interval on unknown mean with known variance, confidence interval
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
88
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
on unknown mean with unknown variance, and confidence interval on differences between two population means with known variance. 5.4.1
Unknown Mean with Known Variance
Consider a population with an unknown mean, *, but a known variance, m2. The variance may be known from prior data, such as physical processes that create the population or the control charts. For this population, random samples of size n yield a sample mean X. The 100(1 ])% confidence interval for the population mean is given by X
ZA / 2S Z S a M a X A /2 n n
(5.3)
where Z ]/2 is the upper ]/2 percentage point of the standard normal distribution. Correspondingly, to obtain the one-sided confidence intervals, Z ] replaces Z ]/2. Setting l ∞ and u ∞ in the two cases, respectively, the one-sided confidence intervals are given by ZA S n
(5.4)
ZA S laM n
(5.5)
Mau X and X
When using a sample mean X to estimate the actual, but unknown, mean *, the “error” is no quick thing, E |X *|. With confidence of 100(1 ])% for a two-sided interval, the error is within the precision of estimation given by Ea
ZA /2S n
(5.6)
Therefore, we can choose a sample size n that allows 100(1 ] )% confidence that an error will not exceed a specified amount E: §Z S ¶ n ¨ A /2 · © E ¸
2
where n is rounded up to the next integer. EXAMPLE 5.1 Consider measuring the propagation delay of a digital electronic part. You want to have a 99% confidence that the measured mean propagation delay is within 0.15 ns of the true mean propagation delay. What sample size do you need to choose?
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(5.7)
CONFIDENCE INTERVALS
89
Solution: Use Equation 5.9 and get the value of n to be 34: 2
2
2
¤ 2.51 r .355 ³ ¤Z S³ ¤ Z r .35 ³ n ¥ A / 2 ´ ¥ .995 ´ y 34 ´ ¥¦ ¦ E µ ¦ .15 µ .15 µ In this application, ] is .01 and ]/2 is .005. Because the standard normal distribution is symmetric, Z.005 is same as Z.995. The value is looked up from the standard normal distribution table.
5.4.2
Unknown Mean with Unknown Variance
The Student’s t-distribution is one method used to estimate the nature of the spread of a distribution. This distribution was developed to estimate a sampling distribution when normal assumptions do not hold. If the population can be “assumed” normal, the t-distribution can be used as the sampling distribution of the mean. The sample variance, s2, is used in place of the population variance, m2, which is not known. Suppose a population has an unknown variance m2. A random sample of size n yields a sample mean X, a sample variance s2, and an upper ]/2 percentage point of the t-distribution with (n 1) degrees of freedom. The two-sided 100(1 ])% confidence interval in this case is given by X
t s tA / 2, n 1s a M a X A / 2, n 1 n n EXAMPLE 5.2
Tensile strength of a synthetic fiber used to manufacture seatbelts is an important characteristic in predicting the reliability of the product. From past experience, the tensile strength can be assumed to be normally distributed. Sixteen samples were randomly selected and tested from a batch of fibers. The samples’ mean tensile strength was found to be 49.86 psi and their standard deviation was found to be 1.66 psi. Determine an appropriate interval to estimate the batch mean tensile strength. Solution: Because one may be concerned only with tensile strengths that are too low, a one-sided confidence interval on the batch mean, m, is appropriate. Because the population (batch) variance is unknown and the sample size is fairly small, a confidence interval based on the t-distribution is necessary. A one-sided, 99% confidence interval for the batch mean * is ª ª x M ¹ 49.86 M ¹ P « tA , n 1 a 2 º 1 A P « t.01,15 a º 0.99 1.662 /16 » s /n » ¬ ¬ 49.86
(1.753)1.66 a M 49.13 a M 16
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(5.8)
90
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
5.4.3
Differences in Two Population Means with Variances Known
A confidence interval for the difference between means of two normal distributions specifies a range of values within which the difference between the means of the two populations (*1 *2) may lie. A random sample, n1, from the first population, with a known standard deviation of m1, yields a sample mean X1. Similarly, a random sample, n2, from the second population, with a known standard deviation of m2, yields a sample mean X2. Then, a two-sided 100(1 ])% confidence interval for the difference between the means is given by
S 12 S 22 S 12 S 22 a M1 M2 a X1 X 2 ZA / 2 n1 n2 n1 n2
X1 X 2 ZA / 2
(5.9)
where Z ]/2 is the upper ]/2 percentage point of the standard normal distribution. EXAMPLE 5.3 Tensile strength tests are performed on two different types of aluminum wires used in wire bonding power electronic devices. The results of the tests are given in the following table. What are the limits on the 90% confidence interval on the difference in mean strength (*1 *2) of the two aluminum wires?
Type
Sample Size (ni)
Sample Mean Tensile Strength (kg/mm2)
Known Population Standard Deviation (kg/mm2)
1 2
15 18
86.5 79.6
1.1 1.4
Solution: We have l X1 X 2 ZA / 2
S 12 S 22 n1 n2
86.5 79.6 1.645
(1.1)2 (1.4)2 (6.9 0.716) 15 18
6.184 Kg/mm 2 Also, u X1 X 2 Z A / 2
S 12 S 22 n1 n2
86.5 79.6 1.645 7.616 Kg/mm 2
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(1.1)2 (1.4)2 (6.9 0.716) 15 18
CONFIDENCE INTERVALS
91
5.5 CONFIDENCE INTERVAL ON MTBF— EXPONENTIAL DISTRIBUTION ASSUMPTION The confidence limits for mean time between failures (MTBF), assuming an exponential distribution, are given by MTBF
2Ta
(5.10)
CG2; dF
where Ta is the total number of device-hours calculated as in Example 3.6 and the values for the parameter c and dF (degrees of freedom) for the _2-distribution can be obtained for different testing conditions from Table 5.1. The reliability of items that fail with an exponential distribution is the same for all time intervals of equal length, regardless of the start time. It is given by the following equation:
t
(5.11)
R e MTBF EXAMPLE 5.4 In a failure-terminated test with four failures, 16,000 device-hours are accumulated. (a) What are the upper and lower one-sided 90% confidence limits on MTBF? (b) What are the one-sided 90% confidence limits on reliability for a 100-hour period? Solution: Here, Ta 16, 000 hours CL 1 A 0.90; A 0.10; A 2 0.05; 1 A 2 0.95 r4 Therefore,
Table 5.1
MTBF(l )
2(16, 000) 32, 000 2, 395 hours C 02.10; 8 13.362
MTBF(u)
2(16, 000) 32, 000 9,195 hours C 02.90; 8 3.490
Values of Parameter c and dF for Confidence Limit Calculations on MTBFa MTBF (l)
Type of Test Two-sided failure terminated One-sided failure terminated Two-sided time terminated One-sided time terminated No failures observed ar
is the number of failures observed.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
MTBF (u)
c
dF
c
dF
]/2 ] ]/2 ] ]
2r 2r 2r 2 2r 2 2
1 ]/2 1 ] 1 ]/2 1 ]
2r 2r 2r 2r
92
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
If the lower and upper 0.90 confidence limits on the MTBF for the item are 2,395 and 9,195 hours, respectively, the lower and upper 0.90 confidence limits on its reliability for any 100-hour interval are
100
R l e 2,395 e 0.0417 0.9591
100
R u e 9,195 e 0.0109 0.9891
EXAMPLE 5.5 In a time-terminated test with seven failures, 21,000 device-hours are accumulated. What are the upper and lower one-sided limits on MTBF with .99 confidence? Solution: Here, Ta 21, 000 hours CL 1 A 0.99; A 0.01; A 2 0.005; 1 A 2 0.995 r7 Therefore, MTBF(l )
2(21, 000) 42, 000 1, 313 hours C 02.01; 16 32.000
MTBF(u)
2(21, 000) 42, 000 9, 013 hours C 02.99; 14 4.660
This is a time-terminated test, so we cannot establish an upper confidence limit because it is possible that a failure would occur in the very next instant or at the beginning of the next measurement interval. A common situation occurs when an estimate of the MTBF and the confidence interval around it is of interest, but no failures have occurred. A lower onesided confidence limit, which is a conservative value for MTBF, can still be calculated. This is a time-terminated test in which it is assumed that a failure would have occurred in the very next instant or at the beginning of the next measurement interval. Of course, there is no upper confidence limit and the equation used is the same as that for lower one-sided confidence limits with one (assumed) failure.
5.6
CONFIDENCE INTERVALS FOR PROPORTIONS
In engineering applications, it is often the case that the outgoing quality of a product is estimated based on testing of a sample of parts. If pk is the proportion of observations in a random sample of size n that belongs to a class of interest (e.g., defect), then
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
CONFIDENCE INTERVALS
93
an approximate 100(1 ])% confidence interval on the proportion p of the population that belongs to this class is pˆ zA / 2
pˆ (1 pˆ ) a p a pˆ zA / 2 n
pˆ (1 pˆ ) n
(5.12)
where Z ]/2 is the upper ]/2 percentage point of a standard normal distribution. This relationship holds true when the proportion is not too close to either zero or one and the sample size n is large. EXAMPLE 5.5 An inspector randomly selects 200 boards from the process line and finds 5 defective boards. Calculate the 90% confidence interval for the proportion of good boards from the process line. Solution: We use Equation 5.12 in the following manner: pˆ zA / 2
pˆ (1 pˆ ) a p a pˆ zA / 2 n
pˆ (1 pˆ ) n
195 0.975(0.025) 195 0.975(0.025)
1..64 a pa 1.64 200 200 200 200 0.957 a p a 0.993 The result implies that the total population is likely to have a proportion of good boards between .997 and .993. Note that no assumption is made regarding what the total population is.
5.7
SUMMARY
It is good practice to state the results of all engineering analysis with the degree of certainty (or uncertainty) associated with them. Confidence interval is one statistical measure to do that. The concepts described in this chapter can be used to estimate confidence intervals on practically all the calculated metrics used in reliability analysis and reporting. For example, one can calculate confidence interval on regression parameters. Just as confidence intervals around estimates can be used to estimate unknown distribution parameters, confidence intervals around a regression line can be used to estimate the uncertainties associated with regression relationships. Statistical software tools can automatically produce and draw confidence intervals on metrics calculated or plots generated. Availability of this software makes it too easy to report the values without complete understanding. When confidence interval is chosen as a measure of uncertainty, it is imperative to report the information completely. One should definitely include the confidence level and whether or not the interval is one sided or two sided. In general, it is also
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
94
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
a good idea to include the sample size and how the samples were chosen. In formal engineering documents, the methods of analysis used (e.g., Fisher matrix) should also be mentioned with justifications. Confidence interval is not the only way to present uncertainties in data. Under some circumstances, estimation and visualization of confidence interval may not be possible. For example, with a very small sample size, one is likely to obtain a very wide confidence interval that has no practical use. In such cases, data visualization techniques display the complete results without making any statistical claim and facilitate making judgments on the data.
REFERENCE Montgomery, D., and G. Runger. 1994. Applied statistics and probability for engineers, 324. New York: John Wiley & Sons.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
CHAPTER 6
Hardware Reliability Abhijit Dasgupta, Jun Ming Hu
CONTENTS 6.1 Introduction.....................................................................................................96 6.2 Failure Mechanisms and Damage Models...................................................... 98 6.2.1 Incorrect Mechanical Performance ................................................ 100 6.2.2 Incorrect Thermal Performance ..................................................... 101 6.2.3 Incorrect Electrical Performance ................................................... 101 6.2.3.1 Electromagnetic Interference ........................................... 102 6.2.3.2 Particle Radiation ............................................................. 102 6.2.4 Yield................................................................................................ 103 6.2.5 Buckling.......................................................................................... 103 6.2.6 Fracture........................................................................................... 104 6.2.7 Interfacial De-Adhesion.................................................................. 106 6.2.8 Fatigue ............................................................................................ 107 6.2.9 Creep............................................................................................... 109 6.2.10 Wear ................................................................................................ 110 6.2.11 Aging due to Interdiffusion............................................................. 110 6.2.12 Aging due to Particle Radiation...................................................... 111 6.2.13 Other Forms of Aging..................................................................... 111 6.2.14 Corrosion......................................................................................... 112 6.2.15 Metal Migration .............................................................................. 113 6.3 Loadings, Stresses, and Material Behavior................................................... 113 6.4 Variabilities and Reliability .......................................................................... 115 6.5 Reliability Prediction Techniques................................................................. 115 6.6 Case Study: Wirebond Assembly in Microelectronic Packages................... 118 6.6.1 Failure Mechanisms and Stress Analysis ......................................... 119 6.6.1.1 Wire Flexure....................................................................... 119 6.6.1.2 Shear of Bond Pad.............................................................. 120
95 © 2009 by Taylor & Francis Group, LLC
96
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
6.6.1.3 Shear of Wire and Substrate............................................... 121 6.6.1.4 Axial Tension of Wire ........................................................ 122 6.6.2 Stochastic Modeling of Variabilities and Reliability........................ 123 6.6.3 Fatigue Lifetime and Reliability Prediction ..................................... 127 6.7 Qualification and Accelerated Testing .......................................................... 131 6.8 De-Rating and Logistic Implications............................................................ 133 6.9 Manufacturing Issues.................................................................................... 134 6.9.1 Process Qualification ........................................................................ 135 6.9.2 Manufacturability, Process Variabilities, Defects, and Yields ......... 135 6.9.3 Process Verification Testing and Statistical Process Control ........... 136 6.10 Summary ..................................................................................................... 138 References.............................................................................................................. 139
6.1
INTRODUCTION
Reliability assessment and its associated validation techniques are crucial factors in the success of any engineering hardware. Hardware reliability is often defined as the probability that the equipment will perform throughout the intended mission life within specified tolerances under specified life-cycle loads. (Here, the term includes all external influences that can affect hardware performance—for example, mechanical forces, temperature, time, concentrations of harmful chemicals, radiation, and electrical voltage/current.) The purpose of reliability assessment is to provide criteria for selecting courses of action that affect (and are affected by) reliability. Reliability is not a matter of chance; it has to be consciously and actively built into hardware through careful specification of good design and manufacturing practices. Proactive, quantitative reliability assessment during the design phase can be an effective vehicle for a variety of other design functions, such as r reliability allocation, based on complexity, cost, and risk; r feasibility evaluation; r determination of deficiencies in current databases regarding material properties, application profile data, or field failure data; r comparison of alternative design configurations, based on relative reliability margins; r comparison of alternative manufacturing processes, based on relative reliability margins; r evaluation of cost effectiveness, based on the reliability margin; r development of trade-offs with other product parameters, such as cost, risk, development time, producibility, and maintainability; r design of accelerated tests to qualify a product to the customer’s specifications; r design of accelerated stress tests for process verification; r identification of reliability problems for corrective action; r de-rating and redundancy decision-making, based on trade-offs between cost and risk; r logistics planning, such as in maintainability decisions; r measuring progress by monitoring reliability growth; and r warranty analysis.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
HARDWARE RELIABILITY
97
Failure of hardware is due to complex sets of interactions between (1) the “stresses” (i.e., environmental parameters that act on and within the module) and (2) the materials and configurations of components, interconnects, and assemblies. A proper evaluation of reliability requires a systematic analysis of the response of the materials and configurations to the “stresses.” From the viewpoint of physics of failure, the characteristics of the failures of a product should be defined by the failure mode, failure site, and failure mechanism. Failure mode is a physically observable change caused by a failure mechanism, such as an open, a short, an increase in resistance, or any other electrical parameter change in an electronic product. Failure mode can serve as a failure criterion in component qualification, and the traditional failure mode effect analysis (FMEA) is a helpful tool for identifying and ranking failure phenomena. A failure mechanism is the process by which a combination of thermal, mechanical, electrical, chemical, and magnetic stresses can induce failure. Failure processes usually start from existing defects such as voids in a material or microcracks in the interface of two materials. In general, a failure mechanism can occur at different sites. The failure mechanisms should not be confused with failure modes and initial defects, although they are often called the root cause of failures. A correct reliability assessment procedure should begin with an investigation of all critical failure mechanisms, followed by an identification of where and when they may occur and their effect on the operation of the product over the required design life. Failures must be identified with respect to the failure mechanisms that could potentially be activated during operation, with the understanding that a given failure mechanism can occur at many sites. In determining component qualification tests, we must be aware of all possible failure mechanisms and the responsible stresses in order to make sure that the component and module are capable of withstanding these stresses. Therefore, investigation of failure mechanisms should also serve as a guide for component qualification. Reliability assessment is greatly facilitated by effective quantitative models that not only can simulate the physics of potential and relevant failure mechanisms, but also can accommodate the variabilities of the parameters in the model through statistical modeling techniques. These models involve examining the stresses at the failure site under the given loads. Stresses are the local intensities of the loads at the failure site and can be calculated through appropriate stress analysis, based on the geometric configuration and material constitutive properties at the failure site. For example, stress analysis may involve computing the temperature field at a device gate under a specified cooling condition in an electronic box, based on the conduction and convection characteristics of the materials involved. Failure is then predicted, based on the stress magnitude, from an appropriate generic material damage model. This is termed the physics-of-failure approach. The inputs to a failure model are usually the local stresses and appropriate failure properties of the material. For example, if the failure mechanism is fatigue, then we need material fatigue properties and the cyclic mechanical stress. If the failure mechanism is electromigration, then the properties include activation energy and a rate constant, while the stress is a combination of the current density and local temperature.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
98
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Variabilities due to inadequacies of process control, workmanship, and flaws in natural materials are typically encountered in material properties, geometric tolerances, loads, and preexisting defect populations at the failure site. Piece-to-piece variations are difficult to model deterministically, and it is often convenient to adopt a probabilistic mechanics approach to the science of reliability modeling. In this approach, the variabilities in each parameter are described by suitable stochastic distribution functions. However, this is a phenomenological feature, and the accuracy of the mathematical distribution depends on the number of observations made. The achievable accuracy in the tails of the statistical distributions limits the accuracy of reliability assessments. This chapter is subdivided into several sections that discuss r r r r
generic failure mechanisms commonly encountered in engineering hardware; analysis techniques for evaluation of the magnitude of stresses at each failure site; qualification and validation techniques for reliability; quality assessment techniques to explore the impact of manufacturing processes on quality and reliability; and r examples of the process of predicting hardware reliability, based on failure mechanism models.
The failure models and examples presented are as detailed and explicit as feasible. It is important to recognize, however, that there is always some degree of uncertainty in reliability assessment and that comprehensive validation through field-failure data is essential. Consequently, in new and emerging technologies, reliability assessment models may often have to be iterative, undergoing continuous improvements based on actual failure histories.
6.2
FAILURE MECHANISMS AND DAMAGE MODELS
Quantitative reliability assessment involves stress analysis for the expected loads, modeling the failure mechanisms, conducting parametric and sensitivity analysis based on the stochastic variabilities of all input parameters, and expressing the results as a time-dependent probability of satisfactory performance, for a given life-cycle load history. Formulating accurate reliability models requires a clear understanding of potential stresses and failure mechanisms. The utility of reliability models depends on the accuracy of r material failure models; r mechanistic stress analysis; and r empirical data (geometry at the failure site and the material property database) used for the analysis.
Failure mechanisms are the physical processes by which stresses damage the materials comprising the product. Investigating failure mechanisms aids in the development of failure-free, reliable designs. The process of developing reliability assessment models is facilitated if quantitative models can be used to describe the
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
HARDWARE RELIABILITY
99
Table 6.1 Common Failure Mechanisms Overstress failures when any single stress excursion exceeds strength Performance failures not associated with material damage r Mechanical r Electrical r Thermal r Cosmetic Material failure mechanisms r Fracture r Buckling r Yielding r Interfacial fracture r Electrical overstress r Electrostatic discharge r Dielectric breakdown r Thermal breakdown
Wearout failures when accumulated damage exceeds endurance Material failure mechanisms r Fatigue r Creep r Metal migration r Corrosion r Wear r Aging Interdiffusion Depolymerization Embrittlement
relevant failure mechanisms. It is necessary, therefore, first to identify the failure mechanisms that could be activated in typical engineering hardware by the stresses occurring during the life-cycle application profile. Table 6.1 lists some generic failure mechanisms commonly observed to be agents of failure. Further discussions of these mechanisms may be found in Dasgupta and Hu (1992). These mechanisms have been carefully selected to ensure that each has a generic quantitative failure model, characterized in terms of damage properties that are reasonably well documented in the literature for common engineering materials. The failure mechanisms in Table 6.1 are grouped into two categories according to the type of the resulting failure. Catastrophic sudden failures due to a single occurrence of a stress event that exceeds the intrinsic strength of the material are termed overstress failure mechanisms. Failures due to monotonic accumulation of incremental damage occur when the accumulated damage exceeds the endurance of the material and are termed wearout mechanisms. Unanticipated large stress events can cause an overstress catastrophic failure or shorten life by causing a step increase in the accumulation of wearout damage. Examples of such stresses are accidental abuse and acts of God. On the other hand, anticipated design stresses in well-designed and high-quality hardware should cause only uniform accumulation of wearout damage; the threshold of damage required to cause eventual failure should not occur within the design life. Examples of wearout failure mechanisms include fatigue damage due to thermomechanical stresses during power cycling of electronic hardware, corrosion rate due to anticipated contaminants, and electromigration rates in high-power devices. The stresses responsible for activating the failure mechanisms in Table 6.1 can be chemical, thermal, electrical/magnetic, radiative, or mechanical. Mechanical failures result from elastic and plastic deformation, buckling, brittle and/or ductile
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
100 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
fracture, interfacial separation, fatigue crack initiation and propagation, creep, and creep rupture. Thermal overstress failures are a result of heating a component beyond such critical temperatures as the glass-transition temperature, melting point, fictive point, or flash point. Some examples of thermal wearout failures include aging due to depolymerization, intermetallic growth, and interdiffusion. Electrical failures, common in electronic hardware, include overstress mechanisms due to electrical overstress (EOS) and electrostatic discharge (ESD) such as dielectric breakdown, junction breakdown, hot electron injection, surface and bulk trapping, and surface breakdown and wearout mechanisms such as electromigration. Radiation failures are principally caused by uranium and thorium contaminants and secondary cosmic rays. Radiation can cause wearout, aging, embrittlement of materials, or overstress “soft” errors in such electronic hardware as logic chips. Chemical failures occur in adverse chemical environments that result in corrosion, oxidation, or ionic surface dendritic growth. There may also be interactions between different types of stresses. For example, metal migration may be accelerated in the presence of chemical contaminants and composition gradients, and a thermal load can activate mechanical failure due to a thermal expansion mismatch. Other common examples of interactions include stress-assisted corrosion, stress-corrosion cracking, stress-assisted diffusion voiding, field-induced metal migration, and temperatureinduced acceleration of diffusion-related phenomena, such as the kinetics of intermetallic growth, creep, corrosion, and interdiffusion. Table 6.1 also includes, in the overstress category, failure of hardware performance due to design mistakes that do not cause irreversible material damage. These include failures due to inadequate mechanical, thermal, and electrical performance. Typical examples of mechanical design failure are excessive elastic deformations or incorrect damping of a structure under field loads. An example of thermal design failure is temperature buildup due to excessive thermal resistance of a critical heattransfer path. Some electrical design failures include electromagnetic interference due to improper shielding, or incorrect transients due to incorrect impedance, capacitance. For convenience, failures of a cosmetic nature are also included in this category. The designer must be aware of all possible failure mechanisms in order to design hardware capable of withstanding loads without failing. Failure mechanisms and their associated models are also important for designing tests and screens to audit both the nominal design and manufacturing specifications and the level of defects introduced by excessive variability in manufacturing and material parameters. Brief descriptions of several important failure mechanisms are given in this chapter. Failures due to inadequate performance are discussed first, followed by failures due to irreversible material damage. 6.2.1
Incorrect Mechanical Performance
Incorrect product response to mechanical overstress loads may compromise the product performance, without necessarily causing any irreversible material damage. Such failures include incorrect elastic deformation in response to mechanical
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
HARDWARE RELIABILITY
101
static loads, incorrect transient response (such as natural frequency or damping) in response to dynamic loads, and incorrect time-dependent (viscoelastic) response. To take one example, elastic deformations occur due to stretching of molecular bonds under mechanical loads and are completely reversible; in other words, the deformation vanishes when the load is removed. Excessive elastic deformations in slender structures due to overstress loads can sometimes constitute functional failure, such as excessive deformation of precision mirrors in large optical devices; large deformation in large, flexible space structures, triggering unstable dynamic modes; or excessive flexing of interconnection wires, package lids, or flex circuits in electronic devices, causing shorting and/or excessive cross-talk. Overstress damage models for excessive elastic deformation typically involve large deformation theory and can be based on nonlinear theory of finite-deformation elasticity (Malvern 1969). These models use a nonlinear strain-deformation definition and relate the strains to corresponding stresses through an elastic constitutive law. Imposing equilibrium requirements on the resulting stresses yields a nonlinear boundary value problem that can be solved, subject to appropriate boundary conditions, for the unknown displacements. Excessive elastic deformation is assumed to occur when the magnitude of the deformation reaches some critical amount, based on some geometric or tolerance constraints. Detailed discussions of the quantitative models and illustrative examples are presented elsewhere (Dasgupta and Hu 1992). 6.2.2
Incorrect Thermal Performance
Thermal performance failures can arise due to incorrect design of thermal paths in an assembly. This includes incorrect conductivity and surface emissivity of individual components, as well as incorrect convective paths for the heat-transfer path. Failures due to inadequate thermal design may be manifested as components running too hot or too cold, causing operational parameters to drift beyond specifications. Typically, the degradation is reversible upon cooling. Such failures can be caused by direct thermal loads or by mechanical loads, such as friction and dissipative losses in materials during mechanical movement, or by electrical resistive loads, which in turn generate excessive localized thermal stresses. Adequate design checks require proper analysis for thermal stress and should include conductive, convective, and radiative heat paths. 6.2.3
Incorrect Electrical Performance
Electrical performance failures can be caused by incorrect resistance, impedance, voltage, current, capacitance, or dielectric properties or by inadequate shielding from electromagnetic interference (EMI), particle radiation, and ESD. The failure mode can be manifested as reversible drifts in electrical parameters and/or accompanying thermal malfunction. Here, only two major electrical design failures are discussed: failures caused by inadequate shielding from EMI and particle radiation.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
102 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
6.2.3.1
Electromagnetic Interference
All electromagnetic waves consist of a magnetic (H) and an electric (E) field. The relative magnitude of these fields depends on the nature of the emitter (source) and the proximity of the emitter to the shielding. The ratio of % to ( is called the wave impedance. At very large distances from the source, this ratio is unity, regardless of the type of source. When this occurs, the wave is said to be a plane wave, and the impedance is equal to 377 ohms (the impedance of free space). If, compared to its voltage, the emitter has a large current flow, such as that generated by a transformer or power lines, it is called a current, magnetic, or low-impedance source. If the emitter operates at high voltage and only a small amount of current flows, the emitter is high impedance, and the wave is referred to as an electric field. Many electronic circuits are susceptible to electromagnetic radiation and must be shielded to ensure proper operation. Electromagnetic interference and radio frequency interference (RFI) are forms of electromagnetic radiation, except that EMI encompasses the full spectrum of frequencies, whereas frequencies above approximately 20 kHz are termed RFI. RFI is defined by high-frequency, high-impedance radiated electromagnetic waves dominated by an electric field as the major wave component. The range from direct current (DC) to 20–25 kHz is the area of low frequency and low impedance, where a magnetic field dominates the EMI wave. Causes of EMI emissions in the RFI range include high-frequency digital circuits, such as signal clocks and high-speed logic devices, as well as radio circuits and microwave circuitry. If not properly shielded, this radiated energy can affect the operation of adjacent circuits or other circuits throughout the electronic product or can radiate from the cabinet to affect the circuitry in adjacent electronic cabinets. Low-impedance EMI is generated by transformers, motors, solenoids, permanent magnets or electromagnets, and other current-operated devices that produce an external magnetic field. When strong enough, this field affects other components by inducing electrical currents. When an electromagnetic wave encounters a discontinuity, such as a metal shield, if the magnitude of the wave impedance differs greatly from the intrinsic impedance of the shield, most of the energy is reflected; very little is transmitted across the boundary and absorbed. Metals have an intrinsically low impedance because of their high conductivity. Therefore, for low-impedance waves, less energy is reflected and more is absorbed because impedance of the metal shield is more closely matched to that of the wave. 6.2.3.2
Particle Radiation
The electrical failure modes caused by radiation are important to hardware design because they dictate, in part, the choice of packaging materials and the allowable impurities in them. Radiation shielding may also be an important consideration in package design and configuration. Radiation effects on microelectronics may be a serious obstacle to further rapid increases in very large scale integration (VLSI) densities. These effects are particularly important in memory chips, which usually lead other microelectronics technologies in advanced development. Particle radiation may also cause aging of materials and is discussed further under wearout failure
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
HARDWARE RELIABILITY
103
mechanisms. In most modern microelectronics, cosmic rays or radioactive contaminants can produce single events, caused by the passage through the microcircuit of a single, high-energy particle: electron, photon, muon, pion, or alpha particle. A single-event upset (soft error) is a nonrecurring and temporary change of logic state, that is, no physical defects are associated with it, and the failed bit is recovered during the next write cycle. In this sense, single-event upsets may be considered as electrical design failures (see Section 6.2.3). Single-event upsets are produced by the direct ionization of a particle as it loses energy passing through the microcircuit or by the ionization of secondary particles created by medium-energy nuclear reactions. The ionization creates electron–hole pairs. If this charge is generated in the vicinity of a p–n junction with a reverse bias, the intense electric fields present cause the electrons and holes to be separated; the charge of the appropriate sign is collected, while that of the opposite sign is swept out of the depletion region. 6.2.4
Yield
This is the first of the overstress material failures discussed in this chapter. Plastic deformations, caused by migration of microstructural defects (called dislocations) under mechanical loads in excess of the yield strength (sometimes called the flow stress) of the material are irreversible. In other words, they are manifested as a permanent deformation in the material, even after the load is removed. Such permanent deformations may be functionally inadmissible and can be considered an overstress failure mechanism in some hardware. Common examples include overstress plastic strains in such precision structures as optic benches, metrological devices, and turbine blades. Some metals do not exhibit strain hardening beyond the yield strength; that is, the flow stress does not increase with strain beyond the elastic limit of the material. Some exhibit significant strain hardening, given by the Ramberg–Osgood power law (Hertzberg 1989): n
S K E pm
(6.1)
where m is the stress, ap is the plastic strain, nm is the material’s strain-hardening exponent, and K is the material’s plasticity coefficient. Numerical values for these material constants may be found in the handbooks of the American Society for Materials (ASM) for most engineering materials. Quantitative discussions of the models and illustrative examples may be found elsewhere (Dasgupta and Hu 1992). 6.2.5
Buckling
Buckling is an overstress failure mechanism caused by sudden catastrophic instability of a slender structure under applied compressive loads. Examples of buckling failures include lateral collapse of long slender columns under axial compression, bending-induced crippling of thin-walled structural beam sections, shear buckling of thin-walled tubular shafts under torsion, or wrinkling of thin plates and thin
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
104 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
films under in-plane compressive and shear loads. Instability occurs when the compressive load reaches a critical threshold value, called the critical buckling stress. The critical buckling stress is a function of material properties (such as stiffness) as well as of structural geometry (such as slenderness ratio). In mathematical terms, buckling is deformation along an unstable path orthogonal to the original deformation mode and can be solved by eigenvalue or bifurcation theory (Timoshenko and Gere 1961). Postbuckling analysis utilizes large-deformation theory and can be accomplished through incremental nonlinear algorithms (Dasgupta and Haslach 1993). 6.2.6
Fracture
Local microscale flaws, such as sharp microcracks, exist in most materials. Excessive stress concentrations at the tip of these sharp cracks can cause catastrophic propagation of the crack under overstress loads in brittle materials that exhibit little yielding and inelasticity before fracture. In ductile materials, a significant plastic zone may develop ahead of the crack tip due to localized yielding. The energy required to yield the material can increase the apparent resistance of a ductile material to fracture. Designing for brittle fracture, a relatively new science, started during World War II because of persistent catastrophic fracture of the welded steel hulls of Allied Liberty ships, which became brittle in the cold Atlantic Ocean. Fracture is now recognized as a major cause of failure in engineering hardware, such as turbine blades, airframe parts, bridges, building frames, electronic dies, glass and ceramic components, and so on. Quasi-brittleness can lead to failure in hardened metal alloys and ceramics. Thermoset polymers can also undergo extensive microcracking and crazing due to brittle cracking. Brittle fracture can also occur due to the formation of brittle intermetallics in otherwise ductile materials, such as solder. A failure criterion based on stress is infeasible because linear elastic analysis predicts infinite stresses at the tip of the flaw or crack, regardless of the magnitude of the far-field average or nominal stress. Hence, a new measure is required to quantify the severity of the stress field. This parameter, termed the stress intensity factor, indicates the intensity of the crack-tip stress field. Griffith postulated that catastrophic crack growth occurs when the energy required to create new free-crack surfaces in the fractured solid is less than the strain energy reduction in the solid due to changes in the crack length (Hertzberg 1989). The approach in fracture mechanics is to predict the level of far-field stress at which the crack will locally propagate. The stress intensity factor, K, used to characterize the intensity of the crack-tip stress field is defined in terms of the applied stress and the flaw size. For instance, in a plate of length 2h and width 2b, with a central crack of size 2a such that a << b (indicating an infinite plate), K I is 1
K I S (P a) 2 where m is the applied far-field uniaxial stress.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(6.2)
HARDWARE RELIABILITY
105
The critical or threshold value of the stress intensity factor, at which the crack will propagate, is a measure of the material’s resistance to brittle fracture and is termed its fracture toughness. The fracture toughness depends on the orientation of the crack relative to the applied stress and is commonly characterized for three different fundamental fracture modes: the crack-opening mode, the shearing mode, and the tearing mode. Fracture toughness values for common engineering materials are listed in ASM handbooks. The common design approach in fracture mechanics analysis is to compute the critical far-field load, based on the assumption that a characteristic flaw is located at the highest stressed region in the component. Details and illustrative examples may be found elsewhere (Dasgupta and Hu 1992). Ductile fracture, like brittle fracture, is an overstress failure mechanism. It requires nonlinear modeling methods because linear elastic theory of brittle fracture becomes inapplicable when there is large-scale plasticity at the crack tip. Ductile fracture can arise in many materials, such as aluminum, gold, copper, and solder, especially at high temperatures. Materials that behave in a brittle manner at relatively low temperatures and high strain rates can transition to ductile behavior at high temperatures and/or high deformation rates. The propagation of cracks in ductile materials requires higher energy because the inelastic deformation at the crack tip causes the material’s apparent fracture toughness to increase. The stress-intensity factor loses physical meaning in a nonlinear situation (except for cracks in power-law hardening materials, which can be characterized by a singular field) (Broek 1978) and can no longer be applied to characterize fracture toughness. Griffith’s energy concepts, however, still apply, and the energy required to propagate a crack can be measured and used to predict ductile fracture. The basic concepts are discussed in this chapter; details are presented elsewhere (Dasgupta and Hu 1992). The most convenient energy formulation is given in terms of scalar conservation integrals, which are components of an energy-momentum tensor that defines the energy potentials for crack extension (Broek 1978). The conservation integrals also represent the energy release rate as the crack tip propagates in brittle materials, but not in ductile materials, because of inelastic work dissipation. Fracture toughness is characterized as a critical threshold value of the appropriate conservation integral. A commonly used conservation integral is Rice’s J-integral (Broek 1978). The J-integral can be suitably modified for dynamic crack propagation under shock loads, traction on crack faces (for example, fluid-filled cracks), large deformations at the crack tip, and creep-stress relaxation at the crack tip. Another popular method for characterizing resistance to fracture in ductile materials is by crack-tip opening displacement (CTOD). Inelastic deformations can cause blunting of the propagating crack tip, creating a nonsingular stress field that is usually analyzed by the Dugdale–Barenblatt equivalent sharp crack or by using the slip-line theory of plastic deformations in the case of gross yielding (Broek 1978). The CTOD can thus be analyzed, and the critical value at fracture is assumed to be indicative of the material’s resistance to fracture. Designing for stable crack propagation in ductile materials, which is extremely difficult because of inelastic work dissipation, is one of the areas of current research.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
106 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
6.2.7
Interfacial De-Adhesion
Interfaces between dissimilar adhering materials can suffer adhesive failure under overstress loads. Examples include delaminations in composite materials and adhesion failures in bonded joints. Common examples in electronic packaging applications are failure at the interface of a die and the attached material, of a bond wire and the bond pad, and of the solder and the base material in a solder joint. The interfacial strength depends on the chemical and mechanical properties of the interface. Interfacial adhesive failures can occur in diffusion-bonded, adhesively bonded, welded, soldered, and brazed joints between dissimilar adherents. One of the factors enhancing interfacial adhesive strength between two dissimilar materials is interdiffusion. However, dissimilar interdiffusion rates for the two adherent materials can degrade the interfacial strength (see Section 6.2.11). Similarly, excessive intermetallic growth can cause a brittle interface of insufficient toughness. The adhesive strength of an interface is a property measuring the maximum amount of mechanical work or energy that can be transferred across an interface before separation occurs (Hertzberg 1989). The work or energy of separating an interface between two materials includes the work done to overcome the energy of adhesion as well as the work required to deform, either elastically or inelastically, the two bulk phases. Thus, the interface between two brittle materials may have less toughness than the interface between two ductile materials. The total fracture energy, Gf, is G f Wa Wp
(6.3)
where Wa is the reversible adhesion work, and Wp is the irreversible work of inelastic deformation in the two phases. The energy required to cause elastic deformations in the two phases is small relative to Wa and Wp and is ignored in Equation 6.3. Wa is defined as Wa G A G B G AB
(6.4)
where c is the interfacial or surface tension required to create a free surface at the interface and is defined as the rate of change in Gibb’s free energy per unit of surface created, measured under conditions of constant temperature, pressure, and moles in the product. Subscripts ] and ^ indicate surface tensions of the two respective phases; the subscript ]^ refers to the interfacial tension between the two phases. Wp is the irreversible work of deformation in the two phases. Experimentalists often characterize interfacial bonding strength in terms of the electron binding energy between pairs of materials, which is a unique property of that pair. On a continuum length scale, the mechanical strength is characterized and measured in terms of the interfacial fracture toughness. This is a unique interfacial
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
HARDWARE RELIABILITY
107
property between any pair of materials, so the interfacial fracture toughness must be characterized for commonly used material pairs in hardware applications. 6.2.8
Fatigue
This is the first of the wearout failure mechanisms discussed in this chapter. Cyclic mechanical deformations (or strains) and loads (or stresses) in a material can cause eventual failure, even though the peak strains may never exceed the ultimate ductility (strain at failure) of the material. Such failure is due to the accumulation of incremental damage with each load cycle and is termed fatigue (Sandor 1972). Fatigue, a leading cause of failures in engineering hardware, is a wearout failure mechanism that includes the initiation and propagation of a crack. The crack develops, typically, at a point of discontinuity or defect in the material, which acts as a site of local stress and plastic strain concentration. Common examples are fatigue cracking in airframes, propulsion components, civil engineering structures such as bridges and buildings, metallization, and solder joints in electronic packages. Low-cycle fatigue (LCF) occurs within 103 or 104 load cycles due to large strain amplitudes. High-cycle fatigue (HCF) refers to failures occurring due to relatively low stress amplitudes at greater than 103 or 10 4 load cycles. The fatigue properties of a material can be characterized either by the stress-life (S–N) curve or by the strain-life curve for the material. These curves plot the stress or strain amplitude against the number of load reversals to failure. The total strain amplitude, Δa/2, is related empirically to the cycles to failure, Nf, by the Coffin–Manson relation (Hertzberg 1989): $E / 2[( Sf /EY )( 2 Nf )b E f ( 2 Nf )c ]
(6.5)
This relationship is illustrated graphically in Figure 6.1 on a log–log scale. Each of the two terms on the right-hand side of Equation 6.5 is represented by two straight lines on a log–log scale. The first term represents high-cycle fatigue, while the second term represents low-cycle fatigue; mf is the fatigue strength coefficient (true stress corresponding to fracture in one reversal); EY is Young’s modulus; b is Basquin’s fatigue strength exponent (slope of the HCF line in Figure 6.1); mf is the fatigue ductility coefficient (the true strain corresponding to failure in one reversal); and c is the Coffin–Manson ductility exponent (slope of the LCF line in Figure 6.1). Rather than being independent parameters, b and c are both related to the cyclic strain hardening exponent, n, given in Equation 6.1. Thus, four material parameters are required to characterize a material’s fatigue behavior completely. Equation 6.5 holds for completely reversed cyclic loading. If the loading is not completely reversed, Foreman’s correction factor is used, and mf is replaced by (mf – mmean), where mmean is the mean of the applied cyclic stress. This fatigue law can also be used for variable history loading, as long as the load history is split into blocks so that, within each block, the mean stress and the stress amplitude are constant. Fatigue damage due to each stress block can be evaluated separately, and the
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
Strain Amplitude, Δε/2 (log scale)
108 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
ε´f c Total = Elastic and plastic
σ´f E
b Elastic Plastic
100
101
102 103 104 105 106 Reversal to Failure, 2Nf (log scale)
107
Figure 6.1 The Coffin–Manson relationship.
cumulative damage can be estimated by using a superposition model such as Miner’s cumulative fatigue damage law (Sandor 1972). Once initiated, a fatigue crack can propagate in a stable fashion under cyclic stresses until it becomes unstable under the applied stress amplitude. The fatigue crack propagation phase is important because it can act as a precursor to overstress failure at the most severe fatigue crack. The crack propagation rate, da/dN, becomes nonzero beyond a threshold value of the cyclic range of the stress intensity factor, ΔK. Crack propagation is initially characterized by a propagation rate that decreases with increasing ΔK, followed by a steady-state secondary phase in which the propagation rate is linearly proportional to ΔK on a log–log scale. During the tertiary and final phase, the propagation rate increases dramatically until catastrophic failure occurs, as shown in Figure 6.1. For design purposes, the steady-state (secondary) phase is usually most important. Steady-state fatigue crack propagation is modeled by the Paris power law for cyclic loading (Hertzberg 1989): p
Ac $K c da , R 0 dN [(1 Ro )( K IC ) ( K max )] 0
(6.6)
where Ac and pc are material constants. ASTM standard tests for characterizing these properties and values for many engineering materials are available in ASM handbooks. Ro is given by Ro
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
S min S max
(6.7)
HARDWARE RELIABILITY
109
The number of cycles for a crack to propagate from initiation to failure, Np, is obtained by integrating Equation 6.6 for increasing crack sizes. The limits of integration are from the initial flaw size, ao (smallest detectable flaw), to the largest allowable size, af, before the flaw becomes unstable under the applied stress. This is given as ¤ 1³ ¤ K ³ af ¥ ´ ¥ c ´ ¦ π µ ¦ S max µ
2
(6.8)
where Kc is the fracture toughness. 6.2.9
Creep
Some materials, such as thermoplastic polymers, solders, and many metals under mechanical stress at elevated temperature, can undergo a time-dependent deformation called creep. In reality, most deformations occur over a finite period. For convenience in mechanics modeling, deformations that occur over very short periods are treated as “instantaneous” and are termed elastic or plastic, depending on the reversibility of the deformation. Deformations requiring longer periods are termed “creep.” Creep deformations are classified as viscoelastic (or anelastic) or viscoplastic, depending on whether the deformations are reversible or not. Creep is a wearout failure mechanism that can cause functional failure due to excessive deformation or act as a precursor to creep rupture. Creep occurs due to dislocation climb mechanisms, polymer chain reorientation, grain boundary sliding (superplasticity), intragranular void migration (self-diffusion), and/or intergranular or transgranular void migration (grain-boundary diffusion). Different creep mechanisms can dominate at different temperatures within the same material, and sometimes more than one creep mechanism can occur simultaneously. In many materials, there is a stage of decreasing creep rate (primary creep), followed by a stage of constant creep rate (secondary creep), and, finally, a stage of increasing creep rate (tertiary creep). The designer must ensure that, over the life of the package, creep strain is within design constraints. At moderate stress levels, the creep strain, ac, due to steady-state (secondary) creep is generally expressed by Weertman’s creep law (Hertzberg 1989): ¤ E ³ n E c Ct S t t exp ¥ a ´ ¦ k BT µ where t is the elapsed time; s is the stress; kB is Boltzman’s constant; T is the absolute temperature; and Ea is the activation energy for the creep mechanisms.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(6.9)
110
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Ct, nt, and Ea are determined from laboratory tests for the materials under consideration and are available in engineering handbooks for many commonly used materials. Details and illustrative examples are presented in Li and Dasgupta (1993). 6.2.10
Wear
Wear is a wearout mechanism that is extremely important in all hardware that experiences impact from foreign particles or a sliding motion between surfaces in contact. For example, abrasive wear can occur due to continuous impingement of sand, water, or other foreign particles, causing gradual erosion; frictional wear can occur between gear teeth, sliding bearing surfaces, piston, and cylinder assemblies, and so on, causing adhesive wear. In the case of liquid ducts, wear can occur due to liquid erosion of cooling ducts as a result of cavitation. Adhesive wear can lead to pitting and galling phenomena. Wear not only is a failure mechanism in itself, but also can leave hardware vulnerable to subsequent corrosion and overstress failure. Adhesive wear between sliding surfaces is commonly described as V km Fvt
(6.10)
where V is the volume of material removed; F is the contact force; v is the relative velocity of sliding between the two mating surfaces; t is the time elapsed; and k is a material property. Details and illustrative examples are presented in Engel (1993). 6.2.11
Aging due to Interdiffusion
When two different materials are in intimate contact, molecules of one material can migrate into the other by diffusion or the ability of a material to migrate within a second material by atomic motion. From an atomic perspective, diffusion is the migration of atoms between lattice sites. The atoms must have sufficient energy to break bonds and reform them at another lattice site. The diffusion rate is a characteristic material property and can be measured in the laboratory. The most common mathematical description of diffusional processes is expressed as Fick’s second law, given as tCd 2 ( Dd Cd ) tt
(6.11)
where Cd is the concentration of the diffusing species, is the spatial gradient operator, t is the time, and Dd is the diffusion coefficient. Values for Dd can be characterized through well-defined laboratory tests and are listed in materials handbooks for many engineering materials.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
HARDWARE RELIABILITY
111
Diffusion phenomena in themselves are not intrinsic failure mechanisms. For example, diffusion is a beneficial mechanism for forming diffusion bonds. However, diffusion can act as a failure agent when, for example, the diffusing medium is a harmful or corrosive chemical or when diffusion leads to microstructural aging, detrimental creep deformation, metal migration, and unbalanced interdiffusion. When two materials have to be bonded, the phenomenon of interdiffusion is important to form good interfacial bond force. However, if the effective diffusion rates for both materials are not equal, one of the materials may suffer a depletion of atoms, leading to Kirkendall voiding and loss of strength at the interface. A common example is the leaching of gold into aluminum in wirebonds of electronic devices, leading to “purple plague.” The interdiffusion constants for a given pair of materials must be similar in order to avoid this failure mechanism. Alternatively, excessive interdiffusion assisted by temperature over a period of time may cause growth of excessive intermetallics at the interface, causing a brittle interface with inadequate toughness. Interdiffusion is a time-dependent phenomenon and is therefore a wearout failure mechanism. Details and illustrative examples are presented in Li and Dasgupta (1994). 6.2.12
Aging due to Particle Radiation
Particle radiation is a common phenomenon in aerospace environments and in nuclear-power and particle-research establishments in terrestrial environments. Radiation damage includes both mechanical and electrical failures. The mechanical failure mechanism is typically an embrittlement aging phenomenon of the wearout type. Common examples include exposed hardware in space satellites, reactor vessels, and such. The electrical phenomenon is an overstress phenomenon that causes “soft errors” due to the passage of single radiation particles through large-scale integration (LSI)/VLSI electronic hardware (see Section 6.2.3). Radiation damage causes different aging in different types of materials. Radiation damage is a time-dependent wearout phenomenon and is of concern in metallic, ceramic, and polymeric materials. In metals and ceramics, radiation causes point defects, such as pairwise combinations of vacancies and interstitial atoms (Schottky defects), by knocking atoms out of molecular lattice structures and lodging them in interstitial sites. These point defects cause embrittlement aging, which can be countered by annealing operations. More importantly, in electronic packaging applications, these defects can also alter thermal, optical, and electrical properties, thus impairing the operation of active devices. In polymeric materials, radiation aging is caused by breaks in polymer chains or changes in the degree of polymerization due to chain branching. Either of these can reduce the strength of the polymer. In its most common form, this can lead to photodegradation of polymers under prolonged exposure to UV radiation in strong sunlight. Stabilizers are sometimes needed to combat such wearout failures. 6.2.13
Other Forms of Aging
A variety of other forms of aging can alter a material’s performance over time. Examples include hydrogen embrittlement, thermally induced depolymerization,
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
112
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
increased cross-linking leading to embrittlement in thermosetting polymers, and grain growth in crystalline materials, Detailed discussions of all these mechanisms are beyond the scope of this chapter. 6.2.14
Corrosion
Corrosion results from the chemical or electrochemical degradation of metals. Corrosion is a time-dependent wearout failure mechanism and can act as a precursor to subsequent overstress failure by brittle fracture or to subsequent wearout failure by fatigue-crack propagation. Corrosion can also alter the electrical and thermal behavior of materials in the microscale. The three most common forms of corrosion are uniform, galvanic, and pitting corrosion. The corrosion reaction rate depends on the material, the presence of an electrolyte, the presence of ionic contaminants, geometric factors, and local electrical bias. Uniform corrosion is a chemical reaction occurring at the metal–electrolyte interface uniformly throughout the surface. The continuation of the corrosion process and its rate depend on the nature of the corrosion product. If the corrosion product is soluble in the electrolyte (e.g., water), it can be dissolved away, exposing fresh metal for further corrosion. On the other hand, if the corrosion product forms an insoluble, nonporous, adherent layer, it limits the rate of reaction and finally stalls the corrosion process. Galvanic corrosion occurs when two or more different metals are in contact. Each metal is associated with a unique electrochemical potential. When two metals are in contact, the metal with the higher electrochemical potential becomes the cathode, and the other metal becomes the anode. The electrical contact between dissimilar metals leads to the formation of a galvanic cell. The rate of galvanic corrosion is governed by the rate of ionization at the anode (i.e., the rate at which anode material passes into solution), and this, in turn, depends on the difference in electrochemical potential between the contacting two metals. The conductivity of the corrosion medium affects both the rate and the distribution of galvanic attack. In solutions of high conductivity, the corrosion of the more active alloy is dispersed over a relatively large area. In solutions with low conductivity, most of the galvanic attack occurs near the point of electrical contact between the dissimilar metals. Pitting corrosion occurs at localized areas, causing the formation of pits. The corrosion conditions produced inside the pit accelerate the corrosion process. As the positive ions at the anode go into solution, they become hydrolyzed, producing hydrogen ions in the process. This increase in acidity in the pit destroys the adhering corrosion products, exposing more fresh metal to attack. Because the oxygen availability in the pit is low, the cathodic reduction reaction can occur only at the mouth of the pit, thus limiting lateral growth of the pit (Pecht and Ko 1990) Surface oxidation, another common type of corrosion in metallic materials, is governed by the free energy of formation of the oxide. For example, there is a large driving force for the oxidation of aluminum and magnesium but much less of a force for copper, chromium, and nickel. Depending on the stoichiometry of the corrosion reaction, the type of the oxide formed can be either porous or densely packed. The oxide type frequently governs the subsequent rate of corrosion. For instance, a thick, nonporous oxide
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
HARDWARE RELIABILITY
113
layer may act as a protective barrier and inhibit further corrosion by cutting off the oxygen supply to the surface, as with aluminum and stainless steel. Sometimes, the volume of the corrosion product (oxide) may be so much higher than the base material that the oxide layer peels off. Such scaling failure exposes the underlying metal to fresh attack. Corrosion is a leading cause of damage and failure in engineering hardware, and the cost for prevention, cure, and replacement is significant. 6.2.15
Metal Migration
This wearout mechanism, important in electronic hardware, is driven by diffusion phenomena. There are many types of metal migration, including electromigration, cathodic dendritic growth, and conductive anodic filament (CAF) growth. Dendritic growth is essentially an electrolytic process in which the metal from the anode region migrates to cathodic areas. Metal migration leads to an increase in leakage current between the bridged regions or causes a short if complete bridging occurs (migrative resistance shorts). Although Ag migration has been most widely reported, depending on environmental conditions, many other electronic metals, like Pb, Sn, Ni, Au, and Cu, can also migrate (Dumoulin 1982). Because it is time dependent, this is a wearout mechanism. Metal migration is governed by the availability of metal, the presence of electrolytes such as condensed water and ionic species, and the existence of a voltage differential. Metals known to be susceptible to metal migration should be protected from water vapor and ionic contamination. Because the migration phenomenon is an electrolytic process, it is essential to have a conducting medium. Ionic species include impurities such as chlorides or products generated during corrosion. The driving force necessary to cause metal migration is the potential difference that exists when the electronic hardware is in a biased condition. Although the primary stress for this failure mechanism is an electrical potential gradient, it can be accelerated by secondary stresses such as moisture, ionic contaminants, and temperature. 6.3
LOADINGS, STRESSES, AND MATERIAL BEHAVIOR
Engineering hardware is subjected to a multitude of loads during manufacture, assembly, testing, rework and repair, transportation and storage, handling, and operation. Damage can occur and accumulate during any of these processes and affect the reliability of the hardware during operation. Quantifying the relevant stresses during these processes is thus necessary in order to understand and model reliability. Stresses are the stimuli that trigger the failure mechanisms discussed in Section 6.2. For the purpose of this discussion, stresses are the local measures of the intensity of load distributions at the failure site. For example, a cyclic mechanical force acting on a member can cause local stress concentrations at a hole, causing the initiation of a high-cycle fatigue crack. In this case, the applied far-field cyclic mechanical force is the load, and the amplitude of the force per unit area at the site of crack initiation is the stress. Obtaining the stresses caused by the loads requires some analysis of the component geometry, the material constitutive properties, and other boundary conditions on the component. In general, this involves solving an initial-boundary value
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
114
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
problem. Stress analysis can be performed by simple closed-form analytic means for simple geometries and linear material properties. However, for any complex situation or for rate-dependent, nonlinear material properties, computational schemes such as finite difference methods, finite element methods, boundary element methods, or other approximate numerical methods may be required. Accurate stress analysis forms one crucial step in reliability modeling. As discussed in Section 6.2, the types of loads that can cause stresses include magnitude and gradients of electrical currents and voltages, contaminant concentrations and their gradients, steady-state temperature, cyclic temperature variations, temperature transients, temperature gradients, humidity and moisture, mechanical static forces, dynamic vibrational or shock loads, electromagnetic fields, and particle radiation. Operational stresses constantly act on hardware elements, actuating various failure mechanisms and gradually stressing the component to failure. The operational stresses critical to the reliability of a specific component are determined by that component’s operational environment and the dominant failure mechanisms responsible for component failure. The critical stresses need to be identified for all hardware elements at potential failure sites and ranked according to their relative severities. Actually, a physical quantity could be either stress or loading, depending on the role it plays in a failure process. For example, in the failure mechanism of excessive device temperature (burning), temperature is a stress and appears in both the damage model and acceleration transform equation. However, temperature is a loading in thermally induced semiconductor chip cracking, in that it induces mechanical stress (Hu 1994). Traditionally, the stress analysis for obtaining the magnitude and the distribution of temperature in an electronic package is called thermal analysis, and the stress analysis for the magnitude and the distribution of mechanical stresses induced by temperature is called thermal stress analysis. The purpose of the analysis is to investigate the relationship between applied loading and stress with respect to given materials and geometry in an assembly. The stress at each site is obtained as a function of applied loads, including various environmental parameters, as well as geometry and material properties of the components and the module. Good reliability prediction also requires that quantitative, scientific techniques be developed for realistically defining the product requirements and the design usage environment. The life-cycle application profile is a time-sequential listing of all the loads that can potentially cause failure. These loads constitute the parameters for quantifying the given application profile. For example, a flight application could be logged at a specified location and could involve engine warm-up, taxi, climb, cruising, high-speed maneuvers, gun-firing, ballistic impact, rapid descent, and emergency landing. This information is of little use to a hardware designer unless it is associated with the appropriate application load histories, such as acceleration, vibration, impact force, temperature, humidity, and electrical power cycle. Among the most critical parameters governing reliability are the relevant material constitutive and damage properties over the entire range of loads anticipated during manufacturing, accelerated testing, storage, handling, and use. To obtain physicsbased reliability predictions, these material properties and their variabilities must be described with suitable stochastic distribution functions. Unfortunately, information
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
HARDWARE RELIABILITY
115
regarding material properties is often grossly inadequate, which compromises the accuracy of reliability predictions. A wide variety of material properties can be found in handbooks of ASM International, in the materials selectors published annually by Penton Publishers as part of their Materials Engineering Series, and in numerous computerized databases such as CINDA (compiled by Purdue University). Often, manufacturers maintain their own material databases for relevant materials. The importance of adequate knowledge of material properties cannot be overemphasized; every reliability prediction team should allocate sufficient resources to collect, compile, and disseminate information about fundamental material properties. 6.4
VARIABILITIES AND RELIABILITY
In the previous sections, the parameters of any failure/damage model were listed as the material properties, the applied stresses, and the failure criteria. The stresses, in turn, depend on the loads, geometry, boundary conditions, and material constitutive properties. If all of these parameters could be modeled as deterministic variables, the predicted failure strength for overstress failure mechanisms and/or the predicted time to failure for wearout failure mechanisms would also be deterministic quantities. However, in real life the variabilities of the parameters must also be modeled. This is accomplished by using probability distributions with an estimated mean and standard deviation. For example, material constitutive and damage properties always have some variability due to microstructural and processing variability, geometric parameters always have some variability within specified tolerance ranges, and application-profile information must account for variations in load histories. Hence, the best a hardware designer can usually expect to achieve is to predict the probability of failure at a specified stress level or at a specified time, based on the appropriate failure mechanisms. This probability represents the unreliability of the device at any given time for a given application profile. In the following section we present a viable approach to computing reliability and present an example for illustrative purposes. 6.5
RELIABILITY PREDICTION TECHNIQUES
As discussed in the introduction, reliability predictions, based on the physics of failure, are important not only for proper design but also for a variety of related decisions, such as life-cycle logistics and cost assessments. Techniques for reliability prediction are often still subjective, due to the lack of adequate models for and data on failure mechanisms. Therefore, it is necessary to follow up every prediction with qualification testing for validation purposes. This section presents some reliability prediction schemes that are particularly well suited for incorporating physics-of-failure concepts, followed by a discussion of accelerated qualification testing methodology and the logistical aspects of reliability predictions. One possible approach to probabilistic modeling of overstress failure mechanisms for reliability assessment is the stress–strength interference theory (for detailed discussions, see, for example, Kapur and Lamberson, 1977).
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
116
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Figure 6.2
Basic structured programming constructs.
In this approach, the applied stress is expressed as a continuous stochastic variable with a specified probability density function (pdf). The mean and standard deviation are estimated as best as possible from the loads specified in the application profile, based on stress analysis that includes the effects of geometric tolerances and material constitutive property variability at the failure site. Similarly, strength is assumed to be a continuous stochastic variable represented as a pdf, based on the mean and scatter of the damage properties of the material at the failure site. Obtaining accurate estimates for mean and standard deviation values for these pdfs is extremely difficult in reality. Figure 6.2 illustrates the two pdfs for stress and strength, respectively: fs(s) is the pdf for stress and fm (m) is the pdf for strength. The reliability, R, is the probability that the strength, S, exceeds stress, s, for all possible stress values in the distribution and is given as (Kapur and Lamberson 1977) c
R
§c ¶ fs ( S ) ¨ fs ( S )dS · ds ¨ ·
c ©s ¸
¯
¯
(6.12)
Alternatively, R is the probability that S exceeds s for all S: c
§S ¶ R fs ( S ) ¨ fs (s)ds · dS ¨ ·
c © c ¸
¯
¯
The unreliability, Q, is then obtained as R = 1 – Q.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(6.13)
HARDWARE RELIABILITY
117
Alternatively, a new function can be defined as U = S – s. Reliability, R, is the probability that U is positive. Thus, if s and S are statistically independent, c
R
¯ f ( y) dy
(6.14)
y
0
The limitations of the stress–strength integration method lie in the practical difficulties in obtaining accurate probability density functions. Further, the accuracy of the reliability prediction depends on the accuracy of the tails of the probability density functions where the confidence is the best. The problem can be somewhat alleviated by using extreme-value distributions for the probability distribution functions. The stress–strength interference theory is usually applicable to overstress failure mechanisms. A slightly different approach is required to apply probabilistic methods to wearout failure mechanisms. One approach is to use the time interface model (sometimes called the strength degradation model; see Chapter 4). Another alternative is to replace the distributions of stress and strength by a single nondimensional damage curve, denoted by fd(D) in Figure 6.3. The damage parameter, DM, is a time-dependent parameter, monotonically increasing with elapsed time (or with load cycles experienced, in the case of fatigue loads). The instantaneous value of DM is based on the stress history, in conjunction with suitable definitions of the damage parameter and damage accumulation. For example, in the case of high-cycle
“Endurance” line Direction of motion of “Damage” curve
Probability Density
“Damage” curve at time t
f (D, t)
Damage, D Figure 6.3 The distribution of stress and strength.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
118
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
fatigue damage accumulation, the damage parameter, DM, per load cycle is customarily defined as the reciprocal of the mean fatigue life, Nf, expected under the applied stress amplitude, Δm, according to the Coffin–Manson law (Equation 6.5). Stochastic modeling techniques are used to represent the damage curve of the uncertainties in the parameters that determine the amplitude of applied stress, Δm, and the uncertainties in the material damage constants, mf and b, in the HCF regime of the Coffin–Manson law. Again, determining the actual values for the mean and standard deviation values for the damage curve may be extremely difficult in reality. Miner’s law (Sandor 1972) provides a simple linear damage superposition law for computing accumulation of damage under complex load histories. In simple materials that do not exhibit healing phenomena such as annealing, the damage curve migrates monotonically to the right, with increasing damage accumulation. Thus, at any given point of time in the fatigue load history, there is a unique damage curve. For simplicity, analysts often assume that the mean, μ D, is a function of time, but assume the variance of DM to be time invariant. On this nondimensional damage scale, reliability, R, is given by the probability that DM > 1. Thus, the reliability at time t = t* is
¯
R(t t * ) [ f D ( x )]
t t*
dx
(6.15)
1
Other models for reliability prediction are discussed in Chapter 4 and in a variety of other references (Haugen 1980; Kapur and Lamberson 1977; Lewis 1987). In general, the computation of reliability involves dealing with functions of multiple random variables. This cannot be accomplished in closed form for any but the simplest cases, and numerical schemes like Monte-Carlo methods may be required. A relatively new technique that may offer another numerical solution method is stochastic finite element analysis. (Ghanem and Spanos 1991) A sample case study, using simple closed-form stress analysis, is presented in the following section, where the failure mechanisms in a wirebond assembly in an electronic device are identified, and models are presented to predict reliability.
6.6
CASE STUDY: WIREBOND ASSEMBLY IN MICROELECTRONIC PACKAGES
This sample case study illustrates the implementation of a probabilistic physicsof-failure approach in reliability prediction and modeling. The hardware is a bond wire-and-wirebond assembly in a microelectronic package. The potential failure mechanisms are identified and quantitative models are adopted. Thermomechanical stresses due to applied thermal cyclic loads are analyzed, using a strength-of-materials approach, sample variability in material properties is assumed, and reliability is estimated for an anticipated fatigue life of 104 thermal cycles. Only the
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
HARDWARE RELIABILITY
119
highlights of the analysis are presented; details are presented elsewhere (Hu, Pecht, and Dasgupta 1991). 6.6.1
Failure Mechanisms and Stress Analysis
Fatigue failure of a wirebond under thermal cycles occurs predominantly as a result of repeated flexure of the wire, repeated shear stress generated between the bond pad and the wire, repeated shear stress generated between the bond pad and the substrate, and repeated axial stress in the wire (Pecht, Dasgupta, and Lall 1989). To model these failure mechanisms better, the relationships between strain cycles and thermal cycles are derived to obtain the fatigue life and reliability. 6.6.1.1 Wire Flexure As the temperature changes, the wire expands and contracts, and the wirebond undergoes flexural fatigue. The differential thermal expansion between the wire and the substrate due to temperature cycling causes flexure of the wire, produces stress reversal at the heel of the bond in wedge bonds and stitch bonds, and causes eventual fatigue failure of the wire. Figure 6.4 illustrates a wedge bond deformation due to thermal cycling. Because the cross-section of the wire is reduced near the bond site for wedge and stitch bonds, stress concentrations occur at the heel, making this a probable site for failure due to wire flexing. A wedge-bonded wire is modeled as a beam under pure bending due to thermal expansion. Using the theory of curved beams and the simple theory of linear elasticity, the cyclic amplitude of bending stress at the heel of the wire due to thermal expansion during temperature cycle, ΔT, is (Hu et al. 1991)
Y p X
T TO
L E, I
2D
Figure 6.4 A wedge bond deformation due to thermal cycling.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
h
120 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
1
³2¤ A Aw ³ r ¤ L $T S 6 Ew
1 2A s s ¥ ´ ¥ (1 Ds /L ) ´µ Ds ¦ Ds µ ¦
(6.16)
where ]w and ]s are the thermal expansion coefficients of the wire and substrate materials, respectively; Ew is the Young’s modulus of the wire material; L is the length of the wire; and Ds is the span between the bonds at the two ends of the wire. The number of cycles to failure, Nf, in flexure is related to the stress range in fatigue using the Coffin–Manson equation (Section 6.2.8): Nf CwS
mw
(6.17)
where m is given by Equation 6.16 and Cw and mw are fatigue properties determined by tensile fatigue tests of the wire material. 6.6.1.2
Shear of Bond Pad
As the temperature changes, bimetal bonds experience shear stresses as a result of differential thermal expansion between the wire and bond pad or between the bond pad and the substrate, as illustrated schematically in Figure 6.4. Because the bond pad is typically very thin compared with the substrate, it is modeled as an interlayer between the wire and the substrate. Neglecting all bending deformations and using a shear lag model (Jones 1975), a first-order approximation can be obtained for the shear stress distribution in the bond pad as (Hu et al. 1991)
T
¹ sinh( Zx ) G p $T ª (A s A p ) º «(A w A s ) bp Z 1 ( Es As ) /[ E p Ap (1 vs )] cosh( Zlw ) » ¬
where G is the shear modulus; E is the Young’s modulus; v is the Poisson’s ratio; ] is the coefficient of thermal expansion; subscript p refers to the bond pad; s refers to the substrate;
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(6.18)
HARDWARE RELIABILITY
121
w refers to the wire; bp is the thickness; Ap is the cross-sectional area of the bond pad; lw is the wire length; and Y is the eigenvalue expressed in terms of material properties and geometry of the wirebond assembly (Hu et al. 1991). Equation 6.18 shows that shear stress in the pad is a function of position along the interface layer is maximum at x = lw, and decreases toward the center of the bond-pad wire or bond-pad substrate interface. Notice that Zlw >> 1. Therefore, tan h(Zlw), the amplitude of maximum shear stress due to the temperature cycles, ΔT, at the critical point x = lw, is
T max Q $T
(6.19)
where
Q
¹ G p ª (A s A p ) «(A w A s ) º 1 ( Es As ) /[ E p Ap (1 vs )] bp Z ¬ »
(6.20)
Once the maximum amplitude of the shear stress is determined, the numbers of cycles to shear fatigue failure of the bond-pad material can be predicted by the Coffin–Manson equation for HCF: N C p`T max
m p`
(6.21)
where Cp` and mp` are the experimentally determined shear fatigue properties for the bond-pad materials. Values for common engineering materials are listed in engineering handbooks. 6.6.1.3
Shear of Wire and Substrate
Using arguments similar to those for Equation 6.18, the maximum shear stresses in the wire and substrate can also be obtained as (Hu et al. 1991) 1
Tw
Max
ª r2 « 2w 2 ¬ 4 Z Aw
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
§ cosh( Z x w ) ¶
1· ¨ ·¸ ¨© cosh( Zlw )
2
2 Q sinh ( Z x w ) ¹ º $T cosh 2 ( Zlw ) » 2
2
(6.22)
122 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Ts
Max
ª§ W Q ¤ ¶ (A s A p ) cosh( Z x s ) ³ · «¨ p ¥1 ´ ¨© 2 Z As ¦ cosh( Z /s ) µ (1 vs ) / ( Es ) As / ( E p Ap ) ·¸ ¬
(6.23)
1 2
Q2
sinh 2 ( Z x s ) ¹ º $T cosh 2 ( Zls ) »
where x w = ±arc tan h(Aw/r w); xs = (±) arc tan h(As/rs); r w is the radius of the bond wire; Aw is the cross-sectional area of the wire; and As = bs (ws + wp)/2. All other terms are defined in Figure 6.4. The numbers of cycles to shear fatigue failure of wire and substrate materials can be modeled also by the Coffin–Manson equation, which gives
N C T
N Cw` T w
mw `
ms `
s`
(6.24)
Max
s Max
(6.25)
where Cw`, mww, Cs`, and ms` are the shear fatigue properties of wire material and substrate material obtained either from experimental measurements in the laboratory from controlled fatigue tests or from engineering handbooks. As with most material properties, the variability must be characterized with a measured mean value and a standard deviation. There are seldom adequate property data in the literature to obtain realistic estimates of the standard deviation. 6.6.1.4 Axial Tension of Wire In plastic-encapsulated packages where the encapsulant surrounds and interfaces with the wire, temperature cycling of the component produces differential expansion between the wire and encapsulant. Repeated temperature fluctuations can cause axial fatigue of the wire. The total axial deformation in the wire and encapsulant is the deformation due to the rise of temperature, ΔT, plus the deformation due to the mechanical forces experienced by the wire and the encapsulant. The compatibility condition requires that the total deformation in the wire and encapsulant be equal. Using this condition in conjunction with the equilibrium of the assembly, the approximate axial stress is (Hu et al. 1991)
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
HARDWARE RELIABILITY
123
S w Ew (A e A w ) $T
(6.26)
Substituting Equation 6.26 into the Coffin–Manson equation, the number of cycles to failure is N CwS w
mw
(6.27)
where Cw and mw are defined in Equation 6.17. The five fatigue failure mechanisms cited before act simultaneously. Based on these mechanisms, the fatigue lifetime, Ni, for each failure mode can be written generically as N i Ci Si
mi
i 1 ,! , 5
(6.28)
where Ni is the number of cycles to failure for the ith failure mechanism. C1 = C5 = Cw and m1 = m5 = mw need to be determined from the tensile fatigue test; C2 = Cp`, C3 = Cw`, C4 = Cs` and m2 = mp`, m3 = mw`, m 4 = ms` need to be determined from the shear fatigue tests. S1 = m is expressed in Equation 6.15, S2 = nmax is expressed in Equation 6.18, S3 = nw is expressed in Equation 6.21, S4 = ns is expressed in Equation 6.22, and S5 = mw is expressed in Equation 6.26. Because the occurrence of any of the five mechanisms constitutes a failure of the wirebond, these mechanisms are assumed to act in series, in terms of a logic diagram. If Si and the mean values of Ci and mi (i = 1,…, 5) are known, the mean number of cycles to failure (which is a constant for each mechanism) can be estimated by Equation 6.28. The damage mechanism corresponding to the shortest life is considered the dominant or failure mechanism. In wirebond failure, the dominant mechanism depends on the operating environment, the fatigue properties of the bonding materials, and the bonding conditions in manufacture. The dominant mechanism can shift, depending on the desired reliability level.
6.6.2
Stochastic Modeling of Variabilities and Reliability
The most common fatigue tests are those with constant amplitude stress, specified in terms of the mean stress and a stress range. Variability occurs in the fatigue life data, even under very carefully controlled testing conditions. Therefore, a complete description of sufficient test data should include the interrelationship among the survivor function (reliability), R; the stress range, Si; and the fatigue life, N. Generally, the material fatigue strength can be described by a reliability function that depends on both the stress level and fatigue life: R F ( Si , N )
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(6.29)
124 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
For a given stress level, the log-normal and Weibull distributions are often used for fatigue analysis. The parameters to measure mean life, μ N; dispersion about the mean life; and standard deviation of the life, mN , can be estimated from test data, using the method of maximum likelihood estimation for a given stress range. The coefficient of variation, `N = mN/μN , is a dimensionless measure of variability. Use of the log-normal distribution has been defended primarily on grounds of mathematical expediency, although it suggests the possibility of a decreasing hazard function that contradicts the observed phenomena. The postulate that cracking results from a large number of independent events, coupled with the central limit theorem, lends some theoretical justification. The Weibull distribution arises from a “weakest link” hypothesis of failure and leads to a monotonically increasing hazard function with time, which agrees with the physical implication of progressive deterioration resulting from the fatigue process. Equation 6.29 can also be written in the form of Basquin’s equation: N i ( p, S ) C fi ( p) Si
mi ( p )
(6.30)
where Cfi and mi are random variables describing the fatigue behavior of the wirebond materials. Therefore, the uncertainties of material fatigue properties are included in the probability distribution of random variables Ci and mi. Based on this, the fatigue strength of a material is often expressed by a p–S–N diagram, a family of S–N curves, as shown in Figure 6.5. This curve represents the high-cycle fatigue portion of the Coffin–Manson plot presented in Figure 6.1. The standard deviation of fatigue
Stress Range, Log S
Mean S-N curve
High cycle fatigue
Cycles to Failure, Log N Figure 6.5
A family of S–N curves.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
HARDWARE RELIABILITY
125
life, mN , increases with the decrease in the stress range in constant amplitude fatigue tests; mN decreases with the decrease in the stress range in random amplitude tests. For simplicity, mi is often considered a constant because, for some kinds of material, the p–S–N curves are almost parallel. In this case, from Equation 6.29, Ci may be considered to be a Weibull or log-normal distributed random variable. From Section 6.6.1, the stress range Si at the critical failure site, due to the temperature cycles, can be derived for each failure mechanism as (6.31)
Si Hi $T
where Hi is a generic term to symbolize the coefficients of ΔT in the failure models of Section 6.6.1. According to Equation 6.30, Equation 6.29 can be written as N i Ci H i
mi
( $T )
ni
(6.32)
Hi is a function of other elementary variables, such as mechanical and thermal constitutive properties of the materials and the geometry of the wirebond. If any of these variables is modeled as a random variable, Hi is a random variable. Hi actually reflects the variation of the stress for a unit change of temperature. In order to predict accurately the lifetime and reliability of a wirebond, the measurements and statistical analysis of geometric parameters such as r, D, LID, and material properties, Ew, Ep, Es, ]w, ]p and ]p, are necessary to determine the distributions of Hi based on formulations for the mean value and standard deviation of a multi-random-variable function. The practical limitation of this approach lies in the difficulty of obtaining accurate measurements of the standard deviation; experimental research should be aimed at addressing this issue. Generally speaking, ΔT should also be modeled as a random variable because, in many cases, the temperature cycles during use vary substantially. The uncertainties in temperature cycles arise from the variations in environmental and usage circumstances described by a service history. From the viewpoint of the user, environment is important because it may affect safety; from the viewpoint of the manufacturer, usage is also important because the device should be suitable for different field conditions, with a given reliability at the lowest cost. Assessments of wirebond reliability should consider the uncertainties stemming from both sources; the uncertainty from the first source is directly related to the fatigue damage process, and the uncertainty from the second source can be evaluated via statistical analysis by device users. The temperature range, Δ = Δ (t), for a given environmental circumstance can be treated as a stochastic process composed of an ensemble of possible temperature–time histories. At any time, t, ΔT is a random variable, and any measurement of temperature–time histories is a sample function. For example, Figure 6.6 shows a typical sample function of an air-launched weapon as a function of application
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
126 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
40
20
0
–20
–40 0
80
40
120
Time, t(minutes) Temperature, T (°C)
Flight height, h(km)
Figure 6.6 A typical sample function of an air-launched weapon as a function of mission time.
time (Hu et al. 1991). The dotted line is the flight height; the solid line is the temperature. Because the temperature range affects the degree of fatigue damage, an acceptable cumulative fatigue damage theory is needed to determine the equivalent temperature range for a given stochastic process of ΔT(t). In general, field-data collection under service conditions is necessary to determine the distribution, statistical parameters, and other properties of the stochastic process of ΔT(t). Hu et al. (1991) use a three-term decomposition: T u y Yc cos(2 f yt ) Z (t )
(6.33)
where uy is the annual mean temperature; Yc is the amplitude of the seasonal cycle, which is a random variable with Raleigh distribution; f y is the frequency of annual wave; t is the time parameter; and Z(t) is characterized by the power spectrum that reflects the temperature variation:
f ( $T )
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
B ¤ $T G ³ H ¥¦ H ´µ
B 1
ª ¤ $T G ³ B ¹ exp « ¥ º ¦ H ´µ » ¬
(6.34)
HARDWARE RELIABILITY
127
where ^ = 2.7, d = 109, and c = 0. To describe the uncertainty caused by different device users, a statistical data analysis based on market prediction is needed. Based on some sample data, Hu et al. (1991) considered a sample Weibull distribution. For convenience, the data are fit with a Gaussian distribution:
f ( $T )
ª ¤ $t 96G ³ 2 ¹ 1 exp « ¥ ´ º 95.6 ¬ ¦ 76 µ »
(6.35)
where the mean value of ΔT is 96°C, and the standard deviation is 38°C. If the probable numbers of a device needed by various users are evaluated from market prediction and the typical operational temperature range for each user is reasonably well known, the distribution of ΔT due to the second type of uncertainty can be estimated by the goodness-of-fit test method. The uncertainties involved in thermal fatigue of wirebond arise from the behavior of wirebonding materials, the dimensions of assembly, and the temperature variation. In order to predict the fatigue life accurately and analyze reliability, these uncertainties must be taken into account.
6.6.3
Fatigue Lifetime and Reliability Prediction
Equation 6.32 models the fatigue life of a wirebond for the ith failure mechanism as a four-variable random function: N i Ci ( Hi $T )
mi
(6.36)
where Hi and ΔT are statistically independent of the other two variables, but Ci and mi are correlated because they are from the same fatigue test data. For convenience of illustration, ΔT is assumed to have a specified value, and μi is assumed to be a constant. Using the theory of multiple random variables (see Chapters 2 and 3), we obtain:
M N MC ( M H $T ) ai
i
mi
i
¤ MN ³ ¤ MN ³ S N2 ¥ ai ´ S C2 mi ¥ ai ´ S H2 ai ¥¦ MC ´µ i ¥¦ M H ´µ i i i
(6.37)
(6.38)
Thus, the mean, *N]i, and standard deviation, mNai, of anticipated fatigue life, Ni, can be obtained when the mean and standard deviations of the remaining parameters are known.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
128 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Because the random variables Ci and Hi are statistically independent, the cumulative distribution function (cdf) of N is FN ( N ) P{N n}
¯¯ p(C ) p(v)dC dH i
i
i
(6.39)
The explicit expression of the integral can be obtained only for some very special distributions of Ci and Hi. Generally, the mean value of the amplitude of cyclic temperature ranges may be a function of time. The effect of this variable temperature amplitude on fatigue performance of wirebonds can be accounted for through cumulative damage rules, which relate fatigue behavior under a complex loading history to known statistical behavior under constant amplitude loading. Linear damage-accumulation models, such as Miner’s rule, are widely used, although they do not account for stress range sequencing effects on fatigue life. Miner’s rule predicts the following fatigue damage under variable amplitude stress. Using DM as the damage parameter in accordance with the notation of Section 6.4, we obtain kb
kb
DM
£ $D £ n /N j
j
j 1
j
(6.40)
j 1
where ΔDj is the incremental damage due to the jth block of constant stress or strain range, Sj; nj is the total number of cycles in the jth block; Nj is the number of cycles to failure due to the ith mechanism (i dropped for convenience in Equation 6.40) under the stress cycles, Sj; and kb is the total number of blocks. Failure occurs when accumulated damage exceeds endurance—that is, when DM > 1. Substituting Equation 6.36 into Equation 6.40 gives m
H i DM i Ci
kb
£ n $T j
mi j
(6.41)
j 1
Let nj = Ni pj ( pj is the relative likelihood of the occurrence of temperature cycle ΔTj ) and let DM = 1 in Equation 6.40. Then, Equation 6.41 becomes
Ni
Ci H i
mi
kb
£ p $T j
j 1
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
mi j
(6.42)
HARDWARE RELIABILITY
129
Define the equivalent temperature cycle for the ith failure mode as 1
$Teq ,i
ª « ¬
kb
£ j 1
¹ mi m p j $Tj i º »
(6.43)
Equation 6.43 has the same form as Equation 6.36: N i Ci ( Hi $Teq ,i )
mi
(6.44)
If the temperature range is described by a continuous random variable, the equivalent temperature range is 1
$Teq ,i
ª « ¬
¯
¹ mi p( $T ) $T d ( $T ) º »
$Tmax
mi j
0
(6.45)
where p(ΔT) is the pdf of ΔT. When mi in Equations 6.43 and 6.45 is considered as a deterministic quantity with a known value, ΔTeq,i is a constant for ith failure mechanism, and (ΔTeq,i)mi is the mith moment of ΔT. Thus, the variable amplitude temperature cycle problem becomes a thermal fatigue problem with constant temperature amplitude, and Equations 6.37 and 6.38 therefore apply. However, from a manufacturer’s point of view, the equivalent temperature cycle, ΔTeq,i, is still not a constant, but rather a random variable whose distribution needs to be estimated from analysis of statistical data of past users. Thus, in Equation 6.44, there are still three random variables, even if mi is considered to be a deterministic constant. For this case, the mean value and variance of the fatigue life are expressed as
M N MC ( M H M$T , eqi ) i
i
mi
i
(6.46)
2
S
2 Ni
¤ MN ³ ¤ MN ³ ¤ MN ³ 2 i ¥ i ´ S C2 mi ¥ i ´ S H2 mi ¥ ´S i ¥¦ M$T , eqi ´µ $T , eqi ¥¦ MC ´µ ¥¦ M H ´µ i i i
(6.47)
In general, the distribution of fatigue life is very difficult to express with a closedform equation, but it can be estimated numerically by computer simulation using, for example, the Monte-Carlo method. The explicit expression of life distribution is possible only in some very special cases. For example, if Ci, Hi, and ΔTi are adequately
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
130 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
modeled by lognormal distributions, then the lifetime, Ni, is adequately modeled by a lognormal distribution whose mean is
M N ` MC ` mi Mh` mi Mt ` i
i
i
i
(6.48)
where N i` LogN i ,Ci` LogCi ,hi LogHi , and ti` Log$Ti . The corresponding standard deviation is
2 ci`
2 i
2 hi`
2 i
S N` S m S m S i
1 2 2 ti`
(6.49)
Once the probability distribution of the operation lifetime, pi(N), is determined from the close-form equation or simulation method for each failure mechanism, the curves of reliability versus fatigue life can be plotted for each mechanism. From this plot, the governing failure mechanism can be determined, and the corresponding fatigue life under required reliability level can be estimated. The reliability, Ri(N), which is defined as the probability of survival of a wirebond up to a certain number of operating thermal cycles, N, is written as follows for each failure mechanism: c
Ri ( N )
¯ p (N )dN i
(i 1, 2, , 5)
(6.50)
N
The reliability of the wirebond is calculated for the case in which the failure mechanisms are in series and statistically independent: 5
R( N )
R (N ) i
(6.51)
i 1
However, some of these failure mechanisms may be correlated—for example, Ci, C2, m1, and m5 are from the same group of test data. For correlated failure mechanisms, the reliability of the wirebond assembly can be estimated as 5
i 1
5 Ri ( N ) a R( N ) a Min Ri ( N ) i 1
(6.52)
Numerical applications of these equations are presented elsewhere (Hu et al. 1991). As an example, Figure 6.7 illustrates a sample plot of the reliability computed by Hu et al. for sample assumed inputs.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
HARDWARE RELIABILITY
131
1.1 1.0 0.9
Reliability, R(n)
0.8 0.7 0.6 0.5 0.4
Axial tension
0.3
Flexure
0.2 Shear
0.1 0 0
1
2
3 4 5 Operating Cycles, Log N
6
7
8
Figure 6.7 Sample plot of the reliability.
6.7
QUALIFICATION AND ACCELERATED TESTING
The reliability predictions obtained from the damage models discussed in Section 6.2 are approximate at best, and their accuracy depends on the accuracy of the databases for the input parameters and their uncertainties. Therefore, it is essential to validate the predictions through extensive qualification testing and redesign at the prototype development phase. This qualification should be repeated every time there is a change in design or manufacturing specifications. The purpose of qualification testing is to verify whether the anticipated reliability is indeed achieved under actual life-cycle loads. In other words, qualification tests are intended to find the probability of survival of a product over an extended period of time (usually, the design life of the product). Qualification testing thus audits the ability of the design specifications to meet reliability goals. This is usually done under accelerated stresses to achieve test-time compression. A well-designed reliability qualification procedure provides economic savings and quick turnaround during development of new products or mature products subjected to manufacturing and process changes. Many modern engineering hardware items demonstrate very high reliability when operating within their intended use environment. Investigating the wearout failure mechanisms and measuring reliability for products where long life is required may be a challenge because a very long test period under actual operating conditions is necessary to obtain sufficient data to determine actual failure characteristics. One approach to the problem of obtaining meaningful qualification test data for high-reliability devices is accelerated wearout testing, sometimes called accelerated testing or accelerated-stress life testing. When qualifying the reliability for
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
132 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
overstress mechanisms, however, a single cycle of the expected overstress load may be adequate, and acceleration of test parameters may not be necessary. Accelerated testing for wearout failure mechanisms involves measuring the performance of the test device at accelerated conditions of load or stress that are more severe than the normal operating level, in order to induce failures within a reduced time period. The goal of such testing is to accelerate the time-dependent wearout failure mechanisms and the accumulation of damage to reduce the time to failure, under the conditions that r failure mechanisms and modes in the accelerated environment are the same as (or can be quantitatively correlated with) those observed under usage conditions; r it is possible to extrapolate quantitatively from the accelerated environment to the usage environment with some reasonable degree of assurance; r engineering properties of the materials under accelerated stress are well characterized; and r failure distribution at operating levels and accelerated test levels is similar.
A scientific approach to accelerated testing starts with identifying the relevant wearout failure mechanism. The stress parameter that directly causes the time-dependent failure is selected as the acceleration parameter and is commonly called the accelerated stress. Common accelerated stresses include thermal stresses, such as temperature, temperature cycling, or rates of temperature change; chemical stresses, such as humidity, corrosives, acid, or salt; electrical stresses, such as voltage, current, or power; and mechanical stresses, such as vibration loading, mechanical stress cycles, strain cycles, and shock/impulse. The accelerated environment may include one or a combination of these stresses. Interpretation of results for combined stresses requires a very clear and quantitative understanding of their relative interactions and the contribution of each stress to the overall damage. The techniques of accelerated testing involve selecting the failure mechanisms and the appropriate acceleration stress; determining the test procedures and the stress levels; determining the test method, such as constant stress acceleration or step-stress acceleration; performing the tests; and interpreting the test data, which includes extrapolating the accelerated test results to normal operating conditions. Quantitative extrapolation is often difficult because of the uncertainties in the failure models and material properties. However, the test results often provide designers with good qualitative failure information for improving the hardware through design and/or process changes. Failure due to a particular mechanism can be induced by several acceleration parameters. For example, corrosion can be accelerated by both temperature and humidity; creep can be accelerated by both mechanical stress and temperature. Furthermore, a single acceleration stress can induce failure by several wearout mechanisms simultaneously. For example, temperature can accelerate wearout damage accumulation not only due to electromigration, but also due to corrosion, creep, and so forth. Failure mechanisms that dominate under usual operating conditions may lose their dominance as stress is elevated. Conversely, failure mechanisms that are dormant under normal use conditions may contribute to device failure under
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
HARDWARE RELIABILITY
133
accelerated conditions. Thus, accelerated tests require careful planning in order to represent the actual usage environments and operating conditions without introducing extraneous failure mechanisms or nonrepresentative physical or material behavior. The degree of stress acceleration is usually controlled by an acceleration factor, defined as the ratio of the life under normal use conditions to that under the accelerated condition. The acceleration factor should be tailored to the hardware in question and should be estimated from an acceleration transform that gives a functional relationship between the accelerated stress and reduced life, in terms of all the hardware parameters. Obviously, the transform function should be based on quantitative failure models of the type discussed in Section 6.2. Detailed failure analysis of failed samples is a crucial step in the qualification and validation program. Without such analysis and feedback to designers for corrective action, the purpose of the qualification program is defeated. In other words, it is not adequate simply to collect statistical failure data. Mere statistics alone cannot provide insights into (and consequent control over) relevant failure mechanisms and ways to prevent them. The key is to use the test results to develop more robust and cost-effective designs.
6.8
DE-RATING AND LOGISTIC IMPLICATIONS
The feedback from qualification testing should be used for redesigning hardware for better reliability. Such reliability growth is an intrinsic part of product development in maturing technologies. The process is iterative and should be continued until reliability and/or cost-effectiveness goals have been met. When further improvements are no longer feasible and reliability is still short of requirements, then de-rating and design redundancies may be the next option to improve product reliability. De-rating is a technique by which either the operational stresses acting on a device or structure are reduced relative to rated strength or strength is increased relative to allocated operating stress levels. Reducing the stress is achieved by specifying upper limits on the operating loads below the rated capacity of the hardware. For example, manufacturers of electronic hardware often specify limits for supply voltage, output current, power dissipation, junction temperature, and frequency. The equipment designer may decide to select an alternative component or make a design change that ensures that the operational condition for a particular parameter, such as temperature, is always below the rated level. The component is then said to have been “de-rated for thermal stress.” The de-rating factor, typically defined as the ratio of the rated level of a given stress parameter to its actual operating level, is actually a margin of safety or “margin of ignorance” determined by the criticality of any possible failures and by the amount of uncertainty inherent in the reliability model and its inputs. Ideally, this margin should be kept to a minimum to maintain the cost effectiveness of the design. This puts the responsibility on the reliability engineer to identify as unambiguously as possible the rated strength, the relevant operating stresses, and reliability.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
134 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
To be effective, de-rating criteria must target the right stress parameter to address a given failure mechanism. Further, the inputs to failure models and laboratory tests to measure these parameters accurately must be specified. Physics-of-failure concepts can relate allowable operating stresses to design strengths through quantitative modeling of the relevant failure mechanisms. Field measurements may also be necessary, in conjunction with modeling simulations, to identify the actual operating stresses at the failure site. Once the failure models have been quantified, the impact of de-rating on the effective reliability of the component for a given load can be determined. Quantitative correlations between de-rating and reliability enable designers and users to tailor the margin of safety effectively to the level of criticality of the component, leading to better and more cost-effective utilization of the functional capacity of the component. The value of reliability predictions goes well beyond design model development. Many life-cycle logistical decisions are driven by reliability predictions, as well as other considerations like cost, repairability, and preventive replacement schedules. Examples of logistic functions affected by reliability include reliability allocations, scheduling of maintenance actions, spares planning, warranty policies, and obsolescence planning. The other logistic aspect of physics-based reliability predictions is that it places limitations on the time taken for product development. Exhaustive modeling and qualification of a brand-new technology may require time and resources that are not feasible in a competitive market. In such situations, new products are often introduced in a semideveloped, premature stage. Ongoing qualification is a costly necessity in such situations in order to achieve reliability growth. The increased use of computeraided modeling, automated prediction tools, and improved databases for material properties and application profiles should facilitate the task of timely reliability prediction. Proper understanding of relevant failure mechanisms and extensive material testing programs play key roles in attaining such an advanced modeling capability.
6.9
MANUFACTURING ISSUES
Manufacturing, processing, and assembly appreciably impact the quality and reliability of hardware. Improper assembly and manufacturing techniques introduce defects, flaws, and residual stresses that act as potential failure sites or “stress raisers” later in the life of the component. The fact that the defects and stresses during the assembly and manufacturing process can affect hardware reliability during operation necessitates the identification of these defects and stresses to help the design analyst account for them proactively during the design and development phase. The task of auditing the merits of the manufacturing process involves two crucial steps. First, qualification procedures are required, as in design qualification, to ensure that manufacturing specifications do not excessively compromise the long-term reliability of the hardware. Second, lot-to-lot screening is required to ensure that the variabilities of all manufacturing-related parameters are within specified tolerances. In other words, screening ensures the quality of the product and improves short-term reliability by precipitating latent defects before they reach the field.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
HARDWARE RELIABILITY
6.9.1
135
Process Qualification
Like design qualification, this test program should be conducted at the prototype development phase. The intent is to ensure that the nominal manufacturing specifications and tolerances produce acceptable reliability in the hardware. Once qualified, the process needs requalification only when some process parameters, materials, manufacturing specifications, or human factors change. Process qualification tests could be the same set of accelerated wearout tests used in design qualification. As in design qualification, overstress tests may be used to qualify a product for anticipated field overstress loads. In addition, overstress tests may also be exploited to ensure that manufacturing processes do not degrade the intrinsic material strength of hardware beyond a specified limit. However, such tests should supplement, not replace, the accelerated wearout test program unless explicit physics-based correlations are available between overstress test results and wearout field-failure data. As in design qualification, acceleration of the field failure mechanism is the main goal, and all caveats stated in Section 6.4 also apply here. Failure analysis and closed-loop corrective action constitute the most important and effective uses of the test results. 6.9.2
Manufacturability, Process Variabilities, Defects, and Yields
The control and rectification of manufacturing defects have typically been the concern of production and process-control engineers, rather than of the designer. However, in the spirit and context of concurrent product development, hardware designers must understand material limits, available processes, and manufacturing process capabilities in order to select materials and construct architectures that promote producibility, aid in reducing the occurrence of defects, and consequently increase yield and quality. Therefore, no specification is complete without a clear discussion of manufacturing defects and acceptability limits. The reliability engineer must have clear definitions of the threshold for acceptable quality and of what constitutes nonconformance. Nonconformance that compromises hardware performance and reliability is considered a defect here. Failure mechanism models provide a convenient vehicle for developing such criteria. It is important for the reliability analyst to understand the deviations from specifications that can compromise performance or reliability and the deviations that are benign and can hence be accepted. The emphasis here is on poor quality and defects due to excessive lot-to-lot variabilities in an otherwise qualified process. Such variabilities can be typically attributed to poor process control. A defect is any outcome of a process (manufacturing or assembly) that impairs or has the potential to impair the functionality of the product at any time. The defect may arise during a single process or may be the result of a sequence of processes. The yield of a process is the fraction of products that are acceptable for use in a subsequent process in the manufacturing sequence or product life cycle. The cumulative yield of the process is determined by multiplying the individual yields of each of the individual process steps. The source of defects is not always apparent because
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
136 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
defects resulting from a process can go undetected until the product reaches some downstream point in the process sequence, especially if screening is not employed. When a process is in control (performing within specifications), the observed defect rate is the same as that anticipated by the process engineer. The correlation between the process sequence and the defects that result if any part of the process is not in control must be understood. Good process control optimizes yield and reliability. However, even if high yields are attained, product reliability cannot be assumed. Thus, the potential exists for a mismatch between the initial acceptance criteria and reliability. The engineer must design for reliability as well as keep the processes in control. It is often possible to simplify the manufacturing and assembly processes in order to reduce the probability of workmanship defects. However, as processes become more sophisticated, process monitoring and control are necessary to ensure a defectfree product. The bounds that specify whether the process is within tolerance limits, often referred to as the process window, are defined in terms of the independent variables to be controlled within the process and the effects of the process on the product, or the dependent product variables. The goal is to understand the effect of each process variable on each product parameter in order to formulate control limits for the process—that is, the points on the variable scale where the defect rate begins to possess a potential for causing failure. In defining the process window, the upper and lower limits of each process variable beyond which it will produce defects have to be determined. Manufacturing processes must be contained in the process window by defect testing, analysis of the causes of defects, and elimination of defects by process control, such as closed-loop corrective action systems. The establishment of an effective feedback path to report process-related defect data is critical. Once this is done and the process window is determined, the process window becomes a feedback system for the process operator. Several process parameters may interact to produce a defect different from that which would have resulted from the individual effects of these parameters acting independently. This complex case may require that the interaction of various process parameters be evaluated in a matrix of experiments. In some cases, a defect cannot be detected until late in the process sequence. Thus, a defect can cause rejection, rework, or failure of the product after considerable value has been added to it. These cost items due to defects can reduce yield and return on investments by adding to hidden factory costs. All critical processes require special attention for defect elimination by process control. The strategy for quality manufacture and assembly relies on understanding and controlling each individual process step and its effect on the product. The goal is to reduce the probability of the occurrence of defects, facilitate monitoring of the process sequence, and improve the manufacturability of the hardware. 6.9.3
Process Verification Testing and Statistical Process Control
Process verification testing is often called screening. Screening invokes 100% auditing of all manufactured products to detect or precipitate defects, and it is required on a lot-to-lot basis for qualified products. The aim is to preempt potential quality
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
HARDWARE RELIABILITY
137
problems before they reach the field. In principle, this should not be required for a process fully under production control. However, due to the uncertainties in control procedures, screening is often used as a safety net. As discussed previously, quantitative models of field failure mechanisms may be used successfully to define acceptability thresholds for defects. Some products exhibit a multimodal probability density function for failures, with a secondary peak during the early period of their service life due to the use of faulty materials, poorly controlled manufacture and assembly technologies, or mishandling. This type of early-life failure is often called infant mortality. Properly applied screening techniques can successfully detect or precipitate these failures, eliminating or reducing their occurrence in field use. Screening may be a redundant cost item if there is only one main peak in the failure probability density function. Failures arising due to unanticipated events, such as acts of God (lightning, earthquake, etc.), may be difficult to design and screen for cost effectiveness. Screening should be considered for use only during the early stages of production, if at all, and only when products are expected to exhibit infant mortality field failures. It is appropriate for products with nonrobust designs and for mature products with newly specified components, materials, or processes. Because screening is done on a 100% basis, it is important to develop screens that do not harm good components. Stress screening involves the application of stresses, possibly above the rated operational limits. The best screens, therefore, are nondestructive evaluation techniques, such as microscopic visual exams, x-rays, acoustic scans, C-scans, nuclear magnetic resonance, electron paramagnetic resonance, and so on. If stress screens are unavoidable, overstress tests are preferred to accelerated wearout tests because the latter are more likely to consume some useful life of good components. A stress screen need not necessarily simulate the field environment or even utilize the same failure mechanism as the one likely to be triggered by this defect in field conditions. Instead, a screen should exploit the most convenient and effective failure mechanism to stimulate the defects that would show up in the field as infant mortality. Obviously, this requires an awareness of the possible defects that may occur in the hardware and extensive familiarity with the associated failure mechanisms. If damage to good components is unavoidable during stress screening, then quantitative estimates of the screening damage, based on failure mechanism models, must be developed to allow the designer to account for this loss of usable life. The appropriate stress levels for screening must be tailored to the specific hardware. As in qualification testing, transforms based on quantitative models of failure mechanisms can aid in determining screen parameters. Unlike qualification testing, the effectiveness of screens is maximized when screens are conducted immediately after the operation believed to be responsible for introducing the defect. Qualification testing is preferably conducted on the finished product or as close to the final operation as possible; on the other hand, screening only at the final stage, when all operations have been completed, is less effective because failure analysis, defect diagnostics, and troubleshooting are difficult and impair corrective actions. Further, if a defect is introduced early in the manufacturing process,
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
138 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
subsequent value added through new materials and processes is wasted, which additionally burdens operating costs and reduces productivity. Admittedly, there are also several disadvantages to such an approach. The cost of screening at every manufacturing station may be prohibitive, especially for small batch jobs. Further, components will experience repeated screening loads as they pass through several manufacturing steps, which increases the risk of accumulating wearout damage in good components due to screening stresses. To arrive at a screening matrix that addresses as many defects and failure mechanisms as feasible with each screen test, an optimum situation must be sought through analysis of cost effectiveness, risk, and the criticality of the defects. All defects must be traced back to the root cause of the variability. Stress screening carries substantial penalties in capital, operating expense, and cycle time, and its benefits diminish as a product approaches maturity. Any commitment to stress screening must include the necessary funding and staff to determine the root cause and appropriate corrective actions for all failed units. The type of stress screening chosen should be derived from the design, manufacturing, and quality teams. Although a stress screen may be necessary during the early stages of production, a plan that includes the earliest possible reduction in the sample size through corrective action is strongly recommended. If almost all the products fail in a properly designed screen test, the design is probably incorrect. If many products fail, a revision of the manufacturing process is required. If the number of failures in a screen test is small, the processes are likely to be within tolerances, and the observed faults may be beyond the resources of the design and production process. At the time the process matures and screening rejects decrease, the decision to screen is driven by economic considerations, and it may be appropriate to replace a screen with a sampling procedure. As in qualification testing, failure analysis, proper feedback, and appropriate corrective action are essential to ensure timely removal of the cause of the defects. However, with a properly qualified product, screening should be used only to improve lot-to-lot variability, product quality, and short-term reliability. Long-term product reliability improvement is possible only through proactive design and process changes and must be completed early in the development phase through timely qualification procedures. Continued design and process changes late in the life of the product line are costly and undesirable.
6.10
SUMMARY
High product reliability can be assured only through robust product designs, capable processes that are known to be within tolerances, and qualified components and materials from vendors whose processes are also capable and within tolerances. Quantitative understanding and probabilistic modeling of all relevant failure mechanisms provide a convenient vehicle for formulating effective design and processing specifications and tolerances for high reliability. Accurate reliability predictions also require accurate stress analysis and accurate databases for anticipated life-cycle load
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
HARDWARE RELIABILITY
139
histories for different usage categories, as well as material constitutive and damage properties over the entire range of loads expected during manufacturing, accelerated testing, and usage. Proper reliability predictions also aid in proactive planning of logistical tasks, such as maintenance schedules, warranty pricing, and so on. Scientific reliability assessments should always be supplemented by accelerated qualification testing. The impact of manufacturing processes on quality and reliability should be carefully considered. Robust manufacturing can be ensured only by using qualified procedures that are carefully and continuously controlled to minimize variabilities and defects. Screens and qualification tests have distinctly different goals and hence must be tailored to serve specific purposes. In general, qualification tests must trigger the same failure mechanism expected to affect long-term reliability by causing field failures. Screen tests can use any convenient failure mechanism, as long as it succeeds in removing the target defects without damaging good components. Hardware reliability is not a matter of chance or good fortune; rather, it is a rational consequence of conscious, systematic, rigorous efforts at every stage of design, development, and manufacture. Admittedly, such an approach requires extensive modeling and knowledge of material behavior. However, it can be argued that, without such a systematic approach and painstaking attention to detail, we cannot achieve consistent control over the reliability of complex and expensive engineering hardware.
REFERENCES Broek, D. 1986. Elementary engineering fracture mechanics. Boston: Martinus Nijhoff. Dasgupta, A., and H. Haslach. 1993. Mechanism design failure models for buckling. IEEE Transactions on Reliability 42. Dasgupta, A., and J. M. Hu. 1992. Failure-mechanism model tutorials: (i) Excessive elastic deformation; (ii) Plastic deformation; (iii) Brittle fracture; (iv) Ductile fracture. IEEE Transactions on Reliability 41 (1–4): 149–154; 168–174; 328–335; 489–495. Dumoulin, P. 1982. Metal migration outside the package during accelerated life testing. IEEE Transactions on Components, Hybrids Manufacturing Technology 479. Engel, P. 1993. Failure models for mechanical wear modes and mechanisms. IEEE Transactions on Reliability 42:9–16. Ghanem, R., and P. D. Spanos. 1991. Stochastic finite element methods. New York: SpringerVerlag. Haugen, E. B. 1980. Probabilistic mechanical design. New York: John Wiley & Sons. Hertzberg, R. W. 1989. Deformation and fracture mechanics of engineering materials. New York: John Wiley & Sons. Hu, J. -. 1994. Physics-of-failure based component qualification of automotive electronics. In Reliability, maintainability, and supportability, SAE. Hu, J. -., M. Pecht, and A. Dasgupta. 1991. A probabilistic approach for predicting thermal fatigue life of wirebonding in microelectronics. ASME Journal of Electronic Packaging 113 (3): 275. Jones, R. M. 1975. Mechanics of composite materials. New York: McGraw Hill. Kapur, K. C., and L. R. Lamberson. 1977. Reliability in engineering design. New York: John Wiley & Sons. Lewis, E. E. 1987. Introduction to reliability engineering. New York: John Wiley & Sons.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
140 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Li, J., and A. Dasgupta. 1994. Failure mechanism models for creep and creep rupture. IEEE Transactions on Reliability 42:339–353. Malvern, L. E. 1969. Introduction to the mechanics of a continuous medium. Englewood Cliffs, .J: Prentice Hall. Pecht, -., A. Dasgupta, and P. Lall. 1989. A failure prediction model for wirebonds. Proceedings of the International Microelectronic Symposium, ISHM 607. Pecht, M., and W. Ko. 1990. A corrosion rate equation for microelectronic die metallization. Journal of Hybrid Microelectronics, ISHM 41. Sandor, B. 1972. Fundamentals of cyclic stress and strain. Madison: City Univ. of Wisconsin Press. Timoshenko, S. P., and J. M. Gere. 1961. Strength of materials. New York: McGraw Hill.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
CHAPTER 7
Software Reliability Richard Kowalski, Carol Smidts
CONTENTS 7.1 Introduction ................................................................................................... 142 7.2 Definitions ..................................................................................................... 143 7.3 Software Development: The Classic Waterfall Life Cycle ........................... 147 7.3.1 Phase Descriptions ............................................................................ 148 7.3.1.1 Software Requirements Definition and Analysis Phase ..... 148 7.3.1.2 Preliminary and Detailed Design Phases........................... 149 7.3.1.3 Code and Unit Testing Phase.............................................. 150 7.3.1.4 Integration and System Testing Phase ................................ 151 7.3.1.5 Acceptance Testing Phase .................................................. 151 7.3.1.6 Maintenance and Operation Phase ..................................... 152 7.3.1.7 Retirement Phase ................................................................ 152 7.3.2 Software Development Standards ..................................................... 152 7.3.3 Distribution of Errors over the Development Life Cycle and Related Costs..................................................................................... 152 7.4 Techniques to Improve Software Reliability................................................. 153 7.4.1 Designing Reliable Software............................................................. 153 7.4.1.1 Structured Programming.................................................... 153 7.4.1.2 Design Techniques.............................................................. 153 7.4.1.3 Design Issues ...................................................................... 154 7.4.2 Designing Fault-Tolerant Software.................................................... 155 7.4.2.1 Recovery-Block Design...................................................... 156 7.4.2.2 N-Version Programming .................................................... 157 7.4.2.3 Consensus Recovery Block ................................................ 158 7.4.3 Testing ............................................................................................... 159 7.4.3.1 Black-Box and White-Box Testing ..................................... 159
141 © 2009 by Taylor & Francis Group, LLC
142 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
7.4.3.2 Module Testing: White-Box and Black-Box Testing Strategies............................................................................. 159 7.4.3.3 Integration Testing.............................................................. 161 7.4.4 Formal Methods (Neuhold and Paul 1991) ....................................... 163 7.4.4.1 Formal Specification Methods ........................................... 163 7.4.4.2 Formal Verification ............................................................ 163 7.4.5 Software Development Process Maturity ......................................... 165 7.5 Techniques to Assess Software Reliability ................................................... 166 7.5.1 Software Analysis Methods .............................................................. 166 7.5.1.1 Failure Mode and Effect Analysis (FMEA) ....................... 166 7.5.1.2 Fault-Tree Analysis ............................................................. 166 7.5.2 Software Metrics ............................................................................... 168 7.5.2.1 Requirements Measures: The Specification Completeness Measure....................................................... 168 7.5.2.2 Design Phase Measures...................................................... 168 7.5.2.3 Code and Unit Test Phase Measure: Defect Density (IEEE 1989a)...................................................................... 172 7.5.3 Software Reliability Models ............................................................. 173 7.5.3.1 A Classification of Software Reliability Models................ 173 7.5.3.2 Jelinski and Moranda’s Model ........................................... 175 7.5.3.3 Musa Basic Execution Time Model (BETM) .................... 178 7.5.3.4 Musa–Okumoto Logarithmic Poisson Execution Time Model (LPETM)................................................................. 179 7.5.3.5 Mills’s Fault Seeding Model (IEEE 1989b) ....................... 179 7.5.3.6 Nelson’s Input-Based Domain Model................................. 180 7.5.3.7 Derived Software Reliability Models................................. 180 7.5.3.8 A Critique of Existing Software Reliability Models.......... 181 7.6 Summary ....................................................................................................... 181 References.............................................................................................................. 182
7.1 INTRODUCTION It is impossible to ignore the importance of software in our lives and the need for reliable software. Software reliability has progressively become a critical issue because of the number and nature of the fields involved. Many products that rely heavily on software are safety critical: flight systems, air traffic control systems, products for helping operators to diagnose the root causes of accidents in nuclear power plants and to identify mitigating actions, remote controls for satellites, medical products, and so on. Even when not affecting safety, software failures may have severe consequences, such as loss of valuable data in information systems, mismanagement of banking transactions, and accounting errors. Software development companies are now held responsible for the quality of their products, and product development costs are increasingly driven by the cost of the software.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
SOFTWARE RELIABILITY
143
Interestingly enough, although hardware has become extremely reliable at low costs, software is now replacing hardware in many applications. Given these circumstances, a number of techniques are available to enhance and measure software reliability qualitatively as well as quantitatively. These techniques, which aim to improve the quality of software products as well as the quality of the software development process, include software engineering to design better code, testing techniques to remove faults efficiently, formal methods for detailed specification of software capabilities and for certifying that the final product meets its requirements, qualitative metrics to indicate the complexity of the product and to measure the status of the software development process, and quantitative models to assess reliability. Each of these techniques contributes to software reliability and will be reviewed in this chapter. This chapter first defines software, software reliability, software quality, and software safety. It then describes the software development life-cycle process and pinpoints the mechanisms by which software errors are introduced at each stage of the process. It continues with a description of selected techniques that can improve the reliability of a given software product and concludes with some of the qualitative measures and quantitative models that assess software reliability.
7.2
DEFINITIONS
Software and firmware. The Institute of Electrical and Electronics Engineers (IEEE) Standard Glossary of Software Engineering Terminology (1983) defines software as computer programs, procedures, rules, and possibly associated documentation and data pertaining to the operation of a computer product. From the definition, it is clear that software includes more than merely the lines of code that cause the operation of a program. It could be argued that software is everything but the physical hardware on which the software operates. Firmware is a special form of software. According to IEEE (1983), firmware is r computer programs and data loaded in a class of memory that cannot be dynamically modified by the computer during processing; r hardware that contains a computer program and data that cannot be changed in its user environment; the computer programs and data contained in firmware are classified as software—the circuitry containing the computer program—and data are classified as hardware; r program instructions stored in a read-only storage; and r an assembly composed of a hardware unit and a computer program integrated to form a functional entity whose configuration cannot be altered during normal operation; the computer program is stored in the hardware unit as an integrated circuit with a fixed-logic configuration that will satisfy a specific application or operational requirement.
Thus, software includes the microcode or microprograms found in embedded systems. Projects that develop or produce firmware (which may be noncomputational— e.g., lookup tables) should also be subject to software reliability activities.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
144 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Software reliability. The accepted definition for software reliability* is similar to that for hardware reliability. According to the IEEE (1983), software reliability is r the probability that software will not cause the failure of a product for a specified time under specified conditions; this probability is a function of the inputs to and use of the product as well as a function of the existence of faults in the software; the inputs to the product determine whether existing faults, if any, are encountered; and r the ability of a program to perform a required function under stated conditions for a stated period of time.
There has been and continues to be some debate about the definition of software reliability, arising mainly because time has been selected as the basis for reliability measurement. There is no question that “duration” is relevant to applications, such as operating systems, that need to function over long periods, but it may not be applicable to items such as compilers and scientific applications. Let us consider a scientific application coded by two different development teams. Their products are likely to be two distinct pieces of software, S1 and S2. Failures are defined as incorrect outputs. Let us assume that input to the two pieces of software is the same and that S1 runs in T1 and S2 runs in T2 and T1 T2. If both outputs, O1 and O2, are incorrect, the inference from the IEEE definition of software reliability would be that software S1 is less reliable than S2, a meaningless conclusion. Hence, for that specific application, a sensible definition of reliability would be the “probability that the software will fulfill its mission over a specified number of runs.” From both meanings, it is clear that software reliability is achieved by reducing or eliminating product failures due to software and by reducing and avoiding faults in the software. A specialized vocabulary found in IEEE (1983) relates faults and failures to the activities associated with software development: r Errors are human actions that result in software containing a fault—for example, the omission or misinterpretation of user requirements, the omission or incorrect translation of a requirement into design documentation, or the improper coding of a line of code, data table, or branch condition. r Fault is the manifestation of an error in software; a fault, if encountered, may cause a failure—synonymous with “bug.” r Failure is the inability of a product or component to perform a required function within specified limits; failures will be observed during either testing or actual use.
Although the term “defect” is widely used in the literature, the IEEE does not define “defect” except to refer the reader to “fault.” Although other authors may differentiate between defect and fault, this chapter follows the IEEE terminology, unless specific sources make use of the word “defect.”
*
This definition of software reliability is called the “users’ definition of software reliability” in contrast with the “developers’ definition of software reliability,” which is usually expressed in number of errors per thousand lines of source code.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
SOFTWARE RELIABILITY
145
Because errors and faults occur before software is an operational product, software reliability programs are planned and initiated before the operating software is available. The purpose of a software reliability program is fivefold: r to prevent the occurrence of errors and faults before software is available for test or operational use; r to use information obtained during software development to eliminate related unobserved errors and faults; r to mitigate the effects of software failures, particularly those with critical consequences; r to collect project data to understand better the conditions and phenomena that permit errors and faults to occur and to use this experience to improve processes to develop software; and r to perform an estimate of the reliability of the delivered code.
Figure 7.1 illustrates how errors cascade through a software development effort that lacks an effective reliability program. Errors that occur when identifying requirements propagate through development efforts and become faults in code. Errors that occur in the design and coding phases also ultimately become code faults. Because all faults are potential failures during testing (which will itself include errors—for example, improper testing or a failure to test certain areas), testing focuses on finding and correcting failures due to a cascade of faults that might otherwise have been identified and eliminated in earlier phases. However, faults present in the testing phase may mask each other and further increase the risk that some will pass through to the operational code. Few programs will have the test resources or schedule to address all possible effects of cascading errors.
Errors due to coding
Requirement errors
Detailed design errors
Code faults due to detailed design errors
Preliminary design errors
Detailed design errors due to preliminary design errors
Code faults due to preliminary design errors
Preliminary design errors due to requirement errors
Detailed design errors due to requirement errors
Code faults due to requirement errors
Figure 7.1 Cascading of errors through the development effort.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
146 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Table 7.1
Quality Factors Identified in DOD-STD-2167
Quality Factor Correctness Efficiency Flexibility Integrity Interoperability Maintainability Portability Reusability Testability Usability
Definition The extent to which software is free from design defects and from coding defects—that is, fault free The extent to which software performs its intended functions with a minimum consumption of computing resources The ease with which software can accept enhancements The extent to which software prevents or controls unauthorized access and modification to data or code The ability of two or more products to exchange information The ease with which software can be maintained The ease with which software can be transferred from one computer system or environment to another The extent to which a module can be used in multiple applications The ease with which software can be tested The extent to which software incorporates human engineering capabilities and features
An effective software reliability program will identify and correct errors and faults at or near the phase in which they occur. This approach to reliability minimizes the effects of cascading errors so that test efforts can concentrate on product interface and integration issues not easily addressed at earlier stages of development. Software quality. According to the IEEE (1983), software quality is defined as “the totality of features and characteristics of a software product that bear on its ability to satisfy given needs, for example, conformance to specifications.” The Department of Defense Standard 2167, Defense System Software Development 1985 (DoD 1985), identifies, in addition to reliability, 10 quality factors that are applicable to a software product. These factors are listed and defined in Table 7.1. Although responsibility for meeting reliability goals will vary among organizations and projects, the software quality assurance (SQA) organization clearly has a role in achieving software reliability. For small organizations, SQA activities may be carried out by software development personnel. In larger organizations, an SQA organization will be given the task of planning and executing SQA efforts for a project as part of the project team. Such an organization will implement those responsibilities in several ways. For example, an SQA will r review and audit the development organization’s conformance to standards, practices, and procedures for conducting software development; r conduct independent reviews of project activity through all project phases; r evaluate intermediate and final software products for conformance to corporate and project standards; and r evaluate the management and engineering processes used in developing software (e.g., whether design and code reviews are adequate in scope and coverage; whether appropriate individuals participate), and collect process and product data relating to quality.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
SOFTWARE RELIABILITY
147
Thus, the individual or individuals given the task of satisfying software reliability goals will need to work closely with the project’s SQA activity to avoid duplication of effort, to ensure that proper data are collected, and to coordinate their respective tasks. Software safety. Software safety concentrates on the study of particular types of software failures denoted as safety failures—that is, failures that lead to casualties or serious consequences. Specific techniques for developing safe code and for assessing software safety will be described in Sections 7.4.2 and 7.5.1.2.
7.3 SOFTWARE DEVELOPMENT: THE CLASSIC WATERFALL LIFE CYCLE The software development life cycle is the period of time that starts when a software product is conceived and ends when the product is no longer available for use. The principal stages of the software development process are presented in Figure 7.2. Together, they form the waterfall life cycle. Variations occur in the number of phases, in their names, and in the individual responsible for a given activity. However, the waterfall life cycle provides a good baseline for understanding the general philosophy and main stages of the development process. A project’s software life cycle often differs from the waterfall life-cycle model or any of its analogs. For example, in principle, the end of each phase of the life cycle should see the beginning of a new phase; however, in reality, preliminary design will often start before all requirements have been elicited and analyzed, and coding will start before the complete software design has been defined. Hence, the actual life cycle may be a sequence of iterations among the eight different life-cycle phases. A formal software life-cycle model (LCM) may not be needed for projects that are small in scope or that can be accomplished through the efforts of a small team (perhaps up to four or five persons). In these cases, team activities will move from one phase to another as the circumstances of the project demand, with little loss of coordination among team members. However, once project complexity or team size rises above a modest threshold, the organization will find that the establishment and use of a software life-cycle model will lead to better organization, increased Requirements definition and analysis Preliminary design Detailed design Code and unit testing Integration and system testing Acceptance testing Maintenance and operation Retirement
Figure 7.2 The classic waterfall life cycle.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
148 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
efficiency, and an improved capability to manage team efforts and produce a reliable product. When a software life-cycle model is defined, three elements should be identified for each phase: r the activities to be accomplished; r the products or results to be obtained; and r the reviews or other milestones to be met.
Once a software LCM is so defined, it becomes possible to assess the progress, cost, and risk associated with the project. For the reliability practitioner, the software LCM will determine or influence: r the timing and nature of reliability engineering and assessment activities; r the data that need to be collected and the methods to be used to collect them; and r the nature of the possible reliability assessments in each phase.
7.3.1
Phase Descriptions
The following sections describe the principal phases of a software life cycle and discuss the sources of errors or faults that can occur during each phase. 7.3.1.1
Software Requirements Definition and Analysis Phase
The purpose of this phase is to define and document the engineering requirements for the software to be developed. The principal product of this phase is a software requirements specification (SRS) with sufficient detail to allow the development organization to proceed with the design phase of the project. Requirements may be either positive or negative. Positive requirements define the required (or tolerable) operational behaviors of the product. Negative requirements define unacceptable operational behaviors of the product (and may therefore define specific instances of software failure). There are three means of creating software requirements for a project: r The client develops requirements prior to a contract award; these requirements are the starting point for the project’s software development. The client’s software requirements are typically developed by a group within his or her own organization (perhaps with subcontractor support), coordinated within the user community, and then issued as the basis for subsequent software development. r The initial task for the software development organization is to develop a software requirements specification; in a top-down approach, this is accomplished by identifying the users of the product and the specific needs and operational constraints to be met by the software through interviewing people in the user community and by reviewing policy or functional documents that relate to the product under development. The functions to be performed by the product are subsequently identified and documented. r In a rapid prototyping approach, a minimal set of requirements is identified and a high-level language is used to develop executable prototype code that simulates the principal functions of the product to be developed. For example, in a management
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
SOFTWARE RELIABILITY
149
information system, prototype code would be developed to illustrate the data input process, sample video display terminal screens would be developed to illustrate what users would see during operations, and sample reports or other outputs would also be prototyped. The potential user (client) reviews the results of prototype operations and comments on features and deficiencies. Several iterations of this process occur until the prototype product is judged acceptable; at that time, the prototype becomes a de facto requirements specification—that is, a model of what the delivered product should do.
Errors traceable to the requirements definition and analysis phase can be most damaging to a software reliability program because they may often propagate too many different areas of the code. For example, if some timing requirements of a realtime application are ambiguous or otherwise not properly specified, many functions of the final product may be affected. Several sources may generate errors in this phase: r Incomplete requirements: some functions or performance requirements may not be identified. r Nonfeasible requirements: the specified requirements may not be compatible with the existing or planned hardware or operator environment; for example, the specified accuracy of an algorithm may be inconsistent with the sampling rates of the inputs to the algorithm. r Conflicting requirements: a combination of two or more requirements may be incompatible; a common example occurs with point-of-sale products where, at the time a clerk enters a transaction, the product is required to update a number of transaction files and return control to the operator within a specified time. r Software requirements: these are inconsistent with other product requirements; in these cases, the hardware requirements or the user’s operating policy is incompatible with the software requirements. r User needs not properly described: for all but the simplest products, the user can be difficult to identify; for example, is the user the product operator, the operator’s supervisor, the manager who uses the product to conduct business, or the administrator of the network on which the product runs? The answer may be “all of the above.” If conflicts among these parties are not resolved when the requirements are developed, the subsequent acceptability and reliability of the product can be in question, and user needs reflected in requirements may not express the expected performance of the product.
7.3.1.2
Preliminary and Detailed Design Phases
The preliminary design effort decomposes the software product into a hierarchical structure consisting of computer software configuration items (CSCIs), which are further decomposed into computer software components (CSCs) and computer software units (CSUs), often called modules. A CSCI is a collection of software elements treated as a unit for the purpose of configuration management (e.g., CSCIs would normally be the level at which software versions are identified). A CSU is the smallest compilable group of code that can be treated as an entity. In the preliminary design phase, the software requirements for a CSCI are apportioned to CSCs and the functionality of
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
150 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
each CSC is identified, together with its required inputs and outputs. This information constitutes the preliminary software design. In the detailed design phase, CSC design specifications are made progressively more detailed until a set of primitive (nondecomposable) CSUs is identified that describes the elemental operations of the product. The detailed design should be language independent and should allow the coding of the program to be accomplished by someone who is not the original designer. As the design evolves, planning for testing activities also continues. During preliminary design, the test organization establishes requirements for CSC integration and testing. During the detailed design, the organization establishes test cases for CSU integration testing and for product testing. Several sources may generate errors in this phase: r Input data range errors: the allowable range for input data may not reflect requirements of the actual application; the scope of these errors can vary from numerical ranges that are too large or too small to improper definitions for alphanumeric or other character sequences. r Inconsistent data definitions: problems arise when modules that are required to exchange information process a variable in different units (e.g., module A uses time in seconds and module B uses time in hours). r Incorrect error analysis for algorithms: truncation and rounding errors are usually analyzed correctly. However, the ability of a sequence of algorithms to meet error requirements across the range of input variables is much more difficult to assess. r Inadequate validity checking: individual variable validity checks may be relatively simple to implement. However, checking input combinations for validity requires a much greater level of understanding of the software’s purpose (e.g., is a person’s date of birth later than the date of death? Does the end date of a task precede its start date?). r Interface errors: do some modules call themselves? Does a set of modules form a daisy chain (i.e., A calls B calls C calls A)? Is the proper set of inputs passed to each module? r Inadequate error recovery: what happens when a module cannot execute? Will the design produce catastrophic results, or will the design provide some reasonable recovery action?
7.3.1.3
Code and Unit Testing Phase
In the coding phase, a programmer translates the detailed software design into the programming language specified during the requirements or design phases. Code in higher order languages, such as C or Ada, is referred to as source code. A compiler is used to transform the source code into machine-readable object code. When a unit’s coding is completed, a code review or inspection should be conducted to identify any design errors, to identify cases where the code does not implement the design, or to determine conformance with code style (e.g., naming conventions) and standard constructs. Depending on the program requirements, these reviews may be informal or formal; however, in all cases, the goal is the same: to identify faults in the code before they become failures in operation.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
SOFTWARE RELIABILITY
151
Unit testing is conducted to verify the functionality of the unit. During unit testing, the programmer has the most control over test conditions. This is often the last opportunity to find and remove faults cheaply. In succeeding test phases, the units will have been aggregated into CSCs or a CSCI, and the isolation and removal of faults is usually much more time consuming (and expensive). Coding is a fault-prone process. It takes great discipline by the individual creator of a unit to remember all the details that must be put into the code and then to have the patience and perseverance to implement the design and develop a thorough set of unit tests. A number of standard bugs (not unlike common grammatical errors that are found in written text) will occur in this phase. For example, these include r r r r r r r
missing code; unreachable code; improper or incomplete validity checks; incomplete initialization or resetting of variables and parameters; improper logic for branch conditions and loops; out-of-range calculations and infinite loops; and any failure to implement the documented design.
7.3.1.4
Integration and System Testing Phase
Integration testing begins when unit tests are completed. At this time, units that have been separately tested are assembled into logical combinations and tested again to demonstrate that the assembly meets product requirements. This process is repeated by combining larger and larger groups of units until all units in a CSCI have been integrated. System testing is then conducted to demonstrate that the software product (which may include multiple CSCIs) and hardware work together to accomplish all system functions and that the resulting product is ready for release to production. At this point, readiness also implies that product manuals, other documents, and training materials are ready, available, and consistent with the software product. The test phase does not directly produce code for the final product. The errors and faults introduced in this phase influence the ability of the test program to detect faults in the software and can include r test plans or procedures that have incorrectly interpreted software requirements or are not traceable to them; and r errors, defects, or faults in code written for the test program (e.g., test cases, drivers, or special databases).
7.3.1.5
Acceptance Testing Phase
At this point, the development team relinquishes its testing responsibility to an acceptance test team. The acceptance test team determines whether or not the product now satisfies the original product requirements. An acceptance test plan is developed and the phase ends with successful completion of all the tests in the acceptance plan. (Different testing techniques will be described in Section 7.4.3.)
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
152 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
7.3.1.6
Maintenance and Operation Phase
If software requirements are stable, activities will be focused on fixing errors that appear in the exercise of the software or on fine-tuning the software to enhance its performance. On the other hand, if the requirements continue to change in response to changing user needs or changing hardware, this stage could resemble a mini life cycle in itself as software is modified to keep pace with operational needs. 7.3.1.7
Retirement Phase
The user may decide at some point not to use the software anymore and to discard it. Due to recurrent changes, the software might have become impossible to maintain further (e.g., the documentation size may not be manageable or it is incomplete or missing). 7.3.2
Software Development Standards
DOD-STD-2167A, Defense System Software Development, 1988 (DoD 1988), established “uniform requirements for software development that are applicable throughout the system life cycle.” Although the standard did not specify or discourage the use of any specific software development method, it did use a waterfall-like life cycle to identify development phases and it identified documents and reviews associated with each phase. Many of the phases are similar to those described in Section 7.3.1. MIL-STD-498, Software Development and Documentation, 1994 (DoD 1994), superseded MIL-STD-2167A and other DoD software standards. Although 2167A did not specify a waterfall model, 498 clearly presented more options to the developer. It described the possibility for software to be developed in one or more “builds” where “some activities may be performed in every build, [and] others may be performed only in selected builds . . . until all builds are accomplished.” In MIL-STD-498, CSCIs are decomposed into software units that may or may not be related to each other in a hierarchical way. This provided more flexibility than the approach of DOD-STD-2167A and was more compatible with object-oriented designs. This also provided greater flexibility in using computer-based configuration management tools. IEE/IEC12207, Standard for Information Technology, 1996 (IEEE 1996), formally replaced MIL-STD-498 in 1998. This standard defines a set of processes that span the entire software life cycle, from concept to retirement. For each process, it also defines one or more information items that are inputs or outputs of the process. There are three volumes to this standard: 12207.0-1996 describes the base standard, 12207.1-1997 is a guide to life-cycle data, and 12207.3-1997 is a guide to process implementation. 7.3.3
Distribution of Errors over the Development Life Cycle and Related Costs
Ideally, the number of errors might be expected to decrease with each new phase of the life-cycle development process, with a high number of failures removed during unit testing, a smaller number at integration testing, and an even smaller number at system
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
SOFTWARE RELIABILITY
153
testing, and none in operation. Furthermore, errors removed during integration testing are likely to be related to interface problems, errors removed at the system test level should be related to compatibility between software and hardware, and so on. However, studies have shown that a number of software projects demonstrate very different trends (Neufelder 1993). Either the number of failures peaks in the system test phase or, worse, it increases progressively from one phase to another. The latter behavior could be found in software projects with constantly changing or added requirements. Errors identified late in the life cycle might be extremely costly to remove. Indeed, data show that the cost of fixing an error increases by roughly a factor of 10 when one progresses from one phase of the life-cycle model to another (Neufelder 1993). 7.4
TECHNIQUES TO IMPROVE SOFTWARE RELIABILITY
A number of techniques are available to improve the reliability and safety of software products. A brief description of these techniques follows. 7.4.1 Designing Reliable Software Software engineering has given birth to a number of techniques that help programmers systematically derive the software design from its detailed specification. This section introduces some of the commonly used software design techniques. (The interested reader is referred to Bell, Morrey, and Pugh, 1992, for further information on this specific topic.) 7.4.1.1
Structured Programming
Structured programming is directly concerned with the clarity of the design. It postulates that, for the design to be easy to understand, a program structure needs to be developed in which the important software components and their interrelationships can be easily identified without being obscured by unnecessary details. Structured programming claims that limiting the number of allowed control structures is a first step in that direction. Programmers should restrict themselves to sequences (a succession of statements that will execute one after another), decision statements (written as if…then…else… statements), and loops (written while…do…statements). These three control structures, shown in Figure 7.3, possess only one entry and one exit point, a factor that makes it easier to understand globally the software at different levels of abstraction. Structured programming precludes the use of GOTOs because they allow software components to have more than one exit point. 7.4.1.2
Design Techniques
The limitation in the number and nature of control structures used for programming does not define the program design completely. Additional techniques are needed for its detailed specification. Functional decomposition. Functional decomposition is a structured programming method used to define the detailed software design, as well as its gross architecture.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
154 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
No x=x+a y=y+b z = x2y
(a) Sequence
x=x+a
x>y
Yes
y=y+b
(b) Decision (if ... then ... else ...)
i
No
Yes x=x+a i=i+a
(c) Loop (while ... do ...)
Figure 7.3 Basic structured programming constructs.
The method proceeds top down and, as indicated by its name, focuses on the functions that the software needs to perform. Data play a secondary role. High-level functions (with a high level of abstraction) are considered first. The designer determines how these may be achieved, using functions belonging to a lower level of abstraction. For example, if one wants to start the engine of a car (high-level function), one will need to open the door of the car, step inside, put the key in the lock, and turn it (four lower level functions). Once a level of abstraction has been entirely defined, the design process proceeds to the next level of abstraction. The technique is thus essentially based on a progressive refinement of functionalities. The design is expressed in pseudocode language (i.e., sequences of sentences beginning with verbs and using the approved constructs of structured programming). Breadth-first (dealing with one level of abstraction at a time) as well as depth-first (dealing with one function only through all levels of abstraction, then returning to deal with the next high-level function) approaches may be used. One of the major disadvantages of functional decomposition is that it is not well defined. Because of its inherent fuzziness, it is difficult to apply, and different programmers are bound to create different designs. Data structure design. Data structure design, also called Jackson structured programming, is oriented toward the development of a detailed program design, expressed once again in pseudocode language (PDL). The methodology and its resulting design are driven by the structure of the input files from which the software will read and by the structure of the output files that will be generated. According to this method, an ideal software design should reproduce the structure of the data in the I/O files. This methodology is the most systematic design method available to date. Because it is laid down in steps that need to be followed precisely, it is simple, rational, teachable, anti-inspirational, and consistent: Two different programmers will end up writing the same program. However, it is not applicable to scientific programs because their content is essentially algorithmic (functional). 7.4.1.3
Design Issues
Program modularity: a key issue in software design. Modularity is one of the key issues in software engineering and, more precisely, in software design. How large and
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
SOFTWARE RELIABILITY
155
how complex should a module be? How many interactions should be allowed between program modules? These questions need to be considered during software design. A program is modular if it is built from pieces essentially independent of each other. Modularity should be pursued in order to ease module design, debugging, and testing, and to facilitate maintenance. Further, and very significantly, it enables independent development of different pieces of the same software and increases potential for reuse. Module size. An extreme view claims that the size of a module should be limited to seven lines of code or less, an argument that comes from the psychological observation that humans can only memorize seven items at a time. However, this would lead to a huge number of interfaces between software modules and a complexity that would defeat the attempt to increase the readability of the code. Actually, experience shows that the number of lines of code does not reflect the complexity of the software. Therefore, other measures of complexity have been developed. Module complexity. McCabe’s complexity metric (1985) relates module complexity to the number of decision points (see Section 7.5.2). McCabe maintains that the complexity metric should not exceed 11 for any given software module. However, this argument is unconvincing, given how difficult it can be to understand the purpose of lengthy pieces of code, even if they have only a few decision points. The use of global data. The code length and its number of decision points are not the only attributes of module complexity. The presence of numerous global data (data shared by more than one module) is extremely prejudicial to code readability. Procedures using local data are easier to study and easier to remove without contaminating the residual software. In the same vein, concepts such as information hiding* and data abstraction or encapsulation,† which are used in object-oriented programming, achieve better modularity (i.e., they increase changeability, independent development, and comprehensibility). Cohesion and coupling. Modules should be built so that the number of interactions between modules is limited (low coupling) and a high number of interactions are inside a module (high cohesion). Different types of cohesion have been indexed: coincidental cohesion (when the module content has been limited arbitrarily), logical cohesion (when the module performs a set of logically similar functions—say, print salary and print date of birth), temporal cohesion (when the module contains functions that need to be performed simultaneously), communicational cohesion (when functions are grouped together because they act on the same data), and functional cohesion (when the module executes one and only one function). Design of shared modules. Shared modules should be designed bottom up to be sure that they are independent of the context in which they are being used. 7.4.2
Designing Fault-Tolerant Software
Fault-tolerant software stems from the recognition that whatever the amount of testing (see Section 7.4.3) or however extensive the use of formal proofs (see Section 7.4.4), *
†
Information hiding is based on the principle that the user of a specific object (piece of code) should not have access to the internals of the object. Data encapsulation means that data and procedures (operations) are contained in a common object, and the data in this object can only be modified through the procedures defined for that object.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
156 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
it will never be possible to create error-free software (Neufelder 1993; Scott, Gault, and McAllister 1987). Recognizing this limitation, another approach to enhancing software reliability is to create a fault-tolerant architecture that will provide recovery from software failure. The expensive character of this approach limits its use to high-risk applications (such as embedded software for controlling satellites or for life-critical applications). 7.4.2.1
Recovery-Block Design
The recovery-block design is shown in Figure 7.4. Versions of software are being developed under the same set of requirements. The assumption is that, because these Input Execute most reliable version (version 1)
Execute acceptance test
Accepted
Yes
Program continues
No Execute recovery of input
Execute next most reliable version Recovery block structure
Execute acceptance test
Accepted
Yes
No (a) Recovery block Figure 7.4 Fault-tolerant design: the recovery block structure.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
Program continues
SOFTWARE RELIABILITY
157
pieces of software are being developed independently, the likelihood that they will fail at the same time (under the same set of inputs) is negligible. This, of course, assumes that the requirements are correct. Versions are then ordered from the most to the least reliable. (The testing intensity that each piece of software has undergone provides a basis for this ordering.) The most reliable software will execute first. Once the output has been computed, an acceptance test is run. If the output is rejected, a recovery block is entered. The recovery is performed in three steps: The input is recovered, the next most reliable software is executed with this restored input, and the output is submitted to the same acceptance test. If not accepted, the next recovery block is entered, and so on. Recovery of the input and acceptance testing are the two main weaknesses of this design. Hence, they need to be handled with caution. In particular, the acceptance test software needs to be tested thoroughly if the recovery block design is to achieve higher levels of reliability than the individual piece of software. 7.4.2.2
N-Version Programming
In N-version programming, several versions of the software are developed independently. The N programs are executed in parallel (see Figure 7.5). Upon their completion, the outputs are compared. If at least two programs share the same output, it is declared correct, the output is accepted, and the process continues. This design does not exhibit the weaknesses of recovery blocks. However, it is not fit for applications in which multiple correct outputs can be generated (such as would be the case of
Input
Execute version 1
Execute version k
Execute version n
Acceptance test: “Do two or more versions agree that the output is correct”
Accepted
No
Yes Program continues Figure 7.5 Fault-tolerant design: N-version programming.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
Failure
158 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
applications that, say, compute paths to go from one point to another on a map). It will also discriminate against correct solutions affected by rounding errors. 7.4.2.3
Consensus Recovery Block
The consensus recovery block (see Figure 7.6) combines attributes of the recovery block with those of N-version programming and attempts to eliminate the weaknesses in the two designs. The consensus recovery block requires the development of N different versions of a program and of a voting procedure of an acceptance test. The different versions of the software are ranked according to their level of reliability. All versions are executed and submit their output to the voting procedure. If there is no agreement, each output is submitted successively to the acceptance test in order Input
Execute version 1
Execute version k
Execute version n
Acceptance test: “Do two or more versions agree that the output is correct” Yes No
Program continues
Execute acceptance test on output of most reliable software
Accepted No
Modified recovery block structure
Yes
Program continues
Yes
Execute acceptance test on output of next most reliable software
Accepted
Yes
Program continues
No
Figure 7.6 Fault-tolerant design: the consensus recovery block.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
SOFTWARE RELIABILITY
159
of reliability. The process stops as soon as one of the outputs passes the test, and the software continues with the calculation. Consensus blocks are more reliable than the two fault-tolerant designs previously presented. 7.4.3
Testing
Software testing is the process of executing a program with the intention of uncovering errors. However, no number of test cases can ever assure that the software will be error free. The number of test cases needed to verify even a very small program is huge. A number of principles and strategies can help programmers test programs more efficiently. The interested reader is referred to Myers (1979) for a brief introduction to testing and to Beizer (1984) for a more comprehensive survey of testing techniques. This section presents several test methods with emphasis on module testing and integration testing. 7.4.3.1
Black-Box and White-Box Testing
The two broad categories of testing strategies are called black-box and white-box testing. White-box testing uses knowledge of the program structure to ascertain that derived test cases cover as many logical paths in the programs as possible. Blackbox testing views the program as a black box, does not take advantage of program structure, and usually concentrates on input and output to the program (see Section 7.4.3.2.). Some specific white-box and black-box testing techniques are presented in Table 7.2; they will be described in more detail in the next section. Good coverage of the whole program requires a combination of white-box and black-box testing. Indeed, because the former requires a deep knowledge of the program, it is usually carried out by the program developers, who may overlook flaws in the program. Furthermore, white-box testing does not test the conformity of the program to its requirements. In practice, test cases are first developed using black-box methods; supplementary test cases use white-box methods. 7.4.3.2 Module Testing: White-Box and Black-Box Testing Strategies Logical-coverage testing. Table 7.3 defines different white-box logical-coverage testing techniques used to test procedure M (see Figure 7.7). As an example, the condition coverage technique will be used to derive the set of corresponding test cases. Table 7.2
Module Testing: White-Box and Black-Box Testing Techniques
Black Box Equivalence partitioning Boundary-value analysis
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
White Box Statement coverage Decision coverage Condition coverage Decision/condition coverage Multiple-condition coverage
160 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Table 7.3
Logical Coverage Testing Techniques
Logical Coverage
Description
Test Cases for Procedure M
Statement coverage
Exercise each statement in the program at least once Exercise each alternative output of each decision statement Exercise each condition of every decision statement at least once Exercise each condition and each alternative output of each decision statement at least once Exercise every combination of conditions at least once
A=2, B=0
Decision coverage Condition coverage
Condition/decision coverage
Multiple-condition coverage
A=2, B=0 A=0, B=1, X=0, C=2 A=2, B=0, C=3, X=2 A=0, B=1, C=2, X=0 A=2, B=0, C=3, X=2 A=0, B=1, C=2, X=0
A=2, B=0, C=3, x=2 A=2, B=0, C=2, X=2 A=2, B=1, C=2, X=0 A=2, B=1, C=3, X=0 A=0, B=0, C=2, X=2 A=0, B=0, C=3, X=2 A=0, B=1, C=2, X=0 A=0, B=1, C=3, X=0
The goal is to exercise each condition of every decision statement at least once. The procedure possesses: r two decision statements—namely, “((A 1) or (C 2)) and (B 0)” and “(A 2) or (X 1) or (C 3)”; r and six conditions: A 1, C 2, B 0, A 2, X 1, C 3.
Selecting cases where A 1 or A a 1, C 2 or C w 2, B 0 or B w 0, A 2 or A w 2, X 1 or X a 1, and C 3 or C w 3 yields, for instance, a first test case: A 2 (satisfying A 1 and A 2), B 0 (satisfying B 0), C 3 (satisfying C w 2, C 3), and X 2 (satisfying X 1). The second test case could be composed of A 0 (satisfying A a 1 and A w 2), B 1 (satisfying B w 0), C 2 (satisfying C 2, C w 3), and X 0 (satisfying X a 1). Equivalence partitioning. The aim in this technique is to develop a minimum set of test cases covering as many different input conditions as possible. To achieve this goal, the input domain of the program is subdivided into a finite number of equivalence M:PROCEDURE (A,B,C,X); IF(((A>1)|(C=2)&(B=0)) THEN DO; X=X/A; END; IF((A=2)|(X>1)|(C=3)) THEN DO; X=X+1; C=X+A; END. Figure 7.7 A piece of software code, procedure M.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
SOFTWARE RELIABILITY
161
classes—that is, subdomains where the program is expected to have the same output no matter what the input condition is. Hence, test-case design using equivalence partitioning proceeds in two steps: (1) identification of equivalence classes (valid equivalence classes are those for which the input is valid; invalid equivalence classes are those for which the input is invalid), and (2) defining test cases (covering all valid equivalence classes and as many at the same time as possible, then designing test cases with one invalid equivalence class at each time). Consider a program that inputs the size of an array, A. The software requirements specify that the size of A ranges from 10 to 200. For this case, there is one valid equivalence class, “10 a size of A a 200,” and two invalid equivalence classes, “size of A 10” and “size of A 200.” Hence, this will generate three test cases—for instance, the size of A 50; size of A 9; and the size of A 201. The assumption of equivalence partitioning is that two input values belonging to the same equivalence class provoke similar program responses—in other words, that these input values are equivalent with respect to program behavior. The software presented in Figure 7.7 possesses three statements: “X X/A,” “X X 1,” and “C X A” and two decision statements: “(A 1 or C 2) and B 0” and “A 2 or X 1 or C 3.” Each decision statement has two alternative outputs: “true” and “false.” Decision statement: “A 1 or C 2 and B 0” has three conditions: “A 1,” “C 2,” and “B 0.” Boundary value analysis. Boundary value analysis is a technique for deriving test cases based on evidence that test cases that explore the boundaries of input and output equivalence classes are more likely to expose a fault in the software than others. If we take the case of size of A, test cases using the boundary value analysis technique will include an input set with size of A 9, size of A 10, size of A 200, and size of A 201. 7.4.3.3
Integration Testing
As briefly explained in Section 7.3.1, integration testing starts as soon as module testing has been completed. Modules are combined together progressively to test interfaces. The strategy used to combine the different elements into a final software product will influence the order in which modules are actually coded and tested, the cost of testing, and the cost of generating test cases. Nonincremental testing first checks all the different modules separately. Once this testing activity is completed, the software is completely assembled and tested for interface errors. Incremental testing tests one module first and then combines it with a second that has not yet been tested; the combination is then tested together. The next module is then added to the combination already constructed, and the process continues. Incremental testing can proceed in two ways: top down or bottom up. Bottom-up testing requires the use of drivers,* while top-down testing requires extensive use of stubs (see Figure 7.8). Table 7.4 discusses the respective advantages and disadvantages of top-down and bottom-up incremental testing. *
A driver module is one that needs to be coded to drive or transmit test cases to the module under test. A stub module must be coded to simulate the behavior of routines or procedures that the tested module calls during its execution.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
162 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
A
B
C
D
E (a) Program undergoing integration testing
A
Stub 1
Stub 2
(b) First step of top-down integration
Driver 1
D
E
(c) First step of bottom-up integration Figure 7.8 Incremental integration testing. Table 7.4 Comparison of Top-Down and Bottom-Up Testing Advantages
Disadvantages Top-Down Testing
Quickly locates problem areas at the higher levels of the program Once the I/O functions are added, representation of test cases is easier Early skeletal program allows early demonstrations
Quickly locates problem areas at the lower levels of the program Test conditions are easier to create Observation of test results is easier
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
Stub modules must be written Stub modules may be complex Before the I/O functions are added, the representation of test cases in stubs can be difficult Test conditions may be impossible, or difficult, to create Observation of test output is more difficult Design and testing can seemingly be overlapped Completion of the testing of certain modules may be deferred
Bottom-Up Testing Driver modules must be produced The program does not exist as an entity until the last module is added
SOFTWARE RELIABILITY
7.4.4
163
Formal Methods (Neuhold and Paul 1991)
7.4.4.1
Formal Specification Methods
Formal specification methods consist of a detailed specification of the program requirements, using a formal language—that is, not the natural language used in traditional programming, but rather a “formal” substitute with a precisely defined syntax and semantics. Mathematics is an example of such language. The following example illustrates the use of logic as a specification language. Consider the operation of a teller machine. Users need to withdraw money from or deposit it into their bank accounts, check the status of their accounts, and so on. During the withdrawal process, the teller machine should first screen access, admitting only those service requests of clients with magnetic cards and the right personal identification number (pin). Three repetitive unsuccessful attempts with the same magnetic card lead to its removal. The amount that a client is allowed to debit will depend on the balance of the account, on the amount of credit allowed by the bank, and on restrictions on the amount that can be withdrawn from a machine in one attempt or in a week. These limitations can be specified in a formal language: r r r r
x: card number; y: pin; a: access variable (a(x) 1, access is allowed to x); and x, y such that (pin(x) y and attempts (x) a 3 implies a(x) 1).
The complete set of these limitations constitutes a formal specification of the tellermachine software behavior. The Vienna development method (VDM), developed in 1973 at the IBM Vienna Research Laboratories, is an example of a widespread formal specification method in this case, using logic that has been used for industrial applications. VDM bases its proof product on first-order logic. Formal specifications enforce clarity in the definition of requirements and leave little space for misinterpretation and ambiguity, which are always a concern in using a natural language. Formal specifications can be used by themselves or in combination with a proof apparatus in order to demonstrate that the program satisfies its requirements (see the Section 7.4.4.2) or as the basis for automatic code generation (not considered here). 7.4.4.2
Formal Verification
Formal verification (Galton 1992) uses inference on the formal specification to demonstrate that the program satisfies its requirements. The following example of first-order logic augmented by a small number of additional properties as a proof system will be used throughout this section: Geo is a software module that computes the n first terms of a geometric progression: s 1 q q2 q3 … qk … qn–1; the inputs to the program are q and n. Assume that q and n are positive integers. The output is “s” (consequently, s 0). The program is
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
164 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Procedure geo(q,n:integer; var s,k:integer) s:=1 k:=1 While k
given next (see Figure 7.9). A proof system based on an extension of the traditional logic was developed by Hoare (1969). This proof apparatus may be used to determine if, after running a program, preconditions bearing on the input variables of that program will lead to the right set of postconditions bearing on the output. The notations used are {P} for the set of preconditions, {Q} for the set of postconditions, and S as a notational representation of the software. We want to demonstrate that {P}S{Q}. The extension of the traditional logic consists of r an axiom to describe the effect of an assignment statement: {P[x`/x]} x: x`{P}, where P[x`/x] is the logical property, P, in which all occurrences of x have been replaced by x`. This rule implies that a property that holds for variable x` will hold for the variable x if x` is assigned to x; r rules to indicate how to compose two preconditions and two postconditions: r rule 1: if {P}S{Q} and {Q} implies {R}, then {P}S{R); r rule 2: if {P}S{Q} and {R} implies {P}, then {R}S{Q}; r a rule to indicate how to compose two software properties: r rule 3: if {P}S{Q} and {Q}S`{R}, then {P}S;S`{R} where S;S` indicates that S` is run after S; and r a rule that describes how to deal with a “while… do…” loop: r rule 4: if {P or B}S{P}, then {P}; while B, do S{not B and P}.
We will prove that {q 0} Geo {s 0 and q 0}. The proof starts with the last statement of the program and runs backward through the different program statements. Proof: using the assignment axiom, {s 0 and q 0} k: k 1{s 0 and q 0}. Using the assignment axiom once again, {s qk 0 and q 0} s: s qk {s 0 and q 0}. Using rule 2 on {s 0 and q 0} implies {s qk 0 and q 0} and yields {s 0 and q 0} L {s 0 and q 0}. In other words, {s 0 and q 0} is an invariant for the while..do… loop. Now, using rule 4 with P={s 0 and q 0} and B {k n} gives: {s 0 and q 0}, while k n do L {k n and s 0 and q 0}, which implies P by rule 1. Using the assignment axiom once again, {1 0 and q 0} s: 1; k: 1 {s 0 and q 0}. However, {1 0 and q 0} is equivalent to {q 0}; hence, {q 0} Geo {s 0 and q 0}. This short example illustrating the use of a formal proof system based on firstorder logic shows that it can be demonstrated that a software product will meet its requirements for the whole range of possible input conditions. In this respect, formal
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
SOFTWARE RELIABILITY
165
methods differ drastically from conventional testing techniques, which can only afford to run a limited number of input conditions, sampled more or less cleverly from the input domain. The main problem, of course, is to identify the properties that the software needs to verify and to execute the proof. Proofs can be automated and the tools to do so are slowly being developed. However, there is a chance that, even if fully automated, the methods will remain underused because too few programmers know how to use them. Formal methods are not well understood because they are complex and require a considerable amount of training. 7.4.5 Software Development Process Maturity After two decades of unfulfilled promises about the productivity and quality gains from applying new software methodologies and technologies, industry and government organizations are realizing that their fundamental problem is the inability to manage the software process (DoD 1987). In the absence of an organization-wide software process, repeating successful results often depends on having the same individuals available for the next project. This approach provides no basis for longterm product quality and reliability improvement throughout an organization. In 1986, the Software Engineering Institute (SEI) began developing a capability maturity model (CMM) for software development organizations (Paulk et al. 1993) The CMM describes five levels of software development process maturity for an organization. Higher maturity levels imply better predictability, lower risk, and increased quality and reliability. The CMM provides a guide for organizations wanting to improve their processes for developing and maintaining software or for procuring organizations wanting to evaluate the risks of contracting a software project to a particular organization. The five levels of maturity are r Initial: the software development process is characterized as ad hoc, few processes are defined, and project outcomes are hard to predict; r Repeatable: basic project management processes are established to track cost, schedule, and functionality; processes may vary from project to project, but management controls are standardized; current status can be determined across a project’s life; with high probability, the organization can repeat its previous level of performance on similar projects; r Defined: the software process for both management and engineering activities is documented, standardized, and integrated into an organization-wide software process; all projects use a documented and approved version of the organization’s process for developing and maintaining software; r Managed: detailed measures of the software process are collected; both the process and the product are quantitatively understood and controlled, using detailed measures; and r Optimizing: continuous process improvement is made possible by quantitative feedback from the process and from testing innovative ideas and technology.
At each maturity level, the CMM identifies “key process areas” (KPAs) that include organizational and project goals and activities that promote, document, or
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
166 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
demonstrate process maturity in an area. The KPAs for level 2 comprise requirements management, project planning, project tracking and oversight, software subcontract management, software quality assurance, and software configuration management. SW-CMM appraisals conducted from 2000 through 2004 and reported to the SEI by January 2005 included over 1900 organizations. They indicated that 10% of the organizations were at level 1, 42% at level 2, 31% at level 3, 8% at level 4, and 9% at level 5. The SW-CMM was expanded and upgraded to address system engineering (Bate et al. 1995). The SW-CMM was eventually phased out and replaced by the CMMI (capability maturity model integration), also known as the systems engineering capability maturity model (SE-CMM).
7.5 TECHNIQUES TO ASSESS SOFTWARE RELIABILITY We list in this section a number of well-known techniques for the qualitative and quantitative assessment of software reliability. 7.5.1 Software Analysis Methods The failure mode and effect analysis and fault-tree analysis (see Chapters 4, 5, and 9, this book) may also be applied to software development efforts. 7.5.1.1 Failure Mode and Effect Analysis (FMEA) FMEA is a bottom-up approach that consecutively postulates failure of each component of the product and follows it through all its possible hazardous outcomes. The FMECA (failure mode and effect criticality analysis) includes a criticality ranking of each failure mode, based on failure probability and/or hazard severity. Applications of these techniques to hardware are numerous. However, applications to software are still scarce. This may be because it is difficult to identify failure modes of software components. The technique has been successfully used at the system level (for combinations of hardware and software). 7.5.1.2
Fault-Tree Analysis
Fault-tree analysis proceeds backward from effects (called either top event or hazard) to identification of the root causes of an event (failure of the individual components). Figure 7.10 is a fault-tree analysis of procedure M (see Section 7.4.3.2). The analysis is performed at the code level (the finest level) for one specific top event, namely, “x 200.” In other words, we want to identify the input conditions under which the output variable, x, will exceed 200. If M is part of a software that controls the rotational velocity of an engine, x exceeding 200 might violate the design limits for the engine and lead to its destruction. The methodology is applied by backtracking through the different software paths from an output value x > 200 to identify the values of different variables and parameters that lead to this unacceptable output. Analysis at the
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
SOFTWARE RELIABILITY
167
" #
" #
!
"
#
Figure 7.10 Software fault-tree analysis for procedure -.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
"
#
168 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
code level is not always affordable, especially for codes with up to 100,000 lines. Its application is usually limited to the module or submodule level and refined down to code level in areas that are critical for safety (Leveson and Harvey 1983). 7.5.2
Software Metrics
Software metrics are quantitative indicators of the degree to which a software product or process possesses a given attribute. Metrics are frequently used to assess the status of and trends in a software development effort and to assess risk of moving from one phase of the life cycle to the next. 7.5.2.1
Requirements Measures: The Specification Completeness Measure
Product code is not usually produced during the software requirements phase, although prototyping activity or trial reuse of code from other projects may occur. The principal software measures available during this phase will address the size, completeness, and stability of the specification and hence the risk of moving the software life cycle into its next phase. The principal measures available at this time are r r r r
M1 number of requirements adequately defined or specified; M2 number of requirements not adequately defined or specified; M3 total number of requirements = M1 plus M2; and M4 number of requirements for which test requirements are defined.
A plot of M3 over time shows whether the identification of requirements is stabilizing. A plot of M1/M3 over time shows the completeness of the process of specifying the known requirements. Finally, a plot of M4/M3 over time shows how well the identification of test requirements is proceeding. If M3 has not begun to stabilize or if M1/M3 and M4/M3 have not reached some reasonable thresholds (i.e., 75%), then the uncertainty in the specification can present a technical risk that needs management review before proceeding to the next phase. 7.5.2.2
Design Phase Measures
In addition to the analysis of error data to identify error sources, other measures useful in this phase are r a completeness measure that can be used to identify areas of technical risk associated with the specification and design; r a complexity measure that can be used to evaluate the potential difficulty of committing the design to code; and r a defect density measure that can be used to evaluate the growth of reliability in the design phase.
The completeness indicator. The completeness indicator (Department of the Air Force 1987) provides insight into the adequacy of the software specification. Its
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
SOFTWARE RELIABILITY
169
use can begin during the requirements phase of the software development process and continue effectively throughout the software development life cycle. The values determined from the components of the completeness indicator can be used to identify areas of technical risk and to assess how well product-level requirements have been translated into specification and subsequent design. The inputs for this indicator are obtained as the software specification’s requirements and design mature. Inputs come directly from the requirements analysis and the design and code reviews. For the purpose of this indicator, functions and requirements may be considered equivalent. Initially, data will be available from the software specification review (SSR) or preliminary design review (PDR), as indicated for each input parameter. These data should be updated as the software moves through the design, coding, and unit testing phases. For products with multiple CSCIs, separate calculations should be made for each one. The inputs to the completeness indicator are P1 number of functions not adequately defined or specified (SSR); the same as M2 in the requirements analysis phase; P2 total number of functions (SSR); the same as M3 in the requirements analysis phase; P3 number of data items not defined (PDR); P4 total number of data items (PDR); P5 number of defined functions not used (PDR); P6 total number of defined functions (SSR) (P6 = P2 – P1); the same as M1 in the requirements analysis phase; P7 number of functions referenced by defined functions but not defined (PDR); P8 total number of functions referenced by defined functions (PDR); P9 number of decision points not using all conditions or options (PDR); P10 total number of decision points (PDR); P11 number of condition options without processing (PDR); P12 total number of condition options (PDR); P13 number of calling routines with calling parameters that do not agree with defined parameters (PDR); P14 total number of calling routines (PDR); P15 number of condition options that are not set (PDR); P16 number of condition options that are set but have no processing associated with the option (PDR); P17 number of set condition options (PDR) (P17 P12 P15); and P18 number of data references having no destination (PDR).
Once the inputs are collected, the following components of the index are calculated: functions satisfactorily defined (C1), C1 (P2 P1)/P2; defined database references or items (C2), C2 (P4 P3)/P4; defined functions used (C3), C3 (P6 P5)/P6;
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
170 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
defined referenced functions (C4), C4 (P8 P7)/P8; all condition options are used at decision points (C5), C5 (P10 P9)/P10; all condition options with processing are used at decision points (C6), C6 (P12 P11)/P12; all calling routine parameters agree with the called routine’s defined parameters (C7), C7 (P14 P13)/P14; all condition options are set (C8), C8 (P12 P15)/P12; all processing follows set condition options (C9), C9 (P17 P16)/P7; and all data items have a destination (C10), C10 (P4 P18)/P4.
Completeness is then calculated as the weighted sum of these 10 components and is given by 10
COMPLETENESS
£wC i
i
(7.1)
i 1
where wi is the weight associated with each component (a value between 0 and 1) and the sum of the weights equals one. Because each Ci is also between zero and one, the completeness measure is always between zero and one, with larger values representing more complete specifications. Although the individual input values P1 through P18 may be difficult to determine early in a project, the component values may be easier to estimate because they are all in fractional form. For example, although exact values for P1 and P2 may not be known, engineering experience may still allow one to estimate that C1 = 0.1 or 0.2 early in the project’s life. The weights, wi, used to calculate completeness are selected by determining how essential each component in the equation is to developing a successful product. This can be a function of the complexity of the specification, the type of application, the amount of code to be reused, or other project or application factors. For example, during the requirements analysis and early preliminary design phases, it may be appropriate to set wI = 1 and all other wi = 0. Although completeness is a complex measure, it can be calculated early in the design phase and updated periodically throughout development (e.g., at the conclusion of preliminary and detailed design). The values and value trends observed in the components of the indicator can be used to identify the stability of the specification and areas of technical risk. Complexity measure. The cyclomatic complexity measure (IEEE 1989b), first introduced by McCabe (1985), may be used to determine the structural complexity of a module from a graph of the module’s operations. These graphs can be constructed from design phase information and, once coding has begun, from the program code.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
SOFTWARE RELIABILITY
171
A strongly connected graph of a module contains four primitive elements. (“Strongly connected” means that each node is reachable from any other node. This is accomplished by adding an edge between an exit node and the entry node.) If: r N is the number of nodes (sequential groups of program statements); r E is the number of edges (program flows between nodes); r SN is the sum of splitting nodes (nodes with more than one edge emanating from them); nodes with N exit paths contribute a count of N – 1 to SN. For example, when a module contains an N-way predicate, such as a CASE statement with N cases, this predicate contributes a count of N – 1 to SN; and r RG is the number of regions (areas bounded by edges with no edges crossing), then the cyclomatic complexity (C) of the graph is given by CE–N1 SN 1. RG
For example, Figure 7.11 shows a graph with eight nodes (A through H), 12 edges, and five regions. The cyclomatic complexity is C edges – nodes 1 12 – 8 1 5 number of splitting nodes 1 4 1 5 number of regions bounded by the paths 5
The regions are ACEHA, ACFHA, ABFHA, ADFHA, and ADGHA.
A
B
C
D
F
E
G H
Figure 7.11 Graph with eight nodes (A through H), 12 edges, and five regions.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
172 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Physically, the cyclomatic complexity represents the number of unique paths through the module. The complexity of a module may be used to indicate the relative effort required to code it correctly, to test it, and, ultimately, to modify or enhance it. Thus, in the design phase, an estimate of module complexity is useful in identifying module areas for further decomposition or other simplifying design changes. Although a hard and fast rule is difficult to justify, a complexity exceeding 10 or 11 often indicates a need for redesign for simplification. Defect density (IEEE 1989a). This measure requires establishing defect severity categories and collecting the following data: r Di is the total number of unique defects, at or above a specified severity level, detected during the ith design review; r I is the total number of reviews; and r KSLOD is, in the design phase, the number of source lines of design statement, in thousands.
Then, the cumulative defect ratio for design (DD) is given by I
DD
£ D /KSLOD i
(7.2)
i 1
There is some ambiguity in this measurement because a low value could mean either a good product or a poor review process. If estimates of defect density are higher than those experienced on comparable projects, the development process should be reviewed to determine if poor training or practices are the causes or if the requirements are incomplete or ambiguous. In such cases, it may be appropriate to delay development until corrective actions can be taken. If estimates of defect density are lower than those experienced on comparable projects, the review process and methodology should be reviewed. If problems are identified, additional training or modification to the review procedures may solve the problem. If the review process is assessed to be adequate, it is reasonable to conclude that these phases of development are producing low-defect products. 7.5.2.3
Code and Unit Test Phase Measure: Defect Density (IEEE 1989a)
A second form of the defect density parameter is appropriate for this phase. Again, defect severity categories should be established and the following data collected: Di, total number of unique defects, at or above a specified severity level detected during the ith code review; I, total number of reviews; KSLOC, the number of source lines of code reviewed, in thousands. Then, the cumulative defect ratio for code (CD) is given by I
CD
£ D /KSLOC i
i 1
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(7.3)
SOFTWARE RELIABILITY
7.5.3 7.5.3.1
173
Software Reliability Models A Classification of Software Reliability Models
Many software reliability models have been developed over the years. For a detailed description, see, for instance, Musa, Iannino, and Okumoto (1987). This section gives a broad overview of existing models, presents and critiques the assumptions on which they are based, provides a detailed description of four of the most commonly used software reliability models (SRMs), and describes their inherent limitations. Different classification schemes for software reliability models have been proposed in the past. Musa, Iannino, and Okumoto demonstrate that most SR models published in the literature are based on a Markovian formulation of the fault removal process.* They differ in the number of faults present in the software when the testing begins and in the per-fault failure probability density function distribution.† Models of the finite failure category are based on the assumption that the exact number of faults in the software at time 0, or at least the mean number, is known. Models of the infinite failure category make the assumption that the number of faults is unlimited. Models of the finite failure category are broken down into types: namely, the binomial type (when the number of faults at time 0 is known precisely) and the Poisson type (when the average number of faults is known). Models of both categories are then divided into classes, depending on the per-fault failure intensity distribution used in the model (see Figure 7.12). The classification presented here is a generalization of the classification proposed in Goel (1985). It includes most existing SRMs and provides guidelines for the selection of a software reliability model fitting a specific application. Most existing SRMs may be grouped into one of four categories: r Time-between-failures category includes models that provide estimates of the times between failures (for an example, see Section 7.5.3.2). r Failure count category is interested in the number of faults or failures experienced in specific intervals of time (see Section 7.5.3.3). r Fault seeding category includes models to assess the number of faults in the program at time 0 via seeding of extraneous faults (see Section 7.5.3.5). r Input domain-based category includes models that assess the reliability of a program when the test cases are sampled randomly from a well-known operational distribution of inputs to the program. The “clean room methodology” is a tentative attempt to implement this approach in an industrial environment (the Software Engineering Laboratory, NASA) (Basili and Green 1993). The reliability estimate is obtained from the number of observed failures during execution (see Section 7.5.3.6).
*
†
Markovian models formulate stochastic processes for which transitions between states are independent of the past history of the process (see Chapter 2). The per-fault failure probability density function, fa(t): fa(t)dt, is the probability that fault “a” will provoke a software failure between times t and t dt.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
174
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Software reliability models
Finite failure category models
Infinite failure category models
Other (Littlewood and Verrall, 1973)
Poisson type
Exponential (Musa, 1975) (Goel and Okumoto, 1979a)
Figure 7.12
Poisson (Musa and Okumoto, 1983)
Binominal type
Gamma
Exponential (Jelinski and Moranda, 1972) (Shoonman, 1975)
Weibull (Schick and Wolverton, 1973)
Classification of software reliability models. (From Musa, J. D. et al. 1987. Software reliability. New York: McGraw-Hill.)
Table 7.5 lists the key assumptions on which each category of models is based, as well as representative models of the category. Additional assumptions specific to each model are listed in Table 7.6. Table 7.7 examines the validity of some of these assumptions (whether they are generic assumptions of a category or the additional assumptions of a specific model). The software development process is environment dependent. Thus, even assumptions that would seem reasonable during the testing of one function or product may not hold true in subsequent testing. The ultimate decision about the appropriateness of the assumptions and the applicability of a model must be made by the user. To select an SRM for a specific application, the following practical approach can be used. First, determine to which category the software reliability model of interest belongs. Then, assess which specific model in a given category fits the application. Actually, a choice will only be necessary if the model is in one of the two first categories—that is, time-between-failures or failure count models. If this is the case, select a software reliability model based on knowledge of the software development process and environment. Collect such failure data as the number of failures, their nature, the time at which the failure occurred, its severity level, and the time needed to isolate the fault and to make corrections. Plot the cumulative number of failures and the failure intensity as a function of time. Derive the
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
SOFTWARE RELIABILITY
175
Table 7.5 Key Assumptions on Which Software Reliability Models Are Based Key Assumptions
Specific Models
Time-Between-Failures Models r Independent times between failures r Equal probability of exposure of each fault r Embedded faults are independent of each other r No new faults introduced during correction
r Jelinski and Moranda’s de-eutrophication model (1972) r Schick and Wolverton’s model (1973) r Goel and Okumoto’s imperfect debugging model (1979) r Littlewood and Verrall’s Bayesian model (1973)
Fault Count Models r Testing intervals are independent of each other r Testing during intervals is homogeneously distributed r Number of faults detected during nonoverlapping intervals are independent of each other
r Shooman’s exponential model (1975) r Goel and Okumoto’s nonhomogeneous Poisson process (1979) r Goel’s generalized nonhomogeneous Poisson process model (1983) r Musa’s execution time model (1975) r Musa–Okumoto’s logarithmic Poisson execution time model (1983)
Fault Seeding Models r Seeded faults are randomly distributed in the program r Indigenous and seeded faults have equal probabilities of being detected
r Mills’s seeding model (1972)
Input Domain-Based Models r Input profile distribution is known r Random testing is used (inputs are selected randomly) r Input domain can be partitioned into equivalence classes
r Nelson’s model (1978) r Ramamoorthy and Bastani’s model (1982)
different parameters of the models from the data collected and use the model to predict future behavior. If the future behavior corresponds to the model prediction, keep the model. 7.5.3.2
Jelinski and Moranda’s Model
Jelinksi and Moranda (1972) developed one of the earliest reliability models. It assumes that r all faults in a program are equally likely to cause a failure during test; r the hazard rate is proportional to the number of faults remaining; and r no new defects are incorporated into the software as testing and debugging occur.
Originally, the model assumed only one fault was removed after each failure, but an extension of the model, credited to Sukert, permits more than one fault to be removed.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
176 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Table 7.6 Specific Assumptions Related to Each Specific Software Reliability Model Specific Representatives of a Model Category
Specific Assumptions
Time-Between-Failures Models r Jelinski and Moranda de-eutrophication model (JM) r Schick and Wolverton model r Goel and Okumoto imperfect debugging model r Littlewood–Verrall Bayesian model
r N faults at time 0; detected faults are immediately removed; the hazard ratea in the interval of time between two failures is proportional to the number of remaining faults r Same as above with the hazard rate a function of both the number of faults remaining and the time elapsed since last failure r Same as above, but the fault, even if detected, is not removed with certainty r Initial number of failures is unknown, times between failures are exponentially distributed, and the hazard rate is gamma distributed
Fault Count Models r Shooman’s exponential model r Goel–Okumoto nonhomogeneous Poisson process (GONHPP) r Goel’s generalized nonhomogeneous Poisson process model r Musa’s execution time model r Musa–Okumoto logarithmic Poisson execution time model
r Same assumptions as in the JM model r The cumulative number of failures experienced follows a nonhomogeneous Poisson process (NHPP); the failure rate decreases exponentially with time r Same assumptions as in the Goel–Okumoto NHPP, but the failure rate attempts better to replicate testing experiments’ results that show that failure rate first increases then decreases with time
Fault Seeding Models r Mills’s seeding model
r Same assumptions as in the JM model r Same assumptions as in the Goel–Okumoto NHPP model, but the time element considered in the model is the execution time
Input Domain-Based Models r Nelson’s model r Ramamoorthy and Bastani’s model
r The outcome of a test case provides some stochastic information about the behavior of the program for other inputs close to the inputs used in the test
a Software
hazard rate is z(t) = f(t)/R(t), where R(t) is the reliability at time t and f(t) = dR(t)/dt. The software failure rate (intensity) is h(t) = dμ(t)/dt, where μ(t) is the mean value of the cumulative number of failures experienced by time t.
In this model, the hazard rate for the software is constant between failures and is Zi (T ) F[ N ni 1 ], i 1, 2, z , m
(7.4)
for the interval between the (i – 1)st and ith failures. In this equation, N is the initial number of faults in the software, b is a proportionality constant, and ni–1 is the cumulative number of faults removed in the first (i – 1) intervals.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
SOFTWARE RELIABILITY
Table 7.7
177
Validity of Some Software Reliability Models’ Assumptions
Assumptions
Intrinsic Limitations of Such Assumptions
r Times between failures are independent
r Only true if derivation of test cases is purely random (never the case) r Faults will not be removed immediately; they are usually corrected in batches; however, the assumption is valid as long as further testing avoids the path in which the fault is active r In general, this is not true
r A detected fault is immediately corrected
r No new faults are introduced during the fault removal process r Failure rate decreases with test time r Failure rate is proportional to the number of remaining faults for all faults
r Reasonable approximation in most cases r A reasonable assumption if the test cases are chosen to ensure equal probability of testing different parts of code r Usually, time is a good basis for failure rate; if this is not true, the models are valid for other units r Generally not the case unless testing intensity increases r Usually not the case because testing usually selects error-prone situations (see Section 7.4.3)
r Time is used as a basis for failure rate r Failure rate increases between failures for a given failure interval r Testing is representative of the operational usage
The maximum likelihood estimates of N and b are given by the solution of the following equations: m
£ i 1
1
N ni 1
m
£F x 0 li
(7.5)
]xli 0
(7.6)
i 1
and n
F
m
£[N n
i 1
i 1
where xli is the length of the interval between the (i – 1)st and ith failures, and n is the number of errors removed so far. Once N and b are estimated, the number of remaining errors is given by N (remaining) N ni
(7.7)
The mean time to the next software failure is MTTF
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
1 ( N ni )F
(7.8)
178 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
The reliability is Ri 1 (t|ti ) exp[ ( N ni )F (t ti )], t q 0
(7.9)
where Ri+1(t\ti) is the reliability of the software at time t in the interval [ti, ti+1], given that the ith failure occurred at time ti. 7.5.3.3
Musa Basic Execution Time Model (BETM)
This model was first described by Musa (1975). It assumes that failures occur as a nonhomogeneous Poisson process. The units of failure intensity are failures per central processing unit (CPU) time. This relates failure events to the processor time used by the software. In the BETM, the reduction in the failure intensity function remains constant, whether the first or the Nth failure is being fixed. The failure intensity is a function of failures experienced:
L ( M ) L0 (1 M /U 0 )
(7.10)
where h( *) is the failure intensity (failures per CPU hour at * failures), h0 is the initial failure intensity (at ne = 0), * is the mean number of failures experienced at execution time, and ne, U0 is the total number of failures expected to occur in infinite time. Then, the number of failures that need to occur to move from a present failure intensity, hp, to a target intensity, hF, is given by $M
N0 (L L F ) L0 P
(7.11)
and the execution time required to reach this objective is $T
U0 ln(LP /LF ) L0
(7.12)
In practice, o0 and h0 can be estimated in three ways: r Use previous experience with similar software. Then the model can be applied prior to testing. r Plot the actual test data to establish or update previous estimates. Plot failure intensity execution time: the y-intercept of the straight line fit is an estimate of h0. Plot failure intensity failure number: the x-intercept of the straight line fit is an estimate of o0. r Use the test data to develop a maximum-likelihood estimate. The details for this approach are described in Musa et al. (1987).
Musa also developed a method to convert execution time predictions to calendar time. The calendar time component is based on the fact that available resources limit the amount of execution time that is practical in a calendar day.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
SOFTWARE RELIABILITY
7.5.3.4
179
Musa–Okumoto Logarithmic Poisson Execution Time Model (LPETM)
The logarithmic Poisson execution time model was first described by Musa and Okumoto (1983). In the LPETM, the failure intensity is given by
L ( M ) L0 exp( QM )
(7.13)
where k is the failure intensity decay parameter and h, *, and h0 are the same as for the BETM. The parameter k represents the relative change of failure intensity per failure experienced. This model assumes that repair of the first failure has the greatest impact in reducing failure intensity and that the impact of each subsequent repair decreases exponentially. In the LPETM, no estimate of o0 is needed. The expected number of failures that must occur to move from a present failure intensity of hP to a target intensity of hF is $M (1/Q ) ln(L P /L F )
(7.14)
and the execution time to reach this objective is given by
$T
1§ 1 1 ¶
· ¨ ¨© LF LP ¸·
(7.15)
In these equations, h0 and k can be estimated based on previous experience by plotting the test data to make graphical estimates or by making a least-squares fit to the data. 7.5.3.5
Mills’s Fault Seeding Model (IEEE 1989b)
An estimate of the number of defects remaining in a program can be obtained by a seeding process that assumes a homogeneous distribution of a representative class of defects. The variables in this measure are NS, the number of seeded faults; nS, the number of seeded faults found; and nF, the number of faults found that were not intentionally seeded. Before seeding, a fault analysis is needed to determine the types of faults expected in the code and their relative frequency of occurrence. An independent monitor inserts into the code NS faults that are representative of the expected indigenous faults. During reviews (or testing), both seeded and unseeded faults are identified. The number of seeded and indigenous faults discovered permits an estimate of the number of faults remaining for the fault type considered. The measure cannot be computed unless some seeded faults are found. The maximum likelihood estimate of the indigenous (unseeded) faults is given by NF nF N S /nS .
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(7.16)
180 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Example. Twenty faults of a given type are seeded. Subsequently, 40 faults of that type are uncovered: 16 seeded and 24 unseeded. Then, NF = 30, and the estimate of faults remaining is NF (remaining) = NF – nF = 6. 7.5.3.6
Nelson’s Input-Based Domain Model
Nelson’s model (1978) obtains a reliability estimate, R, from K failures experienced in N runs. If N sets of inputs are randomly selected from the operational profile (i.e., the probability distribution of these inputs, which replicates their actual usage as inputs to the program), (1 – K/N) will be an unbiased estimator of R. 7.5.3.7
Derived Software Reliability Models
From the basic models presented previously, models can be built for applications involving more than one piece of software, such as models to assess the reliability of fault-tolerant software designs or the reliability of a group of modules assembled during the integration phase. As an example, take an N-version fault-tolerant design. Assume that the N different versions are totally independent. The design will experience a failure if one of the three following conditions is fulfilled: r All outputs disagree; the error is labeled E1. r Identical incorrect outputs are generated (E2). r The voting procedure fails to fulfill its application (E3).
If errors E1 and E2 can be neglected, the design becomes equivalent to a multiple hardware redundancy. If p(Vn = C) is the probability that version n executes correctly, and p(Vn = I) is the probability that version n fails, for N = 3,
p(E1) p(V1 I) p(V2 I) p(V3 I) p(V1 C) p(V2 I) p(V3 I) p(V1 I) p(V2 C) p(V3 I) p(V1 I) p(V2 I) p(V3 C),
where p(Vi C) y Rvi (reliability of version i, if the reliability is measured per number of successful runs). If there is a probability that the voting procedure will fail (i.e., that two or more identical correct outputs are discarded), p(VP I), p(E1) becomes
p(E1) p(V1 I) p(V2 I) p(V3 I) p(V1 C) p(V2 I) p(V3 I) p(V1 I) p(V2 C) p(V3 I) + p(V1 I) p(V2 I) p(V3 C) {p(V1 C) p(V2 C) p(V3 C) p(V1 C) p(V2 C) p(V3 I) p(V1 I) p(V2 C) p(V3 C) p(V1 C) p(V2 I) p(V3 C)} p(VP I).
From these models, the most reliable design for a given cost can be assessed.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
SOFTWARE RELIABILITY
181
Littlewood (1979) takes explicitly into account the structure (i.e., the modules) of the software and models the exchange of control between modules (time of sojourn in a module and target of exchange) using a semi-Markovian process. The failure rates of a given module can be obtained from the basic reliability models applied to the module; interface failures are being modeled explicitly. This model can be used to study the integration process. 7.5.3.8
A Critique of Existing Software Reliability Models
Software reliability models have been criticized extensively. One class of objections is mainly conceptual. Software is purely deterministic, whereas hardware behavior is partly stochastic. Once a set of inputs to the program is defined, the program will operate correctly or it will fail; there is no such thing as a probability that the program will fail. Of course, this argument is valid. However, probabilities in software reliability models, as in many other models, are used as a representation of ignorance. The inputs with which the program will be run are not known with certainty; neither are the position and the nature of the fault. A second class of objections is based on the fact that software reliability models originate from hardware reliability models. The latter were slightly modified to account for some software peculiarities (such as the fact that software does not wear out and, consequently, does not follow the bathtub curve). The major drawback of most of these models is that they are based on a limited number of sometimes questionable assumptions rather than on a deep knowledge of what we could call “the thermal-hydraulics equations” of software.* As discussed in Section 7.5.3.1, a number of models have been developed, tested, validated, and invalidated over the years. Most models are based on software failure data. Their validity is often limited to the later phases of testing† (integration, product, and acceptance testing phases) and to operation. If a software project is started from scratch (with no available history on comparable projects), these models will be of no help in managing the early phases of the project. 7.6
SUMMARY
Software is more than code; it can include the programs, procedures, rules, and documentation related to the operation of a computer system. Firmware is a special form of software that includes the computer programs and data that reside in types of memory that cannot be modified by the computer during processing. Software reliability reflects the ability of a program to perform a required function under stated conditions for a stated period of time. Software unreliability is *
†
Some models, such as Littlewood’s structural model (Littlewood 1979), have moved toward better description and understanding of the nature of software. Note that the models presented will not hold if reliability growth is achieved by formal verification. To our knowledge, no software reliability model has yet been developed to account for the influence of formal methods on software reliability.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
182 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
reflected in failures that occur during operation. Software failures are the consequences of errors that occur during the development process; these lead, in turn, to faults in the code. Software quality is concerned with numerous features of a software product, including reliability. As a result, the organization or individuals responsible for software quality assurance will play an important role in achieving software reliability. Software safety is concerned with life-critical applications that cannot afford to fail. Achieving software safety requires the development of extremely reliable software. Software development organizations should use a software life-cycle model to define a project’s software development process. It should identify the activities, products, and reviews or other milestones to be achieved in each development phase and the data to be collected. To provide a basis for long-term product quality and reliability improvement, the organization could use a program such as the SEI’s CMM to manage and improved its processes and enhance the repeatability of future software development efforts. A number of techniques can be used during the software life cycle to reduce or eliminate the potential for failure in the final product. Such techniques include fault-tolerant designs, testing at several levels, and formal methods. Each technique has its own advantages, limitations, and costs that can limit the scope and depth of its application. Fault-tolerant designs can be expensive and this expense must be justified by the application. Also, software reliability will still depend on the accuracy of voting procedures. Testing can never be exhaustive, despite the art and skill displayed by the test organization. Not fully automated yet, formal methods are complex and require a high level of training in order to apply them effectively. Several metrics can be used during the development process to assess the reliability characteristics of the product and the risks of proceeding to the next development phase. Assessments may be based on failure data collected through the life cycle and can be used to estimate reliability and failure intensity, as well as the time required to reach specified levels of failure intensity. However, these assessments will be limited to the integration, system, and acceptance testing phases of the software development life cycle and to operation. Unless the software development organization has accumulated a consistent set of data from previous development efforts, these models will be of little help in managing the early phases of a project.
REFERENCES Basili, V., and S. Green. 1993. The evolution of software processes based upon measurement in the SEL: The cleanroom example. University of Maryland and NASA/GSFC, draft. Bate, R. et al. 1995. A systems engineering capability maturity model, version 1.1 (CMU/SEI95-MM-033). Pittsburgh, PA: Software Engineering Institute. Beizer, B. 1984. Software system testing and quality assurance. New York: Van Nostrand Reinhold. Bell, D., I. Morrey, and J. Pugh. 1992. Software engineering: A programming approach. Upper Saddle River, NJ: Prentice Hall.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
SOFTWARE RELIABILITY
183
DoD (Department of Defense). 1985. Defense system software development, DOD-STD2167, Washington, D.C. ———. 1987. Report of the defense science board task force on military software, Office of the Under Secretary of Defense for Acquisition. Washington, D.C. ———. 1994. Software development and documentation. MIL-STD-498, Washington, D.C. Department of the Air Force. 1987. Software quality indicators. Air Force Systems Command, AFSC Pamphlet 800-14. Galton, !. 1992. Logic as a formal method. Computer Journal 35(5). Goel, A. L. 1983. A guidebook for software reliability assessment. Rep. RADC TR-83-176. ———. Software reliability models: Assumptions, limitations, and applicability. IEEE Transactions on Software Engineering SE-11 (12): 1411. Goel, A. L., and K. Okumoto. 1979a. A time dependent error detection rate model for software reliability and other performance measures. IEEE Transactions on Reliability R28:206. ———. 1979b. A Markovian model for reliability and other performance measures of software systems. Proceedings of the National Computer Conference, New York 48. Hoare, C. A. R. 1969. An axiomatic basis for computer programming. Communications of the ACM 12:576. IEEE (Institute of Electrical and Electronics Engineers). 1983. IEEE standard glossary of software engineering terminology. ANSI/IEEE Std. 729. ———. 1989a. IEEE standard dictionary of measures to produce reliable software. IEEE Std. 982.1-1988. ———. 1989b. IEEE guide for the use of IEEE Standard Dictionary of Measures to Produce Reliable Software. ANSI/IEEE Std. 982.2-1988. ———. Standard for information technology. IEEE/IEC12207, 1996. Jelinski, Z., and P. Moranda. 1972. Software reliability research. In Statistical computer performance evaluation, ed. W. Freiberger. New York: Academic Press. Leveson, N. G., and P. R. Harvey. 1983. Analyzing software safety. IEEE Transactions on Software Engineering SE-9:5. Littlewood, B. 1979. Software reliability model for modular program structure. IEEE Transactions on Reliability R-28: 3. Littlewood, B., and J. L. Verrall. 1973. A Bayesian reliability growth model for computer software. Applied Statistics 22:332. McCabe, T. 1985. Structural testing. Columbia, MD: McCabe and Associates. Mills, H. D. 1972. On the statistical validation of computer programs. Rep. 72-6015. Gaithersburg, MD: IBM Federal Systems Division. Musa, J. D. 1975. A theory of software reliability and its application. IEEE Transactions on Software Engineering SE-1:312. Musa, J. D., A. Iannino, and K. Okumoto. 1987. Software reliability. New York: McGraw-Hill. Musa, J. D., and K. Okumoto. 1983. A logarithmic Poisson execution time model for software reliability measurement. Proceedings 7th International Conference Software Engineering, Orlando, FL. Myers, G. J. 1979. The art of software testing. IBM Systems Research Institute. New York: John Wiley & Sons. Nelson, E. 1978. Estimating software reliability from test data. Microelectronic Reliability 17:67. Neufelder, A. M. 1993. Ensuring software reliability. New York: Marcel Dekker. Neuhold, E. J., and M. Paul. 1991. Formal description of programming concepts. IFIP International Federation for Information Processing, Laxenburg, Austria.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
184 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Paulk, M. C. et al. 1993. Capability maturity model, version 1.1 (SEI-93-TR-024). Software Engineering Institute, Pittsburgh, PA. Ramamoorthy, C. V., and F. B. Bastani. 1982. Software reliability: Status and perspectives. IEEE Transactions on Software Engineering SE-8:359. Schick, G. J., and R. W. Wolverton. 1973. Assessment of software reliability. Paper presented at 11th Annual Meeting German Operational Research Society, DGOR, Hamburg, Germany; also in Proceedings of Operational Research Physica-Verlag, WirzbergWien. Scott, R. K., J. W. Gault, and D. G. McAllister.1987. Fault tolerant software reliability modeling, IEEE Transactions on Software Engineering SE-13:5. Shooman, M. L. 1975. Software reliability measurement and models. Proceedings of the Annual Reliability and Maintainability Symposium, Washington, D.C.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
CHAPTER 8
Failure Modes, Mechanisms, and Effects Analysis Sony Mathew, Michael Pecht
CONTENTS 8.1 Introduction................................................................................................... 186 8.2 Failure Modes, Mechanisms, and Effects Analysis Methodology ............... 188 8.2.1 System Definition, Elements, and Functions .................................... 189 8.2.2 Potential Failure Modes.................................................................... 189 8.2.3 Potential Failure Causes ................................................................... 189 8.2.4 Potential Failure Mechanisms .......................................................... 190 8.2.5 Failure Models .................................................................................. 190 8.2.6 Life-Cycle Profile ............................................................................. 191 8.2.7 Failure Mechanism Prioritization ..................................................... 191 8.2.8 Documentation.................................................................................. 194 8.3 Case Study .................................................................................................... 194 8.4 Conclusions ................................................................................................... 199 References.............................................................................................................. 199
This chapter presents a methodology called failure modes, mechanisms, and effects analysis (FMMEA) that is used to identify potential failure modes and mechanisms, and their effects. FMMEA enhances the value of failure modes and effects analysis (FMEA) and failure modes, effects, and criticality analysis (FMECA) by identifying high-priority failure mechanisms to help create an action plan to mitigate their effects. The knowledge about the cause and consequences of mechanisms found through FMMEA helps in efficient and cost-effective product development. The application of FMMEA for an electronic circuit board assembly is described in the chapter.
185 © 2009 by Taylor & Francis Group, LLC
186 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
8.1
INTRODUCTION
The competitive marketplace demands that manufacturers look for economic ways to improve the product development process. In particular, the industry has been interested in an efficient approach to understand potential product failures that might affect product performance over time. Some organizations are using or require the use of a technique called failure mode and effects analysis (FMEA) to achieve this goal, but most of these companies are not completely satisfied with this methodology. FMEA was developed as a formal methodology in the 1950s at Grumman Aircraft Corporation, where it was used to analyze the safety of flight control systems for naval aircraft. From the 1970s through the 1990s, various military and professional society standards and procedures were written to define and improve the FMEA methodology (Bowles 2003; Kara-Zaitri, Keller, and Fleming 1992; Guidelines for Failure Mode and Effects Analysis for Automotive, Aerospace, and General Manufacturing Industries 2003). In 1971, the Electronic Industries Association (EIA) G-41 committee on reliability published “Failure Mode and Effects Analysis.” In 1974, the U.S. Department of Defense published Mil-Std 1629, “Procedures for Performing a Failure Mode, Effects and Criticality Analysis,” which, through several revisions, became the basic approach for analyzing systems. In 1985, the International Electrotechnical Commission (IEC) introduced IEC 812, “Analysis Techniques for System Reliability—Procedure for Failure Modes and Effects Analysis.” In the late 1980s, the automotive industry adopted the FMEA practice. In 1993, the Supplier Quality Requirements Task Force, composed of representatives from Chrysler, Ford, and GM, introduced FMEA into the quality manuals through the QS 9000 process. In 1994, the Society of Automotive Engineers (SAE) published SAE J-1739, “Potential Failure Modes and Effects Analysis in Design and Potential Failure Modes and Effects Analysis in Manufacturing and Assembly Processes,” which provided general guidelines in preparing an FMEA. In 1999, as part of the International Automotive Task Force, Daimler Chrysler, Ford, and GM agreed to recognize the new international standard “ISO/TS 16949” that included FMEA and would eventually replace QS 9000 in 2006. FMEA is used across many industries as one of the six sigma tools. It may be applied to various applications, such as system FMEA, design FMEA, process FMEA, machinery FMEA, functional FMEA, interface FMEA, and detailed FMEA. Although the purpose and terminology can vary according to type of industry, the principal objectives of the different FMEA processes are to anticipate problems early in the development process and either prevent them or minimize their consequences (SAE Standard 2002). An extension of FMEA, called failure modes, effects, and criticality analysis (FMECA), was developed to include techniques to assess the probability of occurrence and criticality of potential failure modes. Today, the terms FMEA and FMECA are used interchangeably (Bowles and Bonnell 1998; Bowles 2003). FMEA is also one of the six sigma tools (Franceschini and Galetto 2001) and is utilized by the six sigma organizations in some form. The FMEA methodology is
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
FAILURE MODES, MECHANISMS, AND EFFECTS ANALYSIS
187
FMEA Number Prepared By FMEA Date Revision Date Page
Potential Failure Mode and Effects Analysis (Design FMEA)
System Subsystem Component Design Lead Core Team
Key Date
of
Potential Cause(s) of Failure
Prob
Current Design Controls
Det
Recommended Action(s)
Responsibility & Target Completion Date
Actions Taken
New RPN
Sev
New Det
Potential Effect(s) of Failure
New Occ
Potential Failure Mode(s)
RPN
Item/ Function
New Sev
Action Results
Figure 8.1 FMMEA worksheet. (From Guidelines for Failure Mode and Effects Analysis for Automotive, Aerospace, and General Manufacturing Industries 2003.)
based on a hierarchical approach to determine how potential failure modes affect a product. This involves inputs from a cross-functional team that has the ability to analyze the whole product life cycle. A typical design FMEA worksheet is shown in Figure 8.1. Failure mechanisms are the processes by which specific combinations of physical, electrical, chemical, and mechanical stresses induce failure (Hu et al. 1993). Neither FMEA nor FMECA identifies the failure mechanisms and models in the analysis and reporting process. In order to understand and prevent failures, failure mechanisms must be identified with respect to the predominant stresses (mechanical, thermal, electrical, chemical, radiation) that precipitate these failures. Understanding the cause and consequences of failure mechanisms aids the design and development of a product, including virtual qualification, accelerated testing, root cause analysis, and life consumption monitoring. In virtual qualification, failure models are used to estimate analytically the timeto-failure distributions for products. Without knowledge of the relevant dominant failure mechanisms and the operating conditions, virtual qualification for a product cannot be meaningful. For accelerated testing design, one needs to know the failure mechanisms that are likely to be relevant in the operating conditions. Only with the knowledge of the failure mechanism can one design appropriate tests (stress levels, physical architecture, and durations) that will precipitate the failures by the relevant mechanism without resulting in spurious failures. All the root cause analysis techniques, including cause-and-effect diagrams and fault tree analysis, require that one know how the conditions during an incident may have an impact on the failure. The hypothesis development and verification processes are also affected by the failure mechanism analysis. Knowledge of failure mechanisms and the stresses that influence these mechanisms is an important issue for life-consumption monitoring of a product. The limitations on physical space and interfaces available for data collection and transmission limit the number of sensors that can be implemented in a product in a realistic manner. To make sure that the appropriate data are collected and utilized for the remaining life assessment during health monitoring, the prioritized list of failure mechanisms is essential. The traditional FMEA and FMECA do not address the key issue of failure mechanisms to analyze failures in products. To overcome this, a failure modes,
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
188 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
mechanisms, and effects analysis (FMMEA) methodology has been developed. The FMMEA process merges the systematic nature of the FMEA template with the “design for reliability” philosophy and knowledge. In addition to the information gathered and used for FMEA, FMMEA uses application conditions and the duration of the intended application with knowledge of active stresses and potential failure mechanisms. The potential failure mechanisms are considered individually and are assessed using appropriate models for design and qualification of the product for the intended application. The following sections describe the FMMEA methodology in detail. 8.2 FAILURE MODES, MECHANISMS, AND EFFECTS ANALYSIS METHODOLOGY FMMEA is a systematic approach to identify failure mechanisms and models for all potential failure modes and then prioritize them. High-priority failure mechanisms determine the operational stresses and the environmental and operational parameters that need to be accounted for in the design or to be controlled. FMMEA is based on understanding the relationships between product requirements and the physical characteristics of the product (and their variation in the production process), the interactions of product materials with loads (stresses at application conditions), and their influence on product failure susceptibility with respect to the use conditions. This involves finding the failure mechanisms and the reliability models to evaluate failure susceptibility quantitatively. The steps in conducting an FMMEA are illustrated in Figure 8.2. The individual steps are described in greater detail in the following subsections.
Define system and identify elements and its functions to be analyzed Identify potential failure modes Identify potential failure causes
Identify potential failure mechanisms
Identify failure models Prioritize failure mechanisms Document the process
Figure 8.2 FMMEA methodology.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
Identify life cycle profile
FAILURE MODES, MECHANISMS, AND EFFECTS ANALYSIS
8.2.1
189
System Definition, Elements, and Functions
The FMMEA process begins by defining the system to be analyzed. A system is a composite of subsystems or levels that are integrated to achieve a specific objective. The system is divided into various subsystems or levels. These subsystems may comprise further divisions or may have multiple parts that make up this subsystem. The parts are “components,” which form the basic structure of the product. Based on convenience or needs of the team conducting the analysis, the system breakdown can be by function (i.e., according to what the system elements “do”) or by location (i.e., according to where the system elements “are”), or both (i.e., function of a system element at a location within the system or location of a system element within a system with respect to elements having similar functions). For example, an automobile is considered a system, a functional breakdown of which would involve cooling system, braking system, and propulsion system. A location breakdown would involve engine compartment, passenger compartment, and dashboard or control panel. In a printed circuit board system, a location breakdown would include the package, plated through hole (PTH), metallization, and the board itself. Further analysis is conducted on each element thus identified. 8.2.2
Potential Failure Modes
A failure mode is the effect by which a failure is observed to occur (SAE Standard 2002). It can also be defined as the way in which a component, subsystem, or system could fail to meet or deliver the intended function. For all the elements that have been identified, all possible failure modes for each given element are listed. For example, in a solder joint, the potential failure modes are open or intermittent change in resistance that can hamper its functioning as an interconnect. In cases where information on possible failure modes that may occur is not available, potential failure modes may be identified using numerical stress analysis, accelerated tests to failure (e.g., HALT), past experience, and engineering judgment. A potential failure mode may be the cause of a failure mode in a higher level subsystem or system or be the effect of one in a lower level component. 8.2.3
Potential Failure Causes
A failure cause is defined as the circumstance during design, manufacture, or use that leads to a failure mode (IEEE Standard 1413.1-2002 2003). In this step, for each failure mode, all the possible ways a failure can result are listed. Failure causes are identified by finding the basic reason that may lead to a failure during design, manufacturing, storage, transportation, or use conditions. Knowledge of potential failure causes can help identify the underlying failure mechanisms driving the failure modes for a given element. For example, consider a failed solder joint of an electronic component on a printed circuit board in an automotive under-hood environment. The solder joint failure modes, such as open and intermittent change in resistance, can potentially be caused due to fatigue under conditions like temperature cycling, random vibration, and/or shock impact.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
190 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
8.2.4
Potential Failure Mechanisms
Failure mechanisms are the processes by which specific combinations of physical, electrical, chemical, and mechanical stresses induce failure (Hu et al. 1993). They are determined based on a combination of potential failure mode and cause of failure (JEDEC 2004) and selection of appropriate available mechanisms corresponding to the failure mode and cause. Studies on electronic material failure mechanisms and the application of physics-based damage models to the design of reliable electronic products comprising all relevant wearout and overstress failures in electronics are available in the literature (Dasgupta and Pecht 1991; JEDEC 2003). Failure mechanisms thus identified are categorized as either overstress or wearout mechanisms. Overstress failures involve a failure that arises as a result of a single load (stress) condition. Wearout failure, on the other hand, involves a failure that arises as a result of cumulative load (stress) conditions (IEEE Standard 1413.1-2002 2003). For example, in the case of a solder joint, the potential failure mechanisms driving the opens and shorts caused by vibration and shock impact are fatigue and overstress shock, respectively. 8.2.5
Failure Models
Failure models use appropriate stress and damage analysis methods to evaluate susceptibility of failure. Failure susceptibility is evaluated by assessing the time to failure or likelihood of a failure for a given geometry, material construction, and environmental and operational conditions. For example, in the case of solder joint fatigue, Dasgupta (Dasgupta et al. 1992) and Coffin–Manson (Foucher et al. 2002) failure models are used for stress and damage analysis for temperature cycling. Failure models of overstress mechanisms use stress analysis to estimate the likelihood of a failure based on a single exposure to a defined stress condition. The simplest formulation for an overstress model is the comparison of an induced stress versus the strength of the material that must sustain that stress. Wearout mechanisms are analyzed using both stress and damage analysis to calculate the time required to induce failure based on a defined stress condition. In the case of wearout failures, damage is accumulated over a period until the item is no longer able to withstand the applied load. Therefore, an appropriate method for combining multiple conditions must be determined for assessing the time to failure. Sometimes, the damage due to the individual loading conditions may be analyzed separately, and the failure assessment results may be combined in a cumulative manner (Guidelines for Failure Mode and Effects Analysis for Automotive, Aerospace, and General Manufacturing Industries 2003). Time-to-failure assessment may be limited by the availability and accuracy of models for quantifying the time to failure of the system. It may also be limited by the ability to combine the results of multiple failure models for a single failure site and the ability to combine results of the same model for multiple stress conditions (IEEE Standard 1413.1-2002 2003). If no failure models are available, the appropriate parameters to monitor can be selected based on an empirical
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
FAILURE MODES, MECHANISMS, AND EFFECTS ANALYSIS
191
model developed from prior field failure data or models derived from accelerated testing. 8.2.6 Life-Cycle Profile A life-cycle profile includes environmental conditions such as temperature, humidity, pressure, vibration or shock, chemical environments, radiation, contaminants, and loads due to operating conditions such as current, voltage, and power (SAE 1978). The life-cycle environment of a product consists of assembly, storage, handling, and usage conditions of the product, including the severity and duration of these conditions. Information on life-cycle conditions can be used for eliminating failure modes that may not occur under the given application conditions. In the absence of field data, information on the product usage conditions can be obtained from environmental handbooks or data monitored in similar environments. Ideally, such data should be obtained and processed during actual application. Recorded data from the life-cycle stages for the same or similar products can serve as input toward the FMMEA process. Some organizations collect, record, and publish data in the form of handbooks that provide guidelines for designers and engineers developing products for market sectors of their interest. Such handbooks can provide first approximations for environmental conditions that a product is expected to undergo during operation. These handbooks typically provide an aggregate value of environmental variables and do not cover all the life-cycle conditions. For example, for general automotive application, life-cycle environment and operating condition can be obtained from an SAE handbook (SAE 1978). However, for specific applications, more detailed information of the particular application conditions needs to be obtained. 8.2.7 Failure Mechanism Prioritization Ideally, all failure mechanisms and their interactions must be considered for product design and analysis. In the life cycle of a product, several failure mechanisms may be activated by different environmental and operational parameters acting at various stress levels, but only a few operational and environmental parameters and failure mechanisms are in general responsible for the majority of the failures. High-priority mechanisms are those select failure mechanisms that may cause the product to fail earlier than the product’s intended life duration. These mechanisms occur during the normal operational and environmental conditions of the product’s application. Highpriority failure mechanisms provide effective utilization of resources and are identified through prioritization of all the potential failure mechanisms. The methodology for failure mechanism prioritization is shown in Figure 8.3. Environmental and operating conditions are set up for initial prioritization of all potential failure mechanisms. If the load levels generated by certain operational and environmental conditions are nonexistent or negligible, the failure mechanisms that are exclusively dependent on those environmental and operating conditions are assigned a “low” risk level and eliminated from further consideration.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
192 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Potential failure mechanisms
Evaluate failure susceptibility and assign occurrence
Look up severity level
Final prioritization RPN High risk
Medium risk
Low risk
Figure 8.3 Failure mechanism prioritization.
For all the failure mechanisms remaining after the initial prioritization, the susceptibility to failure by those mechanisms is evaluated using the previously identified failure models when such models are available. For the overstress mechanisms, failure susceptibility is evaluated by conducting a stress analysis to determine if failure is precipitated under the given environmental and operating conditions. For the wearout mechanisms, failure susceptibility is evaluated by determining the time to failure under the given environmental and operating conditions. To determine the combined effect of all wearout failures, the overall time to failure is also evaluated with all wearout mechanisms acting simultaneously. In cases where no failure models are available, the evaluation is based on past experience, manufacturer data, or handbooks. After evaluation of failure susceptibility, occurrence ratings under environmental and operating conditions applicable to the system are assigned to the failure mechanisms. Occurrence describes how frequently a failure mechanism is expected to result in a failure. For the overstress failure mechanisms that precipitate failure, a highest occurrence rating of “5” (frequent) is assigned. In a case in which no overstress failures are precipitated, the lowest occurrence rating “1” (extremely unlikely) is assigned. For the wearout failure mechanisms, the ratings are assigned based on benchmarking the individual time to failure for a given wearout mechanism, with overall time to failure, expected product life, past experience, and engineering judgment. Table 8.1 shows the occurrence ratings. A “frequent” occurrence rating involves failure mechanisms with very low time to failure (TTF) and overstress failures that are almost inevitable in the use condition. A “reasonably probable” rating involves cases that involve failure mechanisms
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
FAILURE MODES, MECHANISMS, AND EFFECTS ANALYSIS
193
Table 8.1 Occurrence Ratings Occurrence
Rating (Occ)
Criteria
5 4 3 2 1
Overstress failure or very low TTF Low TTF Moderate TTF High TTF No overstress failure or very high TTF
Frequent Reasonably probable Occasional Remote Extremely unlikely
with low TTF. An “occasional” involves failures with moderate TTF. A “remote” rating involves failure mechanisms that have a high TTF. An extremely unlikely rating is assigned to failures with very high TTF or overstress failure mechanisms that do not produce any failure. To provide a qualitative measure of the failure effect, each failure mechanism is assigned a severity rating. Severity is the seriousness of the effect of the failure. The failure effect is assessed first at the level being analyzed, then at the next higher level, the subsystem level, and so on to the system level (SAE Standard 2002). Safety issues and impact of a failure mechanism on the end system are used as the primary criteria for assigning the severity ratings. In the severity rating, possible worst-case consequence is assumed for the failure mechanism being analyzed. Past experience and engineering judgment may also be used in assigning severity ratings. The severity ratings shown in Table 8.2 follow: r A “very high or catastrophic” severity rating indicates that there may be loss of life of the user or irreparable damage to the product. r A “high” severity rating indicates that failure might cause a severe injury to the user or a loss of function of the product. r A “moderate or significant” rating indicates that the failure may cause minor injury to the user or show gradual degradation in performance over time through loss of availability. r A “low or minor” rating indicates that failure may not cause any injury to the user or result in the product operating at reduced performance. r A “very low or none” rating does not cause any injury and has no impact on the product or, at best, may be a minor nuisance. Table 8.2
Severity Ratings
Severity
Rating (Sev)
Very high or catastrophic
5
High Moderate or significant Low or minor
4 3 2
Very low or none
1
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
Criteria System failure or safety-related catastrophic failures Loss of function Gradual performance degradation System operable at reduced performance Minor nuisance
194 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Table 8.3
Risk Matrix
SEVERITY
Occurrence 5 Frequent
4 Reasonably Probable
3 Occasional
2 Remote
1 Extremely Unlikely
5 Very high or catastrophic
25 High risk
20 High risk
15 High risk
10 High risk
5 Moderate risk
4 High
20 High risk
16 High risk
12 High risk
8 Moderate risk
4 Low risk
3 Moderate or significant
15 High risk
12 High risk
9 Moderate risk
6 Low risk
3 Low risk
2 Low or minor
10 High risk
8 Moderate risk
6 Low risk
4 Low risk
2 Low risk
1 Very low or none
5 Moderate risk
4 Low risk
3 Low risk
2 Low risk
1 Low risk
Based on the severity and occurrence of each identified failure mechanism, the risk priority number (RPN) can be calculated. The RPN is the product of the severity rating (Sev) and the occurrence rating (Occ). The final prioritization step involves classification of the failure mechanisms into three risk levels based on the RPN. This can be achieved by using the risk matrix shown in Table 8.3. The classifications may vary based on the product type, use condition, and business objectives of the user or manufacturer. 8.2.8
Documentation
The FMMEA process involves documentation, which includes the actions considered and taken based on the FMMEA. For products already manufactured, documentation may exist in the form of records of root-cause analysis conducted for the failures that occur during product development and testing. The history and lessons learned contained within the documentation provide a framework for future product FMMEA. It is also necessary to maintain and update documentation about the FMMEA after the corrective actions so as to generate a new list of high-priority failure mechanisms for future analysis. 8.3
CASE STUDY
A simple printed circuit board (PCB) assembly used in an automotive application was selected to demonstrate the FMMEA process. The PCB assembly was mounted at all four corners in the engine compartment of a 1997 Toyota 4Runner. The assembly
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
FAILURE MODES, MECHANISMS, AND EFFECTS ANALYSIS
195
consisted of an FR-4 PCB with copper metallizations, PTH, and eight surface mount inductors soldered into the pads using 63Sn-37Pb solder. The inductors were connected to the PTH through the PCB metallization. The PTHs were solder filled and an event detector circuit was connected in series with all the inductors through the PTHs to assess failure. Assembly failure was defined as one that would result in breakdown or no current passage in the event detector circuit. For all the elements listed, the corresponding functions and the potential failure modes were identified. Table 8.4 lists the physical location of all possible failure modes for the elements. For example, for the solder joint, the potential failure modes are open and intermittent change in resistance. For sake of simplicity and demonstration purposes, it was assumed that the test setup, the board, and its components were defect free. This assumption can be valid if proper screening was conducted after manufacture. In addition, it must be assumed that there was no damage to the assembly after manufacture. Potential failure causes were then identified for the failure modes shown in Table 8.4. For example, for the solder joint, the potential failure causes for open and intermittent change in resistance are temperature cycling, random vibration, or sudden shock impact caused by vehicle collision. Based on the potential failure causes that were assigned to the failure modes, the corresponding failure mechanisms were identified. Table 8.4 lists the failure mechanisms for the failure causes that were identified. For example, for the open and intermittent change in resistance in solder joint, the mechanisms driving the failure were solder joint fatigue and fracture. For each of the failure mechanisms listed, the appropriate failure models were then identified from the literature. Information about product dimensions and geometry was obtained from design specification, board layout drawing, and component manufacturer data sheets. Table 8.4 provides all the failure models for the failure mechanisms that were listed. For example, in the case of solder joint fatigue, a Coffin–Manson (Steinberg 1988) failure model was used for stress and damage analysis for temperature cycling. The assembly was powered by a 3-V battery source independent of the automobile electrical system. No high-current, voltage, magnetic, or radiation sources were identified as having an effect on the assembly. For the temperature, vibration, and humidity conditions prevalent in the automotive under-hood environment, data were obtained first from the SAE environmental handbook (SAE 1978) because no manufacturer field data were available for the automotive under-hood environment for the Washington, D.C., area. The maximum temperature in the automotive under-hood environment was listed as 121°C (SAE 1978). The car was assumed to operate on average 3 hours per day in two equal trips in the Washington, D.C., area. The maximum shock level was assumed to be 45 G for 3 ms. The maximum relative humidity in the under-hood environment was 98% at 38°C (Society of Automotive Engineers 1978). The average daily maximum and minimum temperatures in the Washington, D.C. area for the period of the study were 127 and 16°C, respectively. After all potential failure modes, causes, mechanisms, and models were identified for each element, an initial prioritization was made based on the life-cycle environmental and operating conditions. In the automotive under-hood environment for
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
Element
Potential failure mode
Potential Failure Cause
Potential Failure Mechanism Mechanism Type
PTH
Electrical open Temperature in PTH cycling
Fatigue
Wearout
Metallization
Electrical short/open, change in resistance in the metallization traces
High temperature
Electromigration
High relative humidity Ionic contamination High temperature
Corrosion
Wearout of winding insulation
Wearout
Temperature cycling
Fatigue
Component (inductors)
Interconnect
Short/open between windings and the core Open/ intermittent change in electrical resistance
Random vibration Sudden impact
© 2009 by Taylor & Francis Group, LLC
Failure Model
Failure Occurrence Susceptibility
Severity
Risk
>10 years
Remote
Very low
Wearout
CALCE PTH barrel thermal fatiguea Blackb
Low
>10 years
Remote
Very high Moderate
Wearout
Howardc
>10 years
Remote
Very high
Moderate
No model
Remoted
Very high
Moderate
Wearout
170 days Coffin– Mansone
Frequent
Very high
High
Wearout
Steinbergf 43 days
Frequent
Very high High
Overstress
Steinbergf No failure
Extremely unlikely
Very high
Wearout
Shock
Moderate
196 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
© 2009 by Taylor & Francis Group, LLC
Table 8.4 FMMEA Worksheet for the Case Study
Electrical short High relative between humidity PTHs Crack/fracture Random vibration Sudden impact Loss of polymer strength Open
Excessive noise
Pad
Lift/crack
CFF
Wearout
Rudra et al.g
4.6 years
Occasional
Very low
Low
Fatigue
Wearout
Basquinf
>10 years
Remote
Very high
Moderate
Shock
Overstress
Steinbergf
No failure
Very high
Moderate
High temperature
Glass transition
Overstress
No model
No failure
Extremely unlikely Extremely unlikely
Very high
Moderate
Discharge of high voltage through dielectric material Proximity to high current or magnetic source Temperature cycling/random vibration Sudden impact
EOS/ESD
Overstress
No model
Eliminated in first level prioritization
Low
EMI
Overstress
No model
Eliminated in first level prioritization
Low
Fatigue
Wearout
No model
Remote
Very high
Moderate
Shock
Overstress
Extremely unlikely
Very high
Moderate
FAILURE MODES, MECHANISMS, AND EFFECTS ANALYSIS
© 2009 by Taylor & Francis Group, LLC
PCB
aBhandarkar,
S. M. et al. 1992. Transactions of the ASME—Journal of Electronic Packaging 114:8–13. J. R. 1983. IEEE Proceedings of International Reliability Physics Symposium 142–149. R. T. 1981. IEEE Transactions on CHMT 4 (4): 520–525. dBased on failure rate data of inductors in Telcordia. (From Telcordia Technologies. May 2001. Special Report SR-332: Reliability prediction procedure for electronic equipment, Issue 1, Telcordia Customer Service, Piscataway, NJ.) eFoucher, B. et al. 2002. Microelectronics Reliability 42 (8): 1155–1162. fSteinberg, D. S. 1988. Vibration analysis for electronic equipment, 2nd ed. New York: John Wiley & Sons. gRudra, A. B. et al. 1995. Circuit World 22 (1): 67–70. bBlack
cHoward,
197
© 2009 by Taylor & Francis Group, LLC
198 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
the given test setup, failures driven by electrical overstress (EOS) and electrostatic discharge (ESD) were ruled out because of the absence of active devices and the low voltage source of the batteries. Electromagnetic interference (EMI) was also not anticipated because the circuit function was not susceptible to transients. Hence EOS, ESD, and EMI were each assigned a “low” risk level. The time to failure for the wearout failure mechanisms was calculated using calcePWA (University of Maryland). Occurrence ratings were assigned based on comparing the time to failure for a given wearout mechanism with the overall time to failure with all wearout mechanisms acting together. For the inductors, the occurrence rating was assigned based on failure rate data obtained from Telcordia (2001). From prior knowledge regarding wearout associated with the pads, it was assigned a “remote” occurrence rating. An assessment of a shock level of 45 G for 3 ms using calcePWA produced no failure for interconnects and the board. Hence it was assigned an “extremely unlikely” occurrence rating. Because no overstress shock failure was expected on the board and the interconnects, it was assumed there would also be no failure on the pads. Hence overstress shock failure on pads was also assigned an “extremely unlikely” rating. The glass transition temperature for the board was 150°C. Because the maximum temperature in the under-hood environment was only 121°C (SAE 1978), no glass transition was expected to occur and it was assigned an “extremely unlikely” rating. A short or open PTH would not have had any impact on the functioning of circuits because it was used only as termination for the inductors. Hence it was assigned a “very low” severity rating. For all other elements, any given failure mode of the element would have led to disruption in the functioning of the circuit. Hence all other elements were assigned a “very high” severity rating. Final prioritization and risk assessment for the failure mechanisms are shown in Table 8.4. Out of all the failure mechanisms that were analyzed, fatigue due to thermal cycling and vibration at the solder joint interconnect were the only failure mechanisms that had a high risk. Because they were high-risk failure mechanisms, they were identified as high priority. An FMEA on the assembly would have identified all the elements, their functions, potential failure modes, and failure causes as in FMMEA. FMEA would then have identified the effect of failure for each failure mode. For example, in the case of a solder joint interconnect, the failure effect of the open joint would have involved no current passage in the test setup. Next, the FMEA would have identified the severity, occurrence, and detection probabilities associated with each failure mode. For example, in case of a solder joint open failure mode, based on past experience and use of engineering judgment, each of the metrics, severity, occurrence, and detection, would have received a rating on a scale of 1–10. The product of severity, occurrence, and detection would then have been used to calculate RPN. The RPNs for other failure modes would have been calculated in a similar manner and then all the failure modes would have been prioritized based on the RPN values. This is unlike FMMEA, which used failure mechanisms and models and used combined effects of all failure mechanisms to evaluate the occurrence quantitatively. The occurrence rating in conjunction with severity was then used to assign a risk level to each failure mechanism for prioritization.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
FAILURE MODES, MECHANISMS, AND EFFECTS ANALYSIS
8.4
199
CONCLUSIONS
FMMEA allows the design team to take into account the available scientific knowledge of failure mechanisms and merge them with the systematic features of the FMEA template with the intent of “design for reliability” philosophy and knowledge. The idea of prioritization embedded in the FMEA process is also utilized in FMMEA to identify mechanisms that are likely to cause failures during the product life cycle. FMMEA differs from FMEA in a few respects. In FMEA, potential failure modes are examined individually and the combined effects of coexisting failure causes are not considered. FMMEA, on the other hand, considers the impact of failure mechanisms acting simultaneously. FMEA involves precipitation and detection of failure for updating and calculating the RPN, and it cannot be applied in cases that involve continuous monitoring of performance degradation over time. In contrast, FMMEA does not require the failure to be precipitated and detected, and the uncertainties associated with the detection estimation are not present. Use of environmental and operating conditions is not made at a quantitative level in FMEA. At best, they are used to eliminate certain failure modes. FMMEA prioritizes the failure mechanisms, using the information on stress levels of environmental and operating conditions to identify high-priority mechanisms that must be accounted for in the design or be controlled. This prioritization in FMMEA overcomes the shortcomings of RPN prioritization used in FMEA, which provide a false sense of granularity. Thus, the use of FMMEA provides additional quantitative information regarding product reliability and opportunities for improvement compared to FMEA because it takes into account specific failure mechanisms and the stress levels of environmental and operating conditions in the analysis process. There are several benefits to organizations that use FMMEA. It provides specific information on stress conditions so that the acceptance and qualification tests yield usable results. Use of the failure models at the development stage of a product allows for appropriate “what-if” analysis on proposed technology upgrades. FMMEA can also be used to aid several design and development steps considered to be the best practices, which can only be performed or enhanced by the utilization of the knowledge of failure mechanisms and models. These steps include virtual qualification, accelerated testing, root cause analysis, life consumption monitoring, and prognostics. All the technological and economic benefits provided by these practices are realized better through the adoption of FMMEA. REFERENCES Bhandarkar, S. M. et al. 1992. Influence of selected design variables on thermomechanical stress distributions in plated through hole structures. Transactions of the ASME— Journal of Electronic Packaging 114:8–13. Black, J. R. 1983. Physics of electromigration. IEEE Proceedings of International Reliability Physics Symposium 142–149, Phoenix, AZ. Bowles, J. B. 2003. Fundamentals of failure modes and effects analysis. Tutorial Notes Annual Reliability and Maintainability Symposium, Tampa, FL.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
200 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Bowles, J. B., and R. D. Bonnell. 1998. Failure modes, effects and criticality analysis— What is it and how to use it. Tutorial Notes Annual Reliability and Maintainability Symposium, Anaheim, CA. Dasgupta, A., C. Oyan, D. Barker, and M. Pecht. 1992. Solder creep-fatigue analysis by an energy-partitioning approach. ASME Transactions, Journal of Electronic Packaging 114 (2): 152–160. Dasgupta, A., and M. Pecht. 1991. Material failure mechanisms and damage models. IEEE Transactions on Reliability 40 (5): 531–536. Foucher, B., J. Boullie, B. Meslet, and D. Das. 2002. A review of reliability predictions methods for electronic devices. Microelectronics Reliability 42 (8): 1155–1162. Franceschini, F., and Galetto, M. 2001. A new approach for evaluation of risk priories of failure modes in FMEA. International Journal of Production Research 39 (13): 2991–3002. Guidelines for failure mode and effects analysis for automotive, aerospace, and general manufacturing industries. 2003. Ontario, Canada: Dyadem Press. Howard, R. T. 1981. Electrochemical model for corrosion of conductors on ceramic substrates. IEEE Transactions on CHMT 4 (4): 520–525. Hu, J., D. Barker, A. Dasgupta, and A. Arora. 1993. Role of failure-mechanism identification in accelerated testing. Journal of the IES 36 (4): 39–45. IEEE Standard 1413.1-2002. 2003. IEEE guide for selecting and using reliability predictions based on IEEE 1413. JEDEC Publication JEP 122-B. August 2003. JEDEC Publication JEP 122-B. Failure mechanisms and models for semiconductor devices. JEDEC Publication JEP 148. April 2004. JEDEC Publication JEP 148. Reliability qualification of semiconductor devices based on physics-of-failure risk and opportunity assessment. Kara-Zaitri, C., A. Z. Keller, and P. V. Fleming. 1992. A smart failure mode and effect analysis package. Annual Reliability and Maintainability Symposium Proceedings, 414–421. Rudra, A. B. et al. 1995. Electrochemical migration in multichip modules. Circuit World 22 (1): 67–70. SAE (Society of Automotive Engineers). Rev. November 1978. Recommended environmental practices for electronic equipment design, SAE J1211. SAE Standard. August 2002. SAE J1739 Potential failure mode and effects analysis in design (design FMEA) and potential failure mode and effects analysis in manufacturing and assembly processes (process FMEA) and effects analysis for machinery (machinery FMEA). Steinberg, D. S. 1988. Vibration analysis for electronic equipment, 2nd ed. New York: John Wiley & Sons. Telcordia Technologies. 2001. Special Report SR-332: Reliability prediction procedure for electronic equipment, issue 1, Telcordia Customer Service, Piscataway, NJ. University of Maryland. A physics-of-failure-based virtual reliability assessment tool developed by CALCE, The University of Maryland.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
CHAPTER 9
Design for Reliability Diganta Das, Michael Pecht
CONTENTS 9.1 9.2 9.3 9.4 9.5 9.6 9.7
Introduction .................................................................................................. 201 Product Requirements and Constraints ........................................................202 Product Life-Cycle Conditions ..................................................................... 203 Reliability Capability....................................................................................205 Parts and Materials Selection .......................................................................205 Failure Modes, Mechanisms, and Effects Analysis .....................................206 Physics of Failure..........................................................................................207 9.7.1 Stress Margins ..................................................................................207 9.7.2 Model Analysis of Failure Mechanisms ...........................................208 9.7.3 De-Rating..........................................................................................208 9.7.4 Protective Architectures ...................................................................209 9.7.5 Redundancy ......................................................................................209 9.7.6 Prognostics........................................................................................ 210 9.8 Qualification ................................................................................................. 210 9.9 Manufacture and Assembly .......................................................................... 212 9.9.1 Manufacturability ............................................................................. 212 9.9.2 Process Verification Testing ............................................................. 213 9.10 Closed-Loop, Root-Cause Monitoring ......................................................... 214 9.11 Summary ...................................................................................................... 216 References.............................................................................................................. 216 Homework Problems.............................................................................................. 216 9.1
INTRODUCTION
To ensure product reliability, an organization must follow certain practices during the product development process. These practices impact reliability through the selection of parts (materials), product design, manufacturing, assembly, shipping and 201 © 2009 by Taylor & Francis Group, LLC
202 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
handling, operation, maintenance, and repair. The practices are listed here and are described in this book: r Define realistic product reliability requirements determined by factors including the targeted life-cycle application conditions and performance expectations. The product requirements should consider the customer’s needs and the manufacturer’s capability to meet those needs. r Define the product life-cycle conditions by assessing relevant manufacturing, assembly, storage, handling, shipping, operating, and maintenance conditions. r Ensure that the supply-chain participants have the capability to produce the parts (materials) and services necessary to meet the final reliability objectives. r Select the parts (materials) that have sufficient quality and are capable of delivering the expected performance and reliability in the application. r Identify the potential failure modes, failure sites, and failure mechanisms by which the product can be expected to fail. r Design to the process capability (i.e., the quality level that can be controlled in manufacturing and assembly), considering the potential failure modes, failure sites, and failure mechanisms obtained from the physics-of-failure analysis and the life-cycle profile. r Qualify the product to verify its reliability in the expected life-cycle conditions. Qualification encompasses all activities that ensure that the nominal design and manufacturing specifications will meet or exceed the reliability goals. r All manufacturing and assembly processes must be capable of producing the product within the statistical process window required by the design. Variability in material properties and manufacturing processes will impact the product’s reliability. Therefore, characteristics of the process must be identified, measured, and monitored. r Manage the life-cycle usage of the product using closed-loop, root-cause monitoring procedures.
9.2
PRODUCT REQUIREMENTS AND CONSTRAINTS
There are various reasons to justify the creation, modification, or upgrade of a product. For example, a company may want to address a perceived market need or to open new markets. In some cases, a company may need to develop new products to remain competitive in a key market or to maintain market share and customer confidence. In other cases, a company may want to satisfy specific strategic customers, to demonstrate experience with a new technology or methodology, or to improve maintainability of an existing product. In addition, product updates are often developed to reduce the life-cycle costs of an existing product. To make reliable products, there should be cooperation between suppliers and customers throughout the supply chain. IEEE 1332 (1998) addresses this cooperation in three reliability objectives. First, the supplier, working with the customer, shall determine and understand the customer’s requirements and product needs so that a comprehensive design specification can be generated. Second, the supplier shall structure and follow a series of engineering activities so that the resulting product satisfies the customer’s requirements and product needs with regard to product reliability. Third, the supplier shall perform activities
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
DESIGN FOR RELIABILITY
203
that assure the customer that the reliability requirements and product needs have been satisfied. Initially, requirements are formulated into a requirements document, where they are prioritized. The specific people involved in prioritization and approval will vary with the organization and the product. For example, for safety-critical products, safety, reliability, and legal representatives may all provide guidance. Once a set of requirements has been completed, the product engineering function creates a response to the requirements in the form of a specification. The specification states the requirements that must be met, the schedule for meeting the requirements, the identification of those who will perform the work, and the identification of potential risks. Differences in the requirements document and the preliminary specification become the topic of trade-off analyses. Once product requirements are defined and the design process begins, there should be an assessment of the product’s requirements against the actual product design. As the product’s design becomes increasingly detailed, it becomes increasingly more important to track the product’s characteristics in relation to the original product requirements. The rationale for making changes should be documented. The completeness with which requirement tracking is performed can significantly reduce future product redesign costs. Planned redesigns or design refreshes through technology monitoring and use of roadmaps ensure that a company is able to market new products or redesigned versions of old products in a timely, effective manner to retain its customer base and ensure continued profits.
9.3
PRODUCT LIFE-CYCLE CONDITIONS
The life-cycle conditions of the product influence decisions regarding product design and development, materials and parts selection, qualification, product safety, warranty, and product support (i.e., maintenance). The phases in a product’s life cycle include manufacturing and assembly, testing, rework, storage, transportation and handling, operation* (modes of operation, on–off cycles, etc.), repair, and maintenance. During each phase of its life cycle, a product will experience various environmental and usage loads. The life-cycle loads can include, but are not limited to, thermal (steady-state temperature, temperature ranges, temperature cycles, temperature gradients), mechanical (pressure levels, pressure gradients, vibrations, shock loads, acoustic levels), chemical (aggressive or inert environments, ozone, pollution humidity levels, contamination, fuel spills), and radiation (electromagnetic interference, and altitude), and electrical loading conditions (power, power surge, current, voltage, voltage spikes). The extent and rate of product degradation, and thus reliability, depend upon the nature, magnitude, and duration of exposure to such loads. Defining and characterizing the life-cycle loads is often an uncertain element of the overall design-for-reliability process. The challenge occurs because products can *Operational conditions are sometimes referred to as the life-cycle application conditions.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
204 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
experience completely different application conditions depending on the application location, the product utilization or nonutilization profile, duration of utilization, and maintenance and servicing conditions. For example, typically all desktop computers are designed for home or office environments. However, the operational profile of each unit may be completely different depending on user behavior. Some users may shut down the computer every time after it is used; others may shut down only once at the end of the day, and still others may keep their computers powered all the time. Furthermore, one user may keep the computer by a sunny window, while another person may keep the computer near an air conditioner. Thus, the temperature profile experienced by each product and hence its degradation due to thermal loads would be different. There are four methods used to estimate product life-cycle loads: market studies and standards, similarity analysis, field trial and service records, and in situ monitoring. Each is discussed next. Market surveys and standards provide a very coarse and often inaccurate estimate of the environmental loads possible in various field applications. The environmental profiles available from these sources are typically classified according to industry type, such as military, consumer, telecommunications, automotive, and commercial avionics. Similarity analysis is a technique for estimating environmental loads when sufficient field histories for similar products are available. Before using data on existing products for proposed designs, the characteristic differences in design and application use for the comparison products need to be reviewed. For example, electronics inside a washing machine in a commercial laundry are expected to experience a wider distribution of loads and use conditions (due to a large number of users) and higher usage rates compared with a home washing machine. As another example, it has been found that some Asians use a dishwasher to wash vegetables, in addition to eating utensils. These dishwashers would experience higher usage rates than those used only for washing dishes. Field trial records provide estimates of the environmental profiles experienced by the product. The data depend on the durations and conditions of the trials and can be extrapolated to estimate actual environmental conditions. Service records provide information on the maintenance, replacement, or servicing performed. These data can give an idea of the life-cycle environmental and usage conditions that lead to servicing or failure. Environmental and usage conditions experienced by the product over its life cycle can be monitored in situ (Vichare et al. 2004). These data are often collected using sensors that are mounted externally or integrated with the product and supported by telemetry systems. Load distributions should be developed from data obtained by monitoring products used by different customers, ideally from various geographical locations where the product is used. The data should be collected over a sufficient period to provide an estimate of the loads and their variation over time. In situ monitoring provides the most accurate account of load histories and is most valuable in design for reliability (DFR) and product reliability assessment.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
DESIGN FOR RELIABILITY
205
9.4 RELIABILITY CAPABILITY The selection of a supply chain is often based on factors that do not explicitly address reliability, such as technical capabilities, production capacity, geographic location, support facilities, and financial and contractual factors. A selection process that takes into account the ability of suppliers to meet reliability objectives during manufacturing, testing, and support can improve reliability of the final product throughout its life cycle and can provide valuable competitive advantages. Reliability capability is a measure of the practices within an organization that contribute to the reliability of the final product and the effectiveness of these practices in meeting the reliability requirements of customers. Reliability capability assessment is the act of quantifying the effectiveness of reliability activities, using a metric called reliability capability maturity. From a reliability perspective, maturity indicates whether the key reliability practices within an organization are well understood, supported by documentation and training, applied to all products throughout the organization, and continually monitored and improved.
9.5
PARTS AND MATERIALS SELECTION
A parts (materials) selection and management methodology helps a company to make risk-informed decisions concerning the incorporation (assembly) of parts and materials into a product. The part assessment process is shown in Figure 9.1. Key
Candidate part Part assessment
Performance assessment
Quality assessment
Reliability assessment
Are all criteria satisfied Yes Part group
Figure 9.1 Part assessment process.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
No
Assembly assessment
Alternative part Yes Reject part available? No Supplier intervention
206 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
elements of part assessment include performance, quality, reliability, and ease of assembly. The goal of performance assessment is to evaluate the part’s ability to meet the performance requirements (e.g., structural, mechanical, electrical, thermal, biological, etc.) of the product. In general, there are often minimum and maximum limits beyond which the part will not function properly, at least in terms of the datasheet specifications. These limits, or ratings, are often called the recommended operating conditions. Quality is evaluated by outgoing quality and process capability metrics. Reliability assessment results provide information about the ability of a part to meet the required performance specifications in its targeted life-cycle application for a specified period of time. Reliability is evaluated through part qualification and reliability test results. A part is acceptable from an assembly viewpoint if it is compatible with the downstream assembly equipment and processes. Assembly guidelines should be followed to prevent damage and deterioration of the part during the assembly process. Examples include a recommended temperature profile, cleaning agents, adhesives, moisture sensitivity, and electrical protection. As new technologies emerge and products become more complex, assembly guidelines become more important to ensure the targeted quality and reliability of the parts and the product.
9.6
FAILURE MODES, MECHANISMS, AND EFFECTS ANALYSIS
A failure mode is the manner in which a failure can occur—that is, the way in which the product fails to perform its intended design function or performs the function but fails to meet its objectives. For example, failure modes of a cell phone include a button that does not cause a number to register or a microphone that does not pick up the user’s voice. Sometimes the failure modes are intentionally accentuated so that the user of the product will become aware of the existence of a problem. For example, a bad smelling substance is sometimes added to natural gas to indicate the existence of a leak. Another example is the grinding noise when the brake pads wear out in a car. Failure mechanisms are the processes by which a specific combination of physical, electrical, chemical, and mechanical stresses induces failures. For example, fracture, fatigue, and corrosion are failure mechanisms. The purpose of failure modes, mechanisms, and effects analysis (FMMEA) is to identify potential failure mechanisms and models for all the potential failure modes of a product and then to prioritize failure mechanisms for efficient product development. FMMEA is based on understanding the relationships between product requirements and the physical characteristics of the product (and their variation in the production process), the interactions of product materials with loads (stresses at application conditions), and their influence on the product’s susceptibility to failure with respect to the use conditions.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
DESIGN FOR RELIABILITY
207
9.7 PHYSICS OF FAILURE Once the parts (materials), load conditions, and possible failure risks based on the FMMEA have been identified, the design guidelines based on physics-of-failure models aid in making design trade-offs and can also be used to develop tests, screens, and de-rating* factors. Tests based on physics-of-failure models can be planned to measure specific quantities, to detect the presence of unexpected flaws, and to detect manufacturing or maintenance problems. Screens can be planned to precipitate failures in “weak” products while not deteriorating the design life of the shipped product. De-rating or safety factors can be determined to lower the stresses for the dominant failure mechanisms. 9.7.1 Stress Margins Products should be designed to operate satisfactorily, with margins (the design margins) at the extremes of the stated recommended operating ranges (the specification limits). These ranges must be included in the procurement requirement or specifications. Figure 9.2 schematically represents the hierarchy of product load (stress) limits and margins. The specification limits are set by the manufacturer to limit the conditions of customer use. The design margins correspond to the load (stress) condition that the product is designed to survive without field failures. That is, the operating margin is the expected loads (stress) that may lead to a recoverable failure. The destruct margin is the expected loads (stress) that may lead to permanent (overstress) failure. Statistical analysis and worst-case analysis should be used to assess the effects of product parameter variations. In statistical analysis, a functional relationship is established between the output characteristics of the product and its parameters. In worst-case analysis, the effect of the product outputs is evaluated on the basis of endof-life performance values. Upper destruct limit Upper operating margin Upper design margin Upper specification limit Lower specification limit Lower design margin Lower operating margin Lower destruct limit
Figure 9.2 Load (stress) limits and margins.
* De-rating is the practice of subjecting parts to lower electrical or mechanical stresses than they can withstand to increase the life expectancy of the part.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
208 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
9.7.2 Model Analysis of Failure Mechanisms Model analysis of failure mechanisms is based on computer-aided simulation. It can assist in identifying and ranking the dominant failure mechanisms associated with the product under life-cycle loads, determining the acceleration factor for a given set of accelerated test parameters, and determining the time to failure corresponding to the identified failure mechanisms. Each failure model comprises a load analysis model and a damage assessment model. The output is a ranking of different failure mechanisms, based on the time to failure. The load model captures the product architecture, and the damage model depends on a material’s response to the applied loads. Model analysis of failure mechanisms can be used to optimize the product design in such a way that the minimum time to failure of the product is greater than its desired life. Although the data obtained from model analysis of failure mechanisms cannot fully replace those obtained from physical tests, they can increase the efficiency of tests by indicating the potential failure modes and mechanisms that can be expected. It should be remembered that the accuracy of modality results depends on the accuracy of the inputs to the process—that is, the product geometry and material properties, the life-cycle loads, the failure models used (e.g., constants in the failure model), the analysis domain, and the discretization approach (spatial and temporal). Hence, to obtain a reliable prediction, the variability in the inputs should be specified using distribution functions, and the validity of the failure models should be tested by conducting the appropriate tests. 9.7.3 De-Rating To ensure that the product remains within the predetermined margins shown in Figure 9.2, de-rating can be used. De-rating is the practice of limiting loads (e.g., thermal, electrical, and mechanical) to improve reliability. De-rating can provide added protection from anomalies unforeseen by the designer (e.g., transient loads, electrical surges). For example, manufacturers of electronic parts often specify limits for supply voltage, output current, power dissipation, junction temperature, and frequency. The product design team may decide to ensure that the operational condition for a particular load, such as temperature, is always below the rated level. The load reduction is expected to extend the useful operating life when the failure mechanisms under consideration are wearout types. This practice is also expected to provide a safer operating condition by furnishing a margin of safety when the failure mechanisms are of the overstress type. As inherently suggested by the term “de-rating,” the methodology involves a two-step process: “Rated” load values are first determined and then a reduced value is assigned. The margin of safety that the process of de-rating is to provide is the difference between the maximum allowable actual applied load and the product’s demonstrated limits. In order to be effective, de-rating must target the appropriate, critical load parameters based on models of the relevant failure mechanisms. Once the failure
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
DESIGN FOR RELIABILITY
209
models for the critical failure mechanisms have been identified using, for example, the FMMEA, the impact of de-rating on the effective reliability of the product for a given load can be determined. The goal should be to determine the “safe” operating envelope for the product and then to operate within that envelope. 9.7.4 Protective Architectures The objective of protective architectures is to enable some form of action, after an initial failure or malfunction, to prevent additional or secondary failures. Protective techniques include the use of fuses and circuit breakers, self-sensing structures, and adjustment structures that correct for parametric shifts. In designs where safety is an issue, it is generally desirable to incorporate some means for preventing a product from failing or from causing further damage when it fails. Fuses and circuit breakers are examples of elements used to sense excessive current or voltage spikes and disconnect power from electronic products. Similarly, thermostats can be used to sense critical temperature-limiting conditions and to unpower the product until the temperature returns to normal. Self-checking circuitry can also be incorporated to sense abnormal conditions and restore normal conditions or to activate circuitry that will compensate for the malfunction. In some instances, it may be desirable to permit partial operation of the product after a part failure, possibly with degraded performance, rather than completely unpower the product. For example, in shutting down a failed circuit whose function is to provide precise trimming adjustment within a dead band of another control product, acceptable performance may be achieved, under emergency conditions, with the dead-band control product alone. Protective architectures must be designed with consideration of the impact of maintenance. For example, if a fuse protecting a circuit is replaced, the following questions need to be answered: What is the impact when the product is re-energized? What protective architectures are appropriate for postrepair operations? What maintenance guidance must be documented and followed when fail-safe protective architectures have or have not been included? 9.7.5 Redundancy The purpose of redundancy is to enable the product to operate successfully even though one or more of the parts of the product fail. A design team often finds that redundancy is the quickest way to improve product reliability if there is insufficient time to explore alternatives. It can be the most cost-effective solution, or perhaps the only solution, if the reliability requirement is beyond the state of the art. A redundant design typically adds size, weight, and cost. When not properly implemented, redundancy can also provide a false sense of reliability. If a failure cause can affect all the redundant elements of a product at the same time, then the benefits of redundancy will be lost. Also, failures of sensing and switching circuitry or software can result in failure even in the presence of redundancy.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
210 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
9.7.6 Prognostics A product’s health is the extent of deviation or degradation from its expected normal (in terms of physical performance) operating condition (Vichare et al. 2004). Knowledge of a product’s health can be used to detect and isolate faults or failures (diagnostics) and to predict an impending failure based on current conditions (prognostics). Thus, by determining the advent of failure based on actual life-cycle conditions, procedures can be developed to mitigate and manage potential failures and maintain the product. Prognostics can be designed into a product by (1) installing built-in fuses and canary structures that will fail faster than the actual product when subjected to lifecycle conditions (Mishra and Pecht 2002); (2) sensing parameters that are precursors to failure, such as defects or performance degradation (Pecht et al. 2001); or (3) sensing the life-cycle environmental and operational loads that influence the system’s health and processing the measured data using physics-of-failure models to estimate remaining useful life (Mishra et al. 2002; Ramakrishnan and Pecht 2003).
9.8
QUALIFICATION
Qualification tests are conducted to identify and assess potential failures that could arise during the use of a product. Qualification tests should be performed during initial product development and also after any significant design or manufacturing changes to an existing product. In some cases, the target application, and therefore the use conditions, of the product may not be known. For example, a part or an assembly may be developed for sale to the open market for incorporation into many different types of products. In such cases, standard qualification tests are often employed. However, passing these tests does not mean that the product will be reliable in the actual targeted application. As a result, it is generally not sufficient to rely on qualification tests conducted on the parts (materials) of a product to determine or ensure the reliability of the final product in the targeted application. Most often, there is not sufficient time to test products for their complete targeted application life under actual operating conditions. Therefore, accelerated (qualification) tests are often employed. Accelerated testing is based on the concept that a product will exhibit the same failure mechanisms and modes in a short time under high-load conditions as it would exhibit in a longer time under actual life-cycle load conditions. The purpose is to decrease the total time and cost required to obtain reliability information for the product under study. Accelerated tests can be divided into two categories: qualitative tests and quantitative tests. Qualitative tests generally overstress the products to determine the load conditions that will cause overstress or early wearout failures. Such tests may target a single load condition, such as shock, temperature extremes, and electrical overstress, or some combination of these. The results of the tests include failure mode information, but qualitative tests are not generally appropriate to estimate time to failure in the application.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
DESIGN FOR RELIABILITY
211
Quantitative tests target wearout failure mechanisms in which failures occur as a result of cumulative load conditions. These tests make analysis possible to extrapolate quantitatively from the accelerated environment to the usage environment with some reasonable degree of assurance. The easiest form of accelerated life testing is continuous-use acceleration. The objective of this approach is to compress the life into the shortest time possible. This approach assumes that the product is not used continuously and that, when the product is not used, there are no loads (stresses) on the product. For example, consider that most washing machines are used for 10 hours per week on average. If a washing machine was continuously operated, the acceleration factor* would be (24)(7)/10 = 16.8. Thus, if the warranty or design life of the product was 5 years, then the product should be tested for 5/16.8 = 0.3 years or 106 days. Continuous-use acceleration is not very effective with high-usage products or with products that have a long expected life. Under such circumstances, accelerated testing is conducted to measure the performance of the product at loads (stresses) that are more severe than would normally be encountered, in order to accelerate the damage accumulation rate in a reduced time period. The goal of such testing is to accelerate time-dependent failure mechanisms and the damage accumulation rate to reduce the time to failure. Based on the data from accelerated tests, the time to failure in the targeted use conditions can be extrapolated. Accelerated testing begins by identifying all the significant overstress and wearout failure mechanisms from the failure modes, mechanisms, and effects analysis (FMMEA). The load parameters that cause the failure mechanisms are selected as the acceleration parameters and are commonly called accelerated loads. Common accelerated loads include thermal loads, such as temperature, temperature cycling, and rates of temperature change; chemical loads, such as humidity, corrosives, acid, solvents, and salt; electrical loads, such as voltage or power; and mechanical loads, such as vibration, mechanical load cycles, strain cycles, and shock/impulses. Accelerated tests may require a combination of these loads. Interpretation of the results for combined loads requires a quantitative understanding of their relative interactions. Failure due to a particular mechanism can be induced by several acceleration parameters. For example, corrosion can be accelerated by both temperature and humidity, and creep can be accelerated by both mechanical stress and temperature. Furthermore, a single accelerated load can induce failure by several mechanisms. For example, temperature can accelerate wearout damage accumulation of many failure mechanisms, such as corosion, electrochemical migration, and creep. Failure mechanisms that dominate under usual operating conditions may lose their dominance as the load is elevated. For example, high-power electronics can generate temperatures that drive off moisture. Conversely, failure mechanisms that are dormant under normal use conditions may contribute to device failure under accelerated conditions. Thus, accelerated tests require careful planning if they are to accelerate the actual * The acceleration factor is defined as the ratio of the life of the product under normal use conditions to that under an accelerated condition.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
212 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
usage environments and operating conditions without introducing extraneous failure mechanisms or nonrepresentative physical or material behavior. Once the failure mechanisms are identified, it is necessary to select the appropriate acceleration load; to determine the test procedures and the load levels; to determine the test method, such as constant load acceleration or step-load acceleration; to perform the tests; and to interpret the test data, which includes extrapolating the accelerated test results to normal operating conditions. The test results provide failure information to assess the product reliability, to improve the product design, and to plan warranties and support.
9.9
MANUFACTURE AND ASSEMBLY
Manufacturing and assembly processes significantly impact quality and reliability. Manufacture and improper assembly can introduce defects, flaws, and residual stresses that act as potential failure sites or stress enhancers (or raisers) later in the life of the product. The effect of manufacturing variability on time to failure is depicted in Figure 9.3. A shift in the mean or increase in the standard deviation of key parameters during manufacturing can result in early failure due to a decrease in the strength of the product. Generally, qualification procedures are required to ensure that the normal product is reliable. In some cases, lot-to-lot screening is required to ensure that the variability of assembly and manufacturing-related parameters are within specified tolerances. Here, screening ensures the quality of the product by precipitating latent defects before they reach the final customer. 9.9.1 Manufacturability The design team must understand material limits and manufacturing process capabilities to construct products that promote producibility and reduce the occurrence of defects. The team must have clear definitions of the threshold for acceptable quality and of what constitutes nonconformance. Products with quality nonconformance should not be accepted. Upper destruct limit Upper operating margin Upper design margin Upper specification limit Lower specification limit Lower design margin Lower operating margin Lower destruct limit
Figure 9.3 Influence of quality on failure probability.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
DESIGN FOR RELIABILITY
213
A defect is any outcome of a process that impairs or has the potential to impair the performance of the product at any time. A defect may arise during a single process or may be the result of a sequence of processes. The yield of a process is the fraction of products that are acceptable for use in a subsequent process sequence or product life cycle. The cumulative yield of the process is approximately determined by multiplying the individual yields of each of the individual process steps. The source of defects is not always apparent because defects resulting from a process can go undetected until the product reaches some downstream point in the process sequence. It is often possible to simplify processes to reduce the probability of workmanship defects. As processes become more sophisticated, however, process monitoring and control are necessary to ensure a defect-free product. The bounds that specify whether the process is within tolerance limits, often referred to as the process window, are defined in terms of the independent variables to be controlled within the process and the effects of the process on the product. The goal is to understand the effect of each process variable on each product parameter to formulate control limits for the process—that is, the condition in which the defect rate begins to have a potential for causing failure. In defining the process window, the upper and lower limits of each process variable beyond which defects can be produced must be determined. Manufacturing processes must be contained in the process window by defect testing, analysis of the causes of defects, and elimination of defects by process control, such as by closed-loop corrective action systems. The establishment of an effective feedback path to report process-related defect data is critical. Once this is accomplished and the process window is determined, the process window itself becomes a feedback system for the process operator. Several process parameters may interact to produce a different defect than would have resulted from an individual parameter acting independently. This complex case may require that the interaction of various process parameters be evaluated by a design of experiments. In some cases, a defect cannot be detected until late in the process sequence. Thus, a defect can cause rejection, rework, or failure of the product after considerable value has been added to it. This cost can reduce return on investment by adding to hidden factory costs. All critical processes require special attention for defect elimination by process control. 9.9.2 Process Verification Testing Process verification testing is often called screening. Screening involves 100% auditing of all manufactured products to detect or precipitate defects. The aim of this step is to preempt potential quality problems before they reach the field. Thus, screening can aid in reducing warranty returns and increases customer goodwill. In principle, screening should not be required if parts (materials) are selected properly and if processes are well controlled. Some products exhibit a multimodal probability density function for failures, with peaks during the early period of their service life due to the use of faulty materials, poorly controlled manufacturing and assembly technologies, or mishandling. This type of early-life failure is often called infant mortality. Properly applied screening
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
214 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
techniques can successfully detect or precipitate these failures, eliminating or reducing their occurrence in field use. Screening should only be considered for use during the early stages of production, if at all, and only when products are expected to exhibit infant mortality field failures. Screening will be ineffective and costly if there is only one main peak in the failure probability density function. Further, failures arising due to unanticipated events such as lightning or earthquakes may be impossible to screen cost effectively. Because screening is conducted on a 100% basis, it is important to develop screens that do not harm good products. The best screens, therefore, are nondestructive evaluation techniques, such as microscopic visual exams, x-rays, acoustic scans, nuclear magnetic resonance, electronic paramagnetic resonance, and so on. Stress screening involves the application of loads, possibly above the rated operational limits. If stress screens are unavoidable, overstress tests are preferred over accelerated wearout tests because the latter are more likely to consume some useful life of good products. If damage to good products is unavoidable during stress screening, then quantitative estimates of the screening damage, based on failure mechanism models, must be developed to allow the design team to account for this loss of usable life. The appropriate stress levels for screening must be tailored to the specific product. As in qualification testing, quantitative models of failure mechanisms can aid in determining screen parameters. A stress screen need not necessarily simulate the field environment or even utilize the same failure mechanism as the one likely to be triggered by this defect in field conditions. Instead, a screen should exploit the most convenient and effective failure mechanism to simulate the defects that can show up in the field as infant mortality. This requires an awareness of the possible defects that may occur in the product and familiarity with the associated failure mechanisms. Any commitment to stress screening must include the necessary funding and staff to determine the root cause and appropriate corrective actions for all failed units. The type of stress screening chosen should be derived from the design, manufacturing, and quality teams. Although a stress screen may be necessary during the early stages of production, stress screening carries substantial penalties in capital, operating expense, and cycle time, and its benefits diminish as a product approaches maturity. If many products fail in a properly designed screen test, the design is probably faulty or a revision of the manufacturing process may be required. If the number of failures in a screen is small, the processes are likely to be within tolerances and the observed faults may be beyond the resources of the design and production process.
9.10
CLOSED-LOOP, ROOT-CAUSE MONITORING
Product reliability needs to be ensured using a closed-loop process that provides feedback to design and manufacturing in each stage of the product life cycle. Data obtained from manufacturing, assembly, storage, shipping, periodic maintenance, use, and health monitoring methods can be used to aid future design plans and tests, and perform timely maintenance for sustaining the product and preventing
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
DESIGN FOR RELIABILITY
Design product
Design Qualification
215
Establish manufacturing processes
Manufacture
Product qualification
Use and maintenance
Quality assurance tests and screens
Analysis of failures
Figure 9.4 Reliability management using a closed-loop process.
catastrophic failures. Figure 9.4 depicts the closed-loop process for managing the reliability of a product over the complete life cycle. The objective of closed-loop monitoring is to analyze all failures throughout the product life cycle to identify the root cause of failure. The root cause is the most basic casual factor or factors that, if corrected or removed, will prevent recurrence of the situation. The purpose of determining the root cause is to fix the problem at its most basic source so that it does not occur again, even in other products, as opposed to merely fixing a failure symptom. Correctly identified root cause analysis during design, manufacturing, and use, followed by appropriate corrective actions, results in fewer field returns, major cost savings, and customer goodwill. The lessons learned from each failure analysis need to be documented and appropriate actions need to be taken to update the design, manufacturing process, and maintenance actions. After products are developed, resources must be applied for supply chain management, obsolescence assessment, manufacturing and assembly feedback, manufacturer warranties management, and field failure and root-cause analysis. The risks associated with the product fall into two categories: r managed risks: risks that the product development team chooses to manage proactively by creating a management plan and performing a prescribed monitoring regime of the field performance, manufacturer, and manufacturability; and r unmanaged risks: risks that the product development team chooses not to manage proactively.
If risk management is considered necessary, a plan should be prepared. The plan should contain details about how the product is monitored (data collection) and how the results of the monitoring feed back into various product development processes. The feasibility, effort, and cost involved in management processes must be considered.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
216 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
9.11
SUMMARY
The development of a reliable product is not a matter of chance; rather, it is a rational consequence of conscious, systematic, and rigorous efforts conducted throughout the entire life cycle of the product. Meeting the targeted product reliability can only be assured through robust product designs, capable processes that are known to be within tolerances, and qualified parts (materials) from vendors whose processes are also capable and within tolerances. Quantitative understanding and modeling of all relevant failure mechanisms can guide design, manufacturing, and the planning of test specifications. When utilized early in the concept stage of a product’s development, reliability analysis serves as an aid to determine feasibility and risk. In the design stage of product development, reliability analysis involves the selection of parts (materials), design trade-offs, design tolerances, manufacturing processes and tolerances, assembly techniques, shipping and handling methods, and maintenance and maintainability guidelines. Engineering concepts such as strength, fatigue, fracture, creep, tolerances, corrosion, and aging play a role in these design analyses. The use of physicsof-failure concepts coupled with mechanistic and probabilistic techniques is used to assess the potential problems and trade-offs and to take corrective actions. REFERENCES IEEE Reliability Society. 1998. IEEE Std. 1332-1998. IEEE standard reliability program for the development and production of electronic systems and equipment. Mishra, S., and M. Pecht. 2002. In-situ sensors for product reliability monitoring. Proceedings of SPIE 4755:10–19. Mishra, S., M. Pecht, T. Smith, I. McNee, and R. Harris. 2002. Remaining life prediction of electronic products using life consumption monitoring approach. Proceedings of the European Microelectronics Packaging and Interconnection Symposium, Cracow, Poland, 136–142, 16–18 June 2002. Pecht, M., M. Dube, M. Natishan, and I. Knowles. 2001. An evaluation of built-in test. IEEE Transactions on Aerospace and Electronic Systems 37 (1): 266–272. Ramakrishnan, A., and M. Pecht. 2003. A life consumption monitoring methodology for electronic systems. IEEE Transactions on Components and Packaging Technologies 26 (3): 625–634. Vichare, N., P. Rodgers, V. Eveloy, and M. Pecht. 2004. In situ temperature measurement of a notebook computer—A case study in health and usage monitoring of electronics. IEEE Transactions on Device and Materials Reliability 4 (4): 658–663, 2004.
HOMEWORK PROBLEMS Problem 9.1 Production lots and vendor sources for parts that comprise the design are subject to change, and variability in parts characteristics is likely to occur during the fielded life of a product. How does this impact design decisions that impact reliability?
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
DESIGN FOR RELIABILITY
217
Problem 9.2 Discuss the relationship between manufacturing process control and stress margins. How does this affect qualification? What are the implications for product reliability? Problem 9.3 List five characteristic life-cycle loads for a computer keyboard. Describe how the product design could address these in order to ensure reliability. Problem 9.4 Explain how the globalization of the supply chain could affect the parts selection and management process for a product used for critical military applications. Problem 9.5 Explain the distinction between FMEA and FMMEA and how this is significant for design for reliability. For example, how would an FMMEA affect product qualification testing? Problem 9.6 Explain how the intended application for a product would affect the decision on whether to incorporate redundancy into its design. Include in your answer a discussion of the relevant constraints related to product definition. Problem 9.7 Discuss the concept of design for manufacturability and how it can lead to improvement of product reliability. Provide a specific example. Problem 9.8 What are the advantages and disadvantages of virtual qualification as compared to accelerated testing? How can these be combined in a qualification program to reduce the overall product design cycle time?
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
CHAPTER 10
System Reliability Modeling Michael Pecht
CONTENTS 10.1 10.2 10.3 10.4
Introduction .................................................................................................. 219 Reliability Block Diagram............................................................................ 220 Series System................................................................................................ 220 Products with Redundancy ........................................................................... 222 10.4.1 Active Redundancy........................................................................... 223 10.4.2 Standby Systems ............................................................................... 225 10.4.3 (k, n) Systems.................................................................................... 226 10.4.4 Limits of Redundancy ...................................................................... 227 10.4.5 Complex Systems.............................................................................. 228 10.4.5.1 Complete Enumeration Method......................................... 228 10.4.5.2 Conditional Probability Method ........................................ 230 10.4.5.3 Cut Set Methodology ......................................................... 231 10.5 Fault-Tree Analysis ....................................................................................... 232 10.6 Steps of Fault-Tree Analysis ......................................................................... 233 References.............................................................................................................. 236 Homework Problems.............................................................................................. 236
10.1
INTRODUCTION
This chapter describes how to model product reliability based on the parts and the subsystem that comprise the product. Reliability block diagrams are described as a means to represent the logical system architecture and develop system reliability models. Fault trees are then presented.
219 © 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
220 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
10.2 RELIABILITY BLOCK DIAGRAM A system reliability block diagram presents a logical relationship of the system components. Series systems are described in Section 10.3 and redundant systems (including standby systems, k-out-of-n systems, and complex systems) are described in Section 10.4. All of these system configurations are analyzed using principles of probability theory. 10.3 SERIES SYSTEM In a series system, all subsystems must operate if the system is to function. This implies that the failure of any subsystem will cause the system to fail. In terms of reliability, the system will be reliable if all the units are reliable. The units need not be physically connected in series for the system to be called a series system. The reliability block diagram of a series system is represented in Figure 10.1. The reliability of each block is represented by Ri(t) and the times to failure are represented by TTF(i). The system reliability can be derived from the basic principles of probability theory. The reliability and times of failure for the system are given in Equation 10.1. The reliability of the system reduces with increase in number of components in series (see Figure 10.2). n
Rs (t ) R1 (t ) R2 (t )..... Rn (t )
R (t)
(10.1)
i
i 1
System TTF: Min(TTF(i)), i 1, 2,…N Assume that the time-to-failure distribution for each component of a system is exponential and each has a constant failure rate, hi. The component reliability is Ri (t ) e
Li t
(10.2)
Then, the system reliability is given by n
Rs (t )
n
R (t) e
Li t
i
i 1
e
¤n ³
¥ £ Li ´ t ¥¦ i 1 ´µ
(10.3)
i 1
TTF(1)
TTF(2)
TTF(3)
TTF(N)
1
2
3
N
R1(t)
R2(t)
R3(t)
RN (t)
Figure 10.1 Series system representation.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
SYSTEM RELIABILITY MODELING
221
100 90
N=1
System Reliability (%)
80 70
N = 10
60 50 40 N = 50 30 N = 100 20 10 0 96
97
98 Part Reliability (%)
99
100
Figure 10.2 Effect of part reliability and numbers of parts in system reliability in series configuration.
The constant system failure rate is n
Ls
£L
(10.4)
i
i 1
and the system mean time between failures is MTBF
1 Ls
1
£
n i 1
(10.5)
Li
The system hazard rate is constant if all the components of the system are in series and have constant hazard rates. The assumptions of a constant hazard rate and a series system make the mathematics simple, but this is often not the case. EXAMPLE 10.1 An electronic system consists of two parts that operate in series. Assuming that failures are governed by a constant failure rate hi for the ith part, determine (1) the system failure rate, (2) the system reliability for a 1000-hour mission, and (3) the system mean time to failure. The failure rates of the parts are given by: h1 6.5 failures/106 hours h2 26.0 failures/106 hours
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
222 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Solution: For a constant failure rate, the reliability Ri for the ith part has the form: t
Ri e
¯ Li (t ) dt
e Lit
0
The reliability RS of the series system is N
Rs e
£ Li (t ) dt
iii 1
e Ls t
N
Ls
£L
i
i 1
for a series system with parts assumed to have a constant failure rate. Substituting the given values:
Ls 32.5/10 6 hours The reliability for a 1000-hour mission is thus: Rs (1000) e (32.5 x10
6 ) x1000
0.968
The mean time to failure (MTTF) for the system is: c
c
MTTF Rss (t )dt e Lsst dt 1/Lss
¯ 0
¯ 0
30, 770 hours
10.4
PRODUCTS WITH REDUNDANCY
Redundancy exists when one or more of the parts of a system can fail and the system can still function with the parts that remain operational. Two common types of redundancy are active and standby. In active redundancy, all the parts are energized operational during the operation of a system. In active redundancy, the parts will consume life at the same rate as the individual components. In standby redundancy, some parts are not contributing to the operations of the system and are switched on only when there are failures in the active parts. In standby redundancy, the parts in standby ideally should last longer than the parts in active redundancies. There are three conceptual types of standby redundancy: cold, warm, and hot. In cold standby, the secondary part is shut down until needed. This lowers the number of hours that the part is active and consuming useful life; however, the transient stresses on the part during switching may be high. This transient stress can cause faster consumption
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
SYSTEM RELIABILITY MODELING
223
of life during switching. In warm standby, the secondary part is usually active, but is idling or unloaded. In hot standby, the secondary part forms an active parallel system. The life of the hot standby part is consumed at the same rate as that for active parts. In a hot standby system, the parts still need to be switched on to action when one of the parts in the main system fails; this is the difference in hot standby with active redundancy. 10.4.1 Active Redundancy An active redundant system is a standard “parallel” system, which is a system that fails only when all components have failed. Sometimes, the parallel system is called a 1-out-of-n or (1, n) system, which implies that only one out of n subsystems is necessary to operate for the system to be operational. The reliability block diagram of a parallel system is given in Figure 10.3. The units need not be physically connected in parallel for the system to be called a parallel system. The system will fail if all of the subsystems or all of the components fail by the time, t, or the system will survive the operating time, t, if at least one of the units survives up to time t. Then, the system reliability can be expressed as Rs (t ) 1 Qs (t )
(10.6)
where Qs(t) is the probability of system failure, or n
Qs (t ) [1 R2 (t )] [1 R2 (t )] [1 Rn (t )]
[1 R (t)] i
(10.7)
i 1
The system reliability for a mission time, t, is n
RS (t ) 1
[1 R (t)] i
i 1
System TTF: Max (TTF(i)), i 1, 2,…N
R1(t) R2(t)
Rn(t)
Figure 10.3 Active redundant system.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(10.8)
224 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
System Reliability (%)
100 N=5
90
N =2 80 70 N=1 60 50 40 50
55
60
65
70 75 80 Unit Reliability (%)
85
90
95
100
Figure 10.4 Effect of part reliability and number of parts on system reliability in an active redundant system.
In general, each unit can have a different failure distribution. The system hazard rate is given by
L S (t )
fS (t )
(10.9)
RS (t )
where f S (t) is the system time-to-failure probability density function (pdf). The mean life, m, of the system is determined by c
m
¯
c
RS (t )
0
¤ ¥1 0 ¦
n
³
¯ (1 R (t))´µ dt i
(10.10)
i 1
For example, if the system consists of two units (n 2) with exponential failure distribution and the constant failure rates h1 and h2, then the system mean life is given by Equation 10.11. Note that the system mean life is not equal to the reciprocal of the sum of the component constant failure rates and the hazard rate is not constant over time, although the individual unit failure rates are constant. m
1 1 1
L1 L2 L1 L2
(10.11)
EXAMPLE 10.2 Consider an electronics system consisting of two parts with constant failure rates of h1 6.5 failures/10 6 hours and h2 26.0 failures/10 6 hours. Assume that failures are governed by a constant failure rate hi for the hith part. Determine (1) the system reliability for a 1000-hour mission, (2) the
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
SYSTEM RELIABILITY MODELING
225
system MTTF, (3) the failure probability density function, and (4) the system failure rate. Solution: For a constant failure rate, the reliability Ri of the ith part has the form: t
Ri e
¯
Li ( t ) dt 0
e Lit
For a parallel system: 2
Rp 1
(1 e
Li ( t )
) e Lit e L2t e ( L1 L2 )t
i 1
The failure probability density function is f p (t )
d [ Rp (t )] dt
L1e L1t L2 e L2t (L1 L2 )e ( L1 L2 )t
The system hazard rate for the parallel system is given by
L p (t )
f p (t ) Rp (t )
Substituting numbers in the equation for constant failure rate, we get: RP (1000) 0.99352 0.97434 0.96802 0.99983 The mean time to failure for the parallel system is c
MTTFP
1
¯ R (t)dt L P
0
1
1 1
161, 540 hours L2 (L1 L2 )
The failure probability function for the parallel system is fP (t )
d [ RP (t )] dt
6.5 x 10 6 26.0 r 10 6 e 26.0 x10 6 t 31.5 r 10 6 e 31.5 x10 6 t The system failure rate for the parallel system can be obtained by substituting the results in equation stated before.
10.4.2 Standby Systems A standby system consists of an active unit or subsystem and one or more inactive units, which become active in the event of a failure of the functioning unit. The failures of active units are signaled by a sensing subsystem, and the standby unit is
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
226 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Standby unit 2
Sensing element SE
1 Active unit Figure 10.5
Switching subsystem
Standby system.
brought to action by a switching subsystem. The simplest standby configuration is the two-unit system shown in Figure 10.5. In the general case, there will be N number of units with (N – 1) of them in standby. When the active and the standby units have equal constant failure rates, h, and the switching and sensing units are perfect, hsw 0, the reliability function for such a system is given by R(t ) e Lt (1 L t )
(10.12)
10.4.3 (k, n) Systems A system consisting of n components is called a k-out-of-n or (k, n) system when the system operates only if at least k components are in the operating state. The reliability block diagram for the k-out-of-n system is the same as that for the parallel system, but at least k items need to be operating for the system to be functional (Figure 10.6). The reliability function for the system is mathematically complex to compute in a closed form when the components have different failure distributions. Assuming that all the components have the same failure distribution, Q(t), the system reliability can be determined using the binomial distribution; that is, n
RS (t )
£ [1 Q(t)] [Q(t)]
n 1
n i
i k
(k, n) R1(t) R2(t)
Rn(t)
Figure 10.6
k-out-of-n system.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(10.13)
SYSTEM RELIABILITY MODELING
227
The probability of system failure is then k i
n
Q(t ) 1 R(t ) 1
£ [1 Q(t)] [Q(t)] n i
n i
i k
£ [1 Q(t)] [Q(t)] n i
n i
(10.14)
i0
The probability density function can be determined from
Q(t )
dQs (t ) dt
n! [1 Q(t )]k 1 [Q(t )]n k Q(t ) (n k )! ( k 1)!
(10.15)
and the hazard rate is given by
L S (t )
fS (t ) RS (t )
(10.16)
EXAMPLE 10.3 Compute the reliability of an active redundant configuration system with two out of three units (all with identical reliability R) required for success. Solution: In this case, n 3 and k 2. The reliability for a k-out-of-n active redundancy reliability is obtained from Equation 10.13:
R2 out of 3
3! 3! R 2Q1 R3Q 0 (1!)(2!) (0!)(3!)
R2 out of 3 3 R 2 (1 R) R3 Probability that two units will succeed and one will fail
Probability that all of three units will succeed
10.4.4 Limits of Redundancy It is often difficult to realize the benefits of redundancy due to additional operational and design issues. Three such issues are common mode failures, load sharing, and switching and standby failures. Common mode failures are caused by phenomena that create dependencies between two or more redundant parts, which cause them to fail simultaneously. Common mode failures have many causes (e.g., common electric connections, shared
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
228 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
A
C
D
B
E
Figure 10.7 A complex system.
environmental stresses, and common maintenance problems. In system reliability analysis, common mode failures have the same effect as putting in an additional part in series with the parallel redundant configuration. Load sharing failures occur when the failure of one part increases the stress level of the other part. This increased stress level can affect the life of the active part. For redundant engines, motors, pumps, structures, and many other systems and devices in active parallel setup, the failure of one part may increase the load on the other parts and decrease their times to failure (or increase their hazard rates). Several common assumptions are generally made regarding the switching and sensing of a standby system. Regarding switching, we assume that switching is in one direction only and that switching devices respond only when directed to switch by the monitor and do not fail if not energized. Regarding standby, the general assumption is that standby nonoperating units cannot fail if not energized. When any of these idealizations are not met, switching and standby failures occur. Monitor or sensing failure includes both dynamic (failure to switch when active path fails) and static (switching when not required) failures. 10.4.5 Complex Systems If the system architecture cannot be decomposed into some combination of seriesparallel structures, it is deemed a complex system. There are three methods for reliability analysis of a complex system. Using Figure 10.7 as an example, those three methods are illustrated in the following subsections. 10.4.5.1 Complete Enumeration Method The complete enumeration method is based on the list of all possible combinations of the unit failures. Table 10.1 contains all possible states of the system given in Figure 10.7. The symbol O stands for “system in operating state” and F stands for “system in failed state.” Letters in uppercase denote a unit in an operating state and the lowercase letters denote a unit in a failed state. Each combination representing system status can be written as a product of the probabilities of units being in a given state. For example, combination 2 can be
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
SYSTEM RELIABILITY MODELING
229
Table 10.1 Complete Enumeration Example System Description
System Condition
All components operable
ABCDE
O
One unit in failed state
aBCDE
O
Two units in failed state
System Status
System Description Three units in failed state
System Condition
System Status
ABcde
F
AbCde
O
AbCDE
O
AbcDe
F
AbcDE
O
AbcdE
F
ABCdE
O
aBCde
O
ABCDe
O
aBcDe
O
abCDE
F
aBcdE
O
aBcDE
O
abCDe
F
aBCdE
O
abCdE
F
aBCDe
O
abcDE
F
AbcDE
F
Abcde
F
AbCdE
O
aBcde
F
AbCDe
O
abCde
F
ABcdE
O
abcDe
F
ABcDe
O
ABCde
O
Four units in failed state
All five units in failed state
abcdE
F
abcde
F
written as (1 – R A) R BRCR DRE , where (1 – R A) denotes probability of failure of unit A by time t. The system reliability can be written as the sum of all the combinations for which the system is in operating state, O; that is, Rs RA RB RC RD RE (1 RA ) RB RC RD RE RA (1 RB ) RC RD RE RA RB (1 RC ) RD RE RA RB RC (1 RD ) RE RA RB RC RD (1 RE ) (1 RA ) RB (1 RC ) RD RE (1 RA ) RB RC (1 RD ) RE (1 RA ) RB RC RD (1 RE ) ... ... (1 RA ) RB (1 RC )(1 RD ) RE
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(10.17)
230 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
After simplification, the system reliability is given by Rs RB RC RD RE RA RB RC RB RC RD RB RC RE RB RD RE RA RC RB RC
(10.18)
RB RD RB RE 10.4.5.2 Conditional Probability Method The conditional probability method is based on the law of total probability, which allows system decomposition by a selected unit and its state at time t. For example, system reliability is equal to the reliability of the system, given that unit A is in operating state at time t (denoted by RS | AG), times the reliability of unit A, plus the reliability of the system, given that unit A is in a failed state at time t, RS | AB, times the unreliability of unit A, or Rs ( RS | AG ) RA ( RS | AB ) QA
(10.19)
This decomposition process continues until each term is written in terms of the reliability and unreliability of all the units. As an example of the application of this methodology, consider the system given in Figure 10.8 and decompose the system, using unit C. Then the system reliability can be written Rs ( RS | CG ) RC ( RS | C B ) QC
(10.20)
If the unit C is in operating state at time t, the system reduces to the configuration shown in Figure 10.9. Therefore, the system reliability, given unit C is in operating state at time t, is equal to the series-parallel combination as shown earlier, or RS | CG [1 (1 RA ) (1 RB )]
A D B
E
A B
Figure 10.8
System reduction when unit C is operating.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(10.21)
SYSTEM RELIABILITY MODELING
231
A D B
E
D B E
Figure 10.9 System reduction when unit C fails.
If unit C is in a failed state at time t, the system reduces to the configuration as follows. Then the system reliability, given that unit C is in a failed state, is given by Rs | C B RB [1 (1 RD ) (1 RE )]
(10.22)
The system reliability is obtained by substituting Equation 10.21 and Equation 10.22 into Equation 10.19: RS ( RS /RG ) RC ( RS /RB ) QC [1 (1 RA ) (1 RB )] RC
(10.23)
RB [1 (1 RD ) (1 RE )] (1 RC ) The system reliability is expressed in terms of the reliabilities of its components. Simplification of Equation 10.23 gives the same expression as Equation 10.18. The component (i.e., for each block) reliabilities can be obtained using methodologies presented in the preceding sections. 10.4.5.3 Cut Set Methodology A cut set is a set of components with the property that failure of all the components causes a system to fail. A minimal cut set is a set containing minimum number of components that causes a system to fail. If a single unit is removed (i.e., considered not failed) from the minimal cut set, the system will not fail. This implies that all the units from a minimal cut set must fail in order for the system to fail. The procedure for system reliability calculation using minimal cut sets is as follows: r r r r
Identify the minimal cut sets for a given system. Model the components in each cut set as a parallel configuration. Join all the minimal cut sets in series configuration. Model and solve for system reliability as a series combination of cut sets with the parallel combination of components in each cut set.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
232 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
C A
B D
B
C E
Figure 10.10 System block diagram in terms of minimal cut sets.
For the preceding example, the following cut sets can be identified: C1 {A, B} C2 {B, C}
(10.24)
C3 {C , D, E} Following the steps described, the system block diagram in terms of minimal cut set is as given in Figure 10.10. Using the methodologies for the series and parallel systems, the system reliability is given by Rs [1 ( RA ) (1 RB )] [1 (1 RB ) (1 RC )]
(10.25)
[1 (1 RC ) (1 RD )(1 RE )] Upon simplification, we get the following, which is the same as that given by Equation 10.18: RS RA RC RB RC RB RD RB RE RA RB RC RB RD RE
(10.26)
RB RC RD RB RC RE RB RC RD RE
10.5 FAULT-TREE ANALYSIS Fault-tree analysis (FTA) is a deductive methodology to determine the potential causes of failures and to estimate the failure probabilities. Fault-tree analysis addresses system design aspects and potential failures, tracks down system failures deductively, describes system functions and behaviors graphically, focuses on one error at a time, and provides qualitative and quantitative reliability analyses. The purpose of a fault tree is to show the sets of events—particularly the primary failures—that will cause the top event in a system. Failures can be classified in several ways (hardware faults or human error; hardware faults: early, random, or aging; primary, secondary or command fault; passive
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
SYSTEM RELIABILITY MODELING
233
or active). In this chapter, we do not distinguish among the different classifications (Lewis 1996). EXAMPLE 10.4 The following is a reliability block diagram for blackout. Blackout happens if both the off-site power and the emergency power fail. The emergency power fails if either the voltage monitor or the diesel generator fails. (The voltage monitor signals the diesel generator to start when the off-site voltage falls below a threshold level.)
Emergency Power System
Fault tree for blackout: Blackout
AND
Off-site power loss
Emergency system failure
OR
Voltage monitor failure
10.6
Diesel generator failure
STEPS OF FAULT-TREE ANALYSIS
There are three phases in fault-tree analysis. The first step is to develop a logic block diagram or a fault tree using elements of the fault tree. This phase requires complete system definition and understanding of its operation. Every possible cause and effect of each failure condition should be investigated and related to
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
234 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Rectangle
Fault event
AND
Output true only if all inputs true
Circle
Independent primary fault
OR Gate
Output true if any of the inputs are true
Diamond
Undeveloped fault event
INHIBIT Gate
Output exists if condition occurs
Trianglein/-out
Transfer events– jump to another part of tree
House
Normally occurring basic event
A X
A
A
Figure 10.11 Fault-tree symbols—events and gates.
the top event. The second step is to apply Boolean algebra to the logic diagram and develop algebraic relationships between events. If possible, the expressions should be simplified using Boolean algebra. The third step is to apply probabilistic methods to determine the probabilities of each intermediate event and the top event. The probability of occurrence of each event has to be known; that is, the reliability of each component or subsystem for every possible failure mode has to be considered. The graphical symbols used to construct the fault tree fall into two categories: gate symbols and event symbols. The basic gate symbols are AND, OR, k-out-of-n voting gate, priority AND, exclusive OR, and inhibit gate. The basic event symbols are basic event, undeveloped event, conditional event, trigger event, resultant event, and transfer-in and transfer-out events (Lewis 1996; Rao 1992; Kececioglu 1991). Quantitative evaluation of the fault tree includes calculation of the probability of the occurrence of the top event. This is based on the Boolean expressions for the interaction of the tree events. Figure 10.11 shows the commonly used symbols in creating a fault tree. For the quantitative analysis, the basic Boolean relations are shown in Table 10.2.
Table 10.2 Boolean Algebra of AND and OR Relations A
B
A AND B
A AND B
A OR B
A NOR B
0
0
0
1
0
1
1
0
0
1
1
0
0
1
0
1
1
0
1
1
1
0
1
0
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
SYSTEM RELIABILITY MODELING
235
EXAMPLE 10.5 Analyze the following fault tree: T
E2
A
E3
B
C
C
E4
A
Top-down evaluation: 1. 2. 3. 4.
T E1
E2 E1 A E3; E2 C E 4 E3 B C; E 4 A
B T (A E3)
(C E 4) [A (B C)]
[C (A
B)]
Bottom-up evaluation: 1. 2. 3. 4. 5.
E3 B C; E 4 A
B E1 A E3; E2 C E4 E1 A (B C) E2 C (A
B) T E1
E2 [A (B C)]
[C (A
B)]
Either evaluation direction can be used for fault-tree analysis. Associative law: A (B C) (A B) C Commutative law: (A B) C C (A B) Thus, T [C (A B)]
[C (A
B)] Distributive law: T C [(A B)
(A
B)] A
BB
A Associative law: T C [(A B)
B
A] Absorption law: (A B)
B B T C (B
A)
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
B
236 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Hence, the tree can be reduced to show that T occurs only when C or both A and B occur. T
A
C
B
B
A
REFERENCES Kececioglu, B. D. 1991. Reliability engineering handbook, vols. 1–2. Englewood Cliffs, NJ: Prentice Hall. Lewis, E. E. 1996. Introduction to reliability engineering. New York: John Wiley & Sons. Rao, S. S. 1992. Reliability-based design, 505–543. New York: McGraw–Hill.
HOMEWORK PROBLEMS Problem 10.1 Review the following two reliability block diagrams. Both use the same components the same number of times. Assume reliability of each block to be RX, where X is the name of the block. (a) Is there a difference in reliability between the two configurations when the failures are all independent of each other? Explain. You do not need to perform complete algebraic derivation. b) Which configuration is more susceptible to common mode failure and why? Assume that each component (A, B, and C) fails primarily by different mechanisms and those mechanisms are affected by different loads.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
A
B
C
A
B
C
A
B
C
A
B
C
SYSTEM RELIABILITY MODELING
237
Problem 10.2 The reliability block diagram shown is a complex system that cannot be decomposed into a “series-parallel” configuration. We want to determine the reliability equation for the system using the conditional probability method. We have decided to use the component B for the decomposition. Draw the two reliability block diagrams that result from “B operating” and “B failed” conditions.
A
B
C
D
E
F
Problem 10.3 Consider the system shown in the block diagram and derive an equation for reliability of the system. RX denotes the reliability of each component in the system, where X is the name of the component. For stage 3 (four C components in parallel), it is a two out of four systems—that is, two components need to operate for the system to operate.
C B
C
A C B C
Problem 10.4 Derive the reliability equation (manually) of the following system. Note that the system is complex.
A
C E
B
D
Find the following for this complex system: (a) system reliability at 100 hours (b) system reliability at 0 hours (c) failure rate at 1000 hours (d) time when wearout region begins (use the graph)
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
238 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
How long does it take for 75% of the systems to fail? Node A
Failure Distribution Weibull three parameter
Parameter (in Hour or Equivalent) ^ 3; d 1,000; c 100
B
Exponential
Mean time between failures 1,000
C
Lognormal
Mean 6; standard deviation 0.5
D
Weibull three parameter
^ 0.7; d 150; c 100
E
Normal
Mean 250; standard deviation 15
What happens to the results if you switch the properties of the blocks C and D? Problem 10.5 For a top-level event T, the following minimum cut sets were identified: ABC, BDC, AE, ADF, and BEF. Draw a fault tree for the top event of these minimum cut sets.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
CHAPTER 11
Reliability Analysis of Redundant and Fault-Tolerant Products Joanne Bechta Dugan
CONTENTS Terminology...........................................................................................................240 Notation.................................................................................................................. 241 11.1 Static Redundancy—Combinatorial Modeling ........................................... 241 11.1.1 Simple Redundancy .......................................................................... 241 11.1.1.1 Series Connections............................................................. 243 11.1.1.2 Parallel Connections .......................................................... 243 11.1.1.3 Series-Parallel Connections ............................................... 243 11.1.1.4 Non-Series-Parallel Connections.......................................244 11.1.2 Masking Redundancy .......................................................................246 11.1.2.1 Triple Modular Redundancy .............................................. 247 11.1.2.2 N-Modular Redundancy ....................................................248 11.1.3 Fault Trees ........................................................................................ 250 11.1.3.1 Cut Set Generation............................................................. 250 11.1.3.2 Inclusion/Exclusion Method .............................................. 252 11.2 Time Dependence........................................................................................ 253 11.2.1 Mean Time to Failure ....................................................................... 254 11.2.2 Hazard Rate ...................................................................................... 255 11.3 Dynamic Redundancy—Markov Modeling................................................ 256 11.3.1 Standby Sparing................................................................................ 257 11.3.2 TMR/Simplex Product......................................................................260 11.3.3 Repairable Products.......................................................................... 262 11.3.3.1 Independent Repair............................................................ 263 11.3.3.2 Dependent Repair ..............................................................264 11.4 Dependent Failures......................................................................................266 11.4.1 Common-Mode Failures...................................................................266 239 © 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
240 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
11.4.2 Dependent Failure Rate .................................................................... 268 11.4.3 Multimode Failures........................................................................... 268 11.5 Coverage Modeling for Fault-Tolerant Computer Products ........................ 271 11.5.1 Terminology...................................................................................... 272 11.5.2 The Impact of Imperfect Coverage................................................... 272 11.5.3 Some Coverage Models .................................................................... 273 11.5.3.1 General Structure of a Coverage Model ............................ 273 11.5.4 Near-Coincident Faults ..................................................................... 278 11.5.5 Including the Coverage Model in the Product Model....................... 281 11.6 Bounded Approximate Models ...................................................................284 11.6.1 Truncated Exhaustive State Enumeration......................................... 287 11.6.2 Truncated Sum of Disjoint Products................................................. 289 11.6.3 Truncating a Markov Chain.............................................................. 291 11.7 Advanced Topics.......................................................................................... 292 11.7.1 Combining Performance with Reliability......................................... 292 11.7.2 Phased Applications.......................................................................... 293 11.7.3 Advanced Fault-Tree Modeling ........................................................ 296 11.8 Summary ..................................................................................................... 297 References.............................................................................................................. 297
A fault-tolerant product is designed to continue operating correctly, despite the failure of some constituent components. Fault tolerance is generally achieved through the use of redundancy—that is, by providing alternative means for accomplishing a given task. This chapter presents methods for evaluating the reliability of several types of fault-tolerant products. Simple redundant products, in which the components are duplicated and have a fixed probability of failure, are discussed first. These are then generalized to include more complex redundancy techniques, such as standby sparing, majority voting, and hybrid redundancy, as well as failure probabilities that depend on time. More complex situations, such as dependent and multimode failures, are addressed. Coverage modeling, which is critically important when considering embedded fault-tolerant computers, is then considered. Special techniques for analyzing large models, as well as several advanced topics, are also introduced.
TERMINOLOGY Component: the basic product unit under consideration. A component is usually the smallest single element that is replaceable or repairable in the field. It is either operational or failed. A redundant component has at least one functional duplicate. Product: an integrated collection of redundant and nonredundant components that perform some function. Product reliability analysis estimates the probability that enough of the constituent components are operational to allow the product to perform its intended function.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
RELIABILITY ANALYSIS OF REDUNDANT AND FAULT-TOLERANT PRODUCTS
241
Path set: the basic unit of redundancy; a physical means for accomplishing a task. If all the components associated with a path set are operational, then the product is operational; a path fails when one of the constituent components fails. Failure of a redundant path does not necessarily result in product failure. Cut set: a set of components the failure of which causes the product to fail. If all the components associated with a cut set have failed, then the product has failed. Even when some or all of the components in a cut set are operational, the product is not necessarily operational.
NOTATION Components are usually denoted by A, B, C, and so on. For probability statements concerning component success or failure, A is the event “component A is operational”; A is the event “component A is failed”; pi is the probability that component i is operational, the reliability of component i; pi(t) is the probability that component i is operational at time t, the reliability of component i at time t; qi is the probability that component i has failed, the unreliability of component i; and qi(t) is the probability that component i has failed at time t, that is, the unreliability of component i at time t.
In connection with probability statements about redundant configurations, the following notation is used: r R, R(t)—product reliability (possibly time dependent); r U, U(t)—product unreliability (possibly time dependent); (R + U = 1, R(t) + U(t) = 1, for all t ≥ 0); and r A, A(t)—product availability (possibly time dependent).
11.1
STATIC REDUNDANCY—COMBINATORIAL MODELING
Some products achieve fault tolerance through the use of static redundancy, frequently termed passive redundancy. In such products, the operation and interconnection of the constituent components does not change in response to a component failure. The next section considers products with dynamic redundancy, in which the product is reconfigured when a component failure occurs. 11.1.1 Simple Redundancy The simplest method for achieving fault tolerance is duplication, in which two redundant components are used to ensure the functionality of one. Figure 11.1 is a diagram
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
242 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
representing the duplication of a single component. The product consists of two redundant components, only one of which is needed for the product to be A1 functional. Product reliability is therefore deterR = 1 –qA1qA2 mined by the probability that at least one of the two components is operational—the complement of the probability that both have failed. Figure 11.1 Duplication of a single component. Products can have more than one redundant component. Figure 11.2 shows two ways to provide redundancy in a product that requires the functionality of two components, A and B. The first configuration provides two redundant paths, each logically equivalent to the original product. This configuration fails when at least one of the components on each of the two paths fails. The second configuration duplicates the components individually, providing four nondisjoint redundant paths. This configuration survives as long as at least one Ai and one Bi are operational (the analysis of the reliability of these configurations is described later in this section). The first step toward evaluating the reliability of a simple redundant configuration consists of drawing a reliability block diagram showing the combinations of components that must function for successful product operation. Each node in a reliability block diagram represents a component; connections between components depict logical relationships between components. Each path from the source to the sink represents a combination of components; if all the components on a path are operational, then the product is operational. The various paths for successful operation can be determined from the reliability block diagram. When one or more components on a path fail, the path is no longer usable. Because redundancy implies more than one path for product success, reliability is the probability that at least one path will be operational. Many reliability block diagrams can be decomposed into series and parallel combinations of components. The reliability of these can be determined individually and then combined to gauge the reliability of the overall product. A2
Non-Redundant System A
B R = PAPB
System-Level Redundancy
Component-Level Redundancy
A1
B1
A1
B1
A2
B2
A2
B2
R = 1 –(1 – PA1PB1)(1 – PA2PB2)
Figure 11.2 Duplication of a simple system.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
R = (1 – qA1qB1)(1 – qA2qB2)
RELIABILITY ANALYSIS OF REDUNDANT AND FAULT-TOLERANT PRODUCTS
A1
A2
An–1
243
An
Figure 11.3 A series system of n components.
11.1.1.1 Series Connections If a product consisting of n components cannot survive the loss of any of them, then from a reliability perspective, it is a series product of n elements. Such a product can be represented by the reliability block diagram in Figure 11.3. The reliability of this series product, under the basic assumption of independent component failures, is the probability that each element on the path is functioning: n
Rseries
p (1) i 1
Ai
(11.1)
11.1.1.2 Parallel Connections A parallel connection of components is used when only one of several components is necessary for successful product operation. A parallel product of n elements can be represented by the reliability block diagram in Figure 11.4. Because the product is operational as long as at least one of the n components is operational, there are a total of n possible success paths. The product fails when all n components have failed, which yields the following expression for the reliability of the parallel product: n
Rparallel 1
q i 1
Ai
(11.2)
11.1.1.3 Series-Parallel Connections Products built up from series and parallel connections of components can be analyzed by considering the series and parallel subproducts, then combining the analyses. Each series or parallel configuration can be analyzed and replaced in the reliability block diagram with a pseudocomA1 ponent whose reliability is equal to the reliability of the subproduct it is replacing. As a simple example, consider the two redundant configuraA2 tions shown in Figure 11.2. In Figure 11.5, the product-level redundancy An configuration is decomposed into a parallel combination of two series combinations. The two series combinations are replaced by pseudocomponents, Figure 11.4 A parallel system of Ci, whose reliability is equal to the reliability of a n components.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
244 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Overall System A1
Replace Series B1
Replace Parallel
C1 D
A2
B2
C2 PC1 = PA1PB1 PC2 = PA2PB2
Figure 11.5
PD = 1 – qC1qC2 = 1 – (1 – pC1)(1 – pC2) = 1 – (1 – pA pB )(1 – pA pB ) 1 1 2 2
Evaluation of system-level redundancy.
series connection of Ai and Bi. The resulting product is a simple parallel combination of C1 and C2; reducing this parallel combination to a single component results in the reliability as shown in Figure 11.2. In Figure 11.6, the component-level redundancy configuration from Figure 11.2 is decomposed into a series combination of two parallel combinations. The parallel combinations of components Ai and Bi are each replaced by a single component, Ci, whose reliability is equal to the reliability of the parallel combination. The series combination of C1 and C2 is then reduced to a single component, D, whose reliability is the same as shown in Figure 11.2. The component-level and product-level redundancy approaches shown in Figure 11.2 can be compared by assuming that the component reliabilities are all equal (pA1 = pA2 = pB1 = pB2). Figure 11.7 shows that, for this simple product, the component-level approach achieves higher reliability than the product-level approach for all values of the component reliability. 11.1.1.4 Non-Series-Parallel Connections A more generally applicable approach to the analysis of redundant products consists of determining the path sets of the product and calculating the reliability from these sets. This approach is more general because it is applicable to products that cannot be reduced to series or parallel combinations.
Overall System A1
Replace Parallel
C1 A2
Figure 11.6
Replace Series
B1 C1
D
B2 PC1 = 1 – qA1qB1
PD = PC1PC2
PC2 = 1 – qA2qB2
R = (1 – qA1qB1)(1 – qA2qB2)
Evaluation of component-level redundancy.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
RELIABILITY ANALYSIS OF REDUNDANT AND FAULT-TOLERANT PRODUCTS
245
1 0.9
Component level
Product Reliability
0.8
Product level
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
0.1
0.2
0.3 0.4 0.5 0.6 0.7 Component Reliability
0.8
0.9
Figure 11.7 Comparison of two simple approaches to redundancy.
A path set is defined as a set of components such that, if all components in a set are operational, then the product is operational. A minimal path set contains the minimum number of elements on a path to ensure product operation. Suppose that there are n minimal path sets (called Pj, j = 1,…,n) for a product. The reliability of a product is then given by the probability that the elements of at least one path set will be operational: ¹ ª n R Prob « Pj º ¬ i 1 »
(11.3)
For the product-level redundancy configuration in Figure 11.2, the minimal path sets are {A1, B1} and {A2, C2}. Other nonminimal path sets include {A1, B1, B2} and {A1, A2, B1, B2}. Using Equation 11.3, the reliability of the product-level configuration can again be determined: R Prob{A1 B1
A B } 2
2
Prob{A1 B1} Prob{A2 B2} Prob{A1 A2 B1 B2} pA pB pA pB pA pA pB pB 1
1
2
2
1
2
1
(11.4)
2
1 (1 pA pB pA pB pA pA pB pB ) 1
1
2
2
1
1 (1 pA pB )((1 pA pB ) 1
1
2
2
This result is the same as that derived in Figure 11.6.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
2
1
2
246 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
A1
A4 A3
A2
Rsys = R1Pr(A3)+ R2Pr(A3) A5
To determine R1, assume that A3 is operational: A1
A4
A2
A5
R1 = (1 – qA1qA2)(1 – qA4qA5) To determine R2, assume that A3 has failed: A1
A4
A2
A5
R2 = 1 –(1 – PA PA )(1 – PA PA ) 1 4 2 5
Figure 11.8
Evaluation of a non-series-parallel system.
Next consider the product shown in Figure 11.8, which cannot be reduced to a series-parallel combination because of component A3. Conditioning on the status of A3 (i.e., operational or failed) reduces the figure to two simpler configurations. That is, the reliability of the product is given by Rsys Pr{System operational | A3 operational}Pr A3 operational (11.5)
Pr{System operational | A3 failed} R1 pA R2 q A 3
3
Now, assuming that A3 is operational, what remains is a series connection of two parallel connections (similar to Figure 11.6), whose reliability is given by (11.6)
R1 (1 q A q A )(1 q A q A ) 1
2
4
5
Alternatively, assuming that A3 has failed leaves a parallel connection of two series connections (similar to Figure 11.5), whose reliability is given by R2 1 (1 pA pA )(1 pA q A ) 1
11.1.2
4
2
(11.7)
5
Masking Redundancy
In products that duplicate components, the replication can be used for error detection when the two redundant components do not produce the same output (in the case of communication links), then one of the outputs must be erroneous. However, without
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
RELIABILITY ANALYSIS OF REDUNDANT AND FAULT-TOLERANT PRODUCTS
247
A1
A2
Voter
A3
Figure 11.9 Reliability of a TMR system.
some additional information, there is no way to know which of the outputs is correct. Suppose, however, that redundant components always occur in groups of three or more. A majority voter can then be used to determine both the correct output and the faulty unit. The output that occurs most frequently is chosen as the correct output, while those that differ from the majority are assumed to result from faulted components. As long as a majority of the components are operating correctly, a correct output can be realized. 11.1.2.1
Triple Modular Redundancy
One example of masking redundancy, or triple modular redundancy (TMR), is shown in Figure 11.9. In a TMR product, the three redundant components all perform precisely the same task and the voter selects the correct output from among the three redundant outputs. As long as at least two of the redundant components are operating correctly, and the voter has not failed, then the TMR configuration operates correctly. There are three ways for two units to be operational, each of which occurs with probability pA2, and one way for all three components to be operational, which occurs with probability pA3. The reliability of the TMR configuration is thus given by RTMR pv r [ p3A 3 p2A (1 pA )] pv r (3 p2A 2 p3A )
(11.8)
where pV is the probability that the voter operates correctly, and pA is the reliability of each redundant unit. The reliability of a single component and the reliability of the voter determine whether a TMR product provides an improvement in reliability over a single component as a function of the reliability of the single component, for three different values of the reliability of the voter. The diagonal line shows the reliability of a single component. In
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
248 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Figure 11.10 Triple modular redundancy.
the case of a perfect voter, the TMR product is only better than using a single component when the reliability of the single component is greater than 0.5. If the reliability of the voter is only 0.9, then the TMR product is always worse than a single component. In most practical products, the voting mechanism is substantially simpler than the replicated components and thus fails with a much lower probability, making TMR a useful means for achieving reliability improvement. 11.1.2.2
N-Modular Redundancy
The TMR concept can be generalized to higher levels of redundancy. An N-modular redundant (NMR) product utilizes N = 2n + 1 redundant units (note that N is odd). As in the TMR product, the NMR voter selects as the correct output that which occurs in more than half the units. The product is operational as long as fewer than n of the redundant components have failed. The reliability of an NMR product is given by RTMR
§ 2 n1 ¤ 2n 1³ ¶ i 2 n 1 i · pv r ¨ ¥¦ i ´µ pA (1 pA ) ¨© i n1 ·¸
£
(11.9)
where pv is the reliability of the voter, and pA is the reliability of a single component. In Figure 11.11, the reliability of three different NMR configurations (N = 3,5,7) in which the voter is assumed not to fail (pv = 1) is compared with the reliability of a single component. In all cases, there is an improvement in reliability whenever
© 2009 by Taylor & Francis Group, LLC
RELIABILITY ANALYSIS OF REDUNDANT AND FAULT-TOLERANT PRODUCTS
249
1 Single component N=3 N=5 N=7
Product Reliability
0.8
0.6
0.4
0.2 0 0
0.1
0.2
0.3 0.4 0.5 0.6 0.7 0.8 Single Component Reliability
0.9
1
Figure 11.11 NMR reliability with nonfailing voter.
the reliability of a single component is greater than 0.5. Assessing the effect of an imperfect voter, Figure 11.12 shows the reliability of three NMR products as a function of the voter reliability for pA fixed at the value 0.85. When using a TMR product (N = 3) with single-component reliability pA = 0.85, the reliability of the voter must be greater than 0.91 to achieve an improvement in reliability. For a five-modular redundant product, pv should be greater than 0.88; a voter reliability greater than 0.86 in a seven-redundant product improves reliability over 1 Pa = 0.85 N=3 N=5 N=7
Product Reliability
0.95 0.9 0.85 0.8 0.75 0.7 0.65
Figure 11.12
0.7
0.75
0.8
0.85 0.9 Voter Reliability
0.95
1
The effect of the unreliability of the voter when component reliability is fixed at pA = 0.85.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
250 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
System Failure
A1
B1
A2
B2
A1 A2
B 1 B2
Figure 11.13 An equivalent reliability block diagram and fault tree.
a single component. Because the voter is treated as a series connection from a reliability point of view, if its reliability is not greater than that of a single component, an NMR product will not be more reliable than a single component. 11.1.3 Fault Trees A fault-tree model is a logical representation of the failure criteria of a product. The top event of a fault tree generally represents the failure of the product being analyzed, and it is broken down into its constituent causes. These contributing causes are connected by logical gates, including, for example, the AND, OR, and m-out-of-n gates. (An m-out-of-n gate is true when m of the n inputs have occurred.) An example of a fault tree, representing the component-level redundant product from Figure 11.2, is shown in Figure 11.13. The top event (product failure) is caused by A1 and A2 failing or by B1 and B2 failing. The bottom events of a fault tree (i.e., Ai and Bi) are the basic events of the tree and usually represent component failures. Fault-tree analysis consists of determining the probability of occurrence of the top event, given the probability of occurrence of the basic events (the usual assumption is that the basic events are statistically independent). Because the fault tree represents failure criteria (as contrasted with the reliability block diagram, which represents success criteria), fault trees are analyzed by generating the cut sets of the product. A cut set is a dual of a path set for a product, in that if all the components in a cut set fail, then the product fails. Thus, the path sets establish the success criteria for a product, while the cut sets establish the failure criteria. A minimal cut set contains the minimum number of elements whose failure ensures product failure. 11.1.3.1 Cut Set Generation A top-down algorithm for determining the cut sets of a fault tree starts at the top event of the tree and constructs the set of cut sets by considering the gates at each lower level. A set of cut sets is expanded at each level of the tree until the set of basic events is reached. If the gate being considered is an AND gate, then all the inputs must occur to enable the gate; thus, a gate is replaced at the lower level by a listing of
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
RELIABILITY ANALYSIS OF REDUNDANT AND FAULT-TOLERANT PRODUCTS
251
G1
G2
Figure 11.14
G3
G4
G5
A1 A2
A3 A4
A5
An example fault tree used for cut set generation.
all its inputs. If the gate being considered is an OR gate, then the cut set being built is split into several cut sets—one containing each input to the OR gate. As an example, consider the fault tree shown in Figure 11.14, containing five gates (G1 through G5) and five basic events (A1 through A5). The derivation of the set of cut sets for this fault tree is shown in Figure 11.15. The top-down algorithm starts with the top gate, G1. Because G1 is an OR gate, it is replaced in the expansion by its inputs, G2 and G3. G3 is an AND gate and is replaced in the expansion by the basic events {A4, A5}, a cut set for this tree. G2 is expanded into {G4, G5} because both must occur to activate it. Expanding the G4 term splits the set into two because it is a two-input OR gate: {A1, G5} and {A2, G5}. Finally, the expansion of G5 splits both sets in two, yielding {A1, A3}, {A1, A4}, {A2, A3}, and {A2, A4} as the remaining minimal cut sets for the tree. If a gate being expanded is a k-out-of-n gate, then its expansion is a combination of the OR and AND expansions. The k-out-of-n gate is expanded into the c(n,k) combinations of input events that can cause the gate to occur. For example, a cut set with a gate, Gx, that is a 3-out-of-4 gate gets split into four cut sets, where Gx is replaced with the four possibilities for selecting three of the inputs. When such an algorithm is used for generating cut sets, some reduction might be necessary. If a cut set contains the same basic event more than once, then the G3
(A4, A5) (A1, A3)
G1
(A1, G5) G2
(A1, A4)
(G4, G5)
(A2, A3) (A2, G5) (A2, A4)
Figure 11.15 A top-down algorithm for determining the cut sets for a fault tree.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
252 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
redundant entries can be eliminated. If one cut set is a subset of another, the latter can be removed from further consideration. For example, the set of cut sets {{A1, A2, A1, A3}, {A3, A4}, {A2, A3, A4}} can be reduced to {{A1, A2, A3}, {A3, A4}}. 11.1.3.2
Inclusion/Exclusion Method
Once the set of minimal cut sets for a fault tree has been determined, the probability of product failure can be calculated. The set of cut sets represents all the ways in which the product will fail, so the probability of product failure is simply the probability that the events in one or more cut sets will occur. ¹ ª n Pr{product failure} Pr « Ci º ¬ i 1 »
(11.10)
where Ci is the minimum cut sets for the product. Because the cut sets are not generally disjoint, the probability of the union is not equal to the sum of the probability of the individual cut sets. The inclusion/exclusion method is a generalization of the rule for calculating the probability of the union of two events:
[ B] Pr{A} Pr{B} Pr{A,B}
(11.11)
Pr A and is given by ¹ ª n Pr « Ci º ¬ i 1 »
n
¹
ªn
£ Pr{C } £ Pr{C ,C } £ Pr{C ,C ,C } # z # Pr «¬ , C º» i
i 1
i
i j
J
i
i jk
j
k
i 1
i
(11.12)
Equation 11.12 calculates the probability of product failure exactly. As each successive summation term is calculated and added to the running sum, the result alternatively overestimates (if the term is added) or underestimates (if the term is subtracted) the desired probability. Thus, bounds on the probability of product failure can be determined by using only a portion of the terms in Equation 11.12. Consider the example fault tree shown in Figure 11.14, whose cut sets are C1 = {A4, A5}; C2 = {A1, A3}; C3 = {A1, A4}; C4 = {A2, A3}; and C5 = {A2, A4}. Assuming that the probability of occurrence for each of the basic events is Pr{A1} = q1 = 0.05; q2 = 0.10; q3 = 0.15; q4 = 0.20; and q5 = 0.25, then the probability of occurrence for each of the cut sets is Pr{C1} = q4 × q5 = 0.05; Pr{C2} = 0.0075; Pr{C3} = 0.01; Pr{C4} = 0.015; and Pr{C5} = 0.02. The sum of the probabilities for the singular cut sets, 0.1025, is an upper bound on the unreliability of the product. 0 a unrealiability a 0.1025
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(11.13)
RELIABILITY ANALYSIS OF REDUNDANT AND FAULT-TOLERANT PRODUCTS
253
The second term of Equation 11.12 is the sum of the probability of occurrence for all the possible combinations of two cut sets; for the current example, this is 0.015175. Subtracting this from the first term yields a lower bound on the unreliability of the product, 0.087325. Adding the third term, 0.0020875, the sum of the probabilities for all possible 0.087325 a unrealiability a 0.1025
(11.14)
combinations of three cut sets yields a better upper bound: 0.087325 a unreliability a 0.0894125
(11.15)
Subtracting the fourth term, 0.0003, the sum of the probabilities for all the possible combinations of four cut sets yields a tighter lower bound: 0.0891125 a unreliability a 0.0894125
(11.16)
If we are interested in three decimal places of accuracy, the expansion can stop here, with a known unreliability of 0.089. Adding the final term, the probability that all five cut sets will occur (0.0000375) results in the exact unreliability: unreliability 0.08915
11.2
(11.17)
TIME DEPENDENCE
In the preceding sections, it was assumed that components fail with a fixed probability. In many cases, however, the probability of failure depends on the amount of time that has elapsed since the component was put into operation. For the discussion that follows, the following notation will be used: r r r r r
fi(t) is the density function of the time to failure of component i; Fi(t) is the cumulative distribution function for the time to failure of component i; pi(t) is the reliability function for component i at time t; h(t) is the instantaneous failure rate (hazard rate) at time t; and MTTF is the mean time to failure or average lifetime of the component or product.
If the relationship between component reliability and time is known, the inclusion of the time factor does not change the methodology used to evaluate product reliability. In any equation where the term pi appears, pi(t) can be substituted; similarly, qi(t) can be substituted for qi. For example, Equation 11.8, which gives the reliability of a TMR configuration of components, becomes RTMR (t ) pv (t ) r [3 pA (t )2 2 pA (t )3 ]
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(11.18)
254 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
1 0.9
Product Reliability
0.8
TMR system Single component
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 10
100
1,000 Time in Hours
10,000
100,000
Figure 11.16 TMR reliability as compared with single component reliability.
Assume that the time to failure for the components in a TMR configuration is exponentially distributed, with rate parameters hB for basic components, and hi for the voter. That is, pA (t ) e
L At
and
pv (t ) e
Lvt
(11.19)
Substituting into Equation 11.18 yields RTMR (t ) e
Lvt
(3e
2 L At
2e
3 L At
)
(11.20)
Suppose further that the basic components fail at a rate of hA = 10 –4 per hour and that the voter fails at a rate of hv = 10 –6 per hour. The reliability of the TMR configuration, as a function of time, is compared with the reliability of a single component in Figure 11.16. Notice that the TMR product is not always better than the single component but depends on the length of time that the product will be deployed. For short applications, the TMR product is better than the single-component product. For very long applications, the single component is more reliable. This behavior occurs regardless of the values of the single-component reliability and voter reliability. 11.2.1 Mean Time to Failure For products whose failure probability is time dependent, one parameter of interest is the MTTF, or the average lifetime. If f(t) represents the time-to-failure density (with or without redundant components), then the mean lifetime can be calculated from c
MTTF
¯ xf (x)dx 0
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(11.21)
RELIABILITY ANALYSIS OF REDUNDANT AND FAULT-TOLERANT PRODUCTS
255
For example, in the case of a single component whose time to failure is exponentially distributed with parameter hB , the average life of a component is c
MTTFA
¯ xL e
L Ax
A
dx
0
1 LA
(11.22)
The mean time to product failure can also be calculated from the reliability equation: c
MTTFs
¯ R (t)dt s
(11.23)
0
For the TMR product considered previously, where the failure rate of a single component is hA and the failure rate of the voter is hv, the MTTF is c
MTTFTMP
¯e
L At
(3e
2 L At
2e
3 L At
)
0
3 2
L v 2L A L v 3L A
(11.24)
Comparing Equations 11.22 and 11.24, even making the optimistic assumption that the voter never fails (i.e., that hv = 0), the MTTF for the single component is greater than that of the TMR product. This implies that MTTF is not always the best metric to use in comparing products. Recall that the TMR product had a higher reliability for small t, but a lower reliability for large t. 11.2.2 Hazard Rate Another metric applicable to a product whose failure probability is time dependent is the hazard rate, h(t). The hazard rate, or instantaneous failure rate, is the rate of failure of a product or component that has survived until at least time t. That is, h(t)Δt is the conditional probability that a component will fail in the time interval (t, t + Δt), given that it has survived until time t: h(t ) lim T 0
1 F (t T ) F (t ) f (t ) T R(t ) R(t )
(11.25)
For a component whose time to failure is exponentially distributed, the hazard rate is constant:
L t
hi (t )
f (t ) Lie i L t Li R(t ) e i
(11.26)
The exponential distribution is the only one with a constant hazard rate. If the time to failure for a component is uniformly distributed between a and b, then the hazard rate increases as t approaches b: 1 f (t ) b a hi (t ) t [a, b] R(t ) b t b a
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(11.27)
256 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
The determination of the hazard function leads to an alternative definition of the reliability. Integrating both sides of Equation 11.25 assuming that R(0) = 1 yields t
t
¯ h(x)dx ¯ 0
0
f (x) dx R( x )
t
R `( x ) dx ln( R(t )) R( x )
¯ 0
(11.28)
Then, t
¯
h ( x ) dx
R(t ) e
11.3
(11.29)
0
DYNAMIC REDUNDANCY—MARKOV MODELING
Products that can automatically reconfigure themselves upon the failure of a component employ dynamic redundancy. For example, redundant units may remain unpowered until needed, switching into active operation when the primary unit fails. Figure 11.17 is an example of a product that employs standby sparing. Component AP is the primary component that is initially operational. When AP fails, then AS, the spare unit, is switched into active operation. Another example of dynamic redundancy is a TMR/simplex product in which a TMR product is reconfigured when the first failure occurs. In a standard TMR product, after one of the three components fails, the product fails when one of the two remaining components fails. If one of the two remaining components is thrown away, as in a TMR/simplex product, the product will still fail on the next component failure; however, this second failure occurs with a lower probability than in the previous case. That is, the probability of one component failing is higher when there are two active components than when there is only one. NMR products can be dynamically reconfigured after failure by removing one good component each time one fails (thus keeping an odd number of active components). The analysis of such reconfigurable products is substantially more complicated than the analysis of products employing static redundancy because the failure criteria depend on the order in which failures occur, rather than simply on the combinations. For example, in the standby redundant product shown in Figure 11.17, the product remains operational as long as the primary unit is operational. If the primary unit fails, then the
AP
Switch
AS
Figure 11.17 Standby sparing.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
RELIABILITY ANALYSIS OF REDUNDANT AND FAULT-TOLERANT PRODUCTS
257
product is operational only if the switch has not yet failed because then the spare unit can be switched on. If the switch fails after the spare is switched on, then the product remains operational (assuming that a switch failure means that the position of the switch cannot be changed). The combination of primary failure and switch failure can sometimes, but not always, cause product failure, depending on the order in which the failures occur. Assuming that the times to failure for all components are exponentially distributed, we can use a Markov chain to evaluate the reliability of products that employ dynamic redundancy. Pictorially, a Markov chain is composed of circles that represent states of the product; these are connected by arcs representing events (usually failures) that alter the state of the product. An arc is labeled with the symbolic rate at which the event occurs. A Markov chain gives rise to a set of linear, ordinary differential equations. Let Pi(t) be the probability of being in state i of a Markov chain at time t. The reliability of the product is given by the sum of the state probabilities for the operational states: R(t )
£
Pi (t )
(11.30)
i operational states
In the following subsections, Markov chains are used to analyze several example products. 11.3.1 Standby Sparing First consider the standby sparing product shown in Figure 11.17. Assume that the failure rate for the primary unit and the spare unit (once activated) is hR and that the failure rate for the switch is hS. A Markov chain representation of this product is shown in Figure 11.18. State 1 is the initial state of the product, in which the primary component is operating. Two events can issue from the initial state, each leading to a different subsequent state. If the primary unit fails first, then the product moves to state 2, in which the spare unit has been switched into active operation. If, in state 1, the switch fails first, then the primary unit is still operational, and the product moves to state 3. From state 2, the
Figure 11.18 Markov chain representation of standby system.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
258 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
product moves to the failure state, when the active unit (formerly the spare) fails. If the switch fails from state 2, the reliability of the product is not affected, so this event is not included in the Markov chain. From state 3, the product fails when the primary unit fails because the spare unit cannot be switched into active operation. The reliability of the product is determined by the probability that the product will be in state 1, 2, or 3. The unreliability depends on the probability that the product will be in state F. The set of equations associated with the Markov chain in Figure 11.18 is as follows: d P (t ) ( LP LS ) P1 (t ) dt 1 d P (t ) LP P1 (t ) LP P2 (t ) dt 2
(11.31)
d P (t ) LS P1 (t ) LP P3 (t ) dt 3 d P (t ) LP P2 (t ) LP P3 (t ) dt F This set of equations is most easily solved by using Laplace transforms. The first state is the initial state—that is, P1(0) = 1 and Pi(t) = 0, i ≠ 1. Taking the Laplace transform of both sides where Li(s) is the Laplace transform of Pi(t) yields the following set of algebraic equations: sL1 (s) 1 ( LP LS ) L1 (s) sL2 (s) LP L1 (s) LP L2 (s) sL3 (s) LS L1 (s) LP L3 (s)
(11.32)
sLF (s) LP L2 (s) LP L3 (s) Solving this set of equations for the Li terms yields L1 (s)
1 s LP LS
L2 ( s )
³ Lp ¤ 1 1
¥ ´ Ls ¦ s L p s L p Ls µ
1 1
L3 (s) s LP s LP LS LF (s )
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
L 1 1 LP LS 1
P s LS s LP LS s LP LS
(11.33)
RELIABILITY ANALYSIS OF REDUNDANT AND FAULT-TOLERANT PRODUCTS
259
Table 11.1 Some Useful Laplace Transform Pairs LF(s)
f(t), t > 0
c s
The constant c
1 sa
e–at
1 (s a)(s b)
1 (e at e bt ) b a
1 (s a)(s b)(s c)
e at e bt e ct (b a)(c a) (a b)(c b) (a c)(b c) e at e bt (b a)(c a)(d a) (a b)(c b)(d b)
1 (s a)(s b)(s c)(s d )
e dt e ct (a c)(b c)(d b) (a d )(b d )(c d )
Inverting the Laplace transform gives the state probabilities (Table 11.1 lists several useful Laplace transform pairs): P1 (t ) e P2 (t )
( L P +L S ) t
L P LPt ( LP LS )t (e
e ) LS
P3 (t ) e
LPt
PF (t ) 1
e
(11.34)
( LP LS )t
L P L S LPt L P ( LP LS )t e e LS LS
Thus, the reliability of the standby product shown in Figure 11.17 is given by R(t )
L P L S LPt L P ( LP LS )t e
e LS LS
(11.35)
For the same standby product, suppose that the spare unit is partially energized and thus can fail before being switched into operation. Such a spare is often termed a warm spare in contrast to a cold spare, which cannot fail before being activated, and a hot spare, which is fully activated and fails at the same rate as the primary unit. A Markov model of a standby product with a warm spare is shown in Figure 11.19. This Markov chain is similar to the one in Figure 11.18, but the rate of transition
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
260 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Figure 11.19 A Markov model of a system with a warm spare.
from state 1 to state 3 has increased from hS (representing switch failure) to hS + hw (representing switch or spare failure). In the product with a warm spare, if the spare fails (at rate hw) before the primary unit, then the behavior is the same as if the switch had failed, in that the product fails when the primary unit fails. The reliability of the standby product with a warm spare is given by R(t )
L P L S LW LPt LP
( L L L )t e
e P S W L S LW L S LW
(11.36)
where (hS + hw) is substituted for hS in Equation 11.35. 11.3.2 TMR/Simplex Product As a second example, consider the TMR/simplex product, a TMR product that reconfigures to a single component on the occurrence of the first failure. The Markov chain representation of this product is shown in Figure 11.20. The Laplace transform equations are sL1 (s) 1 (3L A L v ) L1 (s) sL2 (s) 3L A L1 (s) (L A L v ) L2 (s)
(11.37)
sLF (s) (L A L v ) L2 (s) L v L1 (s)
Figure 11.20 Markov model of a TMR/simplex system.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
RELIABILITY ANALYSIS OF REDUNDANT AND FAULT-TOLERANT PRODUCTS
261
1 0.9
Product Reliability
0.8 0.7 0.6
TMR TMR/simplex Single component
0.5 0.4 0.3 0.2 0.1 0 10
100
1,000 Time in Hours
10,000
100,000
Figure 11.21 Reliability comparison of TMR, TMR/simplex, and single component systems.
The state probabilities are given by
P1 (t ) e P2 (t )
(3L A Lv )t
3 ( L A Lv )t (3L A Lv )t
e (e ) 2
(11.38)
The reliability of the product is
RTMR /simplex (t )
3 ( L A Lv )t 1 (3L A Lv )t e
e 2 2
(11.39)
Figure 11.21 compares the time-dependent reliability of a TMR/simplex product with a standard TMR product and with a single nonredundant component. As before, we assume that the components fail at a rate of hA = 10 –4 per hour and that the voter fails at a rate of hv = 10 –6 per hour. The reconfigurable product achieves a consistently higher reliability than both the standard TMR product and the single component. If the voter were less reliable, however, it is possible that the TMR/simplex product would be less reliable than a single component. This can be easily demonstrated by a case where the voter itself is less reliable than the single component (hv > hA). Because the voter can be considered, from a reliability standpoint, to be connected in series with the three components and the reliability of the three components can never be greater than one, in this case the voting product can never be more reliable than the single component.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
262 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
The MTTF of the TMR/simplex product can be calculated by Equation 11.21: c
MTTFTMR/simplex
¯ 0
3 1 ¤ 3 ( L A Lv )t 1 (3L A Lv )t ³ 2 2 e
e ´µ dt L L 3L L 2 ¦¥ 2 A v A v
(11.40)
To compare the MTTF of the TMR/simplex product with that of a standard TMR product and with a single component, assume that the voter does not fail (hi = 0). Under this assumption, the mean times to failure are given by MTTF TMP/simplex
3 1 4
(2L A ) 6L A 3L A
MTTFTMR MTTFA
2 5 3
2L A 3L A 6L A
(11.41)
1 LA
Assuming a perfect voter, the MTTF of the TMR/simplex is not only better than that of the standard TMR product, but also better than that of the single component. 11.3.3 Repairable Products If a product can be repaired after failure, then reliability may not be the most appropriate measure. Because the reliability of a product is defined as the probability that the product has not failed in the interval [0,t], the fact that the product is repairable is not reflected in the reliability measure. A more useful measure of the effectiveness of a repairable product is the availability (A(t)) of the product, defined as the probability that the product is operational at time t. Steady-state availability (A) is the time-independent probability that the product will be operational. Both A(t) and A acknowledge that the product can be up or down and thus reflect either the instantaneous or long-term probability that the product will be operational. Products that are repairable can be modeled combinatorially or with Markov models, depending on the assumptions made about the repair facility. If we assume that the components of a product can each be repaired independently and if a combinatorial model is appropriate for assessing reliability (in the case of no repair), then the same model can be used for assessing availability. If, however, there are fewer repair technicians available than there are components, then a Markov model should be used. The following sections consider these two cases separately.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
RELIABILITY ANALYSIS OF REDUNDANT AND FAULT-TOLERANT PRODUCTS
263
11.3.3.1 Independent Repair If the repair processes for failed components are all independent—that is, there are sufficient repair facilities to avoid waiting lines or priorities—then a combinatorial model can be used to assess availability as well as reliability. Consider, for example, a simple TMR product in which the components and voter are both repairable. Recall that the reliability of the TMR product is given by
RTMR pv r §© p3A 3 p2A (1 pA ) ¶¸ pv r 3 p2A 2 p3A
(11.42)
If aiB represents the long-term proportion of time that component A is operational, and avv is the similar measure for the voter, then the availability of the TMR product is determined simply by replacing the p terms with the corresponding av terms:
ATMR avv r 3av 2A 2av 3A
(11.43)
The time-dependent availability of the TMR product can be determined from the time-dependent component availabilities, avA(t) and avv(t):
ATMR (t ) avv (t ) r 3av A (t )3
(11.44)
The time-independent availabilities for individual components can be determined from the average lifetime of the component (MTTF) and the average time to repair (MTTR) the component: av A
MTTF MTTF MTTR
(11.45)
In the case where both time to failure and time to repair are exponentially distributed, with rates hA and *B , respectively, the availability of the component is given by 1 LA MA MTTF av A 1 1 MA LA MTTF MTTR LA MA
(11.46)
In the same case (exponentially distributed time to failure and time to repair), the time-dependent availability of a component can be determined from a Markov analysis. A simple Markov model showing the transitions between the operational and failed states for a single component is given in Figure 11.22. As before, the state probabilities can be determined by using Laplace transforms. The probability of being in state UP at time t gives the availability of the component: sLUP (s) 1 L A LUP M A LDOWN M A LDOWN L A LUP
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(11.47)
264 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Figure 11.22
A Markov model for a single repairable component.
LUP
1 M A LDOWN s LA
LDOWN
L L A UP s MA
(11.48)
Substituting the expression for L DOWN into the expression for LUP and then solving for LUP gives LUP
s MA 2
s s M A sL A
1 s MA LA
MA s (s L A M A )
(11.49)
Inverting the Laplace transform gives the state probability and, hence, the availability: A(t ) PUP (t )
MA LA
( L M )t e A A MA LA MA LA
(11.50)
Taking the limit as t approaches infinity yields the time-independent (steadystate) availability.
11.3.3.2 Dependent Repair If there is not a separate repair crew for each component, then the repair process is not independent because one component may have to wait before it can be repaired. Consider, for example, a TMR/simplex product in which there are two available repair technicians: one for the voter and one for the redundant components. A Markov model for such a product is shown in Figure 11.23, in which *B is the repair rate for a single failed component, and *i is the repair rate for the voter. (If there were two repair technicians available for the individual components so that they could be repaired simultaneously, then the rate from state FA to state 2 would be 2*B .) The steady-state availability of this product can be determined from the steady-state probability of being in either state 3 or 2.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
RELIABILITY ANALYSIS OF REDUNDANT AND FAULT-TOLERANT PRODUCTS
265
Figure 11.23 A repairable TMR/simplex system with dependent repair.
The calculation of the steady-state probabilities for a Markov chain is most easily done via balance equations. In steady state, the state probabilities are not changing (dP/dt = 0) and the total flow into a state is equal to the total flow out of a state. Equating the inbound and outbound flows gives rise to a set of balanced algebraic equations that can be solved for the state probabilities. For the Markov chain in Figure 11.23, the balance equations are given by (3L A L v ) P3 M A P2 Mv PF 1 (L A M A L v ) P2 3L A P3 M A FA M v F2 (11.51)
L v P3 Mv PF 1 L v P2 Mv PF 2 L A P2 M A PFA Solving this set of equations in terms of P3 results in P3 P3 P2 3 PF 1 PF 2 PFA
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
LA P MA 3
Lv Mv
P3
3L A L v
M A Mv 3L A2
M A2
P3
(11.52) P3
266 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Using the fact that P3 + P2 + PF1 + PF2 + PFA = 1 yields the following for P3: P3
M A2 Mv M A2 Mv 3L A M A Mv L v M A2 3L AL v M A 3L A2 M v
(11.53)
from which the other state probabilities can be determined. The availability of the product is then given by the sum A = P3 + P2 because these two states represent operational configurations of the product: A
M A2 Mv 3L A M A Mv M A2 Mv 3L A M A Mv L v M A2 3L ALv M A 3L A2 Mv
11.4
(11.54)
DEPENDENT FAILURES
In earlier sections, it has been assumed that component failures occur independently in a product. In several situations, this independence assumption is not valid, however. A common-mode failure occurs when the failure of one component causes the failure of other components. One example is the failure of a power supply to which several components are connected; if the power supply fails, the connected components are rendered useless. Another type of failure dependency arises when the failure of one component increases the probability that other components will fail; this occurs when several components share a load and the increased load that results from a component failure places additional stresses on the remaining components. A third type of failure dependency is associated with components that can fail in more than one manner—for example, a diode that can fail short or open. Such a component is said to exhibit multiple failure modes, or multimode failures. 11.4.1 Common-Mode Failures The analysis of products that are subject to common-mode failures is relatively straightforward, using a combinatorial model or a Markov model, whichever is more appropriate. Consider the simple redundant product in Figure 11.2. Suppose that components A1 and B 2 are connected to the same power supply, C1, and that A 2 and B1 are connected to another power supply, C2. If a power supply fails, then the components connected to it cannot operate and they essentially fail. Including the power supplies in the reliability block diagram is affected by replacing each occurrence of a component with a series connection of the component and its power supply, as in Figure 11.24. Note that the resulting block diagram is no longer a simple series-parallel product because components appear on more than one path. In a fault-tree representation of the product, each basic event representing a component is replaced with an OR gate, with the component and its power
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
RELIABILITY ANALYSIS OF REDUNDANT AND FAULT-TOLERANT PRODUCTS
System-Level Redundancy
Component-Level Redundancy
A1
C1
B1
C2
A1
C1
B1
C2
A2
C2
B2
C1
A2
C2
B2
C1
267
Which Reduce To: A1
B1 C1
A2
A1
B2
C1
A2
B1
C2
C2
B2
Figure 11.24 Including power supply in model of simple redundancy.
supply as input. For example, the fault tree in Figure 11.13 becomes the fault tree in Figure 11.25. When dynamic redundancy is used or the product has repair dependencies, necessitating a Markov model, introducing common-mode failures to the model is simpler. From a given origin state, an arc is added to the state that results when the common-mode component fails and is labeled with the failure rate of the common mode component. In the TMR/simplex product, for example, if the three redundant components are connected to the same power supply, an additional arc between the initial state (state 1) and the failure state (state F) is added. In such a case, the redundancy is rendered useless by the failure of the power supply; in fact, from a reliability standpoint, the failure of the power supply has the same effect as the failure of the voter.
A1
B1
A2
C2
C1 Figure 11.25 Including power supply in fault-tree model.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
B2
268 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
11.4.2 Dependent Failure Rate In some products, the failure probability for a component changes upon the failure of some other component. Such a case may arise, for example, when two redundant components share a load. When one fails, the other must work proportionately harder and may experience an increased probability of failure. Such a situation is handled most easily in a Markov model because the transition rates placed on the arcs can differ according to the origin state. For example, consider the product whose reliability block diagram is shown in Figure 11.26. Suppose, in addition to being redundant, that components A1 and A2 share some load, and that components B1 and B2 share another load; the components fail at rates hA1 and hB1, respectively, when both are operational. After one of the two A components has failed, the remaining one fails at rate hB2. Similarly, when only one B component remains, it fails at rate hB2. A Markov model of this product is shown in Figure 11.26, where the operational states are labeled with an ordered pair of integers telling how many A and C components remain. If the component failure rates were independent of the number of components in operation, then hA1 would equal hA2, and hB1 would equal hB2. In this case, a fault tree or reliability block diagram would have been simpler to use. 11.4.3 Multimode Failures Some components can fail in more than one way; that is, there are more than the two possible states we have been assuming thus far. For example, the engines on a twin-engine aircraft might fail to half power or might lose power completely. Suppose that the aircraft can land safely as long as one engine is at full power or if both engines have at least half power. A reliability block diagram for such a product is shown in Figure 11.27, where AiF denotes that engine i is at full power and AiH denotes that engine i is at half power. The components in this diagram are not independent, in that the states of component i are mutually exclusive: An
Figure 11.26
A Markov model of load sharing.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
RELIABILITY ANALYSIS OF REDUNDANT AND FAULT-TOLERANT PRODUCTS
269
A1F
A2F
A1H
A2H
Figure 11.27 Block diagram of a system with multiple failure modes.
engine cannot be at full power and half power simultaneously. The dependence of the components in the reliability block diagram must be considered when the path sets are expanded. For this product, the minimum path sets are {A1F}, {A2F}, and {A1H, A2H}. Using the inclusion/exclusion algorithm gives the reliability of the product: R Prob( A1F ) Prob( A2 F ) Prob( A1H ,A2 H ) Prob( A1F ,A2 F )
Prob( A1F ,A1H ,A2 H
(11.55)
Prob( A2 F ,A1H ,A2 H ) Prob( A1F ,A2 F ,A1H ,A2 H ) 2 Prob( AF ) Prob 2 ( AH ) Prob 2 ( AF ) where we have assumed that the two engines are statistically identical. The zero terms (the three last ones) result because the components cannot be in two states simultaneously. If the probabilities for AF, AH, and A0 are known, where Prob(AF) + Prob(AH) + Prob(A0) = 1, we can substitute directly into Equation 11.55. Suppose that, instead of fixed probabilities, we have the rate at which a fullpower engine fails to half power (hFH), the rate at which a full-power engine fails to zero power (hF0), and the rate at which a half power engine fails to zero power (hH0). We can use these rates in a Markov model (Figure 11.28) to determine the terms needed for Equation 11.55 (which would then become time dependent). The solution
Figure 11.28 Markov model of multiple modes of single engine.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
270 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
of the Markov chain shown in Figure 11.28 gives the probabilities that a component is in each of the three possible states: Prob[ AF (t )] PF (t ) e ( LFH L P 0 )t Prob[ AH (t )] PH (t )
L FH [e ( LFH LF 0 )t e LH 0t ] L HO L FH L F 0
Prob[ A0 (t )] P0 (t ) 1
(11.56)
L H 0 2L FH L F 0 ( L L )t L FH e FH F 0 e L H 0t L H 0 L FH L F 0 L H 0 L FH L F 0
Using Equation 11.55 results in the probability of a successful flight: R(t ) 2 Prob[ AF (t )] Prob 2 [ AH (t )] Prob 2 [ AF (t )] 2e ( LFH LF 0 )t
2 L FH [e ( LFH LF 0 )t e LH 0t ]2 e 2( LFH LF 0 )t (L H 0 L FH L F 0 )2
(11.57)
Suppose, however, that a combinatorial model is not appropriate for this product. Multiple failure modes can easily be incorporated into a Markov model as well. Figure 11.29 shows the Markov model of the product with two engines. In state 2F, both engines are at full power; in FH one is at full power and one at half; in 2H both are at half power; in F0 one is at full power and the other at zero, and in state 0 there is insufficient power and the product has failed. Generating, solving, and inverting the Laplace transforms gives the probabilities for each operational state. The Laplace transforms for the operational states in the Markov chain in Figure 11.29 are given in the following: L2 F ( s )
1 s 2L FH 2L F 0
sLFH (s) 2L FH L2 F (s) (L FH L HO L F 0 ) LFH (s) LFH (s)
2L FH (s 2L FH 2L F 0 )(s L FH L HO L F 0 )
sL2 H (s) L FH LFH (s) 2L H 0 L2 H (s) L2 H ( s )
2L
(s 2L FH 2L F 0 )(s L FH L HO L F 0 )(s 2L H 0 )
sLF 0 (s)
(11.58) 2 FH
2L F 0 (s L FH L F 0 )(s 2L FH 2L F 0 ) 2L FH L H 0
(s 2L FH 2L F 0 )(s L FH L H 0 L F 0 )(s L FH L F 0 )
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
RELIABILITY ANALYSIS OF REDUNDANT AND FAULT-TOLERANT PRODUCTS
271
Figure 11.29 Markov model of twin-engine system with multiple failure modes.
Inverting the Laplace transforms results in the state probabilities: P2 F (t ) e PFH (t ) P2 H (t ) PF 0 (t )
2 ( L FH L FO ) t
2L FH
L H 0 L FH L F 0
[e
2 L FH
(L H 0 L FH L F 0 )
2L F 0
L FH L F 0
[e
( L FH L F 0 ) t
e
2
2 ( L FH L F 0 ) t
[e
e
2 ( L FH L F 0 ) t
2 ( L FH L F 0 ) t
( L FH L H 0 LF 0 ) t
2e
]
( L FH L H 0 L F 0 ) t
e
2 L H 0t
] (11.59)
]
( L L L )t
2 ( L L ) t § e FH F 0 e FH H 0 F 0
2L H 0 L FH ¨ ¨© (L H 0 L FH L F 0 )(L FH L F 0 ) L H 0 (L H 0 L FH L F 0 )
e
( L FH L F 0 ) t
L H 0 (L FH
¶ · L F 0 ) ·¸
Summing the probabilities of the operational states given in Equation 11.59 results in the same reliability expression that was given in Equation 11.57. 11.5 COVERAGE MODELING FOR FAULTTOLERANT COMPUTER PRODUCTS Microprocessor products are used in a wide variety of everyday applications, from automobiles to kitchen appliances to sewing machines. Computer failures in such applications may be inconvenient but are rarely considered life threatening. Computer products are also used in critical applications such as in-flight aircraft control, nuclear power plant monitoring, chemical processes, or medical applications. Because a failure in a critical application of these products may result in loss of life or severe environmental or economic damage, high levels of reliability are required. This section considers the problem of assessing the reliability of fault-tolerant computer (FTC) products.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
272 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Technological advances have reduced costs to the point where fault-tolerant computer product designs can include a sufficient number of microprocessors to minimize the probability that all have failed. However, the product designer must be certain that faults and errors are detected promptly so that the redundant units can be used effectively. If a faulty unit is not reconfigured out of the product, it can produce incorrect results that contaminate the nonfaulty units. Reliability models of fault-tolerant computer products incorporate coverage factors to reflect the ability of the product to recover automatically from the occurrence of a fault during operation. A coverage factor is a probability that the product can automatically recover from a fault and its associated errors and thus continue normal operation. An FTC product may fail to recover from a fault, even if spare units remain. For example, a fault may produce an undetected error, and the subsequent calculations or operations then operate on incorrect data, possibly leading to overall product failure. Even if an error is detected, the product may still be unable to recover because the fault could “contuse” the automatic recovery procedures into disabling the wrong component. A fault from which the product cannot automatically recover—an unrecovered fault—leads to immediate product failure, even if spare or redundant units remain. 11.5.1 Terminology r Covered fault is a fault from which the product can automatically recover. The product can continue to operate, although possibly in a degraded mode. r Error is the manifestation of a fault in the information processed by the computing product or in the internal product state. r Failure is an unacceptable deviation from the anticipated delivered service; an incorrect output; the inability to perform the desired function. r Fault is a defect, imperfection, or flaw within some hardware or software component. r FTC is a fault-tolerant computer. r Hardware fault is a fault that occurs in a hardware component during its lifetime—for example, a short circuit between two leads or an open-circuit transistor junction. r Permanent hardware fault is a physical fault with lasting effects. r Transient hardware fault is a physical fault of limited duration that causes no permanent hardware damage. Transient faults can be caused by excessive heat, power disruptions, or environmental influences. r Uncovered fault is a fault from which the product cannot automatically recover. An uncovered fault leads to immediate product failure.
11.5.2 The Impact of Imperfect Coverage
EXAMPLE 11.1 MODEL WITH COVERAGE An example FTC product has four processors. As long as no uncovered faults occur, the product can survive until the last processor fails; however, an uncovered fault in any processor causes immediate product failure. Figure 11.30 is a Markov model of the product, where h is the failure rate of a single unit and cp is the coverage probability.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
RELIABILITY ANALYSIS OF REDUNDANT AND FAULT-TOLERANT PRODUCTS
4λC 4
3λC 3
2λC
273
λ
2
1
F
2λ(1-C) 3λ(1-C) 4λ(1-C) Figure 11.30 A Markov model of a four-unit redundant system with coverage.
With a failure rate of h = 10 –4 per hour and a coverage probability of cp = 0.99, the product failure probability is 3.98 × 10 –4 for a 100-h application. For the same product, if the coverage were perfect (cp = 1), the probability of failure would be considerably lower: 9.8 × 10 –9. Even a one-in-a-hundred possibility that a fault is uncovered affects reliability dramatically. If the product consisted of three units instead of four, the probability of failure (with cp = 0.99) would be 2.98 × 10 –4, which is better than the comparable case with four redundant units.
11.5.3 Some Coverage Models The simple four-unit product in Example 11.1 in the preceding section demonstrated the dramatic effect that imperfect coverage can have on the reliability of a faulttolerant product. How then can a reasonable value of cp be determined for use in the product model? Several methods have been used. If a working or prototype version of the product exists or if enough information is available about a product being designed, then a model of the recovery process can be developed. The parameters for the model can be measured from a working prototype or estimated from field data. Alternatively, a detailed simulation model of the product recovery process can be developed. If the details of the recovery process are not known, reasonable parameters can be deduced from other, similar products. This section concentrates on modeling the detailed behavior of the product when a fault occurs. 11.5.3.1
General Structure of a Coverage Model
Figure 11.31 shows the general structure of a coverage model. The entry point to the model is the occurrence of the fault, and the three exits (R, C, and S) are the three possible outcomes. r R: transient restoration. Correct recognition of and recovery from a transient fault. A transient is usually caused by external or environmental factors, such as excessive heat or a glitch in the power line. The vast majority of faults are transient. Successful recovery from a transient fault restores the product to an operational state without discarding any components—for example, by masking the error,
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
274 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Fault occurs
r
Fault/error handling model
c C exit permanent coverage
R exit transient restoration s
S exit single-point failure Figure 11.31
General structure of a coverage model.
retrying an instruction, or rolling back to a previous checkpoint. Reaching this exit successfully requires r timely detection of an error produced by the fault; r performance of an effective recovery procedure; and r swift disappearance of the fault (the cause of the error). r C: permanent coverage. Determination of the permanent nature of the fault and the successful isolation and removal of the faulty component. r S: single-point failure. A single fault causes the product to fail, generally when an undetected error propagates through the product or when the faulty unit cannot be isolated and the product cannot be reconfigured.
EXAMPLE 11.2 COVERAGE MODEL FOR MEMORY SUBPRODUCT A hypothetical recovery procedure for a memory subproduct is shown in Figure 11.32. The memory uses an error-correcting code, so a single-bit error is always detectable and correctable and no reconfiguration is required. If 98% of all memory faults affect only a single bit, then the probability of reaching the R exit is r = 0.98. The 2% of faults that affect more than one memory bit are 95% detectable. When a multiple memory error is detected, the affected portion of memory is discarded, the memory mapping function is updated, and the needed information is reloaded from a previous checkpoint and updated to represent the current state of the product. Experimentation on a prototype product revealed that this recovery from the detected multiple memory errors works 85% of the time. Thus, the probability of reaching the C exit is the probability that a multiple fault occurs, is detected, and is recovered from cp = (0.02) × (0.95) × (0.85) = 0.01615. There are two paths to the single-point failure exit: (1) The memory fault causes a single-point failure if a multiple-bit error is not detected (with probability 0.02 × 0.05). (2) A multiple-bit memory error is detected, but the attempted recovery is not successful (with probability 0.02 × 0.95 × 0.15). Thus, s = (0.02) × ((0.05) + (0.95) × (0.15)) = 0.00385.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
RELIABILITY ANALYSIS OF REDUNDANT AND FAULT-TOLERANT PRODUCTS
Error occurs
0.98
0.02
Single bit memory error
Multiple bit memory error
0.95 Error masked in zero time
0.05 Not detected
Detected
Transient restoration exit R
Failure exit 5 Attempt recovery
0.85
0.15
Successful
Unsuccessful
Permanent coverage exit C
Failure exit S
Figure 11.32 Coverage model for memory subsystem.
EXAMPLE 11.3 COVERAGE MODEL FOR A PROCESSOR A processor contains built-in test circuitry so that error checking occurs concurrently with instruction execution. If an error is detected, the instruction is retried immediately. Partial results are stored in case the retry is unsuccessful so that the computation can be continued from some intermediate point (checkpoint) in a process called a rollback. In some cases, the fault is such that the rollback is not successful, so the computation must start over after a product-level recovery procedure is invoked. An example of a processor fault coverage model and the subsequent recovery procedure is shown in Figure 11.33 (Ng and Avizienis 1976). Transient recovery procedure. Assume that the fault is transient, and begin a multistep recovery procedure that continues as long as an error is detected. If an error persists after all three steps have been performed, then a permanent recovery procedure must be invoked. r 4 UFQ8BJUGPSTBOEEPOPUIJOH*GUIFGBVMUJTUSBOTJFOU JUNBZEJTBQQFBS during this time, allowing rollback to succeed. r 4UFQ 3FUSZ UIF DVSSFOU JOTUSVDUJPO TFWFSBM UJNFT GPS BT MPOH BT B T The probability that the retry will be successful (i.e., that no error will be detected) is 0.5.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
275
276 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Wait
Retry
Rollback
Permanent recovery Transient restoration exit R
Permanent coverage exit C
Failure exit S
Figure 11.33 Coverage model for a processor.
r 4 UFQ *G BO FSSPS QFSTJTUT QFSGPSN B SPMMCBDL UP B QSFWJPVT DIFDLQPJOU followed by recomputation, taking a total of 2 s. The rollback succeeds in removing the error 80% of the time.
If the fault is transient, a transient recovery can be successful only if the fault has disappeared before the step begins. To analyze this example, assume that the lifetime of a transient fault is exponentially distributed with a mean lifetime of 0.25 s. Transients comprise 90% of all faults; the other 10% are permanent. Permanent recovery. If an error still persists after the rollback, it is assumed to be caused by a permanent fault, and a product-level permanent fault recovery process is begun to remove the offending processor from the set of active units and to reconfigure the product to continue without it. The permanent fault recovery process succeeds with probability 0.875. The analysis of this coverage model consists of calculating the probability of product recovery (PSRi) for each of the three steps of transient recovery and for permanent recovery. This calculation entails determining two intermediate sets of quantities: the probability that the transient has gone before step i is reached and the probability that step i is taken. Transient recovery exit. The transient recovery exit is reached if the fault is transient and any of the three steps is successful in achieving product recovery.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
RELIABILITY ANALYSIS OF REDUNDANT AND FAULT-TOLERANT PRODUCTS
277
r Step 1. Step 1 is taken with probability one immediately upon the occurrence of the fault. Step 1 performs no actual recovery, so the probability of successful product recovery at the end of step 1 is zero (PSR1 = 0). r Step 2. Product recovery from a transient error will occur in step 2 if r the transient has disappeared during step 1 (with probability l – e–0.1/0.25 = 0.329); and r the retry is successful (with probability 0.5). Thus, PSR2 = (0.329) × (0.5) = 0.165. r Step 3. Product recovery from a transient will occur in step 3 if r steps 1 and 2 are unsuccessful (with probability 1 – 0.165 = 0.835); r the transient has disappeared during step 1 or 2 (with probability l – e–(0.1+0.5)/0.25 = 0.909); and r the rollback is successful (with probability 0.8). Thus, PSR3 = (0.835) × (0.909) × (0.8) = 0.607.
The probability of transient recovery is then the sum of the probabilities associated with the three steps (0 + 0.165 + 0.607 = 0.772). The transient restoration exit is reached if the fault is transient (with probability 0.9) and if transient recovery is successful; thus, r = 0.9 × 0.772 = 0.695. Permanent coverage exit. There are two relevant cases in the analysis of the permanent recovery exit. The first case deals with the invocation of the permanent recovery procedure to handle a persistent transient fault; the second deals with recovery from a permanent fault. Case 1. The permanent recovery process is initiated against a transient fault if the three steps of transient recovery have been unsuccessful. The probability associated with this case is then the product of r the probability that the fault is transient (0.90); r the probability that the three steps of transient recovery were not successful (1 – 0.772 = 0.228); and r the probability that the permanent recovery process is successful (0.875). Case 2. The permanent recovery process is successful against a permanent fault if r the fault is permanent (with probability 0.10); and r the permanent recovery process succeeds (with probability 0.875).
The probability of reaching the permanent coverage exit is then the sum of the probabilities associated with the two cases; thus, cp = 0.179 + 0.0875 = 0.267. Single-point failure exit. There are two cases to consider for the single-point failure exit. Case 1. For a transient fault, the single-point failure is reached if the permanent recovery procedure is invoked but fails to achieve product recovery. The probability associated with this scenario is (0.228 × (1 – 0.875) = 0.028). Multiplying by the probability that the fault is transient yields the probability for Case 1: (0.9 × 0.028 = 0.0252.) Case 2. For a permanent fault, the single-point failure exit is reached if the permanent recovery procedure is unsuccessful. Multiplying by the probability that the fault is permanent yields the probability for Case 2: (1 – 0.875) × (0.10) = 0.0125.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
278 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
The probability of reaching the single-point failure exit is the sum of the probabilities associated with Cases 1 and 2; thus, s = (0.0252 + 0.0125) = 0.0377. 11.5.4 Near-Coincident Faults In highly reliable products, such as those used for flight control, the probability of a second fault occurring while recovery from a prior fault is being attempted may dominate the product failure probability. The occurrence of a second, near-coincident fault may interfere with recovery and cause immediate product failure. EXAMPLE 11.4 EXPONENTIAL TIME TO PERFECT RECOVERY—NO TRANSIENTS This example considers a very simple coverage model in order to demonstrate the effect of a near-coincident fault. The recovery process assumes that every fault is permanent and reconfigures the product to bypass the faulty component. No transient recovery is attempted. As long as no second fault interferes, recovery is always successful, and the permanent coverage exit is taken. The time to recover is exponentially distributed with rate `r and the time until a second fault occurs is exponentially distributed with rate c. Then, the first fault is covered (with probability cp) if the recovery time is less than the time until the occurrence of the next fault.
c p Prob [time to recover time to second fault ] c
¯ 0
(D r e D ,x )(e G x )dx
Dr Dr G
(11.60)
In the general case, the probability of successful fault recovery involves two phenomena: r The product must be able to recover from the fault if given enough time; and r The recovery must occur before another fault can interfere.
The first aspect is analyzed using the methods presented in Section 11.1.2. Calculating the probability of an interfering fault requires information about how much time the product spends recovering. To determine the probability of a near-coincident fault, some new notation must be introduced because the amount of time taken to reach an exit must be included with the exit probabilities: r Pc(n) represents the probability of the product recovering to a state with fewer components (i.e., reaching the C exit from the coverage model) in an amount of time ≤ n from the time of occurrence of the fault. r Pr(n) represents the probability of successful transient restoration in a time ≤ n from the time of occurrence of the fault.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
RELIABILITY ANALYSIS OF REDUNDANT AND FAULT-TOLERANT PRODUCTS
279
r Ps(n) represents the probability of a single-point failure in an amount of time ≤ n from the time of occurrence of the fault.
The three distributions, PC (n), PR (n), and PS (n), may be defective in that their limiting values may be less than one. In fact, lim( PC (T ) PR (T ) PS (T )) 1 T lc
(11.61)
These distributions represent the solution of the coverage model without considering near-coincident faults. ˆ and sˆ denote the eventual probability of reaching the appropriate exit: Let rˆ, c, rˆ lim PR (T ), cˆ lim PC (T ),lim PS (T ), rˆ cˆ sˆ 1 T lc
T lc
(11.62)
T lc
Then, if the possibility of near-coincident faults is ignored (as was done in Section 11.1.2), then the desired coverage factors (r, c, and s) are equated to these limiting probabilities: r rˆ, c cˆ, s sˆ
(11.63)
In some products, it is not sufficient to know that the product eventually recovers. Recovery is successful only if it is completed before the occurrence of a second, interfering fault. Thus, setting the coverage factors to the limiting value of the exit distributions is optimistic, especially if the recovery time is long in comparison to fault arrivals. These limiting values must be adjusted to account for the time needed to recover from a fault. Let W be a random variable that represents the time between occurrences of interfering faults, where FW (t) = 1 – e–ct. Let YR, YC, and YS be random variables representing the time to reach the corresponding exit of the coverage model (conditioned on actually reaching the exit), where FYR (n), FYC (n), and FYS (n) are the conditional distributions of time to exit: FYR (T )
P (T ) P (T ) PR (T ) , FYC (T ) C , FYS (T ) S sˆ rˆ cˆ
(11.64)
These distributions are not defective distributions because the limiting value for each is one. The associated probability density functions are f YR (n), f YC (n), and f YS (n). The PR and FR distributions differ. Pr(n) is the distribution of time to reach the R exit from entry into the model. In the limit, PR may not be one because the R exit might not be taken because the C or S exit may be reached instead. The FR distribution, on the other hand, is the distribution of time for the paths that actually do reach the R exit. The PR distribution characterizes the R exit when the whole model is considered, while the FR distribution characterizes only the R exit, ignoring the rest of the model. If another exit is added to the model, the PR distribution may change, but the FR distribution does not.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
280 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
The calculation of the coverage factors must include the possibility that a nearcoincident fault occurs. For the analysis of the permanent coverage exit C, the C exit must be reached (rather than the other two exits), and the exit must be reached before another fault occurs. The exit probability, cp, is thus given by c
c Prob[time toreachC exit W ]
¯ (e
G x
)dPC ( x ) cˆ lim LYC (u)
(11.65)
ulG
0
where LYC (u) is the Laplace transform of the random variable YC. Similar calculations produce r and s. In the calculations in Section 11.1.2, where near-coincident faults are ignored, rˆ cˆ sˆ 1. When near-coincident faults are considered, a fourth exit (N) is added to the coverage model for near-coincident faults. The associated coverage factor for the N exit is n = 1 – (r + c + s). Because the rate at which near-coincident faults occur depends on the overall configuration of the product (the more components in the product, the higher is the probability of a near-coincident fault), the coverage parameters are functions of the near-coincident rate, n. The numerical value of the rate, n, will depend on the overall product model (see Section 11.1.4). Return now to the earlier coverage models. For the memory coverage model (Example 11.2; Figure 11.32), the calculated coverage factors were rˆ 0.98, cˆ 0.01615, sˆ 0.00385
(11.66)
Transient recovery exit. Because transient recovery occurs in zero time, there is no possibility of a near-coincident fault, and thus r (G ) rˆ 0.98
(11.67)
Permanent coverage exit. Permanent recovery time follows an exponential distribution with parameter `r, so c(G ) 0.01615 r
Dr Dr G
(11.68)
Single-point failure exit. Time to single-point failure follows the same distribution as permanent recovery because a single-point failure is caused by a permanent recovery failure. Thus, s(G ) 0.00385 r
Dr Dr G
(11.69)
Near-coincident failure exit. The coverage factor for the near-coincident failure exit is n(G ) 1 (r c(G ) s(G ))
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(11.70)
RELIABILITY ANALYSIS OF REDUNDANT AND FAULT-TOLERANT PRODUCTS
281
For the processor recovery model (Example 11.3; Figure 11.33), the calculations for the coverage factors are altered slightly. The probability of successful recovery was previously given by the product of the probability that the phase is entered, the probability that the transient is gone, and the effectiveness of the phase; however, the product must now include the probability that no near-coincident fault occurred during the phase. 11.5.5 Including the Coverage Model in the Product Model Once the product model and coverage models have been determined, behavioral decomposition can be used to combine the results of the coverage model into the product model. This section presents the methodology for combining the results into a Markov model of the product failure modes.
EXAMPLE 11.5 A THREE-PROCESSOR, TWO-MEMORY FTC PRODUCT A simple FTC product (call it 3P2M) consists of three processors and two shared memories communicating over a shared bus (Figure 11.34). The product is operational as long as at least one processor can communicate with at least one of the memories. Figure 11.35 shows the Markov chain representation of the 3P2M product. The states are labeled with an ordered triple, where the elements are r UIFOVNCFSPGPQFSBUJPOBMQSPDFTTPST r UIFOVNCFSPGPQFSBUJPOBMNFNPSJFTBOE r UIFTUBUFPGUIFCVT
Processors
Bus
Memories Figure 11.34 Reduction of combined system and coverage models.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
282 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Figure 11.35 3P2M system diagram.
The constant failure rates for the components are r QSPDFTTPSh = 10 –4; r NFNPSZh = 10 –5; and r CVTh = 10 –6. In the Markov chain of Figure 11.35, r 'SFQSFTFOUTFYIBVTUJPOPGUIFQSPDFTTPSDMVTUFS r 'SFQSFTFOUTFYIBVTUJPOPGUIFNFNPSJFTBOE r 'SFQSFTFOUTGBJMVSFPGUIFCVT
The coverage model for the processors is from Example 11.3 (Figure 11.33) and that for the memories is from Example 11.2 (Figure 11.32). There is no automatic recovery from a bus fault. A coverage model is inserted on each arc between operational states in the Markov chain, as shown in Figure 11.36. In the 3P2M example, the coverage models on the horizontal arcs are copies of the processor model (Figure 11.33), and the coverage models on the vertical arcs are copies of the memory recovery model (Figure 11.32). Two additional failure states are inserted: r FSPF denotes the occurrence of a single-point failure; and r FNCF denotes the occurrence of critically coupled near-coincident faults.
The coverage models are solved for the probability of reaching each of the exits and the coverage parameters are inserted on the arcs in place of the models. The resulting Markov chain (Figure 11.37) is then solved for the state probabilities. The reliability of the product is given by the probability that the product is not in any failure state.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
RELIABILITY ANALYSIS OF REDUNDANT AND FAULT-TOLERANT PRODUCTS
Figure 11.36 Markov chain model of 3P2M example system.
Figure 11.37 Reduction of combined system and coverage models.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
283
284 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Table 11.2
Solution of 3P2M Example Product
Cause of Failure
Probability
Exhaustion of processors Exhaustion of memories Exhaustion of buses Single-point failure Near-coincident faults Total unreliability
2.00 × 10–7 1.61 × 10–8 9.99 × 10–5 3.53 × 10–4 4.23 × 10–9 4.53 × 10–4
Because the number of active components differs for each state, the rate of occurrence of near-coincident faults is state- and transition dependent, resulting in coverage factors that also differ for each transition. For example, consider the two coverage models on arcs that originate in state 3,2,1. From state 3,2,1, when a processor fault occurs, two processors, two memories, and a bus are still active. A fault in any of these active components interferes with recovery, so the near-coincident rate associated with the processor recovery model is
G 1 2L 2 M S
(11.71)
From the same state, when a memory fault occurs, three processors, one memory, and a bus are still active, so the near-coincident rate associated with the memory recovery model is
G 2 3L M S
(11.72)
Table 11.2 shows the results of the reliability analysis for a 100-h application of the 3P2M example. The largest contributor to the unreliability is single-point failure, a fault from which recovery is not successful.
11.6
BOUNDED APPROXIMATE MODELS
Determining the probability of product failure (the probability of the top event) for a product described as a fault tree can be computationally expensive. Markov models grow exponentially with the number of constituent components. Exact analysis is feasible only for products with a relatively small number of components or in products in which there are no or few subproduct interdependencies. In this section, several methods are introduced for obtaining approximate solutions for models of fault-tolerant products. These methods truncate the solution process. The approximate solutions are given in terms of upper and lower bounds on product reliability, so an estimate of the degree of approximation is also available. Two methods are presented for truncating the combinatorial solution of a fault tree. These two methods are useful when coverage is assumed to be perfect and nearcoincident faults are ignored. For products more accurately modeled with a Markov
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
RELIABILITY ANALYSIS OF REDUNDANT AND FAULT-TOLERANT PRODUCTS
285
model, the truncation of the Markov model is considered. All three methods are intended for the analysis of products that require high levels of reliability for short application times, during which no manual repair of the products is possible. The short application times result in very small probabilities for multiple simultaneous failures. The components that comprise the product generally are characterized by constant failure rates, although most of the techniques presented here apply to products whose failure processes are time dependent. EXAMPLE 11.6 THE CM@ PRODUCT The Cm* product, a loosely coupled distributed product, is used here to demonstrate the first two truncation methods. An instance of the Cm* product (see Figure 11.38) consists of two clusters, each containing four processors and four memory modules. Fault recovery is assumed to be perfect and instantaneous; thus, no coverage models are used. A fault-tree model is thus more appropriate than a Markov chain to analyze this product. Each cluster consists of four local switch-interface controllers (S.locals), each attached to one processor and one 12K-memory module. Each processor has 4K of memory on board. The K.map is a cluster controller connecting the S.locals; the clusters are connected by intercluster communications (L.inc). A fault in the K.map. renders the associated S.locals (and their connected processors and memories) inaccessible, while a fault in an S. local makes the processor and memory modules connected to it inaccessible. The constant failure rates are r N FNPSZFJHIUNFNPSZNPEVMFT FBDIXJUIGBJMVSFSBUFhM = 69.4 failures per million hours; r QSPDFTTPSQSPDFTTPST XJUIGBJMVSFSBUFhP = 29.9 failures per million hours;
Linc
K..map
K..map S.local S.local
S.local Pc
Pc
S.local
MP
Pc
MP Cluster
Figure 11.38 A diagram of the Cm* product.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
MP
Pc
MP
Pc = LSI-11 MP = Memory (12K words) S.local = Local switch interface controller K.map = Cluster controller Linc = Inter-cluster communications
286 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
r S .local: eight local switch interface controllers, each with failure rate hS = 24 failures per million hours; r K.map: two cluster controllers, each with constant failure rate hK = 131 failures per million hours; and r L .inc: one intercluster communication link, with constant failure rate hL = 34.8 failures per million hours. The product requires three processors that can communicate with three memories. As long as the L.inc. is operational, these requirements can be satisfied by the components of both clusters. However, if the L.inc. fails, the requirements must be met within one cluster. A fault-tree model of the Cm* product is shown in Figure 11.39. Product failure (the top event) can be attributed to one of two causes, which are shown as inputs into the uppermost OR gate. Failure occurs when either the L.inc. fails and the requirements cannot be satisfied by a single cluster (the left input to the uppermost OR gate) or, independent of the state of the L.inc., there are insufficient processors or memories in both clusters together. The gates labeled with m/n, where m and n are integers, are true when m of the n input events have occurred. The row of OR gates at the bottom of the fault tree reflects the fact that a processor (or memory) is considered to have failed if it fails itself or if its S.local. or K.map. fails. There are many potential product states. Because there are 27 components, the product can be in any one of 227 > 134 million states, if any component can be in one of two states, functional and failed. There are 5,405 minimal cut sets for this product.
Linc 6/8
2/4
2/4
6/8
2/4
2/4
K1 K1 K1 K1 K2 K2 K2 K2 K1 K1 K1 K1 K1 K2 K2 K2 S1 S2 S3 S4 S5 S6 S7 S8 S1 S2 S3 S4 S5 S6 S7 S8 P1 P2 P3 P4 P5 P6 P7 P8 M1 M2 M3 M4 M5 M6 M7 M8
Figure 11.39 Fault-tree model of Cm* system.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
RELIABILITY ANALYSIS OF REDUNDANT AND FAULT-TOLERANT PRODUCTS
287
11.6.1 Truncated Exhaustive State Enumeration Given a fault-tree model of a product, the simplest solution method is an exhaustive state enumeration. Using this method, all 2n state vectors (given n basic events) are examined and are determined to represent either operational or failed product configurations. The probabilities associated with the state vectors for failed configurations are summed to determine the unreliability of the product. Although not very efficient, this exhaustive state vector examination process lends itself nicely to truncation. At any point in the solution, we can bound the total probability associated with the state vectors that have not yet been examined. When this total is less than the required accuracy, the state enumeration process can be stopped and a bounded estimate of the reliability is produced. Assume that the n basic events are sorted in decreasing order of failure probability at some time, t, for which the reliability of the product is desired. That is, the failure probability of component i is higher than (or equal to) the failure probability of component i + 1, for 1 ≤ i ≤ n. A binary vector of length n is used to represent the state of the components; a one in location i of the state vector means that component i has failed; a zero means that component i is operational. The initial state of the product, (0,0,0,…, 0), represents no failures in any components. The state vectors are evaluated in an orderly fashion, proceeding from the initial vector until the (1,1,1,…, 1) vector is reached. Each state vector represents either an operational configuration of the product or a failed configuration (as defined by the fault-tree model of the product). The sum of the probabilities of occurrence of all the operational configurations is the reliability of the product; the sum of the probabilities of occurrence of all the failure configurations is the unreliability of the product. The probability of occurrence of each state vector is determined from the failure probabilities of the individual components. Denote by pi the probability that the ith component has failed, and by qi the probability that the ith component has not failed. Because the two events are complementary, pi + qi = 1. Let vi represent the ith element of the state vector being analyzed (each vi is either 0 or 1, depending on whether the ith component is operational or has failed). The probability of occurrence of the state vector, (v1, v2,…, vn), is given by n
Prob[( v1 , v2 ,z, vn )]
(v p (1 v )q ) i i
i
i
(11.73)
i 1
The goal of this section is to describe a process by which the reliability of the product can be determined to some level of accuracy (say, 10 –d) without analyzing all 2n state vectors. The probabilities of occurrence of a few key state vectors can determine the number of failure levels that must be examined in order to guarantee an accuracy of 10 –d in the result. The result can be more accurate than required because the error term on each failure level will tend to be overestimated. The term failure level means that all the state vectors on failure level k have exactly k ones and exactly n – k zeros, representing a configuration in which k components have failed and the rest are operational.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
288 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Because the basic events are sorted in decreasing order of failure probability, the most probable state vector on any failure level is the first, (1,…, 1, 0,…, 0). Multiplying the probability associated with this first state on a level, l, by the number of state vectors on the level (n,l) gives the maximum possible contribution to the failure probability from all the states on the level. In general, on any failure level, l, the maximum error introduced by not expanding the states on the level is ¤ n³ El ¥ ´ r p1 r p2 r z r pl r ql 1 r z r qn ¦ lµ
(11.74)
To avoid examining all 2n state vectors, but still guarantee accuracy of 10 –d, a sum of Eis is formed that is less than 5 × 10 –(d+1). For each Ei that is included in the sum, the corresponding level i can be ignored when generating the state vectors. A practical solution is to find the smallest l such that ¤ n ³ Error Term ¥ Ei ´ a 5 r 10 ( d 1 ¦ i l 1 µ
(11.75)
£
The smallest l value that satisfies Equation 11.75 determines how many levels (l) must be expanded in order to achieve the desired accuracy. For a given l value satisfying Equation 11.75, all state vectors that correspond to states containing l or fewer failed components must be analyzed; however, the state vectors that correspond to states containing l + 1 or more failed components need not be expanded. Because the error term associated with each failure level is being overestimated, it is possible (and not unlikely) that the sum of all the error terms may exceed one. The bound on the unreliability can be formed as follows. Let Ui be the sum of the state vector probabilities for all failure configurations on level i. The exact unreliability of the product is in
Unreliability
£U
(11.76)
i
i 1
and can be bounded by l
£ i0
l
Ui a Unreliability a
n
£ £E Ui
i0
i
(11.77)
i l 1
Table 11.3 shows the results of the solution of the Cm* example, using the truncated exhaustive state enumeration algorithm. Reasonably accurate estimates of the reliability of the product resulted from a consideration of only a small portion of the state space. As the application time increases, more components are likely to have failed, so more levels in the state space need to be considered.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
RELIABILITY ANALYSIS OF REDUNDANT AND FAULT-TOLERANT PRODUCTS
289
Table 11.3 Results of Solution of Example Fault Trees Truncation of Exhaustive State Enumeration Solution (Fault Tree for Cm* Product—13,421,778 States Total) Application time (hours) Truncation level (number of component failures) Size of model (number of states) Low bound on unreliability Upper bound on unreliability Total run time (Sun-4 CPU seconds)
5
10
100
1000
3
3
4
7
3303
3303
20,853
1,285,623
4.30e–7
1.73e–6
1.85e–4
2.65e–2
4.32e–7
1.75e–6
1.90e–4
2.84e–2
28
27
168
8923
11.6.2 Truncated Sum of Disjoint Products Most qualitative algorithms for determining the unreliability of a product modeled with a fault tree start with the determination of the minimal cut sets of the product. A cut set is a set of basic events in which, if all the basic events occur, the top event (product failure) will occur. If any basic event is removed from a minimal cut set, the remaining events are no longer a cut set. If the p minimal cut sets are labeled Ci, i = 1,…, p, then product unreliability is § p ¶ Prob ¨ Ci · ¨© i 1 ·¸
(11.78)
The cut sets are not necessarily mutually disjoint, so we cannot simply sum the probabilities of the individual cut sets, although this sum does provide an upper bound on the unreliability of the product. A sum of disjoint products (SDP) method for fault-tree evaluation uses the equation: where Ci is that part of the universal set that is not in Ci. Because the terms in the right-hand side of Equation 11.79 are mutually disjoint, the sum of the p
C (C )C C C C C zC C C zC i
1
i 1
1
2
1
2
3
1
2
3
p 1
Cp
(11.79)
probabilities of the individual terms yields the exact unreliability of the product.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
290 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Both the determination of the set of minimal cut sets and the calculation of the reliability of the product from the cut sets are generally considered infeasible for all but the smallest products. Often, the determination of the cut sets is substantially faster than the reliability calculation. Therefore, this section is concerned with the problem of truncating the SDP after the cut sets have been determined. The basic approach to an SDP algorithm is to take each cut set and make it disjoint with each preceding cut set, using Boolean algebra. If the cut sets are sorted in decreasing order of their probability of occurrence, the larger contributions to product unreliability are associated with the cut sets with lower indices. Making the cut sets with the highest indices disjoint from the others that precede it can require a substantial time investment that does not contribute much in terms of accuracy. For the Cm* example, the most probable cut set (which contained both K.maps) was four orders of magnitude more probable than the second most probable cut set, and seventeen orders of magnitude more probable than the least probable cut set. This disparity in cut set probabilities leads us to consider ways to bound the unreliability of the product without exerting the effort to disjoint the improbable cut sets from the others. The SDP algorithm also lends itself to truncation. Suppose that the first l(l < p) cut sets have been made disjoint from the preceding ones. That is, the C 1C2 , C 1C 2C3 ,z, C 1C 2 C 3 zCl 1Cl
(11.80)
sets have all been determined. Then these terms, along with the remaining cut sets (indexed from l + 1 p), can be used to provide bounds on the product unreliability: Prob[C1 ] Prob §C 1C2 ¶ Prob §C 1C 2C3 ¶ z Prob §C 1C 2 zCl 1Cl ¶ © ¸ © ¸ © ¸ a Unreliability a
(11.81)
Prob[C1 ] Prob §C 1C2 ¶ Prob §C 1C 2C3 ¶ z Prob §C 1C 2 zCl 1Cl ¶ © ¸ © ¸ © ¸ Prob[Cl 1 ] Prob[Cl 2 ] z Prob[C p ] If the bounds are tight enough after making cut set l disjoint, the process of making the remaining cut sets disjoint can be suspended. If the bounds from the l disjoint cut sets are too loose, the procedure can continue with the l + 1st cut set. In fact, one can determine l a priori if the maximum acceptable width of the error interval is known (say, 10 –d). As in Section 11.6.1, form a sum of the cut set probabilities that is less than 5 × 10 –(d+1). For example, find the smallest l such that ¤ p ³ Prob[Ci ]´ a 5 r 10 ( d 10 ) ¥ ¥¦ ´µ i l 1
£
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(11.82)
RELIABILITY ANALYSIS OF REDUNDANT AND FAULT-TOLERANT PRODUCTS
291
Table 11.4 Results of Solution of Example Fault Trees Truncation of Sum of Disjoint Products Solution (Fault Tree for Cm* product—Total Number of Cut Sets = 5,405) Application time (hours) Number of cut sets considered (l) Lower bound on unreliability Upper bound on unreliability Total run time (Sun-4 CPU seconds)
5
10
100
1000
2
2
36
80
4.28e–7
1.71e–6
1.81e–4
2.62e–2
4.31e–7
1.74e–6
1.87e–4
2.68e–2
24
24
33
238
The smallest l value that satisfies Equation 11.82 determines how many cut sets (the first l) must be made disjoint from the proceeding ones to achieve the desired accuracy. Table 11.4 shows the results of the solution of the Cm* product, using the truncated sum of disjoint products algorithm. It took a total of 138 Sun-3 CPU seconds to generate the cut sets. 11.6.3 Truncating a Markov Chain A major problem that arises when using Markov models is that the state space of the Markov chain can increase exponentially with the number of components in the product. However, often relatively few of the states of a large Markov model contribute significantly to the desired output measure. If the states that are likely to have low occupation probability can be identified in advance, both their generation and solution can be dispensed with. Products designed to provide a high level of reliability for a relatively short application time are unlikely to experience more than some small number of component failures during an application. Thus, states that represent more than a few component failures will have a very low occupation probability. Model reduction based on this observation is called state truncation. Suppose that in the creation of a Markov chain, only states that have up to k failed components are created. States with more than k faults are aggregated into a special dummy state, called a truncation aggregation (TA) state. In general, the states aggregated into a TA state include both operational and failure states. Considering the TA state as an operational state gives an optimistic estimate or upper bound on the reliability of the product. By considering the TA state as a failure state, we obtain a conservative estimate of the lower bound on the reliability of the product. Let the probability of the TA state be denoted by Pr(TA) and the sum of the probabilities of the failure states in the unaggregated portion of the model (the part of the
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
292 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Figure 11.40 Truncated Markov model of three-processor, two-memory system.
Markov chain above the truncation line) be denoted by Pr(DownStates). Bounds on the product unreliability are then given by Pr ( Down States) a Unreliability a Pr ( Down States) Pr(TA)
(11.83)
For the 3P2M product (Example 11.5; Figure 11.35), the Markov chain truncated after the first level is shown in Figure 11.40. Using the same parameters as before, as well as the same coverage models, this Markov chain is solved for a 100-h application. The sum of the probabilities for the failure states is 4.52 × 10 –4, which is the lower bound on the unreliability of the product. The probability for the TA state, 6.05 × 10 –5, is added to the lower bound to produce an upper bound on unreliability of 5.12 × 10 –4. Thus, a reasonable estimate of the reliability of the product may be obtained from a relatively small subset of the entire state space. However, if the bounds are too loose when the model is truncated at level k, then the model must be generated and solved anew in order to increase the truncation level. 11.7
ADVANCED TOPICS
This section introduces several other considerations that arise when analyzing faulttolerant products and provides pointers to the literature. 11.7.1
Combining Performance with Reliability
The use of multiple processors in a computer product can enhance performance as well as reliability. Spare processors can be used to increase throughput or reduce job turnaround time, as well as to provide redundancy for critical functions. Consider a product with n processors. The highest reliability is achieved when one processor is sufficient to perform the intended tasks and the n – 1 processors are used as backups. However, this scenario also provides the poorest average utilization because
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
RELIABILITY ANALYSIS OF REDUNDANT AND FAULT-TOLERANT PRODUCTS
293
one would expect that many processors (the spares) would be idle at any given time. The configuration providing the best average utilization would use all n processors for computation, allowing no redundancy and poor reliability. Somewhere between these two extremes is a configuration that addresses both performance and reliability issues. A gracefully degradable product provides high levels of performance when redundancy levels are high; performance degrades (but the product continues to operate) as components fail, as long as some minimal configuration is available. The analysis of a gracefully degradable product must consider both performance and reliability metrics (Meyer 1992). This is easily accomplished with a Markov reward model—a Markov chain in which some additional measure, a reward, is associated with each state. This reward might be throughput, response time, cost, computation capacity, and so on. In the simplest case, the reward metric is given as a vector of ri values, where ri is the reward associated with state i. The expected reward at time t (E[R(t)]) is then given by n
E[ R(t )]
£ r P (t)
(11.84)
i i
i 1
where Pi(t) is the probability of being in state i at time t. Equation 11.84 can be used to assess reliability, by setting ri = 1 for operational states and ri = 0 for failure states. For the 3P2M product (Example 11.5; Figure 11.35), suppose that a single processor, a single memory, and the bus provide a base performance level of one (r1,1,1 = 1). Adding a second memory unit improves performance by a factor of 1.25, adding a second processor improves performance by a factor of 1.8, and using three processors improves performance 2.5 times over the base case. The resulting reward vector for operational states is shown in Table 11.5. The expected performance of the 3P2M product is then 3.1, slightly less than the peak performance level associated with the initial state. 11.7.2 Phased Applications Fault-tolerant products are often used in applications characterized by several phases in which the product structure, failure processes, or success criteria can change with each phase. For example, an aircraft control product must control the aircraft through
Table 11.5
Rewards and Probabilities for 3P2M Product
State
Reward
3,2,1 2,2,1 3,1,1 1,2,1 2,1,1 1,1,1
3.125 2.25 2.5 1.25 1.8 1
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
Probabilities at 100 h 0.98613 1.33 × 10-2 3.18 × 10-5 5.99 × 10-5 4.30 × 10-7 1.93 × 10-9
294 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
A B B
A
A
B
C
C C Phase 1
Phase 2
Phase 3
Figure 11.41 Three-phase mission requirements for the example system.
takeoff, cruising, and landing. The analysis of the reliability of such products is hampered by the existence of more than one phase because separate models must be developed for each phase. The problem arises because the models for the separate phases must be linked together, so the solution at the end of one phase becomes the initial condition for the beginning of the second phase. A simple example illustrates several different approaches to solving the phased application problem. Figure 11.41 shows a reliability block diagram of the product used to illustrate the several approaches. The product consists of three components used for three sequential phases. The failure rates of the components are constant for the duration of a phase, but may be different for each phase. The product fails if it fails in any phase of the application. Figure 11.42 shows the Markov chain models for each of the three phases. The states in the Markov chain are labeled with the names of the components still operational in the state. The state labeled F is the failure state.
ABC
λA
ABC
λC
λB
ABC
λC λB
BC
λC
AC
AB
λA
λB
λA + λB + λC
λA AC
AB
λB λA
λA + λC
λC
C
B
F
λA + λB
A F
λC
λB λA F Phase 1
Phase 2
Figure 11.42 Markov model for three phases of example system.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
Phase 3
RELIABILITY ANALYSIS OF REDUNDANT AND FAULT-TOLERANT PRODUCTS
295
A B B
A
A
B
C
C C
Figure 11.43 Conservative reliability model of example phased system.
The simplest approach to phased application analysis consists of connecting the reliability block diagrams for each phase in series and then solving for the reliability of the product at the end of the final phase of the application. For the example product, the resulting reliability block diagram is shown in Figure 11.43. This block diagram reduces to a series connection of A, B, and C because all three are required in the final phase. If the components are not repairable, then the solution of this model will produce an exact result. However, if redundant components are repairable while the product is operational, this model will produce a conservative estimate of product reliability. To see this, suppose that component C fails during phase 1 and is repaired during phase 2. Because C is redundant in both phases, this failure/repair scenario will not cause product failure. In the conservative reliability model, however, the product fails as soon as component C fails. An exact method for a combinatorial solution of phased products can be effected by constructing a block diagram for an equivalent product. In the block diagram for phase i, the component C is replaced by a series connection of independent components C1, C2,…, Ci, as shown in Figure 11.44. The failure probabilities associated with the new components are now conditional probabilities. The probability associated with component Ci is the probability that component C will survive phase i given that it has survived the previous phases. This method produces an exact approach, but not without cost. Because the fault-tree model solution is exponential in the number of components, the addition of components can be computationally expensive. The standard Markov approach to solving phased applications consists of solving each Markov model sequentially, using the final state probabilities from application i as the initial state probabilities for application i + 1. In terms of the example product,
A1 B1 B1
A1
B2
A2
A1 C1
A2
A3
B1
B2
C2
C1
Figure 11.44 Exact reliability model of example phased system.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
B3
C1
C2
C3
296 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
suppose that phase changes occur at times T1 and T2 and that the final phase ends at time T3. Then, the probabilities for states ABC, AC, and AB at time T1 from the model of the first phase are used as the initial state probabilities for the states with the same names in the model of the second phase. The sum of the other state probabilities is used as the initial state probability for state F in the second-phase model. For the change from phase 2 to phase 3, the state probability for state ABC at time T2 is used as the initial state probability for state ABC in the third-phase model; the sum of the remaining state probabilities is used as the initial state probability for the failure state. The state probability for state ABC in the third-phase model at time T3 gives the phased-application reliability of the product. One of the disadvantages associated with this approach is that unless the set of components for each phase is the same, it is difficult to match the states in the model of one phase with the corresponding states in another model. Further complications arise if components can fail during one phase but not during another phase, or if failures during one phase are not detectable until some later phase, when the component may again be used. Somani and Trivedi (1994) used Boolean algebraic methods for phased application analysis, while Smotherman and Zemoudeh (1989) and Dugan (1991) considered Markovian methods. 11.7.3
Advanced Fault-Tree Modeling
The use of combinatorial models for analyzing redundant products has required consideration of several interesting modeling approaches. We provide pointers to the literature for several new innovations. Dynamic fault-tree models. A fundamental limitation of fault-tree models is their inability to capture sequence dependencies. A sequence dependency results when the order in which failures occur (not simply the combination) affects product operation. In addition to standby sparing and coverage modeling, sequence dependencies include the use of shared pools of spares and functional dependencies. A functional dependency exists when some components can become useless or disconnected as a result of the failure of another component. Dugan, Bavuso, and Boyd (1992) introduced the dynamic fault tree, which includes special gates to handle such dynamic behavior. Heidtmann (1992) considers deterministic methods for the analysis of dynamic redundancy. Including coverage in combinatorial models. Most techniques for including coverage in models of fault-tolerant computer products have required the use of Markov models. In several recent papers however, techniques for including coverage in combinatorial models have been introduced (Doyle, Dugan, and Patterson-Hine 1995). Binary decision diagrams (BDDs). Fault-tree models of large products are inherently limited by long solution times because the amount of time needed to solve a fault-tree model can be exponential in the number of components or in the number of cut sets. Recently, some techniques from circuit theory have been adapted to the solution of fault-tree models, using the fault tree as a logic diagram. One such approach, the BDD, appears to hold great promise for the solution of very large fault
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
RELIABILITY ANALYSIS OF REDUNDANT AND FAULT-TOLERANT PRODUCTS
297
trees (Coudert and Madre 1993). In many cases, the solution time for a fault tree using a BDD is independent of the number of cut sets in the product. Combined models of hardware and software. Several authors have considered the analysis of both hardware and software faults in fault-tolerant computer products. Markov models for hardware and software in nonredundant products are considered in Stark (1987) and Laprie and Kanoun (1992). Product-level analysis of hardware and software fault-tolerant products using fault-tree and Markov models is considered in Dugan and Lyu (1993, 1994). 11.8
SUMMARY
This chapter introduced several important techniques for reliability analysis of faulttolerant and redundant products. Combinatorial models, such as reliability block diagrams and fault trees, are useful for analyzing products that employ static redundancy; Markov models are more applicable to dynamic products. Combinatorial models have the advantage of being more concise and easier to understand, but Markov models are much more flexible. Time-independent analysis, as well as timedependent measures, has also been considered. The possibility of imperfect coverage was shown to have a profound effect on the predicted reliability of fault-tolerant computer products. Example submodels for analyzing coverage were considered, as were methodologies for incorporating coverage models into Markov product models. A more thorough investigation into coverage modeling appears in Dugan and Trivedi (1989). Several advanced topics were also introduced. REFERENCES Coudert, O., and J. C. Madre. 1993. Fault tree analysis: 1020 Prime implicants and beyond. Proceedings of the Reliability and Maintainability Symposium, January 1993. Doyle, S. A., and J. B. Dugan. 1995. Combinatorial models and coverage: A binary-decision-diagram (BDD) approach. Proceedings of the Reliability and Maintainability Symposium, January 1995, Atlanta, GA. Doyle, S. A., J. B. Dugan, and A. Patterson-Hine. 1991. A combinatorial approach to modeling imperfect coverage. IEEE Transactions on Reliability. Dugan, G. B. 1991. Automated analysis of phased mission reliability. IEEE Transactions on Reliability. Dugan, G. B., S. Bavuso, and M. Boyd. 1992. Dynamic fault tree models for fault tolerant computer systems. IEEE Transactions on Reliability. Dugan, J. B., and M. R. Lyu. 1993. System reliability analysis of an N-version programming application. Proceedings of the International Symposium on Software Reliability Engineering, Denver, CO. ———. 1994. Dependability modeling for fault tolerant software and systems, ed. M. R. Lyu. New York: John Wiley & Sons. Dugan, J. B., and K. S. Trivedi. 1989. Coverage modeling for dependability analysis of faulttolerant systems. IEEE Transactions on Computers 38 (6): 775.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
298 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Heidtmann, K. D. 1992. Deterministic reliability modeling of dynamic redundancy. IEEE Transactions on Reliability 41 (3), 378–385. Johnson, A. M., and M. Malek. 1988. Survey of software tools for evaluating reliability, availability, and serviceability. ACM Computing Surveys 20 (4): 227. Johnson, B. W. 1989. Design and analysis of fault tolerant digital systems. Reading, MA: Addison–Wesley. Laprie, J. C., and K. Kanoun. 1992. X-ware reliability and availability modeling. IEEE Transactions on Software Engineering 1:130. Meyer, J. F. M. 1992. Performability: A retrospective and some pointers to the future. Performance Evaluation 14:139. Ng, Y-W., and A. Avizienis. 1976. A model for transient and permanent fault recovery in closed fault tolerant systems. Proceedings IEEE International Symposium on Fault Tolerant Computing, FTCS-6, 182, June 1976. Rauzy, A. 1993. New algorithm for fault tree analysis. Reliability Engineering and System Safety 40:203. Siewiorek, D. P., and R. S. Swarz. 1982. The theory and practice of reliable system design. Bedford, MA: Digital Press. Smotherman, M. K., and K. Zemoudeh. 1989. A nonhomogeneous Markov model for phased mission reliability analysis. IEEE Transactions on Reliability 38 (5): 585. Somani, A. K., and K. Trivedi. 1994. Phased-mission system analysis using Boolean algebraic methods. Proceedings of the ACM Sigmetric Conference on Measurement and Computer Systems 98. Stark, G. E. 1987. Dependability evaluation of integrated hardware/software systems. IEEE Transactions on Reliability 440.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
CHAPTER 12
Reliability Models and Data Analysis for Repairable Products Harold S. Balaban
CONTENTS 12.1 12.2
Introduction ................................................................................................300 Analytical Background ..............................................................................300 12.2.1 Age-Independent F-R Processes ................................................... 301 12.2.2 Age-Persistent F-R Processes ....................................................... 301 12.2.3 Defining Characteristics of AI and AP Precedes .........................302 12.2.4 Failure Repair as Renewal and Poisson Processes .......................302 12.2.4.1 Renewal Processes .......................................................302 12.2.4.2 Homogeneous Poisson Processes ................................. 305 12.2.4.3 Nonhomogeneous Poisson Processes ...........................306 12.2.4.4 F-R Process Relationships............................................308 12.3 Data Analysis Techniques ..........................................................................309 12.3.1 Graphical Trend Tests ...................................................................309 12.3.2 Test for a Renewal Process............................................................ 312 12.3.3 Test for a Homogeneous Poisson Process ..................................... 315 12.3.4 Comparison of Two Samples ........................................................ 317 12.3.5 Fitting the Weibull Nonhomogeneous Poisson Process................ 319 12.3.5.1 Weibull Process Characteristics................................... 319 12.3.5.2 Estimation of h and ^ ................................................... 320 12.3.5.3 Goodness-of-Fit Tests................................................... 322 12.3.5.4 Confidence Interval Estimates ..................................... 323 12.4 Summary .................................................................................................... 324 References.............................................................................................................. 324
299 © 2009 by Taylor & Francis Group, LLC
300 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
12.1
INTRODUCTION
This chapter describes methods for modeling and analyzing failures of repairable products that normally exhibit wearout characteristics, particularly nonelectronic equipment. Most repairable products are restored to operable condition through replacement or repair of failed components, rather than replacement or repair of the whole unit. The quality of the restoration is the central issue. If the maintenance action restores the product to as-new condition, then the theory of ordinary renewal processes can be used, and analysis of the failure data is relatively straightforward. However, if the product is not restored to a new condition after maintenance, the analytical procedures and operational decisions related to the analysis become more complex. To illustrate some aspects of the problem, consider an automobile that has accumulated 30,000 miles. Assume that the automobile fails due to a small puncture in a tire. When the tire is repaired, the automobile is restored to an operable condition, but the repair of the minor, yet disabling, damage has had no significant influence on the car’s age; thus, a postrepair “good-as-new” or renewal assumption clearly would not be valid. If the car’s owner had decided to replace all four tires instead of repairing the puncture, then a partial restoration to a new condition would have taken place. If the owner decided that while the car was in the shop, all major products should be overhauled, then a renewal assumption might be a reasonable. It is most important that the analysis of repairable products considers the effects of the repair action on the product’s age. Otherwise, decisions on operational and maintenance policy, sparing levels, design improvement, fleet sizes, and other operating and support factors may be subject to serious error. Section 12.2 of this chapter provides an analytical background for the problem by introducing the concepts of age independence and age persistence. Section 12.3 presents a recommended procedure for analyzing repairable product failure data, illustrating the methods using actual data. Section 12.3.5 summarizes methods for analyzing data from a Weibull process, a special case of an age-persistent process often used in reliability growth theory that has equal utility for failure-repair processes. 12.2
ANALYTICAL BACKGROUND
The failure times of a product restored to an operable condition through repair represent a stochastic process, denoted by {X(n); n 0, 1, 2,…} or, more briefly, as {X(n)}, where X(0) is defined as zero. X(k) represents the time of the kth failure, as measured from the origin. The origin is the time when a new equipment item is installed or when a failed product is restored to a new condition through major overhaul. This process is called a failure-repair (F-R) process. Before describing types of failure-repair processes, some necessary notation must be introduced: r Fx(n)(x) is the cumulative distribution function of X(n); that is, Fx(n)(x) P[X(n) a x]; r Fx(n1)|x(n)x(y) is the cumulative distribution function of the conditional random variable, X(n 1) | X(n) x; that is, Fx(n1)|x(n)x(y) P[X(n 1) a y | X(n) x];
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
RELIABILITY MODELS AND DATA ANALYSIS FOR REPAIRABLE PRODUCTS
r r r r r r r r r
301
T(i) is the ith interarrival time X(i) X(i l), i 1.2,…; Gn(t) is the cumulative distribution function of Ti; rx(x) is the failure or hazard rate function defined by pdf(x)/cdf(x); H(x) is the cumulative failure (hazard) rate function; N(x, y) is the number of failures in the interval (x, y); IFR (DFR) is the increasing (decreasing) failure rate; IFRA (DFRA) is the increasing (decreasing) failure rate average; IMRL (DMRL) is increasing (decreasing) mean residual life; and NBUE (NWUE) is new better (worse) than used.
Two characterizations represent, in practice, the boundaries of maintenance influence: age-independent and age-persistent F-R processes. 12.2.1 Age-Independent F-R Processes The first characterization of the F-R process is equivalent to renewal. This implies that equipment maintenance will restore the equipment to a new condition or, equivalently, that T(n), the nth interarrival time, will have the same distribution as T(1). Such maintenance is termed maximal repair because it will generally involve complete equipment replacement or major overhaul. The term age independence, or AI, is used for such a failure-repair process because the underlying failure-influencing mechanism depends on the time since the last repair, rather than on the equipment age as measured from the origin. Formally, an F-R process is AI if and only if P[ X (n 1) q x n1 \ X (n) x n ,!] P[ X (1) q x n1 x n ]
(12.1)
for all 0 a xn xn1 ∞ and all integers, n. Thus, the probability that the nth failure will occur after age x, given that the previous failure occurred at age y, is the same as the probability that a new product will fail after x y hours. 12.2.2 Age-Persistent F-R Processes In contrast to an AI or maximal repair process, a minimal repair policy assumes that the maintenance action restores the product back to the operable state that it was just prior to failure so that the age remains unchanged. This would be a reasonable assumption for the repair of a punctured automobile tire. The term age persistence (AP) is used for such a process. An F-R process is AP if and only if P[ X (n 1) q x n1 \ X (n) x n ,!] P[ X (1) q x n1 \ X (1) q x n ]
(12.2)
for all 0 a xn xn1 ∞ and all integers n. Thus, for an AP process, the distribution for the (n l)st failure time, given that the nth failure occurred at time xn, is the same as the distribution time for the first failure, given survival to xn.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
302 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
12.2.3 Defining Characteristics of AI and AP Precedes Following the notation established previously where I is the set of positive integers: {X (n), n I}
is AI if and only if
Fx ( n1)\x ( n ) ( x n1 \xx n ) Fx (1) ( x n1 x n )
{X (n), n I}
is AP if and only if
Fx ( n1) \x ( n ) ( x n1 \ x n )
Fx (1) ( x n1 ) Fx (1) ( x n )
These two forms of F-R processes can be characterized by the distribution of the random variable X(n 1)|X(n), which, in turn, is defined only in terms of the first life-length random variable, X(1). Making the reasonable assumption that maintenance cannot restore a product to a better-than-new condition or to a condition that makes it statistically worse than it was prior to failure, the AI and AP characterizations are natural boundaries on the quality of the maintenance action. For products that contain components subject to wearout—for example, mechanical equipment—an AI process is clearly more desirable because the age of such products in a degraded state is reset to zero as a result of AI repair. 12.2.4 Failure Repair as Renewal and Poisson Processes The relation of AI and AP F-R processes to renewal and Poisson processes is now considered.
12.2.4.1 Renewal Processes A process that generates a series of events is called a renewal process if the times between events are independently and identically (i.i.d.) distributed. Products that fail and are replaced with new products of the same type generally produce such a process. Let F be the distribution of X(1), called the underlying distribution, and let F(n) be the k-fold convolution of F with itself. F (k) represents the cumulative distributive function of the sum of k i.i.d. random variables. F (k) will then be the distribution of the time to the kth event—for example, F ( k ) ( x ) P[ X ( k ) a x ]
(12.3)
Also, because P[ N (0, x ) q k ] P[ X ( k ) a x ] F (k ) (x)
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(12.4)
RELIABILITY MODELS AND DATA ANALYSIS FOR REPAIRABLE PRODUCTS
303
then P[ N (0, x ) k ] P[ N (0, x ) q k ] P[ N (0, x ) q k 1]
(12.5)
F ( k ) ( x ) F ( k 1) ( x ) The Poisson distribution provides an upper bound on the probability of n or more failures in (0, x) for failure distributions with an increasing failure rate; that is, if F is IFR, then
c
P[ N (0, x ) q n] a
£
e
jn
n j Qp ¤ n ³ ¥ ´ j! ¦ Q p µ
(12.6)
where kp is the Poisson mean interarrival time. The renewal function, Mr(x), is defined as the expected number of renewals (events) in (0, x); that is, M r ( x ) E[ N (0, x )]
(12.7)
Mr(x) can be represented as an integral equation, known as the fundamental renewal equation, with the following form: x
Mr (x) F (x)
¯ M (x t)dF (t) r
(12.8)
0
If F has a density, f, then, by differentiation, x
mr ( x ) f ( x )
¯ m (x t) f (t)dt r
(12.9)
0
The function mr(x) dMr(x)/dx is known as the renewal density; mr(x)dx may be interpreted as the unconditional probability of renewal (e.g., an F-R event) in the interval (x, x dx) or, equivalently, as the expected number of renewals per unit time. The elementary renewal theorem states that the expected number of renewals per unit time approaches l/kp; that is § M (x) ¶ 1 lim ¨ r · x lc ¨© x ·¸ Q p
(12.10)
where kp is the mean interarrival time. For most common distributions, the expected number of renewals in an interval, x to x h, is approximately h/kp for x large and h small.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
304 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Laplace transforms can be used to find the renewal function or the renewal density according to the following equations: M r* (s)
F * (s ) 1 F * (s )
(12.11)
F * (s) m (s) 1 f * (s) * r
where c *
g (s )
¯e
sx
g( x )dx
(12.12)
0
with g(×) defined only over the positive part of the real line. The mean interarrival time (that is, the mean time between failures) when observing a renewal process over the interval x1 to x2 is given by E[T ( x1 , x 2 )] [ M ( x 2 ) M ( x1 )] / ( x 2 x1 )
(12.13)
This result is exact when observation starts at an arbitrary point in time and is asymptotically true if observation starts at a renewal. Now consider the distribution of the remaining life of a unit, operating at time x. If n(x) represents this random variable, then x
P[T ( x ) y] F ( x y)
¯ F (x y z)dM (z)
(12.14)
0
Note that, in the preceding equation, we know the time from the origin x, but not the age of the unit in service. If x is not known, then the waiting time to the next failure is given by 1 W (t ) Q
t
¯ ( y)dy
(12.15)
0
An important result of renewal processes, often called the Drenick theorem, deals with a series product of n independent components. It is assumed that the F-R process for each component is a renewal process. Under fairly weak conditions, it can be shown that as the number of components becomes large, the limiting distribution of times between product failure is exponential. This is true, even though component failure times are not exponential. In reliability theory, this result is analogous to the central limit theorem and explains why the exponential distribution is applied even to products with wearout components. Work by Blumenthal, Greenwood and Herbach (1973) indicates that the length of operating period is more important than the number of components.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
RELIABILITY MODELS AND DATA ANALYSIS FOR REPAIRABLE PRODUCTS
305
The well-known steady-state availability equation is another result of renewal processes. Let Ti be the interarrival operating hours before the ith failure, and let Di be the repair time for the ith repair. If both Ti and Di represent renewal processes, then the probability that the product will be operational at age x, A(x) is, in the limit, A lim A( x ) x lc
Q ft
(12.16)
Q ft M rt
where kft and Mrt are the means of the failure time and repair time distributions, respectively. 12.2.4.2 Homogeneous Poisson Processes A homogeneous Poisson process (HPP) is a renewal process for which the distribution of the number of events in any interval of time is given by the following equation: P[ N ( x1 , x 2 ) m]
[L ( x 2 x1 )]m m!
e
L ( x2 x1 )
(12.17)
The parameter h is a constant with the dimensions of the reciprocal of time. Over a long period of time, it measures the mean rate of occurrence of events. Thus, h(x2 x1) can be thought of as the mean number of events in (x1, x2). Other properties of homogeneous Poisson processes include the following: r The time intervals between failures have an exponential distribution if and only if the F-R process is HPP. r The number of events in any interval is independent of the number of events in any other nonoverlapping interval; this allows the origin from which times to failure are measured to be arbitrarily defined. r The time to the nth event, X(n), has the gamma distribution
fx (n) ( x )
L (L x ) 1 e Lt ,x q 0 (n 1)!
(12.18)
and 2hX(n) is distributed as chi-square with 2n degrees of freedom; the mean and variance of X(n) are E[ X (n)] n / L ,
Var[ X ( x )] n / L 2
(12.19)
For large n, X(n) is approximately normal with mean and variance as given earlier. If there are p HPPs operating (e.g., p pieces of equipment, all identical with equal failure rates, h) and we consider failures irrespective of the process in which they occur, then the overall process is an HPP, with rate parameter ph. If there are
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
306 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
p equipment items and failures are not replaced, the distribution of the rth ordered failure, X(r), is determined from the following equation: X (r ) V1 / p V2 /( p 1) ! Vr /( p r 1),
r 1, 2,!, p
(12.20)
where Vi represents independent exponentially distributed random variables with parameter h; thus, E[ X (r )] (1/ L )[(1/ p 1/( p 1) ! 1/( p r 1)]
(12.21)
The joint distribution of n failure times in the interval (0, x) with x1 a x2 a … a xn a x is f x ( x1 , x 2 ,! , x n , x ) L e n
L x1
Le
L ( x2 x1 )
!Le
L ( xn xn 1 ) L ( x xn )
e
L n e L x (12.22)
The conditional probability density function (pdf) of x1, x2,… xn, given that n events have occurred in (0, x), is f x ( x1 , x 2 ,! x n | x n a x ) n
n! , xn
0 a x1 a x2 a ! a x n a x
(12.23)
This is the same distribution as n order statistics corresponding to n random variables uniformly distributed over (0, xn). 12.2.4.3 Nonhomogeneous Poisson Processes If the occurrence rate of a Poisson process is a time-dependent function, then it is called a nonhomogeneous Poisson process (NHPP). The probability distribution of the number of events occurring in any interval of time is m
x
§ x2 ¶ ¯x12 v ( x ) dx e ¨ P[ X ( x1 , x 2 ) m] v( x )dx · ¨ · m! ¨© x1 ·¸
¯
(12.24)
Note that this probability has the Poisson form of [cm!] e c, where the parameter c is given by the integral term, which is dependent on x1 and x 2. The function v(x) is called the intensity function, representing the time-dependent rate of event occurrence. We can say that i(x)Δx is the approximate unconditional probability that an event (e.g., failure) occurs in the interval (x, x Δx) for Δx small. The function x
m( x )
¯ v(t)dt
(12.25)
0
is called the mean-value function because it represents the expected number of failures from 0 to x.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
RELIABILITY MODELS AND DATA ANALYSIS FOR REPAIRABLE PRODUCTS
307
If we consider a time-scale transformation of the form x
T
¯ L (t)dt
(12.26)
0
then the series of failures becomes a homogeneous Poisson process. For an NHPP, the intervals between success events are independently distributed. If we observe the process over (0, x0) and events occur at (x1, x2,…, xn), the likelihood function for the observed failure times is n
L ( x1 , x 2 ,! , x n )
v(x )e ¯
z0 0
v ( x ) dx
i
(12.27)
i 1
Given that n events occurred in (0, x0), then the conditional pdf of the n failure times (x1, x2,… xn) is the same as that of n order statistics of random variables with the common distribution function: Fx ( x )
v( x ) , 0 a x a x0 v( x0 )
(12.28)
The preceding results correspond to those of homogeneous Poisson processes. Balaban and Singpurwalla (1984) derived a number of results about the properties of the random variable x(n 1)|x(n)—the time to the (n 1)st failure, given the time of the nth failure, for an NHPP. Some of these results are summarized next. r Fx(1) is IFR (DFR) if and only if Fx(n1)|x(n) is IFRA (DFRA). The preceding result leads to the following chain of implications:
Fx (1) IFR( DFR) 6 Fx ( x 1) |x ( n ) IFR( DFR)
Fx ( x 1)|x ( n ) IFRA( DFRA) k
j
r Fx(1) is DMRL (IMRL) if and only if Fx(n1)|x(n) is NBUE (NWUE). The chain of implications from this result is
Fx (1) DMRL ( IMRL ) 6 Fx ( x 1) |x ( n ) DMRL ( IMRL )
Fx ( x 1)|x ( n ) NBUR( NWUE ) k
j
r For all other failure distribution characterizations, the property of X(1) does not necessarily carry over to X(n1)|x(n).
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
308 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
r If Fx(x1) is IRFA (DFRA), then y
Fx ( n 1)| x ( n ) ( y | x ) a (q)[ Fx (1) ] x y
(12.29)
r The unconditional density of X(n) is
hx ( n ) ( x ) [ H x (1) ( x )]n 1
f x (1) ( x )
(12.30)
(n 1)!
r The time of the nth failure can be predicted from the last failure time, xk , by n 1 k
P[ X (n) x n | X ( k ) x k ]
£e m0
z
zm m!
(12.31)
where
z H x (1) ( x n ) H x (1) ( x k )
(12.32)
12.2.4.4 F-R Process Relationships By definition, an age-independent process is a renewal process because each time to failure has the same distribution as that for new products, and all events are assumed to be independent. The times to failure of the age-persistent failure repair process are governed by an NHPP. This has been proven by Balaban and Singpurwalla (1984) in a theorem that states that an F-R process is AP if and only if it is an NHPP. Figure 12.1 summarizes the relationships of the various processes. These relationships do not imply that an F-R process can be AP and AI only if X(1) is exponential; this is a necessary but not sufficient condition. For example, even if the time to failure following repair is exponential, the F-R process might be neither AI nor AP if the failure rate changes inconsistently with both forms.
F-R process is AP
F-R process is AI
HPP NHPP
X(1) exponential
Renewal process
Figure 12.1 Relationship among F-R, Poisson, and renewal processes.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
RELIABILITY MODELS AND DATA ANALYSIS FOR REPAIRABLE PRODUCTS
12.3
309
DATA ANALYSIS TECHNIQUES
This section considers data analysis techniques for describing the reliability behavior of repairable products. We shall limit the scope of the detailed presentation to the two types of F-R processes considered, AI and AP—that is, renewal and NHPP processes representing natural boundaries on the repair effectiveness. The basic strategy for modeling and analyzing repairable product failure data is presented in Figure 12.2. It must first be determined whether the times between failure (interarrival times) can be modeled as a renewal process, by testing the renewal process hypothesis against the alternative hypothesis that the interarrival times have a monotonic trend—or, equivalently, that the rate of occurrence of failure is monotonically increasing or decreasing. If no trend is evident, an HPP is an appropriate model; if a trend is evident, an NHPP may fit. If both a renewal process and an NHPP are not appropriate, more complicated forms must be considered or nonparametric approaches must be used. This discussion is limited to graphical procedures for trend testing, tests for a renewal process, tests for an HPP, and fitting a Weibull NHPP. The references discuss other modeling forms and present additional data analysis methods. 12.3.1 Graphical Trend Tests Graphical procedures can be used to provide a first-level indication of whether there is a trend in the rate of occurrence of failures. The simplest form of graphical analysis is to plot cumulative failures versus cumulative operating time. A line fitted through the data will be straight if there is no trend. An upward curve indicates an increasing failure trend; for example, wearout causes failures to occur more frequently with time.
Is the process a renewal process?
Yes
Is an HPP in force?
No
No
No
Yes Fit HPP
Figure 12.2
Yes Fit renewal process
Fit other type of process
Strategy for modeling of repairable equipment.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
Is an NHPP in force?
Fit NHPP
310 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Cumulative Failures
Graphical Trend Analysis
Wearout No trend Improvement with age
Cumulative Time Figure 12.3 Graphical trend analysis.
A curve with a downward bend indicates improvement with age. These possibilities are depicted in Figure 12.3. Consider an equipment item with the following failure times: 80, 125, 191, 242, 292, 328, 410, 436, 480, 512, 540, 577, 601, 619, 640, 658, 678, 705, 720, and 741 hours. The plot of the cumulative number of failures versus cumulative time is given in Figure 12.4. The trend is toward increasing failure frequency because the curve has a definite upward bend. Supporting this conclusion is the fact that the average of the first 10 interarrival times is 52.2, while the average is 21.9 for the last 10 interarrival times.
Plot of Sample Failure Data
Cum. No. Failures
25 20 15 10 5 0 0
Figure 12.4
160
320 480 Cumulative Time
Cumulative failure versus cumulative time.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
640
800
RELIABILITY MODELS AND DATA ANALYSIS FOR REPAIRABLE PRODUCTS
311
Sample Failure Data-Grouped 1.0
Failure Frequency
8.0 6.0 4.0 2.0 0 0–150
Figure 12.5
150–300
300–450 450–600 Time Interval
600–750
Failure frequency over time.
For a single product, the data can be grouped and the grouped failure frequencies plotted. For example, the preceding data can be divided into five intervals of 150 hours each, yielding Table 12.1. Plotting the grouped data failure frequencies gives the graph in the figures, again supporting the conclusion of an increasing trend (Figure 12.5). In many cases, data will be available for a number of like products. Because observation periods will vary, some analysis of the data is required before graphical procedures can be invoked. An outline of one such procedural approach follows: r Define equal time intervals over the period of observation, I1 (0, t), I2 (t, 2t),…. r Determine the number of products observed in each interval; fractional numbers may be used when the observation of a product stops before the end of an interval (censored observation). Let ni be the number of products observed in the ith interval. r Calculate the conditional failure rate for each interval by the equation ri f/ni, where fi is the number of failures occurring in the ith interval. Note that ri can be greater than 1.0. r A plot of ri versus time will indicate the general relationship of failure occurrences to time.
Table 12.1 Failure Frequency Table Interval
Time
Frequency
1 2 3 4 5
0 150 150 300 300 450 450 600 600–750
2 3 3 4 8
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
312 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Table 12.2
Failure and Termination Times Time
Product 1 2 3 4 5
0 100
100 200
200 300
300 400
400 500
Ti
-**-*--*-----*--
-*--**-*--*--
-*--*--*--
-**-*---
-*--*--
T1 = 500 T2 = 500 T3 = 350 T4 = 220 T, = 99
Example. Consider the following time lines for five like units, where each * represents a failure occurrence, and Ti is the termination time for the ith unit, given in Table 12.2. These data yield Table 12.3. Note that the number of samples accounts for partial operation in an interval of a censored product. A plot of the failure rate will be relatively flat, indicating no evidence of a trend. Using this technique to combine data about different products must be approached with caution. It is not appropriate if there is evidence that the products are not from the same population. For example, assume that there are two products. For one, the numbers of failure occurrences in five time intervals are 5, 4, 3, 2, and 1; for the other product, the numbers are 1, 2, 3, 4, and 5. The total number of failures in each interval is six, indicating a constant failure occurrence rate. Actually, the first product shows a strong decreasing rate, while the second shows a strong increasing rate. Combining the two will create a canceling effect that can give misleading results. When confronted with these sorts of data, the analyst must first determine the consistency considering hardware, operation, environment, and data collection procedures. The analytical procedures in the following sections provide more complete tests for trends. 12.3.2 Test for a Renewal Process The procedure for the Mann (1945) test for a renewal process versus a monotonic trend is as follows: r Obtain the chronologically ordered interarrival times:
Table 12.3 Failure Rate Estimates Interval
Number of Samples
Number of Failures
Failure Rate
0 99 100 199 200 299 300 399 400 499
5 4 3.2 2.5 2
5 5 3 3 2
1.00 1.20 0.94 1.20 1.00
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
RELIABILITY MODELS AND DATA ANALYSIS FOR REPAIRABLE PRODUCTS
313
T (1) X (1) T (2) X (2) X (1)
(12.33)
" T (n) X (n) X (n 1) r Count the number of inversions, In; that is, for each interarrival time, compare every later interarrival time and count the number of occasions when the later interarrival time is larger. Mathematically, an inversion occurs if T(i) T(j) when i j. r If n is less than 10, use the values given in Table 12.5 to determine if the null hypothesis of a renewal process is sustained. If the probability level for In is consistent with the chosen significance level, then a renewal process can be used as a reasonable description of the F-R process: For n q 10, calculate and compare to a standardized normal deviate:
n(n 1)
In 4
z
¤ 2n 3 3n 2 5h ³ ¥ ´ 72 ¦ µ
(12.34)
1/ 2
Example. Data (see Table 12.4) represent failure times of a pump unit on a ship. Then, 10 Z
9
10 4
¤ 2 10 3 3 10 2 5 10 ³ ¥ ´ 72 ¦ µ
Table 12.4
0.5
2.61
(12.35)
Failure Time Data for a Pump
Event
Failure Time
Interarrival Time
1 2 3 4 5 6 7 8 9 10
327 1380 2289 3080 3197 3422 3498 3520 3755 3851
327 1053 909 791 117 225 76 22 235 96 Total
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
Number of Inversions 3 0 0 0 2 1 2 2 0 — 10
T n
0
1
2
3
3
167
500
833
1000
4
5
6
4
42
167
375
5
8
42
117
6
1
8
7
0
1
8
0
9
0
625
833
958
1000
242
408
592
28
68
136
5
15
35
0
1
3
0
0
0
© 2009 by Taylor & Francis Group, LLC
7
8
9
10
758
883
958
992
1000
235
360
500
500
640
68
119
191
281
386
7
16
31
54
89
1
3
6
12
22
11
12
13
14
15
16
17
18
19
20
765
864
932
992
1000
500
500
614
719
138
199
274
360
452
809
881
932
965
985
995
999
500
500
548
640
726
801
862
38
60
90
130
179
238
306
381
460
500
500
540
314 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
© 2009 by Taylor & Francis Group, LLC
Table 12.5 Probability of Obtaining T or Fewer Inversions for Sample Size na
RELIABILITY MODELS AND DATA ANALYSIS FOR REPAIRABLE PRODUCTS
315
Because Y > Z 0.05 1.96, a renewal process is not appropriate when compared to a monotonic trend. In this case, the interarrival times are decreasing in length, and an increasing trend of failure occurrence appears to be taking place. Note that, in Table 12.5, the maximum number of inversions is n(n 1)/2. For values not shown, use symmetry, as described: If the number of observed inversions, I, is greater than the tabular value, then for
n 8;
P( I ) 1 P(29 I ),
for
n 9;
P( I ) 1 P(37 I ),,
(12.36)
To illustrate the symmetry, we show two examples: ( A) if n 8 and I 18, P(18) 1 P(29 18) 1 P(11),
(12.37)
( B) if n 9 and I 20, P(20) 1 P(37 20) 1 P(17) If a significance level is set at, say, P, a trend exists if the probability is less than or equal to P/2 or greater than or equal to 1 P/2—for example, 0.05 and 0.95 for a 10% significance level. A one-tail test can also be used, depending on whether an increasing or decreasing trend is of interest. A large number of inversions means either that times between failure are increasing with time or that a decreasing rate of failure is occurring. 12.3.3 Test for a Homogeneous Poisson Process A relatively simple test can be used to determine whether an F-R process is an HPP, as opposed to one with a monotonic trend. This test, called the central limit theorem test or the Laplace test, is described in Cox and Lewis (1966). Two cases are considered: Case 1: observation stops at time X`. Assume that n failures occur during the interval (0, X`) at times X1, X2,…, Xn. Compute the following statistic:
£ U
n i 1
Xi
¤ n³ X`¥ ´ ¦ 12 µ
nX ` 2 1/ 2
(12.38)
Case 2: observation stops at the nth failure. Assume that n failures occur at times X1, X2,…, Xn. Compute the following:
£ U
n 1 i 1
Xi
(n 1) X n 2 1/ 2
§ (n 1) ¶ Xn ¨ · © 12 ¸
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(12.39)
316 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
To test the null hypothesis that no trend exists in the data and that an HPP is in force, compare the statistic U to the standardized normal deviate at the chosen significance level, say Z ] . If the absolute value of U is less than the value of Z ] , we cannot reject the null hypothesis. If U q Z ] , then an increasing trend exists; if U a Z ] , then a decreasing trend exists. The preceding equations apply to a single product. If data exist on a number, m, of like products, then a test of pooled data should be applied. The following test tests if each of m products has an HPP, but with possibly different occurrence rates. Let ª number of failures for item j(Case 1) Nj « ¬number of failures 1 for item j(Case 2)
(12.40)
ªobservation period for item j(Case 1) Xj « ¬ last failure time for item j(Case 2)
(12.41)
Then,
U
S1 S2 ! Sm §1 ¨ 12 ©
£
m i 1
1 2
£
m
i 1 1/ 2
Ni Xi
(12.42)
¶ N i X i2 · ¸
where Sj
£
Nj i 1
X ij
(12.43)
Sj represents the sum of the failure times for the jth product. To illustrate the computation, consider a product with observed failure times of 70, 122, 152, 165, and 170 hours. Observation stopped at the last failure time, so this is a single-product, case 2 situation. Therefore,
£ U
n 1 i 1
Xi
(n 1) X n 2 1/ 2
§ (n 1) ¶ Xn ¨ · © 12 ¸
70 122 152 165 4(170 / 2) 1.72 (12.44) 170(4 /12)1/ 2
A U value of 1.72 is significant at the 10% level, indicating that a positive trend exists.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
RELIABILITY MODELS AND DATA ANALYSIS FOR REPAIRABLE PRODUCTS
317
12.3.4 Comparison of Two Samples Establishing whether two samples come from the same population is important in many situations because such a test can be used to determine if a design improvement has occurred, if overhaul is beneficial, if the operating environment has changed, and so forth. The procedure for the Mann Whitney test is as follows: r Let T1(1), T1(2),…, T1(k) be the interarrival times for one sample and T2(1), T2(2),…. T2(m) be the interarrival times for the second sample. r Order the k m times, smallest to largest, and assign the rank of 1 to the smallest, 2 to the next smallest, and so on. The largest time receives the rank of k m. If two or more times are tied, use the average of the tied ranks. r Let S be the sum of the ranks associated with the first sample. That is, if R(T1(i)) is the rank for the ith failure time of the first product, k
S
£ R[T (i)] 1
(12.45)
i 1
r Compute the test statistic U:
U S k ( k 1)/ 2
(12.46)
r If k or m is less than eight, compare the test statistic U with the critical values of U, such as those found in Conover (1983), to determine if the null hypothesis that the two samples come from the same population is valid. If U is greater than the larger of the two critical values given, then the conclusion is that the first population has greater reliability than the second population; a value of U less than the smaller critical value indicates less reliability for the first population. r If both k and m are equal to or greater than 8, then a normal approximation can be used; compute
1 km 2 Z 1/ 2 §© km( k m 1)/12 ¶¸ U
(12.47)
and compare to a standardized normal value for the chosen significance level.
To illustrate the test, consider the data taken from operating ship equipment before and after overhaul (see Table 12.6). Interarrival times are then as in Table 12.7. Ordering all the times yields the results shown in Table 12.8. Summing the “before overhaul” ranks gives S 1 3 6 8 9 10 37
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(12.48)
318 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Table 12.6
Interarrival Time Data
Before Overhaul
After Overhaul
327
161
1380
249
2289
590
3080
901
3197 3261
Table 12.7
Interarrival Time Data (Ordered)
Before overhaul
After overhaul
327
161
1053
88
909
341
791
311
117 64
Table 12.8 Ranks of Interarrival Time Data Time
Rank
64
1
Sample Before
88
2
After
117
3
Before
161
4
After
311
5
Before
327
6
After
341
7
Before
791
8
After
909
9
Before
1053
10
After
The test statistic with k 6 is U 37 6(7 / 2) 16
(12.49)
The critical values at the 10% level are 4 and 26 (Conover 1983). Because U 16 falls within this range, the conclusion is that reliability performance was unchanged after overhaul.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
RELIABILITY MODELS AND DATA ANALYSIS FOR REPAIRABLE PRODUCTS
319
12.3.5 Fitting the Weibull Nonhomogeneous Poisson Process If a trend is established through application of the trend test discussed in Section 12.3.3, the next step is to see whether the data can be modeled as a nonhomogeneous Poisson process and whether an equation can be established for the process so that relevant characteristics, such as mean time between failures, average number of failures, and probability of failure, can be estimated. Because the process is nonhomogeneous, the rate at which failures occur is time dependent; thus, these characteristics will vary over time, unlike an HPP that, for example, has a single mean time between failures (MTBF). As shown earlier, the NHPP has the defining property that the distribution of the number of failures from age 0 to x is given by P[number of failures(0, x )]
e R ( x ) [ R( x )]n m!
(12.50)
t
wher R( x )
¯ r (t)dt 0
where the term r(t) is the intensity function representing the instantaneous rate of failure occurrence. In the case where r (t ) LB t B 1
(12.51)
the NHPP is called a Weibull process. A large amount of theory has been developed for such a process, and it can be used as a model for an NHPP. 12.3.5.1
Weibull Process Characteristics
Characteristics of a Weibull process include: r cumulative MTBF (0, x):
Q (0, x )
x1 B L
(12.52)
M (x)
x1 B LB
(12.53)
r instantaneous MTBF at age x:
r failure time distribution at time t, with t measured from x: 1
Fx (t ) 1 e M ( x )
(12.54)
r expected number of failures in (x1, x2):
E[ N ( x1 , x 2 )] L x 2B x1B
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(12.55)
320 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
r distribution of the number of failures in (x1, x2):
P[ N ( x1 , x 2 ) n]
n L xB xB 1§ L x2B x1B ¶ e 2 1 ¸ n! ©
(12.56)
To illustrate the use of these equations, assume that h 0.1 and ^ 2 and that time is measured in months. The following then holds: r cumulative MTBF over the period 0 6 months:
Q (0, 6) 61 2 / 0.1 1.67 months
(12.57)
r instantaneous MTBF at 12 months:
M(12) 121 2 /(0.1 r 2) 0.42 months
(12.58)
r probability of failing within the next half-month for a product with 12 months of service:
F12 (0.5) 1 e 0.5 / 42 0.304
(12.59)
r expected number of failures from 6 months to 12 months:
E[ N (6, 12)] 0.1122 62 e 0.1(12
2
62 )
10.8 failurres
(12.60)
r probability that eight failures occur from 6 months to 12 months:
P[ N (6,12) 8]
2 2 1 [0.1(122 62 )]8 e 0.1(12 6 ) 0.094 8!
(12.61)
12.3.5.2 Estimation of h and ^ Several data collection possibilities must be considered. The data may be ungrouped (U) so that every failure time on each product is known, or the data may be grouped (G) so that only the total number of failures within fixed time intervals are known. For ungrouped data, observation can stop at some given time (time-truncated [T]) or at the occurrence of a specified number of failures (N). These possibilities then give rise to the following three cases: r (U-T) ungrouped, time-truncated; r (U-N) ungrouped, failure truncated; and r (G) grouped.
Estimation for case U-T. For ungrouped data, time-truncated testing—the maximum likelihood estimate (MLE) for ^ given n failure times at over (0, x`)—is
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
RELIABILITY MODELS AND DATA ANALYSIS FOR REPAIRABLE PRODUCTS
321
n
Bˆ n ln x `
£
n i 1
ln xi
(12.62)
For h, the MLE is
L
n
(12.63)
ˆ
( x `) B
Estimation for case U-N. For ungrouped data with truncation based on the number of failures, n, n
Bˆ
(n 1) ln x n
£
n 1 i 1
ln xi
(12.64)
n Lˆ ˆ x nB Estimation for case G. For grouped data, the estimation procedure is somewhat more complicated in that a closed-form equation for ^ does not exist. Assume that there are k intervals with boundaries x0 0, x1,…, xk . Then, ^ is estimated as the solution of the equation ˆ
n
£
ni
ˆ
xiB ln xi xi 1B ln xi 1 ˆ
ˆ
xiB xi 1B
i 1
ln x k 0
(12.65)
where xo/nxo is defined as equal to zero. Numerical techniques must be employed to solve this equation for ^. Given an estimate for ^, h is estimated by
£ Lˆ
k
n
i 1 i B k
x
(12.66)
Example. Use the data on the pump presented in Section 12.3.2, which were previously shown to have a monotonically increasing trend. If observation stopped at 4,162 hours, the data are ungrouped and time truncated, so the U-T category holds. Therefore, n
Bˆ n ln x `
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
£
n i 1
ln x(i )
10 1.81 10 8.3338 77.8094
(12.67)
322 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Then,
Lˆ
n ( x `)
Bˆ
10 2.8 10 6 41621.81
(12.68)
Note that the ^ estimate is greater than 1.0, which is consistent with an increasing trend toward greater failure occurrence frequency. 12.3.5.3 Goodness-of-Fit Tests Goodness-of-fit tests may be used to test whether observed failure data are consistent with a Weibull process. Generally, at least 20 failure times should be observed in order to apply a goodness-of-fit test. The equations for each of the three cases are presented next. Test for case U-T. Calculate
GUT
1 12n
n
£ i 1
§ x B 2i 1 ¶ ¨ i · 2n · ¨© T ¸
2
(12.69)
where
B
(n 1)Bˆ n
(12.70)
GUT is compared with critical values for the Cramer Von Mises test. If GUT exceeds the value in the table for the selected significance level, then the null hypothesis that the data are consistent with a Weibull process is rejected. Test for case U-N. Calculate
GUN
1 12(n 1)
n 1
£ i 1
§¤ x ³ B ¶ ¨ (i ) 2i 1 · ¨¥¦ T ´µ 2(n 1) · ·¸ ¨©
2
(12.71)
where
B
(n 1) ˆ B n
(12.72)
Test for case G. For each interval, calculate the expected number of failures:
ˆ ˆ ei Lˆ xiB xi 1B ,
i 1, 2,! , k
(12.73)
Combine adjacent intervals if required so that the expected number of failures is at least five. Assume that, after such grouping, there are k` intervals. Let ni` be
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
RELIABILITY MODELS AND DATA ANALYSIS FOR REPAIRABLE PRODUCTS
323
the number of failures in the adjusted ith interval and let ei` be the corresponding expected value. Then calculate k
C
£
( xi` ei`)2
i 1
(12.74)
ei`
C is approximately distributed as chi-square with k` 2 degrees of freedom. Critical values can be found from tables of the chi-square distribution.
12.3.5.4 Confidence Interval Estimates This section presents the formulas to calculate confidence limits of Weibull characteristics for ungrouped data. In some cases, these limits are approximations— for example, there is no distinction between the two types of truncation. Many of the limits have factors of the form C A(n 1)/n and D B(n 1)/n, where n is the number of failures, and A and C depend on the confidence level and n. The C and D factors are tabulated in Crow (1975). For n > 60, they can be approximated as follows: § ¤ 2 ³ 1/ 2 ¶ C ¨1 ¥ ´ XA / 2 · (n 1)/ n ¨ ¦ nµ · © ¸ § ¤ 2³ D ¨1 ¥ ´ ¨ ¦ nµ ©
1/ 2
(12.75)
¶ XA / 2 · (n 1)/ n · ¸
where z]/2 is the 1 − ]/2 percentile of the standard normal distribution. The following confidence formulas can be used: r intensity function, r(x):
LCL : eˆL ( x ) C rˆ( x )
(12.76)
UCL : eˆU ( x ) D rˆ( x ) r expected number of failures, N(x1, x2):
UCL : N ( x , x ) D Lˆ x
x
ˆ ˆ LCL : N L ( x1 , x 2 ) C Lˆ x 2B x1B
U
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
1
2
Bˆ 2
Bˆ 1
(12.77)
324 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
r cumulative MTBF, k(0, x):
LCL : Q L (0, x )
x NU ( 0 , x )
(12.78)
x UCL : QU (0, x ) N L ( 0, x ) r instantaneous MTBF, M(x):
LCL : M L ( x )
1 rˆU ( x )
(12.79)
1 UCL : MU ( x ) rˆL ( x ) where LCL is a lower confidence limit and UCL is an upper confidence limit. Currently, no confidence formulas appear to be available for grouped data. As a conservative approach, the number of groups can be used for n in the preceding equation, which will provide wider limits than the true values. 12.4
SUMMARY
This chapter presented various methods for formulating and analyzing the reliability of repairable products. The notions of age independence and age persistence were introduced to define the boundaries of maintenance influence and they were related to the well-known renewal and Poisson processes. A basic strategy was then formulated for modeling and analyzing product failure data. Graphical and analytical tests for determining a trend in the rate of occurrence of failure were described. Finally, for the case when a Weibull nonhomogeneous Poisson process is applicable, detailed goodness-of-fit and estimation procedures were developed and illustrated. REFERENCES Balaban, H., and N. Singpurwalla. 1984. Stochastic properties of a sequence of interfailure times under minimal repair and under revival. In Reliability theory and models, ed. M. Abdel-Hameed, E. Cinlar, and J. Quinn. New York: Academic Press. Blumenthal, S., J. Greenwood, and L. Herbach. 1973. The transient reliability behavior of series systems on superimposed renewal processes. Technometrics 15:255. Conover, W. 1971. Practical nonparametric statistics. New York: John Wiley & Sons. Cox, D. R., and P. A. W. Lewis. 1966. The statistical analysis of series of events. London: Methuen. Crow, L. H. 1975. Tracking reliability growth. U.S. Army Materiel Systems Analysis Agency, Aberdeen Proving Grounds, MD. Mann, H. B. 1945. Nonparametric tests against trend. Econometrika 13:245.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
CHAPTER 13
Continuous Reliability Improvement Walter Tomczykowski
CONTENTS 13.1 Introduction .................................................................................................. 326 13.2 Reliability Growth Process........................................................................... 326 13.2.1 Reliability Improvement Program .................................................... 326 13.2.2 Failure Classification ........................................................................ 331 13.2.3 Test Optimization ............................................................................. 333 13.2.4 Test Cycles and Environmental Considerations ............................... 334 13.3 Stress Margin Testing ................................................................................... 335 13.3.1 Stressed Life Test (STRIFE)............................................................. 336 13.3.2 Highly Accelerated Life Test (HALT) ............................................. 337 13.3.3 Inverse Power Law Model and Miner’s Rule.................................... 338 13.4 Continuous Growth Monitoring ................................................................... 339 13.4.1 Continuous Growth Models.............................................................. 339 13.4.1.1 Duane Model...................................................................... 339 13.4.1.2 AMSAA Model.................................................................. 342 13.4.2 Discrete Models ................................................................................ 349 13.4.2.1 Lloyd and Lipow Model..................................................... 349 13.4.2.2 Wolman Model................................................................... 350 13.5 Reliability Improvement Effectiveness and Uncertainty.............................. 350 13.5.1 Reliability Improvement Effectiveness............................................. 351 13.5.2 Reliability Improvement Uncertainty............................................... 351 13.6 Summary ...................................................................................................... 354 References.............................................................................................................. 354
325 © 2009 by Taylor & Francis Group, LLC
326 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
13.1
INTRODUCTION
Reliability improvement techniques can be applied to a new product that has passed its major hardware and/or software design reviews, to a developed product that the manufacturer wishes to make more competitive, or to an existing product that is not meeting the customer’s expectations of reliability performance. Presumably, the latter case should not occur because the desired level of reliability should be designed into the product before the design is released to full production. The reliability improvement process recognizes that the reliability of the drawing board design of a complex product can be improved and allocates time for that improvement. By operating or testing the product in a manner that will identify deficiencies caused by the design, manufacturing process, and/or operation, deficiencies can be detected and removed, and methods for designing-in reliability can be reevaluated or used to improve reliability. By comparison, reliability qualification testing is intended to demonstrate the ability of the product to perform in its intended environment, and stress environment screening is intended only to precipitate defects. These methods do not result in reliability improvement. A continuous reliability improvement program is cost beneficial over the life cycle. The cost benefits result from reduced warranty repairs for commercial products and reduced maintenance and spares. This chapter discusses the principles of reliability growth, accelerated testing, and management of a continuous improvement program.
13.2 RELIABILITY GROWTH PROCESS When complex equipment is designed with innovative technology or advanced production methods, the equipment often has unforeseen design, manufacturing, or operating deficiencies that affect reliability. A reliability improvement program seeks to achieve reliability goals by improving product design. The objective of an improvement program is to identify, locate, and correct faulty and weak aspects of the design, manufacturing process, and operating procedures. Reliability improvement is often accomplished through a program employing a test, analyze, and fix (TAAF) philosophy. The product’s reliability improves when corrective actions that remove the faulty and weak aspects of the design are incorporated and then verified with further testing. The TAAF process should be applied to developmental equipment but can be implemented on equipment already fielded. The TAAF process for developmental or prototype equipment is illustrated by a feedback loop, as in Figure 13.1. The use of a feedback loop is the basis for a successful improvement program. 13.2.1 Reliability Improvement Program The length of a test in a reliability improvement program is a function of the perceived and desired reliability for commercial products; the required reliability for military products; the maturity and complexity of the product or prototype; the number of
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
CONTINUOUS RELIABILITY IMPROVEMENT
327
Testing of prototype equipment
Detection of failures
Analyze and (re) design
Yes
Fix verified?
No
Figure 13.1 The test, analyze, and fix process.
products available for test (test units); the effectiveness of the failure reporting, analysis, and corrective action program; and the amount of time available. Because time is needed for troubleshooting, analyzing detected failures, investigating corrective actions, and incorporating changes, the test time is a portion of the total time available. Other factors affecting test time, which could be avoided through planning and training, include operator errors, chamber breakdowns, inadequate spare test units, and poor supervision. Test time divided by total available time (where available time is scheduled calendar time multiplied by the quantity of test units) is known as test efficiency. Experience has shown that most improvement programs have a test efficiency of 50% or lower. The test efficiency can be higher than 50% if additional manpower is used to perform the failure analysis, if spare test units are available and adequate planning and training have been conducted. However, test efficiencies that approach 90–100% are often an indicator that there is no growth. No growth could be caused by environmental stresses that are not severe enough to precipitate further faults, inadequate management of the improvement program, or a deficiency in the detection and reporting system. If calendar time is limited, the quantity of test units must be increased, or accelerated life testing (Section 13.3) must be employed. When adequate program resources and funding are available, calendar time could be minimized by using several test units. When more than one unit is used, the amount of test time should be reasonably distributed among each unit or prototype. Distributing the test time will prevent the accrual of hours only on a “gold-plated” or “hand-built” prototype, while the “real” prototype is down for repair. Testing only a hand-built prototype may support the desire simply to “pass the test.” To keep the test from being biased, each unit must
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
328 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
operate for at least 75% of the average relevant test time of all the units under test. For example, for a test of 2,000 hours on two products, neither product can be tested for less than 1,000 r 0.75, or 750 hours. An improvement program can also be applied to products whose reliability is not measured in terms of failures per unit time. Some test units for measuring duration other than time could be number of miles before failure, number of flights before failure, number of copies before failure, etc. This chapter discusses growth and test duration in units of test time. Before products are officially placed under test, several tasks should be completed. One of these is establishing an environmental stress screening (ESS) process, a manufacturing process that uses random vibration and temperature cycling to precipitate part and workmanship defects. ESS will identify infant mortality failures such as manufacturing defects (missing components, wrong components, reversed components), workmanship failures (poor solder joints, bent leads, weak wirebonds), and incorrect part types. Implementing corrective actions to remove these faults reduces production, rework, and life-cycle costs. ESS can be applied to either the subassembly or final product; however, ESS is most cost beneficial when applied to the lowest level of assembly. The maximum and minimum temperature values should not exceed the rating of any of the parts or materials comprising the assembly. Care should be exercised to ensure that the physical response associated with the failure mechanisms of a part or material under stress is large enough to generate an effective screen but does not exceed the capability of the product. The effectiveness of the screen can be continuously monitored by examining the yields and results of higher level tests. If many workmanship or manufacturing failures are discovered by such tests, then the screen at lower levels must be adjusted. When the yield or fallout data are acceptable—that is, when the majority of the failures are design failures—formal reliability growth testing may begin. Equipment to be used for reliability growth testing must undergo ESS. In addition to ESS, the following five actions should be taken: r Verify the performance of the field environment simulation instruments (temperature chambers, vibration tables that will be used to test the prototypes, and the test measurement equipment). r Complete a thermal analysis of the product. r Complete a failure modes and effects analysis (FMEA). r Establish a closed loop failure reporting, analysis, and corrective action system (FRACAS); FRACAS is utilized, failure analysis is performed, and corrective action is implemented on all failures occurring during developmental and operational testing, including failures that occur during ESS, rather than just those occurring during the formal improvement program. r Complete a reliability improvement plan.
Failure analysis and corrective action are the most critical aspects of a reliability improvement program. Failures must be isolated to the root failure mode. Common failure modes for electrical products include cracked solder joints, board delaminations, component failures, software errors, procedure errors, poor board placement, and manufacturing process problems. Common failure modes for mechanical products
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
CONTINUOUS RELIABILITY IMPROVEMENT
329
are corrosion, binding, and fracture (crack propagation). Specific root causes, including thermal overstress, electrical overstress, contamination, wearout, and mechanical damage must be identified. An accurately completed FMEA will aid the analysis process and save valuable time. The basic failure analysis steps for components are, in order: 1. Identify part to ensure that the correct part was used. 2. Establish a part history to determine where it failed and how it may have failed in the past. 3. Confirm the failure. 4. Analyze the part, following an ordered failure analysis flow, as shown in Figure 13.2. 5. Identify the failure mode and cause. 6. Take photographs and x-rays. 7. Adopt recommended corrective action to preclude recurrence of the same failure. 8. Produce a concise report summarizing each step.
Figure 13.2 summarizes steps needed to determine the root cause of an electronic component failure. Once the part is identified and previous failure modes and causes are determined, the part should be externally examined for signs of overstress. Electrical testing, using a curve tracer, can then be performed; if the failure is verified, further nondestructive analysis techniques—x-ray, particle impact noise detection (PIND), or leak test—may be used to isolate the root cause of the failure. The component is then cross-sectioned or de-lidded to facilitate an internal inspection. Further fault isolation is accomplished using such techniques as scanning electron microscopy, die probing, or energy-dispersive x-ray analysis. Bond failures can be analyzed by performing destructive bond pull tests. If the failure has not been verified, additional tests for unverified failures (such as temperature cycling, temperature shock, and vibration) are performed until the failure is precipitated. Even after these additional tests, some failure modes may still not be identifiable. If this occurs, management will have to decide what additional resources (such as manpower and cost) to expend in attempting to identify the problem. Once the failure analysis is complete, the corrective action is identified, and the results are documented, the information is entered into FRACAS. The information in FRACAS is used by the manufacturer to incorporate the corrective actions into the product. To ensure that FRACAS is effective, it must be integrated into the reliability improvement plan and procedures. A reliability improvement plan must be completed, approved, and coordinated through the responsible test engineer, design engineer, reliability manager, manufacturing manager, logistics manager, and program manager. For military contracts and some commercial contracts (such as improving the reliability of a transformer for a power company), the consumer should ensure that the plan and procedures are coordinated through the procuring activity and a representative of the product user. At a minimum, the plan should address the test schedule, resources, test equipment, manpower, test environment, test procedures, planned growth versus
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
330 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Failure history
Electrical curve tracer
External visual
No Verify failure? Temperature cycling
A
X-ray
Temperature cycling
A
P.I.N.D
Vibration
A
Leak
Burn-in
De-lid
Electrical curve tracer
Internal visual A Fault isolation
Destructive part analysis
Electrical test. Go to next technique if still unverified.
Document results
Figure 13.2 Failure analysis flow. (From LeStrange, J. 1990. Litton Amecon briefing to the University of Maryland Reliability Engineering Program.)
test time, failure reporting product, and corrective action program. To ensure a successful improvement program, the plan and procedures should thoroughly describe all aspects of the test, including ground rules. Establishing ground rules for the conduct of the test and guidelines for failure classification are critical aspects of the reliability growth test.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
CONTINUOUS RELIABILITY IMPROVEMENT
331
13.2.2 Failure Classification A reliability improvement program with unlimited resources, funding, and time focuses attention on identifying and eliminating deficiencies in the design, rather than on the classification of failures. In the ideal case, corrections are incorporated as the root causes are discovered. However, this ideal usually does not exist due to limited customer funding, schedule demands, or the nature of the test itself (failures have to be counted and/ or classified to monitor progress). In some contracts, the producer may designate certain types of failures as nonrelevant or nonchargeable to avoid a costly investigation of their causes. However the customer may contest these decisions and demand to have the problem identified and corrected. This antagonistic situation must be avoided. To ensure a successful reliability improvement program, all failures should be considered relevant. Every hardware and software failure, including those caused by loose test cables (often the cause of intermittent failures or failures that cannot be duplicated), faulty test equipment, or other test facility problems should be investigated, and corrective action for each should be developed. Until all producers realize the benefit of investigating all failures, controversy will occur in failure classification. To minimize the problems associated with classifying failures, ground rules should be established prior to the start of the growth test. A standard failure classification methodology is illustrated in Figure 13.3. Any anomaly in prototype behavior is classified and evaluated as either a relevant or nonrelevant failure. (Some contracts require that the root causes of all failures must be determined.) Any anomaly in product operating behavior that is not expected to occur in the field is classified as nonrelevant. Nonrelevant failures are often caused by improper installation, accidental damage or mishandling, failures of the test facility, or failures due to externally induced overstress exceeding the amount approved for testing. To judge failures as nonrelevant, the strength and stress distributions of the product must be understood (see Section 13.3) (Seusy 1987). Understanding strength and strength distributions provides insight into whether the product should have operated successfully under the failure conditions. Relevant failures include all operational anomalies not classified as nonrelevant, regardless of whether the failure is verified or unverified. For example, momentary cessation of equipment function, termed intermittent failure, is a relevant failure. Failures that cannot be duplicated during troubleshooting can also be classified as relevant. Relevant failures must be investigated and may result in design or production modifications. An anomaly classified as a relevant failure may be further classified as chargeable or nonchargeable. Nonchargeable failures result from another failure; such dependent failures are induced by equipment furnished by the government or the customer or by the failure of parts whose specified life expectancy has been exceeded. Chargeable failures include intermittent failures, failures independent of equipment design, equipment and part manufacturing failures, part design failures, and failures resulting from contractor-furnished equipment (CFE) and contractor-furnished operating, maintenance, or repair procedures. Failures that have the same cause, failure mode, and environmental failure conditions are only counted as chargeable once. Chargeable failures can be used as the basis for tracking reliability growth.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
332 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Incident anomaly or failure
Could incident occur in the field?
No
Non-relevant
Yes Relevant
Was the incident a dependent or induced failure?
No
Non-chargeable
Yes Chargeable
Figure 13.3 Failure classification.
Intermittent failures, failures that cannot be duplicated, and failures caused by operator error are usually controversial and difficult to classify. These types of failures place additional burdens on maintenance, support, and logistics personnel. They also frustrate the consumer whose automobile’s classic intermittent stalling cannot be duplicated by the mechanic or whose television requires a brightness adjustment every 30 minutes at home but works perfectly for hours at the repair shop. Typically, intermittent failures caused by external power interruptions, surges, or transients are usually nonchargeable failures. To avoid classifying an intermittent failure as chargeable, external power monitors should be provided to monitor, regulate, and record the input power during the conduct of the reliability improvement program. If the intermittent failure cannot be associated with an external power interruption, surge, or transient, then it must be investigated as a “cannot duplicate” (CND). CNDs are test incidents that cannot be verified or duplicated by subsequent troubleshooting and maintenance. Common causes of CNDs are intermittent failures,
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
CONTINUOUS RELIABILITY IMPROVEMENT
333
built-in test inadequacies, operator errors, improper maintenance, management misunderstanding, and incorrect user manuals. CNDs often are overlooked during an improvement program, due to schedule constraints, because time is needed to attempt to duplicate the failure and to verify that the equipment is in good condition. To maintain test efficiency, one approach is to classify CNDs as chargeable if, during troubleshooting, any component was swapped, disconnected and reconnected, or adjusted in place. If only internal built-in test (BIT) equipment or internal self-tests were used during troubleshooting (in other words, the equipment was not tampered with), then the CNDs could be classified as nonchargeable. In either case, all CNDs must be reported on the failure reporting product (FRACAS) and investigated by the cognizant testability and logistics engineers. The engineers can then attempt to remove the cause of the CNDs offline from the reliability improvement program. Even though CNDs could be classified as nonchargeable in terms of reliability, CNDs are always relevant in terms of maintainability or testability. Numerous CNDs and intermittent failures in products that contain software and utilize built-in tests may indicate bugs or errors in the software code and should be evaluated for possible software deficiencies. If the product does not contain software or BITs, the failures may only manifest themselves under certain environmental conditions, such as humidity, temperature, or vibration. Operator errors that cause failure during reliability growth are also controversial; however, during a reliability demonstration test, operator errors are always chargeable. If they could cause loss of life or other catastrophic failures in the field, then they are classified as chargeable. In less catastrophic situations, guidelines could be established; for example, if the same operating error were to occur three times, then, at the third occurrence, the failure would be classified as chargeable. Operator errors that repeat constantly could indicate poorly written operating instructions or user manuals. For example, during a growth test on a video cassette recorder (VCR), the test engineer has to verify the operation of the timing feature once every 24 hours. The test engineer carefully follows the operating instructions for the timing feature of the VCR, but it does not always turn on when expected. If this occurred only once, this failure would probably be classified as nonchargeable due to operator error, but if it occurred three or more times, it would be considered chargeable. Test time is also categorized as relevant or nonrelevant. The test time between failures when the equipment is officially under test is termed relevant test time. Time spent troubleshooting equipment failures and verifying repairs is termed nonrelevant test time. Only the accumulated relevant test time is used to determine the improvement in product reliability. Failures that occur during nonrelevant test time should be investigated and corrected, but they would not be used to gauge reliability. 13.2.3 Test Optimization To avoid duplication of effort, other types of testing—such as functional, human factor, and safety testing—should be performed concurrently with growth testing. Because design changes stemming from other test results may affect reliability, test data derived from different types of tests are best shared and may, in this way, provide deeper insight into equipment behavior. An important but often overlooked test that
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
334 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
can be conducted during growth testing is BIT false-alarm verification. Many products use BIT to determine when a failure occurs. When BIT indicates a failure, but a failure has not occurred, these BIT indications are considered false alarms. False alarms are a percentage of all failures indicated (usually specified as 1–5%) and are accounted for as a subset of CNDs. To use growth testing for false-alarm verification, external test instrumentation and recorders must monitor performance. During growth testing, if the BIT thresholds are set at the same sensitivities as the fielded equipment, then the BIT data, along with the information from the external test instrumentation, can be used to determine the percentage of false alarms. Even if the BIT thresholds are not the same, the data still provide a gross indication of false-alarm performance. 13.2.4 Test Cycles and Environmental Considerations Relevant test time consists of a series of cycles that combine the worst-case environmental stresses that the equipment will experience in the field. Accelerated testing, which applies stresses higher than expected field conditions, is discussed in
! %#
&
)
!#
$#''
$# !#
#!#$!
!
!#$! *
)
!
#!
&! #!#$! !" !!#!"#(##!#!#$! ! $#!##!#!#$! ##"#"#!#$$# $#)!#! $! Figure 13.4 Sample environmental test cycle.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
CONTINUOUS RELIABILITY IMPROVEMENT
335
Section 13.3. Depending on the field application, the test cycle may include stresses that are electrical (power line cycling), thermal, moisture induced (humidity), or vibration induced. Typically, the simulation of worst-case environmental stresses may not be required for products such as consumer electronics for home use; however, worstcase stresses must always be simulated for products when safety is an issue. In order to precipitate the occurrence of failures, the environmental conditions in the test are often the worst-case stresses that the equipment would encounter in the field. An example of a typical test plan cycle is illustrated in Figure 13.4. Test chambers provide rapid temperature change and are compatible with electrodynamic or mechanical vibration equipment. Operational checks on equipment are conducted continuously or at regular intervals during each test cycle; performance checks are conducted less frequently. The performance checks are commonly performed at room ambient temperature and consist of the operational check plus additional verification of equipment behavior, such as precision and measurement repeatability. However, performance tests conducted during or immediately following environmental extremes, such as after vibration, provide further insight into equipment behavior. 13.3
STRESS MARGIN TESTING
Accelerated testing is a reliability improvement technique used to identify deficiencies quickly by increasing the product’s normal stresses. Basic conditions for accelerated testing include the following: r The dominant failure mode under normal stress and under accelerated stress should be the same. r The engineering properties associated with the failure mechanisms of a material under accelerated stress should be the same before and after the test. r The shape of the failure probability density function for the failure mechanisms at rated and higher stress levels should be the same.
To determine when these conditions are met, the failure mode (mechanism) has to be identified. The failure mechanism is the process by which the physical, electrical, mechanical, and chemical stresses combine to cause a failure. These stresses are used in the failure model to predict the reliability of the product. When the three basic conditions are met, accelerated life testing can be used to reduce test time and, consequently, test cost. Accelerated testing increases stresses such as temperature cycling, vibration, humidity, and power cycling above the product’s typical operating conditions or specifications. Pecht (1991) provides techniques to perform accelerated testing based on temperature, humidity, voltage, and mechanical stresses. To determine the equivalent amount of test time, accelerated test conditions can be extrapolated back to normal operating conditions. Failures occur only when stress exceeds the strength, as illustrated in Figure 13.5 (Seusy 1987). Product strength generally is broadly distributed and decreases with time, as shown in Figure 13.6. Stress testing simulates aging and amplifies unreliability; Figure 13.7 shows the general physical principle behind accelerated life tests.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
336 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Strength distribution
Stress distribution
Unreliability
Figure 13.5 Stress vs. strength.
Strength distribution
Stress distribution
Unreliability
Figure 13.6 Time effect on strength.
The accelerated testing techniques called STRIFE and HALT and accelerated lifetesting models tools, such as the power law model and Miner’s rule, are discussed further next (Schinner 1988; Hobbs 1990). 13.3.1 Stressed Life Test (STRIFE) The stressed life test (STRIFE), developed by the Hewlett Packard Company, uses temperature cycling, power line cycling, and/or frequency variation to accelerate product failures. This is essentially the same as the “normal” improvement program that uses the operating environment during the test. Hewlett Packard enhanced STRIFE testing
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
CONTINUOUS RELIABILITY IMPROVEMENT
Stress distribution
337
Strength distribution
Unreliability
Figure 13.7 Stress testing principle.
for printed wiring boards by applying an expanded temperature range, increased temperature change rates, and increased random vibration (board electronic STRIFE testing [B.E.S.T.]). Some of the necessary conditions for B.E.S.T. include the following: r The temperature of the components on the board must be kept in continuous and rapid transition for 90% of the temperature profile by using a 15°C “overshoot” at both hot and cold extremes. The temperature profile should be tailored to both the product and the test chamber; the duration of the “overshoot” should be such that the components reach at least 90% of the hot and cold extremes. r The product must be powered on and off to create internal temperature cycles, thus accelerating electronic failures. When power is applied, the temperature of the component increases, based on the power dissipation, thermal mass, and heat transfer rate. Cycling the power on and off will also induce electrical stresses caused by voltage and current transient failures. r Random vibration should be applied to the two axes that cause the worst-case mechanical stress to identify excessive displacement.
13.3.2 Highly Accelerated Life Test (HALT) The highly accelerated life test (HALT), developed by Hobbs Engineering Corporation, applies stresses to the product that are higher than the normal operating and nonoperating levels (Hobbs 1990). Common stimuli applied are temperature, vibration, and voltage. HALT uses a stress-step approach to increase the stress levels of a stimulus progressively until an operational or destruct limit is reached. Once a failure occurs, it is investigated, and the product design is changed to compensate for the stress. This process is repeated for each stimulus and then continued by combining stimuli, such as temperature and vibration. HALT is completed only when the desired safety margin above normal operating conditions has been achieved.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
338 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Therefore, the length of a HALT program is difficult to predict. To be most cost effective, HALT should be performed prior to production and as early in the design as possible. Hobbs Engineering Corporation summarizes HALT: r an information-gathering tool and design approach to force failures that force product maturity; r may rapidly find failure mechanisms; r helps to improve the product to the technology limit; and r will require a functional change for most companies. HALT is applied to each level of production, from the subassembly to the final product. If performed correctly, HALT should increase customer satisfaction by improving reliability and lowering overall product life-cycle costs.
13.3.3 Inverse Power Law Model and Miner’s Rule The inverse power law model and Miner’s rule are two tools that can be used for measuring the effects of accelerated life testing (Raheja 1990). The inverse power law model can be used to extrapolate accelerated test conditions back to normal operating conditions. The model states that product life is inversely proportional to the stress raised to the power of Na , where Na is the acceleration factor derived from the slope of an S–N curve: (13.1)
slope 1/N a The inverse power law model can then be written as § Life at normal stress ¶ § accelerated stress ¶ ¨ ·¨ · © Life at accelerated stress ¸ © normal stress ¸
Na
(13.2)
Once the accelerated test is complete, solving for “life at normal stress” will yield the equivalent test time for normal operating conditions. For example, if the accelerated stress were twice as severe as the normal stress, the life at the accelerated stress determined to be 4 hours, and Na equal to 2, then the equivalent life at normal stress would be 16 hours. To determine the cumulative damage that may have occurred as a result of this test, Miner’s rule can be used. Miner’s rule states that the cumulative damage, CD, is k
CD
CSi
£N i 1
1
(13.3)
i
where Csi is equal to the number of cycles applied with a given mean stress, S; Ni is equal to the number of cycles to failure under stress, S; and k is the number of loads. Miner assumes that every part has a useful fatigue life and every cycle uses up a percentage of that life. When CD is equal to one, the cumulative damage should cause a failure.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
CONTINUOUS RELIABILITY IMPROVEMENT
13.4
339
CONTINUOUS GROWTH MONITORING
Product reliability will improve with appropriate design modifications resulting from the TAAF process (or accelerated life testing). The failure data obtained from reliability growth testing are often used to assess the rate of improvement and estimate the continued growth of the present product. The purpose of estimating the potential reliability growth is to aid management in scheduling. In some cases, a reliability growth estimate is used to determine the expected or estimated amount of test time needed to reach a reliability level. The reliability can be assessed at any point during testing to determine whether the product is improving on schedule and whether resources are allocated appropriately. Both continuous and discrete models have been developed to assess reliability growth (Duane 1964; Lloyd and Lipow 1962). 13.4.1 Continuous Growth Models Continuous growth models were developed for repairable products in which reliability is measured in terms of mean time between failures (MTBF). The MTBF is plotted as a function of test time to illustrate growth. The MTBF is found by dividing the cumulative relevant test time by the cumulative number of relevant equipment failures. This concept was originally implemented by Duane (1964). Another continuous growth model is the Army material systems analysis activity (AMSAA) model (Crow 1974). 13.4.1.1 Duane Model When Duane was working for General Electric, he recognized a general trend in the improvement of various products under test development in terms of the cumulative failure rate. The products included hydromechanical devices, aircraft generators, and an aircraft jet engine (Duane 1964). The cumulative number of failures, plotted on log–log paper as a function of cumulative operating hours, produced a nearly straight line for all of the products. The slope of the line showed the rate of MTBF growth and indicated the effectiveness of the reliability improvement program in identifying and correcting design deficiencies. Progressively fewer failures occurred during the test program as design improvements were incorporated into the products. This phenomenon is mathematically modeled as
L3
3F
A Kt r t
where h3 is the cumulative failure rate; 3F is the cumulative number of failures; t is the cumulative operating hours; K is a constant indicating an initial failure rate; and ]r is the growth rate.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(13.4)
340 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
The growth rate, ]r, must be between zero and one to model a decreasing failure rate. A growth rate that approaches one represents the maximum growth process achievable. A growth rate of 0.4–0.5 is generally accepted as a reasonable value for planning purposes. Upon completion of the reliability improvement program, the anticipated failure rate of production equipment is the instantaneous failure rate. A current or instantaneous failure rate, hI, can be found from the derivative of the cumulative number of failures, 3F: $( 3F ) d ( 3F )
A (1 A ) Kt r $l0 $t dt
L I lim
(13.5)
The instantaneous MTBF could also be determined graphically by plotting the failures on log–log paper, with the point estimate MTBF on the y-axis and the time to failure on the x-axis. A straight line is fitted to the points. The instantaneous MTBF is then determined by drawing a line parallel to and displaced by a factor of 1/(1 ]r) above this cumulative line. The Duane model is also useful for constructing curves for predicted or planned growth in order to map the progress of reliability growth. The steps involved in the construction of a planned growth curve include: r identifying reliability goals; r initializing the initial reliability on the growth curve based on historical data from similar products or initial test data; r initializing the test time to equal the time when fixes will initially be incorporated into the product (Crow 1986); the initial test time is determined by estimating the probability of obtaining at least one failure by time ti; the initial test time is calculated by subtracting 1 from the product’s reliability function equal to a failure probability of perhaps 63.2% (at t MTBF, 63.2% of test products have failed) to 95% and solving for t; increasing the failure probability will increase the expected total test time if the product’s reliability function follows an exponential distribution and the probability of at least one failure is 90%; r determining a growth rate based on the product’s complexity, maturity, and technology, the level of effort and aggressiveness of the failure analysis program, and the amount of supportive attention provided by management; and r developing a growth curve as a baseline by which reliability growth can be evaluated.
The point where the instantaneous MTBF crosses the required MTBF line is the expected test time for the reliability growth program. The growth curve serves only as a guideline for assessing progress in terms of the schedule; there is no guarantee that reliability goals will be met. To obtain growth and meet reliability goals, design deficiencies have to be detected and corrective actions must be implemented. EXAMPLE 13.1 A reliability growth test is planned for an avionic system to improve its current MTBF of 250 hours to 1,000 hours. The initial test time (ti) is 250 hours, based on
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
CONTINUOUS RELIABILITY IMPROVEMENT
341
the 63.2% probability of having at least one failure by time ti. If similar products at this stage of development have achieved a growth rate of 0.3, determine: r test time for cumulative MTBF to reach 1,000 hours; r test time for instantaneous MTBF to reach 1,000 hours (the required quantity of units under test if 6 months of calendar time are available; and r testing can run for 24 hours per day); and the cumulative MTBF line and the instantaneous MTBF line graphically using log–log paper. Test time for cumulative MTBF to reach 1,000 hours. Using Equation 13.4, an expression for total cumulative test time and MTBF is determined (General Electric Company 1973). The total cumulative test time, tC, can be derived as 1/A r
§Q ¶ tc ti ¨ R · © Qi ¸
(13.6)
where ti is equal to the initial test time, ki is equal to the initial MTBF, kR is equal to the cumulative or required MTBF, and ]r is equal to the growth rate. Therefore, 1/ 0.3
§ 1000 ¶ tc 250 ¨ · © 250 ¸
25, 398 hours
(13.7)
Test time for instantaneous MTBF to reach 1,000 hours. Using Equation 13.5, the instantaneous test time T can be derived by first converting the initial failure rate, K, to the equivalent instantaneous failure rate. This is derived as
L I K (1 A r )
(13.8)
where hi is equal to the instantaneous failure rate and K is the failure rate that will be converted. Therefore,
Lt 0.004(1 0.3) 0.0028
(13.9)
Using a variation of the equation derived in the preceding step, 1/ 0.3
§ 0.0028 ¶ T 250 ¨ · © 0.001 ¸
7735 hours
(13.10)
where T is equal to the instantaneous test time. Units under test required. The test is run for 730 hours per month for 6 months, so the total time is 4,380 calendar hours; having 50% test efficiency, one gets 2,190 hours; using the instantaneous test time from Equation 13.10 divided by 2,190 hours, one obtains 3.53, or 4, units. Cumulative and instantaneous MTBF line. Figure 13.8 provides the graphical solution to this problem.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
342 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
10,000
Instantaneous MTBF Required MTBP
MTBF (hours)
1,000
Cumulative MTBF
250
Slope = 0.3
100
10 100
1,000
10,000 Test Time (hours)
100,000
Figure 13.8 Cumulative and instantaneous MTBF lines.
13.4.1.2 AMSAA Model Concepts formulated by Duane were extended in the AMSAA model, which uses Duane’s cumulative failure rate but is also based on the Poisson distribution. It is explained as follows. Let dm1 dm2 … dmk represent the cumulative test times when design modifications (dm) are made. Between design modifications, the failure rate can be assumed to be constant, as illustrated in Figure 13.9. Let hi represent the failure rate during the ith time period between modifications (dmi dmi 1). Based on the constant failure rate assumption, the number of failures, Ni, during the ith time period has a Poisson distribution with a mean number of failures hi (dmi – dmi 1). This is mathematically expressed by Prob[ N i n]
[Li (dmi dmi 1 )]n e
Lt ( dmi dmi 1 )
(13.11)
n!
where n is an integer. Let t represent the cumulative test time and let N(t) represent the total number of product failures by time t. Then N(t) is analogous to the cumulative number of failures, 3F, from Duane’s model. If t is in the first interval, then N(t) has a Poisson distribution,
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
CONTINUOUS RELIABILITY IMPROVEMENT
343
Failure Rate
λ1 λ2 λ3
DM1
DM2
DM3
λ4
DM4
Design Modification
Figure 13.9 Constant failure rate between design modifications.
with mean h1t. If t is in the second interval, then N(t) is the number of failures, Ni, in the first interval plus the number of failures in the second interval between dm1 and t. Thus, in the second interval, N(t) has the mean Q(t) = h1N1 h2(t DMi). When the failure rate is assumed to be constant (h0) over a test interval—that is, between design modifications—then N(t) is said to follow a homogeneous Poisson process, with a mean of the form h0 t. When the failure rates change with time, as in Figure 13.9, from interval 1 to interval 2, then N(t) is said to follow a nonhomogeneous Poisson process. For tracking reliability growth between design modifications, N(t) follows the nonhomogeneous Poisson process, with the mean value function t
1(t )
¯ N (x)dx L
(13.12)
0
where the intensity function ih(x) hi and dmi 1 x dm1. Thus, for any t, [1(t )]n e 1 (t ) (13.13) n! where n is an integer. As Δt approaches zero, ih(t)Δt approximates the probability of a product failure in the time interval (t, t Δt). The intensity function is approximated by a continuous parametric function so that test data can be compiled and parameters estimated. The AMSAA model assumes that the intensity function can be approximated by a parametric function defined as Prob [ N (t ) n]
N L (t ) LB t B 1
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(13.14)
344 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Here, the intensity function is analogous to the instantaneous failure rate, h1, defined in Equation 13.8. This is also the Weibull hazard rate function, with ^ 0, h 0, and t 0. Because the AMSAA model assumes a Poisson process with Weibull hazard rate function—not the Weibull distribution—statistical procedures for the Weibull distribution do not apply. Equation 13.14 can model various processes, including reliability growth. The parameter of interest is ^ because 1 ^ is the reliability growth rate. From the parametric assumption, the mean number of failures by time t is defined as 1(t ) L t B
(13.15)
This is not the product MTBF. If no additional modifications are incorporated into the product after the completion of the test, future failures would follow an exponential distribution, and the product MTBF could be obtained by taking the inverse of the intensity function, ih(t). The cumulative failure rate, p(t), is defined as
W (t )
N (t ) t
(13.16)
and is analogous to h3 , defined in Equation 13.4. If the cumulative failure rate, p(t), is linear with respect to time on a log–log scale, then the growth is analogous to that modeled by Duane. Parameters ^ and h can be found graphically on full logarithmic paper or determined statistically using estimation theory. For graphical estimation, a straight line is fitted to the plot of the cumulative failure rate (or mean number of failures) versus the cumulative test time, using log–log paper. Taking the logarithm of the cumulative failure rate illuminates the relationship between the parameters and the slope and ordinate intercept. For statistical estimations, the maximum likelihood estimators method (discussed in Chapter 3) can provide point approximations of the parameters. Prior to the use of the AMSAA method, the test data must be analyzed to identify significant trends, rather than a homogeneous Poisson process. One test used to identify such trends is the central limit theorem test or the Laplace test (Cox and Lewis 1966). When more than one prototype is used—say, m prototypes—the product should be analyzed on a cumulative test duration basis (time, miles, etc.). The failure data on the prototypes are combined and analyzed as if they were a single product. If the period of observation ends with a failure (failure truncated), use the test statistic (*1) generated by
£ M 1
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
M i 1
X i MX N / 2
X N [ M /12]0.5
(13.17)
CONTINUOUS RELIABILITY IMPROVEMENT
345
Table 13.1 Test Statistics Z]Value
Percent Level of Significance (Two Sided)
1.960 1.645 1.282
5.0 10.0 20.0
where M is the number of failures (N) minus 1, X N is the time of the last failure, and Xi is the time of the ith failure. If the failure data are time truncated, use the test statistic (*2) generated by
M2
£
N i 1
X i Nt0 / 2
(13.18)
t0 [ N /12]0.5
where N is the number of failures and t0 is the total test time. The statistic * is compared to the standardized normal deviate at the chosen significance level—say, Z ] —if: r μ a z]: significant growth is indicated at the chosen significance level and the AMSAA model can be used for estimating parameters ^ and h; r μ q z]: significant reliability decay is indicated at the chosen significance level and further corrective action and design changes are needed; or r z] μ z]: the trend is not significant at the chosen significance level because the data (failure rate) follow a homogeneous Poisson process; additional data should be accumulated.
Critical values of the test statistics can be found in the normal distribution tables. Common two-sided significance level test statistics are shown in Table 13.1. If significant growth is indicated, the estimates of ^ and h can be determined by Equations 13.19 through 13.26. Biased estimators are typically used for large samples. However, if the goodness-of-fit test is passed, then unbiased estimator can be used for both small and large samples. For failure-truncated tests, the biased estimate of ^ is N
Bk
( N 1) ln X N
£
N 1 i 1
(13.19) ln( X i )
The unbiased estimate of ^ can be determined by multiplying the biased estimate by [(N 2)/N)]:
B
( N 2) ( N 1) ln X N
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
£
N 1 i 1
(13.20) ln X
346 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
For failure truncated tests, the biased estimate of h is N Lk k X NB
(13.21)
The unbiased estimate of h is
L
N
(13.22)
X NB
For time-truncated tests, the biased estimate of ^ is N
Bk N ln t0
£
N i 1
(13.23) ln( X i )
The unbiased estimate of ^ can be determined by multiplying the biased estimate by [(N 1)/2]: ( N 1)
B
N ln t0
£
N i 1
(13.24) ln( X i )
The biased estimate of h is N Lk k t0B
(13.25)
The unbiased estimate of h is
L
N
(13.26)
t0B
A goodness-of-fit model is used to determine if the collected data fits the AMSAA model. A popular test used for AMSAA modeling is the Cramer–Von Mises goodness-of-fit test discussed in Chapter 12. From the chosen level of significance, ], the critical value of the test statistic, C M2 , is determined from Table 13.2. The value calculated from the observations is then compared to this critical value. If the test is failure truncated, the calculated value is obtained by Equation 72 from Chapter 12: 1 C M2 12 M
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
N
£ i 1
B §¤ ¶ ¨ X i ³ 2i 1 · ¨ ¥¦ X ´µ 2M · ¨© N ·¸
2
(13.27)
CONTINUOUS RELIABILITY IMPROVEMENT
347
Table 13.2 Critical Values of CM2 Parametric Form of the Cramer–Von Mises Statistic Level of Significance, ] M 2
0.20
0.15
0.10
0.05
0.01
0.138
0.149
0.162
0.175
0.186
3
0.121
0.135
0.154
0.184
0.231
4
0.121
0.136
0.155
0.191
0.279
5
0.121
0.137
0.160
0.199
0.295
6
0.123
0.139
0.162
0.204
0.307
7
0.124
0.140
0.165
0.208
0.316
8
0.124
0.141
0.165
0.210
0.319
9
0.125
0.142
0.167
0.212
0.323
10
0.125
0.142
0.167
0.212
0.324
15
0.126
0.144
0.169
0.215
0.327
20
0.128
0.146
0.172
0.217
0.333
30
0.128
0.146
0.172
0.218
0.333
60
0.128
0.147s
0.173
0.221
0.333
100
0.129
0.147
0.173
0.221
0.336
If the test is time truncated, then the calculated value is obtained by the equation
C M2
1 12 N
N
£ i 1
B §¤ ¶ ¨ X i ³ 2i 1 · ¨¥¦ t ´µ 2N · ¨© 0 ·¸
2
(13.28)
If the calculated value is greater than the tabulated critical value, then the AMSAA model is rejected. A poor Cramer–Von Mises fit may be caused by program changes that caused jumps or discontinuities in the reliability improvement program. Plotting the data may suggest whether a different model should be used or may indicate where the discontinuities occurred. If there are jumps, the AMSAA model can be applied piecemeal; the data prior to and following the jump or discontinuity are treated separately. If the calculated value is less than the tabulated critical value, then the AMSAA model is accepted and the product intensity function (Equation 13.14) may be estimated as a function of time. Once the intensity function is determined, a parametric curve can be drawn to predict the future behavior of the product. If the product is not modified after time t 0, then failures are assumed to continue at the constant rate, L0 N L (t0 ) LB t0B 1, according to the exponential distribution. The estimate of the MTBF would then be equal to [1/(h^t^ 1)]. Confidence tables developed for the AMSAA model can then be used to determine the lower and upper confidence bounds around this MTBF. For M 100, use values for M 100.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
348 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
EXAMPLE 13.2 A reliability growth test has accumulated 2,500 hours of test time, with failures occurring at the following test times: 85, 151, 184, 267, 378, 474, 660, 803, 1,031, 1,230, 1,400, 1,589, 1,643, 1,756, and 2,122 hours. Determine if the AMSAA model is appropriate at the 10% level of significance. If it is, determine the MTBF at 2,500 hours.
£
M2
N i 1
X i Nt0 / 2
t0 [ N /12].5
1.7806
(13.29)
Because 1.7806 is less than 1.645, significant growth is indicated at the 10% significance level and the AMSAA model can be used for estimating parameters ^ and h. The unbiased estimate of ^ is N 1
B
N ln t0
£
N i 1
0.679
(13.30)
ln( X i )
The unbiased estimate of h is
L
N t0B
0.073942
(13.31)
The Cramer–Von Mises statistic is
C M2
1 12 M
M
£ i 1
2
§ 2i 1 ¶ B ¨( X i /t0 ) · 0.100305 2M ¸ ©
(13.32)
The calculated critical value, 0.100305, is less than the critical value of 0.169 obtained from the Cramer–Von Mises table. Therefore, the AMSAA model is accepted and the intensity function may be estimated for t 2,500 hours:
N L (t ) L B t B 1 0.004074
(13.33)
The inverse of the intensity function provides an MTBF of 245 hours.
The data in this example can be modified to illustrate the sensitivity of the MTBF calculated by the AMSAA model. If the first two failures occurred very early in the test—at 1 and 3 hours, instead of 85 and 151 hours—then ^ would be equal to 0.483094, h would be equal to 0.342426, and the MTBF at t 2,500 hours would be 345 hours. The successful completion of the pretest tasks, such as ESS, burn-in, thermal survey, and so forth, should help assure that failures do not occur at the start of the improvement program. A technique to estimate reliability growth when data may be missing or some failure times are not known was investigated by Crow (1988), who discussed a
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
CONTINUOUS RELIABILITY IMPROVEMENT
349
methodology useful for this situation over some interval of test time. Generally, for grouped data, the estimation procedure is somewhat more complicated because a closed form equation for ^ does not exist. Assume that there are k intervals with boundaries x = 0, x1,…, xk; ^ is estimated as the solution of the following equation: n
£
ni
i 1
xi B ln xi xi 1 B ln xi 1
ln x k 0 xi B xi 1 B
(13.34)
where x0lnx0 is defined to equal zero. Numerical techniques must be employed to solve this equation for ^. Given an estimate for h, h can be estimated by
Lk
£
k
n
i 1 i Bk
(13.35)
xk
Other continuous reliability growth models have been developed that model products under assumptions tailored to the development program and to specific circumstances. Some of these include models by Cox and Lewis (1966), and Lloyd and Lipow (1962) and software models by Jelinski and Moranda (1972) and Littlewood and Verrall (1973). 13.4.2 Discrete Models The Duane and AMSAA reliability models are examples of continuous models, developed for repairable products in which reliability is measured over periods of time. Discrete models differ from continuous models because they measure reliability in terms of a go/no-go situation, such as for a missile or rocket. Products that either fail or operate when called into service are modeled by discrete functions. Popular discrete models were developed by Lloyd and Lipow (1962) and Wolman (1963). 13.4.2.1 Lloyd and Lipow Model Lloyd and Lipow (1962) considered two models. The first assumes the product in the reliability improvement program has only one failure mode. Each trial assumes that the probability that the product will fail if the failure mode has not been previously eliminated is a constant. If the trial is a success and the product does not fail, the next trial is performed. If the product fails, an attempt is made to eliminate the failure mode by a corrective action or design change. The probability of removing this failure mode is also assumed to be constant. Therefore, the product reliability, Rn, on the nth trial is Rn 1 Ae C ( n 1) where A and C are predetermined parameters.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(13.36)
350 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Lloyd and Lipow considered a second improvement program conducted in K stages. On the ith stage, N number of products are tested. Failures and successes are recorded during each stage, and improvements are not incorporated until the completion of a given stage. The reliability growth function considered is Ri Rc [A r /i]
(13.37)
where Ri is the product reliability during the ith stage, R∞ is the ultimate value of reliability as i l ∞, and ]r is the growth rate (]r 0). Maximum likelihood estimates and least square estimates are used to determine the values of R∞ and ]r. A lower confidence limit for the reliability determined during the final or Kth stage was also determined. 13.4.2.2 Wolman Model Wolman (1963) considered a situation where product failures could be classified as either an inherent cause or an assignable cause. For each trial—for instance, a missile launch—the trial is either a success or a failure. If it is a failure, then the cause is determined to be either inherent or assignable. Inherent cause failures reflect the state of the art of the product and cannot be eliminated by corrective action. Assignable cause failures can be eliminated. Wolman assumed, first, that a number of original assignable cause failure modes are known and, second, that when one of these modes causes a product failure, it will be permanently removed from the product. A Markov-chain approach was used to determine the reliability of the product after the nth trial. The model considered was k
R( n )
£ (1 q )(1 q ) i
0
k i
P0(,ni )
(13.38)
i0
where qi is the probability that the product will fail due to inherent failure modes, q0 is the probability of failure due to assignable cause failure modes, and P(n)0,i is the n-step transition probability. Other discrete reliability growth models have been developed besides the Lloyd and Lipow and Wolman models. These include models by Barlow and Scheuer (1966) and Singpurwalla (1978).
13.5
RELIABILITY IMPROVEMENT EFFECTIVENESS AND UNCERTAINTY
The overall objective of any reliability improvement program is to identify, correct, and eliminate design and manufacturing deficiencies and failure modes. If the reliability improvement program is carried out as planned, design defects should be uncovered and corrected during the growth test instead of occurring in the field. However, two phenomena inhibit this process: reliability growth test effectiveness and uncertainty.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
CONTINUOUS RELIABILITY IMPROVEMENT
351
13.5.1 Reliability Improvement Effectiveness When failures occur during a reliability growth test, they may or may not be corrected. Failure modes that will not be corrected are termed type A failures; failures that will be corrected are termed type C failure modes. (A similar classification of failures was also discussed by Wolman, 1963.) After the analysis of reliability data using one of the models, careful attention should be paid to the implementation of corrective actions. Ideally, all relevant failure modes should be corrected. However, due to funding limitations or the state of the art of the design, there will be a percentage of type A failures. Experience has shown that of the type B failure modes, an average of 30% will remain in the product, even though they were thought to have been corrected. The proportion of type B failure modes that will be eliminated equals the growth effectiveness factor. The potential growth upon the completion of the test can be determined by SystemGP 1/[L A [(1 EF ) r L B ]]
(13.39)
where SystemGP is the product growth potential, hA is the observed failure rate of type A failure modes, hB is the observed failure rate of type C failure modes, and EF is the effectiveness factor. EXAMPLE 13.3 A reliability growth test was stopped at 3,000 hours of test time with 25 failures. Failure analysis determined that 6 failures were type A and the remaining 19 were type B. Experience on comparable products has shown that the effectiveness factor should be 70%. The growth potential for this product is
SystemGP 1 / [
§ 6 19 ¶ ¨(1 0.7) r · 256 hoours] 3000 © 3000 ¸
(13.40)
This example illustrates the use of known cumulative test data to determine the growth potential upon the completion of a test. Similar analytical methods could be applied to the Duane model prior to the start of the growth test to determine the amount of test time needed to reach the required MTBF, based on various values of the growth rate. 13.5.2 Reliability Improvement Uncertainty The Department of Defense conducted a study on the uncertainty of MTBF and growth estimates (U.S. Air Force, Army, and Navy 1989). Earlier, an example of the AMSAA model was given where failures occurring at the start of the test caused a 40% increase in the MTBF (345 vs. 245 hours). Given the numerous models available for analyzing growth test data, the problems associated with failure
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
Estimated MTBF/True MTBF
352 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Uncertainty of MTBF estimates Monte Carlo (80% band)
2.8 2.6 2.4 2.2 2.0 1.8 1.6 1.4 1.2 1.0 0.8 0.6 0.4 5
10 15 20 25 No. of Failures at Which Estimate Made
30
Figure 13.10a Uncertainty of prospective MTBF estimates.
classification, and growth test effectiveness, it is apparent why there could be large variations in the results. The U.S. Air Force, Army, and Navy conducted Monte Carlo simulations using the AMSAA model to determine the probable uncertainties for MTBF and growth estimates. For each Monte Carlo trial, all failure times up to the 30th failure were recorded, and estimates were made of the growth rate and of the instantaneous (or current) MTBF. (Note this on the figures themselves.) Figure 13.10a illustrates the range of data containing 80% of the simulation results. After five failures, 10% of the MTBF estimates would be expected to exceed the true value by a factor of 2.6, and 10% would be less than 0.45 of the true value. The three figures in Figure 13.10b were also developed using Monte Carlo simulations. These figures illustrate that, as the true growth rate increases, the dispersion in estimated growth rate diminishes. The variations illustrated by these Monte Carlo simulations and the many variables inherent in the planning and conduct of growth testing make it imperative that critical program decisions not be made on the results of growth testing alone. This does not imply that reliability growth testing is not cost effective. Indeed, it can be a cost-effective method of continuously improving the reliability of a product. This does imply that sound engineering judgment should be used to compare the results of the growth program with the results of other development tests and other analyses, such as thermal surveys and reliability predictions. In addition, the reliability determined by the growth test should be bound by upper and lower confidence limits. In this case, we could say with some confidence—say, 80%—that the true MTBF would be between the lower and upper confidence limits; at 80%, there is a one in five chance that the true MTBF would not be between the bounds. When performed with the objective to eliminate and remove deficiencies, a reliability growth program will improve the reliability of a product and may eliminate the need for reliability demonstration testing. However, if there is any doubt about the results of the growth test or if major program decisions are needed, then a reliability demonstration or qualification test must be considered.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
Estimated Growth Rate
CONTINUOUS RELIABILITY IMPROVEMENT
Uncertainty of Growth Rate Estimates (Monte Carlo)
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 –0.1 –0.2 –0.3 –0.4 –0.5
True growth rate = 0.1
Estimated Growth Rate
5
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 –0.1 –0.2 –0.3 –0.4 –0.5
10 15 20 25 No. of Failures at which Estimate Made
30
True growth rate = 0.3
5
Estimated Growth Rate
353
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 –0.1 –0.2 –0.3 –0.4 –0.5
10 15 20 25 No. of Failures at which Estimate Made
30
True growth rate = 0.5
5
10 15 20 25 No. of Failures at which Estimate Made
Figure 13.10b Uncertainty of prospective growth rate estimates.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
30
354 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
13.6
SUMMARY
This chapter discussed continuous reliability improvement techniques that can be applied to products. The reliability improvement process recognizes that the reliability of the drawing board design of a complex product can be improved and time allocated for that improvement. By operating or testing the product in a manner that will identify deficiencies caused by the design, manufacturing process, and/or operation, deficiencies can be detected and removed, and methods for designing-in reliability can be reevaluated or used to improve reliability. Specifically, this chapter discussed the principles of the reliability growth process, stress margin testing, methods for continuous growth monitoring and reliability improvement effectiveness, and uncertainty. Details presented in the reliability growth process section included management decisions required to implement a reliability improvement program, a summary of failure analysis procedures and common failure modes, and techniques to classify failures to ensure a successful reliability improvement program. The section on stress margin testing included accelerated testing techniques and tools to measure the effects of accelerated testing. The section on continuous growth monitoring presented both continuous and discrete growth models. Popular models such as Duane and AMSAA models were discussed and examples presented. The final section provided techniques and examples to judge effectiveness on corrective actions and the uncertainty associated with reliability improvement techniques. Information integrated throughout the chapter will assist both managers and engineers in continuously improving the reliability of a product in a cost-effective manner.
REFERENCES Barlow, R. E., and E. M. Scheuer. 1966. Reliability growth during a development testing program. Technometrics 8:53. Cox, D. R., and P. A. W. Lewis. 1966. The statistical analysis of series of events. New York: John Wiley & Sons. Crow, L. H. 1974. Reliability analysis for complex repairable systems. Technical report #138, U.S. Army material systems analysis activity, Aberdeen Proving Ground, Aberdeen, MD. ———. 1986. On the initial system reliability. Proceedings of the Annual Reliability and Maintainability Symposium, Las Vegas. ———. 1988. Reliability growth estimation with missing data—II. Proceedings of the Annual Reliability and Maintainability Symposium, Los Angeles. Duane, J. T. 1964. Learning curve approach to reliability monitoring. IEEE Transactions on Aerospace 2 (2): 563. General Electric Company. 1973. Research study of radar reliability and its impact on lifecycle costs for the APQ-113, -119, -120, -144 radars, Utica, NY. Hobbs, G. K. 1990. Highly accelerated life tests—HALT. Westminster, CO: Hobbs Engineering Corporation. Jelinski, Z., and P. B. Moranda. 1972. Software reliability research. In Statistical computer performance evaluation, ed. W. Freiberger. New York: Academic Press.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
CONTINUOUS RELIABILITY IMPROVEMENT
355
LeStrange, J. 1990. Failure analysis laboratory. Litton Amecon briefing to the University of Maryland Reliability Engineering Program. Littlewood, B., and J. L. Verrall. 1973. A Bayesian reliability growth model for computer software. Record IEEE Symposium on Computer Software Reliability, New York. Lloyd, D. K., and M. Lipow. 1962. Reliability: Management methods and mathematics. Englewood Cliffs, NJ: Prentice Hall. Pecht, M. 1991. Handbook of electronic package design. New York: Marcel Dekker. Raheja, D. G. 1990. Assurance technologies: Principles and practices. New York: McGraw–Hill. Schinner, C. 1988. The board electronic STRIFE test (B.E.S.T.) program. Reliability Review 8:3. Seusy, C. J. 1987. Achieving phenomenal reliability growth. Proceedings of the ASM Conference on Reliability—Key to Industrial Success, Los Angeles, CA, 1987. Singpurwalla, N. 1978. Estimating reliability growth (or deterioration) using time series analysis. MIL-HDBK-189, appendix D. U.S. Air Force, Army, and Navy. 1989. The TAAF process, appendix C—Uncertainty of MTBF and growth estimates. HQ AMC/QA, OASN S&L, HQ USAF/LE-RD, C1–C3. Wolman, W. 1963. Problems in system reliability analysis. In Statistical theory in reliability. Madison: University of Wisconsin Press.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
CHAPTER 14
Logistics Support Robert M. Hecht
CONTENTS 14.1 Introduction ................................................................................................. 358 14.2 Logistics Elements ...................................................................................... 359 14.3 Influence of Reliability on Logistics Resources.......................................... 361 14.3.1 Reliability, Maintenance Rates, and Expected Demand for Logistics Resources ....................................................................... 361 14.3.1.1 False Alarm Rate (FAR) ................................................364 14.3.1.2 Cannot Duplicate (CND) Rate .......................................364 14.3.1.3 Probability of Fault Detection (DET) ............................364 14.3.1.4 Probability of Fault Isolation (ISO)................................ 365 14.3.1.5 Maintenance Action Rate (MAR) .................................. 366 14.3.1.6 Demand Rate (DEM) ..................................................... 367 14.3.1.7 Mean Downtime (MDT) ................................................ 368 14.3.2 Supply Support—Provisioning of Repair Parts and Consumables .................................................................................. 369 14.3.2.1 Optimal Reorder Quantity ............................................. 370 14.3.2.2 Spares’ Availability and Provisioning............................ 373 14.3.2.3 Provisioning a Product Composed of Replaceable Parts................................................................................ 375 14.3.2.4 Spares’ Optimization...................................................... 378 14.3.3 Manpower and Personnel—Staffing Levels .................................. 382 14.3.4 Support and Test Equipment—Utilization and Productivity......... 385 14.4 Repair Level Analysis ................................................................................. 387 14.5 Summary..................................................................................................... 389 References.............................................................................................................. 390
357 © 2009 by Taylor & Francis Group, LLC
358 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
14.1
INTRODUCTION
All consumers expect that a purchased product will not fail or otherwise malfunction during its expected service life. Furthermore, if a failure or performance degradation does occur, the typical consumer expects prompt repair or replacement at a reasonable cost when compared to the purchase price of the product. In addition, the consumer wants a product that requires little or no preventive or scheduled maintenance, in order to minimize both the cost of ownership and unavailability of the product. In the civilian world, industry has responded by improving the reliability and durability of products and by eliminating, or at least minimizing, preventive maintenance (PM) requirements. Examples of reduced PM in the automotive field include the use of electronic, pointless ignition products and lubricated-for-life bearings. The service life of many products (e.g., front wheel bearing/constant velocity joint assemblies and exhaust/emission products) has also improved to the point that the consumer may not need to replace these costly products during the typical ownership period. Ultimately, servicing or repairs will be required, and maintenance of some sort must be provided by an automobile dealer or an independent shop. The customer would prefer to wait while the servicing is done; if that is not possible, the automobile should be ready the same day or, at most, the next day. To accomplish this, the dealer or repair shop needs trained personnel and appropriate facilities, tools, test equipment, technical data, and repair parts. For fast turnaround, these logistical assets must be pre-positioned. To achieve this, the manufacturer must have invested time and money during the design and development of the product. The planning, acquisition, and positioning of the resources necessary to effect the repair or replacement of a product are termed logistics support. In order to meet consumer needs and expectations, it is important to develop products from a life-cycle perspective. Reliability, maintainability, and effectiveness must be considered by designers and program managers at the initiation of design and development if products are to meet the fundamental performance needs of the consumer cost effectively. However, consumer satisfaction cannot be completely fulfilled without addressing logistics and product support capability and properly integrating them with the application-oriented aspects of the product. Reliability, maintainability, and effectiveness requirements must be applied not only to the prime equipment and applicable software, but also to the acquired resources that comprise the logistics support elements. Integrated logistics support (ILS) applied to products constitutes a life-cycle approach to maintenance and support. ILS is an integral part of all aspects of product planning, design, and development; testing and evaluation; production and construction; utilization; and retirement. Elements of logistics support that concern the developer include the maintenance plan; supply support; product support; packaging, handling, storage, and transportation; manpower and personnel; training and training support; facilities; technical data; computer resources support; and design interface. Blanchard (1992) provides a broad overview of the life-cycle aspects of logistics support. This chapter discusses the influence of reliability on logistics
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
LOGISTICS SUPPORT
359
support requirements, emphasizing how the reliability of a product, equipment, or assembly influences the need for spares or repair parts, support equipment, and maintenance personnel.
14.2
LOGISTICS ELEMENTS
Logistics support encompasses the planning and management, design and development, and acquisition and positioning of the resources necessary to ensure the effective and economical support of a product throughout its programmed life cycle. The elements of logistics support must be integrated with all other segments of the product in order to ensure that both operational and cost requirements are met. There are several major elements of logistics support. The maintenance plan includes all planning and analysis for the overall support of a product throughout its life cycle. This process is formalized through logistics support analysis (LSA) and repair level analysis (RLA)—or level of repair analysis—and documented in the logistics support analysis record. The LSA is an iterative process that depends upon inputs from reliability and maintainability predictions, failure modes, effects, and criticality analysis. The results of these analyses, coupled with design reviews and audits conducted by logistics engineers, are used to develop the maintenance concept for each product, subassembly, and assembly, and to identify the logistics resources required to support the product if it is deemed replaceable, repairable, or both by the RLA. Like a life-cycle cost analysis, the RLA considers all costs associated with supporting a product over its life cycle. The RLA is an economic evaluation of the cost benefit of repairing or discarding the product, or its constituent components, at specified levels of maintenance. The levels typically considered are the organizational level or on-site, by-the-user level (O-level)—at which, typically, assemblies can only be removed and replaced; the intermediate level (I-level), which can be either an on- or off-site repair facility with a limited capability to repair assemblies and subassemblies; and the depot level (D-level) or remanufacturer, which has the capability to rework or refurbish subassemblies. At each of these maintenance levels, RLA is used to calculate the costs associated with the repair or discard of the candidate product. The results of the RLA are fed back to the LSA so that the final LSA reflects the maintenance concept and logistics resources that must be developed or procured to support the product over its life cycle. Supply support involves all spares (e.g., units, assemblies, modules), repair parts, consumables, special supplies, and related inventories needed to support prime application-oriented product, software, testing and support of the product, transportation and handling of the product, training equipment, and facilities. Supply support could also encompass provisioning documentation, procurement functions, warehousing, and the distribution of material, as well as the personnel associated with the acquisition and maintenance of spare and repair part inventories at all applicable locations. Considerations include each maintenance level and each geographical location where spare and repair parts are distributed and stocked, spares’ demand rates and
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
360 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
inventory levels, distances between stockage points, procurement lead times, and methods of material distribution. Support and test equipment (STE) includes all tools, special-condition monitoring equipment, diagnostic and checkout equipment, metrology and calibration equipment, maintenance stands, and servicing and product handling required to support scheduled and unscheduled maintenance of the product. Test and product support requirements must be addressed at each level of maintenance, as well as the overall requirements for test traceability to a primary or secondary standard. Test and product support may be classified as peculiar (newly designed or off-the-shelf products unique in the user inventory to the product under development) or common (existing products already in the user inventory). Test and product support is also classified into special purpose (designed specifically to support the product under development) or general purpose (typically, off-the-shelf product tests to support end products, in addition to the product under development, without the need for modification). Packaging, handling, storage, and transportation encompass all special provisions, containers (reusable and disposable), and supplies to support packaging, preservation, storage, handling, or transportation of prime product, test and product support, spares and repair parts, personnel, technical data, and mobile facilities. This element involves both the initial distribution of products and the transportation of personnel and materials for maintenance purposes. Manpower and personnel include the personnel required for the installation, checkout, operation, handling, and sustaining maintenance of the product and its associated test and product support. Personnel requirements are identified in terms of quantity and skill levels for each operation and maintenance function by level of support and geographical location. Training and training support involve initial training to familiarize personnel with the product as well as replenishment training to compensate for attrition and the development of replacement personnel. Training is designed to upgrade assigned personnel to the skill levels defined for the product. Training support also includes those aids (e.g., simulators, mock-ups, special products, software) developed to support personnel training operations. Facilities are all physical locations needed to operate the product and to perform maintenance functions at each level: physical plants, real estate, portable buildings, housing, intermediate maintenance shops, calibration laboratories, and special depot repair and overhaul facilities. Capital equipment and utilities (e.g., heat, power, energy requirements, environmental controls, communications) are generally included as part of facilities. Technical data include the installation and checkout procedures, operating and maintenance instructions, inspection and calibration procedures, overhaul procedures, modification instructions, facilities information, drawings, and specifications necessary to perform product operation and maintenance functions. Such data cover not only the prime application equipment, but also testing and support equipment, transportation and handling equipment, training equipment, and facilities.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
LOGISTICS SUPPORT
361
Computer resources support encompasses all computer equipment and accessories, condition monitoring and maintenance diagnostic aids, software, program tapes, disks, and databases to perform product maintenance functions at each level. Design interface relates logistics design parameters to product readiness, resource requirements, and support cost. These design parameters could include product availability or the attainment of a required product output, compliance with local or national environmental or safety codes or laws, the minimization of the use of energy resources, and the ability of the designed product to be used or easily modified for use in more than one end product.
14.3
INFLUENCE OF RELIABILITY ON LOGISTICS RESOURCES
From the perspective of the logistician, reliability translates into a demand for logistics resources; maintainability translates into the range of logistics resources required to support the operation of the prime product and the length of time during which specific logistics resources (e.g., personnel or product support) are dedicated to a single repair action. The interaction of reliability and maintainability results in the need for logistics assets to maintain a level of operational readiness or availability over the time desired by the user. The equation for operational availability was given by
A0
MTBM MTBM MDT
(14.1)
In this chapter, we will further discuss the term MDT (mean downtime) by examining how the mean time to repair (MTTR) and supply response time affect it. We will also examine how reliability affects supply support, provisioning, and the utilization of product support and maintenance personnel. 14.3.1 Reliability, Maintenance Rates, and Expected Demand for Logistics Resources Low reliability creates an increased demand for logistics resources. The arrival rate is the demand rate that a product or population of products places on a product logistics support. For a product for which the reliability is represented by a homogeneous Poisson process (i.e., a constant hazard rate), the general demand rate equation is given by Demand Rate ( Number of Units in use) r ( Maintenance Action per Unit Time per Unit)
(14.2)
The number of maintenance actions per unit time is a function of the relevant, chargeable failures of the product, as well as removals due to deficient built-in
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
362 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
testing or product monitoring (false alarms and false fault indications), inadequate diagnostic procedures that result in the unnecessary removal of a functional unit, or the removal of one assembly in order to gain access to another (i.e., irrelevant, nonchargeable failures). Figure 14.1 illustrates the potential impact of these incidents. Fault detected
BIT indication
No BIT indication FA% Non BIT true failures (detections) (1-ISO%)
ISO%
Non BIT CNDs
Not isolated (incorrect isolations)
Isolated (correct isolations)
Isolated to 1 Rt
(1-FA%)
BIT CNDs (false alarms)
BIT CNDs (false alarms)
CNDs
True failures
TF%
(1-DET%)
Isolated to 2 Rts DET% Isolated to m Rts Not isolated (incorrect isolations) BIT : Built in test CND : Cannot duplicate ISO : Isolation FA : False Alarm TF : True failures DET : Detection Rt(s) : Repairable item(s)
BIT relevant maintenance actions
Isolated (correct isolation)
Non-BIT true failures (not detected) BIT true failures (detections)
(1-ISO%)
ISO%
Isolated to 1 Rt Isolated to 2 Rts Isolated to m Rts
Figure 14.1 Relationships of FAs, CNDs, and fault isolation to the maintenance of a replaceable item.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
LOGISTICS SUPPORT
363
Some key terms associated with maintenance actions include: r r r r
false alarm (FA); retest okay (RETOK); cannot duplicate (CND); and no fault found (NFF).
False alarms are indications, usually by a built-in test product (BITE), that something is wrong with the monitored product, even though the operator does not perceive any degradation in performance. If the operator does report a problem, a maintainer may attempt to duplicate the problem and could trigger a CND or an NFF. From the reliability engineer’s viewpoint, a failure may not have occurred. However, from the perspective of the logistician, logistics resources have been expended. As shown in Figure 14.1, if the operator detects a fault and requests assistance from a technician (O-level), one of two events may occur. Either the maintainer can duplicate a fault condition and conduct diagnostics to fault-isolate to a “potentially” failed replaceable or repairable product (RI), or the maintainer may be unable to find or duplicate the fault condition. Several outcomes can result from a repair action. The maintainer could complete the repair or removal and replacement and conduct a checkout procedure. The result could be a fully functional product (successful repair) or the recurrence of the original fault condition (if the wrong component was replaced). If the parts used to effect the repair are themselves repairable, the failed parts will be shipped to a higher level maintenance facility (I- or D-level) or, perhaps, to the manufacturer. If two or more repairable products were replaced, it is likely that at least one was in functional working order (a RETOK event). This is especially true for electronics. In some cases, a product shipped to a repair facility may be found to be functional and meet minimum performance requirements, but the maintenance activity may require that it be refurbished to restore or increase the usable service life. To satisfy requirements at the I- or D-level, logistics support must again provide the required resources. If the O-level maintainer replaces one or more components and the product remains faulty, logistics support must still provide the same resources as though a true failure occurred. Costs (time and money) were expended at the O-level—and perhaps at the I- and D-levels—if repairable assemblies were replaced, but the affected product remains not fully functional. The resolution of the fault may require the use of supplemental diagnostic procedures or more experienced O-level maintainers; the assistance, on-site or remotely, of I- or D-level technicians or manufacturer’s field representatives; or the shipment of the product to a higher maintenance level. Due to differences in defining what incidents or events constitute an FA, a CND, or a RETOK, only two terms, the FA and the CND, are discussed here. The first, the rate (FA), indicates by BITE or some other form of remote monitoring that something is wrong with the product, although the operator cannot perceive a problem. The second term, the cannot duplicate rate (CND), encompasses all nonoperational and shop maintenance actions during which a fault cannot be identified.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
364 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
14.3.1.1 False Alarm Rate (FAR) The FAR is defined as the number of FA actions divided by the total number of fault indications at the user (operator) level for a given product. Generally, FA incidents are considered only at the user level for systems or subsystems that employ some form of BITE or other means of automated or semiautomated monitoring. The FAR rate can be expressed as a percentage or as a decimal equivalent. Hence, FAR
( Number of FA Actions) ( Number of FA Actions) ( Number of True Malfunctions)
(14.3)
Number of FA Actions Total Number of User Fault Indications and 1 FAR
Number of True Malfunctions Total Number of User Fault Indications
(14.4)
Assume that FA incidents do not result in a request for maintenance and thus do not impact logistics support. 14.3.1.2 Cannot Duplicate (CND) Rate CND incidents can occur at any level of the product hierarchy (i.e., system, subsystem, assembly, subassembly, module) but can only be incurred by maintainers. The CND rate for a product at a given maintenance level is given by CND
Number of Maintenance Actions for which No Fault is Found Number of Maintenance Actions for whichh No Fault is Found Number of True Failures Number of Maintenance Actions for which No Fault is Found d Total Number of Maintenance Actions
(14.5) and 1 CND
Number of True Detected Faults Total Number of Maintenance Actions
(14.6)
14.3.1.3 Probability of Fault Detection (DET) An area of concern to both the user and the logistician is the inability of BITE or the user to detect failures. How is the product logistics support impacted when neither
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
LOGISTICS SUPPORT
365
the operator nor BITE detects all failures? Should the logistician be concerned? If the current and future effects of the failure and subsequent failure of other assemblies on components caused by potential overstress are totally benign, the logistics support system may never plan for or address (repair) failures of this type. If the latent failed product is detected, isolated, and replaced, it will probably be during a maintenance action for an overt failure. Detection and isolation of the latent failed product is a result of the tighter tolerance windows of intermediate or depot support products. For failure modes undetected by BITE that result in undesirable effects, the product logistics support must provide an alternate means of detection and correction. This means that the logistician’s plan must provide the operator with the capability to detect a non-BITE failure mode of the product. In some cases, the logistician may be required to procure and field a special-purpose test product to conduct pre- or postoperational checks. The DET is given by DET
Number of Detected True Failures Total Number of True Failures
(14.7)
14.3.1.4 Probability of Fault Isolation (ISO) As shown in Figure 14.1, a term called the probability of fault isolation (ISO) is associated with how the system, subsystem, equipment, or assembly design permits fault isolation of one or more lower level elements for a given percentage of maintenance actions. For a given assembly, ISO n
Number of Maintenance Actions Isolated to n Component Total Number of Maintenance Actions
(14.8)
For a given product, ISO can also be stated in terms of the number of removals or maintenance actions on that product due to inclusion in an ambiguity group for which the product did not fail, divided by the total number of maintenance actions: Number of (Nonfailure) Removals or Maintenance Actions for Item i due to inclusion in an Ambiguity Group ISOL i Total Number of Maintanence Actions forr Item i
(14.9)
In terms of design for testability, ISO and ISOL define the size of ambiguity groups. From the viewpoint of the logistician and maintainer, the maintainer may have to remove one or more subassemblies, incurring longer repair time (and more downtime). Also, the logistician must provide more spare products or assemblies and allow longer utilization of product support and manpower than if the design permitted fault isolation to one product or assembly 100% of the time. The removal and replacement of ambiguity groups may also affect the CND rates of the removed components. For example, assuming that only one component can fail at a time, consider three printed wiring boards (PWBs) identified as possibly
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
366 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
causing the failure of a transmitter. Unless the maintainer or shop has known good subassemblies, replacement PWBs would have to be drawn from supply or ordered from a higher level of maintenance or the manufacturer. The maintainer could order one PWB at a time, remove the existing installed PWB, and install the new PWB to see if it corrected the problem. If not, the maintainer could order the next PWB on the list and either leave the new PWB in place or reinstall the old one. If the repair action was successful, the failed product would be subject to repair or discard, in accordance with the standing maintenance plan. Alternatively, the maintainer could order all three PWBs at the same time and conduct the fault isolation by substitution. In this case, the maintainer would have to send two good PWBs and one bad PWB back to the supply product or manufacturer. Therefore, the removal of a good (nonfailed) product as a result of its inclusion in an ambiguity group may become a CND at the next maintenance level. 14.3.1.5 Maintenance Action Rate (MAR) The MAR is defined as the number of maintenance actions per operating unit per unit time. In general, the MAR is given by MAR ( Number of True Dectected Failures per Unit Time) ( Number of Cannot Detect Actions per Unit Time) ( Number of Removal Actions due to Ambiguities per Unit Time) (14.10) DET r L L CND r MAR ISOL r MAR DET r L L /(1 CND ISOL) where CND is the probability of a maintenance action being a CND, and ISOL is the probability that the maintenance action is a result of the product being part of an ambiguity group and that the product has not failed. In the preceding equation, hL is the series or logistics failure rate. False alarms are not considered because the term addresses maintenance rather than failure or fault indications. ISOL actions at one level of maintenance (say, where a subassembly is removed) may become CND actions at the next level of maintenance. Then, MAR DET r LL / (1 CND)
(14.11)
If all removals or maintenance actions are true failures or, in the case of provisioning, if a product is removed and shipped to the next higher level of maintenance without further checkout to determine whether it is a fault, then the CND becomes equal to one and the MAR is given by MAR DET r LL
(14.12)
As defined, the CND and ISOL are not statistically independent and cannot both be equal to one.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
LOGISTICS SUPPORT
367
Redundancy is not normally considered when computing the demand for logistics resources because a failure in a redundant product needs to be corrected at some point, if not immediately. For most systems with redundant products, when one of the redundant products fails, repair is either immediately initiated or done at a time when it does not impede operations. However, for heavily redundant products (m of n must function), repair may be delayed until a fixed number of redundant products fails or a scheduled (periodic) maintenance event occurs. In the latter case, the product logistics support must still provide the same number of parts (at the same time), the same technical data, the same product support, and the same maintainers as for immediate repair. However, in the m of n redundancy case with delayed maintenance, the logistician may be required to restore the product to full functionality within the same time limit as if only one redundant subsystem had failed. This time limit may require the logistician to plan for and provide more maintainers and product support than would be required if the failures were repaired as they occurred. 14.3.1.6 Demand Rate (DEM) For a given time period (TL ), the maintenance action rate can be converted into the absolute, expected number of maintenance actions. Following Equation 14.2, the expected number of maintenance actions in time period TL , DEM is given by DEM MAR r TL
(14.13)
The units of time, TL , must be consistent with the units associated with the term MAR. If MAR is stated in terms of failures, removals, or maintenance actions per operating hour, then T must also be in units of operating hours; if MAR is in units of flight hours, T must be also. For example, to determine the expected demand for a consumable part at an operational (organizational) site, the logistician may define TL as TL K u r N sys r OPHRS r RESUP
(14.14)
where Ku
is the utilization factor–conversion factor from one unit of operating time (e.g., flight hours) to another (e.g., operating hours); Nsys is the number of products being supported; OPHRS is the number of operating or flight hours per unit of calendar time (e.g., per day); and RESUP is the resupply time (e.g., days)—the period of time required to order and receive a part from an off-site source. The utilization factor (Ku) permits the equation to account for products that may be operated at a different rate from the parent product containing an operating timerecording device (e.g., an elapsed time indicator). For example, in addition to the avionics operating while the aircraft is airborne, they may also be in operation for preflight and postflight checkout. Alternatively, some avionics systems may only be
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
368 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
operated for a brief period during an extended application. The utilization factor can also be used to account for operating time during repair actions and to convert different application scenarios to a common time base. 14.3.1.7 Mean Downtime (MDT) The concept of a time period associated with the number of demands for logistics resources can be extended to a more generalized expression that will be employed in supply support. The generalized term is commonly referred to as MDT or mean logistics downtime (MLDT)—here, MDT. MDT is the expected time for a response from the product logistics support. For example, the MDT for a product that can be repaired at the user site can be given by MDT Pos r [ Psos r ( MTTR ros MADMos ) (1 Psos ) r ( MTTR ros MDAMos RESUP)] (1 Pos ) r (TAToff ) (14.15) where Pos Psos MTTRos MADMos RESUP TAToff
is the probability that the necessary repair can be accomplished on-site; is the probability that the necessary spare parts are on-site, given that a repair could be accomplished on-site; is the mean time to repair for on-site repair; is the mean administrative downtime for on-site repairs, including the time to obtain the repair parts from on-site storage; is the average time required to obtain the longest lead time part from an off-site source; and is the turnaround time to have the product shipped and repaired, using an off-site repair facility.
When the product under consideration cannot be repaired on-site, the MDT equation is given by MDT 0.0 r [ Psos r ( MTTR os MADMos ) (1 Psos ) r ( MTTR os MADMos RESUP)] (1 0.0) r (TAToff ) TAToff (14.16) Example 14.1 illustrates the calculation of an MDT for different repair scenarios. As will be discussed in Section 14.3.2.2, the equation for MDT can be tailored for use in provisioning and determining supply support requirements. EXAMPLE 14.1
MEAN LOGISTICS DOWNTIME
Farmer Brown has a tractor that he repairs himself in order to save money. He can make most repairs to the tractor because he has a fairly well-equipped shop, but due to the high cost of parts, he does not keep many spare parts on hand.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
LOGISTICS SUPPORT
369
He lives in a very remote area that is far from tractor dealers, so he orders any needed parts by telephone and has the parts shipped. Let the probability that the repair can be accomplished on-site, Pos , be 0.90; the probability that the necessary spare parts are on-site, given that a repair could be accomplished onsite, Psos , be 0.40; the mean time to repair for on-site repair, MTTRos , be 6.5 h; the mean administrative downtime for on-site repairs, MADMos , be 0.5 h; the average time required to obtain the longest lead time part from tractor dealers, RESUP, be 2.5 days; and the turnaround time to have the tractor shipped and repaired at an off-site repair facility, TAToff, be 0.5 months. If the tractor malfunctions, what is the average time it will be unavailable for use? Inserting these data into Equation 14.15, MDT 0.9 r [0.4 r (6.5 hrs 0.5 hrs) (1 0.4) r (6.5 hrs 0.5 hrs 2.5 days r 24 hrs/day)] (1 0.9) r (0.5 mo r 30 day/mo r 24 hrs/day) 74.7 hrs 3.1 days
(14.17)
Thus, whenever Farmer Brown’s tractor is malfunctioning, he can expect the tractor to be unavailable for 3.1 days, on average. Let us say that Farmer Brown has the money to buy every part he may ever need to repair the tractor. The term Psos then becomes equal to 1.00 and MLDT is given by MDT 0.9 r [1.0 r (6.5 hrs 0.5 hrs) (1 1.00) r (6.5 hrs 0.5 hrs 2.5 days r 24 hrs/day)] (1 0.9) r (0.5 mo r 30 day/mo r 24 hrs/day) 42.3 hrs 1.8 days
(14.18)
Instead of the tractor being unavailable for 3 days, it is unavailable for just under 2 days. Is stocking every part a good investment for Farmer Brown? Let us suppose that Farmer Brown’s tractor exhibits a mean time between breakdown of 75 days and an MDT of 1.8 days; what is his operational availability?
A0
75 days MTBM 0.976 MTBM MDT 75 days 1.8 days
(14.19)
14.3.2 Supply Support—Provisioning of Repair Parts and Consumables Harris (1915) is generally credited with the earliest published derivation of an inventory model. Raymond (1931) published the first textbook dedicated to the subject. World War II prompted the military to support studies in this area in an effort to optimize the procurement and stocking of spare (black boxes) and consumable repair parts, subject to such constraints as cost, weight, volume, or combined factors. Retailers and manufacturers embraced inventory theory in the 1950s as a means to
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
370 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
reduce costs and overstockages while at the same time minimizing outstockages and backorders. In this section, the discussion of inventory models will be limited to economic order quantity (EOQ), subject to no-shortage and shortage-allowed conditions, and to provisioning in terms of modeling a standby redundant product. The EOQ problem uses a deterministic approach for its solution, while the standby model adopts a stochastic approach. The emphasis here will be on the effect of reliability on the demand for supply support. The reader interested in inventory models is referred to Sivazlian and Stanfel (1975), Hillier and Lieberman (1970), and Goldman and Slattery (1967). 14.3.2.1 Optimal Reorder Quantity One of the simplest inventory models considers a case that assumes that a product is drawn from inventory at a constant demand rate (DEM) and that the inventory stock is replenished periodically in equal amounts (REPLEN). It also assumes that shortages are not allowed. The costs associated with establishing and maintaining the inventory include: r SETUP: the cost to set up the inventory at the start of a time period; r UNIT: the unit production cost; and r HOLD: the inventory holding cost per unit.
The cost associated with placing an order is given by COSTORD SETUP UNIT r REPLEN
(14.20)
The holding cost per time period is given by REPLEN/DEM
HOLDCOST HOLD r
¯
(REPLEN DEM r T ) dT
0
(14.21)
HOLD r REPLEN 2 / (2 r DEM) The total cost per time period is given by COSTPER SETUP UNIT r REPLEN HOLD r REPLEN 2 /(2 r DEM) (14.22) The total cost per unit of time is given by TOTCOST COSTPER / (REPLEN/DEM) DEM r SETUP/REPLEN UNIT r DEM HOLD r REPLEN/ 2
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(14.23)
LOGISTICS SUPPORT
371
The optimal order size, REPLEN*, is found by taking the first derivative of the total cost with respect to order size and setting the resulting differential equation equal to zero. The optimal order size is thus REPLEN* (2 r DEM r SETUP/HOLD)1/ 2
(14.24)
The average or expected time required to expend the quantity of parts given previously, TIME*, is TIME* REPLEN* /DEM {2 r SETUP/ (HOLD r DEM)}1/ 2
(14.25)
Consider the case where shortages are allowed. Let STOCK denote the stock on hand at the beginning of a period. The holding cost per period, HOLDCOST, is given by HOLDCOST HOLD r STOCK 2 /(2 r DEM)
(14.26)
For a given shortage penalty cost, SHOR$, the shortage cost per period is given by SHORT SHOR $ r (REPLEN STOCK )2 /(2 r DEM)
(14.27)
The total cost per period is found from the following equation: COSTPER SETUP UNIT r REPLEN HOLD r STOCK 2 / (2 r DEM) SHOR $ r (REPLEN STOCK )2 / (2 r DEM)
(14.28)
The total cost per unit of time is given by TOTCOST COSTPER/(REPLEN/DEM) DEM r SETUP/REPLEN UNIT r DEM HOLD r STOCK 2 / (2 r DEM)
(14.29)
SHOR $ r (REPLEN STOCK )2 / (2 REPLEN) In order to find the optimal reorder size, REPLEN*, and optimal initial stock, STOCK*), the first partial derivatives of TOTCOST with respect to reorder size and initial stock size must be found. These derivatives are set equal to zero, and the two resulting equations are solved simultaneously to yield the following: REPLEN* [2DEM r SETUP r (SHOR $ HOLD) / (HOLD r SHOR $)]1/ 2 STOCK * {2 r DEM r SETUP r SHOR $ /[HOLD r (SHOR $ HOLD)]}1/ 2 (14.30)
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
372 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
The average or expected time to expend the reordered parts is given by TIME* REPLEN* /DEM [2 r SETUP r (SHOR $ HOLD) / (DEM r HOLD r SHOR $)]1/ 2
(14.31)
The average or expected fraction of time that a shortage may exist is given by SHORTIME* STOCK * /REPLEN* SHOR $ / (SHOR $ HOLD) (14.32) Example 14.2 illustrates the use of the EOQ equations for the cases of no shortages allowed and shortages allowed. EXAMPLE 14.2
ECONOMIC ORDER QUANTITY
You are the inventory manager and buyer for Good-Gas Petroleum Company. Upper management is unhappy about the delays in making deliveries and the costs associated with the regulator valves installed in delivery trucks. You are tasked to determine the optimal number of valves to maintain in inventory and how many to order to minimize the company’s cost. Based on historical data, you determine the following: SETUP $100; HOLD $5 per unit; UNIT $500 per unit; DEM five units per month. Using Equation 14.30, REPLEN* (2 r DEM r SETUP/HOLD)1/ 2 [2 r (5 units/month) r ($100) / ($5 per unit)]1/ 2 14.1 units
(14.33)
The average or expected time required to expend the quantity of parts given earlier is given by Equation 14.25. TIME* [2 r SETUP/ (HOLD r DEM)]1/ 2 (2 r ($100)) /[($5 per unit) r (5 units/month)] 2.8 months
(14.34)
Upper management suggests that a few deliveries could be delayed if inventory costs were reduced. A value of $10 per shortage is suggested. Using Equation 14.30, REPLEN* [2 r DEM r SETUP r (SHOR $ HOLD) / (HOLD r SHOR $)]1/ 2 {2 r (5 units/month) r ($100) r [($10 /shortage) ($5/unit)] / [($5/unit) r ($10 /shortage)]}1/ 2 17.3 units STOCK * {2 r DEM r SETUP r SHOR $ /[HOLD r (SHOR $ HOLD)]}1/ 2 1/ 2
¹ ª $10 /shortage «2 r (5 units/month) r ($100) r º 5/unit ) ($ )] ($ 5/unit ) r [($ 10 /shortage) » ¬ 11.5 units
(14.35)
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
LOGISTICS SUPPORT
373
The average or expected time to expend the reordered parts is given by Equation 14.31: TIME* REPLEN* /DEM (17.3 units) / (5 units/month) 3.5 months
(14.36)
Equation 14.32 gives the average or expected fraction of time that a shortage may exist: SHORTIME* STOCK * /REPLEN* (11.5 units) / (17.3 units) 0.7
(14.37)
14.3.2.2 Spares’ Availability and Provisioning For some enterprises, shortages in on-hand inventory can be tolerated. The shortage unit cost, SHOR$, can be used to represent either lost profit or a cost factor related to both lost profit and loss of customer goodwill. In some operational environments, the cost of spares’ shortage cannot be easily assessed. For this case, a commonly used approach to provisioning modeling is to consider the problem in the context of a product that incorporates standby redundancy. It was previously shown that the cumulative Poisson equation can be used to model a standby redundant product if it is assumed that the failure rate remains constant with respect to time. This is given by x
P(X a x )
£ i0
(L r TIME)i r exp ( L r TIME) i!
(14.38)
To serve as a provisioning model, Equation 14.38 is rewritten as follows: S in
Ps ( X a S in ) a
£ i0
(DEMin )i r exp ( DEMin ) i!
(14.39)
where Ps(X ≤ Sin) is the probability that the number of demands (X) will be equal to or less than the number of spares in the inventory—that is, the probability that a spare will be available if needed; Sin is the number of spares in the inventory; and DEMin is the expected demand during a given period of time. As shown in Equation 14.13, the expected demand can be stated as simply the failure rate of the product (h) multiplied by the number of applications (Nl) and the average operating time per application (OPTIME), or DEMin N l r L r OPTIME
(14.40)
Alternatively, Equation 14.40 can be written as a function of a MAR that may include CNDs or maintenance events other than true failures. Then, DEMin N l r MAR r OPTIME
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(14.41)
374 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
The expected demand can also be stated as a function of the product’s failure or maintenance action rate and the MDT: DEMin MAR r Nl r UTIL r MDT
(14.42)
For the supply support case, the MDT is calculated for a spare product and not the supported system or subsystem: MDTi {Pos r [ Psos r ( MTTR os MADMos ) (1 Psos ) r ( MTTR os MDAMos RESUP)] (1 Pos ) r (TAToff )} (14.43) In Equation 14.43, MTTRos represents the mean time required to repair the failed assembly, subassembly, or module on-site; the term RESUP represents the time to obtain a part to repair the ith product, if the required part is not available on-site; and TAToff represents the time to obtain a replacement product if the failed product cannot be repaired on site. Example 14.3 illustrates the use of Equations 14.39, 14.42, and 14.43 as a provisioning model. EXAMPLE 14.3
BASIC PROVISIONING PROBLEM
You are employed by Lightning-Overnite Delivery. As part of your job, you are tasked with provisioning a new fleet of 150 trucks used for deliveries and pickup of small packages. Because a fixed pickup schedule is not followed, each truck is equipped with a two-way radio so that drivers can be instructed to stop at various locations. If the radio becomes inoperable, the affected truck cannot be used. Management wants to know how many spare radios should be stocked in order to be 95% confident that a truck will leave the company’s facility with an operable radio. Data obtained from the manufacturer of the radio and from historical maintenance data in your company files provide the following information: MAR 0.0002 removals/ophr; Pos 0.10; Psos 0.50; MTTRos 2.5 h; MADMos 0.5 h; RESUP 3 days; TAToff 14 days; UTIL 10 ophrs/day. The MDT due to radio failure is MDT 0.10 r [0.50 r (2.5 hrs 0.5 hrs) (1 0.50) r (2.55 hrs 0.5 hrs 3 days r 24 hrs/day)] (1 0.10) r (14 days r 24 hrs/day) 306.0 hrs
(14.44)
The expected demand during the period MDT is given by DEM (0.0002 removals/ophr ) r 150 trucks r 1 radio/truck r (10 ophrs/day r 1 day/ 24 hrs) r (306.3 hrs) 3.8 radio failures
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(14.45)
LOGISTICS SUPPORT
375
Table 14.1 Probability of Having a Required Spare S
Ps
0 1 2 3 4 5 6 7
0.02 0.10 0.26 0.47 0.66 0.81 0.91 0.96
The number of spare radios required is given by S
Ps ( X a S ) 0.95 a
( 3.8) £ (3.8) r exp i! i
(14.46)
i 0
The results of iterating the preceding equation are given in Table 14.1. The analysis indicates that the company can expect fewer than four failures, on average, during the time it takes to repair or remove and replace a failed radio at an off-site repair shop and receive a replacement. However, the company must stock seven units in order to satisfy the confidence requirement imposed by management.
14.3.2.3 Provisioning a Product Composed of Replaceable Parts Example 14.3 is representative of a top-down approach to provisioning. The analyst determines the number of spare products required by assuming a stocking level at the next lower level of assembly (i.e., the repair parts may be subassemblies, modules, products, or a combination of these). In order to provision the subassemblies, modules, and products, the analyst assumes an availability of repair parts and a supply response time. Products at the next lower level of assembly must be provisioned so that the aggregate probability of having a repair part available is equal to or greater than the value used in the model for the next higher level of assembly. In a bottom-up approach, the logistics analyst determines the inventory levels to be maintained at the lowest level of assembly and works upwards. At each potential stockage point, the calculation of the MDT and stockage quantity is based upon the probability of spares’ sufficiency determined at the previous (lower) level of support. The spares’ sufficiency, spares’ availability, or probability of having a required spare is M
PS ,sys
P
s ,k
k 1
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(14.47)
376 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
where Sk
Ps ,k
£
DEM kj r exp ( DEM k )
j!
j 0
(14.48)
is the spares’ sufficiency for a unit with Sk , and DEMk is the expected demand for the kth unit at the next lower level of assembly. The probability of having a spare or repair part, Psos , for use in determining the MDT is not the same as Ps,k . The definition of Psos is the probability that a spare can be drawn from an on-site inventory, given that a failure has occurred. To determine the value of Psos , the following equation can be used: Psos Psc r Ps
(14.49)
where Psc is the probability that the required part is carried in the on-site inventory, and Px is the probability that the required part is in stock, given that it is carried in the on-site inventory. Equations for Psc and Ps are given by k
£N
r Li
i
i 1
Psc
(14.50)
LT
k
£N
r Li r Ps ,i
i
(14.51)
i 1
Ps
k
£N
r Li
i
i 1
where hi is the failure rate of the ith part that is carried in the on-site inventory; Ni is the number of applications for the ith part; hT is the total failure rate for the unit, assembly, or subassembly that is being provisioned, considering both carried and not carried parts; Ps,i is the probability of the sufficiency of the ith part; and Si
Ps ,i
£
DEMij r exp ( DEMi )
j!
j 0
(14.52)
where Si ≥ 1. Therefore, Psos can be written as k
Psos
£N
i
r Li r Ps ,i
i 1
LT
Example 14.4 illustrates the use of this equation.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(14.53)
LOGISTICS SUPPORT
377
EXAMPLE 14.4 PROVISIONING OF A PRODUCT CONSISTING OF REPLACEABLE PARTS Suppose a product consists of eight assemblies with the failure rates, expected demand, and probability of sparing sufficiency given in Table 14.2. From these data, hT 0.0188 and Psos 0.9466. Psos (1/ 0.0188) r (0.0007 r 0.9977) 0.0028 r 0.997 70 0.001 r 0.9998 0.006 r 0.9769 0.0075 r 0.9927) 0.9466
(14.54)
Many maintenance and supply products are multi-echelon. The level at which maintenance is performed is based on the design for maintainability and testability of the product, doctrine, and economics. The product supply responds to the maintenance concept by providing the range and depth of spare and repair parts needed to maintain the product at each applicable level of repair. For a military product, the depot or an inventory control point acts as the highest level of maintenance and inventory management and physical stocking of spare and repair parts. The depot buys spares and repair parts from either in-house sources or civilian manufacturers, maintains physical inventory control, and distributes required spares and repair parts to lower echelons. A program manager or a designated supply support specialist determines the range and depth of spares and repair parts to be procured, carried in inventory, and distributed to lower echelons for use as on-site inventory. A provisioning model must be developed for each echelon and used to compute the stockage quantity of each potentially repairable or replaceable part. Either a multi-echelon model or separate provisioning models must be developed for repairable and consumable products. Representative models for repairable and consumable products can be written as follows: MDTI,i {Pos,I r Psos,I r ( MTTR ros,I MADMos,I ) (1 Psos,I ) r ( MTTR ros,I MDAMos,I RESUPI )]
(14.55)
(1 Pos,I ) r (TAToff,I )}i MDT for an on-site (intermediate) repairable product is MDTI ,i RESUPI ,i Table 14.2
(14.56)
Data for Example 14.4
Assembly
hi
Di
Carried
Si
Psi, if S ) 1
1 2 3 4 5 6 7 8
0.0007 0.0028 0.0005 0.0002 0.0010 0.006 0.0001 0.0075
0.07 0.28 0.05 0.02 0.10 0.6 0.01 0.75
Yes Yes No No Yes Yes No Yes
1 2 0 0 2 2 0 3
0.9977 0.9970 0.0000 0.0000 0.9998 0.9769 0.0000 0.9927
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
378 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
MDT for an off-site repairable or consumable product is DEM I ,i Dl ,i MAR i r N i r UTIL i r MDTI ,
(14.57)
The on-site demand and stocking level are given by SI , i
Ps ,I ,i
£
DEM Ij ,i r exp ( DEM I ,i )
j!
j 0
(14.58)
At the depot or a major inventory control point, MDTD,I is computed by MDTD,i {(1 Z i ) r ( MTTR ros,D MADMos,D ) (1 Psos,D ) r ( MTTD ros,D MADMos,D REORDER com )
(14.59)
( Z i ) r (REORDER rep )}i The MDT for an off-site repairable product or a consumable product is given by MDTD ,i REORDER com,i
(14.60)
where the subscript D represents the depot or manufacturer level of repair, as appropriate. Zi is the condemnation rate of a repairable product—that is, the probability that a repairable product inducted into the depot cannot be repaired for any reason; REORDERcom is the average time required to order and receive a repair part from a manufacturer; and REORDER rep is the average time required to order and receive a new repairable product from a manufacturer. The on-site demand and stocking levels are found as follows: DEM D ,i DD ,i M sites r MAR i r N i r UTIL i r MLDTD ,i S D ,i
Ps ,D ,i
£
DEM Dj ,i r exp ( DEM D ,i )
j 0
(14.61)
j!
where Msites is the number of operating sites being supported by the depot. Examples 14.5 and 14.6 illustrate the use of these equations.
14.3.2.4 Spares’ Optimization A provisioning list can be very easily optimized for a single variable. A dynamic programming approach can be used by employing the following procedure: r For each level of repair and for each assembly, subassembly, and product, determine the expected demand (DEM R,i) subject to: r level of repair − on-site sparing − depot or inventory control point
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
LOGISTICS SUPPORT
379
r level of assembly − product or line replaceable unit − assembly or shop replaceable unit − subassembly or modules − products and piece parts r Determine the unit cost of each product (Ci ). r For a specific level of repair and level of assembly (e.g., on-site sparing of a product), compute
Ps,sys
M
Si
i 1
j 0
£
DEMij r exp ( DEMi ) j!
(14.62)
where DEMi is the expected demand for the ith unit at the next lower level of assembly. r For each product, i, determine the change in shortage risk by incrementing the inventory by one:
DEL Si
DEM j 1 r exp ( DEM i )
J!
(14.63)
r For each product, determine the change in shortage risk per dollar expended: DEL S $i DEL S i /C i
(14.64)
r Select the product with the highest DEL S$i to have its inventory be incremented by one. r Continue the process until either a dollar constraint has been reached or the desired level of PS,sys has been attained.
How does product reliability or a change in reliability affect provisioning? Example 14.5 gives the case of a proposed improvement in the reliability of a product and the case in which field reliability is less than predicted. EXAMPLE 14.5 USING THE PROVISIONING MODEL AS PART OF A RELIABILITY TRADE-OFF ANALYSIS As a repair and maintenance (R&M) engineer, you have been asked to evaluate an engineering change proposal. The proposal is for the upgrade of products used in a “ruggedized” communications radio. Part of the assigned task is to assess the change in required spare radios due to the improvement in reliability. The mean time between failures (MTBF) of the current product is 1,500 operating hours. A 30% improvement in reliability is expected if the modification is adopted, giving an improved MTBF of 1950 operating hours. Let MAR W r Lsys / (1 CND)
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(14.65)
380 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
where MAR is the maintenance action rate; W 1.00; hsys,old 1/MTBF 1/(1500 ophr/failure) 0.0006667 failures/ophr; hsys,new 1/MTBF 1/(1950 ophr/failure) 0.0005128 failures/ophr; andCND 0. Therefore, MARold 0.0006667 failures/ ophr and MAR new 0.0005128 failures/ophr. Let DEM MAR r N sys r OPHRS r MLDT
(14.66)
and MLDT Pos r [ Psos r ( MTTR ros MADMos ) (1 Psos ) r ( MTTR ros MDAMos RESUP)]
(14.67)
(1 Pos ) r (TAToff ) Let Nsys 30 products; OPHRS 10 ophr/products/day; Pos 0.90; Psos 0.90; MTTRros 6.5 h; MADMos 0.5 h; RESUP 2.5 days; TAToff 0.5 month. Then, DEMold (0.0006667 failures/ophr ) r (30 systems) r (10 ophr/system day) r (1 day/ 24 hrs) r {(0.90) r (0.90) r (6.5 hr 0.5 hr ) (1 0.9) r [6.5 hr 0.5 hr (2.5 day) r (24 hr/day)] (1 0.9) r [(0.5 mo) r (30 days/mo) r (24 /hr/day)]} (0.0006667 failures/ophr ) r (596.25) 0.396 failures
(14.68) and DEM new (0.0005128 failures/ophr ) r (596.25 ophr ) 0.306
(14.69)
The sparing level is determined using the Poisson equation as follows: Assume the desired PS is 0.95. Then, Sold
PS ,old 0.95 a
£ j0
(DEMold ) j r exp( DEMold ) j!
(14.70)
0.95 a 0.673 0.266 0.053 0.992 Sold 2 Snew
PS ,new 0.95 a
£ j 0
(DEMAND new ) j r exp ( DEMAND new ) j!
(14.71)
0.95 a 0.736 0.225 0.961 and PS,new 1. The proposed design change would save the cost of one spare. What is the savings if PS is 0.90? If PS is equal to 0.90, both the old and the new (improved design) products require one spare.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
LOGISTICS SUPPORT
381
EXAMPLE 14.6
SPARING A REPLACEABLE ASSEMBLY
Consider a nonrepairable module with an MTBF of 4,000 operating hours, 15 operating sites, and one depot. Each operating site supports 50 aircraft. Each aircraft uses two modules and flies 100 h per month. The resupply time at the site level is 2 weeks. The lead time to acquire the module from the manufacturer is 15 months. What should the sparing level be at each operating site and at the depot, if the level of sparing sufficiency is 0.90? The expected demand at each operating site is found as follows: DEMopsite (1/MTBF ) r NACFT r NUNITS r OPHR r RESUP (1/ 4000 ophr r 50 aircraft/site r (2 units/aircraft)) r (2 wks r 1 mo/ 4 wk ) 1.25 failures/site
(14.72) Using the cumulative Poisson equation, the required sparing level to achieve a 0.90 sufficiency at an operating site is Sonsite
PS 0.90 a
£
(DEMopsite ) j r exp( DEMopsite )
j0
j!
(14.73)
a 0.286 0.358 0.224 0.093 0.961 and Sonsite 3 spare/site. At the depot, the expected demand is DEM depot NSITES r (1/MTBF ) r NACFT r NUNITS r OPHR r REORDER (15 sites)(1/ 4000 ophr/failure) r (50 aircraft/site) r (2 units/aircraft) r (100 ophrs/aircraft mo) r (15 mo) 562.5 failures
(14.74) At the depot level, the module is a consumable or throw-away item. The depot must stock a sufficient quantity of the modules to meet the expected demand over the reorder period. Unless the cumulative Poisson equation has been programmed on a computer, it becomes tedious to compute the sparing level for an expected demand as large as the value in this example. The following equation can be used to derive an approximate value:
S DEM z A r (DEM)1/ 2 z A2 /8
(14.75)
where z] is the normal variate for the desired sparing sufficiency. For example, if Ps 0.90, z] 1.29, then S 594 for this example.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
382 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
What is the economic order quantity? Assume the setup cost is $2,500 and the holding cost is $25 per spare. The economic order quantity, using Equation 14.2, is REPLEN* (2 r DEM r SETUP/HOLD)1/ 2 (2 r 594 r $2, 500 / $25)1/ 2 345
(14.76)
The expected time to expend the on-hand parts (i.e., the time between orders) is found by using Equation 14.11: TIME REPLEN* /DEM 345 / 594 0.58
(14.77)
In terms of calendar time, TIME* 0.58 r (15 mo) 8.7 months
(14.78)
The results in the example should be interpreted as follows: r An initial spares’ pool of 594 units must be procured. r Immediately, an order should be placed for 345 units scheduled for delivery within 15 months of the initiation of aircraft operations. r If field demand is the same as predicted demand, after 345 units are used (approximately 8.7 months), another order is placed for 345 units to be delivered within 15 months (month 23.7); if field demand differs significantly from predicted demand, the sparing level and EOQ must be recalculated and the procurement quantity adjusted accordingly. r The process is repeated whenever 345 units are used or 8.7 months have elapsed since the last order, subject to statistically significant changes in the actual demand rate.
14.3.3 Manpower and Personnel—Staffing Levels Table 14.3 describes the manpower tasks of a typical, multi-echelon corrective maintenance action. Each block on the diagram represents one or more maintenance, supply, manufacturing or fabrication, quality assurance, or other administrative logistics support tasks that must be accomplished to restore a failed end-product to operation and to repair or replace failed products removed from the end-product. Each action associated with a logistics support task requires a combination of technical skills and experience. The LSA identifies and documents these personnel requirements for organizational, intermediate, and depot maintenance. Using data from the maintainability analysis, the LSA also determines the time (typically clock-hours, but in some instances also man-hours) required to accomplished a given maintenance task. The RLA determines the maintenance tasks that will be accomplished at a given level of maintenance. In the ILS planning process, how does reliability affect manpower and personnel requirements? How can manpower and personnel requirements be estimated?
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
LOGISTICS SUPPORT
Table 14.3
383
Task Flow for a Multi-Echelon Logistics Support Product Organizational Level
Failures originate at the organizational level and are isolated to a line replaceable unit (LRU). The faulty LRU is removed from the product and replaced with a spare LRU. The product is checked for proper operation. The faulty LRU is sent to the intermediate level base shop for repair. Intermediate Level At the intermediate level, the LRU is repaired by isolating to the faulty shop replaceable unit (SRU). The faulty SRU is removed and replaced with a spare SRU. The repaired LRU is checked for proper operation. Once the LRU is repaired, it is sent to the organizational level or to an inventory control or stockage point. If no fault is found, the LRU is also sent to the inventory control or stockage point. Occasionally, the LRU cannot be repaired by the intermediate level and it is sent to the depot for repair. Depot At the depot, the SRUs (and sometimes LRUs) are repaired by fault-isolating to components. The faulty component is removed and replaced. The SRU (or LRU) is checked for proper operation. Once the repair is complete, the repair unit is sent back to the intermediate or depot level inventory control or stockage point.
An estimate of the demand for manpower resources can be developed using an approach similar to that used to determine the expected demand for a spare or repair part. As part of a design and development program, the component, module, subassembly, assembly, unit, and product failure rates are determined and incorporated into the LSA as task frequency of occurrence. Supplemented by the man-hours to complete a maintenance task, task frequency of occurrence yields a man-hour per operating hour demand for specific skills and experience categories. Applying an operating profile (i.e., operating hours per unit per calendar period) and the number of units to be supported, the man-hours per calendar period for specific skill and experience categories can be found. The expected demand in man-hours per calendar period for the specific skill, experience level, or both required to support a given product is given by MANTIME ijkl M sites r MAR ij r NUNITSi r ( MANTTR ijkl ) r UTIL i
(14.79)
where MANTIMEijkl is man-hours per calendar period of skill level “k” and experience level “1” required to support maintenance action “j” on the ith product; Msites is the number of operating sites supported by the maintenance facility; is the maintenance action rate (actions/unit-operating-hour) for MARij action “j” on product “i”;
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
384 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
NUNITSi MANTTRijkl UTIL
is the number of units of type “i” at each site; is the mean man-hours of skill level “k” and experience level “l” expended per maintenance action “j” on the ith product; and is the average utilization rate of the ith product, operating hours per unit per calendar period.
Note the similarity of Equation 14.79 to the equations developed for provisioning and supply support. In this case, the MDT term has been replaced by MANTTRijkl . Example 14.7 illustrates the use of this equation. EXAMPLE 14.7
MANPOWER REQUIREMENTS
The removal and replacement of a starter motor on a small commuter or business jet requires the skill and experience of a licensed mechanic, a helper, and a quality assurance inspector. The task takes 2.5, 1.25, and 0.5 man-hours per removal action, respectively. The maintenance action rate for the starter is 25.0 removals per million aircraft-unit flight hours (i.e., 50.0 removals per million aircraft operating hours). There is only one maintenance site. Forty aircraft, each with two starters, are supported, for a total of eighty starters. Each aircraft averages 110 flight hours per month. What is the manpower utilization of each skill category per month and per year associated with removal and replacement of the starter? The expected man-hours to be expended for each skill level are given by MANTIME ijkl M sites r MAR ij r NUNITSi r ( MANTTR ijkl ) r UTIL i (1 site) r (25 10 6 removals/aircraft unit flighthour ) r (40 aircraft/site r 2 units/aircraft) r (2.5 1.25 0.5) r (110 flighthours/month/aircraft) (0.22 removals per month) r (4.25 manhours/removal)
(14.80)
Table 14.4 provides the resulting MANTIMEijkl for this example. Note that the MAR for the starter was given in removals per aircraft-unit operating hour. If the MAR was given in terms of starter operating cycles or starter-on time, a conversion factor would be required in order to yield consistent units (i.e., removals per calendar time).
Equation 14.79 can be used to calculate the man-hours associated with one skill level in one form of maintenance action on one unit or product. By performing a summation over the indices “i” and “j,” the total man-hours per calendar time for skill and experience levels “k” and “l” can be determined. Summing over the indices “kl” yields Table 14.4
MANTIMEijkl
Calendar Period Month Year
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
Mechanic
Helper
Inspector
Total
0.55 6.60
0.28 3.30
0.11 1.32
0.94 11.22
LOGISTICS SUPPORT
385
yields the expected total man-hours required to support a product. The example illustrates this process for three skill categories, but for only one maintenance action type. In general, as with provisioning, spares and people can only be procured in integer quantities. However, unlike provisioning, the employment of part-time personnel and full-time personnel working overtime can offset surge demands. When using maintenance man-hour data extracted from a maintainability or logistics support analysis, the analyst must remember that the data represent the active man-hours required to conduct a specific task. Generally, the maintainability analysis does not include access time, setup, breakdown, or many other peripheral tasks normally associated with maintenance. In addition, manpower efficiency factors must be applied in order to convert the predicted active maintenance man-hours to employable man-hours. Moreover, the man-hours associated with maintenance support personnel (e.g., administrative, quality assurance, and supply) are not normally addressed in a maintainability or logistics support analysis. Logistics analysts or R&M engineers engaged in manpower planning must consider the following factors in order to derive the manpower and personnel requirements for a product: r A decrease in the man-hours available to perform active maintenance or maintenance support can be expected due to many factors, including: r slack time; r direct administrative time; r direct time associated with peripheral maintenance subtasks not included in the prediction or measured task time; r sick leave; r holidays; r weekends; r shifts (hours per day availability); r job proficiency; r extra or collateral duties; and r work-rule restrictions. r The use of higher percentile (e.g., 2 or 3m) values in lieu of expected (mean) values and the application of the algebra of normal variables will provide more confidence that manpower requirements can be met. Consider applying this approach to the following variables: MAR, MANTTR, and OPTIME. r Excess personnel or overtime may be required to meet surge or other transient circumstances not addressed by the mean value equation.
14.3.4 Support and Test Equipment—Utilization and Productivity The mean or expected value equation for determining the utilization of support and test equipment (STE) “k” is similar to the equation to determine personnel requirements. Like personnel, STE can only be procured in integer quantities and extra work shifts can accommodate surge requirements. Unlike personnel, indirect and other factors affecting producibility are more defined and predictable. A basic equation for
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
386 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
use in determining the utilization of an item of STE attributed to repair action “j” of a prime or lower level of assembly product “i” can be written as follows: SETIME ijk M sites r MAR ij r NUNITSi r ( MTTR ijk SETijk ) r UTIL i r (1 MAR k r MLDTk ) r (1 PMTIME k r PMRATk
(14.81)
CALTIME k r CALRATk ) where SETIMEijk is the STE utilization time for the kth SP attributed to the jth maintenance action on the ith item of prime product; MTTRijk is the active maintenance time using the kth STE for the jth maintenance action on product “i,” STE operating hours; SETijk is any setup or other direct usage time associated with the use of the kth STE supporting the jth maintenance action on the ith prime product; MARk is the maintenance action rate for the kth STE; MLDTk is the mean logistics downtime per STE maintenance action for the kth STE; PMTIMEk is the mean time to conduct preventive maintenance for the kth STE; PMRATk is the PM action rate of occurrence for the kth STE; CALTIMEk is the mean time to conduct calibration for the kth STE; and CALRATk is the calibration action rate of occurrence for the kth STE. When Equation 14.82 is used, care must be taken to ensure consistent use of units of time. When summed over the indices “i” and “j,” the preceding equation yields the total utilization for the kth STE. Note the similarity to the demand equations developed for supply support and manpower. In the preceding equation, the terms equivalent to the MDT in the supply support equation define how long the support product will be used per prime product, support product, or calibration maintenance action. Equation 14.81 assumes that preventive maintenance and calibration of the product support are a function of use and not based upon calendar interval. For many STEs, this is not the case, and PM and calibration are conducted at fixed calendar intervals, irrespective of use during the calendar interval. For this case, the equation can be modified as follows: SETIME k M sites r
N SE
Mi
i 1
j 1
£ £ [MAR
ij
r NUNITSi r ( MTTR ijk SETijk ) r UTIL i
r (1 MAR k r MLDTk ) r CALEND PMTIME k r NPER k CALTIME k r NCAL k (14.82)
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
LOGISTICS SUPPORT
387
where NSE Mi
is the number of prime products supported by the kth STE; is the number of different maintenance actions for the ith prime product that use the kth STE for test support; CALEND is the number of calendar planning periods under consideration; NPER k is the number of PM cycles in period CALEND; and NCALk is the number of calibration cycles in period CALEND. Equation 14.82 provides SE utilization (hours) per calendar planning period for the kth SE. As in manpower planning, when SE utilization hours computed by the mean value equation are used, the analyst must determine the maximum hours during which an SE will be available during the calendar or planning period. For example, if a company typically works one 8-hour shift per day, 5 days per week, 50 weeks per year, each piece of support equipment (SE) will be available for a maximum of 2,000 hours per year. If the results of applying Equation 14.82 indicate that 4,700 STE operating hours will be expended, including downtime for STE maintenance and calibration, then 2.35 units are required, or three units with rounding. If two shifts are used on a normal basis, the number of required STEs would be reduced. As with provisioning and manpower planning, consistency of units must be ensured in order to obtain useful results. This is especially true when variables such as MLDT are encountered in an equation because product downtime, calendar time, and product-available or operating time must be related to a common time scale. Also, the mean value equations given earlier do not account for surge requirements or for meeting turnaround time or operational availability requirements that may be placed on the prime product. 14.4
REPAIR LEVEL ANALYSIS
The repair level analysis, also sometimes referred to as a level-of-repair analysis, is an economic analysis used to determine if a product should be repaired or discarded, as well as the maintenance level (e.g., organizational, intermediate, or depot) at which the repair or discard action should be made. The RLA is an iterative analysis that should interact with the design process. The RLA can be used to determine if an assembly can be cost effectively repaired, given an initial design approach. If the product is deemed a discard, money can be saved by simplifying the design to remove test points. On the other hand, if the initial analysis indicates that a product should be repairable, redesign to add more test points or initialization circuits may be warranted. Table 14.5 relates the use of the RLA during the life cycle of a product. For many acquisition programs, the RLA is strictly an economic analysis; however, a cost versus operational availability/readiness (or some other measure of effectiveness) trade-off analysis can be easily conducted. Typically, the RLA determines the nonrecurring and recurring costs associated with each of the 10 logistics elements for both the repair and discard alternatives. Unless the analysis is restricted a priori by noneconomic considerations (e.g., printed-wiring boards can only be
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
388 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Table 14.5
RLA and the Product Life Cycle
Product Life-Cycle Phase
Function of the RLA
Program initiation and concept exploration
Conduct trade-off studies of r Maintenance concept: evaluate possible support scenarios r Product support: new or existing Conduct operational effectiveness analysis to r Develop LCC estimate for budgetary planning r Identify noneconomic constraints to supportability and level of repair
RLA Data and Sources r R&M and LCC data from existing, fielded products r Predictions
Design and development
r Influence design for maintainability and testability r Identify preliminary quantitative requirements for product support, facilities, personnel, and provisioning of major assemblies r Make repair and discard decisions r Evaluate LCC impact of proposed design changes
r R&M predictions r LSA r Developer budgetary cost estimate
Production and initial fielding operations
r Make level of repair decisions r Determine provisioning requirements to include user/on-site spares and maintenance-site repair-part inventory design changes r Evaluate LCC impact of proposed design changes r Review and assess effectiveness of logistics product support r Update provisioning lists r Assess the LCC impact of proposed design changes
r LSA r Test results r R&M prediction designchange proposals r Field maintenance and cost data
repaired at the depot level, or hermetically sealed hybrid circuits are nonrepairable), the costs are computed for each applicable subsystem, assembly, subassembly, module, and potentially repairable product at each potentially applicable level of maintenance. These cost estimates use data drawn from the logistics support analysis report
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
LOGISTICS SUPPORT
389
(LSAR) and mean value models similar to those given in the preceding sections for provisioning, manpower, and SE. The conduct of an RLA can become quite complex, given the number of repairable products in the typical weapon, aerospace, or electronics products. Unless prior noneconomic constraints are applied, each assembly, subassembly, or module must be evaluated for repair or discard. For either of these two alternatives, the selected action can be accomplished at one of three levels of maintenance. When conducting an RLA, the following issues, concerns, and considerations must be addressed: r Sensitivity studies must be conducted in order to assess the potential of changing a decision for repair versus discard or for level of maintenance due to changes in a random variable (e.g., MTBF, mean time between maintenance [MTBM], MAR, MTTR, etc.). r The cost associated with some logistics resources (e.g., multipurpose test product or repair facilities) must be amortized over all the products that will be supported by that particular resource. If the repair decision for a supported product results in that product no longer needing the resource, the analysis must be iterated to reflect the change. r The costs associated with the support of the product can be significant. If new or peculiar STE is required, estimates must be developed for this category. This may entail performing an RLA for the STE. r Although it may be feasible to repair an assembly or subassembly, it is important to consider repair procedure yield and condemnation rates. If the repair yield is low (high condemnation rate), the decision to attempt repair may change. r Operational requirements and the possibility of product obsolescence or the future unavailability of new consumable assemblies may override economically based decisions. (For example, when a manufacturer decides to no longer make an assembly, but keeps the products comprising the assembly readily available, the availability of products may not be any help if repair procedures, other technical data, and the required SE do not exist.)
14.5
SUMMARY
This chapter has discussed how reliability and, to a more limited extent, testability affect the demand for spares and repair parts, personnel, and product support. This was accomplished by developing demand equations based on the assumption of a constant arrival rate and the development of a mean downtime term. The MDT was then tailored to reflect the time during which a product was undergoing maintenance, the time required to repair a spare, the response time of the product supply to deliver a part or spare from an off-site inventory, and the utilization of personnel or product support per maintenance action. The repair level analysis and how this analysis is placed in the life cycle of a product were also discussed.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
390 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
REFERENCES Blanchard, B. S. 1992. Logistics engineering and management. Upper Saddle River, NJ: Prentice Hall. Goldman, A. S., and T. B. Slattery. 1967. Maintainability: A major element of systems effectiveness. New York: John Wiley & Sons. Harris, F. W. 1915. Operations and cost. New York: A. W. Shaw. Hillier, F. S., and G. J. Lieberman. 1970. Introduction to operations research. Ann Arbor: Holden-Day, University of Michigan Press. Raymond, F. E. 1931. Quantity and economy in manufacture. New York: McGraw–Hill. Sivazlian B. D., and L. E. Stanfel. 1975. Analysis of systems in operations research. New York: Prentice Hall.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
CHAPTER 15
Product Effectiveness and Cost Analysis Harold S. Balaban, David Weiss
CONTENTS 15.1 Introduction ................................................................................................. 391 15.2 A Framework for Product Effectiveness Quantification Using Markov Processes ..................................................................................................... 393 15.2.1 A Generalization of the Model for Multifunction Operations....... 394 15.2.2 Effectiveness Evaluation Example—Continuous Performance .... 396 15.2.3 Model Applicability .......................................................................400 15.3 Factors to Consider in Analyzing Product Effectiveness............................ 401 15.3.1 Phase I: Define Application, Product, and Logistics Support .......403 15.3.2 Phase II: Select Measures of Effectiveness....................................403 15.3.3 Phase III: Develop the Mathematical Model .................................405 15.3.4 Phase IV: Obtain Data Inputs ........................................................407 15.3.5 Phase V: Exercise, Interpret, and Refine Model ............................407 15.4 Cost-Effectiveness Analysis........................................................................408 15.4.1 Cost Categorization........................................................................408 15.4.2 Cost Estimation.............................................................................. 410 15.4.3 Cost Adjustments ........................................................................... 413 15.4.4 Cost Uncertainty and Cost Sensitivity ........................................... 415 15.4.5 Combining Effectiveness and Cost................................................ 416 15.5 Summary..................................................................................................... 419 Reference ............................................................................................................... 419 Additional Reading ................................................................................................ 419 15.1
INTRODUCTION
This chapter shows how reliability and maintainability data can be combined with performance data to assess overall product effectiveness of the product and how cost aspects can be introduced to provide a more complete basis for design decision. First, 391 © 2009 by Taylor & Francis Group, LLC
392 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
the product effectiveness concept is reviewed. A generalized model for quantifying effectiveness is then developed by first considering single-mode products and then extending the model to multimode cases. This is followed by a general discussion of how product effectiveness is analyzed and the chapter concludes with a discussion of ways of introducing cost into the decision process. We noted in Chapter 1 that product effectiveness represents the overall capability of a product to meet customer or user requirements. It can be formally defined as a measure of the extent to which a product may be expected to achieve a set of specific application requirements in terms of availability, dependability, and capability: r Availability is a measure of the product condition at the start of an application or start of use. It is a function of the relationships among hardware, personnel, and procedures. r Dependability is a measure of the product condition at one or more points during the application, given the product condition at the start of the application. r Capability is a measure of the product’s ability to achieve the application objectives, given the product condition during the application. Capability specifically accounts for the performance spectrum of the product. The capability measure can take on a number of forms, such as a success probability, a measure relative to maximum performance, or a value in terms of the product output (e.g., megawatts of power) or in terms of its impact (e.g., cargo tonnage hauled).
If we consider a very simple product, one that is either “working” or not and that cannot be repaired while in use, the preceding definitions result in the following set of questions that an effectiveness analysis seeks to answer: r Availability: is the product working when the user needs to start using it? r Dependability: will the product work throughout the use period? r Capability: if the product works throughout the use period, will it perform its functions satisfactorily?ww
To complete the argument, we will define “working” to mean that product outputs fall within design specifications. It is surprising how many cases can be adequately handled by this simple model, at least for an initial, ballpark estimate. Speaking of ballparks, suppose the product was a television set, the “application” was to watch a World Series game, and the effectiveness measure was the probability of watching the whole game. The effectiveness of the television set is then given by Eff P {set is available at the start of the game} r P {set is dependable for the duration of the game given—it is available} (15.1) r P {set provides satisfactory picture and sound given—it is dependable) It is not hard to develop takeoffs on this example that would make it more complicated. For example, what if a part exceeded tolerance so that the picture became
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
PRODUCT EFFECTIVENESS AND COST ANALYSIS
393
snowy but still viewable? What if sound was lost but a radio was available? What if the colors turned a ghastly hue but the picture was sharp? These questions focus on the requirement of “satisfactory picture and sound” that is embodied in the capability quantification. It is also easy to generate scenarios that focus on availability and dependability issues, especially if we allow for repair to take place during the game. Thus, although a simple “working or not” scenario may be useful, a more generalized approach may be needed.
15.2 A FRAMEWORK FOR PRODUCT EFFECTIVENESS QUANTIFICATION USING MARKOV PROCESSES For relatively simple products with little or no repair capability, availability is not a significant issue, and reliability and dependability models can be formulated. However, today’s products are becoming more complex and frequently involve central computers, digital sensors, distributed microprocessor controllers, and built-in test capability. To model such complex repairable products, which really are systems of products, a standard technique is the use of a Markov model. A Markov process is governed by probabilities that are functions of the immediate past history. A Markov model is a function of the state of the product (e.g., operating, nonoperating) and the time of the observation. It is defined by a set of probabilities (Pij) that define the probability of transition from state i to j. A Poisson process is a special type of Markov process. To formulate a Markov model, we must define all mutually exclusive states of the product. Then, Markov state equations describe the probabilistic transitions from the initial to the final states. For complex products, the number of states in the system model becomes very large and the solution of the state equations can be very computer intensive, generally providing little design insight. To be tractable, it is necessary to reduce the number of states through combination or approximation or to use a Monte Carlo simulation. This section defines a framework for a Markov model. The reader is referred to Shooman (1990) for a detailed discussion of approximation and simplification technologies. Product states can be used as the basis for product description, and state transitions can then be used to reflect the product reliability and maintainability characteristics. The easiest way, but not necessarily the best way, to describe the product states is to consider each product component as a success or a failure. A product state is then a combination of product successes and failures; if there are n components in the product, this method will define 2 n product states. A state transition occurs when the state of a component changes (a successful component fails or a failed component is repaired). With certain simplifying assumptions, state models allow for relatively easy analytical methods using Markov processes. Using the concepts of availability, dependability, and capability, we will describe several forms of this model.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
394 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Single operating mode products—no in-use repair. These products can be adequately described as having only a single operating mode, as was assumed for the television set in the previous example, so repair of failures while in use is not possible. In such cases, we have the following generic model for effectiveness: E ff Av r D p r Cap
(15.2)
where Av is the availability, the probability that the product is operating at the beginning of the use period; Dp is the dependability, the probability that the product continues to operate throughout the use period; and Cap is the capability, a measure of product performance, given that it was dependable throughout the use period. If the application is one in which continuous performance is required over the application length tm , the effectiveness of the product, assuming well-behaved functions, may possibly be quantified as the time average of Eff (tk)—that is, 1 E ff tm
tm
¯E
ff
(t )dt
(15.3)
0
Note that if, at each performance time, the capability coefficient cj equals one if state j belongs to the set of satisfactory states and is zero otherwise, the preceding equation for Eff reduces to the expected fraction of the application performance time that the product is in a satisfactory state. If the Markov assumption does not hold, the capability matrix must be written as an N r N matrix (N number of product states), with an entry for each state transition. 15.2.1 A Generalization of the Model for Multifunction Operations Consider a product that has to perform f functions during an application where the kth function takes place during the interval tk to tk ` k , which we will denote by T k . We will call such an interval the kth functional interval. We shall call the period between the two intervals, T k–1 and T k (i.e., the period from tk–1 ` k–1 to tk) the kth nonfunctional interval and denote it by n k . Figure 15.1 illustrates this symbology using “_ _ _ _” for a functional period and “…….” for a nonfunctional period. For simplicity, assume that effectiveness is measured as a probability of success, there is no overlap of the functional intervals, and that, for success, no transitions can take place during a functional interval. Then, an equation for effectiveness is as follows: E ff Av W P {D p (D k ) P(Tk )Ck }D(D f ) P(T f ) C f
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(15.4)
PRODUCT EFFECTIVENESS AND COST ANALYSIS
395
Figure 15.1 Product that has to perform f functions during an application where the kth function takes place during the interval tk to tk `k.
where Av [a1 a2 z an ]
(15.5)
ai is the probability the product is in state i at time 0 or application start. § w1 ¶ ¨ · w2 0 · ¨ W¨ · ¨0 · ¨© wn ·¸
(15.6)
wi is the probability that the product will be used given state i at time 0. § d11 (T k ) d12 (T k ) z d1n (T k ) ¶ ¨ · ¨ d (T ) d 22 (T k ) z d 2 n (T k ) · D p (T k ) ¨ 21 k · z ¨ · ¨© d n1 z dn 2 d nn ·¸
(15.7)
dif (nk) is the probability of a transition from state i to state k during the kth nonfunctional interval. § p1 (Tk ) ¶ ¨ · p2 (Tk ) 0 ¨ · P(Tk ) ¨ · 0 ¨ · ¨© pn (Tk ) ·¸
(15.8)
pi(Tk) is the probability that, given state i at tk , the beginning of the kth functional interval, there is no state transition before time tk `k . §c1 (Tk ) ¶ ¨ · c2 (Tk ) 0 ¨ · Cap (Tk ) ¨ · ; for k 1 to f 1 0 ¨ · ¨© cn (Tk ) ·¸
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(15.9)
396 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
§ c1 (T f ) ¶ ¨ · ¨ c2 (T f ) · Cap (T f ) ¨ · ¨ z · ¨c (T ) · © n f ¸
(15.10)
ci(Tk) is the probability that the ith product state will lead to successful accomplishment of all required functions during the kth performance interval. To illustrate, assume only two functional intervals. Then, Eeff Av W D p (T 1 ) P(T1 ) Cap (T1 ) D p (T2 ) P (T2 ) Cap (T2 )
(15.11)
and a general term is ai wi dij pj cj djk pk ck . This term represents the probability of starting the application in state i (ai wi); transitioning to state j by the start of the first functional interval and persisting in that state throughout the interval, (dij pj); successfully performing the first function while in state j (cj); transitioning to state k after the first functional interval and before the second functional interval (djk); persisting in state k during the second functional interval, (pk); and successfully performing the second function while in state k (ck). It is, of course, possible to relax one or more of the model restrictions at the cost of more complexity. This is illustrated in the next section through an example. 15.2.2 Effectiveness Evaluation Example—Continuous Performance One of the criticisms of the previous model is its reliance on discrete point, Markovian performance. Although many products can be cast within that framework, it is possible to use the basic model concepts to handle continuous performance. In some cases, this can be done by “attaching” a capability measure to each state transition. The approach is illustrated with the following example. Example product definition. Two communications products, A and B, are used simultaneously to transmit information. Should either of the products fail, the remaining one is capable of transmitting alone (A and B performances are statistically independent). Failures in either product are not repaired during a transmission period, but are repaired during a period when the products are normally shut down. A transmission will be started whenever at least one of the products is available. (In other words, it is not necessary that both A and B be in operable condition in order to start a transmission.) The respective mean times between failure, mean repair times, and transmission bit rates for products A and B are given in Table 15.1. To illustrate the effectiveness evaluation approach, the basic product effectiveness model will be used to answer the question, “What is the effectiveness of A and B combined, if effectiveness is defined as the probability of transmitting at least 800,000 bits during a normal transmission period of 40 minutes?” For the analysis, we shall assume that only one transition is possible. The product
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
PRODUCT EFFECTIVENESS AND COST ANALYSIS
397
Table 15.1 Mean Times Between Failure, Mean Repair Times, and Transmission Bit Rates for Products A and B Product A B
Mean Time Between Failure (min)
Mean Repair Time (min)
Transmission Rate (r) (bits/min)
1200 (exponential) 2000 (exponential)
60 80
30,000 15,000
state designations will be given in Table 15.2 (a bar above a letter signifies a failed state). Availability calculations. The availability (Av) of a product is the probability that a product is operating at any point in time and, in the steady-state case, is given by the equation: Av MTBF /( MTBF MTTR )
(15.12)
In particular, the availability of subproducts A and B is as follows: Avail( A) 1200 /(1200 60) 0.9524
(15.13)
Avail( B) 2000//(2000 80) 0.9615 Definition: ai p (state i exists at start of transmission—a function of the availabilities of subproducts A and B): a1 Avail( A) r Avail( B) (0.9524) (0.9615) 0.9158 a2 Avail( A) r [1 Avail ( B)] (0.9524)(0.0385) 0.0366 a3 [1 Avail ( A)] r Avail( B) (0.0476)(0.9615) 0.0458
(15.14)
a4 [1 Avail( A)] [1 Avail ( B)] (0.0476)(0.0385) 0.0018 Then, the availability vector is Av [0.9158 0.0366 0.0458 0.0018]
Table 15.2
The Product State Designation
Configuration
State number
AB
1
AB *
2
AB
3
AB
4
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(15.15)
398 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Dependability calculations. Because there is no in-use repair, dependability is based only on the reliability measures associated with the operation of subproducts A and B. The reliability function is assumed to be exponential and is given by the following equation: R(T ) e t /Q
(15.16)
where t is the application time and 1 is the mean time between failures (MTBF). Thus, the reliabilities for subproducts A and B over a 40-minute period are as follows: RA (3 hours) e 40 /1200 0.9672
(15.17)
RB (3 hours) e 40 / 2000 0.9802 Definition: dij R (transition from state i to state j)3 hours. The state transition probabilities are given in Table 15.3. Capability calculations. Definition: cij R (transmit at least 800,000 bits in 40 minutes while undergoing a transition from state i to state j). We have the following results immediately: c11 1, c12 1, c22 1, c23 0, cij 0, for all i > 2. To illustrate the thinking, c11, c12, and c22 are 1 because they represent the capability when product A works throughout the application and this product can meet the transmission requirement. The states for which the c values are indicated to be 0 are those that represent impossible transmissions or for which product A is not available at all throughout the application. Because product B cannot transmit 800,000 bits in 40 hours, the capability associated with states for which product A starts out failed is 0. For c13, c14, and c24, things get a bit more complicated. We define the following: r r r r
transmission rates per minute: ra and rb, for products a and b, respectively application time: T failure rates: ha and hb total bits transmission requirement: ^B
Table 15.3 State Transition Probabilities Configuration d11 (0.9672) (0.9802) 0.9481 d12 (0.9672) (0.0198) 0.0192 d13 (0.0328) (0.9802) 0.0321 d14 (0.0328) (0.0198) 0.0007 d21 (0.9672) (0) 0 d22 (0.9672) (1) 0.9672 d23 (0.0328) (0) 0 d24 (0.0328) (1) 0.0328
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
State number d31 (0) (0.9802) 0 d32 (0) (0.0198) 0 d33 (1) (0.9802) 0.9802 d34 (1) (0.0198) 0.0198 d41 (0) (0) 0 d42 (0) (1) 0 d43 (1) (0) 0 d44 (1) (1) 1
PRODUCT EFFECTIVENESS AND COST ANALYSIS
399
First, we will consider c24, the probability of transmitting at least ^ bits in time T when starting out in state 2 (A working and B failed), and transitioning to state 4, with both products failed. Clearly, this can only happen if A fails after such time that ^ bits were delivered, or the failure time of A, ta , is no earlier than ^/ra . This probability is given by the following equation: T
c24
La e
La ta
¯ 1 e
LaT
dta
B ra
1 1 e
La T
¶ § La B ¨e ra e LaT · ¨ · © ¸
(15.18)
When substituting the example values, we calculate c24 0.3296. The calculation for c13 is similar. This case is the transition from both A and B working to A failing. If A fails at time ta , it will have delivered ra r ta bits; thus, the surviving B product will transmit rb r (T – ta) bits over the remaining time T – ta . Because the total bits delivered under this case, ra ta rb (T – ta), must be at least ^, a lower limit on ta to meet the criterion is defined and a probability expression can be developed. For the numbers we used, we find that c13 0.8478. For the c14 capability measure, which represents a transition from both A and B working to both failed, we need to use a convolution equation, which represents the probability that the sum of the transmissions of A and B is at least ^ over time T. This is given by the following equation: T
c14
Lae
B ra
Lata
¯ 1 e
La
B ra
T
dta
Lae
T
Lata
¯ 1 e 0
Lat
¯
B rata rb
Lbe
Lbtb
1 e
LbT
dtb dta
(15.19)
The first term is the probability that A fails after it has transmitted ^ bits; the second term represents all the ways in which the sum of bits transmitted by A and B, given that both fail before time T, is at least ^. On solving this equation and inserting the numerical values applicable for the example, we find that c14 0.5509. These calculations then lead to the following C matrix:
C( ap)ij
§1 ¨ 0 ¨ ¨0 ¨ ¨© 0
1 0.8478 0.5509 ¶ · 1 0 0.3296 · 0 0 0 · · 0 0 0 ·¸
(15.20)
Effectiveness is then calculated by the following equation: E ff
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
4
4
i 1
j 1
££a d c
i ij ij
(15.21)
400 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
This may be written as follows: § ¨ ¨ ¨ ¨ ¨ ¨ E ff Av ¨ ¨ ¨ ¨ ¨ ¨ ¨ ©
¶ d1 j c1 j · · j 1 · 4 · d 2 j c2 j · · j 1 · 4 d3 j c3 j · · j 1 · 4 · d 4 j c4 j · · j1 ¸ 4
£ £
(15.22)
£ £
Upon substituting the example values, we have § 0.9948 ¶ ¨ · 0.9780 · E ff [0.9158 0.0366 0.0458 0.0018] ¨ ¨ 0 · ¨ · ¨© 0 ·¸
(15.23)
or Eff 0.947. 15.2.3 Model Applicability It is impossible to support the proposition that a concept as complex as effectiveness can be universally quantified by a single model. As stated earlier, the proposed model provides a conceptual framework for effectiveness analysis. Products have to work to perform, have to be repaired if they fail, and have to do the job if they are operating. That is the essence of the model. One of the criticisms often made of any effectiveness model is that, for complex products, no one single measure can be used to describe how well the product meets its objectives. For example, in evaluating a communication product, one may want to consider capacity, error rate, security, and a number of other factors. Although it may not be possible to develop a single measure for all these factors, one may develop effectiveness measures for each of the important factors, thus providing a set of measures for evaluation. The proposed model, in fact, has the ability to do this. If the capability vector is transformed to a capability matrix, with each column representing the capability associated with a particular form of output (e.g., capacity, errors, security), then the model will develop the vector solution. However, this is possible only if all the availability and dependability formulations are applicable for all capability measures.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
PRODUCT EFFECTIVENESS AND COST ANALYSIS
15.3
401
FACTORS TO CONSIDER IN ANALYZING PRODUCT EFFECTIVENESS
The preceding sections provided a basis for quantifying product effectiveness. We will now examine the approaches and factors to consider in evaluating alternative designs or in analyzing fielded products. Figure 15.2 is a flow diagram for a typical program. A product effectiveness analysis is performed at several product and support levels. The first application is usually at the product or major subproduct level (e.g., computer system or data processor) and the associated support levels. Early applications provide decisions on overall design approaches and supply the basis for further analysis at lower hardware and support levels. Thus, analysis at the computer system level helps to define the overall architecture, and analysis at the data processor level will determine how inputs, computations, and outputs are to be handled by the central processing unit and associated hardware.
# !
#
!
# #
!
! "
"
# !
!"
!
Figure 15.2 Flow diagram for a typical product effectiveness analysis.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
402 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Because data are often limited in the early stages, some decisions will have to be deferred and others made on a contingency basis. Until the design is frozen, the results of each iteration of product effectiveness analysis are used to refine the analytical model and criteria and early design and support decisions. Management must plan on performing the analyses at times consistent with key points in the product design process. These steps and the corresponding points in time of the design process are shown in Figure 15.3. The analysis proceeds to translate product requirements and constraints into requirements and constraints on the parameters of progressively smaller parts of the product, which are then fed to the applicable design groups. As this process continues, the relationships between the product components and the
$$
$$% !#% #"&#%$
$ $%%!)$ (
$$% !)$ $%#%$
$$% $%#%$
(#$ +%'$$ $
!&%$
%!!# !#% $%+%'$$ '&$ #!!# !#% $$
#*#$&%$ %# &$#%#
*$
Figure 15.3 Process for model implementation.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
#'$
PRODUCT EFFECTIVENESS AND COST ANALYSIS
403
overall effectiveness of the product become better defined so that decisions among alternative approaches can be made on a factual, rather than subjective, basis. 15.3.1 Phase I: Define Application, Product, and Logistics Support Phase I involves a detailed study of those aspects of the application, product, and logistics support that will eventually form the basis for the effectiveness model (Phase III) and the establishment of design-decision criteria. The first task in this phase involves translating general application objectives into quantitative operational requirements suitable for effectiveness analysis. The overall objective of a command and control system, for example—to maintain command after a disaster—must be translated into specific requirements. These induce the particular type of enemy attack to be survived, the environment under which the product must operate, the information available to and required by the system, and any real-time data requirements. If more than one application is involved, the first phase will lead to a set of operational requirements for each application. The resulting sets of requirements should be reduced to a composite set that defines a product that will be effective for the primary or most likely application types. Weighting by importance and probability of occurrence can be used in deriving the composite. The operational requirements are used to define the product and major subproducts with respect to boundaries, functions, and constraints. The subproducts and their functions must be clearly defined through, for example, preliminary specifications, hardware sketches, and functional block diagrams. The interfaces between the product and the operational user, the logistics support functions, any larger product, and the application environment must be analyzed. In the early stages of product development, block diagrams are used to analyze multimodal capabilities, failure efffects, and redundancy. Analyses should also be made of other general reliability and maintainability design consideration, such as the use of on-board testing and modular avionics. Logistics support should be examined in areas such as maintenance levels, available repair facilities, and available maintenance skills. Various approaches for achieving high reliability, maintainability, and readiness must be formulated for further study, using the effectiveness model. 15.3.2 Phase II: Select Measures of Effectiveness The definitions of product, logistics support, and application developed in Phase I provide a basis for formulating measures of effectiveness, using the factors discussed in Chapter 1 and earlier in this chapter. Some factors, such as reliability, apply to all products. Others are peculiar to a given type of product—maximum gross takeoff weight, for example, is associated only with aircraft. In selecting measures of effectiveness, care must be exercised to avoid limiting design options prematurely. For example, if a low infrared signature is required for an aircraft during supersonic cruise, then the measure of effectiveness should be stated in those terms. If the
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
404 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Table 15.4
Time or Use Requirements Related to Product Operational Demands
Type of Requirement
Description
Example
Instantaneous
Product must meet a given demand at a given instant for a short duration
Retro-rocket package for re-entry of orbiting satellite
Continuous interval
Product must meet a demand continuously for a given interval
Plane on a transport application
Fraction interval
Product must meet a demand for a specified fraction(s) of a given interval
Reconnaissance satellite that takes pictures on demand
measure were restarted as supersonic cruise without afterburner, the options left to the design team would be reduced. The frequency with which the product will be used and the criteria for measuring performance must also be defined. Frequency of use is important because product effectiveness is based on meeting operational demands within a specified time period, or satisfying some other applicable use constraint such as one based on miles or number of attempts. Usually, the function performed will indicate the nature of this time or use requirement, but in some cases there are several possible choices. Three types of time or use requirements are described in Table 15.4, using time as the basis. Given a time or use requirement, it is then necessary to determine the performance measure(s) to consider, which are generally in terms of the product outputs, and how the performance measure(s) will be reflected in an overall measure of effectiveness. In terms of the model discussed earlier, the effectiveness measure is equivalent to the way capability is quantified. Naturally this measure must be related to the application objectives, but there are often a number of ways to express it. For example, consider the communication product example discussed earlier in this chapter. There we defined the communication product performance by the number of bits transmitted, and we defined effectiveness as the probability that a minimum number of bits would be transmitted during a certain time period. Instead, we could have chosen to define effectiveness as the expected number of bits to be transmitted. Instead of bits transmitted as the performance measure we might have some other communication measure such as error rate or transmit delay time. We will discuss two basic forms of effectiveness measures given that a performance measure has been defined. The minimum performance criterion. The minimum performance criterion specifies quantitative bounds on the output of the product. These bounds define the range of acceptable performance; there is no assessment of the degree of acceptability within these bounds. Thus, this criterion leads to the commonly used dichotomous description of product performance: success or failure. For example, a criterion that states that a computer must perform a set of benchmark problems within a certain time period or that a bomb must be dropped within d miles of the center of the target is a minimum performance criterion. The overall performance criterion. The overall performance criterion concerns the complete distribution of output, considered in terms of the actual probability
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
PRODUCT EFFECTIVENESS AND COST ANALYSIS
405
distribution or of a related statistical measure, such as the expected value. This criterion may be used when the degree of conformance to application or functional requirements is important and when the minimum performance criterion is too artificial. Referring to the computer benchmark example, other effectiveness measures are the amount of output provided in a given time period or the time it took to complete the benchmark. The choice of which form of measure to use is not always clear. The application objectives, product functions, and associated output sometimes dictate the evaluation criterion. If the functional output is dichotomous (e.g., for a detection product, the output might be detection or no detection), the choice would be the minimum performance criterion using success probability as the measure. For a multi- or continuous-output product, the overall performance criterion can be used if success boundaries would be highly artificial or if interest centers on the statistical properties of the output such as the mean and variance. The choice between the minimum performance criterion and the overall performance criterion can be critical. For an effectiveness measure based on a minimum performance criterion, specified bounds on the product output define the region of acceptable performance and lead to the classification of success or failure. In the other method of quantifying effectiveness, using an overall performance criterion, the complete distribution of output is considered and is usually quantified by an appropriate statistical measure, such as the mean output. Products that have outputs for which partial or degraded information return may be of some value could possibly be measured under the overall performance criterion. In summary, the product output and associated effectiveness measures form the basis for evaluating the effectiveness of proposed designs. This evaluation is performed through mathematical models developed to represent these measures and the associated costs and burdens involved in achieving the design objectives. 15.3.3 Phase III: Develop the Mathematical Model Three tasks must be performed in this phase: r selection of variables that affect product effectiveness; r definition of existing and imposed economic constraints; and r development of the mathematical relationships among the variables to express the effectiveness measures.
The variables are the product and logistics support parameters that influence product effectiveness and will appear in the model. Within the defined limitations on these parameters and on the model, the proper selection of these parameter values will yield a product with an optimal level of product effectiveness. Typical are those variables that affect the major components of effectiveness; reliability, maintainability, and performance and that can be traded off against one another on the basis of some common denominator, which may be called worth.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
406 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Variables include complexity, the number of redundant units and the type of redundancy, the number and level of replaceable modules, and the type and frequency of preventive maintenance. As the effectiveness analysis is iterated, more “detailed variables” must be considered—for example, stresses on parts and components and maintenance accessibility factors, and the numbers and types of test points. The associated burden and benefit of each alternative should be determined. For example, increased weight, cost, and complexity constitute the burden of improving reliability through redundancy. The burden and benefit of improving reliability through redundancy can be compared with the burden and benefit of using ultrahighreliability parts. The second task is to determine existing and imposed physical, support, and economic limitations. Imposed physical limitations can be explicitly required (e.g., available space) or implied by operational requirements; existing constraints are defined by the current state of the art and may include achievable failure rate levels, maintenance repair rates, and data transmission speeds. Constraints such as funding, desired date for fielding the product, and available maintenance staffing and skill levels should also be listed. Also to be considered are such economic and logistic limitations as total cost, development time, test product requirements, and maintenance manpower and skill level requirements. These factors must be introduced into the mathematical model to ensure that the product made will meet its operational requirements within the specified constraints, including cost, schedule, and support. The third task is to develop the mathematical model for (1) estimating the effectiveness of proposed products, (2) evaluating the effectiveness potential of various alternatives, (3) trading off these alternatives, and (4) determining reliability and maintainability requirements at product levels. A general model is first developed in terms of product states defined by the states of major product elements, as we discussed earlier in this chapter. By assessing the performance of these elements with respect to each subfunction, the effect of design and support decisions on overall product effectiveness can be determined. In this early stage, the capability analysis is a product engineering function, often relying more on basic principles because directly applicable data may be very limited. A complete analysis would provide probabilistic performance indices. For example, a cumulative distribution function of detection probability versus range might be appropriate for evaluating radar performance, and radar range equations could serve as the basis for the engineering analysis. At this point, a cost/effectiveness trade-off should also be conducted. For example, building in high reliability and maintainability increases initial investment costs; however, support costs over the product service life should be reduced. This type of trade-off should be investigated at the major product and support levels to narrow the choice of alternatives early in the development stage. Models should be developed and applied to the analysis of costs with respect to reliability and maintenance characteristics. Product failure triggers the support system and thus determines how often a particular product will consume support resources. The expenditure in terms of manpower and maintenance time is a function of the maintainability characteristics.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
PRODUCT EFFECTIVENESS AND COST ANALYSIS
407
The framework of the cost model should provide the following information: r comparative cost data on the operation of major subordinate organizations; r the cost of supporting the units; identification of those units that might justify engineering changes; r information on the comparative costs of making a given repair at different echelons of maintenance and the elasticity of support costs with respect to failure frequency; r information needed to trade off support savings against increases in fixed investment (e.g., the introduction of new test equipment); and r a tool such as a simulation model and applicable data with which suggested changes in the support organization can be evaluated before being initiated.
More detailed treatment of cost-effectiveness analysis is provided in the following sections. 15.3.4 Phase IV: Obtain Data Inputs The required data inputs for the model consist of such products as part or component reliability and maintainability parameters, costs, weight and space estimates, and other data on pertinent physical, engineering, or economic factors. Initially, such inputs may be obtained from past experience and from appropriate estimation techniques. As early design approaches become more definite and component and unit development progresses, these inputs should be refined and the model iteratively run. It is important that policy and procedures be established to ensure that the effectiveness analysis team is aware of the data generating activities during development and that it has a say on the data to be collected. 15.3.5 Phase V: Exercise, Interpret, and Refine Model The steps in this phase are essentially: r r r r
designing a product that satisfies constraints; computing the values of effectiveness and worth; comparing these values with requirements; making generalizations concerning appropriate combinations of design support factors; r revising the factors and rerunning the model; and r refining the model as additional data become available.
Schematically, this process might be represented as in Figure 15.4. Only those designs that meet physical and economic limitations are actually evaluated by the model. The range of designs, however, is naturally restricted by the customer’s requirements. With the aid of the model, a conceptual design is translated into actual hardware configurations and support plans that, given the constraints under which the product is being developed, should provide the highest level of product effectiveness.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
408 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Investment Cost ($)
R&D Annual operating
Time Figure 15.4 Process for model implementation.
15.4
COST-EFFECTIVENESS ANALYSIS
In this section we bring cost into the effectiveness picture in order to describe a more complete approach to choosing among competing designs, operational methods, and logistic structures. We can define cost effectiveness analysis as the process of comparing alternative solutions for meeting product or application requirements in terms of value received (effectiveness) and resources expended (costs). Note that cost is the measure of resource usage, and effectiveness is the measure for value received. We must recognize that effectiveness may not include all of the value elements of a product and that cost does not embrace all of the resources required. For example, resource requirements in terms of such factors as personnel skills and schedule delays are often difficult to translate to cost measures. Thus, one must ensure that the documentation associated with any cost-effectiveness analysis include the important elements that are not explicitly considered as cost or effectiveness numbers. Cost effectiveness analysis became prominent in the early 1960s in large-scale military development and acquisition projects. It evolved from economic analysis work (termed cost-benefit analysis) done several decades earlier, such as that done on the flood control project in the late 1930s. Another related term is product analysis, which embraces many of the same ideas as cost-effectiveness analysis but is not as definitive in requiring that cost and effectiveness numbers be produced. The analytic output of a cost-effectiveness analysis may then be fed into the higher level system analysis framework in order that the decision maker can act on it in conjunction with his expert judgment and intuition in deciding on the best course of action. 15.4.1 Cost Categorization Product costs have been categorized in a number of different ways, depending on the product type and applicability of available data. A cost categorization should focus attention on the major resources that will be consumed during the life of the
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
PRODUCT EFFECTIVENESS AND COST ANALYSIS
Table 15.5
409
Cost Categorization by Program Phase
Research and Development Costs r Preliminary research and design studies r Development engineering and hardware fabrication r Development instrumentation r Industrial facilities Product test r Test vehicle fabrication r Test vehicle spares r Test operations r Test support product r Test facilities r Data collection, reduction, analysis, and storage r Maintenance, supply, miscellaneous Product management and technical direction Initial Investment Costs Equipment r Primary application equipment r Support equipment r Other equipment Stocks r Application product and product spares r Equipment support and part spares r Consumables Initial training Installation r Construction of facilities r Platform modifications Miscellaneous investment r Technical data r Transportation and travel r Administrative and support costs
Design and development
Equipment and installation replacement
Maintenance and support
Recurrent training Inventory management Management and technical data Facilities Operation costs
Operating Costs r Primary application equipment r r r r r r
Specialized equipment Other equipment Installations Primary application equipment Specialized equipment Other equipment
r r r r
Personnel Fuel Power Other
product. A broad categorization based on program phases includes costs associated with research and development, investments, and operation. Examples of the types of costs in each major category are given in Table 15.5. r Research and development costs include all the costs necessary to bring a product into the production or procurement phase.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
410
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
r Initial investment costs comprise all costs incurred in introducing a product into the active inventory. They include production or procurement costs, facility costs, personnel training, installation, and procurement of initial spares. r Operating costs are all costs necessary to the operation of the product once it has been phased into the operational inventory. Although both research and development (R&D) and investment costs are incurred just once, operating costs continue throughout the lifetime of the product.
The curves of Figure 15.5 show a typical distribution of these costs over the life cycle of the product. The term life-cycle cost is often used to represent the sum of the R&D, investment, and operating costs for a period representing the expected lifetime of the product. It is important to note that when alternatives are compared, all costs that will not affect the decision should be excluded. For example, assume that life-cycle costs are to be estimated for several alternatives and they include depot repair. If there is no foreseeable reason for depot management costs to be dependent on the selected alternative, such costs should be excluded in order to simplify the problem. Another cost categorization is shown in Figure 15.6. Each of the eight categories is subdivided. Figure 15.7 shows a breakdown of the development cost category. 15.4.2 Cost Estimation Three general methods for cost estimation are bottom up, top down, and analogy with similar products. Bottom-up method. The bottom-up approach, sometimes called the accounting or grassroots method, attempts to estimate costs by breaking expenditures down into elemental categories called a work breakdown structure (WBS). Estimates are made of required labor and materials for each category; these are then used with standard labor rates and material costs to estimate category costs. These category costs are then aggregated to higher level cost categories to build up the cost estimate. Higher level costs such as facilities are introduced into the buildup process as necessary. Estimates of operation and support (O&S) costs usually involve some form of bottom-up approach, at least with respect to logistics and maintenance costs. As a Cost of ownership 0000
Development
Procurement
1000
2000
Figure 15.5
Installation Maintenance Operation 3000
4000
5000
Management and technical services 6000
Distribution of product costs over the life cycle.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
Modification
Disposal
7000
8000
PRODUCT EFFECTIVENESS AND COST ANALYSIS
411
Development 1000
Special support equipment development 1200
Prime equipment development 1100
Initial research and development
Hardware for technical evaluation
Operations evaluation
Other
Initial research and development
Hardware for technical evaluation
Operations evaluation
Other
1110
1120
1140
1150
1210
1220
1240
1250
Initial technical evaluation
Initial technical evaluation
1130
1230
Naval Commercial Contractor Government Contractor Government laboratory laboratory shipyard laboratory 1131
1132
1133
1134
1231
1232
Naval shipyard
Commercial laboratory
1233
1234
Figure 15.6 Total cost of ownership.
simple example, if there are 200 products and each operates 40 hours a month and has an MTBF of 1,000 hours, then Expected Number of Failures per Month 200 r 40 / 1000 8
(15.24)
Estimates of the manpower, materials, and other costs (e.g., transportation costs if the repair facility is not on site) needed to restore the product to operating Performance level 1 2 Line of minimum cost points
Cost ($)
3 $
System Characteristic, e.g., Weight Figure 15.7 Development cost.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
Performance Level, e.g., Payload
412
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
condition are used to provide an estimate of the cost of failure, which is then aggregated over the year or time period of interest. In addition to the active repair costs, MTBF and pipeline time determine how many spares should be bought to ensure a required level of product availability. If there is a range of costs that may be applicable for some activities or materials, these can be incorporated into the aggregation methodology to provide a measure of uncertainty in the estimate. As one simplistic approach, by using all the high (low) cost estimates, a bound on the maximum (minimum) cost is provided. If cost distributions for uncertain cost elements are known or can be assumed, then statistical procedures based on sums of random variables can be used for developing a distributional estimate of the aggregated cost. Top-down method. The top-down method, also called the parametric method, uses historical data and statistical techniques in an attempt to find a relationship between high-level costs and a set of product parameters that is applicable to the product under study. The term cost-estimating relationship (CER) is often used for these types of cost-estimating equations. A typical CER development exercise involves collecting data on applicable products of a general class—for example, radar products. Typically, separate CERs are developed for each of the major cost categories (R&D, investment, operation) because factors that influence one cost category may have little influence on another or may even have an opposite effect. For example, a program to produce a product with an ultrahigh reliability may increase development costs over that of a product with a typical reliability level; it is hoped, however, that operational costs may be greatly reduced because of the reduction in the number of failures. Data elements to consider include factors related to product performance, physical attributes, and costs. Data may also be collected on factors related to technology and the acquisition environment. For example, in a computer-costing exercise, it may be important to know the amount of large-scale integration in each of the computers in the data set. If some products were bought in a sole-source environment and others were procured competitively, that may have an important bearing on the prices paid. Factors such as reliability requirements or length of warranty coverage could also impact costs. Using such techniques as multiple regression analysis, factors that are highly correlated to the cost numbers of interest are selected by the regression technique as “significant” and are included in the predictive equation. For radar products, the CER may depend on such technical factors as range and sensitivity; for computers, CPU speed and memory capacity may be good cost predictors; for aircraft, range, speed, and carrying capacity are obvious candidates. The statistical processes used to develop a CER also provide measures of the strength of the relationship and can provide measures of uncertainty in the estimate through such techniques as confidence intervals. A great deal of engineering judgment is required in deciding which factors to include in the database, what adjustments must be made to the cost numbers to account for unusual events, and what screening criteria to use to ensure that the resultant equation makes engineering as well as statistical sense.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
PRODUCT EFFECTIVENESS AND COST ANALYSIS
413
To illustrate a CER, some years ago Aeronautical Radio, Incorporated (ARINC) developed the following equation to predict radar product development costs: ln Cst 0.784 0.205 ln Apr 0.165 Ddev 0.151 ln Pk 0.028 S 0.082 SC 1.37 TD (15.25) where Apr is antenna aperture in square feet; Ddev is the degree of development (score 0, 1, 2); Pk is the peak power in kilowatts; S is the sensitivity in negative dBm; SC is the number of special circuits; TD is the type of development (new 2; modification 1); and Cst is the cost of development plus cost of first prototype. Analogy with similar products. The analogy method uses cost data from products with similar characteristics, which are then adjusted to account for known differences between the current products and the one being evaluated. Sources of data may range from price lists to costs incurred under previous procurement contracts. In many cases, the adjustment will involve extrapolation, increased CPU speed for a computer, increased range and better fuel consumption for an engine, or a graphical user interface for a software purchase. Again, experience and good engineering judgment will be required to determine which historical data are relevant and the adjustments to make to account for differences between the past and present. The analogy method clearly will give good results when the products and the acquisition environments of the analogous product and subject product are similar. As the differences increase, the uncertainty about the accuracy of the analogous estimate decreases and the degree of uncertainty is difficult to quantify. 15.4.3 Cost Adjustments Several “standard” techniques are applied to cost estimates to account for significant cost influences that may not be explicitly included in the initial estimate. Two of the more important relate to economy of scale and discounting the time value of money. Economy of scale. When products are made, the production quantity will usually have a significant impact on the unit cost. This is a phenomenon known as economy of scale and can be explained by a number of factors (e.g., ability to buy larger lots of raw material—another economy of scale factor; the dispersion of fixed costs over a larger number of units, and learning effects). A generalized equation that reflects this phenomenon is KCA ( P * / P)
ac
where P* is a reference of standard production size; R is the production lot size under consideration;
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(15.26)
414
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
ac is a positive constant; and KCA is the cost adjustment factor to adjust a cost based on production lot size of P*. A typical learning curve equation reflecting reduction in time (or some other measure of resource expenditure such as man-hours) as a result of increasing usage or application (i.e., production, maintenance) is given by b
T ( RC ) At RCc where T(R) At RC bc
(15.27)
is the time required for the Rth unit; is the time for the first unit; is the cumulative unit number; and is a negative constant.
If we assume that the resource usage declines by a constant percentage for each doubling of usage, the constant bc would be determined as follows: bc [ln ( percent ) 2]/ ln(2)
(15.28)
Discounting. If work is to be performed over a period of time, it should be clear that it would be better to be paid up front than after the job is finished, thus ignoring such issues related to surety of payment. By being paid up front, one can deposit the money in a bank so that, at the end of the job, one will have the agreed-to amount plus any interest earned. The payer, however, may recognize that, by paying up front, he is losing that potential interest because he must take the money out of the bank to make the payment. He may therefore propose paying a discounted amount that represents his loss of interest. If the discount rate matches the interest rate earned when the money is deposited, the payee will have the full amount at the end of the job and the payer’s bank account will be the same as it would be if he had paid at the end of job. Discounting, therefore, is a process that considers the time value of money. It should be applied to all out-year expenditures so that the costs are all in constant year dollars, which is necessary for an accurate assessment. The discounted amount of an out-year expenditure is called the present value and is computed as follows: PV C fe / (1 id )n
(15.29)
where PV is the present value; n is the number of periods in the future; Cfe is the future expenditure, n periods in the future; and id is the discount rate per period. Consider Table 15.6 as a simple example of comparing costs of two products (a 10% discount rate was used). Both products have the same total expenditure, $3,500,
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
PRODUCT EFFECTIVENESS AND COST ANALYSIS
Table 15.6
Year 1 2 3 4 5 6
415
Example of Comparing Costs of Two Products Discount Factor
Project A Expenditure
Project A Discounted Costs
0.91 0.83 0.75 0.68 0.62
1000 500 1000 500 500 3500
910 415 750 340 310 2725
Project B Project B Discounted Expenditure Costs 500 500 500 1000 1000 3500
455 415 375 680 320 2545
before discounting. However, because the majority of product B’s costs are incurred in the last 2 years, its total discounted cost is less than that of product A, where over 70% of the expenditure occurs in the first 3 years. With everything else being equal, product B is the less costly product. Note that discounting and inflation are two different concepts. Inflation refers to the buying power of a unit of currency when compared to some base year and discounting refers to the value of having the money. Cost effectiveness analyses must take the time value of money, but usually need not deal with inflation directly unless there are some unusual circumstances. 15.4.4 Cost Uncertainty and Cost Sensitivity In the discussion of the types of cost estimates that can be used, we indicated that uncertainty is an issue that must be considered. There is usually uncertainty in every cost estimate, whether it be the cost of a small piece part or elemental activity in the bottom-up approach, the result of applying a CER in the top-down approach, or the adjustment of the cost of a similar product using the analogy method. The CER method provides the most direct way of dealing with uncertainty for the statistical techniques used in developing the CER usually can provide measures of uncertainty in the form of standard errors or confidence interval factors. But, even here, there may be additional uncertainty concerning the applicability of the historical data to the product and associated environment under consideration, or the prediction may be an extrapolation of the past rather than an interpolation. For the other two methods, it is not as easy to develop quantitative measures of uncertainty although, as indicated early, statistical theory can be used when summing up uncertain cost numbers in the bottoms-up approach in much the same manner as we do in a program (or project) evaluation and review technique (PERT) analysis, where pessimistic, most likely, and optimistic activity times are used to develop a distribution of project time. For all three methods, when the costing exercise involves a large-scale product where life-cycle cost is to be estimated, there is also the concern that all relevant and significant costs elements have been identified. Whether quantitative measures of uncertainty can be provided or not, it is the responsibility of the cost analyst to provide, as explicitly as possible, cost uncertainty
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
416
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Table 15.7 Comparisons of Effectiveness and Cost Measures of Two Products Case 1 Case 2 Case 3 Case 4
Effectiveness
Cost
Apparent Decision
A better A better B better B better
A better B better A better B better
A ? ? B
information and to help the decision maker in evaluating the effect of the uncertainty. One way to get further insight into the effects of uncertainty is to perform a sensitivity analysis. By varying a cost variable about which one is uncertain—say, a labor rate—one can determine how sensitive the overall cost is to variations in labor rate. If two products have the same effectiveness, but A is less costly than B using best estimates of elemental costs, then this type of exercise provides an indication of how sensitive the decision to select A is to errors in the labor rate. If, for example, the same decision would be made even if the labor rate was, at the extreme, favorable to B, then the labor rate uncertainty becomes much less of a concern. 15.4.5 Combining Effectiveness and Cost We have now established some methodology for developing effectiveness and cost measures to be used to evaluate candidate products. If we have two products, A and B, we can conceptualize four possibilities when comparing their effectiveness and cost measures in Table 15.7. Cases 1 and 4 show complete dominance of one product over another, whereas the situation is unclear regarding cases 2 and 3. Taking case 2 as an example, if A is much better than B in effectiveness, but almost equal to B in cost, then A might be chosen. But what if cost is the predominant criterion? Even for the dominant cases, the nonquantifiable factors and the uncertainty in the estimates for one product when compared to another may make the decision less obvious than it appears to be. Another complication concerns the concept of leverage or the influence a product may have on factors not explicitly considered in the cost or effectiveness models. As an example, suppose the results in Table 15.8 were obtained for two engine types to be used in a new helicopter. On the surface, the $15 million savings in cost of engine A over that of B would make it the logical choice. However, suppose the design of engine B will allow it, with modification, to be used in an airplane also. If, for example, the expected development cost for the airplane engine were to be reduced by $30 million if engine B were to be selected, then that amount of leverage may make B the better choice. Table 15.8
Example of Leverage
Total cost ($) Effectiveness
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
Powerplant A
Powerplant B
60,000,000 0.95
75,000,000 0.95
PRODUCT EFFECTIVENESS AND COST ANALYSIS
Figure 15.8 Equal performance curves.
417
Figure 15.9 Cost versus performance trade-off cost versus product characteristic.
Although we cannot provide a decision methodology that can be consistently used when cost and effectiveness values are provided for alternative designs or alternative products, the general agreement is that the decision maker is better off having the cost and effectiveness values than not having them. However, some techniques that can be used to help in the decision process will now be discussed. Product design studies. We will first discuss the use of effectiveness analysis when alternative designs for a given product class are being considered. One of the desirable outputs is a cost versus effectiveness trade-off curve. Consider, for example, a new transport aircraft in which a particular performance parameter, such as payload, is of interest and some product characteristic, such as engine thrust or aircraft weight, is being analyzed. Figure 15.8 shows the relationship between cost and the product characteristic for several performance levels. In economic terms, these curves are known as isoquants (i.e., equal quantities). If we take the minimum costs for each of the performance levels and plot them on a graph of cost versus performance, we can develop a trade-off curve showing the lowest cost for any performance level. This is illustrated in Figure 15.9. The trade-off curve then becomes a useful tool for the decision maker to consider in selecting a design alternative. If, in fact, all relevant factors are included in the cost and performance measures, the curve provides an optimized solution. Product comparison studies. These studies involve comparing two or more products designed to accomplish the same application. The products may or may not be similar in design or operation (e.g. trucks or trains can be used to transport material, and both transportation forms are capable of meeting the objectives). If it is possible to develop a cost versus effectiveness curve for each product as we described before, we may have a result similar to Figure 15.10. Here we see that, for effectiveness values less then E0, product A provides the lower cost solution and that, for the
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
418
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Figure 15.10 Cost effectiveness curves: products A and B.
higher effectiveness values, product B is preferred. Two possible ways to approach this situation are fixing effectiveness or fixing costs: r Fixed effectiveness: if a required effectiveness or performance level can be specified, then the product that meets that level at minimum cost is the preferred choice, without considering any leverage effects. r Fixed cost: if a total cost has been budgeted, then the product that provides the maximum effectiveness for that cost level is the preferred choice, without considering any leverage effects.
Fixing effectiveness or cost is the more desirable way to proceed if it makes sense to do so. In many cases, it may not be desirable to fix either effectiveness or cost, but they may be constrained. For example, the challenge to the cost-effectiveness analyst is to provide information so that the decision maker can select the best product that has an effectiveness of at least Ew and a total cost of not more than Cw. This type of situation is illustrated by the shaded region shown in Figure 15.11. Any combination of cost and performance inside the region represents an acceptable solution. For the example shown, we see that neither A nor B dominates and the choice is still not clear. One approach that is often used is to compute a ratio of effectiveness to cost; the resultant value then has a “bang for the buck” characteristic. In some cases, this is acceptable, but the approach has been criticized as one that “reaches for corners.” The absence of a standard criterion does not diminish the value of cost-effectiveness analysis. It means that as much information as possible must be made available to the decision maker. Although the information may not be amenable to being wrapped up into a neat individual number, it can be displayed in a manner that facilitates its use in conjunction with the decision maker’s expert judgments. This challenges the
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
PRODUCT EFFECTIVENESS AND COST ANALYSIS
419
Figure 15.11 Cost effectiveness curves: region of acceptability.
cost-effectiveness analyst to maintain flexibility in modeling the product so that the evaluation is adaptable to changing information requirements. 15.5
SUMMARY
A Markov model framework for combining product reliability, maintainability, and performance characteristics into an overall effectiveness measure was introduced in this chapter. Although this model framework has many simplifying characteristics, it provides a basis for analyzing complex products through appropriate extension, as illustrated by the communications system example in Section 15.2.2. The approaches and factors to consider in evaluating the effectiveness of alternative designs were discussed, with specific focus given to cost issues. The elements of the three major cost categories—research and development, initial investment, and operating—and cost estimating methods were reviewed. Also addressed were the issues of economy of scale, discounting, and cost uncertainty and sensitivity. The last section dealt with the challenge of combining cost and effectiveness values for use by the decision maker. REFERENCE Shooman, M. L. 1990. Probabilistic reliability: An engineering approach. Malabar, FL: Robert E. Krieger.
ADDITIONAL READING ARINC Research Corporation. 1969. Guidebook for systems analysis/cost effectiveness. U.S. Army Electronics Command.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
420 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
DARCOM P700-6 (Army), NAVMAT P5242 (Navy), AFLCP/AFSCP 800-19 (Air Force). 1977. Joint-design-to-cost guide: Life-cycle cost as a design parameter. Washington, DC: U.S. Department of Defense. Dhillon, B. S. 1989. Life-cycle costing: Techniques, models and applications. New York: Gordon and Breach, Science. English, J. M. 1968. Cost effectiveness—The economic evaluation of engineering systems. New York: John Wiley & Sons. Fabrycky, W. J., and B. S. Blanchard. 1991. Life-cycle cost and economic analysis. Englewood Cliffs, NJ: Prentice Hall. Goldman, T., ed. 1967. Cost-effectiveness analysis. New York: Frederick A. Praeger Michaels, J. V., and W. P. Wood. 1989. Design to cost. New York: John Wiley & Sons. Ostwald, P. F. 1992. Engineering cost estimating, 3rd ed. Englewood Cliffs, NJ: Prentice Hall. Quade E. S., and W. I. Boucher. 1968. Systems analysis and policy planning. New York: American Elsevier. Taguchi, G. 1992. Taguchi methods: Research and development. Englewood, CO: ASI Press.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
CHAPTER 16
Process Capability and Process Control Diganta Das, Michael Pecht
CONTENTS 16.1 16.2 16.3 16.4
Introduction ................................................................................................. 421 Average Outgoing Quality .......................................................................... 421 Process Capability....................................................................................... 423 Statistical Process Control .......................................................................... 426 16.4.1 Control Charts: Recognizing Sources of Variation ....................... 427 16.4.1.1 Constructing a Control Chart......................................... 427 16.5 Examples of Control Chart Constants ........................................................ 434 References.............................................................................................................. 442 Homework Problems.............................................................................................. 443
16.1
INTRODUCTION
Quality is a measure of a product’s ability to meet the workmanship criteria. This chapter introduces the concepts of process capability and the basics of the statistical process control techniques used to attain and maintain part and product quality.
16.2
AVERAGE OUTGOING QUALITY
A measure of product quality is average outgoing quality (AOQ). It is typically defined as the total number of products per million (ppm) that are outside manufacturer specification limits during the final quality control inspection (Ackermann and Fabia 1993). A high AOQ indicates a high defective count and therefore a poor quality level. AOQ
Shaded area under the process curve r 106 Total area under the process curve
(16.1) 421
© 2009 by Taylor & Francis Group, LLC
422 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Figure 16.1 Visualization of average outgoing quality.
where USL is upper specification limit; LSL is lower specification limit, and M is the process mean. For example, manufacturers conduct visual, mechanical, and electrical tests to measure AOQ of electronic products. Visual and mechanical tests include dimensions, solderability, and bent leads. Electrical tests include functional and parametric tests at room temperature, high temperature, and low temperature. AOQ is defined in Equation 16.1, referring to Figure 16.1. The formulae for AOQ calculations may differ among manufacturers. For example, the formula for AOQ based on JEDEC standard JESD 16–A (JEDC 1995) is AOQ P r LAR r 10 6
D r LAR r 10 6 N
(16.2)
D AL r r 10 6 N TL where D is the total number of defective products; N is the total number of products tested; LAR is the lot acceptance rate; AL is the total number of accepted lots; and TL is the total number of lots tested. IDT, a semiconductor manufacturer, provided AOQ based on the following formula: AOQ P r 10 6
D r 10 6 N
(16.3)
where D is the total number of defective products and N is the total number of products tested. Most manufacturers provide data in terms of the number of defective products, D, and the sample size, N.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
PROCESS CAPABILITY AND PROCESS CONTROL
16.3
423
PROCESS CAPABILITY
AOQ is a measure of the quality of products as they leave the production facility. In contrast, process capability is a measure of conformance to customer requirements and this is typically measured at key process steps. A process capability assessment is conducted to determine whether a process, given its natural variation, is capable of meeting established requirements or specifications. It can help to identify changes that have been made in the process and determine the percent of product or service that does not meet the requirements. If the process is incapable of making products that conform to the specifications, then a different process needs to be selected or, in some cases, specifications may have to be changed because they may have been set in an unrealistic manner. Figure 16.2 shows specification limits of a product; these are usually based solely on the customer requirements and are not meant to reflect on the capability of the process. Specification limits are used to determine if the products will meet the expectations of the customer. Figure 16.2 overlays a normal distribution curve on top of the specification limits. To determine the process capability, the first step is to determine the process grand average, X, and the average range, R. This is followed by determination of the USL and the LSL. The process standard deviation, m, is then calculated, using the control charts, by R s or Sˆ Sˆ d2 c4
(16.4)
where R and s are the averages of the subgroup ranges and standard deviation for a period when the process was known to be in control, and d2 and c4 are the associated constant values based on the subgroup sample sizes. The process average can be estimated by X, X, and X. A stable process can be represented by a measure of its spread compared with six standard deviations. Comparing six standard deviations of the process variation to the customer specifications provides a measure of capability. Some measures of Lower Specification Limit (LSL)
Upper Specification Limit (USL)
Specification Width Figure 16.2 Measuring conformance to the customer requirements.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
424 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
LSL
USL
Cp < 1
Cp = 1
Cp > 1
Figure 16.3 Cp, simple process capability.
capability include Cp, Cr (inverse of Cp), Cpl, Cpu , and Cpk . Hence, Cp is calculated using the following equation: Cp USL LSL 6Sˆ
(16.5)
The Cp value can predict the reject rate of new products by using normal probability distribution curves. When Cp 1, the process variation exceeds specification and defective products are being made. When Cp 1, the process is just meeting specification. A minimum of 0.3% defective products will be made in this condition—more if the process is not centered. When Cp 1, the process variation is less than the specification; however, defective products might be made if the process is not centered on the target value. Figure 16.3 shows three cases of Cp values in their relation to the specification limits. The indices Cpl and Cpu (for single-sided specification limits) and Cpk (for twosided specification limits) not only measure the process variation with respect to the specification, but also take into account the location of the process average. Capability describes how well centered the curve is in the specification spread and how tight the variation is. Cpk is considered a measure of the process capability and is taken as the smaller of either Cpl or Cpu . If the process is near normal and in statistical control, Cpk can be used to estimate the expected percentage of the defective products. C pl
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
X LSL , C USL X pu C pu 3Sˆ 3Sˆ C pk min{C pu , C pl }
(16.6) (16.7)
PROCESS CAPABILITY AND PROCESS CONTROL
LSL
425
Target
= X-LSL
= USL-X
Actual Spread, 3σˆ
Actual Spread, 3σˆ
= X
Cpl
USL
Cpμ
Figure 16.4 Process not capable of meeting parts.
Figure 16.4 shows an example of a process not capable of meeting targets. For the process in this figure, Cp 1, but the inability of the process arises because it is not centered between LSL and USL. If the process is capable of consistently making products to specification, common causes of the variation in the process must be identified and corrected. Examples of common remedies include assigning another machine to the process, procuring a new piece of equipment, providing additional training to reduce operator variations, and requiring vendors to implement statistical process controls. EXAMPLE 16.1 In the die-cutting process, a control chart was maintained, producing the following statistics: X 212.5, R 1.2 and n 5. Specification limit for this process is 210 o 3. This means that USL 213 and LSL 207. Calculate Cp and Cpk for this process. Also, find the number of defects. Solution: R 1.2 Sˆ .516 d 2 2.326 6 213 207 1.938 C p USL ˆ LSL 6(.516) 3.096 6S C pl
X LSL 212.5 207 5.5 3.553 ˆ 3S 3 (.516) 1.548
C pu
USL X 213 212.5 0.5 0.323 3Sˆ 3 (.516) 1.548
C pk min {C pl , C pu} 0.323
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
426 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
LSL
USL
– X
207 Figure 16.5
208
209
210
211
212
213
214
215
Schematic of a process that is not capable.
Because Cpk 1, defective material is being made. Figure 16.5 shows the schematic of the problem. Defects calculation: If the process is near normal and in statistical control, the process of calculating Cpk can also be used to estimate the expected percent of defective material. The area under the curve outside the specification limits is used to determine number of defects. To determine the area under the curve, the following factors must be calculated: LSL X 207 212.5 10.68 z1 0.516 Sˆ z2
USL X 213 212.5 0.969 Sˆ 0.516
Defects for value of z LSL F(z1); here, F(z1) 0 (approximately). Defects for value of z USL [1 F(z2)]; here, [1 F(z2)] [1 0.832] 0.168. –F(z) P(Z z) is the cumulative distribution value for any value of z obtained from the standard normal distribution table as shown in Figure 16.6. Total defects F(z1) [1 F(z2)] 16.8%.
16.4 STATISTICAL PROCESS CONTROL Statistical process control (SPC) is a technique that uses a measure of central tendency average and the measure of dispersion range to monitor sampled measured data in quality characteristics of a process, instead of inspecting results after a process has produced a product.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
PROCESS CAPABILITY AND PROCESS CONTROL
427
16.4.1 Control Charts: Recognizing Sources of Variation A control chart is a type of trend chart with statistically determined upper and lower control limits. It is used to determine if a process is “in control.” A process is said to be in control when the variation within the process is consistently random and within predictable (control) limits. Control charts are used to assess process variations and their sources and to monitor, control, and improve process performance over time. Random variation results from the interaction of the steps within a process. When the performance falls outside the control limits, assignable variation may be the cause. Assignable variation can be attributed to special causes. A control chart will help determine what type of variation is present within the process. Using a control chart, one can distinguish special causes of variation from common causes of variation. Control charts can serve as an ongoing control tool and help improve the process to perform consistently and predictably. 16.4.1.1 Constructing a Control Chart There are many types of control charts. The appropriate control chart depends on the types of data. Figure 16.6 presents the different types of data and the associated control charts. Figure 16.7 shows a guideline to select the control chart based on the information from Figure 16.6. To construct a control chart, follow the steps shown in Figure 16.8. To calculate appropriate statistics, one needs to know the method and the constants for that method. Constants and different formulae that are used in
z –4.00 –3.80 : –3.00 –2.90 –2.80 –2.70 : 0.00 0.10 : 0.70 0.80 0.90 1.00 : 2.80 2.90 3.00
Figure 16.6
0 0.0000 0.0001 : 0.0013 0.0019 0.0026 0.0035 : 0.5000 0.5398 : 0.7580 0.7881 0.8159 0.8413 : 0.9974 0.9981 0.9987
0.02 0.0000 0.0001 : 0.0014 0.0020 0.0027 0.0037 : 0.5080 0.5478 : 0.7642 0.7939 0.8212 0.8461 : 0.9976 0.9982 0.9987
0.04 0.0000 0.0001 : 0.0015 0.0021 0.0029 0.0039 : 0.5160 0.5557 : 0.7704 0.7995 0.8264 0.8508 : 0.9977 0.9984 0.9988
0.06 0.0000 0.0001 : 0.0016 0.0023 0.0031 0.0041 : 0.5239 0.5636 : 0.7764 0.8051 0.8315 0.8554 : 0.9979 0.9985 0.9989
0.08 0.0000 0.0001 : 0.0018 0.0024 0.0033 0.0044 z 0 : Z 0.5319 F(z) = P(Z
Sample cumulative normal distribution table and its use for defect estimate.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
428 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Data
Attribute data: Counted and plotted as discrete events
Defects
Variable data: Measured and plotted on a continuous scale
Defectives
Constant sample size, usually > 5
Variable sample size
Constant sample size, usually ≥ 50
Variable sample size, usually ≥ 50
Sample size = 1
c Chart
u Chart
np Chart
p Chart
X and Rm
Figure 16.7
Sample size is Sample size is large, small, median usually >10 value ~ X and R
Sample size is small, usually 3 to 5
X and s
X and R
A process to select the appropriate control chart.
construction control charts are shown in Tables 16.1 and 16.2 and for variable and attribute data, respectively. Use Tables 16.3 and 16.4 for the values of constants in formulae. While interpreting control charts, one should determine if the process mean (center line) is where it should be relative to specifications or objectives. If the
Start
Variable data (measurable)
Yes
X-moving Range chart
Yes
Median R chart
Figure 16.8
Ranges n<10
Avoid math?
No nonconformities
No
Subgroup size = 1?
Attribute data (countable)
Type of data?
Ranges or Std devs?
#obs, no. Items checked?
Yes (nonconforming items)
Std Devs n>10 Yes
Equal sized subgroups ?
No
Yes
Equal sized subgroups ?
No
No X& S chart
X &R chart
np or p chart
c or u chart
Guidelines for selecting control charts.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
u chart
p chart
PROCESS CAPABILITY AND PROCESS CONTROL
429
Table 16.1 Variable Data Type Control Chart Average and range
Central Linea
Control Limits
( X1 X 2 k
Xk )
UCL x X A2 R
(R R2 R 1 k
R k )
LCL X A2 R x
Sample Size 10 but usually 3–5
X and R
X
UCLR D4 R LCLR D3 R
Average and standard deviation
Usually, ≥10 X
( X1 X 2 k
(S S S 1 2 k
X and S
Xk ) S k )
UCL x X A3 S LCL x X A3 S UCLS B4 S LCLS B3 S
Median and range
10, but usually 3–5
2
X
( X1 X k
X k )
R
(R1 R2 k
R k )
X and R
UCL x X A2 R LCL x X A2 R UCLR D4 R LCLR D3 R
Individuals and moving range X and Rm
1
X
( X1 X 2 k
Xk )
Rm |( X i 1 X i )| Rk 1) (R R Rm 1 2 k 1
a
UCL x X E2 Rm LCL x X E2 Rm UCLR D4 Rm m
LCLRm D3 Rm
k number of subgroups; X median value within each subgroup, and X
£X
i
n
process mean is not where it should be, then either the process or objectives have changed. To distinguish between common causes and special causes, data relative to control limits must be analyzed. Upper and lower control limits are not specification limits and do not make a value judgment (good, bad, marginal) about a process. To analyze the data, follow the steps in Figure 16.9 and Figure 16.10. To use a control chart as a monitoring tool, all special causes must be eliminated. The chart will show when special causes reemerge. A common cause is defined as a deviation from the mean due to statistical errors such as normal variations in measurements. Parts affected by random causes usually fall within the control limits. A special cause is a deviation due to a process failure such as a malfunctioning machine. Parts
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
430 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Table 16.2 Attribute Data
Table 16.3 Constants Charts X and s Chart
X and R Chart
Sample Size n
A2
D3
D4
A3
B3
B4
C4
2 3 4 5 6 7 8 9 10
1.880 1.023 0.729 0.577 0.483 0.419 0.373 0.337 0.308
0 0 0 0 0 0.076 0.136 0.184 0.223
3.267 2.574 2.282 2.114 2.004 1.924 1.864 1.816 1.777
2.659 1.954 1.628 1.427 1.287 1.182 1.099 1.032 0.975
0 0 0 0 0.030 0.118 0.184 0.239 0.284
3.267 2.568 2.266 2.089 1.970 1.882 1.815 1.761 1.716
0.7979 0.8862 0.9213 0.9400 0.9515 0.9594 0.9650 0.9693 0.9727
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
PROCESS CAPABILITY AND PROCESS CONTROL
Table 16.4
431
Constants Charts X and R Chart
X and Rm Chart
Sample Size n
A2
D3
D4
E2
D3
D3
d2
2 3 4 5 6 7 8 9 10
— 1.187 — 0.691 — 0.509 — 0.412 —
0 0 0 0 0 0.076 0.136 0.184 0.223
3.267 2.574 2.282 2.114 2.004 1.924 1.864 1.816 1.777
2.659 1.772 1.457 1.290 1.184 1.109 1.054 1.010 0.975
0 0 0 0 0 0.076 0.136 0.184 0.223
3.267 2.574 2.282 2.114 2.004 1.924 1.864 1.816 1.777
1.128 1.693 2.059 2.326 2.534 2.704 2.847 2.970 3.078
Start
Select the process to be charted, and allow it to run according to standard procedure. Determine the sampling method and plan.
How large a sample can be drawn?
Can all samples be drawn from the same conditions?
Does data shift during different times or due to other factors? (E.g., do traffic patterns change during rush hour?)
Can a baseline be developed from historical data?
Initiate data collection by running the process, gathering data, and recording it properly.
Generally, collect 20–25 random samples.
Calculate the appropriate statistics.
Figure 16.9 Ten steps in control client construction.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
432 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Calculate the control limits using the appropriate formulas.
Construct the control chart(s).
For attribute data, construct one chart, plotting each subgroup’s proportion or number defective, number of defects, or defects per unit.
For variable data, construct one chart with each subgroup’s mean, median, or individual, and a second chart with subgroup’s range or standard deviation.
On all charts, draw a solid horizontal line showing the process average, and dashed horizontal lines for the upper and lower limits.
Figure 16.10 Data analysis process for control charts.
3 2 1
1 2
Figure 16.11 Guidelines to distinguish out-of-control process.
Table 16.5
Rules to Detect Out-of-Control Processes
1.
One or more points fall outside control limits.
2.
Two out of three consecutive points are in zone A.
3.
Four out of five consecutive points are in zone A or B.
4.
Nine consecutive points are on one side of the average.
5.
Six consecutive points are increasing or decreasing.
6.
Fourteen consecutive points alternate up and down.
7.
Fifteen consecutive points within zone C.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
3
PROCESS CAPABILITY AND PROCESS CONTROL
433
One or More Points Fall Outside Control Limits.
Measured Values
UCL Zone A Zone B Zone C Zone C Zone B Zone A LCL
1
2
3
Time Period
4
Six Consecutive Points are Increasing or Decreasing. UCL
Measured Values
Zone A Zone B Zone C Zone C Zone B Zone A LCL
1
2
3
4
5
6
Time Period Fifteen Consecutive Points Within Zone C.
Measured Values
UCL Zone A Zone B Zone C Zone C Zone B Zone A LCL
1
2
Figure 16.12
3
4
5
6
7 8 9 Time Period
10
11
Examples of out-of-control process from Table 16.6.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
12
13
14
15
434 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Table 16.6 Common Questions for Investigating an Out-of-Control Process Are there differences in the measurement accuracy of instruments/methods used? Are there differences in the methods used by different personnel? Is the process affected by the environment (e.g., temperature and humidity)? Has there been a significant change in the environment? Is the process affected by predictable conditions (e.g., tool wear)? Were any untrained personnel involved in the process at the time? Has there been a change in the source for input to the process (e.g., raw materials, information)? Is the process affected by employee fatigue? Has there been a change in policies or procedures (e.g., maintenance procedures)? Is the process adjusted frequently? Did the samples come from different parts of the process? Shifts? Individuals? Are employees afraid to report “bad news”?
affected by special causes usually fall outside the control limits or demonstrate unusual patterns, such as all points being on the upper confidence limit (UCL) line. A process is in statistical control if it is not affected by special causes. Statistical control means that the process is consistent; the process must also be checked to see if it fits specification limits. After detecting special causes, one should change the process to fix them; common causes are a fact of life and trying to change them may result in worse deviations later. As long as the process does not change, control limits should not change. There are seven rules to detect out-of-control processes, as shown Table 16.5. Examples of the rules are shown graphically in Figure 16.12. After identifying an out-of-control process, a series of actions must take place in order to bring the process back in control. Table 16.6 shows examples of actions. A team should address any “yes” answer as a potential source of a special cause.
16.5 EXAMPLES OF CONTROL CHART CONSTANTS In this section, we provide examples of several types of control charts, including X(bar)-chart, which displays the variation in the average of a measurement series; r-chart, which displays the variation in the range of a measurement series; c-chart, which displays the variation in the number of defects; u-chart, which displays the variation in the number of defects per unit; p-chart, which displays the variation in the fraction of defective units; and np-chart, which displays the variation in the number of defective units.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
PROCESS CAPABILITY AND PROCESS CONTROL
Table 16.7
435
Data for Machine Shop Part
Group No.
A
B
C
D
E
1
1.4
1.2
1.3
1.4
1.2
2
1.3
1.2
1.3
1.5
1.3
3
2.7
1.3
1.4
1.2
1.2
4
1.4
1.2
1.3
1.3
1.4
5
1.5
1.1
1.7
1.3
1.3
6
1.8
1.2
1.5
1.5
1.4
7
1.5
1.2
1.3
1.3
1.2
8
1.7
1.7
1.2
1.2
1.1
9
1.8
1.8
1.7
1.8
1.5
10
1.1
1.2
1.8
1.6
1.3
11
1.2
1.3
1.4
1.4
1.4
12
1.3
1.9
1.9
1.5
1.5
13
1.4
1.8
1.7
1.1
1.3
14
1.8
1.9
1.5
1.4
1.4
15
1.1
1.3
1.1
1.8
1.5
16
1.8
1.9
1.7
1.6
1.3
17
1.2
1.4
1.3
1.2
1.4
18
1.1
1.1
1.7
1.2
1.3
19
1.8
1.6
1.5
1.7
1.8
20
1.1
1.3
1.3
1.4
1.3
X
EXAMPLE 16.2 Analyze the weights of a specific part made in a machine shop using X(bar)- and r-charts. The machine shop sampled the parts at 20 different times (groups) and each group had five measurements (samples). (Data are given in Table 16.7.) Solution: Because we have variable data with constant sample size 5, we choose X(bar)- and r-charts. Mean and range calculation for each group: The mean (X) sum of the samples within the group divided by the group size. For group 1, X (1.4 1.2 1.3 1.4 1.2)/5 1.3. The range (R) difference between the largest observation within a group and the smallest observation within that group. R1 (1.4 1.2) 0.2. Compute total of the X and R columns. Average mean and average range calculation: Overall average (X .) total/total number of groups 28.54/20 1.43.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
R
436 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Table 16.8
Add Calculated Data to the Chart
Group No.
A
B
C
D
E
X
R
1
1.4
1.2
1.3
1.4
1.2
1.3
0.2
2
1.3
1.2
1.3
1.5
1.3
1.3
0.3
3
2.7
1.3
1.4
1.2
1.2
1.3
0.5
4
1.4
1.2
1.3
1.3
1.4
1.3
0.2
5
1.5
1.1
1.7
1.3
1.3
1.3
0.6
6
1.8
1.2
1.5
1.5
1.4
1.4
0.6
7
1.5
1.2
1.3
1.3
1.2
1.3
0.3
8
1.7
1.7
1.2
1.2
1.1
1.3
0.3
9
1.8
1.8
1.7
1.8
1.5
1.7
0.3
10
1.1
1.2
1.8
1.6
1.3
1.4
0.7
11
1.2
1.3
1.4
1.4
1.4
1.3
0.2
12
1.3
1.9
1.9
1.5
1.5
1.6
0.6
13
1.4
1.8
1.7
1.1
1.3
1.4
0.7
14
1.8
1.9
1.5
1.4
1.4
1.6
0.5
15
1.1
1.3
1.1
1.8
1.5
1.3
0.7
16
1.8
1.9
1.7
1.6
1.3
1.6
0.6
17
1.2
1.4
1.3
1.2
1.4
1.3
0.2
18
1.1
1.1
1.7
1.2
1.3
1.2
0.6
19
1.8
1.6
1.5
1.7
1.8
1.6
0.3
20
1.1
1.3
1.3
1.4
1.3
1.2
0.3
28.0
9.0
It is also called the grand average. Grand average is used as the center line for the chart. Average of all group ranges (R) total R/total number of groups 9.0/20 0.45. It is used as the center line (average) for the range chart (Table 16.8). Control limits calculation: UCL X X A2 R 1.43 (0.577 r 0.45) 1.69 LCL X X A2 R 1.43 (0.577 r 0.45) 1.17 About 99.73% (three sigma limits) of the average values should fall between 1.17 and 1.69. UCL R D4 R 2.114 r 0.45 0.951 LCL R D3 R 0 r 0.45 0 About 99.73% (three sigma limits) of the sample ranges should fall between 0 and 0.951.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
PROCESS CAPABILITY AND PROCESS CONTROL
437
EXAMPLE 16.3 The weights of a product made in the machine shop are given in Table 16.9. For speed of production, only one sample was evaluated over an observation period. Analyze the weights of a specific part made in a machine shop using the MR-chart Solution: Because we have variable data and there is only one product (unit) in each sample, we choose a moving range chart. Calculating the MR: MR \Rn Rn 1\ absolute value of the difference between consecutive range values. This is also known as the two-sample moving range (most common form of moving range). There is no range for the first observation. The first MR value works out to MR1 {1.4 1.3{ 0.1. Calculate total of the sample (X) and MR columns as shown in Table 16.10. Average the mean and group range calculation:
Table 16.9
Data for Example 16.3
Observation No.
Sample (X)
1
1.4
2
1.3
3
1.7
4
1.4
5
1.5
6
1.8
7
1.5
8
1.7
9
1.8
10
1.1
11
1.2
12
1.3
13
1.4
14
1.8
15
1.1
16
1.8
17
1.2
18
1.0
19
1.8
20 Total
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
1.1 28.0
MR
438 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Table 16.10
Add Calculated Data to the Chart
Observation No.
Sample (X)
MR
1
1.4
N/A
2
1.3
0.1
3
1.7
0.4
4
1.4
0.3
5
1.5
0.1
6
1.8
0.3
7
1.5
0.3
8
1.7
0.2
9
1.8
0.1
10
1.1
0.7
11
1.2
0.1
12
1.3
0.1
13
1.4
0.1
14
1.8
0.4
15
1.1
0.7
16
1.8
0.7
17
1.2
0.6
18
1.0
0.2
19
1.8
0.8
1.1
0.7
28.0
6.9
20 Total
The overall average (X) sum of the measurements/number of observations 28.90/20 1.45. X is also called the grand average and is used as the center line for the X chart. Average of all group ranges MR total MR/number of ranges 6.9/19 0.36. MR is used as the center line (average) for the MR chart. Determining control limits: UCL X X ( E2 r MR) 1.45 (2.659 r 0.36) 2.41 LCL X X ( E2 r MR) 1.45 (2.659 r 0.36) 0.49 UCL MR D4 r MR 3.267 r 0.36 1.18 LCL MR D3 r MR 0 r 0.36 0 Note: The sample size used to obtain the values for E2, D3, and D4 is two in this case because we are using a two-sample moving range in this example. If a
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
PROCESS CAPABILITY AND PROCESS CONTROL
Weight of Parts: X Bar 1.8 1.7 UCL 1.6
X Bar
1.5
CL
1.4 1.3 LCL
1.2 1.1 1
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20 Group Number
Figure 16.13 Mean chart.
three-sample moving range is used, the number of ranges will reduce to 18, and values of constants used will change accordingly.
EXAMPLE 16.4 Analyze the weights of a specific part made in a machine shop with the following information. Ten weeks of defective data have been collected with a sample size of 50. (Data are in Table 16.11.) Weight of Parts: R Chart 1 0.9 0.8
UCL
0.7 0.6 R 0.5
CL
0.4 0.3 0.2 0.1 0
LCL 1
Figure 16.14
2
3
4
5
Range chart.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20 Group Number
439
440 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Table 16.11 Data for Example 16.4 Week No.
No. Defective
1
9
2
7
3
4
4
2
5
4
6
5
7
2
8
3
9
5
10
5
Total
46
Solution: Because we have attribute data with constant sample size and number of defectives, we use the np-chart. Determining the averages: The average percent defective p total defectives/totaled sampled. p
46 46 0.092 (n)(weeks) (50)(10)
The grand average np (center line) also total defectives/total number of samples.
Weight of Parts: X Chart 3 2.5 UCL 2 CL X 1.5 1 LCL
0.5 0 1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20
Observation Number
Figure 16.15 X-chart.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
PROCESS CAPABILITY AND PROCESS CONTROL
441
Weight of Parts: MR Chart 1.4 1.2 UCL MR
1 0.8 0.6 CL 0.4 0.2
LCL
0 1
2
3
4
5
6
7
8 9 10 11 12 13 14 15 16 17 18 19 20 Observation Number
Figure 16.16 MR-chart.
np (50)(0.092) 4.6 np 46 / 10 4.6 Determining control limits: UCL np 3 np(1 p) 4.6 3 4.6(1 0.092) 10.731 LCL np 3 np(1 p) 4.6 3 4.6(1 0.092) 0 Note: Because the lower confidence limit (LCL) is less than zero, use zero. Draw the np-chart: 16
Number of Changes
14 12
UCL
10 8 CL 6 4 2
LCL = 0
0 1
2
3
4
5
6 Week
Figure 16.17 np-Chart.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
7
8
9
10
442 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Table 16.12
Data for Example 16.5
Week No.
No. Defective
1
9
2
7
3
4
4
2
5
4
6
15
7
2
8
3
9
5
10
5
Total
56
EXAMPLE 16.5 A company tracks the number of times a specification was changed by either an engineering change proposal (ECP) or a letter from the contracting officer. Attribute data summarize changes to 50 contracts over a 10-week period. Analyze the weights of a specific part made. Solution: Because we have attribute data with constant sample size and the number of changes is represented by number of “defects,” we use the c-chart (Table 16.12). Determining center line (C) and control limits: C Total defects found/total number of groups 56/10 5.6 (changes per week). Determine control limits. If LCL 0, set LCL 0. UCL c 3 c 5.6 3 5.6 12.699 LCL c 3 c 5.6 3 5.6 0
Draw the c-chart.
REFERENCES Ackermann C. S., and J. M. Fabia. 1993. Monitoring supplier quality at PPM levels. IEEE Transactions on Semiconductor Manufacturing 6 (2): 189–195. JEDEC. 1995. Standard JESD16-A. Assessment of average outgoing quality levels in parts per million (PPM). Electronic Industries Association, Alexandria, VA.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
PROCESS CAPABILITY AND PROCESS CONTROL
443
HOMEWORK PROBLEMS Problem 16.1 For each of the data sets given, identify which of the following control charts should be used to plot the data for process control: c-chart, u-chart, p-chart, np-chart, X(bar)-r-chart, or X-Rm chart. For each case, state why you selected the particular chart type. Data-Set Details
Control Chart Type
An equal number of samples of process output have been monitored each week for the last 5 weeks. Ten defective parts were found the first week, eight the second week, six the third week, nine the fourth week, and seven the fifth week. Different numbers of samples (between 40 and 60) of process output have been monitored each week for the last 4 weeks. In the first week, 1.2 defects per sample were observed. In the second week, 1.5 defects per sample were observed. In the third week, one defect per sample was observed. In the fourth week, 0.8 of a defect per sample was observed. The thicknesses of 10 samples were measured each day for a week. An equal number of samples of process output have been monitored each week for the last 4 weeks. In the first week, eight defects were observed. In the second week, 12 defects were observed. In the third week, 10 defects were observed. In the fourth week, nine defects were observed. The thickness of a single sample was measured each day for a week. A process has been observed each week for the last three weeks. The first week, 10% of the parts were found to be defective; 20% were found to be defective the second week, and 15% were found to be defective the third week
Problem 16.2 The copper content of a plating bath is measured three times per day and the results are reported in products per million. The X(bar)- and r-values for 10 days are shown in the following table. Day
X(bar)
r
1 2
5.45
1.21
5.39
0.95
3
6.85
1.43
4
6.74
1.29
5
5.83
1.35
Day
X(bar)
r
6
7.22
0.88
7
6.39
0.92
8
6.50
1.13
9
7.15
1.25
10
5.92
1.05
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
444 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
a. b. c
Determine the upper and lower control limits. Is the process in statistical control? Estimate the Cp and Cpk given that the specification is 6.0 o 1.0. Is the process capable? Problem 16.3 Printed circuit boards are assembled by a combination of manual assembly and automation. A reflow soldering process is used to make the mechanical and electrical connections of the leaded components to the board. The boards are run through the solder process continuously, and every hour five boards are selected and inspected for process-control purposes. The number of defects in each sample of five boards is noted. Results for 20 samples are shown in the following table. What type of control chart is appropriate for this case and why? Construct the control chart limits and draw the chart. Is the process in control? Does it need improvement? Problem 16.4 Twelve parts of the same type are tested for 1,000 hours, and seven failures are observed at 250, 450, 510, 625, 750, 825 and 979 hours. The items are removed at failure without replacement. Calculate the upper and lower one-sided 90% confidence limits on mean time between failures. Also, calculate the twosided 90% confidence limits on reliability for a 200-hour period. Sample
No. of Defects
Sample
No. of Defects
1
6
11
9
2
4
12
15
3
8
13
8
4
10
14
10
5
9
15
8
6
12
16
2
7
16
17
7
8
2
18
1
9
3
19
7
10
10
20
13
Problem 16.5 The diameter of a shaft with nominal specifications of 60 o 3 mm is measured six times each hour and the results are recorded. The X(bar)- and r-values for 8 hours are shown in the following table:
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
Hour
X(bar)
R
1
62.54
1.95
2
60.23
2.03
3
58.46
1.43
4
59.95
1.29
5
61.58
0.78
6
57.93
1.48
7
61.56
0.86
8
57.34
1.35
PROCESS CAPABILITY AND PROCESS CONTROL
445
(a) Determine the upper and lower control limits. (b) Determine if the process is in statistical control. (c) Estimate the Cp and Cpk for the process. Is the process capable? Data-Set Details
Control Chart Type
An equal number of samples of process output have been monitored each week for the last 5 weeks. Ten defective parts were found the first week, eight the second week, six the third week, nine the fourth week, and seven the fifth week. Different numbers of samples (between 40 and 60) of process output have been monitored each week for the last 4 weeks. In the first week, 1.2 defects per sample were observed. In the second week, 1.5 defects per sample were observed. In the third week, one defect per sample was observed. In the fourth week, 0.8 of a defect per sample was observed. The thicknesses of 10 samples were measured each day for a week. An equal number of samples of process output have been monitored each week for the last 4 weeks. In the first week, eight defects were observed. In the second week, 12 defects were observed. In the third week, 10 defects were observed. In the fourth week, nine defects were observed. The thickness of a single sample was measured each day for a week. A process has been observed each week for the last three weeks. The first week, 10% of the parts were found to be defective; 20% were found to be defective the second week, and 15% were found to be defective the third week.
Problem 16.6 The specification for a shaft diameter is 212 o 2 mm. Provided below are 30 recorded observations for the diameter of a shaft (in millimeters) taken at 30 different points in time. 212.1a
214.2
213.7
212.7
212.5
212.7b
212.8
213.0
212.9
212.3
212.5
212.1
211.8
213.5
212.0
213.0
214.5
212.3
212.2
211.9
213.2
212.7
211.9
212.3
212.0
212.8
213.9
212.6
214.0
212.4c
a
First observation. Sixth observation. cThirtieth observation. b
(a) Determine the three-sample X(bar)- and MR(bar)-control limits from the data. (b) Determine from the control charts whether the process is in control or not. (c) Determine the capability indices (Cp and Cpk) for the process. (d) Determine the percent defective shafts produced by the process.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC