Complex System Maintenance Handbook (Springer Series in Reliability Engineering)

Springer Series in Reliability Engineering Series Editor Professor Hoang Pham Department of Industrial Engineering Ru...

Author: Khairy A.H. Kobbacy | D.N. Prabhakar Murthy

503 downloads 1755 Views 19MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Springer Series in Reliability Engineering

Series Editor Professor Hoang Pham Department of Industrial Engineering Rutgers The State University of New Jersey 96 Frelinghuysen Road Piscataway, NJ 08854-8018 USA

Other titles in this series The Universal Generating Function in Reliability Analysis and Optimization Gregory Levitin Warranty Management and Product Manufacture D.N.P Murthy and Wallace R. Blischke Maintenance Theory of Reliability Toshio Nakagawa System Software Reliability Hoang Pham Reliability and Optimal Maintenance Hongzhou Wang and Hoang Pham Applied Reliability and Quality B.S. Dhillon Shock and Damage Models in Reliability Theory Toshio Nakagawa Risk Management Terje Aven and Jan Erik Vinnem Satisfying Safety Goals by Probabilistic Risk Assessment Hiromitsu Kumamoto Offshore Risk Assessment (2nd Edition) Jan Erik Vinnem The Maintenance Management Framework Adolfo Crespo Márquez Human Reliability and Error in Transportation Systems B.S. Dhillon

Khairy A.H. Kobbacy • D.N. Prabhakar Murthy Editors

Complex System Maintenance Handbook

123

Khairy A.H. Kobbacy, PhD Management and Management Sciences Research Institute University of Salford Salford, Greater Manchester M5 4WT UK

D.N. Prabhakar Murthy, PhD Division of Mechanical Engineering The University of Queensland Brisbane 4072 Australia

ISBN 978-1-84800-010-0

e-ISBN 978-1-84800-011-7

DOI 10.1007/978-1-84800-011-7 Springer Series in Reliability Engineering series ISSN 1614-7839 British Library Cataloguing in Publication Data A Complex system maintenance handbook. - (Springer series in reliability engineering) 1. Maintenance 2. Reliability (Eningeering) 3. Maintenance - Management I. Murthy, D. N. P. II. Kobbacy, Khairy A. H. 620'.0046 ISBN-13: 9781848000100 Library of Congress Control Number: 2008923781 © 2008 Springer-Verlag London Limited Watchdog Agent™ is a trademark of the Intelligent Maintenance Systems (IMS) Center, University of Cincinnati, PO Box 210072, Cincinnati, OH 45221, USA. www.imscenter.net Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms of licences issued by the Copy-right Licensing Agency. Enquiries concerning reproduction outside those terms should be sent to the publishers. The use of registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant laws and regulations and therefore free for general use. The publisher makes no representation, express or implied, with regard to the accuracy of the information contained in this book and cannot accept any legal responsibility or liability for any errors or omissions that may be made. Cover design: deblik, Berlin, Germany Printed on acid-free paper 9 8 7 6 5 4 3 2 1 springer.com

To our wives Iman and Jayashree for their patience, understanding and support

Preface

Modern societies depend on the smooth operation of many complex systems (designed and built by humans) that provide a variety of outputs (products and services). These include transport systems (trains, buses, ferries, ships and aeroplanes), communication systems (television, telephone and computer networks), utilities (water, gas and electricity networks), manufacturing plants (to produce industrial products and consumer durables), processing plants (to extract and process minerals and oil), hospitals (to provide services) and banks (for financial transactions) to name a few. Every system built by humans is unreliable in the sense that it degrades with age and/or usage. A system is said to fail when it is no longer capable of delivering the designed outputs. Some failures can be catastrophic in the sense that they can result in serious economic losses, affect humans and do serious damage to the environment. Typical examples include the crash of an aircraft in flight, failure of a sewerage processing plant and collapse of a bridge. The degradation can be controlled, and the likelihood of catastrophic failures reduced, through maintenance actions, including preventive maintenance, inspection, condition monitoring and design-out maintenance. Corrective maintenance actions are needed to restore a failed system to operational state through repair or replacement of the components that caused the failure. Maintenance has moved from being an engineering activity after a system has been put into operation into an important issue that needs to be addressed during the design and manufacturing or building of the system. Maintenance impacts on reliability (a technical issue) with serious economic and commercial implications. This implies that operators of complex systems need to look at maintenance from an overall business perspective that integrates the technical and commercial issues in an effective manner. The literature on maintenance is vast. Over the last 50 years, there have been dramatic changes due to advances in the understanding of the physics of failure, in technologies to monitor and assess the state of the system, in computers to store

viii

Preface

and process large amounts of relevant data and in the tools and techniques needed to build model to determine the optimal maintenance strategies. The aim of this book is to integrate this vast literature with different chapters focusing on different aspects of maintenance and written by active researchers and/or experienced practitioners with international reputations. Each chapter reviews the literature dealing with a particular aspect of maintenance (for example, methodology, approaches, technology, management, modelling analysis and optimisation), reports on the developments and trends in a particular industry sector or, deals with a case study. It is hoped that the book will lead to narrowing the gap between theory and practice and to trigger new research in maintenance. The book is written for a wide audience. This includes practitioners from industry (maintenance engineers and managers) and researchers investigating various aspects of maintenance. Also, it is suitable for use as a textbook for postgraduate programs in maintenance, industrial engineering and applied mathematics. We would like to thank the authors of the chapters for their collaboration and prompt responses to our enquiries which enabled completion of this handbook on time. We also wish to acknowledge the support of the University of Salford and the award of CAMPUS Fellowship in 2006 to one of us (PM). We gratefully acknowledge the help and encouragement of the editors of Springer, Anthony Doyle and Simon Rees. Also, our thanks to Sorina Moosdorf and the staff involved with the production of the book.

Contents

Part A An Overview Chapter 1: An Overview K. Kobbacy and D. Murthy ...................................................................................... 3 Part B Evolution of Concepts and Approaches Chapter 2: Maintenance: An Evolutionary Perspective L. Pintelon and A. Parodi-Herz.............................................................................. 21 Chapter 3: New Technologies for Maintenance Jay Lee and Haixia Wang....................................................................................... 49 Chapter 4: Reliability Centred Maintenance Marvin Rausand and Jørn Vatn .............................................................................. 79 Part C Methods and Techniques Chapter 5: Condition-based Maintenance Modelling Wenbin Wang........................................................................................................ 111 Chapter 6: Maintenance Based on Limited Data David F. Percy ..................................................................................................... 133 Chapter 7: Reliability Prediction and Accelerated Testing E. A. Elsayed ........................................................................................................ 155

x

Contents

Chapter 8: Preventive Maintenance Models for Complex Systems David F. Percy ..................................................................................................... 179 Chapter 9: Artificial Intelligence in Maintenance Khairy A. H. Kobbacy ......................................................................................... 209 Part D Problem Specific Models Chapter 10: Maintenance of Repairable Systems Bo Henry Lindqvist............................................................................................... 235 Chapter 11: Optimal Maintenance of Multi-component Systems: A Review Robin P. Nicolai and Rommert Dekker ................................................................ 263 Chapter 12: Replacement of Capital Equipment P.A. Scarf and J.C. Hartman................................................................................ 287 Chapter 13: Maintenance and Production: A Review of Planning Models Gabriella Budai, Rommert Dekker and Robin P. Nicolai ................................... 321 Chapter 14: Delay Time Modelling Wenbin Wang........................................................................................................ 345 Part E Management Chapter 15: Maintenance Outsourcing D.N.P. Murthy and N. Jack ................................................................................. 373 Chapter 16: Maintenance of Leased Equipment D.N.P. Murthy and J. Pongpech .......................................................................... 395 Chapter 17: Computerised Maintenance Management Systems Ashraf Labib ......................................................................................................... 417 Chapter 18: Risk Analysis in Maintenance Terje Aven ............................................................................................................ 437 Chapter 19: Maintenance Performance Measurement (MPM) System Uday Kumar and Aditya Parida .......................................................................... 459 Chapter 20: Forecasting for Inventory Management of Service Parts John E. Boylan and Aris A. Syntetos .................................................................... 479

Contents

xi

Part F Applications (Case Studies) Chapter 21: Maintenance in the Rail Industry Jørn Vatn ............................................................................................................. 509 Chapter 22: Condition Monitoring of Diesel Engines Renyan Jiang, Xinping Yan ................................................................................. 533 Chapter 23: Benchmarking of the Maintenance Process at Banverket (The Swedish National Rail Administration) Ulla Espling and Uday Kumar ............................................................................. 559 Chapter 24: Integrated e-Operations–e-Maintenance: Applications in North Sea Offshore Assets Jayantha P. Liyanage ........................................................................................... 585 Chapter 25: Fault Detection and Identification for Longwall Machinery Using SCADA Data Daniel R. Bongers and Hal Gurgenci .................................................................. 611 Contributor Biographies ....................................................................................... 643 Index ..................................................................................................................... 653

Part A

An Overview

1 An Overview K.A.H. Kobbacy and D.N.P. Murthy K. Kobbacy and D. Murthy

1.1 Introduction The efficient functioning of modern society depends on the smooth operation of many complex systems comprised of several pieces of equipment that provide a variety of products and services. These include transport systems (trains, buses, ferries, ships and aeroplanes), communication systems (television, telephone and computer networks), utilities (water, gas and electricity networks), manufacturing plants (to produce industrial products and consumer durables), processing plants (to extract and process minerals and oil), hospitals (to provide services) and banks (for financial transactions) to name a few. All equipment is unreliable in the sense that it degrades with age and/or usage and fails when it is no longer capable of delivering the products and services. When a complex system fails, the consequences can be dramatic. It can result in serious economic losses, affect humans and do serious damage to the environment as, for example, the crash of an aircraft in flight, the failure of a sewage processing plant or the collapse of a bridge. Through proper corrective maintenance, one can restore a failed system to an operational state by actions such as repair or replacement of the components that failed and in turn caused the failure of the system. The occurrence of failures can be controlled through maintenance actions, including preventive maintenance, inspection, condition monitoring and design-out maintenance. With good design and effective preventive maintenance actions, the likelihood of failures and their consequences can be reduced but failures can never be totally eliminated. The approach to maintenance has changed significantly over the last one hundred years. Over a hundred years ago, the focus was primarily on corrective maintenance delegated to the maintenance section of the business to restore failed systems to an operational state. Maintenance was carried out by trained technicians and was viewed as an operational issue and did not play a role in the design and operation of the system. The importance of preventive maintenance was fully appreciated during the Second World War. Preventive maintenance involves additional costs and is worthwhile only if the benefits exceed the costs. Deciding

4

K. Kobbacy and D. Murthy

the optimum level of maintenance requires building appropriate models and use of sophisticated optimisation techniques. Also, around this time, maintenance issues started getting addressed at the design stage and this led to the concept of maintainability. Reliability and maintainability (R&M) became major issues in the design and operation of systems. Degradation and failure depend on the stresses on the various components of the system. These depend on the operating conditions that are dictated by commercial considerations. As a result, maintenance moved from a purely technical issue to a strategic management issue with options such as outsourcing of maintenance, leasing equipment as opposed to buying, etc. Also, advances in technologies (new materials, new sensors for monitoring, data collection and analysis) added new dimensions (science, technology) to maintenance. These advances will continue at an everincreasing pace in the twenty-first century. This handbook tries to address the various issues associated with the maintenance of complex systems. The aim is to give a snapshot of the current status and highlight future trends. Each chapter deals with a particular aspect of maintenance (for example, methodology, approaches, technology, management, modelling analysis and optimisation) and reports on developments and trends in a particular industry sector or deals with a case study. In this chapter we give an overview of the handbook. The outline of the chapter is as follows. Section 1.2 deals with the framework that is needed to study the maintenance of complex systems and we discuss some of the salient issues. Section 1.3 presents the structure of the book and gives a brief outline of the different chapters in the handbook. We conclude with a discussion of the target audience for the handbook.

1.2 Framework for Study of Maintenance A proper study of maintenance requires a comprehensive framework that incorporates all the key elements. However, not all the elements would be relevant for a particular maintenance problem under consideration. The systems approach is an effective approach to solving maintenance problems. In this approach, the real world relevant to the problem is described through a characterisation where one identifies the relevant variables and the interaction between the variables. This characterisation can be done using language or a schematic network representation where the nodes represent the variables and the connected arcs denote the relationships. This is good for qualitative analysis. For quantitative analysis, one needs to build mathematical models to describe the relationships. Often this requires stochastic and dynamical formulations as system degradation and failures occur in an uncertain manner. In this section, we discuss the various key elements and some related issues. We use the term “asset” to denote a complex system or individual equipment. It can include infrastructures such as buildings, bridges etc. in addition to those listed in Section 1.1.

An Overview

5

1.2.1 Stakeholders For an asset there can be several stakeholders as indicated in Figure 1.1.

Figure 1.1. Stakeholders for maintenance of an asset

The number of parties involved would depend on the asset under consideration. For example, in case of a rail network (used to provide a service to transport people and goods) the customers can include the rail operators (operating the rolling stock) and the public. The owner can be a business entity, a financial institution or a government agency. The operator is the agency that operates the track and is responsible for the flow of traffic. The service provider refers to the agency carrying out the maintenance (preventive and corrective). It can be the operator (in which case maintenance is done in-house) or some external agent (if maintenance is outsourced) or both (when only some of the maintenance activities are outsourced). The regulator is the independent agency which deals with safety and risk issues. They define the minimum standards for safety and can impose fines on the owner, operator and possibly the service provider should the safety levels be compromised. Government plays a critical role in providing the subsidy and assuming certain risks. In this case all the parties involved are affected by the maintenance carried out on the asset. If the line is shut either frequently and/or for long duration, it can affect customer satisfaction and patronage, the returns to the operators and owners and the costs to the government. 1.2.2 Different Perspectives We focus our attention on the case where the asset is owned by the owner and maintenance is outsourced. In this case, we have two parties – (i) owner (of the asset) and (ii) service agent (providing the maintenance). Figure 1.2 is a very simplified system characterisation of the maintenance process where the main-

6

K. Kobbacy and D. Murthy

tenance activities are defined through a maintenance service contract. The problem is to determine the terms of the service contract.

Figure 1.2. System characterisation for maintenance out-sourcing

Each of the elements of Figure 1.2 involves several variables. For example, the maintenance service contract involves the following: (i) duration of contract, (ii) price of contract, (iii) maintenance performance requirements, (iv) incentives and penalties, (v) dispute resolution, etc. The maintenance performance requirements can include measures such as availability, mean time between failures and so on. The characterisation of the owner’s decision-making process can involve costs, asset state at the end of the contract, risks (service agent not providing the level and quality of service) and so on. The interests and goals of the owner are different from that of the service agent. The study of maintenance is complicated by the unknown and uncontrollable factors. It could be rate of degradation (which depends on several factors such as material properties, operating environment etc) and other commercial factors (high demand for power in the case of a power plant due to very hot weather). 1.2.3 Key Issues and the Need for Multi-disciplinary Approach The key issues in the maintenance of an asset are shown in Figure 1.3. The asset acquisition is influenced by business considerations and its inherent reliability is determined by the decisions made during design. The field reliability and degradation is affected by operations (usage intensity, operating environment, operating load etc.). Through use of technologies, one can assess the state of the asset. The analysis of the data and models allow for optimizing the maintenance decisions (either for a given operating condition or jointly optimizing the maintenance and operations). Once the maintenance actions have been formulated it needs to be implemented.

An Overview

7

Figure 1.3. Key Issues in maintenance of an asset

To execute effective maintenance one needs to have a good understanding of a variety of concepts and techniques for each of the issues. Another issue is the computer packages that allow one to collect and analyze data and build models and derive the optimal solutions. The linking of the technical and commercial issues is indicated in Figure 1.4 and this requires an inter-disciplinary approach.

Figure 1.4. Linking technical and commercial issues

8

K. Kobbacy and D. Murthy

The disciplines involved are as follows 1.2.3.1 Engineering The degradation of an asset depends to some extent on the design and building (or production) of the asset. Poor design leads to poor reliability that in turn results in high level of corrective maintenance. On the other hand, a well-designed system is more reliable and hence less prone to failures. Maintainability deals with maintenance issues at the design and development stage of the asset. 1.2.3.2 Science This is very important in the understanding of the physical mechanisms that are at play and have a significant influence on the degradation and failure. Choosing the wrong material can have a serious consequence and impact on the subsequent maintenance actions needed. 1.2.3.3 Economic Maintenance costs can be a significant fraction of the total operating budget for a business depending on the industry sector. There are two types of costs – annual cost and cost over the life cycle of the asset. The costs can be divided into direct (labour, material etc.) and indirect (consequence of failure). 1.2.3.4 Legal This is important in the context of maintenance out-sourcing and maintenance of leased equipment. In both cases, the central issue is the contract between the parties involved. Of particular importance is dispute resolution when there is a disagreement between the parties in terms of the violation of some terms of the contract. 1.2.3.5 Statistics The degradation and failures occur in an uncertain manner. As such, the analysis of such data requires the use of statistical techniques. Statistics provide the concepts and tools to extract information from data and for the planning of efficient collection systems. 1.2.3.6 Operational Research Operation research provides the tools and techniques for model building, analysis and optimization. Often, analytical approaches fail and one needs to use simulation approach to evaluate the outcomes of different decisions and to choose the optimal (or near optimal) strategies. 1.2.3.7 Reliability Theory Reliability theory deals with the interdisciplinary use of probability, statistics and stochastic modelling, combined with engineering insights into the design and the scientific understanding of the failure mechanisms, to study the various aspects of reliability. As such, it encompasses issues such as (i) reliability modelling, (ii) reliability analysis and optimization, (iii) reliability engineering, (iv) reliability science, (v) reliability technology and (vi) reliability management.

An Overview

9

1.2.3.8 Information Technology and Computer Science The operation and maintenance of complex assets generates a lot of data. One needs efficient ways to store and manipulate the data and to extract relevant information from data. Computer science provides a range of artificial intelligence techniques such as data mining, expert systems, neural networks etc., which are very important in the context of maintenance. 1.2.4 Maintenance Management Maintenance management deals with the overall management of the maintenance of an asset. The management needs to be done at three different levels (strategic, tactical and operational) as indicated in Figure 1.5. - BUSINESS PERSPECTIVE - TECHNICAL & COMMERCIAL - IN-HOUSE vs. OUT-SOURCING - REPLACEMENT / DESIGN CHANGES

MAINTENANCE STRATEGY

STRATEGIC LEVEL

- DEGRADATION (RELIABILITY SCIENCE) - MAINTENANCE POLICIES - LOGISTICS (SPARES, FACILITIES ETC)

MAINTENANCE PLANNING AND SCHEDULING

TACTICAL LEVEL

- DATA COLLECTION - DATA ANALYSIS (ROOT CAUSE, OTHER FACTORS)

MAINTENANCE WORK EXECUTION

OPERATIONAL LEVEL

Figure 1.5. Maintenance management

The strategic level deals with maintenance strategy. This needs to be formulated so that it is consistent and coherent with other (production, marketing, finance, etc.) business strategies. The tactical level deals with the planning and scheduling of maintenance. The operational level deals with the execution of the maintenance tasks and collection of relevant data.

1.3 Structure of the Handbook The handbook integrates the vast literature on maintenance with each chapters focussing on a different aspect of maintenance and written by active researchers with international reputation and/or experienced practitioners from industry. Each chapter either reviews the literature dealing with a particular aspect of maintenance (for example, methodology, approaches, technology, management, modelling ana-

10

K. Kobbacy and D. Murthy

lysis and optimisation), reports on developments and trends in a particular industry sector, or deals with a case study. The book is structured into five parts and each of the last four parts contains several chapters. The topic of the different chapters is as indicated below. Part A:

An Overview

Chapter 1:

An Overview (Khairy Kobbacy and Pra Murthy)

Part B:

Evolution of Concepts and Approaches

Chapter 2: Chapter 3: Chapter 4:

Maintenance: An Evolutionary Perspective (Liliane Pintelon and Alejandro Parodi Herz) New Technologies for Maintenance (Jay Lee and Haixia Wang) Reliability Centred Maintenance (Marvin Rausand and Jorn Vatn)

Part C:

Methods and Techniques

Chapter 5: Chapter 6: Chapter 7: Chapter 8: Chapter 9:

Condition-based Maintenance Modelling (Wenbin Wang) Maintenance Based on Limited Data (David F. Percy) Reliability Prediction and Accelerated Testing (Elsayed A. Elsayed) Preventive Maintenance Models for Complex Systems (David F. Percy) Artificial Intelligence in Maintenance (Khairy A.H. Kobbacy)

Part D:

Problem Specific Models

Chapter10: Chapter 11:

Chapter 14:

Maintenance of Repairable Systems (Bo Henry Lindqvist) Optimal Maintenance of Multi-component Systems: A Review (Robin P. Nicolai and Rommert Dekker) Replacement of Capital Equipment (Philip A. Scarf and Joseph C. Hartman) Maintenance and Production: A Review of Planning Models (Gabriella Budai, Rommert Dekker and Robin P. Nicolai) Delay Time Modelling (Wenbin Wang)

Part E:

Management

Chapter 15: Chapter 16:

Maintenance Outsourcing (Pra Murthy and Nat Jack) Maintenance of Leased Equipment (Pra Murthy and Jarumon Pongpech) Computerised Maintenance Management Systems (Ashraf Labib) Risk Analysis in Maintenance (Terje Aven) Maintenance Performance Measurement (MPM) System (Uday Kumar and Aditya Parida) Forecasting for Inventory Management of Service Parts (John E. Boylan and Aris A. Syntetos)

Chapter 12: Chapter 13:

Chapter 17: Chapter 18: Chapter 19: Chapter 20:

An Overview

11

Part F:

Applications (Case Studies)

Chapter 21: Chapter 22:

Maintenance in the Rail Industry (Jorn Vatn) Condition Monitoring of Diesel Engines (Renyan Jiang and Xinping Yan) Benchmarking of the Maintenance Process at Banverket (The Swedish National Rail Administration) (Ulla Espling and Uday Kumar) Integrated e-Operations–e-Maintenance: Applications in North Sea Offshore Assets (Jayanta P. Liyanage) Fault Detection and Identification for Longwall Machinery Using SCADA Data (Daniel Bongers and Hal Gurgenci)

Chapter 23:

Chapter 24: Chapter 25:

A brief outline of each chapter is as follows Chapter 2: Maintenance: An Evolutionary Perspective In the past few decades industrial maintenance has evolved from a non-issue into a strategic concern. During this period the role of maintenance has drastically been transformed. This chapter, while considering the fundamental elements of maintenance and its environment, describes the evolution path of maintenance management and the driving forces of such changes. It basically explains how and why maintenance practice has evolved in time. It includes basic notions of maintenance and clearly classifies and distinguishes between different types of maintenance actions, policies and concepts currently available. The chapter concludes by enlightening the reader with some new challenges in maintenance Chapter 3: New Technologies for Maintenance Predictive maintenance is critical to any engineering system, especially complex systems, in order to avoid system breakdown. With the recent advances in pervasive computing, prognostics can be easily embedded in any devices and systems. When smart machines are networked and remotely monitored, and when their data is modelled and continually analyzed with sophisticated embedded systems, it is possible to go beyond mere “predictive maintenance” to intelligent “prognostics”, the process of pinpointing exactly which components of a machine are likely to fail and then autonomously trigger service and order spare parts. This chapter addresses the paradigm shift in modern maintenance systems from the traditional “fail and fix” practices to a “predict and prevent” methodology. Recent advances in prognostic technologies and tools are presented, and future work directions are discussed. Chapter 4: Reliability Centred Maintenance This chapter gives an introduction to reliability centred maintenance (RCM). The RCM analysis process is divided into 12 distinct steps. Each step is thoroughly described and discussed. The main RCM process is similar to the processes outlined in RCM standards and guidelines, but has more focus on the optimization of maintenance intervals. A new approach is proposed based on generic RCM analyses related to specified classes of consequences. The new approach will significantly reduce the workload of the RCM analysis. A computer tool OptiRCM

12

K. Kobbacy and D. Murthy

that has been developed by the authors, is used to illustrate the new approach. Several examples from railway applications are provided. Chapter 5: Condition-based Maintenance Modelling This chapter presents a model for supporting condition based maintenance decision making. The chapter discusses various issues related to the subject, such as the definition of the state of an asset, direct or indirect monitoring, relationship between observed measurements and the state of the asset, and current modelling developments. In particular, the chapter focuses on a modelling technique used recently in predicting the residual life via stochastic filtering. This is a key element in modelling the decision making aspect of condition based maintenance. A few key condition monitoring techniques are also introduced and discussed. Methods of estimating model parameters are outlined and a numerical example based on real data is presented. Chapter 6: Maintenance-based on Limited Data Reliability applications often suffer from paucity of data for making informed maintenance decisions. This is particularly noticeable for high reliability systems and when new production lines or new warranty schemes are planned. Such issues are of great importance when selecting and fitting mathematical models to improve the accuracy and utility of these decisions. This chapter investigates why reliability data are so limited and proposes statistical methods for dealing with these difficulties. It considers graphical and numerical summaries, appropriate methods for model development and validation, and the powerful approach of subjective Bayesian analysis for including expert knowledge about the application area. Chapter 7: Reliability Prediction and Accelerated Testing This chapter presents an overview of accelerated life testing (ALT) methods and their use in reliability prediction at normal operating conditions. It describes the most commonly used models and introduces new ones which are “distribution free”. Design of optimum test plans in order to improve the accuracy of reliability prediction is also presented and discussed. The chapter provides, for the first time, the link between accelerated life testing and maintenance actions. It develops procedures for using the ALT results for estimating the optimum preventive maintenance schedule and the optimum degradation threshold level for degrading systems. The procedures are demonstrated using two numerical examples. Chapter 8: Preventive Maintenance Models for Complex Systems Preventive maintenance (PM) of repairable systems can be very beneficial in reducing repair and replacement costs, and in improving system availability. Strategies for scheduling PM are often based on intuition and experience, though considerable improvements in performance can be achieved by fitting mathematical models to observed data. For simple repairable systems comprising few components or many identical components, compound renewal processes are appropriate. This chapter reviews basic and advanced models for complex repairable systems and demonstrates their use for determining optimal PM intervals. Computational

An Overview

13

difficulties are addressed and practical illustrations are presented, based on subsystems of oil platforms and Chapter 9: Artificial Intelligence in Maintenance AI techniques have been used successfully in the past two decades to model and optimise maintenance problems. This chapter reviews the application of Artificial Intelligence (AI) in maintenance management and introduces the concept of developing intelligent maintenance optimisation system. The chapter starts with an introduction to maintence management, planning and scheduling and a brief definition of AI and some of its techniques that have applications in maintenance management. A review of literatures is then presented covering the applications of AI in maintenance. We have focused on five AI techniques namely Knowledge Based Systems, Case Based Reasoning, Genetic Algorithms, Neural Networks and Fuzzy Logic. This review also covers “hybrid” systems where two or more AI techniques are used in an application. A discussion of the development of the prototype hybrid intelligent maintenance optimisation system (HIMOS) which was developed to evaluate and enhance PM maintenance routines of complex engineering systems then follows. The chapter ends with a discussion of future research and concluding remarks. Chapter 10: Maintenance of Repairable Systems A repairable system is traditionally defined as a system which, after failing to perform one or more of its functions satisfactorily, can be restored to fully satisfactory performance by any method other than replacement of the entire system. An extended definition used in this chapter includes the possibility of additional maintenance actions which aim at servicing the system for better performance, referred to as preventive maintenance (PM). The common models for the failure process of a repairable system are renewal processes (RP) and non-homogeneous Poisson processes (NHPP). The chapter considers several generalizations and extensions of the basic models, for example the trend renewal process (TRP) which includes NHPP and RP as special cases, and having the property of allowing a trend in processes of non-Poisson type. When several systems of the same kind are considered, there may be an unobserved heterogeneity between the systems which, if overlooked, may lead to wrong decisions. This phenomenon is considered in the framework of the TRP process. We then consider the extension of the basic models obtained by introducing the possibility of PM using a competing risks approach. Finally, models for periodically inspected systems are studied, using a combination of time-continuous and time-discrete Markov chains. Chapter 11: Optimal Maintenance of Multi-component Systems: A Review This chapter gives an overview of the literature on multi-component maintenance optimization focusing on work appearing since the 1991 survey by Cho and Parlar. A classification scheme primarily based on the dependence between components (stochastic, structural or economic) is introduced. Next, the papers are also classified on the basis of the planning aspect (short-term vs. long-term), the grouping of maintenance activities (either grouping preventive or corrective maintenance, or opportunistic grouping) and the optimization approach used (heuristic, policy

14

K. Kobbacy and D. Murthy

classes or exact algorithms). Finally, attention is paid to the applications of the models. Chapter 12: Replacement of Capital Equipment This chapter deals with models of replacement of capital equipment. Capital replacement models may be classified as economic life models or dynamic programming models. The former are concerned with determining the optimal lifetime of an item of equipment taking account of costs over some planning horizon. The latter considers replacement decisions dynamically, determining whether plant should be retained or replaced after each period. We begin by looking at simple economic life models. These are applied in a case study on escalator replacement. Economic life models are then extended to consider first an inhomogeneous fleet and then second a network system viewed as an inhomogeneous fleet with interacting items. A number of different dynamic programming models are introduced for singular systems and then expanded to homogeneous and inhomogeneous fleets and networks of assets. Chapter 13: Maintenance and Production: A Review of Planning Models This chapter gives an overview of the relation between planning of maintenance and production. Production planning and scheduling models where failures and maintenance aspects are taken into account are considered first. The planning of maintenance activities are considered next, where both preventive as well as corrective maintenance are discussed. Third, the planning of maintenance activities at such moments in time where the items to be maintained are not or less needed for production, also called opportunity maintenance is considered. Apart from describing the main ideas, approaches, and results a number of applications are provided. Chapter 14: Delay Time Modelling This chapter presented a modelling tool that was created to model the problems of inspection maintenance and planned maintenance interventions, namely Delay Time Modelling (DTM). This concept provides a modelling framework readily applicable to a wide class of actual industrial maintenance problems of assets in general and inspection problems in particular. The delay time defines the failure process of an asset as a two-stage process. The first stage is the normal operating stage from new to the point that a hidden defect has been identified. The second stage is defined as the failure delay time from the point of defect identification to failure. It is the existence of such a failure delay time which provides the opportunity for preventive maintenance to be carried out to remove or rectify the identified defects before failures. With appropriate modelling of the durations of these two stages, optimal inspection intervals can be identified to optimise a criterion function of interest. This chapter first gives an outline of the delay time concept then introduces two delay time inspection models of a single component and a complex system respectively. The parameters estimation techniques used in DTM are discussed. Extensions to the basic delay time model are highlighted and future research in DTM concludes the chapter.

An Overview

15

Chapter 15: Maintenance Outsourcing It is often uneconomical for businesses to carry out their own maintenance on complex equipment. The alternative is to ‘out-source’ the maintenance function and use an external agent, under a service contract, to carry out some or all of the maintenance actions (preventive and corrective). This chapter develops the framework needed to study decision-making for maintenance outsourcing from both the customer (equipment owner) and service agent perspectives. The relevant literature is reviewed and a game theoretic approach to maintenance outsourcing and the use of agency theory is discussed. The link between maintenance outsourcing and extended warranties is highlighted and the scope for future research in both areas is examined. Chapter 16: Maintenance of Leased Equipment For leased equipment, the lessor has to carry out the maintenance of the equipment over the lease period. To ensure satisfactory performance and maintenance, the lease contract has penalty terms which result in the lessor having to compensate the lessee if the number of failures exceeds some specified number and/or the time to rectify each failure exceeds some specified value. This implies that the lessor needs to take into account these penalties in determining the optimal maintenance strategy. The chapter starts with a conceptual framework to discuss the different issues involved and then looks at models to help the lessor in developing the optimal maintenance strategy. Chapter 17: Computerised Maintenance Management Systems Computerised maintenance management systems (CMMSs) are vital for the coordination of all activities related to the availability, productivity and maintainability of complex systems. Modern computational facilities have offered a dramatic scope for improved effectiveness and efficiency in, for example, maintenance. CMMSs have existed, in one form or another, for several decades. In this chapter, the characteristics of CMMSs have been investigated and have highlighted the need for them in industry and identified their current deficiencies. A proposed model is then presented to provide a decision analysis capability that is often missing in existing CMMSs. The effect of such model is to contribute towards the optimisation of the functionality and scope of CMMSs for enhanced decision analysis support. The use of AI techniques in CMMSs is illustrated. The features of next generation maintenance systems are finally highlighted. Chapter 18: Risk Analysis in Maintenance Risk analysis can be used for selection and prioritisation of maintenance activities, and this application of risk analysis has been given increased attention in recent years. This chapter presents and discusses the use of risk analysis for this purpose. The chapter reviews some critical aspects of risk analysis important for the successful implementation of such analyses in maintenance. This relates to risk descriptions and categorisations, uncertainty assessments, risk acceptance and risk informed decision making, as well as selection of appropriate methods and tools. Both qualitative and quantitative approaches are covered. A detailed risk analysis is outlined showing the effect of maintenance on risk.

16

K. Kobbacy and D. Murthy

Chapter 19: Maintenance Performance Measurement (MPM) System It is important that factors influencing the performance of maintenance process should be identified, and measured, so that they can be monitored and controlled for improvement. In this chapter, besides an overview of performance measurement, maintenance performance indicators, associated issues and challenges for developing a maintenance performance measurement framework, and indicators as in use by different industries are discussed. The framework considers stakeholders, business environment, multi-criteria and hierarchical needs amongst other. Chapter 20: Forecasting for Inventory Management of Service Parts This chapter addresses issues pertinent to forecasting for the inventory management of service parts. In some sectors, such as the aerospace and automotive industries, a very wide range of service parts are held in stock, with significant implications for availability and inventory holding. Their management is therefore an important task. First, a number of possible approaches to classifying service parts for forecasting and inventory management related purposes are reviewed. Second, parametric and non-parametric approaches to forecasting service parts requirements are discussed followed by the presentation of appropriate metrics for measuring the performance of the inventory management system. The existing empirical evidence on various forecasting methods is then summarised. Finally, the conclusions of this work are presented along with the identification of some natural avenues for further research. Chapter 21: Maintenance in the Rail Industry The chapter presents two case studies in railway maintenance. The first case study presents an optimisation model preventive maintenance of a train bogie. In the model a dynamic approach to grouping of maintenance activities is used enabling, e.g., opportunity maintenance. Data from the Norwegian State Railways have been used in the calculation example. The second case study present a life cycle cost approach to prioritization of larger maintenance and renewal projects under budget constraints. Chapter 22: Condition Monitoring of Diesel Engines Various techniques have been widely used to monitor the condition of diesel engines. Analysis of engine lubricant is a most widely used condition monitoring technique. In this chapter, a case study applying oil analysis technique to monitor the condition of marine diesel engines is presented. The case study focuses on analysis and modelling of oil monitoring data. The study first introduces the concept of state discriminant capability of condition variables and uses it to identify the significant condition variables, and then develops a state discriminant model to determine the state of the monitored system based on the current observation. The model parameters are obtained by directly minimizing the misjudgment probability. We believe that the proposed model has a great potential to be used due to its plausible mathematical basis and simplicity though it needs further testing with new data.

An Overview

17

Chapter 23: Benchmarking of the Maintenance Process at Banverket (The Swedish National Rail Administration) For sustaining a competitive edge in the business, railway companies all over the world are looking for ways and means to improve their maintenance performance. Benchmarking is a very effective tool that can assist the management in their pursuit of continuous improvement of their operation. Three different benchmarks have been studied based on a project benchmarking of the maintenance process across borders, another project dealing with benchmarking of maintenance outsourcing by different track regions in Sweden, and a third project studying the level on transparency among the European railway administrations. The chapter discuss the pro and cons, the areas for improvement and the need for improvement of benchmarking metrics and framework. Chapter 24: Integrated e-Operations–e-Maintenance: Application in North Sea Offshore Assets Ongoing developments in Norway brings a good example of how an industry-wide re-engineering process has triggered major changes in operations and maintenance practice of complex and high-risk assets leading towards what is termed integrated e-operations e-maintenance. It aims towards a step-change to the conventional operations and maintenance practices of offshore assets. Initiatives have already been taken to exploit new methods, smart techniques, and digital technologies to enable remote monitoring of offshore equipment condition and asset performance in landbased onshore support facilities using large ICT networks. This has already proved to have direct positive implications on the technical and safety integrity of assets, and subsequently on the plant economics. This chapter shares current experience and knowledge with reference to ongoing developments in the Norwegian oil and gas industry. It highlights current offshore asset maintenance practice, changing technical and economic environment that lead towards an e-approach, development and implementation of integrated e-operations and e-maintenance solutions in the North sea, key features of the e-approach in North sea assets, and future challenges to be fullyintegrated and fail-safe. Chapter 25: Fault Detection and Identification for Longwall Machinery Using SCADA Data In an attempt to improve equipment availability and facilitate informed, preventative maintenance, engineers may choose to implement one or more fault detection and identification (FDI) technologies. For complex systems (systems for which component interactions are not understood and model uncertainties are significant), data-driven methods of FDI are often the only practicable solution. The development of a data-driven FDI system for longwall mining equipment using SCADA data is described here. Significant data preprocessing was required to generate a quality example set. Missing value estimation (MVE) techniques were required to complete the highdimensional stream of condition monitoring data from existing sensors. A cost function, in combination with a linear discriminant analysis, was used to ‘align’ the inaccurate, categorical delay records with those delays inferred by the SCADA data. A neural network was developed to determine the state of the system as a

18

K. Kobbacy and D. Murthy

function of the real-time SCADA data input. Validation of this algorithm with unseen condition monitoring data showed misclassification rates of machine faults as low as 14.3%.

1.4 Target Audience The unique features of the book are as follows: 1. A coverage of the different approaches to maintenance. 2. Deals with many different aspects (scientific, technical, commercial, management, quantitative modelling) etc. 3. Blends theory with practice. As such it should appeal to both researchers and practitioners. For researchers (from different disciplines) it should provide a starting point for new research into different aspects of maintenance. For practitioners it should provide the concepts and tools so that these can be used for improvements in the overall business performance. Also we hope that it will serve as a reference book for use in postgraduate programs in maintenance

Part B

Evolution of Concepts and Approaches

2 Maintenance: An Evolutionary Perspective Liliane Pintelon and Alejandro Parodi-Herz L. Pintelon and A. Parodi-Herz

2.1 Introduction Over the last decennia industrial maintenance has evolved from a non-issue into a strategic concern. Perhaps there are few other management disciplines that underwent so many changes over the last half-century. During this period, the role of maintenance within the organization has drastically been transformed. At first maintenance was nothing more than a mere inevitable part of production, now it is an essential strategic element to accomplish business objectives. Without a doubt, the maintenance function is better perceived and valued in organizations. One could considered that maintenance management is no longer viewed as an underdog function; now it is considered as an internal or external partner for success. In view of the unwieldy competition many organizations seek to survive by producing more, with fewer resources, in shorter periods of time.To enable these serious needs, physical assets take a central role. However, installations have become highly automated and technologically very complex and, consequently, maintenance management had to become more complex having to cope with higher technical and business expectations. Now the maintenance manager is confronted with very complicated and diverse technical installations operating in an extremely demanding business context. This chapter, while considering the fundamental elements of maintenance and its environment, describes the evolution path of maintenance management and the driving forces of such changes. In Section 2.2 the maintenance context is described and its dynamic elements are briefly discussed. Section 2.3 explains how maintenance practice have evolved in time and different epochs are distinguished. Further, this sections devotes special attention to describe a common lexicon for maintenance actions and policies to further focuss on the evolution of maintenance concepts. Section 2.4 underlines how the role of the maintenance manager has been reshaped as a consequence of the changes of the maintenance function. Finally, the chapter concludes with Section 2.5 identifying the new challenges for maintenance.

22

L. Pintelon and A. Parodi-Herz

2.2 Maintenance in Context To discuss the context in which maintenance management is embedded, one may raise the question what is maintenance as such? Most authors in maintenance management literature, one way or another, agree on defining maintenance as the “set of activities required to keep physical assets in the desired operating condition or to restore them to this condition”. While this defines what maintenance is about, it may suggest that maintenance is simple, which it is not, as will be confirmed by any maintenance practitioner. Hence “maintenance management” is needed to ingrain maintenance practice in a complex and dynamic context. From a pragmatic view, the key objective of maintenance management is “total asset life cycle optimization”. In other words, maximizing the availability and reliability of the assets and equipment to produce the desired quantity of products, with the required quality specifications, in a timely manner. Obviously, this objective must be attained in a cost-effective way and in accordance with environmental and safety regulations. Figure 2.1 clearly shows that maintenance is embedded in a given business context to which it has to contribute. What is more, it shows that the maintenance function needs to cope with multiple forces and requirements within and outside the walls of the organization. Beyond any doubt, the tasks of maintenance are complex, enclosing a blend of management, technology, operations and logistics support elements. People Legislation

Management

Society

Technology

Total asset life cycle optimization

Technological evolution

Operations

e-business

Logistics Support Outsourcing Market

Information Technology Competition

Figure 2.1. Maintenance in context

To cope with and to coordinate the complex and changing characteristics that constitute maintenance in the first place, a management layer is imperative. Management is about “what to decide” and “how to decide”. In the maintenance arena, a manager juggles with technology, operations and logistics elements that mainly need to harmonize with production. Technology refers to the physical assets which maintenance has to support with adequate equipment and tools. Operations indicate the combination of service maintenance interventions with

Maintenance: An Evolutionary Perspective

23

core production activities. Finally, the logistics element supports the maintenance activities in planning, coordinating and ultimately delivering, resources like spare parts, personnel, tools and so forth. In one way or another, all these elements are always present, but their intensity and interrelationships will vary from one situation to another. For example, the elevator maintenance in a hospital vs. the plant maintenance in chemical process industries stipulates a different maintenance recipe tailored to the specific needs. Clearly, the choice of the structural elements of maintenance is not independent from the environment. Besides, other factors like the business context, society, legislation, technological evolution, outsourcing market, will be important. Furthermore, relative new trends, such as the e-business context, will influence the current and future maintenance management enormously. A whole new era for maintenance is expected as communication barriers are bridged and coordination opportunities of maintenance service become more intense. 2.2.1 Changes in the Playing Field of Maintenance One should expect that neither maintenance management nor its environment are stationary. The constant changes in the field of maintenance are acknowledged to have enabled new and innovative developments in the field of maintenance science. The technological evolution in production equipment, an ongoing evolution that started in the twentieth century, has been tremendous. At the start of the twentieth century, installations were barely or not mechanized, had simple design, worked in stand-alone configurations and often had a considerable overcapacity. Not surprisingly, nowadays installations are highly automated and technologically very complex. Often these installations are integrated with production lines that are right-sized in capacity. Installations not only became more complex, they also became more critical in terms of reliability and availability. Redundancy is only considered for very critical components. For example, a pump in a chemical process installation can be considered very critical in terms of safety hazards. Furthermore, equipment built-in characteristics such as modular design and standardization are considered in order to reduce downtime during corrective or preventive maintenance. However, predominantly only for some newer, very expensive installations, such as flexible manufacturing systems (FMS), these principles are commonly applied. Fortunately, a move towards higher levels of standardization and modularization begins to be witnessed at all level of the installations. As life cycle optimization concepts are commendable, it becomes mandatory that at the early design stages supportability and maintainability requirements are well thought-out. Parallel to the technological evolution, the ever-increasing customer focus causes even higher pressure, especially on critical installations. As customers’ service in terms of time, quality and choice becomes central to production decisions, the more flexibility is required to cope with these varying needs. This calls for well-maintained and reliable installations capable to fulfil shorter and more reliable lead-times estimation. Physical assets are ever more important for business success.

24

L. Pintelon and A. Parodi-Herz

Maintenance does not escape from the (r)evolution in information communication technology (ICT), which has tremendously changed business practices. However, we comment further on this topic in Section 2.3, by illustrating the impact on the role of the maintenance manager as such. Furthermore, new production and management principles such as Just-in-time (JIT) philosophy, Lean principles, total quality management (TQM) and so forth, have emerged. These production trends intend, by all means, to reduce waste and remove non-value added transactions. It is not surprising that work-in-process (WIP) inventories are one of the key issues for improvement. Clearly, WIP inventories incur high costs as a consequence of the capital immobilization, expensive floor space, etc. As processes happen to be streamlined, WIP inventories are no longer a buffer for problems; accordingly, asset availability and reliability are ever more imperative. Albeit, these principles were initially inspired for production and manufacturing environments are currently also applied and translated in service context. Above all, the business environment has also changed. Competition has become fierce and worldwide due to the globalization. The latter not only implies that competitors are located all over the world, but also that decisions to move production or service activities from a non-efficient site (e.g. due to high operations and maintenance costs) to another site are quickly taken, even if the other location belongs to another continent. Obviously, with the advent of globalization and intense competitive pressures, organizations are looking for every possible source of competitive advantage. This implies that the nature of business environment has become more complex and dynamic requiring different competitive strategies. Many companies are critically evaluating their value chain and often decide to drastically reorganize it. This results in focusing on the core business. Consequently outsourcing of some non-core business activities and the creation of new partnerships and alliances are being considered by many organizations. Not surprisingly, maintenance as a support function is no exception for outsourcing. Yet, it may not be so simple. Outsourcing maintenance of technical systems can become a sensitive issue if it is not handled with diligence. Technical systems are unique and situation specific. For example, outsourcing maintenance of utilities or elevators can be relatively straightforward, but when it comes to production floor equipment it can be a strategic issue that has to be handled with extreme care. These circumstances suggest that outsourcing needs to be considered at operational, tactical and strategic level; see Figure 2.2 The simplest, and also the most common, form of outsourcing is “operational outsourcing”. At this level, a specific task is outsourced and the relationship between supplier and customer is strictly limited to a sell-buy situation. The impact on the internal organization of the customer is also limited. As outsourcing moves up in the organizational pyramid the relationship between supplier and customer changes and “tactical outsourcing” maybe required. At this level of outsourcing the customer shares management responsibility with the supplier and a simple kind of partnership is established. The impact on the internal organization is also greater. Finally, moving towards the organization’s top and for more critical maintenance services, a new form of outsourcing is created, the so-called “strategic outsourcing”. This type of outsourcing is also labelled as “transformational out-

Maintenance: An Evolutionary Perspective

25

sourcing” because of its impact on the customer’s internal organization. Here a complete outsourcing is carried out, the maintenance department is cut away from the customer and moved to the supplier. The relationship between customer and supplier is a strong partnership: the customer has fully entrusted the supplier with one of its strategic maintenance activities. This level of outsourcing is yet less common than the former ones. The rationales of whether or not to outsource maintenance activities are complex and require a well-thought and structured outsourcing process. As mentioned maintenance outsourcing can cover a lot of alternatives. Fortunately, besides, traditional outsourcing of maintenance activities to equipment suppliers or the use of some small local firms, there is nowadays a growing market of medium sized and large outsourcing firms. These firms offer a range of consulting support, specialized services and even full service to allow strategic outsourcing to work.

Strategic “Transformational”

Full service To think with… e.g. outsourcing of all maintenance, BOT, ...

Tactic “Partnership”

Service package e.g. MRO, utilities, facilities, ...

To manage…

Projects e.g. renovation, shutdown, ... Operational “Supplier – Customer”

Specialised services

To organise…

e.g. high tech equipment, piping, insulation, ...

Generic services

e.g. temporary extra capacity (painting, welding, ...)

To carry out…

Figure 2.2. Outsourcing decision levels

Societal expectations concerning technology is also creating boundary conditions for maintenance management. The attention paid to sustainability (3P: people, profit, planet) is a clear sign of this. Legislation is getting more and more stringent. This is especially important here because of its impact on occupational safety and environmental standards. Note that most of the above-mentioned trends for industrial installations can be easily translated to the service sector. Think, for example, of automated warehouses in distribution centre, hospital equipment or building utilities.

26

L. Pintelon and A. Parodi-Herz

2.3 Maintenance Practices Over Time Consequent to the transformation the maintenance context, the maintenance function has also drastically evolved from a non-issue into a strategic concern (see Figure 2.3). At first maintenance was nothing more than an inevitable part of production; it simply was a necessary evil. Repairs and replacements were tackled when needed and no optimization questions were raised. Later on, it was conceived that maintenance was a technical matter. This not only included optimizing technical maintenance solutions, but it also involved attention of the organization on the maintenance work. Further on, maintenance became a full-blown function, instead of production sub-function. Clearly, now maintenance management has become a complex function, encompassing technical and management skills, while still requiring flexibility to cope with the dynamic business environment. Top management recognizes that having a well thought out maintenance strategy together with a careful implementation of that strategy could actually have a significant financial impact. Nowadays, this has led to treating maintenance as a mature partner in business strategy development and possibly at the same level as production. In turn, these strategies formally consider establishing external partnerships and outsourcing of the maintenance function. “Necessary evil” 1940

1950

“Technical matter” 1960

1970

“Profit contributor” 1980

1990

“Cooperative partnership” 2000

Decade

Figure 2.3. The maintenance function in a time perspective

The fact that maintenance has become more critical implies that a thorough insight into the impact of maintenance interventions, or the omission of these, is indispensable. Per se, good maintenance stands for the right allocation of resources (personnel, spares and tools) to guarantee, by deciding on the suitable combination of maintenance actions, a higher reliability and availability of the installations. Furthermore, good maintenance foresees and avoids the consequences of the failures, which are far more important than the failures as such. Bad or no maintenance can appear to render some savings in the short run, but sooner or later it will be more costly due to additional unexpected failures, longer repair times, accelerated wear, etc. Moreover, bad or no maintenance may well have a significant impact on customer service as delivery promises may become difficult to fulfil. Hence, a well-conceived maintenance program is mandatory to attain business, environmental and safety requirements. Despite the particular circumstances, if one intends to compile or judge any maintenance programme, some elementary maintenance terms need to be unambiguous and handled with consistency. Yet, both in practice and in the literature a lot of confusion exists. For example, what for some is a maintenance policy others refer to as a maintenance action; what some consider preventive maintenance others will refer to as predetermined or scheduled maintenance. Furthermore, some argue that some concepts can almost be considered strategies or philosophies, and

Maintenance: An Evolutionary Perspective

27

so on. Certainly there is a lot of confusion, which perhaps is one of the breathing characteristics of such a dynamic and young management science. The terminology used to describe precisely some maintenance terms can almost be taken as philosophical arguments. However, the adoption of a rather simplistic, but truly germane classification is essential. Not intending to disregard preceding terminologies, neither to impose nor dictate a norm, we draw attention, in particular, to three of those confusing terms: maintenance action, maintenance policy and maintenance concept. In the remainder of this chapter the following terminology is adopted. Maintenance Action. Basic maintenance intervention, elementary task carried out by a technician (What to do?) Maintenance Policy. Rule or set of rules describing the triggering mechanism for the different maintenance actions (How is it triggered?) Mainenance Concept. Set of maintenance polices and actions of various types and the general decision structure in which these are planned and supported. (The logic and maintenance recipe used?) 2.3.1 Maintenance Actions Basically, as depicted in Figure 2.4, maintenance actions or interventions can be of two types. They are either corrective maintenance (CM) or precautionary maintenance (PM) actions. 2.3.1.1 Corrective Maintenance Actions (CM) CM actions are repair or restore actions following a breakdown or loss of function. These actions are “reactive” in nature; this merely implies “wait until it breaks, then fit it!”. Corrective actions are difficult to predict as equipment failure behavior is stochastic and breakdowns are unforeseen. Maintenance actions such as replacement of a failed light bulb, repair of a ruptured pipeline and the repair of a stalled motor are some examples of corrective actions. 2.3.1.2 Precautionary Maintenance Actions (PM) PM actions can either be “preventive, predictive, proactive or passive” in nature. These types of actions are moderately more complex than the former. To describe fully each one of them, a book can be written on its own. Nonetheless, the fundamental ideas aim at diminishing the failure probability of the physical asset and/or to anticipate, or avoid if possible, the consequences if a failure occurs. Some PM actions (preventive and predictive) are somewhat easier to plan, because they can rely on fixed time schedules or on prediction of stochastic behaviours. However, other types of PM actions become ongoing tasks, originating from the attitude concerning maintenance. Somehow they became part of the tacit knowledge of the organization. Some precise examples of precautionary actions which can be mentioned are lubrication, bi-monthly bearing replacements, inspection rounds, vibration monitoring, oil analysis, design adjustments, etc. All these tasks are considered to be precautionary maintenance actions; however, the underlying principles may be different.

28

L. Pintelon and A. Parodi-Herz

ACTIONS

POLICIES

CONCEPTS

TPM RCM

Optimizing existing concept

CIBOCOF

Q&D

BCM

Ad hoc

reactive

Customized concept

LCC

preventive

predictive

T/UBM

CBM

FBM DOM

OBM

proactive

passive

Corrective

Precautionary

reactive

Predictive, preventive, proactive and passive

Figure 2.4. Actions, policies and concepts in maintenance1

Although it seems a very clear-cut way of defining elementary maintenance interventions, it still may be difficult in practice to assign some interventions to either class. An example here is routine maintenance on medical equipment such as a breathing device. Cleaning and sterilizing this equipment can be called precautionary maintenance since the equipment is not defective at the moment of the intervention. On the other hand, it is very difficult to predict when an intervention will be needed, and this is a typical characteristic of a corrective intervention. Furthermore, even within precautionary maintenance, it is not always simple to classify certain actions into simple types. This is due to the changing perception on maintenance and the fast evolution of its techniques. 2.3.1.3 Acuity of Maintenance Actions As maintenance knowledge is enhanced and more advance enabling technologies are available, the perception on which maintenance action is “right” has changed a lot during the last decennia. In the 1950s almost all maintenance actions were corrective. Per se maintenance was considered as an annoying and unavoidable cost, which could not be managed. Later on, in the 1960s many companies switched to precautionary (preventive) maintenance programs as they could recognize that some failures on mechanical component had a direct relation with the time or number of cycles in use. This belief was mainly based on physical wear of components or age-related fatigue characteristics. At that time, it was accepted 1

See abbreviations list at the end of this chapter

Maintenance: An Evolutionary Perspective

29

that preventive actions could avoid some of the breakdowns and would lead to cost savings in the long run. The main concern was how to determine, based on historical data, the adequate period to perform preventive maintenance. Certainly, not enough was known about failure patterns, which, among other reasons, have led to a whole separate branch of engineering and statistics: reliability engineering. In the late 1970s and early 1980s, equipment became in general more complex. As result, the super-positioning effect of the failure pattern of individual components starts to alter the failure characteristics of simpler equipment. Hence, if there is no dominant age-related failure mode, preventive maintenance actions are of limited use in improving the reliability of complex items. At this point, the effectiveness of applying preventive maintenance actions started to be questioned and was considered more carefully. A common concern about “over-maintaining” grew rapidly. Moreover, as the insidious belief on preventive maintenance benefits was put at risk, new precautionary (predictive) maintenance techniques emerged. This meant a gradual, though not complete, switch to predictive (inspection and condition-based) maintenance actions. Naturally, predictive maintenance was, and still is, limited to those applications where it was both technically feasible and economically interesting. Supportive to this trend was the fact that conditionmonitoring equipment became more accessible and cheaper. Prior to that time, these techniques were only reserved to high-risk applications such as airplanes or nuclear power plants. In the late 1980s and early 1990s a different footprint on maintenance history occurred with the emergence of concurrent engineering or life cycle engineering. Here maintenance requirements were already under consideration at earlier product stages such as design or commission. As a result, instead of having to deal with built in characteristics, maintenance turned out to be active in setting design requirements for installations and became partly involved in equipment selection and development. All this led to a different type of precautionary (proactive) maintenance, the underlying principle of which was to be proactive at earlier product stages in order to avoid later consequences. Furthermore, as the maintenance function was better appreciated within the organization, more attention was paid to additional proactive maintenance actions. For example, as operators are in straight and regular contact with the installations they could intuitively identify and “feel” right or wrong working conditions of the equipment. Conditions such as noise, smell, rattle vibration, etc., that at a given point are not really measured, represent tacit knowledge of the organization to foresee, prevent or avoid failures and its consequences in a proactive manner. Yet these actions are indeed typically not performed by maintenance people themselves, but are certainly part of the structural evolution of maintenance as a formal or informal partner within the organization. The last type of precautionary (passive) maintenance actions are driven by the opportunity of other maintenance actions being planned. These maintenance actions are precautionary since they occur prior to a failure, but are passive as they “wait” to be scheduled depending on others probably more critical actions. Passive actions are in principle low priority for the maintenance staff as, at a given moment in time, they may not really be a menace for functional or safety failures. However, these actions can save significant maintenance resources as they may reduce the

30

L. Pintelon and A. Parodi-Herz

number of maintenance interventions, especially when the set up cost of maintenance is high. For example, when maintenance actions are planned or need to be carried out on offshore oil platforms or on windmills in remote locations, getting to the equipment equipment can be costly. Therefore, optimizing the best combination of maintenance actions, at that point in time, is mandatory. This may invoke replacing components with significant residual life that in different circumstances would not be replaced. 2.3.2 Maintenance Policies As new maintenance techniques happen to be available and the economic implications of maintenance action are comprehended, a direct impact on the maintenance policies is expected. Several types of maintenance policies can be considered to trigger, in one way or another, either precautionary or corrective maintenance interventions. As described in Table 2.1, those policies are mainly failure-based maintenance (FBM), time/used-based maintenance (TBM/UBM), condition-based maintenance (CBM), opportunity-based maintenance (OBM) design-out maintenance (DOM), and e-maintenance. Table 2.1. Generic maintenance policies Policy

Description

FBM

Maintenance (CM) is carried out only after a breakdown. In case of CFR behaviour and/or low breakdown costs this may be a good policy.

TBM / UBM

PM is carried out after a specified amount of time (e.g. 1 month, 1000 working hours, etc.). CM is applied when necessary. UBM assumes that the failure behaviour is predictable and of the IFR type. PM is assumed to be cheaper than CM.

CBM

PM is carried out each time the value of a given system parameter (condition) exceeds a predetermined value. PM is assumed to be cheaper than CM. CBM is gaining popularity due to the fact that the underlying techniques (e.g. vibration analysis, oil spectrometry,...) become more widely available and at better prices. The traditional plant inspection rounds with a checklist are in fact a primitive type of CBM.

OBM

For some components one often waits to maintain them until the “opportunity” arises when repairing some other more critical components. The decision whether or not OBM is suited for a given component depends on the expectation of its residual life, which in turn depends on utilization.

DOM

The focus of DOM is to improve the design in order to make maintenance easier (or even eliminate it). Ergonomic and technical (reliability) aspects are important here.

CFR = Constant failure rate, IFR=Increasing failure rate

For the more common maintenance policies many models have been developed to support tuning and optimization of the policy setting. It is not our intention to explain the fundamental differences between these models, but rather to provide an overview of types of policies available and why these have been developed. Much

Maintenance: An Evolutionary Perspective

31

has to do with the discussion in the previous section regarding the acuity of maintenance actions. Therefore, it is clear that policy setting and the understanding of its efficiency and effectiveness continues to be fine-tuned as any other management science. We advocate the reader, particularily interested in the underlying principles and type of models, to review McCall (1965), Geraerds (1972), Valdez-Flores and Feldman (1989), Cho and Parlar (1991), Pintelon and Gelders (1992), Dekker (1996), Dekker and Scarf (1998) and Wang (2002) for a full overview on the state-of-the-art literature. The whole evolution of maintenance was based not solely on technical but rather on techno-economic considerations. FBM is still applied providing the cost of PM is equal to or higher than the cost of CM. Also, FBM is typically handy in case of random failure behaviour, with constant failure rate, as TBM or UBM are not able to reduce the failure probability. In some cases, if there exists a measurable condition, which can signal the probability of a failure, CBM can be also feasible. Finally, a FBM policy is also applied for installations where frequent PM is impracticable and expensive, such as can be the maintenance of glass ovens. Either TBM or UBM is applied if the CM cost is higher than PM cost, or if it is necessary because of criticality due to the existence of bottleneck installation or safety hazards issues. Also in case of increasing failure behaviour, like for example wear-out phenomena, TBM and UBM policies are appropriate. Typically, CBM was mainly applied in those situations where the investment in condition monitoring equipment was justified because of high risks, like aviation or nuclear power regeneration. Currently, CBM is beginning to be generally accepted to maintain all type installations. Increasingly this is becoming a common practice in process industries. In some cases, however, technical feasibility is still a hurdle to overcome. Another reason that catches the attention of practitioners in CBM is the potential savings in spare parts replacements thanks to the accurate and timely forecasts on demand. In turn, this may enable better spare parts management through coordinated logistics support. Finding and applying a suitable CBM technique is not always easy. For example, the analysis of the output of some measurement equipment, such as advanced vibration monitoring equipment, requires a lot of experience and is often work for experts. But there are also simpler techniques such as infrared measuring and oil analysis suitable in other contexts. At the other extreme, predictive techniques can be rather simple, as is the case of checklists. Although fairly low-level activity, these checklists, together with human senses (visual inspections, detection of “strange” noises in rotating equipment, etc.) can detect a lot of potential problems and initiate PM actions before the situation deteriorates to a breakdown. At present FBM, TBM, UBM and CBM accept and seize the physical assets which they intend to maintain as a given fact. In contrast, there are more proactive maintenance actions and policies which, instead of considering the systems as “a given”, look at the possible changes or safety measures needed to avoid maintenance in the first place. This proactive policy is referred to as DOM. This policy implies that maintenance is proactively involved at earlier stages of the product life cycle to solve potential related problems. Ideally, DOM policies intend to completely avoid maintenance throughout the operating life of installations, though, this may not be realistic. This leads one to consider a diverse set of maintenance requirements at the

32

L. Pintelon and A. Parodi-Herz

early stages of equipment design. As a consequence, equipment modifications are geared either at increasing reliability by raising the mean-time-between-failures (MTBF) or at increasing the maintainability by decreasing the mean-time-to-repair (MTTR). Per se DOM aims to improve the equipment availability and safety. Some equipment modifications may merely request ergonomic considerations to reduce MTTR, others may need totally new designs. Often DOM projects are combined with efforts to increase occupational safety or increase production capacity, such as set up reduction programs. A rather passive, but considerably important maintenance policy that needs to be mentioned is OBM. Typically OBM is applied for non-critical components with a relatively long lifetime. For these components no separate maintenance programs are scheduled; maintenance happens if an opportunity arises due to a maintenance intervention for another component of that machine. More recently in the mid-1990s, with the emergence of the Internet as an enabling technology and the growth of e-business as the standard on business communication, e-maintenance also appeared in the radar of maintenance policies. E-maintenance rather than a policy can also be considered as a means or enabler to some, if not all, the previous policies. However, it is more than just an acronym; it is a step forward to full-integrated maintenance techniques without the boundaries of place. It is in fact a maintenance policy on its own that can support other policies. In particular, academics and practitioners watch with anticipation the great impact it may have on CBM. Conditions measured on site can be remotely monitored, opening entirely new dimensions and opportunities for maintenance services. Therefore, e-maintenance has captured much attention of maintenance researchers given its great impact on business practice. An example of this evolution is telemaintenance, which allows the diagnosis of installation and to perform limited type of repairs from a remote location using ICT and sophisticated control and knowledge tools. 2.3.3 Maintenance Concepts The idea of an “optimized” maintenance program suggests that an adequate mix of maintenance actions and policies needs to be selected and fine-tuned in order to improve uptime, extend the total life cycle of physical asset and assure safe working conditions, while bearing in mind limiting maintenance budgets and environmental legislation. This does not seem to be straightforward, and may require a holistic view. Therefore, a “maintenance concept” for each installation is necessary to plan, control and improve the various maintenance actions and policies applied. A maintenance concept may in the long term even become a philosophy, tenet or attitude to perform maintenance. In some cases advance maintenance concepts are almost considered strategies on their own. What is certain is that maintenance concepts determine the business philosophy concerning maintenance, and that they are needed to manage the complexity of maintenance per se. In practice, it is clear that more and more companies are spending time and effort determining the right maintenance concept. As a matter of fact, maintenance concepts need to be formulated considering the physical characteristics and the context within which installations operate. Not

Maintenance: An Evolutionary Perspective

33

surprisingly, as system complexity is increasing and maintenance requirements are becoming more complex, maintenance concepts will require different levels of complexity. Literature provides us with various concepts that have been developed through a combination of theoretical insights and practical experiences. Choosing and implementing the best concept in a given context is hard. To the question “what concept is best for us?”, no short and straightforward answer exists. The right answer to the question is determined by the context, with its complex interaction of technology, business, organization, and so forth. Designing and implementing a good concept will take time and effort. Many companies establish teams with members from different areas (engineering, production, maintenance, ...) to accomplish this difficult task. On the market, many consultants offer their services to assist in this process. This outside help may be very useful to get started and to obtain a better insight into own situation. However, it is useful to note that many consultants have “their” concept (e.g. RCM) they are used to implementing, which may bias their judgment on what concept is “right”. Nevertheless, some outside guidance can be useful, but in order to have a good concept that fits all the companies needs, this should be built by in-house people, using all the knowledge available. Several times in this chapter, it has been suggested that next to increasing systems complexity, maintenance has also evolved in time. This has led to three generations of maintenance concepts with its respective transition points. In the following paragraphs an overview is offered which is also portrayed in Table 2.2. In the past, equipment was generally much simpler; hence the need for maintenance decision support was moderate. For truly simple systems, even a single maintenance policy may possibly be considered a concept on its own. This is considered the simplest form, the “first generation”, of maintenance concepts. Here, only one maintenance policy or even type of action was applied to certain equipment. For a state-of-the-art review on this type of maintenance concepts see Wang (2002). With the advent of automation, installations became highly mechanized and the equipment turned out to be more complex and the interdependencies of the multi-unit systems could no longer be ignored. To maintain such installations efficiently a specific mixture of maintenance policies and actions was required. The need for decision structures became crucial. These circumstances prompted, at first instance, the concept of simple quick and dirty (Q&D) decision diagrams. Q&D charts could help to select adequate maintenance policies as only ‘yes’ or ‘no’ answers can be given to a series of structured but simple questions. The authors note that even though Q&D charts lack the holistic view required for well-conceived and sophisticated maintenance concepts, they are still widely used in practice on specific situations thanks to their simplicity. Examples are reported in Pintelon et al. (2000) and Waeyenbergh and Pintelon (2002). Eventually, superior maintenance concepts were claimed, as the complexity of maintenance decisions increased. As a result, in the last 40 years a vast range of maintenance concepts has been extensively documented in literature. This group of concepts is considered the “second generation” of maintenance concepts and provides a pool of knowledge for maintenance practitioners and researchers. Typical examples, and perhaps the most important ones, are total productive

34

L. Pintelon and A. Parodi-Herz

maintenance (TPM), reliability-centred maintenance (RCM) and life cycle costing (LCC) approaches. Table 2.2. Description of the maintenance concepts generations Generation

Concept

Description

Main strengths

Main weaknesses

1st

Ad hoc

Implementing FBM and UBM policies; rarely CBM, DOM, OBM

Simple

Ad hoc decisions

1st → 2nd

Q&D

Easy-to-use decision chart. It helps to decide on the “right” maintenance policy

Consistent, Allows for priorities

Rough questions, and answers

2nd

LCC

Detailed cost breakdown over the equipment’s lifetime helping to plan the maintenance logistics

Sound basic philosophy

Resource and data intensive

TPM

Approach with an overall view on maintenance and production. Especially successful in the manufacturing industry

Considers human/technical aspects, fits in kaizen approach. Extensive tool box

Time consuming implementation

RCM

Structured approach focused on reliability. Initially developed for high tech/high risk environment

Powerful approach, Stepby-step procedure

Resource intensive

RCM-based

Approaches focused on remediating some of the perceived RCM shortcomings

Improved performance through e.g. use of sound statistical analysis

Sometimes an oversimplification

Exploiting the company’s strengths and considering the specific business context

Ensuring consistency and quality in the concept developed

2nd → 3rd

Example: streamlined RCM, BCM, RBCM

3rd

Customized

In-house developed; cherrypicking from existing concepts Examples: CIBOCOF, VDM

All these concepts, as many others, enjoy several advantages and are doomed to specific shortcomings. Correspondingly, new maintenance concepts are developed, old ones are updated and methodologies to design customized maintenance concepts are created. These concepts enjoy a lot of interest in their original form and also give raise to many derived concepts. For example, streamlined RCM from RCM. One may consider that customized maintenance concepts constitute the “third generation” of this evolution. They have fundamentally emerged since it is very difficult to claim a “one fits all” concept in the complex and still constantly changing world of maintenance. They are inspired by the former concepts while trying to aviod in the future previously experienced drawbacks. One way or another, customized maintenance concepts mainly consist of a “cherry picking” of useful techniques and ideas applied in other maintenance concepts. This important, but relatively new concept is expected to grow in importance both in practice and with academicians. Concepts that belong to this generation are, for example, value driven maintenance (VDM) and CIBOCOF, which was developed at the Centre of

Maintenance: An Evolutionary Perspective

35

Industrial Management (CIB), K.U. Leuven, Belgium. Additionally, in-house maintenance concepts, mostly developed in organization with fairly high maintenance maturity, also belong to this category of concepts. This, for example, was implemented in a petrochemical company that developed a customised concept, which was basically following the RCM logic. However, by extending RCM analysis steps and introducing risk-based inspections (RBI), a more focused and betterconceived maintenance plan could be developed. Moreover, the company borrowed some elements from TPM and incorporated these in their maintenance concept. For example, multi-skilled training programmes were implemented and special tool kits were designed for a number of maintenance jobs using TPM principles. Before the third generation of maintenance concepts was started, or actually even earlier, they were perceived as necessary. In the literature, a middle step is recognized to bridge the second generation with maintenance concepts such as business-centred maintenance (BCM) and risk based centred maintenance (RBCM) were developed. These concepts are merely RCM-related and still widely applied in many organizations. However, a slow but steady movement towards more customized maintenance concept is expected in the near future, as the maintenance function matures. Next, a straightforward description on the most important concepts is presented and important references are provided for the interested reader. 2.3.3.1 Quick & Dirty Decision Charts (Q&D) A Q&D decision chart is a decision diagram with questions on several aspects including; failure paterns, repair behaivours of the equipment, business context, maintenance capabilities, cost structure etc. Answering the questions for a given installation, the user proceeds through the branches of the diagram. The process stops with the recommendation of the most appropriate policy for the specific installation. The Q&D approach allows for a relatively quick determination of the most advantageous maintenance policy. It ensures a consistent decision making for all installations. Although some Q&D decision charts are available from literature (e.g. Pintelon et al. 2000), most companies adopting this approach prefer to draw up their own charts, which incorporate their experience and knowledge in the decision process. This can be implemented in several ways. For instance by defining specific questions, adding or deleting maintenance policies, establishing preferred sequence in which the different policies should be considered, etc. This approach however has the drawback of being rough (dirty). The questions are usually put in the basic yes/no format, limiting the answering possibilities. Moreover, answering the questions is usually done on a subjective basis; for example the question whether a given action or policy is feasible is answered based on experience rather than on a sound feasibility study. 2.3.3.2 Life Cycle Costing (LCC) Approaches LCC originated in the late 1960s and is now resurrecting. The basic principle of LCC is sometimes summarised by “it is unwise to pay too much, but is foolish to spend too little”. This refers to the two main underlying ideas of LCC. The first concerns the cost iceberg structure presented by Blanchard (1992) by whom LCC

36

L. Pintelon and A. Parodi-Herz

was revived. Mainly he proposes that when considering maintenance or equipment purchasing alternatives, one should not be limited to what momentarily can be seen: “the top of the iceberg”, such as direct maintenance costs (material, labour, etc.) or the purchase price. The indirectly relevant long run cost such as operational expenses, trainning cost, spares inventory costs, etc. are at least of the same order of magnitude. The second refers to the principle that the further one gets in the design or construction cycle of equipment, the more costly it will be to make modifications (e.g. DOM). Maintenance should be taken into account from the very first moment of designing a machine or system. LCC is a methodology for calculating or estimating the total cost of a system during the entire course of its life. This LCC approach implies a synthesis of costing analysis and engineering design principles that must satisfy life cycle requirements at minimum cost. In turn, design decisions are based on total cost of ownership (TCO) principles. In the literature, several LCC approaches can be distinguished. Among the more important ones are Terotechnology, Integrated Logistic Support/Logistics Support Analysis (ILS/LSA) and Capital asset management. During the 1970s, the Terotechnology concept originated in the UK and was the first formal attempt towards LCC (Parkes 1970). It describes a total view of maintenance management that combines management, technology, logistical support and financial control for industrial systems. Terotechnology is concerned with the specification and design for reliability and maintainability of physical assets. The application of Terotechnology also takes into account the processes of installation, commissioning, operation, maintenance, modification and replacement. Decisions are influenced by feedback of information on design, performance and cost, throughout the life cycle of a project. Although generally accepted as very useful, it was not until fairly recently that terotechnology or similar LCC was adopted by large-scale industry. This was largely due to the developments in ICT that made LCC easier. In the 1980s a different LCC-approach, integrated logistic support/logistics support analysis (ILS/LSA), originated in the military logistics support. Maintenance is regarded as an important issue within the integral logistical support. ILS comprises the spectrum of all activities related to the logistical support during its entire life cycle. These logistical support activities refer to maintenance concept development, the spare parts provisioning, the technical information, the maintenance crew, the training programs, etc. The goal of ILS may be summarized as achieving minimum life cycle costs. Furthermore, LSA is an iterative analytical process to identify and evaluate the logistic support for a new system. LSA constitutes the integration and application of various techniques and methods to ensure that supportability requirements are considered in the system design process. Finally, capital asset management, an LCC-approach with real concern of the financial performance of asset, was developed. Capital asset management provides information to make the financial and operational decisions that optimize equipment performance, from deployment through operations, maintenance and retirement. The key focus is not technical, but financial. Asset management aims at maximizing the return on investment (ROI) in capital assets so that they last longer, perform better and cost less to maintain.

Maintenance: An Evolutionary Perspective

37

2.3.3.3 Total Productive Maintenance (TPM) TPM (Takahashi and Takeshi 1990) is much more than just a concept, actually it is even considered a maintenance philosophy, which derives to the greater part of its substance from a variety of non-Japanese management structures and practices, which were adapted by the Japanese to fit their culture. TPM involves total participation, at all levels of the organization. It aims at maximizing equipment effectiveness and establishing a thorough system of preventive maintenance. TPM fits entirely with the TQM philosophy and the JIT approach. The latter makes sure that problems of various nature (material related, breakdown, training related, ...) are tackled and solved one by one, instead of camouflaging them by using large buffer stocks as was the case with MRP approaches. The TPM toolbox consists of various techniques, some of which are universal ones such as 6sigma, Pareto or ABC analysis, Ishikawa or fishbone diagrams, etc. Other concepts and techniques such as SMED, poke yoke, jidoka, OEE, and the 5S are specific of the TPM philosophy. The last two are of extreme importance and worthy to be explained further. The overall equipment efectiveness (OEE) is a powerful tool to measure the effective use of production capacity. The strength of the concept is the integration of production, maintenance and quality issues into what is called the “six big losses” of useful capacity. Figure 2.5 illustrates this concept. On the other hand, the 5S form one of the basic principles of TPM: Seiri (or sorting out), Seiton (or systematic arrangement), Seiso (or Spic and span), Seiketsu (or standardizing) and Shitsuke (or self-discipline).

downtime losses

Loading time

Operating time

quality losses

Valuable operating time

loss of speed

Net operating time

planning delays planned maintenance

failures set-up and adjustment

stoppages reduced speed

6 big losses

planning losses

Total time

process defects reduced yields

Figure 2.5. The “big six losses” of overall equipment efectiveness

2.3.3.4 Reliability Centered Maintenance (RCM) RCM originates from the 1960s in North American aviation industry. Later on it was adopted by military aviation, and afterwards it was only implemented at high risk industrial plant such as nuclear power plants. Now it can be found in industry

38

L. Pintelon and A. Parodi-Herz

at large. Well known are the books by Nowlan and Heap (1978); Anderson and Neri (1990) and Moubray (1997) who contributed to the adoption of RCM by industry. Note that today many versions of RCM are around, streamlined RCM being one of the more popular ones. However, the Society for Automotive Engineers (SAE) holds the RCM definition that is generally accepted. SAE puts forward the following basic questions to be solved by the any RCM implementation; if any of these is omitted, the method is incorrectly being refered to as an RCM. To answer these seven questions a clear step-by-step procedure exists and decision charts and forms are available: • What are the functions and associated performance standards of asset in its present operating context? • How can it fail to fulfil its functions? (functional failures) • What causes each failure? (failure modes) • What happens when each failure occurs? (failure effects ) • In what way does each failure matter? (failure consequences) • What should be done to predict or prevent each failure? (proactive tasks and task intervals) • What should be done if a suitable proactive task cannot be found? (default actions) RCM is undeniably a valuable maintenance concept. It takes into account system functionality, and not just the equipment itself. The focus is on reliability. Safety and environmental integrity are considered to be more important than cost. Applying RCM helps to increase the asset’s lifetime and establish a more efficient and effective maintenance. Its structured approach fits in the knowledge management philosophy: reduced human error, more and better historical data and analysis, exploitation of expert knowledge and so forth. RCM is popular and many RCM implementations have started during the last decade. Although RCM offers many benefits, there are also drawbacks. From the conceptual point of view there are some weak points. For instance, the fact that the original RCM does not offer a task packaging feature and thus does not automatically offer a workable maintenance plan and the fact that the standard decision charts and forms offered are helpful but also far from perfect. A serious remark, mainly from the academic side, is about the scientific basis of RCM: the FMEA analysis, which is the heart of the RCM analysis, is often done on a rather ad hoc basis. Often available statistical data are insufficient or inaccurate, there is a lack of insight in the equipment degradation process (failure mechanisms) and the physical environment (e.g. corrosive or dusty environment) is ignored. The balance between valuable experience and equally valuable, objective statistical evidence is often absent. Many companies call in the (expensive) help of consultants to implement RCM; some of these consultants however are not capable of offering the help wanted and this – in combination with the lack of in-house experience with RCM – discredits this methodology. RCM is in fact an on-going process, which often causes reluctance to engage in a RCM project. RCM is undoubtedly a very resource consuming process, which also makes it difficult to apply RCM to all equipment.

Maintenance: An Evolutionary Perspective

39

2.3.3.5 RCM-Related Concepts RCM as such has proven to be a very valuable concept, focussing on reliability and paying attention to safety and environment. Its structured approach ensures asset sustainability. However, there are some drawbacks that should be kept in mind and, if possible, remedied. In the literature one can find many RCM-related concepts such as Gits, Coetzee, BCM, RBCM, streamlined RCM, and so forth. All of them adopt RCM principles with the intention of solving some of its shortcomings. These group of concepts constitute the bridging step to the third generation of maintenance concepts. Gits (1984) developed an RCM-like maintenance concept. The main difference with the original RCM is the fact that the methodology delivers a workable maintenance plan. The focus of the concept is on technical and organizational aspects, rather than on economic considerations. This three-phase approach establishes the maintenance plan by quantifying and clustering basic maintenance rules. Those rules are harmonised in operational entities that describe what exactly must be done. Later on, Jones (1995) put forward risk based reliability centred maintenance (RBCM), a new variance of basic RCM. Basically, RBCM can be described as RCM, but with a strong statistical background. This tackles and eliminates the drawback of the ad hoc FMEA of the traditional RCM approach. Risk based inspections (RBI) are one of the core concepts here. The RBI methodology enables the assessment of the likelihood and potential consequences of pressure equipment failures. RBI provides companies with the opportunity to prioritize equipment inspections and optimize the inspection methods, frequencies and resources. Furthermore, RBI helps to develop specific equipment inspection plans and enable the implementation of RCM as such. This results in improved safety, lower failure risks, fewer forced shutdowns, and reduced operational costs. The risk-based approach requires a systematic and integrated use of expertise from the different disciplines that affect plant integrity. These include design, materials selection, operating parameters and scenarios, and understanding of the current and future degradation mechanisms and of the risks involved. So far, all preceding RCM inspired concepts aimed at improving technical drawbacks of RCM by coverting them into workable solutions. It was not until Kelly (1997), with his business-centred maintenance BCM, a full-fledged concept for determining a detailed maintenance plan, that the business as such gained the focal point. Kelly emphasised the importance of identifying, mapping and auditing the maintenance function. The BCM concept also pays attention to the necessary administrative support. Kelly calls his approach a BUTD approach, bottom-up/top-down approach. First, it is a top-down step that starting from the business context, the exact objectives for maintenance are outlined considering all corporate level. The second step is a bottom-up step. It aims at establishing a life maintenance plan for all equipments. In a third and last step, all item life plans are fitted in a maintenance strategy. Applying BCM thus results in a detailed maintenance schedule, ready for use. RCM implementation is complex, time consuming and is not straightforward. Hence, it should be implemented in a controlled fashion with total support of all levels of the organizations. Coetzee (2002) mentions that RCM is a core methodology to ensure that the organization can achieve world-class results. However, to

40

L. Pintelon and A. Parodi-Herz

achieve this objective the traditional RCM should be enhanced. Coetzee proposes a “new” RCM blending concept from different RCM authors’ related techniques. He also puts forward some innovations like the funnelling approach to ensure that RCM efforts are concentrated on the most important failure modes in the organization. Finally, there is a vast range of so-called “streamlined RCM” concepts. These concepts claim to be derivations of RCM. It is consultants who mainly promote streamlined RCM as the solution for the resource consuming character of RCM. Although streamlining sounds attractive it should be carefully applied, in order to keep the RCM benefits. Different streamlining approaches exist; however, very few are acceptable as formal RCM methodologies. Based on Pintelon and Van Puyvelde (2006), Table 2.3 provides a picture of popular streamlined RCM approaches. Table 2.3. Classification of streamlined RCM concepts

Characteristics

Pitfalls

Retro-active approach

Example

Starts from the existing maintenance plan. Determines the failure mode for all maintenance tasks and implements the last RCM steps for these.

Quite time-consuming to find the failure modes for all tasks.Functions” are detected on ad hoc basis. It Implies that the existing maintenance plan is good.

Generic approach

Uses generic lists of failure modes, or even generic analyses of technical systems

Ignores the operational context of the technical systems and the current maintenance practices. It assumes a standard level of analysis detail for all systems.

Skipping approach

Omits one or more steps. Typically, the first step (functions) is skipped and the analysis starts with listing the failure modes.

Omits the first and essential step of RCM, i.e. the functional analysis and as such also does not allow for a sound performance standard setting

Criticality approach

Limits the implementation to critical functions and/or failures for these a full RCM analysis is performed.

Often determines criticality on an ad hoc basis or uses criticality tools which are less reliable than the RCM approach

Troublemaker approach

Carries out a full RCM analysis for critical equipment only. Critical equipment is defined here as bottleneck equipment, which had a lot of maintenance problems in the past or is critical in terms of safety hazards.

Idem as above, although here all RCM steps are followed which guarantees a complete “picture”.

2.3.3.6 Customized Maintenance Concepts The value driven maintenance (VDM) methodology proposed by Haarman and Delahay (2004) builds a bridge between traditional maintenance philosophies and the shareholders’ value. Not only does VDM simplify the boardroom discussion, it also shows that far from being a cost center, maintenance is actually a major economic value within the overall business performance. It is built on established

Maintenance: An Evolutionary Perspective

41

best maintenance practices and concepts such as TPM, RCM and RBI. It shows where the added-value of maintenance lies and how an organisation can be best structured to realise this value. One of the main contributions of VDM is that it offers a common language to management and maintenance to discuss maintenance matters. VDM identifies four value drivers in maintenance and provides concepts to manage by those drivers. For all four value drivers, maintenance can help to increase a company’s economic value. VDM makes a link between value drivers and core competences. For each of the core competences, some managerial concepts are provided. Most recently, Waeyenbergh (2005) presents CIBOCOF as a framework to developed customised maintenance concepts. CIBOCOF starts out from the idea that although all maintenance concepts available from the literature contain interesting ideas, none of them is suitable for implementation without further customization. Companies have their own priorities in implementing a maintenance concept and are likely to go for “cherry picking” from existing concepts. CIBOCOF offers a framework to do this in an integrated and structured way. Figure 2.6 illustrates the steps that this concept structurally goes through. A particularly interesting step is step 5, maintenance policy optimization, where a decision chart is offered to determine which mathematical decision model can be used to optimize the chosen policy (step 4). This decision chart guides the user through the vast literature on the topic. M2 Technical analysis M1 Start-up

Maintenance Plan

M5 Continuous improvement

M3 Policy decision making

M4 Implementation & Evaluation

Figure 2.6. CIBOCOF logic

2.4 Maintenance Manager As maintenance management evolved, so did the job of the maintenance manager. Clearly maintenance management is no longer a pure technical function. Business economics (cost-benefit considerations) and business context (how important are the installations in question?, what are the functional requirements?, …) play an important role. A good maintenance manager needs to have a technical background in order to have an eye for the “big picture” and not lose any aspect out of sight.

42

L. Pintelon and A. Parodi-Herz

Nowadays, the decisions expected from the maintenance manager are complex and sometimes can have far reaching consequences. He/she is (partly) responsible for operational, tactical and strategical aspects of the company’s maintenance management. This involves the final responsibility for operational decisions like the planning of the maintenance jobs and tactical decisions concerning the long-term maintenance policy to be adopted. More recently, maintenance managers are also consulted in strategic decisions, e.g. purchases of new installations, design choices, personnel policy, … The career path of today’s maintenance manager starts out from a rather technical content, but evolves over time into more financial and strategic responsibilities. This career path can be horizontal or vertical. It is also important that the maintenance manager is a good communicator and people manager, as maintenance remains a labor-intensive function. The maintenance manager needs to be able to attract and retain highly skilled technicians. On-going training for technicians is needed to keep track of the rapidly evolving technology. Motivation of maintenance technicians often requires special attention. Job autonomy in maintenance is more than in production, instructions may be vague, immediate assessment of the quality of work is mostly not possible, complaints are more often heard than compliments etc. Aspects like safety and ergonomics are an indispensable element in current maintenance management. Besides people, materials are another important resource for maintenance work. Maintenance material logistics mainly concerns the spare parts management and the determination of finding the optimum trade-off between high spare parts availability and the corresponding stock investments. The above described evolution in maintenance management incurs a sharp need for decision support techniques of various nature: statistical analysis tools for predicting the failure behaviour of equipment, decision schemes for determining the right maintenance concept, mathematical models to optimize the maintenance policy parameters (e.g. PM frequency), decision criteria concerning e-maintenance, decision aids for outsourcing decisions, etc. Table 2.4 illustrates the use of some decision support techniques for maintenance management. These techniques are available and have proven their usefulness for maintenance, but they are not yet widely adopted. In the 1960s most maintenance publications were very mathematically oriented and mainly focussed on reliability. The 1970s and early 1980s publications were more focused on maintenance policy optimization such as determination of optimum preventive maintenance interval, planning of group replacements and inspection modelling. This was a step forward, although these models still often were too focussed on mathematical tractability rather than on realistic assumptions and hypotheses. This caused an unfortunate gap between academics and practitioners. The former had the impression that industry and service sector were not “ready” for their work, while the latter felt frustrated because the models were too theoretical. Fortunately, this is changing. Academics pay more attention to the reallife background of their subject and practitioners discover the usefulness of the academic work. Moreover academic work gets broader and offers a more diverse range of models and concepts, such as maintenance strategy design models, e-maintenance concepts, service parts supply policies, and the like besides the more traditional maintenance optimization models. With the introduction of main-

Maintenance: An Evolutionary Perspective

43

tenance software, the necessary data required for these models could be more easily collected. There still is a big gap between practitioners and academics, but it is already slowly closing. Table 2.4. OR/OM techniques and its application in maintenance Techniques

Application examples in maintenance management

Statistics

Describing failure behaviour

Reliability theory

Reliability prediction of complex systems

Markov theory

Availability studies of repairable systems

Renewal theory

Replacement decisions (group or individual)

Math programming

Maintenance policy parameter optimization

Decision theory

Decisions under uncertainty

Queueing theory

Trade-off personnel capacity - service level

Simulation

Comparison of alternative maintenance policies

Inventory control

MRO management: FMI, NMI, SMI and VSMI

Time and motion study

Estimation of maintenance intervention times

Scheduling – rostering

Daily planning of maintenance jobs

Project planning

Planning of turnaround, large renovation projects

MCDM

Selecting the best outsourcing partner

MRO = maintenance, repair and operating supplies, FMI = fast moving items, NMI = normal moving items, SMI = slow moving items, VSMI = very slow moving items, MCDM = multi-criteria decision making, OR/OM=Operations Research / Operations Management

The help from information technology (IT) is of special interest when discussing decision support for maintenance managers. Computerized maintenance management systems (CMMS), also called computer aided maintenance management (CAMM), maintenance management information systems (MMIS) or even enterprise asset management systems (EAM), nowadays offer substantial support for the maintenance manager. These systems too have evolved over time (Table 2.5). IT of course also supports the e-maintenance applications and offers splendid opportunities for knowledge management implementations. At the beginning of the knowledge management hype, knowledge management was mainly aimed at fields like R&D, innovation management, etc. Later on the potential benefits of knowledge management were also recognized for most business functions. For maintenance management, a knowledge management programme helps to capture the implicit knowledge and expertise of maintenance workers and secure this information in information systems, so making it accessible for other technicians. The benefits of this in terms of consistency in problem solving approach and knowledge retention are obvious. Other knowledge management applications can be, for example, expert systems, assisting in the diagnosis of complex equipment

44

L. Pintelon and A. Parodi-Herz

failures, or data mining on maintenance history records to learn about failure causes. A knowledge management programme will also help to keep track of individual skills and expertise and as such support personnel management over time. Table 2.5. Evolution of CMMS

1990s ...

1980s–1990s

1970s

Business IT systems

CMMS

Characteristics

1st generation

Mainly registration and data administration (EDP). Limited or no process support. Low priority mainframe applications. Limited software market, a lot of in-house development.

2nd generation

Cost control and work order management; MRO management most often included, ... Link with company’s financial information module. First MIS for maintenance Many stand-alone microcomputer applications. Dynamic, but not always reliable, software market.

3rd generation

Broader, e.g. also asset utilization, and EHS module External communication possible, e.g. e-MRO. Enhanced analytical capabilities. Multimedia and web enabled features. Matured market for embedded (part of e.g. ERP) or BoB.

Clearly, the evolution in maintenance management offers a challenging job environment for today’s maintenance manager. This maintenance manager needs to be aware of “the big picture”, i.e. the business context and the maintenance organization as a whole. Moreover, he/she needs to have a sound technological background and be prepared to keep informed of technological evolutions. The maintenance manager needs real management skills, to manage the resources – personnel and materials – in an efficient and effective way, while keeping asset utilization and asset life cycles in mind. Growing in the function of maintenance manager, will also mean acquiring new skills, e.g. in financial management. Last but not least, today’s maintenance manager needs to be flexible, flexible to face threats and to grab opportunities in today’s dynamic business environment where increasing globalisation, many mergers and acquisitions, growing outsourcing markets and emerging e-maintenance technologies are part of daily life.

2.5 Conclusions and New Challenges of Maintenance Maintenance management undoubtedly has undergone major changes during the past decade. It has moved from being low profile, necessary but difficult to manage problems, to be regarded a prominent business function, an important element in business strategy. Not only practitioners have changed their mind about maintenance; academics did as well. Maintenance nowadays is a professional business

Maintenance: An Evolutionary Perspective

45

function and an area of intensive academic research. Efforts are aimed at advancing towards world class maintenance and providing methodologies to do so. Pintelon et al. (2006) describes several maintenance maturity levels required to achieve world class maintenance; these are illustrated in Figure 2.7.

Figure 2.7. Maturity levels of maintenance

Maintenance concept optimization has professionalized. Corrective and precautionary actions are combined in different policies, from reactive to preventive and from predictive to proactive policies. A sound insight into the pros and cons of each of these policies is available in practice and research supports the selection and optimization of these policies. These policies are no longer ad hoc and lose elements within maintenance management but policies are also embedded in maintenance concepts, focussing on reliability and productivity. These concepts ensure consistent decision making for all equipment and at the same time allow for individualized installation maintenance concepts. Decision tools are available to support this process. Top management nowadays, at least in most companies, recognizes the importance of maintenance as an element of their business strategy. Expectations for maintenance are no longer formulated as “keep things running”, but are based upon the overall business strategy. This strategy can be based on flexibility, quality and low cost. The maintenance organization, with its structural and infrastructural elements, is built accordingly. The previous paragraph may give the impression that all problems for maintenance management are already solved; this however is not the case. New opportunities in terms of, for example, outsourcing and e-maintenance exist. Moreover, there is a threatening gap between the top management level and the overall maintenance strategy determination and the tactical level on which the maintenance concepts are designed, detailed and implemented (Figure 2.8). The gap, however, is there between the alignment of the tactical and subsequent operational phase on the one hand and the strategic phase on the other. While both aspects are well studied, the link between the two is often not well established. This leads to disappointments with top management as well as frustration with maintenance managers. Research shows a similar gap. There is some — though

46

L. Pintelon and A. Parodi-Herz

still not enough — research on the link between maintenance and business strategy. The main focus of maintenance management research is still on the tactical and operational planning. Links between the former and the latter part of research however are still very rare. Closing this gap by linking maintenance and business throughout all decision levels is one of the major challenges for the future; every step taken brings us closer to real world-class maintenance.

Figure 2.8. Gap between maintenance and business strategy

2.6 List of Abbreviations BCM: Business-centred maintenance BoB: Best-of-breed BUTD: Bottom-up/top-down analysis CAMM: Computer aided maintenance management CBM: Condition-based maintenance CFR: Constant failure rate

CIBOCOF: Center Industrieel Beleid Onderhoudsontwikkelingsframework CM: Corrective maintenance CMMS: Computerized maintenance management systems DOM: Design-out of Maintenance DSS: Decision support systems

Maintenance: An Evolutionary Perspective

EAM: Enterprise asset management (system) EHS: Energy, health and safety EDP: Electronic data processing EUC: End user computing FBM: Failure-based maintenance FMEA: Failure modes and effect analysis FMI: Fast moving items FMS: Flexible manufacturing systems GUI: Graphical user interface ICT: Information communication technology IFR: Increasing failure rate ILS: Integrated logistics support IT: Information technology JIT: Just-in-time LCC: Life-cycle costing LSA: Logistics support analysis MCDM: Multi-criteria decisionmaking MIS: Management information systems MMIS: Maintenance management information system MRO: Maintenance repair and operating supplies

47

MTBF: Mean-time-between-failures MTTR: Mean-time-to-repair NMI: Normal moving items OBM: Opportunity-based maintenance OEE: Overall equipment effectiveness OM: Operations management OR: Operations research PM: Precautionary maintenance Q&D: Quick & dirty decision charts R&D: Research & development RBI: Risk-based inspections RCBM: Risk-based centred maintenance RCM: Reliability-centred maintenance ROI: Return on investment SAE: Society of automotive engineering SMED: Single minute exchange of dies SMI: Slow moving items TBM: Time-based maintenance TCO: Total cost of ownership TPM: Total productive maintenance TQM: Total quality management UBM: Use-based maintenance VDM: Value-driven maintenance VSMI: Very slow moving items WIP: Work in progress

2.7 References Anderson, R.T., Neri, L., (1990), Reliability Centred Maintenance: Management and Engineering Methods, Elsevier Applied Sciences, London Blanchard, B.S., (1992), Logistics Engineering and Management, Prentice Hall, Englewood Cliffs, New Jersey Cho, I.D, Parlar, M., (1991), A survey on maintenance models for multi-unit systems. European Journal of Operational Research, 51:1–23 Coetzee, J.L., (2002), An Optimized Instrument for Designing a Maintenance Plan: A Sequel to RCM. PhD thesis, University of Pretoria, South-Africa Dekker, R., (1996) Applications of maintenance optimization models: A review and analysis. Reliability Engineering and System Safety, 52(3):229–240 Dekker, R., and Scarf, P.A., (1998) On the impact of optimisation models in maintenance decision making: the state of the art. Reliability Engineering and System Safety, 60:111–119 Geraerds, W.M.J., (1972), Towards a Theory of Maintenance. The English University Press. London.

48

L. Pintelon and A. Parodi-Herz

Gits, C.W., (1984), On the Maintenance Concept for a Technical System: A Framework for Design, Ph.D.Thesis, TUEindhoven, The Netherlands Haarman, M. and Delahay, G., (2004), Value Driven Maintenance – New Faith in Maintenance, Mainnovation, Dordrecht, The Nederlands Jones, R.B., (1995), Risk-Based Maintenance, Gult Professional Publishing (Elsevier), Oxford Kelly, A., (1997), Maintenance Organizations & Systems: Business-Centred Maintenance, Butterworth-Heinemann, Oxford McCall, J.J. (1965), Maintenance policies for stochastically failing equipment: A survey. Management Science, 11 (5):493–524 Moubray, J., (1997), Reliability-Centred Maintenance. Second Edition. ButterworthHeinemann, Oxford Nowlan, F.S., Heap, H.F., (1978), Reliability Centered Maintenance, United Airlines Publications, San Fransisco Parkes, D. in Jardine, A.K.S., (1970), Operational Research in Maintenance, University of Manchester Press, Manchester Pintelon, L., Gelders, L., Van Puyvelde, F., (2000), Maintenance Management, Acco Leuven/ Amersfoort Pintelon, L., Gelders, L., (1992) Maintenance management decision making. European Journal of Operational Research, 58:301–317 Pintelon, L., Pinjala, K., Vereecke, A., (2006), Evaluating the Effectiveness of Maintenance Strategies, Journal of Quality in Maintenance Engineering (JQME), 12(1):214–229 Pintelon, L., Van Puyvelde, F., (2006), Maintenance Decision Making, Acco, Leuven, Belgium Takahashi, Y. and Takashi, O., (1990) TPM: Total Productive Maintenance. Asian Productivity Organization, Tokyo Valdez-Flores, C., Feldman, R.M., (1989) A survey of preventive maintenance models for stochastically deteriorating single-unit systems. Naval Research Logistics, 36:419–446 Waeyenbergh, G., (2005), CIBOCOF – A Framework for Industrial Maintenance Concept Development, PhD thesis, Centre for Industrial Management – K.U.Leuven, Leuven, Belgium Waeyenbergh, G., Pintelon, L., (2002) A framework for maintenance concept development. International Journal of Production Economics, 77:299–313 Wang H., (2002), A survey of maintenance policies of deteriorating systems. European Journal of Operational Research, 139:469–489

3 New Technologies for Maintenance Jay Lee and Haixia Wang

3.1 Introduction For years, maintenance has been treated as a dirty, boring and ad hoc job. It’s seen as critical for maintaining productivity but has yet to be recognized as a key component of revenue generation. The question most often asked is “Why do we need to maintain things regularly?” The answer is “To keep things as reliable as possible.” However, the question that should be asked is “How much change or degradation has occurred since the last round of maintenance?” The answer to this question is “I don’t know.” Today, most machine field services depend on sensor-driven management systems that provide alerts, alarms and indicators. The moment the alarm sounds, it’s already too late to prevent the failure. Therefore, most machine maintenance today is either purely reactive (fixing or replacing equipment after it fails) or blindly proactive (assuming a certain level of performance degradation, with no input from the machinery itself, and servicing equipment on a routine schedule whether service is actually needed or not). Both scenarios are extremely wasteful. Rather than reactive maintenance, “fail-and-fix,” world-class companies are moving forwards towards “predict-and-prevent” maintenance. A maintenance scheme, referred to as condition based maintenance (CBM), was developed by considering current degradation and its evolution. CBM methods and practices have been continuously improved for the last decades; however, CBM is conducted at equipment level − one piece of equipment at a time, and the developed prognostics approaches are application or equipment specific. Holistic approach, real-time prognostics devices, and rapid implementation environment are potential future research topics in product and system health assessment and prognostics. With the level of integrated network systems development in today’s global business environment, machines and factories are networked, and information and decisions are synchronized in order to maximize a company’s asset investments. This generates a critical need for a real-time remote machinery prognostics and health management (R2M-PHM) system. The unmet needs in maintenance can be categorized into the following:

50

J. Lee and H. Wang

1. Machine intelligence: intelligent monitoring, predict and prevent, and compensation, reconfiguration for sustainability (self-maintenance). 2. Operations intelligence: prioritize, optimize, and responsive maintenance scheduling for reconfiguration needs. 3. Synchronization intelligence: autonomous information flow from market demand to factory asset utilization. Based on the unmet needs in maintenance, many research and development questions concerning next generation maintenance systems can be raised. Some of them are the following: 1. How to adapt maintenance schedules to cope dynamically with shop-floor reality? 2. How to feed back information and knowledge gathered in maintenance to the designers of the process? 3. How to link maintenance policies to corporate strategy and objectives? 4. How to synchronize production scheduling based on maintenance performance? The rest of this chapter is organized as follows. Section 2 gives a state-of-theart review on maintenance technologies, which includes a maintenance paradigm overview and CBM prognostics approaches. Section 3 presents the newly developed platform of Watchdog Agent®-based real-time remote machinery prognostics and health management (R2M-PHM) system, the Watchdog Agent® toolbox method for multi-sensor performance assessment and prognostics, and real-life industrial case studies. Section 4 summarizes new developments and discusses future work.

3.2 State-of-the-art Reviews on Maintenance Technologies 3.2.1 Maintenance Paradigm Overview Looking back on the development history and forecasting the development tendency of maintenance technologies, the roadmap to excellence in maintenance can be illustrated as in Figure 3.1. 3.2.1.1 No Maintenance There are two kinds of situations in which no maintenance will occur: • •

No way to fix it: the maintenance technique is not available for a special application, or the maintenance technique is at too early stage of development. Isn’t worth it to fix it: some machines were designed to be used only once. When compared to maintenance cost, it may be more cost-effective just to discard it.

Neither of the scenarios above is within the scope of the discussion here.

Machine Performance and uptime

New Technologies for Maintenance

51

Self-Maintenance or Maintenance-free Proactive Machine Maintenance (Failure Root causes analysis)

No Maintenance

Predictive Preventive Maintenance Maintenance (Scheduled Reactive Maintenance) Maintenance (Fire Fighting)

Figure 3.1. The development of maintenance technologies

3.2.1.2 Reactive Maintenance The aim of reactive maintenance is just to “fix it after it’s broken”, since most of the time a machine breaks down without warning and it is urgent for the maintenance crew to put it back to work: this is also referred to as “fire-fighting”. This fire-fighting mode of maintenance is still present in many maintenance operations today because accurate knowledge of the equipment behavior is lacking. Essentially, little to no maintenance is conducted and the machinery operates until a failure occurs. At this time, appropriate personnel are contacted to assess the situation and make the repairs as expeditiously as possible. In a situation where the damage to equipment is not a critical factor, plenty of downtime is available, and the values of the assets are not a concern, the fire-fighting mode may prove to be an acceptable option. Of course, one must consider the additional cost of making repairs on an emergency basis since soliciting bids to obtain reasonable costs may not be applicable in these situations. Due to market competition and environmental/safety issues, the trend is toward appropriating an organized and efficient maintenance program as opposed to firefighting. 3.2.1.3 Preventive Maintenance Preventive maintenance (PM) is an equipment maintenance strategy based on replacing, overhauling or remanufacturing an item at fixed or adaptive intervals, regardless of its condition at the time. These maintenance operations models can be characterized as long term maintenance policies (Wang 2002) that do not take into account instantaneous equipment status. Scheduled restoration tasks and scheduled discard tasks are both examples of preventive maintenance tasks. In preventive maintenance, breakdowns are tracked and recorded in a database, and the information accumulated provides a base for general preventive actions. The age-dependent PM policy can be considered as the most common maintenance policy in which a unit’s PM times are based on the age of the unit. The basic idea is to replace or repair a unit at its age T or failure whichever occurs first (Badia et al., 2002; Mijailovic 2003). Commonly used equipment reliability indices such as mean time between failure (MTBF) and mean time to repair (MTTR) are extracted

52

J. Lee and H. Wang

from the historical databases of equipment behavior over time. These two indices provide a rough estimate of the time between two adjacent breakdowns and the mean time needed to restore a system when such breakdowns happen. Although equipment degradation processes vary from case to case, and the causes of failure can be different as well, the information contained in MTBF and MTTR can still be informative. Other indices can also be extracted and used, including the mean lifetime, mean time to first failure, and mean operational life, as discussed by Pham et al. (1997). With the introduction of minimal repair and imperfect maintenance, various extensions and modifications to the age-dependent PM policy have been proposed (Bruns 2002; Chen et al. 2003). Another preventive maintenance policy that received much attention is the periodic PM policy, in which degraded machines are repaired or replaced at fixed time intervals independent of the equipment failures. Various modifications and enhancements to this maintenance policy have also been proposed recently (Cavory et al. 2001). The preventive maintenance schemes are time-based without considering the current health state of the product, and thus are inefficient and less valuable for a customer whose individual asset is of the most concern. For the case of helicopter gearboxes, it was found that almost half of the units were removed for overhaul even though they were in a satisfactory operating condition. Therefore techniques for more economical and reliable maintenance are needed. 3.2.1.4 Predictive Maintenance Predictive maintenance (PdM) is a right-on-time maintenance strategy. It is based on the failure limit policy in which maintenance is performed only when the failure rate, or other reliability indices, of a unit reaches a predetermined level. This maintenance strategy has been implemented as condition based maintenance (CBM) in most production systems, where certain performance indices are periodically (Barbera et al. 1996; Chen and Trivedi 2002) or continuously monitored (Marseguerra et al. 2002). Whenever an index value crosses some predefined threshold, maintenance actions are performed to restore the machine to its original state, or to a state where the changed value is at a satisfactory level in comparison to the threshold. Predictive maintenance can be best described as a process that requires both technology and human skills, while using a combination of all available diagnostic and performance data, maintenance history, operator logs and design data to make timely decisions about maintenance requirements of major/critical equipment. It is this integration of various data, information and processes that leads to the success of a PdM program. It analyzes the trend of measured physical parameters against known engineering limits for the purpose of detecting, analyzing and correcting a problem before a failure occurs. A maintenance plan is devised based on the prediction results derived from condition based monitoring. This method can cost more up front than PM because of the additional monitoring hardware and software investment, cost of manning, tooling, and education that is required to establish a PdM program. However, it provides a basis for failure diagnostics and maintenance operations, and offers increased equipment reliability and a sufficient advance in information to improve planning, thereby reducing unexpected downtime and operating costs.

New Technologies for Maintenance

53

3.2.1.5 Proactive Maintenance Proactive maintenance (PaM) is a new maintenance concept that is emerging along with the development of business globalization. It encompasses any tasks that seek to realize the seamless integration of diagnosis and prognosis information and maintenance decision making via a wireless internet or satellite communication network. Machine health information should represent a trend, not just a status, so that a company’s productivity can be focused on asset-level utilization, not just production rates. Moreover, through integrated life-cycle management, such degradation information can be used to make improvements in every aspect of a product’s life-cycle. Intelligent maintenance systems (IMS) presented by Lee (1996) is a PaM representative. Specifically, it has three main working directions as follows: • •

•

Develop intertwined embedded informatics and electronic intelligence in a networked and tether-free environment and enable products and systems to intelligently monitor, predict, and optimize their performance. Change “failure reactive” to “failure proactive” by avoiding the underlying conditions that lead to machine faults and degradation. Focus on analyzing the root cause, not just the symptoms. That is, seek to prevent or to fix failure from its source. Feed the maintenance information back to the product, process and machine design, and ultimately make improvements in every aspect of product lifecycle.

3.2.1.6 Self-maintenance Self-maintenance is a new design and system methodology. Self-maintenance machines are expected to be able to monitor, diagnose, and repair themselves in order to increase their uptime. One system approach to enabling self-maintenance is based on the concept of functional maintenance (Umeda et al. 1995). Functional maintenance aims to recover the required function of a degrading machine by trading off functions, whereas traditional repair (physical maintenance) aims to recover the initial physical state by replacing faulty components, cleaning, etc. The way to fulfil the self-maintenance function is by adding intelligence to the machine, making it clever enough for functional maintenance, so that the machine can monitor and diagnose itself, and it can still maintain its functionality for a while if any kind of failure or degradation occurs. In other words, self-maintainability would be appended to an existing machine as an additional embedded reasoning system. The required capabilities of a self-maintenance machine (SMM) are defined as follows (Labib 2006): • •

Monitoring capability: SMM must have the ability of on-line condition monitoring using sensor fusion. The sensors send the raw data of machine condition to a processing unit. Fault judging capability: from the sensory data, the SMM can judge whether the machine condition is at normal or abnormal state. By judging the condition of the machines, we can know the current condition and time left to failure of the machines.

54

J. Lee and H. Wang

• •

• •

Diagnosing capability: if the machine condition is at abnormal state, the causes of faults must be diagnosed and identified to allow repair planning action to be carried out. Repair planning capability: the machine is able to propose repair actions based on the result of diagnosis and functional maintenance. The repair planning action is performed using knowledge from the experts which is stored in the data base system. There may be more than one repair action proposed; however, the optimized one will be selected to be implemented. Repair executing capability: the maintenance is carried out by the machine itself without any human intervention. This can be achieved through computer control system and actuators in the machines. Self-learning and improvement: when faced with unfamiliar problems, the machine is able to repair itself and it is expected that if such problems occur again, the machine will take a shorter time for repairing itself and the outcome of maintenance will be more effective and efficient.

Efforts towards realizing self-maintenance have been mainly in the form of intelligent adaptive control, where investigation of control was achieved using fuzzy logic control. In order to realize self-maintenance, one needs to develop and implement an adaptive artificial neuron-fuzzy inference system which allows the fuzzy logic controller to learn from the data it is modeling and automatically produce appropriate membership functions and the required rules. Such a controller must be able to cater for sensor degradation and this leads to self-learning and improvement capabilities. Another system approach to enabling self-maintenance is to add the self-service trigger function to a machine. The machine self-monitors, self-prognoses and selftriggers a service request before a failure actually occurs. The maintenance task may still be conducted by a maintenance crew, but the no gap integration of machine, maintenance schedule, dispatch system and inventory management system will minimize maintenance costs and raise customer satisfaction. 3.2.2 Prognostics Approaches for Condition Based Maintenance Condition based maintenance (CBM) was presented as a maintenance scheme to provide sufficient warning of an impending failure on a particular piece of equipment, allowing that equipment is to be maintained only when there is objective evidence of an impending failure. CBM methods and practices have been continuously improved in recent decades. Sensor fusion techniques are now commonly in use due to the inherent superiority in taking advantage of mutual information from multiple sensors (Hansen et al. 1994; Reichard et al. 2000; Roemer et al. 2001). A variety of techniques in vibration, temperature, acoustic emissions, ultrasonic, oil debris, lubricant condition, chip detectors, and time/stress analyses has received considerable attention. For example, vibration signature analysis, oil analysis and acoustic emissions, because of their excellent capability for describing machine performance, have been successfully employed for prognostics for a long time (Kemerait 1987; Wilson et al. 1999; Goodenow et al. 2002). Current prognostic approaches can be classified into three basic groups: model-based

New Technologies for Maintenance

55

approach, data-driven approach, and hybrid approach. The model-based approach requires detailed knowledge of the physical relationships between, and characteristics of, all related components in a system. It is a quantitative model used to identify and evaluate the difference between the actual operating state determined from measurements, and the expected operating state derived from the values of the characteristics obtained from the physical model. Bunday (1991) presented the theory and methodology of obtaining reliability indices from historical data. In direct implementation in maintenance, the reliability of the system is kept at a defined level, and whenever the reliability falls below the defined level, maintenance actions should take place to restore it back to its proper level. However, it is usually prohibitive to use the model-based approach since relationships and characteristics of all related components in a system and its environment are often too complicated to build a model with a reasonable amount of accuracy. In some cases, values of some process parameters/factors are not readily available. A poor model leads to poor judgment. The data-driven approach requires a large amount of history data representing both normal and “faulty” operations. It uses no a priori knowledge of the process but, instead, derives behavioral models only from measurement data from the process itself. Pattern recognition techniques are widely used in this approach. General knowledge of the process can be used to interpret data analysis results, based on which qualitative methods such as fuzzy logic, and artificial intelligence methods can be used for decision making to realize fault prevention. The hybrid approach fuses the model-based information and sensor-based information and takes advantage of both model-driven and datadriven approaches through which more reliable and accurate prognostic results can be generated (Hansen et al. 1994). Garga et al. (2001) introduced a hybrid reasoning method for prognostics, which integrated explicit domain knowledge and machinery data. In this approach, a feed-forward neural network was trained using explicit domain knowledge to get a parsimonious representation of the explicit domain knowledge. However, a major breakthrough has not been made since. Existing prognostic methods are application or equipment specific. For instance, the development of neural networks has added new dimensions to solving existing problems in conducting prognostics of a centrifugal pump case (Liang et al. 1988). A comparison of the results using the signal identification technique shows various merits of employing neural nets including the ability to handle multivariate wear parameters in a much shorter time. A polynomial neural network was conducted in fault detection, isolation, and estimation for a helicopter transmission prognostic application (Parker et al. 1993). Ray and Tangirala (1996) built a stochastic model of fatigue crack dynamics in mechanical structures to predict remaining service time. Fuzzy logic-based neural networks have been used to predict paper web breakage in a paper mill (Bonissone 1995) and the failure of a tensioned steel band with seeded crack growth (Swanson 2001). Yet another prognostic application presented an integrated system in which a dynamically linked ellipsoidal basis function neural network was coupled with an automated rule extractor to develop a tree-structured rule set which closely approximates the classification of the neural network (Brotherton et al. 2000). That method allowed assessment of trending from the nominal class to each of the identified fault classes, which means quantitative

56

J. Lee and H. Wang

prognostics were built into the network functionality. Vachtsevanos and Wang (2001) gave an overview of different CBM algorithms and suggested a method to compare their performance for a specific application. Prognostic information, obtained through intelligence embedded into the manufacturing process or equipment, can also be used to improve manufacturing and maintenance operations in order to increase process reliability and improve product quality. For instance, the ability to increase reliability of manufacturing facilities using the awareness of the deterioration levels of manufacturing equipment has been demonstrated through an example of improving robot reliability (Yamada and Takata 2002). Moreover, a life cycle unit (LCU) (Seliger et al. 2002) was proposed to collect usage information about key product components, enabling one to assess product reusability and facilitating the reuse of products that have significant remaining useful life. In spite of the progresses in CBM, many fundamental issues still remain. For example: 1. Most research is conducted at the single equipment level, and no infrastructure exists for employing a real-time remote machinery diagnosis and prognosis system for maintenance. 2. Most of the developed prognostics approaches are application or equipment specific. A generic and scalable prognostic methodology or toolbox doesn’t exist. 3. Currently, methods are focused on solving the failure prediction problem. The need for tools for system performance assessment and degradation prediction has not been well addressed. 4. The maintenance world of tomorrow is an information world for featurebased monitoring. Features used for prognostics need to be further developed. 5. Many developed prediction algorithms have been demonstrated in a laboratory environment, but are still without industry validation. To address the afore-mentioned unmet needs, Watchdog Agent®-based intelligent maintenance systems (IMS) has been presented by the IMS Center with a vision to develop a systematic approach in advanced prognostics to enable products and systems to achieve near-zero breakdown reliability and performance.

3.3 Watchdog Agent®-based Intelligent Maintenance Systems Today most state-of-the-art manufacturing, mining, farming, and service machines (e.g., elevators) are actually quite “smart” in themselves. Many sophisticated sensors and computerized components are capable of delivering data concerning a machine’s status and performance. The problem is that little or no practical use is made of most of this data. We have the devices, but we do not have a continuous and seamless flow of information throughout entire processes. Sometimes this is because the available data is not rendered in a useable, or instantly understandable,

New Technologies for Maintenance

57

form. More often, no infrastructure exists for delivering the data over a network, or for managing and analyzing the data, even if the devices were networked. Watchdog Agent®-based real-time remote machinery prognostics and health management (R2M-PHM) system has been recently developed by the IMS Center. It focuses on developing innovative prognostics algorithms and tools, as well as remote and embedded predictive maintenance technologies to predict and prevent machine failures, as illustrated in Figure 3.2.

Figure 3.2. Key focus and elements of the Intelligent Maintenance Systems

The rest of the section is organized as follows. Section 3.1 deals with the platform of Watchdog Agent®-based real-time remote machinery prognostics and health management (R2M-PHM) system. Section 3.2 presents a generic and scalable prognostic methodology or toolbox, i.e., the Watchdog Agent® toolbox; and Section 3.3 illustrates the effectiveness and potentials of this new development using several real industry case studies. 3.3.1 Watchdog Agent®-based R2M-PHM Platform A generic and scalable prognostics framework was presented by Su et al. (1999) to integrate with embedded diagnostics to provide “total health management” capability. A reconfigurable and scalable Watchdog Agent®-based R2M-PHM platform is being developed by the IMS Center, which expands the well known open system architecture for condition-based maintenance (OSA-CBM) standard (Thurston and Lebold 2001) by including real-time remote machinery diagnosis and prognosis systems and embedded Watchdog Agent® technology. As illustrated in Figure 3.3, the Watchdog Agent® (hardware and software) is embedded onto machines to convert multi-sensory data to machine health information. The extracted information is managed and transferred through wireless internet or a satellite communication network, and service is automatically triggered.

58

J. Lee and H. Wang

Figure 3.3. Illustration of IMS real-time remote machinery diagnosis and prognosis system

3.3.1.1 System Architecture The system architecture of the Watchdog Agent®-based R2M-PHM platform is shown in Figure 3.4. In most products or systems, different sensors measure different aspects of the same physical phenomena. For example, sensor signals, such as vibrations, temperature, pressure, etc. are collected. A “digital doctor” inspired by biological perceptual systems and machine psychology theory, the Watchdog Agent® consists of embedded computational prognostic algorithms and a software toolbox for predicting degradation of devices and systems. It is being built to be extensible and adaptable to most real-world machine situations. The health related information is saved to the database. The diagnostic and prognostic outputs of the Watchdog Agent®, which is mounted on all the machinery of interest, can then be fed into the decision support tools. Decision support tools help the operation personnel balance and optimize their resources, when one or more machines are likely to fail, by constantly looking ahead. For example, if a production line has three processes A, B and C, such that A has one machine, B has three machines, and C has one machine, what would we do if we could anticipate that one of the machines at station B is not behaving normally. Perhaps we would arrange a staging area for output from A, or perhaps we would ramp up production on the other two machines at station B. Whatever the case, we would be making our decision before experiencing the impending breakdown. These tools are critical to maintenance and process personnel, enabling them to stay ahead of the game, balancing limited resources with constant change in demand. Decision support tools also help minimize losses in productivity caused by downtime, and help production and logistics managers optimize their maintenance schedule to minimize downtime costs. The lean and necessary information for maintenance can then be determined and published to the internet through an embedded web server.

New Technologies for Maintenance

Embedded software

Sensor signals Vibration Temperature Pressure

59

Watchdog Agent® toolbox

Database

Decision support tools

Web server

Client software

Current Voltage On/Off …

Embedded operating system I/O cards

Remote computer

Embedded computer

Figure 3.4. System architecture of a reconfigurable Watchdog Agent®

The rapid development of web-enabled and cyber-infrastructure technologies is important in providing enablers for remote monitoring and prognostics. One of the major barriers is that most manufacturers adopt proprietary communication protocols which lead to difficulties in connecting diverse machines and products. Currently, the IMS Center is developing a web-enabled remote monitoring Deviceto-Business (D2B)™ platform for remote monitoring and prognostics of diversified products and systems. A system methodology and infotronics platform has been developed that enables the transformation of product condition data into more a useful health information format for remote and network-enabled prognostics applications. The MIMOSA (maintenance information management open system architecture) organization has adopted the IMS infotronic platform as one of its standard platforms and will use an IMS testbed to demonstrate MIMOSA standards in its future activities. As shown in Figure 3.5, the IMS infotronics platform includes the Watchdog Agent® toolbox (which contains adaptive algorithms for different situations and applications), decision support tools, data storage, and D2BTM (device-to-business) system level connectivity. The Watchdog Agent® toolbox includes signal processing, feature extraction, performance assessment, autonomous learning, prediction and prognostics functions. The lean and necessary information for maintenance from decision support tools can then be determined and sent out through D2BTM system level connectivity to remote workstations or computers.

60

J. Lee and H. Wang

Figure 3.5.

Integrated infotronics platform

3.3.1.2 Hardware Requirements For a certain industry application, the selection of Watchdog Agent® hardware depends on characteristics of the input/output signals (for example, what type of input/output signal and how many channels needed), which tools or algorithms are selected (for example, different algorithms require different hardware computation and storage capacities), and the hardware’s working environment (for example, which decides the hardware’s storage type, temperature range, etc.). The hardware prototype currently used in the IMS Center is based on PC104 architecture, as shown in Figure 3.6a. PC104 architecture enables the hardware to be easily expanded to a multi-board system, which includes multiple CPUs and a large amount of input channels. It has a powerful VIA Eden 400MHz CPU and 128MB

New Technologies for Maintenance

61

of memory since all of the tools are embedded into the hardware. It has 16 high speed analog input channels to deal with highly dynamic signals. It also has various peripherals that can acquire non-analog sensor signals such as RS232/485/432, parallel and USB. The prototype uses a compact flash card for storage, so it can be placed on top of machine tools and is suitable for withstanding vibrations in a working environment. Once a certain set of tools/algorithms is determined for a certain industry application, commercially available hardware, such as Advantech and National Instruments (NI) as illustrated in Figure 3.6b and c, respectively, will be further evaluated for customized Watchdog Agent® applications.

Figure 3.6a–c. Options of hardware prototypes for Watchdog Agent® application

3.3.1.3 Software Development The software system of the Watchdog Agent®-based IMS platform consists of two parts: the embedded side software and the remote side software, as shown in Figure 3.7. The embedded side software is the software running on the Watchdog Agent® hardware, which includes a communication module, a command analysis module, a task module, an algorithm module, a function module, and a DAQ module. The communication module is responsible for communicating with the remote side via TCP/IP protocol. The command analysis module is used to analyze different commands coming from the remote side. The task module includes multithread scheduling and management. The algorithm module contains specific watchdog agent tools. The function module has several auxiliary functions such as channel configuration, security configuration, and email list and so on. The DAQ module performs A/D conversion using either interrupt or software trigger to get data from different sensors. The remote side software is the software running on the remote computers. It is implemented by ActiveX control technology and can be used as a component of the Internet Explorer Browser. The remote side software is mainly composed of a communication module and a user interface module. The communication module is used for communicating with the embedded site via TCP/IP protocol. The user interface has a health information display, an ATC status display, and a discrete event display. It also possess an algorithm module, as well as error log database and data format interface.

62

J. Lee and H. Wang

Figure 3.7. Software structure of Watchdog Agent®

3.3.1.4 Remote Monitoring Architecture and Human Machine Interface Standards A four-layer infrastructure for remote monitoring and human machine interface standards is illustrated in Figure 3.8. The data acquisition layer consists of multiple sensors which obtain raw data from the components of a machine or machines in different locations. The Network layer will use either traditional Ethernet connections, or wireless connections for communication between the Watchdog Agent®s, or for sending short messages (SM) to an engineer’s mobile phone via GPRS services. The Application layer functions as a control server to save related information and control the behavior of the Watchdog Agent®s in the network. The Enterprise layer offers a user-friendly interface for maintenance-related engineers to access information either via an Internet browser or a mobile phone.

Figure 3.8. Illustration of Watchdog Agent®-based remote monitoring architecture

New Technologies for Maintenance

63

3.3.2 Watchdog Agent® Toolbox for Multi-sensor Performance Assessment and Prognostics The Watchdog Agent® toolbox, with autonomic computing capabilities, is able to convert critical performance degradation data into health features and quantitatively assess their confidence value to predict further trends so that proactive actions can be taken before potential failures occur. Figure 3.9 illustrates one of the developed enabling prognostics tools that can assess and predict the performance degradation of products, machines and complex systems.

Figure 3.9. MS innovation in advanced prognostics

The Watchdog Agent® toolbox enables one to assess and predict quantitatively performance degradation levels of key product components, and to determine the root causes of failure (Casoetto et al. 2003; Djurdjanovic et al. 2000; Lee 1995, 1996), thus making it possible to realize physically closed-loop product life cycle monitoring and management. The Watchdog Agent® consists of embedded computational prognostic algorithms and a software toolbox for predicting degradation of devices and systems. Degradation assessment is conducted after the critical properties of a process or machine are identified and measured by sensors. It is expected that the degradation process will alter the sensor readings that are being fed into the Watchdog Agent®, and thus enable it to assess and quantify the degradation by quantitatively describing the corresponding change in sensor signatures. In addition, a model of the process or piece of equipment that is being considered, or available application specific knowledge can be used to aid the degradation process description, provided that such a model and/or such knowledge exist. The prognostic function is realized through trending and statistical modeling of the observed process performance signatures and/or model parameters. In order to facilitate the use of Watchdog Agent® in a wide variety of applications (with various requirements and limitations regarding the character of signals, available processing power, memory and storage capabilities, limited space, power consumption, the user’s preference etc.) the performance assessment module of the

64

J. Lee and H. Wang

Watchdog Agent® has been realized in the form of a modular, open architecture toolbox. The toolbox consists of different prognostics tools, including neural network-based, time-series based, wavelet-based and hybrid joint time-frequency methods, etc., for predicting the degradation or performance loss on devices, process, and systems. The open architecture of the toolbox allows one easily to add new solutions to the performance assessment modules as well as to easily interchange different tools, depending on the application needs. To enable rapid deployment, a quality function deployment (QFD) based selection method had been developed to provide a general suggestion to aid in tool selection; this is especially critical for those industry users who have little knowledge about these algorithms. The current tools employed in the signal processing and feature extraction, performance assessment, diagnostics and prognostics modules of Watchdog Agent® functionality are summarized in Figure 3.10. Each of these modules is realized in several different ways to facilitate the use of the Watchdog Agent® in a wide variety of products and applications.

Figure 3.10. Watchdog Agent® prognostics toolbox

3.3.2.1 Signal Processing and Feature Extraction Module The signal processing module transforms multiple sensor signals into domains that are the most informative of a product’s performance. Time-series analysis (Pandit and Wu 1993) or frequency domain analysis (Marple 1987) can be used to process stationary signals (signals with time invariant frequency content), while wavelet (Burrus et al. 1998; Yen and Lin 2000), or joint time-frequency analysis (Cohen 1995; Djurdjanovic et al. 2002) could be used to describe non-stationary signals (signals with time-varying frequency content). Most real life signals, such as speech, music, machine tool vibration, acoustic emission etc. are non-stationary

New Technologies for Maintenance

65

signals, which place a strong emphasis on the need for development and utilization of non-stationary signal analysis techniques, such as wavelets, or joint timefrequency analysis. The feature extraction module extracts features most relevant to describing a product’s performance. Those features are extracted from the time domain into which the sensory processing module transforms sensory signals, using expert knowledge about the application, or automatic feature selection methods such as roots of the autoregressive time-series model, or time-frequency moments and singular value decomposition. Currently the following signal processing and feature extraction tools are used in the Watchdog Agent® toolbox: •

•

• •

•

The Fourier transformation method has been widely used in de-noising and feature extraction. Noise component in the signal can be distinguished after it is transformed, and feature components can be identified after the removal of noise. However, Fourier transformation is applicable to nonstationary signals only since frequency-band energies for applications are characterized by time-invariant frequency content. The autoregressive modeling method calculates frequency peak locations and intensities using autoregressive oscillation modes of sensor readings and bares significant information about the process (usually, mechanical systems are well described by the modes of oscillations). The wavelet/wavelet packet decomposition method enables the rapid calculation of non-stationary signal energy distribution at the expense of loosing some of the desirable mathematical properties. The time-frequency analysis method provides both temporal and spectral information with good resolution, and is applicable to highly non-stationary signals (e.g. impacts or transient behaviors). However, it is not applicable if a large amount of data has to be considered and calculation speed is a concern. The application specific features extraction method is applicable in cases when one can directly extract performance-relevant features out of the time-series of sensor readings.

3.3.2.2 Performance Assessment Module The performance assessment module evaluates the overlap between the most recently observed signatures and those observed during normal product operation. This overlap is expressed through the so-called confidence value (CV), ranging between zero and one, with higher CVs signifying a high overlap, and hence performance closer to normal (Lee 1995, 1996). In case data associated with some failure mode exist, most recent performance signatures obtained through the signal processing and feature extraction module can be matched against signatures extracted from faulty behavior data as well. The areas of overlap between the most recent behavior and the nominal behavior, as well as the faulty behavior, are continuously transformed into CV over time for evaluating the deviation of the recent behavior from nominal to faulty. Realization of the performance evaluation module depends on the character of the application and extracted performance signatures. If significant application

66

J. Lee and H. Wang

expert knowledge exists, simple but rapid performance assessment based on the feature-level fused multi-sensor information can be made using the relative number of activated cells in the neural network, or by using the logistic regression approach. For products with open-control architecture, the match between the current and nominal control inputs and the performance criteria can also be utilized to assess the product’s performance. For more sophisticated applications with intricate and complicated signals and performance signatures, statistical pattern recognition methods, or the feature map based approach can be employed. The following performance assessment tools are currently being used in the Watchdog Agent® toolbox: •

• •

•

•

The logistic regression method allows one to predict a discrete outcome, such as group membership, from a set of variables that may be continuous, discrete, dichotomous, or a mix of any of these. It can quantitatively represent the proximity of current operating conditions to the region of desirable or undesirable behavior. However, it is applicable when a good feature domain description of unacceptable behavior is available. The feature map method assesses the overlap between the normal and most recent process behavior, and is applicable in cases when the Gaussianness of extracted features cannot be guaranteed. The statistical pattern recognition method calculates overlap of feature distributions based on the assumption of Gaussian distribution of the features, and is applicable to a repeatable and stable process. However, it is not applicable to the highly dynamic systems in which feature distribution cannot be approximated as Gaussian The hidden Markov model method is applicable to highly dynamic phenomena when a sequence of process observations rather than a single observation is needed to describe adequately the behavior of process signatures. The particle filters performance assessment is able to describe quantitatively process performance, and is applicable in cases of complex systems that display multiple regimes of operation (both normal and faulty). In this case a hybrid description of the system is needed, incorporating both discrete and continuous states.

3.3.2.3 Diagnostics Module The diagnostics module tells not only the level of behavior degradation (the extent to which the newly arrived signatures belong to the set of signatures describing normal system behavior), but also how close the system behavior is to any of the previously observed faults (overlap between signatures describing the most recent system behavior with those characterizing each of the previously observed faults). This matching allows the Watchdog Agent® to recognize and forecast a specific fault behavior, once a high match with the failure associated signatures is assessed for the current process signatures, or forecasted based on the current and past product’s performance. Figure 3.11 illustrates this signature matching process for performance evaluation.

New Technologies for Maintenance

67

Figure 3.11. Performance evaluation using Confidence Value (CV)

•

• •

•

The support vector machine method establishes a non-linear maximum margin classifier that infers the machine condition from a new set of measurements. It works by using a non-linear kernel to transform the input vector space (which is a set of measurements believed to be correlated with machine condition) to a much higher dimension feature space, and drawing a linear hyper-plane classifier there. It is especially applicable to the situation when Gaussianity of the performance related features cannot be guaranteed and when a process may display multiple normal and faulty modes of behavior (multiple regimes of operation and/or multiple possible faults in the process). The main drawback to using this method is that the choice of a kernel in real applications is usually based on experience or trial-and-error test. The hidden Markov model method is especially applicable to a situation in which multiple signals exist and the system may have multiple failure modes. It is applicable to both stationary and non-stationary signals. The Bayesian belief network is a compact representation of cause-and-effect for a complex system, and is especially applicable to situations where there are multiple faults with multiple symptoms. The main drawback of this method is that no standard procedure exists to determine network structure and expert knowledge is needed to identify the node state. Condition diagnosis based on analytically calculated overlaps of Gaussians that describe the signatures corresponding to the current process behavior and the signatures corresponding to various modes of normal or faulty equipment behavior, is applicable to the cases in which performance related features approximately behave as Gaussians.

3.3.2.4 Prediction and Prognostics Module The prediction and prognostics module is aimed at extrapolating the behavior of process signatures over time and predicting their behavior in the future. autoregressive moving average (ARMA) (Pandit and Wu 1993) modeling and match matrix (Liu et al. 2004) methods are used to forecast the performance behavior. Currently, autoregressive moving-average (ARMA) modeling and match matrix methods are used to forecast the performance behavior. Over time, as new

68

J. Lee and H. Wang

failure modes occur, performance signatures related to each specific failure mode can be collected and used to teach the Watchdog Agent® to recognize and diagnose those failure modes in the future. Thus, the Watchdog Agent® is envisioned as an intelligent device that utilizes its experience and human supervisory inputs over time to build its own expandable and adjustable world model. Performance assessment, prediction and prognostics can be enhanced through feature-level or decision-level sensor fusion, as defined by Hall and Llinas (2000) (Chapter 2). Feature-level sensor fusion is accomplished through concatenation of features extracted from different sensors, and the joint consideration of the concatenated feature vector in the performance assessment and prediction modules. Decision-level sensor fusion is based on separately assessing and predicting process performance from individual sensor readings and then merging these individual sensor inferences into a multi-sensor assessment and prediction through some averaging technique. In summary, the following performance forecasting tools are currently used in the Watchdog Agent®: •

•

•

•

The autoregressive moving average (ARMA) method is applicable to linear time-invariant systems whose performance features display stationary behavior. ARMA utilizes a small amount of historic data and can provide good short term predictions. The compound match matrix/ARMA prediction method is applicable to cases when abundant records of multiple maintenance cycles exist for nonlinear processes. It excels at dealing with high dimension data and can provide good long term prediction by converting vector-based feature prediction to scalar-based prediction. The fuzzy logic prediction method is applicable to complex systems whose behavior is unknown and no model, function or numerical technique to describe the system is readily available. It utilizes linguistic vagueness or form and allows imprecision, to some extent, in formulating approximations. Fuzzy logic can give fast approximate solutions. The Elman recurrent neural network (ERNN) prediction method is applicable to non-linear systems and can give long term predictions when given a large amount of training data. However, no standard methodology exists to determine ERNN structure, and trial-and-error is usually used in the modeling process.

New tools will be continuously developed and added to the modular, open architecture Watchdog Agent® toolbox based on the development procedure as shown in Figure 3.12.

New Technologies for Maintenance

69

Problem definition & constraints

Tool selection

Parameter & tool selection

Prototyping & testing No

Accepted

Program development No

Evaluation

Yes Yes Deployment

Figure 3.12. Flowchart for developing Watchdog Agent® tools

3.3.3 Case Studies Several Watchdog Agent® tools for on-line performance assessment and prediction have already been implemented as stand alone applications in a number of industrial and service facilities. Listed below are several examples to illustrate the developed tools. 3.3.3.1 Example 1: Prognostics of an AS/RS Materials Handling Systems A time-frequency based method (Cohen 1995) has been implemented for performance assessment of a gearbox in an AS/RS material handling system shown in Figure 3.13. Four vibration sensor readings have been fused to evaluate autonomously its performance while it is on-line. The vibration signals were processed into joint time-frequency energy distributions (Cohen 1995) and a set of time-shift invariant time-frequency moments (Zalubas et al. 1996; Djurdjanovic et al. 2000; Tacer and Loughlin 1996) were extracted. Since those moments asymptotically follow a Gaussian distribution (Zalubas et al. 1996), statistical reasoning was utilized to evaluate the overlap between signatures describing normal process behavior (used for training) and those describing the most recent process behavior. Figure 3.14 shows a screenshot of the software application housing this time-frequency based Watchdog Agent® used for performance assessment of a material handling system. The CV was generated by fusing multiple signal features for performance assessment.

70

J. Lee and H. Wang

Figure 3.13. Material handling system for mail staging

Figure 3.14. Screenshot of the time-frequency based Watchdog Agent ®

3.3.3.2 Example 2: Roller Bearing Prognostics Testbed Most bearing diagnostics research involves studying the defective bearings recovered from the field or from laboratory experiements where the bearings exhibit mature faults. Experiments using defective bearings have a lower capability for discovering natural defect propagation in its early stages. In order truly to reflect real defect propagation processes, bearing run-to-failure tests were performed under normal load conditions on a specially designed test rig sponsored by Rexnord Technical Service. The bearing test rig hosts four test bearings on one shaft. Shaft rotation speed was kept constant at 2000rpm. A radial load of 6000lbs was added to the shaft and bearing by a spring mechanism. A magnetic plug installed in the oil feedback pipe collected debris from the oil as evidence of bearing degradation. The test stopped when the accumulated debris that adhered to the magnetic plug exceeds a certain level. Four double row bearings were installed on one shaft as shown in Figure 3.15. A high sensitivity accelerometer was installed on each bearing house. Four thermocouples were attached to the outer race of each bearing to record bearing temperature (that is relevant to bearing lubrication condition). Several sets of tests ending with various failure modes were carried out. The time domain feature shows that most of the bearing fatigue time is consumed during the period of material accumulative damage, while the period of crack propagation and development is relatively short. This means that if the traditional threshold-based condition monitoring approach is used, the response time available for the maintenance crew to respond prior to catastrophic failure after a defect is detected in such bearings is very short. A prognostic approach that can detect the defect at an early stage is demanded so that enough buffer time is available for maintenance and logistical scheduling.

New Technologies for Maintenance

71

Figure 3.15. The bearing test rig sponsored by Rexnord Technical Service

Figure 3.16 presents the vibration waveform collected from bearing 4 at the last stage of the bearing test. The signal exhibits strong impulses periodicity because of the impacts generated by a mature outer race defect. However, when examining the historical data and observing the vibration signal three days before the bearing failed, there is no sign of periodic impulses as shown in Figure 3.17a. The periodic impulse feature is completely masked by the noise.

Figure 3.16. The vibration signal waveform of a faulty bearing

An adaptive wavelet filter is designed to de-noise the raw signal and enhance degradation detection. The adaptive wavelet filter is yielded in two steps. First the optimal wavelet shape factor is found by the minimal entropy method. Then an optimal scale is identified by maximizing the signal periodicity. By applying the designed wavelet filter to the noisy raw signal, the de-noised signal can be obtained as shown in Figure 3.17b. The periodic impulse feature can then be clearly discovered, which serves as strong evidence of bearing outer race degradation. The wavelet filter-based de-noising method successfully enhanced the signal feature and provided potent evidence for prognostic decision-making.

72

J. Lee and H. Wang

a Raw Signal

b De-noised signal using the wavelet filter

Figure 3.17a, b. The vibration waveform with early stage defect

3.3.3.3 Example 3: Bearing Risk of Failure and Remaining Useful Life Prediction An important issue in prognostic technology is the estimation of the risk of failure, and of the remaining useful life of a component, given the component’s age and its past and current operating condition. In numerous cases, failures were attributed to many correlated degradation processes, which could be reflected by multiple degradation features extracted from sensor signals. These features are the major information regarding the health of the component under monitoring; however, the failure boundary is hard to define using these features. In reality, the same feature vector could be attributed to totally different combinations of the underlying degradation processes and their severity levels. There is only a probabilistic relationship between the component failure and the certain level of degradation features. A typical example can be found during bearing operation. Two bearings of the same type could fail at different levels of RMS and Kurtosis of vibration signal. To capture the probabilistic relationship between the multiple degradation features and the component failure as well as to predict the risk of failure and the remaining useful life, IMS has developed a Proportional Hazards (PH) approach (Liao et al. 2005) based on the PH model proposed by Cox (1972). The PH model involving multiple degradation features is given as

λ (t ; Z ) = λ0 (t ) exp( β ' Z )

(3.1)

where λ (t ; Z ) is the hazard rate of the component given the current age t and the degradation feature vector Z ; λ0 (t ) is called the baseline hazard rate function; β is the model parameter vector. This formulation relates the working age and multiple degradation feature to the hazard rate of the component. To estimate the parameters, the maximum likelihood approach could be utilized using offline data, including the degradation features over time of many components and their failure times. Afterwards, the established model can be used for predicting the risk of failure for the component by plugging in the working age and the degradation features extracted from the on-line sensor signals. In addition, the remaining useful life L(tcurrent ) given the current working age and the history of degradation features can be estimated as

New Technologies for Maintenance

L(tcurrent ) ≈ ∫

∞ t current

⎛ τ exp ⎜ − ∫ ⎝ t

current

⎞ λ (v; zˆ (v)) dv ⎟ dτ ⎠

73

(3.2)

where zˆ (v) is the predicted feature vector. Consider the vibration data obtained from the test rig in Example 2. To facilitate on-line implementation, root-mean-square (RMS) and Kurtosis are calculated and used as degradation features. Figure 3.18 shows the predicted hazard rate over time based on these degradation features. This quantity can be utilized to trigger maintenance when the risk level crosses a predetermined threshold level. Table 3.1 provides the remaining useful life predictions given the current bearing age and the feature observations. The predictions are in accordance with the actual life of the studied bearing ( ≈ 32 days) with minor prediction errors as the degradation progresses.

Figure 3.18. Hazard rate prediction of bearing 3 in Test 1

Table 3.1. Estimates of expected remaining useful life – Test 1, Bearing 3 (unit: day) Time

26

29

31

Estimated expected remaining useful life

3.5549

3.3965

1.5295

True remaining useful life

6.5278

3.5278

1.5278

Error

2.9729

0.1313

0.0017

3.4 Conclusions and Future Research This chapter addresses the paradigm shift in modern maintenance systems from the traditional “fail and fix” practices to a “predict and prevent” methodology. A reconfigurable and scalable Watchdog Agent®-based intelligent maintenance system

74

J. Lee and H. Wang

has been developed, which serves as a baseline system for researchers and companies to develop next-generation e-maintenance systems. It enables machine makers and users to predict machine health degradation conditions, diagnose fault sources, and suggest maintenance decisions before a fault actually occurs. The Watchdog Agent®-based R2M-PHM platform expands the OSA-CBM architecture topology by including real-time remote machinery diagnosis and prognosis systems and embedded Watchdog Agent® technology. The Watchdog Agent® is an embedded algorithm toolbox which converts multi-sensory data to machine health information. Innovative sensory processing and autonomous feature extraction methods are developed to facilitate the plug-and-play approach in which the Watchdog Agent® can be setup and run without any need for expert knowledge or intervention. Future work will be the further development of the Watchdog Agent®-based IMS platform. Smart software and NetWare will be further developed for proactive maintenance capabilities such as performance degradation measurement, fault recovery, self-maintenance and remote diagnostics. For the embedded Watchdog Agent® application, we need to harvest the developed technologies and tools and to accelerate their deployment in real-world applications through close collaboration between industrial and academic researchers. Specifically, future work will include the following aspects: (i) evaluate the existing Watchdog Agent® tools and identify the application needs from the smart machine testbed; (ii) develop a configurable prognostics tools platform for rotary machinery elements such as bearings, motors, and gears, etc., so that several of most frequently used prognostics tools can be pretested and deposited into a ready-to-use tool library; (iii) develop a user interface system for tool selection, which allows users to use the right tools effectively for the right applications and achieve “the first tool correct” accuracy; (iv) validate the reconfiguration of these tools to a variety of similar applications (to be defined by the company participants); and (v) explore research in a ‘‘peer-to-peer’’ (P2P) paradigm in which Watchdog Agent®s embedded on identical products operating under similar conditions could exchange information and thus assist each other in machine health diagnosis and prognosis. To predict, prioritize, and plan precision maintenance actions to achieve an “every action correct” objective, the IMS Center is creating advanced maintenance simulation software for maintenance schedule planning and service logistics cost optimization for transparent decision making. At the same time, the Center is exploring the integration of decision support tool and optimization techniques for proactive maintenance; this integration will facilitate the functionalities of the Watchdog Agent®-based R2M-PHM in which an intelligent maintenance systems can operate as a near-zero down-time, self-sustainable and self-aware artificially intelligent system that learns from its own operation and experience. Embedding is crucial for creating an enabling technology that can facilitate proactive maintenance and life cycle assessment for mobile systems, transportation devices and other products for which cost-effective realization of predictive performance assessment capabilities cannot be implemented on general purpose personal computers. The main research challenge will be to accomplish sophisticated performance evaluation and prediction capabilities under the severe power consumption, processing power and data storage limitations imposed by embedding. The Center

New Technologies for Maintenance

75

will develop a wireless sensor network made of self-powered wireless motes for machine health monitoring and embedded prognostics. These networked smart motes can be easily installed in products and machines with ad hoc communications. In addition, the Center is investigating the feasibility of harvesting energy by using vibration in an environment equipped with wireless motes for remote monitoring of equipment and machinery. In conjunction with that investigation, the Center is looking at ways of developing communication protocols that require less energy for communication. Power converter circuitry has been designed by using vibration signals in order to convert vibration energy into useful electric energy. These technologies are very critical for monitoring equipment or systems in a complex environment where the availability of power is the major constraint. In the area of collaborative product life cycle design and management, the Watchdog Agent® can serve as an infotronics agent to store product usage and endof-life (EOL) service data and to send feedback to designers and life cycle management systems. Currently, an international intelligent manufacturing systems consortium on product embedded information systems for service and EOL has been proposed. The goal is to integrate Watchdog Agent® capabilities into products and systems for closed-loop design and life cycle management, as illustrated in Figure 3.19.

Figure 3.19. Embedded and tether-free product life cycle monitoring

The Center will continue advancing its research to develop technologies and tools for closed-loop life cycle design for product reliability and serviceability, as well as explore research in new frontier areas such as embedded and networked agents for self-maintenance and self-healing, and self-recovery of products and systems. These new frontier efforts will lead to a fundamental understanding of reconfigurability and allow the closed-loop design of autonomously reconfigurable engineered systems that integrate physical, information, and knowledge domains. These autonomously reconfigurable engineered systems will be able to sense, perform self-prognosis, self-

76

J. Lee and H. Wang

diagnose, and reconfigure the system to function uninterruptedly when subject to unplanned failure events, as illustrated in Figure 3.20.

Near Near“0” “0” Downtime

Closed-Loop Life LifeCycle Cycle Design Design Design for Reliability and Serviceability

Product Center

Health Monitoring Product or System Sensors & Embedded In Use Intelligence

Product Redesign

Smart Design

Enhanced Six-Sigma Design

Degradation Watchdog Agent®

Self-Maintenance

Communications

•Redundancy •Active •Passive

•Tether-Free (Bluetooth) • Internet •TCP/IP

Service

• Web-enabled Monitoring & Prognostics • Decision Support Tools for Optimized Maintenance Condition-based

Maintenance • Business and Service Synchronization (CBM) • Asset Optimization

Web-enabled D2B™ Platform (XML-based)

Watchdog Agent and Device-to-Business (D2B) are Trademarks of IMS Center

Figure 3.20. Intelligent maintenance systems and its key elements

3.5 References Badia, F.G., Berrade, M.D. and Campos, C.A., (2002) Optimal Inspection and Preventive Maintenance of Units with Revealed and Unrevealed Failures. Reliability Engineering and System Safety 78: 157–163. Barbera, F., Schneider, H. and Kelle, P., (1996) A Condition Based Maintenance Model with Exponential Failures and Fixed Inspection Interval. Journal of the Operational Research Society 47(8): 1037–1045. Bonissone, G., (1995) Soft computing applications in equipment maintenance and service in: ISIE ’95, Proceedings of the IEEE International Symposium, 2: 10–14. Brotherton, T., Jahns, G., Jacobs, J. and Wroblewski, D., (2000) Prognosis of faults in gas turbine engines, in: Aerospace Conference Proceedings, (2000) IEEE, 6: 18–25. Bruns, P., (2002) Optimal Maintenance Strategies for Systems with Partial Repair Options and without Assuming Bounded Costs. European Journal of Operational Research 139: 146–165. Bunday, B.D., (1991) Statistical Methods in Reliability Theory and Practice, Ellis Horwood. Burrus, C., Gopinath, R. and Haitao, G., (1998) Introduction to wavelets and wavelet transforms – a primer. NJ: Prentice Hall. Casoetto, N., Djurdjanovic, D., Mayor, R., Lee, J. and Ni, J., (2003) Multisensor process performance assessment through the use of autoregressive modeling and feature maps. Trans. of SME/NAMRI, 31:483–490.

New Technologies for Maintenance

77

Cavory, G., Dupas, R. and Goncalves, R., (2001) A Genetic Approach to the Scheduling of Preventive Maintenance Tasks on a Single Product Manufacturing Production Line, International Journal of Production Economics, 74: 135–146. Chen, C.T., Chen, Y.W. and Yuan, J., (2003) On a Dynamic Preventive Maintenance Policy for a System under Inspection. Reliability Engineering and System Safety 80: 41–47. Chen, D. and Trivedi, K., (2002) Closed-Form Analytical Results for Condition-Based Maintenance. Reliability Engineering and System Safety 76: 43–51. Cohen, L., (1995) Time-frequency analysis. NJ: Prentice Hall. Cox, D., (1972) Regression models and life tables (with discussion). Journal of the Royal Statistical Society, Series B 34:187–220. Djurdjanovic, D., Widmalm, S.E., William, W.J., et al., (2000) Computerized classification of temporomandibular joint sounds. IEEE Transactions on Biomedical Engineering 47:977–984. Djurdjanovic, D., Ni, J. and Lee, J., (2002) Time-frequency based sensor fusion in the assessment and monitoring of machine performance degradation. Proceedings of 2002 ASME Int. Mechanical Eng. Congress and Exposition, paper number IMECE2002-32032. Garga, A., McClintic, K.T., Campbell, R.L., et al., (2001) Hybrid reasoning for prognostic learning in CBM systems, in: Aerospace Conference, 10–17 March, 2001, IEEE Proceedings, 6: 2957–2969. Goodenow, T., Hardman, W., Karchnak, M., (2000) Acoustic emissions in broadband vibration as an indicator of bearing stress. Proceedings of IEEE Aerospace Conference, 2000; 6: 95–122.L.D. Hall, L.D. and Llinas, J., (Eds.), (2000) Handbook of Sensor Fusion, CRC Press. Hall, L.D., (1992) Mathematical techniques in Multi-Sensor Data Fusion, Artech House Inc. Hansen, R., Hall, D., Kurtz, S., (1994) New approach to the challenge of machinery prognostics. Proceedings of the International Gas Turbine and Aeroengine Congress and Exposition, American Society of Mechanical Engineers, June 13–16 1994: 1–8. IMS, NSF I/UCRC Center for Intelligent Maintenance Systems, www.imscenter.net; 2004. Kemerait, R., (1987) New cepstral approach for prognostic maintenance of cyclic machinery. IEEE SOUTHEASTCON, 1987: 256–262. Kleinbaum, D., (1994) Logistic regression. New York: Springer-Verlag. Labib, A.W., (2006) Next generation maintenance systems: Towards the design of a selfmaintenance machine. 2006 IEEE International Conference on Industrial Informatics, Integrating Manufacturing and Services Systems, 16–18 August, Singapore Lee, J., (1995) Machine performance monitoring and proactive maintenance in computerintegrated manufacturing: review and perspective. International Journal of Computer Integrated Manufacturing 8:370–380. Lee, J., (1996) Measurement of machine performance degradation using a neural network model. Computers in Industry 30:193–209. Lee, J., Ni, J., (2002) Infotronics agent for tether-free prognostics. Proceeding of AAAI Spring Symposium on Information Refinement and Revision for Decision Making: Modeling for Diagnostics, Prognostics, and Prediction. Stanford Univ., Palo Alto, CA, March 25–27. Liang, E., Rodriguez, R., Husseiny, A., (1988) Prognostics/diagnostics of mechanical equipment by neural network, Neural Networks 1 (1) 33. Liao, H., Lin, D., Qiu, H., Banjevic, D., Jardine, A., Lee, J., (2005) A predictive tool for remaining useful life estimation of rotating machinery components. ASME International 20th Biennial Conference on Mechanical Vibration and Noise, Long Beach, CA, 24–28 September, 2005. Liu, J., Djurdjanovic, D., Ni, J., Lee, J., (2004) Performance similarity based method for enhanced prediction of manufacturing process performance. Proceedings of the 2004 ASME International Mechanical Engineering Congress and Exposition (IMECE), 2004.

78

J. Lee and H. Wang

Marple, S., (1987) Digital spectral analysis. NJ: Prentice Hall. Marseguerra, M., Zio, E. and Podofillini, L. (2002) Condition-Based Maintenance Optimization by Means of Genetic Algorithm and Monte Carlo Simulation. Reliability Engineering and System Safety 77: 151–166. Mijailovic, V. (2003) Probabilistic Method for Planning of Maintenance Activities of Substation Component. Electric Power System Research 64: 53–58. Pandit, S., Wu, S-M., (1993) Time series and system analysis with application. FL: Krieger Publishing Co. Parker, B.E., Jr., Nigro, T.M., Carley, M.P., et al., (1993) Helicopter gearbox diagnostics and prognostics using vibration signature analysis, in: Proceedings of the SPIE — The International Society for Optical Engineering: 531–542. Pham, H., Suprasad, A. and Misra, R.B. (1997) Availability and Mean Life Time Prediction of Multistage Degraded System with Partial Repairs. Reliability Engineering and System Safety 56: 169–173 Radjou, N., (2002) The collaborative product life-cycle. Forrester Research, May 2002. Ray, A. and Tangirala, S., (1996) Stochastic Modeling of Fatigue Crack Dynamic for OnLine Failure Prognostics, IEEE Transactions on Control Systems Technology, 4(4): 443– 449. Reichard, K., Van Dyke, M. and Maynard, K. (2000) Application of sensor fusion and signal classification techniques in a distributed machinery condition monitoring system. Proceedings of SPIE – The International Society for Optical Engineering 4051:329–336. Roemer, M., Kacprzynski, G. and Orsagh, R., (2001) Assessment of data and knowledge fusion strategies for prognostics and health management. Proceedings of IEEE Aerospace Conference, 2001; 6:62979–62988. Seliger, G., Basdere, B., Keil, T., et al. (2002) Innovative processes and tools for disassembly. Annals of CIRP 51:37–41. Su, L., Nolan M, DeMare G, Carey D. (1999) Prognostics framework ‘for weapon systems health monitoring’. Proceedings of IEEE Systems Readiness Technology Conference, IEEE AUTOTESTCON '99, 30 August–2 September 1999: 661–672. Swanson, D.C., (2001) A General Prognostics tracking algorithm for predictive maintenance, Proc. of the IEEE Aerospace Conference, 2001, 6: 2971–2977. Tacer, B., Loughlin, P., (1996) Time-frequency based classification. SPIE Proceedings 42:2697–2705. Thurston, M. and Lebold, M., (2001) Open Standards for Condition Based Maintenance and Prognostic Systems, Pennsylvania State University, Applied Research Laboratory. Umeda, Y., Tomiyama, T. and Yoshikawa, H., (1995) A design methodology for selfmaintenance machines, ASME journal of mechanical design, 117, September Vachtsevanos, G. and Wang, P., (2001) Fault prognosis using dynamic wavelet neural networks, Proceedings of the IEEE International Symposium on Intelligent Control 2001 (ISIC '01): 79–84. Wang, H.Z., (2002) A Survey of Maintenance Policies of Deteriorating Systems, European Journal of Operationa Research, 139: 469–489. Wilson, B.W., Hansen, N.H., Shepard, C.L., et al. (1999) Development of a modular in-situ oil analysis prognostic system. International Society of Logistics (SOLE) 1999 Symposium, Nevada, Las Vegas, 30 August –2 September. Yamada, A., Takata, S., (2002) Reliability improvement of industrial robots by optimizing operation plans based on deterioration evaluation. Annals of the CIRP 51:319–322. Yen, G., Lin, K., (2000) Wavelet packet feature extraction for vibration monitoring. IEEE Trans. on Industrial Electronics 2000; 47:650–667. Zalubas, E.J., O’Neill, J.C., Williams, W.J., et al., (1996) Shift and scale invariant detection. Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 1996; 5:3637–3640.

4 Reliability Centred Maintenance Marvin Rausand and Jørn Vatn

4.1 Introduction Reliability centred maintenance (RCM) is a method for maintenance planning that was developed within the aircraft industry and later adapted to several other industries and military branches. A high number of standards and guidelines have been issued where the RCM methodology is tailored to different application areas, e.g., IEC 60300-3-11, MIL-STD-217, NAVAIR 00-25-403 (NAVAIR 2005), SAE JA 1012 (SAE 2002), USACERL TR 99/41 (USACERL 1999), ABS (2003, 2004), NASA (2000) and DEF-STD 02-45 (DEF 2000). On a generic level, IEC 60300-3-11 (IEC 1999) defines RCM as a “systematic approach for identifying effective and efficient preventive maintenance tasks for items in accordance with a specific set of procedures and for establishing intervals between maintenance tasks.” A major advantage of the RCM analysis process is a structured, and traceable approach to determine the optimal type of preventive maintenance (PM). This is achieved through a detailed analysis of failure modes and failure causes. Although the main objective of RCM is to determine the preventive maintenance, the results from the analysis may also be used in relation to corrective maintenance strategies, spare part optimization, and logistic consideration. In addition, RCM also has an important role in overall system safety management. An RCM analysis process, when properly conducted, should answer the following seven questions: 1. 2. 3. 4. 5. 6. 7.

What are the system functions and the associated performance standards? How can the system fail to fulfil these functions? What can cause a functional failure? What happens when a failure occurs? What might the consequence be when the failure occurs? What can be done to detect and prevent the failure? What should be done when a suitable preventive task cannot be found?

80

M. Rausand and J. Vatn

The main objectives of an RCM analysis process are to: • • •

Identify effective maintenance tasks Evaluate these tasks by some cost–benefit analysis Prepare a plan for carrying out the identified maintenance tasks at optimal intervals

The RCM analysis process is carried out as a sequence of activities. Some of these activities, or steps, overlap in time. The structuring of the RCM process is slightly different in the various standards, guidelines, and textbooks. In this chapter we split the RCM analysis process into the following 12 steps: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12.

Study preparation System selection and definition Functional failure analysis (FFA) Critical item selection Data collection and analysis Failure modes, effects, and criticality analysis (FMECA) Selection of maintenance actions Determination of maintenance intervals Preventive maintenance comparison analysis Treatment of non-critical items Implementation In-service data collection and updating

The rest of the chapter is structured as follows: In Section 4.2 we describe and discuss the 12 steps of the RCM process. The concepts of generic and local RCM analysis are introduced in Section 4.3. These concepts have been used in a novel RCM approach to improve and speed up the analyses in a railway application. Models and methods for optimization of maintenance intervals are discussed in Section 4.4. Some main features of a new computer tool, OptiRCM, are briefly introduced. Concluding remarks are given in Section 4.5. The RCM analysis approach that is described in this chapter is mainly in accordance with accepted standards, but also contains some novel issues, especially related to steps 6 and 8 and the approach chosen in OptiRCM. The RCM approach is illustrated with examples from railway applications. Simple examples from the offshore oil and gas industry are also mentioned.

4.2 Main Steps of the RCM Analysis Process 4.2.1 Step 1: Study Preparation Before the actual RCM analysis process is initiated, an RCM project group must be established. The group should include at least one person from the maintenance function and one from the operations function, in addition to an RCM specialist. In Step 1 the RCM project group should define and clarify the objectives and the scope of the analysis. Requirements, policies, and acceptance criteria with

Reliability Centred Maintenance

81

respect to safety and environmental protection should be made visible as boundary conditions for the RCM analysis. The part of the plant to be analyzed is selected in Step 2. The type of consequences to be considered should, however, be discussed and settled on a general basis in Step 1. Possible consequences to be evaluated may comprise: • • • • • •

Human injuries and/or fatalities Negative health effects Environmental damage Loss of system effectiveness (e.g. delays, production loss) Material loss or equipment damage Loss of market shares

All consequence classes cannot usually be measured in a common unit. It is therefore necessary to prioritize between means affecting the various consequence classes. Such a prioritization is not an easy task and will not be discussed in this chapter. The trade-off problems can to some extent be solved within a decision theoretical framework (Vatn et al. 1996). RCM analyses have traditionally concentrated on PM strategies. It is, however, possible to extend the scope of the analysis to cover topics like corrective maintenance strategies, spare part inventories, logistic support problems, and input to safety management. The RCM project group must decide what should be part of the scope and what should be outside. The resources that are available for the analysis are usually limited. The RCM project group should therefore be realistic with respect to what to look into, realizing that analysis cost should not dominate potential benefits. In many RCM applications the plant already has effective maintenance programs. The RCM project will therefore be an upgrade project to identify and select the most effective PM tasks, to recommend new tasks or revisions, and to eliminate ineffective tasks. Further to apply those changes within the existing programs in a way that will allow the most efficient allocation of resources. When applying RCM to an existing PM program, it is best to utilize, to the greatest extent possible, established plant administrative and control procedures in order to maintain the structure and format of the current program. This approach provides at least three additional benefits: • • •

It preserves the effectiveness and successfulness of the current program It facilitates acceptance and implementation of the project’s recommendations when they are processed It allows incorporation of improvements as soon as they are discovered, without the necessity of waiting for major changes to the PM program or analysis of every system

4.2.2 Step 2: System Selection and Definition Before a decision to perform an RCM analysis is taken, two questions should be considered:

82

M. Rausand and J. Vatn

• •

To which systems is an RCM analysis beneficial compared with more traditional maintenance planning? At what level of assembly (plant, system, subsystem) should the analysis be conducted?

All systems may in principle benefit from an RCM analysis. With limited resources we must, however, set priorities, at least when introducing RCM in a new plant. We should start with the systems we assume will benefit most from the analysis. The following criteria may be used to prioritize systems for an RCM analysis: • • •

The failure effects of potential system failures must be significant in terms of safety, environmental consequences, production loss, or maintenance costs The system complexity must be above average Reliability data or operating experience from the actual system, or similar systems, should be available

Most operating plants have developed an assembly hierarchy, i.e. an organization of the system hardware elements into a structure that looks like the root system of a tree. In the offshore oil and gas industry this hierarchy is usually referred to as the tag number system. Several other names are also used. Moubray (1997) refers to the assembly hierarchy as the plant register. In railway infrastructure maintenance it is common to use the disciplinary areas as the next highest level in the plant register. These are typically: • • • • • •

Superstructure Substructure Signalling Telecommunications Power supply (overhead line with supporting systems) Low voltage systems

In this chapter, the following terms are used for the levels of the assembly hierarchy: Plant: A logical grouping of systems that function together to provide an output or product by processing and manipulating various input raw materials and feed stock. An offshore gas production platform may, e.g., be considered as a plant. For railway application a plant might be a maintenance area, where the main function of that “plant” is to ensure satisfactory infrastructure functionality in that area. Moubray (1997) refers to the plant as a cost centre. In railway application a plant corresponds to a train set (rolling stock), or a line (infrastructure). System: A logical grouping of subsystems that will perform a series of key functions, which often can be summarized as one main function, that is required of a plant (e.g., feed water, steam supply, and water injection). The compression system on an offshore gas production platform may, e.g., be considered as a system. Note that the compression system may consist of several compressors with a high degree of redundancy. Redundant units performing the same main function should be included in the same system. It is usually easy to identify the systems in a plant, since they are used as logical building blocks in the design process.

Reliability Centred Maintenance

83

The system level is usually recommended as the starting point for the RCM process. This is further discussed and justified, e.g., by Smith (1993) and in MILSTD 2173 (MIL-STD 1986). This means that on an offshore oil/gas platform the starting point of the analysis should be the compression system, the water injection system or the fire water system, and not the whole platform. In railway application the systems were defined above as the next highest level in the plant hierarchy. The systems may be further broken down into subsystems, and sub-subsystems, and so on. For the purpose of the RCM analysis process the lowest level of the hierarchy should be what we will call an RCM analysis item. RCM analysis item: A grouping or collection of components, which together form some identifiable package that will perform at least one significant function as a stand-alone item (e.g., pumps, valves, and electric motors). For brevity, an RCM analysis item will in the following be called an analysis item. By this definition, a shutdown valve, e.g., is classified as an analysis item, while the valve actuator is not. The actuator is supporting equipment to the shutdown valve, and only has a function as a part of the valve. The importance of distinguishing the analysis items from their supporting equipment is clearly seen in the FMECA in Step 6. If an analysis item is found to have no significant failure modes, then none of the failure modes or causes of the supporting equipment are important, and therefore do not need to be addressed. Similarly, if an analysis item has only one significant failure mode, then the supporting equipment only needs to be analyzed to determine if there are failure causes that can affect that particular failure mode (Paglia et al. 1991). Therefore, only the failure modes and effects of the analysis items need to be analyzed in the FMECA in Step 6. An analysis item is usually repairable, meaning that it can be repaired without replacing the whole item. In the offshore reliability database OREDA (2002) the analysis item is called an equipment unit. The various analysis items of a system may be at different levels of assembly. On an offshore platform, for example, a huge pump may be defined as an analysis item in the same way as a small gas detector. If we have redundant items, e.g., two parallel pumps; each of them should be classified as analysis items. When in Step 6 we identify causes of analysis item failures, we often find it suitable to attribute this failure causes to failures of items on an even lower level of indenture. The lowest level is usually referred to as components. Component: The lowest level at which equipment can be disassembled without damage or destruction to the items involved. Smith (2005) refers to this lowest level as least replaceable assembly, while OREDA (2002) uses the term maintainable item. It is very important that the analysis items are selected and defined in a clear and unambiguous way in this initial phase of the RCM analysis process, since the following analysis will be based on these analysis items. If the OREDA database is to be used in later phases of the RCM process, it is recommended as far as possible to define the analysis items in compliance with the “equipment units” in OREDA.

84

M. Rausand and J. Vatn

4.2.3 Step 3: Functional Failure Analysis (FFA) The objectives of this step are to: 1. 2. 3.

Identify and describe the systems’ required functions Describe input interfaces required for the system to operate Identify the ways in which the system might fail to function

4.2.3.1 Step 3(i): Identification of System Functions The objective of this step is to identify and describe all the required functions of the system. According to ABS (2004) “each function should be documented as a function statement that contains a verb describing the function, an object on which the function acts, and performance standard(s)”. A function of a shutdown valve may therefore be “close flow of oil within 5 s”. A complex system will usually have a high number of different functions. It is often difficult to identify all these functions without a checklist. The checklist or classification scheme of the various functions presented below may help the analyst in identifying the functions. The same scheme may be used in Step 6 to identify functions of analysis items. The term item is therefore used in the classification scheme to denote either a system or an analysis item: 1.

2.

3.

4. 5.

6.

Essential functions: These are the functions required to fulfil the intended purpose of the item. The essential functions are simply the reasons for installing the item. Often an essential function is reflected in the name of the item. An essential function of a pump is, e.g., to pump a fluid. Auxiliary functions: These are the functions that are required to support the essential functions. The auxiliary functions are usually less obvious than the essential functions, but may in many cases be as important as the essential functions. Failure of an auxiliary function may in many cases be more critical than a failure of an essential function. An auxiliary function of a pump is, e.g., to “contain fluid.” Protective functions: The functions intended to protect people, equipment, and the environment from damage and injury. The protective functions may be classified according to what they protect, as: (i) safety functions, (ii) environment functions, and (iii) hygiene functions. An example of a protective function is the protection provided by a rupture disk on a pressure vessel. Information functions: These functions comprize condition monitoring, various gauges and alarms, and so on. Interface functions: These functions apply to the interfaces between the item in question and other items. The interfaces may be active or passive. A passive interface is, e.g., present when an item is a support or a base for another item. Superfluous functions: According to Moubray (1997) “Items or components are sometimes encountered which are completely superfluous. This usually happens when equipment has been modified frequently over a period of years, or when new equipment has been over-specified”. Superfluous functions are

Reliability Centred Maintenance

85

sometimes present when the item has been designed for an operational context that is different from the actual operational context. In some cases failures of a superfluous function may cause failure of other functions. For analysis purposes the various functions of an item may also be classified as: •

•

On-line functions: These are functions operated either continuously or so often that the user has current knowledge about their state. The termination of an on-line function is called an evident (or detectable) failure. In relation to safety instrumented systems, on-line functions correspond to high demand systems; see IEC 61508 (IEC 1997). Off-line functions: These are functions that are used intermittently or so infrequently that their availability is not known by the user without some special check or test. The protective functions are very often off-line functions. An example of an off-line function is the essential function of an emergency shutdown (ESD) system on an oil platform. The termination of an off-line function is called a hidden (or undetectable) failure. In the IEC 61508 setting, off-line functions correspond to low demand systems.

Note that this classification of functions should only be used as a checklist to ensure that all relevant functions are revealed. Discussions about whether to classify a function as, e.g., “essential” or “auxiliary” should be avoided. The item may in general have several operational modes (e.g., running, and standby), and several functions related to each operating state. 4.2.3.2 Step 3(ii): Functional Block Diagrams Various types of functional diagrams may represent the system functions identified in Step 3(i). The most common diagram is the so-called functional block diagram. A simple functional block diagram of a diesel engine is shown in Figure 4.1. It is generally not required to establish functional block diagrams for all the system functions. The diagrams are, however, efficient tools to illustrate the input interfaces to a function. In some cases we may want to split system functions into sub-functions on an increasing level of detail, down to functions of analysis items. The functional block diagrams may be used to establish this functional hierarchy in a pictorial manner, illustrating series-parallel relationships, possible feedbacks, and functional interfaces (e.g., see Blanchard and Fabrycky 1998; Rausand and Høyland 2004). Alternatives to the functional block diagram are reliability block diagrams and fault trees. Functional block diagrams are also useful as a basis for the FMECA in Step 6 in the RCM analysis process. 4.2.3.3 Step 3(iii): Functional Failures The next step of the FFA is to identify and describe how the various system functions may fail. A system function may be subject to a set of performance standards (or functional requirements) that may be grouped as physical properties, operational performance properties including output tolerances, and time requirements such as continuous operation or required availability. An unacceptable deviation from one or more of these performance standards is called a functional failure.

86

M. Rausand and J. Vatn

Figure 4.1. Functional block diagram for a diesel engine

The term functional failure is mainly used in the RCM literature, and has the same meaning as the more common term failure mode. In RCM we talk about functional failures on equipment level, and use the term failure mode related to the parts of the equipment. The failure modes will therefore be causes of a functional failure. It is important to realize that a functional failure (and a failure mode) is a manifestation of the failure as seen from the outside, i.e., a deviation from performance standards. Functional failures and failure modes may be classified in three main groups related to the function of the item: • • •

Total loss of function: In this case the function is not achieved at all, or the quality of the function is far beyond what is considered as acceptable. Partial loss of function: This group may be very wide, and may range from the nuisance category almost to the total loss of function. Erroneous function: This means that the item performs an action that was not intended, often the opposite of the intended function.

A variety of classifications schemes for functional failures (failure modes) have been published. Some of these schemes, e.g., Blache and Shrivastava (1994), may be used in combination with the function classification scheme in Step 3(ii) to ensure that all relevant functional failures are identified. The system functional failures may be recorded on a specially designed FFAworksheet that is rather similar to a standard FMECA worksheet. An example of an FFA-worksheet is presented in Figure 4.2 In the first column of Figure 4.2 the various operational modes of the system are recorded. For each operational mode, all the relevant functions of the system are recorded in column 2.

Reliability Centred Maintenance System: Ref. drawing no.: Operational mode

Function

Date: Function requirements

Performed by: Functional failure

87

Page: of: Frequency

Criticality S

E

A

C

Figure 4.2. Example of an FFA-worksheet

The performance requirements to the functions, like target values and acceptable deviations, are listed in column 3. For each function (in column 2) all the relevant functional failures are listed in column 4. In column 5 the frequency/probability of the functional failure is listed. A criticality ranking of each functional failure in that particular operational mode is given is given in column 6. The reason for including the criticality ranking is to be able to limit the extent of the further analysis by disregarding insignificant functional failures. For complex systems such a screening is often very important in order not to waste time and money. The criticality ranking depends on both the frequency/probability of the occurrence of the functional failure, and the severity of the failure. The severity must be judged at plant level. The severity ranking should be given in the four consequence classes: (S) safety of personnel, (E) environmental impact, (A) production availability, and (C) economic losses. For each of these consequence classes the severity should be ranked as for example (H) high, (M) medium, or (L) low. How we should define the borderlines between these classes will depend on the specific application. If at least one of the four entries are (M) medium or (H) high, the severity of the functional should be classified as significant, and the functional failure should be subject to further analysis. The frequency of the functional failure may also be classified in the same three classes. (H) high may, e.g., be defined as more than once per 5 years, and (L) low less than once per 50 years. As above, the specific borderlines will depend on the application. The frequency classes may be used to prioritize between the significant system failure modes. If all the four severity entries of a system failure mode are (L) low, and the frequency is also (L) low, the criticality is classified as insignificant, and the functional failure is disregarded in the further analysis. If, however, the frequency is (M) medium or (H) high the functional failure should be included in the further analysis even if all the severity ranks are (L) low, but with a lower priority than the significant functional failures. The FFA may be rather time-consuming because, for all functional failures, we have to list all the maintenance significant items (MSIs) (see Step 4). The MSI lists will hence have to be repeated several times. To reduce the workload we often conduct a simpler FFA where for each main function we list all functional failures in one column, and all the related MSIs in another column. This is illustrated in Figure 4.3 for a railway application.

88

M. Rausand and J. Vatn

The function name reflects the functions to be carried out on a relatively high level in the system. In principle, we should explicitly formulate the function(s) to be carried out. Instead we often specify the equipment class performing the function. For example, “departure light signal” is specified rather than the more correct formulation “ensure correct departure light signal”. We observe that the last functional failure in Figure 4.3 is not a failure mode for the “correct” functional description (Ensure correct departure light signal), but is related to another function of the “departure light signal”. Thus, if we use an equipment class description rather than an explicit functional statement, the list of failure modes should cover all (implicit) functions of the equipment class. At the functional failure level, it is also convenient to specify whether the failure mode is evident or hidden; see Figure 4.3 where we have introduced an “EF/HF” column. For each function we also list the relevant items that are required to perform the function. These items will form “rows” in the FMECA worksheets; see Step 5. 4.2.4 Step 4: Critical Item Selection The objective of this step is to identify the analysis items that are potentially critical with respect to the functional failures identified in Step 3(iii). These analysis items are denoted functional significant items (FSI). For simple systems the FSIs may be identified without any formal analysis. In many cases it is obvious which analysis items that have influence on the functional failures. For complex systems with an ample degree of redundancy or with buffers, we may need a formal approach to identify the FSIs. If failure rates and other necessary input data are available for the various analysis items, it is usually a straightforward task to calculate the relative importance of the various analysis items based on a fault tree model or a reliability block diagram. A number of importance measures are discussed by Rausand and Høyland (2004). In addition to the FSIs, we should also identify items with high failure rate, high repair costs, low maintainability, long lead-time for spare parts, or items requiring external maintenance personnel. These analysis items are denoted maintenance cost significant items (MCSI). The sum of the functional significant items and the maintenance cost significant items are denoted maintenance significant items (MSI). In an RCM project for the Norwegian Railway Administration the use of generic RCM analyses (see Section 4.3) made it possible to analyze all identified MSIs. In this case this step could be omitted. 4.2.5 Step 5: Data Collection and Analysis The purpose of this step is to establish a basis for both the qualitative analysis (relevant failure modes and failure causes), and the quantitative analysis (reliability parameters such as MTTF, PF-intervals, and so on). The data necessary for the RCM analysis may be categorized into the following three groups:

Reliability Centred Maintenance

89

Function: _______ Function: “Home signal” Function: “Departure light signal” Description: “Five lamp signals, with three main signals and two pre-signals” Functional failure

- Wrong signal picture - Missing signal picture - Unclear signal picture - Does not prevent contact hazard in case of earth fault - etc.

EF / HF HF HF HF HF

MSI

- Signal mast - Brands - Background shade - Earth conductor - Lamp - Lens - Transformer - etc.

Figure 4.3. Structure of functional failure analysis

Design data: (i) System definition: a description of the system boundaries including all subsystems and equipment to fulfil the main functions of the system, (ii) system breakdown: the assembly hierarchy as described in Step 2, (iii) a technical description of each subsystem, such as the structure of the subsystem, capacity and functions (e.g., input and output), (iv) system performance requirements, e.g., desired system availability, environmental requirements, (v) requirements related to maintenance/testing, e.g., according to rules and regulations. 2. Operational and failure data: (i) Performance requirements, (ii) operating profile (continuous or intermittent operation), (iii) control philosophy (remote/local and automatic/manual), (iv) environmental conditions, (v) maintainability, (vi) calendar- and accumulated operating time for overhauls, (vii) maintenance and downtime costs, (viii) recommended maintenance for each analysis item based on manufacturer specification, general guidelines or standards, or in-house recommended practice, and (ix) failure information, what happens when a failure occurs. 3. Reliability data: Reliability data may be derived from the operational data by statistical analysis. The reliability data is used to decide the criticality, to describe the failure process mathematically and to optimize the time between PM tasks.

1.

During the initial phase of the RCM analysis process it often becomes evident that the format and quality of the operational data are not sufficient to estimate the relevant reliability parameters. Some of the main problems encountered are: • •

The failure data is on a too high level in the assembly hierachy, i.e., data is not reported on the RCM analysis item level (MSI). Failure mode and failure causes are not reported, or the recorded information does not correspond to definitions and code lists used in the FMECA of Step 6.

90

M. Rausand and J. Vatn

• •

For systems being monitored by measurements or visual inspection, the state information is often not reported, making it impossible to establish models for the failure progression. For multiple copies of a component the failure reporting do not link each failure report to a physical unit, but only states that “one of the components has failed and has been replaced.”

When such problems are encountered, it is important to start a process to improve the reporting of operational and failure data. However, there will always be a cost associated with improved reporting due to: (i) the maintenance personnel need to spend more time on reporting, (ii) the maintenance personnel need to be trained in failure reporting, and get insight into the structured FMECA thinking, and (iii) the reporting systems (maintenance management systems) have to be restructured to allow reporting in a format in accordance with the logical structure of the FMECA worksheets. Our experience is that improved reporting quality is unattainable unless maintenance personnel executing the maintenance also participate in the RCM process. This would give ownership to the process, but it is no guarantee that reporting will improve. 4.2.6 Step 6: FMECA The objective of this step is to identify the dominant failure modes of the MSIs identified in Step 4. The information entered into the FMECA worksheet should be sufficient both with respect to maintenance task selection in Step 7, and interval optimization in Step 8. Our FMECA worksheet has more fields than the FMECAs found in most RCM standards. The reason for this is that we use the FMECA as the main database for the RCM analysis. Other RCM approaches often use a rather simple FMECA worksheet, but then have to add an additional FMECA-like worksheet with the data required for optimization of maintenance intervals. TOP

Events Experience has shown that we can significantly reduce the workload of the FMECA by introducing so-called TOP events as a basis for the analysis. The idea is that for each failure mode in the FMECA, a so-called TOP event is specified as consequence of the failure mode. A number of failure modes will typically lead to the same TOP event. A consequence analysis is then carried out for each TOP event to identify the end consequences of that particular TOP event, covering all consequence classes (e.g., safety, availability/punctuality, environmental aspects). For many plants, risk analyses (or safety cases) have been carried out as part of the design process. These may sometimes be used as a basis for the consequence analysis. Figure 4.4 shows a conceptual model of this approach for a railway application where the left part relatively to the TOP event is treated in the FMECA, and the right part is treated as generic, i.e., only once for each TOP event.

Reliability Centred Maintenance

91 C1 C2

Initiating event

C3

TOP event

“Red bulb failure”

“Train collision”

C4 C5 C6

Failure cause: - Burn-out bulb

Maintenance barrier: Other barriers: - Preventive replacement

- Directional setting “block” - Automatic train protection - Train control centre

Consequence reducing barriers: - Rescue team - Train construction - Fire protection

Figure 4.4. Barrier model for safety

In the rectangle (dashed line) in the left-hand side of Figure 4.4 an “initiating event” and a “barrier” are illustrated. To analyze this “rectangle” we need reliability parameters, such as MTTF, aging parameter, and PF interval, that are included in the FMECA worksheet (e.g., see Rausand and Høyland 2004). Three situations are considered: 1.

There is a failure or a fault situation that is not related to the component we are analyzing with respect to maintenance. If, for example, we are analyzing the automatic train protection (ATP) on the train, the initiating event may be “locomotive driver does not comply with signaling”, and thus the ATP is a barrier against this initiating event. In this situation the function of the ATP is typically a hidden function. 2. There is a potential failure in the component that is being analyzed, and maintenance is a barrier against this failure. An example is a crack that has been initiated in the rail, or in an axle (initiating event); and ultrasonic inspection is a maintenance activity to reveal the crack, and prevent a serious incident. 3. The initiating event is a component aging failure, and preventive maintenance is carried out to reduce the likelihood of this failure. In this situation the initiating event and the first barrier in Figure 4.4 merges to one element. An example is aging failure of a light bulb. The likelihood of such a failure will, however, be reduced if the light bulb is periodically replaced with a new one before the aging effect becomes dominant.

“Other barriers” in Figure 4.4 can prevent the component failure from developing into a critical TOP event. “Track circuit detection” may be a barrier against rail breakage, because the track circuit can detect a broken rail. Typical examples of TOP events in railway application are: • • •

Train derailment Collision train-train Collision train-object

92

M. Rausand and J. Vatn

• • • •

Fire Persons injured or killed in or at the track Persons injured or killed at level crossings Passengers injured or killed at platforms

Several consequence-reducing barriers may also be available. Guide rails may, e.g., be installed to mitigate the consequences in case of derailment. In Figure 4.4 we have indicated that the outcome of the TOP event may be one out of six (end) consequence classes: C1: C2: C3: C4: C5: C6:

Minor injury Medical treatment Permanent injury 1 fatality 2–10 fatalities >10 fatalities

Note that the consequence reducing barriers and the end consequences are not analyzed explicitly during the FMECA, but treated as generic for each TOP event. In the railway situation this means only six analyses of the safety consequences related to human injuries/fatalities. In the following, a list of fields (columns) for the FMECA worksheets is proposed. The structure of the FMECA is hierarchical, but the information is usually presented in a tabular worksheet. The starting point in the FMECA is the functional failures from the FFA in Step 3. Each maintainable item is analyzed with respect to any impact on the various functional failures. In the following we describe the various columns: • • • • • • • •

Failure mode (equipment class level). The first column in the FMECA worksheet is the failure mode at the equipment class level identified in the FFA in Step 3. Maintenance significant item (MSI). The relevant MSI were identified in the FFA. MSI function. For each MSI, the functions of the MSI related to the current equipment class failure mode are identified. Failure mode (MSI level). For the MSI functions we also identify the failure modes at the MSI level. Detection method. The detection method column describes how the MSI failure mode may be detected, e.g., by visual inspection, condition monitoring, or by the central train control system (for railway applications). Hidden or evident. Specify whether the MSI function is hidden or evident. Demand rate for hidden function, fD. For MSI functions that are hidden, the rate of demand of this function should be specified. Failure cause. For each failure mode there is/are one or more failure causes. A failure mode will typically be caused by one or more component failures at a lower level. Note that supporting equipment to the component is considered for the first time at this step. In this context a failure cause may therefore be a failure mode of supporting equipment.

Reliability Centred Maintenance

•

•

• •

• • •

•

•

• •

93

Failure mechanism. For each failure cause, there is one or several failure mechanisms. Examples of failure mechanisms are fatigue, corrosion, and wear. To simplify the analysis, the columns for failure cause and failure mechanism are often merged into one column. Mean time to failure (MTTF). The MTTF when no maintenance is performed should be specified. The MTTF is specified for one component if it is a “point” object, and for a standardized distance if it is a “line” object such as rails, sleepers, and so on. TOP event safety. The TOP event in this context is the accidental event that might be the result of the failure mode. The TOP event is chosen from a predefined list established in the generic analysis Barrier against TOP event safety. This field is used to list barriers that are designed to prevent a failure mode from resulting in the safety TOP event. For example, brands on the signalling pole would help the locomotive driver to recognize the signal in case of a dark lamp. PTE-S. This field is used to assess the probability that the other barriers against the TOP event all fail; see Figure 4.4. PTE-S should count for all the barriers listed under “Barrier against TOP event safety”. TOP event availability/punctuality. Also for this dimension a predefined list of TOP events may be established in the generic analysis. Barrier against TOP event availability/punctuality. This field is used to list barriers that are designed to prevent a failure mode from resulting in an availability/punctuality TOP event. Since the fail safe principle is fundamental in railway operation, there are usually no barriers against the punctuality TOP event when a component fails. An example of a barrier is a two out of three voting system on some critical components within the system. PTE-P. This field is used to assess the probability that the other barriers against an availability/punctuality TOP event all fails. PTE-P should count for all the barriers listed under “Barrier against TOP event availability/ punctuality”. Due to the fail safe principle, PTE-P will often be equal to one. Other consequences. Other consequences may also be listed. Some of these are non-quantitative like noise effects, passenger comfort, and aesthetics. Material damage to rolling stock or components in the infrastructure may also be listed. Material damage may be categorized in terms of monetary value, but this is not pursued here. Mean downtime (MDT). The MDT is the time from a failure occurs until the failure has been corrected and any traffic restrictions have been removed. Criticality indexes. Based on already entered information, different criticality indexes can be calculated. These indexes are used to screen out nonsignificant MSIs.

If a failure mode is considered significant with respect to safety or availability/ punctuality (or other dimensions) a preventive maintenance task should be assigned. In order to do such an assignment, further information has to be specified. This additional information will be completed during Steps 7 and 8. The following fields are recommended:

94

M. Rausand and J. Vatn

•

•

•

•

• •

Failure progression. For each failure cause the failure progression should be described in terms of one of the following categories: (i) gradual observable failure progression, (ii) non-observable and fast observable failure progression (PF model), (iii) non-observable failure progression but with aging effects, and (iv) shock type failures. Gradual failure information. If there is a gradual failure progression information about a what values of the measurable quantity represents a fault state. Further information about the expected time and standard deviation to reach this state should be recorded. PF-interval information. In case of observable failure progression the PF model is often applied (e.g., see Rausand and Høyland 2004, p. 394). The PF concept assumes that a potential failure (P) can be observed some time before the failure (F) occurs. This time interval is denoted the PF interval (e.g., see Rausand and Høyland 2004). We need information both on the expected value and the standard deviation of the PF interval. Aging parameter. For non-observable failure progression aging effects should be described. Relevant categories are strong, moderate or low aging effects. The aging parameter can alternatively be described by a numeric value, i.e., the shape parameter α in the Weibull distribution. Maintenance task. The maintenance task is determined by the RCM logic discussed in Step 7. Maintenance interval. Often we start by describing existing maintenance interval, but after the formalized process of interval optimalization in Step 8 we enter the optimized interval.

An example of an FMECA worksheet is shown in Table 4.1 for a departure light signal. 4.2.7 Step 7: Selection of Maintenance Actions This step is the most novel compared to other maintenance planning techniques. A decision logic is used to guide the analyst through a question–and–answer process. The input to the RCM decision logic is the dominant failure modes from the FMECA in Step 6. The main idea is for each dominant failure mode to decide whether a preventive maintenance task is suitable, or it will be best to let the item deliberately run to failure and afterwards carry out a corrective maintenance task. There are generally three reasons for doing a preventive maintenance task: • • •

Prevent a failure Detect the onset of a failure Reveal a hidden failure

Only the dominant failure modes are subjected to preventive maintenance. To obtain appropriate maintenance tasks, the failure causes or failure mechanisms should be considered.

Reliability Centred Maintenance

95

Table 4.1. Example of part of an FMECA worksheet System function: Functional failure:

Ensure correct departure light signal No signal Failure mode

event

Safety barriers

PTE-S

event

MSI

Function

Lamp

Give light

No light

Burnt-out filament

Train – Train

Directional block, ATP, TCC, “Black=red”

3 x 10

–4

Manual train operation

Lens

Protect lamp

Broken lens

Rock fall

Train – Train

Directional block, ATP, TCC, “Black=red”

2 x 10

–-5

None

Slip through light

No light slipping through

Fouling

Train – Train

Directional block, ATP, TCC, “Black=red”

2 x 10–4

Failure cause

TOP

TOP

None

The failure mechanisms behind each of the dominant failure modes should be entered into the RCM decision logic to decide which of the following basic maintenance tasks is most applicable: 1. 2. 3. 4. 5. 6.

Continuous on-condition task (CCT) Scheduled on-condition task (SCT) Scheduled overhaul (SOH) Scheduled replacement (SRP) Scheduled function test (SFT) Run to failure (RTF)

Continuous on-condition task (CCT) is a continuous monitoring of an item to find any potential failures. An on-condition task is applicable only if it is possible to detect reduced failure resistance for a specific failure mode from the measurement of some quantity. Scheduled on-condition task (SCT) is a scheduled inspection of an item at regular intervals to find any potential failures. There are three criteria that must be met for an on-condition task to be applicable: 1.

It must be possible to detect reduced failure resistance for a specific failure mode. 2. It must be possible to define a potential failure condition that can be detected by an explicit task. 3. There must be a reasonable consistent age interval between the time of potential failure and the time of failure.

96

M. Rausand and J. Vatn

There are two disadvantage of a scheduled vs. a continuous on-condition task: • •

The man-hour cost of inspection is often larger than the cost of installing a sensor. Since the scheduled inspection is carried out at fixed points of time, one might “miss” situations where the degradation is faster than anticipated.

An advantage of a scheduled on-condition task is that the human operator is then able to “sense” information that a sensor will not be able to detect. This means that traditional “walk around checks” should not be totally skipped even if sensors are installed. Scheduled overhaul (SOH) is a scheduled overhaul of an item at or before some specified age limit, and is often called “hard time maintenance”. An overhaul task can be considered applicable to an item only if the following criteria are met: 1.

There must be an identifiable age at which the item shows a rapid increase in the item’s failure rate function. 2. A large proportion of the units must survive to that age. 3. It must be possible to restore the original failure resistance of the item by reworking it.

Scheduled replacement (SRP) is scheduled discard of an item (or one of its parts) at or before some specified age limit. A scheduled replacement task is applicable only under the following circumstances: 1. 2.

The item must be subject to a critical failure. Test data must show that no failures are expected to occur below the specified life limit. 3. The item must be subject to a failure that has major economic (but not safety) consequences. 4. There must be an identifiable age at which the item shows a rapid increase in the failure rate function. 5. A large proportion of the units must survive to that age.

Scheduled function test (SFT) is a scheduled inspection of a hidden function to identify any failure. A scheduled function test task is applicable to an item under the following conditions: 1.

The item must be subject to a functional failure that is not evident to the operating crew during the performance of normal duties. 2. The item must be one for which no other type of task is applicable and effective. Run to failure (RTF) is a deliberate decision to run to failure because the other tasks are not possible or the economics are less favourable.

Reliability Centred Maintenance

Continuous oncondition task (CCT)

Yes Does a failure alerting measurable indicator exist?

Yes

Is continuous monitoring feasible?

Scheduled oncondition task (SCT)

No

No

Yes Is aging parameter α>1?

Yes

Scheduled overhaul (SOH)

Is overhaul feasible? No

No

Is the function hidden?

97

Yes

Scheduled replacement (SRP)

Scheduled function test (SFT)

No No PM activity found (RTF)

Figure 4.5. Maintenance task assignment/decision logic

In many situations a maintenance task may prevent several failure mechanisms. Hence in some situations it is better to enter failure modes rather than failure mechanisms into the RCM decision logic. Note also that if a failure cause for a dominant failure mode corresponds to supporting equipment, the supporting equipment should be defined as the “item” to be entered into the RCM decision logic. The RCM decision logic is shown in Figure 4.5 Note that this logic is much simpler than that found in most RCM standards and guidelines. It should be emphasized that such logic can never cover all situations. For example, in the situation of a hidden function with aging failures, a combination of scheduled replacements and function tests is required.

4.2.8 Step 8: Determination of Maintenance Intervals Usually, formalized methods for optimization of maintenance interval are not a part of the RCM analysis. In order to optimize maintenance intervals we need to structure the analysis in such a way that it fits into the maintenance optimization models that exist. See Section 4.4 for a discussion of determination of maintenance intervals using optimization models. 4.2.9 Step 9: Preventive Maintenance-Comparison Analysis Two overriding criteria for selecting maintenance tasks are used in RCM. Each task selected must meet two requirements:

98

M. Rausand and J. Vatn

• •

It must be applicable It must be effective

Applicability: Meaning that the task is applicable in relation to our reliability knowledge and in relation to the consequences of failure. If a task is found based on the preceding analysis, it should satisfy the applicability criterion. A PM task is applicable if it can eliminate a failure, or at least reduces the probability of occurrence to an acceptable level (Hoch 1990) — or reduces the impact of failures! Cost-effectiveness: Meaning that the task does not cost more than the failure(s) it is going to prevent. The PM task’s effectiveness is a measure of how well it accomplishes that purpose and if it is worth doing. Clearly, when evaluating the effectiveness of a task, we are balancing the cost of “performing the maintenance with the cost of not performing it. In this context, we may refer to the cost as follows (Hoch 1990): The cost of a PM task may include: • • • • • • •

The risk of maintenance personnel error, e.g., “maintenance introduced failures” The risk of increasing the effect of a failure of another component while one is out of service The use and cost of physical resources The unavailability of physical resources elsewhere while in use on this task Production unavailability during maintenance Unavailability of protective functions during maintenance of these “The more maintenance you do the more risk you expose your maintenance personnel to”

On the other hand, the cost of a failure may include: • • •

The consequences of the failure should it occur (i.e., loss of production, possible violation of laws or regulations, reduction in plant or personnel safety, or damage to other equipment) The consequences of not performing the PM task even if a failure does not occur (i.e., loss of warranty) Increased costs for emergency

4.2.10 Step 10: Treatment of Non-MSIs In Step 4 critical items (MSIs) were selected for further analysis. A remaining question is what to do with the items that are not analyzed. For plants already having a maintenance program it is reasonable to continue this program for the nonMSIs. If a maintenance program is not in effect, maintenance should be carried out according to vendor specifications if they exist, else no maintenance should be performed. See Paglia et al. (1991) for further discussion.

Reliability Centred Maintenance

99

4.2.11 Step 11: Implementation A necessary basis for implementing the result of the RCM analysis is that the organizational and technical maintenance support functions are available. A major issue is therefore to ensure the availability of the maintenance support functions. The maintenance actions are typically grouped into maintenance packages, each package describing what to do, and when to do it. Many accidents are related to maintenance work. When implementing a maintenance program it is therefore of vital importance to consider the risk associated with the execution of the maintenance work. Checklists may be used to identify potential risk involved with maintenance work: • • • •

Can maintenance people be injured during the maintenance work? Is work permit required for execution of the maintenance work? Are means taken to avoid problems related to re-routing, by-passing, etc.? Can failures be introduced during maintenance work?

Task analysis, e.g., see Kirwan and Ainsworth (1992), may be used to reveal the risk involved with each maintenance job. See Hoch (1990) for further discussion on implementing the RCM analysis results. 4.2.12 Step 12: In-service Data Collection and Updating The reliability data we have access to at the outset of the analysis may be scarce, or even almost none. In our opinion, one of the most significant advantages of RCM is that we systematically analyze and document the basis for our initial decisions and, hence, can better utilize operating experience to adjust that decision as operating experience data is collected. The full benefit of RCM is therefore only achieved when operation and maintenance experience is fed back into the analysis process. The updating process should be concentrated on three major time perspectives: 1. 2. 3.

Short term interval adjustments Medium term task evaluation Long term revision of the initial strategy

For each significant failure that occurs in the system, the failure characteristics should be compared with the FMECA. If the failure was not covered adequately in the FMECA, the relevant part of the RCM analysis should, if necessary, be revised. The short-term update can be considered as a revision of previous analysis results. The input to such an analysis is updated reliability figures either due to more data, or updated data because of reliability trends. This analysis should not require excessive resources, since the framework for the analysis is already established. Only Steps 5 and 8 in the RCM process will be affected by short-term updates. The medium term update will also review the basis for the selection of maintenance actions in Step 7. Analysis of maintenance experience may identify significant failure causes not considered in the initial analysis, requiring an updated FMECA in Step 6.

100

M. Rausand and J. Vatn

The long-term revision will consider all steps in the analysis. It is not sufficient to consider only the system being analyzed; it is required to consider the entire plant with its relations to the outside world, e.g., contractual considerations, new laws regulating environmental protection, and so on.

4.3 Generic and Local RCM Analyses An RCM analysis should be conducted for physical units in a stated operational context. Assume that we are planning to carry out an RCM analysis of a specific railway point1 (turnout) at location X on line Y. For this railway point we identify all functions, failure modes, and so on. We then propose a set of maintenance tasks, and finally we choose the maintenance intervals based on reliability parameters for the railway point, punctuality parameters, and personnel risk. Now, there might be several hundreds of similar railway points, with slightly varying reliability performance and risk profiles that would require different maintenance intervals. To avoid repeating the entire RCM analysis for all these railway points, we propose to conduct a generic RCM analysis, and then make local adjustments with regard to reliability and risk parameters. The following steps are then required: 1.

2.

3.

4.

5.

1

Conduct a generic RCM analysis for selected components. In this analysis we use generic (average) values of reliability and risk parameters (regarding punctuality and personnel risk). Establish a generic RCM database. The results from generic RCM analyses of selected equipment types are stored in a generic RCM database. In a first phase we may restrict ourselves to consider broad classes of typical railway points. In a later phase, we may want to refine our analysis to cover specific types and brands of railway points (with different failure modes). Select local analysis objects. In the local analysis we work with a subset of the railway system. This can, for example, be a specific railway point, railway points in the main track of a specific line, and so on. Find an appropriate generic RCM template. For a local analysis object, we now recall the corresponding generic RCM analysis from the RCM database. We first verify that the generic RCM analysis object (template) is appropriate in terms of qualitative properties, with respect to functions, failure modes, and so. At this point it might be necessary to add more functions, failure modes, etc. In this case, we add the “new” RCM object to the generic RCM database in order to make the generic RCM database more comprehensive. Adjust parameters. At the local level we identify differences from the parameters used in the generic RCM database. A specific line may, for example, have very old railway points that may cause the MTTF to be smaller than the average MTTF. In this step of the procedure we have to

A railway point is a railway “switch” that allows a train to go from one track to another. A railway point is called a “turnout” in American English.

Reliability Centred Maintenance

101

consider all parameters that are involved in the optimization model (see Section 4.4. 6. Re-run the optimization procedure. Based on the new “local” parameters we next re-run the optimization procedure to adjust maintenance intervals taking local differences into account. To carry out this process we need a computerized tool to streamline the work. 7. Document the results. The results from the local analysis are stored in a local RCM database. This is a database where only the adjustment factors are documented, for example, for railway points A, B, C, and D on line Y the MTTF is 30 % higher than the average. Hence the maintenance interval is also reduced accordingly.

4.4 Modelling and Optimizing Maintenance Intervals A wide range of general models and methods for maintenance optimization have been proposed, e.g., see Rausand and Høyland (2004), Pierskalla and Voelker (1979), Valdez-Florez and Feldman (1989), Cho and Parlar (1991), Gertsbakh (2000) and Wang (2002). A high number of models and methods for specific applications have also been developed, e.g., see Vatn and Svee (2002), Chang (2005), Castanier and Rausand (2006), and Welte et al. (2006). In this section we present basic elements required to optimize maintenance interval (τ ) , and a standard procedure for setting up the cost function, C(τ ) , is proposed. A computerized tool called OptiRCM has been developed by the authors to support the RCM procedure presented in this chapter. OptiRCM is currently being used by the Norwegian National Railway (NSB). The Norwegian National Rail Administration (JBV) has also adopted the same procedure. OptiRCM imports the FMECA results generated by Steps 6 and 7 of the RCM analysis process. Cost information is usually not available in the FMECA; hence information about preventive and corrective maintenance costs must be provided separately. A screen presenting the information on the MSI level is shown in Figure 4.6. OptiRCM uses a procedure with three steps to optimize maintenance intervals: (i) the component performance is established (left-hand part of Figure 4.6), (ii) the system model is established (centre part of Figure 4.6), and (iii) the total cost if calculated (right-hand part of Figure 4.6). 4.4.1 Component Model The aim of the component model is to establish the effective failure rate with respect to a specific failure mode, λE (τ ) , as a function of the maintenance interval τ . The effective failure rate is the unconditional expected number of failures per time unit for a given maintenance level. Typically, the effective failure rate is an increasing function of τ . A large number of models for determining the effective failure rate as a function of the maintenance strategies, the degradation models, and so on, have been proposed in the literature.

102

M. Rausand and J. Vatn

Figure 4.6. OptiRCM input and analysis screen

The interpretation of the effective failure rate is not straightforward for hidden functions. For such functions we also need to specify the rate at which the hidden function is demanded. In this situation we may approximate the effective failure rate by the product of the demand rate and the probability of failure on demand (PFD) for the hidden function. In the following we indicate models that may be used for modelling the effective failure rate, and we refer to the literature for details. The aim of OptiRCM has been: • • •

To cover the standard situations, both with respect to evident/hidden failures, but also with respect to the type of failure progression. Provide formulae that do not require too many reliability parameters to be specified. Limit the number of probabilistic models as a basis for the optimization.

Only the Weibull distribution is used to model aging failures in OptiRCM. There may, of course, be situations where another distribution would be more realistic, but our experience is that the user of such a tool rarely has data or insight that helps him to do better than applying the Weibull model. 4.4.1.1 Effective Failure Rate in the Situation of Aging A standard block replacement policy is considered where an aging component is periodically replaced after intervals of length τ . Upon a failure in one interval, the component is replaced without affecting the next planned replacement. The effective failure rate, i.e., the average number of failures per time unit is then given by λE (τ ) = W (τ ) / τ , where W (τ ) is the renewal function (e.g., see Rausand and Høyland 2004). Approximation formulas for the effective failure rate exist if we assume Weibull distributed failure times (e.g., see Chang et al. 2006). OptiRCM

Reliability Centred Maintenance

103

uses the renewal equation to establish an iterative scheme for the effective failure rate based on an initial approximation. 4.4.1.2 Effective Failure Rate in the Situation of Gradual Observable Failure Progression The assumptions behind this situation is that the failure progression, say Y (t) , can be observed as a function of time. In the simplest situation Y (t) is onedimensional, whereas in more complex situations Y (t) may be multidimensional. We may also have situations where Y (t) denotes some kind of a signal where, for example, the fast Fourier transform of the signal is available. In OptiRCM a very simple situation is considered, where Y (t) is monotonically increasing. As Y (t) increases, the probability of failure also increases, and at a predefined level (maintenance limit), say l , the component is replaced, or overhauled. The effective failure rate, λE (τ, l) , is now a function of both the inspection interval, and the maintenance limit. In OptiRCM a Markov chain model is used to model the failure progression (e.g., see Welte et al. 2006 for details of the Markov chain modelling, and also an extension where it is possible to reduce the inspection intervals as we approach the maintenance limit). In the Markov chain model it is easy to treat the situation where Y (t) is a nonlinear function of time. If we restrict ourselves to linear failure progression, continuous models as the Wiener and gamma processes may also be used. 4.4.1.3 Effective Failure Rate in the PF Model The assumption behind the PF model is that failure progression is not observable for a rather long time, and then at some point of time we have a rather fast failure progression. This is the typical situation for cracks (potential failures) that can be initiated after a large number of load cycles. The cracks may develop rather fast, and it is important to detect the cracks before they develop into breakages. The time from a crack is observable until a failure (breakage) occurs is denoted the PF-interval. The important reliability parameters are the rate of potential failures, the mean and standard deviation of the PF-interval, and the coverage of the inspection method. The model implemented in OptiRCM for the PF situation is described in Vatn and Svee (2002). See also Castanier and Rausand (2006) for a similar approach, and the more general application of delay time models (Christer and Waller 1984). 4.4.2 System Model Figure 4.4 shows a simplified model of the risk picture related to the component failure being analyzed. In order to quantify the risk related to safety, we need the following input data: • • •

The effective failure rate, λE (τ ) The probability that the other barriers against the TOP event with respect to safety all fail, PTE−S The probability that the TOP event results in consequence C j is PC j for j running through the number of consequence classes.

104

M. Rausand and J. Vatn Table 4.2. PLL and cost contribution and for each consequence class Consequence

PLLj = PLL-contribution

SCj = Cost (Euro)

C1: Minor injury

0.01

2,000

C2: Medical treatment

0.05

30,000

C3: Permanent injury

0.1

300,000

C4: 1 fatality

0.7

1,600,000

C5: 2–10 fatalities

4.5

13,000,000

C6: >10 fatalities

30

160,000,000

The frequency of the consequence class C j is given by Fj = λE (τ ) ⋅ PTE−S ⋅ PC j

(4.1)

where PCj is the probability that the TOP event results in consequence class C j . We will later indicate how we can model Equation 4.1 as a function of the maintenance interval, τ . In some situations we also assign a cost, and/or a PLL (potential loss of life) contribution to the various cost elements. PLL denotes the annual, statistically expected number of fatalities in a specified population. Proposed values adopted by the Norwegian National Rail Administration are given in Table 4.2. Please see discussion by Vatn (1998) regarding what it means to assign monetary values to safety. The total PLL contribution related to the component failure being analyzed is then PLL = PTE−S ⋅ ∑ j=1 (PC j ⋅ PLL j ) ⋅ λE (τ ) 6

(4.2)

and the total cost contribution related to the component is CS = PTE−S ⋅ ∑ j=1 (PC j ⋅ SC j ) ⋅ λE (τ ) 6

(4.3)

where SCj is safety cost of consequence class C j . Note that in the FMECA analysis we can have an automatic procedure that calculates both the PLL contribution and the safety cost contribution based on the reliability parameters, and the type of TOP event. In the same way as we have done for safety consequences, we proceed with punctuality or unavailability costs. Here we simplify, and assume that there exists a fixed (expected) cost for each TOP event for punctuality, say PC(TOP). The punctuality cost per time unit is then

Reliability Centred Maintenance

CP = PTE-P ⋅ PC (TOP) ⋅ λE (τ )

105

(4.4)

This procedure may, if required, be repeated for other dimensions like environment, material damage, and so on. 4.4.3 Total Cost and Interval Optimization The approach to interval optimization is based on minimizing the total cost related to safety, punctuality, availability, material damage, etc. Within an ALARP regime (e.g., see Vatn 1998) this requires that the risk is not unacceptable. Assuming that risk is acceptable, we proceed by calculating the total cost per time unit: C(τ ) = CS (τ ) + CP (τ ) + CPM (τ ) + CCM (τ )

(4.5)

where CS (τ ) and CP (τ ) are given by Equation 4.3 and 4.4, respectively. Further, CPM (τ ) = PM Cost / τ

(4.6)

where PM Cost is the cost per preventive maintenance activity. Note that for condition-based tasks we distinguish between the cost of monitoring the item, and the cost of physically improving the item by some restoration or renewal activity. This complicates Equation 4.6 slightly because we have to calculate the average number of renewals. Further, if CM Cost is the cost of a corrective maintenance activity, we have CCM (τ ) = CM Cost ⋅ λE (τ )

(4.7)

Table 4.3. Generic probabilities, PCj, of consequence class Ci for the different TOP events event

PC1

PC2

PC3

PC4

PC5

PC6

Derailment

0.1

0.1

0.1

0.1

0.05

0.01

Collision train-train

0.02

0.03

0.05

0.5

0.3

0.1

Collision train-object

0.1

0.2

0.3

0.15

0.01

0.001

Fire

0.1

0.2

0.2

0.1

0.02

0.005

Passengers injured or killed at platforms

0.3

0.3

0.2

0.05

0.01

0.001

Persons injured or killed at level crossings

0.1

0.2

0.3

0.3

0.09

0.01

Persons injured or killed in or at the track

0.2

0.2

0.2

0.3

0.1

0.0001

TOP

106

M. Rausand and J. Vatn

To find the optimum maintenance interval we can now calculate C(τ ) in Equation 4.5 for various values of the maintenance interval, τ , and then choose the τ value that minimizes C(τ ) . As a numerical example we consider a pump used for oil cooling of the main high voltage transformer in a locomotive. The relevant figures in the example are assessed by experts in the Norwegian National Railway. Upon failure of the oil pump, the TOP event for punctuality will most likely be a FULL STOP with a probability, PTE−P = 0.75 for this punctuality consequence. It is considered that a full stop gives an average delay of 15 min, and the cost of 1 min delay is set to 150 Euros. The potential TOP event for safety is a FIRE, but the likelihood is very small, i.e. PTE−S = 0.0005 . The reliability parameters of the pump are for the aging parameter α = 3.5 , and for the mean time to failure without any preventive maintenance we set to MTTF = 10 million km. To calculate the safety cost we find ∑ j (PC j ⋅ SC j ) = 1.286 million Euros by combining Table 4.2 and 4.3. Equation 4.3 thus reads CS (τ ) = 643 ⋅ λE (τ ) . Punctuality cost in Equation 4.4 is similarly given by CS (τ ) = 2250 ⋅ λE (τ ) . For PM and CM cost we have PM Cost = 3100 Euros, and CM Cost = 4400 Euros, respectively. Chang et al. (2006) argue that a good approximation for the effective failure rate is α

2 (0.09α − 0.2)τ ⎤ ⎛ Γ(1 + 1/ α ) ⎞ α −1 ⎡ 0.1ατ λE (τ ) = ⎜ + ⎥ ⎟ τ ⎢1 − 2 MTTF ⎝ MTTF ⎠ ⎣ MTTF ⎦

(4.8)

The total cost C(τ ) in Equation 4.5 can now be found as a function of τ ; see Figure 4.7 for a graphical illustration. The optimum interval is found to be 7.5 million km. The maintenance action is scheduled replacement of the pump; see Figure 4.5.

Figure 4.7. Cost elements as a function of the maintenance interval

Reliability Centred Maintenance

107

4.5 Conclusions The main parts of the RCM approach that we have described in this chapter are compatible with common practice and with most of the RCM standards. We are, however, using a more complex FMECA where we also record data that are necessary during maintenance interval optimization. The novel parts of our approach are related to the use of so-called generic RCM analysis and to maintenance interval optimization. The use of generic RCM analysis will significantly reduce the workload of a complete RCM analysis. Maintenance optimization is, generally, a very complex task, and only a brief introduction is presented in this chapter. For maintenance personnel to be able to use the proposed methods, they need to have access to simple computerized tools where the mathematically complex methods are hidden. This was our objective in developing the OptiRCM tool. Maintenance optimization modules are, more or less, non-existent in the standard RCM tool. OptiRCM is not a replacement for these tools, but rather a supplement. OptiRCM is still in the development stage, and we are currently trying to implement several new features into OptiRCM. Among these are additional methods related to maintenance strategies, and grouping of maintenance tasks.

4.6 References ABS, (2003) Guide for Survey Based on Reliability-Centered Maintenance. American Bureau of Shipping, Houston. ABS, (2004) Guidance Notes on Reliaility-Centered Maintenance. American Bureau of Shipping, Houston. Blanchard BS, Fabrychy WJ, (1998) Systems Engineering and Analysis, 3rd ed. Prentice Hall, Englewood Cliffs, NJ. Blanche KM, Shrivastava AB, (1994) Defining failure of manufacturing machinery and equipment. Proceedings from the Annual Reliability and Maintainability Symposium, pp. 69–75. Castanier B, Rausand M, (2006) Maintenance optimization for subsea oil pipelines. Pressure Vessels and Piping 83:236–243. Chang KP, (2005) Reliability-centered maintenance for LNG ships. ROSS report 200506, NTNU, Trondheim, Norway. Chang KP, Rausand M, Vatn J, (2006) Reliability Assessment of Reliquefaction Systems on LNG Carriers. Submitted for publication in Reliability Engineering and System Safety. Cho DI, Parlar M, (1991) A survey of maintenance models for multi-unit systems. European Journal of Operational Research 51:1–23. Christer AH, Waller WM, (1984) Delay time models of industrial inspection maintenance problems, Journal of the Operational Research Society 35:401–406. DEF-STD 02-45 (NES 45), (2000) Requirements for the application of reliability-centred maintenance technique to HM ships, submarines, Royal fleet auxiliaries and other naval aixiliary vessels. Defense Standard, U.K. Ministry of Defence, Bath, England. Gertsbakh I, (2000) Reliability Theory with Applications to Preventive Maintenance. Springer, New York. Hoch R, (1990) A practical application of reliability centered maintenance. the American Society of Mechanical Engineers, 90-JPGC/Pwr-51, Joint ASME/IEEE Power Generation Conference, Boston, MA, 21–25 October.

108

M. Rausand and J. Vatn

IEC60300-3-11, (1999) Dependability management – Application guide – Reliability centered maintenance. International Electrotechnical Commission, Geneva. IEC61508, (1997) Functional safey of electrical/electronic/programmable electronic safetyrelated systems, Part 1–7. International Electrotechnical Commission, Geneva. Kirwan B, Ainsworth LK, (1992) A Guide to Task Analysis. Taylor and Francis, London. MIL-STD-2173 (AS), (1986) Reliability-Centered Maintenance. Requirements for Naval Aircraft, Weapon Systems and Support Equipment. U.S. Department of Defense, Washington, DC. Moubray J, (1997) Reliability-centered Maintenance II, 2nd ed. Industrial Press, New York. NASA, (2000) Reliability Centered Maintenance Guide for Facilities and Collateral Equipment. NASA Office of Safety and Mission Assurance, Washington DC. NAVAIR 00-25-403, (2005) Guidelines for the naval aviation reliability-centered maintenance process. Naval Air Systems Command, U.S.A. OREDA, (2002) Offshore Reliability Data, 4th ed. OREDA Participants. Available from Det Norske Veritas, NO-1322 Høvik, Norway. Paglia A, Barnard D, Sonnett D, (1991) A case study of the RCM project at V.C. Summer Nuclear Generating Station. 4th International Power Generation Exhibition and Conference, Tampa, Florida, USA. 5:1003–1013. Pierskalla WP, Voelker JA, (1979) A survey of maintenance models: The control and surveillance of deteriorating systems. Naval Research Logistics Quarterly 23:353–388. Rausand M, Høyland A, (2004) System Reliability Theory; Models, Statistical Methods, and Applications. Wiley, New York. SAE JA1012, (2002) A Guide to the Reliability-Centered Maintenance (RCM) Standard. The Egineering Society for Advancing Mobility Land, Sea, Air, and Space, Warrendale, PA. Smith AM, (1993) Reliability-Centered Maintenance. McGraw-Hill, New York. Smith DJ, (2005) Reliability, maintainability and Risk; Practical Methods for Engineers icluding Reliability Centred Maintenance and Safety-Related Systems. Elsevier, Butterworth Heinemann, Amsterdam. USACERL TR 99/41 (1999) Reliability centered maintenance (RCM) guide. Operating a more effective maintenance program. U.S. Army Corps of Engineers. Valdez-Flores C, Feldman RM, (1989) A survey of preventive maintenance models for stochastically deterioratingsingl-unit systems. Naval Research Logistics 36:419–446. Vatn J, Hokstad P, Bodsberg L, (1996) An overall model for maintenance optimization. Reliability Engineering and System Safety 51:241–257. Vatn J, (1998) A discussion of the acceptable risk problem. Reliability Engineering and System Safety 61:11–19. Vatn J, Svee H, (2002) A risk based approach to determine ultrasonic inspection frequencies in railway applications. ESReDA Conference, Madrid, 27–28 May. Wang H, (2002) A survey of maintenance policies of deteriorating systems. European Journal of Operational Research 139:469–489. Welte T, Vatn J, Heggset J, (2006) Markov state model for optimization of maintenance and renewal of hydro power components. 9th International Conference on Probabilistic Methods Applied to Power Systems, KTH, Stockholm, 11–15 June 2006.

Part C

Methods and Techniques

5 Condition-based Maintenance Modelling Wenbin Wang

5.1 Introduction The use of condition monitoring techniques in industry to direct maintenance actions has increased rapidly over recent years to the extent that it has marked the beginning of what is likely to prove a new generation in production and maintenance management practice. There are both economic and technological reasons for this development driven by tight profit margins, high outage costs and an increase in plant complexity and automation. Technical advances in condition monitoring techniques have provided a means to achieve high availability and to reduce scheduled and unscheduled production shutdowns. In all cases, the measured condition information does, in addition to potentially improving decision making, have a value added role for a manager in that there is now a more objective means of explaining actions if challenged. In November 1979, the consultants, Michael Neal & Associate Ltd published ‘A Guide to Condition Monitoring of Machinery’ for the UK Department of Trade and Industry; Neal et al. (1979). This groundbreaking report illustrated the difference in maintenance strategies (e.g., breakdown, planned, etc.) and suggested that condition based maintenance, using a range of techniques, would offer significant benefits to industry. By the late 1990s condition based maintenance had become widely accepted as one of the drivers to reduce maintenance costs and increase plant availability. With the advent of e-procurement, business to business (B2B), customer to business (C2B), business to customer (B2C) etc., industry is fast moving towards enterprise wide information systems associated with the internet. Today, plant asset management is the integration of computerised maintenance management systems and condition monitoring in order to fulfil the business objectives. This enables significant production benefits through objective maintenance prediction and scheduling. This positions the manufacturer to remain competitive in a dynamic market. Today there exists a large and growing variety of condition monitoring techniques for machine condition monitoring and fault diagnosis. A particularly popular

112

W. Wang

one for rotating and reciprocal machinery is vibration analysis. However, irrespective of the particular condition monitoring technique used, the working principle of condition monitoring is the same, namely condition data become available which need to be interpreted and appropriate actions taken accordingly. There are generally two stages in condition based maintenance. The first stage is related to condition monitoring data acquisition and their technical interpretations. There have been numerous papers contributing to this stage, as evidenced by the proceedings of COMADEM over recent years. This stage is characterised by engineering skill, knowledge and experience. Much effort of the study at this stage has gone into determining the appropriate variables to monitor, Chen et al. (1994), the design of systems for condition monitoring data acquisition, Drake et al. (1995), signal processing, Wong et al. (2006), Samanta et al. (2006), Harrison (1995), Li and Li (1995), and how to implement computerised condition monitoring, Meher-Homji et al. (1994). These are just a few examples and no modelling is explicitly entered into the maintenance decision process based upon the results of condition monitoring. For detailed technical aspects of condition monitoring and fault diagnosis, see Collacott (1997). The second stage is maintenance decision making, namely what to do now given that condition information data and their interpretations are available. The decision at this stage can be complicated and entails consideration of cost, downtime, production demand, preventive maintenance shutdown windows, and most importantly, the likely survival time of the item monitored. Compared with the extensive literature on condition monitoring techniques and their applications, relatively little attention has been paid to the important problem of modelling appropriate decision making in condition based maintenance. This chapter focuses on the second stage of condition monitoring, namely condition based maintenance modeling as an aid to effective decision making. In particular, we will highlight a modelling technique used recently in condition based maintenance, e.g. residual life modelling via stochastic filtering (Wang and Christer 2000). This is a key element in modeling the decision making aspect of condition based maintenance. The chapter is organised as follows. Section 5.2 gives a brief introduction to condition monitoring techniques. Section 5.3 focuses on condition based maintenance modeling and discuss various modeling techniques used. Section 5.4 presents the modelling of the residual life conditional on observed monitoring information using stochastic filtering. Section 5.5 concludes the chapter with a discussion of topics for future research.

5.2 Condition Monitoring Techniques For many years condition monitoring has been defined as “The assessment on a continuous or periodic basis of the mechanical and electrical condition of machinery, equipment and systems from the observation and/or recordings of selected measurement parameters” (Collacott 1997). One of the obvious analogies is the temperature measurement of a human body where the observation is the temperature and the system is the human body. Just as doctors strongly recommend periodic checks of key health parameters such as blood pressure, pulse, weight and/or temperature for an early indication of potential health problems, for

Condition-based Maintenance Modelling

113

industrial equipment some measurements can be taken and the likely condition of the plant assessed. Today there exists a large and growing variety of forms of condition monitoring techniques for machine condition monitoring and fault diagnosis. Understanding the nature of each monitoring technique and the type information measured will certainly help us when establishing a decision model. Here we briefly introduce five main techniques and among them, vibration and oil analysis techniques are the two most popular. 5.2.1 Vibration Based Monitoring Vibration based monitoring is the main stream of current applications of condition monitoring in industry. Vibration based monitoring is an on (off) line technique used to detect system malfunction based on measured vibration signals. Generally speaking, vibration is the variation with time of the magnitude of a quantity that is descriptive of the motion or position of a mechanical system, when the magnitude is alternatively greater than and smaller than some average value or reference. Vibration monitoring consists essentially in identifying two quantities: • •

The magnitude (overall level) of the vibration The frequency content (and/or time waveform)

The magnitude is basically used for establishing the severity of the vibration and the frequency content for the cause or origin. Vibration velocity has been seen as the most meaningful magnitude criterion for assessing machine condition, though displacement or acceleration is also used. The magnitude of vibration is usually measured in root mean square (rms). If T denotes the period of vibration and V (t ) is the vibration (say, velocity) measured at time t, then Vrms =

1 T

∫

T 0

(V (t )) 2 dt ,

which is proportional to the energy of vibration (Reeves 1998). However, since vibration signals from machines are, in general, periodic in nature, a great deal of information is contained in its frequency spectrum form. The frequency spectrum is usually obtained digitally using a digital analyser or computer via a mathematical algorithm known as “fast fourier transform” (FFT). The spectrum analysis of vibration signals is commonly used in the fault diagnosis of rotating machines. Potentially, all machines can benefit from vibration monitoring except, perhaps, those running at very low speed (below about 20 rev/min), and those where isolation (or damping) occurs between the source and the sensor. From observed vibration signals we often see a typical two-stage process where the signals may stay flat over the normal operation period and then display some increasing trend when a defect has initiated (Wang 2002). Another factor coming into play when establishing a vibration based maintenance model is the casual relationship between the measured signals and the state of the plant. It is the defect

114

W. Wang

which causes the abnormal signals, but not vice versa (Wang 2002). This factor plays an important role when selecting an appropriate model for describing such a relationship. 5.2.2 Oil Based Monitoring A detailed analysis of a sample of engine, transmission and hydraulic oils is a valuable preventive maintenance tool for machines. In many cases it enables the identification of potential problems before a major repair is necessary, has the potential to reduce the frequency of oil changes, and increase the resale value of used equipment. Oil based monitoring involves sampling and analyzing oil for various properties and materials to monitor wear and contamination in an engine, transmission or hydraulic system etc. Sampling and analyzing on a regular basis establishes a baseline of normal wear and can help indicate when abnormal wear or contamination is occurring. Oil analysis works as follows. Oil that has been inside any moving mechanical apparatus for a period of time reflects the possible condition of that assembly. Oil is in contact with engine or mechanical components as wear metallic trace particles enter the oil. These particles are so small they remain in suspension. Many products of the combustion process will also become trapped in the circulating oil. The oil becomes a working history of the machine. Particles caused by normal wear and operation will mix with the oil. Any externally caused contamination also enters the oil. By identifying and measuring these impurities, one can get an indication of the rate of wear and of any excessive contamination. An oil analysis also will suggest methods to reduce accelerated wear and contamination. The typical oil analysis tests for the presence of a number of different materials to determine sources of wear, find dirt and other contamination, and even check for the use of appropriate lubricants. Today there exists a variety of forms of oil based condition monitoring methods and techniques to check the volume and nature of foreign particles in oil for equipment health monitoring. There are spectrometric oil analysis, scan electron microscopy/energy dispersive X-ray analysis, energy dispersive X-ray fluorescent, low powered optical microscopy, and ferrous debris quantification. One purpose of the oil analysis is to provide a means of predicting possible impending failure without dismantling the equipment. One can “look inside” an engine, transmission or hydraulic systems without taking it apart. For oil based monitoring there is no such clear cut distinction between normal and abnormal operating based on observed particle information in the oil samples. The foreign particles that accumulate in the lubricant oil increase monotonically so that we may not able to see a two-stage failure process as seen in the vibration based monitoring. The casual relationship between the measured amount of particles in the oil and the state of the plant may also be bilateral in that, for example, the wear may cause the increase of observed metals in the oil, but the metals and other contaminants in the oil may also accelerate the wear. This marks a difference when modeling the state of the plant in oil based monitoring compared to vibration based.

Condition-based Maintenance Modelling

115

5.2.3 Other Monitoring Techniques The other popular condition monitoring techniques are infrared thermography, acoustics and motor current analysis. The basis of infrared thermography is quite simple. All objects emit heat or infrared electro-magnetic energy, but only a very small proportion of this energy is visible to the naked eye. At low temperatures in order to ‘see’ the heat being emitted an infrared camera must be used. The camera detects the invisible thermal energy and converts it to a visible image on a screen. The image can then be analyzed to identify any abnormality. The acoustic emission (AE) based method is widely used for monitoring the condition of rotating machinery. Compared to traditional vibration based methods, the high frequency approach of AE has the advantage of a significant improvement in signal to noise ratio. It can also be used for non-rotating machinery where defect activities do not generate distinct repetition frequencies and hence FTT analysis cannot be used. An item to note is that AE transducers need to have a relatively narrow band to be able to detect high frequency faults. The motor current noise signature analysis methods and apparatus for monitoring the operating characteristics of an electric motor-operated device, such as a motor-operated valve, have been frequently used for early detection of rotor related faults in AC induction motors. Frequency domain signal analysis techniques are applied to a conditioned motor current signal to identify distinctly various operating parameters of the motor driven device from the motor current signature. The signature may be recorded and compared with subsequent signatures to detect operating abnormalities and degradation of the device. This diagnostic method does not require special equipment to be installed on the motor-operated device, and the current sensing may be performed at remote control locations, e.g., where the motor-operated devices are used in inaccessible or hostile environments. All the techniques briefly introduced above can offer some help for indicating the current state or condition of the plant monitored. Based on the technical analysis of the observed condition monitoring data, a maintenance decision has to be made to maintain the plant in a cost effective way. We discuss in the next section, how modeling can be used to support such a decision making utilizing available monitoring information.

5.3 Condition Based Maintenance Modelling There is a basic but not always clearly answered question in condition monitoring — what is the purpose of condition monitoring? Have we lost sight of the ultimate need? Condition monitoring is not an end itself, it involves an expenditure entered into by the managers in the belief that it will save them money. How is this saving achieved? It can be obtained by using monitored condition information to optimise maintenance to achieve minimum breakdown of the plant with maximum availability for production, and to ensure that maintenance is only carried out when necessary. This is what one calls condition based maintenance which contrasts with the traditional breakdown or time based maintenance policies where maintenance

116

W. Wang

is only carried out when it becomes necessary utilizing available condition information. But in reality, all too often we see effort and money spent on monitoring equipment for faults which rarely occur, and we also see planned maintenance being carried out when the equipment is perfect healthy though the monitored information indicates something is “wrong”. A study of oil based condition monitoring of gear boxes of locomotives used by Canadian Pacific Railway (Aghjagan 1989) indicated, that since condition monitoring was commissioned (entailed 3–4 samples per locomotive per week, 52 weeks per year), the incidence failure of gear boxes while in use fell by 90 %. This is a significant achievement. However, when subsequently stripped down for reconditioning/overhaul, there was nothing evidently wrong in 50 % of cases. Clearly, condition monitoring can be highly effective, but may also be very inefficient at the same time. Modelling is necessary to improve the cost effectiveness and efficiency of condition monitoring. 5.3.1 The Decision Model This is an extension to the agebased replacement model in that the replacement decision will be made not only dependent upon the age, but also upon the monitored information, plus other cost or downtime parameters. If we take the cost model as an example, then the decision model amounts to minimising the long run expected cost per unit time. We use the following notation: c f : The mean cost per failure c p : The mean cost per preventive replacement

cm : The mean cost per condition monitoring ti : The ith and the current monitoring point Yi : Monitored information at ti with yi of its observed value ℑi : History of observed condition variables to ti , ℑi = { y1 ,..., yi } X i : The residual life at time ti pi ( xi | ℑi ) : Pdf of X i conditional on ℑi

The long term expected cost per unit time, C (t ) , given that a preventive replacement is scheduled at time t> ti is given by (Wang 2003) C (t ) =

(c f − c p ) P (t − ti | ℑi ) + c p + icm ti + (t − ti )(1 − P (t − ti | ℑi )) +

∫

t − ti 0

where P (t − ti | ℑi ) = P ( X i < t − ti | ℑi ) = ∫

(5.1)

xi pi ( xi | ℑi ) dxi t − ti 0

pi ( xi | ℑi )dxi , which is the probability

of a failure before t conditiional on ℑi . The right hand side of Equation 5.1 is the expected cost per unit time formulated as a renewal reward function, though the lifetimes are independent but not identical.

Condition-based Maintenance Modelling

117

The time point t is usually bounded within the time period from the current to the next monitoring since a new decision shall be made once a new monitoring reading becomes available at time ti +1 . In general, if a minimum of C (t ) is found within the interval to the next monitoring in terms of t , then this t should be the optimal replacement time. If no minimum is found, then the recommendation would be to continue to use the plant and evaluate Equation 5.1 at the next monitoring point when new information becomes available. For a graphical illustration of the above principle see Figure 5.1. C(t)

No replacement is recommended

Optimal replacement time

ti Current time

t*

Next monitoring time ti +1

t

Figure 5.1. A graph to show the optimal replacement time

Obviously the key element in Equation 5.1 is the determination of pi ( xi | ℑi ) , which is the topic of the next two sections. 5.3.2 Modelling pi ( xi | ℑi ) Before we proceed to the discussion of the modelling of pi ( xi | ℑi ) , there are few issues that need clarification. The first relates to the concept of direct and indirect monitoring (Christer and Wang 1995). In direct monitoring, the actual condition of the item, say the depth of a brake pad, can be observed, and a critical level, say C , can be set up. While in the indirect monitoring case we can only collect measurements related to the actual condition of the item monitored in a stochastic manner. For example, in the vibration monitoring case, if a high vibration signal is observed we may suspect the item’s condition might be bad, but we may neither know the exact condition of it, nor its quantification. For direct monitored systems, Markov models are popular; see Black et al. (2005), Chen and Trivedi (2005), and Love (2000). Counting processes have also been used for modeling the deterioration of directly monitored plant; see Aven (1996) and Jensen (1992). Christer and Wang (1992) used a random coefficient model for a direct monitored case. It is noted however that the majority of condition monitoring applications are indirect monitoring such as the

118

W. Wang

five popular monitoring techniques discussed earlier. It is therefore in this chapter that our attention is paid to indirect monitoring cases. The second issue is the appropriate definition of the plant state. This also relates to the first issue whether the monitoring is direct or indirect. In direct monitoring, the actual observed condition of the item is clearly the plant state. While in the indirect monitoring case we can only observe measures indirectly related to the actual condition of the item monitored as discussed earlier. The most simple and intuitive definition is a set of categorical states ranging, say from 0 (new) to N (failed) as seen from Markov based models (Baruah and Chinnam 2005). Wang (2006a) also used a generic term of wear to represent the state of the monitored plant, which is particularly useful in modelling wear related problems in condition monitoring. Wang and Christer (2000) first used the residual life at the time of checking as a measure of the state of the monitored unit of interest. This definition provides an immediate modeling means to establish directly a link between the measured information and the residual life of interest. It is noted however, that this residual life is usually not observable which increases modeling complexity. A model of pi ( xi | ℑi ) introduced later will be based on this definition. Various different methods or models have been proposed in the literature to formulate and calculate pi ( xi | ℑi ) . Proportional Hazard Modeling (PHM, one particular and natural form for modelling the hazard) is a popular one; Kumar and Westberg (1997), Love and Guo 1991, Makis and Jardine (1991), Jardine et al. (1998), Banjevic et al. (2001). Accelerated life models (Kalbfleisch and Prentice 1980; Wang and Zhang 2005) could also be used here, and may be more appropriate since the analogy between accelerated life testing, where these models originate, and condition monitoring is a close one. It should be noted that accelerated life models and proportional hazard models are identical when the time to failure distribution is Weibull, that is when the hazard function is given by h (t ) = α β t β −1 .

There are two problems with proportional hazards modeling or accelerated life models in condition based maintenance. The first is that the current hazard is determined partially by the current monitoring measurements and the full monitoring history is not used. The second is the assumption that the hazard or the life is a function of the observed monitoring data which acts directly on the hazard via a covariate function. Both problems relate to the modeling assumption rather than the technique. The first can be overcome if some sort of transformation of the observed data is used. The second problem remains unless the nature of monitoring indicates so. It is noted however that, for most condition monitoring techniques, the observed monitoring measurements are concomitant types of information which are a function of the underlying plant state. A typical example is in vibration monitoring where a high level of vibration is usually caused by a hidden defect but not vice versa as we have discussed earlier. In this case the observed vibration signals may be regarded as concomitant variables which are caused by the plant state. Note that in oil based monitoring things are different as the metal particles and other contaminants observed in the oil can be regarded both as concomitant

Condition-based Maintenance Modelling

119

variables and covariates as we discussed earlier. In this case a model considers both variables might be appropriate. The last decade has seen an increased use of stochastic filtering and Hidden Markov Models (HMM) for modelling pi ( xi | ℑi ) in condition based maintenance; see Hontelez et al. (1996), Christer et al. (1997), Wang and Christer (2000), Bunks et al. (2000), Dong and He (2004), Lin and Markis (2003, 2004), Baruah and Chinnam (2005), and Wang (2006a). These techniques overcome both problems of PHM and provide a flexible way to model the relationship between the observed signals and unobserved plant state. HMM can be seen as a specific type of stochastic filtering models that are usually used for discrete state and observation variables. If the noise factors in the model are not Gaussian, then a closed form for pi ( xi | ℑi ) is generally not available and one has to resort to numerical approximations. A comparison study using both filtering (Wang 2002) and PHM (Makis and Jardine 1991) based on vibration data revealed that the filtering based model produced a better result in terms of prediction accuracy (Matthew and Wang 2006). It should also be noted that if the monitored variables also influence the state to some extent, then both HMM and PHM should be used to tackle the problem. Alternatively an interactive HMM can also be formulated where a bilateral relationship is assumed between the observed and unobserved. In the next section, we shall discuss in details a specific filtering model used for the derivation of pi ( xi | ℑi ) . This model is simple to use and is analytically tractable.

5.4 Conditional Residual Life Prediction First we define the true state of plant as the residual life conditional upon measured condition related information to date, such as, vibration, temperature, etc. Next we assume these conditional pieces of information are functions of the residual life, that is, it is the residual life which controls the behavior of the measured conditional information, but not vice versa (this assumption can be relaxed). Generally we expect that a short residual life (depending on the severity of the defect) will generate a high signal level in some of the measures of condition variables, though in a typical stochastic fashion. In theory, we may have the following relationship: Defect

Short residual life

Higher than normal signal may be observed.

If the severity of the defect is represented by the length of the residual life, the relationship between the residual life and observed condition related variables follows.

120

W. Wang

5.4.1 Conditional Residual Life Prediction The model is built based on the following assumptions: 1. Plant items are monitored regularly at discrete time points. 2. There are two periods in the plant life where the first period is the time length from new to the point when the item was first identified to be faulty, and the second period is the time interval from this point to failure if no maintenance intervention is carried out. The second period is often called the failure delay time. It is also assumed that these two periods are statistically independent from each other. 3. A threshold level is established to classify the item monitored to be in a potentially faulty state if the condition information signal is above the level. Such a threshold level is usually determined by engineering experience or by a statistical analysis of measured condition related variables. 4. The conditional information obtained at time ti , yi , during the failure delay time is a random variable which depends on xi . Assumptions 1 and 2 can often be observed in condition monitoring practice. Assumption 3 can be relaxed and a model which can both identify the starting point of the second stage and residual life prediction can be established (Wang 2006b). For now, to keep the model simple we still use assumption 3. Assumption 4 was first proposed in Wang and Christer (2000), which states that the rapid increase in the observed condition information is partly due to the shortened residual life because of the hidden defect. However this relationship is contaminated with random noise. Assumption 4 is the fundamental principle underpinning our model. For a detailed discussion on assumption 4 see Wang and Christer (2000). Because the interest in residual life prediction is over the failure delay time (assuming it exists) and the information collected over the normal working period may not be beneficial for residual life prediction, we revise our notation on ti as the ith and the current monitoring time since the item was suspected to be faulty but still operating (noted that the order starts from the moment when the item was first identified to be possibly faulty). This implies that t1 is the first monitoring point which may indicate that the second stage has started. However, some monitoring may not be able to display a two-stage process such as oil based monitoring. If this is the case, we can simply set the threshold level to be zero. Figure 5.2 shows a typical condition monitoring practice. It is noted from Figure 5.2 that the conditional information obtained before t1 is not used since it is irrelevant to the decision making process. It is noted however, that the time to t1 is one of important information sources to be used in determining the condition monitoring interval (Wang 2003). Since the residual life at ti is the residual life at ti −1 minus the interval between ti and ti −1 provided the item has survived to ti and no maintenance action has been taken, it follows that

⎧ X − (ti − ti −1 ) if X i −1 > ti − ti −1 X i = ⎨ i −1 . not defined else ⎩

(5.2)

Condition-based Maintenance Modelling

121

yi

x1 y3

y2 y1

x3

x2

Threshold level

t1

0

t2

t3

failure

Figure 5.2. Condition monitoring practice

The relationship between Yi and X i is yet to be identified. From assumption 4 we know that it can be described by a distribution, say, p( yi | xi ) . We will discuss this later when fitting the model to data. We wish to establish the expression of pi ( xi | ℑi ) , and therefore a consequential decision model can be constructed on the basis of such a conditional probability; see Equation 5.1. Since ℑi = { y1 , y2 ,..., yi } = { yi , ℑi −1} , then pi ( xi | ℑi ) can be expressed as pi ( xi | ℑi ) = p( xi | yi , ℑi −1 ) . It follows that

pi ( xi | ℑi ) = p ( xi | yi , ℑi −1 ) =

p ( xi , yi | ℑi −1 ) p ( yi | ℑi −1 )

(5.3)

By using the multiplicative rule, the joint distribution, p ( xi , yi | ℑi −1 ) is given as p ( xi , yi | ℑi −1 ) = p( yi | xi , ℑi −1 ) p ( xi | ℑi −1 )

(5.4)

Since given both xi and ℑi−1 , yi depends on xi only from assumption 4 so Equation 5.4 reduces to p ( xi , yi | ℑi −1 ) = p ( yi | xi , ℑi −1 ) p( xi | ℑi −1 ) = p( yi | xi ) p( xi | ℑi −1 )

(5.5)

Integrating out the xi term in Equation 5.5 we have

p( yi | ℑi −1 ) =

∫

∞ 0

p( xi , yi | ℑi −1 )dxi =

∫

∞ 0

p( yi | xi ) p( xi | ℑi −1 )dxi

(5.6)

122

W. Wang

We focus our attention to p ( xi | ℑi −1 ) which appears both in Equation 5.4 and Equation 5.6. From Equation 5.2 we have xi −1 = g ( xi ) = xi + (ti − ti −1 ) conditional on X i −1 > ti − ti −1 . Then the distribution of X i | ℑi −1 can be expressed by a transformation of variables from X i to X i −1 (Freund 2004) as

p( xi | ℑi −1 ) = pi −1 ( g ( xi ) | ℑi −1 , X i −1 > ti − ti −1 )

Since

dg ( xi ) dxi

(5.7)

dg ( xi ) = 1 and dxi

pi −1 ( g ( xi ) | ℑi −1 , X i −1 > ti − ti −1 ) =

∫

∞

pi −1 ( g ( xi ) | ℑi −1 )

ti − ti −1

pi −1 ( xi −1 | ℑi −1 )dxi −1

(5.8)

we finally have

p( xi | ℑi −1 ) =

pi −1 ( xi + ti − ti −1 | ℑi −1 )

∫

∞

ti − ti −1

pi −1 ( xi −1 | ℑi −1 )dxi −1

(5.9)

Using Equations 5.5, 5.6 and Equation 5.9, 5.3 becomes

pi ( xi | ℑi ) =

p ( yi | xi ) pi −1 ( xi + ti − ti −1 | ℑi −1 )

∫

∞ 0

p( yi | xi ) pi −1 ( xi + ti − ti −1 | ℑi )dxi −1

(5.10)

which is a recursive equation which starts at time t1 . At time t1 , using Equation 5.10 we have

p1 ( x1 | ℑ1 ) =

p ( y1 | x1 ) p0 ( x1 + t1 − t0 | ℑ0 )

∫

∞ 0

p( y1 | x1 ) p0 ( x1 + t1 − t0 | ℑ0 )dx1

(5.11)

Since ℑ0 is usually 0 or not available, so p0 ( x1 + t1 − t0 | ℑ0 ) = p0 ( x1 + t1 − t0 ) , then if p0 ( x0 ) and p( y1 | x1 ) can be specified, Equation 5.11 can be determined. Similarly we can proceed to determining pi ( xi | ℑi ) if pi −1 ( xi −1 | ℑi −1 ) and p ( yi | xi ) are available from the previous step calculation at time ti −1 . Now the task is how to specify p0 ( x0 ) and p ( yi | xi ) .

Condition-based Maintenance Modelling

123

5.4.2 Specification of p0 ( x0 ) and p ( yi | xi ) p0 ( x0 ) is just the delay time distribution over the second stage of the plant life. Here we use the Weibull distribution as an example in this context. In practice or theory, the distribution density function p0 ( x0 ) should be chosen from the one which best fits to the data or from some known theory. The set-up of the p ( yi | xi ) term requires more attention. Here we follow the one used in Wang (2002), where yi | xi is assumed to follow a Weibull distribution with the scale parameter being equal to the inverse of A + Be − cx . In this way we establish a negative correlation between yi and xi as expected, that is E (Yi | X i = xi ) ∝ A + Be − cx . The pdf is given below: i

i

p( yi | xi ) =

yi

−( yi η ( )η −1 e A+ Be − cx − cx A + Be A + Be i

i

− cxi

)η

.

(5.12)

This is a concept called floating scale parameter, which is particularly useful in this case (Wang 2002). There are other choices to model the relationship between yi and xi , but these will not be discussed here, and can be found in Wang (2006a). 5.4.3 Estimating the Model Parameters Within pi ( xi | ℑi ) To calculate the actual pi ( xi | ℑi ) we need to know the values for the model parameters. They are the parameters of p0 ( x0 ) and p ( yi | xi ) . The most popular way to estimate them is using the method of maximum likelihood. At each monitoring point, ti , two pieces information are available, namely, yi and X i −1 > ti − ti −1 , both conditional on ℑi−1 . The pdf. for yi | ℑi −1 is given by Equation 5.7 and the probability function of X i −1 > ti − ti −1 | ℑi −1 is given by P ( X i −1 > ti − ti −1 | ℑi −1 ) = ∫

∞ ti − ti −1

pi −1 ( xi −1 | ℑi −1 ) dxi −1

(5.13)

If the item monitored failed at time t f after the last monitoring at time t n , the complete likelihood function is then given by

L (Θ) =

(∏

n i =1

p( yi | ℑi −1 )

∫

∞ ti − ti −1

)

pi −1 ( xi −1 | ℑi −1 )dxi −1 ) pn (t f − tn | ℑn )

(5.14)

where Θ is the set of parameters to be estimated. Taking logs on both sides of Equation 5.14 and maximising it in terms of unknown parameters should give the estimated values of those parameters. However, computationally it has to be solved numerically since Equation 5.14 involves many integrals which may not have analytical solutions.

124

W. Wang

5.4.4 A Case Study Figure 5.3 shows the data of overall vibration level in rms of six bearings, which is from a fatigue experiment (Wang 2002). It can be seen from Figure 5.3 that the bearing lives vary from around 100 h to over 1000 h, which shows a typical stochastic nature of the life distribution. The monitored vibration signals also indicate an increasing trend with bearing ages in all cases, but with different paths. An important observation is the pattern of vibration signals which stays relatively flat in the early stage of the bearing life and then increases rapidly (a defect may have been initiated). This indicates the existence of the two stage failure process as defined earlier.

Figure 5.3. Vibration data of six bearings

The initial point of the second stage in these bearings is identified using a control chart called the Shewhart average level chart and the threshold levels of the bearings are shown in Table 5.1 (Zhang 2004). Table 5.1. Threshold level for each bearing

Bearing 1 2 3 4 5 6

Threshold level 5.06 5.62 4.15 5.14 3.92 4.9

Condition-based Maintenance Modelling

125

Assuming both distributions for p0 ( x0 ) and p ( yi | xi ) are Weibull where p ( x0 ) = αβ (α x0 ) β −1 e − (α x

0)

β

and yi

−( yi η ( )η −1 e A+ Be − cx − cx A + Be A + Be

p( yi | xi ) =

i

− cxi

)η

i

then starting from t1 and after recursive filtering we have ( xi + ti ) β −1 e− (α ( x + t )) i

pi ( xi | ℑi ) =

∫

∞ 0

i

β

( z + ti ) β −1 e − (α ( z + t )) i

β

∏ ∏

i

ψ k ( xi , ti )

k =1 i

(5.15)

ψ ( z , ti )dz k =1 k

where

ψ k ( z , ti ) =

− C ( z +ti −tk ) −1 η

) ) e− ( y ( A+ Be . − C ( z + t −t ) A + Be k

i

k

To estimate the parameters in p0 ( x0 ) and p ( yi | xi ) we need write down the likelihood function as Equation 5.14. The actual process to estimate these unknown parameters is complicated and involves heavy numerical manipulation which we omit and interested readers can get the details in Zhang (2004). The estimated result is listed in Table 5.2. Table 5.2. Estimated parameter values in p0 ( x0 ) and p ( yi | xi )

αˆ 0.011

βˆ 1.873

Aˆ 7.069

Bˆ 27.089

Cˆ 0.053

ηˆ

4.559

Based on the estimated parameter values in Table 5.2 and Equation 5.15 the predicted residual life at some monitoring points given the history information of bearing 6 in Figure 5.3 is plotted in Figure 5.4. In Figure 5.4 the actual residual lives at those checking points are also plotted with symbol *. It can be seen that actual residual lives are well within the predicted residual life distribution as expected. Given the estimated values for parameters and associated costs such as c f = 6000 , c p = 2000 and cm = 30 (Wang and Jia 2001) we have the expected cost per unit time for one of the bearings at various checking time t, shown in Figure 5.5.

126

W. Wang

Expectd cost per unit time

Figure 5.4. Predicted condition residual life of bearing 6

27 t=80.5 hrs t=92.5 hrs t=104 hrs t=116.5 hrs t=129 hrs

23

19

15 0

10

20

30

Planned replacement time

Figure 5.5. Expected cost per unit time vs. planned replacement time in hours from the current time t

In can be seen from Figure 5.5. that at t = 116.5 and 129 h both planned replacements are recommended within the next 30 h. To illustrate an alternative decision chart in terms of the actual condition monitoring reading, we transformed the cost related decision into actual reading in Figure 5.6 where the dark grey area indicates that if the reading falls within this area a preventive replacement is required within the planning period of consideration. The advantage of Figure 5.6 is that it can not only tell us whether a preventive replacement is needed but also show us how far the reading is from the area of preventive replacement so that appropriate preparation can be done before the actual replacement.

Condition-based Maintenance Modelling

127

14

Observed CM reading

12

Preventive replacement area

10 8 6 4

No preventive replacement area

2 0 80.5

92.5

104

116.5

129

Tim e (age in hour) of CM reading taken

Figure 5.6. Decision chart using observed CM reading

The transformation is carried out in this way – at each monitoring point of ti , by gradually changing the value of yi in pi ( xi | ℑi ) used in Equation 5.1 until a preventive replacement is recommended by the model within the planning period, and then marking this value of yi as the threshold value at time ti . Connecting these threshold values at those monitoring points forms the boundary between the light and dark grey areas. Finally mark the actual reading of yi on the graph to see which area it falls in.

5.5 Future Research Directions 5.5.1 Multi-component Systems Previous condition based prognosis models developed in the literature mainly focused on a single failure mode system subject to routine monitoring and replacement such as bearings, pumps and motors, and various probability distributions are used to describe the lifetime of the component. In the case of a high value and high risk system with many components such as aircraft engines and gas turbines, how to assess the health condition and make prognosis based on condition information obtained from all components is still an open question. It is typical with a multicomponent system that many observed signal parameters are available and the times between failures are neither independent nor identical. 5.5.2 Identification of the Initial Point of a Random Defect With the delay time concept (see Chapter 14), system life is assumed to be classified into two stages. The first is the normal working stage where no abnormal condition parameters are to be expected. The second starts when a hidden defect is first initiated with possible abnormal signals. The identification of the initial point in the evolution of such a defect is important and has a direct impact on the

128

W. Wang

subsequent prediction model. Most research on fault diagnosis focuses on the location of the fault, the possible cause of the fault and, of course, the type of fault. This serves the engineering purpose of deciding what to repair, but does not aid the decision of when to do the task. This initial point defect identification has received very little attention in prognosis literature. Wang (2006b) addressed this problem to some extent using a combination of the delay time concept and the HMM. Much work still remains. It is possible that a multi-stage (>2) failure process could be used, which might be more appropriate in some cases. 5.5.3 The Definition of Plant State The definition of the underlying state and the relationship between the observed monitoring parameters and the state of the system are issues which still need attention. In the model presented in this chapter, the state of the system is defined as the residual life, which is assumed to influence the observed signal parameters. Whilst the modelling output appears to make sense, there are a few potential problems with the approach. The first is the issue that the life of the plant is fixed at birth (installation) but unknown. This is termed as playing God. Second, the residual life is not the direct cause of the observed abnormal signals. These are more likely caused by some hidden defects which are linked to the residual life in this chapter. To correct the first problem we can introduce another equation describing the relationship between X i and X i −1 deterministically or randomly. This will allow X i to change during use, which is more appropriate. If the relationship is deterministic, then a closed form of Equation 5.3 is still available, but if it is random, HMM must be used and no closed form of Equation 5.3 exists unless the noises associated are normally distributed. The second problem can be overcome if we adopt a discrete or continuous state hidden Markov chain to describe the system deterioration process where the state space of the chain represents the system state in question. 5.5.4 Information Fusion There is now a considerable amount of condition monitoring and process control information available in industry, thanks to recent developments in condition monitoring technology. It is noted that not all information is useful, or because of correlation one may obtain similar information. There are two ways to deal with this. One is to use some statistical methods to reduce the dimension of the original data such as principal component analysis, and the other is to use multi-variate distributions. The principal component analysis method has been used in Wang and Zhang (2005), but unless the first principle component accounts for most of the variation in the original data we still need to deal with a data set with more than two dimensions. The use of multi-variate distributions in prognosis has not been reported apart from the normal distribution which has the drawback of producing negative values. A final point worth mentioning is that, in practice, observed condition monitoring variables could be concomitant variables or covariates with respect to the

Condition-based Maintenance Modelling

129

system state. A model which can handle both type of information is ideal, but very few attempts have been made (Hussin and Wang 2006).

5.6 Summary and Conclusions This chapter introduces the concept of condition monitoring, key condition monitoring techniques, condition based maintenance and associated modelling support in aid of condition based maintenance. Particular attention is paid to the residual time prediction based on available condition information to date. An important development made here is the establishment of the relationship between the observed information and underlying condition which is the residual life in this case. This is achieved by letting the mean of the observed information at ti be a function of the residual life at that point conditional on X i = xi . The mathematical development is based on a recursive algorithm called filtering where all past information is included. The example illustrated is based on real data which came from a fatigue experiment. However, data from industry has shown the robustness of the approach and the residual life predictions conducted so far are satisfactory.

5.7 References Aghjagan, H.N., (1989) Lubeoil analysis expert system, Canadian Maintenance Engineering Conference, Toronto. Aven, T., (1996) Condition based replacement policies – a counting process approach, Rel. Eng. & Sys. Safety, 51(3), 275–281. Banjevic, D., Jardine, A.K.S., Makis, V. and Ennis, M., (2001) A control-limit policy and software for condition based maintenance optimization, INFOR 39(1), 32–50. Baruah, P. and Chinnam R.B., (2005) HMM for diagnostics and prognostics in maching processes, I. J. Prod. Res., 43(6), 1275–1293. Black, M., Brint, A.T. and Brailsford J.R., (2005) A semi-Markov approach for modelling asset deterioration, J. Opl. Res. Soc. 56(11), 1241–1249. Bunks C., McCarthy, D. and Al-Ani T., (2000) Condition based maintenance of machine using hidden Markov models, Mech. Sys. & Sig. Pro., 14(4), 597–612. Chen, D. and Trivedi, K.S., (2005) Optimization for condition based maintenance with semi-Markov decision process, Rel. Eng. & Sys. Safety, 90(1), 25–29. Chen, W., Meher-Homji, C.B. and Mistree, F., (1994) COMPROMISE: an effective approach for condition-based maintenance management of gas turbines. Engineering Optimization, 22, 185–201. Christer, A.H., Wang, W. and Sharp, JmM., (1997) A state space condition monitoring model for furnace erosion prediction and replacement, Euro. J. Opl. Res., 101, 1–14. Christer, A.H. and Wang, W., (1992) A model of condition monitoring inspection of production plant, I. J. Prod. Res., 30, 2199–2211. Christer A.H and Wang, W., (1995) A simple condition monitoring model for a direct monitoring process, E. J. Opl. Res., 82, 258–269. Collacott, R.A., (1977) Mechanical fault diagnosis and condition monitoring, Chapman and Hall Ltd., London. Dong M. and He, D., (2004) Hidden semi-Markov models for machinery health diagnosis and prognosis, Trans. North Amer. Manu. Res. Ins. of SME, 32, 199–206.

130

W. Wang

Drake, P.R., Jennings, A.D., Grosvenor, R.I. and Whittleton, D., (1995) acquisition system for machine tool condition monitoring. Quality and Reliability Engineering International 11, 15–26. Freund, J.E., (2004) Mathematical statistics with applications, Pearson Prentice and Hall, London. Harrison, N., (1995) Oil condition monitoring for the railway business. Insight 37, 278–283. Hontelez, J.A.M., Burger, H.H. and Wijnmalen, D.J.D., (1996) Optimum condition based maintenance policies for deteriorating systems with partial information, Rel. Eng. & Sys. Safety, 51(3), 267–274. Hussin, B., and Wang, W., (2006) Conditional residual time modelling using oil analysis: a mixed condition information using accumulated metal concentration and lubricant measurements, to appear in Proc. 1st Main. Eng. Conf, Chendu, China. Jardine, A.K.S., Makis, V., Banjevic, D., Braticevic, D. and Ennis, M., (1998) A decision optimization model for condition based maintenance, J. Qua. Main. Eng., 4(2), 115– 121. Jensen, U., (1992) Optimal replacement rules based on different information level, Naval Res. Log. 39, 937–955. Kalbfleisch, J.D. and Prentice, R.L., (1980) The Statistical Analysis of Failure Time Data. Wiley, New York. Kumar, D. and Westberg, U., (1997) Maintenance scheduling under age replacement policy using proportional hazard modelling and total-time-on-test plotting, Euro. J. Opl. Res., 99, 507–515. Li, C.J. and Li, S.Y., (1995) Acoustic emission analysis for bearing condition monitoring. Wear 185, 67–74. Lin, D. and Makis, V., (2003) Recursive filters for a partially observable system subject to random failures, Adv. Appl. Prob., 35(1), 207–227. Lin D. and Makis, V., (2004) Filters and parameter estimation for a partially observable system subject to random failures with continuous-range observations, Adv. Appl. Prob., 36(4), 1212–1230. Love C.E., Zhang Z.G., Zitron M.A., and Guo R., (2000) A discrete semi-Markov decision model to determine the optimal repair/replacement policy under general repairs, Euro. J. Opl Res, 125, 2, 398–409 Love, C.E. and Guo, R., (1991) Using proportional hazard modelling in plant maintenance. Quality and Reliability Engineering International, 7, 7–17. Makis, V. and Jardine, A.K.S., (1991) Computation of optimal policies in replacement models, IMA J. Maths. Appl. Business & Industry, 3, 169–176. Matthew, C. and Wang, W., (2006) A comparison study of proportional hazard and stochastic filtering when applied to vibration based condition monitoring, submitted to Int. Tran OR. Meher-Homji, C.B., Mistree, F. and Karandikar, S., (1994) An approach for the integration of condition monitoring and multi-objective optimization for gas turbine maintenance management. International Journal of Turbo and Jet Engines, 11, 43–51. Neal, M., and Associates, (1979) Guide to the condition monitoring of machinery, DTI, London. Reeves, C.W. (1998) The vibration monitoring handbook, Coxmoor Publishing Company, Oxford. Samanta, B., Al-Balushi, K.R., Al-Araimi, S.A. (2006) Artificial neural networks and genetic algorithm for bearing fault detection Soft Computing, 10 (3), 264–271. Wang, W., (2002) A model to predict the residual life of rolling element bearings given monitored condition monitoring information to date, IMA. J. Management Mathematics, 13, 3–16.

Condition-based Maintenance Modelling

131

Wang, W., (2003) Modelling condition monitoring intervals: A hybrid of simulation and analytical approaches, J. Opl. Res Soc, 54, 273–282. Wang, W., (2006a) A prognosis model for wear prediction based on oil based monitoring, to appear in J. Opl. Res Soc, Wang, W., (2006b) Modelling the probability assessment of the system state using available condition information, to appear in IMA. J. Management Mathematics. Wang, W. and Christer, A.H., (2000) Towards a general condition based maintenance model for a stochastic dynamic system, J. Opl. Res. Soc. 51, 145–155. Wang, W. and Jia, Y., (2001) A multiple condition information sources based maintenance model and associated prototype software development, proceedings of COMADEM 2001, Eds. A. Starr and Raj B.K.N. Rao, Elsevier, 889–898. Wang, W. and Zhang, W., (2005) A model to predict the residual life of aircraft engines based on oil analysis data, Naval Logistics Research, 52, 276–284. Wong, M.L.D., Jack, L.B., Nandi, A.K., (2006) Modified self-organising map for automated novelty detection applied to vibration signal monitoring Mech. Sys. & Sig. Proc., 20(3), 593–610. Zhang, W., (2004) Stochastic modeling and applications in condition based maintenance, PhD, thesis, University of Salford, UK.

6 Maintenance Based on Limited Data David F. Percy

6.1 Introduction Reliability applications often suffer from a lack of data with which to make informed maintenance decisions. Indeed, the very nature of maintenance is to avoid observed failure data from arising! This effect is particularly noticeable for high reliability systems such as aircraft engines and emergency vehicles, and when new production lines are established or warranty schemes are planned. The evaluation of such systems is a learning process and knowledge is continually updated as more information becomes available. Such issues are of great importance when selecting and fitting mathematical models to improve the accuracy and utility of these decisions. This chapter investigates why reliability data are so limited, identifies the problems that this causes and proposes statistical methods for dealing with these difficulties. In particular, it considers graphical and numerical summaries, appropriate methods for model development and validation, and the powerful approach of subjective Bayesian analysis for including expert knowledge about the application area, such as information pertaining to a particular manufacturing process and experience of similar operational systems. Many reliability problems involve making strategic decisions under risk or uncertainty. Stochastic models involving unknown parameters are often adopted for this purpose and our concern is how to make inference about, and arising from, these unknown parameters. The easiest approach involves skilfully guessing the parameter values by subjective means, which is fine so long as there is sufficient expert knowledge to perform this task well. More commonly, the parameters are estimated from observed data and decisions are then made by assuming the parameters equal to their estimates. This frequentist approach to inference is very good if there are sufficient data to estimate the parameters well. However, few data are available in many areas of maintenance and replacement; see Percy et al. (1997) and Kobbacy et al. (1997) for example. There are several reasons why data are scarce in these situations. New systems and processes

134

D. Percy

naturally offer scant historical data about their performance and reliability. Poor and incomplete maintenance records are often kept, as the engineers and managers do not always appreciate the potential benefits that can be achieved through quantitative modelling and analysis. Of equal importance, many observations of failure times tend to be censored due to maintenance interventions. Typical applications take the form of reliability analysis, such as modelling a critical system’s time to failure, and scheduling problems, such as determining efficient policies for scheduling capital replacement and preventive maintenance, all of which are considered elsewhere in this book. Other applications include determining appropriate thresholds for condition monitoring and specifying warranty schemes for new products. Under these circumstances, it is important to allow for the uncertainty about the unknown model parameters. This is readily achieved by adopting the Bayesian approach to inference, as described by Bernardo and Smith (2000) and O’Hagan (2004). The structure for the remainder of this chapter is as follows. Section 6.2 explains the need for Bayesian analysis and Section 6.3 introduces the concepts beginning with Bayes’ theorem, which is of great importance in its own right. Section 6.4 discusses the construction of prior and posterior distributions, whilst Section 6.5 considers the role of predictive distributions and Section 6.6 considers techniques for setting the hyperparameters of prior distributions. One of the great strengths of the Bayesian approach, particularly in relation to practical problems in reliability and maintenance, is its ability to improve the quality of decision analysis, as described in Section 6.7. Section 6.8 presents a review of the Bayesian approach to maintenance and Section 6.9 includes specific case studies that demonstrate these methods. Finally, Section 6.10 suggests topics for future research and possible new applications. For convenience, there follows a list of symbols and acronyms that are used throughout this chapter. P(⋅) : Probability E (⋅) : Expected value p(⋅) : Probability mass function f (⋅) : Probability density function R(⋅) : Reliability function L(⋅) g (⋅)

Be(θ )

Po(µ )

Ge(θ ) Ex(λ )

: : : : : :

No(µ,ψ ) :

Ga (α , λ ) :

We(α , λ ) :

Likelihood function Prior or posterior probability density function Bernoulli distribution Poisson distribution Geometric distribution Exponential distribution Normal distribution Gamma distribution Weibull distribution

Maintenance Based on Limited Data

135

6.2 Need for Bayesian Approach Figure 6.1 shows the links between equipment, maintenance, models, parameters and data. Starting with the equipment, imperfect reliability necessitates some forms of maintenance. These affect the performance of the equipment as implied by the arrow, which represents a directional influence. In order to determine suitable maintenance policies and strategies, we formulate appropriate mathematical models. These involve unknown parameters that are modelled using expert knowledge and observed data. Variations to the models arise due to modified reliability characteristics when maintenance strategies are in place for particular equipment, forming the cycle at the top of the chart.

Figure 6.1. The link between fundamental aspects of maintenance modelling and analysis

The conventional approach to model fitting is based upon frequentist methods of estimation, as described in statistics books. One of the best such methods is that of maximum likelihood. Essentially, all unknown model parameters are replaced by estimates calculated from samples of data. For example, a parameter that represents the mean lifetime of a rechargeable battery in a portable computer might be replaced by the average lifetime calculated from a sample of such batteries that were run from charged to flat. However, this approximation can and does lead to substantial errors, inaccuracies and poor decisions, particularly when the estimates are based on small samples of data. When data are limited in this way, one starts with subjective estimates and updates them as new data are observed. This is a very common scenario in reliability and maintenance, where the samples typically contain few and censored failure data. Example 6.1 Suppose we use a Weibull We (α , λ ) distribution to model the random variable X , which represents the breaking strain of a steel cable. As destructive testing can be very expensive and safety precautions can be crucial, it is feasible that we might only collect right-censored observations of the form D = { xi : X > xi ; i = 1, 2,… , n} . In order to make useful inference involving the model parameters α > 0 and λ > 0 , we need to construct the likelihood function. For this scenario, the likelihood involves the reliability function of X ,

(

R ( x α , λ ) = exp −λ xα

)

(6.1)

136

D. Percy

for x > 0 , and takes the form L (α , λ ; D ) ∝

n

∏ i =1

n ⎛ ⎞ R ( xi α , λ ) = exp ⎜ −λ xiα ⎟ . i =1 ⎝ ⎠

∑

(6.2)

We typically maximize this function in order to evaluate the maximum likelihood estimates of α and λ . To do so, the likelihood equations are λ

n

∑x i =1

λ

n

α i

∑x i =1

α i

=0;

(6.3)

log xi = 0 .

(6.4)

These have no finite solutions for αˆ and λˆ , so our analysis has been thwarted by the lack of uncensored data. □

6.3 Bayesian Inference In 1763, some research on probability theory by the Reverend Thomas Bayes was posthumously published in Philosophical Transactions of the Royal Society. This contained an incredibly important statement of what we now refer to as Bayes’ theorem. In its simplest form, Bayes’ theorem states that for two events A and B , the conditional probability of B given that A has occurred can be expressed as P ( B A) =

P ( A B) P ( B) P ( A)

(6.5)

where it is sometimes useful to evaluate the probability of A using the law of total probability P ( A) = P ( A B ) P ( B ) + P ( A B′ ) P ( B′ )

(6.6)

where the event B ′ is the complement of the event B ; that is, the event that B does not occur. Bayes’ theorem can be interpreted as a way of transposing the conditionality from P ( A B ) to P ( B A ) , or as a way of updating the prior probability P ( B ) to give the posterior probability P ( B A ) . Example 6.2 An aircraft warning light comes on if the landing gear on either side is faulty. Suppose we know that faults only occur 0.4% of the time, that they are detected with 99.9% reliability and that false alarms only occur 0.5% of the time when the landing gear is operational. Defining events W = “warning light comes on” and L = “landing gear faulty”, this information can be summarized

Maintenance Based on Limited Data

137

concisely as P ( L ) = 0.004 , P (W L ) = 0.999 , P (W L′ ) = 0.005 . Our aim is to calculate the probability that the landing gear is faulty if the warning light comes on. Intuitively, one might suppose that this probability is very close to one, as the alarm system appears to be very accurate. However, the law of total probability gives P (W ) = P (W L ) P ( L ) + P (W L′ ) P ( L ′ ) = 0.999 × 0.004 + 0.005 × 0.996 = 0.008976 ,

(6.7)

from which Bayes’ theorem gives P(L W ) =

P (W L ) P ( L ) P (W )

=

0.999 × 0.004 = 0.45 0.008976

(6.8)

to two decimal places. This result implies that most (55%) of these warning lights are false alarms, despite the apparent accuracy of the alarm system! The reason for this paradoxical outcome is that the landing gear is operational for the vast majority of the time. If we were to specify P ( L ) = 0.04 instead, we would obtain P ( L W ) = 0.89 , which is far more acceptable. Similar patterns of behaviour apply to medical screening procedures – in order to reduce the incidence of misdiagnoses, only patients deemed to be at risk of an illness are routinely screened for it. □ Only in the mid-twentieth century were the real benefits of Bayes’ theorem appreciated though. Not only does it apply to probabilities, but also to random variables. For example, suppose X is a discrete random variable and Y is a continuous random variable. Then the conditional probability density function of Y given X can be determined using Bayes’ theorem, if we know the marginal distributions of X and Y , and the conditional distribution of X given Y : f ( y x) =

p ( x y) f ( y) p ( x)

.

(6.9)

This rule for “transposing the conditionals” has proven to be crucial in a variety of important applications, including quality control, fault diagnosis, image processing, medical screening and criminal trials. Even more importantly, we can apply Bayes’ theorem to unknown model parameters. This is the foundation of the Bayesian approach to statistical inference and has had an enormous and profound impact on the subject over the last few decades. Suppose that a continuous random variable X has a probability distribution that depends on an unknown parameter θ . For example, X might represent the firebreach time of a door in minutes and it might have an exponential distribution with unknown mean µ = 1 θ . A naïve approach to statistical inference would simply replace θ by a good guess based on expert opinions. However, this is inherently inaccurate and can lead to poor decisions. A better method is the frequentist approach to inference, where-

138

D. Percy

by we evaluate an estimate θˆ for the unknown parameter θ based on a set of observed data D = { x1 , x2 ,… , xn } , which might consist of a random sample of actual fire-breach times for the above example. Subsequent analyses generally invoke the approximation θ ≈ θˆ , which can again lead to poor decisions. In contrast, the Bayesian approach does not involve any guesses or estimates of unknown parameters in the model. Rather, it uses Bayes’ theorem to update our prior beliefs about θ in response to the observed data D thus: g (θ D ) =

f ( D θ ) g (θ ) f ( D)

.

(6.10)

This enables us to make any inference we wish about θ . We can also use our posterior beliefs about θ for any subsequent inference involving X . The price that we pay for obtaining exact answers and avoiding approximations in this way comes in two parts, the need to assume a prior distribution for θ and the increase in algebraic complexity. This chapter shows how to resolve these issues. Example 6.3 Suppose the unknown parameter θ represents the proportion of car batteries that fail within two years and our prior beliefs about θ can be expressed in terms of the probability density function g (θ ) = 2 (1 − θ ) ; 0 < θ < 1 .

(6.11)

Suppose also that we observe three car batteries, one of which fails within two years and two of which do not. Then we can express the likelihood of these data using the binomial probability mass function

p ( D θ ) = 3θ (1 − θ ) , 2

(6.12)

which is the discrete equivalent to the probability density function f ( D θ ) referred to above. As p ( D ) , the discrete equivalent of f ( D ) , does not depend on θ , an application of Bayes’ theorem as stated above gives g (θ D ) ∝ p ( D θ ) g (θ ) ∝ θ (1 − θ )

3

(6.13)

for 0 < θ < 1 , so our posterior beliefs about the unknown parameter θ can be expressed as a beta distribution θ D ~ Be ( 2, 4 ) . We elaborate on this process further in Section 6.4. □

Maintenance Based on Limited Data

139

6.4 Prior and Posterior Distributions Section 6.3 concluded by deriving an equation for updating a prior probability density function g (θ ) for an unknown parameter θ , based on some observed data D to give a posterior probability density function g (θ D ) . The term f ( D θ ) is proportional to the likelihood function. If the data set D consists of a random sample of observations x1 , x2 ,… , xn of a continuous random variable X with probability density function f ( x θ ) , then the likelihood function becomes L (θ ; D ) ∝

n

∏ f (x θ ) . i =1

(6.14)

i

As the term f ( D ) does not depend on θ , we can therefore write g (θ D ) ∝ L (θ ; D ) g (θ )

(6.15)

or, in words, “posterior is proportional to likelihood times prior”. This is the fundamental rule for Bayesian inference. Example 6.4 Previously, in Example 6.3, we considered the proportion of car batteries that fail within two years. This involved the use of Bayes’ theorem for this unknown model parameter θ and was an illustration of how the fundamental rule “posterior is proportional to likelihood times prior” can be applied. To clarify this demonstration, the likelihood function takes the form

L (θ ; D ) ∝

3

∏ p ( x θ ) = θ (1 − θ ) i =1

2

(6.16)

i

where the probability mass function p ( xi θ ) corresponds to a Bernoulli distribution. Consequently, the posterior probability density function of θ given the data D has the form

g (θ D ) ∝ L (θ ; D ) g (θ ) ∝ θ (1 − θ )

3

(6.17)

for 0 < θ < 1 , which agrees with the result we obtained previously. The corresponding prior and posterior probability density functions are graphed for comparison in Figure 6.2. □

140

D. Percy 3

prior ( θ )

2

posterior ( θ ) 1

0

0

0.2

0.6

0.4

0.8

1

θ

Figure 6.2. Prior and posterior probability density functions for Example 6.4

Having evaluated a posterior distribution using this rule, we can evaluate the posterior mode θˆ such that

( )

g θˆ D ≥ g (θ D ) ∀θ ,

(6.18)

by solving the equation d L (θ ; D ) g (θ ) = 0 . dθ

(6.19)

However, to find the median or mean, and to use this posterior density to make any further inference, we need to determine the constant of proportionality in the fundamental rule above. In standard situations, we can recognise the functional form of L (θ ; D ) g (θ ) and hence quote published work on probability distributions to determine this constant of proportionality and so derive g (θ D ) explicitly. In non-standard situations, we determine this constant of proportionality using numerical quadrature or simulation, both of which we discuss later. 6.4.1 Reference Priors There are two main types of prior distribution, which loosely correspond with objective priors and subjective priors. As objective priors strictly do not exist, this category is generally known as reference priors and are used if little prior information is available and as a benchmark against which to compare the output from using subjective priors. This offers a default Bayesian analysis that is not dependent upon any personal prior knowledge. The simplest reference prior is proposed by the Bayes-Laplace postulate and simply recommends the use of a uniform or locally-uniform prior g (θ ) ∝ 1 for all θ in the region of support Rθ .

Maintenance Based on Limited Data

141

However, different parameterisations can lead to different inferences with this approach. To avoid this inconsistency, the standard univariate reference prior that analysts now adopt is the invariant prior of Jeffreys (1998), defined by g (θ ) ∝ I (θ ) ; θ ∈ Rθ

(6.20)

⎧⎪ d 2 log f ( x θ ) ⎫⎪ I (θ ) = − E X θ ⎨ ⎬ dθ 2 ⎩⎪ ⎭⎪

(6.21)

where

is Fisher’s expected information. An extension exists for the case of a parameter vector θ , though we usually assume the components of θ are independent, so g (θ ) is just the product of the univariate invariant priors. This invariant prior distribution is occasionally improper, as its integral sometimes diverges. However, this problem is generally unimportant because the corresponding posterior distributions are usually proper. Books on Bayesian methods, such as Bernardo and Smith (2000) and Lee (2004), present tables of invariant prior and posterior distributions for common models. 6.4.2 Subjective Priors Subjective prior distributions should be used if prior information is available, which is almost always. They represent the best available knowledge about unknown parameters and can be specified using smoothed histograms, relative likelihoods or parametric families. The first two of these are arbitrary and computationally awkward, so we now investigate the last of these. A family of priors C is closed under sampling if g (θ ) ∈ C ⇒ g (θ D ) ∈ C ,

(6.22)

so that the posterior density has the same functional form as the prior density. This property is particularly appealing, as our prior knowledge can be regarded as posterior to some previous information. Again, we tend to suppose that components in multi-parameter problems are independent, so that their joint prior density is the product of corresponding univariate marginal priors. Such closed priors exist, and are called natural conjugate priors, for sampling distributions f ( x θ ) that belong to the exponential family. This family includes Bernoulli, binomial, geometric, negative binomial, Poisson, exponential, gamma, normal and lognormal models. For a model in the exponential family with scalar parameter θ , we can express the probability density or mass function in the form

142

D. Percy

f ( x θ ) = exp {a ( x ) b (θ ) + c ( x ) + d (θ )}

(6.23)

and the natural conjugate prior for θ is defined by

g (θ ) ∝ exp {k1b (θ ) + k2 d (θ )}

(6.24)

for suitable constants k1 and k 2 . However, any conjugate prior of the form

g (θ ) ∝ h (θ ) exp {k1b (θ ) + k2 d (θ )}

(6.25)

is also closed under sampling for models in the exponential family. Books on Bayesian methods, such as Bernardo and Smith (2000), present tables of the conjugate prior and posterior distributions for common models. However, many applications in reliability and maintenance are not amenable to such simple analyses. For example, the Weibull distribution is not a member of the exponential family. As a result of this, the constant of proportionality in the expression g (θ D ) ∝ L (θ ; D ) g (θ )

(6.26)

can sometimes not be evaluated algebraically and analytical approximations or numerical computation are usually required. It is desirable to avoid the inconsistency of using natural conjugate priors when they exist and other forms of subjective prior, such as location-scale forms, when they do not. The following recommendations by Percy (2004) provide a simple, consistent and comprehensive strategy that achieves this for general use: • • •

Infinite range −∞ < θ < ∞ , use a normal prior distribution for θ Semi-infinite range 0 < θ < ∞ , use a gamma prior distribution for θ Finite range 0 < θ < 1 , use a beta prior distribution for θ

If necessary, linear transformations of the parameters ensure that these priors are sufficient for modelling all situations. They match with the natural conjugate priors for simple models and extend to deal with more complicated models. Mixtures of these priors can be used if multimodality is present and prior independence can be assumed for multiparameter situations.

6.5 Predictive Distributions The frequentist approach to inference involves estimating unknown parameters, evaluating confidence intervals and performing significance tests. Such intervals and tests are statements about the data rather than the parameters and so are of little use. For example, null hypotheses are often strictly impossible, in which case a test will be significant if, and only if, sufficient data are observed. In contrast, the

Maintenance Based on Limited Data

143

Bayesian approach to inference makes statements about the parameters given the data, which are precisely what is required. O’Hagan (1994) commented that the “Bayesian approach … is fundamentally sound, very flexible, produces clear and direct inferences and makes use of all the available information.” In contrast, he noted that the “Classical approach suffers from some philosophical flaws, has a restrictive range of inferences with rather indirect meanings and ignores prior information.” One of the most important and useful features of the Bayesian approach arises when we wish to make predictions about future values of the random variable X where f ( x θ ) is specified. If θ is unknown, the prior predictive probability density function of X is f ( x) =

∞

∫ f ( x θ ) g (θ ) dθ .

(6.27)

−∞

If data D are observed, the posterior predictive probability density function of X is f ( x D) =

∞

∫ f ( x θ ) g (θ D ) dθ

−∞

∞

∝

(6.28)

∫ f ( x θ ) L (θ ; D ) g (θ ) dθ .

−∞

( )

In contrast, a frequentist approach either uses the approximation f ( x D ) ≈ f x θˆ

( )

or gives a point prediction xˆ = E X θˆ with a prediction interval if available. Example 6.5 Suppose that the time X to breakdown of a large pulper in a paper mill has an exponential sampling distribution given some unknown hazard parameter λ . With Jeffreys’ invariant prior, the prior predictive density is given by ∞

f ( x ) ∝ ∫ λ exp ( −λ x ) 0

1 1 dλ ∝ λ x

(6.29)

for x > 0 , which is improper. However, this does provide information about the relative likelihoods for different values of X . For example, the ratio of probabilities that X lies in the intervals (5,10) and (10,20) is given by

144

D. Percy 10

P ( 5 < X < 10 )

P (10 < X < 20 )

=

1

∫ x dx

5 20

1 dx x 10

∫

=

log10 − log 5 =1 log 20 − log10

(6.30)

so the time to breakdown of this pulper is equally likely to lie in these two intervals without taking account of any subjective or empirical information that might be available. Even if we subsequently observe a random sample of lifetimes D = { x1 , x2 ,… , xn } the posterior predictive density ∞ ⎪⎧ ⎛ n ⎞ ⎪⎫ f ( x D ) ∝ λ exp ( −λ x ) × λ n −1 exp ⎨− ⎜ xi ⎟ λ ⎬ d λ ⎩⎪ ⎝ i =1 ⎠ ⎪⎭ 0 n! ; x>0 = n +1 n ⎛ ⎞ ⎜ x + xi ⎟ i =1 ⎝ ⎠

∑

∫

(6.31)

∑

is still improper, though we can evaluate relative likelihoods as we did for the prior predictive density. In contrast, a frequentist approach would merely generate the approximation X D ~ Ex (1 x ) and could do no better than guess a value for X before observing any data. □ Example 6.6 Reconsidering the time to breakdown of the pulper in Example 6.5, suppose we instead use a gamma prior to reflect the knowledge of experts on site. The prior predictive density is now given by ∞

f ( x ) = λ exp ( −λ x )

∫ 0

=

ab

a

( x + b)

a +1

ba λ a −1 exp ( −bλ ) d λ Γ (a)

(6.32)

; x>0

which corresponds with a special form of gamma-gamma distribution. If we subsequently observe a random sample of lifetimes D = { x1 , x2 ,… , xn } the posterior predictive density is given by ∞

⎧⎪ ⎛ f ( x D ) ∝ λ exp ( −λ x ) λ a + n −1 exp ⎨ − ⎜ b + ⎪⎩ ⎝ 0 Γ ( a + n + 1) = ; x>0 a + n +1 n ⎛ ⎞ ⎜ x + b + xi ⎟ i =1 ⎝ ⎠

∫

∑

n

⎞ ⎫⎪

∑ x ⎟⎠ λ ⎬⎪ d λ i =1

i

⎭

(6.33)

Maintenance Based on Limited Data

145

which again corresponds to a gamma-gamma distribution. As before, a frequentist □ approach would merely yield the approximation X D ~ Ex(1 x ) .

6.6 Prior Specification In Section 6.4 we discussed what objective and subjective prior distributions are appropriate for practical applications. As some prior knowledge is always available, a conjugate prior should be used whenever possible. However, reference priors are useful in these circumstances: • • •

For an objective analysis with no specific personal inputs For comparison with similar analyses by other investigators As baselines to assess the sensitivity of results to choice of prior

We now consider the difficult problem of assigning values to the hyperparameters of subjective prior distributions. Suppose we have a model for a continuous random variable X with probability density function f ( x θ ) , which depends on a parameter θ with subjective prior probability density function g (θ ) . Typically, this prior distribution consists of two unknown hyperparameters, which we now label a and b . We set fixed values for these hyperparameters, to reflect our prior knowledge about θ . For two hyperparameters, we need two distinct pieces of information such as the upper and lower tertiles ( 33 1 3 and 66 2 3 percentiles) of θ or the cumulative probabilities corresponding to any two suitable values of θ . Alternative information about θ could be provided, though quantiles and cumulative probabilities are the easiest and best formulations. One obvious alternative is to specify the prior mode, but this is occasionally at an endpoint of the parameter’s range and so provides no useful information. Furthermore, there is no suitable candidate for the second piece of information when the prior mode is used. Another obvious alternative is to specify the prior mean and standard deviation. However, we cannot make meaningful judgments about these purely mathematical abstracts. Unfortunately, parameters are not observable and we cannot make accurate statements about them directly. The sole exception is when our parameter represents the probability of an event associated with infinitely repeatable Bernoulli trials. In this case, it is feasible to elicit information about an identical quantity, the asymptotic proportion. In general, however, we can elicit hyperparameter values by considering the prior predictive distribution introduced in Section 6.5, which is also a function of a and b ; refer to Percy (2002) for further details. Research in this area is still ongoing, particularly for models for which the prior predictive cumulative distribution function cannot be determined analytically and for multiparameter models for which there are implicit and indeterminable constraints on the prior predictive quantiles. Example 6.7 We saw earlier that the prior predictive probability density function for the exponential sampling distribution (perhaps representing the time X to

146

D. Percy

failure of a pulper, as before, or the downtime X incurred as a result of a computer system failure) with a gamma prior is given by f ( x) =

ab a

( x + b)

a +1

; x>0.

(6.34)

Hence the prior predictive cumulative distribution function is F ( x) =

x

a

ab a

⎛ b ⎞ dx = 1 − ⎜ ⎟ ; x>0. a +1 ⎝ x+b⎠ 0 ( x + b)

∫

(6.35)

If an expert specifies tertiles L and U , such that FX ( L ) = 1 3 and FX (U ) = 2 3 , then we can solve these two nonlinear simultaneous equations numerically for a and b . These can then be substituted into our prior density f ( x ) . □ Example 6.8 In Example 6.7, the exponentially distributed random variable X might instead represent the lifetime of an energy efficient light bulb, in operating hours. Suppose that, based on subjective knowledge of similar light bulbs, we believe that one third of the new type will fail within 2500 operating hours and one third will last for at least 7500 operating hours. This implies that we believe that the remaining third will fail between these two values. Then L = 2500 and U = 7500 , so we need to solve these two simulateneous nonlinear equations numerically for a and b : a

1 b ⎛ ⎞ = 1− ⎜ ⎟ ; 3 ⎝ 2,500 + b ⎠

(6.36)

a

2 b ⎛ ⎞ = 1− ⎜ ⎟ . 3 ⎝ 7,500 + b ⎠

(6.37)

There are many algorithms for solving simultaneous nonlinear equations and several computer packages that contain these algorithms. Mathcad gives the values a = 3.5240 and b = 20,502 , so the prior distribution for the exponential parameter λ is specified completely as λ ~ Ga ( 3.5240, 20,502 ) . □

6.7 Bayesian Decision Theory Much research into maintenance modelling, as presented throughout this book, involves making informed decisions in the presence of stochastic variability. Sensitivity analyses are always advisable in such circumstances, to consider how the conclusions are affected by misspecification of the model and its parameters; see Kobbacy et al. (1995). Rather than replacing model parameters by guesses or estimates, however, more accurate decisions can be made by adopting a Bayesian

Maintenance Based on Limited Data

147

analysis to allow for the uncertainty attached to these parameters. This effect is particularly important when dealing with limited amounts of data, a common problem in the area of reliability and maintenance and the subject of this chapter. For example, the author recently acquired a set of data relating to the performance of an industrial valve subject to corrective and preventive maintenance. Only 12 uncensored lifetime observations were available, despite the fact that this represents six years of data collection. From a frequentist point of view, it would be unwise to fit any model involving more than three parameters to these data. However, the Bayesian is not constrained in this manner, as prior knowledge gleaned from experience of similar systems can be incorporated in the analysis. Of course, parsimony still dictates that models with fewer parameters are more robust for predictive purposes, even if they provide better fits to the observed data. We can resolve such issues using model comparison methods using prior odds, Bayes factors and posterior odds, which we do not discuss here. Consider a set of possible decisions d ∈ ∆ with associated utility function u (d ,θ ) , which depends on an unknown parameter θ . The best decision is that which maximizes the prior expected utility E {u ( d , θ )} =

∞

∫ u ( d ,θ ) g (θ ) dθ

(6.38)

−∞

with respect to the prior probability density function g (θ ) . Alternatively, we can minimize the prior expected loss E {l ( d , θ )} =

∞

∫ l ( d ,θ ) g (θ ) dθ

(6.39)

−∞

for some loss function l ( d , θ ) . If we observe exchangeable data D = { x1 , x2 ,… , xn } from the sampling density f ( x θ ) , the criterion to maximize (minimize) is the posterior expected utility (loss) defined by E {u ( d , θ ) D} =

∞

∫ u ( d ,θ ) g (θ D ) dθ

(6.40)

−∞

where g (θ D ) ∝ L (θ ; D ) g (θ )

(6.41)

is the posterior probability density function. Example 6.9 Which of two alarm systems should we buy if they cost ci units and fail at times X i where X i λi ~ Ex ( λi )

(6.42)

148

D. Percy

for i = 1,2 respectively? Assuming replacements on failure for an infinite horizon, the elementary renewal theorem gives the expected cost per unit time for action i as the loss function l ( i, λi ) = ci λi .

(6.43)

Eλ {l ( i, λi )} = ci E ( λi )

(6.44)

Then i

and we choose system i which minimizes this expected loss, where E ( λi ) is the prior mean. □

6.8 Review of Bayesian Approach to Maintenance Whether we are interested in modelling the reliability of components or systems, assessing the quality of manufactured products, determining optimal replacement policies, deciding when to intervene with preventive maintenance, interpreting the results from condition monitoring, resolving stock control problems or establishing warranty schemes, mathematical models and statistical analysis offer many advantages over subjective expert knowledge alone. This book describes many techniques related to the modelling aspects and generally advocates the frequentist approach of estimating unknown model parameters based upon random samples of observed data. However, Chapter 6 has emphasised that this approach only provides approximate inference, decisions, predictions and solutions. When many data are available, such as might arise when analysing the returns data from common household appliances, these approximations are very accurate. However, these approximations can be very inaccurate when few data are available. We often encounter this situation in maintenance modelling, as the whole purpose of maintenance is to prevent failures from occurring and so lifetime observations are typically censored. Moreover, some applications in this general area relate to products or systems that are completely new or modified versions and for which reliability data are simply not available. By combining the observed data with expert knowledge, the Bayesian approach to statistical analysis avoids the need for approximate inference and yields exact answers under the assumed model and prior formulations. These enable us to make the best maintenance decisions given all available information. This chapter began by justifying this approach and then investigated suitable forms for the prior distributions corresponding to common reliability models. After describing how to calculate the related posterior and predictive distributions, it discusses how to use this knowledge for decision making in practice.

Maintenance Based on Limited Data

149

Table 6.1. Common distributions for maintenance modelling Model and parameter

Prior

Posterior

BERNOULLI

beta

beta

Be (θ )

Be ( a, b )

Be ( a + nx , b + n (1 − x ) )

POISSON

gamma

gamma

Po ( µ )

Ga ( a, b )

Ga ( a + nx , b + n )

GEOMETRIC

beta

beta

Ge (θ )

Be ( a, b )

Be ( a + n, b + n ( x − 1) )

EXPONENTIAL

gamma

gamma

Ex ( λ )

Ga ( a, b )

Ga ( a + n, b + ∑ x )

NORMAL

normal

normal

No ( µ ,ψ )

No ( a, b )

⎛ ab + nxψ ⎞ No ⎜ , b + nψ ⎟ b + n ψ ⎝ ⎠

NORMAL

gamma

gamma

No ( µ ,ψ )

Ga ( c, d )

2 ⎛ n ( x − µ ) ( n − 1) s 2 n Ga ⎜ c + , d + + ⎜ 2 2 2 ⎝

probability θ

mean µ

probability θ

hazard λ

mean µ known precision ψ

known mean µ

⎞ ⎟ ⎟ ⎠

precision ψ

Table 6.1 presents a summary of the probability distributions commonly encountered in maintenance analysis, together with details of their natural conjugate prior and posterior distributions. For models with two parameters, including the unconstrained normal, gamma and Weibull sampling distributions, the analysis is less straightforward and readers are referred to Section 6.4.2 for guidance. Among the published research that applies this methodology to maintenance modelling is the extensive book on Bayesian reliability analysis by Martz and Waller (1982). Journal papers that address specific issues include those by Soland

150

D. Percy

(1969), Bury (1972) and Canavos and Tsokos (1973), who are concerned particularly with analysis of the Weibull distribution. Singpurwalla (1988) and Percy (2004) are concerned with prior elicitation for reliability analysis and O’Hagan (1998) presents an accessible, general discussion of Bayesian methods. There are many other academic publications dealing with Bayesian approaches in maintenance and a representative sample of recent articles include those by van Noortwijk et al. (1992), Mazzuchi and Soyer (1996), Chen and Popova (2000), Apeland and Aven (2000), Kallen and van Noortwijk (2005) and Celeux et al. (2006). The general aim is to determine optimal policies for maintenance scheduling and operation, by combining subjective prior knowledge with observed data using Bayes’ theorem and employing belief networks for larger systems.

6.9 Case Studies We now consider two case studies in which the techniques of this chapter can be applied successfully. 6.9.1 Digital Set Top Boxes The proportion of defective test versions of digital set top boxes θ in a large shipment is unknown, but a beta prior probability density function of the form g (θ ) =

1 b −1 θ a −1 (1 − θ ) ; 0 < θ < 1 B ( a, b )

(6.45)

is appropriate. An expert believes that θ is equally likely to lie in each of the intervals (0, 1 50 ) , ( 1 50 , 1 20 ) and ( 1 20 ,1) , which corresponds to hyperparameter values of a = 1.112 and b = 24.03 as displayed in Figure 6.3. Given that 100 boxes are selected at random from the shipment and 3 of these are found to be defective, we can determine the posterior probability density function of θ from the first row of Table 6.1 as a Be ( 4.112,121.0 ) distribution, which is also displayed in Figure 6.3. This enables us to evaluate numerically the posterior probability that the proportion of defective boxes in the shipment exceeds 1 1 10 as P (θ > 10 D ) = 0.0013 , or 1 in 763. As a final exercise, suppose we select a further box at random from the shipment and consider the random variable X which takes the value 0 if the box is functional, or 1 if it is defective. Then Equation 6.28 can be used to determine the posterior predictive probability mass function for X given the data above as ⎧0.967 ; x = 0 p ( x D) = ⎨ ⎩0.033 ; x = 1

(6.46)

so the posterior probability that a randomly chosen box from the shipment is defective is P ( X = 1 D ) = 0.033 , or 1 in 30.

Maintenance Based on Limited Data

151

30

20 prior( θ ) posterior( θ ) 10

0

0

0.1

0.2 θ

Figure 6.3. Prior and posterior probability density functions for digital set top boxes

6.9.2 Rechargeable Tool Batteries A manufacturer is interested in assessing the unknown hazard λ of rechargeable tool batteries for inter-charge operational times measured in hours. Her prior beliefs are represented by a Ga (10, 40 ) distribution with probability density function g (λ ) =

4010 λ 9 exp ( −40λ ) ; λ > 0 . 9!

(6.47)

She runs an experiment for one day, replacing each flat battery by an identical fully charged battery after failure, so that the total number of failures X has a Poisson distribution with probability mass function p(x λ) =

( 24λ ) x!

x

exp ( −24λ ) ; λ = 0,1, 2,… .

(6.48)

In fact, she runs n = 10 such experiments in parallel, giving a sample mean of x = 6.7 . Referring to the second row of Table 6.1 and transforming to failures per hour, we see that her posterior beliefs about λ correspond to a Ga ( 77, 280 ) distribution, as displayed in Figure 6.4. The posterior mode is 0.27 , which corresponds to the most likely value of λ .

152

D. Percy

20

prior( λ ) 10

posterior ( λ )

0

0.5

1

λ Figure 6.4. Prior and posterior probability density functions for rechargeable tool batteries

6.10 Conclusions Bayesian inference represents a methodology for mathematical modelling and statistical analysis of random variables and unknown parameters. It provides an excellent alternative to the frequentist approach which gained immense popularity throughout the twentieth century. Whereas the frequentist approach is based upon the restrictive inference of point estimates, confidence intervals, significance tests, p-values and asymptotic approximations, the Bayesian approach is based upon probability theory and provides complete solutions to practical problems. Advocates of the Bayesian approach regard it as superior to the frequentist approach in most circumstances and infinitely superior in some. However, it does depend upon the existence and specification of subjective probability to represent individual beliefs, whereas the frequentist approach is almost completely objective. Partial resolution of these difficulties was addressed in Section 6.6 and continues to be improved upon, particularly in regards to eliciting subjective prior knowledge for multiparameter models. The approach advocated here also involves more analytical and computational complexity, though this is not much of a hindrance with modern computing power. In particular, this approach often involves intractable integrals of the forms g (θ D ) ∝

f ( x) =

∞

∫ L (θ ; D ) g (θ ) dθ

posterior densities;

(6.49)

−∞

∞

∫ f ( x θ ) g (θ ) dθ

−∞

predictive densities;

(6.50)

Maintenance Based on Limited Data

E {u ( d , θ )} =

153

∞

∫ u ( d ,θ ) g (θ ) dθ

expected utilities.

(6.51)

−∞

Monte Carlo simulation can be used to approximate any integral of this form by generating many pseudo-random numbers θ1 , θ 2 ,… ,θ n from the prior or posterior density in the integrand and evaluating the unbiased estimator ∞

1

n

∫ s (θ ) g (θ ) dθ ≈ n ∑ s (θ ) ,

−∞

i =1

i

(6.52)

though more efficient procedures exist. Rejection methods are used to generate the pseudo-random numbers and the most powerful such algorithms are referred to as Markov chain Monte Carlo (MCMC) methods, the most common of which is Gibbs sampling. At the time of writing, WinBUGS software is freely available for performing MCMC calculations and may be downloaded from the internet. Further information about MCMC techniques, and other analytical and numerical methods for Bayesian computation, are discussed in the textbooks mentioned in the introduction. We have explained why the solution of many problems arising in maintenance applications is often hampered by a lack of data and so are prime candidates for applying the ideas presented in this chapter. In particular, we suggested and demonstrated how this methodology might benefit decision making related to modelling times to failure and scheduling problems, such as determining efficient policies for scheduling capital replacement and preventive maintenance, determining appropriate thresholds for condition monitoring and specifying warranty schemes for new products. There is considerable scope for developing these techniques for new application areas within maintenance and extending them into related areas. Potential future projects might consider original products, such as recent inventions or modified lines, and items that are tailored to consumers’ specifications, such as construction works, for which historical data are not available. Similarly, some rare, expensive and safety critical systems will have limited failure data with which to estimate model parameters. Enhancements to warranty analysis are also possible, particularly in cases where returns data are not readily available, including natural extensions of the basic concepts to the analysis of extended warranties. Finally, broader definitions of reliability and maintenance would enable us to apply some of the preceding ideas to non-industrial systems, such as information networks, social communities and public services.

6.11 References Apeland S, Aven T, (2000) Risk based maintenance optimization: foundational issues. Reliability Engineering & System Safety 67:285–292 Bernardo JM, Smith AFM, (2000) Bayesian Theory. Chichester: Wiley Bury KV, (1972) Bayesian decision analysis of the hazard rate for a two-parameter Weibull process. IEEE Transactions on Reliability 21:159–169

154

D. Percy

Canavos GC, Tsokos CP, (1973) Bayesian estimation of life parameters in the Weibull distribution. Operations Research 21:755–763 Celeux G, Corset F, Lannoy A, Ricard B, (2006) Designing a Bayesian network for preventive maintenance from expert opinions in a rapid and reliable way. Reliability Engineering & System Safety 91:849–856 Chen TM, Popova, E, (2000) Bayesian maintenance policies during a warranty period. Communications in Statistics 16:121–142 Jeffreys H, (1998) Theory of Probability. Oxford: University Press Kallen MJ, van Noortwijk JM, (2005) Optimal maintenance decisions under imperfect maintenance. Reliability Engineering & System Safety 90:177–185 Kobbacy KAH, Percy DF, Fawzi BB, (1995) Sensitivity analyses for preventive-maintenance models. IMA Journal of Mathematics Applied in Business and Industry 6:53–66 Kobbacy KAH, Percy DF, Fawzi, BB, (1997) Small data sets and preventive maintenance modelling. Journal of Quality in Maintenance Engineering 3:136–142 Lee PM, (2004) Bayesian Statistics: an Introduction. London: Arnold Martz HF, Waller RA, (1982) Bayesian Reliability Analysis. New York: Wiley Mazzuchi TA, Soyer R, (1996) A Bayesian perspective on some replacement strategies. Reliability Engineering & System Safety 51:295–303 O’Hagan A, (1998) Eliciting expert beliefs in substantial practical applications. The Statistician 47:21–35 O’Hagan A, (1994) Kendall's Advanced Theory of Statistics Volume 2B: Bayesian Inference. London: Arnold Percy DF, (2002) Bayesian enhanced strategic decision making for reliability. European Journal of Operational Research 139:133–145 Percy DF, (2004) Subjective priors for maintenance models. Journal of Quality in Maintenance Engineering 10:221–227 Percy DF, Kobbacy KAH, Fawzi BB, (1997) Setting preventive maintenance schedules when data are sparse. International Journal of Production Economics 51:223–234 Singpurwalla ND, (1988) An interactive PC-based procedure for reliability assessment incorporating expert opinion and survival data. Journal of the American Statistical Association 83:43–51 Soland RM, (1969) Bayesian analysis of the Weibull process with unknown scale and shape parameters. IEEE Transactions on Reliability 18:181–184 van Noortwijk JM, Dekker A, Cooke RM, Mazzuchi TA, (1992) Expert judgement in maintenance optimization. IEEE Transactions on Reliability 41:427–432

7 Reliability Prediction and Accelerated Testing E. A. Elsayed

7.1 Introduction Reliability is one of the key quality characteristics of components, products and systems. It cannot be directly measured and assessed like other quality characteristics but can only be predicted for given times and conditions. Its value depends on the use conditions of the product as well as the time at which it is to be predicted. Reliability prediction has a major impact on critical decisions such as the optimum release time of the product, the type and length of warranty policy and associated duration and cost, and the determination of the optimum maintenance and replacement schedules. Therefore, it is important to provide accurate reliability predictions over time in order to determine accurately the repair, inspection and replacements strategies of products and systems. Reliability predictions are based on testing a small number of samples or prototypes of the product. The difficulty in predicting reliability is further complicated by many limitations such as the available time to conduct the test and budget constraints, among others. Testing products at design conditions requires extensive time, large number of units and cost. Clearly some kind of reliability testing, other than testing at normal design conditions, is needed. One of the most commonly used approaches for testing products within the above stated constraints is accelerated life testing (ALT) where units or products are subjected to more severe stress conditions than normal operating conditions to accelerate its failure time and then use the test results to predict (extrapolate) the reliability at design conditions. This Chapter will address the determination of optimum maintenance schedule at normal operating conditions while utilizing the results from accelerated testing. We classify the ALT into two types: accelerated failure time testing (AFTT) and accelerated degradation testing (ADT). The AFTT is conducted when accelerated conditions result in the failure of test units without experiencing failure mechanisms different from those occurring at normal operating conditions and when there is “enough” units to be tested at different conditions. Moreover, the economics of conducting AFTT need to be justified as the test is destructive and its duration is

156

E. Elsayed

directly related to the reliability of test units and the applied stresses. Finally, testing at stresses far from normal makes it difficult to predict reliability accurately at normal conditions as in some cases few or no failures are observed even under accelerated conditions making reliability inference via failure time analysis highly inaccurate, if not impossible. On the other hand ADT is a viable alternative to AFTT when the product’s physical characteristics or performance indices leading to failure (e.g. drift in resistance value of a resistor, change in light intensity of light emitting diodes (LED) and loss of strength of a bridge structure) experience degradation over time. Moreover, significant degradation data can be obtained by observing degradation of a small number of units over time. Degradation testing may also be conducted either at normal or accelerated conditions, and no actual failure is required for reliability inference (Liao 2004). In this chapter, we address the issues associated with conducting accelerated life testing and describe how the reliability models obtained from ALT are used in the determination of the optimum maintenance schedules at normal operating conditions. This chapter is organized as follows. Section 7.1 provides an overview of the role of reliability prediction and the importance of accelerated life testing. In Section 7.2 we present the two most commonly used accelerated life testing types in reliability engineering. The approaches and models for predicting reliability using accelerated life testing are described in Section 7.3 while Section 7.4 focuses on mathematical formulation and solution of the design of accelerated life testing plans. Section 7.5 shows how accelerated life testing is related to maintenance decisions at normal operating conditions. Models to determine the optimum preventive maintenance schedules for both failure time models and degradation models are presented in Section 7.6. A summary of the chapter is presented in Section 7.7. We begin by describing the ALT types.

7.2 ALT Types 7.2.1 Accelerated Failure Time Testing It is known that the more reliable the device, the more difficult it is to measure its reliability. In fact, many devices last so long that life testing at normal operating conditions is impractical. Furthermore, testing devices or components at normal operating conditions requires an extensive amount of time and a large number of devices in order to obtain accurate measures of their reliabilities. ALT is commonly used to obtain reliability and failure rate estimates of devices and components in a much shorter time. A simple way to accelerate the life of many components or products that are used on a continuous time basis such as tires and light bulbs is to accelerate time (i.e. run the product at a higher usage rate). It is typically assumed that the number of cycles, hours, etc., to failure during testing is the same as would be observed at the normal usage rate. For example, in evaluating the failure time distribution of light bulbs which are used on the average about 6 h per day, one year of operating experience can be compressed into three months by using the light bulb for 24 h every day. The advantage of this type of testing is that no assumptions need to be

Reliability Prediction and Accelerated Testing

157

made about the relationship of the failure time distributions at both the accelerated and the normal conditions. However, it is not always true that the number of cycles to failure at high usage rate is the same as that of the normal usage rate. Moreover, the effect of aging is ignored. Therefore, this type of testing must be run with special care to assure that product operation and stress remain normal in all regards except usage rate and the effect of aging is taken into account, if possible. An alternative to the above accelerated failure time testing is to accelerate stress (apply stresses more severe than that of the normal conditions) to shorten product or component life. Typical accelerating stresses are temperature, voltage, humidity, pressure, vibration, and fatigue cycling. It is important to recognize the type of stress which indeed accelerates product or component life. Suitable accelerating stresses need to be determined. One may also wish to know how product life depends on several stresses operating simultaneously. In accelerated life testing, the test stress levels should also be controlled. They cannot be so high as to produce other failure modes that rarely or are unlikely to occur at normal conditions. Yet levels should be high enough to yield enough failures similar to those that exist at the design (operating) stress. The limited range of the stress levels needs to be specified in the test plans to avoid invalid or biased estimates of reliability. The stress application loading can be constant, increase (or decrease) continuously or in steps, vary cyclically, or vary randomly or combinations of these loadings. The choice of such stress loading depends on how the product is loaded in service and on practical and theoretical limitations (Shyur 1996). 7.2.2 Accelerated Degradation Testing In some cases, applying high stresses might not induce failures or result in sufficient data and reliability inference via failure time analysis becomes highly inaccurate, if not impossible. However, if a product’s physical characteristics or performance indices leading to failure experience degradation over time then degradation analysis could be a viable alternative to traditional failure time analysis. The advantages of degradation modeling over time-to-failure modeling are significant. Indeed, degradation data may provide more reliability information than would otherwise be available from time-to-failure data with censoring. Moreover, degradation testing may be conducted either at normal or accelerated conditions, and no actual failure is required for reliability inference. Degradation data needed for reliability inference may be obtained from two categories: the first is field application and the second is degradation testing experiments. The first category requires an extensive data collection system over a long time. Since the collected data are often subject to highly random stress environment and human errors, the data may exhibit significant volatility and sometimes its accuracy is questionable, limiting its use for reliability inference and prediction. The second category, prognostics, is a process of predicting the future state of a product (or component). Degradation data analysis might be used in this process to minimize field failure and reduce the life-cycle expenses by recommending conditionbased maintenance on observed components or systems. Moreover, degradation testing is usually conducted to demonstrate products’ reliability and helps in revealing the main failure mechanisms and the major failure-causing stress factors.

158

E. Elsayed

It may be conducted at or close to the normal operating conditions to provide more accurate and precise information for reliability estimates. Yet, to save time and cost, accelerated degradation testing (ADT) is commonly used to obtain immediate data for extrapolating reliability under normal conditions. ADT is conducted by testing units (products or components) at accelerated conditions and measuring its degradation indicators with time. The test can be terminated once “enough” observations are obtained without causing destruction of the test unit (nondestructive testing) if possible. For general purposes, a degradation model along with inference procedure that can utilize both field degradation data and degradation testing data is preferred, and its potential ability to be embedded into the development of systems for prognostics purposes is of additional value to the manufacturers. Reliability assessment using ADT experiments requires an appropriate degradation model, a carefully designed test plan and insightful investigation of the field operating environment in order to achieve high accuracy of the reliability estimates. An appropriate degradation model is the one that accurately interprets the effects of the stresses on the degradation process of a product based on its physical properties and the related probability distributions. On the other hand, a carefully designed test plan may improve the accuracy of the developed degradation model and the efficiency of the experiments. The design of the test plan consists of objective functions, several constraints and decision variables such as stress levels, sample allocation ratios at stress levels, frequency of observing and measuring degradation and test termination time. Inappropriate assignments of these decision variables in practice result in inaccurate reliability estimates. Moreover, it is a challenging and critical issue to consider the stochastic nature of the normal (field) operating conditions in reliability inference from ADT to the normal operating conditions. When field stresses are not deterministic, which is usually the case; their uncertainty will potentially influence the degradation process of the product. If such variations and extremes are ignored in a reliability model, an inaccurate estimate will result, sometimes misleading the judgment for reliability requirements, warranty decisions and the maintenance plans. Therefore, it is important to design robust test plans subject to constraints. The plans should be robust to: the accuracy estimation of the model parameters, the underlying distributions (in case of misspecifications) and robust to the underlying stress-life relationship. Currently, the literature relating ADT to field applications is rare. Without scientific guide from the literature, it is hard to make an appropriate robust design to tolerate the extremes while avoiding “over-design” of the product. In both accelerated failure time (usage or stress) and degradation testing (normal or stress) robust models that relate the results of the test to the normal operating conditions (or other conditions) are needed. In the following section, we describe such models and discuss their assumptions and limitations.

7.3 ALT Reliability Estimation Models The accuracy of reliability estimation depends on the models that relate the failure data under severe conditions, or high stress, to that at normal operating conditions, or design stress. Elsayed (1996) classifies these models into three groups: statistics-

Reliability Prediction and Accelerated Testing

159

based models, physics-statistics models, and physics-experimental models. Furthermore, he classifies the statistics models into two sub-categories: parametric and nonparametric models. We limit the models in this chapter to the statistics models as they are more general while the physics-statistics and physics-experimental models are usually developed for particular applications such as fatigue testing, creep testing and electromigration models. 7.3.1 Statistics-based Models: Parametric The failure times at each stress level are used to determine the most appropriate failure time distribution along with its parameters. We refer to these models as AFT (accelerated failure time). Parametric models assume that the failure times at different stress levels are related to each other by a common failure time distribution with different parameters. Usually, the shape parameter of the failure time distribution remains unchanged for all stress levels, but the scale parameter may present a multiplicative relationship with the stress levels. For practical purposes, we assume that the time scale transformation (also referred to as acceleration factor, AF > 1 ) is constant, which implies that we have a true linear acceleration. Thus the relationships between the accelerated and normal conditions are summarized as follows (Tobias and Trindade 1986; Elsayed 1996). Let the subscripts o and s refer to the operating conditions and stress conditions, respectively. The relationship between the time to failure at operating conditions and stress conditions is to = AF × tS .

(7.1)

The cumulative distribution functions are related as ⎛ t ⎞ Fo ( t ) = Fs ⎜ ⎟. ⎝ AF ⎠

(7.2)

The probability density functions are related as ⎛ 1 ⎞ ⎛ t ⎞ fo ( t ) = ⎜ ⎟ fs ⎜ ⎟. ⎝ AF ⎠ ⎝ AF ⎠

(7.3)

The failure rates are given by ⎛ 1 ⎞ ⎛ t ⎞ ⎟ hs ⎜ ⎟. ⎝ AF ⎠ ⎝ AF ⎠

ho ( t ) = ⎜

(7.4)

160

E. Elsayed

The acceleration factor is obtained by determining the median lives of units tested at two different accelerated stresses and extrapolating to the median life at normal operating stress. It can also be estimated by replacing the medians with some quartiles. The accuracy of the reliability estimates suffers when small samples are tested at the stress conditions since the determination of proper failure time distribution that describes these failures becomes difficult. More importantly, the assumption of having the same failure time distributions at different stress levels is difficult to justify especially when small numbers of failures are observed. In these cases, it is more appropriate to use nonparametric models as described next. 7.3.2 Statistics-based Models: Nonparametric Nonparametric models relax the requirement of the common failure time distribution, i.e., no common failure time distribution is required among all stress levels. Several nonparametric models have been developed and validated in recent years. We describe these models below. 7.3.2.1 Proportional Hazards Model Cox’s Proportional Hazards (PH) model (Cox 1972, 1975) is the most popular nonparametric model. It has become the standard nonparametric regression model for accelerated life testing in the past few years. The PH model is distribution-free requiring only the ratio of hazard rates between two stress levels to be constant with time. The proportional hazards model has the following form: λ (t ; z ) = λ 0 (t ) exp( β z )

(7.5)

The base line hazard function λ 0 (t ) is an arbitrary function; it is modified multiplicatively by the covariates (i.e. applied stresses). Elsayed and Zhang (2006) assume λ 0 (t ) to be linear with time: λ 0 (t ) = γ 0 + γ 1t . Substituting λ 0 (t ) into the PH model, we obtain: λ (t ; z ) = (γ 0 + γ 1t ) exp( β z ) , where z = ( z1 , z2 ,… z p )T is a column vector of the covariates (or applied stresses). For ALT, the column vector represents the stresses used in the test and/or their interactions. β = ( β1 , β 2 ,… β p ) is a row vector of the unknown coefficients corresponding to the covariates z . These coefficients can be estimated using a partial likelihood estimation procedure. This model usually produces “good” reliability estimation with failure data for which the proportional hazards assumption holds and even when it does not exactly hold. 7.3.2.2 Extended Linear Hazards Regression Model The PH and AFT models have different assumptions. The only model that satisfies both assumptions is the Weibull regression model (Kalbfleisch and Prentice 2002). For generalization, the Extended Hazard Regression (EHR) model (Ciampi and

Reliability Prediction and Accelerated Testing

161

Etezadi-Amoli 1985; Etezadi-Amoli and Ciampi 1987; Shyur et al. 1999) is proposed to combine the PH and AFT models into one form: λ (t ; z ) = λ0 (ez'β t ) exp( z'α)

(7.6)

The unknowns of this model are the regression coefficients α , β and the unspecified baseline hazard function λ 0 (t ) . The model reflects that the covariate z has both the time scale changing effect and hazard multiplicative effect. It becomes the PH model when β = 0 and the AFT model when α = β . Elsayed et al. (2006) propose a new model called Extended Linear Hazard Regression (ELHR) model. The ELHR model (e.g., with one covariate) assumes those coefficients to be changing linearly with time: λ (t ; z ) = λ0 (te( β

0

+ β1t ) z

) exp ( (α 0 + α1t ) z )

(7.7)

The model considers the proportional hazards effect, time scale changing effect as well as time-varying coefficients effect. It encompasses all previously developed models as special cases. It may provide a refined model fit to failure time data and a better representation regarding complex failure processes. Since the covariate coefficients and the unspecified baseline hazard cannot be expressed separately, the partial likelihood method is not suitable for estimating the unknown parameters. Elsayed et al. (2006) propose the maximum likelihood method which requires the baseline hazard function to be specified in a parametric form. In the EHR model, the baseline hazards function has two specific forms; one is a quadratic function and the other is a quadratic spline. In the proposed ELHR model, we assume the baseline hazard function λ 0 (t ) to be a quadratic function:

λ0 (t ) = γ 0 + γ 1t + γ 2 t 2

(7.8)

Substituting λ 0 (t ) into the ELHR model yields

λ (t ; z ) = γ 0 eα 0 z +α1zt + γ 1teθ0 z +θ1zt + γ 2t 2 eω0 z +ω1zt where θ 0 = α 0 + β 0 , θ1 = α1 + β1 , ω0 = α 0 + 2β 0 , ω1 = α1 + 2β1

The cumulative hazard rate function is obtained as

(7.9)

162

E. Elsayed

Λ (t ; z ) = =

∫

t 0

λ (u; z )du =

∫

t 0

γ 0 eα 0 z +α1zu du +

∫

t 0

γ 1ueθ0 z +θ1zu du +

∫

t 0

γ 2u 2eω0 z +ω1zu du

γ 0 α 0 z +α1zt γ 0 α 0 z γ 1t θ0 z +θ1zt γ γ e − e + e − 1 2 eθ0 z +θ1zt + 1 2 eθ0 z α1 z α1 z θ1 z (θ1 z ) (θ1 z )

+

2γ 2t ω0 z +ω1zt 2γ 2 ω0 z +ω1zt 2γ 2 ω0 z γ 2t 2 ω0 z +ω1zt − + − e e e e 2 3 3 ω1 z (ω1 z ) (ω1 z ) (ω1 z )

The reliability function, R(t ; z ) and the probability density functions f (t ; z ) are obtained as R(t ; z ) = exp(−Λ(t ; z )) f (t ; z ) = λ (t ; z ) exp(−Λ(t; z ))

Although the ELHR model is developed based on the distribution-free concept, a close investigation of the model reveals its capability of capturing the features of commonly used failure time distributions. The main limitation of this model is that “good” estimates of the many parameters of the model require a large number of test units. 7.3.2.3 Proportional Mean Residual Life Model Oakes and Dasu (1990) originally propose the concept of the Proportional Mean Residual Life (PMRL) by analogy with PH model. Two survivor distributions F (t ) and F0 (t ) are said to have PMRL if e( x) = θ e0 ( x)

(7.10)

where e( x) is the mean residual life at time x . We extend the model to a more general framework with a covariate vector Z (applied stress)

e(t | z ) = exp( β T z )e0 (t )

(7.11)

We refer to this model as the proportional mean residual life regression model which is used to model accelerated life testing. Clearly e0 ( x) serves as the MRL corresponding to a baseline reliability function R0 (t ) and is called the baseline mean residual function; e(t z ) is the conditional mean residual life function of T − t given T > t and Z = z . Where z T = ( z1 , z2 ; , z p ) is the vector of covariates, β T = ( β1 , β 2 ; , β p ) is the vector of coefficients associated with the covariates, and p is the number of covariates. Typically, we can experimentally obtain {(ti , zi ), i = 1, 2, , n} the set of failure time and the vectors of covariates for each unit (Zhao and Elsayed, 2005). The main assumption of this model is the proportionality of mean residual lives with applied stresses. In other words, the mean

Reliability Prediction and Accelerated Testing

163

residual life of a unit subjected to high stress is proportional to the mean residual life of a unit subjected to low stress. 7.3.2.4 Proportional Odds Model In many applications, however, it is often unreasonable to assume the effects of covariates on the hazard rates remain fixed over time. Brass (1971) observes that the ratio of the death rates, or hazard rates, of two populations under different stress levels (for example, one population for smokers and the other for nonsmokers) is not constant with age, or time, but follows a more complicated course, in particular converging closer to unity for older people. So the PH model is not suitable for this case. Brass (1974) proposes a more realistic model: the proportional odds (PO) model. The proportional odds model has been successfully used in categorical data analysis (McCullagh 1980; Agresti and Lang 1993) and survival analysis (Hannerz 2001) in the medical fields. The PO model has a distinct different assumption on proportionality, and is complementary to the PH model. It has not been used in reliability analysis of accelerated life testing so far. Zhang and Elsayed (2005) extend this model for reliability estimates using ALT data. We describe the PO model as follows. Let T > 0 be a failure time associated F (t ; z ) , with stress level z with cumulative distribution F (t ; z ) , and that ratio 1 − F (t ; z ) or

1 − R(t ; z ) , be the odds on failure by time t . The PO model is then expressed as R(t ; z )

F (t ) F (t ; z ) = exp( β z ) 0 1 − F (t ; z ) 1 − F0 (t )

(7.12)

where F0 (t ) ≡ F (t ; z = 0) is the baseline cumulative distribution function and β is unknown regression parameter. Let θ (t ; z ) denote the odds function, then the above PO model is transformed to θ (t ; z ) = exp( β z )θ 0 (t )

(7.13)

where θ 0 (t ) ≡ θ (t ; z = 0) is the baseline odds function. For two failure time samples with stress levels z1 and z2 , the difference between the respective log odds functions is log[θ (t ; z1 )] − log[θ (t ; z2 )] = β ( z1 − z2 ) ,

which is independent of the baseline odds function θ 0 (t ) and the time t . Hence, the odds functions are constantly proportional to each other. The baseline odds function could be any monotone increasing function of time t with the property of θ 0 (0) = 0 . When θ 0 (t ) = t ϕ , PO model presented by Equation 7.13 becomes the

164

E. Elsayed

log-logistic accelerated failure time model (Bennett 1983), which is a special case of the general PO models. In order to utilize the PO model in predicting reliability at normal operating conditions, it is important that both the baseline function and the covariate parameter, β , be estimated accurately. Since the baseline odds function of the general PO models could be any monotone increasing function, it is important to define a viable baseline odds function structure to approximate most, if not all, of the possible odds function. In order to find such a “universal” baseline odds function, we investigate the properties of odds function and its relation to the hazard rate function. The odds function θ (t ) is denoted by θ (t ) =

F (t ) 1 − R(t ) 1 = = −1 1 − F (t ) R(t ) R(t )

(7.14)

From the properties of reliability function and its relation to odds function shown in Equation 7.14, we could easily derive the following properties of odds function θ (t ) : 1.

θ (0) = 0 , θ (∞) = ∞

2.

θ (t ) is monotonically increasing function in time

3.

θ (t ) =

1 − exp[−Λ (t )] = exp[−Λ (t )] − 1 , and Λ (t ) = ln[θ (t ) + 1] exp[−Λ (t )]

4.

λ (t ) =

θ ′(t ) θ (t ) + 1

Further investigation of such a “universal” odds function shows that it can be approximated by a polynomial function. An appropriate ALT model is important since it explains the influences of the stresses on the expected life of a product based on its physical properties and the related statistical properties. On the other hand, a carefully designed test plan improves the accuracy and efficiency of the reliability estimation. The design of an accelerated life testing plan consists of the formulation of objective function, the determination of constraints and the definition of the decision variables such as stress levels, sample size, allocation of test units to each stress level, stress level changing time and test termination time, and others. Inappropriate values of the decision variable result in inaccurate reliability estimates and/or unnecessary test resources. Thus it is important to design test plans to minimize the objective function under specific time and cost constraints.

Reliability Prediction and Accelerated Testing

165

7.4 Design of Accelerated Life Testing Plans Conducting an accelerated life testing (ALT) requires the determination or development of a reliability inference model that relates the failure data at stress conditions with design or operating conditions. Moreover, an accelerated test plan needs to be developed to obtain appropriate and sufficient information in order to estimate reliability performance accurately at operating conditions. A test plan requires the identification of the type of stresses to be applied, stress levels, methods of stress application (constant, ramp, cyclic), number of units at every stress level, minimum number of failures at every stress level, optimum test duration, frequency of test data collection and other test parameters. Indeed, without an optimum test plan, it is likely that a large sequence of expensive and time consuming tests be conducted that might cause delays in product release or in some cases the termination of the entire product. In this section, we describe the procedure for designing an optimum test plan based on the proportional hazards model followed by a numerical example. Optimum test plans based on other ALT models can be developed in a similar fashion. 7.4.1 Design of ALT Plans An ALT plan requires the determination of the type of stress, method of applying stress, stress levels, the number of units to be tested at each stress level and an applicable accelerated life testing model that relates the failure times at accelerated conditions to those at normal conditions. When designing an ALT, we need to address the following issues: (a) select the stress types to use in the experiment, (b) determine the stress levels for each stress type selected, (c) determine the proportion of devices to be allocated to each stress level (Elsayed and Jiao 2002). We refer the reader to Meeker and Escobar (1998) and Nelson (2004) for other approaches for the design of ALT plans. We consider the selection of the stress level zi and the proportion of devices pi to allocate for each zi such that the most accurate reliability estimate at use conditions zD can be obtained. We consider two types of censoring: type I censoring involves running each test unit until a prespecified time. The censoring times are fixed and the number of failures is random. Type II censoring involves simultaneously testing units until a prespecified number of them fails. The censoring time is random while the number of failures is fixed. We use the following notations: ln ML n zH, zM, zL zD p1 , p2 , p3 T R(t; z) f(t; z) F(t; z)

Natural logarithm Maximum likelihood Total number of test units High, medium, low stress levels respectively Specified design stress Proportion of test units allocated to zL, zM and zL, respectively Pre-specified period of time over which the reliability estimate is of interest Reliability at time t, for given z Pdf at time t, for given z Cdf at time t, for given z

166

E. Elsayed

Λ (t ; z ) λ0 (t )

Cumulative hazard function at time t, for given z Unspecified baseline hazard function at time t

We assume the baseline hazard function λ0 (t ) to be linear with time: λ0 (t ) = γ 0 + γ 1t

Substituting λ0 (t ) into the PH model given by Equation 7.5, we obtain, λ (t ; z ) = (γ 0 + γ 1t ) exp( β z )

We obtain the corresponding cumulative hazard function Λ (t ; z ) , and the variance of the hazard function as Λ (t ; z ) = (γ 0 t +

γ 1t 2 β z )e 2 ˆ

ˆ

Var[(γˆ0 + γˆ1t )e β Z ] = (Var[γˆ0 ] + Var[γˆ1 ]t 2 )e2( β z +Var [ β ] z D

ˆ

2

ˆ

2

)

2

+ e 2 β z +Var [ β ] z (eVar [ β ] z − 1)(γ 0 + γ 1t ) 2

7.4.1.1 Formulation of the Test Plan Under the constraints of available test units, test time and specification of minimum number of failures at each stress level, the problem is to allocate stress levels and test units optimally so that the asymptotic variance of the hazard rate estimate at normal conditions is minimized over a prespecified period of time T. If we consider three stress levels, then the optimal decision variables ( z *L , zM* , p1* , p2* , p3* ) are obtained by solving the following optimization problem with a nonlinear objective function and both linear and nonlinear constraints: T

Min

∫ Var[(γˆ

0

ˆ

+ γˆ1t )e β z ]dt D

0

subject to Σ = F −1 0 < pi < 1, i = 1, 2,3 3

∑p i =1

i

=1

z D < zL < zM < zH npi Pr[t ≤ τ | zi ] ≥ MNF , i = 1, 2,3

Reliability Prediction and Accelerated Testing

167

where, MNF is the minimum number of failures and Σ is the inverse of the Fisher's information matrix. Other objective functions can be formulated which result in different design of the test plans. These functions include the D-Optimal design that provides efficient estimates of the parameters of the distribution. It allows relatively efficient determination of all quantiles of the population, but the estimates are distribution dependent. 7.4.1.2 Numerical Example An accelerated life test is to be conducted at three temperature levels for MOS capacitors in order to estimate its life distribution at design temperature of 50°C. The test needs to be completed in 300 h. The total number of items to be placed under test is 200 units. To avoid the introduction of failure mechanisms other than those expected at the design temperature, it has been decided, through engineering judgment, that the testing temperature should not exceed 250°C. The minimum number of failures for each of the three temperatures is specified as 25. Furthermore, the experiment should provide the most accurate reliability estimate over a 10-year period of time. Consider three stress levels; then the formulation of the objective function and the test constraints follow the same formulation given in the above section. The optimum plan derived (Elsayed and Jiao 2002) that optimizes the objective function and meets the constraints is shown as follows: z L = 160o C , zM = 190o C , z H = 250o C The corresponding allocations of units to each temperature level are: p1 = 0.5, p2 = 0.4, p3 = 0.1 7.4.1.3 Concluding Remarks Design of ALT plans plays a major role in providing accurate estimates of reliability, mean time to failure and the variance of failure time at normal operating conditions. These estimates have a major impact on many decisions during the product life cycle such as maintenance schedules, warranty and repair policies and replacement times. Therefore, the test plans should be robust (Pascual 2006), i.e., it should be: 1.

2.

3.

Robust to planning values of the model parameters. This implies that ALT conducted at three or more stresses are more robust than those conducted at two stresses. Allocating more units at the low stress level will also improve the robustness of the plan. Robust to the type of the underlying distribution. In other words, misspecification of the underlying distribution should not result in significant errors in calculating reliability characteristics. Robust to the underlying stress-life relationship. The commonly used concept that higher stresses result in more failures might result in the “wrong” stress-life relationship. For example, testing circuit packs at higher temperature reduces humidity which in turn results in fewer failures than those at field conditions. In essence, this is a deceleration test (higher stresses show fewer failures).

168

E. Elsayed

7.5 Relating ALT Results to Maintenance Decisions at Normal Operating Conditions It is important to note that it is not necessary to conduct destructive ALT when the product’s characteristics can be monitored through degradation with time. For example, light emitting diodes (LED) are likely to experience degradation in the light intensity before they are deemed completely unsuitable for use. In such cases it is important to conduct accelerated degradation testing (ADT). The threshold level where a unit is considered unacceptable might be considered the same threshold level for replacement or maintenance (if possible). In a typical experiment the threshold level is set as the level at which the light intensity drops to 50% of its original value. This threshold level is set based on engineering and users’ experience. Of course, an optimum level can be determined based on other factors such as economic, maintenance strategy, availability of maintenance crew and others. It should be noted that this level is set for accelerated conditions. Clearly the determination of the optimum maintenance schedule at normal operating conditions depends on many factors as follows. (1) The variance of the time to failure at normal conditions is much larger than that of the ADT as shown in Figure 7.1. (2) The failure time distribution or the degradation paths at accelerated conditions are directly related to the failure rate or degradation rate. Higher accelerated stresses result in higher rates as shown in Figure 7.2. Thus the failure rate at normal conditions requires careful evaluation as it directly affects the maintenance schedule. (3) Since there are no universal normal operating conditions but a distribution is likely to describe these conditions, the maintenance threshold level will then be greatly affected by such a distribution. (4) The repair rate in field conditions is likely to be different from that of the ALT. (5) The effect of aging at stress conditions is not captured. (6) When a unit is repaired it is not considered as good as new; consequently the time to next failure is shorter. Therefore, the maintenance threshold level needs to be optimally determined so that the total maintenance cost is minimized or the system availability is maximized as discussed in Section 7.7.

Figure 7.1. Distributions of the time to failure at stress and normal conditions

Reliability Prediction and Accelerated Testing 40° C

60° C

Degradation Path

80° C

169

Threshold

Time

Figure 7.2. Distributions of degradation paths with time at different stress levels

In order to determine the optimum maintenance schedule at normal operating conditions using accelerated testing results one needs to perform the following two steps: 1. Relate the reliability function at stress conditions to that at normal conditions by developing an appropriate model using the approaches discussed earlier in this chapter or using an ADT model as described in Eghbali and Elsayed (2001), Liao (2004) and Meeker and Escobar (1998). 2. Relate the maintenance threshold level to the operating conditions. For example, when the stress at operating conditions is higher than the mean of the normal conditions then a lower threshold level is used. Similarly, when the stress at operating conditions is lower than the mean of the normal operating conditions then a higher threshold level is used. The first step has been discussed in Section 7.3 and the second step will be discussed in Section 7.6.

7.6 Determination of the Optimum Preventive Maintenance Schedule and Optimum Threshold Degradation Level at Normal Conditions The optimum preventive maintenance schedule at operating conditions can be determined by relating the reliability functions at accelerated conditions with that at normal conditions then utilize an optimization function that relates reliability to preventive maintenance schedule. In Section 7.6.1 we demonstrate these steps through an example. Another approach for determining the optimum preventive maintenance for degrading systems is to determine the optimum threshold degradation level at which maintenance actions are taken by minimizing the over-

170

E. Elsayed

all cost of maintenance or by ensuring a minimum acceptable system availability level (Liao et al. 2005). This will be illustrated in Section 7.6.2. 7.6.1 Optimum Preventive Maintenance Schedule at Operating Conditions The first step is to relate the accelerated testing results to stress conditions and obtain a reliability expression which is a function of the applied stresses. We then substitute the normal operating conditions in the expression to obtain a reliability function at normal conditions. We illustrate this by designing an optimum test plan then use its results to obtain the reliability expression. Suppose we develop an accelerated life test plan for a certain type of electronic devices using two stresses: temperature and electric voltage. The reliability estimate at the design condition over a 10-year period of time is of interest. The design condition is characterized by 50 ºC and 5V. From engineering judgment, the highest levels (upper bounds) of temperature and voltage are pre-specified as 250 ºC and 10 V, respectively. The allowed test duration is 200 h, and the total number of devices placed under test is 200. The minimum number of failures at any test combination is specified as 10. The test plan is determined through the following steps: 1. According to the Arrehenius model, we use 1/(absolute temperature) as the first covariate z1 and 1/(Voltage) as the second covariate z2 in the ALT model. 2. The PH model is used in conducting reliability data analysis and designing the optimal ALT plan using the approach described in Section 7.4.1.1. The model is given by λ (t ; z ) = λ 0 ( t ) exp ( β1 z1 + β 2 z2 )

where λ 0 (t ) = γ 0 + γ 1t + γ 2 t 2 3. A baseline experiment is conducted to obtain initial estimates for the model parameters. These values are: γˆ0 = 0.0001 , γˆ1 = 0.5 , γˆ2 = 0 , βˆ1 = −3800 , and βˆ2 = −10 . Approximating γˆ0 to zero we write the hazard rate function as

λ (t ; T , V ) = 0.5t e

−(

3800 10 + ) T V

(7.15)

The reliability and the probability density function (pdf) expressions are respec2

tively given as f (t ;30o C,5V ) = 0.5t exp[−(e −3.6336 t ) )] 2

R(t ; T , V ) = exp[−(e−0.25((3800 / T ) +10 / V )t ) )]

(7.16)

Reliability Prediction and Accelerated Testing

2

f (t ; T , V ) = 0.5t exp[−(e−0.25((3800 / T ) +10 / V ) t ) )]

171

(7.17)

Assume that the normal operating temperature is 30 oC and the normal operating voltage is 5 V. Substituting in Equations 7.16 and 7.17 yields 2

Rn (t ) = R(t ;30o C,5V ) = exp[−(e −3.6336 t )]

(7.18) 2

f n (t ) = f (t;30o C,5V ) = 0.5t exp[ −(e−3.6336 t )]

(7.19)

In the second step, we chose an appropriate preventive maintenance (PM) model and determine the optimum PM schedule. Consider a simple preventive maintenance and replacement policy. Under this policy, two types of actions are performed. The first type is the preventive replacement that occurs at fixed intervals of time. Components or parts are replaced at predetermined times regardless of the age of the component or the part being replaced. The second type of action is the failure replacement where components or parts are replaced upon failure. This policy is illustrated in Figure 7.3. The most widely used criterion of maintenance models is to minimize the total expected maintenance and replacement cost per unit time. This can be accomplished by developing a total expected cost function per unit time as follows. NEW ITEM

PREVENTIVE REPLACEMENT

FAILURE REPLACEMENTS

0 ONE CYCLE

tp

Figure 7.3. Constant interval replacement policy

Let c(t p ) be the total replacement cost per unit time as a function of t p .Then

c (t p ) =

Total expected cost interval (0, t p ] Expected length of the interval

.

(7.20)

The total expected cost in the interval (0, t p ] is the sum of the expected cost of failure replacements and the cost of the preventive replacement. During the interval (0, t p ], one preventive replacement is performed at a cost of c p and M (t p ) failure

172

E. Elsayed

replacements at a cost of c f each, where M (t p ) is the expected number of replacements (or renewals) during the interval (0, t p ]. The expected length of the interval is t p . Equation 7.20 can be rewritten as c(t p ) =

c p + c f M (t p ) tp

.

(7.21)

We apply the above model to determine the optimum preventive maintenance schedule for the example for the electronic devices whose reliability and pdf functions obtained from accelerated conditions and are expressed as given in Equations 7.18 and 7.19 respectively. Assuming c p =100 and c f =1200, we rewrite Equation 7.21 as: tp

∫

10 + 1200 tf n (t ) dt 0

c(t p ) =

(7.22)

tp

Calculated values of the cost per unit time are shown in Table 7.1 and plotted in Figure 7.4. The optimum preventive maintenance schedule at normal operating conditions is 0.18 unit times. Table 7.1. Time vs. cost per unit time values (bold numbers indicate optimum values)

0.13

0.14

0.15

0.16

0.17

0.18

0.19

0.2

Cost/unit time 918

885

862

847

839

836

840

848

C ost per unit tim e

Time

3500 3000 2500 2000 1500 1000 500 0 0,03

0,13

0,23

0,33

0,43

Time Figure 7.4. Optimum preventive maintenance schedule

0,53

Reliability Prediction and Accelerated Testing

173

7.6.2 Optimum Preventive Maintenance Schedule Based on Accelerated or Normal Degradation Determining the optimum maintenance schedule for systems subject to degradation follows the same procedures described in Section 7.6.1. It begins by developing a degradation model (at normal operation conditions or at accelerated conditions then extrapolate to normal conditions as shown above). Liao et al. (2005) assume that the degradation is described by a gamma process and obtain the optimum degradation level accordingly. Ettouney and Elsayed (1999) obtain the reliability function for different threshold degradation levels. We demonstrate the determination of the degradation threshold level at normal stress levels using Ettouney and Elsayed (1999) results; then we utilize the optimum degradation level to determine the corresponding optimum preventive maintenance schedule as follows. Consider the case of corrosion in reinforced concrete bridges which is a major concern to professional engineers because of both public safety and cost which associated with needed repairs and replacement. Prediction of bridge functional degradation due to corrosion conditions is investigated below. The two main corrosion parameters which affect the reinforcing bars in reinforced concrete bridges are the corrosion rate, rcorr , and the time it takes to initialize corrosion, T1. Enright and Frangopol (1998) present several mean and variance test measurements for both rcorr and T1. In a typical case, they show that the mean and variance of rcorr are between 0.005 in/year and 3 × 10−6 in/year, respectively. The mean and standard deviation of T1 are 10 years and 0.4 years, respectively. In order to estimate the time-variant strength of a reinforced concrete corroded beam, the corrosion effects on the diameter of the reinforcing bars is evaluated first. After corrosion initiation time, T1, the diameter of a reinforcing bar, D(t), can be evaluated as D(t ) = Di − rcorr (t − T 1)

(7.23)

where Di = 1.41 in. is the initial reinforcing bar diameter and t is the elapsed time. Note that t ≥ T1 and D(t) ≥ 0. For more details of Equation 7.23 the reader is referred to Enright and Frangopol (1998). The time-variant reinforced concrete strength, Mp(t), can now be evaluated using the conventional design equations in Enright and Frangopol (1998): a⎞ ⎛ M p = nAs f y ⎜ d − ⎟ 2⎠ ⎝

(

a = ( nAs f y ) 0.85 f c` b

(7.24)

)

(7.25)

Note that As = π D(t ) 2 4 . The reinforcing steel and the concrete strengths are f y and f c` , respectively. The number of reinforcing bars is n. The effective depth and the width of the beam are d and b, respectively. For the current example, the

174

E. Elsayed

values of different parameters are chosen as f y = 40 ksi, f c` = 3 ksi, d = 27 in. and b = 16. Using Equation 7.23 through Equation 7.25 the random time-variant strength, Mp(t), can be estimated. Using the previously mentioned values of rcorr and T1 and a Monte Carlo simulation technique, different strength values for different reinforced concrete beams can be simulated. Thus, a discrete time-variant reinforced concrete strength, xij can be evaluated from the Monte Carlo simulation of the continuous strength Mp(t). Eghbali and Elsayed (2001) show that the reliability function for a specified failure threshold degradation x is expressed as Rx (t ) = P( X > x; t ) = exp[

− xγ ] b exp(−at )

(7.26)

where X is a random variable represents the degradation measure, a, b and γ are constants. The Maximum Likelihood method was utilized to estimate the parameters of Equation 7.26: m

L (γ , a , b , t ) = ∏ ( i =1

γ ) ni b e x p ( − a ti )

− x ijγ

ni

m

∏ ∏ x ijγ −1 e x p ( b e x p ( − a ti ) ) i =1 j =1

(7.27)

where m is the number of years, ni is the total number of degradation data in a year i and xij is the strength of unit j in year i. Taking the logarithm of Equation 7.27 we obtain ni

m

m

m

m

i =1

i =1

i =1

i =1 j =1

m

ni

ln L = ∑ ni ln γ − ∑ ni ln b + ∑ ni ati + ∑∑ (γ − 1) ln xij − ∑∑ i =1 j =1

xijγ b exp(−ati )

(7.28)

Equating the partial derivatives of Equation 7.28 with respect to γ , a and b to zeros and solving the resulting equations using a modified Powell hybrid algorithm and a finite difference approximation to the Jacobian yields: a = 0.12, b = 1.1346×107 and γ = 1.49. The resulting reliability function is Rx (t ) = P( X > x; t ) = exp[

− xγ ] b exp(−at )

or − x1.49 Rx (t ) = exp[ ]. 1.1346598 ×107 × exp (-0.12t )

The reliability for different threshold values of the strength is shown in Figure 7.5. The time to failure for threshold values of 4800, 4000, 3500, 3000, and 2500 are 25.04, 27.25, 28.88, 30.76, and 33.0 years respectively.

Reliability Prediction and Accelerated Testing

s=2500

1 Reliability

175

s=3000 s=3500

0.8 0.6

s=4000

0.4

s=4800

0.2 0 0

10

20

30

40

50

60

Time (Years) Figure 7.5. Reliability for different threshold levels

The next step is to determine the optimum preventive maintenance schedule for every threshold level and select the schedule corresponding to the smallest cost among all optimum cost values. This will represent both the optimum threshold level and the corresponding optimum preventive maintenance schedule. We demonstrate this for two threshold levels (S = 4800 and S = 2500) assuming c p =10 and c f =1200; we utilize Equation 7.21 as follows: tp

∫

10 + 1200 tf ( x; t ) dt c(t p ) =

0

tp

(7.29)

where f ( x; t ) =

γ γ −1 − xγ ) , t > 0, θ (t ) = be− at x exp( θ (t ) θ (t )

(7.30)

As shown in Figure 7.6, the optimum t p values for S=400 and S=2500 are 17 and 16 years respectively. The minimum of the two is the one corresponding to S = 2500. Therefore, the optimum threshold is 2500 and the corresponding optimum maintenance schedule is 16 years.

176

E. Elsayed

3 2,5

Cost / Unit Time

2

S=4800

S=2500

1,5 1 0,5 0 2

12

22

32

Time

Figure 7.6. Total cost per unit time vs. time

7.7 Summary In this chapter we present the common approaches for predicting reliability using accelerated life testing. The models are classified as accelerated life testing models (ALT) and accelerated degradation models (ADT). The ALT models are also classified as accelerated failure time models with assumed failure time distributions and “distribution free” models. Also we modify the proportional odds model to be used for reliability prediction with multiple stresses. Most of the research in the literature does not extend the use of accelerated life testing beyond reliability predictions at normal conditions. This is the first work that links the ALT to maintenance theory and maintenance scheduling. We develop optimum preventive maintenance schedules for both ALT models and degradation models. We demonstrate how the reliability prediction models obtained from ALT can be used in obtaining the optimum maintenance schedules. We also demonstrate the link between the optimum degradation threshold level and the optimum maintenance schedule. This work can be further extended to include other maintenance cost or insurance of minimum availability level of a system. Further work is needed to investigate the relationship between threshold levels at accelerated conditions and those at normal conditions. Moreover, the models need to include the repair rate as well as spares availabilities.

Reliability Prediction and Accelerated Testing

177

7.8 References Agresti, A. and Lang, J.B., (1993) Proportional odds model with subject-specific effects for repeated ordered categorical responses, Biometrika, 80, pp. 527–534 Bennett, S. (1983) Log-logistic regression models for survival data, Applied Statistics, 32, 165–171 Brass, W., (1971) On the scale of mortality, In: Brass, W., editor. Biological aspects of Mortality, Symposia of the society for the study of human biology. Volume X. London: Taylor & Francis Ltd.: 69–110 Brass, W., (1974) Mortality models and their uses in demography, Transactions of the Faculty of Actuaries, Vol. 33, 122–133. Ciampi, A. and Etezadi-Amoli, J., (1985) A general model for testing the proportional hazards and the accelerated failure time hypotheses in the analysis of censored survival data with covariates, Commun. Statist. - Theor. Meth., Vol. 14, pp. 651–667. Cox, D.R., (1972) Regression models and life tables (with discussion), Journal of the Royal Statistical Society B, Vol. 34, pp. 187–208 Cox, D.R., (1975) Partial likelihood, Biometrika, Vol. 62, pp. 269–276 Eghbali, G. and Elsayed, E.A., (2001) Reliability estimate using degradation data, in Advances in Systems Science: Measurement, Circuits and Control, Mastorakis, N. E. and Pecorelli-Peres, L. A. (Editors), Electrical and Computer Engineering Series, WSES Press, pp. 425–430 Elsayed, E.A., (1996) Reliability engineering, Addison-Wesley Longman, Inc., New York, 1996. Elsayed, E.A. and Jiao, L., (2002) Optimal design of proportional hazards based accelerated life testing plans, International Journal of Materials & Product Technology, Vol. 17, Nos. 5/6, 411–424 Elsayed, E.A. and Zhang, H., (2006) Design of PH-based accelerated life testing plans under multiple-stress-type, to appear in the Reliability Engineering and Systems Safety Elsayed, E.A., Liao, H., and Wang, X., (2006) An extended linear hazard regression model with application to time-dependent-dielectric-breakdown of thermal oxides, IIE Transactions on Quality and Reliability Engineering, Vol. 38, No. 4, 329–340 Elsayed, E.A. and Zhang, H., (2005) Design of optimum simple step-stress accelerated life testing plans, Proceedings of 2005 International Workshop on Recent Advances in Stochastic Operations Research. Canmore, Canada. Enright, M.P. and Frangopol, D.M., (1998) Probabilistic analysis of resistance degradation of reinforced concrete bridge beams under corrosion, Engineering Structures, Vol. 20 No. 11, pp. 960–971 Etezadi-Amoli, J. and Ciampi, A., (1987) Extended hazard regression for censored survival data with covariates: a spline approximation for the baseline hazard function, Biometrics, Vol. 43, pp. 181–192 Ettouney, M. and Elsayed, E.A., (1999) Reliability estimation of degraded structural components subject to corrosion, Fifth ISSAT International Conference, Las Vegas, Nevada, August 11–13 Hannerz, H., (2001) An extension of relational methods in mortality estimation, Demographic Research, Vol. 4, p. 337–368 Kalbfleisch, J.D. and Prentice, R.L., (2002) The statistical analysis of failure time data, John Wiley & Sons, New York, New York Liao, H., Elsayed, E.A., and Ling-Yau Chan, (2005) Maintenance of continuously monitored degrading systems, European Journal of Operational Research, Vol. 75, No. 2, 821–835 Liao, H., (2004) Degradation models and design of accelerated degradation testing plans, Ph.D. Dissertation, Department of Industrial and Systems Engineering, Rutgers University

178

E. Elsayed

McCullagh, P., (1980) Regression models for ordinal data, Journal of the Royal Statistical Society. Series B, Vol. 42, No. 2, 109–142 Meeker, W.Q. and Escobar, L.A., (1998) Statistical methods for reliability data, John Wiley & Sons, New York, New York Nelson, W., (2004) Accelerated testing: statistical models, test plans, and data analyses, John Wiley & Sons, New York, New York Oakes, D. and Dasu, T. (1990) A note on residual life, Biometrika, 77, pp. 409–410. Pascual, F.G., (2006) Accelerated life test plans robust to misspecification of the stress-life relation, Technometrics, Vol. 48, No. 1, 11–25 Shyur, H-J., (1996) A General nonparametric model for accelerated life testing with timedependent covariates, Ph.D. Dissertation, Department of Industrial and Systems Engineering, Rutgers University Shyur, H-J., Elsayed, E.A. and Luxhoj, J.T., (1999) A General model for accelerated life testing with time-dependent covariates, Naval Research Logistics, Vol. 46, 303–321 Tobias, P. and Trindade, D., (1986) Applied reliability, Von Nostrand Reinhold Company, New York, New York Zhang, H. and Elsayed, E.A., (2005) Nonparametric accelerated life testing based on proportional odds model, Proceedings of the 11th ISSAT International Conference on Reliability and Quality in Design, St. Louis, Missouri, USA, August 4–6 Zhao, W. and Elsayed, E.A., (2005) Optimum accelerated life testing plans based on proportional mean residual life, Quality and Reliability Engineering International

8 Preventive Maintenance Models for Complex Systems David F. Percy

8.1 Introduction Preventive maintenance (PM) of repairable systems can be very beneficial in reducing repair and replacement costs, and in improving system availability, by reducing the need for corrective maintenance (CM). Strategies for scheduling PM are often based on intuition and experience, though considerable improvements in performance can be achieved by fitting mathematical models to observed data; see Handlarski (1980), Dagpunar and Jack (1993) and Percy and Kobbacy (2000) for example. For systems comprising few components, and systems comprising many identical components, modelling and analysis using compound renewal processes might be possible. Such situations are considered by Dekker et al. (1996) and Van der Duyn Schouten (1996). However, many systems comprise a large variety of different components and are too complicated for applying this methodology. We refer to these as complex repairable systems. This chapter reviews basic models for complex repairable systems, explaining their use for determining optimal PM intervals. Then it describes advanced methods, concentrating on generalized proportional intensities models, which have proven to be particularly useful for scheduling PM. Computational difficulties are addressed and practical illustrations are presented, based on sub-systems of oil platforms and refineries. The motivation is that for complex systems, one needs to build models for failures based on the history of maintenance (PM and CM) available. Once a model is built, one can evaluate different PM strategies to determine the best one. The focus is to look at different models and how to determine the best model based on historical data. Section 8.2 presents some real examples of complex systems with historical data sets. In each case, it discusses current maintenance policies and any problems with collection or accuracy of the data. Section 8.3 considers the effects of PM and CM actions upon system reliability and availability, so justifying the need for

180

D. Percy

modelling the operating situations in order to determine suitable scheduling strategies. In Section 8.4, we review the models that can be used for this purpose. We also assess the relevance, strengths and weaknesses of each model and provide references where readers can find more details. The remainder of the chapter presents general recommendations for modelling of complex systems in order to schedule PM in practice. Section 8.5 describes the generalized proportional intensities model, Section 8.6 reviews the method of maximum likelihood for estimating unknown model parameters, Section 8.7 addresses the problem of model selection, and considers statistical tests for this purpose, and Section 8.8 looks at the scheduling problem. Finally, Section 8.9 applies these methods to some of the data of Section 8.2 and Section 8.10 presents some concluding remarks. For convenience, we now present a list of symbols and acronyms that are used throughout this chapter. PM Preventive maintenance CM Corrective maintenance ROCOF Rate of occurrence of failures NHPP Nonhomogeneous Poisson process T1 , T2 , … Failure times of a system X 1 , X 2 , … Inter-failure times of a system N (t ) Number of failures up to time t History of process up to time t H (t ) Intensity function ι (t ) ι 0 (t ) Baseline intensity function Po(µ ) Poisson distribution F ( x) Cumulative distribution function f (x ) Probability density function R(x ) Reliability or survivor function h( x ) Hazard function h0 (x ) Baseline hazard function DRP Delayed renewal process DARP Delayed alternating renewal process VAM Virtual age model PHM Proportional hazards model IRM Intensity reduction model PIM Proportional intensities model GPIM Generalized proportional intensities model MLE Maximum likelihood estimate AIC Akaike information criterion BIC Bayes information criterion

Preventive Maintenance Models for Complex Systems

181

8.2 Examples with Historical Data Sets Example 8.1 Ascher and Feingold (1984) presented three hypothetical sets of reliability data to illustrate the forms of historical failure information that are typically observed for complex systems. The numbers represent inter-failure times corresponding to happy, sad and noncommittal systems respectively and are displayed in Table 8.1. The inter-failure times are increasing for the happy system, as the system settles down and fewer failures occur later on. This phenomenon can arise with prototype systems, such as a new aircraft, items subject to a burn-in phase of operation, such as a piston engine, and debugging of computer programs. Conversely, the inter-failure times are decreasing for the sad system, as the system ages and wears over time. This situation is very common and applies to most systems, such as television sets, music centres and motor vehicles. The noncommittal system displays no clear trend in inter-failure times. Table 8.1. Hypothetical reliability data from Ascher and Feingold (1984) Happy system

Sad system

Noncommittal system

15

177

51

27

65

43

32

51

27

43

43

177

51

32

15

65

27

65

177

15

32

Example 8.2 Percy et al. (1998) published a set of data relating to the reliability and maintenance history of a valve in a petroleum refinery, as displayed in Table 8.2. The two columns successively represent the times in days between maintenance actions and the types of actions, where 0 indicates no failure (PM) and 1 indicates failure (CM). At first glance, this would appear to be a noncommittal system. However, on further inspection, there appear to be fewer failures later on and more preventive actions. Whether the PM is proving to be effective or the system is generally happy is not easy to determine. Modelling can provide these answers though. Based on these data, our ultimate goal is to decide how often to perform PM in future or on similar systems. When collecting such data, it is very important to record all PM and CM events accurately, as errors of omission or commission can result in wrong decisions. For example, if the first failure were not recorded, the average time until system failure over the first 94 days would appear to be twice its actual value, perhaps suggesting that PM is not required.

182

D. Percy

Table 8.2. Reliability and maintenance history of a petroleum refinery valve Time since last action

Type of action

Time since last action

Type of action

71

1

186

0

23

1

14

1

64

1

8

1

207

0

112

1

136

1

57

0

66

1

28

1

37

0

4

1

119

0

139

0

2

1

250

0

5

1

206

0

250

0

144

0

Example 8.3 Kobbacy et al. (1997) published a set of historical reliability and maintenance data collected from a main pump at an oil refinery over a period of nearly seven years. These data are reproduced in Table 8.3, with consecutive observations reading down the columns successively from left to right. Table 8.3. Reliability and maintenance history of a main oil refinery pump Times since last actions 34*

1

37

22

3

14

4

28

51

21

81*

13

38

51

6

86*

27

20

15

26

156*

8

28*

18

15*

20*

148*

44

1

35

96*

92

3

26*

44*

47*

13

56

37

61

45*

13

64

36

84*

97

67*

8*

2

12

88*

29

62*

12*

65*

30

12

8

27

43*

4

1

46

102

4

Preventive Maintenance Models for Complex Systems

183

Right-censored observations corresponding to preventive maintenance are marked by asterisks, whereas unmarked observations correspond to failures and corrective maintenance. To clarify this point, consider the pump’s performance from the time when data collection commenced. After 34 days without failure, PM was performed. The pump then continued to operate for 14 more days and then failed. Following CM, the pump worked for 81 more days without failure and then PM was performed. Following 6 further PM actions, the next failure occurred 676 (=34+14+…+97) days after data collection began. By scanning the inter-event times in Table 8.3, it is clear that preventive maintenance was not performed at regular intervals or according to any other simple pattern. Such irregularity can arise because of opportunistic PM, such as when a maintenance team is on site or has idle time, or because of condition monitoring warnings, such as vibration and noise indicators. In many applications including this, however, PM is simply not modelled and monitored effectively. This can result in excessive repair costs and unacceptable levels of downtime.

8.3 Effects of Preventive and Corrective Maintenance Before considering suitable models for the reliability and maintenance of complex repairable systems, we must consider what is meant by these terms. A complex system consists of any structure of more than one component, which performs a particular function. Typical systems include industrial and domestic machinery, such as production lines, utility supplies, railway operations, motor vehicles, central heating systems and washing machines. We concentrate on industrial systems, which benefit greatly from reliability and maintenance modelling. Such complex systems are often subject to failures, upon which we either discard the systems or repair them. Failures can be total or catastrophic, in which case the system stops working, such as when an exhaust pipe drops off a car or a microchip short circuits in a refrigerator. Alternatively, they can be partial or debilitating, such as when a car headlight bulb blows or a refrigerator clogs up with ice. Total failures incur immediate repair costs. Repairs usually consist of replacing broken components and we incur the costs of replacement parts, labour associated with repair and system downtime. For expensive systems, the cost of replacement parts might contribute most. For dangerous situations, the cost of labour might be most influential. For continuous process industries, the cost of downtime will dominate. As these costs can be very large, management will seek to avoid catastrophic failures by intervening with preventive maintenance at a much smaller cost. Debilitating failures are of less importance, as they do not incur direct costs. However, when observable, they can serve as indicators of when to perform preventive maintenance or capital replacement. Consequently, the failures that this chapter generally refers to are catastrophic in nature. Preventive maintenance can be specific, as identified by condition monitoring indicators, or opportunistic, when such actions are convenient because of other environmental factors. These possibilities are very much application dependent and require in-depth analyses, though the models we consider here do extend to include

184

D. Percy

such information. Much preventive maintenance is less specific in terms of particular systems but not in terms of the work involved, and applies more generally. For example, motor vehicles might be serviced annually according to a strict checklist procedure. The actual work conducted during PM can involve many tasks, such as cleaning surfaces, lubricating joints, sharpening blades, replacing fluids, removing waste, cooling down and redecorating. As for CM, we incur costs of PM due to parts, labour and downtime, though these tend to be substantially less than for repairs. The challenge is to balance the costs of preventive maintenance with the supposed improvements in system reliability. Too few PM actions means we incur big CM costs and small PM costs, whereas too many PM actions means we incur small CM costs and big PM costs. Unfortunately, there is no simple explanation of how CM and PM affect system reliability. By modelling the failure patterns of these systems mathematically, we can gain valuable insights about cost-effective strategies for maintenance and replacement.

8.4 Review of Suitable Models Many mathematical models have been proposed for statistical analysis of complex repairable systems. Table 8.4 presents a summary of the main types. In order to discuss the strengths and weaknesses of each model in more depth, we first introduce some standard notation. Suppose that each time a system fails, we repair it and thereby return it to operational condition. For a preliminary analysis, we also assume that repair times are negligible. Let T1 , T2 , T3 ,… be the times to successive failures of the system and let X i = Ti − Ti −1 be the time between failure i − 1 and failure i where T0 = 0 . The Ti and X i are random variables and we define ti and xi to be their corresponding realized values. Figure 8.1 illustrates this situation. We also define N (t ) as the number of failures in the interval (0, t ] .

Figure 8.1. Notation for a repairable system

We generally model the time to first failure using a familiar lifetime probability distribution or hazard function. However, this approach is inadequate for modelling other times to failure, as the inter-failure times are neither independent nor identically distributed in general (Ascher and Feingold 1984). Stochastic processes form the appropriate basis for models to use under these circumstances. We are interested in the probability that a system fails in the interval (t, t + ε ] given the history of the process up to time t . We describe the behaviour of the failure process by the intensity function (identified here by the Greek letter iota):

Preventive Maintenance Models for Complex Systems

ι ( t ) = lim

{

}.

P N (t + ε ) − N (t ) ≥ 1 H (t ) ε

ε →0

185

(8.1)

For an orderly process, where simultaneous failures are impossible, the intensity function is equal to the derivative of the conditional expected number of failures:

ι (t ) =

{

}

d E N (t ) H (t ) , dt

(8.2)

which is referred to as the rate of occurrence of failures (ROCOF). Table 8.4. Summary of models for complex repairable systems Models Renewal process

Nonhomogeneous Poisson process

CM

PM

Comments

References

Repair back to (or replace by) new item

Taylor and Karlin (1994)

Only CM actions, zero repair times

Ascher and Feingold (1984); Crowder et al. (1991); Lindqvist et al. (2003)

Watson (1970)

Percy et al. (1998a)

Delayed renewal process

Distributions for failures after PM and CM actions, zero downtimes

Delayed alternating renewal process

Fixed or random downtimes

Virtual age model

CM minimal repair, Jack (1998); PM reduction in Doyen and Gaudoin age (2004)

Proportional hazards model

Different hazard functions for failures after PM and CM actions

Cox (1972a); Jardine et al. (1987); Newby (1994); Lutigheid et al. (2004)

Intensity reduction model

CM minimal repair, Doyen and Gaudoin PM reduction in (2004) intensity function

Proportional intensities model

Takes account of covariates, CM as minimal repairs

Cox (1972b); Percy et al. (1998b)

Generalized proportional intensities model

Both CM and PM affect the intensity function

Percy and Alkali (2006)

186

D. Percy

For more details on statistical inference in this context, we refer readers to Crowder et al. (1991). Our fundamental model is the nonhomogeneous Poisson process (NHPP), which effectively implies that a repair restores a system to the state it was in immediately before failure. Such corrective maintenance effects are commonly referred to as minimal repairs. The NHPP satisfies these conditions for 0<s
N ( 0 ) = 0 [system initialisation at time t = 0 ]

(ii)

{ N ( t ) − N ( s )} ⊥ N ( s )

(iii)

{ N ( t ) − N ( s )} ~ Po ⎪⎨∫ ι ( t ) dt ⎪⎬

[independence of increments]

⎧t

⎫

⎩⎪ s

⎭⎪

[Poisson inter-failure times]

If we were to model the time to first failure of a complex system as a random variable X , we could describe its probability distribution by a cumulative distribution function F ( x ) = P ( X ≤ x ) with corresponding probability density function f ( x ) = F ′ ( x ) , reliability or survivor function R ( x ) = 1 − F ( x ) and hazard function h ( x ) = f ( x ) R ( x ) . Typical examples of probability density functions are: (i)

f ( x λ ) = λ exp ( −λ x ) ;

(ii)

f (x α,λ) =

(iii)

f ( x α , β ) = αβ (α x )

x > 0 [exponential]

λ α α −1 x exp ( −λ x ) ; Γ (α ) β −1

{

exp − (α x )

x > 0 [gamma]

β

};

x > 0 [Weibull]

The form of the hazard function is precisely the same as the form of the intensity function if we were to use a stochastic process to model the complex system. For a nonhomogeneous Poisson process, this intensity function applies beyond the first failure. However, successive hazard functions for inter-failure times have different forms, which correspond to shifted and truncated versions of the distribution for time to first failure. Imperfect maintenance models must allow for the dynamic evolution of a system and take account of hypothesized and observed knowledge about the effectiveness of repairs. As mentioned above, this section reviews a variety of existing models for repairable systems and describes suitable adaptations for systems that are subject to preventive maintenance. In passing, we remark that time is used as the only scale of measurement here. Some applications use running time instead, or both, such as the flight time of an aircraft or the mileage and age of a car. Further details of such variations are described by Baik et al. (2004) and Jiang and Jardine (2006).

Preventive Maintenance Models for Complex Systems

187

8.4.1 Renewal Process (Maximal Repair) This model assumes that repairs renew a system to its condition as new. A renewal process is a counting process that registers the successive occurrence of events during a given time interval ( 0,t ] where the time durations between consecutive events X 1 , X 2 , X 3 ,… form a sequence of independent and identically distributed non-negative random variables. The special case where their distribution is exponential corresponds to the homogeneous Poisson process. We can characterize the intensity function of a renewal process by

(

ι ( t ) = ι0 t − t N ( t )

)

(8.3)

where ι0 ( t ) is the baseline intensity function, which would prevail if there were no system failures. As this is a renewal process, the baseline intensity function is equal to the hazard function for the inter-failure times: ι0 ( x ) = h ( x ) . The baseline intensity function can take many forms, including: (i)

ι0 (t ) = α

(ii)

ι0 (t ) = αβ t

[loglinear]

(iii)

ι 0 (t ) = α t β

[power-law]

[constant]

The renewal process is a plausible first order model for components or parts when the repair time is negligible, since complete replacement of a component after failure implies renewal instead of repair. Conversely, the renewal process is a poor model for complex systems, where repairs involve replacing or restoring just a fraction of the system’s components. If a large portion of a system needs to be restored, it is often more economical to replace the entire system. Even if a repair restores the system’s performance to its original specification, the presence of predominantly aged components implies that system reliability is not renewed. 8.4.2 Nonhomogeneous Poisson Process (Minimal Repair) The assumptions underlying this model imply that, when a repair is carried out, a system assumes the same condition that it was in immediately before failure. The nonhomogeneous Poisson process (NHPP) differs from the homogeneous Poisson process only in that the rate of occurrence of failures varies with time rather than being constant. As mentioned early in this section, it is the fundamental model for repairable systems. The NHPP is also the most appropriate model for the reliability of a complex system comprising infinity components. However, for a finite number of components, this model can only serve as an approximation, often poor, as the intensity function changes following each repair. In this model, the interarrival times X 1 , X 2 , X 3 ,… are neither independent nor identically distributed.

188

D. Percy

An important characteristic of the NHPP is that the intensity function depends on the system’s global operating time, measured from the instant the system is put into operation. A simple NHPP model can be expressed as

ι ( t ) = ι0 ( t )

(8.4)

where ι0 ( t ) is the baseline intensity function introduced earlier. In modelling the reliability of repairable systems under the nonhomogeneous Poisson process assumptions, the numbers of events in non-overlapping intervals are independent random variables and the intensity becomes the rate of occurrence of failures or peril rate of a repairable system. This model corresponds to minimal repair, whereby system reliability returns to the condition immediately before failure. If repair times are small relative to times between failures, so that they can be ignored, then we have ι ( t ) = h ( t ) . 8.4.3 Delayed Renewal Process We refer to a repairable system as stationary if there is no long-term improvement or deterioration of its performance. For many applications, the assumptions of renewal and minimal repair are too restrictive. We have encountered the need for an alternative scenario that allows for minor repairs, as follows: • •

Corrective maintenance is performed upon failure, to restore the system to a reasonable operating state Preventive maintenance takes place at regular intervals, to reset the system to a good operating state

Corrective maintenance (CM) corresponds to major or minor repair work and may involve replacing the damaged components, whereas preventive maintenance (PM) usually corresponds to minor interventions such as lubrication, cleaning and inspection. Given this structure, we assume that failure times after corrective operations are independent and identically distributed, as are failure times after preventive operations. However, we allow for different probability distributions in the two cases and this defines the delayed renewal process (DRP). This is not a simple renewal process, because of the different lifetime distributions following the two types of action. However, the simple renewal process could be regarded as a limiting case of the DRP, if corrective operations were to repair the system to the same state as preventive operations. Maximal repairs involve restoring the system upon failure to its condition at new. Similarly, if corrective operations were to restore the system to the state immediately before failure, minimal repairs would result. This is not strictly a special case of the delayed renewal process, but a computer program could easily allow for this assumption if required. However, we believe that minimal repairs are convenient for mathematical modelling but are not always valid in practice.

Preventive Maintenance Models for Complex Systems

189

Figure 8.2. Delayed renewal process

As shown in Figure 8.2, define the random variables U and V to be the lifetimes after PM and CM respectively. Their probability density functions, conditional upon known parameters, are fU ( u ) and fV ( v ) respectively. These distributions might take the exponential, gamma or Weibull forms defined earlier, to achieve the required flexibility. Note that the exponential distribution is a limiting case of the gamma as α → 1 and Weibull as β → 1 . The DRP assumes that downtimes are negligible compared with the costs of parts and labour. We now consider the effects of non-ignorable downtimes. 8.4.4 Delayed Alternating Renewal Process The delayed renewal process described above assumes that the downtimes for preventive and corrective maintenance are negligible when compared with the lifetimes. It also assumes that the costs associated with these downtimes are dominated by the costs of parts and labour. The model and analysis are further complicated when we allow for periods of downtime, when maintenance actions take place. In many applications involving continuous-process industries, the principal costs are not due to parts and labour, but are due to lost production whilst the system is down. Consequently, we must consider downtime costs and durations when determining cost-effective strategies for scheduling PM. This extension results in the delayed alternating renewal process (DARP), for which analytical solution is not even feasible in practice. The downtimes following preventive and corrective maintenance can be fixed or random. Since analytical solution of the optimisation problems is not possible and we are adopting a simulation approach here, either of these can be included in the calculations with ease. In the following work, we consider them fixed to avoid confusion. Another benefit of simulation over numerical solution of the renewal equations is that anomalies are readily catered for, such as switching from CM to PM if the system is in the failed state when PM is due. The DARP is illustrated in Figure 8.3.

Figure 8.3. Delayed alternating renewal process

The delayed alternating renewal process is appropriate when the time to replace (or repair back to new) a failed item is non-zero. In this case, we have working and

190

D. Percy

failed states and these alternate. So far, we have only allowed for systems that display no long-term trends, corresponding to improvement or deterioration. We now discuss age-based models that allow for such trends. These models can also be used for stationary and non-stationary systems when concomitant information is available. We discuss these benefits later, as the need for including such extra sources of information is described. 8.4.5 Virtual Age Model (Rejuvenation) The virtual age model (VAM) modifies the hazard function for a system’s interfailure times at each corrective maintenance action. For these repairs, the system’s virtual age at any given time is determined by a variety of additive or multiplicative age-reduction factors. This resets the system to a younger state, which is only an approximation for reasons mentioned earlier. The intensity function of a point process under the age reduction model may be additive

⎛ N (t ) ⎞ ι ( t ) = ι0 ⎜ t − si ⎟ ⎜ ⎟ i =1 ⎝ ⎠

∑

(8.5)

or multiplicative

⎛ N (t ) ⎞ ι ( t ) = ι0 ⎜ t si ⎟ ⎜ i =1 ⎟ ⎝ ⎠

∏

(8.6)

where both si are constants, representing the age reduction factors, and ι0 ( t ) is the baseline intensity function again. In order to evaluate the intensity function for a sequence of failures under age reduction, the renewal function governs the system failure pattern. The additive model can generate negative intensities but the multiplicative model is suitable if replacement components are infallible. The age-reduction model has been applied to systems under a block replacement policy. A critical defect of the age-reduction model and its many variants is that they do not provide a realistic description of the failure processes. For example, replacing a corroded exhaust pipe does not reduce a car’s age, as very many other components are no less likely to fail. 8.4.6 Proportional Hazards Model The proportional hazards model (PHM) is more flexible than the renewal process, DRP and DARP, as it allows for non-stationarity. It is also more flexible than the virtual age model because it allows for concomitant information. In principle, this model appears to be inappropriate for representing a complex system, because hazards naturally relate to lifetimes of components rather than inter-failure times of processes. We cannot physically justify this model as readily as the proportional intensities model described later. However, this does not invalidate its use in this context as a statistical model rather than a mathematical model and considerable

Preventive Maintenance Models for Complex Systems

191

success in applying the proportional hazards model to real reliability and PM scheduling problems has been achieved. In formulating the PHM for a repairable system, we adopt different hazard functions after PM κ ( u ) = κ 0 ( u ) exp ( y ′t γ )

(8.7)

and after CM λ ( v ) = λ0 ( v ) exp ( z t′δ )

(8.8)

where u and v represent the lifetimes following PM and CM respectively. The baseline hazard functions can take any suitable forms, including exponential, Gumbel and Weibull. The covariates that might be contained in the vectors y t and z t include cumulative observations of: • • • • •

Time since last PM Time since last CM Total number or total downtime of PMs Total number or total downtime of CMs Average PM interval duration

We might consider other factors and covariates for inclusion here, representing the concomitant information mentioned earlier. These could include: • • •

Severity measures of failures Quality measures of maintenance Condition-monitoring measurements

Temporal, or continuously time varying covariates (time since last PM and time since last CM) cause substantial computational difficulties. These may be avoided by choosing baseline hazard functions that are sufficiently flexible. The vectors γ and δ contain the regression coefficients, which generally take the form of unknown parameters. The results from extensive analyses demonstrate that this proportional hazards model is flexible, easy to use and of considerable practical value, despite its doubtful mathematical suitability for modelling repairable systems. 8.4.7 Intensity Reduction Model (Correction) Improvement factors feature in additive and multiplicative intensity reduction models (IRM) for imperfect maintenance. Perhaps the most suitable of these is an intensity reduction model that involves a multiplicative scaling of the intensity function upon each failure and repair. This is the natural model for systems that are improving or deteriorating with time and provides a perfect description of the physical situation. This model can be expressed as an NHPP with intensity function

192

D. Percy

ι ( t ) = ι0 ( t )

N (t )

∏s i =1

(8.9)

i

where the si are constants representing the intensity reduction factors and ι0 ( t ) is the baseline intensity function again. We later generalize this model by supposing si are simple functions of i , or are random variables that are independent of the failure and repair process. Having concluded that this model is ideally suited to modelling complex repairable systems, this chapter later considers how to extend it to allow for preventive maintenance and concomitant information. 8.4.8 Proportional Intensities Model Whilst the proportional hazards model offered a valuable generalization of the delayed renewal process and delayed alternating renewal process to allow for nonstationarity and concomitant information, it is not the natural model for repairable systems. The natural model takes the form of a nonhomogeneous Poisson process and is the essence of the proportional intensities model (PIM), which is the subject of this subsection and is a generalization of the intensity reduction model described above. Define the random variable N ( t ) as the number of system failures by time t . Then the NHPP is characterised by conditionally independent increments, corresponding with conditionally independent times between failures that occur with intensity ι ( t ) = lim

{

}

P N (t + ε ) − N (t ) ≥ 1 H (t ) ε

ε →0

(8.10)

at system age t units, where H ( t ) is the history of the process. However, the NHPP corresponds with minimal repair as in Section 8.4.2 and makes no allowances for system improvement, or even deterioration, arising from maintenance actions. Hence, we modify the intensity function by introducing a multiplicative factor, so that we can express the intensity function as

(

)

ι ( t ) = ι0 ( t ) exp xTt β ,

(8.11)

where the baseline intensity ι0 ( t ) has a standard form such as constant, loglinear and power-law. Furthermore, the parameter vector β represents the regression coefficients and the observation vector x t contains factors and covariates relating to the system, such as the cumulative observations and concomitant information mentioned in Section 8.4.6. An alternative option arises when using the PIM to model a complex repairable system subject to PM. Rather than adopting a global time scale for the baseline intensity function as implied above, we could reset the time scale of the baseline intensity function to zero upon each PM action. This introduces an element of

Preventive Maintenance Models for Complex Systems

193

renewal that might be applicable if PM involves major reworking. System age could then be included amongst the covariates if necessary. However, this intervention results in a hybrid model, which suffers from the same difficulty of interpretation as does the PHM. As for predictor variables, the process simulation calculations for scheduling PM simplify greatly if we hold factors and covariates at fixed values throughout each PM interval. However, this essentially treats all CM as minimal repair work, an assumption that we earlier claimed is often unreasonable. To avoid this constraint, we need to consider variables that change during a PM interval, such as the cumulative number of failures. The computational effort required to incorporate such temporal covariates in our simulation is immense, but this relates to computer power rather than manpower and so is quite acceptable.

8.5 Generalized Proportional Intensities Model (GPIM) The GPIM is this chapter’s main model of interest, as it allows for covariates and offers much potential for decision making related to scheduling preventive maintenance. Special cases of the GPIM are the intensity reduction model and the proportional intensities model investigated in Section 8.4. An algebraic representation of the GPIM in terms of the intensity function is given by

⎧⎪ M ( t ) ⎫⎪ ⎧⎪ N ( t ) ⎫⎪ ι ( t ) = ι0 ( t ) ⎨ ri ⎬ ⎨ s j ⎬ exp xTt β . ⎩⎪ i =1 ⎭⎪ ⎩⎪ j =1 ⎭⎪

∏

∏

( )

(8.12)

Here, ι0 ( t ) is the baseline intensity function, whilst ri > 0 and s j > 0 are the intensity scaling factors for preventive maintenance (PM) and corrective maintenance (CM) actions respectively. Furthermore, M ( t ) and N ( t ) are the total numbers of PM and CM actions, whilst xt is a vector of predictor variables and β is an unknown parameter vector of regression coefficients. One might expect the rj and s j to be less than one for a deteriorating system and greater than one for an improving system, though replacing failed components with used parts and accidentally introducing faults during maintenance can produce the opposite effects. System copies can have different forms of baseline intensity function. For reduction of intensity, the scaling factors can take the forms of positive constants, random variables, deterministic functions of time ( t ) and events ( i and j ) or stochastic functions of time and events. As for the intensity reduction model described in Section 8.4.7, a reasonable assumption for initial analysis is that ri = ρ for i = 1, 2,… , M ( t ) and s j = σ for j = 1, 2,… , N ( t ) , in which case the GPIM corresponds with the PIM of Section 8.4.8. The vector of predictor variables might include:

194

D. Percy

• • •

Quality of last maintenance action Time since last maintenance action Condition indicators

The quality of maintenance affects the functionality of a system and its future performance. Our justification for including the time since last maintenance here is to allow for the possibility that maintenance interventions can introduce problems similar to the burn-in of new components. The first of these is a discrete function of time, whereas the second is a continuous function of time. Condition indicators, when available, give direct and very strong guidance on the likely occurrence of failures. They are typically discrete functions of time that vary at, and between, maintenance actions.

8.6 Parameter Estimation All of the preceding models contain unknown parameters. In order to make any decisions based on these models, such as determining when to schedule the next PM activity, we need to quantify our knowledge about these parameters subjectively and empirically. Three forms of inference are applicable here. In increasing order of accuracy and precision, but also of algebraic complexity, they are naïve (fully subjective), frequentist (fully objective) and Bayesian (both subjective and objective). The first of these is trivial, whereas the others both require us to specify the likelihood function. Firstly, consider the delayed renewal process of Section 8.4.1. In practical applications, the model parameters for each of the PM and CM lifetime distributions are unknown and we need some subjective or objective information about these parameters. Subjective information typically represents the expert views of maintenance engineers about a system’s repair and failure process, and can take many forms such as simply specifying values for the unknown parameters. Objective information typically takes the form of historically observed failure and repair data for the system under consideration, of the form

{

D = ( ui , vij ) ; i = 1,… , n; j = 1,… , ni

}

(8.13)

which covers n complete PM intervals, where interval i contains ni failures. Note that the ui are right censored if ni = 0 and the vij are right censored when j = ni . Otherwise, the observations represent actual failure times. We introduce the indicator variables ⎧0 ; ui right censored ci = ⎨ ⎩1 ; ui observed lifetime

and

(8.14)

Preventive Maintenance Models for Complex Systems

⎧⎪0 ; vij right censored d ij = ⎨ ⎪⎩1 ; vij observed lifetime

195

(8.15)

to identify when observations are right censored. The likelihood function for this delayed renewal process then becomes L (θ , φ ; D ) ∝ ni

∏ { f ( u θ )} {R ( u θ )} ∏ { f ( v φ )} {R ( v φ )} n

i =1

1− ci

ci

i

i

1− dij

dij

ij

j =1

(8.16)

ij

where R ( ⋅) represents the corresponding reliability function. Due to the nature of the DRP model, this likelihood function can be written as the product of a function of θ and a function of φ , so that

L (θ , φ ; D ) = L (θ ; D ) L (φ ; D )

(8.17)

where n

{

} {R ( u θ )}

L (θ ; D ) ∝ ∏ f ( ui θ ) i =1

1− ci

ci

(8.18)

i

and L (φ ; D ) ∝

ni

∏∏ { f ( v φ )} {R ( v φ )} n

i =1

j =1

1− dij

dij

ij

ij

.

(8.19)

For a frequentist analysis, we evaluate the maximum likelihood estimates of θ and φ by maximising the natural logarithm of this function with respect to these parameters. Subsequent inference generally assumes that the parameters are equal to these values. To avoid the errors that arise through adopting a naïve or frequentist approach, we can instead adopt a Bayesian approach. This leads naturally to a decision-theoretic solution to the problem of PM scheduling and we refer interested readers to the article by Percy et.al. (1998a) for details. We now turn our attention to parameter estimation for the nonhomogeneous Poisson process. For failure times T1 , T2 ,… , TN (T ) with observed values t1 , t2 ,… , t N (T ) in the interval ( 0,T ] , the likelihood function corresponding to a NHPP with intensity function ι ( t ) is given by T ⎧⎪ N (T ) ⎫⎪ ⎪⎧ ⎪⎫ L {ι; H ( t )} ∝ ⎨ ∏ ι ti− ⎬ exp ⎨− ∫ ι ( t ) dt ⎬ ⎪⎭ ⎩⎪ 0 ⎭⎪ ⎩⎪ i =1

( )

(8.20)

196

D. Percy

and so the log-likelihood function becomes

l {ι; H ( t )} = const. +

N (T )

T

∑ logι ( t ) − ∫ι ( t ) dt . i =1

− i

(8.21)

0

Therefore, once we specify the formulation of ι ( t ) , we can obtain estimates for its unknown parameters via likelihood-based methods. Example 8.4 Assuming T = t N (T ) so that observation ceases at a failure, the maximum likelihood estimates (MLEs) can be determined analytically for the power-law process (NHPP with power-law intensity). With ι (t ) = α t β and n = N (T ) , the MLEs are n

βˆ =

n

T log ∑ ti i =1

−1

(8.22)

and

(

).

n βˆ + 1

αˆ =

T

βˆ +1

(8.23)

For a particular system, successive arrival times (not inter-arrival times) were observed to be 15, 42, 74, 117, 168, 233 and 410 days. With n = 7 , T = 410 and t1 = 15,… , t7 = 410 , we have βˆ ≈ −0.3007 and then αˆ ≈ 0.07288 . As βˆ < 0 , the intensity is a strictly decreasing function of time; this is a happy system that seems to improve with age. Analysis of the intensity based models follows by extending this likelihood function corresponding to the NHPP. Consider the generalized proportional intensities model of Section 8.5. The choice of which predictor variables to include depends upon the sample size (history of failures) and the results of standard selection procedures based on analyses of deviance for nested models. Only important predictors should be included in order to produce a robust model. We can estimate the parameters in the model by maximum likelihood, on extending the NHPP likelihood presented above, whereby the log-likelihood is given by

l {ι; H (T )} = const. + n

∑ c {logι ( t ) + M ( t ) log ρ + N ( t ) log σ + x γ} k =1

−

0

k

− k

⎧⎪ M t N t ( ) σ ( ⎨ρ k =0 ⎪ ⎩

− k

n

∑

k

k

)

tk +1

− k

T tk−

(8.24)

⎫

⎪ ∫ ι ( t ) exp ( x γ ) dt ⎬⎪ . 0

tk

T t

⎭

This corresponds to the simple case where the scaling factors are constant: minor changes are needed for the more general cases.

Preventive Maintenance Models for Complex Systems

197

8.7 Model Selection In this section, we consider how to choose among the many approaches described above, namely the renewal process (RP), delayed renewal process (DRP), delayed alternating renewal process (DARP), virtual age model (VAM), proportional hazards model (PHM), nonhomogeneous Poisson process (NHPP), intensity reduction model (IRM), proportional intensities model (PIM) and generalized proportional intensities model (GPIM). The main distinguishing features are process stationarity, goodness of fit, mathematical robustness, consistency and ease of implementation. All of these models are concerned with describing the failure and repair process of complex repairable systems subject to preventive maintenance. We subsequently use the fitted models to forecast the system behaviour under different PM strategies by simulation. This enables us to determine the optimal strategy by minimising the expected cost per unit time over a suitable horizon, finite or infinite, with respect to suitable loss or utility functions. The RP only applies to individual components, for which corrective and preventive maintenance effectively amount to replacement. The DRP and DARP apply to stationary systems, whereas all of the later models allow for nonstationarity. The DRP is easier to fit than the DARP, but ignores the influence of downtimes, so we use the latter if these are significant. Ascher and Feingold (1984) discuss several methods for assessing stationarity. Perhaps the simplest of these is a graph that plots the observed cumulative number of failures against the observed cumulative operating hours. Consistent departures from linearity might suggest that some trends are present. Naturally, we must exercise care to avoid distorting the results when allowing for PM interventions. Sometimes though, we seek a formal hypothesis test to assess whether the assumption of stationarity is reasonable. One of these is Laplace's trend test, which is simple and sufficient for most needs. Suppose we observe the system history from time 0 until time t and suppose that we observe n failures at times t1 , t2 ,… , tn . Then Laplace's trend test compares the test statistic n

U=

nt

∑t − 2 i =1

i

n t 12

(8.25)

with standard normal critical values, rejecting the null hypothesis of no trend if U ∉ ( − z p 2 , z p 2 ) for a hypothesis test at the 100 p % level of significance, where the proportion p represents the size of the test. For a 5% significance test, the critical values are given by z p 2 = 1.960 . If we decide that a system is nonstationary, we could use the VAM or PHM, which are easier to fit to data than the stochastic processes considered next, but are less robust because of their statistical rather than mathematical derivation. However, all of these models require numerical computation to some extent. The VAM and PHM might provide a better fit to the observed data on occasions,

198

D. Percy

though the mathematical justification and consequent robustness of the stochastic processes are most appealing properties. The NHPP corresponds to minimal repairs and only applies to systems containing very many similar components. However, it is the fundamental model for complex repairable systems and its simplicity appeals to many practitioners. The IRM and PIM improve upon the NHPP by allowing for partial repairs and preventive maintenance. The GPIM combines the best features of both models and perhaps offers the most potential for PM scheduling problems, despite the extra computational burden it attracts. Whichever model we choose to fit to our data, some degree of model comparison is necessary. For the RP, DRP and DARP, we need to decide which lifetime distributions to fit following PM and CM. For the VAM and PHM, we must select suitable baseline hazard functions, scaling factors and explanatory variables for our linear predictors. Similar choices are necessary for the NHPP, IRM, PIM and GPIM. We can assess the goodness of fit of a model using its likelihood function and can compare different models using likelihood ratios or Bayes factors. Consider two nested models M 1 ⊃ M 2 with p1 > p2 parameters and likelihood functions L1 > L2 respectively. Then under general conditions, asymptotic sampling distribution theory states that 2 log

L1 ~ χ 2 ( p1 − p2 ) L2

(8.26)

and so we can test whether the extra parameters are significant. This is particularly beneficial when choosing which elements to include in a linear predictor. If the models M 1 and M 2 are not nested, we cannot use this formal test and simply compare the log-likelihood functions log L1 and log L2 , choosing the model with the larger log-likelihood. This is appropriate for choosing between gamma and Weibull baseline hazard functions, for example. However, it is only valid if p1 = p2 , as a model with more parameters often fits better than a model with fewer parameters, by definition. To compare non-nested models with different numbers of parameters, we usually apply a correction factor to the log-likelihood functions. Two common modified forms are the Akaike information criterion (AIC), which suggests that we compare log L1 − p1 with log L2 − p2 , and the Schwarz criterion, or Bayes information criterion (BIC), which suggests that we compare log L1 − ( p1 log n ) 2 with log L2 − ( p2 log n ) 2 where n is the number of observations in the data set. The latter arises as the limiting case of the posterior odds resulting from a Bayesian analysis with reference priors. In each case, the best model to choice is the one that maximizes the information criterion. Example 8.5 Suppose we fit two non-nested models to a set of lifetime data, based on n = 31 observed failures. The first model contains three parameters and has a likelihood of L1 = 8.742 × 10−18 . The second model contains five parameters and has a likelihood of L2 = 3.110 × 10−17 . The Bayes information criterion for the first model is log L1 − ( p1 log n ) 2 ≈ −44.43 and for the second model it is log L2 − ( p2 log n ) 2 ≈ −46.59 so we prefer the first, simpler model here.

Preventive Maintenance Models for Complex Systems

199

8.8 Preventive Maintenance Scheduling The objective of model fitting is to determine the optimal PM period for minimising the expected cost per unit time. We will see that analytical solution of this problem is not possible and that simulation of the failure and repair process over a given horizon provides the best approach for resolving this difficulty. Sometimes, no particular horizon is specified and we can do no more than assume an infinite horizon. However, this problem simplifies for stationary systems involving the models based on the renewal process, as we only need to simulate the process over a single PM interval. On other occasions, a finite horizon is clearly defined. Perhaps a factory or machine is owned on a 20-year lease. Alternatively, the equipment might be retained until cost efficiencies on a larger scale recommend replacement or scrapping. For a pre-determined finite horizon such as these, we base decisions on simulating the process for the whole horizon. Further analysis could be performed for the situation where a finite horizon is not pre-specified and must be regarded as random. The extra complexity introduced is a current research issue. We begin by considering the delayed renewal process again. Suppose that the costs associated with PM and CM are k PM and kCM units respectively. Assuming an infinite horizon, we now simulate a PM interval of length t . This involves generating a pseudo-random observation u from fU ( u ) , to represent a typical lifetime following PM. If u ≥ t , the interval is complete and the total cost incurred is k PM . However, if u < t we generate a pseudo-random observation v1 from fV ( v ) , to represent a typical lifetime following CM, and add a cost kCM for the repair. We continue this process, generating CM lifetimes v1 , v2 , v3 ,… and adding a further cost kCM each time until this interval is complete, and then calculate the total cost for this interval. Call this total cost K1 . This procedure has completely simulated a PM interval of length t . We next repeat the procedure, until we have m repetitions in total, and determine the total costs for these simulated intervals, K1 , K 2 ,… , K m . Then their sample mean K=

1 m

m

∑K i =1

i

(8.27)

represents an unbiased estimator for the total cost per PM interval. This enables us to estimate the expected cost per unit time as K t . Now we must repeat the whole simulation for different values of t , using an efficient search algorithm, to determine the value of t that minimises this expected cost per unit time. This is the recommended PM interval duration. We advocate direct search algorithms for practical implementation, such as golden-section search. For practical purposes, t is unlikely to vary continuously and discrete values will dominate. Convenient multiples of days, weeks or months provide suitable units of measurement for practical implementation.

200

D. Percy

To deal with scenarios involving finite horizons, we modify this simulation procedure. Instead of generating one simulated PM interval on many occasions, we simulate the process over the whole horizon h and accumulate the costs of PM and CM over this period. If we redefine K1 as the total cost over this horizon, then successive replications of this simulated process generate total costs K1 , K 2 ,… , K m as before. This time however, the expected cost per unit time is given by K h where K is the sample mean defined earlier. We now shift our attention to the delayed alternating renewal process. For most applications, it is reasonable to suppose that all PM activities have downtimes of similar durations, at an average cost of k PM units, and that all CM activities have downtimes of similar durations, at an average cost of kCM units. The analysis of this DARP model then proceeds exactly as for the DRP model, except that the simulation of successive PM intervals must also take account of these downtimes. Our model assumptions could be extended to consider different levels of maintenance activity, if these are evident in practice. For example, PM and CM might each be performed as minor or major activities, with corresponding downtimes. Such possibilities are application specific and can readily be incorporated as required, by adapting the basic simulation program. Indeed, simulation is the only feasible method of analysis and optimisation in this case. To investigate PM scheduling for the nonhomogeneous Poisson process, we condition only upon the history at time t to avoid the problems associated with doubly stochastic processes and obtain

{

}

P N (t + ε ) − N (t ) = n H (t )

{µt (ε )} =

n

n!

exp {− µt ( ε )}

(8.28)

for n = 0,1, 2,… where

µt ( ε ) =

t +ε

∫ ι ( t ) dt

(8.29)

t

is the mean number of failures in the interval ( t , t + ε ) . Consequently, the reliability function for the next failure from time t is

{

}

Rt ( ε ) = P N ( t + ε ) − N ( t ) = 0 H ( t ) = exp {− µt ( ε )} ,

(8.30)

from which we can determine the lifetime distribution following a particular maintenance action at time t as ft ( ε ) = − Rt′ ( ε ) = ι ( t + ε ) exp {− µt ( ε )} .

(8.31)

Preventive Maintenance Models for Complex Systems

201

This allows us to simulate the process as before, evaluate expected costs over a finite horizon, and so deduce the most economical time for the next preventive maintenance. This decision can be made at any specific event, such as during PM or CM, or even between events, so long as the intensity function is known. Next we consider the proportional hazards model. To avoid referring separately to the hazard functions κ ( u ) and λ ( v ) , consider a general hazard function h ( x ) . For the purposes of simulation in order to schedule PM in the future, the reliability function can be determined as x ⎪⎧ ⎪⎫ R ( x ) = exp ⎨ − h ( x ) dx ⎬ , ⎪⎩ 0 ⎪⎭

∫

(8.32)

from which the probability density function is ⎧⎪ x ⎫⎪ f ( x ) = − R ′ ( x ) = h ( x ) exp ⎨− h ( x ) dx ⎬ , ⎩⎪ 0 ⎭⎪

∫

(8.33)

allowing us to simulate the system’s failure process for PM optimisation as before.

8.9 Applications We now apply some of these models to the data sets in Section 8.2. Example 8.6 For each system, we fitted the intensity reduction model using constant, loglinear and power-law baseline intensities with constant reduction factors. Its goodness of fit is measured by the log-likelihoods in Table 8.5, obtained using Mathcad software. For comparison, we also display the log-likelihoods for the extremes of renewal process (maximal repairs) and nonhomogeneous Poisson process (minimal repairs)

202

D. Percy

Table 8.5. Log-likelihoods for analyses of hypothetical reliability data Model

intensity reduction

maximal repair

minimal repair

Baseline intensity

Happy system

Sad system

Noncommittal system

constant

−33⋅ 7

−33⋅ 7

−35 ⋅ 5

loglinear

−32 ⋅ 4

−28 ⋅ 5

−33⋅ 4

power-law

−29 ⋅ 4

−32 ⋅ 0

−34 ⋅ 7

constant

−35 ⋅ 5

−35 ⋅ 5

−35 ⋅ 5

loglinear

−34 ⋅ 8

−34 ⋅ 8

−34 ⋅ 8

power-law

−35 ⋅ 1

−35 ⋅ 1

−35 ⋅ 1

constant

−35 ⋅ 5

−35 ⋅ 5

−35 ⋅ 5

loglinear

−34 ⋅ 8

−32 ⋅ 0

−35 ⋅ 2

power-law

−35 ⋅ 0

−31⋅ 8

−35 ⋅ 3

As expected, the intensity reduction model provides a good fit to all three systems, preferring the power-law baseline intensity for the happy system and the loglinear baseline intensity for the sad and noncommittal systems. Figure 8.4 shows that these baseline intensities are all increasing functions and any apparent happiness is due to the high quality of repairs rather than a self-improving system.

Preventive Maintenance Models for Complex Systems

203

Intensity Function

0.18

λ ( t , a , b , s)

0 0

t

410

Intensity Function

0.18

λ ( t , a , b , s)

0 0

t

410

Intensity Function

0.18

λ ( t , a , b , s)

0 0

t

410

Figure 8.4. Best fitting models for happy, sad and noncommital systems, respectively

204

D. Percy

Example 8.7 Regarding all PM actions as CM actions for demonstration purposes, we apply the Laplace trend test to determine whether there is any evidence of non-stationarity at the 5% level of significance. Our test statistic is n

U=

nt

∑t − 2 i =1

i

n t 12

=

22 × 2,128 2 ≈ −0.5230 . 22 2,128 × 12

21,901 −

(8.34)

As −1.960 < U < 1.960 , the test is not significant at the 5% level and we conclude that this test provides no evidence of non-stationarity for these data. Consequently, the delayed renewal process might provide an adequate fit to these data, without the need for a more complicated model. However, we might consider using the DARP if downtime is important or one of the later models if concomitant information is also available. Example 8.8 Here the data comprise 65 event observations collected over seven years. In the first half of this period, there were 15 CM and 11 PM actions. In the second half of this period, there were 29 CM actions and 10 PM actions. Hence, this is a sad system, which might benefit from preventive maintenance. We fit the generalized proportional intensities model to these data with explanatory variables representing quality of last maintenance and time since last maintenance. A loglinear baseline with constant reduction factors generates the results in Table 8.6. Table 8.6. Log-likelihoods and parameter estimates for GPIM analyses of oil pump data Predictor variables

Loglikelihood

─

Parameter estimates

αˆ

βˆ

ρˆ

σˆ

γˆ

−211.7

5 × 10 −4

1.01

0.719

0.740

─

Quality of last action

−210.2

6 × 10−4

1.01

0.699

0.745

6 × 10−3

Ttime since last action

−210.8

7 × 10−4

1.01

0.666

0.728

− 8 ×10−3

Quality of last action

−209.5

8 ×10−4

1.01

0.653

0.734

6 × 10−3

Time since last action

− 7 ×10 −3

The best model includes both “quality of last maintenance action” and “time since last maintenance action” as predictor variables. This is not surprising, as it contains six parameters whereas the model with no predictor variables has only four. As the associated PM reduction factor ρˆ is about two-thirds, preventive

Preventive Maintenance Models for Complex Systems

205

maintenance reduces the intensity of critical failures for this system and so improves its reliability. Although slightly less impressive, corrective maintenance reduces the intensity function too. Hence, the maintenance workforce appears to be very effective for this application! A graph of the intensity function for the GPIM with both covariates follows in Figure 8.5, based on the corresponding parameter estimates in the last row of Table 8.6. Intensity Function

0.1

λ ( t , a , b , r , s , c1 , c2)

0 0

t

2487

Fig. 8.5. Intensity function for GPIM analysis of oil pump data with two covariates

We now perform a simulation analysis for this last model based on the methods described in Section 8.8, in order to determine an optimal strategy for scheduling preventive maintenance. Several convenient PM intervals are considered for our calculations, including weekly, monthly, two-monthly, quarterly, biannually, annually and biennially. The minimum cost per unit time over a ten-year fixed horizon is achieved with monthly PM and generates a projected 80% saving over annual PM, though this estimated reduction in costs is sensitive to the choice of model. The previous policy implemented averages about three PM actions per year, which our simulation estimates would cost about four times as much in preventive maintenance when compared with the optimal policy of monthly PM.

8.10 Conclusions This chapter discussed the ideas of modelling complex repairable systems, with the intention of scheduling preventive maintenance to improve operational efficiency and reduce running costs. It started by emphasising the importance of improved, accurate and complete data collection in practice. It then presented the renewal process, delayed renewal process and delayed alternating renewal process as reasonable models for systems that exhibit stationary failure patterns.

206

D. Percy

The virtual age model and proportional hazards model were described as suitable for systems that do not exhibit stationarity and for systems where predictor variables such as condition monitoring observations are also measured. The nonhomogeneous Poisson process, intensity reduction model and proportional intensities model, with a promising generalization, were described next. We claim that these models offer natural interpretations of the physical underlying reliability and maintenance processes. Finally, this chapter demonstrated some applications of these ideas using reliability and maintenance data taken from the oil industry and reviewed several methods for model selection and goodness-of-fit testing, including graphs, Laplace trend test, likelihood ratios and the Akaike and Bayes information criteria. The use of mathematical modelling and statistical analysis in this fashion can improve, and has improved, the quality of PM scheduling. This can then result in considerable cost savings and help to improve system availability.

8.11 References Ascher HE, Feingold H, (1984) Repairable Systems Reliability: Modeling, Inference, Misconceptions and their Causes. New York: Marcel Dekker Baik J, Murthy DNP, Jack N, (2004) Two-dimensional failure modeling with minimal repair. Naval Research Logistics 51:345–362 Cox DR, (1972a) Regression models and life tables (with discussion). Journal of the Royal Statistical Society Series B 34:187–220 Cox DR, (1972b) The statistical analysis of dependencies in point processes. In Stochastic Point Processes (Lewis PAW). New York: Wiley Crowder MJ, Kimber AC, Smith RL, Sweeting TJ, (1991) Statistical Analysis of Reliability Data. London: Chapman and Hall Dagpunar JS, Jack N, (1993) Optimizing system availability under minimal repair with nonnegligible repair and replacement times. Journal of the Operational Research Society 44:1097–1103 Dekker R, Frenk H, Wildeman RE, (1996) How to determine maintenance frequencies for multi-component systems? A general approach. In Reliability and Maintenance of Complex Systems (Ozekici S). Berlin: Springer Doyen L, Gaudoin O, (2004) Classes of imperfect repair models based on reduction of failure intensity or virtual age. Reliability Engineering and System Safety 84:45–56 Handlarski J, (1980) Mathematical analysis of preventive maintenance schemes. Journal of the Operational Research Society 31:227–237 Jack N, (1998) Age-reduction model for imperfect maintenance. IMA Journal of Mathematics Applied in Business and Industry 9:347–354 Jardine AKS, Anderson PM, Mann DS, (1987) Application of the Weibull proportional hazards model to aircraft and marine engine failure data. Quality and Reliability Engineering International 3:77–82 Jiang R, Jardine AKS, (2006) Composite scale modeling in the presence of censored data. Reliability Engineering and System Safety 91:756–764 Kobbacy KAH, Fawzi BB, Percy DF, Ascher HE, (1997) A full history proportional hazards model for preventive maintenance scheduling. Quality and Reliability Engineering International 13:187–198 Lindqvist BH, Elvebakk G, Heggland K, (2003) The trend-renewal process for statistical analysis of repairable systems. Technometrics 45:31–44

Preventive Maintenance Models for Complex Systems

207

Lugtigheid D, Banjevic D, Jardine AKS, (2004) Modelling repairable systems reliability with explanatory variables and repair and maintenance actions. IMA Journal of Management Mathematics 15:89–110 Newby M, (1994) Perspective on Weibull proportional hazards models. IEEE Reliability Transactions 43:217–223 Percy DF, Alkali BM, (2006) Generalized proportional intensities models for repairable systems. IMA Journal of Management Mathematics 17:171–185. Percy DF, Kobbacy KAH, (2000) Determining economical maintenance intervals. International Journal of Production Economics 67:87–94 Percy DF, Bouamra O, Kobbacy KAH, (1998a) Bayesian analysis of fixed-interval preventive-maintenance models. IMA Journal of Mathematics Applied in Business and Industry 9:157–175 Percy DF, Kobbacy KAH, Ascher HE, (1998b) Using proportional-intensities models to schedule preventive-maintenance intervals. IMA Journal of Mathematics Applied in Business and Industry 9:289–302 Taylor HM, Karlin S, (1994) An Introduction to Stochastic Modelling. London: Academic Press Van der Duyn Schouten F, (1996) Maintenance policies for multicomponent systems: an overview. In Reliability and Maintenance of Complex Systems (Ozekici S). Berlin: Springer Watson C, (1970) Is preventive maintenance worthwhile? In Operational Research in Maintenance (Jardine AKS). Manchester: University Press

9 Artificial Intelligence in Maintenance Khairy A. H. Kobbacy

9.1 Introduction Over the past two decades their has been substantial research and development in operations management including maintenance. Kobbacy et al. (2007) argue that the continous research in these areas implies that solutions were not found to many problems. This was attributed to the fact that many of the solutions proposed were for well-defined problems, that the solutions assumed accurate data were available and that the solutions were too computationally expensive to be practical. Artificial intelligence (AI) was recognised by many researchers as a potentially powerful tool especially when combined with OR techniques to tackle such problems. Indeed, there has been vast interest in the applications of AI in the maintenance area as witnessed by the large number of publications in the area. This chapter reviews the application of AI in maintenance management and planning and introduces the concept of developing intelligent maintenance optimisation system. The outline of the chapter is as follows. Section 9.2 deals with various maintenance issues including maintenance management, planning and scheduling. Section 9.3 introduces a brief definition of AI, some of its techniques that have applications in maintenance and Decision Support Systems. A review of the literature is then presented in Section 9.4 covering the applications of AI in maintenance. We have focused on five AI techniques namely knowledge based systems, case based reasoning genetic algorithms, neural networks and fuzzy logic. This review also covers “hybrid” systems where two or more of the above mentioned AI techniques are used in an application. Other AI techniques seem to have very few applications in maintenance to date. A discussion of the development of the prototype hybrid intelligent maintenance optimisation system (HIMOS) which was developed to evaluate and enhance preventive maintenance (PM) routines of complex engineering systems follows in Section 9.5. HIMOS uses knowledge based system to identify suitable models to schedule PM activities and case base reasoning to add capability to utilise past experience in model selection. Future developments and

210

K. Kobaccy

outline design of an Adaptive Maintenance Measurement and Control Model are covered in Section 9.6. Concluding remarks are presented in Section 9.7. The following abbreviations are used throughout this chapter. AHP: Analytic hierarchy process AI: Artificial intelligence CBR: Case based reasoning CO: Corrective action DMG: Decision making grid DSS: Decision support system FL: Fuzzy logic GAs: Genetic algorithms HIMOS: Hybrid intelligent maintenance optimisation system IDSS: Intelligent decision support system IMOS: Intelligent maintenance optimisation system KBS: Knowledge based systems NHPP: Non-homogeneous Poisson process NNs: Neural networks OR: Operational research PHM: Proportional hazards model PIM: Proportional intensities model PM: Preventive maintenance RBR: Rule based reasoning

9.2 Maintenance Management, Planning and Scheduling Most industrial organisations have maintenance departments which deal with many issues regarding operations. For example they can be involved in process design, inventory, schedulling and staffing. However, the ultimate objective of maintenance is to keep equipment at acceptable standard. To achieve this objective a variety of maintenance actions are employed including inspection, repair, planned maintenance and replacement. An adequate planning of type, contents and timing of maintenance actions is essential for the success of the maintenance function (Kobbacy 1992). A survey of some 34 companies was carried out in the UK (Kobbacy et al. 2005). It indicated that around half of the work that was carried out by maintenance departments was on repair; around a quarter was on preventive maintenance and 5% on inspection. The remaining effort was on other types of maintenance actions including opportunistic maintenance, condition monitoring and design-out maintenance. Repairs represent the largest proportion of maintenance actions carried out by maintenance department and indeed all departments surveyed carried out repairs. Repair is the maintenance action that restores the equipment to operating condition. Some repair actions restore equipment to as new condition while others are classed as minimal repair, i.e. restore equipment to the condition prior to failure. In reality, equipment is likely to be restored to a condition between these two states. Occasionally, repairs may introduce faults to the equipment.

Artificial Intelligence in Maintenance

211

Preventive maintenance is the maintenance action that is undertaken in the belief that it reduces the occurrence of failures as compared with the alternative of repairing components only upon failures (Kobbacy et al. 1995a). PM is perhaps the most intractable of maintenance actions in terms of mathematical modelling. The main reason is that only one point is usually known on the curve representing cost/ availability against PM interval, and the analyst attempts to predict failure rate at a range of PM intervals in order to select the optimal interval. Inspection is the action taken to establish the condition of equipment at some point in time. It can be triggered by observing unusual performance of equipment, e.g. noise, or else the inspection can be carried out at regular predetermined intervals. A major difference between PM and inspection is that PM routines usually involve planned maintenance action, e.g. replace component, make adjustment, etc. while inspection involves checking the condition of equipment and carrying out maintenance action based on the outcome of inspection. In other words inspection routines, unlike PM, do not contain predetermined restoration of equipment condition. Fault diagnosis is an integral part of maintenance actions and it follows from realizing that a fault has occurred. This is essentially required before repairs are carried out following failure, preventive maintenance, inspection or condition monitoring. There are two approaches for maintenance planning/management – the engineering approach and the mathematical approach (Gits 1984). The engineering approach has a broad view of the maintenance problem as the maintenance concept is determined through consideration of the operations plan, maintenance constraints and item behaviour. Thus it emphasises the development of rules or guidelines for planning maintenance action. The mathematical approach has more emphases on developing optimal maintenance policies, e.g. optimal PM interval. A major challenge in this field is how to integrate these approaches. Many software packages have been developed over the years to help in the analysis and modelling of maintenance situations, though they have their limitations including the interference of an analyst, which can slow the process or make the analysis almost intractable for large systems. Scheduling of maintenance actions is a part of maintenance planning. Not all maintenance actions require scheduling, e.g. repair upon failure and design-out maintenance. Opportunistic maintenance, by definition, is carried out taking advantage of the time when equipment is not in use, but planning for spare parts can be required. Condition monitoring can be a continuous process but often requires planning of monitoring interval and the subsequent replacement. The other two major maintenance actions that require scheduling are preventive maintenance and “planned” inspections. Typically, one needs first to establish a model for failure pattern, i.e. times between failures. A non-homogeneous Poisson process is usually the model of first choice for deteriorating repairable systems (Ascher and Kobbacy 1995). There have been many attempts to schedule PM routines, i.e., to decide on the frequency of PM actions per year. Ascher and Kobbacy (1995) present models for scheduling PM by minimizing cost/ maximizing availability and using NHPP. Other attempts use Cox’s proportional hazards model (Kobbacy et al. 1997) and proportional intensities model (Percy et al. 1998). The latter has proved to be of

212

K. Kobaccy

great promise and indeed being investigated for application in more complex PM situations, e.g., multiple PM routines.

9.3 AI Techniques AI is a branch of computer science that develops programmes to allow machines to perform functions normally requiring human intelligence (Microsoft ENCARTA College Dictionary 2001). The goal of AI is to teach machines to “think” to a certain extent under special conditions (Firebaugh 1988). There are many AI techniques, the most used in maintenance decision support are as follows. Knowledge based systems (KBS): use of domain specific rules of thumb or heuristics (production rules) to identify a potential outcome or suitable course of action. Case based reasoning (CBR): utilises past experiences to solve new problems. It uses case index schemes, similarity functions and adaptation. It provides machine learning through updating of the case base. Genetic algorithms (GAs): these are based on the principle that solutions can evolve. Potential promising solutions evolve through mutation and weaker solutions become extinct. Neural networks (NNs): use back propagation algorithm to emulate behaviour of human brain. Both of NNs and GAs are capable of learning how to classify, cluster and optimise. Fuzzy logic (FL): allows the representation of information of uncertain nature. It provides a framework in which membership of a category is graded and hence quantifies such information for mathematical modelling, etc. There are several other AI techniques and these include Data Mining, Robotics and Intelligent Agents. However, to date very few publications are available about their applications in maintenance. 9.3.1 Intelligent Decision Support Systems A useful definition of DSS is as follows. It is a computer based system that helps decision makers confront ill-structured problems through direct interaction through data and analysis and models (Sprague and Watson 1986). The result of integrating an AI technique within a DSS is referred to in this chapter as an Intelligent DSS. This is essentially a DSS as defined above, but has the additional capabilities to “understand”, “suggest” and “learn” in dealing with managerial tasks and problems. The method of integration and the features of the end product depend very much on the area of application.

9.4 AI in Maintenance AI techniques have been used successfully in the past two decades to model and optimise maintenance problems. Since the resurgence of AI in the mid-1980s researchers have consider the applications of AI in this field. The article by

Artificial Intelligence in Maintenance

213

Dhaliwal (1986) is one of the early ones that argued for the appropriateness of using AI techniques for addressing the issues of operating and maintaining large and complex engineering systems. Kobbacy (1992) discusses the useful role of knowledge based systems in the enhacement of maintenance routines. Over the years the applications of AI in maintenance grew to cover very wide area of applications using a variety of AI techniques. This can be explained by the individual nature of each technique. For example GAs and NNs have the advantage of being useful in optimising complex and nonlinear problems and overcome the limitations of the classic “black box” approaches, where attempt is made to identify the system by relating system outputs to inputs without understanding and modelling the underlying process. Hence the widespread applications in the scheduling area and also in fault diagnosis. In this section, an up to date survey is presented covering the area of application of AI techniques in maintenance including fault diagnosis. This chapter will only refer to some of the references in the vast applications of AI in fault diagnosis. Interested readers can refer to the recent comprehensive review by Kobbacy et al. (2007) on applications of AI in Operations. 9.4.1 Case Based Reasoning (CBR) CBR is an interesting AI technique which adds learning capabilities to DSS systems. This may explain the lack of publications on using CBR on its own in maintenance. Instead there are few hybrid applications which utilises CBR together with other AI techniques. Details about CBR technique are discussed while presenting the case study in Section 9.5.3. Yu et al. (2003) present a problem-oriented multi-agent-based E-service system (POMAESS). The system uses a CBR-based decision support function. The case study, which is discussed later in this chapter deals with a hybrid KBS/CBR maintenance optimisation system (HIMOS). More publications are found in fault diagnosis including papers on its application in locomotive diagnostics, e.g. Varma and Roddy (1999). Xia and Rao (1999) argue the need to develop dynamic CBR which introduces new mechanisms such as time-tagged indexes and dynamic and multiple indexing to help accurate solving of problems taking into account system dynamics and fault propagation phenomena. Cunningham et al. (1998) describe an incremental CBR mechanism that can initiate the fault diagnosis process with only a few features. There are also papers on hybrid CBR systems in fault diagnosis including the use of CBR with Petri nets for induction motor fault diagnosis (Tang et al. 2004), CBR with FL in fault diagnosis of modern commercial aircraft ( Wu et al. 2004), CBR with NN in web-based intelligent fault diagnosis system (Hui et al. 2001), CBR with heuristic reasoning and hypermedia for incident monitoring (Rao et al. 1998) and CBR with KBS in pattern search problem in fault diagnosis (Kohno et al. 1997).

214

K. Kobaccy

9.4.2 Genetic Algorithms (GAs) GAs are popular in maintenance applications because of their robust search capabilities that help reduce the computational complexity of large optimisation problems (Morcous and Lounis 2005), such as large scale maintenance scheduling models. GAs have applications in infrastructure networks including programming the maintenance of concrete bridge decks (Morcous and Lounis 2005; Lee and Kim 2007), pavement maintenance programme (Chootinan et al. 2006), and optimising highway life-cycle by considering maintenance of roadside appurtenances (Jha and Abdullah 2006). GAs also have applications in maintenance activities in nuclear power plants including optimising the technical specification of a nuclear safety system by coupling GAs and Monte Carlo simulation in attempt to minimise the expected value of system unavailability and its associated variance (Marseguerra et al. 2004). Another important area of application is in manufacturing. Ruiz et al. (2006) present an approach for scheduling of PM in a flowshop problem with the aim of maximising availability. Sortrakul et al. (2005) present a heuristic based on genetic algorithms to solve an integrated optimisation model for production scheduling and preventive maintenance planning. Chan et al. (2006) propose a GA approach to deal with distributed flexible manufacturing system scheduling problem subject to machine maintenance constraint. Other popular application areas for GAs include preventive maintenance scheduling optimisation. Application areas in PM include chemical process operations (Tan and Kramer 1997), power systems (Huang 1998), single product manufacturing production line (Cavory et al. 2001) and mechanical components (Tsai et al. 2001). GAs are also used in deciding on opportunistic maintenance policies (Saranga 2004; Dragan et al. 1995). GAs have had some moderate but constant interest over the past decade in the area of fault diagnosis. Applications range from manufacturing systems (Khoo et al. 2000), nuclear power plants (Yangping et al. 2000), electrical distribution networks (Wen and Chang 1998) to a new area of application in automotive fuel cell power generators (Hissel et al. 2004). 9.4.3 Neural Networks (NNs) NNs are popular AI technique applied in the areas of maintenance and in particular in fault diagnosis. NNs are the primary information processing structure used in neurocomputing i.e. systems that learn the relationship between data through a process of training (Dendronic Decisions Ltd 2003). NNs have many applications in the areas of predictive maintenance and condition monitoring. Gilabert and Arnaiz (2006) present a case study for noncritical machinery, where NN is used for elevator monitoring and diagnosis as no previous experience existed. Al-Garni et al. (2006) also use NN for predicting the failure rate of an airplane tyres. Gromann de Araujo Goes et al. (2005) have developed a computerised online reliability monitoring system for nuclear power plant applications. An interesting application, developed by Garcia et al. (2004), uses NNs to aid tele-maintenance, where staff can carry out the work remotely and in collaboration with other experts. Other applications of NNs in condition monitoring include the work of Bansal et al. (2004) on machine systems, Booth

Artificial Intelligence in Maintenance

215

and McDonald (1998) on electrical power transformers and Spoerre (1997) on bearings. Shyur et al. (1996) use NNs to predict component inspection requirements for ageing aircraft and Eldin and Senouci (1995) use NNs for the condition rating of joint concrete pavements. Lin and Wang (1996) developed an approach combining NNs and advanced vibration monitoring methods for online predictive maintenance of rotating machinery. Luxhoj and Williams (1996) present a hybrid NN/KBS DSS for aircraft safety inspection. NNs suit model based fault detection and isolation when analytical models are not available. Frank and Koppen-Seliger (1997) define three steps for fault detection: residual generation, i.e. generation of a signal that reflects the fault, residual evaluation, i.e. the logical decision making on the time of occurrence and location of the fault and fault analysis, i.e., determination of the type of fault, its size and cause. NNs have to be trained for both residual generation and evaluation using collected or simulated data for the former and residuals in the latter. There is large number of papers published on the use of NNs in fault diagnosis covering a wide range of applications. These include diagnosis in induction motors (Yang and Kim 2006), marine propulsion systems (Kuo and Chang 2004), supervision of desalination plant during dynamic states, e.g. start up (Tarifa et al. 2003), engineering structures (Chen et al. 2003), navigation systems (Zhang et al. 2001), power plants (Simani and Fantuzzi 2000) and automotive engine management (Shayler et al. 2000). There are various applications in the chemical process industry for using NNs in fault diagnosis, e.g., packed towers (Sharma et al. 2004) and batch processes (Scenna 2000). There are studies that make use of hybrid NNs systems for fault diagnosis. Yang et al. (2004) integrate CBR with an ART-KNN to enhance fault diagnosis when solving a new problem with NN used to make hypotheses and to guide CBR to search for similar previous cases. Jota et al. (1998) use neuro-fuzzy, neuroexpert and fuzzy expert algorithms for fault detection in a range of electrical power system equipment. 9.4.4 Knowledge Based Systems (KBSs) The use of KBS in maintenance management represents one of the early applications of AI in maintenance. Martland et al. (1990) developed a knowledgebased expert system to guide the rail scheduling process, i.e. in developing a plan for rail relay or replacement. Ahmed et al. (1991) developed an expert system for offshore structure inspection and maintenance. Kobbacy (1992) argued the use of KBS in evaluation and enhancement of maintenance routines. Batanov et al. (1993) developed EXPERT-MM, an expert system that supports maintenance policy suggestions, machine diagnosis and maintenance scheduling. Feldman et al. (1992) designed a rule-based expert system to investigate maintenance policies with regards to replacement, minimal repair or no actions in continuous manufacturing environments. Srinivasan et al. (1993) present an intelligent scheduling system using KBS for application on a power-distributed system. Drury and Prabhu (1996) provide a framework for information design that captures the interaction between the inspection task and its information requirements in the

216

K. Kobaccy

operation of commercial aircraft. The framework is used together with the cognitive control categories of skill-rule-knowledge-based behaviour to analyse information needs of aircraft inspectors. de Brito et al. (1997) developed a prototype system for optimising the inspection and maintenance and repair strategies for bridges. A fuzzy knowledge based method for maintenance planning in power system is demonstrated by Sergaki and Kalaitzakis (2002). In addition the work of Kobbacy and Jeon (2001) is discussed later (Section 9.5.3). KBSs also have a wide range of applications in fault diagnosis that are showing an increasing trend unlike applications in maintenance planning. KBSs can be used in all three phases of fault diagnosis (see Section 9.4.3). In the case of complex systems where there is insufficient information to formulate a mathematical model, KBSs have been particularly successful. Examples of applications of KBS in fault diagnosis include diagnosing electrical failures in induction motors (Acosta et al. 2006), fault diagnosis of rotating machinery (Yang et al. 2005), CNC machine-tools (Leung and Romagnoli 2002), industrial gas turbines (Milne et al. 2001), research reactors (Varde et al. 1998), power transmission networks (Baroni et al. 1997), real-time fault detection of green house sensors (Beaulah and Chalabi 1997), monitoring, diagnosis and optimisation of a coal washing plant (Villanueva and Lamba 1997), continuous and semi-continuous chemical processes (Nam et al. 1996) and in diagnosis and maintenance of robotic systems (Patel et al. 1995). Miller et al. (1990) developed a vehicle trouble-shooting expert system which has integrated imaging capability. The system is used to diagnose maintenance problems in the electrical/ hydraulic subsystems. Hybrid KBS systems applications in fault diagnosis include the KBS/NNs application in batch chemical plants (Ruiz et al. 2001). Frank and Ding (1997) outline advances of the theory of observed-based fault diagnosis in dynamic systems covering the use of AI including KBSs and NNs. 9.4.5 Fuzzy Logic (FL) FL has been used in various applications in the maintenance area to deal with uncertainity. Oke and Charles-Owaba (2006) apply an FL control model to Gant charting preventive maintenance scheduling. Al-Najjar and Alsyouf (2003) use a fuzzy multiple criteria decision making to select in advance the most informative (efficient) maintenance approach, i.e. strategies, policies or philosophies. Braglia et al. (2003) adopt FL to help an approach to allow analysts formulating efficiently assessment of possible causes of failure in mode, effects and criticality analysis. Sudiarso and Labib (2002) investigated FL approach to an integrated maintenance/ production scheduling algorithm. Jeffries et al. (2001) develop an efficient hybrid method for capturing machine information in a packaging plant using FL, fuzzy condition monitoring, in order to reduce wastage and maintenance overheads. Examples of FL hybrid applications include the use of a KBS for bridge damage diagnosis which aims at providing information about the impact of design factors on bridge deterioration with FL used to handle uncertainties (Zhao and Chen 2001). Sinha and Fieguth (2006) propose a neuro-fuzzy classifier that com-

Artificial Intelligence in Maintenance

217

bines FL and NNs for the classification of defects by extracting features in segmented buried pipe images. Applications for FL in fault diagnosis include fault diagnosis of railway wheels (Skarlatos et al. 2004), thrusters for an open- frame underwater vehicle (Omerdic and Roberts 2004), chemical processes (Dash et al. 2003) and rolling element bearings in machinery (Mechefske 1998).

9.5 The Hybrid Intelligent Maintenance Optimisation System In this section we discuss the Hybrid Intelligent Maintenance Optimisation System (HIMOS). 9.5.1 Why Intelligent Maintenance DSSs are Needed Optimisation of the maintenance policies of complex technical systems, such as telecommunication systems and complex manufacturing plants, can prove to be difficult. With the developments in information technology over the past two decades, many organisations with complex technical systems have developed maintenance databases. Though the stored history data is potentially very useful to the maintenance engineer aiming to improve maintenance policies, in many cases the data are mainly used to produce simple statistics for management reporting. This is not due to the lack of interest on the part of maintenance practitioners, but to the challenging nature of these systems. The following difficulties are likely to be encountered while attempting to optimise the maintenance routines of complex systems: 1. The system contains a large number of sub-systems and components. This gives rise to a wide variety of maintenance situations that can be handled using different models and methods. 2. For a maintenance engineer, optimising the maintenance routines using available software packages, a familiarity with maintenance modelling in addition to engineering expertise is required. 3. Even if engineers with such experience were available, the time required to examine a large number of components using this type of software can be prohibitive. 4. The changeable nature of large technical systems, e.g. replacement of components with different types or modification of design, will present constant challenges. All these difficulties accentuate the need to develop special computerised systems that can cope with the management of complex engineering systems. Intelligent DSSs are a candidate.

218

K. Kobaccy

9.5.2 The Required Functional Features of An Intelligent Maintenance DSS The main functional features which would be expected of such a system, to cope with the above situation are (Kobbacy 2004): 1. 2. 3. 4. 5. 6.

To access the history data from a maintenance data base. To check the quality of data. To recognise characteristic data patterns. To query the user for additional information, judgement, and criterion. To select the most appropriate PM scheduling model for the decision analysis. To optimise the selected model, evaluate the current policy and propose optimal maintenance policy. 7. To present the results of the analysis in a flexible format. 8. To respond to user enquiries, perform ‘What if?’ decision modelling and provide explanations of the recommended decisions. 9. To have learning capabilities. 10. To have a user friendly Windows interface.

In the following section we will present one specific application of intelligent systems in maintenance, namely the development of an intelligent system to schedule PM for complex technical systems. This section is based on the work of the author with others (see references below). 9.5.3 The Hybrid Intelligent Maintenance Optimisation System (HIMOS) HIMOS aims at deciding the optimal PM cycle interval for a repairable system by selecting and applying the most appropriate optimisation model automatically and without the need for expert interference (Kobbacy and Jeon 2001). HIMOS is the result of developing its predecessor IMOS (Kobbacy et al. 1995b), the intelligent maintenance optimisation system, which used rule based reasoning to select an appropriate model for analysis. HIMOS employs hybrid reasoning by combining rule-based reasoning (RBR) and case-base reasoning (CBR) to choose a model from a model base for a given data set. Analysis of a typical large data file by IMOS showed that about two thirds of components cannot be modelled, mostly because of insufficient history data needed for model selection (Kobbacy 2004). However, some of the cases which could not be modelled may have parameters with values close to those of a model’s acceptance level as stated in the rulebase. By introducing case based reasoning, the system can model cases which are not identified by the rule base, although it has analysed similar cases in the past. Thus, such a hybrid (KBS and CBR) system is expected to increase the previous low percentage of model cases where the system is able to identify a suitable model.

Artificial Intelligence in Maintenance

219

Figure 9.1. Outline Design of HIMOS (Kobbacy and Jeon 2001)

Figure 9.1 illustrates the conceptual structure of HIMOS which is divided into two areas. The DSS contribution area contains a database to store maintenance historical data, a model base for data analysis models and optimisation models, and a user interface to communicate with the user. In the AI contribution area, there are two bases which contain experts’ knowledge: knowledge base and case base. 9.5.3.1 HIMIS Procedure Figure 9.2 illustrates the model selection for a data set consisting of a sequence of preventive maintenance (PM) and corrective action (CO) events to enable calculating the optimal PM interval. HIMOS has the ability to use a set of production rules to select and then optimise a suitable model in order to provide an evaluation of the current maintenance routine and to propose an optimal policy. These rules are acquired from experts’ knowledge and may require subjective judgements to be made. The processor of HIMOS identifies data patterns through data analysis procedure and then selects the most appropriate model for a given data set by consulting the rule base. If a data set cannot be matched by any of the KBS rules, then the system attempts to use CBR to identify a suitable model.

220

K. Kobaccy

Figure 9.2. Model Selection In HIMOS (Kobbacy and Jeon 2001)

Data Formatting and Analysis After reading data from the input data file, the system formats and checks the data to create a suitable data set for the next step of analysis. Suspect or missing items of data are flagged in order to be sorted out by the system or investigated by the user. The analysis consists of five steps: recognition of PM and CO patterns, calculation of current availability, Weibull distribution fitting to failure times, trend test of frequency and severity to establish data stationarity with respect to frequency and severity or otherwise, and if applicable analysis of Multi-PM cases. In the first step a basic analysis is carried out to identify the features of the data set such as the numbers of PM and CO events and the mean lives to failure, so that the data set can be compared with characteristic data patterns in the model selection process. The data produced in this process are referred to as ‘metadata’. Model Base The model base contains two sets of models: the data analysis models and the PM scheduling optimisation models. The data analysis models identify a data pattern which together with the RBR/CBR help to select an optimisation

Artificial Intelligence in Maintenance

221

model. The optimisation models are a set of mathematical models of maintenance policies which evaluate current policies and, in certain circumstances, recommend optimal policy. These models deal with components rather than systems and they assume independence of components, i.e. the failure of one component does not affect the performance of another. The models in the models base are classified into single-PM and multi-PM models and for the former case into stationary and nonstationary models. Stationary models deal with the data sets in which no trend is found. If there is frequency or severity trends, then a nonstationary model can be used. Multi-PM models are structured to deal with components subject to more than one PM routine. IMOS model base includes 21 different models. The description of the models used in HIMOS can be found elsewhere ( Kobbacy and Jeon 2001). Model Selection Using the Rule-Base In HIMOS the rule base (or knowledge base) consists of a list of rules capturing some of the knowledge of experts in maintenance modelling concerning mathematical modelling techniques and their applicability to various situations. The rules match data sets to the models by searching for patterns in the data set for each component such as relative numbers of CO and PM events, component life distribution, range of PM intervals, etc. The approach used to develop the rule base is described in Kobbacy (2004). The knowledge base implemented in HIMOS consists of the set of 15 rules, an example of which is shown below. If the rule base failed to identify a suitable model the CBR is invoked. RULE 1:

If Not matched and There are multi-PMs Then Apply Multi-PM Model Matched

RULE 2:

If Not matched and Trend test statistics of frequency is significantly large and Trend test statistics of severity of CO is significantly large and Trend test statistics of severity of PM is significantly large Then Apply NHPPScoSpm Model Matched

RULE 3:

If Not matched and Trend test statistics of frequency is significantly large and Trend test statistics of severity of CO is significantly large Then Apply NHPPSco Model Matched

Model Selection Using CBR CBR is an approach to problem solving that utilises past experiences to solve new problems. The first step in the operation of a CBR system is the retrieval in which the inputs are analysed to determine the critical

222

K. Kobaccy

features to use in retrieving past cases from the case database. Among the well known methods for case retrieval is the nearest neighbour which is used in HIMOS. To find the nearest neighbour matching the case being considered, the case with the largest weighted average of similarity functions for selected features is selected. In HIMOS four features were selected and all given equal weights. These features are: number of PM, number of CO, trend value and variability of PM cycle length. The reason for selecting these features is that they were found to be the main causes for failure to select a suitable model using the rule based system. The similarity function was selected as the difference between the values of feature in the current and retrieved cases divided by the standard deviation of the feature. Once the best matching case has been retrieved, adaptation is carried out to reduce any prominent difference between the retrieved case and the current case through the derivational replay method. Thus in the CBR phase, the system uses rules similar to those used in the KBS phase to find a solution. However some critical values in the adaptation rules are more relaxed compared with the original rules. In the evaluation step the system displays multiple candidate models (possible solutions) with their critical features for the current case (adaptation results). The user can then evaluate these alternatives and selects one using their expertise. For the non-expert user, the system itself provides the ‘Recommended Model’ as a result of evaluation. Here the system compares the results of adaptation with the results of retrieval. If there is no matching model then no recommendation is made, otherwise the system recommends the matching model. If there is more than one matching model, the system merely recommends the first ranked (nearest neighbour) model. 9.5.3.2 Results and Validation of HIMOS HIMOS results for a component include some basic statistics for, e.g. number of PM, CO, current availability, etc. The most important result from the decisionmaker’s point of view is the recommended PM interval. The optimal availability gives an estimate of availability which might be achieved if the recommended PM policy is implemented. Table 9.1 shows the percentage success rate of HIMOS in modelling a large number of components. As can be seen, around two thirds of the components could not be modelled because no rule matched the data to a specific model. The introduction of case base reasoning can add to the success rate of modelling components. The table also shows that the introduction of CBR reduces the percentage of cases where no suitable model was identified from 68.6 % to 52.7 %. Given the self-learning nature of CBR where the case base expands with use, it is possible to improve the success rate with the extended use of the system in certain environments.

Artificial Intelligence in Maintenance

223

Table 9.1. Percentage use of maintenance models for HIMOS when applied to large systems, 1633 components in three data files (Kobbacy 2004) Model

HIMOS* RBR

Stochastic

RBR+CBR

RP

6.6

12.8

NHPP

1.6

1.6

NRP

2.3

2.3

Total stochastic

10.5

16.4

Geometric I

15.7

23.5

Geometric II

1.7

1.8

Weibull

1.7

3.7

Deterministic

1.8

1.9

68.6

52.7

No model suitable

HIMOS was validated using test cases by comparing the results of analysis of selected cases by HIMOS with the recommendations of an expert panel. For the validation HIMOS, eight data sets were used and a panel of five experts were involved. In general there was agreement between HIMOS and the experts. The experts had a measure of disagreement in their advices as a result of making different assumptions in their analysis. Experts also made useful suggestion for the operation of the system. Table 9.2 is a typical example of HIMOS and the experts’ recommendations. Table 9.2. Example of validation of IMOS Data Set 3 HIMOS

Increase PM interval from 177 to 403 days (CBR-RPOW model)

Expert A

There is no evidence of trend. Increase PM interval but should not be allowed to approach 600 days

Expert B

Unless failure has substantive safety or risk association, PM could be extended.

Expert C

Optimal PM interval is found to be 404 days

Expert D

Increase PM interval

Expert E

Increase PM interval to 250 days

224

K. Kobaccy

9.6 Future Developments In approaching the problem of maintenance management of complex engineering systems one can identify two broad levels for tackling the maintenance issues (Kobbacy and Labib 2003). At a higher decision-making level, one is usually concerned with effectiveness issues such as prioritising machines, modes of failure and types of maintenance actions that will lead to improving systems operations. At the lower decision level, one is concerned with maintenance efficiency issues, e.g. PM interval. Researchers tend to address either the higher-level issues of effectiveness or the lower decisions level issues of efficiency. Labib (1998) proposed two techniques to identify effective maintenance policies at higher levels; namely the rule based decision making grid (DMG) and the analytic hierarchy process (AHP). The AHP is a technique for prioritisation that relies on modelling a problem into a hierarchical structure of a goal, at the apex, and levels of criteria and alternatives at the bottom. The DMG acts as a map where the machines with the worst performances are placed, based on selected multiple criteria. These criteria, such as, downtime and frequency of breakdowns, are determined through prioritisation based on the AHP approach. The objective is to take maintenance actions to improve the machines’ performance as measured by the selected multiple criteria. This approach is discussed in Chapter 17. In order to tackle the issue of efficient maintenance management for complex engineering systems, Kobbacy (1992) proposed integrating Artificial Intelligent techniques such as rule based reasoning with mathematical modelling. Such approach allows automated modelling of large amounts of maintenance data to carry out analysis and propose optimal maintenance schedule, i.e. frequency of PM. This approach has been explained in Section 9.5. Kobbacy and Labib (see Section 9.8) propose merging their approaches of DMG and HIMOS in order to develop an integrated approach towards developing ‘effective’ and ‘efficient’ maintenance management approach. Figure 9.3 outlines the design of such a futuristic system. This proposed concept emphasis the sharing of data and tools between the two models while maintaining their distinct features and allowing flow of information between them.

Artificial Intelligence in Maintenance

225

Figure 9.3. Outline design of AMMCM (Adaptive Maintenance Measurement and Control Model

9.7 Concluding Remarks There has been many developments in the use of AI in the maintenance area. Hundreds of papers have been published in this area. Kobbacy et al. (2007) have shown that the number of publications using NNs and GAs in maintenance have had increasing trends in the past few years which can be explained by their use in optimising complex and nonlinear problems (see Section 9.4). There is an apparent increase in using hybrid approaches and utilising their combined strengths. There is enormous potential for developments in many applications of AI in maintenance by combining two or more AI techniques. Multiple hybrid intelligent management systems (MHIMS) are potentially powerful tools that can help making the right decisions right, i.e. making effective and efficient decisions. The author has a vision that such MHIMS may be assembled in the future from off the shelf modules, resulting in reduction in time and cost of development.

226

K. Kobaccy

9.8 Acknowledgments The author wishes to acknowledge the contributions of those who collaborated at the various stages of the development of IMOS and HIMOS. In particular I wish to acknowledge the significant contribution of A.L. Labib in developing the proposal for the AMMCM presented in Section 9.6.

9.9 References Acosta, G.G., Verucchi, C.J. and Gelso, E.R. (2006) A current monitoring system for diagnosing electrical failures in induction motors, Mechanical Systems and Signal Processing, 20, 953–965. Ahmed, K., Langdon, A. and Frieze, P.A., (1991), An expert system for offshore structure inspection and maintenance, Computers and Structures, 40, 143–159. Al-Garni, A.Z., Jamal, A., Ahmad, A.M. Al-Garni, A.M. and Tozan, M. (2006), Neural network-based failure rate prediction for De Havilland Dash-8 tires, Engineering Applications of Artificial Intelligence, 19, 681–691. Al-Najjar, B. and Alsyouf, I. (2003), Selecting the most efficient maintenance approach using fuzzy multiple criteria decision making, International Journal of Production Economics, 84, 85–100. Ascher, H.E. and Kobbacy, K.A.H. (1995), Modelling preventive maintenance for deteriorating repairable systems, IMA Journal of Mathematics Applied in Business & Indistry, 6, 85–99. Bansal, D., Evans, D.J. and Jones, B. (2004), A real-time predictive maintenance system for machine systems, International Journal of Machine Tools and Manufacture, 44, 759–766. Baroni, P., Canzi, U. and Guida, G. (1997), Fault diagnosis through history reconstruction: an application to power transmission networks, Expert Systems with Applications, 12, 37–52. Batanov, D., Nagarue, N. and Nitikhunkasem, P. (1993) EXPERT-MM: A knowledge-based system for maintenance management, Artificial Intelligence in Engineering, 8, 283–291. Beaulah, S.A. and Chalabi, Z.C. (1997), Intelligent real-time fault diagnosis of greenhouse sensors, Control Engineering Practice, 5, 1573–1580. Booth, C. and McDonald, J.R. (1998), The use of artificial neural networks for condition monitoring of electrical power transformers, Neurocomputing, 23, 97–109. Braglia, M., Frosolini, M. and Montanari, R. (2003), Fuzzy criticality assessment model for failure modes and effects analysis, International Journal of quality & Reliability Management, 20, 503–524. Cavory, G., Dupas R. and Goncalves, G. (2001), A genetic approach to the scheduling of preventive maintenance tasks on a single product manufacturing production line. International Journal of Production Economics 74, 135–146. Chan, F.T.S., Chung, S.H., Chan, L.Y., Finke, G. and Tiwari, M.K. (2006), Solving distributed FMS scheduling problems subject to maintenance: Genetic algorithms approaches, Robotics and Computer-Integrated Manufacturing, 22, 493–504. Chen, Q., Chan, Y.W. and Worden, K. (2003), Strucural fault diagnosis and isolation using neural networks based on response-only data, Computers & Structures, 81, 2165–2172.

Artificial Intelligence in Maintenance

227

Chootinan, P., Chen, A., Horrocks, M.R. and Bolling, D. (2006), A multi-year pavement maintenance program using a stochastic simulation-based genetic algorithm approach, Transportation Research Part A: Policy and Practice, 40, 725–743. Cunningham, P., Smyth, B. and Bonzano, A. (1998), An incremental retrieval mechanism for case-based electronic fault diagnosis. Knowledge-Based Systems 11, 239–248. Dash, S., Rengaswamy, R. and Venkatasubramanian, V. (2003), Fuzzy-logic based trend classification for fault diagnosis of chemical processes, Computers & Chemical Engineering, 27, 347–362. de Brito, J., Branco, F.A., Thoft-Christensen, P. and Sorensen, J.D. (1997), An expert system for concrete bridge management, Engineering Structures, 19, 519–526. Dendronic Decisions Ltd (2003), www.dendronic.com/articles.htm. Dhaliwal, D.S. (1986), The use of AI in maintaining and operating complex engineering systems, in Expert systems and Optimisation in Process Control, A. Mamdani and J E Pstachion, eds, 28–33. Gower Technical Press, Aldershot. Dragan, A.S., Walters, G.A. and Knezevic, J. (1995), Optimal opportunistic maintenance policy using genetic algorithms, 1 formulation, Journal of Quality in Maintenance Engineering, 1, 34–49. Drury, C.G. and Prabhu, P. (1996), Information requirements of aircraft inspection: framework and analysis, International Journal of human-Computer Studies, 45, 679–695. Eldin, N.N. and Senouci, A.B. (1995), Use of neural networks for condition rating of joint concrete pavements, Advances in Enginering software, 23, 133–141. Feldman, R.M., William, M.L., Slade, T., McKee, L.G. and Talbert, A. (1992), The development of an integrated mathematical and knowledge-based maintenance delivery system, Computers & Operations Research, 19, 425–434. Firebaugh, M.W. (1988), Artificial Intelligence: A Knowledge-based Approach, Boyd & Fraser Publishing Co. Danvers, MA, USA. Frank, P.M. and Ding, X. (1997), Survey of robust residual gereration and evaluation methods in observed-based fault detection systems, Journal of Process Control, 7, 403–427. Frank, P.M. and Koppen-Seliger, B. (1997), New developments using AI in fault diagnosis, Engineering applications in Artificial Intelligence, 10, 3–14. Garcia, E., Guyennet, H., Lapayre, J.C. and Zerhouni, N. (2004), A new industrial cooperative tele-maintenance platform. Computers & Industrial Engineering 46, 851–864. Gilabert, E. and Arnaiz, A. (2006), Intyelligent automation systems for predictive maintenance: A case study, Robotics and Computer Integrated Manufacturing, 22, 543–549. Gits, C.W. (1984), On the maintenance concept for a technical system, PhD Thesis, Eindhoven Technische Hogeschool, Eindhoven. Gromann de Araujo Goes, A., Alvarenga, M.A.B. and Frutuoso e Melo, P.F. (2005), NAROAS: a neural network-based advanced operator support system for the assessment of systems reliability, Reliability Engineering & System Safety, 87, 149–161. Hissel, D., Pera, M.C. and Kauffmann, J.M. (2004) Diagnosis of automotive fuel cell power generators, Journal of power Sources, 128, 239–246. Huang, S.J. (1998), Hydroelectric generation scheduling – an application of geneticembedded fuzzy system approach. Electric Power Systems Research 48, 65–72. Hui, S.C., Fong, A.C.M. and Jha, G. (2001) A web-based intelligent fault diagnosis system for customer service support, Engineering Applications of Artificial Intelligence, 14, 537–548. Jeffries, M., Lai, E.. Plantenberg, D.H. and Hull, J.B. (2001), A fuzzy approach to the condition monitoring of a packaging plant, Journal of Materials Processing technology, 109, 83–89.

228

K. Kobaccy

Jha, M.K. and Abdullah, J. (2006) A Markovian approach for optimising highway life-cycle with genetic algorithms by considering maintenance of roadside appurtenances, Journal of the Franklin Institute, 343, 404–419. Jota, P.R.S., Islam, S.M.,Wu, T. and Ledwich, G. (1998), A class of hybrid intelligent system for fault diagnosis in electric power systems. Neurocomputing 23, 207–224. Khoo, L.P., Ang, C.L. and Zhang, J. (2000), A Fuzzy-based genetic approsach to the diagnosis of manufacturing systems, Engineering Applications of artificial Intelligence, 13, 303–310. Kobbacy, K.A.H. (1992), The use of knowledge-based systems in evaluation and enhancement of maintenance riutines, International Journal of Production Economics, 24, 243– 248. Kobbacy, K.A.H. (2004), On the evolution of an intelligent maintenance optimisation system, journal of the Operational Research Society, 55, 139–146 Kobbacy, K.A.H. and Jeon, J. (2001), The development of a hybrid intelligent maintenance optimisation system (HIMOS), Journal of the Operational Research society, 52, 762–778. Kobbacy, K.A.H., Percy, D.F. and Fawzi, B.B. (1995a), Sensitivity analysis for preventive maintenance modeld, IMA Journal of Mathematics Applied in Business& industry, 6,53– 66. Kobbacy, K.A.H., Proudlove, N.L. and Harper, M.A. (1995b), Towards an intelligent maintenance optimisation system, Journal of the Operatonal Research society, 46, 229–240. Kobbacy, K.A.H., Fawzi, B.B., Percy, D.F. and Ascher, H.E. (1997), A Full history proportional hazards model for preventive maintenance modelling, Journal of Quality and Reliability Engineering Internationa, 13, 187–198. Kobbacy, K.A.H., Percy, D. F. and Sharp, J.M. (2005), Results of preventive maintenance survey, unpublished report,University od Salford.. Kobbacy, K.A.H., Vadera, S. and Rasmy, M.H.(2007), AI and OR in management of operations:history and trends, Journal of the Operational Research Society, 58, 10–28. Kohno, T., Hamada, S., Araki, D., Kojima, S. and Tanaka, T. (1997) Error repair and knowlledge acquisition via case-based reasoning, Artificial Intelligence, 91, 85–101. Kuo, H-C. and Chang, H-K. (2004) A new symbiotic evolution-based fuzzy-neural approach to fault diagnosis of maine propulsion systems, Engineering Applications of Artificial Intelligence, 17, 919–930. Labib, A.W. (1998) World class maintenance using a computerised maintenance management system, Journal of Quality in Maintenance Engineering,4, 66–75. Lee, C-K. and Kim, S-K. (2007) GA-based algorithm for selecting optimal repair and rehabilitation methods for reinforced concrete (RC) bridge decks, Automation in Construction, 16, 153–164. Leung, D. and Romagnoli, J. (2002) An integration mechanism for multivariate knowledgebased fault diagnosis, Journal of Process Control, 12, 15–26. Lin, C-C. and Wang, H-P. (1996), Performance analysisof routating machinary using venhanced cerebellar model articulation controller (E-CMAC) neural netyworks, Computers and industrial Engineering, 30, 227–242. Luxhoj, J.T. and Williams, T.P. (1996), Integrated decision support for aviation safety inspectors. Finite Elements in Analysis and Design 23, 381–403. Marseguerra, M., Zio, E. and Podofillini, L. (2004), A multiobjective genetic algorithm approach to optimisation of the technical specifications of a nuclear safety system, Reliability Engineering & System Safety, 84, 87–99. Martland, C.D., McNeil, S., Axharya, D., Mishalani, R. and Eshelby, J. (1990), Applications of expert systems in railroad maintenance:Scheduling rail relays, Transportation Research Part A: General, 24, 39–52.

Artificial Intelligence in Maintenance

229

Mechefske, C.K. (1998), Objective machinery fault diagnosis using fuzzy logic, Mechanical Systems and signal Processing, 12, 855–862. Microsoft ENCARTA College Dictionary (2001), StMartin’s Press, N.Y. Miller, D., Mellichamp, J.M. and Wang, J. (1990), An image enhanced knowledge based expert system for maintenance trouble shooting, Computers in Industry, 15, 187–202. Milne, R., Nicole, C. and Trave-Massuyes, L. (2001) TIGER with model based diagnosis: initial deployment, Knowledge-based Systems, 14, 213–222. Morcous, G. and Lounis, Z. (2005), Maintenance optimisation of infrastructure networks using genetic algorithms, Automation in Construction, 14, 129–142. Nam, D.S., Jeong, C.W., Choe, Y.J. and Yoon, E.S. (1996), Operation-aided system for fault diagnosis of continuous and semi-continuous processes, Computers& Chemical Engineering, 20, 793–803. Oke, S.A. and Charles-Owaba, O.E. (2006), Application of fuzzy logic control model to Gantt charting preventive maintenance scheduling, International Journal of Quality & Reliability Management, 23, 441–459. Omerdic, E. and Roberts, G. (2004), thruster fault diagnosis and accommodation for openframe underwater vehicles, Control Engineering Practice, 12, 1575–1598. Patel, S.A., Kamrani, A.K. and Orady, E. (1995), A knowledge-based system for fault diagnosis and maintenance of advanced automated systems, Computers & Industrial Engineering, 29, 147–151. Percy, D.F., Kobbacy, K.A.H. and Ascher, H.E. (1998), Using proportional intensities models to schedule preventive maintenance intervals, IMA Journal of Mathematics Applied in Business& industry, 9, 289–302. Rao, M., Yang, H. and Yang, H. (1998), Integrated distributed intelligent system architechture for incidents monitoring and diagnosis, Computers in Industry, 37, 143–151. Ruiz, D., Canton, J., Nougues, J.M., Espuna, A. and Puigjaner, L. (2001), On-line fault diagnosis system support for reactive scheduling in multipurpose batch chemical plants, Computers & Chemical Engineering, 25, 829–837. Ruiz, R., Garcia-Diaz, C. and Maroto, C. (2006), Considering scheduling and preventive maintenance in the flowshop sequencing problem, Computers & Operations Rresearch,34, 3314–3330. Saranga, H. (2004) Opportunistic maintenance using genetic algorithms, Journal of Quality in Maintenance Engineering, 10, 66–74. Scenna, N.J. (2000) Some aspects of fault diagnosis in batch processes, Reliability Engineering & System Safety, 70, 95–110. Sergaki, A. and Kalaitzakis, K. (2002), Reliability Engineering& System Safety, 77, 19–30. Sharma, R., Singh, K., Singhal, D. and Ghosh, R. (2004), Neural network applications for detecting process faults in packed towers. Chemical Engineering and Processing 43, 841–847. Shayler, P.J., Goodman, M. and Ma, T. (2000), The exploitation of neural networks in automative engine management systems, Engineering Applications of Artificial Intelligence, 13, 147–157. Shyur, H.J., Luxhoj, J.T. and Williams, T.P. (1996), Using neural networks to predict component inspection requirements for aging aircraft. Computers & Industrial Engineering 30, 257–267. Simani, S. and Fantuzzi, C. (2000), Fault diagnosis in power plant using neural networks, Information Sciences, 127, 125–136. Sinha, S.K. and Fieguth, P.W. (2006) Neuro-fuzzy network for the classification of buried pipe defects, Automation in Construction, 15, 73–83. Skarlatos, D., Karakasis, K. and Trochidis, A. (2004), Railway wheel fault diagnosis using a fuzzy-logic method, Applied Acoustics, 65, 951–966.

230

K. Kobaccy

Sortrakul, N., Nachtmann, H.L. and Cassady, C.R. (2005), Genetic algorithms for integrated preventive maintenance planning and production schedulling for a single machine, Computers in Industry,56, 161–168. Spoerre, J.K. (1997), Application of the cascade correlation algorithm (CCA) to bearing fault classification problems. Computers in Industry 32, 295–304. Sprague, R.H. and Watson, H.J. (1986) Decision support systems – putting theory into practice, Prentice Hall, Englewood Cliffs, New Jersey. Srinivasan, D., Liew, A.C., Chen, J.S.P. and Chang, C.S. (1993) Intelligent maintenance scheduling of distributed system components with operating constraints, Electric Power Systems Research, 26, 203–209. Sudiaros, A. and Labib, A.W. (2002) A fuzzy logic approach to an integrated maintenance/ production scheduling algorithm, International Journal of Production Research, 40, 3121–3138. Tan, J.S. and Kramer, M.A. (1997), A general framework for preventive maintenance optimization in chemical process operations. Computers & Chemical Engineering 21, 1451–1469. Tang, B-S., Jeong, S.K., Oh, Y-M. and Tan, A.C.C. (2004), Case-based reasoning system with Petri nets for induction motor fault diagnosis, Expert Systems with Applications, 27, 301–311. Tarifa, E.E., Humana, D., Franco, S., Martinez, S.l. Nunez, A.F. and Scenna, N.J. (2003) Fault diagnosis for MSF using neural networks, Desalination, 152, 215–222. Tsai, Y-T., Wang, K-S. and Teng, H-Y. (2001), Optimizing preventive maintenance for mechanical components using genetic algorithms. Reliability Engineering & System Safety 74, 89–97. Varde, P.V., Sankar, S. and Verma, A.K. (1998), An operator support system for research reactor operations and fault diagnosis through a connectionist framework and PSA based knowledge based system, Reliability Engineering and System safety, 60, 53–69. Varma, A. and Roddy, N. (1999), ICARUS: design and deployment of a case-based reasoning system for locomotive diagnostics, Engineering Applications of Artificial Intelligence 12, 681–690. Villanueva, H. and Lamba, H. (1997). Operator guidance system for industrial plant supervision, Expert systems withy Applications, 12, 441–454. Wen, F. and Chang, C.S. (1998), A new approach to fault diagnosis in electrical distribution networks using a genetic algorithm. Artificial Intelligence in Engineering 12, 69–80. Wu, H., Liu, Y., Ding, Y. and Qiu, Y. (2004), Fault diagnosis expert system for modern commercial aircraft, Aircraft Engineering and Aerospace Technology, 76, 398–403 Xia, Q. and Rao, M. (1999), Dynamic case-based reasoning for process operation support systems. Engineering Applications of Artificial Intelligence 12, 343–361. Yang, B-S. and Kim, K.J. (2006) Applications of Dempster-Shafer theory in fault diagnisis of induction motors, Mechanical systems and Signal Processing, 20, 403–420. Yang, B-S., Han, T. and Kim, Y-S (2004), Integration of ART-Kohonen neural network and case-based reasoning for intelligent fault diagnosis, Expert Systems with Applications, 26, 387–395. Yang, B-S., Lim, D-S. and Tan, A.C.C. (2005), VIBEX : an expert system for vibtation fault diagnosis of rotating machinery using decision tree and decision table, Expert Systems with Applications, 28, 735–742. Yangping, Z., Bingquan, Z. and DongXin, W. (2000), Application of genetic algorithms to fault diagnosis in nuclear power plants. Reliability Engineering & System Safety, 67, 153–160. Yu, R., Iung, B. and Panetto, H. (2003), A Multi-Agents based E-maintenance system with case-based reasoning decision support, Engineering Applications of Artificial Intelligence, 16, 321–333.

Artificial Intelligence in Maintenance

231

Zhang, H.Y., Chan, C.W., Cheung, K.C. and Ye, Y.J. (2001) Fuzzy artmap neural network and its application to fault diagnosis of navigation systems, Automatica, 37, 1065–1070. Zhao, Z. and Chen, C. (2001), concrete bridge deterioration diagnosis using fuzzy inference system, Advances in Engineering Software, 32, 317–325.

Part D

Problem Specific Models

10 Maintenance of Repairable Systems Bo Henry Lindqvist

10.1 Introduction A commonly used definition of a repairable system (Ascher and Feingold 1984) states that this is a system which, after failing to perform one or more of its functions satisfactorily, can be restored to fully satisfactory performance by any method other than replacement of the entire system. In order to cover more realistic applications, and to cover much recent literature on the subject, we need to extend this definition to include the possibility of additional maintenance actions which aim at servicing the system for better performance. This is referred to as preventive maintenance (PM), where one may further distinguish between condition based PM and planned PM. The former type of maintenance is due when the system exhibits inferior performance while the latter is performed at predetermined points in time. Traditionally, the literature on repairable systems is concerned with modeling of the failure times only, using point process theory. A classical reference here is Ascher and Feingold (1984). The most commonly used models for the failure process of a repairable system are renewal processes (RP), including the homogeneous Poisson processes (HPP), and nonhomogeneous Poisson processes (NHPP). While such models are often sufficient for simple reliability studies, the need for more complex models is clear. In this chapter we consider some generalizations and extensions of the basic models, with the aim to arrive at more realistic models which give better fit to data. First we consider the trend renewal process (TRP) introduced and studied in Lindqvist et al. (2003). The TRP includes NHPP and RP as special cases, and the main new feature is to allow a trend in processes of non-Poisson (renewal) type. As exemplified by some real data, in the case where several systems of the same kind are considered, there may be unobserved heterogeneity between the systems which, if overlooked, may lead to non-optimal or possibly completely wrong decisions. We will consider this in the framework of the TRP process, which in Lindqvist et al. (2003) is extended to the so-called HTRP model which

236

B. Lindqvist

includes the possibility of heterogeneity. Heterogeneity can be thought of as an effect of an unobserved covariate. Another extension of the basic models is to allow the systems to be preventively maintained. We review some recent research in this direction, where this situation is modeled as a competing risks problem between failure and PM. This leads to a need for combining the theory of competing risks with repair models and point process theory. Relevant statistical data for such analyses are found in most modern reliability databases. The book by Bedford and Cooke (2001) contains a chapter related to this. A general reference to competing risks is the book by Crowder (2001).

Figure 10.1. Event times ( Ti ) and sojourn times ( X i ) of a repairable system

The last extension of the basic models to be considered in the present chapter consists of using Markov models to model the behavior of periodically inspected systems in between inspections, with the use of separate Markov models for the maintenance tasks at inspections. Recent review articles concerning repairable systems and maintenance include Peña (2006) and Lindqvist (2006). A review of methods for analysis of recurrent events with a medical bias is given by Cook and Lawless (2002). General books on statistical models and methods in reliability, covering much of the topics considered here, are Meeker and Escobar (1998) and Rausand and Høyland (2004).

10.2 Point Process Approach 10.2.1 Notation and Basic Definitions Consider a repairable system where time usually runs from t = 0 and events occur at ordered times T1 , T2 ,…. Here time is not necessarily calendar time, but can be for example operation time, number of cycles, number of kilometers run, length of a crack, etc. In the present treatment we shall disregard time durations of repair and maintenance, and assume that the system is always restarted immediately after failure or maintenance action. The inter-event, or inter-failure, times will be denoted X 1 , X 2 ,…. Here X i = Ti − Ti −1 , i = 1, 2,… , where for convenience we define T0 ≡ 0 . Figure 10.1 illustrates the notation. We also make use of the counting process representation N (t ) = number of events in (0, t ] . In order to describe probability models for repairable systems we use some notation from the theory of point processes. A key reference is Andersen et al. (1993). H t denotes the history of the failure process up to, but not including, time t .

Maintenance of Repairable Systems

237

The conditional intensity of the process at time t is defined as γ (t ) = lim ∆t ↓ 0

Pr (event of type j in [t , t + ∆t ) | H t ) . ∆t

(10.1)

From this we obtain an expression for the likelihood function, which is needed for statistical inference. Suppose that a single system as described above is observed from time 0 to time τ , resulting in observations T1 , T2 ,…, TN (τ ) . The likelihood function is then given by (Andersen et al. 1993, Section II.7)

{

}

τ ⎧⎪ N (τ ) ⎫⎪ L = ⎨∏ γ (Ti ) ⎬ exp − ∫ γ (u ) du . 0 ⎩⎪ i =1 ⎭⎪

(10.2)

10.2.2 Perfect and Minimal Repair Models Consider a system with failure rate z (t ) . Suppose first that after each failure, the system is repaired to a condition as good as new, called a perfect repair. In this case the failure process can be modeled by a renewal process (RP) with inter-event time distribution F , denoted RP( F ) . Clearly, the conditional intensity defined in Equation 10.1 is given by γ (t ) = z (t − TN (t − ) ),

where t − TN (t − ) is the time since the last failure strictly before time t . Suppose instead that after a failure, the system is repaired only to the state it had immediately before the failure, called a minimal repair. This means that the conditional intensity of the failure process immediately after the failure is the same as it was immediately before the failure, and hence is exactly as it would be if no failure had ever occurred. Thus we must have γ (t ) = z (t ),

and the process is a nonhomogeneous Poisson process (NHPP) with intensity z (t ) , denoted NHPP( z (⋅) ). In practice a minimal repair usually corresponds to repairing or replacing only a minor part of the system. If z (t ) = λ does not depend on t , then NHPP ( z (⋅)) is a homogeneous Poisson process which we denote by HPP (λ ) . Note that an HPP is at the same time an RP with exponential inter-failure times. 10.2.3 The Trend-Renewal Process The idea behind the trend-renewal process is to generalize the following well known property of the NHPP. First let the cumulative intensity function corresponding to

238

B. Lindqvist t

an intensity function λ (⋅) be defined by Λ (t ) = ∫ λ (u ) du . Then if T1 , T2 ,… is an 0 NHPP(λ (⋅)) , the time-transformed stochastic process Λ (T1 ), Λ(T2 ),… is HPP(1). The trend-renewal process (TRP) is defined simply by allowing the above HPP(1) to be any renewal process RP ( F ) . Thus, in addition to the intensity function λ (t ) , for a TRP we need to specify a distribution function F of the inter-arrival times of this renewal process. Formally we can define the process TRP( F , λ (⋅) ) as follows. t Let λ (t ) be a nonnegative function defined for t ≥ 0 , and let Λ (t ) = ∫ λ (u ) du . 0 The process T1 , T2 ,… is called TRP( F , λ (⋅) ) if the process Λ (T1 ), Λ(T2 ),… is RP( F ∵ ), that is if the Λ (Ti ) − Λ (Ti −1 ); i = 1, 2,… are i.i.d. with distribution function F . The function λ (⋅) is called the trend function, while F is called the renewal distribution. In order to have uniqueness of the model it is usually assumed that F has expected value 1. Figure 10.2 illustrates the definition. For the cited property of the NHPP, the lower axis would be an HPP with unit intensity, HPP(1). For the TRP, this process is instead taken to be any renewal process, RP(F), where F has expectation 1. This shows that the TRP includes the NHPP as a special case. Further, if λ (t ) ≡ 1 , then Λ (Ti ) = Ti , and so T1 , T2 ,… is RP(F). For an NHPP(λ (⋅)) , the RP( F ) would be HPP(1) . Thus TRP (1 − e − x , λ (⋅)) = NHPP(λ (⋅)). Also, TRP ( F ,1) = RP( F ) , which shows that the TRP class includes both the RP and NHPP classes.

Figure 10.2. The defining property of the trend-renewal process

It can be shown (Lindqvist et al. 2003) that the conditional intensity function, given the history H t , for the TRP( F , λ (⋅)) is γ (t ) = z (Λ (t ) − Λ (TN (t − ) ))λ (t )

(10.3)

where z (⋅) is the hazard rate corresponding to F . This is a product of one factor, λ (t ) , which depends on the age t of the system and one factor which depends on a transformed time from the last previous failure. Suppose now that a single system has been observed in [0,τ ] , with failures at T1 , T2 ,…, TN (τ ) . If a TRP( F , λ (⋅) ) is used as a model, then substitution of Equation 10.3 into Equation 10.2 gives the likelihood

Maintenance of Repairable Systems N (τ )

239

τ

L = {∏ z[Λ(Ti ) − Λ(Ti −1 )]λ (Ti )}exp{−∫ z[Λ(u ) − Λ(TN (u − ) )]λ (u)du}. (10.4) 0

i =1

For the NHPP (λ (⋅)) we have z (t ) ≡ 1 , so the likelihood simplifies to the well known expression (Crowder et al. 1991, p 166) N (τ )

∏ λ (T )}exp{−∫

L ={

i =1

i

τ 0

λ (u ) du}.

Returning to the general case, if f is the density function corresponding to F , the we can write the likelihood at Equation 10.4 as N (τ )

∏ f [Λ(T ) − Λ(T

L ={

i =1

i

i −1

)]λ (Ti )}{1 − F [ Λ(τ ) − Λ(TN (τ ) )]}.

(10.5)

This latter form of the likelihood of the TRP follows directly from the definition, since the conditional density of Ti given T1 = t1 ,…, Ti −1 = ti −1 is f [Λ (ti ) − Λ (ti −1 )]λ (ti ) , and the probability of no failures in the time interval (TN (τ ) ,τ ] , given T1 ,…, TN (τ ) , is 1 − F [Λ (τ ) − Λ (TN (τ ) )] . This again simplifies if λ (t ) ≡ 1 in which case it gives the likelihood of an RP(F) observed on [0,τ ] . 10.2.4 Observations from Several Similar Systems Suppose that m systems of the same kind are observed, where the j-th system ( j = 1, 2,…, m ) is observed in the time interval [0,τ j ] . For the j-th system, let N j denote the number of failures that occur during the observation period, and let the specific failure times be denoted T1 j < T2 j < < TN j . Figure 10.3 illustrates the notation and explains the information given in a so-called event plot which is provided by computer packages for analysis of this kind of data (see examples below). j

Example 1 Nelson (1995) presented data for times of valve-seat replacements in a fleet of m = 41 diesel engines. Figure 10.4 shows an event plot for the complete dataset. Example 2 Bhattacharjee et al. (2003) presented failure data for motor operated closing valves in safety systems at two boiling water reactor plants in Finland. Failures of the type “external leakage” were considered for 104 valves with a follow-up time of nine years. An event plot for the 16 valves which experienced at least on failure, is given in Figure 10.5. The remaining 88 valves had no failures.

240

B. Lindqvist

Figure 10.3. Observation of failure times of m systems. The j-th system is observed over the time interval [0,τ j ] , with N j ≥ 0 observed failures

Figure 10.4. Event plot for times of valve seat replacements for 41 diesel engines, taken from Nelson (1995)

When data are available for m systems as described above, one will typically assume that the systems behave independently but with the same probability laws (“i.i.d. rules”). The total likelihood for the data will then be the product of the likelihoods at Equations 10.4 or 10.5, one factor for each of the m systems.

Maintenance of Repairable Systems

241

Figure 10.5. Event plot for times of external leakage from nuclear plant valves, taken from Bhattacharjee et al. (2003). In addition, 88 valves had no failures in 3286 days (9 years)

However, even if the m systems are considered to be of the same type, they may well exhibit different probability failure mechanisms. For example, systems may be used under varying environmental or operational conditions. To cover such cases we shall assume that failures of the j-th system follow the process TRP ( F , λ j (⋅)) , j = 1,…, m , where the renewal function F is fixed and differences between systems are modeled by letting the trend functions λ j (t ) vary from system to system. The assumption of a fixed F parallels the NHPP case, where F is the unit exponential distribution. Assuming that systems work independently of each other, we obtain from m

Equation 10.5 the full likelihood L ≡ ∏ j =1 L j where Nj

L j = {∏ f [Λ j (Tij ) − Λ j (Ti −1, j )]λ j (Tij )}{1 − F [ Λ j (τ j ) − Λ j (TN j )]}. (10.6) i =1

j

As an example of the use of Equation 10.6, assume that differences between system performances can be attributed to an observable covariate vector x , and that the trend λ j (t ) for system j is represented by a proportional trend model with λ j (t ) = g (x j )λ (t ), j = 1,…, m

(10.7)

Here λ (⋅) is a basic trend function common to all systems, while g is a function of the covariate vector x j of system j . The special cases of this model

242

B. Lindqvist

corresponding to NHPP and RP are studied, respectively, by Lawless (1987) and Follmann and Goldberg (1988). 10.2.5 The Heterogeneous Trend Renewal Process As noted in the introduction, in addition to observable differences there may be an unobserved heterogeneity between systems. A common way of incorporating such heterogeneity is to modify Equation 10.7 to λ j (t ) = a j g (x j )λ (t ) where the a j are unobservable (positive) random variables taking values independently across systems (Andersen et al. 1993, Chapter IX). For simplicity we shall in this chapter restrict attention to the case with no observed covariates, and instead concentrate on unobserved heterogeneity. In the following we thus assume the model λ j (t ) = a j λ (t )

(10.8)

where the a j are independently distributed according to a common probability distribution H , say, and where for convenience we assume that the expected value of a j equals 1 . Thus in Equation 10.8, λ (⋅) is regarded as a basic trend function, while the a j represent a possibly different failure intensity “level” for each system, averaging to 1 . The special case when a j = 1 with probability 1 will be referred to as the “no heterogeneity” case. For given values of the a j the likelihood for the j-th system is, by Equation 10.6, Nj

L j (a j ) = {∏ f [a j (Λ(Tij ) − Λ(Ti −1, j ))]a j λ (Tij )}{1 − F[ a j ( Λ(τ j ) − Λ(TN j ))]}. j

i =1

However, since the a j are unobservable, we need to take the expectation with respect to the a j , giving

∫

L j = E[ L j (a j )] = L j ( a j ) dH ( a j )

as the contribution to the likelihood from the j-th system. The total likelihood is then the product m

L = ∏ Lj .

(10.9)

j =1

We shall use the notation HTRP ( F , λ (⋅), H ) for the model with the likelihood at Equation 10.9,). Here the renewal distribution F and the heterogeneity distribution H are distributions corresponding to positive random variables with expected value 1, while the basic trend function λ (t ) is a positive function defined for t ≥ 0 .

Maintenance of Repairable Systems

243

A useful feature of the HTRP model is that several important models for repairable systems are easily represented as submodels. With the notation HPP, NHPP, RP and TRP used as before, we define corresponding models with heterogeneity as at Equation 10.8 by putting an H in front of the abbreviations. Specifically, from a full model, HTRP ( F , λ (⋅), H ) , we can identify the seven submodels described in Table 10.1. Table 10.1. The seven submodels of HTRP ( F , λ (⋅), H ) . ’exp’ means the unit exponential distribution, ’1’ means the distribution degenerate at 1 . The third column contains references to work on the corresponding models or special cases of them. Submodel

HTRP-formulation

HPP (ν )

HTRP (exp,ν ,1)

RP ( F ,ν )

HTRP ( F ,ν ,1)

NHPP (λ (⋅))

HTRP (exp, λ (⋅),1)

TRP ( F , λ (⋅))

HTRP ( F , λ (⋅),1)

HHPP (ν , H )

HTRP (exp,ν , H )

HRP ( F ,ν , H )

HTRP ( F ,ν , H )

HNHPP (λ (⋅), H )

HTRP (exp, λ (⋅), H )

The HTRP and the seven submodels may also be represented in a cube, as illustrated in Figures 10.6 and 10.7. Each vertex of the cube represents a model, and the lines connecting them correspond to changing one of the three “coordinates” in the HTRP notation. Going to the right corresponds to introducing a time trend, going upwards corresponds to entering a non-Poisson case, and going backwards (inwards) corresponds to introducing heterogeneity. In analyzing data by parametric HTRP models we shall see below how we use the cube to facilitate the presentation of maximum log-likelihood values for the different models in a convenient, visual manner. The log-likelihood cube was introduced in Lindqvist et al. (2003). Example 1 (continued) Figure 10.6 shows the log-likelihood cube of the valve-seat data. It should be noted that each arrow points in a direction where exactly one parameter is added (see text of Figure 10.6 for definitions of parameters). Using standard asymptotic likelihood theory we know that if this parameter has no influence in the model, then twice the difference in log likelihood is approximately chi-square distributed with one degree of freedom. For example, if twice the difference is larger than 3.84, then the p-value of no significant difference is less than 5% and we have an indication that the extra parameter in fact has some relevance. Note that adding an extra parameter will always lead to a larger value of the maximum log likelihood, but from what we just argued, the difference needs to be more than, say, 3.84 / 2 = 1.92 to be of real interest.

244

B. Lindqvist

Figure 10.6. The log-likelihood cube for the Nelson valve seat data of Nelson (1995), fitted with a parametric HTRP( F , λ (⋅), H ) model and its sub-models. Here F is a Weibulldistribution with expected value 1 and shape parameter s , λ (t ) = cbt b −1 is a power function of t , and H is a gamma-distribution with expected value 1 and variance v . The maximum value of the log likelihood is denoted l

Looking at the valve-seat data cube in Figure 10.6 we note first that going from a vertex of the front face to the corresponding vertex of the back face (adding “H” in front of the model acronym) there is never much to gain (1.17 at most from HPP to HHPP). This indicates no apparent heterogeneity between the various engines. By comparing the left and right faces we conclude, however, that there seems to be a gain in including a time trend. Having already excluded heterogeneity we are thus faced with the possibilities of either NHPP or TRP. Here the latter model “wins”, since the difference in log-likelihood is as large as (−343.66) − (−346.49) = 2.83 and twice the difference equal to 5.66 corresponding to an approximate p-value of 0.017. The resulting estimated TRP is seen to have a renewal distribution which is Weibull with shape parameter 0.6806 which implies a decreasing failure rate. This means that the conditional intensity function will jump upward at each failure, which may be explained by burn-in problems at each valve-seat replacement. Further, there will be an estimated time trend of the form λˆ (t ) = 3.26 × 10−6 × 1.929 × t 0.929 = 6.29 × 10 −6 × t 0.929 which increases with t so that replacements are becoming more and more frequent. Example 2 (continued) For the closing valve failures considered by Bhattacharjee et al. (2003), previous studies had shown significant variations in the number of

Maintenance of Repairable Systems

245

failures of each valve, suggesting a heterogeneity between valves. Bhattacharjee et al. (2003) thus stressed the importance of taking heterogeneity into consideration and concluded that even very simple models may describe the heterogeneous behavior successfully. In particular they considered a model where heterogeneity is represented by assuming that each valve is either “good” or “bad”. While Bhattacharjee et al. (2003) used hierarchical Bayes-models, we fitted an HTRP model and its sub-models, with a trend function of power law type as for the valve-seat data, λ (t ) = cbt b −1 , but now with a heterogeneity distribution H being a two-point distribution with values a1 = “good”, a2 = “bad” (so a1 ≤ a2 by assumption) and P("good") = p . In order to have uniqueness of parameters we imposed the restriction of expected value 1 for the distribution H , leading to pa1 + (1 − p)a2 = 1 . The results are given in the log-likelihood cube of Figure 10.7. By comparing the front and back faces of Figure 10.7 it is clear that there is a considerable heterogeneity present, leaving us with the back face. Thus we continue by investigating whether we have Poisson-behavior or renewal-behavior at failures. This is done by comparing the bottom and top faces, in other words (HHPP, HNHPP) vs. (HRP, HTRP). The difference from HHPP to HRP happens to be 1.92 so the p-value is 5%. Thus we might prefer the HRP model. However, in order to obtain a simple model with a simple interpretation we might go for the HHPP which gives that the closing valve is a “good” one with probability 0.9524, with failures following an HPP with rate 1.083 ×10−4 × 0.35 = 3.79 ×10 −5 (per day),

or a “bad” one with probability 0.0476 and rate 1.083 ×10−4 ×14.0 = 1.52 ×10 −3 (per day).

The expected number of failures in 3286 days are hence 0.125 and 4.99 , respectively, for the “good” and “bad” valves.

10.3 A Competing Risks Model for Failure vs. Preventive Maintenance 10.3.1 A General Setup Consider again the situation illustrated in Figure 10.1, where the sojourns X 1 , X 2 ,… are times to failure of a system which is repaired immediately before the start of the sojourn. In the present section we consider the case when the failure which we expect at the end of the sojourn X i , may be avoided by a preventive maintenance (PM) after a time Z i in the sojourn. The experienced sojourn time will in this case be Yi = min( X i , Z i ) , and it will result in either a failure or a PM according to whether Yi = X i or Yi = Z i . We thus have a competing risks situation with two risks, corresponding to failure and PM.

246

B. Lindqvist

Figure 10.7. The log-likelihood cube for the data of Bhattacharjee et al. (2003) concerning failures of motor operated closing valves in nuclear reactor plants in Finland, fitted with a parametric HTRP( F , λ (⋅), H ) model and its sub-models. Here F is a Weibull-distribution b −1 is a power function of t , and with expected value 1 and shape parameter s , λ (t ) = cbt H is a two-point distribution with unit expectation, giving probability p for the value “low” and 1 − p for the value “high”. The maximum value of the log likelihood is denoted by l

Doyen and Gaudoin (2006) recently presented a point process approach for modeling of such competing risks situations between failure and PM. A general setup for this kind of processes is furthermore suggested in the review paper Lindqvist (2006). For simplicity we shall in this chapter consider only the case where the component or system is perfectly repaired or maintained at the end of each sojourn. This will lead to the observation of independent copies of the competing risks situation in the same way as for a renewal process. We will therefore in the following consider only a single sojourn and hence suppress the subscripts of the observed times. Thus we let X and Z be, respectively, the potential times to failure and time to PM of a single sojourn. Then Y = min( X , Z ) is the observed sojourn, and in addition we observe the indicator variable δ which we define to be 1 if there is a PM ( Y = Z ) and 0 if there is a failure ( Y = X ). This situation has been extensively studied by Cooke (1993, 1996), Bedford and Cooke (2001), Langseth and Lindqvist (2003, 2006), Lindqvist et al. (2006) and Lindqvist and Langseth (2005). Thus note that the observable result is the pair (Y , δ ) , rather than the underlying times X and Z , which may often be the times of interest. For example, knowing

Maintenance of Repairable Systems

247

the distribution of X would be important as a basis for maintenance optimization. It is well known (see Crowder 2001, Chapter 7), however, that in a competing risks case as described here, the marginal distributions of X and Z are not identifiable from observation of (Y , δ ) alone unless specific assumptions are made on the dependence between X and Z . The most frequently used assumption of this kind is to let X and Z be independent, in which case identifiability follows. This assumption is not reasonable in our application, however, since the maintenance crew is likely to have some information regarding the system’s state during operation. This insight is used to perform maintenance in order to avoid failures. We are thus in practice usually faced with a situation of dependent competing risks between X and Z . 10.3.2 Random Signs Censoring Cooke (1993, 1996) suggested that the competing risks situation between failure and PM will often satisfy what he called the random signs censoring property. The important features of random signs censoring are that the marginal distribution of X is always identifiable, and that an indication of the validity of this type of censoring could be found from data plotting. A lifetime Z is said to be a random signs censoring of X if the event {Z < X } is stochastically independent of X , i.e. if the event of having a PM before failure is not influenced by the time X at which the system fails or would have failed without PM. The idea is that the system emits some kind of signal before failure, and that this signal is discovered with a probability which does not depend on the age of the system. We now introduce some notation. Below we assume without further mention that X , Z are positive and continuous random variables, with P( X = Z ) = 0 . We let FX (t ) = P( X ≤ t ) and FZ (t ) = P( Z ≤ t ) be the cumulative distribution functions of X and Z , respectively. The subdistribution functions of X and Z are defined as, respectively, FX∗ (t ) = P( X ≤ t , X < Z ) and FZ∗ (t ) = P( Z ≤ t , Z < X ) . Note that the functions FX∗ and FZ∗ are nondecreasing with FX∗ (0) = 0 and ∗ FZ (0) = 0 . Moreover, we have FX∗ (∞) + FZ∗ (∞) = 1 . We will also use the notion of conditional distribution functions, defined by F X (t ) = P( X ≤ t | X < Z ) and F Z (t ) = P( Z ≤ t | Z < X ) . Note then that ∗ ∗ ∗ ∗ F X (t ) = FX (t ) /FX (∞) , F Z (t ) = FZ (t ) /FZ (∞) . It is important to note that the functions FX∗ , FZ∗ , F X , F Z are identifiable from data of the form (Y , δ ) , since they are given in terms of probabilities of events that can be expressed by (Y , δ ) . For example, FX∗ (t ) = P(Y ≤ t , δ = 0) and can hence be estimated consistently from a sample of values of (Y , δ ) . On the other hand, as already mentioned, the marginal distribution functions FX , FZ are not identifiable in general since they are not probabilities of events that can be expressed by (Y , δ ) . We now show that the marginal distribution of X is identifiable under random signs censoring. In fact this follows directly from the definition, since we must have

248

B. Lindqvist

F X (t ) = P( X ≤ t | X < Z ) = P( X ≤ t ) = FX (t )

(10.10)

by independence of X and the event X < Z . As verified above, F X (t ) can always be estimated consistently from data, and thus this holds for FX (t ) as well by Equation 10.10. Hence we have the somewhat surprising result under random signs censoring that the marginal distribution of X is the same as the distribution of the observed occurrences of X . Cooke (1993) showed that under random signs censoring we have F X (t ) < F Z (t ) for all t > 0.

(10.11)

Moreover, he showed the kind of inverse statement that whenever Equation 10.11 holds, there exists a joint distribution of ( X , Z ) satisfying the requirements of random signs censoring and giving the same sub-distribution functions. On the other hand, if F X (t ) ≥ F Z (t ) for some t , then there is no joint distribution of ( X , Z ) for which the random signs requirement holds. For more discussion on random signs censoring and its applications we refer to Cooke (1993, 1996) and Bedford and Cooke (2001, Chapter 9). One idea is to estimate the functions F X (t ) and F Z (t ) from data to check whether Equation 10.11 may possibly hold and when this is the case to suggest a model that satisfies the random signs property. 10.3.3 The Repair Alert Model Lindqvist et al. (2006) introduced the so-called repair alert model which extends the idea of random signs censoring by defining an additional repair alert function which describes the “alertness” of the maintenance crew as a function of time. The definition can be given as follows: The pair ( X , Z ) of life variables satisfies the requirements of the repair alert model provided the following two conditions both hold: (i) Z is a random signs censoring of X b (ii) There exists an increasing function G defined on [0, ∞) with G (0) = 0 , such that for all x > 0 , P( Z ≤ z | Z < X , X = x) =

G( z) , 0 < z ≤ x. G ( x)

The function G is called the cumulative repair alert function. Its derivative g (when it exists) is called the repair alert function. The repair alert model is hence a specialization of random signs censoring, obtained by introducing the repair alert function G . Part (ii) of the above definition means that, given that there would be a failure at time X = x , and given that the maintenance crew will perform a PM before that

Maintenance of Repairable Systems

249

time (i.e. given that Z < X ), the conditional density of the time Z of this PM is proportional to the repair alert function g . Lindqvist et al. (2006) showed that whenever Equation 10.11 holds there is a unique repair alert model giving the same sub-distribution functions. Thus, restricting to repair alert models we are able to strengthen the corresponding result for random signs censoring which does not guarantee uniqueness. The repair alert function is meant to reflect the reaction of the maintenance crew. More precisely, g (t ) ought to be high at times t for which failures are expected and the alert therefore should be high. Langseth and Lindqvist (2003) simply put g (t ) = λ (t ) where λ (t ) is the failure rate of the marginal distribution of X . This property of g (t ) of course simplifies analyses since it reduces the number of parameters, but at the same time it seems fairly reasonable given a competent maintenance crew. In a subsequent paper, Langseth and Lindqvist (2006) present ways to test whether g (t ) can be assumed equal to the hazard function λ (t ) . It follows from the construction in Lindqvist et al. (2006) that the repair alert model is completely determined by the marginal distribution function FX of X , the cumulative repair alert function G , the probability q ≡ P( Z < X ) , and the assumption that X is independent of the event {Z < X } (i.e. random signs censoring). Thus, given statistical data, the inference problem consists of estimating FX (t ) (possibly on parametric form), the repair alert function g (or G ), and the probability q of PM. We refer to Lindqvist et al. (2006) and Lindqvist and Langseth (2005) for details on such statistical inferences. The following is a simple example of a repair alert model. Example 3 Let ( X , Z ) be a pair of life variables with joint density parameterized by λ > 0 and 0 < q < 1 , f XZ ( x, z; λ , q ) = (q /x)λ e− λ x for x > 0, 0 < z < x /q.

The marginal distribution of X is the exponential distribution with density f X ( x) = λ e − λ x , while the conditional distribution of Z given X = x is the uniform distribution on (0, x /q) . From this we obtain P( Z < X | X = x) = q for all x > 0 . Thus the event Z < X is independent of X and condition (i) of the definition is satisfied. The following computation shows that condition (ii) holds as well. Let 0 < z < x . Then P ( Z ≤ z, Z < X | X = x) P( Z < X | X = x) P( Z ≤ z | X = x) = q z ( q /x ) z = = , q x

P( Z ≤ z | Z < X , X = x) =

which implies condition (ii) of Definition 2 with G (t ) = t .

250

B. Lindqvist

The practical interpretation of this example is as follows. We consider a component or system with lifetime X which is exponentially distributed with failure rate λ . With probability q a PM is performed before X , at a time which for given X = x is uniformly distributed on the interval from 0 to x . 10.3.4 Further Properties of The Repair Alert Model The following formula (taken from Lindqvist et al. 2006) shows in particular why Equation 10.11 holds under the repair alert model: F Z (t ) = FX (t ) + G (t )

∫

∞ t

f X ( y) dy. G( y)

(10.12)

Note that for random signs and hence for the repair alert model we have F X (t ) = FX (t ) . We next discuss some implications of the repair alert model, in particular how the parameters q and G influence the observed performance of PM and failures. In order to help intuition, we sometimes consider the power version G (t ) = t β where β > 0 is a parameter. Then g (t ) = β t β −1 so β = 1 means a constant repair alert function, while β < 1 and β > 1 correspond to, respectively, a decreasing and increasing repair alert function. Under the random signs assumption, the parameter q = P( Z < X ) is connected to the ability to discover “signals” regarding a possibly approaching failure. More precisely, q is understood as the probability that a failure is avoided by a preceding PM. Given that there will be a PM, one should ideally have the time of PM immediately before the failure. It is seen that this issue is connected to the function G . For example, large values of β will correspond to distributions with most of its mass near x . Moreover, it follows from Equation 10.12 that E (Z | Z < X ) =

∫

∞ 0

⎡ M (X )⎤ (1 − F Z ( z ))dz = E ( X ) − E ⎢ ⎥ ⎣ G( X ) ⎦

x

where M ( x) = ∫ G (t )dt . For the special case when G (t ) = t β , we obtain the 0 simple result E (Z | Z < X ) =

β E( X ) β +1

(10.13)

which clearly indicates that good PM performance corresponds to large values of β . An interesting observation is, furthermore, that Equation 10.13 can be used to estimate β from a sample of (Y , δ ) . In fact, E ( Z | Z < X ) can be estimated simply by the average of the observed Z , and since E ( X ) = E ( X | X < Z ) for random

Maintenance of Repairable Systems

251

signs censoring, we can estimate E ( X ) similarly by the average of the observed X . An estimate of the quotient β/ ( β + 1) and hence of β follow. Instead of merely considering the conditional expectation E ( Z | Z < X ) one may more generally study the conditional distribution of Z given Z < X , or the conditional distribution of X − Z given Z = z, Z < X . A good PM performance would then mean that the former distribution is stochastically as large as possible, while the latter distribution should be small (stochastically). For precise results in this direction we refer to Lindqvist et al. (2006). Consider next Y = min( X , Z ) , which is the actual sojourn time. The following results are hence of practical interest, and may in addition shed light on the influence of the parameters of the repair alert model: P (Y ≤ t ) = FX (t ) + qG (t )

∫

∞ t

f X ( y) dy G( y)

⎡ M (X )⎤ E (Y ) = E ( X ) − qE ⎢ ⎥ , where M ( x) = ⎣ G( X ) ⎦

∫

x 0

G (t )dt.

Furthermore, if G (t ) = t β , then ⎛ q ⎞ E (Y ) = E ( X ) ⎜1 − ⎟. β +1⎠ ⎝

(10.14)

We finally give a simple illustration of how the parameters q and β (assuming G (t ) = t β for simplicity) influence the long run cost per time unit under the repair alert model. Let CPM , CF be costs of PM and failure, respectively, for a single sojourn. Assume now that following an event (PM or failure), the operation is restarted with a system assumed to be as good as new, and that this process continues. This leads to a sequence of observations of (Y , δ ) , which we shall assume are independent and identically distributed. The theory of renewal reward processes (e.g. Ross 1983, p 78) implies that the expected cost per unit time in the long run equals the expected cost per sojourn divided by the expected length of a sojourn, i.e. qCPM + (1 − q)CF

(

E ( X ) 1 − βq+1

)

,

where we used Equation 10.14. This is a decreasing function of β , which seems reasonable. On the other hand, it is a decreasing function of q provided β > CPM / (CF − CPM ) . This last inequality is likely to hold in many practical cases since the right hand side will usually be much less than 1, while β should for a competent maintenance crew be larger than 1. Thus a high value of q is usually preferable.

252

B. Lindqvist

10.4 Periodically Tested Systems Certain systems, for example alarm systems, are tested only at fixed times which are usually periodic. If the system is found in a failed state, then it is repaired or replaced. Thus repair is usually not done at the same time as the failure, and the situation is hence not covered by the methods considered earlier in this chapter. A simple model of this situation was suggested by Hokstad and Frøvig (1996) and further studied and extended by Lindqvist and Amundrustad (1998) which is the main source for the present section. The approach of Lindqvist and Amundrustad (1998) involves a continuous time Markov model for the system state when time runs between testing epochs, and in addition two discrete time Markov chains for the states of the system reported immediately before and after each test, respectively. As will be seen, the given framework also allows in an easy manner the potentially useful extension to modeling of incomplete repairs or maintenance actions. We consider a standby system observed from time 0 , with testing, repair and PM performed periodically at times

τ , 2τ , 3τ ,…, called PM epochs. Here τ > 0 is the length of what we shall call the PM interval. 10.4.1 The Markov Model Let X (t ) ∈ S denote the state of the system at time t , where the set S of possible states is finite. It is assumed that X (t ) behaves like a time homogeneous Markov chain as long as time runs inside PM intervals, i.e. inside time intervals nτ ≤ t < (n + 1)τ for n = 0,1,…. This Markov chain is governed by an infinitesimal intensity matrix A , where the entry a jk of A for j ≠ k is the transition intensitiy from state j to state k ; see for example Taylor and Karlin (1984, p 254). An example of an intensity matrix A is given by Equation 10.15, an illustration of which is provided by the state diagram in Figure 10.9. Let Pjk (t ) = P( X (t ) = k | X (0) = j ); j, k ∈ S , t > 0

denote transition probabilities for the Markov chain governed by A and let P(t ) = ( Pjk (t ); j, k ∈ S )

be the corresponding transition matrix. In order to specify the effect of maintenance and repair at PM epochs, we next introduce for n = 1, 2,…, Yn = X (nτ −) ≡ lim X (t ), t ↑ nτ

Maintenance of Repairable Systems

253

which is the state of the system immediately before the n -th PM epoch. The effect of PM at time nτ is to change the state of the system from Yn to Z n according to a transition matrix R = ( R jk ) , where P( Z n = k | Yn = j ) = R jk ; j, k ∈ S .

Moreover, given Yn it is assumed that Z n is independent of all transitions of the system state before time nτ . The definitions of the Yn and Z n are illustrated in Figure 10.8.

Figure 10.8. The definition of Yn and Z n

The model description is completed by defining the initial state of the Markov chain X (t ) running inside the PM interval [nτ , (n + 1)τ ) to be X (nτ ) ≡ Z n ( n = 0,1,…), where Z 0 is the initial state of the system, usually the perfect state in S . It is furthermore assumed that the Markov chain X (t ) on [nτ , (n + 1)τ ) , given its initial state Z n , is independent of all transitions occurring before time nτ . Let the distribution of Z 0 ≡ X (0) be denoted ρ = ( ρ j ; j ∈ S ) , where ρ j = P( Z 0 = j ) . Then for any k ∈ S , P(Y1 = k ) = P( X (τ −) = k )

= ∑ P( X (τ −) = k | X (0) = j ) P( X (0) = j ) j∈S

= ∑ ρ j Pjk (τ ) = [ ρ P(τ )]k . j∈S

Thus the distribution of Y1 is given by the vector-matrix product ρ P(τ ) . Further, for n ≥ 1 , P(Yn +1 = k | Yn = j ) = =

∑ P(Y

n +1

∈S

∑P ∈S

k

= k | Z n = , Yn = j ) P( Z n = | Yn = j )

(τ ) R j = [ RP(τ )] jk .

It follows that Y1 , Y2 ,… is a discrete time Markov chain on S with transition matrix Q = RP(τ ).

254

B. Lindqvist

On the other hand, P( Z n +1 = k | Z n = j ) =

∑ P( Z ∈S

n +1

= k | Yn +1 = , Z n = j )

×P(Yn +1 = | Z n = j ) =

∑P ∈S

j

(τ ) Rk = [ P(τ ) R] jk .

Thus, our assumptions imply that Z 0 , Z1 ,… is a discrete time Markov chain on S with transition matrix T = P(τ ) R.

10.4.2 Reliability Measures The approach may now be used to compute interesting reliability measures. 10.4.2.1 Average rate of Critical Failures Let π = (π j , j ∈ S ) be the stationary distribution of the Markov chain Y1 , Y2 ,…, i.e. π is the unique probability vector satisfying the equation π Q ≡ π RP(τ ) = π .

∑

π j . This is the expected relative number For any subset G ⊂ S , define π G = j∈G of PM epochs, in the long run, where the system is found to be in G . Moreover, 1/π G is the mean time, in the long run, between visits to G (measured with time unit τ ). These facts are well known from the theory of Markov chains (Taylor and Karlin 1984, Chapter 4). Let in the following G be the subset of S defining the critical failure states of the system. Then as in Hokstad and Frøvig (1996) we define the mean time between critical failures to be MTBFcrit = τ/π G

and the average rate of critical failures to be λcrit = 1/MTBFcrit = π G /τ .

10.4.2.2 Critical Safety Unavailability Consider a PM interval [nτ , (n + 1)τ ) . The expected relative amount of time in this interval that the system is in a critical state, i.e. in G , is

Un =

1 ( n +1)τ P ( X (t ) ∈ G )dt . τ ∫ nτ

Maintenance of Repairable Systems

255

By our assumptions, X (t ) behaves in the interval [nτ , (n + 1)τ ) in the same manner as if it was run in the interval [0,τ ) and started in state Z n . Thus Un =

1 τ

τ

∫ ∑P 0

j∈S

jG

(t ) P( Z n = j )dt

where PjG (t ) = ∑ k∈G Pjk (t ) .

Letting n tend to infinity, and the P( Z n = j ) tend to the limiting values γ j defined from the stationary distribution γ = (γ j ) of the Markov chain Z 0 , Z1 ,… . This distribution is found by solving the equations γ T ≡ γ P(τ ) R = γ .

Following Hokstad and Frøvig (1996) we shall define the critical safety unavailability (CSU) of the system by CSU = lim U n n →∞

1 = τ

τ

∫ ∑P 0

jG

(t )γ j dt =

j∈S

∑γ Q j

j

j∈S

where Qj =

1 τ

∫

τ

0

PjG (t )dt

is the critical safety unavailability given that the system state is j at the beginning of the PM interval. 10.4.3 The Failure Model of Hokstad and Frøvig As an illustration we shall reconsider the most general failure model of Hokstad and Frøvig (1996), namely their Failure Mechanism III. Here the state space is S = {O, D, K I , K II },

where O = the system is as good as new, D = the system has a failure classified as degraded (noncritical), K I = the system has a failure classified as critical, caused by a sudden shock, K II = the system has a failure classified as critical, caused by the degradation process. It is assumed that the Markov chain X (t ) is defined by the state diagram of Figure 10.9, and thus has infinitesimal transition matrix

256

B. Lindqvist

⎡ −λd − λk ⎢ 0 A=⎢ ⎢ 0 ⎢ 0 ⎣

λd −λk − λdk

λk λk

0

0

0

0

0⎤ λdk ⎥⎥ 0⎥ ⎥ 0⎦

(10.15)

Note that both K I and K II are absorbing states.

Figure 10.9. State diagram for the failure mechanism of Hokstad and Frøvig (1996)

The model assumes that no repairs are done in the time intervals between PM epochs. Moreover, since A is upper triangular, we can obtain P(t ) = etA rather easily. It is clear that P(t ) can be written ⎡ POO (t ) POD (t ) POK (t ) POK (t ) ⎤ ⎢ ⎥ PDD (t ) PDK (t ) PDK (t ) ⎥ ⎢ 0 ⎢ 0 0 1 0 ⎥ ⎢ ⎥ 0 0 1 ⎥⎦ ⎢⎣ 0 I

I

II

II

where expressions for the entries are found in Lindqvist and Amunrustad (1998). In practice it is of interest to quantify the effect of various forms of preventive maintenance. This can be done in the presented framework by means of the repair matrix R . Some examples are given below. If all failures are repaired at PM epochs, then the PM always returns the system back to state O , and we have ⎡1 ⎢1 R=⎢ ⎢1 ⎢ ⎣1

0 0 0⎤ 0 0 0 ⎥⎥ 0 0 0⎥ ⎥ 0 0 0⎦

Maintenance of Repairable Systems

257

Next, if only critical failures are repaired at PM epochs, then the appropriate R matrix is ⎡1 ⎢0 R=⎢ ⎢1 ⎢ ⎣1

0 0 0⎤ 1 0 0 ⎥⎥ 0 0 0⎥ ⎥ 0 0 0⎦

More generally one may consider an extension of this by assuming that all critical failures are repaired, while degraded failures are repaired with probability 1 − r and remain unrepaired with probability r , 0 ≤ r ≤ 1 . The repair strategy is thus determined by the parameter r . This clearly leads to the matrix 0 0 0⎤ ⎡ 1 ⎢1 − r r 0 0 ⎥ ⎥ R=⎢ ⎢ 1 0 0 0⎥ ⎢ ⎥ 0 0 0⎦ ⎣ 1

A more general imperfect repair model can be defined by

R=

0

0

1− r 1 − rk1

r 0 0 rk1

0 0

1 − rk 2

0

r

⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣

1

0

0

⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ k 2 ⎥⎦

Here r has the same meaning as before, while 1 − rk1 is the probability of successful repair of a K I failure and 1 − rk 2 is the similar for K II .

10.5 Concluding Remarks In the present chapter we have considered some aspects of the modeling and analysis of repaired and maintained systems. Rather than giving a comprehensive review of the field we have concentrated on a few points, partly chosen by the interest of the author. It is believed, however, that the chapter touches some topics that have to a certain degree been overlooked in much of reliability practice. The first point concerns the use of the NHPP as the single model for repairable systems with trend. Although this is appropriate in perhaps most cases, there are cases where renewal effects caused by repair or maintenance destroy the randomness associated with Poisson processes. One way of checking NHPP models is to embed them in larger models, and here the TRP can serve as a means of model

258

B. Lindqvist

checking (see for example the consideration of maximum log likelihoods in the examples of Section 10.2.5). Another way of extending the NHPP processes is via the large class of imperfect repair models. The classical model is here the one suggested by Brown and Proschan (1983) (see the review paper Lindqvist 2006 for an introduction to the subsequent literature). Imperfect repair models combine two basic ingredients, a hazard rate z (t ) of a new system together with a particular repair strategy which governs a so called virtual age process. The idea is that the virtual age of the system is reduced at repairs by a certain amount which depends on the repair strategy. The extreme cases are the perfect repair (renewal) models where the virtual age is set to 0 after each repair, and the minimal repair (NHPP) models where the virtual age is not reduced at repairs and hence always equals the actual age. Second, we have put some emphasis on the consideration of possible heterogeneity between systems of the same kind. Recall our Example 2 based on data from Bhattacharjee et al. (2003). The authors write in their conclusion: “The heterogeneity of failure behaviour of safety related components, such as valves in our case study, may have important implications for reliability analysis of safety systems. If such heterogeneity is not identified and taken into account, the decisions made to maintain or to enhance safety can be non-optimal or even erroneous. This non-optimality is more serious if the safety related decisions are made on the basis of failure histories of the components”. Still it is believed that heterogeneity has been neglected in many reliability applications. In fact, analyses of reliability data will often lead to an apparent decreasing failure rate which is counterintuitive in view of wear and ageing effects. Proschan (1963) pointed out that such observed decreasing rates could be caused by unobserved heterogeneity. Proschan presented failure data from 17 air conditioner systems on Boeing 720 airplanes, concluding that an HPP model was appropriate for each plane, but that the rates differed from plane to plane. This is a classical example of heterogeneity in reliability. If times between failures had been treated as independent and identically distributed across planes, the conclusion would have been that these times between failures had a decreasing failure rate. It has long been known in biostatistics that neglecting individual heterogeneity may lead to severe bias in estimates of lifetime distributions. The idea is that individuals have different “frailties”, and that those who are most “frail” will die or fail earlier than the others. This in turn leads to a decreasing population hazard, which has often been misinterpreted in the same manner as mentioned for the reliability applications. Important references on heterogeneity in the biostatistics literature are Vaupel et al. (1979), Hougaard (1984) and Aalen (1988). It should be noted that heterogeneity is in general unidentifiable if being considered an individual quantity. For identifiability it is necessary that frailty is common to several individuals, for example in family studies in biostatistics, or if several events are observed for each individual, such as for the repairable systems considered in this paper. The presence of heterogeneity is often apparent for data from repairable systems if there is a large variation in the number of events per system. However, it is not really possible to distinguish between heterogeneity and dependence of the intensity on past events for a single process.

Maintenance of Repairable Systems

259

The third point to be mentioned regards the use, or lack of use, of methods for competing risks in reliability applications. The following is a citation from Crowder (2004) appearing in the article on Competing Risks in Encyclopedia of Actuarial Sciences: “If something can fail, it can often fail in one of several ways and sometimes in more than one way at a time. In the real world, the cause, mode, or type of failure is usually just as important as the time to failure. It is therefore remarkable that in most of the published work to date in reliability and survival analysis there is no mention of competing risks. The situation hitherto might be referred to as a lost case”. Fortunately, some work has been done recently in order to include competing risks in the study of repaired and maintained systems. Much of this work, partly reviewed in Section 10.3, has been motivated by the work of Cooke (1996) and his collaborators. His point of departure was formulated in the conclusion of Cooke (1996): “The main themes of Parts I and II of this article are that current RDB (Reliability Data Bank) designs: 1. are not giving RDB users what they need; 2. are not doing a good job of analyzing competing risk data; 3. are not doing a good job in handling uncertainty. Improvements in all these areas are possible. However, it must be acknowledged that the models and methods presented here merely scratch the surface. It is therefore appropriate to conclude with a summary of open issues...” The final section of the present chapter considers an example of an approach which in some sense generalizes the competing risks issue, namely using Markov chains to model failure mechanisms of various equipment. The chapter has mostly considered the modeling of repairable systems, with less mention of statistical methods. It is believed that much of future research on maintenance of repairable systems will still be centered around modeling, possibly with an increased emphasis on point process models including multiple types of events (see for example Doyen and Gaudoin 2006). More detailed models of the underlying failure and maintenance mechanisms may indeed be of great value for planning and optimization of maintenance actions. On the other hand, the new advances in modeling certainly lead to considerable statistical challenges. This point was touched on by Cooke (1996) as cited above, and it is clear that the information in reliability databases could and should be handled by more sophisticated methods than the ones that are traditionally used. Here there is much to learn from the biostatistics literature where there has for a long time been an emphasis on nonparametric methods and on regression methods using covariate information.

10.6 References Aalen OO, (1988) Heterogeneity in survival analysis. Statistics in Medicine 7:1121–1137. Andersen P, Borgan O, Gill R, Keiding, N, (1993) Statistical Models Based on Counting Processes. Springer, New York. Ascher H, Feingold H, (1984) Repairable Systems – Modeling, inference, misconceptions and their causes. Marcel Dekker, New York. Bedford T, Cooke RM, (2001) Probabilistic Risk Analysis: Foundations and Methods; Cambridge University Press: Cambridge.

260

B. Lindqvist

Bhattacharjee M, Arjas E, Pulkkinen, U, (2003) Modeling heterogeneity in nuclear power plant valve failure data. In: Mathematical and Statistical Methods in Reliability (Lindqvist BH, Doksum KA, eds.) World Scientific Publishing, Singapore, pp 341–353. Brown M, Proschan F, (1983) Imperfect repair. Journal of Applied Probability 20:851–859. Cook RJ, Lawless JF, (2002) Analysis of repeated events. Statistical Methods in Medical Research 11:141–166. Cooke RM, (1993) The total time on test statistics and age-dependent censoring. Statistics and Probability Letters 18:307–312. Cooke RM, (1996). The design of reliability databases, Part I and II. Reliability Engineering and System Safety 51:137–146 and 209–223. Crowder MJ, (2001) Classical competing risks. Chapman & Hall/CRC, Boca Raton. Crowder MJ, (2004) Competing risks. In: Encyclopedia of actuarial science (Teugels JL, Sundt B, eds.) Wiley, Chichester, pp. 305–313. Crowder MJ, Kimber AC, Smith RL, Sweeting TJ, (1991) Statistical Analysis of Reliability Data. Chapman & Hall, Great Britain. Doyen L, Gaudoin O, (2006) Imperfect maintenance in a generalized competing risks framework. Journal of Applied Probability 43:825-839. Follmann DA, Goldberg MS, (1988) Distinguishing heterogeneity from decreasing hazard rate. Technometrics 30:389–396. Hokstad P, Frøvig AT, (1996) The modelling of degraded and critical failures for components with dormant failures. Reliability Engineering and System Safety 51:189–199. Hougaard P, (1984) Life table methods for heterogeneous populations: Distributions describing the heterogeneity. Biometrika 71:75–83. Langseth H, Lindqvist BH, (2003) A maintenance model for components exposed to several failure mechanisms and imperfect repair. In: Mathematical and Statistical Methods in Reliability (Lindqvist BH, Doksum KA, eds.). World Scientific Publishing, Singapore, pp 415-430. Langseth H, Lindqvist BH, (2006) Competing risks for repairable systems: A data study. Journal of Statistical Planning and Inference 136:1687–1700. Lawless JF, (1987) Regression methods for Poisson process data. Journal of American Statistical Association 82:808–815. Lindqvist BH, (2006) On the statistical modelling and analysis of repairable systems. Statistical Science 21:532–551. Lindqvist BH, Amundrustad H, (1998) Markov models for periodically tested components. In: Safety and Reliability. Proceedings of the European Conference on Safety and Reliability - ESREL ’98 (Lydersen S, Hansen GK, Sandtorv HA). AA Balkema, Rotterdam, pp 191–197. Lindqvist BH, Langseth H, (2005) Statistical modelling and inference for component failure times under preventive maintenance and independent censoring. In: Modern Statistical and Mathematical Methods in Reliability (Wilson A, Limnios N, Keller-McNulty S, Armijo Y). World Scientific Publishing, Singapore, pp. 323–337. Lindqvist BH, Elvebakk G, Heggland K, (2003) The trend-renewal process for statistical analysis of repairable systems. Technometrics 45:31–44. Lindqvist BH, Støve B, Langseth H, (2006) Modelling of dependence between critical failure and preventive maintenance: The repair alert model. Journal of Statistical Planning and Inference 136:1701–1717. Meeker WQ, Escobar LA, (1998) Statistical methods for reliability data. Wiley, New York. Nelson W, (1995) Confidence limits for recurrence data – applied to cost or number of product reapair. Technometrics 37:147–157. Peña EA, (2006) Dynamic modelling and statistical analysis of event times. Statistical Science 21:487–500.

Maintenance of Repairable Systems

261

Proschan F, (1963) Theoretical explanation of observed decreasing failure rates. Technometrics 5:375–383. Rausand M, Høyland A, (2004) System reliability theory: Models, statistical methods, and applications. 2nd ed. Wiley-Interscience, Hoboken, N.J. Ross SM, (1983) Stochastic Processes. Wiley, New York. Taylor HM, Karlin S, (1984) An introduction to stochastic modeling. Academic Press, Orlando. Vaupel JW, Manton KG, Stallard E, (1979) The impact of heterogeneity in individual frailty on the dynamics of mortality. Demography 16:439–454.

11 Optimal Maintenance of Multi-component Systems: A Review Robin P. Nicolai and Rommert Dekker

11.1 Introduction Over the last few decades the maintenance of systems has become more and more complex. One reason for this is that systems consist of many components which depend on each other. On the one hand, interactions between components complicate the modelling and optimization of maintenance. On the other hand, interactions also offer the opportunity to group maintenance which may save costs. It follows that planning maintenance actions is a big challenge and it is not surprising that many scholars have studied maintenance optimization problems for multi-component systems. In some articles new solution methods for existing problems are proposed, in other articles new maintenance policies for multi-component systems are studied. Moreover, the number of papers with practical applications of optimal maintenance of multi-component systems is still growing. Cho and Parlar (1991) give the following definition of multi-component maintenance models: “Multi-component maintenance models are concerned with optimal maintenance policies for a system consisting of several units of machines or many pieces of equipment, which may or may not depend on each other (economically/stochastically/structurally).” So, in these models it is all about making an optimal maintenance planning for systems consisting of components that interact with each other. We will come back later to the concepts of optimality and interaction. For now it is important to remember that the condition of the systems depends on (the state of) the components which will only function if adequate maintenance actions are performed. In this chapter we will give an up-to-date review of the literature on multicomponent maintenance optimization. Let us start with a brief summary of the overview articles that have appeared in the past. Cho and Parlar (1991) review articles from 1976 to 1991. The authors divide the literature into five topical categories: machine-interference/repair models, group/block/cannibalization/opportunistic models, inventory/maintenance models, other maintenance/replacement models and inspection/maintenance models. Dekker et al. (1996) deal exclusively

264

R. Nicolai and R. Dekker

with multi-component maintenance models that are based on economic dependence. Emphasis is put on articles that have been published after 1991, but there is an overlap with the review of Cho and Parlar (1991). The classification scheme of Dekker et al. (1996) differs from that of Cho and Parlar (1991). First, models are classified based on the planning aspect of the model: stationary (long-term) and dynamic (short-term). Second, the stationary-grouping models are divided in the categories grouping corrective maintenance, grouping preventive maintenance and opportunistic grouping maintenance. Here, opportunistic grouping is grouping both preventive and corrective maintenance. The dynamic grouping models are divided into two categories: those with a finite horizon and those with a rolling horizon. In a recent article Wang (2002) gives an overview of maintenance policies of deteriorating systems. The emphasis is on policies for single component systems. One section is devoted to opportunistic maintenance policies for multi-component systems. The author primarily considers models with economic dependence. The existing review articles indicate that there are several ways to categorize articles and models. In Section 11.2 of this chapter we structure the field and present our comprehensive classification scheme. It differs from the schemes used in the review articles discussed earlier. First of all, we distinguish between models with economic, structural and stochastic dependence. Economic dependence implies that grouping maintenance actions either save costs (economies of scale) or result in higher costs (because of, e.g. high down-time costs), as compared to individual maintenance. Stochastic dependence occurs if the condition of components influences the lifetime distribution of other components. Structural dependence applies if components structurally form a part, so that maintenance of a failed component implies maintenance of working components. In Sections 11.3–11.5, we discuss papers concerning economic, stochastic and structural dependence between components. In Section 11.6 we classify articles according to the planning aspect of the maintenance model and the method used to optimize the model. Following the review of Dekker et al. (1996) we distinguish between models with finite and infinite planning horizons. Models with an infinite planning horizon are called stationary, since they usually provide static rules for maintenance, which do not change over the planning horizon. Finite horizon models are called dynamic, since these models can generate dynamic decisions that may change over the planning horizon. In these models short-term information can be taken into account. With respect to the optimization methods, we divide the papers into three categories: exact, heuristic and policy optimization. Section 11.7 covers trends and open research areas in multi-component maintenance. Conclusions are drawn in Section 11.8.

11.2 Structuring the Field In Section 11.2.1 we give a short review of the terminology used in multi-component maintenance optimization models and explain how we searched the literature. In Section 11.2.2 we present our comprehensive classification scheme.

Optimal Maintenance of Multi-component Systems: A Review

265

11.2.1 Search Strategy and Terminology Presenting a scientific review on a certain topic implies that one tries to discuss all relevant articles. Finding these articles, however, is very difficult. It depends on the search engines and databases used, electronic availability of articles and the search strategy. We used Google Scholar, Scirus and Scopus as search engines, and used ScienceDirect, JStor and MathSciNet as (online) database. We primarily searched on key words, abstracts and titles, but we also searched within the papers for relevant references. Note that papers published in books or proceedings that are not electronically available, are likely to have not been identified. Terminology is another important issue, as the use of other terms can hide a very interesting paper. The field has been delineated by maintenance, replacement or inspection on one hand and optimization on the other. This combination, however, provides almost 5000 hits in Google Scholar. Next, the term multi-component has been used in junction with related terms as opportunistic maintenance (policies), piggyback(ing), joint replacement, joint overhaul, combining maintenance, grouping maintenance, economies of scale and economic dependence. With respect to the term stochastic dependence, we have also searched for synonyms and related terms such as failure interaction, probabilistic dependence and shock damage interaction. This yields approximately 500 hits. Relevant articles have been selected from this set by scanning the articles. The vast literature on maintenance of multi-component systems has been reviewed earlier by others. Therefore, we have also consulted existing reviews and overview articles in this field. Moreover, we have applied a citation search (looking both backwards and forwards in time for citations) to all articles found. This citation search is an indirect search method, whereas the above methods are direct methods. The advantage of this method is that one can easily distinguish clusters of related articles. 11.2.2 Classification Scheme First of all, we classify the multi-component maintenance models on the basis of the dependence/interaction between components in the system considered. Thomas (1986) defines three different types of interactions: economic, structural and stochastic dependence. Simply said, economic dependence implies that the cost of joint maintenance of a group of components does not equal the total cost of individual maintenance of these components. The effect of this dependence comes to the fore in the execution of maintenance activities. On the one hand, the joint execution of maintenance activities can save costs in some cases (e.g. due to economies of scale). On the other hand, grouping maintenance may also lead to higher costs (e.g. due to manpower restrictions) or may not be allowed. For this reason, we will subdivide the models with economic dependence into two categories: positive and negative economic dependence. That is, we refine the definition of economic dependence as compared to the definition used in the review article of Dekker et al. (1996). Note that in many systems both positive and negative economic dependence between

266

R. Nicolai and R. Dekker

components are present. We give special attention to the modelling of maintenance optimization of these systems, in particular the k-out-of-n system. Stochastic dependence occurs if the condition of components influences the lifetime distribution of other components. Synonyms of stochastic dependence are failure interaction or probabilistic dependence. This kind of dependence defines a relationship between components upon failure of a component. For example, it may be the case that the failure of one component induces the failure of other components or causes a shock to other components. Structural dependence applies if components structurally form a part, so that maintenance of a failed component implies maintenance of working components, or at least dismantling them. So, structural dependence restricts the maintenance manager in his decision on the grouping of maintenance activities. A second classification of the models is based on the planning aspect: stationary or dynamic. That is, do we make a short-term/operational or a long-term/ strategic planning for the maintenance activities? Is the planning horizon finite or infinite? In stationary models, a long-term stable situation is assumed and mostly these models assume an infinite planning horizon. Models of this kind provide static rules for maintenance, which do not change over the planning horizon. They generate for example long-term maintenance frequencies for groups of related activities or control-limits for carrying out maintenance depending on the state of components. In dynamic grouping models, short-term information such as a varying deterioration of components or unexpected opportunities can be taken into account. These models generate dynamic decisions that may change over the planning horizon. The last classification we consider is based on the type of optimization method used. This can be an exact method, a heuristic or a search within classes of policies. Exact optimization methods are designed to find the real optimal solution of a problem. However, if the computing time of the optimization method increases exponentially with the number of components, then exact methods are only desirable to a certain extent. In that case solving problems with many components is impossible and heuristics should be used. Heuristics are local optimization methods that do not pretend to find the global optimum, but can be applied to find a solution to the problem in reasonable time. The quality of such a solution depends on the problem instance. In some cases it is possible to give an upper bound on the gap between the optimal solution and the solution found by the heuristic. In many papers a maintenance planning is made by optimizing a certain type of policy. Well known maintenance policies are the age and block replacement policies and their extensions. The advantage of policy optimization over other optimization methods is that it gives more insight into the solution of the problem. Note that policy optimization will not always result in the global optimal solution, since there may be another policy that results in a better solution. In some cases however, it can be proved that applying a certain maintenance policy results in the exact (global) optimal solution.

Optimal Maintenance of Multi-component Systems: A Review

267

11.3 Economic Dependence In this section we review articles on multi-component systems with economic dependence. We focus on articles appearing since the review of Dekker et al. (1996). In Sections 11.3.1 and 11.3.2 we discuss models with positive and negative dependence, respectively. In Section 11.3.3 we discuss articles on k-out-of-n systems, in which both positive and negative dependence between components are present. 11.3.1 Positive Economic Dependence Positive economic dependence implies that costs can be saved when several components are jointly instead of separately maintained. Compared with the review of Dekker et al. (1996) we refine the concept of (positive) economic dependence and distinguish the following forms: • Economies of scale – General – Single set-up – Multiple set-ups o Hierarchy of set-ups • Downtime opportunity The term economies of scale is often used to indicate that combining maintenance activities is cheaper than performing maintenance on components separately. The term economies of scale is very general and it seems to be similar to positive economic dependence. In this chapter we will speak of economies of scale when the maintenance cost per component decreases with the number of maintained components. Economies of scale can result from preparatory or set-up activities that can be shared when several components are maintained simultaneously. The cost of this set-up work is often called the set-up cost. Set-up costs can be saved when maintenance activities on different components are executed simultaneously, since execution of a group of activities requires only one set-up. In this overview we distinguish between single set-ups and multiple set-ups. In the latter case there usually is a hierarchy of set-ups. For instance, consider a system consisting of two components, which both consist of two subcomponents. Maintenance of the subcomponents of the components may require a set-up at system level and component level. First, this means that the set-up cost at component level is paid only once when the maintenance of two subcomponents of a component is combined. Second, the set-up cost at system level is paid only once when all subcomponents are maintained at the same time. Set-up costs usually come back in the objective function of the maintenance problem. If economies of scale are not explicitly modelled by including set-up costs in the objective function, then we classify the model in the category ‘general’. Another form of positive dependence is the downtime opportunity. Component failures can often be regarded as opportunities for preventive maintenance of nonfailed components. In a series system a component failure results in a non-operating

268

R. Nicolai and R. Dekker

system. In that case it may be worthwhile to replace other components preventively at the same time. This way the system downtime results in cost savings since more components can be replaced at the same time. Moreover, by grouping corrective and preventive maintenance the downtime can be regulated and in some cases it can even be reduced. Note that if the downtime cost is included in the set-up cost in a certain paper, then we will not classify the paper in the category ‘downtime opportunity’, but in the category ‘set-up cost’. In general, however, it is difficult to assess the cost associated with the downtime (see, e.g. Smith and Dekker (1997), who approximate the availability and the cost of downtime for a 1-out-of-n system). Therefore, the downtime cost is usually not included in the set-up cost. In the paragraphs below we discuss articles dealing with positive economic dependence. Our main focus is on the modelling of this dependence. 11.3.1.1 Economies of Scale General In comparison with Dekker et al. (1996) the category ‘general economies of scale’ is new. The papers in this category deal with multi-component systems for which joint maintenance of components is cheaper than individual maintenance of components. This form of economies of scale cannot be modelled by introducing a single set-up cost. The cost associated with the maintenance of components is often concave in the number of components that are maintained simultaneously. Dekker et al. (1998a) evaluate a new maintenance concept for the preservation of highways. In road maintenance cost savings can be realized by maintaining larger sections instead of small patches. The road is divided into sectors of 100-m length. Set-up costs are present in the form of the direct costs associated with the maintenance of different parts of the road. The set-up cost is a function of the number of these parts in a maintenance group. A heuristic search procedure is proposed to find the optimal maintenance planning. Papadakis and Kleindorfer (2005) introduce the concept of network topology dependencies (NTD) for infrastructure networks. In these networks two types of NTD can be distinguished: contiguity and set-up discounts. Both types define positive economic dependence between components. In the former case savings are realized when costs are paid once when contiguous sections are maintained at the same time. In the latter case savings are realized when costs are paid once for a neighbourhood of the infrastructure network, independently of how much work is carried out on it. For both types of dependencies a non-linear discount function is defined. The authors consider the problem of maintaining an infrastructure network. It is modelled as an undirected network. Risk measures or failure probabilities for the segments of this network are assumed to be known. A maximum flow minimum cut formulation of the problem is developed. This formulation makes it easier to solve the problem exactly and efficiently. Single Set-up Nearly all articles reviewed by Dekker et al. (1996) fall into this category. The objective function of the maintenance optimization model usually consists of a

Optimal Maintenance of Multi-component Systems: A Review

269

fixed cost (the set-up cost) and variable costs. In the articles discussed below, this will not be different. Castanier et al. (2005) consider a two-component series system. Economic dependence between the two components is present in the following way. The setup cost for inspecting or replacing a component is charged only once if the actions on both components are combined. That is, joint maintenance of components saves costs. In this article the condition of the components is modelled by a stochastic process and it is monitored by non-periodic inspections. In the opportunistic maintenance policy several thresholds are defined for doing inspections, corrective and preventive replacements, and opportunistic maintenance. These thresholds are decision variables. Many articles on this type of models have appeared, but most of these articles only consider single component models. The articles of Scarf and Deara (1998, 2003) consider both economic and stochastic dependence between components in a series system. This combination is scarce in the literature. Positive economic dependence is modelled on the basis that the cost of replacement of one or more components includes a one-off set-up cost whose magnitude does not depend on the number of components replaced. We will discuss these articles in more detail in Section 11.4. In one of the few case studies found in the literature, Van der Duyn Schouten et al. (1998) investigate the problem of replacing light bulbs in traffic control signals. Each installation consists of three compartments for the green, red, and yellow lights. Maintenance of light bulbs means replacement, either correctively or preventively. First, positive economic dependence is present in the form of set-up cost, because each replacement action requires a fixed cost in the form of transportation of manpower and equipment. Second, the failure of individual bulbs is an opportunity for doing preventive maintenance on other bulbs. The authors propose two types of maintenance policies. In the first policy, also known as the standard indirect-grouping strategy (introduced in maintenance by Goyal and Kusy 1985; for a review of this strategy we refer to Dekker et al. 1996), corrective and preventive replacements are strictly separated. Economies of scale can thus only be achieved by combining preventive replacements of the bulbs. The authors also propose the following opportunistic age-based grouping policy. Upon failure of a light bulb, the failed bulbs and all other bulbs older than a certain age are replaced. Budai et al. (2006) consider a preventive maintenance scheduling problem (PMSP) for a railway system. In this problem (short) routine activities and (long) unique projects for one track have to be scheduled in a certain period. To reduce costs and inconvenience for the travellers and operators, these activities should be scheduled together as much as possible. With respect to the latter, maintenance of different components of one track simultaneously requires only one track possession. Time is discretized and the PMSP is written as a mixed-integer linear programming model. Positive dependence is taken into account by the objective function, which is the sum of the total track possession cost and the maintenance cost over a finite horizon. To reduce possible end-of-horizon effects an end-of-horizon valuation is also incorporated in the objective function. Note that the possession cost can be seen as a downtime cost. The cost is modelled as a fixed/ set-up cost. This is the reason that it is classified in this category. Besides this positive dependence there also exists negative dependence between components, since some activities exclude each other.

270

R. Nicolai and R. Dekker

The advantage of a discrete time model is that negative dependence can be incorporated in the model by adding additional restrictions. It appears that the PMSP is a NP-hard problem. Heuristics are proposed to find near-optimal solutions in reasonable time. Multiple Set-ups This is also a new category. The maintenance of different components may require different set-up activities. These set-up activities may be combined when several components are maintained at the same time. We have found one article in this category; it assumes a complex hierarchical set-up structure. Van Dijkhuizen (2000) studies the problem of clustering preventive maintenance jobs in a multiple set-up multi-component production system. As far as the authors know, this is the first attempt to model a maintenance problem with a hierarchical (tree-like) set-up structure. Different set-up activities have to be done at different levels in the production system before maintenance can be done. Each component is maintained preventively at an integer multiple of a certain basic interval, which is the same for all components, and corrective maintenance is carried out in between whenever necessary. So, every component has its own maintenance frequency — the frequencies are based on the optimal maintenance planning for single components. Obviously, set-up activities may be combined when several components are maintained at the same time. The problem is to find the maintenance frequencies that minimize the average cost per unit of time. This problem is an extension of the standard-indirect grouping problem (for an overview of this problem see Dekker et al. 1996). 11.3.1.2 Downtime Opportunity As we stated earlier, the downtime of a system is often an opportunity to combine preventive and corrective maintenance. This is specially true for series systems, where a single failure results in a system breakdown. Of course, non-failed components should not be replaced when they are in a good condition, because useful lifetime can be wasted. The maintenance policies proposed in the articles discussed below use this idea. Gürler and Kaya (2002) propose a new opportunistic maintenance policy for a series system with identical items. The article is an extension of the work by Van der Duyn Schouten and Vanneste (1993), who also propose an opportunistic policy for such a system. In their model, the lifetime of the components is described by several stages, which are classified as good, doubtful, preventive maintenance due and failed. Gürler and Kaya (2002) classify the stages in the same way, but the stages good and doubtful are subdivided into a number of states. The proposed policy is of the control-limit type. Components which are PM due (failed) are preventively (correctively) replaced immediately. The entire system is replaced when a component is PM due or down and the number of components in doubtful states is at least N. Here, N is a decision variable. It appears that this policy achieves significant savings over a policy where the components are maintained individually without any system replacement. Popova and Wilson (1999) consider m-failure, T-age and (m,T) failure group policies for a system of identical components operating in parallel. According to

Optimal Maintenance of Multi-component Systems: A Review

271

these policies the system is replaced at the time of the m-th failure, every T time units, and at the minimum time of these events, respectively. These policies were first introduced by Assaf and Shanthikumar (1987), Okumoto and Elsayed (1983) and Ritchken and Wilson (1990), respectively. Popova and Wilson (1999) assume that downtime costs are incurred when failed components are not repaired or replaced. So, when the system operates there is also negative dependence between the components. After all, when the components are left in a failed condition, with the intention to group corrective maintenance, then downtime costs are incurred. In the maintenance policies a trade-off between the downtime costs and the advantages of grouping (corrective) maintenance is made. Sheu and Jhang (1996) propose a new two-phase opportunistic maintenance policy for a group of independent identical repairable units. Their model takes into account downtime costs and the maintenance policy includes minimal repair, overhaul, and replacement. In the first phase, (0,T], minor failures are removed by minimal repairs and ‘catastrophic’ failures by replacements. In the second phase, (T,T+W], minor failures are also removed by minimal repairs, but ‘catastrophic’ failures are left idle. Group maintenance is conducted at time T+W or upon the k-th idle, whichever comes first. The generalized group maintenance policy requires inspection at either the fixed time T+W or the time when exactly k units are left idle, whichever comes first. At an inspection, all idle components are replaced with new ones and all operating components are overhauled so that they become as good as new. Higgins (1998) studies the problem of scheduling railway track maintenance activities and crews. In this problem positive economic dependence is present in the following way. The occupancy of track segments due to maintenance prevents all train movements on those segments. The costs associated with this can be regarded as downtime costs. The maintenance scheduling problem is modelled as a large scale 0-1 programming problem with many (non-linear) restrictions. The objective is to minimize expected interference delay with the train schedule and prioritized finishing time. The downtime costs are modelled by including downtime probabilities in the objective function. The author proposes tabu search to solve the problem. The neighbourhood, which plays a prominent role in local search techniques, is easily defined by swapping the order of activities or maintenance crews. The article of Sriskandarajah et al. (1998) discusses the maintenance scheduling of rolling stock. Multiple train units have to be overhauled before a certain due date. The aim is to find a suitable common due date for each train so that the due dates of individual units do not deviate too much from the common due date. Maintenance carried out too early or too late is costly since this may cause loss of use of a train. A genetic algorithm is proposed to solve this scheduling problem. 11.3.2 Negative Economic Dependence Negative economic dependence between components occurs when maintaining components simultaneously is more expensive than maintaining components individually. There can be several reasons for this:

272

R. Nicolai and R. Dekker

• • •

Manpower restrictions Safety requirements Redundancy/production-loss

First grouping maintenance results in a peak in manpower needs. Manpower restrictions may even be violated and additional labour needs to be hired, which is costly. The problem here is to find the balance between workload fluctuation and grouping maintenance. Second, there are often restrictions on the use of equipment, when executing maintenance activities simultaneously. For instance, use of equipment may hamper use of other equipment and cause unsafe operations. Legal and/or safety requirements often prohibit joint operation. Third, joint (corrective) maintenance of components in systems in which some kind of redundancy is available may not be beneficial. Although there may exist economies of scale through simultaneous repair of a number of (identical) components, leaving components in a failed condition for some time increases the risk of costly production losses. We will come back to this in Section 11.3.3. Production loss may increase more than linearly with the number of components out of operation. For an example of this type of economic dependence we refer to Stengos and Thomas (1980). The authors give an example of the maintenance of blast furnaces. The disturbance due to maintenance is substantially more, the more furnaces that are out of operation. That is, the cost of overhauling the furnaces increases more than linearly with the number of furnaces out of action. It appears that maintenance of systems with negative dependence is often modelled in discrete time. The models can be regarded as scheduling problems with many restrictions. These restrictions can easily be incorporated in discrete time models such as (mixed) integer programming models. With respect to these models, there is always the question whether the exact solution can be found efficiently. In other words, the question arises whether the problem is NP-hard. An example of discrete time modelling is given by the article of Grigoriev et al. (2006). In this article the so-called periodic maintenance problem (PMP) is studied. In this problem machines have to be serviced regularly to prevent costly production losses. The failures causing these production losses are not modelled. Time is discretized into unit-length periods. In each period at most one machine can be serviced. Apparently negative economic dependence in the form of manpower restrictions or safety measures play a role in the maintenance of the machines. The problem is to find a cyclic maintenance schedule of a given length T that minimizes total service and operating costs. The operating costs of a machine increase linearly with the number of periods elapsed since last servicing that machine. PMP appears to be an NP-hard problem and the authors propose a number of solution methods. This leads to the first exact solutions for larger sized problems. In Stengos and Thomas (1980) time is also discretized but the maintenance problem, scheduling the overhaul of two pieces of equipment, is set up as a Markov decision process. The pieces can be in different states and the probability of failure increases with the time since the last overhaul. So in comparison with the problem of Grigoriev et al. (2006), pieces can fail during operation. Negative economic dependence is modelled as follows. The cost of overhauling the pieces

Optimal Maintenance of Multi-component Systems: A Review

273

increases more than linearly with the number of pieces out of action. The objective is to minimize the ‘loss of production’ cost, which is incurred when a piece is overhauled. The optimal policy is found by a relative value successive approximation algorithm. In Langdon and Treleaven (1997) the problem of scheduling maintenance for electrical power transmission networks is studied. There is negative economic dependence in the network due to redundancy/production-loss. Grouping certain maintenance activities in the network may prevent a cheap electricity generator from running, so requiring a more expensive generator to be run in its place. That is, some parts of the network should not be maintained simultaneously. These exclusions are modelled by adding restrictions to the MIP formulation of the problem. The authors propose several genetic algorithms and other heuristics to solve the problem. 11.3.3 k-out-of-n Systems In this section we discuss the different dependencies in the k-out-of-n system in more detail. This system is a typical example of a system with both positive and negative economic dependence between components. A k-out-of-n system functions if at least k components function. If k = 1, then it is a parallel system; if k = n, then it is a series system. Let us for the moment distinguish between the cases k = n and k < n. In the series system (k = n), there is positive economic dependence due to downtime opportunities. The failure of one component results in an expensive downtime of the system and this time can be used to group preventive and corrective maintenance. Negative economic dependence is not explicitly present in the series system. If k < n, then there is redundancy in the system and it fails less often than its individual components. This way a specified reliability can be guaranteed. Typically, the components of this system are identical which allows for economies of scale in the execution of maintenance activities. It is not only possible to obtain savings by grouping preventive maintenance, but also by grouping corrective maintenance. Note that the latter form of grouping is not advantageous in series systems. In other words, the redundant components introduce additional positive dependence in the system. Whereas positive economic dependence is present upon failure of a component, negative economic dependence plays a role as long as the system operates. A single failure of a component may not always be an opportunity to combine maintenance activities. First, grouping corrective and preventive maintenance upon the failure of the component increases the probability of system failure and costly production losses. Second, leaving components in a failed condition for some time, with the intention to group corrective maintenance at a later stage, has the same effect. So, there is a trade-off between the potential loss resulting from a system failure and the benefit of joint maintenance. One problem of optimizing (age-based) maintenance in k-out-of-n systems is the determination of downtime costs, as a failure does not directly result in system failure. Smith and Dekker (1997) derive the uptime, downtime and costs of maintenance in a 1-out-of-n system (with cold standby), but in general it is very difficult

274

R. Nicolai and R. Dekker

to assess the availability and the downtime costs of a k-out-of-n system. In their article, Smith and Dekker (1997) optimize the following age-replacement policy. A component is taken out for preventive maintenance and replaced by a stand-by one, if its age has reached a certain value Tpm. Moreover, they determine the number of redundant components needed in the system. In the maintenance policies considered in the articles below, an attempt is made to balance the negative aspects of downtime costs and the positive aspects of grouping (corrective) maintenance. The opportunistic maintenance policies proposed in these articles are age-based and also contain a threshold for the number of failures (except for the policy introduced by Sheu and Kuo 1994). In Dekker et al. (1998b) the maintenance of light-standards is studied. A light standard consists of n independent and identical lamps screwed on a lamp assembly. To guarantee a minimum luminance, the lamps are replaced if the number of failed lamps reaches a pre-specified number m. In order to replace the lamps the assembly has to be lowered. This set-up activity is an opportunity to combine corrective and preventive maintenance. Several opportunistic age-based variants of the m-failure group replacement policy (in its original form only corrective maintenance is grouped) are considered in this paper. Simulation optimization is used to determine the optimal opportunistic age threshold. Pham and Wang (2000) introduce imperfect PM and partial failure in a k-outof-n system. They propose a two-stage opportunistic maintenance policy for the system. In the first stage failures are removed by minimal repair; in the second stage failed components are jointly replaced with operating components when m components have failed, or the entire system is replaced at time T, whichever occurs first. Positive economic dependence is of an opportunistic nature. Joint maintenance requires less time than individual maintenance. Sheu and Kuo (1994) introduce a general age replacement policy for a k-out-ofn system. Their model includes minimal repair, planned and unplanned replacements, and general random repair costs. The system is replaced when it reaches age T. The long-run expected cost rate is obtained. The aim of the paper is to find the optimal age replacement time T that minimizes the long-run expected cost per unit time of the policy. The article of Sheu and Liou (1992) will be discussed in Section 11.4, because they assume stochastic dependence between the components of a k-out-of-n system.

11.4 Stochastic Dependence In the survey of Thomas (1986) multi-component maintenance models with stochastic dependence are considered as a separate class of models. In the more recent review articles this is not the case. In Cho and Parlar (1991) some articles dealing with failure interaction are discussed, but the modelling of failure interaction between components is not. In Wang (2002) nothing is said about systems with failure interaction; articles on this kind of systems only appear in the references. Actually, this is the first publication, since the survey of Thomas (1986), to give a comprehensive review of multi-component maintenance models with stochastic dependence. We do not aim to give solely a list of papers that have appeared.

Optimal Maintenance of Multi-component Systems: A Review

275

Instead, we want to give insight into the different ways of modelling failure interaction between components and explain the implications of certain approaches and assumptions with respect to practical applicability. Stochastic dependence, also referred to as failure interaction or probabilistic dependence, implies that the state of components can influence the state of the other components. Here, the state can be given by the age, the failure rate, state of failure or any other condition measure. In their seminal work on stochastic dependence, Murthy and Nguyen (1985b) introduce three different types of failure interaction in a two-component system. Type I failure interaction implies that the failure of a component can induce a failure of the other component with probability p (q), and has no effect on the other component with probability 1 – p (1 – q). It follows that there are two types of failures: natural and induced. The natural failures are modelled by random variables and the induced failures are characterized by the probabilities p and q. In Murthy and Nguyen (1985a) the authors extend type I failure interaction to systems with multiple components. It is assumed that whenever a component fails it induces a total failure of the system with probability p and has no effect on the other components with probability (1 – p). In this chapter we will consider this to be the definition of type I failure interaction. Type II failure interaction in a two-component system is defined as follows. The failure of component 2 can induce a failure of component 1 with probability q, whereas every failure of component 1 acts as a shock to component 2, without inducing an instantaneous failure, but affecting its failure rate. Type III failure interaction implies that the failure of each component affects the failure rate of the other component. That is, every failure of one of the components acts as a shock to the other component. A potential problem of the failure rate interaction defined by the last two types, is determining the size of the shock. In practice it is very difficult to assess the effect of a failure of one component on the failure rate of another component. Usually there is not much data on the course of the failure rate of a component after the occurrence of a shock. Shocks can also be modelled by adding a (random) amount of damage to the state of another component. Natural failures then occur if the state of a component (measured by the cumulative damage) exceeds a certain level. In this paper we will bring this modelling of type II and III failure interaction together in one definition. That is, we renew the definition of type II failure interaction for multi-component systems. It reads as follows. The system consists of several components and the failure of a component affects either the failure rate of or causes a (random) amount of damage to the state of one or more of the remaining components. It follows that we regard a mixture of induced failures and shock damage as type II failure interaction. Models with type II failure interaction will also be called shock damage models. In general, the maintenance policies considered in the literature on stochastic dependence, are mainly of an opportunistic nature, since the failure of one component is potential harmful for the other component(s). Modelling failure interaction appears to be quite elaborate. Therefore, most articles only consider two-component systems. Below we review the articles on failure interaction in the following order. First, we will discuss the type I interaction models. For this type of inter-

276

R. Nicolai and R. Dekker

action different opportunistic versions of the well known age and block replacement policies have been proposed. Second, the articles on type II interaction will be reviewed. We will see that in most of these articles the occurrence of shocks is modelled as a non-homogeneous Poisson process (NHPP) or that the failure rate of components is adjusted upon failure of other components. Third, we pay attention to articles that consider both types of failure interaction. Finally, we discuss other forms of modelling failure interaction. 11.4.1 Type I Failure Interaction Murthy and Nguyen (1985a) consider two maintenance policies in a multicomponent system with type I failure interaction. Under the first policy all failed components are replaced by new ones. When there is no total system failure, then only the single failed component is replaced. Under the second policy all components, also the functioning component(s), are replaced. When there is no total system failure, then the single failed component is subjected to minimal repair and made operational. The failure rate of the failed component after repair is the same as that just before failure. The authors deduce both the expected cost of keeping the system operational for a finite time period as well as the expected cost per unit time, of keeping the system operational for an infinite time period. Sheu and Liou (1992) consider an optimal replacement policy for a k-out-of-n system subject to shocks. Shocks arrive according to a NHPP. The system is replaced preventively whenever it reaches age T > 0 at a fixed cost c0. If the m-th shock arrives at age Sm < T, it can cause the simultaneous failure of i components at the same time with probability pi(Sm) for i = 0, 1,..., n, where

∑

n i =0

pi ( S m ) = 1 . If

i ≥ k, then the k-out-of-n system is replaced by a new one at a cost c∞ (unplanned failure replacement). So, the downtime is used to replace all components. If 0 ≤ i < k, then the system is minimally repaired with cost ci(Sm). After a complete replacement (either a planned or a failure replacement), the shock process is set to zero. All failures subject to shocks are assumed to be instantly detected and repaired. The aim of the paper is to find the optimal T that minimizes the long run expected cost per unit time of the maintenance policy. The articles of Scarf and Deara (1998, 2003) consider failure-based, (opportunistic) age and (opportunistic) block replacement policies for a labelled twocomponent series system with type I failure interaction. The articles can be seen as an extension of the article of Murthy and Nguyen (1985b) on failure-based replacement for such systems. Note that since we deal with a series system, the failure of either component causes a system downtime. So, if the system is down, this does not necessarily mean that both components have failed. Economic dependence is modelled on the basis that the cost of replacement of one or more components includes a one-off set-up cost whose magnitude does not depend on the number of components replaced. The maintenance policies considered in Scarf and Deara (1998) are of the agebased replacement type: replace a component on failure or at age T, whichever is sooner. Failure-based maintenance is viewed as the limiting case (T → ∞) of agebased replacement. As there is also economic dependence between components,

Optimal Maintenance of Multi-component Systems: A Review

277

the authors consider opportunistic age-based replacement policies: replace a component on failure or at age T or at age T' < T if an opportunity exists. The policies considered in Scarf and Deara (2003) are of the block replacement type and are extended for two-component systems. The independent block replacement policy is a single component policy and it is of the following form: replace all failed components, replace component 1 at times k∆1, k = 1, 2,... and replace component 2 at times k∆2, k = 1, 2,... . Block replacement can be grouped: replace failed components and replace the system at times k∆, k = 1, 2,... . It can also be combined: replace both components (whether failed or not) on failure of the system and replace the system at times k∆, k = 1, 2,... . In modified block replacement policies for a two-component system, a component is only replaced at the block replacement times if its age is greater than some critical value. The block replacement times may be independent or grouped, or the components may be combined. Opportunistic modified block replacement policies are of the form: on failure of component 1, if the age of component 2, τ2, is greater than b2′ , then replace both components; otherwise just replace component 1. On failure of component 2, if the age of component 1, τ1, is greater than b1′ , then replace both components; otherwise just replace component 2. At block replacement times for component 1, k∆1, k = 1, 2,..., replace component 1 if τ1 > b1 and replace component 2 if τ2 > b2′ ; at block replacement times for component 2, k∆1, k = 1, 2,..., replace component 2 if τ2 > b2 and replace component 1 if τ1 > b1′ (for suitable chosen thresholds, b1, b2, b1′ and b2′ . In both articles the maintenance policies are considered in the context of the clutch system used in a bus fleet. This system consists of the clutch assembly (component 2) and the clutch controller (component 1). Actually, the failure of the controller causes a failure of the assembly with probability 1 and the failure of the assembly causes a failure of the controller with probability 0. It is important to mention that the maintenance policies are not only compared on the basis of cost, but also on ease of implementation and system reliability. It is found that an agebased policy is best, but since this implies that components ages have to be monitored, the authors propose to implement a block or modified block policy. Combined modified block replacement seems to be the best alternative for the clutch system under consideration. Combining maintenance of components has the advantage that the system is in general more reliable, although the long run costs per unit time are higher. The economic gains from using a complex policy have to be weighed up against the addition of investment required to implement such policies. Jhang and Sheu (2000) address the problem of analyzing preventive maintenance policies in a multi-component system with type I failure interaction. The ith component 1 ≤ i ≤ N has two types of failures. Type 1 failures are minor failures and are rectified through minimal repair. Type 2 failures are catastrophic failures and induce a total failure of the system (i.e. failure of all other components in the system). Type 2 failures are removed by an unplanned/unscheduled replacement of the system. The model takes into account costs for minimal repairs, replacements and preventive maintenance. Generalized age and block replacement policies are proposed. The age replacement policy implies preventive replacement of all com-

278

R. Nicolai and R. Dekker

ponents whenever an operating system reaches age T. In the case of a block replacement policy the system is preventively replaced every T years. The expected long-run cost per unit time for each policy is derived and it is discussed how the optimal T can be determined. Various special cases are discussed in detail. Finally, the authors mention the application of their model to the maintenance of mining cables used in hoisting load. 11.4.2 Type II Failure Interaction Satow and Osaki (2003) consider a two-component parallel system. Component 1 is repairable and at failure minimal repair is done. Failures of component 1 occur according to a NHPP. Whenever the component fails it induces a random amount of damage to component 2. The damage is additive and component 2 fails whenever the total damage exceeds a certain failure level. A system failure always occurs whenever component 2 fails, because both components fail simultaneously. By assumption component 2 is not repairable. This means that a failed system needs to be replaced by a new one. Since preventive replacement is cheaper than failure replacement, a two-parameter preventive replacement policy is analyzed. The policy takes into account both system age and the total damage of component 2. The system is replaced preventively whenever the total damage of component 2 exceeds k or at time T and it is replaced correctively at system failures. An expression for the expected cost per unit time for long run operation is derived and the policy is optimized analytically for two special cases (the one-parameter policies). Numerical examples show that the policy imposing a limit on the total damage (k) of component 2 outperforms the age T policy. It appears that the twoparameter preventive maintenance policy does not necessarily lead to lower expected costs. This is because in this model the state of component 2 is best indicated by the total damage and its age does not provide any additional information. Zequeira and Bérenguer (2005) study inspection policies for a two-component parallel standby system. The system operates successfully if at least one component functions. Failures can be detected only by periodic inspections. The failure times are modelled as independent random variables. Type II failure interaction is modelled as follows. The failure of one component modifies the (conditional) failure probability of the other component with probability p and does not influence the failure time with probability 1 – p. Within this respect, the model extends the failure rate interaction models proposed by Murthy and Nguyen (1985b). Inspections are either staggered, i.e. the components are inspected one at a time, or non-staggered, i.e. the components are inspected simultaneously at the same time. It is assumed that there are no economies of scale by doing nonstaggered inspections. Numerical experiments prove that for the case of constant hazard rates, staggered inspections outperform non-staggered inspections on the expected average cost per unit time criterion. The authors explain this counterintuitive result as follows. When inspections are staggered, at least one component is in an operating condition more frequently than when inspections are not staggered.

Optimal Maintenance of Multi-component Systems: A Review

279

Lai and Chen (2006) consider a two-component system with failure rate interaction. The lifetimes of the components are modelled by random variables with increasing failure rates. Component 1 is repairable and it undergoes minimal repair at failures. That is, component 1 failures occur according to a NHPP. Upon failure of component 1 the failure rate of component 2 is modified (increased). Failures of component 2 induce the failure of component 1 and consequently the failure of the system. The authors propose the following maintenance policy. The system is completely replaced upon failure, or preventively replaced at age T, whichever occurs first. The expected average cost per unit time is derived and the policy is optimized with respect to parameter T. The optimum turns out to be unique. Barros et al. (2006) introduce imperfect monitoring in a two-component parallel system. It is assumed that the failure of component i is detected with probability 1 – pi and is not detected with probability pi. The components have exponential lifetimes and when a component fails the extra stress is placed on the surviving one for which the failure rate is increased. Moreover, independent shocks occur according to a Poisson process. These shocks correspond to common cause failures and induce a system failure. The following maintenance policy is proposed. Replace the system upon failure (either due to a shock or failure of the components separately), or preventively at time T, whichever occurs first. Assuming that preventive replacement is cheaper, the total expected discounted cost over an unbounded horizon is minimized. Numerical examples show the relevance of taking into account monitoring problems in the maintenance model. The model is applied to a parallel system of electronic components. When one fails, the surviving one is overworked so as keep the delivery rate not affected. 11.4.3 Types I and II failure interaction Murthy and Nguyen (1985b) derive the expected cost of operating a two-component system with type I or type II failure interaction for both a finite and an infinite time period. They consider a simple, non-opportunistic, maintenance policy. Always replace failed components immediately. This means that the system is only renewed if a natural failure induces a failure of the other component. Nakagawa and Murthy (1993) elaborate on the ideas of Murthy and Nguyen (1985b). They consider two types of failure interaction between two components. In the first case the failure of component 1 induces a failure of component 2 with a certain probability. In the second case the failure of component 1 causes a random amount of damage to the other component. In the latter case the damage accumulates and the system fails when the total damage exceeds a specified level. Failures of component 1 are modelled as an NHPP with increasing intensity function. The following maintenance policy is examined. The system is replaced at failure of component 2 or at the N-th failure of component 1, whichever occurs first. For both models the optimal number of failures before replacing the system as to minimize the expected cost per unit time over an infinite horizon is derived. The maintenance policy for the shock damage model is extended as follows: the system is also replaced at time T. This results in a two-parameter maintenance policy, which is also optimized. The authors give an application of their models to the

280

R. Nicolai and R. Dekker

chemical industry; component 1 is a pneumatic pump and component 2 is a metal container. The failure of the pneumatic pump may either lead to an explosion, causing system failure (model 1), or lead to a reduction in the wall thickness of the container (model 2). The extension of model 2 captures the introduction of preventive maintenance of the container at time T. 11.4.4 Other Types of Failure Interaction Özekici (1988) considers a reliability system of n components. The state of the system is given by the random vector Xt of the ages of the components at time t, that is Xt = ( X1t ,..., Xtn ) . It is assumed that Xit ≥ 0 for all t > 0 and i = 1,...,N, where Xit = ∞ implies that component i is in a failed state at time t. The stochastic structure of the system is that the stochastic process with state-space [0, ∞) is a positive, increasing, right-continuous, and quasi-left continuous, strong Markov process. Stochastic dependence between the components is modelled by making the age (state) of a component at time t dependent on the age of the system up to time t. The failure interaction considered here differs from type I and II failure interaction defined above. It is worth to mention that this paper is written independently of the work of Murthy and Nguyen (1985a,b). Maintenance is modelled as follows. There are periodic overhauls at which the state of the system is inspected and a replacement decision is made on the components based on the observation of the system. Here the cost structure of the maintenance decision is very general and consists of two types: costs which only depend on the number of replaced components and costs which depend on the state of the system at the time of inspection. Economic dependence between components is ‘hidden’ in the former costs. Replacing a group of components together is cheaper than replacing the components separately or in a smaller subgroup. The optimal replacement problem is formulated as a Markov decision process. The author proposes a very general class of replacement policies, for which the decision to replace a component depends on the age of all components. It appears to be possible to characterize the optimal solution to the replacement problem. Unfortunately, it cannot be proved that there exists a single critical age for the system, which describes the optimal replacement problem. The author provides some intuitive results, e.g. it is not always optimal to replace new components and if the age of components that have to be replaced is increased, then the optimal policy does not change. He also gives an important counter-intuitive result: it is not true that more components are replaced as the system gets older.

11.5 Structural Dependence Structural dependence means that some operating components have to be replaced, or at least dismantled, before failed components can be replaced or repaired. In other words, structural dependence between components indicates that they cannot be maintained independently. This is not failure dependence, but maintenance dependence. Since the failure of a component offers an opportunity to replace other

Optimal Maintenance of Multi-component Systems: A Review

281

components, opportunistic policies are expected to perform well on systems with structural dependence between components. Obviously, preventive maintenance may also be advantageous, since maintenance of structural dependent components can be grouped. There may be several reasons for structural dependence. For example, a bicycle chain and a cassette form a union, which should always be replaced together rather than individually. Another example is from Dekker et al. (1998a), which considers road maintenance. Several deterioration mechanisms affect roads, e.g. longitudinal and transversal unevenness, cracking and ravelling. For each mechanism one may define a virtual component, but if one applies a maintenance action to such a component it also affects the state with respect to the other failure mechanisms. The seminal paper in this category is from Sasieni (1956). He considers the production of rubber tyres. The machine that produces the tyres consists of two “bladders”; one tyre is produced on each bladder simultaneously. Upon failure of a bladder, the machine must be stripped down before replacement can be done. This means that the other bladder can be replaced at the same time. Note that immediate replacement is not mandatory, but a failed bladder will produce faulty tyres. Two maintenance policies are analyzed and optimized. The first is a preventive maintenance policy. Bladders which have made a predetermined number of tyres (m) without failure are replaced. The second is an opportunistic version of the first policy. When a machine is stripped to replace one bladder, replace the other bladder if it has produced more than n ≤ m tyres.

11.6 Planning Horizon and Optimization Methods In this section we will classify articles on the basis of the planning horizon of the maintenance model and the optimization methods used to solve this model. Actually, these two concepts are related. The majority of the articles reviewed here assume an infinite horizon. This assumption facilitates the mathematical analysis; it is often possible to derive analytical expressions for optimal control parameters and the corresponding optimal costs. So, in the category infinite horizon (stationary grouping) models policy optimization is the most popular optimization method. For convenience we will not review the articles in this category. Finite-horizon models consider the system in this horizon only, and hence assume implicitly that the system is not used afterwards, unless a so-called residual value is incorporated to estimate the industrial value of the system at the end of the horizon. In the article of Budai et al. (2006) the so-called end-of-horizon effect is eliminated by adding an additional term to the objective function. This term values the last interval.

282

R. Nicolai and R. Dekker

The optimization methods applied to finite horizon models are either exact methods or heuristics1. Exact methods always find the global optimum solution of a problem. If the complexity of an optimization problem is high and the computing time of the exact method increases exponentially with the size of the problem, then heuristics can be used to find a near-optimal solution in reasonable time. The scheduling problem studied by Grigoriev et al. (2006) appears to be NPhard. Instead of defining heuristics, the authors choose to work on a relatively fast exact method. Column-generation and a branch-and-price technique are utilized to find the exact solution of larger-sized problems. The problem considered by Papadakis and Kleindorfer (2005) is first modelled as a mixed integer linear programming problem, but it appears that it can also be formulated as a max-flow min-cut problem in an undirected network. For this problem efficient algorithms exist and thus, an exact method is applicable. Langdon and Treleaven (1997), Sriskandarajah et al. (1998), Higgins (1998) and Budai et al. (2006) propose heuristics to solve complex scheduling problems. The first two articles utilize genetic algorithms. Higgins (1998) applies tabu search and Budai et al. (2006) define different heuristics that are based on intuitive arguments. In all four articles the heuristics perform well; a good solution is found within reasonable time.

11.7 Trends and Open Areas In this section we comment on the future research of optimal maintenance of multicomponent systems. We first analyze the trends in modelling multi-component maintenance and then discuss the future research areas in this field.

11.7.1 Trends In the last few years several articles have appeared on optimal maintenance of systems with stochastic dependence. In particular, the shock-damage models have received much attention. One explanation for this is that type II failure interaction can be modelled in several ways, whereas there is not much room for extensions in the type I failure model. Another reason is that since the field of stochastic dependence is not very broad yet, it is easy to add a new feature such as minimal repair or imperfect monitoring to an existing model. Third, many existing opportunistic maintenance policies for systems with economic dependence have not yet been applied to systems with (type II) failure interaction. Another upcoming field in multi-component maintenance modelling is the class of finite horizon maintenance scheduling problems. Finite horizon models can be

1

Actually, if the maintenance policy is relatively easy, it is sometimes possible to determine the expected maintenance costs over a finite period of time. For instance, Murthy and Nguyen (1985a,b) consider failure-based policies in a system with stochastic dependence and derive an expression for the expected cost of operating the system for a finite time.

Optimal Maintenance of Multi-component Systems: A Review

283

regarded as dynamic models, because short-term information can be taken into account. Maintenance scheduling problems are often modelled in discrete time as mixed integer linear programming problems. These problems can be NP-hard and in that case heuristics or local search methods have to be developed in order to solve the problems to near-optimality efficiently. In the last decade tabu search, genetic algorithms and problem specific heuristics have already been applied to maintenance scheduling problems (see Langdon and Treleaven 1997, Sriskandarajah et al. 1998, Higgins 1998 and Budai et al. 2006). However, there is still need for better local search algorithms.

11.7.2 Open Areas There is scope for more work in the following areas. 11.7.2.1 Finite Horizon Models On one hand, the class of infinite horizon models has been studied extensively in the literature. Based on the renewal-reward theory many maintenance policies for stationary grouping models have been analyzed. On the other hand, the class of finite horizon models, which includes many maintenance scheduling problems, has never had that much attention. However, maintenance of multi-component systems has to be made operational. Therefore, finite horizon and especially rolling horizon models, which also take short-term into account, have to be developed. In order to solve these models heuristics/local search methods should be further developed. Exact algorithms also need more attention. The article of Grigoriev et al. (2006) shows that some scheduling problems of reasonable size can be optimized exactly in a reasonable time. 11.7.2.1 Case-studies This review shows that case-studies are not represented very well in the field. This is surprising, since maintenance is an applied topic. In our opinion many models are just (mathematical) extensions of existing models and most of the times models are not validated empirically. Case-studies can lead to new models, both in the context of cost structures and dependencies between components. 11.7.2.3 Modelling Multiple Set-up Activities In this article we have subdivided the category “economic dependence” into a number of subcategories. It appears that examples of modelling maintenance of systems with multiple set-up activities are scarce. Therefore, this seems to be a promising field for further research. After all, in many production systems complex set-up structures exist. 11.7.2.4 Structural Dependence The field of structural dependence is wide open. In our opinion there have only been a few articles published on this topic.

284

R. Nicolai and R. Dekker

11.7.2.5 Stochastic Dependence Two decades ago Murthy and Nguyen published two articles on the maintenance of systems with stochastic dependence. Although this topic has had much attention since then, most articles still deal with two-component systems. So, there is still a lot of work to do on modelling maintenance of systems with failure interaction consisting of more than two components. 11.7.2.6 Combination of Dependencies In this article we have seen one example of the combination of structural and economic dependence (Scarf and Deara 1998, 2003). We have also reviewed some papers with both positive and negative economic dependence. Obviously, the combination of different types of interaction results in difficult optimization models. So, this is also an opportunity for researchers to come up with some new models. 11.7.2.7 Simulation Optimization We have already said that much work has been done on maintenance policies for the class of infinite horizon models. Many maintenance policies are not analytically tractable and simulation is needed to analyze these policies. We observe that the optimization of policies via simulation is often done by using algorithms for deterministic optimization problems. Methods such as simulated annealing and response surface methodology may be more efficient. This should be investigated further.

11.8 Conclusions In this chapter we have reviewed the literature on optimal maintenance of multicomponent maintenance. We first classified articles on the basis of the type of dependence between components: economic, stochastic and structural dependence. Subsequently, we subdivided these classes into new categories. For example, we have introduced the categories positive and negative economic dependence. We have paid attention to articles with both forms of interaction. Moreover, we have defined several subcategories in the class of models with positive economic dependence. With respect to articles in the class of stochastic dependence, we are the first to review these articles systematically. Another classification has been made on the basis of the planning horizon models and optimization methods. We have focussed our attention on the use of heuristics and exact methods in finite horizon models. We have concluded that this is a promising open research area. We have discussed the trends and the open areas of research reported in the literature on multi-component maintenance. We have observed a shift from infinite horizon models to finite horizon models and from economic to stochastic dependence. This immediately defines the open research areas, which also include topics such as case studies, modelling combinations of dependencies between components and modelling multiple set-up activities.

Optimal Maintenance of Multi-component Systems: A Review

285

11.9 References Assaf D, Shanthikumar J, (1987) Optimal group maintenance policies with continuous and periodic inspections. Management Science 33:1440–1452 Barros A, Bérenguer C, Grall A, (2006) A maintenance policy for two-unit parallel systems based on imperfect monitoring information. Reliability Engineering and System Safety 91:131–136 Budai G, Huisman D, Dekker R, (2006) Scheduling preventive railway maintenance activities. Journal of the Operational Research Society 57:1035–1044 Castanier B, Grall A, Bérenguer C (2005) A condition-based maintenance policy with nonperiodic inspections for a two-unit series system. Reliability Engineering & System Safety 87:109–120 Cho D, Parlar M, (1991) A survey of maintenance models for multi-unit systems. European Journal of Operational Research 51:1–23 Dekker R, Plasmeijer R, Swart J, (1998a) Evaluation of a new maintenance concept for the preservation of highways. IMA Journal of Mathematics applied in Business and Industry 9:109–156 Dekker R, van der Duyn Schouten F, Wildeman R, (1996) A review of multi-component maintenance models with economic dependence. Mathematical Methods of Operations Research 45:411–435 Dekker R, van der Meer J, Plasmeijer R, Wildeman R, (1998b) Maintenance of lightstandards: a case-study. Journal of the Operational Research Society 49:132–143 Goyal, S, Kusy M, (1985) Determining economic maintenance frequency for a family of machines. Journal of the Operational Research Society 36:1125–1128 Grigoriev A, van de Klundert J, Spieksma F, (2006) Modeling and solving the periodic maintenance problem. European Journal of Operational Research 172:783–797 Gürler Ü, Kaya A, (2002) A maintenance policy for a system with multi-state components: an approximate solution. Reliability Engineering & System Safety 76:117–127 Higgins A, (1998) Scheduling of railway track maintenance activities and crews. Journal of the Operational Research Society 49:1026–1033 Jhang J, Sheu S, (2000) Optimal age and block replacement policies for a multi-component system with failure interaction. International Journal of Systems Science 31:593–603 Lai M, Chen Y, (2006) Optimal periodic replacement policy for a two-unit system with failure rate interaction. The International Journal of Advanced Manufacturing and Technology 29:367–371 Langdon W, Treleaven P, (1997) Scheduling maintenance of electrical power transmission networks using genetic programming. In Warwick K, Ekwue A, Aggarwal R, (eds.) Artificial intelligence techniques in power systems, Institution of Electrical Engineers, Stevenage, UK, 220–237 Murthy D, Nguyen D, (1985a) Study of a multi-component system with failure interaction. European Journal of Operational Research 21:330–338 Murthy D, Nguyen, D (1985b) Study of two-component system with failure interaction. Naval Research Logistics Quarterly 32:239–247 Nakagawa T, Murthy D, (1993) Optimal replacement policies for a two-unit system with failure interactions. RAIRO Recherche operationelle / Operations Research 27:427–438 Okumoto K, Elsayed E, (1983) An optimum group maintenance policy. Naval Research Logistics Quarterly 30:667–674 Özekici S, (1988) Optimal periodic replacement of multicomponent reliability systems. Operations Research 36:542–552 Papadakis I, Kleindorfer P, (2005) Optimizing infrastructure network maintenance when benefits are interdependent. OR Spectrum 27:63–84

286

R. Nicolai and R. Dekker

Pham H, Wang H, (2000) Optimal (τ, T) opportunistic maintenance of a k-out-of-n:G system with imperfect PM and partial failure. Naval Research Logistics 47:223–239 Popova E, Wilson J, (1999) Group replacement policies for parallel systems whose components have phase distributed failure times. Annals of Operations Research 91: 163–189 Ritchken P, Wilson J, (1990) (m; T) group maintenance policies. Management Science 36:632–639 Sasieni M, (1956) A Markov chain process in industrial replacement. Operational Research Quarterly 7:148–155 Satow T, Osaki S, (2003) Optimal replacement policies for a two-unit system with shock damage interaction. Computers and Mathematics with Applications 46:1129–1138 Scarf P, Deara M, (1998) On the development and application of maintenance policies for a two-component system with failure dependence. IMA Journal of Mathematics Applied in Business & Industry 9:91–107 Scarf P, Deara M, (2003) Block replacement policies for a two-component system with failure dependence. Naval Research Logistics 50:70–87 Sheu S, Jhang J, (1996) A generalized group maintenance policy, European Journal of Operational Research 96:232–247 Sheu S, Kuo C, (1994) Optimal age replacement policy with minimal repair and general random repair costs for a multi-unit system. RAIRO Recherche Operationelle/Operations Research 28:85–95 Sheu S, Liou C, (1992) Optimal replacement of a k-out-of-n system subject to shocks. Microelectronics Reliability 32:649–655 Smith M, Dekker R, (1997) Preventive maintenance in a 1 out of n system: the uptime, downtime and costs. European Journal of Operational Research 99:565–583 Sriskandarajah C, Jardine A, Chan C, (1998) Maintenance scheduling of rolling stock using a genetic algorithm. Journal of the Operational Research Society 49:1130–1145 Stengos D, Thomas L, (1980) The blast furnaces problem. European Journal of Operational Research 4:330–336 Thomas L, (1986) A survey of maintenance and replacement models for maintainability and reliability of multi-item systems. Reliability Engineering 16:297–309 Van der Duyn Schouten F, Vanneste S, (1993) Two simple control policies for a multicomponent maintenance system. Operations Research 41:1125–1136 Van der Duyn Schouten F, van Vlijmen B, Vos de Wael S, (1998) Replacement policies for traffic control signals. IMA Journal of Mathematics Applied in Business & Industry 9:325–346 Van Dijkhuizen G, (2000) Maintenance grouping in multi-setup multi-component production systems. In Ben-Daya M, Duffuaa S, Raouf A, (eds.) Maintenance, Modeling and Optimization, Kluwer Academic Publishers, Boston, 283–306 Wang H, (2002) A survey of maintenance policies of deteriorating systems. European Journal of Operational Research 139:469–489 Zequeira R, Bérenguer C, (2005) On the inspection policy of a two-component parallel system with failure interaction. Reliability Engineering and System Safety 88:99–107

12 Replacement of Capital Equipment P.A. Scarf and J.C. Hartman

12.1 Introduction Businesses require equipment in order to function and deliver their outputs. In the global, competitive environment, this equipment is critical to success. However, equipment generally degrades with age and usage, and investment is required to maintain the functional performance of equipment. For example, in mass urban transportation, annual expenditure on equipment replacement for the Hong Kong underground is of the order of $50 million, and further, the Hong Kong underground network is a fraction of the size of that in London, Paris or New York. Where equipment replacement impacts significantly on the bottom line of a corporation and decision-making about such expenditure is under the control of the company executive, the modelling of such decision making is within the scope of this chapter. Capital equipment investment projects are typically driven by operating cost control, technical obsolescence, requirements for performance and functionality improvements, and safety. That is, rational decision-making about capital equipment replacement will take account of engineering, economic, and safety requirements. In this chapter we will assume that the engineering requirements concerning replacement will define certain choices for equipment replacement. For example, engineers would normally propose a number of options for providing the continuity of equipment function: retain the current equipment as is, refurbish the equipment in order to improve operation and functionality, or replace the equipment with new improved technology. We will further assume that safety requirements are addressed when these options are analysed by engineers. Consequently, we argue that rational choice between the defined replacement options is an economic question. Thus, a logistics corporation may be considering replacement of certain assets in its road transportation fleet. The organisation may have to raise capital to fund such replacement. There is the expectation that engineers for the corporation will offer a number of choices for replacement (e.g. buy tractors from company X or Y, buy tractors now or in N years time, or scrap or retain existing tractors as spares) that meet future functional and safety requirements. In this way, decision making about

288

P. Scarf and J. Hartman

replacement then necessarily considers the costs of the replacement options over some suitable planning horizon. As capital equipment replacement potentially incurs significant costs, the cost of capital is a factor in the decision problem and models to support decision making typically take account of the time value of capital through discounting. Capital equipment is a significant asset of a business. It consists of necessarily complex systems and a business would typically own or operate a fleet of equipment: the Mass Transit Railway Corporation Limited of Hong Kong operates hundreds of escalators; Fed Ex Express, the cargo airline corporation operates more than 600 aircraft; electricity distribution systems comprise thousands of kilometres of cable and hundreds of thousands of items such as transformers and switches; water supply networks are on a similar scale. We can appeal to the law of large numbers and assume with some justification that the economic costs that enter capital equipment replacement decisions are deterministic. Consequently, we consider deterministic models in this chapter and model rational decision making throughout using net present value techniques (e.g. see Arnold 2006; Northcott 1985). When considering optimal equipment replacement in an uncertain environment, authors have argued the case for using real options (Dixit and Pindyck 1994; Bowe and Lee 2004). Whenever replacement decisions may be exercised continuously, it is argued that the choice to replace an existing asset with a new asset at a specified time is characteristic of an American call option—this approach seeks to value the opportunity to replace the asset. Such a modelling approach would be valuable when considering expansion of assets, for example, through the building of a new transportation link for which the likely return on investment would be highly uncertain. However, we do not consider this approach in this chapter. We do not consider problems of component replacement in which the functionality of repairable systems is optimized either on a cost basis or a required reliability basis. Such maintenance does not typically involve capital expenditure, and the models used are often stochastic in nature—times to failure are considered to be random. For a recent review of such models, see Wang 2002. The outline of the chapter is as follows. In Section 12.2 we describe the framework for the classification of models that are discussed in this chapter. This framework considers the nature of capital equipment replacement problems in general and presents further detail regarding the nature of cost factors that contribute to replacement decisions. Section 12.3 looks at economic life models and discusses several models and an application of one of the models. Section 12.4 deals with replacement of a network system. Dynamic programming models are discussed in Section 12.5 and the chapter concludes with a discussion of topics for future research in Section 12.6.

12.2 Framework for Replacement Modelling of Capital Equipment The composition of a fleet may be classified as singular (one operating plant), multiple identical (homogeneous), or multiple non-identical (inhomogeneous). Replacement policies may be classified as single plant replacement, sub-fleet replace-

Replacement of Capital Equipment

289

ment, or entire fleet replacement (Scarf and Christer 1997). The capital replacement models that are considered in this chapter may be classified as economic life models or dynamic programming models. The former are concerned with determining the optimal lifetime of an item of equipment, taking account of costs over some planning horizon. The latter considers replacement decisions dynamically, determining whether plant should be retained or replaced after each period. Economic life models may be further classified according to the length of the planning horizon: infinite, variable finite (with length of the horizon a function of decision variables), or fixed (with a variable number of replacement cycles). Dynamic programming models generally require a finite horizon, but may be used to identify the optimal time zero decision for an infinite horizon. Early models (e.g. Eilon et al. 1966) were formulated in continuous time with optimum policy obtained using calculus. More complex models are simpler to implement under a discrete time formulation. In the case of economic life models, optimization may be performed using a crude search when there exists a small number of decision variables. For fleets with many items, the discrete time formulation naturally gives rise to mathematical programming problems. Dynamic programming models necessarily require a discrete time formulation. Real options models are formulated in continuous time. We begin by looking at simple economic life models. These are applied in a case study on escalator replacement. Economic life models are then extended to consider first an inhomogeneous fleet and second a network system viewed as an inhomogeneous fleet with interacting items. A number of different dynamic programming models are introduced for singular systems and then expanded to homogeneous and inhomogeneous fleets and networks of assets. It is assumed that data relating to maintenance are available and sufficient for modelling purposes. Data on other “age” related operating costs, such as fuel costs and failures (breakdowns), would also ideally be available. Where usage of plant is non-uniform, particularly if decreasing with age, usage data are also required for replacement policy to be meaningful. This is because, for example, maintenance costs for older plant may be artificially low due to under utilization or neglect of good maintenance practice for plant near the end of their useful life. Some plant may even be retired as occasional spares. Under reporting and thus bias of maintenance cost data may also be significant (Scarf 1994). Replacement models have also been considered when cost information is obtained subjectively (Apeland and Scarf 2003). Penalty costs play a role in all replacement decisions (Christer and Scarf 1994). It is only the extent to which penalty cost is quantified in the modelling process that varies. Rather than attempt to estimate the values of “difficult to quantify” parameters such as penalty cost and then determine optimal policy, the influence of these parameters on the decision should be quantified. In this latter approach, threshold values that lead to a step-change in optimum policy can be investigated and presented and the decision makers can then consider whether they believe that such values are realistic within the context of the problem. Thus, the penalty cost can be used to measure in part the subjective component of a replacement choice. All costs considered in the modelling will be discounted to net present value through the use of a constant discount factor. We refer the reader to Kobbacy and

290

P. Scarf and J. Hartman

Nicol (1994) for a detailed discussion of the role of discounting in capital replacement. Appropriate functions describing resale values are assumed to be known, as are purchase costs. Tax considerations in particular contexts should be taken into account and modelled.

12.3 Economic Life Models 12.3.1 A Simple Model for Individual Plant Early economic life models such as Eilon et al. (1966) considered an idealised equipment replaced at age T, that is, replacement every T time units, in perpetuity. In this idealised framework, for T small, frequent replacement leads to high replacement or capital costs. Infrequent replacement (large T), on the other hand, results in high operating or revenue costs (assuming that operating costs increase with the age of equipment). Trading-off capital costs against revenue costs leads to an optimum age at replacement, T*, the so-called economic life. The decision criterion is typically the total cost per unit time or the annuity—this latter term has been called the rent by Christer (1984). In the case without discounting, the total cost per unit time, c(T), and the annuity are equivalent and T

c(T ) = {∫ m0 (t ) dt + R}/ T ,

(12.1)

0

where m0 (t ) is the operating cost rate and R is the replacement cost, and assuming no residual value. From Equation 12.1, it follows that T* is the solution of

∫

T* 0

m0 (t ) dt + R = T * m0 (T *) ,

provided it exists. In its discrete time form the total cost per unit time is T c(T ) = {∑ i =1 m0i + R}/ T , where m0i is the operating cost in time period i. With a discount factor ν , discounting to year end, and a residual value function S(T), the net present value (NPV) of all future costs in perpetuity is

∑

cNPV (T ) = (1 + ν T + ν 2T + ...){

T i =1

m0iν i + ν T [ R − S (T )]}

= (1 −ν T ) −1{∑ i =1 m0iν i + ν T [ R − S (T )]}. T

An objection to this criterion is that as ν → 1, cNPV (T ) → ∞ . Consequently, we recommend the annuity or rent (the amount paid annually and in perpetuity that is necessary to meet the total discounted cost) given by

∑

(1 + ν + ν 2 + ...) crent (T ) = (1 + ν T +ν 2T + ...){

T i =1

m0iν i +ν T [ R − S (T )]} ,

Replacement of Capital Equipment

291

whence crent (T ) =

(1 − ν ) T

(1 − ν )

T

∑i =1 m0iν i + ν T [ R − S (T )]} .

{

Notice that as ν → 1, crent (T ) → c(T ), the total cost per unit time. The economic life can be obtained by minimising crent (T ) , typically using a spreadsheet by considering a range of values of T. 12.3.2 Analysing Technological Change Using a Two-cycle Model The economic life model can be adapted to consider technological change in a number of ways. One can consider economic factors for new models of equipment (future operating costs) in a parametric fashion, specifying a model for technological change which then implies operating cost functions, replacement cost and residual values for each replacement cycle into the future (Elton and Gruber 1976). Alternatively, one can model replacement over a limited time scale, either by fixing the time horizon, or by fixing the number of replacement cycles. Christer (1984) did the latter and described a two-cycle model which models the immediate replacement decision problem by considering existing plant as having age τ and age-related operating cost m0i , and new plant as having operating cost m1i . In its discrete form, the annuity for this model is 2 crent ( K , L) =

∑

K i =1

m0(i +τ )ν i + ν K {R1 − S0 ( K + τ ) + ∑ i =1 m1iν i +ν L [ R1 − S1 ( L)]} L

∑

K +L

i =1

νi

.(12.2)

Here K and L are decision variables, with K modelling the time (from now) to replacement of the existing asset; K+L is the time to second replacement. The advantage of this model is that one only need estimate the operating cost of the existing and new assets (as functions of age), the capital cost for the new asset, R1 , and the age-related resale or residual value of new and existing assets, S0 , S1 . 12.3.3 A Fixed Planning Horizon Model In the financial appraisal of projects, a standard approach fixes the time horizon and determines the NPV of future costs over this horizon (e.g. Northcott 1985). This fixed horizon model has been studied by Scarf and Hashem (2003) and its simplicity lends itself to application in complex contexts (e.g. Scarf and Martin 2001). The annuity for this model can be derived from Equation 12.2 above simply by setting X = K and K + L = h , the length of the planning horizon, and then considering h as fixed. Whence, there is only one decision variable, X, the time to replacement. Given the possibility that X = h , that is, no replacement over the planning horizon whence we retain the current asset, the annuity function has a discontinuity at X = h , and X * = h implies that it is not optimal to undertake the

292

P. Scarf and J. Hartman

(replacement) project. Furthermore, since the replacement at the end of the horizon has a fixed cost (with respect to the decision variable X) its inclusion or exclusion has no effect on the optimal time to replacement. It is natural not to include the replacement cost at the horizon-end since a standard financial appraisal approach would only account for revenue costs up to project execution, capital costs at project execution, subsequent revenue costs up to the horizon-end, and residual values. Including the replacement at h on the other hand allows cost comparisons with the two-cycle model and the associated rent, Equation 12.2. We take the former approach here however and the annuity is

∑

⎧ { ⎪ ⎪ h crent ( X ) = ⎨ ⎪ ⎪{ ⎩

∑

X i =1

m0(i +τ )ν i + ν X [ R1 − S0 ( X + τ )] +

−ν h S1 (h − X )}/ h i =1

∑

h

∑

h i = X +1

νi,

X < h,

i =1

m0(i +τ )ν i −ν h S0 (h + τ )}/

∑

m1(i − X )ν i

h

νi,

i =1

(12.3)

X = h.

12.3.4 A Modified Two-cycle Model It is interesting to consider the behaviour of these models at Equations 12.2 and 12.3 when the operating costs are constant (or increasing only slowly), since it is not unusual for plant to age only slowly. Of course, replacement of an existing asset in these circumstances would only be contemplated if the operating cost (or functionality) of the new asset is significantly lower (or functionality higher), e.g. electricity supply network components; see Brint et al. (1998). The behaviour is simplest to follow for the continuous time formulation when the discount factor is unity (no discounting) and residual values are zero. Under these circumstances, the costs per unit time (annuity) for the two-cycle model and the fixed horizon model become 2 crent ( K , L) = ( Km0 + R + Lm1 + R ) /( K + L) ,

(12.4)

and

⎧[ Xm0 + R + (h − X )m1 ] / h, h crent (X ) = ⎨ ⎩ m0

X < h, X = h.

respectively. From Equation 12.4 we get 2 dcrent ( K , L) / dK = [ L(m0 − m1 ) − 2 R] /( K + L) 2 . Thus there is no K such that 2 2 dcrent ( K , L) / dK = 0, but that dcrent ( K , L) / dK > 0 ⇒ K * = 0 if L(m0 − m1 ) > 2 R 2 ( K , L) / dL = [ K (m1 − m0 ) − 2 R] /( K + L) 2 , and for any fixed L. Furthermore, dcrent 2 so if m0 > m1 (which is a necessary condition for dcrent ( K , L) / dK > 0 ) 2 dcrent ( K , L) / dL < 0 and so L* does not exist. However, in any practical implementation of the two-cycle model, one would bound L with some upper value lmax , say. Then the optimal policy would be K*=0 (L*= lmax ) if

Replacement of Capital Equipment

lmax (m0 − m1 ) > 2 R .

293

(12.5)

We can consider a similar argument for the fixed horizon model. Thus, h h dcrent ( X ) / dX = ( m0 − m1 ) / h, X < h and so dcrent ( X ) / dX > 0 if m0 > m1 . Howh ever, since crent ( X ) has a discontinuity at X=h, X*=0 is optimal only if m0 > m1 h h and crent (0) < crent (h) . That is, if ( R + hm1 ) / h < m0 , that is, if h(m0 − m1 ) > R .

(12.6)

Thus, comparison of inequalities at Equations 12.5 and 12.6 shows that the two models have different properties in terms of the behaviour of optimal policy as a function of cost parameters. Thus the two-cycle model is inconsistent with standard financial models. However, a simple modification to the model will correct this inconsistency. Scarf et al. (2006) suggest simply to omit the replacement at the end of the second cycle. For the constant revenue case above, the rent becomes 2 c rent ( K , L) = ( Km 0 + R + Lm1 ) /( K + L) and optimal policy would be K*=0 ( L* = lmax ) if lmax (m0 − m1 ) > R , which is consistent with the fixed horizon model and hence with standard financial appraisal models. However, it would appear that the two-cycle model with its two replacements (at t=K and at t=K+L) is applicable for the case of increasing operating costs and that a modified two-cycle model with one replacement (at t=K only) for operating costs that are constant or increasing only slowly. However, this issue can be resolved. When operating costs are increasing only slowly, typically L* does not exist, and, in practice L must be constrained such that L ≤ lmax (as pointed out above) since numerically we can only search for L* over a finite space. In constraining L ≤ lmax under the two replacements formulation, we impose a replacement at lmax when in fact there should not be a second replacement since L* does not exist. This then suggests that the two-cycle replacement model should be modified in the following subtle way: if there does not exist an L such that 2 crent ( K , L) has a minimum strictly within the search space, that is, within {( K , L) : 0 < K < K max ,0 < L < lmax } then, when determining that K which 2 minimises crent ( K , lmax ) , no replacement cost should be incurred at t = K + lmax . Thus the model should be modified so that there is only one replacement. Otherwise the “cost hurdle” for replacement of the current asset will be set artificially high (inequality at Equation 12.5). Thus, in all practical situations for which operating costs are increasing only slowly, one should use this modified two-cycle model or the fixed horizon model as a special case. 12.3.5 Discussion of Finite Horizon Replacement Models Using the fixed horizon model or equivalently using the modified two-cycle model with a finite search space may lead to significant end-of-horizon effects (since costs beyond the horizon-end are ignored). Thus time to first replacement will depend on h (or equivalently lmax ). Choice of h (or lmax ) will need to be considered carefully; in practice the horizon may be specified by company policy on accounting methods and discounting may reduce those costs incurred in the distant

294

P. Scarf and J. Hartman

future to a small or insignificant level. Furthermore, specification of the residual value may be problematic, particularly for non-movable assets with either constant or slowly changing operating revenue. This is because the market resale value of the asset is arguably zero. However, the residual value, as measured by the benefit of the function the asset performs rather than its value if sold, may be non-zero. In this case company policy may prescribe a “straight line” depreciation so that the residual value is proportional to the estimated asset life fraction remaining at replacement or horizon end. However, such an approach may be difficult to justify since the asset life is unknown and linearity is a strong assumption. One possible approach here would be to look at sensitivity to the parameters in a residual value model such as this but there would be a number of parameters and this may become over-complex. An alternative would be to equate the residual value at the horizon-end to the cost-benefit of the replacement (whenever it took place) over the next m years. But this then amounts to extending the planning horizon from K+L to K+L+m or from h to h+m. This of course will lead to models the same as those considered at present but with longer horizons (or to a three cycle model if a subsequent refurbishment is also considered). Thus if one accepts that a two-cycle model is sufficient for modelling purposes, then, logically, consideration of residual values for a non-movable asset amounts to considering sensitivity to horizon length (either h or K+L whichever model is used). Restricting replacement models to (at most) two cycles may lead to sub-optimal replacement policies. However, for typical discount rates and planning horizons the modelling of operation and replacement beyond the end of the second cycle may have only a small effect on the time to first replacement, the issue of principal interest in practice. Replacement models that are not restricted to (at most) two replacement cycles are considered in Section 12.3. 12.3.6 An Application of Finite Horizon Repacement Models Example 12.1 Decision making regarding the replacement of escalators on a mass transit rail system in a particular city has been considered over a number of years by the corporation that owns and operates the system (Scarf et al. 2006). Maintenance of escalators is generally outsourced to equipment suppliers due to the difficulty that alternative contractors have in obtaining proprietary spares. The original manufacturers can keep costs down as a result of the economy of scale that is achievable through maintaining equipment over a large number of client organisations. Currently, the corporation operates of the order of 600 escalators and the annual maintenance contract price is over $10 million. Escalator replacement is therefore a significant issue within the organisation. Studies by the corporation suggest that the economic life of escalators is of the order of 25 years but that, based on overseas experience, escalator life can be extended to up to 40 years. However, given the size of the fleet, a strategy has to be set to manage escalator maintenance and to deal with the replacement or refurbishment of older escalator assets. A key factor in this strategy is the approach of the organisation to the re-negotiation of maintenance contracts and in particular to

Replacement of Capital Equipment

295

determine the scale of refurbishment of older assets and the level of major parts replacement and supply within the negotiated contract. For the presentation of the modelling work in this example, it is necessary to consider the asset management options open to the corporation in a simple manner, and a homogeneous sub-fleet of the escalators is considered, with modeling carried out for a typical escalator—this is a reasonable simplification since all escalators in the sub-fleet were installed at approximately the same time. For this group, replacement, although crudely costed, was not really a viable option—economic costs were too high and disruption unacceptable given the duration of replacement work. Refurbishment by the original manufacturer, replacing worn parts, upgrading the control system and maintenance access was being carefully considered by the corporation as a viable strategy for managing the asset life. Cost savings could be achieved through a reduction in the annual maintenance contract price subsequent to refurbishment. Thus, put simply, for the escalator group, the corporation was faced with the decision: continue with the current relatively higher-price maintenance contract or refurbish and benefit from a new relatively lower-priced maintenance contract. Other benefits would also accrue from refurbishment for both contractor and the corporation. For the contractor, improved access and safety for maintenance was part of the refurbishment package. For the corporation, upgrade of the control system would result in fewer unplanned escalator stoppages. We consider some four asset management options: “do nothing”—continue with high-price maintenance contract; “refurb”—renew worn parts, retro-fit new control system and proceed with lower-price maintenance contract; “delay refurb”—delay refurbishment for up to n years; “replace”—a full replacement option with nominal costs included for comparison purposes. The costs of refurbishment (per escalator) in the present study were obtained from initial quotations from the respective manufacturers: these are $63K for refurbishment. On-going annual maintenance contract costs (per escalator) are: $9K pre-refurbishment; $7K post-refurbishment. Prior to refurbishment the cost of replacement of major parts is in addition to the annual maintenance contract and major parts are replaced on the basis of condition. Post-refurbishment, the annual maintenance contract includes replacement of major parts at no extra cost. Given that we might expect major parts to be replaced somewhat less frequently than dictated by their recommended lives, we introduce a cost parameter to model such life-extension—this is called the effective life factor, ρ . ρ = 1 implies that major parts are replaced at a frequency corresponding to their recommended life (for example, once every 25 years for the steps at a cost of $48K), and the replacement frequency ∝ 1 / ρ ( ρ = 2 implies replacement of steps once every 50 years). The cost of a replacement ($170K) is a nominal figure and used mainly for crude comparison with refurbishment. In practice, replacement may cost significantly more than this. The corporation recommend a discount rate of r= 0.11 and a projected inflation rate of i= 0.05. This corresponds to an effective discount factor, ν , of 0.057 ( 1 /(1 +ν ) = (1 + i ) /(1 + r ) ). Integral to the refurbishment option is the up-grading of the escalator control system to allow “power-dip ride-through”—this facility prevents unnecessary emergency stops caused by momentary power loss that can cause injuries to passengers. However, the effectiveness of the “ride-through” facility is uncertain; hence we introduce another cost parameter, control system

296

P. Scarf and J. Hartman

retro-fit effectiveness, which represents the percentage of passenger injuries due to power dips that would be prevented by up-grading of the control system at refurbishment. Also, for the purposes of sensitivity analysis it is necessary to place a cost on an emergency stop due to a power dip. We call this the penalty cost of failure. Historic records of the number of such stops (approximately 0.5 stops per escalator per year) and the retro-fit effectiveness are used to calculate a total penalty cost (saving) per escalator per year post-refurbishment. Another parameter that is difficult to quantify was also considered, the passenger delay cost due to refurbishment, but the results are omitted here. In Table 12.1, we look at annuities for the modified two-cycle model and the fixed horizon model for a range values of the respective decision variables. Note that the annuities for the fixed horizon model lie on the diagonal indicated. This is because the fixed horizon model is equivalent to the modified two-cycle model with the additional constraint that K+L= h. These annuities are also presented in Figure 12.1a. Figure 12.1b shows the annuities for the two models in the case of no discounting—discounting has the effect of slightly extending the economic life (since the NPV of future costs are reduced) and this accounts for the small difference in optimum policy between the fixed horizon model and the modified two-cycle model in Figure 12.1(a), X* = 1, K* = 4 (years). Table 12.1. Annuities ($000s per escalator per year) escalator for modified two-cycle model with refurbishment at K years from now and again after a further L years. Annuities for fixed horizon model with h = 22 years highlighted, except for X* = 22 (no replacement) for which annuity = $139.4K. Cost parameters as follows: refurbishment cost, $62.9K; effective discount factor, 0.06 (equivalent to inflation rate of approximately 0.05 and discount rate of 0.11); penalty cost of failure, $5K; effective life parameter, 1.5; control system retro-fit effectiveness, 75%; cost of refurbishment delay, $10K; annual maintenance contract prerefurbishment, $8.8K (per escalator); annual maintenance contract post-refurbishment, $6.9K (per escalator).

K, length of the first cycle, years

1 3 5 7 9 11 13 15 17 19 21

1 491.8 302.4 241.1 211.6 193.9 182.2 173.9 167.8 163.1 159.4 156.4

3 298.0 236.1 206.8 190.2 179.3 171.6 165.9 161.5 158.0 155.3 153.0

5 233.0 202.7 186.2 176.1 169.0 163.7 159.7 156.5 154.0 151.9 150.2

L, length of the second cycle, years 7 9 11 13 15 200.3 180.7 167.8 158.6 151.8 182.8 169.6 160.2 153.3 148.0 172.5 162.9 155.8 150.3 146.0 166.1 158.7 153.1 148.6 145.0 161.4 155.5 150.9 147.2 144.2 157.7 153.0 149.2 146.1 143.6 154.9 151.0 147.8 145.2 143.0 152.6 149.3 146.7 144.4 142.5 150.7 148.0 145.7 143.8 142.1 149.1 146.8 144.9 143.2 141.8 147.8 145.9 144.2 142.8 141.5

17 146.6 143.8 142.5 142.1 141.7 141.4 141.2 140.9 140.8 140.6 140.4

19 142.5 140.5 139.7 139.7 139.6 139.6 139.6 139.6 139.6 139.5 139.5

21 139.2 137.8 137.4 137.7 137.9 138.1 138.3 138.4 138.5 138.6 138.7

The cost parameters in Table 12.1 and Figure 12.1 are held at intermediate values. In Figure 12.2, we present annuities for a number of “replacement” options as a function of each of the cost parameters. These replacement options correspond to those considered by the corporation, with “refurb” referring to immediate refurbishment (in year 1), and “delay refurb” referring to refurbishment in year 10 (from

Replacement of Capital Equipment

297

time of study). Given the size of the fleet, a constraint on the number of escalators that can be refurbished at any one time and the duration of refurbishment, we would expect the refurbishment programme to last some 15 years and therefore a significant proportion of the fleet would experience this kind of delay prior to refurbishment. Therefore we include it as a particular policy for indicative purposes. We use the fixed horizon model here in order to make comparisons between annuities—this is because one would wish to compare the cost of different options over the same horizon. Equivalently, we could use the modified two-cycle model with the additional constraint K+L= h= 22 (years), say. 150.0 annuity per escalator, HK$K

annuity per escalator, HK$K

170.0

160.0

150.0

140.0

140.0

130.0

120.0

110.0

130.0 1

5

9

13

17

1

21

K, years

a L=10

L=12

5

1

7

0 0 .

1

6

0 0 .

1

5

0 0 .

1

4

0 0 .

1

3

0 0 .

1

L=16 5

9

13

17

21

K, years

b L=14

9

L=18 1

3

1

7

L=21 2

fixed

1

Figure 12.1a,b. Annuities ($000s per escalator per year) for modified two-cycle model with refurbishment at K years from now and operation for a further L years. Annuities for fixed horizon model with h = 22 years also shown (X < 22: bold, solid curve; X = 22: ■). Cost parameters as Table 12.1 except: a effective discount factor equals 0.06 (equivalent to inflation rate of approximately 0.05 and discount rate of 0.11); b no discounting.

From Figure 12.2 we can see that optimum policy is certainly sensitive to these cost factors with the influence of cost parameters as expected. Threshold values that lead to a step-change in the optimum policy (option) can be observed from these figures. Thus while estimation of the penalty cost of failure, for example, may be difficult and contentious, the importance of its effect can be observed. This may then provide an incentive for further investigation of this parameter or discussion about whether its true value is above or below the threshold of policy change. As a final note for the escalator replacement problem in particular, one could argue that the cost of differing options or policies will reflect the maintenance contractor’s profit requirement, whatever the details of the arrangement, and therefore the total costs of options would expect to vary very little. What can differ, however, is that some options may lead to lower risk (for example where the contractor bears the cost of major parts’ wear-out which may be subject to significant uncertainty) and lower risk is certainly desirable from the point of view of the operator.

298

P. Scarf and J. Hartman

12.3.7 A Model for an Inhomogeneous Fleet

200

200

190

190

annuity per escalator / HK$000s

annuity per escalator / HK$000s

Consider now a fleet consisting of sub-fleets classified on the basis of class (e.g. vehicle-type) and age (or condition) so that the operator of the fleet is concerned with the replacement of sub-fleets, and not with replacement of individual equipment or with replacement of the entire fleet. For this fleet, it is natural to focus on the replacement of particular sub-fleet(s). The economic life models of the previous section must be extended given that the replacement of particular sub-fleets has cost implications for the rest of the fleet.

180 170 160 150 140 130 1.00

a

1.50

1.75

150 140

0

20

40

60

80

penalty cost of failure (HK$K)

b 200

annuity per escalator / HK$000s

annuity per escalator / HK$000s

160

2.00

200

c

170

130 1.25

effective life factor

190 180 170 160 150 140 130 0.05

180

190 180 170 160 150 140 130

0.07

0.09

0.11

discount rate

00 129 1 8 0 0 1 1 7 6 0 0 1 5 0 1 1 4 3 0 0 2 0

refurb

2 2

20

0.13

22

d delay refurb 2 4

24

26

28

horizon / years

2 6

do nothing 2 8

replace

Figure 12.2a–d. Annuities (per escalator) as a function of cost parameters for fixed horizon model, with h = 22 years for various refurbishment/replacement options: a annuity vs. effective life parameter; b annuity vs. penalty cost of failure; c annuity vs. nominal discount rate; d annuity vs. horizon length h. Cost parameter values when not varying set at: effective life, 1.5; penalty cost of failure, $5K; nominal discount rate, 0.11; control system retro-fit effectiveness, 75%; refurbishment delay cost, $10K.

Replacement of Capital Equipment

299

Considering a “rolling schedule” of replacements, the questions of interest are then: which sub-fleet should be replaced first (second, third)?; when should they be replaced?; and what model (equipment specification) should be purchased at replacements? The order in which the sub-fleets are replaced we call the replacement schedule. We call a particular replacement schedule along with the time scale for replacements and the choice of model for purchase a replacement policy. It is expected that the operator would have significant input into the choice of sub-fleets to be replaced, based on experience with the fleet. Also the choice of model for purchase at replacements is also likely to be decided in advance by the operator. This would give the operator a level of control over the modelling process which is highly desirable in practice (Russell 1982). The purpose of the modelling is therefore to provide decision support for the operator on: (i) the cost implications of alternative replacement schedules; (ii) the time scale for replacements and budget requirements; (iii) the cost implications of particular sub-optimal policies necessitated by technological obsolescence or changing economic considerations (e.g. Suzukia and Pautschb 2005). The most important considerations are the choice of sub-fleet for first replacement and the time to first replacement. It is expected that the optimal policy will be updated periodically, as the fleet evolves, and when new information about maintenance costs and new (equipment) models becomes available. To model the replacement scenario described, we consider an extension of the simple fixed horizon model (Equation 12.3) in which we have a variable number, N, of replacement cycles. Let the inhomogeneous fleet comprise of r sub-fleets, with the current sub-fleets indexed by k = 1,..,r. New replacement sub-fleets are indexed by k = r+1,..,r+N. For a fixed planning horizon of length h, and given replacement schedule and choice of class for the replacement sub-fleets, the decision variables are then: number of cycles, N(>1); and time from beginning of i-th cycle to the replacement of sub-fleet i, Li (i=1,...,N). The whole fleet is operated over cycle i, which ends with the resale of sub-fleet i of size ni and purchase of sub-fleet r+i of size n r +i . Sub-fleets need not be homogeneous and the current ages of plant are denoted by τ ij (i = 1,..., r + N ) . The fleet size may be constant ( ni = nr +i all i) or variable. For a given replacement schedule, the total discounted cost over the horizon can be expressed as ctdc ( N , L; h) =

∑

N

∑

ν ti {

i =1

ti s = ti −1 +1

mi ( s )ν s −ti + nr + i Rr + i − Si (ti )}

(12.7)

where t i = ∑ij−=10 Li with L0 = 0 . Here mi (.) is the age related operating cost of the whole fleet in cycle i; S i (.) is the age related resale value of plant in sub-fleet i; Rr +i is the cost each of replacement plant in sub-fleet r+i; and v is the discount rate. The costs mi (.) and S i (.) may be expressed as mi ( s ) = ∑ k =i

r + i −1

∑

nk j =1

M k (τ kj + s ), (i = 1,..., N ),

Si ( Li ) = ∑ j =1 Si1 (τ ij + Li ), (i = 1,..., N ) , ni

300

P. Scarf and J. Hartman

where M k (.) is the age related operating cost per unit time for an individual plant in sub-fleet k (k = 1,..., r + N ) , and S i1 (.) is the age related resale value for individual plant in sub-fleet i. (Also, τ kj = 0 for k>r). Appropriate penalty costs, associated with failures, may be incorporated into the operating costs. The annuity, c tdc ( N , L; h) / ∑ih=1ν i , or other suitable objective function may then be minimized subject to the constraint ∑iN=1 Li = h . Technological change is allowed for in that costs relating to proposed plant for cycles 2,..,N may be assigned as appropriate. The optimum replacement schedule may be obtained by minimizing the objective function over all possible schedules. In practice the range of possible schedules would be narrowed greatly by the experience of the operator. However, as the decision-maker will not have a firm value for the horizon length, the optimum policy must be robust to variation in h. Furthermore, because the fleet is mixed, both different replacement schedules and different planning horizon lengths will give rise to different age compositions of the fleet at the end of the horizon. Thus replacement policies may need to be compared not just on the basis of cost but also on the basis of the age composition of the fleet at the end of the planning horizon. This final age composition can be considered as quantifying the end-of-horizon effect. Non-uniform usage, particularly between sub-fleets, may be allowed for by varying the fleet size at replacements. For example, if older plant are underutilized, a smaller number of new plant would be required to meet the demand currently placed on an older sub-fleet. This effectively reduces the replacement cost for that sub-fleet by factor which is the ratio of the utilization of the old to the new sub-fleet. Of course, other more complex methods of accounting for differing usage may be considered. Given sufficient data, operating costs could be quantified in terms of usage and optimum policy may be obtained given forecasts for usage of sub-fleets over the planning horizon. The models may be extended to the case in which sub-fleets are retired as spares. The number of sub-fleets would simply increase by one at each replacement, with the costs associated with retired sub-fleet added. Predicting operating costs for a retired sub-fleet would be difficult however, as it is likely that no data would be available for this. Also it is assumed that equipment is bought new: in principle it is a simple matter to extend Equation 12.7 to the case in which used equipment may be purchased. Note that the formulation as presented allows for the possibility for a sub-fleet to be composed of a single unit of equipment. This may be appropriate if the fleet is small. The complexity of the computational problem increases rapidly as the number of sub-fleets increases. However we do not consider efficient algorithms for determining optimum policy here. 12.3.8 Application of a Replacement Model for an Inhomogeneous Fleet Example 12.2 Scarf and Hashem (1997) consider the inter-city coach fleet operated by Express National Berhad in Malaysia. The fleet comprised of 160 vehicles of 5 vehicletypes of varying ages, with maintenance cost modelled as M (τ ) = aτ b and resale

Replacement of Capital Equipment

301

values S (τ ) = 0.6 R(0.81)τ , for replacement cost R (Table 12.2). The data available were not sufficient for obtaining the maintenance cost model for all vehicletypes—for example, for the MAN, only data relating to their first year of operation were available. Furthermore, for older vehicles the costs appeared to be decreasing. This could perhaps be put down to under-utilization (partial retirement) and also neglect of vehicles reaching the end of their useful life. It was therefore necessary to pool the data to obtain reasonable cost models. The fitted maintenance cost models for the Cummins, Isuzu CJR and MAN were obtained by first fitting an overall cost model to data on vehicles up to eight years old, and then scaling this model to the costs of the individual vehicletypes in the manner described in Christer (1988). The costs for the older sub-fleets, the Mitsubishi and Isuzu CSA, were taken as constant. Penalty costs for breakdowns on the road were also modelled—see Scarf and Hashem (1997) for a full discussion of this. It was known that the Mitsubishi and Isuzu sub-fleets were in partial retirement and candidates for immediate replacement, capital expenditure permitting. The usage of sub-fleets was unknown, although with a daily requirement for 125 vehicles, it was reasonable to suppose that the usage level for the Mitsubishi and Isuzu sub-fleets was about half that of the other newer sub-fleets. This assumption led to the null “optimal policy”—replace the Mitsubishi and Isuzu CSA sub-fleets as soon as possible—which is uninteresting from a model validation point of view. Therefore in order to illustrate the replacement model, we consider the following subproblem in detail: investigate replacement policy for the “fleet” comprising of the Cummins, Isuzu CJR and MAN, assuming a fixed fleet size (93 vehicles) and uniform usage. Table 12.2. Fleet composition by vehicle type showing purchase cost, R, maintenance cost parameters and age distribution at time of replacement study

>12

8–9

6–7

0 0 0.72 0.72 0.72

4–5

55.6 57.8 24.7 11.1 18.4

10–11

750 800 500 300 450

b

2–3

R, M$000s IsuzuCSA Mitsubishi Cummins IsuzuCJR MAN

a

Age distribution: number of vehicles in each age group (2 year intervals) <2

M (τ ) = aτ b

model

9

30

28 18 8 45

22

For the three sub-fleets problem, optimal policy is presented for each of the six replacement schedules in Table 12.3. Horizon lengths, h, of 120, 150 and 180 months (15 years) are considered. For the fleet as a whole it is difficult to determine optimal replacement policy as two sub-fleets are partially retired, and usage levels are unknown. The problem is made more difficult because it is likely that the maintenance of these sub-fleets is less thorough than that for the newer subfleets. Under simple usage assumptions, the optimum policy is to replace the Mitsubishi and Isuzu CSA sub-fleets immediately. For the particular sub-problem

302

P. Scarf and J. Hartman

relating to the Cummins, Isuzu CJR and MAN, it appears that the optimum replacement schedule depends on the length of the horizon. Also, the end-ofhorizon effect, as represented by the mean age of the fleet, also varies with the replacement schedule. The choice of optimal policy is therefore not straightforward. Over a fifteen year planning horizon, there is little to choose between the three schedules Cummins-IsuzuCJR-MAN, Cummins-MAN-IsuzuCJR and MANCummins-IsuzuCJR, both in terms of cost and age. Sensitivity to model parameters is considered more fully in Scarf and Hashem (1997). Table 12.3. Optimum policy for each schedule for various horizon lengths, h=120,150,180 months; penalty cost, M$2000, annual discount rate 0.97. Cost of equivalent rent (M$000s per month for whole fleet), average age of fleet at end of horizon, and optimum cycle lengths. Replacement schedules: CIM – Cummins-IsuzuCJR-MAN, etc. h

Schedule

120

CIM CMI ICM IMC MCI MIC CIM CMI ICM IMC MCI MIC CIM CMI ICM IMC MCI MIC

150

180

Cost/month (M$000s) 745.8 763.5 816.7 816.7 782.1 838.1 779.4 767.3 844.7 848.1 782.8 850.6 787.9 778.1 841.0 843.9 794.6 859.9

Age/years 7.0 9.9 8.4 8.4 5.1 6.5 8.6 5.6 4.6 4.4 5.7 7.7 6.0 6.0 5.2 5.4 6.7 4.4

L1

L1

6 120 120 120 24 42 6 6 54 60 42 60 18 18 72 72 54 72

114

6 78 144 54 6 6 6 90 72 60 6 6 6 6

L3

L4

90

90 6 6 102 6 36 6 6 120 6

84 78

84 66 96 96 96

12.4 Capital Replacement for a Network System Consider now a network viewed as a fleet of connected or dependent assets. For such a network, replacements can be considered as capital projects identified using engineering considerations. A simple way to proceed is then to consider a fixed horizon of length h over which we will evaluate the consequences of such projects, P. For a network, a potential project may be the replacement or refurbishment of a component or group of components which comprise part of the network, or a network expansion, or a network re-design. Define ft1 to be the operating cash flow (or performance) relating to a project, P, in year t after the release of project P. Define ft 0 to be the baseline operating cash flow (or performance) relating to P

Replacement of Capital Equipment

303

in year t. For network expansion projects ft 0 = 0 . Let C (>0) be the capital cost of project P. Assume income cashflows are negative and expenditure cashflows are positive, and that all cashflows are incurred at the year end and discounted at rate v. If project P is released in year x from now then the total cashflow over h years from now will be

∑

x −1 t =1

ft 0ν t + ν x

(∑

h− x t =0

)

ft1ν t + C .

(12.8)

If project P is not released then the cashflow over the horizon will be ∑ t =1 ft 0ν t . Define the “gain” from releasing project P in year x to be the difference between these cashflows: h

g P ( x; h) = ∑ t =1 ft 0ν t −1 − [∑ t =1 ft 0ν t + ν x {∑ t = 0 ft1ν t + C}] h

x −1

h− x

= ∑ t = x ( ft 0 − ft1− x )ν t −ν x C. h

Release of project P in year x (x 0. We can optimize project release by choosing that x for which the gain is a maximum and positive. If g P , x (h) ≤ 0 for all x = 1,..., h then release of project P will not be recommended (over the horizon). The consequences of project release may also be measured in terms of performance (Scarf and Martin, 2001). For a large system comprising many potential projects, the outcome of this modelling approach will typically be a list of projects that should be released immediately, and a list of those that should not. Of course, the release of projects will, in both cases, be limited by the budget for capital expenditure. For prioritizing project release, we can consider all projects over the horizon (0,h). Let g ij (h) be the gain when project i is released in year j ( 1 ≤ j ≤ h ), and suppose that, in year j, n projects have positive gains. These projects might be listed in order of magnitude of their gains. They might also be listed in order of magnitude of the “profitability” index, gij (h) / Ci , where Ci is the capital cost of project i. If rational decision criteria are to be used to determine policy then projects at the top of the list should be given priority, since they would be associated with the largest expected gain over the planning horizon. Under capital rationing with a fixed budget, this project priority list would indicate which projects can be released in the current year. For consequences considered in cashflow terms, appropriate discount rates may be chosen to reflect the investment risks of projects. A higher required rate of return (smaller discount rate) might be imposed on network expansion projects than on replacement of existing assets. Also, factors other than cost may impact on decisions: safety related or projects with high customer benefit may take priority. A capital rationing model to prioritize project release over k years (k ≤ h) may be formulated as a linear program (LP), similarly to that proposed in vehicle fleet management by Karabakal et al. (1994). Suppose the capital investment budget for year j is Bj (j = 1,...,h). Introduce the indicator variable xij which takes the value 1 if project i is released in year j and 0 otherwise. Then seek those values of xij (i = 1,..,n; j =1,...,k) which maximize the total gain over (0,h) of all projects re-

304

P. Scarf and J. Hartman

leased subject to the constraints that the capital investment budget is not exceeded in each year. That is maximize

∑ ∑ n

k

i =1

j =1 ij

x gij (h)

subject to

∑ ∑

n

x Ci ≤B j for all j = 1,...,k;

i =1 ij k

x ≤1

j =1 ij

for all i = 1,...,n;

(12.9) (12.10)

xij = 0,1.

Constraint set Equation 12.9 ensures that the budget for year j is not exceeded. Constraint set at Equation 12.10 ensures that project i is released at most once over the planning horizon. Note that if an individual project has negative gain whatever its execution time, then the contribution to the objective function from this project will be greatest when this project is not released over (0, k). Typically such planning may be informative over the planning horizon, but only decisions relating to the immediate future (one to two years) would be acted on. Therefore policy would be continually updated, implying a “rolling horizon” approach. Where a network consists of many identical components, the modelling of project planning may be extended to the case in which a proportion of “similar” projects are released in a given year. This could be done by formulating the capital rationing model (CRM) as a mixed programming problem. Consider now dependence between projects. For example, a major expansion project, while not replacing existing assets, may have significant operating cost or performance implications for particular assets: the building of a large ring-main in a water supply network is one such example. Essentially, if two projects P1 and P2 interact in this way, then new projects P1' = ( P1, not P2 ) , P2' = (not P1, P2 ) , and ' P12 = ( P1, P2 ) would have to be introduced, along with the constraint to ensure that ' at most one of P1' , P2' , and P12 is released over the planning horizon. While this approach may lead to a significant increase in the number of “projects” in the model, in principle the solution procedure would remain unchanged. The existence of future-cost dependencies between projects would have to be identified by the network owner. This may be extremely difficult in practice. However such dependency would very much characterize the network replacement problem, and therefore the approach described is an advance over current methods. A similar approach has been taken by Santhanam and Kyparisis (1996) in modelling dependency in the project release of information systems. Capital costs may be considered simply using the concept of shared set-up. It is possible that it may be optimal to release both P1 and P2 during the planning horizon, but not simultaneously. This presents a more difficult modelling task, without introducing many pseudo-projects, that is. For example, we could consider: release P1 at time s and P2 at time t; however for k=10, say, this would mean the introduction of 25 variables, x( P1P2 )( s ,t ) , for the P1 , P2 decision alone!

Replacement of Capital Equipment

305

For reasons of budgeting constraints or technical delays, the release of some recommended project at some optimal execution moment x* may not be possible. In such cases, it would be informative to have an indication of the extra cost to be incurred in revenue expenditure because of lack of capital expenditure; this is the marginal increased revenue expenditure due to delayed release. Given the capital rationing model, and focusing on cashflows, the operating cost consequences of capital rationing can be determined by calculating the delay associated with each project as a result of capital rationing. The revenue cost implications due to this delay would from expression Equation 12.8 be x ′ −1

h − x′

h − x∗

δ P ( x*, x′; h) = ∑ t = x ft 0ν t + ∑ t = 0 ft1ν t + x′ − ∑ t = 0 ft1ν t + x

∗

∗

where x ′ is the execution time for the project under capital rationing. The marginal increase in revenue expenditure would be found by summing over all projects. In a similar manner, the marginal increase in revenue expenditure due to projects delayed in year j could be found by summing over all projects with x∗ = j , and this measure indicates how much more capital investment would be required to reduce revenue expenditure to the optimum level. Uncertainty in the cashflow/performance model parameter estimates, reflecting the extent of currently available information about particular components and potential projects, and the extent of technological developments (new materials and techniques), may be propagated through into uncertainty in the gain function, g (.) . This would be most easily done using the delta method; see Baker and Scarf (1995) for an example of this in maintenance. The variance of the gain, as well as the expected gain, may then be used to produce the project priority list and those projects for which the expected gain is high and the uncertainty in the gain (variance of the gain) is low are candidates for release; these projects would be viewed as sound investments. Markovitz (1952) is the classic reference here; for a more recent discussion see Booth and King (1998). Also, a real options approach might be taken (e.g. Bowe and Lee 2004). Where there are no data regarding a potential project, there will be no objective basis for determining if and where the project lies on the project priority list. One possible approach to this problem would be to use data relating to other projects that are similar in design. Also subjective data may be collected, and used to update component data for the whole network in the manner described in O’Hagan (1994) and Goldstein and O’Hagan (1996) in the context of sewer networks. These methods are particularly useful for multi-component systems in which there are only limited data for a limited number of individual components. On the other hand, it may be that the income cashflow may be deterministic in some situations. For example, expansion of the network may be initiated by legislation, and the compensation for the investment costs are fixed and predetermined per customer connection.

306

P. Scarf and J. Hartman

12.5 Dynamic Programming Models Dynamic programming (DP) is a versatile technique for modelling and solving optimization problems which are sequential in nature. Thus, it is ideally suited to solve capital equipment replacement problems that consider whether plant should be kept or replaced after each period. The use of dynamic programming in equipment replacement analysis is significant as the methodology allows for keep/replace decisions to be evaluated after every period. This relaxes the assumption of economic life models that assume an asset and its replacements are retained for the same length of time over the horizon. Furthermore, dynamic programming allows for general modelling of costs and technological change. A dynamic program evaluates the transition from an initial state to possible eventual states, determining the optimal path of decisions over time. In our application, this entails evaluating an asset and the periodic decisions to keep or replace that asset over some horizon. This requires the definition of a state for the asset. As there are many possibilities for the definition of a state space, a number of methods have been developed. We highlight these different models in the following sections. 12.5.1 Age Based Model Bellman (1955) introduced the first dynamic programming model to analyze the equipment replacement problem. In this model, the state of the system is defined as the age of the asset and the decision to be evaluated at each stage is whether to keep or replace the asset. Thus, a solution consists of keep and replace decisions in each period of the horizon. The dynamic program can be described by the network in Figure 12.3. Each node in the network represents the age of the asset, which is the state of the system, at the end of the period. The states are labelled according to the age of the asset along the y-axis, increasing from 1 to N, the maximum allowable age of the asset (N = 5 in the figure), at the end of the time period which is labelled on the x-axis from 0 to T, the horizon time (T = 4 in the figure). The arcs connecting the nodes represent keep and replace decisions. An arc representing a “keep” decision (K) connects a state (age) of n to n+1 in consecutive stages (periods), as the asset ages one period. A “replace” decision (R) connects a state of n to a state of 1, as the nperiod old asset is salvaged and a new asset is purchased and used for one period. The initial decision is made at time zero, with n = 4 in the figure, and the asset is salvaged at the end of the horizon. Define ft(n) as the minimum net present value cost of making optimal keep and replace decisions for an asset of age n at time period t through time period T. Mathematically, we evaluate ft(n) with the following recursion: ⎧⎪K : α ( Ct +1 (n + 1) + ft +1 (n + 1) ) ft (n) = min ⎨ , n ≤ N , t ≤ T −1 ⎪⎩R : Pt − St (n) + α ( Ct +1 (1) + f t +1 (1) )

(12.11)

Replacement of Capital Equipment

N=5

307

K

4 R

3 2 1 0

1

2

3

T=4

Figure 12.3. Dynamic programming network for an age-based model

If the n-period old asset is kept (K), the operating and maitenance (O&M) cost Ct+1(n+1) is incurred for the asset in the following period. As the asset is age n+1 at the end of the period, ft+1(n+1) defines the costs going forward. (This is why ft is often referred to as the “cost to go function” in dynamic programming.) If the asset is replaced (R), then a salvage value St(n) is received and a purchase price Pt is paid for a new asset. The new asset is utilized for the period as the state transitions to an age of 1, defined by costs ft+1(1) going forward. If the asset reaches the maximum age of N, then only the replace decision is feasible. When the horizon time T is reached, the asset is salvaged such that fT (n) = − ST (n)

(12.12)

Traditionally, the recursion is solved backwards, such that Equation 12.12 is evaluated for each feasible age n. These values are substituted into Equation 12.11 when determining fT–1(n) for each feasible n. This process continues until the value and decision at stage zero (t = 0) are computed, signaling the initial decision in the optimal sequence of decisions over time. Note that O&M costs are paid at the end of the period and thus are discounted by the periodic factor α, along with the ensuing state cost ft+1(n). Salvage values and purchase costs are assumed to occur at the beginning of the period. As the recursion works from the horizon time T to time zero, the net present value cost is computed. Example 12.3 Assume a four-period old asset is owned at time zero, its maximum age is 5, and an asset is required to be in service in each of the next four periods (such that the decisions are represented in Figure 12.3). The purchase price is $50,000 with first year O&M costs $10,000, increasing 20% per period of use. The salvage value is expected to decline 30% (from the purchase price) after the first year of use and an additional 10% each year thereafter. For simplicity, we assume no technological change and the interest rate is 12% per period.

308

P. Scarf and J. Hartman Table 12.4. Dynamic programming state values ft(n) for the example problem

t\n 0 1 2 3 4

1 2 3 4 ---$47,996 $16,332 ---–$407 $6,292 --–$17,411 –$12,455 –$7,353 -–$35,000 –$31,500 –$28,350 –$25,515

5 -$35,602 ----

Table 12.4 shows the results of solving the dynamic programming algorithm. The values in the final row (f4 (n)) are the negative salvage values received for a given asset of age n at that time. To illustrate a calculation at t = 3, consider n = 1. Substituting into Equation 12.11: ⎧⎪K : 0.893 ( $12, 000 − $31,500 ) f3 (1) = min ⎨ = −$17, 411 ⎪⎩R : $50, 000 − $35, 000 + 0.893 ( $10, 000 − $35, 000)

The recursion continues in this fashion until f0 (4) is evaluated, with the decision to replace the asset immediately with a new asset. This new asset is retained through the horizon. The net present value cost of this sequence of decisions is $47,996. The benefit of using this model, in addition to allowing for replacements after each period, is that periodic costs are explicitly modeled on each arc in the network. This allows for detailed cost modelling of technological change, as in Regnier et al. (2004) or those costs associated with after-tax analysis, as in Hartman and Hartman (2001). A similar line of models have also been developed such that the condition of the asset, not its age, is tracked (i.e. Derman 1963). As opposed to moving from state to state by increasing the age of the asset, there is some probability that the asset will degrade to a lower condition during a period. The work assuming stochastic deterioration has been extended to include technological change (Hopp and Nair 1994) or consider probabilistic utilization (Hartman 2001). 12.5.2 Period Based Model Wagner (1975) offered an alternative dynamic programming formulation for the equipment replacement problem in which the state of the system is the time period and the decision at each period is the length of time to retain an asset. This model is described in the network in Figure 12.4. The nodes represent the state of the system (time period) and the arcs connecting two nodes represent the decision to keep an asset in service between those time periods.

Replacement of Capital Equipment

0

1

2

3

309

4

Figure 12.4. Dynamic programming network for a period-based model

The objective is to find the sequence of service lives that minimizes costs from time 0 through time T. (As previously, T = 4 in the figure.) Assuming costs along an arc connecting node t to node t+n are defined as net present value costs at time t, the optimal sequence of decisions can be determined by solving the following recursion: f (t ) = min n ≤ N ,t + n ≤T {ctn + α n f (t + n)}, t = 0,1,..., T − 1

(12.13)

where ctn represents the cost of retaining the asset for n periods from period t. Using our previous notation, ctn is defined as n

ctn = Pt + ∑ α j Ct + j ( j ) − α n St + n (n)

(12.14)

j =1

This model can be solved similarly to the age-based model, assuming that f(T) = 0 is substituted into Equation 12.13. Note that the network in Figure 12.4 assumes that a new asset is purchased at time 0. To include the option to keep or replace an asset owned at time zero, another set of arcs must be drawn, emanating from node 0, representing the length of time to retain the owned asset with its associated costs. As these arcs parallel those illustrated in Figure 12.4, the higher cost parallel arcs can be deleted, as they will not reside on the optimal path. This can be completed in a pre-processing step, with the recursion ensuing as defined. Example 12.4 Utilizing the same data from Example 12.3, the network in Figure 12.4 represents the options associated with purchasing a new asset in each period. We would add an arc from node 0 to node 1 to represent the decision to retain the four-period old asset for one additional period (to its maximum feasible age of 5). Table 12.5 provides the net present value costs (at time t) on the arcs from node t to node t+n. The arc from node 0 to 1 represents the cost of retaining the fourperiod old asset for one period, as this is cheaper than salvaging the used asset and purchasing a new asset for one period of use. The values of c02, c03, and c04 include the revenue received for salvaging the four-period old asset at time zero. With the values in Table 9.5, the dynamic programming recursion in Equation 12.13 can be solved.

310

P. Scarf and J. Hartman Table 12.5. Arc costs for Figure 12.4 using the example data

t \ t+n 0 1 2 3

1 -$1,989

2 $17,868 $27,679

3 $33,051 $43,383 $27,679

4 $47,996 $58,566 $43,383 $27,679

To illustrate a calculation, note that f(4) = 0, f(3) = $27,619, and consider t = 2. Substituting into Equation 12.13: ⎧$43,383 + $0 ⎫ f (2) = min ⎨ ⎬ = $43,383, $27, 679 + 0.893($27, 679) ⎩ ⎭

defining that it is cheaper to keep the asset for two periods (from the end of period 2 to the end of the horizon) rather than replacing it after one period of use. Continuing in this manner, it is found that f(0) = $47,996, signaling that the four-period old asset should be sold and the new asset should be retained through the horizon. This is the same solution found with Bellman’s model. While this model can be shown to be more computationally efficient than the age-based model, it is the ease with which multiple challengers (as parallel arcs) or technological change is modelled that has led to numerous extensions in the literature. See Oakford et al. (1984), Bean et al. (1985, 1994), and Hartman and Rogers (2006). 12.5.3 Cumulative-usage Based Model Recently, Hartman and Murphy (2006) offered a third dynamic programming formulation for the equipment replacement problem following the form of the classical knapsack model. The model determines the number of times an asset is used for a given length of time over some horizon. The dynamic program is described by the network in Figure 12.5. The y-axis defines the periods, 1 through T, while the x-axis identifies the stage in which an asset is to be retained for a given length of time, 1 through N, is evaluated. In the figure, the order is ages 4, 3, 2, and then 1. Thus, in the first stage of the dynamic program, the number of times to retain an asset for four consecutive periods is analyzed. (For this small example with T = 4, the asset can only be retained for four periods once.) In the second stage, it is evaluated whether an asset should be retained for three periods. In the third stage, it is evaluated whether an asset should be retained for two periods either once (for two periods of total service) or twice (for four periods of total service).

Replacement of Capital Equipment

311

T=4 3 2 1 0 0

4

3

2

1

Figure 12.5. Dynamic programming network for a cumulative usage-based model

A node in the network represents the cumulative service that has been accrued through a given stage. For example, after the first stage in Figure 12.5, either 0 or 4 periods of service have been reached. As the horizon is 4, a solution must ultimately result in 4 periods of service. As with the other dynamic programming, models, the goal is to find the minimum cost path from the initial node, representing no service at time zero, to the final node, representing an entire horizon’s worth of service after the final stage. To determine an optimal solution it is assumed that the costs are stationary and the stages (lengths of service) are ordered according to increasing annualized costs. Thus, before the recursion can be solved, the annualized costs of keeping an asset for each possible service life must be computed such that the stages can be ordered accordingly. Example 12.5 We revisit the previous examples again. From the given costs, the annual equivalent costs are computed as given in Table 12.6. For example, to retain the asset for two years costs $25,670 per year, equivalently, assuming a 12 percent interest rate. The net present value (NPV) costs are also given. We restrict the set of decisions to those of a new asset – namely how many to purchase and how long to retain them over the finite horizon. Table 12.6. Annual equivalent costs of keeping the asset for up to five years Age 0 1 2 3 4 5

O&M $10,000 $12,000 $14,400 $17,280 $20,736

SV $50,000 $35,000 $31,500 $28,350 $25,515 $22,964

AEC

NPV

$31,000 $25,670 $24,384 $24,202 $24,540

$27,679 $43,383 $58,566 $73,511 $88,462

Given the information in Table 12.6, the stages are ordered according to ages 4, 3, 5, 2, and 1, as the annual equivalent costs increase accordingly. As an asset is only required for four periods, the age 5 cost can be ignored.

312

P. Scarf and J. Hartman

According to Figure 12.5, an asset can be retained a maximum of one time for four years, at a cost of $73,511. Thus, the states in the first stage and their values are

f1 (0) = 0, f1 (4) = $73,511. Similar reasoning defines f2(0)=0, f2(3)=$58,566, and f2(4)=$73,511. For the third stage, the decisions are more interesting because an asset can be retained for two years twice in the sequence. Thus f3 (0) = 0, f3 (2) = $43,383, f3 (3) = $58,566,

{

}

f3 (4) = min $73,511,$43,383 + (0.893) 2 $43,383 = $73,511.

The final stage evaluates using assets for a single period with previous combinations (three-period and two-period aged assets). It can be shown that the optimal decision is to retain the asset for all four periods at a net present value cost of $73,511. Note that this is the same decision found with the two previous formulations, as $73,511 less the salvage value of the four-period old asset ($25,515) is $47,996. This recursion was not developed in order to provide another computational approach to the equipment replacement problem. Rather, it was developed to illustrate the relationship between the infinite and finite horizon solutions under stationary costs. Specifically, as the optimal solution to the infinite horizon problem is to repeatedly replace an asset at its economic life (age which minimizes equivalent annualized costs), the question being investigated was whether the solution (replacing at the economic life) translates to the finite horizon case. It was shown that using the infinite horizon solution provides a good answer when O&M costs increase over the life of an asset more drastically than salvage values decline. In the case when the salvage value declines are more drastic than the O&M cost increases, it is generally better to retain the final asset in the sequence for a period longer than the economic life of the asset. For the cases when O&M cost increases and salvage value declines are similar, then it is beneficial to solve a dynamic programming recursion to find the optimal policy. 12.5.4 Infinite Horizon Considerations The solution of a dynamic programming algorithm assumes that the horizon is finite. In the case of an infinite horizon in which an asset is expected to remain in service indefinitely, it may be possible to identify an optimal time zero decision. Bean et al. (1985) show that if the time zero decision for an equipment replacement problem does not change for N consecutive horizons, where N is the maximum age of an asset, then the decision is optimal for any length of horizon, includ-

Replacement of Capital Equipment

313

ing an infinite horizon. Unfortunately, this does not guarantee the existence of an optimal time zero decision. For the age or period based dynamic programming recursions, the models must be solved over T, T+1, T+2, …, T+N horizons. If the time zero decision does not change for these problems, then the optimal time-zero decision is found. If this is not the case, the progression must continue until N consecutive time zero decisions are identified. This may be more easily facilitated using a forward recursion. In the period-based model, this requires defining f(t) as a function of f(t–1), f(t–2), etc., with f(0) = 0. We illustrate by revisiting Example 12.4. Example 12.6 We illustrate the first few stages of the forward recursion, as its implementation is better suited for infinite horizon analysis. As noted earlier, the recursion is initialized with f(0) = 0. Stepping forward in time, it is assumed that T = 1. Using the values from Table 12.2, it is clear that the only feasible decision is to retain the four-period old asset for one period such that f(1) = -$1,989. For the second stage, there are two feasible decisions to evaluate, such that ⎧0.893($27, 679) − $1,989 ⎫ f (2) = min ⎨ ⎬ = $17,868. ⎩$17,868 + $0 ⎭

The first decision evaluates using the new asset for one period, assuming (from stage 1) that the four-period old asset is retained for one period. The second decision assumes the four-period old asset is retired immediately and a new asset is used for two periods. This process moves forward in time, increasing the value of T in each step. The process stops when, in this case, five consecutive solutions (with increasing T) result in the same time zero decision. 12.5.5 Modeling Complex Systems The presented dynamic programming algorithms are designed for single asset systems. More complex systems are obviously defined by multiple assets which are not independent, otherwise the presented models would be sufficient. The most straightforward case in where all assets operate in parallel, such as a fleet. Jones et al. (1991) offered the first dynamic programming recursion for the parallel machine replacement model, which can be used to analyze fleet replacement decisions. Machines are assumed to operate in parallel and thus the capacity of the system is equal to the sum of the individual asset capacities. In addition to defining the capacity of the system, the assets are often linked economically. Jones et al. focused on the assumption that a fixed cost would be charged in any period in which a replacement occurs (in addition to the typical per unit charges for each asset replaced). This provides an incentive to replace multiple assets together over time so as to reduce the number of times the fixed charge in incurred over some horizon. To model replacement decisions for this system, the state of the system is defined as the number of assets aged 1 through N, represented as a vector, [m1, m2,

314

P. Scarf and J. Hartman

…, mN]. In general, this would seem to be an intractable model, as the number of feasible combinations of replacements to evaluate in each period is exponential – as one could replace any combination of the m1+m2+…+mN assets. However, Jones et al. illustrated two key theorems that drastically reduce the computational complexity. First, it was shown that clusters, or assets of the same age in the same time period, do not split. That is, a group of same aged assets are kept or replaced in their entirety at the end of each period. Second, under some mild cost assumptions, they showed that older clusters of assets are replaced before younger clusters. With these two theorems, the number of possible replacements in a given period is drastically reduced. In each period, either the oldest cluster is replaced, or the two oldest clusters are replaced, etc., for a given state of the system. Consider the network in Figure 12.6. At time zero, a system is defined by six assets; three of age one, two of age two, and one of age three, and the maximum feasible age of an asset is three. Replacement decisions, assuming the no-splitting rule and oldest cluster replacement rule, are illustrated for three periods. Note that the maximum number of decisions for a given state is N = 3. Determining the optimal sequence of replacement decisions for the system is similar to our previous dynamic programming recursions. A value is assigned to f3(S), where S refers to the state vector. This would merely be the sum of the salvage values for each asset owned at time T = 3. Then, the value of each state S at time 2 would be determined by summing the costs of the decision and discounting the value of the resulting state. For example, moving from state [3,0,3] to state [3,3,0] would entail selling the three three-period old assets and purchasing three new assets. The new assets would be utilized for one period (incurring O&M costs) while the three one-period old assets (at the end of time period two) would be utilized for a second year, also incurring O&M costs. These costs (discounted accordingly) would be added to the discounted value of f3(3,3,0). This value would be compared to the decision of replacing all six assets (leading to state [6,0,0]) to determine the value of state f2(3,0,3).

[3,2,1]

0

[1,3,2] [3,3,0] [6,0,0]

[2,1,3] [5,1,0] [0,3,3] [3,0,3] [0,6,0] [6,0,0]

1

2

[3,2,1] [4,2,0] [0,5,1] [1,5,0] [3,0,3] [3,3,0] [0,0,6] [0,6,0] [6,0,0] T=3

Figure 12.6. Dynamic programming network for parallel-machine replacement problem

Replacement of Capital Equipment

315

Define n as the decision of what minimum aged assets are to be replaced for a given state at time t. That is, all assets of age n and older are replaced while the remaining assets are retained. We can model the recursion in general as follows: ⎧⎪ N ⎛ N ⎞ f t (m1, m2 ,..., mN ) = min n ⎨ K t ⋅ 1n >1 + ⎜ ∑ m j ⎟ Pt − ∑ m j St ( j ) + ⎜ ⎟ ⎪⎩ j =n ⎝ j =n ⎠

(12.15)

⎛⎛ N ⎞⎫⎪ n −1 ⎞ α ⎜ ⎜ ∑ m j ⎟Ct +1 (1) + ∑ m j Ct +1 ( j + 1) + f t +1 (mm + mm +1 + ... + mN , m1, m2 ,..., mn −1,0,0,...0) ⎟⎬ ⎜ ⎜ j =n ⎟ ⎟⎪ j =1 ⎠ ⎝⎝ ⎠⎭

Examining the recursion, a purchase price is paid and a salvage value is received for all assets that are replaced. All of the newly purchased assets (the total number of assets is the sum of mn+mn+1+…+ mN) incur the O&M cost of a new asset while the O&M costs of the retained assets are incurred according to their age. A fixed charge Kt is paid if at least one group of assets is replaced (n>1), captured by the indicator function. The resulting state is a group of new assets (age 1) with all other assets incrementing one period in age. A number of extensions to this model have been published in the literature, although many utilize integer programming modeling techniques to deal with the large-state space. Chand et al. (2000) focus on the use of dynamic programming and include capacity expansion decisions with the replacement decisions. Unfortunately, capital budgeting constraints greatly complicate the problem as it cannot be assumed that groups of assets must be kept or replaced together. While the theorems presented in Jones et al. (1991) greatly reduce the computational difficulties of solving the dynamic program for the parallel replacement problem, it should be clear that using dynamic programming to address replacement decisions for more complex systems may be difficult due to computational complexities that arise due to the number of combinations of replacement alternatives. (See Hartman and Ban (2002) and the references therein for a discussion of these issues.) Consider a more complex system in which a number of machines are used in series (such as a production line) and there are a number of lines in parallel, such as the one given in Figure 12.7. The lines are labeled 1, 2 and 3 while the machines are labeled a, b, c, and d.

1 2 3 a

b

c

d

Figure 12.7. System with assets in series and lines in parallel

316

P. Scarf and J. Hartman

The capacity of a line is now defined by the machine in the line with the minimum capacity. However, the capacity of the system is raised due to the parallel design. The capacity of the system is defined by the sum of the capacity of each line. Therefore, it is defined by the sum of the minimum capacity asset in each line. Reliability is measured similarly to a capacity, in that it is reduced by the series structure but increased with parallel (redundant) structure. For a given series, the reliability of the line is equal to the product of the reliability of each individual asset. That is because if one asset is down, the line is down. The reliability of the system, assuming only one line must be up and running, is increased as the system is operating even if three lines are down. If one defines minimum system capacity or reliability constraints, these can be incorporated into a dynamic programming recursion that evaluates the possibility of replacing any combination of assets in each period over some horizon. Presumably, newer assets would have higher capacity or reliability, either due to technological change or due to the fact that they are new (and have not deteriorated), and thus would increase the respective capacity or reliability of the system (in order to meet the defined constraints). The difficulty with using a dynamic programming recursion to evaluate these decisions is not in capturing the capacity or reliability constraints. Rather, the difficulty is in the exponential growth in the number of possible combinations of replacements in each period. Consider the 12 assets shown in Figure 12.7. In the most general problem, each asset and each combination of assets can be replaced in each period, totaling 212 combinations each period for each state of the system. This system could easily become more complicated, merely by defining a, b, c, and d as processes, each of which may have a number of assets in parallel (or in series). In the parallel machine replacement problem, a similar problem was encountered, but the number of possible decisions was reduced to N (the maximum allowable age for an asset) for each possible state in each period with the two theorems introduced by Jones et al., without sacrificing optimality. Unfortunately, the interaction of the assets may prohibit the application of these theorems to other systems. In fact, defining the state of the system is not entirely clear. For the system described in Figure 12.7, we could define the system as a matrix of asset ages. Each row would be defined by the age of each machine in a given line, with a row defined for each line. If an asset is replaced, then the age would translate to 1 in the next stage while it would merely increment 1 period if the machine is retained. This modeling approach could be expanded to the case of multiple machines in a given process – by expanding the size of the matrix. Again, the difficulty would be in restricting the number of decisions to evaluate for each state in a given period. Following the approach of Jones et al. (1991), older assets would be replaced first (and even further restricted to have to be above a certain age for consideration) and similarly aged assets of the same type would be replaced in the same time period. Another approach would be to only consider replacing assets that increase the system capacity or reliability. Thus, replacements could be examined in the order of either increasing capacity or increasing reliability. Whether these heuristic approaches provide a good solution for a given problem instance would require extensive numerical testing.

Replacement of Capital Equipment

317

12.6 Discussion and Further Topics for Research In this chapter we have reviewed both economic life and dynamic programming models to address capital replacement problems which can arise in various settings, including manufacturing, transportation, and utility industries. It should be clear that the trend in developing solutions to these problems has migrated from single asset to those of complex sytems. While good solutions exist for systems with homogeneous assets in parallel, computational difficulties exist for those with inhomogeneous assets both in series and parallel and opportunities exist to develop optimal solution methods with advanced computational techniques or good solution rules developed from simpler, tractable models. In addition to the investigation of systems with multiple assets, further savings can be achieved by considering operational and replacement decisions simultaneously (Hartman 1999, 2004). It should be clear that the usage of an asset over time impacts its replacement schedule. In the context of multiple assets, it may be possible to allocate usage to assets over time, thereby influencing replacement schedules. Thus, in order to minimize total system costs over time, both replacement and operating decisions should be considered simultaneously. There are numerous application areas where this analysis is warranted, including transportation networks, water distribution networks, and production systems. A final area of future research must center on technological change. While assets are often replaced due to deterioration, newer assets are often purchased because they are technologically advanced—providing similar capabilities at lower cost or additional capabilities for additional revenue. Numerous studies focus on the continuous evolution of technological change, however, more detailed research must focus on appropriate models for various applications, as it is clear that technological advances in different ways for different industrial sectors.

12.7 References Apeland, S. and Scarf, P.A. (2003) A fully subjective approach to capital equipment replacement. Journal of the Operational Research Society 54, 371–378. Arnold, G. (2006) Essentials of Corporate Financial Management. Pearson, London. Baker, R.D. and Scarf, P.A. (1995) Can models to small data samples lead to maintenance policies with near-optimal cost? IMA Journal of Mathematics Applied in Business and Industry 6, 3–12. Bean, J.C., Lohmann, J.R. and Smith, R.L. (1985) A dynamic infinite horizon replacement economy decision model. The Engineering Economist 30, 99–120. Bean, J.C., Lohmann, J.R. and Smith, R.L. (1994) Equipment replacement under technological change, Naval Research Logistics, 41, 117–128. Bellman, R.E. (1955) Equipment replacement policy. Journal of the Society for the Industrial Applications of Mathematics 3, 133–136. Booth, P. and King, P. (1998) The relationship between finance and actuarial science. In Hand, D.J., Jacka, S.D. (Eds), Statistics in Finance, Arnold, London, pp.7–40. Bowe, M. and Lee, D.L. (2004), Project evaluation in the presence of multiple embedded real options: evidence from the Taiwan High-Speed Rail Project, Journal of Asian Economics 15, 71–98.

318

P. Scarf and J. Hartman

Brint, A.T., Hodgkins, W.R., Rigler, D.M and Smith, S.A. (1998) Evaluating strategies for reliable distribution. IEEE Comput.Applns. in Power 11, 43–47. Chand, S., McClurg, T. and J. Ward (2000) A model for parallel machine replacement with capacity expansion. European Journal of Operational Research, 121. 519–531. Christer, A.H. (1984) Operational research applied to industrial maintenance and replacement. In Eglese, R.W. and Rand, G.K. (Eds) Developments in Operational Research (pp.31–58). Pergamon Press, Oxford. Christer, A.H. (1988) Determining economic replacement ages of equipment incorporating technological developments. In Rand, G.K. (Eds) Operational Research ’87 (pp.343– 354). Elsevier, Amsterdam. Christer, A.H. and Scarf, P.A. (1994) A robust replacement model with applications to medical equipment. J.Opl.Res.Soc. 45:261–275. Derman, C. (1963) Inspection-maintenance-replacement schedules under markovian deterioration. In Mathematical Optimization Techniques, University of California Press, Berkely, CA, pp. 201–210. Dixit, A.K. and Pindyck R.S. (1994) Investment Under Uncertainty Princeton University Press, New Jersey. Eilon, S., King, J.R. and Hutchinson, D.E. (1966). A study in equipment replacement. Opl.Res.Quart. 17:59–71. Elton, D.J. and Gruber, M.J. (1976) On the optimality of an equal life policy for equipment subject to technological change. Opl.Res.Quart. 22:93–99. Goldstein, M. and O’Hagan, A. (1996) Bayes linear sufficiency and systems of expert posterior assessments. Journal of the Royal Statistical Society Series B 58, 301–316. Hartman, J.C. (1999) A General Procedure for Incorporating Asset Utilization Decisions into Replacement Analysis. Eng. Econ., 44(3):217–238. Hartman, J.C. (2001) An Economic Replacement Model with Probabilistic Asset Utilization. IIE Transactions, 33, 717–729. Hartman, J.C. (2004) Multiple asset replacement analysis under variable utilization and stochastic demand. European Journal of Operational Research 59, 145–165. Hartman, J.C. and J. Ban (2002) The series-parallel replacement problem. Robotics and Computer Integrated Manufacturing, 18, 215–221. Hartman, J.C. and R.V. Hartman (2001) After-Tax Replacement Analysis. The Engineering Economist, 46, 181–204. Hartman, J.C. and Murphy, A. (2006) Finite Horizon Equipment Replacement Analysis. IIE Transactions 38, 409–419. Hartman, J.C. and Rogers, J.L. (2006) Dynamic Programming Approaches for Equipment Replacement Problems with Continuous and Discontinuous Technological Change. IMA Journal of Management Mathematics, 17, 143–158. Hopp, W.J. and Nair, S.K. (1991) Timing replacement decisions under discontinuous technological change. Naval Research Logistics 38, 203–220. Hopp, W.J. and Nair, S.K. (1994) Markovian deterioration and technological change. IIE Transactions, 26, 74–82. Jones, P.C., Zydiak, J.L. and Hopp, W.J. (1991) Parallel machine replacement. Naval Research Logistics, 38, 351–365. Karabakal, N., Lohmann, J.R. and Bean, J.C. (1994) Parallel replacement under capital rationing constraints. Management Science 40, 305–319. Kobbacy, K. and Nicol, D. (1994) Sensitivity of rent replacement models. Int.J.Prod.Econ. 36, 267–279. Markovitz, H.M. (1952) Portfolio selection. Journal of Finance 7, 77–91. Northcott, D. (1985) Capital Investment Decision Making. Dryden Press, London. Oakford, R.V., Lohmann, J.R. and Salazar, A. (1984) A dynamic replacement economy decision model. IIE Transactions, 16, 65–72.

Replacement of Capital Equipment

319

O’Hagan, A. (1994) Robust modelling for asset management. Journal of Statistical Planning and Inference 40, 245–259. Regnier, E., Sharp, G., and Tovey, C. (2004) Replacement under ongoing technological progress. IIE Transactions, 36, 497–508. Russell, J.C. (1982) Vehicle replacemeny: a case study in adapting a standard approach for a large organisation. Journal of the Operational Research Society 33, 899–911. Santhanam, R. and Kyparisis, G.J. (1996) A decision model for interdependent information system project selection. European Journal of Operational Research 89, 380–399. Scarf, P.A. (1994) Optimal buying, running and selling policy for the private motorist: an application of capital replacement modelling, IMA Bulletin 30, 181–186. Scarf, P.A. and Christer, A.H. (1997) Applications of capital replacement models with finite planning horizions, International Journal of Technology Management 13, 25–36. Scarf, P.A. and Hashem, M. (1997) On the application of an economic life model with a fixed planning horizon, International Transactions in Operations Research 4, 139–150. Scarf, P.A. and Hashem, H. (2003) Characterization of optimal policies for capital replacement models. IMA Journal of Management Mathematics 13, 261–271. Scarf, P.A. and Martin, H. (2001) A framework for maintenance and replacement of a network structured system. Int. J. Prod.Econ. 69, 287–296. Scarf, P.A., Dwight, R., McCusker, A. and Chan, A. (2006) Asset replacement for an urban railway using a modified two-cycle replacement model. Journal of the Operational Research Society (doi: 10.1057/palgrave.jors.2602288). Suzukia, Y. and Pautschb, G.R. (2005) A vehicle replacement policy for motor carriers in an unsteady economy. Transportation Research Part A: Policy and Practice 39, 463–480. Wagner, H.M. (1975) Principles of Operations Research. Prentice-Hall Inc., Englewood Cliffs, NJ. Wang, H. (2002) A survey of maintenance policies of deteriorating systems. European Journal of Operational Research 139, 469–489.

13 Maintenance and Production: A Review of Planning Models Gabriella Budai, Rommert Dekker and Robin P. Nicolai

13.1 Introduction Maintenance is the set of activities carried out to keep a system into a condition where it can perform its function. Quite often these systems are production systems where the outputs are products and/or services. Some maintenance can be done during production and some can be done during regular production stops in evenings, weekends and on holidays. However, in many cases production units need to be shut down for maintenance. This may lead to tension between the production and maintenance department of a company. On one hand the production department needs maintenance for the long-term well-being of its equipment, on the other hand it leads to shutting down the operations and loss of production. It will be clear that both can benefit from decision support based on mathematical models. In this chapter we give an overview of mathematical models that consider the relation between maintenance and production. The relation exists in several ways. First of all, when planning maintenance one needs to take production into account. Second, maintenance can also be seen as a production process which needs to be planned and finally one can develop integrated models for maintenance and production. Apart from giving a general overview of models we will also discuss some sectors in which the interactions between maintenance and production have been studied. Many review articles have been written on maintenance, e.g. Cho and Parlar (1991), but to our knowledge only one on the combination between maintenance and production, Ben-Daya and Rahim (2001). This review differs from that in several aspects. First of all, we also consider models which take production restrictions into account, rather than integrated models. Second we discuss some specific sectors. Finally, we discuss the more recent articles since that review. Maintenance is related to production in several ways. First of all, maintenance is intended to allow production, yet to execute maintenance production often has to be stopped. This negative effect has therefore to be considered in maintenance plan-

322

G. Budai, R. Dekker and R. Nicolai

ning and optimization. It comes specifically forward in the costing of downtime and in opportunity maintenance. All articles taking the effect of production on maintenance explicitly into account fall into this category. Second, maintenance can also be seen as a production process which needs to be planned. Planning in this respect implies determining appropriate levels of capacity (e.g. manpower) concerning the demand. Third, we are concerned with production planning in which one needs to take maintenance jobs into account. The point is that the maintenance jobs take production capacity away and hence they need to be planned together with production. Maintenance has to be done either because of a failure or because the quality of the produced items is not high enough. In this third category we also consider the integrated planning of production and maintenance. The relation between maintenance and production is also determined by the business sector. We consider the following sectors: railways, road, airlines and electrical power system maintenance. The outline of the rest of this chapter is now as follows. In Section 13.2 we present an overview of the main elements of maintenance planning as these are essential to understand the rest of this chapter. Following our classification scheme, in Section 13.3 we review articles in which maintenance is modelled explicitly and where the needs of production are taken into account. Since these needs differ between business sectors, we discuss in Section 13.4 the relation between production and maintenance for some specific business sectors. In Section 13.5 we consider the second category in our classification scheme: maintenance as a production process which needs to be planned. In Section 13.6 we are concerned with production planning in which one needs to take maintenance jobs into account (integrated production and maintenance planning). Trends and open research areas are discussed in Section 13.7 and, finally, conclusions are drawn in Section 13.8.

13.2 Maintenance Planning and Optimization: An Overview In maintenance several important decisions have to be made. We distinguish between (i) the long term strategic and maintenance concept, (ii) medium term planning, (iii) short term scheduling and finally (iv) control and performance indicators. Major strategic decisions concerning maintenance are made in the design process of systems. What type of maintenance is appropriate and when should it be done? This is laid down in the so-called maintenance concept. Many optimization models address this problem and the relation with production is implicit in some of them. Another important strategic problem is the organization of the maintenance department. Is maintenance done by production personnel (in the way total productive maintenance prescribes) or is there specific maintenance personnel? Second, questions such as “Where is it located?”, “Are specific types of work outsourced?”, etc. should be answered. Although they are important topics, they are more the concern of industrial organization than the topic of mathematical models. Further important strategic issues concern how a system can be maintained, whether specific expertise or equipment are needed, whether one can easily reach

Maintenance and Production: A Review

323

the subsystems, what information is available and what elements can be easily replaced. These are typical maintainability aspects, but they have little to do with production. In the tactical phase, usually between a month and year, one plans for the major maintenance/upgrade of major units and this has to be done in cooperation with the production department. Accordingly, specific decision support is needed in this respect. Another tactical problem concerns the capacity of the maintenance crew. Is there enough manpower to carry out the preventive maintenance program? These questions can be addressed by use of models as will be indicated later on. In the short term scheduling phase one determines the moment and order of execution, given an amount of outstanding corrective or preventive work. This is typically the domain of work scheduling where extensive model-based support can be given. We will next consider another important aspect in maintenance, which is the type of maintenance. A typical distinction is made between corrective and preventive maintenance work. The first is carried out after a failure, which is defined as the event by which a system stops functioning in a prescribed way. Preventive work however, is carried out to prevent failures. Although this distinction is often made, we like to remark that the difference is not that clear as it may seem. This is due to the definition of failure. An item may be in a bad state, while still functioning and one may or may not consider this as a failure. Anyhow, an important distinction between the two is that corrective maintenance is usually not plannable, but preventive maintenance typically is. The execution of maintenance can also be triggered by condition measurements and then we speak of condition-based maintenance. This has often been advocated as more effective and efficient than time-based preventive maintenance. Yet it is very hard to predict failures well in advance, and hence condition-based maintenance is often unplannable. Instead of time based maintenance one can also base the preventive maintenance on utilisation (run hours, mileage) as being more appropriate indicators of wear out. Finally one may also have inspections which can be done by sight or instruments and often do not affect operation. They do not improve the state of a system however, but only the information about it. This can be important in case machines start producing items of a bad quality. There are inspection-quality problems where inspection optimization is connected to quality control. Another distinction is about the amount of work. Often there are small works, grouped into maintenance packages. They may start with inspection, cleaning and next some improvement actions like lubricating and or replacing some parts. These are typically part of the preventive maintenance program attached to a system and have to be done on a repetitive basis (monthly, quarterly, yearly or two-yearly). Next, one has replacements of parts or subsystems and overhauls or refurbishments where a substantial system is improved. The latter are planned well in advance and carried out as projects with individual (or separate) budgets. A traditional optimization problem has been the choice and trade-off between preventive and corrective maintenance. The typical motivation is that preventive maintenance is cheaper than corrective. Maintenance costs are usually due to manhours, materials and indirect costs. The difference between corrective and preven-

324

G. Budai, R. Dekker and R. Nicolai

tive maintenance costs is especially in the latter category. They represent loss of production and environmental damage or safety consequences. Costing these consequences can be a difficult problem and is tackled in Section 13.3.1. It will also be clear that preventive maintenance should be done when production is least effected. This can be done using opportunities, which has given rise to a specific class of models dealt in a separate section (Section 13.3.2).

13.3 When to do Maintenance in Relation to Production In this section we discuss articles in which maintenance (planning or scheduling) is modelled explicitly and the needs of production are taken into account. The latter, however, is not usually modelled as such, but it is taken into account in the form of constraints or requirements. Alternatively the effect of maintenance on varying production scenarios may be considered. Following this reasoning we arrive at three streams of research. A first stream assesses the costs of downtime, which are important in the planning of maintenance. The second stream deals with studies where one tries to schedule maintenance work at those moments that units are not needed for production (opportunities) and in the last stream articles are considered which schedule maintenance in line with production. Each stream is dealt in a separate section. 13.3.1 Costing of Downtime Assessing the costs of downtime is an important step in the determination of costs of preventive and corrective maintenance. Although exact values are not necessary as most optimization results show, it is important to assess these values with a reasonable accuracy. It is easier to determine downtime costs in case of preventive maintenance than in case of corrective maintenance as failures may have many unforeseen consequences. Yet even in case of preventive maintenance the assessment can be difficult, e.g. in case of highway shutdowns or railway stoppage. Another problem to be tackled is the system-unit relation. A system can be a complex configuration of different units, which may imply that downtime of one unit does not necessarily halt the full system. Accordingly, an assessment of the consequences of unit downtime on system performance has to be made. This is especially a problem in case of k-out-of-n systems or even in more general configurations. Several articles deal with this issue. Some give an overall model, others describe a detailed case. Geraerds (1985) gives an outline of a general structuring to determine downtime costs. In Dekker and Van Rijn (1996) a downtime model is described for k-out-of-n systems used on the oil production platforms. Edwards et al. (2002) give a detailed model for the costs of equipment downtime in open-pit mining. They use regression models based on historical data. Knights et al. (2005) present a model to assist maintenance managers in evaluating the economic benefits of maintenance improvement projects.

Maintenance and Production: A Review

325

13.3.2 Opportunity Maintenance Opportunity maintenance is the maintenance that is carried out at an opportune moment, i.e. moments at which the units to be maintained are less needed for their function than normally. We speak of opportunities if these events occur occasionally and if they are difficult to predict in advance. There can be several reasons for a maintenance opportunity: •

•

Failure and hence repairs of other units/components. The failure of one component is often an opportunity to preventively maintain other components. Especially if the failure causes the breakdown of the production system it is favourable to perform preventive maintenance on other components. After all, little or no production is lost above that resulting from the original failure. An example is given in Van der Duyn Schouten et al. (1998) who consider the replacement of traffic lights at an intersection. Other interruptions of production. Production processes are not only interrupted by failures or repairs. Several outside events may create an opportunity as well. This can be market interruptions, or other work for which production needs to be stopped (e.g. replacing catalysts etc.) and this is an opportunity to combine preventive maintenance.

According to the foregoing discussion there are two approaches to opportunities. The first models a whole multi-component system in which upon a failure preventive maintenance can be carried out on other components as well. In the latter stream the opportunities are modelled as an outside event at which one may do maintenance. In the simplest form one considers one component, with maintenance which may be done at opportunities, or also with a forced shutdown. Bäckert and Rippin (1985) consider the first type of opportunistic maintenance for plants subject to breakdowns. In this article three methods are proposed to solve the problem. In the first two cases the problem is formulated as a stochastic decision tree and solved using a modified branch and bound procedure. In the third case the problem is formulated as a Markov decision process. The planning period is discretised, resulting in a finite state space to which a dynamic programming procedure can be applied. In Wijnmalen and Hontelez (1997) a multi-component system is considered where failures of one component may create an opportunity, but the opportunity process is approximated by an independent process with the same mean rate. In this way they circumvent the problem of dimensionality which appears in the study of Bäckert and Rippin (1985). There are several articles considering the other stream. Tan and Kramer (1997) propose a general framework for preventive maintenance optimization in chemical process operations. The authors combine Monte Carlo simulation with a genetic algorithm. Opportunities are the failure of other components. In Dekker and Dijkstra (1992) and Dekker and Smeitink (1991) it is assumed that the opportunity-generating process is completely independent of the failure process and is modelled as a renewal process. Dekker and Smeitink (1994) consider multi-component maintenance at opportunities of restricted duration and determine priorities of what preventive maintenance to do at an opportunity.

326

G. Budai, R. Dekker and R. Nicolai

In Dekker and Van Rijn (1996) a decision-support system (PROMPT) for opportunity-based preventive maintenance is discussed. PROMPT was developed to take care of the random occurrence of opportunities of restricted duration. Here, opportunities are not only failures of other components, but also preventive maintenance on (essential) components. Many of the techniques developed in the articles of Dekker and Smeitink (1991), Dekker and Dijkstra (1992) and Dekker and Smeitink (1994) are implemented in the decision-support system. In PROMPT preventive maintenance is split up into packages. For each package an optimum policy is determined, which indicates when it should be carried out at an opportunity. From the separate policies a priority measure is determined with which maintenance package should be executed at a given opportunity. In Dekker et al. (1998b) the maintenance of light-standards is studied. A lightstandard consists of n independent and identical lamps screwed on a lamp assembly. To guarantee a minimum luminance, the lamps are replaced if the number of failed lamps reaches a prespecified number m. In order to replace the lamps the assembly has to be lowered. As a consequence, each failure is an opportunity to combine corrective and preventive maintenance. Several opportunistic age-based variants of the m-failure group replacement policy (in its original form only corrective maintenance is grouped) are considered. Simulation optimization is used to determine the optimal opportunistic age threshold. Dagpunar (1996) introduces a maintenance model where replacement of a component within a system is possible when some other part of the system fails, at a cost of c2. The opportunity process is Poisson. A component is replaced at an opportunity if its age exceeds a specified control limit t. Upon failure a component is replaced at cost c4 if its age exceeds a specified control limit x, otherwise it is minimally repaired at cost c1. In case of a minimal repair the age and failure rate of the component after the repair is as it was immediately before failure. There is also a possibility of a preventive or “interrupt” replacement at cost c3 if the component is still functioning at a specified age T. A procedure to optimise the control limits t and T is given in Dekker and Plasmeijer (2001). 13.3.3 Maintenance Scheduling in Line with Production Here we consider models where the effect of production on maintenance is explicitly taken into account. These models only address maintenance decisions, but they do not give advice on how to plan production. The models developed in the articles in this category show that a good maintenance plan, one that is integrated with the production plan, can result in considerable cost savings. This integration with production is crucial because production and maintenance have a direct relation. Any breakdown in machine operation results in disruption of production and leads to additional costs due to downtime, loss of production, decrease in productivity and quality, and inefficient use of personnel, equipment and facilities. Below we review articles following this stream of research in chronological order. Dedopoulos and Shah (1995) consider the problem of determining the optimal preventive maintenance policy parameters for individual items of equipment in multipurpose plants. In order to formulate maintenance policies, the benefits of

Maintenance and Production: A Review

327

maintenance, in the form of reduced failure rates, must be weighed against the costs. The approach in this study first attempts to estimate the effect of the failure rate of a piece of equipment on the overall performance/profitability of the plant. An integrated production and maintenance planning problem is also solved to determine the effects of PM on production. Finally, the results of these two procedures are then utilized in a final optimization problem that uses the relationship between profitability and failure rate as well as the costs of different maintenance policies to select the appropriate maintenance policy. Vatn et al. (1996) present an approach for identifying the optimal maintenance schedule for the components of a production system. Safety, health and environment objectives, maintenance costs and costs of lost production are all taken into consideration, and maintenance is thus optimized with respect to multiple objectives. The approach is flexible as it can be carried out at various levels of detail, e.g. adapted to available resources and to the management’s willingness to give detailed priorities with respect to objectives on safety vs. production loss. Frost and Dechter (1998) define the scheduling of preventive maintenance of power generating units within a power plant as constraint satisfaction problems. The general purpose of determining a maintenance schedule is to determine the duration and sequence of outages of power generating units over a given time period, while minimizing operating and maintenance costs over the planning period. Vaurio (1999) develops unavailability and cost rate functions for components whose failures can occur randomly. Failures can only be detected through periodic testing or inspections. If a failure occurs between consecutive inspections, the unit remains failed until the next inspection. Components are renewed by preventive maintenance periodically, or by repair or replacement after a failure, whichever occurs first (age-replacement). The model takes into account finite repair and maintenance durations as well as costs due to testing, repair, maintenance and lost production or accidents. For normally operating units the time-related penalty is loss of production. For standby safety equipment it is the expected cost of an accident that can happen when the component is down due to a dormant failure, repair or maintenance. The objective is to minimize the total cost rate with respect to the inspection and the replacement interval. General conditions and techniques are developed for solving optimal test and maintenance intervals, with and without constraints on the production loss or accident rate. Insights are gained into how the optimal intervals depend on various cost parameters and reliability characteristics. Van Dijkhuizen (2000) studies the problem of clustering preventive maintenance jobs in a multiple set-up multi-component production system. This article has been reviewed in Chapter 11, which gives an overview of multi-component maintenance models. Cassady et al. (2001) introduce the concept of selective maintenance. Often production systems are required to perform a sequence of operations with finite breaks between each operation. The authors establish a mathematical programming framework for assisting decision-makers in determining the optimal subset of maintenance activities to perform prior to beginning the next operation. This decision making process is referred to as selective maintenance. The article of Haghani and Shafahi (2002) deals with the problem of scheduling bus maintenance activities. A mathematical programming approach to the problem

328

G. Budai, R. Dekker and R. Nicolai

is proposed. This approach takes as input a given daily operating schedule for all buses assigned to a depot along with available maintenance resources. Then a daily inspection and maintenance schedule is designed for the buses that require inspection so as to minimize the interruptions in the daily bus-operating schedule, and maximize the reliability of the system and efficiently utilize the maintenance facilities. Charles et al. (2003) examine the interaction effects of maintenance policies on batch plant scheduling in a semiconductor wafer fabrication facility. The purpose of the work is the improvement of the quality of maintenance department activities by the implementation of optimized preventive maintenance (PM) strategies and comes within the scope of total productivity maintenance (TPM) strategy. The production of semiconductor devices is carried out in a wafer lab. In this production environment equipment breakdown or procedure drifting usually induces unscheduled production interruptions. Cheung et al. (2004) consider a plant with several units of different types. There are several shutdown periods for maintenance. The problem is to allocate units to these periods in such a way that production is least effected. Maintenance is not modelled in detail, but incorporated through frequency or period restrictions.

13.4 Specific Business Sectors The purpose here is to illustrate the interdependence between maintenance and production for some specific sectors in more detail. Moreover, it shows what ideas were employed in which sector and the difference between them. Although many sectors could be distinguished we take those where maintenance plays an important role. Not surprisingly, these are all capital intensive sectors with high maintenance expenditure and we discuss railway, road, airline and electric power system maintenance. 13.4.1 Railway Maintenance Since rail is an important transportation mode, proper maintenance of the existing lines, repairs and replacements carried out in time are all important to ensure efficient operation. Moreover, since some failures might have a strong impact on the safety of the passengers, it is important to prevent these failures by carrying out in time, and according to some predefined schedules, preventive maintenance works. The preventive maintenance works are the small routine works and/or projects. The routine (spot) maintenance activities, that consist of inspections and small repairs (see Esveld 2001), do not take much time to be performed and are done regularly, with frequencies varying between monthly and once a year. The projects include renewal works and they are carried out once or twice every few years. In the literature there are a couple of articles that provide useful methods for finding optimal track possession intervals for carrying out preventive maintenance works, i.e. time periods when a track is required for maintenance, therefore it will be blocked for the operation. In production planning terms track possession means

Maintenance and Production: A Review

329

downtime required for maintenance. The main question is when to carry out maintenance such that the inconvenience for the train operators, the disruption to and from the scheduled trains, the infrastructure possession time for maintenance are minimized and the maintenance cost is the lowest possible. For a more detailed overview of techniques used in planning railway infrastructure maintenance we refer to Dekker and Budai (2002) and Improverail (2002). In some articles (see, e.g. Higgins 1998, Cheung et al. 1999 and Budai et al. 2006) the track possession is modelled in between operations. This can be done for occasionally used tracks, which is the case in Australia and some European countries. If tracks are used frequently, one has to perform maintenance during nights, when the train traffic is almost absent or during weekends (with possible interruption of the train services), when there are less disturbances for the passengers. In the first case one can either make a cyclic static schedule, which is done by Den Hertog et al. (2005) and Van Zante-de Fokkert (2001) for the Dutch situation, or a dynamic schedule with a rolling horizon, which is done in Cheung et al. (1999). The latter schedule has to be made regularly. Some other articles deal with grouping railway maintenance activities to reduce costs, downtime and inconvenience for the travellers and operators. Here we mention the study of Budai et al. (2006) in which the preventive maintenance scheduling problem is introduced. This problem arises in other public/private sectors as well, since preventive maintenance of other technical systems (machine, road, airplanes, etc.) also contains small routine works and large projects. 13.4.2 Road Maintenance Road maintenance has many common characteristics with railway maintenance. Failures are often indirect, in the sense that norms are surpassed, but there may not be any consequences. The production function is indirect, but that does not mean that it is not felt by many. Governments may define a cost penalty due to one hour waiting per vehicle because of congestion caused by road maintenance. Similar to railway maintenance one sees that work is shifted to nights or a lot of work is combined into a large project on which the public is informed long before it is started. The night work causes high logistics costs for maintenance, yet it is useful for small repairs or patches. Other similarities with railroads are the large number of identical parts (a road is typically split up in lanes of 100 meters about which information is stored). Vans with complex road analysing equipment are used to assess the road quality. For railways special trains with complex measuring equipment are used. Videos are used in both cases. Next, both roads and rails have multiple failure modes. Furthermore, the assets to be maintained are spread out geographically, which result in high logistics costs for maintenance. This is also true for airline and truck maintenance. Both road and rail need much maintenance and as a result large budgets need to be allocated for both. Although several articles have been written on road maintenance, few take the production or user consequences into account. We would like to mention Dekker et al. (1998a) who compare two concepts to do road maintenance – one with small projects carried out during nights and the other where large road segments (some

330

G. Budai, R. Dekker and R. Nicolai

4 km) are overhauled in one stretch. In the latter case the traffic is diverted to other lanes or the side of the road. It is shown that the latter is both advantageous for the traffic as well as cheaper, provided the volume of traffic on the road is not too high. Another interesting contribution is from Rose and Bennett (1992) who provide a model to locate and decide on the size (or capacity) of road maintenance depots, for corrective maintenance. 13.4.3 Airline Maintenance Maintenance costs are a substantial factor of an airline’s costs. Estimates are that 20% of the cost is due to maintenance. Maintenance is crucial because of safety reasons and because of high downtime costs. Apart from a crash, the worst event for an airline is an aircraft on ground (AOG) because of failures. Accordingly a lot of technology has been developed to facilitate maintenance. We like to mention inflight diagnosis, such that quick actions can be taken on ground and a very high level of modularity, such that failed components can easily be replaced. Yet in an aircraft there is still a high level of time-based preventive maintenance rather than condition-based maintenance. A plane has to undergo several checks, ranging from an A check taking about an hour after each flight, to a monthly B check, a yearly C check and a five-yearly D check, where it is completely overhauled and which can take a month. The presence of the monthly check implies that planes cannot always fly the same route, but need to be rotated on a regular basis. It also implies that airlines need multiple units of a type in order to provide a consistent service. Several studies have addressed the issue of fleet allocation and maintenance scheduling. In the fleet allocation one decides which planes fly which route and at which time. One would preferably make an allocation which remains fixed for a whole year, but due to the regular maintenance checks this is not possible. Gopalan and Talluri (1998) give an overview of mathematical models on this problem. Moudani and Mora-Camino (2000) present a method to do both flight assignment and maintenance scheduling of planes. It uses dynamic programming and heuristics. A case of a charter airline is considered. Sriram and Haghani (2003) also consider the same problem. They solve it in two phases. Finally, Feo and Bard (1989) consider the problem of maintenance base planning in relation to an airlines fleet rotation, while Cohn and Barnhart (2003) consider the relation between crew scheduling and key maintenance routing decisions. In another line of research, Dijkstra et al. (1994) develop a model to assess maintenance manpower scheduling and requirements in order to perform inspection checks (A type) between flight turnarounds. It appears that their workload is quite peaked because of many flights arriving more or less at the same time (socalled banks) in order to allow fast passenger transfers. The same problem is also tackled by Yan et al. (2004). The articles in this line of research consider in effect the production planning of maintenance, a topic also addressed in Section 13.5. As the last article in this category we would like to mention Cobb (1995) who presents a simulation model to evaluate current maintenance system performance or the positive effect of ad hoc operating decisions on maintenance turn times (i.e. the time maintenance takes to carry out a check or to do a repair).

Maintenance and Production: A Review

331

13.4.4 Electric Power System Maintenance Kralj and Petrovic (1988) have presented an overview article on optimal maintenance of thermal generating units in power systems. They primarily focused on articles published in IEEE Transactions on Power Apparatus and Systems. Here we will briefly discuss the typical problems of the maintenance of power systems and review two articles dealing with these problems. First of all, note that maintenance of power systems is costly, because it is impossible to store generated electrical energy. Moreover, the continuity of supply is very important for its customers. A second problem of scheduling the maintenance of power systems is that joint maintenance of units is often impossible or very expensive, since that would too much effect production. Frost and Dechter (1998) consider the problem of scheduling preventive maintenance of power generating units within a power plant. The purpose of the maintenance scheduling is to determine the duration and sequence of outages of power generating units over a given time period, while minimizing operating and maintenance costs over the planning period, subject to various constraints. A subset of the constraints contains the pairs of components that cannot be maintained simultaneously. In this article the maintenance problem are cast as constraint satisfaction problems (CSP). The optimal solution is found by solving a series of CSPs with successively tighter cost-bound constraints. Langdon and Treleaven (1997) study the problem of scheduling maintenance for electrical power transmission networks. Grouping maintenance in the network may prevent the use of a cheap electricity generator, so requiring a more expensive generator to be run in its place. That is, some parts of the network should not be maintained simultaneously. These exclusions are modelled by adding restrictions to the MIP formulation of the problem.

13.5 Production Planning of Maintenance Maintenance can also be regarded as a production process which needs to be planned. Planning in this respect implies determining appropriate levels of capacity concerning the demand. It will be clear that this activity can only be carried out for plannable maintenance, e.g. overhauls or refurbishment and that it is only needed when there are capacity restrictions, e.g. in a shipyard. The specific aspect of maintenance production planning with standard production planning is that there tend to be more unforeseen events and intervening corrective maintenance work than in regular production planning. Articles in this category are Dijkstra et al. (1994) and Yan et al. (2004), who both consider manpower determination and allocation problems in case of a fluctuating workload for aircraft maintenance. Shenoy and Bhadury (1993) use the MRP approach to develop a maintenance-manpower plan. Bengü (1994) discusses the organization of maintenance centres that are specialized to carry out particular types of maintenance jobs in the telecommunication sector. Al-Zubaidi and

332

G. Budai, R. Dekker and R. Nicolai

Christer (1997) consider the problem of manpower planning for hospital building maintenance. Another typical production planning problem is with respect to layout planning. A case study for a maintenance tool room is described in Rosa and Feiring (1995). The study by Rose and Bennett (1992), which was discussed in Section 13.4, also falls into this category.

13.6 Integrated Production and Maintenance Planning In recent years there has been considerable interest in models attempting to integrate production, quality and maintenance (Ben-Daya 2001). Whereas in the past these aspects have been treated as separate problems, nowadays models take into account the mutual interdependencies. Production planning typically concerns determining lot sizes and evaluating capacity needs, in case of fluctuating demand. Both the optimal lot size and the capacity needs are influenced by failures. On the other hand, maintenance prevents breakdowns and improves quality. Accordingly, they should be planned in an integrated way (see, e.g. Nahmias 2005). We subdivide the class of integrated production and maintenance planning models into four categories: high-level models considering conceptual and process design problems (Section 13.6.1); the economic manufacturing quantity model, which was originally posed as a simple inventory problem, but has been (successfully) extended to deal with quality and failure aspects (Section 13.6.2); models of production systems with buffer capacities, which by definition are suitable to deal with breakdowns (Section 13.6.3); finally, production and maintenance rate optimization models, which aim to find the production and preventive/corrective maintenance rates of machines so as to minimize the total cost of inventory, production and maintenance (Section 13.6.4). In Section 13.6.5 we discuss articles which do not fit in any of the above categories. 13.6.1 Conceptual and Design Models In a number of articles conceptual models are developed that integrate the preventive and corrective aspects of the maintenance planning, with aspects of the production system such as quality, service level and priority and capacity activities. For instance, Finch and Gilbert (1986) present an integrated conceptual framework for maintenance and production in which they focus especially on manpower issues in corrective and preventive work. Weinstein and Chung (1999) test the hypothesis that integrating the maintenance policy with the aggregate production planning will significantly influence total cost reduction. It appears that this is the case in the experimental setting investigated in this study. Lee (2005) considers production inventory planning, where high level decisions on maintenance (viz. their effects) are made. Another group of articles deal with integrating process design, production and maintenance planning. Already at the design stage decisions on the process system and initial reliabilities of the equipments are made. Pistikopoulos et al. (2000) describe an optimization framework for general multipurpose process models,

Maintenance and Production: A Review

333

which determine both the optimal design as well as the production and maintenance plans simultaneously. In this framework, the basic process and system reliability-maintainability characteristics are determined in the design phase with the selection of system structure, components, etc. The remaining characteristics are determined in the operation phase with the selection of appropriate operating and maintenance policies. Therefore, the optimization of process system effectiveness depends on the simultaneous identification of optimal design, operation and maintenance policies having properly accounted for their interactions. In Goel et al. (2003) a reliability allocation model is coupled with the existing design, production, and maintenance optimization framework. The aim is to identify the optimal size and initial reliability for each unit of equipment at the design stage. They balance the additional design and maintenance costs with the benefits obtained due to increased process availability. 13.6.2 EMQ Problems In the classical economic manufacturing quantity (EMQ) model items are produced at a constant rate p and the demand rate for the items is equal to d < p. The aim of the model is to find the production uptime that minimizes the sum of the inventory holding cost and the average, fixed, ordering cost. This model is an extension of the well known economic order quantity (EOQ) model, the difference being that in the EOQ model orders are placed when there is no inventory. Note that the EMQ model is also referred to as economic production quantity (EPQ) model. In the extensive literature on production and inventory problems, it is often assumed that the production process does not fail, that it is not interrupted and that it only produces items of acceptable quality. Unfortunately, in practice this is not always the case. A production process can be interrupted due to a machine breakdown or because the quality of the produced items is not acceptable anymore. The EMQ model has been extended to deal with these aspects and we thus divide the literature on EMQ models into two categories. First, we consider EMQ problems that take into account the quality aspects of the items produced. The second category of EMQ models analyzes the effects of (stochastic machine) breakdowns on the lot sizing decision. 13.6.2.1 EMQ Problems with Quality Aspects One of the reasons why a production process is interrupted is the (lack of) quality of the items produced. Obviously, items of inferior quality can only be sold at a lower revenue or cannot be sold at all. Thus, the production of these items results in a loss (or a lower profit) for the firm. This type of interruption is usually modelled as follows. It is assumed that at the start of the production cycle the production is in an “in-control” state, producing items of acceptable quality. After some time the production process may then shift to an “out-of-control” state. In this state a certain percentage of the items produced are defective or of substandard quality. The elapsed time for the process to be in the in-control state, before the shift occurs, is a random variable. Once a shift to the out-of-control state has occurred, it is assumed that the production process stays in that state unless it is

334

G. Budai, R. Dekker and R. Nicolai

discovered by (a periodic) inspection of the process, followed by corrective maintenance. One of the earliest works that consider the problem of finding the optimal lot size and optimal inspection schedule is the article of Lee and Rosenblatt (1987). They show that the derived optimal lot size is smaller than the classical EMQ if the time for the process to be in the in-control state follows an exponential distribution. Lee and Rosenblatt (1989) have extended this work by assuming that the cost of restoration is a function of the elapsed time since a shift from an in-control to an out-of-control state of the production process has occurred. In addition, the possibility of incurring shortages in the model is allowed. Many attempts have been made to extend these two models. For instance, Tseng (1996) assumes that the process lifetime is arbitrarily distributed with an increasing failure rate. Furthermore, two maintenance actions are considered. The first is a perfect maintenance action, which restores the system to an as-good-as new condition if the process is in the in-control state. If however, the production process is in out-of-control state, it is restored to the in-control state at a given restoration cost. Secondly, maintenance is always done at the end of a production cycle to ensure that the process is perfect at the beginning of each production cycle. Wang and Sheu (2003) assume that the periodic inspections are imperfect. Two types of inspection errors are considered, namely (I) the process is declared out-ofcontrol when it is in-control and (II) the process is declared in-control when it is out-of-control. They use a Markov chain to jointly determine the production cycle, process inspection intervals, and maintenance level. Wang (2006) derives some structural properties for the optimal production/preventive maintenance policy, under the assumption that the (sufficient) conditions for the optimality of the equalinterval PM schedule hold. This increases the efficiency of the solution procedure. The quality characteristics of the product in a production process can be monitored by x -control chart. The economic design of the x -control chart determines the sample size n, sampling interval h, and the control limit coefficient k such that the total cost is minimized. Rahim (1994) develops an economic model for joint determination of production quantity, inspection schedule and control chart design for a production process which is subject to a non-Markovian random shock. In their model it is assumed that the in-control period follows a general probability distribution with an increasing failure rate and that production ceases only if the process is found to be out of control during inspection. However, if the alarm turns out to be false the time for searching an assignable cause is assumed to be zero. Rahim and Ben-Daya (1998) generalize the model of Rahim (1994) by assuming that the production stops for a fixed amount of time not only for a true alarm, but also whenever there is a false alarm during the in-control state. Rahim and Ben-Daya (2001) further extend the model of Rahim (1994) by looking at the effect of deteriorating products and a deteriorating production process on the optimal production quantity, inspection schedule and control chart design parameters. The deterioration times for both product and process are assumed to follow Weibull distributions. It is assumed that the process is stopped either at failure or at the m-th inspection

Maintenance and Production: A Review

335

interval, whichever occurs first. Furthermore, the inventory is depleted to zero before a new cycle starts. Tagaras (1988) develops an economic model that incorporates both process control and maintenance policies, and simultaneously optimizes their design parameters. Lam and Rahim (2002) present an integrated model for joint determination of economic design of x -control charts, economic production quantity, production run length and maintenance schedules for a deteriorating production system. In the model of Ben-Daya and Makhdoum (1998) PM activities are also coordinated with quality control inspections, but they are carried out only when a preset threshold of the shift rate of the production process is reached. 13.6.2.2 EMQ Problems with Failure Aspects A couple of articles study the EMQ model in the presence of random machine breakdowns or random failures of a bottleneck component For instance, Groenevelt et al. (1992a) consider the effects of stochastic machine breakdowns and corrective maintenance on economic lot sizing decisions. Maintenance of the machine is carried out after a failure or after a predetermined time interval, whichever occurs first. They consider two production control policies. Under the first policy when the machine breaks down the interrupted lot is not resumed and a new lot starts only when all available inventory is depleted. In the second policy, production is immediately resumed after a breakdown if the current on hand inventory is below a certain threshold level. They showed that under these policies the optimal lot size increases with the failure rate and assuming a constant failure rate and instantaneous repair times the optimal lot sizes are always larger than the EMQ. Nevertheless, Groenevelt et al. (1992a) propose to use the EMQ as an approximation to the optimal production lot size. Chung (2003) provides a better approximation to the optimal production lot size. Groenevelt et al. (1992b) study the problem of selecting the economic lot size for an unreliable manufacturing facility with a constant failure rate and general distributed repair times. The quantity of the safety stock that is used when the machine is being repaired is was derived based on the managerially prescribed service level. Makis and Fung (1995) present a model for joint determination of the lot size, inspection interval and preventive replacement time for a production facility that is subject to random failure. The time that the process stays in the in-control state is exponentially distributed and once the process is in out-of-control state, a certain percentage of the items produced is defective or qualitatively not acceptable. Periodic inspections are done to review the production process and the time to machine failure is generally distributed random variable. Preventive replacement of the production facility is based on operation time, i.e. after a certain number of production runs the production facility is replaced. Some other articles are concerned with PM policies for EMQ models. For instance, in Srinivasan and Lee (1996) an (S, s) policy is considered, i.e. as soon as the inventory level reaches S, a preventive maintenance operation is initiated and the machine becomes as good as new. After the preventive maintenance operation, production resumes as soon as the inventory level drops down to or below a prespecified value, s, and the facility continues to produce items until the inventory level is raised back to S. If the facility breaks down during operation, it is mini-

336

G. Budai, R. Dekker and R. Nicolai

mally repaired and put back into commission. Okamura et al. (2001) generalize the model of Srinivasan and Lee (1996) by assuming that both the demand as well as the production process is a continuous-time renewal counting process. Furthermore, they suppose that machine breakdown occurs according to a non-homogeneous Poisson process. In Lee and Srinivasan (2001) the demand and production rates are considered constant and a production run begins as soon as the inventory drops to zero. If the facility fails during operation, it is assumed to be repaired, but restoring the facility only to the condition it was in before the failure. Lee and Srinivasan (2001) consider an (S, N) policy, where the control variable N specifies the number of production cycles the machine should go through before it is set aside for preventive maintenance overhaul, which restores the facility to its original condition. Recently, Lin and Gong (2006) determined the effect of breakdowns on the decision of optimal production uptime for items subject to exponential deterioration under a no-resumption policy. Under this policy, a production run is executed for a predetermined period of time provided that no machine breakdown has occurred in this period. Otherwise, the production run is immediately aborted. The inventories are built up gradually during the production uptime and a new production run starts only when all on-hand inventories are depleted. If a breakdown occurs then corrective maintenance is carried out and this takes a fixed amount of time. If the inventory build-up during the production uptime is not enough to meet the demand during the entire period of the corrective maintenance, shortages (lost sales) will occur. Maintenance restores the production system to the same initial working conditions. 13.6.3 Deteriorating Production System with Buffer Capacity In order to reduce the negative effect of a machine breakdown on the production process, a buffer inventory may be built up during the production uptime (as it is done in the EMQ model). The role of this buffer inventory is that if an unexpected failure of the installation occurs then this inventory is used to satisfy the demand during the period that corrective maintenance is carried out. One of the earliest works on this subject is Van der Duyn Schouten and Vanneste (1995). In their model the demand rate is constant and equal to d (units/time) and as long as the fixed buffer capacity (K) is not reached the installation operates at a constant rate of p units/time (p>d) and the excess output is stored in the buffer. When the buffer is full, the installation reduces its speed from p to d. Upon failure corrective maintenance starts and the installation becomes as good as new. It is possible to perform preventive maintenance, which takes less time than repair and it also brings the installation back into the as-good-as-new condition. The decision to start a preventive maintenance action is not only based on the condition of the installation, but also on the level of the buffer. The criterion is to minimize the average inventory level and the average number of backorders. Since the optimal policy is difficult to implement, the authors develop suboptimal (n, N, k) control-limit policies. Under this policy if the buffer is full, preventive maintenance is undertaken at age n. If the buffer is not full, but it has at least k items, preventive main-

Maintenance and Production: A Review

337

tenance is undertaken at age N. Maintenance is never performed unless the system has at least k items. The objective is to obtain the best values for n, N and k. Iravani and Duenyas (2002) extend the above model by assuming a stochastic demand and production process. Demand that cannot be met from the inventory is lost and a penalty is incurred. Moreover, it is assumed that the production characteristics of the system change with usage and the more the system deteriorates the more its production rate decreases and the more its maintenance operation becomes time-consuming and costly. In a recent article, Yao et al. (2005) assume that the production system can produce at any rate from 0 (idle) to its maximum rate if it is in working state. Upon failure corrective maintenance is performed immediately to restore the system to the working state. Preventive maintenance actions can be performed as well. Both the failure process and the times to complete corrective/ preventive maintenance are assumed to be stochastic. Thus, in addition to the direct cost of performing corrective/preventive maintenance the non-negligible maintenance completion time leads to an indirect cost of lost production capacity due to system unavailability. Kyriakidis and Dimitrakos (2006) study an infinite-state generalisation of Van der Duyn Schouten and Vanneste (1995). The deterioration process of the installation is considered nonstationary, i.e. the transition probabilities depend not only on the working conditions of the installation but on its age and buffer level as well. Furthermore, the cost structure is more general than in Van der Duyn Schouten and Vanneste (1995) since it includes operating and maintenance costs of the installation as well as storage and shortage costs. It is assumed that the operating costs of the installation depend on both the working condition and the age of the installation. Another way of maintaining the buffer inventory is according to an (S, s) policy, i.e. the system stops production when the buffer inventory reaches S and the production restarts when the inventory drops to s. This idea is used by Das and Sarkar (1999). They assume that exogenous demand for the product arrives according to a Poisson process. Back-orders are not allowed. The unit production time, the time between failures, and the repair and maintenance times are assumed to have general probability distributions. Preventive maintenance decisions are made only at the time that the buffer inventory reaches S, and they depend on both the current inventory level and the number of items produced since the last repair/maintenance operation. The objective is to determine when to perform preventive maintenance on the system in order to improve the system performance. A different approach of dealing with integrated maintenance/production scheduling with buffer capacity is presented in Chelbi and Ait-Kadi (2004). They assume the preventive maintenance actions are regularly (after each T time periods) performed and the duration of corrective and preventive maintenance actions is random. The proposed strategy consists of building up a buffer stock whose size S covers at least the average consumption during the repair periods following breakdowns within the period of length T. When the production unit has to be stopped to undertake the planned preventive maintenance actions, a certain level of buffer stock must still be available in order to avoid stoppage of the subsequent assembly line. The two decision variables are: the period T at which preventive maintenance must be performed, and the level S of the buffer stock.

338

G. Budai, R. Dekker and R. Nicolai

A recent article of Kenne et al. (2006) considers the effects of both preventive maintenance policies and machine age on optimal safety stock levels. Significant stock levels, as the machine age increases, hedge against more frequent random failures. The objective of the study is to determine when to perform preventive maintenance on the machine and to find the level of the safety stock to be maintained. 13.6.4 Production and Maintenance Rate Optimization An integrated production and maintenance planning can also be made by optimizing the production and maintenance rates of the machines under consideration. In this line of research we mention the work of Gharbi and Kenne (2000, 2005), Kenne and Boukas (2003) and Kenne et al. (2003). In these articles a multiple-identical-machine manufacturing system with random breakdowns, repairs and preventive maintenance activities is studied. The objective of the control problem is to find the production and the preventive maintenance rates of the machines so as to minimize the total cost of inventory/backlog, repair and preventive maintenance. 13.6.5 Miscellaneous Finally, we list some articles that deal with integrated maintenance and production planning, but their approaches for modelling or the problem settings are different from the articles in the previous categories discussed earlier. For instance, the model presented in Ashayeri et al. (1996) deals with the scheduling of production and preventive maintenance jobs on multiple production lines, where each line has one bottleneck machine. The model indicates whether or not to produce a certain item in a certain period on a certain production line. In Kianfar (2005) the manufacturing system is composed of one machine that produces a single product. The failure rate of the machine is a function of its age and the demand of the manufacturing product is time-dependent. Its rate depends on the level of advertisement of the product. The objective is to maximize the expected discounted total profit of the firm over an infinite time horizon. Sarper (1993) considers the following problem. Given a fixed repair/maintenance capacity, how many of each of the low demand large items (LDLIs) should be started so that there are no incomplete jobs at the end of the production period? The goal is to ensure that the portion of the total demand started will be completed regardless of the amount by which some machines may stay idle due to insufficient work. A mixed-integer model is presented to determine what portion of the demand for each LDLI type should be rejected as lost sales so that the remaining portion can be finished completely.

13.7 Trends and Open Areas Initial publications on models in the production and maintenance area date from the end of the 1980s (Lee and Rosenblatt 1987). Since that time many papers have

Maintenance and Production: A Review

339

been published with the majority dating from the 1990s and the new millennium. The most popular area in this review is also the oldest one, i.e. on integrated models for maintenance and production. However, still many papers appear in that area and the models become more and more complex, with more decision parameters and more aspects. The topics on opportunity maintenance and scheduling maintenance in line with production have also been popular, but maybe more in the past than today. We did expect to find more studies on specific business sectors, but could only find many for the airline sector. That sector seems to be the most popular as it has both a lot of interaction between maintenance and production as well as high costs involved. In the other sectors, we do see the interaction, but perhaps more papers will be published in the future. The other sections are interesting but small in terms of papers published. In general, the demands on maintenance become higher as public and companies are less likely to accept failures, bad quality products or non-performance. Yet at the same time society’s inventory of capital goods is increasing as well as ageing in the western societies. This is very much the case for roads, railways, electric power generation, transport, and aircrafts. As there are continuous pressures on maintenance budgets we do foresee the need for research supporting maintenance and production decisions, also because decision support software is gaining in popularity and more data becomes electronically available. A theory is therefore needed for such decision support systems. As several case studies have taught us that practical problems have many complex aspects, there is a high need for more theory that can help us to understand and improve complex maintenance decision-making.

13.8 Conclusions In this chapter we have given an overview of planning models for production and maintenance. These models are classified on the basis of the interactions between maintenance and production. First, although maintenance is intended to allow production, production is often stopped during maintenance. The question arises when to do maintenance such that production is least effected. In order to answer this question planning models should take into account the needs of production. These needs are business sector specific and thus applications of planning models in different areas have been considered. In comparison with other specific sectors, much work has been done on modelling maintenance for the airline sector. Second, maintenance itself can also be seen as a production process which needs to be planned. Models for maintenance production planning mainly address allocation and manpower determination problems. Finally, maintenance also affects the production process since it takes capacity away. In production processes maintenance is mostly initiated by machine failures or low quality items. Maintenance and production should therefore be planned in an integrated way to deal with these aspects. Indeed, integrated maintenance and production planning models determine optimal lot sizes while taking into account failure and quality aspects. We observe

340

G. Budai, R. Dekker and R. Nicolai

a non-stop attention for such models, which take more and more “real world” aspects into account. Although many articles have been written on the interaction between production and maintenance, a careful reader will detect several open issues in this review. The theory developed thus far, is far from complete and any real application, is likely to reveal many more open issues.

13.9 Acknowledgements The authors would like to thank Georgios Nenes, Sophia Panagiotidou, and the editors for their helpful suggestions and comments.

13.10 References Al-Zubaidi H, Christer A, (1997) Maintenance manpower modelling for a hospital building complex. European Journal of Operational Research 99:603–618 Ashayeri J, Teelen A, Selen W, (1996) A production and maintenance planning model for the process industry. International Journal of Production Research 34: 3311–3326 Bäckert W, Rippin D, (1985) The determination of maintenance strategies for plants subject to breakdown. Computers and Chemical Engineering 9(2):113–126 Ben-Daya M, Makhdoum M, (1998) Integrated production and quality model under various preventive maintenance policies. Journal of the Operational Research Society 49(8): 840–853 Ben-Daya M, Rahim M, (2001) Integrated production, quality & maintenance models: an overview. in M. Rahim and M. Ben-Daya (eds), Integrated models in production planning, inventory, quality, and maintenance, Kluwer Academic Publishers, 3–28 Bengü G, (1994) Telecommunications systems maintenance. Computers and Operations Research 21:337–351 Budai G, Huisman D, Dekker R, (2006) Scheduling preventive railway maintenance activities. Journal of the Operational Research Society 57:1035–1044 Cassady C, Pohl E, Murdock W, (2001) Selective maintenance modeling for industrial systems. Journal of Quality in Maintenance Engineering 7(2):104–117 Charles A, Floru I, Azzaro-Pantel C, Pibouleau L, Domenech S, (2003) Optimization of preventive maintenance strategies in a multipurpose batch plant: application to semiconductor manufacturing. Computers and Chemical Engineering 27:449–467 Chelbi A, Ait-Kadi D, (2004) Analysis of a production/inventory system with randomly failing production unit submitted to regular preventive maintenance. European Journal of Operational Research 156:712–718 Cheung B, Chow K, Hui L, Yong A, (1999) Railway track possession assignment using constraint satisfaction. Engineering Applications of AI 12(5):599–611 Cheung K, Hui C, Sakamoto H, Hirata K, O'Young L, (2004) Short-term site-wide maintenance scheduling. Computers and Chemical Engineering 28:91–102 Cho D, Parlar M, (1991) A survey of maintenance models for multi-unit systems. European Journal of Operational Research 51:1–23 Chung K, (2003) Approximations to production lot sizing with machine breakdowns. Computers & Operations Research 30:1499–1507 Cobb R, (1995) Modeling aircraft repair turntime: simulation supports maintenance marketing. Journal of Air Transport Management 2:25–32

Maintenance and Production: A Review

341

Cohn A, Barnhart C, (2003) Improving crew scheduling by incorporating key maintenance routing decisions. Operations Research 51(3):387–396 Dagpunar J, (1996) A maintenance model with opportunities and interrupt replacement options. Journal of the Operational Research Society 47:1406–1409 Das T, Sarkar S, (1999) Optimal preventive maintenance in a production inventory. IIE Transactions 31:537–551 Dedopoulos L, Shah N, (1995) Preventive maintenance policy optimisation for multipurpose plant equipment. Computers and Chemical Engineering 19:693–698 Dekker R, Budai G, (2002) An overview of techniques used in planning railway infrastructure maintenance. In Geraerds W, Sherwin D, (eds), Proceedings of IFRIMmmm (maintenance management and modelling) conference, Vaxjo University, Sweden, 1–8 Dekker R, Dijkstra M, (1992) Opportunity-based age replacement: exponentially distributed times between opportunities. Naval Research Logistics 39:175–190 Dekker R, Plasmeijer R, (2001) Multi-parameter maintenance optimisation via the marginal cost approach. Journal of the Operational Research Society 52:188–197 Dekker R, Smeitink E, (1991) Opportunity-based block replacement, European Journal of Operational Research 53:46–63 Dekker R, Smeitink E, (1994) Preventive maintenance at opportunities of restricted duration. Naval Research Logistics 41:335–353 Dekker R, van Rijn C, (1996) Prompt - a decision support system for opportunity based preventive maintenance. In Özekici S, (ed) Reliability and Maintenance of Complex Systems, NATO ASI series 154:530–549 Dekker R, Plasmeijer R, Swart J, (1998a) Evaluation of a new maintenance concept for the preservation of highways. IMA Journal of Mathematics applied in Business and Industry 9:109–156 Dekker R, van der Meer J, Plasmeijer R, Wildeman R, (1998b) Maintenance of lightstandards - a case-study. Journal of the Operational Research Society 49:132–143 Den Hertog D, van Zante-de Fokkert J, Sjamaar S, Beusmans R, (2005) Optimal working zone division for safe track maintenance in the Netherlands. Accident Analysis and Prevention 37:890–893 Dijkstra M, Kroon L, Salomon M, van Nunen J, van Wassenhoven L, (1994) Planning the size and organization of KLM's aircraft maintenance personnel. Interfaces 24:47–58 Edwards D, Holt G, Harris F, (2002) Predicting downtime costs of tracked hydraulic excavators operating in the UK opencast mining industry. Construction Management & Economics 20:581–591 Esveld C, (2001) Modern Railway Track. MRT-Productions, Zaltbommel, The Netherlands Feo T, Bard J, (1989) Flight scheduling and maintenance base planning. Management Science 35(12):1415–1432 Finch B, Gilbert J, (1986) Developing maintenance craft labor efficiency through an integrated planning and control system: a prescriptive model. Journal of Operations Management 6(4):449–459 Frost D, Dechter R, (1998) Optimizing with constraints: a case study in scheduling maintenance of electric power units. Lecture Notes in Computer Science 1520:469–488 Geraerds W, (1985) The cost of downtime for maintenance: preliminary considerations. Maintenance Management International 5:13–21 Gharbi A, Kenne J, (2000) Production and preventive maintenance rates control for a manufacturing system: an experimental design approach. International Journal of Production Economics 65:275–287 Gharbi A, Kenne J, (2005) Maintenance scheduling and production control of multiplemachine manufacturing systems. Computers and Industrial Engineering 48:693–707

342

G. Budai, R. Dekker and R. Nicolai

Goel H, Grievink J, Weijnen M, (2003) Integrated optimal reliable design, production, and maintenance planning for multipurpose process plant. Computers and Chemical Engineering 27:1543–1555 Gopalan R, Talluri K, (1998) Mathematical models in airline schedule planning: a survey. Annals of Operations Research 76(1): 155–185 Groenevelt H, Pintelon L, Seidmann A, (1992a) Production batching with machine breakdowns and safety stocks. Operations Research 40(5):959–971 Groenevelt H, Pintelon L, Seidmann A, (1992b) Production lot sizing with machine breakdowns. Management Science 48(1):104–123 Haghani A, Shafahi Y, (2002) Bus maintenance systems and maintenance scheduling: model formulations and solutions. Transportation Research Part A 36:453–482 Higgins A, (1998) Scheduling of railway maintenance activities and crews. Journal of the Operational Research Society 49:1026–1033 Improverail (2002) http://www.tis.pt/proj/improverail/downloads/d6final.pdf (accessed September 26, 2006) Iravani S, Duenyas I, (2002) Integrated maintenance and production control of a deteriorating production system. IIE Transactions 34:423–435 Kenne J, Boukas E, (2003) Hierarchical control of production and maintenance rates in manufacturing systems. Journal of Quality in Maintenance Engineering 9:66–82 Kenne J, Boukas E, Gharbi A, (2003) Control of production and corrective maintenance rates in a multiple-machine, multiple-product manufacturing system. Mathematical and Computer Modelling 38:351–365 Kenne J, Gharbi A, Beit M, (2006) Age-dependent production planning and maintenance strategies in unreliable manufacturing systems with lost sale. Accepted for publication in European Journal of Operational Research 178(2):408–420 Kianfar F, (2005) A numerical method to approximate optimal production and maintenance plan in a flexible manufacturing system. Applied Mathematics and Computation 170:924–940 Knight P, Jullian F, Jofre L, (2005) Assessing the “size” of the prize: developing business cases for maintenance improvement projects. Proceedings of the International Physical Asset Management Conference, 284–302 Kralj B, Petrovic R, (1988) Optimal preventive maintenance scheduling of thermal generating units in power systems – a survey of problem formulations and solution methods. European Journal of Operational Research 35:1–15 Kyriakidis E, Dimitrakos T, (2006) Optimal preventive maintenance of a production system with an intermediate buffer. European Journal of Operational Research 168:86–99 Lam K, Rahim M, (2002) A sensitivity analysis of an integrated model for joint determination of economic design of x -control charts, economic production quantity and production run length for a deteriorating production system. Quality and Reliability Engineering International 18:305–320 Langdon W, Treleaven P, (1997) Scheduling maintenance of electrical power transmission networks using genetic programming. In Warwick K, Ekwue A, Aggarwal A, (eds), Artificial intelligence techniques in power systems, Institution of Electrical Engineers, Stevenage, UK, 220–237 Lee H, (2005) A cost/benefit model for investments in inventory and preventive maintenance in an imperfect production system. Computers and Industrial Engineering 48:55–68 Lee H, Rosenblatt M, (1987) Simultaneous determination of production cycle and inspection schedules in a production system. Management Science 33:1125–1137 Lee H, Rosenblatt M, (1989) A production and maintenance planning model with restoration cost dependent on detection delay. IIE Transactions 21(4):368–375

Maintenance and Production: A Review

343

Lee H, Srinivasan M, (2001) A production/inventory policy for an unreliable machine. In Rahim M, Ben-Daya M, (eds) Integrated models in production planning, inventory, quality, and maintenance, Kluwer Academic Publishers, 79–94 Lin G, Gong D, (2006) On a production-inventory system of deteriorating items subject to random machine breakdowns with a fixed repair time. Mathematics and Computer Modelling 43:920–932 Makis V, Fung J, (1995) Optimal preventive replacement, lot sizing and inspection policy for a deteriorating production system. Journal of Quality in Maintenance Engineering, 1(4): 41–55 Moudani WE, Mora-Camino F, (2000) A dynamic approach for aircraft assignment and maintenance scheduling by airlines. Journal of Air Transport Management 6:233–237 Nahmias S, (2005) Production and operations analysis (5th ed). McGraw-Hill, Boston Okamura H, Dohi T, Osaki S, (2001) Computation algorithms of cost-effective EMQ policies with PM. In Rahim M, Ben-Daya M, (eds) Integrated models in production planning, inventory, quality, and maintenance, Kluwer Academic Publishers, 31–65 Pistikopoulos E, Vassiliadis C, Papageorgiou L, (2000) Process design for maintainability: an optimization approach. Computers and Chemical Engineering 24:203–208 Rahim M, (1994) Joint determination of production quantity, inspection schedule, and control chart design. IIE Transactions, 26(6), 2–11 Rahim M, Ben-Daya M, (1998) A generalized economic model for joint determination of production run, inspection schedule and control chart design. International Journal of Production Research 36:277–289 Rahim M, Ben-Daya M, (2001) Joint determination of production quantity, inspection schedule, and quality control for an imperfect process with deteriorating products. Journal of the Operational Research Society 52(12):1370–1378 Rosa L, Feiring B, (1995) Layout problem for an aircraft maintenance company tool room. International Journal of Production Economics 40:219–230 Rose G, Bennett D, (1992) Locating and sizing road maintenance depots. European Journal of Operations Research 63:151–163 Sarper H, (1993) Scheduling for the maintenance of completely processed low-demand large items. Applied Mathematical Modelling 17:321–328 Shenoy D, Bhadury B, (1993) MRSRP – a tool for manpower resources and spares requirements planning. Computers and Industrial Engineering 24:421–439 Srinivasan M, Lee H, (1996) Production-inventory systems with preventive maintenance. IIE Transactions 28:879–890 Sriram C, Haghani A, (2003) An optimization model for aircraft maintenance scheduling and re-assignment. Transportation Research Part A 37:29–48 Tagaras G, (1988) An integrated cost model for the joint optimization of process control and maintenance. Journal of the Operational Research Society 39(8):757–766 Tan J, Kramer M, (1997) A general framework for preventive maintenance optimization in chemical process operations. Computers and Chemical Engineering 21(12):1451–1469 Tseng S, (1996) Optimal preventive maintenance policy for deteriorating production systems. IIE Transactions 28:687–694 Van der Duyn Schouten F, Vanneste S, (1995) Maintenance optimization of a production system with buffer capacity. European Journal of Operational Research 82:323–338 Van der Duyn Schouten F, van Vlijmen B, Vos de Wael S, (1998) Replacement policies for traffic control signals. IMA Journal of Mathematics Applied in Business and Industry 9:325–346 Van Dijkhuizen G, (2000) Maintenance grouping in multi-setup multi-component production systems. In Ben-Daya M, Duffuaa M, Raouf A, (eds) Maintenance, Modeling and Optimization, Kluwer Academic Publishers, 283–306

344

G. Budai, R. Dekker and R. Nicolai

Van Zante-de Fokkert J, den Hertog D, van den Berg F, Verhoeven J, (2001) Safe track maintenance for the Dutch Railways, Part II: Maintenance schedule. Technical report, Tilburg University, the Netherlands Vatn J, Hokstad P, Bodsberg L, (1996) An overall model for maintenance optimization. Reliability Engineering and System Safety 51:241–257 Vaurio J, (1999) Availability and cost functions for periodically inspected preventively maintained units. Reliability Engineering and System Safety 63:133–140 Wang C, (2006) Optimal production and maintenance policy for imperfect production systems. Naval Research Logistics 53:151–156 Wang C, Sheu S, (2003) Determining the optimal production-maintenance policy with inspection errors: using a Markov chain. Computers & Operations Research 30:1–17 Weinstein L, Chung C, (1999) Integrating maintenance and production decisions in a hierarchical production planning environment. Computers & Operations Research 26:1059–1074 Wijnmalen D, Hontelez A, (1997) Coordinated condition-based repair strategies for components of a multi-component maintenance system with discounts. European Journal of Operational Research 98:52–63 Yan S, Yang T, Chen H, (2004) Airline short-term maintenance manpower supply planning. Transportation Research Part A 38:615–642 Yao X, Xie X, Fu M, Marcus S, (2005) Optimal joint preventive maintenance and production policies. Naval Research Logistics 52:668–681

14 Delay Time Modelling Wenbin Wang

14.1 Introduction In this chapter we present a modelling tool that was created to model the problems of inspection maintenance and planned maintenance interventions, namely delay time modelling (DTM). This concept provides a modelling framework readily applicable to a wide class of actual industrial maintenance problems of assets in general, and inspection problems in particular. The concept of the delay time was first mentioned by Christer (1976) in a context of building maintenance. It was not until 1984, the concept was first applied to an industrial maintenance problem (Christer and Waller 1984). Since then, a series of research papers appeared with regard to the theory and applications of delay time modelling of industrial asset inspection problems; see Christer (1999) for a detailed review. The delay time concept itself is simple which defines the failure process of an asset as a two-stage process. The first stage is the normal operating stage from new to the point that a hidden defect has been identified. The second stage is defined as the failure delay time from the point of defect identification to failure. It is the existence of such a failure delay time which provides the opportunity for preventive maintenance to be carried out to remove or rectify the identified defects before failures. With appropriate modelling of the durations of these two stages, optimal inspection intervals can be identified to optimise a criterion function of interest. The delay time concept is similar in definition to the well known potential failure (PF) interval in reliability centred maintenance (Moubray 1997). It is noted, however, that two differences between these two definitions mark a fundamental difference in modelling maintenance inspection of assets. First, the delay time is random in Christer’s definition while the PF interval is assumed to be constant. Second, the initial point of a defect identification is very important to the set up of an appropriate inspection interval, but ignored by Moubray. Nevertheless, Moubray did not provide any means of modelling the inspection practice, while DTM

346

W. Wang

provides a rich source of modelling methodologies ranged from the concept to practical solutions. Asset inspection modelling has long been researched by many others, Among them, the model proposed by Barlow and Proschan (1965) is perhaps the most famous one. They consider a unit subject to inspections as follows. The unit is inspected at prespecified times, where each inspection is executed perfectly and instantaneously. The policy terminates with an inspection which detects the unit failure. This implies that the unit may have already failed during an operation interval between inspections, but can only be identified at the forthcoming inspection. Various modifications and extensions to the Barlow and Proschan’s model have been proposed; see for example, Thomas et al. (1991), Luss (1983), AbdelHameed (1995), Kaio and Osaki (1989) and McCall (1965). The delay time inspection model is different from the classical Barlow and Proschan’s model on two accounts. First, a failure is identified immediately when it occurs. This is perhaps more rationale than the Barlow and Proschan’s model since if the system fails, it may have stopped operating and should be observed immediately by the operators. Second, there is a failure delay time in DTM which characterises the abnormal deterioration before failure, which is not defined in Barlow and Proschan’s model. It is noted however, that for a certain class of equipment such as fire distinguishers, Barlow and Proschan ’s model is appropriate. To clarify the objective of the type of inspection modelling we are concerned with here, consider a plant item with an inspection practice every period T, says, weeks, months, … , with repair of failures undertaken as they arise. The inspection consists of a check list of activities to be undertaken, and a general inspection of the operational state of the plant. Any defect identified leads to immediate repair, and the objective of the inspection is to minimise operational downtime. Other objectives could be considered, for example cost, availability or output. There are other types of inspection activities such as condition monitoring and preventive maintenance which will be introduced and discussed elsewhere in this book; for now we focus on the inspection practice outlined above using the delay time inspection modelling technique. This chapter is organised as follows. Section 14.2 gives an outline of the delay time concept. Sections 14.3 and 14.4 introduce two delay time inspection models of a single component and a complex system respectively. Section 14.5 discusses the parameters estimation techniques used in DTM. Section 14.6 highlights extensions to the basic delay time model and future research in DTM and Section 14.7 concludes the chapter.

14.2 The Delay Time Concept We are interested in the relationship between the performance of assets and inspection intervention, and to capture this, the conventional reliability analysis of time to first failure, or time between failures, requires enrichment. Consider a repairable item of an asset. It could be, say, a component, a machine, a building, or an integrated set of machines forming a production line, but viewed by management as a unit. For now we take a complex system of multiple components as an

Delay Time Modelling

347

example, the case for a single component will be considered in Section 14.3. The interaction between inspection and equipment performance may be captured using the delay time concept presented below. Let the item of an asset be maintained on a breakdown basis. The time history of breakdown or failure events is a random series of points; see Figure 14.1. For any one of these failures, the likelihood is that, had the item been inspected at some point just prior to failure, it could have revealed a defect which, though the item was still working, would ultimately lead to a failure. Such signals include excessive vibration, unusual noise, excessive heat, surface staining, smell, reduced output, increased quality variability, etc. The first instance where the presence of a defect might reasonably be expected to be recognised by an inspection, had it taken place, is called the initial point u of the defect, and the time h to failure from u is called the delay time of the defect; see Figure 14.2. Had an inspection taken place in (h, u + h) , the presence of a defect could have been noted and corrective actions taken prior to failure. Given that a defect arises, its delay time represents a window of opportunity for preventing a failure. Clearly, the delay time h is a characteristic of the item concerned, the type of defect, the nature of any inspection, and perhaps the person inspecting. For example, if the item was a vehicle, and the maintenance practice was to respond when the driver reported a problem, then there is in effect a form of continuous monitoring inspection of cab related aspects of the vehicle, with a reasonably long delay time consistent with the rate of deterioration of the defect. However, should the exhaust collapse because a support bracket was corroded through, the likely warning period for the driver, the delay time, would be virtually zero, since he would not normally be expected to look under the vehicle. At the same time, had an inspection been undertaken by a service mechanic, the delay time may have been measured in weeks or months. Had the exhaust collapsed because securing bolts became loose before falling out, then the driver could have had a warning period of excessive vibration, and perhaps noise, and the defects would have had a drive related delay time measured in days or weeks.

●

●

●

●

●

●

Figure 14.1. Failure points ‘●’

h

○ u

● failure

Figure 14.2. The delay time for a defect

●

Time

348

W. Wang

To see why the delay time concept is of use, consider Figure 14.3 incorporating the same failure point pattern as Figure 14.1 along with the initial points associated with each failure arising under a breakdown system. Had an inspection taken place at point (A), one defect could have been identified and the seven failures could have been reduced to six. Likewise, had inspection taken place at points (B) and point (A), four defects could have been identified and the seven failures could have been reduced to three. Figure 14.3 demonstrates that provided it is possible to model the way defects arise, that is the rate of arrival of defects λ (u ) , and their associated delay time h , then the delay time concept can capture the relationship between the inspection frequency and the number of plant failures. We are assuming for now that inspections are perfect, that is, a defect is recognised if, and only if, it is there and is removed by corrective action. Delay time modelling is still possible if these assumptions are not valid, but this more complex case is discussed in Section 14.3.1.

○ ○

●

○

● ●

○

○●

● ○○

●

● Time

B

A

B

Figure 14.3. ‘○’ initial points; ‘●’ failure points

14.3 Delay Time Models for Complex Plant 14.3.1 Perfect Inspections A complex plant, or multi-component plant, is one where a large number of failure modes arise, and the correction of one defect or failure has nominal impact in the steady state upon the overall plant failure characteristics. Consider the following basic complex plant maintenance modelling scenario where: 1. An inspection takes place every T time units, costs cs units and requires d s time units, where d s << T . 2. Inspections are perfect in that all (and only) defects present are identified. 3. Defects identified are repaired during the inspection period. 4. Defects arise according to a homogeneous Poisson process (HPP) with the rate of occurrence of defects, λ , per unit time. 5. The delay time, H , of a random defect is described by a pdf. f (h) , cdf. F (h) , and is independent of the initial point U . 6. Failure will be repaired immediately at an average cost c f and downtime df . 7. The plant has operated sufficiently long since new to be considered effectively in a steady state. 8. Defects and failures only arise whilst plant is operating.

Delay Time Modelling

349

These assumptions characterise the simplest non-trivial inspection maintenance problem, Christer et al. (1995), and would, of course, only be agreed in any particular case after careful analysis and investigation of the specific situation. We now proceed to construct the mathematical model of the relationship between T and an objective function of interest. From assumptions 1–4, it is obvious that the number of system failures is identical and independent over each inspection interval, and we can simply study the behaviour of such a failure process over one interval, say the first interval [0, T ) . Suppose for now that we take the expected downtime per unit time, D(T ) , as a measure of our objective function, the relationship between T and D(T ) can be established directly by using the renewal reward theorem (Ross 1983) as D(T ) = limt →∞

E (Downtime over t) d f E[( N f (T )] + d s = t T + ds

(14.1)

where E[ N f (T )] is the expected number of failures within [0,T). Clearly if E[ N f (T )] is available, D(T ) can be readily calculated. It can be shown that the failure process shown in Figure 14.3 is a Marked Poisson process (Taylor and Karlin 1998), with the delay time h as the marker. It has been proved that this failure process over [0, T ) is a nonhomogenous Poisson process (NHPP) (Taylor and Karlin 1998; Christer and Wang 1995). To derive the rate of occurrence of failures (ROCOF), ν (t ) , for this NHPP, within [0, T ) , we start first by deriving the expected number of failures within [0, T ) . Since the expected number of the defects arrived within [ t , t + δ t ), 0 ≤ t < T , is λδ t , then the expected value of the failures caused by these defects is λ F (T − t )δ t . Integrating t from 0 to T and after some manipulation we have T

E[ N f (T )] = ∫ λ F (t )dt 0

(14.2)

Differentiating Equation 14.2 with respect to T we have v(t ) = λ F (t )

(14.3)

The original model developed in Christer and Waller (1984) for Equation 14.2 uses a different approach, but leads to the same result. 14.3.2 Imperfect Inspections Section 14.3.1 outlined a basic delay time model under perfect inspections. It is established under a set of assumptions, and some of them may not be valid in practical situations. These assumptions greatly simplify the mathematics involved but also restrict a wider use of the models developed. Perhaps the most restrictive assumption is that of perfect inspections. In almost all the case studies conducted using the delay time concept, we found none of them supported the perfect inspec-

350

W. Wang

tion assumption. The other concerning assumption is the HPP for defect arrival in the case of a complex system. One would naturally think as the system ages there could be more defect arrivals than that of a younger system. In this section, we introduce one delay time model that relaxes the perfect inspection assumption. The delay time model using a NHPP is presented in Christer and Wang (1995) and Wang and Christer (2003). These models are mainly developed for complex systems, but a non-perfect inspection single component delay time model can also be developed along a similar line (Baker and Wang 1991). All the assumptions proposed in Section 14.3.1 will hold except the perfect inspection one. Assume for now that if a defect is present at an inspection; then there is a probability r that the defect can be identified. This implies that there is a probability 1 − r that the defect will be unnoticed. Figure 14.4 depicts such a process. Two defects were not identified

○ ○

●

○

○ A

○●

●

●

○○

B

● C

time

Figure 14.4. Failure process of a multi-component system subject to three non-perfect inspections at points A, B, and C; two potential failures were removed and two missed

It has been proved that the failure process over each inspection interval is still an NHPP (Christer and Wang 1995), but not identical over the earlier inspection intervals of the system. It can be shown that as the number of inspections increases, the number of failures over each inspection interval becomes stable and identical, so we need to study the asymptotic behaviour of the failure process assuming the number of previous inspections is very large. Let i --- i-th inspection U --- random variable of the initial time u r --- probability of perfect inspection ν i (t ) --- ROCOF at time t , t ∈ [(i − 1)T , iT ) E[ N f ((i − 1)T , iT )] --- expected number of failures over [(i − 1)T , iT ) E[ N s (iT )] --- expected number of defects identified at iT It can be shown (Christer et al. 1995; Christer and Wang 1995) that vi (t ) is given by vi (t ) = λ∑ n=1 (1 − r)i −n+1[F (t − (n −1)T ) − F (t − nT )] + λ F (t − (i −1)T ) i

for t ∈ [(i − 1)T , iT ) .

(14.4)

Delay Time Modelling

351

It can also be proved by induction that vi −1 (t ) ≈ vi (t ) when i is large. Given that Equation 14.4 is available, it is straightforward that the expected number of failures over [(i − 1)T , iT ) is given by E [ N f ((i − 1)T , iT )] = =

∫

iT ( i −1)T

{λ ∑

∫

iT ( i −1)T

vi (t )dt

}

i

(14.5)

(1 − r )i − n +1[ F (t − (n − 1)T ) − F (t − nT )] + λ F (t − (i − 1)T ) dt n =1

The expected number of defects found at an inspection point, say, iT , is also a Poisson variable with the mean given by (Christer et al. 1995; Christer and Wang 1995) E[ N s (iT )] =λ

∑

i n =1

(1 − r )i − n +1r

∫

nT ( n −1)T

[1 − F (iT − u )]du + λ r

∫

iT ( i −1)T

[1 − F (iT − u )] du

(14.6)

The expected downtime is given by Equation 14.1 with the expected number of failures given by by Equation 14.5, so that D(T ) =

d f E[ N f ((i − 1)T , T )] + d s

(14.7)

T + ds

The use of Equation 14.7 assumes that the system is already in a steady state with i → ∞ . For computation purpose we can select a large i , and then n starts from the first k where (1 − r )i − k +1 ≥ ε and ε is a very small number. Equation 14.7 is established assuming that the defects identified at an inspection will always be removed without costing any extra downtime or cost. This assumption can be relaxed. Let d r be the mean downtime per defect being repaired. Then using the same approach as before, the expected downtime is given by D(T ) =

d f E[ N f ((i − 1)T , T )] + d s + d r E[ N s (iT )] T + d s + d r E[ N s (iT )]

,

(14.8)

If the objective function is the expected cost per unit time, we obtain this by simply substituting the downtime parameters in Equations 14.7 or 14.8 by the corresponding cost parameters. Example 14.1 Assume that the rate of occurrence of defects is two per day, and the delay time distribution is exponential with scale parameter 0.03 measured in days. The downtime measures are d f = 30 and d s = 30 min respectively. The probability of a perfect inspection is assumed to be 0.7. Using Equations 14.5 and 14.7, we have the expected downtime against inspection intervals as shown in Figure 14.5. It can be seen from Figure 14.5 that a weekly inspection interval is the best.

W. Wang

35 30 25 20 15

22

19

16

13

10

7

4

10 1

Expected cost per unit time

352

Inspection interval

Figure 14.5. Expected downtime per unit time vs. inspection interval (in days)

14.4 Delay Time Model for a Component Subject to a Single Failure Mode (Single Component System) Most DTM applications are for multiple component systems subject to independent failure modes; although most maintained equipment fall into this category, there are plant items which may have a single dominant failure mode, and may be, in some cases, replaced or renewed upon failure. Examples of such plant items are batteries, traffic lights, small pumps and motors. Such plant items are called single component systems. Noted that a system in this category may not actually be a single component, but the key difference compared with a complex multi-component system is that this single component system is subject to a single failure mode, and the only maintenance action is to renew the whole system either by a complete replacement or a renewal type of repair. This implies that at any point of time, only one defect of the dominant failure mode can exist. This contrasts with a complex system with many failure modes, where only the failed component was replaced or repaired upon a failure, and at any point of time there could be many defects present, and the system is not renewed at failures. The failure process of this type of a single plant item is different from that of a multi-component complex system; see Figures 14.6 and 14.7.

○ ○ ● ○ ● ●

○

○●

● ○○ ●

● Time

Figure 14.6. Failure process of a multi-component system, where ‘○’ denotes initial points; ‘●’ failure points

Delay Time Modelling

○

●

○

●

○

353

● Time

Figure 14.7. Failure process of a single component system

For the system in Figure 14.6, the system may be renewed at inspection points if these inspections are perfect, and the rate of arrival of defects is constant. However for the system in Figure14.7, the system can be renewed either at a failure or at an inspection. We present the case with a perfect inspection assumption. The case of an imperfect inspection delay time model for a single component can be found in Baker and Wang (1991, 1993). We need the following additional assumptions and notation; 1. The system is renewed at either a failure repair or at a repair done at an inspection if a defect is identified. 2. After either a failure renewal or inspection renewal the inspection process re-starts. 3. The initial time, U , to the appearance of a random defect has a probability density function g (u ) . 4. The defective compoment identified at an inspection will be renewed either by a repair or a replacement at an average cost of cr and downtime d r . 14.4.1 Inspection Model Based on an Exponentially Distributed Initial Time We first consider a simple case that an inspection renews the system regardless of whether a defect was identified or not. This effectively assumes an exponential distribution for the initial time U . Since each failure or inspection renewed the system with associated downtimes or costs, the process is a renewal reward process, and the long term expected cost per unit time, C (T ) , is given by Ross (1983): C(T) =

E(CC) E(CL)

where CC is the renewal cycle cost and CL is the renewal cycle length which is the interval between two consecutive renewals. There could be two different renewal cycles, one is the failure renewal and the other is the inspection renewal. Taking the expected cost per renewal cycle as an example, since a failure will cost c f with probability of it happening as P( X < T ) , then the expected cost due to a failure renewal within T is c f P( X < T ) = c f

∫

T 0

g (u ) F (T − u ) du ,

where X is the time to failure.

(14.9)

354

W. Wang

The expected cost due to an inspection renewal with a defect identified at T

(cr + cs ) P (U < T ∩ X ≥ T ) = (cr + cs ) ∫ g (u ){1 − F (T − u )}du

T is

(14.10)

0

and finally the expected cost due to an inspection renewal without a defect being identified at T is given by cs P (U ≥ T ) = cs

∫

∞ T

g (u ) du

(14.11)

From Equations 14.9–14.11 we have expected cost per renewal cycle: E(CC) = cf

∫

T 0

T

∞

g (u ) F (T − u )du +(cr + cs ) ∫ g (u ){1 − F (T − u )}du + cs ∫ g (u ) du 0

(14.12)

T

As to the expected cycle length, we model two possibilities. The first is that the cycle ends at a failure before T . Define p (t ) the density function for the time to failure which is given readily by p (t ) =

d P( X ≤ t ) = dt

∫

t 0

g (u ) f (t − u ) du

Since 1 − P( X < T ) is the probability of no failure, which implies an inspection T

renewal and is given by 1 − ∫ g (u ) F (T − u ) du , we have 0

E (CL) =

T

t

0

0

∫ t∫

g (u ) f (t − u )dudt + T (1 −

∫

T

g (u ) F (T − u ) du )

0

(14.13)

For the detailed derivation of Equations 14.9–14.13 see Baker and Wang (1991, 1993). Finally the expected cost per unit time is given by C(T) = cf

∫

T 0

g (u ) F (T − u )du + (cr + cs ) T

t

0

0

∫ t∫

∫

T 0

g (u ){1 − F (T − u )}du + cs

g (u ) f (t − u )dudt + T (1 −

∫

T 0

∫

∞ T

g (u ) du

g (u ) F (T − u ) du )

The expected downtime can be obtained in a similar manner.

(14.14)

Delay Time Modelling

355

290 270 250 230 210 190

2.3

2.1

1.9

1.7

1.5

1.3

1.1

0.9

0.7

0.5

0.3

170 150 0.1

Expected cost per unit time

Example14.2 Assume both the initial time and delay time distributions are exponential with scale parameters 0.6 and 0.75 respectively. The time unit is 100 days and the cost parameter values are c f = £1000, cr = £150 and cs = £15 respectively. Using Equation 14.14, the calculated expected cost per unit time as a function of T is shown in Figure 14.8.

Inspection interval

Figure 14.8. Expected cost per unit time vs. inspection interval

The optimal inspection interval is 0.4 x 100 = 40 days, so a monthly inspection schedule is appropriate. 14.4.2 Inspection Model Based on a Non-exponentially Distributed Initial Time If g (u ) is not exponentially distributed, then we cannot assume any inspection will renew the system unless a defect was identified at an inspection and the system was replaced or repaired to as new condition. In this case a renewal cycle may span several inspection intervals. Using a similar framework as before and now taking the expected downtime per renewal cycle as an example, the expected downtime due to a failure renewal at time X where X ∈ [(i − 1)T , iT ) is [(i − 1)d s + d f ]P ((i − 1) < X < iT ) = [(i − 1)d s + d f ]∫

iT ( i −1)T

g (u ) F (iT − u )du

(14.15)

This is because inspections are perfect so that if a failure at time X, then the initial time U must be bounded within [(i − 1)T , X ), X < iT . There are (i − 1) inspections with no defect identified before the failure so (i − 1) times of the inspection downtime are added. Equation 14.15 models only one of the possibilities and a failure can be in any of the inspection intervals so summing over all possible intervals i from 1 to infinity gives the expected downtime due to a failure:

356

W. Wang

∑ [(i − 1)d + d ]P((i − 1) < X < iT ) = ∑ [(i − 1)d + d ]∫ g (u ) F (iT − u ) du ∞

s

i =1

f

(14.16)

iT

∞

s

i =1

f

( i −1)T

Equation 14.16 is always finite since all the probability terms for large i tend to zero because g (u ) tends to zero for u > (i − 1)T when i is large. Similarly the expected downtime due to an inspection renewal with a defect identified is

∑

∞ i =1

((i − 1)d s + d r ) ∫

iT ( i −1)T

g (u )[1 − F (iT − u )]du

(14.17)

Summing Equations 14.16 and 14.17 gives the complete expected downtime per renewal cycle: E(CD) =

∑

∞ i =1

{[(i −1)d + d ]∫ s

r

iT ( i −1)T

g (u ) du +( d f − d r )

∫

iT ( i −1)T

g (u ) F (iT − u ) du

}

(14.18)

The expected cycle length is obtained in a similar manner and is given by E (CL) =

∑ {∫ ∞

iT

i =1

( i −1)T

t

∫

t ( i −1)T

g (u ) f (t − u ) dudt + iT

∫

iT ( i −1)T

g (u ){1 − F (iT − u )}du

}

(14.19)

Finally the expected downtime per unit time is given by C(T) =

∑ ∑

∞

{[(i −1)d + d ]∫ g (u)du + (d − d )∫ g(u)F (iT − u)du} {∫ t ∫ g (u) f (t − u)dudt + iT ∫ g(u)[1 − F (iT − u)]du}} iT

s

i =1

∞ i =1

r

( i −1)T

iT

f

r

iT

t

iT

( i −1)T

( i −1)T

( i −1)T

( i −1)T

(14.20)

14.4.3 A Case Example The medical physics department of a teaching hospital in England, which maintains a large number of medical equipment, records the history of breakdowns and repairs carried out using history cards for each individual item of departmental equipment. Information available included purchase date, date of preventive maintenance, failures and some description of the work carried out. There were no costs recorded, but some estimated cost values were provided by the hospital staff.

Delay Time Modelling

357

Following a discussion with the chief technician, it seemed best to focus on the following items, to ensure a sample of similar machine types, under heavy and constant use, with a usefully long history of failures, and with reasonably welldefined modes of failures. Two pumps were chosen, namely volumetric infusion pumps and peristaltic pumps all from the intensive-care, neurosurgery and heartcare units. There were 105 volumetric pumps and the most frequent failure mode was the failure of the pressure transducer. There were 35 peristaltic pumps and the most frequent failure mode was battery failure. For a detailed description of the case, data and model fitting see Baker and Wang (1991). Several distributions were chosen for the initial and delay time distributions for both pumps, and it turned out that in both cases a Weibull distribution was the best for the initial time distribution and an exponential distribution for the delay time distribution. The estimated parameter values based on history data using the maximum likelihood method for both pumps are shown in Table 14.1. Table 14.1. Estimated parameter values for the pumps Pump

Delay time pdf.

Initial time pdf.

g (u ) = αη (α u ) β −1 e− (α u )

f ( h) = β e − β h

Volumetric infusion

αˆ =0.0017, ηˆ =1.42

βˆ =0.0174

Peristaltic

αˆ =0.0007, ηˆ =2.41

βˆ =0.0093

η

Although the cost data were not recorded, it was relatively easy to estimate the cost of an inspection (called preventive maintenance in the hospital) and the cost of an inspection repair if a defect was identified. However, it was extremely difficult to have an estimate for the failure cost since if the pump failed to work while needed the penalty cost could be very high compared with the cost of the pump itself. Nevertheless, some estimates were provided, which are shown in Table 14.2 Table 14.2. Cost estimates Pump

Inspection cost

Inspection repair cost

Failure cost

Volumetric infusion

£15

£50

£2000

Peristaltic

£15

£70

£1000

This time we cannot derive an analytical formula for the expected cost because of the use of the Weibull distribution. Numerical integrations have to be used to calculate Equation 14.20. We did this using the maths software package MathCad and the results are shown in Figures 14.9 and 14.10.

358

W. Wang

2.4 2.2 2 Expected_Cost( T )

C(T)

1.8 1.6 1.4 1.2

0

20

40

60

80

100

120

T

Figure 14.9. Expected cost per unit time vs. inspection interval for the volumetric infusion pump 2.5

2

Expected_Cost( T ) 1.5

C(T)

1

0.5

0

20

40

60

80

100

120

T

Figure 14.10. Expected cost per unit time vs. inspection interval for the peristaltic pump

Time is given in days in Figures 14.9 and 14.10, so the optimal inspection interval for the volumetric infusion pump is about 30 days and for the peristaltic pump is around 70 days. The hospital at the time checked the pumps at an interval of six months, so clearly for both pumps the inspection intervals should be shortened. However, it has to be pointed out that the model is sensitive to the failure cost, and had a different estimate been provided, the recommendation would have been different.

Delay Time Modelling

359

14.5 Delay Time Model Parameter Estimation 14.5.1 Introduction In previous sections, delay time models for both a complex system and a single compnent have been introduced. However in a practical situation, before the construction of expected cost or downtime models, it is necessary to estimate the values of the parameters that characterise the defect arrival and failure processes. In this section we discuss various methods developed to estimate the parameters from either ‘subjective’ data of experts opinions or ‘objective’ data collected at failures and inspections. Naturally, the parameter estimation process is not the same for the different types of delay-time model, i.e. single component models where a single potential failure mode is modelled and only one defect may (or may not) be present at any one time, compared with complex system models where many defects can exist simultaneously and many failures can occur in the interval between inspections. This is particularly important for the method using objective data. In this section, we mainly focus on the estimation methods for complex systems since these systems are the most applicable asset items for DTM. The details of the approaches developed for parameters estimation for a single component DTM can be found in Baker and Wang (1991, 1993). 14.5.2 Subjective Data Method If the maintenance records of failures and recorded findings at maintenance interventions such as inspections (collectively called objective data in this chapter) are available and sufficient in quantity and quality, the delay time distribution and parameters can be estimated by the classical statistical method of maximum likelihood; see Section 14.5.3 and the paper by Christer et al. (1995). If, however, such a data set does not exist, or is insufficient in quality and quantity for the purpose of estimation, the alternative is to use the subjective judgement of experienced maintenance engineers or technicians to obtain the delay time distribution and parameters. This section introduces three methods developed by Christer and Waller (1984), Wang (1997) and Wang and Jia (2007) in estimating the delay time distribution and the associated parameters using subjective data. 14.5.2.1 Subjective Estimation of the Delay Times Through an On-site and On-spot Survey This method needs to be done over a time period to collect detailed information and assessment at every maintenance intervention or failure; Christer and Waller (1984). At every failure repair, the maintenance technician repairing the plant would be asked to estimate: HLA: how long ago the defect causing the failure may first have been expected to have been recognised at an inspection. If a defect was identified at an inspection, then in addition to HLA, the technician would be asked to estimate:

360

W. Wang

HML: how much longer could the defect be left unattended before a repair was essential. The estimates are given by hˆ = HLA for a failure, and hˆ = HLA + HML for an inspection repair; see Figure 14.11a,b. f (h) is then estimated from the data of { hˆ }. HLA

HLA

HML

● (a) Failure

(b) Inspection

Figure 14.11. HLA and HML estimates at failure and inspection

At the time of repair, the maintenance technician has information available to produce his estimate. In addition to his experience, the defect is present, the plant may be examined, and operatives questioned. The rate of defect arrivals can be estimated directly from the number of observed failures and defects identified over the survey period. For a case study using this approach for estimating delay time model parameters; see Christer and Waller (1984). 14.5.2.2 Subjective Estimation of the Delay Times Based Identified Failure Modes The method introduced earlier is a questionnaire survey based approach where the subjective opinions of maintenance engineers were asked. It has the advantage of directly facing the defect or failure when the information regarding the delay time was requested. However, it has also the following problems: (a) it is a time consuming process in conducting such a survey, particularly in the case that the frequency of failures or defects is not high, which implies a longer time to get sufficient data; (b) the estimation process is not easy to control since all the forms are left at the hands of the maintenance engineers involved without an analyst present, which may result in confusion and mistakes as experienced in the studies of Christer and Waller (1984) and Christer et al. (1998b). Wang (1997) recommended a new approach to estimate directly the delay time distribution based on pre-defined major failure modes or types. The idea is as follows: 1. If the estimates can be made based on pre-selected major failure types instead of the individual failure or defect when it occurs, the time spent for the questionnaire survey will be greatly reduced since the estimates for all major failure types can be carried out at the same time, which may only take a few hours. This also creates the opportunity for an analyst to be present to reduce possible confusion and mistakes. 2. A group of experts should be questioned on the same failure type and opinions can be properly combined to reduce sampling errors. 3. The question asked should be a probabilistic measure of the delay time over all possible ranges.

Delay Time Modelling

361

The following phases for the estimating of the delay time were suggested; Wang (1997). The problem identification phase This is for the identification of all major failure types and possible causes of the failures. This was normally done via a failure mode and criticality analysis so that a list of dominant failures can be obtained. This process will entail a series of discussions with the maintenance engineers to clarify any hidden issues. If some failure data exists it should be used to validate the list, or otherwise a questionnaire should be designed and forwarded to the person concerned for a list of dominant failure types. Expert identification and choice phase The term ‘expert’ is not defined by any quantitative measure of resident knowledge. However, it is clear in the case here that a person who is regarded by others as being one of the most knowledgeable about the machine should be chosen as the expert. The shop floor fitters or any maintenance technicians or engineers who maintain the machine would be the desired experts; Christer and Waller (1984). After the set of experts is identified, a choice is made of which experts to use in the study. Full discussion with management is necessary in order to select the persons who know the machine ‘best’. Psychologically, five or fewer experts are expected to take part of the exercise, but not less than three. The question formulation phase The questions we want to ask in this case are the rate of occurrence of defects, (assuming we are modelling a complex plant) and the delay time distribution. In the case addressing the rate of arrival of a defect type, we can simply ask for a point estimate since it is not random variable. Without maintenance interventions, this would, in the long term, be equal to the average number of the same failure type per unit time. For example we may ask ‘how many failures of this type will occur per year, month, week or day?’. It is noted that this quantity is usually observable. In fact, our focus is mainly on the delay time estimates. Given the amount of uncertainty inherent in making a prediction of the delay time, the experts may feel uncomfortable about giving a point estimate, and may prefer to communicate something about the range of their uncertainty. Accepting these points, perhaps the best that experts could do in this case would be to give their subjective probability mass function for the quantity in question. In other words, they could provide an estimate over the interval such that the mass above the interval is proportional to their subjective probability measures. Alternatively, three point estimates can be asked, such as the most likely, the minimum and the maximum durations of the delay times for a particular type of failure. The word ‘delay time’ was not entered in the question since it will take some effort to explain what is the delay time. Instead, we just asked a similar question like HLA. But this question was still difficult for the experts to understand based upon our case experience. The lesson learned is to demonstrate one example for them before starting the session.

362

W. Wang

The elicitation phase Elicitation should be performed with each expert individually. If possible, the analyst should be present, which proved to be vital in our case studies. The above-mentioned histogram was used to draw the answer from the experts so that the experts can have a visual overview of their estimates and a smooth histogram could be achieved if the experts are advised to do so. The maximum number of the histogram intervals is set to be five, which is advised by psychological experiments. The calibration phase Roughly speaking, calibration is intended to measure the extent to which a set of probability mass functions ‘correspond to reality’. Reviewing the problem we have concluded that subjective calibration is not recommended due to its time consuming nature. If any objective data is available, we may calibrate the experts’ opinion by a Bayesian approach as discussed by many others. Another approach is to calibrate the estimate by matching a statistics observed. If significant difference is found, the estimates must be revised. The combination phase Experts resolution, or combining probabilities from experts, has received some attention. Here we use one of the simplest approaches, namely the weighting method. It is simply a weighted average of the estimates of all experts. The weights need to be selected carefully according to each expert’s level of expertise, and their sum should be equal to one. Other more complicated methods are available; see Wang (1997) It is noted that the combined delay time distribution obtained from this phase is in a form of discrete probability distribution. In fact a continuous delay time distribution is needed in delay time inspection modelling. To achieve this, based upon the number of delay times in each interval, an estimated continuous delay time distribution Fˆ (h) of F (h) can be obtained by fitting a distribution from a known family failure distributions, such as exponential or Weibull using the least square method or maximum likelihood method. The updating phase This phase is mainly for after some failure and recorded findings become available. In a sense it is a way of calibrating. A case study using the above method is detailed in Akbarov et al. (2006). 14.5.2.3 An empirical Bayesian Approach for Estimating the DTM Parameters Based on Subjective Data In previous subjective data based delay time estimating approaches (Christer and Waller 1984; Wang 1997; Akbarov et al. 2006), some direct subjective estimates of the delay time is required, which has been found to be extremely difficult for the experts to estimate since the delay time is not usually observable and difficult to explain Akbarov et al. (2006). We now introduce a recently developed new approach which starts with subjective data first and then updates the estimates when objective data becomes available. The initial estimates are made using the empirical Bayesian method matching with a few subjective summary statistics provided by the experts. These statistics should be designed easy to get based on the experience of the experts and on observed practice rather than unobservable delay times. Then the updating

Delay Time Modelling

363

mechanism enters the process when objective data become available, which requires a repeated evaluation of the likelihood function which will be introduced later. In the framework of Bayesian statistics and assuming no objective data is available at the beginning, we basically first assume a prior on the parameters which characterize the underlying defect and failure arrival processes. When objective data becomes available, we calculate the joint posterior distribution of the parameters, and then we may use this posterior distribution to evaluate the expected cost or downtime per unit time conditional on observed data. Assuming for now that we are interested in the rate of arrival of defects, λ , and the delay time pdf., f (h) , which is characterised by a two parameter distribution f (h | α , β ) . Unlike the methods proposed in Christer and Waller (1984) and Wang (1997), here we treat parameters λ and the α and β in f (h | α , β ) as random variables. The classical Bayesian approach is used here to define the prior distributions for model parameters λ , α and β as f (λ | Φ λ ) , f (α | Φα ) and f ( β | Φ β ) , where Φ • is the set of hyper-parameters within f (• | Φ • ) . Once those Φ • are available, the point estimates of λ , α and β are the expected values of them and are given by

∫

λˆ =

∞ 0

λ f (λ | Φ λ ) d λ ,

αˆ =

∫

∞ 0

α f (α | Φα )dα and βˆ =

∫

∞ 0

β f (β | Φ β )d β

Let g (λ , α , β ) denote a statistics of interest, which may be a function of λ , α and β , say the mean number of failures within an inspection interval, and E[ g (Φ λ , Φα , Φ β )] denote its expected value in terms of Φ λ , Φα and Φ β then we have E[ g (Φ λ , Φα , Φ β )] =

∞

∞

∞

0

0

0

∫ ∫ ∫

g (λ , α , β ) f (λ | Φ λ ) f (α | Φα ) f ( β | Φ β )d λ dα d β .

(14.21)

If we can obtain a subjective estimate of E[ g (Φ λ , Φα , Φ β )] provided by the experts, denoted by g s , then letting E[ g (Φ λ , Φα , Φ β )] = g s , we have gs =

∞

∞

∞

0

0

0

∫ ∫ ∫

g (λ , α , β ) f (λ | Φ λ ) f (α | Φα ) f ( β | Φ β )d λ dα d β .

(14.22)

Equation 14.22 is only one of such equations and if several such subjective estimates (different) were provided, we could have a set of equations like Equation 14.22. The hyper-parameters Φ • may be estimated by solving the equations like Equation 14.22 in the case that the number of equations like Equation 14.22 is at least the same as the number of hyper-parameters in Φ • . We now demonstrate this in our case. Suppose that the experts can provide us the following subjective statistics in estimating Φ λ :

364

W. Wang

• The average number of failures within [0, T ) , denoted by , n f • The average number of defects identified at inspection time T , denoted by nd • The average probability of no defect at all in [0, T ) , denoted by pnd . In this case if the statistics of interest is the average number of the defects within [0, T ) , we have from the property of the HPP that g (λ , α , β ) = λT , and then E[ g (Φ λ , Φα , Φ β )] =

∞

∞

∞

0

0

0

∫ ∫ ∫

λTf (λ | Φ λ ) f (α | Φα ) f ( β | Φ β ) d λ dα d β =

∫

∞ 0

λTf (λ | Φ λ ) d λ

Since if inspection is perfect we have g s = n f + nd , it follows from Equation 14.22 that ∞

n f + nd = ∫ λTf (λ | Φ λ ) d λ .

(14.23)

0

-λT n Similarly, from the property of the HPP, that is, P( N d (0,T) = n|λ ) = e (λT ) , we

n!

have pnd =

∫

∞ 0

Pr ( N d (0,T) = 0|λ )f (λ|Φλ )d λ =

∫

∞ 0

e − λT f ( λ | Φ λ ) d λ .

(14.24)

where N d (0, T ) is the number of defects in [0, T ) . If we have only two hyper-parameters in Φ λ , then solving Equations 14.23 and 14.24 simultaneously in terms of Φ λ will give the estimated values of the hyper-parameters in Φ λ . Note that λ is independent with α and β so that the integrals of f (α | Φα ) and f ( β | Φ β ) are dropped from Equation 14.21. Similarly if more subjective estimates were provided, the hyper-parameters in Φα and Φ β can be obtained. For a detailed description of such an approach to estimate delay time model parameters see Wang and Jia (2007). Obviously this approach is better than the previously developed subjective methods in terms of the way to get the data and the accuracy of the estimated parameters. It is also naturally linked to the objective method in estimation DTM parameters to be presented in the next section via Bayesian theorem if such objective data becomes available, Wang and Jia (2007). 14.5.3 Objective Data Method Objective data for complex systems under regular inspections should consist of the failures (and associated times) in each interval of operation between inspections and the number of defects found in the system at each inspection. From this data information, we estimate the parameters for the chosen form of the delay time model.

Delay Time Modelling

365

Initially, we consider a simple case of the estimation problem for the basic delay time model where only the number of failures, mi , occurring in each cycle [(i − 1), iT ) and the number of defects found and repaired, ji , at each inspection (at time iT ) are required. We do not know the actual failure times within the cycles The probability of observing mi failures in [(i − 1), iT ) is P ( N f ((i − 1)T , iT ) = mi ) =

e

− E [ N f (( i −1)T ,iT )]

E[ N f ((i − 1)T , iT )]mi

(14.25)

mi !

Similarly the probability of removing ji defects at inspection i (at time iT ) is e− E[ N

P ( N s (iT ) = ji ) =

s

( iT )]

E[ N s (iT )] j ji !

i

(14.26)

As the observations are independent, the likelihood of observing the given data set is just the product of the Poisson probabilities of observing each cycle of data, mi and ji . As such, the likelihood function for K intervals of data is L (Θ) =

K

∏ i =1

⎧⎪⎛ e − E[ N ⎨⎜⎜ ⎩⎪⎝

f

(( i −1)T , iT )]

E[ N f ((i − 1)T , iT )]m ⎞ ⎛ e − E[ N ⎟⎜ ⎟⎜ mi ! ⎠⎝ i

s

( iT )]

E[ N s (iT )] j ji !

i

⎞ ⎫⎪ ⎟⎬ , ⎟ ⎠ ⎭⎪

(14.27)

where Θ is the set of parameters within the delay time model. The likelihood function is optimised with respect to the parameters to obtain the estimated values. This process can be simplified by taking natural logarithms. The log-likelihood function is ( Θ)

∑ ( m log{E[ N ((i − 1)T , iT )]} + j log{E[ N (iT )]} − E[ N −∑ ( log(m !) + log( j !) ) =

K

i =1

i

f

i

s

f

((i − 1)T , iT )] − E[ N s (iT )])

(14.28)

K

i

i =1

i

where the final summation term is irrelevant when maximising the log-likelihood as it is a constant term and therefore not a function of any of the parameters under investigation. When the times of failures are available, it is often necessary to refine the likelihood function at Equation 14.27 by considering the detailed pattern of behaviour within each interval in terms of the number of failures and their associated times. Define t ij the time of the j-th failure in the i-th inspection interval; the likelihood is given by (Christer et al. 1998a) L (Θ) =

K

⎧⎪ ⎨ ⎩⎪

∏∏ i =1

mi

v (t )e j =1 i ij

− E [ N f (( i −1)T , iT )]

⎛ e − E[ N s (iT )] E[ N s (iT )] ji ⎜ ⎜ ji ! ⎝

where vi (tij ) is given by Equation 14.4.

⎞ ⎫⎪ ⎟⎬ ⎟ ⎠ ⎭⎪

(14.29)

366

W. Wang

In the case study of Christer et al. (1995), only the daily numbers of failures are available. They formulated a different likelihood taking account of this pattern of data. It was done essentially by formulating the probability of a particular number of failures for each day over each inspection interval, and then the likelihood for a particular inspection interval is just the product of these probabilities and the probabilty of observing some number of defects at the inspection; see Christer et al. (1995) for details. 14.5.4 A Case Example A copper works in the north-west of England has used the same extrusion press for over 30 years, and the plant is a key item in the works since 70% of its products go through this press at some stage of their production. The machine comprises a 1700-ton oil-hydraulic extrusion press with one 1700 kW induction heater and completely mechanized gear for the supply of billets to the press and for the removal of the extruded products. The machine was operated 15–18 h a day (two shifts), five days a week, excluding holidays and maintenance down-time. Preventive maintenance (PM) has been carried out on this machine since 1993, which consisted of a thorough inspection of the machinery, along with any subsequent adjustments or repairs if the defects found can be rectified within the PM period. Any major defects which cannot be rectified during the PM time were supposed to be dealt with during non-production hours. PM lasted about 2 h and is performed once a week at the beginning of each week. Questions of concern are (i) whether PM is or could be effective for this machine; (ii) whether the current PM period is the right choice, particularly the one week PM interval which was based upon maintenance engineers’ subjective judgement; (iii) whether PM is efficient, i.e. whether it can identify most defects present and reduce the number of failures caused by those defects. In this case study, the delay time model introduced earlier was used to address the above questions. The first question can also be answered in part by comparing the total downtime per week under PM with the total downtime per week per week of the previous years without PM. A parallel study carried out by the company revealed that PM has lowered the total downtime. The proportion of downtime was reduced from 7.8% to 5.8%. To establish the relationship between the downtime measure and the PM activities using the delay time concept, the first task is to estimate the parameters of the underlying delay time distribution from available data, and hence build a model to describe the failure and PM processes. The type of delay time model used in the study is the non-perfect inspection model. In the original study, Christer et al. (1995), a number of different candidate delay time distributions were considered including exponential and Weibull distributions. The chosen form for the delay time distribution is a mixed distribution consisting of an exponential distribution (scale parameter α) with a proportion P of defects having a delay time of 0. The cdf. is given by F(h) = 1 − ( 1 − P)e −α h

Delay Time Modelling

367

An optimisation algorithm is required for maximisation of the likelihood with respect to the parameters. The estimated values are given in Table 14.3 with their associated coefficients of variation (CV). Table 14.3. Estimated model parameters Rate of occurrence of defect

Probability of perfect Proportional of zero inspection delay time of defects

Scale parameter

λˆ = 1.3561

rˆ = 0.902

Pˆ = 0.5546

αˆ = 0.0178

CV = 0.0832

CV = 3.4956

CV = 0.4266

CV = 1.1572

Inserting the optimal parameter estimates into the log-likelihood function gives an ML value of 101.86. See Christer et al. (1995) on the analysis and the fit of the model to the data.

14.6 Other Developments in DTM and Future Research Several useful extensions have been made over the last decade to make the delay time model more realistic, but that increases the mathematical complexity as well. Christer and Wang (1995) addressed an NHPP non-perfect inspection delay time model of multiple component systems. In this case the constant inspection interval assumption cannot be held, and a recursive algorithm was developed in Wang and Christer (2003) to find the optimal non-constant intervals until final replacement. Christer and Redmond (1990) reported a problem of sampling bias, and proposed ways of estimating the delay time distribution from subjective data. Wang and Christer (1997) modelled a single component system subject to inspections over a finite time horizon. Christer et al. (1997) used an NHPP in modelling the rate of arrival of defects within a case study. Wang (2000) developed a model of nested inspections using the delay time concept. Wang and Jia (2007) reported the use of empirical Bayesian statistics in the estimation of delay time model parameters using subjective data, which overcame a number of problems in previous subjective delay time parameter estimation. If the downtime due to failures cannot be ignored in the calculation of the expected number of failures during an inspection interval, Christer et al. (2000) addressed this problem and a refined method was proposed. Christer et al. (2001) compared the delay time model with an equivalent semi-Markov setting to explore the robustness of both modelling techniques to the Markov assumption. Carr and Christer (2003) in a recent paper studied the problems of non-perfect repairs at failures, which allows failures to reoccur if the repair is not perfect. The future research on the DTM relies on the application areas, the data involved, and the objective function chosen. We consider that the following areas or problems are worthy of research using the delay time concept:

368

W. Wang

1. PM type of inspections. Inspections may consist of many activities and some of them are purely preventive types such as greasing, topping up oil, and cleaning, which may have no connection with defect identification. It is noted, however, that this type of PM may change the RATE of defect arrivals and therefore change the expected number of failures within an inspection interval. This problem has not been modelled in previous DTM research, but it is a reality we have to face. An initial idea is to introduce another parameter in the RATE OF DEFECT ARRIVALS to model the effectiveness of such PM activities. 2. Multiple inspections scheme. This is again common in practice in that more than one inspection intervals of different scales or types are in place. Wang (2000) developed a DTM for nest inspections, but the model is not generic, and can only be used for a specific type of problems. 3. Condition monitoring (CM). This is becoming more popular in industry and offers abundent modelling opportunities with a large amount of data. CM may identify the initial point of a random defect at an earlier stage than manual inspections, and it is possible that u, the initial point of a random defect, becomes observable by CM. A pilot research has been carried out to investigate the use of DTM in condition based maintenance modelling (Wang 2006). 4. Parameters estimation. This is still an on-going research item since for each specific problem we may have to develop a tailor made approach. The empirical Bayesian approach outlined earlier is promising since it combines both subjective and objective data. It is noted, however, that the computation involved is intensive, and therefore, algorithms developments are required to speed up the process.

14.7 Conclusion There is considerable scope for advances in maintenance modelling that impact productivity upon current maintenance practice. This chapter reports upon one methodology for modelling inspection practice. The power of mathematics and statistics is used to exploit an elementary mathematical construct of failure process to build operational models of maintenance interactions. The delay time concept is a natural one within the maintenance engineering context. More importantly, it can be used to build quantitative models of the inspection practice of asset items, which have proved to be valid in practice. The theory is still developing, but so far there has been no technical barrier to developing DTM for any plant items studied. This chapter has introduced the delay time concept and has shown how it can be applied to various production equipment to optimise inspection intervals. To provide substance to this statement, the processes of model parameter estimation and case examples outlining the use of delay time modelling in practice are introduced. We only presented some fundamental DTMs and associated parameters estimation procedures, but interested readers can refer to the references listed at the end of the chapter for further consultation.

Delay Time Modelling

369

14.8 Dedications This chapter is dedicated to Professor Tony Christer who recently passed away. Tony was a “world class” researcher with an international reputation. He was the originator of the delay time concept and had produced in conjunction with others a considerable number of papers in delay time modelling theory and applications. He was a great man who enthused, mentored and guided many of us to strive for higher quality research. He will be sadly missed by all who knew him.

14.9 References Abdel-Hameed, M., (1995), Inspection, maintenance and replacement models, Computers and Operations Research, V22, 4, 435–441 Akbarov, A., Wang W. and Christer A.H., (2006), Problem identification in the frame of maintenance modelling: a case study, to appear in I. J. Prod. Res. Baker, R.D. and Wang, W., (1991), Estimating the delay time distribution of faults in repairable machinery from failure data, IMA J. Maths. Applied in Business and Industry, 4, 259–282. Baker, R. and Wang, W., (1993), Developing and testing the delay time model, Journal of Operational Research Society, Vol. 44, No. 4, 361–374. Barlow, R.E and Proschan, F., (1965), Mathematical theory of reliability, Wiley, New York. Carr, M.J., and Christer, A.H, (2003) Incorporating the potential for human error in maintenance models, J. Opl. Res. Soc., 54 (12), 1249–1253 Christer, A.H., (1976), Innovative decision making, proceedings of NATO conference on the role of effectiveness of theory of decision in practice, eds. Bowen K.C and White D.J., Hodder and Stoughton, 368–377. Christer, A.H., (1999), Developments in delay time analysis for modeling plant maintenance, J. Opl. Res. Soc., 50, 1120–1137. Christer, A.H. and Redmond, D.F., (1990), A recent mathematical development in maintenance theory, Int. J. Prod. Econ, 24, 227–234. Christer, A.H. and Waller, W.M., (1984), Delay time Models of Industrial Inspection Maintenance Problems, J. Opl. Res. Soc., 35, 401–406. Christer, A.H and Wang, W., (1995), A delay time based maintenance model of a multicomponent system, IMA Journal of Maths. Applied in Business and Industry, Vol. 6, 205–222. Christer, A.H and Whitelaw, J. (1983), An Operational Research approach to breakdown maintenance: problem recognition, J Opl Res Soc, 34, 1041–1052. Christer, A.H., Wang, W., Baker, R.D. and Sharp, J.M., (1995), Modelling maintenance practice of production plant using the delay time concept, IMA J. Maths. Applied in Business and Industry, Vol. 6, 67–83. Christer, A.H., Wang, W., Sharp, J.M. and Baker, R.D., (1997), A stochastic modelling problem of high-tech steel production plant, in Stochastic Modelling in Innovative Manufacturing, Lecture Notes in Economics and mathematical Systems, (Eds. by A.H Christer, Shunji Osaki and L. C. Thomas), Springer, Berlin, 196–214. Christer, A.H., Wang, W., Choi, K. and Sharp, J.M., (1998a), The delay-time modelling of preventive maintenance of plant given limited PM data and selective repair at PM, IMA J. Maths. Applied in Business and Industry, Vol. 9, 355–379.

370

W. Wang

Christer, A.H., Wang, W., Sharp, J.M. and Baker, R.D., (1998b), A case study of modelling preventive maintenance of production plant using subjective data, J. Opl. Res. Soc., 49, 210–219. Christer, A.H., Wang, W. and Lee, C., (2000), A data deficiency based parameter estimating problem and case study in delay time PM modelling, Int. J. Prod. Eco. Vol. 67, No. 1, 63–76 Christer, A.H. Wang, W., Choi, K. and Schouten, F.A., (2001), The robustness of the semiMarkov and delay time maintenance models to the Markov assumption, IMA. J. Management Mathematics, 12, 75–88. Kaio, N. and Osaki, S., (1989), Comparison of Inspection Policies Journal of the Operational Research Society, Vol. 40, No. 5, 499–503 Luss, H., (1983), An Inspection Policy Model for Production Facilities, Management Science, Vol. 29, No. 9, 1102–1109 McCall, J., (1965), Maintenance Policies for Stochastically Failing Equipment: A Survey, Management Science, Vol. 11, No. 5, 493–524 Moubray, J., (1997), Reliability Centred Maintenance, Butterworth-Heineman, Oxford. Ross, (1983), Stochastic processes, Wiley, New York Taylor, H.M., and Karlin, S., (1998), An introduction to stochastic modeling, 3rd Ed., Academic press, San Diego. Thomas, L.C., Gaver, D.P. and Jacobs, P.A. (1991), Inspection Models and their application, IMA Journal of Management Mathematics, 3(4):283–303 Wang, W., (1997), Subjective estimation of the delay time distribution in maintenance modelling, European Journal of Operational Research, 99, 516–529. Wang W., (2000), A model of multiple nested inspections at different intervals, Computers and Operations Research, 27, 539–558. Wang W., (2006), Modelling the probability assessment of the system state using available condition information, to appear in IMA. J. Management Mathematics Wang W. and Christer A.H., (1997), A modelling procedure to optimise component safety inspection over a finite time horizon, Quality and Reliability Engineering International, 13, No. 4, 217–224. Wang W. and Christer A.H., (2003), Solution algorithms for a multi-component system inspection model, Computers and OR, 30, 190–134. Wang W. and Jia, X., (2007), A Bayesian approach in delay time maintenance model parameters estimation using both subjective and objective data, Quality Maintenance and reliability Int. , 23, 95–105

Part E

Management

15 Maintenance Outsourcing D.N.P. Murthy and N. Jack

15.1 Introduction Every business (mining, processing, manufacturing and service-oriented businesses such as transport, health, utilities, communication) needs a variety of equipment to deliver its outputs. Equipment is an asset that is critical for business success in the fiercely competitive global economy. However, equipment degrades with age and usage and ultimately become non-operational and businesses incur heavy losses when their equipment is not in full operational mode. For example, in open cut mining, the loss in revenue resulting from a typical dragline being out of action is around one million dollars per day and the loss in revenue from a 747 plane being out of action is roughly half a million dollars per day. Non-operational equipment leads to delays in delivery of goods and services and this in turn causes customer dissatisfaction and loss of goodwill. Rapid changes in technology have resulted in equipment becoming more complex and expensive. Maintenance action can reduce the likelihood of such equipment becoming non-operational (referred to as preventive maintenance) and also restore a non-operational unit to an operational state (referred to as corrective maintenance). For most businesses it is no longer economical to carry out maintenance in house. There are a variety of reasons for this including the need for a specialist work force and diagnostic tools that often require constant upgrading. In these situations it is more economical to outsource the maintenance (in part or total) to an external agent through a service contract. Campbell (1995) gives details of a survey where it was reported that 35% of North American companies had considered outsourcing some of their maintenance. Consumer durables (products such as kitchen appliances, televisions, automobiles, computers, etc.) that are bought by individuals are certainly getting more complex. A 1990 automobile is immensely more complex than its 1950 counterpart. Customers need assurance that a new product will perform satisfactorily over its lifetime. In the case of consumer durables, manufacturers have used warranties to provide this assurance during the early part of a product’s useful life. Under

374

D. Murthy and N. Jack

warranty the manufacturer repairs all failures that occur within the warranty period and this is often done at no cost to the customer. The warranty period for most consumer durables has been increasing and the warranty terms have been becoming more favourable to the customer. For example, the typical warranty period for an automobile in 1930 was 90 days, in 1970 it was 1 year, and in 1990 it was 3 years. A warranty is tied to the sale of a product and the cost of servicing the warranty is factored into the sale price. For customers who need assurance beyond the warranty period, manufacturers and/or third parties (such as financial institutions, insurance companies and independent operators) offer extended warranties (or service contracts) at an additional cost to the customer. Extended warranties for automobiles of 5–7 years are now fairly common. Governments (local, state or national) own infrastructure (roads, rail and communication networks, public buildings, dams, etc.) that were traditionally maintained by in-house maintenance departments. Here there is a growing trend towards outsourcing these maintenance activities to external agents so that the governments can focus on their core activities. In all the above cases, we have an asset (complex equipment, consumer durable or an element of public infrastructure) that is owned by the first party (the owner) and the asset maintenance is outsourced to the second party (the service agent who is also referred to as the “contractor” in many technical papers) under a service contract. This chapter deals with maintenance outsourcing from the perspectives of both the owner (the customer for the maintenance service) and the service agent (the service provider). We focus on the first case (where the customer is a business) and we develop a framework to indicate the different issues involved, carry out a review of the literature, and indicate topics that need further investigation and research. The outline of the chapter is as follows. Section 15.2 deals with the customer and the agent perspectives. In Section 15.3, we propose a framework to study maintenance outsourcing. Section 15.4 reviews the relevant literature on maintenance outsourcing and on extended warranties. Section 15.5 deals with a game theoretic approach to maintenance outsourcing and extended warranties. In Section 15.6 we briefly discuss agency theory and its relevance to maintenance outsourcing and, in Section 15.7 we conclude with a brief discussion of future research in maintenance outsourcing.

15.2 Customer and Service Agent Perspectives 15.2.1 Customer Outsourcing of maintenance involves some or all of the maintenance actions (preventive and/or corrective) being carried out by an external service agent under a service contract. The contract specifies the terms of maintenance and the cost issues. It can be simple or complex and can involve penalty and incentive terms.

Maintenance Outsourcing

375

15.2.1.1 Businesses Businesses (producing products and/or services) need to come up with new solutions and strategies to develop and increase their competitive advantage. Outsourcing is one of these strategies that can lead to greater competitiveness (Embleton and Wright 1998). It can be defined as a managed process of transferring activities performed in-house to some external agent. The conceptual basis for outsourcing (see Campbell 1995) is as follows: 1. Domestic (in-house) resources should be used mainly for the core competencies of the company. 2. All other (support) activities that are not considered strategic necessities and/or whenever the company does not possesses the adequate competences and skills should be outsourced (provided there is an external agent who can carry out these activities in a more efficient manner). Most businesses tend not to view maintenance as a core activity and have moved towards outsourcing it. The advantages of outsourcing maintenance are as follows: 1. 3. 4. 5. 6. 7. 8.

Better maintenance due to the expertise of the service agent. Access to high-level specialists on an “as and when needed” basis. Fixed cost service contract removes the risk of high costs. Service providers respond to changing customer needs. Access to latest maintenance technology. Less capital investment for the customer. Managers can devote more resources to other facets of the business by reducing the time and effort involved in maintenance management.

However, there are some disadvantages of outsourcing the maintenance and these are indicated below 1. 2. 3. 4.

Dependency on the service provider. Cost of outsourcing. Loss of maintenance knowledge (and personnel). Becoming locked in to a single service provider.

For very specialised (and custom built) products, the knowledge to carry out the maintenance and the spares needed for replacement need to be obtained from the original equipment manufacturer (OEM). In this case, the customer is forced into having a maintenance service contract with the OEM and this can result in a noncompetitive market. In the USA, Section II of the Sherman Act (Khosrowpour 1995) deals with this problem by making it illegal for OEMs to act in this manner. When the maintenance service is provided by an agent other than the original equipment manufacturer (OEM) often the cost of switching prevents customers from changing their service agent. In other words, customers get “locked in” and are unable to do anything about it without a major financial consequence.

376

D. Murthy and N. Jack

As a result, it is very important for businesses to carry out a proper evaluation of the implications of outsourcing their maintenance. If done properly, outsourcing can be cheaper than in-house maintenance and can lead to greater business profitability. 15.2.1.2 Owners of Infrastructure Traditionally, governments owned and operated infrastructures (such as road, rail, water and electricity networks). There has been a growing trend towards selling these assets to private businesses who either lease them back to the government or operate of the asset. The maintenance of the asset is often outsourced as it is again viewed as not being the core activity of the business owning the asset. A complicating factor is the additional parties involved and these are shown in Figure 15.1. For example, in the case of a rail network, the operators are the different rail companies that use the track and the maintenance is outsourced to specialist contractors. The government plays a critical role in terms of providing loans to and/or acting as a guarantor for the owner and the regulators are independent authorities responsible for ensuring public safety. The role of maintenance now becomes important in the context of safety and risk. For further discussion see Vickerman (2004).

REGULATOR

OWNER

SERVICE AGENT [MAINTENANCE]

ASSET [INFRASTRUCTURE]

GOVERNMENT

OPERATOR

PUBLIC

Figure 15.1. Different parties that need to be considered in the maintenance of infrastructures

15.2.1.3 Individual Consumers In the case of consumer durables, the cost of rectifying failures in the postwarranty period is a concern to buyers. The uncertainty in the cost of repair and attitude to risk determines the amount a customer is willing to pay for an extended warranty or service contract. In one sense, opting for an extended warranty can be viewed as taking out an insurance to cover future potential costs resulting from the product failures in the post-warranty period.

Maintenance Outsourcing

377

15.2.1.4 Decision Problems In the case of businesses (producing goods and services) and infrastructure operators the decision problems are (i) whether to outsource or not, (ii) what maintenance activities to outsource and, (iii) how to implement and manage the process. We will discuss these issues in a later section. In the case of an extended warranty, the customer has to decide (i) whether or not to buy an extended warranty and (ii) the best one to buy when there are several different options. 15.2.2 Service Agent – Issues and Decisions The service agent providing the maintenance needs to operate as a service business. This implies that issues such as return on investment (ROI), number of customers to service (market share), location of operations, range of service contracts to offer are some of the variables that are important in the context of strategic management of the business. The type of contract depends on the needs of customers and they can be either standard contracts or customized. At the operational level, the service agent needs to deal with issues such as scheduling of maintenance tasks, spare part inventory control, etc. The pricing of the different service contracts offered is critical for business profitability. If the price is low, the service agent might end up making a loss instead of profit. On the other hand, if it is too high then there might be no customer for the service. The price must cover the costs and estimating the cost is a challenge due to information uncertainties. 15.2.2.1 Extended Warranty Providers – Issues and Decisions For most products, the product market has become global and highly competitive, resulting in many similar brands. Survival and growth in such an environment requires the manufacturers to differentiate their products from those of competitors. Product support provides the mechanism for this differentiation. Product support deals with issues such as providing better information about the product before sale and post-sale support in the form of warranty, extended warranty, training, upgrades, spares, etc. The bundling of products with product-support is a mechanism that manufactures have used very effectively to market their products (see Eppen et al. 1991). In many industries (for example, consumer electronics) extended warranties have been highly profitable to manufacturers (see Padmanabhan 1996 and the UK Competition Commission Report 2003). The popularity of extended warranties has resulted in third parties (financial institutions, insurance companies and independent operators) providing these to customers. The decision problem here is the pricing of extended warranties. The price must exceed the cost of servicing claims over the warranty period. In the case where the extended warranty is offered by the manufacturer, the manufacturer has some information about product reliability. However, third parties offering extended warranties lack this information and as such the decision on pricing must take into account this uncertainty.

378

D. Murthy and N. Jack

15.3 Framework to Study Maintenance Outsourcing A proper framework to study maintenance outsourcing from both customer and service agent points of view involves several interlinked elements as indicated in Figure 15.2. In Section 2 we discussed the customer and the service agent elements and in this section we discuss the remaining elements.

PAST USAGE

ASSET STATE AT THE START OF CONTRACT

PAST MAINTENANCE

OWNER (CUSTOMER)

CONTRACT

SERVICE AGENT

NOMINATED USAGE RATE

ACTUAL USAGE RATE

PENALTIES / INCENTIVES

ASSET DEGRADATION RATE

NOMINATED MAINTENANCE

ACTUAL MAINTENANCE

ASSET STATE AT THE END OF CONTRACT

Figure 15.2. Framework for study of maintenance outsourcing

15.3.1 Asset and State of Asset In general, an asset is a complex system comprising several components. The state of the system degrades with age and/or usage and this leads to a failure. An asset is said to be in failed state when it is no longer functioning properly. In the case of equipment, or a consumer durable, the failure is due to the failure of one or more components. In the case of infrastructure, for example a road, a failure occurs when a pothole reaches some size or the number of potholes per kilometre exceeds some specified amount. In the case of a new asset, the initial state is determined by the decisions made during its design and construction (or manufacture). The asset reliability characterises the probability of no failure and this decreases with age. The field reliability also depends on the operating stress (load) on the asset and the operating environ-

Maintenance Outsourcing

379

ment. The stress can be thermal, mechanical, electrical, etc., and the reliability decreases as the stress increases and/or the environment gets harsher. When a failure occurs, the asset can be restored to an operational state through corrective maintenance (CM). In the case of equipment, this involves repairing or replacing the failed components. In the case of the road example, the CM involves filling the potholes and resealing a section of the road. The degradation in the asset state can be controlled through use of preventive maintenance (PM) and, in the case of equipment, this involves regular monitoring and replacing of components before failure. The asset state at any given time (subsequent to it being put into operation) is a function of its inherent reliability and past history of usage and maintenance. This information is important in the context of maintenance service contracts for used assets. The information that the service agent (and the customer) has can vary from very little to lot (if detailed records of past usage and maintenance have been kept). Finally, for some assets, the delivery of maintenance requires the service agent to visit the site where the asset is located (for example, lifts in buildings and roads) and for others (most consumer durables and some industrial equipment) the failed asset can be brought to a service centre to carry out the maintenance actions. 15.3.2 Maintenance 15.3.2.1 Corrective Maintenance (CM) These are corrective actions performed when the asset has a failure. The most common form of CM is “minimal repair” where the state of the asset after repair is nearly the same as that just before failure. The other extreme is “as good as new” repair and this is seldom possible unless one replaces the failed asset by a new one. Any repair action that restores the asset state to better than that before failure and not as good as that of new asset is referred to as “imperfect repair”. 15.3.2.2 Preventive Maintenance (PM) In the case of equipment or consumer durables, PM actions are carried out at component level where components are replaced based on age, usage and/or condition. As a result, there are several different kinds of PM policies (Blischke and Murthy 2000). Some of the more commonly used ones are the following: • • • •

Age based maintenance. Replace a component (under PM) when it reaches age T (after being put into use) or on failure under CM, if the item fails earlier. Clock based maintenance. Replace a component (under PM) at set times t = kT , k = 1, 2, , or on failure under CM. Opportunistic maintenance. This is based on exploiting opportunities that become available. An example is PM actions for some components being carried out at the same time as the CM action for a failed component. Condition-based maintenance. Here, the maintenance action is based on an assessment of the state of a component from a set of measurement data obtained. For example, the state of a turbine bearing is assessed on data relating to noise, vibration, wear debris in oil, etc.

380

D. Murthy and N. Jack

15.3.2.3 Modeling Failures and Maintenance Actions To evaluate different maintenance actions, mathematical models are needed for the failure of assets and the effect of maintenance on these failures. Themodeling can be done at two levels – system or component.

INTENSITY FUNCTION

System level modeling If only CM and no PM is used and the time to repair is very much smaller than time between failures, then one can model failures over time as a stochastic point process with an intensity function λ (t ) that is increasing with t (time or age) to capture the degradation with time (see Rigdon and Basu 2000). The effect of operating stress and operating environment can be modeled through a Cox-regression model where the intensity function is modified to g ( z )λ (t ) where variables z is the vector of covariates representing the stress and environmental (see Cox and Oakes 1984). The effect of PM actions can be modeled through a reduction in the intensity function as shown in Figure 15.3. The level of PM (indicated by δ in the figure) determines the reduction in the intensity function and the cost of a PM action increases with the level of PM.

PM ACTIONS

δ2

δ1 T1

TIME

T2

Figure 15.3. Effect of PM actions on the intensity function

Component level modeling If a component of the asset fails and is non-repairable and/or too costly to repair, then it is replaced by a new one. If the replacement time is small relative to the mean time to failure, then it can be ignored and component failures (over time) can be modeled by a renewal process (see Ross 1980). If the component is repairable and costly and a failed component is subjected to minimal repair, then failures (over time) can be modeled by a stochastic point process with intensity function having the same form as the hazard function of the component.

Maintenance Outsourcing

381

15.3.3 Contract The contract is a legal document that is binding on both parties (customer and service agent) and it needs to deal with technical, management and economic issues. 15.3.3.1 Technical and Management Issues Maintenance of an asset involves carrying out several activities as indicated in Figure 15.4 (adapted from Dunn 1999). There are many different contract scenarios depending on how these activities are outsourced. Table 15.1 indicates three different scenarios (S-1 to S-3) where: • • •

(D-1). What (components) need to be maintained? (D-2). When should the maintenance be carried out? (D-3). How should the maintenance be carried out? WORK IDENTIFICATION

WORK PLANNING

WORK SCHEDULING

DATA ANALYSIS

DATA RECORDING

WORK EXECUTION

Figure 15.4. Activities in asset maintenance Table 15.1. Different contract scenarios SCENARIOS

DECISIONS CUSTOMER

SERVICE AGENT

S-1 S-2

D-1, D-2 D-1

D-3 D-2, D-3

S-3

-

D-1, D-2, D-3

In scenario S-1, the service agent is only providing the resources (workforce and material) to execute the work. This corresponds to the minimalist approach to outsourcing. In scenario S-2, the service agent decides on how and when and what is to be done is decided by the customer. Finally, in scenario S-3 the service agent makes all three decisions. There is growing trend towards functional guarantee contracts. Here the contract specifies a level for the output generated from equipment, for example, the amount of electricity produced by a power plant, or the total length of flights and number of landings and takeoffs per year. The service agent has the freedom to decide on the maintenance needed (subject to operational constraints) with incentives and/or

382

D. Murthy and N. Jack

penalties if the target levels are exceeded or not. For more on this, see Kumar and Kumar (2004). In the context of infrastructures, there is a trend towards giving the service agent the responsibility for ongoing upgrades or the responsibility for the initial design resulting in a BOOM (build, own, operate and maintain) contract. The levels of risk to both parties vary with the contract scenario. 15.3.3.2 Economic Issues There are a number of alternative contract payment structures. The following list is from Dunn (1999): • • • • • • •

Fixed or firm price Variable price Price ceiling incentive Cost plus incentive fee Cost plus award fee Cost plus fixed fee Cost plus margin

Each of these price structures represents a different level of risk sharing between the customer and the service agent. According to Vickerman (2004), an increasing issue in privatized infrastructure is the appropriate incentives needed to ensure adequate maintenance of the infrastructure as a public resource. 15.3.3.3 Other Issues Some other issues are as follows: Requirements. Both parties might need to meet some stated requirement. For example, the customer needs to ensure that the stresses on the asset do not exceed the levels specified in the contract as this can lead to greater degradation and higher servicing costs to the service agent. Similarly, the service agent needs to ensure proper data recording. Contract duration. This is usually fixed with options for renewal at the end of the contract. Dispute resolution. This specifies the avenues to follow when there is a dispute. The dispute can involve going to a third party (legal courts). Unless the contract is written properly and relevant data (relating to equipment and collected by the service agent) are analysed properly by the customer, the longterm costs and risks will escalate. 15.3.4 Maintenance Outsourcing Market Whether the maintenance outsourcing market is competitive or not depends on the number of customers and service agents. Table 15.2 indicates the different market scenarios. These have an impact on issues such as the types of service contracts available to customers and the pricing of the contracts.

Maintenance Outsourcing

383

Table 15.2. Maintenance outsourcing market scenarios NUMBER OF SERVICE AGENTS

NUMBER OF CUSTOMERS

ONE

FEW

ONE

A-1

B-1

FEW

A-2

B-2

MANY

A-3

B-3

15.4 Review of Literature There is a vast literature on maintenance and it covers a range of topics (approaches to maintenance, mathematical models for deciding optimal maintenance, maintenance management, etc.). There are several review papers that have appeared over the last 40 years and these include McCall (1965), Pierskalla and Voelker (1976), Jardine and Buzzacot (1985), Sherif and Smith (1986), Thomas (1986), ValdezFlores and Feldman (1989), Pintelton and Gelders (1992) and Scarf (1997). Cho and Parlar (1991) and Dekker et al. (1997) deal with the maintenance of mutli-component systems. There are also several maintenance books. In contrast, the literature on maintenance outsourcing is very limited and in this section we briefly review this literature. 15.4.1 Maintenance Outsourcing The literature deals with maintenance outsourcing mainly from the customer perspective and is focussed on management issues. More specifically, attempts are made to address one or more of the following questions in a qualitative manner: 1. 2. 3. 4. 5. 6. 7.

Does outsourcing make sense? Are the objectives achievable? Is the organisation ready? What are the outsourcing alternatives? What maintenance activities should be outsourced? How should the best service agent be selected? What are the negotiating tactics for contract formation?

Some of the relevant papers are Campbell (1995), Judenberg (1994), Martin (1997), Levery (1998) and Sunny (1995). Unfortunately, cost has been the sole basis used by businesses for making maintenance out-sourcing decisions. Sunny (1995) looks at what activities are to be outsourced by looking at the long strategic dimension (core competencies) as well as the short-term cost issues. Bertolini et al. (2004) take a quantitative approach and use the analytic hierarchy process (AHP) to make decisions regarding the outsourcing of maintenance. Ashgarizadeh and Murthy (2000) and Murthy and Ashgarizadeh (1998, 1999) look at maintenance outsourcing from both customer and service agent perspec-

384

D. Murthy and N. Jack

tives and propose game-theoretic models to determine the optimal strategies for both parties. This approach is discussed further in Section 15.5. On the application side, Armstrong and Cook (1981) look at clustering of highway sections for awarding maintenance contracts to minimise the cost and use a fixed-charge goal programming model to determine the optimal strategy. Bevilacqua and Braglia (2000) illustrate their AHP model in the context of an Italian brick manufacturing business having to make decisions regarding maintenance outsourcing. Stremersch et al. (2001) look at the industrial maintenance market. 15.4.2 Extended Warranties The literature can be broadly divided into three groups. 15.4.2.1 Group 1: Warranty cost analysis The cost analysis of many different types of basic warranties can be found in Blischke and Murthy (1994, 1996). For a review of more recent literature, see Murthy and Djamaludin (2002). These techniques can be easily extended to obtain the costs for extended warranties and this has been done by Sahin and Polatoglu (1998). 15.4.2.2 Group 2: Warranty Servicing Strategy When a repairable asset fails under warranty, the manufacturer has the choice of either repairing or replacing it with a new one. The first option costs less then the second but a repaired asset has a greater probability of failing during the remainder of the warranty period. It is therefore important for the manufacturer to choose an appropriate servicing strategy in order to minimise the expected cost of servicing the warranty per asset sold. Servicing strategies for products sold with one-dimensional warranties have received considerable attention. Biedenweg (1981) and Nguyen and Murthy (1986, 1989) assume that repaired items have independent and identically distributed lifetimes different from that of a new item and considered strategies where the warranty period is divided into distinct intervals for repair and replacement. Nguyen (1984) introduces the first servicing model with minimal repair (see Barlow and Hunter 1960), with the warranty period split into a replacement interval followed by a repair interval. The length of the first interval is selected optimally to minimize the expected warranty cost. Jack and Van der Duyn Schouten (2000) show that this strategy is sub-optimal and that the optimal servicing strategy is in fact characterized by three distinct intervals – [0, x), [x, y] and (y, W] where W is the warranty period. The optimal strategy is to carry out minimal repairs in the first and last intervals and to use either minimal repair or replacement by new in the middle interval depending on the age of the item at failure. This strategy is difficult to implement, so Jack and Murthy (2001) propose a near optimal strategy involving the same three intervals but with only the first failure in the middle interval resulting in a replacement and all other failures being minimally repaired.

Maintenance Outsourcing

385

Servicing strategies for products sold with two-dimensional warranties have been studied by Iskandar and Murthy (2003) who propose two strategies similar to those from Nguyen and Murthy (1986, 1989) but with minimal repair. Iskandar et al. (2005) deal with a servicing strategy similar to that given in Jack and Murthy (2001). When the cost of replacement is high compared to the cost of a minimal repair then strategies involving replacement are not appropriate. In this case, strategies involving imperfect repair (where the failure characteristics of the repaired asset are better than those after minimal repair but are not the same as a new item) are more appropriate. The advantage of using imperfect repair is that the degree of improvement in the reliability after repair is a decision variable under the control of the manufacturer. Yun et al. (2006) discuss this topic. Every EW provider also needs to choose appropriate servicing strategies to minimise the costs of servicing the EWs that they have sold. The techniques that have been developed for basic warranties can easily be adapted to the EW case. 15.4.2.3 Group 3: Market for EWs There are a number of studies that have been carried out to show how EWs can be used as a tool for market segmentation. Unfortunately, most of the failure modeling used in these studies is static in nature. The asset either functions or doesn’t function properly during the EW period. Padmanabhan and Rao (1993) consider strategies that manufacturers should adopt for warranty provision when consumers vary in risk attitude and consumer moral hazard is also present. Moral hazard problems occur when consumers who have purchased EWs reduce their level of maintenance effort and this causes increased servicing costs to EW providers. Lutz and Padmanabhan (1994) look at the effect of income variation on EW purchasing and Padmanabhan (1995) and Hollis (1999) consider heterogeneity in consumer usage. Lutz and Padmanabhan (1998) investigate differences in consumers’ valuations of a working asset and the effect of independent EW providers in the market. Desai and Padmanabhan (2004) consider the impact of different distributional arrangements for the sale of assets and their EWs.

15.5 Game Theoretic Approach In the game theoretic approach, the outsourcing problem is viewed as a game with two players – customer and service agent. Each player has his/her own goal or objective and a set of decisions that need to be selected optimally. There are several different scenarios depending on whether there is a dominant player (a leader-follower situation where the actions of the follower depend on the actions of the leader – referred to as a “Stackelberg game formulation”) or there isn’t (both players decide on their actions either in a cooperative or non-cooperative mode – referred to as a “Nash game formulation”), and also on the kinds of information available to each player and their attitudes to uncertainty and risk. This approach allows maintenance outsourcing to be studied from both customer and service agent perspectives.

386

D. Murthy and N. Jack

15.5.1 Maintenance Outsourcing Consider the case where the service agent is the leader and offers n options ( Ai (θi ),1 ≤ i ≤ n, ) to the customer where θi ,1 ≤ i ≤ n, are the decision variables corresponding to the different options that the agent needs to select optimally. As an illustrative case, let n = 2 and the two options that the service agent offers for CM actions are as follows: Option 1 [Fixed Price Service Contract – A1 (θ1 ) ]: For a fixed price P , the service agent agrees to rectify all failures occurring over a period L at no additional cost to the customer. If a failure is not rectified within a period τ , the service agent incurs a penalty. If Y denotes the time for which the equipment is in the non-operational state before it becomes operational, then the penalty incurred is given by max{0, α (Y − τ )} , where α is the penalty cost per unit time. This ensures that the service agent does not deprive the customer of the use of the equipment for too long. Here, θ1 = {P,τ , α }. Option 2 [Pay for each repair contract – A2 (θ 2 ) ]: In this case, whenever a failure occurs, the service agent charges an amount Cs for each repair and does not incur any penalty if the equipment is in the non-operational state for greater than τ units of time. Here, θ 2 = {Cs }. In the Stackelberg game formulation, given the set of options (along with the values for the decision variables of the service agent), the customer chooses the best option to optimize his/her goal. This generates the optimal response function A *(θ1 , θ 2 , ,θ n ) as shown in Figure 15.5. Using this, the service agent then optimally selects the decision variables to optimize his/her objective.

Ai (θi ), 1 ≤ i ≤ n SERVICE AGENT

CUSTOMER

A* (θ1 , θ 2 , , θ n ) Figure 15.5. Stackelberg game formulation

Murthy and Asgharizadeh (1998, 1999) and Asgharizadeh and Murthy (2000) use a Stackelberg game formulation for a special case where the time between equipment failures is given by an exponential distribution so that the failures over time occur according to a Poisson process. They consider the two options discussed earlier and consider the following three cases: 1. 2. 3.

Single service agent and single customer (Case A-1) Single service agent, multiple customers (Case A-2)and one repair facility so that only one failed equipment can be repaired at any given time Single service agent, multiple customers (Case A-3) and more than one repair facility

Maintenance Outsourcing

387

In case 1 the service agent has to decide the optimal number of customers to service and in case 3 he has to decide the optimal number of repair facilities. 15.5.2 Extended Warranties Jack and Murthy (2006) consider the case where the product is complex and so the specialist knowledge of the manufacturer is required to carry out any repairs after the base warranty expires. The consumer must decide how long to keep the item and how to maintain it until replacement. Two maintenance options are available: the consumer can (i) pay the manufacturer to repair the item each time it fails, or (ii) purchase an extended warranty (EW) from the manufacturer. These are similar to Options 2 and 1 respectively, discussed earlier. The EW contract specifies that the manufacturer will again rectify all failures free of charge to the consumer. The consumer has flexibility in choosing when the EW will begin and the length of cover. The price of the EW depends on these two variables and is set by the manufacturer. The manufacturer also has to decide the price of each repair if the item fails and the consumer does not have an EW. A Stackelberg game formulation is used to determine the optimal strategies for both the consumer and the manufacturer.

15.6 Agency Theory (The Principal – Agent Problem) Agency theory deals with the relationship that exists between two parties (a principal and an agent) where the principal delegates work to the agent who performs that work and a contract defines the relationship. Agency theory is concerned with resolving two problems that can occur in agency relationships. The first problem arises when the two parties have conflicting goals and it is difficult or expensive for the principal to verify the actual actions of the agent and whether the agent has behaved properly or not. The second problem involves the risk sharing that takes place when the principal and agent have different attitudes to risk (due to various uncertainties). According to Eisenhardt (1989), the focus of the theory is on determining the optimal contract, behaviour vs. outcome, between the principal and the agent. Many different cases have been studied in depth in the principal-agent literature and these deal with the range of issues indicated in Figure 15.6. Agency theory has also been applied in many different disciplines. For an overview see Van Ackere (1993).

388

D. Murthy and N. Jack

COSTS MONITORING

INCENTIVES

PRINCIPAL

CONTRACT

INFORMATIONAL ASYMMETRY

RISK PREFERENCES AGENT MORAL HAZARD

ADVERSE SELECTION

Figure 15.6. Issues in agency theory

15.6.1 Issues in Agency Theory Moral hazard. Moral hazard refers to lack of effort (or shirking) on the part of the agent. The agent does not put in the agreed-upon effort because the objectives of the two parties are different and the principal cannot assess the level of effort that the agent has actually used. Adverse selection. Adverse selection refers to any misrepresentation of ability by the agent and the principal is unable to completely verify this before deciding to hire the agent. Information. To counteract adverse selection, the principal can invest in getting information about the agent’s ability. One way of getting the desired information is by contacting people for whom the agent has provided service in the past. Monitoring. The principal can counteract the moral hazard problem by monitoring the actions of the agent. Monitoring provides information about the agent’s actual actions. Information asymmetry. There are several uncertainties that affect the overall outcome of the relationship. The two parties, in general, will have different information to make an assessment of these uncertainties and will also differ in terms of other information. Risk. This results from the different uncertainties that affect the outcome of the relationship. The risk attitude of the two parties, in general, will differ for a variety of reasons. A problem arises when this disagreement is over the allocation of risk between the two parties. Costs. There are various kinds of costs for both parties. Some of these depend on the outcome (which is influenced by uncertainties) but also in acquiring information, monitoring and the administration of the contract. The heart of the principalagent theory is the trade-off between (i) the cost of monitoring the actions of the

Maintenance Outsourcing

389

agent and (ii) the cost of measuring the outcomes of the relationship and the transferring of risk to the agent. Contract. The design of the contract that takes into account the issues discussed above is the challenge that lies at the heart of the principal-agent relationship. 15.6.2 Relevance to Maintenance Outsourcing and Extended Warranties 15.6.2.1 Maintenance Outsourcing Outsourcing of maintenance involves all the Agency Theory issues discussed in Section 15.6.1 with the customer as the principal and the maintenance service provider as the agent. The key factor is the contract that specifies what, when, and how maintenance is to be carried out. This needs to be designed taking into account all the various issues. Kraus (1996) reviews the literature on incentive contracting. The customer and service agent both potentially face moral hazard. This can occur for the customer when the service agent shirks to reduce costs and doesn’t do proper maintenance and it can occur for the agent when the customer uses the asset in a manner different to that stated in the contract. Adverse selection can also take place when the customer chooses from a pool of potential maintenance service providers (the B scenarios in Table 15.2). The two parties have different information about asset state, usage level, care and attention of the asset, and quality of maintenance used and this asymmetry will affect the outcome of their relationship. The different market scenarios for maintenance outsourcing are as indicated in Table 15.2. In scenario A-1, the classical principal-agent model discussed in Section 15.6.1 is appropriate with a single principal (customer) and a single agent (maintenance provider). This could be a large business unit, for example. In the remaining five scenarios, there are multiple principals and/or multiple agents. In scenarios A-2 and A-3, the equipment under consideration could be a particular brand of lift installed in different buildings within a city. In this case, all the equipment is maintained either by the OEM or an agent of the OEM. There is an extensive literature dealing with the design of contracts for multiple principal/ multiple agent problems (Macho-Stadler and Perez-Castrillo 1997 and Laffont and Martimort 2002 are a couple of samples of the papers from this literature) and all the issues from Section 15.6.1 are still relevant. The principal-agent models that have been studied in the literature are static in nature and new, dynamic models need to be formulated so that they can be applied meaningfully in the context of maintenance outsourcing. 15.6.2.2 Extended Warranties This case is similar to A-3. In the case of standard commercial and industrial products and consumer durables, the EW policy is decided by the EW provider and the customer does not have any direct input. The issues (such as moral hazard, adverse selection, risk, monitoring, etc) from agency theory are all relevant for EW policies. Current EW offerings lack flexibility from the customer point of view and there is a perception (amongst customers and EW regulators) that the pricing of EWs is not fair. This provides an opportunity for EW providers to offer flexible

390

D. Murthy and N. Jack

warranties to meet the different needs across the customer population. Agency theory offers a framework to evaluate the costs of different policies taking into account all the relevant issues.

15.7 Conclusion and Topics for Future Research In this chapter we have proposed a framework to look at maintenance outsourcing from both the equipment owner (customer for maintenance service) and the service agent (maintenance service provider) perspectives. A review of the literature indicates that the bulk of it is qualitative with only very few papers dealing with the topic in a more quantitative manner. Also, not all the relevant issues have been addressed effectively. Agency theory provides an approach to address all these issues in a unified manner. This will require building new models and offers scope for lot of new research in the future. The provision of extended warranties is very similar to maintenance outsourcing. We have highlighted this link and have also discussed the concept of flexible EWs. The framework proposed in this chapter combined with Agency theory can be used by EW providers to obtain better estimates of the cost of offering different EW options in a more objective and scientific manner where all the various issues such as moral hazard, adverse selection, risk, etc., are taken into account. Again, there is considerable scope for more future research in EWs.

15.8 References Armstrong, R.D. and Cook, W.D. (1981), The contract formation problem in preventive pavement maintenance: A fixed-charge goal-programming model, Comp. Environ. Urban Systems, 6, 147–155 Ashgarizadeh, E. and Murthy, D.N.P. (2000), Service contracts – a stochastic model, Mathematical and Computer Modelling, 31, 11–20 Barlow, R.E. and Hunter, L.C. (1960), Optimum preventive maintenance policies, Operations Research, 8, 90–100 Bertolini, M., Bevilacqua, M. Braglia, M. and Frosolini, M. (2004), An analytical method for maintenance outsourcing service selection, International Journal on Quality & Reliability Management, 21, 772–788 Bevilacqua, M. and Braglia, M. (2000), The analytic hierarchy process applied to maintenance strategy selection, Reliability Engineering & System Safety, 70, 71–83. Biedenweg, F. M. (1981), Warranty Analysis: Consumer Value vs. Manufacturers Cost, Unpublished Ph.D. Thesis, Stanford University, U.S.A. Blischke, W.R. and Murthy, D.N.P. (1994), Warranty Cost Analysis. Marcel Dekker, New York Blischke, W.R. and Murthy, D.N.P. (1996), Product Warranty Handbook, Marcel Dekker, New York Blischke, W.R. and Murthy D.N.P. (2000), Reliability, Wiley, New York Campbell, J.D. (1995), Outsourcing in maintenance management: a valid alternative to selfprovision, Journal of Quality in Maintenance Engineering, 1, 18–24.

Maintenance Outsourcing

391

Cho, D. and Parlar, M. (1991), A survey of maintenance models for multi-unit systems, European Journal of Operational Research, 51, 1–23. Cox, D.R. and Oakes, D. (1984), Analysis of Survival Data, Chapman and Hall, New York Day, E. and Fox, R.J. (1985), Extended warranties, service contracts and maintenance agreements – A marketing opportunity? Journal of Consumer Marketing, 2, 77–86 Dekker, R., Wildeman, R.E. and van der Duyn Schouten, F.A. (1997), Review of multicomponent models with economic dependence, Zor/Mathematical Methods of Operations Research, 45, 411–435. Desai, P.S. and Padmanabhan, V. (2004), Durable good, extended warranty and channel coordination. Review of Marketing Science, 2, Article 2, available at www.bepress.com/romsjournal/vol2/iss1/art2 Dunn, S. (1999), Maintenance outsourcing – Critical issues, available at: www.plantmaintenance.com/maintenance_articles_outsources.html Eisenhardt, K.M. (1989), Agency theory: An assessment and review, The Academy of Management Review, 14, 57–74 Embleton, P.R. and Wright, P.C. (1998), “A practical guide to successful outsourcing”, Empowerment in Organizations, Vol. 6 No. 3, pp. 94–106 Eppen, G.D., Hanson, W.A. and Martin, R.K. (1991), Bundling – new products, new markets, low risks, Sloan Management Review, Summer, 7–14 Hollis, A. (1999), Extended warranties, adverse selection and aftermarkets. The Journal of Risk and Insurance, 66, 321–343 Iskandar, B.P., and Murthy, D.N.P. (2003), Repair-replace strategies for two-dimensional warranty policies, Mathematical and Computer Modelling, 38, 1233–1241 Iskandar, B.P., Murthy, D.N.P. and Jack, N. (2005), A new repair-replace strategy for items sold with a two-dimensional warranty, Computers and Operations Research, 32, 669–682 Jack, N. and Murthy, D.N.P. (2001), A servicing strategy for items sold under warranty, Jr. Oper. Res. Soc., 52, 1284–1288 Jack, N. and Murthy, D.N.P. (2006), A Flexible Extended Warranty and Related Optimal Strategies, Jr. Oper. Res. Soc. (accepted for publication) Jack, N. and Van der Duyn Schouten, F. (2000), Optimal repair-replace strategies for a warranted product, Int. J. Production Economics, 67, 95–100 Jardine, A.K.S. and Buzacott, J.A. (1985), Equipment reliability and maintenance, European Journal of Operational Research, 19, 285–296. Judenberg, J. (1994), Applications maintenance outsourcing, Information Systems Management, 11, 34–38 Khosrowpour, M. (ed) (1995), Managing Information Technology Investments with Outsourcing, Idea Group Publishing, Harrisburg Kraus, S. (1996), An overview of incentive contracting, Artificial Intelligence, 83, 297–346 Kumar, R. and Kumar, U. (2004), Service delivery strategy: Trends in mining industries, Int. J. Surface Mining, Reclamation and Environment, 18, 299–307 Laffont, J. and Martimort, D, (2002) The Theory of Incentives: the Principal-Agent Model, Princeton University Press Levery, M. (1998), Outsourcing maintenance: a question of strategy, Engineering Management Journal, February, 34–40. Lutz, N.A. and Padmanabhan, V. (1994), Income variation and warranty policy. Working Paper, Graduate School of Business, Stanford University. Lutz, N.A. and Padmanabhan, V. (1998), Warranties, extended warranties and product quality. International Journal of Industrial Organization, 16, 463–493. Macho-Stadler, I. and Perez-Castrillo, D. (1997), An Introduction to the Economics of Information, Oxford University Press

392

D. Murthy and N. Jack

Martin, H.H. (1997), Contracting out maintenance and a plan for future research, Journal of Quality in Maintenance Engineering, 3, 81–90 McCall, J.J. (1965), Maintenance policies for stochastically failing equipment: A survey, Management Science, 11, 493–524. Murthy D.N.P. and Ashgarizadeh, E. (1998), A stochastic model for service contract; Int. Jr. of Reliability Quality and Safety Engineering; 5, 29–45 Murthy D.N.P. and Ashgarizadeh, E. (1999), Optimal decision making in a maintenance service operation, European Journal of Operational Research, 116, 259–273 Murthy, D.N.P. and Djamaludin, I. (2002), Product warranty – A review, International Journal of Production Economics, 79, 231–260 Nguyen, D.G. (1984), Studies in Warranty Policies and Product Reliability. Unpublished Ph.D. Thesis, The University of Queensland, Australia. Nguyen, D.G. and Murthy, D.N.P. (1986), An optimal policy for servicing warranty, Jr. Oper. Res. Soc., 37, 1081–1088 Nguyen, D.G. and Murthy, D.N.P. (1989), Optimal replace-repair strategy for servicing items sold with warranty, Euro. Jr. of Oper. Res., 39, 206–212 Padmanabhan, V. (1995), Usage heterogeneity and extended warranties. Journal of Economics and Management Strategy, 4, 33–53 Padmanabhan, V. (1996), Extended warranties, in Product Warranty Handbook, W.R. Blischke and D.N.P. Murthy (eds), Marcel Dekker, New York Padmanabhan, V. and Rao, R.C. (1993), Warranty policy and extended warranties: theory and an application to automobiles. Marketing Science, 12, 230–247 Pierskalla, W.P. and Voelker, J.A. (1976), A survey of maintenance models: The control and surveillance of deteriorating systems, Naval Research Logistics Quarterly, 23, 353–388. Pintelton, L.M. and Gelders, L. (1992), Maintenance management decision making, European Journal of Operational Research, 58, 301–317. Rigdon, S.E. and Basu, A.P. (2000), Statistical Methods for the Reliability of Repairable Systems, Wiley, New York Ross, S.M. (1980), Stochastic Processes, Wiley, New York Sahin, I. and Polatoglu, H. (1998), Quality, warranty and preventive maintenance. Kluwer: Amsterdam Scarf, P.S. (1997), On the application of mathematical models to maintenance, European Journal of Operational Research, 63, 493–506. Sherif, Y.S. and Smith, M.L. (1986), Optimal maintenance models for systems subject to failure - A review, Naval Logistics Research Quarterly, 23, 47–74. Stremersch, S., Wuyts, S. and Frambach, R.T. (2001), The purchasing of full-service contracts: An exploratory study within the industrial maintenance market, Industrial Marketing Management, 30, 1–12 Sunny, I. (1995), Outsourcing maintenance: making the right decisions for the right reasons, Plant Engineering, 49, 156–157. Thomas, L.C. (1986), A survey of maintenance and replacement models for maintainability and reliability of multi-item systems, Reliability Engineering, 16, 297–309 UK Competition Commission (2003): A report into the supply of extended warranties on domestic electrical goods within the UK, available at: www.competition-commission.org.uk/inquiries/completed/2003/warranty/index.htm Valdez-Flores, C. and Feldman, R.M. (1989), A survey of preventive maintenance models for stochastically deteriorating single-unit systems, Naval Research Logistics Quarterly, 36, 419–446. Van Ackere, A. (1993), The principal-agent paradigm: Its relevance to various functional fields, European Journal of Operational Research, 70, 83–103

Maintenance Outsourcing

393

Vickerman, R. (2004), Maintenance incentives under different infrastructure regimes, Utilities Policy, 12, 315–322 Yun, W.Y., Murthy, D.N.P. and Jack, N. (2006), Warranty servicing with imperfect repair, Submitted for publication

16 Maintenance of Leased Equipment D.N.P. Murthy and J. Pongpech

16.1 Introduction Businesses need equipment to produce their outputs (goods/services). Equipment degrades with age and usage, and eventually fails (Blischke and Murthy 2000). This impacts business performance in several ways – reduced equipment availability, lower output quality, higher operating costs, increased customer dissatisfaction, etc. The degradation can be controlled through preventive maintenance (PM) actions whilst corrective maintenance (CM) actions restore failed equipment to its working state. Prior to 1970, businesses owned the equipment, and maintenance was done in house. Since 1970, there has been a shift towards outsourcing of maintenance. This was primarily due to a change in the management paradigm where activities in a business were classified as either core or non-core, with the non-core activities to be outsourced to external agents if this was deemed to be cost effective. Also, as technology became more complex it was no longer economical to carry out inhouse maintenance due to the need for expensive maintenance equipment and highly trained maintenance staff. Since 1990, there has been an increasing trend towards leasing rather than owning equipment. According to Fishbein et al. (2000) there are several reasons for this. Some of these are as follows: • • • •

Rapid technological advances have resulted in improved equipment appearing on the market, making the earlier generation equipment obsolete at an ever-increasing pace. The cost of owning equipment has been increasing very rapidly. Businesses viewing maintenance as a non-core activity. It is often economical to lease equipment, rather than buy, as this involves less initial capital investment and often there are tax benefits that make it attractive.

396

D. Murthy and J. Pongpech

In the USA, the Equipment Leasing Association (ELA) conducted a survey in 2002 (ELA, 2002a) and the results of their findings were as follows: • • • •

80% of businesses acquire equipment through leasing. Leasing accounts for roughly 30% of business capital investment. Nearly 50% of office equipment is leased. Leasing companies own more equipment than companies in other US industries.

The leasing industry grew from 1990 till the last quarter of year 2001 when it experienced an economic downturn due to the impact from 9/11. In 2002, the predictions made by the Department of Commerce for equipment leasing volume for 2003 and 2004 were $208 and $218 billion respectively. The ELA Online Focus Groups Report (ELA 2002b) states that 60% of leasing benefits come from maintenance options. This is because some equipment leases come with maintenance as an integral part of the lease so that the physical equipment is bundled with maintenance service and offered as a package under a lease contract. This implies that the lessee can focus on the core activities of the business and not be distracted with equipment maintenance. Maintenance of leased equipment raises several new issues for both the lessor and the lessee (Desai and Purohit 1998; Kleiman 2001). The strategic issues deal with the size and composition of the equipment fleet, the number and the location of lease centers, workshop facilities, warehouse for spares, etc. The operational issues include logistics, pricing, marketing, and maintenance strategies. In this chapter we touch on these issues and then focus our attention on maintenance strategies for leased equipment. The outline of the chapter is as follows. Section 16.2 starts with a general introduction to equipment leasing and then the different types of leases are discussed. Section 16.3 deals with a framework to study equipment leasing and reviews the relevant literature. In Section 16.4, we look at the maintenance of equipment under operational lease. We discuss the modeling issues and propose various maintenance policies. Section 16.5 looks at the analysis of two of these policies and the optimal selection of the policy parameters. We conclude with a brief discussion of topics for future research in Section 16.6. We use the following abbreviations and notation. Abbreviations AFT: Accelerated failure time PH: Proportional hazard NHPP: Non-homogeneous Poisson process ROCOF: Rate of occurrence of failure CM: Corrective maintenance PM: Preventive maintenance Notation F (t ) : Failure distribution for the time to first failure of new equipment f (t ), r (t ) : Failure density and hazard functions associated with F (t ) Intensity function with only CM actions λ0 (t ) :

Maintenance of Leased Equipment

λ (t ) : A: x: L: δj: tj : N ( L) : Y: G( y) : γ ,τ : C p (δ ) : Cu ( x) : Cf : Cn : Ct :

397

Intensity function with both CM and PM actions Age of used equipment Reduction in age with PM action Duration of lease period Reduction in intensity function with j-th PM action Time instant of j-th PM action Number of equipment failures over the lease period Time to carry out minimal repair (Random variable) Distribution function for Y Parameters of penalty cost Cost of PM action with reduction in intensity function δ Cost of PM action with reduction in virtual age x Mean cost of a CM action (minimal repair) Penalty cost per failure (when number of failures exceeds γ ) Penalty cost per unit time (when repair time exceeds τ )

16.2 Equipment Leasing 16.2.1 Lease Definition A lease is a contractual agreement under which the owner of equipment (referred to as the “lessor”) allows another person (referred to as the “lessee”) to operate the equipment for a stated period of time and under specified conditions. Examples of equipment can include aircraft, computers, telecommunications equipment, hospital equipment, office equipment, cars, forklifts, etc. 16.2.2 Types of Leases There are several types of leases but, unfortunately, there is no standard terminology. The terms used in the USA often differ from those used in the UK. We briefly discuss the three main types. 16.2.2.1 Operating Lease In an operating lease the lessee pays the lessor for the use of equipment over a specified period. Usually, new equipment (for example, cars) is leased with an operating lease but in some cases used equipment is also leased with this type of lease. The lease period is much shorter than the equipment’s expected useful life. At the end of the lease period, the lessor retains ownership of the equipment and can renew the lease contract (if the lessee is interested), lease the equipment to some other lessee, or sell the equipment as second-hand equipment. Additional services, such as operator training (to ensure that the leased item is operated properly – for example, the leasing of specialized industrial equipment) and maintenance (to ensure that the equipment is in a proper operating condition and meets the requirements stated in the lease contract), are provided by the lessor as part of the lease contract. This kind of lease is also referred to as a “true” lease. In the

398

D. Murthy and J. Pongpech

USA, the Internal Revenue Code defines a true lease as a transaction that allows the lessor to claim ownership and the lessee to claim rental payments as tax deductions. The advantages and disadvantages of an operating lease from the lessee’s perspective are as follows: Advantages • • •

The lessee can obtain new equipment (based on the latest technologies) and thus avoid the risks associated with equipment obsolescence. The lessee usually gets maintenance and other supports from the lessor so that the business can focus on core activities. Equipment disposal is the lessor’s responsibility.

Disadvantages • •

If the lessee’s needs change over the lease period, then premature termination of the lease agreeement can incur penalties. The risks associated with the lessor do not provide the level of maintenance needed.

16.2.2.2 Finance Lease In a finance lease, the lessee pays the lessor for the use of equipment over a specified period. At the end of the lease period, the lessee gets the ownership of the equipment either at no cost or at a previously established price. The entire payments by the lessee must cover the lessor’s initial investment (for acquiring the equipment) and the profit margin. The type of equipment sold with this type of lease can vary from very expensive industrial and commercial equipment (such as a financial institution leasing aircraft to an airline operator) to less expensive consumer products (banks or retailers leasing domestic appliances, cars, etc. to consumers who own the equipment at the end of the lease). This type of lease is also referred to as a “capital” or “full payout” lease. The advantages and disadvantages of a finance lease from the lessee’s perspective are as follows: Advantages • •

The lessee is able to spread the payments over the lease period (no need for initial cash at purchase). It offers greater flexibility as the lessee can choose from a range of lease options – especially, in the consumer product market when there are several institutions offering different types of leases.

Disadvantages •

If the lessee fails to make lease payments as per schedule, the leased equipment can be repossessed and sold by the lessor to recover the payments due.

Maintenance of Leased Equipment

• •

399

Maintenance is often not a part of the lease agreement so that the lessee has to provide for this separately. The overall cost to the lessee is significantly higher than purchase price of the equipment because the payments include not only the financing costs, but also other costs associated with insurance, taxes, etc.

16.2.2.3 Sale and Leaseback Under a sale and leaseback lease, the owner sells the equipment to a lessor (usually a finance company) and leases it immediately without ever surrendering the use of equipment. The maintenance is carried out either by the lessee or some third party. This type of lease is used mainly for infrastructure assets such as rail transport, electricity, sewerage and water pipe networks, etc. The main benefit of using such a lease is that both the lessor and the lessee are eligible for tax deductions. 16.2.2.4 Other Types of Leases For a discussion of other types of leases see Coyle (2000) and ELA (2005).

16.3 A Framework for Study of Equipment Lease A framework for the study of equipment leasing involves several key elements and these are shown in Figure 16.1. We discuss each of these briefly.

REGULATOR

OWNER

CUSTOMER (USER)

SERVICE PROVIDER

EQUIPMENT (ASSET) OUTPUTS (PRODUCTS / SERVICES)

GOVERNMENT

OPERATOR

Figure 16.1. Framework for study of equipment lease

Customer: The customer is the lessee. The lessee can be an individual (purchasing a car under finance lease), a business (operating industrial or commercial equipment under operational lease) or a government agency (responsible for operating an infrastructure, such as train network, under a buyback lease). Equipment: Equipment can be an infrastructure (for example, parts of road network, railway network, sewerage and water network, electricity network, etc.);

400

D. Murthy and J. Pongpech

industrial equipment (for example, trucks, cranes, plant machinery, etc.); commercial equipment (for example, office furniture, vending machines, photocopiers, etc.) and, consumer products (for example, refrigerators, computers, etc.). The cost of the equipment (or asset) can vary significantly. Ezzel and Vora (2001) give some interesting statistics relating to sale and leaseback, and operating leases in the USA over the period 1984–1991. Owner: The owner is a person or agency that owns the equipment from a legal point of view. In the case of a finance lease, the financial institution is the owner as the equipment is mortgaged to the institution. Service provider: In the case of an operating lease, the lessor is the service provider. However, if the lessor decides to outsource the maintenance to some external service agent, then the agent is the service provider. In the case of a finance lease, the lessee is responsible for the maintenance and might decide to outsource it to an external agent. Outputs (products/services): If the lessee is a business, then the leased equipment is used to produce its outputs – goods and/or services as discussed in Section 16.1. For consumer goods, the output is the utility (in the case of a kitchen appliance) or the satisfaction (in the case of a television) derived by the lessee. Operator: In general, the lessee is the operator of the equipment. However, the lessee, in turn, might hire some other business to operate the equipment and produce the desired outputs. An example of this is a business that leases a fleet of aircraft, then outsources the flying to another business that employs the crew and operates the planes. Government: Government plays an important role in the context of sale and buyback leases of infrastructure. The lessee can be a department of the government or an independent unit acting as a proxy for the government. Decisions relating to subsidy, tax incentives, etc., are decided by the government and have a significant impact on the lease structure. Regulator: This applies mainly for equipment used in certain industry sectors (such as health, transport, energy) where public safety is of great concern. The regulator is often an independent body that monitors and makes recommendation that can be binding on the owners and operators of equipment. Vickerman (2004) deals with the infrastructure maintenance issues in the context of rail and road transport in the UK and discusses the role of government and regulators. Interested readers should consult the references cited in the paper for more details. 16.3.1 Different Scenarios of Leasing There are many different scenarios depending on the number of parties involved. Table 16.1 gives three different scenarios involving four parties. Other scenarios can include additional parties such as the government and/or the regulator. In the remainder of the chapter we focus our attention on industrial and commercial equipment leased under an operating lease and this corresponds to Scenario 1.

Maintenance of Leased Equipment

401

Table 16.1. Three different scenarios of leasing Number of Parties Involved First party Second party Third party Fourth party

Scenario 1

Scenario 2

Scenario 3

2

3

4

Lessor: Owner & Service Provider Lessee: User & Operator ---

Lessor: Owner

Lessor: Owner

Lessee: User & Operator Service Provider --

Lessee: User Service Provider Operator

16.3.2 Business Equipment and Operating Lease According to Baker and Hayes (1981), some of the pioneers in business equipment leasing were IBM and Xerox. Since then, the number of businesses that lease business equipment has grown significantly and many kinds of equipment are leased. ELA (2005) gives a list of some of the businesses leasing their products under operating leases. We focus on the maintenance (provided by the lessor) of equipment leased under an operating lease.1 A framework to study this involves several key elements and these are indicated in Figure 16.2. LEASE CONTRACT

LESSOR

MAINTENANCE

LESSEE

EQUIPMENT

Figure 16.2. Conceptual model of equipment leasing

Lessor: The lessor is not only the owner of the leased equipment, but also the maintenance service provider. The lessor is a business (either manufacturer or some other entity) and as such has certain business objectives. At the strategic level these can include issues such ROI, market share, profits, etc. In order to achieve these objectives, the lessor needs to have proper strategies at the strategic level (to deal with issues such as type and number of equipment to lease, upgrade options to 1

In the case of a finance lease, the lessee has the option of either doing the maintenance in house or outsourcing it to some third party. For more on maintenance outsourcing, see Deelen et al. (2003).

402

D. Murthy and J. Pongpech

compensate for technological obsolescence, etc.) and at the operational level (maintenance servicing, inventory of spares, crew size, etc.). Lessee: The lessee is a business that leases the equipment to produce its outputs – goods and/or services. The lessee has to choose which equipment to lease when there are several competing brands, the best lease arrangement from the set of lease options available, the terms of the lease, etc. Critical to this decision-making are issues such as equipment availability, cost, etc. Also, the lessee needs to take into account the effect of failures on production and their subsequent impact on customer satisfaction. As a result, the lessee’s objectives are different from those of the lessor’s. Equipment: A critical factor is the reliability of equipment. One needs to differentiate between new and used equipment. The reliability of new equipment is the inherent reliability and this depends on the decisions made by the manufacturer during the design and production of the equipment. The field reliability depends on factors such as usage intensity (which determines the load on the equipment) and the operating environment. In the case of used equipment, the reliability depends on the inherent reliability and the operating and maintenance history. Maintenance: Equipment degrades with age and usage, and ultimately fails. Maintenance actions can be broadly grouped into two categories – corrective maintenance (CM) and preventive maintenance (PM). CM actions are needed to restore failed equipment to an operational state. PM actions are needed to control equipment degradation and reduce the likelihood of failure. Contract: The contract needs to take into account the interests of both the lessor and the lessee. The contract defines the terms and conditions of the lease (lease period, rental payments, renewal options, penalty for early termination, equipment upgrade, etc.). From the lessee’s point of view, the number of failures over the lease period and the recovery time after each failure are important as they affect equipment availability and the smooth running of the operations. The contract can include terms to ensure that failures occur infrequently and the recovery times are small. The lessor incurs penalties if these terms are violated. Also, in the case of incentive oriented contracts, the lessor is paid a bonus if equipment related measures (such as availability, number of failures, etc.) either exceed or don’t exceed some specified values stated in the contract. 16.3.3 Literature Review The literature on equipment leasing deals with a variety of issues. For a broader overview see Baker and Hayes (1981), Schallheim (1994) and Coyle (2000). The bulk of the literature deals with issues from the lessee’s perspective, and these can be broadly divided into two groups – (a) management oriented and (b) economics and finance oriented. The management oriented literature is mainly qualitative and deals with the following issues: • • • •

Buy vs. lease options through proper cost and benefit analysis Selection of the most appropriate lease option Negotiating the terms of the lease option Administration of lease contracts

Maintenance of Leased Equipment

403

See Deelen et al. (2003) and ELA (2005) for more details. The economics and finance oriented literature looks at both the lessor and lessee perspectives and the leased equipment market resulting from the interaction between these two parties. Ezzel and Vora (2001), Sharpe and Nguyen (1995), Desai and Purohit (1998), Stremersch et al. (2001), Handa (1991) and Kim et al. (1978) are an illustrative sample where readers can find more details. The literature on maintenance is vast and there are many survey papers and books on the topic. They deal with a range of issues – determining optimal maintenance strategies, planning and implementation of maintenance actions, logistics of maintenance, etc. References to these can be found in review/survey papers (McCall 1965; Pierskalla and Voelker 1976; Sherif and Smith 1976; Jardine and Buzacott 1985; Gits 1986; Thomas 1986; Valdez-Flores and Feldman 1989; Cho and Parlar 1991; Pintelton and Gelders 1992; Dekker et al. 1997; Scarf 1997). There are very few papers dealing with the maintenance of leased equipment and these will be discussed later in the chapter.

16.4 Maintenance of Equipment Under an Operational Lease The lessee has to decide first on whether to lease or buy equipment and, once a decision is made to lease, the next step is to decide on the lease contract. The lease contract might be decided by the lessor or by the lessee or jointly. Figure 16.3 shows the key elements that are involved in the decision-making processes of the lessor and the lessee. We focus on the maintenance of the leased equipment in the remainder of the chapter. The lessor has to decide on an effective maintenance strategy for the leased equipment. The maintenance decision depends on the following factors: • • • •

The duration of the lease. The penalty terms in the lease contract. The usage intensity (which is under the control of the lessee) and the operating environment (which might or might not be under the control of the lessee). The initial state of the equipment (in the case of used equipment).

To determine the optimal maintenance for a specific leased equipment, the lessor has to decide on the maintenance policy, and then determine the optimal values for the parameters of this policy. In order to do this, both failures and the effect of maintenance actions on failures need to be modeled. Figure 16.3 shows the key elements for determining the optimal maintenance.

404

D. Murthy and J. Pongpech

Figure 16.3. Framework for decision-making with regards leased equipment

16.4.1 Equipment Failures One needs to differentiate between first and subsequent failures. The first failure depends on the age of the equipment (in the case of used equipment) and the subsequent failures depend on the type of CM actions (to rectify failures) and the PM actions (to avoid failures). 16.4.1.1 First Failure In the case of new equipment, the time to first failure is a random variable and modeled by a distribution function F ( t ) . The failure density function f ( t ) and the hazard function r ( t ) are given by

f ( t ) = dF ( t ) / dt and r ( t ) = f ( t ) ⎡⎣1 − F ( t ) ⎤⎦

(16.1)

respectively. In the case of used equipment, let A denote the age at the start of the lease. Then, the time to first failure is given by the conditional failure distribution function F (t A) =

F (t ) − F ( A) , t ≥ A. 1 − F ( A)

(16.2)

16.4.1.2 Corrective Maintenance (CM) Actions CM actions are performed to restore failed equipment to its operational state. Depending on the effect of CM on the failure rate we have many different models. If the failure rate after repair is essentially the same as that if the equipment had not

Maintenance of Leased Equipment

405

failed then it is called “minimal repair” (see Barlow and Hunter 1960). This is appropriate for complex equipment where the equipment failure is due to failure of one or a few components. The equipment becomes operational by replacing (or repairing) the failed components. This action has very little impact on the reliability characteristics of the equipment. If the failure rate changes (in either direction) after repair, it is called “imperfect repair”. Many different types of imperfect repair models have been proposed and for a review of such models see Pham and Wang (1996). The time to repair is in general a random variable and needs to be modeled by a distribution function. Typically, the time to repair is often very much smaller than the time between random variables (in a statistical sense) so that one can ignore this and treat repair as being instantaneous for determining failures over time. With this assumption, the failures over time (with only CM actions) occur according to a non-homogeneous Poisson process (NHPP) with intensity function λ0 (t ) = r (t ) , the hazard function defined earlier. The intensity function (characterizing the failures over time) is also referred to as “rate of occurrence of failure” (ROCOF). The cost of repair is also a random variable and needs to be modeled by a distribution function. Let C f denote the average cost of each minimal repair. 16.4.1.3 Preventive Maintenance (PM) Actions PM actions are used to control the degradation process and to reduce the likelihood of failure occurrences. Inspection, cleaning, lubrication, adjustment and calibration, replacement of degraded components, and major overhaul are some common tasks that are carried out under PM. The effect of PM action is to improve the reliability of the equipment. There are several ways of modeling this improvement and we discuss two of them (Reduction in Failure Intensity and Reduction in Age) later in the section. The time needed to carry out PM actions can vary and needs to be modeled properly. For minor PM actions, the time needed is small relative to the time between failures and can be ignored. For a major overhaul, the time can be significant and cannot be ignored. The cost of PM action comprises the administration cost, labor cost, material cost, and spare parts inventory cost, and some of these costs are uncertain. Reduction in intensity function: Here a PM action results in a reduction in the intensity function (ROCOF). λ0 ( t ) is the intensity function without any PM actions. Let λ ( t ) denote the intensity function with PM actions. We assume that the time for PM action is small relative to the mean time between failures so that it can be ignored. The effect of PM on the intensity function is given by

( )

( )

λ t +j = λ t −j − δ j

(16.3)

where δ j is the reduction resulting from the PM action at time t j . δ j depends on the level of PM effort and constrained as follows:

406

D. Murthy and J. Pongpech

( )

0 ≤ δ j ≤ λ t −j − λ ( 0 )

(16.4)

This implies that PM action cannot make the equipment better than new. As a result, if PM actions are carried out at time instants t j , j ≥ 1, and the reduction in the intensity function given by δ j , j ≥ 1, then the intensity function is given by λ ( t ) = λ0 ( t ) −

j

∑δ , t i =0

i

j

< t < t j +1 ,

(16.5)

for j ≥ 0 , with t0 = 0 and δ 0 = 0 . This implies that the reduction resulting from action at t j lasts for all t ≥ t j as shown in Figure 16.4.

λ0 (t )

λ (t )

δ2

δ1 Time

t1

t2

Figure 16.4. Effect of PM action on the intensity function for new equipment

The cost of each PM action depends on the reduction in the intensity function. Let C p (δ ) denote the cost of PM action and this is an increasing function of δ . Reduction in age: Used equipment can be subjected to an upgrade (or overhaul) where components that have degraded significantly are replaced with new ones so that the equipment is a sense younger (from a reliability point of view). If the age of the equipment is A before it is subjected to PM action, then it can be viewed as an equipment of virtual age A − x after the PM action. The reduction in the age is x, 0 < x < A . As a result, the intensity function decreases after PM action as shown in Figure 16.5.

Maintenance of Leased Equipment

407

x

λ0 (t )

λ (t )

x

x A-x

A

Time

Figure 16.5. Effect of upgrade action on the intensity function for used equipment

The cost of this type of PM action depends on the reduction in the virtual age and is modeled by a function Cu ( x) which is an increasing function of x . 16.4.1.4 Usage Intensity and Operating Environment Equipment is usually designed for some nominal usage intensity and operating environment. When it is operated under these conditions, the ROCOF (with no PM actions) is given by λ0 (t ) . If the equipment is used in a more intense mode and/or the operating environment becomes harsher, then the ROCOF can increase significantly. As a result, failures occur more frequently. Many different models have been proposed to model this change. Two of the well known ones are (i) accelerated failure time (AFT) model and (ii) proportional hazard (PH) model. For more on this see, Blischke and Murthy (2000). 16.4.2 Penalties Both the lessor and the lessee can incur penalties if they violate the terms of the contract. In the case of the lessee, it could be the usage intensity exceeding that specified in the contract (provided the lessor can monitor this). In the case of the lessor, the penalties are linked to equipment failures and the time to repair failed equipment. Two simple forms of penalty are as follows. Penalty 1: Let N ( L) denote the number of equipment failures over the lease period L . If N ( L) exceeds γ (a pre-specified value) the lessor incurs a penalty. The amount that the lessor pays to the lessee at the end of the contract is Cn [max{N ( L) − γ , 0}] . Penalty 2: Let the random variable Y denote the time that the lessor takes to restore failed equipment to its working state. If Y exceeds τ (a pre-specified value) then the lessor incurs a penalty given by Ct [max{(Y − τ ), 0}] .

408

D. Murthy and J. Pongpech

16.4.3 Optimal Maintenance Whenever a failure occurs, the lessor incurs a direct cost in restoring the failed equipment to its operating state. Also, the lessor can incur indirect costs resulting from the penalties incurred. As a result, the total CM costs are the sum of both the direct and the indirect costs. These costs can be lowered through greater PM effort but this implies increased PM costs. The total cost to the lessor as a function of the PM effort is as shown in Figure 16.6 and the optimal PM effort is one that minimizes the total costs. Since the CM costs are uncertain, the optimal PM effort is based on minimizing the expected total cost. This requires the lessor to first define the kind of PM policy that would be employed and then to optimally select the parameters of the policy so as to minimize the expected total cost.

COSTS

TOTAL COST

PM COST

CM COST

OPTIMAL PM EFFORT LOW

PM EFFORT

HIGH

Figure 16.6. Optimal PM effort

16.4.4 Maintenance Policies One can define many different types of PM policies that the lessor can use. We first consider new equipment lease. We define a few policies and indicate the parameters that need to be optimally selected. Later we look at used equipment lease. 16.4.4.1 New Equipment Lease Policy 1: The equipment is subjected to k preventive maintenance actions over the lease period. The time instants at which these actions are carried out are given by {t j ,1 ≤ j ≤ k} with ti < t j for i < j . The reduction in the intensity function during the PM action is δ j . All failures over the lease period are rectified through minimal repair. As a result, the policy is characterized by the parameter set θ ≡ {k , t j , δ j ,1 ≤ j ≤ k } .

Maintenance of Leased Equipment

409

Policy 2: The equipment is subjected to preventive maintenance actions periodically so that the j −th PM action is carried at time t j = jT , j = 1, 2,..., k . After each PM action the intensity function is reduced by δ j . All failures over the lease period are rectified through minimal repair. The policy is characterized by the parameter set θ ≡ {T , δ j } . Policy 3: The equipment is subjected to preventive maintenance action whenever the intensity function reaches a specified level ρ . Each PM action reduces the intensity function by a fixed amount δ . All failures over the lease period are rectified through minimal repair. The policy is characterized by the parameter set θ ≡ {ρ , δ } . Policy 4: Let 0 < ς 1 < ς 2 < L . The equipment is subjected no PM actions in the interval [0, ς 1 ) , periodic PM actions with period 2∆ in the interval [ς 1 , ς 2 ) and period ∆ in the interval [ς 2 , L) . Each PM reduces the intensity function to a specified level ν . All failures over the lease period are rectified through minimal repair. The policy is characterized by the parameter set θ ≡ {ς 1 , ς 2 ,ν } . 16.4.4.2 Used Equipment Lease In this case, the lessor has the additional option of subjecting the equipment to an overhaul. This can be modeled as a reduction in the virtual age so that we now have an additional parameter x (the reduction in age). During the lease period the lessor can use PM policies defined in Section 4.4.1.

16.5 Analysis and Optimisation of Maintenance Policies We confine our attention to Policies 1 and 2 with new equipment lease and Policy 1 with lease of used equipment. Let J (θ ) denote the expected total cost to the lessor. This includes the CM and PM costs as well as the penalty costs. We assume γ = 0 so that the lessor incurs a penalty even if there is one failure. (The expressions are slightly more complicated when γ > 0 and the analysis is lot more difficult.) We present the final expressions for J (θ ) and indicate references where interested readers can get the details. It is not possible to derive any analytical results. A computational scheme is needed to obtain the optimal values for the parameters of the policy. Our focus is on the effect of penalty terms in the contract on the optimal maintenance strategies. We illustrate this through numerical examples based on the Weibull intensity function given by λ0 ( t ) =

β α

⎛t ⎞ ⎜ ⎟ ⎝α ⎠

β −1

(16.6)

α is scale parameter and β is shape parameter. The repair time distribution is given by a two-parameter Weibull distribution

410

D. Murthy and J. Pongpech

⎡ ⎛ y ⎞m ⎤ G ( y ) = 1 − exp ⎢ − ⎜ ⎟ ⎥ , 0 ≤ y < ∞ ⎣⎢ ⎝ n ⎠ ⎦⎥

(16.7)

with shape parameter m < 1 (implying decreasing repair rate) and scale parameter n > 0 . We assume the following parameter values: Intensity function: α = 1 (year) and β > 1 (implying increasing failure rate) Repair time: m = 0.5 and n = 0.5 (mean time to repair is one day) Reduction in intensity function: C p (δ ) = 100 + 50δ ($) wx ($) with w = 10 and ϕ = 0.1 Reduction in age: Cu ( x) = −ϕ A − x 1− e ( ) Cost parameters: C f = 100 ($), Cn = 200 ($), Ct = 300 ($) 16.5.1 Policy 1 (New Equipment Lease) From Jaturonnatee et al. (2005), the expected total cost given by

J (θ ) = C f E ⎡⎣ N ( L ) ⎤⎦ +

k

∑ C (δ ) + j =1

p

j

⎧⎪ ∞ ⎫⎪ Ct E ⎡⎣ N ( L ) ⎤⎦ ⎨ ( y − τ ) g ( y ) dy ⎬ + Cn E ⎡⎣ N ( L ) ⎤⎦ ⎩⎪ τ ⎭⎪

(16.8)

∫

The first term on the LHS is the cost of rectifying failures, the second term is the PM costs, and the third and fourth terms represent the penalty costs associated with repair times and number of failures over the lease period. The parameters, given by the set θ ≡ {k , t j , δ j ,1 ≤ j ≤ k } need to be selected optimally to minimize J (θ ) . Example 16.1 Table 16.2 (extracted from Table 3 of Jaturonnatee et el. 2005) shows k * , the optimal number of PM actions (the optimal values for the remaining parameters are omitted) and J * (θ * ) , the corresponding expected costs for a range of τ and Cn . The optimisation needs to take into account the following constraint: 0<

j

∑δ i =0

i

< λ0 (t j ) − λ0 (0), j ≥ 1

with t0 = 0 and δ 0 = 0 .

(16.9)

Maintenance of Leased Equipment

411

Table 16.2. Optimal maintenance under Policy 1

β

Cn = 0 ($)

τ (days)

1.5

2

3

J (θ * ) $1298.34 $1223.91 $1179.39 $1042.31 $2531.77 $2399.16 $2317.58 $2067.71 $8962.08 $8610.66 $8388.23 $7712.87

*

J (θ ) $1002.27 $907.63 $838.08 $615.31 $1992.32 $1811.05 $1693.48 $1280.00 $7511.43 $7009.92 $6677.07 $5437.03

k 4 3 3 2 7 6 6 4 19 16 16 10

1 2 3 ∞ 1 2 3 ∞ 1 2 3 ∞

Cn = 200 ($) *

*

k 5 5 5 4 10 9 9 7 26 24 23 20

The case of no penalty corresponds to Cn = 0 and τ → ∞ . In this case, for β = 2 and L = 5 , we have from Table 16.2 k * = 4 and J θ * = 1280.00 ($). With only the penalty for repair not being completed within the specified time ( τ = 2 and Cn = 0 ), k * increases to 6 and the expected total cost increases to 1811.05 ($). With only the penalty for failure occurrence ( Cn = 200 and τ → ∞ ), k * increases to 7 and the expected total cost increases to 2067.71 ($). With both penalties ( τ = 2 and Cn = 200 ), k * increases to 9 and the expected total cost increases to 2399.16 ($).The impact of the penalty is more significant as β increases.

( )

16.5.2 Policy 2 (New Equipment Lease) The number of PM actions carried out over the lease period is k (T ) given by the largest integer less than [ L / T ] . The expected total cost is given by Equation 16.8 with t j = jT , j ≥ 1, and the parameters, given by the set, θ ≡ T , δ j ,1 ≤ j ≤ k (T ) , need to be selected optimally to minimize J (θ ) subject to the constraint given by Equation 16.9.

{

}

Example 16.2 Table 16.3 (extracted from Pongpech and Murthy 2006) shows T * (the optimal values for the other parameters are omitted) and the corresponding expected total cost for β = 3 and L = 5 . Table 16.3. Optimal maintenance under Policy 2

τ (days) 1 2 3

∞

Cn = 200 ($)

C n = 0 ($) *

T 0.2381 0.2778 0.3125 0.5000

J (θ * ) $7827.21 $7312.50 $6968.53 $5750.00

*

T 0.1786 0.1923 0.2000 0.2273

J (θ * ) $9336.99 $8969.90 $8737.14 $8034.90

412

D. Murthy and J. Pongpech

When there is no penalty ( Cn = 0 and τ → ∞ ) we see from Table 16.2 that T * = 0.5 and J θ * = 5750.00 ($). The effect of the repair time penalty is that T * decreases as τ decreases. The effect of the failure penalty is also similar, with T * decreasing as Cn increases.

( )

16.5.3 Policy 1 (Used Equipment Lease) The age of the used equipment is A and the lessor carries out an overhaul which reduces its age by x before the equipment is leased out. The analysis is similar to Policy 1 and the expected total cost (see, from Pongpech et al. (2006) for details) is given by ⎧⎪ ∞ ⎫⎪ C p (δ j ) + Ct E ⎡⎣ N ( L ) ⎤⎦ ⎨ ( y − τ ) g ( y ) dy ⎬ ⎪⎩ τ ⎪⎭ j =1 + Cn E ⎡⎣ N ( L ) ⎤⎦ + Cu ( x )

J (θ ) = C f E ⎡⎣ N ( L ) ⎤⎦ +

k

∑

∫

(16.10)

where θ ≡ { x, k , t j , δ j ,1 ≤ j ≤ k } . This differs from Equation 16.8 in two ways – (i) E[ N ( L)] depends on A (the age of the used equipment) and (ii) the last term is the cost of PM action before the equipment is leased out. Example 16.3 Table 16.4 (extracted from Pongpech et al. 2006) shows x* and k * (the optimal values for the remaining parameters are omitted) and the corresponding optimal expected total cost for A = 5 , β = 2 and L = 5 . Table 16.4. Optimal maintenance under Policy 1 for used equipment

τ (days) 1 2 3

∞

Cn = 0 ($) *

x 3.5 3.5 3.5 2.5

*

k 7 6 6 4

Cn = 200 ($) *

J (θ ) $8484.36 $7488.90 $6884.50 $4792.55

*

x 4.0 4.0 4.0 3.5

k* 10 9 9 7

J (θ * ) $11312.57 $10637.18 $10231.04 $8918.59

( )

With no penalty ( Cn = 0 and τ → ∞ ) we have x* = 2.5, k * = 4 and J θ * = 4792.55 ($). The effect of the repair time penalty is that x* increases and then stays constant as τ decreases. Similarly, k * increases (implying more frequent PM actions over the lease period). The effect of the failure penalty is also similar, with x* and k * increasing as Cn increases. Table 16.5 shows the results with A ranging from one to seven years.

Maintenance of Leased Equipment

413

Table 16.5. Effect of variation in A on optimal strategy

A 1 2 3 4 5 6 7

x* 0.0 0.6 1.2 2.0 2.5 3.6 4.2

k* 4 4 4 4 4 4 4

J (θ * ) $2280.00 $3111.58 $3752.68 $4290.03 $4792.55 $5198.07 $5601.10

As can be seen, x* (the reduction in age due to PM actions before the equipment is leased out) increases with A as is to be expected since the ROCOF increases with age. Note that no upgrade is needed when the equipment is fairly young ( A = 1 ). Also, k * does not change when β = 2 . However, when β > 2 , then we find that k * increases as A increases.

16.6 Topics for Future Research In this section we briefly discuss some future research areas. 1. The occurrence of failures depends on factors such as usage intensity, operating environment and operator skills. These can vary across the lessee population. One way of modeling this is through the Cox regression model where the intensity function includes an extra term to reflect the effect of these variables. 2. The penalty terms in the lease contract studied so far are fairly simple – a penalty when the repair time and/or the number of failures over the lease period exceed some specified limits. The lease contract can involve more complex penalty terms. For example, different upper limits on the number of failures for different intervals over the lease period, the time interval between subsequent failures, etc. 3. The time to carry out CM actions depends on the availability of repair crew and spare parts. This raises several issues such as the optimal inventory levels for spares, number of repair crew, etc., that the lessor needs to deal with. Large inventory and a greater number of crews reduce the penalty cost but increase the inventory holding and operating costs. As a result, these parameters must be selected optimally to achieve a proper trade-off between the two costs. 4. The research so far has focussed mainly on issues of interest to the lessor. When the lessor offers a wide range of options the lessee has to decide on the optimal choice. This needs to take into account the price of the lease and a proper cost-benefit analysis of each option.

414

D. Murthy and J. Pongpech

5. From the lessor’s point of view, the size and variety of equipment to stock for leasing are both important issues. The optimal choice of these and the replacement decisions must take into account the needs of different lessees and the investment needed for the purchase of new stock.

16.7 References Baker CR, Hayes RS (1981) Lease Financing — A Practical Guide, John Wiley, New York, USA Barlow RE, Hunter LC (1960) Optimum preventive maintenance policies, Operation Research, 8:90–100 Blischke WR, Murthy DNP (2000) Reliability Modeling, Prediction, and Optimization, John Wiley, New York, USA Cho D, Parlar M (1991) A survey of maintenance models for multi-unit systems, European Journal of Operational Research, 51:1–23 Coyle B (2000) Leasing, Glenlake, Chicago, USA Deelen L, Dupleich M, Othieno L, Wakelin O (2003) Leasing for small and micro enterprises – a guide for designing and managing leasing schemes in developing countries, Berold, R. (ed), Cristina Pierini, Turin, Italy. Dekker R, Wildeman RE, Van Der Duyn Schouten FA (1997) Review of multi-component models with economic dependence, Mathematical Methods of Operations Research, 45:411–435 Desai P, Purohit D (1998) Leasing and selling: optimal marketing strategies for a durable goods firm, Management Science, 44 (11):19–34 http://www.leasefoundation.org/pdfs/2001StateofIndustryRpt.pdf ELA (2002a) Equipment Leasing and Financial Foundation 2002 State of the Industry Report, Price Water House Coopers, Available on http://www.leasefoundation.org/pdfs/2002SOIRpt.pdf ELA (2002b) Equipment Leasing Association Online Focus Groups Report, Available on http://www.chooseleasing.org/Market/2002FocusGroupsRpt.pdf ELA (2005) The economic contribution of equipment leasing to the U.S. economy: growth, investment & jobs—update, Equipment Leasing Association, Global Insight, Advisory Services Group, Available on http://www.elaonline.org/press/ Ezzel JR, Vora PP (2001) Leasing versus purchasing: Direct evidence on a corporation’s motivation for leasing and consequences of leasing, The Quarterly Review of Economics and Finance, 41:33–47 Fishbein BK, McCarry LS, Dillon PS (2000) Leasing: A step toward producer responsibility, Available on http://www.informinc.org. Gits CW (1986) On the maintenance concept for a technical system: II. Literature review, Maintenance Management International, 6:181–196 Handa P (1991) An economic analysis of leasebacks, Review of Quantitative Financing and Accounting, 1:177–189 Jardine AKS, Buzacott JA (1985) Equipment reliability and maintenance, European Journal of Operational Research, 116:259–273 Jaturonnatee J, Murthy DNP, Boondiskulchok R (2005) Optimal preventive maintenance of leased equipment with corrective minimal repair, European Journal of Operational Research, Available online 30 March 2005 Kim EH, Lweellen WG, McConnell JJ (1978) Sale-and-leaseback agreements and enterprise valuation, Journal of Financial and Quantitative Analysis, 13:871–881

Maintenance of Leased Equipment

415

Kleiman RT (2001) The characteristics of venture lease financing, Journal of Equipment Lease Financing, 19 (1):1–10 McCall JJ (1965) Maintenance policies for stochastically failing equipment: A survey, Management Science, 11:493–524 Pham H, Wang H (1996) Imperfect maintenance, European Journal of Operational Research, 94:425–438 Pierskalla WP, Voelker JA (1976) A survey of maintenance models: The control and surveillance of deteriorating systems, Naval Logistics Research Quarterly, 23:353–388 Pintelton LM, Gelders L (1992) Maintenance management decision making, European Journal of Operational Research, 58:301–317 Pongpech J, Murthy P (2006) Optimal periodic preventive maintenance policy for leased equipment, Reliability Engineering and System Safety, 91(7):772–777 Pongpech J, Murthy DNP, Boondiskulchok R (2006) Maintenance strategies for used equipment under lease, Journal of Quality in Maintenance Engineering, 12(1): 52–67 Scarf PS (1997) On the application of mathematical models to maintenance, European Journal of Operational Research, 63:493–506 Schallheim JS (1994) Lease or Buy? Principles for Sound Decision Making, Harvard Business School Press, Cambridge, Mass. Sharpe S A, Nguyen H H (1995) Capital market imperfections and incentive to lease, Journal of Financial Economics, 39:271–294 Sherif YS, Smith ML (1976) Optimal maintenance models for systems subject to failure—A review, Naval Logistics Research Quarterly, 23:47–74 Stremersch S, Wuyts S, Rambach RT (2001) The Purchasing of Full-Service Contracts: An Exploratory Study within the Industrial Maintenance Market, Industrial Marketing Management, 30(1):1–12 Thomas LC (1986) A survey of maintenance and replacement models for maintainability and reliability of multi-unit systems, Reliability Engineering, 16:297–309 Valdez-Flores C, Feldman RM (1989) A survey of preventive maintenance models for stochastically deteriorating single-unit systems, Naval Logistics Research Quarterly, 36:419–446 Vickerman R (2004) Maintenance incentives under different infrastructure regimes, Utilities Policy, 12:315–322

17 Computerised Maintenance Management Systems Ashraf Labib

17.1 Introduction Computerised maintenance management systems (CMMSs) are vital for the coordination of all activities related to the availability, productivity and maintainability of complex systems. Modern computational facilities have offered a dramatic scope for improved effectiveness and efficiency in, for example, maintenance. Computerised maintenance management systems (CMMSs) have existed, in one form or another, for several decades. The software has evolved from relatively simple mainframe planning of maintenance activity to Windows-based, multi-user systems that cover a multitude of maintenance functions. The capacity of CMMSs to handle vast quantities of data purposefully and rapidly has opened new opportunities for maintenance, facilitating a more deliberate and considered approach to managing assets. Some of the benefits that can result from the application of a CMMS are: • • • • •

Resource control – tighter control of resources Cost management – better cost management and audibility Scheduling – ability to schedule complex, fast-moving workloads Integration – integration with other business systems Reduction of breakdowns – improved reliability of physical assets through the application of an effective maintenance programme

The most important factor may be reduction of breakdowns. This is the aim of the maintenance function and the rest are ‘nice’ objectives (or by-products). This is a fundamental issue as some system developers and vendors as well as some users lose focus and compromise reduction of breakdowns in order to maintain standardisation and integration objectives, thus confusing aim with objectives. This has led to the fact that the majority of CMMSs in the market suffer from serious drawbacks, as will be shown in the following section.

418

A. Labib

The term maintenance has many definitions. One comprehensive definition is provided by the UK Department of Trade and Industry (DTI): “The management, control, execution and quality of those activities which will ensure that optimum levels of availability and overall performance of plant are achieved, in order to meet business objectives”. It is worth noting that the definition implies that maintenance is a managerial and strategic activity; today, the term ‘asset management’ is often used instead. It is also worth noting that the word ‘optimum’ was used rather than ‘maximum’ which implies that maintenance is an optimisation case, where both over-maintenance and under-maintenance should be avoided. In this chapter an investigation of the characteristics of computerised maintenance management systems (CMMSs) is carried out in order to highlight the need for them in industry and identify their current deficiencies. This is achieved through the assessment of the state-of-the-art of existing CMMSs. A proposed model is then presented to provide a decision analysis capability that is often missing in existing CMMSs. The effect of such model is to contribute towards the optimisation of the functionality and scope of CMMSs for enhanced decision analysis support. The system is highly adaptive and has been successfully applied in industry. The proposed model employs a hybrid of intelligent approaches. In this chapter, we also demonstrate the use of AI techniques in CMMS’s and we show how it integrates with the work of Kobbacy in Chapter 9 as well as outline features of next generation maintenance systems. The chapter is organized as follows. Section 17.2 provides evidence of existence of ‘black holes’ in the CMMS market. An alternative is provided in Section 17.3 where a model for decision analysis called the Decision Making Grid (DMG) is introduced. Section 17.4 describes maintenance policies that are covered by the DMG. This is then followed by demonstration of incorporating the DMG into a CMMS through a case study in Section 17.5 with a discussion of the results. The final two sections (Sections 17.6 and 17.7) deal with the unmet needs in CMMSs and a discussion of future directions for research.

17.2 Evidence of ‘Black Holes’ Most existing off-the-shelf software packages, especially CMMSs and enterprise resource planning (ERP) systems, tend to be ‘black holes’. This term has been coined by the author as a description of systems that are greedy for data input but that seldom provide any output in terms of decision support. In astronomical terms, ‘black holes’ used to be stars at some time in the past and now possess such a high gravitational force that they absorb everything that comes across their fields and do not emit anything at all, including light. This is analogous to systems that, at worst, are hungry for data and resources and, at best, provide the decision-maker with information that he/she already knows. Companies consume a significant amount of management and supervisory time compiling, interpreting and analysing the data captured within the CMMS. Companies then encounter difficulties analysing equipment performance trends and their causes as a result of inconsistency in the

Computerised Maintenance Management Systems

419

form of the data captured and the historical nature of certain elements of it. In short, companies tend to spend a vast amount of capital in acquisition of off-theshelf systems for data collection, but their added value to the business is questionable. Few books have been published about the subject of CMMSs (Bagadia 2006; Mather 2002; Cato and Mobley 2001; Wireman 1994). However, they tend to highlight its advantages rather than its drawbacks. All CMMSs offer data collection facilities; more expensive systems offer formalised modules for the analysis of maintenance data, and the market leaders allow real time data logging and networked data sharing (see Table 17.1). Yet, despite the observations made above regarding the need for information to aid maintenance management, a ‘black hole’ exists in the row titled ‘Decision analysis’ in Table 17.1, because virtually no CMMS offers decision support.1 This is a definite problem, because the key to systematic and effective maintenance is managerial decision-making that is appropriate to the particular circumstances of the machine, plant or organisation. This decision-making process is made all the more difficult if the CMMS package can only offer an analysis of recorded data. As an example, when a certain preventive maintenance (PM) schedule is input into a CMMS, for example to change the oil filter every month, the system will simply produce a monthly instruction to change the oil filter and is thus no more than a diary. Table 17.1. Facilities offered by commercially available CMMS packages Price range Data collection Data analysis

£ 1,000 +

£ 10,000 +

£ 30,000 +

£ 40,000 +

Realtime

Network Decision analysis

A “black hole”

A step towards decision support is to vary the frequency of PM depending on the combination of failure frequency and severity. A more intelligent feature would be to generate and prioritise PM according to modes of failure in a dynamic realtime environment. A PM is usually static and theoretical in that it does not reflect shop floor realities. In addition, the PM that is copied from machine manuals is usually inapplicable because: • • •

All machine work in different environments and would therefore need different PMs Machine designers often have a different experience of machine failures and means of prevention from those who operate and maintain them Machine vendors may have a hidden agenda of maximising spare parts replacements through frequent PMs

420

A. Labib

The use of CMMSs for decision support lags significantly behind the more traditional applications of data acquisition, scheduling and work order issuing. While many packages offer inventory tracking and some form of stock level monitoring, the reordering and inventory holding policies remain relatively simplistic and inefficient. See the work of Exton and Labib (2002) and Labib and Exton (2001). Also, there is no mechanism to support managerial decision-making with regard to inventory policy, diagnostics or setting of adaptive and appropriate preventive maintenance schedules. A noticeable problem with current CMMS packages regards provision of decision support. Figure 17.1 illustrates how the use of CMMS for decision support lags significantly behind the more traditional applications of data acquisition, scheduling and work-order issuing. Applications of CMMS Modules

A Black Hole

Mai ntenance budgeti ng Pr edi cti ve mai ntenance data anal ysi s Equi pment f ai l ur e di agnosi s Inventor y contr ol Spar e par ts r equi r ements pl anni ng Mater i al and spar e par ts pur chasi ng Manpower pl anni ng and schedul i ng Wor k-or der pl anni ng and schedul i ng Equi pment par ts l i st Equi pment r epai r hi stor y Pr eventati ve Mai ntenance pl anni ng and schedul i ng 70

75

80

85

90

95

100

Per cent ag e o f syst ems inco r p o r at ing mo d ule

Figure 17.1. Extent of CMMS module usage (from Swanson 1997)

According to Boznos (1998): “The primary uses of CMMS appear to be as a storehouse for equipment information, as well as a planned maintenance and a work maintenance planning tool.” The same author suggests that CMMS appears to be used less often as a device for analysis and co-ordination and that: “Existing CMMS in manufacturing plants are still far from being regarded as successful in providing team based functions”. He has surveyed CMMS as well as total productive maintenance (TPM) and reliability-centred management (RCM) concepts and the extent to which the two concepts are embedded in existing marketed CMMSs. He has concluded that:

Computerised Maintenance Management Systems

421

“It is worrying the fact that almost half of the companies are either in some degree dissatisfied or neutral with their CMMS and that the responses indicated that manufacturing plants demand more user-friendly systems.” This is a further proof of the existence of a ‘black hole’. To make matters worse, it appears that there is a new breed of CMMSs that are complicated and lack basic aspects of user-friendliness. Although they emphasise integration and logistics capabilities, they tend to ignore the fact that the fundamental reason for implementing CMMSs is to reduce breakdowns. These systems are difficult to handle for both production operators and maintenance engineers; they are accountingand/or IT-orientated rather than engineering-orientated. Results of an investigation (EPSRC – GM/M35291) show that managers’ lack of commitment to maintenance models has been attributed to a number of reasons: • • •

Managers are unaware of the various types of maintenance models. A full understanding of the various models and the appropriateness of these systems to companies is not available. Managers do not have confidence in mathematical models due to their complexities and the number of unrealistic assumptions they contain.

This correlates with surveys of existing maintenance models and optimisation techniques. Ben-Daya et al. (2001) and Sherwin (2000) have also noticed that models presented in their work have not been widely used in industry for several reasons, such as: • • •

Unavailability of data Lack of awareness about these models Restrictive assumptions of some of these models

Finally, here is an extract from the Professor Nigel Slack (Warwick University) textbook on operations management regarding critical commentary of ERP implementations (which may as well apply to CMMSs as many of them tend to be nowadays classified as specialised ERP systems): “Far from being the magic ingredient which allows operations to fully integrate all their information, ERP is regarded by some as one of the most expensive ways of getting zero or even negative return on investment. For example, the American chemicals giants, Dow Chemical, spent almost half-a-billion dollars and seven years implementing an ERP system which became outdated almost as it was implemented. One company, FoxMeyer Drug, claimed that the expense and problems which it encountered in implementing ERP eventually drove it to bankruptcy. One problem is that ERP implementation is expensive. This is partly because of the need to customise the system, understand its implications for the organisation, and train staff to use it. Spending on what some call the ERP ecosystem (consulting, hardware, networking and complimentary applications) has been estimated as being twice the spending on the software itself. But it is not only the expense which has disillusioned many companies, it is also the returns they have had for their investment. Some studies show that the vast majority of companies implementing ERP are disappointed with the effect it has had on their businesses. Certainly many

422

A. Labib

companies find that they have to (sometimes fundamentally) change the way they organise their operations in order to fit in with ERP systems. This organisational impact of ERP (which has been described as the corporate equivalent of dental root canal work) can have a significantly disruptive effect on the organisation’s operations.” Hence, theory and implementation of existing maintenance models are, to a large extent, disconnected. It is concluded that there is a need to bridge the gap between theory and practice through intelligent optimisation systems (e.g. rulebased systems). It is also argued that the success of this type of research should be measured by its relevance to practical situations and its impact on the solution of real maintenance problems. The developed theory must be made accessible to practitioners through IT tools. Efforts need to be made in the data capturing area to provide necessary data for such models. Obtaining useful reliability information from collected maintenance data requires effort. In the past, this has been referred to as ‘data mining’ as if data can be extracted in its desired form if only it can be found. In the next section we introduce a decision analysis model. We then show how such a model has been implemented for decision support in maintenance systems.

17.3 Application of Decison Analysis in Maintenance The proposed maintenance model is based on the concept of effectiveness and adaptability. Mathematical models have been formulated for many typical situations. These models can be useful in answering questions such as “how much maintenance should be done on this machine? How frequently should this part be replaced? How many spare should be kept in stock? How should the shutdown be scheduled?” It generally accepted that the vast majority of maintenance models are aimed at answering efficiency questions, that is questions of the form “how can this particular machine be operated more efficiently?” and not at effectiveness questions, like “which machine should we improve and how?”. The latter question is often the one in which practitioners are interested. From this perspective it is not surprising that practitioners are often dissatisfied if a model is directly applied to an isolated problem. This is precisely why in the integrated approach efficiency analysis as proposed by the author (do the things right) is preceded by effectiveness analysis (do the right thing). Hence, two techniques were employed to illustrate the above-mentioned concepts mainly the fuzzy logic rule based decision making grid (DMG) and the analytic hierarchy process (AHP) as proposed by Labib et al. (1998). The proposed model is illustrated in Figure 17.2. The decision-making grid (DMG) acts as a map where the performances of the worst machines are placed based on multiple criteria. The objective is to implement appropriate actions that will lead to the movement of machines towards an improved state with respect to multiple criteria. These criteria are determined through prioritisation based on the analytic hierarchy process (AHP) approach. The AHP is also used to prioritise failure modes and fault details of components of critical machines within the scope of the actions recommended by the DMG.

Computerised Maintenance Management Systems

423

The model is based on identification of criteria of importance such as downtime and frequency of failures. The DMG then proposes different maintenance policies based on the state in the grid. Each system in the grid is further analyzed in terms of prioritisations and characterisation of different failure types and main contributing components.

Figure 17.2. Decision analysis maintenance system

17.4 Maintenance Policies Maintenance policies can be broadly categorised into the technology or systems oriented (systems or engineering), management of human factors oriented and monitoring and inspection oriented. RCM is a technological based concept where reliability of machines is emphasised. RCM is a method for defining the maintenance strategy in a coherent, systematic and logical manner. It is a structured methodology for determining the maintenance requirements of any physical asset in its operation context. The primary objective of RCM is to preserve system function. The RCM process consists of looking at the way equipment fails, assessing the consequences of each failure (for production, safety, etc), and choosing the correct maintenance action to ensure that the desired overall level of plant performance (i.e. availability, reliability) is met. The term RCM was originally coined by Nolan and Heap (1979). For more details on RCM see Moubray (1991, 2001), and Netherton (2000).

424

A. Labib

TPM is human based technique in which maintainability is emphasised. TPM is a tried and tested way of cutting waste, saving money, and making factories better places to work. TPM gives operators the knowledge and confidence to manage their own machines. Instead of waiting for a breakdown, then calling the maintenance engineer, they deal directly with small problems, before they become big ones. Operators investigate and then eliminate the root causes of machine errors. Also, they work in small teams to achieve continuous improvements to the production lines. For more details on TPM see Nakajima (1988), Hartmann (1992) and Willmott (1994). Condition based maintenance (CBM) – not condition based monitoring – is a sensing technique in which availability based on inspection and follow-up is emphasised. In the British Standards, CBM is defined ast the preventive maintenance initiated as a result of knowledge of the condition of an item from routine or continuous monitoring.” (BS 3811, 1984). It is the means whereby sensors, sampling of lubricant products, and visual inspection are utilised to permit continued operation of critical machinery and avoid catastrophic damage to vital components The integral components for the successful application of condition monitoring of machinery are: reliable detection, correct diagnosis, and dependable decision-making. For more details on CBM, see Brashaw (1998) and Holroyd (2000). The proposed approach in this chapter is different from the above – mentioned ones in that it offers a decision map adaptive to the collected data where it suggest the appropriate use of RCM, TPM, and CBM.

17.5 The DMG Through an Industrial Case Study This case study demonstrates the application of the proposed model and its effect on asset management performance. The application of the model is shown through the experience of a company seeking to achieve world-class status in asset management. The company has implemented the proposed model which has had the effect of reducing total downtime from an average of 800 h per month to less than 100 h per month as shown in Figure 17.3. 17.5.1 Company Background and Methodology In this particular company there are 130 machines, varying from robots and machine centres to manually operated assembly tables. Notice that, in this case study, only two criteria are used (frequency and downtime). However, if more criteria are included, such as spare parts cost and scrap rate, the model becomes multi-dimensional, with low, medium, and high ranges for each identified criterion. The methodology implemented in this case was to follow three steps. These steps are: i. criteria analysis, ii. decision mapping, and iii. decision support.

Computerised Maintenance Management Systems

425

B re a k d o wn tre n d s (h ) 1200 1000 800 600 400 200 0

Nov

D ec

Ja n

Feb

M ar

A pr

M ay

Ju n

Ju l

A ug

Sep

Oct

Nov

Figure 17.3. Total breakdown trends per month

17.5.2 Step 1: Criteria Analysis As indicated earlier, the aim of this phase is to establish a Pareto analysis of two important criteria: downtime — the main concern of production and frequency of calls — the main concern of asset management. The objective of this phase is to assess how bad are the worst performing machines for a certain period of time, say one month. The worst performers in both criteria are sorted and grouped into high, medium, and low sub-groups. These ranges are selected so that machines are distributed evenly among every criterion. This is presented in Figure 17.4. In this particular case, the total number of machines is 120. Machines include CNCs, robots, and machine centres.

Figure 17.4. Step 1: criteria analysis

426

A. Labib

17.5.3 Step 2: Decision Mapping The aim of this step is twofold: it scales high, medium, and low groups and hence genuine worst machines in both criteria can be monitored on this grid; it also monitors the performance of different machines and suggests appropriate actions. The next step is to place the machines in the “decision making grid” shown in Figure 17.5, and accordingly, to recommend asset management decisions to management. This grid acts as a map where the performances of the worst machines are placed based on multiple criteria. The objective is to implement appropriate actions that will lead to the movement of machines towards the north-west section of low downtime, and low frequency. In the topleft region, the action to implement, or the rule that applies, is OTF (operate to failure). The rule that applies for the bottomleft region is SLU (skill level upgrade) because data collected from breakdowns — attended by maintenance engineers — indicates that machine [G] has been visited many times (high frequency) for limited periods (low downtime). In other words maintaining this machine is a relatively easy task that can be passed to operators after upgrading their skill levels. Machines that are located in the topright region, such as machine [B], is a problematic machine, in maintenance words “a killer”. It does not breakdown frequently (low frequency), but when it stops it is usually a big problem that lasts for a long time (high downtime). In this case the appropriate action to take is to analyse the breakdown events and closely monitor its condition, i.e. condition base monitoring (CBM). A machine that enters the bottomright region is considered to be one of the worst performing machines based on both criteria. It is a machine that maintenance engineers are used to seeing it not working rather than performing normal operating duty. A machine of this category, such as machine [C], will need to be structurally modified and major design out projects need to be considered, and hence the appropriate rule to implement will be design out maintenance (DOM). If one of the antecedents is a medium downtime or a medium frequency, then the rule to apply is to carry on with the preventive maintenance schedules. However, not all of the media are the same. There are some regions that are near to the top left corner where it is “easy” FTM (fixed time maintenance) because it is near to the OTF region and it requires re-addressing issues regarding who will perform the instruction or when will the instruction be implemented. For example, in the case of machines [I] and [J], they are situated in region between OTF and SLU and the question is about who will do the instruction — operator, maintenance engineer, or sub-contractor. Also, a machine such as machine [F] has been shifted from the OTF region due to its relatively higher downtime and hence the timing of instructions needs to be addressed. Other preventive maintenance schedules need to be addressed in a different manner. The “difficult” FTM issues are the ones related to the contents of the instruction itself. It might be the case that the wrong problem is being solved or the right one is not being solved adequately. In this case machines such as [A] and [D] need to be investigated in terms of the contents of their preventive instructions and an expert advice is needed.

Computerised Maintenance Management Systems

427

Decision making grid DOWNTIME

Low

Med.

High

FREQUENCY

10

Low Med. High

O.T.F. 5

10

F.T.M. [H]

(When ?)

F.T.M. [I]

F.T.M.

(Who ?)

C.B.M. [F]

F.T.M. [G]

(How ?)

[B]

F.T.M. [E]

[J]

S.L.U.

CBM: condition base monitoring SLU: skill level upgrade FTM: fixed time maintenance

20

(What ?)

[A]

D.O.M. [D]

[C]

OTF: operate to failure DOM: design out M/C

Figure 17.5. Step 2: decision mapping

17.5.4 Step 3: Multileveled Decision Support Once the worst performing machines are identified and the appropriate action is suggested; it is now a case of identifying a focused action to be implemented. In other words, we need to move from the strategic systems level to the operational component level. Using the analytic hierarchy process (AHP), one can model a hierarchy of levels related to objectives, criteria, failure categories, failure details and failed components. For more details on the AHP readers can consult Saaty (1988). This step is shown in Figure 17.6. The AHP is a mathematical model developed by Saaty (1980) that prioritises every element in the hierarchy relative to other elements in the same level. The prioritization of each element is achieved with respect to all elements in the above level. Therefore, we obtain a global prioritized value for every element in the lowest level. In doing that we can then compare the prioritized fault details (level 4 in Figure 17.6), with PM signatures (keywords) related to the same machine. PMs can then be varied accordingly in an adaptive manner to shop floor realities. The proposed decision analysis maintenance model as shown previously in Figure 17.2 combines both fixed rules and flexible strategies since machines are compared on a relative scale. The scale itself is adaptive to machine performance with respect to identified criteria of importance. Hence flexibility concept is embedded in the proposed model.

428

A. Labib

Multiple Criteria Decision Analysis (MCDA) Level 1: Criteria Evaluation

Downtime

Frequency

Spare Parts

Level 2: Critical Machines

System A

System B

Bottlenecks

System C ………

Level 3: Critical Faults

Electrical Level 4: Fault Details 6/30/02

Mechanical

Motor Faults

Limit Faults

Hydraulic Pneumatic

No Power Faults

Panel Faults

Proximity Faults Pressure Faults Dr. A.W. Labib (UMIST)

Software Switch Faults

30

Push Button Faults

Figure 17.6. Step 3: decision support

17.5.4.1 Fuzzy Logic Rule Based Decision Making Grid In practice, however, there can exist two cases where one needs to refine the model. The first case is when two machines are located near to each other across different sides of a boundary between two policies. In this case we apply two different policies despite a minor performance difference between the two machines. The second case is when two machines are on the extreme sides of a quadrant of a certain policy. In this case we apply the same policy despite the fact they are not near each other. Both cases are illustrated in Figure 17.7. For both cases we can apply the concept of fuzzy logic where boundaries are smoothed and rules are applied simultaneously with varying weights. In fuzzy logic, one needs to identify membership functions for each controlling factor, in this case: frequency and downtime as shown in Figure 17.8a,b. A membership function defines a fuzzy set by mapping crisp inputs from its domain to degrees of membership (0,1). The scope/domain of the membership function is the range over which a membership function is mapped. Here the domain of the fuzzy set medium frequency is from 10 to 40 and its scope is 30 (40–10), whereas the domain of the fuzzy set high downtime is from 300 to 500 and its scope is 200 (500–300) and so on.

Computerised Maintenance Management Systems

429

Figure 17.7. Special cases for the DMG model

µ Medium

Low

High

1 0.75 0.4

0

10

20

30

40

50 Frequency (No. of times)

12

µ Medium

Low

High

1 0.7

0.2 0

100

200

300

400 380

500 Downtime (hrs)

Figure 17.8. a Membership function of frequency b Membership function of downtime

430

A. Labib

The output strategies have a membership function and we have assumed a cost (or benefit) function that is linear and follows the following relationship (DOM > CBM >SLU > FTM > OTF) as shown in Figure 17.9a. The rules are then constructed based on the DMG grid where there will be 9 rules. An example of the rules is as follows: • •

If frequency is high and downtime is low then maintenance strategy is SLU (skill level upgrade). If frequency is low and downtime is high then maintenance strategy is CBM (condition based maintenance).

Rules are shown in Figure 17.9b.

µ

0 OTF

20

30

40

FTM

SLU

CBM

50 DOM Units of Cost (x £1,000/unit)

Figure 17.9. a Output (strategies) membership function. b The nine rules of the DMG

Computerised Maintenance Management Systems

431

The fuzzy decision surface is shown in Figure 17.10. In this figure, given any combination of frequency (x-axis) and downtime (y-axis) one can determine the most appropriate strategy to follow (z axis).

DOM CBM SLU FTM OTF

Figure 17.10. The fuzzy decision surface

It can be noticed from Figure 17.11 that the relationship of (DOM > CBM >SLU > FTM > OTF) is maintained. As illustrated in Figure 17.11, given an 380-h downtime and a 12 x frequency, the suggested strategy to follow is CBM.

Figure 17.11. The fuzzy decision surface showing the regions of different strategies

432

A. Labib

17.5.5 Discussion The concept of the DMG was originally proposed by Labib (1996). It was then implemented in a company that has achieved a world-class status in maintenance (Labib 1998a). The DMG model has also been extended to be used as a technique to deal with crisis management in an award winning paper (Labib 1998b). The DMG could be used for practical continuous improvement process because, when machines in the top ten have been addressed, they will then, if and only if appropriate action has been taken, move down the list of top ten worst machines. When they move down the list, other machines show that they need improvement and then resources can be directed towards the new offenders. If this practice is continuously used then eventually all machines will be running optimally. If problems are chronic, i.e. regular, minor and usually neglected, some of these could be due to the incompetence of the user and thus skill level upgrading would be an appropriate solution. However, if machines tend towards RCM then the problems are more sporadic and when they occur could be catastrophic. Uses of maintenance schemes such as FMEA and FTA can help determine the cause and may help predict failures thus allowing a prevention scheme to be devised. Figure 17.12 shows when to apply TPM and RCM. TPM is appropriate at the SLU range since skill level upgrade of machine tool operators is a fundamental concept of TPM, whereas RCM is applicable for machines exhibiting severe failures (high downtime and low frequency). Also, CBM and FMEA will be ideal for this kind of machine and hence an RCM policy will be most applicable. The significance of this approach is that in one model we have RCM and TPM in a unified model rather than two competing concepts.

Figure 17.12. When to apply RCM and TPM in the DMG

Computerised Maintenance Management Systems

433

Figure 17.13. Parts of PM schedules that need to be addressed in the DMG

Generally the easy preventive maintenance (PM), fixed time maintenance (FTM) questions are who? and when? (efficiency questions). The more difficult ones are what? and how? (effectiveness questions), as indicated in Figure 17.13.

17.6 Unmet Needs in Responsive Maintenance According to Professor Jay Lee, of the National Science Foundation (NSF) Industry/University Cooperative Research Centre on Intelligent Maintenance Systems (IMS) at the University of Cincinnati, unmet needs in responsive maintenance can be categorised as follows: • • •

Machine intelligence – intelligent monitoring, prediction, prevention and compensation and reconfiguration for sustainability (self-maintenance) Operations intelligence – prioritisation, optimisation and responsive maintenance scheduling for reconfiguration needs Synchronisation intelligence – autonomous information flow from market demand to factory asset utilisation

It can be concluded that the challenges, and research questions facing research and development (R&D) concerning next generation maintenance systems are: • • • •

How to adapt PM schedules to cope dynamically with shop-floor reality How to feed back information and knowledge gathered in maintenance to the designers How to link maintenance policies to corporate strategy and objectives How to synchronise production scheduling based on maintenance performance

434

A. Labib

17.7 Future Directions and Conclusions Training and educational programmes should be designed to address the existence of the considerable gap between the skills that are essential to maximise the potential benefits from these advanced systems and technologies in the area of maintenance and asset management and the skills that currently exist in the maintenance sections of most industries. Existing ERP and CMMS systems tend to put much emphasis on data collection and analysis rather than on decision analysis. Although the existing teaching programmes already address some of the issues related to next-generation maintenance systems, there is still room for considering other issues, such as: • • • • • • •

Emphasis on CMMS and ERP systems in the market, as well as their use and limitations Design awareness in maintenance and design for maintainability Learning from failures across different industries and disciplines Emphasis on prognostics rather than diagnostics e-Maintenance and remote maintenance, including self-powered sensors Modelling and simulation using OR tools and techniques AI applications in maintenance

As the success of systems implementation are based on two factors, human and systems, it is important to develop and nurture skills as well as to use advanced technologies. In this chapter we have investigated the characteristics of computerised maintenance management systems (CMMSs) and have highlighted the need for them in industry and identified their current deficiencies. A proposed model was then presented to provide a decision analysis capability that is often missing in existing CMMSs. The effect of such model was to contribute towards the optimisation of the functionality and scope of CMMSs for enhanced decision analysis support. We have also demonstrated the use of AI techniques in CMMSs. We also showed how it integrates with the work of Kobbacy in Chapter 9. Finally, we have outlined features of next generation maintenance systems.

17.8 References Bagadia, K. (2006), Computerized Maintenance Management Systems Made Easy, McGraw-Hill. Brashaw, C. (1998), Characteristics of acoustic emission (AE) signals from ill fitetd copper split bearings, Proc 2nd Int. Conf on Planned Maintenance, Reliability and Quality. Ben-Daya, M., Duffuaa, S.O. and Raouf, A. (eds) (2001), Maintenance Modelling and Optimisation, Kluwer Academic Publishers, London. Boznos, D. (1998), The Use of CMMSs to Support Team-Based Maintenance, MPhil thesis, Cranfield University.

Computerised Maintenance Management Systems

435

Cato, W., and Mobley, K. (2001), Computer-Managed Maintenance Systems:A Step-by-Step Guide to Effective Management of Maintenance, Labor, and Inventory, Butterworth Heinemann, Oxford. Exton, T. and Labib, A.W. (2002), Spare parts decision analysis – The missing link in CMMSs (Part II), Journal of Maintenance & Asset Management, 17,14–21. Hartmann, E.H. (1992), Successfully Installing TPM in a Non-Japanese Plant, TPM Press, Inc., New York. Holroyd, T. (2000), Acoustic Emission & Ultrasonics, Coxamoor Publishing Company, Oxford. Labib, A.W., Exton, T. (2001), Spare parts decision analysis – The missing link in CMMSs (Part I), Journal of Maintenance & Asset Management.16(3):10–17. Labib, A.W., Williams, G.B. and O’Connor, R.F. (1998), An intelligent maintenance model (system): An application of the analytic hierarchy process and a fuzzy logic rule-based controller, Journal of the Operational Research Society, 49, 745–757. Labib, A.W. (1998a), World-class maintenance using a computerised maintenance management system, Journal of Quality in Maintenance Engineering, 4, 66–75. Labib, A.W. (1998b), A Logistic approach to managing the millennium information systems problem, Journal of Logistics Information Management, 11, 285–384. Labib, A.W. (1996), An integarted approprate productive maintenance, PhD Thesis, University of Birmingham. Mather, D. (2002), CMMS: A Timesaving Implementation Process, CRC PRESS, New York. Moubray, J. (2001), The case against streamlined RCM, Maintenance & Asset Management, 16, 15–27. Moubray, J. (1991), Reliability Centred Maintenance, Butterworth-Heinmann Ltd, Oxford Nakajima, S. (1988), Total Productive Maintenance, Productivity Press, Illinois Netherton, D. (2000), RCM Standard, Maintenance & Asset Management, 15, 12–20. Nolan, F. and Heap, H. (1979), Reliability Centred Maintenance, National Technical Information Service Report, # A066-579. Saaty, T.L. (1988), The Analytic Hierarchy Process, Pergamon Press, New York. Saaty, T.L. (1980), The Analytic Hierarchy Process: Planning, Priority Setting – Resource Allocation, McGraw-Hill, New York. Sherwin, D., (2000) A review of overall models for maintenance management, Journal of Quality in Maintenance Engineering, 6, 138–164 Swanson, L. (1997), Computerized Maintenance Management Systems: A study of system design and use, Production and Inventory Management Journal, Second Quarter: 11–14. Willmott, P. (1994), Total Productive Maintenance. The Western Way, Butterworth Heinemann Ltd., Oxford Wireman, T. (1994), Computerized Maintenance Management Systems, 2nd edition, Industrial Press Inc, New York.

18 Risk Analysis in Maintenance Terje Aven

18.1 Introduction This chapter discusses the use of risk analysis to support decision making on maintenance activities. In recent years there has been a growing interest in the use of risk analysis and risk based (informed) approaches for guiding decisions on maintenance, see, e.g., Vatn et al. (1996), Clarotti et al. (1997), Dekker (1996) and Cepin (2002), and this topic has also been given much attention in industry see for example van Manen et. al. (1997), Knoll et al. (1996), Perryman et al. (1995) and Podofillini et al. (2006). This chapter provides a critical review of some of the key building blocks of the theories and methods developed. We also discuss some critical factors for ensuring a successful use of risk analysis for maintenance applications. The issues discussed include: • • • •

Risk descriptions and categorisations Uncertainty assessments Risk acceptance and risk informed decision making Selection of appropriate methods and tools

An example is presented of a detailed risk analysis, showing the effect of maintenance efforts on risk. The chapter is organised as follows. First in Section 18.2 we review the basic elements of risk management and risk management processes, and clarify the risk perspective adopted in this chapter. Then in Section 18.3 we address the use of risk analysis to support decisions on maintenance. Various types of decision situations and analyses are covered. Section 18.4 presents the case mentioned above. In Section 18.5 we discuss key building blocks of the theories and methods developed, as well as the critical factors for ensuring a successful use of risk analysis for maintenance applications. Section 18.6 concludes. When not otherwise stated, we use terminology from ISO (2002).

438

T. Aven

List of abbreviations: PLL Potential loss of life (expected number of fatalities per year) FAR Fatal accident rate (expected number of fatalities per 100 million exposed hours) ETA Event tree analysis FTA Fault tree analysis CCA Cause consequence analysis FMECA Failure mode and effect and criticality analysis HAZOP Hazard and operability studies RIF Risk influencing factor BORA Barrier operational risk analysis RCM Reliability centred maintenance HMI Human machine interface TTS Technical condition safety

18.2 Basics of Risk Management and Risk Analysis 18.2.1 General The purpose of risk management is to ensure that adequate measures are taken to protect people, the environment and assets from harmful consequences of the activities being undertaken, as well as balancing different concerns, in particular risks and costs. Risk management includes both measures to avoid the occurrence of hazards and reduce their potential harm. Traditionally risk management was based on a prescriptive regulating regime, in which detailed requirements were set to the design and operation of the arrangements. This regime has gradually been replaced by a more goal oriented regime, putting emphasis on what to achieve rather than the solutions. Risk management is an integral aspect of a goal oriented regime. It is acknowledged that risk cannot be eliminated but must be managed. There is an enormous drive and enthusiasm in various industries and society as a whole nowadays to implement risk management in the organizations. There are high expectations, that risk management is the proper framework for obtaining high levels of performance. To support decision making on design and operation, risk analyses are conducted. The analyses cover identification of hazards and threats, cause analyses, consequence analyses and risk description. Evaluations of the results of the analyses are carried out. The totality of the analyses and the evaluations are referred to as risk assessments. Risk assessment is followed by risk treatment, which is the process and implementation of measures to modify risk, including measures to avoid, reduce (“optimize”), transfer or retain risk. Risk transfer means sharing with another party the benefit or loss associated with a risk. It is typically affected through insurance. Risk management covers all co-ordinated activities to direct and control an organisation with regard to risk. The risk management process is the systematic application of management policies, procedures and practices to the tasks of establishing the context, assessing, treating, monitoring, reviewing and communicating risks; see Figure 18.1.

Risk Analysis in Maintenance

439

ANALYSE RISKS

MONITOR AND REVIEW

IDENTIFY RISKS RISK ASSESSMENT

COMMUNICATE AND CONSULT

ESTABLISHING THE CONTEXT

EVALUATE RISKS

TREAT RISKS

Figure 18.1. Risk management process (based on ISO 2005)

Risk management involves managing to achieve an appropriate balance between realizing opportunities for gains while minimizing losses. It is an integral part of good management practice and an essential element of good corporate governance. It is an iterative process consisting of steps that, when undertaken in sequence, enable continuous improvement in decision making and facilitate continuous improvement in performance. “Establishing the context” (see Figure 18.1) defines the basic frame conditions within the risks must be managed and sets the scope for the rest of the risk management process. The context includes the organization’s external and internal environment and the purpose of the risk management activity. This also includes consideration of the interface between the external and internal environments. The context means definition of suitable decision criteria as well as structures for how to carry out the risk assessment process.

440

T. Aven

Risk analysis is often used in combination with risk acceptance criteria, as inputs to risk evaluation. Sometimes the term risk tolerability limits is used instead of risk acceptance criteria. The criteria state what is deemed as an unacceptable risk level. The need for risk reducing measures is assessed with reference to these criteria. In some industries and countries, it is a requirement in regulations that such criteria should be defined in advance of performing the analyses.

18.2.2 Risk Perspective Adopted The discussion in this chapter is based on a risk perspective characterised by the following points: 1. Risk is defined by the combination of possible consequences associated with an activity and the assessor’s uncertainty about these consequences. The consequences are normally expressed by quantities that can be measured (such as money, loss of lives, etc.). A set of quantities are typically needed to give a proper description of the consequences. We refer to these quantities as observable quantities or just observables. 2. Risk (uncertainty) is quantitatively expressed by probabilities and expected values. We assess the uncertainties and assign probabilities (and hence we assign values for risk). A probability is always conditional on some information and knowledge. 3. Risk analyses provide decision support, by analysing and describing risk (uncertainty). The risk analysts analyse the risks, and evaluate the risks, i.e. they discuss the significance of the risks, in relation to comparable activities and possible criteria. The analyses need to be evaluated in light of the premises, assumptions and limitations of these analyses. The analyses are based on a background information that must be reviewed, together with the results of the analyses. The decision maker performs what we refer to as a managerial review and judgment. 4. It is essential to make a distinction between what the expected values determined at the point of decision making are, and what the real outcomes are. The expected values give to varying degree good predictions of the future outcomes. Uncertainty and safety management are justified by reference to these outcomes and not the expected values alone. 5. What is acceptable risk and the need for risk reduction cannot be determined just by reference to the results of risk analyses. To be precise, we do not accept a risk, but we accept a solution, with all its attributes. 6. Cost-benefit analyses means calculating expected net present values with a risk adjusted discount rate or risk-adjusted cash-flows. In a societal context, the society’s willingness to pay is the appropriate reference, whereas for businesses it is the decision maker’s willingness to pay that is to be used. 7. Cost-effectiveness analyses means calculating measures such as the expected cost per expected saved life.

Risk Analysis in Maintenance

441

8. A multi-attribute analysis is an analysis of the various attributes (costs, safety, …) of the decision problem, separately for each attribute. 9. Risk and decision analyses need extensive use of sensitivity and robust analyses. Thus we adopt a broad perspective on risk, acknowledging that risk cannot be distinguished from the context it is a part of, the aspects that are addressed, those who assess the risk, the methods and tools used, etc. Following our definition of risk, a low degree of uncertainty does not necessarily mean a low risk, or a high degree of uncertainty does not necessarily mean a high level of risk. As risk is defined as the combination of possible consequences and the associated uncertainties (quantified by probabilities), any judgment about the level of risk needs to consider both dimensions. For example, consider a case where only two outcomes are possible, 0 and 1, corresponding to 0 and 1 fatality, and the decision alternatives are A and B, having uncertainty (probability) distributions (0.5, 0.5), and (0.0, 1.0), respectively. Hence for alternative A there is a higher degree of uncertainty than for alternative B. However, considering both dimensions, we would of course judge alternative B to have the highest risk as the negative outcome 1 is certain to occur. The above building blocks are premises for the analysis and discussion in this chapter. For their justification and suitability we refer to Aven and Kristensen (2005), Aven et al. (2007) and Aven and Vinnem (2007). Some aspects of particular importance for the maintenance applications are addressed in Section 5.

18.3 Risk Analysis to Support Decisions on Maintenance Our starting point is a decision maker facing some decision points in a project. These decision points include problems and opportunities related to maintenance. Having identified the main decision points, adequate decision alternatives need to be generated and assessed, relating to whether or not to execute an activity, alternative maintenance policies, etc. Our focus is on situations characterized by a potential of rather large consequences, large associated uncertainties and/or high probabilities of what will be the consequences, if the alternatives are in fact being realised, i.e. high risks according to our definition of risk. The consequences and associated uncertainties relate to economic performance, possible accidents leading to loss of lives and/or environmental damage, etc. Risk analyses are considered to give valuable decision support in such situations. In this chapter we are particularly concerned about how the maintenance activities are reflected in the risk analysis. A distinction between different types of analysis methods is then required. To identify hazards and risks, FMECA (failure mode and effect and criticality analysis) and HAZOP (hazard and operability studies) are two of the most common methods. In FMECA categories of the possible consequences and associated likelihoods are introduced and the criticality is determined using a risk matrix approach. Using this approach, different maintenance strategies may be assessed with respect to risk (criticality) and compared using the risk matrix. This is a crude risk analysis. The next level of sophistication

442

T. Aven

of risk analysis we obtain when models are developed to represent cause and/or consequence scenarios. The standard tools used are FTA (fault tree analysis) and ETA (event tree analysis) and the combination of the two, CCA (cause consequence analysis). These models are important elements in a qualitative risk analysis, and provide the basis for a quantitative risk analysis. These are all standard risk analysis methods and we refer to texts books for description of discussion of these methods; see, e.g. Aven (1992) and Modarres (1993). The models are used to identify critical systems, and thus provide a basis for selecting appropriate maintenance activities. To illustrate this, let R be a risk index, for example expressing the expected number of fatalities (PLL) or the probability of a system failure, and let Ri be the risk index when subsystem i is in the functioning state. Then a common way of ranking the different subsystems is to compute the risk improvement potential (also referred to as the risk achievement worth) Ii = Ri – R, i.e. the maximum potential risk improvement that can be obtained by improving system i (Aven 1992; Haimes 1998). The potential Ii is referred to as a risk importance measure. An application of this approach is presented in Brewer and Canady (1999). Criteria are established based on such a ranking to identify when maintenance improvements are needed to reduce risks. Identifying critical items is an important basis for maintenance management, and is one of the key steps in various maintenances frameworks, e.g. the RCM (reliability centred maintenance) approach (Andersen and Neri 1990). In risk analysis, the maintenance efforts are incorporated by: 1. Showing the relation between maintenance effort and component performance 2. Showing the relation between component performance and overall risk indices. An example demonstrating the component level 1 is the periodical testing of a component, where the component has a failure rate λ and the testing interval is τ. Then the unavailability of the component is approximated by λτ/2, expressing the mean fractional down time of the component. We refer to the literature for further details on this example and related models and methods, including Markov methods; see, e.g. Aven (1992), Rausand and Høyland (2003) and Modarres (1993). The component measures are often expressing features of the performance of safety barriers, reflected in the event trees. In this way a link is established between the component performance level and risk (level 2). For the periodical testing example, suppose that the component is a safety barrier modelled as a branching event of the event tree. Then the unavailability λτ/2 expresses the probability that this barrier is not functioning at a demand. In Figure 18.2 we present a model for integrating maintenance activities and risk analysis, taken from Apeland and Aven (2000), which also shows the two levels 1 and 2 mentioned above. On the low system level we have maintenance, component and operating characteristics, describing alternative maintenance actions and strategies, alternative components available and relevant operating patterns.

Risk Analysis in Maintenance

443

Predictions concerning alternative maintenance strategies’ effect on the main objectives are normally subject to uncertainty, and in the model we apply risk analyses for expressing this uncertainty. In risk analyses we evaluate the effect of different low system level alternatives on the maintenance performance and component performance, for example described through time to failures and test intervals. The system performance describes how the maintenance and component performance affect systems on different levels, for example through resulting production capacity, quality, availability and reliability, and through occurrence of accidents and other undesirable events. On the high system level we have the organization’s main objectives. In the figure we refer to indices describing risk closely linked to the main objectives as system attributes. One example could be the PLL value. Since the main objectives should include elements relating to humans, the environment and assets/financial interests, the risk indices would normally relate to each of these categories. Applying this model will result in risk results for each relevant low system level alternative, and this forms a basis for making decisions of which maintenance alternative to apply. To be able to quantify the effect of the performed maintenance actions on the organization’s main objectives, the following issues have to be discussed: • Which system attributes should be applied for describing performance related to the main objectives? • Which indices should be applied for describing low and intermediate analysis level performance? • How should risk analyses be applied for describing the relationship between high and low system level elements, how should engineering judgments be integrated into the analyses and how should uncertainty be expressed? We return to these issues in Sections 18.4 and 18.5. Risk-based inspection is an example of a risk informed approach (Faber 2002). Here risk analysis principles are used to manage inspection programs for plant equipment. The need for inspections and the level of inspections are determined by references to the risks, for example described by risk matrices and expected cost figures, and of course other relevant information.

444

T. Aven

High system level

Main objectives

High analysis level A n R a i l s y Intermediate k s i analysis level s Low analysis level Low system level

Comparison of alternatives System attributes

Expert opinions System performance

Maintenance performance

Maintenance characteristics

Historical data

Suitable models

Component performance

Operating characteristics

Component characteristics

Figure 18.2. Model showing the relationship between maintenance efforts and risk (Apeland and Aven 2000)

Traditionally, risk analysis using FTA and ETA have not had the level of detail that is necessary to support many decision related to maintenance. However, recent developments within risk analysis allow for more detailed analysis taking into account risk influencing factors, for example maintenance activities. In Section 18.4 we will look closer into this type of risk analysis and show how maintenance activities can be incorporated. Here we summarise the basic features of the method, using a cause analysis based example as an illustration: 1. Identify top events A that summarise essential barrier performance. An example is ‘ignition’ or ‘avoid ignition’ given a specific leakage scenario. The event A must be precisely defined – no ambiguity can exist. 2. Establish a deterministic model that links A and events Bi and quantities Xi on a more detailed level. A fault tree is an example of such a model. 3. Specify a set of operational and management factors Fi that could influence the performance of the barriers, and which have not been included in the fault tree model. Examples of such factors are the quality of the maintenance work, the level of competence and the adequacy of organisation.

Risk Analysis in Maintenance

445

4. Specify probabilities P(Bi| F), where F is the vector of the Fis. 5. Use probability calculus to obtain P(A| F). To carry out such an analysis there are a number of challenges, of which the following are some of the more important: • Determine which F factors that should be included in the fault tree. The F factors are fixed, meaning that the probability assignments are conditioned on these factors. If some of the F factors are to be considered unknown to the analyst, these factors need to be included in the fault tree, or the factors should be divided into two categories, reflecting unknown factors on the one hand and some given factors on the other. Such a distinction is made in the SAM-method (Pate-Cornell and Murphy 1996). • Find adequate procedures for specifying the probabilities P(Bi|F). These procedures need to be based on models and methods used for barrier performance analyses, such as human reliability analysis. We refer to Section 18.4. The above analysis provides decision support, by describing the effect of maintenance efforts on risk. To make a decision costs and others aspects also need to be considered, and an important issue is then how this should be done. A standard approach is the cost-benefit analysis based on computation of the expected net present value. We will discuss this issue in Section 18.5.

18.4 A Case In this section we present a risk analysis incorporating operational and maintenance factors. The presentation is based on Sklet et al. (2005), and is referred to as the BORA (barrier and operational risk analysis) approach. The approach is inspired by the I-Risk method (Papazoglou et al. 2003). The case relates to an offshore installation, and releases of hydrocarbons. The BORA approach consists of the following steps: 1. Development of a basic risk model. 2. Assignment of industry average frequencies/probabilities of initiating events and basic events. 3. Identification of risk influencing factors (RIFs) and development of risk influence diagrams. 4. Assessment of the status of RIFs. 5. Calculation of installation specific frequencies/probabilities. 6. Calculation of installation specific risk, incorporating the effect of technical systems, technical conditions, human factors, operational conditions, and organizational factors.

446

T. Aven

18.4.1 Development of a Basic Risk Model The basic building blocks of the BORA model are barrier block diagrams, event trees, fault trees, and influence diagrams. Barrier block diagrams are used to illustrate the event scenarios and the effect of barrier systems on the event sequences and consist of initiating events, barriers aimed to influence the event sequence in a desired direction, and possible outcomes of the event sequence. Event trees are used in the quantitative analysis of the scenarios. The performance of the safety barriers are analyzed using fault trees. Influence diagrams are used to analyze how the RIFs affect the initiating events in the event trees and the basic events in the fault trees. This case restricts attention to modeling of the containment function (“prevent release of hydrocarbons”). For this function a number of release scenarios have been modeled by use of barrier block diagrams. Each barrier block diagram comprises the following: • An initiating event, i.e. a deviation from the normal situation which may cause a release of hydrocarbons. • Barrier systems aimed to prevent release of hydrocarbons. • The possible outcomes of the event sequence, which depend upon the successful operation of the barrier system(s). The barrier block diagram for the release scenario “Release due to valve(s) in wrong position after maintenance” is illustrated in Figure 18.3. Initiating event

Barrier functions Detection of valve(s) in wrong position

Valve(s) in wrong position after maintenance

End event

Detection of release prior to normal production

Self control / checklists (isolation plan)

”Safe state” Failure revealed

3rd party control of work

Leak test

Release of hydrocarbons

Figure 18.3. Barrier block diagram for one release scenario

Risk Analysis in Maintenance

447

As seen in Figure 18.3, several of the barriers are non-physical by nature, thus requiring human and operational factors to be included in the risk model. In order to perform a quantitative risk analysis, frequencies/probabilities of three main types of events need to be quantified: 1. The frequency of the initiating event, i.e. in the example case: “The frequency of valve in wrong position after maintenance”. 2. The probability of failure of the barrier systems, which for the example case includes: i) failure to reveal valve(s) in wrong position after maintenance by self control/use of checklists, ii) failure to reveal valve(s) in wrong position after maintenance by third party control of work, and iii) failure to detect potential release during leak test prior to start-up. 3. The (end event) frequency of release of hydrocarbons due to valve in wrong position (needed for further analysis of the effect of the consequence barriers). The frequency of the initiating event is in our example a function of the annual number of maintenance operations where valve(s) may be set in wrong position in hydrocarbon systems, and the probability of setting a valve in wrong position per maintenance operation. In order to determine the probability of failure of barrier systems, the barrier systems may be further analyzed by use of fault trees as shown in Figure 18.4. Failure to reveal valve(s) in wrong position after maintenance by self control/ use of checklists

Operator fails to detect a valve in wrong position by self control/ use of checklists

Self control not performed/ checklists not used

A13

Use of self control/ checklists not specified in program

Activity specified, but not performed

A11

A12

Figure 18.4. Fault tree for failure of one barrier

Corresponding analysis may be performed for all barriers for all the identified release scenarios. For further illustration of the quantification methodology in the BORA project, we consider the initiating event and the basic events shown in Figures 18.3 and 18.4:

448

T. Aven

• Valve(s) in wrong position after maintenance that may cause release (the initiating event). • Use of self control/checklists not specified in program (basic event A11). • Use of self control/checklists specified, but not performed (basic event A12). • The operator fails to detect valve(s) in wrong position by self control/use of checklists (basic event A13). 18.4.2 Assignment of Average Frequencies/Probabilities The first step in the quantification process is to assign industry average frequencies and probabilities for all the initiating events in the event trees and basic events in the fault trees. Generic data may be found in generic databases or company internal databases. Alternatively, industry average values can be established by use of expert judgment. For our example case, Table 18.1 shows the assigned industry average frequencies and probabilities for the initiating events and basic events in Figure 18.4. Table 18.1. Assigned average frequencies (F) and probabilities (P) Event description

Assigned values

Annual frequency of valve(s) in wrong position after maintenance that may cause release Failure to specify self control/use of checklist Failure to perform self control/use of checklist Failure of operator to detect valve(s) in wrong position by self control/use of checklist

F=6 P = 0.1 P = 0.05 P = 0.06

18.4.3 Qualitative Risk Influence Modeling RIFs for every initiating event in the event trees and every basic event in the fault trees need to be identified. An example of an influence diagram for the basic event “Operator fails to detect a valve in wrong position by self check/checklist” is shown in Figure 18.5. Area technician fails to detect a valve in wrong position by self control/ use of checklists

HMI

Maintainability/ accessibility

Time pressure

Competence of area technician

Procedures for self control

Work permit

Figure 18.5. Influence diagram for the basic event “Operator fails to detect a valve in wrong position by self check/checklist”

Risk Analysis in Maintenance

449

Table 18.2 shows the RIFs for the all the relevant events in our example case. Table 18.2. Proposed RIFs for basic events in the example case Event description Valve in wrong position after aintenance

Self control/use of checklists not specified Self control/use of checklists not performed Area technician fails to detect valves(s) in wrong position by self control/ use of checklists

RIFs Process complexity Maintainability/accessibility HMI (valve labeling and position feedback features) Time pressure Competence (of area technician) Work permit Program for self control Work practice (regarding use of self control/checklists) Time pressure Work permit HMI (valve labeling and position feedback features) Maintainability/accessibility Time pressure Competence (of area technician) Procedures for self control Work permit

The next step is the quantification process. 18.4.4 Scoring of RIFs The first step is to assess the status of the RIFs. Two schemes are being used for scoring of RIFs. Scheme 1. Use of results from existing projects like technical condition safety – TTS (Thomassen and Sørum 2002), the risk level on the Norwegian continental shelf (PSA 2004), and investigations of incidents. The TTS project is a review method to map and monitor the technical safety level based on the status of safety critical elements and safety barriers, and each system is given a score (rating) according to predefined performance standards. Table 18.3 shows the definition of grades. Table 18.3. Definition of grades in the TTS project Rating A B C D E F

Description of safety level Condition is significantly better than the reference level Condition is in accordance with the reference level Conditions satisfactory, but does not fully comply with the reference level Condition is acceptable and within the statutory regulations’ minimum intended safety level, but deviates significantly from the reference level Condition with significant deficiencies as compared with “D” Condition is unacceptable

450

T. Aven

Scheme 2. Expert judgment of status of RIFs on a specific platform. A scoring scheme for each RIF will be developed as a basis for this assessment. An example of a scoring scheme is shown in Table 18.4. Table 18.4. Example of scoring scale for the RIF procedures Score A

Grade characteristics for the RIF procedures Almost perfect procedures, with checklists, highlighting of important information, illustrations, etc. Procedures better than industry average Industry average procedures Poorly written procedures and no highlighting Procedures incomplete, out-of-date, inaccurate much cross-referencing, etc. No procedures, even though the task demands them

B C D E F

18.4.5 Calculation of Installation Specific Frequencies/Probabilities The next task is to adjust the industry average probabilities based on the scoring of the RIFs. Three main aspects are discussed: a) the formulas for calculation of installation specific frequencies/probabilities, b) assignment of appropriate RIF scores, and c) weighting of RIFs. The procedure is illustrated by use of numbers from the example case. 18.4.5.1 Principles for Adjustment The following principles for adjustment are proposed. Let Prev(A) be the “installation specific” probability of the failure event A. The probability Prev(A) is determined by the following procedure: Prev = Pave

n

∑ w ⋅Q i =1

i

i

(18.1)

where Pave is the industry average probability, wi is the weight/importance of RIF no. i for the event, Qi is a measure of the status of RIF no. i, and n is the number of RIFs. Here n

∑w i =1

i

=1

(18.2)

The challenge is now to determine appropriate values for Qi and wi. 18.4.5.2 Determining Appropriate Values of Qi To determine the Qis we need to associate a number to each of the score A–F. This can be done in many ways, and the proposed scheme is:

Risk Analysis in Maintenance

• • •

451

Determine by expert judgment Plow as the lower limit for Prev Determine by expert judgment Phigh as the upper limit for Prev Then put for i =1,2,…n ⎧ Plow / Pave if si = A ⎪ Qi = ⎨1 if si = C ⎪ P / P if s = F i ⎩ high ave

(18.3)

where si denotes the score or status of RIF no i. Hence if the score si is A, and Plow is 10% of Pave, then Qi is equal to 0.1. And if the score si is F, and Phigh is ten times higher than Pave, then Qi is equal to 10. If the score si is C, then Qi is equal to 1. Furthermore, if all scores are C, then Prev = Pave, if all scores are A, then Prev = Plow, and if all scores are F, then Prev = Phigh. Note that in this study we use a fixed factor of ten to describe the variations caused by different scores, from A to F. That is, if all scores are A, Plow is 10% of Pave, and if all the scores si are F, then Phigh is ten times higher than Pave. Furthermore; we have adopted the grade score from the TTS project; A=3, B=2, C=1, D=0, E= –2 and F= –5. Thus we have, letting Qi(j) denote the value of Qi if the score si takes the value j, the results shown in Table 18.5. Table 18.5. Adaptation of scores from the TTS-project Score si =j Qi (j)

3 (A) 0.10

2 (B)

1 (C) 1

0 (D)

–2 (E)

–5 (F) 10

Hence it remains to determine Qi (j) for j = 2, 0, and – 2. Using a linear transform seems natural, and we obtain the following Q values; For j=0 and – 2 (E and D): Qi (j) = Qi (–5) + (j – 1)(Qi (1)– Qi (–5))/(1– (–5)). And for j=2 (B): Qi (j) = Qi (1) + (j – 1) (Qi (3)– Qi (1))/(3–1), which gives the values for Qi as shown in Table 18.6. Table 18.6. Specifed values for Qis Score si =j Qi(j)

3 (A) 0.10

2 (B) 0.55

1 (C) 1

0 (D) 2.5

–2 (E) 5.5

–5 (F) 10

18.4.5.3 Weighting of RIFs To determine the weights wi, we start from a weight wi equal to 10 assigned to the most important RIF (RIF no i). The other RIFs are afterwards given relative weights (10 – 8 – 6 – 4 – 2). The idea is to think of relative changes in the probability given that the score of RIF no. i is changed from A to F. According to Equation 18.2, normalization is required to ensure that the sums of the wis are equal to 1.

452

T. Aven

18.4.5.4 Calculation Example An example on results from calculation of Prev when Pave = 0.01, Phigh = 0.1, and Plow = 0.001 is shown in Table 18.7. Table 18.7. Example – calculation of Prev RIF no i 1 2 3 4 5 6 Sum

Weight of RIF i (wi) 4 6 4 6 10 4 34

Normalized weight 0.12 0.18 0.12 0.18 0.29 0.12 1.0

Status of RIF i (si) B C E D C D –

Qi

wi * Q i

0.55 1 5.5 2.5 1 2.5 –

0.065 0.176 0.647 0.441 0.294 0.294 1.918

By use of (18.1), Prev is equal to (Pave x 1.918). In our example case, the RIF analysis gave an increase of the probability of occurrence of the basic event by a factor of 1.9 (from Pave = 0.01 to Prev = 0.019). 18.4.6 Recalculation of the Installation Specific Risk A revised value for the installation specific risk may be calculated by use of the platform specific data (Prev) as input data in the risk model (event trees/fault trees) described above. 18.4.7 Remarks We refer to Sklet et al. (2005) for a detailed discussion of this approach, and relevant references for similar methods. Compared to a traditional QRA model, the BORA approach is a more detailed method, and includes considerably more risk influencing factors that gives more detailed information of factors contributing to the total risk, i.e. a more detailed risk picture. The analysis allows one to study the effect of maintenance efforts on risk, and thus provide support for maintenance decisions. The risk analysis can be used to identify the critical factors, as well as expressing the effect of risk reducing measures.

18.5 Discussion of Critical Issues In maintenance applications, it is common to define risk as the expected loss, i.e. risk is equal to the probability of failure multiplied by the consequence of failure, see e.g. Khan and Haddara (2003), or in probabilistic terms, E[X], where X is the possible consequences measured for example in fatalities or economic values. It seems to be a common understanding among many risk and maintenance analysts

Risk Analysis in Maintenance

453

that the use of expected values is the appropriate criterion for determining the best policies. The justification is the statistical property of a mean. If we consider a large set of similar activities and Xi is the consequences of the i-th activity, then the law of large number says that under certain conditions the mean of the Xis is approximately equal to EXi. Also the portfolio theory supports the use of the expected values; see e.g. Abrahamsen et al. (2005). The use of traditional cost-benefit analyses to support decision making is based on the same type of logic. Cost-benefit analyses means that we assign monetary values to all relevant attributes, including costs and safety and summarise the performance of an alternative by the expected net present value, E[NPV]. The main principle in transformation of goods into monetary values is to find out what the maximum amount society is willing to pay to obtain an improved performance. Use of cost-benefit analysis is seen as a tool for obtaining efficient allocation of the resources, by identifying which potential actions are worth undertaking and in what fashion. By adopting the cost-benefit method the total welfare is optimised. This is the rationale for the approach. Although cost-benefit analysis was originally developed for the evaluation of public policy issues, the analysis is also used in other contexts, in particular for evaluating projects and activities in firms. The same principles apply, but using values reflecting the decision maker’s benefits and costs, and the decision maker’s willingness to pay. However, risk is more than expected values. The most common definition of risk in the engineering community is that risk is the combination of consequences and probability, i.e. the combination (X, P), where P refers to probability; see e.g. ISO (2002). We extend this definition by using the pair (X, U), where U refers to uncertainty. Probability is a way of expressing the uncertainties. Following these perspectives on risk, there is a need to see beyond the expected values. The arguments can be summarised as follows. What we search for is desirable outcomes X, for example no accidents and high profit. In practice we have a finite number of projects, and the mean numbers based on these projects are not the same as the expected value. An accident could result in losses that are significant also in a corporate perspective – the standard deviation of the project loss could be significant relative to the total cash flow of the firm. And since the uncertainties in the consequences are large, the assumptions and suppositions made in the calculation of the expected value may influence the results to large extent. The assessments made should be seen as considerations based on relevant information, but there could be different assessments, different views and different perspectives on the uncertainties. This applies in particular to assigned, small probabilities of rare events. A complicating factor is that safety and risk involve the balance between different attributes, including lives and money. The above expected value approach, for example based on cost-benefit analyses, is based on one being able to transform all values to one unit, the economic value. And from a business perspective, firms may argue that this is the only relevant value. All relevant values should be transformed to this unit. This means that the expected costs of accidents and lives should be incorporated in the evaluations. But what is the economic value of a life? For most human beings it is infinite; most people would not be willing to give his or her life for a certain amount of

454

T. Aven

money. We say that a life has a value in itself. But of course, an individual may accept a risk for certain money or other benefits. And for the firm, this is the way of thinking – the balance of costs and risk. The challenge is however to perform this balance. What are reasonable numbers for the firm to use for valuing that a life has a value in itself? Obviously there are no correct answers, as it is a managerial and strategic issue. High values may be used if it can be justified that this would produce high performance levels, on both safety and production. Consequently, uncertainty needs to be considered, beyond the expected values, which means that the principles of robustness and caution (precaution) have a role to play. A risk-aversion behaviour is often the result. The point is that we put more weight on possible negative outcomes than the expected values support. Many firms seem in principle to be in favour of a risk neutral strategy for guiding their decisions, but in practice it turns out that they are often risk averse. The justification is partly based on the above arguments. In the case with a large accident, the possible total consequences could be rather extreme – the total loss for the firm in a short and long term perspective is likely to be high due to loss of production, penalties, loss of reputation, changes in the regulation regimes, etc. The overall loss is difficult to quantify – the uncertainties are large – and it is seldom done in practice, but the overall conclusion is that investments in safety are required. The expected value is not the only basis for making this conclusion. We apply a cautionary principle, expressing that in the face of uncertainty, caution should be a ruling principle. For example, in a process plant, major hydrocarbon leaks might occur, requiring investments in various safety systems and barriers to reduce the possible consequences – we are cautious. Uncertainties in phenomena and processes justify investments in safety. Thus to conclude on maintenance alternatives, we need an approach which provide decision support beyond expected values. We recommend an assessment process following a structure as summarized in the following (Aven and Vinnem 2007). For a specified alternative, say A, we assess the consequences or effects of this alternative seen in relation to the defined attributes (safety, costs, reputation, etc.). Hence we first need to identify the relevant attributes (X1, X2, …) and then assess the consequences of the alternative for these attributes. These assessments could involve qualitative or quantitative analysis. Regardless of the level of quantification, the assessments need to consider both what the expected consequences are, as well as uncertainties related to the possible consequences. Often the uncertainties could be large. In line with the adopted perspective on risk, we recommend a structure for the assessment according to the following scheme: 1. Identify the relevant attributes (safety, costs, reputation, alignment with main concerns, ...). 2. What are the assigned expected consequences, i.e. E[Xi] given the available knowledge and assumptions? 3. Are there special features of the possible consequences? In addition to assessing the consequences on the quantities Xi, some aspects of the possible consequences might need special attention. Examples may for example be the temporal extension, aspects of the consequences that could cause social

Risk Analysis in Maintenance

455

mobilization, i.e. violation of individual, social or cultural interests and values generating social conflicts and psychological reactions by individuals and groups who feel afflicted by the risk consequences. A system based on the scheme developed by Renn and Klinke (2002) is recommended; see Sandøy et al. (2005). 4. Are the large uncertainties related to the underlying phenomena, and do experts have different views on critical aspects? The aim is to identify factors that could lead to consequences Xi far from the expected consequences E[Xi]. A system for describing and characterising the associated uncertainties are outlined in Sandøy et al. (2005). This system reflects features such as the current knowledge and understanding about the underlying phenomena and the systems being studied, the complexity of technology, the level of predictability, the experts’ competence, and the vulnerability of the system. If a quantitative analysis is performed, the uncertainties are expressed by prob– ability distributions. 5. The level of manageability during project execution – to what extent is it possible to control and reduce the uncertainties, and obtain desired outcomes? The expected values and the probabilistic assessments performed in the risk analyses provide predictions for the future, but some risks are more manageable than others, meaning that the potential for reducing the risk is larger for some risks compared to others. By proper uncertainty and safety management, we seek to obtain desirable consequences. This leads to considerations on for example how to run processes reducing risks (uncertainties) and how to deal with human and organisational factors and obtain a good safety culture. Hence for each alternative and attribute we may have information covering the following points: • • • •

Predictions of attribute (e.g. zero fatalities) Expected value (e.g. 0.1 fatalities) Probability distribution (e.g. expressing a probability of a “major accident”) Risk description on a “lower level” (e.g. prediction of number of leaks, expected number of leaks, etc.) • Aspects of the consequences • Uncertainty factors • Manageability factors These assessments provide a basis for comparing alternatives and making a decision. Compared to standard ways of presenting risk results, this basis is much more comprehensive. In addition, sensitivity analyses and robustness analyses are to be performed. Of course, the depth of the analysis will be a function of the decision situation, the risks involved and the resources to be used. The full risk descriptions as outlined above would be used only in special situations, requiring a comprehensive decision support basis. We refer to Aven and Vinnem (2007) for further reflections on the above issues, and in particular the use of cost-benefit analyses. A key question discussed

456

T. Aven

is to what extent it is appropriate to adjust the value of a (statistical) life and adjust the discount rate to take into account the uncertainties. In maintenance application there is often reference to the use of risk acceptance criteria, as upper limits of risk acceptance expressed for example by the PLL or FAR values; see e.g. Khan and Haddara (2003). We are sceptical to the prevailing thinking concerning risk acceptance criteria; see Aven and Vinnem (2005, 2007). We all agree on the need for considering risk as a basis for making decisions under uncertainty. Such considerations must however be seen in relation to other concerns, costs and benefits. Care should be shown when using pre-determined risk acceptance criteria in order to obtain good arrangements, plans and measures, as they easily lead to the wrong focus – using risk analysis to verify that these limits are met and there is no drive for risk reduction and safety improvements. The use of risk acceptance criteria cannot replace managerial review and judgement. The decision support analyses need to be evaluated in the light of the premises, assumptions and limitations of these analyses. The analyses are based on a background information that must be reviewed together with the results of the analyses. Risk analysis provides decision support, not hard decisions. We refer to Aven and Vinnem (2007).

18.6 Conclusions This chapter has presented and discussed the use of risk analysis for the selection and prioritisation of maintenance activities. The chapter has reviewed some critical aspects of risk analysis important for the successful implementation of such analyses in maintenance. This relates to risk descriptions and categorisations, uncertainty assessments, risk acceptance and risk informed decision making, as well as selection of appropriate methods and tools. In the risk analysis, the maintenance efforts are incorporated by: • Showing the relation between maintenance effort and component performance • Showing the relation between component performance and overall risk in– dices An example is shown in Section 18.4. This example demonstrates some of the problems related to incorporating the maintenance efforts into the risk analysis. The analysis needs to be rather detailed to support the decision making. Developing suitable methodology is not straightforward, for example on how to assign installation specific probabilities, based on the information available (including reliability and maintenance data). Further research is undoubtedly required to give confidence in the methods to be used. A detailed analysis requires substantial input data, and the data must be relevant. Such analyses cannot be performed without extensive use of expert judgment. However, expert judgment is not to be seen as something negative. The risk analysis is a tool for summarising the information available (including uncertainties), and expert judgment constitutes an important part of this information.

Risk Analysis in Maintenance

457

18.7 References Abrahamsen, E.B., Aven, T., Vinnem, J.E. and Wiencke, H.S. (2005) Safety Management and the use of expected values. Risk, Decision and Policy, 9, 347–358. Andersen, R.T. and Neri, L. (1990) Reliability-Centred Maintenance. Management and Engineering Methods, Elsevier Applied Sciences, London. Apeland, S. and Aven, T. (2000) Risk based maintenance optimization: foundational issues. Reliability Engineering and System Safety, 67, 285–292. Aven, T. (1992), Reliability and Risk Analysis, Elsevier Applied Science, London. Aven, T. and Jensen, U. (1999) Stochastic Models in Reliability, Springer-Verlag, New York. Aven, T. and Kristensen, V. (2005) Perspectives on risk – Review and discussion of the basis for establishing a unified and holistic approach. Reliability Engineering and System Safety, 90, 1–14. Aven, T. and Vinnem, J.E. (2005) On the use of risk acceptance criteria in the offshore oil and gas industry. Reliability Engineering and System Safety, 90, 15–24. Aven, T., Vinnem, J.E. and Wiencke, H.S. (2007) A decision framework for risk management. Reliability Engineering and System Safety, 92, 433–448. Aven, T. and Vinnem, J.E. (2007) Risk Management, with Applications from the Offshore Oil and Gas Industry, Springer Verlag, New York. Brewer, H.D. and Canady, K.S. (1999) Probabilistic safety assessment support for the maintenance rule at Duke Power Company. Reliability Engineering and System Safety, 63, 243–249. Cepin, M. (2002) Optimization of safety equipment outages improves safety. Reliability Engineering and System Safety, 77, 71–80. Clarotti, C.A., Lannoy, A. and Procaccia, H. (1997) Probabilistic risk analysis of ageing components which fail on demand; A Bayesian model: Application to maintenance optimization of diesel engine linings. In Proceedings of Ageing of materials and methods for the assessment of lifetimes of engineering plant, Cape Town, pp. 85–94. Dekker, R. (1996) Applications of maintenance optimization models: A review and analysis. Reliability Engineering and System Safety, 51, 229–240. Faber, M.H. (2002) Risk-Based Inspection: An Introduction, Structural Engineering International, 12, 186–194. Haimes, Y.Y. (1998) Risk modeling, Assessment, and Management, Wiley, New York. ISO (2002) Risk management vocabulary. ISO/IEC Guide 73. Khan, F.I. and Haddara, M.M. (2003) Risk-based maintenance (RBM): a quantitative approach for maintenance/inspection scheduling and planning, Journal of Loss Prevention, 16, 561–573. Knoll, A., Samanta, P.K. and Vesely, W.E. (1996) Risk based optimization of the Frequency of EDG on-line maintenance at Hope Creek. In Proceedings of Probabilistic Safety Assessment, Park City, pp. 378–384. Modarres, M. (1993) What Every Engineer should Know about Reliability and Risk Analysis, Marcel Dekker, New York. van Manen, S.E., Janssen, M.P. and van den Bunt, B. (1997) Probability-based optimization of maintenance of the River Maas Weir at Lith. In Proceedings of European Safety and Reliability conference (ESREL), Lisbon, pp. 1741–1748. Papazoglou, I.A., Bellamy, L.J., Hale, A.R., Aneziris ON, Post JG, Oh JIH. (2003) I-Risk: development of an integrated technical and Management risk methodology for chemical installations. Journal of Loss Prevention in the Process Industries, 16, 575 – 591. Paté-Cornell, E.M. and Murphy, D.M. (1996) Human and management factors in probabilistic risk analysis: the SAM approach and observations from recent applications. Reliability Engineering and System Safety, 53, 115–126.

458

T. Aven

Perryman, L.J., Foster, N.A. and Nicholls, D.R. (1995) Using PRA in support of maintenance optimization, International Journal of Pressure Vessels & Piping, 61, 593– 608. Podofillini, L., Zio, E. and Vatn, J. (2006) Risk-informed optimisation of railway tracks inspection and maintenance procedures, Reliability Engineering and System Safety, 91, 20–35. PSA, 2004. “Trends in Risk Levels on the Norwegian Continental Shelf Main report Phase 4 – 2003” (in Norwegian). The Petroleum Safety Authority Norway, Stavanger, Norway. Rausand, M. and Høyland, A. (2003) System Reliability Theory, Wiley, New York. Renn, O. and Klinke, A. (2002) A New approach to risk evaluation and management: Riskbased precaution-based and discourse-based strategies, Risk Analysis, 22, 1071–1094. Sandøy, M., Aven, T. and Ford, D. (2005) On integrating risk perspectives in project management. Risk Management: an International Journal, 7, 7–21. Sklet, S., Hauge, S., Aven, T. and Vinnem, J.E. (2005) Incorporating human and organizational factors in risk analysis for offshore installations. Proceedings ESREL 2005, pp. 1839–1847. Thomassen, O., Sørum, M. 2002. Mapping and monitoring the safety level. SPE 73923, Society of Petroleum Engineers. Vatn, J., Hokstad, P. and Bodsberg, L. (1996) An overall model for maintenance optimization. Reliability Engineering and System Safety, 51, 241–257.

19 Maintenance Performance Measurement (MPM) System Uday Kumar and Aditya Parida

19.1 Introduction Maintenance is an important support function for the business processes with significant investment in physical assets which plays an important role in achieving organizational goals. However, the cost of maintenance and downtime is too high for many industries. For example, the cost of maintenance in a highly mechanized mine can be 40–60% of the operating cost (Campbell 1995), the maintenance spending in the UK’s manufacturing industry ranges from 12 to 23% of the total factory operating costs (Cross 1988) and as per a study in Germany, the annual spending on maintenance in Europe is around 1500 billion euros (Altmannshopfer 2006). All these have motivated the senior managers and maintenance engineers to measure the contribution of maintenance towards total business goals or in terms of return on investment, etc. Prior to the 1940s, maintenance was considered as a necessary evil and the general attitude to maintenance was “It costs what it costs.” During 1950–80, with the advent of techniques like preventive maintenance and condition monitoring, the perception changed to “maintenance is an important support function and it can be planned and controlled.” Today maintenance is considered as an integral part of the business process and it is perceived as: “It creates additional value” (Liyanage and Kumar 2003). The creation of additional value by maintenance is expressed in terms of increased productivity, better utilisation of plant and system, lower accident rates and better working environment. With increasing awareness that maintenance creates additional value in the business process; more and more companies are treating maintenance as an integral part of the business process, and maintenance function has become an essential element of strategic thinking of many companies involved in service and manufacturing industry. With this change in the mindset of senior asset managers and owners, it has become essential to measure the performance of manufacturing process to understand the tangible and, if possible, intangible contribution of maintenance towards business goals. However, without any formal measures of performance, it is difficult to plan, control

460

U. Kumar and A. Parida

and improve the maintenance process. With this, the focus has shifted to measure the performance of maintenance. Maintenance performance needs to be measured to evaluate, control and improve the maintenance activities for ensuring achievement of organizational goals and objectives. In recent years, maintenance performance measurement (MPM) has received a great amount of attention from researchers and practitioners due to a paradigm shift in maintenance. This chapter deals with the broad topic of performance measurement (PM), metrics and measures for MPM, reviews the existing MPM frameworks, discusses various issues and challenges associated with the development and implementation of an MPM system. The outline of the chapter is as follows: an overview of various PM frameworks and their development are presented in Section 19.2. Definitions of maintenance performance indicator (MPI), and MPM system, and their salient features are discussed in Section 19.3. The important issues associated with the development of MPM system are discussed in Section 19.4, while the MPIs under different criteria are explained in Section 19.5. The MPM system and the framework are explained in Section 19.6. Some of the MPIs and MPM system in different industries are discussed in Section 19.7. The final section concludes the chapter with limitations of the current literature and practice.

19.2 Performance Measurement – An Overview In the past two decades, performance measurement (PM) has received a great amount of attention from researchers, practitioners and from industry as well. Andersen and Fagerhaug (2002) have listed the reasons for measuring performance, like providing employees with the feedback on the work they are performing, and necessary information based on which correct decision making by the employee and management can be made, helping in implementing strategies and policies for an organization, and using PM data to monitor the performance trend over time. PM is defined as the process of quantifying the efficiency and effectiveness of past and future activities. Major issues related to this field concern what to measure and how to measure it in a practically feasible and cost-effective way (Neely 1999). Measurement thus gives the status of the variable, compares the data with target or standard data and points out what actions should be taken and where they should be used for as corrective and preventive measures. It is extremely difficult to develop models for supporting the decision making process, without adequate data (Wealleans 2000). Today, PM is related to product, operation process, partnering, stakeholders and the production. PM of process essentially involves mapping of the process, measurement of the performance, undertaking root-cause analysis and bench marking of the performance. PM is a multi-disciplinary activity as it involves multiple stakeholders. A PM system needs to have features such as integrated, linking all the perspectives in a balanced manner, besides having a holistic approach for the entire organization to achieve the stakeholders’ goals at various levels.

Maintenance Performance Measurement (MPM) System

461

19.2.1 Metrics, Measures and Indicators Performance measure is the term used when talking about PM in general. Performance indicators (PIs) are measures that describe how well an operation is achieving its objectives. A PI of an activity is a ratio of two variables: the output to the input of that activity. A performance measure thus can be defined as metrics for quantifying the efficiency and/or effectiveness of past or future activities, where as a performance metric is the definition of the scope, content and component parts of a broadly based performance measures (Neely et al. 2002). The characteristics of performance measures include relevance, interpretability, timeliness, reliability and validity (Al-Turki and Duffuaa 2003). PI is a more specific measurement gauges or it indicates performance. PIs are broadly classified as leading or lagging indicators. Leading indicators are performance driver and are used for understanding the present status and taking corrective measures to achieve the desired target. A leading indicator is of the non-financial and statistical type that fairly and reliably predicts in advance. A leading indicator thus works as a performance driver and ascertains the present status in comparison with the reference indicator level. In maintenance departmental level, condition monitoring indicators such as noise, vibration, thermograph measurement and particles in oil can be leading indicator. Lagging indicators are outcome measures and provide basis for studying the deviations after the completion of the activities. Cost of maintenance and mean time between failures (MTBF), are few examples of lagging indicators. Since PIs are just the indicator of performance, Key performance indicator (KPI) is an aggregation of various PIs in a logical way. Thus, KPI is more strategic and important indicator of performance (Wireman 1998). The main purpose of KPI is to pinpoint possible areas for improvement within an organization. Until 1980, the PM was mostly based on financial measures. Kaplan and Norton (1992) suggested the balanced scorecard as a more pragmatic and progressive framework to measure the performance in a balanced way. The balanced scorecard, with its four perspectives, focuses on both tangible and intangible perspectives of the business process like; customers, internal processes, financial, and innovation and learning. Subsequently, various researchers have developed frameworks considering non-financial measurements and intangible assets to achieve competitive advantages (Parida and Kumar 2006). Some studies have shown that companies using an integrated balanced PM system perform better than those which do not measure their performance (Kennerly and Neely 2003; Lingle and Schiemann 1996). Some of the major PM frameworks developed by various authors and researchers are Du Pont Pyramid (Chandler 1977), PM matrix (Keegan et al. 1989), results and determinants matrix (Fitzgerald et al. 1991), balanced scorecard (BSC) (Kaplan and Norton, 1992), SMART pyramid, (Lynch and Cross 1991), integrated PM framework (Medori and Steeple 2000), performance prism, (Neely et al. 2002), BSC of advanced information (Abran and Buglione 2003), and European Foundation for Quality Management (EFQM) (Wongrassamee et al. 2003).

462

U. Kumar and A. Parida

19.3 Maintenance Performance Measurement (MPM) Generally, a maintenance performance measurement (MPM) system forms part of the organization’s operational system and includes all related maintenance performance indicators (MPIs) and their interrelationship within the whole maintenance process. MPM is the process of measuring maintenance performance, to know how well the maintenance process is performing and to identify the opportunities for improvement. In a MPM system, data are collected, analyzed and relevant information extracted for timely decision making. MPM is a complex task involving measurement of varying inputs and multiple outputs of the maintenance process. One way of measuring the performance is to develop PIs and implement them with a total involvement of entire organisation. An indicator is a function of several metrics, when used for measurement of maintenance is called a maintenance performance indicator (MPI). MPIs are ratio of two maintenance related variables, which needs to be defined beforehand and their values may change with time. For example, the value of an MPI may change after five years as compared to the first year. MPIs are the means to measure the performance of a maintenance process and are used to facilitate the understanding and measurement of the past performance, so that future prediction can be visualized resulting in appropriate decision making. MPIs can act as an early warning system for operation and maintenance process, indicating the present status of the process, so as to enable evaluation, prediction and corrective action. The data from measurement tells us the status of the job carried out and what action to be taken thereafter, and to indicate where those actions should be targeted. For example, the MPIs could be used for financial reports, for control of performance of employees and other resources like the costing and appraisal system, for finding competitive position with in business organizations like the customer satisfaction and competitor ranking, for health, safety and environmental (HSE) rating for production industry, and finding internal effectiveness, like the overall equipment effectiveness (OEE) for the manufacturing and process industry. The selection of MPIs to follow up the contribution of maintenance is an important but a complex issue. Thus, the structure of the MPI needs to be considered from different perspectives of the maintenance process in an integrated manner. There are a large number of MPIs used by different industries to-day which need to be carefully identified and selected to meet the specific requirements of the organization. While defining or identifying MPIs, it is important to relate them to both the process inputs and the process outputs. If this is carried out properly, then MPIs can (Kumar and Ellingsen 2000): • Provide or identify basis for resource allocation and control • Facilitate to identify the problem areas • Provide individuals and team with the means to measure his/their performance • Provide teams/individuals the means to measure his/their contribution to the business objectives • Facilitate easy benchmarking of performance

Maintenance Performance Measurement (MPM) System

463

• Provide trends in performance • Indicate the contribution of maintenance to overall business objectives Some of the MPIs provide quality information for monitoring operational safety performance during the implementation phase of the MPM system. These MPIs are critical for the industries, especially for nuclear power plants, where safety aspects play an important role. The characteristics of operational safety performance indicators as applicable to nuclear power plants, which can be applied to other industry as well, are (IAEA 2000): • There is a direct relationship between the indicator and safety • Necessary data are available or capable of being generated • Indicators are unambiguous, their significance is understood, can be expressed in quantitative terms and local action can be taken on basis of indicators • They are not susceptible to manipulation, a manageable set, meaningful, can be validated and integrated into normal operational activities • They can be linked to the cause of a malfunction, the accuracy of the data at each level can be subjected to quality control and verification The MPIs could be time and target-based, giving a positive or negative indication. An MPI could be trend-based in some cases. If it is positive or steady, meaning that everything is working well, then no action may be required to be undertaken. If it shows a negative trend and has crossed the lower limit of the target, then the decision is to act immediately. Whenever the value of the MPI falls within the target limits (as set by the decision maker), then the decision is “wait and see”. Different types of graphs and figures could be used for indicating the health state of the technical system using different color codes for “excellent”, “satisfactory”, “improvement required” and “unsatisfactory performance level”. There could be other visualization techniques using bar charts or other graphical tools for monitoring MPIs. SMART test developed by the Department of Energy (DOE) can also be used effectively to describe the five characteristics of an MPI (DOE 2002); S = Specific; clear and focused to avoid misinterpretation. Should include measure assumptions and definitions and be easily interpreted. M = Measurable; can be quantified and compared to other data. It should allow for meaningful statistical analysis. Avoid “yes/no” measures except in limited cases, such as start-up or systems-in-place situations. A = Attainable; achievable, reasonable, and credible under conditions expec– ted. R = Realistic; fits into the organization's constraints and is cost-effective. T = Timely, the indicator should be reflecting the status in real time and on time.

464

U. Kumar and A. Parida

19.4 Development and Implementation Issues Today, many companies involved in industrial production do measure their maintenance performances in order to remain competitive in the market. However, improper implementation and management of measurement system aiming to use new measures to reflect new priorities often lead to ineffective results. This is due to the failure of the organization to discard measures reflecting old priorities, uncorrelated and inconsistent indicators and inadequate measurement techniques (Meyer and Gupta 1994). Understanding the need for MPM in the business for effective management of maintenance and its work process is critical for the development and successful implementation of the MPM. In order to develop the MPM system, maintenance performance issues are required to be considered, which include the complexity of tasks, multiple inputs and outputs of maintenance process and stakeholders continuously changing requirements. Some of the important issues associated with the development of MPM system are as follows. 19.4.1 Measuring Values Created by the Maintenance The most important issue in developing an MPM system is to measure the value created by maintenance process. As a manager, one must know that what is being done is what is needed by the business process, and if the maintenance output is not contributing/creating any value for the business, it needs to be restructured. For example, ratio of investment made and trends in cost per ton. 19.4.2 Justifying Investment The second issue in developing an MPM system is to justify the organization’s investment made in maintenance organisation; not so much as to whether one is doing the right thing, but whether the investment they are making is producing a return on the resources that are being consumed. 19.4.3 Revising Resource Allocations The third issue issue in developing an MPM system is to determine if additional investment is required in maintenance and to justify it. Alternatively, such measurement of activities also permits one to determine the need for change or how to carry out the current activities more effectively by using the resources allocated. 19.4.4 Health, Safety and Environmental (HSE) Issues The fourth issue is to understand the contribution of maintenance towards HSE issues. A bad maintenance performance can lead to accidents (safety issue) and pollutions (health hazards and environmental issues), besides encouraging an unhealthy work culture and environment.

Maintenance Performance Measurement (MPM) System

465

19.4.5 Adapting to New Trends in Operation and Maintenance Strategy New operating and maintenance strategies are adopted and followed by industries in quick response to market demand, for the reduction of production loss and process waste. MPM measures the value created by the maintenance. Some of the important questions related to strategy are as follows: • How does one assess and respond to stakeholders’ (internal and external) needs? • How does one translate the corporate goal and strategy into targets and goals at the operational level (converting a subjective vision into objective goals)? • How does one integrate the results and outcomes from the operational level to develop lead indicators at the corporate level (converting objective outcomes into strategic KPIs and linking them to strategic goals and targets)? • How to support innovation and training for the employees to facilitate an MPM oriented culture? 19.4.6 Measuring What is Easy to Measure Most organizations make the mistake of measuring what is easy to measure, rather than what is required to be measured. Thus, over a period of time, the indicators are out of tune with the corporate strategy. Besides, a large amount of undesired data creates the data overload, which are rarely utilised for analysis or decision making. Therefore, the MPIs need to be identified and selected to meet the specific requirement of the organization and its related issues. 19.4.7 Organizational Issues Today organizations are trying to adopt a flat and compact organizational structure, a virtual work organization, and empowered, self-managing, knowledge management work teams and work stations. The organisational maintenance issues are to measure maintenance effectiveness and resources spent on maintenance. Typically in an organization, the top level looks for the investment and decides the corporate strategy, based on which the operation and maintenance strategies are formulated. Depending on the maintenance strategy, maintenance program and policies are defined, which are implemented by the middle level. The operational level undertakes the actual tasks of performing the activities. The issues pertaining to organi– zation are: • Need for developing a reliable and meaningful MPM system. • Commitment of the top management for the MPM system. • Converting the subjective corporate goals to specific targets and MPIs required to be measured. • Involvement of the employees in implementation of the MPM system. • Method and means of these measurements. • Periodicity (time period) of such measurements.

466

U. Kumar and A. Parida

• Analysis of the collected data, its conversion to information, owner of the information and its accountability with in the organization. • Effective and efficient communication within and outside the organization on issues related to information and decision making.

19.5 Framework for MPM System A conceptual framework explains, either graphically or in narrative form, the main things to be studied, the key factors or variables and the presumed relationship between them. Frameworks can be rudimentary, elaborate, theory driven, descriptive or causal (Miles and Huberman 1994). The MPM framework linking to multiple criteria of MPIs needs to consider, from the internal and external stakeholder’s requirements, the different hierarchical levels of the organization. There is also a need to map the maintenance process and identify the gap between the maintenance planning and execution, so that the MPIs can take care of these gaps. 19.5.1 Multiple Hierarchical Levels in MPM System In order to accomplish the top-level objectives of the espoused maintenance strategy, these objectives need to be cascaded into team and individual goals. The adoption of fair processes is the key to successful alignment of these goals. It helps to harness the energy and creativity of committed managers and employees to drive the desired organizational transformations (Tsang 1998). Murthy et al. (2002) mentioned that maintenance management needs to be carried out in both strategic and operational contexts and the organizational structure is generally structured into three levels. At each level, the linkage and relationship between maintenance and operation needs to be clearly understood. Defining the measures and the actual measurements for monitoring and control constitute an extremely complex task for large organizations. The complexity of MPM is further increased for multiple criteria objectives. In the MPM system, MPIs are considered from the multiple hierarchical levels. The first hierarchical level could correspond to the corporate or strategic level, the second to the tactical or managerial level, and the third to the functional/operational level. Depending on the organizational structure, the hierarchical levels could be more than three. Three hierarchical levels given in Figure 19.1 (adapted from Parida and Kumar 2006) are considered for the proposed MPM framework. The top level is responsible for framing the mission/vision statement, goals, objectives, which form part of the strategic management. They decide the investment to be made for the infrastructure, manpower and what will be the consequences or likely return on investment. The detailed activities are not in focus at this level. The maintenance data at the functional level are aggregated and linked to tactical or middle level to help the management for analysis and decision making at strategic or tactical level. The corporate KPIs are cascaded down from strategic to MPIs at operational level in a top-down manner and the MPIs are aggregated from operational to strategic level in a bottom-up information flow. The subjectivity increases as we integrate the objective outcomes from the shop floor to the

Maintenance Performance Measurement (MPM) System

467

organizational goal at higher level. An illustration of the breaking down of the corporate goals to an objective targets at shop floor level is shown in Figure 19.2. Similarly, Figure 19.3 exhibits an example of aggregation of MPIs.

Figure 19.1. Linkages between objective outcomes at operational level to strategic level and breaking down of goals into objective targets

As shown in the figure, while cascading down the corporate goals of a mining company with an installed capacity of 0.6 million ton per month, the monthly production target of 0.51 million ton per month of iron ore pellet will cascade down to a system availability of 96% at the tactical level, which must be translated into maximum allowed planned stop of 20 h per month and unplanned plant stop of 8.8 h per month. Similarly, while aggregating the MPIs such as planned and unplanned stops needs to be aggregated to higher level in terms of availability and capacity utilization. The calculations are as under: • • • •

Plant capacity = 0.6 million ton per month Saleable quantity = 0.51 million ton per month Plant capacity is 835 tons per hour Goals (tactical): Availability (A) = 96%, Speed (P) = 90% and Quality (Q) = 99% • OEE = A ⫻ P ⫻ Q = .96 ⫻ .90 ⫻ .99 = 0.85 • Non-availability = 24 ⫻ 30 ⫻ 0.4 = 28.8 h per month • Planned stop = 20 h/month and unplanned stop = 8.8 h/month

468

U. Kumar and A. Parida

Figure 19.2. An example of cascading down of corporate goal to operational targets

Figure 19.3. Aggregation of MPIs from operational level to corporate level

Maintenance Performance Measurement (MPM) System

469

Since the actual production and the OEE level has gone down, now the management has to take remedial measures and appropriate decision making to achieve the desired level of OEE and production. 19.5.2 Multiple Criteria of MPM System The objectives of the organizational decision makers are expressed in terms of different criteria. For example, at the beginning of twentieth century, financial cost was the single criteria used by the managers. After the 1980s, it was felt by the management that a single criterion is unable to meet their entire objectives and the concept of multiple criteria evolved. When there are a number of criteria, the multi-criteria choice problem arises, which is solved by obtaining information about all the criteria and their relative priorities. For the MPM system, different MPIs are being grouped under different criteria as per organizations requirements, based on the stakeholders need. The multiple criteria of the MPIs can be considered from a balanced and integrated point of view. Besides the four perspectives (customer, financial, internal processes and innovation and learning) of Kaplan and Norton (1992), three more criteria like the HSE, employees’ satisfaction and maintenance task related, are considered and included in the MPM framework. Some of the MPIs thus grouped under seven criteria associated with the development of the MPM framework are selected to improve productivity, quality and safety of the organization (Parida et al. 2005). The seven criteria considered are discussed below. 19.5.2.1 Plant/Equipment Related Indicators The indicators under this criterion measure the performance pertaining to the plant and equipment of the organizations. These MPIs provide relevant information to the management at different hierarchical level for appropriate decision making. Some of the MPIs under this criterion are: •

• •

•

Availability. The availability is represented by the percentage of the plant availability used for manufacturing/production. This is calculated as the ratio of the mean time to failure (MTTF) to the total time, i.e. MTTF plus the mean time to repair (MTTR). Performance (output per hour). This MPI indicates the speed of production and is expressed as a percentage of the production/performance speed. Quality. This MPI refers to the quality of the product/service. This is the percentage of good parts produced out of the total number of parts produced. The overall equipment effectiveness (OEE) is one of the main benchmarks or key performance indicators for the total process of a company. The OEE is a multiplication of the equipment availability, performance and impact of quality. Number of minor and major stops. This indicator is the number of stops, either minor or major. Stoppages can also be quantified in time (hours and minutes).

470

U. Kumar and A. Parida

• •

Down-time for the number of minor and major stops. This is expressed in hours and minutes for the total number of stops or for each minor and major stop. Rework. Rework due to maintenance lapses (for example; not sharpening the tools) expressed in time (hours and minutes), the number of pieces on which rework has been carried out and the cost of the rework undertaken.

19.5.2.2 Maintenance Task Related Indicators MPIs under this criterion pertains to the maintenance tasks carried out. These MPIs indicates the efficiency and effectiveness of the maintenance department of the organizations. The MPIs are: • • • •

Change over time Planned maintenance task (preventive maintenance) Unplanned maintenance tasks (corrective maintenance) Response time for maintenance

19.5.2.3 Finance/Cost-related Indicators The finance or the cost related MPIs are the most sought information for the management; these measures are valuable in summarizing the readily measurable economic results of the business. The MPIs under this criterion relate to maintenance and production costs. Besides, management of the organization can include other financial and cost related MPIs or PIs as per their need. Some of the MPIs of this criterion are: • • •

Maintenance cost/unit Production cost per unit Total maintenance cost

19.5.2.4 Customer Satisfaction Customer satisfaction is one of the most important criteria for an organization to focus on. This criterion measures the organization’s performance to satisfy the customers which are formulated from the organizational business strategy. Some of the MPIs under this criterion are: • • • • •

Number of quality complaints Low quality returns (number/quantity) Customer satisfaction (value-for-money feedback etc.) Customer retention Number of new customers added

19.5.2.5 Learning and Growth This criterion is related to infrastructure of the organization required for creating long term growth and improvement. The global competitive environment compels the companies to continuously improve their capabilities for delivering required value to the customers and other stakeholders. Root cause analysis is carried out

Maintenance Performance Measurement (MPM) System

471

for checking the frequency of failure and time taken to fix the failure. Some of the MPIs considered under this criterion are: • •

Number of new ideas generated for improvement Skills and competency development/training

19.5.2.6 Health, Safety, and the Environment (HSE) Today, all plants and organizations are compelled to consider criteria related to societal and environmental issues, besides economy. All safety precautions like protective clothing and safety against chemical cleaning are undertaken by the organization as they are mandatory requirements. Health and safety, which forms part of the societal requirements, besides the environmental issues, are considered by the organization under this criterion. Some of the MPIs under this criterion are: • • • • •

Number of incidents/accidents Lost time due to HSE issues Number of legal cases Number of compensation cases/amount of compensation paid Number of HSE complaints

19.5.2.7 Employee Satisfaction Employees are one of the important partner and stakeholders of any organization today. Therefore, their satisfaction is essential to successfully implement MPM system and achieve the desired goals of the organizations. Samples of MPIs which indicates the motivation and satisfaction level of the employees, under this criterion are: • • •

Employee absentees Employee complaints Employee retention

19.5.3 Multiple Criteria and Hierarchical MPM Framework While developing an MPM framework, multiple criteria and hierarchical levels of the organization are considered. Based on the stakeholders’ requirements, corporate objectives and strategy, multiple criteria MPIs are considered for integrating them to different hierarchical levels of the organization involving the employees at all levels. At the functional level, the corporate objectives are converted to specific measurable targets. It is essential that all the employees speak the same language throughout the entire organization. In addition to external stakeholders’ requirements, the internal aspects like the capacity and capability of the organization comprising the departments, employee requirements, the organizational climate and skill enhancement are taken into consideration. An MPM framework considering the multi-criteria and hierarchical approach is given in Table 19.1, with sample MPIs.

472

U. Kumar and A. Parida

Table 19.1. A multi-criteria hierarchical maintenance performancemeasurement (MPM)

framework Front-end process

Hierarchical Level 1 level MultiStrategic/top criteria management

- Timely delivery - Quality - HSE issues

External effectiveness - Customers/ stakeholders - Compliance with regulations

Level 2

Level 3

Tactical/middle management

Functional/ operational

- Availability - OEE - Production rate - Quality - Number of stops

- Production rate - Number of defects/rework - Number of stops/downtime - Vibration & thermography

Cost/finance - Maintenance budget related - ROMI

- Maintenance production cost per ton - Maintenance/production cost

- Maintenance cost per ton

Maintenance - Cost of maintenance task related tasks

- Quality of maintenance - Change over time task - Planned maintenance - Change over time task - Planned maintenance task - Unplanned maintenance task - Unplanned maintenance task

Equipment/ process related

- Capacity utilization

Internal effectiveness - Reliability - Productivity - Efficiency - Growth & innovation Back-end process - Process stability - Supply chain - HSE

Learning growth & innovations

- Generation of a number - Generation of number of new ideas of new ideas - Skill improvement training - Skill improvement training

Customer satisfaction related

- Quality complaint numbers - Quality return - Customer satisfaction - Customer retention

- Quality complaint numbers - Quality complaint - Quality return numbers - Customer satisfaction - Quality return - New customer addition - Customer satisfaction

Health, safety & security environment

- Number of accidents - Number of legal cases - HSSE losses - HSSE complaints

- Number of accidents/incidents - Number of legal cases - Compensation paid - HSSE complaints

- Number of accidents/ incidents - HSSE complaints

Employee satisfaction

- Employee satisfaction - Employee complaints

- Employee tumover rate - Employee complaints

- Employee absentees - Employee complaints

- Generation of number of new ideas - Skill improvement training

The MPIs at functional and tactical levels gets aggregated as KPI at the strategic level. For example, MPIs like the availability, performance (production rate) and quality at operational level aggregates to OEE at the tactical level, and to capacity utilization at the strategic level under plant/equipment criteria.

Maintenance Performance Measurement (MPM) System

473

19.6 Some Examples from Different Industries Each industry has its own system for MPM; especially it is more relevant for industries like nuclear power, oil and gas, etc. MPM framework and indicators to monitor, control and evaluate various performances are in use by different industries. More and more industries are trying to develop specific MPIs for their own organization and identify the indicators best suited to their industry. Some of the industries, where MPIs has been tried out are in the nuclear, oil and gas (O & G), railway, process industry and energy sectors. A different approach has been applied to developing the MPM framework and indicators for different industries, as per the stakeholders’ requirements. Some of the MPIs used in different industries are briefly discussed. 19.6.1 Nuclear Industry The International Atomic Energy Agency (IAEA) has been actively sponsoring work in the area of indicators to monitor nuclear power plant (NPP) operational safety performance from the early 1990s. The safe operation of the nuclear power plants is the accepted goal for the top management. A high level of safety results from the integration of the good design, operational safety and human performance. In order to be effective, a holistic and integrated approach is required to be adopted for providing a performance measurement framework and identifying the performance indicators with desired safety attributes for the operation of the nuclear plant. The NPP performance parameters include both safety and economic performance indicators, with overriding safety aspects. To assess the operational safety of NPP, a set of tools like the plant safety aspect (PSA), regulating inspection, quality assurance and self assessment are used. Two categories of indicators commonly applied are risk based indicators and safety culture indicators. 19.6.1.1 Operational Safety Performance Indicators Indicator development starts with attributes usage and the operational safety performance indicators are identified. Under each attribute, overall indicators are established for providing overall evaluation of relevant aspects of safety performance and under each overall indicator, strategic indicators are identified. The strategic indicators are meant for bridging the gap between the overall and specific indicators. Finally, a set of specific indicators are identified/developed for each strategic indicator to cover all the relevant safety aspects of NPP. Specific indicators are used to measure the performance and identify the declining performance, so that management can take corrective decisions. Some of the indicators as used in the plants are given in Table 19.2 (IAEA 2000).

474

U. Kumar and A. Parida Table 19.2. Some of the operational safety performance indicators

Attributes

Overall indicators

1. Operates smoothly

1. Operating 1. Forced power 1. No of forced power reductions and performance reductions & outages due to internal causes outages 2. No of forced power reductions & outages due to external causes 2. State of structures, systems and components

Strategic indicators

Specific indicators

1. Corrective work orders issued

1. No of corrective work orders issued for safety system 2. No of corrective work orders issued for risk important BOP systems 3. Ratio of corrective work orders executed to work orders programmed 4. No of pending work orders for more than 3 months

2. Material condition

1. Chemistry Index (WANO performance indicators) 2. Ageing related indicators

3. State of the barriers

1. Fuel reliability (WANO) 2. RCS leakage 3. Containment leakage

19.6.2 Oil and Gas Industry The cost of maintenance and its influence on the total system effectiveness of the oil and gas industry is too high to ignore (Kumar and Ellingsen 2000). The safe operations of oil and gas production units are the accepted goal for the management of the industry. A high level of safety is essential through the integration of good design, operational safety and human performance. To be effective, an integrated approach is required to be adopted for identifying the MPIs with desired safety attributes for the operation of the oil and gas production unit. Some of the MPIs reported from plant level to result unit level to result area level for the Norwegian oil and gas industry grouped into different categories are as follows (Kumar and Ellingsen 2000): •

Production – Produced volume (Sm3) – Planned production (Sm3)

•

Technical integrity – Backlog preventive maintenance (man-hours) – Backlog corrective maintenance (man-hours)

•

Maintenance – Maintenance man-hours total – Maintenance man-hours safety systems

Maintenance Performance Measurement (MPM) System

•

475

Deferred production – Due to maintenance (Sm3) – Due to operation (Sm3) – Due to drilling/well operations (Sm3) – Weather and other causes (Sm3)

19.6.3 Railway Industry: Example from Rail Infrastructure Railway operation and maintenance is meant to provide acceptable service to users, while meeting the regulating authorities’ requirements. Today, one of the requirements for infrastructure managers is to achieve cost effective maintenance activities and a punctual and cost-effective railroad transport system. As a result of a research project for the Swedish railroad transport system, some of the identified maintenance performance indicators are (Åhren and Kumar 2004): • • • • • • • • • • • • • • • •

Capacity utilization of infrastructure Capacity restriction of infrastructure Hours of train delays due to infrastructure Number of delayed freight trains due to infrastructure Number of disruption due to infrastructure Degree of track standard Markdown in current standard Maintenance cost per track-km Traffic volume Number of accidents involving railway vehicles Number of accidents at level crossings Energy consumption per area Use of environmental hazardous material Use of non-renewable materials Total number of functional disruptions Total number of urgent inspection remarks

19.6.4 Process and Utility Industries Measuring maintenance performance has drawn considerable interest in the utility, manufacturing and process industry over the last decade. Organizations are keen to know the return on investment made in maintenance investments, while meeting business objectives and strategy. Under challenges of increasing technological changes, implementing an appropriate performance measurement system in an organization ensures that actions are aligned to strategies and objectives of the organization. The MPIs for the utility industry in an energy sector will vary with that of the process industry. Some of the MPIs as identified for an energy sector organization of Europe are:

476

U. Kumar and A. Parida

(a) Customer satisfaction related • SAIDI (system average interruption duration index) • CAIDI (customer average interruption duration index) • CSI (customer satisfaction index) (b) Cost related • Total maintenance cost • Profit margin (c) Plant/ process related • Down time • OEE rating (d) Maintenance task related • Number of unplanned stop (no & time) • Number of emergency work • Inventory cost (e) Learning and growth/innovation related • Number of new ideas generated • Skill and improvement training (f) Health, safety and environment related • Number of accidents • Number of HSE complaints (g) Employee satisfaction related • Employee satisfaction level

19.7 Concluding Remark The MPM system for each organization needs to be different as each organization is unique. It is required that a holistic and balanced MPM system should be developed and implemented by involving all the stakeholders of the maintenance process. Even though there has been a several fold growth in research publications dealing with the area of performance measurement, the researchers and the managers dealing with the specific area of maintenance are still continuing with their efforts to find universal maintenance performance measurement system which shows the “added value” generated by the maintenance process and its contribution towards the business goal of the company. Therefore, in future, it will be challenging for maintenance professional to show the contribution of maintenance, towards the total business goal and provide metrics to measure the “added value” generated by the maintenance process. Thus, future research will need to focus on the understanding of the maintenance process and developing simple, and easy to implement, perform-

Maintenance Performance Measurement (MPM) System

477

ance measurement frameworks. There is a further scope to study the impact of different culture and human behavioral aspects associated with MPM.

19.8 References Abran, A. and Buglione, L. (2003), A multidimensional performance model for consolidating Balanced Scorecards, Advances in Engineering Software, 34, pp. 339–349 Åhren, T and Kumar, U. (2004), Use of maintenance performance indicators: a case study at Banverket. Conference proceedings of the 5th Asia-Pacific Industrial Engineering and Management Systems Conference (APIEMS2004). Gold Coast, Australia Altmannshoffer, R. (2006). Industrielles FM, Der Facility Manager (In German), April Issue, pp. 12–13. Al-Turki, U. and Duffuaa, S. (2003), Performance measures for academic departments, International Journal of Educational Management, Vol. 17, No. 7, pp. 330–338 Andersen, B. and Fagerhaug, T. (2002), Eight steps to a new performance measurement system, Quality Progress, 35, 2, pp. 1125. Campbell, J.D. (1995), Uptime: Strategies for Excellence in Maintenance Management. Portland, OR: Productivity Press Chandler, A.D. (1977), The Visible Hand: the Managerial Revolution in American Business, Boston, MA, Harvard University Press, pp. 417 Cross, M. (1988), Raising the value of maintenance in the corporate environment, Management Research News, Vol. 11, No. 3, pp. 8–11 DOE-HDBK-1148-2002 (2002) Work Smart Standard (WSS) Users’ Handbook, Department of Energy, USA, www.eh.doe.govt/tecgstds/standard/hdbk1148/hdbk11482002.pdf Fitzgerald, L., Johnson, R., Brignall, S., Silvestro, R. and Voss, C. (1991), Performance Measurement in Service Businesses, London, CIMA IAEA, International Atomic Energy Agency, (2000), A Framework for the Establishment of Plant specific Operational Safety Performance Indicators, Report, Austria Kaplan, R.S. and Norton, D.P. (1992), The balanced scorecard: measures that drive performance, Harvard Business Review, January–February, pp. 71–79 Keegan, D., Eiler, R. and Jones, C. (1989), Are your performance measures obsolete? Management Accounting, June, pp. 45–50 Kennerly, M. and Neely, A. (2003), Measuring performance in a changing business environment, International Journal of Operation and Production Management, Vol. 23, No. 2, pp. 213–229 Kumar, U. and Ellingsen, H. P. (2000), Development and implementation of maintenance performance indicators for the Norwegian oil and gas industry, Conference proceedings of 15th European Maintenance Conference (Euro Maintenance 2000), Gothenburg, Sweden Lingle, J.H. and Schiemann, W.A. (1996), From balanced scorecard to strategy gauge: is measurement worth it? Management Review, March, pp. 56–62 Liyanage, J.P. and Kumar, U. (2003), Towards a value-based view on operations and maintenance performance management, Journal of Quality in Maintenance Engineering, Vol. 9, pp. 333–350 Lynch, R.L. and Cross, K.F. (1991), Measure up!: the Essential Guide to Measuring Business Performance, London, Mandarin Medori, D. and Steeple, D. (2000), A framework for auditing and enhancing performance measurement systems, International Journal of Operation & Production Management, Vol. 20, No. 5, pp. 520–533

478

U. Kumar and A. Parida

Meyer, M.W. and Gupta, V. (1994), The performance paradox, in Straw, B. M. and Cummings, L.L. (Eds), Research in Organizational Behavior, Vol. 16, Greenwich, CT, JAI Press, pp. 309–369 Miles, M.B. and Huberman, A.M. (1994). Qualitative Data Analysis, Sage Publication, California, USA. Murthy, D.N.P, Atrens, A. and Eccleston, J.A. (2002), Strategic maintenance management, Journal of Quality in Maintenance Engineering, Vol. 8, No. 4, pp. 287–305 Neely, A.D. (1999), The performance measurement revolution: why now and where next, International Journal of Operation and Production Management, Vol. 19, No. 2, pp. 205–228 Neely, A., Adams, C. and Keenerly, M. (2002), The Performance Prism, Prentice Hall, Financial Times, Harlow, UK Parida, A., Chattopadhyay, G. and Kumar, U. (2005), Multi criteria maintenance performance measurement: a conceptual model, in Proceedings of the 18th International Congress of COMADEM, 31st Aug–2nd Sep 2005, Cranfield, UK, pp. 349–356 Parida, A. and Kumar, U. (2006), Maintenance performance measurement (MPM): issues and challenges, Journal of Quality in Maintenance Engineering, Vol. 12, No. 3, pp. 1355–2511 Tsang, A.H.C. (1998), A strategic approach to managing maintenance performance, Journal of Quality in Maintenance Engineering, Vol. 4, No. 2, pp. 87–94 Wealleans, D. (2000), Organizational Measurement Manual, Abingdon, Oxon, GBR, Ashgate Publishing Limited Wireman, T. (1998), Developing Performance Indicators for Managing Maintenance, New York, Industrial Press, Inc. Wongrassamee, S., Gardiner, P.D. and Simmons, J.E.L. (2003), Performance measurement tools: the balanced scorecard and the EFQM Excellence Model, Measuring Business Performance, Vol. 7, pp. 14–29

20 Forecasting for Inventory Management of Service Parts John E. Boylan and Aris A. Syntetos

20.1 Introduction Service parts are ubiquitous in modern societies. Their need arises whenever a component fails or requires replacement. In some sectors, such as the aerospace and automotive industries, a very wide range of service parts are held in stock, with significant implications for availability and inventory holding. Their management is therefore an important task. A distinction should be drawn between preventive maintenance and corrective maintenance. Demand arising from preventive maintenance is scheduled and is deterministic, at least in principle. Demand arising from corrective maintenance, after a failure has occurred, is stochastic and requires forecasting. Fortuin and Martin (1999) categorise the contexts for service logistics as follows: • • •

Technical systems under client control (e.g. machines in production departments, transport vehicles in a warehouse); Technical systems sold to customers (e.g. telephone exchange systems, medical systems in hospitals) End products used by customers (e.g. TV sets, personal computers, motor cars)

In the first context, there is usually a specialist department within the client organization performing maintenance activities and managing service parts inventories. In the second context, a specialist department within the vendor organization will generally undertake these tasks. In both cases, a large amount of information is known by the vendor, or can be shared with the vendor. This information may include scheduled (preventive) maintenance activities, times between failures, usage rates and condition of equipment. When a wealth of data is available, it is possible to identify explanatory variables which may be used to predict the demand of service parts. For example,

480

J. Boylan and A. Syntetos

Ghobbar and Friend (2002) showed that the average demand interval for aircraft spare parts depends on the aircraft utilization rate, the component overhaul life and the type of primary maintenance process. In a further study, Ghobbar and Friend (2003) showed how forecast accuracy depends on various characteristics of the demand process, including the seasonal period length, as well as the primary maintenance process. Hua et al. (2006) used two zero-one explanatory variables, plant overhaul and equipment overhaul, to help predict demand of spare parts in the petrochemical industry. In other cases, explanatory variables have been used to predict part of the demand for a stock keeping unit (SKU). For example, Kalchschmidt et al. (2006) identified clusters of customers whose sales were correlated with promotional activities and clusters of customers that were unaffected, using appropriate forecasting methods for each group. In the third context, parts are used by consumers and much less information is available. Fortuin and Martin (1999, p 957) commented, “Clients are anonymous, their usage of consumer products and their ‘maintenance concept’ are not known”. Most demand arises from purely corrective maintenance (e.g. on TV sets, personal computers) required in the case of a defect. Even when preventive maintenance occurs (e.g. on motor cars), prediction is complicated by the ‘maintenance concept’ of consumers being unknown. For example, customers may not bring in their cars at the correct time for a service, or may not bring them in at all. In many practical situations where end products are used by consumers, the vendor must gauge demand for service parts from the demand history alone. Such demand patterns are often sporadic, with occasional ‘spikes’ of demand. Alternatively, demand for an SKU may be decomposed into regular and irregular components (Kalchschmidt et al. 2006). In both cases, sporadic demand for service parts poses a considerable challenge to those responsible for managing inventories. It is this challenge that will be addressed in this chapter. The remainder of the chapter is structured as follows. In the next section we address issues pertinent to the classification of service parts for forecasting and inventory management related purposes. Parametric and non-parametric approaches to forecasting service parts requirements are then discussed in Sections 20.3 and 20.4 respectively. In Section 20.5, we present various metrics appropriate for measuring the performance of the inventory management system whereas in Section 20.6 we review the limited number of studies that provide empirical evidence on: i) the performance of forecasting methods for service parts and ii) the empirical fit of statistical distributions to the corresponding underlying demand patterns. Finally, the conclusions of our work are summarized in Section 20.7.

20.2 Classification of Service Parts Service parts for consumer products are highly varied, with differing costs, service requirements and demand patterns. Classification of stock keeping units (SKUs) is widely adopted by organizations, but the method of classification varies widely. This is to be expected, as classification serves a number of different purposes.

Forecasting and Inventory Management for Service Parts

481

20.2.1 Service Requirement The first aim of classification is to determine service requirements. It is common for organizations to segment their service parts, assigning higher service-level targets to some segments than others. A direct approach is to classify according to a part’s service criticality. The ‘criticality’ may be determined informally or by formal methods such as failure mode, effects and criticality analysis (FMECA). According to this method, criticality analysis is defined as “A procedure by which each potential failure mode is ranked according to the combined influence of severity and probability of occurrence” (Department of Defense 1980, p 3). This approach is likely to be more appropriate for those situations where technical systems are being managed by the vendor or are under client control. However, it may also be applicable to service parts for consumers. For example, safety-critical automotive components, such as brakes, may be assigned to a higher criticality category than automotive accessories, such as furry dice. Alternatively, an ABC (Pareto) classification can be used to determine service requirements. A Pareto report lists all SKUs in descending order, by total volume, or total value of sales. An ABC analysis by value is often used as a proxy for criticality, with the A items being assumed to be the most critical and requiring the highest service levels. Some authors also argue that the sophistication of the replenishment method should reflect the ABC classification: “For a true C item the low total of replenishment, carrying and shortage costs implies that, regardless of the type of control system used, we cannot achieve a sizable absolute savings in these costs. Therefore, the guiding principle should be to use simple procedures that keep the control costs per SKU quite low…” (Silver et al. 1998, p 359). For a single SKU, this is undoubtedly correct. However, for many hundreds or thousands of SKUs, the argument has less force. The additional savings accruing from more sophisticated forecasting and stock control methods potentially outweigh any additional investment or system running costs. A further disadvantage of the ABC classification is that it is not obvious in what different ways the categories should be treated. (This basic requirement of inventory classification schemes was first discussed by Williams 1984.) In particular, the choice of forecasting methods for slower demand items depends on the degree of intermittence and the variability of demand, neither of which are fully captured by a Pareto analysis by volume or value. Therefore, if a Pareto classification is adopted, it may be advantageous to supplement it with further categorizations. For example, classification based on value of sales is often used to determine the frequency of orders to be placed, whereas classification of demand characteristics (to be examined later in this section) is a more effective way to determine the order levels and the safety stocks. Categorization of service parts by cost is common practice. In some organizations, a two-way classification by cost and volume is employed. This allows greater flexibility in adjusting service targets, by category, in order to achieve overall targets at minimum cost. It is a slightly more sophisticated variation of the ABC approach, again requiring supplementary classifications for forecasting purposes.

482

J. Boylan and A. Syntetos

20.2.2 Inventory Decision A product life cycle approach is often used in marketing, with three phases of growth, maturity and decline. A similar classification may be adopted for stock control, with the phases aligned directly to the decisions required for the inventory management of service parts. Fortuin (1980) suggested three phases: initial, normal and final. In the initial phase, when the part is introduced, there are two decisions: i) should the item be stocked and ii) if so, what are the initial stock requirements? In the normal phase, an inventory policy must be determined and the parameters estimated. If an orderup-to (OUT) policy is adopted, for example, then the order-up-to-Level must be calculated. As the part nears the end of its life, suppliers may become reluctant to manufacture small volumes, as required by clients, particularly if the part has high manufacturing set-up costs. In this final phase, a decision must be taken on the size of a single order to cover all remaining demand (sometimes known as an ‘all time buy’). Teunter (1998) analysed this problem from a theoretical perspective, while Teunter and Fortuin (1998) reported a case-study of a company facing such a decision. 20.2.3 Forecasting Approach Forecasting approaches may be broadly divided into two categories: • •

Dependent on explanatory variables (causal methods) Dependent only on the history of demand (time-series methods)

A classification of service parts according to the product life cycle can assist in choosing the better approach. As discussed in the first section of this chapter, the choice of forecasting approach is mainly determined by the availability of data on explanatory variables, such as the timing of preventive maintenance activities. However, the forecasting approach is also driven by the availability of demand history data which, in turn, is determined by the stage of the service part’s life cycle. Causal methods are particularly useful in the initial phase, when the part is introduced, since the lack of an adequate length of demand history precludes the use of extrapolative time-series methods. Models linking sales to promotional expenditure, for example, can be applied. In the normal phase, which is the focus of this chapter, causal methods are used when maintenance activities are under the control of the vendor or the client (if the client is not an end-consumer). For consumer clients, historical data for the explanatory variables are usually not available, and time-series methods are used to forecast service parts’ requirements. In the final phase, when an ‘all time buy’ from a supplier is required, extrapolative methods can be applied. For example, a regression model on the logarithm of sales against time may be used, assuming an exponential decline in demand over time. 20.2.4 Forecasting Method Faster moving service parts are commonly forecast using time-series methods. The specific method that should be employed depends on the characteristics of the

Forecasting and Inventory Management for Service Parts

483

demand pattern. For non-intermittent demand, exponential smoothing methods are often used, with appropriate variants for trended, damped trended and seasonal data. For intermittent demand, with some periods showing no demand at all, different methods are needed. Demand is said to be ‘intermittent’ if it is “infrequent in the sense that the average time between consecutive transactions is considerably larger than the unit time period, the latter being the interval of forecast updating” (Silver et al. 1998, p 127). An item with ‘erratic demand’ is “one having primarily small demand transactions with occasional very large transactions” (Silver 1970, p 7). Intermittent and erratic demand patterns are very common amongst service parts. If an item is both intermittent and erratic, it is said to be ‘lumpy’. The graph in Figure 20.1 shows examples of intermittent and lumpy demand patterns (based on annual demand history for two service parts used in the aerospace industry).

70

Demand (Units)

60 50 40 30 20 10 0 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Time period Slow demand

Lumpy demand

Figure 20.1. Intermittent and lumpy demand patterns

A further approach to the categorization of service parts is to examine the sources of intermittence and erraticness in the demand pattern. Bartezzaghi et al. (1996) identified the following factors contributing to these demand characteristics: 1. 2. 3. 4. 5.

Numerousness of potential customers Frequency of customer requests Heterogeneity of customers (measured by Gini’s index) Variety of customers’ requests (measured by the coefficient of variation of the demand of a single customer) Correlation between customers’ requests

484

J. Boylan and A. Syntetos

The first and second factors determine the intermittence of demand. In response to this intermittence, for those SKUs with very few customers, it may become feasible to liaise directly with them, and to enhance forecasts accordingly. The third and fourth factors determine the ‘erraticness’ of demand. As orders become more irregular, exploiting early information at the customer level becomes more attractive. Of course, such early indications are not always available. This will often be the case when addressing consumer demand. It is also possible that early confirmed orders may give a good indication of final orders. This is particularly useful when there is a strong correlation between customers’ demands. The five factors, and their effect on intermittence, erraticness and lumpiness, are summarised in Figure 20.2.

Numerousness of customers Intermittence Frequency of individual orders

Correlation between customers’ requests

Lumpiness

Heterogeneity of customers Erraticness Variety of customers’ requests Fig. 20.2. Categorization based on the sources of demand characteristics

For those items without early indicators, forecasting must be undertaken using a purely time-series approach. This is usually linked to a demand distribution, so that inventory levels may be set to achieve high percentage service level targets. Many inventory management systems make distributional assumptions of demand according to the ABC classification. For example, A and B items may be taken to be normally distributed, whilst C items are assumed to be Poisson. In practice, however, many service parts have demand that is more erratic than Poisson (sometimes known as ‘over-dispersed’). The Poisson dispersion index (ratio of the variance to the mean of demand, including zero demands) can be used to classify SKUs as Poisson or non-Poisson. If the index is close to unity, then a Poisson distribution is indicated; if the index is greater than unity, then other distributions, such as the negative binomial, may be more appropriate, or a non-parametric approach may be required, as discussed in Section 20.4. An obvious way to classify service parts is by frequency of demand. As demand occurrence becomes more infrequent, with some periods having no demand at all, a number of difficulties emerge. From a forecasting perspective, methods such as

Forecasting and Inventory Management for Service Parts

485

single exponential smoothing (SES) can no longer be recommended (Croston 1972). Second, assumptions of normality of demand become unsustainable, as demand becomes more skewed. Johnston and Boylan (1996) examined the conditions under which Croston’s method (designed for intermittent demand and reviewed in the next section of this chapter) is more accurate than SES. The authors concluded, on the basis of simulation of a wide range of conditions, that Croston’s method is more accurate (in terms of mean square error) when the average demand interval (p) exceeds 1.25 review periods. In a recent case-study, Boylan et al. (2006) showed that overall forecast accuracy is robust to the choice of break-points above 1.25 but much less so to values below 1.25. For practical applications, break-points may be determined using simulation studies. The important point is that it is preferable to identify conditions for superior forecasting performance, and then to categorise demand based on these results, rather than the other way round. A complementary method of classification is by variability of demand size. From a forecasting perspective, Syntetos et al. (2005) re-examined the comparison between methods such as Croston (1972), based on intermittent demand, and SES. This study was based on comparison of approximate expressions for the theoretical mean square error of different methods. The authors identified two key categorization variables, namely the average demand interval (p) and the squared coefficient of variation of demand size (CV 2). (Note that the latter measure ignores zero demand periods.) Comparisons between forecasting methods yield regions of superior performance. The performance of SES depends on whether it is assessed at all points in time, or only immediately after demand occurrences (which trigger stockorders in most inventory systems). The performance of Croston’s method does not depend on this timing. Suppose that SES is compared with Croston’s method, for the forecasts made immediately after a demand occurrence; then the matrix presented in Figure 20.3 is obtained. Low

High p=1.34 (break-point)

High Erratic

Lumpy

(Croston)

(Croston)

Smooth

Intermittent

(SES)

(Croston)

2

CV =0.28

Low Fig. 20.3. Categorisation of SKUs by forecast accuracy

In summary, service parts may be in the initial, normal or final phases of their life cycle. In this chapter, we focus attention on the normal phase. Although service parts may be classified as A, B or C in a Pareto analysis, it is likely that most parts

486

J. Boylan and A. Syntetos

will be categorized as C. The service requirements for the part may be guided by criticality and cost considerations, as well as the ABC classification. Further refinements are necessary to the Pareto classification in order to allocate the most appropriate forecasting methods to each SKU. Enhancement of the ABC classification in the manner described above gives a coherent approach to classification according to forecasting performance and a foundation for theoretically-informed usage of terms such as ‘erratic’, as shown in Figure 20.4 (after Syntetos 2001, adapted by Boylan et al. 2006).

High Intermittent Mean interdemand interval Non-intermittent Low Mean demand size

Slow Low

AND/OR

High Erratic

Lumpy AND

Coefficient of variation of demand sizes Non-erratic Low

AND

Clumped

Fig. 20.4. Categorisation of demand patterns for service parts

As in Figure 20.2, a ‘lumpy’ SKU is defined as one that is both ‘intermittent’ and ‘erratic’; definitions of ‘slow’ and ‘clumped’ are also included. Figure 20.4 offers a different perspective from Figure 20.2. The former diagram shows the measures that may be used to classify SKUs as intermittent and erratic, whereas the latter showed the factors that lead to intermittence and erraticness. An understanding of both issues is required for effective forecasting of service parts.

Forecasting and Inventory Management for Service Parts

487

20.3 Parametric Forecasting Practical parametric approaches to inventory management rely upon estimates of some essential demand distribution parameters. The decision parameters of the inventory systems (such as the re-order point or the order-up-to-level) are then based on these estimates. Different inventory systems require different variables to be forecasted. Some of the most well cited, for example (R, s, S) policies (Naddor 1975; Ehrhardt and Mosier 1984), require only estimates of the mean and variance of demand. (In such systems, the inventory position is reviewed every R periods and if the stock level drops to the re-order point s enough is ordered to bring the stock up to the orderup-to-level S.) In other cases, and depending on the objectives or constraints imposed on the system, such estimates are also necessary, although they do not constitute the ‘key’ quantities to be determined. We may consider, for example, an (R, S) or an (s, Q) policy operating under a fill-rate constraint – known as P2 and discussed in Section 20.5. (In the former case, the inventory position is reviewed periodically, every R periods, and enough is ordered to bring it up to S. In the latter case, there is a continuous review of the inventory position and as soon as that drops to, or below, s an order is placed for a fixed quantity Q.) In those cases we wish to ensure that x% of demand is satisfied directly off-the-shelf and estimates are required for the probabilities of any demands exceeding S or s. Such probabilities are typically estimated indirectly, based on the mean demand and variance forecast in conjunction with a hypothesized demand distribution. Nevertheless, a reconstruction of the empirical distribution through a bootstrapping (non-parametric) procedure would render such forecasts redundant and this issue is further discussed in the following section. Similar comments apply when these systems operate under a different service driven constraint: there is no more than x% chance of a stock-out during the replenishment cycle (this service measure is known as P1). Consequently, we need to estimate the (100–x)-th percentile of the demand distribution. In summary, parametric approaches to forecasting involve estimates of the mean and variance of demand. In addition, a demand distribution needs also to be hypothesized, in the majority of stock control applications, for the purpose of estimating the quantities of interest. Issues related to the hypothesized demand distribution are addressed in the following sub-section. The estimation of the mean and variance of demand is addressed in SSections 20.3.2 and 20.3.4 respectively. 20.3.1 The Demand Distribution Demand for service parts is most commonly intermittent in nature. The demand pattern is characterized by infrequent demands, often of variable size, occurring at irregular intervals. Consequently, as discussed in Section 20.3.2, it is preferable to model demand from constituent elements, i.e. the demand size and inter-demand interval. Therefore, compound theoretical distributions (that explicitly take into account the size-interval combination) are typically used in such contexts of application. We first discuss some issues related to modelling demand arrivals and

488

J. Boylan and A. Syntetos

hence inter-demand intervals. We then extend our discussion to compound demand distributions. In a service parts demand context, two demand generation processes have dominated the literature. If time is treated as a discrete (whole number) variable, demand may be generated based on a Bernoulli process, resulting in a geometric distribution of the inter-demand intervals. When time is treated as a continuous variable, the Poisson demand generation process results in negative exponentially distributed inter-arrival intervals. There is sound theory in support of both geometric and exponential distribution for representing the time interval between successive demands. There is also empirical evidence in support of both distributions (e.g. Dunsmuir and Snyder 1989; Kwan 1991; Willemain et al. 1994; Janssen 1998; Eaves 2002). With Poisson arrivals of demands and an arbitrary distribution of demand sizes, the resulting distribution of total demand over a fixed lead time is compound Poisson. Interdemand intervals following the geometric distribution in conjunction with an arbitrary distribution for the sizes, results in a compound binomial distribution. Regarding the compound Poisson distributions, the stuttering Poisson, which is a combination of a Poisson distribution for demand occurrence and a geometric distribution for demand size, has received the attention of many researchers (for example: Gallagher 1969; Ward 1978; Watson 1987). Another possibility is the combination of a Poisson distribution for demand occurrence and a normal distribution for demand sizes (Vereecke and Verstraeten 1994), although the latter assumption has little empirical support. Quenouille (1949) showed that a Poissonlogarithmic process yields a negative binomial distribution (NBD). When order occasions are assumed to be Poisson distributed and the order size is not fixed but follows a logarithmic distribution, total demand is then negative binomially distributed over time. Another possible distribution for representing demand is the gamma distribution. The gamma distribution is the continuous analogue of the NBD and “although not having a priori support (in terms of an explicit underlying mechanism such as that characterizing compound distributions), the gamma is related to a distribution which has its own theoretical justification” (Boylan 1997, p 168). The gamma covers a wide range of distribution shapes, it is defined for non-negative values only and it is generally mathematically tractable in its inventory control applications (Burgin and Wild 1967; Burgin 1975; Johnston 1980). Nevertheless if it is assumed that demand is discrete, then the gamma can be only an approximation to the distribution of demand. At this point it is important to note that the use of both NBD and gamma distributions requires estimation of the mean and variance of demand only. In addition, and as discussed in section 20.6, there is empirical evidence in support of both distributions and therefore they are recommended for practical applications. Vereecke and Verstraeten (1994) presented an algorithm developed for the implementation of a computerised stock control system for spare parts in a chemical plant. Of the items, 90% were classified as lumpy, with the remaining 10% consisting of slow or fast movers. The demand was assumed to occur as a Poisson process with a package of several pieces being requested at each demand occurrence. The parameters of the distribution of the demand size can be estimated from

Forecasting and Inventory Management for Service Parts

489

the variance and the average of the demand history data of each item. The resulting distribution of demand per period was called a ‘package Poisson’ distribution. The same distribution has appeared in the literature under the name ‘hypothetical SKU’ (h-SKU) Poisson distribution (Williams 1984), where demand is treated as if it occurs as a multiple of some constant, or ‘clumped Poisson’ distribution, for multiple item orders for the same SKU of a fixed ‘clump size’ (Ritchie and Kingsman 1985) (please also refer to Figure 20.4 where a definition of ‘clumped’ demand is offered). In an earlier work, Friend (1960) also discussed the use of a Poisson distribution for demand occurrence, combined with demands of constant size. The ‘package Poisson’ distribution requires, as the Poisson distribution itself, an estimate of the mean demand only. If demand occurs as a Bernoulli process and orders follow the logarithmicPoisson distribution (which is not the same as the Poisson-logarithmic process that yields NBD demand) then the resulting distribution of total demand per period is the log-zero-Poisson (Kwan 1991). The log-zero-Poisson is a three parameter distribution and requires a rather complicated estimation method. Moreover, it was found by Kwan (1991) to be empirically outperformed by the NBD. Hence, the log-zero Poisson cannot be recommended for practical applications. One other compound binomial distribution appeared in the literature is that involving normally distributed demand sizes (Croston 1972, 1974). However, and as discussed above, a normality assumption is unrealistic and therefore the distribution is not recommended for practical applications. 20.3.2 Estimation of Mean Demand: Size – Interval Methodology Single exponential smoothing (SES) and simple moving averages (SMA) are often used in practice to forecast intermittent demand. Both methods have been shown to perform satisfactorily on real service parts data. However, the ‘standard’ forecasting method for such items is considered to be Croston’s method (Croston 1972, as corrected by Rao 1973). Croston suggested treating the size of orders ( z t ) and the intervals between them ( p t ) as two separate series and combining their expenentially weighted moving averages (obtained using SES) to achieve a forecast of the demand per period. (Recently, some adaptations of Croston’s method have appeared in the literature that rely upon SMA rather than SES estimates and such modifications are further discussed later in this section.) In Croston’s work, both demand sizes and intervals were assumed to have constant means and variances, for modelling purposes, and demand sizes and demand intervals to be mutually independent. Demand was assumed to occur as a Bernoulli process. Subsequently, the inter-demand intervals are geometrically distributed (with mean p ). The demand sizes were assumed to follow the normal distribution (with mean µ and variance σ 2 ). These assumptions have been challenged in respect of their realism (see, for example, Willemain et al. 1994) and they have also been challenged in respect of their theoretical consistency with Croston’s forecasting method. The latter issue is further discussed in Section 20.3.3.

490

J. Boylan and A. Syntetos

Croston’s method works in the following way: SES estimates of the average size of the demand ( zˆt ) and the average interval between demand incidences ( pˆ t ), are made after demand occurs (using the same smoothing constant value, α ). If no demand occurs, the estimates remain exactly the same. The forecast of demand per period ( Yˆt ) is given by: Yˆt = zˆt / pˆ t . If demand occurs in every time period, Croston’s estimator is identical to SES. For constant lead times of length L , the mean lead-time demand estimate ( YˆL ) is then obtained as follows: YˆL = LYˆt

(20.1)

Despite the theoretical superiority of such an estimator, modest benefits were recorded in the literature when the method was actually applied on real data. Syntetos and Boylan (2001) showed that Croston’s estimator is biased. The bias is introduced by estimating the probability of demand occurrence from the average inter-demand interval (inversion bias). This is now explained in some more detail. We start with Croston’s assumptions: Ε( zt ) = Ε( zˆt ) = µ Ε( pt ) = Ε( pˆ t ) = p

(20.2a) (20.2b)

According to Croston, the expected estimate of demand per period in that case would be: Ε(Yˆt ) = Ε( zˆt / pˆ t ) = Ε( zˆt ) / Ε( pˆ t ) = µ / p (i.e. the method is unbiased). If it is assumed that estimators of demand size and demand interval are independent, then ⎛ zˆ ⎞ ⎛ 1 ⎞ Ε ⎜ t ⎟ = Ε( zˆt )Ε ⎜ ⎟ ⎝ pˆ t ⎠ ⎝ pˆ t ⎠

(20.3)

⎛ 1 ⎞ 1 Ε⎜ ⎟ ≠ ˆ ( p Ε pˆ t ) ⎝ t⎠

(20.4)

but

and therefore Croston’s method is biased. It is clear that this result does not depend on Croston’s assumptions of stationarity and geometrically distributed demand intervals. More recently, Boylan and Syntetos (2003), Syntetos and Boylan (2005) and Shale et al. (2006) presented correction factors to overcome the bias associated with Croston’s approach. Some of these papers discuss: i) Croston’s applications under a Poisson demand arrival process and ii) estimation of demand sizes and intervals using an SMA (using the ratio of the former to the latter as an estimate of demand per period). The correction factors are summarized in the Table 20.1. (where k is the length of the moving average and α is the smoothing constant for SES).

Forecasting and Inventory Management for Service Parts

491

Table 20.1. Bias correction factors Demand generation process Bernoulli Poisson SES

1−

α 2

1−

α 2 −α Shale et al.(2006)

Syntetos and Boylan (2005)

Estimation SMA

k k +1 Boylan and Syntetos (2003)

k −1 k Shale et al. (2006)

At this point it is important to note that SMA and SES are often treated as equivalent when the average age of the data in the estimates is the same (Brown, 1963). A relationship links the number of points in an arithmetic average (k) with the smoothing parameter of SES ( α ) for stationary demand. Hence it may be used to relate the correction factors presented in Table 20.1 for each of the two demand generation processes considered. The linking equation is k=

2 −α α

(20.5)

20.3.3 Method – Model Inconsistencies Snyder (2002) pointed out that Croston’s model assumes stationarity of demand intervals and yet an SES estimator is used, implying a non-stationary demand process. The same comment applies to demand sizes. Snyder commented that this renders the model and method inconsistent and he proposed some alternative models, and suggested a new forecasting approach based on parametric bootstrapping. Shenstone and Hyndman (2005) developed this work by examining Snyder’s models. In their paper they commented on the wide prediction intervals that arise for non-stationary models and recommended that stationary models should be reconsidered. However, they concluded: “... the possible models underlying Croston’s and related methods must be non-stationary and defined on a continuous sample space. For Croston’s original method, the sample space for the underlying model included negative values. This is inconsistent with reality that demand is always non-negative” (Shenstone and Hyndman, 2005, pp 389–390). In summary, any potential non-stationary model assumed to be underlying Croston’s method must have properties that do not match the demand data being modeled. Obviously, this does not mean that Croston’s method and its variants are not useful. Such methods do constitute the current state of the art in intermittent demand parametric forecasting. An interesting line of further research would be to consider stationary models for intermittent demand forecasting rather than restrict-

492

J. Boylan and A. Syntetos

ing attention to models implying Croston’s method. For example, Poisson autoregressive models have been suggested to be potentially useful by Shenstone and Hyndman (2005).

20.3.4 Estimation of Demand Variance In parametric forecasting and inventory control applications, estimating the variability of the lead-time demand forecast error is equally important to estimating the level of demand itself. In this section we address issues related to the estimation of the error variance but it is important to note that such an estimate may not always be required. Under the assumption of Poisson distributed demand, for example, an estimate of the mean demand only would be sufficient. In practical applications, and assuming constant lead-times, the variance of the lead-time forecast error is most often taken as the sum of the error variances of the individual forecast periods. In particular, if L is the length of the lead-time (constant), and σˆ t is the standard deviation of the one-step ahead forecast error at time t , then the standard deviation of the lead-time forecast error ( σˆ L ) is estimated as follows: σˆ L = Lσˆ t

(20.6)

In theory, the standard deviation of the one-step-ahead forecast error σˆ t can be estimated by using either the mean squared error (MSE) or the mean absolute deviation (MAD) approach. However, the ‘smoothing’ versions of those error measures are most often used in practice to improve the responsiveness of the system (see, for example, Silver et al. 1998): σˆ t = MSEt

(20.7)

where MSEt = α (Yt −1 − Yˆt −1 ) 2 + (1 − α ) MSEt −1 , or σˆ t ≈ 1.25MADt

(20.8)

where MADt = α Yt −1 − Yˆt −1 + (1 − α ) MADt −1 In the above calculations, α is the smoothing constant, Yt −1 the actual demand in period t − 1 and Yˆt −1 the forecast of demand for period t − 1 . If the mean demand level fluctuates over time (steady state model or autoregressive integrated moving average – ARIMA process of order (0,1,1)), it has been shown (Johnston and Harrison 1986) that Equation 20.6 neglects any correlation between the estimates of demand. This correlation exists, at least in part, because of the uncertainty in the estimate of the true underlying level of demand that is carried from one period to another.

Forecasting and Inventory Management for Service Parts

493

When using SES, under the above model formulation, the standard deviation of the lead time forecast error was shown (Johnston and Harrison 1986) to be correctly calculated as follows: σˆ L = L + α ( L − 1) L(1 + α (2 L − 1/ 6)) σˆ t

(20.9)

Under the stationary mean model assumption (the demand level is assumed to be constant) the forecast error correlation still exists because of the uncertainty associated with the variance of the forecasts, which is carried forward from one period to another. In addition, if a biased estimator is in place to forecast future demand requirements, the auto-correlation can be also attributed to the bias. This issue has been analytically addressed by Strijbosch et al. (2000) and Syntetos et al. (2005).

20.4 Non-parametric Forecasting As discussed in Section 2, demand for service parts may often be lumpy in nature. Considering Figure 20.3, such SKUs are characterized by very infrequent demand occurrences (intermittence) coupled with highly erratic demand sizes, when demand occurs. Croston’s method and its variants (in conjunction with an appropriate distribution) have been reported to offer tangible benefits to stockists forecasting intermittent demand. (Relevant empirical evidence follows in Section 20.6.) Nevertheless, there are certainly some restrictions regarding the degree of lumpiness that may be dealt with effectively by any parametric distribution. In addition to the average inter-demand interval, the coefficient of variation of demand sizes has been shown in the literature to be very important from a forecasting perspective for demand classification purposes. However, as the data become more erratic, the true demand size distribution may not comply with any standard theoretical distribution. This challenges the effectiveness of any parametric approach. When SKUs exhibit a lumpy demand pattern such as that presented in Figure 20.1, one could argue that only non-parametric approaches may provide opportunities for further improvements in this area. Willemain et al. (2004) developed a patented non-parametric forecasting method for intermittent demand data. Their method is not model-based but instead is a heuristic that combines a Markov process, bootstrapping and ‘jittering’ to simulate an entire distribution for lead-time demand rather than a single forecast. The method works according to the following steps: 1. Obtain historical demand data in chosen time buckets (e.g. days, weeks, months) 2. Estimate transition probabilities for two-state (zero vs. non-zero) Markov model 3. Conditional on last observed demand, use Markov model to generate a sequence of zero/non-zero values over forecast horizon

494

J. Boylan and A. Syntetos

4. Replace every non-zero state marker with a numerical value sampled at random, with replacement, from the set of observed non-zero demands 5. ‘Jitter’ the non-zero demand values – this is effectively an ad hoc procedure designed to allow greater variation than that already observed. The process enables the sampling of demand size values that have not been observed in the demand history 6. Sum the forecast values over the horizon to get one predicted value of lead time demand (LTD) 7. Repeat steps 3 – 6 many times 8. Sort and use the resulting distribution of LTD values. Willemain et al. (2004, p 381) argued that “… we need to assess the quality not of a point forecast of the mean but of a forecast of the entire distribution”, but they conceded that it is impossible to compare this on an item-specific basis. Instead, the authors recommended pooling percentile estimators across items and measuring the conformance of the observations (expressed using the corresponding percentiles) to a uniform distribution. The researchers claimed significant improvements in forecasting accuracy achieved by using their approach over single exponential smoothing and Croston’s method. (Issues related to assessing forecasting performance are further considered in the next section.) Gardner and Koehler (2005) criticized this study in terms of its methodological arrangements and experimental structure, pointing out that: • Willemain et al. did not use the correct lead time demand distribution for either SES or Croston’s method. This was a twofold criticism consisting of arguments against the use of Equation 20.6 for estimating the lead-time demand variance (please refer to Section 20.3.4) and the use of the normal distribution for representing demand • They did not consider published modifications to Croston’s method such as the estimator proposed by Syntetos and Boylan (2005) Further empirical evidence is required in order to develop our understanding of the benefits offered by such a non-parametric approach. In particular, a comparison between the recently developed adaptations of Croston’s method – see Table 20.1 (in conjunction with an appropriate distribution) – with the bootstrapping approach should prove to be beneficial from both theoretical and practitioner perspectives.

20.5 Performance Measurement In assessing the performance of an inventory management system, there are two essential measures: stock-holding cost and service level. Stock-holding costs are relatively straightforward to interpret. They are generally calculated as a percentage of the value of inventory investment, where the percentage takes into account such factors as the cost of capital, insurance, warehousing and obsolescence costs. Often, the same percentage is applied to all service parts. However, one could

Forecasting and Inventory Management for Service Parts

495

argue that slow-moving service parts should attract a higher percentage cost, since these parts are at the highest risk of obsolescence. ‘Service level’ is generally interpreted as ‘off the shelf availability’ but the way in which it is measured varies. Three common measures are defined as follows (Silver et al. 1998): • • •

The fraction of replenishment cycles in which total demand can be delivered from stock (known as P1). This is equivalent to a specified probability of no stock-outs during a replenishment cycle. The fraction of total demand that can be delivered from stock (known as P2). This is also called the ‘fill rate’. The fraction of time during which there is stock on the shelf (known as P3). This is sometimes used when equipment is needed for emergency purposes.

The ‘fill rate’ is probably the measure with the greatest appeal to practitioners, since it relates most directly to customer satisfaction. Care is needed with its application, since different results are obtained if it is calculated over a lead-time or over all time. If unsatisfied demand is back-ordered, Brown (1967) showed that P2 = 1 −

LS (1 − P2LT ) Q

(20.10)

where P2LT is the measure over lead-time, P2 is the measure over all time, L is the lead-time, S is total demand in a year, Q is the order-quantity and it is assumed that LS < Q. Ronen (1982) showed that, if unsatisfied demand is lost, then P2 =

1 LS (1 − P2LT ) + 1 Q

(20.11)

These measures are based on the fraction of units satisfied from stock. Some organizations also use measures that relate to the successful completion of an ‘order-line’ for a number of units of the same SKU. Typically, these are based on the fraction of order-lines completely satisfied (partial satisfaction does not count). Boylan and Johnston (1994) identified relationships between such measures and fill-rates. In addition to these standard measures, other suggestions have been made. Gardner (1990) recommended the use of trade-off curves, showing the effect of inventory investment on the average delay in filling backorders. Separate curves are drawn for each forecasting method, allowing the manager to see at a glance if one method dominates the others. Sani and Kingsman (1997) proposed the use of average regret measures. The service regret is the amount each method falls short of the maximum service level over all methods for that SKU. (The ‘method’ may be a forecasting or inventory method.) The regret is then divided by the maximum service level and the ratios

496

J. Boylan and A. Syntetos

are averaged across all SKUs. A cost regret measure is defined similarly. This approach allows more detailed assessments of the interaction between forecasting and inventory methods. Eaves and Kingsman (2004) suggested assessment of forecast performance according to implied stock-holdings. These are based on a calculation of the exact safety margin providing a maximum stock-out of zero. The advantage of this approach is that it gives monetary values of stock-savings. However, these savings may not be achieved in practice using a standard stock control method based on the mean and variance of lead-time forecasts. Whilst it is essential to assess stock-holding costs and service levels, it is also important to be able to diagnose the reasons for any deterioration in these measures. Boylan and Syntetos (2006) argue that, since this may arise as a result of forecasting methods or inventory rules (see Figure 20.5), the accuracy of forecasting methods should also be monitored.

Forecasting

Stock

method

Stock-holding

management system

Inventory rules

costs Service level

Figure 20.5. Inventory system performance measurement

Forecast error measures may be used to detect changes in forecast accuracy over time or to determine the relative accuracy of two (or more) forecasting methods. For faster moving service parts, a measure that serves both purposes and is easy to interpret is the mean absolute percentage error (MAPE): MAPE =

1 n

n

∑100 t =1

Yt − Yˆt Yt

(20.12)

where n is the number of historical forecasts included in the error measure, Yt is the observation at time t and Yˆt is the forecast of demand at time t. Unfortunately, this error measure is not defined for zero observations. This rules out its application for service parts with intermittent demand. An alternative measure, suggested by Makridakis (1993) and used by Makridakis and Hibon (2000), is the Symmetric MAPE (sMAPE): sMAPE =

| Yt − Yˆt | 1 n 100 ∑ n t =1 | Yt + Yˆt | / 2

(20.13)

Forecasting and Inventory Management for Service Parts

497

However, as pointed out by Syntetos (2001) and by Boylan and Syntetos (2006), the sMAPE will always be 200 for any period when the actual demand is zero, regardless of the size of the error. Therefore, the sMAPE does not discriminate between forecast methods in this case. Since it does not allow a satisfactory comparison of forecast methods, it is inappropriate for intermittent demand. A simple measure that can be used to assess the bias of forecasting methods is the mean error (ME): ME =

1 n

n

∑ (Y − Yˆ ) t

t =1

(20.14)

t

This measure is simple to interpret: if it is close to zero, then the forecast method is unbiased; negative values indicate that forecasts are consistently too high, while positive values show that forecasts are too low. Its use is recommended for intermittent service parts. If a forecast method has high forecast errors, but is approximately unbiased, then the positive and negative errors cancel one another out, yielding a mean error close to zero. To capture the degree of error, regardless of sign, other error measures are required. The mean absolute error (MAE) is often used for an individual SKU and is defined as follows:

MAE =

1 n

n

∑ Y − Yˆ t =1

t

(20.15)

t

This error measure should not be averaged over a whole set of parts, since it may be dominated by a few SKUs with large errors. To avoid this problem, four alternatives have been suggested: the MAE: mean ratio, the geometric mean absolute error, the percentage better measure and the mean absolute scaled error. Each of these measures will be reviewed in turn. Hoover (2006) proposed the application of the MAE: mean ratio for intermittent demand: n

∑ Y − Yˆ t

t =1

MAE:Mean =

n

n

∑Y t =1

n

n

t

t

=

∑ Y − Yˆ t =1

t

t

n

∑Y t =1

(20.16)

t

This measure is robust to outlying data and is easy to interpret. Hyndman (2006) observed that the MAE: mean ratio assumes that the data is stable over time and that, for seasonal intermittent data, the measure may become unreliable. This problem may be overcome, for non-trended data, by calculating the measure over a full set of seasonal cycles. If the data is trended, however, then Hyndman’s criti-

498

J. Boylan and A. Syntetos

cism stands. Therefore, the MAE: mean ratio can be recommended for non-trended intermittent service parts. A second alternative to the MAE is the geometric mean absolute error (GMAE) defined below; for a single series: ⎛ GMAE = ⎜ ⎝

n

∏ i =1

1/ n

⎞ Yt − Yˆt ⎟ ⎠

(20.17)

This can be generalized across series by taking the geometric mean again to obtain the geometric mean (across series) of the geometric mean (across time) of the absolute errors (GMGMAE): 1/ N

1/ n ⎛ N ⎛ n ⎞ ⎞ GMGMAE = ⎜ ∏ ⎜ ∏ Yit − Yˆit ⎟ ⎟ ⎜ i =1 ⎝ t =1 ⎠ ⎟⎠ ⎝

(20.18)

where Yit is the observation for the i-th SKU at time t, Yˆit is the forecast of demand for the i-th SKU at time t, and N is the number of SKUs. An outlying observation, producing a large error by any statistical method, will affect the GMAE similarly for all methods, and so the ratio of the GMAE for one method to another will be robust to outliers. (The same robustness property applies to the GMGMAE). This was first shown by Fildes (1992), using a general argument, and applied to intermittent data by Syntetos and Boylan (2005). In fact, these authors used a slightly more complex measure, the geometric root mean square error (GRMSE); however, Hyndman (2006) pointed out that the GRMSE and the GMAE are identical. Although the measure is robust to outliers, it is sensitive to zero errors (Boylan and Syntetos 2006). Just one exact forecast will yield a zero error and a zero GMAE, regardless of the size of the other errors. This problem may be overcome, for stationary errors, by using the geometric mean (across series) of the arithmetic mean (across time) of the absolute errors (GMAMAE): ⎛ GMAMAE = ⎜ ⎜ ⎝

N

∏ i =1

⎛1 ⎜⎜ ⎝ ni

ni

∑ t =1

1/ N

⎞⎞ Yit − Yˆit ⎟⎟ ⎟ ⎟ ⎠⎠

(20.19)

This measure collapses to zero only if a series has zero forecast errors for all periods of time, and so is more robust to zero errors than the GMGMAE. The measure is also robust to occasional large forecast errors, provided the remaining errors are stable, and are not unduly affected by trend or seasonality. It can therefore be recommended for application in these cases. Another approach, which is simple to use and interpret, is the percentage better method. According to this approach, for each service part, one forecast method is compared to another according to a criterion such as mean error or geometric root mean square error (Syntetos and Boylan 2005). The percentage better shows the

Forecasting and Inventory Management for Service Parts

499

percentage of series for which one method has the lower error. This approach is robust to large forecast errors and the results can be subjected to formal statistical tests (Syntetos 2001). It is a useful measure, although it does not quantify the degree of improvement in forecast error. Hyndman (2006) recently suggested a new error measure for intermittent demand. This measure, known as the mean absolute scaled error (MASE), is defined as follows: MASE = mean( qt ) qt =

Yt − Yˆt

1 n Yt − Yt −1 n − 1 i =2

∑

(20.20a) (20.20b)

The errors are scaled based on the in-sample MAE from the naïve forecasting method (i.e. the forecast for the next period is this period’s observation). The measure is robust to outliers, and is valid for all non-constant series. Hyndman (2006) gave an example of the application of the MASE on intermittent data from a major Australian lubricant manufacturer. He compared the out-of-sample MASE of four methods: naïve, overall mean, single exponential smoothing and Croston’s method. The naïve method has the lowest MASE. This result is valid statistically, but is counter-intuitive from an inventory-management perspective. Boylan and Syntetos (2006) commented that the naïve method is sensitive to large demands and will generate high forecasts. Its use will almost certainly lead to over-stocking and possibly to obsolescence. This example highlights the danger of relying on statistical error measures alone. As noted earlier in this section, attention should always be paid to the stock-holding and service implications of different forecasting methods. Improvements in forecasting accuracy do not necessarily translate into improved stock-control performance. However, if stock-control performance has deteriorated, then forecast error measures can be used to diagnose problems with forecasting methods, and to suggest alternatives.

20.6 Empirical Evidence Empirical evidence on the performance of forecasting methods for service parts is not extensive. The same is true regarding the empirical fit of various distributions to the underlying demand patterns of such SKUs. In this section the relevant studies are reviewed. 20.6.1 Statistical Distributions Kwan (1991) conducted research to identify the theoretical distributions that best fit the empirical distributions of demand sizes, inter-demand intervals and demand per unit time period for low demand items. Regarding inter-demand intervals, both the

500

J. Boylan and A. Syntetos

geometric and the negative exponential distribution were found to provide a good fit to the demand patterns observed. The geometric distribution was also found to be a reasonable approximation to the distribution of inter-demand intervals, for real demand data, by Dunsmuir and Snyder (1989) and Willemain et al. (1994). Janssen (1998) tested the Bernoulli demand generation process on a set of empirical data obtained from a Dutch wholesaler of fasteners. The results indicated that the Bernoulli demand generation process is a reasonable approximation for intermittent demand processes. Finally, Eaves (2002) examined the demand patterns associated with 6795 service parts from the Royal Air Force (UK). The findings of this detailed study provide support for both Poisson and Bernoulli processes. In particular, the geometric distribution was found to provide a statistically significant fit (5% significance level) to 91% of his sample whereas the negative exponential distibution fitted 88% of the demand histories examined. Kwan (1991) tested the empirical fit of the log-zero-Poisson (lzP) and negative binomial (NBD), amongst other possible underlying demand distributions. The NBD was found to be the best, fitting 90% of the SKUs. Boylan (1997) tested the goodness-of-fit of four demand distributions (NBD, lzP, condensed negative binomial distribution (CNBD) and gamma distribution) on real demand data. The CNBD arises if we consider a condensed Poisson incidence distribution (‘censored’ Poisson process in which only every second event is recorded) assuming that the mean rate of demand incidence is not constant, but varies according to a gamma distribution. The empirical sample used for testing goodness-of-fit contained the six months histories of 230 SKUs, demand being recorded weekly. The analysis showed strong support for the NBD. The results for the gamma distribution were also encouraging, although not as good, for slow moving SKUs, as the NBD.

20.6.2 Demand Estimators Willemain et al. (1994) compared SES and Croston’s method on both theoretically generated and empirical intermittent demand data (54 series). They concluded that Croston's method is robustly superior to exponential smoothing and can provide tangible benefits to stockists dealing with intermittent demand. A very important feature of their research, though, was the fact that industrial results showed very modest benefits as compared with the simulation results. Sani and Kingsman (1997) compared the performance (service level and inventory costs) of various empirical and theoretically proposed stock control policies for low demand items as well as that of various forecasting methods (SMA, SES, Croston) on 30 service parts. Their results indicated: i) the very good overall performance of SMA; ii) the fact that stock control policies that have been developed in conjunction with specific distributional assumptions – such as the power approximation (explicitly built upon the assumption of a compound Poisson underlying demand pattern – please also refer to Section 20.3) perform particularly well. Willemain et al. (2004) assessed the forecast accuracy of SES, Croston’s method (both in conjunction with a hypothesised normal distribution) and the nonparametric approach that they proposed (please refer to Section 20.4) on 28,000

Forecasting and Inventory Management for Service Parts

501

service inventory items. They concluded that the bootstrap method was the most accurate forecasting method and that Croston’s method had no significant advantage over SES. As discussed in Section 20.4, some reservations have been expressed regarding the study’s methodology. Nevertheless, the bootstrapping approach is intuitively appealing for very lumpy demand items. More empirical studies are needed to substantiate its forecast accuracy in comparison with other methods. Syntetos and Boylan (2005) conducted an empirical investigation to compare the forecast accuracy of SMA, SES, Croston’s method and a bias-corrected adaptation of Croston’s estimator (termed the Syntetos-Boylan approximation, SBA; please refer to Table 20.1). The forecast accuracy of these methods was tested, using a wide range of forecast accuracy metrics, on 3000 service parts from the automotive industry. The results demonstrated quite conclusively the superior forecasting performance of the SBA method. In a later project, Syntetos and Boylan (2006) assessed the empirical stock control implications of the same estimators on the same 3000 SKUs. The results demonstrated that the increased forecast accuracy achieved by using the SBA method (also known as the ‘Approximation Method’) is translated to a better stock control performance (service level achieved and stock volume differences). A similar finding was reported in an earlier research project conducted by Eaves and Kingsman (2004). They compared the empirical stock control performance (implied stock holdings given a specified service level) of the above discussed estimators on 18750 service parts from the Royal Air Force (UK). They concluded that ‘the best forecasting method for a spare parts inventory is deemed to be the approximation method’ (Eaves and Kingsman 2004, p 436).

20.7 Conclusions Service parts, particularly those subject to corrective maintenance, present a considerable challenge for both forecasting and inventory management. If stocking decisions are made injudiciously, then the result will be poor service or excessive stock-holdings, possibly leading to obsolescence. Conversely, effective forecasting and stock control will lead to cost savings and improved customer service A number of stock-control methods may be employed for slow-moving service parts. Sani and Kingsman (1997) recommended the (R, s, S) policy, based on its inventory cost and service performance in an empirical study. However, empirical evidence is not extensive, and further research is needed in this area. Classification of service parts is an essential element in their management. Four purposes are served by classification: • • • •

Determination of service targets Establishment of inventory decisions Choice of forecasting approach Choice of forecasting method

502

J. Boylan and A. Syntetos

Determining service level requirements may be supported by a criticality classification, undertaken using management judgment or a more formal approach, such as an assessment of the risk and severity of a part failure. Alternatively, a Pareto classification may be used as a proxy for criticality, with the A items being deemed the most important. A variation on this approach is to use a matrix of cost and sales volume to determine service requirements. Inventory decisions relate directly to a product life cycle classification. Initial provisioning decisions must be taken during the initial phase of the life cycle. Decisions should be taken regarding stocking locations, too. In the normal phase, the inventory policy and the appropriate parameters must be determined. The inventory rules depend on forecasts of demand over lead-time, so the most appropriate forecasting method should be chosen. In the final phase, an ‘all time buy’ requires a decision on the final order quantity. The product life cycle can also be used to help determine the forecasting approach: causal or time-series. Causal methods are often used in the initial phase, because of the lack of data on demand history. In the normal phase, causal methods also have an important role, if data on explanatory variables are available. Some models have been proposed, for example, that link the sales rate with the renewal function associated with the part replacement in order to derive the demand for spares (Blischke and Murthy 1994). If such data are not available, then time-series methods are used, usually based on exponential smoothing. In the final phase, regression-based extrapolations have been recommended, assuming an exponential decline of demand. A further aim of classification, in the normal phase, is to determine the most appropriate forecasting method. By examining the sources of intermittence and erraticness of demand, it may be possible to identify parts with few customers and to forecast using advance information or to predict responses to promotional activity. For SKUs where this is not possible, two approaches have been proposed: bootstrapping and distribution-based. In the former case, no distributional assumptions are made and the lead-time distribution of demand is generated by re-sampling from previous observations. In the latter case, a demand distribution must be determined and its parameters estimated. Classification by the shape of the demand distribution allows the system to determine whether the Poisson, the compound Poisson or some other distribution should be used. Classification by demand frequency and by demand size variability allows the system to choose between smoothing methods such as single exponential smoothing (for non-intermittent, non-lumpy data) and methods such as Croston’s (for intermittent or lumpy data). The time-series method that should be employed for service parts depends on the characteristics of the demand pattern. For non-intermittent demand, exponential smoothing methods should be employed, with appropriate variants being chosen for trended, damped trended and seasonal data. For intermittent demand series, Croston’s method is the standard approach and has been adopted by a number of forecast packages. The performance of Croston’s method can be improved by applying an appropriate adjustment factor to reduce the bias of the forecast. This has been shown by Eaves and Kingsman (2004) and by Syntetos and Boylan (2006) to improve the inventory performance of the system.

Forecasting and Inventory Management for Service Parts

503

The performance of forecasting methods for non-intermittent demand can be assessed directly using measures such as the mean absolute percentage error (MAPE) and the mean absolute scaled error (MASE). However, the MAPE is not defined for intermittent series. For an individual series, forecast accuracy may be assessed using the mean error and the mean absolute error: mean demand ratio. The latter measure has the shortcoming of producing unreliable results for trended data, but trend is often barely perceptible in intermittent series. Alternatively, a geometric mean (across series) of the arithmetic mean (across time) of the absolute errors (GMAMAE) can be used. Statistical measures of forecast accuracy should not be used alone, however, since optimization of forecast accuracy does not necessarily lead to optimization of inventory performance. Inventory measures assessing inventory costs and service level, particularly the fill rate, should also be considered. Taken together, forecast accuracy and inventory measures provide the manager with a comprehensive overview of the system’s performance. Empirical evidence on the forecasting and inventory management of service parts is not extensive, but has grown in recent years. There is good empirical support for both compound Bernoulli and compound Poisson demand distributions. The negative binomial distribution has been found to be a good fit to many parts’ distributions, although some SKUs do not appear to be well represented by any standard statistical distribution. Simple forecasting methods often work well, with the simple moving average being a good benchmark method for service parts with intermittent demand. Croston’s method and its bias-reduced variants, including the Syntetos-Boylan approximation (SBA), should be considered. The SBA method has been shown to perform well from both forecasting and inventory management perspectives. In summary, there has been substantial progress in research on forecasting for inventory management of service parts over recent decades. Three challenges remain: for researchers to resolve theoretical inconsistencies and develop more powerful methods, for software manufacturers to reflect the state of the art in their packages, and for both researchers and software developers to work with practitioners to broaden the base of empirical evidence in this field.

20.8 References Bartezzaghi E, Verganti R, Zotteri G, (1996) A framework for managing uncertain lumpy demand. Paper presented at the 9th International Symposium on Inventories, Budapest, Hungary Blischke WR, Murthy DNP, (1994) Warranty cost analysis. Marcel Dekker, Inc., New York Boylan JE, (1997) The centralisation of inventory and the modelling of demand. Unpublished Ph.D. Thesis, University of Warwick, UK Boylan JE, Johnston FR, (1994) Relationships between service level measures for inventory systems. Journal of the Operational Research Society 45: 838–844 Boylan JE, Syntetos AA, (2003) Intermittent demand forecasting: size-interval methods based on average and smoothing. Proceedings of the International Conference on Quantitative Methods in Industry and Commerce, Athens, Greece Boylan JE, Syntetos AA, (2006) Accuracy and accuracy-implication metrics for intermittent demand. Foresight: the International Journal of Applied Forecasting 4: 39–42.

504

J. Boylan and A. Syntetos

Boylan JE, Syntetos AA, Karakostas GC, (2006) Classification for forecasting and stockcontrol: a case-study. Journal of the Operational Research Society: in press Brown RG, (1963) Smoothing, forecasting and prediction of discrete time series. PrenticeHall, Inc., Englewood Cliffs, N.J. Brown RG, (1967) Decision rules for inventory management. Holt, Reinhart and Winston, Chicago Burgin TA, (1975) The gamma distribution and inventory control. Operational Research Quarterly 26: 507–525 Burgin TA, Wild AR, (1967) Stock control experience and usable theory. Operational Research Quarterly 18: 35–52 Croston JD, (1972) Forecasting and stock control for intermittent demands. Operational Research Quarterly 23, 289–304 Croston JD (1974) Stock levels for slow-moving items. Operational Research Quarterly 25: 123–130 Department of Defense USA, (1980) Procedures for performing a Failure Mode, Effects and Criticality Analysis. MIL-STD-1629A Dunsmuir WTM, Snyder RD, (1989) Control of inventories with intermittent demand. European Journal of Operational Research 40: 16–21 Eaves AHC, (2002) Forecasting for the ordering and stock holding of consumable spare parts. Unpublished Ph.D. thesis, Lancaster University, UK Eaves A, Kingsman BG, (2004) Forecasting for ordering and stock holding of spare parts. Journal of the Operational Research Society 55: 431–437 Ehrhardt R, Mosier C, (1984) A revision of the power approximation for computing (s, S) inventory policies. Management Science 30: 618–622 Fildes R, (1992) The evaluation of extrapolative forecasting methods. International Journal of Forecasting 8: 81–98 Fortuin L, (1980) The all-time requirements of spare parts for service after sales – theoretical analysis and practical results. International Journal of Operations and Production Management 1: 59–69 Fortuin L, Martin H, (1999) Control of service parts. International Journal of Operations and Production Management 19: 950–971 Friend JK, (1960) Stock control with random opportunities for replenishment. Operational Research Quarterly 11: 130–136 Gallagher DJ, (1969) Two periodic review inventory models with backorders and stuttering Poisson demands. AIIE Transactions 1: 164–171 Gardner ES, (1990) Evaluating forecast performance in an inventory control system. Management Science 36: 490–499 Gardner ES, Koehler AB, (2005) Correspondence: Comments on a patented bootstrapping method for forecasting intermittent demand. International Journal of Forecasting 21: 617–618 Ghobbar AA, Friend CH, (2002) Sources of intermittent demand for aircraft spare parts within airline operations. Journal of Air Transport Management 8: 221–231 Ghobbar AA, Friend CH, (2003) Evaluation of forecasting methods for intermittent parts demand in the field of aviation: a predictive model. Computers and Operations Research 30: 2097–2014. Hoover J, (2006) Measuring forecast accuracy: omissions in today’s forecasting engines and demand-planning software. Foresight: the International Journal of Applied Forecasting 4: 32–35 Hua ZS, Zhang B, Yang J, Tan DS, (2006) A new approach of forecasting intermittent demand for spare parts inventories in the process industries. Journal of the Operational Research Society: in press.

Forecasting and Inventory Management for Service Parts

505

Hyndman RJ, (2006) Another look at forecast-accuracy metrics for intermittent demand. Foresight: the International Journal of Applied Forecasting 4: 43–46 Janssen FBSLP, (1998) Inventory management systems; control and information issues. Published Ph.D. thesis, Centre for Economic Research, Tilburg University, The Netherlands Johnston FR, (1980) An interactive stock control system with a strategic management role. Journal of the Operational Research Society 31: 1069–1084 Johnston FR, Boylan JE, (1996) Forecasting for items with intermittent demand. Journal of the Operational Research Society 47: 113–121 Johnston FR, Harrison PJ, (1986) The variance of lead-time demand. Journal of the Operational Research Society 37: 303–308 Kalchschmidt M, Verganti R, Zotteri G, (2006) Forecasting demand from heterogeneous customers. International Journal of Operations and Production Management 26: 619–638 Kwan HW, (1991) On the demand distributions of slow moving items. Unpublished Ph.D. thesis, Lancaster University, UK Makridakis S, (1993) Accuracy measures: theoretical and practical concerns. International Journal of Forecasting 9: 527–529 Makridakis S, Hibon M, (2000) The M3-Competition: results, conclusions and implications. International Journal of Forecasting 16: 451–476 Naddor E, (1975) Optimal and heuristic decisions in single and multi-item inventory systems. Management Science 21: 1234–1249 Quenouille MH, (1949) A relation between the logarithmic, Poisson and negative binomial series. Biometrics 5: 162–164 Rao AV, (1973) A comment on: Forecasting and stock control for intermittent demands. Operational Research Quarterly 24: 639–640 Ritchie E, Kingsman BG, (1985) Setting stock levels for wholesaling: performance measures and conflict of objectives between supplier and stockist. European Journal of Operational Research 20: 17–24 Ronen D, (1982) Measures of product availability. Journal of Business Logistics 3: 45–58 Sani B, Kingsman BG, (1997) Selecting the best periodic inventory control and demand forecasting methods for low demand items. Journal of the Operational Research Society 48: 700–713 Shale EA, Boylan JE, Johnston FR, (2006) Forecasting for intermittent demand: the estimation of an unbiased average. Journal of the Operational Research Society 57: 588–592 Shenstone L, Hyndman RJ, (2005) Stochastic models underlying Croston’s method for intermittent demand forecasting. Journal of Forecasting 24: 389–402 Silver EA (1970) Some ideas related to the inventory control of items having erratic demand patterns. CORS Journal 8: 87–100. Silver EA, Pyke DF, Peterson R, (1998) Inventory management and production planning and scheduling (3rd edition). John Wiley & Sons, New York Snyder R, (2002) Forecasting sales of slow and fast moving inventories. European Journal of Operational Research 140: 684–699 Strijbosch LWG, Heuts RMJ, van der Schoot EHM, (2000) A combined forecast-inventory control procedure for spare parts. Journal of the Operational Research Society 51: 1184–1192 Syntetos AA, (2001) Forecasting of intermittent demand. Unpublished PhD Thesis, Buckinghamshire Chilterns University College, Brunel University, UK Syntetos AA, Boylan JE, (2001) On the bias of intermittent demand estimates. International Journal of Production Economics 71: 457–466 Syntetos AA, Boylan JE, (2005) The accuracy of intermittent demand estimates. International Journal of Forecasting 21: 303–314

506

J. Boylan and A. Syntetos

Syntetos AA, Boylan JE, Croston JD, (2005) On the categorization of demand patterns. Journal of the Operational Research Society 56: 495–503 Syntetos AA, Boylan JE (2006) On the stock control performance of intermittent demand estimators. International Journal of Production Economics 103: 36–47 Teunter RH, (1998) Inventory control of service parts in the final phase. Published PhD Thesis, University of Groningen, The Netherlands Teunter RH, Fortuin L, (1998) End-of-life-service: a case-study. European Journal of Operational Research 107: 19–34 Vereecke A, Verstraeten P, (1994) An inventory management model for an inventory consisting of lumpy items, slow movers and fast movers. International Journal of Production Economics 35: 379–389 Ward JB, (1978) Determining re-order points when demand is lumpy. Management Science 24: 623–632 Watson RB, (1987) The effects of demand-forecast fluctuations on customer service and inventory cost when demand is lumpy. Journal of the Operational Research Society 38: 75–82 Willemain TR, Smart CN, Shockor JH, DeSautels PA, (1994) Forecasting intermittent demand in manufacturing: a comparative evaluation of Croston’s method. International Journal of Forecasting 10: 529–538 Willemain TR, Smart CN, Schwarz HF, (2004) A new approach to forecasting intermittent demand for service parts inventories. International Journal of Forecasting 20: 375–387 Williams TM, (1984) Stock control with sporadic and slow-moving demand. Journal of the Operational Research Society 35: 939–948

Part F

Applications (Case Studies)

21 Maintenance in the Rail Industry Jørn Vatn

21.1 Introduction This chapter presents two case studies of maintenance optimization in the rail industry. The first case study discusses grouping of maintenance activities into maintenance packages. The second case study uses a life cycle cost approach to prioritize between maintenance and renewal projects under budget constraints. Grouping of maintenance activities into maintenance packages is an important issue in maintenance planning and optimization. This grouping is important both from an economic point of view in terms of minimization of set-up costs, and also with respect to obtaining administratively manageable solutions. If several maintenance activities may be specified as one work-order in the computerized maintenance management system, we would have less work-orders to administer. The maintenance intervals are usually determined by considering the various components or activities separately, and then the activities are grouped into maintenance packages. By executing several activities at the same time, the set-up costs may be shared by several activities. However, this will require that we have to shift the intervals for the individual activities. If we try to put too many activities into the same group, the gain with respect to set-up costs may be dominated by the costs of changing the intervals for the individual activities. The case study we present for maintenance grouping is related to train maintenance, and especially we focus on activities related to components in the bogie. Another problem most industries are facing is the limited resources available for maintenance and renewal, implying that optimization has to be conducted under budget constraints. Then two main questions should be addressed, first of all whether the budget constraints should be eliminated to some extent by putting more resources into maintenance and renewal in case we have more good projects than we have resources. The other question is how to prioritize, given the budget constraints. In the case study we present an approach to cost-benefit analysis of the various projects. This gives a ranked list of projects to consider for execution. The

510

J. Vatn

proposed method has been implemented by the Norwegian National Rail Administration (JBV), responsible for the Norwegian railway net. Section 21.2 presents some general information about rail maintenance in Norway as a basis for the two case studies. The first case study in Section 21.3 discusses grouping of maintenance activities into maintenance packages. The second case study in Section 21.4 uses a life cycle cost approach to prioritize between maintenance and renewal projects under budget constraints.

21.2 Background Information About Rail Maintenance During the past few decades there has been a dramatic change in the organization of the European railways. The European Union has been an important driving force, and legislation has been introduced to split the former state railways into one national infrastructure manager, and one or more train operators (railway undertakings). The idea has been to allow for many train operators to compete against each other to offer train services on the European network. Further, the maintenance of both the rolling stock (trains) and the infrastructure has to a great deal been outsourced. In the following sections we present some Norwegian case studies, and give some background information about organization of the maintenance in Norway. The Norwegian State Railways (NSB) is the main Norwegian railway undertaking. NSB has outsourced most of the maintenance to MANTENA, a maintenance contractor. The preventive maintenance is based on activity based contracts where NSB decide type and amount of maintenance, whereas corrective maintenance is compensated for by a lump sum. The potential for the contractor to earn more money is in effective grouping of the maintenance, and in improved work processes for organization and execution of maintenance. NSB has implemented reliability centred maintenance (RCM) as basis for the preventive maintenance program. There is an objective that maintenance should be executed in natural lulls in the timetable. Major revisions of, e.g. the bogies, need longer depot stops. JBV is the infrastructure manager of the Norwegian network. The level of outsourcing of maintenance work is relatively low. Less than 10% of the operations and corrective maintenance work is performed by external contractors, whereas for preventive maintenance contract work represents 10–20%. For renewals the percentage is almost 70, and for investment (new lines) the percentage is more than 80. JBV has also implemented RCM as a basis for the preventive maintenance program. For larger maintenance projects and all renewal projects a prioritization regime supported by life cycle cost considerations has been implemented. The Norwegian network is split into three regions, where each region is responsible for prioritization of the resources that are allocated by the central maintenance administration. For the track and overhead line data from special measuring wagons is important input to the models used to support prioritization between large maintenance and renewal projects such as rail grinding, level tamping, ballast cleaning and rail repair and renewal. Most European infrastructure managers have introduced more formalized optimization models for maintenance and renewal planning. Some recent references

Maintenance in the Rail Industry

511

are Carreteroa et al. (2003), Zoeteman (2003), Veit and Wogowitsch (2003), Vatn et al. (2003), Zarembski and Palese (2003), Pedregala et al. (2004), Meier-Hirmer1 et al. (2005), Budai et al. (2005) and Reddy et al. (2006). Railway research related to maintenance is, however, dominated by wear modelling. Especially wheel-rail wear models and track degradation models are important because the major maintenance and renewal costs of a railway line are due to track components. Some important references are Bing and Gross (1983), Li and Selig (1995), Sato (1995), Bogdaanski et al. (1996), Ferreria and Murray (1997), Zhang et al. (1997), Kay (1998), Zakharov et al. (1998), Salim (2004), Telliskivi and Olofsson (2004), Grassie (2005) and Braghin et al. (2006). A complete survey of reported models is beyond the scope of this chapter.

21.3 Case Study 1 21.3.1 Grouping of Maintenance Activities Rolling stock maintenance is characterized by the fact that the trains have to be taken out of service while they are maintained in a maintenance depot. This causes a lot of challenges related to scheduling of the train services taking the need for maintenance into account. The scheduling problem is not considered here, and we only present a rather simple model for grouping of some maintenance activities assuming that we have access to the train whenever we want. Sriskandarajah et al. (1998) present a methodology utilizing genetic algorithms on a much more complex situation within train maintenance scheduling. In our example we only consider the following cost elements: • • • • •

Man-hour costs and material costs related to preventive maintenance of each component. Set-up costs to get access to the components to be maintained, and by paying the set-up costs access to several components is obtained. Costs of taking the train out of service. These costs are included in the setup costs from a modelling point of view. Man-hour costs and material costs related to corrective maintenance. Typically set-up costs can not be shared by other components unless preventive maintenance is advanced (opportunity maintenance). Costs related to the effect of a failure, i.e. punctuality, safety and material damage costs.

In classical maintenance optimization the objective is to find the optimum frequency of maintenance of one component at a time. However, in the multicomponent situation there exist dependencies between the components, e.g. they may share common set-up costs (economy of scope), the costs may be reduced if the contract to a maintenance contractor is huge (economy of scale), etc. This will complicate the modelling from the single component approach, e.g. see Dekker et al. (1997) for a survey of models used in the multi-component situation. In this chapter we only consider the situation where we can save some set-up costs by executing several maintenance activities at the same time.

512

J. Vatn

We often distinguish between the static and the dynamic planning regimes. In the static regime the grouping is fixed during the entire system lifetime, whereas in the dynamic regime the groups are re-established over and over again. The static grouping situation may be easier to implement than the dynamic, and the maintenance effort is constant, or at least predictable. The advantage of the dynamic grouping is that new information, unforeseen events, etc., may require a new grouping and changing of plans. For an introduction to maintenance grouping we refer to Wildeman (1996) who discusses these different regimes in detail. In the example that follows we illustrate some aspects of dynamic grouping related to maintenance activities on a train bogie. 21.3.2 Modelling Framework for the Grouping of Maintenance Activities The trains are regularly taken out of service and sent to the maintenance depot for execution of maintenance. Several subsystems are maintained at the same time, and this makes the definition of set-up costs rather complicated when we develop grouping strategies. In principle, some of the set-up costs are related to the fact that the train is sent to the depot for maintenance, whereas some other parts of the setup costs are specific for one subsystem. In the following, we will simplify and only consider costs related to the bogie, i.e. we assume one fixed set-up costs related to the bogie. We also assume that the train is available at the maintenance depot at any time. This is also a simplification, since each train follows a schedule, and can only enter the maintenance depot at some of the end stations for the different services. In order to get access to the various components in the bogie some disassembling is required before maintenance can be executed, and also some reassembling is required after execution of maintenance. The costs of disassembling and re-assembling are here included in the set-up cost. In the model presented we also assume that the set-up costs are the same for all activities. It is further assumed that there is one and only one maintenance activity related to each component. This simplifies notation because we then may alternate between failure of component i and executing maintenance activity i where there is a unique relation between component and activity. The basic notation to be used is as follows. Notation ciP cUi S λE,i(x) Mi(x)

Planned maintenance cost, exclusive set-up cost. Typically the costs of replacing one unit periodically. Unplanned costs upon a failure. These costs include the corrective maintenance costs, safety costs, punctuality costs, and costs due to material damage. Set-up costs, i.e. the costs of preparing the preventive maintenance of a group of components maintained at the same time. We assume the same set-up costs for all activities. Effective failure rate for component i when maintained at intervals of length x. Mi(x) = x × cUi × λE,i(x) = expected costs due to failures in a period [0,x) for a component maintained at time 0, exclusive planned maintenance cost

Maintenance in the Rail Industry

Φi(x,k) Φ*i,k x*i,k ki,Av Φ*i,Av x*i,Av t0 xi t*i,Av Kk N T

513

Φi(x,k) = [ciP + S/k + Mi(x)]/x = average costs per unit time if x is the length of the interval between planned maintenance, and the set-up costs are shared by totally k activities. The minimum value of Φi(x,k), i.e., minimization over x. The x-value that minimizes Φi(x,k). Average number of components sharing the set-up costs for the i-th component, i.e. the i-th component is in average maintained together with ki,Av –1 other components. Average minimum costs per unit time over all k-values. Optimum value of xi over all k-values. x*i,Av is measured in million kilometres since last maintenance on component i. Point of time when we are planning the next group of activities. Initially t0 = 0. t0 is measured in running (million) kilometres since t = 0. Age of component i at time t0, i.e., time since preventive maintenance t*i,Av = t0 +x*i,Av – xi = optimum time in running (million) kilometres. Candidate group, i.e. the set of the first k components to be maintained according to individual schedule with t*i,Av as the basis for due time. Number of activities/components. End of planning horizon, i.e. we are planning from t0 = 0 to T.

The optimization problem is basically a question of balancing planned costs against unplanned costs. The planned costs are paid when the train is taken out of service for preventive maintenance, whereas the unplanned costs arise upon failures, i.e. corrective maintenance costs (repairs), costs related to accidents, delays, etc. For each component there is an expected time dependent cost which is a function of the time since the last preventive maintenance activity, i.e. Mi(x). In order to establish Mi(x) we need: (i) to establish the accumulated expected number of failures in the period [0,x), (ii) to specify the expected corrective maintenance costs for the repair of each failure, and (iii) to specify the impact of the failure on safety, punctuality, etc., and quantify these into cost figures. In the model presented here we assume that the effective failure rate, λE,i(x) may be established for the different failure characteristic, and maintenance strategies (e.g. periodic replacement and condition monitoring). Next the costs associated with a failure of component i can in principle be found by risk modelling, punctuality modelling, etc. (see Chapter 4). The result of such modelling is one figure for the expected costs, i.e. cUi. Thus, Mi(x) = x ⫻ cUi ⫻ λE,i(x). The planned costs comprise the costs of executing the maintenance on component i (ciP ) and set-up costs (S) of getting access to the component. The set-up costs may in general be shared with k–1 other activities. The average contribution to the total costs for component i per unit time is given by Φi(x,k) = [ciP+ S/k + Mi(x)]/x

(21.1)

If the grouping was fixed, i.e. static grouping, the optimization problem would just be to minimize ΣiΦi(x,k) for all k components maintained at the same time.

514

J. Vatn

Static grouping will not be discussed, but we present an approach for dynamic grouping. Mathematically, the challenge now is to establish the grouping either in a finite or infinite time horizon. In addition to the grouping, we also have to schedule the execution time for each group (maintenance package). The grouping and the scheduling cannot be done separately. Generally, such optimization problems are NP hard (see Garey and Johnson (1979), for a definition), and heuristics are required. Before we propose our heuristic we present some motivating results. Let Φ*i,k be the minimum average costs when one component is considered individually, and let x*i,k be the corresponding optimum x value. It is then easy to prove that mi(x*i,k) = M’i(x*i,k) = Φ*i,k meaning that when the instantaneous expected unplanned costs per unit time, mi(x), exceeds the average costs per unit time, maintenance should be carried out. The way to use the result is now the following. Assume we are going to determine the first point of time to execute the maintenance, i.e. to find t = x*i,k starting at t = 0. Further, assume that we know the average costs per unit time (Φ*i,k) but that we have for some reason “lost” or “forgotten” the value of x*i,k. What then we can do is to find t such that mi(t) = M’i(t) = Φ*i,k yielding the first point of time for maintenance. Then from time t and the remaining planning horizon we can pay Φ*i,k as the minimum average costs per unit time. This is the traditional marginal costs approach to the problem, and brings the same result as minimizing Equation 21.1. The advantage of the marginal thinking is that we are now able to cope with the dynamic grouping. Assume that the time now is t0, and xi is the age (time since last maintenance) for component i in the group we are considering for the next execution of maintenance. Further, assume that the planning horizon is [t0,T). The problem now is to determine the point of time t (≥t0) when the next maintenance is to be executed. The total costs of executing the maintenance activities in a group is S + ΣiciP which we pay at time t. Further, the expected unplanned costs in the period [t0 , t) is ΣiMi(t-t0+xi) –ΣiMi(xi). For the remaining time of the planning horizon the total costs are (T–t)ΣiΦ*i,k provided that each component i can be maintained at “perfect match” with k–1 activities the rest of the period. Since Φ*i,k depends on how many components that share the set-up cost, which we do not know at this time, we use some average value Φ*i,Av. We assume that we know this average value at the first planning. To determine the next point of time for maintaining a given group of components we thus minimize: c1 (t ; k ) = S +

∑ ⎡⎣c

i∈K k

P i

+ M i (t − t0 + xi ) − M i ( xi ) + (T − t )Φ*i ,Av ⎤⎦

(21.2)

The costs in Equation 21.2 depend on which components to include in the group of activities to be executed next. The more activities we include, the higher the costs will be. For some activities it might thus be cheaper to include them in groups to be executed later. For activities we do not include in this first group we assume that they will be maintained at their “optimum” time t*i,Av > t. The total contribution to the costs related to these activities in [t0,T) is

Maintenance in the Rail Industry

c2 (t ; k ) =

∑ ⎡⎣c

i∉K k

P i

+ S / ki ,Av + M i ( xi*,Av ) − M i ( xi ) + (T − ti*,Av )Φ*i ,Av ⎤⎦

515

(21.3)

provided they can be maintained at “perfect match” with other activities, i.e. the set-up costs are shared with ki,av – 1 activities, and executed at time t*i,Av. The total optimization problem related to the next group of activities is therefore to minimize: c(t ; k ) = S +

∑ ⎡⎣c

i∈K k

+

∑ ⎡⎣c

i∉K k

P i

P i

+ M i (t − t0 + xi ) − M i ( xi ) + (T − t )Φ*i ,Av ⎤⎦

+ S / ki ,Av + M i ( xi*,Av ) − M i ( xi ) + (T − ti*,Av )Φ*i ,Av ⎤⎦

(21.4)

The idea is simple, we first determine the best group to execute next, and the best time to execute it. Further we assume that subsequent activities can be executed at their local optimum. It is expected to do better by taking the second grouping into account when planning the first group, and not only treat the activities individually. See, e.g. Budai et al. (2005) for more advanced heuristics in similar situations to those presented here. The heuristic is as follows. Step 0: Initialization. This means to find initial estimates of ki,Av, and use these kvalues as basis for minimization of Equation 21.1. This will give initial estimates for x*i,Av and Φ*i,Av. Finally the time horizon for the scheduling is specified, i.e., we set t0 = 0 and choose an appropriate end of the planning horizon (T). Step 1: Prepare for defining the group of activities to execute next. First calculate t*i = x*i,Av + t0 – xi and sort in increasing order. Step 2: Establish the candidate groups, i.e. for k = 1 to N we use the ordered t*i s to find a candidate group of size k to be executed next. If t*k > mini
516

J. Vatn

Table 21.1. Snapshop of FMECA for bogie components #

Component

Function

Failure type

Failure effect

1

Torsions bar and lever, motor bogie

Anti roll device

Crack

Potential reduction of antitilting

2

ZF-Ecomat 5HP600 Transmission between motor and axle gear

Wear and tear

Defect of gear

3

Flexible coupling bearing (CENTA)

Coupling between diesel engine/gear

Wear and tear

Worn out bearing –> vibrations

4

Deep groove ball bearing

Power transfer

Wear and tear

Worn out bearing

5

Aeration valve

Pressure balance

Locked

Problems with fuel oil filling

6

Torque reaction arm Torque reaction link Wear and tear

Fissure and demaged rubber of silent blocks

7

Diesel engine Cummins N14-R

Actuation of half train set

Wear and tear

Functional failure or lower compression of engine

8

Engine attachment (bearing NS3.59)

Engine seat

Wear and tear

Worn out bearing

9

Plant frame bearing (NS3.61)

Damping of vibrations

Wear and tear

Worn out bearing

10

Primary damper

Absorbing the vibration between axle box and bogie

Functional failure

Reduced dynamic characteristics

11

Horizontal damper, motor bogie

Absorbing the vibration between bogie and car body

Functional failure

Reduced dynamic characteristics

12

Horizontal damper, motor bogie

Absorbing the vibration between bogie and car body

Functional failure

Reduced dynamic characteristics

13

Vertical damper, motor bogie

Absorbing the vibration between bogie and car body

Functional failure

Reduced dynamic characteristics

14

Vertical damper, motor bogie

Absorbing the vibration between bogie and car body

Functional failure

Reduced dynamic characteristics

15

Longitudinal car body damper

Absorbing vibrations between car bodies

Functional failure

Reduced dynamic characteristics

Maintenance in the Rail Industry

517

Table 21.1. (continued) #

Component

Function

Failure type

Failure effect

16

Break beam support bush

Fixing pin for break Wear and tear beam

Increased gap between pin and bush

17

Bush for brake pad link

Reduction of wear Wear and tear between bolts brake support

Increased gap between pin and bush

18

Bush for brake unit

Reduction of wear between bolts and brake unit support

Wear and tear

Increased gap between pin and bush

19

Cylindrical roller bearing actuation side

Bearing rotor of generator

Wear and tear

Rotor of generator blocks

20

Cardan shaft

Power transmission Wear and tear from gear box to bogie

Fracture joint bearing

The procedure is demonstrated by analyzing components in a train bogie. A snapshot of the corresponding FMECA is presented in Table 21.1. Table 21.2 gives cost figures for the bogie components. All failure times are assumed to be Weibull distributed, where we specify the mean time to failure (MTTF, given in million kilometres), and the aging (shape) parameter α. The parameter values have been established in cooperation with NSB experts. However, some of the parameters have been modified by intention to meet competitive considerations. The example is thus realistic, but no single figure should be regarded as approved by NSB. The format and quality of the available data within the maintenance organization of NSB is currently not compatible with requirements for estimating aging parameters or fitting parametric distributions. The shape parameters have therefore been established on a very qualitative understanding of failure mechanisms, and the Weibull distribution has been chosen due to convenience considerations. Set-up costs are assumed to be 3000 Euros for all activities. We assume a standard age replacement model, but it is easy to adopt to more complex situations where we, for example, combine inspection and replacement upon condition rather than age (see, e.g. Podofollini et al. 2006 for an example model). In step 0 of the algorithm we first assess ki,Av = 13 for all activities, meaning that we initially believe that in average more than half of the activities are included in each execution of a maintenance group. For all activities we have set ki,Av = 13, and we use Equation 21.1 to find x*i,Av for each activity. The result is shown in Table 21.2. The values of Φ*i,Av are not presented here. The time horizon is set to T = 15 million kilometres. In Step 1 we calculate the optimum of each individual activity, t*i = x*i,Av + t0 – xi. In the example we have assumed that initially all xi’s are zero (a new train), and since t0 also is zero initially, we simply have t*i = x*i,Av. These values are sorted in Table 21.3 (values given in million kilometres).

518

J. Vatn

In Step 2 we establish candidate groups. For k =12 we note that t*12 > t*1 + x*1,Av which means that we only process candidate groups with k < 12. In Step 3 we calculate c(t,k), and the minimum values are shown in Table 21.3. The minimum is found for k = 10. Further c(t,10) has its minimum for t* = 0.829 million kilometres. We observe that for those activities included in the first group, the t*i -values are rather close to 0.829 million kilometres. In Step 4 we now proceed, and set xi to 0 for those activities which are executed (i.e. i ≤ 10), whereas xi = xi + 0.829 million kilometres for i > 10. Finally we set t0 = 0.829 million kilometre before we go to Step 1 again. The next group of activities is similarly found to be executed at t* = 1.606 million kilometres. This next group comprises some activities not included in the first group, but also some activities that was executed in the first group and are now executed for the second time. We proceed until t0 > 15. When the procedure terminates, we have a total cost of 1.2 million Euros. We have also recorded the average values of ki,Av which in this example ranges from 13.5 to 17 which is slightly higher than the initial assessment of ki,Av = 13. By repeating the entire procedure with the new values for ki,Av a small reduction in costs of 1% is obtained. Table 21.2. Cost figures and reliability parameters #

CP (€)

CU(€)

MTTF (106 km)

Aging, α

x*i,Av (106 km)

1

960

6,740

2.56

3.5

1.38

2

9,600

22,400

3.33

3

2.48

3

680

6,230

1.33

3.5

0.67

4

632

5,960

2.22

3.5

1.12

5

720

6,320

10.00

2

4.76

6

400

5,720

2.11

3.5

0.98

7

37,000

72,500

2.00

3.5

7.90

8

520

5,960

4.17

3.5

2.01

9

780

6,440

12.50

3.5

6.46

10

664

6,236

1.60

3.5

0.80

11

424

5,786

1.61

3.5

0.75

12

384

5,711

1.61

3.5

0.74

13

384

5,711

1.78

3.5

0.82

14

184

5,336

1.78

3.5

0.74

15

600

6,116

1.78

3.5

0.88

16

1,440

7,580

2.67

3.5

1.53

17

4,060

12,590

2.67

3.5

1.77

18

1,160

7,130

2.67

3.5

1.48

19

6,080

16,220

1.61

2.5

1.22

20

6,400

16,700

1.33

3.5

0.93

Maintenance in the Rail Industry

519

21.3.2 Opportunity Based Maintenance The dynamic scheduling regime presented above is a good basis for opportunity based maintenance. The scheduling we have proposed may be used to set up an explicit maintenance plan for the time horizon [0, T). But even though the plan exists, we may consider changing it as new information becomes available, either in terms of new reliability parameter estimates, or if unforeseen failures occur. In operation, for any time t0 we may update the scheduling of preventive maintenance. Table 21.3. Results for the first maintenance group t*i (106 km)

k

c(t*,k) (106 €)

t* (106 km

PM

0.674

1

1.2009

0.659

6

PM

0.740

2

1.2007

0.682

10

PM

0.742

3

1.2005

0.690

11

PM

0.751

4

1.2002

0.700

12

PM

0.805

5

1.2000

0.718

13

PM

0.819

6

1.1998

0.728

14

PM

0.879

7

1.1996

0.743

15

PM

0.932

8

1.1995

0.814

20

PM

0.979

9

1.1993

0.820

1

PM

1.120

10

1.1991

0.829

9

Wait

1.221

11

1.1993

0.872

18

Wait

1.375

12

.

.

2

Wait

1.475

13

.

.

16

Wait

1.534

14

.

.

5

Wait

1.769

15

.

.

17

Wait

2.013

16

.

.

7

Wait

2.483

17

.

.

8

Wait

4.760

18

.

19

Wait

6.461

19

.

4

Wait

7.904

20

.

#

Activity

3

Upon a failure requiring the set-up costs to be paid, it is rather obvious that activities that already were due if they were treated individually according to Equation 21.1 should be executed upon this opportunity. Further, activities not scheduled in the next group (maintenance package) should not be executed since they were not even included in a group to be executed later than the time of this

520

J. Vatn

opportunity. The basic question is thus which of the remaining activities in the next due group that should be executed at this opportunity. Let Kk be the set of k activities in this group. Assume that we have found that it is favourable to execute the first i–1 < k activities on this opportunity. The procedure to test whether or not activity i also should be executed is as follows: • First perform a scheduling by starting at Step 1 in Section 21.3.2. First we assume that all activities up to i are executed on this opportunity, i.e. xj = 0, j ≤ i, and xj is set to the time since activity j were executed for j > i. • Let C1 be the minimum value of c(t,k) obtained in Step 3 plus the marginal cost, ciP of executing activity i. • Next, we assume that only activities up to i–1 is executed, i.e. xj = 0, j ≤ i–1, and xj is set to the time since activity j was executed for j ≥ i. • Let C2 be the minimum value of c(t,k) obtained in Step 3 this second time. • If C1 > C2 is it not beneficial to do activity i. If it was beneficial to do activity i at t0 we should test for i = i+1 as long as i ≤ k. The procedure is demonstrated by the following example. We assume that a failure occurs at time t = 0.8 million km. From Table 21.3 we observe that the first 10 activities were scheduled for execution at time 0.829 million km. Since the schedule costs is already paid by the corrective activity, it is obvious that the first four activities, i.e. those with individual optimum less than t = 0.8 million km, should be done. Then we test whether activity 5 (t*5 = 0.805) should be done at this opportunity. We calculate C1 = 1.188267 million Euros and C2 = 1.188274 million Euros, hence activity 5 should be done. Then we proceed similarly, and find that also activity 6 should be executed. For activity 7 (t*7 = 0.879) we find that it is not cost effective to executed this activity. Since the first six activities have been executed upon this opportunity, the next planned maintenance can be postponed from the original t = 0.829 million km to t = 0.985 million km.

21.4 Case Study 2 21.4.1 Prioritization of Major Maintenance and Renewal Projects The infrastructure manager usually has a limited budget for maintenance and renewal of the railway network. This calls for a structured approach to prioritization of possible projects. In this section we discuss a portfolio approach to greater projects, in contrast to the situation in Section 21.3 where the scheduling of periodical activities were discussed. Examples of such greater projects are: • • • • • •

Ballast cleaning when the ballast is polluted and stones are crushed Rail grinding when the rail surface is rough Tamping and leveling when track geometry is degraded Sandblasting of bridges exposed to corrosion Renewal of overgrown ditches Point replacement of rails, e.g. in curvatures with high wear factor

Maintenance in the Rail Industry

521

The challenge is to schedule the candidate projects proposed by the local railway departments. Scheduling here means to decide which projects to include in the renewal plan for the next 10 years, and the order of executing the proposed projects. JBV requires that all candidate projects are subject to a cost-benefit analysis (CBA). For such projects we need to consider a time span of several decades; hence it is natural to calculate the net present value (NPV) as a basis for CBA. The CBA figures will only be used as input to the decision process, since it might be other considerations than the pure CBA figures that are taken into account when projects are selected. Notation ρC/B {RC(t)} {RC*(t)} {T*} {T} c(t) c*(t) d LCC N r RIF RLT RLT*

Cost-benefit ratio, i.e. the net present value of the benefits divided by the net present value of the costs of the project Portfolio costs of renewals without the project Portfolio costs of renewals with the project Set of renewal times with the project Set of renewal times without the project Time dependent cost as at point of time t (from now) Time dependent cost when a maintenance or renewal project is executed Factor to describe increase in time dependent cost due to degradation, i.e. the increase from one year to another is d ⫻100% Life cycle cost Calculation period for net present value calculations Discount rate Risk influencing factor, i.e. a factor that influences the risk level Residual lifetime without the project Residual lifetime with the project

21.4.2 Model Formulation Related to Case Study 2 The basic situation is that the railway infrastructure is deteriorating as a function of time and operational load. This deterioration may be transformed into cost functions, and when the costs become very large it may be beneficial to maintain or renew the infrastructure. In the following we introduce the notation c(t) for the time dependent costs as a function of time. In c(t) we include costs related to (i) punctuality loss, (ii) accidents, and (iii) extra maintenance and operation due to reduced track quality. By executing a maintenance or renewal project we typically reset the time dependent cost function c(t), either to zero, or at least a level significantly below the current value. Thus, the operating costs will be reduced in the future if we execute the maintenance or renewal project. Figure 21.1 shows the savings in operational costs, c(t)–c*(t), if we perform maintenance or renewal at time T. In addition to the savings in operational costs, we will also often achieve savings due to an increased “residual lifetime”.

522

J. Vatn

Costs

Renewal costs

Savings c*(t)

c(t) T

Time

Figure 21.1. Costs savings

Special attention will be paid to projects that aim at extending the lifelength of a railway system. A typical example is rail grinding for lifelength extension of the rail, but also the fastenings, sleepers and the ballast will take advantages of the rail grinding. Figure 21.2 shows how a smart activity ( ) may suppress the increase in c(t) and thereby extend the point of time before the costs explode and a renewal is necessary. From a modelling point of view the situation is rather complex because different projects are interconnected. For example, by executing a ballast cleaning project the track quality is increased, reducing the need for tamping and leveling. On the other hand, by tamping and point-wise supplement of ballast in pumping areas (surface water) we may postpone the much more expensive ballast cleaning. A third factor to take into account is the fact that for each tamping cycle there is some stone crushing, and hence we should also be reluctant to do too much tamping. Despite the fact that railways have existed for over 160 years there is a lack of documented mathematical models describing the interaction between different components in the railway, and the effect of the various maintenance activities. When developing a tool for prioritization it has therefore been necessary to base the model on model parameters specified by the maintenance planners and their experts. In the future, it is planned to improve the models based on the findings from a joint research project between Norway and Austria. In the following we describe the basic input for performing the cost benefit analysis. The numerical calculations are supported by a computerized tool (PriFo). 21.4.2.1 Qualitative Information The situation leading up to each proposed project is described. This is typically information from measurements and analysis of track quality, trends, etc. It is important to describe the situation qualitatively before any quantitative parameters are assessed. It is, however, a great challenge to transform the qualitative problem description to quantitative numbers. In the future this can be supported by the expected results from various research projects on deterioration models. 21.4.2.2 Safety Related Information A general risk model has been derived where important risk influencing factors (RIFs) have been identified. The RIFs relate both to the accident frequency such as

Maintenance in the Rail Industry

523

number of cracks in the rails, but also to the accident consequences such as speed, terrain description, etc. Table 21.4 shows an example related to the derailment frequency. In the modelling, f0 corresponds to the “average” derailment frequency related to rail problems. The value of f0 is found by analysing statistics over derailments in Norway, where we find f0 = 3 × 10–4 per kilometre per year.

Variable cost

Renewal

Renewal*

) c(t

c*(t)

Time RLL RLL* = smart maintenance activity, e.g., rail grinding

Figure 21.2. Lifelength extension

The variation width (w) in Table 21.4 shows the maximum negative or positive effect of each RIF. In this model the values of the various RIFs are standardised, which means that –1 represents the “worst value” of the RIF, 0 represent the “base case”, and +1 represents the “best value” of the RIF. The interpretation of w is as follows: If one RIF equals –1, then the derailment frequency is w times higher than for the base case, and if the RIF equals 1 then the derailment frequency is w times lower than the base case. Assuming that the various RIFs act independently of each other an influence model for the derailment frequency may be written f = f 0 Π i wi− RIF

i

(21.5)

where wi is the variation width of RIF number i, and RIFi is the value of RIF number i. By using Equation 21.5 with the generic weights from Table 21.4, we may easily assess the derailment frequency only by assessing the values of the RIFs for a given railway line or section. In addition to the current value of the risk, the future increase also has to be described corresponding to the two cost curves c(t) and c*(t) in Figure 21.1. For example, we might use an exponential growth of the form c(t) = f (1+d)t–1, where d is the degradation from one year to the next. The rational behind an exponential growth is that the forces driving the track deterioration often is assumed proportional to the deviation from an ideal track. A simple differential equation argument would then show an exponential growth.

524

J. Vatn

21.4.2.3 Punctuality Information The basic punctuality information to be specified is the ordinary speed for the line, and any speed reductions due to the degradation the project is intended to fight against. Based on the amount of speed restrictions it is rather easy to calculate the corresponding train delay minutes. Very often such delays cause cascading effects in a tight network. Table 21.4. Example of effect of risk influencing factors Risk influencing factor, RIF

Variation width, w

Number of failures/cracks

4

Rail quality (age, type, rail profile)

2

Gradient

2

Quality of sleepers, ballast and fastening

2

Number of fixed points with narrow filling

1.5

Horizontal geometry

1.5

Such effects cannot be assessed unless we have a good understanding of the network capacity, and the possibilities for change of crossings, etc. In Norway, where most lines are single track lines, change of crossing may cause large disturbances in the network. It may also be possible to catch up with a delay if there is slack in the schedule. 21.4.2.4 Maintenance and Operating Information The degradation of the permanent way will very often require extra maintenance and operating costs. Examples of such costs are extra runs of the measurement car, extra line inspections, use of alternative transportation such as busses, shorter lifetime of influenced components, etc. These costs need to be quantified in the model. Describing the change in maintenance and operating costs are very challenging because short term and long term activities interact. It is possible to perform explicit modelling of such interactions if we have a good understanding of the physical deterioration. Welte et al. (2006) has, e.g. used a Markov state model to model degradation, and the effect of different inspection and renewal strategies. 21.4.2.5 Residual Lifelength To be able to calculate the economic gain due to increased lifelengths it is required to describe the residual lifelength both if the proposed project is executed, e.g. RLL*, and if the project is not executed, RLL. 21.4.2.6 Project Costs The project costs are specified for each year in the project period.

Maintenance in the Rail Industry

525

21.4.2.7 Cost Parameters A set of general cost parameters are common for all projects. For JBV these are: • • •

The discount rate is r = 4%. Note that we here introduce the discount factor as the difference between the interest rate and the inflation rate. Monetary values for safety consequence classes as given in Table 21.5. Costs per kiloton freight delayed 1 min = 160 Euros. Table 21.5. Monetary values in Euros for each safety consequence class

•

Safety consequence

Monetary value (€)

C1

Minor injury

2 000

C2

Medical treatment

33 000

C3

Serious injury

330 000

C4

1 fatality

1.7 millions

C5

2-10 fatalities

11 millions

C6

> 10 fatalities

175 millions

Costs per passenger delayed 1 min = 0.4 Euros. A train with 250 passengers then gives 100 Euros per minute delayed.

21.4.3 LCC Calculations A life cycle cost (LCC) perspective will be taken with respect to calculating the cost benefit ratio for the different projects. This includes a net present value analysis, taking the following aspects into consideration: • • •

Change in variable costs, c(t) The effect of extending the lifelength The project costs

21.4.3.1 Change in Variable Costs The variable cost contribution from the dimension safety, punctuality, and maintenance and operation can be treated similarly from a methodical point of view. Let c(t) denote the variable costs in year t (from now) if the project is not executed, and similarly c*(t) is the cost if the project is run. See Figure 21.1 for an illustration. For example, for the safety dimension we have ∆LCCS =

N

∑[c(t ) − c *(t )](1 + r )

−t

(21.6)

t =1

where r is the discount rate, and N is the calculation period. N is here the residual lifelength (RLL) if nothing is done. This means that we compare the situation with and without the project in the period from now till we have to do something in any

526

J. Vatn

case. Similarly we obtain the change in punctuality costs, ∆LCCP and the change in maintenance and operational costs, ∆LCCM&O. To calculate Equation 21.6 we may in some special situations find closed formulas. For example, if c(t) is constant, i.e. c(t) = c, the formula for the sum of a geometric series yields ⎡1 − (1 + r ) − N ⎤ −t + = c (1 r ) c ⎢ ⎥ ∑ r t =1 ⎣ ⎦ N

(21.7)

Further, if c(t) the first year is c1 and c(t) increases by a factor (1+d) each year we have ⎡ ⎛ 1+ d ⎞ N ⎤ c1 (1 + d )t −1 (1 + r ) − t = c1 ⎢ 1 − ⎜⎝ 1+ r ⎟⎠ ⎥ t =1 ⎣⎢ r − d ⎦⎥ N

∑

(21.8)

21.4.3.1 The Effect of Extending the Lifelength To motivate for the calculation we show a sketch of the need for renewal both if and if not the proposed project is executed in Figure 21.3. We now let: {RC(t)} = Portfolio costs of renewals without the project {RC*(t)} = Portfolio costs of renewals with the project {T} = Set of renewal times without the project {T*} = Set of renewal times with the project. The cost contribution related to increased residual lifetime may now be found from ∆LCCRLT =

∑ RC(t ) ⋅ (1 + r )

−t

−

t∈{T }

∑ RC *(t ) ⋅ (1 + r )

−t

(21.9)

t∈{T *}

21.4.3.3 The Project Costs The LCC contribution from the project cost, LCCI, is the net present value of the project costs in the project period. The project costs may be spread over some years, and hence we have to calculate the NPV of the project cost profile. 21.4.3.4 Total LCC Contribution The total gain in terms of life cycle costs are ∆LCC = LCCI + ∆LCCS + ∆LCCP + ∆LCCM&O + ∆LCCI

(21.10)

The cost benefit ratio, or more precisely the benefit cost ratio is given by

ρ C/B =

∆LCC S +∆LCC P +∆LCCM&O +∆LCCRLT LCC I

(21.11)

Maintenance in the Rail Industry

527

Figure 21.3. Renewals if and if not the project is executed

21.4.4 Illustrative Example As a calculation example we consider a rail-grinding project. Grooves and wave formations imply strong impact on the track and rolling stock due to increased dynamic loads and vibrations. This again gives shorter life length of the rails, the sleepers, fastenings and ballast. Increased noise, energy consumption, and lower comfort can also be expected. A 160-km section on the Rauma line in Norway has rails of age 40–50 years and rail grinding is recommended primarily to extend the life length of the rails. 21.4.4.1 Safety Costs The derailment frequency due to rail breakages is estimated to 0.01 per year. For the most severe consequences we have the following distribution: P(C4) = 13.5%, P(C5) = 11% and P(C6) = 5% where the consequence classes are explained in Table 21.5. The material damage costs given a derailment is estimated to 1,300,000 Euros. Thus the yearly “safety costs” is found to be 0.01 ⫻ (0.135 ⫻ 1.7 + 0.11 ⫻ 11 + 0.05 ⫻ 175 + 1.3) million Euros, which equals 110,000 Euros. It is further expected that the rate of rail breakages leading to derailments will increase by a factor d = 7% if no grinding is performed. If the grinding project is executed the derailment probability the first year is assumed to be reduced by a factor of 50%, and the deterioration factor is also assumed to be reduced to d = 3% each year, and by utilizing Equation 21.8 we have the following contribution to the safety part of the LCC (the calculation period is set to N = 5 years, which is the expected residual life of the rails if no grinding is performed): ⎡ ⎛ 1+ 0.07 ⎞5 ⎤ ⎡ ⎛ 1+ 0.03 ⎞5 ⎤ ∆LCCS = 110 000 ⎢ 1 − ⎜⎝ 1+ 0.04 ⎟⎠ ⎥ − 55 000 ⎢ 1 − ⎜⎝ 1+ 0.04 ⎟⎠ ⎥ ≈ 300 000 € ⎢⎣ 0.04−0.07 ⎥⎦ ⎢⎣ 0.04−0.03 ⎥⎦

528

J. Vatn

21.4.4.2 Punctuality Costs Due to a high number of cracks it is recommended to reduce the speed from 80 to 70 km/h for a section of 20 km. The speed reduction corresponds to 2 minutes increase in travelling time. Slightly more than thousand passengers travels this line per week, thus the yearly delay time costs is in the order of 50,000 Euros. In addition, there is also freight delay time costs in the order of 60,000 Euros the first year. An increase in the speed restriction of d = 10% is expected if the grinding project is not executed. If the grinding project is executed, we may relax on speed restriction yielding a punctuality loss the first year of 40,000 Euros, and then a yearly increase of d = 3%. Again, utilizing Equation 21.8 we have ⎡ ⎛ 1+ 0.10 ⎞5 ⎤ ⎡ ⎛ 1+ 0.03 ⎞5 ⎤ ∆LCCP = 110,000 ⎢ 1 − ⎜⎝ 1+ 0.04 ⎟⎠ ⎥ − 40,000 ⎢ 1 − ⎜⎝ 1+ 0.04 ⎟⎠ ⎥ ≈ 400,000 € ⎣⎢ 0.04−0.10 ⎦⎥ ⎣⎢ 0.04 −0.03 ⎦⎥

21.4.4.3 Maintenance and Operation Costs From different studies it is found that rail grinding every 40 megaton reduce the wear of other components (sleepers, ballast and fastenings) corresponding to 3 Euros per metre per year. This corresponds to a yearly (fixed) cost of 400,000 Euros for the actual 160 km section. Using Equation 21.7 with N = 5 this corresponds to an NPV value of 2.1 million Euros. Reduction of critical cracks that have to be fixed is estimated to 10 per year, and with a cost of 2500 Euros per crack to be fixed this gives an NPV value of 110,000 Euros. Finally, extra yearly ultrasonic inspection accounts for 12,000 Euros per year corresponding to an NPV value of 50,000 Euros. The total extra maintenance and operation costs are therefore found to be almost 2.3 million Euros. 21.4.4.4 Extended Lifelength By the rail grinding project it is assumed that the rails may be kept going for another 15 years, whereas a rail renewal is expected after 5 years if the project is not run. The lifelength of new rails is approximately 40 years. The costs of new rails is in the order 250 Euros per meter. The LCC contribution is thus the difference in changing the rails in 5 years, 45 years, 85 years, etc. vs. changing the rails in 15 years, 55 years, 95 years, etc. A discount rate r = 4% calls for only counting the two first renewals, hence: LCCRLT = 250 × 160 000 [1.04–5+1.04–45–1.04–15–1.04–55] ≈ 12.9 million € 21.4.4.5 Project costs The costs of rail grinding is in the order of 8 Euros per meter, giving a total cost of 1.3 million Euros. In addition we have to expect a second grinding within 5–10 years, giving an additional contribution. The net present value of the grinding activity is then 2.2 million Euros. 21.4.4.6 Cost Benefit Ratio Summing up we find the following contribution to the change in LCC (million Euros):

Maintenance in the Rail Industry

∆LCCS ∆LCCP ∆LCCM&O ∆LCCRLT LCCI

529

= 0.3 = 0.4 = 2.3 = 12.9 = 2.2

This yields a cost benefit ratio of ρC/B = 7.2, meaning that for each Euro put into rail grinding, the payback is 7 Euros. By calculating the cost benefit ratio for the various maintenance and renewal projects, we get a sorted list of the most promising projects. In principle, we should execute those projects having a cost benefit ratio, ρC/B, higher than one. If the budget constraints imply that we can not execute all projects with ρC/B higher than one, it would be necessary to have a thorough discussion related to the budget for maintenance and renewal. Since most organizations suffer from the short term costs cutting syndrome, it is a hard struggle to argue for spending more money now in order to save money in a five to ten years perspective. Even if we cannot do much about the budget situation, we may use the results from the cost-benefit analysis to prioritize between the various projects.

21.5 Conclusions The two case studies presented elaborate on some of the challenges in Norwegian rail maintenance. Both the railway undertaking (NSB) and the infrastructure manager (JBV) aim at implementing more proactive strategies for maintenance and renewal based on more formal methods such as RCM and NPV/CBA. These methods require reliability parameters of a much higher level of detail than the current experience databases can offer today. Therefore both NSB and JBV have started the process of restructuring databases, and emphasize the importance of proper failure reporting. Due to the lack of experience data it has up to now been necessary to utilize expert judgment to a great extent. It is further important to emphasize that optimization models like the ones presented here should be considered as decision support, rather than decision rules. In order to improve on these areas we believe that more systematic collection and analysis of reliability data is an important factor, and here the rail industry may learn from the offshore industry where joint data collection exercises have been run for 25 years (OREDA 2002). Another challenge of such modelling is the lack of consistent degradation models. For example, for the track there is a good qualitative understanding of factors affecting degradation such as water in the track, contamination, geometry failures, heavy axles, etc. However, the quantitative models for degradation taking these factors into account are not very well developed. Research has paid much attention to design problems to ensure long service life but it is difficult to use the research results for maintenance and renewal considerations. More empirical research on degradation mechanisms will also be important in the future.

530

J. Vatn

21.6 References Bing AJ, Gross A, (1983) Development of Railroad Track Degradation Models. Transportation Research Record 939, Transportation Research Board, National Research Council, National Academy Press, Washington, D.C, USA. Bogdaanski S, Olzak M, Stupnicki J. (1996). Numerical stress analysis of rail rolling contact fatigue cracks. Wear 191:14–24 Braghin F, Lewis R, Dwyer-Joyce RS, Bruni S, (2006) A mathematical model to predict raiway wheel profile evolutio due to wear. Accepted for publication in Wear. Budai G, Huisman D, Dekker R. (2005) Scheduling Preventive Railway Maintenance Activities. Accepted for publication in Journal of the Operational Research Society. Carreteroa J, Pereza JM, Garcıa-Carballeiraa F, Calderona A, Fernandeza J, Garcıaa JD, Lozano A, Cardonab L, Cotainac N, Prete P, (2003) Applying RCM in large scale systems: a case study with railway networks. Reliability Engineering and System Safety 82:257–273 Dekker R, Wildeman RE, Van der Duyn Schouten, FA, (1997). A Review of MultiComponent Maintenance Models with Economic Dependence. Mathematical Methods of Operations Research, 45:411–435. Ferreira L, Murray M, (1997) Modelling rail track deterioration and maintenance: current practices and future needs. Transport Reviews, 17(3): 207–221. Garey MR, Johnson DS (1979). Computers and Intractability: a Guide to the Theory of NPCompleteness. W.H. Freeman and Company: New York. Grassie SL (2005). Rolling contact fatigue on the British railway system: treatment. Wear 258:1310–1318 Hecke A, (1998) Effects of future mixed traffic on track deterioration. Report TRITA-FKT 1998:30, Railway Technology, Department of Vehicle Engineering, Royal Institute of Technology, Stockholm. Kay AJ, (1998) Behaviour of Two Layer Railway Track Ballast under Cyclic and Monotonic Loading. PhD Thesis, University of Shefield, UK. Li D, Selig ET, (1995) Evaluation of railway sub grade problems. Transportation Research Record. 1489:17–25. Meier-Hirmer1 C, Sourget F, Roussignol M, (2005). Optimising the strategy of track maintenance. Advances in Safety and Reliability – Kołowrocki (ed.) Taylor & Francis Group, London. OREDA, (2002) Offshore Reliability Data, 4th ed. OREDA Participants. Available from Det Norske Veritas, NO-1322 Høvik, Norway. Pedregala DJ, Garcıaa FP, Schmid F (2004) RCM2 predictive maintenance of railway systems based on unobserved components models. Reliability Engineering and System Safety 83:103–110 Podofillini L, Zio E, Vatn J. Risk-informed optimization of railway tracks inspection and maintenance procedures. Reliability Engineering and System Safety 91:20–30, 2006 Reddy V, Chattopadhyay G, Larsson-Kråik PO, Hargreaves DJ, (2006). Modelling and analysis of rail maintenance cost . Accepted for publication in International Journal of Production Economic.s Salim W, (2004): Deformation and degradation aspects of ballast and constitutive modeling under cyclic loading. PhD Thesis, university of Wollongong. Austrailia. Sato Y, (1995) Japanese studies on deterioration of ballasted track. Vehicle System Dynamics, 24:197–208. Sriskandarajah C, Jardine, AKS, Chan, CK (1998). Maintennace scheduling of rolling stock using a genetic algorithm. European J. Oper.Res., 35:1–15. Telliskivi T, Olofsson U, (2004) Wheel–rail wear simulation. Wear 257 1145–1153.

Maintenance in the Rail Industry

531

Vatn J, Podofillini, P, Zio E (2003). A risk based approach to determine type of ultrasonic inspection and frequencies in railway applications. World Congress on Railway Research. Edinburgh, Scotland 28 September – 1 October 2003. Veit P, Wogowitsch M, (2003) Track Maintenance based on life-cycle cost calculations. In Innovations for a cost effective Railway Track. www.promain.org/images/publications/Innovations-LCC.pdf Welte T, Vatn J, Heggset J, (2006) Markov state model for optimization of maintenance and renewal of hydro power components. 9th International Conference on Probabilistic Methods Applied to Power Systems, KTH, Stockholm, 11–15 June 2006. Wildeman RE (1996). The art of grouping maintenance. PhD Thesis, Erasmus University Rotterdam, Faculty of Economics. Zakharov S, Komarovsky I, Zharov I (1998). Wheel flange/rail head wear simulation. Wear 215. 18–24 Zarembski AM, Palese JW, (2003) Risk Based Ultrasonic Rail Test scheduling: Practical Application in Europe and North America. 6th International Conference on Contact Mechanics and Wear of Rail/Wheel Systems (CM2003) in Gothenburg, Sweden June 10–13, 2003 Zhang YJ, Murray MH, Ferreira L, (1997). Railway track performance models: degradation of track structures. Road and transport Research. 6(2):4–19 Zoeteman A, 2003. Life Cycle Management Plus. In Innovations for a cost effective Railway Track. www.promain.org/images/publications/Innovations-LCC.pdf

22 Condition Monitoring of Diesel Engines Renyan Jiang, Xinping Yan

22.1 Introduction The engine is the heart of the ship; and the lubricant is the lifeblood of the engine. Wear is one of the main causes that lead to engine failures. It is desirable to avoid engine breakdowns for reasons of safety and economy. This has led to an increasing interest in engine condition monitoring and performance modeling so as to provide useful information for maintenance decision. Generally, an engine goes through three phases – (i) running-in phase with an increasing wear rate, (ii) normal operational phase with a roughly constant wear rate and, (iii) wear-out phase with a quickly increasing wear rate. The wear state can be effectively monitored by a number of techniques. The most popular technique is lubrication oil testing and analysis. Other techniques such as vibration and acoustical emission analyses also provide evidences of the wear state. A more effective way may be an integrated use of various monitoring techniques. In this chapter we confine our attention on oil analysis. Oil analysis techniques fall into the following three types. The first is concentration analysis of wear particles in lubricant. This can be conducted in the field or the laboratory. The second is wear debris analysis. This deals with examination of the shape, size, number, composition, and other characteristics of the wear particles so as to identify the wear state. This is usually conducted in the laboratory. The third is lubricant degradation analysis. This is used to analyze physical and chemical characteristics of lubricant and determine the state of lubricant. This can be conducted in the field or the laboratory. To avoid the use of expensive laboratory instrumentation for wear state identification, a usual practice is to build a quantitative relation (or discriminant model) between the condition variables (e.g. concentrations of wear particles) and the wear state using an observation sample obtained from both field and laboratory analysis. Once such a relation is built and verified, only field analysis is needed in practical applications. As a result, a key issue is to develop an effective and quantitative condition monitoring model.

534

R. Jiang and X. Yan

In this chapter we present a case study, which deals with applying oil analysis techniques to condition monitoring of marine diesel engines. We present a systematic approach to identify the important condition variables, construct a multivariate control chart, build the quantitative relation between the condition variables and the wear state, and establish the state discrimination criterion or critical value. The proposed approach is formulated based on intuitive reasoning, optimization technique and real data. The chapter is organized as follows. Section 22.2 presents a literature review on condition-based maintenance (CBM) and its applications to diesel engines. Section 22.3 provides the background details and presents the monitoring and experimental results. The results are analyzed and modeled in Section 22.4. Finally, we conclude the chapter with a summary and discussion in Section 22.5. Notation and Acronyms AE Acoustic emission AI Artificial intelligent CBM Condition-based maintenance CM Condition monitoring CV Coefficient of variation TBM Time-based maintenance f(x) Pdf of X F(x) Cdf of X m Mean r Correlation coefficient V Variance φ() Standard normal pdf Φ() Standard normal cdf µ, σ Distribution model parameters and so on

22.2 CBM and its Applications to Diesel Engines CBM is a maintenance approach, where a maintenance action is performed only when needed. Research and development in the CBM area has been growing rapidly. This section outlines the CBM concept and summarizes the relevant literature on the applications of CBM to diesel engines. 22.2.1 CBM and its Main Constituent Elements Traditional maintenance policies are run-to-failure and time-based maintenance (TBM). TBM is usually carried out at regular and fixed intervals that are determined based on experience or the recommendations of manufacturers. The CBM decision is based on the information collected through condition monitoring (CM) and hence it is particularly applicable to the situations where maintenance and failure are very costly.

Condition Monitoring of Diesel Engines

535

From a viewpoint of reliability, TBM is based on traditional reliability models and CBM on dynamic multivariate models. According to Lu et al. (2001), a traditional reliability model is represented by a probability distribution of time to failure of a population, which reflects the average behavior of the population’s reliability characteristics while a dynamic multivariate model focuses on estimating individual system reliability under dynamic operating and environmental conditions. As such, CBM usually deals with condition monitoring and reliability evaluation of individual systems in a quantitative and real-time manner. Jardine et al. (2006) provide a comprehensive literature review on the recent research and developments in diagnostics and prognostics of mechanical systems implementing CBM. They divide a CBM program into three main steps: data acquisition, data processing and maintenance decision-making. Saranga (2002) considers that a complete architecture for CBM systems should cover the range of functions from data collection through the recommendation of specific maintenance actions. He enumerates the following key functions: • • • • • • • • • • • •

Sensing and data acquisition Signal processing and feature extraction Production of alarms or alerts Failure or fault diagnosis and health assessment Prognostics Projection of health profiles to future health or estimation of remaining useful life Decision aiding Maintenance recommendations, or evaluation of asset readiness for a particular operational scenario Management and control of data flows or test sequences Management of historical data storage and historical data access System configuration management Human system interface

In this chapter, we summarize the CBM literature from the following five perspectives: • • • • •

Data acquisition Data processing Diagnosis and prognostics Maintenance decision-making Computerized CBM management system

22.2.1.1 Data Acquisition Data or information is the basis of CBM decision. There are three main sources: field records, CM, and expert knowledge. Field records provide event data such as breakdown, minor repair, overhaul, oil change, etc. CM data are the measurements related to the health condition of the system. They can be vibration signals, acoustics signals, debris concentrations, temperature, pressure, etc., obtained using various sensors or techniques such as

536

R. Jiang and X. Yan

accelerometers, laser vibrometers, microphones, acoustic emission sensors, ferrography, spectroscopy, thermography, thermocouples, etc. Finally, the knowledge and experience of experts provide important information in determining system state and importance of relevant factors, indices or measures. CM can be continuous and intermittent. The former is often expensive and probably inaccurate; the latter may be more cost effective and accurate but probably misses some failure events. Thus, it has been an important issue to determine the optimal monitoring (or inspection or sampling) interval. 22.2.1.2 Data Processing According to Jardine et al. (2006), CM data fall into three categories: value type, waveform type, and image type. Data processing for value-type data is called data analysis; data processing for waveform and image data is called signal processing; and the procedure of extracting useful information from raw signals is called feature extraction. Two commonly used techniques for analyzing value type data are trend analysis and time series modeling. When the problem involves a number of variables, dimension reduction appears very important. In a CBM setting, a reliability model with covariates can combine event data (e.g. times to failure) with CM data (or covariates). One such model is the proportional hazards model. Another well known approach is some two-interval models, where the failure process is divided into two intervals: the time interval from working state to the initiation of the defect, and the time interval from the initiation of the defect to failure. Moubray (1997) describes the latter as P-F interval, where P means potential failure and F means functional failure. Goode et al. (2000) describes the former as I-P interval, which is the time interval from machine installation to its potential failure. Each of the intervals can be represented by a certain distribution. Based on the fitted distributions and the outcomes of condition monitoring, machine prognosis can be derived. Waveform data analysis includes three main categories: time-domain analysis, frequency-domain analysis and time-frequency analysis. Time-domain analysis calculates some descriptive statistics such as mean, standard deviation, root mean square (RMS), skewness, kurtosis, time synchronous average, etc., based on the time waveform. More advanced approaches include time series, autoregressive and autoregressive moving average models. The most widely used frequency-domain analysis is spectrum analysis by means of fast Fourier transform. A typical timefrequency analysis is the wavelet transform. Image processing is similar to waveform signal processing but more complicated. 22.2.1.3 Diagnostics and Prognostics Diagnostics. Diagnostics deals with detection, isolation and identification of faults. It maps the monitoring information and extracted features to machine faults. This mapping process is usually called pattern recognition. Typical fault diagnostic approaches are model-based (or first principles; see Grimmelius et al. 1999), statistical, and artificial intelligent (AI).

Condition Monitoring of Diesel Engines

537

The model-based approaches use mathematical simulation models based on underlying physical principles of the monitored machine. This kind of approach requires specific mechanistic knowledge and theory relevant to the monitored machine. The statistical process control approach has been widely used for fault detection. It compares the monitored signal with a reference signal representing the normal condition to determine whether the monitored signal is within the control limits or not. Cluster analysis is a statistical classification approach that groups signals into different fault categories based on a certain distance or similarity measure between two signals. The measure is usually derived from certain discriminant function in statistical pattern recognition. AI approaches have been increasingly applied to machine diagnosis. Typical AI techniques include artificial neural networks, expert systems, fuzzy logic systems, and evolutionary algorithms. According to Grimmelius et al. (1999), the model-based approaches can be applied efficiently for newly developed machinery because the design data is already available; and the other two kinds of approaches strongly depend on the availability of measured data and are more suited for application to existing machinery. Prognostics. Different from diagnostics that deals with posterior event analysis, prognostics deals with fault prediction. Prognostics usually needs to evaluate remaining useful life or the probability that a machine will operate normally for a given time interval. Similar to diagnosis, the prognosis approaches include modelbased, statistical, and AI approaches. Data fusion. For a complex system, a single sensor cannot provide sufficient information for producing accurate results from analysis. In such cases, multiple sensors are needed to obtain additional condition information, and multi-sensor data fusion techniques are used to combine the information from these sensors for more accurate diagnosis and prognosis. Data collected from each sensor may be a mixture of data from several sources. Some of the sources are related to a particular machine condition of interest. Thus, an issue is to separate different sources by fusing the observed multi-sensor data. Fusion can be conducted at data-level, feature-level, or decision-level. 22.2.1.4 Maintenance Decision The outcome of prognosis provides decision support for maintenance actions. Therefore, prognostic and maintenance optimization are often considered together. Maintenance optimization is usually based on certain criteria such as risk, cost, reliability and availability. Widely used criteria are cost and availability. However, risk and reliability criteria may be more appropriate for critical equipments or situations where the consequence cannot be estimated by cost. Many CBM optimization models have been developed and can be found in the literature. Their implementation often needs specially developed software packages. 22.2.1.5 Computerized CBM Management System A well-developed CBM system can provide the user with a simple and automated method to plan and implement maintenance quickly and efficiently. To achieve

538

R. Jiang and X. Yan

this, a computerized management system with all those functions mentioned in Saranga (2002) is a crucial tool. 22.2.2 Applications of CBM to Diesel Engines: A Literature Survey We classify the literature based on the following five dimensions: • • • • •

Reference Machinery type: diesel engine or marine diesel engine CM technique: oil, vibration, acoustic emission (AE), others (including multi-sensors) Modeling technique: model-based, statistical, AI, others Focus: data processing and modeling, development of sensors, and development of a CBM system

The relevant literature is summarized in Table 22.1. From the table, we can draw the following observations: 1. CM technique: among the 23 references, 10 deal with oil analysis, 5 with vibration analysis, 4 with AE analysis, and 7 with other analysis techniques (mainly multi-sensors technique). This implies that oil analysis is the most widely used CM technique for diesel engines. 2. Analysis and modeling technique: 4 references deal with model-based approach, 9 with statistical approach, 10 with AI approach, and 3 with other approaches (mainly integrated approach). This implies that statistical and AI approaches are widely used analysis techniques for CM data of diesel engines. 3. Application type: 15 references deal with data processing and/or modeling, 4 with development of on-line sensors or measurement systems, and 4 with development and/or application of integrated CBM systems. This implies that data processing and modeling plays a key role in a CBM program. Table 22.1. Summary of literature in CBM of diesel engines Reference

Machinery CM technique Model type

Anderson et al. Engine (1983)

Quantitative analytical ferrography

Douglas et al. (2006)

Acoustic emission

Diesel engines

Focus Development of a standard ferrography analysis procedure, evaluation of a high gradient magnetic separator

Statistical, AE energy

Identification of AE signals of ring/liner

Gorin and Shay Marine (1997) diesel engine

Oil analysis

Development of onboard oil analysis meters

Grimmelius et al. (1999)

Torsional vibration of crank shaft

First principles, Demonstration of feature extraction, modeling techniques neural networks through two cases

Marine diesel engines

Condition Monitoring of Diesel Engines

539

Table 22.1. (continued) Hargis et al. (1982)

Marine diesel engines

Oil, Ferrography

Hofmann (1987)

Ship main engine

Vibration analysis

HojenSorensen et al. (2000)

Marine diesel engines

Vibration, acoustic emission

Neural network, discriminant methods, hidden Markov decision trees

On-line classification scheme

Hountalasa and Marine Kouremenosa diesel (1999) engines

Thermodynamics

Model-based, simulation model

Automatic troubleshooting method

Hubert et al. (1983)

Cummins VT-903 diesel engine

Ferrography

Model-based approach

Development of a testing methodology for determining wear particle generation rates and filter efficiencies

Jakopovic and Bozicevic (1991)

Marine engine

Oil analysis

Expert system, theory of fuzzy sets

Expert system for assessing the quality of lubricant and diagnosing engine failures

Jardine et al. (1989)

Diesel engine

Metal concentration of engine oil

Proportional hazard model

Fit proportional hazards model to oil analysis data

Johnson and Hubert (1983)

Medium duty truck engine

Analytical ferrography

Statistical

Evaluation of the particle generation rate and the filtering efficiency

Liu et al. (2000)

Marine diesel engines

Ferrograph, grid capacitance and photoelectric sensors

Logan (2005)

Gas turbine Electric sensing Neural network and diesel devices diagnostic generators inferencing

Intelligent diagnostic software agents operating in real-time onboard naval ships

Pontoppidan and Larsen (2003)

Marine diesel engines

Detection of condition changes

Acoustical emission

Discriminant score plotting technique

Identification of normal and abnormal states, conditionbased inspection Vibration monitoring program for preventive maintenance onboard

On-line wear condition monitoring system

Independent component analysis

540

R. Jiang and X. Yan

Table 22.1. (continued) Priha (1991)

Marine diesel engines

Scherer et al. (2004)

Diesel engines

Knowledge-based Development of Fault Avoidance Knowledge System Viscosity, permittivity, temperature, IR-spectroscopy

Development and application of prototype of an oil condition sensor

Sharkey (2001) Diesel engines

Vibration, AE, cylinder pressure

Neural network

Decision fusion through a multinet system.

Sun et al. (1996)

Diesel engines

Oil analysis

Artificial neural network

Application of multisensor fusion technology

Tang et al. (1998)

Marine diesel engine

Temperature, pressure, combustion air flow

Fuzzy neural network, combustion simulation model

Condition monitoring system

Wang and Wang (2000)

Diesel engine

evidence theory, decision-layer multisensor data fusion

Approach to diagnose multiple faults of a working diesel engine.

Wu et al. (2001)

Diesel engine

Vibration

Statistics analysis, Analysis of the multi-index fusion piston–liner wear condition

Zhang et al. (2003)

Marine diesel engines

Oil spectrometric analysis

Grey system theory

Determination of the turning point

22.3 Problem Background and Observation Results 22.3.1 Problem Background Wear states of two 8NVD48A-2u marine main propulsion diesel engines were experimentally investigated at the Reliability Institute of Wuhan University of Technology, China. The overall objective of the program was to develop a CBM technique to provide condition information for maintenance decision of the engines. There are three kinds of condition variables to represent the wear condition of the engines: • • •

Wear particle concentrations Lubricant quality parameters such as viscosity and contamination index Operational parameters such as vibration level, shaft torque moment and instantaneous rotation velocity

Condition Monitoring of Diesel Engines

541

In this case study, we focus on the concentrations of wear particles, which reflect the wear condition of the main tribo-pairs (e.g. ring-liner, shaft-bearing, and gears) in the engines. The metallic elements in the wear particles consist of ferrous elements and non-ferrous elements. Elemental Fe comes from many parts such as valve, bearings, piston ring, shaft, and so forth. The other ferrous elements include Cr, Mn, and Ni. Among them, Cr is from the surface coating of the first piston ring, Mn from cylinder liner and Ni from transmission gears. The non-ferrous elements include Al, Cu, Pb and Si. Among them, Al is from piston, Cu mainly from the bearing of connecting rod, Pb from crankshaft bearing and Si from piston or contaminant. 22.3.2 Observation Results The experiment was started after an overhaul of the engines, which is assumed to restore the engine to good-as-new, and finished at the time instant of the next overhaul. During the experiment the engines cummulatively ran for 4831 h, the engine oil was periodically sampled, and a total of 110 oil samples were taken from the 2 engines. Various pieces of equipment such as direct reading ferrograph, rotary ferrograph, infrared spectrum analyzer, scanning electron microscope and electronic digital analyzer, viscosity meter, and lubricant quality meter were used to analyze the oil samples in order to classify the wear state. In this study, the wear is divided into two states: normal (or State 0) and abnormal (or State 1). The wear state can be determined by analyzing the size, composition, and type of wear particles. Several different techniques were used to identify the wear state in the laboratory. For more details about the state classification based on wear particle morphology, see Roylance et al. (1994), Roylance and Raadnui (1994) and Raadnui and Roylance (1995). Most of the observations were under normal operational conditions. A trend analysis of concentration vs. time was carried out. The main findings were as follows: 1.

2.

There exists a close relation between oil degradation and abnormal wear. It was observed that the concentration of wear particles increases as viscosity decreases and the contaminate index increases. There exist some differences among the outcomes provided by different analysis techniques; and sometimes the outcomes are in disagreement.

The trend analysis identified 28 observations within the neighboring regions of the condition change point, which is somewhat similar to the P-point of the P-F interval. Among them, 12 observations are identified as abnormal and 16 observations as normal. These 28 observations are shown in Table 22.2 for further analysis and modeling. In the table, j denotes sample number, and the 14th row gives the mean concentration values of the first 12 observations.

542

R. Jiang and X. Yan Table 22.2. Concentration of main elements in oil samples (ppm)

j 1 2 3 4 5 6 7 8 9 10 11 12

State 1 1 1 1 1 1 1 1 1 1 1 1

Mean 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Fe 52.18 52.73 35.31 32.2 82.87 48.22 30.78 37.99 39.51 33.47 36.50 35.03

Cr 2.95 3.25 0.95 1.35 4.74 2.17 1.03 1.30 1.39 1.06 2.17 1.73

Ni 2.66 2.55 0.68 1.17 2.61 1.94 0.00 0.00 0.41 0.35 1.05 0.57

Mn 2.36 1.78 1.26 1.03 1.85 1.37 1.15 1.07 1.04 0.86 1.61 1.30

Al 8.7 8.04 5.57 5.70 9.85 7.08 4.71 6.07 7.24 6.77 7.73 7.68

Cu 10.98 8.93 4.33 3.57 13.34 6.82 4.18 4.52 3.17 2.87 3.88 3.47

Pb 13.29 9.65 6.23 5.89 17.1 8.05 5.94 5.87 5.51 4.95 6.29 5.11

Si 5.32 5.46 4.57 4.56 7.22 4.88 4.00 3.90 7.15 7.18 7.61 7.43

43.07

2.01

1.17

1.39

7.10

5.84

7.82

5.77

28.2 27.02 25.66 22.25 30.72 29.4 29.17 31.45 30.04 29.48 25.97 42.05 43.16 23.38 29.16 22.82

0.39 0.79 0.43 0.50 1.28 0.58 0.47 1.10 0.43 0.66 0.34 2.34 2.10 0.96 0.62 0.66

0.00 0.40 0.00 0.18 0.64 0.00 0.00 0.00 0.00 0.00 0.00 1.98 2.16 0.31 1.07 0.31

0.72 0.87 0.69 0.50 1.09 1.01 0.97 1.12 1.02 0.91 0.68 1.94 1.92 0.58 0.91 0.91

4.04 4.09 3.64 3.94 5.09 4.30 4.12 4.73 4.16 4.49 3.69 7.75 7.41 4.63 3.23 4.52

2.71 3.24 2.65 2.15 4.15 4.20 3.67 4.27 3.91 3.58 2.56 11.12 10.64 2.13 2.95 2.06

3.70 5.02 3.57 3.80 6.63 4.92 4.73 5.96 5.30 4.79 3.71 12.78 13.25 3.21 4.93 3.66

3.96 5.71 5.29 5.50 3.99 3.65 3.67 4.01 3.90 3.85 4.03 5.93 5.67 6.74 8.52 6.35

22.4 Development of Multivariate Control Chart and Discriminant Model The data of Table 22.2 has been modeled using a stepwise pluralistic regression approach by Zhao et al. (2003). There, an empirical relation between the state and debris concentrations was built, which included three major elements: Mn, Cr, and Cu. Jiang and Jardine (2006) propose a composite scale modeling approach, where the data is used as a numerical example and is reanalyzed. It is shown that the composite scale approach gives a better result in terms of statistical significance and failure (or abnormal) prediction capability. In this section, we propose a new

Condition Monitoring of Diesel Engines

543

approach to model the data. Comparing it with the previous approaches, it appears more straightforward and comprehensive. 22.4.1 Correlation Analysis The correlation coefficient r is a measure of strength of linear relationship between two variables (Blischke and Murthy 2000, p 367). In this section we conduct a correlation analysis to: • •

Examine whether the correlation is dependent on the wear state or not Identify potentially significant variables for further analysis

For the former issue we examine the correlation coefficient matrices for both States 0 and 1. The correlation coefficient matrix associated with State 1 can be obtained from the first 12 rows of Table 22.2. The first figure of each entry in Table 22.3 gives the correlation coefficient for this case. Similarly, the correlation coefficient matrix associated with State 0 can be obtained from the last 16 rows of Table 22.2. The second figure of each entry in Table 22.3 gives the correlation coefficient for this case. As can be seen from the table, the two figures of each entry are close to each other with an average relative error about 10% except those of the column corresponding to Si. Thus, we can roughly assume that the correlation is state-independent. In the following discussion, when we mention the correlation coefficients, they are the first figure of each entry in Table 22.3. To judge the significance of a linear correlation, we need to determine a critical value for the correlation coefficient r. According to Fisher (1970), given a correlation coefficient r the significance of the linear correlation of two variables can be tested using the following statistic to transform the correlation coefficient to a Student’s t-value: t = r / 1− r2 .

(22.1)

The critical value of t associated with the 95% level, one tail, and the degrees of freedom 12–1 = 11 is 1.7959. This implies that the critical value of r is 0.8737. Namely, the linear relation between two variables are significant if their correlation coefficient is larger than 0.8737 in this application. As can be seen from Table 22.3, there are eight correlation coefficients that are larger than 0.8737. They are: r(Cu, Pb) = 0.98, r(Fe, Cr) = 0.95, r(Fe, Pb) = 0.94, r(Fe, Cu) = 0.93, r(Cr, Cu) = r(Cr, Pb) = 0.92, r(Cr, Al) = r(Cu, Ni) = 0.88. (22.2) Equation 22.2 involve six elements: Fe, Cr, Ni, Al, Cu, and Pb. Among them, Ni and Al appear only once and the corresponding correlation coefficients (= 0.88) are very close to the citical value. Thus, we may classify the elements into three groups: • • •

Strong correlation group: (Fe, Cr, Cu, Pb) Weak correlation group: (Ni, Al) Independent group: (Mn, Si)

544

R. Jiang and X. Yan Table 22.3. Correlation coefficients matrices

Fe Cr Ni Mn Al

Cr

Ni

Mn

Al

Cu

Pb

Si

0.95 0.84

0.78 0.80

0.66 0.96

0.81 0.85

0.93 0.96

0.94 0.96

0.24 -0.04

0.87 0.88

0.78 0.89

0.88 0.96

0.92 0.91

0.92 0.93

0.32 0.23

0.84 0.83

0.75 0.82

0.88 0.85

0.84 0.89

0.11 0.50

0.73 0.90

0.84 0.97

0.81 0.97

0.11 0.05

0.74 0.93

0.76 0.93

0.64 0.06

0.98 0.99

0.02 0.05

Cu Pb

0.12 0.10

When two variables are strongly correlated and their means differ significantly, then one can ignore the one with the smaller mean and simply use the one with the larger mean. Using this reasoning, we may delete some of the elements in the strong correlation group. Consider the first three correlation coefficients of Equation 22.2, which have larger r values. According to the first correlation coefficient and the means given in Table 22.2, Cu may be deleted. Similarly, Cr and Pb may be deleted based on the second and third correlation coefficients, respectively. As a result, only five elements (Fe, Ni, Mn, Al and Si) are retained for further analysis. A physical interpretation of the correlation in this case study is that the wear debris may not be pure metal and can be from different parts. Its mathematical interpretation is that an increase or decrease of the readings in one element implies a possible increase or decrease [decrease or increase] of the readings in a positively (negatively) correlated element. When the absolute value of readings is very small, e.g. some of readings of Si, the correlation should be considered insignificant. 22.4.2 State Discrimination Capability of Condition Variables Each condition variable contributes partial information for identifying the state of the monitored system. By quantitatively examining their contributions, we can identify those varables which carry more state information. This study develops a method to quantitatively evaluate contributions of the condition variables. It starts with building the marginal distributions associated with the abnormal and normal states for each condition variable. 22.4.2.1 Marginal Distribution Associated with State 1 We use index 1 ≤ i ≤ 5 to denote the element (Fe, Ni, Mn, Al, Si), respectively. For a given element i, the entries of the corresponding column in Table 22.2 form a censored sample, denote them {xij, j = 1, 2, …, 28}. Assume that Xi follows a

Condition Monitoring of Diesel Engines

545

certain distribution F1(i ) ( x) . The data associated with State 0 can be viewed as right-censored. Namely, if the observed value associated with State 0 is xij+ , then the corresponding value of x associated with State 1 meets the relation: x > xij+ . Its likelihood function is given by 1 − F1( i ) ( xij+ ) . The overall maximum likelihood function is given by 12

28

j =1

j =13

L1(i ) = ∏ f1(i ) ( xij )∏ [1 − F1(i ) ( xij+ )] .

(22.3)

The model parameters can be estimated by maximizing L1(i ) or ln( L1(i ) ) . Murthy et al. (2003) present a model selection method based on a WPP (Weibull plotting paper) plot. The method is based on a match between the WPP plot of data and the WPP plot of a model. The WPP plots of data are shown in Figure 22.1. As can be seen from the figure, the WPP plots of data have three kinds of different shapes: convex for Ni, S-shaped for Mn and Al, and concave for Si and Fe. It is well known that the WPP plot of the two parameter Weibull distribution is a straight line, and hence it is not an appropriate model for modeling the data. We examined the WPP plots of some common two-parameter distributions and found that the WPP plot is S-shaped for the normal distribution truncated at x = 0, concave for Lognormal distribution, and convex for the Gumbel distribution of the smallest extreme truncated at x = 0 given by

F ( x) = {1 − exp[− exp(

x−µ )]} exp(e − µ / σ ), x ≥ 0 . σ

(22.4)

1.5 1 0.5 0 -1

-0.5 0

1

2

3

4

-1

y

-2

-1.5 Ni

Mn

-2 -2.5 -3 -3.5

Si Al x

Figure 22.1. WPP plots of data

Fe

5

546

R. Jiang and X. Yan

Their WPP plots are shown in Figure 22.2. Clearly, for each WPP plot of data in Figure 22.1 one can find a shape that matches one of the WPP plots in Figure 22.2. Thus, an appropriate model can be found from these three models for each variable. Once the model type is determined, the maximum likelihood method can be used to obtain the estimates of the model parameters. The estimated parameters, ( µ1(i ) , σ 1(i ) ), are shown in Table 22.4. In later analysis, we need to know the means and variances of the fitted marginal distributions. For the truncated normal distribution, the mean and variance are given by m = µ +σ

φ ( − µ / σ ) , V = σ 2 − m( m − µ ) , 1 − Φ (− µ / σ )

(22.5)

where µ and σ are the model parameters, and φ (.) and Φ(.) are pdf and cdf of the standard normal distribution, respectively. For the truncated Gumbel distribution, the mean and variance are given by m = µ + σ exp(e − µ / σ ) I1 , V = (m − µ )[(µ − m) + σI 2 / I1 ] ,

(22.6)

where I1 =

∞

∞

e −µ / σ

−µ / σ

−s ∫ ln(s)e ds , I 2 =

e

∫ ln

2

( s )e − s ds .

(22.7)

For lognormal distribution, the mean and variance are given by m = exp( µ + σ 2 / 2) , V = m 2 [exp(σ 2 ) − 1] .

(22.8)

y

Lognormal

T runcated Gumbel T runcated normal

x

Figure 22.2. WPP plots of truncated normal, lognormal, and truncated Gumbel distributions

Condition Monitoring of Diesel Engines

547

Table 22.4. Maximum likelihood estimates of the distribution parameters i = 1, Fe Lognormal

i = 2, Ni Truncated Gumbel

µ

(i ) 0

3.3532

1.0286

0.0548

4.4977

1.5194

σ

(i ) 0

0.1243

0.6361

0.8948

1.1658

0.1880

(i ) 0

28.8153

0.9530

0.7342

4.4980

4.6511

(i ) 0

3.5965

0.8064

0.5494

1.1653

0.8820

m

V

i = 3, Mn Truncated normal

i = 4, Al Truncated normal

i = 5, Si Lognormal

(i ) 1 (i ) 1 (i ) 1

µ σ

3.7554

2.0401

1.5644

7.3574

1.8317

0.1897

0.7985

0.4440

1.3679

0.1964

m

43.5269

1.7723

1.5648

7.3574

6.3660

V1(i )

8.3335

1.1040

0.4433

1.3679

1.2624

(i ) c

34.3485

1.5703

1.0493

5.9022

5.3512

Err

0.0701

0.1171

0.2540

0.1142

0.2005

Err P( x

0.1244

0.3797

0.1230

0.1437

0.2159

0.0973

0.2484

0.1885

0.1289

0.2082

1

5

3

2

4

x

(i ) 1 (i ) 2 (i ) ) c

Rank

22.4.2.2 Marginal Distribution Associated with State 0 Denote F0(i ) ( x) as the marginal distribution associated with State 0. The data associated with State 1 can be viewed as left-censored (see Blischke and Murthy 2000). Namely, if the observed value associated with State 1 is xij− , then the corresponding value of x associated with State 0 meets the relation x < xij− . Its likelihood function is given by F0(i ) ( xij− ) . The overall maximum likelihood function is given by: 12

28

j =1

j =13

L(0i ) = ∏ F0(i ) ( xij− )∏ f 0(i ) ( xij ) .

(22.9)

A careful examination has been carried out to determine the model type of F0(i ) ( x) . We found that that F0( i ) ( x) has the same model type as F1( i ) ( x) . The maximum likelihood estimates of the model parameters, ( µ 0(i ) , σ 0( i ) ), are also shown in Table 22.4. 22.4.2.3 Critical Value Between States 0 and 1 For a given element i and observation value xi, we need to establish a state discrimination criterion based on a certain critical value xc(i ) . Namely, we classify it as normal if xi < xc(i ) ; otherwise, as abnormal and accordingly initiate an

548

R. Jiang and X. Yan

appropriate maintenance action. We propose the following method to determine the value of xc(i ) . Consider the case xi = xc(i ) . As can be seen from Figure 22.3, there can be two kinds of misjudgments or errors: • •

The real state is 0 but is misjudged as State 1 The real state is 1 but is misjudged as State 0

The former error is given by Err1(i ) = 1 − F0( i ) ( xc(i ) ) ,

(22.10)

and the latter error is given by

Err2(i ) = F1(i ) ( xc(i ) ) .

(22.11)

The average of the two errors is given by

P( xc(i ) ) = ( Err1(i ) + Err2( i ) ) / 2 .

(22.12)

f 1 (x) f 0 (x)

x xa

xc

Figure 22.3. Distributions of condition variable and critical value

Thus, xc(i ) can be determined by minimizing P( xc(i ) ), i.e. dP ( xc(i ) ) (i ) (i ) (i ) (i ) = 0 or f 0 ( xc ) = f1 ( xc ) . dxc( i )

(22.13)

Condition Monitoring of Diesel Engines

549

The specific values of the relevant parameters ( Err1(i ) , Err2(i ) , P( xc(i ) ), xc(i ) ) for each element are shown in Table 22.4. 22.4.2.4 Discussion P( xc(i ) ) is a measure of misjudgment probability. The smaller it is, the better is the discrimination capability of variable i, namely, the variable contains more state information. Using it as an importance criterion, we can rank the condition variables. The last row of Table 22.4 shows the rank number of each variable. As can be seen from the table, Fe has the best discrimination capability. This is consistent with the result of correlation analysis, which shows that it is highly correlated with (Cr, Cu, Pb). Namely, the concentration of Fe comprehensively reflects the concentrations of Cr, Cu, Pb and itself, and hence the reading of Fe reflects the wear state to a great extent. The second most significant element is Al. This also appears reasonable since debris of Al and Cr (the latter is reflected by Fe) mainly comes from piston and piston rings, which are the main wear parts. Mn and Si have almost the same discrimination capability. This appears reasonable due to their independence. Finally, it is noted that Ni has the worst state discrimination capability. This can be explained by the dispersion of its readings (see Table 22.2), and the fact that the wear of the transmission gears may not be a major problem. 22.4.3 Construction of a Multivariate Control Chart A multivariate control chart can intuitively display the results of condition monitoring and evolution trend. Therefore, it appears especially important to set an alarm threshold and an abnormal threshold. Usually, the thresholds are optimized in a CBM model. Here, our focus is on the construction of such a control chart, and hence we only present a simple method to set the thresholds when the optimal thresholds unavailable. We define xc(i ) as the abnormal threshold, and define the alarm threshold as below:

F1(i ) ( xa(i ) ) = α < Err2(i ) .

(22.14)

Here, α depends on the condition degradation speed, sampling interval, maintenance reaction time, and the values of Err2(i ) , i = 1, …, 5. In the current example, we take α = 5%. The control chart is designed to have the following features: • • • •

It is displayed in x-y plane with an element order from the most important one to the least important one The abnormal thresholds are normalized to 1 for all the elements The alarm thresholds are transformed to the same value, γ for all the elements The overall state is represented along the y-axis

550

R. Jiang and X. Yan

To achieve the second and third features, we use the following relation to transform an observed concentration xi into a normalized concentration yi without changing the relative magnitude of the original readings:

yi = ai + bi xi , bi > 0.

(22.15)

1 = ai + bi xc( i ) , γ = ai + bi xa(i ) .

(22.16)

Let

From Equation 22.16 we have 1− γ . γxc( i ) − xa( i ) , bi = (i ) (i ) (i ) xc − xa(i ) xc − x a

ai =

(22.17)

To specify the value of γ, we let

∑a i

(22.18)

=0

i

so as to decrease the influence of the constant term in Equation 22.15. This yields γ =∑ i

xa( i ) xc( i ) . / ∑ xc( i ) − xa( i ) i xc(i ) − xa( i )

(22.19)

Clearly, γ is a function of α. For the present case, γ = 0.8404. The alarm thresholds and relevant parameters are shown in Table 22.5. The multivariate control chart with rescaled element concentrations associated with the 12th and 13th observations are displayed in Figure 22.4. Table 22.5. Alarm thresholds and transformed parameters i = 1, Fe

i = 2, Ni

i = 3, Mn

i = 4, Al

i = 5, Si

31.2898

0.4048

0.8342

5.1075

4.5206

ai

–0.7925

0.7849

0.2215

–0.1855

–0.0284

bi

0.0522

0.1370

0.7419

0.2008

0.1922

x

(i ) a

22.4.4 State Discriminant Model A state discriminant model consists of a relation between the condition variables and the wear state and a critical value. A composite scale modeling approach can be used to combine several scales or variables into a single scale or variable. The

Condition Monitoring of Diesel Engines

551

combined scale is expected to have better failure (or abnormal state) prediction capability than individual scales. Two typical models are the linear and multiplicative ones. Their parameters are determined by minimizing the sample coefficient of variation (CV) of the composite scale. The minimum CV approach is hard to apply in the presence of censored data. In this context, Jiang and Jardine (2006) propose a simple method to estimate the model parameters in the presence of censored data. The method transforms censored data into complete data by adding a mean residual value to a censored datum for each scale. Such a new data set, thus obtained, is called an equivalent complete data set and will be used for the parameter estimation using the minimal CV approach under the assumption that the transformation does not significantly impact the composite scale model to be built. They also conclude that a small value of CV is a necessary but insufficient condition of a good prediction capability of failure for the composite scale model. Therefore, they consider more than one alternative model, use the minimum CV method to estimate the parameters of the alternative models, and determine the best model based on the prediction capability of the models.

Rescaled concentration

2

1.5

No. 12 Abnormal

1

Alarm No. 13

0.5

0

State

Fe

Al

Element element

Si

Mn

Ni

Figure 22.4. Multivariate control chart

The above approach appears somewhat troublesome as it involves multiple steps and intensive numerical calculations. In this subsection, we propose a simpler and more straightforward approach. It is based on the following assumptions, which appear plausible: • •

The correlations between the individual variables under consideration can be ignored (see the correlation analysis in Section 22.4.1) The composite scale is a linear combination of the individual variables, and follows the normal distribution

Under these assumptions, the misjudgment probability can be directly represented by a function of the parameters of the composite scale and the means and variances of the condition variables. As a result, the parameters of the composite

552

R. Jiang and X. Yan

scale and the misjudgment probability can be simultaneously determined by minimizing the misjudgment probability. The critical value is then established using the approach presented in Section 22.4.2. 22.4.4.1 Determination of a Composite Scale Consider the following linear model: 5

y = ∑ ci xi , i =1

5

∑c i =1

i

= 1.

(22.20)

If we want to exclude a certain variable, say xk, from the model, we just need to set ck = 0. According to the above assumptions, the composite scale, Y, is a normal random variable. For State 1, the mean and variance of Y are given by 5

5

i =1

i =1

m1 = ∑ ci m1( i ) , V1 = ∑ ci2V1(i ) .

(22.21)

For State 0, the mean and variance of Y are given by 5

5

i =1

i =1

m0 = ∑ ci m0(i ) , V0 = ∑ ci2V0(i ) .

(22.22)

According to Equation 22.13, the critical value of the composite scale, yc, meets the following relation:

φ(

y − m0 y − m1 ) / V0 = φ ( ) / V1 . V0 V1

(22.23)

From Equation 22.23 we have

yc = m1 + V1

d 2 + 2( s 2 − 1) ln(s ) − sd , s2 −1

(22.24)

where

s = V1 / V0 , d = (m1 − m0 ) / V0 .

(22.25)

According to Equation 22.12, the misjudgment probability is given by

P( yc ) = [1 − Φ (

yc − m0 y − m1 ) + Φ( c )] / 2 . V0 V1

(22.26)

Condition Monitoring of Diesel Engines

553

Since m0, V0, m1, and V1 are functions of the decision variables {ci}, P(yc) is a function of {ci}. As a result, {ci} can be optimally determined by directly minimizing P(yc). 22.4.4.2 Candidate Linear Models By considering all linear models that at least include three variables, then we have ten three-parameter models, five four-parameter models, and one five-parameter model. If we always include the two most important elements, Fe and Al, in all the models, then we just needs to consider three three-parameter models, three fourparameter models, and one five-parameter model. We take the latter approach. We first consider the five-parameter model. Using the approach outlined in Section 22.4.4.1, we obtained the model parameters and objective function value shown in the second row of Table 22.6. The third row of the table shows the values of ci m1(i ) , which reflects the contribution of each element to the composite scale. The larger it is, the more important is the element. Based on this criterion, we rerank the elements and the results are shown in the fourth row. Comparing these results with those shown in Table 22.4, we can find that the ranks are basically consistent except that the positions of Mn and Si are exchanged. By eliminating one of (Ni, Mn, Si), we obtain three four-parameter models, whose parameters and objective function values are shown in rows 5–7 of Table 22.6. Similarly, by eliminating two of (Ni, Mn, Si), we obtain three three-parameter models, whose parameters and objective function values are shown in rows 8–10 of Table 22.6. As can be seen from rows 5–10 of the table, the objective function value obtained from the model excluding a less important element is smaller than that obtained from the model excluding a more important element. This confirms the reasonability of the new rank. Table 22.6. Parameters of composite scale models Model No.

c1, Fe

c2, Ni

c3, Mn

c4, Al

c5, Si

P(yc)

rI

1 ci m1(i )

0.0529

0.1182

0.4009

0.2310

0.1969

0.0201

12.4486

Rank

2.3039 1

0.2096 5

0.6274 4

1.6994 2

1.2534 3

2

0.0602

0

0.4542

0.2620

0.2235

0.0224

14.8925

3 4 5 6 7

0.0917 0.0664 0.1149 0.0782 0.1394

0.1974 0.1476 0 0 0.2948

0 0.4981 0 0.5837 0

0.3810 0.2879 0.4739 0.3381 0.5658

0.3299 0 0.4112 0 0

0.0290 0.0294 0.0323 0.0328 0.0425

11.5072 11.3420 15.4769 15.2289 11.7512

22.4.4.3 Selection of the Best Model The best model should have a small value of P(yc) and include few model parameters. Denote by n the number of model parameters (i.e. the number of variables included in a linear composite scale). Noting that the second relation in Equation

554

R. Jiang and X. Yan

22.20, an n-parameter model only has n–1 independent parameters. Define the information quantity of a model as follows: I = 1/P(yc).

(22.27)

The average information quantity per an independent parameter is given by:

rI = 1 /[(n − 1) P( yc )] .

(22.28)

It comprehensively reflects the above two requirements. A large value for rI implies a better model. We use this criterion to select the best model. The last column of Table 22.6 shows the values of rI. As can be seen from the table, the best model is the three-parameter model that includes the three important elements (Fe, Al, Si). Also to be noted is that the second best model is the three-parameter model that includes the elements (Fe, Al, Mn). Once more, it shows that Mn and Si have almost the same importance as indicated in the correlation analysis. 22.4.4.4 Rescaling of the Best Model To display the state discrimination result on the control chart, we normalize the state critical value yc (= 8.9081) to 1. To do so, all the coefficients in the composite condition variable is divided by yc. Similarly, we may set an alarm threshold for y as below:

F1 ( ya ) = β < Err2 = 4.15% .

(22.29)

In the current case, we take β = 1%. This yields y0.01 = 8.1536. The rescaled alarm threshold for the composite scale equals 0.9156, which is not equal to the rescaled alarm threshold (= γ) for the elements; see Figure 22.4.

22.5 Conclusions and Discussion In this case study, we have presented an approach for modeling and analysis of the condition monitoring data of the 8NVD48A-2u marine diesel engines. The main conclusions have been: 1. The correlation analysis is useful for identifying the correlation strengths between the elements and whether or not the correlations are state-independent. 2. It is possible and useful to build the marginal distributions of element concentrations associated with both abnormal state and normal state. A discrimination capability analysis helps in evaluating the state discrimination capability of elements. 3. A multivariate condition monitoring control chart has been developed to provide the maintenance engineer with intuitive wear state information.

Condition Monitoring of Diesel Engines

555

4. The composite scale modeling approach based on minimizing the misjudgment probability is a useful technique to combine multiple variables. The proposed information criterion for selecting the best model appears reasonable. Some issues that need to be considered in the future are as follows: 1. Some additional work is needed to validate the proposed model. This can be done by examining the agreement between the model predition results and the actual observations in the field. 2. The alarm threshold and oil sampling interval can be optimized so as to obtain a balance between the acquired information and the effort involved. 3. To provide a more accurate assessment of engine condition, it appears necessary to use multiple monitoring techniques. Thus, fusion of multisensor data and aggregation of multi-state measures is an important topic that needs further study. 4. An optimization maintenance decision model and computerized implementation software package needs to be developed to promote greater use of this approach in industry.

22.6 Acknowledgement The authors wish to thank Prof. D.N.P. Murthy for his constructive comments on an earlier version of this chapter.

22.7 References Anderson DN, Hubert CJ, Johnson JH, (1983) Advances in quantitative analytical ferrography and the evaluation of a high gradient magnetic separator for the study of diesel engine wear: Wear 90(2): 297–333 Blischke WR, Murthy DNP, (2000) Reliability: modeling, prediction, and optimization. John Wiley, New York Douglas RM, Steel JA, Reuben RL, (2006) A study of the tribological behaviour of piston ring/cylinder liner interaction in diesel engines using acoustic emission. Tribology International 39(12): 1634–1642 Fisher RA, (1970) Statistical Methods for Research Workers. Oliver and Boyd, Edinburgh Goode KB, Moore J, Roylance BJ, (2000) Plant machinery working life prediction method utilizing reliability and condition-monitoring data. Proceedings of the Institution of Mechanical Engineers Part E-Journal of Process Mechanical Engineering 214: 109–122 Gorin N, Shay G, (1997) Diesel lubricant monitoring with new-concept shipboard test equipment. TriboTest 3(4): 415–430 Grimmelius HT, Meiler PP, Maas HLMM, Bonnier B, Grevink JS, van Kuilenburg RF, (1999) Three state-of-the-art methods for condition monitoring. IEEE Transactions on Industrial Electronics 46(2): 407–416 Hargis SC, Taylor H, Gozzo JS, (1982) Condition monitoring of marine diesel engines through ferrographic oil analysis. Wear 90(2): 225–238

556

R. Jiang and X. Yan

Hofmann SL, (1987) Vibration analysis for preventive maintenance: a classical case history. Marine Technology 24(4): 332–339 Hojen-Sorensen PAdFR, de Freitas N, Fog T, (2000) On-line probabilistic classification with particle filters. Neural Networks for Signal Processing X, 2000. Proceedings of the 2000 IEEE Signal Processing Society Workshop 1: 386–395 Hountalasa DT, Kouremenosa AD, (1999) Development and application of a fully automatic troubleshooting method for large marine diesel engines. Applied Thermal Engineering 19(3): 299–324 Hubert CJ, Beck JW, Johnson JH, (1983) A model and the methodology for determining wear particle generation rate and filter efficiency in a diesel engine using ferrography. Wear 90(2): 335–379 Jakopovic J, Bozicevic J, (1991) Approximate knowledge in LEXIT, an expert system for assessing marine lubricant quality and diagnosing engine failures. Computers in Industry 17(1): 43–47 Jardine AKS, Ralston P, Reid N, Stafford J, (1989) Proportional hazards analysis of diesel engine failure data. Quality and Reliability Engineering International 5(3): 207–216 Jardine AKS, Lin D, Banjevic D, (2006) A review on machinery diagnostics and prognostics implementing condition-based maintenance. Mechanical Systems and Signal Processing 20(7): 1483–1510 Jiang R, Jardine AKS, (2006) Composite scale modeling in the presence of censored data. Reliability Engineering and System Safety 91(7): 756–764 Johnson JH, Hubert CJ, (1983) An overview of recent advances in quantitative ferrography as applied to diesel engines. Wear 90(2): 199–219 Liu Y, Liu Z, Xie Y, Yao Z, (2000) Research on an on-line wear condition monitoring system for marine diesel engine. Tribology International 33(12): 829–835 Logan KP, (2005) Operational Experience with Intelligent Software Agents for Shipboard Diesel and Gas Turbine Engine Health Monitoring. 2005 IEEE Electric Ship Technologies Symposium: 184–194 Lu S, Lu H, Kolarik WJ, (2001) Multivariate performance reliability prediction in real-time. Reliability Engineering and System Safety 72: 39–45 Moubray J, (1997) Reliability-centred maintenance. Butterworth-Heinemann, Oxford. Murthy DNP, Xie M, Jiang R, (2003) Weibull Models, Wiley. Pontoppidan NH, Larsen J, (2003) Unsupervised condition change detection in large diesel engines. 2003 IEEE XI11 Workshop On Neural Networks For Signal Processing: 565–574 Priha I, (1991) FAKS—an on-line expert system based on hyperobjects. Expert Systems with Applications 3(2): 207–217 Raadnui S, Roylance BJ, (1995) Classification of wear particle shape. Lubrication Engineering 51(5): 432–437 Roylance BJ, Albidewi IA, Laghari MS, Luxmoore AR, Deravi F, (1994) Computer-aided vision engineering (CAVE): Quantification of wear particle morphology. Lubrication Engineering 50(2): 111–116 Roylance BJ, Raadnui S, (1994) Morphological attributes of wear particles – their role in identifying wear mechanisms. Wear 175(1-2): 115–121 Saranga H, (2002) Relevant condition-parameter strategy for an effective condition-based maintenance. Journal of Quality in Maintenance Engineering 8(1): 92–105 Scherer M, Arndt M, Bertrand P, Jakoby B, (2004) Fluid condition monitoring sensors for diesel engine control. Sensors, 2004. Proceedings of IEEE 1: 459–462 Sharkey AJC (2001) Condition monitoring, diesel engines, and intelligent sensor processing. Intelligent Sensor Processing, A DERA/IEE Workshop on: 1/1 – 1/6 Sun C, Pan X, Li X, (1996) The application of multisensor fusion technology in diesel engine oil analysis. Signal Processing, 1996., 3rd International Conference on 2:1695– 1698

Condition Monitoring of Diesel Engines

557

Tang T, Zhu Y, Li J, Chen B, Lin R, (1998) A fuzzy and neural network integrated intelligence approach for fault diagnosing and monitoring. UKACC International Conference on Control 2: 975–980 Wang HF, Wang JP, (2000) Fault diagnosis theory: method and application based on multisensor data fusion. Journal of Testing and Evaluation 28(6): 513–518 Wu X, Chen J, Wang W, Zhou Y, (2001) Multi-index fusion-based fault diagnosis theories and methods. Mechanical Systems and Signal Processing 15(5): 995–1006 Zhang H, Li Z, Chen Z, (2003) Application of grey modeling method to fitting and forecasting wear trend of marine diesel engines. Tribology International 36(10): 753–756 Zhao C, Yan X, Zhao X, Xiao H, (2003) The prediction of wear model based on stepwise pluralistic regression. In: Proceedings of International Conference on Intelligent Maintenance Systems (IMS), Xi’an, China: 66–72

23 Benchmarking of the Maintenance Process at Banverket (The Swedish National Rail Administration) Ulla Espling and Uday Kumar

23.1 Introduction To sustain a competitive edge in business, railway companies all over the world are looking for ways and means to improve their maintenance performance. Benchmarking is a very effective tool that can assist the management in their pursuits of continuous improvement of their operations. The benefits are many, as benchmarking helps developing realistic goals, strategic targets and facilitate the achievement of excellence in operation and maintenance (Almdal 1994). In this chapter three different benchmarking studies are presented, these are: (1) benchmarking of the maintenance process for cross-border operations, (2) study of the effectiveness of outsourcing of maintenance process by different track regions in Sweden, and (3) study of the level of transparency among the European railway administrations. In these case studies the focus is on railway infrastructure excluding the rolling stock. The outline of the chapter is as follows. An overview of Swedish railway operation is presented in Section 23.2. The definition and methodology in general is discussed in Section 23.3. The special demands for benchmarking of maintenance is described in Section 23.4 and in Section 23.5, the special considerations caused by the railway context is overviewed generally for the railways and in more detailed from the Swedish context. The case studies are discussed in Sectiosn 23.6–23.8. The discussions and conclusions are presented in Sections 23.9 and 23.10 respectively. All the data pertinent to benchmarking of railway operation and maintenance are retrieved, classified and analyzed in close cooperation with operation and maintenance personnel from both infrastructure owners and maintenance contractors. The chapter discusses the pros and cons, the areas for improvement and the need for the development of a framework and metrics for benchmarking. The focus of this chapter is to visualize best practices in maintenance and also proposed means for improvement in railway sector with special reference to railway infrastructure.

560

U. Espling and U. Kumar

23.2 Swedish Railway Operations The railway industry is presently in a state of transition, with new stakeholders emerging and old ones trying to adjust to the new operating environment. In each country of Europe, the railway administration is vertically integrated, i.e. to comprise all in “one body”, almost until the end of the 1980s, when a new railway era started. The vertically integrated railway organisations were and still are partly government-funded and regulated by parliament through government directives. Figure 23.1 illustrates the organisational changes in Sweden from “single entity”, SJ (the Swedish State Railways), to a number of business units, each functioning independently to achieve their business goals. During 1988, SJ as a state authority was restructured to enhance its competitiveness and make railway travel and transportation economically viable. The restructuring programme divided SJ up into two major groups, namely train operating companies (TOCs) and infrastructure owners. The TOCs are expected to take the responsibility for transportation of goods and passengers in close cooperation with infrastructure managers. Today there are about 20 TOCs functioning in Sweden. The railway infrastructure is managed by ‘Banverket’ (the Swedish National Rail Administration), which is a government body. In 1998, Banverket was reorganised into two distinct categories, purchasers or ‘service buyers’ and contractors, or ‘service providers’. For administrative purposes, Banverket is divided into five regions, each of which is responsible for maintenance planning and purchasing, and following up the execution of the maintenance contract. In recent years, maintenance contracts have increasingly been awarded through open tender, thus being subjected to market competition. 1988-07-01

1998-01-01

2004-07-01

2001-07-01

Swedish Rail Agency Rail Inspectorate Banverket

Carrilion Inhouse Contractor

Svensk Banproduktion

Client

SJ

SJ AB

Rail Traffic Administration

Green Cargo AB

SJ

Jenhusen AB EuroMaint AB EuroMaint AB

ASG Swebus

AB TR

TrainTech AB

Interfleet

TrafficCare AB

Sweferry

Unigrid AB

Nordwaggon AB New Traffic operators

MTAB TGOJ Connex

Figure 23.1. Organisational changes within the Swedish railway system

Benchmarking of the Maintenance Process at Banverket

561

23.2.1 Maintenance Railway infrastructure is a complex system. Usually such infrastructure is technically divided into substructures, namely bridges, tunnels, permanent way, turnouts, sleepers, electrical assets (both low and high voltage), signalling systems including systems for traffic control, telecom systems such as systems for radio communication, telecommunications and detectors, etc. Maintenance of all these subsystems is a complex issue which makes it difficult to plan and execute the maintenance task. Factors such as geographical and geological features, topography, climatic conditions need to be considered when planning for maintenance. Furthermore, the availability of track for maintenance is also an important issue to be considered when planning the maintenance tasks to be executed. Previously, maintenance management was based on technical system characteristics instead of asset delivery functions. Maintenance is critical for ensuring safety, train punctuality, overall capacity utilization and lower costs for modern railways. The deregulation, privatization and outsourcing processes have created new situations, new organizations and new structures for collecting appropriate data from the field operations and extracting relevant information, so as to make correct decision. 23.2.2 Need for Benchmarking in Maintenance Many of the European railways have followed a similar evolution. Although many of the countries of Europe are now members of the European Union, questions are being raised concerning the transparency of the state-controlled railway sector in order to make comparisons possible and to find the best practices followed within the railway business. The European railway sector has gradually started to use benchmarking so that the different actors may be able to learn from each other.

23.3 Benchmarking: An Overview Benchmarking has its root in fundamental business exercise and began to take shape in the beginning of 1980. It was introduced as a tool for business development and is supposed to offer a key to large-scale improvements, as it provides a basis for learning from the best practices, providing a road map for copying the work process of the best in the class, i.e. it provides gains with relatively little effort (Dunn 2003). In general the magnitude of the improvement is around 10– 15% (Varcoe 1996) and in some cases it can be as high as 35% (Burke 2004). There are different benchmarking approaches ranging from the purely quantitative to the highly qualitative (Oliverson 2000). Quantitative benchmarking will benchmark, for example, the percentage of emergency work orders, the number of skilled workmen per first line supervisor or the percentage of overtime. Moulin (2004) discusses benchmarking of the public sector, in which some aspects of performance measurement must be considered, and states that, since organisations in this sector often perform non-profitable administrative work, they should be viewed from a balanced scorecard perspective (see Kaplan and Norton 1992).

562

U. Espling and U. Kumar

Such organizational measures are useful to service users and provide a clear system for translating feedback from the analysis into strategy for corrective actions. 23.3.1 The Benchmarking Methodology Successful benchmarking starts with a deep understanding and good knowledge regarding one’s own organisation’s processes; i.e. learning about one’s own performance and bringing one’s own core business under control before learning from others (Wireman 2004). The most common approach to benchmarking is to compare one’s own performance indicators with those of competitors or other companies in the same area, which can be accomplished using simple questionnaires completed by personnel involved in maintenance activities, with little or no expert help to conduct comprehensive studies, or with help from outside firms providing expertise in the planning, execution and implementation of such processes. Based on what is to be compared, benchmarking can be classified as performance, process or strategic benchmarking (Campbell 1995). Similarly, based on whom one should make a comparison with, benchmarking can be classified as internal, competitive, functional or generic benchmarking (Zairi and Leonard 1994). The results obtained from benchmarking identify the gap between one’s own organisation’s performance and the one following best practices. These results are then used to improve and develop core competencies and core businesses, leading to lower costs, increased profit, better service towards the customers, increased quality, and continuous improvements. In order to gain benefits, an organisation has to mature in its own core competencies, and to ensure success, the ROI (return on investment) should be calculated for each benchmarking exercise (Wireman 1998, 2004). A broad survey of the literature shows that, even though all the suggested methodologies for benchmarking are similar in their approach, they vary from a general two-step process to a more detailed 10-step process (Varcoe 1996; Ramabadron et al. 1997; Wireman 2004). All these steps can be related to Deming’s famous PDCA cycle. Malano (2000) goes a little further and describes Deming’s cycle as a “circular process” which includes the following phases; planning, analysis, integration, action and review. The operational form of these four steps for the purpose of benchmarking may look like the following: 1. Detailed planning of the benchmarking operation is to keep the goal of benchmarking in focus (for example cost reduction, productivity, etc.) and identify suitable partners for benchmarking. This step essentially encompasses an internal audit to learn about the organisation’s business indicators etc. 2. Identifying which business to visits and appropriate data collection. 3. Analysis of the data and information collected to identify gaps and the sharing of information 4. Implementation and continuous improvement.

Benchmarking of the Maintenance Process at Banverket

563

Most of the literature points out the fact that successful benchmarking needs a good plan specifying what to benchmark, whom to visit (to study the best practice), when to visit, and what types of resources are required for analysis and implementation. Often simple studies are completed at little cost and generally have no follow-up. Good benchmarking, on the other hand, is time- and resource-consuming and has well-structured follow-up plans etc. The selection of the type and scope of the benchmarking process should be made on the basis of the impact of the outcome on the critical success factors for the process (Mishra et al. 1998). A benchmarking exercise is of no value, if the findings are not implemented. In fact, without implementation it would be a waste of resources. The benefits of benchmarking do not occur until the findings from the benchmarking project are realized, and therefore performance improvement through benchmarking needs to be a continuous process. 23.3.2 Metrics Metrics for benchmarking can be indicators or KPIs as discussed in Chapter 19. In order to make the benchmarking process a successful exercise, it is important that the areas, the process enablers and the critical success factors required for a good performance needs to be identified, so that the common denominator or any common structure that is important to compare can be described by indicators or other types of measurements, often presented as percent (%) (Wireman 2004). These performance drivers can be characterized as lead and lag indicators, lead indicators being performance drivers and lag indicators being outcome measures (Åhrén et al. 2005).

23.4 Benchmarking of Maintenance Maintenance is treated as an enabler of improved asset or equipment performance (see Figure 23.2) which creates additional value for the business process (Liyanage and Kumar 2003). Its performance can be monitored by performance measures like availability, quality, value (cost) etc. (Mishra et al.1998).

Operation

Equipment state

Maintenance

Perfomance measurement

Comparison with benchmarked value

Figure 23.2. Maintenance’s link with benchmarked value

564

U. Espling and U. Kumar

Since maintenance is a process of continuous improvement of the delivered performance, benchmarking can be used to improve efficiency in maintenance and offer solutions for improvement in maintenance performance. One definition of benchmarking maintenance used in practice is “the process of comparing performance with other organisations, identifying comparatively high performance organisations, and learning what they do that allow them to achieve the high level of performance” (Dunn 2003). Relevant data can contain the following: (1) the man hours, (2) the material costs, (3) the cost of preventive maintenance, (4) the cost of predictive maintenance and (5) the cost of maintenance contracting. In Europe the European Federation of National Maintenance Societies (EFNMS 2006) has agreed upon 13 different maintenance indices to be used for presenting the results from benchmarking maintenance organisations. These are: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13.

Maintenance costs as a percentage of plant replacement value Store investment as a percentage of plant replcement value Contract cost as a percentage of maintenance cost Preventive maintenance costs as a percentage of maintenance costs Peventive maintenance man hours as a percentage of maintenance man hours Maintenance cost as a percentage of turnover Training man hours as a percentage of maintenance hours Immediate corrective maintenance man hours as a percentage of maintenance hours Planned and scheduled man hours as a percentage of maintenance man hours Required operating time as a percentage of total available time Actual operating time as a percentage of required operation time Actual operating time divided by the number of immediate corrective maintenance events Immediate corrective maintenance time divide by the number of immediate corrective maintenance events

Wireman (2004) states that the maintenance management impact on the return on fixed assets (ROFA) can be measured by two indicators, namely: • •

Maintenance cost as a percentage of the total process, production, or manufacturing cost Maintenance cost per square foot maintained

23.4.1 Decision Criteria from Benchmarking Exercise Results as experienced from different benchmarking projects in the US have identified some rules of thumb that can be used to evaluate the results as well as make suggestions for future actions. One rule of thumb concerns the ratio of the corrective maintenance volume to the total maintenance volume. A level higher than 20% indicates a reactive situation, where the future focus will be to bring the

Benchmarking of the Maintenance Process at Banverket

565

core business under control, since planned work vs. unplanned work may have a cost ratio as high as 1:5. Another rule of thumb concerns a high level of overtime, which indicates reactive situations in the maintenance process. Since labour is a large cost driver for maintenance, the amount of overtime can have a large impact on maintenance costs. Another large cost driver is spare parts (Wireman 2004, Hägerby 2002). 23.4.2 Railway Context Benchmarking approaches used by industries to improve their performance through comparison with the best in the class, can be equally used for benchmarking of the railway operations. But unlike the industrial sector, railway infrastructure consists of a larger number of individual assets, including substructure, permanent way, signalling, electrical and telecom assets that extend over a few or hundreds of kilometres. Furthermore, there are large differences between the structures of the different railway organisations. At present, many organisations are characterised by comprising one entity, whereas some are divided up into traffic companies and infrastructure owners, with an in-house or outsourced maintenance function. The different types of traffic on the railway tracks have different degradation characteristics and, therefore, it is difficult to compare passenger-intensive lines with heavy haul lines or lines with mixed traffic. Furthermore, the data collected from the different partners selected for benchmarking are not always possible to compare without normalisation. It is also important to validate and audit the collected data to find outliers (Oliverson 2000). Some examples of the normalisation required within railway benchmarking are presented in the following. In a benchmarking project called “InfraCost”, data have been collected over a number of years to compare the asset life cycle costs of different railways. A complex normalisation process has been used to bring all the information, for example maintenance costs, renewal costs, local labour costs, intensity and speed of trains, from different countries in Europe to a same base for comparison (see http://promain.server.de; Zoeteman and Swier 2005 ). Another way to normalize data is to identify the cost drivers and try to establish a link between performance and cost on the one hand, and performance and the age of the assets on the other. In order to compare the assets, compensation factors were established on the basis of the network complexity, measured in terms of (Stalder et al. 2002): • • • • •

Density of turnouts Length of lines on bridges and in tunnels Degree of electrification Usage according to average frequencies of train per year Average gross tonnage per year (freight and passenger)

In the project, the cost drivers have been established, but the implementation of life cycle cost (LCC) strategies for avoiding the difficulties of separating the maintenance cost from the renewal expenditures has not yet been fully realized (Stalder et al. 2002).

566

U. Espling and U. Kumar

When the International Union of Railways (UIC) in their benchmarking projects between the years 1996 and 2002 compared costs between Europe, USA and Asia, they found big differences in the costs. In an attempt to understand the differences, Zoeteman and Sweir (2005) developed a model that converted the benchmarked results into life cycle cost per km of track, including the maintenance cost, renewal cost and overhead cost both for the organization and the contractors. The major differences are in purchasing power, wages, turnout density, and degree of electrification, the proportion of single track and intensity of use. Benchmarking is not yet common practice within the railway sector, and there is a need to build up a framework and metrics in order to compare and find out the best practices. The aim of using benchmarking as a tool to improve prevalent maintenance practices within the railway sector is to demonstrate the measures that make it possible to compare the result from one operation to another regarding the railway administrations under different circumstances and conditions, and to identify the best practices in the area. Therefore, the benchmarking process has to be evaluated and normalised to fit the railway maintenance process. Accordingly, it is also essential to decide what kind of KPIs (key performance indicators) need to be implemented for improvement.

23.5 Benchmarking in the Swedish Railway Sector Benchmarking within the railway sector is characterized by state ownership and monopoly. One of the first benchmarking projects, “InfraCost” (http://promain.server.de; Zoeteman and Swier 2005) showed big differences in maintenance costs among European, Asian and American railway administrations. The result from this benchmarking shows the need for establishing a common framework and common metrics for benchmarking. Initially benchmarking within Sweden was motivated using other reasons than finding the best practice. These were: • • •

Checking, if it is possible to perform benchmarking and studying benchmarking methods, Finding those key areas that are critical success factors Finding answers to questions like “Why is it less expensive to run railways in neighbouring countries?”

The case studies presented in Sections 23.6–23.8 have used three different approaches concerning methodologies for data collection and classification, normalization and analysis of results. The case studies are: 1.

Two neighbouring local track areas sharing a line for railway traffic on each side of the border. The aim was to compare the maintenance cost, identify differences and find areas to improve. 2. Internal benchmarking for maintenance contracts in order to find the best practice and to improve the maintenance contracts.

Benchmarking of the Maintenance Process at Banverket

3.

567

To determine what (maintenance) performance measures were in use within the railway sector in Europe. The aim was to scan the possibility of finding areas to compare, just by looking into those official documents that some of the railways have presented.

The common denominator between these case studies is used for benchmarking methodologies in order to find out if it is useful within the railway sector. The differences between these case studies are the main objectives of the benchmarking.

23.6 Case Study – 1: Benchmarking Across the Border A case study benchmarking a cross-border operation and maintenance process was initiated by Track Area A for the rail administration in Country A. Track Area A provides railway infrastructure in the western part in Country B, between City B in Country B and City A in their own country (Country A). The aim was to study and understand why the operation and maintenance cost are different on the other side of the border. They also needed to find out if those costs were comparable with the costs in Country B and if it was possible to coordinate parts of the maintenance work between these two countries in order to decrease the cost (Åhrén and Espling 2003). The benchmarking process was conducted by Luleå Railway Research Center, a neutral party to both the organizations. During the preparatory stage of the benchmarking process, a total transparency between the infrastructure owners representing these two countries was agreed upon. It was also decided (by the sponsor of the study) not to make the result of the study public and to keep it confidential for five years. Both track areas were organised more or less identically for the purpose of maintenance, and the maintenance activities were planned and executed in a similar way. It was therefore not necessary to examine and normalise the overhead costs of both the railway administrations. 23.6.1 Metrics and Data The metrics and data collected were the cost for the operation and maintenance and outcome of performance losses. The data were collected for one calendar year from the systems for accounting, planning system, failure reporting and inspection and contained: • • • • •

Budget vs. performed outcome for maintenance costs Overhead costs for the local administrations Maintenance planning Failure statistics The inspection remarks

568

U. Espling and U. Kumar

However, the following information and data relevant to the study could not be collected: • • • • •

Overhead cost for the contractor (not available due to the competition between the different contractors) Man hours (not available, not collected in the client system from the invoice) Traffic volume Asset age, which were approximately the same (not necessary to collect, since the traffic mix and volume were the same) Spare part costs (not available)

36.6.1.1 Normalisation Since the organisation and accounting structure were almost the same, it was assumed that the missing data could be disregarded. The amount of normalisation was restricted to adjusting the currency. 23.6.2 Results and Interpretations The available data and information were then sorted as shown in Table 23.1. The maintenance costs were grouped into the categories snow removal, corrective maintenance and preventive maintenance; see Table 23.2. Table 23.1. Comparing cost per metre of track Object Total cost Maintenance cost Track area administration cost (overhead) Other external costs, e.g. consultancy Charges for electric power

Track Area A 795 285 220 90 200

Track Area B 290 280 8 2 0

Table 23.2. Difference in percentage in maintenance costs between Track Areas A and B Maintenance activities Snow removal Corrective maintenance, including organisation for preparedness (emergency service) Preventive maintenance, including inspection

Difference in percentage from Track Area B + 10% + 32% – 62%

The benchmarking result showed that the maintenance cost was approximately the same as the total cost per track meter. One of the findings was that the amount of corrective maintenance was very high in both track areas. A closer investigation showed that Track Area A had a larger amount of corrective maintenance and therefore less money for preventive maintenance.

Benchmarking of the Maintenance Process at Banverket

569

Furthermore the overhead cost and other external costs such as travel costs, costs for consultancy etc. in Track Area A were much higher compared to Track Area B. One of the explanations was the geographical isolation of Track Area A from its own administration, resulting in higher traveling costs and the necessity of buying consultancy for some services that Track Area B could obtain from its nearby regional office. Another explanation was that Track Area A had to finance all its buildings, the electrical power and the cost for the traffic control centre, while this was taken care of by a separate organization for Track Area B. It was also possible to find those areas of work that could be mutually coordinated, for example snow removal. However, this was something that needed to be negotiated and was therefore considered a political matter. The implementation phase was the responsibility of the national railway administrations. The results were mainly used as arguments clarifying why the costs were so much higher for the railway line in Country A compared with those of other national lines.

23.7 Case Study – 2: Internal Benchmarking for Maintenance Contracts All the maintenance work within Banverket is purchased either from the in-house contractor or from an external contractor. This necessitates legal operations, and maintenance business contracts are prepared and written for every maintenance commission, containing details of the work to be provided, with targets and agreed performance measures (for example, minimum of track down time in order to increase the train punctuality) to control the quality of the maintenance work to be performed. Purchasing infrastructure maintenance is a complex issue due to the engineering complexities of railway assets, safety assurance, the usage type, the climate and the traffic mix. In particular, it is very difficult to define the task to be performed (procured) and the desired final outcome from the contract. Many different procurement models have been tested with varying degrees of success (Larsson 2002). This benchmarking project was launched at the request of one of the 16 local regional track area managers (clients) responsible for procuring the maintenance contracts. The manager had observed that their contracts with, in this case, the inhouse contractor had resulted in an increase in the cost limits, while the performance and the quality had started to decrease. The process started with an internal survey of an ongoing contract. The contract included snow removal and maintenance activities such as corrective maintenance (failure repair and repair due to inspection remarks classifying faults as requiring immediate action), inspections for safety and inspections for maintenance (classified as condition-based maintenance) and predetermined maintenance pin-pointed by the internal regulations. Repair work due to faults not classified in inspections as requiring immediate action was to be bought separately. The survey showed problems such as a high amount of corrective maintenance, increasing costs for failure repair, an increasing amount of backlogs and a long response time for failure calls. The aim was to find ways to improve the procurement and the next

570

U. Espling and U. Kumar

maintenance contract by learning from the experience and knowledge of other regional track areas in this respect. The benchmarking process followed the standard procedure recommended for benchmarking as stated in an earlier section (Section 23.3). The study covered nine local track areas named as Track Areas A–I, and six of these were selected for the study and follow-up of qualitative interviews (D–I). 23.7.1 Metrics and Data Before starting the collection of data and other relevant information, the existing indicators and indices used by maintenance professionals available in the literature and through professional bodies, for example the EFNMS indices (EFNMS 2006), were examined for their suitability for the purpose of benchmarking maintenance practices in different track regions at Banverket. Most of these metrics were not found suitable for the purpose of this study and therefore actions were initiated to establish indicators that would facilitate this benchmarking process. Furthermore, information and data which were planned to be included in the study, namely details of maintenance-related measures such as maintenance costs, maintenance hours, material, maintenance vehicle costs, overhead costs etc., were missing or only available in the aggregate form, due to the competitive situation. As the deregulation of the railway transport system in Sweden has led to competition among the traffic companies, it was not possible to get hold of traffic data, i.e. how the track was used, because this information is being treated as a business secret by the train operators. Data from 2002 were collected from the systems for accounting, the failure reports, the inspection remarks, and the asset information and from the train delay reports. The following data were collected: •

•

•

•

Asset data from BIS: total length of track, total length of operated track, total amount of turnouts, total amount of operated turnouts, length of electrification, number of protected level crossings. An attempt was also made to define their standard by the assets’ age and what type of traffic they had been exposed to – this had to be skipped as it was not possible to obtain complete data for all the assets and different track lines. The purpose was to know the intensity of track utilization. From the accounting system AGRESSO: snow removal and maintenance costs for one year, defined per maintenance activity corresponding to the maintenance contract (corrective, predetermined, condition-based etc.) and cost per asset type (rail, sleeper, turnout etc.). From BESSY (inspection remark system): the number of inspection remarks, classified as remarks requiring immediate attention or deployment of corrective measures or remarks requiring attention or correction in the near future (deferred inspections remarks). From OFELIA (failure report system): failure reports (including asset type and type of failure, time to fault localization and time to repair, symptoms and causes, place, date and time). Time to establish on the fault place.

Benchmarking of the Maintenance Process at Banverket

• •

571

From TFÖR (train delay system): train delay statistics corresponding to infrastructure failures. TFÖR registers all the train delays and records them together with the respective reported infrastructure failure. Contracts and procurement documents.

23.7.2 Data Collection The data collected from the accounting system needed normalisation in particular, due to difficulties in separating normal track maintenance activities from track renewal activities, as these two concepts were frequently being mixed in the database. There were also some difficulties in using the prescribed terminology, because of misunderstandings in the maintenance context which resulted in the common structure for reporting cost back into the system not being used, and data had to be sorted afterwards into the “right boxes”. Some track areas were using maintenance definitions and concepts from other branches representing the building and construction industry. Some “outliers” were also eliminated from the data, especially those representing some special or just-one-time investments made to increase train punctuality or reduce winter problems. Cost drivers leading to non-availability of infrastructure for train operation or affecting safety were identified. The respective train delay hours were also retrieved. The cost drivers for the infrastructure were failure or defects in rail, sleepers, rail joints, turnouts, level crossings, and catenaries (overhead wire). On further investigation it was found that the cost related to sleepers could be classified as outliers, because a large amount of the sleepers replaced in the 1990s were delivered with inbuilt defects. These sleepers are being dealt with in a replacement phase within the framework of a large project. In order to find the best internal practice within the organization, two parameters, the “amount of corrective maintenance” and the management indicator “return on fixed asset” (ROFA), were used. 22.7.3 Results and Interpretation Track Areas A–I are the nine track areas, D–I are those selected by the infrastructure manager for qualitative interviews and Track Areas A–C are references. The data pertaining to various costs, corrective maintenance, condition based maintenance and failure and delay statistics from Track Areas A–I for the year 2002 are given in Tables A.1–A.7 of the Appendix to this chapter. When using the parameter ROFA and the rule of thumb concerning the lowest amount of corrective maintenance, Track Areas B, G, C and H were the best performers (see Figure 23.3) and the ROFA measurement showed a tendency of “more money per track metre, less corrective maintenance”; see Figure 23.4.

572

U. Espling and U. Kumar

Share of corrective and preventive maintenance 100% 80% 60% 40% 20% 0% A

B

C

D

Corrective maintenance

E

F

G

H

I

Preventive maintenance

Figure 23.3. Share of corrective maintenance and preventive maintenance for the nine track areas studied (Espling 2004)

Maintenance cost per square metre

Skr/sqr metre

10 8 6 4 2 0 A

B

C

D

E

F

G

H

I

Figure 23.4. Maintenance cost per square metre of track area (Espling 2004)

Another comparison was made concerning the maintenance cost per metre within the framework of the maintenance contract for each track region under study. Track Areas H, C and G showed the best practice followed; see Figure 23.5. It was noted that the maintenance cost varies greatly per asset or per track metre unit among the compared track areas due to the asset standard, type of wear, climate and type of traffic. To compare the performance, the amount of functional failures and train delay hours were listed as failure or delay hours per metre or per cost driving asset; see Figure 23.6. Even here the best performance was shown by Track Areas G and H.

Benchmarking of the Maintenance Process at Banverket

573

Maintence cost in the contract 140 120

Cost per m

100

Predetermined maintenance Maintenance inspection

80

Saf ety Inspection Repair Immidate Insp remarks

60

Failure repair 40

Snow removal

20 0 A

B

C

D

E

F

G

H

I

Track are a

Figure 23.5. Maintenance cost in the maintenance contract

De lay hour s and am ount failur e and im m idate ins pe ction r e m ar k s pe r k m or pe r as s e t 25

amount/asset

hours/asset or

20

Inpection remarks/km f ailur/crossing

15

f ailures/turnout 10

f ailures/km h/catenary

5

h/turnout

0 A

B

C

D

E

F

G

H

I

h/track km

Tr ack ar e as

Figure 23.6. A comparison of the performance of the different track regions

All these results obtained from the comparison of different track regions, in combination with the content of the maintenance contract defining work specifications within the maintenance contracts, were used for the gap analysis. The gap analysis was conducted with the help of interviews with the track area managers for Track Areas D–I. The best practice criteria were identified with the help of interviews and survey questionnaires. The best practices were: • • • • • •

Goal-oriented maintenance contracts combined with incentives Scorecard perspectives, quality meetings and feedback facilitate management by objectives Frequent meetings where top managers from the local areas participate Forms for cooperation and an open and clear dialogue, for example partnering Focus on increased preventive maintenance of assets with frequent functional failures and a high maintenance cost will give results, e.g. turnouts The use of Root Cause analysis

574

U. Espling and U. Kumar

The best practices identified from the benchmarking study were immediately implemented in the new purchasing procedures and documents. These were used for floating tenders and for new contracts by the infrastructure manager for the local track area initiating this benchmark, and resulted in maintenance contracts at a much lower price with better control of quality and performance. The benchmarking study also identified the best practice for gaining control over backlogs by using SMS and other internet-based tools. Besides these, the maintenance contract was also provided with information about goals, objectives and expected incentives related to the execution of the maintenance contracts.

23.8 Case Study – 3: Transparancy Among the European Railway Administrations In an attempt to find ways of benchmarking railway infrastructure administrations as an “external observer” and to give an answer to the question “is there any transparency in the railway systems of Europe?” five railway administrations were selected; see Table 23.3. Table 23.3. Infrastructure managers (A–E) and important organisational differences Infrastructure manager A B C D E

Outsourced maintenance

Traffic operation

Traffic operators

Both external and internal Internal outsourcing Internal outsourcing Both external and internal Both external and internal

Free service Free service Included Free service Is bought

Many Few Few Many Few

23.8.1 Metrics In this study, many official documents, such as annual reports and regulation letters and documents, were studied in detail in order to gain insight into the types of measures, key performance indicators and indices used by the railway administrations investigated (Åhrén et al. 2005). The collected measures were then compared with those recommended by EFNMS in order to see if these could be used in future benchmarking exercises. Rather soon it was found that the EFNMS indices were developed for factories and plants and were not suitable for studying or benchmarking the performance of infrastructures, as they did not consider the type of asset, the age of the asset, the asset condition or the practice of outsourcing maintenance work in an open market. 23.8.2 Normalisation Since data were qualitative in nature, no normalisation was carried out for the purpose of this study.

Benchmarking of the Maintenance Process at Banverket

575

23.8.3 Results and Interpretation The next step was to group the measurements according to the unit which they measured; for example cost went into the economy group. The parameters collected and reported by the infrastructure managers were then classified into different categories of common denominators. These categories comprised the following: strong denominators (Sods) collected by everyone, medium denominators (Sims) collected by more than 50%, and weak denominators (Sews) collected by less than 50%, and finally some indicators (I) also identified as Sods presented as a percentage value; see Figure 23.7. The results show that economic values, safety, and traffic are strong denominators, followed by quality, assets, and labour. It is important to note that “traffic” is the total traffic volume on a national level. These parameters could later on be used to develop new benchmark measures, e.g. maintenance costs per staff and amount of accidents per traffic volume. Today the comparable indicators are: • • • •

Corrective maintenance cost / total maintenance cost including renewal Total maintenance cost / turnover Maintenance and renewal costs / cost for asset replacement Maintenance cost / track metre

When comparing the outcomes of the findings only highly aggregated measures were used for the purpose of analysis, in terms of: • • • • • •

Economy Punctuality Safety Number of staff employed Track quality Total traffic volume divided up into passenger and freight kilometres

They can be used as benchmarking measures, the lag indicators showing past performance. This indicates that these areas of interest are important for every studied railway administration. It is also important to note that the identified measures can be defined as outcome measures from the railway maintenance process. It has not been possible to find any measures reflecting the actual maintenance performance. This can probably be explained by the fact that the maintenance activities are carried out by either in-house or external maintenance contractors (Åhrén et al. 2005). Some of the maintenance performance indicators are used by various organizations and provide railways with an opportunity to benchmark their operations internationally to improve their performance. One of the findings in the studies is that there are parameters missing regarding the traffic volume, infrastructure age, and history of the performed maintenance.

576

U. Espling and U. Kumar

Clas s ification of pos s ible param e te rs 30

Amount

25

Others

20

I

15

SDm

10

SDw

5

fic

li t y

Tr af

ua

en nm

vi ro

En

Q

ty

t

l fe

er ia

Sa

at

bo

ur

M

La

ry

se t

H

ist o

se t As As

Ec

on

om

y

0

Figure 23.7. Classification of possible comparable parameters

23.9 Discussion The reason why most plants do not enjoy best practices in maintenance is that they do not picture how to structure a sustainable improvement process (Oliverson 2000). Benchmarking can then be a tool for waking up organisations and their management in order to find improvement areas that create more value from the business process. However, on the way there are many pitfalls to be aware of, such as starting the process without knowing the starting point and the destination (Oliverson 2000; Wireman 2004). Other pitfalls are: •

• • • • • •

Just doing quantitative benchmarking. Quantitative numbers just tell parts of the story, and the difficulty is to start the sustainable improvement process, by focusing on qualitative benchmarking (Oliverson 2000). If the organisation does not have maturity or self-knowledge, it just glances at the figures and continues to do as it always has done before. Rejection of the results. Managers often overestimate their performance and react with disbelief to feedback that tells them that their plants are merely mediocre (Wiarda and Luria 1998). Not being aware of the need for normalisation of data, including the problem of outliers or comparing “apples with bananas”. Not finding the enablers (Wireman 2004). Using benchmarking data as a performance goal. Believing that it is as easy as just copying the best practice into one’s own organisation, rather than learning. Unethical benchmarking.

The methodologies for performing benchmarking for plants are rather well developed, but need to be adapted for infrastructure. Today it is difficult to

Benchmarking of the Maintenance Process at Banverket

577

establish what is included in maintenance, renewal and new investment. Other difficulties are how the infrastructure administrations are organized, for example if the client/contractor is the organization, if the maintenance is outsourced, and how it is outsourced; outsourcing makes it difficult to collect costs for overheads, maintenance, man hours, spare parts, backlog’s, etc. Today there are a number of performance indicators in use connected to maintenance, covering for example the areas of safety, track quality and asset reliability. Maintenance performance and cost control are the so-called lag indicators.

23.10 Conclusion Stating that the “benchmarking of maintenance provides gains with relatively little effort” is a truth that needs some modification. First of all, the theory of maintenance is a rather young science, which has resulted in a lack of common nomenclature and understanding of maintenance through value. This is one of the reasons why it is difficult to define what is included in maintenance and where to put the boundaries for renewal. There can also be different structures in use to describe what operation is and what maintenance is, and also for grouping maintenance into preventive and corrective maintenance. Outsourcing maintenance has become popular in recent years, and this makes it difficult to obtain all the necessary measurements, especially if the outsourcing is carried out in a performance contract (lump sum, fixed price). The assets’ complexity and condition are also difficult to compare and measure. The multitude of entities involved in the railway systems after their restructuring has made it considerably difficult to locate the organization responsible for the problems encountered and to ascertain the course of action to be taken to rectify them. Benchmarking cannot be used if its results are not implemented. The benefits from benchmarking do not occur until the findings from the benchmarking project are implemented and systematically followed up and analyzed against the set targets and goals. The results from the three benchmarking studies presented show that benchmarking is a powerful tool and its methodology can be used by other industries. Since the focus of these case studies is the benchmarking process and not the continuous improvement process, it is important to point out the need for empowered enablers, who will be responsible for identifying the problem, finding a solution to the problem and implementing the solution and the continuous improvement processes. The case studies also show that there is some more improvement to be made in order to start the whole process of benchmarking including the implementation in an integrated manner.

578

U. Espling and U. Kumar

23.11 Further Research Further research could be conducted to identify those parameters that are essential for developing lead indicators (Kaplan and Norton 1992) for effective planning and execution of railway infrastructure maintenance tasks, by developing methods to select, evaluate and implement these indicators in open market competition. More metrics, i.e. indicators and a measurement framework, should be developed and reconfigured for maintenance, making comparisons possible, for example from the Life Cycle Cost perspective vis-à-vis the business perspective. In railway administrations, one critical improvement area is enhancement of the quality of the incoming data. This can be achieved: •

•

•

By giving details of the status of the assets (age and degree of wear), the total traffic volume per year and the available time on track for infrastructure maintenance. This information should be incorporated as a correction factor in the analysis By well-structured economic feedback reports on maintenance activities. This should be implemented so that it is possible to differentiate resources which are consuming corrective maintenance activities and those consuming preventive maintenance activities. The structure of the economic feedback reports on maintenance should be designed so that it may be possible to differentiate operation and corrective and preventive maintenance. By separating the specially targeted maintenance investment from normal “maintenance activities”; efforts to enhance punctuality in special campaign form are an example of the former.

23.12 Acknowledgements The authors are grateful to Banverket (the Swedish Rail Administration) for sponsoring this research work and providing information and statistics through free access to their database.

Benchmarking of the Maintenance Process at Banverket

579

Appendix Table A.1. Failure and delay statistics from Track Areas A-I for the year 2003 Track area

Train delay h/track km

Train delay h/turnout

A B C D E F G H I

1.07 0.88 0.73 0.57 0.93 0.97 0.35 0.32 1.18

0.25 0.33 0.21 0.29 0.76 0.36 0.14 0.31 0.84

Train delay h/catenaries km

Amount of Amount of Amount of Inspection failures/ failures/ failure/ remarks/ track km turnout crossing track km

0.15 0.61 0.45 0.1 0.25 0.41 0.05 0.14 0.14

4.2 3.7 2.5 3.6 4.7 3.8 2.8 2.0 6.5

3.5 2.9 1.68 2.24 4.59 2.22 1.28 2.24 6.1

2.5 1.9 1.5 1.3 1.5 1.0 1.3 1.1 1.9

4.7 3.1 4.2 2.7 1.4 0.9 3.5 3.0 3.2

Table A.2. Cost of various maintenance activities in thousands of SEK for each track area for the year 2003 Track area

A B C D E F G H I

Snow removal in thousands of SEK

15,325 16,801 12,908 22,085 18,074 8,250 4,336 3,041 4,976

Corrective maintenance

Preventive maintenance

Contract sum

24,189 17,792 28,728 46,772 44,168 39,181 22,050 22,854 46,414

14,130 12,941 10,863 20,537 21,532 15,991 26,388 19,131 31,803

53,644 47,534 52,553 89,394 83,774 63,442 52,774 45,026 83,193

Normalisation is necessary due to the investment of extra money just for one year to enhance the preparedness to deal with failures causing train delays. The figures in Table A.2 are the figures before normalisation

580

U. Espling and U. Kumar

Table A.3. Costs in thousands of SEK for corrective maintenance due to failure reports from s for the year 2003 Track area

A B C D E F G H I

Maintenance organisation (personnel, machines, spare parts)

Emergency organisation

Actual cost

2880a 4416a 3732a

7,989 6,145 4,128 11,448 16,078 14,095

3512a

7,785 20,274

4701 4776 4884

Fixed price (lump sum)

6304

Total cost (t SEK)

10,869 10,861 7,860 16,150 20,854 18,897 12,686 11,444 28,246

SEK/ failure

5933 5273 4690 5379 5073 5530 5838 6065 145

a Extra preparedness 2003

Table A.4. Cost statistics for corrective maintenance triggered by the failure reporting system ofelia (in thousands of SEK) after normalisation Track area

A B C D E F G H I

Maintenance organisation

Emergency organisation

2156 1472 4701 4776 4884 3512

Actual cost

Fixed price (lump sum)

7,989 6,145 4,128 11,448 16,078 14,095 7,785 20,274

6 304

Total cost (t SEK)

7,989 8,601 5,600 16,150 20,854 18,897 12,686 11,367 28,246

SEK/ failure

1832 2060 1676 3002 4111 3417 2173 1887 5490

Benchmarking of the Maintenance Process at Banverket

581

Table A.5. Reported corrective maintenance caused by inspection remarks classifying faults as requiring immediate repair; also including activities such as inspection and conditionbased and predetermined maintenance that should have been booked under other codes in the accounting system (before normalisation of the data) Track area

A B C D E F G H I

Inspection remarks calling for immediate repair

Mixes of inspection remarks calling for immediate repair and CBM Remarks

13,320 6,931 12,355 16,361 10,864 9,963

Inspection cost including inspection remarks calling for immediate repair

Operational actions due to predetermined maintenance

Care of electrical assets due to predetermined maintenance

1485

Conditionbased maintenance

7081 7614 1962 3194

3558 4732 4289

3091 1486 2756

11,107 18,169

168 303

Total cost

13,320 6,931 20,921 30,638 19,044 20,383 9,346 11,410 18,168

Table A.6. Reported corrective maintenance caused by inspection remarks classifying faults as requiring immediate repair; also including activities such as inspection and conditionbased and predetermined maintenance that should have been booked under other codes in the accounting system (after normalisation) Track Inspection remarks area calling for immediate repair

A B C D E F G H I

13,320 6,931 12,355 16,361 10,864 9,963 11,107 18,169

Inspection remarks calling for immediate repairbooked under inspection

995 1904 491 799

Corrective maintenance booked as inspection in the accounting system

1506 1553 8

916

New total cost

13,320 6,931 13,350 19,771 12,908 10,770 9,346 11,410 19,084

582

U. Espling and U. Kumar

Table A.7. Condition-based maintenance bought as extra orders in thousands of SEK, but including the so-called special maintenance activity Track area A B C D E F G H I

Original accounting sum 32,319 43,831 44,139 6,607 81,720 53,797 50,753 45,198 63,426

Minus defective sleepers

–60,913 –27,972 –7,680 –12,722

New Sum 32,319 43,831 44,139 6,607 20,807 25,825 50,753 37,518 51,004

23.13 References Almdal, W. (1994), “Continuous improvement with the use of benchmarking”, CIM Bulletin, Vol. 87 No.983, pp.21–26 Burke, C.J. 2004. 10 steps to Best–Practices Benchmarking. http://www.qualitydigest.com/feb/bench.html Campbell, J.D. (1995). Uptime: Strategies for Excellence in Maintenance Management, Productivity Press, Portland, US Dunn, S. (2003), Benchmarking as a Maintenance Performance Measurement and Improvement Technique. Assetivity Pty Ltd, http://www.plant-maintenance.com/maintenance_articles-Performance.shhtml EFNMS (2006), http://www.efnms.org/efnms/publications/13defined101.doc Espling, U. (2004), Benchmarking av Basentreprenad år 2002 för drift och underhåll, Research Report, LTU 2004:16, (In Swedish). Hägerby, M., Johansson, M. (2002). Maintenance performance assessment: strategies and indicators. Master thesis, Linköping, Linköpings tekniska högskola, LiTH – IPE Ex arb 2002:635. Kaplan, R.S. and Norton, D. P. (1992), The Balanced Scorecard: the measures that drive performance, Harvard Business Review, Jan–Feb (1992), pp. 71–79. Larsson. L. (2002). Utvärdering av underhållspiloterna, delrapport 1. Banverket F02§713/AL00. (In Swedish). Liyanage, J.P. and Kumar, U. (2003). Towards a value-based view on operations and maintenance performance management, Journal of Quality in Maintenance Engineering, Vol. 9, pp. 333–350. Malano, H. (2000), Benchmarking irrigation and drainage performance: a case study in Australia. Report on a Workshop 3 and 4 August 2000, FAO, Rome, Italy. Mishra, C., Dutta Roy, A., Alexander, T.C. and Tyagi, R.P. (1998), Benchmarking of maintenance practice for steel plants, Tata Search 1998, 167–172. Moulin, M. (2004), Eight essentials of performance measurements, International Journal of Health Care Quality Assurance, Vol .17, Number 3. pp. 110–112. Oliverson, R.J. (2000), Benchmarking: a reliability driver, Hydrocarbon Processing, August 2000, pp. 71–76. Ramabadron, R., Dean Jr J.W. and Evans J.R. (1997), Benchmarking and project management: a review and organisational model, Benchmarking for Quality Management & Technology, Vol. 4, No. 1, pp. 437–458.

Benchmarking of the Maintenance Process at Banverket

583

Stalder, O., Bente, H. and Lüking, J. (2002), The Cost of Railway Infrastructure. ProM@ain – Progress in Maintenance and Management of Railway Infrastructure, 2, pp. 32–37. http://promain.server.de Varcoe, B.J. (1996), Business-driven facilities benchmarking, Facilities, Vol. 14. Number 3/4, March /April, pp. 42–48, MCB University Press. Wiarda, E.A. and Luria, D.D. (1998), The Best-practice Company and Other Benchmarking Myths Wireman, T. (1998), Developing Performance Indicators in Maintenance. New York: Industrial Press Inc. Wireman, T. (2004), Benchmarking Best Practice in Maintenance Management. New York: Industrial Press Inc. Zairi, M. and Leonard, P. (1994). Practical Benchmarking: the Complete Guide. London: Chapman and Hall. Zoeteman, A. and Swier, J. (2005), Judging the merits of life cycle cost benchmarking, in Proceedings International Heavy Haul Association Conference, Rio de Janeiro June, Åhrén, T., Espling, U. and Kumar, U. (2005), Benchmarking of maintenance process: two case studies from Banverket, Sweden, in Conference proceedings of the 8th Railway Engineering Conference, London June 29–30. Åhrén, T. and Espling, U. (2003), Samordnet/Felles drift av Järnvägen Kiruna – Narvik (confidential). Luleå, Luleå tekniska universitet (In Swedish).

24 Integrated e-Operations–e-Maintenance: Applications in North Sea Offshore Assets Jayantha P. Liyanage

24.1 Introduction There is a clear growth of interests today on the development and use of e-maintenance concepts for industrial facilities. This is particularly seen in the offshore oil and gas (O&G) production environment in the North Sea in relation to a major reengineering process termed ‘integrated operations’ (IO) that began in 2004–2005 as a new development scenario for the offshore industry (OLF 2003). Major challenges to conventional operations and maintenance (O&M) practice have been seen unavoidable under this new IO initiative. Subsequently, the industry began to develop some serious interests on novel and smart solutions for O&M. The developments began in 2005 seeking long-term changes to the conventional O&M practice. The change process has been relatively slow during the 2005–2006 period, but seemingly has gathered gradual and steady pace by now. This is a large-scale change, and hence the current plan is to realize fully functional e-operations e-maintenance status by the years 2012–2015 or so. Even though the integrated e-operations and e-maintenance applications in the North Sea are still at their inception, the learning process and the state of current knowledge can be very valuable for similar efforts in the development and implementation of novel solutions in other industries and /or regions in the world. Current developments in Norway exemplifies that the growth of smart use of advanced information and communication technology (ICT) solutions is a principal driving factor in the development and implementation of novel and smart solutions to realize e-maintenance (Liyanage and Langeland 2007; Liyanage et al. 2006). In principal it seeks to establish better offshore-onshore connectivity and interactivity enhancing decisions and work processes. The emerging O&M practice will be based on a smart blend of application technologies, novel managerial solutions, new organizational forms, etc. to enable 24/7 online real-time operating modes. The new set of O&M solutions for North Sea offshore assets are not simply about the use of some form of core technologies for electronic data acquisition and so on, but a large-scale re-engineering process dedicated to make a significant change to

586

J. Liyange

the conventional O&M practice based on a solid technical platform. It is noteworthy that, even though the changes within O&M by far is mostly technologydependent, its managerial implications are inevitable and that managerial changes have to be properly blended into the technology-based change. Such an integrated change is very critical in terms of technical and safety integrity of assets, and subsequent commercial impact in terms of production, plant economics, and safety and environmental performance. Ongoing developments in Norway bring a good example of how an industrywide re-engineering process has triggered major changes in O&M practice leading the path towards integrated e-operations e-maintenance. It implies that integrated e-operations e-maintenance initiatives in Norway is not a standalone and a shortterm technical change limited to O&M, but an integral part of a wider and a longterm development process that combines various technical disciplines and different sectors of the industry seeking an optimum and a long-term solution. In this context, there are two salient features that define the future of e-based O&M practice in Norwegian O&G industry: • •

Integration with other technical disciplines that have major roles in the realization of fully functional 24/7 online real-time operational status The important technological and managerial change that the e-approach has to incorporate to ensure fail-safe status

Owing to the growing interests and the importance of the subject matter on e-operations e-maintenance, learning from different application scenarios in various industries has a timely significance. This chapter shares current experience and knowledge with reference to ongoing developments in the Norwegian O&G industry. The chapter highlights current offshore asset maintenance practice, changing technical and economic environment that lead the path towards an e-approach, development and implementation of integrated e-operations and e-maintenance solutions in the North Sea, key features of the e-approach in North Sea assets, and future challenges to be fullyintegrated and fail-safe. The specific acronyms and their application definitions are given in Section 24.2, and Section 24.3 contains some recent reflections on the work on e-maintenance. Section 24.4 covers a brief introduction to offshore asset maintenance. It describes current thinking, practice, and visible trends. The technical and economic environment that shape guides a shift towards e-operations and e-maintenance is discussed in Section 24.5. It illustrates some of the major drivers that demand technological and managerial integration in search for comprehensive solutions for offshore assets in the North Sea. The section that follows (Section 24.6) highlights issues related to development and implementation of integrated e-operations and e-maintenance solutions on the Norwegian continental shelf. The major features of the e-approach for operations and maintenance are highlighted in Section 24.7, and pays specific attention to the diagnostic and prognostic technologies and the emerging infrastructure (i.e. ICT network, Onshore centers) for their implementation and use. Since the emerging environment represents a step change towards a more complex operational setting, there are numerous challenges to realize reliable fully integrated status and to remain fail-safe. Section 24.8, briefly covers these issues, and highlights the critical role and specific features of intelligent watchdog agent

Integrated e-Operations–e-Maintenance: Applications in North Sea Offshore Assets

587

technology in this context. This section also underlines some of the important nontechnical issues that play pivotal roles in terms of being fully integrated and failsafe.

24.2 Acronyms and Application Definitions Acronyms given in Table 24.1 are used throughout the chapter. Table 24.1. Acronyms in the chapter and their application definitions

Acronym B2B CBM CMMS CV D2A D2D ERP ICT IO IT LAN NCS NOK NPD OLF OOC OSC O&G O&M PDA PM PSA RUL R&D SAP SOIL WAN

Application definition Business-to-business Condition based maintenance Computerized maintenance management system Confidence value Decisions-to-actions Data-to-decisions Enterprise resource planning Information and communication technology Integrated operations Information technology Local area network Norwegian continental shelf Norwegian Crowns (Norwegian currency) Norwegian Petroleum Directorate Norwegian Oil Industry Association Onshore operational center Onshore support center Oil and gas Operations and maintenance Personal digital assistant Preventive maintenance Petroleum safety authority Remaining useful life Research and development A commercially available business ERP system Secure Oil Information Link Wide area network

588

J. Liyange

24.3 Current Reflections on e-Maintenance Over the last few years, e-maintenance has drawn the attention of both the industry and academia equally. With the growth of attention and interests towards nearzero-downtime performance, cost-effective maintenance strategies, data-dependent decision support systems, etc., the conventional maintenance practices have largely been challenged during the last couple of decades (Hansen et al. 1994; Bonissone 1995; Emmanouilidis et al. 1998; Khatib et al. 2000; Roemer et al. 2001; Koc and Lee 2001; Swanson 2001; Djurdjanovic et al. 2002; Wang 2002; Iung 2003; Yen 2003; Arnaiz et al. 2005; Han and Yang 2006). Subsequently the industrial practice gradually showed some inclination to adapt condition monitoring as a strategic tool to resolve some major challenges in various plants, facilities, and industrial settings. The emergence of various condition monitoring solutions coupled with data acquisition and presentation software appears to have laid a good foundation for further development of technology-based maintenance solutions leading the path towards diagnostics and prognostics (Liao et al. 2005; Emmanouilidis et al. 2006; Jardine et al. 2006). Current waves of interest on a range of e-maintenance solutions are largely dependent on parallel development in information technology infrastructures and communication technologies enhancing online communication, remote monitoring capabilities, remote expert assistance, etc. As the R&D activities gradually progress seeking novel solutions to the conventional condition monitoring practices, more advanced solutions have begun to appear generating a strong focus on intelligent maintenance solutions (Sanz-Bobi et al. 2002; Iung 2003; Lee 2004; Moore and Starr 2006). The trend appears to be towards more robust and comprehensive technical solutions where data acquisition, processing and interpretation, and decision support components are integrated. Along this line of practice, the developments in the discipline seem to be progressing towards intelligent e-maintenance solutions. Furthermore, some interesting work has also been performed incorporating for instance neural networks, expert systems, fuzzy logic, genetic algorithms, multi-agent platforms and case based reasoning, etc. (Liang et al. 1988; Yager and Zadeh 1992; Jantunen et al. 1996; Lee 1996; Chande and Tokekar 1998; Sanz-Bobi and Toribio 1999; Yang et al. 2000; Garcia and SanzBobi 2002; Marceguerra et al. 2002; Yu et al. 2003; Palluat et al. 2006). Moreover, the growth of R&D activities has resulted in introduction of novel application concepts and products such as PROTEUS (Bangemann et al. 2006), EXAKT (Jardine et al. 1998), Watchdog agents (Djurdjanovic et al. 2003), SIMAP (Garcia et al. 2006), etc. Obviously, condition monitoring and e-maintenance solutions have already shown a substantial potential for wider industrial applications. However, the type of solutions required and the nature of the practical applications may differ from one setting to another depending on the commercial challenges and the available technical infrastructure. This chapter brings an overview on this with reference to current developments in the North Sea offshore asset management environment.

Integrated e-Operations–e-Maintenance: Applications in North Sea Offshore Assets

589

24.4 Offshore Asset Maintenance An O&G production asset in general is at the heart of the petroleum business, and the oil or gas it produces is in fact the life-blood. A production asset by far can be seen as a property with an economical value. Such a physical property can involve a number of integrated modules. In O&G terms, a production asset comprises modules for extraction, processing, treatment and supply of raw oil and/or gas to a refinery or straight to a market (also see Liyanage 2003). With reference to the physical process of O&G production, a production asset primarily constitutes: • • • •

Reservoir(s) containing oil and/or gas Production and injection wells Production platform(s) and drilling/injection rig Pipes for export of production

An O&G production asset is a complex mechanical design involving various machineries, tools, mechanisms, etc. in the production process. The production process comprises different stages and has four major technical processes, namely: • • • •

Reservoir engineering, drilling and well intervention Development and modifications Operations and maintenance Logistics and support services

O&M process has a major role-play in production platforms (so-called ‘topside’) and rigs. In fact production and injection wells require some maintenance as well, but this is a highly specialized technical area. This chapter mainly covers O&M aspects in the ‘top side’. O&M, inclusive of testing and inspection, is an important discipline in terms of the technical condition and the mechanical integrity of an O&G asset. Necessary functional and technical conditions are achieved through a blend of O&M strategies, programs, and technologies. A diversity of O&M strategies and management practices may be necessary during the life of an asset that in general is under operation for 20–30 years of commercial production. The challenges to plant O&M can be quite dramatic particularly at the beginning and end of production life cycle, i.e. in the startup phase, and in the tail-end production phase (i.e. when the production begins to decline gradually). During various stages of the life-cycle, demands for maintenance can also vary, for instance, due to design flaws, varying operational conditions (pressure, temperature, etc.), ageing equipment, outdated O&M procedures, modifications, and so on and so forth. The fact that a good number of production platforms on the Norwegian shelf at the moment are in the tail-end and maturity phase of production poses significant challenge and it demands novel solutions to improve maintenance practices. Obviously there is a common cause for performing O&M activities in various O&G production assets, i.e. commercial, or statutory and regulatory. However, there can be differences among O&M programs and practices performed by various producers. Such variations can exist, for instance, due to age of installa-

590

J. Liyange

tions and equipment, scale of production operations, level of technological complexity, competence availability, budgeted operating costs, etc. Preventive maintenance (PM) tasks account of a larger portion of the maintenance work performed in offshore installations. Such PM programs can be based on industry practices, third party recommendations, or reliability analysis. PM programs are built into running maintenance plans and thus are executed as calendar-based or periodical maintenance tasks. The planning process can for instance be done on a 3-months or 7-weeks basis, and can be frozen weekly for execution offshore. One of the major concerns related to current maintenance practice is the consequences if the PM on equipment in offshore plants exceeds what is actually required. Excessive PM has significant commercial implications in terms of production interruptions, which on the other hand ensures compliance to strict regulatory requirements particularly for safety critical equipment. Lately, there seems to be some general preference for the use of condition monitoring techniques and risk-based methods. While methods such as risk-based inspections are already available, technology experts believe that application of CBM techniques together with risk computation can be of great benefit as it can greatly facilitate ‘need based maintenance’. This implies that the experts can precisely identify at which point in time certain maintenance tasks have to be performed based on risk conscious decisions. This is expected to bring substantial commercial benefits by prolonging maintenance intervals and thus reducing the production interruptions. However, conventionally CBM techniques have not been widely applied other than on an ad hoc basis or on special rotating machineries such as turbines. It has so far been a challenge to make effective use of condition monitoring in the production facilities on the Norwegian shelf. Some applications are in use such as vibration monitoring on heavy rotating equipment, thermography on electrical equipment, and oil analysis. However, many producers have been struggling to capitalize on the inherent potential of CBM technologies for quite some time. The underlying bottlenecks are largely related to the physical distance between offshore assets and onshore support organization, the availability of expertise to the site at a moments notice, and reluctance by some of the producers to initiate a quick response solely based on CBM expert’s opinion since they conventionally rely much on the overall equipment manufacturer’s recommendations and guidelines, etc. O&M organizations, on the other hand, gradually appear to become more teambased. Recent downsizing moves, an ageing workforce, and the ongoing efforts to integrate maintenance and operational crews have contributed much to this trend. The way in which such teams are formed and the way they carry out work may vary from one situation to another. Work teams can be dedicated to individual plants (i.e. dedicated work teams) and also certain teams may involve in doing campaign maintenance (i.e. fly-out maintenance) tasks. Campaign activities imply that while dedicated work teams carry out asset-specific tasks, there are teams (called campaign teams) with specific technical expertise (e.g. for turbine maintenance) who carry out certain specific PM tasks in addition to the dedicated maintenance personnel. They fly across platforms attending pre-assigned tasks in accordance with maintenance programs registered in the system. Administratively, while campaign teams may be responsible to the maintenance manager, function-

Integrated e-Operations–e-Maintenance: Applications in North Sea Offshore Assets

591

ally they may be responsible to individual assets. In addition, certain technical expertise can also be calledin for specialized maintenance tasks either from the onshore support organization of the producer or from a third party maintenance contractor. Such competence compensation strategies are relatively ad hoc and can take place on a need-by-need basis. Regardless of various strategies aiming at the best use of available competencies, the O&G industry in the North Sea has suffered from a scarcity of competent labor for some years now. The situation has further been aggravated by the ageing workforce, where in some instances close to about 50–60% of competent and experienced personnel is said to be reading the retirement age in a few years time. Major producers have already begun to resort to such options as outsourcing and insourcing to compensate for the growing competency gaps. Despite third parties playing pivotal roles in the entire O&G production process, contract management seemingly has still been relatively undeveloped with continuous discussions and debates. It is often pointed out that the fullest potential of external maintenance technicians, engineering contractors, CBM experts, etc. has not been fully capitalized on. However, under the competence mapping and outsourcing-insourcing efforts by the producers, some of the issues have been taken up for open discussion. There seems to be a relatively wider acknowledgement today that the knowledge industry that is external to the producers has a significant potential and that more prudent use of such resources is important for the long-term benefits of the industry. The industry in general has begun to explore win-win options to establish better commercial relationships between producers and third parties. Also, there is a growing inclination to the use of more advanced technical means for doing maintenance today. More and more maintenance free products and exotic gadgets have drawn the attention of the engineers. One such example is usage of smart sensors for gas detection, whose built-in self-testing capability removes the requirement for periodical inspection and functionality tests. Equally, the dependence on IT tools for O&M management has gradually increased with notable improvements in data and knowledge management capabilities. Certainly recent developments in the IT sector have specific effects on maintenance planning and decision-making processes. O&M management tools are often seen built into corporate ERP systems such as SAP, but the effective use of such capabilities and the efficiency with which they are put into use still need some major improvements. In many cases, the biggest problem notably is the use of different databases without an effective configuration for data acquisition. Given the large volume of sensitive data accumulated and stored in those databases, effective and efficient data management is often seen as a daunting and a resource consuming task.

24.5 e-Approach: Changing Technical and Economic Environment The global industrial environment is being strongly challenged today both in engineering and management terms. There is a clear growth of application technologies, engineering techniques, organizational forms, management principles, cooperative policies, etc. to cope with the complex socio-economical and techno-

592

J. Liyange

political change processes. The trends of deviations from conventional wisdom and practices have become more and more clear, seeking to adapt creative, innovative, and smart solutions to manage complex systems for commercial advantage (During et al. 2004; Hosni and Khalil 2004; Russell and Taylor 2006). With the growth of business uncertainties, the enterprise risk profile has become more complex demanding more flexible, collaborative, and open strategies to support various operational activities in industrial plants and facilities. The emerging commercial environment by far has already indicated the greater reliance on new technological and managerial solutions to manage important asset processes such as O&M establishing a new landscape for commercial activities. This seems to be a generic trend among almost all the commercial business sectors, but to varying degrees, where the dependence on advanced technological solutions to manage complex technical systems is rapidly growing. The resulting environment will obviously be very dynamic enabling key stakeholders of complex technical systems to remain intact within an extended live network (Wang et al. 2006). The production, manufacturing, and process industries are directly seen impacted by the new demands and the wave of subsequent changes. Technologically complex and highrisk businesses in particular cannot afford to divert their management strategies of complex assets away from the mainstream technologydriven change. Today different industrial sectors are seen adapting various novel and integrated solutions to manage their industrial assets and internal processes to realize major commercial benefits. More often, rapid advancement in information and communication technologies (ICT) has been very catalytic to the progress in technology applications (e.g. diagnostic technologies) and data management solutions particularly for complex systems, such as offshore oil and gas (O&G) production platforms. O&G activities on the NCS began in the early 1970s with the discovery of the great Ekofisk asset. Ever since, NCS has been a major supplier of oil to the world energy market. Today, after more than 30 years of continuous production, NCS has stepped up to its peak level. Despite the fact that NCS foresees a gradual decline after 2010 or so, the remaining potential is known to be substantial. But the future is known to have a unique set of challenges with a major need to enhance the recovery efficiency so that the commercial lives of major production assets can be extended by another 40–50 years. By 2003–2004, the forthcoming challenges to O&G exploration and production activities in North Sea became very obvious. The major part of the industry became relatively more inclined to resort to advanced application technologies to address underlying commercial risks. At the same time the industry has been undergoing some other challenges widely acknowledged as serious impediments to future growth on NCS. For instance, the industry has been experiencing some major setbacks in attracting talent, and in centralizing core competencies. The problem has been further aggravated by the ageing workforce with no suitable remedy to solve competency gaps. Industry restructuring has been seen by the majority as a feasible solution to provide a tighter integration and partnerships with the knowledge industry. Table 24.2 illustrates the complex set of economical and technical drivers that challenged the conventional practices in the North Sea O&G production environment.

Integrated e-Operations–e-Maintenance: Applications in North Sea Offshore Assets

593

Table 24.2. Technical and economical factors that contributed to a step-change in North Sea asset management practice also introducing changes to conventional O&M practices Risk and uncertainty profile Risk and uncertainty profile is seen to be too large to ignore due to maturing assets, declining production, rising lifting costs, discovery of marginal fields, declining investments for developments, lower recovery efficiency, etc.

Commercial incentives The underlying commercial incentives of a major change have been very convincing. This mainly includes substantial enhancement in production recovery at least by 10% or more, significant reduction in operating costs, and major safety and environmental benefits

Technical and managerial setting

Business conditions

The emerging technological and business environment have given its own solutions to counter attack major problems. Such solutions seem to be feasible through application technologies and ICT solutions, business-to-business collaboration forms, closer inter-disciplinary integration to jointly manage offshore activities, standardized platforms for dynamic data and knowledge sharing

Various other industrial circumstances have also constantly been demanding some form of a change to the conventional industry practice. This is primarily due to remaining substantial un-tapped value potential, major need for more open and flexible partnerships, emerging competency gaps, obsolete technologies and ageing equipment, more complex and new kind of challenges in production settings, etc.

Under such circumstance, the risk-and-uncertainly profile on NCS appeared to be too high to ignore for major O&G producers. This brought a major momentum to challenge the conventional practices of core technical disciplines such as O&M, Drilling, etc. Subsequently, key stakeholders directly steppedin to re-engineer the conventional practices targeting long-term commercial advantages. Thus, O&G business in Norway stepped into what has been termed integrated e-operations since 2004 as a new development scenario for the continental shelf. This is known to be the ‘third efficiency leap’ for O&G activities on the Norwegian shelf. This was further envisioned by the Norwegian parliament through the report no. 38: 2003–2004. Today, this has become an industry-wide program with major national interests drawing NOK billions of investments from various sources for reengineering tasks and further development. Under integrated e-operations major improvements are expected in three technical asset processes, namely: • • •

Drilling and well intervention Reservoir management and production optimization Operations and maintenance

O&M drew the attention of industry slightly later than two other technical disciplines, but is widely acknowledged today as a technical process that has substantial improvement potential. In fact some signs of development within O&M began to appear in 2005–2006 period. Nevertheless, it has been known for some time that the conventional O&M process has large limitations and some well-

594

J. Liyange

established O&M policies have seen significant hindrance to bringing cost-effective and efficient solutions. The integrated e-operations–e-maintenance concept for North Sea assets, brought forward a long-term development path to O&M process from 2005 onwards with substantial opportunities to: • • • • • •

Test out and implement new technological solutions particularly enabling predictive maintenance capabilities Implement more robust technical platforms for effective O&M data management Establish new organizational forms to compensate for lacking or short of experienced O&M workforces Standardize the technical language in use between different stakeholders to enhance communication and cooperation Provide fast access to technical experts in demanding and urgent situations Build an agile competence network to enhance decisions and activities

The experience so far is that ongoing activities will eventually result in relatively more dynamic and complex functional environment. However, it is also noteworthy, that e-operations–e-maintenance is a sensitive change processes in terms of safety and security, and thus has its own challenges to make it fully functional and fail-safe.

24.6 Development and Implementation of Integrated e-Operations e-Maintenance Solutions in the North Sea The new integrated e-operations–e-maintenance scenario, as aforementioned, has its major focus on changing conventional practices. The Norwegian O&G industry in principle has looked into smart use of advanced application technologies and information solutions as the driving forces to push forward smart O&M solutions for offshore assets. Implementation of new solutions depends on the four factors shown in Figure 24.1. These are: • • • •

Advanced technologies that enhances the maintainability of assets Digital IT infrastructure that enhances reliable transfer and exchange of O&M data between different stakeholders Active operational networks lively connecting producer’s O&M personnel and that of engineering contractor’s, overall equipment manufacturers, logistics suppliers, other external technical groups, etc. Business-to-business (B2B) collaborative partnerships that lay the foundations to create a reliable information and knowledge network

Integrated e-Operations–e-Maintenance: Applications in North Sea Offshore Assets

Advanced technologies

Digital infrastructure

B2B collaborative partnerships

Active operational networks

595

Figure 24.1. e-Approach to O&M in North Sea assets in principal relies on fourfold aspects

Partner industries, particularly those related to electronic and communication technologies, have major effects on the new O&M environment as it has to provide a stable and a reliable technical infrastructure to support O&M decisions and work processes. As of today, there are clear indications on the use of following technologies for integrated O&M solutions: • • • • • •

Web-based data exchange and communication networks Real-time visualization and simulation tools Equipment with built-in smart electronics and advanced functionalities Online diagnostic and prognostic tools and methods Process automation and real-time data acquisition techniques Online video conferencing and monitoring facilities

Wireless communication capability appears to be a major step forward in the new O&M environment. Such smart technological products as ‘VisiWear’ are already in use. This is a man-wearable technology with live video and audio communication capabilities between offshore and onshore. The technology-dependent change has direct implications on the establishment of new organizational forms to bring improvements to conventional O&M management practice. This is achieved through an active blend of digital technologies and infrastructures with active operational networks and business-to-business collaborative partnerships. The ongoing rapid reformation of industry infrastructure targets enhances the live interactivity between different stakeholders involved in O&M decisions and activities. This helps systematic establishment of tight online and real-time collaborative partnerships between the O&M crew in offshore assets and those who are positioned in the onshore support system. In fact the current tendency to test-out and implement novel O&M solutions actively seeks options to combine effectively other sectors of the industry (e.g. engineering contractors, equipment suppliers, technical expert centers, spare-part vendors, logistics, etc.). It is in this application context that the large-scale ICT networks and web-enabled solutions play a key role in establishing the necessary connectivity and interactivity between dispersed groups and organizations. It implies that integrated O&M solutions, as experienced in North Sea offshore environment, breaks the conventional boundaries, for instance advancing: • • •

From in-house competencies to collaborative shared-expertise From centralized databases to open data management landscapes From localized on-the-site diagnostics to remote monitoring

596

J. Liyange

•

From on-the-site O&M expert interventions to tele-consultancy capabilities, etc.

Figure 24.2 is a schematic diagram of the technical infrastructure in the North Sea that facilitates realizing integrated e-operations e-maintenance. Sources of data Asset Operator Satellit e

Equipment and Spareparts

Experience data

Offline-online technical data Intelligent systems and components

Offshore asset

Direct visualization

Central Datahub Fiber-optic network

Advanced Fibre-optic based and Wireless Information and Communication Network

Fiber-optic I IP-VPN / ADSL based access

Logistics and Emergency

Offshore O&M contractors

Onshore Support System

Wireless network and Radio links

Distributed control and monitoring systems

Technical / Engineering expertise

Figure 24.2. Technical infrastructure for integrated e-operations–e-maintenance solutions in the North Sea

The figure highlights that the functional landscape for the establishment of ebased O&M setting in North Sea is a relatively complex combination of various technical as well as social elements. The synergy among at least three elements is critical in the development of the necessary technical infrastructure, i.e.: • • •

Advanced process and safety technologies implemented in equipment in offshore assets that allows real-time data acquisition and transfer Large scale ICT network with an appropriate bandwidth, that uses both wireless, fiber-optic and web-enabled capabilities, to enable sharing of acquired data and communication traffic on 24/7 basis Well equipped onshore expert centers with built-in advanced data management capabilities and collaborative technologies to process and interpret data, and to stay connected with offshore assets as well as other partners to interact online for enhancing decisions and activities

Such a large-scale technical setting can perhaps be considered as the heart of e-operations–e-maintenance activities, as it allows: • • •

Integration of geographically dispersed knowledge centers creating a virtual workplace Establishment of 24/7 online net-based connectivity to provide easy and fast access to remote experience and knowledge Access to reliable IT network with a higher bandwidth and speed to acquire, process, and to interpret volumes of real-time data

The largest implication of such a setting by far is the significant improvements to decisionmaking and work processes. The connectivity and the interactivity between offshore and onshore, as well as between different onshore-based competence groups and knowledge centers, allows more effective decision loops and more coordinated planning and execution of O&M activities (see Figure 24.3). Smart combination of real-time data with multi-disciplinary expertise has major

Integrated e-Operations–e-Maintenance: Applications in North Sea Offshore Assets

597

benefits on the effectiveness and efficiency in data-to-decision (D2D) processes. Continuous monitoring of functional status of equipment and joint interpretation of technical and safety integrity levels with the experts in the active network, on the other hand, brings major benefits to the decision-to-action (D2A) processes. Such benefits have already been visible in terms of time and quality of D2D and D2A processes, and is said to be very encouraging and commercially attractive for further improvements. Digital technology platforms Coupling the real-time equipment data with rapid analysis techniques and joint decision making.

Integrated O&M workprocesses

Collaborative management solutions

D2D processes

D2A processes

Joint coordination and planning of maintenance work by use of advanced communication capabilities and technical data sharing platforms.

Figure 24.3. Integrated e-operations–e-maintenance brings key solutions to enhance data-todecision (D2D) and decision-to-action (D2A) processes

The targeted benefits of these developments within O&M, together with those in other technical disciplines, are continuously expected in a 30–40 year time span. The key value creation elements identified includes, for example, methods and techniques to reduce uncertainty in data interpretation, reduced cycle time on decisions, better planning and work coordination procedures, and reduced offshore operating costs through offshore-onshore work re-organization and prolonged maintenance intervals. The overall commercial benefits expected include; approximately 10% increment in production, 30–40 % reduction in operating costs, and significant improvements in health and safety performance.

24.7 Key features of the e-Approach for O&M in North Sea Assets As aforementioned, integrated e-operations e-maintenance is not just an effort to introduce new technologies. It in fact represents a change in the use of technical tools, advanced methods, and joint expertise to make O&M processes more effective and efficient. It introduces a novel scenario to manage the process stepping out of the convention. However, the successful implementation and use of e-approach dependent heavily on the synergy between remote diagnostic and prognostic technology, onshore expert centers directly connected to offshore collaborative rooms, and net-based web-enabled ICT solutions (Figure 24.4).

598

J. Liyange

Net-based and Webenabled ICT solutions (e.g. SOIL)

Remote monitoring technology (e.g. diagnostic and prognostic)

Offshore-Onshore expert centers

Figure 24.4. The solid foundation to e-approach in O&M demands a synergy between three main components that establish a complex and an interactive technical system

24.7.1 Prognostic and Diagnostic Technologies For a long time it had mostly been a challenge to make effective use of condition monitoring on the Norwegian shelf (Ellingsen et al. 2006). There had been ad hoc use of some diagnostic technologies such as vibration monitoring on heavy rotating equipment, thermography on electrical equipment and oil analysis, but mainly on a discontinuous need-by-need basis. In most cases use of diagnostic expertise had been limited to on-the-site tapping and data acquisition after reporting a malfunction or some abnormal technical indications. But today, many O&G producers are keen on capitalizing on the inherent potential provided by the digital infrastructure on North Sea and advanced technologies. It implies that the use of condition monitoring to support technical and safety integrity is strengthened in the integrated environment since: • • •

Data acquisition techniques have developed to an extent that the experts can tap signals real-time at onshore support centers (OSC) on critical equipment Online communication capability has allowed joint interpretation and trend analysis, for instance coupling to asset operator’s OSC, and comparing with set alarm levels Expert centers have acquired technological capability so that they can secure connections to several offshore assets in a way that those assets can be served simultaneously if necessary

The use of advanced networking technologies is in fact a landmark of integrated O&M solutions for North Sea assets, as opposed to offline technologies. It has brought some unique capabilities to share the expertise. With the rapid use of portable communication technology, offshore personnel can also communicate effectively with OSCs allowing more sensible use of data acquisition technologies. The current setting has given a new dimension to the diagnostic and prognostic efforts for North Sea assets today. The OSC in SKF-Norway is for instance a CBM expert center that has remote diagnostic and prognostic capabilities and serves various operators in the Norwegian and Danish O&G sectors. Over the past few years it has carried out online remote vibration monitoring of critical machinery of offshore production platforms

Integrated e-Operations–e-Maintenance: Applications in North Sea Offshore Assets

599

in its OSC. Use of other technical options such as wireless or web-enabled solutions and software such as Microsoft Net Meeting, etc. together with highsecure data traffic enabled by SOIL, have made it possible for experts to have simultaneous access to critical data around the clock from different geographical locations for rapid analysis and interpretation. In the absence of access to fiber-optic technology, for instance due to technical limitations, multiplexed solutions are available that reduce the amount of data traffic to a level that can easily be handled by satellite communications. In such cases, foreign expert centers can reasonably record data, e.g. from 20 accelerometers at a frequency of every 20 s. In the presence of adequate data exchange and communication networks, data management systems of those foreign locations can easily be linked to SOIL. This has proven to work successfully, for instance in case of mobile offshore drilling rigs, and floating production and storage vessels. Obviously developments in sensor systems, microelectronics, wireless systems and software have actively contributed to the growth in the use of advanced CBM techniques today on NCS. On the other hand, advancement in exception-reporting techniques has also reduced the need for massive amount of data for failure predictions. This has actively contributed to faster decision-making processes on the bases of variations in pre-defined data sets or associations between technical indicators. The need for a fast track of communication from remote locations has been solved by web based reporting systems that follow a pre-defined format and tag-based reporting structure for ease in faster action. This allows the data and reports to be transferred automatically into CMMS systems such as those built into corporate ERP systems as SAP, Workmate, etc. This has substantially narrowed the time and conventional routines for data collection, analysis, reporting, work orders and feedback. 24.7.2 Onshore Remote Support Centres and Virtual Activity Onshore Support Centres (OSCs) can be considered as the active nodes of the integrated e-operations e-maintenance setting. Such OSCs are established in the premises of both O&G producers and third parties. The functional characteristics of OSCs can vary from one to another depending on the contractual roles and specific assignments of external organizations. For instance, ConocoPhillips as the operator of the Ekofisk asset has two such onshore centers. One of them is called onshore operational center (OOC) and has built-in integrated solutions for O&M planning, logistics, and other production and operation related activities (Figure 24.5).

600

J. Liyange

3D technologies & Simulations laandscape

Logistics and planning landscape

Conferencing landscape

Realtime monitoring landscape

Figure 24.5. Landscape of onshore support centers (OSCs) with built-in collaborative and decision support technologies are the active nodes of the integrated e-operations–e-maintenance environment on NCS (courtesy: ConocoPhillips, Norway)

In general, OSCs have built-in communication capabilities with offshore control rooms and external business partners. The OSCs of third party organizations are dedicated to provide expert assistance for instance in logistics, vibration monitoring, etc. on a 24/7 online and real-time basis. To enable active collaboration these OSCs are equipped with tabletop collaborative workstations, backprojected large VDUs, technologies for remote monitoring, video-conferencing facilities, and other advanced technological capabilities for joint decision-making (e.g. VisiWear, Smart boards), supportive advanced technology to produce 3D images and to run simulations, etc. The success stories of OSCs such as those of ConocoPhillips have given an industry-wide boost to further advancement both in number of OSCs and the type of technologies in use. This has given a very fruitful environment for rapid exploitation of technology (e.g. CBM), decision and work process optimization, and multi-disciplinary coordination of planning (e.g. between O&M and drilling), shared-expertise, etc. The technological capabilities built into OSCs, together with the net-based access via the ICT infrastructure, have resulted in establishment of a dedicated virtual environment to support O&M decisions and activities. This takes place through: • • • •

Real-time online connection between offshore and onshore organizations Real-time online connection between different technical disciplines (e.g. planning and scheduling, transport logistics, equipment suppliers, external service contractors, spare part suppliers, health and safety advisors, etc.) Real-time online connection between the asset operator and the external experts (e.g. remote condition monitoring center of SKF-Norway and BP) Real-time online connection across the geographical borders to the corporate network to receive expert support for instance from Aberdeen, UK, Houston or Alaska, USA, or to remotely monitor activities ‘following the sun’

Certainly, the new network-based and collaborative O&M environment has already shown its capabilities in making notable changes to conventional O&M practice. The new thinking and the progress so far have indicated great potential

Integrated e-Operations–e-Maintenance: Applications in North Sea Offshore Assets

601

for expansions of substantial scale that can lead to a completely different technological setting and an operating mode by the year 2010 or so. The ongoing developments at some stage would be coupled with other technologies, for instance related to scenario simulations of technical faults and failures using 3D technologies, intelligent watchdog agents for condition prognostics, virtual tools to train O&M crews, etc. 24.7.3 The ICT Network: Secure Oil Information Link (SOIL) Often, advanced ICT solutions are at the heart of principal commercial activities of almost all industrial sectors today (Chang et al. 2004; van Oostendrep et al. 2005; Mezgaar 2006). Current developments on the Norwegian shelf also resorted to such solutions as the basis to induce the change. Current ICT solutions are a technical blend of more centralized LANs, primarily localized within organizational boundaries, to large scale WAN solutions that open up transaction routes for complex business-to-business (B2B) traffic. In fact, the specific need for such robust integrated solutions for O&G industry in the North Sea have largely been growing over the last 2–3 years, demanding more common platforms, for instance to manage complex O&M and other plant data. The large scale ICT network established in North Sea is called Secure Oil Informaton Link (SOIL). SOIL was introduced to Norwegian E&P industry in 1998. It is a result of growing demands for integrated data management and B2B communication solutions. SOIL consists of a number of application services actively connecting almost all the business sectors of the Norwegian O&G industry. This network helps establishing the connectivity and interactivity between different parties, for instance offshore O&M teams, operator’s onshore O&M support groups, thirdparty CBM experts, logistic contractors, etc. through the use of fiber-optic cables and wireless communications. Real-time equipment data can be acquired, jointly analyzed and results can be exchanged online between these parties, enhancing the ability for shared interpretation and decision-making. In this context, there are two major functional features of SOIL (see also Figure 24.6): • •

The high reliable information and knowledge-sharing network to coordinate and manage remotely O&M activities in North Sea offshore assets regardless of the geographical location Many-to-many simultaneous authorized connectivity breaking the conventional one-to-one solution enhancing collaboration between experts, third party services, asset operator, and offshore crew

The conventional one-to-one setting only enabled the connectivity between two distinctive parties, for example between an inspection engineer of a contractor and a maintenance planner of an asset owner. However, with the use of the webenabled networking solutions available today, a number of distinctive groups can stay connected and interact simultaneously (i.e. many-to-many connectivity). This capability has major effects on improvements to D2D and D2A processes of O&M in terms of time, cost, and quality.

602

J. Liyange

Figure 24.6. SOIL’s application solutions provide many-to-many connectivity and interactivity on 24/7 online real-time bases to enhance D2D and D2A performance of O&M

24.8 Future Challenges to be Fully-integrated and Fail-safe The integrated e-operations–e-maintenance approach that is currently under progress on the Norwegian shelf has given a new perspective challenging the convention. It has already illustrated, through a number of successful implementation tasks, how the technology can be coupled with suitable managerial solutions (e.g. better interfacing with contractors, fast access to external expertise, etc.) to address novel challenges of and to cater to the crave for innovative solutions by O&M engineers. With the availability of the high-secure ICT infrastructure the Norwegian shelf has opened up a substantial space for technological innovation seeking major improvements in O&M practice. In fact, SOIL enabled operational-network together with OSCs have provided the structural skeleton for test-beds and to implement novel solutions. It is the rapid developments within data acquisition and offshoreonshore communication technologies that are expected to take O&M to a different mode of practice. However, certain challenges are still there in the use of some of the novel technologies that include for instance: • • • •

Portable video-communication technologies Smart sensors and intelligent transducers for equipment with built-in selfdiagnostic and reporting capabilities Electronic products such as PDAs with advanced functionalities 3D technologies, etc.

Regardless of the notable achievements so far during the last 1–2 years, the challenges for further development of O&M process are quite many. In pure technological terms, smart and cost-effective use of CBM technologies in particular still remains a significant challenge. In fact there is no argument about the benefits of CBM in terms of being fully integrated and fail-safe. The demand by far is on the more sensitive use of the diagnostic and prognostic technologies as a principal means to improve and to be in control of technical and safety integrity of assets. The demand in the current O&M setting is towards advanced technical platforms that for instance combine unique signal processing, risk analysis, and decision-making features. In fact the demand is for such technologies where failure

Integrated e-Operations–e-Maintenance: Applications in North Sea Offshore Assets

603

modes can be learnt by intelligent programs and reported automatically by decision support software. It implies that as the remote support through CBM becomes the ‘practice of the future’ the O&G activities in North Sea demand innovative solutions such as intelligent watchdog agents, common data management platforms, smart decision support tools, etc. to support the rapid transition towards fully integrated and fail-safe e-operations e-maintenance setting. 24.8.1 Intelligent Watchdog Agents From pure CBM perspective, there is a greater demand for the use of enabling technologies as integral parts of robust CBM solutions. As the operating environment steps into a remote mode, where 24/7 access becomes a sensitive issue, the experts need to ensure a tight technical coupling for instance between: •

•

Signal-processing technology with a series of toolboxes for signal processing and system performance evaluation to track the health of a system/ machine and provide diagnostic and prognostic information in order to achieve the goal of “near-zero-downtime” performance Application software solutions to interpret optimally monitored data signals regarding the execution of a maintenance action and to estimate remaining useful life (RUL)

The requirement on the Norwegian shelf today is a CBM technology that is not limited to data acquisition but also has integrated advanced solutions with signal processing and decision-making capabilities to make it more attractive and commercially viable solution. In a series of more recent R&D efforts, the Center for Intelligent Maintenance Systems (IMS) at University of Wisconsin-Milwaukee and the CBM Lab at University of Toronto have developed such an integrated O&M optimization platform to provide asset owners and operators with an advanced tool for the signal processing and the maintenance decision-making (see Jardine et al. 1997; Banjevic et al. 2001). Figure 24.7 shows the multi-sensor performance assessment framework of this technology. This watchdog agent constitutes a toolbox with modules for signal processing, feature extraction, degradation assessment and performance evaluation embedded in a common software application. It includes signal processing and feature extraction tools built on Fourier analysis, time-frequency distribution, wavelet packet analysis and ARMA time series models. The component of performance evaluation uses such tools as fuzzy logic, match matrix, neural network and other advanced algorithms. Functionally, the watchdog agent in principal is used for feature extraction from a series of signals under a given condition, and comparing those with a template model built-up based on signals under a pre-identified normal condition. The performance evaluation yields a “confidence value” (CV), which indicates the health status of the system and is used as the basis for diagnostics and prognostics under given circumstances. If the data can be directly associated with some failure mode, then most recent performance signatures, obtained through the signal processing and feature extraction modules, can also be matched against signatures extracted from faulty behavior data for proper decisions.

604

J. Liyange

M ultisensor P erform ance A ssessm ent S ensory S ig. P roc. • T im eFrequency An alysis • AR M A m odeling • Fourier An alysis • W avelet packet Analysis

Feature E xtraction

• T im e-frequency / W avelet m om ents and PCA • W avelet Frequency B ands • AR m odel roots • E xpert extracted features (intensity, peakto-peak value, R M S ).

M ultisensor P erf. E valuation • Logistic R egression • S tatistical pattern recognition • Feature M ap pattern m atching • N eural N etw ork pattern m atching • H idden M arkov M odel • P article filter

Figure 24.7. The potential for further enhancement in the use of advance CBM technologies such as Intelligent watchdog agents are very evident for North Sea assets (courtesy: CBM Lab, University of Toronto, Canada)

24.8.2 Early Warning and Decisions Support Systems Offshore production facilities are largely threatened by various unwanted events and incidents yearly. The risk exposure due to such serious events and incidents are much higher in a 24/7 online real-time environment than on a conventional operating mode. The former demands more robust early warning systems and decision support tools for fast decisions and actions. The ability to control better such events and incidents demands tools and techniques for recognition of actual condition of technical items (i.e. systems, sub-systems, equipment, and components) and early prediction of eminent faults and failures that may lead to such events and incidents, based on performance cues (or early indications). In this context, the major challenge for avoiding serious events and catastrophic incidents relates to the ability to employ smart technologies and techniques to obtain such critical performance cues and to actively use such cues as a basis for diagnostic and prognostic purposes to enable early decisions and actions prior to the ‘point of no return’ (e.g. emergency shut down). Some of the current R&D projects for O&M optimization seek to implement such technical solutions as integral parts of early warning systems to deal better with unwanted events and incidents. Such early warning systems to quick initiate further technical analysis based on trends, associations, failure histories, or expert judgments will be builtin to OSCs to support decisions. Additional software application solutions are under testing at the moment that can be mapped onto the existing ERP systems with built-in data mining logic to tap into complex events and incidents data. However, apart from the technology, there are other impediments such as ontology, semantics, mechanics of reporting, custom data flow structures, etc. that need to addressed. It implies that the current integrated e-operations–e-maintenance setting require some efforts for standardization of data as well to make use of reliable early warning and decision support systems.

Integrated e-Operations–e-Maintenance: Applications in North Sea Offshore Assets

605

24.8.3 Non-technical Issues In fact, the initiatives in 2005–2006 to introduce integrated e-operations–e-maintenance solutions to the Norwegian shelf represent a very sensitive commercial step to enhance technology exploitation and resource management capability. Owing to the direct long-term commercial implications of this industry-wide program, the involvement from socio-political and authoritative organizations have been rapidly increasing with the intention of taking all possible measures to ensure a fully-integrated and a fail-safe system in operation. The Norwegian Petroleum Directorate (NPD), Petroleum Safety Authority (PSA) and Norwegian Oil Industry Association (OLF) are at the front-end of this move, and have already sponsored and launched various programs. The breaking of the conventional barriers that has been a practice over the last 20–30 years is a serious challenge in itself. The implication of the ongoing change processes is far beyond being pure technical. It induces a new socio-technical environment with inherent characteristics and complexities. Interestingly, it is commonly acknowledged within the industry that a much greater portion of current challenges are not in fact technical but rather non-technical. It is constantly highlighted that increasing complexities and ill-defined solutions can easily increase the vulnerabilities and the risk-exposure. With the expansion of activities at global scale, the path to establish the best practice and thus to achieve commercial excellence has to overcome some critical challenges that mostly relate to the issue of effective and efficient interfacing with fairly diversified sectors of the industry. Understanding and interfacing of critical socio-technical dimensions are important to avoid vulnerabilities and risks of the ongoing change processes (Health and Safety Executive 1997; Perow 1999; Booher 2003). Table 24.3 illustrates some of the known challenges that need to be overcome for fully integrated and fail-safe operations. Table 24.3. The challenges for full-scale integration task of e-operations–e-maintenance is quite complex Challenges for integrated e-operation–e-maintenance solutions Liabilities of shared decisions and activities Trust and openness between business partners and distinctive groups Semantics and ontology for data integration Security and reliability of digital infrastructure Information quality, data filtering, common data exchange platforms Incentives for and risk of knowledge-based industry integration Standards and interfacing for work processes optimization Human and organizational learning Competence development programs for change absorption Trade union matters Etc.

There is much to do to make sure that the new integrated e-operations–e-maintenance setting is fully functional and fail-safe. Perhaps the greater concern is that the marvel of the success brought by ad hoc technological solutions may easily lead to miscalculation of underlying risks of process re-engineering tasks. With this

606

J. Liyange

realization, a major portion of the industry has begun to adapt along a more cautious, synchronized, and an incremental development path. Initiatives by authorities (e.g. NPD, PSA, etc.) and by socio-political sources (e.g. OLF) are critical to establish a more harmonized setting to ensure necessary levels of safety and security. Even though a systematic strategy may prolong the integration plan, the argument is that such a systematic move will have substantial long-term pay back rather than a rapid solution that would eventually expose major stakeholders to deal with unforeseen events requiring ‘ad hoc solutions’ or ‘quick fixes’ that would be too costly to bear.

24.9 Conclusion Commencing from 2003–2004, the Norwegian O&G industry has launched a dedicated program to overcome obvious commercial risks on the NCS. This is termed the third efficiency leap that has directly supported the implementation of integrated e-operations–e-maintenance solutions for offshore assets in North Sea. This new practice greatly challenged the conventional practices of many disciplines, particularly of O&M seeking a technological as well as a managerial change. The new O&M practice pays major emphasis on the more active exploitation of application technologies, new data and knowledge management techniques. The change process has also begun to re-engineer the industry infrastructure to actively integrate O&M expertise of O&G producers with that of the external knowledge-based industry. The large-scale ICT network called Secure Oil Information Link and onshore support centers mainly facilitate the rapid development within O&M process. The new setting has already brought major commercial benefits by streamlining D2D and D2A processes with substantial improvements in work processes. However, some critical challenges still remains to be addressed, and the socio-political organizations and authorities are keen on ensuring fully functional and fail-safe operations. The demand and the interest to complete the rest of the journey is through more cautious and systematic strategies to sustain commercial benefits beyond the year 2050 without exposing the industry to unwanted or hidden risks that would be too costly to bear.

24.10 References Arnaiz, A., Arana, R., Maurtua, I., et al., (2005), Maintenance: future technologies, Proceedings of the IMS (Intelligent Manufacturing System) International Forum IMS Forum 2004 Como, Italy, May 17–19, pp. 300–307. Bangemann, T., Rebeuf, X., Reboul, D., et al., (2006), PROTEUS-creating distributed maintenance systems through an integration platform, Computers in Industry, 57(6), pp. 539–551. Banjevic, D., Jardine, A.K.S., Makis, V. and Ennis, M., (2001), A control-limit policy and software for condition-based maintenance optimization, INFOR, 39, pp. 32–50. Bonissone, G., (1995), Soft computing applications in equipment maintenance and service, ISIE ’95, Proceedings of the IEEE International Symposium, 2, pp. 10–14. Booher, HR. (ed.) (2003). Handbook of human systems integration, Wiley-Interscience.

Integrated e-Operations–e-Maintenance: Applications in North Sea Offshore Assets

607

Chande, A., Tokekar, R., (1998), Expert-based maintenance: a study of its effectiveness, IEEE Transactions on Reliability 47, pp. 53–58. Chang, Y.S., Makatsoris, H.C., Richards, H.D., (2004), Evolution of supply chain management: symbiosis of adaptive value networks and ICT, Boston: Kluwer Academic Publishers. Djurdjanovic, D., Ni, J., Lee, J., (2002), Time-frequency based sensor fusion in the assessment and monitoring of machine performance degradation, Proceedings of the 2002 ASME International Mechanical Engineering Congress and Exposition paper number IMECE 2002-32032. Djurdjanovic, D., Lee, J., Ni, J., (2003), Watchdog agent — an infotronics-based prognostics approach for product performance degradation assessment and prediction, special issue on intelligent maintenance systems, Engineering Informatics Journal 17 (3–4), pp. 107– 189. During, W., Oakey, R., et al. (ed.) (2004). New technology-based firms in the new millennium. Elsevier. Ellingssen, H.P., Liyanage, J.P., Ruså, R., (2006), Smart integrated operations and maintenance solutions to manage offshore assets in North Sea, Proceedings of the 18th EuroMaintenace, MM Support GmbH, pp, 319–324. Emmanouilidis, C., MacIntyre, J., Cox, C., (1998), An integrated, soft computing approach for machine condition diagnosis, Proceedings of the Sixth European Congress on Intelligent Techniques & Soft Computing (EUFIT’98), vol. 2 Aachen, Germany, pp. 1221–1225. Emmanouilidis, C., Jantunen E., MacIntyre, J., (2006), Flexible software for condition monitoring, Computers in Industry, 57(6), pp, 516–527. García, M.C., Sanz-Bobi, M.A., (2002), Dynamic Scheduling of Industrial Maintenance Using Genetic Algorithms, Proceedings of EuroMaintenance 2002, Helsinki, Finland. Garcia, M.C., Sanz-Bobi, M.A., Pico, J., (2006), SIMAP: Intelligent systems for predictive maintenance: Application to the health condition monitoring of a wind-turbine gearbox, Computers in Industry, 7(6), pp, 552–568. Han, T., Yang, B.S., (2006), Development of an e-maintenance system integrating advanced techniques, Computers in Industry, 57(6), pp, 569–580. Hansen, R., Hall, D., Kurtz, S., (1994), New approach to the challenge of machinery prognostics, Proceedings of the International Gas Turbine and Aeroengine Congress and Exposition American Society of Mechanical Engineers, pp. 1–8. Health and Safety Executive (HSE). (1997). Human and organizational factors in offshore safety. HSE, UK. Hosni, Y.A., Khalil, T.M. (ed.) (2004). Management of technology. Elsevier. Iung, B., (2003), From remote maintenance to MAS-based e-maintenance of an industrial process, International Journal of Intelligent Manufacturing 14(1), pp. 59–82. Jardine, A.K.S., Banjevic, D., Makis, V., (1997), Optimal replacement policy and the structure of software for condition-based maintenance, Journal of Quality in Maintenance Engineering, 3, pp. 109–119. Jardine, A.K.S., Makis, V., Banjevic, D., et al., (1998), Decision optimization model for condition-based maintenance, Journal of Quality in Maintenance Engineering 4 (2), pp. 115–121 Jardine, A.K.S. Lin, D., Banjevic, D., (2006) A review on machinery diagnostics and prognostics implementing condition based maintenance, Mech. Syst. Signal Process. 20 (7), pp. 1483–1510. Jantunen, E. Jokinen, H. Milne, R., (1996), Flexible expert system for automated on-line diagnostics of tool condition, Integrated Monitoring & Diagnostics & Failure Prevention, Technology Showcase, 50th MFPT Mobile, Alabama.

608

J. Liyange

Khatib, A.R., Dong, Z., Qiu, B., et al., (2000), Thoughts on future Internet based power system information network architecture, in: Proceedings of the 2000 Power Engineering Society Summer Meeting, vol. 1, Seattle, USA. Koc, M., Lee, J., (2001), A system framework for next-generation e-maintenance system, Proceeding of Second International Symposium on Environmentally Conscious Design and Inverse Manufacturing Tokyo, Japan. Lee, J. (1996), Measurement of machine performance degradation using a neural network model, Computers in Industry 30, pp. 193–209. Lee, J., (2004), Infotronics based intelligent maintenance system and its impacts to closed loop product life cycle systems, Proceedings of the Proceedings of the IMS’2004 International Conference on Intelligent Maintenance Systems Arles, France. Liao, H.T., Lin, D.M. Qiu, H., et al., (2005), A predictive tool for remaining useful life estimation of rotating machinery components, ASME International 20th Biennial Conference on Mechanical Vibration and Noise Long Beach, CA. Liyanage, J.P., (2003), Operations and maintenance performance in oil and gas production assets: Theoretical architecture and capital value theory in perspective, PhD Thesis, Norwegian University of Science and Technology (NTNU), Norway. Liyanage, J.P., Herbert, M., Harestad, J., (2006), Smart integrated e-operations for high-risk and technologically complex assets: Operational networks and collaborative partnerships in the digital environment, Wang, Y.C., et al., (ed.), Supply chain management: Issues in the new era of collaboration and competition, Idea Group, USA, pp. 387–414. Liyanage, J.P., Langeland, T., (2007), Smart assets through digital capabilities, Mehdi Khosrow-Pour (ed.), Encyclopaedia of Information Science and Technology, Idea Group, USA. Liang, E., Rodriguez, R., Husseiny, A., (1988), Prognostics/diagnostics of mechanical equipment by neural network, Neural Networks 1 (1), p. 33. Marseguerra, M., Zio, E., Podofilini, L., (2002), Condition-based optimisation by means of genetic algorithms and Monte Carlo simulation, Reliability Engineering and System Safety 77, pp. 151–166. Mezgaar, I., (2006), Integration of ICT in smart organizations, Hershey, PA: Idea Group Pub. Moore, W.J., Starr, A.G., (2006), An intelligent maintenance system for continuous costbased prioritization of maintenance activities, Computers in Industry, 57(6), pp. 595–606. OLF (Oljeindustriens landsforening / Norwegian Oil Industry Association), (2003). eDrift for norsk sokkel: det tredje effektiviseringsspranget (eOperations in the Norwegian continental shelf: The third efficiency leap), OLF (www.olf.no). (in Norwegian) Palluat, N., Racoceanu, D., Zerhouni, N., (2006), A neuro-fuzzy monitoring system: Application to flexible production systems, Computers in Industry, 57(6), pp. 528–538. Perow, C. (1999). Normal accidents: Living with high-risk technologies, Pinceton University Press. Roemer, M. Kacprzynski, G., Orsagh, R. (2001), Assessment of data and knowledge fusion strategies for prognostics and health management, IEEE Aerospace Conference Proceedings, vol. 6, pp. 62979–62988 Russell, R.S., Taylor, B.W., (2006), Operations management: Quality and competitiveness in a global environment, Hoboken, N.J.: Wiley Sanz-Bobi, M.A., Toribio, M.A.D., (1999), Diagnosis of electrical motors using artificial neural networks, IEEE International Symposium on Diagnostics for Electrical Machines, Power Electronics and Drives (SDEMPED) Gijón, Spain, pp. 369–374. Sanz-Bobi, M.A., Palacios, R. Munoz, A., et al., (2002), ISPMAT: Intelligent System for Predictive Maintenance Applied to Trains, Proceedings of EuroMaitenance 2002, Helsinki, Finland. Swanson, L., (2001), Linking maintenance strategies to performances, International Journal of Production Economics 70, pp. 237–244

Integrated e-Operations–e-Maintenance: Applications in North Sea Offshore Assets

609

van Oostendrep, H., Breure, L., Dillon, A., (2005), Creation, use, and deployment of digital information, Mahwah, N.J. : Lawrence Erlbaum Associates. Wang, W., (2002), A stochastic control model for on line condition based maintenance decision support, Proceedings of the Sixth World Multiconference on Systemics, Cybernetics and Informatics, Part 6, vol. 6, pp. 370–374 Wang, W.Y.C., Heng, M.S.H., Chau, P.Y.K., (2006), Supply chain management: Issues in the new era of collaboration and competition, Idea Group Publishing. Yager R., Zadeh, L., (1992), An Introduction to Fuzzy Logic Applications in Intelligent Systems, Kluwer Academic Publishers. Yang, B.S., Lim, D.S., Lee, C.M., (2000), Development of a case-based reasoning system for abnormal vibration diagnosis of rotating machinery, Proceedings of the International Symposium on Machine Condition Monitoring and Diagnosis Japan, pp. 42–48. Yen, G.G., (2003), Online multiple-model-based fault diagnosis and accomodation, IEEE Transaction on Industrial Electronics 50 (2). Yu, R., Iung B., Panetto, H., (2003), A mutli-agents based e-maintenance system with casebased reasoning decision support, Engineering Applications of Artificial Intelligence 16, pp. 321–333.

25 Fault Detection and Identification for Longwall Machinery Using SCADA Data Daniel R. Bongers and Hal Gurgenci

25.1 Introduction Despite the most refined maintenance strategies, equipment failures do occur. The degree to which an industrial process or system is affected by these depends on the severity of the faults/failures, the time required to identify the faults and the time required to rectify the faults. Real-time fault detection and identification (FDI) offers maintenance personnel the ability to minimise, and potentially eliminate one or more of these factors, thereby facilitating greater equipment utilisation and increased system availability. This case study describes, in some detail, the application of data-driven fault detection to an underground mining operation. However specific this application may be, the concept can be employed on any system of machines, with or without complex machine-machine or machine-environment interactions, or to individual plant. In addition to detailing the implementation of an FDI system in real-time, we propose a semi-autonomous approach to dealing with inaccurate and incomplete records of equipment malfunction. Since past equipment performance is often the principal information source for maintenance planning and evaluation, it is of utmost importance that this information be as accurate as possible. The method described allows for varying levels of confidence in the record keeping. Section 25.2 introduces the longwall mining system, the most common form of mining coal underground at the present. The availability of longwall equipment systems is low compared to surface systems of similar complexity. The present approaches towards reducing equipment downtime in longwall mining are summarised in Section 25.3. Common FDI approaches are summarised in Section 25.4. Two data-driven techniques are used in this study, namely artificial neural networks and multi-variate statistics. The availability of quality training data is of critical importance for either one. The issue is addressed in Sections 25.5–25.7. Once the training data set is constructed, the application of the selected FDI tech-

612

D. Bongers and H. Gurgenci

niques is reasonably straightforward. The application and the results are summarised in Section 25.8 with concluding remarks in Section 25.9.

25.2 The Longwall Mining System Longwall is an underground mining technique used to extract coal from relatively flat coalbeds. The basic principle is simple. A coalbed is selected and blocked out into panels averaging nearly 240 m wide, 2100 m in length, and several metres in height, by excavating passageways around its perimeter. A panel of this size contains millions of tonnes of coal, most of which is recovered. In the extraction process, numerous pillars of coal are left untouched in certain parts of the mine in order to support the overlying strata. The mined-out area is allowed to collapse. In some instances this may cause some surface subsidence.

Figure 25.1. Longwall equipment layout

Extraction by longwall mining is an almost continuous operation involving the use of self-advancing hydraulic roof supports, a sophisticated coal-shearing machine, and an armoured conveyor parallel to the coal face (see Figure 25.1). Working under the movable roof supports and riding on the conveyor frame, the shearing machine cuts and spills coal onto the conveyor for transport out of the mine. When the shearer has traversed the full length of the coal face, it reverses direction (without turning) and travels back along the face taking the next cut. As the shearer passes each roof support, the support is moved closer to the newly cut face. The steel canopies of the roof supports protect the workers and equipment located along the face, while the roof is allowed to collapse behind the supports as they are advanced. Extraction continues in this manner until the entire panel of coal is removed.

Fault Detection and Identification for Longwall Machinery Using SCADA Data

613

25.2.1 Longwall Maintenance – Current Practices As with all forms of mining, production downtime represents massive losses in potential revenue. An effective maintenance program is seen in the industry as the key to a profitable and sustainable longwall. The purpose of this section is to describe the current maintenance practices in most longwall mines. Maintenance is broadly categorized as either planned or reactive. Typically, one shift per week is assigned to routine planned maintenance, with major overhauls of machinery occurring at the end of each panel. Weekly maintenance is performed by the regular longwall operators, and usually includes the following: • • • • •

Checking the levels and quality of lubricants and hydraulic fluids Visual inspection of electrical panels, pumps, motor, gearboxes and hoses Testing of all underground communication devices Start-up tests for all AC motors Replacement of worn shearer picks and faulty spray nozzles on the cutting drum

Reactive (or breakdown) maintenance refers to the repair or replacement of failed or faulty equipment which interrupts longwall production. It is these actions which are reported in the maintenance log of all unscheduled downtime. This maintenance is attended to by longwall operators, as each team typically includes specialists in both mechanical and electrical repairs. 25.2.1.1 Daily Management Meetings Senior mine management typically meets daily to discuss and update current issues. A large portion of the meeting is spent discussing planned maintenance, and any events from the previous day’s shifts that caused production delays. This forum allows input from managers with an array of specialities, thereby deciding on action the will most benefit longwall production as a whole. Any urgent issues are relayed to the longwall team presently underground for immediate attention. Other maintenance that can be postponed until a later production shift or the weekly maintenance shift is requested in the form of a work order. Work orders are given to a shift supervisor at the beginning of each shift, and are carried out before production begins. 25.2.1.2 Performance Indicators Although coal tonnage per unit cost is the best indication of revenue, it gives no information about the performance of individual operations such as longwall production or new panel development. For this reason, maintenance teams use many performance indicators. Examples of such indicators are: • Number of failures (or faults) of a piece of machinery in an operational month • Total downtime of a piece of machinery in an operational month • Tonnes of coal per unit of longwall operations

614

D. Bongers and H. Gurgenci

Of the many indicators (or measurements) of performance, longwall mines universally define key performance indicators (KPIs) that are production related. These KPIs are: Longwall availability =

Operating time × 100% Operating time + Maintenance delays

This KPI looks at the ability to have the machines operate for the time that they are planned to operate. It is simply the percentage of the available planned time that they do actually operate. The ‘maintenance delays’ refer to scheduled and breakdown maintenance. Some sites include only the breakdown maintenance in this statistic, which leads to an inflated value of the equipment availability. Such confusion in terms makes it difficult to benchmark practices between sites. Typical values for this KPI average between 40% and 60%. Mean time between failure (MTBF) =

Actual operating time Number of maintenance delays

This KPI looks at the ability to sustain the operation of machines over periods of time. It is a measure of how long, on average, before machines stop due to a maintenance problem. Typical values average around 1 h. Mean time to repair (MTTR) =

Actual delay time Number of maintenance delays

This KPI looks at the ability to diagnose and remedy maintenance delays once they have occurred. It is a measure of how long, on average, before machines that have faulted are returned to operation. Typical values average around 20 min. KPIs are typically reviewed on a weekly basis. 25.2.2 Longwall Monitoring Equipment monitoring is playing an increasingly significant role in the modern longwall. By monitoring is meant both the determination of the overall state of plant as well as measurement of individual properties; for example, the running temperature of a gearbox. This section discusses all forms of condition monitoring undertaken in Australian longwalls. Condition monitoring of longwall equipment takes on two main forms: on-line and off-line. On-line monitoring includes all sensor measurements recorded and transmitted to the surface using a PLC-driven, SCADA network. All other measurement and monitoring, including regular maintenance inspections, are classed as offline.

Fault Detection and Identification for Longwall Machinery Using SCADA Data

615

25.2.2.1 Off-line Monitoring Oil is used in longwall machinery as both a lubricant and a cooling fluid. As such, it is important to monitor the quality and quantity of all oils used to ensure satisfactory operating conditions for machinery. A change in the quantity of oil in a piece of machinery is typically due to some sort of leakage;S however changes in the quality (or specific properties) of particular oil may be due to any or all of the following: • Wearing away of machine part material will increase the amount of metal inclusions in the oil • Unsealed or cracked housings may allow water or dust to become mixed in the oil • The use of an incorrect oil may cause a change in the viscosity or PQ index (the PQ or Particle Quantifier Index is a measure of the amount of entrained debris in the oil) Oil analysis is done at irregular intervals; however on average it is conducted fortnightly. The properties usually measured are metal inclusions (Cr, Ni, Mo, Mn, Pb, Zn, Cu, Sn, Fe, Si, Al, Mg), kinematic viscosity at 40 °C, PQ index, and water content (% m/m). The task of checking oil level, collecting and analysing oil samples, as well as conducting vibration and thermographic analysis (see below) is typically contracted to one of a few condition monitoring companies. Results are returned to the mine in around three days, and are classed as satisfactory, caution, or action for each sample. If action is required, a facsimile is sent to the mine detailing the perceived problem. An example of such action is that the measured viscosity of an oil sample is significantly high or low and all other analysis normal, suggesting the wrong lubricating fluid was used. The mine is immediately notified so that the lubricant may be checked and/or changed in the next shift. Vibration analysis is performed on all machinery with rotating parts, but is focused primarily on the crusher rotor, where the largest forces exist. Accelerometers are attached in strategic locations to the casing of the machinery, and measurements taken under operation for around 10 min. The results from the analysis are almost immediate, and action can be taken if excessive play is found in any plant. Not all longwalls perform vibration analysis, as it is quite expensive and rarely highlights any problems. This is also due to the fact that large machinery that is rotating off-centre tends to generate massive noise, which is easily detected by longwall operators. Thermal analysis of various pieces of equipment is performed through thermography, a form of photography in which different colours represent different temperatures. Thermography is used both underground and on the surface to check the temperatures of parts not monitored by the on-line system. Typical parts inspected are control boxes, sensor communication subsystems, and sometimes the fluid coupling at the AFC drive.

616

D. Bongers and H. Gurgenci

25.2.2.2 On-line Monitoring The on-line monitoring systems at longwall mines consist of a network of sensors and switches, which may be monitored from the surface or by operators underground. The interface allows users to view the overall operation, as well as monitor individual pieces of machinery. The system is designed to indicate simple faults such as an over-temperature trip by displaying messages and changing the colour of the particular machine at fault. This system is very expensive, both to purchase and maintain. At present, the display is only looked at when production personnel indicate that the longwall has stopped due to an unknown fault. Data that is recorded, is rarely used, and undergoes no more than a visual analysis. Over 300 analogue sensors are typically monitored by these SCADA systems. In the past, data storage was considered one of the major pitfalls of archiving all that was sampled. Today, however, this has been mitigated by the inexpensive and very large capacity computer hard drives. Having ready access to millions of measurements tends to cause an information overload, which is where FDI technology fits in. The ability to process the data in real time allows the vast number of data points to be summarized into a highly accurate and concise history of equipment failures.

25.3 Reducing Equipment Downtime: The Case for FDI Available techniques to minimise equipment/process downtime are examined first. After all, any maintenance program has the principle goal of achieving the greatest possible equipment utilisation. The discussion of various techniques will be based on generalized plant, with no regard to the specific challenges of the longwall industry. Only in conclusion will such consideration be given, so as to justify the method selection for this particular application. 25.3.1 Available Techniques 25.3.1.1 Preventative Maintenance The most common way that industries attempt to improve machine availability is through preventative maintenance. This consists of regular inspections, regular replacement of lubricant, filters etc. and occasional reconditioning of parts. Preventative maintenance has proven very successful in prolonging the life of machinery, as well as showing improvements in performance, such as the mean time between failures (MTTF) and mean time to repair (MTTR). Although the particular preventative maintenance philosophy employed may vary depending on the training and experience of the engineers responsible, regular maintenance tends to prove very cost effective, and is employed in nearly all industries. 25.3.1.2 Design Modification It is possible to redesign machinery to alter its inherent reliability. That is, based on industry experience of certain faults, redesign parts or entire machinery to reduce

Fault Detection and Identification for Longwall Machinery Using SCADA Data

617

or stop the occurrence of the fault. Typically, machine design is the responsibility of the manufacturer, and is a very time consuming and costly operation. 25.3.1.3 Redundancy The concept of redundancy is widely used on a component level to make machines more reliable. The principle is rather simple – machine components such as relays, hydraulic valves or electronic capacitors are unnecessarily duplicated in the design in such a way that if one fails, another may take on its role. It may be possible in some situations to extend this concept of redundancy to entire machinery, having complete units on standby in case one should fail. Although this method in no way improves the inherent reliability of the individual plant, they are effectively made more available. The use of this sort of redundancy is therefore suitable in those situations where the time to commission replacement machinery is relatively small in comparison to the fault repair time (or maintenance time). Other factors that must be taken into account when considering the use of redundant machinery are cost and storage. Many industries use machinery that is far too expensive to purchase spares, or is too large to economically store. 25.3.1.4 Diagnosis and Repair Time When a fault occurs causing a machine to cease operation, the process time lost is that required to diagnose and resolve/repair the fault. If this time can be reduced, machinery is then made more available. This may be achieved by improved operator training, or specialization of operator tasks. Another way to reduce the time required to determine the nature of the fault is to employ a diagnostic system that uses sensor and/or operational information to detect and isolate the fault. Such a system would typically produce an instantaneous diagnosis; however it would not help to reduce the time required to repair the fault, once determined. The maintainability of the system is improved by reduction of the diagnosis and repair time. A reliable FDI tools assists this by providing the operating and maintenance staff with a clear indicator towards the failing component and sometimes to the mode of failure. 25.3.1.5 Failure Prediction Similar to weather forecasting, the knowledge of an imminent failure of a piece of machinery would allow machine operators to cease or change operation to minimize and potentially avoid any subsequent downtime. Such a predictive system would rely on a continual stream of quality data. There is an inherent risk in such a concept in that false or misleading predictions could cause additional downtime, and quickly lose operator confidence. For a predictive system to be effective it must produce not only a prediction of an imminent fault, but provide a sensible reason for the prediction, so that operators may make an informed decision whether to act or ignore the suggestion.

618

D. Bongers and H. Gurgenci

25.3.2 Potential Benefits When determining what approach should be taken to improve the availability of longwall machinery it is important to weigh the cost of any changes against the potential benefits, both financial and improvements in worker safety. Since just a handful of lost time injuries (LTIs) are recorded internationally each year as a result of machine failure, we only consider here the financial benefits. Quite obviously, the financial benefit of a more available longwall is the additional extraction of coal. To quantify this, we consider a single rip undertaken by a shearer traversing from the maingate to the tailgate, and then returning to the maingate. In an average-sized longwall, a rip can take between 18 and 26 min given no interruption, and will yield approximately 1800 tons of run-of-mine (ROM) coal. Assuming a 70% yield for export quality coal, sold at $50 per ton, this equates to between $145,000 and $210,000 gross turnover for each continuous hour of longwall production. Generously accounting for expenses incurred such as water and electricity, additional longwall production could be valued at around $100,000 per hour. With such a large potential profit for increased machine availability, it is clear that the benefits from just a few hours of additional production per month would far outweigh the cost of employing any of the techniques mentioned, with the possible exclusion of machine redesign. 25.3.3 Conclusions All longwall mines employ preventative maintenance. It is one of the largest suboperations at any mine, and proves effective in that, when less attention is paid to maintenance, more faults occur. The optimum level of planned maintenance is difficult to determine because not all failures are age-related. In fact, many failures follow an exponential distribution with a uniform failure rate that is not related to age. Redesign and re-engineering of major offenders has been effective in many instances and it is believed that more improvements can be realized through such efforts. Redundancy in design has not been fully explored by longwall machine designers mainly due to the extra cost and the bulk associated with the redundant systems. It is the authors’ opinion that the future of longwall mining should include intelligent predictive systems that rely on the currently unused monitoring data. The possibility of such a system relies on answering the question: “does the currently recorded condition monitoring data contain sufficient information regarding imminent faults?” If such information exists in the data, then information must also exist that the fault has occurred, and specifically which fault occurred. The outcome of the work described in this case study, the detection and isolation of major longwall faults, should therefore be seen as a stepping stone towards a predictive system for longwall faults/failures. A detection system would also act as a diagnostic tool, as described above, itself contributing to the goal of improved longwall availability.

Fault Detection and Identification for Longwall Machinery Using SCADA Data

619

25.4 Fault Detection and Isolation Methodology The fault detection and isolation problem can be viewed as a subset (or branch) of the general classification problem. Analogous to detection, a distinct change in the state of a dynamic system can indicate that a fault has been induced. The isolation task, is thereby reduced to the categorization (or classification) of the state of the system, thereby locating the source of the fault; e.g. see Willsky (1976). The development of a specific FDI system requires the engineer to capture the relationship between the available information (input data) and the state of the system (which includes the presence and nature of faults). This system, when developed, can then be viewed as a mapping function, allowing the state of the system to be determined at each measurement interval. The nature of this function is left to the reasoning of the engineer, dependent on both the nature of the faults/failures, and the available system information (including sensor data and fault/failure history). 25.4.1 FDI Techniques Perhaps the most commonly applied FDI technique is the informal, qualitative opinion of the expert. Analogous to the diagnostic method applied by a car mechanic, operators (experts) use typical indicators such as heat, noise, vibration or poor performance to ascertain the presence and nature of the fault. Typically, faults detected using this rather subjective FDI technique must be confirmed by further investigation. The most rigorous of the FDI approaches, qualitative expert systems, are rule based methods usually relying on a large number of if-then relationships. Expert systems truly require an expert, as they rely heavily on knowledge of the influence of all faults on system behaviour. This approach can provide excellent FDI; however is not robust to variations in system parameters or the occurrence of unforeseen faults. Model-based methods, as the name suggests, rely on a mathematical model of the system of interest and/or a model of how system faults affect sensor measurements; e.g. see Frank (1990). These techniques typically rely on analytical redundancy. The principle behind analytical redundancy is simple: for a given measured input, a mathematical model of the system may be used to generate estimates of its output; the redundant measurements. Comparison of these and the real output measurements allows inference to be made regarding the operating state of the system. A commonly applied regime is that of the Kalman filter, an optimal state estimator. The extended Kalman filter (EKF) is used when non-linearities are dominant. In either case, the state representations can be chosen that are most sensitive to fault induced behaviour. While originally developed for estimating states in a control system, the Kalman filter has been applied in a wide range of fields including control, communications, image processing, biomedical science, meteorology, and geology. For more information on the Kalman filter and its applications, there are many excellent references available; e.g. Sorenson (1985); Gelb (1974) and Grewal and Andrews (2001).

620

D. Bongers and H. Gurgenci

The inference to be drawn from the apparent difference between the model and system outputs, referred to as the residual, often uses simple, statistical limits. Assumptions enforced for model validity including the random distribution of sensor noise allow chi-squared confidence limits, for example, to be determined for each element of the residual vector. Expert knowledge is then employed to establish which faults will be evident in each of these elements. In contrast to model-based approaches where a priori knowledge of the system is required, process history based or data-driven methods require only the availability of a large amount of historical process data. These techniques attempt to capture the relationship between system measurements and system behaviour, with the goal to detect and identify fault-affected behaviour from future measurements. By definition, a data driven approach to fault detection and isolation is one in which the decision criteria are based primarily or wholly on example data. Essentially, a sufficiently large, example dataset representative of each fault of interest is used to generate an algorithm which ‘maps’ a single observation input to a single fault classification output. As new or ‘unseen’ observations of the systems are presented, they are subsequently classified (using these mappings), which allows both the detection and isolation of faults. Data driven methods are typically applied to systems for which the development of accurate state-space or other dynamical equations is not possible or practical. Difficulty in the determination of accurate dynamical equations is common in engineering problems for one or more of the following reasons: 1. A lack of understanding of the dynamic interactions between system components 2. Unpredictable environmental conditions which significantly affect the system 3. Non-linearities within the system for which suitable approximations have not been determined Whether applied to fault detection or other classification problems, data driven methods are often referred to as black box solutions. This is because little or no understanding of the particular system is required for their implementation. Although often based on strict optimality criteria, little interpretation can be drawn from the subsequent equations which map the input data to a system classification. Regardless, these methods have proven useful tools for the detection and isolation of faults in a wide variety of engineering problems. 25.4.2 Data Driven Techniques for FDI Numerous journal and conference papers have been published describing the application of data driven techniques to fault detection problems. Their popularity is largely due to the fact that the established algorithms, namely principal components analysis (PCA), partial least squares (PLS), linear discriminant analysis, fuzzy logic discriminant analysis and neural networks, are simple and fast to apply with little system knowledge. Venkatasubramanian et al. (2003) provide a compre-

Fault Detection and Identification for Longwall Machinery Using SCADA Data

621

hensive review of process history based methods applied to FDI, referencing over 140 such papers. This section provides just a handful of brief descriptions of data driven FDI applications, for the sole purpose of illustrating the methods by which the example data classifications are typically determined. McKay et al. (1996) described the use of an artificial neural network, or ANN (see Section 25.4.4) to determine the acceptability of a polymer coating used to coat copper wire. It was determined that the viscosity of the polymer as it exited the extrusion process (during manufacture) was the most reliable indicator of quality, short of destructive testing. A neural network was employed to estimate this viscosity based on sensor measurements on the extrusion equipment and data from an attached rheometer. Network training data was developed over a period of time whereby laboratory experiments were performed to accurately determine the viscosity of a number of extruded polymer samples. This form of training data is manually generated, and relies on a number of supervised sets of measurements. Also described in McKay et al. (1996) is the use of a neural network integrated as part of a model based predictive control scheme. In this case, a detailed model of the process of mixing air and fuel in a combustion engine was developed, and the model interrogated with a number of initial condition scenarios to generate a predicted set of measurements. This set of conditions/artificial measurements formed the training dataset for the neural network. Chow (2000) describes the use of an ANN to detect and isolate simple faults in a DC motor. In contrast to the two prior examples, the training process involved expert diagnosis to classify faults/failures as they occurred. With each occurrence, the network weights were updated. To expedite the process, faults were induced by damaging components or changing the resistance of internal components. The supervised approach to generating example data is typical of data-driven FDI examples in the open literature. Such research focuses on new detection and isolation regimes, and assumes that training data is both available and accurate. 25.4.3 Training Data Set All data-driven FDI systems need to be trained first on known data before they are applied on unknown data. Availability of quality training or example data is an essential requirement whether one used statistical FDI or artificial neural networks. Example data are a sufficiently large dataset with the state of the system identified for each observation. The identification process maps every observation to a discrete state. Below is an augmented matrix, illustrating the form in which such a training set with associated classifications, Y , would be assembled. ⎡ y11 ⎢y 21 Y =⎢ ⎢ ⎢ ⎢⎣ y n1

y12 y 22 yn 2

y1 p y2 p y np

C1 ⎤ C 2 ⎥⎥ ⎥ ⎥ C n ⎥⎦

622

D. Bongers and H. Gurgenci

The last column in the above matrix includes the state descriptors assigned to each observation vector (each row). Based on the assumption that the classifications accurately and discretely describe the state of the system, various algorithms may be applied to generate rules (or equations) that map a single observation vector input to a single classification output. Once generated from the training set, these rules can be used to classify new observations of the system. As the state of the system changes from ‘normal operation’ to a state indicative of the presence of a particular fault, this may be recognized as a fault being both detected and isolated (identified). Various data-driven techniques for FDI were discussed in the previous section. The most common of these is multivariate statistical analysis (linear and nonlinear) and artificial neural networks. Both approaches have proven to be valuable data-driven tools for the classification of multivariate observations. The performance of an FDI system generated from example data is a function of both the observability of each fault within the monitored variables and the quality of the example data collected. Since these techniques are typically applied where mathematical modeling is not feasible, a rigorous study of the observability of each fault in observation space is not possible. The successful detection of faults implies observability, but failure to detect certain faults does not imply non-observability. Observable faults will not be detected if the FDI function is not sensitive to the specific changes exhibited by a fault, or if the training data set is not of good quality. It is paramount that one endeavours to apply a complete, unbiased and representative training dataset in order to achieve a robust and accurate fault detection and isolation system. 25.4.4 Neural Networks for FDI Inspired by the way the biological nervous system processes information, artificial neural networks (ANNs) are a mathematical paradigm, composed of a large number of interconnected elements operating in parallel. The function of the network, influenced by a number of factors including its architecture, is however largely determined by the connections between elements. Analogous to the ability of the biological system to learn by example, particular functions can be developed by adjusting the value of these connections, which are known as weights. Essentially, neural networks are adjusted, or trained, so that a particular input produces a specific target output. Based on a comparison of the output and the target, network parameters are adjusted in an iterative process until the output adequately matches the target. This process is known as supervised learning, which typically involves a large number of input/target pairs. During training, each output is set to be a binary indicator for each data classification. Unlike linear discriminant FDI, however, the output of the network using unseen data is not open to interpretation of the likelihood that the observation belongs to a particular class. Figure 25.2 shows the mathematical workings of the most basic neural network element, often termed a neuron. Each element of the vector input x is multiplied by a weight. These products are summed, together with the neuron bias b, to form the

Fault Detection and Identification for Longwall Machinery Using SCADA Data

623

net input, n. This net input is then applied to a transfer function to produce the neuron output, z. The projection of the neuron element can viewed as a discriminant function g(x) given by

⎛ n ⎞ g (x) ≡ z = f ⎜ ∑ xi wi + b ⎟ ⎝ i =1 ⎠

Figure 25.2. Single neuron with vector input

It is the transfer functions of a neural network that allow them to produce highly on-linear relationships between the input and output. Figure 25.3 illustrates a multilayer neural network. Such a network has a significantly greater expressive power, and is able to map a vector input to a vector output. In this case, the input is the set of measurements collected by the SCADA system. The output of the network is the state classification of the longwall system, which may indicate that everything is normal, or that a particular fault is present.

Fig. 25.3 Multiple layered neural network with vector input

624

D. Bongers and H. Gurgenci

A number of software packages exist to implement neural networks of various architectures for any classification task. As such, engineers tend not to focus on the detailed and ambiguous task of governing the precise training process of the network. Given this, however, it must be stated that it is the details of the training algorithm that will dictate the level of classification success achieved, second only to the requirement for quality training data. A detailed description of the algorithm used in this application can be found in Bongers (2004), which also outlines the flexibility that engineer has in varying specific training parameters.

25.5 Longwall Mining FDI Training Set Development The purpose of a training dataset is to provide an algorithm with sufficiently broad examples of each classification to allow the generation of an FDI function with high distinguishing power. It is important that the training set is not biased in its ability to determine a particular class of observation, as this will lead to high rates of misclassification of other classes. Most importantly, however, the training dataset must provide the subsequent FDI function with sufficient information to capture the underlying relationships between various distributions of observation vectors and the associated state of the system. In terms of classification bias, all data-driven techniques are affected in the same way: an unequal number of each class of observation in the example dataset will cause the resulting FDI function to be biased in its class assignment. Ironically, the class most often presented during the development stage (training) will have a less than appropriate chance of being assigned to new observations. This is a result of an overfitting of the FDI function to that particular class, effectively placing a higher importance on variations from the class mean observation vector. Clearly, this is undesirable, leading to above-average rates of misclassification. Equally important in the quest for an accurate fault detection system is the distribution of observations for each class. Since the data-driven approach does not assume an understanding of individual faults on data properties, large amounts of data must be collected. If possible, the data should span an operational time for which a large number of example faults occur, and exhibit an ordinary proportion of faults/failures per unit time. In situations where a large amount of data cannot be collected, data-driven approaches may not be appropriate. The goal of this section is to demonstrate the non-triviality of developing a classification scheme for longwall fault detection and isolation. Also, links will be formed to illustrate how this phenomenon is common to a large number of engineering problems. The first concern in processing data is, however, the estimation of missing entries. Estimation must be accurate and efficient. The k-th nearest-neighbour (kNN) algorithm (Todeschini 1990) is used in this study. A training dataset requires classifications associated with each observation, for the purpose of generating an FDI function. In order to generate a list of classifications, one must attempt to determine the ‘state’ of the longwall at each observation. It should be noted that the development of an accurate FDI system relies on

Fault Detection and Identification for Longwall Machinery Using SCADA Data

625

the assumption that the state of the longwall system can be classified into a finite number of categories. The only record of the activity of the longwall is the maintenance log, which details all unscheduled downtime at the longwall face. Table 25.1 is an excerpt from the maintenance log corresponding to the condition monitoring data discussed earlier. It records the time that the delay began and the duration of downtime experienced. The plant responsible is also recorded, as well as a description of the delay cause. Figure 25.4 illustrates the inaccuracy of the maintenance records. It shows traces of motor currents and the shearer position, which are centred on a time corresponding to a documented delay. In this case, the maintenance records show that a delay began at observation 9059, and that the longwall was inactive for 50 observations (25 min). Table 25.1. Excerpt from the maintenance log Date

DST

Dur.

Major delay

Minor delay

Detail delay

Remark

06-May-01

21:35

5

Support services

Pumps

–

–

06-May-01

21:45

25

Support services

Power supply

–

–

06-May-01

23:20

40

Support services

Power supply

–

–

06-May-01

0:30

80

Support services

Pumps

–

–

06-May-01

2:00

10

Maingate drive

Drive assembly

Cooling water supply

Blown hose in pump station

06-May-01

3:40

5

Maingate drive

Drive assembly

Cooling water supply

–

06-May-01

4:45

104

Panel

Supplies

–

–

07-May-01

6:30

20

Labour

Travel

–

Panel prepr.

07-May-01

6:50

30

Maingate drive

Drive assembly

Cooling water supply

Tripper belt slip

07-May-01

8:25

20

Shearer

Cutting drum assembly

Cutter shear shaft

Tripper belt slip

07-May-01

12:40

10

Shearer

Cutting drum assembly

Cutter shear shaft

–

07-May-01

13:15

10

Shearer

Electrical – Control

Display – Screen

ESR faults on remotes

07-May-01

13:58

45

Shearer

Cutting drum assembly

Cutter shear shaft

Intermittant loss of shearer position

07-May-01

14:58

2

Maingate drive

Drive assembly

Cooling water supply

–

07-May-01

15:08

7

Maingate drive

Drive assembly

Cooling water supply

Tripper CST trip

07-May-01

15:15

10

Mining conditions

Fall/clean up

–

Tripper CST trip

07-May-01

16:05

5

Maingate drive

Drive assembly

Cooling water supply

Tripper CST pump fault

A stoppage in production is indicated by the shearer position remaining constant (i.e. not moving) and all motor currents falling to zero. This figure shows two examples of longwall shutdown, neither of which coincides with the documented event. The reason for this common discrepancy is simple. The shift supervisor enters the details into the maintenance log. Values such as the delay start time and duration are taken from his/her wristwatch, whereas the time associated with the condition monitoring data is that of the computer clock. Additionally, there are significantly more stoppages in production than the number of documented delays. As a result, there is uncertainty as to which stoppage in longwall production corresponds to each documented delay. In addition to this uncertainty, there is no indication as to how long the fault was present prior to the resulting shutdown. This information is necessary so that fault-affected observations can be appropriately classified. The observations after the shutdown are not generally useful for the purposes of fault isolation since all

626

D. Bongers and H. Gurgenci

shutdowns look similar, regardless of the triggering cause. Therefore, to generate a training dataset, there are two distinct challenges: 1. To determine the event time; i.e. to determine which longwall stoppage relates to each documented instance of a maintenance event 2. To determine the number of observations prior to shutdown that contain information about the presence of a fault, and to develop a scheme to classify observations based on the maintenance record

Figure 25.4. Example of fault at observation 9059

Where possible, the challenges are approached in a generic manner. This will illustrate the applicability of this research to a large number of engineering problems where system modeling is highly complex, and discrete states of the system are not immediately apparent. All faults considered lead to a complete longwall shutdown. That is, one or more parameter (examples include gearbox temperatures, AFC chain tension and earth leakage current) measures outside present safety limits, causing all major longwall machinery to shutdown. As such, all longwall stoppages represent candidates for each documented maintenance event. This section describes the process by which the start time and duration of all longwall stoppages was determined, as well as the selection criteria for candidates for each maintenance event of interest.

Fault Detection and Identification for Longwall Machinery Using SCADA Data

627

25.6 Event Time Determination No binary channel exists to indicate whether the longwall is operational or shutdown; therefore others must be used to make this simple decision. The most obvious choice is the motor currents of the major equipment, namely the shearer, AFC and BSL/Crusher, as shown in Figure 25.4. As alluded to, when the longwall is shutdown, each of these records a value of 0.01 Amps, which is the minimum level recordable set in the monitoring software of the SCADA system. Fig. 25.4 illustrates the idiosyncrasies of longwall stoppages. First, when the motor currents resume typical operating values, there is often a delay prior to shearer movement. This is most commonly a result of the 1–3-min period required for the conveyor start-up sequence. Second, large spikes in the armature current of both AFC drives are evident 30 s to 1 min before the face equipment is powered. This is known as the inrush current, which is the initial current demand on startup of an AC drive before a load resistance or impedance increases to its normal operating value. Third, the shearer position may change (the shearer may be moved) although the remainder of the longwall face is inactive. This is simply due to the operators moving the shearer to allow access for repairs. Due to the large number of stoppages present in the data, the process of stoppage detection and candidate selection must be automated. It is important, therefore, that the duration of each longwall be clearly defined. Considering the idiosyncrasies mentioned above, a single longwall stoppage is defined from the time when all face equipment motor currents have a value of 0.01A and the shearer stops moving to the time the motor currents resume typical operating values, ignoring any ‘non-zero’ values for current that occur for two observations (1 min) or less. 25.6.1 Candidate Selection We consider now the selection of candidate stoppages for each maintenance event. It is of course likely that the true event time lies in the vicinity of the documented delay start time (DDST), and most certainly within the same 8-h working shift. Although not shown in Table 25.1, the maintenance log contains a ‘shift’ field, which indicates day, afternoon or night shift. The shift schedule is known for the mine from which the data was collected. Therefore, to establish a conservative approach that will be adopted throughout this chapter, all longwall stoppages within the same shift will be considered candidates for each fault occurrence of interest. 25.6.1.1 Procedure The process of determining candidate stoppages was automated using the following procedure: Step 1: Determine a list L of all observations for which the value of all face equipment motor currents are 0.01. Step 2: Determine the observation number for each observation in L.

628

D. Bongers and H. Gurgenci

Step 3:

Step 4:

Step 5: Step 6:

Step 2 is required since the removal of sparse observations (a procedure carried out during data preprocessing) disturbed the sequential nature of observations in the data matrix Y. Beginning with the first entry in L, determine which successive observation numbers that have a difference greater than 2. Place the latter observation number of each such pair in a new list L2. Using lists L and L2, create a two-column matrix S which lists the start time and duration of each stoppage. The following steps are repeated for each maintenance event of interest: Using the maintenance log (stored electronically) determine the observation numbers spanned by the appropriate shift. Determine which stoppages listed in S have a start time within this shift.

The stoppages listed in S that have a start time within the same shift as a particular delay are the candidate stoppages for that delay. 25.6.1.2 Results When this procedure was applied to data representing five months of longwall operations, 2452 stoppages were determined. The average duration for each stoppage is 69 observations or 34.5 min (the sampling rate is two observations per min). On average, five candidates were selected for each maintenance event of interest using the procedure described. As further testimony as to the inaccuracy of the maintenance log, analysis showed that two particular shifts had fewer longwall stoppages than the number of catastrophic maintenance events documented for each shift. 25.6.2 Event Candidate Cost Function A number of electronically-recorded stoppages in longwall production have been identified as those associated with documented maintenance events. Furthermore, for reasons discussed in the previous section, only a handful of these are considered candidates for each occurrence of a fault of interest. This section attempts to discriminate further between candidates. A two-stage process was adopted. In the first stage, a number of stoppages in longwall production were identified as candidates for each documented delay of interest. On average, five candidates were selected for each event. In the second stage (presented in this and subsequent sections), these likely candidates were compared against each other to identify the best match to the event described in the production delay history. It was important that this was done in a generic manner, i.e. with no consideration given to the nature of each specific failure. Each step in the development of the training dataset had to be universally applicable, allowing the generation of FDI systems for a variety of applications. The research question that must now be posed is: ‘what information is available that can be used to determine which candidate corresponds to the documented downtime?’

Fault Detection and Identification for Longwall Machinery Using SCADA Data

629

To answer this, we look to the maintenance log. The only information available is the difference between the delay start time and duration of each candidate and those of the documented event. We define ∆DST as the difference between the delay start time of a candidate and the documented delay start time. ∆DD is similarly defined as the difference between the duration of each candidate stoppage and that of the documented delay. Each candidate will have associated values of ∆DST and ∆DD, and these will initially be used to determine which candidate corresponds to the documented downtime. The discriminating metric is simply a weighted sum of the available discriminatory information, in this case ∆DST and ∆DD. Commonly referred to as a cost function, it provides a crude way of determining which stoppage relates to the documented maintenance event. The form of the cost function is Cost = α ∆DST + β ∆DD

where α and β are the (generally) unequal weights. A cost of zero indicates a stoppage whose start time and duration are congruent with the maintenance records. Conversely, a large cost shows that one or both of the indicators is significantly different to those documented. Typically, the candidate with the lowest cost would be selected. 25.6.2.1 Determining α and β The task of assigning values to the cost coefficients is usually approached in an ad hoc manner. One must rely on particular knowledge of the application and make an educated guess as to the contribution of each indicator. In this case, we are trying to answer the question: ‘how much confidence can we place in the operator to correctly document the delay start time and delay duration?’ More specifically, ‘by what factor do we believe ∆DD to be more/less accurate than ∆DST?’ Section 25.2 presented the key performance indicators that are universal to Australian longwall mines. Clearly, longwall availability and consequently machine downtime are under the watchful eyes of the mine manager. As such, it is expected that the documented delay duration is a reasonably accurate reflection of the actual lost time. Discrepancies arise, however, when a failure and repair immediately precede or follow a meal break. The latter is more likely; a failure of face equipment is the most logical time for workers to break, rather than interrupt continuous production. A result of this is that the documented DD may not include the time of the meal break. Therefore, the computer records used to detect candidates will show one long production delay rather then two distinct events. On the other hand, there is little advantage in correctly documenting the delay start time. In fact, some longwall operations see documenting the DST as needless bookkeeping and do not record it. Our experience of underground operations showed that operators were diligent and accurate in the documentation of the delay duration. For no apparent reason, the DD usually included any crib break that immediately followed, negating the problem previously described. Large variation, however, was noticeable in the DST.

630

D. Bongers and H. Gurgenci

Table 25.2 shows the maintenance log from a single shift we observed. Table 25.3 is our record of the events as they occurred at the longwall face. Clearly, there are discrepancies in both the DST and DD. Analysis of these errors shows the average discrepancy to be 8 min and 31 min for DD and DST respectively. In line with the previous arguments, and the limited comparative data, it is decided that, on average, |∆DST| will be four times larger than |∆DD|. Therefore, the cost function for initial candidate selection will be

Cost = ∆DST + 4 ∆DD

Table 25.2. Maintenance records

TIME 15:10 17:40 19:15 19:45

DELAY 120 45 20 25

CLASS M M M M

DETAILS Shearer out of hydraulic oil, pressure switch faulted Hydraulics: change stabilizer cylinder valve Replace LW shearer picks AFC Chain overtension

Table 25.3. True equivalent of Table 25.2

TIME 14:23 17:32 19:37 20:17

DELAY CLASS DETAILS 137 M Replace shearer hydraulic fluid pressure switch 44 M Changeover stabilizer cylinder valve on support #62 34 M Problem with AFC Tension - system reset 16 M Shearer picks

25.6.2.2 Application of the Cost Function Application of a simple cost function whose coefficients are based on ‘gut feeling’ and data from a single operating shift can provide misleading results. As such, a candidate will only be selected if it has a cost three (nominal) times less than all the other candidates within the 8-h window. When applied to all 89 fault occurrences of interest, 11 were able to be selected. It is the data immediately prior to these that will form the basis of the work in Section 25.7. 25.6.3 Clustering Algorithm for Candidate Selection The purpose of this section is to employ a clustering algorithm to select candidates for faults where a single candidate could not be conservatively selected by the cost function. The approach taken relies on two assumptions; first, that the candidates selected by the cost function are, in fact, the actual stoppages corresponding to the documented delay (which can be confirmed by the manual inspection of key, individual channels prior to the selected candidate), and second, that the dynamics of the longwall are slow enough that the observations immediately prior to these instances of shutdown contain information indicative of the fault that is present.

Fault Detection and Identification for Longwall Machinery Using SCADA Data

631

25.6.3.1 Discriminant Analysis Similar to PCA, discriminant analysis produces a new dataset, via linear or nonlinear projection, which is of equal dimensionality as the original set. The projections are orthogonal, and are developed with criteria to maximize the separation (in multivariate space) between classes of data, while minimizing the spread within each class. The hypothesis is, data representing similar longwall behaviour (known) can be made to cluster. If data of an unknown class, under the same linear projections, tends to join a particular cluster, it may then be classified accordingly. 25.6.3.2 Clustering Results In line with the conservative approach in the previous sections, two observations prior to each identified candidate were classified according to the type of fault they represent. Other classes of data included were randomly selected observations corresponding to longwall shutdown and normal operation. Some observations were duplicated to ensure an equal number of each class. The clustering algorithm was applied to this data. The methodology is that observations prior to certain unassigned candidates projected under the same algorithm would fall into the confidence interval defined by data representative of known fault type. This would allow the assignment of candidates to the remaining faults, completing the event time determination process. Figure 25.5 shows an example where the clustering algorithm projected candidates in such a way that one alone could be identified as the stoppage of interest. The selected candidate is projected within 95% confidence interval of the test group.

Figure 25.5. Projections of candidate observations for classification

632

D. Bongers and H. Gurgenci

Similar results were seen for the majority of maintenance events, with the exception of seven. Candidate selection was not possible for these because: • No candidates were projected within the 95% confidence interval as defined by the T2-statistic • More than one candidate was projected within this confidence interval

25.7 Classification of Observations The work presented in the previous section established the one-to-one correspondence between stoppages in longwall production and documented maintenance events. A simple cost function of the discrepancies in the start time and duration of candidate stoppages allowed the automated selection of a small number of candidates. Slow longwall dynamics permitted the classification of observations prior to these shutdowns, which were used to generate a discriminant function to select the remaining candidates. This section describes the assignment of classifications to a large number of observation vectors, which will combine to form a training set for FDI development. In this training set, every observation vector will be assigned to a class. The label itself is of little consequence; it serves only to indicate which group or class of data each observation represents. These classifications must be accurate, discrete descriptions of the state of the longwall. As with the event time determination problem, this process must have a reasonable level of automation and generality to allow fast and accurate classification of observations in applications other than longwall fault detection. The majority of the observations will be assigned to one of the two main classes: the class ‘normal’, representing fault-free, normal operation; and the class ‘longwall shutdown’, representing the state of complete stoppage. Particular types of failure are identified by analyzing the observations while the longwall is in transition from ‘normal’ to ‘shutdown’ state. This assumes that there is a transition. In other words, it is assumed that the data prior to each resulting shutdown contains information regarding the presence of the fault. Analysis of this data should then reveal the transition from normal operation to fault-induced operation. This section addresses the problem of determining the length of that transition period; i.e. identifying the set of observations that are distinct to each shutdown. Unfortunately, the system has the same shutdown signature regardless of the cause of the shutdown. Therefore, the observations to be nominated as training set entries for a particular shutdown should occur immediately before the shutdown. The following questions need to be answered: 1. Is there a set of N observations {yk − N −1 ,… yk −1} before the shutdown at yk, which are different from normal operation and can be used as an indicator of the development of the fault that eventually causes the shutdown? 2. What is the value of N? That is, how many observations prior to shutdown can be included in the training set for each fault class?

Fault Detection and Identification for Longwall Machinery Using SCADA Data

633

Analysis of individual trends prior to shutdown would be a laborious process. Also, reliance on specific knowledge of each fault is undesirable, in order to retain a certain level of generality. As such, bulk data properties, or metrics, are observed. Numerous metrics were investigated; two of these, Hotelling’s T2 statistic (Hotelling 1931) and the PCA-residual Q-statistic (Bongers 2004), seemed to emphasize clearly a change in the data properties (the relationships between variables) prior to each of the failures of interest. 25.7.1 Distribution Prior to Shutdown Figure 25.6 shows the trace of the T2-statistic around the time of a longwall stoppage identified as an example of a maingate drive cooling fault. The values on the horizontal axis of this and subsequent figures have been shifted so that observation zero represents the first measurement of longwall shutdown. There is a clear transition from normal operation to shutdown indicated by the values of this statistic starting to rise a number observations prior to shutdown. The dashed lines represent the upper and lower confidence limit for data representative of normal longwall operation. These were determined by conservatively selecting data between a number of stoppages in production. The T2 values for observations in the class normal will be likely to stay between these limits. It is the violation of these limits that can be used to test if the system is behaving in a abnormal manner.

Figure 25.6. T2 statistic prior to drive assembly cooling fault

This particular figure shows a distinct change from what is apparently normal operation. The four observations prior to shutdown are clearly outside the 95% confidence limit, which suggests that these represent operation with the fault present.

634

D. Bongers and H. Gurgenci

Figure 25.7. T2 statistic prior to maingate blockage

Figure 25.7 shows the T2 values in the vicinity of an AFC maingate blockage fault. Once again, significant abnormal activity is observed prior to shutdown.

Figure 25.8. Q-statistic prior to BSL Dupline fault

The Q-statistic was able to illustrate abnormal system behaviour prior to shutdown resulting from a BSL Dupline fault, as shown in Figure 25.8. This is an encouraging result, as one would expect that a faulty Dupline controller would contribute little detectable variation in system. Also, the detection of this sort of variation in the Q-statistic highlights the fact that it is sensitive to smaller effects not captured by the lower-dimensional, PCA representation.

Fault Detection and Identification for Longwall Machinery Using SCADA Data

635

25.7.2 Classification of Observations The previous section highlighted distinct changes in data properties prior to longwall shutdown. In the case of the maingate drive cooling fault, AFC blockage (maingate) and BSL drive stall, this was represented by a gradual increase in the T2 value outside the confidence limits for normal longwall operation. A similar trend was displayed by the Q-statistic values preceding a BSL Dupline fault. These will be referred to as a onestage fault because the transition from normal to shutdown class occurs over a constant-slope line. Some instances of fault affected behaviour showed two distinct, abnormal periods prior to shutdown, which will be referred to as twostage faults. Whether one or two stage, it is these abnormal observations during longwall operation that must be classified in a way that indicates the presence of a fault leading to a particular cause for shutdown. A separate class of data is required for each type of single stage fault, and two classes for each twostage fault. The data corresponding to the first of the two stages will classified as “fault ‘x’ imminent” to indicate that it precedes the second class of data named “fault ‘x’ present”, with ‘x’ corresponding to a name indicating the nature of the subsequent shutdown. Single stage faults will be named the latter class, since it is the class of data which precedes longwall shutdown. Figure 25.9 shows the metric values prior to shutdown for a specific fault with the boundaries for each classification.

Figure 25.9. Classifications from a T2 trace

25.7.2.1 Other Classifications Although not investigated here, a large number of other classes of data are present in the longwall data. These would include fault affected operation leading to shutdown as a result of other failure modes not of interest here. As such, all ob-

636

D. Bongers and H. Gurgenci

servations not clearly representative of fault affected behaviour, longwall shutdown or fault-free operation will be pooled into an alternative class, arbitrarily labelled ‘OTHER’. This will ensure that other fault affected operation will not be classified as NORMAL, and may prove to suggest that abnormal behaviour has been detected. Given this, it should be noted that less than 12% (one-ninth) of observations in the training dataset fall into this category. 25.7.2.2 Automation of Classification Process In order to maintain the generic and semi-autonomous nature of the classification scheme, the process of classifying observations must be automated. One question that arises is: ‘must the classifications be determined on a case-by-case basis, at each instance looking at either the T2 or Q-statistic, or should classifications for each fault type be determined by an average number of affected observations prior to shutdown’? In generating training data for fault detection and isolation, all observations prior to a documented shutdown of interest that are continuously outside the 95% confidence interval for normal longwall operation will initially be classified as “fault ‘x’ present”. Of these, if the first four or more observations outside this limit show a total variation of less than 10% of their average value, they will be classified as “fault ‘x’ imminent”. This condition will be waived if either: • These observations immediately precede the subsequent shutdown • They represent measurements of the Q-statistic, which was utilized only for the Dupline fault 25.7.2.3 Interpretation of Classifications Names given to classifications are purely arbitrary; however they should provide a description of the discrete state of the system. Specifically, the IMMINENT classification should be distinguished as always preceding the PRESENT classification. Detection of the former class of data serves to predict both the presence of a fault and the shutdown that may ultimately follow. In keeping with the preferred passive rather than active nature of an on-line fault detection system, it is important to remember that the real-time detection of faults merely provides forewarning for longwall operators. Detection precedes shutdown, allowing changes in the system to be initiated to avoid or reduce longwall downtime. Also, since the classifications described above indicate conditions that may or may not lead to a system shutdown, their detection should be treated as only one factor in deciding any corrective action. The basic classifications determined earlier in this section are more accurately described as follows: 1. Normal longwall operation: current measurements indicate that the longwall is operating in a fault-free manner 2. Fault x imminent: current measurements are commensurate with those in the early stages of fault-affected operation which may lead to the eventual shutdown of the longwall as a result of an x-type fault

Fault Detection and Identification for Longwall Machinery Using SCADA Data

637

3. Fault x present: current measurements are commensurate with those associated with fault-affected operation which suggests that a shutdown of the longwall as a result of an x-type fault is imminent 4. Longwall shutdown: current measurements suggest that the longwall is in full shutdown; that is, all major face equipment is not operational 25.7.3 Compilation of Training Set To this point, a large number of observations have been classified as representing normal operation, longwall shutdown, fault affected behaviour, or other (nonnormal operation). This section describes the engineering judgment that must be employed in the assembly of the training set that will be used for FDI function development. Such judgment is required to obtain an unbiased training dataset that best characterizes the various states of the longwall system. 25.7.3.1 Removing Bias from Training Set An unbiased training set is one that contains an equal number of observations for each class to be discriminated. Research has shown that an unequal quantity of each class of data typically results in a discrimination function that is less likely to classify new observations as the more frequently presented class. As such, some fault affected observations were duplicated to ensure a large training set with an equal proportion of each data class. 25.7.3.2 Transitional Observations Ultimately the goal of the training data development is to have a set of observations and associated classifications that, when applied to an FDI development algorithm, can produce an FDI function with the greatest distinguishing power. Erroneous classifications in the training set may alter the decision criteria in a way that either: • Reduces the mutual exclusivity of the decision space • Incorrectly loosens or tightens the decision criteria for one or more classes Observations immediately preceding the IMMINENT of FAULT PRESENT classifications that were classified as NORMAL OPERATION may represent operation with the presence of a fault. As mentioned previously, the dynamics of the longwall are slow, requiring time for the presence of a fault to become evident in the sensor measurements. Also, they may correspond to observations taken when the severity of the impending fault is low. These so-called ‘normal’ observations are therefore considered transitional, and may not adequately characterize fault-free operation. In order to remain consistent with the stated goal of training data development, these transitional observations were removed from the training dataset.

638

D. Bongers and H. Gurgenci

25.8 FDI Results A multilayer neural network was trained with the data described. Approximately 20% of the data was reserved for network validation, and was therefore not used in the training process. Figure 25.10 presents the output of the network in the vicinity of a maingate drive cooling fault. The input data used was that reserved for validation, and hence is ‘unseen’ data. Clearly, the output indicator of normal operation significantly drops as the network indicates that the fault is present. After a time, the LONGWALL SHUTDOWN classification is dominant, and the other outputs essentially zero. As with some figures shown previously, the horizontal axis has been shifted so that observation zero indicates the first observation of shutdown.

Figure 25.10. Network output using ‘unseen’ data

In order to measure the overall FDI performance of the neural network, we calculated the average recall and precision when applied to all ‘unseen’ faults, which are defined as follows. The recall(i) of a classification system for a given class of input, i, is defined as recall (i ) =

output (i ) ∩ correct (i ) correct (i )

where output(i) refers to the set of all observations that the system classifies as that of fault type i. The term correct(i) is the set of all observations in the input set that are actually in fault class i. The recall is then the fraction of the correct classifications of observation type i that the system correctly computes. It is of course possible that correct(i) = 0 (when the system is presented with an input set for

Fault Detection and Identification for Longwall Machinery Using SCADA Data

639

which no correct classification exists). In such a situation recall is defined as unity, regardless of the performance of the classification system. The precision(i) of a classification system for a given class of input, i, is defined as precision(i ) =

output (i ) ∩ correct (i ) output (i )

Subtly different from the concept of recall, precision is the fraction of observations that the system classifies as type i that are actually correct. In the situation where output(i) = 0 (when the system never classifies an observation as type i), it is defined: if output(i) = 0, and correct(i) = 0 then precision(i) = 1 if output(i) = 0, and correct(i) = 0 then precision(i) = 0 Analogous to measures typically used in applied statistics, the recall and precision are the complements to the probability of a Type I and Type II error respectively. That is, for each class of data, P(Type I Error) = 1 ෥ recall(i) P(Type II Error) = 1 ෥ precision(i) Table 25.4 presents the overall FDI performance of the neural network. For all faults, the values of precision and recall are higher than that for the linear discriminant algorithm. All instances of faults were both detected and isolated, again, occasionally a few observations after the FAULT PRESENT class of data had begun. Table 25.4. FDI Performance using neural network Fault

No. test examples

Recall

Precision

MG drive cooling fault

14

0.929

0.813

AFC blockage (maingate)

21

0.952

0.833

BSL drive stall

16

0.934

0.789

BSL dupline fault

8

0.750

0.857

The results presented in this section show the successful detection and isolation of faults using both the linear discriminant algorithm and the two-layer neural network. The improvements in FDI performance offered by the NN suggest that there exists some non-linearity in the relationship between sensor measurements and the determined classifications. This is typical of most mechanical systems, largely due to the non-linear effect of damping.

640

D. Bongers and H. Gurgenci

25.9 Concluding Remarks The work presented here illustrates the application of fault detection and isolation to a longwall mine. Given the accurate and timely detection of faults, the equipment operators can preempt a catastrophic failure, or more rapidly respond to a longwall shutdown. In either case, the FDI has served its purpose which was to reduce the system downtime associated with equipment faults. Condition monitoring data was collected from a longwall mine, which represented five months of operation. An error surface, in combination with analysis of the distribution of missing entries per observation, was used to determine specific limits, α and β, for the removal of rows and columns from the data matrix that were deemed too sparse to allow sufficiently accurate missing entry estimation. Missing values were estimated using the k-NN algorithm, which displayed a smaller estimation error than other documented techniques. The results show misclassification rates as low as 14.3%, which is considerably better than the majority of documented performances of FDI systems using real data. The two-layer neural network performed better than the linear discriminant analysis, which revealed a level of non-linearity within the system. Overall, these results were deemed largely successful, thereby verifying the validity of the classification scheme. Significant effort has been spent on correcting the maintenance logs, which were the historical record of faults. In a number of industries, systems are in place to ensure that this data is very accurate. As such, the implementation of FDI to other systems may not require the degree of treatment presented here. Finally, this study employed a data-driven approach for detection and isolation of longwall face equipment faults. This was necessitated by the complexity of the equipment that made a model-based approach impractical. However, it may be possible to address at least subsets of the target fault lists by several model-based approaches (e.g. Reid 2007). Although success has been shown using data-driven techniques, any implemented FDI system would most likely be a hybrid system, incorporating decision support from a number of FDI functions. As an example, a sensor fault would cause an abnormal signal to be produced in a single channel with a negligible effect on the overall relationships in the data. The approach taken in this thesis would be insensitive to this sort of fault, however is ideally suited to model-redundancy based algorithms.

25.10 References Bongers, D., (2004) Development of a Classification System for Fault Detection in Longwall Systems, PhD Thesis, The University of Queensland Chow, M.Y., (2000) Guest Editorial: Special Section on Motor Fault Detection and Diagnosis. IEEE Transactions on Industrial Electronics, 47(5):982–983 Frank, P.M., (1990) Fault diagnosis in dynamic systems using analytical and knowledgebased redundancy – a survey and some new results, Automatica, 26(3): 459–474 Gelb, A., (1974) Applied Optimal Estimation, MIT Press, Cambridge, Massachusetts.

Fault Detection and Identification for Longwall Machinery Using SCADA Data

641

Grewal, M.S., Andrews, A.P., (2001) Kalman Filtering: Theory and practice using MATLAB, John Wiley and Sons, New York Hotelling, H., (1931) The generalization of Student's ratio. Annals of Mathematical Statistics, 2:360–378 McKay, B., Lennox, B., Willis, M., Barton, G., Montague, G., (1996) Extruder Modelling: A Comparison of two Paradigms. UKACC International Conference on Control'96, 2: 734–739, Exeter, UK. Conference publication No. 427 Reid, A. (2007) Longwall Shearer Cutting Force Estimation, PhD Thesis, The University of Queensland Sorenson, H.W., (1985) Kalman Filtering: Theory and Application, IEEE Press, New York Todeschini, R., (1990) Weighted k-nearest neighbor method for the calculation of missing values, Chemometrics and Intelligent Laboratory Systems, 9:201–205 Venkatasubramanian, V., Rengaswamy R, Yin K, Kavuri S, (2003) Review of Process Fault Diagnosis – Parts I, II, III. Computers and Chem Eng, 27(3): 293–346 Willsky, A.S., (1976) A survey of design methods for failure detection in dynamic systems, Automatica, 12:601–611

Contributor Biographies

Chapter 1 Khairy Kobbacy is the Professor of Management Science and Associate Head (Research) of Salford Business School, Salford University, UK. He is also the Director of the Management and Management Sciences Research Institute. Prof Kobbacy has a BSc from Cairo, M.Sc. from Strathclyde and Ph.D. from Bath University. He has sustained research interests in mathematical modelling in maintenance, intelligent management systems in operations, and supply chain management. He has over 40 refereed publications and edited 9 volumes including conference proceedings, special issues of international journals and ORS 46 Keynote papers. He chaired the European Conference on Intelligent Management Systems in Operations in 1997, 2001 and 2005 and the IBC Middle East Conference: Superstrategies for Maintenance in 1998. He was elected Vice President of the Operational Research Society (UK) 2001–2003. Prabhakar Murthy obtained B.E. and M.E. degrees from Jabalpur University and the Indian Institute of Science in India and M.S. and Ph.D. degrees from Harvard University. He is currently Research Professor in the Division of Mechanical Engineering at the University of Queensland. He has held visiting appointments at several universities in the USA, Europe and Asia. His research interests include various aspects of new product development, operations management (lot sizing, quality, reliability, maintenance), and post-sale support (warranties, service contracts). He has authored or coauthored 20 book chapters, 150 journal papers and 140 conference papers. He is a coauthor of five books and co-editor of two books. He is on the editorial boards of eight international journals.

644

Contributor Biographies

Chapter 2 Liliane Pintelon holds degrees in Chemical Engineering (1983) and Industrial Management (1984) of the KULeuven (Catholic University of Leuven, Belgium). In 1988–1989 she worked as a visiting research associate at the W. Simon Graduate Business School (University of Rochester, USA). She obtained her doctoral degree in industrial management (maintenance management) from the KULeuven in 1990. Currently, she is professor at the Centre for Industrial Management (KULeuven); she is also Board Member of BEMAS (Belgian Maintenance Society) and of IFRIM (International Foundation for Research in Maintenance). Her research and teaching area is industrial engineering and logistics, with a special interest in maintenance. In this area lays the majority of her academic publications. She also has considerable experience as an industrial consultant in this area. Alejandro Parodi-Herz received his M.Sc. degree in Mechanical Engineer at the Simon Bolivar University, Venezuela (2002), the degree in Master of Industrial Management (2003) at the Katholieke Universiteit Leuven and the degree of Master in Operations and Technology Management (2004) at the Universiteit Gent. Currently he works with the Centre of Industrial Management at the Katholieke Universiteit Leuven as research associate to pursue his Ph.D. degree. His research interest is mainly focused on maintenance, spare parts demand categorisation and inventory control. Chapter 3 Jay Lee is Ohio Eminent Scholar and L.W. Scott Alter Chair Professor in Advanced Manufacturing at the University of Cincinnati and is founding director of National Science Foundation (NSF) Industry/University Cooperative Research Centre (I/UCRC) on Intelligent Maintenance Systems. His current research focuses on autonomic computing and smart prognostics technologies for predictive maintenance and self-maintenance systems, as well and closed-loop product life cycle service model studies. He has authored/co-authored over 100 technical publications, edited 2 books, contributed numerous book chapters, 3 U.S. patents and 2 trademarks. He received his B.S. degree from Taiwan, a M.S. in Mechanical Engineering from the Univversity of Wisconsin-Madison, a M.S. in Industrial Management from the State University of New York at Stony Brook, and D.Sc. in Mechanical Engineering from the George Washington University. He is a Fellow of ASME and SME. Haixia Wang is a postdoctoral researcher in the NSF Industry/University Cooperative Research Centre (I/UCRC) on Intelligent Maintenance Systems (IMS) Center headquartered at the University of Cincinnati. Her current research interest focuses on data streamlining for machinery prognostics and health management, manufacturing process performance and quality improvement, and design for product reliability and serviceability. Haixia Wang received her B.S. degree in Mechanical Engineering from Shandong University at China, a Ph.D. in Mechanical Engineering from Southeast University at China, a M.S. and a Ph.D. in Industrial and Systems Engineering from the University of Wisconsin-Madison.

Contributor Biographies

645

Chapter 4 Marvin Rausand is Professor of Reliability Egineering at the Norwegian University of Science and Technology (NTNU). He worked for the research institute SINTEF for ten years, mostly related to offshore oil and gas activities. The last four years of this period he was Director of SINTEF Department of Safety and Reliability. In 1989 he joined NTNU as a full time professor. He was head of NTNU’s Department of Machine Design for five years and vice-dean of the Faculty of Mechanical Engineering for six years. In 1985–1986 he was visiting professor at Heriot-Watt University in Scotland, and in 2002–2003 he was visiting professor at Ecole des Mines de Nantes. Professor Rausand is a member of the Norwegian Academy of Technical Sciences, and of the Royal Norwegian Society of Letters and Science. Jørn Vatn is Professor of Maintenance Optimisation at the Norwegian University of Science and Technology (NTNU). He worked for the research institute SINTEF for 15 years, mostly related to transportation, critical infrastructure, and offshore oil and gas activities. He has developed several computerized tools for decision support in safety, reliability and maintainability. For the last five years he has been involved in implementing a new maintenance strategy in the Norwegian National Railway Administration. Chapter 5 Wenbin Wang is Chair of Operational Research at the Centre for OR and Applied Statistics, Salford Business School, University of Salford, UK. Prof. Wang received his B.Sc. (Harbin, China) in Mechanical Engineering in 1981, M.Sc. (Xian, China) in Operations Management in 1984 and Ph.D. in OR and Applied Statistics from Salford University (UK) in 1992. He has over 20 years experience in OR modelling in general and maintenance and reliability modelling in particular. He received 3 EPSRC projects in the past and has authored and co-authored over 80 research papers. Professor Wang is a fellow of Royal Statistics Society, Operational Research Society, Institute of Mathematical Applications, and a charted mathematician. He is also a member of the International Foundation for Research in Maintenance. Professor Wang holds a guest professorship at Harbin Institute of Technology, China. Chapter 6 David Percy gained a B.Sc. degree with first class honours in mathematics from Loughborough University in 1985 and a Ph.D. degree in statistics from Liverpool University in 1990. He is a reader in mathematics at the University of Salford and his research into Bayesian inference, stochastic processes and multivariate analysis has produced 40 refereed publications and many conference presentations. He is actively involved in collaborative research for industrial applications, particularly concerning maintenance scheduling problems for complex systems. Dave is a chartered scientist, chartered mathematician and member of the governing Council for the Institute of Mathematics and its Applications.

646

Contributor Biographies

Chapter 7 Elsayed Elsayed is Professor of the Department of Industrial Engineering, Rutgers University. He is also the Director of the NSF/ Industry/ University Co-operative Research Centre for Quality and Reliability Engineering, Rutgers-Arizona State University. His research interests are in the areas of quality and reliability engineering and Production Planning and Control. He is a co-author of Quality Engineering in Production Systems, McGraw Hill Book Company, 1989. He is also the author of Reliability Engineering, Addison-Wesley, 1996. These two books received the 1990 and 1997 IIE Joint Publishers Book-of-the-Year Award respectively. He is a co-recipient of the 2005 Golomski Award for the outstanding paper. Chapter 8 David Percy: See Chapter 6 Chapter 9 Khairy Kobbacy: See Chapter 1 Chapter 10 Bo Lindqvist is Professor in Statistics at the Department of Mathematical Sciences, Norwegian University of Science and Technology, Trondheim (associate professor since 1979, professor since 1988). He obtained the degree of Dr.Philos. in statistics at the Univerisity of Oslo in 1982. Lindqvist's main research interest is in stochastic modeling and statistical analysis related to reliability and survival analysis. Lindqvist is Editor of Scandinavian Journal of Statistics (2007–). He is elected member of The Royal Norwegian Society of Sciences and Letters and International Statistical Institute. Chapter 11 Robin Nicolai is a Ph.D. student at Tinbergen Institute Rotterdam. He is also affiliated with the Econometric Institute at Erasmus University Rotterdam. His research interests are maintenance optimization, in particular degradation modelling, discrete-event systems and simulation optimization. One of his papers has been accepted for publication in Reliability Engineering and System Safety. Other papers have appeared in proceedings of different international conferences. Rommert Dekker is a full-time professor in Operations Research and Quantitative Logistics at Erasmus University Rotterdam. His research interests are maintenance optimization, inventory control, service and reverse logistics. He has published over 100 papers in scientific journals and he has been involved in the development of several decision support systems for maintenance planning.

Contributor Biographies

647

Chapter 12 Philip Scarf is a lecturer at the University of Salford. He obtained his Ph.D. in 1989 from the University of Manchester. Among his research interests are capital replacement, reliability and maintenance modelling, and extreme value theory. He has worked on capital replacement problems with the UK NHS, Mass Transit Rail Corporation of Hong Kong, Express National Berhad Malaysia, and Malaysia Truck and Bus Berhad. He currently serves as co-editor of the IMA Journal of Management Mathematics. Joseph Hartman is an Associate Professor of Industrial and Systems Engineering at Lehigh University in Bethlehem, PA, USA. He also serves as Department Chair and holds the Kledaras Endowed Chair. He received his Ph.D. in 1996 from the Georgia Institute of Technology and currently serves as Editor of The Engineering Economist, a journal devoted to the problems of capital investment. His research and teaching interests are in economic decision analysis, including equipment replacement analysis and transportation logistics. Chapter 13 Gabriella Budai is a Ph.D. student at Tinbergen Institute Rotterdam. She is also affiliated with the Econometric Institute at Erasmus University Rotterdam. Her research topic is railway maintenance optimization, in particular scheduling preventive railway maintenance activities and rescheduling of the rolling stock during track possession. Her papers have been published in Journal of the Operational Research Society (JORS) and in proceedings of different international conferences. Rommert Dekker: See Chapter 11 Robin Nicolai: See Chapter 11 Chapter 14 Wenbin Wang: See Chapter 5 Chapter 15 Prabhakar Murthy: See Chapter 1 Nat Jack is a Lecturer in Operational Research and Statistics at the University of Abertay Dundee and has more than 30 publications in refereed journals, books, and conference proceedings. The present focus of his research deals with product warranty, in collaboration with Professor D.N.P. Murthy from the University of Queensland, and this research has resulted in a series of papers examining optimal maintenance strategies for items sold with one- and two-dimensional warranties. His latest project involves a study of extended warranty decision-making using a game theoretic approach.

648

Contributor Biographies

Chapter 16 Prabhakar Murthy: See Chapter 1 Jarumon Pongpech received her B.E. (IE) at Chiang Mai University, Thailand in 1993. She got the scholarship from Faculty of Engineering, Chiang Mai University to pursue her master degree and graduated in the field of M.S. (EM) from The George Washington University, USA in 1996. For her Doctoral degree she also got Thailand’s grant of Commission on Higher Education in 2000 to study at Department of Industrial Engineering, Chulalongkorn University in Thailand and to conduct her research at Division of Mechanical Engineering, The University of Queensland in Brisbane Australia. She was formerly a lecturer at Chiang Mai University until 1999 before moving to Thammasat University. Her research interests are in the areas of maintenance policy of a system, service contract, engineering management, and industrial engineering. Chapter 17 Ashraf Labib is Chair of Operations and Decision Analysis at Strategy and Business Systems Department, Portsmouth Business School, University of Portsmouth. He holds a B.Sc. in Production Engineering, a M.B.A., a M.Sc. in integrated manufacturing systems and a Ph.D. in maintenance systems. His research work focuses on asset management, manufacturing maintenance systems, best practice and decision-making. In particular, he is concerned with the analysis of data related to machine failures and design and to the development of computerised maintenance management systems (CMMSs). He is a Fellow of the Operational Research Society (ORS), a Fellow of the IEE and a Chartered Engineer. He has published over 80 refereed papers in professional journals and international conferences proceedings. He is currently the Associate Editor of IEEE Transactions SMC (Systems, Man, and Cybernetics). Chapter 18 Terje Aven is Professor of Risk Analysis and Risk Management at University of Stavanger, Norway. He is also a Principal researcher at International Research Institute of Stavanger (IRIS). He has been Professor II (adjunct professor) in reliability and safety at University of Trondheim (Norwegian Institute of Technology) 1990– 1995 and Professor II in reliability and risk analysis at University of Oslo 1990–2000. He was the Dean of the Faculty of Technology and Science, Stavanger University College, 1994–1996. Dr. Aven has many years of experience from the petroleum industry (The Norwegian State Oil Company, Statoil). He is the author of several reliability and risk related books and he is an associate editor/area editor/member of the editorial board of several international journals. He received his master's degree (cand.real) and Ph.D. in mathematical statistics (reliability) at the University of Oslo in 1980 and 1984, respectively

Contributor Biographies

649

Chapter 19 Uday Kumar is a Professor of Operation and Maintenance Engineering at Luleå University of Technology, Sweden. He obtained his B. Tech. from India and a Ph.D. degree in field of reliability and maintenance from Luleå University of Technology, Luleå, Sweden in 1990. He worked six years in Indian mining industries prior to joining the postgraduate program. His research interests are equipment maintenance, reliability and maintainability analysis, product support, life cycle costing, risk analysis, system analysis, etc. He is also member of the editorial boards and reviewer for many international journals. He has published more than 100 papers in international journals and conference proceedings. Aditya Parida obtained his Ph.D. in the area of maintenance performance measurement and hat taught operation and maintenance engineering at Luleå University of Technology, Sweden since 2002. Prior to this, he was teaching the same subject in couple of institutes in India and was joint-director of NIILM Centre for Management Studies, New Delhi. He has a bachelor’s degree in mechanical engineering and a post-graduation qualification in industrial engineering from IIT, Kharagpur, India and has more than two decades experience in the area of operation and maintenance engineering from the Indian Army, amongst others. He is actively involved in research in the area of maintenance performance measurement and other related issues. He has published a number of papers in this subject area and was the coeditor for the proceedings of the COMADEM 2006. Chapter 20 John Boylan is Professor of Management Science at Buckinghamshire Chilterns University College. He holds degrees from Oxford and Warwick Universities and has published papers on short-term forecasting in a variety of academic and practitioner-oriented journals. In addition to his academic work, Professor Boylan advises commercial organisations on forecasting processes and software. He also leads a large project, funded by the European Union and the Learning and Skills Council, facilitating the education and training of managers in small and medium enterprises. His current research interests relate to demand forecasting in the supply chain, with a particular emphasis on intermittent demand. Aris Syntetos is a reader working with the Centre for Operational Research and Applied Statistics (CORAS) at the University of Salford, UK. He holds a B.A. degree from the University of Athens, an M.Sc. degree from Stirling University and in 2001 he completed a Ph.D. at Brunel University – Buckinghamshire Business School. His research interests relate primarily to intermittent demand forecasting and the interface between forecasting and stock control. Aris’s work has appeared in the International Journal of Forecasting, International Journal of Production Economics and Journal of the Operational Research Society. He is currently holding three research grants — two from the Engineering and Physical Sciences Research Council (EPSRC, UK) and one from the Department of Trade and Industry (DTI, UK).

650

Contributor Biographies

Chapter 21 Jørn Vatn: See Chapter 4 Chapter 22 Renyan Jiang is a Professor and Director of the Quality, Reliability and Maintenance Laboratory at Changsha University of Science and Technology, China. He obtained his undergraduate and graduate degrees from Wuhan University of Technology, China, and his Ph.D. from University of Queensland, Australia. He held visiting appointments at City University of Hong Kong, University of Saskatchewan, The Hong Kong Polytechnic University, and University of Toronto. His research interests are in various aspects of quality, reliability and maintenance. He is the author or co-author of three reliability related books, including Weibull Models, Wiley, 2003. He has published 28 papers in international journals and a number of other papers. Xinping Yan is a Professor and Director of Reliability Engineering Institute at Wuhan University of Technology, China. He obtained his undergraduate and graduate degrees from Wuhan University of Technology, China, and his Ph.D. from Xi’an Jiaotong University, China. He is a member of ISO/TC108/SC5 Committee and a member of the Council Committee of Tribology Institute of Chinese Mechanical Engineering Society (CMES). He is an editorial member of Journal of COMADEM(U.K.) and Journal of Maritime Environment (U.K.). His research interests include condition monitoring and fault diagnosis, tribology and its industrial application, and intelligent transport system. Chapter 23 Uday Kumar: See Chapter 19 Ulla Espling is deputy director at Luleå Railway Research Centre (JVTC) and a researcher within “Framework for Maintenance Strategies for Railway Infrastructure” dealing with a regulated administration, outsourced maintenance, high demands on safety and yearly funding. She has a M.Sc. degree in mechanical engineering and a Licentiate in operation and maintenance engineering. She also has a background from the railway which goes back to 1984. Within the railway she has been working withh both traffic operation and planning, track engineer, design leader and as the head for a track area, giving her a broad and rich experience.

Contributor Biographies

651

Chapter 24 Jayantha Liyanage is an Associate Professor of Asset Operations, Maintenance technology, and Asset Management at the University of Stavanger (UiS), Norway. He is also the Chair and a project advisor of Center for Industrial Asset Management (CIAM), and a member of the R&D group of the Center for Risk Management and Societal Safety (SEROS), at UiS. In addition, Dr Liyanage also serves as the Co-Organiser and Coordinator of the European Research Network for Strategic Engineering Asset Management (EURENSEAM). Currently, he was appointed to the Board of Directors of the Society of Petroleum Engineers (SPE) Stavanger section, where he also take up the responsibilities as the Chairman of the Schoralship committee. Dr Liyanage is actively involved in numerous joint industry projects at advisory and managerial capacities. He has received a number of awards for his excellent academic and research performance. He serves in international editorial boards of a number of international journals and international steering committees of many International conferences. Chapter 25 Daniel Bongers received his B.E. (1999) and Ph.D. (2004) from the University of Queensland, Australia. He is currently a research fellow for the Australian Cooperative Research Centre Mining, and is responsible for managing two late-stage technology development projects. His current research interests include physiological signal processing, fault detection and isolation, physiological fatigue detection and signal measurement. Hal Gurgenci received his B.Sc. (1976) and M.Sc. (1979) from the Middle East Technical University, Turkey, and Ph.D. (1982) from the University of Miami. He is currently a professor with the School of Engineering, The University of Queensland in Brisbane. Previously, he was a Vice President of the Australian Cooperative Research Centre on Mining responsible for research and education activities of the Centre. He was the principal investigator of several large projects in mining equipment design, automation, reliability and maintenance. His current research interests include energy generation and conservation.

Index

A ABC classification 484 Accelerated Degradation testing 157 Failure time testing 156 Life testing plans 156 Adverse selection 388 Agency Theory 387 Issues 388 Aging parameter 517 AHP 427 Artificial intelligence 209 Asset 4

B Bayesian Approach 135 Decision Theory 146 Inference 136 Benchmarking Methodology 562 Need 563 Overview 561

C Candidate group 515 Case Based reasoning 209, 212 Studies 69, 124, 150, 445 CBA See cost benefit analysis CBM 52, 54

Applications 538 CMMS 43, 417 Composite scale 542 Condition monitoring techniques 112 Contract 402 Cost benefit ratio 525, 526, 529 Cost benefit analysis 509, 521, 529 Costs Down time 324 Punctuality 526 Safety 525 Criticality index 93

D Data Acquisition 535 Fusion 537 Processing 536 Decision Charts 35 Model 116 Support 42 Delay time Bayesian approach 362 Modelling 345 Objective data method 364 Subjective estimation 359 Demand Distribution 487 Estimators 500 Mean 489 Variance 492

654

Index

Dependence Economic 265 Stochastic 266 Structural 266 Diagnostics 536 Module 66 Technologies 598 Diesel Engine 538 Discount rate 521, 525, 528 Distributions Event time 627 Posterior 139 Predictive 142 Prior 139 DMG 422 Dynamic grouping 512, 514, 516, 519, 527

E Economy of scale 511 Economy of scope 511 Effective failure rate 512 E-maintenance 586 EMQ 333 E-operations 586 Equipment leasing 397 ERP 418 Extrusion press 366

F Failure Information 94 Interaction 275 Interaction Type I 276 Interaction Type II 278 Fault detection 611 FMECA 90, 517 Forecasting Non-parametric 493 Parametric 487 FTA 442 Functional Block diagrams 85 Failure analysis 84 Failures 85 Fuzzy logic 212, 428

G Game Nash 385 Stackelberg 385 Genetic algorithm 212 Government 400

H HAZOP 441 HIMOS 217 HSE 471

I Industry Nuclear 473 Oil and gas 474 Process and utility 475 Railway 475, 565 Information fusion 128 Infrastructure 376 Inspections Imperfect 349 Perfect 348 Intensity function General proportional 193 Reduction 405 Interval optimization 105 Inventory decision 482

K Knowledge based systems 212 KPI 461

L Laplace trend test 197 Lease Definition 397 Finance 398 New equipment 408 Operating 397 Sale and leaseback 399 Used equipment 409 Lessee 402 Lessor 401 Life cycle cost 34, 509, 510, 525 Calculations 525

Index

M Maintainability 8 Maintenance Actions 27 Actions Selection 94 Benchmarking 563 Concepts 32 Concepts customized 40 Condition based 49, 424 Context 22 Contract 569 Corrective 27, 379 Design-out 30 Failure-based 30 Framework 4 Grouping activities 511 Intelligent Systems 56 Intervals 97 Longwall 613 Management 9 Management 22 Manager 41 Maturity levels 45 Measurement and control 225 Offshore asset 589 Opportunity based 519 Opportunity-based 30 Optimization 509, 511 Outsourcing 24 Outsourcing advantages 375 Outsourcing disadvantages 375 Passive 29 Performance 6 Performance Measurement 459 Policies 30 Predictive 29 Preventive 28, 199, 511, 79, 379, 510–513, 519 Preventive comparison analysis 97 Preventive optimal schedule 170, 173 Proactive 29, 53 Reactive 51 Reliability centered 37 Scheduling 199, 271 Self 53 Service contract 6 Technologies 50 Time based 30

655

Total productive 37 Usage based 30 Metrics 461, 570 Misjudgment 549 Model Age based 306 Basic risk 446 Capital replacement 303 Competing risk 245 Cumulative usage based 310 Dynamic programming 306 Economic life 290 Finite horizon 294 Intensity reduction 191 Linear regression 160 Markov 252 Non-homogeneous Poisson process 187 Period based 308 Proportional hazards 190 Proportional intensities 192 Renewal process 187 Repair alert 248 Risk influence 448 Selection 197, 553 State discriminant 533 Statistic based nonparametric 160 Statistic based parametric 159 Two-cycle 291 Virtual age 190 Monitoring Off-line 615 Oil based 114 On-line 616 Vibration based 113 Moral hazard 388 MPM system 469 MTTF 517, 518 Multivariate control chart 549

N Net present value 521, 525, 526, 528 Neural network 212, 622 NPV See Net present value

O Oil degradation 541 Opportunity maintenance 325

656

Index

Outsourcing Operational 24 Strategic 24 Tactical 24

P Parameter estimation Delay time model 359 Parameter Estimation 194 Penalties 407 Performance Assessment 65 Assessment Multi-sensor 63 Indicators 461, 613 Measurement 494 Peristaltic pump 357 Planned maintenance cost 512 Point process 236 PriFo 522 Priors Reference 140 Specification 145 Subjective 141 Prognostics 537 Approach 54 Approach Technologies 598 Project costs 524, 528 Punctuality costs 528

R RCM Analysis process 79 Data collection 80, 99 Implementation 80, 99 Regulator 400 Reliability Inherent 6 Measures 254 Theory 8 Renewal Process Alternating 189 Trend 235 Repair Maximal 187 Minimal 187 Replacement model Age based 306 Cumulative-usage based 310

Period based 308 Residual life prediction 119 Residual lifelength 524, 525 Risk Competing 245 Influencing factors 522, 524 Influence modeling 448 Management 438 RLL See residual lifelength Run to failure 94

S Safety costs 527 Scheduled Function test 95 Overhaul 96 Service Agent 377 Contract 381 Part classification 480 Set-up costs 509, 511, 512, 513, 515, 517, 519 Shape parameter 517 Signal Processing 64 Spares forecasting Approach 482 Method 482 Stakeholders 5 Static grouping 512, 513 Systems approach 4

T Technologies Diagnostic 598 Prognostic 598 Training set 637 Trend renewal process 237 Heterogeneous 242

U Unplanned costs 512

V Variable costs 525 Virtual age 406

Index

W Warranty Extended 377

Servicing 384 Wear particles 541 Weibull 517

657

Complex System Maintenance Handbook (Springer Series in Reliability Engineering)